BB Unix Network Monitor - Message
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
{bb} The bb_rename error problem
- To: bb@bb4.com
- Subject: {bb} The bb_rename error problem
- From: brent.mccrackin@bell.ca
- Date: Wed, 26 Jan 2005 14:57:57 -0500
- Content-class: urn:content-classes:message
- Content-type: multipart/alternative; boundary="----_=_NextPart_001_01C503E1.54E6ACA4"
- Reply-to: bb@bb4.com
- Sender: owner-bb@bb4.com
- Thread-index: AcUD4VTd+LVt1uqQTAK02f+rb8on2g==
- Thread-topic: The bb_rename error problem
Title: Message
I have a very large
and very busy BB server, and notice a lot of bb_rename errors logged to BBOUT
where new status logs are unable to be renamed, with an errno 2
message.
Wed Jan 26 08:56:23
2005 bbd bb_rename Could not rename /bbvar/logs/.somehost.http to
/bbvar/logs/somehost.http - errno: 2
I've looked through
the list archives, and suggestions have been that this is a result of a host
appearing multiple times in the BB-HOSTS file either on the server or the
client. This is most definitley not the case for me.
The root cause of
the problem is high I/O wait times on the disks where the logs are stored, with
the affected disks running at a fairly constant 80% to 90% busy. This can
be alleviated with faster disks, higher I/O bandwidth, and separating the BBVAR
directory across multiple filesystems.
However, in
examining and troubleshooting this problem, there are a number of side effects
to the error message being produced that is having an negative impact on the
usability of the BB logs and history. One or all of the below effects
typically happen when the bb_rename error is produced.
1)
When the
error happens, the updates to the BBHISTLOGS/HOST/COLUMN/Time_date_log
BBHIST/host BBHIST/host.column and HIST/allevents files do not happen, and the
new status is essentially lost. While the data in the BBLOGS/host.column
file holds the new status, the history logs used by reporting and events display
is incorrect. This reults in bad numbers in the SLA reporting or the
history view.
2) When a recovery
status message is misslogged, the next time a pageable event occurs the page is
sent immediately because the BBHIST logs indicate the previous status was also
pagable some time in the past. The intermediate recovery message was
lost.
For
example:
somehost.http is
currently green, and all logs agree to this state.
somehost.http goes
red, and all files are updated okay.
somehost.http goes
green again, but an error happens in the BBVAR/logs update and the history
events are not recorded.
Some time later,
somehost.http goes red again, the updates happen, but BBPAGER sees that
there has been no green since the last red in the BBHIST logs for the host, so
it sends a page immediately if the time period matches the pager delay periods
(typically I've seen reds seperated by weeks cause this). The history view
shows two red alerts in a row in the column, with no intermediate green, even
though the dot on the BBDISPLAY has been showing green all
along.
3) When a rename
error happens, the old status log timestamp may not be updated to 30
minutes in the future. If enough errors happen to the same host, or the
timestamp update was never completed, the next time BBDISPLAY looks through the
logs for expired files, the host will show up as a purple - even though recent
status messages have been received. This is most apparent when the status
message was supposed to be generated by the BBSERVER network
tests.
The system in
question is:
Solaris 8 on a
Sunfire 280R
BB 1.9c (registered
and licensed, but using the BTF code due to many required modifications and
hacks)
BBGEN
3.5
LARDD 0.42 with 0.43
mixed in (heavily hacked and added onto)
From the BBGEN
stats, there are approximately 1300 network tests performed by the server in 40
seconds, and 3600 status messages processed by BBDISPLAY in 10 seconds. BB
has been configured to run polls every 2.5 minutes (Management requirement -
BBGEN has been a HUGE help in achieving workable results).
Some filesystem
tuning has been performed to help (noatime has been enabled on the mount points,
for instance), but faster disks appear to be the best solution at the present
time. A RAM disk may be another possibility, but I want to make sure that
all logs and BB statuses is maintained between reboots (not that this server is
rebooted that often).
What I am looking
for is some way to have BBD better handle filesystem errors and make sure a
received status message is guaranteed to be properly updated throughout the
system under most all conditions. I'm not completely sure where the
problem, if any, may reside - or if it even exists and can be fixed.
However, I'm fairly certain that filesystem contention and busy-ness has a large
factor to do with the problem and symptoms being seen.
Any help from the
developers or community is appreciated.
---
Brent B
McCrackin
UNIX Systems Specialist - Bell
Sympatico
"Serenity through
viciousness."
Home |
Main Index |
Thread Index