BB Unix Network Monitor - Message

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

{bb} The bb_rename error problem



Title: Message
I have a very large and very busy BB server, and notice a lot of bb_rename errors logged to BBOUT where new status logs are unable to be renamed, with an errno 2 message.
 
Wed Jan 26 08:56:23 2005 bbd bb_rename Could not rename /bbvar/logs/.somehost.http to /bbvar/logs/somehost.http - errno: 2
 
I've looked through the list archives, and suggestions have been that this is a result of a host appearing multiple times in the BB-HOSTS file either on the server or the client.  This is most definitley not the case for me.
 
The root cause of the problem is high I/O wait times on the disks where the logs are stored, with the affected disks running at a fairly constant 80% to 90% busy.  This can be alleviated with faster disks, higher I/O bandwidth, and separating the BBVAR directory across multiple filesystems.
 
 
However, in examining and troubleshooting this problem, there are a number of side effects to the error message being produced that is having an negative impact on the usability of the BB logs and history.  One or all of the below effects typically happen when the bb_rename error is produced.
 
1) When the error happens, the updates to the BBHISTLOGS/HOST/COLUMN/Time_date_log BBHIST/host BBHIST/host.column and HIST/allevents files do not happen, and the new status is essentially lost.  While the data in the BBLOGS/host.column file holds the new status, the history logs used by reporting and events display is incorrect.  This reults in bad numbers in the SLA reporting or the history view.
 
2) When a recovery status message is misslogged, the next time a pageable event occurs the page is sent immediately because the BBHIST logs indicate the previous status was also pagable some time in the past.  The intermediate recovery message was lost.
 
For example:
 
somehost.http is currently green, and all logs agree to this state.
 
somehost.http goes red, and all files are updated okay.
 
somehost.http goes green again, but an error happens in the BBVAR/logs update and the history events are not recorded.
 
Some time later, somehost.http goes red again, the updates happen, but BBPAGER sees that there has been no green since the last red in the BBHIST logs for the host, so it sends a page immediately if the time period matches the pager delay periods (typically I've seen reds seperated by weeks cause this).  The history view shows two red alerts in a row in the column, with no intermediate green, even though the dot on the BBDISPLAY has been showing green all along.
 
3) When a rename error happens, the old status log timestamp may not be updated to 30 minutes in the future.  If enough errors happen to the same host, or the timestamp update was never completed, the next time BBDISPLAY looks through the logs for expired files, the host will show up as a purple - even though recent status messages have been received.  This is most apparent when the status message was supposed to be generated by the BBSERVER network tests.
 
 
The system in question is:
 
Solaris 8 on a Sunfire 280R
BB 1.9c (registered and licensed, but using the BTF code due to many required modifications and hacks)
BBGEN 3.5
LARDD 0.42 with 0.43 mixed in (heavily hacked and added onto)
 
From the BBGEN stats, there are approximately 1300 network tests performed by the server in 40 seconds, and 3600 status messages processed by BBDISPLAY in 10 seconds.  BB has been configured to run polls every 2.5 minutes (Management requirement - BBGEN has been a HUGE help in achieving workable results).
 
Some filesystem tuning has been performed to help (noatime has been enabled on the mount points, for instance), but faster disks appear to be the best solution at the present time.  A RAM disk may be another possibility, but I want to make sure that all logs and BB statuses is maintained between reboots (not that this server is rebooted that often).
 
 
What I am looking for is some way to have BBD better handle filesystem errors and make sure a received status message is guaranteed to be properly updated throughout the system under most all conditions.  I'm not completely sure where the problem, if any, may reside - or if it even exists and can be fixed.  However, I'm fairly certain that filesystem contention and busy-ness has a large factor to do with the problem and symptoms being seen.
 
Any help from the developers or community is appreciated.
 
---
Brent B McCrackin
UNIX Systems Specialist - Bell Sympatico
Brent.McCrackin@Bell.ca   PH: 416-353-0692
"Serenity through viciousness."
 
 

Home | Main Index | Thread Index