Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to know when a disk subsystem has stopped responding

Status
Not open for further replies.

darkstar

MIS
Aug 25, 1999
18
0
0
US
Visit site
Greetings,<br>
We had a case where a server with external disks had a bad problem. Somehow the power cord for the disks got pulled out and the disk stopped responding (of course).<br>
Oddly enough, the kernel kept running, at least for the time being, and the box responded to pings, snmp traffic, etc.<br>
Our mgmt. software never knew there was anything amiss with the box.<br>
Now that the post mortum is in, managers want to know if we can know when the disk stops responding on any given system.<br>
This is a dilemma. Any shell program that runs periodically will not suffice -- it won't be able to read the script, and even if it does run (as in constantly memory resident) what would a script do? In all probability it would pend waiting for disk I/O and never respond.<br>
Has anyone ever worked out a problem like this before?
 
Darkstar,<br>
<br>
I have a few questions for you. Obviously, the external drives are the ones that lost power. Typically, a machine will have at least one internal drive where you load the O/S. Is this not the case with your server?<br>
<br>
If you do have an internal drive, you could easily write a script that lives on a file system that physically resides on your internal drive that monitors the status of your external drives.<br>
<br>
Also, there are a myriad of 3rd party products that monitor everything from drive status to network traffic, cpu utilization, etc. that could run on a separate machine to monitor this machine.<br>
<br>
slars
 
In this case there are no internal disks. And we are indeed using a 3rd party product (ITO) to monitor the system. All of ITO's processes that check the wellness of the machine locked up as well, apparently pending on I/O.
 
Do you have another host to utilise? If so, why not have a cron job that checks the Disk Sub-System if all is well it writes a check-point off elsewhere(on another host), at pre-determined periods of time on the 2nd host a cron(unix)/AT(NT) job checks for a change in that file, if the file hasn't changed over that period or over a pre-determined number of time periods, there is a rule to highlight this event via email or someother form of notification. Its a little cumbersome but as a minimum this takes the hang notification away from the host that is potentially going to hang.
 
You might want to think about monitoring /var/adm/syslog/syslog.log with something that mails you when a bad thing happens.<br>
<br>
You could base this on something that reads the output of <br>
<br>
tail -f /var/adm/syslog/syslog.log<br>
<br>
Mike<br>
<p>Mike Lacey<br><a href=mailto:Mike_Lacey@Cargill.Com>Mike_Lacey@Cargill.Com</a><br><a href= Cargill's Corporate Web Site</a><br>
 
How about putting in a memory-resident script (load it into a RAMdisk) that sends you a simple e-mail message whenever it CAN access files on the naughty drives; this way you'd know something was up if you didn't get the message (i.e. make the message ONLY have a subject such as: &quot;drives working&quot; to keep download time used by messages @ a minimum).<br>
Do you think this would help with your problem? <p>-Robherc<br><a href=mailto:robherc@netzero.net>robherc@netzero.net</a><br><a href= > </a><br>*nix installation & program collector/reseller. Contact me if you think you've got one that I don't :)
 
Take a look at Big Brother at <A HREF=" TARGET="_new"> This will watch your system logs for WARNING messages and similar, or it would be easy to add in your own script to throw up a warning to Big Brother if something went wrong.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top