Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Unknown Issue Causing Shutdown

Status
Not open for further replies.

bjdobs

Programmer
Mar 11, 2002
261
CA
5 year old SCO box running 24/7 is randomly shutting dowm ... is there a log file or event file that might shed some light on what is happening???

This machine is currently running a Point of Sale program for a small business and needs to be fixed ASAP.

The machine doesn't appear to be experiencing any power disruption rather its like some process is issueing a shutdown command ... the system terminal is sitting at the SCO press any key to reboot message. This has happened 4 times in the last week including twice today.
 
Hi. Are there any clues as to what's happening in /var/adm/messages?

Does a core file get created when the system shuts down. Try a:

find / -name core

to see whether there are any which you can examine for clues. Just a couple of starters - no doubt others will be along with additions later.
 
Is this system running with a monitored UPS?
On the console screen, what is displayed directly above the "Safe to power off" message?
Is the shutdown occuring at any specific time, or randomly?
The primary log files are /usr/adm/messages and /usr/adm/syslog (links to /var/adm directory).

Most instances like this are actually PANIC shutdowns, and often times you won't have entries in those logs, as the system is too sick to update them.

This will take more clues. You might want to set-up a little script to dump a process list to a text file every few minutes and review it afterwards.

Were there any recent changes (Hardware/OS/Application/environment)?
 
The system has been extremely stable ... no new Hw/Sw ... PS was replaced 8 mounths ago because of failing Fan ... the messages above the press any key message look like the first kernel load screen after any key is pressed (sort of the cart before the hourse) ... This system is not currently on a UPS however within the last 2 months there have been some power outages and we are in the process of UPS selection.
 
find / -name core is taking a long time to complete ... find finally came back with a prompt find: and then the system appears to have crashed again.

syslog entries prior to the two last crashes appear to have messages to the affect that Samba is shutdown then started back up again

messages has no date stamp ... the tail of the file appears to be a hardware profile detailing device, interupt assignments etc. ... there are two warnings in the middle of the second last message block for wd(0)

Is there a UNIX utility that can do a check disk with repair???
 
... the messages above the press any key message look like the first kernel load screen after any key is pressed

I think what you are describing is the process where the system has PANIC'd, and is saving contents of memory to a disk area (uses the SWAP partition). During this procedure, you get many rows of dots (......) across the screen. Unfortunately, the cause of the PANIC is probably scrolled off the screen.

If this is a "mission critical" system, and it's 5 years old, I'd recommend replacement with current hardware and updated O/S. You might end up spending the same amount of labor troubleshooting and repairing this system as you'd spend updating to current box. The difference is if you troubleshoot and repair this system, you'll still have a 5-year old system.

If that's not an option at this point, you need to find a way to "see" the PANIC message. Does anybody know how to redirect this output to a serial port?
I found this entry on Tony's site:

echo panic | crash -d /dev/swap

You might try running that command immediately after rebooting the system.
 
We actually have a new system being shipped in as we speak ... but for the moment we need to keep this sytem limping if nessecary for at least another week or so.

Here is my guess ... the Find / -name core just crashed the system ... I did a disk scan last night remotely via samba which would have logged every file on the disk which also crashed the system ... I think there is a file subsytem issue ...either the inodes tables are corrupted or the Hard drive is starting to fail.

If this is the case not sure how this can be resolved

we do have a backup from Saturday night (Data only from the point of sale Database) plus we have a core image which is approx 1 year old ... I may have to re-image a new drive and restore Saturdays backup.
 
>>>>messages has no date stamp ... the tail of the file >>>>appears to be a hardware profile detailing device, >>>>interupt assignments etc. ... there are two warnings >>>>in the middle of the second last message block for wd(0)

the messages file does have a date/timestamp on a seperate line, and then the messages. Those two warnings are going to be a key to what is going on. Is this a SCSI?

Is there a UNIX utility that can do a check disk with repair???

fsck
 
Yes, fsck. I was thinking exactly the same myself reading the remainder of the thread.
 
There was a file system issue ... I have run fsck several times and finally it has finished without a PANIC (possibly because I ran a utility called spinwrite on the drive) ... I have also restored a year old Ghost image onto a new drive and now in the process of attempting to restore yesterdays backups. (Did I say new drive ... I had to dredge up a drive from one of our retired systems because the MotherBoard wouldn't recognize a new drive)

Thanx for the help.

This should hopefully keep us going until the new system arrives next week.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top