Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Server restarts itself.

Status
Not open for further replies.

Nostradamus

Technical User
May 3, 2000
419
SE
SCO Openserver Enterprise Systems 5.0.5. + patches
The server is a Compaq Proliant 3000 and therefor Compaq EFS 5.36 has been installed.

The server has been going up and down for two days
Totally random and without warning. No messages on console as far as I can see. It doesn't write anything unusual in /var/adm/messages or syslog.
Are there any other logs I might find some more information in?

I've tried a couple of things but nothing has worked. I don't know what to do next.

I managed to be around during one of the reboots and it ran fsck on one of the partitions. It displayed "possibly damaged filesystem or partiotion full" or something similar. I then entered single-user mode and ran fsck (on that partition) manually and it completed succesfully.
Could I have a corrupted filesystem somewhere?

I suspect that it's the backup that's causing problems.
Not long ago I/we stopped using our internal tape to do backups. Instead I installed Backup Exec remote agent (v. 5.0.1) on the SCO-machine and began doing backups using our tape-robot (Win2k, Backup Exec v.8.6). I have a crontab job copying my directories to a newly intalled harddrive and then Backup Exec is taking backup on that harddrive.

I removed the Backup Exec remote agent last night when the server restarted. It's been stable since, but I doubt it will remain.

any input on this would be really appreciated. /Sören
 
If it ran fsck, it either lost power or did a panic crash.

If it is on a UPS, that ups could be defective.

If it was a panic crash, Backup Exec probably did NOT crash it, but the ACCESS of a corrupt or damaged file system may have.

You must have PANICBOOT=YES set in /etc/default/boot. Change it to NO so that you can see what the panic is ( see and )

Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
The server never lost power. It has redundant powersupplies as well as an ups that is supplying our entire server room. No other servers where affected.
The UPS also handles electric peaks that can occour, so I really don't think the problem is with power.

The PANICBOOT is set to NO and has been since the beginning.
If it was a panic crash, where would I find information about what kind and why it happened. I mean, in which logs should I look?
hmm, when I come to think of it, the server never rebooted that time. It went off. I had to push the on-button to start it. Does that tell you anything?
/Sören
 
One of my new ML370's was fine for 2 weeks, I moved it to a different rom, and on restart it was rebooting for no apparent reason.
This turned out to be a powerchute Daemon which was reporting low battery on my UPS -The server WAS NOT CONNECTED to the UPS, never had been either.
***************************************
Party on, dudes!
 
panic info is displayed on the console screen.
you are most likely having a hardware problem.
aren't atx power systems fun.. the motherboard has the ability to turn the system off so you can't see what/if anything reported on the screen.

you could also make sure that extra compaq utilities aren't running if you don't need them (/etc/rc2.d/SXXcpq...)

stan hubble
 
Karver: What did you do with the server? Move it back to the other room? Compaq EFS har several utilities to monitor UPS/Power supplies and alert as soon as any of them fail.
We've had some hardware problems with powersupplies and the errors (on console) have always been reported (saved in both /var/adm/messages and syslog). Since I don't have anything reported in these logs I believe that the power/UPS is doing fine.

Stan: I thought that everything reported to console ended up in /var/adm/messages. Am I wrong?
Unfortunately I've never been around to see the console during one of the restarts. Are there any other logs I can look for error-messages/Panic-info? If it's a panic it should notify me some way or another.

The reason I have the EFS loaded is that I need the driver support for the raid-controllers as well as some system monitor tools (cpqmon and such).
I've already removed some cpq-utilites I don't want/need.
When the problems began I also tried removing cpq ASR (Active Server Recovery or something) since I don't think I need it. That didn't help because the machine restarted an hour later. It was after that that I removed the Backup Exec agents and the server hasn't rebooted since.
Going "stable" for 36 hours now.

Does anyone of you have any experience using remote backup servers/programs/tape-robots instead of a local tape?
It doesn't have to be Backup Exec on Win2k.

thanks for your replies. appreciate it! /Sören
 

I had such a problem in the past with a Dell Powerhedge system and after several checks i found that the problem came from the memory stick. Try replacing the memory sticks and see if the problem persists.

I hope this might help you.

PS; Don't rely fully on the memory diagnostics utilities that comes along with the software. They sometimes DO NOT detect failures.

Regards.
Salim
 
hmm... It's a fact that faulty memory modules can cause really bizarre errors.

I'll leave the machine on for now and I hope it doesn't go down during my trip to the alps next week.
I'll try switching memory sticks if the prior steps doesn't help.

Thanks for the input.
/Sören
 
No Sören, not everything that goes to console screen gets logged. The majority of panics are caused by hardware problems and most of these are memory and harddrive errors.
once the system panics it generally is not able to write to a log file because the kernel space is corrupted or memory or hd controller or whatever. and if things are corrupted enough that the system panics then i wouldn't want it to try and write anything back to the disk.


 
Sorry been on site for a few days:
- My SCO box would give me 60 seconds before rebooting due to an incorrect powerchute module being loaded.
The really weird thing was it had never run before, never been connected to any UPS and only after moving it and restarting did powerchute wake up.
I didnt ask it to and as soon as I killed off the process the server went back to being fine.

I believe at some point I restored cron from another machine and that caused my problem. (and the fact that the UPS from the other machine was not a smart UPS which really confused things :))

Don't ya just love IT ***************************************
Party on, dudes!
[cannon]
 
The server still restart itself.

Here's a thing I found in syslog.
I doubt that it has anything to do with my problem, but anyway. Anyone know what it is?

Apr 13 10:04:08 server syslog: SCOADM: localhost {sco_UUCPdevices} {ACU_tty1a} error SCO_OFACE_MSG_ERROR {error {{SCO_OSA_ERR_NO_SUCH_OBJECT_INSTANCE {The object
instance ACU_tty1a does not exist.}}}}

Back to my real problem...
What can be causing my annoying restarts?
Last Saturday it restarted itself. It then ran like a clock the whole week and this saturday and sunday it restarted again. It's been stable since last night, but I know It's just a matter of time.

I've spoken with Compaq Support regarding hardware problems, but they don't know what's wrong. They are sending over a couple of new memory modules anyway.

What do you think it could be? hardware?
Is there ANY way I can capture panic messages for examination?

Thanks again for your input. /Sören
 
If PANICBOOT is set to NO, it isn't a Panic. You are either losing power (ups or system power supply) or software is shutting you down.

If it's the latter, the script at "How do I find out who or what halted my system?" will help you find out what.

Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
No user except me have access to shutdown or haltsys.
I don't think it is something like that, but I'll take an extra look.

The software that runs on this machine is a mumps database and medical records. Anyone have any experience with mumps databases (or similar) on SCO? If so, have you ever run into my problems?

Is there anyway I can check the Powersupplies within the OS?
What does the apm.cmd=arg in /etc/default/boot do?
I have redundant power supplies on this machine.
Is PWRCHECK=Y in /etc/default/boot something I should look into?

Thx again. /Sören
 
The point was not that some USER is shutting you down- the point is to check if some PROCESS is doing it.


man boot
man apm
man pwrsh

explain what this stuff is all about, and yes, having defective hardware could cause this stuff.

Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
I had found on a few of our servers that the COMPAQ server agent was shutting down the system. It would show that the system had a problem with the power supply (or other monitored hardware function) and was shutting down. I do belive that this was logged in a file called agenterrs.log not sure where but /usr/adm/logs seems right. I ended up removiong the compaq agent from the server.

 
I had found that it was the Compaq Insight Manager... just remembered which agaent...
 
Tony: I understood that you meant processes. I was just making a point that no USER had access to the commands. I did look into the thing, to see if some PROCESSES shut down the system, but I didn't find anything. I've never thought of putting scripts into, for instance, the shutdown script/command. Pretty nifty.
That's an excellent site you have their at
sturgis: OK, I'll keep that in mind. The Insight Manager isn't installed. I've heard people having problems with it so I let it be.

I finally discovered the problem.
It was the on-off button.
It's been replaced and the server is fine.

Thank you all for your help. /Sören
 
He he. The old 'engineer reset' rears it's head again eh? Glad it's sorted. Cheers.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top