Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Sick 4.11 Server

Status
Not open for further replies.

jeffkelly

Technical User
Aug 18, 2003
71
US
I have a NW4.11 server that started a new behavior about six months ago where it would abruptly freeze with 99% utilization. Normal utilization is about 6% at any given time. Since the behavior began, the failure interval slowly decreased and is now anywhere from 1/2 day to 4 days. The only thing that logs into the server are 61 utility computers (no people) with minimal network and server activity.

I'm about to rebuild the server...see q's in another thread...but I have to bandaid my current server until I'm done. Below is the log for two crashes yesterday. Can someone shed som light on what's happening?

The server is a P3-500 w/ 128MB RAM and two 40GB IDE HDDs (mirrored). the server has run flawlessly for years.

Thanks in advance to all who answer,

Jeff



---------------------------------
First Crash Yesterday as logged in the Abend log. It is not normal to receive an abend when the server hits freezes. This is something else...HDD related perhaps. If you want the loaded NLM list from the abend log, let me know.
---------------------------------

Server JEFF halted Saturday, December 16, 2006 9:45:57 am
Abend 1: Server-4.11a: SubAllocFreeSectors given invalid FAT chain end that was already free.

Registers:
CS = 0008 DS = 0010 ES = 0010 FS = 0010 GS = 0010 SS = 0010
EAX = 00000001 EBX = 07C1B020 ECX = 0000012A EDX = 0094B000
ESI = 067EA980 EDI = 0000012A EBP = 07E91FDC ESP = 07E91FC8
EIP = 00000000 FLAGS = 00007202


Running process: Server 02 Process
Created by: SERVER.NLM
Stack pointer: 7E92434
Stack limit: 7E8F440
Scheduling priority: 0
Wait state: 00
Stack: F809A077 ?
--D7F213AC ?
--300A504F ?
--31323136 ?
--0000001B ?
--00000000 ?
--06827010 ?
--067EA980 ?
--00000000 ?
F8010C24 ?
--00000000 ?
--9B00012A ?
--067EA980 ?
--00000000 ?
--00000000 ?
--00003476 (DS.NLM|DSF90AA0B0+143)
--0000004F ?
--9B00012A ?
--00000000 ?
--000003E8 ?
--00000000 ?
--00000000 ?
--00000000 ?
F800FA95 (SERVER.NLM|(Code Start)+FA95)
--05015C00 ?
--D81C53A0 ?
--00000000 ?
--07E92018 (IDEATA.HAM|consoleScreen+7184)
--00000000 ?
F800FA95 (SERVER.NLM|(Code Start)+FA95)
--05017C00 ?
--D81C54A8 ?

Additional Information:
The NetWare OS detected a problem with the system while executing
a process owned by SERVER.NLM. It may be the source of the problem or
there may have been a memory corruption.


---------------------------------
First and second crash from SYS$LOG.ERR. Second crash didn't log an abend.
---------------------------------

12-16-06 9:45:57 am: SERVER-4.11-4
Severity = 4 Locus = 18 Class = 6
WARNING! Server JEFF has experienced a critical error. It is going down in 2 minutes. Save your files and logout.

12-16-06 9:53:21 am: SERVER-4.11-745
Severity = 4 Locus = 14 Class = 6
User UTILITY1 on station 75 cleared by connection watchdog.
Connection cleared due to communication or station failure.


---61 of these messages when all attached computes disconnected---


12-16-06 10:02:15 am: SERVER-4.11-745
Severity = 4 Locus = 14 Class = 6
User UTILITY61 on station 66 cleared by connection watchdog.
Connection cleared due to communication or station failure.

12-16-06 11:17:07 am: SERVER-4.11-1741
Severity = 4 Locus = 3 Class = 6
Remirroring partition #0.

12-16-06 11:17:09 am: DS-6.11-28
Severity = 1 Locus = 17 Class = 19
Bindery open requested by the SERVER

12-16-06 11:17:10 am: DS-6.11-26
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database is open

12-16-06 11:17:18 am: SERVER-4.11-2541
Severity = 0 Locus = 18 Class = 19
System time changed from file server console.
New time is 12-16-2006 11:17:19 am

12-16-06 11:17:28 am: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 11:17:31 am: DS-6.11-30
Severity = 0 Locus = 17 Class = 19
Bindery close requested by the SERVER

12-16-06 11:17:31 am: DS-6.11-27
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database has been closed

12-16-06 11:17:31 am: SERVER-4.11-2009
Severity = 4 Locus = 7 Class = 19
JEFF TTS shut down
because backout volume SYS was dismounted.

12-16-06 11:32:45 am: SERVER-4.11-1741
Severity = 4 Locus = 3 Class = 6
Remirroring partition #2.

12-16-06 11:32:47 am: DS-6.11-28
Severity = 1 Locus = 17 Class = 19
Bindery open requested by the SERVER

12-16-06 11:32:47 am: DS-6.11-26
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database is open

12-16-06 11:32:56 am: SERVER-4.11-2541
Severity = 0 Locus = 18 Class = 19
System time changed from file server console.
New time is 12-16-2006 11:32:57 am

12-16-06 11:33:06 am: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 11:34:41 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:35:09 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:35:37 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:36:05 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:36:33 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:37:01 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:37:09 am: DS-6.11-50
Severity = 1 Locus = 17 Class = 19
Established communication with server KELLY

12-16-06 11:37:13 am: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 11:37:17 am: DS-6.11-50
Severity = 1 Locus = 17 Class = 19
Established communication with server KELLY

12-16-06 11:45:16 am: SERVER-4.11-2366
Severity = 4 Locus = 3 Class = 6
Redirected block 4C4ABh to 17h on Device #0.

12-16-06 12:02:21 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 12:02:21 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 12:32:59 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 12:32:59 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 1:03:39 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 1:03:39 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 1:34:16 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 1:34:16 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 1:52:29 pm: SERVER-4.11-1633
Severity = 0 Locus = 3 Class = 19
Synchronized partition #2.

12-16-06 1:52:29 pm: SERVER-4.11-1632
Severity = 4 Locus = 3 Class = 6
All mirrored partitions on this system are synchronized.

12-16-06 2:14:31 pm: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 21 successful polling loops.

12-16-06 2:14:56 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 3:56:44 pm: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 22 successful polling loops.

12-16-06 3:57:10 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 7:17:50 pm: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 32 successful polling loops.

12-16-06 7:18:13 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 8:33:02 pm: SERVER-4.11-1741
Severity = 4 Locus = 3 Class = 6
Remirroring partition #0.

12-16-06 8:33:05 pm: DS-6.11-28
Severity = 1 Locus = 17 Class = 19
Bindery open requested by the SERVER

12-16-06 8:33:05 pm: DS-6.11-26
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database is open

12-16-06 8:33:13 pm: SERVER-4.11-2541
Severity = 0 Locus = 18 Class = 19
System time changed from file server console.
New time is 12-16-2006 8:33:14 pm

12-16-06 8:33:23 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 8:33:34 pm: DS-6.11-30
Severity = 0 Locus = 17 Class = 19
Bindery close requested by the SERVER

12-16-06 8:33:34 pm: DS-6.11-27
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database has been closed

12-16-06 8:33:35 pm: SERVER-4.11-2009
Severity = 4 Locus = 7 Class = 19
JEFF TTS shut down
because backout volume SYS was dismounted.

12-16-06 8:47:33 pm: SERVER-4.11-1741
Severity = 4 Locus = 3 Class = 6
Remirroring partition #0.

12-16-06 8:47:35 pm: DS-6.11-28
Severity = 1 Locus = 17 Class = 19
Bindery open requested by the SERVER

12-16-06 8:47:35 pm: DS-6.11-26
Severity = 1 Locus = 17 Class = 19
Directory Services: Local database is open

12-16-06 8:47:43 pm: SERVER-4.11-2541
Severity = 0 Locus = 18 Class = 19
System time changed from file server console.
New time is 12-16-2006 8:47:46 pm

12-16-06 8:47:55 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 8:49:31 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:49:59 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:50:27 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:50:55 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:51:23 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:51:51 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:52:00 pm: DS-6.11-50
Severity = 1 Locus = 17 Class = 19
Established communication with server KELLY

12-16-06 8:52:02 pm: DS-6.11-47
Severity = 1 Locus = 17 Class = 19
Unable to communicate with server KELLY

12-16-06 8:52:07 pm: DS-6.11-50
Severity = 1 Locus = 17 Class = 19
Established communication with server KELLY

12-16-06 9:00:07 pm: SERVER-4.11-2366
Severity = 4 Locus = 3 Class = 6
Redirected block 45EB9h to 17h on Device #2.

12-16-06 9:17:14 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 9:17:14 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 9:47:52 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 9:47:52 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 10:18:31 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 10:18:31 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 10:49:10 pm: SERVER-4.11-1631
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on this system are not all synchronized.

12-16-06 10:49:10 pm: SERVER-4.11-2883
Severity = 4 Locus = 3 Class = 6
The mirrored partitions on server JEFF are not all synchronized.

12-16-06 11:06:48 pm: SERVER-4.11-1633
Severity = 0 Locus = 3 Class = 19
Synchronized partition #0.

12-16-06 11:06:48 pm: SERVER-4.11-1632
Severity = 4 Locus = 3 Class = 6
All mirrored partitions on this system are synchronized.

12-16-06 11:09:33 pm: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 18 successful polling loops.

12-16-06 11:09:57 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-16-06 11:33:12 pm: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 14 successful polling loops.

12-16-06 11:33:36 pm: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-17-06 12:16:05 am: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 16 successful polling loops.

12-17-06 12:16:30 am: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-17-06 1:08:52 am: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 17 successful polling loops.

12-17-06 1:09:17 am: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.

12-17-06 2:51:08 am: TIMESYNC-4.15-72
Severity = 1 Locus = 17 Class = 19
Time synchronization has been lost after 22 successful polling loops.

12-17-06 2:51:32 am: TIMESYNC-4.15-138
Severity = 0 Locus = 17 Class = 19
Time synchronization has been established.



 
It looks like you might have a disk problem. Make a backup of your data, dismount your volumes, and run VREPAIR on them.

There is a chance that you will lose data, so be sure of your backup.
 
Each time the server crashes, we run VRepair (twice yesterday). There are never any errors reported during/after the scan. Could it be a controller flaking out or the HDD itself?
 
See if you can run any hardware diagnostics which may identify this. The issue is probably occuring sporadically and will therefore not be picked up by a VREPAIR after a reboot. By running a form of hardware diagnostics, it may prompt the issue to arise and be logged. Does the issue occur when there is a lot of disk activity?

Check your hardware vendor to see if they offer any diagnostic tools.

--------------------------------------
"Insert funny comment in here!"
--------------------------------------
 
The network runs with very low utilization. Every hour, we copy all files -- except system and public -- to our backup server. That's the only time CPU utilization goes up. We've compared the backup times to the crash times and they don't match.

It's extremely frustrating because whatever is happening (without warning) causes the server to freeze w/ 99% utilization. Nothing is typically logged except an occasional generic message saying the server is dismounting in 2 minutes. Usually the server just freezes.

The server was built using the on-board IDE controllers (Abit motherboard). At this point, I'm apprehensive about running intrusive diagnostic processes like DSRepair, HDD scanning utility, etc. out of fear it will kill the server. I'm also worried that whatever is happening here will materialize on the new server I'm building. Based on a few cryptic log entries over the past few months, I'm thinking the problem is hardware...possibly bad memory and/or storage system (HDD, controller, etc.).
 
Running a DSREPAIR will not help unless the server crashes are causing DS corruption.

However the following TID from Novell suggests memory rather than Hard Disk as a possible cause?


--------------------------------------
"Insert funny comment in here!"
--------------------------------------
 
Hi

Looks like one of the disks to me, not sure how you have your disks and partitions setup, but I suspect its one of the disks which has the SYS volume on it - probable Device #2. Here is my thoughts.
"SubAllocFreeSectors given invalid FAT chain end that was already free" - Sounds like corruption of the FAT tables (File Allocation Table) and while this can be Server memory or the disk controller I think you would have some other issues that would appear. ie Errors on your VREPAIR and the like.
"Redirected block 45EB9h to 17h on Device #2." - Indicates a bad block on device #2 and is redirecting to a spare
block. Drives are numbered 0, 1, 2, etc.
"JEFF TTS shut down because backout volume SYS was dismounted." - A critical error has dismount the volume. Netware does this automatically to protect the integrity of the volume when an error is received that the OS can not handle.
"Remirroring partition #0." - Indicates one partition\drive does not match the other. ie Mirror mismatch.

Hope this is helpful.

Goodluck

David



 
Well the server crashed twice yesterday and again this morning. We moved our utility machines to the backup server. Since then, we noted the following on our primary server:

1. CPU fan wasn't turning (which probably caused all of this).
2. The capacitors around the CPU slot are bulged (bad thing).

On top of this, the primary HDD suddenly started sounding like a jet engine after my engineer moved it to another computer. To make matters worse, the mirror won't boot (I'm hoping the guy who installed the OS forgot to write the boot track). The hard drives are being overnighted to me.

It is paramount that I finish building my new 4.11 servers. There is post above this one related to the new build. Feel free to answer my questions as they arise.

Thanks Much.





 
it is possible that he hasnt copied the boot as this is common fault with software mirrors - always forget about the boot and sometimes the nwserver directory

out of curiousity is the disk a seagate ?
 
Western Digital WD150 (ATA66). Pretty old. I hope the person forgot to write the boot info. Beyond that, will the mirror load Netware and mount the volumes if the DOS partition is intact and identical to the primary drive that crashed?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top