Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Problems with system shutting down

Status
Not open for further replies.

phonesaz

Vendor
Dec 18, 2006
880
US
I am involved in a processor repair unlike anything I have ever seen. Here is the scenario:

A Merlin Magix is shutting down for no reason; we have added a battery backup and had an electrician check the power. Skipping through other details so this post isn't too long, last night I did a backup, put in a new processor that I had restored to factory defaults, and reloaded from the backup. The processor rebooted in F state. Did the whole thing again with a separate backup file, same result. F state and key mode. So I put the old processor back in, and the system came up with the correct programming.

The customer insists the system reboots all the time, but is way worse at approximately 8-9 at night and 10-11 in the AM. I happened to be on site doing all the above at about 7:30PM. I had the system put back together and was sitting there for a few minutes pondering my next step when it spontaneously rebooted. Checking the clock, it was 8:25. It proceeded to reboot 4 times in the next 20 minutes, collaborating what the customer said had been happening.

Once the rebooting stopped, I cleared all the errors and did a board renumber.

I also noticed that when the system was rebooting over and over, first the C would come on, then the T1 would light orange, go blank for a few seconds, and then go to the red light one would expect until the system rebooted. I don't remember ever seeing the orange on in that sequence on another system before.

I did a maintenance status on the boards, and here is what it said before I cleared the errors (didn't check it after:)

01 - errors yes (412)
02 - errors yes (412)
03 - errors no (412)
04 - errors no (016)
05 - errors yes (DCD)
06 - errors no (008OPT)
07 - errors no (008OPT)
08 - errors yes (Mer Msg)

The transient error log has errors relating to reboots and restarts that began June 7, which was when the system began having difficulties. I have attached the error log - the one that I am puzzled by is the one dealing with slot 8.

I am not sure where to go with this next. The fact that the problem increases at certain times doesn't make sense. This system has been in place for about 10 years, nothing has changed, and there have been no issues until this started. The following is an error log I saved last week; the errors last night were similar except more of them. I am wondering what the card inserted/removed on slot 8 (the voice mail) is about and if this could be part of the problem. Thanks for any input...

ERROR LOG


A Last 99 System Errors:

A Message ss/pp Cnt First Last Code
A SOFTWARE COLD START 00/00 - - 06/16 11:41:27 0003
A DUART STREAMING INT 00/00 - - 06/16 11:41:54 0013
A SOFTWARE WARM START 00/00 - - 06/16 11:41:54 0004

A Permanent Errors:

A Message ss/pp Cnt First Last Code

A Transient Errors:

A Message ss/pp Cnt First Last Code
A DUART STREAMING INT 00/00 098 06/07 21:28:48 06/16 11:53:29 0013
A SOFTWARE WARM START 00/00 098 06/07 21:28:48 06/16 11:53:29 0004
A CARD INSERTED/REMOVED 08/00 098 06/07 21:28:48 06/16 11:53:29 000B
A SOFTWARE COLD START 00/00 069 06/07 21:28:49 06/16 11:53:30 0003
A POOL BUSY 00/01 037 06/07 21:29:31 06/16 11:36:28 4C02
A POOL BUSY &/OR OOS 00/01 121 06/07 21:29:44 06/16 11:42:16 4C03
A ON HOOK BEFORE READY 05/01 005 06/07 22:05:54 06/16 00:07:34 8405
A TIMEOUT COLD START 00/00 029 06/07 22:28:23 06/16 11:11:45 0001
A ON HOOK BEFORE READY 05/03 002 06/08 13:33:00 06/13 11:53:24 8405
A POOL BUSY 00/02 007 06/08 21:25:48 06/13 01:31:11 4C02
A DS1 MISFRAME ALARM 05/00 028 06/10 22:47:49 06/16 11:52:08 6C09
A ON HOOK BEFORE WINK 05/02 002 06/11 21:55:30 06/16 00:16:23 8404
A ON HOOK BEFORE READY 05/05 002 06/12 18:39:52 06/13 11:53:24 8405
A DS1 SLIP ALARM 05/00 001 06/13 01:15:46 06/13 01:15:46 6C0A
A ON HOOK BEFORE READY 05/10 001 06/13 09:59:18 06/13 09:59:18 8405
A ON HOOK BEFORE READY 05/06 001 06/13 11:53:24 06/13 11:53:24 8405
A INVALID SLOT INTERRUPT 00/00 003 06/13 21:07:11 06/15 23:11:02 0010
A ON HOOK BEFORE WINK 05/06 001 06/14 01:34:51 06/14 01:34:51 8404
A ON HOOK BEFORE READY 05/02 001 06/14 22:14:03 06/14 22:14:03 8405
A ON HOOK BEFORE WINK 05/07 001 06/15 07:52:31 06/15 07:52:31 8404
A ON HOOK BEFORE WINK 05/04 001 06/16 00:16:23 06/16 00:16:23 8404
A ON HOOK BEFORE WINK 05/10 001 06/16 00:16:23 06/16 00:16:23 8404
 
Sounds like Power problems to me.

(Or heat)

Considering you are in Arizona, what's the environment like in the phone room?

Now, if it's power, you will probably be ahead using a DRANITZ device to monitor and record power issues.



 
The room is surprisingly clean and air conditioned. They had an electrician out - says all good. But... I don't know if they (electricians) were there when the problem was occuring... or if that would make a diff? Is there anything in monitor that would capture what causes it to shut off if it were running when it shut down?
 
Have you checked the power supply voltages during and after the rebooting sequences? If you suspect the Mer Msg, you might remove it for a while to see what effect if any is observed. The error msgs related to the DS1 on the 100DCD module indicate it is configured as wink-start E/M signalling.(point of info) You might try re-seating each module, possible back plane issue?

....JIM....
 
Is there something in the Building being turned on or off when this happens?

Since you have eliminated HEAT, then power could be your next suspect.

Coming up with the "F" (Frigid Start) can mean that the Back Up battery on the processor is shot, or strapped out.

We used to take the strap off of them in our lab to make a quick re-config easier. But you don't want that in the real world.




 
Power definitely hints at the problem, but here are some other random thoughts:

* Don't necessarily trip on the two 412 boards as the error is likely generated by the pools being OOS. The "POOL BUSY 00/01" means that the first pool--usually Pool 70--is busy or OOS. My statement holds if you have POTS lines on both 412 boards that are in that pool.

* Same goes with the DCD, obviously.

* The 000B error on the MerMSG blade should probably be overlooked.

* The hours of 9:00pm to 10:30pm DO seem suspicious, but could be coincidental.

* DUART STREAMING INT. This one bugs me. A DUART is a Dual UART or Dual Universal Asynchronous Receiver and Transmitter. While asynchronous suggests "serial", it also can be made to function for synchronous transmission, which is completely dependent on timing alone (no start or stop bits). The whole UART thing was originally an invention of National Semiconductor but was--through a series of improvements and manufacturers--bested by Motorola in the mid '70s. Their UART, with it's combination of buffers, clock, read/write logic was adopted by AT&T. The errors you see in the MAGIX regarding buffer overrun, underrun, framing/parity errors and "long break" come directly from Motorola's UART and MC68EC020 processor integration. My experience with Motorola's design is that they also employ "watchdog timers" for endless loop detection. That condition would also trigger a reboot if buffers hit an overflow mark.

The short version of this diatribe is that the processor is suspect. Now that's a serious accusation as I almost NEVER, EVER blame the processor. But something is definately rotting in the state of Denmark.

Tim Alberstein
 
Thanks guys - I can always count on you... it turns out they 'may' have had serious work done on the AC earlier this year, so I am going to replace the processor with another one - will check it in my office first (I have NEVER had the frigid start problem before so I guess I learn something new every day...) and also have the circuit the system is on monitored for power fluctuations. More to follow..
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top