Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Multiple APE units fail on overnight sync---sometimes pass--common error seen on failures

Status
Not open for further replies.

kevin906

MIS
Aug 4, 2006
167
US
I don't want to go way out in the weeds on this issue but maybe if someone has seen it before and the fix wasn't overly involved...
Multiple APE units for this site. V4.0 on old patches--no avenue for upgrades with no service contracts. Some nights APEs will sync up with no issue. Most nights they fail. The only common denominator I can find are the errors logging when they fail. Each APE will log multiple Int0D general protection errors soft restarts against roughly similar software addresses. Then the process just moves the "working" copy of software back to :pDS: and tries again the next night. It attempts to do the restore about 5 times before giving up. Since the same problems occur on all of the units it seems like whatever the problem is must be the same for each unit. We installed a new disk on one these units 3 weeks ago and it has yet to sync up with the mothership database. All of the proper commands were run to init the APE when the disk was replaced. DBC,APE,CPCI,APESM,etc. Access from APE to main site is working for FTP to the IPDA subdirectory. It all looks good but it won't sync. From the looks of the /.AS/BACKUP/var_hbr/rmx directory it's copying the files from the main site. This logs the same type of Int0D errors the other locations are showing when they fail. I have attached two text files from two different APE units just to illustrate the common (closely) errors recorded. I am not too overly concerned as this site is likely not long for this world but if any has any input===thanks.
 
 https://files.engineering.com/getfile.aspx?folder=838fd4cb-4d14-4b8e-9168-11ec2a7020f3&file=APE_20_Interrupts.txt
If you have problems with multiple APEs it makes sense the issue is with the host 4K rather than the APEs.

What software version is it exactly, for RMX and Assistant?

Are you able to reinstall the Assistant? Install files should be on the :SCR: area of the HD if you're lucky. You can force a reinstall by connecting on V24 and catching the UIXBIOS during Assistant reboot.

Do you have an APE which always works, never fails? I wonder if something in the network could be corrupting the FTP. It's unlikely but this seems an odd problem. If you had an APE on subnet local to the 4K which always worked it might point the figure at network.
 
Alternatively, you need to force the speed to be set on all 100 Mb ethernet ports (STMI ports, NCUI ports, Cisco ports, etc.). There should be no auto-detection.
 
That is not true. AUTO is fine to use as long as it is used both sides, i.e. for both 4K and data switch side. If one side is fixed, but the other side is auto detect, that’s when you get the half duplex connection on the auto side.

In any case, neither STMI or NCUI NIC ports are used here, APE HBR is a function of the Assistant/DSCXL, and although an accidentally negotiated half duplex connection will have an impact on voice RTP, FTP uses TCP and so the error recovery would likely mask an issue. The FTP would work, but with poor performance.

I’m not saying don’t check it, nobody wants half duplex connections, but AUTO is absolutely fine to use, as long as both sides are auto.
 

Moriendi--Software release is Hipath V4.1.76, Unixware V4_R4.0.25 --no lectures on upgrading please, I have no way to get an upgrade. There is no single APE that ALWAYS succeeds, there is one that succeeds more than other ones do. No APEs on the same subnet, yes, that would be a great way to prove file corruption. We did recently reinstall Unixware-Assistant on one of the APEs--still no joy. Still seeing the same INT0D soft restarts over and over until it gives up and copies GLA back to PDS and stabilizes till the next try. We are wondering if there a problem with the processor/memory capacity. Again this seems odd for that many to exhibit the same issue with the soft restarts on virtually the same task address.
 
It's V4 R4.1?

:(

Can you post a dis-aps:,psgl,y*;

I don't think reinstalling on the APE will help here. But you have discovered that.

I think problem is on the main 4K. I think you either have corruption on your current HD and it's being blindly copied to the APEs, or there is no corruption and you've found some weird bug that's taken years to surface but now you're in it.

Do you have good backups to the compact flash? If you made a copy-ddrsm today, you might copy corruption across. But you could make a regen now, and gendb it onto a compact flash backup which you know was taken before the problem arrived. Or, you could reload the ADP from the CF, and make an exec-updat:bp onto the CF, then reload. That might copy corruption down to the CF though, if there is any, but it might also load your DB onto a clean OS and solve it.

Maybe there is no corruption and it would be solved with a software upgrade. If you are the earliest of the early V4's, there's a good chance upgrade would solve. Your existing codeword still be valid on the last V4 so you just need the software. Perhaps someone could supply it to you. Are you confident enough to apply it? If you have a DDRSM CF backup, AND the unixware install files (check in :SCR: area) - it can always restored anyway.

 
DIS-APS:,PSGL,Y*;
H500: AMO APS STARTED
ADINIT STARTED
PROGRAM SYSTEM : Y0-EO0YC
VERSION NUMBER : 10
CORRECTION VERSION NUMBER : 003
PART NUMBER : P30252N4604BH7601|V4 R4.1.76
PROGRAM SYSTEM WITH CODE SUBSYSTEMS
INTERFACE VERSION:
PROGRAM SYSTEM DOES NOT CONTAIN ANY INTERFACE VERSIONS

DIR SUBSYSTEM | | OMF SUBSYSTEM
-----------------------+-+-----------------------
ZMITSC00.Y0-EM0.10.001 |*|ZMITSC00.Y7-PMT.10.001

PROGRAM SYSTEM : Y7-POTYT
VERSION NUMBER : 10
CORRECTION VERSION NUMBER : 002
PART NUMBER : P30252N4600U00400|V4 R4.0.25
PROGRAM SYSTEM WITH TEXT SUBSYSTEMS
INTERFACE VERSION:
PROGRAM SYSTEM DOES NOT CONTAIN ANY INTERFACE VERSIONS

DIR SUBSYSTEM | | OMF SUBSYSTEM
-----------------------+-+-----------------------
ZMITSC00.Y7-PMT.10.001 | |ZMITSC00.Y7-PMT.10.001

ADINIT COMPLETED
STATUS = H'0000
AMO-APS -111 SOFTWARE LOAD UPGRADE
DISPLAY COMPLETED;


I just found out the APE with recently recovered Unixware had the disk replaced as well. In preparation for this a copy-ddrsm was done at the main site, that CF was used to copy PDS onto the replacement disk at the APE unit. It loads and is stable on that copy. The commands needed to change it to APE configuration were done without issue. It's just when it fetches the overnight sync DB from the main site it and all the others goes into fits--copies GLA back to PDS, reloads and stabilizes till next sync time. We have serviced this customer since early 2013 so I am not sure when this all started.
Yes, if we could get hands on next software version we can do the upgrade. Most of our sources are no longer available.
One of my colleagues has a lab at home, we might try SAVDB/REGEN from the main site to a replacement disk for the main and try that route. thanks
 
Lack of PM on this site is a pain in the ass. Do you want to drop me an email to 'itsabrandnewemail' at gmail.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top