Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AIX 4.2.0.0 7013 box broken - Final call for help.

Status
Not open for further replies.

szewczykm

MIS
Jul 31, 2002
37
US
This is my last SOS flare. If you read this all I greatly appreciate your attention. None of the sugestions given to me in past posts have helped me. I can only assume that the suggestions were based on newer versions of AIX or that they don't take into account that the only access I have to the OS is through a maintenance boot using a mksysb tape.

Here's the chronology:

I have an IBM 7013 with an external SSA cabinet using AIX 4.2.0.0 and 2 6214 SSA cards. My documentation and disks were all lost. I have limited knowledge of AIX and RS-6000 systems.

- System would boot fine, but the SSA drives would not come online

- The SSA cards would no activate.

- I searched for the same drivers online (SSA drivers 4.2.0.0) so I could reinstall them. I figured 2 bad cards seemed odd so it should be software.

- I called IBM because the 4.2.0.0 drivers were not availiable. I was directed to find the 4.2.1.0 drivers and I followed the instructions to install them.

- After following the instructions on installing the drivers, reboots of the system gave me an 888 700 102 0c8 error during the {299}.

- Assuming that the drivers must have messed something up I booted the system in maintenance mode. I ran installp -ug on the driver fileset but only part of the fileset could be removed.

- I found a CD with SSA drivers on it and a mksysb tape from 1996. This system is a legacy system and has not been changed since 1998 but is used for reading old data.

- In maintenance mode, the /dev/cd0 is missing. I have tried about 10 different commands (mkdev, cfgmgr, etc) based on many old newsgroup suggestions and nothing has brought back the /dev/cd0. So I cannot mount the CD to read the older drivers off of it.

- The mksysb tape is old and management is currently leary about restoring the OS with this tape. They want me to continue my search.

- I tried to turn on the sysdump with a -K and that is not working on bootup (the 0c8 means that sysdump has been disabled, I don't know how to enable it) I figured I could get a better clue as to why the system wont boot -exactly- if I had this data.

- In the mean time, I was able to pick up two new replacement SSA cards. The system now sees them, but 1) I can't boot normally and 2) Loading the drivres in maintenance mode doesn't work because I screwed up the drivers.

Through this process, I have not been able to use SMIT in maintenance mode. This has made it very difficult. I have a ibm3151 but no matter how I change TERM setting, etc, the SMIT screen comes up in dumb terminal mode (I'm guessing) and I cannot navigate ony of the screens. I can hit enter and select the top of each menu, or I can use escape-3 to go back a screen and exit. I have tried using a laptop hooked to the serial port and no matter what emulation I try, I get the same results. SMIT is currently unusable. I can't use it to help me.

I have read two manuals on the 7013. I have read at least a hundred newsgroup articles. I have gone to tek-tips.com. I have called IBM. I have tried just about every venue I could to get past my problems and have dead-ended each time.

I'm planning on declaring failure on all fronts except for restore from the 1996 backup. But I figured I'd send this last SOS flare and hope that someone can give me new avenues to go down. These are the things I feel I need to get my problem solved:

1) Make my CDROM work in maintenance mode.
2) Get me drivers. 4.2.0.0 drivers for the IBM 6214 controller - on a floppy or downloaded.
3) Help me get SMIT to work in maintenance mode.
4) Help me understand how loading the driver would have made my system unbootable and help me to undo my changes. Or at least get the system booting normally so I can use my CDROM, SMIT, and other tools that work best when the system is running on a normal boot.

If I can't get past any of these 4, then I have no more avenues to go down except for A) Call IBM and get an RS-6000 specialist out here for $250 an hour or B) Attempt the OS restore from the old mksysb tape.

Any help you give me will be greatly appreciated. Thank you.

Mike
 
Yes, the data is on the SSA disks that aren't working right now.
 
Have you considered reinstalling AIX from scratch, and importing your root volume group? You can even install the latest version of AIX, which should have whatever drivers you need, then import the old VG.
 
Ugh, I meant import your OTHER volume group, not the root volume group.
 
This is as extreme as restoring the backup don't you think? Can you think of a reason the system would stop booting entirely because of a driver being loaded? I could understand if I updated the drivers for the SCSI card that all of the system drives are on, but that's not the case. It's like the system no longer booting because I updated the TTY card driver. You would assume that the card would stop working but the whole OS goes into the toilet?

Is there a file somplace I can modify to skip that step maybe? Or is it deeper than that?
 
`sysdumpdev -P -p /dev/hd3` where hd3 is the where you want your sysdump. sysdumpdevstart fails but continues the boot.

An LED of 299 shows that the BLV will be loaded. If this LED code is passed, then the load has been successful. If, after passing 299, you get a stable 201, then you have to re-create the BLV.

SSA subsystem components use microcode to control their function. You should ensure that the microcode level and any drivers on all devices are correct.

Your bootlist must include the CD-ROM to boot from it. (bootlist -m normal cd0 hdisk0)

I am guessing that you are not getting past 299 which means a hardware error. Try removing the SSA adapters so that it cannot find them during phase 1 boot where base devices are configured to prepare the system for activating rootvg. Phase 2 is rootvg is activated and Phase 3 is initiated by the init process loaded from rootvg.

Also do an `alog -ot boot` and check the bootlog for any errors.

 
I can boot from CD rom, but when I do it, I can't open it to load anything else. I boot from tape and I can't access the CDROM because it's not in the /dev/cd0.

I removed the controllers and it still stalls right after 299 (or during 299)

After I get rebooted again (30-40 minutes) I'll try the 'alog -ot boot'
 
If it is getting past 299 and you get a 201 then try to re-create the BLV. Are you in service mode when you load the CD? I assume you are.
 
Well, I've dealt with this kind of situation before, actually. At this point, any resolution is going to be "extreme".

You can continue to try to repair the existing system. You will need to boot into maintenance mode off a 4.2.0 cd (don't use a 4.3 cd as I believe it changes the format of cmos memory), remove all of the devices dependent on the mangled driver, then remove the driver itself. You may need to reboot once or twice to accomplish all of the necessary removing, and it might be of help to physically remove the SSA card for the duration so nothing during any bootup detects it and tries to load a driver for it.

As for obtaining a driver, that's a problem. I disagree with whoever told you to use a 4.2.1 driver on a 4.2.0 system. I can't help you on that matter.

If you had a well-mirrored rootvg, you could install a new OS on one of the old rootvg disks, that would leave the old rootvg importable if you need data off it.

Apart from that, assuming that all your data is truely in the offline VG, just install a new OS.

You don't have a working OS right now. Installing a new one will make sure that you do get a working OS.
 
Rereading your original post is your LED sequence 888-102-700-0c8? This error code is usually a kernel panic or trap.
 
The painc() routine in the kernel puts its message into a buffer, writes it to the debug tty using the kernel debug program, and calls brkpoint(). If the kernel debugger is loaded, and an ASCII terminal is connected to a serial port, this will start the debugger; otherwise, it will cause a dump.
 
I'll try removing the drivers with the cards out, this makes sense. After all the learning I've done in the past week I think that if I can get the OS going again then re-load the old drivers I will be OK.

What messed me up was the 4.2.1.0 device drivers. If I can erradicate those completely I think I'll be OK. If not, then I'll have no choice but to start with the OS again.

I'll let you know what happens once I get the system back in maintenance mode. I hope removing the cards allows me to fully remove the drivers.
 
OK, some randomw ray of sunshine hit my brain and I figured the reason I couldn't removed the 4.2.1.1 driver was that the device was still defined. I removed (rmdev) the SSA adapter and then the fileset could be removed.

I rebooted and still had the same problem. However, I left the devices.ssa.disk fileset in there (Higher than 4.2.0.0 but I forgot the numbers). After this reboot I will remove the SSA disks and then that driver. If it still wont boot even with every post 4.2.0.0 driver removed then I'm officially suggesting the tape restore.

My only fear is that I'll damage data on the SSA disks. They're diconnected now so it's not a problem. But when I re-install the SSA drivers I hope nothing happens to the data. It stand to reason that you should be able to re-load your OS without blowing out data.
 
OK, I managed to removed anything higher than 4.2.0.0, driver wise, and I still can't boot.

If anyone has any ideas for making the emulation work correctly for SMIT and/or accessing the CDROM after I've booted in maintenance mode, I'd like to hear them.

I'm at the end of my "Fix it" rope and I don't think I have any other options beside reloading the OS from the MKSYSB tape. If that doesn't work, well.... then I don't know.
 
You have to have an LED code during the boot phase; what is the LED code that it fails on? That is the only way to see where it is failing. Like my earlier postings said, if it makes it past 299 and you get a stable 201 the re-create the BLV. The LED is probably the tell all clue to solving the boot problem in normal mode.
 
I get nothing after the 299. I watched it very closely to see if anything flashes on the screen or anything. Nothing. 299 then the screen clears for a few seconds and I get the flashing 888.
 
Progress indicator. IPL ROM passed control to loaded code. A flashing 888 indicates that a problem was detected, but couldn't be displayed on the console. It will be followed by 102, 103, or 105. The reset button is used to scroll the message.

Did you do an `errpt -a | grep CHECKSTOP`.

If system stops at 201 it could be either hardware or software. If it goes past 299 then back to 201 it is probably a damaged boot image. If it never gets to 299 then it is not a software problem because the boot process hasn't loaded any software yet.

I am guessing if it stops AT 299 then it is hardware.
 
I would still try to re-create the BLV, though because though I think hardware, I cannot get over it may be a bad boot image.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top