Do I have a poorly SP frame/nodes?

reclspeak · Sep 11, 2006

Hi!

I'm investigating an SP frame and associated nodes that haven't been supported for some years (the previus admin team having been rendered redundant to cut costs). There was a power outage some months ago and some nodes can't be reached.

Unfortunately my knowledge of SP's is postage-stamp in size.

spmon -G -d displays;

1. Checking server process
Process 14196 has accumulated 36 minutes and 13 seconds.
Check ok

2. Opening connection to server
Connection opened
Check ok

3. Querying frame(s)
1 frame(s)
Check ok

4. Checking frames

Controller Slot 17 Switch Switch Power supplies
Frame Responds Switch Power Clocking A B C D
------------------------------------------------------------
1 yes yes on 0 on on on on

5. Checking nodes
--------------------------------- Frame 1 ---------------
Frame Node Node Host/Swch Key Env FrontPanel
Slot Number Type Power Responds Switch Fail LCD/LED
---------------------------------------------------------
1 1 high on no autojn normal no LCD is blank
5 5 wide on no no normal no LEDblank
9 9 high on no no normal no LCD blank 13 13 wide on no autojn N/A no LCDsblank

I'm concerned abut the "Host Responds"=no. I can ping the nodes, but cannot login. telnet responds but no login is displayed. s1term works to a node, displaying a Console login which works, but the password isn't processed (this might be an unrelated problem with a NIS domain message being constantly displayed, though "root" isn't in the NIS passwd map.)

A command like cshutdown produces (in its log file)

"Node spnode1en0 is not running. Unable to rsh to node"

However I can start and stop nodes with spmon and change the key switch setting. SMIT verification routines report that "Information in the SDR indicates that the node is not up"

PSSP version 4.1 is installed. The Control Workstation and nodes have AIX 4.3 (yes!)

Any SP whizz out there can point me in the right direction as to what I should be seeing if all was OK, or can a problem be diagnosed from what I've got to date so far?

Many thanks

recl

ogniemi · Sep 11, 2006

I guess you have a NIM server on your PSSP (sp cws) workstation. Then you can configure "Remote Maintenance Boot" (normally having CD you boot in that case from CD and go to Maintenance Shell).

Boot you SP node from network (NIM) and enter writable s1term (writable means: "s1term -w 1 1"). You should enter maintenance shell and change or check root's password.

kHz · Sep 11, 2006

I would be VERY surprised if your PSSP version was 4.1, because I thought 3.5 was the last version when they switched to CSM. Although there is a Parallel Envrionment of 4.1, but that is for AIX 5L and you are on AIX 4.3. I would guess you are at PSSP 3.1 or 3.2.

But anyway....

Before I go on... did you realize at one time in the not too distant past there were only something like 2000 SP complexes installed worldwide? So not too many people have knowledge and experience on the SP and PSSP.

But anyway....

When the server went down it probably jacked the switch. But not sure if you have the old HPS or the SP Switch (that replaced it). With the HPS you had to fence and unfence the nodes, but not with the SP Switch.

You will also want to check the Primary and Primary Backup with Eprimary. The primary node initializes the switch.

Is the Emonitor subsystem started? You can check:

Code:

lssrc -a | grep emon

which should show active. If not then run:

Code:

Estart -m

Emonitor checks the nodes host_repsonds value in the SDR. After you run this, then run:

Code:

Eunfence <node>

You can check if the switch responds on the nodes by doing:

Code:

SDRGetObjects switch_responds

You also need to check that the worm is running on the nodes:

Code:

ps -ef | grep Worm

If you cannot start the switch then you will have to look in the error report and run an Eclock.

Gotta like the SP!!

kHz · Sep 11, 2006

The s1term works because that it's a serial connection.

kHz · Sep 11, 2006

You could try to rebuild the NIS maps and see if that works. Also can other users other than root login without problem? What is the domain message you get? You can also recreate the map and transfer them using smitty mkmaps.

Never used NIS, I always used SP user management instead.

Breslau · Sep 12, 2006

sounds like your host(s) are having an issue with NIS to the point that the SP daemons on the host are not able to respond to the CWS, making it think the node(s) are down. same being true for telnet and your CLI login process.

unfortunately i dont have exp in NIS if you're using that for authentication.

the s1term is really the last line of defense in accessing a node within the SP env. if you still cannot get in that way, you could always attach a bootable device (like a CD drive) to your host and instruct the node to use that device to boot while you're watching the s1term. once up, get the box into maintenace mode and access the rootvg... it might be possible to then dig around and figure out what's going on.

the PSSP manuals have decent instructions on setting up kerberos tickets if you want to can NIS within your SP.

kHz · Sep 12, 2006

Guess it wasn't that important [bigsmile]

kHz · Sep 12, 2006

There is no need to boot a node into maintenace mode to fix any problem. The problem lies on the CWS and the switch getting jacked during the power outage.

NIS authentication is being used instead of SP user authentication and the master (on the CWS) is serving the collections.

Once the switch is fixed and the nodes respond, then a possible recreate of the maps and transfer and the NIS problem should be taken care of.

I just really like the SP! It is (ok was) a lot of fun to manage!

reclspeak · Sep 13, 2006

Terrific replies everyone. Sorry for the delay in replying but I'm on the road a lot at present.

I've managed to get two high nodes up, using the spmon -key service frame<x>/node<y> routine, then using spmon -p on frame<x>/node<y> and s1term -w x y. There's probably a better way of doing things, but this sufficed.

Neither nodes had been up since early July after the outage (you can see how unused this frame is). /var was 100% on one node, and both couldn't ypbind to the NIS server because their default route isn't working through FDDI (though other servers including a 43P node have the same route through FDDI working fine). I went into single-user mode and disabled NIS and loads of defunct NFS mounts and rebooted, and both nodes are in multi-user state, albeilt contactable only through their Ethernet NIC from the Control Workstation.

One node won't boot at all, either in multi-user, single, diagnostics, or even network boot from the Control Workstation/NIM. The LED shows "888" and when I use spmon -reset -t frame<x>/node<y> and spmon -L frame<x>/node<y> to cycle through the error codes I either get 103 (can't figure-out model number) and 803 (spurious tape library configuration error code) or 404 (dunno what this is 'cos I'm aways from my normal desktop with the codes recorded). I think this is indicative of some firmware corruption or motherboard knackered.

I'm looking into the frame issues now. When I'm back on site I'll update with any new routines I find that might help someone in the future.

recl

Breslau · Sep 14, 2006

There is no need to boot a node into maintenace mode to fix any problem. The problem lies on the CWS and the switch getting jacked during the power outage."

In general, any node on a switch will boot up without that interface, it's just another network as far as the host is concerned.

since the poster was able to get the machine up, but not log in, then they need a method to access the host at a point before it gets hosed in order to fix it. booting it off another boot image and accessing the rootvg is a way to troubleshoot a problem such as this one.

kHz · Sep 14, 2006

From the original description of the problem it appeared to be all because of the power outage which jacked the switch and the NIS maps.

If those would have been the problem, then there wouldn't have been any need to boot nodes into maintenance mode.

And I am quite certain the login problem was caused my the NIS maps since NIS authentication was being used instead of SP user authentication.

DukeSSD · Sep 14, 2006

888-103-803-404 is a crash code.
888: it crashed
103: it is a hardware problem (probably, could be a critical software problem like /var is full)
803-404: is the SRN, service request number.

SRN 803-xxx: filesystem full or device failure, where xxx is the FFC, failing function code.

FFC 404: not known but most 4xx are disk drives

http://publib.boulder.ibm.com/infocenter/eserver/v1r3s/index.jsp?topic=/iphau/x0070.htm

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Do I have a poorly SP frame/nodes?

reclspeak

IS-IT--Management

ogniemi

Technical User

kHz

MIS

kHz

MIS

kHz

MIS

Breslau

Technical User

kHz

MIS

kHz

MIS

reclspeak

IS-IT--Management

Breslau

Technical User

kHz

MIS

DukeSSD

Technical User

Similar threads

Part and Inventory Search

Sponsor