I have a 5.2 LPAR on a POWER5 670 in a HACMP cluster. During periods of heavy network load, network I/O occurs, paging goes up and lrud shows a CPU utilization of 300%. This causes the system to stop responding.We do not get any return codes or detailed logs of the root cause on the machine, which responds by dropping out of the cluster, which reforms with the redundant node.
As far as I can see the adapter and tcp kernel values are absolutely standard, with tcp_sendspace 131072 tcp_recvspace 65536, mtu 1500, 1000 Mbps Full Duplex, auto negotiate, jumbo frames OFF.
I'm normally used to disk i/o issues, so this is an unusual one for me.
Anyone any suggestions as to what path to follow to identify any bottleneck or diagnose an insufficiently-configured parameter. A complication is that the system problems occur out-of-hours.
Thanks in anticipation!
ave
As far as I can see the adapter and tcp kernel values are absolutely standard, with tcp_sendspace 131072 tcp_recvspace 65536, mtu 1500, 1000 Mbps Full Duplex, auto negotiate, jumbo frames OFF.
I'm normally used to disk i/o issues, so this is an unusual one for me.
Anyone any suggestions as to what path to follow to identify any bottleneck or diagnose an insufficiently-configured parameter. A complication is that the system problems occur out-of-hours.
Thanks in anticipation!
ave