We just migrated our production servers from AIX 4.3.3 to 5.1. It worked ok on the test machines, but we had a severe problem on production. We've not been able to figure out what happened (nor has IBM). Here's our systems, a timeline of events, and the problem will follow.
2 7017-S85's in HA cluster
AIX 4.3.3 -> 5.1.0.0
SDD 1.3.3.6 -> 1.5.0.0 to communicate with ESS
HA 4.4.1
We failed over from secondary to primary. Apps running on primary.
Remove SDD from secodary.
Migration install on secondary.
Right at the end, we believe just after the auto reboot at the end of install, the primary machine lost sight to some vpaths which crashed the prod db running on primary.
Finished upgrade on secondary (including ML05, hapatches and reinstall new SDD)
Bring systems up, and failover to secondary.
Remove SDD from primary.
Migration install on primary.
Right at the end, we believe just after the auto reboot at the end of install, the secondary machine lost sight to some vpaths which crashed the prod db running on secondary. (same issue as before)
I've had my IBM software support look at my plan, and they saw no issues with it.
What actually happened, was that the machine out of the cluster at that point, put a persistent reserve on the disks being used by the other machine. What could have caused that? I thought only a varyonvg would do that, and I didn't do that. The concurrent vg's are not set to autovaryon, plus HA is the only thing that will vary those on, and it doesn't come up automatically.
Has anyone experienced this before? Any ideas? I've run out of them.
One more note, we had a hardware failure on the shark a couple weeks ago, which basically made the disks unavailable, and had the same affect (crashing the DB). So, when it first happened on the primary machine, it looked just like another hardware failure on the shark (not my machine), which is redundant, and I pushed on with the upgrade. When it happened the 2nd time, I realized that it probably wasn't another hardware failure on the shark.
Thanks
Tim
2 7017-S85's in HA cluster
AIX 4.3.3 -> 5.1.0.0
SDD 1.3.3.6 -> 1.5.0.0 to communicate with ESS
HA 4.4.1
We failed over from secondary to primary. Apps running on primary.
Remove SDD from secodary.
Migration install on secondary.
Right at the end, we believe just after the auto reboot at the end of install, the primary machine lost sight to some vpaths which crashed the prod db running on primary.
Finished upgrade on secondary (including ML05, hapatches and reinstall new SDD)
Bring systems up, and failover to secondary.
Remove SDD from primary.
Migration install on primary.
Right at the end, we believe just after the auto reboot at the end of install, the secondary machine lost sight to some vpaths which crashed the prod db running on secondary. (same issue as before)
I've had my IBM software support look at my plan, and they saw no issues with it.
What actually happened, was that the machine out of the cluster at that point, put a persistent reserve on the disks being used by the other machine. What could have caused that? I thought only a varyonvg would do that, and I didn't do that. The concurrent vg's are not set to autovaryon, plus HA is the only thing that will vary those on, and it doesn't come up automatically.
Has anyone experienced this before? Any ideas? I've run out of them.
One more note, we had a hardware failure on the shark a couple weeks ago, which basically made the disks unavailable, and had the same affect (crashing the DB). So, when it first happened on the primary machine, it looked just like another hardware failure on the shark (not my machine), which is redundant, and I pushed on with the upgrade. When it happened the 2nd time, I realized that it probably wasn't another hardware failure on the shark.
Thanks
Tim