We have a customer who has been experiencing a large number of Passport 8100 crashes & partial crashes the last 6-8 months. They're running 3.2.2.2 code and most of them have 64M of RAM. We're in the process of upgrading to 3.5.5.0/256M RAM but have only done a couple so far.
The boxes all have x2 8190 CPUs, an 8148TX card in slot one and an 8108GB in slot 7. Some are loaded the rest of the way with 8148TXs...other's have no more than 4 cards (slots 1,5,6,7).
The symptoms at these sites have "roughly" followed one of two patterns: CPU lockups - whole switch is down...or disappearing cards - doesn't usually kill the whole switch.
We've had several cases with anywhere from 4-10 cards in a chassis and when you look at it via the Cli or JDM it only shows the CPUs present. In some cases we're accessing the switch through cards that aren't showing up. That problem is usually fixed by reseating or replacing the card in slot 1...(Slot 1 or 2 is the timing source for the chassis. I'm beginning to think there may be issues with 8148's providing timing).
The CPU lockups are usually fixed by switching over to the standby. Sometimes we have to boot or power clear the switch. A day...two days...a week later the other CPU will lockup. We've had situations where we've swapped both CPUs and still get lockups.
We've looked for viruses, done "very verbose" CPU traces and Sniffer traces...we haven't spotted any suspicious traffic patterns. The forwading databases aren't huge...CPU utilization's staying low. We've been round and round with Nortel but nothing's been isolated. In one case they had us swap a chassis and a week later it crashed again.
Is anyone else experiencing similar problems with the type of hardware/code we're running? Or just problems that are difficult to isolate with PP8100s? Any ideas??
The boxes all have x2 8190 CPUs, an 8148TX card in slot one and an 8108GB in slot 7. Some are loaded the rest of the way with 8148TXs...other's have no more than 4 cards (slots 1,5,6,7).
The symptoms at these sites have "roughly" followed one of two patterns: CPU lockups - whole switch is down...or disappearing cards - doesn't usually kill the whole switch.
We've had several cases with anywhere from 4-10 cards in a chassis and when you look at it via the Cli or JDM it only shows the CPUs present. In some cases we're accessing the switch through cards that aren't showing up. That problem is usually fixed by reseating or replacing the card in slot 1...(Slot 1 or 2 is the timing source for the chassis. I'm beginning to think there may be issues with 8148's providing timing).
The CPU lockups are usually fixed by switching over to the standby. Sometimes we have to boot or power clear the switch. A day...two days...a week later the other CPU will lockup. We've had situations where we've swapped both CPUs and still get lockups.
We've looked for viruses, done "very verbose" CPU traces and Sniffer traces...we haven't spotted any suspicious traffic patterns. The forwading databases aren't huge...CPU utilization's staying low. We've been round and round with Nortel but nothing's been isolated. In one case they had us swap a chassis and a week later it crashed again.
Is anyone else experiencing similar problems with the type of hardware/code we're running? Or just problems that are difficult to isolate with PP8100s? Any ideas??