Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Passport 8300 Random Fail-Over problem

Status
Not open for further replies.

Googer

Technical User
Apr 30, 2004
60
US
We have two brand new Passport 8300 switches fully loaded with 8324GTX(10/100/1000) cards. We are running the 2.0.0.1 version of code on these switches. We are running a combination of 100MB Full and 1000MB Full to our servers on these switches. We have both of these unlinked to a Split MLT core of two Passport 8600 switches. The problem we are having is that the Passport 8300s will occasionally execute a Fail-Over from one switch fabric to the other with no apparent reason to be found. This happens no matter which switch fabric is primary. I have tried replacing the switch fabric modules on one of these switches and the problem is still occurring. I have also replaced multiple Ethernet cards on this switch with no result. The only thing I see in the logs is that switch has put the active switch fabric card into warm standby and that the standby card has come online as the primary, which of course drops all my servers’ offline for 60-120 seconds and sets off every monitoring alarm we have. I have been working with Nortel support but as of yet they have been of no help what-so-ever. Has anyone else seen this problem? Does anyone know if there is a fix for this?

P.S. We can not upgrade to the newly released 2.1 version of code because that is not covered as part of the warranty.

Any help would be appreciated.

Googer
 
A quick look at the 2.1 release notes does not specifically address a CPU fail-over issue. Maybe the Nortel guys know more, if they have said it will fix the problem. We have had 2 chassis failures in the 8600 world. Have you replaced that(2 bad ones is unlikely)?
Maybe 1 test could be to work with just 1 SF. If the problem is frequent enough, it may help narrow the problem down. Some odd SF failure may then log rather than just switch over. Could be brutal outage-wise though :-(
 
I've tested this in my lab with just one Switch Fabric and it seems to work just fine. I did finally get the new software from Nortel and they won't guarantee that this fixes the issue but they think it may help. I discovered something else in this process; under the standard warranty software upgrades are not included even when the software does not work correctly.

Googer
 
FYI, The new code did not fix the problem.

Googer
 
From what I have seen/know, the 2.1 code has indeed resolved this issue at many sites. All information points to the fact that this is NOT related to number of SFs/8393s - problem was seen in both single and dual SF configurations. When you upgraded to the new 2.1.0.0 code, did you boot all of the 'A', 'B' and 'F' images?
 
The way I read the documentation for the upgrade you only needed to boot the f image for the Master CPU. Nortel support has since loaded some debug software to try and capture some data when this occurs the next time. So far, the silent fail-over has not happened since the debug code was loaded.

Googer
 
We have had this same symptoms/ events with our 8600's running dual switch fabrics. Very dissapointing at the cost level of this hardware and the stated MTBF/ level they sell/ compare themselves at. Also, ironically I just agreed to install the 8300 product beside the 8600's to offload some of the load while they get to the bottom of our ongoing 8600 issues that have caused repeated random core network downtime along with much intentional after hours downtime trying to reproduce issues so as they can fix their OS bugs.
 
Do you see anything in the log when this happens on the 8600. We have 8600s also but I have not seen this problem on those systems yet.

Thanks,

Googer
 
Yea, the logs show it fails over! That is it. No indication things are going downhill and impending doom is aproaching from the logs. It has happened 3-4 times over the past year. How long have you been running your 8600's? Do you have more than one 8600 and if so do you have IST trunking between them?
 
Yeah, we have 7 of them in production and I've never seen this. We have two running with IST. The only time I have seen anything even remotely like this is when we have had a large number of backplane errors "FAD Misalign, SWIP reset" which has taken one of the systems down forcing VRRP to switch over to the other unit. I did find out recently from a Nortel engineer that even though it isn't published they don't think you should have more than four ports in your IST trunk. He also said that you have to have E cards on any downstream 8600s that split trunk to your IST boxes or they might not work right. One last thing was the CP rate limiting has to off on the IST. Any of that helpful?

Googer
 
VRRP transitions have been pinned specifically on at least 1 of the 8600 CPU failovers by Nortel support. We have 5 IST trunks between our 8600's. Can you get me any specific info on that 4 or less IST recommendation?
 
I'm waiting on documentation on this. I just finished this on an open case with Nortel. I also forgot to mention a Windows XP problem they told me about. XP runs a protocol called SSDP which can cause VRRP problems specifically. I'm waiting on documentation of this problem as well but I know it has to do with running a large number of XP machines on the network and this protocol causing VRRP to stop communicating properly. I'll let you know when I get some documentation on the IST issue.

Googer
 
Yes, the XP Universal PNP caused our VRRP CPU failover. They identified it on ours and wrote a patch for the OS. That issue is behind us but more keep coming up. Sounds like we are having the same type issues.
 
Here is what the Nortel engineer had to say about the IST.

"There is no hard and fast rule regarding 4-port IST configuration, it's up to the customer how he configure the IST, we normally recommends 4 port IST. One can put as much as ports in the IST link but that will be wastage of the resources and extra load on CPU by running the IST protocol on 8 ports or 10 ports."

I like the way it was critical for me to do when we were having problems but now it is just a recommendation. He also said that he didn't have the XP problem documented that he thinks that was just one customer. You must be the lucky one.

Googer
 
We have been lucky in so many ways by getting to expose now 3 definite and 1 in the hopper, bugs in the 8600 OS. Just last Thursday we had a total lock up of one 8600. It was bad because it did not go down 100% nor reload or fail over the CPU. Thus, the SMLT links to closet stacks and server rack 470's thought that 8600 was alive yet and sent traffic to it. It in turn dropped the packets on the floor. So, the whole network was down for 50 minutes until we got someone there that could diagnose what it was (as our NOC just showed everything RED because the whole LAN/ WAN showed down)and in the end "knock it in the head" and power cycle that 8600. Here is what the error log showed over and over when it happened.

IP ERROR ipPktOut: Can't copy frame buffer!
 
Googer

Would you be open to a conf. call via a conference bridge this afternoon or Wed. morning I can setup at my cost? I think we have data that would be of benefit to each other...
 
This appears to have morphed it to an 8600 conversation. With that in mind and the last posts, I am interested in the XP induced VRRP problem. We have been slowly converting to XP and have unexplained VRRP events. Related?? Hmmm.. What have you noticed, and what version of 8600 code are you running?

Thanks
 
I would be open to a conference call. Let me know when and where. I would like to talk to someone else having the same problems.

As of yet I have not seen the XP/VRRP problem. I was warned of it as a possible problem as we are on the verge of moving to XP. I am currently investigating how the two could interact. Currently we are looking at the possiblity of turning off the XP protocols.

Googer
 
I will get the details of the UPNP/ VRRP multicast issue from our ticket details. In any event, if you are rolling out XP it will enable this by default on the OS install. We have 700 client PC's and that count caused it. So, if you have over a few hundred PC's I would tread lightly.
 
I have setup a conf. call for 8:30am Pacific time Thursday. Here is the bridge info for you two guys.

CALL DATE: FEB-10-2005 (Thursday)
CALL TIME: 10:30 AM CENTRAL TIME
USA Toll Free Number: 877-939-1570
PASSCODE: 16705
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top