Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

8600 lane lockup issues R/RS blades

Status
Not open for further replies.

curtismo

MIS
Dec 4, 2006
699
US
Does anyone here have experiences dealing with lane lockup issues on 8600 R or RS blades? This past week I experienced my second lane lockup issue on an 8634GTRS blade - affecting the same lane/group of ports that were affected earlier.

I had a case opened when the first lockup occurred, but learned that some of the prevention measures for lane lockup that had been developed in the 4.x code stream had not been incorporated into the 5.x stream - which, of course I have to run in order to use the RS blades in this chassis. Searching the knowledgebase finds lots of lane lockup issues. Of course, being a core switch in my network, recovering from this is not fun for anyone involved - this switch touches everything - voice, data, video.

Of course, this past week's issue happened when I was on vacation; I won't bore you with the gory details, but it wasn't pretty.

We've been promised a new release of 5.x in December that doesn't fix the problem (last I heard, NT STILL doesn't know what the cause is but it is supposed to do a graceful reset of the lane) but I can't really wait that long - I can't be available 24x7x365 til this crappy issue has been fixed.

I do believe I had at one time with 4.0.x code and R blades a lane lockup, but other than that, the 8600 has been a really stable switch, so this lockup (second in two months, on same switch, affecting same ports), is both scary and infuriating, turning the "ho-ho-ho" season into "ho-ho-hum".

The parter we use says he actually proved that he found a certain multicast packet passing through the 8600 switch could reproduce the lane lockup issue. Given that the lane [ports are divided 12 (4 GigE/8 SFP) 12 - 2 (10Gig) on the combo port, and the affected ones are in the left-hand ports) only has 3 ports active - and I know what is on them, has anyone come up with any cause/effect analysis? One item is a MS SQL DB server; another is an image processor server, and the third is an uplink to a 5520 for normal workstations, which have other uplinks, etc., which makes me discount this as a possible issue-causing source.

Some of the things I've been considering is moving these items to other ports; the caveat would be that possibly the problem moves along with this and affects other lanes, affecting other more critical services on the other 12 combo ports, or, heaven forbid, 24 ports on a 48-port Gig blade. I'm not sure turning these into MLT trunks or using adapter teaming to the servers will work, since I believe the 8600 (or the other end) still sees the MAC addresses of the switch ports, which would still make some things unavailable. Again, when this happens in the middle of a workday, the most important thing is getting business working again, not gathering diagnosis.

Note that I do not want to turn this into a "bash Nortel" session, but want to see if anyone has suggestions that we can all learn from in working out this issue.
 
I have heard (never had it happen on any 8648GTR cards that I have deployed) of the lane lock ups on the 8648GTR. I believe there was a CSB that talked about duplex mismatches bringing down the entire lane.


I have yet to hear about any issues with the 8648GTRS (although I don't yet have any deployed).

I'll reach out to my contacts at Nortel and let you know if I hear anything.

PS: I don't think anyone can blame you for being upset at Nortel
 
We have a Nortel engineer assigned to our account, and is moving up the ladder there. I'm just trying to find if people are seeing this in the field, and what they are doing to work-around until new code is released.

The problems we are having with 5.0 are on the new 8634XGRS combo blade; as of now, no issues on 8648 blades.
 
F.Y.I.
I'm just about to log a case with Nortel regarding what I assume is exactly the same problem, basically layer 3 doesn't work although looking at it layer 2 appears to be O.K. Before I did I just thought I'd do a quick search of the net to see if there were any known issues out there and heh presto this one popped up! We've recently installed 2 of these cards, upgraded the PSU's and Fans and software to 5.0.0.1 and we've now experienced 3 faults in 3 months on 3 different 8634GTRS cards in 2 different switches with the problem appearing to affect either the left hand or right hand set of copper 1Gbps links. When we first had this fault I found that a reset of the card cured the problem for about a week initially, however a week later a problem with SMLT occured, this time disabling the affected port on the suspect card cured the problem. We then got the card replaced - it worked fine until today when same thing happened except it was this time it was affecting ports on the right hand side 16-20 and not 1.4. A week ago my colleague also had a similar problem with a different 8634GRRS card in the Cluster pair. We are only using the Copper 1Gbps cards at the moment but were going to connect our WAN connection onto one of these ports, however I now think that unwise. Basically we're going to pass it to a support group that we use who are Nortel partners because as you say there is not always time to diagnose problems. However we are in a fortunate situation in as much as the cards only have 3 connections on one and 1 connection on the other at the moment and we have resiliency of 2 swithces. The problem that I see is there is no indication of the fault except for the fact that things stop working. I believe there is a 5.1.0 code soon to be released so it will be interesting to see if this fixes these problems. Personally I've already lost faith with this hardware and my gut feeling is that the card is trying to be all things to all men but not quite achieving it.
 
I've not had the problem since doing two things. Unfortunately, two of the servers affected when the card goes down are my most critical in the organization (ie, its a PITA to recover operations) so we did two things at the same time, so I am unable to determine which one resolved the issue, albiet it being a workaround:

1. RMAd and replaced the 8634GTRS blade;
2. Used a spare 5520 switch as a pass-thru device (ie, took one copper Gig link out of the 8634GTRS, and another out of a 8648GTR blade on the switch, set up MLT and VLACP on the links, then but both servers Gig SFP fiber links in the 5520.

I haven't had a lockup since that time, almost 5 months later. But its a waste of a 5520 (these are heavily-used image and SQL servers), items that should be on the core switch. Unfortunately, these servers are so old I don't have dual fiber ports or any copper GigE on them.

I have heard that the next release will be 5.1.0 that will merge some of the lockup fixes from the 4.1 code.
 
...to add, I have another 8600 with a 8634GTRS blade that has never had an issue, which says it could be hardware related, but again, I am unwilling at this time to risk the downtime to test the theory.
 
Yes apparently the issue will be resolved in 5.1.0 which is due immenently and we are told that the issue is definately a software issue. However like yourself I'm now a little reluctant to use this card for our primary WAN link and will probably utilise one of the older cards and with dual GBIC slots.
 
FYI.

5.1.0 has now been released so we'll be implementing this fairly soon. We're also getting problems with the switches resetting DRAM on one switch cluster goes up to 99% and effectively makes it impossible to telnet in, the other switch cluster pair the CPU goes up to 99%/100% but still functions if not badly eventually it resets with some error messages being generated along the way. We raised this with Nortel but they were unable to determine what was causing this, it then went away for a while and then came back hopefully this might also be resolved by an upgrade however it would be nice to know what was causing it.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top