Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Large file transfer - network freezes 1

Status
Not open for further replies.

hunterdw

Technical User
Oct 25, 2002
345
0
0
US
Hi there,

I've got a puzzler that happens under a VERY specific set of circumstances.

I'll do the best I can to describe.

Site A - Cisco 6509 - Port Gig3/8 connected to AT&T Gigaman (Gigabit Metro E)

Site B - Cisco 4507 - Port Gig3/6 connected to AT&T Gigaman (Gigabit Metro E)

I am "switching" between the two sites, not routing. I have my trunk set appropriately and am only allowing a specific set of vlans.

In site A is a server called "FS01" - it is a MacOS server that is connected, via Fibre, to 100+ TB of Apple XSAN storage.

In site B are two servers (Playback A and Playback B) - they connect across the Gig Metro E, to FS01, via AFP.

When I copy large ISO files (5-6gig+) FROM FS01 (Site A) to either of the Playback machines (Site B) - my network will "Freeze" during the transfer.

The best word I can think of is "freeze" - the data stops copying, and pings to the destination servers fail - for between 35 and 40 seconds.

Tonight I tested under various fails and various circumstances, and each time the "freeze" happened for exactly 37 seconds.

And, here's a weird thing too. During the "freeze" if I do a "show int gig3/8" on the 6509, the "counters" are zero.

Here's an example

BEFORE THE FREEZE
5 minute input rate 91000 bits/sec, 119 packets/sec
5 minute output rate 6193000 bits/sec, 745 packets/sec

DURING THE FREEZE
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec

I'm going to open a TAC case tomorrow, but thought I would see if anyone had any ideas.

Again, copying from Site A to Site B using any other servers as source or destination is just fine

It's just this specific set of circumstances that cause the "freeze"

Thanks for any input - I'm willing to try most anything

--DW
 
sounds like the session is timing out after a certain time. in linux (maybe apple OS also) you can set 'keep alives'. this keeps the socket open. check into that.
 
Is a transfer between the same endpoints ok with small files?

One issue I have seen is to do with the maximum transmission size (MTU) negotiated between endpoints during the FTP connection establishment. Normally on ethernet its around 1500 bytes, but Gigabit can support larger frame sizes.

So, the FTP establishes, starts to transfer data, as the server starts to cache data ready for transmission the FTP packet size starts to increase as data becomes ready for transmission.

Its possible to negotiate an MTU size that exceeds the value allowed on one of the links, this is down to poor network configuration. If you transfer a small file, the server may never sends packets approaching this size, as the transfer is completed before the caching catches up with the data on the drive on the server, so its not a problem.

With a large file, after a period the server starts to transmit packets that exceed the MTU of one of the links, and everyting stops.

The end to end MTU negotiation can be altered by using the commmand ip tcp adjust-mss 1400, on a layer 3 interface at one end of the link. If the end to end is layer 2 only this may be difficult to resolve, MTU would need setting on a server.

 
@north323 - I'm not seeing keepalive issues, good thought though

@routerman - MTU sounds logical. From everything I can tell, we're at 1500 end to end. I've asked AT&T to double check their gear to make sure 1500 is their setting too. These file copies are not via FTP though. This happens using AFP to transfer files - you know - Just copying Mac to Mac, across the MetroE. I agree with your assessment and "ip tcp adjust-mss 1400" comment. Unfortunately, this is not a layer 3 connection. It is, in fact, layer 2 only. I'll double check all interfaces along the way - including the server. I may manually choose an MTU, like 1400 as you suggested.
 
Post a sh proc cpu 1 minute after initiating a transfer.

You could also run sh tech through Output Interpreter, with your CCO login...

Burt
 
Here's a few updates - and the resolution

@routerman MTU settings made no difference

@burtsbees CPU was "normal" - on the 6509, it stays around 1-2% and on the 4507, it stays around 30%.

I flew from my home office to the actual location of this equipment. As I arrived, I did the typical "reset / reseat / swap cables" stuff. No change. While I was working, I noticed a strange looking Red Alarm on the AT&T Gigaman equipment. It seemed to correspond to the network problems.

I worked with the AT&T SONET engineers, and they saw the problem as well. There was some optic loss - like (-18) dBm. So, we scheduled a fibre polish and all locations, including the CO, and the light loss improved. Data flowed better, but still had issues.

Through further behavior observations, I began to calculate the downtime and the "behavior" was spanning-tree. The 35+ second downtown was possibly spanning-tree convergence. Showing interface stats during the network outage proved, sure enough, that spanning-tree was BLOCKING data from the 6509 during the data loss.

I reconfigured the switch for RAPID spanning tree (which should have been done prior, but anyway). This didn't fix the problem, but the downtime was 4-7 seconds now (instead of 35-40) when the problem occured. Rapid spanning-tree converges much quicker.

So, what was causing spanning-tree? This is a single path. It's not a loop. It must be root changing. What would cause that? I had previously configured vlan priorities and root guard on my switchports. It must be an interface flap.

I changed interfaces. Problem continued. I changed interfaces 6 more times (it was an 8-port fibre card) and the problem continued. For fun, I placed the interface directly into my switching supervisor. Problem solved.

So, it was an entire LINE CARD that was having a problem - and it was only exacerbated by the types of noisy traffic I was generating.

The best way to describe it is... like a ballon...

The combination of noisy traffic (AFP) and dirty fibre (at first) and a bad interface was causing bad packets. I was observing some CRC errors as well. This bad data was reflected through the SONET gear and was building up pressure. After the pressure got high enough - like a balloon - the interface blew up. It flapped, and because the interface flapped, spanning-tree was signaled to converge.

The downtime was normal. The inteface went down, and spanning-tree needed to do it's thing to protect my network.

The real problem was bad interface (entire line card).

Anyway, a new line card is here and will be installed tonight.

Sorry for the long followup. Just thought it might be helpful for someone that reads this in the future.

--DW
 
Interesting.

So the Gigabit fiber card on the 4507 at the remote site was causing the whole problem?

Ross Perot and Ralph Nader were right...
 
@wabob no, the gigabit fibre card on the 6509 was causing the whole problem

it was sending out malformed packets, reflected into the SONET gear across the Gigaman, and they showed up as "input errors" and "CRC errors" on the gigabit fibre card on the 4507 side
 
Do you have a CCO login? Output Interpreter in an INDISPENSIBLE tool...

/
 
@burtsbees yes I have cco. the tool was great with my 4507R config. It wouldn't process my 6509 config - either copy-n-paste or uploading the file. I've used it prior and it's been really helpful to find little bugs along the way.
 
Would not process the file? Too big? Perhaps try to tweak MTU for grins, if you wanted to try again...

/
 
No need to try again. My problem is resolved. It was a flaky interface that was flapping. In fact, it was the flaky line card. I pulled out the 8port 6509 blade and replaced it with a 16port 6509 blade and all is well.
 
I know---but I'm just the curious type who sometimes can't leave an issue like that hanging...lol

Not busy today, really...lol

/
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top