Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Lost Packets with No Errors!!!

Status
Not open for further replies.

JK9713

IS-IT--Management
Jan 24, 2006
4
US
Please help…

My company currently has 4 locations on our frame relay network that connect to our main office. Each remote location has a frame relay circuit with a 512K Port speed. The main office has a full T1. The frame is provided by AT&T and all circuits have a CIR that equals the port speed.

The traffic at all remote locations typically consists of a few SSH sessions to an AIX host at the Main Office, Browser based applications and the occasional print job generated from the main office and sent to a remote printer. Occasionally during the day an imaging application is used to capture "Tickets" locally on a Windows XP Pro system and transmit them to a Windows 2003 Server at the main office.

Starting about three weeks ago one location (Location X) started having errors with the imaging application. When saving a batch after capturing or making corrections they would occasionally receive errors from the application. The application did not consistently generate an error nor was the error always the same. The messages seemed to indicate a communication problem with the main server.

The following list has the most common messages:

"The semaphore timeout period has expired"

"The specified network name is no longer available"

The path is too deep"

I was initially unable to recreate the problem outside the imaging application, I soon discovered that if I copied files to the Main Office that the copies would occasionally fail with the same error messages that the imaging application was generating. Interestingly I did not experience the same problems copying files from the Main Office.

In an attempt to isolate the problem I have taken the following steps:

Updated all drivers and software on PC at Location X - Errors
Used a PC known to operate without errors from another location at Location X – Errors
Used the PC from Location X at another site – No Errors
Replaced Network Switch, Router and CSU – Errors
Recertified network drops and replaced patch cords – Errors
Moved PC to certified network drop in another area of the building - Errors
Installed Voltage Regulating UPS at Location X and monitored for power anomalies (None found) – Errors
Reviewed differences between Location X 1841 router and routers at other location, updated to same version of IOS – Errors
Corrected issue with minor T1 errors at Main Office – Errors
Corrected issue with high number of Frame-Relay BECN's on Router at Main Office by implementing traffic shaping for remote VC's – Errors

Based on these steps I came to the following conclusions:

As all hardware with the exception of the Telco Equipment (Smart Jack) has been replaced; the issue is not Network Equipment related.
As the PC in question operates correctly at an identically configured location; the issue is not software or driver related.
As all other locations are not experiencing the problem; the equipment and configurations at the Main Office are not the cause.

This has left me to investigating the Frame Relay circuit. I had AT&T test the T1 and associated frame relay circuit. They are not reporting any problems. I have checked the statistics on both Routers (Including CSU stats) and am not seeing any obvious errors.

After using Ethereal to capture traffic from both the server and the client I discovered a number of "TCP Duplicate ACK", "TCP Retransmission" and "TCP Previous Segment Lost" conditions. These error conditions do not seem to occur at another location.

 
are you seeing input or output drops on the serial interface of the remote routers?
 
No, No input or Output drops are reported on any of my routers.

New discoveries - After working with AT&T I finally got them to look at the frame-relay portion of the circuit. (Despite my many references to this being a frame relay problem they were only looking at and testing the T1 circuit).

It was discovered that they were reporting CRC errors on packets received from the remote branch and discarding them.

Much more testing insued and to keep a long story short here is the new estimation of the problem:

The problem only seems to occur when using a Cisco WIC-1DSU-T1-V2 Card for the T1 interface at Location X. A non V2 card seems to work without any problems. I have verified this with multiple V2 cards, all cards work at other locations and experience problems at Location X.

I am currently in the middle of a finger pointing issue between Cisco and AT&T. AT&T seems unwilling to make any changes due to the fact that the circuit has no problems when they run their automated tests. I do not think their is any thing Cisco can do as their does not seem to be anyone reporting widespread problems with the WIC-1DSU-T1-V2 .

Any advice on how to deal with this situation would be greatly appreciated.

 
have you tried swapping the cables going into that wic?
 
The cable from the smart jack to the WIC has not been replaced.

I am not sure how this could be a problem as the Non-V2 WIC works fine and loopbacks from the telco to the router also have no issues.

However it may be worth a try...

Next time I am at that location I will attempt to use a different cable.
 
Solution

After working with both AT&T and Cisco and many hours of trying to prove or disprove the "It's not our problem..." responses I learned more about T1 and Frame Relay than probably healthfull to know.

The end cause was this...

Any Cisco Router located at (Location X) using a WIC-1DSU-T1-V2 connected to a Westell SmartJack via a cable run of less than 110ft.

My theory is this...

The WIC-1DSU-T1-V2 is designed to have additional support for long/short cable runs. It seems that this WIC transmits a slightly stonger signal than its predecessor. The devices in question were connected by a 20' CAT5e patch cord. The short cable run appears to have caused an attenuation problem. The SmartJack seems to have a "Sensitive Receiver" that detects errors when packets are sent rapidly to its interface by the WIC-1DSU-T1-V2 at default settings.

The fix...

Implemented the following command on the router...
service-module T1 cablelength short 110ft

It seems interesting that in all the router installs that I have done and everyone that I have spoke to (AT&T, Cisco other technicians), Nobody has ever had to make a change from the default setting on this WIC. I have multiple sites with the router and SmartJack connected in a similar fashion (some using shorter cables).

One of my biggest complaints about forums is that some people never post the solution to their problem once found.

I hope someone finds this usefull...

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top