Cisco 3560 QOS packet drops

keithja · Nov 21, 2012

Hi,

I am trying to understand the reason for egress packet drops I am seeing on our 3560s.

The switches are wkgrp switches directly connected to users. We use Nortel IP phones with the phone inline with the user PC. PCs auto-neg to 100/full

As a reminder, Nortel phones mark voice traffic dscp 46 and COS 6, sig traffic dscp 40 and cos 5, and data traffic 0/0

The typical port configuration is :
switchport mode access
switchport voice vlan 100
priority-queue out
mls qos trust dscp
spanning-tree portfast
end

Global qos config is:
mls qos map dscp-cos 46 to 6
mls qos map cos-dscp 0 8 16 24 32 40 46 56
mls qos srr-queue input bandwidth 98 2
mls qos srr-queue input buffers 95 5
mls qos srr-queue input priority-queue 2 bandwidth 1
mls qos srr-queue input dscp-map queue 1 threshold 1 40
mls qos queue-set output 2 threshold 1 3100 3100 100 3200
mls qos queue-set output 2 buffers 10 75 5 10
mls qos

What I am seeing is persistent drops from Egress q 1-1

While I know this can happen during bursts, this seems to exceed what I would expect for the low-level usage Im seeing on the interface. Here are stats from 5 mins after doing a clear mls qos int stat

FastEthernet0/17

dscp: incoming
-------------------------------

0 - 4 : 53078 0 0 0 0
5 - 9 : 0 0 0 0 0
10 - 14 : 0 0 0 0 0
15 - 19 : 0 0 0 0 0
20 - 24 : 0 0 0 0 0
25 - 29 : 0 0 0 0 0
30 - 34 : 0 0 0 0 0
35 - 39 : 0 0 0 0 0
40 - 44 : 29 0 0 0 0
45 - 49 : 0 0 0 0 0
50 - 54 : 0 0 0 0 0
55 - 59 : 0 0 0 0 0
60 - 64 : 0 0 0 0
dscp: outgoing
-------------------------------

0 - 4 : 77308 0 0 0 0
5 - 9 : 0 0 0 0 0
10 - 14 : 0 0 0 0 0
15 - 19 : 0 0 0 0 0
20 - 24 : 0 0 0 0 0
25 - 29 : 0 0 0 0 0
30 - 34 : 0 0 0 0 0
35 - 39 : 0 0 0 0 0
40 - 44 : 29 0 0 0 0
45 - 49 : 0 0 0 549 0
50 - 54 : 0 0 0 0 0
55 - 59 : 0 0 0 0 0
60 - 64 : 0 0 0 0
cos: incoming
-------------------------------

0 - 4 : 53099 0 0 0 0
5 - 7 : 0 29 0
cos: outgoing
-------------------------------

0 - 4 : 80501 0 0 0 0
5 - 7 : 29 553 0
output queues enqueued:
queue: threshold1 threshold2 threshold3
-----------------------------------------------
queue 0: 33 0 0
queue 1: 80164 86 615
queue 2: 0 0 0
queue 3: 549 0 260

output queues dropped:
queue: threshold1 threshold2 threshold3
-----------------------------------------------
queue 0: 0 0 0
queue 1: 490 0 0
queue 2: 0 0 0
queue 3: 0 0 0

Policer: Inprofile: 0 OutofProfile: 0

MainStack2#sh int fa0/17
FastEthernet0/17 is up, line protocol is up (connected)
Hardware is Fast Ethernet, address is 0011.93c1.0713 (bia 0011.93c1.0713)
MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
...
Full-duplex, 100Mb/s, media type is 10/100BaseTX
input flow-control is off, output flow-control is unsupported
...
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 83286
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 128000 bits/sec, 153 packets/sec
5 minute output rate 1653000 bits/sec, 227 packets/sec
14551956 packets input, 1566102266 bytes, 0 no buffer
Received 9428 broadcasts (420 multicasts)
... no errors ...
28850266 packets output, 23268567069 bytes, 0 underruns
... no errors ...

Here is a view of the queue maps:
Policed-dscp map:
d1 : d2 0 1 2 3 4 5 6 7 8 9
---------------------------------------
0 : 00 01 02 03 04 05 06 07 08 09
1 : 10 11 12 13 14 15 16 17 18 19
2 : 20 21 22 23 24 25 26 27 28 29
3 : 30 31 32 33 34 35 36 37 38 39
4 : 40 41 42 43 44 45 46 47 48 49
5 : 50 51 52 53 54 55 56 57 58 59
6 : 60 61 62 63

Dscp-cos map:
d1 : d2 0 1 2 3 4 5 6 7 8 9
---------------------------------------
0 : 00 00 00 00 00 00 00 00 01 01
1 : 01 01 01 01 01 01 02 02 02 02
2 : 02 02 02 02 03 03 03 03 03 03
3 : 03 03 04 04 04 04 04 04 04 04
4 : 05 05 05 05 05 05 06 05 06 06
5 : 06 06 06 06 06 06 07 07 07 07
6 : 07 07 07 07

Cos-dscp map:
cos: 0 1 2 3 4 5 6 7
--------------------------------
dscp: 0 8 16 24 32 40 46 56

IpPrecedence-dscp map:
ipprec: 0 1 2 3 4 5 6 7
--------------------------------
dscp: 0 8 16 24 32 40 48 56

Dscp-outputq-threshold map:
d1 :d2 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------------
0 : 02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01
1 : 02-01 02-01 02-01 02-01 02-01 02-01 03-01 03-01 03-01 03-01
2 : 03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01
3 : 03-01 03-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01
4 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 04-01 04-01
5 : 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01
6 : 04-01 04-01 04-01 04-01

Dscp-inputq-threshold map:
d1 :d2 0 1 2 3 4 5 6 7 8 9
------------------------------------------------------------
0 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01
1 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01
2 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01
3 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01
4 : 01-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01 01-01 01-01
5 : 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01
6 : 01-01 01-01 01-01 01-01

Cos-outputq-threshold map:
cos: 0 1 2 3 4 5 6 7
------------------------------------
queue-threshold: 2-1 2-1 3-1 3-1 4-1 1-1 4-1 4-1

Cos-inputq-threshold map:
cos: 0 1 2 3 4 5 6 7
------------------------------------
queue-threshold: 1-1 1-1 1-1 1-1 1-1 2-1 1-1 1-1

Dscp-dscp mutation map:
Default DSCP Mutation Map:
d1 : d2 0 1 2 3 4 5 6 7 8 9
---------------------------------------
0 : 00 01 02 03 04 05 06 07 08 09
1 : 10 11 12 13 14 15 16 17 18 19
2 : 20 21 22 23 24 25 26 27 28 29
3 : 30 31 32 33 34 35 36 37 38 39
4 : 40 41 42 43 44 45 46 47 48 49
5 : 50 51 52 53 54 55 56 57 58 59
6 : 60 61 62 63

I have not analyzed QOS to this level before and would like to at least understand what I am seeing.

The drops are happening at times other than when calls are taking place - so it doesn't seem to be relating to queues piling up while waiting to empty the priority queue. The 5-min traffic rate shown above is only about 1.5%. Additionally I had a graphical monitor running at 20 second intervals on that port during the documented period and there were no bursts - traffic level varied less than 1% during that period.

What I think I know is:

The phone is marking the voice traffic dscp 46, and the port trusts -accepts- that marking. If the packet would happen to not be IP, I presume the port would accept the COS, and because of the modified dscp-cos map, it would write the dscp class as 46 and go on using that value for further operations (not sure about that part tho)

All of that at the port level is irrelevant since the drops we're seeing are not on ingress but egress but I wanted to throw it in to make sure I was understanding that end correctly.

On the other end, voice traffic is classified the same way, and trusted through the trunks so that any voice traffic is coming back into the switch and out the user port (in this case 17) with the same markings. This seems to be borne-out by the incomming queue stats - its not shown clearly here since the stats were cleared, stats that include calls show large packet counts on dscp 46 and a smaller number to 45.

ok, so classification seems ok - hopefully. Since it was set up to be simple and straight-forward, the queuing and policing mechanism seems pretty straight-forward. Since priority queue out is specified, voice traffic is dumped into queue 1 (does this wind up being specified as Q02-01 on the queue map?) and that queue is serviced until empty. Even assuming a call is present - which it isn't, this should still be less than 120kbps for a g711 call, leaving plenty of service-capability for data traffic.

So, since no call is present, the remaining data should be handled based on the queue and bandwidth settings. In this case, - and they've been modified a little to try to resolve the issue, theyre set to 10 75 5 10 and since its q1-t1 dropping packets - queue 1: 490 0 0 - it seems I am giving about as much buffer space to the queue being dropped as is possible, correct?

Then the only thing left to consider is the bandwidth allocation for how often that queue gets packets removed from the ring. Doing a show mls qos int que gives
Egress Priority Queue : enabled
Shaped queue weights (absolute) : 25 0 0 0
Shared queue weights : 25 25 25 25
The port bandwidth limit : 100 (Operational Bandwidth:100.0)
The port is mapped to qset : 1

At this point - since I am not specifying anything, I'm not sure if it is using shared or shaped. Even if I knew, I'm not sure how to translate it.

If it is shaped. is it 25% bw reserved to service q1, then remaining 75% shared evenly between the remaining 3 queues?

If it is shared, is it saying 'If there are pkts in q1, they can have up to 25% bw, then if pkts in q2, q2 gets up to 25%bw, etc: BUT if Q1 is empty, Q2 gets up to 33% bw? - not sure at all how to read this one...

In any case, even if we presume the queue dropping packets only gets 25% bw, it should still be able to handle up to 20mbps (really 25mbps) with no drops or probs shouldnt it? In which case - why am I seeing drops at only 2% util?

A couple other things: In the case of int fa0/17, I checked the end-users nic configs and verified qos marking on the PC was disabled, and flow control was disabled. Also as I mentioned, he is running 100mbps/full. I chose fa0/17 because the issue seems more severe on this int, than on most of the other ports on this switch.

Also, I am not really getting any user complaints - but that isn't a definite sign that they're not being affected. Also, whereas I have always seen periodic drops with this configuration (minus the modified egress buffer changes made to try to reduce the problem), it seems to have become worse at least on this switch/port in the last 2 weeks with no obvious changes that would account for it.

Finally: In working on all of this, I came across the following statement in a Nortel document that doesn't seem to make sense to me but maybe its my understanding that is faulty:

"

Port based Configuration

Config terminal (Enter global configuration mode)
mls qos (Enable QoS globally)
mls qos map cos-dscp 0 8 16 40 32 46 48 56 (Define ingress CoS-to-DSCP mappings)
Intrerface level
interface GigabitEthernet1/0/1 (Specify the physical port)
switchport access vlan 10 (Native VLAN)
switchport mode access (Set the port to access mode)
switchport voice vlan20 (Voice VLAN)
priority-queue out (Enable the egress exepedite queue)
mls qos trust dscp (Trust IP Phone DSCP Values)
spanning-tree portfast (For Nortel IP Phones)

The Nortel IP Phone marks the voice payload with CoS 6 and DSCP EF when it sends the traffic
to the switch. When the traffic enters the switch port Gi 1/0/1 (in our example), the switch trusts
the CoS value. Then, the switch derives the DSCP value 48 for the CoS value 6 from the
CoS−DSCP default table."

The description accompanying the configuration list seems totally wrong to me. Isn't it that the phone marks with CoS 6 and DSCP EF, then the switch trusts the DSCP value? Is this a typo or do I have a big hole in my understanding? If it is a typo, I really shouldnt need the "mls qos map cos-dscp 0 8 16 24 32 40 46 56" command do I, since the port will automatically trus the dscp and wont need to translate the CoS?

I would really like to thoroughly understand this, and understand why I'm seeing the drops when it seems like there is no reason for them. I've read quite a bit on it but obviously there are still some gaps. Should I modify egress sharing/shaping to try to improve the issue?

I REALLY appreciate anyone who can take the time to straighten me out!

Thanks a lot in advance!!!

k

PS. When I turn off qos the packet drops stop so it certainly seems related to qos
Thanks again!

ADB100 · Nov 28, 2012

I am fairly sure you are seeing the bug/behaviour detailed in Bug ID: CSCsc96037 (and others I think?). I have hit this lots of times - with QoS enabled and some default settings the switch is dropping traffic far too agressively (you are seeing drops from the 1st threshold of the 1st queue - DSCP 0-15 maps to this queue/threshold by default). If you turn QoS off globally I suspect the drops will stop. The workaround is to increase the output thresholds for the queue's - this is what I usually configure:

Code:

mls qos queue-set output 1 threshold 1 800 800 50 3200
mls qos queue-set output 1 threshold 2 560 640 100 800
mls qos queue-set output 1 threshold 3 800 800 50 3200
mls qos queue-set output 1 threshold 4 320 800 100 800

Good luck
Andy

keithja · Nov 29, 2012

Hi Andy,

Thanks for the reply. I will take a look at the body of that bug ID if I can get access to it. I am getting drops out of the second queue - threshold 1 (its really confusing how the numbering starts with a zero in some places and a one in others!)But anyway it shows up in a 'show mls qos int stats' this way:
queue 0: 0 0 0
queue 1: 490 0 0
queue 2: 0 0 0
queue 3: 0 0 0

So Im not sure if this is the same q-t youre speaking of or not.

I did find part of the problem. If you look closely at the config portion, you will see that I was doing config changes for queue set 2, but had queue set 1 applied! I modified queue set 1 with the settings I had applied to queue set 2, and that significantly improved the problem. But I am still seeing some queue drops low util and pkt counts, while other times I can see 40% to 80% util with no drops...

So in the parlance of your command suggestion:
mls qos queue-set output 1 threshold 1 800 800 50 3200
mls qos queue-set output 1 threshold 2 560 640 100 800
mls qos queue-set output 1 threshold 3 800 800 50 3200
mls qos queue-set output 1 threshold 4 320 800 100 800

I would want to give a larger weighted value to Q2-T1
mls qos queue-set output 1 threshold 2 800 640 100 800

or some such?

thx
k

keithja · Jan 10, 2013

Hi I'm still looking at this issue trying to figure out what is going on.

I have turned of QOS on a couple of switches (no mls qos, but left individual port configurations the same) but am still seeing drops.

With QOS turned off one would presume that it is general congestion - especially at the user edge where busy PC issues might contribute. So I wanted to see if I could see any instances of packets in the output queues building up.

I wrote some scripts and macros that essentially did a snapshot of 'show int' every 20 seconds or so, and looked for instances of 'Queue: x/' where x was greater than zero.

What I found after several days of watching the switches most displaying the behavior, was that I NEVER saw ANY packets in output queues. I often saw packets in Input queues for VLAN1, once in a great while I would see packets on input queues for fa\ or Gi\ interfaces, but NEVER on output queues.

Does anyone have a clue what could be going on? In most cases, no error counters are incrementing for these interfaces. Is there some mechanism besides conjestion that could cause output queue drops?

While the counts aren't critically high at this point, they are happening more frequently than in the past and I would like to understand what is going on...

Here is an example shot

reference time is D497F931.F38DA0F7 (08:12:01.951 cst Wed Jan 9 2013)
Vlan1 is up, line protocol is up
Input queue: 3/75/15/0 (size/max/drops/flushes); Total output drops: 0
Output queue: 0/40 (size/max)
GigabitEthernet1/0/17 is up, line protocol is up (connected)
Input queue: 2/75/0/0 (size/max/drops/flushes); Total output drops: 175085
Output queue: 0/40 (size/max)

VinceWhirlwind · Jan 10, 2013

Those drops aren't incrementing as a result of that interface being briefly in blocking mode after a restart of whatever device is at the end of it?

keithja · Jan 11, 2013

A good thought and I suspect it explains the morning bursts that seem to happen when people are coming in and logging in for the day (powering up or restarting their PCs). Is there anyway to verify it other than circumstantially?

However we see drops throughout the day so it wouldn't be the only explanation.

Also I am seeing drops on the core switch (less frequently). Additionally I have received a couple trouble tickets that may indicate the drops are affecting users. The reported issues could be unrelated tho.

If the interfaces were in blocking mode during these drops, would the switch drop the packet without first queueing it? in which case I should still see packets in output queues unless it happens so quickly I never capture it.

Also, a while back I implemented the following qos changes:
mls qos map dscp-cos 46 to 6
mls qos map cos-dscp 0 8 16 24 32 40 46 56
mls qos srr-queue input bandwidth 98 2
mls qos srr-queue input buffers 95 5
mls qos srr-queue input priority-queue 2 bandwidth 1
mls qos srr-queue input dscp-map queue 1 threshold 1 40
mls qos queue-set output 1 threshold 1 3100 3100 100 3200
mls qos queue-set output 2 threshold 1 3100 3100 100 3200
mls qos queue-set output 1 buffers 10 75 5 10
mls qos queue-set output 2 buffers 10 75 5 10
!
!

Since I have disabled qos on the switches in question (not the core tho) I am presuming these commands have no affect on the switch operation and therefore cannot be related to the problem. Is this correct?

keithja · Jan 12, 2013

Any thoughts anyone?

VinceWhirlwind · Jan 13, 2013

I've recently been involved in the QoS-causing drops issue.
If you've got rid of that default QoS or fixed the queue 2 and it's still happening....I reckon I'd setup a test with a host sending giant frames (or various sizes) and compare what a dropped giant frame looks like compared with what you're seeing. I've never done any troubleshooting work on giant frames, but from memory, the switch will accept frames even if too big, but it's not till it goes to egress them that it decides whether to drop them or not. From memory. I'd do it just to compare, anyway. Might help me understand these drops.

Having said that, maybe take a step back and start from the beginning again - you've made some changes, you might be operating on some assumptions that no longer apply. Concentrate on "always up" interfaces for the moment.

I just noticed you say Nortel phones use a COS of 6. In Cisco-world, this is not a voice value, and the default queuing is therefore not good for voice for COS 6. In fact, according to the IETF and the RFC this is not a voice value, but I'm pretty sure I've seen HP(?) do the same thing. I've setup lots of Nortel phones, but I don't remember noticing this before. Could you change the Nortel COS values to the ones Cisco usually uses (and the ones defined as proper voice values in the IETF) - 5 voice - 4 for video? Doesn't matter what you use for signalling.

keithja · Jan 14, 2013

Hi Vince,

Thanks for the input. I rebooted the core stack over the weekend and so far - knock on wood - I haven't seen any new queue drops there.
I'm not sure how that could affect the situation other than possibly mem fragmentation. I watch the core-switches memory and there wasn't a huge change anywhere (the levels did increase, of course). IO largest contig mem - for instance - went from around 3360-3499k to 3760-3952k between the 5 switches.

Any thoughts there?

Yeah, the Nortel phones use a rather different config. That's why I played with the setup a little. But of course I turned off QOS on some of the switches and continued to get drops - unless some of those commands are still in effect even with 'no mls qos'?

The workgroup switches had been previously rebooted so, even if the reboot takes care of the core, I still have those to look into. I will start trying to see how I can watch for giants with NI Observer and either an NTAP or SPAN.

I just thought of something. If any interfaces are receiving giants, shouldnt I be seeing those on the 'show interface' stats?

Thanks for your help
k

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Cisco 3560 QOS packet drops

keithja

MIS

ADB100

Technical User

keithja

MIS

keithja

MIS

VinceWhirlwind

Technical User

keithja

MIS

keithja

MIS

VinceWhirlwind

Technical User

keithja

MIS

Similar threads

Part and Inventory Search

Sponsor