cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1570
Views
14
Helpful
30
Replies

queue time doesnot match

maya davidson
Level 1
Level 1

Hello, 

I have a network topology as below:

R2 - R3 - R4
         |
         R1

My bandwidth are 100Mbps, queue sizes are 1000 packets, fifo
and I am sending 12500 frame/second and 1000byte each frame , 
from R2- R3 
and another one R1-R3
Loss is around 1.96% for both flows, but the delay is 0.007s for the first flow and 0.4s for the second flow, 
which in theory i can calculate the queuing time as below, 
queue sizes / (bandwidth / packetsize * = 0.08s

the R2 and R3 are from 9200 cisco and R1 c1111 

I don't know where I am getting wrong that with the same configs i am getting delays which none of them are correct. 

Thanks for your help in advance

30 Replies 30

@Joseph W. Doherty , @Ramblin Tech 

On Cisco 9200 catalyst, I have found 
qos queue-softmax-multiplier 1200 which increased my delay to 0.093s which is close to my theorical calculation, 
I wonder if there is anything similar to c1111 so it would decrease the delay for those routers ?

Hello @maya davidson ,

the router C1111 may have some software queue in action here that will cause a great change in observed behaviour with traffic near to congestion as noted by @Ramblin Tech .

Hint : the software queue may host many packets so if they are queued there before transmission you get a so high delay.

Hope to help

Giuseppe

 

On a 9200, the queue-softmax-multiplier would adjust logical egress queue limits.

On your router see what the range limits are for the egress interface tx-ring-limit.  Try its max value with minimum hold queue.

BTW, software based routers performance is often very much more impacted by packet's size than it is on a switch.  I.e. per packet processing latency may increase as packet becomes smaller.

In other words the actual processing latency might increase per packet.  This would be like Ethernet's IPG needing to increase as frame size decreases.

This may be unexpected until you recognize on a software based router the CPU is much more involved with the physical data plane.  

A possible example. . .

A large packet arrives and CPU begins processing its forwarding logic.  Concurrently, another large packet is incoming on interface.  CPU finishes forwarding and is waiting to process that next packet.

As above but now small packets are being received.  While processing forwarding logic, CPU is interrupted to accept a newly received packet.  Such interruptions can (very slightly) add additional (and variable [jitter]) latency to the packet being actively forwarded.

I mention the forgoing because you haven't described your purpose of your testing.

As a basic level or conceptionally, switches and routers appear to be simple devices.  Actual real hardware architecture is often pretty darn complex especially considering costs and other design factors.

Real world network usage combinations can be pretty complex too.

When you start analysis looking for confirmation of expectations of ms results, considerations, as above, may arise.

Unfortunately, something like a Cisco device uses proprietary elements and some results cannot be explained (publicly) beyond educated guesses.

I have a network and I want to have a simulated version of this network on omnet++ and i need to be able to tune some of the parameters, 

this is the result of sh controllers for router C1111 family, 
GE4 Statistics:
Input:
pkts 126428665, bytes 21185786830406
unicast 123866593, multicast 2561953, broadcast 119
total drops 0, total errors 8, overrun 458, crc 3
pkts64 2578173, pkts65to127 12606055, pkts128to255 2801522772,
pkts256to511 27783476, pkts512to1023 1848441697, pkts1024toMax 4023431092
oversize 0, undersize 0, jabber 0, fragments 0,
collision 0, pause 0, align 3
Output:
pkts 3500829855, bytes 9648697262426,
unicast 3499366153, multicast 1463401, broadcast 301
total drops 0, total errors 0, underrun 0
collision 0, pause 0
defer 0, late 0, excessive 0, fcs 0
ATU LearnLimit: 0, LearnCnt: 0
RxQ rsvd: 1, cnt: 0, XON Limit: 19, XOFF Limit: 38
TxQ total cnt: 53
TxQ 0 cnt: 78
TxQ 1 cnt: 0
TxQ 2 cnt: 0
TxQ 3 cnt: 0
TxQ 4 cnt: 0
TxQ 5 cnt: 0
TxQ 6 cnt: 0
TxQ 7 cnt: 0


I couldn't find anything that could help me to tune this value to see the effect of it on the end to end result which is very high 0.4-0.5 s 

I resolved the issue with c9200 by softmax-multiplexer cmd, but i'm still stuck on this one 

BPE_LAB_C1121x_0002#sh buffers
Tracekey : 1#32a0dd251a9736b9f75e3ab6ddca0ad6

Buffer elements:
725 in free list
146382498 hits, 0 misses, 1019 created

Public buffer pools:
Small buffers, 104 bytes (total 1200, permanent 1200):
1198 in free list (200 min, 2500 max allowed)
75661926 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Middle buffers, 600 bytes (total 900, permanent 900):
899 in free list (100 min, 2000 max allowed)
65260082 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Big buffers, 1536 bytes (total 900, permanent 900, peak 901 @ 6w2d):
900 in free list (50 min, 1800 max allowed)
48378373 hits, 0 misses, 1 trims, 1 created
0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 100, permanent 100, peak 101 @ 6w2d):
100 in free list (0 min, 300 max allowed)
0 hits, 0 misses, 1 trims, 1 created
0 failures (0 no memory)
Large buffers, 5024 bytes (total 100, permanent 100, peak 101 @ 6w2d):
100 in free list (0 min, 300 max allowed)
0 hits, 0 misses, 1 trims, 1 created
0 failures (0 no memory)
VeryLarge buffers, 8280 bytes (total 100, permanent 100):
100 in free list (0 min, 300 max allowed)
0 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
Huge buffers, 18024 bytes (total 20, permanent 20, peak 21 @ 6w2d):
20 in free list (0 min, 33 max allowed)
0 hits, 0 misses, 1 trims, 1 created
0 failures (0 no memory)

Interface buffer pools:
CF Small buffers, 104 bytes (total 101, permanent 100, peak 101 @ 6w2d):
101 in free list (100 min, 200 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
Generic ED Pool buffers, 512 bytes (total 101, permanent 100, peak 101 @ 6w2d):
101 in free list (100 min, 100 max allowed)
0 hits, 0 misses
CF Middle buffers, 600 bytes (total 101, permanent 100, peak 101 @ 6w2d):
101 in free list (100 min, 200 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
Syslog ED Pool buffers, 600 bytes (total 1057, permanent 1056, peak 1057 @ 6w2d):
1025 in free list (1056 min, 1056 max allowed)
11825 hits, 0 misses
EOBC0 buffers, 1524 bytes (total 256, permanent 256):
256 in free list (0 min, 256 max allowed)
0 hits, 0 fallbacks
CF Big buffers, 1536 bytes (total 26, permanent 25, peak 26 @ 6w2d):
26 in free list (25 min, 50 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
IPC buffers, 4096 bytes (total 378, permanent 378):
377 in free list (126 min, 1260 max allowed)
1 hits, 0 fallbacks, 0 trims, 0 created
0 failures (0 no memory)
CF VeryBig buffers, 4520 bytes (total 3, permanent 2, peak 3 @ 6w2d):
3 in free list (2 min, 4 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
CF Large buffers, 5024 bytes (total 2, permanent 1, peak 2 @ 6w2d):
2 in free list (1 min, 2 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
IPC Medium buffers, 16384 bytes (total 2, permanent 2):
2 in free list (1 min, 8 max allowed)
0 hits, 0 fallbacks, 0 trims, 0 created
0 failures (0 no memory)
Private Huge IPC buffers, 18024 bytes (total 1, permanent 0, peak 1 @ 6w2d):
1 in free list (0 min, 4 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
Private Huge buffers, 65280 bytes (total 1, permanent 0, peak 1 @ 6w2d):
1 in free list (0 min, 4 max allowed)
0 hits, 0 misses, 393 trims, 394 created
0 failures (0 no memory)
IPC Large buffers, 65535 bytes (total 17, permanent 16, peak 17 @ 6w2d):
17 in free list (16 min, 16 max allowed)
0 hits, 0 misses, 63720 trims, 63721 created
0 failures (0 no memory)

Header pools:
Header buffers, 0 bytes (total 266, permanent 256, peak 266 @ 6w2d):
10 in free list (10 min, 512 max allowed)
253 hits, 3 misses, 0 trims, 10 created
0 failures (0 no memory)
256 max cache size, 256 in cache
71718662 hits in cache, 0 misses in cache

Particle Clones:
1024 clones, 0 hits, 0 misses

Public particle pools:
F/S buffers, 256 bytes (total 384, permanent 384):
128 in free list (128 min, 1024 max allowed)
256 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
256 max cache size, 256 in cache
0 hits in cache, 0 misses in cache
Normal buffers, 512 bytes (total 512, permanent 512):
384 in free list (128 min, 1024 max allowed)
128 hits, 0 misses, 0 trims, 0 created
0 failures (0 no memory)
128 max cache size, 128 in cache
0 hits in cache, 0 misses in cache

Private particle pools:
lsmpi_rx buffers, 416 bytes (total 8194, permanent 8194):
0 in free list (0 min, 8194 max allowed)
8194 hits, 0 misses
8194 max cache size, 0 in cache
134982757 hits in cache, 0 misses in cache
lsmpi_tx buffers, 416 bytes (total 4098, permanent 4098):
0 in free list (0 min, 4098 max allowed)
4098 hits, 0 misses
4098 max cache size, 4097 in cache
89785703 hits in cache, 0 misses in cache

Could you explain exactly what you're doing to obtain a .4 second delay?

I was just reading a Miercom report on the C1111, and such a delay seems abnormal.

I am sending 13600 frame/second each frame 1000 bytes , my bw can handle 12500 frame/second given 100Mbps, 
and on ixia i am seeing 8% loss and 0.4s end to end delay ,

The 8% lost rate makes sense but unclear how the .4 sec delay is computed (by Ixia?).

What's the full path topology and how are device(s) configured?

At 100 Mbps, I computed it would take almost .082 ms per frame.  If your egress queue is still 1,000, a full queue would take only .082 seconds to transit.  I.e. your .4 s value is about 5x too much (your wrong queue time, correct?).

BTW, looking into C1111 performance, came across it slows when you bridge traffic.  So, again, need exactly info.

Is there something about the packet headers and C1111 config that is forcing a punt and process switching? As you surmise Joe, 400msec is way too long for a packet being forwarded in a fast path (ie, CEF), even for CPU forwarding.

Disclaimer: I am long in CSCO

Yep, something I wondered about.  One of the reasons I would like as many details as possible.

I did find some info (on these forums, https://community.cisco.com/t5/routing/isr1100-routing-performance-difference-sub-interface-vs-service/td-p/3315518) that someone was having very slow performance on a C1100 doing bridging.  TAC was contacted and said, in that platform, it's a slow CPU process (sounds like the equivalent of punting packets).  This reference said a later IOS version did appear to be faster.  (No real surprise that might happen, as, over the years, I've seen some features move from the slow path to the fast path, or take better advantage of platform hardware.)

I've also mentioned the interface may also have its own interface hardware FIFO, so overall queuing depth could exceed 1,000, but I would think it unlikely that queue is about 4k; worth checking.

BTW, just be curious, went into PT using a 4331 to see if it "knew" of tx-ring-limit, and if so, what the range was:

Obtained:

Router(config)#int g0/0/0
Router(config-if)#tx
Router(config-if)#tx-ring-limit ?
  <1-32767>  Number (ring limit)

Surprised to see a max value of 32K!  (Might be in "particles", not packets.)

That queue might be sized big enough to produce the .4 second delay.  Trying to confirm how to find current settings (if not explicitly configured - possibly via show controllers).

This doesn't have the option tx-ring-limit , and I wasn't able to find anything that could help me to configure this 
and this is my question how can i configure this ?
this is part of sh controller output for this interface 

Output:
pkts 3500829855, bytes 9648697262426,
unicast 3499366153, multicast 1463401, broadcast 301
total drops 0, total errors 0, underrun 0
collision 0, pause 0
defer 0, late 0, excessive 0, fcs 0
ATU LearnLimit: 0, LearnCnt: 0
RxQ rsvd: 1, cnt: 0, XON Limit: 19, XOFF Limit: 38
TxQ total cnt: 53
TxQ 0 cnt: 78
TxQ 1 cnt: 0
TxQ 2 cnt: 0
TxQ 3 cnt: 0
TxQ 4 cnt: 0
TxQ 5 cnt: 0
TxQ 6 cnt: 0
TxQ 7 cnt: 0


@maya davidson unfortunately, I don't see a possible way to help you if you're unable to provide the detailed information that's been requested.  (Of course, even if the requested detailed information is provided, cannot guarantee success.)

Suggest you contact TAC.

If you do find the cause, please post that here too, even if months from now.

Possibly, you don't understand the level of details being requested.  If not, for example, your last topology reply shows Ixia traffic transiting R1 or R4 would also transit one or two other devices.  I have no idea which, let alone actual interfaces being transited or relevant configuration information or stats.

 

Yes I get this from IXIA, as average end to end latency. 
The topology is as below 
                 ixia
                   |         
IXIA - R1 -  R3 - R2 - IXIA
                    |
                   R4 - ixia

the router R1, and R4 are from family C1111, and C1121 and in both of these routers I see high delay around 0.4s 
There are no policy defined and configured on the links and I configured queue sizes by hold-queue to 1000 

Hello @maya davidson ,

I try to recap what you are doing for your tests:

a) you generate a constant packet rate in excess of the allowed outgoing packet rate

>> I am sending 13600 frame/second each frame 1000 bytes , my bw can handle 12500 frame/second given 100Mbps,
and on ixia i am seeing 8% loss and 0.4s end to end delay

b) your network topology is:

                      ixia3
                        |
IXIA1 - R1 - R3 - R2 - IXIA
                    |
                   R4 - ixia4

So Ixia4 connected to R4 is the rx port on the instrument

Ixia1 is connected to R1 and Ixia3 is connected to R3.

You have defined at least two traffic flows one sending out of Ixia1  to Ixia4 and one sending out of Ixia3 to Ixia4.

The destination address of the flow is the IP address assigned to Ixia4 or to an emulated downstream subnet with next-hop = Ixia4 IP address.

the cumulative rate of traffic over the two Ixia1 to Ixia4 and Ixia3 to Ixia4 is :

>> 13600 frame/second each frame 1000 bytes , my bw can handle 12500 frame/second given 100Mbps,

You have a packet loss of 8%

13600 / 12500 = 108,8 % of oversubscription and you get 8% of packet loss.

The IXIA measures the delay by ckecking timestamps added to the payload of sent packets on the rx Port and by measuring the difference between the time of reception and the timestamp .

Both times are taken from IXIA system itself so there are not clock issues here.

However more then 8% of packets are lost.

I would suggest you to try to use a lighter load something like 101% of the max packet rate on the 100 Mbps.

This will allow to make tests in a condition of light overload.

Compare the results and let us know if the results are a better fit to your expectations.

Explanation:

I say this because in the past I did a lot of tests with traffic generator instruments similar to Ixia ( at that time it was Agilent Router Tester or Ixia  or SmartBits that was simple traffic generator).

We made extensive tests to check the behaviour of Modular QoS like CBWFQ with LLQ.

We noticed that when the offered load was much higher then speed of the outgoing interface  we started to have packet losses in all traffic classes regardless of QoS settings and traffic composition ( the traffic mix was prepared in a way that only one or two classes were non conforming we had to play with the total number of flows created for each class and packet size per flow to achieve this)

With slight overload like 103% the behavior of QoS was in line with configuration with packet losses confined to non conforming classes ( a traffic class sending more then its own guaranteed rate).

Hope to help

Giuseppe

 


@Giuseppe Larosa wrote:

I say this because in the past I did a lot of tests with traffic generator instruments similar to Ixia ( at that time it was Agilent Router Tester or Ixia  or SmartBits that was simple traffic generator).

We made extensive tests to check the behaviour of Modular QoS like CBWFQ with LLQ.

We noticed that when the offered load was much higher then speed of the outgoing interface  we started to have packet losses in all traffic classes regardless of QoS settings and traffic composition ( the traffic mix was prepared in a way that only one or two classes were non conforming we had to play with the total number of flows created for each class and packet size per flow to achieve this)

With slight overload like 103% the behavior of QoS was in line with configuration with packet losses confined to non conforming classes ( a traffic class sending more then its own guaranteed rate).


Interesting, that's not behavior I've seen, but then I haven't done exactly what you describe.  What I've often done, is send one UDP flow at a rate exceeding the egress interface total bandwidth capacity by anywhere at 110% to 200%, while sending other traffic too.

Probably not relevant to go into a sidebar discussion here, but if you're willing to discuss further, please drop me a private message.

Review Cisco Networking for a $25 gift card