DLSw circuit issue

ronaldobf · ‎01-31-2011

Guys,

Today I got a problem related to DLSw circuits.

I have two locations that has a DLSw peer. For some periods of the day, I just receive complains about applications freezing, some getting timed out, etc...

I only support the routers that make the dlsw tunnel which have serial interfaces connected to SNA environemt.

I've checked almost everything...

WAN circuit utilization - ok

QoS policy - ok

WAN circuit - ok

There is no latency between the peers // No packet loss // no high utilization etc...

Also, checked in the serial interface for the SNA environment:

FRMRs is ok
RNRs is ok (if steady increasing, means a possible congestion - hold queue less than 200)
Hold queue is ok (default is 200 - if more than 100/50% in the queue, it could represent a congestion)
Output queue is ok
No errors like CRC or any other kind of error.
No drops at all.

They are working as half-duplex

I could not run a debug for long time because the router's CPU probably could reach 100%.

The only thing I've noticed was that, thru "sh dlsw cir det", I saw the Congestion: High(08)

Below, one of the "sh dlsw cir det" outputs:

Index           local addr(lsap)    remote addr(dsap) state          uptime
XXXXXXX      XXXXXXXXXX(04) XXXXXXXXX(04) CONNECTED          1w1d
        PCEP: 66822634   UCEP: 667963C4
        Port:Se0/1/0          peer X.X.X.X(2065)
        Flow-Control-Tx CW:20, Permitted:35; Rx CW:1, Granted:1; Op: Repeat
        Congestion: High(08), Flow Op: Half: 250/0 Reset 12363/0
        RIF = --no rif--
        Bytes:        44477541/511133894 Info-frames:      785165/2077339
        XID-frames:          2/1          UInfo-frames:          0/0

XXXXXXXXX#sh int Se0/1/0
Serial0/1/0 is up, line protocol is up
Hardware is GT96K Serial
Description: XXXXXXXXXXXXXXXXXX

MTU 1500 bytes, BW 1544 Kbit, DLY 20000 usec,
     reliability 255/255, txload 6/255, rxload 1/255
Encapsulation SDLC, loopback not set
CRC checking enabled
    Router link station role: PRIMARY (DCE)
    Router link station metrics:
      slow-poll 10 seconds
      T1 (reply time out) 3000 milliseconds
      N1 (max frame size) 12016 bits
      N2 (retry count) 20
      poll-pause-timer 10 milliseconds
      poll-limit-value 1
      k (windowsize) 7
      modulo 8
      sdlc vmac: XXXXXXXXX--
sdlc addr 75 state is CONNECT
      cls_state is CLS_IN_SESSION
      VS 4, VR 4, Remote VR 5, Current retransmit count 0
      Hold queue: 60/200 IFRAMEs 9393/3476
      TESTs 0/0 XIDs 0/0, DMs 0/0 FRMRs 0/0
      RNRs 0/0 SNRMs 0/0 DISC/RDs 0/0 REJs 0/0
      Poll: set, Poll count: 0, chain: 75/75
Last input never, output 00:00:00, output hang never
Last clearing of "show interface" counters 00:07:38
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 3000 bits/sec, 10 packets/sec
5 minute output rate 41000 bits/sec, 21 packets/sec
     4884 packets input, 203750 bytes, 0 no buffer
     Received 0 broadcasts, 0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
     9459 packets output, 2348202 bytes, 0 underruns
     0 output errors, 0 collisions, 0 interface resets
     0 output buffer failures, 0 output buffers swapped out
     0 carrier transitions
     DCD=up DSR=up DTR=up RTS=down CTS=up

In the other side, we use a bridge group since we have it connected to a FastEthernet interface.

I want to understand the Congestion field.

I've looked for it in Cisco website, but I could not find any specific information for this. Also for the field Resets

Any suggestions or explanations about the Congestion and reset fields?

thanks in advance

ronaldobf · ‎02-02-2011

I am still looking for it.

Anyone has suggestions?

vmiller · ‎02-03-2011

I would take a look at the RNRs (reciever not ready) as a start.

the fact that they are increasing is not a good sign.

the congestion counter is "on the SNA side"

I confess its been years since i did DLSW

mwinnett · ‎02-03-2011

It looks like something from the remote (ethernet) side is pumping traffic without any kind of SNA level pacing to slow it up. Probably a file xfer of some sort. You can see that this peer has send 250 HWO (half window operators) and 12363 RWO (reset window ops).

Congestion: High(08), Flow Op: Half: 250/0 Reset 12363/0

Currenty the dlsw windows is set to 1. This means that between the dlsw peers, each packet is effectively requiring an "ack" before the next one is sent. The reason is becuase this peer is receiving more traffic than it can ship out locally (ie: to the local sdlc interface). This is seen in the hold queue (60/200). Dlsw queues up some stuff and then starts throttling back on the remote peer by sending the HWO & RWOs.

What is the sdlc attached device ? How many LUs is it supporting ? What is the actual line speed of the device (I asume that its a lot less than the 1544 kbit BW in the display) ? What is the max packet size that the device can support ? If its a real 3x74 then it will be limited, but a sdlc adapter to a PC should be able to take larger. By default we only send 265 bytes. The interface stats indicate that the sdlc device isn't throttling back (no RNRs etc), but it is basically controlled from the router (we are sdlc primary).

There are some other bits of tuning that are worth discussing, but these would ensure that you keep unwanted stuff off of the wan and maximise the traffic between the peers, but this is not the issue at the moment.

A sniffer trace would be useful of course to identify what is streaming packets. If it was interactive, then its kind of self regulating as there is a limit to how quickly you can enter data.

Any chance of opening a tac case and we can make a webex ?

Matthew

ronaldobf · ‎02-03-2011

Great post guys!

Mathew, I guess you're right...

The actual speed (clock rate) is 56k for each serial interface.

I don't know exactly how many LUs they have, but I am sure they have many.

I am really interested on this, but I had no time to look/search for everyting you wrote by now. I just got a hard week with several other critical issues!

I have a contract to open a TAC.

Do you want to go further with this and get a TAC? How can I assign the TAC to you (in case if you want to get involved)?

More specific information, I can get with the SNA guys, since I only have access to the routers (DLSw tunnel). After initial Cisco checkings, I can also go ahead and set up a bridge call for this.

Let me know your thoughts...

And thanks for your help.

mwinnett · ‎02-03-2011

Not sure what time zone you are in. I am in Europe. Open a tac case and

request that it gets passed to me (mwinnett).

Matthew

ronaldobf · ‎02-20-2011

Thanks for your help.

I have opened a TAC for this. Unfortunately, we could not reproduce the issue and seems that the congestions went away.

We are waiting for new congestions to collect some outputs.

IT was happening for several days. When I opened the TAC, it just disappeared...

As soon as I have any news, I will update this thread.

Regards,

ronaldobf · ‎04-22-2011

Ok! Issue solved.

For several days we did not get any issues, then, during the peak hours at the beginning of this month, the issue came on again.

after troubleshooting, as you guys stated, we found high utilization.

The clock rate was configured as 56k and some others was less than that. Anyway, I made some calculations and found that they were using up to 40k in normal days.

According to our calculations, we should have around 75k available for each controller at the far end device to have some extra availability and avoid any congestion.

We increased the clock rate to 128000 and problem gone away.

Thanks everyone that helped me on this!