06-30-2003 06:46 AM
Setup
6509 (IOS 12.1(13)E6) running standard DLSW ( i.e. non ethernet redundancy)over FastEthernet to 7513 (12.1(14)) connected to Mainframe over CIP2/IBM Channel.
1.
Our PC running LU6.2 over ethernet connected to the 6509 spordic loose thier connection due to the DLSW peer drops. (on the DLSW peer there is max 100 circuits)
2.
The 6509 have orther DSLW peers that run to other routers that are unaffected by the dropping peer.
3.
Likewise the 7513 that is the peer for the dropping DLSW connection has other peers that are unaffected.
4.
On a DLSW debug I see that the drop is caused due to failed DLSW keepalives.
5. As I found the CSCdw58350 I would appreciate some details how to determine if its the CSCdw58350 thats causing my problems ?? Or any other suggestions for the cause - thanks.
Regards
Gert Schaarup
06-30-2003 11:07 AM
Just before the DLSw peer drop, do a show tcp on the 7513. Look at the KRTT value. If the value is larger than 4.8 second, you run into CSCdw58350. Based on my experience, you only have 4 minutes to do the show tcp.
Another way is to take a sniffer trace between the 7513 and 6509. (i.e. any IP path between the DLSw connection). If you hit CSCdw58350, you will find the time between 2 retransmissions increases beyond 4.8 seconds.
07-03-2003 07:38 AM
Thanks for the "how to tell".
But:
1. I havent had the luck to catch a "just before" show tcp but on a sniffer trace I do see retransmit sjust before the DLSW drops. But I also see retransmit for other peers that seems to be unaffected(or mayby justsurvive this time) by the retransmit at that time (FYI they do also drop sporadic).
2. Just before the DLSW I see a lot of events going on. On the LU6.2 client side router a dlsw event debug shows flowing events could you please give me your comment on this thanks :
20:20:30: DLSw: Keepalive Request sent to peer 172.31.0.2(2065))
20:20:30: DLSw: dlsw_tcpd_fini() for peer 172.31.0.2(2065)
20:20:30: DLSw: tcp fini closing connection for peer 172.31.0.2(2065)
20:20:30: DLSw: START-TPFSM (peer 172.31.0.2(2065)): event:ADMIN-CLOSE CONNECTION
And on the CIP router:
-----------------------------------------------------------------------------------------------------
20:16:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:TCP-BUSY XPORT state:CONNECT
20:16:42: DLSw: dtp_action_r(), flow_ctrl (TCP_CONG_HIGH) for peer 172.16.0.1(2065)
20:16:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:CONNECT->WAN_BUSY
20:17:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY
20:17:42: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)
20:17:42: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)
20:17:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY
20:18:12: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY
20:18:12: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)
20:18:12: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)
20:18:12: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY
20:18:42: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY
20:18:42: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)
20:18:42: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)
20:18:42: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY
20:19:27: DLSw: Keepalive Request send failed to peer 172.16.0.1(2065)
20:19:30: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:DLX-KEEPALIVE REQ state:WAN_BUSY
20:19:30: DLSw: dtp_action_q() keepalive request from peer 172.16.0.1(2065)
20:19:30: DLSw: Keepalive Response send failed to peer 172.16.0.1(2065)
20:19:30: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->WAN_BUSY
20:20:29: DLSw: dlsw_tcpd_fini() for peer 172.16.0.1(2065)
20:20:29: DLSw: START-TPFSM (peer 172.16.0.1(2065)): event:ADMIN-CLOSE CONNECTION state:WAN_BUSY
20:20:29: DLSw: tcp fini closing connection for peer 172.16.0.1(2065)
20:20:29: DLSw: dtp_action_b() close connection for peer 172.16.0.1(2065)
20:20:29: DLSw: END-TPFSM (peer 172.16.0.1(2065)): state:WAN_BUSY->DISCONN
07-03-2003 09:07 AM
The debug shows that the 7206 fails to send DLSw keepalive to the LU6.2 client peer because the TCP queue on the 7206 is too long. I cannot tell if you hit the bug or there is a problem on the IP path from the 7206 to the LU6.2 client site.
As I do not know the network topology, the information about the other peers stay up does not help me. Do the peer staying up during the outage share the same IP network infrastructure?
As per my previous posting, you need to get show tcp just before the outage or a sniffer trace between the DLSw peer connection in order to confirm if you hit the bug. Even though I think that it is very likely the bug is the root cause, you need the information I mentioned before to confirm.
07-03-2003 04:52 PM
Thanks
The other peers that stays up during an outage on another peer share the same ip infrastructure (pure LAN to LAN / Fastethernet).
But I have a "suspicion" that the TCP queue on the LU6.2 client is getting too long due to LU6.2 data that cant get through the peer to the mainframe at the CIP router end (at the other end of the DLSW peer).
Does this make sence ? Is there any way to check this ?
07-03-2003 06:17 PM
As I do not know the network topology, it is difficult to comment on this. What is the minimum bandwidth between the DLSw peer connection?
DLSw has a built-in flow control mechanism. If the LU 6.2 sends too much data for the TCP pipe to handle, DLSw will sends half window and reset window indicator, which cause the router to send RNR (receiver not ready) to the LU6.2.
From what I seen in the debug messages, the 7206 is unable to send any TCP segments to the MSFC. Please be aware that the trigger point of the bug is signifcant amount of segment dropped on the network. (If MSS is 536, the minimum data dropped on the network is about 6000 bytes). I have encounter a situation where only one of the DLSw peer drops while other DLSw peer connection stays up, the root cause of the issue is still the TCP timer bug. Upgrading the IOS to 12.1(15) or above resolves 99% of related problems.
To verify your theory, do a show dlsw circuit detail and look at the half window and reset indicator. Also, look at the DLSw circuit send/receive window.
Go to the CIP console and do a llc stat all.
07-09-2003 06:05 AM
Thanks
1. I will go for the IOS upgrade, but do I need to upgrade both ends of the dlsw peer ? (i.e. the 7513 with the CIP and the 6500 with the LU6.2 workstations)
2. After we moved some of the LU6.2 client to Token Ring (i.e. direct SNA to the host) the dlsw peer seems to stay up but we still see that dlsw circuits gets disconnected and shortly after reconnect without direct loss of LU6.2 data.
3. I do see some few half flow op and reset (like Half: 6/2 Reset 1/0) but could you please comment following output (taken 4 and 9 July when the peer is up, as the peer hasnt gone done since we moved some of the load) to see if these confirm that it is the bug that is the problem. - Thanks
07:54:01.231 PDT Fri Jul 4 2003
----------------------------------------
At the 7513 end:
CIP-Slot4#llc stat all
Maximum connections = 1200
Highest connections = 169
Current connections = 59
Connections rejected = 0
Connect requests = 5615
Connect responses = 0
Connect confirms = 5615
Connect indications = 0
Disconnect confirms = 207
Disconnect indications = 5128
Flow off terminated conn = 0
Session timeouts = 99
FRMRs sent = 1
FRMRs received = 1
SABMEs sent = 5632
SABMEs received = 0
DISCs sent = 542
DISCs received = 5027
UAs sent = 5027
UAs received = 5811
DMs sent = 0
DMs received = 10
IDLE connections = 0
Connection thresheld = 1200
Frames transmitted = 130903182
Transmit completes = 130903182
Frames received = 124105764
Flow offs sent = 43844
Flow ons sent = 43810
Flows (on/off) cancel = 180
Flow offs received = 0
Flow ons received = 0
CCB indications sent = 92883
CCB indications freed = 92883
Unexpected events = 36
Ack Timer Missed = 0
Frames ignored = 4
Data requests = 101195500
Data requests freed = 101193980
Aggregate data req qlen = 0
Maximum data req qlen = 0
Data indications = 87183015
Data indications freed = 87183015
Aggregate data ind qlen = -1457917659
Maximum data ind qlen = 87183015
Polls sent = 152722
Finals sent = 18591962
Polls rcvd = 18625036
Finals rcvd = 158403
RR polls sent = 152722
RNR polls sent = 0
REJ_polls sent = 0
RRs sent = 29607348
RRs received = 36878480
RNRs sent = 80011
RNR received = 27452
REJs sent = 1487
REJs received = 4178
Bad PDUs received = 0
Bad N(r)s received = 5
Bad N(s)s received = 1747
Out-of-seq N(s)s received = 38
Acknowledgements delayed = 81673676
Iframes transmitted = 101203134
Iframes retransmitted = 9115
Retries = 7507
Local busy entered = 96226
Local busy exited = 96226
Remote busy entered = 9737
Remote busy exited = 9429
Unexpec Ack = 0
Adapter registered = 4
SAPs registered = 1
CEPs registered = 59
Msgs to entity mgr = 31
Msgs to PSAP = 58712
Msgs to PCEP = 101206483
Msgs to DLU = 87346909
Invalid handles = 0
SAPs activated = 8
Deactivate SAP requests = 5
Deactivate SAP confirms = 5
SAPs reset by user = 2
SAPs reset by network = 0
SAP deact/reset completed = 7
SAPs deact/reset w/ CCBs = 2
SAPs frame drops = 0
Hash buckets used = 3439
Hash table collisions = 2181
Hash table lookups = 124174704
Hash table caparisons = 198849566
Hash table lookups failed = 58305
Longest collistion chain = 19
Receive erroneous frame = 0
DMA write operations = 130999946
DMA write completions = 130999946
DMA write errors = 0
DMA write waits = 0
DMA read notifications = 124534436
DMA read operations = 124534436
DMA read completions = 124534435
DMA read errors = 0
DMA reads discarded = 0
DMA reads discarded small = 0
DMA read chains freed = 0
Frames exceed MAX frameSZ = 0
Timer ticks (100ms) = 222075154
Timeouts = 6502680
Timers started = 275098246
Timers stopped = 268595506
T1 timers expired = 7607
Idle timers expired = 91521
T2 timers expired = 6403552
Longest timeout chain = 58
P/F timer started = 163700
P/F timer stopped = 158403
Ack timer started = 66234646
Ack timer stopped = 58411329
Busy timer started = 13354
Idle timer started = 145383718
T2 timer started = 63301341
T2 timer stopped = 63301340
Reject timer started = 1487
Reject timer stopped = 1589
LLC Stack locked = 363336715
LLC Stack delayed events = 101
LLC stack pending events = 0
LLC stack most pndng evnts= 1
LLC stack not locked.
--------------------------------------------------------------------------------------------------------
15:17:44.199 PDT Wed Jul 9 2003
-------------------------------------------
At the 7513 end:
CIP-Slot4#llc stat all
Maximum connections = 1200
Highest connections = 169
Current connections = 69
Connections rejected = 0
Connect requests = 10778
Connect responses = 0
Connect confirms = 10778
Connect indications = 0
Disconnect confirms = 301
Disconnect indications = 10216
Flow off terminated conn = 0
Session timeouts = 118
FRMRs sent = 1
FRMRs received = 1
SABMEs sent = 10815
SABMEs received = 0
DISCs sent = 628
DISCs received = 10096
UAs sent = 10096
UAs received = 11037
DMs sent = 0
DMs received = 12
IDLE connections = 0
Connection thresheld = 1200
Frames transmitted = 153147701
Transmit completes = 153147701
Frames received = 146574711
Flow offs sent = 52506
Flow ons sent = 52472
Flows (on/off) cancel = 198
Flow offs received = 0
Flow ons received = 0
CCB indications sent = 115314
CCB indications freed = 115314
Unexpected events = 103
Ack Timer Missed = 0
Frames ignored = 4
Data requests = 119159228
Data requests freed = 119157594
Aggregate data req qlen = 0
Maximum data req qlen = 0
Data indications = 102290655
Data indications freed = 102290655
Aggregate data ind qlen = -2074303823
Maximum data ind qlen = 102290655
Polls sent = 182235
Finals sent = 21206449
Polls rcvd = 21242583
Finals rcvd = 193195
RR polls sent = 182235
RNR polls sent = 0
REJ_polls sent = 0
RRs sent = 33862716
RRs received = 44225534
RNRs sent = 93378
RNR received = 30175
REJs sent = 2075
REJs received = 4686
Bad PDUs received = 0
Bad N(r)s received = 5
Bad N(s)s received = 2441
Out-of-seq N(s)s received = 69
Acknowledgements delayed = 95797452
Iframes transmitted = 119167992
Iframes retransmitted = 10275
Retries = 7696
Local busy entered = 108966
Local busy exited = 108966
Remote busy entered = 11023
Remote busy exited = 10715
Unexpec Ack = 0
Adapter registered = 4
SAPs registered = 1
CEPs registered = 69
Msgs to entity mgr = 35
Msgs to PSAP = 85729
Msgs to PCEP = 119180620
Msgs to DLU = 102514497
Invalid handles = 0
SAPs activated = 9
Deactivate SAP requests = 6
Deactivate SAP confirms = 6
SAPs reset by user = 2
SAPs reset by network = 0
SAP deact/reset completed = 8
SAPs deact/reset w/ CCBs = 2
SAPs frame drops = 0
Hash buckets used = 8529
Hash table collisions = 2283
Hash table lookups = 146676148
Hash table caparisons = 224807587
Hash table lookups failed = 80517
Longest collistion chain = 19
Receive erroneous frame = 0
DMA write operations = 153281642
DMA write completions = 153281642
DMA write errors = 0
DMA write waits = 0
DMA read notifications = 147050284
DMA read operations = 147050284
DMA read completions = 147050284
DMA read errors = 0
DMA reads discarded = 0
DMA reads discarded small = 0
DMA read chains freed = 0
Frames exceed MAX frameSZ = 0
Timer ticks (100ms) = 226558498
Timeouts = 7361429
Timers started = 324359028
Timers stopped = 316997530
T1 timers expired = 7815
Idle timers expired = 109334
T2 timers expired = 7244280
Longest timeout chain = 58
P/F timer started = 203551
P/F timer stopped = 193195
Ack timer started = 78970086
Ack timer stopped = 68892536
Busy timer started = 14376
Idle timer started = 170688653
T2 timer started = 74480287
T2 timer stopped = 74480287
Reject timer started = 2075
Reject timer stopped = 2229
LLC Stack locked = 426985053
LLC Stack delayed events = 120
LLC stack pending events = 0
LLC stack most pndng evnts= 1
LLC stack not locked.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide