Re: dlsw problem : stuck in WAIT_CAP

rrsstefano · ‎02-01-2005

Hi all

we had a great problem with some snasw/dlsw routers.

About all dlsw peers remained in WAIT_CAP state( except few resources where dlsw connection ok ) while no ip issues were present between central routers and dlsw remote routers....

The appn snasw links seemed to be ok ( seeing the snasw link command )

A question : there is a relationship between dlsw peer state and snasw link , ie there is an interaction between the 2 processes that could block dlsw peering establishement if there is a mistake on snasw=appn process??

Thanks for feedbacks

Stefano R.

rrsstefano · ‎02-02-2005

I want to clarify my request because I think it is a little incomplete.

We have a network with some snasw/dlsw routers at central site speaking to mainframe in upstream via appn and speaking via dlsw with branch offices

There are IPSec/GRE tunnels between central and remote site in an hub and spoke topology.The IPSec traffic is trasported via MPLS network provided by external carrier.

All traffic go in GRE tunnels then encapsulated via IPSec.

Problem : we noted all dlsw peers down with WAIT_CAP status on the central site.The extended ping between loopbacks is OK and it is encrypted just like dlsw traffic, then generally I presume that no problems should be present also for dlsw tcp connections.

Consider that the network is working for some months with no problems like mtu mismatch etc

Just before the problem, it happened a flap in mpls network but is seemed to be resolved in about 1 hour

My question :

Supposing that snasw routers were influenced by network problems, the two process snasw and dlsw are really separated, ie the dlsw status is EVER independent from appn status or eventually an abnormal appn status succeded ( but not verified ) could influence the dlsw status?

Unfortunately we haven't logs or debugs on central site to points the problem.

Debug on a remote sites tell us that no response arrived on it from central site during dlsw hadshaking

It should be an mpls problem, I'm suspecting that.....

Thanks for every suggestion

Stefano

jihicks · ‎02-02-2005

Hi Stefano,

SNASw has no effect on the DLSw peers establishing a connection. You could code local and remote peers only on the routers and the peer connection would come up if they have IP( TCP ) connectivity. Sounds like TCP port 2065 is open in one direction only.

Best regards,

Jim

mbinzer · ‎02-02-2005

Stefano,

not sure if i fully understand what your question is.

However there is NO interaction between snasw and a dlsw peer beeing connected or disconnected.

WAIT_CAP would indicate that the peer connection got established at least partialy at that point and the capabilities exchange did not happen/complete. This should be a transitory status in any case. If it does not complete it should cycle back to DISCONNECTED after a while and start over.

Did it finally clear up and the dlsw peers reconnected?

Are you using tcp path mtu discovery on your routers?

If the dlsw peers are up you can do a show tcp and check the mss value. Max Segment Size used by tcp. If you dont use path mtu discovery the value should be 536. If you use path mtu discovery it depends on your network. The value should be larger and will be negotiated during the tcp session setup.

Besides that if the trouble started after a hickup in the mpls network than the assumption is quite logical that something interrupted the tcp sessions.

thanks...

Matthias

rrsstefano · ‎02-02-2005

Hi Matthias,

yes you are right regarding what is my first question also if it seems to be a stupid question....It is due to the very strange situation......so strange that I thought that the snasw routers were in some stuck state due to particular condition met during the mpls fault.

The thing that I can't explain is why the icmp packet encapsulated in IPSec worked fine as also some web services, and dlsw traffic not.

Mss issues is sure possible, but for 9 month all worked correctly.

Speaking with mpls provider , the mpls problem was solved in about 1 hour as confirmed by our management station that had a complete visibilty of remote sites just after 2 hours ( snmp polling and traps received ) but not the same for dlsw services.

Just an add : problem arised at 0.30 am , then around 2am the ip connectivity was resolved from provider side, but at 7.00 am customer discovered the dlsw issue....making some reload on some snasw,IPSec and CE concentrators didn't resolve the situation.At 9.30 am on customer decision, ALL devices were reloaded , included firewall between dlsw and IPSec concentrators....and then 10 minutes after routing protocols converged, dlsw worked properly......

Stefano

mbinzer · ‎02-02-2005

Stefano,

i dont know what exactly triggered the problem. Your question why did a extended ping work and the dlsw peer did not? A ping is a icmp message. The dlsw peer is a tcp session. You are mentioning firewalls ect.

Firewalls can act in many different ways. I.e. intercept the tcp session and terminate it to both ends.

When the dlsw peer is in any staten other than disconnected you should have a tcp session already at least partially up. This can be displayed with show tcp, and show tcp brief. You can also clear those sessions with a

clear tcp tcb

A debug dlsw peer, debug dlsw ip tcp transaction to start with would be quite helpfull to get some basic understanding what the dlsw peer routers were doing while the problem happened.

There is not much we can do right now after the fact that all devices got rebooted. I dont have a clear explanation for the problem you have seen.

thanks...

Matthias

rrsstefano · ‎02-03-2005

Hi Matthias,

as you say, now is difficult to explain exactly what happened.

IN attach there is a debug dlsw peers captured on a remote site

From this is possible to understand if socket were opened in both direction and then tear down by an application problem?

thanks!

10.237.88.136 is the central dlsw router

mbinzer · ‎02-03-2005

Stefano,

the debugging is telling us that the tcp write and read pipe was opened. Ok so far. Next the router is sending his cap_ex message and then this router waits for the cap_ex from the peer. Which never arrives.

So why is the cap exchange not arriving? It would be very much needed to see the debug from the other end at the same time to have a chance to understand what goes on.

Feb 1 09:12:25.524: DLSw: START-TPFSM (peer 10.237.88.136(2065)): event:ADMIN-OPEN CONNECTION state:DISCONN

Feb 1 09:12:25.524: DLSw: dtp_action_a() attempting to connect peer 10.237.88.136(2065)

Feb 1 09:12:25.528: DLSw: END-TPFSM (peer 10.237.88.136(2065)): state:DISCONN->WAIT_WR

Feb 1 09:12:26.037: DLSw: Async Open Callback 10.237.88.136(2065) -> 11241

Feb 1 09:12:26.037: DLSw: START-TPFSM (peer 10.237.88.136(2065)): event:TCP-WR PIPE OPENED state:WAIT_WR

Feb 1 09:12:26.037: DLSw: dtp_action_f() start read open timer for peer 10.237.88.136(2065)

Feb 1 09:12:26.041: DLSw: END-TPFSM (peer 10.237.88.136(2065)): state:WAIT_WR->WAIT_RD

Feb 1 09:12:26.726: DLSw: passive open 10.237.88.136(28757) -> 2065

Feb 1 09:12:26.726: DLSw: START-TPFSM (peer 10.237.88.136(2065)): event:TCP-RD PIPE OPENED state:WAIT_RD

Feb 1 09:12:26.726: DLSw: dtp_action_g() read pipe opened for peer 10.237.88.136(2065)

Feb 1 09:12:26.726: DLSw: CapExId Msg sent to peer 10.237.88.136(2065)

Feb 1 09:12:26.726: DLSw: END-TPFSM (peer 10.237.88.136(2065)): state:WAIT_RD->WAIT_CAP

Here you see when the write pipe opens, when the read pipe goes up, the tcp connection from the other router came back to us and this router is sending his CapExId to the peer but we dont get a response and time out.

thanks...

Matthias