Solved: TC6 Firmware - Call disconnects after 2 hours

Giovanni Ceci · ‎05-29-2013

Hello all,

Our client is complaining that after upgrading the firmware on Telepresence endpoints (C40, C60, EX90) to TC6.x, there are two issues:

1. Endpoints with TC6.x will disconnect themselves after 120 minutes of being connected to any call that was initiated by the remote end. They will not disconnect at the 120 minute mark if the TC6.x unit initiated the call.

2. Endpoints with TC6.x cannot connect to MXP units running F6 or earlier software. Since we have MXP codecs running F6 and F5 software this is causing problems with point to point calls.

They have units with TC5.1.6 software which have no problems at all. The only thing in between other than routers and switches is a VCS Controller.

Anyone have any ideas on what the root cause is?

Thanks,

John

Bryan Deaver · ‎05-30-2013

In TC6 we silently turned on TCP keepalives for the h225 session running with the linux default time of tcp_keepalive_time of 7200 seconds. This would likely explain the disconnect at 120 minutes that John is seeing after moving from TC5.1.6.

I would also suggest that if this is the case, there is something that is tcp aware between the two devices. The firewall is also my first thought. Maybe WAAS on a router? At times you can get a clue that this is occuring by comparing the tcp header seq/ack/windowsize to see if it is different between the sending the receiving side for the same packet. Something like "tcpdump -s0 -w /tmp/h225.pcap port 1720" from root should limit the collected output.

The TC6 troubleshooting guide located here would be a good reference in logs and tcpdump to use for troubleshooting signaling related issues:

http://www.cisco.com/en/US/docs/telepresence/endpoint/codec-c-series/tc6/troubleshooting_guide/tc_troubleshooting_guide_tc60.pdf

The example this doc uses does not filter out the traffic and John you would want to limit the tcpdump output only for the interesting signaling to avoid causing problems with the size of the collected pcap file.

We have a bug open to better control keepalives on the h225 session (CSCub20591) but unclear if/when this will be implemented and for now if there is something that is timing out the tcp sessions in the network, changes will need to made there to prevent this from occurring.

For the 2nd issue, I am not aware either of an issue with older MXP but your approach Martin of trying a later version of software would also be what I would recommend. If you need to troubleshoot TC6 with the older MXP software, the same h323 logs from the troubleshooting guide are where you will want to focus on to see where in the handshaking the call fails and the compare that output between TC6 and TC5.1.6 as to what changed that may be impacting this.

View solution in original post

Martin Koch · ‎05-30-2013

Hi Giovanni!

Could you tell us a bit more about,like which VCS version is used, how the call flow looks like,

especially how the endpoints are registered (h323/sip) how the call flow looks like (like if you see

on the vcs that its a local or traversal call, if its using plain h323 / sip or if its interworked.

Regards 1) I have not faced or hear about this issue, but this does not need to say anything. In general

a 2h disconnect is often a firewall/ router / tcp timeout. Routers can be quite enhanced so I would look into that.

Regards 2) I could picture a TAC answer would be to upgrade your endpoints and that would be something I would

recommend was well. MXP F8.3 and up seem to be ok in our deployments.

Not sure when SIP was introduced in MXP, but if you use it, better disable it below F8.3, it brings more issues

as it would help, so let the VCS handle the interworking.

Did not find anything in the bug toolkit when I briefly searched for it.

If you do not find an answer and you have service contracts, please escalate this thread as

a service request so TAC can help you.

Please remember to rate helpful responses and identify

Bryan Deaver · ‎05-30-2013

In TC6 we silently turned on TCP keepalives for the h225 session running with the linux default time of tcp_keepalive_time of 7200 seconds. This would likely explain the disconnect at 120 minutes that John is seeing after moving from TC5.1.6.

I would also suggest that if this is the case, there is something that is tcp aware between the two devices. The firewall is also my first thought. Maybe WAAS on a router? At times you can get a clue that this is occuring by comparing the tcp header seq/ack/windowsize to see if it is different between the sending the receiving side for the same packet. Something like "tcpdump -s0 -w /tmp/h225.pcap port 1720" from root should limit the collected output.

The TC6 troubleshooting guide located here would be a good reference in logs and tcpdump to use for troubleshooting signaling related issues:

http://www.cisco.com/en/US/docs/telepresence/endpoint/codec-c-series/tc6/troubleshooting_guide/tc_troubleshooting_guide_tc60.pdf

The example this doc uses does not filter out the traffic and John you would want to limit the tcpdump output only for the interesting signaling to avoid causing problems with the size of the collected pcap file.

We have a bug open to better control keepalives on the h225 session (CSCub20591) but unclear if/when this will be implemented and for now if there is something that is timing out the tcp sessions in the network, changes will need to made there to prevent this from occurring.

For the 2nd issue, I am not aware either of an issue with older MXP but your approach Martin of trying a later version of software would also be what I would recommend. If you need to troubleshoot TC6 with the older MXP software, the same h323 logs from the troubleshooting guide are where you will want to focus on to see where in the handshaking the call fails and the compare that output between TC6 and TC5.1.6 as to what changed that may be impacting this.

Giovanni Ceci · ‎05-31-2013

Thanks Martin and Bryan for your responses.

We ran a test where we isolated the two endpoints to one switch off the network and tried the 2-hour call and, sure enoughm the call did NOT drop. So it looks like the issue with the TCP keepalives is posing a problem in regards to Issue 1 and we will look for the firewall that is sitting between the two endpoints. The Bug you posted Bryan matches up with what's going on here so that is likely the cause.

For issue 2, we will look into our options for upgrading the firmware for the MXPs.

I appreciate your responses guys. Have a good one.

Thanks,

John