cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3496
Views
5
Helpful
15
Replies

TMS lost HTTPS communciation to devices, but not really!!!

Chris Swinney
Level 5
Level 5

Hey all,

Another weird one.

TMS - 14.3.2

Monitoring multiple device via HTTPS.

I'm seeing some odd issues from TMS where it seems to think that it has lost HTTPS communication with multiple devices. We have many hundreds of devices that we monitor across multiple organisations in TMS (thankfully, we don't use it to schedule!) such as VCS-Es, VCS-Cs, MXP, C and SX series devices. Yesterday it appeared as though TMS wanted to stop monitor a LOT of these devices. Most of the VCS-Cs, MXP and other devices that seemed to all loose communication are in organisations outside of our main campus.

However, the odd thing is, that they haven't really! Its a "phantom" lost communication!

If you navigate to the connection tab in TMS and hit the "Save/Try" button (in fact you only need to navigate to this page, or the edit setting page, or force a refresh etc.), HTTPS communication is established immediately. In fact, we can also RDP into the TMS box and navigate to any and all devices from a browser with no issues using HTTPS. You can even have a browser session open to a device an have an auto refresh running (say on a camera/call page), with no problems. Further, if you navigate the the affected device (such as VCS Control) and look in the "System --> External Manger" menu, the device shows that our TMS management server is active on HTTPS. However, after a couple of minutes, TMS again thinks it has lost HTTPS connection!!

The really odd thing is that we haven't lost access to all devices. Those on campus are OK, as are the VCS-Es on the core, but a most of the devices in other organisations have got this "phantom" lost communication - weirdly not all of them though!!!

Recently, the backbone network provider installed some DDoS devices on the MPLS network that caused havoc with our neighbour zones, effectively killing our entire VC network. I'm wondering if these devices might also be responsible for these problems? So, I need to know what is different between the TMS HTTPS monitoring communication, and actually forcing a refresh of this connection or navigating to a HTTPS server on a device via a browser.

I have tried a packet capture on TMS, looking at device whose HTTPS response remain unaffected and a VCS Control that is affected, but it show no traffic on either, so I don't really know what I'm looking for (I was expecting a heartbeat or some-such).

Any suggestions, please let me know, before I go back to the core network provider....

Cheers

Chris

15 Replies 15

Adam Wamsley
Cisco Employee
Cisco Employee

Hi Chris,

It very well could be the DDoS devices they added to the network. I have seen similar behavior with proxies in the network. When you did the captures did you get a capture from each side, both the TMS and affected device? I would look for 401/403 responses. Try and run the capture when you do a force refresh and if the no https response happens not long after that keep it going on both sides.

Adam

Hey Adam,

Thanks for this and indeed was my next plan of action when I'm back in work tomorrow.

 

Cheers

Chris

 

Hi Chris,

With their addition of the DDoS devices, they may have addes some additional delay when TMS is polling the devices.  I've seen this before when a slow proxy server was in the chain.

In TMS, under Admin Tools > Configuration > Network Setings there are a number of timeout entires.  Try increasing the timeout value for the Telnet/HTTS Connection Timeout by a few seconds and see if that helps.

Wayne
--
Please remember to rate responses and to mark your question as answered if appropriate.

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

Hey Wayne,

Funnily enough, I had already set tried a few different values here up to the maximum of 15 seconds, but no can do.

I had been playing with these setting with regard to another issue I had been seeing relating to the odd Phonebook update timeout, but set this back to 5 seconds and the HTTP Command back to 30 seconds but the same issue persists.

I wasn't able to run the traces today, so will have another stab at this tomorrow. However, the network provider is convinced that their product is not the cause of the issue.

Chris

Network providers are good at saying that the Network is never the issue ;)

I have a couple of devices here that seem to be doing a similar thing following our upgrade to TMS14.4.2 - I'm pretty sure they were fine on TMS14.3.2 though.  The "problem" devices I have mainly seem to be VCS-Controls (on X7.2.3) and occasionally a 3241 ISDN Gateway on 2.2(1.94)P.

TMS Logs show "Incorrect Authentication Information" for the VCS-Controls (even though we don't change the details and it works the next minute, or on a Force Refresh).

The ISDN Gateway just shows "TMS Connection Error" which then gets magically fixed again on the next poll (see screenshot below).

Edit: I'm also seeing errors on my Content Servers - and I've confirmed they're all appearing post the TMS14.4.2 upgrade (there were no similar errors in our environment with TMS14.3.2).  I've logged a job with the TAC.

Wayne
--
Please remember to rate responses and to mark your question as answered if appropriate.

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

Hey Wayne,

Thanks for this, I have uploaded a similar log, although with us is is purely lost connections. You can even see forced refreshes by me, then automatic monitoring re-established, only to be dropped again after a few minutes.

Chris

Hi Chris,

My VCSes are reporting incorrect authentication details - but then, without changing anything, they work again, then a bit later on, error again (see attached).  My SR is 631637421.

Wayne
--
Please remember to rate responses and to mark your question as answered if appropriate.

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

Hey Wayne/Adam,

Riddle me this....

Ok, in my situation all appears to be working again. How? Well, I don't really understand, but settle down and I'll tell you a story...

I ran some captures on TMS and on a couple of devices (one that was working and one that wasn't). I turned off things like provisioning just to eliminate as much traffic as possible between the devices and TMS. However, the captures showed nothing unusual between the either device and TMS. Of I could see traffic in both directions with no issue, but one devices still times out and ended with the "No HTTPS connection" error. Unusually I saw no heartbeat type traffic after a refresh - just nothing on both devices either to or from TMS.

We noted that the devices that were not timing out in TMS were also accessible from other machines in our campus. We ask our customers to tie down this management to the single IP address of TMS, but sometimes they allow our campus range. So we though, ah, firewall, yet TMS can still talk to these odd units, so again I didn't understand.

Anyhow, my colleague noted an error in the Windows Event log referring to DNS resolution (Event ID 1014 - which looked like a partial reverse lookup such as 100.100.in-addr.arpa for our class B campus network). However, the vast majority of our system we mange form TMS are through direct IP so I didn't think much of it. Anyhow, I though we would update the DNS settings on the TMS NIC to point to other DNS servers and an odd thing happened.

Firstly, TMS started to "see" the phantom devices. Secondly, the DNS settings didn't actually take. I definitely OK'ed the properties box on the NIC settings, but the 'ipconifg' details and 'nslookup' still showed the old DNS server addresses. Even more weird, I re-opened the properties box and the new services were in place. I then closed the properties box down again and re-opened and as if by magic, the old servers re-appeared! All of this was watched by my colleague who was as baffled as I.

I have no idea what just happened here and why the DNS setting may even have had an impact on this problem, but I'll take the win and walk away.

Our TMS is virtualised running on Windows 2008 R2 and I'm wondering if the adapter settings might have got messed up somehow, but to be honest, I don't really know.

 

I doubt this will help you Wayne, but its worth a shot. If it doesn't work and If TMS is a VM, it may be worth added an second adapter and recovering the old one.

Cheers

Chris

Thanks Chris,

That certainly sounds odd.  Our TMS is a VM (also on 2008r2).  Looking back at all our logs, the issue only started when we upgraded to TMS14.4.2 - there are no issues up until the hour that we did the upgrade.  Perhaps, with the server restart at that time, there was an issue created - I'll get our "Server People" to have a look and see if there's any problems at the Windows level like yours were.

Cheers

Wayne
--
Please remember to rate responses and to mark your quesiton as answered if appropriate.

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

Wayne, did you ever find a resolution to this?  I have a client experiencing the same problems with TMS 14.4.2.

Hi Anthony,

No, our problem is still on-going.  We're still working with the TAC - many packet captures have been sent, tried a few quick check/fix things, and they didn't do anything - we're now organising a WebEX with the BU guys from Norway for them to directly tap in and have a look.

Wayne

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

Hi Wayne,

I upgraded my TMS to 14.6.2 in April this year and also started getting this weird issue where my Cisco MSE 8050 will lose connection and then a second later be back online. 

TMS version: 14.6.2

MSE 8050: 2.3(1.46)

Any new development from your side, I see the bug should have been fixed when TMS is upgraded to 14.5

I'm now running TMS15.2.1 and not seeing this issue any more. 

Wayne

Wayne

Please remember to mark helpful responses and to set your question as answered if appropriate.

I will need to wait till September before I will be allowed to upgrade to the latest version. We are running IBM Lotus Notes and upgrading might break are current integration. So I should rather wait until I can upgrade to confirm if this fixes the bug.