I have a SIP trunk deployment setup as:
ITSP --- <SIP> --- CUBE1 --- <SIP> --- CUCM
ITSP --- <SIP> --- CUBE2 --- <SIP> --- CUCM
ITSP will send all calls to the Primary Link (CUBE1), if no response is received after three SIP INVITES on port 5060, ITSP will send the call to the Secondary Link (CUBE2).
Calls into devices registered to CUCM via the SIP trunk work and connect OK, for example we can make 10-20 calls into the same number and everything is fine.
We start to get issues when we have simultaneous calls into multiple Jabber for Windows clients via the SIP trunk, the number of calls seems to be intermittent as we have seen the issue occur on anything between 8-16 simultaneous calls.
When the issue occurs, all calls connect but once the far end disconnects from the call, the Jabber for Windows client does not tear down the call and the user has to physical disconnect the call even though there is no-one on the far end. Looking at the debug on CUBE and also CUCM, i can see that the client starts to return a 480 TEMPORARILY NOT AVAILABLE, at this point we stop seeing calls coming through on CUBE1 and calls start to appear on CUBE2 for some reason.
It then takes a considerable amount of time for CUBE1 to then start accepting calls, roughly around 20mins.
Has anyone seen anything similar, it does seem rather odd behaviour?
i can see that the client starts to return a 480 TEMPORARILY NOT AVAILABLE, at this point we stop seeing calls coming through on CUBE1 and calls start to appear on CUBE2 for some reason.
Which client: Jabber, CUCM, CUBE, or provider?
If I were troubleshooting this, I would collect logs from CUCM and CUBE for the same time period and then read through the traces for a call that experienced the failure. For example, when the far-end hangs up, does the provider send a BYE? If yes, does CUBE forward this to CUCM which in turn tells Jabber?
If it is CUBE that is telling the provider 480 Temp Unavailable when the provider sends the BYE I would expect this symptom. The provider cannot send a message for an existing SIP dialog to a different CUBE but it can send new calls elsewhere. Depending on the provider they may try to send BYE again after waiting a bit or just tear down the SIP dialog on their side and give up.
So, if that is what is happening: does CUBE forward the BYE on to CUCM who in turn responds with an error? If yes then it's not CUBE who is at fault, it may just be remapping the reason code (e.g. as it does with converting 480 Ringing into 483 Session Progress).
That would tell you who to look more closely at: CUCM or CUBE. For example, if CUBE is sending that without ever talking to CUCM then what's the memory and CPU utilization while this is happening? Is the router struggling to keep up? Remember that the Calls Per Second and Concurrent Call figures listed on the data sheet assume a very specific call flow of 14 SIP messages per call (seven provider to CUBE and seven CUBE to CUCM) with an average call connection time of three minutes. If your load profile differs from that you might be maxing out the platform capacity before reaching the published number.
Thanks for taking the time to respond, this is really helpful.
I have been working from some old traces and some things just aren't adding up, i'm scheduled in to do more testing next week.
From the old traces i can see BYE messages on the CUBE in all directions, between from/to ITSP and from/to CUCM - the odd thing is there is a gap of around 30minutes that i don't see any BYE messages but in this time we were making test calls and i am seeing the INVITE's come in from ITSP.
When i come to taking more logs, i'm going to monitor all messages for a single call inbound from the ITSP to ensure the messaging is correct end-to-end; i'll then monitor again for multiple simultaneous calls when the issue happens, at this point i'll be able to determine what messages are missing.
Thanks for the advice on calls per second, i'll check CPU utilisation while i'm testing.