CCM MGCP Gateway Fails Over Several Times a Day

ruesch-eng · ‎11-29-2005

I have ccm-manager configured on my redundnat gateways. Several times per day, it fails over to the redundant ccm. See config below from my gateway.

ccm-manager switchback immediate

ccm-manager redundant-host 10.108.0.10

ccm-manager mgcp

ccm-manager music-on-hold

ccm-manager config server 10.108.0.10

ccm-manager config

I captured the following during the last failover. I can see that keepalives are not being received and that is causing the failover. See the packet capture below.

london_2821_1#

Nov 29 11:17:08.716: cmapp_mgr_handle_ka_timeout: Send Keepalive - last_traffic_time:5092 ipaddr:10.108.0.11

Nov 29 11:18:53.716: cmapp_mgr_handle_ka_timeout: Send Keepalive - last_traffic_time:16436 ipaddr:10.108.0.11

Nov 29 11:19:07.280: cmapp_mgr_handle_ka_timeout: Host failed -- last_traffic_time:30000

Nov 29 11:19:07.280: cmapp_mgr_process_ev_active_host_failed: Active host 0 (10.108.0.11) failed

Nov 29 11:19:07.280: cmapp_mgr_check_hostlist: Active host is 0 (10.108.0.11)

Nov 29 11:19:07.280: cmapp_mgr_switchover: New actv host will be 1 (10.108.0.10)

Nov 29 11:19:07.280: cmapp_host_fsm: Processing event GO_STANDBY for host 0 (10.108.0.11) in state REGISTERED

Nov 29 11:19:07.280: cmapp_mgr_send_rehome: new addr=0.0.0.0,port=2427

Nov 29 11:19:07.280: cmapp_mgcpapp_go_down: Setting mgc status to NO_RESPONSE

Nov 29 11:19:07.280: cmapp_mgcp_send_rsip: ip_addr=10.108.0.11 port=2427 if_type=-1, slot=0,subunit=0 rst_type=1

Nov 29 11:19:07.280: cmapp_start_host_tmr: Host 0 (10.108.0.11), tmr 0, duration 15000

Nov 29 11:19:07.280: cmapp_open_new_link: Open link for [0]:10.108.0.11

Nov 29 11:19:07.280: cmbh_open_tcp_link: Opening TCP link with Rem IP 10.108.0.11, Local IP 10.108.0.252, port 2428

Nov 29 11:19:07.280: cmapp_open_new_link: Open initiated OK: Host 0 (10.108.0.11), session_id=46556204

Nov 29 11:19:07.280: cmapp_host_fsm: New state STANDBY_OPENING for host 0 (10.108.0.11)

Nov 29 11:19:07.280: cmapp_host_fsm: Processing event GO_ACTIVE for host 1 (10.108.0.10) in state STANDBY_READY

Nov 29 11:19:07.280: cmapp_mgr_send_rehome: new addr=10.108.0.10,port=2427

Nov 29 11:19:07.280: cmapp_host_fsm: New state REGISTERING for host 1 (10.108.0.10)

The switch is not congested so there is no reason for the keepalives to not arrive. I have configured the ports and NICs for 100M full duplex. There are no errors on the switch ports and no errors in the ccm event logs.

Please help. Thanks.

Both callmanagers and both gateways are in the same vlan on the same switch.

cplatt01 · ‎11-29-2005

The same thing happened to us, and I had to bounce the switch, haven't seen it since.

adignan · ‎11-29-2005

Try the following:

1. disable "ccm-manager config". I usually use "ccm-manager config" initially to get the router registered but I remove the command after the voice gateway is up.

2. Add a loopback and bind the mgcp signalling and audio to it:

interface loopback 0

ip address 1.1.1.1 255.255.255.255

!

mgcp bind control source-interface loopback 0

mgcp bind media source-interface loopback 0

please rate posts.

andy - berbee

tristan · ‎11-29-2005

Does anyone know how to increase the timers:

for Failover and Keepalive ?????

Current active Call Manager: 10.108.0.11

Backhaul/Redundant link port: 2428

Failover Interval: 30 seconds

Keepalive Interval: 15 seconds

ruesch-eng · ‎11-30-2005

Andy -

I removed the ccm-manager config statement. I already had bind configured to my g0/1 interface. Would configuring bind to a loopback instead of the gig int really make a difference regarding lost keepalives?

I have both cccms and both gateways on the same 3550 switch in the same vlan. I have auto-qos enabled.

The interfaces on the switch are showing no lost packets or errors. CPU and bandwidth utilization on the switch is fine. The ccms don't show any issues in the event logs. It almost seems like the CCMs are not sending the keepalives when they should.

Is there a way to change timers so that mgcp waits longer than 30 seconds to failover?

ruesch-eng · ‎12-02-2005

This was due to a bug in the SATA RAID driver on the new HP DL320 G3 based MCS-7825-H servers. There is a patch available now from HP or Cisco, and its in the lastest Service Release for the OS.

jschlimg · ‎12-02-2005

I have been wrestling with the exact same issue for the last few weeks. Can you provide more details or a reference case my TAC engineer can reference?

Thanks

John

ruesch-eng · ‎12-02-2005

My colleague actually tracked down the bug and installed the patch. But, I believe this link from Cisco provides the info you need.

http://www.cisco.com/en/US/customer/products/hw/voiceapp/ps378/products_field_notice09186a008055528f.shtml