12-11-2011 02:51 PM
Hi,
I have a pair of 4710's (ACE01 and ACE02) in fault tolerant config. I have followed standard config. guidelines. I have the following problem:
(1) I reload ACE01 and ACE02 seems to take control.
(2) After the reload completes, ACE01 does not accept ssh login, therefore I have to login via async router, then when I do 'sh arp' command on ACE01, it thinks about it for about 2 mins and I get the following message:
ace01/Admin# sh arp
Context Admin
rpc call failure. retval = -998
ace01/Admin#
(3) Then after about 4 or 5 mins of ACE01 coming back up, I lose SSH connectivity on ACE02, then I login via async router onto ACE02 and I get the following message on that:
Arpmgr busy, Possible ARP flood, 526801 arp pkts were dropped over last60 secs
(4) In order to get out of this state, I have to break the fault tolerant link and shutdown the primary network link (shutdown the switchport that the ACE units connect to), then reload both devices again and then I can get SSH login.
Pleae could someone help me, I don't understand what is going on, I have googled the above messages and they said that it might be related to a bug on the switches that the ACE units connect to (2960's), I have subsequently upgraded the switches but still no luck.
Here is the basic fault tolerant config.
interface gigabitEthernet 1/1
switchport access vlan 711
no shutdown
interface gigabitEthernet 1/2
shutdown
interface gigabitEthernet 1/3
shutdown
interface gigabitEthernet 1/4
description Fault Tolerant (ea-ste10-ace02)
ft-port vlan 999
shutdown
ft interface vlan 999
ip address 1.1.1.1 255.255.255.252
peer ip address 1.1.1.2 255.255.255.252
no shutdown
ft peer 1
heartbeat interval 300
heartbeat count 10
ft-interface vlan 999
ft group 1
peer 1
priority 200
associate-context ea
inservice
Software Version: A5(1.0)
I also have: 1 Admin context and 1 user context
I really would appreciate some help/guidance as I am struggling and I have a deadline to meet as we are going live soon with this system.
Regards
Sajjad
12-11-2011 11:59 PM
HA cannot work because the interface carrying the FT Vlans is down.
Your ARP storm comes probably from a split brain event.
12-12-2011 06:34 AM
Hi,
Sorry the config. was slightly incorrect, the FT vlans was not shutdown and there wasn't a split brain event.....
I have also downgraded to Version A4(2.1a) in order to eliminate faults with the v5 (1.0) and still I am getting the same problem
12-12-2011 06:41 AM
What do give the variuous "show ft xxx" commands on both units when the issue occurs ?
12-13-2011 04:08 AM
Hi,
Here is the 'show ft group det' result after ACE01 has been reloaded.....the behaviour seems to be normal......but again the problem is still there....
ea-ste10-ace01/Admin# sh ft group detail
FT Group : 1
No. of Contexts : 1
Context Name : ea
Context Id : 1
Configured Status : in-service
Maintenance mode : MAINT_MODE_OFF
My State : FSM_FT_STATE_ACTIVE
My Config Priority : 200
My Net Priority : 200
My Preempt : Enabled
Peer State : FSM_FT_STATE_STANDBY_HOT
Peer Config Priority : 100
Peer Net Priority : 100
Peer Preempt : Enabled
Peer Id : 1
Last State Change time : Tue Dec 13 10:48:32 2011
Running cfg sync enabled : Enabled
Running cfg sync status : Running configuration sync has completed
Startup cfg sync enabled : Enabled
Startup cfg sync status : Startup configuration sync has completed
Connection sync enabled : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace01/Admin#
ea-ste10-ace02/Admin# sh ft group detail
FT Group : 1
No. of Contexts : 1
Context Name : ea
Context Id : 1
Configured Status : in-service
Maintenance mode : MAINT_MODE_OFF
My State : FSM_FT_STATE_STANDBY_HOT
My Config Priority : 100
My Net Priority : 100
My Preempt : Enabled
Peer State : FSM_FT_STATE_ACTIVE
Peer Config Priority : 200
Peer Net Priority : 200
Peer Preempt : Enabled
Peer Id : 1
Last State Change time : Tue Dec 13 10:48:57 2011
Running cfg sync enabled : Enabled
Running cfg sync status : Running configuration sync has completed
Startup cfg sync enabled : Enabled
Startup cfg sync status : Startup configuration sync has completed
Connection sync enabled : Enabled
Bulk sync done for ARP: 0
Bulk sync done for LB: 0
Bulk sync done for ICM: 0
ea-ste10-ace02/Admin#
12-13-2011 10:00 AM
Hi Sajjad,
it would seems the result of an arp flood/storm (triggered by a loop?), during the issue you could confirm it by executing:
show processes cpu
and see the cpu usage of arp_mgr (arp is handled on the control plane in ACE) and maybe take a trace monitoring on the switch the port connected to ACE01 to see what the traffic actually is. It could also help to have on the switches "mac-address-table notification mac-move" to detect loops.
Should the above not clarify the issue I would suggest opening a TAC SR to get this investigated further.
Cheers,
Francesco
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide