06-16-2022 03:31 AM
Hi
I have 3 main sites with 5520 and 8500 in HA, these all connect to 4 external Mobility anchors and 3 internal anchors on DMZ.
Have upgraded to 8.5.182.104 and 8.10.171.0 , issues was happening before and after
The issue I have is with WLCs on Site 3 going to 3rd Party Company 1.
The 5520 and 8500 on Site 3 seem to drop for split second at different times not at the same time, no patterns. Each HA pair goes at different times
They are both on the same 10Gb blade, the connection to the Firewall is on a 1Gb Port on the same Distribution
All WLANs are configured exactly the same across the board
If a SFP or fibre issue would affect all 2000+ devices at the same time on that uplink.
Not a issue between Dist and Firewall as would affect the other 2 sites
There is no issues with the other Mobility Anchors to other organisations our DMZ WLCs.
Radius is configured correctly
I know how anchors work, been configuring for years, so no epings or mpings the anchors are up 99.99999% of the day
2022-06-16 08:18:57 Local7.Warning 10.*.*.* host00WLC: *mmMobility: Jun 16 08:18:58.162: %MM-4-INET_MEMBER_DOWN: [PA]mm_heartbeat.c:531 Data path to mobility member 192.168.226.10 is DOWN.
2022-06-16 08:18:57 Local7.Alert 10.*.*.* host00WLC: *mmMobility: Jun 16 08:18:58.162: %MM-1-ANCHORS_DOWN: [PA]mm_heartbeat.c:730 All Export-Anchors are down on WLAN 20
TAC has looked at the 3rd party site, cant find a issue
06-16-2022 04:04 AM - edited 06-16-2022 04:05 AM
Hi
At first sight, Firewall is my suspicious. They are all the same firewall vendor, model and version? I´ve seen CheckPoint problem where the permit rule was there but the traffic failed to pass but not totally. I saw they adding aditional command on the cli and also applying patch to fix the problem. I am mentioning this because Firewall guys sometimes stick on the rule sreen shot to tell that all is good from their sides but actually not even them are seing the problem.
06-16-2022 05:02 AM
They are Fortigate 200, old, but only used between 2 companys. Ruled out the firewall as it would be down all the time if a rule issue. They are at the latest codes
Cheers
06-16-2022 05:52 AM - edited 06-16-2022 05:53 AM
"Ruled out the firewall as it would be down all the time if a rule issue"
That something a firewall guy usually say. As I mentioned above, I have too much experience on this to buy this statement.
06-16-2022 06:44 AM
I know the guy was correct, as it was me who set it up , well my side anyway
Unfortunetly I can't get access to the other firewall as it belongs to the 3rd party
cheers
06-16-2022 02:16 PM
Based on the mobility log, the tunnel went down because of a lack of mobility keep alive so the WLC thinks the other end is down, I've seen many times that the network may see one way communication for mobility tunnels, so I would recommend debugging mobility keep alive exchange messages between the problematic controllers to see what if there is any error in the debug logs or if there is 1 way communication, you should run these commands on two of the problematic WLC (one anchor and one foreign) and wait for the tunnel to go down then compare the logs:
Debug mobility keepalive enable
debug mobility peer-ip <IP-address> enabled
I agree with @Flavio Miranda about the firewall guys, they often say everything is ok based on the firewall rules and don't take a deeper look, I know there is a very old Cisco document that explains that firewalls could cause a "route loop" inside the firewall due to its architecture and that triggers the one way communication, unfortunately, I couldn't find the document, but I know that the workaround is to clear the session in the firewall for the mobility tunnel so a new session is created and everything works again.
09-12-2022 03:38 AM
Thanks both, I'll keep looking in to it
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: