Solved: 4500X Routing/ARP Issue

John King · ‎10-30-2017

We are experiencing an odd issue with one of our Catalyst 4500X switches.

We recently installed a new one of these into one of our campuses as a replacement for the Core switch of the site. We had already installed one into the Core of another site with no issues.

They are running different versions

Working site on:

Catalyst 4500 L3 Switch  Software (cat4500e-UNIVERSALK9-M), Version 03.07.00.E RELEASE SOFTWARE (fc4)

Issue site:

Catalyst 4500 L3 Switch  Software (cat4500e-UNIVERSALK9-M), Version 03.09.00.E RELEASE SOFTWARE (fc1)

The problem we are experiencing is a loss of connectivity *through* the switch. For example, it initially came up that our monitoring software could not ping through to certain access switches on the network. Now we are having an issue with the print server not being able to contact printers. It seems that it is all contact with the outlying devices, as for the printers, once it goes out of contact, you cannot access the webpage, ping, or print to the device at all.

Restarting the outlying device (the printer) will restore functionality for it. If you ping the device *from* the 4500X, it comes back up immediately. This is what led me to think it may be an ARP table issue.

It is a sporadic issue, but is troublesome when it does occur. Is there something I can do to determine the cause or troubleshoot further? Has anyone seen this before?

Reza Sharifi · ‎10-30-2017

It could be software bug. And to eliminate that, you may want to downgrade to match the working 4500 switch (03.07.00) and see if the issues goes away.

HTH

View solution in original post

Reza Sharifi · ‎10-30-2017

It could be software bug. And to eliminate that, you may want to downgrade to match the working 4500 switch (03.07.00) and see if the issues goes away.

HTH

John King · ‎03-21-2018

While not what I would say is a solution to the problem posed, downgrading does solve the symptoms of this issue.

jseely7150 · ‎04-11-2018

Thank you, John King, for the excellent problem description. I was in a rush last month to install gear at a remote location, and while I had copied our known working 4500X image (03.06.06.E) I never did the reload. It shipped with 03.09.00.E.

We've had four sporadic "events" where we see some devices at this location go down in our monitoring system. These events self-cleared after several minutes, and then we wouldn't see the issues again for a week or more. The devices affected on our network are some of the older ones, older Digis, SNMP cards in UPSs, older environmental monitoring devices, etc. I had suspected a broadcast or multicast storm since I was only seeing a problem on the old slow devices. I captured some irregular large bursts of ARP broadcasts from the 4500X switch, the LAN gateway, that coincided with the devices going down, did a search and saw your post. I tried pinging the "down" devices from the 4500X and they immediately popped back up the next poll.

I've scheduled a reload for that switch to get on the desired version. :-)

Thanks!

pp089x001 · ‎04-12-2018

Hi

we solved the problem upgrading both the 4500x to the version 3.9.2E as suggested by CISCO, we opened a TAC case.

richcomscisco · ‎02-28-2019

Hi

Did this resolve the issue!

I have the same problem with the 03.09.00.E software

jseely7150 · ‎05-02-2018

Just an update to my previous post:

Going down rev to 03.06.06.E from 03.09.00.E. Solved my issues.

elio.garcia · ‎05-22-2019

i had the same issue. it is a sofware bug on ver 15.2.

there are some workaround listed here:

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb78700/?rfs=iqvred

Symptom:
4500X unable to forward packets when they need to be unknown unicast flooded

Conditions:
Destination device's mac is NOT present on the switch mac table, but the ARP is resolved

Workaround:
This issue is observed when the switch does not have a known MAC address table entry for a given host.
In this scenario, the switch must perform an "Unknown Unicast Flood" which uses the Unicast floodset table that is broken due to this bug.

As a result, the issue can be worked around by ensuring the MAC address table never loses MAC table entries due to inactivity for a given host.

These can be accomplished with the following workarounds:

Workaround 1: Set the mac address aging timer so that it is longer than the ARP Aging timer.

The default ARP Aging timer is 4 hours (14400 seconds).
The default MAC aging timer is 5 minutes (300 seconds).

By increasing the MAC aging timer and decreasing the ARP aging timer, ARP will refresh an entry before the MAC aging timer ages out the MAC address.
In the process of ARP refreshing an entry, the corresponding MAC table entry will also be refreshed thus ensuring the switch does not have to perform an unknown unicast flood.

Workaround 2: Set a static MAC address table entry for the impacted hosts.

Static MAC table entries do not age out and as a result will never be unknown unciast flooded.
Note this workaround is inflexible and changes to the topology may result in the static entry pointing towards an incorrect port.

Workaround 3: Correct the floodset table directly.

The floodset table is reviewed and reprogrammed with the following activity:

1) Adding or removing a port from the affected VLAN.
2) Shut / No Shut of a port in the affected VLAN.
3) Removal and addition of the affected VLAN.
4) Reload of the switch with the issue.

Note that this workaround is temporary and the issue can come back again over time.

Further Problem Description:
"show platform software floodset vlan <>" and "show platform hardware floodset vlan <>" will be out of sync for the Unicast floodset.

The software unicast floodset will have all the required ports in specified vlan under it.
The hardware floodset may have no ports mapped or may have some ports missing.

paul driver · ‎10-31-2017

Hello

I encountered something extremely similar on a 4500 vss core which drove me crazy for a while, then it turned out to be this CSCue76243

res
Paul

Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

pp089x001 · ‎03-20-2018

Hi John
did you solve this problem?
I have a plant in critical situation due an exaclty problem like you.
Did you downgrade the IOS and solved the issue?
Thanks

John King · ‎03-21-2018

I did end up downgrading and the issue has not reoccured at all. I hope
your results match mine. At least it wasn't mass amounts of downtime, only
a restart of the switch after changing the boot image.

scotts1919 · ‎04-12-2018

what version did you downgrade to? What version are you currently running? We had similar issue running 3.8.3 and recently as of this am changed to 3.8.6. Waiting to see if issue manifests itself again.

John King · ‎05-01-2018

We downgraded to: cat4500e-universalk9.SPA.03.07.00.E.152-3.E.bin

So Version 3.07.00-E

jerrymatson1 · ‎05-01-2018

We are on the same "TAC" approved version but still suffering from this issue. We have a TAC case open now to see why. We are thinking of downgrading software.

For a bit of clarification, before the 3.09.02E patch was applied we had issues with IPs on the same VLAN between switches connected to the 4500x core (the core itself could reach everything just fine). After the 3.09.02E patch was applied the problem was resolved on directly connected switches but was still a problem for indirectly connected switches daisy chained downstream from the 4500x.

System image file is "bootflash:cat4500e-universalk9.SPA.03.09.02.E.152-5.E2.bin" <--- Still broken for us.

John King · ‎05-01-2018

Not the best situation there. Is there a later version of the image you could use? If you can reliably reproduce the error you could test with a downgraded version maybe. Your situation seems slightly different to what we were experiencing from your description. We had issues between different VLANs not on the same VLAN.