Solved: 4500x VSS odd issue, cannot ping any IP now

DAVE GENTON · ‎08-23-2013

Customer network deployed two months ago. They had stp topology change notification in log from replacing/swapping out a layer 3 idf closet switch this morning but nothing further. From any and all idfs, cannot ping the default gateways of any vlan which mean cannot ping any SVI. I cannot see any changes made since I left other than that one switch, Nexus 5548 vpc pair is stp root with priorities hard coded due to vpc being in effect in that layer. Have vpc connecting to port-channel on vss across both switches. It appears all traffic is passing/flowing as required. I see cdp neigbors, also new switch has ip routing enabled to advertise vlans only in that closet and routing protocol adjacency is formed and routes are exchanged. I can no longer terminate any traffic into the vss node, I cannot ping the default gateway from idf to test, which is what started this. I can get to any and all devices in the network EXCEPT the 4500x core, we must use a console cable as it appears IP is not working for anything that must terminate to it, Odd and never seen anything like this in 21 years of Cisco engineering more so in vss etc. Almost makes me think vss virtual mac's are issue but again only for terminating traffic, looking for way to resolve without impacting fearing reboot inevitable. For config, too simple to post really, some SVI's for routing with 6 idf's that are port-channeled 1 in each 4500x switch, then of course links to the vPC pair of 5548's where all the servers reside. All works but mgmt it seems.

On mgmt they have 10.96.0.x/24 on fa1 which they dont really use, then use same 10.96.0.x/24 mgmt subnet on their vlan 1 SVI but its NOT in mgmtVrf, it conflicting with now after running 8 weeks somehow ??? Just thinking out loud.

Marvin Rhoads · ‎08-23-2013

We have seen some bugs (not always published ones) with 4500X VSS on 3.4.0SG software. Remember 3.4.0SG was the first release to support 4500X VSS and is considered a "dot 0" major release by Cisco.

The recommendation I received from the business unit (via TAC) was to upgrade to 3.4.1SG (just released ca. 24 July 2013).

View solution in original post

Marvin Rhoads · ‎08-23-2013

We have seen some bugs (not always published ones) with 4500X VSS on 3.4.0SG software. Remember 3.4.0SG was the first release to support 4500X VSS and is considered a "dot 0" major release by Cisco.

The recommendation I received from the business unit (via TAC) was to upgrade to 3.4.1SG (just released ca. 24 July 2013).

DAVE GENTON · ‎08-27-2013

Well I got the new release in this morning and its been stable and fine since then. Prior it would never make it two hours following a reboot before going unreachable again. I wish I had a root cause or at least something as it ran error free for 8 weeks then it appeared as if nothing would clear it and certainly one of the oddest things I have seen in over 20 years of Cisco engineering. I'm just glad it's stopped as all appears fine now.

Marvin Rhoads · ‎08-27-2013

Glad to hear you're running more stably now. Hope it remains that way.

It's anecdotal, but in the installation where I saw the bug it only manifested after we had added some features - QoS and BGP - on the units.

Plamen Micovic · ‎08-30-2013

I have had the same issue with my 4500x stack, due to the fact that Fa1 was active (?!!). The recommended fix IS the software upgrade, but there is also a workaround. See below.

CSCue76243 Bug Details

Symptom:

Sup7E, Sup7LE or 4500X loses all L3 connectivity to/from switch IP address. Switching continues to work, but IP traffic to/from the switch does not. This includes snmp, ntp, telnet, ssh, etc.

Conditions:

Sup7E, Sup7LE or 4500X using 3.4.0SG

Workaround: Disable CEF on the L3 interfaces to temporarily restore service (but it will be software switched). Once the problem occurs, the switch must be rebooted. To prevent the problem from occurring, shut down Fa1. If Fa1 is shut down, the problem will not appear.

Upgrading to 3.4.1 will resolve this issue, or alternatively you can shut down the management interface FastEthernet1. Since this is a VSS configuration, I would recommend disabling the management interface on both physical switches. In order to do this, I recommend the following action plan:

1. Shut FastEthernet1 on the active switch

2. Save configuration with “copy run start”

3. Failover the VSS

4. Shut FastEthernet1

5. Save configuration with “copy run start”

6. WAIT until the peer switch is ready for failover

a. Check with “show redundancy” and wait until the peer supervisor is in STANDBY HOT

7. Failover the VSS

8. Layer 3 connectivity should be restored once the second failover is complete

phu · ‎12-05-2013

Hi, I just encountered the same issue on my 4500-x VSS. Also affected all of my clients' dhcp requests.

Can someone please provide the procedure for failing over the VSS and the permanent solution of upgrading the IOS on the VSS? Thanks.

Peter

Marvin Rhoads · ‎12-06-2013

"To force a switchover from the VSS Active to the VSS Standby supervisor engine, use the redundancy force-switchover command." (Source)

The bug is reported to be fixed in 3.4.1SG and subsequent releases. The procedure for upgrade is further down in the same document link I posted above - i.e starting here.

Garry Cross · ‎12-11-2013

Seems to me that 3.4.1.E1 does not yet have this fix as we have encountered that and backed out. But apperently the code were running 3.4.0.SG also has this bug but is working better.

Can anyone confirm my assumption above and I should move to 3.4.2.SG2.

We need fa1 for now because we are staging and that is our access for the moment.

By the way when we were running 3.4.1.E1, fa1 was not reachable and did not show a CDP neighbour on ajacent switch.

None of our connected 3850 switches was reachable from the 4500x.

Richard Primm · ‎12-12-2013

Hi Garry,

I would recommend 3.4.2 (151-2SG2) on the 4K. The bug you are referencing is fixed in this version.

https://tools.cisco.com/bugsearch/bug/CSCue76243

HTH

Luke

Terence Lockette · ‎03-28-2018

I know this is an old discussion but I'm having a very similar issue on a VSS pair using 4500E chassis (4510R+E & 4507R+E) running IOS version 3.6.4E. Any known fix release for this? A workaround would also be nice until I can fit this in for our maintenance window next week.