08-01-2023 10:52 PM
Hi everyone
hope you can help me.
I'm having the weirdest of issues where I have two Cisco 3650s-24TS connected to each other via a LAG with 4 x 1 Gbps links and have configured several vlans to do HSRP between them. Besides this two switches, I have as well, two 2960s that are connected via a 1Gbps link to each 3650 in order to have redundancy.
At the moment the exit link for this switches is going via CSW1, and not CSW2. CSW2 to send data to other parts of the network it need to go via CSW1.
The issue lies here, everything was working correctly with the IOS version 16.12.03a, I could ping the SVIs, everything, but after upgrading the switches to 16.12.09 (or even 08), I "lost" CSW2. Meaning, one of the 3650s, I can still access without any issues and can ping all the SVIs, but CSW2 all it's SVIs stopped pinging and stopped being able to access the switch remotely. Even locally connected via console cable, the switch can't ping itself. But to make things even more weirder, all the clients that pass through this switch and that communicate with it's gateways, they are all working correctly. The clients can't ping their gateway, but they are forwarding data like if nothing happened, but I can't access anything on that switch.
If I disconnect everything and leave the switch empty, and plug one computer to the switch I can ping the gateway, but connecting to the other core I loose it for some reason...
With this, decided to try another 3650, installed and all working with the version 16.12.03a, but as soon as I upgrade to 16.12.09 I loose it but all the unicast routing and multicast routing continue to work normally... never had anything like this.
Any help please?
Thank you
08-02-2023 06:50 AM
Hi,
It appears that there is a major software bug in 16.12.09 as you have seen the same issue with multiple switches. If you have support, open a ticket with TAC; they may have more info on the bug you are encountering if not, upgrade to a different version.
HTH
08-02-2023 08:07 AM - edited 08-02-2023 08:08 AM
Thank you @Reza Sharifi
It's weird, I ended up finding the culprit, had the CPU at 100% due to the IP Input process. Did a debug IP packet detail, and found that it was due to the multicast routing and one specific vlan.
A little bit more info, I have 4 Vlans (vlan 20, 30, 40, 50) where all have "pim sparse-dense-mode" configured on them since they all participate in the multicast. This vlans have HSRP configured in both switches where one has the final octet as .252 and 253 with the virtual gateway being .254.
When "debug IP packet detail" I could see thousands of packets entering on vlan 20 with route null on CSW2 and when checking interface vlan 20, I could see thousands of drops in the input of the interface, but checking CSW1 that has an identical config everything was running smoothly and the CPU is at 1 to 2%.
Disabling PIM on vlan 20 (this Vlan only has clients pulling data from the multicast groups) would "solve" the issue and the CPU would go immediately down. The CSW1 with exact the same version 16.12.09 is working correctly.
Manipulated HSRP to make CSW2 the main gateway to try, but the same thing. Configured IP PIM redundancy together with HSRP but the same thing. The solution was to downgrade this switch. Found that I could downgrade to the version 16.12.05b and everything would run smoothly, but if I upgrade to anything higher than this one, the same issue would appear and the CPU would ramp up...
This is a weird one... can this weird issue be present on 4 versions of IOS?
Thank you
Kind regards
08-02-2023 08:19 AM
Thanks for the feedback and I am glad you were able to find the culprit!
I am wondering if there is any hardware issue that triggers this bug with any version higher than 16.12.05 b.
Are all of these 3650s in the same hardware version? You can use "sh switch" to find the hardware version.
HTH
08-02-2023 08:25 AM - edited 08-02-2023 08:28 AM
Just did the check @Reza Sharifi and both have the same hardware version V03. Never saw one of this ones... at least now I know i can only go up to 16.12.05b in this network, if I want to avoid issues.
08-02-2023 08:40 AM
This is very strange
Do you have support for these switches? If yes, can you open a ticket? I am curious now what Cisco has to say about this, and if this is a known issue.
HTH
08-02-2023 08:50 AM - edited 08-02-2023 08:59 AM
That's my problem @Reza Sharifi, my customers don't want to pay for Cisco TAC and my company won't pay for it as well, so I'm in the middle trying to figure things out with the amazing help of the community.
In essence, basically the forum is my Cisco TAC
08-02-2023 09:37 AM
We try to help as much as we can and in a lot of cases our advice is more in line with what people need than tac. At the same time we don't have access to internal resource Cisco TAC does, and so that is where the support comes in. Are all these switches running the same license?
HTH
08-03-2023 12:37 AM
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide