cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1584
Views
2
Helpful
12
Replies

Issue with Internet UPLINK when moving from 4948 to 4948E

dajohnso
Level 1
Level 1

dajohnso_0-1747968117491.png

I am upgrading my core switch from a 4948 to a 4948E so I can use a newer version of IOS (to fix a bug discovered in another post).  I already upgrade IDF2 to a 4948E without issue. Now I am upgrading MDF1 so I installed a new switch connected a shown. I copied all the interface configuration from the MDF1 o MDF1a and added MDF1a to the vtp domain. So far it looked like MDF1a is connected, I can see it in "sho cdp nei" and I can ping the interface (as well as ssh into it) on the network. Everything looked good so I started moving all the connections from MDF1 to MDF1a (same port) and everything was fine until I moved the Internet uplink from MDF1 to MDF1a and then I lost all internet access. I moved everything back and it works fine.

I did some diagnostics and eventually moved ONLY the internet uplink to the new MDF1a switch while watching ping responses for 1.1.1.1 and after 2-3 normal pings the response time started climbing quickly, nearly doubling with every ping until it was 3500ms+ and then 50% packet loss, 50% 4000ms. Moved the connection back to MDF1 and immediately back to 4ms. Moved to MDF1a and again back to 4000ms and dropped packets. (note, when all the switches were 4948's I never had any issues and other than upgrading to 4948E there were no other changes).

The 4948 is cat4500-entservicesk9-mz.150-2.SG11.bin

Both MDF2 and MDF1a are cat4500e-entservicesk9-mz.152-2.E6.bin

I am upgrading the 4948E's to cat4500e-entservicesk9-mz.152-4.E5.bin to see if it does the same thing?

I also want to note that I have (2) vlans each with its own uplink. VLAN 400 is Verizon FTTI (1G fiber) and VLAN 200 is a Broadband connection. Same thing happened on BOTH uplinks, immediately went to 4000+ms and packet loss when moved to MDF1a

I suspect an L2 issue

here are the two highest utilization tasks.

79 616949 2248766 274 2.47% 2.69% 2.60% 0 Cat4k Mgmt HiPri
80 15004998 2025681 7407 70.95% 67.15% 67.12% 0 Cat4k Mgmt LoPri

I upgraded to ios 12.2-4-E5, still happening.

12 Replies 12

dajohnso
Level 1
Level 1

Very surprised this didn't get ANY responses? I resolved the issue. After upgrading to 15.2.4-E10 it was still happening. I started with a blank switch and kept updating the config until the issue happened.  It turns out after about 6 or 7 interfaces with "IP DEV TRACK MAX 10"  commands the switch CPU utilization goes to 99% and throughput to the internet uplink drops to <1Mbps. This was true even if I only had 2 interfaces active! (the rest were all shut and it still happened!)

Very surprised this didn't get ANY responses?

Well, as so many want to help, it generally means no one felt they could make a positive contribution.

It turns out after about 6 or 7 interfaces with "IP DEV TRACK MAX 10"  commands the switch CPU utilization goes to 99% and throughput to the internet uplink drops to <1Mbps. This was true even if I only had 2 interfaces active! (the rest were all shut and it still happened!)

Surprisingly, at least it initially was to me, switches often don't have a real powerful CPU because, I believe, usually dedicated specialized hardware does almost all the heavy lifting.  Further, the efficiency of software might vary much based on the function being done in software.  So, it's not surprising that the CPU, effectively running flat out, can cause operational issues.

BTW, I've also often written that a CPU running at a high percentage, even 100%, isn't necessarily an issue either.  Conversely, a low utilization doesn't preclude issues.

What might indicate a bug, if I understand correctly what's happened, is once this situation is created, reducing the causing elements, the switch doesn't self recover?

But if you're wondering how it can happen even with a few active interfaces, sure, that not really surprising either.

Also, if your issue was ping related, to the switch, itself, a high CPU can really be a problem for that.  However, did you see the same ping issues, pinging a host on the far side of an impacted switch?

The switch utilization without the IP Tracking command is 5%. I was working on this issue for several days and eventually started with a blank switch and started adding my config 8 ports at a time and was very surprised to see utilization jump to 99% after the first 8 ports were configured. On a whim, I took out IP track on all ports and the problem went away immediately so I started adding the IP TRACKING back in and at around 5-6 ports the switch tanked. The CLI was sluggish and speed test on my laptop dropped to 1Mbps on a 1Gbps uplink. I removed the IP DEV TRACKING on all the ports and it immediately jumped back up to 950+Mbps

It doesn't happen on the 4948 only on the 4948E. I tried several IOS versions (about 8 of them) from 15,2(2) all the way up to the most recent I have 15.2(4)E10. It occurred on all versions. (NOTE my 4948 has 15.0(2)SG11). I had to upgrade to 15.2(4) to resolve another critical bug impacting vlan 1 on QinQ config so I had to move to 4948E platform since there wasn't a 15.2 on the 4948. So the issue did not just impact ping, the speedtest bandwidth dropped to 1Mbps and ping response times jumped from 4ms to 4000ms with significant packet loss. I see when IP DEV TRACKING is configures all of the processing is in Cat4k Mgmt LoPri but clearly not low enough as it kills all other processing.  I also tried several different 4948E switches  with various versions of supervisor and cpu models. Affects all 4948E's I have.

I would also note, that it only happened when the Internet uplink was on the same switch? Confusing as my initial migration was I connected the old switch to the new switch with a trunk and started moving the devices over one port at a time after configurating the 4948E the same as the 4948. Everything went fine until I moved the Internet uplink to the new switch. If I moved the uplink back to the old switch it went back to normal operation! This was in the MDF, I had another 4948 switch in the next IDF that I upgraded to 4948E that functioned fine even with the IP DEV TRACKING enabled on every port.

I suspect the 4948E is examining all the IP's on the uplink (like the entire Verizon network?). I didnt try turning off the tracking just on the uplink port.

How is the uplink port configured?

Plain vanilla vlan 400. 

interface GigabitEthernet1/1
description FTTI Uplink
switchport access vlan 400
switchport mode access

 

 

Okay, just another access port.

How does L3 work, go to the Internet?  V400 has an SVI on the switch?  Switch has some kind of default route to the Internet, to a L3 hop on the other side of the g1/1 link?

Basically, just wondering if you have something similar to the infamous usage of a default route to an interface and without an IP.  Reading up on the ip dev tracking, did come across mention it can lead to high CPU consumption and/or memory usage, and, for instance, it's not recommended as being active on trunk interfaces.

Of course, none of this easily explains why it's not a problem on the 4948s but is a problem on the 4948Es.  (Although, as already seen with .1Q, the two different switches don't handle that the same, either.)

If these switches were still supported, this is likely the kind of situation that merits a TAC case, and possibly might only be solved via them.  Perhaps, the best you can hope for, is to avoid triggering the issue, and whatever the trigger is, that's it's not critical to your needs.

The Cisco "answer" to this, would likely be, migrate to a newer, and supported, platform.

From all that you've described, my guess would be, ip dev tracking, might record its CPU consumption under Low Priority Management, but this task's "Low Priority" isn't low enough to protect transit traffic.

The latter, if true, is surprising, as transit traffic is generally handled by dedicated hardware, which is why I was so interested in if the performance impact was just to traffic interacting with the switch, itself, like pings, or also to traffic just transiting the switch.  (If I understand correctly, it's both.)

Regardless, if the problem seems to be related to ip dev tracking, and you can disable it, and can live without it, you have a work around, correct?

vlan 400 does not have an IP address. All devices (routers/firewalls) in vlan 400 have a static public IP and a gateway that points to the FTTI fiber uplink router at Verizon.

 

My guess is that it impacted everything as even the CLI interface (on the serial port) was sluggish to respond. I also noted that I have 2 internets, one in vlan 200 and the other in vlan 400. BOTH network were crippeled when it occurs and neither could pass any normal traffic.

I dont get exactly which link make issue' but cpu spikes to 99% you need to check cpu sort' check if stp have high cpu.

Without IP the only think make SW CPU spike is STP loop (l2 broadcast storm).

MHM

. . . cpu spikes to 99% you need to check cpu sort' check if stp have high cpu.

That may have already been answered in OP, specifically:

here are the two highest utilization tasks.

79 616949 2248766 274 2.47% 2.69% 2.60% 0 Cat4k Mgmt HiPri

80 15004998 2025681 7407 70.95% 67.15% 67.12% 0 Cat4k Mgmt LoPri

#show processes cpu sorted | ex 0.00

Share output of above 

MHM

Hello
You state you coped the switch CFG  MDF1<>MDF1a for the switch migration, and by the sounds of it, its a L3 switch running SVIs for the assocated internet Verizon/BB connections?

So for the relocation to succeed correctly you would need to shutdown the associated L3 SVIs from MDF1 before relocating is over to MDF1a and enabling it on their along with any routing process or static routes?

If the above is incorrect can you elaborate a lillte more on where the routing is being perfromed and maybe if applicable share the rung cfg of MDF1 ( include any CDP/LLDP -mac-arp-route tables, interface status)


Please rate and mark as an accepted solution if you have found any of the information provided useful.
This then could assist others on these forums to find a valuable answer and broadens the community’s global network.

Kind Regards
Paul

Sorry, I cant provide any more details. After realizing the issue was related to the "ip dev tracking max 10" command, I removed it from all interfaces on the production environment although I suspect it was the command on the uplink interface that caused the issue and not the quantity of them since I had other switches in the network that had it on all interfaces, the issue only occurred when I moved the uplink to the new 4948E.  I can't test in the production environment anymore as the issue brings the whole network down (well Internet access that is). So for now I will accept that I should use the "ip dev tracking" command on an as needed basis and not apply to all interfaces anymore. I have an FTTI uplink in my office as well and if I cant get some time I will try my theory out about it only happening if "IP DEV TRACKING" is enabled on the uplink. I can tell you it did NOT happen when "ip dev tracking" was enabled on the uplink of a 4948, only on the 4948E (not sure if its the hardware or the software, I will have to try 15.0 on a 4948E as of right now all my 4948E's are running 15.2) note, I had to upgrade from the 4948 to 4948E because of a bug resolved in another post about vlan 1 not operating correctly with QinQ config). I moved to 4948E 15.2(4)E10 and after removing the "ip dev tracking" for all interfaces have not had any other issues with the config and now have (2) full QinQ config's installed and plan implement them next week.