Re: Need help in increasing CPU utilization for Cisco Switch - Page 2

mahazz12 · ‎08-19-2023

Hello,

I have a cisco switch (Cisco Catalyst C9300) and want to increase CPU utilization of the switch. I am passing lots of different types and heavy traffic on the switch through TRex, but the CPU utilization is stuck on 1%. No matter how much I try, the CPU utilization is not increasing.

I also tried disabling STP which increased CPU utilization to about 40%, but every time it resulted in losing my SNMP connection to which my monitoring tool and switch were connected, hence I am unable to further monitor the switch (for throughput etc).

I am new to networking and want to conduct experiments, so I want that the CPU utilization 0%, 50% and 100% (or you can say variations in CPU utilizations up to MAX) in order to carry out the tests. Can someone please help me out with how do I increase the CPU utilization?

TIA.

Leo Laohoo · ‎08-20-2023

@mahazz12 wrote:
Exactly, the same scenario is happening with me as well. Is there any solution that can increase the CPU utilisation and not choke the CPU and also maintain the SNMP connection so I can monitor the switch through Zabbix?

No way.

This IOS-XE we are talking about. And with IOS-XE, once the CPU or memory utilization sends the stack into "runaway", all bets are off!

mahazz12 · ‎08-20-2023

Also, I am varying the utilisation level through Trex as well. I start passing the traffic at 100% and then decrease it steadily by 80% 60% 40% 20% and finally 0% but the CPU utilisation is stable at 1% and there's no impact on it. Is it normal? Will the CPU utilisation not vary with the traffic load?

Joseph W. Doherty · ‎08-20-2023

Normal?

More-or-less, yes it is.

Can CPU load vary with traffic?

Yes, it MAY, but it doesn't have to. Further, if it does vary, it might not track with traffic level, including even the inverse of what I suspect you would expect.

mahazz12 · ‎08-20-2023

So is there any other way (except fluctuating traffic level) to increase the CPU utilisation?

I tried the debug all command which increased the CPU utilisation to around 10%-12%. Can I push the utilisation further through any other method?

Joseph W. Doherty · ‎08-20-2023

@mahazz12 wrote:

So is there any other way (except fluctuating traffic level) to increase the CPU utilisation?

I tried the debug all command which increased the CPU utilisation to around 10%-12%. Can I push the utilisation further through any other method?

Yes there are. Many described in @Ramblin Tech's first reply.

But you're missing two important points.

First, if you're trying to achieve precise CPU step changes, probably that's very difficult to achieve, except perhaps via traffic volume.

Second, as you "stress" the CPU, "bad things" happen, like you noticing SNMP stopped working. Problem is, unless you know EXACTLY all the resource allowances for IOS components, you don't know exactly what will happen.

For example, you found one way to break SNMP. Is that the only way? For the way you found, what, exactly, is the level the causes the break?

You've described you goal is internal temperature analysis. As I've already noted, on a switch, the CPU probably isn't a big heat source. Further, if the fans move enough air, even as the switch generates more heat, you probably won't see much of a temperature variance.

Have you ever operated a car with an engine temperature gauge? If so, other than when the engine just starts, or truly overheats, have you noticed that the "normal" operation temperature might be about the same, regardless of the external temperature, or how "busy" the engine is. This is by design. A somewhat similar situation applies to the switch.

Not exactly the same situation, but ever boil water in a pot. Does the water's temperature rise (beyond 100C) if you increase the heat being applied?

You may think, but if the pot's water starts as room temperature, as you add heat, it gets hotter (correct). Further, if I add lots of heat, the water's temperature rises even faster (correct). But, the goal of air cooling in a computer device, is much like the car's cooling, it's to keep the temperature from rising above an unacceptable temperature, regardless of how active the hardware is being used.

Now to clarify a bit further, as you add heat, measuring that increase is possible, but not always the same way and/or your instrumentation might not be precise enough to indicate the difference.

The temperature sensors, in a network device, or more like a car's "idiot" lights, i.e. not telling how hot, just too hot.

mahazz12 · ‎08-20-2023

Thank you for such a detailed explanation. I noticed that when the CPU utilisation reached somewhere between 40% to 60%, the SNMP connection broke. I monitored the CPU utilisation through show cpu history command.

So, from your explanation, what I conclude is that we can't increase the CPU utilisation precisely i.e. 50% then 100%, is that correct?

Joseph W. Doherty · ‎08-20-2023

"So, from your explanation, what I conclude is that we can't increase the CPU utilisation precisely i.e. 50% then 100%, is that correct?"

No, I'm not saying that.

I'm saying it's almost impossible to accomplish that, without, perhaps, using traffic volume management, because we, well I, don't KNOW how to, or even if it's possible, as a user of the switch.

Further, if you do use the traffic volume approach, by design, (ideally) switch traffic is forwarded by ASICs, not the CPU (there's exceptions to that too, but do we KNOW exactly what they are?).

I'm also saying, for lack of KNOWLEDGE, we, again, at least I, don't KNOW all the interrelationships of the switch's components, so we (I) don't KNOW what to expect (like when something is going to "break") such as in your SNMP case.

Lastly, I'm saying, if you can overcome all the forging (which might be possible), you might not obtain any truly worthwhile information for your stated goal of temperature analysis.

Ponder, on that last point, how often do you see a "professional" network device review analysis temperatures vs. CPU utilization?

Back to my car analogy - you might see some report, that Brand X car, at 30 MPH takes 400' to stop, but at 60 MPH takes 1200' to stop, but do you see a Brand X car report describing, at 30 MPH engine shows a coolant temperature of 200 degrees but at 60 MPH engine shows a coolant temperature of 201 degrees? (Another point I've been trying to make, might a car engine run hotter, even much hotter, at 60 MPH vs. 30 MPH? Yup, very likely it does. But the whole point of the car's cooling system is to carry away the excess heat. If you do see a major rise in coolant temperature, you've got a coolant system problem. Likewise for your switch, its cooling system is designed to avoid the interior of the switch baking in its own heat. I.e., by its design, you might not see any huge temperature increase between CPU running at 1% vs. 100%!)

To recap, trying to PRECISELY control your switch's CPU utilization, is likely very difficult to nearly impossible, and from your described goals, even if you could manipulate your switch's CPU as desired, with just on-board temperature sensors, I doubt (but, I don't KNOW) you'll would get any useful information.

To put it another way, IMO, for your goals and equipment, you're wasting your time. However, I could be wrong, and even if I'm not, it's YOUR time, not mine.

If you do succeed, please post what you did and how you did it; it may be interesting to read.

mahazz12 · ‎08-21-2023

I understand your point. Temperature analysis is one of the aspects. We are also interested in measuring the power consumption of the switch to see whether the CPU usage can impact the amount of power C9300 switch is utilising.

Joseph W. Doherty · ‎08-21-2023

Yes, a busier CPU will draw more power (also generate more heat). But, you're not seeing the forest because of your focus on one particular (small) tree, the CPU.

In a switch, the CPU is a small factor in a switch's heat/power considerations. Again with my auto analogy, your goal is somewhat like asking, how does turning up my auto's radio's volume impact miles per gallon?

I believe @Ramblin Tech's reply is also trying to convey the same.

BTW, in Jim's same reply, he details how injecting specific kinds of traffic might allow precise CPU utilizations. I agree, and have in a couple of prior replies noted that's likely the most plausible way, as a user of the switch, to accomplish that.

However, again, as we (or I) don't KNOW the internal "architecture", you might bump into "unintended consequences" (unlikely anything permanently damaging).

Remember, Cisco network devices usually don't perform their role well, if at all, when their CPU doesn't have a cushion of CPU cycles.

If your concern is with power/heat, you would probably be better off contacting Cisco for additional details.

However, if you want to work on DoS procedures, you're on the right track, although that's why these devices have features like CoPP.

Ramblin Tech · ‎08-21-2023

"Is there any solution that can increase the CPU utilisation and not choke the CPU and also maintain the SNMP connection so I can monitor the switch through Zabbix?"

Use your traffic generator to slowly increase the number of packets punted up to the CPU for processing until you hit your target CPU level; ARP requests might be a simple way of doing this with proxy-arp enabled on the Cat9300. Ideally, your traffic generator would broadcast out ARP requests with incrementing IP addresses to be resolved (to defeat caching mechanisms), with a configurable interval between each ARP request which you would decrease between your trials. As overloading the CPU through punted traffic is a long-recognized DDoS attack vector against router control-planes (this is what you are see with the choked CPU and unresponsive SNMP), modern routers have protection schemes to mitigate against this vulnerability. These protection schemes go by various names: control plane policing, CoPP, LPTS, etc but the basic idea is that the number of punted packets per second is limited in the hardware (NPU) before the punted packets hit the CPU. To impact the CPU with punted traffic, you must disable any control-plane policing function that might be in place. [My assumption here is that the Cat9300's NPU responds to ARP requests in software (CPU) and not hardware (NPU), but ARP is far from the only packet type that gets punted, so you use others as well (eg, STP BPDUs)].

Going back to a couple of your premises early in this thread... (1) CPU utilization increases with traffic load and (2) temperature increases with CPU utilization; I do not believe you can rely on either of these to be true in the typical case. In the typical case for a Catalyst switch, traffic is forwarded by the NPU and not the CPU, so CPU utilization does not increase with traffic load. Also, typically, most heat is not going to be generated by the CPU in a Cat switch, but by the NPU and the interface circuitry (including optics modules). While power consumption, and therefor heat dissipation, will rise with traffic levels (you can see that in the Cat 9300 data sheet), the CPU's contribution will be small by comparison to the other components.

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎08-21-2023

Jim, NPU = networking processing unit? New(er) term rather than the more generic ASIC?

Re: "punted". I've understood this to mean "Plan B", not anything being done on the CPU, although, that's likely where NPU "Plan B" ends up.

Real story somewhat tied to what we're discussing,

A couple of decades back, working at a company which had many remote offices with a pair of 2811 ISRs, and one or more Catalyst 3750s.

The (logical) 3750 (minimally 2 unit stack) was L3, running OSPF.

I was looking into whether we might run RIP at such sites, to avoid additional feature licensing and maintenance costs on the 3750s.

Lab testing a 3750 with a 2811, RIP worked fine including also using OSPF on 2811 including RIP<>OSPF redistribution.

Wanting to insure faster RIP convergence, I began to reduce RIP timers. As I approached minimum values, 3750 CPU went to 100%! 2811, maybe an 1 or 2% increase.

A multi-gig L3 switch struggles while a 10 Mbps software based router barely notices. Why?

I suspect the 2811 had a much faster/capable CPU because the 3750 has dedicated hardware for its expected work needs.

Again, to @mahazz12, CPU unlikely to matter much relative to the switch as a whole. For a guess, CPU from 1 to 100%, might have a delta of less than 50 watts.

Ramblin Tech · ‎08-21-2023

Hi Joe,

Yep, right NPU = Network Processing Unit, typically implemented as a purpose-built ASIC, though implementing with an FPGA is not inconceivable. In Cisco's Catalyst line, I believe all the higher-end stuff is now on Cisco's QFP (Quantum Flow Processor; marketing name) NPU. I have heard that many Nexus products are based on merchant silicon NPUs such as Broadcom's XGS and DNX families. Over on the XR side, future products seam to be shifting away from DNX and the Cisco custom silicon in the ASR9000 to Cisco's SiliconOne (another marketing name) NPU, but I would not be surprised to see all Cisco products lines move toward SiOne over time including Catalyst and Nexus devices. The head of Cisco's PI hardware dev organization co-founded Leaba Semiconductor, which Cisco acquired a few years ago and from which SiOne was developed (he also founded Dune Networks, which developed the DNX line and was acquired by Broadcom).

Anyway,... CPUs in lower end switches with NPUs have traditionally been somewhat anemic to save power and cut costs, as they did not have to forward in s/w nor converge large routing tables. By comparison, the ISR line 1800/2800/3800 (and follow-on products) forwarded in s/w (no NPU) and consequently had much beefier CPU and RAM resources.

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎08-21-2023

"Anyway,..."

Exactly, what I figured out when I initially stumbled across what I described for the CPU performance difference between the 3750 and 2811. The difference in CPU performance need, in hindsight, made sense, but somewhat surprising was the difference in data plane capacity almost the inverse of their control plane capacity.

The 3750s being used were the G variant, capable of either 16 Gbps wire-speed throughput, while the 2811 is rated between 1.5 to about 61 Mbps capacity, the latter, if your packets are going down hill with a brisk tail wind. (We found, it could usually handle 10 Mbps Ethernet, duplex, w/o issues.)

BTW, on the intermediate sized business routers, the 7304 with NSE vs. 7200 with NPE, had the QFP precursor (?), the PXF. Had a 7304-NSE-150 (800 Kpps) run side by side with a 7204VXR-NPE-G1 (1 Mpps); about same traffic load and traffic kind, former's CPU averaged about half the latter's, although the former is only 80% of the capacity of the latter's non-PXF, it had about 3x the bandwidth forwarding capacity.

Ramblin Tech · ‎08-21-2023

An interesting aside on the 7200 was that, for a very long time, it had the most features of any IOS router, since it had more RAM available than any lower-end router and forwarded in s/w, thus not constrained by any feature limitations of a h/w NPU. If IOS could do it, you would most likely find it supported on the 7200, albeit at the speed of a CPU rather than an NPU.

On the subject of packet punts... generically, it just means kicking packets out of the normal forwarding path (whether implemented in s/w or h/w) and into more CPU-intensive processing. Typically, this means control- and management-plane packets, but can mean data-plane packets if something about the header says it cannot be handled by normal forwarding and the specialized handling has been implemented in s/w. Also note that some NPUs implement an "OAM Engine" that can handle some types of time-sensitive or high-scale traffic on-chip, rather than punting to a CPU (eg: BFD, CFM/Y.1731, TWAMP, PTP).

A few references to punts from the XR world, both features were implemented for an SP customer I worked with and who experienced outages to their mobility network due to floods of punt traffic:

https://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/asr9k_r5-3/bng/configuration/guide/b-bng-cg53xasr9k/b-bng-cg53xasr9k_chapter_0111.html

https://www.cisco.com/c/en/us/td/docs/iosxr/ncs5500/ip-addresses/66x/b-ip-addresses-cg-ncs5500-66x/m-implementing-lpts-ncs5500.html#Cisco_Concept.dita_365eb914-2cee-4691-98ec-9c3a73fd6c4c

Disclaimer: I am long in CSCO

Joseph W. Doherty · ‎08-21-2023

"On the subject of packet punts... generically, it just means kicking packets out of the normal forwarding path (whether implemented in s/w or h/w) and into more CPU-intensive processing."

Yup, I'll go along with that; that conforms how I understood it.

Reason I brought it up, at all, something like ARPs, BPDUs, etc., most likely being processed the same way, and likely on the CPU (well non-monster systems), regardless of volume, but if the volume, is way, way above normal, it will run the CPU out of cycles.

Just in my 3750 example, when I decreased the RIP timers, same processing, just much, much more than "normal".

So, in such cases, I (personally) wouldn't have used the term "punt" although impact to CPU much like "real" (to me) "punted" traffic. ("Classic" example, small router - data packet that must be "processed" switched vs. fast-path switched.)