Ansible spiking CPU on Cisco switches

johnlhawley · ‎02-14-2020

Been utilizing Ansible to automate Cisco switch configurations. But wondering if I don't have it optimized correctly. These are playbooks that modify port level configs and they take forever to run and often times max out the CPU on the switches, especially older ones like 3750X.

Running Ansible 2.9.2 on Ubuntu Bionic box. Using the "ios_config" module.

Output as I chunk through a job looks like this:

changed: [esav1] => (item={u'iface': u'Gi4/0/17', u'vlan': 20})
changed: [esig2] => (item={u'iface': u'Gi4/0/1', u'vlan': 20})
changed: [esav1] => (item={u'iface': u'Gi4/0/18', u'vlan': 20})
changed: [esig2] => (item={u'iface': u'Gi4/0/2', u'vlan': 20})
changed: [esav1] => (item={u'iface': u'Gi4/0/19', u'vlan': 20})

I currently have "serial" set to 5, and it can take ~45 seconds for a particular switch job connection/thread to get from one port to the next. "show cpu process sorted" on the switch shows that the "ssh" process is topping the CPU usage list. I have Ansible set to "defaults: no", so at least it shouldn't be doing a "show run all" in the background. I think previous versions of Ansible did not honor this flag, but I believe the version I'm on now has that fixed.

I can attach example copies of my playbook and variables if that would help.

Thanks, John

Leo Laohoo · ‎02-14-2020

What firmware is the switch running on?

johnlhawley · ‎02-15-2020

Switch is running newest Cisco recommended: 15.2(4)E8.

Leo Laohoo · ‎02-15-2020

Good.
Post the complete output to the command "sh proc cpu sort | ex 0.00".

johnlhawley · ‎02-17-2020

ESISETEST4#sh proc cpu sort | ex 0.00
CPU utilization for five seconds: 49%/1%; one minute: 68%; five minutes: 47%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
192 28235 6333 4458 12.15% 19.43% 7.23% 2 SSH Process
88 145281084 27797349 5226 4.00% 4.42% 4.27% 0 RedEarth Tx Mana
234 292891 704452 415 2.55% 1.09% 0.40% 0 IP Input
87 60558144 41227154 1468 1.59% 1.89% 1.81% 0 RedEarth I2C dri
370 333721 302667 1102 0.95% 0.37% 0.13% 0 TPLUS
3 1759 259 6791 0.63% 0.21% 0.23% 1 SSH Process
198 13699034 651802 21017 0.47% 0.43% 0.42% 0 HQM Stack Proces
171 1211260 16165569 74 0.31% 0.07% 0.03% 0 Hulc Storm Contr

ESISETEST4#sh proc cpu sort | ex 0.00
CPU utilization for five seconds: 100%/0%; one minute: 70%; five minutes: 48%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
192 31282 6793 4605 45.59% 20.59% 7.86% 2 SSH Process
372 234480 13875 16899 22.23% 13.81% 4.94% 0 hulc running con
88 145281497 27797423 5226 4.63% 4.47% 4.29% 0 RedEarth Tx Mana
87 60558327 41227261 1468 1.75% 1.87% 1.81% 0 RedEarth I2C dri
198 13699076 651804 21017 0.47% 0.42% 0.42% 0 HQM Stack Proces

ESISETEST4#sh proc cpu sort | ex 0.00
CPU utilization for five seconds: 65%/0%; one minute: 69%; five minutes: 48%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
372 236586 14000 16899 22.08% 13.46% 5.15% 0 hulc running con
182 685414737 81573820 8402 19.52% 19.79% 19.76% 0 Hulc LED Process
192 32025 7235 4426 9.12% 19.32% 8.01% 2 SSH Process
88 145281829 27797490 5226 4.48% 4.41% 4.28% 0 RedEarth Tx Mana
87 60558477 41227356 1468 1.75% 1.87% 1.81% 0 RedEarth I2C dri
234 293080 704968 415 1.12% 1.09% 0.45% 0 IP Input
370 333816 302800 1102 0.47% 0.43% 0.16% 0 TPLUS
198 13699093 651805 21017 0.31% 0.41% 0.41% 0 HQM Stack Proces
171 1211301 16165652 74 0.31% 0.13% 0.05% 0 Hulc Storm Contr
199 1319474 1303931 1011 0.15% 0.05% 0.01% 0 HRPC qos request

Leo Laohoo · ‎02-17-2020

Red earth and HULC (aside from SSH) are chewing the CPU.
Do you have the command "no logging event link-status" configured?

johnlhawley · ‎02-17-2020

Yes, have "no logging event link-status" on all the user connected ports. The trunk uplink we enable logging.

I'm not sure this is the right forum for my question. I assume the issue lies with the way Ansible is doing its job. I believe there is a lot of sanity checking that goes on as commands are applied. Obviously if I'm on the command line in EXEC mode, applying a config line to a port takes < 1 second. Need to figure out what all the overhead is that Ansible is injecting.

Thanks, John

Leo Laohoo · ‎02-17-2020

@johnlhawley wrote:

Yes, have "no logging event link-status" on all the user connected ports.

HULC and Red Earth Mana are attributed to STP engaging in quick succession (think stop-start).

Enable interface logging because you may have a wired client flapping and causing STP to shoot through the roof.

johnlhawley · ‎02-18-2020

Perhaps, but I believe STP problems are pretty rare in our environment. That's not the problem I'm trying to solve at the moment.