cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
395
Views
4
Helpful
7
Replies

C9300 IOS-XE 17.12.XX packet loss and CLI lags

mar0n
Level 1
Level 1

Dear all

We are currently running 7x C9300-48P all with IOS-XE 17.12.XX.

We are experiencing a strange issue where on 5x of the devices we get high latency when pinging the virtual switch interface that is used for CLI access. But it is only happening every 5-7 pings
This is also noticeable when connecting to the CLI. When the latency goes up we have a small lag when entering commands for example.

All the 7 devices are setup the same and have the same function (access switches).

The devices were previously running 17.12.04 and I have since upgraded them to 17.12.06 but the behaviour is still the same.

When checking "sh proc cpu sorted | ex 0.00" I can see that the TPS IPC PROCESS is running around 60-70% and this coincides with the lag/latency spikes:

Switch#sh proc cpu sorted | ex 0.00
CPU utilization for five seconds: 63%/0%; one minute: 38%; five minutes: 37%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
363 419605 824 509229 62.00% 36.05% 34.72% 0 TPS IPC Process
490 197 1082 182 0.23% 0.06% 0.01% 1 SSH Process
192 2142 1697 1262 0.15% 0.16% 0.17% 0 CDP Protocol
168 2476 8257 299 0.15% 0.17% 0.17% 0 FED IPC process
480 2142 11130 192 0.07% 0.16% 0.16% 0 SISF Switcher Th
78 2624 14832 176 0.07% 0.07% 0.07% 0 IOSD ipc task
70 763 1750 436 0.07% 0.04% 0.05% 0 Net Background
155 1459 709 2057 0.07% 0.09% 0.09% 0 NGWC DOT1X Proce
397 956 570 1677 0.07% 0.02% 0.01% 0 Syslog Traps
117 679 21843 31 0.07% 0.04% 0.04% 0 IOSXE-RP Punt Se
212 710 8244 86 0.07% 0.07% 0.07% 0 UDLD
100 1675 3976 421 0.07% 0.11% 0.11% 0 Crimson flush tr

In addition the log contains the following entries:

Switch#sh logg | s HOG
Jan 10 16:37:11.463: %SYS-3-CPUHOG: Task is running for (2597)msecs, more than (2000)msecs (3/3),process = TPS IPC Process.
Jan 10 16:37:19.753: %SYS-3-CPUHOG: Task is running for (2538)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.
Jan 10 16:37:28.049: %SYS-3-CPUHOG: Task is running for (2491)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.
Jan 10 16:37:36.316: %SYS-3-CPUHOG: Task is running for (2435)msecs, more than (2000)msecs (2/2),process = TPS IPC Process.
Jan 10 16:37:44.583: %SYS-3-CPUHOG: Task is running for (2365)msecs, more than (2000)msecs (3/3),process = TPS IPC Process.
Jan 10 16:37:52.879: %SYS-3-CPUHOG: Task is running for (2320)msecs, more than (2000)msecs (0/0),process = TPS IPC Process.
Jan 10 16:38:01.176: %SYS-3-CPUHOG: Task is running for (2209)msecs, more than (2000)msecs (2/2),process = TPS IPC Process.
Jan 10 16:38:09.438: %SYS-3-CPUHOG: Task is running for (2143)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.
Jan 10 16:38:17.701: %SYS-3-CPUHOG: Task is running for (2097)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.
Jan 10 16:38:25.944: %SYS-3-CPUHOG: Task is running for (2065)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.
Jan 10 16:38:34.219: %SYS-3-CPUHOG: Task is running for (2033)msecs, more than (2000)msecs (1/1),process = TPS IPC Process.

My understanding is, is that the control plane is busy and thus starts droping packets but I have not been able to confirm this as I am not very well versed with CPU/Control plane troubleshooting.
I have followed most of the troubleshooting guides I've found but didn't come to a proper finding/conclucsion, hence me creating this post.

As mentioned 2/7 devices are not experiencing this issue even though they have the same config.
I've also tried finding differences in their config but I was not successful.

Has somebody encoutered this already or has any other ideas?

Thank you and best!

1 Accepted Solution

Accepted Solutions

mar0n
Level 1
Level 1

Dear all

We found the issue when comparing configs again in detail.

The group our company belongs to is using DNAC and has pushed 32 telemetry ietf subscriptions to the 5 affected devices.
If we remove that config the problem is gone completely.

Thank you all for your ideas and your time.

Best!

View solution in original post

7 Replies 7

Mark Elsen
Hall of Fame
Hall of Fame

 

  - @mar0n                 Consider using latest advisory software : https://software.cisco.com/download/home/286313983/type/282046477/release/IOSXE-17.15.4
                                        and check if that can help 

 M.



-- Let everything happen to you  
       Beauty and terror
      Just keep going    
       No feeling is final
Reiner Maria Rilke (1899)

Cristian Matei
VIP Alumni
VIP Alumni

Hi,

    There might be a bug, however, since you've already upgraded to as I see, an MD suggested release, could be something related to control-plane. First collect and post the complete output of following commands:

show policy-map control-plane
show platform hardware fed active qos queue stats internal cpu policer
show platform software fed switch active punt cpuq all
show platform software fed switch active punt cause summary
show platform software fed switch active punt rates interfaces

  If there's control-plane overloading, we'll be using this document as a guide to get to the Root Cause:

https://www.cisco.com/c/en/us/support/docs/switches/catalyst-9300-switch/221841-troubleshoot-control-plane-operations-on.html

Thanks,

Cristian. 

Thanks Cristian, unfortunately I didnt reply to your post, see below in the thread the outputs and answers.

mar0n
Level 1
Level 1

Hi both

Thank you for taking the time to respond.

I've followed that document already previously but as I said I am not well versed with this kind of troubleshooting.

I've captured the output of the commands stated by you and attached them below.

As far as I am concerned we are using the default cpp policy-map.
Looking at the cpu policer I can see that the switch has "Forus traffic" drops:

============================================================================================
(default) (set) Queue Queue
QId PlcIdx Queue Name Enabled Rate Rate Drop(Bytes) Drop(Frames)
--------------------------------------------------------------------------------------------
2 14 Forus traffic Yes 4000 4000 169370 285

However this values are not increasing when I run the command a few times and since the issue (lag/latency) is there permanently I would expect the counter to increase...

Looking foward to hear what you can interpret from this files.

Thank you very much again for your support!

Best
mar0n.

Joseph W. Doherty
Hall of Fame
Hall of Fame

The CLI lagging and ping latency spikes are probably perfectly "normal" with the corresponding CPU spikes.

What's also probably not normal, is why those CPU spikes are happening.

From a quick review, the %SYS-3-CPUHOG: messages, during "normal" operations, are generally considered abnormal.  They may be indicative of some software defect, or something unusual happening within your network.

If you have contract TAC support, this might be worth opening a TAC case.

Besides the information already requested by @Cristian Matei (which it appears you've provided), might you also be able to provide a "sanitized" config?

Hi Joseph

The devices are under contract yes but I am not able to open a TAC directly but will have to go through our supplier.
Since I will need to explain the problem to their technician and then again to TAC, I thought I'd try my luck here first.
But sanitizing the config vs explaining the problem twice for a TAC is probably about the same amount of effort, so I'd rather go with the TAC option than possibly leaking something by not sanitizing my config file properly. I hope you understand.

Thank you for answering to my post!

Best

mar0n
Level 1
Level 1

Dear all

We found the issue when comparing configs again in detail.

The group our company belongs to is using DNAC and has pushed 32 telemetry ietf subscriptions to the 5 affected devices.
If we remove that config the problem is gone completely.

Thank you all for your ideas and your time.

Best!