Nexus 5000 kernel panic cdpd process

udid · ‎05-31-2012

Hello,

A few months ago I had a kernel panic in Nexus 5596 switch running 5.1(3)N1(1).

The stack-trace ponted to cdpd process.

I found that I hit a know bug (CSCtx91413) that was in "open" staus for a long time.

This bug is now superceded by CSCtz13307 which is in "fixed" status.

However, the fixed versions are 6.1(0.280)S0,6.0(4)S1,5.2(4.83)S0, which are N7K versions.

This bug also effecs N5K devices as described in the bug description.

Does anyone know in which version of N5K is this bug fixed?

Thanks.

krun_shah · ‎05-31-2012

Quick work arround would be turn off CDP and use LLDP if you can.

vdsudame · ‎05-31-2012

Hi Ehud

CSCtz13307 will be fixed in goldcoast release which is 5.2.1.N1.1. Its not released yet. Target Q3 CY12

Thanks, Vin.

sims.ru · ‎07-17-2013

Hi,

I'm running N5K-C5548UP with L3 module N55-D160L3-V2 on 6.0(2)N1(2a) version.

After about a month running on of two N5K's has crashed with the kernel error:

%KERN-0-SYSTEM_MSG: [1196189.202497] BUG: soft lockup - CPU#1 stuck for 11s! [bcm_usd:3276] - kernel.

As I checked this BUGs is not mentioned in the release notes nor open nor fixed.

So is this is known issue for 6.0(2)N1(2a) [the latest version at this moment] and the fix is on roadmap?

Is there workaround for that?

Thanks.

Mordechay Amir · ‎08-29-2013

Hi,

We just had the exact same problem!!

I'm running the same NX-OS release (6.0(2)N1(2a)) but on 5596UP switch (and also having the L3 card).

The primary VPC peer went down and kept rebooting itself, after about 25 minutes the secondary went down.

On both switches I could see that log message...

1102 2013/08/28 17:21:31.863 IDT 10.45.127.53 : 2013 Aug 28 17:21:32.203 GMT: 28 17:21:31 %KERN-0-SYSTEM_MSG: [4834387.272491] BUG: soft lockup - CPU#1 stuck for 11s! [usd_mts_kthread:3366] - kernel

sims.ru · ‎08-29-2013

The answer from the Cisco TAC:

From: Mian Min (mianmin) <mianmin@cisco.com>

"I have finished my research and here is my findings:

The kernel panic crash is due to a software defect:

CSCug26811 Kernel Panic, process hap reset caused by excessive traffic on mgmt port,

The crash is triggered by Massive flood in MGM network.

Conditions:

This problem happens only when OOB management network (interface mgm0) overflooded with ftp/tftp traffic.

Current problem was observed when rate of tcp/udp sweep traffic was above 200Mb/s

Workaround:

Mgm interfaces can be configured with RACL which allow access from NMS only.

Besides of W/A above, there is no possible solution so far.

On the Nexus switch which is connected to FEX via mgmt port, we can disable 'cfs ipv4 distribute' and reload the switch for multicast to be disabled on the port completely. please see CSCuf38974 for that.

Currently our development team is working on to fix this defect."

Here is the link for this bug: http://cdets.cisco.com/apps/dumpcr?&content=summary&format=html&identifier=CSCug26811 (but this is an internal BUG so we can't see it).

In our case the broacast storm caused the high utilization on OOB interfaces. So my suggestion for now is just to keep storm-control values within the entire OOB network as low as possible.

Mordechay Amir · ‎08-29-2013

Thanks for sharing this information!!

emelloul · ‎09-23-2013

Were you able to apply the fix and see if it works? I am running into the same issue as well on our gear.

Mordechay Amir · ‎09-24-2013

Hi Eithan,

This BUG will be fixed in the upcomming version of 6.0(2)N2(2).

I suggest you to isolate the managment network on a seperated isolate VLAN and apply an ACL on the VLAN interface denying all traffic exept SSH and TELNET from your PC.