cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3136
Views
0
Helpful
7
Replies

Nexus N3K-C3064PQ-10GE QSFP port crash

ss1
Level 1
Level 1

Dear community members,

We are having quite a lot Cisco Nexus N3K-C3064 switches in our datacenter (approx. 80 switches) and hundreds of different QSFP-40G-SR4 used to build our internal topology. We are facing some strange issues with some of these switches and modules and thought it would be wise to discuss.

Part of these QSFP modules are not Cisco genuine, I hope it wouldn't disturb the community but you know - logistic, pricing circumstances etc.

 

Let's now explain the issues.

Some of my QSFP-SR4-40G make the N3K-C3064 port crash once plugged. To make it more interesting - they make the neighbor port crash. Say, I plug a suspicious QSFP in port 49. Then port 50 just dies, the orange diode of port 50 gets dark and then the switch won't recognize any further QSFP plugged on port 50 (even Cisco genuine). It's very very strange at least for me how a QSFP in port 49 may cause port 50 to die, however, that's happening without any message being logged. For any QSFP module plugged on port 50, the switch just tells me that a transceiver is not present.
Let's make it even more interesting. Those modules are recognized okay during switch boot-up. If they are already plugged during boot-up process, the switch does recognize them properly and does link properly until the next tryout to unplug them and plug again. Also, it happens only on N3K-C3064-10GE model but never happened on N3K-C3064PQ-10GX until today. No matter which software version, I tried with 6.0.2, 7.0.3 etc.

 

Today I got a similar case with a N3K-C3064-X. The first issue I ever have after having worked with about 30pcs. C3064-X. 
Let's dig deeper:

# show inventory 
NAME: "Chassis",  DESCR: "Nexus3000 C3064PQ Chassis"             
PID: N3K-C3064PQ-10GX    ,  VID: V01 ,  SN:           

NAME: "Slot 1",  DESCR: "48x10GE + 4x40G Supervisor"            
PID: N3K-C3064PQ-10GX    ,  VID: V01 ,  SN:           

NAME: "Power Supply 1",  DESCR: "Nexus3000 C3064PQ Chassis Power Supply"
PID: N2200-PAC-400W-B    ,  VID: V01 ,  SN:           

NAME: "Power Supply 2",  DESCR: "Nexus3000 C3064PQ Chassis Power Supply"
PID: N2200-PAC-400W-B    ,  VID: V01 ,  SN:           

NAME: "Fan 1",  DESCR: "Nexus3000 C3064PQ Chassis Fan Module"  
PID: N3K-C3064-FAN-B     ,  VID: V00 ,  SN: N/A                  


# show version  
Software
 BIOS: version 4.5.0
 NXOS: version 7.0(3)I7(6)
 BIOS compile time:  11/09/2017
 NXOS image file is: bootflash:///nxos.7.0.3.I7.6.bin
 NXOS compile time:  3/5/2019 13:00:00 [03/06/2019 00:04:55]


Hardware
 cisco Nexus3000 C3064PQ Chassis  
 Intel(R) Celeron(R) CPU        P4505  @ 1.87GHz with 3902992 kB of memory.
 Processor Board ID FOC1638328Z

 Device name: xxxxxxxxxxx
 bootflash:    1635720 kB
 usb1:               0 kB (expansion flash)

Kernel uptime is 55 day(s), 4 hour(s), 1 minute(s), 1 second(s)

# show interface status | include "1/51|1/52"
Eth1/51        xcvrAbsen trunk     auto    auto    --          
Eth1/52        xcvrAbsen trunk     auto    auto    --         

Seems allright? No. There are 2 QSFPs in slot Eth1/51 and Eth1/52 and they are even Cisco genuine right now :(

I checked deeper in bcm shell:

# bcm-shell module 1 "port 51"
PORT: Status (* indicates PHY link up)
  xe50    Forced(40GFD) Stad(6c:20:56:e6:e9:5a) STP(Block) Lrn(ARL,FWD) UtPri(0) Pfm(FloodNone) IF(KR4) Max_frame(1518) MDIX(ForcedNormal, Normal) Medium(Fiber) Fault(Local) VLANF
ILTER(3) 

# bcm-shell module 1 "port 52"
PORT: Status (* indicates PHY link up)
  xe51    Forced(40GFD) Stad(6c:20:56:e6:e9:5b) STP(Block) Lrn(ARL,FWD) UtPri(0) Pfm(FloodNone) IF(KR4) Max_frame(1518) MDIX(ForcedNormal, Normal) Medium(Fiber) Fault(Local) VLANF
ILTER(3) 

# bcm-shell module 1 "ports"
            ena/    speed/ link auto    STP                  lrn  inter   max  loop
      port  link    duplex scan neg?   state   pause  discrd ops   face frame  back      
      xe50  down   40G  FD None  No     Block          None   FA    KR4  1518      
      xe51  down   40G  FD None  No     Block          None   FA    KR4  1518      

At the time I started to dig into this they were shown as !ena on the bcm-shell, however all switchports are enabled via NX-OS CLI and the ports are configured. Then I unplugged the QSFP modules from slot 51,52, then did bcm-shell module 1 "port 50 enable=true", same for port 51. The !ena replaced with 'down' but no transceivers are still recognized in the slot.

 

For now I'm out of ideas as for how to reset those ports and let the switch reinitialize them again.
What's strange, no plugging or unplugging of modules does cause any message in the log. 

 

I'm not sure what are the exact hardware differences between N3K-C3064PQ-10GE and N3K-C3064PQ-10GX but I even demanded the technical specifications and diagrams of these modules from my dealer in order to check for high current required on module startup etc. Maybe stupid thinking or maybe not - I don't know... so far the N3K-C3064PQ-10GE are the only models that produced such issues and the GX version never shown something weird..... until today.

I wouldn't admire the option to reboot that switch as it's driving quite a lot traffic now. So I would rather look at any option to reset or reconfigure the particular ports from the bcm shell but I'm not sure what to execute?

 

I would honor any ideas and responses very much.

Thank you.

7 Replies 7

ss1
Level 1
Level 1

Hi

Apologies for bumping :(
Is there anybody who ever faced such issues?

Thank you.

Hi,

Nexus 3Ks are not as common as other platforms like the 5ks and the 7ks but I have never seen this type of behavior. Have you tried to open a ticket with Cisco? There maybe a known bug that has not been published.

HTH

Hi

Thanks for your reply. Appreciated very much.

Unfortunately I don't have any support contract with Cisco and all the knowledge I can gather would come either from the official documentations (which I believe I read thoroughly several times) or through this forum :(

My research as for how to reset the port via bcm shell does continue but I'm unable to find the proper syntax at this time. My hope is that I would find the required syntax somehow and get my port re-initialized without reloading the whole device. 

So far I'm not sure what's the meaning of interface=XLAUI and interface=KR4 but I think I have to change it somehow in order to get the port back to life. That's the only difference I see for now. The crashed port says interface=KR4 but the working one is XLAUI. Google search seems to be very limited regarding any useful information about these.

Thank you.

Hello,

 

just a thought, I am not sure if this is applicable in your situation: a lot of 3rd party SFPs do not support DOM. You could try and disable that feature on one of the 'problem' switches. I think the syntax is:

 

N3K# configure terminal
N3K(config)# no system ethernet dom polling

 

Possibly a mix of SFPs with and without DOM support causes your problem...

rodrigoyoshioka
Level 1
Level 1

Hi, 

 

 

Have  you solved this issue??? I'm with 4 N3k with the same behaviour.

 

 

thanks.

@rodrigoyoshioka 

No, sorry friend. No solution. If you have some of these QSFP, the only option is to plug them, reload the switch and then never unplug them. 

To avoid such problems i recommend you to use original Cisco SFPs with Nexus Switch.