looks like cisco finally gave

Jose Davlla · ‎07-07-2014

Hi Guys,

I have a head scratcher that I'm dealing with. I have 2x (2 month) 6880-x switches configured with vss with 2 vsl links on each of its sup cards and switch 1 is the active switch on the vss. Both 6880x chassis also have 2 additional 10g modules each. I am dealing with 2 issues in this setup.

The first issue is, i have a 3750x with a 10g module thats port channeled to the vss with one 10g link going to each of the chassis. Within the first 3 days it started dropping the link that's plugged to switch1, then comes back on right away. It does it randomly throughout the day. I have changed out the optics on both ends, changed the cable, and also changed the module on the 3750x and I still get the same issue. The strange part is that when switch2 on the vss becomes the active switch, the problem goes away. I did notice this before changing out the optics, cable, module on the 3750x.

The second issue is the one I'm been trying to figure out along with cisco TAC. This started about a month after bring the vss online. When switch1 was the active switch in the vss, every couple of days vsl links drop one at a time, eventually killing the vss and putting the standby unit into recovery mode because of the dual active detection. once I reboot the standby switch, the vss comes back up normally. it did this a couple of times until I decided to force switch2 to be the active switch on the vss. when switch2 became active, the vss was stable for about a month then the vsl links died again, and the system failed over to switch1. after looking at the logs with cisco tac, we see that the vsl links stop responding which causes the failover, but up until now we still can't determine what is causing the vsl links to fail. cisco tac said that maybe the vsl links were being overloaded but we have been monitoring the bandwidth utilization on the vsl links and they never go beyond 1% utilization. The last suggestion by cisco tac was to add another vsl link but through another module other than the sup. This was done a couple of days ago so now i'm waiting to see if the vsl links fail again and since switch1 is the active switch on the vss, i'm having to deal with the first issue above with the 3750x.

I've included some log entries from the dropped port-channel member and also logs for when the vsl links fail.

Logs for 3750x port-channel member drops.

*Jul 5 05:27:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 5 05:28:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 6 11:28:52 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 6 11:28:52 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 6 11:28:53 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 6 11:29:53 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:27:25 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:27:25 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:27:26 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:27:33 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:33:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:34:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 07:38:35 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 07:38:35 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 07:38:36 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up

log for VSL link failures

*Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/15: Link down
*Jun 27 23:59:25 PDT: %VSL-SW2-5-VSL_CNTRL_LINK: New VSL Control Link 2/5/16

*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/15, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/15, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/15, changed state to down
*Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/16: Link down
*Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN: Last VSL interface Te2/5/16 went down

*Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role

*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/16, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel2, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel2, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/16, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel1, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel1, changed state to down
*Jun 27 23:59:25 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 5 - Module Type LINE_CARD removed
*Jun 27 23:59:25 PDT: %OSPF-SW2-5-ADJCHG: Process 1, Nbr 10.253.0.3 on TenGigabitEthernet1/1/1 from FULL to DOWN, Neighbor Down: Interface down or detached
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel24, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel24, changed state to down
*Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 1 - Module Type LINE_CARD removed
*Jun 27 23:59:26 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel17, changed state to down
*Jun 27 23:59:26 PDT: %PFREDUN-SW2-6-ACTIVE: Standby processor removed or reloaded, changing to Simplex mode
*Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 2 - Module Type LINE_CARD removed
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/14, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/15, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/16, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/14, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/16, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/1, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/3, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/1, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/3, changed state to down

Press RETURN to get started!

*Jun 28 07:03:33.421: %USBFLASH-SW2_STBY-5-CHANGE: bootdisk has been inserted!
*Jun 28 07:03:53.297: %OIR-SW2_STBY-6-INSPS: Power supply inserted in slot 1
*Jun 28 07:03:53.301: %C6KPWR-SW2_STBY-4-PSOK: power supply 1 turned on.
*Jun 28 07:04:26.093: %FABRIC-SW2_STBY-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
*Jun 28 07:04:48.497: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 5: Running Minimal Diagnostics...
*Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:23.309: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 5: Passed Online Diagnostics
*Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
*Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
*Jun 28 00:05:38 PDT: %SSH-SW2_STBY-5-DISABLED: SSH 2.0 has been disabled
*Jun 28 00:05:54 PDT: %SYS-SW2_STBY-5-RESTART: System restarted --
Cisco IOS Software, c6880x Software (c6880x-ADVENTERPRISEK9-M), Version 15.1(2)SY2, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Wed 26-Feb-14 15:30 by prod_rel_team
*Jun 28 00:05:54 PDT: %SSH-SW2_STBY-5-ENABLED: SSH 2.0 has been enabled
*Jun 28 00:05:54 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:03:01 to ensure console debugging output.

*Jun 28 00:05:56 PDT: %C6KENV-SW2_STBY-4-LOWER_SLOT_EMPTY: The lower adjacent slot of module 5 might be empty. Airdam must be installed in that slot to be NEBS compliant
*Jun 28 00:07:36 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 1: Running Minimal Diagnostics...
*Jun 28 00:07:37 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:01:29 to ensure console debugging output.

*Jun 28 00:07:41 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 2: Running Minimal Diagnostics...
*Jun 28 00:08:10 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 1: Passed Online Diagnostics
*Jun 28 00:08:22 PDT: %EC-SW2_STBY-5-CANNOT_BUNDLE2: Te2/1/10 is not compatible with Te1/2/10 and will be suspended (Operational flow control send of Te2/1/10 is off, Te1/2/10 is on)
*Jun 28 00:08:31 PDT: %EC-SW2_STBY-5-COMPATIBLE: Te2/1/10 is compatible with port-channel members
*Jun 28 00:08:45 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 2: Passed Online Diagnostics
*Jun 28 00:08:46 PDT: %C6KENV-SW2_STBY-4-HIGHER_SLOT_EMPTY: The higher adjacent slot of module 2 might be empty. Airdam must be installed in that slot to be NEBS compliant
ELDC1C1-AG01-1 line 0

************************ W A R N I N G ***************************
* THIS IS A PRIVATE COMPUTER SYSTEM, AND FOR AUTHORIZED USE ONLY.*
* THIS SYSTEM IS MONITORED, AND ANY UNAUTHORIZED USE MAY BE *
* SUBJECT TO CRIMINAL PROSECUTION. *
* IF YOU ARE NOT AUTHORIZED, LOG OUT IMMEDIATELY!!!!! *
******************************************************************

so according to the logs, it seems like the switch was reloaded or something of the nature but even the cisco tac said that it wasnt the case but couldnt determine whats causing it either. Tac went through the config on the vss and has said that its configured correctly.

Maybe someone else has experienced this issue or if someone can point out something that I can look at... sorry for the very long post.

thanks

Reza Sharifi · ‎07-07-2014

Hi,

You have done lots of hardware changes . Since the switch is not really rebooting, but it is acting like it is, it maybe a software issue. Have you tried loading a different version of IOS?

Jose Davlla · ‎07-07-2014

I have not done any changes to hardware since putting it into production, other than replacing the optics and cable to the 3750x. the logs say that all removable modules were removed and reinserted, but i can assure you that is not the case. This why I originally thought that the switch was reloading itself, but when I did a show version, the uptime for the switch itself was not reset, but for the individual modules it was.

Andre Hohler · ‎07-14-2014

Hi,

I have the same problem a you described.

I have two 6880-x switches configured with two VSS links. One on goes to the sup and one goes to the additional 10g modul.

First I was running 15.1(2)SY2. After approx two weeks the VSS Cluster went down.

With 15.1(2)SY3 same effect. After two weeks the links went down.

Jul 14 16:53:51.264: %VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/1: Link down
Jul 14 16:53:51.264: %VSL-SW1-5-VSL_CNTRL_LINK: New VSL Control Link 1/1/1

Jul 14 16:53:51.324: %VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/1/1: Link down
Jul 14 16:53:51.324: %VSLP-SW1-2-VSL_DOWN: Last VSL interface Te1/1/1 went down

Jul 14 16:53:51.324: %VSLP-SW1-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role

omarzaman · ‎08-19-2014

We are having this issue on two seperate VSS setups that we have. This is a major concern. The first time it occured luckily the Standy chassis just said its Last reload reason: dual-active

We are running the latest code 15.1(2)SY3

The second occurence happened but this time it reloaded the Active chassis and caused some outage.

I have dual-active fast-hello setup with a link between the Sup's.

I have the VSL links setup in a redundant manner to where one link is on the Sup and the other is on the 10G linecard.

Any input or others have this issue would be appreciated. I believe we are hitting some type of software bug....

Jose Davlla · ‎08-20-2014

it is a bug, cisco TAC finally confirmed. My options were, we get put down on a list of affected users and when a fix comes out then they will notify us. The other was they send us 2 replacement chassis' and put us on the list. Now that I have the replacements, i have to schedule time to unrack those heavy suckers and put in the replacement.

omarzaman · ‎08-20-2014

Do you know the bug ID#? I'm trying to research it some more. Thanks you jdavila94

Jose Davlla · ‎08-20-2014

they never gave me one. they said it was an "unverified" bug. Since they still can't determine if its a hardware or software bug, they can't catagorize it. it was all verbal over the phone so i cant remember exactly what they said,.

omarzaman · ‎08-20-2014

https://tools.cisco.com/bugsearch/bug/CSCup99867

That might be it...Trying to see if I can get a case open with TAC to determine/confirm....

Jose Davlla · ‎08-20-2014

how funny, that bug search ID you posted was created from my tac incident. the timing is perfect and the timestamp on the logs is eactly as mine.

anyhow,

im hoping that the 4510's wont have that problem as im about to build another VSS with these 4500's.

Jose Davlla · ‎09-26-2014

looks like cisco finally gave us a workaround for the vss issue. its on the bug id.

heard that a customer has tried it and has been flawless.

https://tools.cisco.com/bugsearch/bug/CSCup99867/?reffering_site=dumpcr

rhub · ‎09-24-2024

We are obviously facing this issue. Before I'll setup the workaround mentioned in the TACs bug database, I really want to be sure, that 1st: we don't get any issues with the entire VSS-cluster when entering the command and 2nd: We don't have any drawbacks. Does anyone have implemented the workaround ?

Regards

James Lasky · ‎10-13-2014

Hi all,

but facing the bug CSCup99867 and related proposed workaround, which are the drawbacks ?

I understand that increasing the LMP timers to 6 mins, if there is a failure on active member it will take 6 mins to recover ?

which is the control plane behavior in case of fault ?

Do anybody experienced it after the workaround implementation ?

Regards

viveks2 · ‎02-10-2015

check sea_console logs from both active and standby supervisor.

Mostly we would see logs as below:

%VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/15: Timeout (30000msec) waiting for Hello packet from peer
%VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/16: Timeout (30000msec) waiting for Hello packet from peer
%VSLP-SW1-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role

This VSLP LMP link failure is common issue on SY1, SY2 and on SY3. There is no reported issues in SY4, though I would suggest to get on SY4a (which is maintenance release of SY4) in case you find VSL LMP link failure log messages on SY4 then open TAC case, so that TAC can report this issue to developers.

hsticher · ‎04-08-2015

I suppose, we have the same problem. Only when Switch 1 is the active switch of the 6880-X VSS system, the connection to the FEX-stack breaks suddenly after running fine for some days and the FEX-stack (3 x 6800-IA) get only ready again when I reset the whole VSS-system, both VSS-switches and the FEX-stack.
All access ports at the FEX can't connect to our infrastructure (DHCP,DNS...). The two connected TenG FEX-stack uplinks sometimes get down and sometimes only the uplink to the active parent switch get down. The SYST LED of FEX 1 lighting amber.
When parent switch 2 is the active unit the issue never occured.
6880-X running with 15.2(1)SY, 6800-IA running with 15.2(3)E.
Cisco TAC is investigating that issue, but till now without any result.
When the issue occurs, the recorded logging messages are:
Syslog 6880-X
Apr 7 07:42:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1274 (FEX-101)
Apr 7 07:45:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1276 (FEX-101)
Syslog 6800-IA:
Apr 7 07:57:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1284
-Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz
Apr 7 08:00:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1286
-Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz

Cisco 6880-x vsl link failure