07-07-2014 04:55 PM - edited 03-07-2019 07:58 PM
Hi Guys,
I have a head scratcher that I'm dealing with. I have 2x (2 month) 6880-x switches configured with vss with 2 vsl links on each of its sup cards and switch 1 is the active switch on the vss. Both 6880x chassis also have 2 additional 10g modules each. I am dealing with 2 issues in this setup.
The first issue is, i have a 3750x with a 10g module thats port channeled to the vss with one 10g link going to each of the chassis. Within the first 3 days it started dropping the link that's plugged to switch1, then comes back on right away. It does it randomly throughout the day. I have changed out the optics on both ends, changed the cable, and also changed the module on the 3750x and I still get the same issue. The strange part is that when switch2 on the vss becomes the active switch, the problem goes away. I did notice this before changing out the optics, cable, module on the 3750x.
The second issue is the one I'm been trying to figure out along with cisco TAC. This started about a month after bring the vss online. When switch1 was the active switch in the vss, every couple of days vsl links drop one at a time, eventually killing the vss and putting the standby unit into recovery mode because of the dual active detection. once I reboot the standby switch, the vss comes back up normally. it did this a couple of times until I decided to force switch2 to be the active switch on the vss. when switch2 became active, the vss was stable for about a month then the vsl links died again, and the system failed over to switch1. after looking at the logs with cisco tac, we see that the vsl links stop responding which causes the failover, but up until now we still can't determine what is causing the vsl links to fail. cisco tac said that maybe the vsl links were being overloaded but we have been monitoring the bandwidth utilization on the vsl links and they never go beyond 1% utilization. The last suggestion by cisco tac was to add another vsl link but through another module other than the sup. This was done a couple of days ago so now i'm waiting to see if the vsl links fail again and since switch1 is the active switch on the vss, i'm having to deal with the first issue above with the 3750x.
I've included some log entries from the dropped port-channel member and also logs for when the vsl links fail.
Logs for 3750x port-channel member drops.
*Jul 5 05:27:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 5 05:27:10 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 5 05:28:10 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 6 11:28:52 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 6 11:28:52 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 6 11:28:53 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 6 11:29:53 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:27:25 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:27:25 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:27:26 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:27:33 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:33:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 06:33:43 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 06:34:43 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to up
*Jul 7 07:38:35 PDT: %LINEPROTO-SW1-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 07:38:35 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to down
*Jul 7 07:38:36 PDT: %LINK-SW1-3-UPDOWN: Interface TenGigabitEthernet2/2/5, changed state to up
log for VSL link failures
*Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/15: Link down
*Jun 27 23:59:25 PDT: %VSL-SW2-5-VSL_CNTRL_LINK: New VSL Control Link 2/5/16
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/15, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/15, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/15, changed state to down
*Jun 27 23:59:25 PDT: %VSLP-SW2-3-VSLP_LMP_FAIL_REASON: Te2/5/16: Link down
*Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN: Last VSL interface Te2/5/16 went down
*Jun 27 23:59:25 PDT: %VSLP-SW2-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet2/5/16, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel2, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel2, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet2/5/16, changed state to down
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel1, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel1, changed state to down
*Jun 27 23:59:25 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 5 - Module Type LINE_CARD removed
*Jun 27 23:59:25 PDT: %OSPF-SW2-5-ADJCHG: Process 1, Nbr 10.253.0.3 on TenGigabitEthernet1/1/1 from FULL to DOWN, Neighbor Down: Interface down or detached
*Jun 27 23:59:25 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface Port-channel24, changed state to down
*Jun 27 23:59:25 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel24, changed state to down
*Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 1 - Module Type LINE_CARD removed
*Jun 27 23:59:26 PDT: %LINK-SW2-3-UPDOWN: Interface Port-channel17, changed state to down
*Jun 27 23:59:26 PDT: %PFREDUN-SW2-6-ACTIVE: Standby processor removed or reloaded, changing to Simplex mode
*Jun 27 23:59:26 PDT: %OIR-SW2-6-INSREM: Switch 1 Physical Slot 2 - Module Type LINE_CARD removed
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/14, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/15, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/5/16, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/14, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/5/16, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/1, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to down
*Jun 27 23:59:27 PDT: %LINK-SW2-3-UPDOWN: Interface TenGigabitEthernet1/1/3, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/1, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to down
*Jun 27 23:59:27 PDT: %LINEPROTO-SW2-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/3, changed state to down
Press RETURN to get started!
*Jun 28 07:03:33.421: %USBFLASH-SW2_STBY-5-CHANGE: bootdisk has been inserted!
*Jun 28 07:03:53.297: %OIR-SW2_STBY-6-INSPS: Power supply inserted in slot 1
*Jun 28 07:03:53.301: %C6KPWR-SW2_STBY-4-PSOK: power supply 1 turned on.
*Jun 28 07:04:26.093: %FABRIC-SW2_STBY-5-FABRIC_MODULE_ACTIVE: The Switch Fabric Module in slot 5 became active.
*Jun 28 07:04:48.497: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 5: Running Minimal Diagnostics...
*Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:48.497: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
*Jun 28 07:04:55.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestFexModeLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:00.165: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL2CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 15 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:06.049: %CONST_DIAG-SW2_STBY-6-DIAG_PORT_SKIPPED: Module 5 port 16 is skipped in TestL3CTSLoopback due to: the port is used as a VSL link.
*Jun 28 07:05:23.309: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 5: Passed Online Diagnostics
*Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
*Jun 28 00:05:38 PDT: %SYS-SW2_STBY-6-CLOCKUPDATE: System clock has been updated from 00:05:38 PDT Sat Jun 28 2014 to 00:05:38 PDT Sat Jun 28 2014, configured from console by console.
*Jun 28 00:05:38 PDT: %SSH-SW2_STBY-5-DISABLED: SSH 2.0 has been disabled
*Jun 28 00:05:54 PDT: %SYS-SW2_STBY-5-RESTART: System restarted --
Cisco IOS Software, c6880x Software (c6880x-ADVENTERPRISEK9-M), Version 15.1(2)SY2, RELEASE SOFTWARE (fc3)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2014 by Cisco Systems, Inc.
Compiled Wed 26-Feb-14 15:30 by prod_rel_team
*Jun 28 00:05:54 PDT: %SSH-SW2_STBY-5-ENABLED: SSH 2.0 has been enabled
*Jun 28 00:05:54 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:03:01 to ensure console debugging output.
*Jun 28 00:05:56 PDT: %C6KENV-SW2_STBY-4-LOWER_SLOT_EMPTY: The lower adjacent slot of module 5 might be empty. Airdam must be installed in that slot to be NEBS compliant
*Jun 28 00:07:36 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 1: Running Minimal Diagnostics...
*Jun 28 00:07:37 PDT: %SYS-SW2_STBY-3-LOGGER_FLUSHED: System was paused for 00:01:29 to ensure console debugging output.
*Jun 28 00:07:41 PDT: %DIAG-SW2_STBY-6-RUN_MINIMUM: Switch 2 Module 2: Running Minimal Diagnostics...
*Jun 28 00:08:10 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 1: Passed Online Diagnostics
*Jun 28 00:08:22 PDT: %EC-SW2_STBY-5-CANNOT_BUNDLE2: Te2/1/10 is not compatible with Te1/2/10 and will be suspended (Operational flow control send of Te2/1/10 is off, Te1/2/10 is on)
*Jun 28 00:08:31 PDT: %EC-SW2_STBY-5-COMPATIBLE: Te2/1/10 is compatible with port-channel members
*Jun 28 00:08:45 PDT: %DIAG-SW2_STBY-6-DIAG_OK: Switch 2 Module 2: Passed Online Diagnostics
*Jun 28 00:08:46 PDT: %C6KENV-SW2_STBY-4-HIGHER_SLOT_EMPTY: The higher adjacent slot of module 2 might be empty. Airdam must be installed in that slot to be NEBS compliant
ELDC1C1-AG01-1 line 0
************************ W A R N I N G ***************************
* THIS IS A PRIVATE COMPUTER SYSTEM, AND FOR AUTHORIZED USE ONLY.*
* THIS SYSTEM IS MONITORED, AND ANY UNAUTHORIZED USE MAY BE *
* SUBJECT TO CRIMINAL PROSECUTION. *
* IF YOU ARE NOT AUTHORIZED, LOG OUT IMMEDIATELY!!!!! *
******************************************************************
so according to the logs, it seems like the switch was reloaded or something of the nature but even the cisco tac said that it wasnt the case but couldnt determine whats causing it either. Tac went through the config on the vss and has said that its configured correctly.
Maybe someone else has experienced this issue or if someone can point out something that I can look at... sorry for the very long post.
thanks
07-07-2014 08:30 PM
Hi,
You have done lots of hardware changes . Since the switch is not really rebooting, but it is acting like it is, it maybe a software issue. Have you tried loading a different version of IOS?
07-07-2014 08:39 PM
I have not done any changes to hardware since putting it into production, other than replacing the optics and cable to the 3750x. the logs say that all removable modules were removed and reinserted, but i can assure you that is not the case. This why I originally thought that the switch was reloading itself, but when I did a show version, the uptime for the switch itself was not reset, but for the individual modules it was.
07-14-2014 11:37 PM
Hi,
I have the same problem a you described.
I have two 6880-x switches configured with two VSS links. One on goes to the sup and one goes to the additional 10g modul.
First I was running 15.1(2)SY2. After approx two weeks the VSS Cluster went down.
With 15.1(2)SY3 same effect. After two weeks the links went down.
Jul 14 16:53:51.264: %VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/1: Link down
Jul 14 16:53:51.264: %VSL-SW1-5-VSL_CNTRL_LINK: New VSL Control Link 1/1/1
Jul 14 16:53:51.324: %VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/1/1: Link down
Jul 14 16:53:51.324: %VSLP-SW1-2-VSL_DOWN: Last VSL interface Te1/1/1 went down
Jul 14 16:53:51.324: %VSLP-SW1-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role
08-19-2014 03:58 PM
We are having this issue on two seperate VSS setups that we have. This is a major concern. The first time it occured luckily the Standy chassis just said its Last reload reason: dual-active
We are running the latest code 15.1(2)SY3
The second occurence happened but this time it reloaded the Active chassis and caused some outage.
I have dual-active fast-hello setup with a link between the Sup's.
I have the VSL links setup in a redundant manner to where one link is on the Sup and the other is on the 10G linecard.
Any input or others have this issue would be appreciated. I believe we are hitting some type of software bug....
08-20-2014 10:48 AM
it is a bug, cisco TAC finally confirmed. My options were, we get put down on a list of affected users and when a fix comes out then they will notify us. The other was they send us 2 replacement chassis' and put us on the list. Now that I have the replacements, i have to schedule time to unrack those heavy suckers and put in the replacement.
08-20-2014 11:20 AM
Do you know the bug ID#? I'm trying to research it some more. Thanks you jdavila94
08-20-2014 11:29 AM
they never gave me one. they said it was an "unverified" bug. Since they still can't determine if its a hardware or software bug, they can't catagorize it. it was all verbal over the phone so i cant remember exactly what they said,.
08-20-2014 12:04 PM
https://tools.cisco.com/bugsearch/bug/CSCup99867
That might be it...Trying to see if I can get a case open with TAC to determine/confirm....
08-20-2014 06:02 PM
how funny, that bug search ID you posted was created from my tac incident. the timing is perfect and the timestamp on the logs is eactly as mine.
anyhow,
im hoping that the 4510's wont have that problem as im about to build another VSS with these 4500's.
09-26-2014 03:33 PM
looks like cisco finally gave us a workaround for the vss issue. its on the bug id.
heard that a customer has tried it and has been flawless.
https://tools.cisco.com/bugsearch/bug/CSCup99867/?reffering_site=dumpcr
09-24-2024 07:49 AM
We are obviously facing this issue. Before I'll setup the workaround mentioned in the TACs bug database, I really want to be sure, that 1st: we don't get any issues with the entire VSS-cluster when entering the command and 2nd: We don't have any drawbacks. Does anyone have implemented the workaround ?
Regards
10-13-2014 11:33 PM
Hi all,
but facing the bug CSCup99867 and related proposed workaround, which are the drawbacks ?
I understand that increasing the LMP timers to 6 mins, if there is a failure on active member it will take 6 mins to recover ?
which is the control plane behavior in case of fault ?
Do anybody experienced it after the workaround implementation ?
Regards
02-10-2015 09:42 PM
check sea_console logs from both active and standby supervisor.
Mostly we would see logs as below:
%VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/15: Timeout (30000msec) waiting for Hello packet from peer
%VSLP-SW1-3-VSLP_LMP_FAIL_REASON: Te1/5/16: Timeout (30000msec) waiting for Hello packet from peer
%VSLP-SW1-2-VSL_DOWN: All VSL links went down while switch is in ACTIVE role
04-08-2015 02:50 AM
I suppose, we have the same problem. Only when Switch 1 is the active switch of the 6880-X VSS system, the connection to the FEX-stack breaks suddenly after running fine for some days and the FEX-stack (3 x 6800-IA) get only ready again when I reset the whole VSS-system, both VSS-switches and the FEX-stack.
All access ports at the FEX can't connect to our infrastructure (DHCP,DNS...). The two connected TenG FEX-stack uplinks sometimes get down and sometimes only the uplink to the active parent switch get down. The SYST LED of FEX 1 lighting amber.
When parent switch 2 is the active unit the issue never occured.
6880-X running with 15.2(1)SY, 6800-IA running with 15.2(3)E.
Cisco TAC is investigating that issue, but till now without any result.
When the issue occurs, the recorded logging messages are:
Syslog 6880-X
Apr 7 07:42:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1274 (FEX-101)
Apr 7 07:45:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1276 (FEX-101)
Syslog 6800-IA:
Apr 7 07:57:43.182: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1284
-Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz
Apr 7 08:00:43.181: %PLATFORM_RPC-3-MSG_THROTTLED: RPC Msg Dropped by throttle mechanism: type 3, class 21, max_msg 32, total throttled 1286
-Traceback= 5198C4z 21413F8z 1C3B35Cz 1C3C6FCz 1C3CA00z 2654CC0z 26509BCz
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide