cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1432
Views
0
Helpful
3
Replies

VSS Supervisor Crash - Uplinks ?

Hey,

Our design consists out of 2 VSS Systems interconnected with a 4 x 1GE bundled in an etherchannel. These uplinks are terminated on the supervisor and each supervisor is terminating 2 x 1GE.

crash.jpg

Recently we had a failure of a supervisor engine in one of the VSS  Chassis, the failover from the active to the standby supervisor was not  really transparent for some of our clustered applications.

In our logging we've noticed that while the crashing supervisor was writing its crashinfo the supervisor uplink ports remained physically up, however all traffic received was not processed and as such dropped.

From the application point of view this caused unidirectional traffic and eventually resulting in a 'split brain' situation, both servers of the cluster became active.

We had UDLD in normal mode active, but in normal mode it does not react on physcical Layer1 issues.

My question is if we enable UDLD aggressive on this portchannel, will the crashing supervisor stop sending UDLD packets and as such the other end will end up in error-disable, previnting a split-brain situation for our clusters?

1 Accepted Solution

Accepted Solutions

You just found your issue. Because you have no channel protocol there was nothing to tell the other end that it should take the ports out of the etherchannel. I HIGHLY suggest (and Cisco HIGHLY suggests) you use active/active or desirable/desirable for these etherchannels to prevent exactly the issue you describe.

View solution in original post

3 Replies 3

Nathan Spitzer
Level 1
Level 1

Two questions:

  • What type of dual-active detection are you using?
  • Are you using LACP or PAGP  for the etherchannels and what modes?

For dual-active detection we're using fast-hello and ePAGP.

Please note that the VSS never became dual-active it were the server clusters who are interconnected over the VSS core who became dual-active.

We're not using any channeling protocol it's just on, their is no good reason for using 'on', mainly historical.

(The portchannel was already active in the past between 2 pairs of 3750's who were not able to use an etherchannel protocol between 2 different chassis in a stack.)

You just found your issue. Because you have no channel protocol there was nothing to tell the other end that it should take the ports out of the etherchannel. I HIGHLY suggest (and Cisco HIGHLY suggests) you use active/active or desirable/desirable for these etherchannels to prevent exactly the issue you describe.