Solved: How it works stacking switch failover

eeebbunee · ‎01-13-2022

Hello Engineers & Professionals,

I am searching for 'Stacking switch', especially how it can be a failover plan.

When we stacks switches, (for example, we have 2 switches stacking) there will be a master and slave. (I know thesedays the word 'slave' is not preferred, but please take it just simple expression)

- if master switch has an issue , slave switch will be the master.

in this case, which example can be the 'issue'?

1) Master switch powered off and never comeback

2) IOS has broken, boot image can't be uploaded

3) ... else?

- In case of Master switch can't perform, what happened to the devices which is connected to master switch directly?

Does New master switch (used to be slave sw) take cares the devices communication?

Please help me to understand 'how the stacking switch structure can be a backup plan of switch failover'.

Leo Laohoo · ‎01-13-2022

@eeebbunee wrote:

May I ask you how long you've using this stack structures?

Excluding my experience with Switch Clustering & GigaStacking GBIC, I have been using the "stacking" technology since the introduction of the 3750 around 2005.

With the 3750-series and 2960S/2960X stacking, I am very confident about the time it will take for the standby to "take over" when the switch master crashes.

With IOS-XE and stacking, there is a significant and noticeable "lag" when the standby takes over. Many factors affect the time it will take to failover (from master to standby) that I have never seem to experience with classic IOS. The factors are:

What version is the stack running on
Uptime of the stack
Downlink flapping

If redundancy is absolutely necessary, I would consider using VSS for critical services and application.

View solution in original post

balaji.bandi · ‎01-13-2022

Any of your case, master switch fails, slave switch take over master roles and remains as master (even though master switch come back)

either you need to reload stack to get master switch become master as per the requirement.

Operationally i do not see any issue here. The device failed connection will be lost, either required to re-patch (since end device do not have dual home connection facility yet)

here some example explained here :

https://www.cisco.com/c/en/us/products/collateral/switches/catalyst-9300-series-switches/white-paper-c11-741468.html

BB

=====Preenayamo Vasudevam=====

***** Rate All Helpful Responses *****

How to Ask The Cisco Community for Help

Leo Laohoo · ‎01-13-2022

@eeebbunee wrote:

- if master switch has an issue , slave switch will be the master.

NOTE: I am going to provide answers based on my technical experience and not something out of a sales brochure.

Currently, IOS-XE has (present tense) issues with the stack-mgr process. What this process do is very self-explanatory -- It manages the communication and traffic of the stack. (Whenever there is a stacking cable attached, the stack-mgr process starts. If there is no stacking cable attached, stack-mgr does not start.) Since 16.10.X until 16.12.X (for 9300/9300L) and 17.X.X (for 9200/9200L), there have been several instances where a memory leak will cause issues (plural). Cisco BST has many bugs attributed to stack-mgr or "EHSA standby down" process. A lot of these bugs get "reintroduced" in newer firmware.

If the stack is running a bad code, no crashinfo, no core dumps, will be generated. In essence, no evidence of a crash will be generated for TAC to look into.

IOS-XE is very management intensive. The memory and CPU must be regularly monitored. Daily.

Permit me to show an example:

From the picture above, this happened just a few days ago. The cause? The stacking cable "sagged" because someone did not tighten the stacking cable s_crews. A few minutes after I tightened all the screws and made sure the flapping stopped, the CPU just went ballistics. We had to manually intervene and cold reboot the entire stack just to recover it.

NOTE: I adjusted the stacking cable around 9:30 am.

If this stack was running on classic IOS, this would not have happened. However, IOS-XE behaves very differently and stack-mgr process is very unstable.

eeebbunee · ‎01-13-2022

Hello Leo,

I really appreciate your "Alive" reply, I desperately needed reviews of realistic situations.

It looks like serious defects to me because stacking could be unstable sometimes is worst in our business like 24/7 hrs activation.

May I ask you how long you've using this stack structures?

Nevertheless with these defects, do you still stays with stacking structure? or do you have any plans to change your infra?

Leo Laohoo · ‎01-13-2022

@eeebbunee wrote:

May I ask you how long you've using this stack structures?

Excluding my experience with Switch Clustering & GigaStacking GBIC, I have been using the "stacking" technology since the introduction of the 3750 around 2005.

With the 3750-series and 2960S/2960X stacking, I am very confident about the time it will take for the standby to "take over" when the switch master crashes.

With IOS-XE and stacking, there is a significant and noticeable "lag" when the standby takes over. Many factors affect the time it will take to failover (from master to standby) that I have never seem to experience with classic IOS. The factors are:

What version is the stack running on
Uptime of the stack
Downlink flapping

If redundancy is absolutely necessary, I would consider using VSS for critical services and application.

eeebbunee · ‎01-14-2022

I think I need to searching about VSS..

Appreciate for sharing your opinion and experiences to me..!

You have wonderful day!