Re: CSS 11000 redundancy problems - Both are masters

Greenwolf · ‎09-12-2005

Hi all,

I had a strange problem with the redundancy between two of my CSS11000.

They were both master at the same time. With resulted in total apocalypse :(

07:44:35 5/1 49369 IPV4-4: Duplicate IP address detected: xxx.xxx.xxx.xxx xx-xx-xx-xx-xx-xx 07:44:35 5/1 49370 IPV4-4: Incoming CE 0x401f00, incoming (0 based) SLP 0x1

Just before CSS01 switched to backup mode I see it saying SNTP-6: No SNTP replies in 3*poll-interval secs. When CCS01 switches back to master mode I can see this same message on my CSS02. But I don’t see the CSS02 switching back to backup mode. So they where both master at the same time and it was disaster time.

When I logged in and saw the problem, I rebooted CSS02. After the reboot the situation restored itself. But I now need to find out why it happened and how to prevent this to happen in the future.

The only thing I can see is the SNTP errors. Does anyone has any idea why this happened and could this be a result of the SNTP errors. If you need additional information just let me know.

css01

07:20:19 5/1 49322 SNTP-6: No SNTP replies in 3*poll-interval secs.

07:20:21 5/1 49323 REDUNDANCY-4: Transition to redundancy backup, master is x.x.x.x

…

07:43:58 5/1 49345 REDUNDANCY-4: Transition to redundancy master

css02

02:58:43 5/1 48126 SNTP-6: Setting time to <02:58:43>

07:20:22 5/1 48127 REDUNDANCY-4: Transition to redundancy master

…

07:43:57 5/1 48217 SNTP-6: No SNTP replies in 3*poll-interval secs.

Thanks in advance for your time and help.

With kind regards,

Geert Hermans

pknoops · ‎09-12-2005

Geert,

Could we see little more info prior to the master/master situation in the sys.log ? It is possible the MASTER was so busy that it did not answer the heartbeat polls to the backup and also could not process the sntp polls ?

Regards

Pete..

Greenwolf · ‎09-13-2005

Hi Pete,

First of all thank you for your reply.

It is possible the MASTER was so busy that it did not answer the heartbeat polls to the backup and also could not process the sntp polls ?

That could be possible but explain me than this. Let’s say he’s so busy that he can’t reply to the heartbeat polls to the backup and he also couldn’t process sntp polls. Where did he find than the resources to send the syslog to the logging server witch is on the same subnet as the sntp server? He doesn’t have resource to send heartbeat polls to the backup and also no resources to process the sntp polls. But he does has resource to process the logging! Sounds just strange to me!

Maybe I’m wrong but I was from the impression that the master sends a redundancy protocol messages every second to inform the

backup CSS that it is alive. And that the backup doesn’t send anything to the master.

If the backup CSS doesn’t receive anything after 3 seconds, the backup

CSS becomes the master CSS and begins sending out redundancy protocol messages. Or am I wrong?

Now what did I notice at 07:20:19 on CSS01 the master transitioned from master to slave. Why would a master transition from master to slave? Just before the transition on the CSS1 I see the SNTP polls errors. On the CSS02 I see at 07:20:22 (3 seconds – redundancy protocol timeout) he is becoming the master.

At 07:43:57 I see the same SNTP errors on CSS02. And one second later the CSS01 jumps back from backup to master. Why? Wasn’t he receiving the redundancy protocol messages?

Included with this mail, the complete syslog. If you need extra information doesn’t hesitate to ask.

Thanks a million for you help. If you ever in Belgium I’ll buy you a beer 

With kind regards,

Geert

Gilles Dufour · ‎09-13-2005

Geert,

unfortunately we won't be able to tell you what happened.

The most important with this kind of problem is to capture a sniffer trace on the 2 CSS ports and see if VRRP messages are seen and/or sent.

I believe the SNTP message is just an indication that there is traffic related issue.

Unable to receive or send SNTP messages and unable to receive VRRP messages.

Regards,

Gilles.

pknoops · ‎09-13-2005

Hi Geert,

I will take a look at the sys.log info. Maybe Gilles already has. It's actually the backup box that sends the polls to the MASTER. If it does not get a response back to 3 of the polls then the BACKUP will become MASTER.

As a side note, you can modify the amount of time needed for the response by changing the "vrrp-backup-timer"

You would need to set this on both the MASTER and BACKUP and then you would need to "bounce" redundancy on the boxes therefore a maint window would be needed.

For more info on this command, see this link:

http://www.cisco.com/univercd/cc/td/doc/product/webscale/css/css_720/advcggd/redndncy.htm#1031447

Regards

Pete..

Greenwolf · ‎09-13-2005

Hi Pete,

Thanks for this information. I didn't know it worked like that.

Just one more question about the polls. What happens with the master if it doesn't receive any more polls from the Backup?

Why did the MASTER became backup?

CCS01 was the master but at

07:20:21 he transitioned to backup.

5/1 49323 REDUNDANCY-4: Transition to redundancy backup, master is xxx.xxx.xxx.xxx

Everything started because CSS01 became backup.

Thanks again for your help.

Geert

pknoops · ‎09-13-2005

Geert,

What is port e12 ? Is this the connection between the boxes, because if so, it went down and would cause the two boxes to not know which is MASTER, so they would both be MASTER ?

Regards

Pete..

Greenwolf · ‎09-13-2005

Pete,

Yes. Port e12 is the connection between the both boxes.

I'll have a look at the config immediately again. I guess I misssed that.

With kind regards,

Geert Hermans

Greenwolf · ‎09-13-2005

Pete, Gilles,

Yes, port e12 is the connection between the both boxes.

But at 09:01:08 I submitted the reboot command at CSS02.

5/1 48376 NETMAN-4: Reboot command entered via CLI

With resulted in 09:01:12 to a down of the port at CSS01. They are connected by a crosscable like you probebly could guess.

5/1 52334 CIRCUIT-6: Port e12 is down for circuit VLANXXX .

The reason why I reboted CSS02 was because they where both in master mode.

Maybe this wasn't a good idea but at the time I seem to be a smart thing to do.

With kind regards,

Geert

pknoops · ‎09-13-2005

Geert,

What version of software are you running. I did some research on this type of thing and quite honestly we have not seen this type of thing for several years.

Can you do a "show core" to see if you have any recent core dumps on either CSS that would have occured around the time in question ?.

Regards

Pete..

Greenwolf · ‎09-14-2005

Hi Pete,

Thanks for the help guys. We are really appreciating this a lott.

We have here 6 CSS running now for almost 3.5 years. Once we had a hard disk failure a year or so ago ,on one of them, and now this. The hard disk failure wasn't so bad because the other one took over. But this caused some havoc :(

But the other onces are still running smoothly. So their pritty stable.

Here is the information you requested:

CSS01# sh core

CSS01# sh ver

Version: ap0503034s (5.03 Build 34)

Flash (Locked): 5.00 Build 33

Flash (Operational): 5.03 Build 15

Type: PRIMARY

Licensed Cmd Set(s): Standard Feature Set

CSS02# show core

CSS02# sh version

Version: ap0503034s (5.03 Build 34)

Flash (Locked): 5.00 Build 45

Flash (Operational): 5.03 Build 15

Type: PRIMARY

Licensed Cmd Set(s): Standard Feature Set

No dump files. But we did not enable core dumps.

CSS02# show dump-status

Dump mode is disabled

with kind regards,

Geert Hermans