09-12-2005 04:53 AM
Hi all,
I had a strange problem with the redundancy between two of my CSS11000.
They were both master at the same time. With resulted in total apocalypse :(
07:44:35 5/1 49369 IPV4-4: Duplicate IP address detected: xxx.xxx.xxx.xxx xx-xx-xx-xx-xx-xx 07:44:35 5/1 49370 IPV4-4: Incoming CE 0x401f00, incoming (0 based) SLP 0x1
Just before CSS01 switched to backup mode I see it saying SNTP-6: No SNTP replies in 3*poll-interval secs. When CCS01 switches back to master mode I can see this same message on my CSS02. But I dont see the CSS02 switching back to backup mode. So they where both master at the same time and it was disaster time.
When I logged in and saw the problem, I rebooted CSS02. After the reboot the situation restored itself. But I now need to find out why it happened and how to prevent this to happen in the future.
The only thing I can see is the SNTP errors. Does anyone has any idea why this happened and could this be a result of the SNTP errors. If you need additional information just let me know.
css01
07:20:19 5/1 49322 SNTP-6: No SNTP replies in 3*poll-interval secs.
07:20:21 5/1 49323 REDUNDANCY-4: Transition to redundancy backup, master is x.x.x.x
07:43:58 5/1 49345 REDUNDANCY-4: Transition to redundancy master
css02
02:58:43 5/1 48126 SNTP-6: Setting time to <02:58:43>
07:20:22 5/1 48127 REDUNDANCY-4: Transition to redundancy master
07:43:57 5/1 48217 SNTP-6: No SNTP replies in 3*poll-interval secs.
Thanks in advance for your time and help.
With kind regards,
Geert Hermans
09-12-2005 05:05 AM
Geert,
Could we see little more info prior to the master/master situation in the sys.log ? It is possible the MASTER was so busy that it did not answer the heartbeat polls to the backup and also could not process the sntp polls ?
Regards
Pete..
09-13-2005 01:55 AM
Hi Pete,
First of all thank you for your reply.
It is possible the MASTER was so busy that it did not answer the heartbeat polls to the backup and also could not process the sntp polls ?
That could be possible but explain me than this. Lets say hes so busy that he cant reply to the heartbeat polls to the backup and he also couldnt process sntp polls. Where did he find than the resources to send the syslog to the logging server witch is on the same subnet as the sntp server? He doesnt have resource to send heartbeat polls to the backup and also no resources to process the sntp polls. But he does has resource to process the logging! Sounds just strange to me!
Maybe Im wrong but I was from the impression that the master sends a redundancy protocol messages every second to inform the
backup CSS that it is alive. And that the backup doesnt send anything to the master.
If the backup CSS doesnt receive anything after 3 seconds, the backup
CSS becomes the master CSS and begins sending out redundancy protocol messages. Or am I wrong?
Now what did I notice at 07:20:19 on CSS01 the master transitioned from master to slave. Why would a master transition from master to slave? Just before the transition on the CSS1 I see the SNTP polls errors. On the CSS02 I see at 07:20:22 (3 seconds redundancy protocol timeout) he is becoming the master.
At 07:43:57 I see the same SNTP errors on CSS02. And one second later the CSS01 jumps back from backup to master. Why? Wasnt he receiving the redundancy protocol messages?
Included with this mail, the complete syslog. If you need extra information doesnt hesitate to ask.
Thanks a million for you help. If you ever in Belgium Ill buy you a beer
With kind regards,
Geert
09-13-2005 03:33 AM
Geert,
unfortunately we won't be able to tell you what happened.
The most important with this kind of problem is to capture a sniffer trace on the 2 CSS ports and see if VRRP messages are seen and/or sent.
I believe the SNTP message is just an indication that there is traffic related issue.
Unable to receive or send SNTP messages and unable to receive VRRP messages.
Regards,
Gilles.
09-13-2005 04:43 AM
Hi Geert,
I will take a look at the sys.log info. Maybe Gilles already has. It's actually the backup box that sends the polls to the MASTER. If it does not get a response back to 3 of the polls then the BACKUP will become MASTER.
As a side note, you can modify the amount of time needed for the response by changing the "vrrp-backup-timer"
You would need to set this on both the MASTER and BACKUP and then you would need to "bounce" redundancy on the boxes therefore a maint window would be needed.
For more info on this command, see this link:
http://www.cisco.com/univercd/cc/td/doc/product/webscale/css/css_720/advcggd/redndncy.htm#1031447
Regards
Pete..
09-13-2005 05:28 AM
Hi Pete,
Thanks for this information. I didn't know it worked like that.
Just one more question about the polls. What happens with the master if it doesn't receive any more polls from the Backup?
Why did the MASTER became backup?
CCS01 was the master but at
07:20:21 he transitioned to backup.
5/1 49323 REDUNDANCY-4: Transition to redundancy backup, master is xxx.xxx.xxx.xxx
Everything started because CSS01 became backup.
Thanks again for your help.
Geert
09-13-2005 04:47 AM
Geert,
What is port e12 ? Is this the connection between the boxes, because if so, it went down and would cause the two boxes to not know which is MASTER, so they would both be MASTER ?
Regards
Pete..
09-13-2005 04:51 AM
Pete,
Yes. Port e12 is the connection between the both boxes.
I'll have a look at the config immediately again. I guess I misssed that.
With kind regards,
Geert Hermans
09-13-2005 05:00 AM
Pete, Gilles,
Yes, port e12 is the connection between the both boxes.
But at 09:01:08 I submitted the reboot command at CSS02.
5/1 48376 NETMAN-4: Reboot command entered via CLI
With resulted in 09:01:12 to a down of the port at CSS01. They are connected by a crosscable like you probebly could guess.
5/1 52334 CIRCUIT-6: Port e12 is down for circuit VLANXXX .
The reason why I reboted CSS02 was because they where both in master mode.
Maybe this wasn't a good idea but at the time I seem to be a smart thing to do.
With kind regards,
Geert
09-13-2005 06:12 AM
Geert,
What version of software are you running. I did some research on this type of thing and quite honestly we have not seen this type of thing for several years.
Can you do a "show core" to see if you have any recent core dumps on either CSS that would have occured around the time in question ?.
Regards
Pete..
09-14-2005 12:21 AM
Hi Pete,
Thanks for the help guys. We are really appreciating this a lott.
We have here 6 CSS running now for almost 3.5 years. Once we had a hard disk failure a year or so ago ,on one of them, and now this. The hard disk failure wasn't so bad because the other one took over. But this caused some havoc :(
But the other onces are still running smoothly. So their pritty stable.
Here is the information you requested:
CSS01# sh core
CSS01# sh ver
Version: ap0503034s (5.03 Build 34)
Flash (Locked): 5.00 Build 33
Flash (Operational): 5.03 Build 15
Type: PRIMARY
Licensed Cmd Set(s): Standard Feature Set
CSS02# show core
CSS02# sh version
Version: ap0503034s (5.03 Build 34)
Flash (Locked): 5.00 Build 45
Flash (Operational): 5.03 Build 15
Type: PRIMARY
Licensed Cmd Set(s): Standard Feature Set
No dump files. But we did not enable core dumps.
CSS02# show dump-status
Dump mode is disabled
with kind regards,
Geert Hermans
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide