on 11-21-2013 11:10 PM
[toc:faq]
This document briefly summarizes the 828 days problem often observed in CSS and CSM.
There is a need to provide detailed information on how the 828 days problem occurs, and ways to avoid it.
CSS and CSM are based on 32 bit VxWorks.
Further, the clock of VxWorks is counted at 60 Hz (increases by one every 1/60 of a second). You can get the value with tickGet() API, provided in the following URL. For example, when you get the value of tickGet() five seconds after booting, you can get the value of 0x12c (60Hz * 5sec = 300). CSS and CSM refer to this value for various purposes.
http://www.vxdev.com/docs/vx55man/vxworks/ref/tickLib.html
As tickGet() is 32 bit timer, and its maximum value is 2^32 = 4294967296, when this value is wrapped, the counter is reset to 0.
In other words, tickGet() value will be reset to 0 after 828 days and 12 hours have elapsed, according to the following formula.
Therefore, various problems will occur after 828 days have elapsed.
2^32 / (60Hz*60sec*60min*24hr) = 828.5days
Let us explain this problem in a bit more detail. We will use keepalive feature for the example. Keepalive is sent by CSS regularly.
By default, CSS sends icmp packets to a service every five seconds for an availability check.
Within CSS, the next transmission time will be calculated with current time and keepalive interval (five seconds, 0x12c; next_keepalive = tickGet() + 0x12c).
For example, if keepalive is sent 3600 seconds after booting, the next icmp packets will be sent 3605 seconds after booting.
If the value retrieved by tickGet() is larger than 3605 seconds (tickGet() > next_keepalive), keepalive packets will be sent.
If the tickGet() value is 0xfffffff0, the next_keepalive value is set to 0x1000011c, but the maximum value of tickGet() is 2^32 = 0xffffffff. Therefore, if this maximum value is exceeded, it is reset to 0 and the next keepalive value is set to 0x1000011c.
In this case, the condition of tickGet() > next_keepalive will never come, and thus CSS stops sending keepalive packets.
Changing the base OS from 32 bit to 64 bit also requires significant changes in CSS/CSM, which runs on the OS. Therefore, we have decided not to upgrade the base OS.
As a result, many bugs that may have taken effect after 828 days have been corrected.
For both CSS and CSM, we fixed many bugs. The root problem, however, remains. Therefore we suggest you reboot CSS/CSM before 828 days have elapsed..
Note: End of SW Maintenance Releases Date: September 20, 2012
http://www.cisco.com/en/US/prod/collateral/contnetw/ps5719/ps792/end_of_life_c51-657403.html
End of SW Maintenance Releases Date: August 26, 2011
http://www.cisco.com/en/US/prod/collateral/modules/ps2706/ps780/end_of_life_c51-577764.html
Also, some of the reported failures were analyzed in order to determine that correction was impossible.
To avoid these problems, it is recommended that CSS/CSM be rebooted every two years.
When the CSS has an uptime of 828 days, it cannot send packets to the management port for 18 minutes. This issue affects the management port only. The circuit and VIP addresses works fine. We recommend that you reboot the CSS before its uptime is 828 days.
When 828 days have elapsed since the CSM was booted, the HTTP probe will fail and will stay in the down state for about 18 minutes. Reboot the CSM before 828 days have elapsed. (CSCso08858)
Original Document: Cisco Support Community Japan DOC-31076
Author: Yuji Shimazaki
https://supportforums.cisco.com/docs/DOC-31076
Posted on March 19, 2013
There is a typo on the formula.
2^32 / (60Hz*60sec*60min*24hr) = 828.5days is correct one.
I guess, because the 32 bit clock tick will be reset every 828.5 day, without the restart, the issue will happen every 828.5 day.
You are right, I made a typo.
Thank you for your pointing it out. I fixed it.
Hi,
The root cause is that tickget() is reset to 0 whenever it reached 2^32 ( = 828.5 days ).
and I don't think this issue happen just one time because this problem will occur on 828.5 *n days since CSS was booted.
Yes, this symptom is not a one-time issue, but occurs on 828.5 * n days.
I handled the SR that uptime was 1657 days.
Thanks yushimaz for your feedback.
I have one more question.
In below sentence, I don't understand why the next_keepalive value will be 0x1000012b.
If the keepalive value will be added 0x12C from 0xFFFFFFF0, the next value would be 0x10000011C.
Could you please explain me? Also, I guess the next value will be set as 0x11C after overflow.
---------
If the tickGet() value is 0xfffffff0, the next_keepalive value is set to 0x1000012b, ....
Thank you for your comments.
Yes, you are right!
I calculated the value with '0xFFFFFFFF + 0x12C'..
So, I fixed it.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: