Re: SCE8000 reboots with RuC timeouts

Iulian Vaideanu · ‎05-19-2020

Hello everyone,

I know I'm asking about an old, end-of-everything platform, but maybe there is someone here who still remembers stuff...

We've recently purchased a fully-equipped, refurbished SCE8000-10G and, after configuring it for production use (same SCOS 4.1.0 and service config that we have on five more identical devices), experience random reboots with "SE Watchdog Module: An Error occurred" messages in the log file and "Line Card Watchdog: RuC number 3 (and 4) timeout" / "Line Card Watchdog: Line card failed." messages in the debug interpretation of last-failure. This happens almost daily, sometimes often enough (three times in half an hour) to put the device into Recovery Mode.

The device is under warranty with the supplier but I'm trying to locate the issue to a specific component, so that we don't have to send the whole thing back. After some reading I learned that RuCs are traffic processors and that each SCM has twelve of them for user ("data plane") traffic processing and one for "control plane" traffic. So I figured I'd swap the SCMs and, if then RuCs 16 and 17 time out, I'd know which SCM is the cuplrit.

The thing is, RuCs 3 and 4 are still the ones that fail - is it possible that the chassis backplane has anything to do with this? Also (probably not related to our issue, given the previous sentence), is it possible to "map" the RuC numbering to the two large daughterboards on each SCM showing three CPUs each (and maybe three underneath?)?

Thank you.

Iulian Vaideanu · ‎06-11-2020

Just a quick update here: we tested the device as a single-link SCE, using each SCM in turn - one of them seems stuck in an endless reset loop (not even a single successful boot), while the other one has been stable for two weeks now... we're waiting for the replacement SCM to arrive - I'll keep you posted about how things go.