The Problem
As with any complex system, IOS-XR routers will encounter failures, whether they’re limited to a single component, or they affect modules across the entire device. In most cases, there is a standard path to debug the issue:
- Log into the router and collect show techs, logs, etc
- Pass the collected data to the Cisco support
- Engineers analyze the data and come back with a solution
Of course, there may be some back-and-forth involved, but these are the general steps. However, what happens when step #1 fails? Imagine you attempt to login to the router via ssh, telnet, and even console, but the router appears to be non-responsive, and no command prompt appears. The command line is the lowest level of basic manageability, meaning that it should function even when all other roads into the router are closed. What to do then, in the event that the command line is also not working?
These cases are rare, but they do occur, which generally leads to a more severe triage response than normal. This does not mean that these cases are due to a shell issue, but simply that the shell blockage prevented a rapid triage. For example, in one case, a dependency on a central server that was itself stuck on another service lead to the shell blockage. Also, an inaccessible shell often leads to the assumption that the router is “bricked”.
The Solution
The goal here is to ensure that the user can always access the shell, without the shell becoming blocked. If the shell does become blocked, then the user should be able to force the shell to respond. To accomplish this, we have a multi-part solution:
- The shell should never block permanently during exit cleanup. The solution here is to spawn a new thread for the cleanup function, then wait a maximum of 1 minute for cleanup to complete. At 15 second intervals, the shell will display a message to the user indicating that cleanup is taking longer than normal, and that the shell will definitely exit in X seconds. Cleanup is always just a best-effort, since even if a process receives SIGKILL, the rest of the system should be capable of handling anything that the exiting process failed to cleanup, so this design is acceptable, particularly since it is rare that the shell becomes stuck during cleanup in the first place.
- If the shell becomes blocked on parser_server or a stubborn child process, the user should have a way to force the shell to become unstuck. This is solved by introducing a special control sequence (CTRL-C 3 times followed by one CTRL-Z, all within 2 seconds). When the sequence is entered, the connection to parser_server will be ended, and all children processes will receive a SIGKILL signal, forcing them to end. Again, this is a last-resort option, designed to ensure that the user can access the shell no matter how impaired the rest of the system might be at the time.
After the shell is disconnected from parser_server, it enters “essential-ops mode”. This is indicated to the user via some info messages, and the user will then have access to a limited subset of reserved commands that do not require a connection to parser_server. Normal permission restrictions are still in full effect for these reserved commands.
Once in essential-ops mode, you may use the ‘reconnect’ command to attempt to re-enter normal mode.
Note that this enhancement is only available in IOS-XR from 7.11.1 onward.