Overall system health monitoring

Seth Beauchamp · ‎12-16-2015

I am looking for some guidance on overall system health monitoring for IOS XR devices. All of our devices (mostly 9ks) are running over 4gb of memory and our current monitoring systems do not monitor them properly. So I am looking to see what others are doing for overall system health, I am looking to set up alerts via snmp or syslog that can be sent to my NOC. I am looking for Memory and CPU mostly, but anything else critical that is different from IOS would be good information.

Is there a certain OID or syslog message I should watch for?

tlewisflood · ‎12-16-2015

Cacti has some interesting features. It can be very fast, and I recall there being some third party template specific for ios-xr cpu/memory as well as a threshold plugin that has notification capabilities. Cisco's Prime Performance Monitor also seems promising, but I'm not clear on the notification part of that.

I'm no developer, so some of what my group has been using seems to always come with complications. We get stuff done, but there's frequently some previous experience that was lacking that gets in the way of forward progress. If you're starting from scratch, you might want to have a look on a broader scale. I recall recently reading this: https://workaround.org/article/tired-of-nagios-and-cacti-try-zabbix and thinking that I would've liked the room to try something new.

AARON WEINTRAUB · ‎12-17-2015

So one thing to watch out for that is completely different from IOS is the concept of the PFM (platform fault manager). There are events that will go into this that are just syslogged once, and then the condition is said to be 'set' and won't syslog anymore until it clears, and then you'll get one 'clear' syslog.

This could be things like a optical port that is getting too low (or too high) power. Might not be a large concern, but then there are more serious things like punt fabric errors or other hardware "failures" which are traffic impacting. If you miss that first syslog for some reason (router had issues, etc) then you won't get any more notifications until the issue clears. To look at the current raised faults, do 'show pfm loc all'.

smilstea · ‎12-18-2015

There is a handy command show logging events buffer bistate which will list these sets and clears even if the show logging buffer has wrapped.

Sam