I work for a medium size ISP and we have an ASR1002-X as a broadband aggregation device for PPPoE accounts. Interestingly I thought I was crashing it because of QoS profiles I was applying to the PPPoE profiles but it turned out I isolated it to something even much more basic than that. When the following configuration is applied to the router, the router crashes. It takes a couple minutes for the router to crash, but it very much just blows up and hard resets. Curious if anyone has any thoughts on it.
Code Router Is Running:
Cisco IOS Software, IOS-XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.2(4)S4a, RELEASE SOFTWARE (fc1)
! Class Maps:
class-map match-all CM_QoS_Match_CS3
match dscp cs3
class-map match-all CM_QoS_Match_EF
match dscp ef
! Child Policy Map:
priority level 1
! Parent Policy-Map:
description Hierarchical QoS Policy To Apply 100 Mbps Policing Downstream
shape average 117760k
description Hierarchical QoS Policy To Apply 100 Mbps Policing Upstream
! Application to Interface:
service-policy input PM_HQoS_PPPoE_100Mbps_DarylTest_Ingress_Parent
service-policy output PM_HQoS_PPPoE_100Mbps_DarylTest_Egress_Parent
I thought this was fairly basic code that I was planning on applying to some PPPoE Customer accounts that we could just underlying do some very basic QoS control for voice throughout our core. But the HQoS on a sub-interface crashes the router. This is a pretty big deal as we have multiple sets of thousands of customers aggregating on this router and they all went belly-up in a glorious display of complete network failure. The one time I did catch the crash I was consoled in and I got the following console notifications:
! Console Logs:
000376: Nov 1 18:35:38.544 MDT: priority command is not supported in input direction for this interface
000377: Nov 1 18:35:38.544 MDT: Configuration failed!
000378: Nov 1 18:35:38.545 MDT: %QOS-6-POLICY_INST_FAILED:
Service policy installation failed
000379: Nov 1 18:35:39.131 MDT: %SSH-3-RSA_SIGN_FAIL: Signature connection failed, status 3
000380: Nov 1 18:35:39.131 MDT: %SSH-3-RSA_SIGN_FAIL: Signature creation failed, status -1
000384: Nov 1 18:35:40.090 MDT: %SSH-3-RSA_SIGN_FAIL: Signature connection failed, status 3
000385: Nov 1 18:35:40.090 MDT: %SSH-3-RSA_SIGN_FAIL: Signature creation failed, status -1
000390: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: irq 23: nobody cared (try booting with the "irqpoll" option)
000391: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: Pid: 15907, comm: fuser Tainted: P 18.104.22.168 #4
000392: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: Call Trace:
000393: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: <IRQ> [<ffffffff8107f5e0>] ? __report_bad_irq+0x30/0x7d
000394: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8107f733>] ? note_interrupt+0x106/0x16f
000395: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8107fdd8>] ? handle_fasteoi_irq+0xb7/0xdf
000396: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100cbfc>] ? call_softirq+0x1c/0x2a
000397: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100e2c4>] ? handle_irq+0x7e/0x86
000398: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100d8dc>] ? do_IRQ+0x54/0xb2
000399: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100c413>] ? ret_from_intr+0x0/0xa
000400: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: <EOI>
000401: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: handlers:
000402: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: [<ffffffff8125291a>] (usb_hcd_irq+0x0/0x5d)
000403: Nov 1 18:35:47.435 MDT: %IOSXE-0-PLATFORM: R0/0: kernel: Disabling IRQ #23
000404: Nov 1 18:35:48.399 MDT: %SEC-6-IPACCESSLOGRL: access-list logging rate-limited or missed 2 packets
??????: Nov 1 18:35:53.037 R0/0: %PMAN-5-EXITACTION: Process manager is exiting: reload fru action requested
Thanks Everyone. This made for a couple of pretty bad days. Very curious to know if I'm doing something blatantly wrong.
show version | in retu
System returned to ROM by reload
I was hoping a crash report or something perhaps stored in flash, but no such luck there either.
I did peruse the release notes a bit, and just spent a lot of time in them now, but nothing specific to what I saw happen, although there's a lot of QoS stuff in those releases notes. It does sound like a software bug as well from my perspective, I thought the code I was trying to apply was rather straight forward, not a whole lot of magic there, a bit of complexity with inbound and outbound QoS and the fact that one set is HQoS, but still as far as QoS goes it's pretty basic stuff.
Cisco IOS Software, IOS-XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.2(4)S4a, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport
hoc-telco-agg-gw-01#show flash | ex log
-#- --length-- ---------date/time--------- path
550 387831496 Jan 16 2014 02:28:51 +00:00 /bootflash/asr1002x-universalk9.03.07.04a.S.152-4.S4a.SPA.bin
So now I have to ask how do I tell if a version of code is in 'deferred state.' As that sounds like with a decade of Cisco experience I should know how to find that information.
On this particular box I don't have SmartNet, which I don't ever recommend to any company, but we are small operators comparatively and it is fairly cost prohibitive for us to put SmartNet on these boxes. What we opted to do instead is to purchase a redundant ASR1002-X which I'll be putting in production here in another month.
But it does sound like a code error. So I'll be be upgrading new new code on the redundant box and trying the same coding and see what happens, then sometime in the middle of the night I'll move a couple thousand PPPoE customers to use that new code and see what happens there... and then we'll go from there.
The redundant ASR1002-X that arrived this week I actually just got upgraded to Denali 16.3.5 this morning. Had to upgrade the ROMMON though, after some Googling there is something about the older ROMMON not working with images above 512mbs.
Router# upgrade rom-monitor filename bootflash:asr1000-rommon.152-1r.S.pkg all
I'm going to migrate some PPPoE customers to to this new box once it is in production and then I'll upgrade this ASR1002-X that crashed in this thread. I want to make sure I can properly upgrade it in a maintenance window and that I have this other box ready to take on clients if something goes sideways. I'm doing a little more caution now that I broke every customer. I never thought in my wildest dreams I'd crash the box with that simple QoS code. So I'm going to get the new one up and running with customer migration in play before I go back and try to retrofit this current box. I'm also nervous on how it'll handle working with the Radius server for PPPoE with the code upgrade. I don't want to necessarily be searching for a new command that was added at 3am in the morning when none of the PPPoE accounts will activate.
Thanks for the all input Mark, appreciate your time to be sure.
I use exam diff for that when doing upgrades or downgrades as we have hundreds of lines of code on some of our routers as we have iwan and other cli features that require lot of lines so I'm always worried when changing code as I have been caught out before where the software has removed one or two lines on me somewhere and caused issues this is a good thing to start with to see if anything has changed take a copy of old sh run before and then after it will compare them and highlight anything missed or changed