cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2763
Views
15
Helpful
8
Replies

ASR1002-X crashes upon application of HQoS to sub-interface

Daryl Benson
Level 1
Level 1

Hello All,

 

I work for a medium size ISP and we have an ASR1002-X as a broadband aggregation device for PPPoE accounts.  Interestingly I thought I was crashing it because of QoS profiles I was applying to the PPPoE profiles but it turned out I isolated it to something even much more basic than that.  When the following configuration is applied to the router, the router crashes.  It takes a couple minutes for the router to crash, but it very much just blows up and hard resets.  Curious if anyone has any thoughts on it.

 

Code Router Is Running:

show ver
Cisco IOS Software, IOS-XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.2(4)S4a, RELEASE SOFTWARE (fc1)

 

! Class Maps:

 

class-map match-all CM_QoS_Match_CS3
match dscp cs3
class-map match-all CM_QoS_Match_EF
match dscp ef

 

! Child Policy Map:

 

policy-map PM_HQoS_PPPoE_100Mbps_DarylTest_Egress_Child
class CM_QoS_Match_EF
priority level 1
police 7165k
class CM_QoS_Match_CS3
bandwidth 1024
class class-default
random-detect

 

! Parent Policy-Map:

 

policy-map PM_HQoS_PPPoE_100Mbps_DarylTest_Egress_Parent
description Hierarchical QoS Policy To Apply 100 Mbps Policing Downstream
class class-default
shape average 117760k
service-policy PM_HQoS_PPPoE_100Mbps_DarylTest_Egress_Child

policy-map PM_HQoS_PPPoE_100Mbps_DarylTest_Ingress_Parent
description Hierarchical QoS Policy To Apply 100 Mbps Policing Upstream
class class-default
police 117760k
conform-action transmit
exceed-action drop
violate-action drop

 

! Application to Interface:

 

int Te0/3/0.88010001
service-policy input PM_HQoS_PPPoE_100Mbps_DarylTest_Ingress_Parent
service-policy output PM_HQoS_PPPoE_100Mbps_DarylTest_Egress_Parent

 

Comments:

 

I thought this was fairly basic code that I was planning on applying to some PPPoE Customer accounts that we could just underlying do some very basic QoS control for voice throughout our core.  But the HQoS on a sub-interface crashes the router.  This is a pretty big deal as we have multiple sets of thousands of customers aggregating on this router and they all went belly-up in a glorious display of complete network failure.  The one time I did catch the crash I was consoled in and I got the following console notifications:

 

! Console Logs:


000376: Nov 1 18:35:38.544 MDT: priority command is not supported in input direction for this interface
000377: Nov 1 18:35:38.544 MDT: Configuration failed!
000378: Nov 1 18:35:38.545 MDT: %QOS-6-POLICY_INST_FAILED:
Service policy installation failed
000379: Nov 1 18:35:39.131 MDT: %SSH-3-RSA_SIGN_FAIL: Signature connection failed, status 3
000380: Nov 1 18:35:39.131 MDT: %SSH-3-RSA_SIGN_FAIL: Signature creation failed, status -1
000384: Nov 1 18:35:40.090 MDT: %SSH-3-RSA_SIGN_FAIL: Signature connection failed, status 3
000385: Nov 1 18:35:40.090 MDT: %SSH-3-RSA_SIGN_FAIL: Signature creation failed, status -1
000390: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: irq 23: nobody cared (try booting with the "irqpoll" option)
000391: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: Pid: 15907, comm: fuser Tainted: P 2.6.32.39 #4
000392: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: Call Trace:
000393: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: <IRQ> [<ffffffff8107f5e0>] ? __report_bad_irq+0x30/0x7d
000394: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8107f733>] ? note_interrupt+0x106/0x16f
000395: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8107fdd8>] ? handle_fasteoi_irq+0xb7/0xdf
000396: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100cbfc>] ? call_softirq+0x1c/0x2a
000397: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100e2c4>] ? handle_irq+0x7e/0x86
000398: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100d8dc>] ? do_IRQ+0x54/0xb2
000399: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: [<ffffffff8100c413>] ? ret_from_intr+0x0/0xa
000400: Nov 1 18:35:47.435 MDT: %IOSXE-4-PLATFORM: R0/0: kernel: <EOI>
000401: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: handlers:
000402: Nov 1 18:35:47.435 MDT: %IOSXE-3-PLATFORM: R0/0: kernel: [<ffffffff8125291a>] (usb_hcd_irq+0x0/0x5d)
000403: Nov 1 18:35:47.435 MDT: %IOSXE-0-PLATFORM: R0/0: kernel: Disabling IRQ #23
000404: Nov 1 18:35:48.399 MDT: %SEC-6-IPACCESSLOGRL: access-list logging rate-limited or missed 2 packets
??????: Nov 1 18:35:53.037 R0/0: %PMAN-5-EXITACTION: Process manager is exiting: reload fru action requested

 

Thank You:

Thanks Everyone.  This made for a couple of pretty bad days.  Very curious to know if I'm doing something blatantly wrong. 

8 Replies 8

Mark Malone
VIP Alumni
VIP Alumni
sounds like a bug ,when it reboots what does the show version ... system returned by show ? that may indicate the issue
have you checked the release notes for that version your running to see if there are any open qos caveats , I see quite a few might be worth running through it below

https://www.cisco.com/c/en/us/td/docs/ios/15_2s/release/notes/15_2s_rel_notes/15_2s_caveats_15_2_4s.html

show version | in retu
System returned to ROM by reload

 

I was hoping a crash report or something perhaps stored in flash, but no such luck there either.

I did peruse the release notes a bit, and just spent a lot of time in them now, but nothing specific to what I saw happen, although there's a lot of QoS stuff in those releases notes.  It does sound like a software bug as well from my perspective, I thought the code I was trying to apply was rather straight forward, not a whole lot of magic there, a bit of complexity with inbound and outbound QoS and the fact that one set is HQoS, but still as far as QoS goes it's pretty basic stuff.

I would start by upgrading the image , whats the full image name it looks like its in deferred state on Cisco website not to be used , you could tac it if you have support either

show version

Cisco IOS Software, IOS-XE Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.2(4)S4a, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport

 

hoc-telco-agg-gw-01#show flash | ex log
-#- --length-- ---------date/time--------- path
550 387831496 Jan 16 2014 02:28:51 +00:00 /bootflash/asr1002x-universalk9.03.07.04a.S.152-4.S4a.SPA.bin

 

So now I have to ask how do I tell if a version of code is in 'deferred state.'  As that sounds like with a decade of Cisco experience I should know how to find that information.  

 

On this particular box I don't have SmartNet, which I don't ever recommend to any company, but we are small operators comparatively and it is fairly cost prohibitive for us to put SmartNet on these boxes.  What we opted to do instead is to purchase a redundant ASR1002-X which I'll be putting in production here in another month.

 

But it does sound like a code error.  So I'll be be upgrading new new code on the redundant box and trying the same coding and see what happens, then sometime in the middle of the night I'll move a couple thousand PPPoE customers to use that new code and see what happens there... and then we'll go from there.

ok confirmed its not deferred but Is 4 years old so I would upgrade , you could also run the show tech through the cisco cli analyzer its a tool that TAC use , it can pick up bugs hardware issues on devices and report on the show tech ,free on Cisco website

This is what there saying to currently use but always check the release notes first when downloading just to be sure
asr1002x-universalk9.03.16.06.S.155-3.S6-ext.SPA.bin

https://software.cisco.com/download/release.html?mdfid=284146581&softwareid=282046477&release=3.7.4aS&relind=AVAILABLE&rellifecycle=ED&reltype=latest

The redundant ASR1002-X that arrived this week I actually just got upgraded to Denali 16.3.5 this morning.  Had to upgrade the ROMMON though, after some Googling there is something about the older ROMMON not working with images above 512mbs.  

https://www.cisco.com/c/en/us/td/docs/routers/asr1000/rommon/asr1000-rommon-upg-guide.html

Router# upgrade rom-monitor filename bootflash:asr1000-rommon.152-1r.S.pkg all

 

I'm going to migrate some PPPoE customers to to this new box once it is in production and then I'll upgrade this ASR1002-X that crashed in this thread.  I want to make sure I can properly upgrade it in a maintenance window and that I have this other box ready to take on clients if something goes sideways.  I'm doing a little more caution now that I broke every customer.  I never thought in my wildest dreams I'd crash the box with that simple QoS code.  So I'm going to get the new one up and running with customer migration in play before I go back and try to retrofit this current box.  I'm also nervous on how it'll handle working with the Radius server for PPPoE with the code upgrade.  I don't want to necessarily be searching for a new command that was added at 3am in the morning when none of the PPPoE accounts will activate.  

 

Thanks for the all input Mark, appreciate your time to be sure.  

HI

I use exam diff for that when doing upgrades or downgrades as we have hundreds of lines of code on some of our routers as we have iwan and other cli features that require lot of lines so I'm always worried when changing code as I have been caught out before where the software has removed one or two lines on me somewhere and caused issues this is a good thing to start with to see if anything has changed take a copy of old sh run before and then after it will compare them and highlight anything missed or changed

 

http://www.softpedia.com/get/PORTABLE-SOFTWARE/Programming/Portable-ExamDiff.shtml

Historically I've used WinDiff and WinMerge, both pretty solid programs.

Review Cisco Networking for a $25 gift card