cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1001
Views
9
Helpful
2
Replies

ASR9010 BNG radiusd crash

j.restaino
Level 1
Level 1

Hi,

We have BNG ASR9010 with IOS XR Version 5.2.2, by mistake we tried to establish 73000 PPPoE sessions with only 65535 IPs in our pool. But the strange thing is that after an hour we had this messages in the log

RP/0/RSP0/CPU0:Dec 23 10:27:59.468 : syslog_dev[92]: syslog_dev: MALLOC_ERROR:radiusd:check_caller_guard - caller @0x82a9215 - fatal error, your application has corrupted the heap. 

RP/0/RSP0/CPU0:Dec 23 10:27:59.468 : syslog_dev[92]: syslog_dev: MALLOC_ERROR:(50): [pid:59879810, tid:1] Suspected memory address 0x1029872c - malloc check_caller_guard: tail data corruption.

And after that we suffered the crash of the radiusd, Iedged and enf_broker daemons.

We tried to manually restart the daemons (the show process showed it up), but the BNG do not send radius messages anymore till we reboot.

Despite of our mistake that made that the BNG worked without IPs in the Pool, we want to know if it could be the cause of the crash and if it´s anything we can do to help mitigate and handle this situation.

I attach the log.

We will be very grateful if anyone could help us.

Regards
                    José

1 Accepted Solution

Accepted Solutions

xthuijs
Cisco Employee
Cisco Employee

Hi Jose,

regardless of the misconfig on the free addresses, the radius process or any should handle this circumstance more graceful.

I did some verifications on the log you provided and I think there is a bug in the radius code that could handle this stress situation or multiple calls rapidly, but failing a bit more graceful. While I dont think there is a substantial issue in the radius code for this problem, I think we can harden:

The decode of the traces suggest a queue overrun btw, do you have radius-server source-ports extended configured? if not, that would help alleviate with larger queues.

We have some smu's for that on 522 also.

My recommendation is to take the 522 smu's first and configure extended ports if not already and this will likely go away.

cheers!

xander

View solution in original post

2 Replies 2

xthuijs
Cisco Employee
Cisco Employee

Hi Jose,

regardless of the misconfig on the free addresses, the radius process or any should handle this circumstance more graceful.

I did some verifications on the log you provided and I think there is a bug in the radius code that could handle this stress situation or multiple calls rapidly, but failing a bit more graceful. While I dont think there is a substantial issue in the radius code for this problem, I think we can harden:

The decode of the traces suggest a queue overrun btw, do you have radius-server source-ports extended configured? if not, that would help alleviate with larger queues.

We have some smu's for that on 522 also.

My recommendation is to take the 522 smu's first and configure extended ports if not already and this will likely go away.

cheers!

xander

Thanks a lot, we will try with your recommendations.

Regards
                         José