Occasional 100% CPU on 2811 doing T.37 fax

ffacilities · ‎10-03-2013

Hi,

We support many 2811 gateways at customer sites, all handling T.37 fax (almost entirely onramp) and nothing else. Occasionally we see bursts of the following in the logs, showing that incoming fax calls are being rejected due to high CPU:

Sep 12 12:07:14.980 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (96/100).

Call (callID=1365621) is rejected.

Sep 12 12:07:20.864 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (96/100).

Call (callID=1365622) is rejected.

Sep 12 12:07:50.564 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (98/100).

Call (callID=1365625) is rejected.

In the above example, the logs continued for about 7 minutes. A 'show processes cpu history' was done about half an hour after the event and showed this in the 'last 60 minutes' graph:

11111111

1111 0000000099988777666554433332111

347444468456468864435765010640000000084064942720627596224963

100 *########

90 *##########*

80 *#############

70 ################*

60 ###################*

50 #####################**

40 ########################*

30 ###########################

20 * #############################*

10 * ** ** *##* ***####* ###############################

0....5....1....1....2....2....3....3....4....4....5....5....6

0 5 0 5 0 5 0 5 0 5 0

CPU% per minute (last 60 minutes)

* = maximum CPU% # = average CPU%

Most recent CPU is on the left, so this shows that the CPU was ramping up over a period of 20-25 minutes, remained at 100% for about 8 minutes, then abruptly dropped back to normal. Uptime was 30 weeks when this happened. Because it's so infrequent and we have no way to reproduce it, it's unlikely that we'll be able to grab diags while the CPU is at 100%.

The IOS version is 12.4(25f), running on a 2811 with the NM-HDV2-2T1/E1 board and the PVDM2-48 DSP card. The DSP firmware is overridden to 26.4.501 due to previous issues with out-of-spec tones and outbound fax not working. It has 512MB memory and 128MB flash. I see there's a 12.4(25g) IOS version available, but the caveats list doesn't seem to contain anything relevant.

Any ideas as to what might be causing this? A rogue fax? Is IOS 15 likely to help? Some of the fax gateways we support only have 256MB memory so won't be able to go to IOS 15 easily.

Suggestions appreciated -- thanks in advance.

ffacilities · ‎07-10-2015

An update on this. We've narrowed it down to long-lived T.37 onramp fax calls:

fgw#show call active fax brief
...
32FF : 1074599 01:31:49.633 CDT Fri Jul 10 2015.1 +10 pid:1 Answer  active
dur 03:08:22 tx:13/272 rx:0/0
Tele 1/0:23 (1074599) [1/0.15] tx:0/0/11302960ms 14400 noise:-1 acom:-1  i/0:0/0 dBm

32FF : 1074603 01:32:04.523 CDT Fri Jul 10 2015.1 +1000 pid:3 Originate FAX=1234567890@fax.company.com active
dur 03:08:07 tx:0/687 rx:0/0
MMOIP 10.154.19.231 AcceptedMime:0 DiscardedMime:0

3302 : 1074613 01:35:46.045 CDT Fri Jul 10 2015.1 +10 pid:1 Answer  active
dur 03:04:28 tx:13/272 rx:0/0
Tele 1/0:23 (1074613) [1/0.14] tx:0/0/11068700ms 14400 noise:-1 acom:-1  i/0:0/0 dBm

3302 : 1074615 01:36:00.965 CDT Fri Jul 10 2015.1 +1000 pid:3 Originate FAX=1234567890@fax.company.com active
dur 03:04:12 tx:0/687 rx:0/0
MMOIP 10.154.19.231 AcceptedMime:0 DiscardedMime:0

Note that these calls have been up for over 3 hours. This particular example is on a 2811 running IOS 15.1(4)M7.

We cured the high CPU by killing the two long-lived calls:

fgw#clear call voice causecode 31 id 32FF
fgw#clear call voice causecode 31 id 3302

We worked with Cisco TAC on this a while back, and found that when this happens the CPU is consumed by the DocMSP process getting into a loop. However, TAC refused to go any further because our onramp TCL script is customized to pass through the RDNIS when present, so that we can use that as the fax target number. Not really a relevant change, but rules are rules.

Anyway, we had DocMSP debugs enabled in this case (debug fax dmsp all, with console logging disabled). They confirmed looping for the two calls; here's one of the loops:

Jul 10 01:50:15.970 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_xmit:
   srcCallId=1074614, dstCallId=1074616, direction=0
Jul 10 01:50:15.970 CDT: //1074614/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   evID=0, proto_flag=3, srcCallId=1074614, dstCallId=1074616, cause=0
Jul 10 01:50:15.970 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_CONNECTION
Jul 10 01:50:15.970 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_FAX_PAGE
Jul 10 01:50:16.422 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   Done
Jul 10 01:50:19.118 CDT: //1074614/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   evID=0, proto_flag=3, srcCallId=1074614, dstCallId=1074616, cause=0
Jul 10 01:50:19.118 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_CONNECTION
Jul 10 01:50:19.122 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_FAX_PAGE
Jul 10 01:50:19.570 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   Done
Jul 10 01:50:19.582 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_xmit:
   srcCallId=1074614, dstCallId=1074616, direction=0
Jul 10 01:50:20.186 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_xmit:
   srcCallId=1074614, dstCallId=1074616, direction=0
Jul 10 01:50:20.190 CDT: //1074614/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   evID=0, proto_flag=3, srcCallId=1074614, dstCallId=1074616, cause=0
Jul 10 01:50:20.190 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_CONNECTION
Jul 10 01:50:20.190 CDT: //1074616/BAB1C8C2A11D/DMSP_ON/docmsp_tiff_writer_data_process:
   START_OF_FAX_PAGE
Jul 10 01:50:20.642 CDT: //1074616/BAB1C8C2A11D/DMSP/docmsp_process_rcv_data:
   Done

So, interesting stuff, but it's still not clear exactly what causes this.