I have a customer experiencing an on-going issue with their 2811 routers acting as FAX gateway's. See below for details. Could use any/all feedback on the problem.
We have a wide range of customers using Cisco routers to provide T.37 on-ramp and off-ramp fax as part of our voicemail solution -- we refer to these routers as 'fax gateways'. All units use a PVDM2-48 or -64 DSP module on an NM-HDV2-2T1/E1 network module; we understand that this will soon need to change due to EOLs. Most fax gateways are in the US and receive their calls over one or two T1 PRIs; some are in other countries and use E1.
We require that the fax gateway routes incoming faxes based on the RDNIS (redirecting number) if present, otherwise using the DNIS (called number). The standard on-ramp TCL script only uses the DNIS, so we have a small customization to get the functionality that we need. I have attached our onramp script, which is exactly as used on all of our fax gateways. It's based on version 184.108.40.206 and the customizations are clearly indicated -- just search for 'DATA CONNECTION' which was our company name back when this solution was originally put together. I don't have any info on who wrote that customization.
For the most part, these fax gateways just work (although we have found that they can be quite sensitive to audio glitches, leading to cut off pages, but that's not our concern right now).
For several years we've seen an intermittent issue, mainly on busier units that handle several hundred fax calls per day, where the CPU ramps up to 100% over a period of 20-25 minutes and then remains at 100%, during which time calls are rejected with logs like these:
Sep 12 12:07:14.980 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (96/100).
Call (callID=1365621) is rejected.
Sep 12 12:07:20.864 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (96/100).
Call (callID=1365622) is rejected.
Sep 12 12:07:50.564 CDT: %IVR-3-LOW_CPU_RESOURCE: IVR: System experiencing high cpu utilization (98/100).
Call (callID=1365625) is rejected.
We mostly see it on 2811s, probably because there are more of those in heavy use than anything else, and the IOS version seems to make no difference -- it's been seen in 2811s running both IOS 12.4(25f) and IOS 15.1(4)M7 (among others). The problem generally resolves itself within an hour, although it looks like the units running IOS 15 don't resolve themselves as easily and often need intervention.
Investigation to date
I posted the problem on the Cisco support forums in Oct 2013 and have had no replies (exluding my own recent update):
The information at that point suggested that the problem was in image manipulation code (apparently a function called Fax3Decode2D), so we removed the 'image encoding MH' from some fax gateways and for a while it looked like it had worked. But eventually the problem surfaced again.
Working on our own, we then discovered that the high CPU condition could be cleared by bouncing the PRI(s) coming into the fax gateway. I then had the good fortune to be able to log into a unit while the high CPU was happening and found some active fax calls which had been up for over 3 hours, a similar length of time that the CPU had been high:
fgw#show call active fax brief
32FF : 1074599 01:31:49.633 CDT Fri Jul 10 2015.1 +10 pid:1 Answer active
dur 03:08:22 tx:13/272 rx:0/0
Tele 1/0:23 (1074599) [1/0.15] tx:0/0/11302960ms 14400 noise:-1 acom:-1 i/0:0/0 dBm
32FF : 1074603 01:32:04.523 CDT Fri Jul 10 2015.1 +1000 pid:3 Originate FAXemail@example.com
dur 03:08:07 tx:0/687 rx:0/0
MMOIP 10.154.19.231 AcceptedMime:0 DiscardedMime:0
3302 : 1074613 01:35:46.045 CDT Fri Jul 10 2015.1 +10 pid:1 Answer active
dur 03:04:28 tx:13/272 rx:0/0
Tele 1/0:23 (1074613) [1/0.14] tx:0/0/11068700ms 14400 noise:-1 acom:-1 i/0:0/0 dBm
3302 : 1074615 01:36:00.965 CDT Fri Jul 10 2015.1 +1000 pid:3 Originate FAXfirstname.lastname@example.org
dur 03:04:12 tx:0/687 rx:0/0
MMOIP 10.154.19.231 AcceptedMime:0 DiscardedMime:0
I cleared the two calls ('clear call voice causecode 31 id <ID>') and the CPU immediately returned to normal (very low) levels.
So it looks like some kind of rogue fax call causes DocMSP to loop, and once the CPU reaches 100% it stays there until the rogue call(s) is/are cleared. I would have thought that most sending fax machines would clear a call as long as this by themselves, but it appears not.
We don't have any practical way to reproduce the problem on a fax gateway without the TCL customization, and we've never seen it happen in a lab.
Since this involves a customized script TAC would not be able to assist ( as you mentioned ). The only thing that comes to my mind is to enable debugs on one of these gateways and capture them until the issue of high cpu is reproduced. Then check the calls that are stuck using the "show call active fax brief" and compare the signaling in the debugs for the normal fax calls and the ones that are stuck. If you find any difference there it could help.