Errors and Timeouts on 2901 running IPSLA Operations

gmacminn000 · ‎08-30-2012

Hi all,

We have recently put in a new 2901 router to be our IPSLA router and after adding 430 operations to it (215 ICMP and 215 UDP jitter) to cover off our state wide sites, it's reporting over half of them as timing out. Over the day, the timed out operations change so that our monitoring system shows the operations as down most of the time and up or warning state the rest of the time.

Some of the remote routers are reporting "SLA_FORMAT_FAIL" errors but I cannot find any references to this error.

A ping from the router to the remote site router returns a ping time of 50ms or less and the network links are not conjested so QoS shouldn't be getting in the way. Our QoS policies would mark and prioritise the UDP jitter test traffic and the ICMP would be in the default class.

The 2901 is running 15.2(4)M1 and has 512MB RAM and 256MB flash. It's single homed into our core network switch.

I've heard stories of 2900 series routers hosting 1000's of operations so I don't think we're taxing the router. CPU is sitting around 5% and memory is around 20%.

Below bits are for one set of operations.

Any thoughts as to why these are not working reliably?

Thanks,

Gary

*******************************************************************************************

End node we're targetting (2951 running 15.2(3)T):

DC204RT04#ping 172.16.37.192

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 172.16.37.192, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/4 ms

DC204RT04#

*********************************************************************

UDP-Jitter Operation:

DC204RT04#sh ip sla stat 1180

IPSLAs Latest Operation Statistics

IPSLA operation id: 1180

Type of operation: udp-jitter

Latest RTT: NoConnection/Busy/Timeout

Latest operation start time: 09:18:32 AEST Fri Aug 31 2012

Latest operation return code: Timeout

RTT Values:

Number Of RTT: 0 RTT Min/Avg/Max: 0/0/0 milliseconds

Latency one-way time:

Number of Latency one-way Samples: 0

Source to Destination Latency one way Min/Avg/Max: 0/0/0 milliseconds

Destination to Source Latency one way Min/Avg/Max: 0/0/0 milliseconds

Jitter Time:

Number of SD Jitter Samples: 0

Number of DS Jitter Samples: 0

Source to Destination Jitter Min/Avg/Max: 0/0/0 milliseconds

Destination to Source Jitter Min/Avg/Max: 0/0/0 milliseconds

Packet Loss Values:

Loss Source to Destination: 0

Source to Destination Loss Periods Number: 0

Source to Destination Loss Period Length Min/Max: 0/0

Source to Destination Inter Loss Period Length Min/Max: 0/0

Loss Destination to Source: 0

Destination to Source Loss Periods Number: 0

Destination to Source Loss Period Length Min/Max: 0/0

Destination to Source Inter Loss Period Length Min/Max: 0/0

Out Of Sequence: 0 Tail Drop: 0

Packet Late Arrival: 0 Packet Skipped: 0

Voice Score Values:

Calculated Planning Impairment Factor (ICPIF): 0

Mean Opinion Score (MOS): 0

Number of successes: 0

Number of failures: 13

Operation time to live: 35254 sec

**************************************************************************

DC204RT04#sh ip sla stat 1181

IPSLAs Latest Operation Statistics

IPSLA operation id: 1181

Latest RTT: NoConnection/Busy/Timeout

Latest operation start time: 09:20:27 AEST Fri Aug 31 2012

Latest operation return code: Timeout

Number of successes: 1

Number of failures: 14

Operation time to live: 35133 sec

******************************************************************************************************

Remote router results:

MD802RT01#sh ip sla responder

General IP SLA Responder on Control port 1967

General IP SLA Responder is: Enabled

Number of control message received: 22 Number of errors: 20

Recent sources:

10.196.128.6 [09:32:53.502 EST Fri Aug 31 2012]

10.196.128.6 [09:31:53.501 EST Fri Aug 31 2012]

Recent error sources:

10.196.128.6 [09:32:53.690 EST Fri Aug 31 2012] SLA_FORMAT_FAIL

10.196.128.6 [09:32:53.670 EST Fri Aug 31 2012] SLA_FORMAT_FAIL

10.196.128.6 [09:32:53.650 EST Fri Aug 31 2012] SLA_FORMAT_FAIL

10.196.128.6 [09:32:53.630 EST Fri Aug 31 2012] SLA_FORMAT_FAIL

10.196.128.6 [09:32:53.610 EST Fri Aug 31 2012] SLA_FORMAT_FAIL

Permanent Port IP SLA Responder

Permanent Port IP SLA Responder is: Disabled

David Vasquez · ‎09-03-2012

This post is for basically the same thing and no answer on this one either:

https://supportforums.cisco.com/thread/2157658

I am looking for the same answer, so I figured I put it out there for you in the event this one gets answered.

gmacminn000 · ‎09-04-2012

Hi David,

Thanks for the comment. I have an update from my side. I think I've fixed it!!

In short, the solution looks to be using group schedules instead of individual schedules. What seems to have been happening is that I was overloading the router every minute by firing off 400+ jobs all (more or less) at once. I've created group schedules now and have all the jobs firing off once every 60 seconds but spread out over 60 seconds as well. This seems to be working at present.

There are still a couple of unresolved questions:

1. Using the group schedules, a show ip sla group schedules only shows the first 50 probe operations (see below) but they all (215 odd) seem to be active. Not sure if this is a display bug or IOS limitation or what.

DC204RT04# sh ip sla group schedule

Group Entry Number: 10

Probes to be scheduled: 100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590

Total number of probes: 50

Schedule period: 60

Mode: even

Group operation frequency: 60

Status of entry (SNMP RowStatus): Active

Next Scheduled Start Time: Start Time already passed

Life (seconds): 36000

Entry Ageout (seconds): never

2. Similiar to the above, the config only shows the 50 operations as well so I'm not sure what's going to happen when the router reloads.

ip sla group schedule 10 100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590 schedule-period 60 frequency 60 start-time 08:00:00 Sep 05 life 36000

All in all, some improvement in the situation but a little way to go yet.

Thanks All,

Gary

gmacminn000 · ‎09-11-2012

Hi All,

Further updates for thiose interested...Group scheduling worked out OK to spread out the load and all the operations are now happy. The 50 operations per group schedule was a pain but I generated a few groups and threw 40 operations in each one and all good.

Group schedules have an annoying feature in that they don't have a recurrance option like the normal schedules do. To get around this I ended up writing an EEM cron job to reset the schedules every morning ready for the day's activity. Bit of a kludge but it looks like it'll work.

Thanks all for your ideas,

Gary