08-30-2012 04:51 PM - edited 03-04-2019 05:26 PM
Hi all,
We have recently put in a new 2901 router to be our IPSLA router and after adding 430 operations to it (215 ICMP and 215 UDP jitter) to cover off our state wide sites, it's reporting over half of them as timing out. Over the day, the timed out operations change so that our monitoring system shows the operations as down most of the time and up or warning state the rest of the time.
Some of the remote routers are reporting "SLA_FORMAT_FAIL" errors but I cannot find any references to this error.
A ping from the router to the remote site router returns a ping time of 50ms or less and the network links are not conjested so QoS shouldn't be getting in the way. Our QoS policies would mark and prioritise the UDP jitter test traffic and the ICMP would be in the default class.
The 2901 is running 15.2(4)M1 and has 512MB RAM and 256MB flash. It's single homed into our core network switch.
I've heard stories of 2900 series routers hosting 1000's of operations so I don't think we're taxing the router. CPU is sitting around 5% and memory is around 20%.
Below bits are for one set of operations.
Any thoughts as to why these are not working reliably?
Thanks,
Gary
*******************************************************************************************
End node we're targetting (2951 running 15.2(3)T):
DC204RT04#ping 172.16.37.192
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.16.37.192, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/4 ms
DC204RT04#
*********************************************************************
UDP-Jitter Operation:
DC204RT04#sh ip sla stat 1180
IPSLAs Latest Operation Statistics
IPSLA operation id: 1180
Type of operation: udp-jitter
Latest RTT: NoConnection/Busy/Timeout
Latest operation start time: 09:18:32 AEST Fri Aug 31 2012
Latest operation return code: Timeout
RTT Values:
Number Of RTT: 0 RTT Min/Avg/Max: 0/0/0 milliseconds
Latency one-way time:
Number of Latency one-way Samples: 0
Source to Destination Latency one way Min/Avg/Max: 0/0/0 milliseconds
Destination to Source Latency one way Min/Avg/Max: 0/0/0 milliseconds
Jitter Time:
Number of SD Jitter Samples: 0
Number of DS Jitter Samples: 0
Source to Destination Jitter Min/Avg/Max: 0/0/0 milliseconds
Destination to Source Jitter Min/Avg/Max: 0/0/0 milliseconds
Packet Loss Values:
Loss Source to Destination: 0
Source to Destination Loss Periods Number: 0
Source to Destination Loss Period Length Min/Max: 0/0
Source to Destination Inter Loss Period Length Min/Max: 0/0
Loss Destination to Source: 0
Destination to Source Loss Periods Number: 0
Destination to Source Loss Period Length Min/Max: 0/0
Destination to Source Inter Loss Period Length Min/Max: 0/0
Out Of Sequence: 0 Tail Drop: 0
Packet Late Arrival: 0 Packet Skipped: 0
Voice Score Values:
Calculated Planning Impairment Factor (ICPIF): 0
Mean Opinion Score (MOS): 0
Number of successes: 0
Number of failures: 13
Operation time to live: 35254 sec
**************************************************************************
DC204RT04#sh ip sla stat 1181
IPSLAs Latest Operation Statistics
IPSLA operation id: 1181
Latest RTT: NoConnection/Busy/Timeout
Latest operation start time: 09:20:27 AEST Fri Aug 31 2012
Latest operation return code: Timeout
Number of successes: 1
Number of failures: 14
Operation time to live: 35133 sec
******************************************************************************************************
Remote router results:
MD802RT01#sh ip sla responder
General IP SLA Responder on Control port 1967
General IP SLA Responder is: Enabled
Number of control message received: 22 Number of errors: 20
Recent sources:
10.196.128.6 [09:32:53.502 EST Fri Aug 31 2012]
10.196.128.6 [09:31:53.501 EST Fri Aug 31 2012]
Recent error sources:
10.196.128.6 [09:32:53.690 EST Fri Aug 31 2012] SLA_FORMAT_FAIL
10.196.128.6 [09:32:53.670 EST Fri Aug 31 2012] SLA_FORMAT_FAIL
10.196.128.6 [09:32:53.650 EST Fri Aug 31 2012] SLA_FORMAT_FAIL
10.196.128.6 [09:32:53.630 EST Fri Aug 31 2012] SLA_FORMAT_FAIL
10.196.128.6 [09:32:53.610 EST Fri Aug 31 2012] SLA_FORMAT_FAIL
Permanent Port IP SLA Responder
Permanent Port IP SLA Responder is: Disabled
09-03-2012 09:56 AM
This post is for basically the same thing and no answer on this one either:
https://supportforums.cisco.com/thread/2157658
I am looking for the same answer, so I figured I put it out there for you in the event this one gets answered.
09-04-2012 03:31 PM
Hi David,
Thanks for the comment. I have an update from my side. I think I've fixed it!!
In short, the solution looks to be using group schedules instead of individual schedules. What seems to have been happening is that I was overloading the router every minute by firing off 400+ jobs all (more or less) at once. I've created group schedules now and have all the jobs firing off once every 60 seconds but spread out over 60 seconds as well. This seems to be working at present.
There are still a couple of unresolved questions:
1. Using the group schedules, a show ip sla group schedules only shows the first 50 probe operations (see below) but they all (215 odd) seem to be active. Not sure if this is a display bug or IOS limitation or what.
DC204RT04# sh ip sla group schedule
Group Entry Number: 10
Probes to be scheduled: 100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590
Total number of probes: 50
Schedule period: 60
Mode: even
Group operation frequency: 60
Status of entry (SNMP RowStatus): Active
Next Scheduled Start Time: Start Time already passed
Life (seconds): 36000
Entry Ageout (seconds): never
2. Similiar to the above, the config only shows the 50 operations as well so I'm not sure what's going to happen when the router reloads.
ip sla group schedule 10 100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,370,380,390,400,410,420,430,440,450,460,470,480,490,500,510,520,530,540,550,560,570,580,590 schedule-period 60 frequency 60 start-time 08:00:00 Sep 05 life 36000
All in all, some improvement in the situation but a little way to go yet.
Thanks All,
Gary
09-11-2012 10:15 PM
Hi All,
Further updates for thiose interested...Group scheduling worked out OK to spread out the load and all the operations are now happy. The 50 operations per group schedule was a pain but I generated a few groups and threw 40 operations in each one and all good.
Group schedules have an annoying feature in that they don't have a recurrance option like the normal schedules do. To get around this I ended up writing an EEM cron job to reset the schedules every morning ready for the day's activity. Bit of a kludge but it looks like it'll work.
Thanks all for your ideas,
Gary
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide