I am running into a rather interesting issue and I was curious if anyone may have seen it before or if anyone had any insight into what the problem could be. On one of my ACE 4710's (running sw A5(1.2) , I am running a fairly large number of layer 7 probes (71) across both 80 and 443. At seemingly random points in the day, the system reports that the probes are being skipped due to an internal error. I have seen this before when the system runs out of sockets for the probes, but I am not seeing any indication that is the case.
Here is an example probe config:
probe https CHECK-SOME-SITE
passdetect interval 30
ssl version all
request method get url /some/url
header Host header-value "www.somesite.com"
expect regex "SOMEREGEX"
Here is the relevant output from ''show probe detail'
real : some-rserver
x.x.x.x 443 PROBE 3093610 1749563 1344047 SUCCESS
Socket state : CLOSED
No. Passed states : 49 No. Failed states : 49
No. Probes skipped : 479 Last status code : 200
No. Out of Sockets : 0 No. Internal error: 0
Last disconnect err : -
Last probe time : Tue Mar 4 16:45:03 2014
Last fail time : Fri Feb 28 13:30:37 2014
Last active time : Mon Mar 3 22:08:53 2014
Here are the log messages that are popping up:
Mar 4 2014 14:36:41 : %ACE-3-251014: Could not probe server x.x.x.x on port 443 for 4 consecutive tries - Internal error
The log messages appear for all rservers being probed for about 30 seconds, then they go away until the next event. Considering the probes are skipped, I do not believe this is actually causing failures at the moment. I have read that the ACE platform can only run 200 concurrent scripted probes, however I am at a loss as to how to check if that is what I am running into here. The real confusing thing here is the lack of internal error and out of socket counters.
Any help or insight would be very appreciated. Thanks in advance.
Number of skipped probes. A skipped probe occurs when the ACE does not send out a probe because the scheduled interval to send a probe is shorter than it takes to complete the execution of the probe; the send interval is shorter than the open timeout or receive timeout interval.
In your case the interval is 10 which is little aggressive but still less than receive. But if the probe execution is greater than 10 seconds you may see probes getting skipped. Increasing the interval time by another 10 seconds can be helpful for testing to see if this mitigates the issue.
If you have UDP probes then you need to check this as well:
For UDP probes or UDP-based probes, we recommend a time interval value of 30 seconds. The reason for this recommendation is that the ACE data plane has a management connection limit of 100,000. Management connections are used by all probes as well as Telnet, SSH, SNMP, and other management applications. In addition, the ACE has a default timeout for UDP connections of 120 (ACE module) or 15 (ACE appliance) seconds. This means that the ACE does not remove the UDP connections even though the UDP probe has been closed for two minutes. Using a time interval less than 30 seconds may limit the number of UDP probes that can be configured to run without exceeding the management connection limit, which may result in skipped probes
Are you running any scripted probes?
It could be a stupid bug as well but i would suggest increasing the interval timeout and see how it goes.
You can also alo try debug hm errors/events/all etc and see if you get any detailed output there which can be sent to TAC for further investigation.
Thanks very much for the response. I did not take into account the time the probe actually takes to execute. I will scale the probes back a bit and see if that alleviates the issue.