ACE20 hangs

Anatoly Mazalsky · ‎10-30-2013

Hi!

We have a problem with two ACE20 modules installed in Cisco 6509 switches and working in fault tolerance mode.

Both nodes continue working and doing LB, but it's not possible to configure it, show config, change context, et cetera.

Looks like some process at Admin context stops working, or something like that.

So below is the login process when I try to connect to ACE:

Escape character is '^]'.

Process did not respond within the expected timeframe, using defaults

Password:

Login incorrect

ACE1 login: admin

Password:

Unable to retrieve context name, using default

Unable to obtain config mode lock info

Disabling config mode

NOTE: Configuration mode has been disabled on all sessions

Unable to get role information, using default

Cisco Application Control Software (ACSW)

TAC support: http://www.cisco.com/tac

The copyrights to certain works contained herein are owned by

other third parties and are used and distributed under license.

Some parts of this software are covered under the GNU Public

License. A copy of the license is available at

http://www.gnu.org/licenses/gpl.html.

ACE1/Admin# sh run

Generating configuration....

Acfg: Device or resource busy

ACE1/Admin# changeto VC_SERVERS

Error: Called API timed out

ACE1/Admin#

So I have to reload the module by resetting appropriate slot in 6509. Then it works for another couple of weeks.

What could it be?

Thanks

Kanwaljeet Singh · ‎10-31-2013

Hi Anatoly,

Looking at the output it seems device is running out of resource. Please use "show resource usage" and see if you have denied counter increasing. Please see current and peak connections and see if they have hit the max limit.

You might have to increase the resource for management of the device.using "limit-resource mgmt-connections".. You have to make that change in "Admin" context and since you can access Admin context you can give it a shot.

Let me know if that fixes it.

Regards,

Kanwal

Anatoly Mazalsky · ‎10-31-2013

Hi Kanwal,

Thanks for your reply, but it doesn't work. I can't even see resource usage:

ACE1/Admin# sh resource usage

Error: Transport error

Maybe after I reset the ACE I'll see something, but as I mentioned in previous message, ACE works relatively long time after hardware reset. And regarding increasing resources for mgmt - actually there should be no mgmt connections at all, we log in only when we have to reconfigure load balancing.

Kanwaljeet Singh · ‎10-31-2013

Hi Anatoly,

A quick search internally regarding "Transport error" reveals that this error usually comes while configuring FT or some process related to it has shut down. But in your case you are unable to even execute basic show commands. It seems that we have no choice but reload to come out of this situation but i would also recommend opening a TAC case and get this investigated if you are facing this issue again and again. May be there is a known bug in a version you are running. I see some bugs here but cannot match symptoms exactly like you are facing.

Also, regarding the resource utilization if you haven't allocated a dedicated resource to management, then it means that if a resource is free you will be able to manage the device and if it is not , since it is being used by other traffic you will not get access. That's why it is always a good idea to have some dedicated resource allocated for management. We have seen this issue. Even a single connection to ACE would be denied or not work if ACE has no free resource.

Regards,

Kanwal

Anatoly Mazalsky · ‎10-31-2013

Hi Kanwal,

Thanks a lot, it's good to know we should allocate dedicated resources for management! Maybe it's the root of the problem. Maybe not. We'll do it after resetting modules anyway.

And on the second ACE (active for all context, the previous one was standby) we have another error:

ACE2/Admin# sh resource usage

Error: Called API timed out

Not sure it's the same problem. Software version is A2(3.1) [build 3.0(0)A2(3.1)]

Kanwaljeet Singh · ‎10-31-2013

Hi Anatoly,

This is what i found:

When processing large configurations, subsequent commands may encounter an "API timeout" error, until the processing ends. The way to know when the processing ends is to perform a "show processor cpu" and verify that the load has reduced to low single-digits, or 0. The queue stall messages are harmless, unless they do not stop appearing.

Unfortunately i don't see any workaround to this problem and it is suggested to let the process complete and it should go away. If it doesn't then i would suggest opening a TAC case.

You might also consider to upgrade your device to latest code as the version you are running is pretty old.

Let me know if you have any questions.

Regards,

Kanwal