Content Switch/Oracle problem

junior272 · ‎10-04-2006

The user (where ever he is, local or at another facility) clicks on a link on a portal that directs them to VIP of 136.180.3.3.

This VIP is set up to use the 4 application servers.

The CSS then forwards the request to the application server that the user is matched up with via the sticky table, as long as the server is up.

The packets from the CSS have a source address of 136.180.3.3.

The packet arrives at the application server. The application server responds to the user.

Since the source address of the packet going to the application server is 136.180.3.3, (which is on another subnet) that is where the server sends it back to.

Because of that the flow has to go back to the CSS via the router.

The CSS then un-NATs it and responds back to the user.

I have also asked the user to telnet to the dns name associated with the VIP and they are able to successfully do this. (they are able to get a username prompt, since they do not have logins into the server that is as far as they get)

This works fine.

Here is the issue that occurs that we do not get time to troubleshoot.

At certain regular intervals (during heavy user access, that is when a lot of users hit the system to do there time cards Monday mornings and Friday afternoon) when the user clicks on the link that directs them to the VIP, their browser times out.

When this issue happens I am able to identify what server the user is pointed to via the sticky table. It is usually just one server of the 4 servers having an issue. The system admins restart the Apache process on all the servers and things are back to normal. I do NOT do anything to the CSS.

Unfortunately we do not get even a half hour to try to figure out what the problem is. They just want to get production back going and restarting the process does the trick so that is what they do.

At the same time the user is having the above mentioned issue with the link on the portal, they are able to successfully do a telnet to the VIP and it directs them to the same server (via the sticky table). The server will respond to the telnet request with a username prompt. (that is why I do not think it is the CSS that is the issue)

Questions

Is there anything that you can think of that would help shed the light on the problem? I know that will be tough without being able to trouble shoot during the outage.

Is there something in the CSS that might be able to help us?

Is there a way to make the CSS present its IP address (the one that is defined in the circuit command) that is on the same network as the servers instead of the VIP? (so the return traffic does not have to go thru the router). Not sure if it would help.

The problem is not reproducible on command, we have to wait until the system fails then we can not do anything about it.

A copy of the config is attached, please reference the bolded text. Thanks!

cschneid · ‎10-04-2006

Jason,

The problem could be related to a number of things including garbage collection occurring on the CSS or possibly the server just being overwhelmed.

I see that you are using source groups and 'add destination services' and as

you mentioned you are using the 136.180.3.3 address in those groups. However, in most cases the vip address configured under the group doesn't have to same as the vip address in the content rule. In fact, in your scenario it would be better if the vip address in the group was in the same ip subnet as the services configured in the 'add destination' statement.

Because currently you have the vip address configured in the content rule and in the source group Oracle-Apps it's even possible (depending on the volume of the traffic) that you are cycling through the 63000 tcp ports and causing some sort of collision on the server.

Suggestions:

1) Use source group addresses that are in the same subnet as your servers and see if this resolves the issue. This address should *not* be the interface address of the CSS, in fact the CSS will complain if you try to do this. Use an available address in that network (the CSS will reply to ARPs for this address and will own it).

2) If #1 doesn't resolve the issue and the problem consistently happens, setup sniffers on the front and the back of the CSS. You can configure most sniffers to do rolling captures

(in Ethereal/Wireshark it's called a Ring Buffer) and stop the capture once you know the problem occurs. If you are unable to determine via the traces what's happening, open a TAC case and we can help.

-Chip