High-availability network connectivity problem

felipesolis · ‎12-02-2011

I'm having a big trouble trying to get this topology working. The image is not the actual topology I'm working on, because I removed some stuff so it would be easier to you to focus on the problem I'm having.

SCENARIO

My goal is to remove any single point of failure (SPOF), so as you can see, we have two of anything. If a router/switch/server fails, there's another one. Until now, I got GLBP and Pacemaker working just fine.

Switches have the default (blank) config. Routers have their own IP and GLBP IP (on BVI1) with no additional options and servers have their own IP and Cluster IP (on bond0) with no additional options, either.

LABELS

Green addresses are unique for each device.
Red (virtual) addresses are shared by the devices on their sides in order to provide fault-tolerance and/or load-balancing. Servers have .1 as gateway.
Purple addresses are used by servers to communicate/monitor each other and synchronize databases.

PROBLEMS

ackets get duplicated and/or arrive in both physical server interfaces.

EXAMPLE

Ping 10 times to SRV1:

10 ping requests are sent.
13 ping requests are received on bond0 (6 on eth1 and 7 on eth2).
10 ping replies are sent.

Yes, 10 pings were succesful with duplicated packets (some of you could think that's good enough), but when I use an upper layer protocol such as SSH, when packets arrive in both physical interfaces (eth1 and eth2), it just doesn't work. Sometimes even ping doesn't work fine. Don't know if packets being dropped or not even being received (didn't have the time to capture network traffic on that issue today).

This is my first time working with a high-availability network design, and I think this may be MAC related.

Any help would be much appreciated

[EDITED] Solution (December 5th):

According to the Linux Kernel documentation regarding bonding (Chapter 11: "Configuring Bonding for High Availability"), in this topology and with the equipment provided, isn't possible to setup fault-tolerance and load-balancing on the servers' physical interfaces, which is the default mode for bonding (balance-rr, a round-robin based mode), so the solution was to opt for active-backup mode, which sets only one interface as active and provides only fault-tolerance.

So, now I have primary and backup links, which means there's a primary switch and a backup one. If one server's primary link goes down, that would cause each server to be connected to each switch, so I connected the switches in order to avoid packets going through the routers.

I hope this saves some time for anyone having the same issue.

Leo Laohoo · ‎12-03-2011

What is the model of your switch? It would've made everything easier if these switches were stacked 2960S or 3750.

felipesolis · ‎12-03-2011

Switches are 2950 and routers 2801. I've been reading about bonding in the past hours, and found that when using multiple switch topology with it, only active-backup or broadcast modes are valid. The default mode for bonding is balance-rr, and I didn't specify any mode so that might be the issue. Problem is, I've to wait until monday to test it.

Anyhow, I would like to know what's that about stacked 2960S or 3750, because I'm using those 2950 to build a prototype, but we are supposed to buy two 2960S later, actually.

felipesolis · ‎12-05-2011

Found the solution (see initial post) but I can't find anywhere an option to mark this as solved.