Installed VSS 6800 pair with 3 IA groups for a customer in a pretty small network. This 6800 is core and four wan routers terminate into IA's for branches, Internet etc. Transitioned them off a pair of 4507's with a trunk between the two networks and using HSRP migrated everyone over.
Now that EVERY device is running on the 6800's and 4500's to totally empty its time to allow hsrp values to change and move the active gateway to 6800's to take over layer 3 and turn down the 4500's. When we do so the network comes to a crawl. I put the cable back in for the 4507 router on a stick and speed comes right back. I can find no errors, no reasons for this to happen checking everything so many times. They have a few static routes and rest are eigrp neighbor routes from the wan routers mentioned. Traceroutes show same paths as computers are all in IA's going to one router or another for destination. From switch I see no latency to computer 1, nor to computer 2, but computer 1 and computer 2 can hardly speak to each other its so slow. At any time routing goes back to 4507 where all traffic must hairpin out and back in the speed is very fast, lose that hairpin "router on a stick" and business crashes. This is final step to get 4500's out of production and cpu remains around 22%, no bad stats on interfaces, no drops etc. no syslog errors whatsoever, eigrp routes and statics, infact config matches 4507 but for IA's in place. Very small network with 4 routers, plus core, 3 IA "IDF" stacks, 2 for users 1 is on core for data center routers. No acl's and only default qos is enabled, traffic numbers are very very low but when hsrp standby goes active on 6800X it crawls despite how little traffic. It all dies at single vlan 4, but that vlan is where ALL layer 3 happens as its home to all routers. No line cards in 6800's so IA ports are used which is not what I like to do but Cisco markets and sells them as fully capable so customer ran with that. All inf ports are dual 10GB port-channels, dual 10GB to each of 3 fex stacks, dual 10GB VSL links, all other ports are 1Gb IA terminated pc's and routers.
Anyone aware of layer3 forwarding issues ?? tonight was last straw, rebooted vss pair and still does same thing, can recreate at will, just have to allow a hsrp standby IP to go active... and yes also removed hsrp standby's and changed IP to .1 on SVI as its not needed once the 4500 is offline.. leaving as without 4500 business cannot run.
thanks in advance,
Solved! Go to Solution.
As I eluded, core has 3 IA stacks, that IS the topology with exception of the 4500 as a router on a stick that's going away when layer 3 performs "normal" in 6800. Legacy and remaining network has been moved, everything is in this core via IA's. We are talking from IA 3 on one vlan, to another vlan also on IA 3 for testing to isolate.
I agree, yes I checked all that first as well, I've checked EVERYTHING. The VSS Core is the STP Root. With the legacy 4500 gone, the VSS core is pretty much all there is with everything terminating into it. From the switch I see no legacy going directly to pc1 or to pc2. But pc1 speaking to pc2 has a great deal of latency as its being routed from one vlan to another. Two devices on the same vlan dont experience latency, nor do they if the 4500 is hsrp primary and doing all the routing.
There's no topology to draw out beyond 3 IA stacks connected to this vss core consisting of 2x6880X's with supervisors only, no line cards as the IA's are their "line cards". IA switch stacks 1 and 2 contain users only, IA switch stack 3 contain some users and all wan routers, firewall, etc. The VSS Core is STP root whether the 4500 is online routing or not, it doesn't change but yes I went thru STP as I typically do when I design a datacenter.
If you're OK for traffic within a VLAN, but not for traffic between VLANs, is there anything obviously different between the 6800 and 4500 configuration with respect to Layer-3 e.g., SVI MTU? You might be getting IP fragmentation between VLANs which could account for poor performance.
Can you capture the traffic that flows between the PC's when routing via the 4500 and then do the same routing via the 6800 and check to see whether there's any noticable difference?
Nothing, its too simple as I eluded to being that they have such a small network. They have never had jumbo mtu's enabled and they're not now. We have never been able to turn the old router off so no real traffic is traversing just our pings and such during windows we're allowed to bring down the port-channel or change hsrp priorities allowing the 6800 to take over as ".1" gateway so to speak. They do have EIGRP but only for the four wan routers directly connected into IA stack 3 for their branches. The provider when adding a branch network manages the router so the core gets the new subnet via eigrp. otherwise both routers have the same small set of static routes with one being the default gateway. The static routes distribute between the firewall and a wan router, pretty much only those two destinations but they are identical and again reach-ability has never been an issue per se. We can always "reach" but its the performance that makes it appear to be unreachable at times.
I was mostly opening seeking others who may have seen or experienced a bug as this version was the only image available at time of install that supported greater than 3 switches in a IA stack, they required 4. I now see 5 maintenance releases have come out since, and may layer 3 forwarding and routing issues both multicast and unicast are documented. While nothing identical to what I see here several that exhibit the same type of behavior. My hang up is and reason for going public is that if all layer 3 traffic was being process switched then first I would think it would still be faster than it is given today's cpu's but more so that the CPU would go up and its not. I couldn't see it while in production the night they first attempted shutting down the 4500 thats doing the routing, but since then I have recreated many times at will and cpu always sticks at 22% never higher, never lower. that said during our windows nothing was happening so 22% should be high as far as I am concerned... tac case opened now but no intrusive testing until weekend allowed, hoping they can bug scrub seeing more than I have access to outside Cisco anymore..