I have enabled storm-control at my access layer (C4K), and I'm happy with the results -- every few weeks, someone creates a loop using a mini-switch, storm-control shuts down their port before they can take out that floor.
storm-control broadcast level 1.00
storm-control action shutdown
storm-control action trap
Now I'd like to extend this protection to my data centers (C6K).
But my data centers contain VMWare installations. And I'm imagining that a single misbehaving VM, plus storm-control, could shut down an entire VMWare cluster in the following way:
-Pathological VM Guest starts emitting lots of broadcasts. The upstream C6K notices and shuts down the port feeding the VMWare Host.
-The fancy HA software in VMWare automatically migrates the Guest to another VMWare Host and reincarnates it there. The pathology picks up again, the Guest emits lots of broadcasts, the upstream C6K notices and shuts down the port.
-Our clusters vary from two VM Hosts to ten ... but I imagine that in either case, in a matter of seconds, a pathological Guest plus storm-control on the C6K could shut down all ports leading to all VM Hosts.
Sounds like a bad idea to me.
So, then I was thinking, what if we bought Nexus 1000v for the VMWare Hosts and implemented "storm-control broadcast level 1.00" on the resulting virtual ports plus, say "storm-control broadcast level 5.00" on the upstream C6K. Would this 'do the right thing'? i.e. if a patholgoical Guest start spewing broadcasts, would the Nexus 1000v shutdown the virtual port *before* the C6K noticed? [I suspect that this would be a bit of a crap shoot, as to which switch would notice the storm first, given the relatively coarse time granularity of this feature]
And then, would the 'shutdown' state of this virtual port "follow" the pathological Guest, as it tried to reincarnate on each of the other VMWare Hosts?
I think I would want both behaviors in order to acquire the effect I'm wanting, i.e. hardening transport in the data center against broadcast pathology.
Anyone doing this? Any 'design guides' available describing this?
Fred Hutchinson Cancer Research Center
Seattle, WA USA
I'm looking here, but it doesn't seem that Nexus1000v supports storm-control. I have check to see if its on the roadmap, but your post makes me really think this is needed.
Now...IF we did have storm-control then if the port got shutdown due to excessive bcast packets, even if VMWare moved the server it would still be shut down.
Here is my interface list on my N1k:
pdiwaas-n1kv# show int status
Port Name Status Vlan Duplex Speed Type
mgmt0 -- up routed full 1000 --
Eth4/1 -- up trunk full 1000 --
Eth4/3 -- up trunk full 1000 --
Eth4/4 -- up trunk full 1000 --
Eth5/2 -- up trunk full 1000 --
Eth6/2 -- up trunk full 1000 --
Eth7/2 -- up trunk full 1000 --
Eth8/2 -- up trunk full 1000 --
Po1 -- up trunk full 1000 --
Veth1 pdi-vWAAS-rtp, Net up 600 full auto --
Veth2 vcm-small-ovf, Net up 200 full auto --
Veth3 win7-rtp, Network up 500 full auto --
Veth4 pdiwaas-dc2, Netwo up 100 full auto --
Veth5 pdi-vwaas-dc, Netw up 200 full auto --
Veth6 pdiwaas-dc3, Netwo up 100 full auto --
Veth7 pdi-waasMobile, Ne up 100 full auto --
Veth8 pdiwaas-dc4, Netwo up 100 full auto --
Veth9 pdiwaas-dc5, Netwo up 100 full auto --
Veth10 chapeter-1, Networ up 555 full auto --
Thanx for the sanity check. So, if Nexus 1000v had stormcontrol, then a disabled Veth would follow the affected host around the cluster -- that's good; we're part way there.
(1) Is stormcontrol on the Nexus 1000v roadmap?
(2) If Nexus 1000v had stormcontrol, and if the upstream sheetmetal & silicon Nexus (or C6K in my case) had stormcontrol, would this capability actually work the way I want it to?
==> Do you see my concern? I don't really know how Nexus (or C6K) figures out when to trigger the stormcontrol action, but let's say that it maintains a running 1 second average of broadcast frames. The upstream switch and the 1000v are both calculating these running averages starting at different time points ... entirely likely that they will disagree on when the threshold has been passed. It /should/ be possible to configure the 1000v to trigger the stormcontrol action *before* the upstream switch does ... but it is not obvious to me how big of a 'spread' in sensitivity one would need to configure between the two, in order to guarantee that the 1000v *always* triggers first. [Or, I suppose, one leaves stormcontrol disabled on VMWare Host ports, and dodge the problem entirely.]
Anyway, anything I can do to help, from the customer end? Submit an RFE to my local sales team?
I'll reach out to see if stormcontrol is on the roadmap.
So IF Nexus 1000v had stormcontrol I wouldn't normally suggest applying it at the upstream switch as well. You would want to leave the handling of it in the hands of the edge switch (N1k in our case). This way you don't get into a situation where your 6500 shuts down the port going to the nexus 1000v, when just 1 of the n1kv ports should have been shut.
If you want to implement storm control at both places, you would want the rate of your 6500 to be higher than the allowed rate of your 1k. But if you did implement stormcontrol on all ports of n1k you shouldn't really need it at the 6500.
Go ahead and reach out to your account team about this. You can have them reach out to me as well see what I can do to help.
I haven't heard anything back ... but thinking about this ... it is perhaps not reasonable to implement storm-control on a pure software platform (e.g. N1KV) ... without rate-limiters in hardware, perhaps it is difficult or even impossible to implement this. Just speculation on my part.
Well, it's still not implemented in the Nexus 1000v. I'll follow up on this and update back with a possible timeline. We've had this request open since early 2011.
Ok, I've got a confirmation saying that this is going to be added in the next major release of the Nexus 1000v.
CSCtn08364 is the enhancement request that is being used to track this. The release is for later this year (no dates yet).
Yes, the same defect can be used to track it for Hyper-v also. The next Hyper-v release won't have this, so most likely the one after the next one (somewhere early next year) would have it.