Re: link inconsitency and no traffic Gig-E from VSS distribution

LELAND VANDERVORT · ‎01-17-2011

Hi All,

I have encountered a very STRANGE behaviour that looks remarkably like a major series of bugs (plural) all combined, as I can find nothing else to account for it.

Overview of architecture:

6506 VSS system on one end (distribution)

6504 Sup720-3BXL on the other end (core)

In the distribution there is a WS-X6724-SFP line card in chassis 2 with two connections to the 6504 core.

On the VSS side, the ports are Gi2/1/3 and Gi2/1/4. On the core side they are Gi3/15 and Gi3/16.

Previously these two connections were in an etherchannel carrying a dot1q trunk. As part of engineering work this weekend, we removed the trunks and replaced them with Layer-3 point-to-point etherchannels. This exact same operation was successful on between 4 cores and 2 other VSS systems, but this last one started throwing some weird anomalies.

The exercise consisted of the following steps:

** reusing the same physical connections **

shutown the old port-channel interface and the physical interfaces.

Remote the channel-group from the physcial interfaces.

Remove the switchport ('no switchport') to revert to Layer-3.

Bind the interface to the new channel-group

No-shut on the physical and port-channel.

Everywhere else this worked.. but here's where things started going pear shaped.

Firstly, both sides showed an LACP port-disable message:

Jan 16 04:43:19.255 CET: %EC-SP-5-L3DONTBNDL2: Gi3/16 suspended: LACP
currently not enabled on the remote port.
Jan 16 04:43:20.139 CET: %EC-SP-5-L3DONTBNDL2: Gi3/15 suspended: LACP
currently not enabled on the remote port.

Verification on both sides with 'sh etherch sum' revealed that both sides were indeed configured in LACP

So to troubleshoot this, remove the channel-group and checked with individual connections.

Just with bear interface, no IP config or anything, the links had the following behaviour:

All connections UP/UP:

Gi2/1/3 PTP to CORE4 connected routed full 1000
1000BaseSX
Gi2/1/4 PTP to CORE4 connected routed full 1000
1000BaseSX

Gi3/15 PTP to GDIST connected routed full 1000
1000BaseSX
Gi3/16 PTP to GDIST connected routed full 1000
1000BaseSX

but no CDP neighbor adjacency seen.

Next, shut each interface in turn and observe behavior:

core4-d(config)#int gi3/15
core4-d(config-if)#shut
core4-d(config-if)#do sh int status | i GDIST
Gi3/15 PTP to GDIST disabled routed full 1000
1000BaseSX
Gi3/16 PTP to GDIST connected routed full 1000
1000BaseSX

VSS side shows first port correctly disconnected:

gdist-eqx2#sh int status | i CORE4
Gi2/1/3 PTP to CORE4 notconnect routed full 1000
1000BaseSX
Gi2/1/4 PTP to CORE4 connected routed full 1000
1000BaseSX

Same result for the second port.

Now re-up all the ports and shut down from the VSS side.

gdist-eqx2(config)#int gi 2/1/3
gdist-eqx2(config-if)#shut
gdist-eqx2(config-if)#do sh int status | i CORE4
Gi2/1/3      PTP to CORE4       disabled     routed       full   1000
1000BaseSX
Gi2/1/4      PTP to CORE4       connected    routed       full   1000
1000BaseSX
gdist-eqx2(config-if)#

BUT

the core side shows still UP... (?)

core4-d#sh int status | i GDIST
Gi3/15 PTP to GDIST connected routed full 1000
1000BaseSX
Gi3/16 PTP to GDIST connected routed full 1000
1000BaseSX
core4-d#

Same for the second port...

Now UN-SHUT the ports on the VSS side, and they go to "noconnect", until I shut/no-shut on the core side.

All this time, still no CDP traffic being seen by either side.

Also even more interesting is that with the VSS ports SHUT DOWN, they still send a lot of traffic, which is reflected in the interface counters on the core side... (down, but still sending traffic (?)) with 5min average of around 200 Kbps.

Next, to further check, I simply set the FIRST port on both sides back to switchport.

On the VSS side, the port immediately re-added its original configuration which was removed, including trunk mode, encapsulation, allowed VLANs and the original channel-group [which no longer actually existed] ) etc. The core side behaved correctly without readding all that junk...

CDP relationship could then be seen on both sides of the link though.

Then removing manually the reinserted configuration such that both sides now only contained:

interface GigabitEthernet2/1/3

switchport

no ip address

!

Interface with UP/UP, but with no CDP traffic, and still this high (200Kbps) output from the VSS towards the core.

Upon doing so, the following error messages appeared in the logs:

Jan 17 11:26:51 217.70.176.44 5040036: Jan 17 11:33:03.700 CET: %PM-SW1_SP-1-INCONSISTENT_PORT_STATE: Inconsistent HW/SW port state for Gi2/1/3. Please shut/no shut the interface
Jan 17 11:26:57 217.70.176.44 5040038: Jan 17 11:33:08.405 CET: %EM-SW1_SP-4-AGED: The specified EM client (EM_TYPE_SCP_LINK_STATUS type=2, id=47203)did not close the EM event within the permitted amount of time (300000 msec).

(shut / no shut the interface, as indicated, didn't reasolve the problem)

After scratching heads over this one for 24 hours, next changed fibres, SFPs, GBICs -- SAME result.

Next actually changed physical ports -- SAME result.

Next tested looping back ports on the 6724-SFP linecard...

Port 2/1/3 to 2/1/4 connected back to back - link down/down (not connected) [yes, verified the tx/rx swapover]

Hard loopback on the same port: 2/1/3 link down/down

It seems that this can only be a culmination of SEVERAL bugs at one time, due to the number of different things all going wrong with this.

Again, this exact same excercise was totally successful on 2 other VSS systems and 4 cores (one of those being the same core as in this scenario)

Any suggestions advice or even a strong shot of whiskey would be most appreciated!

Thanks,

Leland

scottwilliamson · ‎01-17-2011

Hi Leland,

Assuming it is a series of bugs is it worth upgrading one or other of the IOS versions on the core or distribution kit to see if any of the problems dissappear? If that works it may make resolving the remaining issue(s) easier, perhaps? Does the bug toolkit give any clues as to the best side to try upgrading first?

Being Scottish I would advise drinking whisky rather than than the inferior Irish whiskey with it's superfluous letter e, however.

Regards,

Scott

LELAND VANDERVORT · ‎01-17-2011

Hi Scott,

well..I said that it *seems* like a groupment of a series of bugs, but the only two that I've found so far that are even remotely related are CSCtd21951 and CSCtd93384, though they don't really address the issue we're seeing.

The problem with the bug toolkit is that for every bug publicly visible on the bug toolkit, there are probably 30 or so marked as "cisco internal" which makes searching a little hit-and-miss.

I would concur on upgrading the IOS version, but before I do that I would need to be sure of what version to upgrade TO (and make sure that the new version doesn't break something else already in place... been there, got the t-shirt... -- classic example here was one of the earlier upgrades totally wiped the IPV6 capabilities when in VSS mode though it worked in the previous version).

Yes.. agree on the whisky -- ochay the noo !

Mòran taing!! Slàinte mhòr agad!

Leland

link inconsitency and no traffic Gig-E from VSS distribution to 6504 core (inconsistent state & config weirdness)