Andybonyx,

andybonyx · ‎05-06-2016

I wonder if anyone can suggest what is wrong here, or advise on next debugging steps. We have control over all devices in the layout.

We have the following topology:

dc router1 (7206VXR)

| (trunk)

dc switch1 (2960)

| (trunk allow vlan 20)

customer switch1 (3750)

| (trunk)

| telco 1Gb layer2 link (forwarding dot1q trunk, cdp, stp)

| (trunk)

customer switch2 (3750)

| (trunk allow vlan 20)

dc switch2 (2960)

| (trunk)

dc router2 (7206VXR)

The DC routers are connected by our WAN so are directly connected by a further L2/L3 link and have reachability, etc, over their WAN link.

On the DC router we have defined a dot1q interface on vlan 20 with the customers public IP block in an HSRP pair setup:

dc router1:

interface GigabitEthernet0/2.20
encapsulation dot1Q 20
ip address 1.1.1.1 255.255.255.248
standby 20 ip 1.1.1.3
standby 20 timers 5 15
standby 20 preempt
standby 20 authentication test20

dc router2:

interface GigabitEthernet0/2.20
encapsulation dot1Q 20
ip address 1.1.1.2 255.255.255.248
standby 20 ip 1.1.1.3
standby 20 timers 5 15
standby 20 priority 200
standby 20 preempt
standby 20 authentication test20

All other devices are layer2 in this scenario, and have the vlan 20 added to local vlan databases, and all relevant uplinks carry vlan20 either by an open trunk or by a trunk allow statement.

dc router 1 can ping dc router 2 (ping 1.1.1.2 or 1.1.1.1) from their associated pair, proving connectivity over the dot1q is working and the path works.

dc router 1 can ping the hsrp 1.1.1.3 address without loss (dc router 2 is active, 1 is standby)

dc router 2 can ping the hsrp 1.1.1.3 address without loss (dc router 2 is active, 1 is standby)

dc router 1 can NOT ping the hsrp 1.1.1.3 address sourcing the ping from another interface in it's routing table

dc router 1 can ping the hsrp 1.1.1.3 address sourcing the ping from another interface in it's routing table

From anywhere else, you can NOT ping the hsrp 1.1.1.3 address (but can 1.1.1.1 and 1.1.1.2)

Checking the dynamic MAC table on the two customer 3750's we can see the expected entries:

customer switch 1

Vlan Mac Address Type Ports
---- ----------- -------- -----
20 0000.0c08.ac31 DYNAMIC Gi1/0/10 (The HSRP mac address)
20 0009.e951.401b DYNAMIC Gi1/0/10 (dc router 1)
20 0019.aabd.221a DYNAMIC Gi1/1/1 (dc router 2)
20 44d3.caa6.b719 DYNAMIC Gi1/1/1 (customer switch 2)
Total Mac Addresses for this criterion: 4

Gi1/1/1 being the vlan trunk over the telco link

Gi1/0/10 being the vlan trunk to our dc switch1

customer switch 2

Vlan Mac Address Type Ports
---- ----------- -------- -----
20 0000.0c08.ac31 DYNAMIC Gi1/1/1 (The HSRP mac address)
20 0009.e951.401b DYNAMIC Gi1/1/1 (dc router 1)
20 0019.aabd.221a DYNAMIC Gi1/0/1 (dc router 2)
20 aca0.1674.f30a DYNAMIC Gi1/0/1 (dc switch 2 )
Total Mac Addresses for this criterion: 4

Gi1/1/1 being the vlan trunk over the telco link

Gi1/0/1 being the vlan trunk to our dc switch2

Tests done so far:

Increased HSRP timers, 'watched' mac address tables to see if any lost, returned, different addresses. Deleted and re-created vlans and IPs, changed IPs, changed VLAN numbers. No ports are in spanning-tree blocked. The stp root is negotiated correctly and the ports all show fwd.

Thanks in advance.

saif musa · ‎05-07-2016

Hi,

It seems you did almost all tests to identify this issue with no avail. There still one shoot, try to change standby priority between DC routers that's the standby router became active one. And inform us, hope that's will help.

Regards

andybonyx · ‎05-09-2016

Hi,

Sorry I forgot to mention I'd tried that too. What happens is the problem 'flips' in that it is still there just the other way round.

A very curious problem! I've also had the telco check the link out now for any issues and none were found.

andybonyx · ‎05-10-2016

Just something additional I've noticed. Looking into lower level I've spotted:

cust-switch-1#sh mls qos interface gig1/1/1 statistics | beg output queues dropped
output queues dropped:
queue: threshold1 threshold2 threshold3
-----------------------------------------------
queue 0: 0 0 0
queue 1: 3008339729 25612 0
queue 2: 0 0 0
queue 3: 0 0 0

Policer: Inprofile: 0 OutofProfile: 0

Incrementing at a reasonably high rate. 3 minutes after the above it goes to:

queue 1: 3008344378 25612 0

So I'm suspecting this may be having an impact, would that make sense?

cust-switch-1#show plat port-asic stat drop gigabitEthernet 1/1/1

Interface Gi1/1/1 TxQueue Drop Statistics
Queue 0
Weight 0 Frames 0
Weight 1 Frames 0
Weight 2 Frames 0
Queue 1
Weight 0 Frames 3008344425
Weight 1 Frames 25612
Weight 2 Frames 0

(All other values are 0)

saif musa · ‎05-10-2016

andybonyx,

Here we got nearly to the root of the problem( as I hope ). Queue 1 is oversubscribed as per queue 0 has bigger share size. this is not normal behavior of Gi1/1/1, here you got a congestion.

It may related to VTP configuration. Did you configure any ?? you didn't mention that ?

andybonyx · ‎05-10-2016

Thanks Saif.

VTP is setup:

VTP Version capable : 1 to 3
VTP version running : 3
VTP Domain Name : cust-switch-1
VTP Pruning Mode : Disabled
VTP Traps Generation : Disabled
Device ID : ccef.4845.c700

Feature VLAN:
--------------
VTP Operating Mode : Primary Server
Number of existing VLANs : 36
Number of existing extended VLANs : 40
Configuration Revision : 86
Primary ID : ccef.4845.c700
Primary Description : cust-switch-1
MD5 digest : -removed-

Feature MST:
--------------
VTP Operating Mode : Transparent

Feature UNKNOWN:
--------------
VTP Operating Mode : Transparent

I don't think it's congestion, as at present the link is passing around 30Mb of traffic in both directions (It's a 1Gb link) and the counters are still going up so perhaps the queue is oversubscribed and causing the issue.

(Seems very high packet loss for an oversubscribed queue, almost 80+% but it's a problem to resolve one way or another at least to eliminate the option)

saif musa · ‎05-12-2016

andybonyx,

Hope you doing well...

Gig1/1/1 experiencing drop packets, this is not happened unless there is no enough space on the port buffer. and so, no enough space on the buffer to store packets means there is a congestion.

By default, each queue on the gigport consumes 25% of the buffer Unless we changed that. Could you please copy and paste here the result from entering the commands [ sh mls qos queue-set ] & [ sh int gig switching ].

Now, we have to handle these dropped packets by increasing queue percentage size or identify the process that causing this amount of traffic.

Regards

andybonyx · ‎05-12-2016

Thank you Saif,

I think this may be a problem, not 100% sure it's THE issue, but one to resolve! All qos is on default buffer splits, so that is an issue there.

The output of mls qos queue-set:

Queueset: 1
Queue : 1 2 3 4
----------------------------------------------
buffers : 25 25 25 25
threshold1: 100 200 100 100
threshold2: 100 200 100 100
reserved : 50 50 50 50
maximum : 400 400 400 400
Queueset: 2
Queue : 1 2 3 4
----------------------------------------------
buffers : 25 25 25 25
threshold1: 100 200 100 100
threshold2: 100 200 100 100
reserved : 50 50 50 50
maximum : 400 400 400 400

No command sh int gig switching but not sure if you wanted:

sh int gigabitEthernet 1/1/1 switchport
Name: Gi1/1/1
Switchport: Enabled
Administrative Mode: trunk
Operational Mode: trunk
Administrative Trunking Encapsulation: dot1q
Operational Trunking Encapsulation: dot1q
Negotiation of Trunking: On
Access Mode VLAN: 1 (default)
Trunking Native Mode VLAN: 1 (default)
Administrative Native VLAN tagging: enabled
Voice VLAN: none
Administrative private-vlan host-association: none
Administrative private-vlan mapping: none
Administrative private-vlan trunk native VLAN: none
Administrative private-vlan trunk Native VLAN tagging: enabled
Administrative private-vlan trunk encapsulation: dot1q
Administrative private-vlan trunk normal VLANs: none
Administrative private-vlan trunk associations: none
Administrative private-vlan trunk mappings: none
Operational private-vlan: none
Trunking VLANs Enabled: ALL
Pruning VLANs Enabled: 2-1001
Capture Mode Disabled
Capture VLANs Allowed: ALL

Protected: false
Unknown unicast blocked: disabled
Unknown multicast blocked: disabled
Appliance trust: none

saif musa · ‎05-12-2016

Andybonyx,

You are most welcome, helping others here is beneficial to me. We are working in the same field.

Write the command ( sh int gig1/1/1 switching ) as it is, don't use quastion mark just copy and past. The result will display the process type that the port handle.

Also, there is one thing, did you double chick your remote switches vlans if it's configured correctly to work with vtp domain ?

Regards

andybonyx · ‎05-12-2016

Thank you, no problem.

Command output:

sh int gig1/1/1 switching
GigabitEthernet1/1/1 Uplink
Throttle count 0
Drops RP 0 SP 0
SPD Flushes Fast 0 SSE 0
SPD Aggress Fast 0
SPD Priority Inputs 0 Drops 0

Protocol Path Pkts In Chars In Pkts Out Chars Out
Other Process 3400000 717843750 3662418 223407498
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0
Spanning Tree Process 1568170515 100183734624 1558229673 99687244364
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0
CDP Process 9491 4574662 1831218 882647004
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0
VTP Process 0 0 360298 44045928
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0
DTP Process 3134546 191207306 0 0
Cache misses 0
Fast 0 0 0 0
Auton/SSE 0 0 0 0

The remote switches are setup with vtp local server, so they aren't sync'ing their vtp with each other but all contain the vlans in their local databases.

saif musa · ‎05-15-2016

andybonyx,

Already spent 2 days trying to figure out whats wrong with the results above, I got a sense that we have malfunction with ( VTP & DTP ) process. but really I cant approve it till now. we need some help now from Cisco support team.

I suggest to disable VTP process temporally and chick the results, whats your opinion ?

andybonyx · ‎05-20-2016

Hi Saif,

We've got a change scheduled in to remove mls qos to remove those drops from the situation.

Interesting point on VTP & DTP, these devices are primary servers in their own right, unsure where that issue would slot into the problem we are seeing? What impact would disabling vtp process have on operation of the 3750? (Would a vtp transparent then back to server be sufficient?)

HSRP ip address unreachable over multiple switch path link