Re: A record in adjacency table has the (incomplete) status

vakulenko.vv · ‎08-10-2023

Hello guys! I urgently need your help!

We are the local ISP. One from our city cores is Cisco 6509E.

Core.KamPod.C6509E#show version

Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9-M), Version 15.1(2)SY16, RELEASE SOFTWARE (fc2)
Technical Support: http://www.cisco.com/techsupport
Copyright (c) 1986-2020 by Cisco Systems, Inc.
Compiled Fri 03-Jan-20 04:10 by prod_rel_team

ROM: System Bootstrap, Version 12.2(17r)SX7, RELEASE SOFTWARE (fc1)
BOOTLDR: Cisco IOS Software, s72033_rp Software (s72033_rp-ADVENTERPRISEK9-M), Version 15.1(2)SY16, RELEASE SOFTWARE (fc2)

Core.KamPod.C6509E uptime is 10 hours, 33 minutes
Uptime for this control processor is 10 hours, 33 minutes
System returned to ROM by reload at 02:49:37 EET Thu Aug 10 2023 (SP by reload)
System restarted at 02:53:01 EET Thu Aug 10 2023
System image file is "sup-bootdisk:s72033-adventerprisek9-mz.151-2.SY16.bin"
Last reload reason: Reload Command

cisco WS-C6509-E (R7000) processor (revision 1.3) with 983008K/65536K bytes of memory.
Processor board ID SMC1110004N
SR71000 CPU at 600Mhz, Implementation 0x504, Rev 1.2, 512KB L2 Cache
Last reset from s/w reset
538 Virtual Ethernet interfaces
99 Gigabit Ethernet interfaces
26 Ten Gigabit Ethernet interfaces
1917K bytes of non-volatile configuration memory.

65536K bytes of Flash internal SIMM (Sector size 512K).
Configuration register is 0x2102

Core.KamPod.C6509E#show module
Mod Ports Card Type Model Serial No.
--- ----- -------------------------------------- ------------------ -----------
1 48 CEF720 48 port 1000mb SFP WS-X6748-SFP SAL1444Y6ZU
3 48 CEF720 48 port 10/100/1000mb Ethernet WS-X6748-GE-TX SAL1035ZUDL
4 16 CEF720 16 port 10GE WS-X6716-10GE SAL1246A0FZ
5 5 Supervisor Engine 720 10GE (Active) VS-S720-10G SAL1543TGPE
7 8 CEF720 8 port 10GE with DFC WS-X6708-10GE SAD10480A32

Mod MAC addresses Hw Fw Sw Status
--- ---------------------------------- ------ ------------ ------------ -------
1 1cdf.0f1a.6080 to 1cdf.0f1a.60af 2.4 12.2(18r)S1 15.1(2)SY16 Ok
3 0018.ba3e.d8a0 to 0018.ba3e.d8cf 2.4 12.2(14r)S5 15.1(2)SY16 Ok
4 0023.0455.62c8 to 0023.0455.62d7 1.0 12.2(18r)S1 15.1(2)SY16 Ok
5 5475.d07b.304c to 5475.d07b.3053 4.1 8.5(4) 15.1(2)SY16 Ok
7 001a.2f00.51f4 to 001a.2f00.51fb 1.1 12.2(18r)S1 15.1(2)SY16 Ok

Mod Sub-Module Model Serial Hw Status
---- --------------------------- ------------------ ----------- ------- -------
1 Distributed Forwarding Card WS-F6700-DFC3CXL SAL1321QPEQ 1.3 Ok
3 Distributed Forwarding Card WS-F6700-DFC3CXL SAL12372P0H 1.6 Ok
4 Distributed Forwarding Card WS-F6700-DFC3CXL SAL1247ATFB 1.2 Ok
5 Policy Feature Card 3 VS-F6K-PFC3CXL SAD120307H2 1.0 Ok
5 MSFC3 Daughterboard VS-F6K-MSFC3 SAL1540S5CN 5.1 Ok
7 Distributed Forwarding Card WS-F6700-DFC3CXL SAD104806WS 1.0 Ok

Mod Online Diag Status
---- -------------------
1 Pass
3 Pass
4 Pass
5 Pass
7 Pass

We have many different vlans with abonents. Every interface vlan is unnumbered and linked to Loopback interface with IP-address. For example

interface Vlan508
ip unnumbered Loopback2
ip helper-address 172.16.255.9
ip helper-address 172.16.255.56
ip policy route-map RMAP_NAT
end

interface Loopback2
ip address 10.20.32.1 255.255.255.0 secondary
ip address 172.20.32.1 255.255.255.0
no ip redirects
end

We have 2 DHCP servers - 172.16.255.9 and 172.16.255.56
Their databases depend on the billing system. If abonent's account has money, the abonent's host receives IP from 172.X.X.X, else from 10.X.X.X
After that the city core has the DHCP-route to the abonent's host.

Core.KamPod.C6509E#show ip route dhcp 172.20.32.161
S 172.20.32.161/32 is directly connected, Vlan508
DHCP Server: 172.16.255.9 Lease expires at Aug 10 2023 02:17 PM

The route-map RMAP_NAT checks the IP-address of abonent. If it is from 172.X.X.X, IP-packets from the host will be re-directed to the NAT and have Internet access else it sees only local resources.
This scheme has been working a lot of years but suddenly, out of nowhere, a problem appeared.

Two abonents can be inside the same vlan. Both successfully received IP-addresses from a DHCP-server, but first one has an access to Internet and local recourses and second one doesn't.

For example, a normal host 172.20.32.160:

Core.KamPod.C6509E#show ip route dhcp 172.20.32.160
S 172.20.32.160/32 is directly connected, Vlan508
DHCP Server: 172.16.255.9 Lease expires at Aug 10 2023 02:27 PM

Core.KamPod.C6509E#ping 172.20.32.160
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.20.32.160, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/12 ms

It can be pinged from the NAT host:

NAT-1.KamPod.A10#ping 172.20.32.160
PING 172.20.32.160 (172.20.32.160) 56(84) bytes of data.
64 bytes from 172.20.32.160: icmp_seq=1 ttl=63 time=13.4 ms
64 bytes from 172.20.32.160: icmp_seq=2 ttl=63 time=10.0 ms
64 bytes from 172.20.32.160: icmp_seq=3 ttl=63 time=13.1 ms
^C
--- 172.20.32.160 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms

Abnormal host 172.20.32.161:

Core.KamPod.C6509E#show ip route dhcp 172.20.32.161
S 172.20.32.161/32 is directly connected, Vlan508
DHCP Server: 172.16.255.9 Lease expires at Aug 10 2023 02:27 PM

Core.KamPod.C6509E#ping 172.20.32.161
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 172.20.32.161, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

It is unreachable from any networks (another vlans, another cities) exclude its vlan.

NAT-1.KamPod.A10#ping 172.20.32.161
PING 172.20.32.161 (172.20.32.161) 56(84) bytes of data.
^C
--- 172.20.32.161 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4055ms

Both hosts - normal and abnormal - can ping each other.

I checked both hosts and found the information below:

Normal host:

Core.KamPod.C6509E#show adjacency 172.20.32.160
Protocol Interface Address
IP Vlan508 172.20.32.160(8)
Core.KamPod.C6509E#show adjacency 172.20.32.160 detail
Protocol Interface Address
IP Vlan508 172.20.32.160(8)
3 packets, 294 bytes
epoch 1
sourced in sev-epoch 0
Encap length 14
045EA4C8FF23001AE3F728000800
L2 destination address byte offset 0
L2 destination address byte length 6
Link-type after encap: ip
ARP
Core.KamPod.C6509E#show ip arp 172.20.32.160
Protocol Address Age (min) Hardware Addr Type Interface
Internet 172.20.32.160 4 045e.a4c8.ff23 ARPA Vlan508

Abnormal host:

Core.KamPod.C6509E#show adjacency 172.20.32.161
Protocol Interface Address
IP Vlan508 172.20.32.161(5) (incomplete)
Core.KamPod.C6509E#show adjacency 172.20.32.161 detail
Protocol Interface Address
IP Vlan508 172.20.32.161(5) (incomplete)
1015480 packets, 1496540801 bytes
epoch 1
sourced in sev-epoch 0
punt (rate-limited) packets
no src set
Core.KamPod.C6509E#show ip arp 172.20.32.161
Protocol Address Age (min) Hardware Addr Type Interface
Internet 172.20.32.161 0 0495.e61f.cbb0 ARPA Vlan508

If I run the command

clear ip arp 172.20.32.161

the abnormal host starts to work perfectly and its adjacency-record looks OK... but not forever. Sometime later in the future the problem can be again.

If I use the command

clear arp-cache 172.20.32.161

it doesn't help.

If I clear the DHCP-route by the command "clear ip route dhcp 172.20.32.161" it helps too when the host re-receives its IP during the couple of minutes (our DHCP leasing time is 5 minutes).

I don't know how to resolve the problem permanently, so I need your help and advices!

Giuseppe Larosa · ‎08-11-2023

Hello @vakulenko.vv ,

your network scenario is quite complex.

Let's start from the following show command

>>

Core.KamPod.C6509E#show adjacency 172.20.32.161
Protocol Interface Address
IP Vlan508 172.20.32.161(5) (incomplete)
Core.KamPod.C6509E#show adjacency 172.20.32.161 detail
Protocol Interface Address
IP Vlan508 172.20.32.161(5) (incomplete)
1015480 packets, 1496540801 bytes
epoch 1
sourced in sev-epoch 0
punt (rate-limited) packets
no src set
Core.KamPod.C6509E#show ip arp 172.20.32.161
Protocol Address Age (min) Hardware Addr Type Interface
Internet 172.20.32.161 0 0495.e61f.cbb0 ARPA Vlan508

If I run the command

clear ip arp 172.20.32.161

the abnormal host starts to work perfectly and its adjacency-record looks OK... but not forever.

---------------------------------------------------------------------

My notes below:

Seeing that the packet counters and byte counters are quite high for the abnormal host

we can suppose that previously the CEF entry and the adjacency table were correctly populated but at some point in time the ARP entry disappears , the entry is removed from the adjacency table and the CEF table reports incomplete.

By issuing a clear ip arp 172.20.32.161 the ARP process is triggered the ARP entry is relearned and the entry in the adjacency table is inserted again.

On the other hand you have SUp720 3CXL and all DFC are 3CXL.

How big is the ARP table on the device ?

I would expect your switch to be able to deal with several thousands of ARP entries with no problem.

As a first step I would reduce the arp timeout on the affected VLAN 508 to see if this helps.

For example to 10 minutes instead of default 4 hours timers just to see if this helps the device to keep the CEF entry in the table.

Warning : it is important to know the ARP table size before making the change. ARP activity will increase with reduced timer.

Hope to help

Giuseppe

vakulenko.vv · ‎08-11-2023

Hello Giuseppe! Thank you very much for your attention to the situation I described!

I will answer to you step-by-step with quotation:

Seeing that the packet counters and byte counters are quite high for the abnormal host. We can suppose that previously the CEF entry and the adjacency table were correctly populated but at some point in time the ARP entry disappears , the entry is removed from the adjacency table and the CEF table reports incomplete.

This is true. Any normal working host can suddenly receive an "incomplete" record in the adjacency table and stop to work normally.

How big is the ARP table on the device?

At this monent:

Core.KamPod.C6509E#show arp summary
Total number of entries in the ARP table: 12075.
Total number of Dynamic ARP entries: 11754.
Total number of Incomplete ARP entries: 282.
Total number of Interface ARP entries: 38.
Total number of Static ARP entries: 1.
Total number of Alias ARP entries: 0.
Total number of Mobile ARP entries: 0.
Total number of Simple Application ARP entries: 0.
Total number of Application Alias ARP entries: 0.
Total number of Application Timer ARP entries: 0.

I think 12k entries is not too much, is it?

As a first step I would reduce the arp timeout on the affected VLAN 508 to see if this helps. For example to 10 minutes instead of default 4 hours timers just to see if this helps the device to keep the CEF entry in the table.

I'll try to use this advice ASAP! Will tell you about the result. BTW all our vlans are affected...

PS: At this moment I created a couple of scripts. The script #1 is permanently running on a Linux host, connected to our Cisco 6509E and checking there the adjacency table every 1 minute by the filter "show adjacency | include Vlan.*incomplete". After parsing it receives a list of IP's of abnormal hosts. For every record in this list it adds a string "clear ip arp ". So we have a list contains a lot of rows:

clear ip arp X.X.X.X
clear ip arp Y.Y.Y.Y
clear ip arp Z.Z.Z.Z

and so on.

After that it re-creates a file "arp.cfg" on our TFTP-server, saves there this list and call the TCL-script #2 in the Cisco 6509E by the command "tclsh disk0://SCRIPTS/execute.tcl"

execute.tcl:

set fp [open "tftp://172.16.255.1/arp.cfg" r]
set file_data [read $fp]
set data [split $file_data "\n"]
foreach line $data {
exec $line
}
close $fp

execute.tcl downloads arp.cfg and runs its content (our list of "clear ip arp")

This is certainly not an elegant solution to the problem but a crutch made "of sticks and tape" but it can give us enough time to find a source of our problem and find a correct solution...

vakulenko.vv · ‎08-24-2023

"As a first step I would reduce the arp timeout on the affected VLAN 508 to see if this helps.

For example to 10 minutes instead of default 4 hours timers just to see if this helps the device to keep the CEF entry in the table."

I checked this way but it didn't help. An abnormal host is always sending packets so it's ARP-entry is always prolonged by 6509.