Ask the Expert: Switch and IOS Architecture and Unexpected Reboots on all Cisco Catalyst Switches - Page 3

ciscomoderator · ‎09-11-2015

This session will provide an opportunity to learn and ask questions about Cisco Catalyst Switches IOS architecture, and how to troubleshoot any unexpected reboots and other errors on switches.

Ask questions from Monday, October 5 to Friday, October 16, 2015

Featured Experts

Ivan Shirshin is a customer support engineer in High-Touch Technical Services (HTTS). He is an expert on Routing, LAN Switching and Data Center products. His areas of expertise include Cisco Catalyst 2000, 3000, 4000, 6500, Cisco Nexus 7000, ISRs, as well as Cisco routers ASR1000, 7600, 10000 and XR platforms. He has over 7 years of industry experience working with large Enterprise and Service Provider networks. Shirshin holds a CCNA, CCNP, CCDP, and CCIE (# 43481) in routing and swtiching, as well as XR specialist certifications.

Naveen Venkateshaiah is a customer support engineer in High-Touch Technical Services (HTTS). He is an expert on Routing, LAN Switching and Data Center products. His areas of expertise include Cisco Catalyst 3000, 4000, 6500, and Cisco Nexus 7000. He has over 7 years of industry experience working with large enterprise and Service Provider networks. Venkateshaiah holds a CCNA, CCNP, and CCDP-ARCH, AWLANFE, LCSAWLAN Certification. He is currently working to obtain a CCIE in routing and switching.

Find other https://supportforums.cisco.com/expert-corner/events.

** Ratings Encourage Participation! **
Please be sure to rate the Answers to Questions

Naveen Venkateshaiah · ‎10-09-2015

Hi Semaj,

We need to check for Crashinfo file in switch to find the cause of the reload, If there is no crash info need to verify show stacks which will monitor the stack usage of processes and interrupt routines if there is any, If there is no error and no crash information in this Switch.

From show version command we notice that the reason of the last reload was due to a "power-on" which means that there is something bad either on the power source or in the cables that cause this reload.

The switch might have reloaded due to a power fluctuation. Look for logs stored on syslog server during the time of reload.

Normally for to find the cause of the crash we need to check show tech output from the switch and also the crashinfo file generated which will be stored in the switch flash memory.

Let me know if you have any further doubt.

Regards,
Naveen Venkateshaiah.

icgonzales · ‎10-09-2015

Hello Experts,

We encountered a high cpu usage on our 6500 switch. First we are getting an error log message of

*Sep 6 16:45:08.947 PST: %EARL_NETFLOW-SP-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [92%]
*Sep 6 16:48:11.685 PST: %EARL_NETFLOW-SP-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization [92%]

Saw consistent overutilization on the TCAM netflow table
Earl in Module 5
Summary of Netflow CAM Utilization (as a percentage)
====================================================

TCAM Utilization : 86%

After modifying the sampling packet based to 4096, the TCAM utilization drops down to 26%

Summary of Netflow CAM Utilization (as a percentage)
====================================================

TCAM Utilization : 26%

Cisco TAC still seeing some traffic coming from vlan 56 that possibly is causing the cpu spikes
------- dump of incoming inband packet -------
interface Vl56, routine draco2_process_rx_packet_inline
dbus info: src_vlan 0x38(56), src_indx 0x4A(74), len 0x56(86)
bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)
38020400 00380000 004A0000 56000000 002F0438 00000400 00000000 0380A9D1
mistral hdr: req_token 0x0(0), src_index 0x4A(74), rx_offset 0x76(118)
requeue 0, obl_pkt 0, vlan 0x38(56)
destmac 00.19.A9.9D.FA.C0, srcmac 00.D0.83.05.A0.73, protocol 0800
protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 68, identifier 13053 df 0, mf 0, fo 0, ttl 30, src 192.168.56.195, dst 10.0.0.1, proto 47

Based on the findings gathered by CiscoTAC, it’s best practice to configure the wccp assignment to MASK since if the assignment is on the HASH, the CPU can reach up to 90% if it receives an amount of more than 750 CPS (connection per seconds). Upon checking, wccp 2 assignment status is “HASH”. Wccp 1 assignment status is “MASK”

any opinions?

Ivan Shirshin · ‎10-10-2015

Hello,

I will answer the issues you listed separately below.

First issue is the Netflow TCAM notifications occurring repeatedly.

The messages "%EARL_NETFLOW-4-TCAM_THRLD: Netflow TCAM threshold exceeded, TCAM Utilization" indicate that the NetFlow ternary content addressable memory (TCAM) is almost full. The Supervisor Engine 720 checks how full the NetFlow table is every 30 seconds. The Supervisor Engine turns on aggressive aging when the table size reaches 90 percent.

The idea behind aggressive aging is that the table is nearly full, so there are new active flows that cannot be created. Therefore, it makes sense to aggressively age-out the less active flows (or inactive flows) in the table in order to make space for more active flows.
The capacity for each policy feature card (PFC) NetFlow table (IPv4), for PFC3A and PFC3B, is 128,000 flows. For the PFC3BXL, the capacity is 256,000 flows.

This issue with Netflow TCAM may happen when you set the NetFlow mask to "full" mode or there are too any flows - TCAM for NetFlow can overflow because there are so many entries. WCCP also uses Netflow resources in its operation. You can use the "show mls netflow ip" count command in order to check Netflow mode. Another solution to reduce number of entries is to change the sampling - which you did by modifying the sampling packet based to 4096.

Note that TCAM for packet forwarding and TCAM for NetFlow accounting are separate, so there is no impact to packet forwarding because of this issue.

Second issue is related to CPU spikes on the switch. Looking into the packet dump you collected, it seems that packet is sent to CPU for processing - dest_indx 0x380 is a Unicast packet punted to CPU - CPU port (15/1).

------- dump of incoming inband packet -------
interface Vl56, routine draco2_process_rx_packet_inline
dbus info: src_vlan 0x38(56), src_indx 0x4A(74), len 0x56(86)
bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x380(896)
38020400 00380000 004A0000 56000000 002F0438 00000400 00000000 0380A9D1
mistral hdr: req_token 0x0(0), src_index 0x4A(74), rx_offset 0x76(118)
requeue 0, obl_pkt 0, vlan 0x38(56)
destmac 00.19.A9.9D.FA.C0, srcmac 00.D0.83.05.A0.73, protocol 0800
protocol ip: version 0x04, hlen 0x05, tos 0x00, totlen 68, identifier 13053 df 0, mf 0, fo 0, ttl 30, src 192.168.56.195, dst 10.0.0.1, proto 47

Due to this packets there was likely high CPU utilization by interrupts.

You mentioned that this traffic source was related to WCCP. Also, protocol type shows 47 - which is GRE.

The reason this causes high CPU is indeed that there is HASH assignment instead of MASK. Using HASH is not recommended on Catalyst 6500 switches.

The assignment method in WCCP determines how traffic will be distributed among multiple WCCP clients in a given service group. There are two assignment methods available, hash-based and mask-based. The assignment method chosen for a given service-group is negotiated between the router and the WCCP clients.
The negotiation of the assignment method is performed between the router and the clients via the WCCPv2 ISU and WCCPv2 HIA messages, respectively. The Cisco Catalyst 6500 supports both the hash-based and mask-based assignment methods and will advertise these capabilities in its ISU messages. The WCCP client must be configured for mask-based assignment and then implicitly choose the mask-based assignment method by first observing the supported method in the router's ISU message and then advertising mask-based assignment in its subsequent HIA messages.

The hash-based assignment method is the default and will be chosen unless the client is configured to support the mask-based assignment method.

1. MASK assignment:
The combination of an ingress traffic intercept method with mask-based assignment provides a full hardware-based traffic assignment method. This means that CPU resources are not used with this type of assignment, so there would not be CPU spikes.

Traffic is filtered for WCCP redirection using an Access Control List. The WCCP mask value is then applied to the redirect ACL to create entries in the Cisco Catalyst 6500 ACL TCAM[3]. The TCAM entries are used to provide hardware accelerated lookups and to derive a specific WCCP client which will service the traffic flow. In this way the forwarding path is performed completely in the Cisco Catalyst 6500 hardware resources.

2. Hash-based assignment method is supported but not recommended on the Cisco Catalyst 6500. A hash-based assignment method will utilize a combination of software and hardware forwarding resources. Traffic flows will need to be forwarded via software initially while also setting up flow entries using the Cisco Catalyst 6500 Netflow resources. This approach is certainly viable for some deployments but is not the best practice solution for the Cisco Catalyst 6500.

Kind Regards,
Ivan

eagles-nest · ‎10-12-2015

Hi

I wonder if you could clarify the Spanning tree instance limits on lower end switches such as 2960's, 3560's and 3750's. The documentation states they are limited to 128 spanning tree instances. I previously thought that a spanning tree instance is created per port and per vlan. When I do the command "show spanning-tree summary totals" I thought the number of spanning tree instances were refelcted in the STP active column. However, in testing on a switch with a documented limit of 128 instances I have no problem creating vlans until I hit the 128th vlan. Even though I have 3 trunks on the switch and the STP active value is over 300 at that stage.

So in short is a spanning tree instance just a single instance per vlan no matter how many ports are in the vlan ?

Thanks, Stuart.

Naveen Venkateshaiah · ‎10-13-2015

Hi Stuart,

"is a spanning tree instance just a single instance per vlan no matter how many ports are in the vlan",

Yes for PVST type of STP.

The limitations are as follows on switches running PVST, PVST+ or Rapid-PVST:

2950 SI: Maximum 64 STP instances, Maximum 128 VLANs.

2950 EI: Maximum 64 STP instances, Maximum 250 VLANS.

3550, 3560, 3750: Maximum 128 STP instances, Maximum 1005 VLANs.

If you exceed the number of VLANs then you'll get an error like this:

“SPANTREE_VLAN_SW-2-MAX_INSTANCE: Platform limit of 64 STP instances exceeded. No instance created for VLANxxx”

The maximum number of Per VLAN Spanning Tree instances on the 6500 switch is 128. In your case, after you grow past 64 (or 128) VLANs, you will need to configure MSTP and begin grouping VLANs into common Spanning Tree instances. The 2950 and the 6500 both support MSTP, you might want to check these links for a detailed description on how to configure it:
Catalyst 2950 and Catalyst 2955 Switch Software Configuration Guide
Configuring MSTP

http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst2950/software/release/12-1_14_ea1/configuration/guide/2950scg/swmstp.html

Regards,

Naveen

Rojer-bkk · ‎10-13-2015

Hello Ivan, Naveen,

For SUP7-E, SUP8-E, how can i monitor CPU by separate core?
Can you advise for OID? Thx in advance

Naveen Venkateshaiah · ‎10-13-2015

Hi Rojer,

There's no snmp mib to monitor cpu usages of individual cpu cores .
Basically, core is not a cpu. In sup7e, two separate cores are existed in single cpu.The latest snmp mib doesn't differentiate different cores.
Alternatively, EEM script can be used to send email an alert with 'sh process cpu | include Core'

event manager applet CheckCPUCore
event timer cron cron-entry "00 11 * * *" /----------- line (1)
action 1.0 cli command "enable"
action 2.0 cli command "show proc cpu | include Core"
action 3.0 set cpu_output $_cli_result
action 4.0 mail server <mail_server_IP address> to navevenk@cisco.com from
logs@cisco.com subject " CPU Core 0 & 1 utilization" body "show proc cpu | I
Core; current status is $_cli_result"
!
!!! if your router has TACACS configured, put the below statement too. Purpose is to log
the user that runs the script, doesn't need password
!
event manager session cli username <TACACS username>
***********************************************

event timer cron name PERIODIC cron-entry "*/5 * * * *" /-------------- line (2)

in order to trigger action sending email when the average cpu usage for last 5min exceeds
certain level of threshold (70%) , line (1) can be changed to line (3) as below.
if the average cpu usages goes down lower than 30%, it will stop to send the commands
output via email.

event snmp oid "1.3.6.1.4.1.9.9.109.1.1.1.1.8" get-type exact entry-op ge entry-val 70
exit-op le exit-val 30 poll-interval 5 /-------------- line (3)

EEM script options can be found from the below cisco doc:

http://www.cisco.com/en/US/docs/ios/netmgmt/command/reference/nm_06.html#wp1157622

Thanks,

Naveen Venkateshaiah

Pani Dharmawardana · ‎10-13-2015

Hello Ivan, Naveen

I'm trying to monitor a cat 3560G which is configured for storm control. I want to monitor the following OIDs via snmp but snmpwalk says these OIDs are not available.

cErrDisableInterfaceEventRev1 (1.3.6.1.4.1.9.9.548.0.2)
cErrDisableIfStatusCause (1.3.6.1.4.1.9.9.548.1.3.1.1.2)
portAdditionalOperStatus(1.3.6.1.4.1.9.5.1.4.1.1.23)

image I'm running is c3560-ipservicesk9-mz.122-55.SE10. According to http://tools.cisco.com/Support/SNMP/do/BrowseOID.do?local=en

this image supports the above OIDs.

1.3.6.1.4.1.9.9.548.1 is available but not 1.3.6.1.4.1.9.9.548.0

1.3.6.1.4.1.9.9.548.1.2 is available but not 1.3.6.1.4.1.9.9.548.1.3

1.3.6.1.4.1.9.5.1.4.1.1 is available only up to 1.3.6.1.4.1.9.5.1.4.1.1.12

When I do a 'show snmp MIB' on the switch, all the above 3 are listed.

Any help is really appreciated.

Thanks

Pani

Ivan Shirshin · ‎10-13-2015

Hello Pani,

portAdditionalOperStatus is not supported on the 3560 devices. To check the errdisabled status of the ports, you would need to use

cErrDisableIfStatusCause.

CISCO-STACK-MIB was primarily implemented for the CatOS and is not fully supported in 3560 and 3750.

You can see this issue explained in DDTS for IOS devices:

CSCdv75076 portAdditionalOperStatus support for Native IOS

To get information about err-disabled ports, you should use OID 1.3.6.1.4.1.9.9.548.1.3.1.1.2 (cErrDisableIfStatusCause) in CISCO-ERR-DISABLE-MIB on 3560 switch. The result is the table with reasons why the port is disabled, as described here:

http://tools.cisco.com/Support/SNMP/do/BrowseOID.do?local=en&translate=Translate&objectInput=1.3.6.1.4.1.9.9.548.1.3.1.1.2

When the system doesn't has ports in error disabled state this table is empty and doesn't contain any data. I see this OID was tested with 12.2(50)SE5 in our lab and worked fine. Can you double check it on your switch? If it does not work, I suggest to test with 12.2(50)SE5 and open a service request with Cisco TAC if there is a discrepancy.

CISCO-ERR-DISABLE-MIB should be supported in your image. Note that it also has some traps like cErrDisableInterfaceEvent.

I want to additionally mention that for the cErrDisableInterfaceEventRev1 [1.3.6.1.4.1.9.9.548.0.2], from the description, you can see that interface identifiers are not contained in this notification but they are meant to be polled from the OIDs cErrDisableIfStatusCause [1.3.6.1.4.1.9.9.548.1.3.1.1.2] and cErrDisableIfStatusEntry [1.3.6.1.4.1.9.9.548.1.3.1.1], which correlates to the ifIndex and therefore identifies the interface that triggered the alarm.

Kind Regards,

Ivan

Kind Regards,
Ivan

Pani Dharmawardana · ‎10-14-2015

Thanks Ivan.

I tried again and 1.3.6.1.4.1.9.9.548.1.3.1.1.2 isn't available. Only up to 1.3.6.1.4.1.9.9.548.1.2 is available.

I'll try the image 12.2(50)SE5 and up date you.

Thanks again.

Pani

Pani Dharmawardana · ‎10-14-2015

Hi Ivan,

Just tried the 12.2(50)SE5 image.

Switch Ports Model              SW Version            SW Image
------ ----- -----              ----------            ----------
*    1 52    WS-C3560G-48PS     12.2(50)SE5           C3560-IPSERVICESK9-M

Still OID 1.3.6.1.4.1.9.9.548.1.3.1.1.2 isn't available. Only upto 1.3.6.1.4.1.9.9.548.1.2 is available.

Is it possible that there's a problem with this 3560 (license?)? I don't have another one to test.

Thanks

Pani

Ivan Shirshin · ‎10-16-2015

Hi Pani,

We are going to check that in the lab on our 3560 shortly.

Kind Regards,

Ivan

Kind Regards,
Ivan

eroussos1 · ‎10-15-2015

Hello to all,

I'm looking to create a custom report through Cisco UCCX Historical Reports that tells me the number of repeat calls our organization is receiving per queue. We want to know when the repeat calls are coming in per day and what the queue is that the call is hitting. Is this something that would be easily customizable through Cisco? If so, what are the steps I would need to take in order to build this.

Thank you for your help,

Eric

Ivan Shirshin · ‎10-15-2015

Hello Eric,

This expert session is for Switch and IOS Architecture questions. You would need to contact Unified Contact Center experts for solution to your problem.

Kind Regards,

Ivan

Kind Regards,
Ivan