cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
26045
Views
26
Helpful
2
Replies

CSCvh89372 - Memory leak in linux_iosd-image

alclark
Level 1
Level 1

How would I determine if this memory leak is occurring?

2 Replies 2

David Spindola
Cisco Employee
Cisco Employee

Please be aware that you need TAC assistance to determine a memory leak.

 

Memory in XE can be monitored at IOSd (Cisco IOS Daemon) level or Linux Kernel level, this leak is identified at Kernel level.

 

I used a Catalyst 3650 with 16.3.5b to get the command samples, the results can change from platform to platform.

 

Step 1: Monitor the Used and Committed memory, you have to see an increase on these values, a leak is identified when the memory increases but it doesn't decrease over time, if memory usage get back to normal, then you can discard a memory leak.

 

Run show platform software status control-processor brief in order to check the Kernel memory usage

#show platform software status control-processor brief
Load for five secs: 4%/1%; one minute: 4%; five minutes: 5%

Memory (kB)
Slot  Status    Total     Used (Pct)     Free (Pct) Committed (Pct)
1-RP0 Healthy  3978124  2514900 (63%)  1463224 (37%)   3162852 (80%)
2-RP0 Healthy  3978124  2453884 (62%)  1524240 (38%)   3094104 (78%)

 

Is possible to see a log message that warns about the memory usage when it surpasses a certain threshold, this message is not exclusive to this issue but it can be considered a symptom:

%PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand:  1/RP/0: Used Memory value 92% exceeds warning level 90%

 

The log message indicates the affected switch and the processor, in the example Switch 1 and the Route Processor (RP) are affected.

 

Step 2: Run show platform software process memory switch <number> <processor> all sorted in order to monitor the Switch and Processor that increased the memory usage:

#show platform software process memory switch 1 r0 all sorted
Load for five secs: 5%/1%; one minute: 5%; five minutes: 4%
   Pid      VIRT       RSS       PSS      Heap    Shared   Private              Name
------------------------------------------------------------------------------------
  5329   1870404    712544    617188        80    110276    602268   linux_iosd-imag
16294   2279920    274864    183683     89936    105024    169840    fed main event
   952    630820    223244    154414    135716     72664    150580             smand
 15755    968948    172380     85746     52900     99380     73000      platform_mgr  
  3336    851508    160564     69297      6052    103280     57284         cli_agent
  1244    579100    141708     58532       220     91128     50580               smd

This command displays the memory consumed by the Linux IOSd image (linux_iosd-imag) and Platform Manager (platform_mgr), check the resident set size (RSS) which is the portion of memory occupied by a process that is held in main memory (RAM).

 

Step 3: Run debug platform software memory <platform-mgr|ios> switch <number> <processor> alloc callsite start in order to enable the memory allocation tracking debug on the affected switch. Then run show platform software memory <platform-mgr|ios> switch <number> <processor> alloc callsite brief to identify the callsite which consumes the memory:

#debug platform software memory ios switch active R0 alloc callsite start
#show platform software memory ios switch 1 R0 alloc callsite brief Load for five secs: 3%/1%; one minute: 6%; five minutes: 4% callsite thread diff_byte diff_call ---------------------------------------------------------- 3890515970 5329 26495300 662359 2101852161 5329 16904 1 2101848064 5329 57360 1 1377000450 5329 20016 6 3890515973 5329 1184 1 1377000449 5329 57384 1 3535134737 5329 31752 1 3535134738 5329 51 2 3890515968 5329 1782 69 3890515969 5329 1104 69

 

Step 4: Run debug platform software memory <platform-mgr|ios> switch <number> <processor> alloc backtrace start <callsite number> depth 10 and then run show platform software memory <platform-mgr|ios> switch <number> <processor> alloc backtrace in order to get the backtrace information:

#debug platform software memory ios switch 1 R0 alloc backtrace start 3890515970 depth 10
#show platform software memory ios switch 1 R0 alloc backtrace Load for five secs: 4%/1%; one minute: 5%; five minutes: 4% backtrace: 1#924a998d29c6b41cd4d8f2471c7daed8 maroon:FFE8EC6000+4D70 tdllib:FFEC71A000+3B914 tdl_17b23f588d:FFCCC1F000+14B24750 tdl_17b23f588d:FFCCC1F000+14B246E4 :AAB4281000+180E240 :AAB4281000+180AC6C :AAB4281000+6801288 :AAB4281000+6801288 :AAB4281000+6801288 callsite: 3890515970, thread_id: 5329 allocs: 27, frees: 0, call_diff: 27 backtrace: 1#924a998d29c6b41cd4d8f2471c7daed8 maroon:FFE8EC6000+4D70 tdllib:FFEC71A000+3B914 tdl_17b23f588d:FFCCC1F000+149CC9F4 tdl_17b23f588d:FFCCC1F000+149CC950 tdl_17b23f588d:FFCCC1F000+14B247C4 tdl_17b23f588d:FFCCC1F000+14B246E4 :AAB4281000+180E240 :AAB4281000

 

Step 5: Collect all the outputs and engage Cisco TAC, they decode the information and are able to determine if the leak triggers by the software bug or not.

We have confirmed this bug is affecting us on our 3850-48's with Denali 16.3.5b. The bug has a doomsday timer of around 43 weeks and 4 days, which then causes the active/master switch to lock up. The switch stack sees it as "removed" and must be manually power cycled to return it to service. We are told this bug is fixed in version Denali 16.3.6, but we are moving to 16.3.7.