02-09-2018 04:57 AM - edited 03-20-2019 09:53 PM
How would I determine if this memory leak is occurring?
05-02-2018 06:49 AM - edited 05-02-2018 06:58 AM
Please be aware that you need TAC assistance to determine a memory leak.
Memory in XE can be monitored at IOSd (Cisco IOS Daemon) level or Linux Kernel level, this leak is identified at Kernel level.
I used a Catalyst 3650 with 16.3.5b to get the command samples, the results can change from platform to platform.
Step 1: Monitor the Used and Committed memory, you have to see an increase on these values, a leak is identified when the memory increases but it doesn't decrease over time, if memory usage get back to normal, then you can discard a memory leak.
Run show platform software status control-processor brief in order to check the Kernel memory usage
#show platform software status control-processor brief Load for five secs: 4%/1%; one minute: 4%; five minutes: 5% Memory (kB) Slot Status Total Used (Pct) Free (Pct) Committed (Pct) 1-RP0 Healthy 3978124 2514900 (63%) 1463224 (37%) 3162852 (80%) 2-RP0 Healthy 3978124 2453884 (62%) 1524240 (38%) 3094104 (78%)
Is possible to see a log message that warns about the memory usage when it surpasses a certain threshold, this message is not exclusive to this issue but it can be considered a symptom:
%PLATFORM-4-ELEMENT_WARNING: Switch 1 R0/0: smand: 1/RP/0: Used Memory value 92% exceeds warning level 90%
The log message indicates the affected switch and the processor, in the example Switch 1 and the Route Processor (RP) are affected.
Step 2: Run show platform software process memory switch <number> <processor> all sorted in order to monitor the Switch and Processor that increased the memory usage:
#show platform software process memory switch 1 r0 all sorted Load for five secs: 5%/1%; one minute: 5%; five minutes: 4% Pid VIRT RSS PSS Heap Shared Private Name ------------------------------------------------------------------------------------ 5329 1870404 712544 617188 80 110276 602268 linux_iosd-imag 16294 2279920 274864 183683 89936 105024 169840 fed main event 952 630820 223244 154414 135716 72664 150580 smand 15755 968948 172380 85746 52900 99380 73000 platform_mgr 3336 851508 160564 69297 6052 103280 57284 cli_agent 1244 579100 141708 58532 220 91128 50580 smd
This command displays the memory consumed by the Linux IOSd image (linux_iosd-imag) and Platform Manager (platform_mgr), check the resident set size (RSS) which is the portion of memory occupied by a process that is held in main memory (RAM).
Step 3: Run debug platform software memory <platform-mgr|ios> switch <number> <processor> alloc callsite start in order to enable the memory allocation tracking debug on the affected switch. Then run show platform software memory <platform-mgr|ios> switch <number> <processor> alloc callsite brief to identify the callsite which consumes the memory:
#debug platform software memory ios switch active R0 alloc callsite start
#show platform software memory ios switch 1 R0 alloc callsite brief Load for five secs: 3%/1%; one minute: 6%; five minutes: 4% callsite thread diff_byte diff_call ---------------------------------------------------------- 3890515970 5329 26495300 662359 2101852161 5329 16904 1 2101848064 5329 57360 1 1377000450 5329 20016 6 3890515973 5329 1184 1 1377000449 5329 57384 1 3535134737 5329 31752 1 3535134738 5329 51 2 3890515968 5329 1782 69 3890515969 5329 1104 69
Step 4: Run debug platform software memory <platform-mgr|ios> switch <number> <processor> alloc backtrace start <callsite number> depth 10 and then run show platform software memory <platform-mgr|ios> switch <number> <processor> alloc backtrace in order to get the backtrace information:
#debug platform software memory ios switch 1 R0 alloc backtrace start 3890515970 depth 10
#show platform software memory ios switch 1 R0 alloc backtrace Load for five secs: 4%/1%; one minute: 5%; five minutes: 4% backtrace: 1#924a998d29c6b41cd4d8f2471c7daed8 maroon:FFE8EC6000+4D70 tdllib:FFEC71A000+3B914 tdl_17b23f588d:FFCCC1F000+14B24750 tdl_17b23f588d:FFCCC1F000+14B246E4 :AAB4281000+180E240 :AAB4281000+180AC6C :AAB4281000+6801288 :AAB4281000+6801288 :AAB4281000+6801288 callsite: 3890515970, thread_id: 5329 allocs: 27, frees: 0, call_diff: 27 backtrace: 1#924a998d29c6b41cd4d8f2471c7daed8 maroon:FFE8EC6000+4D70 tdllib:FFEC71A000+3B914 tdl_17b23f588d:FFCCC1F000+149CC9F4 tdl_17b23f588d:FFCCC1F000+149CC950 tdl_17b23f588d:FFCCC1F000+14B247C4 tdl_17b23f588d:FFCCC1F000+14B246E4 :AAB4281000+180E240 :AAB4281000
Step 5: Collect all the outputs and engage Cisco TAC, they decode the information and are able to determine if the leak triggers by the software bug or not.
11-16-2018 11:39 AM
We have confirmed this bug is affecting us on our 3850-48's with Denali 16.3.5b. The bug has a doomsday timer of around 43 weeks and 4 days, which then causes the active/master switch to lock up. The switch stack sees it as "removed" and must be manually power cycled to return it to service. We are told this bug is fixed in version Denali 16.3.6, but we are moving to 16.3.7.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide