cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
7189
Views
41
Helpful
15
Replies

CSCvp73396 - Cat9300 Memory Leak due to nyq_rsync.log

Leo Laohoo
Hall of Fame
Hall of Fame

Terminology Used: 

  • Switch - a physical unit;
  • Stack - a logical unit made up of multiple switches

NOTE:

  1. This bug only affects Catalyst 9300 in a stack (>14 weeks uptime).
  2. This bug DOES NOT affect stand-alone Catalyst 9300 switch.
  3. This bug DOES NOT affect Catalyst 9400/9600 in VSS or standalone.
  4. This bug DOES NOT affect Catalyst 9500 in Switch Virtual or standalone.
  5. At the time of writing, I am unable to determine if Catalyst 9200 are affected or not.

The cause of this bug is a file, named "/tmp/nyq_rsynch.log", getting bloated (it just keeps getting bigger daily). Once the partition hits 100%, the physical switch crashes.

To check the size of the file, use the command "show platform software mount switch [active | standby | 1 to 16 ] r0 | include ^tmpf.*tmp | exclude /tmp/"

Remember, deleting the nyq_rsych.log file is only a workaround. The file gets created and continues to grow even if the file is repeatedly deleted. For a permanent fix, upgrade to a version not affected by this bug.

Ideally, the value should be <9%. If the value is >80%, call TAC. 

 

Switch#sh platform soft mount switch 4 r0 | i ^tmpfs.*tmp | e /tmp/
tmpfs                     179164    3676488      5% /tmp

The item in RED (above) is 5%.  This is good (normal).  Nothing to worry about. 

 

Switch#sh platform soft mount switch 1 r0 | i ^tmpfs.*tmp | e /tmp/
tmpfs                     2746588  1109064     72% /tmp

The item in RED (above) is 72%.  Contact TAC when possible. 

Verify after the file has been deleted by using the command "df -h" to ensure the file has been successfully erased. 

IMPORTANT:  

If that value is >90%, put the donut down and call TAC immediately.  The switch has less than a week before it will crash.  

Help TAC: Point or direct the TAC engineer to this page so they immediately know what to do.

 

HIDDEN GEM:  Cisco Catalyst 9200/9300 can/may support UP TO sixteen (16) switches in a stack.  Use the enable command "switch ?" and the notice option is to enumerate a switch member (of a stack) from 1 to 16.

 

CRASHINFO/CRASH LOGS: 

If/When a switch crashes, go to the crashinfo directory (Command: dir crashinfo-SWITCHMEMBER) and there should be a crashlog file that looks like this naming convention: system-report_SWITCHMEMBER_YEARMMDD-HHMMSS-TIMEZONE.tar.gz. 

Open the file and look under the "tmp" folder and there should be another file called HOSTNAME_SWITCHMEMBER_RP_0-bootuplog-YEARMMDD-HHMMSS-TIMEZONE.log
Scroll all the way to the bottom and you should see a line that goes like this:

HOSTNAME_SWITCHMEMBER_RP_0 cmand[11955]: %CMRP-0-CHASFS_PROPERTY_SET: Failed to write chassis filesystem object env/rp/0/sensor/0 property data because No space left on device -Traceback= 1#1173ffafaf74e8cab963841cc1ed720b   errmsg:7FDCB3A6C000+17C9 :5616EC5DC000+BDF5B :5616EC5DC000+C2C1C :5616EC5DC000+AF68C evlib:7FDCB450B000+A082 evlib:7FDCB450B000+8B1E evlib:7FDCB450B000+950C orchestrator_lib:7FDCA657A000+CB58 luajit:7FDC8E6FD000+62D76 luajit:7FDC8E6FD000+4B35A luajit:7FDC8E6FD000+333E9 luajit:7FDC8E6FD000+60DD7 luajit:7FDC8E6FD000+1D11D orche

NOTE:

  1. I purposely did not include IOS-XE versions because that is the role of the Bug ID. 
  2. Find the hidden gem.  

Here are the steps TAC needs to undertake to delete the offending file:

1.  Command:  request consent generate shell-access auth-timeout <MINUTES>

NOTE:  “MINUTES” means how long will the shell access last.   

2.  The stack will generate a long string (aka token) which TAC will take into an internal DB in order to counter-generate a challenge or response.

3.  Command:  request consent accept shell-access <TAC-generated RESPONSE TOKEN>

4.  Wait for about 60 to 90 seconds for the stack to “compute” the challenge.

5.  If the token was generated correctly, a line will say "% Consent token authorization success”. 

NOTE: 

I do not recommend using the option “switch ACTIVE” or “switch STANDBY” and would prefer using the command “switch <1 to 16>”. 

6.  To go to the switch, use the command “request platform software system shell switch <1 to 16> r0”.  An “are you sure” question will appear.  Respond with “Y” to proceed or “N” to chicken out (below).

Activity within this shell can jeopardize the functioning of the system.
Are you sure you want to continue? [y/n]

7.  In shell mode, please observe the banner and the cursor: 

DATE TIME : Shell access was granted to user <anon>; Trace file: , /crashinfo/tracelogs/system_shell_R0-0.11225_0.YEARMODAHOUR.bin
********************************************************************** 
Activity within this shell can jeopardize the functioning 
of the system.
Use this functionality only under supervision of Cisco Support.

Session will be logged to:
  crashinfo:tracelogs/system_shell_R0-0.11225_0.YEARMODAHOUR.bin
********************************************************************** 
[SWITCHNAME_RP_0:/]$ 

8.  [OPTIONAL]  In shell, check first using the command “df -h” and look at the /tmp folder (see below):

[SWITCHNAM_MEMBER_RP_0:/]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
tmpfs                     3.7G  3.4G  386M  90% /tmp

9.  To delete the file, issue the command “rm [-f] /tmp/nyq_rsync.log”.    

WARNING

  • If the command “rm /tmp/nyq_rsync.log” doesn’t work, then use “rm -f /tmp/nyq_rsync.log”.  
  •  DO NOT go to /tmp and to delete because it won't work!  The file needs to be deleted in the root directory.
[SWITCHNAM_MEMBER_RP_0:/]$  rm /tmp/nyq_rsync.log 

10.  [OPTINAL]  Verify again by using the command “df -h” and compare the before-and-after output of the /tmp folder.

[SWITCHNAM_MEMBER_RP_0:/]$ df -h
Filesystem                Size  Used Avail Use% Mounted on
tmpfs                     3.7G  133M  3.6G   4% /tmp

11.  Use the command "exit" to get out of shell. 

NOTE: 

If there is a need to go to another stack member, then after step 11, go back to step 6 again. 

12.  IMPORTANT:  Terminate the shell access using the command:  request consent-token terminate-auth shell-access

 

Update: 

  • 01 July 2020 - This bug is not exclusive to the Catalyst 9k.  For 3650/3850, kindly refer to CSCvj86358.
  • 21 Jan 2021 - Similar bug for VSS/Switch Virtual:  CSCvp27552.

URGENT Update: 

  • 01 October 2019 -- Monitor the stack (approximately >8 weeks after the crash) and make sure "sessmgrd" is not >90% (CSCvp03655).  Use the command "show processes cpu platform monitor" or "show platform software process slot switch <1 to 16> r0 monitor" to verify.
15 Replies 15

Giuseppe Larosa
Hall of Fame
Hall of Fame

Hello Leo,

thanks for providing useful info like this about this bug.

 

Best Regards

Giuseppe

 

thanks , checked my 9300 Everest stacks looks ok , lucky enough nothing on Gibraltar yet , although our new 92s seem to be coming with 11.1 pre installed and no other release to swap them over to yet

Thanks for the ratings.

16.12.1 was released over the weekend.

Ah great no didn't see that yet , ill have a look through the notes now , thanks again

Thanks for the ratings, Giuseppe.

Scott Hodgdon
Cisco Employee
Cisco Employee

Just a note for the community ...

This bug only affects those running IOX XE 16.11.1 and is fixed in 16.11.1c : https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvp73396

You will not find it in the 16.12.1 Resolved Caveats section of the Release Notes, but 16.12.1 will have the 16.11.1c fixes in them. Given that 16.12.1 is an Extended Maintenance Release, it would be a good idea to upgrade to that for the fix instead of 16.11.1c .

Cheers,
Scott Hodgdon

Senior Technical Marketing Engineeer

Enterprise Networking Group


@Scott Hodgdon wrote:

This bug only affects those running IOX XE 16.11.1 and is fixed in 16.11.1c : https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvp73396


Thanks, Scott. 

Unfortunately 16.11.1c is only available for the 9800.

Leo,

It would be better to go with 16.12.1 as it is an Extended Maintenance release. 16.11 is a Standard Maintenance release and will have a much shorter lifespan (see https://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/ios-xe-16/bulletin-c25-741178.html ). If a customer really wants 16.11.1c, it can be made available through TAC, but there will be no further maintenance releases on this train most likely.

Cheers,

Scott Hodgdon

Senior Technical Marketing Engineer

Enterprise Networking Group

Thanks, Scott.
I'll check out 16.12.1.

Going to wait out for Amsterdam release.  

Bug ID has been updated that now states "Gibraltar-16.12.1b" in the Known Fixed Releases.
Kudos to whoever posted that.

kenneth.lm
Level 1
Level 1

I believe this has been reintroduced with 16.12.3. 

Seen on both 9300 and 9300L stacks.

Not seen on 9300L stack with 16.12.2.

 

Thanks for the update. 

If the stack master crashed, check & make sure the CPU process "sessmgr" isn't growing (CSCvp03655). 

It did not crash, fortunately. I scheduled a reload to free up space. But thanks for the heads-up. 

With two 9300-24P in stack and an uptime of 3 weeks, 5 days the stack had a TMPFS usage of 3278MB(43%). The reload set it back to 236MB(3%). Now, with an up time of 5 days, 9 hours it is sitting at TMPFS usage of 779MB(10%). 

 

I have a reload scheduled tonight with a downgrade to 16.12.2. I will keep you posted if that resolves the problem. 

Please raise a TAC Case and get TAC to confirm this bug has re-emerged in 16.12.3. 

If confirm, make sure the TAC agent updates the Known Affected Version in the Bug ID.  

If the TAC agent refuses, let us know.  

Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community:

Innovations in Cisco Full Stack Observability - A new webinar from Cisco