Re: NCS 57C3 mod crashing

divadko · ‎07-05-2023

Hi all,

we just installed ncs57C3 mod in our datacenter as core bgp router and it reboots itself alnost each 3 hours. Here is the log:

0/RP0/ADMIN0:Jul  5 07:33:48.000 UTC: shelf_mgr[2131]: %INFRA-SHELF_MGR-6-HW_EVENT : Rcvd HW event HW_EVENT_OK, event_reason_str 'remote card ok' for card 0/0  
0/RP0/ADMIN0:Jul  5 07:33:48.000 UTC: shelf_mgr[2131]: %INFRA-SHELF_MGR-6-CARD_HW_OPERATIONAL : Card: 0/0 hardware state going to Operational  
0/RP0/ADMIN0:Jul  5 07:34:16.000 UTC: linkmonsys_ncs[2065]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :DECLARE :: [8086:15ab 03:00.0]-[8086:15ab 03:00.0]  
0/RP0/ADMIN0:Jul  5 07:34:17.000 UTC: linkmonsys_ncs[2065]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :DECLARE :: [8086:15ab 03:00.1]-[8086:15ab 03:00.1]  
0/RP1/ADMIN0:Jul  5 07:34:16.000 UTC: linkmonsys_ncs[2064]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :DECLARE :: [8086:15ab 03:00.1]-[8086:15ab 03:00.1]  
0/RP1/ADMIN0:Jul  5 07:34:17.000 UTC: linkmonsys_ncs[2064]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :DECLARE :: [8086:15ab 03:00.0]-[8086:15ab 03:00.0]  
0/RP1/ADMIN0:Jul  5 07:34:23.000 UTC: linkmonsys_ncs[2064]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :CLEAR :: [8086:15ab 03:00.0]-[8086:15ab 03:00.0]  
0/RP0/ADMIN0:Jul  5 07:34:23.000 UTC: linkmonsys_ncs[2065]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :CLEAR :: [8086:15ab 03:00.1]-[8086:15ab 03:00.1]  
0/RP1/ADMIN0:Jul  5 07:34:24.000 UTC: linkmonsys_ncs[2064]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :CLEAR :: [8086:15ab 03:00.1]-[8086:15ab 03:00.1]  
0/RP0/ADMIN0:Jul  5 07:34:24.000 UTC: linkmonsys_ncs[2065]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :Physical network interface controller (NIC) to ethernet switch (ESD) link down error :CLEAR :: [8086:15ab 03:00.0]-[8086:15ab 03:00.0]  
0/RP0/ADMIN0:Jul  5 07:34:33.000 UTC: shelf_mgr[2131]: %INFRA-SHELF_MGR-6-CARD_SW_OPERATIONAL : Card: 0/0 software state going to Operational  
0/RP1/ADMIN0:Jul  5 07:34:33.000 UTC: esdma[3512]: %INFRA-ESDMA-6-ESD_CONN_FOUND : ESDMA found connection with esd at 0/LC0/LC-SW1  
0/0/ADMIN0:Jul  5 07:34:38.000 UTC: aaad[2166]: %MGBL-AAAD-7-DEBUG : Disaster-recovery account not configured. Using first user as disaster-recovery account   
0/0/ADMIN0:Jul  5 07:34:38.000 UTC: inst_agent[2176]: %INFRA-INSTAGENT-4-XR_PART_PREP_REQ : Received SDR/XR partition request. Looking for available matching partition. If not found, new one will be created after copying relevant image and RPMs  
RP/0/RP0/CPU0:Jul  5 07:34:34.337 UTC: fpd-serv[265]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :FPD-NEED-UPGRADE :DECLARE :0/0:  
0/0/ADMIN0:Jul  5 07:34:50.000 UTC: inst_agent[2176]: %INFRA-INSTAGENT-4-XR_PART_PREP_RESP : SDR/XR partition preparation completed successfully  
0/0/ADMIN0:Jul  5 07:34:57.000 UTC: vm_manager[2236]: %INFRA-VM_MANAGER-4-INFO : Info: vm_manager started VM default-sdr--1  
RP/0/RP0/CPU0:Jul  5 07:35:31.567 UTC: sysdb_shared_nc[467]: %SYSDB-SYSDB-7-INFO : client 'bfd_agent' attempted duplicate registration for 'oper/overlays/gl/oc_bfd/openconfig-bfd/' from active node: rc 0x0 (Success) 
LC/0/0/CPU0:Jul  5 07:35:46.000 UTC: fia_driver[216]: Warning: dnx_stats_update_engine_context Unable to get database id for j2_app_type 18, database_id 34, total database entry 0
LC/0/0/CPU0:Jul  5 07:35:49.429 UTC: fia_driver[216]: BCM-DPA: Optics Driver connection not established yet,Allow couple of min to establish 
LC/0/0/CPU0:Jul  5 07:35:55.349 UTC: fia_driver[216]: %PLATFORM-OFA-6-INFO : NPU #0 Initialization Completed 
LC/0/0/CPU0:Jul  5 07:35:55.439 UTC: fia_driver[216]: %PLATFORM-DPA-6-INFO : Fabric BANDWIDTH above configured threshold  
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source Clock-interface 0/0/CPU0-Sync0 are not enabled 
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source Clock-interface 0/0/CPU0-Sync1 are not enabled 
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source Clock-interface 0/0/CPU0-Sync2 are not enabled 
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source GNSS Receiver 0 location 0/0/CPU0 are not enabled 
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source PTP 0/0/CPU0 are not enabled 
LC/0/0/CPU0:Jul  5 07:35:55.822 UTC: fsyncmgr[325]: %L2-FSYNC-6-SSM_OFF : The Synchronization Status Messages for the source Internal Oscillator 0/0/CPU0 are not enabled 
RP/0/RP0/CPU0:Jul  5 07:35:51.804 UTC: fpd-serv[265]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :FPD-NEED-UPGRADE :DECLARE :0/0:  
LC/0/0/CPU0:Jul  5 07:35:57.039 UTC: macsec_ea[218]: Platform Capability : EA-HA Support is set to : 0
LC/0/0/CPU0:Jul  5 07:35:57.039 UTC: macsec_ea[218]: Platform Capability : IF_CAPA support is set to : 1
LC/0/0/CPU0:Jul  5 07:35:57.039 UTC: macsec_ea[218]: Platform Capability : Macsec support is set to : 1

What cases this issue please?

divadko · ‎07-05-2023

I can often see this logs too:

LC/0/0/CPU0:Jul  5 08:09:26.382 UTC: fia_driver[216]: %FABRIC-FIA_DRVR-3-ASIC_RESET : [3501] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured 
LC/0/0/CPU0:Jul  5 08:09:26.713 UTC: fia_driver[216]: %FABRIC-FIA_DRVR-3-ASIC_RESET : [3501] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured 
LC/0/0/CPU0:Jul  5 08:09:26.968 UTC: fia_driver[216]: %FABRIC-FIA_DRVR-3-ASIC_RESET : [3501] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured 
LC/0/0/CPU0:Jul  5 08:09:27.633 UTC: fia_driver[216]: %FABRIC-FIA_DRVR-3-ASIC_RESET : [3501] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured

divadko · ‎08-05-2023

Anyone? After replacing the chassis with the brand new one i am receiving the absolut same logs and the router is still crashing

FABRIC-FIA_DRVR-3-ASIC_RESET : [3501] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured

Ramblin Tech · ‎08-05-2023

What XR version? Any MPAs installed? If so, is it stable with the MPAs removed? Do you have redundant RPs? If so, is it stable with one RP removed?

We can speculate about the role of the FIA (fabric interface asic) log messages in the crashes, but a TAC case is going to be the most expedient path to RCA. A brand new device comes with a warranty (including TAC support), so you can open a case even if you do not have a support contract.

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.

divadko · ‎08-05-2023

Hello,

There are no MPAs installed. There are 2 RPs and we also tryed to used it with only 1. The issue was the same. After 20 minutes reload of 0/0 slot happend.

The tac last time replaced the box (chassi) but it is the same.

Firmware version is: 7.8.2

Georg Pauwen · ‎08-06-2023

Hello,

hardly anything out there that helps with finding the cause of this error. I found a basic description of the error, and they ask for the output of:

show processes fia_driver location <loc>
show controllers fia trace all location <loc>

divadko · ‎08-06-2023

There are no options like you sent

#show processes ?
<1-2147483647> IOS(d) Process Number
all-events Show all notifications
bootup-init Show system init time
bootup-time Show system bootup time
cpu Show CPU usage per IOS(d) process
events Show events for which IOS(d) processes want notification
heapcheck Show IOS(d) scheduler heapcheck configuration
history Show ordered IOS(d) process history
memory Show memory usage per IOS(d) process
platform Show information per IOS-XE process
timercheck Show IOS(d) processes configured for timercheck
| Output modifiers
<cr> <cr>

#show controllers ?
E1 E1 controller internal state
T1 T1 controller internal state
VDSL vdsl controller internal state
pos POS framer state
| Output modifiers
<cr> <cr>

Ramblin Tech · ‎08-06-2023

Is the system stable with no config? That is, after "commit replace" (saving the config elsewhere, of course)? If so, then issue may be a bug triggered by your config. If not, then probably an issue with either 7.8.2 or an identically bad RMA chassis (seem less likely).

I suggest that you keep working the TAC case, as they will have the tools, knowledge-base, and access to the DEs that no one on the outside has. If this system is not in production yet, they may be treating it with less priority than cases they are working with production networks down. Keep pushing them to get to root cause; the TAC engineer assigned to your case will have an escalation team that can be brought in on this, and you can always ask to speak to the TAC duty manager if you do not feel like your case is getting adequate attention.

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.

divadko · ‎08-06-2023

System is stable if there is no load thru it. I hade set it up on my lab for 2 weeks ant there wasnt any issues. After installing it in datacenter at 3 oclock am it forker for 4 hours. The reload issues started after more load went thru box at 7-8:00 AM. More load means 2-3+ gigs. Directly after miggration we had no more than 1 gig traffic thru the box.

divadko · ‎08-15-2023

OK, it looks the issue was identified as bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwe06848

Ramblin Tech · ‎08-15-2023

Looks like there is a SMU available to fix this in your 7.8.2 release, which is good news if you have a lengthy qualification process to upgrade to a new release (eg, 7.9.2).

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.

divadko · ‎12-29-2023

After almost 3 months of uptime the router rebooted again. With the same logmessagge as before

Any info about related bug or workaround?

LC/0/0/CPU0:Dec 29 11:44:52.758 UTC: fia_driver[177]: %FABRIC-FIA_DRVR-3-ASIC_RESET : [3368] : Fia asic 0 has to be Reset because Interrupt in block 66 leaf 0x3e084008 has occured
0/0/ADMIN0:Dec 29 11:45:02.000 UTC: cm[2199]: %ROUTING-TOPO-5-PROCESS_UPDATE : Got process update: Card shutdown.
LC/0/0/CPU0:Dec 29 11:45:02.043 UTC: processmgr[51]: Received a graceful shutdown request
0/RP0/ADMIN0:Dec 29 11:45:02.000 UTC: shelf_mgr[2118]: %INFRA-SHELF_MGR-3-FAULT_ACTION_CARD_RELOAD : Graceful reload requested for card 0/0. Reason: Board reload on exceeding reset threshold
0/0/ADMIN0:Dec 29 11:45:02.000 UTC: aaad[2196]: %MGBL-AAAD-7-DEBUG : Disaster-recovery account not configured. Using first user as disaster-recovery account

Ramblin Tech · ‎12-29-2023

I no longer have the entitlement to see any details for CSCwe0684 referenced above. Did TAC confirm that it is the cause of your crash? If so, did you apply the SMU or upgrade to a release that has the fix integrated?

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.

divadko · ‎12-30-2023

Hi,

yes the tac identified an issue as a bug i shared.Based on that we made an upgrade to 7.9.2 that solved our issue for 3 months. But it chrashed agan with the same log after 3 months

Ramblin Tech · ‎12-30-2023

TAC is really your only option for a software fix. Reopen the TAC case (or open a new one). Also, reach out to you Cisco account team with the Service Request number (TAC case) so they can track and escalate it as well.

Disclaimers: I am long in CSCO. Bad answers are my own fault as they are not AI generated.