Solved: Re: CSCvy14484 - How to implement workaround

alexander.koeppe · ‎07-14-2022

I've recently figured, that we're also affected by this bug. We experienced too an ASA failover event due to "MIO-blade heartbeat failure" even though we're running a higher version than the one in the bug report: FXOS:2.11(1.154) / ASA:9.16(3).

I've googled and found this bug report but saw that it's in status "terminated". It mentions the workaround though: "Increase AppAgent timeout and retry settings. Suggested setting: timeout=1000ms, retry=6".

However trying to find out how to do this is not that easy. Effectively I've found no resource that explains how to accomplish that.

Does anybody in the community know how to adjust this timeout, before I open a TAC case?

Regards

tvotna · ‎07-22-2022

There is an ASA configuration command:

app-agent heartbeat interval 1000 retry-count 3

In case of multiple mode, this is configured in system execution space. You can increase retry-count to 6.

So far as I know, there is an architectural ENH in ASA 9.18+ to prevent such failures:

CSCvz94573 MIO heartbeat failure caused by heartbeat dropped by delay

Bug toolkit says that the fix was backported to ASA 9.16.3 too, so you may want to open TAC case and ask them why you're still encountering this failure in 9.16.3.

View solution in original post

tvotna · ‎07-22-2022

There is an ASA configuration command:

app-agent heartbeat interval 1000 retry-count 3

In case of multiple mode, this is configured in system execution space. You can increase retry-count to 6.

So far as I know, there is an architectural ENH in ASA 9.18+ to prevent such failures:

CSCvz94573 MIO heartbeat failure caused by heartbeat dropped by delay

Bug toolkit says that the fix was backported to ASA 9.16.3 too, so you may want to open TAC case and ask them why you're still encountering this failure in 9.16.3.

patoberli · ‎05-16-2023

Still getting this issue, running ASA 9.18.3 on a FPR 4115 with FXOS 2.12.0.450, so it doesn't seem fully fixed in 9.18.x yet.

tvotna · ‎05-17-2023

@patoberli, actually, the "%ASA-1-104001: (Primary) Switching to ACTIVE - Other unit wants me Active. Secondary unit switch reason: MIO-blade heartbeat failure" error can be produced due to various reasons. You need to correlate it with disk0:/log/asa-appagent.log and other log files to get to the root cause. Just as an example, when blade CIMC/BMC crashes and FXOS doesn't have the "CSCvx66494 Handle CIMC Watchdog reset in MIO" fix, the asa-appagent.log contains the following:

Jan 25 05:16:41.479 WARN : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1223] : Set fsm to FSM_INIT as there is no response to HB requests since 3000 msec
Jan 25 05:16:41.479 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1226] : Inform SVCMOD_HEALTH_FAIL to cluster module
Jan 25 05:16:41.479 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:189] : Printing last 20 HB messages:
MessageType TimeStamp MessageID Status Description
HB REQUEST Jan 25 02:16:31.089 0000000003680e1c 
HB RESPONSE Jan 25 02:16:31.090 0000000003680e1c 100 
HB REQUEST Jan 25 02:16:32.090 0000000003680e1d 
HB RESPONSE Jan 25 02:16:32.090 0000000003680e1d 100 
HB REQUEST Jan 25 02:16:33.090 0000000003680e1e 
HB RESPONSE Jan 25 02:16:33.090 0000000003680e1e 100 
HB REQUEST Jan 25 02:16:34.090 0000000003680e1f 
HB RESPONSE Jan 25 02:16:34.090 0000000003680e1f 100 
HB REQUEST Jan 25 02:16:35.090 0000000003680e20 
HB RESPONSE Jan 25 02:16:35.092 0000000003680e20 100 
HB REQUEST Jan 25 02:16:36.090 0000000003680e21 
HB RESPONSE Jan 25 02:16:36.091 0000000003680e21 100 
HB REQUEST Jan 25 02:16:37.090 0000000003680e22 
HB RESPONSE Jan 25 02:16:37.091 0000000003680e22 100 
HB REQUEST Jan 25 02:16:38.091 0000000003680e23 
HB RESPONSE Jan 25 02:16:38.092 0000000003680e23 100 
HB REQUEST Jan 25 02:16:39.091 0000000003680e24 
HB RESPONSE Jan 25 02:16:39.092 0000000003680e24 100 
HB REQUEST Jan 25 02:16:40.091 0000000003680e25 
HB REQUEST Jan 25 02:16:41.092 0000000003680e26
Jan 25 05:16:43.919 ERROR : [appAgent_reply_processor_thread] : [application_agent.c:3291] : appAgent_process_status_code failed. status code = 508

When MIO disk fails due to well-known FN-72077 it can contain the following:

Nov 24 15:55:33.039 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1213] : Process HB_RESPONSE message
Nov 24 15:55:33.039 WARN : [appAgent_hb_resp_processor_thread] : [application_agent.c:1671] : Set fsm to FSM_REQUEST_REGISTER
Nov 24 15:55:33.039 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:1676] : Inform SVCMOD_HEALTH_FAIL to cluster module since AppAG issued re-register
Nov 24 15:55:33.039 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:189] : Printing last 20 HB messages:
MessageType TimeStamp MessageID Status Description
HB REQUEST Nov 24 15:56:58.643 00000000030e88a6
HB RESPONSE Nov 24 15:56:58.644 00000000030e88a6 100
HB REQUEST Nov 24 15:56:59.643 00000000030e88a7
HB RESPONSE Nov 24 15:56:59.644 00000000030e88a7 100
HB REQUEST Nov 24 15:57:00.644 00000000030e88a8
HB RESPONSE Nov 24 15:57:00.644 00000000030e88a8 100
HB REQUEST Nov 24 15:57:01.644 00000000030e88a9
HB RESPONSE Nov 24 15:57:01.644 00000000030e88a9 100
HB REQUEST Nov 24 15:57:02.644 00000000030e88aa
HB RESPONSE Nov 24 15:57:02.644 00000000030e88aa 100
HB REQUEST Nov 24 15:57:03.644 00000000030e88ab
HB RESPONSE Nov 24 15:57:03.645 00000000030e88ab 100
HB REQUEST Nov 24 15:57:04.644 00000000030e88ac
HB RESPONSE Nov 24 15:57:04.645 00000000030e88ac 100
HB REQUEST Nov 24 15:57:05.644 00000000030e88ad
HB RESPONSE Nov 24 15:57:05.735 00000000030e88ad 100
HB REQUEST Nov 24 15:57:06.645 00000000030e88ae
HB RESPONSE Nov 24 15:57:06.645 00000000030e88ae 100
HB REQUEST Nov 24 15:57:07.645 00000000030e88af
HB RESPONSE Nov 24 15:57:07.645 00000000030e88af 510
Nov 24 15:55:33.039 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:3291] : appAgent_process_status_code failed. status code = 510

Etc. There can be other reasons too, not necessarily just the delay in communications between MIO and ASA/FTD.

patoberli · ‎05-19-2023

Update for the FN was already installed a while ago.
The asa-appagent.log shows the following:

May 12 18:33:33.269 INFO : [appagent_sync_server_thread] : [application_agent_mio_interface.c:812] : asa_001_SERIAL.appAttributes.get, ssp-xml:5, 9, appAG, <appAttributesGetRequest></appAttributesGetRequest>
May 12 18:33:33.269 INFO : [appagent_sync_server_thread] : [application_agent_mio_interface.c:867] : appAG.appAttributes.get, ssp-xml:5, 9, 100, <appAttributesGetResponse><appAttributesList><item key="mgmt-ip">192.168.2.47</item><item key="mgmt-url">https://192.168.2.47/</item><item key="ha-role">active</item><item key="ha-link-intf">Ethernet1/5</item><item key="ha-lan-intf">Ethernet1/5</item></appAttributesList><response><code>100</code><message>Request success</message></response></appAttributesGetResponse>
May 12 18:33:36.530 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1219] : Process HB_RESPONSE message
May 12 18:33:36.530 WARN : [appAgent_hb_resp_processor_thread] : [application_agent.c:1729] : Set fsm to FSM_REQUEST_REGISTER
May 12 18:33:36.530 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:1736] : Inform SVCMOD_HEALTH_FAIL to cluster module since AppAG issued re-register. (hc_state = 0x1)
May 12 18:33:36.530 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:195] : Printing last 20 HB messages:
MessageType  TimeStamp            MessageID         Status  Description
HB REQUEST   May 12 18:33:27.520  00000000001e3ecf          
HB RESPONSE  May 12 18:33:27.521  00000000001e3ecf  100     
HB REQUEST   May 12 18:33:28.521  00000000001e3ed0          
HB RESPONSE  May 12 18:33:28.521  00000000001e3ed0  100     
HB REQUEST   May 12 18:33:29.521  00000000001e3ed1          
HB RESPONSE  May 12 18:33:29.521  00000000001e3ed1  100     
HB REQUEST   May 12 18:33:30.521  00000000001e3ed2          
HB REQUEST   May 12 18:33:31.522  00000000001e3ed3          
HB RESPONSE  May 12 18:33:32.521  00000000001e3ed2  100     Discarded: More recent response received
HB RESPONSE  May 12 18:33:32.522  00000000001e3ed3  100     
HB REQUEST   May 12 18:33:32.522  00000000001e3ed4          
HB RESPONSE  May 12 18:33:32.523  00000000001e3ed4  100     
HB REQUEST   May 12 18:33:33.522  00000000001e3ed5          
HB RESPONSE  May 12 18:33:33.523  00000000001e3ed5  100     
HB REQUEST   May 12 18:33:34.523  00000000001e3ed6          
HB RESPONSE  May 12 18:33:34.523  00000000001e3ed6  100     
HB REQUEST   May 12 18:33:35.523  00000000001e3ed7          
HB RESPONSE  May 12 18:33:35.523  00000000001e3ed7  100     
HB REQUEST   May 12 18:33:36.523  00000000001e3ed8          
HB RESPONSE  May 12 18:33:36.523  00000000001e3ed8  510     
May 12 18:33:36.530 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:3481] : appAgent_process_status_code failed. status code = 510
May 12 18:33:37.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:38.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:39.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:40.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:41.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:42.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:42.860 WARN : [appAgent_reply_processor_thread] : [application_agent.c:2716] : Retry number 1 for Register
May 12 18:33:42.860 INFO : [appagent_async_client_send_thread] : [application_agent_mio_interface.c:459] : appAG.register, ssp-xml:5, 0000000000000053, asa_001_SERIAL, <registerRequest><appInstanceId>asa_001_SERIAL</appInstanceId><agentName>ASA_LINA</agentName><ip>127.128.1.254</ip><port>5556</port><capabilities><capability>clusterConfig</capability><capability>interfaceMapping</capability><capability>interfaceStatus</capability><capability>clusterState</capability><capability>bundleStatus</capability><capability>rejoinCluster</capability><capability>managementInfo</capability><capability>netMgmtBootstrap</capability><capability>appAttributes</capability><capability>license</capability><capability>serviceChainStatus</capability><capability>inlinePair</capability><capability>licenseAuthRenewal</capability><capability>telemetry</capability></capabilities></registerRequest>
May 12 18:33:42.870 INFO : [appagent_async_client_receive_thread] : [application_agent_mio_interface.c:602] : asa_001_SERIAL.register, ssp-xml:5, 0000000000000053, 100, <registerResponse><response><code>100</code><message>OK</message></response><version>2.12(0.450)</version></registerResponse>
May 12 18:33:42.870 INFO : [appAgent_reply_processor_thread] : [application_agent.c:2727] : FSM stage change (FSM_REQUEST_REGISTER->FSM_HEARTBEAT)
May 12 18:33:42.870 INFO : [appAgent_reply_processor_thread] : [application_agent.c:2731] : reset the HB request count
May 12 18:33:43.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:43.490 ERROR : [fover_FSM_thread] : [application_agent_interface_commands.c:5062] : AppAgent is not online. Cannot send nic mac filtering message
May 12 18:33:43.650 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1219] : Process HB_RESPONSE message
May 12 18:33:43.650 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_msgHandler.c:5348] : Adapter 2 platform health status is ok

tvotna · ‎05-19-2023

100 is OK and 510 is ERR_RE_REGISTER. You can see that heartbeat messages between MIO and ASA appAgent were flowing fine, than suddenly MIO instructed appAgent to re-register. Collect "show tech fprm detail" and try to find corresponding events in the "show event detail" from sam_techsupportinfo. This log is in reverse order. This issue could be caused by some FXOS process restarting, so also check "connect local-mgmt" > "show pmon state" and "dir cores".

patoberli · ‎05-19-2023

This here was the very first event when that happened:

Creation Time: 2023-05-12T18:33:32.522
ID: 222016
Description: App-instance asa-fw-HOSTNAME-1 on slot 1 is not responding.
Affected Object: sec-svc/slot-1/app-inst-asa-fw-HOSTNAME-1
Trigger: Oper
User: internal
Cause: App Not Responding
Code: E4197973

After that many more entries regarding reconnecting to the application, which it did within some 30 seconds. No cores were written at that time and the ASA image didn't restart, just did a failover.

tvotna · ‎05-19-2023

This is weird. Check /var/sysmgr/sam_logs/svc_sam_appAG.log from FPRM show tech. This FXOS service responds to ASA heartbeat.

patoberli · ‎05-19-2023

Looks like this:

[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][app_sam_appAG] getMemUsageOfAppAG: Executing ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run]        Executing /bin/sh command
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run]        argv[1] = -c
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run]        argv[2] = ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG: root      5422  0.0  1.0 238924 83780 ?        SNl  Jan18 112:10 svc_sam_appAG --x
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] run: lMemUsed = 238924, lMemHeapSize = 660602880
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] appAG mon task: don't need to restart now.
[WARN][0x6c8fcb40][CEST 2023-05-12 18:33:32.521][app_sam_appAG] FTWMonitorTask: 0 : timeout on queue: asa_001_SERIAL, timeout: 3000, monitor_app: 1
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] HBRecordList::dump asa_001_SERIAL
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed3: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed2: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed1: rcv=[CEST 2023-05-12 18:33:29.521] , snd=[CEST 2023-05-12 18:33:29.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed0: rcv=[CEST 2023-05-12 18:33:28.521] , snd=[CEST 2023-05-12 18:33:28.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecf: rcv=[CEST 2023-05-12 18:33:27.521] , snd=[CEST 2023-05-12 18:33:27.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ece: rcv=[CEST 2023-05-12 18:33:26.520] , snd=[CEST 2023-05-12 18:33:26.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecd: rcv=[CEST 2023-05-12 18:33:25.520] , snd=[CEST 2023-05-12 18:33:25.520] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecc: rcv=[CEST 2023-05-12 18:33:24.520] , snd=[CEST 2023-05-12 18:33:24.520] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecb: rcv=[CEST 2023-05-12 18:33:23.520] , snd=[CEST 2023-05-12 18:33:23.520] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3eca: rcv=[CEST 2023-05-12 18:33:22.520] , snd=[CEST 2023-05-12 18:33:22.520] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] FTWMonitorTask: run: sending heartbeat map update to DME, as app asa_001_SERIAL just came online
[INFO][0x73c52b40][CEST 2023-05-12 18:33:32.526][app_sam_appAG] AG_HandleLicenseCleanup: appInstanceID: asa_001_SERIAL
[INFO][0x73c52b40][CEST 2023-05-12 18:33:32.526][app_sam_appAG] LicenseMgr::send_recv: app_InstanceID: asa_001_SERIAL
[INFO][0x7364cb40][CEST 2023-05-12 18:33:33.271][app_sam_appAG] send_recv: msg: asa_001_SERIAL.appAttributes.get, ssp-xml:5, 9, appAG, <appAttributesGetRequest></appAttributesGetRequest>
[INFO][0x73c52b40][CEST 2023-05-12 18:33:35.531][app_sam_appAG] AG_deRegisterAppInstance: app: asa_001_SERIAL, de-register app, check and failover to bypass FTW if any
[INFO][0x6f868b40][CEST 2023-05-12 18:33:36.523][app_sam_appAG] appInst: asa_001_SERIAL not registered, reject: appAG.heartbeat.update
[WARN][0x6c8fcb40][CEST 2023-05-12 18:33:38.523][app_sam_appAG] FTWMonitorTask: 0 : timeout on queue: asa_001_SERIAL, timeout: 3000, monitor_app: 1
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:38.523][app_sam_appAG] HBRecordList::dump asa_001_SERIAL

After this some hundreds lines more.

tvotna · ‎05-19-2023

Yes, there is a 3 sec delay between below messages, so timeout occurred. You probably need to increase retry count in:

app-agent heartbeat interval 1000 retry-count 3

[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed2: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521] 
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed1: rcv=[CEST 2023-05-12 18:33:29.521] , snd=[CEST 2023-05-12 18:33:29.521]

patoberli · ‎05-22-2023

Thanks, that's what I did. Let's hope it doesn't reappear.