07-14-2022 11:42 PM
I've recently figured, that we're also affected by this bug. We experienced too an ASA failover event due to "MIO-blade heartbeat failure" even though we're running a higher version than the one in the bug report: FXOS:2.11(1.154) / ASA:9.16(3).
I've googled and found this bug report but saw that it's in status "terminated". It mentions the workaround though: "Increase AppAgent timeout and retry settings. Suggested setting: timeout=1000ms, retry=6".
However trying to find out how to do this is not that easy. Effectively I've found no resource that explains how to accomplish that.
Does anybody in the community know how to adjust this timeout, before I open a TAC case?
Regards
Solved! Go to Solution.
07-22-2022 12:41 AM
There is an ASA configuration command:
app-agent heartbeat interval 1000 retry-count 3
In case of multiple mode, this is configured in system execution space. You can increase retry-count to 6.
So far as I know, there is an architectural ENH in ASA 9.18+ to prevent such failures:
CSCvz94573 MIO heartbeat failure caused by heartbeat dropped by delay
Bug toolkit says that the fix was backported to ASA 9.16.3 too, so you may want to open TAC case and ask them why you're still encountering this failure in 9.16.3.
07-22-2022 12:41 AM
There is an ASA configuration command:
app-agent heartbeat interval 1000 retry-count 3
In case of multiple mode, this is configured in system execution space. You can increase retry-count to 6.
So far as I know, there is an architectural ENH in ASA 9.18+ to prevent such failures:
CSCvz94573 MIO heartbeat failure caused by heartbeat dropped by delay
Bug toolkit says that the fix was backported to ASA 9.16.3 too, so you may want to open TAC case and ask them why you're still encountering this failure in 9.16.3.
05-16-2023 11:48 PM
Still getting this issue, running ASA 9.18.3 on a FPR 4115 with FXOS 2.12.0.450, so it doesn't seem fully fixed in 9.18.x yet.
05-17-2023 12:24 AM
@patoberli, actually, the "%ASA-1-104001: (Primary) Switching to ACTIVE - Other unit wants me Active. Secondary unit switch reason: MIO-blade heartbeat failure" error can be produced due to various reasons. You need to correlate it with disk0:/log/asa-appagent.log and other log files to get to the root cause. Just as an example, when blade CIMC/BMC crashes and FXOS doesn't have the "CSCvx66494 Handle CIMC Watchdog reset in MIO" fix, the asa-appagent.log contains the following:
Jan 25 05:16:41.479 WARN : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1223] : Set fsm to FSM_INIT as there is no response to HB requests since 3000 msec
Jan 25 05:16:41.479 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1226] : Inform SVCMOD_HEALTH_FAIL to cluster module
Jan 25 05:16:41.479 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:189] : Printing last 20 HB messages:
MessageType TimeStamp MessageID Status Description
HB REQUEST Jan 25 02:16:31.089 0000000003680e1c
HB RESPONSE Jan 25 02:16:31.090 0000000003680e1c 100
HB REQUEST Jan 25 02:16:32.090 0000000003680e1d
HB RESPONSE Jan 25 02:16:32.090 0000000003680e1d 100
HB REQUEST Jan 25 02:16:33.090 0000000003680e1e
HB RESPONSE Jan 25 02:16:33.090 0000000003680e1e 100
HB REQUEST Jan 25 02:16:34.090 0000000003680e1f
HB RESPONSE Jan 25 02:16:34.090 0000000003680e1f 100
HB REQUEST Jan 25 02:16:35.090 0000000003680e20
HB RESPONSE Jan 25 02:16:35.092 0000000003680e20 100
HB REQUEST Jan 25 02:16:36.090 0000000003680e21
HB RESPONSE Jan 25 02:16:36.091 0000000003680e21 100
HB REQUEST Jan 25 02:16:37.090 0000000003680e22
HB RESPONSE Jan 25 02:16:37.091 0000000003680e22 100
HB REQUEST Jan 25 02:16:38.091 0000000003680e23
HB RESPONSE Jan 25 02:16:38.092 0000000003680e23 100
HB REQUEST Jan 25 02:16:39.091 0000000003680e24
HB RESPONSE Jan 25 02:16:39.092 0000000003680e24 100
HB REQUEST Jan 25 02:16:40.091 0000000003680e25
HB REQUEST Jan 25 02:16:41.092 0000000003680e26
Jan 25 05:16:43.919 ERROR : [appAgent_reply_processor_thread] : [application_agent.c:3291] : appAgent_process_status_code failed. status code = 508
When MIO disk fails due to well-known FN-72077 it can contain the following:
Nov 24 15:55:33.039 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1213] : Process HB_RESPONSE message
Nov 24 15:55:33.039 WARN : [appAgent_hb_resp_processor_thread] : [application_agent.c:1671] : Set fsm to FSM_REQUEST_REGISTER
Nov 24 15:55:33.039 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:1676] : Inform SVCMOD_HEALTH_FAIL to cluster module since AppAG issued re-register
Nov 24 15:55:33.039 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:189] : Printing last 20 HB messages:
MessageType TimeStamp MessageID Status Description
HB REQUEST Nov 24 15:56:58.643 00000000030e88a6
HB RESPONSE Nov 24 15:56:58.644 00000000030e88a6 100
HB REQUEST Nov 24 15:56:59.643 00000000030e88a7
HB RESPONSE Nov 24 15:56:59.644 00000000030e88a7 100
HB REQUEST Nov 24 15:57:00.644 00000000030e88a8
HB RESPONSE Nov 24 15:57:00.644 00000000030e88a8 100
HB REQUEST Nov 24 15:57:01.644 00000000030e88a9
HB RESPONSE Nov 24 15:57:01.644 00000000030e88a9 100
HB REQUEST Nov 24 15:57:02.644 00000000030e88aa
HB RESPONSE Nov 24 15:57:02.644 00000000030e88aa 100
HB REQUEST Nov 24 15:57:03.644 00000000030e88ab
HB RESPONSE Nov 24 15:57:03.645 00000000030e88ab 100
HB REQUEST Nov 24 15:57:04.644 00000000030e88ac
HB RESPONSE Nov 24 15:57:04.645 00000000030e88ac 100
HB REQUEST Nov 24 15:57:05.644 00000000030e88ad
HB RESPONSE Nov 24 15:57:05.735 00000000030e88ad 100
HB REQUEST Nov 24 15:57:06.645 00000000030e88ae
HB RESPONSE Nov 24 15:57:06.645 00000000030e88ae 100
HB REQUEST Nov 24 15:57:07.645 00000000030e88af
HB RESPONSE Nov 24 15:57:07.645 00000000030e88af 510
Nov 24 15:55:33.039 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:3291] : appAgent_process_status_code failed. status code = 510
Etc. There can be other reasons too, not necessarily just the delay in communications between MIO and ASA/FTD.
05-19-2023 12:32 AM - edited 05-19-2023 12:35 AM
Update for the FN was already installed a while ago.
The asa-appagent.log shows the following:
May 12 18:33:33.269 INFO : [appagent_sync_server_thread] : [application_agent_mio_interface.c:812] : asa_001_SERIAL.appAttributes.get, ssp-xml:5, 9, appAG, <appAttributesGetRequest></appAttributesGetRequest>
May 12 18:33:33.269 INFO : [appagent_sync_server_thread] : [application_agent_mio_interface.c:867] : appAG.appAttributes.get, ssp-xml:5, 9, 100, <appAttributesGetResponse><appAttributesList><item key="mgmt-ip">192.168.2.47</item><item key="mgmt-url">https://192.168.2.47/</item><item key="ha-role">active</item><item key="ha-link-intf">Ethernet1/5</item><item key="ha-lan-intf">Ethernet1/5</item></appAttributesList><response><code>100</code><message>Request success</message></response></appAttributesGetResponse>
May 12 18:33:36.530 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1219] : Process HB_RESPONSE message
May 12 18:33:36.530 WARN : [appAgent_hb_resp_processor_thread] : [application_agent.c:1729] : Set fsm to FSM_REQUEST_REGISTER
May 12 18:33:36.530 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:1736] : Inform SVCMOD_HEALTH_FAIL to cluster module since AppAG issued re-register. (hc_state = 0x1)
May 12 18:33:36.530 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:195] : Printing last 20 HB messages:
MessageType TimeStamp MessageID Status Description
HB REQUEST May 12 18:33:27.520 00000000001e3ecf
HB RESPONSE May 12 18:33:27.521 00000000001e3ecf 100
HB REQUEST May 12 18:33:28.521 00000000001e3ed0
HB RESPONSE May 12 18:33:28.521 00000000001e3ed0 100
HB REQUEST May 12 18:33:29.521 00000000001e3ed1
HB RESPONSE May 12 18:33:29.521 00000000001e3ed1 100
HB REQUEST May 12 18:33:30.521 00000000001e3ed2
HB REQUEST May 12 18:33:31.522 00000000001e3ed3
HB RESPONSE May 12 18:33:32.521 00000000001e3ed2 100 Discarded: More recent response received
HB RESPONSE May 12 18:33:32.522 00000000001e3ed3 100
HB REQUEST May 12 18:33:32.522 00000000001e3ed4
HB RESPONSE May 12 18:33:32.523 00000000001e3ed4 100
HB REQUEST May 12 18:33:33.522 00000000001e3ed5
HB RESPONSE May 12 18:33:33.523 00000000001e3ed5 100
HB REQUEST May 12 18:33:34.523 00000000001e3ed6
HB RESPONSE May 12 18:33:34.523 00000000001e3ed6 100
HB REQUEST May 12 18:33:35.523 00000000001e3ed7
HB RESPONSE May 12 18:33:35.523 00000000001e3ed7 100
HB REQUEST May 12 18:33:36.523 00000000001e3ed8
HB RESPONSE May 12 18:33:36.523 00000000001e3ed8 510
May 12 18:33:36.530 ERROR : [appAgent_hb_resp_processor_thread] : [application_agent.c:3481] : appAgent_process_status_code failed. status code = 510
May 12 18:33:37.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:38.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:39.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:40.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:41.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:42.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:42.860 WARN : [appAgent_reply_processor_thread] : [application_agent.c:2716] : Retry number 1 for Register
May 12 18:33:42.860 INFO : [appagent_async_client_send_thread] : [application_agent_mio_interface.c:459] : appAG.register, ssp-xml:5, 0000000000000053, asa_001_SERIAL, <registerRequest><appInstanceId>asa_001_SERIAL</appInstanceId><agentName>ASA_LINA</agentName><ip>127.128.1.254</ip><port>5556</port><capabilities><capability>clusterConfig</capability><capability>interfaceMapping</capability><capability>interfaceStatus</capability><capability>clusterState</capability><capability>bundleStatus</capability><capability>rejoinCluster</capability><capability>managementInfo</capability><capability>netMgmtBootstrap</capability><capability>appAttributes</capability><capability>license</capability><capability>serviceChainStatus</capability><capability>inlinePair</capability><capability>licenseAuthRenewal</capability><capability>telemetry</capability></capabilities></registerRequest>
May 12 18:33:42.870 INFO : [appagent_async_client_receive_thread] : [application_agent_mio_interface.c:602] : asa_001_SERIAL.register, ssp-xml:5, 0000000000000053, 100, <registerResponse><response><code>100</code><message>OK</message></response><version>2.12(0.450)</version></registerResponse>
May 12 18:33:42.870 INFO : [appAgent_reply_processor_thread] : [application_agent.c:2727] : FSM stage change (FSM_REQUEST_REGISTER->FSM_HEARTBEAT)
May 12 18:33:42.870 INFO : [appAgent_reply_processor_thread] : [application_agent.c:2731] : reset the HB request count
May 12 18:33:43.430 WARN : [fover_health_monitoring_thread] : [application_agent.c:1358] : health-check FAIL!!
May 12 18:33:43.490 ERROR : [fover_FSM_thread] : [application_agent_interface_commands.c:5062] : AppAgent is not online. Cannot send nic mac filtering message
May 12 18:33:43.650 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_mio_interface.c:1219] : Process HB_RESPONSE message
May 12 18:33:43.650 INFO : [appAgent_hb_resp_processor_thread] : [application_agent_msgHandler.c:5348] : Adapter 2 platform health status is ok
05-19-2023 12:56 AM
100 is OK and 510 is ERR_RE_REGISTER. You can see that heartbeat messages between MIO and ASA appAgent were flowing fine, than suddenly MIO instructed appAgent to re-register. Collect "show tech fprm detail" and try to find corresponding events in the "show event detail" from sam_techsupportinfo. This log is in reverse order. This issue could be caused by some FXOS process restarting, so also check "connect local-mgmt" > "show pmon state" and "dir cores".
05-19-2023 02:44 AM - edited 05-19-2023 02:44 AM
This here was the very first event when that happened:
Creation Time: 2023-05-12T18:33:32.522
ID: 222016
Description: App-instance asa-fw-HOSTNAME-1 on slot 1 is not responding.
Affected Object: sec-svc/slot-1/app-inst-asa-fw-HOSTNAME-1
Trigger: Oper
User: internal
Cause: App Not Responding
Code: E4197973
After that many more entries regarding reconnecting to the application, which it did within some 30 seconds. No cores were written at that time and the ASA image didn't restart, just did a failover.
05-19-2023 04:10 AM
This is weird. Check /var/sysmgr/sam_logs/svc_sam_appAG.log from FPRM show tech. This FXOS service responds to ASA heartbeat.
05-19-2023 04:28 AM
Looks like this:
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][app_sam_appAG] getMemUsageOfAppAG: Executing ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run] Executing /bin/sh command
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run] argv[1] = -c
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.202][os:run] argv[2] = ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] ps -u -p `pidof svc_sam_appAG` | grep svc_sam_appAG: root 5422 0.0 1.0 238924 83780 ? SNl Jan18 112:10 svc_sam_appAG --x
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] run: lMemUsed = 238924, lMemHeapSize = 660602880
[INFO][0x6f767b40][CEST 2023-05-12 18:28:53.251][app_sam_appAG] appAG mon task: don't need to restart now.
[WARN][0x6c8fcb40][CEST 2023-05-12 18:33:32.521][app_sam_appAG] FTWMonitorTask: 0 : timeout on queue: asa_001_SERIAL, timeout: 3000, monitor_app: 1
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] HBRecordList::dump asa_001_SERIAL
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed3: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed2: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed1: rcv=[CEST 2023-05-12 18:33:29.521] , snd=[CEST 2023-05-12 18:33:29.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed0: rcv=[CEST 2023-05-12 18:33:28.521] , snd=[CEST 2023-05-12 18:33:28.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecf: rcv=[CEST 2023-05-12 18:33:27.521] , snd=[CEST 2023-05-12 18:33:27.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ece: rcv=[CEST 2023-05-12 18:33:26.520] , snd=[CEST 2023-05-12 18:33:26.521]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecd: rcv=[CEST 2023-05-12 18:33:25.520] , snd=[CEST 2023-05-12 18:33:25.520]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecc: rcv=[CEST 2023-05-12 18:33:24.520] , snd=[CEST 2023-05-12 18:33:24.520]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ecb: rcv=[CEST 2023-05-12 18:33:23.520] , snd=[CEST 2023-05-12 18:33:23.520]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3eca: rcv=[CEST 2023-05-12 18:33:22.520] , snd=[CEST 2023-05-12 18:33:22.520]
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] FTWMonitorTask: run: sending heartbeat map update to DME, as app asa_001_SERIAL just came online
[INFO][0x73c52b40][CEST 2023-05-12 18:33:32.526][app_sam_appAG] AG_HandleLicenseCleanup: appInstanceID: asa_001_SERIAL
[INFO][0x73c52b40][CEST 2023-05-12 18:33:32.526][app_sam_appAG] LicenseMgr::send_recv: app_InstanceID: asa_001_SERIAL
[INFO][0x7364cb40][CEST 2023-05-12 18:33:33.271][app_sam_appAG] send_recv: msg: asa_001_SERIAL.appAttributes.get, ssp-xml:5, 9, appAG, <appAttributesGetRequest></appAttributesGetRequest>
[INFO][0x73c52b40][CEST 2023-05-12 18:33:35.531][app_sam_appAG] AG_deRegisterAppInstance: app: asa_001_SERIAL, de-register app, check and failover to bypass FTW if any
[INFO][0x6f868b40][CEST 2023-05-12 18:33:36.523][app_sam_appAG] appInst: asa_001_SERIAL not registered, reject: appAG.heartbeat.update
[WARN][0x6c8fcb40][CEST 2023-05-12 18:33:38.523][app_sam_appAG] FTWMonitorTask: 0 : timeout on queue: asa_001_SERIAL, timeout: 3000, monitor_app: 1
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:38.523][app_sam_appAG] HBRecordList::dump asa_001_SERIAL
After this some hundreds lines more.
05-19-2023 08:51 AM
Yes, there is a 3 sec delay between below messages, so timeout occurred. You probably need to increase retry count in:
app-agent heartbeat interval 1000 retry-count 3
[INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed2: rcv=[CEST 2023-05-12 18:33:32.521] , snd=[CEST 2023-05-12 18:33:32.521] [INFO][0x6c8fcb40][CEST 2023-05-12 18:33:32.522][app_sam_appAG] hb message 00000000001e3ed1: rcv=[CEST 2023-05-12 18:33:29.521] , snd=[CEST 2023-05-12 18:33:29.521]
05-22-2023 09:11 AM
Thanks, that's what I did. Let's hope it doesn't reappear.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: