01-06-2023 07:56 PM - edited 03-20-2025 09:00 AM
The Portuguese version of this Article can be found at: ISE - Slow Replication (versão Português).
| For an offline or printed copy of this document, simply choose ⋮ Options > Printer Friendly Page. You may then Print > Print to PDF or Copy & Paste to any other document format you like. | 
This document focus on ISE Slow Replication caused by "external actors" (not ISE), it briefly introduces the meaning of Slow Replication, how to deal with it, its impact on ISE Deployment. and last but not least a list of Bug IDs. : )
If you see the Slow Replication event, I highly recommend checking it out.
The Slow Replication message is part of the Category: Administrative and Operational Audit Category, Class: Replication and with the Message Codes: 60150, 60151 and 60152 (at Administration > System > Logging > Message Catalog) :
There are 3 Events that triggers the Slow Replication at a certain threshold level when messages for a specific Node has queued up beyond a safe level. They are triggered based on an ISE Node not consuming replicated messages for the past 15+ min:
The totalPendingMsgCount is the pending messages of the Secondary Node in relation to the Primary Node (the Primary PAN).
The totalTimeDiffInMinutes is the difference in minutes between the time of the 1st unconsumed Primary message and the Secondary time.
You are able to check the Slow Replication via:
At ISE > Home you may see the Slow Replication record in the Alarms dashboard:
Using the following command:
ise/admin# show logging application deployment.log
...
2022-10-18 10:24:39,148 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Slow replication Warn level alarm will be generated for host <PSN HOSTNAME>
message : Node <PSN HOSTNAME> has slow replication since this node is not consuming messages for past 18 minutes. The number of pending messages are 32103
Status of this node is SYNC COMPLETED
Below are the overrall details of the nodes
a. Seq No in Primary : 799879069
b. Seq No in Secondary : 799846966
c. Current Time : 27768324
d. Primary Seq Time: 27768324
e. Secondary Seq Time: 27768319
f. Time of first unconsumed message in Primary: 27768306
2022-10-18 10:24:39,148 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Inside logAuditMessages to raise slow replication alarm with level [WARN] for node [<PSN HOSTNAME>]
2022-10-18 10:24:39,150 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Slow replication alarm is generated successfully for node <PSN HOSTNAME>
...
2022-10-18 10:24:39,181 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Slow replication Critical level alarm will be generated for host <PSN HOSTNAME>
message : Node <PSN HOSTNAME> has slow replication since this node is not consuming messages for past 23 minutes. The number of pending messages are 44006
Status of this node is SYNC COMPLETED
Below are the overrall details of the nodes
a. Seq No in Primary : 799879070
b. Seq No in Secondary : 799835064
c. Current Time : 27768324
d. Primary Seq Time: 27768324
e. Secondary Seq Time: 27768322
f. Time of first unconsumed message in Primary: 27768301
2022-10-18 10:24:39,181 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Inside logAuditMessages to raise slow replication alarm with level [CRITICAL] for node [<PSN HOSTNAME>]
2022-10-18 10:24:39,183 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Slow replication alarm is generated successfully for node <PSN HOSTNAME>
...
IMPORTANT: at ISE GUI we have the Slow Replication: Info, Warning and Error ... at ISE CLI we have the Slow Replication: Info, Warn and Critical.
You have the following options to check the totalPendingMsgCount and totalTimeDiffInMinutes:
1. you can easily check the totalPendingMsgCount at Administration > System > Deployment > check the Messages to be Synced:
2. you can check the totalTimeDiffInMinutes at ISE PPAN CLI (in the example below: the totalTimeDiffInMinutes = 26, the Time of first unconsumed message in Primary = 27712123 and the Secondary Seq Time = 27712149
ise/admin# show logging application deployment.log
...
2022-09-09 10:09:41,795 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Slow replication Critical level alarm will be generated for host <PSN Hostname>
message : Node <PSN Hostname> has slow replication since this node is not consuming messages for past 26 minutes. The number of pending messages are 74872
Status of this node is SYNC COMPLETED
Below are the overrall details of the nodes
a. Seq No in Primary : 775241426
b. Seq No in Secondary : 775166554
c. Current Time : 27712149
d. Primary Seq Time: 27712149
e. Secondary Seq Time: 27712149
f. Time of first unconsumed message in Primary: 27712123
...
3. you are also able to check the totalPendingMsgCount & totalTimeDiffInMinutes via ISE PPAN CLI:
ise/admin# show logging application deployment.log
...
2022-12-20 10:40:30,284 INFO [Deployment-RegistrationPoller][] cpm.infrastructure.deployment.notifier.NodeChangeLogger -::::- Sequence details for host <PSN Hostname> replicationStatus[SYNC COMPLETED] primarySequenceCount[838148783] primaryTimeInMins[27859060] currentTimeInMins[27859060] firstUnconsumedPrimarySeqTime[ 27859031] secondarySequenceCount[838107650] secondaryTimeInMins[27859060] totalPendingMsgCount[41133] totalTimeDiffInMinutes [29]
...
Examples and possible effects of the Slow Replication:
the Slow Replication Error event is very harmful for the ISE Deployment (especially when it is triggered repeatedly, as the Node may not consume Message for hours, in the example bellow: 86 min at 10:07AM
During a Slow Replication issue avoid the generation of:
Reservation in VMWare VM is a MUST to mitigate Slow Replication, obviously your infrastructure needs to have enough capacity to support these reservations as these resources will be NO MORE SHARED (i.e. you MUST specify a Reservation if you need to guarantee that the minimum required amounts of CPU or Memory are always available for the VM).
At VMWare Host Summary double check the CPU Cores info (for example: 80 CPUs with 2.39GHz each) and the Processor Type:
At Performance and Scalability Guide for Cisco ISE double check:
1. at Cisco ISE Hardware Appliance, if your Processor Type is greater than the Processor Specification:
2. at Cisco ISE on Virtual Appliance, if your CPU & Memory is compatible with the vCPU & Memory Specification:
At VMWare VM Summary click the Edit Settings button:
and double check the Reservation info:
IMPORTANT 1: in this example we have 24x CPU, each with 2.39GHz clock, with a total of 57.36 GHz (24x 2.39 GHz).
IMPORTANT 2: when you Power On a VM, the system checks the amount of CPU and Memory Resources that have not yet been reserved. Based on the available unreserved resources, the system determines whether it can guarantee the reservation for which the VM is configured (if any). This process is called Admission Control.
IMPORTANT 3: VMWare DRS and VMWare HA "impact" the CPU & Memory Reservation, take a look at:
Note 1: you are able to double check your CPU & Memory and Processor Type info via ISE CLI:
ise/admin# show inventory
NAME: "ISE-VM-K9 chassis", DESCR: "ISE-VM-K9 chassis"
PID: ISE-VM-K9 , VID: V01 , SN: <Serial Number>
Total RAM Memory: 263972560 kB
CPU Core Count: 24
CPU 0: Model Info: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
CPU 1: Model Info: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
....
CPU 22: Model Info: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
CPU 23: Model Info: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
Hard Disk Count(*): 1
Disk 0: Device Name: /dev/sda
Disk 0: Capacity: 644.20 GB
NIC Count: 1
NIC 0: Device Name: eth0:
NIC 0: HW Address: <MAC Addr>
NIC 0: Driver Descr: VMware vmxnet3 virtual NIC driver
(*) Hard Disk Count may be Logical.
Note 2: for more details about the CPU info:
ise/admin# show tech-support
...
*****************************************
Displaying /proc/cpuinfo...
*****************************************
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 47
model name : Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
stepping : 2
microcode : 0x3b
cpu MHz : 2394.000
cache size : 30720 KB
...
or
ise/admin# show cpu
processor : 0
model : Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
speed(MHz): 2394.000
cache size: 30720 KB
...
processor : 23
model : Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz
speed(MHz): 2394.000
cache size: 30720 KB
Note 3: to check the Reservation via ISE CLI:
ise/admin# show tech-support
...
*****************************************
Displaying UDI (Unique Device Identifier) information...
*****************************************
UDI PID: ISE-VM-K9
UDI SN: <Serial Number>
UDI VID: V01
UDI Platform type: VM
VM Memory Reservation: 262144 MB
VM CPU Reservation: 57360 MHz
VM Memory Limit: 4294967295 MB
VM CPU Limit: 4294967295 MHz
VM Swap Utilization: 0 MB
VM Memory Balloon Size: 0 MB
VM Session ID: 0x6ddb3ead2740836a
VM CPU Speed: 2394 MHz
VM Host Time: 20 Dec 2022 12:20:33
...
Hyper-Threading allows more than one Thread to run on each Core.
When Hyper-Threading is active, one Physical Core works like "two Logical Cores" that can work through tasks more efficiently by taking advantage of Idle Time when the Core would formerly be waiting for other tasks to complete.
IMPORTANT 1: Hyper-Threading improves CPU throughput by up to 30%.
IMPORTANT 2: if you are planning to use Hyper-Threading, remember the "up to 30% throughput" and avoid the CPU Ready issue !!!
IMPORTANT 3: more info at What is Hyper-Threading?.
CPU Ready Time is a vSphere metric that records the amount of time a VM is ready to use CPU but was unable to schedule time because ALL CPU Resources (on a ESX Host) are busy.
As a rule of thumb, the CPU Ready Time:
Note 1: a VM's CPU Usage and CPU Ready metrics are NOT directly correlated. A VM can have a serious issue with CPU Ready when its CPU Usage doesn't appear to be that high. In order to have a full picture of your CPU Performance, you need to look at both CPU Ready and CPU Usage.
Note 2: please take a look at:
A high Load Average usually mean that the system is being used heavily and the response time is correspondingly slow !!!
Check the Load Average in the last 5 min and identity the Threads that's wasting more CPU resources is an important step, to do that, use the ISE PPAN CLI > tech top command, with the P (to sort by CPU Usage) and H (to list the Threads), for example:
IMPORTANT 1: you MUST use the tech top command during the Slow Replication issue !!! : )
IMPORTANT 2: this information MUST be shared with TAC !!!
ISE Live Logs (at Operations > RADIUS > Live Logs) has two important dashboards:
Click the Misconfigured NAS, check and solve the Failure Reason of the NAS with the most Failed Attempts:
Note: at Operations > Reports > Reports > Diagnostics > Misconfigured NAS, you are able to check more details about the Misconfigured NAS that may help you solve the issue:
Click the Repeat Counter, check the Identity/Endpoint ID with the most Repeat Count:
Note 1: at Live Logs you are unable to use an Advanced Filter to filter by "Repeat Count greater than a Value", the following enhancement was created to deal with this situation:
Then get the Identity/Endpoint ID and at Live Logs (at Operations > RADIUS > Live Logs > filter by Identity/Endpoint ID) to double check the info and at RADIUS Accounting (at Operations > Reports > Reports > Endpoint and Users > filter by Identity/Endpoint ID) check the Account Terminate Cause column, for example:
Note 2: at this point work with your R&S Team about the Account Terminate Cause is a MUST !!!
Note 3: values of Account Terminate Cause: (take a look at RFC 2866 - Acct-Terminate-Cause)
At Operations > Reports > Reports > Endpoints and Users > Authentication Summary > check the Authentication by Failure Reason (you can also check this at Operations > Reports > Reports > Endpoints and Users > Top N Authentication by Failure Reason), try to solve or minimize the top Failure Reason, for example:
At Operations > Reports > Reports > Diagnostics > RADIUS Error > filter by Failure Reason = 5440 (Endpoint abandoned EAP Session and started new) and Export To = Repository (CSV) to check the Endpoints with the most 5440 Failure Reason counts:
Note: at this point work with your R&S Team and the Team responsible for the Supplicant is a MUST !!!
IMPORTANT 1: if you filter by Failure Reason = 5440 at Operations > Reports > Reports > Endpoint and Users > RADIUS Authentication or Operations > Reports > Reports > Endpoint and Users > Authentication Summary you will receive a No Data Found, that is why you have to check this on RADIUS Error.
IMPORTANT 2: the following enhancement was created to deal with this situation:
Your Slow Replication issue could be caused by Application Server "flapping" (from running to initializing state), double check your Active Directory Diagnostic Tool - Schedule Tests configuration (at Administration > Identity Management > External Identity Sources > Active Directory > Advanced Tools > Diagnostic Tools > check the Run Schedule Tests
please take a look at the following Bug IDs:
Attention to the following Bug IDs:
IMPORTANT: it's always a best practice to double check if your environment is up to date or if you are not using a Deferred Release, please check ISE Software Download.
This is one I'd like to explain in more detail, as it's the one that gave me "the most headache". : )
Take a look at: CSCwb29140 Threads getting exhaust post moving to latest patches were nss rpm is updated (Only 3.0p5&2.7p7,3.1P1).
Symptom: " ... you start having performance issues on HTTPS Portals like: Guest Portal, Posture Portal. Example of this are: Portal takes more time than usual to load. Another symptom is Application Server going to initializing state ... "
Known Fixed Releases: " ... 2.6 P11, 2.7 P8, 3.0 P6, 3.1 P3, 3.2  ... "
Before the Fixed Release you are able to check the issue at Operations > Reports > Reports > Audit > Operations Audit, take a look at Request = Maximum Resource Limit Reached.
IMPORTANT: the HP-2.7P7-CSCwb29140 Hot Patch solves the initializing state issue, but not the Maximum Resource Limit Reached issue.
This is another one I'd like to explain in more detail, as it's one of the highest number of Support Cases.
Take a look at: CSCwd45843 Auth Step latency for policy evaluation due to Garbage Collection activity.
Symptom: " ... include, but are no limited to the following:
- Authentication Step Latency on multiple Policy evaluation steps.
- High Average Request Latency during times of peak load.
- Authentication Request Latency does not recover until after the System is reloaded.
- Profiler Queue Full Alarms ... "
Known Fixed Releases: " ... 2.7 P8, 3.0 P7, 3.1 P5, 3.2 P1  ... "
The ONLY workaround without patching or upgrading the System is to reduce the traffic volume being sent to individual Nodes or spread the traffic across additional ISE Nodes. A reboot of the System may temporarily alleviate the issue but it may come back within minutes after reloading.
Take a look at: CSCwf38076 Block Posture port TCP/8905 and CPP port (default is TCP/8443) for Non-Postured connections .
Symptom: " ...
- "Maximum resource limit reached" reports for "Portal service thread pool reached threshold value" events. Navigate to Operations > Reports > Reports > Audit > Operations Audit to see the reports.Known Affected Releases: " ... 2.7, 3.0, 3.1 and 3.2 ... "
The ONLY workaround " ... block traffic from the Endpoints to all ISE PSNs for TCP/8905 and Client Provisioning Portal port (default is TCP/8443). This action is recommended for Posture with redirection and redirectionless. ...".
This is a very important FN, please take a look at:
This issue starts with ISE 2.7 P2 and reloading the system may temporarily alleviate the issue ... upgrading ISE is recommended !!!
Solve or minimize the Slow Replication issue is definitely far from easy : )
Work with TAC is highly recommended and gathering all of this troubleshooting information during the Slow Replication issue is a MUST !!!
I hope this document helps you "get through the dark times" of Slow Replication !!! : )
Hi @ajc ,
1st. use the 12 pCPU x 2.6 GHz = 31.2 GHz
2nd. since you are having a significant Slow Replication, it's better to "simulate" a SNS-3595:
What is your ISE version/patch?
 
					
				
		
I am still running ISE 2.7 patch 7 version. thanks for the advice.
Hi @ajc ,
since you are running ISE 2.7 P7, I STRONGLY recommended ISE 2.7 P8+, please take a look at CSCwb29140 Threads Getting Exhaust, this update is one of the most important steps that solves one of my Customer huge Slow Replication (400,000+ Pending Messages).
 
					
				
		
Hi Marcelo, just to let you know, I talked to Cisco TAC recently about something else and they told me that Patch 8 had lot of issues, it was removed and replaced by patch 9. See attached screenshot I just took from Cisco Website, it is not available anymore. thanks
Hi @ajc ,
ISE 2.7 P8 is a very good Patch, but you MUST pay attention to the following, please take a look at ISE Release Notes 2.7:
 
In other words, in my Case, I ONLY have to uninstall the Hotpatch CSCwa47133 - ISE Evaluation log4j CVE-2021-44228, (I updated ISE from P7 to P8 ... P9 was not available).
Note 1: ISE 2.7 P8 is a Deferred Release because of:
Note 2: that is why I STRONGLY recommended ISE 2.7 P8+ (P8 or P9).
Excellent document @Marcelo Morais @! Helped a lot with my slow replication problems!
Hi @Robson Ribeiro ,
thanks a lot my friend !!!
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: