Solved: ISE 2.6 PSN replication performance monitoring

jmanzell@cisco.com · ‎05-07-2020

Hello team. I have a customer with a distributed ISE 2.6 deployment. PANs/MnTs are located in the Chicagoland area, with remote PSNs in London and Singapore. The Singapore PSN RTT is ~260ms. How can they best monitor replication performance for the international PSNs? They have an occasional log message about replication taking a long time, and are getting ready to go Live with 802.1x wired/wireless and just want to make sure they can monitor this situation. Any advice is greatly appreciated!

Damien Miller · ‎05-07-2020

This is an issue I've had customers struggle with over the years as well. I would actually always suggest working with TAC on this one rather than just monitoring and alerting on it. Many of the past issues I've run in to with this revolve around tuning and cleaning up the deployment, but these have also been large customers. Some examples of this could be aggressive snmp query timers generating heavy profiling traffic, or misbehaving endpoints/nads. Since it sounds like this deployment might not be under much load right now these are of particular concern, adding load won't make it any better, and tuning is typically a solution when it creeps up later in a deployments expansion.

From an alerting stand point it's certainly good to know when it's happening, but there is usually little resolution other than resyncing a node or waiting for replication traffic to drop. Finding a long term solution is high on the list. The three alarms for replication issues, "slow replication info, warning, then error are cascading based on the severity of the situation. They can be configured for email alerting, or flagged in an external syslog such as splunk. If the replication backlog hits 1,000,000 messages, the node will be disconnected and require a manual resync.

As a side note, it might also be worth exploring the possibility of moving the primary PAN/MNTs to London. While it could certainly be a lift and bunch of work, it's entirely possible that the RTT of London <>Singapore could be up to 100 ms less than Chicago <> Singapore. A lot of this depends on the customers providers, but I've seen some environments where this would be ideal. London to Chicago is relatively low latency in this context, so worth looking in to. In some older Cisco on Cisco sessions from Cisco Live, they actually had a very similar situation with the Cisco internal ISE deployment.

View solution in original post

Greg Gibbs · ‎05-07-2020

The main alerting mechanisms in ISE are either the Alarm Settings where you can setup SMTP notifications and Syslog where you can send the alerts to an external SIEM to have it do the correlation and notification.

There are alarms for Replication-related events such as Replication Slow Info/Warning/Error available in the Alarm Settings. These alarms could be sent to an admin or mailer via SMTP.

You could also configure an external Syslog server with the option for 'Include Alarms For This Target' enabled in the External Logging Targets. You would then need to ensure the necessary Logging Category was enabled (in this case, the Administrative and Operational Audit category) for the Target.

Outside of ISE, the customer could also monitor the network RTT between Chicago and remote sites using IP SLA or something similar. They should also ensure that RADIUS is prioritised over best effort traffic in their end-to-end QoS policy.

Damien Miller · ‎05-07-2020

This is an issue I've had customers struggle with over the years as well. I would actually always suggest working with TAC on this one rather than just monitoring and alerting on it. Many of the past issues I've run in to with this revolve around tuning and cleaning up the deployment, but these have also been large customers. Some examples of this could be aggressive snmp query timers generating heavy profiling traffic, or misbehaving endpoints/nads. Since it sounds like this deployment might not be under much load right now these are of particular concern, adding load won't make it any better, and tuning is typically a solution when it creeps up later in a deployments expansion.

From an alerting stand point it's certainly good to know when it's happening, but there is usually little resolution other than resyncing a node or waiting for replication traffic to drop. Finding a long term solution is high on the list. The three alarms for replication issues, "slow replication info, warning, then error are cascading based on the severity of the situation. They can be configured for email alerting, or flagged in an external syslog such as splunk. If the replication backlog hits 1,000,000 messages, the node will be disconnected and require a manual resync.

As a side note, it might also be worth exploring the possibility of moving the primary PAN/MNTs to London. While it could certainly be a lift and bunch of work, it's entirely possible that the RTT of London <>Singapore could be up to 100 ms less than Chicago <> Singapore. A lot of this depends on the customers providers, but I've seen some environments where this would be ideal. London to Chicago is relatively low latency in this context, so worth looking in to. In some older Cisco on Cisco sessions from Cisco Live, they actually had a very similar situation with the Cisco internal ISE deployment.