cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
60832
Views
47
Helpful
27
Comments

 

MarceloMorais_0-1654436644727.png For an offline or printed copy of this document, simply choose ⋮ Options > Printer Friendly Page. You may then Print > Print to PDF or Copy & Paste to any other document format you like.

 

Introduction

This document briefly introduces the meaning of Queue Link Error, how to deal with it, and its impact on ISE Deployment.

 

What is the Queue Link / Queue Link Error ?

Since ISE 2.6 the ISE RabbitMQ Container was renamed to ISE Messaging Service (a Message Broker Container that runs on a Docker). ISE Messaging Service is started on each ISE Node and used for exchanging information between Nodes (via TLS using a Certificate issued by ISE's Internal CA). Queue Link is the connection between these Nodes, and Queue Link Error means that something went wrong !!!

This Alarm is expected in case you are performing any Deployment operations such as: registering a Node to Deployment, manually syncing a Node from PPAN, a Node being in out-of-sync state, a Node Application Service is getting restarted, changing the Domain Name or Hostname of your PAN/PSN, restoring a Backup on a New Deployment or Promoting the Old PPAN to New PPAN post upgrade.

You are able to check the ISE Messaging Service via ISE CLI using the following command:

ise/admin# show application status ise

ISE PROCESS NAME STATE PROCESS ID
-----------------------------------------------------
Database Listener running 15215
Database Server running 131 PROCESSES
Application Server running 27711
...
Docker Daemon running 16843
TC-NAC Service disabled
...
ISE Messaging Service running 43944
Segmentation Policy Service disabled
SSE Connector disabled

IMPORTANT: " ... the Process Down alarm is no longer triggered when ISE Messaging Service fails on a Node. When ISE Messaging Service fails on a Node, ALL the Syslogs and the Process Down alarm will be lost until the Messaging Service is brought back up on that Node... " (at Cisco ISE 3.1 Maintain and Monitor) !!!

 

Troubleshooting - Queue Link Error

At ISE > Home you may see the Queue Link Error record in the Alarms dashboard:

Alarms Dashboard.png

 

Click the Queue Link Error record to open a detail description of it:

 

Alarms Queue Link Error.png

 

The description of the Suggested Actions is:

"Please check and restore connectivity between the Nodes.

Ensure that the Nodes and the ISE Messaging Service are up and running.

Ensure that ISE Messaging Service ports are not blocked by Firewall.

Please note that these Alarms could occur between Nodes, when the Nodes are being registered to Deployment or manually-synced from PPAN or when the Nodes are in out-of-sync state or when the Nodes are getting restarted."

 

You can also check this info at Operations > Reports > Reports > Audit > Operations Audit > filtering by:

  • Object  Type = System-Management
  • Requested = The federation link was down or Event Unknown CA

Operations Audit.png

 

Note: ISE Messaging Services uses port TCP/8671 !!! Please take a look at the following ISE CLI command:

ise/admin# show ports
...
Process : docker-proxy (43916)
tcp: :::8671

 

Note: TCP/8671 is used by ALL Nodes (PAN, MnT and PSN) for Inter-Node Communication. Please take a look at the following ISE CLI commands:

isePAN/admin# show logging application ise-messaging/rabbit-ise-connection.log
...
2022-11-22 18:47:39.224 [error] <0.7017.0>@rabbit_reader:log_hard_error:785 Error on AMQP connection <0.7017.0> (<PSN IP Addr>:40486 -> 169.254.x.y:5671 - Federation link (upstream: E-Mesh-FOR-<PAN Hostname>:8671-TO-<PSN Hostname>, policy: Policy-FullMesh), vhost: '/', user: 'rabbitmq', state: running), channel 0:
...
isePSN/admin# show logging application ise-messaging/rabbit-ise-connection.log
2022-11-22 18:47:39.201 [warning] <0.3335.1078>@rabbit_reader:log_connection_exception_with_severity:447 closing AMQP connection <0.3335.1078> (<PAN IP Addr>:50533 -> 169.254.x.y:5671 - Federation link (upstream: E-Endpoints-FOR-<PSN Hostname>:8671-TO-<PAN Hostname>, policy: Policy-Endpoints), vhost: '/', user: 'rabbitmq'):
...

 

Examples of Causes that you may see on the Queue Link Error message:

  • Cause=basic_cancel
  • Cause=Timeout
  • Cause=Econnrefused
  • Cause={tls_alert;"handshake failure"}
  • Cause={tls_alert;"unknown Ca"}

Note: it's also possible to see the Queue Link Error message via ISE CLI with the following command:

ise/admin# show logging application ise-messaging/rabbit-ise-federation.log
...
2022-06-05 02:55:04.776 [warning] <0.446.0>@rabbit_federation_link_util:log:283 Federation exchange 'E-Mesh' in vhost '/' did not connect to exchange 'E-Mesh' in vhost '/' on amqps://<Node IP Addr>:8671
{error,{tls_alert,"unknown ca"}}

 

Cause=basic_cancel

The reason for this error showing up is usually when: changing hostname of ISE Nodes or there has been a Node Promotion (there are residual links from Old PAN to the rest of the Deployment which are sometimes not cleaned up on promotion)!!!

Please take a look at:

 

Cause=Timeout

This Alarm cause no functional impact, it is potentially caused by Network Congestion during the time of alarms between the Nodes Alarm is triggered for ... these can be ignored !!!

Please take a look at:

Also double check the TCP 8671 flow between the Nodes, via ISE CLI:

ise/admin# tech dumptcp 0 | inc 8671
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10.10.10.1.34190 > 10.10.10.2.8671: Flags [S], cksum 0x9544 (correct), seq 113726506, win 29200, options [mss 1460,sackOK,TS val 2259048610 ecr 0,nop,wscale 7], length 0
--
10.10.10.2.8671 > 110.10.10.1.34190: Flags [S.], cksum 0x5f32 (incorrect -> 0xf139), seq 1497652336, ack 113726507, win 28960, options [mss 1460,sackOK,TS val 2258331801 ecr 2259048610,nop,wscale 7], length 0
--

 

Cause=Econnrefused

This Alarm is considered as cosmetic. It is potentially caused by a period of time when it is unable to connect to your Node and the connection is refused !!!

Please take a look at:

 

Cause={tls_alert;"handshake failure"}

There are several possible reasons for the error, but most often it is due to a problem with the Certificate Chain used by the ISE Messaging Service. Please take a look at Generate Signing Requests (CSR) bellow.

 

Cause={tls_alert;"unknown Ca"}

There are several possible reasons for the error:

1. when Dedicated MnT option is selected (at Administration > System > Deployment > select the MnT Node and check Dedicated MnT) ... take a look at:

2. when utilizing Third-Party signed certificate ... take a look at:

3. but most often it is due to a problem with the Certificate Chain used by the ISE Messaging Service. Please take a look at Generate Signing Requests (CSR) bellow.

 

Generate Signing Requests (CSR)

1st double check if the Certificate Authority is enabled, at Administration > System > Certificates > Certificate Authority > Internal CA Settings > if you see Disable Certificate Authority, then it's enabled !!!

Internal CA Settings.png

 

2nd if it's just a problem with the ISE Messaging Service of a particular Node(s), at Administration> System> Certificates> Certificate Management> Certificate Signing Requests > click Generate Certificate Signing Requests (CSR):
CSR.png

 

Select ISE Messaging Service in Usage, select the Node(s) you want to reissue and generate by clicking Generate ISE Messaging Service Certificate:
ISE Messaging Service.png

 

The following message appear:

ISE Messaging Service Messaging.png

 

IMPORTANT: during the generation of the ISE Messaging Service there is NO reboot or Deployment break, only the initialization of the ISE Messaging Service !!!

 

3rd if it's not just a problem with the ISE Messaging Service of a particular Node(s), you may need to replace the entire Chain of Internal CAs, at Administration> System> Certificates> Certificate Management> Certificate Signing Requests > click Generate Certificate Signing Requests (CSR) > select ISE Root CA in Usage and click Replace ISE Root CA Certification Chain:

ISE Root CA.png

 

The following message appear:
ISE Roota CA Messaging.png

 

IMPORTANT 1: when you replace the Cisco ISE Root CA chain, the Cisco ISE Messaging Service Certificate is also replaced. This is followed by the restart of the Cisco ISE Messaging Service with a downtime of about 2 minutes. During the replacement of the ISE Root CA there is NO reboot or Deployment break !!!

IMPORTANT 2: to avoid losing the Syslogs during the downtime, disable for a short period of time the Cisco ISE Messaging Services (at Administration > System > Logging > Log Settings > uncheck the Use "ISE Messaging Service" for UDP Syslogs delivery to MnT

ISE Messaging Settings.png

 

IMPORTANT 3: if you notice a Slow Replication on Secondary Nodes post re-generating the Root CA, then re-registering the Secondary Node will fix the issue, please take a look at:

IMPORTANT 4: you are unable to regenerate Root CA, if Essentials License are disabled under License Page (at Administration > System > License) , please take a look at:

 

4th at Administration > System > Certificates > Certificate Authority > Certificate Authority Certificates, Delete & Revoke old Certificates.

IMPORTANT 1: when you re-generate the Internal CA Root Chain, ISE does not delete the Old One automatically. As long as ISE retains the Old Root Chain, it will Trust Certificates presented by the Endpoints with Identity Certificates signed by that Chain (if that is the case) !!!

IMPORTANT 2: delete Old Internal Certificates is an important step to prevent some bugs with 200+ Internal Certificates on PPAN that causes Slow UI, Slow Replication and High CPU/Load, please take a look at:

CA Certificates.png

 

Other Causes

Other causes of Queue Link Error:

Please take a look at:

 

Effect of Queue Link Error !!!

The Queue Link Error may not be harmful depending on the usage situation. Let's take a look of some examples the ISE Messaging Service is used.

 

ISE Messaging Settings

If there is a problem with Queue Link and there is a problem with log transfer to MnT between PSN and MnT:

  • Live Logs are not displayed
  • Report is not displayed
  • Dashboard System Summary is not displayed

In addition to the above measures, this event may be recovered by unchecking the Use "ISE Messaging Service" for UDP Syslogs delivery to MnT (enabled by default  since ISE 2.6 P2and switching to log transfer that does not go through the Messaging Service at Administration > System > Logging > Log Settings > ISE Messaging Settings:

ISE Messaging Settings.png

 

Please take a look at:

 

Light Data Distribution (LDD)

At Administration > System > Settings > Light Data Distribution. Initially it was called Light Session Directory (LSD), but it has changed to this name due to the addition of functions and the abbreviation.
LDD is used to store User Session Information and replicate it across the PSNs in a Deployment, thereby eliminating the need to be dependent on the PAN or MnT Nodes for User Session details. In case of connectivity issues between the PSNs, for example, when a PSN is down, the Session Details are retrieved from the MnT Session Directory and stored for future use.
LDD uses Cisco ISE Messaging Services (that uses a Certificate signed by the Internal-CA Chain) for Inter-Node Communication.
If there is a problem with the function that uses the exchange of information between Nodes, it may be one way to check once if there is an Alarm of Queue Link Error.
 
 
Hope this helps !!!
Comments
marco.merlo
Level 1
Level 1

Nice Document.

This morning I just upgrade our deployment from 3.1 to 3.2. The upgrade procedure reported a complete success but after some hours "Queue Link Error Messages" appeared on the dashborad and indeed at least one of our PSN seems no to sending   auth logs.

Unfortunately from the messages seems that all node are affected and worse of all device certificates of secondary nodes are not broswable from PPAN. Now I am afraid I have to regenerate all MSG service certificates for our 8  nodes . All the certificate are signed by our corporate CA. The more I struggle with ISE in a complex environment the more I am convinced that   the only working use case is just a two nodes deployment at least after 2.4 release.

Regards

Marco

EDIT regenerating  CA and ISE messaging certificates using "self signed" ones  stop the error messages. In order to see secondary nodes certificate from PPAN I had to promote the secondary and promote back the original PPAN (I suppose that the restart did the trick). Unfortunately still no log in live logs or reports: It seems that just accounting event are logged. 

Hi @marco.merlo ,

 thanks !!!

 You upgrade from 3.1 to 3.2, which Patch (3.1 and 3.2) ?

 So you have 8 Nodes (2x PAN, 2x MnT and 4x PSNs), correct ?

 You said the Queue Link Error appeared after a few hours, what is the cause of the Queue Link Error (for ex.: Cause={tls_alert;"unknown Ca"}) ?

Best regards

Carl King
Level 1
Level 1
Hi Marco,

The upgrade was from 3.1 latest patch at the time, to 3.2 including the latest patch at the time.

There were actually 12 nodes in this deployment, and I’m not sure why the queue-link error started. The fix is fairly simple (replace certificates for internode messaging). I found that the queue-link errors don’t go away immediately. Returning to check on it the next day revealed they were gone though.
ryan14
Level 1
Level 1

TAC referred me to this post. Anyways thanks, it worked.

Hi @ryan14 ,

 it is a pleasure to help !!!

ce1
Level 1
Level 1

Does the MNT nodes internal CA have to be in enabled state for ISE messaging to work?

Hi @ce1 ,

 even though the CA Responder Status of PAN and MnT is disable (at Administration > System > Certificates > Certificate Authority > Internal CA Settings

Internal CA Settings 01.png

 the ISE Messaging Service of both (PAN and MnT) is in the running state:

ise/admin# show application status ise
ISE PROCESS NAME STATE PROCESS ID
--------------------------------------------------------------------
...
ISE Messaging Service running 19147
...

Note: are you having any issue on the ISE Messaging Service ?

Hope this helps !!!

 

ce1
Level 1
Level 1

@Marcelo Morais 

I have issue with the ISE messaging service. It shows queue link error with unknown CA. I've tried to regenerate the root and ISE messaging service cert but no luck. So wondering if the cert chain still broken somehow.

Hi @ce1 ,

 it's an issue with just one Node ? Remember that you can not only double check this via:

ise/admin# show logging application ise-messaging/rabbit-ise-federation.log

but also check when the issue started via Operations > Reports > Reports > Audit > Operations Audit > filtering by:

  • Object  Type = System-Management
  • Requested = The federation link was down or Event Unknown CA

Hope this helps !!!

thanks!!

CSCO12231061
Level 1
Level 1

Hello,

has anyone ever faced queue link error with cause=closed (please see alarm below) on ISE 3.2 patch 6?

Any clue what could be a trigger and a possible fix ?

============================

Alarm Name :

Queue Link Error

Details :

Queue Link Error: Message=Network Issue From node A To node B; Cause=Closed; Action=Check if IP of node B is reachable from node A on port 8671

Description :

The queue link between two nodes in the ISE deployment is down.

Severity : Critical

Suggested Actions :

Please check and restore connectivity between the nodes. Ensure that the nodes and the ISE Messaging Service are up and running. Ensure that ISE Messaging Service ports are not blocked by firewall. Please note that these alarms could occur between nodes, when the nodes are being registered to deployment or manually-synced from PPAN or when the nodes are in out-of-sync state or when the nodes are getting restarted.

==========

This alarm appears randomly then after a while it clears out without any intervention.

=========

Could this be a networking issue or some bug / ISE messaging service issue ?

rgds,Ivan

Carl King
Level 1
Level 1
I don’t know that it’s a bug, or at least one hasn’t been identified that I know of. The workaround has always been to regenerate the certificates for messaging and watch the paint dry. I did sit in on a TAC case a couple years ago for a 3.1 installation where the TAC had to do some edits for the files pertaining to the RabbitMQ process (root access only), as well as generate new certs. Seemed to fix it. Anytime I’ve seen this it’s been a product of an upgrade. I haven’t seen a **bleep** installation do it.
Getting Started

Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the community: