Re: ISE Administration Node Replacement

Inq_J · ‎01-09-2026

Morning Cisco Forums,

I'm looking for some guidance on the process of replacing ISE administration nodes, in a 4 node cluster. In our current environment, we're running ISE version 3.4p4, and have a primary/secondary administration node, and 2 policy service nodes.

After upgrading from 3.3p8 to 3.4p4 (GUI method, not rebuild/restore), we're running into a bunch of problems specifically the administration nodes. Disk space (/OPT) is filling up within 2 weeks, going from 20% utilization, to over 90%. This requires an M&T reset bi-weekly. The M&T database shows very low utilization, even though the /opt directory is filling up. Also, services like logstash, and elasticsearch are crashing periodically. We've engaged TAC, and although they were helpful getting disk space restored by clearing M&T data, they haven't been able to figure out what's going on behind the scenes. They've checked the linux subsystems for old logs that may be causing the issues, etc, but haven't found anything.

My thought process is to build a new 3.4P4 PAN from scratch, de-register our secondary PAN, and re-register it to the environment. Let data sync, then do the same for our primary. I've run into conflicting information on the forums as to whether or not I should restore my configuration backup to the new node prior to joining it to the environment, and am just looking for some clarification on the "right" way of replacing the two administration nodes.

As always, any help is appreciated!

davidgfriedman · ‎01-09-2026

If you're also seeing crashes with logstash and elastisearch, could you be logging too much data? What are your session re-authentications and your accounting updates? I mostly use session re-authentications of 8 hours, and the same for accounting updates. I saw a post here this week where someone was set to use accounting updates every 5 minutes is 12 /hr * 24/hrs = 288 times a day per endpoint. The point is if you happen to have low session re-auth timers and/or low accounting timers, you can overload logging systems. Do that for 250k endpoints and you'd fine you're logging 36 million logging lines a day just for one of those (session or accounting) timers.

Regards,
David

Inq_J · ‎01-09-2026

Thanks for the reply!

Session re-authentication timers are 1 hour currently, Accounting updates are: aaa accounting update newinfo periodic 2880
Not a large deployment, we're talking less than 1K endpoints. Also, just want to be clear, this was not happening prior to the 3.4p4 upgrade in our environment. It's only an issue post-patch/upgrade.

Marcelo Morais · ‎01-09-2026

@Inq_J ,

If I understand correctly, everything worked fine in 3.3 P8, and the issues started in 3.4 P4, am I correct ?

What is your Hardware (VM or SNS, 36xx or 37xx model, HD space is: 200 GB, 300 GB, 600 GB or 2TB, etc) ?

In your PPAN/PMnT and SPAN/SMnT, what is the result of the following command ?

ise/admin# tech top
 Invoking tech top. Press Control-C to interrupt.
 top - 16:20:53 up 44 days, 19:55, 1 user, load average: 1.82, 1.68, 1.81
 Tasks: 954 total, 3 running, 951 sleeping, 0 stopped, 0 zombie
 %Cpu(s): 4.1 us, 2.0 sy, 0.0 ni, 93.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
 MiB Mem : 257403.3 total, 110243.4 free, 61461.1 used, 85698.8 buff/cache
 MiB Swap: 8001.0 total, 7637.0 free, 364.0 used. 122545.2 avail Mem 
 ...

Note: please take a look at: ISE - What we need to know about SNS / VM .

Hope this helps !

Inq_J · ‎01-09-2026

Thanks for the response, see below answers to your questions:

Yes, we were not seeing any of these /opt disk utilization errors on 3.3P8. It just happened once we upgraded the environment to 3.4.P4. That said, there are some "ghosts" we're still in the process of identifying (replication issues, service crashes, etc), that we are attributing to the upgrade as well. At this point, I don't know if replacing the PPAN, and SPAN will make a difference or not, but initially, that's what our thought process was (to see if we still are running into the utilization space problem).

ISE VM - "Medium" sizing (600GB Disks for PPAN / SPAN).
PPAN:
Invoking tech top. Press Control-C to interrupt.
top - 19:33:30 up 31 days, 17:28, 1 user, load average: 3.75, 4.60, 2.42
Tasks: 716 total, 1 running, 715 sleeping, 0 stopped, 0 zombie
%Cpu(s): 6.9 us, 3.1 sy, 0.0 ni, 89.3 id, 0.2 wa, 0.2 hi, 0.2 si, 0.0 st
MiB Mem : 96127.9 total, 21548.1 free, 27647.8 used, 46932.1 buff/cache
MiB Swap: 7999.9 total, 7994.5 free, 5.4 used. 33427.3 avail Mem

SPAN:
Invoking tech top. Press Control-C to interrupt.
top - 19:31:42 up 34 days, 23:30, 1 user, load average: 2.61, 2.04, 1.64
Tasks: 686 total, 1 running, 685 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.3 us, 1.3 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 96127.9 total, 30885.3 free, 35153.8 used, 30088.8 buff/cache
MiB Swap: 7999.9 total, 7992.7 free, 7.2 used. 41894.5 avail Mem

Thanks!

Marcelo Morais · ‎01-10-2026

@Inq_J ,

you have an ISE 3.4P4, Medium Deployment, with 4x Nodes (2x PAN/MnT and 2x PSNs), and installed on a VM compatible with the SNS-3755 (40 vCPUs, 96GB RAM and 600GB HD).

The /OPT (/dev/sda7) is filling up quickly on both PAN/MnT, am I correct ?

What is the result of the following command ?

ise/admin# show disk
 disks 
 Internal filesystems:
 Filesystem  Size Used Avail Use% Mounted on
 ...
 /dev/sda7   550G 166G 357G  32%  /opt
 ...

You can

deregister the SPAN/SMnT from your Cluster
factory reset the SPAN/SMnT

ise/admin# application reset-config ise
 Initialize your Application configuration to factory defaults? (y/n): y
 Leaving currently connected AD domains if any...
 Please rejoin to AD domains from the administrative GUI
 Retain existing Application server certificates? (y/n): y
 ...

register the SPAN/SMnT to your Cluster again

if you notice an improvement, you can

promote the SPAN/SMnT to a "new PPAN/PMnT"
deregister the "old PPAN/PMnT" from your Cluster
factory reset the old PPAN/PMnT
register the "old PPAN/PMnT" to your Cluster again
promote the "old PPAN/PMnT" to be the "new PPAN/PMnT"

In this way, we will check if the problem is solved by redoing the PANs/MnTs.

Hope this helps !

Inq_J · ‎01-12-2026

Thank you very much for the recommendation. I tried this over the weekend, and I'm still seeing /dev/opt directory filling up relatively quickly (went from 13% to 24% in a matter of 24 hours). The other "ghosts" post-upgrade that I'm seeing are errors like, Cannot find device "podman2", when stopping / starting services. Services (logstash/elasticsearch) are still crashing every few hours.

I'm going to work on deploying a fresh PPAN/SPAN today to see if some of the issues we're seeing go away, or if they persist.

Much appreciated!

Marvin Rhoads · ‎01-12-2026

I would recommend opening a TAC case. You may be hitting this bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCws61409

Inq_J · ‎01-12-2026

Marvin,

Thanks! I'll read through the bug, and pass it along to our TAC engineer. We already have a case opened.

Much appreciated!

Marcelo Morais · ‎01-13-2026

@Inq_J ,

at Administration > System > Settings:

Posture > Updates > double check if you are updating Posture at least every 24 h (not less)
Profiling > disable the MFC Profiling and AI Rules
Protocols > RADIUS > double check if you are using the Suppress Repeated Failed Clients and Suppress repeated successful authentications
Endpoint Replication > double check if the Disable endpoint replication to all nodes is checked

at Administration > System > Certificates:

Certificate Management > System Certificates > double check if you don't have Not in Use Certificates (Delete this kind of Certificate)
Certificate Authority > Certificate Authority Certificates > double check if you don't have any old Certificates

Hope this helps !