Re: ISE 3.2 slowness

ammahend · ‎02-17-2024

Hi members, I am seeing following issues on ISE after upgrading to 3.2 patch 4 :

sometimes the guest page won't load with "error loading page" message, sometimes it will works great.
Lot of authentication request is going to secondary node which was not the case before, seems like lot of radius timeouts on primary
There is a general sluggishness in GUI and CLI both

This is a 2 node deployment with 3715, we are closing around only 7-8K active sessions, Only additional feature enabled in addition to upgrade is log analysis but my understating was it should not be resource intensive. I am seeing 90% plus ram and 100% CPU spike from time to time almost every hour or so. I understand under 90% RAM is expected but CPU spike and 90% plus memory usage doesn't make much sense, as test I have disabled log analysis to monitor behavior since this was the only few feature addition, I also have opened a case open with Cisco. I wanted to pick your brain and see if anyone has any advice or input.

I will upgrade to patch 5 probably next weekend.

-hope this helps-

Arne Bier · ‎02-19-2024

Hi @ammahend

I don't have an SNS to compare this to. I have one customer with a few SNS-3615 servers currently running ISE 3.1 that I was planning to upgrade to ISE 3.2 soon.

Have you logged into the CIMC to have a look around for any hardware events that might be causing slowness? It's worth checking (although unlikely that a patch would affect the hardware).

Do you see any clues in the Dashboard Alarm panel?

I think you've done the right thing to get TAC involved. They should be able to point to the cause (checking the process table to see what's hogging the CPU).

ammahend · ‎02-26-2024

Hi Arne, No hardware events, I am waiting for Cisco to review the Support bundle, but we did make some changes after which we have not high load average

disabled log analysis, deactivated any unused probes, disabled Profiler Forwarder Persistence queue. I will post when I hear back from Cisco.

-hope this helps-

alexhilton · ‎02-23-2024

Hi - I have exactly the same problem on ISE 3.2 Patch 4. I have two Admin Nodes and 3 PSNs across multiple sites. A safe re-boot of the Admin Nodes does help it for a short time and then the Slowness returns. Getting Tacacs/Radius Logs takes an age. It was great initially as Patch 4 did resolve a lot of our issues. I was looking to upgrade to Patch 5 so let me know how you get on with it and I will probably attempt the same so long as it does not break anything else.

omehmetoglu · ‎03-13-2024

I too am having the same issue, I upgraded from a perfectly working fine v3.1 and now am on 3.2 patch 5 and my admin nodes are getting slower in responding and taking a while. A safe reboot does fix it for a short period but then will eventually get slow again.

Will be raising a TAC case for this tomorrow. Does anyone else have an update on this at all?

Jan Junker · ‎05-06-2024

Any news on this issue. I have the exactly same problem. Here it is 3.2 patch 5.

omehmetoglu · ‎05-06-2024

Hi Jju,

No update for me, I'm certain more customers are experiencing this problem. I currently have an open TAC case with Cisco support and is being escalate to the BU to investigate however they are only looking at it as a single case issue, where as im advising them that there is more than one org having this problem. I never had a performance issue on version 3.1

Jan Junker · ‎05-06-2024

I have now installed patch 6. And the auth latency falled and the CPU load dropped from 50-60% to 3-5%. I not sure it is due to the patch, but will monitor it the next weeks. And if it comes back I would open a tac case.

omehmetoglu · ‎10-13-2024

Ive now had a ticket open with TAC since March 2024, the BU may finally have an outcome but we are still to test. I am now on ISE v3.3 Patch 3 and still suffering the performance issue. The performance issue gradually gets worse after a fortnight or so once a reload is completed. I will wait for Patch 4, however will work with TAC to disable the swap script.

Here is a quick summary from TAC:

# There is insufficient memory for the Java Runtime Environment to continue.

# Cannot create GC thread. Out of system resources.

# Possible reasons:

# The system is out of physical RAM or swap space

# The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap

# Possible solutions:

# Reduce memory load on the system

# Increase physical memory or swap space

# Check if swap backing store is full

# Decrease Java heap size (-Xmx/-Xms)

# Decrease number of Java threads

# Decrease Java thread stack sizes (-Xss)

# Set larger code cache with -XX:ReservedCodeCacheSize=

# This output file may be truncated or incomplete.

#

# Out of Memory Error (gcTaskThread.cpp:48), pid=2342881, tid=0x00007fb980660b80

#

# JRE version: (8.0_372-b07) (build )

# Java VM: OpenJDK 64-Bit Server VM (25.372-b07 mixed mode linux-amd64 compressed oops)

# Core dump written. Default location: //core or core.2342881

#

Observations/Recommendations:

Post reload, once the free memory available goes below 20% of total available memory, swap cleanup job gets triggered(CSCwh25160). This is causing some of the processes to be killed and causing cascading issues. We are reverting this change in upcoming patches.
All the threads which are getting blocked, most of them are blocked on waiting for connection from DB/oracle/timesten. This can happen when total memory free is very less
As Mnt needs more memory and when clubbed with admin role it is consuming more memory than the node has.
We can perform
2 things we need to perform to stabilize the nodes:
a: we increase the memory on both admin/MNT node to 256 GB and 40 cores.

Or

b: separate Both MNT node on 2 new dedicated MNT nodes with 96 GB RAM and 24 cores.

We also need to disable swap script as it is causing more issues when there is low memory on the system.