cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
456
Views
1
Helpful
7
Replies

Cisoc ISE 2.7 Patch 10 - corrupted, issue carried over backup

AigarsK
Level 1
Level 1

Hi All,

I am fully aware that 2.7 is EoL, but wanted to seek some guidance.

We have 2 Cisco ISE nodes running on EoL 3515-K9, I recall that I posted some time ago issues I observed with deployment when it was still running Patch 7, I believe this was just around the time when Log4J vulnerability was around. Either way I believe Hot fix for Log4J was installed and later deployment was patched version 7.

Issues observed were around High Resource Utilization that caused High Latency and DNAC reporting ISE server down every morning at around 03:40. Along with Replication Errors and Queue Link errors etc.

This subsequently at one point led down to the issue which I have observed 4 times now, with last one happening right now.

ISE is setup as followed, 1x PAN,pMnT, PSN and 1x sPAN, sMnT, PSN and PxGrid. It works fine until, one day when you log onto Admin page, I get error message " Oops, Something went wrong, Invalid Request" I am able to bypass this message by just clearing the URL in browser, by removing anything past the FQDN.

When logged in I do see same alerts, Slow Replication Error, High Load Average, Replication Failed, Slow Replication Warning and others. If I was to clear any alert and try to acknowledge them I get a popup at bottom right corner that "Alert(s) could not be acknowledged because you do not have enough permissions or due to an unexpected error", if I refresh the Alerts, I actually see that it is removed.

Next, deployment is eventually stating that it needs to be re-synced, if I look in Deployment, it does state Not in Sync, pressing syncup, it gives an error "Unable to sync node NODE-NAME. Sync may already be in progress".

Service restart does not help, cold boot of both nodes does not do it either.

Issues are more serious as I am no longer able to add new endpoints and it plainly says so Unable to create the endpoint, but it appears as created if I try again to add it, if I then look in Context Visibility for the endpoints, it is not displayed. Other issues are that I am not able to see either one of my superadmin accounts even so I am able to log in with them.

When this happened first at Path7, case was raised with TAC, they looked at it and said that yes, issue could be with the Log4J Hot fix and that it is corrupting things, they advised to rebuild and then restore the backup, did that, patched to version 7 and then they followed up and advised that Patch 8 was released and that it should be applied as well. It was followed through, and some 6 month later this same issue happened again, TAC said that this time Patch 8 was completely recalled and I should rebuild, go to Patch version 9 directly and do a restore. I did this and again it was fine for some time until this happened again and again, now even with Patch 10.

I have a reason to believe that my entire 2 node setup is just corrupted beyond repair, doubt that TAC would touch it, but worst part is, it appears this issue follows the backup. Backup as well has issues, even at the first rebuild, issue was that after restore, it appears that Certs I exported are somehow stuck in DB even so not visible, I am not able to import them back. This required TAC to root the access to the ISE and delete them from DB. I cannot do it without TAC help, and last time I rebuilt, I was lucky that external cert used for the Guest Portal was due for renewal in couple of weeks and I generated new CSR for it and got new cert. I had to generate and apply new certs for PSN, Admin, PxGrid, but as they are on internal CA, it is not much and issue.

So what are my options, how can I get rid of this issue before I get my new hardware and support? I would need to install ISE 3.x and then perform a restore of my old 2.7 deployment, but I do not want to inherit same issues with new deployment.

How feasible is to extract every piece of information of the existing ISE and just build ISE 2.7 Patch 10 on a side from scratch and import endpoints, Endpoint Identity Groups, Policies and NAD.

Worth to mention that I also need to keep DNAC integration for the SDA,

What would you advise? Please do tell me for much pain I am in for?

7 Replies 7

marce1000
VIP
VIP

 

 - According to me , you should then build an ISE 3.x deployment from scratch , and configure it as such with all policies and NAD's ; big benefit is to implement a sanitized policy again , or review stuff that was wrong or not needed (e.g.)
   An even more benefit according to me , is that switchovers instead of upgrades can avoid lots of headaches.
   A few NAD can be pointed to the new ISE's first for instance and observe correct behavior ,to avoid the classical-upgrade-stuck-in-step-87-syndrome , ...

 M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Thanks for a reply,

I think bigger thing in all of this is DNAC, how or what needs doing that it does not get disrupted and able to push all the config back into the ISE, that is SGT matrix, CTS for network devices etc?

A lot of pain if you stay on 2.7.  What about your Catalyst Center hardware?  What is it?  What version?  

Also EoL as it is Gen1 hardware. All waiting for few signatures to get Gen3

I would personally just not touch this deployment at all until you have new Gen3 Catalyst Center hardware and new SNS hardware.  Rebuild both Catalyst Center and ISE on new hardware.

Thanks for a reply.

I do however am proceeding with the fresh build of Cisco ISE 2.7 P10, and manually recreating all the policies. There are plenty of items I was able to export and import into the new deployment and rest were recreated manually. Certificates added, network devices imported, same for endpoints.

Reason I am doing this is due to the fact that I am not able to reliably create anything in the current deployment, even policies, if I was to delete something, it would state that there was an error creating or deleting the item, but a refresh of the page does show that intended change has taken place, but I have no clue if it actually will propagate or be accessible.

I would love to leave it as it is, but the outlook of getting the new ISE is not so good as there are still lead-times to deal with once the order is placed, I was quoted some 35 days for 3715-k9 to arrive, and before that, place where these ISE nodes are hosted, are about to undergo black building testing over last two weekends of November and likelihood that this corrupted mess does not start up its services is quite high.

Just wanted to update,

I did redeploy ISE 2.7 P10 without it been restored from backups. Here is order of operations I carried out:


• On network switch where Cisco ISE is attached, I created ACL and applied them tot he ports where ISE Prod NIC was attached. ACL is there to prevents OLD ISE to be able to contact the new one (Had this some time ago, where I built new node and as it was using same credentials, was already picked up by old deployment that synced corrupt state).

It also was there to prevent NAD to start contacting newly built ISE and encountering default policies, thus causing authentication problems. I also included ACE's to prevent DNAC center reaching it. Thing to keep in mind is that ACL needs to be quite specific, as you do not want to block ISE to speak to DNS, Default Gateway, NTP and Internet destinations. I went overboard and did both Inbound and Outbound ACLs.

• Built ISE node from USB and applied the Patch 10.

• Updated Profiler Feed (this could sometimes be issue as I had seen where node profiling databases were not syncing up and if new node is added to deployment, it could remain with outdated profiling policies. This however could be just my bad luck and corrupt ISE I had running)

• Imported Certificates and assigned them to the services.
• Exported from old ISE any Profiling Policies that were created, look for any System State = Administrator Created or Administrator Modified. This if configured, will go about re-creating Endpoint Identity Groups.
• Manually created any missed Endpoint Identity Groups.
• Exported and imported Network Device Profiles, Network Device Groups and Network Devices.
• Configured System Settings to match those of old ISE.
• Joined ISE to AD and matched against what Security Groups should be pulled in.
• Sorted all Identity Source Sequences to match those of old node.
• Exported and Imported User Identity Groups and local ISE Identities.
• Created Policy Authentication Protocols to match the old deployment.
• Recreated Guest Portal and matched all the settings.
• Recreated all DACLs and Authorization Profiles, Library Conditions.
• Recreated all Policies, their AuthC and AuthZ rules.
• Exported from old node and imported into new one all Endpoints.
• Checked that I have enabled ERS and configured pxGrid setting to true for "Automatically approve new certificate-based accounts"

Now that ISE was close of its older self, I took it over weekend to carry our reintegration with DNA Center, here are some of the issues I encountered and managed to resolve.

• I shut down old ISE node, it had actually died over night from Friday to Saturday as its Application Server had been stuck in Initializing when I checked it.
• Removed the ACL from network switch port where new ISE node was attached.
• I did start to see authentication attempts and some sessions been picked up, I did have issues with ISE reporting that TrustSec was not working correctly and that there SGT's asked to be assigned that did not exist in ISE.

NOTE: I did not create any of the SGT's in ISE as my deployment, they were managed by DNAC.

This is where some of the problems started, DNA center would of course report that it is able to see one node and not both (DNAC System 360), I hoped that by just going into Settings and under Authentication and Policy Servers, edit the ISE server, even so I updated the username and password (which were the same on old ISE), FQDN, and be greeted with popup where it goes through the stages of integration, it promptly displayed it as all done and dusted with green circles for each step. DNAC however would still report that pxGrid is Unavailable and I did not see a sign of ISE listing any clients which would match the DNAC.

I switched to troubleshoot switch issues relating to CTS and PACs, I redeployed all my switches with only one ISE aka Radius server, hoping that this will resolve the issues I was seeing with TrustSec on the switches, but this did not help, I should have done the following step sooner.

As I was not having too much of a luck with DNAC interacting with ISE in expected manner, that is updating PACs for NADs and repopulating TrustSec, I ended up adding bogus Radius Server under Authentication and Policy Servers. 

This allowed me to go into Design - Network Settings and update AAA settings and remove any mention of ISE IP addresses, I had to do this for Wireless AAA servers as well.

This now allowed me to completely remove ISE under Authentication and Policy Servers, I did also clear out any trusted certificates that referenced ISE deployment (on this I was not entirely sure why they were there in first place as later when I re-added ISE, they were none to be found).

Now I was able to add ISE from scratch into DNAC, integration worked, pxGrid was marked as Available, and I was able to see client in ISE pxGrid dashboard. I then went back to DNAC and navigated Policy - Group-Based Access Control that presented me with integration steps that allowed to sync DNAC and ISE SGT's and policy matrix (This area previously would just in read say that Integration is not implemented and there was nothing I could do apart from viewing what SGT's and Policy Matrix is in place)

Then I proceeded with again reconfiguring all my Network Setting and Wireless AAA servers, removed the bogus Radius server I created earlier and was able to re-provision all network devices with this one ISE node.

I had to go about connecting to every switch I had and carry out clearing of old PACs and refresh CTS Environment-data:
clear cts pac all
clear cts environment-data
cts refresh pac
cts refresh environment-data

This was due to following Alert being logged in ISE:
=================================================
Trustsec PAC validation failed

Details :

TrustSec recieved invalid PAC';'ISE identified Trustsec request with invalid PAC. Verify network device uses a pac derived for the current ISE deployment. Try to refresh the PAC on the network device.

=================================================

I still believe I have some PAC issues as on my WLC5520, I had to remove "PAC Provisioning" under Radius server or I would not have even single authentication request come in from this WLC.

Now that I had both the Wired and Wireless authentication working for MAB and EAP-TLS, I proceed with second ISE node rebuild, patching, updating Profiling Feed and joining the new deployment. I was able to again re-provision all network devices to have both ISE nodes for AAA activities.

One thing I did see, which I believe is a bug on DNAC, it would not store the Secret field populated, It would allow me to type it in both for network and client section, allow me to save without the issues, but would disappear when I come back to check the Network Settings again.

I am intending to build a virtual ISE 2.7 in isolated environment where I could restore my old deployments back, just in case there is something crucial I have forgotten about setting wise, but most should be fixable once I knw what is not working.

Hope this serves as a reminded, keep your platform in support and remove it time before anything goes EoL. For 10,000 endpoints we have with active session being around 2000 daily, this took me 4 days to complete, longest part of course was recreating Authorization Profiles and Policies and I was fortunate that I had a weekend where I could do switchover of the deployment.