cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
2037
Views
15
Helpful
6
Replies

9800 Serious Problems

neteng1
Level 1
Level 1

We spent several months designing a move to the 9800 WLC according to Cisco best practices. We currently have over 4000 APs on a 9800-80 controller. The recommendation is less than 500 APs per site tag. This required us to reuse the same site tag configuration for many sites.

 

Shortly after a large migration, APs within several sites stopped allowing client connections. Clients were stuck in 'Associating' state. The workaround we discovered was to change site tags. There is no configuration difference between sites.

 

This is a reproducible problem that we opened a Sev 1 case with Cisco about. They observed the problem and discovered associated logs. However, we are going on two weeks without a resolution and faced with migrating back to 8540 controllers. Does anyone have similar experience or this size environment deployed on 9800?

 

Edit: The latest recommendation from Cisco is to try reducing to less than 200 APs per site tag.

6 Replies 6

Arshad Safrulla
VIP Alumni
VIP Alumni

Assuming that your RF environment is perfect, could you share the below

Which IOS-XE code running in 9800-80?

Which AP models are impacted?

Which clients are impacted?

What is the client driver version? 

What mode AP's are deployed in?

Is all 4000AP's in the same campus or you have remote locations as well?

Do you have FT enabled?

Are you using WPA3?

What authentication mechanism impacted SSID's employ? (EAP-TLS,EAP-PEAP,PSK, Open etc)

What does the Radio Active trace say? 

Hi, we're running 17.3.4. This problem is not isolated to specific APs, clients, or WLANs. It affects all connections assigned to a given site tag. Changing site tag temporarily resolves the problem, even when AP Join Profile is the same. We have determined is not related to any configuration from Policy Tag or RF Tag.

We haven't done anything to that scale yet.

The guidelines included in the 9800 migration webinar series a few months back are pages 51-54 of the Session 1 presentation at https://web.cvent.com/event/bcba04b5-6a9b-4a17-ac1e-ae718fd184bd/websitePage:332afdf8-3ce9-492a-bc88-102ec737bf1e

There's more info in the Session 5 presentation pages 9-15 and as per @Arshad Safrulla 's question above your exact design and architecture matters a LOT.  For example - page 14:
- Don't use the same site tag across multiple Flex sites

- If support for Fast Seamless Roaming (802.11r, CCKM, OKC) is needed, then the max number of APs per site-tag for a Flex
site is 100

So that limit of 500 is a general number for a basic local mode deployment but you've really not given any proper details of your design/deployment.  You mentioned using the same site tag for multiple sites so you may already have breached the design best practice guidelines.  Suggest you have a good read through those documents and see whether you need to revise your design.

 

Thank you. The document you provided references a syslog about WNCD overload, which we have no history of. I will clarify, we are using the same Join Profile, but different site tags across our environment. All APs are in local mode. We have adjusted all site tags to less than 200 APs per tag based on TAC recommendation. TAC did find the following log which they have an internal bug for.

 

2021/07/30 13:30:06.694250 {wncd_x_R0-3}{1}: [radius] [20560]: (ERR): RSPE- Crete New Socket Data : Dynamic socket pool limit reaced Max : 96

 

We still have an open case and I will post if a fix is discovered.

Ok thanks will be interesting to hear the outcome as we'll be looking to scale up at some point.

This is the bug Cisco provided. It is marked as Catastrophic. The status says fixed, but I have not been provided a service pack yet. The workaround is to disable AAA Accounting.

 

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvz30708

Review Cisco Networking products for a $25 gift card