cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1165
Views
3
Helpful
14
Replies

C9120 Rx Stuck using 17.12.5

talasgair
Level 1
Level 1

Since we upgraded to 17.12.5 on our 9800 I see these logs from 9120 APs:

...Rx stuck detected,doing phy forcecal for radio 1

Has anyone else noticed this or know what they mean? It doesn't sound good.

14 Replies 14

marce1000
Hall of Fame
Hall of Fame

 

  - @talasgair   - Looks like a bug , report to TAC.
                        - You could try rebooting the (an) AP and check if it is persistent or not
                        - Check if clients are affected using :
                                      https://www.cisco.com/c/en/us/support/docs/wireless/catalyst-9800-series-wireless-controllers/217738-monitor-catalyst-9800-kpis-key-performa.html#toc-hId-866973845

   M.



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

Saikat Nandy
Cisco Employee
Cisco Employee

Could you please share the output of  - 

show controllers dot11Radio 1 reset
show flash crash
show flash cores

From the problem AP.

talasgair
Level 1
Level 1

@Saikat Nandy 

Here is the output from one of the problem APs. 

Rich R
VIP
VIP

@talasgair 
There are problems with the Broadcom drivers https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwn27877
I believe the log comes from https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwk12169 which did not resolve the problem.
In the absence of an actual fix I think they've put something in the code to try to detect the Rx stuck (radio Rx queue is full and stops receiving frames) and restart the radio.  Clients which are associated will have a dead service while the queue is stuck (no response from AP and will eventually timeout) and until after the radio is restarted.  New clients cannot associate while the queue is stuck.

Do you see any clients connected to the 5GHz radio when you see these logs?
If no clients, then do a shut/no shut on the radio (or reload the AP) and then see whether it's working again? (but can fail again with hours or days)


@Rich R wrote:

@talasgair 
There are problems with the Broadcom drivers https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwn27877
I believe the log comes from https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwk12169 which did not resolve the


Sounds like an issue I had last year affecting 9166, 9136, and 9130 when we were on 17.9.4a/APSP8. At the time, the issue was attributed to CSCwj45141 or CSCwk48338. Ultimately, upgrading to 17.12.4/APSP2 solved it, and we're still fine as of 17.12.4/APSP6.

Anyway, I hope that it's not the same issue returning to 17.12.5.

Sadly, the bug CSCwk48338 you noted was updated yesterday and now includes 17.12.5 as affected, so unfortunately it appears there's a regression on that one. 

talasgair
Level 1
Level 1

@Rich R There are no clients on these APs when I see the log. I just assumed that there were no clients on these APs as they are mostly in quiet areas, e.g. basements or areas usually unoccupied but maybe it is because the radio is stuck. Also see the frequency of these logs increase overnight and from more APs when there are less people on site.

This is something that you need to validate.  Don't assume that it's because of the log or its in a quiet area, put a device there so that it can connect to that ap and see what happens over time. Windows machines, you can use the netsh wlan show wlanreports to get history of the device wireless connection over time.  Logs from the controller and correlating the netsh, can help you determine if that log is indeed dropping client connections or not.

My opinion is, if you upgrade (anything) and then you start having issue or seeing log's that was not there, you need to open a TAC case and rollback. Keep your users happy and never drag things out unless there were already issues prior to the upgrade.

-Scott
*** Please rate helpful posts ***

We have a script running every 10 minutes doing "sh ap summ load-info" so if we see >1 clients on 2.4 GHz radio (slot 0) but zero on 5GHz radio (slot 1) then it's probably stuck and we do a shut/no shut on the 5GHz radio:

ap name <ap-name> dot11 5ghz shut ap name <ap-name> no dot11 5ghz shut

That is the quickest way to get it working again.
TAC and BU were not able to suggest any better method of detecting the problem which scales well. The most reliable way is by logging in to each AP (or running remote AP commands from WLC with the results going into WLC logs which is messy) and checking the radio stats directly: If you check show interfaces dot11Radio 1 a few times and you see that FCS errors are incrementing but none of the other Rx counters are incrementing that means the Rx Queue is stuck and the AP is not receiving any frames from the radio. If you look at Over The Air (OTA) capture you see clients trying to talk to the AP and zero response from the AP because the AP never receives the client frames.  You still see the AP beaconing as normal (which is why clients try to join) because the Tx is still working fine.

Leo Laohoo
Hall of Fame
Hall of Fame

Reboot the APs daily -- This is going to be the new "fix" going forward.  

 

    - @Leo Laohoo     >....Reboot the APs daily This is going to be the new "fix" going forward.  
                              I can't agree , at all actually ; there are many places which need 24/24 wireless service.
                              For instance we have a chip factory with FABs on 24 production , hospitals  , warehouses
                              airports and numerous others.  Perhaps if Cisco would also provide an approach such
                              as in flex upgrades where APs can rebooted in a manner (pattern) where some coverage is kept always
                              and clients can hop to an available AP      it would be feasible.
                                                       Better is for them to fix the bugs,

  M.
                        
                              
                             



-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
    When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

I agree with the sentiments and "reboot the APs daily" is not a sustainable solution, however, it is faster (and easier) to reboot the APs daily than wait for Cisco to come up with a solution or fix.  

And when I say "wait for Cisco to come up with a solution or fix", I am talking about 5 to 10 years away, if they promised to fix it.

My latest "product enhancement request" (without an "executive support") took, at the very least, 4 years.  And it would have taken a lot more had it not for a "whale" to lean on Cisco -- And the solution is not even an APSP nor an APDP!

I hate to agree, but I have also had to write automation do find these and reboot the ap or radio.  Back in the day's I had Prime run reports on client count do find issues like this, but seems like you might need to pull this info in a DB so you can filter by client count and determine what you do next.

-Scott
*** Please rate helpful posts ***

jasondodge
Level 1
Level 1

Anyone have luck with these radio monitoring settings found in AP join profile? It seems to be a feature that should reset radios if there are no increment in the Tx and Rx statistics.  I haven't seen much improvement so far using it.

jasondodge_0-1748367848379.png

Source: https://www.cisco.com/c/en/us/td/docs/wireless/controller/9800/17-12/config-guide/b_wl_17_12_cg/m_ap_crash_file_upload_ewlc.html#info-ap-real-time-statistics

Review Cisco Networking for a $25 gift card