Curious if anyone else has seen something similar to this, or has any thoughts on what the cause could be.
We have a small farm of WSA S670 boxes we maintain. Recently, they came under the limelight after dropped traffic and interface resets began happening almost uniformly across the entire farm. As such, we implemented a much greater degree of SNMP monitoring on these devices, and noticed an interesting trend.
The average CPU use percentage tends to sit around 5-7% on every S670, the memory usage hovers around 15-20% usage on them, and disk IOPS never peaks above 200. The amount of socket connections each WSA makes averages around 8k-13k during business hours - well below the 40k each is rated for at max. In other words, they're seemingly experiencing minimal load.
The Linux load average for them, however, always stays between '3' and '5'. In a nutshell, this means that there's some type of resource bottleneck on the WSAs that's causing processes to be cached in RAM instead of handed to the CPU because the CPU can't handle them at the current time. Unless a Linux box is getting high resource utilization, load average should never exceed '1'.
After doing some extensive head-scratching and digging, we found that the /proc directory on every single WSA in our farm was at 100% utilization. While I don't claim to be an expert in Linux by any means, this immediately threw a red flag in my mind since if this directory is completely filled, a traditional Linux installation won't allow you to start new processes as it has no place to store the ID files or related content.
Has anyone else seen a situation like this with their WSAs and a full /proc directory correlating to some bizarre traffic drops and latency? (and yes, I've already opened a case with TAC - I'm looking for community feedback, here)
in normal state /proc directory in WSA should not be 100% and the usage on this directory should be very minimal, and should be dissolved at time of shutdown.
If the usage is 100%, need to identify which process that consume that much (most likely occurred during system boot up, since proc will contains all processes in WSA and created on the fly when system boots) and most likely that process having issues or possible corruption.
TAC case definitely required for the engineer to get in to the root level of the WSA to check this from backend and escalate further if needed.