AMP+kubernetes/rancher memory problem

cxir · ‎04-19-2022

Hi!

We have 2 on-prem cloud infrastructures running kubernetes/rancher at the moment for dev and test, but the prod is coming along fast and we ran into the following problem:
the infrastructures consists of the following nodes:
1 "control" node

3 "master" nodes

3-4 "worker" nodes

3-4 "infra" nodes.

All of them have AMP installed, latest version.

On the worker and infra nodes, the memory usage of the ampdaemon process starts to ramp up after a while, which triggers the oom killer, which starts to kill the k8s/rancher processes and the infra and worker servers become unavailable.
This doesnt happen on the control or master servers ever.

I imagine that there might be some problem with the exclusion list, or its a memory leak somewhere between amp and k8s.
This happened just half an hour again on of our infra nodes and the ampdaemon process was using 77% of memory of the server.

I have disabled the amp service on the test nodes, leaving the dev nodes for troubleshooting for now.

The exclusion list should be sound, based on the documentation, and we dont have this issue with our other server, which doesnt run k8s either.

I can't post logs for now, but will do later if needed.

All mentioned systems are running RHEL8.5, swap off, autoupdate for the connector is set, clam-av linux-only.