cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1650
Views
10
Helpful
0
Replies

DNAC Services Down - (InfluxDB Issue)

Mike.Cifelli
VIP Alumni
VIP Alumni

Sharing a recent experience for a point of reference should someone encounter similar issues. In our environment we run a 3 node DNAC cluster currently on version 1.3.1.3.

The issue identified was that DNAC was reporting services disruption when logging into the admin gui. The 5 services that were down consisted of:

-task-service
-swim
-spf-service-manager-service
-grouping-service
-apic-em-inventory-manager-service

These services were down due to issues with the influxdb (which is primarily used for metrics data). We were advised that we were encountering this bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvq09974. Not due to size of environment, but length of time running the environment. The workaround was to move data (directories: telegraf & k8s) from the full partition to another partition. This mostly fixed the original issue. However, it then created issues with Cassandra and Zookeeper. This again led to moving data around, scaling down/up the db, restarting services, and checking cluster health. Per TAC/BU the main issue is that there is a flaw in earlier releases of code that does not properly prune the database.

If you notice the same services or similar services are down I suggest viewing your partitions and available space. If you feel you are encountering this issue I strongly recommend opening a ticket with TAC, and holding on any major changes you are wanting to make. We plan to upgrade our cluster in the near future. HTH!

0 Replies 0