I have some troubles with CPU utilization on a Cisco CSR1000V Series Router. It periodically rises up to 100%. I found out a pair of documents about troubleshooting techniks for this kind of issues. These articles are: Troubleshooting High CPU Utilization on Cisco Routers, Troubleshooting High CPU Utilization. But I think that in the case of CSR1000V those articles are not so usefull and they have not helped me a lot. Mostly because this is a virtual machine that is executed on VMWare, but those articles are for "hardware" Routers.
Can anybody recommend me a good article about troubleshooting high CPU utilization for CSR1000V Series or for a Virtual Router?
P.S.: I tried to find something relative on Cisco-site, but found nothing. Maybe I have overlooked something. If so, please point out it.
Update: Some articles were beign advised to me but there are some small inaccuracies
As a starting point for troubleshooting can be used: Cisco docwiki CSR1000V:Home
Hi, unfortunatelly it is not a short burst. It can last very long time. Last time it last for about 30 minutes until I reloaded it. Previous time it lasted for some hours but I am not sure whether it began working itself or someone reloaded it by hand.
Of course, I opened an SR but they do not hurry. Since the issue arose I am trying to investigate the issue by myself but I was able to find too little information about this type of issues. I tried to find the same issue in the Bug Search Tool, but without any success. Maybe I am trying to find answers in a wrong place.
We are not using the latest version of the software, we stick to the opinion that the proven version is better than newest one. We are using 3.14.2S. Also, we have not got the recommendations for changing the using version of the IOS-XE.
Yes, I got "show tech-support" and "show process cpu...". The main amazing thing - from IOS point of view is consumes 10% of CPU, from VMWare point of view - 99%.
I attached results snipped. I really do not understand what is the process "qfp-ucode-csr". But I read something about IOS-XE and as I understand, the process "linux_iosd-imag" is the common IOS but executed as a distinguished proccess in IOS-XE.
This is the brief results of "show process...".
# show proc cpu plat sorted
CPU utilization for five seconds: 99%, one minute: 99%, five minutes: 99%
# show proc cpu sorted
CPU utilization for five seconds: 14%/1%; one minute: 7%; five minutes: 10%
There is 1vCPU allocated for VM. According to datasheet it is required 1vCPU/4GB for all the technology packs upto 500Mbps throughput and we meet to this restriction.
The maximum peak of an interface utilization for 7 days was 130Mbps. There is IPSec traffic mostly. At first glance the issue arises irregularly, it can happen in the middle of the night or in the afternoon although traffic rate is highly depends on the time of day because it is a business service. So, I cannot tie high CPU utilization with traffic rate.
We made an experiment some months ago and we can observe the expected behaviour of dependency CPU utilization from IPSec traffic rate. So, synthetical tests we successfully passed. And the current traffic rate must not lead to high CPU utilization.
I also believe that this is a software bug, but I may not to increase the vCPU quantity or make an upgrade or do anything without TAC recommendations because the devices are in production.
But TAC answers very slow and I tried to find some information that would help me to find a root cause by myself but I found nothing that would help me to analyse this situation. :(
This is why I tried to ask here.
No, it is not limited. I checked it one more time. Although, the CSR were deployment from ova-image, and I suppose that the ova-image contains the appropriate settings.
As for the pointed bug. As I understand the bug is mostly cosmetic because under some circumstances a show-command just may show incorrect data.
I don't envy you. You are in a yucky place. It looks like you have a software bug. TAC can only try and reproduce and capture errors and if they can they can log a bug, and then its over to the developers to develop a patch.
I don't think you are going to get your issue resolved any time soon.
Could you deploy a hardware appliance in the meantime?
Yeah, it looks like an issue in a software process that is processing only transit traffic. The same time I may not upgrade software because the root casue is not determined yet.
I can not deploy a hardware appliance because it would be an addition costs. Moreove, we "went away" from hardware devices because they were not able to process the required amount of traffic.
An ASR1006, a mid-range model, can do over 100Gb/s of throughput. Then there is the ASR 9k range, the CSRs, etc. There is no way a piece of software running on generic Intel CPUs is going to out perform a custom made piece of hardware with custom ASICS. No way.
I think the CSR1000V peaks out at 5Gb/s, and needs around 8 CPU cores to get there.
But as you say, there was budget constraints. The question is, have you ended up with something that can do the job.
We determined the type and amount of traffic the CSR should process. It should process far less than 500Mbps. We tested CSR using the appropriate traffic type and flow speed and got very positive results. Also, everyday using shows us that the device with its current configuration consumes less than 50% of CPU, so the device compitely comply our requirements. The only way for me is waiting for TAC answer.
By the way, I believe that every device has it own positive and negative characteristics, so the simple replacement one device by othe device can not guarantee absents of issues. You know, there is "No silver bullet". :-(