09-30-2019 11:56 AM
Hello Cisco Community - can anyone help me understand how to monitor packet drops in Cisco UCS? In UCSM there are error and loss counters in the port stats, but none are a basic counter for packet drops, and more specifically how can one check to see if there were packets being dropped at a specific time?
Do I need the CLI for this?
Do I need a 3rd party mgmt tool that can collect stats and store them for historical review?
HELP! :)
I've been focused on VMware technology most of my career and this is a metric you can easily pull up and look at, with well documented fixes if the problem occurs. Can't seem to figure out how to get these packet dropped stats in UCS, can't find any documentation on it
09-30-2019 12:04 PM
The question here is - packets dropping from where to where.
how is your UCS environment connected?
Most cases, UCS , VMWARE( vSwitch) -- Fibre Interconnect---Nexus --Core--(users with access switches).
depends on where you see packet drops, we need to look at the interface connected level.
with VMware environment, if you have vSphere you can monitor on vm level, or you can monitor switch level if you have any NMS.
10-01-2019 06:17 AM
The architecture is exactly as you listed it, VMware running on UCS blades with 5k upstream then 7k upstream.
I have looked at the VM stats, no packet drops at the time of the issue. Our F5 shows that the 2 servers became unavailable and were not responding to the health check.
No packet drops on the VMs in question
No host mem swapping
No CPU contention on the host
small deviation in storage latency but the spike doesn't even touch 1ms (Thanks HDS there is no better array)
SMall deviation in network throughput but the spike is only to 400KBps, we have 10Gbps networking
My review of the VMware stack shows its not the culprit
UCSM shows 0 ports stats for all loss counters, and 0 for all errors.
The event logs don't go back far enough, they must be getting over written, can't recall how that is configured, but I can only see the past week or so of events and that's it. I navigated to the specific blades that VMs were running on to check for faults and events and there was no data showing in either. Not sure why the events section would be empty though, I figured there should be some events logged but nothing.
I simply want to understand where I can see if packet drops took place within the UCS stack, I know where to look at this from the VMware host perspective, but not the UCS hardware per se, or if its something that is being monitored and logged in the stack.
I figured that if an interface in the UCS stack went down I would see this in the event logs, but as I state, the logs aren't going back far enough for me to see if something happened. I wasn't pulled in to try to diagnose this issue till over a week after it occurred. With VMware I can get a granular look at the stats using the Vrealize operations manager tool, but I have nothing like this tool for Cisco UCS and could certainly take a recommendation on something that might capture events better for historical analysis.
10-01-2019 07:09 AM - edited 10-01-2019 07:25 AM
You would need to check:
Also, have seen plenty of cases were actual issues was with storage, but first symptoms show up as 'network' issues when hosts or guestVms are starting to thrash around with storage. (check ESXtop output for DAVG,KAVG,GAVG )
You can do Vmware pktcap-uw commands to capture at DVS/VMK/VMNIC level, and likely want to do a span on the N5k Links going to the FIs to see which direction the drops seem to be (i.e. is GuestVm sending/resending requests that aren't getting responses?)
I would suggest getting pcaps to define who stops getting responses from who... Are only some of the hosts/guestvms impacted? Trying some disabling port-channel members, or half of VPC.... does problem go away?
Setup some generic ping tests to isolate:
Kirk...
10-01-2019 07:23 AM
10-01-2019 07:29 AM - edited 10-01-2019 07:37 AM
You're not going to get individual drop time stamps.
You can enable CRC increment thresh hold alerts, which would give you alert time stamps.
I would keep the interface counters cleared, and keep checking.
Would be handy if you had a NMS pulling and logging historical data.
What does the output on each FI look like for nxos#show int count error?
What model FIs and IOMs?
Kirk...
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide