ā02-09-2016 09:03 AM - edited ā03-01-2019 12:35 PM
We have been receiving PVSCSI errors in VM's that correlate to errors in the UCS blade adaptors. Below is an example from the UCS blade adaptor logs. Currently our storage team is working with EMC to determine if RecoverPoint, VMAX or Brocade could be causing these errors.
Example: 160209-04:54:55.535299 ecom.ecom_main ecom(4:0): abort called for exch 4cdc, status 3 rx_id 0 s_stat 0x0 xmit_recvd 0x0 burst_offset 0x0 burst_len 0x0 sgl_err 0x0 last_param 0x0 last_seq_cnt 0x0 tot_bytes_exp 0x200 h_seq_cnt 0x0 exch_type 0x1 s_id 0x3d5c5 d_id 0x3c980 host_tag 0x61
Has anyone received similar errors? What was the root cause.
Thank you.
ā02-15-2016 07:41 PM
Greetings.
From the UCSM CLI:
#connect adapter x/y/1 where x= chassis, y= blade (making assumption only 1 adapter present)
#connect adapter 1/6/1
#connect
#show-macstats 0
#show-macstats 1....continue on until you get a invalid uif message
Do any of these have fields with counters for dropped or CRC errors?
You may also want to connect to each FI nxos level and run following:
#connect nxos [a/b]
#show int count err
and see if you have any server ports or uplink ports with lots of CRC/FCS errors (please note if they are tx or rx)
Please also check your FC port counters (if you have FC ports)
It is also possible to check HIF/NIF interfaces from the IOM perspectives, but you may want to get a ticket opened with TAC to check all your interfaces from vnic up to pinned FI uplink port.
You will also want to make sure all if in a supported config for firmware and drivers at the UCS interop matrix: http://www.cisco.com/web/techdoc/ucs/interoperability/matrix/matrix.html
Thanks,
Kirk
ā02-23-2016 12:36 PM
For whats its worth, we had a similar sounding issue with PVSCI and our Linux VM's, it never seemed to bite a Windows server. A reset would take place, and we would loose a disk in the VM. A reboot of the VM and everything was OK, for a while.
We had a TAC ticket, and did many of the same stat gathering as suggested, including swapping from twin-axe cables. Eventually TAC passed the buck to IBM, our storage vendor. Nothing came of that either.
I found I could force a reset in the VM's disk with the sg_reset command, watch it ripple though the VM, through ESX and down to the Cisco HBA causing it to reset too.
What we did find is changing out the PVSCSI cards for LSI Logic Parallel made the issue go away. That ripple effect no longer hit the physical HBA. We left it at that as our VM's where now stable.
IBM XIV storage, 2.2.3c firmware with ESX 5.5 U2. I haven't retested with our current steady state of 2.2.6e firmware and ESX 6.
If you find anything out, I'd love to hear about it.
ā02-23-2016 01:19 PM
Thank you Chris for sharing.
Question: Did you have a VMware support case open and if so what case number? I would like to run your findings by our VMware support engineer to get their thoughts.
Thanks again
Paul
ā02-24-2016 06:01 AM
We started with VMware support (ticket 15792380911). VMware will cry "Not it! Sorry." as as soon as they anything fnic and enic related. Off I went to Cisco (Ticket 636945411), who it must be said really went out of his way to help. Finally Cisco to IBM (No info).
I never did circle back with VMware with the work around I found.
ā02-24-2016 08:58 AM
Thank you Chris for the feedback. I have passed along this info to our VMware support engineer. I'll update the discussion once I receive an update.
ā02-22-2016 08:41 AM
Hi Paul,
were you able to fix this, any link or details if you can share.
ā02-22-2016 08:45 AM
No fix yet. We tried the recommendations by Kirk J. but didn't see anything in the logs that would cause such errors. We are still working with EMC, VMware and Cisco to find the root cause.
ā02-22-2016 08:59 AM
all rite paul, can you check some logs on Esxi level with respect to I/O, and try to increment the I/O throttle count on UCS if that help, not sure but just point ccan be checked, i came across this situation once with UCS and EMC XIO box.
ā02-23-2016 01:33 PM
We are still working with EMC, VMware and Cisco uploading/reviewing logs. The issue is infrequent and is seen in two environments. ESXi 6.0 with VMAX storage only w/RecoverPoint on MDS and ESXi 5.5 with VMAX storage only w/RecoverPoint on DCX Brocade. We have noticed the errors become less frequent since we rebalanced our VMAX FA ports utilization.
Thanks
Paul
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide