My UCS C460 installed wtih ESXi 6.0.0 Build 13635687(VMware-ESXi-6.0.0-9313334-Custom-Cisco-184.108.40.206.iso) is running some Linux and windows servers.After half years stable running,I began to get some PSOD with GP Exception 13.At the very start,it happened every 2 weeks.Then it happened more and more frequently, the server will need to restart every 2 days.
Each time,I will get a GP Exception 13 but the following message is different.I have replaced the mother board and memory card,nothing changed.So I believe it should not be a hardware issue.
I collect some error message but I can not locate the root cause.
My UCS C460M4's BIOS version is C460M220.127.116.11b.0.062120160920,and I have tried the lastest one before.
cpu Microcode Patch Revision is 0x0b00001d
ESXi 6.0.0 Build 3620759(VMware-ESXi-6.0.0-9313334-Custom-Cisco-18.104.22.168.iso)
2 X Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz Type 0, Family 6, Model 79, Stepping 1
4 X 64G DDR4
3 X Intel Ethernet Server Adapter I350-T4
1 X Intel X540 10 Gbps Gbps Network Controller
1 X Intel(R) I 350 1 Gbps Network Controller
1 X Raid controller
I also connect some usb network adapter on the ucs.
Would anybody give me some help? This problem has been bothering me for months,any help would be appreciated.Thanks
It would be better if you can collect server CIMC tech-support. However, looking at the earlier PSOD screenshots, Its looks like issue with CPU 2
Unfortunately，this ucs is internal order,we don't have tech-support service contract.Which part in PSOD hint the CPU2 issue?I am not sure I should RMA again for replacing CPU.
All the PSOD screenshots point to PCPU 64 which should be CPU 2 as per the given server configuration...however issue could be external to CPU 2 like DIMM modules managed by CPU 2 are misbehaving ... as you don't have the tech-support with you, following can be done
### Swap CPU 1 and CPU 2 and check if the error follows CPU 2 or not in PSOD events... if it follows CPU 2, you can replace CPU 2
I would run the UCS diagnostic ISO through its tests and see if any thing for DIMMs or CPU is flagged. See https://software.cisco.com/download/home/286265859/type/286123307/release/6.0(2a)
Also, GP Exception 13 is not necessarily a hardware problem (although it can be).
According to VMware , Exception13, GPF occurs under one of these circumstances:
If the diagnostic tests turn up clean, then you may want to have VMware evaluate the dump files from the vmware support bundle from that host.
As Kirk mentioned, run a diagnostic test on the server... choose comprehensive test method....
Please note, Comprehensive tests can run for several hours or days. These tests run exhaustive burn-in tests on your server, such as stress tests.