10-21-2015 11:06 AM - edited 03-01-2019 12:25 PM
It appears CSCue49366 has reared it's ugly head again as after updating from a functional 2.2(3e) to 2.2(6c) we are fighting with the symptoms outlined in that bug report (fans @ 100% or not functional, no PSU readings). After working with TAC for a week it's been decided the last thing to try is a full chassis shutdown.
Hopefully it will come up without any errors, heads up for anyone looking to update.
Solved! Go to Solution.
10-22-2015 08:59 AM
Ryan,
I believe that it is expected for the subordinate to not pull that fan information (instead we rely on the current primary), thus the 'UNKNOWN' status you see.
10-21-2015 01:40 PM
We would be able to quickly tell if it's an I2C congestion issue from your i2c.log file.
Could you supply the segment from that log below # I2C Bus 1 and # I2C Bus 2? (both IOMs will have different values)
I have seen some issues that looked similar but weren't necessarily due to I2C congestion.
10-21-2015 01:50 PM
Thanks for the reply
Here is IOM 1 (Currently Primary)
# I2C Bus Statistics Wed Oct 21 14:48:41 CST 2015 # I2C Bus 1 busn=0 nseg=2 segment 0 local wait_gt_deadline 1 segment 1 extended wait_gt_deadline 13 error_pca9541_per_device: # I2C Bus 2 busn=1 nseg=5 segment 0 local segment 1 chassis norxack 192 wait_gt_deadline 5 segment 2 blade segment 3 fan segment 4 psu timeout 1777 fixup 3490 hub_sw_mbb 1713 hub_sw_mbb_to 1713 gilroy.counter.reserved 1592 gilroy.counter.released 1592 gilroy.counter.status_chassis_off 3725 error_pca9541_per_device: # I2C Device Statistics iom.fru={SUCCESS=12} iom.rtc={SUCCESS=7} NAME=iom.woodside No such path /sys/devices/platform/fsl-i2c.1/i2c-0/0-0048 NAME=iom.dcdc0 detach_state driver iout iout_cal_offset iout_oc_fault_limit mfr_date mfr_id mfr_location mfr_model mfr_revision mfr_serial name ot_fault_limit ot_warn_limit temperature1 temperature2 vin vin_ov_fault_limit vin_uv_fault_limit vout vout_mode vout_n vout_ov_fault_limit vout_uv_fault_limit NAME=iom.dcdc1 detach_state driver iout iout_cal_offset iout_oc_fault_limit mfr_date mfr_id mfr_location mfr_model mfr_revision mfr_serial name ot_fault_limit ot_warn_limit temperature1 temperature2 vin vin_ov_fault_limit vin_uv_fault_limit vout vout_mode vout_n vout_ov_fault_limit vout_uv_fault_limit iom.gpio0={SUCCESS=154} iom.gpio1={SUCCESS=154} iom.gpio2={SUCCESS=154} iom.gpio3={SUCCESS=372} iom.temp.inlet1={SUCCESS=19} iom.temp.inlet2={SUCCESS=19} iom.temp.woodside={SUCCESS=17} c.fru={SUCCESS=1} c.seeprom={SUCCESS=1523} f.fm0.fru={SUCCESS=1} f.fm1.fru={SUCCESS=1} f.fm2.fru={SUCCESS=1} f.fm3.fru={SUCCESS=1} f.fm4.fru={SUCCESS=1} f.fm5.fru={SUCCESS=1} f.fm6.fru={SUCCESS=1} f.fm7.fru={SUCCESS=1} p.psu3.psmi={EBUSY=4} p.psu0.fru={ETIMEDOUT=73} p.psu0.psmi={EBUSY=4} p.psu1.fru={ETIMEDOUT=73} p.psu1.psmi={EBUSY=4} p.psu2.fru={ETIMEDOUT=72} p.psu2.psmi={EBUSY=4} p.psu3.fru={ETIMEDOUT=72} c.gpio0={ENXIO=8} c.gpio1={ENXIO=8} c.gpio2={ENXIO=8} c.gpio3={ENXIO=8} # I2C Driver sysctl entries sysctl: error: permission denied on key 'net.ipv4.route.flush' sysctl: error: permission denied on key 'kernel.cad_pid' sysctl: error: permission denied on key 'kernel.cap-bound' dev.i2c.disconnect_retry = 3 dev.i2c.post_trigger = 64 dev.i2c.norxack_blink = 5 dev.i2c.norxack_blink = 5 dev.i2c.fixup_blink = 0 dev.i2c.pca9541-workaround = 18 dev.i2c.wait_deadline = 30 dev.i2c.chassis_reservation.demand = 1 dev.i2c.chassis_reservation.lock_state = 1 dev.i2c.chassis_reservation.auto_release = 1 dev.i2c.chassis_reservation.on_demand = 0 dev.i2c.chassis_reservation.pause_gilroy_thread = 0 dev.i2c.chassis_reservation.min_notheld_ms = 150 dev.i2c.chassis_reservation.wait_extra_ms = 3000 dev.i2c.chassis_reservation.grace_ms = 750 dev.i2c.chassis_reservation.hold_ms = 1500 dev.i2c.chassis_reservation.wait_ms = 4500 dev.i2c.gilroy-debug-level = 3 dev.i2c.debug-level = 1 dev.i2c.pca9541-businit = 1 dev.i2c.pca9541-delay = 250 dev.i2c.bus2.write-cdelay = 100 dev.i2c.bus2.write-delay = 100 dev.i2c.bus1.write-cdelay = 30 dev.i2c.bus1.write-delay = 30
And IOM 2 (Subordinate)
# I2C Bus Statistics Wed Oct 21 14:49:16 CST 2015 # I2C Bus 1 busn=0 nseg=2 segment 0 local segment 1 extended wait_gt_deadline 21 error_pca9541_per_device: # I2C Bus 2 busn=1 nseg=5 segment 0 local segment 1 chassis norxack 201 wait_gt_deadline 19 segment 2 blade segment 3 fan norxack 34 pca9541clrerrprs 27 pca9541seterr 5 pca9541clrlasterr 1 wait_gt_deadline 14 segment 4 psu timeout 10993 fixup 17073 hub_sw_mbb 6080 hub_sw_mbb_to 6080 gilroy.error.pca9541_control_state 2 gilroy.error.do_reserve_pca9541_control 2 gilroy.counter.reserve 1 gilroy.counter.release 1 gilroy.counter.reserved 3670 gilroy.counter.already_reserved 1 gilroy.counter.released 3665 gilroy.counter.already_released 5 gilroy.counter.status_chassis_off 5723 error_pca9541_per_device: c.ms 2 f.fm0.ms 3 f.fm2.fru 5 f.fm2.ms 18 f.fm7.ms 6 # I2C Device Statistics iom.fru={SUCCESS=12} iom.rtc={SUCCESS=7} NAME=iom.woodside No such path /sys/devices/platform/fsl-i2c.1/i2c-0/0-0048 NAME=iom.dcdc0 detach_state driver iout iout_cal_offset iout_oc_fault_limit mfr_date mfr_id mfr_location mfr_model mfr_revision mfr_serial name ot_fault_limit ot_warn_limit temperature1 temperature2 vin vin_ov_fault_limit vin_uv_fault_limit vout vout_mode vout_n vout_ov_fault_limit vout_uv_fault_limit NAME=iom.dcdc1 detach_state driver iout iout_cal_offset iout_oc_fault_limit mfr_date mfr_id mfr_location mfr_model mfr_revision mfr_serial name ot_fault_limit ot_warn_limit temperature1 temperature2 vin vin_ov_fault_limit vin_uv_fault_limit vout vout_mode vout_n vout_ov_fault_limit vout_uv_fault_limit iom.gpio0={SUCCESS=780} iom.gpio1={SUCCESS=780} iom.gpio2={SUCCESS=780} iom.gpio3={SUCCESS=1624} iom.temp.inlet1={SUCCESS=291} iom.temp.inlet2={SUCCESS=291} iom.temp.woodside={SUCCESS=255} c.fru={SUCCESS=1} c.seeprom={SUCCESS=1309} f.fm0.fc={SUCCESS=452} f.fm0.fru={SUCCESS=8} f.fm1.fc={SUCCESS=452} f.fm1.fru={SUCCESS=6} f.fm2.fc={SUCCESS=449} f.fm2.fru={SUCCESS=18,ETIMEDOUT=1} f.fm3.fc={SUCCESS=449} f.fm3.fru={SUCCESS=6} f.fm4.fc={SUCCESS=449,EBUSY=2} f.fm4.fru={SUCCESS=3} f.fm5.fc={SUCCESS=449} f.fm5.fru={SUCCESS=3} f.fm6.fc={SUCCESS=447} f.fm6.fru={SUCCESS=11} f.fm7.fc={SUCCESS=447} f.fm7.fru={SUCCESS=3} p.psu3.psmi={EBUSY=424} p.psu0.fru={ETIMEDOUT=386} p.psu0.psmi={EBUSY=441} p.psu1.fru={ETIMEDOUT=386} p.psu1.psmi={EBUSY=424} p.psu2.fru={ETIMEDOUT=386} p.psu2.psmi={EBUSY=424} p.psu3.fru={ETIMEDOUT=386} c.gpio0={ENXIO=8} c.gpio1={ENXIO=8} c.gpio2={ENXIO=8} c.gpio3={ENXIO=8} # I2C Driver sysctl entries sysctl: error: permission denied on key 'net.ipv4.route.flush' sysctl: error: permission denied on key 'kernel.cad_pid' sysctl: error: permission denied on key 'kernel.cap-bound' dev.i2c.disconnect_retry = 3 dev.i2c.post_trigger = 64 dev.i2c.norxack_blink = 5 dev.i2c.norxack_blink = 5 dev.i2c.fixup_blink = 0 dev.i2c.pca9541-workaround = 18 dev.i2c.wait_deadline = 30 dev.i2c.chassis_reservation.demand = 1 dev.i2c.chassis_reservation.lock_state = 1 dev.i2c.chassis_reservation.auto_release = 1 dev.i2c.chassis_reservation.on_demand = 0 dev.i2c.chassis_reservation.pause_gilroy_thread = 0 dev.i2c.chassis_reservation.min_notheld_ms = 150 dev.i2c.chassis_reservation.wait_extra_ms = 3000 dev.i2c.chassis_reservation.grace_ms = 750 dev.i2c.chassis_reservation.hold_ms = 1500 dev.i2c.chassis_reservation.wait_ms = 4500 dev.i2c.gilroy-debug-level = 3 dev.i2c.debug-level = 1 dev.i2c.pca9541-businit = 1 dev.i2c.pca9541-delay = 250 dev.i2c.bus2.write-cdelay = 100 dev.i2c.bus2.write-delay = 100 dev.i2c.bus1.write-cdelay = 30 dev.i2c.bus1.write-delay = 30
OBFL Logs are showing seeprom errors on IOM 1:
2015-10-21T20:44:13.910978+00:00 CMC NOCSN_-3-CMC OBFL:0:read_header_info:header read error#012 2015-10-21T20:44:18.367276+00:00 CMC NOCSN_-3-CMC OBFL:0:read_seeprom_data:seeprom read error: -11 (Resource temporarily unavailable)#012 2015-10-21T20:44:18.367458+00:00 CMC NOCSN_-3-CMC OBFL:0:mcserver_seeprom_read_data:seeprom read error#012 2015-10-21T20:44:22.677589+00:00 CMC NOCSN_-3-CMC OBFL:0:read_seeprom_data:seeprom read error: -11 (Resource temporarily unavailable)#012 2015-10-21T20:44:22.677696+00:00 CMC NOCSN_-3-CMC OBFL:0:read_header_info:header read error#012 2015-10-21T20:44:30.865812+00:00 CMC NOCSN_-3-CMC OBFL:0:lazy_dmclient_init:Error initializing the dmclient#012 2015-10-21T20:44:30.866799+00:00 CMC NOCSN_-3-CMC OBFL:0:seeprom_extension_init:lazy_dmclient_init error during seeprom_extension_init: -519#012 2015-10-21T20:44:30.868088+00:00 CMC NOCSN_-3-CMC OBFL:0:read_seeprom_data:seeprom read error: -501 (Initializatoin failed)#012 2015-10-21T20:44:30.868590+00:00 CMC NOCSN_-3-CMC OBFL:0:read_header_info:header read error#012 2015-10-21T20:44:31.972263+00:00 CMC NOCSN_-3-CMC OBFL:0:read_seeprom_data:seeprom read error: -11 (Resource temporarily unavailable)#012 2015-10-21T20:44:31.972457+00:00 CMC NOCSN_-3-CMC OBFL:0:mcserver_seeprom_read_data:seeprom read error#012
IOM 2 is showing bad PSU readings, all PSUs show 'online' but the thermal and statistics, name and serial numbers are all blank:
2015-10-21T20:34:20.719265+00:00 CMC NOCSN_dmserver-3-CMC OBFL:1429:scan_power_supplies:Restarting ps polling after 60, seconds. 2015-10-21T20:35:23.139424+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 0 marked bad, sys mask: 1, psu_mask f#012 2015-10-21T20:36:44.364090+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 1 marked bad, sys mask: 3, psu_mask f#012 2015-10-21T20:37:51.385312+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 2 marked bad, sys mask: 7, psu_mask f#012 2015-10-21T20:39:07.027963+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 3 marked bad, sys mask: f, psu_mask f#012 2015-10-21T20:39:07.028077+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:All PS marked bad, attempting to reset through other IOM#012 2015-10-21T20:39:23.479998+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004b: read 128 bytes offset 0 to 4b, err -16 2015-10-21T20:39:47.126256+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004f: read 128 bytes offset 0 to 4f, err -16 2015-10-21T20:40:00.182760+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-0056: read 128 bytes offset 0 to 56, err -16 2015-10-21T20:40:07.721113+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-005d: read 128 bytes offset 0 to 5d, err -16 2015-10-21T20:40:21.576658+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004b: read 128 bytes offset 0 to 4b, err -16 2015-10-21T20:40:43.179415+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004f: read 128 bytes offset 0 to 4f, err -16 2015-10-21T20:40:59.809516+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-0056: read 128 bytes offset 0 to 56, err -16 2015-10-21T20:41:05.899218+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-005d: read 128 bytes offset 0 to 5d, err -16 2015-10-21T20:41:21.434477+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004b: read 128 bytes offset 0 to 4b, err -16 2015-10-21T20:41:40.027845+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-004f: read 128 bytes offset 0 to 4f, err -16 2015-10-21T20:42:00.060439+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-0056: read 128 bytes offset 0 to 56, err -16 2015-10-21T20:42:06.125467+00:00 CMC NOCSN_kernel-3-CMC OBFL: at24 1-005d: read 128 bytes offset 0 to 5d, err -16 2015-10-21T20:42:06.126910+00:00 CMC NOCSN_dmserver-3-CMC OBFL:1429:scan_power_supplies:Restarting ps polling after 60, seconds. 2015-10-21T20:43:16.547468+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 0 marked bad, sys mask: 1, psu_mask f#012 2015-10-21T20:44:27.687322+00:00 CMC NOCSN_dmserver-3-CMC OBFL:0:scan_power_supplies:ps 1 marked bad, sys mask: 3, psu_mask f#012
And IOM 2 is able to see fans status where as IOM 1 is showing Unknown
IOM2 logs:
interval: 15 # seconds write_ts: 1445460326 # Wed Oct 21 14:45:26 2015 stale_ts: 1445460348 # Wed Oct 21 14:45:48 2015 STALE now: 1445460364 # Wed Oct 21 14:46:04 2015 status: 1 # ACTIVE policy_state: 1 # COOL xreading: 1 # DEVELOPER_MODE: FALSE hwconf_valid: 1 maxfans: 8 fan[1].fault/read/req: 0/30/30 # OK fan[2].fault/read/req: 0/30/30 # OK fan[3].fault/read/req: 0/30/30 # OK fan[4].fault/read/req: 0/30/30 # OK fan[5].fault/read/req: 0/30/30 # OK fan[6].fault/read/req: 0/30/30 # OK fan[7].fault/read/req: 0/30/30 # OK fan[8].fault/read/req: 0/30/30 # OK
I have re-seated both IOMs in the last day but that hasn't cleared up the errors.
10-21-2015 02:01 PM
Ryan,
So the norxack values are fairly low, not high enough for me to be too concerned.
The only thing I take issue with in that I2C log would be some of those timeout values, particularly IOM 2's PSU segment.
Does IOM1's thermal.log file show something similar to the below?
fan[1].fault/read/req: 3/0/30 # UNKNOWN fan[2].fault/read/req: 3/0/30 # UNKNOWN fan[3].fault/read/req: 3/0/30 # UNKNOWN fan[4].fault/read/req: 3/0/30 # UNKNOWN fan[5].fault/read/req: 3/0/30 # UNKNOWN fan[6].fault/read/req: 3/0/30 # UNKNOWN fan[7].fault/read/req: 3/0/30 # UNKNOWN fan[8].fault/read/req: 3/0/30 # UNKNOWN
10-21-2015 02:06 PM
Yep exactly. IOM 1 was reseated about 5 hours ago, maybe that reset the counters on norxack, not sure.
valid: 1 pid: 1335 interval: 15 # seconds write_ts: 1445461350 # Wed Oct 21 15:02:30 2015 stale_ts: 1445461372 # Wed Oct 21 15:02:52 2015 OK now: 1445461360 # Wed Oct 21 15:02:40 2015 status: 2 # PASSIVE policy_state: 1 # COOL xreading: 1 # DEVELOPER_MODE: FALSE hwconf_valid: 1 maxfans: 8 fan[1].fault/read/req: 3/0/30 # UNKNOWN fan[2].fault/read/req: 3/0/30 # UNKNOWN fan[3].fault/read/req: 3/0/30 # UNKNOWN fan[4].fault/read/req: 3/0/30 # UNKNOWN fan[5].fault/read/req: 3/0/30 # UNKNOWN fan[6].fault/read/req: 3/0/30 # UNKNOWN fan[7].fault/read/req: 3/0/30 # UNKNOWN fan[8].fault/read/req: 3/0/30 # UNKNOWN
After re-seating IOM2 last night I saw it come up after 20-25 mins as expected but HA was in a not-read state with 'Chassis Configuration Incomplete'. Last time I saw that I simply re-seated a fan module and it came up, so I did that this morning and instead of it finishing, all but fan #1 and #2 turned off. The servers were heating up so I tried re-seating IOM1 and luckily that brought the fans up, right now they are spinning as they should (not 100%).
This is our only chassis so while I do have planned downtime this evening I'm trying to keep it going until then.
10-22-2015 07:49 AM
Appears as though the full chassis power cycle fixed the errors.
10-22-2015 08:33 AM
Ryan,
Good to hear. Presumably that bus is looking slightly cleaner now?
I don't think that was the cause of the issue you were seeing, but it would be difficult to get more details on this one outside of an actual case. It sounds like you have one open, so hopefully you can get some more info from your engineer.
10-22-2015 08:47 AM
Yes, I sent in a full Tech Support Log dump this morning so hopefully hear back from them. At least now I can see all the stats for PSU/fans. No more EBUSY signals on the i2c logs either. I do have some norxack but overall much lower.
# I2C Bus Statistics Thu Oct 22 09:34:16 CST 2015 # I2C Bus 1 busn=0 nseg=2 segment 0 local wait_gt_deadline 1 segment 1 extended wait_gt_deadline 2 error_pca9541_per_device: # I2C Bus 2 busn=1 nseg=5 segment 0 local segment 1 chassis norxack 192 wait_gt_deadline 3 segment 2 blade segment 3 fan norxack 3 wait_gt_deadline 247 segment 4 psu unfinished 3 fixup 3 pca9541clrerrprs 2 pca9541postio3 1 wait_gt_deadline 104
# I2C Bus Statistics Thu Oct 22 09:36:01 CST 2015 # I2C Bus 1 busn=0 nseg=2 segment 0 local segment 1 extended wait_gt_deadline 2 error_pca9541_per_device: # I2C Bus 2 busn=1 nseg=5 segment 0 local segment 1 chassis norxack 198 wait_gt_deadline 1 segment 2 blade segment 3 fan segment 4 psu wait_gt_deadline 1
The only thing I'm curious of that maybe you can help answer, on the subordinate IOM I still see Unknown on thermal status, the primary IOM is looking fine though:
magic: 0x486f7403 # OK valid: 1 pid: 1637 interval: 15 # seconds write_ts: 1445528182 # Thu Oct 22 09:36:22 2015 stale_ts: 1445528204 # Thu Oct 22 09:36:44 2015 OK now: 1445528187 # Thu Oct 22 09:36:27 2015 status: 2 # PASSIVE policy_state: 1 # COOL xreading: 1 # DEVELOPER_MODE: FALSE hwconf_valid: 1 maxfans: 8 fan[1].fault/read/req: 3/0/30 # UNKNOWN fan[2].fault/read/req: 3/0/30 # UNKNOWN fan[3].fault/read/req: 3/0/30 # UNKNOWN fan[4].fault/read/req: 3/0/30 # UNKNOWN fan[5].fault/read/req: 3/0/30 # UNKNOWN fan[6].fault/read/req: 3/0/30 # UNKNOWN fan[7].fault/read/req: 3/0/30 # UNKNOWN fan[8].fault/read/req: 3/0/30 # UNKNOWN
-Edit- I guess I should put my question in, is that normal behavior for IOMs to see Ok on one and Unknown on the other
10-22-2015 08:59 AM
Ryan,
I believe that it is expected for the subordinate to not pull that fan information (instead we rely on the current primary), thus the 'UNKNOWN' status you see.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide