SAP HANA C460 SUSE Linux Failed to Boot after unexpected reboot/power outage

gchami · ‎01-09-2013

This Applies to SAP HANA M-Sized Appliance (C460)

May also apply to future appliances that make use of a FusionIO card

After an ungraceful shutdown of the SAP Appliance (Loss of Power/OS system Crash) the system file system check (fsck) fails on /dev/saplogvg/saplog

SUSE will boot into single-user-mode or repair mode

Subsequent reboots of the system will also end up in this state

/dev/saplogvg/saplog will fail to mount

Once in single user mode you will also the following messages on the console screen or issuing var/log/dmesg | tail -100:

[  418.142765] fioinf Fusion-io ioDrive 320GB 0000:84:00.0: Powercut detected
[  418.822921] fioinf Fusion-io ioDrive 320GB 0000:84:00.0: recovered data log append point 866:126959616.
[  418.897230] fioinf Fusion-io ioDrive 320GB 0000:84:00.0: Creating block device fioa: major: 252 minor: 0 sector size: 512...
[  418.973952]  fioa: unknown partition table
[  419.016524] fioinf Waiting for /dev/fioa to be created
[  420.705203] fioinf Fusion-io ioDrive 320GB 0000:87:00.0: found end of data log.
[  420.743656] fioinf Fusion-io ioDrive 320GB 0000:87:00.0: No powercut detected on data log
[  420.960294] fioinf Fusion-io ioDrive 320GB 0000:87:00.0: recovered data log append point 250:178633728.
[  421.036381] fioinf Fusion-io ioDrive 320GB 0000:87:00.0: Creating block device fiob: major: 252 minor: 16 sector size: 512...
[  421.112808]  fiob: unknown partition table
[  421.158446] fioinf Waiting for /dev/fiob to be created

Cause / Problem Description

This is expected behaviour of the FusionIO cards if they are not cleanly shut down the drives may take a long time to come up and will fail to be mounted during boot.

Failure to mount the drives will result in fsck failure which will stop the OS from booting

Udev will wait 180 seconds for the driver to load, then it will exit. In most cases, this is plenty of time, even with multiple ioDrives installed. 
But if the drives were shut down improperly, loading the driver and attaching the drives takes longer than the 180 seconds. In this case, udev will exit. The driver will not exit, but will continue working on attaching the drives.

There is not always a problem when udev exits early. The drivers will eventually load, and then you will be able to use the attached block devices. 
But, if the drivers do take too long to load, and udev does exit, and file systems are set to be mounted in the fstab, then the system file system check (fsck)  will fail, and the system will stop booting. 
In most distributions the user will drop into a single-user mode, or repair mode.

Conditions / Environment

SAP HANA M-Sized Appliance (C460)

FusionIO card

SUSE Linux (OS)

Can also apply to any UCS server that has a FusionIO card and FusionIO cards are set to mount during boot (i.e /etc/fstab)

This condition occurs when the system is brought down ungracefully and will occur on next boot. This was discovered during a power outage but there may be other scenarios that could result in this behaviour

Resolution

1. After the System boots into repair mode hash out the /dev/saplogvg/saplog line in /etc/fstab

cishanar00:~ # cat /etc/fstab
/dev/rootvg/swapvol  swap            swap       defaults              0 0
/dev/rootvg/rootvol  /               ext3       acl,user_xattr        1 1
/dev/disk/by-id/scsi-3600605b000f829d014d0ded4383e3c0e-part1 /boot                ext3       acl,user_xattr        1 2
/dev/sapdatavg/sapdata /sap/data     ext3       acl,user_xattr        1 2
/dev/sapdatavg/sapmnt /sapmnt        ext3       acl,user_xattr        1 2
/dev/sapdatavg/usr_sap /usr/sap      ext3       acl,user_xattr        1 2
# /dev/saplogvg/saplog /sap/log        ext3       acl,user_xattr        1 2
proc                 /proc           proc       defaults              0 0
sysfs                /sys            sysfs      noauto                0 0
debugfs              /sys/kernel/debug   debugfs    noauto            0 0
usbfs                /proc/bus/usb        usbfs      noauto           0 0
devpts               /dev/pts             devpts     mode=0620,gid=5  0 0
cishanar00:~ #

2. Reboot the system
3. Run Fsck on /dev/saplogvg/saplog to make sure the file system is okay
4. The system at this point should boot normally but /sap/log will not be mounted. Uncomment out the /dev/saplogvg/saplog line in /etc/fstab

cishanar00:~ # cat /etc/fstab
/dev/rootvg/swapvol  swap            swap       defaults              0 0
/dev/rootvg/rootvol  /               ext3       acl,user_xattr        1 1
/dev/disk/by-id/scsi-3600605b000f829d014d0ded4383e3c0e-part1 /boot                ext3       acl,user_xattr        1 2
/dev/sapdatavg/sapdata /sap/data     ext3       acl,user_xattr        1 2
/dev/sapdatavg/sapmnt /sapmnt        ext3       acl,user_xattr        1 2
/dev/sapdatavg/usr_sap /usr/sap      ext3       acl,user_xattr        1 2
/dev/saplogvg/saplog /sap/log        ext3       acl,user_xattr        1 2
proc                 /proc           proc       defaults              0 0
sysfs                /sys            sysfs      noauto                0 0
debugfs              /sys/kernel/debug   debugfs    noauto            0 0
usbfs                /proc/bus/usb        usbfs      noauto           0 0
devpts               /dev/pts             devpts     mode=0620,gid=5  0 0
cishanar00:~ #

5. Reboot the system. Since this will be a graceful reboot from the regulare SUSE linux mode, the driver will go through proper shut down procedure and should come up without any issues

Importnat Notes

1. Please try and avoid rebuilding the volume group which will blow away sap/log data which will then require the users to preform a full restore of the HANA DB.

2. In the above method, the data should be present on the disk and should be recoverable (The issue is that the FusionIO disks are not coming up quick enough) Should the data/filesystem be corrupt then you could rebuild the volume group once this is complete then use HANA Studio to preform a restore

juburnet · ‎05-22-2013

I ran into the same issue on UCS C460 with the Fusion IO card and this procedure worked perfectly.

Thanks!