Recover ubuntu box from hack, part 11: Ensuring sane behaviour when a drive dies

So I would prefer, since it is capable of it, that mdadm automatically email me when my RAID looks like it might fail. I found the command to test if its working:

mdadm --monitor -1 -m lynden /dev/md0 -t

The monitor flag puts it into check mode, the -1 tells it just run once the -m specifies the email address (here a user on the local system) , then the md array to check is specified then -t tells it to send a TestEvent. A few minutes after I did that and my CLI told me I have mail. I checked it with:

cat /var/mail/lynden

This gave me raw email format. I decided I needed to find a CLI based email client. Did some searching and found one that sounded good:

sudo aptitude install mutt

Works excellently. Then I set the  address in /etc/mdadm/mdadm.conf to my username at my server. Then I did test without specifying recipient:

mdadm --monitor -1 -m /dev/md0 -t

after playing around for a couple of minutes, bash told me i have mail in /var/mail/lynden. Opened mutt and sure enough its there.  Everything should be swell now. Now to turn off my computer, pull out a drive, and see how it reacts.

Damn. It didn’t boot. Sat at booting screen saying the md array could not be started. Keep waiting, S to skip, or M for manual recovery. OK back to do some more searching. Ok. Skipped both the mdadm failure and the resultant mount failure. From the CLI I tried to mount the array. It came back saying it assembled the array with 1 drive. After some investigation, it seems the drive I removed (was) the middle one of the array, whereas I thought it was the 3rd disk. The arrays on the drives aren’t uniform (i.e. the partitions used are sda2, sdb2 and sdc1, so pulling out /dev/sdb resulted in sdc being named as /dev/sdb. Since, in the mdadm.conf file, I specify exactly which partitions it should use, it is looking for sdb2, and sdc1, both of which do not exist at this point. So I just removed that specifying line and uncommented the original line “DEVICE partitions” to allow mdadm examine all partitions itself.  Then tried to assemble the array again, and it then assembled correctly with 2 of 3 devices. So now  it will assemble. And hey, mail in my inbox! Indeed, it was mdadm reporting an actual degraded array.

So Rebooted again but it still failed. Seems mdadm just won’t automatically boot with a degraded array. And this is precisely the problem: after further research this is intended behaviour, but behaviour which can be changed correctly: there is a line in the file /etc/initramfs-tools/conf.d/mdadm which says “BOOT_DEGRADED=false” which I just need to change to “true”. Did this and rebooted and everything worked perfectly fine. Now was time to try it again with all 3 drives plugged back in. Once again it didn’t go as expected:  it wouldn’t boot again because error occurred mounting share (mount, not mdadm). Skipped mount and manually mounted it via CLI. Worked without error. Trying reboot again to see. Worked fine. Interesting. Wonder why that is? Must be something about auto mount remembering the exact configuration of the drive it is mounting from last time, and so mounting it manually (still only from fstab though) allowed it to then auto mount again from then on.

Advertisements
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: