Posts Tagged ‘ mdadm ’

Recover ubuntu box from hack, part 11: Ensuring sane behaviour when a drive dies

So I would prefer, since it is capable of it, that mdadm automatically email me when my RAID looks like it might fail. I found the command to test if its working:

mdadm --monitor -1 -m lynden /dev/md0 -t

The monitor flag puts it into check mode, the -1 tells it just run once the -m specifies the email address (here a user on the local system) , then the md array to check is specified then -t tells it to send a TestEvent. A few minutes after I did that and my CLI told me I have mail. I checked it with:

cat /var/mail/lynden

This gave me raw email format. I decided I needed to find a CLI based email client. Did some searching and found one that sounded good:

sudo aptitude install mutt

Works excellently. Then I set the  address in /etc/mdadm/mdadm.conf to my username at my server. Then I did test without specifying recipient:

mdadm --monitor -1 -m /dev/md0 -t

after playing around for a couple of minutes, bash told me i have mail in /var/mail/lynden. Opened mutt and sure enough its there.  Everything should be swell now. Now to turn off my computer, pull out a drive, and see how it reacts.

Damn. It didn’t boot. Sat at booting screen saying the md array could not be started. Keep waiting, S to skip, or M for manual recovery. OK back to do some more searching. Ok. Skipped both the mdadm failure and the resultant mount failure. From the CLI I tried to mount the array. It came back saying it assembled the array with 1 drive. After some investigation, it seems the drive I removed (was) the middle one of the array, whereas I thought it was the 3rd disk. The arrays on the drives aren’t uniform (i.e. the partitions used are sda2, sdb2 and sdc1, so pulling out /dev/sdb resulted in sdc being named as /dev/sdb. Since, in the mdadm.conf file, I specify exactly which partitions it should use, it is looking for sdb2, and sdc1, both of which do not exist at this point. So I just removed that specifying line and uncommented the original line “DEVICE partitions” to allow mdadm examine all partitions itself.  Then tried to assemble the array again, and it then assembled correctly with 2 of 3 devices. So now  it will assemble. And hey, mail in my inbox! Indeed, it was mdadm reporting an actual degraded array.

So Rebooted again but it still failed. Seems mdadm just won’t automatically boot with a degraded array. And this is precisely the problem: after further research this is intended behaviour, but behaviour which can be changed correctly: there is a line in the file /etc/initramfs-tools/conf.d/mdadm which says “BOOT_DEGRADED=false” which I just need to change to “true”. Did this and rebooted and everything worked perfectly fine. Now was time to try it again with all 3 drives plugged back in. Once again it didn’t go as expected:  it wouldn’t boot again because error occurred mounting share (mount, not mdadm). Skipped mount and manually mounted it via CLI. Worked without error. Trying reboot again to see. Worked fine. Interesting. Wonder why that is? Must be something about auto mount remembering the exact configuration of the drive it is mounting from last time, and so mounting it manually (still only from fstab though) allowed it to then auto mount again from then on.

Recover ubuntu box from hack, part 8: Reconfigure RAID5 array with mdadm

So this time I’m going to retell my experience of the simple task of reconfiguring the RAID 5 array. This I already did from the live CD, when I was making sure I had all of my things still, so this was going to be fairly straightforward. All I had to do was install mdadm and set up the configuration file. So the first step:

sudo aptitude install mdadm

Excellent. Now I find the configuration already exists in the /etc/mdadm/mdadm.conf file, as mdadm found the RAID partitions already. However the RAID wouldn’t start:

sudo mdadm --assemble /dev/md0
mdadm: /dev/sdc1 has no superblock - assembly aborted

“Hmm”, I thought. “Maybe I’m doing something wrong. OK – I’ll just get the old /etc/mdadm/mdadm.conf file from the old installation. Where did I put that backup?” *sigh* The backup was a disk image stored on the RAID array. Cool. So no choice but to work out how to use mdadm properly.

Then I tried

sudo mdadm --assemble /dev/md0 --verbose /dev/sda2 /dev/sdb2 /dev/sdc1

But mdadm reported that it had started the array with only 2 of the 3 arrays. Why the hell is it doing that? I wonder.

Ok. So did some research along with a mate from work. Turns out there’s actually a bug in Linux that causes the kernel to do something stupid preventing mdadm from mounting RAIDs. I find the work around on that page which is to mount the offending partition to a loop device and then use that in the array instead:

sudo losetup /dev/loop0 /dev/sdc1
sudo mdadm -A /dev/md0 /dev/sda2 /dev/sdb2 /dev/loop0

This worked fine. But this was hardly a good solution. More research to be done.

Eventually I found that /dev/sdc1 – which was one of the RAID partitions – was being claimed by an array called /dev/md_d127, thus holding the drive from being assembled in another array. This I confirmedby checking /proc/mdstat. Simply running

sudo mdadm -S /dev/md_d127

to stop the unknown array allowed me to then run the correct RAID assemble command. So now I had a working RAID again. Rebooted and checked – sure enough the unknown array is running again, and mine won’t assemble. So I stop the unknown one again, and assemble mine. I create a directory to mount and then mounted the RAID to it:

sudo mkdir /media/share
sudo mount /dev/md0 /media/share

Then I go and mount the image from the share to another new directory:

sudo mkdir /media/oldmachine
sudo mount -o loop /media/share/public/image /media/oldmachine

I get the mdadm.conf file from that and overwrite my own. Restart the machine and nothing at all happens differently. *le sigh*.

After another couple of hours of playing around, another friend comes online. I ask him if he’s had experience with mdadm before. He has, and he has his own RAID array going currently. Excellent. I explain the situation and he asks me what type of the RAID partitions are:

sudo fdisk -l /dev/sda

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0000dda4

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        6079    48827392   83  Linux
/dev/sda2            6080      243201  1904682465   fd  Linux RAID autodetect

He goes on to tell me that ‘fd’ will be automatically, and often incorrectly, auto mounted by the kernel – I need to change it to type ’83’ which is for Linux file systems like ext2,3,4 etc. That way the OS won’t auto mount it and give mdadm a chance to.

So I use fdisk to change the partition types, so with each disk I do:

sudo fdisk /dev/sda
... 
Command (m for help): t
Partition number (1-4): 2
Hex code (type L to list codes): 83
Command (m for help): w

and so on with the other disks. ‘t’ specifies I need to change the type, I choose partition 2 as its /dev/sda2 I need to change, then 83 is the type for ‘Linux’, then w writes the changes. Then it says something about not being able to write the changes as the disk is in use at the moment. So once I’ve done that to all the drives I reboot.

I check /proc/mdstat to check if it’s solved the problem. Nope. After 15 minutes of wondering what to do now, I reread the chat with my friend – After changing the partitions I need to reconfigure mdadm:

sudo dpkg-reconfigure mdadm

Then I reboot and its running correctly. Then I just edit my fstab file to include the RAID:

sudo echo '/dev/md0    /media/share    ext4            ' >> /etc/fstab
sudo mount -a

Check the mount and its working. Rebooted and checked again – still fine. Awesome. A whole day on this stupid mess, which took me all of 30 minutes from a live CD, simply because it just worked that time.

 

This post just goes to show the trade off of solving the problem yourself versus asking someone who knows: Sure I learned a fair bit in the several hours I took researching the issue, but the most relevant stuff I learned was during the 15 minutes it took to solve it when talking to someone who knows.

 

%d bloggers like this: