Weblog entry #6 for dkg
There are a few possibilities i can think of, but they can have inconsistent results, from the experimenting i've done:
[use mdadm soft fail the device] Tell mdadm that a device failed and watch how it copes with device removal, notification, and rebuild. For example:
mdadm /dev/md0 --fail /dev/sdg
This is a generally repeatable procedure, and behaves the same on most machines and raid setups that i've tried. However, it doesn't help me feel confident that my systems will properly detect actual hardware failures. If hardware trouble isn't detected promptly and reported back to the RAID layer as a failed component device, even the best post-detection behavior (which is what this step tests) is worthless.[yanking disks] At the other end of the spectrum is physically yanking a disk out of its cage. Having a hotswap chassis makes this kind of test easier to do (though i'm not convinced it's entirely safe for the hardware to be subjected to this kind of abuse). However, i've seen inconsistent results for this, even with identical kernels.
For example, an mdadm RAID5 setup with a 3ware SATA controller in JBOD mode (kernel module 3w_9xxx) running kernel 2.6.16 doesn't detect the device removal until something tries to access the md device itself. Then a report is generated, and the array starts to rebuild into the RAID5's hotspare.
This is delayed detection isn't great, but it's a ton better than the behavior i see from another installation: an mdadm RAID1 setup with an Intel SATA controller (kernel module ahci) with the same kernel (2.6.16) reacts terribly to device removal. The whole system grinds nearly to a halt, many /dev/sdX error messages show up on the console, and the RAID subsystem doesn't even seem to notice for a long time (at least 5 minutes, i think, though i haven't done proper timing to verify this). Meanwhile, the machine becomes nearly unresponsive on the network, even for services which are not accessing the removed device. Interestingly, plugging the disk back while the system is still in this bad state lets it pick up where it left off, and doesn't even appear to be a RAID failure.
[using hdparm] A middle ground between these two testing tactics might be to use hdparm or some similar tool to disable the disk itself and see what happens to the kernel, the RAID subsystem, and the rest of the machine. I don't know hdparm well enough yet to know what parameters to use to take a device down like that, but as i figure it out, i'll post it to the comments here, or update this weblog entry.
Comments on this Entry
[ Send Message | View Steve's Scratchpad | View Weblogs ]
At the other end of the spectrum is physically yanking a disk out of its cage. Having a hotswap chassis makes this kind of test easier to do (though i'm not convinced it's entirely safe for the hardware to be subjected to this kind of abuse). However, i've seen inconsistent results for this, even with identical kernels.
Remind me to write up sometime how we spent months dealing with intermittant faults with one server, running SCO, ultimately discovering that a test "yank" had been carried out when the machine was initially setup.
Course it wasn't hot-swappable ....
Multiple parts were replaced; motherboard, raid controller, drives, etc. Ugh. The machine was eventually written-off.
[ Parent | Reply to this comment ]
However there is no more realistic way to test catastrophic hardware failure. If the devices are hotswappable, and don't survive such experiments, they will almost certainly fail equally badly in use eventually-- given enough installations (of which yours is one).
Indeed the failure I was MOST expecting it to have to cope with was some prat in the computer room knocking one of the (supposed redundant) cables out by accident.
At least most modern hardware costs a lot less than the HP kit I was yanking cables from, so you won't end up paying back a mortgage sized repayment if the hardware dies and they hold you responsible.
Alternatively you can wait for your systems to fail, and see how good the vendors lawyers are.
For what it is worth that HP Enterprise kit did exactly what it said it would do in the manual under all cases tested.
[ Parent | Reply to this comment ]