[COLUG] Linux RAID problem

Jonadab the Unsightly One jonadab at bright.net
Thu Jun 24 22:40:27 EDT 2004


Dane Miller wrote:

> Recently, whenever I do a full backup of /home on the server, about
> fifteen minutes into the backup the server completely locks up. No ping
> response, no console access, unable to wake monitor from power saving
> mode.  So I do a hard power-off. 

I'm guessing and could well be wrong, but this *sounds* to me like
it's hitting a point in the filesystem where reading a particular
section of the disk takes forever, possibly due to bizarre filesystem
corruption or possibly due to a hardware issue (e.g., bad medium).
In other words, you might be hitting an infinite delay in I/O when
the system is I/O-bound.  Possibly.

> /home lives on a software RAID 1 array of 3 80gb Western Digital ATA100
> disks (2 active disks + 1 spare).  When I reboot the box the RAID 1
> mirror has to be rebuilt.  The specific error message in dmesg is "md:
> md0: raid array is not clean -- starting background reconstruction" 

Since this is RAID 1, can you test with the RAID out of the loop using
just one of the drives?  (Mount it read-only for this, so you don't
get them out of sync.)

> I am performing the backup with ssh, tar, and gzip over the network.

Can you test with the network out of the loop, e.g. by putting the
backup drive into the same system and mounting it locally?

> This problem started earlier this month and after repeated crashes,
> caused hardware failure in the two active 80gb disks.

Are you sure that this problem *caused* the hardware failure in the
disks?  Could it have been caused *by* a hardware failure related
to the disks?  Could the drive controller be flaky?

> Since then, I've
> replaced the power supply and the two active disks.

You replaced the disks, and are still experiencing the problem?
Can you test with a different drive controller card?

> Any thoughts?  Hardware? Software?  

Whatever it is, it's lowlevel.  If it's software, it's probably
in kernel space (e.g., a driver).  You could recompile tar and
the other software involved just for grins, but I wouldn't hold
my breath waiting for that to fix the problem.


More information about the colug mailing list