Software RAID Failure?

Carlosinfl · Jul 13, 2007

Guys - I am not sure what happened here but I think I have a problem with RAID (software) on my home machine. I am using it fine now however during boot up, I see something scroll by in red however it is un-readable at 100 mph so I decided to check out some md stats on the box to see if something happened to a disk or RAID array I setup and I can't understand what I am looking at so perhaps you guys can help.

To make things as clear as possible, I have 2 identical drives on the machine both via S-ATA. The drives are 2x Western Digital 160GB disks and I am pretty sure they are both good but I am not sure.

Here is what I see:

Code:

tunafish:/home/cwilliams/Desktop# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Fri Jun 22 21:20:43 2007
     Raid Level : raid1
     Array Size : 19534976 (18.63 GiB 20.00 GB)
  Used Dev Size : 19534976 (18.63 GiB 20.00 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Jul 13 15:03:20 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 93f4ddb3:70d5783e:f47a10c4:9fe19ef3
         Events : 0.6582

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       0        0        1      removed

As you can see there is a section that should match up below as /dev/sdb3 however it shows remove...

Then there is my 2nd RAID

Code:

tunafish:/home/cwilliams/Desktop# mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.03
  Creation Time : Fri Jun 22 21:20:49 2007
     Raid Level : raid1
     Array Size : 135275264 (129.01 GiB 138.52 GB)
  Used Dev Size : 135275264 (129.01 GiB 138.52 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Fri Jul 13 15:08:02 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : c627213e:cbaed46d:6510c67a:3cf96311
         Events : 0.4564

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       0        0        1      removed

Is one of my disks bad? What should I do? Should I place an identical spare in place of /dev/sdb and see if it starts to rebuild?

Both drives feel warm to touch as they are both getting power and I visible in the BIOS so I know it sees the drives but perhaps it has failed sectors on the disk, I don't know...

Code:

tunafish:/home/cwilliams/Desktop# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda4[0]
      135275264 blocks [2/1] [U_]

md0 : active raid1 sda3[0]
      19534976 blocks [2/1] [U_]

unused devices: <none>

Suggestions or comments from data above?

cleric_retribution · Jul 13, 2007

I'm no mdadm expert, but you might try re-adding sdb into both arrays and see what mdadm tells you.

I've had drives seemingly "randomly" drop out of my softRAID5 array, and that's all I've ever done.

also, have you configured your MAILADDR in /etc/mdadm/mdadm.conf (this file may vary based on your linux distro) -- this is where mdadm sends RAID events to, such as disk failures and status messages. I recently figured this out and it helped me when debugging some wierd problems with a shitty fileserver built out of ghetto parts I use for testing stuffs.

Carlosinfl · Jul 13, 2007

What do you mean "re-add" /dev/sdb? How exactly do I re-add it?

I checked /var/log/messages and found the following...

*****

Jul 11 21:22:42 tunafish kernel: md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
Jul 11 21:22:42 tunafish kernel: md: bitmap version 4.39
Jul 11 21:22:42 tunafish kernel: md: raid1 personality registered for level 1
Jul 11 21:22:42 tunafish kernel: md: md0 stopped.
Jul 11 21:22:42 tunafish kernel: md: bind<sda3>
Jul 11 21:22:42 tunafish kernel: md: bind<sdb3>
Jul 11 21:22:42 tunafish kernel: md: kicking non-fresh sda3 from array!
Jul 11 21:22:42 tunafish kernel: md: unbind<sda3>
Jul 11 21:22:42 tunafish kernel: md: export_rdev(sda3)
Jul 11 21:22:42 tunafish kernel: raid1: raid set md0 active with 1 out of 2 mirrors
Jul 11 21:22:42 tunafish kernel: md: md1 stopped.
Jul 11 21:22:42 tunafish kernel: md: bind<sda4>
Jul 11 21:22:42 tunafish kernel: md: bind<sdb4>
Jul 11 21:22:42 tunafish kernel: md: kicking non-fresh sda4 from array!
Jul 11 21:22:42 tunafish kernel: md: unbind<sda4>
Jul 11 21:22:42 tunafish kernel: md: export_rdev(sda4)
Jul 11 21:22:42 tunafish kernel: raid1: raid set md1 active with 1 out of 2 mirrors
Jul 11 21:22:42 tunafish kernel: Attempting manual resume
Jul 11 21:22:42 tunafish kernel: EXT3-fs: INFO: recovery required on readonly filesystem.
Jul 11 21:22:42 tunafish kernel: EXT3-fs: write access will be enabled during recovery.
Jul 11 21:22:42 tunafish kernel: kjournald starting. Commit interval 5 seconds
Jul 11 21:22:42 tunafish kernel: EXT3-fs: recovery complete.
Jul 11 21:22:42 tunafish kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 11 21:22:42 tunafish kernel: ts: Compaq touchscreen protocol output
Jul 11 21:22:42 tunafish kernel: input: PC Speaker as /class/input/input3
Jul 11 21:22:42 tunafish kernel: Real Time Clock Driver v1.12ac
Jul 11 21:22:42 tunafish kernel: i2c_adapter i2c-0: nForce2 SMBus adapter at 0x1c00
Jul 11 21:22:42 tunafish kernel: i2c_adapter i2c-1: nForce2 SMBus adapter at 0x1c80
Jul 11 21:22:42 tunafish kernel: ACPI: PCI Interrupt Link [AAZA] enabled at IRQ 20
Jul 11 21:22:42 tunafish kernel: ACPI: PCI Interrupt 0000:00:0f.1 -> Link [AAZA] -> GSI 20 (level, low) -> IRQ 90
Jul 11 21:22:42 tunafish kernel: hda_codec: Unknown model for AD1988, trying auto-probe from BIOS...
Jul 11 21:22:42 tunafish kernel: Adding 497972k swap on /dev/sda1. Priority:-1 extents:1 across:497972k
Jul 11 21:22:42 tunafish kernel: Adding 497972k swap on /dev/sdb1. Priority:-2 extents:1 across:497972k
Jul 11 21:22:42 tunafish kernel: EXT3 FS on md0, internal journal
Jul 11 21:22:42 tunafish kernel: loop: loaded (max 8 devices)
Jul 11 21:22:42 tunafish kernel: device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: dm-devel@redhat.com
Jul 11 21:22:42 tunafish kernel: kjournald starting. Commit interval 5 seconds
Jul 11 21:22:42 tunafish kernel: EXT3 FS on sda2, internal journal
Jul 11 21:22:42 tunafish kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 11 21:22:42 tunafish kernel: kjournald starting. Commit interval 5 seconds
Jul 11 21:22:42 tunafish kernel: EXT3 FS on md1, internal journal
Jul 11 21:22:42 tunafish kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 11 21:22:42 tunafish kernel: kjournald starting. Commit interval 5 seconds
Jul 11 21:22:42 tunafish kernel: EXT3 FS on sdb2, internal journal
Jul 11 21:22:42 tunafish kernel: EXT3-fs: mounted filesystem with ordered data mode.

Bones · Jul 14, 2007

You need to redo the mirroring for both of those arrays. Your situation looks suspiciously like something mounted the filesystems on sda directly, instead of using the md devices.

Carlosinfl · Jul 14, 2007

What do you mean "re-do"?

Do you mean back up my data and start from scratch?

cleric_retribution · Jul 14, 2007

Bones said:
You need to redo the mirroring for both of those arrays. Your situation looks suspiciously like something mounted the filesystems on sda directly, instead of using the md devices.

This could be possible ...
You should probably try to backup your data anyways (I know you have RAID 1, so you've got mirrors already

but just to be safe)

You may want to run

Code:

yourbox# mdadm --query /dev/sdb3

and

Code:

mdadm --query /dev/sdb4

just to see if your partitions actually still have their RAID info on them.

If so, then have you tried doing the following? This will re-add the partitions that are missing from each array back into the array:

(please double check and make sure that you are adding the correct partitions back -- I gleaned the info from the error messages you posted earlier, but it's morning, and I might get them wrong)

Code:

yourbox# mdadm --re-add /dev/md0 /dev/sdb3

Code:

yourbox# mdadm --re-add /dev/md1 /dev/sdb4

ps - if re-add doesnt work, try just doing

Code:

yourbox# mdadm --add [md device] [partition]

Carlosinfl · Jul 14, 2007

It appears to be re-building the broken mirror with the following command!

Thanks for your help and I hope this fixes the issue.

Code:

cwilliams@tunafish:~$ su
Password:
tunafish:/home/cwilliams# mdadm --re-add /dev/md0 /dev/sdb3
mdadm: re-added /dev/sdb3
tunafish:/home/cwilliams# mdadm --re-add /dev/md0 /dev/sdb4
mdadm: added /dev/sdb4
tunafish:/home/cwilliams# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda4[0]
      135275264 blocks [2/1] [U_]

md0 : active raid1 sdb4[2](S) sdb3[3] sda3[0]
      19534976 blocks [2/1] [U_]
      [>....................]  recovery =  4.1% (817536/19534976) finish=6.4min speed=48090K/sec

unused devices: <none>
tunafish:/home/cwilliams# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda4[0]
      135275264 blocks [2/1] [U_]

md0 : active raid1 sdb4[2](S) sdb3[3] sda3[0]
      19534976 blocks [2/1] [U_]
      [>....................]  recovery =  4.9% (971584/19534976) finish=6.6min speed=46265K/sec

unused devices: <none>

****EDIT****

OK - it appears to have completed rebuilding but md1 does not show both up for some reason.

Code:

tunafish:/home/cwilliams# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda4[0]
      135275264 blocks [2/1] [U_]

md0 : active raid1 sdb4[2](S) sdb3[1] sda3[0]
      19534976 blocks [2/2] [UU]

unused devices: <none>

Bones · Jul 14, 2007

Err, I'll just point out the obvious here. Of course md1 didn't rebuild.

This looks like an "oh, shit!" moment to me. You have a three partition mirror on md0 now

Carlosinfl · Jul 14, 2007

Yup - I noticed that too

Is there a way to fix this? I have no idea how this happened.

How can I move sdb4 to md1?

It should be

sda3 & sdb3 = md0
sda4 & sdb4 = md1

I am so confused as to what happened.

Bones · Jul 14, 2007

Carloswill said:
I am so confused as to what happened.

This is what happened:

Code:

tunafish:/home/cwilliams# mdadm --re-add /dev/md0 /dev/sdb3
tunafish:/home/cwilliams# mdadm --re-add /dev/md0 /dev/sdb4

mdadm did exactly what you told it to do.

It's easy enough to fix. Remove sdb4 from md0, and then add it to md1.

Carlosinfl · Jul 14, 2007

Bones - thanks for showing me what a moron I am.

Now is there a separate command for simply removing /dev/sdb4 from /dev/md0? I know I can run the following:

Code:

#mdadm --re-add /dev/md1 /dev/sdb4

Thanks again for all your assistance...

cleric_retribution · Jul 14, 2007

Carlos,

in a terminal, type in

Code:

yourbox# man mdadm

this will give you the "man page" or manual for mdadm, and has all of the information that I've been posting here for you

hint: look for the --remove option

Carlosinfl · Jul 14, 2007

Thanks all for all your help!

Code:

tunafish:/home/cwilliams# mdadm --remove /dev/md0 /dev/sdb4
mdadm: hot removed /dev/sdb4
tunafish:/home/cwilliams# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[0] sdb3[1]
      19534976 blocks [2/2] [UU]

md1 : active raid1 sda4[0]
      135275264 blocks [2/1] [U_]

unused devices: <none>
tunafish:/home/cwilliams# mdadm --re-add /dev/md1 /dev/sdb4
mdadm: added /dev/sdb4
tunafish:/home/cwilliams# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda3[0] sdb3[1]
      19534976 blocks [2/2] [UU]

md1 : active raid1 sdb4[2] sda4[0]
      135275264 blocks [2/1] [U_]
      [>....................]  recovery =  0.1% (259200/135275264) finish=34.7min speed=64800K/sec

Software RAID Failure?

Carlosinfl

Loves the juice

cleric_retribution

Limp Gawd

Carlosinfl

Loves the juice

Bones

[H]ard|Gawd

Carlosinfl

Loves the juice

cleric_retribution

Limp Gawd

Carlosinfl

Loves the juice

Bones

[H]ard|Gawd

Carlosinfl

Loves the juice

Bones

[H]ard|Gawd

Carlosinfl

Loves the juice

cleric_retribution

Limp Gawd

Carlosinfl

Loves the juice