Recovery / re-silvering from imaged/cloned drives on ZFS RaidZ2

KPE · Jul 15, 2018

Hi everyone,

I am a longtime "lurker" here on Hard Forum, but now need to ask help that hasn't already been posted

I reinstalled OmniosCE/Napp-IT last year to replace my previous Omniti version, and I stupidly forgot to set up scheduled scrubs for my pools

I discovered this last week, and initiated scrubs on my pools. Unfortunately one of my very active pools, a RAIDZ2 with 8 3TB drives lost 2 drives during the scrub overnight and then a 3rd during the rebuild, leaving me with an unavailable pool.

I have imaged the 3 failed drives (using ddrescue) which all had inaccessible partition tables (bad sectors in the first 1 MB on the drive) and a small scattering (around 3MB in total) of bad sectors in other areas of the hard-drives successfully, and my plan was to insert the newly imaged drives into my system, and have them rejoin the failed pool *somehow*, so a resilver / scrub can restore integrity

The newly imaged drives have been inserted into the system and are recognized as drives, I am just not sure what I can do from here.

As the pool is offline, I can't do regular zpool maintenance commands, and I obviously need to get the new drive IDs to be recognized as belonging to the the ZFS pool somehow, which makes me speculate that I might need to do some editing / zdb magic somehow, or perhaps if there was a way to have new drives have their drive ID set to the old one they are replacing?

Any input and suggestions will be much appreciated

_Gea · Jul 16, 2018

A ZFS pool is available if enough disks are available. If failed disks come back, the pool goes online without further actions.

If more disks fail than the redundancy level allows, the pool is basically lost and a backup is required. The ZFS scrub itself does do not a structural repair as ZFS structure is always valid due CopyOnWrite and metadata is twice on disk. But as a ZFS scrub reads all data it is able to detect bad disks.

This is different to old filesystems where a crash during a write can lead to a damaged filesystem where a fschk can repair the metadata.

What you can try is updating to newest OmniOS CE 151026 first and then retry an import as it adds improved support for ZFS pool recovery, see https://www.delphix.com/blog/openzfs-pool-import-recovery

KPE · Jul 16, 2018

Thank you Gea, this certainly offers a new avenue of hope and possibility

What is incredibly frustrating is to have 5 out of 8 disks be good, and then the remainder 3 still be readable but with errors here and there - I strongly suspect what really disqualified the disks from consideration of being in the pool and used for a repair attempt (putting it in the UNAVAIL state) is when it crossed the threshold of having its 1st track and partition table become unreadable. That was basically what I saw happen when the 3rd disk "went bad"

My intention is to find some way of writing a fresh partition label and "magic" on to my 3 rescued hard drive copies, that miss this information in the beginning of the drive. To trigger that repair. It may take me a few months or years

But right now I am in the process of imaging a 2nd copy of my 8 set of drives to work from, and I will upgrade to OmniOS CE 15026 and dig in further - thank you for the good suggestion

I have included some previous information here that I missed in my original post, for others that are interested in commenting

------
pool: avm
id: 4688197856225759405
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://illumos.org/msg/ZFS-8000-3C
config:

avm UNAVAIL insufficient replicas
raidz2-0 UNAVAIL insufficient replicas
c0t50014EE0037F4CDAd0 ONLINE
c0t50014EE0037F5209d0 UNAVAIL cannot open
c0t50014EE0AE2A1DB4d0 ONLINE
c0t50014EE0AE27BF7Fd0 ONLINE
c0t50014EE658538006d0 UNAVAIL cannot open
c0t50014EE65A76D6D7d0 ONLINE
c0t50014EE6ADA80522d0 UNAVAIL cannot open
c0t50014EE6ADA85C2Dd0 ONLINE

destroyed pools to import:
-none-

I already have my replacement drives with the (imperfect) images from the failed drives on the system as:

c0t50014EE6ADA871F9d0
c0t5000CCA263C70BE2d0
c0t5000CCA263C7092Ed0

KPE · Sep 15, 2018

Alright - looks like this story is going to have a happy ending.

On my linux rescue system I used "dd if=/dev/sdi of=/rescued/avm5.mbr bs=512 count=1" to grab the MBR from one of the 5 good disks

I then wrote it to each of the 3 partially recovered drives that had blank MBRs ("dd if=/rescued/avm5.mbr of=/dev/sd[abc]")

I inserted the 5 good imaged drives, and subsequently the 3 drives with the cloned MBRs into my OmniOS box

The drives were detected, but received warnings about the primary disk label being corrupt and the backup would be used. I don't know if it was necessary but I fixed this with the format command (Went format, selected each of the 3 disks and wrote the "backup" command to restore the backup disk label)

I then did zpool import, (my heart skipped a beat when I saw my failed pool), and then a zpool import avm, and after about 60 seconds my pool is back online.

Busy transferring my data to secondary storage, and this should be the culmination of 2 months of patience and not panicking

Have a nice weekend everyone, and thank you for your input Gea!

And a big cheers!
KPE

root@zfs:/dev/rdsk# zpool import
pool: avm
id: 4688197856225759405
state: DEGRADED
status: One or more devices contains corrupted data.
action: The pool can be imported despite missing or damaged devices. The
fault tolerance of the pool may be compromised if imported.
see: http://illumos.org/msg/ZFS-8000-4J
config:

avm DEGRADED
raidz2-0 DEGRADED
c0t5000CCA228C0AB28d0 ONLINE
c0t5000CCA228C34A64d0 ONLINE
c0t5000CCA228C32757d0 ONLINE
c0t5000CCA228C17999d0 ONLINE
c0t5000CCA228C2F141d0 ONLINE
c0t5000CCA228C0A65Cd0 ONLINE
c0t5000CCA228C0B028d0 FAULTED corrupted data
c0t5000CCA228C0A7C5d0 ONLINE

Recovery / re-silvering from imaged/cloned drives on ZFS RaidZ2

KPE

n00b

_Gea

Supreme [H]ardness

KPE

n00b

KPE

n00b