[ZFS] How can I stop pool degrading before I recover the data ?

fishfoodie · Jun 5, 2014

Hi Folks,

I'm looking for some advice on recovering a zpool.

I have a Solaris 11/Napp-it build using a Norco server. The pool started to get large amounts of chksum errors(>1000), when I rearranged the disks in the rows in the chassis. I realised after a bit, that it was probably a HW error generating the chksum errors, so I offlined the pool & disconnected everything to stop things getting any worse.

I found that the new disk row I was using in the chassis had a bad backplane. I got a new one from Norco support, & I've verified that the checksums aren't happening any more, with a new small pool to test the HW, & cables.

So now my question ....

How do I ensure that when I bring the faulted pool back online, I can recover the data off it, before it collapses ? I know I may to deal with some lost data, but I just what to recover as much as possible, before I have to blow away the pool & rebuild it.

Researching online, I see I can disable checksums, but is there anything more I can do to stop the pool disabling, or degrading any disks ?

Thanks !

danswartz · Jun 5, 2014

Unless the pool was showing unrecoverable errors (and therefore degraded and/or lost files), if you have resolved the HW issue, just do 'zpool clear POOL' and move forward. Might want to do a scrub afterward to make sure it is okay. What is the pool makeup (e.g. raid1, raid10, raidz*?)

Aesma · Jun 5, 2014

I think his worry is that when copying, ZFS will check the files and decide that there are too many errors, offline a first drive, then another, then another.

fishfoodie · Jun 5, 2014

Precisly !

Its a z2 array, & the 2x spare disk I had, are already pulled in for disks that 'failed', so I think I'm tip-toeing along the cliff edge of losing the whole pool if I can't stop Solaris from stepping in, by disabling as many safety features as possible. I think turning off checksums will help, but I'm worried that there may be some process that occasionally triggers to test disks, & it may decide to disable another disk, & then another ...

danswartz · Jun 5, 2014

Yeah, I got that. My point was that if it was a bad backplane, if the data is hosed, it is hosed. Clear the zpool errors and scrub it. If it's going to sh*t the bed, the data was likely lost anyway. I don't think disabling checksums will do what you think - all that will do is disable checksums for newly written records. I don't remember zfs bringing the pool down just because one or more disks have excessive checksum errors. If not, scrub the pool and it will fix what it can and nuke the rest.

fishfoodie · Jun 9, 2014

An update.

For anyone who experiences an issue on their ZFS file system; don't panic, Sun have built a damn fine piece of Software !

When I get finished, it looks like I've only lost about seven files, out of several thousand !

Once I reseated all my disks, & provided a 2nd spare for a faulted disk, I powered up, & ZFS started doing it's stuff. My only interventions were to set the pool to Read-Only & to disable checksums. I did this to stop accidental writes to the pool, which potentially could cause more issues.

It took ~36 hours to resliver the two 3TB drives, & at that point I had >1,300 data errors, but these only effected seven files.

One learning was to avoid any process touching the corrupt files, as this seems to generate errors that either hangs, or overloads some system process, which means that new processes hang, & the system needs to be rebooted

Once I realised what was happening I added --exclude to the rsync commands I was using to copy data from the failed pool to my backup pool. Running 'zpool status -v poolname' will give you the list of corrupt files you need to avoid.

The other learning is that even though this is a server in my home, it's a good idea to adopt some of the commissioning procedures we have in work. These would have shown me that one of the disk backplanes was defective, & a couple of days of testing would have avoided a lot of work & stress later on !

BigJayDogg3 · Jun 10, 2014

fishfoodie said:
The other learning is that even though this is a server in my home, it's a good idea to adopt some of the commissioning procedures we have in work. These would have shown me that one of the disk backplanes was defective, & a couple of days of testing would have avoided a lot of work & stress later on !

Can you elaborate on what you do when testing?

fishfoodie · Jun 10, 2014

I used to have a boss who was incredibly anal-retentive, & he insisted that when we got a new server from a different vendor for the ones we currently used, that we beat it to death before we commissioned it for real use. New servers of verified models got a smaller test plan.

We'd create all the filesystems & shares we needed, & then run process with plenty of writes, & reads, & force network/power failovers by yanking cables. Unseating disks while they're running was also done, & we verified that backups & recoveries worked properly. You'd be amazed how many times I've found people never actually check that the data they religiously backup, is actually recoverable

It was overboard in a lot of ways, but the general principle of running in a new server, & testing the key capabilities is a good one; it's better too, to discover a weakness or broken part before your warranty runs out !!

It's nice to be able to know that, if a switch falls over, that nothing will break, instead of relying on the sales mans assurance, nothing beats having seen it work for yourself, in your own environment.

In future I'd create a pool, & make sure that the pool operated with all the disk ports, SAS Cards, & cables that would be in use when the server was fully populated. My mistake this time was that when I started I only had 6x disks in a 20x disk server, so the majority of disk ports weren't occupied, so I didn't know the backplane was faulty, until I created a new pool & put in a bunch of new disks. If I'd split my disks across all disk rows & copied over some data & run a few scrubs, I'd have seen the checksum failures before my data was all at risk.

TCM2 · Jun 11, 2014

Yeah, if a disk shows weakness while in warranty, don't treat it lightly and hope it runs for just some more time.

Stress that thing so it completely fails while still in warranty and not afterwards.

Your data >> some piece of metal.

[ZFS] How can I stop pool degrading before I recover the data ?

fishfoodie

n00b

danswartz

2[H]4U

Aesma

[H]ard|Gawd

fishfoodie

n00b

danswartz

2[H]4U

fishfoodie

n00b

BigJayDogg3

[H]ard|Gawd

fishfoodie

n00b

TCM2

Gawd