ZFS/nexenta instance failed drive question

hotcrandel · Nov 14, 2013

I have a 16x1TB server in RAIDZ3.

I had a disk throw a write error, went ahead and replaced it.

I am aware that the rebuild does take a long time, probably 5-7 days.

We're on day 2, and we lost a second disk

Question is, do I go ahead and replace it now, or let the first disk finish?

I am inclined to replace it now, but I figure I'd get some thoughts first.

danswartz · Nov 14, 2013

Hmmmm. Without knowing if resilvering the 2nd disk might slow down the 1st, it's hard to say... You could run an iostat and then replace the 2nd disk and see if the iostat to the 1st disk takes a hit, and if so, pull the 2nd disk...

HobartTas · Nov 14, 2013

Greetings

From what I've read about ZFS is that it always completes the re-silvering of the first disk before it starts re-silvering a second one, if the second hard drive is effectively dead then I presume ZFS would kick it out of the array or otherwise make it inactive with a failed status in which case it would continue with the re-silvering of the first drive.

Kindly remember however, that unlike hardware raid that would start the re-silvering process from the first sector on the drive and end at the last one what ZFS does is walk the entire directory tree and re-silver in that fashion, what that means is that with the drive that is being re-silvered the information on it is always consistent for that portion of re-silvered work already done, it is my understanding if this process is interrupted then it will continue where it left off as everything is time/date stamped so it works out what needs doing to complete the job, I would not bother removing the second drive unless somehow commands sent to it are not being responded to leading to it hanging up or otherwise impeding the entire process. Another disadvantage of this re-silvering method which I think is happening in your case is if the data is highly fragmented then the speed that this process is occurring at may slow down to something approaching random 512B/4KB read/writes of mechanical hard drives i.e. 1-2 MB's so if you can monitor I/O's over a couple of minutes or so you can probably work out how long this process is going to take to complete (I'm assuming of course the array is otherwise idle without having work still being thrown at it). Having Raid-Z3 in your situation still gives you a large measure of protection as you still have effectively the equivalent of Raid-Z in your array until such time as a successful re-silvering of another drive bumps you back up to the next Raid-Z level.

If you would like to speed this process up in the future then perhaps choosing a larger recordsize would mean that the data would be stored in larger chunks on the drives and the re-silvering process should work faster, Solaris for example allows recordsize values up to 1MB but if you were running say an OLTP database you would not have large recordsizes in the first instance. For performance purposes mirrored pools would be better and it is apparently also recommended for large hard drives like 2-4TB sizes that triple mirroring be used instead.

Hope this helps

Cheers

zrav · Nov 14, 2013

HobartTas said:
From what I've read about ZFS is that it always completes the re-silvering of the first disk before it starts re-silvering a second one, if the second hard drive is effectively dead then I presume ZFS would kick it out of the array or otherwise make it inactive with a failed status in which case it would continue with the re-silvering of the first drive.

On ZFS on Linux I replaced two drives of a RAIDZ2 array consecutively before the first finished resilvering. The process restarted with both drives resilvering in parallel.

hotcrandel · Nov 14, 2013

On nexentastor it appears to be finishing the first resliver.

The second disk, even replaced, is currently idle.

omniscence · Nov 15, 2013

I assume that two parallel resilvering processes will take more overall time to complete because of the additional seeks the remaining disks have to do.

zrav · Nov 15, 2013

I don't see why there would be extra seeks, the XORing for two drives operates on exactly the same data.

danswartz · Nov 15, 2013

This isn't standard raid though, it isn't a simple XOR on drive contents.

madrebel · Nov 15, 2013

FYI, from my own notes. If you've lost two drives already I would immediately adjust the resilver priority.

From a root bash shell type the following
echo zfs_resilver_delay/W1 |mdb -kw
echo zfs_resilver_min_time_ms/W0t4000|mdb -kw

To return to the default settings type the following
echo zfs_resilver_delay/W2 |mdb -kw
echo zfs_resilver_min_time_ms/W0t3000|mdb -kw

Note, in extreme scenarios where returning to N state is paramount you can use the following settings. Customer experience should be greatly impacted. Use with caution.
echo zfs_resilver_delay/W0 |mdb -kw
echo zfs_resilver_min_time_ms/W0t5000|mdb -kw

zrav · Nov 18, 2013

danswartz said:
This isn't standard raid though, it isn't a simple XOR on drive contents.

I know that.
I'll rephrase my statement: Both the disks were resilvered in perfect sync, the process taking about as long as a regular scrub, thus giving me reason to believe that it's not two independent resilvers running in parallel.

ZFS/nexenta instance failed drive question

hotcrandel

Gawd

danswartz

2[H]4U

HobartTas

Limp Gawd

zrav

Limp Gawd

hotcrandel

Gawd

omniscence

[H]ard|Gawd

zrav

Limp Gawd

danswartz

2[H]4U

madrebel

Gawd

zrav

Limp Gawd