zpool replace question

Markus.Schragner · Nov 5, 2014

If i were to issue a "zpool replace tank olddisk newdisk" command on a raidz2 pool, is the pool for the time of the resilvering under greater risk?

I'm in the process of replacing a 10Disk x 3TB raidz2 pool with 10Disks of 6TB WDs. All "old" disks are fine and the last scrub had no errors.

Is it maybe even possible to issue all 10 disk replacements at the same time? I mean as long as all disks are working fine i would assume a direct sector to sector copy from the "old" to the "new" disk. Performance-wise it would be more clever to mirror the disks instead of rebuilding the "new" data from all disks.

Does anyone have any inside how zfs replace handles this case?
I'm using OmniOS

RedShirt · Nov 5, 2014

If are able to have ALL disks hooked up at the same time, you could just do a ZFS send/receive to get your data on the new disks. That'd be the best way.

Since you are using raidz2, I wouldn't be too concerned about doing a disk at a time. It'll take an extremely long time tough.

I'm not sure about a sector by sector copy. ZFS keeps some specific information about the disks in it's pool. I'm not sure what would happen if you'd import a pool from copied disks.

danswartz · Nov 5, 2014

As far as I know, there is no sector level copy when resilvering. ZFS only copies active data to the new disk. Given that you are replacing the entire pool, doing a send/recv would be good, except for a window where new changes would not go over. You could try multiple replaces to see if it makes any difference speed-wise. e.g. do a 'replace A with B' and look at the resilver speed after a few minutes. Then do a 'replace C with D' and wait for the second resilver to get up to speed. Look and see if the first one has slowed down at all. If not, keep doing replaces. I would probably be paranoid and just do one at a time though

RedShirt · Nov 5, 2014

You could take a snapshot, send everything up to the point of that snapshot.

Then, take another snapshot, and send that to get all the changes since the initial send.

http://docs.oracle.com/cd/E18752_01/html/819-5461/gbchx.html

danswartz · Nov 5, 2014

Sure, but if there are writes going to it regularly, it's hard to seamlessly finish that without downtime. Resilvering doesn't have that issue...

Markus.Schragner · Nov 5, 2014

It was actually a bit more complicated.
I started with 8x3TB raidz1 and send/received to a pool of (2x3TB (borrowed by a friend) + 8x6TB) All because i can not seem to get my hands on the last 2 6TB drives. Also i just wanted to finally start migrating the data after almost 2weeks of stresstesting disks and scrubbing the old pool.

Additionally my old pool was nearly full and i needed some space now.
So now i have 1 z2 pool of 2+8 disks with the capacity of 8x3TB which also gives me 3TB free on the pool.

Now when i get the last 2 disks, is it more risky to do 2replace operations at the same time, given that none of the disks are corrupt and the redundancy is not decreased?
So it comes down to how omnios handles this type of disk replacement. I dont think zfs being by desingn enterprise grade would reduce the redundancy level in such a case.
I would assume some kind of 1:1 copy of the data only involving the original and the corresponding replacement drive. But can anyone confirm or deny this?

edith said this is called hot replacing a still functional/partly functional disk

danswartz · Nov 5, 2014

I'm pretty sure the redundancy is not reduced during the resilver. So like I said, replace A with B and C with D, etc...

devman · Nov 6, 2014

I didn't think you could attach mirrors to members of a raidz vdev.

danswartz said:
Sure, but if there are writes going to it regularly, it's hard to seamlessly finish that without downtime. Resilvering doesn't have that issue...

The solution to this is iteration. 2 or 3 iterations will reduce the remaining deltas between live system and the replacement system to a level where the downtime can be short.

danswartz · Nov 6, 2014

Yeah, the mirror thing for raidz didn't make sense to me (I'm pretty sure nested vdevs are not supported at this time.)

Markus.Schragner · Nov 6, 2014

I felt adventurous

pool: bigfish6
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Nov 6 16:39:38 2014
0,2T scanned out of 24,2T at 345M/s, 22h8m to go
0,05T resilvered, 0,12% done
config:

NAME STATE READ WRITE CKSUM
bigfish6 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
replacing-0 ONLINE 0 0 0
c3t5000CCA225CB050Ad0 ONLINE 0 0 0
c3t5000CCA228C16886d0 ONLINE 0 0 0 (resilvering)
replacing-1 ONLINE 0 0 0
c3t5000CCA225CD53A9d0 ONLINE 0 0 0
c3t5000CCA228C1A2C8d0 ONLINE 0 0 0 (resilvering)
c3t50014EE20AF9E597d0 ONLINE 0 0 0
c3t50014EE20AFA1D15d0 ONLINE 0 0 0
c3t50014EE20AFD5C09d0 ONLINE 0 0 0
c3t50014EE26051D866d0 ONLINE 0 0 0
c3t50014EE2605326A3d0 ONLINE 0 0 0
c3t50014EE2B5A532C9d0 ONLINE 0 0 0
c3t50014EE2B5A8A118d0 ONLINE 0 0 0
c3t50014EE2B5ABE44Bd0 ONLINE 0 0 0

errors: No known data errors

@devman, @danswartz no one said something about nesting or mirrors. I talked about how zfs handles the replace command and if it would be possible to replace all disks at the same time (like in Linux which implemented a "hot replace" in mdadm v3.3).
http://www.heise.de/open/meldung/So...m-unterstuetzt-jetzt-Hot-Replace-1948533.html

Its all about Raidz2 and the "zpool replace pool old_drive new_drive" command.
Its NOT about rebuilding the new disks from data of all other disks/parity.

So has anyone done a replace with more disks than the corresponding raidlevel needs for parity? (4 or more simultanious disk replacements in a z3, .... 2 or more simultanious disk replacements in a z1)

By hot-replaceing i mean actively using the soon be gone/defective disk to reconstruct its data to the new disk.

RedShirt · Nov 6, 2014

Interesting.

Your pool shows as online (and not degraded, which is what it would say if you had one or more disks missing from the pool).

I don't know if I've ever heard of anyone trying to do a replace on all disks at once. The only use case I've done it is to replace a failed disk. I would think it would work, but I don't know for sure.

RedShirt · Nov 6, 2014

According to this thread:

http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/39237

You can replace all at once as long as you have all old and new disks hooked up (which you do).

Markus.Schragner · Nov 7, 2014

thank you @RedShirt for the Link, i did not find anyone ever doing this.

But still my question is if anyone ever tried a replace of more disks at the same time than the redundancy level provides? Because tn this thread they are just guessing that it should work....

MAuVE · Nov 7, 2014

Unintentionaly, I had two disks replacing an existing pair, part of a Z2 pool (within the redundancy margin, so it's not what the OP asked for, but anyway)
The first replace was intentional, for the purpose of increasing capacity.
During the resilvering phase, another disk was degraded (some 30 read and/or checksum errors) and the system automatically started replacing it with the hot spare.

Finishing the dual resilvering, the pool was marked as healthy.

danswartz · Nov 7, 2014

A guess, true. But an educated one. Given that a drive is resilvered from the one it is replacing, and when you tried it, the pool was online and not degraded, that would seem to confirm it.

Aesma · Nov 7, 2014

You could test with a virtual machine (I would if I had time).

I plan to have about 10 free bays when I add two more JBOD chassis to my server (and same on the backup server) and was wondering what I could do with them, I guess replacing 10 drives at a time when increasing capacity could be useful.

Otherwise this idea seems good for heavily used systems, except that I would expect either performance to drop or resilvering time to take forever. This can be tuned though : http://broken.net/uncategorized/zfs-performance-tuning-for-scrubs-and-resilvers/

zpool replace question

Markus.Schragner

n00b

RedShirt

n00b

danswartz

2[H]4U

RedShirt

n00b

danswartz

2[H]4U

Markus.Schragner

n00b

danswartz

2[H]4U

devman

2[H]4U

danswartz

2[H]4U

Markus.Schragner

n00b

RedShirt

n00b

RedShirt

n00b

Markus.Schragner

n00b

MAuVE

n00b

danswartz

2[H]4U

Aesma

[H]ard|Gawd