How bad is raidz3 with 16 discs?

brutalizer

[H]ard|Gawd
Joined
Oct 23, 2010
Messages
1,602
I am considering creating a raidz3 consisting of 16 disks. The recommendation is to use 11-15 disks for a raidz3, I think. If you use more disks, there might be problems. Such as long resilver times, etc. For instance, on FreeBSD forum someone tried a raidz2 of 20 disks, and it was very bad. Resilver times were very long, it would almost not finish. They redid the zpool with several raidz2 and perfomance skyrocketed.

So, has anyone tried a 16 disk raidz3? How bad was it? I dont mind a slight penalty, but I dont want something that is very bad.
 
I am considering creating a raidz3 consisting of 16 disks. The recommendation is to use 11-15 disks for a raidz3, I think. If you use more disks, there might be problems. Such as long resilver times, etc. For instance, on FreeBSD forum someone tried a raidz2 of 20 disks, and it was very bad. Resilver times were very long, it would almost not finish. They redid the zpool with several raidz2 and perfomance skyrocketed.

So, has anyone tried a 16 disk raidz3? How bad was it? I dont mind a slight penalty, but I dont want something that is very bad.

For my secondary and third external 16 bay backup machines, i had used raid-z3 with 15 1-2 TB disks for max capacity. Performance was quite ok and resilver time was up to 2 days with nightly running backup tasks.

For my current setups I prefer multiple raid-z2, not only because of better performance but more because of being able to replace half of the disks with larger disks to increase capacity when needed without destroying the pool or moving the data to another machine.
 
I can't really comment on the speed, but you are going to waste space if the number of data disks is not a power of 2. If you can't do that I would at least use an even number of data disks.
 
I can't really comment on the speed, but you are going to waste space if the number of data disks is not a power of 2. If you can't do that I would at least use an even number of data disks.

Yes, you waste space especially with 4k disks unless you reduce blocksize. In the case of a 16 disk raid-z3 and 4k disks, you have not the full capacity advantage over a 2 x raid-Z2 config with 8 disks each where you need 4 redundancy disks.

With 512B disks this is not the same problem.
 
With 512 byte disks the problem is still there, but much less significant. 2x8 disk RAIDZ2 is still not optimal, although a little bit better.
 
I don't mean to derail this thread but i'm curious. I currently have a 6 disk 4TB drives raidz2 (4k disks). I'm going to be adding an additional 5 disks to the system soon (that's all I have room for). All i'm concerned about is sequential read/write performance over gigabit. Am i better off creating a new raidz2 pool of 11 4TB disks or a new 5 disk raidz2?
 
The best performance will possibly have a pool made of a 6 disk RAIDZ2 and a 5 disk RAIDZ2, but you should consider that your data is already on the old 6 disks and will not be redistributed automatically so all reads will go to the old vdev and all writes to the new vdev for a while unless you move everything from your pool and back again. Personally would not add a 5 disk RAIDZ2 vdev to a 6 disk RAIDZ2 vdev.

With 11 disks you have the perfect number for a RAIDZ3. A 11 disk RAIDZ2 will likely have a little bit higher performance, but not significantly. I personally use a 11 disk RAIDZ3, very good linear speeds and it does not suffer from the padding issue all vdevs without a power of two number of data disks are subject to.
 
The best performance will possibly have a pool made of a 6 disk RAIDZ2 and a 5 disk RAIDZ2, but you should consider that your data is already on the old 6 disks and will not be redistributed automatically so all reads will go to the old vdev and all writes to the new vdev for a while unless you move everything from your pool and back again. Personally would not add a 5 disk RAIDZ2 vdev to a 6 disk RAIDZ2 vdev.

With 11 disks you have the perfect number for a RAIDZ3. A 11 disk RAIDZ2 will likely have a little bit higher performance, but not significantly. I personally use a 11 disk RAIDZ3, very good linear speeds and it does not suffer from the padding issue all vdevs without a power of two number of data disks are subject to.

Thanks for the detailed response I appreciate it! Do you think raidz3 is necessary with only 11 disks? I'm trying to find the best balance between total space and safety. I would imagine no matter what I do i should be able to tap out gigabit...gigabit is insanely slow so optimal performance isn't a huge concern. It sounds like an 11 disk raidz3 will have more available space than the 2 separate raidz2 setups which is appealing, i'm just not sure if it needs to be raidz3 or raidz2 for 11 disks, although i do realize 4TB disks are starting to push it.
 
I've got a 16 disk RAIDZ3 setup on a server I'm using at work. Had a disk die in it about 5TB in, resliver took ~60 hours, so quite a while. But the server was fully functional and speedy while re-slivering took place. (as I was restoring the backup sets to it at about 50-60MBps over gig-E)

AMD Opteron "8-core" CPU 2.6ghz, 16GB ECC, 16 1TB hitachi ultrastar, ESXi with IOMMU passthrough, M1015 flashed to 9211-8i in IT mode, supermicro backplane/expander. 10GB RAM dedicated to nexentastor. resliver did finish fine, but again it took a while.

I know its not the most efficient but I had three disks in the server go at once (yes, odd, still not fully trustworthy yet) when using another LSI card, and the whole server flat-lined.
 
I have a 19x3 TB raidz3 in one of my servers now. Resilver times aren't that bad on this configuration, though it will vary with your controller setup. I was going to use 2x10 raidz2, but with a few DOA drives I didn't have the patience to wait for replacements.

I have backups of all my stuff though, and I wouldn't recommend this to anyone who doesn't have some additional redundancy.
 
Thanks for the detailed response I appreciate it! Do you think raidz3 is necessary with only 11 disks? I'm trying to find the best balance between total space and safety. I would imagine no matter what I do i should be able to tap out gigabit...gigabit is insanely slow so optimal performance isn't a huge concern. It sounds like an 11 disk raidz3 will have more available space than the 2 separate raidz2 setups which is appealing, i'm just not sure if it needs to be raidz3 or raidz2 for 11 disks, although i do realize 4TB disks are starting to push it.

I would assume all these setups could saturate gigabit ethernet easily. My pool delivers more than 500 MB/s sequentially and that is with an encryption layer between the disks and ZFS and two clients writing at 50 MB/s each. From redundancy point of view RAIDZ3 is not really required with 11 disks. However, with that configuration one of my backplanes or HBA ports can fail and I would still have one drive left for redundancy. Also, when one drive fails I will probably not immediately get a replacement, maybe within a few weeks.
 
The thing is, I have a JBOD 16 disk chassi, and I was wondering how to configure that many disks. If I create a 11 disk raidz3, I have 5 empty disks doing nothing. That is not really optimal. Theoretically, I could add another another 6 disks into my PC, which would allow two 11 disk raidz3. But that is quite many disks. But it is doable.

So, how would you configure 16 disks? I would prefer raidz3 configurations, above raidz2. I am only going to use my ZFS raid for backup so I will not use it while resilvering. There will be no additional clients working on it, while resilvering.

It seems that consensus of 16 disk raidz3 resilver time is a bit lengthy, but ok?



BLEOMYCIN,
The rule of thumb when creating ZFS configurations is easy: use 2, 4, 8 disks to store the data, and additional disks for parity. Say you want a raidz2, then you should have 4 (2+2), 6 (6+2) or 10 (8+2) disks. If you want raidz3, you should consider 5 (2+3),7 (4+3) or 11 (8+3) disks. If you have larger configs than those recommended, there will be performance penalty, for instance resilver times will be very long. In theory 16 disk + 3 = 19 raidz3 would work, but resilver times would be bad, I suspect.


HOTCRANDEL,
Did you really loose three disks at once?? Have you ever experienced it again? That is actually one of the reasons I would like raidz3 instead of raidz2 in a backup server.
 
If you really wanted the 16 bays filled and don't mind a few in the main chassis or slow rebuild times you could try 16+3+HSp, 19 disk raidz-3 plus a hostpare, makes a nice even 4 disks in the main chassis and requires no intervention in the case of a single drive failure. I guess that would be 6 disks in the main chassis if using a mirrored rpool.

Or 3x 6-disk raidz-2 plus one hotspare for what would likely be very good performance and uptime... way too many possibilities, you could always reconfigure it a few different ways and test to see what you like.
 
19 disk raidz3?? Wouldnt the resilver times be very bad?

I dont mind using 16 disk raidz3, with respect to non perfect alignment. With 11 disk raidz3, the data will be spread out perfectly. With 16 disks, non perfect spread out - but I dont care. today I have 8 disk raidz2 and the data is not spread out perfectly either.
 
Yes, resilver on a 19 disk raidz-3 would probably be terrible :]
but with a hotspare in place you may not even notice anything has happened depending on the device's workload. I'm somewhat compulsive about everything lining up properly, not necessarily because of the ideal performance, it just 'must be' or it eats away at my brain; oddly i'm okay with odd numbers of drives as long as the alignment is 'correct', go figure.
 
Just saying... you are only using 82% of the space if using a 16 drive RAIDZ3. The resilvering time depends on filesystem fragmentation.
 
A 128kB record striped over 13 data disks = 9.85 kB per drive. You need 3x 4kB = 12kB blocks (with ashift=12) per drive to store that, as it does not fit into 2 blocks. 9.85/12 = 82%.
If you have an ashift=9 pool, you would use 98% of the space. With ashift=13 you would be down to 62%! If you use a 2^n number of data disks you are always at 100% because the division of 128kB (=2^17) by the number of disks (=2^n) is always an integer (=2^(17-n)). Okay, you will have a problem once you exceed 2^5=32 data disks, but that should not happen.
 
or just change recordsize variable to 104 and have 100% utilisation with 13 disks
 
Are you sure that you can set record size to non-power-of-two values? At least the documentation says only power-of-two are allowed.
 
Interesting! Thank you for the information! So I should use ashift=9 then, because I only care about storage capacity, not speed? It will be fine with ashift=9?


Staticlag, can I do that? Do you have more information on that? Links?
 
Unless I am mistaken ashift=9 denotes a 512 byte block (2^9) while ashift=12 denotes a 4k block size (2^12). Your ashift setting should (must?) be the same as or greater than the block size defined by your hardware, effectively disk architecture. If you are using disks with 512-byte blocks your ashift may be 9 or greater, if you are using disks with 4k blocks your ashift may be 12 or greater.
Some disks with 4k block size will emulate 512-byte blocks, i haven't used them personally but everything i have read suggests avoiding them, or at least avoiding that feature.
 
In fact ALL SATA disks should have a logical sector size of 512, regardless of their physical sector size - I have yet to see a different drive. This is mainly for compatibility reasons. Some drives however report 512 byte physical size even if they use 4k sectors. You can access a 4k drive with 512 byte sectors, but all read accesses will read 4k and all write accesses will read 4k, modify 512 byte and write 4k, resulting in high access times - effectively making 2700 rpm drives from your 5400 rpm drives if you disregard access coalecing. So if you don't care about speed, use ashift=9. Due to the read-modify-write pattern a power loss could in theory result in a data corruption of an adjacent sector. ZFS should take care of that, however.
 
Create a 15 drive Rz3 array and use de 16th slot for the biggest SSD you can afford.
 
With 15 drives (12 data drives) you are at 88%. 14 (11 data drives) would be better, 97% for 128k blocks.
 
Greetings

Yes, resilver on a 19 disk raidz-3 would probably be terrible :]

All other things being equal a Raid-Z3 vs Raid-Z2 the difference is minimal for calculating the parity according to Adam Leventhal, so I'm assuming the re-silver would not be additionally burdened by much extra overhead for exactly the same reason.

Further, from this investigation I was able to find a related method for doing triple-parity RAID-Z that was nearly as simple as its double-parity cousin. The math is a bit dense; but the key observation was that given that 3 is the smallest factor of 255 (the largest value representable by an unsigned byte) it was possible to find exactly of 3 different seed or generator values after which there were collections of failures that formed uncorrectable singularities. Using that technique I was able to implement a triple-parity RAID-Z scheme that performed nearly as well as the double-parity version.

A 128kB record striped over 13 data disks = 9.85 kB per drive. You need 3x 4kB = 12kB blocks (with ashift=12) per drive to store that, as it does not fit into 2 blocks. 9.85/12 = 82%.
If you have an ashift=9 pool, you would use 98% of the space. With ashift=13 you would be down to 62%! If you use a 2^n number of data disks you are always at 100% because the division of 128kB (=2^17) by the number of disks (=2^n) is always an integer (=2^(17-n)). Okay, you will have a problem once you exceed 2^5=32 data disks, but that should not happen.

If your using Solaris 11 wouldn't it be better to select something larger like a 1MB recordsize, I'm assuming it would pack it correctly for the first 19 4K sectors on each drive (19 x 13 = 247 4K sectors) and then just fit in the last 9 4K sectors (247 + 9 = 256 4K sectors = 1MB) as a shorter variable width size stripe?

Cheers
 
Thanks for the clarification on that omni, i haven't had the means (or necessity) to purchase a new drive in quite some time and have just been playing around with an old stack of them. I had no first hand experience with 4K disks and wasn't aware that all of them had the emulation mode.
 
I just built a RAIDZ3 on Freenas 9.1.1 with 19x4TB drives in RAIDZ3. I'm copying all the data over from my old array (4x5x2TB RAIDZ). I've got encryption turned on, so it'll affect benchmarks, but once I get the 22TB of old data copied over, I'll be happy to run some numbers if anyone wants to see how it performs when there's quite a bit of data on the drives. ETA: 5-6 days (having to use Rsync over SSH for compatibility reasons between the old zpool under Solaris and the new one on FreeBSD, unfortunately).
 
All right, data copy is done and I've been able to play around with it a bit. 19x4TB drives, RAIDZ3, LZ4 compression, and encryption all turned on. Can saturate my 1GBE connection to one client, which is all I cared about (both read and write). Scrubs are slow, but tolerable (just started one, it's slowly ticking upwards in speed, right now 140MB/sec - 45 hours to scrub 22TB, which I consider tolerable for a home server).

Update: Speed is still going up. 957G done so far, up to 209M/s, 29h to go.
 
Last edited:
Does anyone know if you can say that resilver speed is similar to scrub speed? So, it will take roughly 45 hours to resilver for Firebug24k's setup?

Are you willing to pull a disk and then resilver your 19 disk raidz3? It might be good for you to know how long it takes? If you find out it takes 1 week, maybe you should break up your raidz3?
 
Which HBA(s) do you use?

I've thought about pulling a disk to see what happens, but I don't think I really want to go there intentionally :) Resilver and scrub ought to be about the same time, because the process is very similar. The server is very responsive during the resilver - it seems to run at a lower priority than regular disk accesses.

I'm using a LSI 9201-16i and a Supermicro 8 port card with the LSI SAS2008 chipset. - edited to reflect correct chipset.

One more data point - it's FreeNAS running on ESXi. I've been using a combination of Solaris 11 and FreeNAS for a couple of years on ESXi and have never had any data corruption issues - the key is to pass through the cards directly to the OS - DO NOT virtualize the HBAs.
 
Last edited:
LSI 1068e supports your 4 TB drives? I thought they can only handle up to 2 TB drives? Or maybe you didn't attach it to them?
 
My bad, sorry. There's actually three SAS cards in the box (was doing a changeover from 2TB to 4TB disks, so I had 39 drives hooked up through an expander). Anyway, you're right, the 1068e does NOT work on 4TB disks. The 4TB drives are hooked up with the 9201-16i and another card on the SAS2008 chipset (I forget who makes the actual card, I think it's an IBM one that I reflashed - been a couple of years).

Code:
mps0: <LSI SAS2008> port 0x4000-0x40ff mem 0xd2400000-0xd2403fff,0xd2440000-0xd247ffff irq 18 at device 0.0 on pci3
mps0: Firmware: 15.00.00.00, Driver: 14.00.00.01-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
mps1: <LSI SAS2116> port 0x6000-0x60ff mem 0xd2600000-0xd2603fff,0xd2640000-0xd267ffff irq 16 at device 0.0 on pci19
mps1: Firmware: 09.00.00.00, Driver: 14.00.00.01-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
 
Last edited:
I've thought about pulling a disk to see what happens, but I don't think I really want to go there intentionally :)
But right now you have all data on other servers, right? The data is now duplicated, on your new raidz3, and on the old servers. So if you pull one disk now and resilver, and if you find out it is below your expectations - you can easily rebuild your ZFS solution because all data is somewhere else.

If you wait until one disk crashes, say 2 years, and you resilver and then find out your huge raidz3 is not working well - you can not easily rebuild your ZFS solution because the data you on the other servers are not synced. Maybe you even sold your other servers.

So if you want to pull a disk, the time to do it is now, when you are setting up a new solution?
 
I have a 16 bays chassis (cheap JBOD SE3016, M1015) and decided to go for the 19 drives RAIDZ3 too, with 3 drives on my AMD board. Performance is great so far, haven't tried resilvering though (I have full backups), I'll do it for a worst case so when I've filled it.
 
I've already started selling off my 2TB drives, so nope, no backups left. It's a home server, so if somehow everything goes tits up so bad that I lose the whole server, well... that's life :) But I'm not going to tempt it by yanking drives without a cause. Understand that'd be a good test for the enterprise environment, but not at home for me.

Here's the stats on the scrub, this was while still under moderate usage (moved about 1TB around during the scrub - it's actually doing about 240MB/sec per zpool iostat, but the average is down because of the file transfers).

Code:
  pool: placid
 state: ONLINE
  scan: scrub in progress since Sun Sep 22 10:13:27 2013
        21.7T scanned out of 21.8T at 213M/s, 0h15m to go
        0 repaired, 99.12% done
config:

        NAME                                                STATE     READ WRITE CKSUM
        placid                                              ONLINE       0     0     0
          raidz3-0                                          ONLINE       0     0     0
            gptid/dad766b0-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/dda9ba80-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/e07cab1e-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/e3cc5287-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/e72ca64e-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/ea93de87-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/edf81167-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/f1779483-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/f4ec4614-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/f8449bea-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/fbd6f2b9-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/fecb3cd8-1be4-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/01aaa40d-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/047a611f-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/0761599a-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/0a550254-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/0d277389-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/1004c93b-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0
            gptid/12ec193e-1be5-11e3-beaf-000c296a7c5b.eli  ONLINE       0     0     0

errors: No known data errors

Code:
[root@freenas] ~# zpool iostat 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
blazer       280G   188G      0     26  3.19K   130K
placid      22.9T  46.1T  2.03K     46   254M   199K
----------  -----  -----  -----  -----  -----  -----
blazer       280G   188G      0     32      0   211K
placid      22.9T  46.1T  2.07K     46   257M   199K
----------  -----  -----  -----  -----  -----  -----
 
I have a 16 bays chassis (cheap JBOD SE3016, M1015) and decided to go for the 19 drives RAIDZ3 too, with 3 drives on my AMD board. Performance is great so far, haven't tried resilvering though (I have full backups), I'll do it for a worst case so when I've filled it.
How fast is scrubbing? Is resilver speed is similar to scrub speed, then it is interesting to know with your setup.
 
Back
Top