Raidz2 (raid6) vs mirrors: What raid config do you use?

fmatthew5876 · Oct 21, 2016

For 12 10TB disks, I'm looking at 2 possible ZFS raid configurations:

2 vdevs, each having 6 drives in raidz2. 80TB total storage with 40TB lost on parity.
6 mirrored vdevs. 60TB Total storage with 60TB lost on parity.

I was almost sold on mirrors after reading this:
ZFS: You should use mirror vdevs, not RAIDZ. | JRS Systems: the blog

But now, I'm not so sure. The extra 20TB lost to mirrors kind of sucks, but I could live with that. The use case here is just NAS media storage, so mostly reads and few writes.

What really bothers me is the possibility that if one disk dies, I lose the whole array if the other disk in that mirror also dies. With a raidz2, I can lose 1 of any other disk and still recover.

I'm only really concerned with redundancy and the possibility of permanent data loss here. Extra IO performance during normal operation is nice to have but not so important.

The tradeoffs appear to be:

Mirror:

(-) Only 60TB total storage
(+) Resilvering is fast and only does reads on one disk, making it less likely to trigger a second failure.
(-) Possibility of data loss if the other disk in the faulted mirror fails during resilvering
(+) Upgrading the capacity of a vdev requires only 2 disks at a time and would run rather quickly. Only need to buy 2 disks to start.
(+) Faster read / write performance (Not that important for my use case)
(+) With only 5 mirrors, 2 drive bays are free, allowing for a hot spare.

Raidz2:

(+) 80TB total storage
(-) Resilvering is extremely slow and does both reads and writes on all disks, making it more likely to trigger a second failure.
(+) Possibility of data loss only if any 2 other disks within the same vdev both fail during resilvering
(-) Upgrading the capacity of a vdev requires replacing 6 disks at a time and would take an extremely long time. Need to buy 6 disks to start.
(-) Slower read / write performance (Not important for my use case)
(-) No possibility of having a hotspare as the 2 vdevs consume all drive bays.

Raidz is off the table as the industry pretty much fully agrees that its not suitable for large disks. Raidz3 would mandate making an entire 11 + hot spare or 12 disk pool. Meaning I'd have to spend a fortune buying 12 disks up front and in the future if I want to upgrade capacity I have to buy another 12 disks.

For those of you with a raid setup, what raid configuration did you end up using and why? If you could go back, would you change it something else?

Does anyone have a good estimate for how long it would take to replace a single faulted disk in a raidz2 array of 6 10TB disks? How about for a mirror of 2 10TB disks?

Thanks!

iamwhoiamtoday · Oct 21, 2016

In your situation, I would go for the Raidz2 setup. You have minimal I/O workload, it isn't like you're hosting virtual machines on this array.

If you had real performance requirements (or in a production environment), I'd advocate mirrors all day long. It takes a LOT less time to rebuild a mirror than to rebuild a Raidz2 array.

That said..... do what you can to maintain backups, at least of the critical information. You want to be able to restore your data in the event your array suffers a catastrophic failure.

_Gea · Oct 21, 2016

A replace or resilver is a low priority background process that must process all metadata.
Time to finish depends on activity, fillgrade, fragmentation and iops and therefor raid (raid-z has iops of a single disk so many vdevs ex with mirrors reduce rebuild time as iops scale with number of vdevs, on reads and mirrors 2 x number of vdevs).

If a pool is quite filled with a medium to decent activity, I would expect 1-3 days to resilver, with a large single vdev raid-z1-3 pool a week. Much faster is sequential resilvering as in current Oracle Solaris 11.3 where resilver time can be a few hours then.

btw
Nexenta is working on sequential resilvering on their Illumos variant.
Maybe next year in OpenZFS as well.

watch the video about resilvering for more infos
OpenZFS

SirMaster · Oct 25, 2016

It's good information _Gea, but highly dependent on the use for the system. In some use cases a RAIDZ2 configuration can actually resilver faster than a mirror configuration (assuming same disks in the system).

Since OP mentions large media storage it seems to be the same use case as me.

I will tell you the story of what I have found in my actual experience with RAIDZ2 vs mirror configuration.

First off, my ZFS system just stores media files, 90% movies and TV shows which are multi-GB files. Also all that data is stored on datasets which I have the recordsize property set to 1MiB.

When I do a resilver, the limit I've actually seen is simply the sequential write speed of the drive I'm resilvering onto. With my 12 4TB WD Red (slow) drives configured as a single 40TB capacity RAIDZ2, and about half full with 20TB of data, that means that to resilver a disk it needs to write 2TB to the disk. I've resilvered it a few times and it has taken about 4 hours to resilver that 2TB of data at 135MB/s which is coincidentally the average write speed of my 4TB WD Red disks when I look at an HDTune sequential disk write test.

If I configured my 12 disks in striped mirrors, my pool would instead have 24TB capacity rather than 40TB. So with my same 20TB of data instead of being 50% full, it would be about 83% full. So to resilver a disk it would need to write 3.32TB which would end up taking about 6 hours 50 minutes at the disk's 135MB/s average sequential write speed limit.

I kick off my resilvers when I go to sleep and they are done by the time I wake up and so there is also never any activity going on on the system while it runs.

Same with scrubs. I scrub monthly when I'm sleeping and it takes about 4 hours currently to scrub my 20TB (24TB including parity) of data with the scrub reporting an average speed of about 1.6GB/s.

fmatthew5876 · Oct 25, 2016

Thank you SirMaster,

This is exactly the kind of real life example I was looking for.

That article I linked was talking on the order of days or even weeks to do a resilver. Why is there such a discrepancy? Is it because of lack of active users on the system doing reads and writes while resilver is taking place?

I'm also a bit confused about the claim that resilvering does lots of reads and writes on all of the drives. I would expect it to only be reading from the other drives and only writing to the new replacement drive.

SirMaster said:
With my 12 4TB WD Red (slow) drives configured as a single 40TB capacity RAIDZ2, and about half full with 20TB of data, that means that to resilver a disk it needs to write 2TB to the disk.

This is interesting. Just about every ZFS reference I've read recommends not using "too many" disks in a single raidz / raidz2 vdev. I thought I was pushing it with 10 1.5TB drives in raidz2 but you're using 12. Using that many disks in a single raidz2 vdev hasn't caused any problems for you?

SirMaster · Oct 25, 2016

fmatthew5876 said:
Thank you SirMaster,

This is exactly the kind of real life example I was looking for.

That article I linked was talking on the order of days or even weeks to do a resilver. Why do you think there is such a big difference between what they claim and what you've actually experienced? The recordsize and the fact that its mostly large files instead of lots of little files?

Multiple reasons, at least 4 that I can think of. 1MiB recordsize (8x larger than the default 128KiB that most people use). Nearly no fragmentation. I am the only one who writes to the zpool and I do all my writes serially since I'm just copying in new media one file at a time. I also don't modify my media, it's write once, read many. No activity during scrub/resilver. I have done these operations at night where there is literally 0 zpool activity. ZFS kernel parameter tuning. I've adjusted a few of the kernel parameters including:

echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

The 512 zfs_top_maxinflight in my testing really increased my scrub and resilver throughput. The higher I went past the default 32 the faster speeds I saw up till where it stopped increasing at about 512 as I have seemed to reach my disk's max sequential throughputs anyways.

These get set by my scheduled scrub script and get set back to the defaults after it completes.

fmatthew5876 said:
I'm also a bit confused about the claim that resilvering does lots of reads and writes on all of the drives. I would expect it to only be reading from the other drives and only writing to the new replacement drive.

That's all resilver does in my experience. If I watch zpool iostat 1 while I resilver, I see my disks all reading at ~135MB/s and the disk being resilvered writing at ~135MB/s.

fmatthew5876 said:
This is interesting. Just about every ZFS reference I've read recommends not using "too many" disks in a single raidz / raidz2 vdev. I thought I was pushing it with 10 1.5TB drives in raidz2 but you're using 12. Using that many disks in a single raidz2 vdev hasn't caused any problems for you?

Been operating it like this for about 2 years now with no problems that I have encountered. I've seen people use bigger vdevs as well. I don't really know why I shouldn't run it like this or even larger. Performance absolutely flies. I can read and write my media files to the zpool at around 1.3GB/s which is far, far more performance than I would ever need. Resilvers and scrubs are only a handful of hours so I am never degraded very long. Besides I keep backups so at worst case I just blow away the zpool, make a new one and restore the data from backup.

_Gea · Oct 25, 2016

A mediaserver with a single user that is not filled up is uncritical. The same with a backup system. I also use backup systems with a single raid-z2/3 vdev and a lot of disks.

Despite that, very large raid-z vdevs are not suggested for multiple users or systems where high iops is a demand as such a system has the same iops like a single disk as you must position every disk to read or write a datablock. As a resilver or scrub must process all metadata, this is a process that is iops limited what can result in a very long resilver/scub time especially on systems with a high workload.

read
Sequential Resilvering (Bizarre ! Vous avez dit Bizarre ?)

fmatthew5876 · Oct 25, 2016

Now you guys are making me think maybe I should go all out and maximize storage.

12 disk raidz2, 11 disk raidz2 + hotspare, or maybe 11 disk raidz3 + spare.

Lots to think about..

_Gea · Oct 25, 2016

11 disk raidz2 + hotspare is quite senseless compared to a 12 disk raid-z3
Think of the raid.Z3 like an Z2 array where the spare is already resilvered when a disk fails.

(or a flash is allowed to toast 3 disks at a time without a dataloss)

SirMaster · Oct 25, 2016

For your usage, higher RAIDZ level is always preferable to a hot spare. I'd go 12 disk RAIDZ2 or 12 disk RAIDZ3 assuming you are indeed using 1 MiB recordsize and storing large media files.

In fact 1 MiB recordsize will actually gain you about 9% more usable capacity (in the form of less wasted overhead) with a 12-disk RAIDZ2.

I made a spreadsheet here calculating the overhead differences for different RAIDZ configurations and sector size and recordsize values.

The forum automatically embeds this and you can't see the comments and notes in the cells so here is a URL.

Code:

https://docs.google.com/spreadsheets/d/1pdu_X2tR4ztF6_HLtJ-Dc4ZcwUdt6fkCjpnXxAEFlyA/edit?usp=sharing

fmatthew5876 · Oct 25, 2016

What are you guys running for backups? Identical drive configurations in a separate box?

I've got capacity for an additional 4 drives in my 16 port LSI HBA which will be connected to an external JBOD with 4 bays.

12 10 TB drives in a raidz3 would be 90TB space. I'm a bit more inclined to do raidz3 as my backup solution doesn't give me the ability to backup absolutely everything. Also I think 90TB is more than enough for the forseeable future for my use cases.

A second zpool of an extra 4 drives would either 20TB in raidz2, 30TB in raidz, or even 40TB with no redundancy at all. Using only raidz1 here could be fine since worst case if the backup pool does fail I can just make a new one and rebuild it from the primary.

SirMaster · Oct 26, 2016

I have everything backed up to another similar system. I also have everything backed up online with Amazon Cloud Drive for $5 bucks a month.

fmatthew5876 · Dec 10, 2016

SirMaster said:
For your usage, higher RAIDZ level is always preferable to a hot spare. I'd go 12 disk RAIDZ2 or 12 disk RAIDZ3 assuming you are indeed using 1 MiB recordsize and storing large media files.

Isn't RAIDZ3 slower than RAIDZ2?

In that case you could get more performance out of raidz2 with a hot spare instead of raidz3 without sacrificing any storage space. Also the hot spare would spend all of its life idle so later on when it is put into use, in theory it will be like a new drive.

bexxxxxx · Dec 11, 2016

For home use IMO raidz2 is too annoying. Eventually when you want to change anything its a major PITA. I mean especially if you're just writing large media files and never changing them.

I've since switched to combination of snapraid array + zfs mirrors + mhddfs.

Have 14+ disk snapraid array with 2 parity disks... basically for media files that rarely if ever change.
Have zfs mirror for stuff that does change and new files.

Combine it all together using mhddfs so on network it looks like one massive drive.

If you write new file to network drive it goes to mirror. When mirror starts filling up I move all the huge media files to snapraid array to free up mirror space. When snapraid array starts filling up I buy a new hdd to 'expand' it.

Also I don't really care about perf too much... its fast enough. Still basically limited by GbE.

westrock2000 · Dec 11, 2016

One of the nice things about ZFS is the ability to grow a pool, but the downside is the final outcome may not be exactly what you want.

Currently my pool consists of 2 RAIDZ1, each with 5 discs. So it's kind of like RAIDZ2, although limited to 1 disc for set of 5. Although the upside is that if 1 disc fails, only 5 are affected instead of the whole thing. So it kind of compartmentalizes a failure.

But again, while not ideal, it was nice that I have the option to do this. The pool initially started out as a RAIDZ1 of 5 discs, and then I added 5 more additional discs without having to break anything.

westrock2000 · Dec 11, 2016

But here is my idea if I ever replace this.

If you are looking at mirroring, why not just create two separate systems then? For most of us, we use ZFS for mass file storage and the benefits it offers for that. We aren't using it because it's an ideal filesystem for our workflow. So instead of having a full mirrored system, use half the discs in the native filesystem (Windows...as RAID1 or Span) and then set up a separate computer that that has a basic ZFS set up (even just 1 redundancy disk). Make the ZFS system occasionally backup the main filesystem. That way you keep the performance and convenience of a native file system in Windows, but still have a backup (both from physical damage as well file damage like an accidental delete), and then it additionally has 1 layer of protection from a faulty disc.

ZFS is awesome and I swear by it, but integrating it into a mainstream OS (Windows or OSX) is a hassle. If you can stay completely in Linux or Unix, then you are OK. But otherwise you are dealing with how the filesystem is accessed natively (OSX can do this) or you are dealing with network speed as a bandwidth limiter.

Meeho · Dec 17, 2016

westrock2000 said:
One of the nice things about ZFS is the ability to grow a pool, but the downside is the final outcome may not be exactly what you want.

Currently my pool consists of 2 RAIDZ1, each with 5 discs. So it's kind of like RAIDZ2, although limited to 1 disc for set of 5. Although the upside is that if 1 disc fails, only 5 are affected instead of the whole thing. So it kind of compartmentalizes a failure.

Except none of the remaining 4 from that vdev may fail or you lose the whole pool.

I'm currently building two NAS boxes. 6 disks RAIDZ2 would be enough for performance and better for capacity, but storage upgrade is so much easier with mirrors that I'm going with them in the end. After the initial 6 drives, it will be way easier and cheaper to add two by two down the line as needed, than 6 new drives all at once.

danswartz · Dec 17, 2016

westrock2000 said:
But here is my idea if I ever replace this.

If you are looking at mirroring, why not just create two separate systems then? For most of us, we use ZFS for mass file storage and the benefits it offers for that. We aren't using it because it's an ideal filesystem for our workflow. So instead of having a full mirrored system, use half the discs in the native filesystem (Windows...as RAID1 or Span) and then set up a separate computer that that has a basic ZFS set up (even just 1 redundancy disk). Make the ZFS system occasionally backup the main filesystem. That way you keep the performance and convenience of a native file system in Windows, but still have a backup (both from physical damage as well file damage like an accidental delete), and then it additionally has 1 layer of protection from a faulty disc.

ZFS is awesome and I swear by it, but integrating it into a mainstream OS (Windows or OSX) is a hassle. If you can stay completely in Linux or Unix, then you are OK. But otherwise you are dealing with how the filesystem is accessed natively (OSX can do this) or you are dealing with network speed as a bandwidth limiter.

This way, you have no bit-flip protection before it gets sent to the ZFS pool.

danswartz · Dec 17, 2016

Meeho said:
Except none of the remaining 4 from that vdev may fail or you lose the whole pool.

I'm currently building two NAS boxes. 6 disks RAIDZ2 would be enough for performance and better for capacity, but storage upgrade is so much easier with mirrors that I'm going with them in the end. After the initial 6 drives, it will be way easier and cheaper to add two by two down the line as needed, than 6 new drives all at once.

Agreed. I have a storage pool with a 6x2 1TB zpool, serving up storage to 2 vsphere hosts with iSCSI/10gbe. A dozen or so VMs running on virtual disks. I had 4x2 a few months ago, and added 1 more mirrored vdev, and yet another just recently. The zpool is slightly unbalanced initially, but as the guests write to their disks, things gradually spread out...

westrock2000 · Dec 18, 2016

danswartz said:
This way, you have no bit-flip protection before it gets sent to the ZFS pool.

As noted though, for most of us this is for media storage. 1 flipped bit in a MPEG or JPEG is not going to be noticed. It's a nice feature, but I would still use ZFS for it's other features and abilities even if it didn't offer the healing.

fmatthew5876 · Dec 20, 2016

So after being a bit more realistic about my storage requirements, I've settled on 8 seagate ironwolf 10tb disks in a raidz2. It affords me 50TB of usable space in zfs. I figure this will last me at least 5 years and by then if I need more space there will be bigger disks available. I save money on disks, save money on power usage, it should give bit more performance than using more disks, and also staying within the "up to 8 disks" manufacturer spec, for whatever that's worth. I also went a got a brand new SuperMicro SC743TQ case which has exactly 8 sas drive bays so it all fits together nicely.

I'm using the LSI 9201-16i 16 port SAS HBA. I moved the system to the new case and hooked up the new disks to the first 8 ports, left it open and then put the old case right next to it open, wiring my old 10 drives. 8 of the old drives to the remaining 8 ports on the LSI and the final 2 using the onboard sata ports of the motherboard. I did a big rsync -a of all my data (about 6.3TB) from old to new and it finished in less than 16 hours.

One thing very curious is that the reported disk usage of my filesystems on the new array are larger. One of the most noticeable offenders being a zfs filesystem that houses some git repositories. The old one is reporting 14.3M while the new is reporting 50.6M. As far as I can tell, the actual content of the files are identical. I also checked the properties on the zfs filesystems and they are the same.

Can anyone explain why that might be? Perhaps sector size on the physical disk? Some kind of zfs overhead for larger disks? camcontrol on FreeBSD 11 is reporting 512 sector sizes on my new disks, but I'm not sure I believe it.

JJ Johnson · Dec 21, 2016

fmatthew5876 said:
What are you guys running for backups? Identical drive configurations in a separate box?

No way. I have more than 40 TB of file storage, but TBH, less than 2TB of it could be considered critical, or even important data. The rest of it is video that could be lost and would be an inconvenience, but not a disaster. I backup to a couple of external 4TB drives via USB, one drive kept offsite and rotated monthly. The bulk of what I backup is my music library, at about 1.4 TB. It would take years to re-rip and accurately tag it all again. That's one thing I never want to lose to carelessness.

All of my personal files, under 5 GB, are kept on Dropbox, so they're duplicated on at lest three computers in my home, plus in the cloud.

fmatthew5876 · Jan 10, 2017

zfs scrub finished in 2 hours 40 minutes with the new array. Not bad at all..

That's with about 9.7 tb of data spread across several zfs file systems on the pool. I used SirMaster's suggested recordsize only on the zfs filesystem that stores large video files and this filesystem uses only 3.5 tb of the total space.

brutalizer · Jan 10, 2017

Oracle ZFS in Solaris 11.3 resilvers at full platter speed, i.e. 100-150MB/sec. So, you could reboot the server using a USB disk with Oracle Solaris, and start resilver.
http://milek.blogspot.se/2014/12/zfs-raid-z-resilvering.html

Myself is using raidz3 with 11 discs.

Doward · Jan 29, 2017

Jeez I feel pretty weak over here with my now 6 year old mdadm raid 5 array. Started life as a 3x500GB array, now a 5x2TB array.

Raidz2 (raid6) vs mirrors: What raid config do you use?

n00b

Limp Gawd

Supreme [H]ardness

2[H]4U

n00b

2[H]4U

Supreme [H]ardness

n00b

Supreme [H]ardness

2[H]4U

n00b

2[H]4U

n00b

n00b

[H]F Junkie

[H]F Junkie

Supreme [H]ardness

2[H]4U

2[H]4U

[H]F Junkie

n00b

Gawd

n00b

[H]ard|Gawd

[H]ard|Gawd