Optimal raidz vdev sizes for 24 spindles?

ponky

Limp Gawd
Joined
Nov 27, 2012
Messages
178
Currently running 8x 4TB drives in single raidz2 vdev, going to add 16 more disks. Random performance is going to suck anyways so I'm going to max out on Seq to saturate 10G.

Some options (?) :

1) 4x 6disk raidz2

If I had to pick now without planning, I'd go for this one because:
6disk per vdev has optimal stripe size.
Future proof; any bigger vdev with 6TB+ disks sounds dangerous.

2) 3x 8disk raidz2

Better than 1) capacity-wise, non-optimal stripe size. No idea about performance vs 1).

3) 2x 12disk raidz3

.. meh.



Mirrors are not an option, don't want to waste that much space.

About optimal and non-optimal stripe sizes: (from freenas forums)

sub.mesa wrote:
As i understand, the performance issues with 4K disks isn’t just partition alignment, but also an issue with RAID-Z’s variable stripe size.
RAID-Z basically works to spread the 128KiB recordsizie upon on its data disks. That would lead to a formula like:
128KiB / (nr_of_drives – parity_drives) = maximum (default) variable stripe size
Let’s do some examples:
3-disk RAID-Z = 128KiB / 2 = 64KiB = good
4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
5-disk RAID-Z = 128KiB / 4 = 32KiB = good
9-disk RAID-Z = 128KiB / 8 = 16KiB = good
4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = BAD!
6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good



Please, give me your opinions, thanks :)
 
I would use 2 x 11 raidz3 disks, and one disk as spare. This for increased reliability, performance will suck (on one single 11 disk raidz3 I do scrub at 500-600MB/sec).
 
I would use 2 x 11 raidz3 disks, and one disk as spare. This for increased reliability, performance will suck (on one single 11 disk raidz3 I do scrub at 500-600MB/sec).

Well, scrubbing speed does not tell that much about performance since OpenZFS (I think you're not on Solaris? :p) does not do sequential resilver, also raidz3 resilvering is most stressful for disks.
 
Since you're starting from scratch, I would test your various options.
 
What kind of data are you storing?

Don't forget that 128KiB is not the only recordsize since the largeblocks feature was added. You can go up to 1024KiB now. Overhead space associated with sector padding is pretty much nonexistent with 1024KiB recordsize with any number of disks in a vdev and performance can be better if you are dealing with larger files.
 
What kind of data are you storing?

Don't forget that 128KiB is not the only recordsize since the largeblocks feature was added. You can go up to 1024KiB now. Overhead space associated with sector padding is pretty much nonexistent with 1024KiB recordsize with any number of disks in a vdev and performance can be better if you are dealing with larger files.

How does this work though?

If you go with a 1024kb record size, does it have to read an entire meg of data for every time you try to access a smaller file, no matter how small?

If this is the case it would seem that increased large file performance would come at some serious small file performance costs.
 
I don't have as many drives as you, but my setup is similar.

I have 12x 4TB drives in mine, and after much reading I decided to go with 2x6 drive RAIDz2. it just seemed like the best balance of performance, drive redundancy and space lost to redundancy, at least for me.

This was about a year ago though, and OpenZFS changes fast.
 
Just restating some information I posted in an older thread on this subject, for posterity:

Zarathustra[H];1040650861 said:
A lot of this is based on an old theory by sub.mesa on these forums regarding how 4k sector size drives that emulate older 512 byte sector size drives would function under ZFS.

I don't even know if this is even an issue anymore. Back four years ago when these discussions started there was a lot of talk of ZFS needing to include fixes to alleviate this problem going forward, but who knows if it ever happened.

The theory is as follows (in laymans terms):

In order to be compatible with older OS:es, 4k sector drives (needed for the modern large drive sizes) fool operating systems into thinking they are 512byte sector size drives.

Whenever an OS makes a request that is not aligned with 4k, the drive is forced into 512byte emulation mode, which significantly slows it down.

Since ZFS uses 128kb chunks, he came up with the following formula:

128KiB / (number of drives - parity drives).

If the resulting number is divisible by 4, then you shouldn't have a problem. If it is NOT divisible by 4, then - if this has not yet been fixed in ZFS - you might.

If we follow these calculations, optimal configurations would be as follows:

RaidZ: 3, 5 or 9 drives. (17 drives also winds up being divisible by 4, but this is above 12 recommended as max by ZFS documentation)

RaidZ2: 4, 6 or 10 drives. (and 18, which is above 12, as above and not recommended)

Raidz3: 7 and 11 drives (and 19 which is above 12, as above and not recommended)

I am not sure if this is still the case, or historical data only. I have tried PM:ing sub.mesa but not received a response.

Back in 2010 there was talk about issuing some sort of geom_nop command to the drives forcing them to report their true 4k sector size to the OS and making this problem mostly go away. The problem at the time was that geom_nop was not persistent through reboots.

Personally I have been running RAIDz2 with 6 drives for years. I accidentally chose the right size when I set it up, before I read all of this. I have done a little research into this, but haven't found anything regarding whether or not these problems are still real, or if something has been fixed since 2010.

Essentially, this is a fault that occurs due to drive makers essentially shipping drives with a 4k/512 byte hack, not because of any inherent flaw in ZFS. Maybe, since then, ZFS has included a workaround to force the drives into 4k mode, I am not sure. A lot can happen in 4 years.

At the time I was running on a single 6 drive RAIDz2 vdev. Since then I added 6 more drives in a second vdev.

I should add to this that I wholeheartedly recommend AGAINST RAIDz (Raidz2 and 3 are great though) as IMHO single drive parity is completely and totally obsolete IMHO, but the RAIDz information is in there for posterity.

RAID5 / RAIDz just needs to be killed off as even an option, as all it does is provide a false sense of security, IMHO.
 
Zarathustra[H];1041812894 said:
How does this work though?

If you go with a 1024kb record size, does it have to read an entire meg of data for every time you try to access a smaller file, no matter how small?

If this is the case it would seem that increased large file performance would come at some serious small file performance costs.

No, since ZFS uses a variable recordsize.

A downside though would be that smaller reads would have to "get in line" behind these larger 1MiB block reads on a busy system.

Also don't forget that largeblocks are a dataset property, not a pool property.

I mainly store large media files on my server and some of my smallest files are still couple-MB jpegs so largeblocks makes perfect sense for me. Everything is faster as there is just a lot more IOPS to go around when every IOP is so much larger.

I also gained about 3TB of capacity by switching to largeblocks too which was nice.

I run a 12-disk wide RAIDZ2 vdev of 4TB disks and the allocation overhead of a 12-wide RAIDZ2 was nearly 10% with 128KiB recordsize.

However, when I moved all my data onto 1MiB recordsize datasets, my data "appeared" to take 10% less space than what it was using before.

This is just a strange artifact of how largeblocks work because ZFS still uses 128KiB as the calculation for "free space" since it's the default. Better for it to underestimate your pool capacity than to overestimate it.

I mainly started using it to gain capacity and didn't really lose any sort of meaningful performance for my use case.

https://www.illumos.org/issues/5027
 
What kind of data are you storing?

Don't forget that 128KiB is not the only recordsize since the largeblocks feature was added. You can go up to 1024KiB now. Overhead space associated with sector padding is pretty much nonexistent with 1024KiB recordsize with any number of disks in a vdev and performance can be better if you are dealing with larger files.

It will be mostly media streamed around the house, but also a iSCSI LUN to my Windows desktop, that's why I need the performance.

I tried to avoid largeblocks because it's not supported in ZoL and I might have to switch to ZoL in the future.
 
It will be mostly media streamed around the house, but also a iSCSI LUN to my Windows desktop, that's why I need the performance.

I tried to avoid largeblocks because it's not supported in ZoL and I might have to switch to ZoL in the future.

largeblocks is supported in ZoL. I've been using it on Debian for a couple months now. It was committed back in May.
https://github.com/zfsonlinux/zfs/commit/f1512ee61e2f22186ac16481a09d86112b2d6788

I think it's at the very least worth enabling for your datasets that store big files, especially if you are using RAIDZ2 and aren't using 6-disks wide vdevs as you will definitely gain some capacity.

https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html
 
Last edited:
largeblocks is supported in ZoL. I've been using it on Debian for a couple months now. It was committed back in May.
https://github.com/zfsonlinux/zfs/commit/f1512ee61e2f22186ac16481a09d86112b2d6788

I think it's at the very least worth enabling for your datasets that store big files, especially if you are using RAIDZ2 and aren't using 6-disks wide vdevs as you will definitely gain some capacity.

https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

How does it work if enabled retrospectively on a dataset that has been used in some time?

Does it apply to the entire set, or only new writes?
 
Zarathustra[H];1041815105 said:
How does it work if enabled retrospectively on a dataset that has been used in some time?

Does it apply to the entire set, or only new writes?

As with all dataset properties (at least properties that I'm aware of like compression and stuff) it's just new writes. But you could say make a new dataset, move all the data you want into it, and delete the old one and then rename the dataset to what the old one was called.

Actually now that I think maybe it doesn't remove anything from source until the end. You would have to move pieces of the dataset at a time I suppose if you don't have the space for the whole dataset.
 
Last edited:
As with all dataset properties (at least properties that I'm aware of like compression and stuff) it's just new writes. But you could say make a new dataset, move all the data you want into it, and delete the old one and then rename the dataset to what the old one was called.

Hmm.

Trying to figure out if I have enough free space to move 7TB of data out of a dataset and back :p

I wonder how much of a difference largeblocks would make in streaming multi gigabyte media files to a networked front end.

It's odd. This really shouldn't take much in the way of bandwidth, but occasionally I still get hickups, despite having benched my disk reads at over 900MB/s with my current setup...
 
If you are doing a mv command though doesn't it remove each file from the source after it successfully copies it to the destination?

You shouldn't need any more free-space than whatever your largest single file is.
 
If you are doing a mv command though doesn't it remove each file from the source after it successfully copies it to the destination?

You shouldn't need any more free-space than whatever your largest single file is.

You are right. When you are in different datasets it DOES do this.

I was confusing myself because back when I was moving data back and forth within the SAME data set (I added a second VDEV and wanted to redistribute the data across them) mv just updated the file allocation table, it didn't actually delete and rewrite data, so I had to cp it all, and then remove the source.

If the data is in different datasets, I believe mv treats it as if it is on different physical drives, and thus writes and deletes each file.
 
Yes, datasets in ZFS are treated as separate filesystems entirely. They are each mounted separately too.

I think I did make a small mistake in my assumption of mv. I believe that if you are mving a directory it does not remove the source files until the end of the entire mv operation.

So you would still need to mv smaller pieces of the dataset at a time I guess.
 
I wouldn't personally get hung up on alignment/optimum drive numbers as there's very little to be had there and worrying about that tends to catch you out elsewhere.

I'd say either:
a) Go 3x8 z2 if you don't want to have to move all the data off the pool then back on again.
b) Go 4x6 z2 if you don't mind doing it.

b will give you greater performance but I wouldn't imagine it'll be anything to write home about, random performance will be less than stellar either way.
 
I wouldn't personally get hung up on alignment/optimum drive numbers as there's very little to be had there and worrying about that tends to catch you out elsewhere.

I'd say either:
a) Go 3x8 z2 if you don't want to have to move all the data off the pool then back on again.
b) Go 4x6 z2 if you don't mind doing it.

b will give you greater performance but I wouldn't imagine it'll be anything to write home about, random performance will be less than stellar either way.

I don't mind, I'm looking for best all-around setup (reliability, performance, capacity).
 
A) Will give you the most capacity.
B) Will give you the best performance and (debatably) reliability.
 
Zarathustra[H];1041815263 said:
If the data is in different datasets, I believe mv treats it as if it is on different physical drives, and thus writes and deletes each file.

that is correct :D

basically mv is just a front-end to do magical works on the background.

if you are in the same sets ( or partition in standard filesystem) , mv only manipulate top level filename (directory and attributes) without touching the real data.
when outside the scope, mv will do: copy the file (cp in cmdline equivalent), set attributes (chmod in cmdline equivalent), when needed to change new filename, and delete the old file.
 
Note that while it was committed back in May, it is not yet part of the current release (0.6.4.2), but labeled for the 0.7 release. You'd have to build it yourself to get it now, which isn't hard.

I put detail information for easy to read on ZoL milestone
https://github.com/zfsonlinux/zfs/issues/354
and has one issue discover : https://github.com/zfsonlinux/zfs/pull/3703
and could be more? let see....

hopefully will be merged in 0.7.0 release, when already iron-out


The samething that I face with 0.6.4 release that has minor annoyance "missing" after rebooting centos 7.X :p. this already iron-out for release 0.6.5
 
A) Will give you the most capacity.
B) Will give you the best performance and (debatably) reliability.

Yes, 3x8 would give me best capacity, and probably good-enough performance (need 1GB/s seq read/write), but is 8disk wide raidz2 vdevs good idea with 6TB, or even 8TB drives? I'm currently using only 4TB drives, but I'd like to make the pool future proof for bigger disks as well.
 
Yes, 3x8 would give me best capacity, and probably good-enough performance (need 1GB/s seq read/write), but is 8disk wide raidz2 vdevs good idea with 6TB, or even 8TB drives? I'm currently using only 4TB drives, but I'd like to make the pool future proof for bigger disks as well.


just my opinion for simplicity
2T/3T RAIDZ2 for 8-12 drives, beyon that is raidz3
4T/6T raidz 3 for 9/more drive, less than thant is raidz2

do not forget to backup :p
I have a backup system too. my setting is now 12 drives with raidz2. and already replace 3 drives as today. mostly one drive failed and since having hotswap. I just remove and replace in commandline for do silvering.
 
Last edited:
do not forget to backup :p
I have a backup system too. my setting is now 12 drives with raidz2. and already replace 3 drives as today. mostly one drive failed and since having hotswap. I just remove and replace in commandline for do silvering.

Agreed. This one deserves repeating time and time again, even if people ahve heard it before.

Drive redundancy is NOT backup, and it is NOT an adequate alternative to backup.

No matter what RAID/NAS setup you have, still ALWAYS backup.

RAID/ZFS will protect you against things like drive failures, but not accidental delections (oops I just did a "rm -FR *" and was in the wrong folder :eek: ) filesystem corruption, if you exceed your parity in hardware failures (if you have RAIDz2 and 3 drives in the same vdev fail) and even if you don't (if you have RAIDz2 and 2 drives fail, you will be resilvering without parity, with a higher risk of corruption).

Add in the "acts of god", you know, fires, floods, lightning strikes, power spikes, etc. etc.

Always always always always backup, if you care about your data.

(If you don't care about your data, why are you builduing an expensive several thousand $ in disks, setup anyway? :p )


I use Crashplan. It is a little slow, but their unlimited plan works for me. If I ever need to restore it will be a LONG process but I can live with it. I ahve about 8TB of data on their servers now.

Also, I don't trust their encyption or the fact that they ohold the key on their servers, not me, so I pre-encrypt everythign with reverse file system encryption before it gets sent to them. My encryption keys are backed up elsewhere.
 
Zarathustra[H];1041819780 said:
No matter what RAID/NAS setup you have, still ALWAYS backup.
True. :)

RAID/ZFS will protect you against things like drive failures, but not accidental delections (oops I just did a "rm -FR *" and was in the wrong folder
Snapshots in ZFS protects against accidental deletions. I always have a snapshot of every filesystem and I delete files, edit, copy files, etc and I can always rollback in time to get back deleted/unwanted edits of files.

Once in a while, I delete the snapshot which means all these changes take effect, the updating of all files takes effect. As soon as the snapshot is deleted (which takes a second or so), my first action is to immediately take a new snapshot. I am very careful to not make anything else than a snapshot. And then I am safe against unwanted deletions again. Repeat.
 
Zarathustra[H];1041819780 said:
Agreed. This one deserves repeating time and time again, even if people ahve heard it before.

Drive redundancy is NOT backup, and it is NOT an adequate alternative to backup.

No matter what RAID/NAS setup you have, still ALWAYS backup.

RAID/ZFS will protect you against things like drive failures, but not accidental delections (oops I just did a "rm -FR *" and was in the wrong folder :eek: ) filesystem corruption, if you exceed your parity in hardware failures (if you have RAIDz2 and 3 drives in the same vdev fail) and even if you don't (if you have RAIDz2 and 2 drives fail, you will be resilvering without parity, with a higher risk of corruption).

Add in the "acts of god", you know, fires, floods, lightning strikes, power spikes, etc. etc.

Always always always always backup, if you care about your data.

(If you don't care about your data, why are you builduing an expensive several thousand $ in disks, setup anyway? :p )


I use Crashplan. It is a little slow, but their unlimited plan works for me. If I ever need to restore it will be a LONG process but I can live with it. I ahve about 8TB of data on their servers now.

Also, I don't trust their encyption or the fact that they ohold the key on their servers, not me, so I pre-encrypt everythign with reverse file system encryption before it gets sent to them. My encryption keys are backed up elsewhere.

Backups are there, no worries.

Think I'm going for 6 disk per vdev. 8x6/8/10TB drives in a single raidz2 sounds hazardous. Thanks for all the suggestions!
 
Think I'm going for 6 disk per vdev. 8x6/8/10TB drives in a single raidz2 sounds hazardous. Thanks for all the suggestions!

Excellent choice!

I'd be curious what kind of performance results you get with that, specifically with rregards to CPU load. (What CPU are you using?)

It probably won't be a problem, but I am curious how CPU load will scale with 4 VDEV's with 2 drives worth of parity each. That's a lot of parity calculations!

Is this bare metal or virtualized?
 
Zarathustra[H];1041821636 said:
Excellent choice!

I'd be curious what kind of performance results you get with that, specifically with rregards to CPU load. (What CPU are you using?)

It probably won't be a problem, but I am curious how CPU load will scale with 4 VDEV's with 2 drives worth of parity each. That's a lot of parity calculations!

Is this bare metal or virtualized?

I'll just list everything, since other people might wanna know as well. It's currently running on bare metal.

Xeon E3-1230v3
32GB ddr3 ECC
Supermicro X10SLH-f
SC846BA-R920B
SAS3-846EL1 backplane
M1015
2x 40GB Intel 320 SSD for OmniOS
Intel X520 dual port 10G nic

24x various 4TB HDDs, mix or WD reds, Hitachi Deskstar NAS, Seagate NAS.


I'm still waiting for SFP+ version of Xeon D board from Supermicro. Once (if..) I get one of those I can upgrade to 128GB ram.
 
I'll just list everything, since other people might wanna know as well. It's currently running on bare metal.

Xeon E3-1230v3
32GB ddr3 ECC
Supermicro X10SLH-f
SC846BA-R920B
SAS3-846EL1 backplane
M1015
2x 40GB Intel 320 SSD for OmniOS
Intel X520 dual port 10G nic

24x various 4TB HDDs, mix or WD reds, Hitachi Deskstar NAS, Seagate NAS.


I'm still waiting for SFP+ version of Xeon D board from Supermicro. Once (if..) I get one of those I can upgrade to 128GB ram.

Nice, Yeah, I was going to say, you might want to increase the RAM a bit.

I know the oft thrown about 1GB of RAM per TB of disk space for ZFS is an extreme worst case scenario, but I don't think I'd want to do 96TB on 32GB of RAM.

I've been eyeballing those Xeon D boards as well. If it werent for the fact that I'd have to spend so much money on rebuying DDR4 ram, I'd probably have one already.
 
Zarathustra[H];1041821743 said:
Nice, Yeah, I was going to say, you might want to increase the RAM a bit.

I know the oft thrown about 1GB of RAM per TB of disk space for ZFS is an extreme worst case scenario, but I don't think I'd want to do 96TB on 32GB of RAM.

I've been eyeballing those Xeon D boards as well. If it werent for the fact that I'd have to spend so much money on rebuying DDR4 ram, I'd probably have one already.

I think this setup is still doable even with 32GB ram. Also that's the max for this board/cpu.

Upgrading to 2011 now would be overkill. Xeon D is optimal storage platform, but I am not going to touch any 10GBASE-T stuff, never. I wonder if Intel will release a Skylake Xeon SoC?
 
I think this setup is still doable even with 32GB ram. Also that's the max for this board/cpu.

Upgrading to 2011 now would be overkill. Xeon D is optimal storage platform, but I am not going to touch any 10GBASE-T stuff, never. I wonder if Intel will release a Skylake Xeon SoC?

Just because I am curious, why do you dislike 10GBASE-T?

I have a fiber run to my basement, but I would much prefer not having to deal with fibers and transducers, such a pain.

I've been waiting for copper 10g to become affordable forever, now it looks like it might finally be happening...
 
Zarathustra[H];1041822857 said:
Just because I am curious, why do you dislike 10GBASE-T?

I have a fiber run to my basement, but I would much prefer not having to deal with fibers and transducers, such a pain.

I've been waiting for copper 10g to become affordable forever, now it looks like it might finally be happening...

10GBASE-T draws too much power, is too expensive and fiber gives so much more flexibility.
 
10GB fibre also has a lot less latency then 10gbase-t. Not really relevant if sequential is what you're after. But still...

0.3 microseconds per frame for fibre, 2 to 2.5 microseconds per frame for copper. Combine this with the increased power consumption of copper and the existence of fiberstore.com and you have to be nuts to run it on copper.

Source of the numbers:
http://www.datacenterknowledge.com/...benefits-of-deploying-sfp-fiber-vs-10gbase-t/
 

Interesting. That makes absolutely no sense to me. Fiber SHOULD be inherently higher latency due to needing a transducer on each side... The act of changing the signal from one form of energy to another is always going to add some delay.

They must have REALLY botched the 10GBaseT implementation in that case.

The higher power use I understand, and expected, but didn't think it would be enough to be significant compared to the rest of the server (especially considering how many spinning HD's we are talking about)
 
I'm just annoyed, because for a decade I have been waiting for 10G to get to the point of 10mbit, 100mbit and 1gbit before it. Mass consumer adoption with adapters available for less than $10 a piece.

It just doesn't seem like it is going to happen because consumers are too infatuated with wifi garbage.

My biggest gripe with 10GBaseT has been all the expense. Fistly the adapters usually aren't cheap in and of themselves. Then you need a transducers both for your adapter and your switch, which are usually expensive too. (even the old 1gig HP recommended transducers for my switch are over $350 a piece!) Now this has gotten better - as you note - with the fiber store.

And it only gets worse if you want a 10Gbit switch...

I feel like in 2015 (the bloody future if you ask me), I should be able to buy a PCIe 10Gbit adapter for $10, and a 24port 10Gig switch for ~$100, and connect the two with a $3 wire from Monoprice.

Instead I have this complicated setup where I have put a Brocade BR1020 adapter in my server and in my workstation, because it was the only affordable 10Gig adapter I could find used on eBay, and got some LC-LC OM3 cable and transducers from Fiberstore.

Everything else in the house still uses my gigabit procurve managed switch, because 10G switches are still out there in crazy territory from a price perspective.

I have a lot of simultaneous traffic going back and forth to my server, so I used passive link aggregation with the four gigabit ports on the server to my switch, using VMWares kind of lame "route over IP hash" setup.

All this would have been so much easier with a simple 10Gbit copper switch and adapters, or if I could even find an affordable switch with a couple of 10Gig uplink ports...
 
Back
Top