Optimal raidz vdev sizes for 24 spindles?

Discussion in 'SSDs & Data Storage' started by ponky, Aug 24, 2015.

  1. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    Currently running 8x 4TB drives in single raidz2 vdev, going to add 16 more disks. Random performance is going to suck anyways so I'm going to max out on Seq to saturate 10G.

    Some options (?) :

    1) 4x 6disk raidz2

    If I had to pick now without planning, I'd go for this one because:
    6disk per vdev has optimal stripe size.
    Future proof; any bigger vdev with 6TB+ disks sounds dangerous.

    2) 3x 8disk raidz2

    Better than 1) capacity-wise, non-optimal stripe size. No idea about performance vs 1).

    3) 2x 12disk raidz3

    .. meh.



    Mirrors are not an option, don't want to waste that much space.

    About optimal and non-optimal stripe sizes: (from freenas forums)

    sub.mesa wrote:
    As i understand, the performance issues with 4K disks isn’t just partition alignment, but also an issue with RAID-Z’s variable stripe size.
    RAID-Z basically works to spread the 128KiB recordsizie upon on its data disks. That would lead to a formula like:
    128KiB / (nr_of_drives – parity_drives) = maximum (default) variable stripe size
    Let’s do some examples:
    3-disk RAID-Z = 128KiB / 2 = 64KiB = good
    4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
    5-disk RAID-Z = 128KiB / 4 = 32KiB = good
    9-disk RAID-Z = 128KiB / 8 = 16KiB = good
    4-disk RAID-Z2 = 128KiB / 2 = 64KiB = good
    5-disk RAID-Z2 = 128KiB / 3 = ~43KiB = BAD!
    6-disk RAID-Z2 = 128KiB / 4 = 32KiB = good
    10-disk RAID-Z2 = 128KiB / 8 = 16KiB = good



    Please, give me your opinions, thanks :)
     
  2. bds1904

    bds1904 Gawd

    Messages:
    1,006
    Joined:
    Aug 10, 2011
    4x 6 disk raidz2 will give you the best performance out of those options.
     
  3. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    I would use 2 x 11 raidz3 disks, and one disk as spare. This for increased reliability, performance will suck (on one single 11 disk raidz3 I do scrub at 500-600MB/sec).
     
  4. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    Well, scrubbing speed does not tell that much about performance since OpenZFS (I think you're not on Solaris? :p) does not do sequential resilver, also raidz3 resilvering is most stressful for disks.
     
  5. ToddW2

    ToddW2 2[H]4U

    Messages:
    4,019
    Joined:
    Nov 8, 2004
    4x 6 disk raidz2
     
  6. Aesma

    Aesma [H]ard|Gawd

    Messages:
    1,844
    Joined:
    Mar 24, 2010
    Since you're starting from scratch, I would test your various options.
     
  7. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    What kind of data are you storing?

    Don't forget that 128KiB is not the only recordsize since the largeblocks feature was added. You can go up to 1024KiB now. Overhead space associated with sector padding is pretty much nonexistent with 1024KiB recordsize with any number of disks in a vdev and performance can be better if you are dealing with larger files.
     
  8. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    How does this work though?

    If you go with a 1024kb record size, does it have to read an entire meg of data for every time you try to access a smaller file, no matter how small?

    If this is the case it would seem that increased large file performance would come at some serious small file performance costs.
     
  9. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    I don't have as many drives as you, but my setup is similar.

    I have 12x 4TB drives in mine, and after much reading I decided to go with 2x6 drive RAIDz2. it just seemed like the best balance of performance, drive redundancy and space lost to redundancy, at least for me.

    This was about a year ago though, and OpenZFS changes fast.
     
  10. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Just restating some information I posted in an older thread on this subject, for posterity:

    At the time I was running on a single 6 drive RAIDz2 vdev. Since then I added 6 more drives in a second vdev.

    I should add to this that I wholeheartedly recommend AGAINST RAIDz (Raidz2 and 3 are great though) as IMHO single drive parity is completely and totally obsolete IMHO, but the RAIDz information is in there for posterity.

    RAID5 / RAIDz just needs to be killed off as even an option, as all it does is provide a false sense of security, IMHO.
     
  11. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    No, since ZFS uses a variable recordsize.

    A downside though would be that smaller reads would have to "get in line" behind these larger 1MiB block reads on a busy system.

    Also don't forget that largeblocks are a dataset property, not a pool property.

    I mainly store large media files on my server and some of my smallest files are still couple-MB jpegs so largeblocks makes perfect sense for me. Everything is faster as there is just a lot more IOPS to go around when every IOP is so much larger.

    I also gained about 3TB of capacity by switching to largeblocks too which was nice.

    I run a 12-disk wide RAIDZ2 vdev of 4TB disks and the allocation overhead of a 12-wide RAIDZ2 was nearly 10% with 128KiB recordsize.

    However, when I moved all my data onto 1MiB recordsize datasets, my data "appeared" to take 10% less space than what it was using before.

    This is just a strange artifact of how largeblocks work because ZFS still uses 128KiB as the calculation for "free space" since it's the default. Better for it to underestimate your pool capacity than to overestimate it.

    I mainly started using it to gain capacity and didn't really lose any sort of meaningful performance for my use case.

    https://www.illumos.org/issues/5027
     
  12. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    It will be mostly media streamed around the house, but also a iSCSI LUN to my Windows desktop, that's why I need the performance.

    I tried to avoid largeblocks because it's not supported in ZoL and I might have to switch to ZoL in the future.
     
  13. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    largeblocks is supported in ZoL. I've been using it on Debian for a couple months now. It was committed back in May.
    https://github.com/zfsonlinux/zfs/commit/f1512ee61e2f22186ac16481a09d86112b2d6788

    I think it's at the very least worth enabling for your datasets that store big files, especially if you are using RAIDZ2 and aren't using 6-disks wide vdevs as you will definitely gain some capacity.

    https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html
     
    Last edited: Aug 25, 2015
  14. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    How does it work if enabled retrospectively on a dataset that has been used in some time?

    Does it apply to the entire set, or only new writes?
     
  15. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    As with all dataset properties (at least properties that I'm aware of like compression and stuff) it's just new writes. But you could say make a new dataset, move all the data you want into it, and delete the old one and then rename the dataset to what the old one was called.

    Actually now that I think maybe it doesn't remove anything from source until the end. You would have to move pieces of the dataset at a time I suppose if you don't have the space for the whole dataset.
     
    Last edited: Aug 25, 2015
  16. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Hmm.

    Trying to figure out if I have enough free space to move 7TB of data out of a dataset and back :p

    I wonder how much of a difference largeblocks would make in streaming multi gigabyte media files to a networked front end.

    It's odd. This really shouldn't take much in the way of bandwidth, but occasionally I still get hickups, despite having benched my disk reads at over 900MB/s with my current setup...
     
  17. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    If you are doing a mv command though doesn't it remove each file from the source after it successfully copies it to the destination?

    You shouldn't need any more free-space than whatever your largest single file is.
     
  18. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    You are right. When you are in different datasets it DOES do this.

    I was confusing myself because back when I was moving data back and forth within the SAME data set (I added a second VDEV and wanted to redistribute the data across them) mv just updated the file allocation table, it didn't actually delete and rewrite data, so I had to cp it all, and then remove the source.

    If the data is in different datasets, I believe mv treats it as if it is on different physical drives, and thus writes and deletes each file.
     
  19. SirMaster

    SirMaster 2[H]4U

    Messages:
    2,122
    Joined:
    Nov 8, 2010
    Yes, datasets in ZFS are treated as separate filesystems entirely. They are each mounted separately too.

    I think I did make a small mistake in my assumption of mv. I believe that if you are mving a directory it does not remove the source files until the end of the entire mv operation.

    So you would still need to mv smaller pieces of the dataset at a time I guess.
     
  20. bmh.01

    bmh.01 Gawd

    Messages:
    610
    Joined:
    Mar 28, 2002
    I wouldn't personally get hung up on alignment/optimum drive numbers as there's very little to be had there and worrying about that tends to catch you out elsewhere.

    I'd say either:
    a) Go 3x8 z2 if you don't want to have to move all the data off the pool then back on again.
    b) Go 4x6 z2 if you don't mind doing it.

    b will give you greater performance but I wouldn't imagine it'll be anything to write home about, random performance will be less than stellar either way.
     
  21. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    I don't mind, I'm looking for best all-around setup (reliability, performance, capacity).
     
  22. bmh.01

    bmh.01 Gawd

    Messages:
    610
    Joined:
    Mar 28, 2002
    A) Will give you the most capacity.
    B) Will give you the best performance and (debatably) reliability.
     
  23. zrav

    zrav Limp Gawd

    Messages:
    163
    Joined:
    Sep 22, 2011
    Note that while it was committed back in May, it is not yet part of the current release (0.6.4.2), but labeled for the 0.7 release. You'd have to build it yourself to get it now, which isn't hard.
     
  24. cantalup

    cantalup Gawd

    Messages:
    758
    Joined:
    Feb 8, 2012
    that is correct :D

    basically mv is just a front-end to do magical works on the background.

    if you are in the same sets ( or partition in standard filesystem) , mv only manipulate top level filename (directory and attributes) without touching the real data.
    when outside the scope, mv will do: copy the file (cp in cmdline equivalent), set attributes (chmod in cmdline equivalent), when needed to change new filename, and delete the old file.
     
  25. cantalup

    cantalup Gawd

    Messages:
    758
    Joined:
    Feb 8, 2012
    I put detail information for easy to read on ZoL milestone
    https://github.com/zfsonlinux/zfs/issues/354
    and has one issue discover : https://github.com/zfsonlinux/zfs/pull/3703
    and could be more? let see....

    hopefully will be merged in 0.7.0 release, when already iron-out


    The samething that I face with 0.6.4 release that has minor annoyance "missing" after rebooting centos 7.X :p. this already iron-out for release 0.6.5
     
  26. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    Yes, 3x8 would give me best capacity, and probably good-enough performance (need 1GB/s seq read/write), but is 8disk wide raidz2 vdevs good idea with 6TB, or even 8TB drives? I'm currently using only 4TB drives, but I'd like to make the pool future proof for bigger disks as well.
     
  27. cantalup

    cantalup Gawd

    Messages:
    758
    Joined:
    Feb 8, 2012

    just my opinion for simplicity
    2T/3T RAIDZ2 for 8-12 drives, beyon that is raidz3
    4T/6T raidz 3 for 9/more drive, less than thant is raidz2

    do not forget to backup :p
    I have a backup system too. my setting is now 12 drives with raidz2. and already replace 3 drives as today. mostly one drive failed and since having hotswap. I just remove and replace in commandline for do silvering.
     
    Last edited: Aug 27, 2015
  28. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Agreed. This one deserves repeating time and time again, even if people ahve heard it before.

    Drive redundancy is NOT backup, and it is NOT an adequate alternative to backup.

    No matter what RAID/NAS setup you have, still ALWAYS backup.

    RAID/ZFS will protect you against things like drive failures, but not accidental delections (oops I just did a "rm -FR *" and was in the wrong folder :eek: ) filesystem corruption, if you exceed your parity in hardware failures (if you have RAIDz2 and 3 drives in the same vdev fail) and even if you don't (if you have RAIDz2 and 2 drives fail, you will be resilvering without parity, with a higher risk of corruption).

    Add in the "acts of god", you know, fires, floods, lightning strikes, power spikes, etc. etc.

    Always always always always backup, if you care about your data.

    (If you don't care about your data, why are you builduing an expensive several thousand $ in disks, setup anyway? :p )


    I use Crashplan. It is a little slow, but their unlimited plan works for me. If I ever need to restore it will be a LONG process but I can live with it. I ahve about 8TB of data on their servers now.

    Also, I don't trust their encyption or the fact that they ohold the key on their servers, not me, so I pre-encrypt everythign with reverse file system encryption before it gets sent to them. My encryption keys are backed up elsewhere.
     
  29. brutalizer

    brutalizer [H]ard|Gawd

    Messages:
    1,593
    Joined:
    Oct 23, 2010
    True. :)

    Snapshots in ZFS protects against accidental deletions. I always have a snapshot of every filesystem and I delete files, edit, copy files, etc and I can always rollback in time to get back deleted/unwanted edits of files.

    Once in a while, I delete the snapshot which means all these changes take effect, the updating of all files takes effect. As soon as the snapshot is deleted (which takes a second or so), my first action is to immediately take a new snapshot. I am very careful to not make anything else than a snapshot. And then I am safe against unwanted deletions again. Repeat.
     
  30. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    Backups are there, no worries.

    Think I'm going for 6 disk per vdev. 8x6/8/10TB drives in a single raidz2 sounds hazardous. Thanks for all the suggestions!
     
  31. bmh.01

    bmh.01 Gawd

    Messages:
    610
    Joined:
    Mar 28, 2002
    You'll be fine, good luck with it :).
     
  32. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Excellent choice!

    I'd be curious what kind of performance results you get with that, specifically with rregards to CPU load. (What CPU are you using?)

    It probably won't be a problem, but I am curious how CPU load will scale with 4 VDEV's with 2 drives worth of parity each. That's a lot of parity calculations!

    Is this bare metal or virtualized?
     
  33. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    I'll just list everything, since other people might wanna know as well. It's currently running on bare metal.

    Xeon E3-1230v3
    32GB ddr3 ECC
    Supermicro X10SLH-f
    SC846BA-R920B
    SAS3-846EL1 backplane
    M1015
    2x 40GB Intel 320 SSD for OmniOS
    Intel X520 dual port 10G nic

    24x various 4TB HDDs, mix or WD reds, Hitachi Deskstar NAS, Seagate NAS.


    I'm still waiting for SFP+ version of Xeon D board from Supermicro. Once (if..) I get one of those I can upgrade to 128GB ram.
     
  34. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Nice, Yeah, I was going to say, you might want to increase the RAM a bit.

    I know the oft thrown about 1GB of RAM per TB of disk space for ZFS is an extreme worst case scenario, but I don't think I'd want to do 96TB on 32GB of RAM.

    I've been eyeballing those Xeon D boards as well. If it werent for the fact that I'd have to spend so much money on rebuying DDR4 ram, I'd probably have one already.
     
  35. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    I think this setup is still doable even with 32GB ram. Also that's the max for this board/cpu.

    Upgrading to 2011 now would be overkill. Xeon D is optimal storage platform, but I am not going to touch any 10GBASE-T stuff, never. I wonder if Intel will release a Skylake Xeon SoC?
     
  36. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Just because I am curious, why do you dislike 10GBASE-T?

    I have a fiber run to my basement, but I would much prefer not having to deal with fibers and transducers, such a pain.

    I've been waiting for copper 10g to become affordable forever, now it looks like it might finally be happening...
     
  37. ponky

    ponky Limp Gawd

    Messages:
    180
    Joined:
    Nov 27, 2012
    10GBASE-T draws too much power, is too expensive and fiber gives so much more flexibility.
     
  38. rsq

    rsq Limp Gawd

    Messages:
    246
    Joined:
    Jan 11, 2010
    10GB fibre also has a lot less latency then 10gbase-t. Not really relevant if sequential is what you're after. But still...

    0.3 microseconds per frame for fibre, 2 to 2.5 microseconds per frame for copper. Combine this with the increased power consumption of copper and the existence of fiberstore.com and you have to be nuts to run it on copper.

    Source of the numbers:
    http://www.datacenterknowledge.com/...benefits-of-deploying-sfp-fiber-vs-10gbase-t/
     
  39. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    Interesting. That makes absolutely no sense to me. Fiber SHOULD be inherently higher latency due to needing a transducer on each side... The act of changing the signal from one form of energy to another is always going to add some delay.

    They must have REALLY botched the 10GBaseT implementation in that case.

    The higher power use I understand, and expected, but didn't think it would be enough to be significant compared to the rest of the server (especially considering how many spinning HD's we are talking about)
     
  40. Zarathustra[H]

    Zarathustra[H] Official Forum Curmudgeon

    Messages:
    28,629
    Joined:
    Oct 29, 2000
    I'm just annoyed, because for a decade I have been waiting for 10G to get to the point of 10mbit, 100mbit and 1gbit before it. Mass consumer adoption with adapters available for less than $10 a piece.

    It just doesn't seem like it is going to happen because consumers are too infatuated with wifi garbage.

    My biggest gripe with 10GBaseT has been all the expense. Fistly the adapters usually aren't cheap in and of themselves. Then you need a transducers both for your adapter and your switch, which are usually expensive too. (even the old 1gig HP recommended transducers for my switch are over $350 a piece!) Now this has gotten better - as you note - with the fiber store.

    And it only gets worse if you want a 10Gbit switch...

    I feel like in 2015 (the bloody future if you ask me), I should be able to buy a PCIe 10Gbit adapter for $10, and a 24port 10Gig switch for ~$100, and connect the two with a $3 wire from Monoprice.

    Instead I have this complicated setup where I have put a Brocade BR1020 adapter in my server and in my workstation, because it was the only affordable 10Gig adapter I could find used on eBay, and got some LC-LC OM3 cable and transducers from Fiberstore.

    Everything else in the house still uses my gigabit procurve managed switch, because 10G switches are still out there in crazy territory from a price perspective.

    I have a lot of simultaneous traffic going back and forth to my server, so I used passive link aggregation with the four gigabit ports on the server to my switch, using VMWares kind of lame "route over IP hash" setup.

    All this would have been so much easier with a simple 10Gbit copper switch and adapters, or if I could even find an affordable switch with a couple of 10Gig uplink ports...