Best way to set up 24-drive ZFS NAS?

IvanNavi · Jan 27, 2021

Hi all,

Your wisdom and experience would be highly appreciated with the following query.

I am setting up a 24-drive NAS (QNAP TS-h1283XU-RP). The hardware seems decent (Xeon E-2236 6-core 3.4 GHz, 128GB ECC RAM). The system volume will reside on an m.2 drive. For cold storage, there are 12 SATA HDDs size 16TB each, and 12 SATA HDDs size 18TB each. The NAS will be used by a small office of 10 people, mainly for storage, moderate back-up operations, and media editing/streaming. The data will be backed up externally.

The goal is to strike a good balance within the impossible trinity of Storage - Performance - Security.

We're looking to set up the NAS to run in ZFS, which is quite new to me. I have read through countless forums, but a lot of the posts are from several years ago and it's not clear whether the information is still relevant, especially when it comes to performance.

If I understand correctly, a zpool with more vdevs (each with fewer drives) will run faster than a zpool with fewer vdevs (each with more drives). Are there any general rules of thumb as to the appropriate number of drives in a vdev?

Our cost/storage calculations are limitting us to sacrificing no more than four drives from the overall array.

Specifically in our context, would we be better off creating:

Option 1: One zpool containing two 12-drive vdevs, each running in raidz2

Option 2: One zpool containing four 6-drive dvevs, each running in raidz1

Option 3: Two zpools, each containing two 6-drive vdevs, each running in raidz1?

My instinct would be to avoid raidz1 (raid 5), given the risk of single redundancy. However, I am concerned about the performance of a 12-drive vdev running in raidz2 (raid 6). The few tests I have managed to find online suggest a heavy performance penalty under this set-up vs regular EXT4.

Does anyone have experience with 24-drive ZFS configurations? Would the NAS be able to handle the aforementioned tasks with two 12-drive vdevs in raidz2? Would adding another m.2 drive as a Read Cache (or ZIL) vdev mitigate most of the performance penalty? What sort of scrubbing or resilvering times would we be looking at for a vdev with 12 drives of size 18TB each?

Many thanks in advance for your help!

Zedicus · Jan 28, 2021

do not use raidz1

maybe 3x8 drive raidz2 zpools in one vdev?

my ZFS experience has been that 12 drives works, but is outside best practice. i currently run raidz mirrors stacked into a single vdev. this is the most reliability and speed and the most space lost. i was running 6 drive raidz2 zpools but even the resilver (and scrub) times on those with big drives can get long.

_Gea · Jan 28, 2021

ZFS is very flexible. Every setup is possible in a triangle capacity vs performance vs security.

If you avoid Z1 at all and 24 disk Z2/3 your main concern is iops that is always n x number of vdevs and each vdev like a single disk. If you count 100 iops per disk the main options are

2 x 12 disks = 200 iops
3 x 8 disks = 300 iops
4 x 6 disks = 400 iops

(to compare: the best NVMe ex Intel Optane is at 500 000 iops)

Sequential performance is not so different. From math sequential performance is n x datadisks. But as ZFS spreads data quite even over vdevs/disks iops is mostly the limiting factor on access and mainly on rebuild/resilver time - although on a current ZFS (Solaris since around 2015, OpenZFS since last year) a feature called sequential/sorted resilvering makes this not so critical. With an older Open-ZFS reilver time with a slow and very high capacity pool may be up to a week, with newer ZFS more around two days - all depends on fillrate as a resilver must read all (meta)data. Resilver time of a near empty pool is near zero.

I would probably choose between 2 x 12 (backup system) and 3 x 8 (filer). For high iops needs I would prefer a pool from SSD/NVMe or a special vdev to improve performance of single filesystems.

btw
Readcache and ramcache on ZFS is RAM. Only readcache can be extended by an L2Arc. This is helpful only in low ram situations. A ZIL/Slog is not a writecache but only a protector of the rambased writcache against dataloss on a crash (you must use sync write to use it). Not really needed ex on an SMB files, essential with VM datastores.

IvanNavi · Jan 28, 2021

Zedicus said:
i was running 6 drive raidz2 zpools but even the resilver (and scrub) times on those with big drives can get long.

Thanks Zedicus. What size drives did you have in your raidz2 and how long were your scrub/resilvering times?

IvanNavi · Jan 28, 2021

_Gea said:
With an older Open-ZFS reilver time with a slow and very high capacity pool may be up to a week, with newer ZFS more around two days - all depends on fillrate as a resilver must read all (meta)data. Resilver time of a near empty pool is near zero.

I would probably choose between 2 x 12 (backup system) and 3 x 8 (filer). For high iops needs I would prefer a pool from SSD/NVMe or a special vdev to improve performance of single filesystems.

btw
Readcache and ramcache on ZFS is RAM. Only readcache can be extended by an L2Arc. This is helpful only in low ram situations. A ZIL/Slog is not a writecache but only a protector of the rambased writcache against dataloss on a crash (you must use sync write to use it). Not really needed ex on an SMB files, essential with VM datastores.

Thanks Gea. 2 days scrubbing/resilvering for a relatively full volume (80-90%) would not be too bad in our case. It would be twice as long as the current EXT4 system, but we could live with that.

Regarding the RAM, it occured to me that our set-up would be in clear breach of the old rule-of-thumb for ZFS, which says there should be 1GB of RAM per 1TB of storage. The effective size of our array will likely be around 280TB, way above the implied threshold of the 128GB RAM in the NAS.

The problem is that the NAS is limited to 128GB of RAM by QNAP; they do not offer anything higher (and they categorically advise against using 3rd party memory upgrades).

At the same time, QNAP lists the 16TB and 18TB drives we will be using as compatible with this NAS.

All of which makes me think that either QNAP is confident the NAS can perform with its current 128GB of RAM when fully populated with 18TB drives, or they have not done their research properly and will soon have unhappy customers!

I should clarify that we won't be using deduplication, which I understand is the real killer in terms of RAM usage.

Does anyone know what performance is like with deduplication turned off, would the general rule-of-thumb regarding RAM still hold?

_Gea · Jan 28, 2021

There is no such 1GB RAM/TB data rule for ZFS outside some forums. Sun/Oracle who invented ZFS claims a minimum of 2 GB RAM for any stable use of their 64bit Unix does not matter a pool size. The problem is more performance. As ZFS adds extra data due checksums and is more affected for fragmentation due Copy on Write, its performance is lousy in such low RAM situations. You need more RAM to cache mainly metadata (around 1% of data) and last accessed data.

This is why a realworld minimum of 8 GB RAM for Solarish and 2-4 GB more for Free-BSD or Linux (Qnap is just Linux + ZoL + a web-gui) is seen as minimum (ZFS internal memory management is still Solaris like). If you consider 128GB RAM I must ask for use case as ZFS readcache (arc) only caches small random data and metadata, not files or sequential data. For a videoserver for ex RAM is not as helpful as raw pool performance, for a mailserver ram is a killer advantage as as you see in my Epyc tests, CPU is important for a real high performance system.

If you enable dedup, you can add up to 5GB additional ram per dedup data (not pool size) but with the new feature special vdev (ex a mirror of Intel Optane) you can hold the dedup table there instead RAM so this rule is no longer as important.

For achievable performance, you may look at some tests that I made (with the Solaris fork OmniOS but Linux should behave similar - at least in principle), see my pdfs https://napp-it.org/manuals/index_en.html 1. and 8.1 and 8.2

regarding disks
be always sure not to use msr disks, always cmr ones for raid and zfs
https://www.servethehome.com/wd-red-smr-vs-cmr-tested-avoid-red-smr/2/

Zedicus · Jan 28, 2021

i went from 3tb drives to 6tb drives in raidz2 and remember thinking, 'man these scrubs are taking forever.' i do not remember what the actual time was though. when i switched to 8tb drives i went with the raidz mirror config.

my data is mostly all media streaming and even with a Emby server and users shared drives all running on the freenas box i have never maxed out the array on performance. it is always network and maybe CPU if emby is doing multiple re-encodes. this is with just spinners, no ssd cache drives or any extra config.

IvanNavi · Jan 28, 2021

_Gea said:
This is why a realworld minimum of 8 GB RAM for Solarish and 2-4 GB more for Free-BSD or Linux (Qnap is just Linux + ZoL + a web-gui) is seen as minimum (ZFS internal memory management is still Solaris like). If you consider 128GB RAM I must ask for use case as ZFS readcache (arc) only caches small random data and metadata, not files or sequential data. For a videoserver for ex RAM is not as helpful as raw pool performance, for a mailserver ram is a killer advantage

Thanks Gea, super helpful!

It's great to know the RAM rule need not be observed religiously.

In your example of a videoserver, what would be the best way to optimise the raw pool performance? More vdevs with fewer drives in them (as opposed to fewev dvevs with more drives)?

Would the equivalent of RAID60 provide a performance advantage? What does that even look like in ZFS - two striped 12-drive vdevs each running in raidz2, presumably?

IvanNavi · Jan 28, 2021

Zedicus said:
i went from 3tb drives to 6tb drives in raidz2 and remember thinking, 'man these scrubs are taking forever.' i do not remember what the actual time was though. when i switched to 8tb drives i went with the raidz mirror config.

my data is mostly all media streaming and even with a Emby server and users shared drives all running on the freenas box i have never maxed out the array on performance. it is always network and maybe CPU if emby is doing multiple re-encodes. this is with just spinners, no ssd cache drives or any extra config.

V.interesting, thanks for the colour. We have 8-drive arrays populated with 12TB drives (EXT4, not ZFS), where the scrub takes about 20-22hrs. Anything less than 1-1.5 days I would view as tolerable - hoping for that to be the case with ZFS also...

PS. Great to know you never managed to max out the array on performance, and using spinners only

_Gea · Jan 29, 2021

Depends all on network and use case

Ex: a single disk can give around 200 MB/s sequentially. On ZFS there is no pure sequential load as ZFS spreads datablocks over vdevs. For an average office load you may land at 120MB/s. Even with a single disks this is faster as an 1G network. A pool ex with 2 x 12 disk Z2 vdev (yes, from security this is like Raid-60 but without the write hole problem of traditional Raid) is > 1000 MB/s sequentially and maybe at 600 MB/s with a office like workload, probably more as then RAM is very helopful with small datablocks to read.

If your network is 10G than such a setup is where you typically are with 10G without special tunings unless you do not activate encryption (see my pdf). A different situation occurs when you need sync write with the guarantee that every committed write is indeed on disk. Your pool tested with a sequential benchmark and 1000 MB/s lands then maybe at 40 MB/s and a single disk maybe at 10 MB/s (see my pdf about Slog).

Resilver time generally should not be too bad as Qnap use a current ZoL with sorted resilvering. But I would strictly avoid to fill up ZFS more than 90%. This is not due stability but performance sucks in a near full situation. Up from say 60% fillrate you will see a feelable performance degration on access and on resilver. This is not only the case with ZFS but the effect is more important here due Copy on Write.

Regarding pool layout
If performance is a concern in a multiuser environment, then prefer 4 x 6 disks. Max performance is with 12 x mirror but then if two disks fail from a single mirror the pool is lost (for all setups, what is your backup idea). A multiple ZFS Z2 allow any two disks to fail and the capacity loss is not as heavy.

IvanNavi · Jan 30, 2021

_Gea said:
Depends all on network and use case

Ex: a single disk can give around 200 MB/s sequentially. On ZFS there is no pure sequential load as ZFS spreads datablocks over vdevs. For an average office load you may land at 120MB/s. Even with a single disks this is faster as an 1G network. A pool ex with 2 x 12 disk Z2 vdev (yes, from security this is like Raid-60 but without the write hole problem of traditional Raid) is > 1000 MB/s sequentially and maybe at 600 MB/s with a office like workload, probably more as then RAM is very helopful with small datablocks to read.

If your network is 10G than such a setup is where you typically are with 10G without special tunings unless you do not activate encryption (see my pdf). A different situation occurs when you need sync write with the guarantee that every committed write is indeed on disk. Your pool tested with a sequential benchmark and 1000 MB/s lands then maybe at 40 MB/s and a single disk maybe at 10 MB/s (see my pdf about Slog).

Resilver time generally should not be too bad as Qnap use a current ZoL with sorted resilvering. But I would strictly avoid to fill up ZFS more than 90%. This is not due stability but performance sucks in a near full situation. Up from say 60% fillrate you will see a feelable performance degration on access and on resilver. This is not only the case with ZFS but the effect is more important here due Copy on Write.

Regarding pool layout
If performance is a concern in a multiuser environment, then prefer 4 x 6 disks. Max performance is with 12 x mirror but then if two disks fail from a single mirror the pool is lost (for all setups, what is your backup idea). A multiple ZFS Z2 allow any two disks to fail and the capacity loss is not as heavy.

Thank you again for the wealth of information.

Regarding the backup plan, that will be twofold: one copy on cold drives on site and another copy residing on several smaller QNAP units off site running RAID 6.

I was aware of the recommendation to not overfill ZFS, but always wondered whether it truly made that much of an impact in real life. We were hoping to use 90% as the usable capacity threshold for the Zpool, but given your comments we might lower that to 85% or even 80%, just to be on the safe side!

Concentric · Feb 9, 2021

What's the minimum usable capacity you need to provide overall?
This might help to narrow down what pool layout options are available to you in this scenario.
For example if you need at least 250TB usable, that would rule out a pool of 2-drive mirrors, as (by my calculations) that would only give you just under 200TB usable.

Just for awareness, you can use a mixture of the 18TB and 16TB drives in the same vdev (as their performance characteristics are likely to be very similar) - but the extra 2TB of space on the 18TB drives would be wasted (ZFS uses the same amount of space from each device, equal to the smallest in that vdev - the first 16TB in this case), so that's not ideal for getting your money's worth.

I would be considering something like one pool consisting of four 6-drive Z2 vdevs (two of the vdevs being all 18TB drives and the other two being all 16TB drives).
But does that net you enough usable space?
If not, consider one pool of three 8-drive Z2 vdevs (mixing the 16TB and 18TB drives in each vdev) - not perfect but should work out at slightly larger usable space.
For even more space, one pool of two 12-drive Z2 vdevs (matching drives together)

You might also want to consider buying one or two more of each drive type as a cold standby so you have something on-hand to immediately swap a failed drive.

I'm not sure of the details of your workflow but I might also consider a ZIL - you only need two very small capacity SSDs in a mirror but it could be a benefit to your writes. You have a big RAM read cache but nothing there to help with writes except the raw disks themselves (in which case the more vdevs the better).

_Gea · Feb 9, 2021

Concentric said:
I'm not sure of the details of your workflow but I might also consider a ZIL - you only need two very small capacity SSDs in a mirror but it could be a benefit to your writes. You have a big RAM read cache but nothing there to help with writes except the raw disks themselves (in which case the more vdevs the better).

There is a lot of confusion about.
A ZIL is a onpool logging device for sync write (only there to protect the rambased write cache). If you use dedicated disks, the logging device is called Slog. In any case, this is not a write cache. It is never read beside after a crash to redo missing commited writes on next reboot.

For a normal filer you do not need or want sync write so a ZIL or Slog is never part of any data transfer.

Concentric · Feb 9, 2021

_Gea said:
There is a lot of confusion about.
A ZIL is a onpool logging device for sync write (only there to protect the rambased write cache). If you use dedicated disks, the logging device is called Slog. In any case, this is not a write cache. It is never read beside after a crash to redo missing commited writes on next reboot.

For a normal filer you do not need or want sync write so a ZIL or Slog is never part of any data transfer.

ZIL is the log itself and SLOG is a dedicated device that the ZIL sits on; by default if you enable sync writes but don't have a SLOG, the ZIL sits on your pool (hard drives in this case) which is bad for performance as you effectively write the data to the same pool drives twice. So if you're configuring sync writes, you should have a dedicated, SSD-based SLOG.
But this is all just a pedantic argument over details that the OP doesn't need to worry about.

I never said ZIL is a cache, but I did say it could "benefit writes", depending on the workload, which is true.
If his workload is 100% large sequential blocks then there will be no benefit, but if there's a mix of some random IO etc then there could be some benefit.
So just something for the OP to consider.

Romeomium · Feb 9, 2021

I just finished settuping up my first pool, and did something similar. I was using refurb 4TB drives though as they're super cheap. I used 4 x 6 Z2 vdevs. I did not use a QNAP nas though, and elected to use an older supermicro server with additional interior/exterior storage. The reasons for this are:
Hot swap rear ports for 1TB ssd's. 3-way mirror for a special allocation class drive
Interior mirror OS 128GB ssd for OMV.

The special allocation classes add a ton of versatility. I have mine set to accept all files below 64k (special small blocks) which should help tremendously with seek times for folder structures. This can be scaled up for other applications as well. I chose 4x 6 vs 3 x 8 as I wanted the extra IOPS, and the storage difference wasn't as critical. I would NOT go over 8 disks in a vdev. Mirror is a great way to go but I was funding this myself and Z2 offers more bang for the buck.

Best way to set up 24-drive ZFS NAS?

IvanNavi

n00b

Zedicus

[H]ard|Gawd

_Gea

Supreme [H]ardness

IvanNavi

n00b

IvanNavi

n00b

_Gea

Supreme [H]ardness

Zedicus

[H]ard|Gawd

IvanNavi

n00b

IvanNavi

n00b

_Gea

Supreme [H]ardness

IvanNavi

n00b

Concentric

[H]ard|Gawd

_Gea

Supreme [H]ardness

Concentric

[H]ard|Gawd

Romeomium

Limp Gawd