35 Disk ZFS Pool, Go or no?

DataMine

n00b
Joined
Feb 8, 2012
Messages
41
Ok, I am about to upgrade my last set of 2TB hard drives to 4TB and am wondering how bad of an idea is it to make just one large pool instead of 3 smaller vdevs

Right now I have 3 vdev at 11 raidz3 disk each (11x4TB, 11x4TB, 11x2TB), I want to go to one 35x4TB raidz3 + hot spare (36 disk total), however I am seeing warnings not to go over 10 drives or 16 drives (different sites say different things) I will get close to 18TB more storage with 35 disk vdev vs going with 11 disk vdevs. This is mostly for storage so high speed is not an issue (I currently get 290 MBps, as long as I get around 150 MBps, I should have no issues at all)

according to the ZFS calculator
https://www.servethehome.com/raid-calculator/raid-reliability-calculator-simple-mttdl-model/
my 3 year outlook is for failure 11 disc vdev 0.00016 %vs 35 disk vdev 0.09998 % not covering the hot spare I want to add. it is less than 1 percent.

So has anyone here done large disk vdevs. Good, Bad, very bad?

update:
I plan on upgrade again in three years. A full system upgrade (new MB, CPU, more Ram, new harddrives PSU, ect) So upgrading my drives will not be and issue then since I will be making a new Zpool instead of increments upgrades like I have for the last 5 years (upgrading only one vdev when needed)

Current
Zpool Media01 (raidz3)
vdev 11x4TB
vdev 11x4TB
vdev 11x2TB

Zpool Media02 (raidz3)
vdev 11x3TB

Future ???? #1
Zpool Media01 (raidz3)
vdev 35x4TB + hot spare

Zpool Media02 (raidz3)
vdev 11x3TB
 
Last edited:
Ok, I have done a lot of reading and it looks like as long as I am using 4k drives (I am) that going upto 32 disks per vdev is ok. But I cant find any performance info on it other thank IOPS will be in the 1 disk drive. which is fine since most disk provide 120-180 MBps now, so my main bottleneck with be my older M1015 card and ex panders.

Any one have any issues using larger disk vdev?
 
The biggest benefit I think is IOPS. More vdevs=More IO to spread around. There's also the value in increasing redundancy I'd think, but not sure by how much.
 
I am not sure by how much my IOPS will increase, since my upgrade path was a bit odd. I did not start out with 33 empty drives, but as my zpool filled up, then I expanded or added a new vdev. so the pool is not filled evenly (files split among all drives and vdevs) so not every drive is getting a read when a file is looked up. So I don't think keeping more vdevs in my situation will help very much. And since I do not know of a way to "even out" my data other than moving it off destroying the pool and making a new one and placing the data back on. I dont really think my IOPS will increase much otherwise.

As for redundancy. right now I can loose 3 drives per vdev before a fault happens (upto 9 drives, right now I have lost 1 drive), If I go to a single vdev, I can lose only 3 drives, but I will be getting about 24 TB of more storage which is a bit important since I am running out of port on my setup, I am hoping that adding 1 or 2 hot spares will help with any redundancy issues.

11 -Main case
16 -SAS multiplier 1
16 -SAS multiplier 2

16 -SAS multiplier 3 (Offline) Media 02 (11 used)
 
In my NAS I did two 6 disk raidz2 arrays striped together with 4TB disks. If all you are doing is sequential reads, I don't think performance should be much of a concern, compared to hosting VM workloads.

If you need to grow your array, add more vdevs without worrying too much about performance. I don't think you will gain anything by moving to a single large array. You may be worse off instead.
 
We use 7x6 drive vDev's with 3 hot spares in our production chassis.
 
Currently setting up to do

Zpool Media01 (raidz3)
vdev 19x4TB
vdev 19x4TB (really a mix of 4TB and 3TB, I will replace all drives with 4TB as I fine them)
+ Hot spare

I decided on this for 5 reasons
1. I don't have enough 4TB drives right now for a 35 drive setup (3 drives have tested bad sectors, why they never pinged on a scrub I dont know), I try to get them as cheaply as possible (NewEgg sales, or Ebay) so it takes some time to make sure I get mixed lots.

2. Scrub Times. Right now it take 19 hours to due a full scrub set as 3 vdev, which I do every 14 days. I want to avoid a multi day scrub if I can.

3. Open SATA ports, I have a total of 47 slots. 16 (multiplier) + 16 (multiplier)+15 (case)
47 =Total
-1 =OS Drive 120 GB SSD
-1 =Cache maybe? L2ARC 120GB SSD (my hope to help my low IOPS)
-4 =Active Mirrored ZFS for encodes/other

41 slots for zpool.
-19 vdev 1
-19 vdev 2
-1 =Hot spare
2 slots left.

4. Power/Heat/Noise More drives = More Power = More Heat = More noise

5. Space requirements, will get abour 120 TB over my current 85 TB
 
Currently setting up to do

Zpool Media01 (raidz3)
vdev 19x4TB
vdev 19x4TB (really a mix of 4TB and 3TB, I will replace all drives with 4TB as I fine them)
+ Hot spare

I decided on this for 5 reasons
1. I don't have enough 4TB drives right now for a 35 drive setup (3 drives have tested bad sectors, why they never pinged on a scrub I dont know), I try to get them as cheaply as possible (NewEgg sales, or Ebay) so it takes some time to make sure I get mixed lots.

2. Scrub Times. Right now it take 19 hours to due a full scrub set as 3 vdev, which I do every 14 days. I want to avoid a multi day scrub if I can.

3. Open SATA ports, I have a total of 47 slots. 16 (multiplier) + 16 (multiplier)+15 (case)
47 =Total
-1 =OS Drive 120 GB SSD
-1 =Cache maybe? L2ARC 120GB SSD (my hope to help my low IOPS)
-4 =Active Mirrored ZFS for encodes/other

41 slots for zpool.
-19 vdev 1
-19 vdev 2
-1 =Hot spare
2 slots left.

4. Power/Heat/Noise More drives = More Power = More Heat = More noise

5. Space requirements, will get abour 120 TB over my current 85 TB

l2arc will not help with your IO like you expect. How much memory do you have? The biggest contributor to I/O would be faster drives, followed by additional vdevs. If you do a lot of sync writes, an SLOG device will help you with contention which might free up some IO, and then the last thing is maybe L2ARC if you have > 64GB of ram, and you have a workload that would highly benefit from it.

 
Ram is maxed out on my system at 32GB of DDR3 registered (8x4GB), so I cant go any higher. Well, since this is mostly used like a WORM storage high IO would not really be needed.

at most I will have 4 different connections pulling different files for payback.
 
Ram is maxed out on my system at 32GB of DDR3 registered (8x4GB), so I cant go any higher. Well, since this is mostly used like a WORM storage high IO would not really be needed.

at most I will have 4 different connections pulling different files for payback.

I would advise against L2ARC. L2ARC needs memory, so you use up the faster ARC cache. Tbh you should have no issues with two raidz3 vdevs for media playback - 4 streams shouldn't be a problem. If you need more IO, split your raidz3 sets into smaller vdevs, and stripe them all together. Something like an 11 disk raidz3 vdev is optimal.
 
I would not recommend a single vdev pool with that many disks, not even for a media pool

As you said a single vdev has the iops of a single disks what means around 100 io per second. The sequential performance of a pool with around 150 MB/s x number of datadisks does not matter as ZFS does not work sequentially in most cases as datablocks are always spread over the whole vdev (not in sequential order on disk)

Your RAM can speed up reading as all small random datablocks and metadata are cached. Also writes are going over a write cache (up to 4 GB RAM) to transform small random writes to large seqential writes. But there are cases where the caches won't help. This is for example a resilver where ZFS must read all data and in this case iops is the limiting factor. If such a pool is nearly full a resilver can last ages (unless you use Oracle Solaris with the sequential resilvering feature).

Another critical point are mediastreams. A single stream is not a problem but for several streams or if several users read the same stream with a delay, the limited iops capapility may produce drop outs. The rambased ARC cache will not help as it does not cache/read ahead sequential data. This is where an L2Arc is really helpful. It requires around 5% of its size in RAM for management but L2ARC can cache/read ahead sequential data (you must enable this feature via set zfs:l2arc_noprefetch=0). So with video, yes use an L2ARC with 5-10 x size of RAM. I would probable add an NVMe as L2ARC.

I would prefer 3 x Z2 vdevs with 11 disks each what would give your pool around 300 iops.
As your number of datadisks per vdev is not a power of 2, I would increase the blocksize from default 128k to 512k or 1M to use the full capacity
 
Last edited:
I would not recommend a single vdev pool with that many disks, not even for a media pool

As you said a single vdev has the iops of a single disks what means around 100 io per second. The sequential performance of a pool with around 150 MB/s x number of datadisks does not matter as ZFS does not work sequentially in most cases as datablocks are always spread over the whole vdev (not in sequential order on disk)

Your RAM can speed up reading as all small random datablocks and metadata are cached. Also writes are going over a write cache (up to 4 GB RAM) to transform small random writes to large seqential writes. But there are cases where the caches won't help. This is for example a resilver where ZFS must read all data and in this case iops is the limiting factor. If such a pool is nearly full a resilver can last ages (unless you use Oracle Solaris with the sequential resilvering feature).

Another critical point are mediastreams. A single stream is not a problem but for several streams or if several users read the same stream with a delay, the limited iops capapility may produce drop outs. The rambased ARC cache will not help as it does not cache/read ahead sequential data. This is where an L2Arc is really helpful. It requires around 5% of its size in RAM for management but L2ARC can cache/read ahead sequential data (you must enable this feature via set zfs:l2arc_noprefetch=0). So with video, yes use an L2ARC with 5-10 x size of RAM. I would probable add an NVMe as L2ARC.

I would prefer 3 x Z2 vdevs with 11 disks each what would give your pool around 300 iops.
As your number of datadisks per vdev is not a power of 2, I would increase the blocksize from default 128k to 512k or 1M to use the full capacity

But L2ARC with 32GB of ram means you are otherwise starving your pool of precious ARC, that was already very limited to begin with. I think OP's main benefit will be more vdevs, and then also get a system that supports more memory. The new Xeon E3 motherboards on socket 2011 support up to 64GB of ram and don't totally break the bank. But otherwise I agree with the rest of what you have said; well explained.
 
But L2ARC with 32GB of ram means you are otherwise starving your pool of precious ARC.

You are correct with most use cases but ARC does not cache/read ahead sequential video data.
This is why L2ARC helps with a mediaserver especially with a slow pool more than some more RAM for ARC

If you for example add 120GB L2ARC, you will reduce RAM for ARC by around 5%=6GB what is ok, even a 256GB L2ARC would be ok but not a larger one.
 
System Current Specs

CPU -2x Intel Xeon L5520 -Quad Core Socket 1366
Ram -8x 4GB DDR3 Registered 1066
MB -S550BC
HBA -2x M1015 (passthough mode) + HP port multiplier and 2x 16 SAS multipliers

Motherboard SATA (6 ports)
1x SSD
2x DVD-RW

M1015 #1
4x 4TB drives Mirror ZFS pool (Active)
4x 4TB drives Raidz3 pool (Media01)

M1015 #2
4x 4TB drives Raidz3 pool (Media01)
Connected to HP port multiplier. => SAS multipliers (Media01) => SAS multipliers (Media02) -Off most of the time only turned on 1 day a month to move not needed files to it

Yes it is a bit dated, I put it together 4 years ago for about $250, since most of the parts are from 2010. I am hoping for 2-3 more years of use out of it, since any ram upgrades would mean having to discard the ram I already have. Also it will take some time for me to put the funds together to get a new MB + CPU + RAM and Hard drives

From what I read
ZIL = Faster Write (or intent log, so not really??)
L2ARC = Faster Reads, (for small data sets, not really going to improve streaming, now that i have read more, but may help with my image files I keep in the same pool)

since writes were not an issue. I was hoping for better reads. write to zpool over network I am getting a constant 95 MBps. reads seam to bounce from 50-75 MBps. Right now I don't have any real issues streaming 4 1080p HD streams at once, but that maybe due to the pools setup as 3, 11 disk raidz3 vdevs.

The SSD is leftover from an upgrade on my laptop (replaced with a 240GB) figured this would be a better use for it then letting it collect dust. I was not aware that adding a L2ARC may reduce the ram usable for the ARC. (lack on my part)


Since making one large 35 vdev raidz3 is out. and it seems 19 is bad as well. keeping to 8 disk vdev seems like the next best choice and maybe going down to raidz2 + hot spare. I only have 41 SAS slots for this setup. So I can get upto 4 raidz2 vdevs in the pool + Hot spare, and a cold spare as well.
vdev raidz2 10 x 4TB
vdev raidz2 10 x 4TB
vdev raidz2 10 x 3-4TB (upgrade to 4TB ASAP)
vdev raidz2 10 x 2TB (upgrade ASAP maybe 6TB drives)
+Hot Spare (4TB) =41 drives


Most of my drives are purchased used with 1000-2000 hours already on them. I am using a mix of Enterprise drives 7200, and consumer drives 5400/7200. Since I really only like paying about $70-$80 per 4TB drive due to money issues, buying a large number of drives or another SAS multiplier is not really an option at this time.

I use Badblocks, zerofilling, and surface scanning to vet any bad drives out and thanks to Ebays return policy I have no issue returning the drives that fail. In the last 4 years I have had only 2 drives suddenly die on me a 3TB and 4TB drive. I also pull any drives that has a high sector count or has constant checksum errors (upto 4 drives now in 4 years)
 
Ok, this just thought to me the 2TB drives I have may be 512 drives and not 4K, I know from past bads this is bad. Is there away to force ashift 12 instead of 9? or should I separate the 2TB pool as a separate zpool instead of a vdev of my current pool until more drives can be sourced.
 
Thanks, currenlty using ZOL with ubuntu server. But I am looking into useing FreeNAS instead.

I'd look it up in the ubuntu manual - you might use gparted instead, but otherwise it's probably very similar.
 
I was also looking into different OS systems to run. FreeNAS, is near the top of my list, but ProxMox look nice as well since it has native ZFS support as well as a lot of virtual machine built right in. What do you use? are you happy with it?
 
Best of all integration of OS, ZFS and services like iSCSI/FC, NFS or SMB is in Solaris and the free Solaris forks (OmniOS. OpenIndiana, SmartOS) - the origin of ZFS. Even a minimal Solarish install includes everything you need for a storage server.
 
I was also looking into different OS systems to run. FreeNAS, is near the top of my list, but ProxMox look nice as well since it has native ZFS support as well as a lot of virtual machine built right in. What do you use? are you happy with it?

I use FreeNAS at home, and at work we have an iXsystems server that also runs FreeNAS.
 
Back
Top