ZFS N00b: raidz3 and larger qty of disks?

  • Thread starter Deleted member 12106
  • Start date
D

Deleted member 12106

Guest
Getting ready to build my fileserver/openiniana box up.

I have a 16 bay chassis due to arrive as well as 16 hard drives. Drives are: DT01ACA300. I currently have another 5 I could pull at some point to move to this build as well, if needed.

With this build, I am mainly looking to build a file server that may also be used as an iscsi target for some linux and windows machines.

If I understand this from my searching, with these drives I need to config for ashift=12. Also, for raidz3, I need to have the data drives to be a multiple of 3. So for example, I could have 18 data drives and 3 for parity(21 total), and ZFS should be happy? Many of the examples I saw were on smaller disk count.

I also have some 1tb drives as well, these will primary be focused for the iscsi target. I believe I have between 9 and 13 1tb drives, might have more if I look.

Am I on the right path here?.
 
Ok I also got a 240gb SSD coming. Any input on this?
 
Multiple of 2, not 3.
8+3 disks = 11 disks.
16+3 disks = 19 disks (but this might be too many)
 
Greetings

(but this might be too many)

A 19 drive Raid-Z3 array is too large only if high IOPS performance is an issue otherwise there should be no problem using it.

Multiple of 2

Correct, the number of data drives should be a power of 2 as the recordsize is a power of 2 hence when you divide one into the other the data written to each harddrive is a power of 2 also but more importantly is a WHOLE NUMBER for the amount of 512B/4KB sectors that have to be written e.g. 128KB recordsize / 4 drives = 32 KB per drive, other drive numbers entail wasted space.

I have posted a detailed reply to a thread here concerning this topic as well as others, I humbly suggest you read it but better if you read the entire thread and click on all the links provided, this should then give you a good overview about ZFS and issues that can arise with using it.

Cheers
 
Is it always a power of 2 regardless of the raidz type? The other part I am confused on is the overhead cost(lost space) due to the number of drives.

For the 3tb drives I am more interested in file storage at this time. For the 1tb drives, I would be more interested in performance.


I will check out the posts.
 
Is it always a power of 2 regardless of the raidz type?
Yes.

For the 3tb drives I am more interested in file storage at this time. For the 1tb drives, I would be more interested in performance.

You mentioned that you have a 16-bay chassis, so are you limited to 16 drives in total?

If you have two different usages with different requirements (one storage space, one performance) then you'll want to make two different pools.
For example, you could use 4 1TB drives to make a pool of two mirrors (equivalent to a RAID10) for your performance iSCSI target, and use 12 3TB drives in a single RAID-Z2 to give you a separate ~30TB storage pool.
 
I would do 2 raidz2 of 8 drives each.

It's not ideal but it should work nicely.

The data drives ideally have to be a power of two.

2, 4, 8, 16, 32, 64

Parity drives don't count.

So for example with 4 drives ideally for a raidz2 you would need 4(+2) = 6 drives total

z3.... 4(+3) = 7 total

etc.


The downside to not following this is that you loose some space due to the recordsize not matching up correctly to 4k hard drive block sizes.

It's like every 128k written you can potentially loose up to 3.99999K

The math works out to around 115GB lost per 4TB absolute worst case scenario.
 
Yes.



You mentioned that you have a 16-bay chassis, so are you limited to 16 drives in total?

If you have two different usages with different requirements (one storage space, one performance) then you'll want to make two different pools.
For example, you could use 4 1TB drives to make a pool of two mirrors (equivalent to a RAID10) for your performance iSCSI target, and use 12 3TB drives in a single RAID-Z2 to give you a separate ~30TB storage pool.

I have a 16 bay enclosure and also can house 7 drives in the case. Down the road I may look at moving the enclosure and server guts to the norco, but that's for another day and another time. Ideally want to pound the 3tb drives into the enclosure. I may need to make a disk shelf to hold the others, or potentially look at another enclosure.

The math works out to around 115GB lost per 4TB absolute worst case scenario.

Ouch. My understanding is there is a loss with the 4k drives an ashift=12, so that 115 is in addition to?
 
It's not that bad if you think about it. You can potentially loose quite a bit more with standard windows NTFS.
 
Ok I spoke with another member tonight and he is suggesting doing 2x9 and then stripe them and use the balance for hot spares, this seems to make the most sense.
 
Ok I spoke with another member tonight and he is suggesting doing 2x9 and then stripe them and use the balance for hot spares, this seems to make the most sense.

To be honest it makes very little sense. When you say 2x9, are the 9 drives in RAID-Z, Z2 or Z3? You said the case can house 7 extra drives, so that's a total of 23. 2x9=18, so you're going to have 5 hot spares? :confused:


With 23 drives you have quite a lot of different possible configurations.
To start, it's pretty standard procedure to mirror two drives for the OS, so that leaves 21.
Earlier I mentioned two pools but, to be honest, your usage probably isn't heavy enough to worry about that. Stick to one pool and use all 3TB drives.
By the way, "pool" is the ZFS term for striping together "vdevs" (single drives, mirrors or RAID-Z, Z2 or Z3 arrays).

If you wanted maximum speed you'd be looking at something like a pool of ten mirrors and a hot spare, but that is a little risky in terms of potential failures and only nets you ~10 drives-worth of capacity.
For maximum capacity you could do one big RAID-Z3, but it won't win any IOPS awards. Alright for moderate use, and nets you ~18 drives-worth of capacity.
A good compromise would be a pool of two RAID-Z2 vdevs, 10 drives each (and one hot spare). In non-ZFS speak, that's equivalent to a RAID 60. ~16 drives-worth of capacity with decent performance and fault tolerance.
 
Last edited:
Personally I'd stick everything on the pool into just the enclosure, because then you can in the future swap out the server in front of it with more ease, but I'm a lazy git by nature, so YMMV.

I'd also go with 2x8 raidz2 vdevs - you're doubling your IOPS potential for not much loss in capacity versus a 15x1 raidz3 vdev and increasing your actual redundancy slightly as well (technically you could die from 3 disk loss instead of it taking 4, but you could also technically lose 4 w/o failure which you could never do with 1x raidz3 vdev), and it's an even number that fits all 16 slots in your enclosure.

Do keep your performance expectations reasonable. Even with 2x vdevs you're looking at 2 disks worth of IOPS possible, before considering caching. You made no reference to a slog, so unless you're also disabling sync, those drives have to handle ZIL traffic, which is going to significantly impact their performance as well. For a home or SMB, this is still probably fine, especially if it's mostly-read and you've got enough RAM to hold most/all your WSS (working set size -- the actual amount of data on the total pool you're accessing in a short timeframe like a few minutes to a few hours). If it's SMB or > and the number of clients is greater than a handful, it may be IOPS starved. Getting as much of your data on CIFS/NFS instead of iSCSI will help some, if you can't get a slog and/or more RAM and/or more drives and add more vdevs to the design.
 
Sorry, i got it wrong. 2x9 vdevs in raidz, with 2x hot spares, and then striped (is how it was described to me)(I could still be wrong). I have a drive for cashing and can add other drives as needed. OI is already installed on a separate 2.5" drive, I have 8x1tb in there now.

I am imagining once I get my hands dirty I will learn more about this. The 3b drives I am mainly after storage. the 1tb drives will me aimed more at iscsi targets.
 
Sorry, i got it wrong. 2x9 vdevs in raidz, with 2x hot spares, and then striped (is how it was described to me)(I could still be wrong). I have a drive for cashing and can add other drives as needed. OI is already installed on a separate 2.5" drive, I have 8x1tb in there now.

I am imagining once I get my hands dirty I will learn more about this. The 3b drives I am mainly after storage. the 1tb drives will me aimed more at iscsi targets.

I hope you typo'd. Please don't make a 9-disk raidz (raidz1) vdev of 3 TB drives. That has data loss potential written all over it. Also, having 2 hot spares + 2 9-disk raidz vdev's makes NO sense, because you could instead do 2 x 9-disk raidz2, adding one of those hot spares to each vdev to make it raidz2, and even with the sub-optimal odd number of data drives the total loss and minor performance hit is preferable to a 2x9 raidz1 with 2 useless hot spares sitting there doing nothing. :)
 
That would also be preferable, though I was personally suggesting staying @ 16 and just using the enclosure. If you really want to not be able to just move the enclosure around, go with 2x20 z2, yes.
 
It sounds like you are talking about more than one initiator mounting the same iSCSI target at the same time. Is that possible? If so, how does it work?
 
It sounds like you are talking about more than one initiator mounting the same iSCSI target at the same time. Is that possible? If so, how does it work?

Multiple guests=multiple targets. Can they share the same lun? Only if the file system is cluster aware, and also the applications.
 
I hope you typo'd. Please don't make a 9-disk raidz (raidz1) vdev of 3 TB drives. That has data loss potential written all over it. Also, having 2 hot spares + 2 9-disk raidz vdev's makes NO sense, because you could instead do 2 x 9-disk raidz2, adding one of those hot spares to each vdev to make it raidz2, and even with the sub-optimal odd number of data drives the total loss and minor performance hit is preferable to a 2x9 raidz1 with 2 useless hot spares sitting there doing nothing. :)

I was told for best mix of IOPS and space this would be best, so 2 vdevs(raidz) and then stripe them. I'll have to ask him again, it was late when we were discussing. I don't want to risk losing data. That's a lot of....text files.;)
 
That would also be preferable, though I was personally suggesting staying @ 16 and just using the enclosure. If you really want to not be able to just move the enclosure around, go with 2x20 z2, yes.

You mean 2x10?
 
I was told for best mix of IOPS and space this would be best, so 2 vdevs(raidz) and then stripe them. I'll have to ask him again, it was late when we were discussing. I don't want to risk losing data. That's a lot of....text files.;)

Yes. The more vdevs, the higher the IOPS potential. ZFS will 'stripe' all vdevs in the pool by default. The catch is it isn't necessarily a good idea to be going to more vdevs if doing so is requiring you drop the parity level. One of the fairly hard & fast rules I recommend is not going under raidz2 vdevs when dealing with 1+ TB drives.
 
If I understand this from my searching, with these drives I need to config for ashift=12.


The general wisdom is to always use ashift=12 when setting up a new pool. Regardless of whether the drive is 512 or 4K. Because the ashift cannot be changed for the individual drives afterwards. Going forward we are moving to "native" 4K drives and those should be here for the foreseeable future.

Currently many (if not all) drives operate in 512 byte emulation mode. 512 bytes will fit into 4K. So you can have a drive operating internally at 4K, running in 512e mode to the outside world, but will still fit in a 4K logical sector.

If you have a 512 or 512e drive in your pool, by setting it to ashift=12, it means in the future you can replace that drive with either a 512e drive or a true 4K drive and everything will continue to work properly. If you let ZFS choose the ashift per zdev or ifyou set it to 9, then you can only ever replace that zdev with a 512 compatible device.

The reason this is important is because you cannot expand a pool to contain more zdev's, you can only add more pools to the overall array. However you can replace the zdevs within the pool with larger drives, thus making the capacity of the pool increase. So you don't want to limit yourself to a certain kind of drive.
 
Now I said the "general wisdom" is to use ashift=12. Is there a reason you would want to use 512? Yes absolutely depending on your expected file sizes.

Here is something you can do right at home. A simple Perl script that loops 1 to 1000. For each iteration it writes a text file with the name of the number and writes the number to the text file. So its a text file with very little data in it. If you click on the txt file you'll see that files with single digit numbers are about 3 bytes in size and files with 4 digits are 6 bytes. So 1000 of these would come out to about 4 or 5 kilobytes TOTAL in data size. 2 sectors worth.

But when you write those individual files, each one gets its own 4K sector, because this on an SSD which has 4K sectors.

Select all those files and select properties and look specifically at the "size on disk". 1000 4KB sectors adds up to almost 4MB of space. 4MB to store 5KB.

That's where using 512 drives with ashift=9 would be more practical.

sector_size.jpg
 
... The reason this is important is because you cannot expand a pool to contain more zdev's, you can only add more pools to the overall array. However you can replace the zdevs within the pool with larger drives, thus making the capacity of the pool increase. So you don't want to limit yourself to a certain kind of drive.

I'm pretty sure you're wrong on that one. You can add more vdevs to a pool at any time.
See: http://docs.oracle.com/cd/E19253-01/819-5461/gazgw/index.html

It's not necessarily optimal in terms of performance (because the data won't be rebalanced across the vdevs), but it's not necessarily a bad move.
The alternative, as you say, is to replace each drive in each vdev in turn. But that's quite time consuming and you are deliberately degrading the pool whilst each new drive is rebuilt.
 
Thanks some good info here. Was trying to get my temp solution working:
2013-11-22%2016.47.36.jpg


This motherboard as an intel c600 series chipset with c600 sas controller thinger onboard. I cannot get the damned scu driver to work. I'm on an older version of OI now.

I wanted to pull the data from an array on my other machine.

2013-11-20%2016.49.07.jpg


This is full of 3tb drives. Another is on the way for more drives. However, if I don't get this c600 thinger going I am hosed. Bought an adapter to pass the internal sas to external sas for the enclosures:/
 
I would use raidz3 for better safety. And with 20ish disks, I would do two raidz3 vdevs of 11 disks each, which means 22 disks in total.

So if you have an additional 6 disks in the PC case you can have 22 disks in total for two raidz3 vdevs. And I would make sure to use disks like this:
vdev 1:
JBOD case holds 8 disks + 3 in the PC case

vdev 2:
JBOD case holds 8 disks + 3 in the PC case

This way, if you loose your PC somehow, you can still start up the JBOD case, all data are intact.



Compare to this setup:
vdev 1:
-JBOD case holds 11 disks.

vdev 2:
-JBOD case holds 5 disks and 6 disks in the PC case.

In this last setup, if you loose your PC (it gets stolen) you have lost all your ZFS data. Because you can not boot up the JBOD case, because vdev 2 is corrupt as it only contains 5 disks. So, make sure the parity disks are in the PC case (distribute them to the PC case).
 
brutalizer makes a good suggestion.
But just thought I'd point something out - there are no "data" drives (drives where only your data is stored) and "parity" drives (drives where only parity data is stored). That would be like RAID 4.
As with RAID 5, the parity blocks in RAID-Z are spread across all the drives in the RAID.
Therefore to achieve what brutalizer is suggesting, it doesn't matter which particular drives go in the PC case. Any three (or more) drives from each of the RAID-Z3s will do.
 
brutalizer makes a good suggestion.
But just thought I'd point something out - there are no "data" drives (drives where only your data is stored) and "parity" drives (drives where only parity data is stored). That would be like RAID 4.
As with RAID 5, the parity blocks in RAID-Z are spread across all the drives in the RAID.
Therefore to achieve what brutalizer is suggesting, it doesn't matter which particular drives go in the PC case. Any three (or more) drives from each of the RAID-Z3s will do.
Exactly, I was too lazy to explain this, as this is common knowledge I thought. But for the record, it is very good that you clarified this to someone is new to ZFS. Together we give better advice, than a single person. :)
 
Ok. One thing I am having an issue with here...

My understanding is that I could create multiple v-devs, and then I could make a pool based on multiple vdevs. I could also add an ssd caching drive...

napp-itgone.PNG


I don't have spare or any of the ssd- options.

Looking at this: whole doc: http://www.napp-it.org/doc/downloads/napp-it.pdf

There are a lot of differences with what I got.
 
Ok, I think I figured it out...

Code:
	NAME        STATE     READ WRITE CKSUM
	test        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    c5t2d0  ONLINE       0     0     0
	    c5t3d0  ONLINE       0     0     0
	    c5t4d0  ONLINE       0     0     0
	    c5t5d0  ONLINE       0     0     0
	    c5t6d0  ONLINE       0     0     0
	    c5t7d0  ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    c4t2d0  ONLINE       0     0     0
	    c4t3d0  ONLINE       0     0     0
	    c4t4d0  ONLINE       0     0     0
	    c4t5d0  ONLINE       0     0     0
	    c5t0d0  ONLINE       0     0     0
	    c5t1d0  ONLINE       0     0     0

I had to first create a pool and select the drives, then go to extend and then pick the remaining drives. At that point the drop down box had the additional "missing" items.
 
Hmmm, are there any numbers floating around with specific rebuild/resilver times for the various Raidz levels? What are those bottlenecked by?
 
Hmmm, are there any numbers floating around with specific rebuild/resilver times for the various Raidz levels? What are those bottlenecked by?

Disk size and qty per vdev I believe, which is why they suggest smaller number of disks per vdev and multiple vdev per pool.
 
Hmmm, are there any numbers floating around with specific rebuild/resilver times for the various Raidz levels? What are those bottlenecked by?

Another reason raidz2/z3 is not equal to raid5/6 is that you do not resilver the entire drive, bit for bit, but you resilver just the data. Therefore, resilver times will greatly be affected by the utilized space. I know this doesn't exactly speak exactly to your question, but seems appropiate to remind when you are considering the amount of time you would have reduced redundancy.
 
How does one get the iscsi perfromance to improve? Transfers to iscisi from windows start out at 200-500MB/s and then drop to 20-80MB/s.

When using cifs, the performance is flat at 100-125MB/s, when using iscsi, all over the map.

I had added the SSD with anticipation to improve the write speeds.
 
How does one get the iscsi perfromance to improve? Transfers to iscisi from windows start out at 200-500MB/s and then drop to 20-80MB/s.

When using cifs, the performance is flat at 100-125MB/s, when using iscsi, all over the map.

I had added the SSD with anticipation to improve the write speeds.

What OS on the server side?

Can you post more specific benchmark result from a client that breaks it up into it's parts?

Anyway, you do not have transfers of 200-500 MB/sec. You need to understand that iSCSI runs through the full filesystem buffer cache client side and has no obligations to cache coherency wrt the server. That makes the performance very hard to compare to networked filesystems (iSCSI is not a networked filesystem).
 
server 2012. I don't have a benchmark, actual data I am copying.

350iscsi.png


This is with a large batch of files.

dropiscsi.png


This one is with a few larger files.

If there is something I should use to bench, let me know and I'l certainly run it.
 
Another reason raidz2/z3 is not equal to raid5/6 is that you do not resilver the entire drive, bit for bit, but you resilver just the data. Therefore, resilver times will greatly be affected by the utilized space. I know this doesn't exactly speak exactly to your question, but seems appropiate to remind when you are considering the amount of time you would have reduced redundancy.

I know, and it's an important factor.

One problem with it is that I tend to fill my HD space with "junk", more or less cached data that I delete when I need the space for something useful.

Is there is a way to express priority of what you want silvered first?
 
ZFS is a professional filesystem, such considerations are not a priority. If you lose too many drives you lose everything anyway.

Personally I have one rackable (more on order soon) with 16*2TB + 3*2TB in the server chassis making a 19 drives RAIDZ3 vdev/pool, to be expanded with another similar vdev, then maybe a third (probably first migrated to 3 or 4 TB drives).
 
Back
Top