RAID 60 Using ZFS

GreenLED · Mar 8, 2014

I have been studying my options of how I should configure my new NAS. After much "deliberation" with myself, I have come to the conclusion that I'd like to implement RAID 60 with ZFS. In order to implement RAID 60, you need to have two RAID 6 arrays striped, hence RAID 60. To my understanding, I would need to create my pool as follows.

Pool

Alpha (RZ-2)

Drive 1

Drive 2

Drive 3

Drive 4

Bravo (RZ-2)

Drive 5

Drive 6

Drive 7

Drive 8

I'm pretty sure that is the correct implementation. You have two vdevs (which by default are dynamically striped) each configured as Raid Z 2, which, from what I saw in the "Ninja" video on YouTube is Raid 6 (or the equivalent).

I'm having a hard time finding a definitive answer on the equivalent RAID #'s when it comes to RZ-1, RZ-2 and RZ-3. If someone could clarify that, I would appreciate it.

More importantly, if I want to ADD an additional array to the pool (+1 vdev of RZ-2 / RAID 6) can that be done AFTER the fact or can you only add additional drives to the current vdevs?

Oh, one other thing, I keep seeing people referring to "don't use over x amount of drives in a vdev, it will slow to a crawl". Anything to this?

fantabulous · Mar 8, 2014

You have it right in regards to creating the equivalent to a raid 60.

If you're familiar with standard raid levels, it's easy to understand raidz1/2/3. Raidz indicates that it's parity raid, the number indicates the number of parity drives. So Raidz1 = raid5, raidz2 = raid6, raidz3 = triple parity (raid7).

A key thing to keep in mind with ZFS is that you can not add additional drives to an existing vdev. So expansion is either a one by one replacement (and rebuild/resilver)of each drive in a vdev with larger drives, or adding an additional vdev. Either method can be done after the initial pool setup. Existing data will also NOT rebalance over a newly added vdev, but new writes will. So if that's a concern with static data, you may need to manually copy things around to force the issue.

Off the top of my head, I can't think of why raidz would slow to a crawl over X number of drives. I suppose for large enough values of X, you're splitting up the stripe into individual pieces that are smaller than each individual drive's sector size. I think it's more the increasing chance of Y+1 drive failures, where Y is the number of parity drives. Someone more well versed with ZFS than I am might be willing to chime in.

And a handy link, even though ZFS on linux doesn't seem to be the most popular around here (but this is only marginally linux specific): https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/

SirMaster · Mar 8, 2014

I run a RAID 60 on my ZFS. 6 2TB disks and 6 3TB disks on ZFS on Linux. 6 disks is a good number for RAIDZ2, very little overhead and very safe.

drescherjm · Mar 8, 2014

I am running that at work on a server. 2 arrays of 7 x 2TB drives.

Code:

datastore4 ~ # zpool status
  pool: zfs_data_0
 state: ONLINE
  scan: scrub in progress since Sat Mar  8 04:20:10 2014
    6.59T scanned out of 6.62T at 464M/s, 0h1m to go
    0 repaired, 99.57% done
config:

        NAME             STATE     READ WRITE CKSUM
        zfs_data_0       ONLINE       0     0     0
          raidz2-0       ONLINE       0     0     0
            a0_d0-part3  ONLINE       0     0     0
            a0_d1-part3  ONLINE       0     0     0
            a0_d2-part3  ONLINE       0     0     0
            a0_d3-part3  ONLINE       0     0     0
            a0_d4-part3  ONLINE       0     0     0
            a0_d5-part3  ONLINE       0     0     0
            a0_d6-part3  ONLINE       0     0     0
          raidz2-1       ONLINE       0     0     0
            a1_d0-part3  ONLINE       0     0     0
            a1_d1-part3  ONLINE       0     0     0
            a1_d2-part3  ONLINE       0     0     0
            a1_d3-part3  ONLINE       0     0     0
            a1_d4-part3  ONLINE       0     0     0
            a1_d5-part3  ONLINE       0     0     0
            a1_d6-part3  ONLINE       0     0     0

errors: No known data errors
datastore4 ~ # equery l zfs* spl*
 * Searching for zfs* ...
[IP-] [  ] sys-fs/zfs-0.6.2-r3:0
[I-O] [  ] sys-fs/zfs-auto-snapshot-9999:0
[IP-] [  ] sys-fs/zfs-kmod-0.6.2-r3:0

 * Searching for spl* ...
[IP-] [  ] sys-kernel/spl-0.6.2-r3:0
datastore4 ~ # uname -a
Linux datastore4 3.12.13-gentoo-datastore4 #2 SMP Thu Feb 27 13:42:10 EST 2014 x86_64 Intel(R) Xeon(R) CPU E31230 @ 3.20GHz GenuineIntel GNU/Linux

Edit: Looks like the weekly scrub is just ending..

Edit2:
The main reason I went for 2xraidz2 over 1xraidz3 was mostly to minimize downtime. And I only had 16 disks free with very little chance to buy any more (budget). I had less than 1 hour total downtime converting the ~20 (1TB and 500GB) disks in btrfs on top of mdadm in 2 separate raid6 arrays to zfs. Most of the moving data was done live but I rebooted a few times to swap out whole arrays. I first moved the 10 (500GB) disk array to a different server and swapped in 7 new 2TB drives to create the first raidz2. Then I moved all of the data from the raid array that used 1TB disks to the raidz array then I swapped the ~10 1TB drives out and rebooted to rsync the OS again to the 2TB disks. I believe then I just hot added the rest of the 2TB drives and created the second raidz2 with the system online.

SirMaster · Mar 8, 2014

Here is a link for overhead when using RAIDZ2

http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html

GreenLED · Mar 8, 2014

I need to address an important concern I have before I answer to these great responses.

1. Which way should I expand my storage capacity?

A. Keep adding more vdevs with RZ-2. Taking my example up top, it would go, Charlie (RZ-2) and then 4 drives. OR

B. Expand the # of disks in each of the two (Alpha, Bravo) original arrays.

Which way should I expand. I'm going to be expanding this as much as possible. I don't want to expand the array to the point that it starts choking or stepping over it's own technology and somehow crashing or slowing down. How should I expand? What is the proper or proven method?

2. If data is not "re-balanced" across all of the drives, how is my data safe? Is there a manual function to re-balance all data across the drives or is this just overkill and I should be ok with not moving new written data across all of my drives.

I have to get all of these questions out or I will forget to ask.

3. WHAT, I say WHAT is the purpose of even having a container that holds all the vdevs (zpool). Why would an administrator want to create more pools, how does this translate into a real-world situation. Do pools talk to each other? Can I use a pool to backup to another pool? I.e. clone a pool to another pool for even MORE overkill redundancy?

SirMaster · Mar 8, 2014

Well, to expand it the (B.) way would require you to completely re-create the whole array (both Alpha and Bravo), so you would need space to store all your data in the meantime.

The 2 days to expand a zpool (without re-creating it) are to add more vdevs or to replace all the disks in one vdev with larger disks.

GreenLED · Mar 8, 2014

SirMaster said:
Well, to expand it the (B.) way would require you to completely re-create the whole array (both Alpha and Bravo), so you would need space to store all your data in the meantime.

The 2 days to expand a zpool (without re-creating it) are to add more vdevs or to replace all the disks in one vdev with larger disks.

What method do storage administrators usually use? I'm guessing adding more vdevs. I'm guessing you meant "ways" instead of "days" as well

.

zrav · Mar 8, 2014

1A) This is a possibility. Do note that you rarely see RAIDZ2 vdevs with only 4 disks. 6 is common and recommended, though.
1B) You cannot change the number of disks in a vdev once created. You need to rebuild the pool from scratch to do that.
2) Purely from a data safety point of view data re-balancing across vdevs is irrelevant, because if any of the vdevs fail, all data in the pool is lost.
3) Yes, you can incrementally copy snapshots from one pool to another over the network. It's a very neat way of doing backups or moving the datasets.

GreenLED · Mar 8, 2014

zrav said:
1A) This is a possibility. Do note that you rarely see RAIDZ2 vdevs with only 4 disks. 6 is common and recommended, though.
1B) You cannot change the number of disks in a vdev once created. You need to rebuild the pool from scratch to do that.
2) Purely from a data safety point of view data re-balancing across vdevs is irrelevant, because if any of the vdevs fail, all data in the pool is lost.
3) Yes, you can incrementally copy snapshots from one pool to another over the network. It's a very neat way of doing backups or moving the datasets.

Can I add a log, cache and spare later on?

So, let's ee. I'm trying to come up with a game plan here of how I'm going to scale this. So, the plan is 2 vdevs, one with 6 drives, the other with 6 as well. Both will be RZ-2. Now, when I want to expand that pool, I would order 6 more drives, create a new vdev and make it as RZ-2 as well. Does that sound about right or is there some point at which I need to stop adding vdevs? Will it destroy performance at some point? I have heard that when a drive is about 80% full, things start to go south. This is what I've hard mind you.

Does anyone know - if a log or cache (which will be put onto an SSD) FAILS for some reason, will ZFS AUTOMATICALLY revert back to using the "old" method of just pushing the data straight to disk? This seems like a failure point that should be addressed in ZFS.

zrav · Mar 8, 2014

Yes, log, cache and spares can be added later. Yes, adding more vdevs is the easiest way of expanding pool capacity.
Adding vdevs will linearly improve IOps performance. The general performance of ZFS pools degrade the fuller they are.
If a cache disk fails, it will simply not be used anymore. A log device is only ever read after a crash. I'm not sure how ZFS reacts if the log device produces write errors, but probably fail it, which means the intent log gets written to the regular pool, which is still safe, but slower.

SirMaster · Mar 8, 2014

Adding more vdevs increases performance. Since it stripes the vdevs together. A big RAID 0 of all your vdevs. Ive seen setups with 4 vdevs in them personally but I don't know what a good limit is.

The thing is since it stripes all the vdevs the more you add to the pool the more dangerous it is to lose the whole pool.

GreenLED · Mar 8, 2014

SirMaster said:
The thing is since it stripes all the vdevs the more you add to the pool the more dangerous it is to lose the whole pool.

See, when I see this kind of response, it makes me nervous. How should I handle my limits. What limits should I set for myself? Because if I don't have limits, I'm just going to keep piling on the disks until the whole thing fails (which is NOT what I want to do

). So, how should I handle this, is there a "best practice"? I want to be able to scale this so that I can combine pools later on to create a transparent filesystem that just servers all my users without them having to think about two or three different accounts because I can't handle it on my end.

SirMaster said:
Adding more vdevs increases performance.

Indeed because of the striping.

Here's a FreeBSD forum post that I don't know how to interpret.

https://forums.freebsd.org/viewtopic.php?&t=34558

Take a look at the bullet points of guidance. Does anyone agree with this assessment?

GreenLED · Mar 8, 2014

drescherjm said:

Code:

datastore4 ~ # zpool status
  pool: zfs_data_0
 state: ONLINE
  scan: scrub in progress since Sat Mar  8 04:20:10 2014
    6.59T scanned out of 6.62T at 464M/s, 0h1m to go
    0 repaired, 99.57% done
config:

        NAME             STATE     READ WRITE CKSUM
        zfs_data_0       ONLINE       0     0     0
          raidz2-0       ONLINE       0     0     0
            a0_d0-part3  ONLINE       0     0     0
            a0_d1-part3  ONLINE       0     0     0
            a0_d2-part3  ONLINE       0     0     0
            a0_d3-part3  ONLINE       0     0     0
            a0_d4-part3  ONLINE       0     0     0
            a0_d5-part3  ONLINE       0     0     0
            a0_d6-part3  ONLINE       0     0     0
          raidz2-1       ONLINE       0     0     0
            a1_d0-part3  ONLINE       0     0     0
            a1_d1-part3  ONLINE       0     0     0
            a1_d2-part3  ONLINE       0     0     0
            a1_d3-part3  ONLINE       0     0     0
            a1_d4-part3  ONLINE       0     0     0
            a1_d5-part3  ONLINE       0     0     0
            a1_d6-part3  ONLINE       0     0     0

errors: No known data errors
datastore4 ~ # equery l zfs* spl*
 * Searching for zfs* ...
[IP-] [  ] sys-fs/zfs-0.6.2-r3:0
[I-O] [  ] sys-fs/zfs-auto-snapshot-9999:0
[IP-] [  ] sys-fs/zfs-kmod-0.6.2-r3:0

 * Searching for spl* ...
[IP-] [  ] sys-kernel/spl-0.6.2-r3:0
datastore4 ~ # uname -a
Linux datastore4 3.12.13-gentoo-datastore4 #2 SMP Thu Feb 27 13:42:10 EST 2014 x86_64 Intel(R) Xeon(R) CPU E31230 @ 3.20GHz GenuineIntel GNU/Linux

Everytime you finish a scrub, there should be a ZFS app that sends you a notification - "Congratulations! You've achieved _____ status! Keep scrubbing!"

. You should get achievements for each scrub

.

By the way, is this running on a linux box? -- Nevermind

.

Can you paste the cronjob line. I haven't scheduled a scrub before or should I be able to schedule this in the FreeNAS interface?

SirMaster · Mar 8, 2014

Pretty sure you can do it from the freenas GUI.

GreenLED · Mar 8, 2014

Wunderbar!

.

drescherjm · Mar 9, 2014

I see you have an answer for freenas. Here is what I have on gentoo in my /etc/cron.weekly/zpool_scrub.cron

Code:

datastore4 ~ # cat /etc/cron.weekly/zpool_scrub.cron
#!/bin/sh

echo "********************************************************************************" >> /var/log/zpool_scrub.log
echo `date '+%F'` "- Doing ZFS scrub"  >> /var/log/zpool_scrub.log
echo "********************************************************************************" >> /var/log/zpool_scrub.log
/sbin/zpool scrub zfs_data_0 >> /var/log/zpool_scrub.log
sleep 4h
/sbin/zpool status zfs_data_0 >> /var/log/zpool_scrub.log
echo >> /var/log/zpool_scrub.log

I have also enabled zfs-auto-snapshot and I have daily, weekly and monthly snapshots depending on the dataset

Code:

datastore4 ~ # cat /etc/cron.daily/zfs-auto-snapshot
#!/bin/sh
exec zfs-auto-snapshot --default-exclude --quiet --syslog --label=daily --keep=31 //

datastore4 ~ # cat /etc/cron.weekly/zfs-auto-snapshot
#!/bin/sh
exec zfs-auto-snapshot --default-exclude --quiet --syslog --label=weekly --keep=8 //

datastore4 ~ # cat /etc/cron.monthly/zfs-auto-snapshot
#!/bin/sh
exec zfs-auto-snapshot --default-exclude --quiet --syslog --label=monthly --keep=12 //

I have enabled the snapshots for my datasets using zfs set
http://docs.oracle.com/cd/E19120-01/open.solaris/817-2271/ghzuk/index.html

Here is what I currently have for datastore4 at work:

Code:

datastore4 ~ # zfs get all | grep -v @ | grep com.sun:auto-snapshot
zfs_data_0                                                                com.sun:auto-snapshot:monthly  true                                 local
zfs_data_0/Testing                                                        com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/Testing                                                        com.sun:auto-snapshot:daily    true                                 local
zfs_data_0/datastore1                                                     com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/datastore1/TempSpace                                           com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/datastore1/homes_images                                        com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/datastore1/temp-data                                           com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/datastore1/user_private                                        com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/distfiles                                                      com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/imagedata                                                      com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/lxc                                                            com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/samba_test                                                     com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/software                                                       com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/swap                                                           com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/temp                                                           com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/user_private_mirror                                            com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/user_private_mirror                                            com.sun:auto-snapshot:daily    true                                 local
zfs_data_0/user_public_mirror                                             com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0
zfs_data_0/user_public_mirror                                             com.sun:auto-snapshot:daily    true                                 local
zfs_data_0/winbackups                                                     com.sun:auto-snapshot:monthly  true                                 inherited from zfs_data_0

zrav · Mar 9, 2014

GreenLED said:
See, when I see this kind of response, it makes me nervous. How should I handle my limits. What limits should I set for myself? Because if I don't have limits, I'm just going to keep piling on the disks until the whole thing fails (which is NOT what I want to do ).

Any configuration might fail, it's just a matter of probabilities (there are huge differences). Regardless of the setup you might settle for, remember that RAID or ZFS do not replace proper backups. You have to make backups and have a working recovery strategy.

GreenLED · Mar 9, 2014

zrav said:
Any configuration might fail, it's just a matter of probabilities (there are huge differences). Regardless of the setup you might settle for, remember that RAID or ZFS do not replace proper backups. You have to make backups and have a working recovery strategy.

A backup for the backup

. Oy! I AM the backup for my customers. I guess I have to backup what they backup then. But, yes, it's all about probabilities.

madrebel · Mar 10, 2014

GreenLED said:
Pool

Alpha (RZ-2)
Drive 1
Drive 2
Drive 3
Drive 4
Bravo (RZ-2)
Drive 5
Drive 6
Drive 7
Drive 8

this configuration has a 50% reduction in usable capacity, you're aware of this yes? there is only (imo) one reason to configure it as you've done and thats if you're splitting the disks/jbods across racks so you can create discrete failure zones and still remain online.

however, even doing this you're better off doing mirrors/raid10 as you 'lose' the same amount of usable space (50%) but you're gaining twice the performance.

for raidz2, imo, don't go below 6 disks. 7 for raidz3. technically you can do both z2 and z3 with as little as 4 disks but you're losing a ton of usable space and not gaining any performance like you get from mirroring.

I'm having a hard time finding a definitive answer on the equivalent RAID #'s when it comes to RZ-1, RZ-2 and RZ-3. If someone could clarify that, I would appreciate it.

Z1 = raid5 (but never use z1)
Z2 = raid6
Z3 = ...uhm raid6 +1

More importantly, if I want to ADD an additional array to the pool (+1 vdev of RZ-2 / RAID 6) can that be done AFTER the fact or can you only add additional drives to the current vdevs?

with the exception of mirror vdevs, you cannot manipulate Z2/3 vdevs at all. even in the event of a failure you cannot remove a drive from a Z2/3 vdev, you can only replace. Once replaced, and the resilver finishes, the system automatically removes the old drive.

meaning, yes, you have to add more vdevs if you want to expand. also, do note, this new vdev will not automatically speed up your system. you would need to migrate all existing data off the system or just move it to another directory (or datastore etc) which will then balance the data across the new disks.

Oh, one other thing, I keep seeing people referring to "don't use over x amount of drives in a vdev, it will slow to a crawl". Anything to this?

mmmm, not really no. its always ideal to stick to the standard numberings when building your vdevs though. those being ...

mirror = minimum two disks
rz2 = minimum 4 disks but typically 6 or 10 disks.
rz3 = minimum 4 disks but typically 7 or 11 disks.

IMPO if you're using 1TB disks don't go above 10 disks in z2 configs. if you're using 2TB and larger disks don't go above 6.

Z3 is a bit different IMO. Depends on exactly what you want from the array. I have a general 'spin disk' only setup that uses 7 disk Z3 vdevs and 3TB drives. It functions well and is quite fast for what it needs to do. however, if you were building something that is for backup/archival and this storage is ONLY going to be accepting large sequential streams of data then going up to 15 drive RZ3 vdevs is acceptable however 11 is kind of the sweet spot. rebuilds of a 15 drive Z3 vdev that is 85% full will take FOR EVER!

there is some 'lost' space when using odd drive numberings too. you still net more usable space than you would have had had you gone with smaller vdevs mind you, you just dont get to make full use of the usable space is all.

GreenLED · Mar 10, 2014

My RAID 60 calculator says I will have ~16 TB available with 24 disks. Is that incorrect?

GreenLED · Mar 10, 2014

Can you sketch out a mirrored array for 12 disks how you think it might work better? To my understanding I will loose some failure tolerance if I go with mirroring. So, are you suggesting some sort of nested RAID? I was considering RAID 10. I need to come to a final decision here pretty soon. I'm about to start going into development and I need a solid solution. I feel like going to mirroring gets me less disk space vs RAID 60.

tarnar · Mar 11, 2014

The answer depends on what you want to be resilient to, which is an answer that might be informed by your disk topography.

Here's an example based on the storage server I'm working on.

The disk topography is two ports in an HBA, each port goes out to four disks, for a total of eight disks, where the disks are directly attached (no SAS expander).

The disks in the server look like this:

Code:

1 2 3 4 (port 1)
5 6 7 8 (port 2)

Disks 1-4 are on one HBA port, disks 5-8 are on the other port.

You can design a RAID10 that is resilient to a single port failure by creating mirrors of disks 1+5, 2+6, 3+7 and 4+8, then striping those mirrors together.

In this design if a single port disappears then either disks 1-4 go away or 5-8 go away, in either case the stripes remain intact.

However, the downside to this design is that if you lose a given disk then you are at risk. Say disk 1 fails. You replace it and during the rebuild if you lose disk 5 you're cooked, you just lost your pool.

Now, take the same disk topography:

Code:

1 2 3 4 (port 1)
5 6 7 8 (port 2)

This time we create a RAID60. Disks 1, 2, 5 and 6 are in a raidz2 and are striped to 3, 4, 7 and 8 in another raidz2.

In this case you can lose a port and either 1-4 or 5-8 go away. If 1-4 go away you still have the first stripe intact on disks 5+6 and the second stripe intact on disks 7+8.

The primary advantage to this is if you lose a single disk (say, disk 1) then you can lose any other disk during the rebuild and not be cooked.

In the case of an eight drive setup the space difference between striped mirrors and striped raidz2s is nothing. That's arguably wasteful, as you get better performance from striped mirrors. But that comes at the cost of some resiliency.

So you have to understand the topography, what you want protection from, how much resilience you want and what you want to be resilient to. There are not necessarily right and wrong answers to these questions.

My primary concern in my own planning is rebuild times and exposure during that rebuild. 3 TB disks are frigging huge and, even assuming the disk writes at max write speed during rebuild, it will take many, many, many hours to rebuild.

There are other options still: striping together three-way mirrors sacrifices more disk space but gives the performance benefits of striped mirrors. So, in the case of a 12 disk system you could have four three-way mirrors striped together.

GreenLED · Mar 11, 2014

tarnar said:
The answer depends on what you want to be resilient to, which is an answer that might be informed by your disk topography.

Here's an example based on the storage server I'm working on.

The disk topography is two ports in an HBA, each port goes out to four disks, for a total of eight disks, where the disks are directly attached (no SAS expander).

The disks in the server look like this:

Code:

1 2 3 4 (port 1) 5 6 7 8 (port 2)

Disks 1-4 are on one HBA port, disks 5-8 are on the other port.

You can design a RAID10 that is resilient to a single port failure by creating mirrors of disks 1+5, 2+6, 3+7 and 4+8, then striping those mirrors together.

In this design if a single port disappears then either disks 1-4 go away or 5-8 go away, in either case the stripes remain intact.

However, the downside to this design is that if you lose a given disk then you are at risk. Say disk 1 fails. You replace it and during the rebuild if you lose disk 5 you're cooked, you just lost your pool.

Now, take the same disk topography:

Code:

1 2 3 4 (port 1) 5 6 7 8 (port 2)

This time we create a RAID60. Disks 1, 2, 5 and 6 are in a raidz2 and are striped to 3, 4, 7 and 8 in another raidz2.

In this case you can lose a port and either 1-4 or 5-8 go away. If 1-4 go away you still have the first stripe intact on disks 5+6 and the second stripe intact on disks 7+8.

The primary advantage to this is if you lose a single disk (say, disk 1) then you can lose any other disk during the rebuild and not be cooked.

In the case of an eight drive setup the space difference between striped mirrors and striped raidz2s is nothing. That's arguably wasteful, as you get better performance from striped mirrors. But that comes at the cost of some resiliency.

So you have to understand the topography, what you want protection from, how much resilience you want and what you want to be resilient to. There are not necessarily right and wrong answers to these questions.

My primary concern in my own planning is rebuild times and exposure during that rebuild. 3 TB disks are frigging huge and, even assuming the disk writes at max write speed during rebuild, it will take many, many, many hours to rebuild.

There are other options still: striping together three-way mirrors sacrifices more disk space but gives the performance benefits of striped mirrors. So, in the case of a 12 disk system you could have four three-way mirrors striped together.

I'll tell you what, I definitely found the right forum

.

Thank you for bringing up the idea of ports and being resilient to that type of failure. Here's the deal. I want - I NEED to design this server to be as resilient to data loss as possible. Let me qualify that with the statement that I can't just mirror everything because, obviously cost. So, I saw RAID 60 as the next best thing. RAID 6 is great, but RAID 60 (for my situation) is better because not only do I have 2 disk failure tolerance, I have the benefit of striping which compensates for the slowness of RAID 6. Data integrity is paramount to this build, as I'm sure everyone's build is, but I am storing other users files as well as my own.

What HBA card are you using if I may ask?

Please correct my logic here if it's not right, but from what I understand about this configuration (forgetting the port failure issue for a moment) . . .

Pool

Alpha (RZ-2)

Disk 1

Disk 2

Disk 3

Disk 4

Disk 5

Disk 6

Bravo (RZ-2)

Disk 7

Disk 8

Disk 9

Disk 10

Disk 11

Disk 12

I'm going to use this diagram and keep tweaking it until I get my final product. So, from what I understand here. You can loose up to TWO disks in the first vdev and still be OK. You can also loose two disks in the second set and STILL be OK. Am I right so far?

Hopefully so. If I am, then I want to tweak this setup to be resilient to the port issue (if that's possible). Not sure if an HBA will support this many devices or if I would have to have 3 HBAs and then adjust this scheme. I dunno. Want to see your response so I can come up with something finalized.

zrav · Mar 11, 2014

The stated setup is quite reasonable and would yield you a pool with the IOps performance of 2 disks and the capacity of 8 disks minus some overhead. Yes, you might lose up to 2 disks in each vdev and still be fine.

You could do with as little as a single SAS HBA (like the IBM M1015 flashed to IT mode) and connect that to the JBOD chassis. As mentioned you won't have full redundancy that way. Full redundancy would at least halve the usable capacity and require more hardware.

We are still in the dark in regards to what your actual requirements are, so if you could provide some more info on the needed capacity, expected load, availability requirements, budget etc. our suggestions could be much more focused.

tarnar · Mar 11, 2014

GreenLED said:
IWhat HBA card are you using if I may ask?

A cross-flashed Dell Perc H310, running native LSI firmware in 'IT' mode. That particular HBA has two internal SAS ports, each of which has a cable with 4 SATA ports hanging off the end, for a total of eight disks. I believe that's called a 'forward breakout' cable.

Please correct my logic here if it's not right, but from what I understand about this configuration (forgetting the port failure issue for a moment) . . .

Pool

Alpha (RZ-2)
Disk 1
Disk 2
Disk 3
Disk 4
Disk 5
Disk 6
Bravo (RZ-2)
Disk 7
Disk 8
Disk 9
Disk 10
Disk 11
Disk 12

I'm going to use this diagram and keep tweaking it until I get my final product. So, from what I understand here. You can loose up to TWO disks in the first vdev and still be OK. You can also loose two disks in the second set and STILL be OK. Am I right so far?

Correct. The pool is intact as long as all the vdevs striped together are intact. Each vdev is a raidz2 so each vdev can lose up to two disks before the pool fails.

Hopefully so. If I am, then I want to tweak this setup to be resilient to the port issue (if that's possible). Not sure if an HBA will support this many devices or if I would have to have 3 HBAs and then adjust this scheme. I dunno. Want to see your response so I can come up with something finalized.

Well, that really depends on the topography.

The way I've cooked it up is the cheapest way to run an HBA - attaching a SAS HBA to SATA drives. In that model you run 4 SATA drives per SAS HBA port.

In a pure SAS environment you can use SAS expanders. In that model the HBA plugs into the SAS expander (using SAS ports on both ends) while the expander generally lives on a backplane and plugs into lots of drives. Those expanders can also have multiple SAS ports on them so you can build resilience into the SAS topography.*

* - giant disclaimer: I've never done any of that jazz myself, that's just how I understand it works.

So, taking a step back to 'what I know' let's look at the 12 disk configuration as a series of HBAs.

Scenario: Two HBAs, four ports, three disks on each port.

Code:

Disk 01 HBA 1 Port 1
Disk 02 HBA 1 Port 1
Disk 03 HBA 1 Port 1
Disk 04 HBA 1 Port 2
Disk 05 HBA 1 Port 2
Disk 06 HBA 1 Port 2
Disk 07 HBA 2 Port 1
Disk 08 HBA 2 Port 1
Disk 09 HBA 2 Port 1
Disk 10 HBA 2 Port 2
Disk 11 HBA 2 Port 2
Disk 12 HBA 2 Port 2

So, can a stripe of raidz2s fit into this configuration as desired, to survive a single port failure?

First, consider the above Alpha / Bravo configuration. It fails the test. If HBA1 port 1 fails then disks 1-3 disappear from the Alpha stripe, that vdev fails and the whole pool fails.

Let's try a small change. Put the odd numbered disks into the first raidz2. Put the even numbered disks into the second raidz2.

Pool

Alpha (RZ-2)

Disk 1

Disk 3

Disk 5

Disk 7

Disk 9

Disk 11

Bravo (RZ-2)

Disk 2

Disk 4

Disk 6

Disk 8

Disk 10

Disk 12

If HBA1 port 1 goes away then Alpha loses disks 1+3, Bravo loses disk 2. Pool is intact.
If HBA1 port 2 goes away then Alpha loses disk 5, Bravo loses disks 4+6. Pool is intact.
If HBA2 port 1 goes away then Alpha loses disks 7+9, Bravo loses disk 8. Pool is intact.
If HBA2 port 2 goes away then Alpha loses disk 11, Bravo loses disks 10+12. Pool is intact.

The physical topography is probably the most vital detail of this planning.

It's also worth keeping the physical topography in mind when you consider that a lot of the hardware you buy has this topography baked in, in particular the hard drive bits of chassis/cases.

Further considering a 12 disk system, it is definitely advantageous to have a backplane to power said disks, otherwise you have a lot of cables kicking around and blocking airflow. However, that backplane can't have a SAS expander if one wishes to use SAS to SATA breakout.

dekard · Mar 11, 2014

There is another method that would be considerable simplier to administer. Build out ServerA with 6 harddrives in your RZ-2 config. Then build ServerB with 6x hard drives in RZ-2. Then use any number of methods to replicate the contents of ServerA to ServerB. Off the top of my head, you could use rsync, ZFS send, scp or something client side.

Why go this route? You've removed most of your single points of failure in a single stroke. You should be able to pick up some more performance from a pair of servers vs. a single monster server. Oh ya, and scaling up is easier too. Just add ServerC, ServerD, etc... as your requirements grow.

madrebel · Mar 11, 2014

if you went that route why use something external to replicate? you could just export a zvol from server a and b then use those to build a pool on server C.

servers A and B could be relatively small with say 16GB ram. server C then has 32+.

dekard · Mar 11, 2014

madrebel said:
if you went that route why use something external to replicate? you could just export a zvol from server a and b then use those to build a pool on server C.

servers A and B could be relatively small with say 16GB ram. server C then has 32+.

Yup, that would work fine.. Like I said, there's a few ways to do it. Bottom line is you can *really* simplify the server build and config. Plus, you'll save $$$ too.

tarnar · Mar 11, 2014

Quite a lot of money too. Sticking with SATA disks and cheap HBAs (such as the ones mentioned in this thread, both based on the LSI2008 chipset) brings a huge cost savings over going to SAS disks, using expanders, etc.

A single HBA drives 8 disks. That's close to / beyond what you'll fit inside most desktop or small server cases. The bigger cases that go well beyond 8 disks (and why bother if you aren't going to go all the way up to two HBAs and 16 disks) tend to cost well more than two 8 disk cases and you end up with all your eggs in one basket.

madrebel · Mar 11, 2014

honestly the delta between enterprise sas and enterprise sata is really close these days.

dekard · Mar 11, 2014

madrebel said:
honestly the delta between enterprise sas and enterprise sata is really close these days.

I'm sure the price difference is shrinking, but the overall build cost is what I was referring to. A smaller server requires a HBA with few ports, the motherboard will need less processing and RAM, the power supplies can be smaller, the case will be smaller, etc...

GreenLED · Mar 11, 2014

I'm really stepping into something I haven't before. *deep breath*

A few posts back someone asked me to expand on my configuration and such and explain what I'm going for. Well, let me try to do that. I think it would be beneficial to complete the planning to this configuration.

To start off with (which you already know), I have a 12 disk configuration. I had planned on building out JBODs and just have a main server running everything.

There was also a discussion about NOT using SAS expanders with ZFS, which I wanted to try to stick with. Now, what's making this difficult for me is - I don't really understand SAS simply because I have never used it. Is it a physical layer, application layer, protocol layer? What is it? I had looked at some HBAs online, but even those are a little confusing me. Are HBAs classified as SATA, SAS RAID or does RAID come with SAS and SATA HBAs. The confusion is multiplied by the fact that many of the responses assume certain things (not because anyone is being rude, but because they already are "in the know"). So, before I go any further, I guess I want to lay a couple of things down so we can hammer this down properly.

When I was younger, I could assimilate information and communicate it so much more effectively. I'm not really an old fart, but these last few years have brought a fog to my mind. But, let me try to step through this and see if I can't come up with the right questions.

1. I saw a post about using an "SFF" something cable with the complemntary card that goes with it. Is this card an HBA? (I'm assuming it is). Is this type of card a SAS/SATA/RAID card or both or neither or a mix of both? Maybe once I hammer down what kind of card I want to use, this whole thing will be clearer to me.

2. I understand much of this stuff, so please don't brush me off as an idiot, cause I'm not, but I have to ask the stupid questions to get to the good ones. Having said that, the Dell and HP and IBM HBAs you guys mention. Are those SAS, SATA, RAID, both, neither? I understand that you flash them to remove the RAID functionality. I get that (I think). But, are these using the same type of "SFF-" cables I mentioned in my first question.

Now that I've asked some questions here. Here is what I'm going to do from now on once I get an answer and I have made a decision on things. I will update this thread (link) with the details of the uupdated confiruation so if you have a question about the configuration, you can just look there. I feel like I ask questions and I'm just overwhelmed with the answers and I don't move forward in my thought pattern (not your fault, mine).

I'm sure you guys have all seen that Oracle "ZFS Ninja" video. I was just reminded by when the presenter said something to the effect, I blog for me, it's not for you, it's for me because I have such a bad memory.

. I am becoming that person too, ugh! I want to be young again!

danswartz · Mar 11, 2014

A lot of the questions you are asking are answered very concisely on wikipedia...

GreenLED · Mar 11, 2014

danswartz said:
A lot of the questions you are asking are answered very concisely on wikipedia...

Yes, but I need context. I need real life context. I thoroughly enjoy reading Wikis, but I it speeds up my mental process when someone puts things into context for me. But, yes you are correct.

tarnar · Mar 11, 2014

GreenLED said:
1. I saw a post about using an "SFF" something cable with the complemntary card that goes with it. Is this card an HBA? (I'm assuming it is). Is this type of card a SAS/SATA/RAID card or both or neither or a mix of both? Maybe once I hammer down what kind of card I want to use, this whole thing will be clearer to me.

SFF8087 is the name of the most common internal SAS port. In my configuration my HBA has two of these ports. Each port has an SFF8087 to 4x SATA cable, called a 'forward breakout' cable.

There are different SFF port numbers / types / cables for external connections.

2. I understand much of this stuff, so please don't brush me off as an idiot, cause I'm not, but I have to ask the stupid questions to get to the good ones. Having said that, the Dell and HP and IBM HBAs you guys mention. Are those SAS, SATA, RAID, both, neither? I understand that you flash them to remove the RAID functionality. I get that (I think). But, are these using the same type of "SFF-" cables I mentioned in my first question.

Part of what makes it extra confusing is that various retail outlets (especially eBay sellers) will mix and match the terminology, and generally speaking they aren't wrong doing so, as there's a lot of compatibility between these.

Strictly speaking, the PERC H310 I'm using is a SAS HBA with RAID capabilities. But SATA drives are completely 100% compatible with that HBA, or any SAS HBA for that matter.

When using 'IR' firmware one can define RAID sets on the card itself. Or leave the disks in JBOD mode.

When using 'IT' firmware there is no longer any RAID capability at the hardware level and all disks will be in JBOD mode.

For hardware/driver reasons it used to be much better to be using an HBA in 'IT' mode but this is less important today than it used to be. You will still often see the recommendation to use 'IT' mode and many people use that config.

As for "don't use SAS expanders" the recommendation I've read is "don't use SATA disks in SAS expanders." That said, using SAS expanders with SAS disks in ZFS pools is totally kosher. The only compatibility question would be between the HBA and the SAS expander.

Edit: Followed and read the link. I see you have actually purchased said drives and they are SATA drives, so that would take SAS expanders out of the story. And you have a case which is doing the backplane power thing, that's good.

With ZFS you can mix motherboard SATA ports in, so a single HBA driving eight disks and a motherboard with 4 SATA ports is a totally valid config. Or, going back to what I said earlier, two HBAs with four ports would drive the 12 disks just fine.

Have you decided on an operating system yet? Where do you expect the OS to reside, in the pool or elsewhere?

danswartz · Mar 11, 2014

No offense, but a lot of the questions you are asking are very basic. People provide help here on a voluntary and unpaid basis. I can't speak for anyone else, but I don't have the time or energy to give someone a complete ToK wrt ZFS, SAS, etc...

dekard · Mar 11, 2014

GreenLED said:
I'm really stepping into something I haven't before.

You've got a great configuration in theory but you don't really understand and I think you need to simplify. Build 1 server, 6 disks and make a zpool. Build a second server with 6 disks and build another zpool. Send the contents of zpoolA to zpoolB. Then go go get lunch because you're done. You don't need to worry about all these details that are overwhelming you in. Seriously...

GreenLED · Mar 11, 2014

tarnar said:
SFF8087 is the name of the most common internal SAS port. In my configuration my HBA has two of these ports. Each port has an SFF8087 to 4x SATA cable, called a 'forward breakout' cable.

The type of cable / card I saw mentioned was an "8088". What does this numeric scheme refer to? Speed? Type of connection? Grade of connection? "most common SAS internal port" refers to what? Does SAS SIMPLY mean that it's supposed to go to a SAS drive ONLY or (as I think you describe later in this post) will this card / cable combo work 100% with my SATA drives? Sorry if I ask questions you've already answered. I'm going one by one and replying to your answers. Side note - are these SFF cables rated for SATA types? (SATA 300, 600, etc.)?

There are different SFF port numbers / types / cables for external connections.

Even though these cards are for "external" drives, I am assuming there is NO reason to NOT use them inside my case. Can I use these same cards to extend my "reach" (i.e. to put a cable to an external enclosure with more drives) or do I need a DIFFERENT type of card to do that properly?

Strictly speaking, the PERC H310

If I was to look on newegg or some other retail store, what category am I looking for to find this type of controller?

I'm using is a SAS HBA with RAID capabilities. But SATA drives are completely 100% compatible with that HBA, or any SAS HBA for that matter.

There's my answer to the question. Generally speaking should I look for "SAS Controller". What brands to people usually rely on? I really do want to spend properly - that is, I want to spend money on something of quality, but not so expensive that it prohibits my expansion abilities.

When using 'IR' firmware one can define RAID sets on the card itself. Or leave the disks in JBOD mode.

When using 'IT' firmware there is no longer any RAID capability at the hardware level and all disks will be in JBOD mode.

Are basically all cards able to be put into both of these modes or is it a process or is it not possible for some cards verses others?

For hardware/driver reasons it used to be much better to be using an HBA in 'IT' mode but this is less important today than it used to be. You will still often see the recommendation to use 'IT' mode and many people use that config.

I am planning on using that configuration.

As for "don't use SAS expanders" the recommendation I've read is "don't use SATA disks in SAS expanders." That said, using SAS expanders with SAS disks in ZFS pools is totally kosher. The only compatibility question would be between the HBA and the SAS expander.

How would I know a SAS expander when I see one. I'm assuming it will NOT look like a breakout cable?

Edit: Followed and read the link. I see you have actually purchased said drives and they are SATA drives, so that would take SAS expanders out of the story. And you have a case which is doing the backplane power thing, that's good.

Why does it "take it out of the story" are SAS expanders not able to connect to a SATA drive?

With ZFS you can mix motherboard SATA ports in, so a single HBA driving eight disks and a motherboard with 4 SATA ports is a totally valid config. Or, going back to what I said earlier, two HBAs with four ports would drive the 12 disks just fine.

By using the max amount of card slots on the motherboard for these SFF cards (I still don't know how to refer to these things!), I will maximize my storage capacity. But, after that point, I would either need to put on some type of expansion card that allows drives outside of the case to be in the array. If you were building with the parts that you see so far, how would you approach expansion - assuming I will be using outside drives in separate JBOD enclosures?

Have you decided on an operating system yet? Where do you expect the OS to reside, in the pool or elsewhere?

My plan (as of yet) is/was to use a class 10, high speed media like a flash drive or something of high quality to run FreeNAS or some other type of OS that has ZFS (some linux distro) that INCLUDES some type of GUI for the administration of the basic stuff. I'm not afraid of the command line by any means, but I want to be able to look at things quickly and get people files back if they need them from the snapshots without thinking twice about it. I have been so busy trying to do my due diligence on other aspects of this project, I haven't spent the time thinking about this. Could you suggest something more elegant?

Did I mention this was one of the best responses I have gotten so far? Thank you for your reply!

.

tarnar · Mar 11, 2014

GreenLED said:
The type of cable / card I saw mentioned was an "8088". What does this numeric scheme refer to? Speed? Type of connection? Grade of connection? "most common SAS internal port" refers to what? Does SAS SIMPLY mean that it's supposed to go to a SAS drive ONLY or (as I think you describe later in this post) will this card / cable combo work 100% with my SATA drives? Sorry if I ask questions you've already answered. I'm going one by one and replying to your answers. Side note - are these SFF cables rated for SATA types? (SATA 300, 600, etc.)?

8088 is the external SAS port, same as an Ethernet port is RJ45. There isn't any speed information baked into the port spec that I'm aware of.

'SFF8087 to 4xSATA forward breakout' cables are rated to SATA3 @ 6 GBit.

Even though these cards are for "external" drives, I am assuming there is NO reason to NOT use them inside my case. Can I use these same cards to extend my "reach" (i.e. to put a cable to an external enclosure with more drives) or do I need a DIFFERENT type of card to do that properly?

The 'proper' way to do it would be to buy HBAs with SFF8087 ports if you plan on using it for internal disks. You could use the external ones but:
- the cables cost a lot more, as they have lots of extra shielding so they can be run through cable management and between servers
- you'd just have to find a way to make those cables go back in and convert them to SFF8087 ports

This CAN be done but it's definitely not a sensible choice here.

If I was to look on newegg or some other retail store, what category am I looking for to find this type of controller?

There's my answer to the question. Generally speaking should I look for "SAS Controller". What brands to people usually rely on? I really do want to spend properly - that is, I want to spend money on something of quality, but not so expensive that it prohibits my expansion abilities.

LSI 2008 is the chipset for the Dell and IBM cards mentioned so far in this thread. The same chipset is used on various models that have different internal and external port counts and/or RAID related features such as battery backup, cache, etc.

The basic LSI model with two internal SFF8087 ports is the LSI SAS 9211-8i but with 12 disks you'd need two if you wanted them to be all attached to an HBA

There is also a four internal SFF8087 port model, the LSI SAS 9201-16i

Are basically all cards able to be put into both of these modes or is it a process or is it not possible for some cards verses others?

The 'IT' vs 'IR' thing is specific to cards with LSI chipsets. And even then not all cards based on these chipsets are 'supposed' to do so. In the case of my Dell I certainly had to go 'outside the box' to put it into 'IT' mode. But if you have an LSI branded card then you certainly can do this with the blessing of the vendor.

How would I know a SAS expander when I see one. I'm assuming it will NOT look like a breakout cable?

A SAS expander would have one or more of those same SFF8087 ports on it. So you'd be plugging the HBA's 8087 port(s) into the same 8087 port(s) on the expander. Obviously this is a different cable than the 'SATA forward breakout' previously discussed.

Looking at the case in your build it is clear that there is not a SAS expander. The ports from the drives in the backplane are all there to be plugged into, all 12 of them.

Why does it "take it out of the story" are SAS expanders not able to connect to a SATA drive?

SAS expanders do all sorts of magic with regards to how commands are queued/distributed among disks. It is my understanding that this is an unreliable thing to put SATA disks into the mix with.

By using the max amount of card slots on the motherboard for these SFF cards (I still don't know how to refer to these things!), I will maximize my storage capacity. But, after that point, I would either need to put on some type of expansion card that allows drives outside of the case to be in the array. If you were building with the parts that you see so far, how would you approach expansion - assuming I will be using outside drives in separate JBOD enclosures?

These are questions I am still researching. So far it seems like the same rules apply - a JBOD that holds 4 SATA drives would go into an SFF8088 port (the external version) to an HBA. Those JBODs with 8 drives seem to have two SFF8088 ports.

My plan (as of yet) is/was to use a class 10, high speed media like a flash drive or something of high quality to run FreeNAS or some other type of OS that has ZFS (some linux distro) that INCLUDES some type of GUI for the administration of the basic stuff. I'm not afraid of the command line by any means, but I want to be able to look at things quickly and get people files back if they need them from the snapshots without thinking twice about it. I have been so busy trying to do my due diligence on other aspects of this project, I haven't spent the time thinking about this. Could you suggest something more elegant?

My friend (who I'm basing my build off of) is using FreeNAS and likes it. Myself, I'm doing it all by hand the first time through with no GUIs or whatever. Once I'm satisfied that I 'get it' I might rebuild it on something with a nice GUI. I just want to be able to fix it once it breaks, and I know I won't be able to unless I build it by hand the first time and break + fix it a lot in the process.

Did I mention this was one of the best responses I have gotten so far? Thank you for your reply! .

Welcome. This is all new/fresh to me so hopefully I'm getting it mostly right.

RAID 60 Using ZFS

Limp Gawd

Limp Gawd

2[H]4U

[H]F Junkie

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

[H]F Junkie

Limp Gawd

Limp Gawd

Gawd

Limp Gawd

Limp Gawd

n00b

Limp Gawd

Limp Gawd

n00b

2[H]4U

Gawd

2[H]4U

n00b

Gawd

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

n00b

2[H]4U

2[H]4U

Limp Gawd

n00b