RAID SAN/NAS setup help. ZFS or HW? rdy to give up.

bp_968 · May 4, 2013

The Goal: To have redundant, safe, fast storage for my wifes wedding photography for clients. I would prefer to be able to connect her machine and the SAN/NAS via Infiniband or 10GbE.

The previous/current setup: 2TB disks spread around between various workstations in the house so that after every wedding the data/photos could be dropped in at least 2 different drives/locations. Usually within a week or two the data/photos would be backed up to LTO3 tape and blu-ray disc. This makes file management a slight chore, and its a little more difficult to ensure proper / safe duplication. It also offers no speed improvements.

The "idea": The original idea was to build a ESX/ZFS all in one box using a LSI sas card I already had and a HP SAS expander I got cheaply and hardware I had laying around (core 2 quad on a ASUS board). Then I read that the advantages ZFS gives are muted without proper ECC ram (pointless was the term I saw). So I started trying to find a decent xeon setup with ECC ram that wasn't to expensive or power hungry. I ended up stumbling on a dual CPU supermicro board that supported 7 PCI-E slots and 12 RAM slots which seemed like a awesome setup to me (150$ board, 50$ L5520 CPUs, and 24GB of RDIMMS on 6 sticks). So I got ESX running and nexenta but it seems nexenta isn't as friendly with the VM nics as I need it to be (its kinda slow) and worse it has a 18TB *RAW* limit meaning a 9TB limit for a RAID10 setup. Nexenta also doesn't seem to play nicely with SAS expanders with SATA disks on the other end of the expander (which IMHO limits the usefulness of an expander quite a lot). Open Indiana is next on the list to try.

But here is where I am at. I'm tired of all of it. I've ended up way out spending what I should have (my fault obviously) and have come to the conclusion that this setup would/will be so stupidly complex that any problems will require extensive troubleshooting from me, which can be problematic if my illness is acting up. Of course if it stops access to her files then it messes up her business and our income so I don't want that either.

I'm back to considering a hardware raid card so I can just plug and play on a windows box, share out some directories and go but after 5 hours of digging tonight I've re-discovered why I originally started looking at ZFS. TLER and other idiotic drive issues.

Help!! I'm very close to going on a reverse ebay spree and clearing the stupid rack out and going back to the single drive/multiple machines+tape+disc solution and giving up. It stopped being fun a month or two back.

(just as a FYI) The background: I have 15 years of IT experience but have been out of the field for 5 thanks to a disability. So I've used RAID, VMware and server 03/08/08R2 etc a fair amount in the past (my work experience stopped with ESX 3.5 and server 2003).

DigitalDaz · May 4, 2013

Dunno what happened with your nexenta there, did you pass the PCI card through?

Persevere, you will get this working well and don't even think aboutnot carrying on, it won't be complex if things go amiss.

With regards to infiniband and 10GB. If you can run an infiniband cable to the wifes machine then you can probably get a FC one there just as easily. 4GB Qlogic FC cards are cheap on ebay and give great performance.

My ESXi box easily pulls 400MB sequential and thats over FC to a 8x1TB + some L2ARC.

When I did have the nexenta all in one with the esxi I was pulling about 675MB sequential with just 6x1TB in 3 mirrors.

Look at the Openindiana + Napp-it set up, not all in one unless you need it.

If you're getting crap results make some posts as you shouldn't be.

_Gea · May 4, 2013

All-In-One is of interest if you like to have an ESXi server with several guests and a decent SAN/NAS solution.
In the past you have needed two boxes. With All-In-One you can get them both in one box.
Setup complexity is quite the same. But you have high-speed connectivity in software between ESXi and SAN without IB or 10Gbe hardware.

If you only need a NAS solution, build on hardware without ESXi. If you are beyond the NexentaStor 18TB RAW limit,
you can try my napp-it appliance with web-UI on OI or OmniOS (my current default OS)

about the expander and Sata
if you have enough slots, using several HBAs is faster with less problems, especiallay with non SAS disks
(and mostly cheaper ex with IBM 1015 HBAs)

about ECC
This is not needed for ZFS but helpful for every server, especially if they use the whole RAM as disk cache
But this is not ZFS only, every new OS does the same

Have you read my howtos about setup and AiO?
http://www.napp-it.org/manuals/index_en.html

Concentric · May 4, 2013

Do you actually need an ESX all-in-one or just a NAS? What other VMs would be running on this machine? Sounds like all you need is a storage solution so just keep it simple and cut out the virtualisation.

I understand how frustrating it is to try to get something like this set up for the first time. It seems confusing at first but actually makes sense the more you research it and try things out. Just stick with it and keep things simple. For example, if you are having problems with SAS expanders then don't use them. Just use HBAs plugged straight into the drives.

How much storage space is needed?
I was under the impression that the 18TB limit on the Nexenta Community Edition is 18TB of usable space, not 18TB-worth of drives regardless of configuration? Can anyone confirm this? If that's right then, no, you wouldn't be down to 9TB of usable space after creating a bunch of mirrors. But I don't really understand why you would use mirrors in this scenario anyway - surely storing wedding photographs doesn't require that level of performance?

_Gea · May 4, 2013

Concentric said:
I was under the impression that the 18TB limit on the Nexenta Community Edition is 18TB of usable space, not 18TB-worth of drives regardless of configuration? Can anyone confirm this?.

Current NexentaStor CE 3 and next 4 are limited to 18 TB RAW
http://www.nexentastor.org/projects/site/wiki/CommunityEdition

This has changed recently.

Concentric · May 4, 2013

_Gea said:
Current NexentaStor CE 3 and next 4 are limited to 18 TB RAW
http://www.nexentastor.org/projects/site/wiki/CommunityEdition

This has changed recently.

Right. So it used to be usable but they changed it to raw? Interesting, thanks for the info.

hutchingsp · May 4, 2013

Can I ask a question?

Respectfully, it seems like the setups you're looking at are more than most SME's would have, so does your wife have some extreme performance requirements or are you coming at this from simply wanting to do something geeky?

Basically what I'm asking is are you looking for the simplest solution for the business or are you deliberately trying to do something complicated as a side-project?

Either way what's the storage requirement and budget?

DigitalDaz · May 4, 2013

hutchingsp said:
Can I ask a question?

Respectfully, it seems like the setups you're looking at are more than most SME's would have, so does your wife have some extreme performance requirements or are you coming at this from simply wanting to do something geeky?

Basically what I'm asking is are you looking for the simplest solution for the business or are you deliberately trying to do something complicated as a side-project?

Either way what's the storage requirement and budget?

Most video editing setups I've seen do seem quite demanding simply because of the amount of data you are shifting about.

BecauseScience · May 5, 2013

It sounds to me like you'd be better off with a commercial nas or two, one for "production" and one for backup.

DIY ZFS boxes are complex things. They're not for everyone. They're great if you're already familiar with unix and server hardware or if you want to geek out and learn as you go (and occasionally fail along the way) but they're not the best solution when you simply want a bunch of easy, reliable storage with minimal fuss.

DigitalDaz · May 5, 2013

kristof said:
It sounds to me like you'd be better off with a commercial nas or two, one for "production" and one for backup.

DIY ZFS boxes are complex things. They're not for everyone. They're great if you're already familiar with unix and server hardware or if you want to geek out and learn as you go (and occasionally fail along the way) but they're not the best solution when you simply want a bunch of easy, reliable storage with minimal fuss.

Absolutely not, ZFS setups are not some sort of witchcraft and are fairly easy to set up. Granted they are often a new technology that uses a different OS than most people are using.

Thats no reason to chuck in the towel and buy a black box that you know even less about and have absolutely no way of fixing should something go amiss.

hutchingsp · May 5, 2013

Easy to setup is not the same as easy to maintain though. If you buy a prosumer or professional NAS or even just use hardware RAID you usually have the benefit of alerting for things like failed drives and can usually replace them live if they're hot swapable.

What do you do on a ZFS box when a hard drive fails because my understanding is that it's not quite as simple as looking for the one with the flashing light and ripping it out and replacing it, I believe you usually have to get a little "involved" with whichever distro you're using, which IMO is something to consider especially if the OP's wife is expected to do it if he isn't around for some reason.

Happy to be corrected if it is as simple as pulling and replacing a failed drive.

DigitalDaz · May 5, 2013

A failed drive is fairly easy to identify ans replace and I don't think I've come across a ZFS storage distro yet that doesn't have a built in facility to mail you in case of problems.

I've successuly moved physical hard drives many time between different hardware AND different distributions and just imported the pool without problems.

hutchingsp · May 5, 2013

How do you define "fairly easy"?

I've never really played with ZFS so I'll be bp968's wife for this one, let's assume he's not available

I wake up one morning and in the middle of the night a drive's failed - what steps are involved?

DigitalDaz · May 5, 2013

Here's an example on Solaris, pure command line:

zpool offline tank c1t3d0
cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
cfgadm -c configure sata1/3
cfgadm | grep sata1/3
sata1/3::dsk/c1t3d0 disk connected configured ok
zpool online tank c1t3d0
zpool replace tank c1t3d0

Most have GUIs that will automate this.

The drive, in this case c1t3d0 you would have known you needed to change by a 'zpool status tank'

That will also usually be emailed in a failure situation.

_Gea · May 5, 2013

hutchingsp said:
How do you define "fairly easy"?

I've never really played with ZFS so I'll be bp968's wife for this one, let's assume he's not available I wake up one morning and in the middle of the night a drive's failed - what steps are involved?

That depends not on ZFS but the distribution or appliance software you are using.
Mostly you have a hotspare that jumps in automatically.

With a software appliance, you
- get an email like disk c0t2d0 (second disk on controller 0) or disk with a worldwide unique WWN (like a serial number) failed.
- hot-unplug the faulted disk, hot-plugin a new one
- start web-management software and open menu disk and start a disk-replace (select faulted -> new)
(disk replace may start automatically if you have set autoreplace=on)

DigitalDaz · May 5, 2013

I have to add that I've been using Linux now for well over 10 years. I've installed FreeBSD a handful of times and until about 6 months ago had never touched any form of Solaris based distro or ZFS.

Ive used all sorts of filesystems, hardware, software RAID etc. I find ZFS the easiest and it looks to be the best in terms of data security than anything I have come across so far.

Even at a cli level, granted the commands can get complex, there are only two, zpool and zfs, thats it.

_Gea · May 5, 2013

DigitalDaz said:
Here's an example on Solaris, pure command line:

zpool offline tank c1t3d0
cfgadm | grep c1t3d0
sata1/3::dsk/c1t3d0 disk connected configured ok
cfgadm -c unconfigure sata1/3
Unconfigure the device at: /devices/pci@0,0/pci1022,7458@2/pci11ab,11ab@1:3
This operation will suspend activity on the SATA device
Continue (yes/no)? yes
cfgadm | grep sata1/3
sata1/3 disk connected unconfigured ok
<Physically replace the failed disk c1t3d0>
cfgadm -c configure sata1/3
cfgadm | grep sata1/3
sata1/3::dsk/c1t3d0 disk connected configured ok
zpool online tank c1t3d0
zpool replace tank c1t3d0

Most have GUIs that will automate this.

The drive, in this case c1t3d0 you would have known you needed to change by a 'zpool status tank'

That will also usually be emailed in a failure situation.

This procedure is correct but overcomplicated.
If your hardware does not support hotplug: power-off and replace the disk
If your hardware support hotplug: hot unplug/plug the disks
(the unplugged disk keeps in state offline until reboot, but why not)

Aesma · May 6, 2013

hutchingsp said:
How do you define "fairly easy"?

I've never really played with ZFS so I'll be bp968's wife for this one, let's assume he's not available I wake up one morning and in the middle of the night a drive's failed - what steps are involved?

No step needed since your husband has set up a hot spare.

bp_968 · May 6, 2013

I'm familiar with ESX and enjoy using it so was hoping to keep the flexibility of it. Plus I can't possibly justify to myself having 2 quad core HT xeons (16 "cores") for just running the NAS/SAN. Talk about watt-waste

I've mostly frustrated myself by not staying focused and jumping into to many different things at once (10GbE, Fiber, FC, and IB). I'll be sticking with the things I have working at the moment, the IB and focusing on the meat and primary issue (the ZFS server).

What would you suggest other then mirrors (or really striped mirrors in this case)?
I am using raid10 for redundancy. RAID5/Z1 or RAID6/Z2 are just not acceptable options in my opinion with current SATA drive sizes and error rates. Based on experience and data ive seen on the net RAID5 is a larger risk than a single drive. That may seem wrong to some people since it "can" sustain a drive loss the problem is the rebuild. With 6 2TB single drives your risking a loss of 2TB of data or 1/6th your data set if you have a full drive failure. With RAID5 (or even 6) your risking your entire data set even without a drive failure (simply dropping a drive and then failing a rebuild due to a UBR could kill your entire data set, without even having a drive failure).

To clarify the level of paranoia going into how I handle the data here is how I have it planned (and this is similar to how it is now): She returns home from the job with all the CF cards (all of her cameras used mirrored storage so her data starts life mirrored in the camera). She offloads those CF cards onto her machines local drive. Then she, I, or the PC copy the data down to the server (which in the new build will be a RAID10 set, rather then single non redundant drives). Once she is done editing the files (usually a week) I'll burn them to Blu-ray (35-60GB per wedding) and she will upload all the processed JPEGs to smugmug (which retains a copy as long as we have service with them). Then, once I get 400GB worth of photos on the server I'll shoot them off onto a LTO3 tape for archiving.

So at its height the data will be on a local workstation, on the server, on a bluray, on an LTO tape, in the cloud, and (for a week or two) on CF cards. Usually within a few months the customer will also get a DVD full of the processed JPGs (and technically backup is their responsibility at that point, though I keep them all anyway). After a couple years, or as things fill up the oldest files will be removed from "online" storage (workstation and server disks) but left on the blurays, LTOs, and "cloud" and of course the customers copy(s).

To be perfectly honest I can't see using parity raid for anything at this point. If you have a media collection then its probably not backed up (no one seems to back them up and instead try and use RAID as a substitute backup) so all your doing is risking the entire collection to a single failure instead of only risking a subset. Like I said, a rebuild failure takes your whole collection with it, while a drive failure on 2TB drive in a 20TB array only takes 10% of your movie collection with it. It might be different if you need to support huge files or databases and needed large volumes (like maybe VMDK stuff as well). But for most HTPCs I've been pushing people to buy a couple large HDDs and use a software mirroring tool like rsync or similar to have a safe backup.

Concentric said:
Do you actually need an ESX all-in-one or just a NAS? What other VMs would be running on this machine? Sounds like all you need is a storage solution so just keep it simple and cut out the virtualisation.

I understand how frustrating it is to try to get something like this set up for the first time. It seems confusing at first but actually makes sense the more you research it and try things out. Just stick with it and keep things simple. For example, if you are having problems with SAS expanders then don't use them. Just use HBAs plugged straight into the drives.

How much storage space is needed?
I was under the impression that the 18TB limit on the Nexenta Community Edition is 18TB of usable space, not 18TB-worth of drives regardless of configuration? Can anyone confirm this? If that's right then, no, you wouldn't be down to 9TB of usable space after creating a bunch of mirrors. But I don't really understand why you would use mirrors in this scenario anyway - surely storing wedding photographs doesn't require that level of performance?

Thankd to everyone who commented and spoke about helping!

stevebaynet · May 6, 2013

Did you get the ESX pass through and all that stuff working?

I just did nearly the same setup (except im using seagate nearline SAS drives)

My setup:

OmniOS (instead of OI)

Once I installed ESXi5.1 all i had to do is:

Create local storage on one controller for the OmniOS VM OS disk

Goto Config/Advanced for ESX and passthrough the LSI 2008

Create the OmniOS VM and add the PCI Device -> LSI 2008

Once installed, added vmware tools (so i could get access to vmxnet3 NIC/Driver)

Installed Napp-it (super easy)

(I avoided Nexenta CE for this box due to the 18tb raw limitation + EULA issue (not for biz, etc))

Regarding your drive pool Q:

I would still go for a raidz1 or raidz2 over mirrors because:

- the storage is mainly for archiving
- less IOPS should not be a huge issue

Yes resilvering will take a while, and yes, its possible another drive could die during the process, however, i would have a second pool in the box to backup the primary pool. so in the event the main pool is lost, your covered.

I would also have hot spares, so it just handles it automatically.

IB also seems too much of a PITA to me. 10g is expensive too, I would just pick up a managed switch and a 4 port intel 1g NIC and team the 4 ports together. (tho you would also have to team the NICs on your wife's workstation)

Keep at it, this community is extremely helpfull.

bp_968 · May 6, 2013

Yes, ESX passthrough was pretty easy, np.

Why use Z1 and a HS when you can use Z2? Why use Z2 and a HS when you can use Z3? And (and this is where I am at) why use parity RAID at all when RAID10 is faster, safer, and just plain better? With 8 drives in RAID10 I lose 4. With 8 drives in RAIDz2 plus HS or RAIDz3 I lose 3 and take a write penalty, *and* am forced to do parity calculations to resilver, or to "keep" my data. If a drive fails in the RAID10 replacing it is as simple as the array copying over data from the other mirror, quite fast even on today's very large drives.

I'm especially surprised to here you suggest 2 RAID5 pools with a HS. In my case (which I assume your using for your numbers) that would be 2 pools of 3+1 each and then I'd have to RDM a 9th drive in there off the main board SATA ports. Lets ignore the RDM thing and say I only have 8 drives (since I really only have 8) then I have a 3+1 pool and a 2+1 pool and a HS, or a 2+1 and 2+1 and 2 HS. Option 1 ends up maximizing drive space but could become a pain should i end up with more data then 2+1 fits. Option 2 isn't any more efficient then RAID10 (you only "get" 4 drives of space). So based on my drives URE rate, my personal interest in redundancy over maximum space, and the fact that RAID10 also gives me more performance as well I still believe RAID10 is the best choice.

Considering the route the data takes to get to the RAID10 array I feel pretty safe with RAID10. If I was having to build something that protected the data in its entirety (in other words the workstation+cloud wasn't in the picture) I'd build two RAID10s on two different servers and mirror the data between them (which is in essence how many of the major players work in the cloud storage space). That will probably eventually happen anyway (id like to snag a ECC capable atom board and build a ZFS storage box on that at some point). Preferably with 2.5" drives. someday.

I'm curious, how many people who build very large RAID5 (or even 6) arrays have ever had to rebuild one full of data? Especially the people who use RAID as a backup solution and have no backup, just their giant array. In the field (years ago) we had to show up and replace some drives on a HP array that had a failed disk at a bank. The HS had jumped in and started "fixing" the problem before we arrived. Before we arrived (4hr window I believe) the resilver killed another drive and it died. No big deal we just got to wait around on a backup as well. This was U320 drives back "in the day" so not cheap stuff. Another time (much more recently) a friend has a drive drop on his 5 disk raid5. It errored out during the resilver and he lost his data. His entire reason for the RAID5 was to "protect" his data but its my strong opinion that he only endangered it further by using parity RAID rather than backup (preferable) or mirroring.

People usually won't think past "# of drive failures it can sustain" when the actual problem goes far beyond simple hardware failure. Many different things have to come together for a parity RAID set to survive a "blip" or an actual failure. Any one of those things going wrong will kill the whole data set. Of course if your using nearline SAS drives (like Steve here is) then you can do almost whatever you want since they likely have a URE rate of 10^16 vs my drives 10^14 and then URE probably isn't something you should worry about. But if your using expensive drives then RAID5 ends up becoming a false economy IMO.

This is a pretty good article covering these points (though his URE "exposure" numbers might be a little off, especially with a filesystem like ZFS)

http://www.standalone-sysadmin.com/blog/2012/08/i-come-not-to-praise-raid-5/

Thanks for all the tips so far!

stevebaynet said:
Did you get the ESX pass through and all that stuff working?

I just did nearly the same setup (except im using seagate nearline SAS drives)

My setup:

OmniOS (instead of OI)

Once I installed ESXi5.1 all i had to do is:

Create local storage on one controller for the OmniOS VM OS disk

Goto Config/Advanced for ESX and passthrough the LSI 2008

Create the OmniOS VM and add the PCI Device -> LSI 2008

Once installed, added vmware tools (so i could get access to vmxnet3 NIC/Driver)

Installed Napp-it (super easy)

(I avoided Nexenta CE for this box due to the 18tb raw limitation + EULA issue (not for biz, etc))

Regarding your drive pool Q:

I would still go for a raidz1 or raidz2 over mirrors because:

- the storage is mainly for archiving
- less IOPS should not be a huge issue

Yes resilvering will take a while, and yes, its possible another drive could die during the process, however, i would have a second pool in the box to backup the primary pool. so in the event the main pool is lost, your covered.

I would also have hot spares, so it just handles it automatically.

IB also seems too much of a PITA to me. 10g is expensive too, I would just pick up a managed switch and a 4 port intel 1g NIC and team the 4 ports together. (tho you would also have to team the NICs on your wife's workstation)

Keep at it, this community is extremely helpfull.

stevebaynet · May 6, 2013

yeah, your calc's make sense for your use case. for me its diff. the omnios/napp-it box is my backup "disaster recovery" system (should the primary HA cluster fail) the only traffic is coming from the other ZFS systems via zfs-send/rec

but even on the primary, im still using raidz1 for majority of disk pools because they are read centric app servers (db servers is another story) so 95% of the traffic is reads to the arc/l2arc and not even getting to the spindles. + i use 450gig 10k sas disks, so resilver is much faster (tho honestly, i havent had one fail yet, but in other arrays this has been hours not days.)

as for a mini ZFS server, pick up a HP microserver n40l, mine came with ECC mem installed. u can still get one pretty cheap. then pick up a 5.25 drive bay that holds 4 2.5" drives and that gives you 8 bays total in a nice small form factor.

bp_968 · May 6, 2013

hutchingsp said:
Can I ask a question?

Respectfully, it seems like the setups you're looking at are more than most SME's would have, so does your wife have some extreme performance requirements or are you coming at this from simply wanting to do something geeky?

Basically what I'm asking is are you looking for the simplest solution for the business or are you deliberately trying to do something complicated as a side-project?

Either way what's the storage requirement and budget?

No your mostly right in some aspects. The redundancy and data protection requirements are probably over the top but certainly not out of line. The high speed 10GbE stuff is just mostly for fun. Yes I could have (and still can) setup LAG but windows isn't the best at LAG (it won't use it all as one big fat pipe for example) though maybe the intel specific drivers are an improvement. Honestly the IB works great I just hate the cables. I'll probably stick with that, especially since point-2-point can be 20Gb/s with hardware I already have that I spent 20-25$ on (per card).

At this point it looks like most of the complexity has been worked out. It just looks like SAS expanders are a no-no (so I'll have a nice HP one forsale soon).

bp_968 · May 6, 2013

stevebaynet said:
yeah, your calc's make sense for your use case. for me its diff. the omnios/napp-it box is my backup "disaster recovery" system (should the primary HA cluster fail) the only traffic is coming from the other ZFS systems via zfs-send/rec

but even on the primary, im still using raidz1 for majority of disk pools because they are read centric app servers (db servers is another story) so 95% of the traffic is reads to the arc/l2arc and not even getting to the spindles. + i use 450gig 10k sas disks, so resilver is much faster (tho honestly, i havent had one fail yet, but in other arrays this has been hours not days.)

as for a mini ZFS server, pick up a HP microserver n40l, mine came with ECC mem installed. u can still get one pretty cheap. then pick up a 5.25 drive bay that holds 4 2.5" drives and that gives you 8 bays total in a nice small form factor.

I actually have a 42U rack in a storage room in the basement (right below her office, which works out wonderfully I might add). I got it as a freebee while I was still working in the field. I ended up ignoring the HP microserver because its not rackmount.

brutalizer · May 7, 2013

It is actually very easy to setup an ZFS server. Very easy to administer, compared to other solutions. Also, Solaris distros have builtin CIFS, which is 100% compatible with Windows. Samba is not 100% compatible with Windows, only 99% - at least this was true a couple of years ago.

You should try this guide by Gea_ on how to setup an ESXi ZFS server. Lot of people is doing that with great success
http://hardforum.com/showthread.php?t=1573272

If you care about your data, you should use ECC RAM.

LSI cards are very good for ZFS. You should use raidz2 or raidz3, and never mind the hot spare. It is better to use raidz3 than raidz2 + hot spare, because in raidz3 the third disk is already fully operationable.

Concentric · May 7, 2013

bp_968 said:
...

Why use Z1 and a HS when you can use Z2? Why use Z2 and a HS when you can use Z3? And (and this is where I am at) why use parity RAID at all when RAID10 is faster, safer, and just plain better? With 8 drives in RAID10 I lose 4. With 8 drives in RAIDz2 plus HS or RAIDz3 I lose 3 and take a write penalty, *and* am forced to do parity calculations to resilver, or to "keep" my data. If a drive fails in the RAID10 replacing it is as simple as the array copying over data from the other mirror, quite fast even on today's very large drives.

...

It depends on luck and how your RAID-10 is configured as to whether it would be able to survive four drive failures out of a total of eight.

RAID 10 (a.k.a. RAID 1+0) is a stripe across sets of mirrors - i.e. group the drives together into mirrors, and then stripe the data across those mirrors.
The number of drive failures it can sustain depends on the number of drives in each mirror. The stripe will continue to function as long as at least one drive survives in each of the mirrors (i.e. every mirror must survive).

It's possible (for example in ZFS) to stripe together mirrors and arrays of different sizes, but typically you would have even numbers of drives in each mirror.
So in the case of 8 drives, you could either stripe across:
Two mirrors, each consisting of 4 drives, or
Four mirrors, each consisting of 2 drives.

If you went for the first option - two mirrors - you could sustain the loss of up to six of the eight drives, because you could lose 3 drives from each of the 4-drive mirrors.

Great if you're paranoid, but you would be unlikely to go for such a configuration because you only get to use the combined capacity of two out of the eight drives, which is very wasteful.

If you went for the second option - four mirrors - you could sustain the loss of up to four of the eight drives, but, since each mirror consists of only 2 drives, each failed drive would have to come from a different mirror.

In the case of four failures out of eight drives, that's very unlikely to happen. It's analogous to picking a random number between 1 and 8, four times, and getting the result: 1, 2, 3, 4.

It's far more likely that two would happen to be from the same mirror. In that situation your entire stripe has failed because one of the mirrors has failed.

In my opinion, if you are concerned about drive failures and rebuild times, you should opt for a RAID-Z2 or Z3 configuration and/or include spares.

bp_968 · May 7, 2013

Do you have field experience with striped mirrors (RAID10) failing? Now I'll agree that ZFS handles data differently then your typical hardware RAID but its a complete fallacy to believe your only data risk in redundant drive array's comes from actual hardware disk failure, specifically in the case of parity RAID. Sure, losing the exact right 4 disks in a RAID10 is pretty unlikely (like near zero) but losing those 4 disks at all out of 8 is also pretty extremely unlikely. In fact, losing 2 of those drives is well on the far side of unlikely. The biggest factor though is that even if I lost two of them in the same mirror my data is still safe (since its the array itself isn't the single only repository of that data). The data is on the array, AND in another spot, AND backed up.

I've never once in my IT life (15ys) of working in various large companies (some with 4-5 million dollars in EMC equipment) seen a RAID10 array fail from multiple drive failures. I *have* seen plenty (3) parity RAID sets fail from failed rebuilds (yes, with hot spares even). And we go to backups at that point, or more recently the RAID set is part of a clustered RAID set and the other set of drives handles things while the new drives do their thing. (in which case parity RAID is fine since your basically mirroring two RAID sets). Sure all the failures I've seen were RAID5 but they were also with enterprise level drives too.

I'm not arguing your competency (your description is right on the money), I'm just saying I believe people in general look at data security on RAID systems through far too narrow a lens and only "see" drive failure as a possible point of failure and that simply isn't the case. Whenever you add a requirement to the system your adding a point of failure. When I require a parity calculation to recover my data then I am now at risk if the parity calculation can't be made (for whatever reason) so technically I can have a drive fail, and then have a back sector and then lose the array.

Now, ZFS has strong data protection ability. RAIDz3 is not RAID5 by any stretch. But, and here is where my point comes in, if I have 8 drives and my options are Z3 or RAID10 I get 5 drives of data (10TB) or 4 drives of data (8TB) between the two options. RAID10 has no parity overhead, no write penalties, and will rebuild from failures extremely fast. Why give up the performance gain, quicker rebuild, and lower complexity for 2TB? Considering the layout I'm using I'm not short on storage (or redundancy). I might agree with you about Z2-Z3 if it was a 15-20 drive set. But I'd never build a single array from that many drives either.

It's also important to note that unlike the typical business usage for RAID my setup (like most home setups) isn't business critical. Sure if its down its a pain, but it doesn't bring down some important business function. She still has access to her photos on her local machine (or wherever the second backup set is). We (in these forums) typically use RAID well outside its typical business design/function, and that is to maintain uptime. In fact to be truthful its being used for "backup" in this case (just not the sole backup like is often the scenario). Most of the time I see people using RAID as a way to get a giant single volume and to "backup" that volume at the same time (terrible idea).

EDIT: I re-read your post and wanted to point out that I'll be using 4 mirror sets, so a pretty typical RAID10 "style" setup (since I know ZFS doesn't really use the term RAID10). I agree with you that even my paranoia doesn't go up to 4-way mirrored sets striped. At that point your far better off with a cluster of two RAID10s between two systems. Which seems to be a common type of setup in bigger companies now.

If anything its the bluray and LTO's that scare me the most. I'll be pulling the LTOs as soon as I get the raid rolling just to double check that they are still "good". LTO is pretty darn nice tech but me and tapes have never had a very good relationship in my IT "life". lol

Concentric said:
It depends on luck and how your RAID-10 is configured as to whether it would be able to survive four drive failures out of a total of eight.

RAID 10 (a.k.a. RAID 1+0) is a stripe across sets of mirrors - i.e. group the drives together into mirrors, and then stripe the data across those mirrors.
The number of drive failures it can sustain depends on the number of drives in each mirror. The stripe will continue to function as long as at least one drive survives in each of the mirrors (i.e. every mirror must survive).

It's possible (for example in ZFS) to stripe together mirrors and arrays of different sizes, but typically you would have even numbers of drives in each mirror.
So in the case of 8 drives, you could either stripe across:
Two mirrors, each consisting of 4 drives, or
Four mirrors, each consisting of 2 drives.

If you went for the first option - two mirrors - you could sustain the loss of up to six of the eight drives, because you could lose 3 drives from each of the 4-drive mirrors.

Great if you're paranoid, but you would be unlikely to go for such a configuration because you only get to use the combined capacity of two out of the eight drives, which is very wasteful.

If you went for the second option - four mirrors - you could sustain the loss of up to four of the eight drives, but, since each mirror consists of only 2 drives, each failed drive would have to come from a different mirror.

In the case of four failures out of eight drives, that's very unlikely to happen. It's analogous to picking a random number between 1 and 8, four times, and getting the result: 1, 2, 3, 4.

It's far more likely that two would happen to be from the same mirror. In that situation your entire stripe has failed because one of the mirrors has failed.

In my opinion, if you are concerned about drive failures and rebuild times, you should opt for a RAID-Z2 or Z3 configuration and/or include spares.

bp_968 · May 7, 2013

Oh one last thing that made me lean towards RAID10 (mirror/striped) in ZFS. Growing the array seems to be *much* easier than with parity setups. Hopefully the illuminous team (or whoever) end up adding some of the nice features we are seeing in the hardware RAID world to ZFS (like growing parity RAIDs a disk at a time).

Concentric · May 7, 2013

Sorry, but I'm a bit confused by this thread and I'm not going to contribute any further.

You ask for our input, but then dispute any alternative suggestions because you've already made up your mind that RAID-10 is the best.
You're not sure how to set up a small server for your wife's photos, yet you have 15 years of experience with multi-million dollar business setups...?

You make a lot of valid points but you're coming across rather defensively and I've lost track of what you're actually asking; What do you want to get out of this thread?

I was just giving you some information and telling you what I would recommend in your scenario. Take it or leave it.

bp_968 · May 7, 2013

I should probably start a new topic but I have a question: Does ZFS have any controls/features for managing data based on age/usage? Basically I would like to have disk "costs" or something similar and have it keep the data on the fastest disks/array first and then "slide" it off to slower/cheaper disks as it gets used less. Would the L2ARC/ZIL handle something like this?

To be more specific I have 8 300GB 10,000RPM 2.5" Raptor drives I'd like to leverage like that (or just sell if i can't). Second, would you suggest mirrored/striped or Z1/2/3 for those drives (where speed is probably the number 1 priority). I figured (though it hurts) RAID10 as well for those. Or RAID0 and nightly Rsync/backup to the 8TB array.

bp_968 · May 7, 2013

I wasn't being defensive, sorry if it sounded that way. My original question was about the all-in-one setups and HW vs ZFS storage solutions, not the RAID type. Its possible I did ask about raid types in one of my posts (im long winded) though, so sorry if things got confused.

I *do* have lots of experience in the field, but its also over 5 years old at this point (im disabled and no longer working in IT) and usually with single vendor solutions and not your typical "home" setup. I also have *zero* experience with solaris and ZFS.

Main main question in the original post was a (slightly whiny) complaint/question about the reliability of all-in-one/ZFS solutions. Basically I had planned to do ZFS+expander and some recent(ish) posts from Nexenta kind of shot that down (when used with cheap SATA drives).

I do have another question (just above your last post) that I'd love your input on. Specifically related to ZFS at this point.

Thanks again for the info/help.

Concentric said:
Sorry, but I'm a bit confused by this thread and I'm not going to contribute any further.

You ask for our input, but then dispute any alternative suggestions because you've already made up your mind that RAID-10 is the best.
You're not sure how to set up a small server for your wife's photos, yet you have 15 years of experience with multi-million dollar business setups...?

You make a lot of valid points but you're coming across rather defensively and I've lost track of what you're actually asking; What do you want to get out of this thread?

I was just giving you some information and telling you what I would recommend in your scenario. Take it or leave it.

brutalizer · May 8, 2013

bp_968 said:
I should probably start a new topic but I have a question: Does ZFS have any controls/features for managing data based on age/usage? Basically I would like to have disk "costs" or something similar and have it keep the data on the fastest disks/array first and then "slide" it off to slower/cheaper disks as it gets used less. Would the L2ARC/ZIL handle something like this?

To be more specific I have 8 300GB 10,000RPM 2.5" Raptor drives I'd like to leverage like that (or just sell if i can't). Second, would you suggest mirrored/striped or Z1/2/3 for those drives (where speed is probably the number 1 priority). I figured (though it hurts) RAID10 as well for those. Or RAID0 and nightly Rsync/backup to the 8TB array.

Yes, this is exactly what L2ARC does. L2ARC will cache hot data automatically. If you are building a streaming media server, then L2ARC will not help. But if you are heavily accessing the same data every time, then L2ARC will certainly help.

You add the L2ARC on the fly to the ZFS raid, and you can remove the L2ARC without problems.

bp_968 · May 9, 2013

Thanks for all the help, I was literally about 4 hours work from being done. Its running perfect now. I'm making a copy of the setup OI+napp-it VM and then it should be good to go. Getting about 300-350MB/s over SDR infiniband using SMB.

I'll have to look over things tommorow but can the same pool/block of space be used for a SMB share *and* an iSCSI target or do you have to define a specific amount of space for the iSCSI target? (I'd like to let the ESXi box store VMs on that array).

I also gave the OI VM 2vCPUs and 12GB of RAM. I figure I have 24GB in the box so leaving 12GB free should be plenty. I'd increase it if people say it would help.

2nd 8 port SAS card should be here soon and then both pools can come online (the 8 2TBs and the 8 300GB 10k's).

Justintoxicated · May 10, 2013

It's as if you guys are speaking another language

SDR infinibadning the SMB OI VM 2vCPU SAS ESXI that that shit on the L2ARC

yea I have no idea what I just read.

bp_968 · May 10, 2013

Justintoxicated said:
It's as if you guys are speaking another language SDR infinibadning the SMB OI VM 2vCPU SAS ESXI that that shit on the L2ARC

yea I have no idea what I just read.

I kinda felt the same way a few years ago. So many new techs.

Infiniband is a really fast networking tech, think Ethernet or Fiber Channel (sorta). The different speeds are named SDR, DDR, QDR, FDR, EDR (10, 20, 40, 60, 100Gb/s). You can snag the SDR and DDR stuff really cheap now. Great way to get some serious bandwidth if you need/want it.

_Gea · May 10, 2013

bp_968 said:
I'll have to look over things tommorow but can the same pool/block of space be used for a SMB share *and* an iSCSI target or do you have to define a specific amount of space for the iSCSI target? (I'd like to let the ESXi box store VMs on that array).
.

For thin provisioned iSCSI targets, i would add an reservation to ensure space.
But for All-In-One solutions with a virtualized SAN, I would prever NFS as ESXi datastore.

It comes with two main advantages
- auto reconnect after reboot (ESXi waits long enough for NFS datastores to auto-come up from a SAN VM whereas iSCSI must be reconnected manually)
- Accessability (NFS or SMB) for easy snapshot/clone/move/backup of VMs from Windows including previous version for snaps

Thuleman · May 10, 2013

Stop the madness!

Outsource storage and be done with it. Amazon S3, does it cheaper and more reliably than you will ever be able to do it.

It's all fun and games to monkey around with lab boxes around the home, but when production storage needs to be backed up by semi-pros (not corporate users that can run their own storage) the solution is to outsource it to the experts and save yourself the cost, headache, and crying yourself to sleep over storing this amount of important data yourself.

Seriously, Amazon S3 storage is so dirt cheap that you have to be crazy to not use it for this particular purpose.

bp_968 · May 13, 2013

Thuleman said:
Stop the madness!

Outsource storage and be done with it. Amazon S3, does it cheaper and more reliably than you will ever be able to do it.

It's all fun and games to monkey around with lab boxes around the home, but when production storage needs to be backed up by semi-pros (not corporate users that can run their own storage) the solution is to outsource it to the experts and save yourself the cost, headache, and crying yourself to sleep over storing this amount of important data yourself.

Seriously, Amazon S3 storage is so dirt cheap that you have to be crazy to not use it for this particular purpose.

Sooo.. You want me to spend 960$ a year, per year (in other words, it will grow, by 800-900$ per year) to archive stable (unchanging) files I can quad archive (4x copies, 3x locations) myself for a fraction of that? Lets look at it this way:

For 4x copies I'm spending no more then 300$ a TB, but we will call it 500$ per TB. Lets say those copies have a 48 month shelf life before all 4 copies need replaced (way off the mark, but we are being harsh here). then that's 10.41$ per month per TB or .010 cents per month per GB. At home I have a 10Gb/s network so I can backup those files at 80-400MB/s depending on where they are going. Best case I can send to S3 at 300Kb/s.

So the breakdown:

S3: 960$ per year, per TB. 30 Days of constant uploading per TB (30Mb/3Mb connection)
Mine: 125$ per year, per TB. Uploads 1 TB of data in about 2 hours.

(I added power costs in there but for the curious, running 350w worth of server(s)/switch(s) which what mine is at with the 20 drive server, 1 power connect 2716, 1 DXS 3227 24 port GbE and 1 Voltaire 9024 10Gb/s Infiniband switch is 20$ a month at 24/7 on time)

Data growth, about 1TB a year. S3 would be 960$ for the first year, 1800$ or so for the second, 2600$, 3400$ for the fourth, making it 9,000$-10,000$ worth of storage over a 4 year period.

For 10,000$ I can buy a EMC VNXe with a 3 year parts replacement warranty (10-12TB in size, redundant) or for about half of that you could buy a dell powervault (again with 3yw) MD3000 SAN with 15 1TB sata drives (6,000$) or 5,000$ will buy you a DAS with 15 2TB SAS drives (nice and reliable). You could easily hire out a local IT place for a pretty cheap price to maintain the stuff.

Or if all your doing is archiving (like I am) you could do it with a netapp or DROBO NAS box really cheap as well. Buy 3 of them and shoot the file off to each one (sort of like what I'm doing) for a nice triple backup plus redundancy on each unit.

Every single solution I mentioned is massively cheaper per GB then S3. You *REALLY* have to want to not manage that data to pay 900$+ a year per TB.

Of course as a disabled IT guy my time is worth far less per hr (or GB) then yours might be. If your a corporate law firm charging 500$ an hour and storing a couple TB or less then your right, S3 is a no brainer unless you're big enough to need a PT/FT IT guy anyway.

Honestly *more* firms like that should outsource (or hire) IT help. I'm always shocked by how many small businesses that are really busy still have the boss/owner trying to play IT guy when he could source it out for 50$ an hour 10-20 hours a week (or less) and give himself another 8+ hours a week back to his life.

Zeu · Jun 2, 2013

bp_968 said:
To be more specific I have 8 300GB 10,000RPM 2.5" Raptor drives I'd like to leverage like that (or just sell if i can't). Second, would you suggest mirrored/striped or Z1/2/3 for those drives (where speed is probably the number 1 priority). I figured (though it hurts) RAID10 as well for those. Or RAID0 and nightly Rsync/backup to the 8TB array.

If you can sell these drives and use the money to buy SSD you would get much better performance and lower power consumption. Of course if you need a LOT of fast storage it might get very expensive with SSDs, but you previously mentioned "35-60GB per wedding", so I assume that's more or less the amount of fast storage you need.

Just my 2 cents...

bp_968 · Jun 2, 2013

Zeu said:
If you can sell these drives and use the money to buy SSD you would get much better performance and lower power consumption. Of course if you need a LOT of fast storage it might get very expensive with SSDs, but you previously mentioned "35-60GB per wedding", so I assume that's more or less the amount of fast storage you need.

Just my 2 cents...

Those drives are being used for a RAID10 array running VMs on ESX.

SSDs are good but they are also far more expensive per GB for redundant storage. 1GB of mirrored SSD storage would cost nearly 1000$. Its getting really close to being worth it, just not yet.

For now I'm using a SSD in her workstation for the fast storage. If we grow and add employees and need a more centralized high speed disk service then I'll look at adding/creating a SSD pool. I don't see it happening anytime soon.

Thanks for the suggestions! Your dead on with the SSD thing, I hope to get to build some stupid fast SSD array at some point.

RAID SAN/NAS setup help. ZFS or HW? rdy to give up.

n00b

Weaksauce

Supreme [H]ardness

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

Limp Gawd

Weaksauce

[H]ard|Gawd

Weaksauce

Limp Gawd

Weaksauce

Limp Gawd

Weaksauce

Supreme [H]ardness

Weaksauce

Supreme [H]ardness

[H]ard|Gawd

n00b

Limp Gawd

n00b

Limp Gawd

n00b

n00b

[H]ard|Gawd

[H]ard|Gawd

n00b

n00b

[H]ard|Gawd

n00b

n00b

[H]ard|Gawd

n00b

[H]F Junkie

n00b

Supreme [H]ardness

Supreme [H]ardness

n00b

n00b

n00b