Mirroring isn't good enough or maybe I'm just paranoid

Beta4Me · Nov 27, 2011

I'm concerned about using a set of striped mirrors with ZFS for my data--I will likely have 33 drives use.

I could set this up as a stripe of 16 mirrors with a hot-spare. Sounds good, but if I lose a single drive, that mirror is compromise and has no redundancy during the rebuild. This bothers me as I can only lose ANY 1 drive but can lose up to 16 if the stars and planets align.

If I was to have 8 RAID-Z2s of 4 drives with a hot spare, this would allow me to lose ANY 2 drives without data loss with ANY single drive loss still providing redundancy during a rebuild (again, I'd be able to lose up to 16 if everything worked out perfectly).

Is this a good idea or am I just paranoid?

GeorgeHR · Nov 27, 2011

What is your data worth? The answer to that question is what determines if you are paranoid or not.

Some of my data is worthless and I have no backup for it.

Some of my data is very valuable and I have multiple copies of it. Including 2 replacement computers that I can slide into place in minutes.

drescherjm · Nov 27, 2011

GeorgeHR said:
What is your data worth? The answer to that question is what determines if you are paranoid or not.

Some of my data is worthless and I have no backup for it.

Some of my data is very valuable and I have multiple copies of it. Including 2 replacement computers that I can slide into place in minutes.

This is how I approach this at home. Most of my HTPC data gets no redundancy or backups since it can be recreated. However my important documents and source code gets backed up nightly alternating hard drives and gets mirrored off site (to work) and some projects go to the cloud as well.

At work all important data resides on raid6 arrays. The most important data gets mirrored nightly to a second or third raid6. On top of that everything gets backed up to a tape archive at least 2 times.

MrGuvernment · Nov 27, 2011

Well if it is that important you wouldn't be putting it all into one systems... what if the PSU goes out one day and take the raid controller and some drives with it? Are you using a proper server grade PSU or redundant one?

what if the OS takes a dump.....

Aesma · Nov 27, 2011

A stripe of 16 mirrors doesn't make much sense unless the only concern is performance with a little redundancy. And you would need a heck of a network to use that performance. If you lose 1 vdev, you're screwed.

Anyway, RAID1/5/6, RAIDZ1/2/3, that's not backup.

MrGuvernment · Nov 27, 2011

No raid is backup, as we all know...right....

Beta4Me · Nov 28, 2011

Thanks for the replies.

I'll add a few more details as the discussion has gone a little off-course.

The plan is...
The drives are in a 45 drive chassis (JBOD) with dual expanders for both backplanes. Each pair of expanders are hooked up to a head (i.e. dual heads) which are set up as a primary and secondary VM in LockStep with VMware FT for immediate fail-over.

All the data will be backed up to another ZFS server full of big slow cheap drives.

All I'm looking to do is decide what arrangement to put the 33 drives in (will grow to 45 later). I definitely want at least 1 hot spare and redundancy for ANY 2 drives to fail; hence I'm pondering using 4 drive RAID-Z2 vdevs.

danswartz · Nov 28, 2011

8 4-drive raidz2 with 1 spare sounds fine. How much do you mind wasting disk space for redundancy? If not at all, consider 11 3-drive mirrors striped together. Gives you the same redundancy for each vdev, plus superior read performance.

GeorgeHR · Nov 28, 2011

Beta4Me said:
Thanks for the replies.

I'll add a few more details as the discussion has gone a little off-course.

The plan is...
The drives are in a 45 drive chassis (JBOD) with dual expanders for both backplanes. Each pair of expanders are hooked up to a head (i.e. dual heads) which are set up as a primary and secondary VM in LockStep with VMware FT for immediate fail-over.

All the data will be backed up to another ZFS server full of big slow cheap drives.

All I'm looking to do is decide what arrangement to put the 33 drives in (will grow to 45 later). I definitely want at least 1 hot spare and redundancy for ANY 2 drives to fail; hence I'm pondering using 4 drive RAID-Z2 vdevs.

You did not say how valuable the data is. You don't say how much data you have. You did not say how much demand is on the system.

---

You are going to back this up onto big old cheap drives? That is one of your problems. You want good drives for your backup.

brutalizer · Nov 28, 2011

Why not raidz3?

Beta4Me · Nov 28, 2011

brutalizer said:
Why not raidz3?

RAID-Z3 would mean 7 drive vdevs which starts to significantly reduce the number of vdevs involved in the overall stripe. Over 45 drives it would mean a stripe across 6 vdevs plus 3 hotspares.

Beta4Me · Nov 28, 2011

GeorgeHR said:
You did not say how valuable the data is. You don't say how much data you have. You did not say how much demand is on the system.

---

You are going to back this up onto big old cheap drives? That is one of your problems. You want good drives for your backup.

The amount of data, initially will be about 5TB and expected to grow by a minimum of 200GB/month.

The data is comprised of active/running important VMs, testing/dev VMs, general business data and non-critical media (music & videos).
The most important data will be backed up online and the VMs will go to removable disc. Therefore, the only data backed up solely to the ZFS Backup server will be non-critical data and media so big and cheap is the best and allows for that server to be affordable as opposed to not purchased at all.

I'm expecting to have a L2ARC of 10+ SSDs and a ZIL of 2 pairs of heavily over-provisioned SSDs (all Intel 320 120GB).

The load on the server will be a minimum of 6 VMs including SBS 2010, 3 x WS 2008 R2 (RDS, Lync, vCenter) & WHS + whatever testing/dev VMs are running. Also, there will be media streaming and user data consumption.

Hope that helps

spazoid · Nov 28, 2011

How big are the drives?

adi · Nov 28, 2011

Personally I would make separate pools, and tier your storage needs across those pools. Striped mirrors for VM IOPS with SSD cache and log on that pool, then maybe 1-2 Z1 or Z2 pools for documents and music.

Going one more point further, I would put production VM on a completely separate box, isolate it from development VM's and bulk user storage.

Are you going to have the ram needed to track the 1.2TB of SSD cache?

What kind of load is expected (how many users), and what are the current setups for VM and random data storage?

danswartz · Nov 28, 2011

Not sure what you mean by 'needed to track the SSD cache'? The ARC doesn't *need* to be any particular size, nor (AFAIK) does the size correlate to the size of the L2ARC. Maybe I'm being overly pedantic, but people shouldn't think they *need* X amount of RAM if they are going to have Y amount if L2ARC.

apnar · Nov 28, 2011

Beta4Me said:
I'm expecting to have a L2ARC of 10+ SSDs and a ZIL of 2 pairs of heavily over-provisioned SSDs (all Intel 320 120GB).

I concur with adi, best to make separate pools for different types of demands. If you have 10+ SSDs for use as L2ARC why not instead use them to create a small super fast pool which you can store your VMs and fast storage data on. Then keep the big slow disks in another pool with 1-2 SSDs for L2ARC.

danswartz · Nov 28, 2011

I would agree that it might make sense to create different pools depending on the expected performance characteristics (e.g. some pool layouts are faster reading vs writing, etc...)

adi · Nov 28, 2011

A few snippets from first pages of google, still trying to find official documents for ZIL/L2ARC and RAM usage

ZIL RAM:

The maximum size of a log device should be approximately 1/2 the size of physical memory because that is the maximum amount of potential in-play data that can be stored. For example, if a system has 16 GB of physical memory, consider a maximum log device size of 8 GB.

L2ARC

Note that for L2ARC to get utilized, you need the corresponding amount of DRAM. The ratio depends on ZFS recordsize. That is, if you are using a fixed recordsize (which is always recommended for databases) - the back-of-the-envelop formula:

L2ARC size / DRAM size / ZFS recordsize (KB) = 5

For instance, at 8KB recordsize and 4GB DRAM the appliance could efficiently utlize up to 160GB of L2ARC.

Another example: 128KB recordsize (the ZFS default) and 1GB DRAM yields 640GB.

L2ARC 2

Just the L2ARC directory can consume 24GB of ARC for 8KB records.

L2ARC 3

For every 100GB of L2 ARC you have on your system, ZFS will use ~2GB of main memory to map the cache.

This is something that were aware of but didnt really touch on in the article. Our next build will most likely include a significantly larger system board configuration to allow for dual processors and a lot more RAM to accommodate the L2ARC a little bit better.

danswartz · Nov 28, 2011

All true enough, but again, the ARC consumes as much RAM as it can get away with - if there is 'not enough RAM', the caching will be less effective - nothing bad will happen, correctness-wise.

adi · Nov 28, 2011

danswartz said:
All true enough, but again, the ARC consumes as much RAM as it can get away with - if there is 'not enough RAM', the caching will be less effective - nothing bad will happen, correctness-wise.

Not likely with the cost of this build, but if he has only 4GB of ram, and 8k recordsize, the system would only be able to use 160GB of the 1.2TB of SSD that he provides.

I see where you're coming from, as if you have this setup, nothing will be broken, and it will default to disk I/O, but there is a fundamental difference to me between not broken and overpriced inefficient crap that could have been prevented with research.

So yes, you can use 1.2TB of SSD cache, and you can use it even better if you have enough ram to utilize/track all 1.2TB.

danswartz · Nov 28, 2011

True enough. I think we were looking at this from different perspectives, but are fundamentally in agreement.

GeorgeHR · Nov 28, 2011

Beta4Me said:
The amount of data, initially will be about 5TB and expected to grow by a minimum of 200GB/month.

The data is comprised of active/running important VMs, testing/dev VMs, general business data and non-critical media (music & videos).
The most important data will be backed up online and the VMs will go to removable disc. Therefore, the only data backed up solely to the ZFS Backup server will be non-critical data and media so big and cheap is the best and allows for that server to be affordable as opposed to not purchased at all.

I'm expecting to have a L2ARC of 10+ SSDs and a ZIL of 2 pairs of heavily over-provisioned SSDs (all Intel 320 120GB).

The load on the server will be a minimum of 6 VMs including SBS 2010, 3 x WS 2008 R2 (RDS, Lync, vCenter) & WHS + whatever testing/dev VMs are running. Also, there will be media streaming and user data consumption.

Hope that helps

5TB of data spread over 33 hard drives seems wrong. Storing media on a businees computer seems wrong. Doing testing on a machine storing valuable data seems wrong.

As I don't have enough time to give you assistance, I will walk away now.

danswartz · Nov 28, 2011

It isn't 33 drives of actual data though. Why is media on a business computer wrong? Doesn't it depend on what the business is? Not a very useful comment, IMO.

GeorgeHR · Nov 28, 2011

danswartz said:
It isn't 33 drives of actual data though. Why is media on a business computer wrong? Doesn't it depend on what the business is? Not a very useful comment, IMO.

I do not have the time to make it helpful.

Perhaps you do.

Stanza33 · Nov 28, 2011

If you are paranoid go for tripple mirrors..

10 x 3 Drive mirrors = 33 Drives nicely

2 drives dies in ANY mirror set and you are still OK

Really paranoid ??

Offline one of the mirrored drives in each set..... pull the drives out and run the system like that..... that way you also have an OFF Line and OFF Site copy of the data.

Plonk the OFF Line drives in once a week and resilver..... once resilver is done OFF Line them and remove them again.

madrebel · Nov 28, 2011

Beta4Me said:
RAID-Z3 would mean 7 drive vdevs which starts to significantly reduce the number of vdevs involved in the overall stripe. Over 45 drives it would mean a stripe across 6 vdevs plus 3 hotspares.

3 11 drive vdevs.

Beta4Me · Nov 29, 2011

GeorgeHR said:
I do not have the time to make it helpful.

Perhaps you do.

Well then why would you even bother posting? Just to be annoying?

Stanza33 said:
If you are paranoid go for tripple mirrors..

10 x 3 Drive mirrors = 33 Drives nicely

2 drives dies in ANY mirror set and you are still OK

Really paranoid ??

Offline one of the mirrored drives in each set..... pull the drives out and run the system like that..... that way you also have an OFF Line and OFF Site copy of the data.

Plonk the OFF Line drives in once a week and resilver..... once resilver is done OFF Line them and remove them again.

IIRC primary school maths 10 x 3 = 30

But, yes, I would do this if I could afford to have that much redundancy but I'm not willing to have less than 50% of my drives being data...just too much $.

madrebel said:
3 11 drive vdevs.

That would mean 8 (data) + 3 (parity) which totally opposes what I was saying before about reducing the number of vdevs that are striped between.

spazoid said:
How big are the drives?

2TB. I think I said that before? SSDs are 120GB.

adi said:
Personally I would make separate pools, and tier your storage needs across those pools. Striped mirrors for VM IOPS with SSD cache and log on that pool, then maybe 1-2 Z1 or Z2 pools for documents and music.

Going one more point further, I would put production VM on a completely separate box, isolate it from development VM's and bulk user storage.

Are you going to have the ram needed to track the 1.2TB of SSD cache?

What kind of load is expected (how many users), and what are the current setups for VM and random data storage?

The thing is, creating a separate pool for VMs would require me wasting a massive amount of storage space as I've got 2TB drives and I'll probably have MAX 2.5TB of VMs. Or, I'd have to buy more fast & small drives to set up that pool.
Alternatively, I could do what I planned and just have one pool and have a crapload of L2ARC which caches everything (i.e. the VMs) and I don't have to worry about redundancy of it as it's a second copy (original on the disks).

apnar said:
I concur with adi, best to make separate pools for different types of demands. If you have 10+ SSDs for use as L2ARC why not instead use them to create a small super fast pool which you can store your VMs and fast storage data on. Then keep the big slow disks in another pool with 1-2 SSDs for L2ARC.

This would be good, but then I'd have to have redundancy setup for the SSD array which would cost a fortune, else I'd have to have synchronous replication to the data array etc. It'd just be a pain and/or expensive.

adi said:
Not likely with the cost of this build, but if he has only 4GB of ram, and 8k recordsize, the system would only be able to use 160GB of the 1.2TB of SSD that he provides.

I see where you're coming from, as if you have this setup, nothing will be broken, and it will default to disk I/O, but there is a fundamental difference to me between not broken and overpriced inefficient crap that could have been prevented with research.

So yes, you can use 1.2TB of SSD cache, and you can use it even better if you have enough ram to utilize/track all 1.2TB.

Initially, there'll only be 1 server and then a 2nd when funds allow. These will run storage and all the other VMs. Later, when more money is available, additional hosts will be added to run the VMs with the goal that eventually the original 2 servers will only run storage (and maybe vCenter).

The expected specs are:
SuperMicro 846A-R900B
SuperMicro X9SCM-F
Intel Xeon E3-1230 Quad-Core 3.20GHz CPU
4 x Kingston 8GB DDR3-1333 UnBuff ECC RAM
2 x Intel 320 SSD 40GB
Intel 10GbE Dual Port Server Adapter AT2

spazoid · Nov 29, 2011

So, basically you want the following:

Lots of vdevs for speed
High availability in the form of parity/mirrors
Cheap

I don't know if you've ever heard of this saying before, but generally you can have 2 of the following 3 things in the world of computers:
Cheap price
Quality
Speed

What you're asking for is all of them. It's not doable. Either go with
3 RAIDZ3 vdevs for relatively good speed, lots of storage and high availability
11 mirrored vdevs for great speed, relatively few data spindles and very high availability
or anywhere in between.

And remember, the speed of your spindles really don't matter much with that huge amount of L2ARC, but also remember that a very large amount of your RAM will be used to index the L2ARC contents (as was mentioned earlier).

Beta4Me · Nov 29, 2011

spazoid said:
So, basically you want the following:

Lots of vdevs for speed
High availability in the form of parity/mirrors
Cheap

I don't know if you've ever heard of this saying before, but generally you can have 2 of the following 3 things in the world of computers:
Cheap price
Quality
Speed

What you're asking for is all of them. It's not doable. Either go with
3 RAIDZ3 vdevs for relatively good speed, lots of storage and high availability
11 mirrored vdevs for great speed, relatively few data spindles and very high availability
or anywhere in between.

And remember, the speed of your spindles really don't matter much with that huge amount of L2ARC, but also remember that a very large amount of your RAM will be used to index the L2ARC contents (as was mentioned earlier).

I'm not looking for cheap, I'm just not looking to spend an insane amount of dollars for overkill or be wasteful. I can't afford to have 2/3 of my disks set aside for parity.

I'm probably going to go with what I originally suggested which is 4 drive RAID-Z2 vdevs (2 data + 2 parity). It allows me to still have lots of vdevs in the pool to stripe, which is relevant for all non-L2ARC'd data. It also allows for redundancy to still exist in a vdev even after a drive failure (and during the resliver) which is really important to me (my paranoia

)

Initially, I don't mind if the L2ARC is underutilised due to lack of available RAM because it won't be too long till there are more server to push the other VM's onto and distribute the load.

I think this all works fairly well?

GeorgeHR · Nov 29, 2011

Beta4Me said:
Well then why would you even bother posting? Just to be annoying?

No. It turned out that I seem to have underestimated how much time getting through to you would take.

But, yes, I would do this if I could afford to have that much redundancy but I'm not willing to have less than 50% of my drives being data...just too much $.

Now you have set a value on your data. It appears your data is not worth much to you. Certainly not enough for me to spend time on it.

spazoid · Nov 29, 2011

In the same reply you say that you dont want "overkill" or to be "wastefull" and "i dont mind if the l2arc is underutilized"

*confused*

You should spend a few minutes reading this: http://www.servethehome.com/excess-capacity-whs-vail-aurora-hot-spares-raid-time-recover-mttr-guide/

It's obviously not 100% accurate, but it gives a good idea of how well protected the different raid levels are against harddrive failure. The guy basically does some math on theoretical failure rates of raid arrays based on 20 drive arrays.

To sum it up:
In order to kill a raid6/raidz2 array from harddrive failure (not talking about lightning strikes, user errors or anything like that) you need either
a) an insane amount of harddrives in one vdev
b) very slow harddrives (ruins rebuild times)
c) a very very bad batch of harddrives

Beta4Me · Nov 29, 2011

GeorgeHR said:
No. It turned out that I seem to have underestimated how much time getting through to you would take.

Now you have set a value on your data. It appears your data is not worth much to you. Certainly not enough for me to spend time on it.

Well I certainly underestimated how much of a dick you are. Sorry about that, won't happen again.

spazoid said:
In the same reply you say that you dont want "overkill" or to be "wastefull" and "i dont mind if the l2arc is underutilized"

*confused*

You should spend a few minutes reading this: http://www.servethehome.com/excess-capacity-whs-vail-aurora-hot-spares-raid-time-recover-mttr-guide/

It's obviously not 100% accurate, but it gives a good idea of how well protected the different raid levels are against harddrive failure. The guy basically does some math on theoretical failure rates of raid arrays based on 20 drive arrays.

To sum it up:
In order to kill a raid6/raidz2 array from harddrive failure (not talking about lightning strikes, user errors or anything like that) you need either
a) an insane amount of harddrives in one vdev
b) very slow harddrives (ruins rebuild times)
c) a very very bad batch of harddrives

You missed a keyword: "initially". I don't mind if there is some underutilization of my L2ARC in the beginning as once I shift VMs to new servers there will be more free RAM and everything will be peachy

Thanks for the link, I'll check it out.

adi · Nov 29, 2011

Beta4Me said:
Well I certainly underestimated how much of a dick you are. Sorry about that, won't happen again.

You missed a keyword: "initially". I don't mind if there is some underutilization of my L2ARC in the beginning as once I shift VMs to new servers there will be more free RAM and everything will be peachy
Thanks for the link, I'll check it out.

Talking about moving VM's to new servers to free up ram for ZFS makes this sound like it is an all-in-one project?

Or maybe I'm just not understanding where moving a VM will have an impact on your storage server RAM.

danswartz · Nov 29, 2011

GeorgeHR said:
No. It turned out that I seem to have underestimated how much time getting through to you would take.

Now you have set a value on your data. It appears your data is not worth much to you. Certainly not enough for me to spend time on it.

Honestly, if you think the OP is a dolt, why do you continue to discuss the issue with him? Despite the emo posting about 'walking away'?

Eickst · Nov 29, 2011

Silenus · Nov 29, 2011

Beta4Me said:
I'm not looking for cheap, I'm just not looking to spend an insane amount of dollars for overkill or be wasteful. I can't afford to have 2/3 of my disks set aside for parity.

I'm probably going to go with what I originally suggested which is 4 drive RAID-Z2 vdevs (2 data + 2 parity). It allows me to still have lots of vdevs in the pool to stripe, which is relevant for all non-L2ARC'd data. It also allows for redundancy to still exist in a vdev even after a drive failure (and during the resliver) which is really important to me (my paranoia )

Initially, I don't mind if the L2ARC is underutilised due to lack of available RAM because it won't be too long till there are more server to push the other VM's onto and distribute the load.

I think this all works fairly well?

Your 4 drive RAIDz2 then sounds like the best bet. If you were willing to accept 50% of capacity for parity then 8 x 4 drive RAIDz2 is the best compromise. You gain guaranteed 2 drives worth of parity compared to single drive parity for the stripe of 16 mirrors, and you trade half your 16 vdevs for striping. All this while keeping usable space the space as you still lose 50% capacity either way. And you keep your one drive for hot spare too.

Beta4Me · Nov 29, 2011

adi said:
Talking about moving VM's to new servers to free up ram for ZFS makes this sound like it is an all-in-one project?

Or maybe I'm just not understanding where moving a VM will have an impact on your storage server RAM.

Due to cost restrictions we'll be starting with 1 or 2 servers and then adding more. The intention is to eventually have a pair of servers which will act as head nodes purely for the storage system, but this will only happen once there are enough other servers to handle the load of the VMs. So, yes, initially this might be an all-in-one or an all-in-two.

Mirroring isn't good enough or maybe I'm just paranoid

Weaksauce

Gawd

[H]F Junkie

Fully [H]

[H]ard|Gawd

Fully [H]

Weaksauce

2[H]4U

Gawd

[H]ard|Gawd

Weaksauce

Weaksauce

Limp Gawd

Limp Gawd

2[H]4U

Weaksauce

2[H]4U

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

Gawd

2[H]4U

Gawd

Gawd

Gawd

Weaksauce

Limp Gawd

Weaksauce

Gawd

Limp Gawd

Weaksauce

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Weaksauce