Reconsidering Mirroring

GreenLED

Limp Gawd
Joined
Mar 1, 2014
Messages
141
As my hardware slowly starts to arrive, I'm debating whether I should consider using mirroring verses raidz2. Here is the proposed change.

From . . .

Pool
Alpha (RZ-2)​
Device 1​
Device 2​
Device 3​
Device 4​
Device 5​
Device 6​
Bravo (RZ-2)​
Device 7​
Device 8​
Device 9​
Device 10​
Device 11​
Device 12​

To . . .

Pool
Alpha (Mirror)​
Device 1​
Device 2​
Device 3​
Bravo (Mirror)​
Device 4​
Device 5​
Device 6​
Charlie (Mirror)​
Device 7​
Device 8​
Device 9​
Delta (Mirror)​
Device 10​
Device 11​
Device 12​

If I am assessing this correctly ...

By having 3 disks on each vdev, I need to loose ALL three for the pool to fail. The likelyhood of that happening would really be very low, but it could happen. It's more likely that 2 drives from two different vdevs would fail. In addition to this, I get the added benefit of a great increase in performance. So, to my way of thinking, this setup can loose 4 drives (one from each vdev) and still be operational. Isn't this correct?

Obviously, I loose capacity, but data integrity is important in this build and it will be a while until any user fills up even one TB for a while. My only concern with this setup - and it is a concern, is adding more vdevs which will keep adding more drives to the stripe. Should I reverse the scheme? Instead od striping mirrors, mirror stripes? Can you even nest things like that in ZFS?

Drives arrived today, I'll have pictures up shortly.
 
Last edited:
Striping mirrors (1+0) is a lot safer. If you mirror two stripes of 3 drives (0+1) and lose just one drive from each set your data is gone. With the same drives in striped mirrors you can lose two drives from both sets since all drives in a mirror hold the same data.

However, the failure tolerance of your pool would still be just two drives in the worst case scenario. So basically you would be halving your effective storage for almost no gain (shorter build time, IOPS). A lot depends on what you plan to use your server for.

Besides, if you are really interested in your uptime, there are a lot more risks than just failing drives. Your power supply could die, for instance. Power outages and brownouts are also a risk.
 
I would just build 2 independent servers and use raidz3 for both. Use zfs send / recieve to send the updates from the first server to the second. No mattter what you do with a single server there will always be a chance of losing the entire dataset. I mean even though you have 3 copies of everything a powersupply could fail making all drives die..
 
Obviously, I loose capacity, but data integrity is important in this build and it will be a while until any user fills up even one TB for a while. My only concern with this setup - and it is a concern, is adding more vdevs which will keep adding more drives to the stripe. Should I reverse the scheme? Instead od striping mirrors, mirror stripes? Can you even nest things like that in ZFS?

Drives arrived today, I'll have pictures up shortly.

Afaik nesting isn't really supported, only striping of vdevs.If you think you'd have too many vdevs in a zpool, why not consider moving them to a second zpool? If your first zpool manages its load fine, that is.
Also when adding VDEV's there's the thing to consider : ZFS doesn't rebalance existing storage.
Meaning that if you have data in 2 VDEVs at 99% storage space, and if you add a third, you'll only get write/read speed of the newly added VDEV : existing storage stays on its VDEVs and there's afaik no way to rebalance it manually, except by emptying the array and recopying.
New writes to the enlarged pool take into account all VDEVs tho.

I would just build 2 independent servers and use raidz3 for both. Use zfs send / recieve to send the updates from the first server to the second. No mattter what you do with a single server there will always be a chance of losing the entire dataset. I mean even though you have 3 copies of everything a powersupply could fail making all drives die..

Agreed... i remember alltoowell that 5 years ago my ARECA raidcard went haywire and wrote corrupted data to my raid6 array... since then i'm religious about my backups.
 
Obviously, I loose capacity, but data integrity is important in this build and it will be a while until any user fills up even one TB for a while. My only concern with this setup - and it is a concern, is adding more vdevs which will keep adding more drives to the stripe. Should I reverse the scheme? Instead od striping mirrors, mirror stripes? Can you even nest things like that in ZFS?

Drives arrived today, I'll have pictures up shortly.

You can extend your pool with more vdevs. It is suggested but not a must that you extend with same type of vdevs. So if you start with a 6-disk raid-Z2 you should extend the pool with 6 disk raid-z2 vdevs. If you start with 3 way mirrors, extend with more 3 way mirrors.

ZFS stripes always over vdevs. If you add vdevs, data is initially unbalanced but stripe over all vdevs over time on modified data.

Raid-Z2 vdevs and 3 way mirrors allows any 2 disk to fail in a vdev. 3 way mirrors allows more overall disks to fail but the basic reason to choice either is mostly I/O performance vs capacity. With Raid-Z2 you have more capacity with a very good sequential performance. I/O performance is like number of vdevs (like two disks in your case).

With Mirrors you have less capacity but a read I/O comparable to 12 disks and a write I/O like 4 disks. If you use your storage for a database or ESXI, mirrors are better. If you use it as a filer or backuop, I would go Raid-Z2

Don't forget to do backups on a second removable pool or better second server on a different physical location/room/building for disaster recovery (fire/ overvoltage/ server stolen etc).
 
Last edited:
I feel like I'm spinning my wheels. I want to make sure I'm understanding this correctly. Here is a diagram I drew really quick to work off of. The text one is a bit harder to follow I think.

raid%20stripe.jpg


In this example, you have 4 groups (vdevs), each with 3 disks making a total of 12 disks. This (as I understand it) is striping because I'm writing block A on the first group, block B on the second and so on and so forth. In this arrangement I should be able to loose disks 2 & 3, disk 5 & 6, disks 8 & 9, disks 11 & 12 and STILL have a complete working set of data because disks 1, 4, 7 and 10 still contain everything that the others did. Am I insane or is this right? Isn't this RAID 1+0?

I apologize for the huge image. Best I can come up with on short notice.
 
In this arrangement I should be able to loose disks 2 & 3, disk 5 & 6, disks 8 & 9, disks 11 & 12 and STILL have a complete working set

Yes. However as I said a bad power supply could easily wipe out all 12 disks. I do not think you are really gaining by this arrangement for your use case.
 
Yeah, you are over-thinking it. You need a backup no matter what ZFS configuration you choose.

A backup is a second independent copy of the data. No amount of mirroring or parity is equal to a second copy.

The level of ZFS redundancy needs to be chosen based on performance needs and also likelihood of downtime (more parity, more mirroring if you want a less chance of downtime). Downtime meaning destroying and restoring the array from backup because you lost more than you mirroring/parity covered.
 
A backup is a second independent copy of the data.

Preferably offline most of the time and/or in a different physical location to reduce the chance of a fire or other natural disaster taking out the backup as well as the original.
 
My specialty is over-thinking :D.

The Internet is really lacking a GOOD calculator for RAID. All of them I have found do not allow you to build out a complicated setup and find out capacity and fault tolerance. They mostly allow you to calculate non-nested arrays.

So, here's a list of requirements. Maybe we can put this thing to rest.

  • Low downtime
  • Write performance
  • Simple scalability

Most of the arrays I've seen (all of them really) always show read performance higher than write, which makes sense. For the requirements I have, I think 2 RZ-2 vdevs is the best choice. My only concern is expanding the array. That's it. I don't want to stripe too many arrays together. If I can address that issue in some way or if I can address that issue by using a different pool structure (i.e. different arrange other than RAID 60), that's fine. I understand (only to well) the importance of having a backup. I have clients that have critical information that needs to be protected.
 
I see your point. Going back to the drawing board :).

why? your assumption is you want the best availability while everything else is presumed working.

if the PS blow, sure, that sucks but replace it and bring the system back up. further, if you're this concerned about your data why use a box with only a single power supply?
 
again, why not simply go.multiple pools once your r/w and IO specs are satisfied on one pool?
 
why? your assumption is you want the best availability while everything else is presumed working.

if the PS blow, sure, that sucks but replace it and bring the system back up. further, if you're this concerned about your data why use a box with only a single power supply?

I'm not really concerned about power loss. Someone else brought it up and it got me thinking. I have a UPS, so at least that covers power loss so the system can gracefully shutdown.

I'm fairly certain RAID 60 will be the final decision for me. The only thing that is keeping me back from implementing this is expansion. There are certain limits as far as how many vdevs I can string together until the probability of failure is high enough that it will make me feel too uncomfortable. Most people suggest 6 disks per RZ-2 array. Well, once I get to about 4 vdevs, I will be striping 4 RZ-2 arrays. Does that make anyone else here uncomfortable? How am I supposed to manage the scalability. This WILL become a problem for me sooner or later. Especially since I will be dealing with video editing people. So, any ideas how to handle the expansion part? Create a separate pool? I have nothing against that. I just am not sure how to scale this.
 
well, it's how i plan to scale. max 2 raidz2 vdevs per pool, then adding pools at leisure until probably i hit 128gbyte ram/ 128 terabyte space. thats the long term plan, kinda. i wouldn't feel at ease striping too many vdevs either.
 
well, it's how i plan to scale. max 2 raidz2 vdevs per pool, then adding pools at leisure until probably i hit 128gbyte ram/ 128 terabyte space. thats the long term plan, kinda. i wouldn't feel at ease striping too many vdevs either.

Thank you for posting that. Gave me some idea of what others are doing. So, if you have different pools, how do you manage volumes? I'm not sure if you have the same scheme as I do (i.e. outside users uploading their files).
 
I'm not really concerned about power loss.
you should be. instant and sudden loss of power is death to a storage array. well, can be. drives love to just spin, they'll do this sometimes for many years past their warranty. they will continue doing this as long as the have a steady stream of good clean power. once that flow abruptly stops all manner of bad things can happen to those drives.
I'm fairly certain RAID 60 will be the final decision for me. The only thing that is keeping me back from implementing this is expansion. There are certain limits as far as how many vdevs I can string together until the probability of failure is high enough that it will make me feel too uncomfortable. Most people suggest 6 disks per RZ-2 array. Well, once I get to about 4 vdevs, I will be striping 4 RZ-2 arrays. Does that make anyone else here uncomfortable? How am I supposed to manage the scalability. This WILL become a problem for me sooner or later. Especially since I will be dealing with video editing people. So, any ideas how to handle the expansion part? Create a separate pool? I have nothing against that. I just am not sure how to scale this.
my largest Z2 pools have 20, 6 drive vdevs. Now these are all 1TB drives if they were 3 TB drives it might concern me a bit however i'm around 1% failure rate on the roughly 1200 drives I have in production (seagate constellation es.2 and es.3). that said my archive type storage is all 7 disk Z3 arrays, 17 vdevs in each of those pools.

IDK what drives you're using but if they're enterprise class, even sata enterprise, you're over thinking it. multiple drive failures are not that common, in my experience.
 
you should be. instant and sudden loss of power is death to a storage array. well, can be. drives love to just spin, they'll do this sometimes for many years past their warranty. they will continue doing this as long as the have a steady stream of good clean power. once that flow abruptly stops all manner of bad things can happen to those drives.

Let me rephrase this. I AM concerned about power loss. I have repaired many drives that have lost power from customers and I understand the huge "death" factor when loosing power. But, I did mention I had a UPS. Doesn't that count for anything? If only to give the array a chance to shutdown gracefully?
 
UPS can still fail, not likely presuming you or someone regularly services the batteries.

my point though was you're worried about a PS failure ... so use a server that has two power supplies.
 
my largest Z2 pools have 20, 6 drive vdevs. Now these are all 1TB drives if they were 3 TB drives it might concern me a bit however i'm around 1% failure rate on the roughly 1200 drives I have in production (seagate constellation es.2 and es.3). that said my archive type storage is all 7 disk Z3 arrays, 17 vdevs in each of those pools.

Fantastic! So, let's see 20, six drive vdevs. So one vdev carries about ~4.something TB each, right? So you have 120 drives total in that one pool all striped together using Z2, wow. At least I heard it from someone. Makes me think differently about this. Question, what is your topology like. Do you know how many drives per HBA? I'm using 2TB drives, would you be OK with that?


IDK what drives you're using but if they're enterprise class, even sata enterprise, you're over thinking it. multiple drive failures are not that common, in my experience.

You can look at all the parts I'm using here. (update coming soon as well)

==================== Notepad ====================

Just posting this for myself or else I will lose the link :).

ZFS Raidz Performance, Capacity and Integrity Comparison @ Calomel.org
 
Last edited:
Let me rephrase this. I AM concerned about power loss. I have repaired many drives that have lost power from customers and I understand the huge "death" factor when loosing power. But, I did mention I had a UPS. Doesn't that count for anything? If only to give the array a chance to shutdown gracefully?

A CopyOnWrite filesystem like ZFS is always consistent. A sudden power failure is no problem. A file that is copied during powerloss may be damaged but not the pool/filesystem itself. You can use a secure sync write setting and a dedicated ZIL to reduce sudden offline problems where all data from write-RAM-cache are always written to disk without to much speed degration (ex for database or ESXi use cases). For use cases like backup or video editing this is not relevant

A ZFS pool build from many vdevs is also not a problem unlike hardware raid-6 arrays. If a disk fails the ZFS pool is degraded and you should replace the disk (best is to have a pool-wide hotspare). During rebuild performance may be a little lower just like with scrubbing.

If more disks fails that is allowed by the redundancy level (two disks in a vdev in Z2), then the pool goes offline. If enough disks come back the pool is online again without a raid check that is needed with a hardware raid in a similar case.

Petabyte Pools with hundreds of disks in douzens of vdevs is done and within the specification of ZFS. Every new vdev increases capacity and performance. So many vdevs is a good idea.
 
Fantastic! So, let's see 20, six drive vdevs. So one vdev carries about ~4.something TB each, right? So you have 120 drives total in that one pool all striped together using Z2, wow. At least I heard it from someone. Makes me think differently about this. Question, what is your topology like. Do you know how many drives per HBA? I'm using 2TB drives, would you be OK with that?
2TB drives, yeah thats fine. the rebuild math around the 85% full mark is still acceptable.

Topology is like this. On the LIS 9206-16E there are 4 SFF-8644 ports on the back. these ports physically tie back to an LSI 2308 chip. So, each of these cards is in fact two HBAs on a single card right. So to avoid a single chip failure I ran the cables basically straight through 1-4 and then skipped a JBOD for the second set of ports which is 2, 3, 4, 1. Each JBOD then has 48gbps of dual path bandwidth (and is talking to two different physical HBAs which in turn are using 2 different IRQs and different PCIe lanes) to the 60 drives in each JBOD for a system grand total of 192gbps worth of SAS throughput.

HBA
1 JBOD1 C1/P1
2 JBOD2 C1/P1
3 JBOD2 C1/P1
4 JBOD3 C2/P2

HBA
1 JBOD3 C1/P1
2 JBOD4 C2/P1
3 JBOD4 C1/P1
4 JBOD1 C2/P2

I've seen these pools write at around 4GB/s and i have two pools per cluster so i can sustain about 8GB/s of pure sequential write. A 50W/50R 100% random test pattern using 128K block size i managed to sustain about 3GB/s but was hitting latency barriers and didn't have enough test hosts to really push further. honestly i'm not 100% how fast these designs can go we didn't have as much time to test as i would have preferred.
 
A CopyOnWrite filesystem like ZFS is always consistent. A sudden power failure is no problem. A file that is copied during powerloss may be damaged but not the pool/filesystem itself.
Actually, this is one of ZFS strengths, as I have understood it. No need to checkdsk or fsck for hours after a power loss.

I have myself pulled the plug several times on my zpool, even during a scrub. First I was very hesitant, but at the end I thought "hey, this is ZFS, it is designed to withstand this" so I just pulled the plug. This was many years ago, when ZFS was new and immature. It has been safer then.

Have anyone heard of ZFS data loss, when power was lost? Please share such stories?
 
A CopyOnWrite filesystem like ZFS is always consistent. A sudden power failure is no problem.
caveat, as long as you have battery of super cap backed SLOGs. anything in a transaction group that hasn't been committed is gone otherwise.

now ZFS itself won't care about this. your pool/data will be fine. however, your guests ... if they expect that data they thought to be committed to disk wasn't, lots of bad things can occur.

I know you know this Gea but to the others in this thread, its something to consider when building out arrays to solve for different things.
 
Have anyone heard of ZFS data loss, when power was lost? Please share such stories?
while not exactly what you're describing ...

In one of my servers I had one of the two system/OS mirror disks shit the bed which in turn caused the SMF database to corrupt and one of my filer heads basically died.

was on the phone with tier3 nexenta support and i believe the response was "huh .... that .... how ... hmm that shouldn't have happened".

to which i replied ... yeah no shit.

still uncertain why exactly that occurred, but it did.
 
It could happen.

ZFS sends cache flush commands as needed. However, not all harddrives obey the command. Some will respond with "flush complete" even though it has only written the data to the cache and not to the platters. AFA ZFS is concerned, the data is on the platters (the disk told it so), so it continues on with the next transaction.

I've also heard stories on forums of people claiming they lost their pools because after a power loss ZFS told them the pool metadata was corrupted and could not import the pool.

I'm sure it's possible. But I'm willing to bet ZFS is the least likely filesystem where you will see it happen.
 
I'm really NOT trying to be "simplistic" about this in ANY way, shape or form. But, the power failure problems you are discussing - are all of these as a result of backup power not working correctly? I have had my share of intermittent workstation/server problems for the past 10+ years, so I understand that caveats are 99% the order of the day. But, I say again, would a UPS absolve you of these issues?
 
no.

UPS solves for loss of current at the 'wall socket' so to speak. in my world, i don't use UPS for anything because I have an entire room full of batteries in front of the (redundant) power switch that allows us to bounce between (redundant 1N1) street and (redundant 2N1) generator.

now, i know of a few people who still use in rack UPS systems in data centers to protect their SAN/NAS boxes. reason for this is they want to make 100% sure that in the absolute worst case scenario the last thing to lose power will be the storage. In the case of ZFS, after all your compute nodes die (in this worst case scenario) your filers will no longer be receiving data so there will be no pending write transactions and everything on disk will be assumed good.

only thing to worry about here is drive failures when the power returns.

that is an extremely edge case scenario though and you may as well save the money and script a kill switch instead. if you're going down anyway i mean ...
 
I use some rack-UPS for my servers (ESXi + SAN) but only in cases where I have a twin power supply where I can connect one to the UPS and plug the other directly to AC.
The reason is that I had more often a power down due to a failed UPS than a failed AC. This may be different if your local power supplier is not as stable where a UPS + a single PSU can help.

Additionally a powerloss is only one sudden failure reason. There are other reasons for a hanging or blocking OS.

The risk with ZFS:
ZFS caches about 5s of writes in RAM and writes them as a single large seqential write every 5s while it confirms the write immediatly to a client. This can result in undetected data loss with small files or transactions in contrast to large files where you expect a dataloss and you get warned about write problems on a SAN powerloss.

You can switch your filesystem to sync write, where every single write is logged to disk in a special ZIL cache and commited unless the next one occurs. This results in a "If its commited its really on disk". This is done additionally to the 5s write cache. If you add a very fast dedicated logdevice (ZIL) that is capable to hold 10s of network traffic, you can have secure powerloss-save writes and the performance boost of the "convert many small random writes to a single large sequential write via ZFS RAM-write-cache"
 
The risk with ZFS:
ZFS caches about 5s of writes in RAM and writes them as a single large seqential write every 5s while it confirms the write immediatly to a client.

just FYI, and i presume you're aware of this gea but if not and for everyone else ...

this transactional behavior of ZFS has recently been changed, i believe by the guys at delphix. this change has been added to openzfs and should be part of omniOS now and I know for a fact is part of the soon to be released nexenta 4.0.x branch.

could swear i had a link that describes this ... let me see if i can find it. ah, yes, here it is

http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/
 
How does one do that math, roughly?
exactly? fuck if i know. I think i used someone else's math i found somewhere and then just WAGed it to fit my setup.

my method, not scientific at all but works :).

basically think about the worst case scenario. if all your drives are 80% full how long does the rebuild take? answer, rhetorical it doesn't matter however if your drives are two, three, or up to 6 times larger then the rebuild will in turn take that much longer.

will you feel comfortable during that rebuild time? you have to answer that for yourself. me, i don't do Z2 vdevs on drives larger than 1TB anymore. I dont do 2 drives mirrors on drives larger than 1TB either. the 80+% rebuild times on these are a real sphincter check.
Is that an all-SAS topology? No SATA drives?
yes. its like a $10 difference between enterprise sata and enterprise SAS when buying in bulk.

i will get back to some specific focus testing on various SATA drives. the driver here is SSDs ... there are a lot more options with SATA and the price delta is attractive enough to spend the time working out any issues.

for now, i'ma stick with what i know works and that is SAS.

actually, if i were building single node filers i would save the money. currently all my stuff is HA and for that the native dual path SAS is superior in everyway.
 
will you feel comfortable during that rebuild time? you have to answer that for yourself. me, i don't do Z2 vdevs on drives larger than 1TB anymore. I dont do 2 drives mirrors on drives larger than 1TB either. the 80+% rebuild times on these are a real sphincter check.
That's how I've been looking at it - test it, then decide if that rebuild time fits with my requirements. Was just wondering if there was some math I could use to see if the rebuild time I find is sane.

actually, if i were building single node filers i would save the money. currently all my stuff is HA and for that the native dual path SAS is superior in everyway.
Another quick question - when you do dual-path for redundancy do you get get the extra bandwidth (if your disks can actually max out available bandwidth obviously)?

Thanks for taking the time to reply.
 
no, well, so for everything i personally have tested no. to my knowledge there aren't any drives (you can buy) that can use both ports/paths to send/receive data at the same time. i think in talks i had in the past with STEC they have some firmware and have done testing to do just this but i don't think it ever went public.

i could be wrong i haven't tested every SAS drive.

i use SAS so both filers can talk to each drive via different paths. zfs/nexenta HA (rsf1) needs either SATA with interposers or native SAS to function well.

or ... if say you took two standalone 45 drive servers, filled them with SATA drives, connected each of these backend systems to front end filers via infiniband or say 16gbps fiber channel. from here export a single zvol from node 1 and node 2 to the front end servers. use these zvols to create a pool.

now ... here you can in theory get away with striping on the front end servers however some failure testing needs to occur. in theory this will work though as your backend systems already have the raid setup right .... at any rate and worst case you could just mirror these zvols.

now your front end servers can be configured for HA using these backend zvols which are in turn backed by sata drives. this works you around the SATA HA limitations but has a number of design problems as you scale.

lets say you had 4 of those backend systems. each with ... 32-64GB of ram. lay out the vdevs as 4 x 11 disk rZ3 vdevs leaving one spot open for a hotspare. the front end boxes have 128GB ram, 1 small JBOD for SLOG and or l2arc drives, infiniband or FC HBA (or two), 1 dual port HBA for the cache JBOD, and 2-4 ports of 10gig ethernet or FC or whatever you require to serve the data.

you're likely going to benefit a great deal from the cachecading ARC on reads. none of those backend boxes are fast from a random IO standpoint but they kick much ass moving sequential data.

on the front end do probably 4-8 mirrored SLOG vdevs.

sry, i digressed a bit there. I've been wanting to test the above setup for awhile now but no SSD vendor wants to give me enough drives to properly test it ... bastards want to get paid :(.
 
Back
Top