ZFS: RAID-Z3 (raidz3) Recommended Drive Configuration

RavenShadow · Jul 8, 2011

sub.mesa has posted several times on his theories of record size vs stripe size and their relationship to performance of ZFS RAID pools. I'm wondering if anyone has taken his ideas into account when designing their RAID-Z3 configurations:

http://hardforum.com/showthread.php?p=1036154326&highlight=128+KiB#post1036154326

So which combinations would be good with 4K drives?
3-disk RAID-Z = 128 / 2 = 64KiB
4-disk RAID-Z2 = 128 / 2 = 64KiB
5-disk RAID-Z = 128 / 4 = 32KiB
6-disk RAID-Z2 = 128 / 4 = 32KiB
9-disk RAID-Z = 128 / 8 = 16KiB
10-disk RAID-Z2 = 128 / 8 = 16KiB

http://hardforum.com/showpost.php?p=1036395973&postcount=63

The problems i see is when the 128KiB get's spread over too many disks:
128KiB for 10-disk RAID-Z2 = 128KiB / 8 = 16KiB. Ideally you want drives to be handling chunks of 32KiB - 128KiB for optimal performance.

With the above stated, what are the ideal sizes for a RAID-Z3 pool?

Code:

128KiB / (nr_of_drives - parity_drives) = maximum (default) variable stripe size

4-disk RAID-Z3 = 128 / 1 = 128
5-disk RAID-Z3 = 128 / 2 = 64
7-disk RAID-Z3 = 128 / 4 = 32
11-disk RAID-Z3 = 128 / 8 = 16

4 and 5 disk RAID-Z3 pools seem small. 7 seems borderline (not much different than just going with RAID 1) as far as useable space vs # of drives goes. However an 11-disk RAID-Z3 has 16KiB writes which is smaller than the ideal indicated by sub.mesa of 32KiB - 128KiB.

The Solaris Internals Wiki recommends that RAIDZ3 configurations should start with at least 8 drives (5+3): http://www.solarisinternals.com/wik...onfiguration_Requirements_and_Recommendations

What have others done with their RAID-Z3 pools? What are recommended configurations and what type of performance impact would there be if I were to just go with an 8 drive RAID-Z3:
8-disk RAID-Z3 = 128 / 5 = 25.6

danswartz · Jul 8, 2011

Do you really need raidz3? Especially with that few drives? If you are willing to lose almost half your space to overhead, you'd be better off going with a 4x2 raid10.

danswartz · Jul 8, 2011

I had someone PM me pointing out that raid10 is not 100% safe from two drives failing. This is true, but an non-obvious point here: the more vdevs are striped together, the lower the probability of a fatal second error. This makes sense, if you think about it; once a drive has failed, the only subsequent failure that is fatal is the other drive in that mirror. So, for an 8-drive raid10 pool, your odds of the 2nd drive being in the same mirror is only 1/7, about 14%.

RavenShadow · Jul 8, 2011

Thanks danswartz.

The personal problem I have with RAID10 is that I can't lose any three drives and still have my data be intact. I'm a little paranoid and would really like to have the ability to lose any three drives.

Also, as far as probablity of a dual failure in the same mirror, I agree that it is 14% in an 8 drive pool, however I also think that a second drive failure during the rebuilding of a mirror is more likely to happen.

I was actually thinking of going with 11 drives in RAIDZ-3 once I get a RAID card that works with Solaris. However, reading sub.mesa's post indicating that 16KiB writes are smaller than his ideal of 32KiB-128KiB I felt hesitant taking that course without making sure the performance difference is negligible/seeing what other people have done.

In your opinion when does RAIDZ-3 really make sense?

danswartz · Jul 8, 2011

Really makes no sense unless you are talking about mission critical stuff.

ripken204 · Jul 8, 2011

danswartz said:
Really makes no sense unless you are talking about mission critical stuff.

and mission critical stuff will be backed up on another server anyways
so until we get really large drives, raid-z3 is unnecessary

john4200 · Jul 8, 2011

danswartz said:
I had someone PM me pointing out that raid10 is not 100% safe from two drives failing. This is true, but an non-obvious point here: the more vdevs are striped together, the lower the probability of a fatal second error. This makes sense, if you think about it; once a drive has failed, the only subsequent failure that is fatal is the other drive in that mirror. So, for an 8-drive raid10 pool, your odds of the 2nd drive being in the same mirror is only 1/7, about 14%.

RAID-Z3 is better than RAID 10 in most cases that do not require high performance for small, random writes. Certainly for a media server, the performance of RAID-10 is not required.

First, with 10 drives of capacity C, you get only 5C available capacity with RAID 10. You get 7C with RAID-Z3. That is 40% more space.

Then there is the probability of data loss.

Assume that the probability of a single drive failing during a rebuild (i.e., after you have already lost one drive) is F. Note that F is probably higher than the pro-rated annual failure rate over the rebuild time, since all of the data on at least one of the drives must be read during the rebuild, which creates a greater stress on the drive(s) than is seen by the average drive in AFR studies.

So, the probability of data loss with RAID 10, after one drive has already failed, is F.

After a single drive failure with RAID-Z3, you must lose at least three more drives for data loss to occur. To compute this, we need the general formula for probability of failure of exactly X drives out of K drives. It is

C[K, X] F^X (1 - F)^(K - X)

where C[K, X] is the combination formula, the number of ways of choosing X out of K:

C[K, X] = K! / ( X! (K - X)! )

The probability of losing at least 3 drives out of an N-drive RAID-Z3 that has had a single drive failure is (1 - P3), where P3 is the probability of losing 0, 1 or 2 drives.

P3 = (1 - F)^(N - 1) + (N - 1) F (1 - F)^(N - 2) + (N - 1)(N - 2) F^2 (1 - F)^(N-3) / 2

As long as we have that formula, it is easy to compute the corresponding P for RAID-Z1 and RAID-Z2:

P2 = (1 - F)^(N - 1) + (N - 1) F (1 - F)^(N - 2)

P1 = (1 - F)^(N - 1)

Let's take an example. We will assume F = 5%. A 16-drive RAID-10 will have a capacity of 8C, the same as an 11-drive RAID-Z3, a 10-drive RAID-Z2, or a 9-drive RAID-Z1. So, for a capacity of 8C, the probability of data loss during a rebuild after a single drive failure is:

RAID-10:
F = 5%

RAID-Z1:
1 - (1 - F)^(9 - 1) = 33.7%

RAID-Z2:
1 - (1 - F)^(10 - 1) - (10 - 1) F (1 - F)^(10 - 2) = 7.1%

RAID-Z3:
1 - (1 - F)^(11 - 1) - (11 - 1) F (1 - F)^(11 - 2) - (11 - 1)(11 - 2) F^2 (1 - F)^(11 - 3) / 2
= 1.15%

So, for a capacity of 8C, you need only 11 drives with RAID-Z3, as compared to 16 drives with RAID-10, and your probability of data loss during a rebuild after a single drive failure is only 1% with RAID-Z3, as compared to 5% with RAID-10.

If F were only 2%, then the corresponding data loss probabilities for RAID-10, -Z1, -Z2, -Z3 are: 2%, 14.9%, 1.3%, and 0.086%. So that is less than 0.09% for RAID-Z3, as compared to 2% for RAID-10 : 23 times higher chance of data loss with RAID-10 as compared to RAID-Z3 !

Bottom line is that RAID-Z3 is greatly superior to RAID-10, except in rare applications where the higher performance of RAID-10 is required.

danswartz · Jul 8, 2011

Raid10 will outperform raidz3 for random reads too. I opted to go that route given that my raid pool is serving NFS to an ESXi hypervisor with a half-dozen VMs. A nit about rebuild time: with ZFS, a resilver does not read all data on the drives, just what is in use. Unless you are close to the maxed out (which is not recommended for a number of reasons), you will likely have to read a lot less than a full drive.

john4200 · Jul 8, 2011

danswartz said:
A nit about rebuild time: with ZFS, a resilver does not read all data on the drives, just what is in use. Unless you are close to the maxed out (which is not recommended for a number of reasons), you will likely have to read a lot less than a full drive.

I did not claim that all sectors of the drive must be read, and besides, that is irrelevant to a contrast of RAID-10 and RAID-Z3, since it applies to both.

Unless you meant to claim that my estimate of F is too high, because less than a full drive would be read? The lower F is, the GREATER the advantage of RAID-Z3 over RAID-10 for minimizing chance of data loss. If F=1%, then the probability of data loss with RAID-Z3 is only 0.011%. That is a factor of 87 times higher chance of data loss with RAID-10 as compared to RAID-Z3 !

danswartz · Jul 8, 2011

Sorry, I misread your post. You said "all of the data on at least one of the drives". Data != Blocks. Not arguing with your math, but IMO, the number of times someone really needs raidz3 is fairly small. According to the ZFS best practices guide, once you go past 8-9 drives, you should start concatenating vdevs, so if you had 16 total drives, create two raidz3 vdevs of 5+3 and concatenate them.

danswartz · Jul 8, 2011

BTW, thanks for crunching the numbers and posting them

john4200 · Jul 8, 2011

danswartz said:
Sorry, I misread your post. You said "all of the data on at least one of the drives". Data != Blocks. Not arguing with your math, but IMO, the number of times someone really needs raidz3 is fairly small. According to the ZFS best practices guide, once you go past 8-9 drives, you should start concatenating vdevs, so if you had 16 total drives, create two raidz3 vdevs of 5+3 and concatenate them.

You keep stating or implying that RAID-Z3 should only be used if someone "really needs" it. Which implies that there is some hidden cost to RAID-Z3, that outweighs the benefits I have already listed. What is this hidden cost that you are assuming?

Also, I have never run any benchmarks on ZFS RAIDs. I am more familiar with linux mdadm software RAID and with hardware RAID.

So, I am wondering why you seem to be assuming that ZFS performs so poorly on small random reads with RAID-Z3 (or -Z2 or -Z1).

On other RAIDs, if you have small, random reads (where small means significantly smaller than the chunk size), then distributed-parity striped-RAID is not significantly slower than a comparable capacity RAID-10. This is because each individual random read is likely to be completely contained on a single drive, and the RAID controller (or software logic) simply distributes the reads among the drives, reading in parallel, thus obtaining a throughput close to that of RAID-10. Is ZFS unable to do this?

Obviously, such a technique does not work as well for small, random writes since the parity must be written, and the parity depends on all the drives in the stripe. That is where RAID-10, which has no parity, has a significant performance advantage over distributed-parity striped-RAIDs.

Also, my guess is that the OP is not building a database server. More likely a media server. So, small, random IO is less significant than large sequential IO. If we assume we are limited to, say 16 drives total in any case, then the performance of the RAID-10 for large sequential IO will be similar to an 8-drive RAID-0. And a 16-drive RAID-Z3 will have large sequential performance similar to a 13-drive RAID-0. So the RAID-Z3 should be faster than the RAID-10 for large sequential reads and writes. Unless ZFS is just very inefficient. Certainly a 16-drive RAID-6 would be faster than a 16-drive RAID-10 for large sequential IO on decent RAID systems.

_Gea · Jul 8, 2011

i would say its a question of optimization goals

If i need performance, i prefer a pool build from 2 or 3way mirrors (like ESXi datastores)
My backup-pools and my 0-8-15 storage smb-filer-pools are all raid-z3 build from one ZFS Z3
with 14 disks + hotspare (I have 16 slots on most of my systems and keep always one slot free,
data are critical project-data)

Below say 8-10 total disks i would prefer Raid-Z2 due to effectivity.
I would always avoid Raid-Z1 with any critical data.

danswartz · Jul 8, 2011

It isn't that it performs poorly, just not as well as raid10. For any raidz*, every drive needs to be read to generate the data block. With raid10, concurrent reads can be alternated between the halves of mirrors. And the cost is not hidden, it's dedicating a drive to overhead instead of real data. Some people with budget constraints care about that. You did hit on one issue: large sequential I/O will perform somewhat better with raidz* than raid10 because more spindles are participating. As far as your question "is ZFS unable to do this", it isn't a matter of unable, it's how it was designed. All drives, AFAIK, participate in a stripe write/read. The issue of small, random reads where raid10 wins is in a database model (or other situation where multiple clients are requesting data); in that scenario, two reads can be satisfied at the same time by reading from both sides of the mirrors. It all depends on your job load. Keep in mind, I'm not an expert on this - if you pull down and read the zfs best practices doc (and similar ones), these tradeoffs are all discussed at length.

danswartz · Jul 8, 2011

This is a good thread:

http://opensolaris.org/jive/message.jspa?messageID=81762

john4200 · Jul 8, 2011

danswartz said:
For any raidz*, every drive needs to be read to generate the data block.

Really? That means ZFS has terrible performance on small random reads for distributed-parity striped-RAID, as compared to decent RAID systems like linux mdadm software RAID, or any Areca hardware RAID.

Why is ZFS performing so poorly in this situation? Is it because it has to verify the checksum on the data that was read, and it goes and reads the checksums from multiple drives? I would think that could be cached to speed up the process. ZFS is not actually verifying the parity data on every read, is it? I thought that is what the checksums were for.

danswartz · Jul 8, 2011

Keep in mind I'm not an expert here. That said, my understanding is that ZFS was designed for reliability - the end to end checksums guarantees you will never get corrupt data due to a drive failing to detect a URE. There are all kinds of tradeoffs that can be made. For example: the small random read issue is not going to matter to some job mixes, and will to others. For those, drives are considered to be fairly cheap, and raid10 wins there. Even if you want to do raidz*, if you have a large number of drives, the better way to do it is not, say, a 16 drive raidz3, but 2 8 drive raidz3 stripes. That mitigates the issue you raised. Yes, I know you lose some usable storage that way - tradeoffs...

danswartz · Jul 8, 2011

One other thing to consider (I forgot to mention it before). With ZFS you don't have small, random writes. ZFS is a transactional, copy on write filesystem, so the FS will buffer up writes to various places, and then stream them out to consecutive locations on the disk. This is one reason, AFAIK, that ZFS is considered to be memory hungry.

_Gea · Jul 8, 2011

Mostly its like you jump out of a plane.
The fastest way is to jump out naked.

I prever to have at least two parachutes with me and pay the price of a longer journey.
Thats the same with ZFS. Its not the fastest at all, its currently the most secure on earth

And despite of its features, its mostly comparable fast.

Thats what I have learned in 30 years of IT
You nead 30% more performance to say: Yes I can feel - Otherwise forget the difference
unless you really need the last 2% - no need to be the fastest
- with ZFS features - no chance to be the fastest: does not matter at all

Its like a Rolls Royce ad: enough

RavenShadow · Jul 8, 2011

john4200 said:
...
Also, my guess is that the OP is not building a database server. More likely a media server. ...

john4200 is right; I was planning on having an 11 drive RAIDZ-3 setup as my repository/streaming server of HD home videos. Probably no more than two clients ever accessing it at a time.

I was all set to go with an 11 drive RAIDZ-3 until I read sub.mesa's opinion stating that 16KiB writes (128KiB / 8) were smaller than his recommended 32KiB - 128KiB writes. Reading that made me curious to see what other people have done with their RAIDZ-3 implementations since the SolarisInternals wiki recommends 8 or more drives for RAIDZ-3 and to have a write size ≥ 32KiB would require a 4, 5, or 7 drive RAIDZ-3 setup which seems too small to me.

What type of performance hit would I be experiencing if I went with an 8 drive RAIDZ-3 over an 11 drive RAIDZ-3? Should I plan on increasing the size of my RAIDZ-3 past 11 drives?

Thanks for the discussion of going with RAID-10, I hadn't really given it too much thought because I felt the probability of a second drive failure during the rebuilding of a mirror was too great of a chance to take.

danswartz · Jul 8, 2011

It's hard to really say. One possibility: with 13 drives, make two 6-disk RAIDZ2 vdevs with a hot spare in the pool. You can lose any two drives from either vdev and still be okay. If you don't have the odd drive, skip the hot spare. You then stripe them. Like this: zpool create tank raidz2 a1 a2 a3 a4 a5 a6 raidz2 b1 b2 b3 b4 b5 b6. This will improve you sequential I/O, although honestly, with 11 drives, I'd not worry about the performance sub.mesa was talking about. It isn't a significant hit, AFAIK, certainly not for your application.

MasterCATZ · Apr 7, 2013

RavenShadow said:
The Solaris Internals Wiki recommends that RAIDZ3 configurations should start with at least 8 drives (5+3): http://www.solarisinternals.com/wik...onfiguration_Requirements_and_Recommendations

What have others done with their RAID-Z3 pools? What are recommended configurations and what type of performance impact would there be if I were to just go with an 8 drive RAID-Z3:
8-disk RAID-Z3 = 128 / 5 = 25.6

it does not say that it said

http://webcache.googleusercontent.c...s.com/wiki/index.php/ZFS_Best_Practices_Guide

RAIDZ Configuration Requirements and Recommendations

A RAIDZ configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised.

Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 6 disks (4+2)
Start a triple-parity RAIDZ (raidz3) configuration at 9 disks (6+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 6
The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.

but its really bugging me that their is no good combination to use with 16 disks using
2x 8 port controllers in an 16 rack case

the closest I can find is raidz1 3 x 5 disk with 1x spare
( was using raidz1 4x4 and suffered major performance issues during resilvering )

but after haveing 2x failed drives with in a week off each other is making me want raidz 2/3

my next case will be 24 drive and I will do raidz2 4 x 6 disks

staticlag · Apr 7, 2013

MasterCATZ said:
it does not say that it said

http://webcache.googleusercontent.c...s.com/wiki/index.php/ZFS_Best_Practices_Guide

RAIDZ Configuration Requirements and Recommendations

A RAIDZ configuration with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised.

Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1)
Start a double-parity RAIDZ (raidz2) configuration at 6 disks (4+2)
Start a triple-parity RAIDZ (raidz3) configuration at 9 disks (6+3)
(N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 6
The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups.

but its really bugging me that their is no good combination to use with 16 disks using
2x 8 port controllers in an 16 rack case

the closest I can find is raidz1 3 x 5 disk with 1x spare
( was using raidz1 4x4 and suffered major performance issues during resilvering )

but after haveing 2x failed drives with in a week off each other is making me want raidz 2/3

my next case will be 24 drive and I will do raidz2 4 x 6 disks

All that stuff written is just advice. You gotta personally choose the best structure for your needs. There's no one silver bullet that can optimize space, energy usage, integrity and overall cost.

For instance, If I was going to do a movie server I might do just one large RAIDz3 to maximize space.

However if I was going to serve important data that would see high usage I might do striped mirrors.

Or if I wanted to compromise (available space vs integrity) I would do 2 - 8 disk pools of RAIDz2.

Zarathustra[H] · Apr 30, 2013

My take is that worrying about drive performance on a home ZFS NAS server is a silly thing to do.

Unless you go with 10gig ethernet, or start trunking gigabit ethernet like crazy, your ethernet is likely going to wind up being your limiting factor, even in very simple few drive setups with slow drives.

My FreeNAS VM is a single RAIDZ2 array with 6 x2TD WD Greens. Local benchmarks measure ~480MB/s.

Gigabit ethernet - while theoretically 128MB/s, really only gives you some 85MB/s over SMB (a little more with NFS).

I'd be more concerned about having adequate RAM, and and a powerful enough CPU to handle the 3x parity calculations in real time. One of these two is more likely to be your bottleneck (if the ethernet isn't)

Aesma · Apr 30, 2013

My plan with a 16 drives enclosure is a single RAIDZ3, then later more enclosures, each in RAIDZ3. Doing some tests with just 5 drives I realized the performance is enough for my needs, I hadn't understood before that the "performance of one drive per RAIDZ vdev" is only true for IOPS/random ops, for sequential reads/writes it's way faster, and I don't need great random performance. Rather than having a pool consisting of many vdevs which can only sustain 2 drives failures, I'd rather save space and sustain 3 drives failures.

garg_art2002 · Feb 24, 2014

Folks: I am not very experienced with Linux and ZFS. I was tired of my software raid6 on six drives of 3TB taking days to resync after power failure longer than my UPS could handle. In most cases, the power would fail while in resync itself.

The storage I am making is for storing home media content and not heavily used. Worst use case is DNLA for one movie across the hall on wifi.

I created a RAIDZ3 because I just wanted to be sure for protections and storage capacity was not an issue. I have created RAIDZ3 on six drives.

All the notes above are suggesting that for RAIDZ3, I should either have 5 drives or 7.

What is the tradeoff when RAIDZ3 is run on six drives?

Help appreciated.

brutalizer · Feb 24, 2014

You loose lot of space. You only have three disks worth of storage, as raidz3 eats three disks for parity.

Zarathustra[H] · Feb 24, 2014

garg_art2002 said:
What is the tradeoff when RAIDZ3 is run on six drives?

Help appreciated.

A lot of this is based on an old theory by sub.mesa on these forums regarding how 4k sector size drives that emulate older 512 byte sector size drives would function under ZFS.

I don't even know if this is even an issue anymore. Back four years ago when these discussions started there was a lot of talk of ZFS needing to include fixes to alleviate this problem going forward, but who knows if it ever happened.

The theory is as follows (in laymans terms):

In order to be compatible with older OS:es, 4k sector drives (needed for the modern large drive sizes) fool operating systems into thinking they are 512byte sector size drives.

Whenever an OS makes a request that is not aligned with 4k, the drive is forced into 512byte emulation mode, which significantly slows it down.

Since ZFS uses 128kb chunks, he came up with the following formula:

128KiB / (number of drives - parity drives).

If the resulting number is divisible by 4, then you shouldn't ahve a problem. If it is NOT divisible by 4, then - if this has not yet been fixed in ZFS - you might.

If we follow these calculations, optimal configurations would be as follows:

RaidZ: 3, 5 or 9 drives. (17 drives also winds up being divisible by 4, but this is above 12 recommended as max by ZFS documentation)

RaidZ2: 4, 6 or 10 drives. (and 18, which is above 12, as above and not recommended)

Raidz3: 7 and 11 drives (and 19 which is above 12, as above and not recommended)

I am not sure if this is still the case, or historical data only. I have tried PM:ing sub.mesa but not received a response.

Back in 2010 there was talk about issuing some sort of geom_nop command to the drives forcing them to report their true 4k sector size to the OS and making this problem mostly go away. The problem at the time was that geom_nop was not persistent through reboots.

Personally I have been running RAIDz2 with 6 drives for years. I accidentally chose the right size when I set it up, before I read all of this. I have done a little research into this, but haven't found anything regarding whether or not these problems are still real, or if something has been fixed since 2010.

Essentially, this is a fault that occurs due to drive makers essentially shipping drives with a 4k/512 byte hack, not because of any inherent flaw in ZFS. Maybe, since then, ZFS has included a workaround to force the drives into 4k mode, I am not sure. A lot can happen in 4 years.

rsq · Feb 26, 2014

I have also followed the number of drives thing in the old days, and I decided to test all the configurations.

On my hardware (10 2TB samsung HD204's at that time) I saw some differences with raw dd tests in sequential reads, but nothing that made me worry about it.

Anyway, 11 disks in raidz(1/2/3) will easily saturate a 1GBps link regardless of the block sizes or number of drives in the stripe. Unless you are on the local machine, or rocking 10GBps ethernet or infiniband, this absolutely does not matter.

If you worry about performance, there is a much better thing you can do: Go for a good intel server network card. Do not use the onboard NIC, especially when using consumer grade hardware.

Zarathustra[H] · Feb 26, 2014

rsq said:
If you worry about performance, there is a much better thing you can do: Go for a good intel server network card. Do not use the onboard NIC, especially when using consumer grade hardware.

Second this, this is more or less an absolute necessity.

Personally I have a Dual Port Intel PRO/1000 PT NIC in mine, and I have bonded the two ports using 802.3ad link aggregation to improve performance with multiple requests.

garg_art2002 · Feb 27, 2014

Thanks Zarathustra. Appreciate your insights.

A lot of this is based on an old theory by sub.mesa on these forums regarding how 4k sector size drives that emulate older 512 byte sector size drives would function under ZFS

I have a RAIDZ3 under ubuntu 12.04.4 working fine (appears to be). Here are the outputs after I have loaded ton of data from an old server marked for salvage.

Code:

# root@ex58 [11:52:24] ~> zpool history
History for 'tank':
2014-02-15.10:14:11 zpool create -m none -o ashift=12 -f tank raidz3 /dev/disk/by-id/wwn-0x5000c5004e4001fe-part3 /dev/disk/by-id/wwn-0x5000c5004e4804a9-part3 /dev/disk/by-id/wwn-0x5000c5005377d7ea-part3 /dev/disk/by-id/wwn-0x5000c5004e468e41-part3 /dev/disk/by-id/wwn-0x5000c5004e22a08e-part3 /dev/disk/by-id/wwn-0x5000c5004e22f517-part3
2014-02-15.11:51:27 zfs set dedup=off tank
2014-02-15.11:51:54 zfs set atime=off tank
2014-02-15.11:53:41 zfs set compression=on tank
2014-02-15.11:56:59 zpool scrub tank
2014-02-15.12:11:20 zfs create tank/myraidz3
2014-02-15.12:12:10 zfs set mountpoint=/mnt/myraidz3 tank/myraidz3
2014-02-17.17:51:52 zfs set sharesmb=on tank/myraidz3
2014-02-17.17:52:01 zfs set sharenfs=on tank/myraidz3
2014-02-17.18:05:25 zfs set sharesmb=off tank/myraidz3
2014-02-20.10:34:46 zpool scrub tank
2014-02-20.10:38:37 zpool scrub tank
2014-02-20.11:50:05 zpool scrub tank
2014-02-20.17:34:45 zpool clear tank
2014-02-21.08:43:09 zfs set mountpoint=/mnt/tank tank
2014-02-21.09:16:07 zfs create tank/myzfs
2014-02-21.09:16:12 zfs create tank/myzfs/agarg
2014-02-21.09:16:17 zfs create tank/myzfs/rgarg
2014-02-21.09:16:23 zfs create tank/myzfs/tgarg
2014-02-21.09:16:27 zfs create tank/myzfs/media
2014-02-21.09:16:32 zfs create tank/myzfs/software
2014-02-21.09:17:43 zfs set compression=lz4 tank/myzfs
2014-02-21.09:17:51 zfs set compression=lz4 tank/myzfs/media
2014-02-21.09:17:56 zfs set compression=lz4 tank/myzfs/agarg
2014-02-21.09:18:00 zfs set compression=lz4 tank/myzfs/rgarg
2014-02-21.09:18:05 zfs set compression=lz4 tank/myzfs/tgarg
2014-02-21.09:18:14 zfs set compression=lz4 tank/myzfs/software
2014-02-21.09:51:18 zfs set dedup=on tank/myzfs/software
2014-02-21.10:01:03 zfs set dedup=off tank/myzfs/software
2014-02-25.21:15:55 zpool scrub tank

# root@ex58 [11:53:34] ~>

# root@ex58 [11:51:45] ~> zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 8h39m with 0 errors on Wed Feb 26 05:55:22 2014
config:

        NAME                              STATE     READ WRITE CKSUM
        tank                              ONLINE       0     0     0
          raidz3-0                        ONLINE       0     0     0
            wwn-0x5000c5004e4001fe-part3  ONLINE       0     0     0
            wwn-0x5000c5004e4804a9-part3  ONLINE       0     0     0
            wwn-0x5000c5005377d7ea-part3  ONLINE       0     0     0
            wwn-0x5000c5004e468e41-part3  ONLINE       0     0     0
            wwn-0x5000c5004e22a08e-part3  ONLINE       0     0     0
            wwn-0x5000c5004e22f517-part3  ONLINE       0     0     0

errors: No known data errors
# root@ex58 [11:52:11] ~>

# root@ex58 [11:51:26] ~> zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
tank                 6.57T   953G   270K  /mnt/tank
tank/myraidz3        3.46T   953G  3.46T  /mnt/myraidz3
tank/myzfs           3.11T   953G   330K  /mnt/tank/myzfs
tank/myzfs/agarg      330K   953G   330K  /mnt/tank/myzfs/agarg
tank/myzfs/media     3.06T   953G  3.06T  /mnt/tank/myzfs/media
tank/myzfs/rgarg     2.04M   953G  2.04M  /mnt/tank/myzfs/rgarg
tank/myzfs/software  50.1G   953G  50.1G  /mnt/tank/myzfs/software
tank/myzfs/tgarg      255K   953G   255K  /mnt/tank/myzfs/tgarg
# root@ex58 [11:51:36] ~>

Will post more observations after some usage. Thanks again.

Zarathustra[H] · Feb 27, 2014

rsq said:
Anyway, 11 disks in raidz(1/2/3) will easily saturate a 1GBps link regardless of the block sizes or number of drives in the stripe. Unless you are on the local machine, or rocking 10GBps ethernet or infiniband, this absolutely does not matter.

Yeah,

I get the impression it is much more important to seek times than it is to single sequential reads and writes.

So- I think what he is saying is - if you were running VM's off of images stored on the server, or if you had large heavily used databases accessed from it this would have a large impact.

Even if you had a file server heavily used by many users at the same time, this may be important.

For a home type server though, I don't most people would even notice the difference. Though I could be wrong.

Aesma · May 5, 2014

The problem is not performance I agree, however with RAIDZ3 if you don't respect the rules, you lose a lot of space. I mean with ZFS you lose space already as there is an overhead, with RAIDZ3 you lose 3 drives' capacity on top of it, and without the right number of drives you lose 5 or 10% more, it was unacceptable for me (storage).

stormy1 · May 6, 2014

If your really worried your data spend more time worrying about backup than zfs settings!

The real paranoid use raid 61 and 2 duplicate servers backed up to 2 more severs in real time across the country then backed up to 4 set of tapes, 2 in each location. And yes I know a company that runs that configuration.

brutalizer · May 6, 2014

stormy1 said:
If your really worried your data spend more time worrying about backup than zfs settings!

The real paranoid use raid 61 and 2 duplicate servers backed up to 2 more severs in real time across the country then backed up to 4 set of tapes, 2 in each location. And yes I know a company that runs that configuration.

And they do checksums regularly on every file? What happens if some files get bit rotted and they need to figure out which file is correct and functioning? Do they do manual checksumming of all files? Or automatic? If automatic, which solution do they use?

levak · May 6, 2014

Doesn't ZFS do checksuming? On every read and when running scrub...

Matej

danswartz · May 6, 2014

brutalizer · May 6, 2014

But does that company use ZFS or storage without checksumming, such as plain hardware raid? I mean, if they are really paranoid about their data, they should checksum every file and validate checksums every week/month or so?

stormy1 · May 6, 2014

brutalizer said:
And they do checksums regularly on every file? What happens if some files get bit rotted and they need to figure out which file is correct and functioning? Do they do manual checksumming of all files? Or automatic? If automatic, which solution do they use?

Not sure, I just set up the local servers then they installed the software on top of it to do the mirroring and their software remotely.
It runs on 2008 server enterprise is all I know.

brutalizer · May 6, 2014

stormy1 said:
Not sure, I just set up the local servers then they installed the software on top of it to do the mirroring and their software remotely.
It runs on 2008 server enterprise is all I know.

That sounds very bad if they are paranoid about their data. I suggest you ask them to read research on datacorruption, here are some papers and studies by CERN, Amazon, NetApp, etc:
http://en.wikipedia.org/wiki/ZFS#Data_integrity
http://en.wikipedia.org/wiki/Hard_disk_error_rates_and_handling#ERRORRATESHANDLING
http://en.wikipedia.org/wiki/Silent_data_corruption#SILENT

ZFS: RAID-Z3 (raidz3) Recommended Drive Configuration

n00b

2[H]4U

2[H]4U

n00b

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

Supreme [H]ardness

2[H]4U

2[H]4U

[H]ard|Gawd

2[H]4U

2[H]4U

Supreme [H]ardness

n00b

2[H]4U

n00b

[H]ard|Gawd

Extremely [H]

[H]ard|Gawd

n00b

[H]ard|Gawd

Extremely [H]

Limp Gawd

Extremely [H]

n00b

Extremely [H]

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd