nas4free RED vs Black

Chandler · Nov 13, 2013

I know this is yet another battle of the classes thread - but I haven't found answers to answer a specific question

--

I want to have a 16 drive Nas4free set up (not necessarily all the same drives, i mention 16 drives because I have 5x2TB drives and 3x500gb drives in it right now and I am worried about vibrations)

I am using ZFS pools.
RED is better if I was going to use hardware raid in a small enclosure as far as reliability
What about using the drives in ZFS? Will the black ultimately be just a reliable AND preform better?

I would be getting WD2001FASS drives (They are new on sale) = $109
or WD RED WD20EFRX for 99 a drive

For only ten dollars I don't mind the longer warranty and better performance - but what about TLER or other RAID Nonsense - I know that TLER is only usable for hardware RAID... REDs lack the vibration sensors and are advertised for small systems, not a rack mounted server. I will definitely shell out $10.00 a drive if the firmware doesn't make a difference in ZFS

Nex7 · Nov 13, 2013

TLER is 'usable' in anything. TLER is just there. TLER means the drive will not take longer than X seconds to try to recover from a problem, generally 6-8 seconds or so. This is useful even on ZFS. When a drive without TLER goes into one of those 30-45+ second recovery actions, your pool's actual I/O will drop to basically zero -- it is hung, while the drive is stuck.

It will eventually either time it out and kick it from the pool or it will recover once the drive comes back. Often it is the former, and then the problem is if you don't have your system set up in such a way that you are immediately alerted to this and can login and re-add the drive to the pool in short order, it will require a full resilver instead of a quick resilver just to get it back in, and while it's out of the pool, your performance and your redundancy is lowered.

(for the record I'm not a big fan of WD, but that's anecdotal, and that aside...) In the case of your two drives you've linked, however, despite the lack of TLER I'd probably go with the 'black' drive, because it rotates at 7200 RPM while the 'red' goes at some variable nonsense probably 5000's. This represents a serious difference in IOPS potential for whatever kind of zpool you end up building. I'd just be sure to set up monitoring for the pool, so I know when a disk gets kicked out.

Chandler · Nov 13, 2013

Nex7 said:
TLER is 'usable' in anything. TLER is just there. TLER means the drive will not take longer than X seconds to try to recover from a problem, generally 6-8 seconds or so. This is useful even on ZFS. When a drive without TLER goes into one of those 30-45+ second recovery actions, your pool's actual I/O will drop to basically zero -- it is hung, while the drive is stuck.

It will eventually either time it out and kick it from the pool or it will recover once the drive comes back. Often it is the former, and then the problem is if you don't have your system set up in such a way that you are immediately alerted to this and can login and re-add the drive to the pool in short order, it will require a full resilver instead of a quick resilver just to get it back in, and while it's out of the pool, your performance and your redundancy is lowered.

(for the record I'm not a big fan of WD, but that's anecdotal, and that aside...) In the case of your two drives you've linked, however, despite the lack of TLER I'd probably go with the 'black' drive, because it rotates at 7200 RPM while the 'red' goes at some variable nonsense probably 5000's. This represents a serious difference in IOPS potential for whatever kind of zpool you end up building. I'd just be sure to set up monitoring for the pool, so I know when a disk gets kicked out.

Thanks for the input -
So TLER is important to look at as well...
If RE4 drives were cheaper I would get them.

I am using the bulk of the storage for media and a backup server however I would like to be able to run a few VMs on a pool as well so I/O does matter to me a little bit - I run my VMs off of four SSDs at the moment and would like some DATA drives for storage in the VMs.

I think either will ultimately be fine for it but I don't want to be streaming a 1080p movie to my media center and while an SQL Server is at high load and the hole thing bog because of my drive choice.

The REDs spin at 5.4k - And have not much difference in sequential read/write speeds.
What hard drive manufacture would you recommend for this if you do not like WD? I only look at WD because I have always used their black drives in my builds.

I am weary of Seagate and Hitachi because of past reliability issues I have had with the drives. I definitely want best bang for buck though

Nex7 · Nov 14, 2013

I come from an 'enterprise'-ish world. So, generally, I recommend Seagate or Hitachi.

-- My time on this forum seems to indicate a significant feeling of distrust for Seagate, which is very interesting to me, and IMHO must represent some difference between their retail and enterprise edition versions.. the enterprise edition versions I deal with, by the 1000's, definitely do not have a bad track record (whereas by comparison, the enterprise WD drives do).

Honestly go with what you trust - there's enough evidence to suggest every brand makes a drive that /might/ work for years, and every brand makes a drive that might not. Luck of the draw, perhaps. Focusing on functionality, yes, TLER matters -- but at the same time, especially in a home environment (even with a few VM's), I'd rather have 7200 RPM spin speed w/o TLER than a 5400 RPM with TLER. Just my $0.02.

Origin_Unknown · Nov 14, 2013

my own personal experience with the RED drives is that they didn't like being in ZFS but i'm not sure if i had a bad batch (6) in general

vegaman · Nov 14, 2013

I've read around a lot and seen lots of people using red drives without issue. So I'm planning to use them in my build (20 drives) as well.

Other than them I'd be looking at Seagate NAS drives or going up to the enterprise drives. Previously I would have gone for Hitachi or Samsung, but neither of them are really available anymore.

Chandler · Nov 14, 2013

Origin_Uknown - What kind of issues did you have with ZFS and the RED Drives? If anything it sounds like they would be more suitable for ZFS because of the TLER capability...

I currently have 2x WD2002FAEX, 1 WD2001FASS (same drive but Sata2 controller NBD IMO) and then 2x WD20EFX in my NAS4Free set up in one big pool and have no had any issues so far.

Since I purchased two red drives and had RMAd drives and got the new FAEXs, I am thinking about purchasing one more FAEX or FASS - whichever I find cheaper as long as it is new - then purchasing 2 more REDs and creating a two separate ZFS RaidZ1 pools - one red and one black. Ideally I would like to have a "Raid10" type set up to get more performance but I think that It will cost me to much for my use right now. Maybe I'll get a total of 6 black drives and use RAIDz2 with a SSD for a cache and though my extra datastore space on that array while using the Red drives for my home media.

2x Red 200.00 USD
2x Black - $300.00 USD

-- I hear Raidz1 performance better with 5 drives versus 4 though, and that Raidsz2 you should use 6 drives... Can anyone chime in on that?

okashira · Nov 14, 2013

Chandler said:
-- I hear Raidz1 performance better with 5 drives versus 4 though, and that Raidsz2 you should use 6 drives... Can anyone chime in on that?

I made the following post on another forum:

The performance impact is small and seems to affect read only. overall performance would probably be slightly higher since there are more disks...
http://hardforum.com/showpost.php?p=1036395804&postcount=61

other sources:
https://blogs.oracle.com/roch/entry/when_to_and_not_to

The larger issue is wasted space. Over 1 TB for the RAIDZ2 @ 8 Drives.
Read here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html
6: 0.2602885812520981
7: 1.1858475767076015
8: 1.149622268974781

You'll "waste" 1.149 - 0.26 TB of space (~1TB) by choosing 8 disks over 6. This is with 3TB disks.

The above #'s are with ashift = 12 (4k sectors)
with 512b sectors (ashift = 9) the wasted space with 6 drives is almost exactly the same (no penalty for 4k drives) but there is a penalty at 10 drives.

The most interesting comparison comes at 17-18 disks:
17: 3.5806130981072783
18: 0.7912140190601349

You'll gain almost 6TB of space by going from 17 to 18 drives because 17 drives is so wasteful!

Chandler · Nov 14, 2013

I found similar research of that for the 3TB disk when googling configurations... I aim to use 2TB Disks since I already have those and space isn't (Currently) a huge concern - I will be storing 720 and 1080 w/ DTS on the RED drives I suppose, and I expect it to grow over time.

I looked at the graphs put together by MrLie - It looks like a 6 disk config of Raidz2 would be close to a 5 disk with raidz1 - on the read side, the write side no matter how you look at it will suffer a bit.

I am thinking I will just add some SSD drives in there to help out the write performance. I think the two drive redundancy is needed when I look at having five drives each array having at least a pair of drives that came from the same batch...

That is probably what I will do. I will just cross my fingers on the RED Drives as far as my vibration concerns go I suppose. I can always complain in four years like everyone else may

Thanks a lot!

Moogoos · Nov 15, 2013

I just bought 6 4TB REDS to build a NAS with a Fractal 304. Tomorrow begins the build, maybe tonight...

Aesma · Nov 16, 2013

okashira said:
I made the following post on another forum:

The performance impact is small and seems to affect read only. overall performance would probably be slightly higher since there are more disks...
http://hardforum.com/showpost.php?p=1036395804&postcount=61

other sources:
https://blogs.oracle.com/roch/entry/when_to_and_not_to

The larger issue is wasted space. Over 1 TB for the RAIDZ2 @ 8 Drives.
Read here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html
6: 0.2602885812520981
7: 1.1858475767076015
8: 1.149622268974781

You'll "waste" 1.149 - 0.26 TB of space (~1TB) by choosing 8 disks over 6. This is with 3TB disks.

The above #'s are with ashift = 12 (4k sectors)
with 512b sectors (ashift = 9) the wasted space with 6 drives is almost exactly the same (no penalty for 4k drives) but there is a penalty at 10 drives.

The most interesting comparison comes at 17-18 disks:
17: 3.5806130981072783
18: 0.7912140190601349

You'll gain almost 6TB of space by going from 17 to 18 drives because 17 drives is so wasteful!

I wouldn't do a RAIDZ2 with that many drives though. That's why I made a 19 drives RAIDZ3 !

Chandler · Nov 18, 2013

Because of this:

RAID-Z1 = 2^n + 1 Disks. IE. 3,5,9
RAID-Z2 = 2^n + 2 Disks. IE. 4,6,10
RAID-Z3 = 2^n + 3 Disks. IE. 5,7,11

I am going to purchase 2 2TB WD RED drives to accompany the two I already own - I have a 2TB Seagate (ST2000DL003) - 5900RPM that has not been recognized by NAS4FREE for whatever reason - but I ran some tests on the drive and found it good so I will try harder to make it work - I do not believe I will have any 4K alignment issues with the drive.

I will used RAIDz1 - that will be for media and if I need any bulk storage for any of my VMs I can just attach something in that pool.

For my extra datastore pool I am going to use 3x2TB WD black drives that I already have - I also have a couple 1TB black drives I could make into a separate vdev and add to the pool however I am wondering if that will hurt overall performance.

I want to invest in a ZIL and L2ARC drive - I will probably go Samsung 840 for L2ARC.

The ZIL I would think would be most important to me but will be most expensive. I was thinking about the S3700 but I don't think I *need* the ZIL at this time so it may not be worth the money. I am thinking about a cheaper drive to play with performance or maybe nothing at all.

HobartTas · Nov 19, 2013

Greetings

Chandler said:
I am using ZFS pools.

What kind of pools? would that be Raid-Z/Z2/Z3 arrays to maximise space but only have the IOPS of a single disk or alternatively be mirrored pools in a raid 10 type setup to maximise IOPS at the expense of space, or would you have a mix of both to suit the different applications you have?

Chandler said:
I know that TLER is only usable for hardware RAID...

TLER is neccesary for hardware raid and not needed elsewhere as ZFS can even work with green drives which spin down due to APM.

Chandler said:
I will definitely shell out $10.00 a drive if the firmware doesn't make a difference in ZFS

$10.00 difference per drive cost is not that bad but if it was say $30-$50 per drive and I was going to get a dozen or so then I would baulk at spending a couple of hundred bucks for no real tangible benefit using ZFS.

Nex7 said:
It will eventually either time it out and kick it from the pool or it will recover once the drive comes back. Often it is the former, and then the problem is if you don't have your system set up in such a way that you are immediately alerted to this and can login and re-add the drive to the pool in short order, it will require a full resilver instead of a quick resilver just to get it back in, and while it's out of the pool, your performance and your redundancy is lowered.

Pardon? It is my understanding that because ZFS time/date stamps everything it can tell how far out of date the re-added drive is and it just adds the missing information to bring it up to date. I had a situation where I accidentally dislodged a SATA cable on one of my drives in a 10 drive Raid-Z2 array and I added about 4GB to the array before I noticed the cable was loose, I then shut down the system and reconnected the cable and started the system up again. ZFS noticed it was out of date and automatically wrote about 440MB to the drive in several tens of seconds and thereafter it was up to date. I am not aware that there is some sort of time limit for this which is what I understand you are implying.

Chandler said:
I think either will ultimately be fine for it but I don't want to be streaming a 1080p movie to my media center and while an SQL Server is at high load and the hole thing bog because of my drive choice.

It is my understanding that if ZFS notices that it is doing sequential reads from a file like the 1080p movie you mentioned then it does extra read aheads and stores this data in the cache so if this is the case I don't think this will be a problem, neither do I know if the parameter for this is tunable or not.

Chandler said:
What hard drive manufacture would you recommend for this if you do not like WD? I am weary of Seagate and Hitachi because of past reliability issues I have had with the drives. I definitely want best bang for buck though

Can't really help but I suggest you read this article here given they are using about 2000 drives of various types and they do appear to think highly of the black drives.

Chandler said:
Ideally I would like to have a "Raid10" type set up to get more performance but I think that It will cost me to much for my use right now. Maybe I'll get a total of 6 black drives and use RAIDz2 with a SSD for a cache and though my extra datastore space on that array while using the Red drives for my home media.

If your considering whether to decide between a "Raid10" vs "Raid-Z?" setup on an IOPS basis then this is going to be completely overshadowed if say your going to be running a database then you will have to give this topic some serious thought as ZFS is COW (copy on write) and fragmention is going to be a huge issue. How bad can this be? well how does say a 20x-30x slow down concern you (see figure 3)?

Chandler said:
-- I hear Raidz1 performance better with 5 drives versus 4 though, and that Raidsz2 you should use 6 drives... Can anyone chime in on that?

The number of data drives should be a power of 2 as the recordsize is a power of 2 hence when you divide one into the other the data written to each harddrive is a power of 2 also but more importantly is a WHOLE NUMBER for the amount of 512B/4KB sectors that have to be written e.g. 128KB recordsize / 4 drives = 32 KB per drive, other drive numbers entail wasted space.

okashira said:
I made the following post on another forum:

The performance impact is small and seems to affect read only. overall performance would probably be slightly higher since there are more disks...
http://hardforum.com/showpost.php?p=1036395804&postcount=61

Interesting, I was looking at the first graph which was sequential read performance and when I attached my Raid-Z2 array consisting of 10 3TB Toshiba DT01ACA300's to my X79S-UP5 and I7-3820 I was getting scrub speeds of 985MB's although most if not all was archived data laid down in a WORM fashion and hence mostly contiguous, perhaps your CPU is a bit under-powered accounting for your lower figures for Raid-Z2 read speeds?

okashira said:
The larger issue is wasted space. Over 1 TB for the RAIDZ2 @ 8 Drives.
Read here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html
6: 0.2602885812520981
7: 1.1858475767076015
8: 1.149622268974781

You'll "waste" 1.149 - 0.26 TB of space (~1TB) by choosing 8 disks over 6. This is with 3TB disks.

The above #'s are with ashift = 12 (4k sectors)
with 512b sectors (ashift = 9) the wasted space with 6 drives is almost exactly the same (no penalty for 4k drives) but there is a penalty at 10 drives.

The most interesting comparison comes at 17-18 disks:
17: 3.5806130981072783
18: 0.7912140190601349

You'll gain almost 6TB of space by going from 17 to 18 drives because 17 drives is so wasteful!

That opendevs link certainly demonstrates the usefulness of having the data drives as being a power of 2, there is a comment reply there which states

*********************************************************************************
Free space calculation is done with the assumption of 128k block size.
Each block is completely independent so sector aligned and no parity
shared between blocks. This creates overhead unless the number of disks
minus raidz level is a power of two. Above that is allocation overhead
where each block (together with parity) is padded to occupy the multiple
of raidz level plus 1 (sectors). Zero overhead from both happens at
raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks.

High overhead with 10 disks is because of allocation overhead.
128k / 4k = 32 sectors,
32 sectors / 8 data disks = 4 sectors per disk,
4 sectors per disk * (8 data disks + 2 parity disks) = 40 sectors.
40 is not a multiple of 3 so 2 sector padding is added. (5% overhead)
*********************************************************************************

However, I chose a larger recordsize of 512KB firstly because I am not running an OLTP database needing smaller recordsizes and secondly unlike say NTFS where wasted space equals on average clustersize(recordsize) / 2 this does not apply in ZFS because if the data to be written is a lot smaller then ZFS can use variable height stripes.

Re-doing the numbers on the basis of using a 512KB recordsize.....

512k / 4k = 128 sectors,
128 sectors / 8 data disks = 16 sectors per disk,
16 sectors per disk * (8 data disks + 2 parity disks) = 160 sectors.
160 is not a multiple of 3 so 2 sector padding is added. (1.25% overhead)

Compared to 5% previously this is a fairly large improvement, although 1.25% is still higher than 0% for other data drive numbers like 4 or 16 it is clear that 10 drive Raid-Z2's are still a viable proposition.

Chandler said:
I want to invest in a ZIL and L2ARC drive - I will probably go Samsung 840 for L2ARC.

The ZIL I would think would be most important to me but will be most expensive. I was thinking about the S3700 but I don't think I *need* the ZIL at this time so it may not be worth the money. I am thinking about a cheaper drive to play with performance or maybe nothing at all.

I've got a question about ZIL's as I have never actually used one (or needed one for a home NAS) but say you are using it with the database you mentioned, would I be correct in assuming its main advantage would be dealing with absorbing a large amount of write traffic primarily of a bursty nature that the drives cannot instantaneously handle?

I'm just curious because say your database gets heavily fragmented and you do get a huge decrease in read speeds (as discussed above) then I'm assuming that this will also have to impact the write speeds of the drives and hence data will build up in the ZIL eventually filling it up and just turning it into a glorified FIFO buffer? or am I going wrong somewhere here in my line of thinking?

And here's a quote for the day that I feel seems somewhat appropriate at this point; Matthew 15:14 And if the blinde lead the blinde, both shall fall into the ditch.

Anyway that's my 2 cents worth

Cheers

bao__zhe · Nov 19, 2013

HobartTas said:
Greetings
That opendevs link certainly demonstrates the usefulness of having the data drives as being a power of 2, there is a comment reply there which states

*********************************************************************************
Free space calculation is done with the assumption of 128k block size.
Each block is completely independent so sector aligned and no parity
shared between blocks. This creates overhead unless the number of disks
minus raidz level is a power of two. Above that is allocation overhead
where each block (together with parity) is padded to occupy the multiple
of raidz level plus 1 (sectors). Zero overhead from both happens at
raidz1 with 2, 3, 5, 9 and 17 disks and raidz2 with 3, 6 or 18 disks.

High overhead with 10 disks is because of allocation overhead.
128k / 4k = 32 sectors,
32 sectors / 8 data disks = 4 sectors per disk,
4 sectors per disk * (8 data disks + 2 parity disks) = 40 sectors.
40 is not a multiple of 3 so 2 sector padding is added. (5% overhead)
*********************************************************************************

However, I chose a larger recordsize of 512KB firstly because I am not running an OLTP database needing smaller recordsizes and secondly unlike say NTFS where wasted space equals on average clustersize(recordsize) / 2 this does not apply in ZFS because if the data to be written is a lot smaller then ZFS can use variable height stripes.

Re-doing the numbers on the basis of using a 512KB recordsize.....

512k / 4k = 128 sectors,
128 sectors / 8 data disks = 16 sectors per disk,
16 sectors per disk * (8 data disks + 2 parity disks) = 160 sectors.
160 is not a multiple of 3 so 2 sector padding is added. (1.25% overhead)

Compared to 5% previously this is a fairly large improvement, although 1.25% is still higher than 0% for other data drive numbers like 4 or 16 it is clear that 10 drive Raid-Z2's are still a viable proposition.

The power of 2 requirement is easy to understand. But this is the part I'm still confused with, even after I looked at the vdev_raidz.c code. Do you have any clear explanation on the reason?

HobartTas · Nov 20, 2013

Greetings

I have an explanation but it is anything but clear, I think it has to do with the internals of how ZFS stores data, first have a read of this and this.

AFAIK what I understand is

(1) all posts on the net related to this topic mean that in addition to the number of data drives to be a power of two it is also desirable for the total number of drives to be divisible by the Raid-Z level +1 so in the case of Raid-Z2 this works out to be 3 (=2+1), furthermore....

(2) the minimum number of sectors written out for say a 1 byte file in Raid-Z2 is 3 being 1 data and 2 parity e.g (D,P,Q), this means in

(a) a 6 drive Raid-Z2 you can write D1,P1,Q1,D2,P2,Q2 exactly with no wasted space
(b) a 18 drive Raid-Z2 you can write D1,P1,Q1....D6,P6,Q6 exactly with no wasted space
(c) a 10 drive Raid-Z2 you can write D1,P1,Q1,D2,P2,Q2,D3,P3,Q3 with the tenth block empty and wasted.

(3) the tenth block can't be used by ZFS as if it did so it would also have to put a P and Q sector somewhere also on that stripe but there's no room so it can't use it, also although its "free" it can't be reported as "free" so it's padded out and the space is reported as allocated otherwise disk utilities would be reporting all this "free" space for storage that doesn't actually exist thus accounting for the high overhead, furthermore I believe that ZFS keeps metadata in its own stripes and never mixes it with data stripes so metadata can't use the tenth block either.

(4) In addition say you have a recordsize of 1MB and you write out that 1MB starting at a stripe boundary then if using 512B sectors using a 10 drive Raid-Z2 this will be a stripe 256 sectors high so OK no problems here, another complication is say if you already have a partial stripe with D1,P1,Q1,D2,P2,Q2 laid down then if you do the same with the 1MB starting from the 7th drive position you will put down sector1, sector 2, P and Q and so on for a total stripe height of 257 sectors high and in addition there will be an extra 2 parity sectors written (257 in total) so the last stripe will be the last 6 sectors of the 1MB recordsize, a P and a Q so now there will be 2 free sectors wasted which again will be unusable.

That's what I think I understand this situation to be, anyway Caveat Emptor regarding what I have posted now but I think I'm on the right track, perhaps with this magical "power of 3" I should now go and watch an episode or two of Charmed. ZFS internals of this nature are discussed in a bit more detail over at the FreeBSD forums so you can probably find clearer explanations over there.

Hope this helps you somewhat

Cheers

Chandler · Nov 20, 2013

I am fairly confident you hit the nail on the head as far as explaining to yourself what exactly a ZIL is - however I do not know how to answer the question about reduced read speeds and to prevent either of us from falling into a ditch I am going to refrain from googling it

Nex7 · Nov 20, 2013

HobartTas said:
TLER is neccesary for hardware raid and not needed elsewhere as ZFS can even work with green drives which spin down due to APM.

ZFS can survive w/o TLER, but as I stated earlier, you're going to hate it when a drive goes into deep recovery and your pool either hangs until it recovers or drops the disk from the pool (which depending on environment could take nearly as long to actually do as the deep recovery action takes anyway, amusingly).

HobartTas said:
Pardon? It is my understanding that because ZFS time/date stamps everything it can tell how far out of date the re-added drive is and it just adds the missing information to bring it up to date. I had a situation where I accidentally dislodged a SATA cable on one of my drives in a 10 drive Raid-Z2 array and I added about 4GB to the array before I noticed the cable was loose, I then shut down the system and reconnected the cable and started the system up again. ZFS noticed it was out of date and automatically wrote about 440MB to the drive in several tens of seconds and thereafter it was up to date. I am not aware that there is some sort of time limit for this which is what I understand you are implying.

There is a 'time limit', in the form of the transaction commits. The 'time/date stamp of everything' isn't strictly accurate, what matters for a 'quick resilver' is if the uberblock that was in use when the drive disappeared is still in the ... let's call it a 'wheel' ...when the drive comes back. The amount of data written in the interim is only secondarily related (in that it can have some impact on how fast you're churning through txg_id's). Every time a txg happens, the uberblock is the last thing updated -- the last X (usually 128, but my brain is telling me that might not be accurate anymore, I'll need to go see why I'm thinking that) uberblocks beyond the last one are 'stored' in this 'wheel' (I'm probably using a terrible analogy here), and if a disk disappears at say txg_id 4000 and comes back at txg_id 4009 then the interim data can be resilvered onto it 'quickly', as ZFS can see the older uberblock in the wheel, and quickly walk the differences that happened in txg_id's 4001-4009 and plop them down on the offending disk.

If, however, it disappears at txg_id 4000 and comes back at 4400 it's been > 128 and it is no longer possible for ZFS to look at all the changes that happened in the intervening uberblocks because it only has the last 128 or so stored entirely, to actually do that sort of comparison on. Thus, it kicks off a 'full' resilver, not a quick one.

HobartTas said:
It is my understanding that if ZFS notices that it is doing sequential reads from a file like the 1080p movie you mentioned then it does extra read aheads and stores this data in the cache so if this is the case I don't think this will be a problem, neither do I know if the parameter for this is tunable or not.

There's vdev read-ahead prefetch, which should (correctly) be disabled on every distro these days and shouldn't be used, and there's file level prefetch, which is generally only 'tuned' in terms of turning it on or off globally with the tuneable 'zfs_prefetch_disable'. Sometimes it helps performance to disable it, but in general I'd recommend leaving it on if in doubt.

HobartTas said:
I've got a question about ZIL's as I have never actually used one (or needed one for a home NAS) but say you are using it with the database you mentioned, would I be correct in assuming its main advantage would be dealing with absorbing a large amount of write traffic primarily of a bursty nature that the drives cannot instantaneously handle?

I'm just curious because say your database gets heavily fragmented and you do get a huge decrease in read speeds (as discussed above) then I'm assuming that this will also have to impact the write speeds of the drives and hence data will build up in the ZIL eventually filling it up and just turning it into a glorified FIFO buffer? or am I going wrong somewhere here in my line of thinking?

Well, let me start by saying you do have a ZIL, very likely. The ZIL, or ZFS Intent Log, is a mechanism that is by default on for any synchronous write (as determined by the incoming data, or by the 'sync' setting on the target dataset, typically). If you do not have a log (or sometimes 'slog') vdev in your pool set aside for it to specifically use, it simply uses some of the space on the data disks in your pool, thus effectively 'double dipping' on writes, and in the worst possible way (as ZIL writes are followed up by a cache flush command, which is not a workload spinning disks are particularly happy with).

As for impact on a database - it has a significant one, assuming you have a dedicated SSD or battery-backed RAM-based slog, in that it causes the writes to be acknowledged quicker, improving db write performance for starters. It also takes in all those likely very random and small write IOPS and, with that dedicated slog vdev in place, offloads that work from the spinning drives. That way the only write that actually hits spinning media is the much nicer, sequential write done during the txg commit.

At the end of the day it should have some positive impact on the potential IOPS and write performance of the client over what the spinning disks could have done alone (even without a double-dipping ZIL hit), but ultimately it is still beholden on the write performance of the underlying data drives (the data massaging done as part of the txg commit may improve what the data vdevs can take down versus what was coming in, but it isn't magic -- the ratio isn't even likely to be all that high, so while it has some positive write performance, it won't perform miracles for you).

HobartTas said:
And here's a quote for the day that I feel seems somewhat appropriate at this point; Matthew 15:14 And if the blinde lead the blinde, both shall fall into the ditch.

Need a walking stick?

http://src.illumos.org/source/

brutalizer · Nov 23, 2013

okashira said:
The larger issue is wasted space. Over 1 TB for the RAIDZ2 @ 8 Drives.
Read here:
http://www.opendevs.org/ritk/zfs-4k-aligned-space-overhead.html
6: 0.2602885812520981
7: 1.1858475767076015
8: 1.149622268974781

You'll "waste" 1.149 - 0.26 TB of space (~1TB) by choosing 8 disks over 6. This is with 3TB disks.

The above #'s are with ashift = 12 (4k sectors)
with 512b sectors (ashift = 9) the wasted space with 6 drives is almost exactly the same (no penalty for 4k drives) but there is a penalty at 10 drives.

The most interesting comparison comes at 17-18 disks:
17: 3.5806130981072783
18: 0.7912140190601349

You'll gain almost 6TB of space by going from 17 to 18 drives because 17 drives is so wasteful!

The opendevs.org link is very interesting. Is there any similar list when using raidz3 configuration? I am considering a 16 disk raidz3 and wonder how much space is wasted, maybe I should aim for 13 disks instead? I need such a list. Anyone knows where to find it? Or can I calculate the list?

nas4free RED vs Black

Chandler

Limp Gawd

Nex7

Weaksauce

Chandler

Limp Gawd

Nex7

Weaksauce

Origin_Unknown

Limp Gawd

vegaman

n00b

Chandler

Limp Gawd

okashira

[H]ard|Gawd

Chandler

Limp Gawd

Moogoos

Limp Gawd

Aesma

[H]ard|Gawd

Chandler

Limp Gawd

HobartTas

Limp Gawd

bao__zhe

Weaksauce

HobartTas

Limp Gawd

Chandler

Limp Gawd

Nex7

Weaksauce

brutalizer

[H]ard|Gawd