ZFS NAS build, how to arrange 8x1.5TB drives?

kxy

n00b
Joined
Aug 28, 2010
Messages
19
I am currently Building a Nas to house my ever growing media collection. I have 8 1.5TB WD EARS to start with. I have no idea if i should build a single Vdev of 8 drives in raidZ2 or possibly 2x4 drive Vdevs in raidZ. I am probably 6 months away from adding another 8 drives, at which time the 2tbs will probably be at the price point I want.
This is mostly for media storage and serving, from what i understand under a single vdev i get n-2 redundancy, but it wont be as fast as 2 smaller vdevs striped. with the smaller vdevs, i only get single disk redundancy, yet still lose 2 disks to parity data.

Or I can just run a 8 disk raidz single vdev. this would maximise my usable space, not the fastest, and has single disk redundancy. Considering this is only media backup/serving, speed isnt really an issue, Im sure I can saturate gigabit network no matter which way I go.

another theory I have is I actually have another 1.5tb drive thats full of data, so when the pool is up I could copy the contents to the pool and have a spare drive, Could I then add this to the array as a hot spare? so 8 disk raidz + 1 hot spare. this would minimize risk time in event of drive failure, because if a drive fails at a time when im cash strapped, it maybe 2-3 weeks before getting a replacement drive.

Would appreciate any thoughts on the matter.
 
Hey there!

I'm not sure if you know, but EARS disks use 4K sectors with 512-byte emulation. So called Advanced Format as WD likes to name it. This can affect your speeds considerably for the worse when using RAID-Z, much less so for mirroring or striping.

A 4-disk RAID-Z with EARS would be quite slow. But a 5-disk RAID-Z with EARS should (in theory) be as fast as normal drives. This is a bit complicated, but the formula is:
128KiB / (number of drives - parity drives) = .... KiB <-- this number needs to be a multiple of 4KiB.

So with 5 disks in RAID-Z you get 128/ 4 = 32KiB which is perfect. But with 4 disks you get 128/3 = ~43KiB which is awful for these 4K sector disks.

These problems disappear when using mirroring or striping; or when chosing normal 512-byte sector disks like WD EADS and Samsung F2EG/F3. Generally i prefer 5400rpm disks, as with 100MB/s+ they are not slow at all but only half the heat generated and power used.

So which combinations would be good with 4K drives?
3-disk RAID-Z = 128 / 2 = 64KiB
4-disk RAID-Z2 = 128 / 2 = 64KiB
5-disk RAID-Z = 128 / 4 = 32KiB
6-disk RAID-Z2 = 128 / 4 = 32KiB
9-disk RAID-Z = 128 / 8 = 16KiB
10-disk RAID-Z2 = 128 / 8 = 16KiB

You shouldn't attempt more disks in a single vdev i think.

Also know that ZFS scales random I/O per vdev. A single RAID-Z would have random I/O capabilities of a single disk; which is not that bad as RAID5 is usually bad in random writes; this solution should be much better. But it does mean you would want multiple vdevs. So groups of 5 would be peferable to groups of 9 or 10. Also i think RAID-Z with 9 disks is not secure enough.

You *DO* need to make sure that you're using controller without RAID function; or that controller will drop your disk instead; preventing ZFS from accessing it. That's not what you want; you want a SATA HBA without RAID.
 
Last edited:
I've seen you assert this before, but I have not seen any benchmarks. Have you measured it?

I understand the theory of what you are saying, but I suspect that for any reasonable chunk size (128KB or higher, should probably be using 512KB or larger in this case), the difference is not worth worrying about. A few percent at worst, I would think.
 
So sub do you suggest he create a 5 disk raid5 array and 3 disk raid5 array?
zpool create tank raidz1 /dev/ada0 /dev/ada1 /dev/ada2 /dev/ada3 /dev/ada4 /dev/ada5
zpool create tank raidz1 /dev/ada6 /dev/ada7 /dev/ada8
This should create two arrays 5 & 3 disk each with 1 parity(raid5) but the pool should appear as one disk of 8*1.5TB ~ 12TB preformat.

Then for the spare.
zpool add tank spare /dev/ada9

I personally wouldn't suggest a 9 disk array with a single parity. But you can see how you can mix and match your 9 disks to arrays that suit you.
 
Well i still have alot of questions myself. I got the basics of this theory from the FreeBSD/OpenSolaris mailinglists regarding problems with 4K sector size and RAID-Z performance in particular. Many threads about this exist.

RAID-Z is somewhat odd; it is more like RAID3 than RAID5 really. To avoid confusion, let me explain on how i understand this to work:

Traditional RAID
In traditional RAIDs we know stripesize; normally 128KiB. Depending on the stripe width (number of actual striped data disks) the 'full stripe block' would be <data_disks> * <stripesize> = full stripe block. In RAID5 the value of this full stripe block is very important:

1) if we write exactly the amount of data of this full stripe block, the RAID5 engine can do this at very high speeds, theoretically the same as RAID0 minus the parity disks.

2) if we write any other value that is not a multiple of the full stripe block, then we have to would have to do a slow read+xor+write procedure which is very slow.

Traditional RAID5 engines with write-back essentially build up a queue (buffer) of I/O requests and scan for full stripe blocks which can be written efficiently; and will use slower read+xor+write for any smaller or leftover I/O.

RAID-Z
RAID-Z is vastly different. It will do ALL I/O in ONE phase; thus no read+xor+write will ever happen. How it does this? It changes the stripe size so that each write request will fit in a full stripe block. The 'recordsize' in ZFS is like this full stripe block. As far as i know, you cannot set it higher than 128KiB which is a shame really.

So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.

Performance numbers? Well i talked to alot of people via email/PM about these issues. I don't have any 4K sector drives myself yet. I did perform geom_nop testing with someone, which did increase sequential write performance considerably. geom_nop allows the HDD to be transformed to a 4096-byte sector HDD; as if the drive wasn't lying about its true sector size. Now ZFS can detect this and it's even theoretically impossible to write to anything other than multiples of the 4096-byte sector size. So the EARS HDDs never have to do any emulation as well; only ZFS has to do some extra work but is much more efficient. The problem is that this geom_nop is gone after reboot; it's not persistent. It's basically a debugging GEOM module. But i hope to find a way of attaching these modules upon boot everytime on 4K disks with my Mesa project. That might solve alot/all of these issues.

Some people are in the process of building a NAS with FreeBSD+ZFS with 4K sector drives; once they are up to it i hope to do some more testing with them. Some people have offered it before, but went ahead and setup a mirror instead which gave them very good speeds. I can't blame them really; it's a bit shocking to get 11MB/s write on brand new hardware. :D
 
So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.

See, that does not sound as bad to me as you make it out. As you say, each disk will have 42.5+ KB written to it. That makes 10 4KB groups, plus one group less than 4KB. So less than 10% of the write will need to be a read/modify/write of the 4KB sectors. Let's estimate that the read/modify/write takes 3 times as long as a full 4KB write (I think that is pessimistic, it should be closer to 2 times). Let's call the time to write a 512B sector on a 512B sector drive T. So, instead of taking a time of about 85T on a 512B sector disk, it would take 80 + 8*3 = 104T on a 4KB sector disk. So about 22% longer.

But maybe ZFS has a really bad implementation? I would be interested to see benchmarks. Are you sure ZFS cannot use a larger stripe width than 128KB? That is way too small for a media server.
 
Well Im committed to the EARS path, they are the best price at my local store, I currently own 8, + theres one in another pc that i can utilise. The Cards im planning on using are
http://cgi.ebay.com.au/LOT-2-IBM-Se...939?pt=LH_DefaultDomain_0&hash=item4cf17ccebb

IBM LOT OF 2 IBM ServerRaid Br10i FRU which are based on the LSI chipset. Mini sas cards and I will be using the breakout sata cables. My current case (CM stacker) is set up to house 16 drives. I plan on upgrading to a Norco 4228 as the need arises. So these are easily assigned as non raid. It sounds like my best plan is to buy another 2 drives, run 2x5disk raidz vdevs and work my way up to a 3rd
Have been following the process of your webui closely and am very excited to see it come to fruition.

I have all my hardware and am only waiting on the raid cards and 2 more breakout cables. Would there be any problem using 1 card for 8 drives and 2 drives through onboard sata controller, and then when im ready to add another 5 disk vdev, installing the second LSi and migrating the disks from onboard to the controller. i will end up with 3 vdevs of 5 disks, plus 1 drive for hotspare, I have 6 sata ports onboard, so i have many options for expanding. and theres always the possibilty of a SAS expander into an attached Norco.

Fun times ahead.
Im starting off with 4gb ram, but can easily expand that up to 16gb, but doubt i will need that much.
 
"Your submission could not be processed because the token has expired." and so i lost my long reply; very frustrating to be honest. And since this board doesn't send correct HTTP caching headers the back button would destroy any text in the textbox as well. Bah!

Anyway, don't really feel like re-writing my reply. But at least here's the data for my calculations:

writing 4K sector takes 0,00003906 seconds
Writing a emulated sector takes AT LEAST 0,011 sec. (90/60)
Thus, writing an emulated sector takes 285 times longer than normally writing a sector.

T = time to write 4K sector. So your example would be 10 units of 'T' or 10T, plus the emulated sector, 285T + 10 = 295T. As you can see this ought to reduce speeds considerably; and i think this is very conservative since i only count one seek for the emulated sector; not any other work. But the seek likely will take the longest.
 
writing 4K sector takes 0,00003906 seconds
Writing a emulated sector takes AT LEAST 0,011 sec. (90/60)
Thus, writing an emulated sector takes 285 times longer than normally writing a sector.

Where is this timing from?
 
I assume 100MiB/s and 5400rpm.

Calculated latency per 4K write:
100MiB = 1 sec
1MiB = 0,01 sec
4KiB = 0,00003906 sec

Calculated latency for emulated 4K write ( i calculate one rotational delay; very conservative)
5400 = 60 secs
90 = 1 sec
1 = 0,01111 sec

0,00003906 / 0,01111111 = 284.46 ~= 285 x the time required for a normal 4K sector.

Likely it's even more due to read+write happening too; i just counted the seek which is what would contribute most to this slowness. One seek every I/O request pretty much means it's doing random writes.
 
And ZFS will not write the next data until the current write is completed? Because otherwise, it seems the drive should be able to cache the writes and do it much more efficiently than just doing nothing and waiting for another revolution.

If that is the reason, it seems ZFS just has an implementation that does not work well with 4KB sector drives, since it is limited to such small chunk sizes and does not let the drives utilize their cache fully. That puts ZFS in a bad position for the next few years, unless they jump whatever fix they make ahead into the stable version.
 
Well due to the 128 / 3 not giving a sector-aligned number, it would mean some disks write more than others (i.e disk1 42.5KiB and disk2 43.0KiB). As i understand, ZFS adds space to those odd values to prevent fragmentation. This 'padding' causes the rest of the emulated 4K sector write not to be completely written on any subsequent writes. I cannot confirm this yet, though. But if it's true, it explains the performance characteristics of ZFS: low performance with 4K with emulation; high performance with 4K native.

So due to there being a gap the second I/O request won't cover the partially written sector; so the trick you mentioned (buffering basically) wouldn't work. Some people have reported success with disks using GELI or GNOP; increasing logical sector size to 4096 bytes. In other words, the performance issues disappear when ZFS knows its a 4K sector drive. Then it won't ever be able to write smaller chunks it just may have to write a tiny bit more data than what it originally planned. That's not so bad.

I don't really understand how you could be blaming ZFS for this. Intentionally lying is dirty 'hacked' technology to make things work; not clean technology, if you know what i mean. If a proper reported sectorsize fixes most of the slow writings; then i can't say ZFS is broken. Instead, the disks are broken because they lie about a crucial fact.

May do more tests when i get my new Samsung F4s in a few months.
 
I don't really understand how you could be blaming ZFS for this. Intentionally lying is dirty 'hacked' technology to make things work; not clean technology, if you know what i mean. If a proper reported sectorsize fixes most of the slow writings; then i can't say ZFS is broken. Instead, the disks are broken because they lie about a crucial fact.

All I said was that it looks like ZFS will have performance issues over the next few years, unless the programmers implement a fix and jump it forward to the stable version. Since almost all HDDs are switching to 4KB sectors with 512B emulation for the next few years. That's just the way it is. Submesa blaming someone is not going to make a difference.
 
Perhaps, perhaps not. I do want make the geom_nop sectorsize trick available on my Mesa project; meaning that it should be fixed out of the box for them. FreeBSD might also include such patches, i did read about geom_disk detecting the true sectorsize for these drives (though i'm not sure how) and applying them instead.

So i guess some people who worry over this actually are going to make the difference. And the future of ZFS is not as hopeless as you appear to interpret it.
 
Now you've lost me. If there is already a fix, and it is as simple as setting a variable to 4KB instead of 512B, then why are you making such a big deal about 4KB/512B emulation drives?
 
The GNOP workaround disappears when you reboot. (my mesa project could apply it on every boot)
The GELI 'fixed-by-accident' may introduce encryption overhead which you do not want. Funny to see encryption makes the writing go much faster; but it's a by-product of the increased sector-size.
The real proper fix is FreeBSD properly detecting these as 4096 byte sector disks. When that happens, all major problems are over. I think these patches are in HEAD (9-CURRENT) already but i have no way to check if it works. Hopefully i can do some more tests when people finished their builds with 4K sector drives.

As far as i know, no OS detects an EARS as 4K currently, at least not out-of-the-box.
 
The GNOP workaround disappears when you reboot. (my mesa project could apply it on every boot)

That seems simple enough. I don't know what the FreeBSD equivalent is, but on linux it would be as easy as putting something into /etc/rc.local, which would be run at the end of every boot.

So the ZFS 4KB/512B emulation issues were way overblown?
 
Well it should all work out of the box so currently it is not overblown at all. Users would want to use huge disks with 4K sectors like EARS / Samsung F4 in RAID-Z configuration to work well.

But yeah it's not like ZFS is fundamentally broken on 4K sector disks; but it does appear that due to the way ZFS implements RAID-Z and variable stripe sizes, any mismatching sector size would be heavily penalized; virtually with every write request.

If i can make the geom_nop attach to any EARS / Samsung F4 or other disk known to have 4K sectors, then it should just work properly with good write speeds. But FreeBSD doesn't like hacks and it wants to implement the technology where it should be implemented: in geom_disk as some mailinglists talked about. But i'm not sure how they want to distinguish a 4K sector disk from a normal disk.

Also there's a risk when changing sector sizes: if you transform a disk into 4K sector size; all the absolute LBA addresses also change. LBA 4 is now at 16KiB offset instead of 2KiB. I'm not sure but i could imagine this giving problems if you first use a drive with geom_nop and later expose it to the system without; then you would have a ZFS filesystem that is tuned to 4K sectors but suddenly you see a drive with 512-byte sectors and all the on-disk stored locations would point to wrong LBAs.

So lot's of testing i want to do. Hopefully learn more as people finish their builds and allow me to test some things on their 4K sector drives. It's surely an interesting issue and i do not yet understand all parts of it.

But yeah the 'actual' fix may just be two lines of source code; who knows. ;-)
 
No need to detect 4KB sector disks. Just do whatever is necessary to make ZFS always write in minimum 4KB chunks. All other modern operating systems already do that. That is the way HDDs are moving (and SSDs have always preferred 4KB chunks), so might as well make the change now.

And what do you mean, the fix is "source code"? I thought you said it was just changing a setting? So, which is it? Does it require changing ZFS source code and recompiling? Or is it just changing a setting in an /etc file or whatever?
 
Last edited:
Just do whatever is necessary to make ZFS always write in minimum 4KB chunks.
A sector size of 4KiB would do just that. The problem is that these disks get detected as having 512 byte sectors.

That is the way HDDs are moving (and SSDs have always preferred 4KB chunks), so might as well make the change now.
With SSDs going to use 8K pages the 4K hardcoded alignment may be soon obsolete. Filesystems shouldn't need to guess on the limitations of the storage device; the storage device should make them known by communicating the correct sector size optimal for that storage device. Hardcoding is handicraft work; not the UNIX mentality.

Since sectorsize has traditionally been at 512 bytes, some OS (Windows XP) do not support any HDD with another sectorsize. So if HDD vendors just released 'native 4K sector' disks which are incompatible with XP but work with all modern OS without performance issues, then the problems would be gone too. But since XP users still is a decent share of their customer base, i believe they still wanted to remain compatible with XP. Just shame they didn't include a jumper to set the sectorsize at 4K but shifting sectorsize would/could corrupt the data as all LBA addressing changes; so i guess they found this too risky and just released a disk that sort of works for the mainstream but with severe issues for a minor group. Sounds like ugly handicraft work to me. ;-)

Sectorsize is no 'setting' you can control directly in FreeBSD. The geom_disk GEOM module that sits at the root (communicates directly to the disk) of the GEOM chain sets the sector size upon attaching the device. As it got told by the HDD, that is 512 bytes. A lie; so now what?

As i said earlier:
The real proper fix is FreeBSD properly detecting these as 4096 byte sector disks. When that happens, all major problems are over. I think these patches are in HEAD (9-CURRENT) already but i have no way to check if it works. Hopefully i can do some more tests when people finished their builds with 4K sector drives.

Thus, the real fix would be for FreeBSD to alter sourcecode of geom_disk module to properly detect the sectorsize even for disks which are lying about them. If that happens, all issues should be gone both now and in the future; as long as it properly detects those disks. For example, 8K could be used for SSDs with 8K pages, since a smaller sectorsize might not make much sense on them.

But i'm a bit skeptical, as i don't understand how FreeBSD could be making the determination which disk has 4K physical sectors even when they report having 512B sectors. It could maintain a list of known disks to conform to this, but that's not generally how FreeBSD people like to do things. I may send an email to some FreeBSD devs about this, but i would like access to a suitable test system first.

Workarounds
Until such a 'permanent fix' is in place, the workarounds would be to user either geom_nop (volatile) or geom_eli (persistent; but adds encryption overhead) on them, to force their sector size to be 4K. How does this work? Well in GEOM everything is stacked, so it's like:

Physical disk -> ahci driver -> geom_disk -> geom_gpt -> geom_nop -> geom_label -> ZFS

Let me explain this example:

  • ahci driver is device-specific and interfaces with GEOM. The geom_disk module will respond to attached devices and sits at the root of each physical disk.
  • If you have a GPT partition scheme on the disks, the geom_gpt would interpret this and give only the partitioned space as available to the next module.
  • This next module is geom_nop, which we configured to change the sectorsize to 4096 bytes. All the earlier device has 512 bytes; now any later device will have a minimum of 4096 bytes sector size.
  • But there's only one later module, geom_label which identifies our disk with a common name. This ultimately (/dev/label/disk1) is fed to ZFS.
When ZFS issues reads or writes, these travel all the way through the GEOM chain.

In this sense, you can change the sectorsize. But after you reboot, there will no longer be a geom_nop module because it has no metadata to store anywhere; it's volatile. Thus now ZFS sees the device as having 512 byte sectors again, and i'm not sure whether that is safe. It might reject the disk because data at certain LBA is not what it should be or that LBA address doesn't exist (due to higher stripesize the max LBA number gets lower). Again; i haven't tested any of this. When i do i may have more definitive insight.

But these sorts of things you get if you don't design technology cleanly. Much like storing dates like '90' instead of '1990'; that's just STUPID. But they will never learn, so we keep seeing such 'technology quirks'.
 
Last edited:
You are making it much more complicated than necessary. All other modern OS's and filesystems already write with a minimum block size of 4KiB. It does not matter what the sector size is. Just be sure to keep the block size at least 4KiB (and 4KiB alignment, of course). Then the 512B emulation will work efficiently. And true 512B drives will obviously work. That is why you do not see complaints from Windows 7, Vista, Linux, or Mac users about severe slowdowns with 4KiB/512B emulation drives (except if they misaligned their partitions).

It seems ZFS (and perhaps XP) are the only ones that have performance problems.

I still do not see why it is so difficult to force ZFS to use a minimum 4KiB block size. Surely the ZFS block size is not hard-coded at whatever sector size the drive returns? Isn't there an override?

By the way, the drives are reporting the sector-size correctly, since they function, from the outside, as 512B sector drives. The LBAs each address a 512B sector. The problem seems to come, as near as I can tell from your description, from ZFS lacking an easy way to specify block size separately from sector size. It is the block size that needs to be kept to 4KiB minimum.

From this post it looks like the ashift parameter may set both block alignment and block size:

http://arstechnica.com/civis/viewtopic.php?p=20734717#p20734717

That seems easy enough if it works. Bob did not notice much of a performance difference, but I do not think he was using RAID-Z.
 
Last edited:
In ZFS RAID-Z the block size is a multiple of the sectorsize. So with a 4K sectorsize; this problem is fixed properly.

Because RAID-Z implements a storage method not implemented by anything else, it can use variable stripesizes. Not only can but must. These variable stripesizes are multiples of the sectorsize. Thus we can say that ZFS's RAID-Z is indeed much more sensitive to the correct sector size than traditional RAIDs. That doesn't mean ZFS is bad, that means ZFS is just alot different, and more sensitive to some things like the sectorsize.

But if you do use the correct sector size, you have alot of advantages in random writes of various sizes. You won't ever have to do a read+xor+write cycle; you would have random write IOps of a single disk which is pretty good for 'raid5'; traditional raid5 has much worse figures.

So if configured properly, RAID-Z can be pretty fast. Even with all the reliability features that ZFS employes, which means it has to work harder in some respects, alot of the smart features also cause performance benefits. The result should be something that is 'quite fast' depending on the hardware you give it; and just scales as you add more or faster hardware.

And let me repeat that filesystems with hardcoded 4K alignment would have problems as soon as SSDs with 8K pages come out. Lot's of remapping would be required by the SSDs; not really clean is it? Sector size = 8K would solve that.
 
Sub, what sort of access do you need for testing? i could set this pc up with 4-6 1.5TB EARS and point it at a dns, you could do testing on it, Im not in a hurry to fill it with data, and if it helps get the issue sorted out natively then its in everyones interest. What sort of bandwidth is required for testing? I can provide 1Mbit upline and 18Mbit down. But its just a matter of loading scripts and running them locally right?
 
Yes please! 6 disks would be my preference, then i can test:

3-disk RAID-Z (128 / 2 = good)
4-disk RAID-Z (128 / 3 = bad)
4-disk RAID-Z2 (128 / 2 = good)
5-disk RAID-Z (128 / 4 = good)
6-disk RAID-Z2 (128 / 4 = good)

I can also test with geom_nop utility; this requires no reboots. I would require ssh access to a user which is member of the 'wheel' group; then i would 'su' to root and do the tests. You can watch with the 'watch' utility. I'll give you the details later.

You could use the Mesa LiveCD so you wouldn't need a separate system disk per se. But you should only give internet access to port 22 (ssh); don't connect it directly to the internet. You could of course also use a separate system disk with FreeBSD installed. I don't need any special configuration or anything. But use 64-bit distribution if possible and no earlier than FreeBSD 8.1-release. I may also want to test under 9-CURRENT; i can prepare a custom .iso for that.

Bandwidth would be very low; ssh is minor text packages and its encrypted. So all you need is local hardware bandwidth not network bandwidth.

Would be great if you can volunteer your system for some tests. Having comparable numbers can clear up this issue i hope.
 
ok I will build the machine when i get home from work in 12 hours, then might do some IM/skype or something to help me get it set up so you can access it tommorrow or sometime. Would love to be any help I can for your project.
 
And let me repeat that filesystems with hardcoded 4K alignment would have problems as soon as SSDs with 8K pages come out. Lot's of remapping would be required by the SSDs; not really clean is it? Sector size = 8K would solve that.

Not necessarily. The Intel SSDs do okay with 512B logical sectors and 4 KiB physical pages. And that sort of write combining would have much less of an efficiency hit when doing 4 KiB / 8 KiB combining than 512B / 4 KiB.

And there is no way that 8 KiB logical sector size will be supported by block devices. They are moving to 4 KiB logical sectors (but that is still a few years off), and that will probably be the standard for many years. If it increases again sometime in the far future, it would be to a lot more than 8 KiB.

So, as I wrote before, the issue is not logical sector size, but rather filesystems that can be given a minimum block size (cluster size, chunk size, whatever you want to call it) and all writes would then be aligned and at least that block size, regardless of logical sector size for the device. Conveniently, most modern OS's and filesystems already use a minimum block size of 4KiB.

ZFS is not as special as you say. There is no good reason why it could not support a minimum block size which is different from the logical sector size. Other OS's and filesystems already do that.
 
Last edited:
Not necessarily. The Intel SSDs do okay with 512B logical sectors and 4 KiB physical pages. And that sort of write combining would have much less of an efficiency hit when doing 4 KiB / 8 KiB combining than 512B / 4 KiB.
Only Intel yes, the Sandforce/Micron SSDs great performance crashes when you do not align with 4K. As i understand, this is because Intel remaps any misaligned LBA. I.e. if start offset is not aligned with 4K it would remap the whole I/O so it does begin at a page. This remapping would cause alot of 'dynamic' data versus 'static' data and cause erase block framentation on Intel SSDs. Without TRIM, this would cause degradation even quicker. With TRIM i think this isn't really an issue.

So yeah great feature of the Intel firmware. But honestly when 8K pages come out i think it would be more neat if they could queue 4K random writes and put two of those writes (which are NOT contiguous) in one 8K page. That kind of remapping is currently not yet employed by any controller i think.

And there is no way that 8 KiB logical sector size will be supported by block devices.
If this FreeBSD patch works, it might support those SSDs in 8K sectors natively, with potentially some performance benefits when used incombination with ZFS RAID-Z depending on the SSD's penalty on 'emulating' lower sectors.

So, as I wrote before, the issue is not logical sector size, but rather filesystems that can be given a minimum block size (cluster size, chunk size, whatever you want to call it) and all writes would then be aligned and at least that block size, regardless of logical sector size for the device. Conveniently, most modern OS's and filesystems already use a minimum block size of 4KiB.
Well i disagree.

Each storage device, current and in the future, has physical limitations that require you to store information in certain chunks. It can workaround these limitations by using firmware tricks. But the best potential for performance gains would be if the filesystem and/or RAID engine has knowledge about these limitations, to allow them the opportunity to adapt the I/O requests issued in the first place and basically do things smarter than the SSD, which only knows about bits on logical LBA space. For systems who do not support this, you can use firmware emulation to still stay compatible, but you should not withhold the true values for implementations that are smart enough to benefit from it. Since you can do both, it would be something like: newer systems use the optimal values and get some performance boost, older systems use the legacy emulated values and suffer a bit in performance. That last part already works; but the former does not yet.

A lot of things in computer technology don't work well because they are not 'future-proof'. They make assumptions, they use workarounds, hacks or other ugly technological constructs. I believe that the best way to avoid these ugly constructs is to communicate clearly the limitations and optimal operation in a uniform and standardized way, so that we can build a clean foundation that can withstand the test of time.

For example, i could imagine that 'sector size' might be a deprecated value only used for older/legacy OS in the future. And that it is replaced with something like:

minimum-block-size = 512 bytes (emulation will be done by the SSD firmware)
optimal-low-block-size = 8096 bytes (8K pages)
optimal-high-block-size = 131072 bytes (erase block size 128KiB)

If unused, the normal sector size of 512 bytes will apply; modern NTFS (Vista and up) will assume 4K alignment and use that. Fine! If these new specs would be utilized, the Filesystem or RAID has alot of more knowledge to adapt its I/O strategy. It's totally unclear how this pays off in performance gains, but in theory i believe this is the best way to design this kind of technology. The foundation you create today is the solution for tomorrow.

Amen. :D
 
As i understand, this is because Intel remaps any misaligned LBA. I.e. if start offset is not aligned with 4K it would remap the whole I/O so it does begin at a page. This remapping would cause alot of 'dynamic' data versus 'static' data and cause erase block framentation on Intel SSDs. Without TRIM, this would cause degradation even quicker. With TRIM i think this isn't really an issue.

So yeah great feature of the Intel firmware. But honestly when 8K pages come out i think it would be more neat if they could queue 4K random writes and put two of those writes (which are NOT contiguous) in one 8K page. That kind of remapping is currently not yet employed by any controller i think.

No, that is obviously not the way it works. All block devices (SSDs, HDDs, advance format 4KiB HDDs) are addressed with 512B logical sectors. But the flash in most SSDs needs to be addressed in 4KiB pages. It is obviously impossible to both support 512B LBAs and align all possible I/O requests at 4KiB pages. So all SSDs collect the writes to some extent. The difference with Intel is that it aggressively parallelizes even small (512B) writes across its channels, writing them to different flash pages, whereas most of the other SSDs tend to collect those small writes into larger chunks and write them to the same page. The resulting fragmentation caused the G1 Intels to have a problem with permanent slowdowns after a lot of small random writes. But the G2 firmware improved the garbage collection algorithm and ameliorated that problem.

And on the contrary, SSDs do collect writes, even those with non-contiguous LBAs. If a sequence of writes come in that can be collected into a larger group, it would be crazy to write them one at a time (unless they are parallelized to different channels for better performance). And since the SSDs already maintain an LBA -> physical page mapping table, it is not difficult to do such collection.

As for the rest of the post, that is all a dream. Current block storage devices are all addressed with logical 512B sectors, regardless of the physical block size. In the next few years, they will be switching over to logical 4KiB sector addressing. The chance that there will be logical 8KiB sector addressing on any mainstream device is negligible.
 
Last edited:
Yes please! 6 disks would be my preference, then i can test:

3-disk RAID-Z (128 / 2 = good)
4-disk RAID-Z (128 / 3 = bad)
4-disk RAID-Z2 (128 / 2 = good)
5-disk RAID-Z (128 / 4 = good)
6-disk RAID-Z2 (128 / 4 = good)

I can also test with geom_nop utility; this requires no reboots. I would require ssh access to a user which is member of the 'wheel' group; then i would 'su' to root and do the tests. You can watch with the 'watch' utility. I'll give you the details later.

You could use the Mesa LiveCD so you wouldn't need a separate system disk per se. But you should only give internet access to port 22 (ssh); don't connect it directly to the internet. You could of course also use a separate system disk with FreeBSD installed. I don't need any special configuration or anything. But use 64-bit distribution if possible and no earlier than FreeBSD 8.1-release. I may also want to test under 9-CURRENT; i can prepare a custom .iso for that.

Bandwidth would be very low; ssh is minor text packages and its encrypted. So all you need is local hardware bandwidth not network bandwidth.

Would be great if you can volunteer your system for some tests. Having comparable numbers can clear up this issue i hope.

if it helps, I do have a x7spa-hf based play box sitting around that is only used for testing. 6x 2TB ears plus 16 GB USB. Can open up port to enable access to the remote management console which can allow you to do just about anything.
 
Only Intel yes, the Sandforce/Micron SSDs great performance crashes when you do not align with 4K. As i understand, this is because Intel remaps any misaligned LBA. I.e. if start offset is not aligned with 4K it would remap the whole I/O so it does begin at a page. This remapping would cause alot of 'dynamic' data versus 'static' data and cause erase block framentation on Intel SSDs. Without TRIM, this would cause degradation even quicker. With TRIM i think this isn't really an issue.

<mapping granularity causes all types of issues with SSDs, the least of which is write amplification. This is a general issues for all SSDs, doing non-aligned/sub 4k random I/O will tank performance fast (it will still be faster than an HDD but way slower than 4k I/O and correct alignment).

So yeah great feature of the Intel firmware. But honestly when 8K pages come out i think it would be more neat if they could queue 4K random writes and put two of those writes (which are NOT contiguous) in one 8K page. That kind of remapping is currently not yet employed by any controller i think.

Actual page sizes are quite a bit larger than 8K. Really the way these things work is an indirect addressing table from logical address to physical address. Physical realities limit the real world granularity of the mapping tables to a decent extent (a lot of it has to due with being able to store the mapping tables on power failure). Assuming proper mapping granularity, 4 MBs of random 4K i/O writes will look like a 4MB sequential write from the perspective of the flash itself but the issue is that outside of the higher end devices, they don't have small enough mapping granularity. If you are willing to waste a lot of space you could do a self consistent design where it uses "mirrored" flash to store the logical LBA plus large timestamp for each write but that puts you at min 2x flash cost without taking a performance hit but would allow everything to look like sequential I/O. Flash SSD performance in some regards is really an issue of how much "dead" space you are willing to tolerate. The ideal case is of course mapping each LBA to a separate erase block, but that is highly inefficient.


Each storage device, current and in the future, has physical limitations that require you to store information in certain chunks. It can workaround these limitations by using firmware tricks. But the best potential for performance gains would be if the filesystem and/or RAID engine has knowledge about these limitations, to allow them the opportunity to adapt the I/O requests issued in the first place and basically do things smarter than the SSD, which only knows about bits on logical LBA space. For systems who do not support this, you can use firmware emulation to still stay compatible, but you should not withhold the true values for implementations that are smart enough to benefit from it. Since you can do both, it would be something like: newer systems use the optimal values and get some performance boost, older systems use the legacy emulated values and suffer a bit in performance. That last part already works; but the former does not yet.

The issue as I understand it is that the SAS/SATA standards contain specific fields that DO convey information such as physical sector size. AFAIK, solaris et al do support these fields. The issue specific to the advanced format drives currently available is that they actually falsely report 512B physical sector size. There is a separate field for logical sector size to denote that. The supposed reason they falsely report the physical sector size is because apparently reporting anything other than 512B causes some older OSes to crash!
 
And on the contrary, SSDs do collect writes, even those with non-contiguous LBAs. If a sequence of writes come in that can be collected into a larger group, it would be crazy to write them one at a time (unless they are parallelized to different channels for better performance). And since the SSDs already maintain an LBA -> physical page mapping table, it is not difficult to do such collection.

The ability to coalesce writes is limited by the mapping granularity/range of the design. For some real physical reasons, the granularity/range has real limitations which can be shown via random I/O with varying span sizes. This AFAIK is true with all flash based SSDs currently on the market.
 
Actual page sizes are quite a bit larger than 8K.

I think you may be confusing page size with erase block size. In SSDs, page size generally refers to the smallest unit that is read or written to flash, and it is currently 4KiB for Intel MLC SSDs. The block size (or erase block) is the smallest unit that can be erased. That is currently 512 KiB for Intel MLC SSDs.

But now we have completely left the subject of this thread.
 
Didn't reply to this yet:
Well Im committed to the EARS path, they are the best price at my local store, I currently own 8, + theres one in another pc that i can utilise.
You may want to buy more or less disks based on the outcome of the tests; if my trick works then you would want to trick to groups of 5 or 9 disks in RAID-Z or groups of 6 or 10 disks in RAID-Z2. You can combine those in one pool.

I have all my hardware and am only waiting on the raid cards and 2 more breakout cables. Would there be any problem using 1 card for 8 drives and 2 drives through onboard sata controller, and then when im ready to add another 5 disk vdev, installing the second LSi and migrating the disks from onboard to the controller. i will end up with 3 vdevs of 5 disks, plus 1 drive for hotspare, I have 6 sata ports onboard, so i have many options for expanding.
Yes you can mix onboard ports and those on your controller; ZFS won't mind.

Im starting off with 4gb ram, but can easily expand that up to 16gb, but doubt i will need that much.
You may want to expand to 8GB in the future, to benefit from prefetching which is disabled by default with <=4GiB RAM. You can enable it manually, but with 'only' 4GiB it might starve the ARC and thus causing lower performance instead. With 8GB the prefetching should work very well.

4GiB is a good place to start; but ZFS could benefit from having more. ZFS is quite memory intensive. Having the ability to expand memory therefore is great if you want some extra juice in the future.

Hope we can do tests soon to confirm my theory! If it works, you can use your 4K sector drives with good performance in RAID-Z.

@aaronspink: testing on multiple systems would have my preference so i can confirm results on both completely independent systems. So if you could offer your system for testing that would be great! Would need ssh access to normal user and root access using su. OS needs to be FreeBSD 8.1 or later; but you can use my Mesa LiveCD as easy pre-installed environment. All you would need to do is boot it and create new user to allow ssh logon. Please do shield any other ports from the internet!

Perhaps i should create a separate thread for the tests? Ah yes, created new thread here.
 
Last edited:
BTW, this is a good time for me to plug my favorite 2 TB drive for RAID - the venerable Hitachi 7K3200. It's older, but very reliable, built like a tank, and doesn't do the fake 4K sector kibuki dance that screws up ZFS and makes RAID a bigger hassle to configure.

Eventually, we'll have real 4K drives that don't screw around with emulation, but for now, I would avoid such drives in RAID. What's worse, if you use a non-4K drive from WD like the WD20EADS, if you have to RMA it, chances are good you'll get a 4K EARS drive in exchange, which is a real drag.
 
Well the point is that more and more drives are going 4K; and all the current ones use emulation for now. So investigating how to get good performance out of these drives in RAID-Z would be desirable.

And you're probably right; if you buy EADS now you may get a replacement EARS tomorrow. All the more reason to get this issue sorted out. :)
 
BTW, this is a good time for me to plug my favorite 2 TB drive for RAID - the venerable Hitachi 7K3200.

Isn't that a 320GB Travelstar 2.5" HDD?

I guess you meant 7K2000?
 
Last edited:
Its been stated that the 4K sector drives that lie as being 512B hurt write performance with ZFS, but what about read performance?
 
Its been stated that the 4K sector drives that lie as being 512B hurt write performance with ZFS, but what about read performance?

Yes, it hurts it too, but not as bad, you end up for case when the data being read is not 4k aligned doing an extra sector read. Generally it *should* be minor as it should be sequential. If not, it can really tank performance.
 
I have ruled out one thing: accessing ZFS filesystem that was created using 4K sectorsize with geom_nop after a reboot would expose it to the 512-byte sectorsize again as geom_nop is not retained across reboots.

I was initially worried that exposing ZFS again to the 512-byte sectors would throw it off. But nothing happened: ZFS happily took the 512-byte sectorsize versions again without problems. So at least using this trick doesn't or shouldn't pose any danger to the pool.

Currently testing EARS in RAID-Z on Wingfat's system.
 
sub.mesa, what about geom_eli? Also, what about the other way around, ZFS filesystem created using 512B then changed to 4K?

My ZFS server has 2x WD20EADS and 2x WD20EARS and I'm experiencing terrible multitasking performance. Even MP3 streams will cut off for a moment each time zpool iostat -v 1 indicates a major write, for example. Unacceptable.
 
Back
Top