ZFS NAS build, how to arrange 8x1.5TB drives?

sub.mesa · Sep 14, 2010

You would have to tell me more about your ZFS server. What OS? Is it 64-bit? How much memory? How much kernel memory that ZFS can use?

In some cases as i've noticed, systems with 4GB would only assign 1GB or so for ZFS by default, which is too little and gets memory starved. But i've yet to do more thorough research on ZFS performance in general. For now i'm focusing on issues which persist with 4K sector drives.

So the fact that you have low performance may not only have to do with 4K sectors; first assure that ZFS has enough memory. 3GB dedicated to ZFS is a reasonable start; but that won't happen with default 4GiB RAM on FreeBSD; you would have to tune the loader.conf to do so, by adding:
vm.kmem_size="3g"
vm.kmem_size_max="3g"
vfs.zfs.arc_max="2g"

Also see: http://wiki.freebsd.org/ZFSTuningGuide

palmboy5 · Sep 25, 2010

sub.mesa said:
You would have to tell me more about your ZFS server. What OS? Is it 64-bit? How much memory? How much kernel memory that ZFS can use?

In some cases as i've noticed, systems with 4GB would only assign 1GB or so for ZFS by default, which is too little and gets memory starved. But i've yet to do more thorough research on ZFS performance in general. For now i'm focusing on issues which persist with 4K sector drives.

So the fact that you have low performance may not only have to do with 4K sectors; first assure that ZFS has enough memory. 3GB dedicated to ZFS is a reasonable start; but that won't happen with default 4GiB RAM on FreeBSD; you would have to tune the loader.conf to do so, by adding:
vm.kmem_size="3g"
vm.kmem_size_max="3g"
vfs.zfs.arc_max="2g"

Also see: http://wiki.freebsd.org/ZFSTuningGuide

I have FreeBSD x64 8.1-RELEASE installed on the following hardware:
PC Power and Cooling PPCS370X 370W
AMD Athlon 64 X2 4000+
Gigabyte GA-MA69GM-S2H w/ BIOS version F6
Corsair XMS2 2x1GB DDR2-800
Western Digital AV-type 80GB IDE OS drive
2x Western Digital WD20EADS 2TB
2x Western Digital WD20EARS 2TB

I temporarily upgraded the server to 6GB RAM and set loader.conf to have

ahci_load="YES"
vm.kmem_size="5g"
vm.kmem_size_max="5g"
vfs.zfs.arc_max="4g"

Rebooted and tried my test:
Play a 1080p mkv (stored on server)
Boot into a VMWare Windows XP VM (stored on server)
Observe if the 1080p video stutters

It did.

This test scenario was not uncommon from back with my ext4 & Ubuntu based server with which I never experienced this lack of performance.

4GB RAM got used up during the test... I can see what you mean about ZFS being memory intensive.

Is this an issue caused by the EARS drives or is it something else?

As the server is now, I have to hold off doing multiple things at once and that's just unacceptable. Please help!

P.S. My fastest transfer speeds have been just over 30MB/s over gigabit ethernet compared to 60MB/s+ I got with my RAIDless Ubuntu & Atom based server.

mikesm · Sep 25, 2010

palmboy5 said:
I have FreeBSD x64 8.1-RELEASE installed on the following hardware:
PC Power and Cooling PPCS370X 370W
AMD Athlon 64 X2 4000+
Gigabyte GA-MA69GM-S2H w/ BIOS version F6
Corsair XMS2 2x1GB DDR2-800
Western Digital AV-type 80GB IDE OS drive
2x Western Digital WD20EADS 2TB
2x Western Digital WD20EARS 2TB

I temporarily upgraded the server to 6GB RAM and set loader.conf to have

ahci_load="YES"
vm.kmem_size="5g"
vm.kmem_size_max="5g"
vfs.zfs.arc_max="4g"

Rebooted and tried my test:
Play a 1080p mkv (stored on server)
Boot into a VMWare Windows XP VM (stored on server)
Observe if the 1080p video stutters

It did.

This test scenario was not uncommon from back with my ext4 & Ubuntu based server with which I never experienced this lack of performance.

4GB RAM got used up during the test... I can see what you mean about ZFS being memory intensive.

Is this an issue caused by the EARS drives or is it something else?

As the server is now, I have to hold off doing multiple things at once and that's just unacceptable. Please help!

P.S. My fastest transfer speeds have been just over 30MB/s over gigabit ethernet compared to 60MB/s+ I got with my RAIDless Ubuntu & Atom based server.

FreeBSD uses SAMBA for CIFS networking. My experience is that it's hard to get SAMBA to go more than 30-40 MB/s transfer rate on individual flows, but others here have seen 100+ MB/s rates. But no one seems to understand why one SAMBA install runs fast and others slow.

Opensolaris uses a kernel based CIFS implementation which seems to be consistently much faster.

sub.mesa · Sep 25, 2010

palmboy5 said:
I have FreeBSD x64 8.1-RELEASE installed on the following hardware:
I temporarily upgraded the server to 6GB RAM and set loader.conf to have

ahci_load="YES"
vm.kmem_size="5g"
vm.kmem_size_max="5g"
vfs.zfs.arc_max="4g"

arc_max should be lower; try 2g instead. You also may wish to tune transaction groups. The loader.conf on my ZFSguru livecd/usb image has lots of tuning options, i can send you a copy by PM if you like.

Rebooted and tried my test:
Play a 1080p mkv (stored on server)
Boot into a VMWare Windows XP VM (stored on server)
Observe if the 1080p video stutters

It did.

This most likely is due to transaction groups becoming too big for your hardware to handle in time, and thus high sync times. You can solve this by many issues, but first you should address the raw local disk sequential speeds. Once those work properly you can tune the transaction groups for smoother writing experience.

Also Windows XP wouldn't be able to do async I/O and has older Samba protocol version; that may be an issue too. You might also want to look at your video player and increase buffering time. Buffering for 2 seconds is much better than 100 milliseconds; as some players appear to do.

This test scenario was not uncommon from back with my ext4 & Ubuntu based server with which I never experienced this lack of performance.

Keep in mind that ext4 aims at performance at the cost of data integrity, while ZFS aims much more on data integrity; it is much like a transactional database. All your I/O is written in chunks which form a transaction group. Some youtube videos exist that explain this better; you may want to review them.

ext4 is 'async' or asynchronous I/O basically, that's why it is so fast and has very little overhead. If you want to compare ext4 with ZFS, you would want to turn off:

Code:

# disable BIO flushes
# disables metadata sync mode and uses async I/O without flushes
# ONLY USE FOR PERFORMANCE TESTING
#vfs.zfs.cache_flush_disable="1"

# disable ZIL (ZFS Intent Log)
# warning: disabling can sometimes improve performance, but you can lose data
# that was recently written if a crash or power interruption occurs.
# ONLY USE FOR PERFORMANCE TESTING
#vfs.zfs.zil_disable="1"

4GB RAM got used up during the test... I can see what you mean about ZFS being memory intensive.

If you got 64GB RAM ZFS would use it also. UFS and ExtXfs are no different; but is reported differently in top output. For UFS, it uses all your free RAM in 'Inact' in top memory output. For ZFS, it uses 'wired' in top output. Not sure how Linux reports it.

Is this an issue caused by the EARS drives or is it something else?

Did you try making two pools each consisting of either EADS or EARS drives? Then you could test if the 4K sector thing is any root of your problems (it may not have to be). Try some simple local dd benches before you try network performance:

# write 10GiB; make sure compression is disabled on the /pool/ filesystem
dd if=/dev/zero of=/pool/zerofile.000 bs=1m count=10000

As the server is now, I have to hold off doing multiple things at once and that's just unacceptable. Please help!

You should isolate your problem first! Is is network related, disk related or ZFS related. Your problem can be one of those, or actually multiple problems which makes diagnosing more difficult.

Try the above suggestions and we'll get to the root of your problems, and you have smooth streaming performance! If possible keep the 6GiB RAM in the system for now, that allows you to do tests where RAM starvation shouldn't be an issue (do lower the ARC though, as discussed on top).

sub.mesa · Sep 25, 2010

mikesm said:
FreeBSD uses SAMBA for CIFS networking. My experience is that it's hard to get SAMBA to go more than 30-40 MB/s transfer rate on individual flows, but others here have seen 100+ MB/s rates. But no one seems to understand why one SAMBA install runs fast and others slow.

Opensolaris uses a kernel based CIFS implementation which seems to be consistently much faster.

I do dislike Samba. I once even called it the worst open source project ever. Not because it is not popular; but because it's one of the oldest open source projects that implement core protocol functionality used by a wide group of people; and thus is has all the potential of being a sleek, bugfree, working-out-of-the-box piece of software. But instead, Samba has become bloated and has had many bugs in stable releases. It also didn't perform well in all respects under anything other than Linux.

But i have to say, with Samba 3.4 it does appear to run faster now on FreeBSD, especially when combined with Vista/Win7 which support Asynchronous I/O. I do get 100MB/s writes now; though reads are still a tad lower. In FreeNAS some problems were related to FreeBSD 7.x not doing auto-tuning of the TCP send/receive buffer space; FreeBSD 8.x does this however so that problem is gone.

On the other hand, i also noticed some people having trouble with the OpenSolaris kernel-CIFS driver; some even opted to move to FreeBSD because it has Samba and OpenSolaris does not. The kernel CIFS-driver also appears to be broken in svn_134 release. In other words; perhaps it's just that the grass is greener on the other side.

Feel free to test; you can use my livecd, boot it and just go \\<IP> in windows. It has a tmpfs share, which is memory-backed; so you exclude disk performance. If that gets you low performance, it likely is due to Samba/Windows CIFS protocol. But i got these speeds with CrystalDiskMark on win7:

Code:

-----------------------------------------------------------------------
CrystalDiskMark 3.0 x64 (C) 2007-2010 hiyohiyo
                           Crystal Dew World : http://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 byte/s [SATA/300 = 300,000,000 byte/s]

           Sequential Read :    74.930 MB/s
          Sequential Write :   111.479 MB/s
         Random Read 512KB :    71.680 MB/s
        Random Write 512KB :   107.921 MB/s
    Random Read 4KB (QD=1) :    20.388 MB/s [  4977.5 IOPS]
   Random Write 4KB (QD=1) :    18.262 MB/s [  4458.5 IOPS]
   Random Read 4KB (QD=32) :    58.794 MB/s [ 14354.1 IOPS]
  Random Write 4KB (QD=32) :    79.731 MB/s [ 19465.6 IOPS]

  Test : 1000 MB [Z: 14.9% (515.1/3468.1 MB)] (x5)
  Date : 2010/09/25 18:34:43

This scores look awesome to me, except the lower read performance. This is something i noticed for years using Samba. Writing is usually faster than reading. But you see the write speeds are quite high; 111MB/s is pretty much full gigabit throughput, as it excludes overhead from TCP/IP. Also notice how random read/write IOps benefit greatly from a higher queue depth; something that would only work with async I/O enabled clients+server.

Note that CPU usage was very high on the Win7 client (3.6GHz Phenom II); and Windows works single-threaded with many I/O and network backends so 50% cpu usage means CPU bottlenecked. But you can see writing works alot better than reading. This is with untuned smb.conf and normal onboard gigabit network (alright server runs on intel NIC; but i also tested onboard NIC on the server and made little difference). Also without Jumbo frames. But since client is win7 and server runs samba4 with async io, Asynchronous I/O is used which should greatly benefit performance.

parityOCP · Sep 25, 2010

sub.mesa said:
Yes please! 6 disks would be my preference, then i can test:

3-disk RAID-Z (128 / 2 = good)
4-disk RAID-Z (128 / 3 = bad)
4-disk RAID-Z2 (128 / 2 = good)
5-disk RAID-Z (128 / 4 = good)
6-disk RAID-Z2 (128 / 4 = good)

I quoted sub.mesa, but my question is for the thread. Going by this quote, does RAIDZ use distributed parity like RAID 5, or does it store the parity on a single volume like RAID 4?

With RAID 5, which uses distributed parity, does the above quote apply? Or does parity data affect the "evenness" of the data distribution when writing to the disks? For example if you have 64KB of data to write to a 5-disk RAID 5, is that (64KB/4) + (parity/5) or (64KB + parity)/5?

sub.mesa · Sep 25, 2010

RAID-Z is NOT implementable by ANY RAID engine; it requires filesystem and RAID engine to be one package or at least share logic. The RAID-Z uses variable stripesizes to accomodate the file it is writing; this prevents read-modify-write cycles.

RAID-Z is not RAID-4, it is distributed parity so all disks contain both parity and data. But you could actually compare it to RAID-3 and it would be a better comparison as RAID3 also has no read-modify-write cycles. It achieves this by increasing the sector size. ZFS works similarly by spreading 128KiB recordsize over all available data disks. So with 4 disks in RAID-Z you would have 3 data disks. So 128 gets spread over 3 disks, which leads to strange fractions not aligned to 4KiB chunks.

So RAID-Z is quite unique. The biggest advantage is a lower write hole window of opportunity and higher random write performance than either RAID4 or RAID5. Random write is the mortal enemy of RAID5; so RAID-Z tried to be very balanced and it shouldn't be particularly slow for any I/O workload. Though ofcourse random writing is still lower than random reading.

parityOCP · Sep 25, 2010

So with 4 disks in RAID-Z you would have 3 data disks. So 128 gets spread over 3 disks, which leads to strange fractions not aligned to 4KiB chunks.

This is what I was asking about. Does the same issue apply to RAID 5? If you have 4 disks in RAID 5 and a 128KB stripe, and you get misalignment on the data, is misalignment avoided on the parity, because the number of disks is even?

sub.mesa · Sep 25, 2010

RAID5 puts 128KiB stripes on a single disk, RAID-Z actually does the reverse: spread 128KiB record over all available disks. If the recordsize is smaller (random I/O) then the 'stripes' on the individual disks get smaller as well. Consider this:

RAID5
disk1: 64KiB stripe
disk2: 64KiB stripe
disk3: 64KiB parity

All the stripes combined is called a full stripe block or aggregated stripe block. But this excludes any parity (1 virtual parity disk in RAID5; 2 with RAID6). So the formula would be:

full stripe block = (nr_total_disks - nr_parity_disks) * <stripesize> = (3 -1) * 64KiB = 128KiB

So basically this is a RAID0 of two disks with 64KiB stripesize + a virtual parity disk. On the first 'full stripe block' disk number 1 contains the parity, on the second full stripe block disk number 2, and so on. Each disk would get evenly spread. This also counts for RAID-Z. For RAID3 this does not happen, but is not bad due to how RAID3 works. On RAID4 it has a performance impact, and generally RAID5 is superior to RAID4 with no disadvantages so RAID4 is obsolete; RAID3 is not.

Now consider how RAID-Z would stripe the data:

Max recordsize is always 128KiB; this is what RAID would call the 'full stripe block'. Consider same setup:

RAIDZ (sequential write; max recordsize=128KiB)
disk1: 64KiB stripe
disk2: 64KiB stripe
disk3: 64KiB parity

Not much different is it? Except now add one disk making the total disk count even (which is bad):

RAIDZ (sequential write; max recordsize=128KiB; even disk count)
disk1: 43KiB stripe
disk2: 43KiB stripe
disk3: 43KiB parity
disk4: 43KiB parity

Because RAID-Z is now doing 128 / (4-1) = ~43KiB, you would get strange fractions as i mentioned earlier. If you have 4K sector disks with 512-byte emulation, like all the current ones do, then you are in trouble now! I'm still analysing this behavior. Now add another disk and the disk count is uneven again; which is good:

RAIDZ (sequential write; max recordsize=128KiB)
disk1: 32KiB
disk2: 32KiB
disk3: 32KiB
disk4: 32KiB
disk5: 32KiB parity

4*32KiB=128KiB. The stripes per disk have become smaller, as the maximum recordsize stays at 128KiB and cannot grow larger. I consider this to be a weak point of ZFS, but let's focus on how it works first. Let's say you do random writes and writing only 16KiB, now look at the stripe configuration on the disks:

RAIDZ (write 16KiB file like random writes)
disk1: 4KiB
disk2: 4KiB
disk3: 4KiB
disk4: 4KiB
disk5: 4KiB parity

4*4=16KiB; so here the stripesize on the disks gets smaller to accomodate the file in question. This is the key difference between conventional RAID5 and RAID-Z; it adapts the stripesize to the workload which is very cool. Only a combined filesystem and RAID-engine could do this. That also means it is theoretically impossible for Hardware RAID to implement RAID-Z or to read properly from a RAID-Z container (making it one big virtual LBA). For that to work, you would need to read ZFS specific metadata; thus RAID-Z is not really RAID at all; it is no on-disk specification of how data is distributed. Rather, ZFS is a filesystem that uses multiple disks directly; while traditional filesystems always gets put on a single volume, and you need something like RAID or LVM to make that bigger than a single disk.

parityOCP · Sep 26, 2010

sub.mesa said:
RAID5 puts 128KiB stripes on a single disk, RAID-Z actually does the reverse: spread 128KiB record over all available disks. If the recordsize is smaller (random I/O) then the 'stripes' on the individual disks get smaller as well. Consider this:

RAID5
disk1: 64KiB stripe
disk2: 64KiB stripe
disk3: 64KiB parity

All the stripes combined is called a full stripe block or aggregated stripe block. But this excludes any parity (1 virtual parity disk in RAID5; 2 with RAID6). So the formula would be:

full stripe block = (nr_total_disks - nr_parity_disks) * <stripesize> = (3 -1) * 64KiB = 128KiB

So basically this is a RAID0 of two disks with 64KiB stripesize + a virtual parity disk. On the first 'full stripe block' disk number 1 contains the parity, on the second full stripe block disk number 2, and so on. Each disk would get evenly spread.

So going by this example, would it be fair to say that having an even number of disks works out better for RAID 5 than it does for RAID-Z? Also, if you write 128KiB of data to a parity RAID array, does that require 128KiB of parity data to be generated also?

palmboy5 · Sep 27, 2010

Thanks for helping sub.mesa,

With the following settings (no transaction tuning)
ahci_load="YES"
vm.kmem_size="5g"
vm.kmem_size_max="5g"
vfs.zfs.arc_max="2g"
I got this on a volume with compression disabled:

Code:

[root@brisbane-1 /home/vcn64ultra]# dd if=/dev/zero of=/mnt/downloads/zerofile.000 bs=1m count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 30.957154 secs (338718475 bytes/sec)

I think my issue doesn't have to do with transaction groups because read-only performance also drops to unusable levels with multiple reads happening at once. Even the 1080p video + VM booting can count as a multi-read-only test as I only ever see maybe a single 3MB write operation occur while viewing zpool iostat -v 1.

I can't stream a 1080p mkv while doing two (one is fine) simultaneous file copies from the server to my desktop (in sig). Also I don't know how differently ExactFile reads in data but with a single file being SHA512 checksummed along with the 1080p mkv stream, the video still started studdering.. There is just terrible multitasking performance on the server, I can't imagine how bad it would be if this thing was serving more than just myself.

palmboy5 · Oct 4, 2010

I want to try a small transaction group size, something that would only take a fraction of a second to write to disk(s). So something like:
vfs.zfs.txg.write_limit_override="33554432"

Which should mean a ~34MB max transaction group size, correct? But when I do that the write speed ends up being UBER slow, something like 16KB/s... What am I doing wrong?

Thanks!

ZFS NAS build, how to arrange 8x1.5TB drives?

sub.mesa

2[H]4U

palmboy5

Limp Gawd

mikesm

Limp Gawd

sub.mesa

2[H]4U

sub.mesa

2[H]4U

parityOCP

Limp Gawd

sub.mesa

2[H]4U

parityOCP

Limp Gawd

sub.mesa

2[H]4U

parityOCP

Limp Gawd

palmboy5

Limp Gawd

palmboy5

Limp Gawd