Testing ZFS RAID-Z performance with 4K sector drives

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
This is a spin-off thread from a discussion in this thread.

I would like to test performance on FreeBSD+ZFS with multiple 4K sector harddrives configured in various configurations of RAID-Z vdevs. For this i would need access to other people's systems to test performance in various configurations, and would ask anyone to consider offering his system for testing.

What do you need?
1. Server-PC which has internet access (port 22; ssh)
2. 64-bit FreeBSD OS (i recommend using the Mesa LiveCD)
3. 64-bit CPU and several 4K sector HDDs connected to non-RAID controller supported by FreeBSD.
4. At least 2GiB memory, more would be nice but not required.

How does the testing work?
1. We talk via PM/Email/IRC to discuss all details
2. I connect to port 22 to your server
3. You can watch with 'watch' utility, while i setup your disks and perform testing
4. Testing may take some time!
5. Any (valid) performance results i will post in this thread for discussion and feedback.

Hope this is a nice idea and thanks for anyone who already shared interest in helping with this. I think a separate thread is best so we can we talk freely and in-depth about the subject.

Cheers! :)
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
The theory behind RAID-Z performance and 4K sector drives:

RAID-Z is somewhat odd; it is more like RAID3 than RAID5 really. To avoid confusion, let me explain on how i understand this to work:

Traditional RAID
In traditional RAIDs we know stripesize; normally 128KiB. Depending on the stripe width (number of actual striped data disks) the 'full stripe block' would be <data_disks> * <stripesize> = full stripe block. In RAID5 the value of this full stripe block is very important:

1) if we write exactly the amount of data of this full stripe block, the RAID5 engine can do this at very high speeds, theoretically the same as RAID0 minus the parity disks.

2) if we write any other value that is not a multiple of the full stripe block, then we have to would have to do a slow read+xor+write procedure which is very slow.

Traditional RAID5 engines with write-back essentially build up a queue (buffer) of I/O requests and scan for full stripe blocks which can be written efficiently; and will use slower read+xor+write for any smaller or leftover I/O.

RAID-Z
RAID-Z is vastly different. It will do ALL I/O in ONE phase; thus no read+xor+write will ever happen. How it does this? It changes the stripe size so that each write request will fit in a full stripe block. The 'recordsize' in ZFS is like this full stripe block. As far as i know, you cannot set it higher than 128KiB which is a shame really.

So what happens? For sequential I/O the request sizes will be 128KiB (maximum) and thus 128KiB will be written to the vdev. The 128KiB then gets spread over all the disks. 128 / 3 for a 4-disk RAID-Z would produce an odd value; 42.5/43.0KiB. Both are misaligned at the end offset on 4K sector disks; requiring THEM to do do a read whole sector+calc new ECC+write whole sector. Thus this behavior is devasting on performance on 4K sector drives with 512-byte emulation; each single write request issues to the vdev will cause it to perform 512-byte sector emulation.
(i'll rewrite this section sometime later)
 
Last edited:

dminja

n00b
Joined
Aug 21, 2010
Messages
24
I'll let you toy with my new build. It's in the mail.

Hard drive wise I purchased 3 SAMSUNG Spinpoint F4 HD204UI. Since I don't have many drives you won't be able to test many configurations but atleast it will be at 4k with the 666x3 platters.

My plans are for a raid-z1.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Would love to test your F4s! These disks look quite good and should be very fast.

3-disk RAID-Z should do well out of the box. Let me know when you're ready to set things up. :)
 

dminja

n00b
Joined
Aug 21, 2010
Messages
24
Well I did buy from newegg and we know how well they package hard drives :(

Assuming the hardware isn't defective, early next week.
 

wingfat

Weaksauce
Joined
May 13, 2010
Messages
115
sub....
OK... you can have access to my smaller backup system currently running your 1.5 version.

Supermicro Atom based server board MBD-X7SPA-H-O
Supermicro AOC USAS-L8i
4 x 1TB Samsung HD103UJ
4 x 2TB WD20EARS Advanced format Green Drives
1 x 300GB Raptor (for Cache)
1 x 500 Seagate (for OS)

System config currently is 4 X1TB raidz "tank"
then added 4 X2TB raidz "tank" ---used expansion as each virtual device contains the same drives

system is on 16gb flash usb drive

email sent
 

kxy

n00b
Joined
Aug 28, 2010
Messages
19
Just got my Br10ix2 in the mail, assembling the machine now with 8x1.5TB EARS. seems my brackets dont fit will, will have to order some from somewhere i think
 

kxy

n00b
Joined
Aug 28, 2010
Messages
19
Bah, seems I have bought the wrong cables, I bought Reverse breakout cables, they have the wrong size mini-sas connector on them. apparently I need forward breakout cables to go from mini-sas HBA to sata drives.
 

young_einstein

Weaksauce
Joined
Jun 20, 2010
Messages
102
Very interested in this thread.

Especially since I'm halfway through a new ZFS build, and I want to know whether the new Samsung F4 2TB disks [which are 4K] are going to cause me any problems.
 

axan

[H]ard|Gawd
Joined
Nov 5, 2005
Messages
1,935
Looking forward to see the results, I'm planning my new storage server and I think I'll jump on the zfs bandwagon.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Given wingfat a 0.1.6 prerelease version; after he plays with it i hope to continue testing with more RAM available to ZFS. Then we can do real tests!
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
I ordered a complete system today with 6GiB of ECC DDR3
and 6 Sammy F4 2TB drives.

As soon as it comes in I will be willing to give you ssh access to perform w/e tests you need sub-messa.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Did some preliminary testing on killagorilla187's system. He has 6 Samsung F4EG 5400rpm 2TB drives and i think 8GB RAM? Since i see more than 6GB.

I found out the disks don't like NCQ alot when writing; possibly a side effect of the 512-byte sector emulation. So i tweaked loader.conf to use more kernel memory, but more importantly, i set the vdev pending requests min/max both to 1; meaning you won't use NCQ at all. And look at the scores i got... Keep in mind though; these are preliminary tests, so nothing conclusive yet!

without GNOP:
20971520000 bytes transferred in 51.919631 secs (403922747 bytes/sec)

with GNOP:
20971520000 bytes transferred in 49.088218 secs (427221049 bytes/sec)

Not too shabby i think. Though higher would certainly be possible; very little tweaking so far. 400MB/s is well above the gigabit write speeds. But speeds can degrade when filesystem gets more full and fragmentation starts kicking in; so you want to have plenty headroom.

From this very synthetic (sequential write) benchmark you can see the 4K sector issue did not affect speeds much at all. I still have to do alot of testing before i can do any conclusions, though.

But wanted to share this, so far. I must say it does look like the issues with 4K sector sizes are overblown; but that image can change when we look at random writes.

Cheers!
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Some read scores as well:

without GNOP: 452MB/s
with GNOP: 540MB/s

For those who don't know 'gnop'; it is used to transform the sector size from 512 bytes to 4096 bytes (4K).
And i forgot to mention but this is RAID-Z configuration (with 6x Samsung F4 5400rpm).
 
Joined
Sep 25, 2010
Messages
29
I wonder if these performance benefits would exist even when using gnop on top of regular 512 byte disks... ? How about 8192 instead of 4096 bytes?
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
i think 8GB RAM? Since i see more than 6GB.

You are correct.

It looks like these drives might not be getting returned :)

I'm learning a lot from you sub.mesa! I love that you are tuning my server.
What command are you using for read tests?

I'm assuming your initial benches are:
Code:
#dd if=/dev/zero of=zerofile.000 bs=2M count=10000
#dd if=zerofile.000 of=/dev/zero bs=2M

And anytime you change /boot/loader.conf, you need to reboot, correct? Or is there a way to invoke kernel changes on_the_fly?
 
Last edited:

Akirasoft

n00b
Joined
Jul 26, 2004
Messages
27
You are correct.

It looks like these drives might not be getting returned :)

I'm learning a lot from you sub.mesa! I love that you are tuning my server.
What command are you using for read tests?

I'm assuming your initial benches are:
Code:
#dd if=/dev/zero of=zerofile.000 bs=2M count=10000
#dd if=zerofile.000 of=/dev/zero bs=2M

And anytime you change /boot/loader.conf, you need to reboot, correct? Or is there a way to invoke kernel changes on_the_fly?

I'll be following this thread as I've been really interested in ZFS and it looks like 4K drives are the future in the larger capacities.

Any plans to use a real benchmarking tool beyond just dd?
 

palmboy5

Limp Gawd
Joined
Oct 19, 2006
Messages
315
sub.mesa, has your "untested theory" regarding
3-disk RAID-Z = 128KiB / 2 = 64KiB = good
4-disk RAID-Z = 128KiB / 3 = ~43KiB = BAD!
4-disk RAID-Z = 96KiB / 3 = 32KiB = good
5-disk RAID-Z = 128KiB / 4 = 32KiB = good
9-disk RAID-Z = 128KiB / 8 = 16KiB = good
been tested yet?

Also, what is that odd ball 96KiB one and can that be applied to an existing array?
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Haven't gotten to that point yet, but i can confirm that using 96K recordsize will NOT work. It has be powers of 2, not multiples. So 128KiB, 64KiB and 32KiB are ok, but 96K is not.

With Christopher's system i noticed much better RAIDZ performance with 3 disks than with 4 disks, though. But too early to say anything as of yet.

Recordsize can be changes on the fly; used only for new writes. 96K would have gotten a benefit on 4-disk RAID-Z (i.e. uneven number of data disks).

@Akirasoft: no need really; if sequential I/O and random I/O works well; you should have good performance. If these don't work well, you will have bad performance. Once low-level disk I/O works properly you can move on to testing over the network and trying to mimic your own actual usage patterns more closely. But this sector size thing is local disk I/O related; so no direct relation with network performance.
 

Akirasoft

n00b
Joined
Jul 26, 2004
Messages
27
Haven't gotten to that point yet, but i can confirm that using 96K recordsize will NOT work. It has be powers of 2, not multiples. So 128KiB, 64KiB and 32KiB are ok, but 96K is not.

With Christopher's system i noticed much better RAIDZ performance with 3 disks than with 4 disks, though. But too early to say anything as of yet.

Recordsize can be changes on the fly; used only for new writes. 96K would have gotten a benefit on 4-disk RAID-Z (i.e. uneven number of data disks).

@Akirasoft: no need really; if sequential I/O and random I/O works well; you should have good performance. If these don't work well, you will have bad performance. Once low-level disk I/O works properly you can move on to testing over the network and trying to mimic your own actual usage patterns more closely. But this sector size thing is local disk I/O related; so no direct relation with network performance.

from the 10,000 ft view, I agree with you wholeheartedly. The devil can be in the details and IMO dd alone is a pretty poor benchmark tool. Great for showing an IMMEDIATELY apparent problem but I'd love to see something done with Iometer, filebench or even Bonnie++. In fact, I will be looking to do so around the holidays when I build a new fileserver and would welcome the opportunity to work with you on it.

don't get me wrong here, I love what you are doing here and I've found every one of your posts informative and the info you have up on your site is amazing so please do not take offense :)
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
No offense taken, feel free to share your considerations.

However, Bonnie++ is invalid benchmark for writebacks like ZFS; it does not do warmup and cooldown periods needed for write-back mechanisms. The result is that when the read test starts, data from the write test is still being processed. There was a patched bonnie somewhere but generally avoid this benchmark for complex storage like ZFS. Simple HDD is fine though.

I did alot of testing on RAIDs generally (much less so ZFS) and i think low-level tests are best to begin with. If you don't get your rated/predicted performance there, then this will show in 'real application performance' as well.

Sequential writing on ZFS is also MUCH different than any other filesystem. ZFS is copy-on-write transactional filesystem; much like a transactional database it processes writes in transactions and either commits to that transaction or rolls it back to a consistent state. These transactions can be tuned with the .txg variables in loader.conf; something i have not done currently. The moral of this story: even a sequential write is quite a task for ZFS. If you get good performance on sequential writes, that generally means ZFS is doing buffering well and has a balanced configuration for the hardware it is running on.

dd is a universal tool, it moves data from point A to point B. That can be directly on the device (dd if=/dev/ada4) but can also work on filesystems. dd on raw devices (especially if those are RAID arrays) isn't very useful, as you would miss the queue depth that filesystems apply. But when using dd to write or read from filesystems, all the optimizations apply that normally would be used when reading or writing that file; the filesystem takes care of those.

But i could come up with some tests that involve managing very large amounts of files; which would look very much like random I/O benchmark. Simply extracting an archive with huge amounts of small files could be good test.

I would argue though, that NAS filesystems like ZFS are most likely utilized for storing large files securely, and thus sequential performance is very important and for some people the only important performance aspect. If you do things like iSCSI and use it for OS storage (as 'system disk' basically) like i do, then yes IOps performance becomes very important. But i run on SSDs instead as HDDs really are bad at IOps. The L2ARC feature to aide HDD filesystems is also a great way of adding random I/O performance, but only to those filesystems which need it. So for example you would activate the L2ARC for your Virtual Machines which do a fair bit of random I/O, but not for your Movie database, which only gets read and written sequentially without any file modifications. Thus an L2ARC cache device would be pointless for that type of file.

I love ZFS; their features are great and unmatched and it is in a usable state. However, i continue to be amazed by the wildly different performance characteristics people are getting with ZFS. On my system it runs very well even with modest tuning; but it appears some disks really don't like to do NCQ - something i had not noticed on my own HDDs until now. It would be great if ZFS would be able to tune itself; basically adapting to the hardware and I/O workload it is given. That would make ZFS even more awesome. For now, a little tuning can be exciting just like people like to overclock; trying to get your hardware to perform optimally. But i would love that not to be necessary anymore in the future.
 

Akirasoft

n00b
Joined
Jul 26, 2004
Messages
27
No offense taken, feel free to share your considerations.

However, Bonnie++ is invalid benchmark for writebacks like ZFS; it does not do warmup and cooldown periods needed for write-back mechanisms. The result is that when the read test starts, data from the write test is still being processed. There was a patched bonnie somewhere but generally avoid this benchmark for complex storage like ZFS. Simple HDD is fine though.

I did alot of testing on RAIDs generally (much less so ZFS) and i think low-level tests are best to begin with. If you don't get your rated/predicted performance there, then this will show in 'real application performance' as well.

Sequential writing on ZFS is also MUCH different than any other filesystem. ZFS is copy-on-write transactional filesystem; much like a transactional database it processes writes in transactions and either commits to that transaction or rolls it back to a consistent state. These transactions can be tuned with the .txg variables in loader.conf; something i have not done currently. The moral of this story: even a sequential write is quite a task for ZFS. If you get good performance on sequential writes, that generally means ZFS is doing buffering well and has a balanced configuration for the hardware it is running on.

dd is a universal tool, it moves data from point A to point B. That can be directly on the device (dd if=/dev/ada4) but can also work on filesystems. dd on raw devices (especially if those are RAID arrays) isn't very useful, as you would miss the queue depth that filesystems apply. But when using dd to write or read from filesystems, all the optimizations apply that normally would be used when reading or writing that file; the filesystem takes care of those.

But i could come up with some tests that involve managing very large amounts of files; which would look very much like random I/O benchmark. Simply extracting an archive with huge amounts of small files could be good test.

I would argue though, that NAS filesystems like ZFS are most likely utilized for storing large files securely, and thus sequential performance is very important and for some people the only important performance aspect. If you do things like iSCSI and use it for OS storage (as 'system disk' basically) like i do, then yes IOps performance becomes very important. But i run on SSDs instead as HDDs really are bad at IOps. The L2ARC feature to aide HDD filesystems is also a great way of adding random I/O performance, but only to those filesystems which need it. So for example you would activate the L2ARC for your Virtual Machines which do a fair bit of random I/O, but not for your Movie database, which only gets read and written sequentially without any file modifications. Thus an L2ARC cache device would be pointless for that type of file.

I love ZFS; their features are great and unmatched and it is in a usable state. However, i continue to be amazed by the wildly different performance characteristics people are getting with ZFS. On my system it runs very well even with modest tuning; but it appears some disks really don't like to do NCQ - something i had not noticed on my own HDDs until now. It would be great if ZFS would be able to tune itself; basically adapting to the hardware and I/O workload it is given. That would make ZFS even more awesome. For now, a little tuning can be exciting just like people like to overclock; trying to get your hardware to perform optimally. But i would love that not to be necessary anymore in the future.

agreed, and I do like the notion of using dd to provide a quick test if everything is "working". I've frequently used it as a troubleshooting tool (found my way around the nasty Intel 945 SATA chipset in linux issues with it).

I am definitely reading that ZFS is "different", it was almost designed from the ground up to store media files which makes it perfect for this kind of thing.

and just for a little background, one of the reasons I tend to stray away from the LOWEST level benchmarking tools is due to my background as a performance engineer. I've been doing performance testing and engineering for quite some time now and typically the lowest level tests (while great for building a foundation to work with,) can not adequately expose situations that are encountered in the real world. I'm doing a lot of private cloud work right now (hadoop/hdfs and hbase stuff) and no low level tools even begin to properly show situations actual software exposes.

I will admit I'm a bit out of my element on this stuff as I'm primarily an app/db tier guy (j2ee mostly) and typically just do what I can assuming the DB will be slow.
 
Joined
Oct 6, 2010
Messages
3
Are your tests only with raidz? no raidz2 tests?
I have been thinking about a setup of 8xSamsung Eco F4 in raidz2.
But raidz2 on the 4k Samsung wont align with 8 disks right?
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
To test performance, i began working on a benchmark script that tests your disks in several ZFS configurations (striping, mirroring, raidz, raidz2).

On it's first test run, i got these results from Lipe's system, which is a quadcore Xeon 3440 2.53GHz with 4GiB RAM and 6x WD 2TB EARS (4-platter) disks:

Code:
ZFSGURU-benchmark, version 1
Test size: 20 gigabytes (GiB)
Test rounds: 3
Cooldown period: 2 seconds
Number of disks: 6 disks
disk 1: /dev/label/disk1
disk 2: /dev/label/disk2
disk 3: /dev/label/disk3
disk 4: /dev/label/disk4
disk 5: /dev/label/disk5
disk 6: /dev/label/disk6

Stopping background processes like sendmail, moused, syslogd and cron
Now testing stripe configuration with 1 disks......@......@......@
READ:   100 MiB/sec     101 MiB/sec     101 MiB/sec     = 101 MiB/sec avg
WRITE:  97 MiB/sec      97 MiB/sec      95 MiB/sec      = 96 MiB/sec avg

Now testing stripe configuration with 2 disks......@......@......@
READ:   203 MiB/sec     203 MiB/sec     203 MiB/sec     = 203 MiB/sec avg
WRITE:  186 MiB/sec     186 MiB/sec     185 MiB/sec     = 186 MiB/sec avg

Now testing stripe configuration with 3 disks......@......@......@
READ:   293 MiB/sec     295 MiB/sec     295 MiB/sec     = 294 MiB/sec avg
WRITE:  261 MiB/sec     259 MiB/sec     262 MiB/sec     = 261 MiB/sec avg

Now testing stripe configuration with 4 disks......@......@......@
READ:   395 MiB/sec     392 MiB/sec     392 MiB/sec     = 393 MiB/sec avg
WRITE:  331 MiB/sec     332 MiB/sec     332 MiB/sec     = 331 MiB/sec avg

Now testing stripe configuration with 5 disks......@......@......@
READ:   486 MiB/sec     484 MiB/sec     486 MiB/sec     = 485 MiB/sec avg
WRITE:  378 MiB/sec     395 MiB/sec     390 MiB/sec     = 388 MiB/sec avg

Now testing stripe configuration with 6 disks......@......@......@
READ:   578 MiB/sec     581 MiB/sec     583 MiB/sec     = 581 MiB/sec avg
WRITE:  429 MiB/sec     426 MiB/sec     421 MiB/sec     = 425 MiB/sec avg

Now testing mirror configuration with 2 disks......@......@......@
READ:   141 MiB/sec     142 MiB/sec     141 MiB/sec     = 141 MiB/sec avg
WRITE:  95 MiB/sec      93 MiB/sec      97 MiB/sec      = 95 MiB/sec avg

Now testing mirror configuration with 3 disks......@......@......@
READ:   218 MiB/sec     214 MiB/sec     217 MiB/sec     = 216 MiB/sec avg
WRITE:  94 MiB/sec      94 MiB/sec      92 MiB/sec      = 93 MiB/sec avg

Now testing mirror configuration with 4 disks......@......@......@
READ:   278 MiB/sec     279 MiB/sec     279 MiB/sec     = 279 MiB/sec avg
WRITE:  93 MiB/sec      93 MiB/sec      92 MiB/sec      = 93 MiB/sec avg

Now testing mirror configuration with 5 disks......@......@......@
READ:   332 MiB/sec     332 MiB/sec     332 MiB/sec     = 332 MiB/sec avg
WRITE:  89 MiB/sec      93 MiB/sec      93 MiB/sec      = 91 MiB/sec avg

Now testing mirror configuration with 6 disks......@......@......@
READ:   416 MiB/sec     418 MiB/sec     418 MiB/sec     = 417 MiB/sec avg
WRITE:  87 MiB/sec      84 MiB/sec      88 MiB/sec      = 86 MiB/sec avg

Now testing raidz1 configuration with 2 disks......@......@......@
READ:   101 MiB/sec     101 MiB/sec     101 MiB/sec     = 101 MiB/sec avg
WRITE:  92 MiB/sec      92 MiB/sec      92 MiB/sec      = 92 MiB/sec avg

Now testing raidz1 configuration with 3 disks......@......@......@
READ:   193 MiB/sec     194 MiB/sec     194 MiB/sec     = 194 MiB/sec avg
WRITE:  153 MiB/sec     153 MiB/sec     151 MiB/sec     = 152 MiB/sec avg

Now testing raidz1 configuration with 4 disks......@......@......@
READ:   220 MiB/sec     219 MiB/sec     222 MiB/sec     = 220 MiB/sec avg
WRITE:  146 MiB/sec     144 MiB/sec     148 MiB/sec     = 146 MiB/sec avg

Now testing raidz1 configuration with 5 disks......@......@......@
READ:   358 MiB/sec     356 MiB/sec     357 MiB/sec     = 357 MiB/sec avg
WRITE:  247 MiB/sec     243 MiB/sec     246 MiB/sec     = 245 MiB/sec avg

Now testing raidz1 configuration with 6 disks......@......@......@
READ:   338 MiB/sec     340 MiB/sec     332 MiB/sec     = 336 MiB/sec avg
WRITE:  230 MiB/sec     235 MiB/sec     229 MiB/sec     = 231 MiB/sec avg

Now testing raidz2 configuration with 3 disks......@......@......@
READ:   107 MiB/sec     106 MiB/sec     106 MiB/sec     = 106 MiB/sec avg
WRITE:  78 MiB/sec      81 MiB/sec      78 MiB/sec      = 79 MiB/sec avg

Now testing raidz2 configuration with 4 disks......@......@......@
READ:   161 MiB/sec     157 MiB/sec     158 MiB/sec     = 159 MiB/sec avg
WRITE:  50 MiB/sec      49 MiB/sec      50 MiB/sec      = 50 MiB/sec avg

Now testing raidz2 configuration with 5 disks......@......@......@
READ:   235 MiB/sec     237 MiB/sec     237 MiB/sec     = 236 MiB/sec avg
WRITE:  67 MiB/sec      66 MiB/sec      67 MiB/sec      = 67 MiB/sec avg

Now testing raidz2 configuration with 6 disks......@......@......@
READ:   402 MiB/sec     400 MiB/sec     402 MiB/sec     = 401 MiB/sec avg
WRITE:  232 MiB/sec     234 MiB/sec     234 MiB/sec     = 234 MiB/sec avg
You can see how RAID-Z performance drops from going from 5 to 6 disks, and how RAID-Z2 performance increased considerably when going to 6 disks. This would be consistent with my theory, but more testing is needed before i consider it 'verified'.

This benchmark script should ease performance testing quite a bit; as it does all the testing without you needing to do stuff all the time. And it's pretty easy to use:

./gurubench.php /dev/label/disk{1..6}

As an example if your disks are called /dev/label/disk1 through /dev/label/disk6. Right now it only does sequential testing; but i want to extend it with raidtest so it can do random IOps testing as well.

Once i got the script finished i'll release it and integrate it into the ZFSguru project as well. I would love it to display nice graphs; i'm researching that possibility right now.

Then everybody could do these tests; they take a long time like 24 hours if you got 6 or 8 disks; but the tests are consistent and done properly, without you requiring to be present all the time.

Cheers!
 

axan

[H]ard|Gawd
Joined
Nov 5, 2005
Messages
1,935
Good idea on the automated benchmark script should make things easy and consistent. You could use Rrdtool to add some nice graphs to it.

Btw there is definitely something odd about those 4K hdds, the performance in raidz2 is really whacky until it hits 6 disks.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Indeed, but i still have to repeat this test with real 4K sectors using GNOP on all disks and with different tuning parameters as well. I've also enabled prefetching though the system has only 4GiB; which normally means the prefetching gets disabled by default. So this is not the 'final' score for his system; just the first time i ran this benchmark script and let it finish. ;-)

But you're right the RAIDZ2 performance looks appalling until it hits 6 disks; would need more testing and i have to add that i did very few RAIDZ2 testing in general; though on my own system the write score didn't crash as dramatically as the scores posted above.

Since i got offered more test systems from you guys, i can compare things and rule out any issue with a particular system.

Also, keep in mind the EARS used in above benchmarks are slightly older 4-platter disks i believe; not the newest 3-platter with 666GB platters and enhanced firmware. This enhanced firmware would give much better performance for smaller transactions; according to to some online review which looks plausible. The F4 disks may be a bit faster, though i have no data to compare F4 with newer EARS with also 666GB platters.

I think i may need a few days for the benchmark script and graph plotting; after that i would like to test all your systems again with the script and get alot of data we can use for comparison in this thread. And if we have that data processed in visual images, that would be much easier to comprehend and interpret.
 

john4200

[H]ard|Gawd
Joined
Oct 16, 2009
Messages
1,537
By your theory, shouldn't a 4-disk raidz2 be pretty good (128KB/2), and a 5-disk raidz2 be bad (128KB/3)? But the benchmark shows the 5-disk raidz2 outperforming the 4-disk raidz2 by a considerable margin (although the write speed is terrible on both).
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Well be careful about making conclusions just yet; the fact that the RAID-Z2 performance crashes below 6 disks may have a different cause that i'm not yet familiar with. Testing on other people's system will help answer this.

When looking at RAID-Z we can see both the 5-disk doing better than 6 disk and 3 disk doing better than 4 disk, at least for writing which is the key thing here. So the RAID-Z results gathered so far appears to follow the theory, though RAID-Z2 performs poorly, i would love to test this on other systems as well so we can get a clear picture.

But these aren't very high differences; i don't see a reason why you couldn't saturate the gigabit with 4K disks; if you have problems with that its either a network issue or you need to tune FreeBSD/ZFS. I did simple tuning with these benchmark results:
- kernel memory 3500MB out of 4GB
- ARC_min=2GB
- ARC_max=3GB
- min_pending=1
- max_pending=1

For each changed setting, i would need to re-run the benchmark script to get valid results; changing multiple settings at once means you may not be able to make much conclusions as you changed multiple variables that affect performance.

But i would say i'm on the right track for testing this theory; the automated tests and plotted images would be very helpful in getting a good picture about ZFS performance in various configurations.
 

mikesm

Limp Gawd
Joined
Mar 2, 2005
Messages
178
All this discussion makes me even more convinced the right thing to do is stay with the 2 tb hitachis until real 4k drives are available. I wish drive vendors would include a jumper that does the 512 emulation vs straight 4k sector size. But I suspect that when the true 4k drives are released we will have to pay more to bypass the emulation as they will be for enterprise configurations..

Thx
Mike
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
I had been working on my graph php script to produce images using GD-lib. It's not finished yet, but this will allow anyone to do the same tests i did, producing nice graphics like these from Lipe's system:

Bench-LIPE-1.png
Bench-LIPE-2.png


The idea is to integrate this into the web-interface; so all you need to do is start the benchmark and wait for the images to appear on the screen; which can take a long time but this is very easy testing. It also makes interpreting the data much easier.

These test results are the same as posted above; i will run more tests with different tuning. Feel free to comment on the graph itself too; anything else i should put in that graph? colours ok?
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
Well, after a month of testing, I've decided to return my Samsung F4's and get the Hitachi's. I don't feel like tweaking them to make them work correctly and I will be adding other drives to the mix that should use NCQ. Although they performed pretty well with a 6 disk raidz, I've decided they just aren't worth the extra work to get optimal performance and you do loose NCQ across all future drives.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
In that case, killagorilla, it would be interesting to compare the performance figures. If you have little time, you can run a very small benchmark that runs only for a few hours (sequential only, 250MB testsize, 2 rounds, 2sec cooldown). Then when you get your new Hitachi drives, perform the benchmark using the same settings, and see how they compare.

If you've had enough of benchmarking and just want to get this into production, then i understand.
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
Unfortunately sub.mesa, I have already boxed them up. Your live cd crashes my system so I didn't do any of your benchmarks. I do have other benchmarks recorded and I will be comparing them and posting them. All the benches I have are for sequential reads and writes though, so they probably won't be as helpful as yours, but they should give a good idea of performance.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Alright. Well still would be great if you could post numbers of your newer drives, especially in RAID-Z / RAID-Z2. It would indeed not be directly comparable, but nonetheless your experiences could benefit others.

Hope the new Hitachi's work well for you.
 

black6spdz

n00b
Joined
Sep 25, 2010
Messages
54
killagorilla187 , you've got me a little worried now since I just ordered 5 of the 2TB F4s. What were their shortcomings? I plan to run a RaidZ1 config and as long as I can saturate a single GigE interface I will be happy.
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
Black6spdz, you will easily be able to saturate a Gigabit line. Don't worry about that, I'll post my benchmarks when I get home tomorrow and then post the Hitachi benches Monday as long as no DOA's.
 

killagorilla187

Limp Gawd
Joined
Jul 11, 2008
Messages
224
A couple of benchmarks to compare Samsung F4's with Hitachi's.
Loader.conf settings:
Code:
[root@ZFS_Server /]# cat /boot/loader.conf
vm.kmem_size="7g"
vm.kmem_size_max="8g"
vfs.zfs.arc_min="2g"
vfs.zfs.arc_max="3g"
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

Samsung F4 5400rpm 2TB
ZFS WITHOUT GNOP:
Code:
raidz2:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 74.675596 secs (280834987 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.003330 secs (465999294 bytes/sec)

raidz:
[root@ZFS_Server /]# zpool status media
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 63.724312 secs (329097629 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 48.710287 secs (430535752 bytes/sec)

no redundancy (raid0):
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
          da2       ONLINE       0     0     0
          da3       ONLINE       0     0     0
          da4       ONLINE       0     0     0
          da5       ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 40.556101 secs (517099020 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 29.596897 secs (708571580 bytes/sec)

two raidz:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 63.419061 secs (330681654 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.607933 secs (459821759 bytes/sec)

raidz with 5 drives instead of 6:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 55.305749 secs (379192406 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.440789 secs (461513113 bytes/sec)

Samsung F4 5400rpm 2TB
ZFS with GNOP 4k sectors:

Code:
raidz2:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        media        ONLINE       0     0     0
          raidz2     ONLINE       0     0     0
            da0.nop  ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0
            da5.nop  ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 66.255921 secs (316522957 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 47.856496 secs (438216790 bytes/sec)


raidz:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        media        ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da0.nop  ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0
            da5.nop  ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 52.692128 secs (398001007 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 42.707100 secs (491054647 bytes/sec)

no redundancy (raid0)
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          da0.nop   ONLINE       0     0     0
          da1.nop   ONLINE       0     0     0
          da2.nop   ONLINE       0     0     0
          da3.nop   ONLINE       0     0     0
          da4.nop   ONLINE       0     0     0
          da5.nop   ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 38.766298 secs (540972986 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 31.030934 secs (675826261 bytes/sec)

two raidz:
[root@ZFS_Server /]# zpool status media
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        media        ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da0.nop  ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0
            da5.nop  ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 67.693783 secs (309799794 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 49.204880 secs (426208132 bytes/sec)

raidz with 5 drives instead of 6:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        media        ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            da0.nop  ONLINE       0     0     0
            da1.nop  ONLINE       0     0     0
            da2.nop  ONLINE       0     0     0
            da3.nop  ONLINE       0     0     0
            da4.nop  ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 50.730330 secs (413392146 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.638181 secs (459517000 bytes/sec)

Hitachi 2TB 7200rpm:
Code:
raidz2:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 62.793066 secs (333978277 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.092142 secs (465081478 bytes/sec)

raidz:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 47.178551 secs (444513864 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 41.838399 secs (501250539 bytes/sec)

no redundancy (raid0):
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          da0       ONLINE       0     0     0
          da1       ONLINE       0     0     0
          da2       ONLINE       0     0     0
          da3       ONLINE       0     0     0
          da4       ONLINE       0     0     0
          da5       ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 31.458234 secs (666646448 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 27.880972 secs (752180378 bytes/sec)

two raidz's:
[root@ZFS_Server /]# zpool status
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 63.795675 secs (328729494 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 42.682719 secs (491335144 bytes/sec)

raidz with 5 drives instead of 6:
[root@ZFS_Server /]# zpool status media
  pool: media
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0

errors: No known data errors
[root@ZFS_Server /]# dd if=/dev/zero of=/media/zerofile.000 bs=2M count=10000
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 54.700831 secs (383385767 bytes/sec)
[root@ZFS_Server /]# dd if=/media/zerofile.000 of=/dev/null bs=2M
10000+0 records in
10000+0 records out
20971520000 bytes transferred in 45.046736 secs (465550268 bytes/sec)

Off topic, what vfs.zfs.vdev.min_pending/vfs.zfs.vdev.max_pending setttings should I use for these Hitachi's?
 
Top