ZFS raidz2 performance

Joined
Sep 16, 2002
Messages
634
So I've built a zfs raidz2 storage pool out of four 2TB WD EARS (green power) drives. I'm not expecting super performance out of these drives. I'm mostly just double-checking to make sure my array is running as ideally as it can.

Informal benchmarks with dd(1) shows about 40 megabytes/sec write performance. I haven't done any system performance tuning yet.

Code:
#  dd if=/dev/zero of=test.dat bs=1M count=10000 
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 226.371684 secs (46320988 bytes/sec)
The system is running FreeBSD 8.1 amd64 with 8GB of registered ECC RAM. The CPU is an Intel xeon X3430 ("lynnfield") running at 2.4ghz. The drives are attached to the Intel ICH10 SATA controller running in AHCI mode. FreeBSD is using the ahci(4) driver. I'm using the raw drives with no fdisk slices on the drives.

Does this seem like reasonable performance?
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Try these things:

set your kmem to physical memory - 1GB (so for 8GB RAM you pick 7GiB kernel memory)
set your arc max to 75% of the kmem
set min+max pending request per vdev to 1

Then redo your benchmark; you did use the correct dd command.

Can i see your zpool status output?

Cheers.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Here's my loader.conf which you can use as skeleton:

Code:
#
## ZFSguru advanced tuning
## note that editing loader.conf only works on USB or ZFS-on-root installs
#

## KERNEL MEMORY
#
# Do not exceed your RAM size! Note available RAM is lowered by 500MB when
# using the USB or LiveCD distribution. You can increase kernel memory up to
# 1GB less than your physical RAM.
#
# if you exceed the limits of kernel memory, you may get a panic like this:
# panic: kmem_malloc(131072): kmem_map too small: <totalkmem> total allocated
#
#vm.kmem_size="3g"
#vm.kmem_size_max="3g"


## ZFS tunables

# ARC limits
# tune these in according with vm.kmem_size setting; you can increase it
# up to the value of vm.kmem_size minus 1GB.
#
# note: kernel memory size needs to be larger than maximum ARC size (arc_max)
# note: zfs uses more memory than just the ARC; don't make the ARC too big
#
#vfs.zfs.arc_min="512m"
#vfs.zfs.arc_max="1g"

# ARC metadata limits
# increase to cache more metadata (recommended if you have enough RAM)
#vfs.zfs.arc_meta_limit="128m"

# ZFS prefetch disable
# setting this value to 0 will force prefetching to be used even when
# ZFS considers it undesirable if you have <= 4GiB RAM or running 32-bit
#vfs.zfs.prefetch_disable="0"

# ZFS transaction groups (txg)
# ZFS is not unlike a transactional database in the sense that it processes
# your data in transaction groups. The bigger the txg is, the more data it
# can hold but also the longer flushing transaction groups will take.
# experiment with these values to improve temporary 'lags' or 'hangs'
# override maximum txg size in bytes
#vfs.zfs.txg.write_limit_override="0"
# target number of seconds a txg will be synced (tune this!)
#vfs.zfs.txg.synctime="5"
# maximum number of seconds a txg will be synced (tune this!)
#vfs.zfs.txg.timeout="30"

# vdev cache settings
# should be safe to tune; but be careful about your memory limits
#vfs.zfs.vdev.cache.bshift="16"
#vfs.zfs.vdev.cache.size="10m"
#vfs.zfs.vdev.cache.max="16384"

# vdev pending requests
# this manages the minimum/maximum of outstanding I/Os on the vdevs
# this should be safe to tune; best setting depends on your disks
# ssds may prefer higher settings
#vfs.zfs.vdev.min_pending="4"
#vfs.zfs.vdev.max_pending="32"

# other vdev settings
# I/O requests are aggregated up to this size
#vfs.zfs.vdev.aggregation_limit="131072"
# exponential I/O issue ramp-up rate
#vfs.zfs.vdev.ramp_rate="2"
# used for calculating I/O request deadline
#vfs.zfs.vdev.time_shift="6"

# disable BIO flushes
# disables metadata sync mode and uses async I/O without flushes
# ONLY USE FOR PERFORMANCE TESTING
#vfs.zfs.cache_flush_disable="1"

# disable ZIL (ZFS Intent Log)
# warning: disabling can sometimes improve performance, but you can lose data
# that was recently written if a crash or power interruption occurs.
# ONLY USE FOR PERFORMANCE TESTING
#vfs.zfs.zil_disable="1"

## other tuning
kern.maxfiles="950000"

## mandatory kernel modules (REQUIRED)
zfs_load="YES"
#geom_uzip_load="YES"
#tmpfs_load="YES"

## recommended kernel modules
ahci_load="YES"
siis_load="YES"

## optional kernel modules
#geom_md_load="YES"
#nullfs_load="YES"
#unionfs_load="YES"

# end #
 
Joined
Sep 16, 2002
Messages
634
Thanks sub.mesa for your tuning hints, I'm going to implement those in just a minute. It requires rebooting the box and its also running pf(4) as my internet router.

Code:
 zpool status
  pool: titan
 state: ONLINE
 scrub: scrub completed after 1h6m with 0 errors on Sat Oct  2 14:03:57 2010
config:

        NAME                STATE     READ WRITE CKSUM
        titan               ONLINE       0     0     0
          raidz2            ONLINE       0     0     0
            label/zfsdisk1  ONLINE       0     0     0
            label/zfsdisk2  ONLINE       0     0     0
            label/zfsdisk3  ONLINE       0     0     0
            label/zfsdisk4  ONLINE       0     0     0

errors: No known data errors
No, the zpool scrub was not running when I did my initial benchmarks. I just ran it after extracting a backup to make sure the drives where running correctly.

The disks are using glabel lables, which where labeled as such:

Code:
glabel label zfsdisk1 /dev/ada1
glabel label zfsdisk2 /dev/ada3
glabel label zfsdisk3 /dev/ada4
glabel label zfsdisk4 /dev/ada5
Also sub.mesa, I saw your Testing ZFS with 4K drives thread, and would be happy to oblige. The system has a public IP address and is available via ssh.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Alright looks good; i think with the basic tuning it would run alot better already.

I can do the tuning for you if you like and do some benchmarking as well. If you want to go ahead with that, you should create a normal user and make it part of 'wheel' group so i can use 'su' to become root. You should give the normal user a strong password with random chars, to protect against brute force attempts made via the internet.

pw useradd -G wheel sub
passwd sub

That's all you should need to do; then send me a PM with all the data; including your IP. Please also tell me what i should not touch; i.e. any disk that you want me to leave alone or other things you already taken into production.

Thanks for the opportunity!
 
Joined
Sep 16, 2002
Messages
634
I added the following to /boot/loader.conf

Code:
vm.kmem_size="7g"
vm.kmem_size_max="7g"
vfs.zfs.arc_min="512m"
vfs.zfs.arc_max="5376M"
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"
Performance has impoved by about 78%, thats a huge increase!

Code:
#  dd if=/dev/zero of=test.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 177.169443 secs (59184924 bytes/sec)
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Still 60MB/s is way too low score. Did you check whether the loader.conf values where actually used? For example:

sysctl vfs.zfs | grep arc_

sysctl vm | grep kmem_size

Are you using any UFS partitons? If you run 'top' do you see a high memory allocation to 'Inact'? If so UFS is eating your memory. It uses all your free memory and starves ZFS; this is still a problem and is getting fixed (at least on -CURRENT).
 
Joined
Sep 16, 2002
Messages
634
That's all you should need to do; then send me a PM with all the data; including your IP. Please also tell me what i should not touch; i.e. any disk that you want me to leave alone or other things you already taken into production.

Thanks for the opportunity!
PM sent. Thanks!
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Alright i'm in!

I can confirm your tests; got 60MB/s write and 200MB/s read.

Changed some boot settings; could you reboot the server?

Also, is it running a quadcore or a dualcore with hyperthreading? In that last case, could you disable the hyperthreading in BIOS?
 
Joined
Sep 16, 2002
Messages
634
Its a quad core (no hyperthreading). I'll reboot the system in just a moment. Thanks for your help!
 
Joined
Sep 16, 2002
Messages
634
Okay rebooted for your min/max pending changes.

Code:
dd if=/dev/zero of=test.dat bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 164.695637 secs (63667503 bytes/sec)
If you ever find yourself in the San Francisco Bay Area, and you want to, I owe you at least one Beverage Of Choice under the PHK BEER-WARE license sub.mesa.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Hmm i don't understand; you have all the hardware to make this thing go fast but you get low performance somehow. With another system i got scores of 611MB/s write and 730MB/s read for 6x Samsung F4EG 5400rpm 2TB 4K-sect disks.

Tested your devices and SMART data; looks fine. Devices read approx 120MB/s.

Oh one more question; i noticed you use a custom kernel; any big changes there? I'm just grasping for reasons why you get such low performance. But then, i would need to test more thoroughly to discover the reason; those tests are destructive in nature so i waited with those. ;-)
 
Joined
Sep 16, 2002
Messages
634
The custom kernel pretty much only removes stuff that isn't in the system, switch to ahci(4) instead of ata(4), add altq(4) for pf, and add ipmi/smbus.

The kernel config is /usr/src/sys/amd64/conf/TITAN if your curious.

As far as destructive tests, the files on the zfs pool are all backed up in /backups/manual/titan.tar.gz so it can be recreated. As long as you don't blow away ada0 or ada2 you can run any destructive tests you want. Let me shut down samba so it won't get cross...

Go for it.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Alright. ;-)

Might continue on this tomorrow if that's okay with you. Also testing killagorilla's system now.

ZFS is pesky with performance sometimes; until you find the right balance in which things run smoothly. Most likely auto-tuning would be enhanced to find this balance automatically in the future. But plenty of things can ruiin performance such as 'duds'; disks that have much lower performance and may may drag the entire array down.

I hope we can find what is preventing ZFS from shining on your system; because it sure must be capable to!
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Ah i got your setup mixed up and believed it was raidz1. So the maximum score you can get would be:

raid0: 4 * 110 = 440MB/s
raidz1: 3 * 110 = 330MB/s
raidz2: 2 * 110 = 220MB/s

Right now you get 60MB/s write in raidz2 and 210MB/s in raidz1; huge difference!
Now going to try GNOP on your disks.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
20971520000 bytes transferred in 87.785515 secs (238894993 bytes/sec)

239MB/s RAID-Z1 write; but only with THREE disks; your score actually is lower with a forth disk.

I would need to test this out more thoroughly, but 240MB/s for a three-disk RAID-Z is really good. I can think of tons of reasons for this, but i better do that tomorrow. ;-)
 
Joined
Sep 16, 2002
Messages
634
Thanks so much! Feel free to take your time with this, it took me a week just to get off my duffers and install the 4 disks in the system. I'd rather we make zfs better & faster.

This server system is primarily for tinkering & learning anyway, so taking several days/weeks/months to understand performance is a big plus.
 
Joined
Sep 16, 2002
Messages
634
Hey sub.mesa,

Do you have any theories on why 4x raidz performs so poorly on this system? If one of the drives is bad I'd like to get it replaced.
 

sub.mesa

2[H]4U
Joined
Feb 16, 2010
Messages
2,508
Hey Christopher,

Sorry i've had a lot of systems in testing right now. Have been testing untuned performance alot, now moving to full tuning and see how high scores i can get. I would love to resume testing on your system as well.

Not sure why the RAID-Z2 performance was so much lower. It should only be a tad lower for sequential transfers. But benchmarking ZFS is tricky; the results are not consistent at all and proper benchmarking takes ALOT of time. So i will do a more comprehensive testing on my own system i think, and do basic testing on your systems. Testing with too many variables takes too much time, and i noticed tuning affects your systems differently than mine. I still have a lot to learn about ZFS performance as well; so this is a learning experience for me as well.

I think i did test all drives individually; normally the first thing i did. But i mix up your systems now i have .txt file with all different systems in it and should redo some tests. This could take awhile still.

I understand you guys want to take your systems into production. Would one week more testing be possible?
 
Joined
Sep 16, 2002
Messages
634
Hey sub.mesa,

No problem! I just wanted to make sure you weren't suspicious of a failing drive before the exchange window at the local Frys expired. Since it seems like all four drives are generally healthy I'll let it be.

Two notes for you:

1. I zero filled and media scanned all four of the 2TB drives last night to make sure they where behaving correctly. If you had any configuration on them I've probably erased them. Feel free to (re)setup anything you want with those four drives.

2. I swapped the SATA ports the drives where connected to for consistency. The SATA diagram in the manual lied. The drives are as configured:

Code:
ada0: <WDC WD1001FALS-00J7B0 05.00K05> ATA-8 SATA 2.x device <-- Don't touch please (Boot UFS drive, and /home)
ada1: <WDC WD10EACS-00D6B0 01.01A01> ATA-8 SATA 2.x device <-- Don't touch please (UFS storage for backups / offsite backup staging )
ada2: <WDC WD20EARS-00MVWB0 51.0AB51> ATA-8 SATA 2.x device <-- Have at it
ada3: <WDC WD20EARS-00MVWB0 51.0AB51> ATA-8 SATA 2.x device <-- Have at it
ada4: <WDC WD20EARS-00MVWB0 51.0AB51> ATA-8 SATA 2.x device <-- Have at it
ada5: <WDC WD20EARS-00MVWB0 51.0AB51> ATA-8 SATA 2.x device <-- Have at it
Cliff notes: Try not to destroy ada0 or ada1, ada{2,3,4,5} are empty and can be used for any testing you wish to complete.

Thanks again Sub.mesa!
 
Joined
Sep 16, 2002
Messages
634
So I wanted to keep you updated on what I've discovered so far. Things have finally slowed down at work so I can invest some time on this project.

I've recreated a raidz2 array with the four disk using gpt partitions on the drive, starting at a 1 megabyte offset. The partitions where created as follows:

Code:
gpart create -s GPT ada2
gpart add -b 2048 -s 3907027087 -t freebsd-zfs -l zfsdisk1 -i 1 ada2
glabel label -v zfsdisk1 ada2p1
gnop create -S 4096 /dev/label/zfsdisk1
gpart create -s GPT ada3
gpart add -b 2048 -s 3907027087 -t freebsd-zfs -l zfsdisk2 -i 1 ada3
glabel label -v zfsdisk2 ada3p1
gnop create -S 4096 /dev/label/zfsdisk2
gpart create -s GPT ada4
gpart add -b 2048 -s 3907027087 -t freebsd-zfs -l zfsdisk3 -i 1 ada4
glabel label -v zfsdisk3 ada4p1
gnop create -S 4096 /dev/label/zfsdisk3
gpart create -s GPT ada5
gpart add -b 2048 -s 3907027087 -t freebsd-zfs -l zfsdisk4 -i 1 ada5
glabel label -v zfsdisk4 ada5p1
gnop create -S 4096 /dev/label/zfsdisk4
Next, the raidz2 was created:

Code:
zpool create titan raidz2 /dev/label/zfsdisk{1,2,3,4}.nop

zpool status
  pool: titan
 state: ONLINE
 scrub: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        titan                   ONLINE       0     0     0
          raidz2                ONLINE       0     0     0
            label/zfsdisk1.nop  ONLINE       0     0     0
            label/zfsdisk2.nop  ONLINE       0     0     0
            label/zfsdisk3.nop  ONLINE       0     0     0
            label/zfsdisk4.nop  ONLINE       0     0     0

errors: No known data errors
With that, I'm able to get much better write performance, well enough to sustain writes from gigabit ethernet, so I'm happy.

Code:
dd if=/dev/zero of=/titan/zerofile.00 bs=4M count=10240

10240+0 records in
10240+0 records out
42949672960 bytes transferred in 309.480930 secs (138779708 bytes/sec)

Sub.mesa, do you want to do any more testing on this system? Thanks!
 
Top