nuc's Ubuntu and ZFS massive performance testing thread

nuclearsnake · Nov 15, 2011

So a while back I started a project to migrate way from XFS to ZFS on our backup server [Link here]

The machine specs are now

SUPERMICRO X8DTN+
2 x Xeon 2.26Ghz
2 x 80 GB WD
2 x Corsair Corsair_Force_3 120GB
8 x 2 TB Western Digital WD2002FYPS SATA 5400RPM
8 x 2 TB Western Digital WD2003FYYS SATA 7200RPM
1 x 3ware 9650SE-16ML 16 Port SATA Raid
Uubntu 10.04.3 LTS 64Bit
12GB RAM

I took the advise of the Forums and stay away from zfs-fuze, deciding on using the Ubuntu package by Darik Horn

I wanted to try out sub.mesa's zfs build, but this backup server also runs a number of jobs so I couldn't move away from Ubuntu just yet.
The next build we will do will split the rolls up giving up a backup target (sub.mesa) and a server to push things around.

That said, here's our layout; I have two pools;
-1 RAIDz on a Raid5 array from the Controller for the 8 5400RPM drives (/backup/base0)
-1 RAIDz for the 8 7200RPM drives (/backup/base1)

I put base0 together as quickly as possible as we needed somewhere to store our backups before sending them to tape and to boss said to build a Hardware Raid5 array with ZFS. We didn't know enough back then, but now with the base1 I had more time to test out different setting for best performance. On the zfs side, there was
Dedup on/off
Compression on/off
ZIL on an SSD or not
L2ARC on an SSD or not

On the Controller, I set all the drives to single disk mode, and enabled and disabled
Read Cache
Write Cache
Write Journaling
Queuing
Link Speed (1.5 vs 3.0 Gb/s)

And finally Raid types
RaidZ with 8 disks
RaidZ with 4 Raid0 arrays on the controller

Total of 122 different bonnie++ runs to try to figure out what would be best for my system;

Now, that wall all theoretical benchmarking. Things are not so simple in the real world as I'm slowly learning

My system setup has /backup as the mount points for both base0 and base1

Code:

# zpool list all
NAME              USED  AVAIL  REFER  MOUNTPOINT
base0            11.6T   838G  11.6T  /backup/base0
base1             454G  11.6T   354G  /backup/base1
base1/archive     229K  11.6T   229K  /backup/base1/archive
base1/filelevel   100G  11.6T   100G  /backup/base1/filelevel
base1/vmlevel     229K  11.6T   229K  /backup/base1/vmlevel
-

Code:

# zpool status
  pool: base0
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        base0       ONLINE       0     0     0
          sdd       ONLINE       0     0     0


  pool: base1
 state: ONLINE
 scan: none requested
config:

        NAME                                            STATE     READ WRITE CKSUM
        base1                                           ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            scsi-3600050e0bd269d00722d0000fe750000      ONLINE       0     0     0
            scsi-3600050e0bd26b60060ff00008aa70000      ONLINE       0     0     0
            scsi-3600050e0bd26ca00673f0000c84b0000      ONLINE       0     0     0
            scsi-3600050e0bd281900d841000032ef0000      ONLINE       0     0     0
            scsi-3600050e0bd282800bcc1000079570000      ONLINE       0     0     0
            scsi-3600050e0bd283700c197000017b90000      ONLINE       0     0     0
            scsi-3600050e0bd284b00a47300006cf10000      ONLINE       0     0     0
            scsi-3600050e0bd285a0021cd0000af4f0000      ONLINE       0     0     0
        cache
          ata-Corsair_Force_3_SSD_11356504000006820497  ONLINE       0     0     0

errors: No known data errors

Code:

zfs get dedup,compression
NAME             PROPERTY     VALUE          SOURCE
base0            dedup        off            default
base0            compression  off            default
base1            dedup        off            default
base1            compression  off            default
base1/archive    dedup        off            default
base1/archive    compression  off            default
base1/filelevel  dedup        off            default
base1/filelevel  compression  on             local
base1/vmlevel    dedup        off            default
base1/vmlevel    compression  off            default

I have a couple of questions, for one, does anyone want my CSV bonnie++ results?

Also I'm at a loss as for why all my 'per char' results so low... Can't seem to figure that out.

Lastly, I set the wrong datastore for a backup (put it into /backup/base1/ instead of /backup/base1/filelevel/) and now I'm trying to

Code:

mv /backup/base1/server001/daily.0/ ../filelevel/server001/

and it's taking forever with almost nothing happening. The data is an rsync from a number of linux servers, totaling ~355GB. It only took 5 hours to get that data over the network, and it's been moving it to the correct dataset for 6 hours and is only at 100GB xfered;

Code:

iostat -x 1

                             extended device statistics
device mgr/s mgw/s    r/s    w/s    kr/s    kw/s   size queue   wait svc_t  %b
sda        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdc        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdb        0     3    2.0    2.0     2.9    96.2   25.4   0.1   17.5  17.5   7
md2        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
md1        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
md0        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sdd        0     0    0.0    0.0     0.0     0.0    0.0   0.0    0.0   0.0   0
sde        8     0   50.8    0.0  1555.1     0.0   30.6   0.8   14.8  11.0  56
sdf        9     0   44.0    0.0  1543.3     0.0   35.1   0.6   14.0   9.6  42
sdg        9     0   47.9    0.0  1562.9     0.0   32.7   0.7   14.5  10.8  52
sdh        9     0   51.8    0.0  1476.9     0.0   28.5   0.7   13.2   9.4  49
sdi       11     0   46.9    0.0  1582.4     0.0   33.8   0.7   15.2  10.6  50
sdk        9     0   46.9    0.0  1562.9     0.0   33.3   0.7   15.4  11.0  52
sdj        5     0   50.8    0.0  1476.9     0.0   29.1   0.7   14.4   8.5  43
sdl        8     0   49.8    0.0  1527.7     0.0   30.7   0.8   15.7  11.2  56


# zpool iostat -v
                                           capacity     operations    bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
base0                                   11.6T  1.02T     64     61  4.42M  5.46M
  sdd                                   11.6T  1.02T     64     61  4.42M  5.46M
--------------------------------------  -----  -----  -----  -----  -----  -----
base1                                    549G  14.0T     60    149  6.13M  8.26M
  raidz1                                 549G  14.0T     60    149  6.13M  8.26M
    scsi-3600050e0bd269d00722d0000fe750000      -      -     14     16   893K  1.30M
    scsi-3600050e0bd26b60060ff00008aa70000      -      -     14     16   870K  1.28M
    scsi-3600050e0bd26ca00673f0000c84b0000      -      -     14     16   891K  1.30M
    scsi-3600050e0bd281900d841000032ef0000      -      -     14     16   871K  1.29M
    scsi-3600050e0bd282800bcc1000079570000      -      -     14     16   893K  1.30M
    scsi-3600050e0bd283700c197000017b90000      -      -     13     16   869K  1.28M
    scsi-3600050e0bd284b00a47300006cf10000      -      -     14     16   890K  1.30M
    scsi-3600050e0bd285a0021cd0000af4f0000      -      -     14     16   871K  1.29M
cache                                       -      -      -      -      -      -
  ata-Corsair_Force_3_SSD_11356504000006820497   106G  5.89G      6     11  67.1K  1.36M
--------------------------------------  -----  -----  -----  -----  -----  -----

Thanks for looking!

danswartz · Nov 15, 2011

I think the moves are going to a separate dataset, so it needs to copy and delete. That said, I don't think you will get a lot of love here if you are doing zfs on linux, if only because few folks using that...

adi · Nov 15, 2011

It's not lack of love, it's the fact that there is a very small group of people that use ZFS on linux. It would be nice if there were more, with more documentation for drawbacks and issues, but unfortunately you will be mostly on your own for troubleshooting it.

For Bonnie, the main columns to worry about are Seq read and write blocks, and Random Seeks (~IOPS). I have never bothered with chr columns.

Honestly for linux I would stick with straight hardware raid, or LVM/MD. You are probably the only person in this forum using any ZFS on linux, in production on top of that, so troubleshooting why you are getting 15MB/s on a copy are going to be random guesses.

brutalizer · Nov 15, 2011

Uh oh, I dont hope you are using ZFS on Linux? It would be ok for testing purposes, but not for trusting your data. When you loose all your data, dont blame ZFS.

Use Linux with XFS + hardware raid instead, that would be much safer than ZFS on Linux. Brrr...

Or go to an OS that have ZFS support: FreeBSD or Solaris or OpenIndiana (the fork of OpenSolaris and totally community driven) or Nexenta if you need support contract. If you need to run Linux, you can use VirtualBox ontop of FreeBSD / Solaris.

danswartz · Nov 15, 2011

adi said:
It's not lack of love, it's the fact that there is a very small group of people that use ZFS on linux. It would be nice if there were more, with more documentation for drawbacks and issues, but unfortunately you will be mostly on your own for troubleshooting it.

Sorry for being unclear. "Not feeling the love" is often used when you get no answers...

iroc409 · Nov 15, 2011

I certainly wouldn't discourage ZFS on Linux. The current hot ZFS systems don't really have good solutions for everything. I think it would be great to see it on Linux, and if it were, more people would use it.

I love FreeBSD and would like to set up a server, but there are some limitations that Linux would solve for me.

But, you probably will be on your own making it work!

parityboy · Nov 15, 2011

@OP

That said, here's our layout; I have two pools;
-1 RAIDz on a Raid5 array from the Controller for the 8 5400RPM drives (/backup/base0)

Running ZFS on top of a hardware-based RAID array practically negates all of the advantages of ZFS and basically makes it no better than any other filesystem out there. Or are you running the hardware controller in JBOD mode?

adi · Nov 15, 2011

danswartz said:
Sorry for being unclear. "Not feeling the love" is often used when you get no answers...

I understand that, I was more making a jab at the rabid ZFS fanboys going crazy over a "non-pure" system

olavgg · Nov 15, 2011

ZFS on Linux should work fine as long you don't use compression nor deduplication.

ST3F · Nov 15, 2011

nuclearsnake said:
That said, here's our layout; I have two pools;
-1 RAIDz on a Raid5 array from the Controller for the 8 5400RPM drives (/backup/base0)
-1 RAIDz for the 8 7200RPM drives (/backup/base1)

??!!???

s0rce · Nov 15, 2011

ST3F said:
??!!???

I was wondering that also

danswartz · Nov 15, 2011

Me three.

brutalizer · Nov 16, 2011

olavgg said:
ZFS on Linux should work fine as long you don't use compression nor deduplication.

Says who? Links please? If I google for problems with ZFS on Linux, will there be stories, or no stories on this?

I am just saying that it happens that people are using ZFS and loose data, and then they blame ZFS. But, it turned out that the people for instance, used hardware raid with ZFS - which is a major no no. And lots of other mistakes.

If you dont use ZFS as it was meant to be, then you CAN loose data even with ZFS. But, just dont blame ZFS. Blame yourself. That is all I am saying: dont blame ZFS, but yourself.

olavgg · Nov 16, 2011

You should read the mailing list

https://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/5a739039623f8fb1#

nuclearsnake · Nov 16, 2011

parityboy said:
@OP
Running ZFS on top of a hardware-based RAID array practically negates all of the advantages of ZFS and basically makes it no better than any other filesystem out there. Or are you running the hardware controller in JBOD mode?

ST3F said:
??!!???

Yeah, I know about this. Was not my call and let's just say I had other battles to win that week.

Aaaannyway, the new zfs array is setup correctly, with the disks in JBOD mode

olavgg said:
ZFS on Linux should work fine as long you don't use compression nor deduplication.

The testing for RaidZ showed the best Write/Rewrite speeds (not accounting for CPU usage) were with dd=0 (dedup) cp=1 (compression), followed by dd0cp0, dd1cp1, and worst Write/Rewrite was dd1cp0

For the most part, reads were best with dd1cp1, then dd0cp1, with dd1cp0 and dd0cp0 tied

danswartz · Nov 16, 2011

Really not recommended to use dedup at all unless you have a specific use case - there are far too many bad things that can happen.

brutalizer · Nov 17, 2011

ZFS dedup is quite buggy right now. First of all, you need more than 1GB RAM for each TB of disk space not to degrade performance. People have said that when they did a snapshot using dedup, it sometimes took days to delete the snapshot.

Avoid dedup until everybody uses it. All who try it now, always rebuild their array after a while and turn it off. Disk space is cheap.

nuclearsnake · Nov 17, 2011

brutalizer said:
ZFS dedup is quite buggy right now. First of all, you need more than 1GB RAM for each TB of disk space not to degrade performance. People have said that when they did a snapshot using dedup, it sometimes took days to delete the snapshot.

Avoid dedup until everybody uses it. All who try it now, always rebuild their array after a while and turn it off. Disk space is cheap.

I've noticed that dedup puts a lot of load on the system during my testing, and you guys are not the firsts to advise staying away from it until it matures.

There are plans to get the server to 32+ GB of ram, and when dedup stabilizes, I'd love to use it for our VM level backups. Rsync+rsnapshot are great for file level, but when a 60GB VMDK file changes even in the slightest, requiring a full 60GB backup, it's a lot of data to store. I hoping that one day the block level dedup will help out

thefreeaccount · Nov 17, 2011

brutalizer said:
ZFS dedup is quite buggy right now. First of all, you need more than 1GB RAM for each TB of disk space not to degrade performance. People have said that when they did a snapshot using dedup, it sometimes took days to delete the snapshot.

Not sure why you consider that buggy...those people were just trying to run huge pools with far too little CPU/memory. Oracle does support dedupe on their sun storage rigs, so it can't be *that* bad.

danswartz · Nov 17, 2011

Depends on your outlook I guess. I would consider it buggy that it can take literally days to delete a snapshot, but YMMV. A better term might have been 'not recommended'

bexamous · Nov 17, 2011

I looked into zfs on linux for a bit, data integrity is not an issue... I think you can search mailing lists, its not a problem anyone runs into. Performance however blows. Its not just how you configure your array, no matter what it is going to suck. (Oh and saying data integrity is not an issue is assumign you're not messing with dedup pr comrpess or anything fancy, not sure how well that is tested)

If you are set on using Linux, I've been using btrfs on md raid arrays. I use raid6 arrays with md, and then use btrfs on top for data checksumming and snapshotting. XFS and LVM is crap these days, its not 1990. ZFS on freebsd/solaris and if you need Linux put up with btrfs, IMO.

spazoid · Nov 17, 2011

AFAIK the people observing ridiculously poor snapshot deletion performance have all been in situations where it's very likely that their DDT (DeDupe Table) has run out of space in RAM and is therefore on disk.

Deleting a snapshop on a pool with dedupe enabled will require lots of random reads in the DDT, and if these reads has to go to disk this will take a while. Especially if the disks are already under load like they would be in most production environments.

Per default only 25% (IIRC) of RAM is available for DDT, so unless you tune this parameter or have a huge amount of memory, some of it will go to disk. L2ARC will be very helpful in this situation as the entire DDT should be able to fit in RAM+L2ARC, unless you have a very weird storage setup with huge pools and no RAM/small L2ARC.

danswartz · Nov 17, 2011

That is my understanding as well. The problem from what I've heard is that it is easy to not know this is going on until you get screwed.

brutalizer · Nov 18, 2011

nuclearsnake said:
Rsync+rsnapshot are great for file level, but when a 60GB VMDK file changes even in the slightest, requiring a full 60GB backup, it's a lot of data to store. I hoping that one day the block level dedup will help out

In that case, why dont you use snapshots? It works exactly like dedup, but you have to do it manually.

Say you do a snapshot of the VM, then it changes slightest. Do a new snapshot, and only the changes are saved. It works exactly like dedup, but you have to manually do snapshots.

nuclearsnake · Nov 18, 2011

spazoid said:
AFAIK the people observing ridiculously poor snapshot deletion performance have all been in situations where it's very likely that their DDT (DeDupe Table) has run out of space in RAM and is therefore on disk.

Deleting a snapshop on a pool with dedupe enabled will require lots of random reads in the DDT, and if these reads has to go to disk this will take a while. Especially if the disks are already under load like they would be in most production environments.

Per default only 25% (IIRC) of RAM is available for DDT, so unless you tune this parameter or have a huge amount of memory, some of it will go to disk. L2ARC will be very helpful in this situation as the entire DDT should be able to fit in RAM+L2ARC, unless you have a very weird storage setup with huge pools and no RAM/small L2ARC.

That's my understanding as well, and one of the reasons I picked up the SSDs listed in the first post.
Performance is currently on the poor-ish side on my new array, but substantially better then what were were getting on the old array, and even that was an improvement over XFS crashing on a weekly basis.
I've been reading the zfs-discuss (Zfs on Linux) mailing list, and stability is at a point where I'm comfortable.

brutalizer said:
In that case, why dont you use snapshots? It works exactly like dedup, but you have to do it manually.

Say you do a snapshot of the VM, then it changes slightest. Do a new snapshot, and only the changes are saved. It works exactly like dedup, but you have to manually do snapshots.

We're currently doing an automatic VMware snapshot initialized from the backup server on all our ESXi boxes, and then pushing the vm files via rsync/rsnapshot to the zfs pool on the backup server. The snapshots are only used to unlock the file lock the vmdk's for copying.

Once the vmdks are on our backup server, rsnapshot takes care of rotating the files, but we're still left with the problem of having 5 daily full vmdk files being pulled from the esxi machines to the backup server. That's why I'd like to (one day) implement dedup on the zfs array, in order to lower the amount of space used to store these vmdks at the block level... ...one day...

nuc's Ubuntu and ZFS massive performance testing thread

Limp Gawd

2[H]4U

Limp Gawd

[H]ard|Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd