XFS vs ZFS

blacksunseven · Mar 14, 2011

Despite some purely alphabetical discrepancies between the two, I'm just a little curious why the overall consensus seems to be that ZFS is king. Having only a theoretical view as I've never used ZFS in practice and my XFS experience is minimal, I'm looking to hear some of the debate between using one over the other.

I understand that ZFS is widely regarded as the most RAID-friendly filesystem and I don't argue this. I feel that XFS gets a bit of the short end of the stick though when it comes to this aspect of the discussion. While XFS doesn't support dynamic stripe re-sizing or other fanciful features, it does sport specification of stripe width and size. Perhaps the most important difference between these two is that XFS doesn't support checksums. Also missing from XFS is encryption and volume shrinking (both ZFS and XFS support online resizing). In terms of limits on size of paths, files, and volume, ZFS is the clear winner in what can really only be a moot point. For the vast majority (see: all) of us using <8EiB systems, the two are identical. XFS also lacks de-duplication, copy-on-write and an integrated logical volume manager. [data here]

Phew, it does seem like ZFS is the clear winner between the two, at this point, but let's examine now performance and OS support. XFS beats ZFS in read/write/mixed MB/s throughput benchmarks as well as random IOs/s. Large file creation is faster in XFS too. Random read/writes are higher performing in XFS, especially XFS writes. Standard deviations for random read/writes are close but ZFS does win this category. When it comes to multithreaded read/write/mixed IOs/s, XFS wins by a large margin. In dealing with huge file multithreaded mixed IOs/s, the results are closer but XFS still takes the cake. You'll find untarring to be slightly quicker in XFS but when it comes to tarring, ZFS wins by an enormous margin. Clearly, in overall performance, XFS wins. [data here]

So what about OS support? Well, it's pretty common knowledge that XFS is supported by Linux kernels right out of the box (though it's frowned upon as a root filesystem) and that ZFS is happily integrated into FreeBSD and branches. It's a matter of personal use and opinion but I find Linux to be more adaptable to a wider variety of tasks, like use in a home theater or even as a user-friendly desktop OS. FreeBSD is generally considered to be more stable but I'm not sure the differential is worth major points. FreeBSD does offer partial support of XFS and, using fuse, you can even get ZFS working alongside Linux (though neither of these would be the preferred implementation).

So, what does it really come down to? I think ZFS and XFS are both excellent filesystems and it's truly a shame that ext4 has no real place in contending with either of these. BTRFS is the holy grail of filesystems, many will tell you, but as of today, there's no stable release and thus ultimately can't be considered a contender either. In this writer's opinion, if you're looking for a personal file storage filesystem and trust your RAID card, I think XFS with Linux is the clear solution. If you're paranoid about keeping your data as safe as possible and don't mind a performance hit or using a less all-'rounder OS like FreeBSD then ZFS is the one for you.

I hope that others will join in this discussion as I'm very interested to hear about why ZFS seems to be such a popular choice here at [H]. If I've made any errors or improper deductions, please feel free to make a note of it or PM me and I'll edit this post.

mikesm · Mar 14, 2011

This is a bit of apples and oranges. You need to think of ZFS on Solaris vs XFS on top of lvm2 on top of md on top of linux. ZFS combines the functions of filesystem, logical volume manager and software raid, which are handled by independent subsystems under linux.

sub.mesa · Mar 14, 2011

The whole big point about ZFS is that it provides you with a very reliable RAID-like layer, which unlike many existing designs avoids many pitfalls and potential problems that can arise. The use of variable stripe-size in RAID-Z configurations means you never have to perform a read-modify-write cycle and thus avoid the whole raid5 write hole problem altogether. It also increases the random write speed to slightly above single disk performance, which is really good for a raid5. Same for RAID-Z2 aka raid6.

Due to the transactional nature of ZFS, emphasis is given to protecting your files from harm and never allowing the ZFS filesystem itself from becoming in an unrecoverable/inconsistent state. This has some tax on performance. But fortunately, ZFS can really increase the speed of mechanical harddrives by letting them perform only sequential I/O. How? Well random reads you solve with L2ARC (ssd cache devices) and random writes you solve with SLOG (ZIL on ssd), allowing the disks to only do sequential I/O on your dataset. This achieves the best of both worlds and is a very cunning feature.

The problem is that only recent ZFS versions allow recovering from a lost ZIL/SLOG device; otherwise you would lose the entire pool! Also SSDs are prone to corruption, so protection from supercapacitor is mandatory for data consistency. The upcoming Intel G3 should be excellent for use with both L2ARC and SLOG, and you can use one SSD for both functions. Or multiple SSDs in a RAID0-like configuration, like i do.

Comparing ZFS directly to XFS is indeed apples vs oranges. You should compare XFS against Ext4, UFS, NTFS and similar simple filesystems that only access one disk at a time. ZFS is a completely different beast and perhaps the first of it's kind to integrate several functionality in one package. Not just that, but it actually makes use of the emerging properties of such a 'fusion' between RAID and filesystems, things that are not possible if the two were separate packages.

blacksunseven · Mar 14, 2011

sub.mesa said:
Due to the transactional nature of ZFS, emphasis is given to protecting your files from harm and never allowing the ZFS filesystem itself from becoming in an unrecoverable/inconsistent state. This has some tax on performance. But fortunately, ZFS can really increase the speed of mechanical harddrives by letting them perform only sequential I/O. How? Well random reads you solve with L2ARC (ssd cache devices) and random writes you solve with SLOG (ZIL on ssd), allowing the disks to only do sequential I/O on your dataset. This achieves the best of both worlds and is a very cunning feature.

The problem is that only recent ZFS versions allow recovering from a lost ZIL/SLOG device; otherwise you would lose the entire pool! Also SSDs are prone to corruption, so protection from supercapacitor is mandatory for data consistency. The upcoming Intel G3 should be excellent for use with both L2ARC and SLOG, and you can use one SSD for both functions. Or multiple SSDs in a RAID0-like configuration, like i do.

This is immensely interesting and I'm embarrassed to say I wasn't aware of this functionality. I'll be investigating this immediately.

ripken204 · Mar 14, 2011

sub.mesa said:
The upcoming Intel G3 should be excellent for use with both L2ARC and SLOG, and you can use one SSD for both functions. Or multiple SSDs in a RAID0-like configuration, like i do.

regarding having the L2ARC and SLOG on a single SSD, would it be fine to run the OS on the same SSD as well?

and instead of an SSD, I would have to assume that having a dedicated hdd for the ZIL and cache would also be better than placing those on the drives in the pool?

parityboy · Mar 14, 2011

blacksunseven said:
So, what does it really come down to? I think ZFS and XFS are both excellent filesystems and it's truly a shame that ext4 has no real place in contending with either of these. BTRFS is the holy grail of filesystems, many will tell you, but as of today, there's no stable release and thus ultimately can't be considered a contender either.

Yeah, I remember when SGI released XFS for IRIX - nine million terabytes maximum volume size was stratospheric back then.

I think the problems with ext4 are twofold. Firstly, according to what I have read, ext4 was meant to be a stopgap - enhancing ext3 while waiting for btrfs to mature (which hasn't happened yet). I think a side effect of this is the limitation of e2fsprogs tools - they can't see past 16TiB volumes with ext4, even though the filesystem itself is capable of addressing volumes of up to 1EiB in size.

mikesm said:
This is a bit of apples and oranges. You need to think of ZFS on Solaris vs XFS on top of lvm2 on top of md on top of linux. ZFS combines the functions of filesystem, logical volume manager and software raid, which are handled by independent subsystems under linux.

Agree with this. In fact, I've often wondered why LVM2 and MD RAID were not combined, to me it would make sense to do so, and unlike btrfs both LVM2 and MD RAID are mature, and have been for a very long time.

sub.mesa said:
The problem is that only recent ZFS versions allow recovering from a lost ZIL/SLOG device; otherwise you would lose the entire pool! Also SSDs are prone to dropping dead...

Fixed.

One thing that ZFS does do is to blow wide open the performance argument of software RAID versus true hardware RAID. This argument has bounced back and forth as the storage performance of general CPUs and I/O processors have leapfrogged each other, and over the last few years I would argue that software RAID was winning, due to the strides made by Intel and AMD.

However, the checksumming operations of ZFS are quite CPU intensive, and only increase as the number of files increases. I've seen this confirmed on ZFSBuild. I see this playing out one of two ways. Either

a) adoption of ZFS will be stymied by its performance requirements, especially in non-SAN environments where the database server and its storage are in a DAS configuration, or

b) more shops will adopt a SAN architecture on commodity hardware. Even though it means buying at least one extra box (and a dedicated GigE+ network), that box is relatively cheap and the data integrity afforded by ZFS plus the installed ECC RAM is worth more than a single box with a RAID card.

Either way, the next 5-10 years in the storage market are going to be fascinating.

sub.mesa · Mar 14, 2011

@ripken204
You could run the OS on the same SSD; i don't know if this will be supported via the ZFSguru GUI if that is what you meant. At least not very soon.

Do note however, that for a server the system disk doesn't need to be fast at all; unlike a desktop OS which loads apps and has things like a firefox profile which constantly gets read and updated; server OS is different. All processes running are in RAM and disk I/O would be limited to updating log files which is just little random writes now and then - nothing to worry about.

So if it were me, i would install to your pool directly, making each disk bootable. If something goes wrong with your system disk (ZFSguru or otherwise) you can always boot a livecd and import it again so you don't really need the system disk. Still, having the system disk on the pool means it enjoys the redundancy of that pool, and it saves some space on your SSDs so you can use more L2ARC or SLOG.

Generally, you would want a very fast but not that high capacity SSD. So striping multiple small SSDs would yield the most performance per dollar/euro invested. Smaller SSDs are often very low on the write performance, but there are exceptions. You would need a supercapacitor if you want to use the SLOG/ZIL feature, so that's why i recommend to wait with buying an SSD until the Intel G3 and similar third generation SSDs are available that feature a supercapacitor to prevent possibly serious corruption on power loss; due to complex controller designs the modern SSDs are extra sensitive to this than older 'simple' NAND controllers without write remapping.

Using HDD for SLOG may not give you much performance benefit at all, using a 10k rpm disk as L2ARC could work, but it wouldn't be that fast and L2ARC consumes extra RAM memory as well. Having cheap USB pendrives as L2ARC could work better than 10.000rpm drives. So i don't see much benefit here and i would stick to the 'real' SSDs coming in just a few months.

houkouonchi · Mar 14, 2011

XFS is meh... I have had lots of issues with bugs in the past that have caused major issues (I hear its better now). My biggest problem with XFS is it seems too easy to corrupt the file-system on.

I personally run JFS for my 36 TB file-system as its just about as fast as XFS, doesn't fragment as badly and I have never lost data on it like XFS even with powercycles.

When you have > 16 TiB file-system your only real choices are XFS, JFS, ZFS, and BTRFS. ZFS was pretty much userland (too slow for me) and btrfs which is supposed to be the linux version is simply not stable yet. I have had horrible problems with XFS so the choice for me was pretty obvious...

ripken204 · Mar 14, 2011

sub.mesa said:
@ripken204
You could run the OS on the same SSD; i don't know if this will be supported via the ZFSguru GUI if that is what you meant. At least not very soon.

Do note however, that for a server the system disk doesn't need to be fast at all; unlike a desktop OS which loads apps and has things like a firefox profile which constantly gets read and updated; server OS is different. All processes running are in RAM and disk I/O would be limited to updating log files which is just little random writes now and then - nothing to worry about.

So if it were me, i would install to your pool directly, making each disk bootable. If something goes wrong with your system disk (ZFSguru or otherwise) you can always boot a livecd and import it again so you don't really need the system disk. Still, having the system disk on the pool means it enjoys the redundancy of that pool, and it saves some space on your SSDs so you can use more L2ARC or SLOG.

Generally, you would want a very fast but not that high capacity SSD. So striping multiple small SSDs would yield the most performance per dollar/euro invested. Smaller SSDs are often very low on the write performance, but there are exceptions. You would need a supercapacitor if you want to use the SLOG/ZIL feature, so that's why i recommend to wait with buying an SSD until the Intel G3 and similar third generation SSDs are available that feature a supercapacitor to prevent possibly serious corruption on power loss; due to complex controller designs the modern SSDs are extra sensitive to this than older 'simple' NAND controllers without write remapping.

Using HDD for SLOG may not give you much performance benefit at all, using a 10k rpm disk as L2ARC could work, but it wouldn't be that fast and L2ARC consumes extra RAM memory as well. Having cheap USB pendrives as L2ARC could work better than 10.000rpm drives. So i don't see much benefit here and i would stick to the 'real' SSDs coming in just a few months.

thanks for the info.
basically i just wanted to know if it would always be better to but the SLOG/ZIL on another hdd/sdd. I need an OS drive anyways, so if it's beneficial to put it on that, then I might as well. If it doesn't make much of a difference then I wont bother with it.

olavgg · Mar 15, 2011

houkouonchi said:
XFS is meh... I have had lots of issues with bugs in the past that have caused major issues (I hear its better now). My biggest problem with XFS is it seems too easy to corrupt the file-system on.

I personally run JFS for my 36 TB file-system as its just about as fast as XFS, doesn't fragment as badly and I have never lost data on it like XFS even with powercycles.

When you have > 16 TiB file-system your only real choices are XFS, JFS, ZFS, and BTRFS. ZFS was pretty much userland (too slow for me) and btrfs which is supposed to be the linux version is simply not stable yet. I have had horrible problems with XFS so the choice for me was pretty obvious...

You're forgetting hammer! http://www.dragonflybsd.org/hammer/

houkouonchi · Mar 15, 2011

olavgg said:
You're forgetting hammer! http://www.dragonflybsd.org/hammer/

Sorry I was talking specifically about linux file-systems (or ones capable of running on linux). I didn't even know XFS/JFS were available on BSD.

danswartz · Mar 15, 2011

IIRC, JFS is basically dead now.

houkouonchi · Mar 15, 2011

danswartz said:
IIRC, JFS is basically dead now.

I honestly dont care if a FS is not being very actively developed these days. Its an extremely stable file-system and has done much better for me than XFS on fragmentation and not loosing my data. Also its performance is around what XFS does (near raw I/O speeds) but probably a little bit less.

Also fsck on my 36 TB array takes around 20 minutes where as the xfs fsck is *much* slower and from what I heard takes 1 GB of ram per 1 TB your volume is.

parityboy · Mar 15, 2011

houkouonchi said:
ZFS was pretty much userland (too slow for me) and btrfs which is supposed to be the linux version is simply not stable yet. I have had horrible problems with XFS so the choice for me was pretty obvious...

Have a look here.

danswartz · Mar 15, 2011

Well, your choice of course. I have to say, 36TB is an unusually large data array, unless you are a major provider like google

The usual comparison I hear is between ext3 and xfs - don't hear too much about jfs on most forums. My servers are UPS protected, so I am more interested in performance that doesn't suck (which ext3 does, particularly for deleting files...)

houkouonchi · Mar 16, 2011

parityboy said:
Have a look here.

maybe I should have bolded my was when I said was pretty much userland.

There is also a lot more overhead with ZFS (in disk-space) and I don't feel its quite ready yet for primetime linux. I use ZFS on opensolaris on my backup machine though.

JFS is very fast for file-deletion. To me it really came down to what I can trust my data too. JFS was developed by IBM and is rock-stable IMHO. Also there is still development going on as until recently the userland utilities (mkfs) did not support creating a FS > 32 TiB but that was fixed after I emailed about it to the jfs discussion list. Pretty much all my large file-systems use JFS now (ranging from 9->36 TB) ever since I was bitten by running XFS.

parityboy · Mar 16, 2011

Yeah, JFS is very nice.

If I remember rightly, my old workstation's 5TB RAID & LVM array was formatted for JFS. That machine died two years ago and I haven't had a chance to pull the data off it, but I've no doubt it's all intact.

One thing that really irks me is the state of e2fsprogs. I mean seriously, you develop a filesystem that goes to 1EiB, and you can't even do an fsck? How could the the filesystem even be tested properly if it couldn't be tested beyond 16TiB? I mean, seriously?

danswartz · Mar 16, 2011

"There is also a lot more overhead with ZFS (in disk-space)"

Can you elaborate?

houkouonchi · Mar 16, 2011

danswartz said:
"There is also a lot more overhead with ZFS (in disk-space)"

Can you elaborate?

See:

http://opensolaris.org/jive/thread.jspa?messageID=483814&tstart=0

Basically on a brand-new created file-system I saw 500-600 GB lost on a 18 TB (18 billion byte) ZFS file-system.

danswartz · Mar 16, 2011

The above is about 3%. If you are going to compare filesystems, it would be helpful to post the numbers for JFS too, so readers don't have to try to track it down.

danswartz · Mar 16, 2011

Also to keep this apples vs apples, keep in mind that ZFS isn't just a filesystem. The comparison would be to mdraid + lvm + jfs (not saying the amount is not more though...)

blacksunseven · Mar 16, 2011

Am I misconstruing something here? Everyone is saying md + lvm + FS but with hardware raid aren't you bypassing any use of md? I thought md was only use for managing software RAID.

danswartz · Mar 16, 2011

It is. You don't want to use HW with ZFS, since ZFS wants to do it's own thing with the drives.

houkouonchi · Mar 16, 2011

hardware raid/LVM would barely take anything from the disk space. With JFS about 1/100th of the disk space is lost when creating the FS when compared with ZFS. XFS is also pretty comparable to JFS. A couple percent doesn't sound like much but when you are dealing with a 20 disk array just 5% means you are losing one disk worth of disk space.

danswartz · Mar 16, 2011

Well, YMMV, of course. ZFS has a different philosophical approach than older filesystems (for what it's worth, btrfs is more akin to zfs than to previous linux FS). I will point out that a far worse "waste" of disk space with a 20 disk array is having to break it up into groups, if you are going to do something like raid10 or raid50. Unless you are seriously proposing putting all 20 disks in a single raid5, in which case there is really nothing to say...

parityboy · Mar 16, 2011

ext4 is also guilty of "wasting" disk space too. I noticed when a I formatted a 250GB hard drive. I expected to see 230GiB of usable space, and instead saw 218GiB. It's something to do with reserved blocks or some such...

I suspect this will become more common as file systems make more issue out of data integrity and resilience, hence journals and checksumming.

blacksunseven · Mar 16, 2011

danswartz said:
It is. You don't want to use HW with ZFS, since ZFS wants to do it's own thing with the drives.

So why are people comparing ZFS with md+lvm+FS? In reality if you could use hardware to control your RAID (outside ZFS), you would so really we should be comparing oranges and tangerines here: ZFS and HW+LVM+XFS, e.g.

MarkL · Mar 16, 2011

blacksunseven said:
So why are people comparing ZFS with md+lvm+FS? In reality if you could use hardware to control your RAID (outside ZFS), you would so really we should be comparing oranges and tangerines here: ZFS and HW+LVM+XFS, e.g.

Huh? You are suggesting comparing things that are LESS similar.

ZFS <-> md+lvm+fs = comparable.. hence why people COMPARE them.

Otherwise you would compare hw+zfs <-> hw+lvm+fs.. (and ZFS would still win)

danswartz · Mar 16, 2011

We are not communicating

What I am saying: if you are using ZFS you do NOT want to use a hardware raid controller (at least as a raid controller, maybe to present a JBOD to ZFS). So if you don't do HW RAID, then md+lvm+FS == ZFS. I'm not sure what is not clear about that?

john4200 · Mar 16, 2011

parityboy said:
ext4 is also guilty of "wasting" disk space too. I noticed when a I formatted a 250GB hard drive. I expected to see 230GiB of usable space, and instead saw 218GiB. It's something to do with reserved blocks or some such...

Code:

% man mkfs.ext4
....
-m reserved-blocks-percentage

       Specify the percentage of the filesystem blocks reserved for the
       super-user.  This avoids fragmentation, and allows root-owned
       daemons, such as syslogd(8), to continue to function correctly
       after non-privileged processes are prevented from writing to the
       filesystem.  The default percentage is 5%.

parityboy · Mar 16, 2011

@john4200

That's the one. I trimmed mine down to 1%.

blacksunseven · Mar 16, 2011

danswartz said:
We are not communicating What I am saying: if you are using ZFS you do NOT want to use a hardware raid controller (at least as a raid controller, maybe to present a JBOD to ZFS). So if you don't do HW RAID, then md+lvm+FS == ZFS. I'm not sure what is not clear about that?

No, we all understand each other quite well except we disagree on how to compare them. To me, you would test these two against each other not by trying to match them but by catering to each's specific strengths. Therefore you'd use ZFS vs. hw+lvm+FS; why compare ZFS to md+lvm+FS when you wouldn't optimally set it up that way with a filesystem like ext or XFS?

john4200 · Mar 16, 2011

blacksunseven said:
Therefore you'd use ZFS vs. hw+lvm+FS; why compare ZFS to md+lvm+FS when you wouldn't optimally set it up that way with a filesystem like ext or XFS?

Huh? What is wrong with mdadm? I'd choose that over hardware RAID in many cases, since mdadm is more flexible.

danswartz · Mar 16, 2011

No, my comparison was of one system (ZFS) with something else built of 3 different components, md+lvm+FS. I think the original disconnect was because I wasn't advocating a setup to do benchmarks, but to point out to another poster why one can't just compare zfs and jfs (or whatever filesystem.)

parityboy · Mar 16, 2011

john4200 said:
Huh? What is wrong with mdadm? I'd choose that over hardware RAID in many cases, since mdadm is more flexible.

Agreed. Additionally, from a setup and performance point of view, md-RAID + LVM +FS is closer in configuration to ZFS than involving hardware RAID, because both stacks would use the exact same hardware, whereas hardware RAID runs on its own...hardware...

niomosy · Mar 17, 2011

john4200 said:
Huh? What is wrong with mdadm? I'd choose that over hardware RAID in many cases, since mdadm is more flexible.

The main problem is in lack of integration with the volume manager which you'd find on other UNIX lvm's. In most cases, you'd simply specify the raid level at logical volume creation time. The physical volumes are just that, physical volumes (hdiskXX for AIX, cXtXdX for Solaris ZFS or VxFS, etc). mdadm requires that you first create the raid device then have LVM deal with the "physical" volume you created with mdadm.

parityboy · Mar 17, 2011

Yeah, I never understood why md and LVM were never integrated. It really would make sense - call it LV-RAID.

danswartz · Mar 17, 2011

I assume because they were written by different people for different reasons at different times?

john4200 · Mar 17, 2011

niomosy said:
The main problem is in lack of integration with the volume manager which you'd find on other UNIX lvm's. In most cases, you'd simply specify the raid level at logical volume creation time. The physical volumes are just that, physical volumes (hdiskXX for AIX, cXtXdX for Solaris ZFS or VxFS, etc). mdadm requires that you first create the raid device then have LVM deal with the "physical" volume you created with mdadm.

That is hardly a reason to switch to hardware RAID. It is a minor issue, especially since LVM is not needed in many cases.

parityboy · Mar 17, 2011

danswartz said:
I assume because they were written by different people for different reasons at different times?

You could say that of md-RAID and btrfs, however the RAID code in btrfs came directly from md-RAID. So...what's your point?

john4200 said:
That is hardly a reason to switch to hardware RAID. It is a minor issue, especially since LVM is not needed in many cases.

True, but volume management is a natural extension to RAID, and it would simplify a lot of the management tasks. If I was a good
enough C coder, I'd attempt it myself, but I'm not.

Even hardware cards, like the Arecas, have a form of volume management.

XFS vs ZFS

Weaksauce

Limp Gawd

2[H]4U

Weaksauce

[H]ard|Gawd

Limp Gawd

2[H]4U

RIP

[H]ard|Gawd

Limp Gawd

RIP

2[H]4U

RIP

Limp Gawd

2[H]4U

RIP

Limp Gawd

2[H]4U

RIP

2[H]4U

2[H]4U

Weaksauce

2[H]4U

RIP

2[H]4U

Limp Gawd

Weaksauce

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

Weaksauce

[H]ard|Gawd

2[H]4U

Limp Gawd

Limp Gawd

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd