ZFS Deduplication in Open Indiana

TechTrend · Feb 19, 2012

I have a production setup with Open Indiana 151a, napp-it 0.6r, E3-1230 Xeon quad core CPU, 16 GB ram, five GigE interfaces (4 for iSCSI), one 500 GB SATA2 boot disk, one 64GB SSD for read cache, one 64GB SSD for logs and six 2TB SATA2 drives setup as three 2-disk mirrored vdevs. This is a storage server for a small vSphere Essentials cluster with 3 hosts and about 30 VMs. I/O performance is very good. So far about 25% of the storage is in use. Since many VMs have common operating systems and applications, I was considering deduplication to allow further growth without adding additional hardware (a new enclosure would be required for additional drives). A project to virtualize desktops in that network is also being considered, which would make deduplication even more useful (desktops have more common data than servers). Yet, several forum posts suggest not using deduplication with ZFS, e.g.

http://hardforum.com/showpost.php?p=1038065317&postcount=17
...ZFS dedup is broken. Never use it until it is fixed....

http://hardforum.com/showpost.php?p=1038346110&postcount=2589
...Avoid deduplication - only usefull on very special use-cases, even with 16 GB RAM...

Are there successful production deployments of Open Indiana using deduplication? Any reports of major problems?

Else, would the ZFS deduplication in NexentaStor Community Edition be a better choice?

Thanks.

danswartz · Feb 19, 2012

It isn't openindiana, AFAIK, that is the problem. dedup is just not stable enough in that codebase.

TechTrend · Feb 19, 2012

danswartz said:
It isn't openindiana, AFAIK, that is the problem. dedup is just not stable enough in that codebase.

Thanks for your response. Would it be more stable in NexentaStor 3.x or is their ZFS codebase the same as in Open Indiana 151a?

mmmmmdonuts · Feb 19, 2012

I am almost 100% positive it is the ZFS codebase so switching the OS won't make a difference.

adi · Feb 19, 2012

I had dedup running for a short while on my home lab, quad core xeon, 24GB ram, SSD's for cache and log. It was used for hyper-v storage (Solaris 11 express)

It worked great, and had some great dedup ratio's. The main issue is if you take a snapshot, and then come back and try and delete the snapshot, it can kill performance on the box from 12-72 hours, compared to instant snapshot management without dedup.

Just using dedup worked fine for me, and there are tools to monitor your dedup table size, but once you start doing other ZFS functions on top of it, you will start to run into issues.

ChrisBenn · Feb 19, 2012

Dedup isn't broken, it just has some significant caveats. These caveats are *especially* applicable to people running in home server type setups (lower memory, no l2arc, not a significant candidate for dedupe datawise, etc.) - so you tend to see statements that would be hyperbolic in other contexts.

What all this comes down to is that ZFS does realtime dedupe, and so must keep the entire dedup table situated in memory. Depending on specifics of your zfs install this means you will consume 1-2GB of memory per TB of unique data (other factors involved, that's just a guideline).

If this dedupe table can't fit into memory you have *massive* performance implications (it has to swap out to disk). Having a L2ARC is a reasonable precaution as this provides a secondary level for the dedupe to swap to which is typically faster than disk - but remember L2ARC also consumed system memory, so that's going to cause dedup to "Swap out" to l2arc that much sooner. These are where you see the snapshot deletion issues, data pool deletion issue,s etc. - if they are taking a super long time it's typically because you have run out of memory for your dedupe table.

And another caveat - this metadata is hard capped at 1/4 of your RAM - so for best performance you need at least 4x your expected dedupe requirements.

here's a good article on it with more detailed information:
http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe

VM's are probably one of the best candidates for dedupe though - especially if they are all the same OS/version, etc. If you already have the VM's on a ZFS volume you can get a dedupe estimate using zdb -S (I think - off the top of my head). This should also give you a way to estimate your dedupe memory requirements (the blog post above goes over that).

So basically dedup works fine, but is only optimal in a very specific context - and that context is not a typical one for home server users. For 30VM's you sounds like a good candidate, but I would be concerned that 16GB of memory isn't sufficient. Really that ends up depending on how much unqiue data you have in those 30 VM's. If it's only a couple of TB you will probably be fine, if it's more than that you really want a server with more memory - which means either upgrading to 8GB Dimm's (if 32GB will be sufficient), or moving to a dual processor platform - or the new LGA2011 platform (quad channel - 8 ram slots) when it becomes generally available. More RAM is really *never* a waste with ZFS.

That said, you also have to balance the cost against just adding more spindles (and the IOPS gain you get there - as you can likely already max out 1GigE) -- especially if you are hosting VM's the spindles - that might be a bigger gain.

danswartz · Feb 19, 2012

Chris. I agree. I wasn't meaning to imply dedup is broken. But when you can hang your box for 2-3 days with no warning or indication of a problem, it's a usability thing. Most people have found that compression (particularly the lightweight one) gives them more bang for the buck. That said, agreeing with the other post about not being distro-related.

ChrisBenn · Feb 19, 2012

No worries - and yeah, the is/isn't broken distinction almost comes down to a semantics argument. I would say it's "functionally broken" when your dedupe table has to swap to disk. It's kinda like a car with a 5mph speed limiter - sure it works, but not really

.
But if you are planning on using it than it definitley *can* be used with success - it's just something you have to plan for from the start - not something you can typically just turn on (and because it looks like you can just turn it on people tend to functionally break their systems with it)

And also yep - the lightweight compression is pretty much free with current processor. I generally have it turned on by default for everything except bulk media (I actually turned it on for my camera raw backups as there was some compression to be gained there - that's probably dependent on your camera's raw format though)

danswartz · Feb 19, 2012

It's a significant win (compression) for esxi nfs datastores.

bAMtan2 · Feb 19, 2012

dedupe was finished right as the solaris developers were spread to the wind. so unless I've missed something, there is no high quality or official testing and knowledge base about it, and there won't be until oracle writes it and you steal it

_Gea · Feb 19, 2012

bAMtan2 said:
dedupe was finished right as the solaris developers were spread to the wind. so unless I've missed something, there is no high quality or official testing and knowledge base about it, and there won't be until oracle writes it and you steal it

I share the opinion of Bryan Cantrill from Joyent, who mentioned:

... Within 90 days, the entire DTrace team left Oracle, all primary inventors of ZFS had left Oracle and primary engineers for both zones and networking had left.Oracle..
..nearly all of these engineers went to companies betting on illumos...

The only invention from Oracle now is encryption - nearly ready with OpenSolaris..
The only real invention since came from Joyent who added KVM to Illumos,
(not available in Solaris)

http://www.slideshare.net/bcantrill/fork-yeah-the-rise-and-development-of-illumos

so time will prove.

brutalizer · Feb 19, 2012

Dedup is not to be used. It is not usable. You need huge amounts of RAM. It is much easier and cheaper to buy more disks instead. Disks are cheap. Avoid. If the computer becomes unusable for days when deleting a deduped snapshot - you should avoid it. Buy more disks instead and you avoid all problems. Compression is fine and works well.

bAMtan2 · Feb 19, 2012

_Gea said:
I share the opinion of Brian Cantrill from Joyent, who mentioned:

... Within 90 days, the entire DTrace team left Oracle, all primary inventors of ZFS had left Oracle and primary engineers for both zones and networking had left.Oracle..
..nearly all of these engineers went to companies betting on illumos...

The only invention from Oracle now is encryption - nearly ready with OpenSolaris..
The only real invention since came from Joyent who added KVM to Illumos,
(not available in Solaris)

http://www.slideshare.net/bcantrill/fork-yeah-the-rise-and-development-of-illumos

so time will prove.

I do concur with this. bryan's presentation: http://www.youtube.com/watch?v=-zRN7XLCRhc

brutalizer said:
Dedup is not to be used. It is not usable. You need huge amounts of RAM. It is much easier and cheaper to buy more disks instead. Disks are cheap. Avoid. If the computer becomes unusable for days when deleting a deduped snapshot - you should avoid it. Buy more disks instead and you avoid all problems. Compression is fine and works well.

I also concur with this. stay on the beaten path. dont beat your own path. you might get beat

TechTrend · Feb 19, 2012

Thanks for all the replies and the great feedback. Given that the cluster is using 25% of the storage available and deduplication has significant caveats with limited RAM, I'll skip on using deduplication for now. When storage requirements increase I'll decide whether to increase RAM and enable deduplication or change the enclosure and add more drives.

Bryan Cantrill's slides and video have great insights on the history and future of Open Solaris. Joyent is actively enhancing the former Open Solaris codebase in its SmartOS. The few postings on this forun on SmartOS seem to imply it is still not widely deployed for SANs. Yet it appears to be a promising alternative to Open Indiana and Nexenta. Not having a GUI in the base system is a good strategy for a server OS. The memory and CPU consumed by an X11 server and Gnome desktop are better spent in network services. My Open Indiana installs are mostly text based and are managed via a web interface using _Gea's napp-it or via ssh using Solaris commands.

olavgg · Feb 21, 2012

Check out DragonFlyBSD's HammerFS if you want to take advantage of deduplication. It's implementation is a lot more efficient than the ZFS implementation.

jwcalla · Feb 21, 2012

_Gea said:
I share the opinion of Bryan Cantrill from Joyent, who mentioned:

... Within 90 days, the entire DTrace team left Oracle, all primary inventors of ZFS had left Oracle and primary engineers for both zones and networking had left.Oracle..
..nearly all of these engineers went to companies betting on illumos...

The only invention from Oracle now is encryption - nearly ready with OpenSolaris..
The only real invention since came from Joyent who added KVM to Illumos,
(not available in Solaris)

http://www.slideshare.net/bcantrill/fork-yeah-the-rise-and-development-of-illumos

so time will prove.

I was pretty much set on building a NAS around Solaris 11, but after reading that and seeing Bryan's presentation and investigating a bit more, I'll be going with one of the illumos-based distros.

brutalizer · Feb 22, 2012

In the future I will turn to Illumos based distros. But right now Solaris 11 has some small zfs advantage. Illumos is build 151, and Solaris 11 is build... 174 I think. Each build number is two weeks of work.

Illumos is working on how to cope with diferent zfs versions. Say Solaris has version 174, and Illumos have version 174 - but it is different code. Then zfs will tag the zpool with an ID number so zfs can know which version the zpool runs - oracle solaris or Illumos - and keep up compatibility.

So you will be able to migrate your zpool from different zfs versions.

stevebaynet · Feb 22, 2012

brutalizer said:
In the future I will turn to Illumos based distros. But right now Solaris 11 has some small zfs advantage. Illumos is build 151, and Solaris 11 is build... 174 I think. Each build number is two weeks of work.

Illumos is working on how to cope with diferent zfs versions. Say Solaris has version 174, and Illumos have version 174 - but it is different code. Then zfs will tag the zpool with an ID number so zfs can know which version the zpool runs - oracle solaris or Illumos - and keep up compatibility.

So you will be able to migrate your zpool from different zfs versions.

Personally, i would still start with an illumos build if i was starting something in the next few months. (if i had to pick one right now, that would indeed be tough) after watching that video it seems like illumos is the clear future of ZFS when it comes to future features (and when you consider the talent they have on board)

Although true, OI/Nexenta i believe only supports up to ZFS pool version 28 (not sure where illumos goes with this), the things Oracle has added to it since are pretty minor and not really features that i would require.

Also makes me wonder who will be left at oracle to be the driving force and continuing the improvement of their version of ZFS going into the future.

danswartz · Feb 22, 2012

Keep in mind that nexenta, at least, will be switching to an illumos codebase for their upcoming release.

_Gea · Feb 22, 2012

brutalizer said:
In the future I will turn to Illumos based distros. But right now Solaris 11 has some small zfs advantage. Illumos is build 151, and Solaris 11 is build... 174 I think. Each build number is two weeks of work.

Illumos is working on how to cope with diferent zfs versions. Say Solaris has version 174, and Illumos have version 174 - but it is different code. Then zfs will tag the zpool with an ID number so zfs can know which version the zpool runs - oracle solaris or Illumos - and keep up compatibility.

So you will be able to migrate your zpool from different zfs versions.

Current state:
ZFS pool V. 28 is OpenSource
ZFS pool > V.28 from Oracle is not open and total incompatible with all others.

There are only 3 options:
1. Oracle opens Solaris again (i doubt)
2. Illumos develops a compatible Zpool > V. 28 (comparable situation: NTFS)
3. Illumos adds new features also. These pools are incompatible with Solaris
If you need interchangable pools, you must stay at V.28 forever

Currently all three options are possible

lkateley · Feb 23, 2012

Most of the developers who were the driving force and the key innovators around zfs, live in the illumos community now

ST3F · Feb 23, 2012

or ZFS on Linux : http://zfsonlinux.org/

There is more fright than real harm

Cheers

St3f

TechTrend · Feb 27, 2012

This post to the zfs-discuss list has additional insights on the caveats of ZFS deduplication.

http://mail.opensolaris.org/pipermail/zfs-discuss/2011-July/049209.html
Summary: Dedup memory and performance (again, again)

Deduplication logic on commercial SANs appears to be more memory efficient. The EMC VNXe 3100 claims high deduplication ratios and performance using a Nehalem processor with 4 GB of RAM.

danswartz · Feb 27, 2012

TechTrend said:
This post to the zfs-discuss list has additional insights on the caveats of ZFS deduplication.

http://mail.opensolaris.org/pipermail/zfs-discuss/2011-July/049209.html
Summary: Dedup memory and performance (again, again)

Deduplication logic on commercial SANs appears to be more memory efficient. The EMC VNXe 3100 claims high deduplication ratios and performance using a Nehalem processor with 4 GB of RAM.

given how 'early days' zfs dedup was, when development on it was basically halted.

ZFS Deduplication in Open Indiana

n00b

2[H]4U

n00b

Gawd

Limp Gawd

Limp Gawd

2[H]4U

Limp Gawd

2[H]4U

[H]ard|Gawd

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

n00b

Limp Gawd

2[H]4U

[H]ard|Gawd

Limp Gawd

2[H]4U

Supreme [H]ardness

n00b

Limp Gawd

n00b

2[H]4U