ZFS Deduplication requirements and performance

Jay69 · Mar 3, 2012

I have search though and there seems to be various formula throw up for the amount of ram required for deduplication. Is the ram requirement based on the actual size (block) of the pool or the amount of data (block) used in that pool. I.e. if my pool is 2T and 50% utilized, is it based on 2T or 1T (I know it's block size and amount of block).

Given a example of a 2T Pool, VMWare use case, what would be a gauge of the requirements ?

danswartz · Mar 3, 2012

I believe it is based on the amount of data actually tracked.

patrickdk · Mar 3, 2012

It's pretty basic.

Amount of blocks used * 320bytes = size of ddt for dedup.

size of ddt for dedup can't use more than 1/4 of arc ram, or if you shove it out on l2arc, then you save some ram, but I never cared to calculate that.

The main variable, is, what your blocksize is. Using zvol's, it defaults to 8k block size normally. If you use 128k block size average, it's going require a lot less ram.

Using 128k block size kills random i/o performance though.

No idea what you mean by based on actual size block of the pool, or amount of data block used in the pool. Aren't they one in the same? A block is a spot in the zpool that has data in it, it's size depends on how much data was written at that time it was saved.

If your pool only has 1T of data, you only have a dedup table that manages that 1T of data, you don't track empty pool space, cause there is nothing to track yet.

Nex7 · Mar 4, 2012

patrickdk covered it pretty well. I'd also add that 128K and dedup rarely provide a decent dedup ratio -- if you think about it, we dedup on a block by block basis, and with 128K in each block, the chances it will be identical to another block are significantly lower than when the dataset is using a smaller blocksize, especially something more like 8 or 4K. The obvious downside of this, as he mentioned, is that the lower the blocksize, the worse the RAM hit.

When people say the RAM requirement is based on the block size, they mean block size, not 'used space in the pool', which is I think your point of confusion. It is ~320 bytes per block of data in the DDT just as a simple sizing arithmetic (YMMV), so depending on that average block size, your RAM requirements for the same 'amount of data' can vary wildly. Example:

- 1 TB of deduplicated data @ 128 KB average block size would need: ~ 2.5 GB of RAM for DDT
- 1 TB of deduplicated data @ 4 KB average block size would need: ~ 80 GB of RAM for DDT

The reason there's a 32x increase in RAM requirement for the 4 KB average block size is simple -- 4 KB goes into 128 KB 32 times. On the flip side, the chance of a decent ratio of deduplicated data is way higher on the 4 KB average block size (statistically, it's 32x more likely, would I think be the simple reasoning), especially on just any old random set of data. It is worth noting that the higher your dedup ratio is, the LOWER the amount of RAM you will require per-block in the DDT. It takes significantly more RAM to address 1 TB of data at a 1:1 dedup ratio than it does to address 1 TB of data at a 10:1 dedup ratio, for example, because you just dropped the number of unique entries by 10x, and the extra tracking information on the remaining entries to cover the other 9 copies is less than a whole new unique entry would have been.

My general rule of thumb for sizing RAM if you're unsure is:

8 GB minimum
add 1 GB for every 1 TB of raw space in the pool (data drives)
then add another 1 GB of RAM for every TB of usable space in the pool
round up to the nearest motherboard RAM allotment (so if your above math would lead you to 17 GB, you don't put 16 GB of RAM in the system, you put 24 or 32 GB in)

That'll USUALLY cover you, though as with any generic rule, it won't cover all cases and starts to fall apart on very large or very small systems. It is also worth noting that ZFS considers the DDT (DeDupe Table) to be metadata. This means it is affected by the zfs_arc_meta_limit tunable.

The zfs_arc_meta_limit is, by default, set to 25% of maximum ARC size (and maximum ARC size is, by default, typically either all of your RAM minus 1 GB or 3/4 of your total RAM, whichever is larger). This can mean that you may have a large DDT that COULD fit in RAM, but the default options for ZFS leave a significant chunk of it not loadable, as the DDT may be 50 GB in size and you have 96 GB in the system, but by default ZFS can only utilize 23.75 GB for metadata. You'll have to manually tweak the zfs_arc_meta_limit tunable to get more of the ARC into RAM.

It is also worth noting here since I've seen a few people think it -- the DDT is not an 'all or nothing' sort of thing. It isn't a matter of fitting it all into RAM or none of it into RAM, it will load as much of it into RAM as is requested and it has space to hold. It is, however, very common that if the vast majority of the DDT is not in RAM, system performance suffers greatly for it. So shoot for 100%, and if you make it to within 10% of that you'll probably be OK.

Further, ZFS doesn't currently (AFAIK) put any special emphasis on holding onto DDT over any other metadata. You can end up contending between regular metadata and DDT metadata if the two are being accessed equally. This also might not be a problem.. if there's a piece of metadata that gets requested 300x a second, and a DDT entry requested 5x an hour, I'd much rather have that metadata block in there than that DDT entry.

However, I know that at least within Nexenta, various improvements to how this functions and in ways to better manage and direct the DDT have been hashed over. Not being in engineering, I'm afraid I can't discuss specifics, but it is my understanding there is work in progress. No ETA!

Jay69 · Mar 4, 2012

@All,

Thanks. That's the clearest explaination I've seen in a while.

Assuming that ram is sufficient to hold the DDT metadata. Would my main gain be over the HDD saving ? Or would there be significant performance improvement as well ? I guess my point is that as much as the space is important, I think the performance is equally or more important. Assuming I get a reasonable hit ratio since the VM are the same OS and running the same application / data etc. Or would I be better off looking elsewhere if I'm looking for a performance increase ?

ChrisBenn · Mar 4, 2012

It hardly seems worthwile for a 2TB pool - that's small enough that you could easily put together a pool comprised of, say, 256Gig crucial M4 SSD mirrors (~6000 dollars for 16 drives gives you 2TB usable with 8 mirrors). At that point the performance you get is going to tremendously outstrip just about anything else you could possibly do.

If that was just an example, and your pool is really, say 20TB, then for performance I would look at an SSD L2ARC and SSD log devices first (assuming these are sync writes over, say, NFS).

But if you are really running a 2TB pool and performance is an issue then for 5-6K you could essentially make performance a non-issue (well, you would be limited by whatever your interface was - 1GbE/10GbE/FC/etc)

Jay69 · Mar 6, 2012

It's more of here, pick a number and scale that... and SSD mirrors for a pool is clearly unscalable... Beyond that, I'm specifically asking about deduplication and it's impact on performance.. NOT L2ARC and SLOG and their impact on performance in this thread. Much as I appreciate your input, it would be great to stick to the topic on hand.

ChrisBenn said:
It hardly seems worthwile for a 2TB pool - that's small enough that you could easily put together a pool comprised of, say, 256Gig crucial M4 SSD mirrors (~6000 dollars for 16 drives gives you 2TB usable with 8 mirrors). At that point the performance you get is going to tremendously outstrip just about anything else you could possibly do.

If that was just an example, and your pool is really, say 20TB, then for performance I would look at an SSD L2ARC and SSD log devices first (assuming these are sync writes over, say, NFS).

But if you are really running a 2TB pool and performance is an issue then for 5-6K you could essentially make performance a non-issue (well, you would be limited by whatever your interface was - 1GbE/10GbE/FC/etc)

ChrisBenn · Mar 6, 2012

Jay69 said:
It's more of here, pick a number and scale that... and SSD mirrors for a pool is clearly unscalable...

I wouldn't neccecarily agree with that - for typical VM hosting I find that SSD's actually scale reasonably well - and excellently when you take performance into consideration. Bulk VM storage requirements aren't really that great - and as long as you keep your bulk data out of the VM image (which is the case in most situations) there are no problems.

As you scale out relative budget actually scales also - I suspect if you are in a situation where you really need 20TB of VM instances then the 50k for SSD's is going to be well within the budget. Consider what a 20TB Usable SAN which delivered the IOPS/Bandwidth that 160 SSD's offered would cost.

Beyond that, I'm specifically asking about deduplication and it's impact on performance.. NOT L2ARC and SLOG and their impact on performance in this thread. Much as I appreciate your input, it would be great to stick to the topic on hand.

Deduplication is not going to give you any significant performance benefit, and may possibly cause a performance reduction as it reduces the amount of ARC cache you can store. Sure, if you have a corner case where you have 20VM's that are identical and their working set fits within your DDT/ARC then maybe - but that's going to be a very specific and bizzare case.

The point being you asked about performance and running ZFS type deduplication is probably one of the worse thing you can do to improve performance.

Stanza33 · Mar 6, 2012

AFAIK

There is nothing wrong with...

If all your VM images are pretty much the same, and you are looking to save space
Turn on Deduplication....
Make your VM's
Once done turn off Deduplication to get your speed back

As I believe once ZFS has Deduplicated something.... it does all go pop and expand back out when you turn it off.

So if you can put up with the slow speed to begin with to get up and going....go for it

.

brutalizer · Mar 7, 2012

A kind of dedup can be achieved by using snapshots.

Create a master VM, configured and tested. Then you snapshot the master. It becomes a template. Now, you clone the VM, and each clone gets its own filesystem. Thus, all changes are written to the clone VM, but all reads go to the master VM. In this case you can have a few Master VMs, and deploy many clones, each clone only saving the changes in its own fileystem.

This IS dedup. But you have to do that manually. It is not done automatically by the system. There are many people doing this on a regular basis using zfs. I am doing this in VirtualBox. I have one Windows Master VM, and my girl friend uses a clone, and I use another clone. No one touches the master vm.

brutalizer · Mar 7, 2012

A kind of dedup can be achieved by using snapshots.

Create a master VM, configured and tested. Then you snapshot the master. It becomes a template. Now, you clone the VM, and each clone gets its own filesystem. Thus, all changes are written to the clone VM, but all reads go to the master VM. In this case you can have a few Master VMs, and deploy many clones, each clone only saving the changes in its own fileystem.

This IS dedup. But you have to do that manually. It is not done automatically by the system. There are many people doing this on a regular basis using zfs.

I am doing this in VirtualBox. I have one Windows Master VM, and my girl friend uses a clone, and I use another clone. No one touches the master vm. But I am using functionality from VirtualBox to do this. I am not using zfs snapshot functionality.

The difference is that zfs snapshot is more general. VirtualBox has this functionality built in. But with zfs snapshots, I can "dedup" any data, the application need not to have this functionality built in.

For instance, using zfs snapshots, I can record music in raw format using Cubase. The Cubase software does not have snapshots as VirtualBox has. But using zfs allows me to dedup Cubase recordings. Say I have 100GB raw media "file.raw". Say I want to change 10MB in the middle and save the file again "file2.raw". Then "file2.raw" is 100GB big and I have 2 x 100GB data on my disk. But using zfs snapshots, I only save 10MB again, thus "file2.raw" is only 10MB, and I have 100GB + 10MB data on my zfs disk.

Thus, dedup can always be achieved on zfs. But is is static, and you have to do that manually.

Jay69 · Mar 7, 2012

sigh.... does anybody actually read the Q before answering or do they already decided on the answer and then rephase their answer just to meet the bare minimum requirements of pretending to answer the question asked.

If budget isn't a issue, I would have just got DRAM SSD and the 1 billion IOPS that I would enjoy... Do you actually have the 160SSD / 20TB and run it for any decent amount of time ? Seriously, I'm not looking for someone who has no idea of my VM requirements to make recommendation on my SAN. I really appreciate the comments but get way freak out by recommendation on what scale well for me way before I even provide any information about my requirements or setup..

ChrisBenn said:
I wouldn't neccecarily agree with that - for typical VM hosting I find that SSD's actually scale reasonably well - and excellently when you take performance into consideration. Bulk VM storage requirements aren't really that great - and as long as you keep your bulk data out of the VM image (which is the case in most situations) there are no problems.

As you scale out relative budget actually scales also - I suspect if you are in a situation where you really need 20TB of VM instances then the 50k for SSD's is going to be well within the budget. Consider what a 20TB Usable SAN which delivered the IOPS/Bandwidth that 160 SSD's offered would cost.

Deduplication is not going to give you any significant performance benefit, and may possibly cause a performance reduction as it reduces the amount of ARC cache you can store. Sure, if you have a corner case where you have 20VM's that are identical and their working set fits within your DDT/ARC then maybe - but that's going to be a very specific and bizzare case.

The point being you asked about performance and running ZFS type deduplication is probably one of the worse thing you can do to improve performance.

Jay69 · Mar 7, 2012

My understanding is that deduplication cannot be turn off or on like that... Not sure if I'm right on this...

My read on this was that deduplication doesn't slow down that much given enough ram. Essentially you are cutting down the amount of data written to disk...

Stanza33 said:
AFAIK

There is nothing wrong with...

If all your VM images are pretty much the same, and you are looking to save space
Turn on Deduplication....
Make your VM's
Once done turn off Deduplication to get your speed back

As I believe once ZFS has Deduplicated something.... it does all go pop and expand back out when you turn it off.

So if you can put up with the slow speed to begin with to get up and going....go for it

.

Jay69 · Mar 7, 2012

In my case, snapshots won't be feasible. In addition, in my case, the clone VM grew much larger than the master VM.
I guess the bottomline is that I probably need to test it out myself !!

brutalizer said:
A kind of dedup can be achieved by using snapshots.

Create a master VM, configured and tested. Then you snapshot the master. It becomes a template. Now, you clone the VM, and each clone gets its own filesystem. Thus, all changes are written to the clone VM, but all reads go to the master VM. In this case you can have a few Master VMs, and deploy many clones, each clone only saving the changes in its own fileystem.

This IS dedup. But you have to do that manually. It is not done automatically by the system. There are many people doing this on a regular basis using zfs. I am doing this in VirtualBox. I have one Windows Master VM, and my girl friend uses a clone, and I use another clone. No one touches the master vm.

ChrisBenn · Mar 7, 2012

Jay69 said:
Do you actually have the 160SSD / 20TB and run it for any decent amount of time ? Seriously, I'm not looking for someone who has no idea of my VM requirements to make recommendation on my SAN. I really appreciate the comments but get way freak out by recommendation on what scale well for me way before I even provide any information about my requirements or setup..

No, not 20TB, but I do have 12 256gig micron SSD's running a 1.5TB pool (all 2-way mirrors) as a VM host datastore.

Your original post said "Given a example of a 2T Pool, VMWare use case". If that *isn't* representative of your requirements/setup why did you bring it up?

But anyway, good luck and have fun! I'll bow out of your discussion now. I think you should definitely run deduplication enabled in a production environment. Make sure to turn it on for as many filesystems as you can so you get maximum coverage on your DDT table.

Jay69 · Mar 7, 2012

ChrisBenn said:
No, not 20TB, but I do have 12 256gig micron SSD's running a 1.5TB pool (all 2-way mirrors) as a VM host datastore.

Your original post said "Given a example of a 2T Pool, VMWare use case". If that *isn't* representative of your requirements/setup why did you bring it up?

But anyway, good luck and have fun! I'll bow out of your discussion now. I think you should definitely run deduplication enabled in a production environment. Make sure to turn it on for as many filesystems as you can so you get maximum coverage on your DDT table.

And I assume that you have run deduplication with my exact mix of VMs ? Why bother discussing if we are all going to be running SSD. 12 SSD is not exactly equal to 160 SSD. I would consider running 12 SSD as a read cache and in fact that's what I plan to do, but my specific question was more on consideration of deduplication and obviously I would love to hear from those who has tried it rather than those who has imagine about it...

Jay69 · Apr 17, 2012

OK, Thanks to all who replied. I've testing with a small box. 6 disk in 3x mirrors in vdev. Right now, I'm hitting deduplication rations of 22x to 24x . Yes, this is pretty much test environment. But on to my next point. A 40GB file will takes approx 10 mins to complete copying.
Will test and see if performance will be the same without deduplication..

ZFS Deduplication requirements and performance

Jay69

n00b

danswartz

2[H]4U

patrickdk

Gawd

Nex7

Weaksauce

Jay69

n00b

ChrisBenn

Limp Gawd

Jay69

n00b

ChrisBenn

Limp Gawd

Stanza33

Gawd

brutalizer

[H]ard|Gawd

brutalizer

[H]ard|Gawd

Jay69

n00b

Jay69

n00b

Jay69

n00b

ChrisBenn

Limp Gawd

Jay69

n00b

Jay69

n00b