ZFS dedupe is fixed soon

brutalizer

[H]ard|Gawd
Joined
Oct 23, 2010
Messages
1,602
ZFS dedupe always had some problems and were not recommended for production. For instance, it uses gobs of RAM, 1GB RAM for every 1TB disk space. This is where the misconception comes from, that ZFS requires 1GB RAM for every 1TB disk space. This is only valid if you want to use dedupe. Also, dedupe can be slow when deleting stuff.

This summer, Oracle bought Greenbyte. Greenbyte has rewritten the ZFS dedupe engine and claims best in class performance. They can dedupe with near zero latency, they can dedupe 5,000 fat VDI clones using 210TB disk space, down to 4TB disk space. So now Solaris can boot 6,000 VMs in 5-6 minutes or so. Greenbyte ZFS dedupe will be incorporated in the next upgrade of Oracle Solaris coming this year. So, no more problems with ZFS dedupe, you can use it to full extent. Also, compression helps with dedupe too. So you should enable compression (which is very fast) and Greenbyte dedupe is also very fast, for added benefit. I think I shall dedupe and compress all my data because it is so fast. In fact, compression will increase performance because it is faster to load 1000 bytes and decompress it into 2000 bytes, than it is to read 2000 bytes.
http://www.theregister.co.uk/2013/08/27/greenbytes_latency_smash_with_flash_cache/
http://www.theregister.co.uk/2012/10/12/greenbytes_chairman/

Also, current Oracle Solaris v11.2 resilvers at full platter speed. Before, if you had fragmented zfs pool and resilvered a disk, it could take a long time, especially if it were fragmented. Now ZFS instead examines all data to be resilvered and sorts and logs them. Finally, after sorting and logging, resilvering takes place at full disk speed, i.e. 100-150MB/sec.
http://milek.blogspot.se/2014/12/zfs-raid-z-resilvering.html

Solaris 11.2 also has persistent L2ARC, i.e. if you reboot the contents of the L2ARC will be saved and reloaded, so you dont have to wait until the L2ARC is fully populated.
 
I think when you say something is fixed in ZFS, you really need to specify if it's fixed in the FreeBSD implementation, the Linux implementation, the oracle implementation or some other implementation.
 
There are only two implementations, the openzfs and the oracle one.

freebsd linux and illumos all are basing theirs from openzfs, and normally are pretty in sync, though freebsd can lag alittle but due to their releases.
 
I believe brutalizer has overestimated the benefits of the greenbytes acqusition as "fixing dedupe in zfs". The IO offload engine they got will make evolutionary improvements over time, and only in Oracle commercial product for now.
 
I believe brutalizer has overestimated the benefits of the greenbytes acqusition as "fixing dedupe in zfs".
How do you mean? ZFS dedupe is not that good today and not recommended. However, Greenbyte have a pristine dedupe engine, have you read the articles? Deduping with near zero latency, 210TB down to 4TB (ratio 50:1) in practice not just theoretically, etc - are not the results remarkable? The current bad ZFS dedupe engine has ratios of maybe 3:1 or so. What more do you require of the ZFS dedupe engine to say it is fixed? Ratio of 50:1 at blindlingly fast speed (near zero latency) is not enough? You want 100:1 before you can agree that ZFS dedupe engine is fixed?

BTW, I dont think the ZFS dedupe engine ever will be ported to OpenZFS, because Greenbyte and Oracle Solaris is closed source. Maybe I should have specified it in the headline "ZFS dedupe engine is fixed soon - not OpenZFS".

But, who do not want more storage because of deduplication? And you can add compression on top for even higher ratios. Maybe another 2x or so. So in general situations, dedupe + compression, maybe can give 10x or so in practical real life situations. It is easier to compress Virtual Machines, because they share lot of identical bytes. So for VMs you maybe get 50x2 = 100x compression in real life. But in general situations where you dont have many identical bytes, maybe 10x? Imagine that, your storage has suddenly increased 10x more. An 8 TB storage media server has suddenly become 80TB storage. But the most important thing, Greenbyte dedupe engine works excellent, and are used in production. Not problematic as todays ZFS dedupe engine. Who does not say that "zfs dedupe engine is fixed"?? Who is not enthusiastic?:)

Caveat: maybe Greenbyte dedupe engine require lot of RAM too? Hmmm... Well see about the details when next Solaris is released.
 
Last edited:
If OpenZFS had block pointer rewrite functionality then it would be easy to implement an offline dedup engine like the Hammer filesystem. I wouldn't require much ram either.

As writing a block pointer rewrite is pretty hard, there need to be a company willing to invest in its development.
 
Not sure where these real life 100x dedup numbers come from.

For myself, with around 50x identical linux vm's I get almost a 1.4x dedup ratio, and a 4x compression ratio. (at the 512b block level).

For the windows enviroment, with 600 win7 and 300 win2008r2 vmware cloned machines, I only get a 2.2x dedup and 2.4x compression ratio.

There is just too much uniq data per vm to make dedup ratios go up, unless you *never* put logs, or data onto the vm to do any actual work. For the win7 vm's we are using horizon view, so that dedups itself there, with about 30 replica masters from the same clone master, so those would get high dedup values, but once the other vm's are added to the total, the overall dedup drops to basically useless.
 
If OpenZFS had block pointer rewrite functionality then it would be easy to implement an offline dedup engine like the Hammer filesystem. I wouldn't require much ram either.

As writing a block pointer rewrite is pretty hard, there need to be a company willing to invest in its development.
But I think the vast majority prefers on line dedupe. Who wants to unmount(?) the zpool and wait while it dedupes, and then mount it again? Another problem with off line dedupe, is that servers in production can never use off line dedupe, as you dont want to off line the entire server. It would be similar to ext3 fsck, which requires you to off line the raid, before looking for errors. ZFS scrub is online, and I think dedupe should be online too.

Jeff Bonwick, one of the fathers of ZFS (the other being Matt Ahrens who now works at Joyent on OpenZFS), said they have bp rewrite that works off line. Very scary with lot of race conditions he said. Very difficult programming. For my personal home user purposes, off line bp rewrite would be fine. If I need to remove a disk, I can offline my zpool.
 
Not sure where these real life 100x dedup numbers come from.

For myself, with around 50x identical linux vm's I get almost a 1.4x dedup ratio, and a 4x compression ratio. (at the 512b block level).

For the windows enviroment, with 600 win7 and 300 win2008r2 vmware cloned machines, I only get a 2.2x dedup and 2.4x compression ratio.

There is just too much uniq data per vm to make dedup ratios go up, unless you *never* put logs, or data onto the vm to do any actual work. For the win7 vm's we are using horizon view, so that dedups itself there, with about 30 replica masters from the same clone master, so those would get high dedup values, but once the other vm's are added to the total, the overall dedup drops to basically useless.
I posted a link, where Greenbyte shows 50:1 compression in real life with dedupe, i.e. 210 TB disk space down to 4 TB, deduping 5.000 VMs. Each VM using 40GB disk, and 2GB swap space. No compression applied. Compression on top, should boost it further, maybe another 2x (my guess). So in total you should have, in real life scenarios of identical VMs, 50x2 = 100x compression. But these VMs are identical, so you can achieve good dedupe compression. Of course, as people start to save different files on their VMs, dedupe ratio would decrease. But what do they store? Most likely MS office docs, or source code, or whatever they work on. That data is redundant, and can be compressed heavily, i.e. also deduped.

Anyway, it is good that you can compress the large bulk of 5000 VMs down to 4TB. You could probably cache it all in SSDs or PCIe RAM disks. The large bulk of storage is helping to compress VMs. The user data, does not take up that much space, maybe a few TB in total? How much does space does a user need? I prefer to compress the VMs heavily.

For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.
 
Last edited:
For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.

I don't see how you would get ANY savings trying to deduplicate media like movies, TV shows, and music.


ZFS deduplication doesn't even work on TV show intros, which to the user appear to be identical. Let alone thinking that movies that look completely different will somehow be able to deduplicate any data.

http://manuelgrabowski.de/2014/09/29/zfs-deduplication-tvshow-intros/

Even in 10TB of H.264 video you would be extremely lucky to find even 1MB of identical blocks.

For large data like media ZFS is going to use 128K blocks (at a minimum) of which there are 1,048,574,976,000 possible permutations of a 128K block. In other words, it takes 134.2 petabytes to store every possible 128K block.


Also, it's incorrect to simply say that ZFS (as it stands currently) dedupe takes 1GB of memory per 1TB.

On ZFS, all file data is stored using B-trees, where the leaves store X bytes, where X is the value of recordsize at file creation. When a write is done to a file in a dataset with dedup=on, a lookup is done on the data deduplication table. If the lookup finds an entry, it increments a reference. If it does not find an entry, it creates one. This involves 3 random seeks and consequentially your total write throughput is [average record size * IOPS / 3 random seeks]. If you have sufficient memory that the entire DDT can stay cached, then we avoid this limit.

The amount of memory needed to store a DDT is a function of the average record size. If your dataset has the default recordsize=128K (e.g. you are storing many >1MB files), then you can multiply your system memory by 153.6 to determine the total amount of unique data that you can store before hitting the limit I described. If you are storing many small <128KB files, then you need to calculate the average file size and then calculate [average file size * system memory * 12 / 10355] to obtain the total amount of unique data. This assumes the module settings for zfs_arc_max and zfs_arc_meta_max were left at their ZoL defaults. It is important to keep in mind that the unique data is different from total data, especially in the case of duplicate files. It also excludes metadata required for maintaining indirect block trees, dnodes (i.e. inodes), directories and the DDT itself.

The total amount of data that you can store on a pool before hitting the limit is [unique data * deduplication multiplier], where unique data is what I described how to calculate and the deduplication multiplier is a number that is either 1 or greater. A measure of the deduplication multiplier is provided by zpool list as DEDUP, so you should be able to see that yourself. If your pool has data that was written without dedup=on, then any duplicates in that data will be counted as unique data for the purposes of that calculation. To provide a simple example of the deduplication multiplier, imagine a pool with only two files that store the same data. The deduplication multiplier for that pool would be 2, provided that both were written with dedup=on set on their dataset. If one or both were written with dedup=off, then the deduplication multiplier would be 1. You can use zdb to calculate the theoretical deduplication statistics for an entire pool by running zdb -D $POOLNAME. Note that this will require significant memory because it constructs a full DDT in userland memory.
 
Last edited:
I don't see how you would get ANY savings trying to deduplicate media like movies, TV shows, and music.

ZFS deduplication doesn't even work on TV show intros, which to the user appear to be identical. Let alone thinking that movies that look completely different will somehow be able to deduplicate any data.

http://manuelgrabowski.de/2014/09/29/zfs-deduplication-tvshow-intros/
There is a potential flaw in your reasoning here. You show a link that says zfs dedupe does not give good disk savings on TV shows. Well, all ZFS dedupe users say they typically get 2-3x savings, if you read the dedupe posts. See "patrickd"'s post above here, for instance.

But, Greenbyte reports 50x savings in real life dedupe scenarios. I have never seen any any zfs dedupe user report 50x savings, only 2-3x. This means that Greenbyte probably have a superior dedupe ZFS engine, that obviously can achieve 50x savings. So, can it be that Greenbyte also achieve 50x savings on TV media? Maybe? It is not out of the question, which you claim. You compare the bad current ZFS dedupe engine, to Greenbyte superior proprietary ZFS dedupe engine, and try to draw conclusions. Well, you can not compare apples to oranges when drawing conclusions.

So it seems that Greenbyte has a superior dedupe engine. So all this information we have of the current ZFS dedupe engine (1GB RAM for every 1TB disk space) is not valid anymore. Please dont mix them, and apply know how of the old dedupe tech, to the new Greenbyte tech. Forget everything about ZFS dedupe (numbers, stats, etc), and relearn anew. I will.
 
I did actually try to determine the space savings of my 12 GB media library with zdb. The deduplication space saving ratio was less than 1 percent. If you approach even 2x with dedup, it is because of large duplicate files, which can be taken care of by much simpler means - SnapRaid does it for you as a bonus. Large padding blocks can be deduplicated well, but lz4 already takes care of that. So for the home user, who has most of his space consumed by media files deduplication makes no sense. For VM backstorage it has its uses.
 
But, Greenbyte reports 50x savings in real life dedupe scenarios. I have never seen any any zfs dedupe user report 50x savings, only 2-3x. This means that Greenbyte probably have a superior dedupe ZFS engine, that obviously can achieve 50x savings. So, can it be that Greenbyte also achieve 50x savings on TV media?

Your reasoning is flawed. This is part research, part advertising and needs to be treated as such. 50x savings on media files... no, never. Unless you're storing 50 copies of the same exact file.
 
Not sure where these real life 100x dedup numbers come from.

For myself, with around 50x identical linux vm's I get almost a 1.4x dedup ratio, and a 4x compression ratio. (at the 512b block level).

For the windows enviroment, with 600 win7 and 300 win2008r2 vmware cloned machines, I only get a 2.2x dedup and 2.4x compression ratio.

There is just too much uniq data per vm to make dedup ratios go up, unless you *never* put logs, or data onto the vm to do any actual work. For the win7 vm's we are using horizon view, so that dedups itself there, with about 30 replica masters from the same clone master, so those would get high dedup values, but once the other vm's are added to the total, the overall dedup drops to basically useless.

Most of that was done for Memory security by microsoft. Try it with server 2003 vm's lol.
 
For the windows enviroment, with 600 win7 and 300 win2008r2 vmware cloned machines, I only get a 2.2x dedup and 2.4x compression ratio.

These are two conflicting mechanics, dedup works well with small block sizes while compression is better with larger blocks.
To make effective use of dedup the NTFS cluster size (4k by default) should be at least as large as the ZFS volume block size (8k by default).
 
Heh? when was it possible to compress word documents? they have been compressed for a long time now.

All microsoft office since 2007 compresses the documents, docx, xlsx, ...

This is cutting into my nice 2-5x compression ratio's.

And yes, I'm saying, even with IDENTICAL vm's I don't get over a 2.4x dedup ratio, due to the uniq infomation in each vm, each vm must join to ad, each vm has a different user profile installed, ...

And our base vm image is not 40gigs, but much smaller, then the uniq data is normally a few gigs atleast per vm.

It might just be my usecase, but I can't imagine my usecase is *that* widely different from most peoples reallife. I have meet some people getting up into the 10x dedup ratios, but their workcase is *very* uniq. I have never heard of someone getting higher than that, except if they are doing it wrong, and could be using zfs clones easily instead of dedup.
 
There is a potential flaw in your reasoning here. You show a link that says zfs dedupe does not give good disk savings on TV shows. Well, all ZFS dedupe users say they typically get 2-3x savings, if you read the dedupe posts. See "patrickd"'s post above here, for instance.

But, Greenbyte reports 50x savings in real life dedupe scenarios. I have never seen any any zfs dedupe user report 50x savings, only 2-3x. This means that Greenbyte probably have a superior dedupe ZFS engine, that obviously can achieve 50x savings. So, can it be that Greenbyte also achieve 50x savings on TV media? Maybe? It is not out of the question, which you claim. You compare the bad current ZFS dedupe engine, to Greenbyte superior proprietary ZFS dedupe engine, and try to draw conclusions. Well, you can not compare apples to oranges when drawing conclusions.

So it seems that Greenbyte has a superior dedupe engine. So all this information we have of the current ZFS dedupe engine (1GB RAM for every 1TB disk spaaaaaace) is not valid anymore. Please dont mix them, and apply know how of the old dedupe tech, to the new Greenbyte tech. Forget everything about ZFS dedupe (numbers, stats, etc), and relearn anew. I will.

The issue is that for media files there are simply no blocks that can possibly be deduplicated. No matter how good your dedupe engine is, if there are no identical blocks then nothing can dedupe.

The only dedupe that you are going to get on media files is metadata. The problem is, the metadata on media files is such a tiny, tiny, tiny fraction of the file, since media like HD video is so huge.

Word documents for example have LOTS of data that can be deduplicated since they have lots of metadata that defines their structure for example.

Create a word document and put 10,000 characters into it. The size of the file without compression is 33KB, but we only put 10KB of text into it. The other 23KB (70%) of the file is xml metadata type stuff. This document is not even using any fancy formatting which would only increase the metadata amount.

Create another file and put a different 10,000 characters in it.

Now binary diff these 2 files. You can see that 70% of the data is the same between these 2 files. This would be a 70% deduplication savings if you put these files on the ZFS with dedupe on.

Now take a 10GB HD movie. Run it through the same encoder a second time to re-encode it with the same settings so you have approximately the same size output file. Do a binary diff and see how much of the file is the same. It's actually only a few hundred KB of this 10GB file that remains the same, even though the file looks to us to be the same.

You can do the same with an MP3.

There is simply no meaningful amount of data to dedupe in media like video and audio. Even in video and audio that appear identical, there are no matching blocks of data except the codec/container metadata and other embedded data like tags and subtitles. Or perhaps identical embedded album art images in all your music files.

When they are claiming such high dedupe ratios it's because they are storing things that are also normal use cases that a business would be storing. Data with a very high metadata-to-content ratio (like documents). Also things that deduplicate really well, like software. Duplicate operating systems and virtual disks that started out as a clone from an existing disk.

You want other examples?

You could take a look at my CrashPlan archive. I have uploaded all 18TB of my media to them with deduplication set to Full (which guarantees no two blocks that are identical will ever be uploaded or stored twice) Yes out of 18TB I don't even have 1GB of data deduplicated, and that is even a perfect deduplication algorithm (it checks every block against the big hash table of all other blocks so it won't miss anything, yes it's very slow and very resource heavy).

If you understand at all how deduplication works and what the data behind media files like H.264 HD video or MP3 audio looks like you will easily realize that there is no meaningful savings that could he had.

If you still don't believe me you can always run a binary block compare on all your media files and you will see just how few duplicate blocks there really are in that type of data. If there aren't duplicate blocks in your data then you simply have nothing to deduplicate.
 
The issue is that for media files there are simply no blocks that can possibly be deduplicated. No matter how good your dedupe engine is, if there are no identical blocks then nothing can dedupe.

The only dedupe that you are going to get on media files is metadata. The problem is, the metadata on media files is such a tiny, tiny, tiny fraction of the file, since media like HD video is so huge.

Word documents for example have LOTS of data that can be deduplicated since they have lots of metadata that defines their structure for example.

Create a word document and put 10,000 characters into it. The size of the file without compression is 33KB, but we only put 10KB of text into it. The other 23KB (70%) of the file is xml metadata type stuff. This document is not even using any fancy formatting which would only increase the metadata amount.

Create another file and put a different 10,000 characters in it.

Now binary diff these 2 files. You can see that 70% of the data is the same between these 2 files. This would be a 70% deduplication savings if you put these files on the ZFS with dedupe on.

Now take a 10GB HD movie. Run it through the same encoder a second time to re-encode it with the same settings so you have approximately the same size output file. Do a binary diff and see how much of the file is the same. It's actually only a few hundred KB of this 10GB file that remains the same, even though the file looks to us to be the same.

You can do the same with an MP3.

There is simply no meaningful amount of data to dedupe in media like video and audio. Even in video and audio that appear identical, there are no matching blocks of data except the codec/container metadata and other embedded data like tags and subtitles. Or perhaps identical embedded album art images in all your music files.

When they are claiming such high dedupe ratios it's because they are storing things that are also normal use cases that a business would be storing. Data with a very high metadata-to-content ratio (like documents). Also things that deduplicate really well, like software. Duplicate operating systems and virtual disks that started out as a clone from an existing disk.

You want other examples?

You could take a look at my CrashPlan archive. I have uploaded all 18TB of my media to them with deduplication set to Full (which guarantees no two blocks that are identical will ever be uploaded or stored twice) Yes out of 18TB I don't even have 1GB of data deduplicated, and that is even a perfect deduplication algorithm (it checks every block against the big hash table of all other blocks so it won't miss anything, yes it's very slow and very resource heavy).

If you understand at all how deduplication works and what the data behind media files like H.264 HD video or MP3 audio looks like you will easily realize that there is no meaningful savings that could he had.

If you still don't believe me you can always run a binary block compare on all your media files and you will see just how few duplicate blocks there really are in that type of data. If there aren't duplicate blocks in your data then you simply have nothing to deduplicate.
You have a point.

Anyway, if greenbyte dedupe does not dedupe media well, at least it seems to work well. So it is a big win, as rhe current dedupe engine is less optimal.

I suggest we settle this discussion later, when greenbyte dedupe is out. Then we can see how good it is. Obviously, they reach 50x savings on VMs in real life, whereas current dedupe engine reaches 2-3x for same scenario. So obviously greenbyte do something different and much better. We can conclude this discussion then. We try VMs and we try media and compare the dedupe engines.
 
Obviously, they reach 50x savings on VMs in real life, whereas current dedupe engine reaches 2-3x for same scenario.
That could very well be a tailored case. Tbh I find it hard to believe they made significant improvements on the dedup ratio front (note they did not provide vanilla ZFS numbers with the same dataset). Dedup performance improvements and the other hand sound perfectly reasonable.
 
What is dedupe ratio front?

VMs might be tailored case, but i dont want to rule out superior dedupe yet. Sure, if you do a binary diff between two media files, you might get very very large differences so only tiny bit of data is identical, hence not suitable for dedup. But that does not necessarily prove anything. Why? Let us reason a bit.

Well, assume that there is always a bit that differs in the dedupe window. Say bit 0-100 are identical in both files, except they differ in one bit so this are not dedupable. And bit 101-200 differs in one bit, etc. This scenario makes dedup useless. But, can it happen that bit 0-100 in one file is identical to bit 10,300-10,400 in the other file? Sure, bit 0-100 differs in both files, but maybe that bit pattern occurs later in another place? Sure, that is not likely, but, I guess that media files have rhe same structure, so you see the same bit pattern over and over again. Maybe there are only 100 bit patterns used over and over again? Or 10,000? Say the dedupe window is small, say 10 bytes, and you encode media, same structure over and over again. Then you should find multiple occurences. I imagine that MP3 and docs have different structures, so they have not much in common.

But media files, with a small window, should have much in common. That is the reason i do not rule out successful dedup of media, by greenbyte, yet. I would not be surprised if they dedupe media well, beside VMs. So we need to wait, and benchmark media vs VMs so we can finally settle this discussion.
 
Last edited:
What is dedupe ratio front?

VMs might be tailored case, but i dont want to rule out superior dedupe yet. Sure, if you do a binary diff between two media files, you might get very very large differences so only tiny bit of data is identical, hence not suitable for dedup. But that does not necessarily prove anything. Why? Let us reason a bit.

Well, assume that there is always a bit that differs in the dedupe window. Say bit 0-100 are identical in both files, except they differ in one bit so this are not dedupable. And bit 101-200 differs in one bit, etc. This scenario makes dedup useless. But, can it happen that bit 0-100 in one file is identical to bit 10,300-10,400 in the other file? Sure, bit 0-100 differs in both files, but maybe that bit pattern occurs later in another place? Sure, that is not likely, but, I guess that media files have rhe same structure, so you see the same bit pattern over and over again. Maybe there are only 100 bit patterns used over and over again? Or 10,000? Say the dedupe window is small, say 10 bytes, and you encode media, same structure over and over again. Then you should find multiple occurences. I imagine that MP3 and docs have different structures, so they have not much in common.

But media files, with a small window, should have much in common. That is the reason i do not rule out successful dedup of media, by greenbyte, yet. I would not be surprised if they dedupe media well, beside VMs. So we need to wait, and benchmark media vs VMs so we can finally settle this discussion.

This is all just wrong wrong wrong. You keep saying "maybe" and trying to come up with hypotheticals which don't exist. It's blatantly clear you have no clue what you are talking about and have no clue what data a media file is actually composed of.

Media files don't have "structure" or anything like that to be in common. They are 99.99% random blob binary data. The other 0.01% is the structured metadata and container data.

"but maybe that bit pattern occurs later in another place?"

No it won't, I showed you the math behind this before. Dedupe is on the block level which is 128K at a minimum for media files. There are 1,048,574,976,000 possible 128K blocks. So even in 100TB!!! of media (100TB of for all purposes random 128K blocks) the chances of even a single 128K block being found twice is less than 0.1%.

"Maybe there are only 100 bit patterns used over and over again? Or 10,000?"

No, there are not in any media. There are 1,048,574,976,000 128K bit patterns used over and over.

Only lossless media has any possible structure and common bits. BMP, PNG, WAV, lossless video (which nobody uses and is absolutely gigantic, think 1TB+ for a single movie) could benefit somewhat from dedupe. JPEG, MP3, H.264 etc have no structure whatsoever to the main data part of them and cannot benefit.

Compression is already very efficient and eliminates most of the common duplicate bit patterns already. Deduplication only works on uncompressed data and identically compressed data.
 
Last edited:
@SirMaster

I agree you have a point. But i am not fully convinced yet, i told you. The reason for that is that ALL zfs users report 2-3x dedup ratio, whereas greenbyte report 50x. No zfs user have ever reported 50x savings. Ever. Clearly, there is some heavy magic sauce involved. Their magic obviously works on VMs, we all must agree on this.

Will the magic work on media too? I am not willing to rule that out yet because I know nothing about their heavily patented tech. The only way to settle this matter, is to wait and see instead of posting numerous "you are wrong!". Then we will see who of us is "wrong wrong wrong". It might be me. Though, It would be great fun if it is you that have no clue of how greenbyte patents are working. :) But i agree you have a point, yes. You might be correct in your posts. I think your informative posts are more plausible, than my guesses.
 
Go study Computer Science then. It's not a matter of convincing, it's information theory. The whole point of media formats like MPEG is to remove information. Information gets removed, entropy goes up. More entropy = more randomness = less repeated data. What does dedup need? Repeated data.

So instead of defending some obscure marketing bullshit, inform yourself.
 
Last edited:
How did you compute this?

I agree with his other arguments, but this number seems wrong.
There are 256^(128*1024) possible permutations of a 128K block. This is roughly 10^315000, number with 315000 digits.

Note that the universe is estimated to comprise 10^80 protons.
 
How did you compute this?

Completely wrong is my guess. You can't even count all permutations of 128 _bits_ let alone store them, so storing all permutations of 128 _kilobytes_ in a measly 134PB is most definitely wrong, which BTW only strengthens the point against these dedup claims, but is still wrong.
 
How did you compute this?

Number of permutations.

n! / (n - r)!

Where n is the number of items and r is the number of possibilities for each item

n = 1024000 (number of bits in 128 kilobytes)
r = 2 (possible bits)


I guess I'm thinking about it wrong and using the wrong formula (it's been 5 years since statistics class and I'll be honest I don't use it much)

But yes, then there are even more possible 128k blocks which does strengthen the idea that compressed media won't dedupe.

Thanks for correcting me :)
 
Last edited:
What the...

128 kilobytes is 1048576 bits, so it's still wrong.

And finally, r^n is the number of all permutations. Back to school I say.
 
What the...

128 kilobytes is 1048576 bits, so it's still wrong.

128 kibibytes is 1048576 bits, 128 kilobytes is 1024000 bits.

Although I realize that filesystems are going to be using 128 KiB blocks and not 128 KB blocks so yes 1048576 should have been used.

Thanks.
 
Number of permutations.

n! / (n - r)!

This formula describes permutations without repetition.
We can agree that the same byte can be used multiple times in a 128 KiB block.
So r^n would be the formula to use.
 
Last edited:
No, 128 kibibytes is 1048576, 128 kilobytes is 1024000 bits.

Computers are base2 so if you talk about bits in information theory, you are in base2. Seriously dude, go to school.

ZFS uses "1048576 bits" blocks, not "1024000 wtfever" blocks.
 
Computers are base2 so if you talk about bits in information theory, you are in base2. Seriously dude, go to school.

While the math is wrong the validity of the argument remains.
Highly compressed video really behaves like noise. It is impossible to deduplicate.
Even when it is lossless, good lossless compression is similar to noise as well.

We should not forget that lossless compression already is deduplication at a much finer granularity (bit level).
 
Last edited:
Computers are base2 so if you talk about bits in information theory, you are in base2. Seriously dude, go to school.

ZFS uses "1048576 bits" blocks, not "1024000 wtfever" blocks.

Calm down...

Yes I did mention that the filesystems would use KiB not KB and that 1048576 would be the correct number of bits to use in the permutation calculation.
 
Back
Top