ZFS dedupe is fixed soon

smangular · Mar 10, 2015

brutalizer said:
Hahaha!!!

Here is someone who writes about 30-50% dedupe savings on photos, music, videos. And in general, 50-60% dedupe savings for all types of files. I dont know how credible this link is though, I prefer research papers. Not some random blog.
https://technet.microsoft.com/en-us/library/hh831700.aspx

But as I said, SirMaster have a point and after reading his explanation, I doubt that Greenbyte is able to compress movies well. But noone knows, maybe they can? There are people talking about 50% dedupe savings, see the link. How about we wait a couple of months and then continue?

Correction.

Code:

Content	                                 Typical Space Savings

Documents, photos, music, videos           30-50%

Documents make all of the difference. They are highly compressible 70-80%+ depending on the file formats and content. Keep in mind microsoft is also talking in the corporate server world were multiple users may have the same exact file. For example, a video demonstrating a product the company offers.

For recorded TV Shows and Movies that you are not making multiple direct copies deduplication will not work well. I wager deduplication for TV Episode #1, 2, ..., 20 will not work well at all.

_Gea · Mar 10, 2015

The idea behind dedup are not TV shows but mailservers with hundreds or thousands copies of mails or VMs in a pool or deployment situation with lots of copies.

Only in these situations, deduplication, especially the ZFS realtime one makes sense (dedup rate > 2 ).
For 10-50% more space its not worth the effort (offline) or the needed RAM (online)

brutalizer · Mar 10, 2015

@smangular,
Thanks for your help in correcting the information. However, that link talks about 50-60% dedupe savings in total, an average for all files (including movies).

Anyway, I doubt that link, because SirMaster have given a plausible explanation to why you can not dedupe media well. But I have not ruled out that Greenbyte has some trick up their sleeve. And no, TCM2's explanation is not an explanation, it is just some incomprehensible mumblings.

TCM2 · Mar 10, 2015

If you're a monkey, everything is incomprehensible, of course.

patrickdk · Mar 10, 2015

I'm not really sure why dedup and compression are different beasts.

Both are the same.

compression in zfs takes a block of data, and removed duplicate info by putting a reference point to the location in it's place.

dedup works on a block of data, but removing duplicate blocks and placing a reference to it in the block metadata.

The only difference between compression and block level dedup is where the reference data pointer goes, inline or out-of-band.

That microsoft link is not a blog, but microsofts documentation of their product, expected sales lit included, how else would they sell the new features of 2012?

For mailservers, it can sometimes get some gain, but normal mail formats will offset the duplicate blocks in the email normally, unless you store email headers seperate from the mail body.

brutalizer · Mar 11, 2015

I'm not really sure why dedup and compression are different beasts.
Because you never see anyone dedupe with Compression algorithms. There are different methods for dedupe, and other methods for compression. If they were the same, they would not use different methods. And besides, there are numerous links that say they are different, and it is easy to get confused.

I have been thinking a bit on this. There are some misunderstandings here, that is clear: I have to repeat myself many times, etc. I post links, but they all get rejected, etc. I think the problem is that a mathematically trained person can not rule anything out, unless it is proven. There are lot of examples of people believing something is true, until someone proves the opposite. If a mathematician is 99.99% convinced something is true - it is not decided yet. This 0.01% chance means he can not rule out the opposite. Because, sometimes common belief is wrong.

I know mathematical reasoning confuses people, so I tend to not talk mathematically with people. When someone says "it is impossible!", I dont go on "well, you know, when you say impossible, you mean 99,99999999999999999% which is not certainty, so it is possible, so you are wrong". I just say "sure, you are right, it is impossible".

But when I talk to mathematicians, I immediately object, "hey, what do you mean it is impossible, have you seen a proof? Can you explain more regarding your claim?". And that is what I am doing here. I thought that as you are technically inclined here, I thought you also had some basic understanding of mathematical reasoning (as people were mentioning comp sci concepts as entropy, etc) and what you can surely know or not, true or false, etc . I thought you would understand what I mean ("I do not rule out because it is not proven yet"). But that assumption was wrong of me. I learned a lesson: just because you are talking to technical people, does not mean they are mathematically trained. So talk to them as common people.

Mathematicians use a different reasoning, which makes them hard to understand. If you are mathematically trained, you will know what I mean ("shorter proofs are more beautiful than longer, etc"). If you dont have the training, you will never understand what I mean. Apparently, there are lot of people here that dont understand me (read this long thread). Which is my fault, so I just switch back to common people talking mode so everyone understands me as I speak the same language as the rest here does. So, this means I am wrong and you all are correct: "Sure, it is impossible to dedupe media, it can be ruled out for sure, it will never happen even if we do research exclusively on this for 10000000 years and math advances together with quantum computers, spintronic transistors, etc. Even alien technology will never be able to dedupe media across several files. This. Is. Impossible. Forever is a long time, but not to common people". And everybody is happy!

And if someone makes a huge advancement in deduplication, with say, quantum computers, or a new algorithm, or whatever - I will only mention this to mathematically trained people "look, this only again shows you can never be sure unless something is proven". Because other people won't understand what I mean, it will be pointless to discuss scientific stuff here, that is clear (read this thread).

- What? What "never be sure on anything unless you prove it"? What do you mean? I knew all the time that you can dedupe media! It was obvious all the time, you should just have asked me. According to Entropy and Shannon it is possible, just read wikipedia and then you just have to fill in the small details. What, did you ever believe the opposite? LOL!!!
- Oh, never mind, forget it. You are right, I was stupid to say that dedupe could not handle media. Of course it can, we all see it now, with the new quantum computers and recent advancements in quantum complexity theory. I am sorry I doubted. You are correct.

End of discussion. I was wrong, and you are right, media dedupe will never, ever, going to happen. Not with quantum computers, not with alien tech, not with anything. We all agree on this.

cantalup · Mar 11, 2015

patrickdk said:
I'm not really sure why dedup and compression are different beasts.

Both are the same.

compression in zfs takes a block of data, and removed duplicate info by putting a reference point to the location in it's place.

dedup works on a block of data, but removing duplicate blocks and placing a reference to it in the block metadata.

The only difference between compression and block level dedup is where the reference data pointer goes, inline or out-of-band.

That microsoft link is not a blog, but microsofts documentation of their product, expected sales lit included, how else would they sell the new features of 2012?

For mailservers, it can sometimes get some gain, but normal mail formats will offset the duplicate blocks in the email normally, unless you store email headers seperate from the mail body.

as my understanding :
we do using dedup(compression) on the individual file level
there is a bit swing on filesystem terminology, Yes... dedup still exists on filesystem(could be a whole VM). the algorithm can di dedup on this lower level. But with one caveat: when the filesystem( orVM) has mostly compressed(dedup) files. well the ratio would be NOT good.

the simple way ( but not quite correct, since filesystem is very complex in structures) is: *create 50 big files and compressed individually, and compressed all 50 file together into one file.
* create 50 big files and compressed all together into one file.

email would be a big plus when doing dedup on filesystem level.

or please correct

based on computer science.... algorithm/technology.

cantalup · Mar 11, 2015

brutalizer said:
.....
End of discussion. I was wrong, and you are right, media dedupe will never, ever, going to happen. Not with quantum computers, not with alien tech, not with anything. We all agree on this.

okay.....

I disagree media dedupe will never happens. this would take a long progress to achieve.
as for today technology, ther is no algorithm or storage technology or filesystem technology that can do a good ratio.

please remember, Computer Science... is evolving fast now.
we would achieve dedup fileysystem or media with good ratio, when technology can provide the baseline.

this is the reason, once you are getting plunge to software/computer industry, the only way to survive is learning, learning, and adapt. this is just my experience .

just my thought, Nothing is perfect and we are evolving to good or bad

. please pick one ....

SirMaster · Mar 11, 2015

brutalizer said:
End of discussion. I was wrong, and you are right, media dedupe will never, ever, going to happen. Not with quantum computers, not with alien tech, not with anything. We all agree on this.

Not with current media codecs (MP3, H.264, etc) which is what I assumed that people would be storing in a real media collection today.

The data in these real existing codecs cannot ever be deduplicated with any technology, real or future hypothetical. Math just does not allow it.

Now with some new video codec that stores data differently it could be possible yes, but no such codec exists today.

You were saying that compression works only on a file while deduplication works on the whole data set. This is simply not true. Modern compression uses what's called Solid Compression, and it works on entire datasets.

http://en.wikipedia.org/wiki/Solid_compression

If you take a 1TB media library and compress it with solid compression into a single archive how much do we save with the worlds best state-of-the-art compression? It's less than 1% (try it for yourself).

Current compression algorithms are proven to eliminate nearly all duplicate data. So if they are unable to shrink the media by even 1%, it's clear that there is little to no duplicate data left. How could you ever believe that an algorithm could be developed that would somehow find half the data in these files to be duplicate (that the best compression cannot find any of) so that it could achieve this 2-3x or 10x ratio as you were speculating.

You seemed to use the logic that if ZFS at current can only do 2-3x, and greenbyte can do 50x, then that it somehow would apply to all data types. Like if ZFS can do 1% on media, then green byte could do 15-25% But it simply isn't true and it's like apples to oranges. All data types are not the same at all, not even close. You can see differences like this by looking how something like a .doc file or a .vmdk file compresses when added to a RAR. It can compress a LOT. But stick an MP3 or H.264 video inside a RAR and there is no compression. Obviously the data is very different.

Compression and dedupe do the same things with the same techniques. Dedupe though has advantages like not having "decompression overhead" to access the data which is why it's a different and still useful thing. Dedupe is still useful for filesystems that even use compression like ZFS because filesystem compression is block-level compression, not solid compression.

But in seeing how even with the best solid compression we can't find any duplication in a collection of media, it's very strong evidence that a new deduple algorithm wont be able to either. If greenbytes did do this, it would also vastly change compression too.

zrav · Mar 11, 2015

brutalizer said:
End of discussion.

Wait, I think you're right. Your numerous references and beautiful mathematics concepts scientifically prove that it can't be impossible. So by applying hardcore maths, we can even conclude that by running dedupe recursively we can achieve ratios of 2500 or more on any content. And that's just the beginning. By leveraging diehard quantum bit-wise algorithms (there's tons of papers on those!), we could get astronomical compression. Storage vendors are so screwed! Sadly most people here lack the training to see the light.

Zarathustra[H] · Mar 11, 2015

zambanini said:
so only for the non opensource zfs

Yeah, my thoughts exactly.

The ZFS used in just about everything that isn't official Oracle Solaris (even OpenIndiana) is a forked version of the pre Oracle ZFS that was open source.

Essentially this is useless for all but a tiny tiny tiny minority of ZFS users.

Zarathustra[H] · Mar 11, 2015

brutalizer said:
I know mathematical reasoning confuses people, so I tend to not talk mathematically with people. When someone says "it is impossible!", I dont go on "well, you know, when you say impossible, you mean 99,99999999999999999% which is not certainty, so it is possible, so you are wrong". I just say "sure, you are right, it is impossible".

dc2fd6da2c741a3b2f6ae5fcf700e06fafe43267ea271579118ef4292ffb25c6.jpg

omniscence · Mar 11, 2015

brutalizer said:
Hahaha!!!

Here is someone who writes about 30-50% dedupe savings on photos, music, videos. And in general, 50-60% dedupe savings for all types of files. I dont know how credible this link is though, I prefer research papers. Not some random blog.
https://technet.microsoft.com/en-us/library/hh831700.aspx

Since you like quotes and references so much more than logical reasoning, there is a link on this site to a post "Real world data deduplication savings under Windows 2012" where you can read:

On the other side, as you could expect, the optimization gain on .avi files is near to zero. Same for mp3 files. That's because the applications that writes this kind of files already eliminate redundant information and therefore identical blocks are highly unlikely. The theory says that I should be able to get no better results with pictures (.jpg, .jpeg) and other kind of compressed music files. Nonetheless, having Windows 2012 deduplicate your media library or picture library will allow you to have duplicate pictures, or films, or other kind of files on the same volume without necessarily wasting more space.

Zarathustra[H] · Mar 11, 2015

I've always thought of deduplication to be of marginal value on a home system.

The value is - IMHO - in large organizations where hundreds or even thousands of people may have the same pdf report or slide show in their networked private drive.

Even then, traditionally with all the performance drawbacks, its probably cheaper just to buy another 2TB drive.

brutalizer · Mar 11, 2015

zrav said:
Wait, I think you're right. Your numerous references and beautiful mathematics concepts scientifically prove that it can't be impossible. So by applying hardcore maths, we can even conclude that by running dedupe recursively we can achieve ratios of 2500 or more on any content.

You are absolutely correct. Never mind the small details of exactly how to "apply hard core maths" to achieve 2500x. It is obvious how to do that. The truth is out there, somewhere.

brutalizer · Mar 11, 2015

The solid compression you talk about, is concatenating several files, and then treat all as one file, and then standard compression method is applied to that single large file. That is not dedup.

SirMaster said:
The data in these real existing codecs cannot ever be deduplicated with any technology, real or future hypothetical. Math just does not allow it.

But in seeing how even with the best solid compression we can't find any duplication in a collection of media, it's very strong evidence that a new deduple algorithm wont be able to either.

Yeah, I think I understand what you mean. I wont ask you to clarify "it is impossible, math does not allow this (where is the proof?)" and "there is strong evidence it is true". To a mathematician that is... just confusing. First saying, "it is impossible", and then at the same time say "it might be possible". He would be gravely confused. But I think I understand what you mean. You mean it is impossible, even with quantum computers and quantum algorithms (researchers can now factorize numbers in linear time, and solve the discrete logarithm problem in linear time on quantum computers, so RSA / Diffie-Hellman / etc is easy peasy to break on quantum computers. Which is why NSA is trying to build one, because quantum computers allows for things we that are totally unheard of, they play in a different league compared to ordinary computers).

If there is a proof that dedupe can not handle media well - then it is a fact. Forever and ever. It will never change. For instance, Pythagorean theorem was true 2000 years ago, it is true today, and it will be true in the future. It is true forever. No one will ever say "hey, look, Pythagorean theorem is false in this special case!!!". Because it has been proven to be true (under some circumstances). OTOH, lets look at physics, for instance, today we say that Einstein is correct. But maybe within 1000 years, we know that Einstein was not correct - he got most of it wrong? Maybe Super String Theory is the real Truth? Or maybe 10000000 years later we find that Superstring theory was also wrong. How do we know which is true? We dont. Likewise in chemistry (in 10000 years, much of what we know in chemistry is maybe obsolete), biology, medicine, economics, history, etc etc etc. Back in the Aristotelean time, they believed the brain's purpose was to produce snot. In 1970s the doctors thought Lobotomy was good - did the inventor even get a Noble prize for this? In the 1980s, if you had migraine, the doctors removed all your teeths to cure migraine. Science always changes, what is true today, is not true tomorrow.

The only thing that NEVER changes, are mathematical proofs. They are true forever. And somehow, you SirMaster know there exists a mathematical proof that says that dedupe can never handle media well - which means it is True. Quantum computers, alien tech, mathematical breakthroughs, etc in 100000 years - it wont help. If it is proven, then it is proven. The almighty Proteans from Mass Effect, the ancient race with technology like magic, will never be able to dedupe media, because there exists a proof. Galactus from Marvel Universe can not do that. Not even God father himself. If it is proven, it is proven. Nothing can change that, it is written in stone. And we know there exists such a proof, because you SirMaster say so. It does not matter you seem a bit uncertain if there are only strong evidence supporting your claim, or impossible.

You seemed to use the logic that if ZFS at current can only do 2-3x, and greenbyte can do 50x, then that it somehow would apply to all data types.

No, not at all. I said that if Greenbyte can do magic with their dedupe tech - then maybe MAYBE they can also handle media. I do not rule that out, there is a small chance. But thanks to you guys, I now know better. It is impossible to dedupe media, it is never going to happen. Because you possess the TRUTH, like Pythagorean theorem will never be false.

brutalizer · Mar 11, 2015

cantalup said:
okay.....

I disagree media dedupe will never happens. this would take a long progress to achieve.
as for today technology, ther is no algorithm or storage technology or filesystem technology that can do a good ratio.

please remember, Computer Science... is evolving fast now.
we would achieve dedup fileysystem or media with good ratio, when technology can provide the baseline.

this is the reason, once you are getting plunge to software/computer industry, the only way to survive is learning, learning, and adapt. this is just my experience .

just my thought, Nothing is perfect and we are evolving to good or bad . please pick one ....

The point is: compression removes redundancy within a file. But there might be redundancy between files- Just because 01011010101 is close to random bit pattern within a file so that bit pattern can not be compressed anymore - does not mean, that same bit pattern can not occur in other files - allowing for dedupe. Compression never looks at many files, it can never remove redundancy between files. After compression, there might be redundancy between files. It only looks at one file, and removes as much redundancy it can on that single data set. The more data you have (look at many files, instead of one) you can achieve better savings by optimizing more.

"Oh, I see this bit pattern never repeats itself in this file so I can not compress this bit pattern, but fortunately it occurs in the other file, so I can dedupe this bit pattern". Compression never do such a thing. Compression is like... you concatenate all files f1, f2, f3, ... into one large file, and then apply compression, which looks only on f1. Then you apply compression only on f2 without using the bit patterns in f1. etc. But, dedupe will look at all files at the same time, it will scan f1, f2, f3, f4... etc - at the same time. And therefore dedupe might find even more redundancy to remove betwen files, than compression which only looks at the data locally.

It is like:
- I have looked at each business procedure, and optimized every single one. They can not be optimized anymore, they are 100% efficient and slim

- Great! But.... maybe you can optimize between all the procedures? I understand that each process is optimal now, but if you see that, for instance, several processes use step A (which is optimized), maybe you somehow optimize even more? If each process use step A, can you think of another optimization? And then look at step B, etc etc?

It is quite common that when you look at a problem globally, you will see even more structure that you can use. Like, if you try to break DES cipher, and you only get one crypto message, it will not help you, because it looks totally random. But if you look at many DES messages at the same time, then maybe you will find some additional structure.

I am not saying it is 100% sure you will benefit by looking at more data, but it maybe MAYBE can help you. I do not rule that out. Of course, here in this forum, I rule it out.

EDIT: In programming, if you have a function func1() that is optimized. And then you find another function func2() that is also optimized, but they share some structure - you can often refactor and create a new function func3(), so you can delete these two functions which can not be optimized further. You now only have one new function func3(). But if you only looked at func1() you could not optimize it further. Likewise, if you looked at func2() you could not optimize it further - only when you look at them at the same time you can find the common structure they share. This is well known to programmers (who I KNOW reason a bit like mathematicians, so any developer who is not a beginner, would totally understand what I mean without me explaining for hours and hours)

cantalup · Mar 11, 2015

brutalizer said:
The point is: compression removes redundancy within a file. But there might be redundancy between files- Just because 01011010101 is close to random bit pattern within a file so that bit pattern can not be compressed anymore - does not mean, that same bit pattern can not occur in other files - allowing for dedupe. Compression never looks at many files, it can never remove redundancy between files. After compression, there might be redundancy between files. It only looks at one file, and removes as much redundancy it can on that single data set. The more data you have (look at many files, instead of one) you can achieve better savings by optimizing more.

"Oh, I see this bit pattern never repeats itself in this file so I can not compress this bit pattern, but fortunately it occurs in the other file, so I can dedupe this bit pattern". Compression never do such a thing. Compression is like... you concatenate all files f1, f2, f3, ... into one large file, and then apply compression, which looks only on f1. Then you apply compression only on f2 without using the bit patterns in f1. etc. But, dedupe will look at all files at the same time, it will scan f1, f2, f3, f4... etc - at the same time. And therefore dedupe might find even more redundancy to remove betwen files, than compression which only looks at the data locally.

It is like:
- I have looked at each business procedure, and optimized every single one. They can not be optimized anymore, they are 100% efficient and slim

- Great! But.... maybe you can optimize between all the procedures? I understand that each process is optimal now, but if you see that, for instance, each process use step A (which is optimized), maybe you somehow optimize even more? If each process use step A, can you think of another optimization? And then look at step B, etc etc?

It is quite common that when you look at a problem globally, you will see even more structure that you can use. Like, if you try to break DES cipher, and you only get one crypto message, it will not help you, because it looks totally random. But if you look at many DES messages at the same time, then maybe you will find some additional structure.

I am not saying it is 100% sure you will benefit by looking at more data, but it maybe MAYBE can help you. I do not rule that out. Of course, here in this forum, I rule it out.

EDIT: In programming, if you have a function func1() that is optimized. And then you find another function func2() that is also optimized, but they share some structure - you can often refactor and create a new function, so you can delete these two functions which happen to be optimal. You now only have one function. But if you only looked at func1() you could not optimize it further. Likewise, if you looked at func2() you could not optimize it further - only when you look at them at the same time you can find the common structure they share. This is well known to programmers (who I KNOW reason a bit like mathematicians, so they would totally understand what I mean without me explaining for hours and hours)

you are out of topic di dedup/compression(file) or dedup (filesystem)..

you can assume that, show one algorithm that support you argument?
dedup is not in 0 or 1 bit only aka atomic,

in programming (EDIT section), NO!.. you can delete those function and make/refactor a new function.

common structure in programming is vague since we are in OO world mostly.

maths does not always go along with comp Science...
one sample: how do you define fuzzy logic in maths?
complicated sample: how do you define aritificial human behaviour in maths?

as far today in software/hardware... none your assumption is logical in current real life.
there is always a trade when current technology/science do not support the base line

please read compiler course work(book). mostly related with trade-off

please answer me:
are you talking dedup in file level or filesystem(or VM)? those two are different topic.

SirMaster · Mar 11, 2015

brutalizer said:
The solid compression you talk about, is concatenating several files, and then treat all as one file, and then standard compression method is applied to that single large file. That is not dedup.

Yeah it is. You are also forgetting that after the concatenation the compression algorithm splits the data into blocks. Then the compression algorithm starts eliminating duplicate blocks of data.

How is a list of blocks from a big string of bytes from a big concatenated file any different than the list of blocks from a big string of bytes from a filesystem list?

Is a filesystem not simply a concatenated list of data blocks? Except with some holes that are obviously ignored?

brutalizer said:
The point is: compression removes redundancy within a file. But there might be redundancy between files- Just because 01011010101 is close to random bit pattern within a file so that bit pattern can not be compressed anymore - does not mean, that same bit pattern can not occur in other files - allowing for dedupe. Compression never looks at many files, it can never remove redundancy between files. After compression, there might be redundancy between files. It only looks at one file, and removes as much redundancy it can on that single data set. The more data you have (look at many files, instead of one) you can achieve better savings by optimizing more.

"Oh, I see this bit pattern never repeats itself in this file so I can not compress this bit pattern, but fortunately it occurs in the other file, so I can dedupe this bit pattern". Compression never do such a thing. Compression is like... you concatenate all files f1, f2, f3, ... into one large file, and then apply compression, which looks only on f1. Then you apply compression only on f2 without using the bit patterns in f1. etc. But, dedupe will look at all files at the same time, it will scan f1, f2, f3, f4... etc - at the same time. And therefore dedupe might find even more redundancy to remove betwen files, than compression which only looks at the data locally.

This is where you are completely mistaken and possibly where all your misconception comes from. Compression DOES do this. 7zip, rar, tar.gz etc all do solid compression. They all look across all bits from all files, not just 1 file.

If there is a similar bit string in file 1 and file 1000 the compression will eliminate it.

If I take an mp3 file, and make a copy of it and then just change some bits of data like the ID3 tag, and then put both into a compressed archive, the archive is only about the size of 1 file.

Here you can see 2 copies of the same mp3 in a compressed archive. One takes 2.76mb and the other takes only 2.5kb. The size of this .rar file is 2.76mb, but the unpacked size is 5.52mb.

If I take a 10mb text file and copy its contents into another text file and replace the second half of the text with new text to make a half-different document and then compress them together I get a file that is 15MB in size because 5mb of the second file was compressed away because it matched bits in file 1.

How is this different from deduplication?

Zarathustra[H] · Mar 11, 2015

SirMaster said:
Yeah it is. You are also forgetting that after the concatenation the compression algorithm splits the data into blocks. Then the compression algorithm starts eliminating duplicate blocks of data.

How is a list of blocks from a big string of bytes from a big concatenated file any different than the list of blocks from a big string of bytes from a filesystem list?

Is a filesystem not simply a concatenated list of data blocks? Except with some holes that are obviously ignored?

This is where you are completely mistaken and possibly where all your misconception comes from. Compression DOES do this. 7zip, rar, tar.gz etc all do solid compression. They all look across all bits from all files, not just 1 file.

If there is a similar bit string in file 1 and file 1000 the compression will eliminate it.

If I take an mp3 file, and make a copy of it and then just change some bits of data like the ID3 tag, and then put both into a compressed archive, the archive is only about the size of 1 file.

Here you can see 2 copies of the same mp3 in a compressed archive. One takes 2.76mb and the other takes only 2.5kb. The size of this .rar file is 2.76mb, but the unpacked size is 5.52mb.

If I take a 10mb text file and copy its contents into another text file and replace the second half of the text with new text to make a half-different document and then compress them together I get a file that is 15MB in size because 5mb of the second file was compressed away because it matched bits in file 1.

How is this different from deduplication?

Part of this misconception is because the term compression is used to describe different things. Part of it is this type of bit level deduplication, and part of it should really be called encoding and deals with separating data that can be percieved (be it image, video or audio) from data that can not and eliminating the data that can't.

Usually when we refer to deduplication it is something that is done on the block level, but the concept is similar to what archivers have done for years, just optimized to be live, and thus faster, and on the block level.

I would imagine - too - since a block is larger than a bit, that block level deduplication would be significantly less effective.

_Gea · Mar 11, 2015

SirMaster said:
How is this different from deduplication?

Compress works within a file or archive (This is a file as well)
If you duplicate your rar file, it will need the double space.
If you add compress at a filesystem level, this is nearly the same as the data are already compressed and a second compress run will not result in a smaller file.

ZFS dedup works different for example.
If you make 1000 copies of a compressed video file, does not matter on what ZFS filesystem with dedup enabled on them, it will store only 1 file and 999 references as dedup works poolwide.

If you now modify a single datablock in your video file and make again 1000 copies, it will store only this single datablock again with the rest as reference.

If you save 1000 files where 50% of the content is identical (ex different states of video-editing or quite identical VMs), the identical datablocks are stored once with a reference for the duplicates.

ZFS dedup does not try to compress a datablock. It calculates a hash for every datablock in realtime during a write. For next datablock, it saves the file only, if the hash is new, otherwise it saves a reference only.

read
http://constantin.glez.de/blog/2011/07/zfs-dedupe-or-not-dedupe
http://constantin.glez.de/blog/2010/03/opensolaris-zfs-deduplication-everything-you-need-know

brutalizer · Mar 11, 2015

SirMaster said:
This is where you are completely mistaken and possibly where all your misconception comes from. Compression DOES do this. 7zip, rar, tar.gz etc all do solid compression. They all look across all bits from all files, not just 1 file.

If there is a similar bit string in file 1 and file 1000 the compression will eliminate it.

No, you are wrong on this, compression and dedupe is not applicable here.

Dedup - as we talk of - is ONLINE. Which means that you have one file, and a month later, you try to dedupe another file. The compression 7z, rar, etc that you talk of, will only compress one file at time T1. And when you save another file, a month later at time T2, 7z will not automatically start itself and compress the new file and remove redundancy among both files. 7z has no way of knowing that you saved a new file. So, 7z will not be able to remove redundancy in the new file, by comparing to the old. Which means there might be redundancy in both files, in the old and the new file.

Just as I described, all the time.

brutalizer · Mar 11, 2015

cantalup said:
you are out of topic di dedup/compression(file) or dedup (filesystem)..

you can assume that, show one algorithm that support you argument?
dedup is not in 0 or 1 bit only aka atomic,

in programming (EDIT section), NO!.. you can delete those function and make/refactor a new function.

common structure in programming is vague since we are in OO world mostly.

maths does not always go along with comp Science...
one sample: how do you define fuzzy logic in maths?
complicated sample: how do you define aritificial human behaviour in maths?

as far today in software/hardware... none your assumption is logical in current real life.
there is always a trade when current technology/science do not support the base line

please read compiler course work(book). mostly related with trade-off

please answer me:
are you talking dedup in file level or filesystem(or VM)? those two are different topic.

Dedup, as I am talk about it, works across blocks. Files are made of blocks, so dedup might work across VMs or media or whatever that is stored on any disk. Compression does not work across disks. Dedup do - which means dedup can find redundancy between several disks. Something compression never can do, as it looks locally, only at a part of the file. Never among many files, across many disks. So I can dedupe files saved months later. Which compression never does, so it can never remove redundancy between future files, and old files.

BTW, I dont understand what compiler theory has anything to do with this? (I have studied such stuff). And BTW, you can define fuzzy logic within math, several different ways exists. Artificial behaviour has nothing to do with math. Find some equations that describe artificial intelligence, and then you can apply math on the equations - and find some facts that never change.

SirMaster · Mar 11, 2015

_Gea said:
If you duplicate your rar file, it will need the double space.

Have you read this whole thread? I feel you are missing the point of the whole discussion.

The discussion is about if compression algorithms do what dedupe does.

If I put both copies of that rar file into another rar file then it will not need double the space. So it is clear that the compression algorithm is doing deduplication-like work.

Again, the overall argument is that deduplication will save lots of space on compressed media.

I am NOT comparing filesystem-level compression to deduplication. If you read my earlier comments I already talked about how deduplication is still useful on filesystem-level compression because file-system level compression is not solid compression and thus there is still space to save.

I was merely using the fact that if you for example take 1TB or so of hundreds of media files and put them into a single solid compression archive (which does do deduplication-like things as we can see from experience) you do not save any space (as you can see from experience).

Thus if the best solid compression algorithms in the world currently cannot find any duplicate data to eliminate, how can we possibly expect some new deduplication algorithm to find any as well?

He is claiming that there *may* be a way for greenbyte's algorithm to achieve 2-3x or even 10x deduplication on media files which in order for that to be true, it would have to mean that the worlds best compression algorithms with decades upon decades of research and development somehow managed to miss immense amounts of duplicate data in the files on the order of half the file's size worth or more.

I simply find this unfathomably unlikely that decades and decades of compression technology has yet missed 200% data that could be compressed from data like media libraries.

Zarathustra[H] · Mar 11, 2015

brutalizer said:
No, you are wrong on this, compression and dedupe is not applicable here.

Dedup - as we talk of - is ONLINE. Which means that you have one file, and a month later, you try to dedupe another file. The compression 7z, rar, etc that you talk of, will only compress one file at time T1. And when you save another file, a month later at time T2, 7z will not automatically start itself and compress the new file and remove redundancy among both files. 7z has no way of knowing that you saved a new file. So, 7z will not be able to remove redundancy in the new file, by comparing to the old. Which means there might be redundancy in both files, in the old and the new file.

Just as I described, all the time.

Well, the comparison is more apt, if you picture him having just ONE 7z archive on his drive, and every time he copies a file to the drive he drops it into the open 7z window

Not very sleek though

SirMaster · Mar 11, 2015

brutalizer said:
No, you are wrong on this, compression and dedupe is not applicable here.

Dedup - as we talk of - is ONLINE. Which means that you have one file, and a month later, you try to dedupe another file. The compression 7z, rar, etc that you talk of, will only compress one file at time T1. And when you save another file, a month later at time T2, 7z will not automatically start itself and compress the new file and remove redundancy among both files. 7z has no way of knowing that you saved a new file. So, 7z will not be able to remove redundancy in the new file, by comparing to the old. Which means there might be redundancy in both files, in the old and the new file.

Just as I described, all the time.

Not if you are saving the new file into the solid archive... If you always drag/save the new/changed file into the solid archive, it will absolutely compare it will ALL other bits present in the archive. This is a very big part of current compression algorithms.

Yes it's a very intensive process, but so is ZFS dedupe if you haven't noticed... There is a reason it takes so much RAM. It has to keep a checksum of each block in memory or else it would far too slow to be of use. If my compression program also keeps checksums of all the blocks in the archive it can add the file to the archive (and only bits that don't already exist anywhere in the archive) just about as quickly as dedupe with similar RAM requirements.

You still haven't answered though how if a solid archive of my sample 1TB of media files cannot find any bits to save how deduplication will find any. And not just any, but a gigantic amount of them to achieve these 200-300% savings. Where are all these duplicate bits that rar and 7z solid compression are missing, and why have they gone unnoticed for decades of compression research?

brutalizer · Mar 11, 2015

Not if you are saving the new file into the solid archive... If you always drag/save the new/changed file into the solid archive, it will absolutely compare it will ALL other bits present in the archive. This is a very big part of current compression algorithms.

True, but no compression software is expected to be used like this. No one has a huge 100 TB file on their disks, and they drag and drop every file into it. Nobody do it. And you use it for dedupe, not compression. So you are cheating.

SirMaster said:
You still haven't answered though how if a solid archive of my sample 1TB of media files cannot find any bits to save how deduplication will find any. And not just any, but a gigantic amount of them to achieve these 200-300% savings. Where are all these duplicate bits that rar and 7z solid compression are missing, and why have they gone unnoticed for decades of compression research?

I have answered you several times on this. After your sound and clear explanation, I do not longer believe Greenbyte can dedupe media well. You convinced me. I'm with you on this, I have said all the time. However, it might be case, that maybe MAYBE it is possible to dedupe media well, there is a small chance - I said. I should not have said that.

But, unless I see a proof, anything can happen, for instance, that quantum computers can also dedupe media. They can do remarkable things, that no computer can do. That is proven so we know they are much more powerful than computers on some work loads. BTW, we know they can not handle NP complete problems, they are just as bad as ordinary computers.

Unless there is a proof, we don't know for sure. Maybe quantum computers can do that? Or a mathematical breakthrough? Or a new algorithm? Or whatever... can do that? But if there is a proof that it is impossible, then we can be sure, and nothing can change that. No quantum computers, no aliens, no nothing if we so do 100000000 years of research on this. What is proven, is proven. Nobody or nothing can change that.

And... to a mathematician it might seem to be a very strong claim when someone points to wikipedia and say "it is obvious, have you heard about Entropy? On wikipedia is your proof that it is impossible and can never ever happen! You just fill in the small details". To that, mathematicians say "Uh? Come again? I dont understand, can you connect the dots?"

EDIT: Ive read a quantum algorithimic paper some years ago. I vaguely remember that quantum computers can do brutal things with databases, anyone knows the details? It was something like, instead of scanning the entire database table, they know immediately where the data is. So they are wickedly fast on database operations. No computer can do this. Maybe things like this can help dedupe?

ZFS dedupe is fixed soon

Limp Gawd

Supreme [H]ardness

[H]ard|Gawd

Gawd

Gawd

[H]ard|Gawd

Gawd

Gawd

2[H]4U

Limp Gawd

Extremely [H]

Extremely [H]

[H]ard|Gawd

Extremely [H]

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd

2[H]4U

Extremely [H]

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

Extremely [H]

2[H]4U

[H]ard|Gawd