ZFS dedupe is fixed soon

TCM2 · Mar 6, 2015

SirMaster said:
Calm down...

Yes I did mention that the filesystems would use KiB not KB and that 1048576 would be the correct number of bits to use in the permutation calculation.

Yeah, sorry. It's when the edits and the replies overlap.

SirMaster · Mar 6, 2015

omniscence said:
Even when it is lossless, good lossless compression is similar to noise as well.

Well I was thinking RAW lossless like BMP, WAV, and whatever the equivalent RAW video codec you are using. These should be able to duplicate, but compression is the superior solution in these cases.

TCM2 · Mar 6, 2015

"lossless" is a property of compression, not of a RAW format. So saying "RAW lossless like BMP" makes no sense.

SirMaster · Mar 6, 2015

TCM2 said:
Yeah, sorry. It's when the edits and the replies overlap.

Sorry, yes I did edit that, but In my defense did so before your other reply and I though tit was early enough.

I'm guilty of not refreshing and checking replies before I rely too heh.

I am also guilty of editing my comments a lot too. Not to correct things to make me "look right" because that's silly. Just if I make a realization or think of things that I forgot to say or better ways to say things I do go back and edit my comments.

Like for this one I was like K sure thats kilobytes so it's 1000. Then later "I'm an idiot, we are talking about a filesystem here and that's absolutely going to use KiB".

SirMaster · Mar 6, 2015

TCM2 said:
"lossless" is a property of compression, not of a RAW format. So saying "RAW lossless like BMP" makes no sense.

I was thinking about the non-computing definition then. Like WAV is storing a lossless copy of the audio I just recorded.

I guess it would have been better to say RAW all along.

/dev/null · Mar 6, 2015

patrickdk said:
There are only two implementations, the openzfs and the oracle one.

freebsd linux and illumos all are basing theirs from openzfs, and normally are pretty in sync, though freebsd can lag alittle but due to their releases.

That is possible but I'm sure they all have their own customizations and or patches/commit levels unless they are working off the same commit tree.

TCM2 · Mar 6, 2015

To me, RAW is yet another thing. RAW is data without a header or container that specifies the order or the general "meaning" of the bits. Depending on how you order the bits, there can be different representations of "RAW": http://en.wikipedia.org/wiki/Raw_audio_format

Putting data inside a container format or strapping a header to it doesn't say anything about whether it's compressed or not, so put raw PCM audio data into a WAV and you have an uncompressed WAV with PCM audio.

I even make a distinction between compression and encoding. Compression is putting a WAV through WinRAR, for example. This is of course lossless.

Encoding (lossily) is making an MP3 out of a WAV. Encoding losslessly is making a FLAC out of a WAV.

In both cases (encoding and compressing) information is removed so the final data gets closer to random data.

SirMaster · Mar 6, 2015

Robstar said:
That is possible but I'm sure they all have their own customizations and or patches/commit levels unless they are working off the same commit tree.

They have separate code, but they all pull from the Illumos upstream for patches as often as the volunteers have time to manage. They also push improvements developed for their own platform if they could benefit the others.

/dev/null · Mar 6, 2015

SirMaster said:
They have separate code, but they all pull from the Illumos upstream for patches as often as the volunteers have time to manage. They also push improvements developed for their own platform if they could benefit the others.

If they all pull patches are their own rate of speed at different times, saying "dedupe is fixed" doesn't necessarily mean it's fixed in whichver version the reader is using.

It would probably be more accurate to say "dedupe is fixed in Illumos upstream & should be making it to your version eventually..."

patrickdk · Mar 6, 2015

Robstar, did you read this thread at all?

dedup is NOT fixed in illumos or *anything yet*

oracle solaris *replaced* dedup with greenbytes dedup, and we will have to see how well it works.

When we where looking at greenbytes, it was dumped, as claims where pure marketing, and we did not opt to use them as a storage vendor.

safelyRemove · Mar 6, 2015

I was going to post this yesterday, before the thread headed into numbers...

ZFS dedupe currently requires complete block matches, which means it can only get better

Greenbytes can probably identify matches at random offsets within a block. This can greatly increase the dedupe rate (although, not for video).

Finding matches at unknown offsets is a key part of data compression. Greenbytes has probably applied the existing ZFS compression (or their own new version) into a new form that can be fed into the dedupe engine, minimizing the CPU hit and allowing a big increase in dedupe ratio.

patrickdk · Mar 6, 2015

I have done offline dedup tests, using 512byte blocks, and pattern matching, and offset matching. I can't get >2.2x dedup rate with all the above on my workloads atleast.

Yes, they do help, but in all my usecases I have tested on, 64k block dedup was *good enough*, at around 1.8x, and 4k got up to 2.0x, and 512b got 2.1x, add in pattern matching (almost 0 hits) and offsset matching, got me an extra .1x to 2.2x

brutalizer · Mar 6, 2015

TCM2 said:
Go study Computer Science then. It's not a matter of convincing, it's information theory. The whole point of media formats like MPEG is to remove information. Information gets removed, entropy goes up. More entropy = more randomness = less repeated data. What does dedup need? Repeated data.

So instead of defending some obscure marketing bullshit, inform yourself.

I am sure Ive studied more comp sci than you, i have a double degree, one in math, and one in theoretical comp sci studied under one of the best and most famous mathematicians in the field. If you anything about the field, you have heard about him. You have maybe heard about Tarjan, Karp, Sudhan, etc and the other guys. If not, you dont know much about comp sci which makes you an uneducated "something".

Anyway, It is not some "marketing bullshit", greenbyte do get 50x dedupe ratio. If not, oracle would not have bought greenbyte. And what i am saying, is that i dont know how greenbyte patents work, neither do you. clearly they report far better results than normal, so i do not rule out the possibility that, they may have smaller dedup window than 128kb or whatever. You know nothing about their tech, so it is quite stupid to believe their tech behaves like current dedupe engine. Clearly their tech differ a lot. With that said, i do agree that it is probable that greenbyte can not dedupe media well, but we can not be sure. For instance, who say they use 128kb? No one. So why do you keep repeating that number? Again, you can not compare apples to oranges. Back to school, and learn basic logic. It is bad logic to assume greenbyte has identical tech with samw drawbacks, as the old dedupe engine. Who knows what they did? Maybe they solved P=NP? Not probable, but when you say they can not dedupe media well, you need to prove it, or you can not rule it out. And you have not presented any proofs. Mathematicians do not rule amything out unless they can prove it, Clearly, you dont know much about math.

For instance, iam not convinced that all bit patterns are equally likely. Maybe some bit patterns are more common than others in media, and other bit patterns more common in zip files, etc? For instance, in cryptography, some bit patterns are slightly more probable than others, so we have different attacks now, differential analysis of DES, Hastad attack on RSA, etc. it is very difficult to make all bit patterns equally likely, there are no perfect one way functions, the holy grail in cryptography

brutalizer · Mar 6, 2015

patrickdk said:
I have done offline dedup tests, using 512byte blocks, and pattern matching, and offset matching. I can't get >2.2x dedup rate with all the above on my workloads atleast.

Yes, they do help, but in all my usecases I have tested on, 64k block dedup was *good enough*, at around 1.8x, and 4k got up to 2.0x, and 512b got 2.1x, add in pattern matching (almost 0 hits) and offsset matching, got me an extra .1x to 2.2x

Are you talking about VMs now? What numbers do you get on VMs? You do not get 50x, as greenbyte get?

brutalizer · Mar 6, 2015

safelyRemove said:
I was going to post this yesterday, before the thread headed into numbers...

ZFS dedupe currently requires complete block matches, which means it can only get better Greenbytes can probably identify matches at random offsets within a block. This can greatly increase the dedupe rate (although, not for video).

Finding matches at unknown offsets is a key part of data compression. Greenbytes has probably applied the existing ZFS compression (or their own new version) into a new form that can be fed into the dedupe engine, minimizing the CPU hit and allowing a big increase in dedupe ratio.

Yes, but we can only guess until we can try out their engine. But my point is still valid, we know nothing about what they do. But i agree it is not probable they can compress media, but we dont know what they are capable of.

patrickdk · Mar 6, 2015

Yes, as I posted elsewhere in this thread, around 500 or so win7 vm's, and 300 win2008r2 vm's. All from identical base clone image.

mwroobel · Mar 6, 2015

brutalizer said:
Who knows what they did? Maybe they solved P=NP?

I shot soda out my nose when I read that crack.

Silhouette · Mar 7, 2015

brutalizer said:
Maybe they solved P=NP? Not probable, but when you say they can not dedupe media well, you need to prove it, or you can not rule it out.

I rule it out. They did not solve P=NP and they can not dedupe media well.

TCM2 · Mar 7, 2015

brutalizer said:
Anyway, It is not some "marketing bullshit", greenbyte do get 50x dedupe ratio

Yeah, on 5000 nearly identical VM images, whoop-dee-doo.

All your dick-waving would impress me if you hadn't started with the media files, at which point you just proved you're full of shit. Nice try, though.

Oh, and this:

For instance, iam not convinced that all bit patterns are equally likely. Maybe some bit patterns are more common than others in media, and other bit patterns more common in zip files, etc? For instance, in cryptography, some bit patterns are slightly more probable than others, so we have different attacks now, differential analysis of DES, Hastad attack on RSA, etc. it is very difficult to make all bit patterns equally likely, there are no perfect one way functions, the holy grail in cryptography

Just. Stop.

Edit: Here, let me quote it:

So for VMs you maybe get 50x2 = 100x compression in real life. But in general situations where you dont have many identical bytes, maybe 10x? Imagine that, your storage has suddenly increased 10x more. An 8 TB storage media server has suddenly become 80TB storage.

For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.

You just devalue your education with this bullshit and I feel sorry for the people whose names you smear.

Edit2:

Not probable, but when you say they can not dedupe media well, you need to prove it, or you can not rule it out. And you have not presented any proofs. Mathematicians do not rule amything out unless they can prove it, Clearly, you dont know much about math.

I was saying that they were making claims that go against all established information theory, but actually, only you are making the claim that their 50x result can somehow apply to media files or any data, for that matter.

So the burden of proof is on you, if you make claims that contradict established science.

brutalizer · Mar 7, 2015

patrickdk said:
Yes, as I posted elsewhere in this thread, around 500 or so win7 vm's, and 300 win2008r2 vm's. All from identical base clone image.

Ok, so you do identical VMs, just like greenbyte. And still you get 2-3x dedupe ratio, whereas greenbyte gets 50x. The big question is, how do they do it??

Anyone with the slightest intelligence, should ask that question. If they can do that remarkable feat, then what more can they do?? No one asks that question except me? If someone solves Riemanns hypothesis, how can you be so sure he did not solve P=NP too?? How can you claim he did not solve it?? (Btw, knuth believes that P=NP). No One thinks of the implications? Did IQ drop while I was gone?

Silhouette said:
I rule it out. They did not solve P=NP and they can not dedupe media well.

Great, because you obviously know about what greenbyte can do, or not, please enlighten us about why they can not dedupe media. Of course greenbyte do get 50x dedupe, they would never lie about that, and oracle would have found that out during due diligence.

There are software that can inspect data and tell what kind of data it is, obviously, media data has some type of common structure, that mp3 does not have. All bit patterns are not equally likely in media files, they have a kind of structure. Maybe you can dedupe the common structure?

mwroobel said:
I shot soda out my nose when I read that crack.

.

. I do agree it is not likely they can dedupe media, but to be so sure about something no one knows without evidence, is just stupid. It is even dumber to explain why current dedupe can not do it, and draw the conclusion greenbyte also can not do it. We know nothing about greenbyte, they might not use 128kb even.

brutalizer · Mar 7, 2015

TCM2 said:
Yeah, on 5000 nearly identical VM images, whoop-dee-doo.

If this is so easy, why cant current zfs do that?? Answer me.

Or... Are you implying that this is impossible, that greenbyte fooled the oracle dedupe engineers?

All your dick-waving would impress me

Why would anyone be impressed with two masters? My point is, you dont have be an bass sole, and assume people are uneducated or knows nothing in this fine forum. I have got much help from these guys here. The level here is very high and you should respect people for giving us of their valuable time and helping us dig further into this. For instance, SirMaster have provided interesting links and info.

I was saying that they were making claims that go against all established information theory, but actually, only you are making the claim that their 50x result can somehow apply to media files or any data, for that matter.

So the burden of proof is on you, if you make claims that contradict established science.

Your logic is flawed. I am asking a question about greenbyte deduping media well, you and others here deny that because you know something we other dont. So it is you that need to prove your claim. I am NOT making claims, you do.

Btw, do you believe greenbyte lie about 50x dedupe on VM?

TCM2 · Mar 7, 2015

You are not asking a question, you are believing they are able to dedup media files, based on no evidence other than the fact that they dedup identical data better than the current ZFS does, which in itself doesn't contradict the established theories. Your claim does, however, because media files don't consist of identical data anywhere in themselves, because otherwise, the MPEG guys would have done a subpar job so far.

My challenge to your claim is not a claim that needs to be proven. If you can't even do basic science, your ignorance is even bigger than I thought and you should seriously just stop posting.

brutalizer · Mar 7, 2015

TCM2 said:
You are not asking a question, you are believing they are able to dedup media files, based on no evidence other than the fact that they dedup identical data better than the current ZFS does, which in itself doesn't contradict the established theories. Your claim does, because media files don't consist of identical data anywhere in themselves, because otherwise, the MPEG guys would have done a subpar job so far.

My challenge to your claim is not a claim that needs to be proven. If you can't even do basic science, your ignorance is even bigger than I thought and you should seriously just stop posting.

ok,this explains your problems. Your logic is truly flawed.

If someone asks something
-I am not sure on that, maybe X might be able to do Y dont you agree?
-No, you are wrong, they can not do Y
-Oh, are you sure? Can you explain how you know for certain of your claim?
-No, I am not making claims, even though you are wrong. You explain your claim!
-What claim? I asked a question which you explained to me, was wrong. You are making the claim, by Telling me I am wrong
-No i am not making a claim here.

Question: who is making the claim here? The questioner, or the other guy?

TCM2 · Mar 7, 2015

But, Greenbyte reports 50x savings in real life dedupe scenarios. I have never seen any any zfs dedupe user report 50x savings, only 2-3x. This means that Greenbyte probably have a superior dedupe ZFS engine, that obviously can achieve 50x savings. So, can it be that Greenbyte also achieve 50x savings on TV media? Maybe? It is not out of the question, which you claim.

You _claim_ it's not out of the question that they achieve 50x savings on TV media when, based on all current knowledge of information theory[1], it most definitely _is_ out of the question.

You just go on and on about semantics and trying to turn your claims into questions to evade the burden of proof.

They can't dedup media files even 2x, let alone 50x. This is a claim backed by current computer science.

What you exhibit here is magical belief and I feel sorry that education has been wasted on you.

Edit:

[1] According to http://www.data-compression.com/theory.html

In his 1948 paper, ``A Mathematical Theory of Communication,'' Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.

(emphasis mine)

About H:

http://www.data-compression.com/theory.html#entropy

The entropy rate of a source is a number which depends only on the statistical nature of the source.

Its unit is bits/character.

Now, as an exercise for you, run any media file you can find through a program that gives you its entropy rate, e.g. http://en.wikipedia.org/wiki/Diehard_tests and come back with your findings and whether you still believe that this can magically be ignored. To apply this to dedup as well - since dedup ist just a form of compression - just concatenate several media files which you think should dedup well.

Aesma · Mar 7, 2015

safelyRemove said:
I was going to post this yesterday, before the thread headed into numbers...

ZFS dedupe currently requires complete block matches, which means it can only get better Greenbytes can probably identify matches at random offsets within a block. This can greatly increase the dedupe rate (although, not for video).

Finding matches at unknown offsets is a key part of data compression. Greenbytes has probably applied the existing ZFS compression (or their own new version) into a new form that can be fed into the dedupe engine, minimizing the CPU hit and allowing a big increase in dedupe ratio.

I agree that not limiting itself to blocks should improve deduplication rates somewhat, however it should use even more resources, not less. Ridiculous amounts of resources, in fact.

TCM2 · Mar 7, 2015

Which is obvious. Do the thought expriment of what would happen if you used blocks of only 1 bit. I mean those dedup pefectly, right? Having only 2 possible values, all bits are a dupe of either 0 or 1.

The ratios you could get!

smangular · Mar 8, 2015

For TV recordings the best you can do is remove the commercials. Then change the format and often quality to meet your needs.

One additional idea would be to remove all of the identical intro clips at the start of series.

brutalizer · Mar 8, 2015

TCM2 said:
You _claim_ it's not out of the question that they achieve 50x savings on TV media

Yes, that is true. Because they do 50x on VMs. They can obviously achieve 50x dedupe ratios on certain types of files. Why can not they do the same on media? I am asking you. You seem to know their patents and can explain to us.
Question A) What makes greenbyte tech unsuitable for media, but very suitable for VMs?

If I am asking "Maybe Elvis Presley is alive?" - this does NOT mean I claim that he is alive so I need to prove it. If you claim he is not alive, then I want an explanation to how you know that, why he is not alive. A question is not a claim. I repeat, if someone asks you something, they do not claim something, that they can prove. They are unsure and want to learn more on the matter, if you claim to know more, then you must give an explanation/proof scetch.

Question B) By the way, do you believe Greenbytes lie when they claim 50x dedupe on VMs? It is marketing fud?

based on all current knowledge of information theory[1], it most definitely _is_ out of the question.

You need to explain this claim. If you can not prove it, at least give a scetch of the proof. You can not say "according to Einstein, you are wrong" - you need to explain WHAT is wrong. Pinpoint it. If you can prove your claim, I am of course willing to say that you are correct, and I am wrong. The only thing that convinces a mathematician, is a clear explanation refering to different theorems and principles. If you can do that, then you are correct, and I am wrong. If you can not do that, then you can not claim anything.

Question C) How is it out the question? What principles? (You need to prove that media files can not be deduped. Maybe you could do like this; start explaining the WMA format and calculate the entropy in a typical WMA file and explain why all bit patterns are equally likely, so WMA can not be deduped. Maybe you should look into cryptography principles to show there are no more probable bit patterns in WMA than other bit patterns in WMA. You need to present boundaries of probabilities of how likely bit patterns are unique, probably Chesbycheff boundaries might come in handy, judging from all algorithmic research papers. And then you can examine the format of AVI, DIVX, etc etc and do the same calculation. You can show something like: as n-> infinity, there is a probability of upper boundary bla bla that WMA bit patterns are unique. Thus we see that the probability decreases really fast, so typical WMA bit patterns are not likely to be identical. Or maybe you can try to generalize and reason about general bit patterns of typical media codecs and analyze that. I dont know if all codecs have some structure in common you can reason about, but you seem to know this. Or, feel free to reason in any other way that proves that general media files are not dedupable. Maybe there are research papers on this? Or, we can just wait until Greenbyte dedupe engine arrives in Solaris, and then we try, and then we can see who is correct. But of course, just because Greenbyte can not dedupe media well, it does not mean that someone else can not dedupe well, unless you have some proofs about why no one can dedupe media well)

They can't dedup media files even 2x, let alone 50x. This is a claim backed by current computer science.

Is that so? What principles or theorems do you rely on? You can not just hand wave and say "backed by comp sci". In WHAT way is it backed? What theorems/principles supports your claim? No one can take you seriously when you refer to something you do not understand how to apply, especially when you mock others about something you dont understand.

Question D) Have you seen this funny cartoon? What does it say on the black board? EXACTLY your reasoning. As the professor says, you need to be more explicit, and stop hand waving. It is clear that you are not mathematically schooled.
http://cafehayek.com/wp-content/uploads/2014/03/miracle_cartoon.jpg

TCM2 · Mar 8, 2015

Everything you asked was addressed in my last post. If you're too stupid to think, I can't help you.

If you don't understand current CompSci, I certainly won't teach it to you in an online forum.

There's no point in showing a steam engine to a monkey and trying to explain how it works, if it just smears it with feces.

SirMaster · Mar 8, 2015

brutalizer said:
Why can not they do the same on media?

The answer to this is so simple that I do not understand why it needs to be repeated so many times.

The definition of deduplication in computer science is to eliminate duplicate bits of data.

The reason it cannot be applied to compressed media is for the simple and fundamental reason that there are NO duplicate bits in compressed media files.

That's all there is. The end. Compressed media (because compression already eliminated all the duplicate bits) is missing the fundamental type of data that is absolutely required for any deduplication no matter how good or bad to take place at all.

Whether we multiply the 0 amount of duplicate data present in media files by 3x or by 50x, the resulting deduplication ratio is still 0.

patrickdk · Mar 8, 2015

But what if I save 50x copies of the same media file? then dedup will save me, and I get great dedup results on multimedia.

But who does this in a real usecase? There might be 2-3x copies when someone is working on producing a video, but not 50x identical copies.

And most of the time, those files are kept in lossless copies, till final compression is applied.

Anyone can make useless stats even more useless.

brutalizer · Mar 9, 2015

SirMaster said:
The answer to this is so simple that I do not understand why it needs to be repeated so many times.

The definition of deduplication in computer science is to eliminate duplicate bits of data.

The reason it cannot be applied to compressed media is for the simple and fundamental reason that there are NO duplicate bits in compressed media files.

Yes, of course I understand this. This is not what I have been asking all the time. The question is, are you really sure that all codecs remove 100% of all redundancy? I am not. This is GOAL of all codecs, but do they really achive this in practice? Are you sure? I am not. How many times must I repeat this? Am I unclear, or what?

Again. Let me try to explain this again in more detail, I thought I was all clear, but apparently not. I try again.

Look at different cryptographic protocols, their GOAL is to achieve total randomness among all bit patterns. So all bit patterns should be equally likely, i.e. true randomness. Meaning you can not cryptanalyse them. But we all know that is not doable, they ARE attackable. Also, many cryptographic protocols rely on NP hard problems, in one way or the other. And we are not even sure on P=NP. Donald Knuth actually believes that P=NP, he does not believe P != NP. Maybe you heard about Donald Knuth?

So lets look at DES. Ideally, each bit pattern should be indistinguishable from pure randomness. But DES does not succeed in this, in fact, no cryptographic protocol succeeds in this. So, there ARE structure in DES, and structures in ALL protocols. So you CAN attack ALL cryptographic protocols, in one way or the other. Because they are not truly random, even if mathematicians try very hard to make them look like pure randomness.

Knowing this, how the heck can ANYONE who claims to be versed in comp sci, say that media codecs output true randomness??? I mean, cryptographers try really really really hard to output true randomness, but does not succeed. And then, by chance, some media codecs output true randomness, purely by coincidence, or by design? Why do they succeed, when cryptographers do not?

So, you need to show that media codecs output true randomness (good luck on that). So, you can begin simple, say, demonstrating that WMA or AVI or some other, only outputs true randomness. That there are NO structures in WMA. And how do you do that? Maybe by probabilistic arguments, etc etc, I outlined above. That is only a suggestion, you can probably use techniques from cryptography.

But the big question is, can ANYONE prove that, DES / RSA / Diffie-Hellman / El-Gamal / Schnorr etc protocols are safe - i.e they are TRULY random? No. No one can prove this. We dont even know how to begin, because this is an open research problem. So how the HECK does anyone here claim that medica codecs are truly random??? NO ONE HAS EVER PROVED THIS REGARDING CRYPTOGRAPHIC PROTOCOLS! And if someone here do have the answer how to prove this on media codecs, they have solved a big question in cryptography. We can scrap RSA / Diffie-Hellman / etc - and use this truly random codec instead that are un-attackable, that has no structure at all. I bet you will receive lot of awards in math, and will be recruited by NSA, etc.

To me this line of thought is obvious, and I thought everyone here knew this. Maybe I should have explained my question in more detail immediately, so we would not have this pseudo-science debate by people claiming to be versed in comp sci, being impressed by two Master's.

brutalizer · Mar 9, 2015

TCM2 said:
Everything you asked was addressed in my last post. If you're too stupid to think, I can't help you.

If you don't understand current CompSci, I certainly won't teach it to you in an online forum.

There's no point in showing a steam engine to a monkey and trying to explain how it works, if it just smears it with feces.

Look, if someone asks:
-I wonder if Putin planned the Ukraine incident all the time?
-No, he did not. I know that
-Oh? How can you know that? Can you explain to us why claim that?
-I am not making a claim, it is you making a claim!
-Eh? I am asking, I do not rule out that Putin planned it along. I am not making a claim.
-Yes you do, you are making a claim! You are not ruling out that Putin planned it! So you must prove that Putin planned it all the time.
-Prove what claim? I am ASKING something, meaning, I do not know. You say you are certain you know that Putin did NOT plan this. It is you that make the claim, and have the proof burden. Are you claiming you dont understand this?

If someone publish a research paper asking a question, where he is speculating something, he is NOT making a claim. If someone says he the question has an negative answer, then he makes a claim and needs to prove it, or support it with some reasoning. It does not do "according to math, this paper is wrong". This is basic logic and proof theory. If you dont understand this, I claim that you are not good at proving any theorems. How can I claim this? Well, you display lack of understanding logic, and proving theorems requires logic. And you clearly dont understand logic. Ergo, you are not successful in proving theorems.

BTW, you talked about Greenbyte "obscure marketing bullshit" when they claim 50x dedupe ratio on VMs. Do you mean that Greenbyte lie about this? This is just marketing bullshit, with no bearing in reality? Its a big marketing lie?

/dev/null · Mar 9, 2015

patrickdk said:
Robstar, did you read this thread at all?

dedup is NOT fixed in illumos or *anything yet*

oracle solaris *replaced* dedup with greenbytes dedup, and we will have to see how well it works.

When we where looking at greenbytes, it was dumped, as claims where pure marketing, and we did not opt to use them as a storage vendor.

My responses are to the quoted posts, NOT the rest of the thread.

TCM2 · Mar 9, 2015

I was somehow under the impression that Greenbyte had any claims regarding media files until I realized that this BS was coming solely from you. So disregard the "marketing bullshit".

So for VMs you maybe get 50x2 = 100x compression in real life. But in general situations where you dont have many identical bytes, maybe 10x? Imagine that, your storage has suddenly increased 10x more. An 8 TB storage media server has suddenly become 80TB storage.

For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.

It violates current CompSci because media files, while not 100% random, are still pretty much incompressible. How the hell could they be 100% random anyway and what does it matter? Structured data can still be incompressible. Noone said media files were random data, they are _like_ random data.

You have been told this multiple times now. I was even bored enough to give you a link about the theory and why it can't work. Yet you keep coming back with pages and pages of totally unrelated bullshit and how we have to present even more proof until you finally grasp it? No thanks.

Like I said, it's futile trying to explain a steam engine to a monkey. Some people get it and some don't. You can't educate everyone if the mental capacity simply isn't there. This might frustrate you, but it's another hard fact you'll have to live with.

PS: Did you even run some media files through the Diehard tests suite? If yes, what did you find. If not, why not?

SirMaster · Mar 9, 2015

brutalizer said:
The question is, are you really sure that all codecs remove 100% of all redundancy? I am not. This is GOAL of all codecs, but do they really achive this in practice? Are you sure? I am not. How many times must I repeat this? Am I unclear, or what?

Yes I am really 100% sure that they do. And by remove all I mean they remote 99+% of it. There is so little left that even if the best deduplication removes all remaining you are going to only be saving a couple GB on a multi-TB library, so much less than 1%

It doesn't matter what you believe. The fact is they do and you don't have to wait for greenbytes to see that they have no duplicate data left.

You can simply read the computer science papers on modern media compression codecs. There are hundreds of them and they prove through math and through analysis that the duplicate data is eliminated.

I cannot count how many times you have simply said "go read the papers on ZFS and data corruption" as the entire basis and proof of your argument and claims about how good ZFS is at preventing data corruption.

So now it is only fair and it's my turn to simply say "go read the H.264, and MP3 papers" and you will see for yourself how their data has no structure and how they leave no duplicate data.

danswartz · Mar 9, 2015

If this guy actually has a CS degree, I'm embarrassed for whichever institution it was (maybe Dunkin Donuts offers CS degrees with a bacon cheddar wrap and a medium coffee?)

TCM2 · Mar 9, 2015

Was that a question or a claim? LOL.

PointandClick · Mar 9, 2015

I read all four pages... I feel like I deserve something.. a cookie maybe?

And that's really a claim.

brutalizer · Mar 9, 2015

TCM2 said:
I was somehow under the impression that Greenbyte had any claims regarding media files until I realized that this BS was coming solely from you. So disregard the "marketing bullshit".

Ok, so you were wrong about this too. Fine that you admit it.

It violates current CompSci because media files, while not 100% random, are still pretty much incompressible. How the hell could they be 100% random anyway and what does it matter?

"What does it matter"?? Be-jesus. You make references to information theory and entropy. Have you studied algorithmic information theory? No you havent, thats for sure. Anyway, entropy, which you make heavy references to, is a measure of randomness. The higher the entropy, the more randomness. You said it yourself, "entropy goes up" bla bla, which means it becomes more and more random. And when you speak of randomness (yes you have done it, because you said we should learn about entropy) I talk about randomness too, which you just dismiss. Have you understood what you are saying? Have you read your links?

You have been told this multiple times now. I was even bored enough to give you a link about the theory and why it can't work.

No you didnt. You gave a link to something you dont understand, and you claim the answer is in the link. Well, comp sci is not like that. You need to explain WHY your claim is true. You can not just say "according to this link". You need to connect all the dots. THAT is the scientific method.

PS: Did you even run some media files through the Diehard tests suite? If yes, what did you find. If not, why not?

No, I havent. And as I explained to you earlier, this does not prove anything.

ZFS dedupe is fixed soon

Gawd

2[H]4U

Gawd

2[H]4U

2[H]4U

[H]F Junkie

Gawd

2[H]4U

[H]F Junkie

Gawd

n00b

Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd

Supreme [H]ardness

Limp Gawd

Gawd

[H]ard|Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

[H]ard|Gawd

Gawd

Limp Gawd

[H]ard|Gawd

Gawd

2[H]4U

Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

Gawd

2[H]4U

2[H]4U

Gawd

Limp Gawd

[H]ard|Gawd