Calm down...
Yes I did mention that the filesystems would use KiB not KB and that 1048576 would be the correct number of bits to use in the permutation calculation.
Yeah, sorry. It's when the edits and the replies overlap.
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
Calm down...
Yes I did mention that the filesystems would use KiB not KB and that 1048576 would be the correct number of bits to use in the permutation calculation.
Even when it is lossless, good lossless compression is similar to noise as well.
Yeah, sorry. It's when the edits and the replies overlap.
"lossless" is a property of compression, not of a RAW format. So saying "RAW lossless like BMP" makes no sense.
There are only two implementations, the openzfs and the oracle one.
freebsd linux and illumos all are basing theirs from openzfs, and normally are pretty in sync, though freebsd can lag alittle but due to their releases.
That is possible but I'm sure they all have their own customizations and or patches/commit levels unless they are working off the same commit tree.
They have separate code, but they all pull from the Illumos upstream for patches as often as the volunteers have time to manage. They also push improvements developed for their own platform if they could benefit the others.
I am sure Ive studied more comp sci than you, i have a double degree, one in math, and one in theoretical comp sci studied under one of the best and most famous mathematicians in the field. If you anything about the field, you have heard about him. You have maybe heard about Tarjan, Karp, Sudhan, etc and the other guys. If not, you dont know much about comp sci which makes you an uneducated "something".Go study Computer Science then. It's not a matter of convincing, it's information theory. The whole point of media formats like MPEG is to remove information. Information gets removed, entropy goes up. More entropy = more randomness = less repeated data. What does dedup need? Repeated data.
So instead of defending some obscure marketing bullshit, inform yourself.
Are you talking about VMs now? What numbers do you get on VMs? You do not get 50x, as greenbyte get?I have done offline dedup tests, using 512byte blocks, and pattern matching, and offset matching. I can't get >2.2x dedup rate with all the above on my workloads atleast.
Yes, they do help, but in all my usecases I have tested on, 64k block dedup was *good enough*, at around 1.8x, and 4k got up to 2.0x, and 512b got 2.1x, add in pattern matching (almost 0 hits) and offsset matching, got me an extra .1x to 2.2x
Yes, but we can only guess until we can try out their engine. But my point is still valid, we know nothing about what they do. But i agree it is not probable they can compress media, but we dont know what they are capable of.I was going to post this yesterday, before the thread headed into numbers...
ZFS dedupe currently requires complete block matches, which means it can only get better Greenbytes can probably identify matches at random offsets within a block. This can greatly increase the dedupe rate (although, not for video).
Finding matches at unknown offsets is a key part of data compression. Greenbytes has probably applied the existing ZFS compression (or their own new version) into a new form that can be fed into the dedupe engine, minimizing the CPU hit and allowing a big increase in dedupe ratio.
Who knows what they did? Maybe they solved P=NP?
Maybe they solved P=NP? Not probable, but when you say they can not dedupe media well, you need to prove it, or you can not rule it out.
Anyway, It is not some "marketing bullshit", greenbyte do get 50x dedupe ratio
For instance, iam not convinced that all bit patterns are equally likely. Maybe some bit patterns are more common than others in media, and other bit patterns more common in zip files, etc? For instance, in cryptography, some bit patterns are slightly more probable than others, so we have different attacks now, differential analysis of DES, Hastad attack on RSA, etc. it is very difficult to make all bit patterns equally likely, there are no perfect one way functions, the holy grail in cryptography
So for VMs you maybe get 50x2 = 100x compression in real life. But in general situations where you dont have many identical bytes, maybe 10x? Imagine that, your storage has suddenly increased 10x more. An 8 TB storage media server has suddenly become 80TB storage.
For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.
I was saying that they were making claims that go against all established information theory, but actually, only you are making the claim that their 50x result can somehow apply to media files or any data, for that matter.Not probable, but when you say they can not dedupe media well, you need to prove it, or you can not rule it out. And you have not presented any proofs. Mathematicians do not rule amything out unless they can prove it, Clearly, you dont know much about math.
Ok, so you do identical VMs, just like greenbyte. And still you get 2-3x dedupe ratio, whereas greenbyte gets 50x. The big question is, how do they do it??Yes, as I posted elsewhere in this thread, around 500 or so win7 vm's, and 300 win2008r2 vm's. All from identical base clone image.
Great, because you obviously know about what greenbyte can do, or not, please enlighten us about why they can not dedupe media. Of course greenbyte do get 50x dedupe, they would never lie about that, and oracle would have found that out during due diligence.I rule it out. They did not solve P=NP and they can not dedupe media well.
. . I do agree it is not likely they can dedupe media, but to be so sure about something no one knows without evidence, is just stupid. It is even dumber to explain why current dedupe can not do it, and draw the conclusion greenbyte also can not do it. We know nothing about greenbyte, they might not use 128kb even.I shot soda out my nose when I read that crack.
If this is so easy, why cant current zfs do that?? Answer me.Yeah, on 5000 nearly identical VM images, whoop-dee-doo.
Why would anyone be impressed with two masters? My point is, you dont have be an bass sole, and assume people are uneducated or knows nothing in this fine forum. I have got much help from these guys here. The level here is very high and you should respect people for giving us of their valuable time and helping us dig further into this. For instance, SirMaster have provided interesting links and info.All your dick-waving would impress me
Your logic is flawed. I am asking a question about greenbyte deduping media well, you and others here deny that because you know something we other dont. So it is you that need to prove your claim. I am NOT making claims, you do.I was saying that they were making claims that go against all established information theory, but actually, only you are making the claim that their 50x result can somehow apply to media files or any data, for that matter.
So the burden of proof is on you, if you make claims that contradict established science.
ok,this explains your problems. Your logic is truly flawed.You are not asking a question, you are believing they are able to dedup media files, based on no evidence other than the fact that they dedup identical data better than the current ZFS does, which in itself doesn't contradict the established theories. Your claim does, because media files don't consist of identical data anywhere in themselves, because otherwise, the MPEG guys would have done a subpar job so far.
My challenge to your claim is not a claim that needs to be proven. If you can't even do basic science, your ignorance is even bigger than I thought and you should seriously just stop posting.
But, Greenbyte reports 50x savings in real life dedupe scenarios. I have never seen any any zfs dedupe user report 50x savings, only 2-3x. This means that Greenbyte probably have a superior dedupe ZFS engine, that obviously can achieve 50x savings. So, can it be that Greenbyte also achieve 50x savings on TV media? Maybe? It is not out of the question, which you claim.
(emphasis mine)In his 1948 paper, ``A Mathematical Theory of Communication,'' Claude E. Shannon formulated the theory of data compression. Shannon established that there is a fundamental limit to lossless data compression. This limit, called the entropy rate, is denoted by H. The exact value of H depends on the information source --- more specifically, the statistical nature of the source. It is possible to compress the source, in a lossless manner, with compression rate close to H. It is mathematically impossible to do better than H.
The entropy rate of a source is a number which depends only on the statistical nature of the source.
I was going to post this yesterday, before the thread headed into numbers...
ZFS dedupe currently requires complete block matches, which means it can only get better Greenbytes can probably identify matches at random offsets within a block. This can greatly increase the dedupe rate (although, not for video).
Finding matches at unknown offsets is a key part of data compression. Greenbytes has probably applied the existing ZFS compression (or their own new version) into a new form that can be fed into the dedupe engine, minimizing the CPU hit and allowing a big increase in dedupe ratio.
Yes, that is true. Because they do 50x on VMs. They can obviously achieve 50x dedupe ratios on certain types of files. Why can not they do the same on media? I am asking you. You seem to know their patents and can explain to us.You _claim_ it's not out of the question that they achieve 50x savings on TV media
You need to explain this claim. If you can not prove it, at least give a scetch of the proof. You can not say "according to Einstein, you are wrong" - you need to explain WHAT is wrong. Pinpoint it. If you can prove your claim, I am of course willing to say that you are correct, and I am wrong. The only thing that convinces a mathematician, is a clear explanation refering to different theorems and principles. If you can do that, then you are correct, and I am wrong. If you can not do that, then you can not claim anything.based on all current knowledge of information theory[1], it most definitely _is_ out of the question.
Is that so? What principles or theorems do you rely on? You can not just hand wave and say "backed by comp sci". In WHAT way is it backed? What theorems/principles supports your claim? No one can take you seriously when you refer to something you do not understand how to apply, especially when you mock others about something you dont understand.They can't dedup media files even 2x, let alone 50x. This is a claim backed by current computer science.
Why can not they do the same on media?
Yes, of course I understand this. This is not what I have been asking all the time. The question is, are you really sure that all codecs remove 100% of all redundancy? I am not. This is GOAL of all codecs, but do they really achive this in practice? Are you sure? I am not. How many times must I repeat this? Am I unclear, or what?The answer to this is so simple that I do not understand why it needs to be repeated so many times.
The definition of deduplication in computer science is to eliminate duplicate bits of data.
The reason it cannot be applied to compressed media is for the simple and fundamental reason that there are NO duplicate bits in compressed media files.
Look, if someone asks:Everything you asked was addressed in my last post. If you're too stupid to think, I can't help you.
If you don't understand current CompSci, I certainly won't teach it to you in an online forum.
There's no point in showing a steam engine to a monkey and trying to explain how it works, if it just smears it with feces.
Robstar, did you read this thread at all?
dedup is NOT fixed in illumos or *anything yet*
oracle solaris *replaced* dedup with greenbytes dedup, and we will have to see how well it works.
When we where looking at greenbytes, it was dumped, as claims where pure marketing, and we did not opt to use them as a storage vendor.
So for VMs you maybe get 50x2 = 100x compression in real life. But in general situations where you dont have many identical bytes, maybe 10x? Imagine that, your storage has suddenly increased 10x more. An 8 TB storage media server has suddenly become 80TB storage.
For general purpose storage, where you store lot of different files, maybe you would get... 10x? My guess. Say you have a media server with lot of movies. I bet the movie data is quite similar, so you can apply dedupe to great success. Likewise with MP3. Or MS office documents. Or Source code.
The question is, are you really sure that all codecs remove 100% of all redundancy? I am not. This is GOAL of all codecs, but do they really achive this in practice? Are you sure? I am not. How many times must I repeat this? Am I unclear, or what?
Ok, so you were wrong about this too. Fine that you admit it.I was somehow under the impression that Greenbyte had any claims regarding media files until I realized that this BS was coming solely from you. So disregard the "marketing bullshit".
"What does it matter"?? Be-jesus. You make references to information theory and entropy. Have you studied algorithmic information theory? No you havent, thats for sure. Anyway, entropy, which you make heavy references to, is a measure of randomness. The higher the entropy, the more randomness. You said it yourself, "entropy goes up" bla bla, which means it becomes more and more random. And when you speak of randomness (yes you have done it, because you said we should learn about entropy) I talk about randomness too, which you just dismiss. Have you understood what you are saying? Have you read your links?It violates current CompSci because media files, while not 100% random, are still pretty much incompressible. How the hell could they be 100% random anyway and what does it matter?
No you didnt. You gave a link to something you dont understand, and you claim the answer is in the link. Well, comp sci is not like that. You need to explain WHY your claim is true. You can not just say "according to this link". You need to connect all the dots. THAT is the scientific method.You have been told this multiple times now. I was even bored enough to give you a link about the theory and why it can't work.
No, I havent. And as I explained to you earlier, this does not prove anything.PS: Did you even run some media files through the Diehard tests suite? If yes, what did you find. If not, why not?