38 views

Skip to first unread message

Feb 15, 2010, 1:26:04 AM2/15/10

to

Dear all,

I am searching for results about statistical characteristics of hash

functions (I will be more precise in a moment) and I was wondering if

you have some reference to suggest. Even few googling keywords would

be fine; I tried with some obvious choices, but the resulting signal/

noise ratio is very low (e.g., the first page of results of "MD5

statistical" has links to statistical data "signed" with MD5). Note

that since I would not use the hash function against a "malicious"

adversary, but just as a double check of the correctness of a result

(let me skip all the details here, it would be too long), even "old"

and "fragile" functions could be fine, as long as some results are

known.

Let me state more precisely what I mean with "statistical

characteristics". Some questions I am interested to find an answer

are

1) Is the hash function surjective? If it is not, which fraction of

the codomine is covered? [I expect a good hash function to be

surjective or "almost" surjective, but a quantitative result would be

useful]

2) Suppose the message length fixed. How the anti-image of a hash

value is "distributed" over the domain? I expect it to be "spread

out" uniformly all over the domain; for example I expect that if S is

a subset of the domain with |S| elements and h = H(x) is the b-bit

hash of x, then the intersection of S with H^-1(h) has

"approximately" (in some sense to be specified) |S|/2^b elements.

3) What is the minimum Hamming distance between two elements of

H^-1(h)?

I agree that these questions could look a bit vague, but any result

along these lines could be of interest to me.

Thank you for any help

Feb 17, 2010, 12:49:08 AM2/17/10

to

On 15 Feb 2010 06:26:04 GMT, mockturtle <frame...@gmail.com> wrote:

[snip]

> Let me state more precisely what I mean with "statistical

> characteristics". Some questions I am interested to find an answer

> are

>

> 1) Is the hash function surjective? If it is not, which fraction of

> the codomine is covered? [I expect a good hash function to be

> surjective or "almost" surjective, but a quantitative result would be

> useful]

>

> 2) Suppose the message length fixed. How the anti-image of a hash

> value is "distributed" over the domain? I expect it to be "spread

> out" uniformly all over the domain; for example I expect that if S is

> a subset of the domain with |S| elements and h = H(x) is the b-bit

> hash of x, then the intersection of S with H^-1(h) has

> "approximately" (in some sense to be specified) |S|/2^b elements.

>

> 3) What is the minimum Hamming distance between two elements of

> H^-1(h)?

>

> I agree that these questions could look a bit vague, but any result

> along these lines could be of interest to me.

>

> Thank you for any help

If you're talking about cryptographic hash functions, such as MD5

or the SHA family, the best basis from which to address these

questions is the fact that these hash functions aspire to mimick

a "random oracle", and (as a practical matter) come very close to

succeeding. When you ask a random oracle to produce a 160-bit hash

of some particular input, she looks in her Catalog of Previously

Hashed Inputs. If your input value is there, she returns whatever

she returned the last time that input value occurred; otherwise,

she flips a coin 160 times to construct a new value, returns that

value, and enters it into her Catalog of Previously Hashed Inputs.

The fact that some of these hash functions have been broken

has essentially nothing to do with any general statistical

properties. Naturally, it means that you *can* construct

statistical tests that can distinguish (say) MD5 from a random

oracle, but much intelligence and effort is required to construct

such a test; you wouldn't stumble upon one by accident. (OK, if

somebody's going to get picky, we'll have to explicity exclude

tests like "compare the oracle's output with MD5"...)

Accordingly, I recommend that you answer your statistical

questions based on the random oracle model. Addressing the

three questions you listed above would be a straightforward

exercise in statistics. If the resulting predictions fail,

you have a publishable result that would generate excitement

in the cryptology community.

--

To email me, substitute nowhere->spamcop, invalid->net.

Feb 17, 2010, 12:49:31 AM2/17/10

to

According to mockturtle <frame...@gmail.com>:

> I am searching for results about statistical characteristics of hash

> functions

A cryptographically secure hash function is expected to be

indistinguishable from a random oracle. A random oracle is modelled as a

black box with memory: for any given input message, either it was

previously invoked with the very same input message, in which case it

yields the same result than previously, or this is a new message, and

the oracle then selects the output randomly in its output domain, with

uniform probability.

If the function can be distinguished from a random oracle then the

hash function is deemed "weak" and cryptographers refuse to use it.

Corollarily, any "strong" hash function is a hash function for which

no such distinguisher is currently known. This yields some trivial

answers to your questions:

> 1) Is the hash function surjective? If it is not, which fraction of

> the codomine is covered?

A hash function is supposed to be surjective. But it is also expected

that surjectivity cannot be easily proven; such a proof would require

some knowledge on the hash function structure, something quite close to

a characteristic distinguishing the hash function from a random oracle.

> 2) Suppose the message length fixed. How the anti-image of a hash

> value is "distributed" over the domain?

Then again, distribution should be uniform. If the messages have length

n bits and the output has length b bits, then it is expected that an

average of 2^(n-b) input messages yield any given b-bit output.

> 3) What is the minimum Hamming distance between two elements of

> H^-1(h)?

The minimum distance is 1. The only restriction on the random oracle is

to yield two distinct outputs for two distinct inputs. Since hash

fucntions accepts inputs much longer than their output (e.g. SHA-256

accepts input strings of length 0 to 18446744073709551615 bits, while

outputting 256-bit values), then it is highly improbable that there

exists even a single hash output for which there would not be two

preimages differing by a single bit.

Then again, you would be hard pressed to prove it, and if you could

prove it for a given hash function, then that function would be looked

at with some level of defiance. Strong cryptographic hash functions

should not allow such proofs to be achievable.

--Thomas Pornin

Feb 17, 2010, 1:00:22 AM2/17/10

to

Peter Pearson <ppea...@nowhere.invalid> writes:

> If you're talking about cryptographic hash functions, such as MD5

> or the SHA family, the best basis from which to address these

> questions is the fact that these hash functions aspire to mimick

> a "random oracle",

That's for fixed length inputs of course. Otherwise there is a

well-known length extension property.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu