Talk:Hash function

From Wikipedia, the free encyclopedia

Contents

[edit] robust noisey room hashing

"Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room." - very interesting, can we have a source - william sharkey

[edit] MD5 and SHA-1 weaknesses

This article should document the issues, raised last summer and sharpened last month, with MD5 and some other widely used hash functions. I'm putting a note that the algorithm has issues on this page and the MD5 page. If anyone wants to document this properly, look at:

(I won't write this up properly: not my field and I have very limited interest in the topic) ---- Charles Stewart 15:20, 14 Mar 2005 (UTC)

I confused this page with Cryptographic hash function: sorry, no issue ---- Charles Stewart 15:26, 14 Mar 2005 (UTC)

[edit] Who is Jenkins?

Quote: "See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself." It would be informative to know who on earth this Jenkins is, before he is alluded to as 'the big man himself'. :) --Egregius 23:11, 29 Mar 2005 (UTC)

He published a hash algorithm in Dr. Dobbs. I'm not aware of any thing on the topic that he has published in any more formal on the topic, nor have I found any articles that examine his hash formally. I would have expected that if it had been formally evaluated it would have been mentioned on his web page. Nahaj 17:48, 1 July 2006 (UTC)

[edit] Algorithms for Audio Identification

Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences.

This sentence is way too vague. Could someone please elucidate which algorithms and give some perspective on how they work? I assume that one can bring up a more broad topic that is "algorithms for analog signals identification", too, but I really have no knowledge about the topic. Poli 10:38, 12 Jun 2005 (UTC)

[edit] What's the difference?

If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).

This seems awkward and opaque to me...

The only semantic difference I can see in this text seems to suggest that for "strongly" collision free hash functions, even if message x = y, H(x) != H(y)! Which doesn't sound much like a function to me...

Looking around, I found (at http://www.cs.berkeley.edu/~jordan/courses/174-spring02/notes/note14.pdf) the following:

2. A hash function h(x) is said to be weakly collision-free if given a message x1 it is hard to find another message x2 such that h(x1) = h(x2). 3. A hash function h(x) is said to be strongly collision-free if it is hard to find any pair of messages x1, x2 such that h(x1) = h(x2).

This is still not too clear to me.

Weakly collision free seems to me to mean: Given any specific message x, it is computationally infeasible to find a message y such that H(x) = H(y), but if you wish to, you can craft two messages together (x and y), such that H(x) = H(y).

That is, if you're NOT a hacker trying to create a forged message (which is hard), but rather a social engineer trying to create TWO messages which can be interchanged to create plausible deniability (which is easier), you can do it with a Weakly collision free hashing function.

Strongly collision free seems to me to mean: Even though in a large domain of messages there will clearly be a large number of collisions, It is computationally infeasible to find ANY two messages x and y such that x != y and H(x) = H(y), even if you are trying to craft the two together.

Sigh... I'm still not sure how to phrase this is a way which is clear to the casual reader, even assuming that I have correctly interpreted this. Anyone else care to give it a try?

You interpet correctly. But the text makes sense to me as-is.
Maybe this is a clearer non-technical rendering: "Weak means it is hard to find a collision WITH A SPECIFIC message. Strong means it is hard to find ANY MESSAGES that collide." But I agree with the anonymous person above... it makes sense to me as is. [But it is a bit of a technical statement for the math phobic.] Nahaj 17:54, 1 July 2006 (UTC)

[edit] Overhaul of hash function and merging of hash algorithm

Hi Pfunk42! I just saw your major overhaul of hash function and "merging" of hash algorithm. Very nice work! You beat me to it. I put up those merging notices but never got around to note on the talk pages that I was going to do that work in some days. But I am happy you did it since it saved me the work and you did it so well. I added a picture instead, hope you like it. --David Göthberg 02:31, 5 September 2005 (UTC)

[edit] Surjective?

The article says that hash functions are surjective. Is that true? They're certainly not injective, but my understanding is that "surjective" means that "All elements of the image domain are actually 'used' by the mapping," to quote a posting by Peter Simons from the monoton-devel mailing list. [1]

So a hash function could be surjective, but I don't think it has to be.

But this is so far from my areas of expertise that I don't want to just change it; I may be way offbase. --Elysdir 00:10, 17 December 2005 (UTC)

Basically, you want the outputs of your hash function to be uniformly distributed over the codomain of the function. If the function is not a surjection, there exist values in the codomain that are not in the range. If this is true, then the output is not, strictly speaking, uniformly distributed. However, if the size of the codomain is greater than the square root of the size of the domain, then many quite good hash functions will not be surjective (see birthday paradox). Anyway, it's complicated, but not really an issue anymore since i took out the reference to "surjective". --pfunk42 05:56, 2 January 2006 (UTC)

[edit] Stock function?

What's a stock hash function? Isn't the case when an algorithm become published it becomes stock? What's significant about a stock function the way it is being described in this article? Just a little bit confused. --MegaHasher

Some hash functions are special-purpose. For example, they might only give nice, uniform output for the input they were designed to take. Or they might accept only input of some fixed size. These are not stock because they aren't widely reusable. Also, classes of hash functions or hash function designs are not themselves hash functions. For example, Pearson Hashing is not a stock hash function. Basically, what I'm familiar with being called a "stock hash function" is one which, from a mathematical perspective, is almost universally applicable and good. I say "from a mathematical perspective" because being a stock hash function doesn't mean it's implementation is universally fast. --pfunk42 05:56, 2 January 2006 (UTC)

[edit] Randomization Function

Ok. maybe i should have discussed it before throwing it in there, but this is widely used term for a hash function that is one-to-one. a google search for "randomization function" reveals that most of the results use the term as i describe.

True, a bijective function whose domain equals its range represents a permutation, but calling any one-to-one hash function a "permutation" is highly misleading.

It might be worth creating a separate "Randomization Function" page (with redirect from "Mix Function"), but i'm not sure. Comments welcome.--pfunk42 06:12, 2 January 2006 (UTC)

IMHO, Wikipedia already has too many semi-related pages about random functions. In my mind, permutations used for hashing has less strength than permutations used for pseudo randomness. Even with lots of rounds these permutations probably would not be too good for pseudo randomness, because the period is too short when there are already fast RNG generators with very long periods. --MegaHasher 06:33, 2 January 2006 (UTC)

Many uses of randomization functions, especially those which use them under that name, have no "period" requirement/concern. In fact, periodicity concerns might inappropriately narrow the space of possible functions. Other uses (such as pseudorandom number generation) might call for functions satisfying the requirements of a "randomization function" as well as other, stronger requirements. It would seem to me appropriate to disambiguate such classes of requirements in an article. Here's one thing i believe is a useful observation: a random function is an idealization of a hash function; a random permutation is an idealization of a randomization function. --pfunk42 13:08, 2 January 2006 (UTC)

[edit] Intro Edit

I admit perhaps the original intro is too technical. Hopefully my edit is more to the level of a beginner programmer. --MegaHasher 03:44, 25 January 2006 (UTC)

The example of catching a change in an article suffers from a defect that the check is only probabilistic. The materials in the introduction should be rock solid.

I appreciate your effort, but even so the opening sentence basically says "a hash function is something used to create a hash value". I got to this page by searching for "hash value", which redirected me here. In other words, I still don't know what "hash" means, except with respect to meat dishes. I'm going attempt to redress that. ShawnVW 18:52, 2 March 2006 (UTC)

ShawnVW: I like your new intro. --David Göthberg 01:04, 3 March 2006 (UTC)

[edit] Hash table section

Two issues are in this section:

  • Cryptographic hash functions are too slow for hash table usages. Cryptographic hash functions are used to generate security hashes.
  • "too slow" is a brash generalization. some hash tables have records that live on disk, in which case you have virtually all the hashing time in the world. besides, i've seen many-a-hash table using MD5 as the hash function, which is (IIRC) only about 2-3 times slower than Jenkins. (Granted, i quickly pointed out they should use Jenkins instead... :) --pfunk42 21:25, 2 February 2006 (UTC)
  • MD5 is a CRYPTOGRAPHIC hash function, and is designed for a much different use than Jenkin's hash. I'm surprised to see it does that well. But I'd suggest that 2 or 3 times as slow is impractical for many uses. I'v implemented many cryptographic hash functions (1 NIST validated), and I don't use them for anything other than cryptographic work... any more than I would use hash table algorithms for cryptographic use. Cryptographic hash functions are WAY too slow for any hash table use I do. Nahaj 17:32, 1 July 2006 (UTC)
  • A hash function that maps successive keys to successive output values probably has severe funneling, and is not a good hash function to begin with. --MegaHasher 06:31, 1 February 2006 (UTC)
  • another brash generalization. there are cases in which regularity in the input can be exploited to beat "random" hash functions. this is pretty well known thanks to work by Knuth, et al. but the downside is that the *wrong* set of inputs can be very bad... as i discuss in the article! :) also, some very small hash tables (for example, in implementation of "dictionary"-heavy languages like python, smalltalk, etc.) use hash functions that are pretty bad in terms of collisions but work quite well because low load factor + linear probing (very fast search) + small size + frequent access (likely in CPU cache) mean the hash function must be obscenely fast. --pfunk42 21:25, 2 February 2006 (UTC)
  • (Knuth worked with non-cryptographic hash functions, and the rest of Pfunk42's reply sounds like uses of non-cryptographic hashs. MegaHasher's comment sounds like a sound objection to such a hash for cryptographic work.) Possibly we are talking apples and oranges here? Nahaj 17:32, 1 July 2006 (UTC)

[edit] Reverse hash.

Say I have a hash (I know the hashing function and it has no salt), and I know my cleartext is between 12 and 18 bytes long and contains only ascii a-z \. 0-9 Would it be possible to write a function to generate all the possible cleartexts that fall under this particular category with this hash? Doodle77 03:24, 28 March 2006 (UTC)

In the case of non-crypto hash, yes. -MegaHasher 23:50, 5 April 2006 (UTC)
Actually, Doodle77 hasn't really asked the right question, and MegaHasher's answer is potentially misleading (though technically correct). Regardless of whether a hash function is cryptographically strong, one can write a function to generate all possible cleartexts within a finite range/set that generate a certain hash. The solution is as simple as iterating over the possibilities, hashing them, and comparing the result to the desired hash. This answers Doodle77's question the way he/she asked it; it was asked as a question of computability. Doodle77 was probably interested in whether the solution was efficiently computable, or practical, which is a question of computational complexity. With a space of inputs as large as you describe, efficiently generating matching cleartexts depends on the ability to "reverse" the hash function, which is (presumably; see One-way function) much more difficult for cryptographic hash functions. --pfunk42 12:36, 23 April 2006 (UTC)

[edit] Intro

It seems the intro has become longer than the main text. Clean-up is again needed. -MegaHasher 23:50, 5 April 2006 (UTC)

The Intro is incorrect. A well designed hash function is not necessarily one-way nor unforgeable. A well designed hash function only needs to distribute input keys well across all buckets. You might be able to prove that a hash function that does a good job of spreading values across buckets is also one-way or non-forgeable, but those are not initial design constraints. There are "niche" applicaitons in which one-way or non-forgeable attributes are important, but it is not a design constraint for hash functions in general. Chuck Simmons 20:06, 14 April 2006 (UTC)

[edit] Audio Identification NPOV

"Contrary to an assumption often voiced by people who don't follow the industry carefully,"

That seems a bit too opinionated to be in accord with WP:NPOV and WP:AWW, all the more so that it's uncited. I made the statement a bit more neutral and verifiable, but I still ask that a source be given for this claim, or otherwise it should be removed entirely. The claim regarding the noisy room is also uncited, so I'm requesting a source for that as well. --Ultimus 17:07, 1 July 2006 (UTC)