Talk:Hash tree

From Wikipedia, the free encyclopedia

[edit] Merge of Tiger-Tree Hash into Hash tree

Hi! I am thinking of merging the Tiger-Tree Hash article into a subsection of Hash tree. Please respond here if you have any comments. --David Göthberg 02:03, 26 August 2005 (UTC)

Sounds appropriate because a Tiger-Tree hash is just a hash tree hash with fixed choices for the block size and hash function (if I understand correctly?). — Matt Crypto 01:04, 28 August 2005 (UTC)
Yes, you understand correctly. A Tiger tree hash is just a hash tree with fixed choices for the block size (1024 bytes), hash function (Tiger) and type of tree (binary). He, I just realised I should perhaps mention in the article that a hash tree can be of other types then binary. Each node can have much more child nodes then two. (Although binary is the most popular since at least in theory it is the most flexible.) For instance in my own system I am thinking of using lots of children for each node in such a way that the tree only has two levels. Like this: Data blocks, first level hashes, second level hashes, top hash. Sort of just a hash list with an extra level, easier to implement but still means you can download a "branch" at a time. (That is, download the first level hashes, check the integrity of them, then download a chunk of second level hashes that fits under one single hash in the first level and check the integrity. Then start downloading blocks that corresponds to that chunk of second level hashes.) --David Göthberg 11:45, 29 August 2005 (UTC)
I think a merge would be ok, as TTH is rather short and is the only notable implentation of a hash I know. Just be careful to maintain a boundary between concept and implementation and this should be no problem. --R.Koot 19:46, 28 August 2005 (UTC)
Yep, done so. --David Göthberg 11:45, 29 August 2005 (UTC)

Good job! --Adam David Newman 1:57, 19 October 2005 (UTC-6)

[edit] Clarification needed

Note, this problem has a solution that is in the official papers and described in the last comment.

Similar to the Merkle-Damgard chain-hash, tree-hash requires something like Initial Values (IVs) to avoid collision which simply extend the tree. Using a tree hash on the picture it appears that

 x = Hash 0-0 || Hash 0-1 || Hash 1-0 || Hash 1-1 

and

 y = Hash 0 || Hash 1

collide:

Tree-Hash(x)=Tree-Hash(y)

--24.60.175.200 06:40, 15 November 2005

No, you are wrong, sort of. Tree-Hash(x) != Tree-Hash(y) except for one very special case. I'll write an explanation when I thought about a good way to explain it to you. --David Göthberg 02:39, 16 November 2005 (UTC)

If the data blocks incorporate the IV (or are hashed somehow differently into the tree hash leaves, but this needs more care) then my point is taken care of. Otherwise, the construction implies that on the Figure in the article, TopHash=hash(Hash0|Hash1)=hash(hash(Hash00|Hash01)|hash(Hash10|Hash11)).

It seems natural to interpret this as tree-hash(Hash00|Hash01|Hash10|Hash11)=tree-hash(Hash0|Hash1)=TopHash, which gives a collision. My comment is simply to make explicit whatever mechanisms/assumptions that exclude this interpretation. For example, careful use of IVs.

Btw, it might be a good idea to note that hash-chains can be viewed as a special case of the tree-hash (just a very unbalanced tree). Also, authenticated dictionary data structures [Naor Nissim, and subsequent work] constructed using Merkle's tree hashes are probably a good link to include. --24.60.175.200 19:21, 18 November 2005

First of all, you should really learn to log in and sign your messages. It is really annoying to discuss with some one that does not sign his/her messages. It makes it unclear who says what and when it was said. For now I took the liberty of adding "fake" signatures to your messages to make them readable.
Secondly, you are still wrong. Your case is only valid if the data block size is exactly twice the size of the hashsum output if a binary tree is used. (So if using SHA1 which has 20 byte hashsum size that would imply using a data block size of 40 bytes.) Using only 40 byte data block sizes would be very inefficient and as you pointed out very insecure.
Most hash tree implementations use 1024 bytes or more for each data block they hash. That would mean that if you fed the "file" x from your example above to a hash tree function it would hash all of x once as a single data block to get the top hash since it is smaller then 1024 bytes. And if you fed the "file" y from your example above that too would be hashed as a single data block, and thus get another top hash then x.
However what an attacker can do in this case is to trick the receiver into believing that the actual total size of the file is 40 bytes and then feed your y as the "file data". But there is no way an attacker can trick a receiver into believing that x is the file data.
One way of avoiding that 40 byte fake file size attack and even avoid problems if you actually use the data block size of 40 bytes in your hash tree implementation is to instead create the top hash something like this:
(Later edit: Solution here does not give complete protection. See last comment for the proper solution.)
top hash = hash( hash 0 | hash 1 | actual total file size )
There is no need to use any IVs or file size values etc in the other levels of the hash tree. A more common way is to put the top hash in a metadata file with the file name, file size etc and then instead publish the hash of the metadata file.
Your Naor Nissim link seems to link to a paper that contains a description of a hash tree structure similar but not the same to the kind of hash trees this article is talking about. (This article is about Merkle trees.) But if you like to then add that link, but state that it only describes a similar but not same approach.
Regarding hash chains: I could not find any article in Wikipedia about hash chains and I don't remember that I have heard about them before. But if I guess correctly what they are then yes, one could see a hash chain as a very weird unbalanced hash tree. But for now I would not add that to the article for several reasons.
Note that I have tried to keep the article short and easy to understand instead of complete. Since complete in this case would mean at least 10 pages of text and at least 4-10 pictures. Instead I resorted to using the sentence "There are several additional tricks, benefits and details regarding hash trees" and then linking to the THEX paper and RSA articles that tell more (but even those are incomplete).
--David Göthberg 12:52, 20 November 2005 (UTC)
Hi again 24.60.175.200. I just wanted to thank you for pointing out that type of attack. Even though your understanding of it was not complete it lead me to understand the two kinds of "40-byte attacks". I was not aware of it and have not seen any such thing mentioned in any papers on hash trees or hash lists. I have realised now that the fix I wrote about above does not give complete protection. I have to think more about this and discuss it with some other crypto guys.
--David Göthberg 21:06, 8 December 2005 (UTC)

Update: This problem was discovered a couple of years ago and resolved in Justin Chapweske and Gordon Mohr's THEX Protocol. This specification for the contruction of tree hashes has likely received the most scrutiny of any tree hash implementation and is thus recommended for implementation. The appropriate solution to the above problem is to use separate hash functions for interior and leave nodes of the hash tree, which THEX implements by prefixing either 0x00 or 0x01 to the data to be hashed. THEX is implemented in Swarmcast and Gnutella.

24.118.51.232 22:04, 8 December 2005 (UTC)

Ah yes, that seems to be a good solution. --David Göthberg 14:03, 24 February 2006 (UTC)

[edit] Some clarifications

Several things seem to have confused editors and some readers. Since I don't know how to make them more clear in the article without making the text to heavy I will explain them here so future editors at least have access to the info.

  • Hash trees doesn't have to be binary. That is, they can have more than two child nodes/leaves under each node. But binary hash trees seems to be the most used nowadays. One can just as well for instance have a tree with 100 child nodes/leaves under each node. Then if you use 100 bytes per hashed data block and have a file of say 1 million bytes that means two levels of hashes below the top hash. Like this: Top hash < 100 hashes < 100*100 hashes < 100*100 hashes * 100 bytes per block = 1 million bytes.
  • The sentences about Lamport signatures and quantum computers seem to have caused a lot of confusion and edits. I have kept them short in the article since that is currently not an important issue and should probably instead be explained more in detail in the Lamport signature article. Lamport signatures are in them self already secure against quantum computers since they do not use the kind of maths that other public-key systems use, instead they only use hashes. Good strong hashes are believed to not be affected by quantum computers. You just need to use slightly bigger hashes than we use today. The bad part is that a Lamport signature can only be used once. So your published "public-key" can only be used one time. So if you want to sign many messages you would have to publish a long list of one time public Lamport keys. That would be awkward. But if you hash all those Lamport public keys with a hash tree you can then instead publish just the top hash! And when signing a message all you have to store in the message is the signature and the hash tree "branch" of hash nodes that makes it possible to verify the hashes up to the top hash. So hash trees doesn't make Lamport signatures more secure or change them, hash trees "just" makes it possible to easily handle MANY Lamport signatures.
  • Some editors have added "citation needed" tags on the page or asked about the source of some of the statements. Well, all the sources needed are already right there in the external links section. Those links go to good sources that explains pretty much all the details about hash trees and the RSA labs article also explain why Lamport signatures are secure against quantum computers. (The article is written and published by RSA labs, it is not a Wikipedia article.) And I think RSA labs is one of the more trusted sources we have when it comes to crypto stuff. Many other sources I have seen also claims that quantum computers do not seriously affect hashes and stream/block ciphers. Just that public-key systems like RSA, DH, ECC etc are affected.
  • Note that this article doesn't tell the full story about hash trees. There are many more things one need to understand to implement and use hash trees in a cryptographically secure way. But that would make this article huge. Currently the article is just an introduction to hash trees. Again the external links section points to documents telling most of the details and that is why I put a sentence in the text that mentions that: "There are several additional tricks, benefits and details regarding hash trees. See the external links below for more in-depth information."
  • Tiger tree hashes are just a specific kind of hash tree. And currently the most used ones. There is nothing special about them using Tiger hashes. Tiger is just another hash function and Tiger in itself has nothing special to do with hash trees. It just so happens that when hash trees started to be used in p2p networks the p2p programmers did choose Tiger as the hash. They could just as well have chosen any other cryptographically secure hash.

--David Göthberg 06:23, 18 October 2006 (UTC)