Trie

From Wikipedia, the free encyclopedia

A trie for keys "to", "tea", "ten", "i", "in", and "inn".
A trie for keys "to", "tea", "ten", "i", "in", and "inn".

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of any one node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that happen to correspond to keys of interest.

The term trie comes from "retrieval". Due to this etymology it is pronounced "tree", although some encourage the use of "try" in order to distinguish it from the more general tree.

In the example shown, keys are listed in the nodes and values below them. Each complete English word has an integer value associated with it. A trie can be seen as a deterministic finite automaton, although the symbol on each edge is often implicit in the order of the branches.

It is not necessary for keys to be explicitly stored in nodes. (In the figure, words are shown only to illustrate how the trie works.)

Though it is most common, tries need not be keyed by character strings. The same algorithms can easily be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits, permutations on a list of shapes, etc.

Contents

[edit] Advantages and drawbacks, relative to binary search tree

The following are the main advantages of tries over binary search trees (BSTs):

  • Looking up keys is faster. Looking up a key of length m takes worst case O(m) time. A BST takes O(log n) time, where n is the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines.
  • Tries can require less space when they contain a large number of short strings, because the keys are not stored explicitly and nodes are shared between keys.
  • Tries help with longest-prefix matching, where we wish to find the key sharing the longest possible prefix with a given key efficiently. They also allow one to associate a value with an entire group of keys that have a common prefix.
  • There is no need to keep a trie balanced, which for BSTs typically involves a great deal of complexity (see self-balancing binary search tree).

Its disadvantages are the following:

  • Tries can give an ordering of the keys, but it must correspond to lexicographic ordering.
  • Tries can be considerably larger in certain circumstances, such as a trie containing a small number of very long strings (Patricia tries help to deal with this).
  • Trie algorithms are more complex than simple BSTs.
  • It is not always easy to represent data as strings, e.g. complex objects or floating-point numbers.

Although it seems restrictive to say a trie's key type must be a string, many common data types can be seen as strings; for example, an integer can be seen as a string of bits. Integers with common bit prefixes occur as map keys in many applications such as routing tables and address translation tables.

Tries are most useful when the keys are of varying lengths and we expect some key lookups to fail, because the key is not present. If we have fixed-length keys, and expect all lookups to succeed, then we can improve key lookup by combining every node with a single child (such as "i" and "in" above) with its child, producing a Patricia trie. This saves space in maps where long paths down the trie do not have branches fanning out, for example in maps where many keys have a long common prefix or where many portions of keys are composed of characters all unique.

[edit] Clarification about performance

It is acceptable to consider trie search time O(1). This assumes that the length of a key has a constant upper bound. Given N distinct keys, the lower bound of the length of the longest key is logAN where A is the size of the alphabet. If all of the keys in the trie have the same lower-bound key length and if the search string also has the same or greater length, then it follows that trie search time is O(log N) for N distinct keys, which would appear to be the same as that of a binary search. (Whether or not all of the keys in the trie have the same length, the trie search time can be Ω(1), proportional to the length of the search string, which can be shorter than the keys that are actually in the trie.)

This observation does not take away the benefits of the trie, however, because a trie's real advantage is that a trie makes each comparison operation cheaper. In a binary search, we are performing string comparisons, which are O(m) in the worst case, where m is the length of the shorter of the two strings. In a binary search, O(log N) comparisons are made, in either a sorted array or binary search tree, resulting in O(m log N) time performance if all of the keys are the same length. If the length of each key is logAN, then substituting logAN for m results in O(log(N)*log(N)) binary search time performance. (It is notationally inconsistent to create a situation where every key is O(log N) in length for a trie and then treat the same keys as if they are O(1) in length for a binary search.) In a trie, on the other hand, we are comparing single characters in constant time. This is not merely a theoretical difference because, as we descend close to the leaves of a binary search tree, the strings we compare will often have long common prefixes, causing string comparisons to be slow.

A similar argument applies to radix sort, which can sort bitstrings of length m in O(mn) time; because sorting is applied more often to small values than large values (and specifically in the case of radix sort, fixed-size integers which effectively make m constant), the factor of m is often neglected.

[edit] Applications

[edit] As replacement of other data structures

As mentioned, a trie has a number of advantages over binary search trees. A trie can also be used to replace a hash table, over which it has the following advantages:

  • Looking up data in a trie is faster in the worst case, O(1) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time.
  • There are no collisions of different keys in a trie.
  • Buckets in a trie which are analogous to hash table buckets that store key collisions are only necessary if a single key is associated with more than one value.
  • There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
  • A trie can provide an alphabetical ordering of the entries by key.

Tries do have some drawbacks as well:

  • Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random access time is high compared to main memory.
  • It is not easy to represent all keys as strings.
  • Tries are frequently less space-efficient than hash tables.
  • Unlike hash tables, tries are generally not already available in programming language toolkits.

[edit] Dictionary representation

A common application of a trie is storing a dictionary, such as one found on a mobile telephone. Such applications take advantage of a trie's ability to quickly search for, insert, and delete entries; however, if storing dictionary words is all that is required (i.e. storage of information auxiliary to each word is not required), a minimal acyclic deterministic finite automaton would use less space than a trie.

Tries are also well suited for implementing approximate matching algorithms, including those used in spell checking software.

[edit] Algorithms

The following pseudo-code represents the general algorithm for determining whether a given string is in a trie. Note that children is an array of a node's children; and we say that a "terminal" node is one which contains a valid word.

function find(node, key) {
  if (key is an empty string) {  # base case
    return is node terminal?
  } else {  # recursive case
    c = first character in key  # this works because key is not empty
    tail = key minus the first character
    child = node.children[c]
    if (child is null) {  # unable to recurse, although key is non-empty
      return false
    } else {
      return find(child, tail)
    }
  }
}

[edit] Sorting

Lexicographic sorting of a set of keys can be accomplished with a simple trie-based algorithm as follows:

This algorithm is a form of radix sort.

A parallel algorithm for sorting N keys based on tries is Ω(1) if there are N processors and the length of the keys have a constant upper bound. There is the potential that the keys might collide by having common prefixes or by being identical to one another, reducing or eliminating the speed advantage of having multiple processors operating in parallel.

[edit] Full text search

A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.

[edit] See also

[edit] External links

[edit] References

  • R. de la Briandais: File Searching Using Variable Length Keys. Proceedings of the Western Joint Computer Conference, 1959, pages 295–298.
  • E. Fredkin: Trie Memory. Communications of the ACM, 3(9):490-499, Sept. 1960.
  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 6.3: Digital Searching, pp.492–512.