HyperLogLog

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.[1] Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical error rate of 2%, using 1.5 kB of memory.[1] HyperLogLog is an extension of the earlier LogLog algorithm,[2] itself deriving from the 1984 Flajolet–Martin algorithm.[3]

Algorithm

The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is n, an estimate for the number of distinct elements in the set is 2n.[1]

In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly distributed set can then be estimated using the algorithm above.

The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.[4]

Operations

The HyperLogLog has three main operations add to add a new element to the set, count to obtain the cardinality of the set and merge to obtain the union of two sets. Some derived operations can be computed using the Inclusion–exclusion principle like the cardinality of the intersection or the cardinality of the difference between two HyperLogLogs combining the merge and count operations.

The data of the HyperLogLog is stored in an array M of counters called registers with size m that are set to 0 in their initial state.

Add

The add operation consists in computing the hash with a hash function h of the input data v getting the first b bits (where b is and adding 1 to them to obtain the address of the register to modify. With the remaining bits compute which returns the position of the leftmost 1. The new value of the register will be the maximum between the current value of the register and .

Count

The count algorithm consists in computing the harmonic mean of the m registers.

The intuition is that being n the unknown cardinality of M. Each subset will have elements. Then should be close to the harmonic mean of 2 to these quantities is which should be near . Thus, should be n approximately.

Finally, the constant is introduced to correct a systematic multiplicative bias present in due to hash collisions.

This technique, though, is biased for small cardinalities below a threshold of . The original paper proposes using a different algorithm for small cardinalities: Linear Counting [5]

Merge

The merge operation for two HLLs () consists in obtain the maximum for each pair of registers

Complexity

To analyze the complexity it is used the data streaming model [6] which analyses the necessary space to get a approximation with a fixed success probability . The relative error of HLL is and it needs space. Being n the set cardinality and m the number of registers (usually less than one byte size).

The add operation depends on the size of the output of the hash function. As this size is fixed we can consider the add operation to be in time.

The count and merge operations depend on the number of registers m and have a theoretical cost of . In some implementations (Redis) the number of registers is fixed and the cost is considered to be in the documentation.

HLL++

The HLL++ is a practical improvement of the HyperLogLog with some modifications to make it useful for real environments [6]. Among other improvements it proposes:

References

  1. 1 2 3 Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" (PDF). Discrete Mathematics and Theoretical Computer Science proceedings. Nancy, France. AH: 127–146. CiteSeerX 10.1.1.76.4286Freely accessible. Retrieved 2016-12-11.
  2. Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick. Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). 2832. Springer. pp. 605–617.
  3. Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182209. doi:10.1016/0022-0000(85)90041-8.
  4. S Heule, M Nunkesser, A Hall (2013). "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm" (PDF). sec 4.
  5. Whang, Kyu-Young; Vander-Zanden, Brad T; Taylor, Howard M (1990). "A linear-time probabilistic counting algorithm for database applications". ACM Transactions on Database Systems (TODS). 15 (2): 208229.
  6. 1 2 "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm". research.google.com. Retrieved 2014-04-19.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.