HyperLogLog
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset[1] (the cardinality).
Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of with a typical accuracy of 2%, using 1.5kB of memory.[1] HyperLogLog is an extension of the earlier LogLog algorithm.[2]
Algorithm
The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly-distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set. If the maximum number of leading zeros observed is , an estimate for the number of distinct elements in the set is .[1]
In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset, to obtain a multiset of uniformly-distributed random numbers with the same cardinality as the original multiset. The cardinality of this randomly-distributed set can then be estimated using the algorithm above.
The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance. In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum the number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.
References
- ↑ 1.0 1.1 1.2 Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.
- ↑ Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities.". In G. Di Battista and U. Zwick. Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03) 2832. Springer. pp. 605–617.
- "Probabilistic Data Structures for Web Analytics and Data Mining | Highly Scalable Blog". highlyscalable.wordpress.com. Retrieved 2014-04-19.
- "com.twitter.algebird.HyperLogLogMonoid". twitter.github.io. Retrieved 2014-04-19.
- "twitter/summingbird · GitHub". github.com. Retrieved 2014-04-19.
- "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm". research.google.com. Retrieved 2014-04-19.
- http://pnwscala.org/talks/SamRitchie-PNWScalaTalk.pdf