Talk:Bloom filter

From Wikipedia, the free encyclopedia

1 A Graph
2 The example
3 Operation Names
4 Discoverer?
5 Removing from a Bloom filter
6 Adding and removing in a single-bit Bloom filter
7 Adding and removing in a counting Bloom filter
8 Bloomier filter insertion
9 You should break this into multiple pages
10 Generalized
11 Proof of false positive probability is wrong
12 What are they used for?
13 meaning of bloom filter
14 Purpose of k hash functions?

[edit] A Graph

The false positive probability

p

as a function of number of elements

n

in the filter and the filter size

2 m

. An optimal number of hash functions k= m/n ln2 has been assumed.

Maybe the following graph could be added to the "Probability of false positives" section. --Jerz4835 (talk) 12:29, 12 March 2008 (UTC)

[edit] The example

False positives are silly in a spelling checker. I think an example more like [1] should be used, with the secrecy thing mentioned elsewhere. But my english very bad, so I leave in capable hands of brother and sister. —Preceding unsigned comment added by 220.233.116.128 (talk) 16:19, 9 January 2008 (UTC)

Completely agree, if you read the article your mind becomes "stuck" on the example; it is so silly you wonder if you misunderstood the concept. 80.79.35.68 (talk) 17:16, 11 January 2008 (UTC)Lambert

[edit] Operation Names

I propose changing insert to add and test membership to query. I think add better matches what actually happens and query is just more concise. --Pfunk42 22:42, 1 Jun 2005 (UTC)

[edit] Discoverer?

I propose changing discoverer in "...named after its discoverer Burton H. Bloom," to inventor. Algorithms like this aren't 'discovered' so much as invented.

I tried "conceived by" whoever in year whatever, a pattern I've seen elsewhere. Deco 01:21, 5 Jun 2005 (UTC)

[edit] Removing from a Bloom filter

It feels like you can't. This should probably be emphasized in the article, as this is a distinction from similar data structures. It's mentioned in the intro, but then never discussed...I don't have the technical expertise to say definitively how to handle this, so I can't write it up myself. -- Metahacker 15:32, 21 Jun 2005 (UTC)

Good idea. I added a short explanation. You think it'll do? Deco 19:46, 21 Jun 2005 (UTC)

While strictly speaking practical removal may not be possible in a simple Bloom filter, two facts presented in the article might use some clarification as to why removal is logically impossible. First, the universe can be represented by an "full" array, and second, Bloom arrays can be ANDed. In cases with a finite data universe the two of these seems to idicate that I could build the Bloom array of everything but the element I want to delete then AND that with the Bloom array I have. I can't say that I see this never introducing false negatives, but it does seem worthy of mention one way or the other by someone more apt in these things than myself. JonShops 03:06, 15 April 2007 (UTC)

The problem with that idea is that the Bloom filter corresponding to "universe minus my element" will almost certainly be identical to the Bloom filter corresponding to "entire universe". You'll never introduce a false negative, but you will usually introduce false positives — that is, if you take the Bloom filter for (A,B,C) and try to subtract out (B) using your method, you'll probably end up with the Bloom filter for (A,B,C). The "removal" didn't actually do anything! Hope this helps; also, check out the other variations mentioned on this Talk page. Some of them support removals, with various restrictions. --Quuxplusone 03:26, 15 April 2007 (UTC)

[edit] Adding and removing in a single-bit Bloom filter

You could implement a voting algorithm in a single bit Bloom filter to allow both adds and deletes, at the cost of admitting false negatives, with an add setting k hash-bits, and a delete clearing k hash bits. The difference from an ordinary Bloom filter would be that the decision threshold would be less than requiring all k bits to be set for a rejection: for example, say, k/2, for say k=20. In this way, although both false-positive and false-negative results would be possible, the threshold factor could be tuned to give low error rates for both types of error.

Has anyone done this? Or are multi-bit Bloom filters simply superior? Inquiring minds weant to know. -- The Anome 13:54, September 9, 2005 (UTC)

Good question. I can't imagine why nobody's thought about this already, it's pretty obvious considering the implementation of your average BPP algorithm. I shall have to ask some people about it. Dcoetzee 08:03, 28 May 2007 (UTC)

[edit] Adding and removing in a counting Bloom filter

Riffing on The Anome's suggestion above: You could also implement a "counting" Bloom filter, by using b bits per hash instead of only one. For example, using b=2, you'd replace Bloom's array of m bits with an array of m 2-bit counters. Adding an element would increment all k relevant counters; if any counter was already at 3, it would "stick" there instead of wrapping around. Therefore, a count of 0 represents "absence," and a count of 1 or 2 indicates 1 or 2 entries sharing that counter. A count of 3 indicates at least three entries sharing that counter.

To remove an element, you'd decrement all k relevant counters, except for those counters that contained 3 — such a counter you'd leave at 3, since to decrement it would falsely imply that it was shared by exactly two entries.

This kind of "counting" Bloom filter would have all the problems with false positives of a normal Bloom filter, but it would allow a certain number of removals to proceed correctly. Add enough elements, though, and removals would stop working — you'd be unable to remove any element all of whose counters had maxed out.

I suspect that the added complexity of implementation wouldn't make a "counting" Bloom filter worth the trouble, but it's worth a mention on this Talk page, at least. :) --Quuxplusone, 19:06, 18 March 2006 (UTC)

[edit] Bloomier filter insertion

The bloomier filter insertion description is wrong:

"Now that we have the structure and a search algorithm, we also need to know how to insert new key/value pairs. The program must not attempt to insert the same key with both values. If the value is 0, insert the key into A0 and then test if the key is in B0. If so, this is a false positive for B0, and the key must also be inserted into A1 recursively in the same manner. If we reach the last level, we simply insert it. When the value is 1, the operation is similar but with A and B reversed.".

So any key in A0 that is a false positive in B0 must be present in A1 (and so on). But a key inserted in A0 might BECOME a false positive key in B0 after other insertions in B0. Imagine we insert a key in A0 which is "almost present" in B0 - only one of the k bits is not set; the key is not (yet) a false positive in B0. But after this, maybe we insert a key in B0 that also sets that kth bit to 1; now the initial key is in A0, it is a false positive in B0, and it is not in A1 - violating the data structure invariants. --Raduberinde 08:09, 28 July 2006 (UTC)

You're right. My mistake. I will reconsider the original paper. Dcoetzee 04:38, 25 May 2007 (UTC)

I think the paragraph on insertion in bloomier filter should be removed. The described bloomier filter is static, it doesnt permit insertion or deletion. The bloomier filter is constructed globally. I have added a section on dynamic bloomier filters that support insertions and deletions.

Okay, dynamic is certainly better - and surprisingly simple. We ought to still cite the original paper, but we can discuss the more recent result in detail. Dcoetzee 17:40, 24 July 2007 (UTC)

[edit] You should break this into multiple pages

A bloom filter is a bloom filter - a counting bloom filter is a different beast with enough different permutations to discuss for the rest of eternity.

Break it into two - at least

[edit] Generalized

Don't mean to be a wank, but has the "Generalized Bloom Filter" thingy been accepted into a peer-reviewed publication?

Regardless, the author should stick in info on how much bigger than a regular BF a GBF that uses ones and zeros needs to be to have the same false positive rate. Oh, I see. You forgot to mention that because it's A LOT (many times) bigger. ;) --pfunk42 04:35, 28 May 2007 (UTC)

[edit] Proof of false positive probability is wrong

The events of each particular bit being 1 are not independent. Intuitively, if you know the first bit mapped to 1, it suggests the array is more dense. That is, the probability the second bit maps to 1 given that the first did is greater than the probability the second bit maps to 1 given the first mapped to 0 (or say, given no other information). 128.2.16.192 22:49, 27 September 2007 (UTC)

First of all, i would call it an "analysis", not a "proof". And, in fact, this is an average case analysis, because it doesn't take into account the fact that the number of bits set in the Bloom filter from n additions follows a probability distribution, and with random hash functions, the number of set bits determines the false positive probability. Thus, adding n things to a Bloom filter doesn't always result in a certain false positive probability--you might get lucky or unlucky. But the variance is small. There are possibly some other suble, mostly insignificant inaccuracies in that formula. See this paper for what seems to be a more detailed analysis, but it has not, to my knowledge been accepted under peer review. --pfunk42 05:32, 30 September 2007 (UTC)

[edit] What are they used for?

What are bloom filters used for? Any applications people can think of should be put in the intro. Fresheneesz 02:09, 16 October 2007 (UTC)

No. Any applications people can source should be put into later parts of the article, and (per WP:LEDE) mentioned briefly in the intro. —David Eppstein 02:47, 16 October 2007 (UTC)

[edit] meaning of bloom filter

if it gives "false positives" and "correct negatives" then it must be used to find whether element is *not* in a set (and so one would be 100% sure). maybe this article could be change according to it. 84.16.123.194 (talk) 15:47, 28 January 2008 (UTC)

[edit] Purpose of k hash functions?

Why are there k hash functions? Why would you need more than one? —Preceding unsigned comment added by 66.11.82.73 (talk) 15:07, 15 April 2008 (UTC)