Talk:Bayesian spam filtering

From Wikipedia, the free encyclopedia

This entry is all wrong ! The text simply describe Naive_Bayes_classifier applyied to Text classification (such example already exist in the mentioned article). Bayesian filtering is a estimation process generally called "Sequential Bayesian Filtering" that estimates the state of a system through observations. It relates to Particle Filters, Kalman filters and Hidden Markov Models. I strongly suggest to reformat the content (actually I do not have free time to do it). Please someone take a little google search and rewrite this area. I suggest as starting point A survey of probabilistic models, using the Bayesian Programming methodology as a unifying framework


I have rewritten this article to bear some relationship to reality. See the critiques above by some other clueful fellow. The main problem is that the previous version failed to recognize that what they called "Bayesian Filtering"

  • made no reference to what most of the world aside from semi-literate slashdot nerds calls Bayesian filtering
  • was not bayesian except in the most vague sense in which all inference could be considered bayesian
  • was not really filtering except under a *very* loose definition of the word (in the same sense that classification is regression to {0,1})
  • described what is more commonly called classification not filtering
  • was invented in the 60s
  • and not by paul graham

In conclusion, paul graham is a smart, entertaining guy, but he has a phd in computer science and failed to acknowledge the *many* articles on spam classification using the naive bayes (not to mention many more effective methods) which predate him by *years*. The fact that his article recieved so much publicity is a interesting comment on the gap between academic thought and the general public. But it is the duty of an encyclopedia to bridge that gap not widen it.

This entry could still use a lot of good work. Someone should incorporate some information on the actual methods of Bayesian filtering. I am not that someone.


Do we really need the advert for PopFile?

POPFile isn't the archetypical Bayesian spam filter. Maybe a specific spam filtering program (such as Spam Bayes or K9) should be mentioned as filter, and POPFile cited as example of the general classification also mentioned in the article. Drangon

I am a POPFile user so I may be biased, but I think that is a good idea to include examples of bayesian filters. The POPFile link was just removed with the comment "Don't need the advert at the end". I agree the text was pretty strong in saying it is one of the most popular, but providing the reader some links is a good idea. Currently the text only mentions Thunderbird as an option. Programs such as K9, SpamBayes, and POPFile let the user continue to use their current mail client. Thunderbird is a good program but it is not right for everyone. And following what Drangon said, POPFile is a good example of how bayesian filtering can be used for more than just spam filtering. JoeChongq 04:45, 16 Dec 2004 (UTC)

Contents

[edit] Proposal for Move

I suggest that this page be renamed "bayesian spam filtering" and moved. As people have pointed out, this article is not about bayesian filters in general, but rather a specific application in spam filtering. It does have some good info/links for spam filtering, which people are generally interested in, but as an article on "Bayesian Filtering" it could be much broader/technical/better.

  • I totally agree, and have now renamed the article to "Bayesian spam filtering". --Fredrik Orderud 13:04, 2 January 2006 (UTC)

I am real new here and not yet prepared to be bold. The page "Bayesian spam filtering" should address an audience that are short on the prerequisite skills to deal with learn from the other pages on say Boosting. Thus editors skilled in mathematics and related disciplines should in my view keep it simple. It should provide specific information on Bayesian Filtering as used in every day filters that are called Bayesian, even if to an expert they are naive Bayes etal.

I think the article lacks in that readers should be made aware that

  • many 'features' and not just emails contents(Words) may be useful input to a Bayesian algorithm.
  • 'Features' should link to a page that describes it in a more generalised manner.
  • As the page is more about spam filtering using Bayesian filter, then to be balanced, the possibility of using other algortihms for spam filtering should be raised. eg SVM.
  • The top of the page should also link to directly Baysean Inference (I think) the sentence could for instance indicate this is an application of the Baysean Inference technique to a particular problem. The link to Bayesian statistical methods looks too much like its just a biography, or addresses the dry metaphysical 'what does probability mean debate', when I scanned around trying to findout how do I get to the real stuff, I thought that was dead end in historical section of wikipedia.

ZuluWarrior 05:42, 28 February 2006 (UTC)

An article needs to be balanced only in reference to its scope, and this here is about Bayesian methods only. The current place for comparing spam filtering techniques is Stopping_e-mail_abuse#Examination_of_anti-spam_methods, for Bayes see Stopping_e-mail_abuse#Statistical_filtering.--84.188.179.95 10:05, 27 June 2006 (UTC)

[edit] log filter

In order to understand how it works in filtering a log, I need to get a code example , do you have any?

[edit] Pronunciation?

Anybody know how this is typically pronounced? Is it BAY-zee-an, or bay-EEZ-yun or by-EEZ-yun, or what?

I work with programmers from the U.S. (both coasts), France, India, and China. They all pronounce it bay-EE-sian. Don't get me started on the French pronouncing SQL as "squirrel". Kainaw 18:05, 21 Apr 2005 (UTC)
This is late to the party, but all the statistics types that I know pronounce it "BAY-zee-an" or "BAY-zhun", as it's named after "Bayes", prounounced "bays", not "BAY-ess", "BUY-ez", or other alternatives. 136.251.12.63 18:28, 13 January 2006 (UTC)

I agree with the last comment. (Keep in mind that I don't have any experience, but the last presents an important point. Plus, it makes the most sense.) Posted by SimpleBeep

Correct, it is pronounced "Bays-e-en", after Rev. "Thomas Bayes" who wrote the paper that turned probability science upsidedown in 1763. I used to pronounce it wrong as well and did nearly two hours of research on this. - Onexdata 11/20/2006

[edit] Formula

Is the formula correct? It appears to be ((words_in_spam/spam_count)*(spam_count/total_count))/(words_in_total/total_count). If that is so, then this reduces to words_in_spam/words_in_total. Bayes would have noticed that easily, so there must be a reason for a more complex formula. If so, it is not mentioned in the article. Kainaw 18:05, 21 Apr 2005 (UTC)

The formula ought to be read as, to be Bayesian, the probability of a message being spam spam given certain words equals the prior probability of a message being spam times the probability of those certain words given the message is spam all divided by the probability of those words. There is no simple reduction. Remember the | is meant to display the term "given". **This may or may not relate to actual Bayesian filtering, but this is how typical Bayesian formulations in statistics and induction work**

The "Bayesian" here is because spam filters are typically Naive Bayes classifiers, as noted below. As such, the formula should indicate that the we make the naive, incorrect, but often useful assumption that the words all occur independently, given that the message is spam (or not). That is, the final formula that spam filters use has a numerator something like this:

P(word1|spam) P(word2 | spam) .... P(spam)

If we don't make this assumption, we can't use individual word statistics, we need statistics for the entire set of words making up the message. johndburger 01:59, 24 April 2006 (UTC)

[edit] Revert

BoredAndriod's revision:

Bayesian filtering is the process of using Bayesian statistics to attempt to remove noise from a corrupted signal.

Bayesian statistics is a paradigm of statistics named for the Rev. Thomas Bayes. It treats probabilities as estimates of uncertainty and derives methods of statistical inference from decision theory. Bayesian statistics is a reaction to frequentist statistics which interpretes probabilities as the limiting frequency of events upon infinite repetition of some experiment.

The term Bayesian filtering has lately been used to refer to the naive Bayes algorithm which was invented in the 1960s, but recently made popular by the web posting A Plan for Spam by Paul Graham. Presumably Graham chose to refer to the naive Bayes classifier as "Bayesian Filtering" due to its use of Bayes' theorem. This is somewhat confusing, however, as neither Bayes' theorem nor the naive Bayes classifier is necessarily Bayesian. Both are direct applications of probability theory with no interpretation of what the probabilities mean.

I am reverting the article because a lot of material was taken out, uncessarily it would seem, and without any explanation as well as discussion, in short without a word. This is very bad form but we can see which changes we want to add back in from this article. --ShaunMacPherson 13:24, 5 Jun 2005 (UTC)
I began to address the above problem (that the content of this page should be completely different) at User:Samohyl_Jan/Bayesian_filtering. Anyone is invited to help. Samohyl Jan 08:36, 15 November 2005 (UTC)

[edit] Useful None the less

'Useful None the less'Bold text My brief reading of the article explained why the content of most of my 'vi-agra' span contains approximately 300 words in a 'story'. It was therefore useful. I suggest keeping 'it simple,stupid' but correcting any errors in reference and math. The in-depth statistics and programming can be handled via links as necessary.

This article needs more technical content. It should be clear enough that one could implement a mostly working Bayesian spam filter based on the information in this article and no other sources.

[edit] External Links

I agree that the SpamBully pointer should go, but I put back the SpamBayes pointer. It's not a product, it's an open source project. johndburger 15:04, 4 May 2006 (UTC)

The spam reduction tools are in the article Stopping e-mail abuse. Besides, there already is an article about SpamBayes. The external link is not needed. --Sbluen 22:54, 4 May 2006 (UTC)

Okay, I buy that. johndburger 22:56, 4 May 2006 (UTC)

[edit] Advantages and Disadvantages

There is a lengthy section on advantages. Are there any disadvantages? Shouldn't there be a section on the disadvantages? The only one I can think of is the process requires "training" the filter for the filter to work well, right? ~a (usertalkcontribs) 16:35, 16 August 2006 (UTC)