CRM114
From Wikipedia, the free encyclopedia
CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam. While others have done statistical Bayesian filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a hidden Markov model of the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. The author claims recognition rates as high as 99.87%, however these results are not reproduced in independent tests by Holden (although the author writes that he may have had "some sort of installation or usage error on my part", and recommends trying CRM114) and at TREC 2005. It should also be noted that these tests are outdated, and CRM114 has several new training methods (including Double Sided Thick Threshold Training with Testing Refutation) that yields considerable accuracy improvements, and independent retests are greatly needed. CRM114's classifier can also be switched to use Lightstone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, and other more experimental classifiers.
As an example of pattern recognition software, CRM114 is a good example of machine learning accomplished with a reasonably simple algorithm. Source code in C is available through the external link.
At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines, looking somewhat confusing to the uninitiated. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it's possible to write programs that do not depend on absolutely identical strings matching to function correctly.
[edit] Trivia
The term CRM114 is first applied to the radio discriminator aboard a B-52 in Stanley Kubrick's Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb.
After Dr. Strangelove, the Kubrick rubrick CRM114 appears in three subsequent movies. The spacecraft Discovery's registration/serial number in 2001: A Space Odyssey is CRM 114, and in Eyes Wide Shut, the mortuary is located on Level/Wing C, Room 114. Kubrick cleverly uses the homonym "Serum 114," a drug injected into Alex to help his reformation, in A Clockwork Orange.
Other films continue Kubrick's CRM114 tradition. An amplifier in Dr. Emmett Brown's laboratory in Back to the Future is labeled CRM-114, and the remake of Fun with Dick and Jane includes a financial transaction form number CRM-114.