Complement set email filtering
From Wikipedia, the free encyclopedia
Complement Set Filtering (CSF) is a method for filtering unsolicited bulk email (UBE or spam) The technique utilizes at least two email accounts: the primary account where spam and non-spam is received and secondary accounts that receive only spam. CSF calculates the set theoretic difference between the primary and secondary email sets (email accounts) and identifies email messages contained in both sets.
[edit] Implementation
CSF is implemented by comparing message content in a UBE account (separate mailbox or alias) with the message content in a primary account. By definition, messages contained in the UBE account are spam so messages in the primary account that are substantially similar to messages in the UBE account are also spam. When the same message is found in both the primary account and the UBE account, it is deleted from the primary account.
The UBE account is established by creating a mailbox (or alias) incorporating a common first name (to help spammers guess the address) and the domain of the primary account, then exposing the UBE account to the internet. For example, if the primary mailbox is johnm@domain.com, the UBE account might be john@domain.com (see diagram below). After the UBE mailbox is set up, the email address is given to spammers by posting it to message boards, portal groups, “Who Is” listings, ecommerce sites and Usenet.
CSF works especially well in corporate environments where the domain is targeted by spammers and UBE tends to be very similar from mailbox to mailbox. Also, because CSF does not depend on characteristics of past UBE to identify current UBE it is particularly well suited for identifying UBE with new subject matter.
[edit] Advantages of CSF
Many spam-filtering techniques search for patterns and known spam subject matter in the headers and bodies of messages. Others use probabilities (Bayesian statistical methods, for example) to identify unwanted messages. CSF is effective as a stand alone filter or can be combined with other techniques.
CSF has at least three advantages over Bayesian and pattern analysis algorithms. First, CSF does not depend on content analysis other than what is required to find similarities between messages in the primary and UBE accounts. Second, CSF does not utilize scoring (word ranking) that can be circumvented with message obfuscating (V!agra instead of Viagra, for example). Third, CSF takes advantage of the fact most UBE contains identical message content, particularly messages targeted at specific corporate domains.