User:Bjelleklang/Internet spam
From Wikipedia, the free encyclopedia
Internet spam is a term used to describe various forms of spam ocurring on the internet. This takes several forms, but the most common are related to message boards, blogs and e-mail.
Contents
|
[edit] History
[edit] Usenet
Spamming of Usenet newsgroups pre-dates most forms of electronic spam, with the first widely recognized Usenet spam posted on January 18, 1994 by Clarence L. Thomas IV, a sysadmin at Andrews University.[1][2] Entitled "Global Alert for All: Jesus is Coming Soon",[3] it was a fundamentalist religious tract claiming that "this world's history is coming to a climax." The newsgroup posting bot Serdar Argic also appeared in early 1994, posting tens of thousands of messages to various newsgroups, consisting of identical copies of a political screed relating to the Armenian Genocide.
The first commercial Usenet spam,[4][2] and the one which is often (mistakenly) claimed to be the first Usenet spam of any sort, was an advertisement for legal services entitled "Green Card Lottery - Final One?".[5] It was posted in April 1994 by Arizona lawyers Laurence Canter and Martha Siegel, and hawked legal representation for United States immigrants seeking papers ("green cards").
Usenet convention defines spamming as excessive multiple posting, that is, the repeated posting of a message (or substantially similar messages). During the early 1990s there was substantial controversy among Usenet system administrators (news admins) over the use of cancel messages to control spam. A cancel message is a directive to news servers to delete a posting, causing it to be inaccessible to those who might read it. Some regarded this as a bad precedent, leaning towards censorship, while others considered it a proper use of the available tools to control the growing spam problem.
A culture of neutrality towards content precluded defining spam on the basis of advertisement or commercial solicitations. The word "spam" was usually taken to mean excessive multiple posting (EMP), and other neologisms were coined for other abuses — such as "velveeta" (from the processed cheese product) for excessive cross-posting.[6] A subset of spam was deemed cancellable spam, for which it is considered justified to issue third-party cancel messages.[7]
In the late 1990s, spam became used as a means of vandalising newsgroups, with malicious users committing acts of sporgery to make targeted newsgroups all but unreadable without heavily filtering. A prominent example occurred in alt.religion.scientology. Another known example is the Meow Wars.
The prevalence of Usenet spam led to the development of the Breidbart Index as an objective measure of a message's "spamminess". The use of the BI and spam-detection software has led to Usenet being policed by anti-spam volunteers, who purge newsgroups of spam by sending cancels and filtering it out on the way into servers. This very active form of policing has meant that Usenet is a far less attractive target to spammers than it used to be, and most of the industrial-scale spammers have now moved into e-mail spam instead.
As the internet has evolved, and grown beyond Usenet message boards, this form of spam has similarily evolved and been adopted to new forms of communication. Online forums are now often targeted, along with blogs and other similar software, where users can post replies or comments, often without having to register an account first. As a consequense of this, many web sites now require not only registration, but also manual confirmation of an e-mail address. This is supposed to prevent spambots from submitting spam, as these often submit fake e-mail addresses.
[edit] E-mail spam
From the beginning of the internet, sending of junk e-mail has been prohibited, [8] enforced by ISPs Terms of Service/Acceptable Use Policy (ToS/AUP) and peer pressure. Even with a thousand users, using junk e-mail for advertising is not tenable, and with a million users it is not only not practical,[9] but very expensive as well, costing US businesses on the order of $10 billion per year in 2003. As the internet has grown, ISP's have no longer been able to effectively police the sending of UBE and have turned to government for relief,[10] which has failed to materialize, particularly in the US, which aborted tough state laws with a promiscuous federal law, the CAN-SPAM Act of 2003, although some countries have passed laws against spam, especially in Europe and Australia.
As the recipient directly bears the cost of delivery, storage, and processing, one could regard spam as the electronic equivalent of "postage-due" junk mail. Due to the low cost of sending unsolicited e-mail and the potential profit entailed only enforcing an opt-in law will stop junk e-mail. "Today, much of the spam volume is sent by career criminals and malicious hackers who won't stop until they're all rounded up and put in jail."[11]
Spam is rarely sent by well-known companies, and is quickly regretted; spam from these sources is sometimes called mainsleaze[12]. A widely-known instance of spamming by a large corporation was Kraft Foods' marketing of its Gevalia coffee brand.[13]
Advance fee fraud spam such as the Nigerian "419" scam may be sent by a single individual from a cyber cafe in a developing country. Organized "spam gangs" operating from Russia or eastern Europe share many features in common with other forms of organized crime, including turf battles and revenge killings.[14] As much as 80% of spam received by internet users in North America and Europe can be traced to fewer than 600 spammers.[15]
Spammers may engage in deliberate fraud to send out their messages. Spammers often use false names, addresses, phone numbers, and other contact information to set up "disposable" accounts at various Internet service providers. They also often use falsified or stolen credit card numbers to pay for these accounts. This allows them to move quickly from one account to the next as the host ISPs discover and shut down each one.
Spam is also a medium for fraudsters to scam users to enter personal information on fake Web sites using e-mail forged to look like it is from a bank or other organization such as Paypal. This is known as phishing.
[edit] Spambots
Spambots are automated programs designed to register on forums, disseminate spam, and leave. They usually supply a fake name, freebase email address, and sometimes mask their true IP address. Spammers can set the message that the spambot will post. Most spambots target one specific forum software or hosting company. Spambots are easy to identify by the nature of the message they leave, or the links in the signature. A typical post contains no topical content, but is accompanied by either spam links in the post itself, or in the user's signature. Some spambots will never post, and rely on the links in their signature to increase their search engine visibility. Looking up the spambot's user name with a search engine will often reveal thousands of registrations in unrelated forums.
An example of a spambot which has gained some notoriety since November of 2006 is XRumer. XRumer attempts to bypass anti-spamming mechanisms put in place by forum administrators, with some success. It uses a database of known HTTP proxies to mask the IP address of the poster, making it difficult for administrators to use a naive IP-banning mechanism.
[edit] Effects of spam
Spam prevention and deletions measurably increase the workload of forum administrators and moderators. The amount of time and resources spent keeping a forum spam free contributes significantly to labour cost, and the skill required in the running of a public forum. Marginally profitable or smaller forums may be permanently closed by administrators. Forums that do not require registration are becoming rare.
[edit] Spam prevention
There are several automated ways of preventing spam, none of wich are 100% guarranteed to work. Flood control is one commone method, and it forces users to wait for a short interval between making posts to a forum or a blog, thus preventing spambots from flooding with repeated spam messages. Another method Some forums employ CAPTCHA (visual confirmation) routines on their registration pages to prevent spambots carrying out automated registrations. Simple CAPTCHA systems which display alphanumeric characters have proven vulnerable to optical character recognition software but those that scramble the characters appear to be far more effective.
- Posting limits: Limit posting to registered users and/or require that the user pass a CAPTCHA test before posting.
- Registration restrictions: Applying careful restrictions can seriously impact bogus and spambot registrations. One approach consists in the denial of registration from certain domain extensions that are a major source of spambots such .ru, .br, .biz, or freebase addresses such as "gawab.com". Another, more labor-intensive, consists in manual examination of new registrants. This examination looks at several indicators. First, spambots often delay email confirmation by several hours, while humans will confirm promptly. Second, spambots will tend to create user names that are unique, and unlikely to already be used in the forum, preferring "John84731" or "JohnbassKeepsie" to the much more common "John." Third, using a search engine to investigate, one finds hundreds, if not thousands of profiles using the spambot login name, sometimes with the diagnostic spam post, or "banned" label.
- Changing technical details of the forum software to confuse bots - for example, changing "agreed=true" to "mode=agreed" in the registration page of phpBB.
- Block posts or registrations that contain certain blacklisted words.
- Be wary of IPs used by untrusted posters (anonymous posts or newly registered users). A useful technique for proactive detection of well-known spammer proxies is to query a search engine for this IP. It will show up on pages that specialize in the listing of proxies.
[edit] See also
[edit] External links
- How to Block phpbb Spam. Plus email/IP address Blacklists (The phpbb spambot honeypot project)
- A list of open proxy and bot IPs. Ban IPs on this list to prevent comment spam.
- An on-line widget you can use to count Google hits on an IP. This allows you to automate your banning responses to a certain extent.
- On-line database of known forum spammers. It can be used to update ban lists.
[edit] Possible solutions
[edit] Blocking by keyword
This is the simplest form of blocking, which yields very good results, because comment spam is targeted at bots, so it must be readable by simple software. A lot of spam can be blocked by banning names of popular pharmaceuticals and casino games.
The main problem with this approach is that spammers constantly find new ways to spell or hawk their goods, so this requires constant updating. For example, blocking "viagra" would cut down spam by a lot, but spammers will start spamming "vi@gra", "v1agr@", "vigra". There's also an uncountable number of goods spammers who try to sell, making this system difficult to keep updated.
[edit] rel="nofollow"
In early 2005, Google announced that hyperlinks with rel="nofollow"
attribute[16] would not influence the link target's ranking in the search engine's index. The Yahoo and MSN search engines also respect this tag. [17]
nofollow is a misnomer in this case since it actually tells a search engine "Don't score this link" rather than "Don't follow this link." This differs from the meaning of nofollow
as used within a robots meta tag, which does tell a search engine: "Do not follow any of the hyperlinks in the body of this document."
Using rel="nofollow"
is a much easier solution that makes the improvised techniques above irrelevant. Most weblog software now marks reader-submitted links this way by default (with no option to disable it without code modification). A more sophisticated server software could spare the nofollow for links submitted by trusted users like those registered for a long time, on a whitelist, or with a high karma. Some server software adds rel="nofollow"
to pages that have been recently edited but omits it from stable pages, under the theory that stable pages will have had offending links removed by human editors.
Some weblog authors object to the use of rel="nofollow"
, arguing, for example,[18] that
- Link spammers will continue to spam everyone to reach the sites that do not use
rel="nofollow"
- Link spammers will continue to place links for clicking (by surfers), even if those links are ignored by search engines.
- Google is advocating the use of
rel="nofollow"
in order to reduce the effect of heavy inter-blog linking on page ranking. - Google is advocating the use of
rel="nofollow"
only to minimize its own filtering efforts, and to deflect that this actually had better been calledrel="nopagerank"
.
Jeremy Zawodny has stated on his blog [19] that
Worse, nofollow has another, more pernicious effect, which is that it reduces the value of legitimate comments.
Other websites like Slashdot, with high user participation, use improvised nofollow implementations like adding rel="nofollow"
only for potentially misbehaving users. Potential spammers posing as users can be determined through various heuristics like age of registered account and other factors. Slashdot also uses the poster's karma as a determinant in attaching a nofollow tag to user submitted links.
rel="nofollow"
has come to be regarded as a microformat.
[edit] Validation (reverse Turing test)
A method to block automated spam comments is requiring a validation prior to publishing the contents of the reply form. The goal is to verify that the form is being submitted by a real human being and not by a spam tool, and has therefore been described as a reverse Turing test. The test should be of such a nature that a human being can easily pass, whereas an automated tool would most likely fail.
Many forms on websites take advantage of the CAPTCHA technique, displaying a combination of numbers and letters embedded in an image, which must be entered literally into the reply form to pass the test. In order to keep out spam tools with built-in text recognition, the characters in the images are customarily misaligned, distorted and noisy. A drawback of many older CAPTCHAs is that passwords are usually case-sensitive, while the corresponding images often don't allow a distinction of capital and small letters. This should be taken into account when devising a list of CAPTCHAs.
A simple alternative to CAPTCHAs is the validation in the form of a password question, providing a hint to human visitors that the password is the answer to a simple question like "The Earth revolves around the... [Sun]".
One drawback to be taken into consideration is that any validation required in the form of an additional form field may become a nuisance especially to regular posters. Bloggers and guestbook owners may notice a significant decrease in the number of comments once such a validation is in place.
[edit] Disallowing links in posts
There is negligible gain from spam that does not contain links, so currently all spam posts contain (excessive number of) links. It is safe to require passing Turing tests only if post contains links and letting all other posts through. While this is highly effective, spammers do frequently send gibberish posts (such as "ajliabisadf ljibia aeriqoj") to test the spam filter. These gibberish posts will not be labeled as spam. They do the spammer no good, but they still clog up comments sections.
Garbage submmissions might however also result from level 0 spambots, which don't parse the attacked HTML form fields first, but send generic POST requests against pages. So it happens that a "content" or "forum_post" POST variable is set and received by the blog or forum software, but the "uri" or other wrong url field name is not accepted and thus not saved as spamlink.
[edit] Redirects
Instead of displaying a direct hyperlink submitted by a visitor, a web application could display a link to a script on its own website that redirects to the correct URL. This will not prevent all spam since spammers do not always check for link redirection, but effectively prevents against increasing their PageRank, just as rel=nofollow
. An added benefit is that the redirection script can count how many people visit external URLs, although it will increase the load on the site.
Redirects should be server-side to avoid accessibility issues related to client-side redirects. This can be done via the .htaccess file in Apache.
Another way of preventing PageRank leakage is to make use of public redirection services such as TinyURL or My-Own.Net. For example,
<a href="http://my-own.net/alias_of_target" rel="nofollow" >Link</a>
where 'alias_of_target' is the alias of target address.
Services such as POW7.com offer a public redirection without the need to configure an alias. An example of a link to http://wikipedia.org/ on POW7 would be:
<a href="http://pow7.com/pr/http://wikipedia.org/">http://wikipedia.org/</a>
Again, the issue with this method is that while it removes the benefit the spammer is seeking, the users of this method are still left with a very high volume of spam that they must clean up or leave behind.
[edit] Distributed approaches
This approach is very new to addressing link spam. One of the shortcomings of link spam filters is that most sites only receive one link from each domain which is running a spam campaign. If the spammer varies IP addresses, there is little to no distiguishable pattern left on the vandalized site. The pattern, however, is left across the thousands of sites that were hit quickly with the same links.
A distributed approach, like the free LinkSleeve,[20] uses XML-RPC to communicate between the various server applications (such as blogs, guestbooks, forums, and wikis) and the filter server, in this case LinkSleeve. The posted data is stripped of urls and each url is checked against recently submitted urls across the web. If a threshold is exceeded, a "reject" response is returned, thus deleting the comment, message, or posting. Otherwise, an "accept" message is sent.
A more robust distributed approach is Akismet, which uses a similar approach to LinkSleeve but uses API keys to assign trust to nodes and also has wider distribution as a result of being bundled with the 2.0 release of WordPress.[21] They claim over 140,000 blogs contributing to their system. Akismet libraries have been implemented for Java, Python, Ruby, and PHP, but its adoption may be hindered by the requirement of an API key and its commercial use restrictions. No such restrictions are in place for Linksleeve.
Project Honey Pot has also begun tracking comment spammers. The Project uses its vast network of thousands of traps installed in over one hundred countries around the world in order to watch what comment spamming web robots are posting to blogs and forums. Data is then published on the top countries for comment spamming, as well as the top keywords and URLs being promoted. The Project's data is then made available to block known comment spammers through http:BL. Various plugins have been developed to take advantage of the http:BL API.
[edit] Application-specific anti-spam methods
Particularly popular software products such as Movable Type and MediaWiki have developed their own custom anti-spam measures, as spammers focus more attention on targeting those platforms. Whitelists and blacklists that prevent certain IPs from posting, or that prevent people from posting content that matches certain filters, are common defenses. More advanced access control lists require various forms of validation before users can contribute anything like linkspam.
The goal in every case is to allow good users to continue to add links to their comments, as that is considered by some to be a valuable aspect of any comments section.
[edit] RSS feed monitoring
Some wikis allow you to access an RSS feed of recent changes or comments. If you add that to your news reader and set up a smart search for common spam terms (usually viagra and other drug names) you can quickly identify and remove the offending spam.
[edit] Response tokens
Another filter available to webmasters is to add a hidden session token or hash function to their comment form. When the comments are submitted, data stored within the posting such as IP address and time of posting can be compared to the data stored with the session token or hash generated when the user loaded the comment form. Postings that use different IP addresses for loading the comment form and posting the comment form, or postings that took unusually short or long periods of time to compose can be filtered out. This method is particularly effective against spammers who spoof their IP Address in an attempt to conceal their identities.
[edit] Ajax
Some blog software such as Typo allow the blog administrator to only allow comments submitted via Ajax XMLHttpRequests, and discard regular form POST requests. This causes accessibility problems typical to Ajax-only applications.
Although this technique prevents spam so far, it is a form of security by obscurity and will probably be defeated if it becomes popular enough.
[edit] Switching off comments
Some bloggers have chosen to turn off comments because of the volume of spam.
[edit] Buying Blog Comments
A new website came out where spammers can now purchase blog comments from legitimate writers. People write the blog comments and use their username for the anchor and the URL for their spam site. The main site is Buy Blog Comments but there have been some more popping up in other places..
[edit] See also
[edit] References
- ^ Templeton, Brad. Origin of the term "spam" to mean net abuse. Retrieved on 2006-07-11.
- ^ a b 20 Year Archive on Google Groups. Google (2003). Retrieved on 2006-07-11.
- ^ http://groups.google.com/groups?selm=9401191510.AA18576%40jse.stat.ncsu.edu
- ^ History of Spam. Mailmsg.com. Retrieved on 2006-07-11.
- ^ http://groups.google.com/groups?selm=2odj9q%2425q%40herald.indirect.com
- ^ http://www.catb.org/~esr/jargon/html/V/velveeta.html
- ^ http://www.faqs.org/faqs/usenet/spam-faq/
- ^ Gary Thuerk, who sent the first e-mail ad, in 1978, to 600 people, was reprimanded and told not to do it again.Opening Pandora's In-Box
- ^ Why is spam bad?
- ^ Spam's Cost To Business Escalates
- ^ CAUCE accessed July 13, 2007
- ^ mainsleaze. Jargon File. Retrieved on 2007-06-04.
- ^ Trouble Brewing for Kraft and Gevalia Over Coffee Spam. PR Newswire (2005-04-18). Retrieved on 2007-06-02.
- ^ Brett Forrest. "The Sleazy Life and Nasty Death of Russia’s Spam King", Issue 14.08, Wired Magazine, August 2006. Retrieved on 2007-01-05.
- ^ Register of Known Spam Operations (ROKSO)
- ^ http://www.w3.org/TR/REC-html40/struct/links.html#adef-rel
- ^ http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html
- ^ http://www.ioerror.us/2005/05/23/nofollow-revisited/
- ^ http://jeremy.zawodny.com/blog/archives/006800.html
- ^ http://www.linksleeve.org
- ^ http://wordpress.org/development/2005/12/wp2/
[edit] External links
- Anti-spam Features of MediaWiki
- Six Apart Comment Spam Guide, fairly broad overview from Movable Type's authors.
- Gilad Mishne, David Carmel and Ronny Lempel: Blocking Blog Spam with Language Model Disagreement, PDF. From the First International Workshop on Adversarial Information Retrieval (AIRWeb'05) Chiba, Japan, 2005.