Scraper site

From Wikipedia, the free encyclopedia

A scraper site is a website that pulls all of its information from other websites using web scraping. In essence, no part of a scraper site is original. A search engine is not an example of a scraper site. Sites such as Yahoo and Google gather content from other websites and index it so you can search the index for keywords. Search engines then display snippets of the original site content which they have scraped in response to your search.

In the last few years, and due to the advent of the Google Adsense web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines. Open content sites, including Wikipedia, are a common source of material for scraper sites.

[edit] Made for AdSense

Some scraper sites are created for monetizing the site using advertising programs such as Google AdSense. In such case, they are called Made for AdSense sites or MFA. This is also a derogatory term used to refer to websites that have no redeeming value except to get web visitors to the website for the sole purpose of clicking on advertisements.

The problem with Made for AdSense sites is they are considered sites that are spamming search engines and diluting the search results by providing surfers with less than satisfactory search results. The scraped content is considered redundant to that which would be shown by the search engine under normal circumstances had no MFA website been found in the listings.

These types of websites are being eliminated in various search engines and sometimes show up as supplemental results instead of being displayed in the initial search results.

It should be noted however, that Google offers a domain parking service tailored for this kind of site.[1] The interesting thing is that these supposed parked domains often run Google Adwords to attract more visitors to their site in the hopes that they'll click on Adsense ads and generate a greater return than the original cost of the Adwords click. And for many this has been a successful business plan, and one that Google has clearly failed to combat, likely because it makes up a good amount of Google's many millions in revenue.

[edit] Legality

Because scraper sites take content from other sites without the permission of the original creators, they frequently violate copyright law. It is illegal to republish copyrighted material without the copyright holder's permission. This applies regardless of whether the material was originally published on a blog, a mailing list, or any other less-formal medium, just as much as if it were commercially published.

Even taking content from an open content site can be a copyright violation, if done in a way which does not respect the license. For instance, the GNU Free Documentation License (GFDL) and Creative Commons ShareAlike (CC-BY-SA) licenses require that a republisher inform readers of the license conditions, and give credit to the original author. Most scraper sites which copy GFDL- or CC-BY-SA-licensed content do not do this, and therefore are infringing copyright law.

[edit] Techniques

Many scrapers will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the SERPs (Search Engine Results Pages). RSS feeds are vulnerable to scrapers.

Some scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on an advertisement because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Ad networks such as Google AdSense claims to be constantly working to remove these sites from their programs although there is an active polemic about this, since these networks benefit directly from the clicks generated at these kind of sites. From the advertisers point of view, the networks don't seem to be making enough effort to stop this problem.

Scrapers tend to be associated in the mind with link farms and are sometimes perceived as the same thing.

In other languages