Web scraping

From Wikipedia, the free encyclopedia

Web scraping is defined as a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use snippets of text to plagiarised content pages solely for the purpose of earning revenue through advertising. The typical scraper website is monetized using Google AdSense hence the term Made For AdSense or MFA website.

This is also used increasingly by many retail shopping sites to gather more information about the target product around the Internet[citation needed].

Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML/JavaScript-based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.

Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software called a bot or a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions.

Rightly or wrongly, web harvesters are typically demonised as existing for malicious purposes, while "webbots" are typecast as having benevolent purposes. In Australia, The 2003 Spam Act outlaws some forms of web harvesting.Spam Act (Overview for Busineses) - page 3 and Spam Act (Guide for Busineses) - page 10

Contents

[edit] Legal issues

Scraping is against the Terms of Use of many commercial websites, and can lead to legal liability for those involved in authoring, distributing, and even using software which does so. The Digital Millennium Copyright Act in the USA and European Union Copyright Directive specifically address "Circumvention of Copyright Protection Schemes", which impacts any scraping of copyrighted material -- whether for commercial gain or not -- particularly when that material is then redistributed.

Commercial sites often aggressively protect their intellectual property, and many have little tolerance for scraping. As legal force tends to be exerted out of the public eye, and also outside of any official lawsuit, it is not always apparent how vigorously commercial websites will act to protect their intellectual property. Those considering screen scraping a commercial site should study its Terms of Use, and also consider the consequences should the site become aware that the scraping is occurring.

"Scrapers," also known as "Robots," have suffered some recent defeats in U.S. courts. Specifically, users of scrapers have been found liable for committing the tort of trespass to chattels. Under these circumstances, the computer system itself and its capacity are considered personal property upon which the user of scrapers is trespassing. To succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system; and that defendant's unauthorized use caused damage to the plaintiff. Although a single user's scraper may not cause "damage" to the plaintiff's computer system, the courts have held single users liable for this cause of action under the reasoning that the aggregate use of many users' scrapers would quite likely significantly impair the computer system. This is so because scrapers might perform thousands of web server accesses per minute, thus consuming large portions of a computer system's capacity.

[edit] Technical measures to stop bots

A web master can use various measures to stop or slow a bot. Some techniques include:

  • Blocking an IP address. This will also block all browsing from that address.
  • If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
  • Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
  • Bots can be blocked by excess traffic monitoring.
  • Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
  • Sometimes bots can be blocked with carefully crafted Javascript.
  • Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.


[edit] Notes and references

[edit] See also