Web scraping

From Wikipedia, the free encyclopedia

Web scraping is defined as a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use snippets of text to plagiarised content pages solely for the purpose of earning revenue through advertising. The typical scraper website is monetized using Google AdSense hence the term Made For AdSense or MFA website.

This is also used increasingly by many retail shopping sites to gather more information about the target product around the Internet[citation needed].

Web scraping is different from screen scraping[citation needed] in the sense that a website is really not a screen, but a live HTML/JavaScript-based application, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.

Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software robot or bot, often called a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions.

Rightly or wrongly, web harvesters are typically demonised as existing for malicious purposes, while "webbots" are typecast as having benevolent purposes. In Australia, The 2003 Spam Act outlaws some forms of web harvesting.[citation needed]

[edit] See also


"Scrapers," also known as "Robots," have suffered some recent defeats in U.S. courts. Specifically, users of scrapers have been found liable for committing the tort of trespass to chattels. Under these circumstances, the computer system itself and its capacity are considered personal property upon which the user of scrapers is trespassing. To succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system; and that defendant's unauthorized use caused damage to the plaintiff. Although a single user's scraper may not cause "damage" to the plaintiff's computer system, the courts have held single users liable for this cause of action under the reasoning that the aggregate use of many users' scrapers would quite likely significantly impair the computer system. This is so because scrapers perform thousands of instructions per minute, thus consuming large portions of a computer system's capacity.