Talk:Screen scraping

From Wikipedia, the free encyclopedia

Contents

[edit] Implementations

Pappa, I noticed that you wrote about some Perl modules. Are these modules for screen scraping. If so we could include the better ones in the article. JesseHogan 01:44, 9 Dec 2004 (UTC)

[edit] Web scraping

I dispute the notion that screen scraping is relegated to just reading HTML. I've done work on the BlackBerry (J2ME) that required screenscraping solutions... DoomBringer 02:21, 28 May 2005 (UTC)

Doesn't the article's first paragraph make that clear. JesseHogan 07:04, 30 May 2005 (UTC)
It just seemed to me that the article (wrongly) gives the impression that screen scraping is relegated to just HTML parsing. DoomBringer 01:15, 1 Jun 2005 (UTC)
Thats just because the editors of most of the article were probobly most familiar with the HTML aspect. And it is an import aspect of modern screen scrapping. The article isn't wrong, it just needs more information concerning non-html screen scapping techniques. If you can write about these then please do. JesseHogan 19:53, 1 Jun 2005 (UTC)
I just did some major editing, intended to expand on the earlier history of screen scraping with terminals (WRT DoomBringer's comments), and to expand on what screen scraping means in general (WRT Nick Douglas's comment). Web scraping examples are readily available in the links in the reference section. Examples of "classic" screen scraping are harder; they would have to be historical anecdotes, I think. Comments here, improvements on the article, are, as always, welcome. --DragonHawk 23:33, 20 November 2005 (UTC)

[edit] Examples

As a layman, I'm still confused. Are there examples that could be linked? -- Nick Douglas 05:24, 18 September 2005 (UTC)

I just did some major editing, intended to expand on the earlier history of screen scraping with terminals (WRT DoomBringer's comments), and to expand on what screen scraping means in general (WRT Nick Douglas's comment). Web scraping examples are readily available in the links in the reference section. Examples of "classic" screen scraping are harder; they would have to be historical anecdotes, I think. Comments here, improvements on the article, are, as always, welcome. --DragonHawk 23:33, 20 November 2005 (UTC)
Links to external examples mostly removed (see #External links). In order to make up for the removal, I've written some generic example descriptions. I'm hope said examples help clarify what scraping is about, without turning this page into a link farm. Comments, clarifications, etc., are welcome. --DragonHawk 13:42, 6 January 2006 (UTC)

I too see Screen Scraping and Web Scraping as separate topics. I've just completed two applications that use Screen Scraping techniques to interface with a legacy system that was not web-based. —The preceding unsigned comment was added by 64.90.21.3 (talk • contribs) 15:55, 28 December 2006.

[edit] External links

The article was starting to collect link spam. There were several links to implementations which didn't really add information about scraping (there were just another implementation), or were outright commercial products. In order to avoid POV problems regarding external links, I have removed all external links to implementations which do not also include substantial information on how scraping works in general and how the implementation works in particular. I have also placed HTML comments in the article about this. Others can, of course, add what they want, but I've requested that people here explain their reasoning on why a link should be included. --DragonHawk 13:42, 6 January 2006 (UTC)

Thanks for your contributions Dragon. I was wonder though, is it necessarily bad to have external links to commercial sites. In some circumstances I could see how a user could benefit from a list of commercial (or free) products related to this article (or any other). I don't think these links are to much of a nuisance for non-interested people. JesseHogan 23:16, 6 January 2006 (UTC)
It'd be nice to have a way to find screen scrapers for those who want them; perhaps link to a page which lists screen scraping products (preferably one which lists whether the product is free)? --AySz88^-^ 02:28, 7 January 2006 (UTC)
While I agree that a simple list of web scrapers might well be useful, it is explictly outside the mission of Wikipedia. Wikipedia is not a web directory. See also External link guide. If someone needs a web scraper, or wants one, Google, Yahoo, and other searchers and directories already exist and can do a better job then Wikipedia. Now, if a website has lots of content explaining the "how and why" behind an implementation, that's useful content *about* scraping. But if it's just a link to Yet Another Web Scraper, what does that add to the article? --DragonHawk 00:51, 24 January 2006 (UTC)

There was an external link that had been added to a purported example with code. But the example showed no code, just a harvested page. If you want to re-instate that link, make sure it shows the scraping code, as the link suggested. Also, log in to show your name and provide a way for this feedback to be given. peterl 11:03, 27 February 2007 (UTC)

[edit] Box-A-Web

On 20:43, 21 January 2006, an anonymous user added a link to Box-A-Web. No edit summary was given, but the contributor included the link description "Not an article on how to do it in Ruby, but rather a technology demonstrator for drag and drop web scraping using Ruby on Rails Framework".

Investigation: I visited the website in question to check it out. Adverts down the left side. Account required to use. Free registration (no fees). Guest accounts published. Tutorial explains how to use it, but little about how it works -- does make an analogy of XML and RSS to HTML and this tool. Text on tutorial page "as the service is free (currently !)" implies it may or will become commercial in the future.

Conclusion: Reverted. Contains no information on web scraping. Adds nothing to the substance of the article. Anonymous contribution makes discussion with contributor impossible.

--DragonHawk 01:00, 24 January 2006 (UTC)

Hi. Original contributor here. I am also the author of the site. Acknowledged, there is not a lot of information about how it works, but my original point about including the link to the page is to show that it is possible to do web scraping visually using web technology, compared to other methods, which require either a rich client or a command line script with lots of configuration parameters. Not sure if there is an easier way to demonstrate the point. There is no plan to ever charge for the service, hence the ads on the left hand side. Drop me a mail using the webmaster address and I will reply on a private channel if you need any additional information. —The preceding unsigned comment was added by 205.228.74.11 (talkcontribs).
It's good to know that you were working in good faith. However, please understand that Wikipedia is not a web directory, and that adding links to one's own site is strictly against the external link guide. This is important, because popular topics like screen scraping will otherwise eventually consist mainly of huge lists of links. Feel free to add your site to open directories such as DMoz, where it is perfectly appropriate. Thanks! --DragonHawk 02:22, 20 June 2006 (UTC)

[edit] Merge web scraping into screen scraping

I just discovered that Web scraping has its own article, separate from Screen scraping. I propse merging the content from the Web scraping article into the Web scraping section of the Screen scraping article. The term "web scraping" is derrived from "screen scraping", and the two are closely related in operation, so it makes sense, to me. --DragonHawk 13:57, 27 June 2006 (UTC)


[edit] Web Scraping and Screen Scraping are not the same thing

Please see my clarification in the Web scraping content. —The preceding unsigned comment was added by Stefanandr (talk • contribs) 16:56, 30 June 2006.

Bunyip responds... This subject area is actually bigger than "Ben Hur". In a nutshell...

"Screen Scraping" is a form of "Harvesting" but which is not defined in Wikipedia in the computer sense.

We need to start with a description of "Harvesting" and/or "Web Harvesting":

"Web Harvesting" is any software technique in which a software "robot" ("webbot", "crawler" (etc)) "trawls" (ie recursively downloads a page and all the page links in it to a nominated depth) any number of possibly targetted web sites for a variety of reasons, whether legitimate or not. "Web Harvesting" can be done to index web pages for search engines, to hunt for email addresses, phone/account numbers or passwords, to collect metadata, or to perform a http based archive (Eg: http://www.archive.org).

We can then describe Screen Scraping somewhat thusly:

When a human downloads a web page, it is called "browsing". When a computer program records an electronic copy of the textual data on a computer screen, it is called "screen scraping". A "screen scrape" is an electronic copy of the text that a human would have seen on the screen at the time, usually retaining top-bottom, left-right sequence, but it is not an image of the screen. "screen scraping" includes only expressly textual information, and exludes text appearing in image data. The computer program that performs the "screen scrape" is called a "robot". "Screen Scraping" can be used on web sites to collect the html text of the web page. "Screen Scraping" is still very common in high security mainframe-internet interfaces as a robust and inpenetrable (albeit crude) way of sending data from a secure server directly to public and insecure clients. Because the data from the server is static and mostly one way this prevents opportunities for injected code, buffer overflow conditions, or hacking attempts from rogue clients. "Screen Scraping" typically occurs multiple times on the same communication interface.

We can now describe the association between the two as follows:

"Web Scraping" differs from "Screen Scraping" in that the former occurs only once per web page over many different web pages. Recursively "web scraping" by following links to other pages over many web sites is "web harvesting". "web harvesting" is necessarily performed by "robots", often called "webbots", "crawlers", "harvesters" or "spiders" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Rightly or wrongly "web harvesters" are typically demonised as being for malicious purposes, while "webbots" are typecast as having benevolent purposes. In Australia, The Spam Act 2003 outlaws some forms of "web harvesting".

--15:59, 5 September 2006 (UTC)Abunyip

[edit] Blocking information wrong

The page currently says: "These include the blocking of individual and ranges of IP addresses, which stops the majority of "cookie cutter" screen scraping applications." Added section to web scraping on stopping bots. peterl 04:41, 12 February 2007 (UTC)

[edit] Legal Issues

This page should point out that screen scraping is against the Terms of Use of many -- perhaps most -- commercial websites, which leads to legal liability for the scraper. Indeed, the Digital Millennium Copyright Act in the USA and European Union Copyright Directive specifically address "Circumvention of Copyright Protection Schemes", which would impact anyone scraping commercial sites -- whether for commercial gain or not -- especially when the scraped data is then redistributed.

Commercial sites will aggressively protect their intellectual property, and often have little tolerance for screen scraping, especially where it impacts their commerce. As most legal force is exerted out of the public eye (and also outside of any official lawsuit) it may not be readily apparent just how vigorously commercial websites can act to protect their IP. Those considering screen scraping a commercial site should study its Terms of Use, and also consider the consequences should the site become aware that the scraping is occurring.

I propose language similar to the above, adapted for entry use. Comments? Dracogen 16:55, 21 March 2007 (UTC)

I'm too busy right now to comment more, but check out web scraping if you haven't already. —DragonHawk (talk) 17:03, 21 March 2007 (UTC)