Talk:Web scraping

From Wikipedia, the free encyclopedia

Contents

[edit] monetized

I can only guess at what this word means. It's not in the wictionary. Nick Levine 11:28, 3 March 2006 (UTC)

Look it up on dictionary.com. There's an entry. Phoenixrod 17:38, 3 April 2006 (UTC)

[edit] web scraping only for generation of new web pages?

Is it not web scraping if the data extracted from the web page does not end up in another web page? What if it's just stored in a file or database? —Fleminra 00:07, 19 April 2006 (UTC)

Yes, that definition strikes me as strange. It's certainly not how I use the phrase. I would have thought the definition was something like "Using custom-built software to access a site via HTTP, and parse the retrieved data in order to extract embedded information – not for rendering it like it like a normal user agent." Part of the definition would be, I think, that the software knows something specific about the data, and doesn't work against any random site. But—I have no references, just a general feeling. JöG 21:57, 5 February 2007 (UTC)

I also note that the Screen scraping article doesn't restrict the term "web scraping" like this article does. JöG 22:02, 5 February 2007 (UTC)

[edit] merge to screen scraping

See Talk:Screen_scraping#Merge_web_scraping_into_screen_scraping for discussion

[edit] Self-reference

ILike2BeAnonymous seems to want to add the comment "Ironically, one of the most heavily 'scraped' sites is this one, Wikipedia.". I've reverted twice now; I won't revert again without something further happening; I hope ILike2BeAnonymous will come here for discussion. Per WP:SELF, "To ease reusability, never allow the text of an article to assume that the reader is viewing it at Wikipedia, and try to avoid even assuming that the reader is viewing the article at a website." So the statement being added clearly goes against that style guide. It's also not "ironic" that Wikipedia gets web scraped; it's by design. This might be considered original research -- I suspect this is ILike2BeAnonymous's personal opinion. I also don't see what it adds to the article from an encyclopediaic standpoint. In short, I don't think it belongs here. ILike2BeAnonymous: What's your rationale for adding it? Others: What do you think? --DragonHawk 22:42, 8 August 2006 (UTC)

Well, all I can say is that it is both ironic and by design, as you say (and, if you don't know it, a hot issue that's only going to get hotter as the practice increases, with moral implications that rebound on Wikipedia). So far as the "reusability" issue goes (basically a technical one), that could easily be remedied by making the statement refer explicitly to Wikipedia. +ILike2BeAnonymous 23:09, 8 August 2006 (UTC)
Wikipedia has always been free content. The intent, right from the inception, was that the content would be available to and contributed by the community at large. See Wikipedia#History and History of Wikipedia. The GFDL explictly permits commercial redistribution. So anyone who "web scrapes" Wikipedia is doing exactly what the GFDL exists to enable: Reusing free content. Given that irony is defined as "[meaning] the opposite of" or "being both coincidental and contradictory", I don't see how one can call that ironic. Can you cite a source on the irony claim? I do agree that suitable rewording may fix the self-reference issue, but that still leaves the question of "Why should this be in the article at all?". Regarding your statement that this is "a hot issue"; are you referring to something about Wikipedia in particular, or the practice of web scraping of sites which are not free content? --DragonHawk 00:10, 9 August 2006 (UTC)

[edit] Legal Issues Expanded

The Legal Issues section of the entry seems to implicitly assume that screen scraping is generally harmless, and done primarily for personal consumption.

Perhaps this entry should point out that screen scraping is against the Terms of Use of many -- perhaps most -- commercial websites, which leads to legal liability for the scraper. Indeed, the Digital Millennium Copyright Act in the USA and European Union Copyright Directive specifically address "Circumvention of Copyright Protection Schemes", which would impact anyone scraping commercial sites -- whether for commercial gain or not -- especially when the scraped data is then redistributed.

Commercial sites will aggressively protect their intellectual property, and often have little tolerance for screen scraping, especially where it impacts their commerce. As most legal force is exerted out of the public eye (and also outside of any official lawsuit) it may not be readily apparent just how vigorously commercial websites can act to protect their IP. Those considering screen scraping a commercial site should study its Terms of Use, and also consider the consequences should the site become aware that the scraping is occurring.

It may make sense to update the Legal Issues with some of these sentiments. Thoughts?

Dracogen 16:55, 21 March 2007 (UTC)

No objections in over a week, so I proceeded in adding two paragraphs along these lines to the Legal Issues section.

Dracogen 16:12, 30 March 2007 (UTC)