Talk:Web crawler

From Wikipedia, the free encyclopedia

Contents

[edit] More details needed for personal project

Well, this a basics of the web crawler. I have to design a web crawler that will work in client/server architect. I have to make it using the Java. Actually I am confused about the how will I implement the client/server architect. What I have in my mind is that I will create a light weight component using swing for client interaction and an EJB that will get the instructions from the client to start crawling. Now the server will have another GUI that will monitor the web crawler and administrate it.

Do anyone have a simple or another way for doing this.

It is actually not that difficult to build a web crawler, Off the shelf components are available in languages such as Java, python and perl. If you need to build one in python (I am talking about a simple crawler) you can use the library urllib, and in perl LWP. For more information search for these terms on the web. If you want to look at libcurl or curl which provides a very good starting point for c/c++ based crawlers. A lot of academic websites also provide crawlers, but make sure you obtain the documentation for these too.--IMpbt 20:25, 20 May 2005 (UTC)
For downloading a single Web site or a small bunch of Web sites, you can use almost any tool, for instance, just "WGET"; however, for downloading millions of pages from thousands of Web sites the problem gets much more complicated. Run some test crawls with 2-3 candidates and see which crawler suits better for your needs. --ChaTo 08:50, 18 July 2005 (UTC)
The biggest problem I hit writing one of these was keeping track of URL's. I'd keep a list of URL's to be processed, and a list of URL's that had been processed. Thus, before I added one to the list of url's to be processed, I'd check that it was in neither list. My spider is written in python, and on my first attempt, I simply used a list and listname.__contains__. This got slow. Eventually I wrote a binary search tree for this. This was very fast, but around several 100,000 URL's processed (and quite a few more to be processed), it went through all the RAM (the machine I dedicated can only hold 192mb). The solution I finally setteled upon was a hash table. It only goes through a few mb's of ram, yet can process several 10,000's of operations per second on my slow machine. I guess in summary, if you hit a roadblock with your URL list, hash tables work well. This has to be the biggest thing I've tackled involving my spider. Also, because most of your table is stored on disk, you can store as much additional info about each URL as you want without a big hit.
Use a database with a primary key for URLs; ram-based ojbects aren't really optimized for this in the way an RDBMS is, as you've learned.
As far as client/server, I just used xmlrpclib. Probably something similar in Java.

As far as I concerned, how does a Web crawler collect URL automatically as many as possible?

[edit] Transwikied Content

The following came from b:Web crawler. The sole contributor was 129.186.93.50


Web Crawler A program that downloads pages from the internet by following links

Examples: - google bot - yahoo ...

In general all the search engines have a web crawler that collects the pages from the web for them. This is done by starting with a page, then downloading the pages that it points to, then downloading the pages that these pages point to and so on and so forth. The names of the already downloaded pages are kept into a databese in order to avoid redownloading them.

The reach (pages from the web that are downloaded) of this whole technology is depndent upon the initial pages where the downloading starts. Basically the downloaded pages are all the reacheabel pages from those initial pages (unless addititonal constraints are specified). The current eight bilion Pages that Google crawls are estimated to be only 30% of the web for this reason

[edit] Article History

21:55, 18 May 2005 Popski (VfD)
20:38, 15 May 2005 129.186.93.50
20:29, 15 May 2005 129.186.93.50

[edit] Anti-merge

I dissaprove of merging this article, as not all web crawlers are search bots, for example maintenance bots and spam bots! The Neokid 09:55, 28 January 2006 (UTC)

I think crawlers include: search engine bots (for indexing), maintenance bots (for checking links, validating pages) and spam bots (for harvesting e-mail address) ChaTo 10:29, 28 January 2006 (UTC)

[edit] Merge with spidering

The new article on Spidering should definately be moved into this article. Fmccown 18:47, 8 May 2006 (UTC)

Absolutely Not! Spidering and Web Crawling are exactly opposite terms.

   Spidering = The network of web pages and their inter-connection to each other. 
   web-crawling = The art of finding specific information from that web or internet.

I guess this is most comprehensive that i can say! Any comment/suggestion is purely welcomed?

  Raza Kashif (l1f05mscs1025@ucp.edu.pk)

203.161.72.40 11:57, 21 May 2007 (UTC)== Verifiability ==

"Some spiders have been known to cause viruses." No citation, examples, or explanation for how this is possible. I'm removing this sentance, as I don't believe it is true. Requesting a document by URL can't give the server a virus! ( Of course, if somebody knows something I don't, please restore the sentance, and cite your sources! )

--Sorry, not source for this, but I have heard of several cases where the spider has literally flooded a server with requests, reulting in the server going down temporarily. It's not a virus in any way, but it is certainly possible to try and overwhelm a server with thousands of requests. A simple way to prevent this would be a cap on the number of times a server can receive requests per minute.

[edit] Question

How comes WebBase is considered as an open source crawler while his source is unknown !!

[edit] Vandalism

I've never come across a vandalized page before and was not quite sure what to do about it. I removed some of the material on the vandalized page, but did not revert the content. If someone with more experience could assist, I would be grateful. AarrowOM 16:16, 20 February 2007 (UTC)AarrowOM 11:15 EST, 20 February 2007

[edit] Seemingly contradictory section

I added {{confusing}} to the section Crawling policies because it essentially seems to say both that the nature of the Web makes crawling very easy, and that the nature of the Web makes crawling difficult. Can someone rewrite it in a way that clarifies things, or is the problem with how I am reading it? \sim Lenoxus " * " 13:06, 24 March 2007 (UTC)

I couldn't find where does it says that crawling is easy. Here is the (apparent) contradiction: building a simple crawler should be in theory very straightforward, because it's basically download, parse, download, parse, ... so, if you want to download a web site or a small set of, say, a few thousand pages, it's very easy to write a program to do so. But in practice it's very hard, because of the issues listed in the article. If you want you can add an explanation like this to the article, but ... where? - ChaTo 15:26, 12 April 2007 (UTC)
Oh, wow, I'd totally forgotten about this. Well, it seems perfectly good now. :) \sim Lenoxus " * " 15:01, 4 May 2008 (UTC)

[edit] Bad PDF links

There are a lot of bad links at the PDFs at the bottom of the page. I am not an experienced wiki editor, but they should have their hyperlinks removed or fixed or something. 70.142.217.250 13:33, 15 July 2007 (UTC)

[edit] "Web" as a proper noun

As a matter of style I believe that "Web" should be capitalized when used as a proper noun -- for example as in "World Wide Web" (meaning the singular largest connected graph of HTML documents avaiiable by HTTP), or "the Web" (short for the above) -- but not when used in a compound noun such as "web crawler", "web page", "web server", where it acts like an adjective meaning something more like "HTML/HTTP". -- 86.138.3.231 11:58, 13 October 2007 (UTC)

[edit] SEOENGBot

SEOENGBot, originally created for the purpose of providing a focused crawler for the SEOENG engine on a per Website basis (2004-2007), was later retrofitted as a general purpose, highly distributed crawler which is reponsible for crawling millions of webpages, while archiving both webpages and links. The archived data is used to inject into the SEOENG engine for its own commercial use. SEOENGBot, as well as SEOENG, remains a highly guarded system and its source and location are not currently published. Seoeng (talk) 04:45, 10 May 2008 (UTC)