Nutch

From Wikipedia, the free encyclopedia

Apache Nutch

Screenshot Nutch Web Interface Search
Developer(s)	Apache Software Foundation
Stable release	1.5.1 and 2.1 / October 5, 2012 (2012-10-05)
Development status	Active
Written in	Java
Operating system	Cross-platform
Type	Search Engine
License	Apache License 2.0
Website	nutch.apache.org

Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.^[1]

Advantages

Advantages of Nutch over a simple fetcher include^[2]

highly scalable and relatively feature rich crawler
features like politeness which obeys robots.txt rules
robust and scalable - Nutch can run on a cluster of up to 100 machines
quality - crawling can be biased to fetch "important" pages first

Scalability

IBM Research studied the performance^[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.^[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.^[5]

Related projects

Hadoop - Java framework that supports distributed applications running on large clusters

Search engines built with Nutch

Creative Commons Search - launched 2004, Nutch implementation replaced 2006^[6]^[7]^[8]
DiscoverEd - Open educational resources search prototype developed by Creative Commons
Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
mozDex (inactive)
Wikia Search - launched 2008, closed down 2009^[9]^[10]

References

↑ Nutch News
↑ Using Nutch with Solr
↑ Scalability of the Nutch search engine
↑ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
↑ The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
↑ "Our Updated Search". Creative Commons. 2004-09-03.
↑ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22.
↑ "New CC search UI". Creative Commons. 2006-08-02.
↑ Where can I get the source code for Wikia Search?
↑ Update on Wikia – doing more of what’s working

Bibliography

Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6.

External links

Official website
Official wiki
Building Nutch: Open Source Search（2004）- ACM Queue vol. 2, no. 2
An article about Nutch（2003）- Search Engine Watch
Another article about Nutch（2003）- Tech News World

v t e Apache Software Foundation

Top level projects	Abdera Accumulo ActiveMQ Ant Aries Apache HTTP Server APR Avro Axis Bloodhound Buildr Camel Cassandra Cayenne Chemistry Click CloudStack Cocoon Continuum Cordova CouchDB cTAKES CXF Derby Directory Empire-db Felix Flex Forrest Geronimo Gora Gump Hadoop Hama Hive HBase Isis Jackrabbit James JMeter Lenya Mahout Marmotta Maven MINA mod_perl MyFaces ODE OFBiz OpenEJB OpenJPA OpenOffice POI Pivot Qpid River Roller ServiceMix Shindig Shiro Sling Stanbol SpamAssassin stdcxx Struts Subversion Tapestry Tcl Thrift Tiles Tomcat Trafficserver Turbine Tuscany UIMA Velocity Wave Wicket Wink Xalan Xerces XMLBeans

Commons Projects	BCEL BSF Daemon Jelly

Lucene Projects	Lucene Java Lucene.Net Nutch Solr

Hadoop Projects	HDFS ZooKeeper

Other projects	Chainsaw Batik FOP Log4j XAP Log4Net Ivy

Incubator Projects	XAP Storm Kafka

Apache Attic	AxKit Beehive Bluesky Cactus Jakarta Excalibur Harmony HiveMind Slide Shale iBATIS

Licenses standards	Apache License

Category Commons

v t e Web crawlers

Internet bots designed for Web crawling and Web indexing

Active	80legs bingbot Fetcher Googlebot Heritrix HTTrack PHP-Crawler PowerMapper Wget

Discontinued	FAST Crawler msnbot RBSE TkWWW robot Twiceler Yahoo! Slurp

Types	Distributed web crawler Focused crawler ICDL crawler

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.