Nutch

Apache Nutch

Screenshot Nutch Web Interface Search
Developer(s)	Apache Software Foundation
Stable release	1.9 and 2.2.1 / August 16, 2014 (2014-08-16)
Development status	Active
Written in	Java
Operating system	Cross-platform
Type	Search Engine
License	Apache License 2.0
Website	nutch.apache.org

Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

Features

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.^[1]

Releases

Apache Nutch Versions
Version	Release Date	Description
1.1 Release	2010-06-06	This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
V1.2 Released	2010-10-24	This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.3 Release	2011-06-07	This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2MB!).
1.4 Release	2011-11-26	This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tigther Tika integration, and support for HTTP auth in Solr indexing.
1.5 Release	2012-06-07	This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filering and parsing to name a few.
v2.0 Release	2012-07-07	This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores.
v1.5.1 Release	2012-07-10	This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
v2.1 Release	2012-10-05	This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
v1.6 Release	2012-12-06	This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URL's and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
v2.2 Release	2013-06-08	This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
v1.7 Release	2013-06-24	This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
v2.2.1 Release	2013-07-02	This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
V1.8 Release	2014-03-17	Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.

Advantages

Advantages of Nutch over a simple fetcher include^[2]

highly scalable and relatively feature rich crawler
features like politeness which obeys robots.txt rules
robust and scalable - Nutch can run on a cluster of up to 100 machines
quality - crawling can be biased to fetch "important" pages first

Scalability

IBM Research studied the performance^[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.^[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.^[5]

Related projects

Hadoop - Java framework that supports distributed applications running on large clusters

Search engines built with Nutch

Creative Commons Search - launched 2004, Nutch implementation replaced 2006^[6]^[7]^[8]
DiscoverEd - Open educational resources search prototype developed by Creative Commons
Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
mozDex (inactive)
Wikia Search - launched 2008, closed down 2009^[9]^[10]

References

↑ Nutch News
↑ Using Nutch with Solr
↑ Scalability of the Nutch search engine
↑ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
↑ The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
↑ "Our Updated Search". Creative Commons. 2004-09-03.
↑ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22.
↑ "New CC search UI". Creative Commons. 2006-08-02.
↑ Where can I get the source code for Wikia Search?
↑ Update on Wikia – doing more of what’s working

Bibliography

Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6.

External links

Official website
Official wiki
Building Nutch: Open Source Search（2004）- ACM Queue vol. 2, no. 2
An article about Nutch（2003）- Search Engine Watch
Another article about Nutch（2003）- Tech News World

Apache Software Foundation

Top level projects	Abdera Accumulo ActiveMQ Ant Aries Apache HTTP Server APR Avro Axis Axis2 Bloodhound Buildr Camel Cassandra Cayenne Chemistry Click CloudStack Cocoon Continuum Cordova CouchDB cTAKES CXF Deltacloud Derby Directory Empire-db Felix Flex Forrest Geronimo Gora Gump Hadoop Hama HBase Hive Isis Jackrabbit James JMeter Kafka Lenya Mahout Marmotta Maven MINA mod_perl MyFaces ODE OFBiz OpenEJB OpenJPA OpenNLP OpenOffice PDFBox Phoenix POI Pivot Qpid River Roller Samza ServiceMix Shindig Shiro Sling Spark Stanbol Storm SpamAssassin Struts Subversion Sqoop Tapestry Tcl Thrift Tiles Tomcat Trafficserver Turbine Tuscany UIMA Velocity Wave Wicket Wink Xalan Xerces XMLBeans

Commons Projects	BCEL BSF Daemon Jelly

Lucene Projects	Lucene Java Lucene.Net Nutch Solr

Hadoop Projects	HDFS HBase Hive Pig Spark ZooKeeper

Other projects	Chainsaw Batik FOP Log4j XAP Log4Net Ivy

Incubator Projects	XAP

Apache Attic	AxKit Beehive Bluesky Cactus Jakarta Excalibur Harmony HiveMind Slide Shale stdcxx iBATIS

Licenses standards	Apache License

Category Commons

Web crawlers

Internet bots designed for Web crawling and Web indexing

Active	80legs bingbot Fetcher Googlebot Heritrix HTTrack PHP-Crawler PowerMapper Wget

Discontinued	FAST Crawler msnbot RBSE TkWWW robot Twiceler Yahoo! Slurp

Types	Distributed web crawler Focused crawler ICDL crawler