Carrot2

Carrot²

Web search results clustered using Carrot²'s Lingo algorithm.
Developer(s) Carrot Search
Stable release 3.11.0 / October 19, 2015 (2015-10-19)
Development status Active
Written in Java
Operating system Cross-platform
Type Text mining and cluster analysis
License BSD license
Website carrot2.org

Carrot²[1] is an open source search results clustering engine.[2] It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for fetching search results from various sources. Carrot² is written in Java and distributed under the BSD license.

History

The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish.[3] In 2003, a number of other search results clustering algorithms were added, including Lingo,[4] a novel text clustering algorithm designed specifically for clustering of search results. While the source code of Carrot² was available since 2002, it was only in 2006 when version 1.0 was officially released. In the same year, version 2.0 was released with improved user interface and extended tool set. In 2009, version 3.0 brought significant improvements in clustering quality, simplified API and new GUI application for tuning clustering based on the Eclipse Rich Client Platform.

Carrot² releases
Release Release Date Major changes and new features[5]
3.11.0 October 2015 Upgrade of Apache Lucene, bug fixes and a rollup of changes from 3.10.x minors.
3.10.4 October 2015 Upgrade of Morfologik library.
3.10.3 August 2015 Repackaged Google Guava to avoid conflicts in Solr.
3.10.2 July 2015 Minor fixes to the Workbench (Arabic cluster display).
3.10.1 May 2015 Aduna visualization dropped from MacOS distribution. Minor fixes to the Workbench.
3.10.0 May 2015 Visualization updates. Bug fixes. Library dependency updates.
3.9.4 November 2014 FoamTree update. New attributes for multilingual clustering. Visualization fixes.
3.9.3 July 2014 FoamTree update. Infrastructure fixes and tweaks (jflex, sonatype repository URLs).
3.9.2 April 2014 Bug fix to FoamTree HTML5.
3.9.1 April 2014 Bug fixes, upgrades of HTML5 visualizations.
3.9.0 February 2014 HTML5 visualizations replacing flash, library dependencies update, bugfixes.
3.8.1 October 2013 Bug fixes, minor tweaks to functionality.
3.8.0 July 2013 Bug fixes, library dependency updates.
3.7.1 May 2013 Minor bug fixes (3.7.0 maintenance release).
3.7.0 April 2013 Infrastructure changes to the core (string IDs), better Solr integration XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.3 April 2013 Minor bug fixes and improvements: customization of Solr adapter XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.2 November 2012 Minor bug fixes and improvements.
3.6.1 August 2012 Minor bug fixes.
3.6.0 June 2012 Infrastructural changes, refactorings and bug fixes.
3.5.3 December 2011 Infrastructure updates resulting from migration to GitHub. Workbench update to SWT 3.7.1.
3.5.2 September 2011 Ajax support in Document Clustering Server, Bing document source improved, Workbench improvements, bug fixes.
3.5.1 June 2011 Bug fixes, visualization integration improvements, support for Yahoo BOSS API removed.
3.5.0 May 2011 FoamTree visualization, bisecting k-means clustering, resource management improvements
3.4.3 March 2011 Distribution to Maven central repository
3.4.2 October 2010 Bug fixes
3.4.1 September 2010 Solr 1.4.x compatibility package, bug fixes
3.4.0 August 2010 .NET API for calling Carrot² clustering
3.3.0 April 2010 Significant scalability improvements in the STC clustering algorithm
3.2.0 March 2010 Experimental support for clustering Arabic and Korean content, command line application for clustering in batch mode, LGPL-licensed dependencies removed
3.1.0 September 2009 Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.1.0 September 2009 Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.0.1 March 2009 Document Clustering Workbench available for Mac OS X
3.0.0 January 2009 Document Clustering Workbench added for easy experimenting with Carrot² clustering, radically simplified Java API, search results clustering web application re-implemented, user manual[6] available
2.1.0 August 2007 Document Clustering Server added for exposing clustering as a REST service
2.0.0 September 2006 New user interface of the search results clustering web application
1.0.0 January 2006 First official release, binaries available on SourceForge
0.0.0 since 2002 Incubation releases, source code available on SourceForge

Architecture and components

The architecture of Carrot² is based on processing components arranged into pipelines. Two major groups or processing components in Carrot² are: document sources and clustering algorithms.

Document sources

Document sources provide data for further processing. Typically, they would e.g. fetch search results from an external search engine, Lucene / Solr index or load text files from a local disk.

Currently, Carrot² has built-in support for the following document sources:

Other document sources can be integrated based on the code examples provided with Carrot² distribution.

Clustering algorithms

Carrot² offers two specialized document clustering algorithms[7] that place emphasis on the quality of cluster labels:

Other algorithms can be easily added to Carrot².

APIs

Carrot² clustering can be called through a number of APIs.

Java API

Because Carrot² is implemented in Java, it can be integrated with Java software through its native Java API.[9]

C# / .NET API

Carrot² provides a native C# API for calling clustering from C# / .NET software without installing a Java runtime. The Carrot² C# API requires .NET Framework version 3.5 or later.

Other platforms

Other platforms can call Carrot² clustering through the REST service exposed by the Document Clustering Server. Example integration code is provided for PHP5, C#, Ruby and cURL.

Tools

Carrot2 Document Clustering Workbench screen shot.
Carrot2 Document Clustering Workbench.

Carrot² offers a number of supporting tools that can be used to quickly set up clustering on custom data, further tuning of clustering results and exposing Carrot² clustering as a remote service:

Spin-offs

Carrot Search

Carrot Search,[10] a commercial spin-off of the Carrot² project, works on further development of Carrot², offers a real-time text clustering algorithm[11] compliant with the Carrot² framework as well as text mining consulting services based on open source and proprietary software.

Carrot Search Labs

Carrot² gave rise to a number of independent open source projects released under the umbrella of Carrot Search Labs.[12] Currently, the following projects are available:

See also

References

  1. Carrot² project website
  2. Carrot² search results clustering demo
  3. Dawid Weiss: A Clustering Interface for Web Search Results in Polish and English. MSc thesis. Poznan University of Technology, Poznań, Poland, 2001 download PDF
  4. 1 2 Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol. 20), 2005, pp. 4854.
  5. Carrot² release notes
  6. Carrot² user and developer Manual
  7. Carrot² clustering algorithms
  8. Oren Zamir, Oren Etzioni: Web Document Clustering: A Feasibility Demonstration, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (1998), pp. 4654
  9. Carrot² Java API JavaDoc
  10. Carrot Search
  11. Lingo3G document clustering algorithm
  12. Carrot Search Labs website
This article is issued from Wikipedia - version of the Monday, October 19, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.