Document clustering

From Wikipedia, the free encyclopedia

Document clustering (also reffered to as Text clustering) is closely related to concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. For example, a web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo.
Example:
FirstGov.gov, the official Web portal for the U.S. government uses document clustering to automatically organize its search results into categories. For example if a user submits “immigration” next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.


[edit] Further reading

Publications:

  • Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007 [1]

Sources: