Document classification

From Wikipedia, the free encyclopedia

Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.

Contents

[edit] Techniques

Document classification techniques include:

and approaches based on natural language processing.

[edit] Applications

A recent notable use of document classification techniques has been spam filtering which tries to discern E-mail spam messages from legitimate emails.

[edit] See also

[edit] External links

Publications:

Resources:

Data sets:

Software:

  • LingPipe - Java natural language processing software including a rich classification runtime and evaluation framework with classifiers based on character- and token- language models (including Naive Bayes).
  • TIS eFLOW platform - a modular solution that offers advanced data capture and document classification capabilities.
  • YALE (Yet Another Learning Environment) - freely available integrated open-source software environment for knowledge discovery, data mining, machine learning, visualization (e.g. of text clusterings), etc. featuring a plugin WordVectorTool for text mining tasks like text classification, text clustering, document feature set construction and transformation, etc.
  • Bow - freely available open-source toolkit for statistical language modeling, text retrieval, classification, and clustering.
  • XmlMiner Data and text mining toolkit targeted at XML data.