Document classification
From Wikipedia, the free encyclopedia
Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information.
Contents |
[edit] Techniques
Document classification techniques include:
- naive Bayes classifier
- tf-idf
- latent semantic indexing
- support vector machines
- artificial neural network
- kNN
- Concept Mining
and approaches based on natural language processing.
[edit] Applications
A recent notable use of document classification techniques has been spam filtering which tries to discern E-mail spam messages from legitimate emails.
[edit] See also
- classification
- supervised learning, unsupervised learning
- document retrieval
- information retrieval
- machine learning
- text mining, web mining
- Concept Mining
[edit] External links
Publications:
- Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002 [1]
- Introduction to document classification
- Bibliography on Automated Text Categorization
Resources:
- Data Mining Tutorials, Resources Eruditionhome
Data sets:
Software:
- LingPipe - Java natural language processing software including a rich classification runtime and evaluation framework with classifiers based on character- and token- language models (including Naive Bayes).
- TIS eFLOW platform - a modular solution that offers advanced data capture and document classification capabilities.
- YALE (Yet Another Learning Environment) - freely available integrated open-source software environment for knowledge discovery, data mining, machine learning, visualization (e.g. of text clusterings), etc. featuring a plugin WordVectorTool for text mining tasks like text classification, text clustering, document feature set construction and transformation, etc.
- Bow - freely available open-source toolkit for statistical language modeling, text retrieval, classification, and clustering.
- XmlMiner Data and text mining toolkit targeted at XML data.