Document classification

From Wikipedia, the free encyclopedia

Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Contents

1 Techniques
2 Applications
3 See also
4 Further reading

[edit] Techniques

Document classification techniques include:

and approaches based on natural language processing.

[edit] Applications

Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam messages from legitimate emails.

[edit] See also

[edit] Further reading

Publications:

Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002 [1]
Introduction to document classification
Bibliography on Automated Text Categorization
Bibliography on Query Classification

Data sets:

Categories: Information science | Natural language processing | Knowledge representation | Data mining | Machine learning

Views

Interaction

Search

Languages

Powered by MediaWiki

Wikimedia Foundation

This page was last modified 16:56, 11 June 2008 by Wikipedia user BOTarate. Based on work by Wikipedia user(s) TXiKi, Junling, Slambo, Barticus88, Ronz, Gabr, TXiKiBoT, Freedomlinux, HebrewHammerTime, Kku, Matematico, Beetstra, Jfroelich, Scientio, Ralf Klinkenberg, MIT Trekkie, Joerg Kurt Wegner, .:Ajvol:., YurikBot, Shreddy, Silvonen, Steinsky, The Anome and MarkSweep and Anonymous user(s) of Wikipedia.
All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.
About Wikipedia
Disclaimers