Text corpus

From Wikipedia, the free encyclopedia

In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on specific universe.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation.

An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Corpora are the main knowledge base in corpus linguistics. The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes.

Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship.

[edit] Some notable text corpora

English language:

American National Corpus
Bank of English
British National Corpus
Brown Corpus
Helsinki Corpus
Longman-Lancaster Corpus
North American News Text corpus
Oxford English Corpus
Scottish Corpus of Texts & Speech

Historical languages:

Electronic Text Corpus of Sumerian Literature
Neo-Assyrian Text Corpus Project

Other languages:

Leipzig Corpus of 15 languages with collocation statistics
Red iberoamericana de terminología
Red panlatina de terminología
Croatian National Corpus
Czech National Corpus
Slovak National Corpus
Hungarian National Corpus
The IPI PAN Corpus of Polish
Corpus of Slovenian Language
Bank of Swedish
Spoken Dutch Corpus
Balanced Corpus of Modern Chinese
Persian Today Corpus
METU Turkish Corpus
Hellenic National Corpus
Portuguese Corpora by Linguateca

Bilingual corpora:

Evrokorpus English-Slovene parallel corpus

[edit] See also

concordance
corpus linguistics
Linguistic Data Consortium
parallel text alignment
Search engines: they access the "web corpus".
translation memory
treebank
natural language processing

[edit] External links

ACL SIGLEX Resource Links: Text Corpora
Scottish Corpus of Texts & Speech: Multimedia corpus of Scots and Scottish English
WebCorp: The Web as a corpus
The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses
Developing Linguistic Corpora: a Guide to Good Practice
TechTC - Technion Repository of Text Categorization Datasets
http://www-tsujii.is.s.u-tokyo.ac.jp/~genia/topics/Corpus/ GENIA corpus for molecular biology
Biomedical corpora site
DOC Cop Submit your corpus for plagiarism detection processing

Retrieved from "http://en.wikipedia.org../../../t/e/x/Text_corpus.html"

Categories: Discourse analysis | Corpus linguistics | Computational linguistics | Data mining

Views

Search

In other languages

MediaWiki

Wikimedia Foundation

This page was last modified 02:51, 6 December 2006 by Anonymous user(s) of Wikipedia. Based on work by Wikipedia user(s) Bota47, Stephen Hodge, Pax:Vobiscum, Khalid hassani, JAnDbot, Bluebot, Kevin.cohen, Paulusmaria, TXiKi, Dissident, YurikBot, EnisSoz, Tobias Bergemann, Sebesta, Jonsafari, Hamidhassani1, Kku, Gabr, KnightRider, Dav 59, Ycl6, Dvdsn, Mmcannis, Ramzes, Markus Kuhn, Kwamikagami, PeepP, The Anome, Nohat, Emvee, Ish ishwar, SimonP, CanisRufus, Jumbuck, Burschik, Michael Hardy, Carbuncle, Allolex, MichaelTinkler, Gianfranco and Boleslav Bobcik.
All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc.
About Wikipedia
Disclaimers