Treebank

From Wikipedia, the free encyclopedia

A treebank is a text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for training or testing parsers.

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.

Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct.

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank) and those that annotate dependency structure (for example the Prague Dependency Treebank).

The syntactic structure in a treebank can be represented in many different ways, for example using simple labelled brackets in a text file, like this (following the Penn Treebank):

(S (NP (NNP John))
   (VP (VBZ loves)
       (NP (NNP Mary)))
   (. .))

or a treebank-specific XML scheme.

[edit] List of treebanks sorted by language

Arabic: Penn Arabic Treebank, Prague Arabic Dependency Treebank (PADT)
Basque: Eus3LB, see also Annotation guide for Eus3LB and the group's home page
Bulgarian: BulTreeBank (HPSG-based Syntactic Treebank)
Catalan: Cat3LB
Chinese: Penn Chinese Treebank, Sinica Treebank by CKIP, a tentative Chinese Dependency Treebank
Czech: Prague Dependency Treebank
Danish: Danish Dependency Treebank, Arboretum: A syntactic tree corpus of Danish
Dutch: CGN, Alpino
English:
- Penn;
- English Dependency Treebank?;
- BLLIP WSJ corpus;
- International Corpus of English (ICE);
- Lancaster Parsed Corpus;
- Susanne Corpus, Christine Corpus, Lucy Corpus;
- Verbmobil treebanks;
- LinGO Redwoods;
- Multi-Treebank;
- The PARC 700 Dependency Bank;
- CHILDES Brown Eve corpus with dependency annotation, see Sagae, K., MacWhinney, B., and Lavie, A. (2004) Adding syntactic annotations to transcripts of parent-child dialogs. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.
Estonian: Syntactically analyzed and disambiguated text corpus, see also Arborest
French: Paris 7, L'Arboratoire
German: NEGRA, TIGER, The Tuebingen Treebank of Spoken German (TueBa-D/S), The Tuebingen Treebank of Written German (TueBa-D/Z)
Greek, Modern: Greek Dependency Treebank
Greek, Ancient: PROIEL Corpus
Hebrew: Hebrew Treebank
Hindi: AnnCorra
Hungarian: Hungarian treebank
Italian: TUT - Turin University Treebank, VIT - Venice Italian Treebank, ISST - Italian Syntactic-Semantic Treebank
Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks
Korean: Korean Treebank
Latin:
- Latin Dependency Treebank;
- Index Thomisticus Treebank.
- PROIEL Corpus
Norwegian: TREPIL Norwegian treebank
Polish: A Treebank / Test Suite for Polish (HPSG treebank)
Portuguese: Projecto Floresta Sintá(c)tica
Russian: Dependency Treebank for Russian, see also another paper
Slovene: Slovene Dependency Treebank
Spanish: Cast3LB, UAM Treebank of Spanish
Swedish: Talbanken05, Swedish Treebank
Thai: NAiST Thai Treebank
Turkish: METU-Sabanci Treebank

Categories: Corpus linguistics | Computational linguistics

Treebank

From Wikipedia, the free encyclopedia

[edit] List of treebanks sorted by language

Views

Navigation

Interaction

Search

Languages