Treebank
From Wikipedia, the free encyclopedia
A treebank is a text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for training or testing parsers.
Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.
Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct.
Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank) and those that annotate dependency structure (for example the Prague Dependency Treebank).
The syntactic structure in a treebank can be represented in many different ways, for example using simple labelled brackets in a text file, like this (following the Penn Treebank):
(S (NP (NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .))
or a treebank-specific XML scheme.
[edit] List of treebanks sorted by language
- Arabic: Penn Arabic Treebank, Prague Arabic Dependency Treebank (PADT)
- Basque: Eus3LB, see also Annotation guide for Eus3LB and the group's home page
- Bulgarian: BulTreeBank (HPSG-based Syntactic Treebank)
- Catalan: Cat3LB
- Chinese: Penn Chinese Treebank, Sinica Treebank by CKIP, a tentative Chinese Dependency Treebank
- Czech: Prague Dependency Treebank
- Danish: Danish Dependency Treebank, Arboretum: A syntactic tree corpus of Danish
- Dutch: CGN, Alpino
- English:
- Penn;
- English Dependency Treebank?;
- BLLIP WSJ corpus;
- International Corpus of English (ICE);
- Lancaster Parsed Corpus;
- Susanne Corpus, Christine Corpus, Lucy Corpus;
- Verbmobil treebanks;
- LinGO Redwoods;
- Multi-Treebank;
- The PARC 700 Dependency Bank;
- CHILDES Brown Eve corpus with dependency annotation, see Sagae, K., MacWhinney, B., and Lavie, A. (2004) Adding syntactic annotations to transcripts of parent-child dialogs. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.
- Estonian: Syntactically analyzed and disambiguated text corpus, see also Arborest
- French: Paris 7, L'Arboratoire
- German: NEGRA, TIGER, The Tuebingen Treebank of Spoken German (TueBa-D/S), The Tuebingen Treebank of Written German (TueBa-D/Z)
- Greek, Modern: Greek Dependency Treebank
- Greek, Ancient: PROIEL Corpus
- Hebrew: Hebrew Treebank
- Hindi: AnnCorra
- Hungarian: Hungarian treebank
- Italian: TUT - Turin University Treebank, VIT - Venice Italian Treebank, ISST - Italian Syntactic-Semantic Treebank
- Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks
- Korean: Korean Treebank
- Latin:
- Norwegian: TREPIL Norwegian treebank
- Polish: A Treebank / Test Suite for Polish (HPSG treebank)
- Portuguese: Projecto Floresta Sintá(c)tica
- Russian: Dependency Treebank for Russian, see also another paper
- Slovene: Slovene Dependency Treebank
- Spanish: Cast3LB, UAM Treebank of Spanish
- Swedish: Talbanken05, Swedish Treebank
- Thai: NAiST Thai Treebank
- Turkish: METU-Sabanci Treebank