Treebank

From Wikipedia, the free encyclopedia

A treebank is a text corpus in which each sentence has been annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name treebank. Treebanks can be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for training or testing parsers.

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.

Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct.

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank) and those that annotate dependency structure (for example the Prague Dependency Treebank).

The syntactic structure in a treebank can be represented in many different ways, for example using simple labelled brackets in a text file, like this (following the Penn Treebank):

(S (NP (NNP John))
   (VP (VBZ loves)
       (NP (NNP Mary)))
   (. .))

or a treebank-specific XML scheme.

[edit] List of treebanks sorted by language