International Corpus of English
From Wikipedia, the free encyclopedia
The International Corpus of English (ICE) is a set of corpora representing various dialects of English from around the world. The areas with represented dialects currently include Great Britain, the USA, New Zealand, Hong Kong, Singapore, East Africa, India, and the Philippines. Also, work is currently done on corpora for Maltese, Nigerian, and Pakistani English.
Each corpus contains one million words in 500 texts á 2000 words, which follows the methodology used for the Brown Corpus. 60% of the texts are spoken and 40% are written. The texts in the corpus date from 1990 or later.
To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation.[1] Thus, contrastive studies of the language varieties represented can be performed. However, as the corpora have been created and are maintained by different teams, there are some differences in the way the corpora are put together.
In 1998, the British sub-corpus, ICE-GB, was completed, including part-of-speech tagging and parsing of the entire corpus. It can now be morpho-syntactically analyzed with the help of the software package ICE Corpus Utility Program (ICECUP).[2]
[edit] References
- ^ The International Corpus of English Homepage
- ^ Mukherjee, Joybrato. (2002). Korpuslinguistik und Englischunterricht. Eine Einführung. Frankfurt/Main: Lang: 34–5.