Word sketch

Word sketch of verb "read" in the British National Corpus in Sketch Engine

A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches have been first introduced by the British corpus linguist Adam Kilgarriff[1] and exploited within the Sketch Engine[2] corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relation (e.g. subject, object, modifier etc.). The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.

Since the introduction, word sketches have been used by lexicographers to develop modern corpus-based dictionaries by major publishing houses including Oxford English Dictionary,[3] Macmillan English Dictionary[1] and comprising dozens of languages including English,[1] Chinese,[4] Slovene,[5] Japanese,[6] Dutch,[7] Romanian,[8] Russian,[9] Czech,[10] Polish,[11] Vietnamese,[12] Turkish,[13] Portuguese,[14] Hindi,[15] Spanish[16] and others.[17]

Formal account

A word sketch triple is a triple consisting of headword, grammatical relation, collocation (e.g. man, modifier, young). Considering an underlying text corpus, a word sketch quintuple is a quintuple consisting of headword, grammatical relation, collocation, position of headword in the corpus, position of collocation in the corpus (e.g. man, modifier, young, 104, 103). A word sketch database is a set of such triples or quintuples, which may be generated either by querying a corpus using corpus query language[18] or by parsing the corpus using a natural language parser.[19]

References

  1. 1 2 3 Kilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; Tugwell, David (2004) The Sketch Engine. Information Technology, 2004
  2. Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (2004) The Sketch Engine: Ten Years On. In Lexicography, page 7-36, Springer Berlin Heidelberg
  3. Jonathan Culpeper (2009) The metalanguage of impoliteness: Using Sketch Engine to explore the Oxford English Corpus. In Contemporary Corpus Linguistics
  4. Chu-Ren Huang, Adam Kilgarriff, Yiching Wu, Chih-Ming Chiu, Simon Smith, Pavel Rychlý, Ming-Hong Bai and Keh-Jiann Chen (2005). Chinese Sketch Engine and the Extraction of Grammatical Collocations. In Fourth SIGHAN Workshop on Chinese Language Processing, Korea, pp. 48-–55
  5. Simon Krek and Adam Kilgarriff (2006). Slovene Word Sketches. In Proceedings 5th Slovenian Languages Technology Conference, Slovenia
  6. Irena Srdanović, Tomaž Erjavec and Adam Kilgarriff (2008) A web corpus and word sketches for Japanese. In 『自然言語処理』(Journal of Natural Language Processing) 15/2, 137--159.
  7. Carole Tiberius and Adam Kilgarriff (2009). The Sketch Engine for Dutch with the ANW corpus. In Fons Verbhorum, Festschrift for Fons Moerdijk. Instituut voor Nederlandse Lexicologie, the Netherlands, pp. 273--255
  8. Monica Macoveiciuc and Adam Kilgarriff (2010) The RoWaC Corpus and Romanian Word Sketches. In Multilinguality and Interoperability in Language Processing with Emphasis on Romanian, Romanian Academy of Sciences.
  9. Maria Khokhlova and Victor Zakharov (2010) Studying Word Sketches for Russian. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'12)
  10. Karel Pala and Pavel Rychlý (2010) A Case Study in Word Sketches - Czech Verb vidět. In A Way with Words: Recent Advances in Lexical Theory and Analysis. A Festschrift for Patrick Hanks.
  11. Adam Radziszewski, Adam Kilgarriff and Robert Lew (2011) Polish Word Sketches. In Proceedings of the 5th Language & Technology Conference (LTC)
  12. Adam Kilgarriff and Phuong Le-Hong (2012) Vietnamese Word Sketches. In Workshop on Vietnamese Language and Speech Processing (IEEE-RIVF 9)
  13. Bharat Ram Ambati, Siva Reddy and Adam Kilgarriff (2012) Word Sketches for Turkish. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
  14. Adam Kilgarriff, Miloš Jakubíček, Jan Pomikálek, Tony Berber Sardinha and Pete Whitelock (2014) PtTenTen: A corpus for Portuguese lexicography. In Working with Portuguese Corpora, Bloomsbury Publishing
  15. Anil Krishna Eragani, Varun Kuchibhotla, Dipti Sharma, Siva Reddy and Adam Kilgarriff (2014) Hindi Word Sketches. In Proceedings of the Conference on Natural Language Processing (ICON-11)
  16. Adam Kilgarriff and Irene Renau (2013) esTenTen, a vast web corpus of Peninsular and American Spanish. In Procedia - Social and Behavioral Sciences
  17. https://www.sketchengine.co.uk/documentation/wiki/SkE/Biblio
  18. Miloš Jakubíček, Adam Kilgarriff, Diana McCarthy and Pavel Rychlý (2010) Fast syntactic searching in very large corpora for many languages. In Proceedings of Workshop on Advanced Corpus Solutions, PACLIC 24, Japan.
  19. Aleš Horák, Pavel Rychlý, Adam Kilgarriff (2009) Czech word sketch relations with full syntax parser. In After Half a Century of Slavonic Natural Language Processing.
This article is issued from Wikipedia - version of the Saturday, January 16, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.