Tatoeba

Tatoeba.org
URL http://tatoeba.org/
Commercial? No
Type of site Open collaborative multilingual sentence dictionary
Registration Optional
Available language(s) 17 languages; content in 93 languages
Content license Creative Commons Attribution 2.0
Owner Trang Ho, Allan Simon
Created by Trang Ho, Allan Simon
Launched 2006
Current status Online; beta

Tatoeba.org is a free online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (例えば tatoeba), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on complete sentences, their grammatical properties, and translating them into other languages. Registration is optional and open to the public, regardless of linguistics background or second language proficiency. Tatoeba was founded by Trang Ho in 2006 and was initially hosted on Sourceforge under the project name "multilangdict".[1] She maintains and administrates the project with Allan Simon, who joined in 2009.[2] Tatoeba is hosted and supported by the Free Software Foundation France.[3]

Contents

Content

As of August 2011, Tatoeba's corpus has 1,000,000 sentences in 93 languages. A list of how many sentences there are in each language can be found on Tatoeba's language statistics page. The interface is available in 15 different languages. There are procedures by which one can help to add new interface and content languages.

Tatoeba is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English-Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.[4][5]

Interface

Users, even non-registered ones, can search for words in any language to retrieve a list of sentences using that word. Each sentence in the Tatoeba database are displayed next to its translations in other languages; direct and indirect translations are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. Almost 13,000 sentences in 8 languages currently have audio readings. Sentences can also be browsed by language, tag, or audio.

Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. Translations are linked to the original sentence automatically. Users can freely edit their own sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Trusted users, a rank above new users, can tag, untag, link, and unlink sentences.

Database structure

Tatoeba's basic data structure is a series of nodes and links. Each sentence is a node; each link bridges two or more sentences with the same meaning.[6]

License

The entire Tatoeba database is published under a Creative Commons Attribution 2.0 license,[7] freeing it for academic and other use.

Acclaim

Tatoeba received a grant from Mozilla Drumbeat in December 2010.[8][9]

Usage

Parallel text corpora such as Tatoeba are used for a variety of natural language processing tasks such as machine translation. The Tatoeba data has been used as data for treebanking Japanese [10] and statistical machine translation,[11] as well as the WWWJDIC Japanese-English dictionary.

Offline edition

Selected content from Tatoeba – 83,932 phrases in Esperanto along with all their translations into other languages – has appeared in the third edition of the multilingual DVD Esperanto Elektronike ("Electronic Esperanto") published in 6.000 copies by E@I in July 2011.

References

  1. ^ "Trang's dictionary project". sourceforge.net. http://sourceforge.net/projects/multilangdict/. 
  2. ^ "Tatoeba.org, base de données de phrases d'exemple" (in French). linuxfr.org. July 17, 2010. http://linuxfr.org/news/tatoebaorg-base-de-donn%C3%A9es-de-phrases-dexemple. Retrieved March 20, 2011. 
  3. ^ "Tatoeba, un dictionnaire de langues pour phrases d'exemples [Tatoeba, a dictionary of example sentences in several languages]" (in French). fsffrance.org. Paris: FSF France. February 24, 2011. http://fsffrance.org/news/article2011-02-24.fr.html. Retrieved March 20, 2011. 
  4. ^ "Tanaka Corpus". EDRDG Wiki. Electronic Dictionary Research and Development Group. February 3, 2011. http://www.edrdg.org/wiki/index.php/Tanaka_Corpus. Retrieved March 20, 2011. 
  5. ^ Breen, Jim (March 2, 2011). "WWWJDIC - Information". WWWJDIC. Monash University. http://www.csse.monash.edu.au/~jwb/wwwjdicinf.html#examp_tags. Retrieved March 20, 2011. 
  6. ^ Ho, Trang (February 23, 2010). "How to be a good contributor in Tatoeba". Tatoeba Project Blog. http://blog.tatoeba.org/2010/02/how-to-be-good-contributor-in-tatoeba.html. Retrieved March 20, 2011. 
  7. ^ "Terms of use". Tatoeba.org. http://tatoeba.org/eng/terms-of-use. Retrieved March 20, 2011. 
  8. ^ Ho, Trang (January 17, 2011). "Grant from Mozilla Drumbeat". Tatoeba Project Blog. http://blog.tatoeba.org/2011/01/grant-from-mozilla-drumbeat.html. Retrieved March 20, 2011. 
  9. ^ Moltke, Henrik (December 30, 2010). "Best Drumbeat Projects: Tatoeba – a free and open database of sentences". Yoyodyne.cc. http://yoyodyne.cc/tatoeba/. Retrieved March 20, 2011. "...the Mozilla Foundation wants to encourage and help the Tatoeba project by giving it a USD 2.5K Mozilla Drumbeat Grant." 
  10. ^ Francis Bond, 栗林 孝行 [Takayuki Kuribayashi], 橋本 力 [Hashimoto Chikara] (2008) HPSGに基づくフリーな日本語ツリー バンクの構築 [A free Japanese Treebank based on HPSG]. In 14th Annual Meeting of The Association for Natural Language Processing, Tokyo.
  11. ^ Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101-122.

External links

Language portal
Linguistics portal