Machine translation quality metrics
From Wikipedia, the free encyclopedia
It is possible to define objective criteria for evaluating the quality of machine translation output, for example:
1. Well-formed output. Is the output grammatical in the target language? Using an interlingua should be helpful in this regard, because with a fixed interlingua one should be able to write a grammatical mapping to the target language from the interlingua. Consider the following Arabic language input and English language translation result from the Google translator as of 27 December 2006 [1]. This Google translator output doesn't parse using a reasonable English grammar:
وعن حوادث التدافع عند شعيرة رمي الجمرات -التي كثيرا ما يسقط فيها العديد من الضحايا- أشار الأمير نايف إلى إدخال "تحسينات كثيرة في جسر الجمرات ستمنع بإذن الله حدوث أي تزاحم". ==> And incidents at the push Carbuncles-throwing ritual, which often fall where many of the victims - Prince Nayef pointed to the introduction of "many improvements in bridge Carbuncles God would stop the occurrence of any competing."
2. Semantics preservation. Do repeated re-translations preserve the semantics of the original sentence? For example, consider the following English input passed multiple times into and out of French using the Google translator as of 27 December 2006:
Better a day earlier than a day late. ==> Améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. ==> Pour améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late.
A similar objective criterion is
3. Stationarity or Canonical form. Do repeated translations converge on a single expression in both languages? In the above example, the translation does become stationary, although the original meaning is lost. See Round-trip translation for further discussion. This metric has been criticized as not being well correlated with Bilingual Evaluation Understudy scores[1]
4. Adaptive to colloquialism, argot or slang. The French language has many rules for creating words in the speech and writing of popular culture. Two such rules are: (a) The reverse spelling of words such as femme to meuf. (This is called verlan.) (b) The attachment of the suffix -ard to a noun or verb to form a proper noun. For example, the noun faluche means "student hat". The word faluchard formed from faluche colloquially can mean, depending on context, "a group of students", "a gathering of students" and "behavior typical of a student". The Google translator as of 28 December 2006 doesn't derive the constructed words as for example from rule (b), as shown here:
Il y a une chorale falucharde mercredi, venez nombreux, les faluchards chantent des paillardes! ==> There is a choral society falucharde Wednesday, come many, the faluchards sing loose-living women!
The United States National Institute of Standards and Technology conducts annual evaluations[2] of machine translation systems based on the BLEU-4 criterion [3]. A combined method called IQmt which incorporates BLEU and additional metrics NIST, GTM, ROUGE and METEOR has been implemeneted by Gimenez and Amigo [4].
[edit] Notes
- ^ Somers, H. (2005) "Round-trip Translation: What Is It Good For?"
[edit] References
- Gimenez, Jesus and Enrique Amigo. (2005) IQmt: A framework for machine translation evaluation.
- Papineni, Kishore, Salim Roukos, Todd Ward and Wei-Jing Zhu. (2002) BLEU: A Method for automatic evaluation of machine translation. Proc. 40th Annual Meeting of the ACL, July, 2002, pp. 311-318.