Asia Online
From Wikipedia, the free encyclopedia
Asia Online is a Thailand–based company undertaking what it calls the world's largest literacy project by translating vast quantities of the worlds English language knowledge into Asian languages. This is achieved using statistical machine translation (SMT) technologies developed and enhanced in Thailand with a specific focus on Asian languages.
It was founded in 2006 by the University of Edinburgh's Philipp Koehn, Gregory Binger a leading technoligist and IT/IP lawyer, and former Gartner senior analysts Bob Hayward and Dion Wiggins.
Asia Online’s statistically-based translation software is an instance of a recent advance in automated translation. While earlier machine translation technology relied on collections of linguistic rules to analyze the source sentence, and then map the syntactic and semantic structure into the target language, Asia Online uses statistical techniques from cryptography, applying machine learning algorithms that automatically acquire statistical models from existing parallel collections of human translations.
Until early 2008, Google, Microsoft and Language Weaver had publicly available SMT systems. Asia Online claims there are flaws in the existing processes and techniques of SMT and worked to resolve these issues. It claims three key differences from traditional SMT approaches:
- Clean data - The traditional approach leveraged content found on the web in corporate sites, news articles and other similar sources where the same content was available in multiple languages. The quality of the data was very low. Asia Online has focussed machine and human resources in this area to ensure that the data is as clean and as accurate as possible. Data is sourced from high quality translations provided by book publishers and translation companies and is aligned at the segment level (usually sentences) and converted into a consistent format in order to be processed by the learning software. This step includes:
- Extracting segments from files and documents if they are not in a TMX format.
- Aligning segments (if necessary) once they have been extracted. While this is automated by machines, humans are also used to validate the accuracy.
- Converting data to a base UTF-8 encoding for training the SMT system.
- Extracting small subsets from the data to guide training.
- Reviewing, cleaning and analyzing the data to ensure optimal training impact.
- Multiple Domains - Extensive efforts have been put into a system that allows for training in many domains. This is done by extending a base set of information with multiple additional learning sources.
- Real Time Corrections
- Languages Available - Asia Online currently has 110 European language pairs available in a baseline form. These systems are currently used to build customized translation systems for corporate and language service provider (LSP) customers who add their bilingual parallel corpus to the existing data to create higher quality translation systems. These available languages include English, French, Italian, German, Spanish, Portuguese, Dutch, Swedish, Danish, Greek, and Finnish.
Asia Online is also building SMT systems for English to Thai, Indonesian, Hindi, Malay, Vietnamese, Tagalog, Traditional and Simplified Chinese, Japanese and Korean. They plan to have 440 different language pairs available by the end of 2008.