Moses for Mere Mortals
Moses for Mere Mortals (MMM)[1] is a free open source software composed of a set of scripts designed to allow the automation of processes for an easy installation and operation of the Moses Open Source Translation System, a statistical machine translation system.
MMM builds a translation chain prototype with Moses + IRSTLM + RandLM + MGIZA.[2][3]
The first version of Moses for Mere Mortals (MMM) was published in November 2009 and it has been updated and tested on Linux - Ubuntu distributions. MMM is available in the Github Project Hosting website.[4]
Overview
Its main aims are to:
- help build a prototype of a translation chain for the real world;
- guide the first steps of users that are just beginning to use Moses;
- enable a simple and quick evaluation of Moses;
- enable the user to do his/her own translations without having to trust third (translating) parties;
- integrate machine translation and translation memories.
Even though the main thrust is centred on Linux, two add-ins (collectively called “Windows add-ins”) help to make the bridge from MS Windows to Linux and then back from Linux.
General features
Moses allows the training of corpora where every word is presented together with, for instance, its respective lemma and/or part of speech tag (“factored training”). The scripts do not cover this type of training.
MMM consists of seven scripts for Linux, thoroughly tested with Ubuntu (12.04 and 14.04 64 bits):
- Install: To install in Ubuntu the packages from which both Moses and Moses for Mere Mortals depend.
- Create: To compile Moses and the other required packages with a single command.
- Make-test-files: To extract from the original corpus a corpus for training, files for tuning and files for testing the results of the training.
- Train: To train the language pairs needed as Moses is language-independent and works with any language/alphabet.
- Translate: To produce machine translations of new documents.
- Score: To automatically score Moses translations against a human translation taken as a gold standard (BLEU and NIST metrics) in order to have an idea of the level of performance
- Transfer training-to-another-location: To transfer engines/trainings to other folders in the same computer or to a different computer.
MMM comes with a 200 000 segments demonstration corpus (too small to do justice to the qualitative results that can be achieved with Moses, but capable of giving a realistic view of the relative duration of the steps involved and useful to test whether the installation was correctly done). In order to get good results, one generally needs a corpus with several millions of segments. Each orogonal corpus consists of 2 strictly aligned, UTF-8 files, one in the source and the other in the target language. No grammar knowledge is required, though some language pairs give better results than others (in a general way, morphologically rich languages give less good results).
Add-ins
MMM also contains (for Windows and Linux):
- Extract_TMX_Corpus: An application for the conversion of one or more files in TMX format into two parallel and perfectly aligned files (in the source and target languages) needed for the training of a language pair.
- Moses2TMX: An application to align originals and Moses translations and package each file in a TMX file with specific attributes so that Moses translations can be used with translation memory tool and are identified as MT (with a penalty relative to human memories) and as having been translated by Moses.
MMM also contains the file Nonbreaking_prefix.pt, a list of abbreviations specific to the Portuguese language based on the English and German versions already available with the Moses package.
Some MMM original features
Moses for Mere Mortals also has some original features:
- Removes control characters from the input files (these can crash a training);
- Extracts from the corpus files 2 training files, 2 tuning files and 2 testing files test files (one in the source and one in the target language) with randomly selected, non-consecutive segments that are erased from the corpus files;
- A new training does not interfere with the files of a previous training;
- A new training reuses as much as possible the files created in previous trainings (thus saving time);
- Stops with an informative message if any of the phases of training (language model building, recaser training, corpus training, memory-mapping, tuning or training test) doesn’t produce the expected results;
- Can limit the duration of tuning to a specified number of iterations;
- Can generate, in a single step, the BLEU and NIST scores for one translation or a set of translations present in a directory (either for each whole document or for each segment of each document);
- Allows the transfer of corpus trainings to another computer or to another installation in the same computer ;
- Allows the mkcls, GIZA and MGIZA parameters to be controlled through parameters in the train script;
- Allows selected parameters in the Moses scripts and the Moses decoder to be controlled through the train and translate scripts.
References
- ↑ "moses-for-mere-mortals". Retrieved 2014-11-28.
- ↑ "Welcome to Moses!". Retrieved 2012-01-29.
- ↑ "mosesdecoder". Retrieved 2012-01-29.
- ↑ "moses-for-mere-mortals". Github. Retrieved 2014-11-28.