Automatic acquisition of lexicon

Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage parser.^[1] The two main requirements represent raw corpus and the morphological description of the language. The aim is to provide lemmas that will serve to the explanation of all the words that occur within the corpus. For the achievement of a quality lexicon it is necessary to manually validate the generated lemmas and iterate the whole process several times. The process is focused on the open word classes (e.g. nouns, adjectives, verbs). Closed classes (e.g. prepositions, pronouns, numerals) are excluded. This method is applicable to the languages with a rich morphology, such as Slovak, Russian or Croatian.

Applied to Slovak, being an inflectional language, the automatic acquisition focuses on the inflectional morphology as well as on the derivational morphology. This fact enables the users to find out the information about derivational relations (e.g. adjectivizations, prefixes) in the lexicon. For example Slovak word korpusový is an adjectivization of korpus (eng. corpus).

Three-step loop

Conformably to Benoît Sagot,^[1] there are three stages involved in the acquisition of lemmas:

1. Generation and inflection
2. Ranking
3. Manual validation

The more iteration will be performed, the more accurate lexicon will be obtained. For each iteration are essential the information given by a manual validator.

Generation and inflection

Firstly, all words which represent the closed word classes (pronouns, prepositions, numerals) are manually excluded from the given corpus. Number of their occurrences in the corpus is provided. Then the automatic generation comes, when the hypothetical lemmas according to the morphological description of a language are created. Generated lemmas are consequently being inflected, so that all of their inflected forms are built. Obtained forms are associated with the corresponding lemma and a morphological tag.

Ranking

There was created a probabilistic model, represented by a fix-point algorithm, to rank the hypothetical lemmas generated in the first step. Best ranked lemmas are expected to be ideally all correct, whereas the least ranked tend to be incorrect.

Manual validation

Correctness of the best- ranked lemmas created in the previous step are checked by the manual validator, who should be a native speaker. Lemmas are at this stage divided into three categories: - valid lemmas, appended to lexicon - erroneous lemmas generated by valid forms ( later associated to another lemmas) - erroneous lemmas generated by invalid forms (these need to be excluded)

Future development

Automatic acquisition, in comparison to a purely manual development of the lexicons, seems to be promising, considering the future development, because of the short validation time needed and the relatively small amount of human labor involved.

References

↑ 1.0 1.1 Sagot, Benoît. Automatic acquisition of a Slovak Lexicon from a Raw Corpus.

External links

Benoît Sagot publishings