Language guessing

Language identification or language guessing is the process of automatically determining the (natural) language a document or piece of text is written in.

Algorithms for this subtask of NLP include those for the more general task of text classification, but one of the most popular algorithms is a specialized algorithm devised by Cavnar and Trenkle, based on character-level n-gram statistics.^[1] An older method by Grefenstette was based on the prevalence of certain function words (e.g., *the* in English).

References

^ William B. Cavnar and John M. Trenkle (1994), N-gram-based text categorization, SDAIR.