Text normalization

From Wikipedia, the free encyclopedia

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

Unicode normalization
- Unicode NFD (Normalization Form Canonical Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.
- Unicode NFC (Normalization Form Canonical Composition) where the base character and combining accents are canonically composed. This is the result of the composition of an NFD sequence.
- Unicode NFKD (Normalization Form Compatibility Decomposition) a decomposed form similar to NFD except that some lookalike characters (such as half-width and double-width variants of Kana characters) are mapped together. Stated another way, certain characters included in the character set for compatibility reasons are replaced by their compatible equivalents.
- Unicode NFKC (Normalization Form Compatibility Composition) results from the composition of an NFKD sequence (replacing all decomposed sequences by primary composite characters where possible).
converting all letters to lower or upper case
removing punctuation
removing letters with accent marks and other diacritics
expanding abbreviations

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.