Text normalization
From Wikipedia, the free encyclopedia
Text normalization is a process by which text is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.
Examples of text normalization:
- Unicode normalization
- Unicode NFD (Normalization Form Canonical Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.
- Unicode NFC (Normalization Form Canonical Composition) where the base character and combining accents are canonically composed. This is the result of the composition of an NFD sequence.
- Unicode NFKD (Normalization Form Compatibility Decomposition) a decomposed form similar to NFD except that some lookalike characters (such as half-width and double-width variants of Kana characters) are mapped together. Stated another way, certain characters included in the character set for compatibility reasons are replaced by their compatible equivalents.
- Unicode NFKC (Normalization Form Compatibility Composition) results from the composition of an NFKD sequence (replacing all decomposed sequences by primary composite characters where possible).
- converting all letters to lower or upper case
- removing punctuation
- removing letters with accent marks and other diacritics
- expanding abbreviations
While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.