Text normalization

From Wikipedia, the free encyclopedia

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

  • Unicode normalization
    • Unicode NFD (Normalization Form Canonical Decomposition) where the base character and combining accents are canonically decomposed. Usually this is into separate codepoints.
    • Unicode NFC (Normalization Form Canonical Composition) where the base character and combining accents are canonically composed. This is the result of the composition of an NFD sequence.
    • Unicode NFKD (Normalization Form Compatibility Decomposition) a decomposed form similar to NFD except that some lookalike characters (such as half-width and double-width variants of Kana characters) are mapped together. Stated another way, certain characters included in the character set for compatibility reasons are replaced by their compatible equivalents.
    • Unicode NFKC (Normalization Form Compatibility Composition) results from the composition of an NFKD sequence (replacing all decomposed sequences by primary composite characters where possible).
  • converting all letters to lower or upper case
  • removing punctuation
  • removing letters with accent marks and other diacritics
  • expanding abbreviations

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.

[edit] See also

[edit] External links