Unicode equivalence

From Wikipedia, the free encyclopedia

Unicode contains numerous characters to maintain compatibility with existing standards, some of which are functionally equivalent to other characters or sequences of characters. Because of this, Unicode defines some as equivalent. For example, the n character followed by the combining ~ character is equivalent to the single Unicode ñ character. Unicode maintains two standards for defining equivalence.

1 Canonical Equivalence
2 Compatibility Equivalence
3 Visual ambiguity
4 See also
5 References

[edit] Canonical Equivalence

Canonical equivalence is a narrower form of equivalence that preserves visually and functionally equivalent characters. For example, precomposed diacritic letters are considered canonically equivalent to their decomposed letter and combining diacritic marks. In other words the precomposed character ‘ü’ is a canonical equivalent to the sequence ‘u’ and ‘¨’ a combining diaeresis. Similarly, Unicode unifies several Greek diacritics and punctuation characters that have the same appearance to other diacritics.

[edit] Compatibility Equivalence

Compatibility equivalence is broader than canonical equivalence. Anything that is canonically equivalent is also compatibility equivalent, but the opposite is not necessarily true. The non-canonical equivalent compatibility characters are more concerned with plain text equivalence visually and therefore potentially semantically distinct forms. For example, superscript, subscript numerals, are compatibility equivalent to their core decimal digit counterparts. However, the subscript and superscript forms — through their visually distinct presentation — also typically convey distinct meaning. However, this distinct meaning could be better handled in a more open-ended way through the use of rich text protocols beyond Unicode. For example, though the character set includes subscript digits 0 through 9. Other characters can only be made subscript through the use of rich text protocols. Therefore Unicode considers such visual and semantic variations a task for rich text and not plain text. Full-width and half-width katakana characters are also equivalent, as are ligatures and their component letter sequences. For these latter examples, there is usually only a visual and not a semantic distinction. In other words, an author does not typically declare the presence of ligatures or vertical text as meaning one thing and non-ligatures and horizontal text as meaning something entirely different. Rather these are strictly visual typographic design choices.

[edit] Visual ambiguity

The presence of either canonical or non-canonical equivalent characters can lead to visual ambiguity and confusion for users of text processing software. For example, software should typically render the canonical equivalent characters as indistinguishable from one another. If a user performs a search for one character, it may not be clear why the software does not highlight an identical looking character. For the non-canonical equivalent characters the visual ambiguity can arise when, for example, a superscript digit character appears alongside a standard digit with rich text superscript formatting. To handle such situations, Unicode recommends text processing algorithms such as normalization that treats these characters and character sequences as identical in certain circumstances.