Mapping of Unicode graphic characters

From Wikipedia, the free encyclopedia

By far the most common Unicode characters are graphical characters. Graphical characters all have some visual representation or glyphs associated with them. While Unicode does not specify the concrete glyphs for these characters, it does specify recommended or prototypical glyphs. The actual glyph used by textual display software will depend on the font files used and whether those fonts provide support for contextual and non-contextual glyph variations

Contents

[edit] Script-specific characters

v  d  e
Character Types

Letters and other
     script specific
Unihan ideographs, etc.
Phonetic characters
Numerals
Punctuation and separators
Diacritics and other marks
Symbols:
Compatibility characters
Control characters
Other Topics
Combining character
Precomposed character

Main article: Writing system

In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. Some scripts support one and only one language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just a few. Some languages also make use of multiple alternate writing systems. Turkish, for example used Arabic before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of writing systems is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode already includes over 60 scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap.

[edit] Unihan characters

Main article: Unihan

Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called hanzi), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.

Besides the Unihan ideographs, Han unification also provides Han unified punctuation, symbols, numerals, ideograph stroke characters and ideographic description characters.

[edit] Phonetic characters

Unicode includes letters and marks from the International Phonetic Alphabet (IPA) and those supporting other phonetic writing systems as well.

[edit] Numerals

Main article: Unicode numerals

Numerals (often called numbers in Unicode) are characters that denote a number. The same Arabic-Indic numerals are used widely in various writing systems throughout the world and all share the same semantics for denoting numbers, However, the glyphs representing these numerals differ widely from one writing system to another. To support these glyph differences, Unicode includes duplicate encodings of these numerals within many of the script blocks. These digits are repeated in 22 separate blocks — twice in Arabic. Six additional sets of the ten decimal digits repeat again as rich text forms in the mathematical alphanumerics block within the supplementary multilingual plane (i.e., requiring 4 bytes of disk space to store each character).

Unicode also includes several less common numerals: Roman numerals, counting rod numerals, Cuneiform numerals and ancient Greek numerals.

Numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example the sequence 9 - 9 - 0 in Arabic-Indic numerals composes the numeral for nine hundred and ninety (990). In Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a distinct numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition. The Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition.

[edit] Punctuation and diacritics


Unicode includes several blocks for unified diacritics and other combining marks and also blocks for unified punctuation. However, when a mark or punctuation character is intended primarily for use within a particular script, the character is assigned to that particular script’s blocks. Therefore authors will find these types of characters throughout the Unicode character database. Unicode categorizes them as:

  • Punctuation
  • connector (Pc)
  • dash (Pd)
  • open (Po)
  • close (Pe)
  • initial (Pi)
  • final (Pf)
  • Mark
  • non-spacing (Mn)
  • spacing-combining (Mc
  • enclosing (Me)

[edit] Symbols


Unicode has dozens of blocks dedicated to symbols that are useful regardless of one’s writing system. Other script-specific symbols are often included within a particular script’s blocks. Symbols are categorized as:

Symbols:

  • math (Sm)
  • currency (Sc)
  • modifier (Sk)
  • other (So)

[edit] Music notation

Unicode devotes a block of 256 characters for musical symbols. Since Unicode focuses on characters laid out in two dimensions, these characters do not encode pitch or other parts of Western music expressed in the vertical dimension. Therefore the music symbols are more suited for discussions of music symbols themselves or to discuss rhythm within the prose of a document. To encode more complex musical information some other data format is necessary, such as MusicXML or Midi.