Mapping of Unicode characters

From Wikipedia, the free encyclopedia

This article or section does not cite its references or sources.
Please help improve this article by introducing appropriate citations. (help, get involved!) This article has been tagged since November 2006.

Unicode
Encodings UTF-7 UTF-8 CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU Punycode GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces

Unicode reserves 1,114,112 (= 2²⁰ + 2¹⁶ or 17 × 2¹⁶, hexadecimal 110000) code points.

As of Unicode 5.0.0, 101,063 (9.1%) of these codepoints are assigned, with another 137,468 (12.3%) reserved for private use, leaving 875,441 (78.6%) unassigned. The number of assigned code points is made up as follows:

98,884 graphemes

140 formatting characters

65 control characters

2,048 surrogate characters

The first 256 codes correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII.

The Unicode code space for characters is divided into 17 planes, each with 65,536 (= 2¹⁶) code points, although currently only a few planes are used:

Plane 0 (0000–FFFF): Basic Multilingual Plane (BMP)
Plane 1 (10000–1FFFF): Supplementary Multilingual Plane (SMP)
Plane 2 (20000–2FFFF): Supplementary Ideographic Plane (SIP)
Planes 3 to 13 (30000–DFFFF) are unassigned
Plane 14 (E0000–EFFFF): Supplementary Special-purpose Plane (SSP)
Plane 15 (F0000–FFFFF) reserved for the Private Use Area (PUA)
Plane 16 (100000–10FFFF), reserved for the Private Use Area (PUA)

The cap of 2²⁰ code points (excluding Plane 16) exists in order to maintain compatibility with the UTF-16 encoding, which addresses only that range (see below). Currently, about ten percent of the Unicode code space is used. Furthermore, ranges of characters have been tentatively blocked out for every known unencoded script (see [1]), and while Unicode may need another plane for ideographic characters, there are ten planes available if previously unknown scripts with tens of thousands of characters are discovered. This 20 bit limit is unlikely to be reached in the near future.

1 Basic Multilingual Plane
- 1.1 Future additions
2 Supplementary Multilingual Plane
3 Private Use Area
4 Other planes
5 Mapping tables

[edit] Basic Multilingual Plane

The first plane (plane 0), the Basic Multilingual Plane (BMP), is where most characters have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters. Most of the allocated code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.

Roadmap of Unicode Basic Multilingual Plane. Each numbered box represents 256 codepoints.

The graphic on the right is a visual roadmap to the Basic Multilingual Plane. The colours in use are:

Black = Latin scripts and symbols
Light Blue = Linguistic scripts
Blue = Other European scripts
Orange = Middle Eastern and SW Asian scripts
Light Orange = African scripts
Green = South Asian scripts
Purple = Southeast Asian scripts
Red = East Asian scripts
Light Red = Unified CJK Han
Yellow = Aboriginal scripts
Magenta = Symbols
Dark Grey = Diacritics
Light Grey = UTF-16 surrogates and private use
Cyan = Miscellaneous characters
White = Unused

As of Unicode 5.0, The BMP includes the following scripts:

Basic Latin (0000–007F)
Latin-1 Supplement (0080–00FF)
Latin Extended-A (0100–017F)
Latin Extended-B (0180–024F)
IPA Extensions (0250–02AF)
Spacing Modifier Letters (02B0–02FF)
Combining Diacritical Marks (0300–036F)
Greek and Coptic (0370–03FF)
Cyrillic (0400–04FF)
Cyrillic Supplement (0500–052F)
Armenian (0530–058F)
Hebrew (0590–05FF)
Arabic (0600–06FF)
Syriac (0700–074F)
Arabic Supplement (0750–077F)
Thaana (0780–07BF)
N'Ko (Mandekan) (07C0–07FF)
Indic scripts:
- Devanagari (0900–097F)
- Bengali (0980–09FF)
- Gurmukhi (0A00–0A7F)
- Gujarati (0A80–0AFF)
- Oriya (0B00–0B7F)
- Tamil (0B80–0BFF)
- Telugu (0C00–0C7F)
- Kannada (0C80–0CFF)
- Malayalam (0D00–0D7F)
- Sinhala (0D80–0DFF)
Thai (0E00–0E7F)
Lao (0E80–0EFF)
Tibetan (0F00–0FFF)
Burmese (1000–109F)
Georgian (10A0–10FF)
Hangul Jamo (1100–11FF)
Ethiopic (1200–137F)
Ethiopic Supplement (1380–139F)
Cherokee (13A0–13FF)
Unified Canadian Aboriginal Syllabics (1400–167F)
Ogham (1680–169F)
Runic (16A0–16FF)
Philippine scripts:
- Tagalog (1700–171F)
- Hanunóo (1720–173F)
- Buhid (1740–175F)
- Tagbanwa (1760–177F)
Khmer (1780–17FF)
Mongolian (1800–18AF)
Limbu (1900–194F)
Tai Le (1950–197F)
New Tai Lue (1980–19DF)
Khmer Symbols (19E0–19FF)
Buginese (1A00–1A1F)
Balinese (1B00–1B7F)
Lepcha (Rong) (1C00–1C4F)
Phonetic Extensions (1D00–1D7F)
Phonetic Extensions Supplement (1D80–1DBF)
Combining Diacritical Marks Supplement (1DC0–1DFF)
Latin Extended Additional (1E00–1EFF)
Greek Extended (1F00–1FFF)
Symbols:
- General Punctuation (2000–206F)
- Superscripts and Subscripts (2070–209F)
- Currency Symbols (20A0–20CF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Letterlike Symbols (2100–214F)
- Number Forms (2150–218F)
- Arrows (2190–21FF)
- Mathematical Operators (2200–22FF)
- Miscellaneous Technical (2300–23FF)
- Control Pictures (2400–243F)
- Optical Character Recognition (2440–245F)
- Enclosed Alphanumerics (2460–24FF)
- Box Drawing (2500–257F)
- Block Elements (2580–259F)
- Geometric Shapes (25A0–25FF)
- Miscellaneous Symbols (2600–26FF)
- Dingbats (2700–27BF)
- Miscellaneous Mathematical Symbols-A (27C0–27EF)
- Supplemental Arrows-A (27F0–27FF)
- Braille Patterns (2800–28FF)
- Supplemental Arrows-B (2900–297F)
- Miscellaneous Mathematical Symbols-B (2980–29FF)
- Supplemental Mathematical Operators (2A00–2AFF)
- Miscellaneous Symbols and Arrows (2B00–2BFF)
Glagolitic (2C00–2C5F)
Latin Extended-C (2C60–2C7F)
Coptic (2C80–2CFF)
Georgian Supplement (2D00–2D2F)
Tifinagh (2D30–2D7F)
Ethiopic Extended (2D80–2DDF)
Supplemental Punctuation (2E00–2E7F)
CJK Radicals Supplement (2E80–2EFF)
Kangxi Radicals (2F00–2FDF)
Ideographic Description Characters (2FF0–2FFF)
CJK Symbols and Punctuation (3000–303F)
Hiragana (3040–309F)
Katakana (30A0–30FF)
Bopomofo (3100–312F)
Hangul Compatibility Jamo (3130–318F)
Kanbun (3190–319F)
Bopomofo Extended (31A0–31BF)
CJK Strokes (31C0–31EF)
Katakana Phonetic Extensions (31F0–31FF)
Enclosed CJK Letters and Months (3200–32FF)
CJK Compatibility (3300–33FF)
CJK Unified Ideographs Extension A (3400–4DBF)
Yijing Hexagram Symbols (4DC0–4DFF)
CJK Unified Ideographs (4E00–9FFF)
Yi Syllables (A000–A48F)
Yi Radicals (A490–A4CF)
Modifier Tone Letters (A700–A71F)
Latin Extended-D A720–A7FF
Syloti Nagri (A800–A82F)
Phags-pa (A840–A87F)
Hangul Syllables (AC00–D7AF)
High Surrogates (D800–DB7F)
High Private Use Surrogates (DB80–DBFF)
Low Surrogates (DC00–DFFF)
Private Use Area (E000–F8FF)
CJK Compatibility Ideographs (F900–FAFF)
Alphabetic Presentation Forms (FB00–FB4F)
Arabic Presentation Forms-A (FB50–FDFF)
Variation Selectors (FE00–FE0F)
Vertical Forms (FE10–FE1F)
Combining Half Marks (FE20–FE2F)
CJK Compatibility Forms (FE30–FE4F)
Small Form Variants (FE50–FE6F)
Arabic Presentation Forms-B (FE70–FEFF)
Halfwidth and Fullwidth Forms (FF00–FFEF)
Specials (FFF0–FFFF)

[edit] Future additions

Several scripts are expected to be included in the BMP in the next revision of Unicode. These scripts, and their proposed code point ranges, are the following:

Cham (18B0–18FF)
Lanna (Old Tai Lue) (1A80–1AEF)
Santali (Ol Cemet' / Ol Chiki) (2DE0–2DFF)
Vai (A500–A61F)
Saurashtra (AB00–AB5F)

Several other scripts are proposed for inclusion in the BMP, including:

Avestan (0800–083F)
Pahlavi (0840–087F)
Batak (1A20–1A5F)
Meitei Mayek/Meitei (1C80–1CDF)
Varang Kshiti (AA00–AA3F)
Sorang Sompeng (AA40–AA6F)

[edit] Supplementary Multilingual Plane

Plane 1, the Supplementary Multilingual Plane (SMP), is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols.

As of Unicode 5.0, Plane One includes the following scripts:

Linear B Syllabary (10000–1007F)
Linear B Ideograms (10080–100FF)
Aegean Numbers (10100–1013F)
Ancient Greek Numbers (10140–1018F)
Old Italic (10300–1032F)
Gothic (10330–1034F)
Ugaritic (10380–1039F)
Old Persian (103A0–103DF)
Deseret (10400–1044F)
Shavian (10450–1047F)
Osmanya (10480–104AF)
Cypriot Syllabary (10800–1083F)
Phoenician (10900–1091F)
Kharoshthi (10A00–10A5F)
Sumero-Akkadian Cuneiform (12000–1236E and 12400–12473)
Byzantine Musical Symbols (1D000–1D0FF)
Musical Symbols (1D100–1D1FF)
Ancient Greek Musical Notation (1D200–1D24F)
Tai Xuan Jing Symbols (1D300–1D35F)
Mathematical Alphanumeric Symbols (1D400–1D7FF)

Many other scripts are proposed for inclusion in Plane One, including:

Old Permic
Meroitic
Manichaean
Balti
Aramaic
South Arabian
Brahmi
Soyombo
Indus script
Tengwar
Cirth
Blissymbols
Basic Egyptian Hieroglyphics
Rod Numerals

[edit] Private Use Area

A Private Use Area (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters.

The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). Plane Fifteen (U+F0000 to U+FFFFF), and Plane Sixteen (U+100000 to 10FFFF) are completely reserved for private use as well.

The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. Similarly the ConScript Unicode Registry aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs. The Medieval Unicode Font Initiative uses the PUA to encode various ligatures, precomposed characters, and symbols found in medieval texts.

One example of usage of the Private Use Area is Apple Computer's usage of U+F8FF for the Apple logo.

[edit] Other planes

Plane 2, the Supplementary Ideographic Plane (SIP), is used for about 40,000 rare Chinese characters that are mostly historic, although there are some modern ones. Plane 14 (E in hexadecimal), the Supplementary Special-purpose Plane (SSP), currently contains some non-recommended language tag characters and some variation selection characters.

[edit] Mapping tables

Unicode mapping tables
BMP		SMP	SIP		SSP
0000—0FFF	8000—8FFF	10000—10FFF	20000—20FFF	28000—28FFF	E0000—E0FFF
1000—1FFF	9000—9FFF		21000—21FFF	29000—29FFF
2000—2FFF	A000—AFFF	12000—12FFF	22000—22FFF	2A000—2AFFF
3000—3FFF	B000—BFFF		23000—23FFF
4000—4FFF	C000—CFFF	1D000—1DFFF	24000—24FFF	2F000—2FFFF
5000—5FFF	D000—DFFF		25000—25FFF
6000—6FFF	E000—EFFF		26000—26FFF
7000—7FFF	F000—FFFF		27000—27FFF