Plane (Unicode)
In the Unicode standard, a plane is a continuous group of 65,536 (= 216) code points. There are 17 planes, identified by the numbers 0 to 16decimal, which corresponds with the possible values 00–10hexadecimal of the first two positions in six position format (hhhhhh). The planes above plane 0 (the Basic Multilingual Plane), that is, planes 1–16, are called “supplementary planes”,[1] or humorously known as “astral planes”. As of Unicode version 7.0, six of the planes have assigned code points (characters), and four are named.
The 17 planes can accommodate 1,114,112 code points, a limit which is unlikely to be reached in the foreseeable future even if previously unknown scripts with tens of thousands of characters are discovered. The odd-looking code points limit (which is not a power of 2) is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two 16-bit words is used to encode 220 code points in the planes 1 to 16, in addition to the use of single code units to encode the 216 points of plane 0.[2] This limit is not shared by UTF-8, which was designed with a limit of 231 code points (32768 planes), and can encode 221 code points (32 planes) even if limited to 4 bytes.[3] However, the Unicode Consortium has stated that the current limit will never be changed.[4]
Planes are further subdivided into Unicode blocks, which unlike planes, do not have a fixed size. The 252 blocks defined in Unicode 7.0 cover 23 percent of the possible code point space, and range in size from a minimum of 16 code points (eleven blocks) to a maximum of 65,536 code points (Supplementary Private Use Area-A and -B, which constitute the entirety of planes 15 and 16). For future usage, ranges of characters have been tentatively mapped out for every known current and ancient writing system.[5]
Overview
Unicode planes and used code point ranges | ||||||||
---|---|---|---|---|---|---|---|---|
Basic | Supplementary | |||||||
Plane 0 | Plane 1 | Plane 2 | Planes 3–13 | Plane 14 | Planes 15–16 | |||
0000–FFFF | 10000–1FFFF | 20000–2FFFF | 30000–DFFFF | E0000–EFFFF | F0000–10FFFF | |||
Basic Multilingual Plane | Supplementary Multilingual Plane | Supplementary Ideographic Plane | unassigned | Supplementary Special-purpose Plane | Supplementary Private Use Area | |||
BMP | SMP | SIP | — | SSP | S PUA A/B | |||
0000–0FFF |
8000–8FFF |
10000–10FFF |
20000–20FFF |
28000–28FFF |
15: PUA-A |
Plane | Allocated code points[note 1] | Assigned characters[note 2] |
---|---|---|
0 BMP | 65,312 | 55,056 |
1 SMP | 11,936 | 10,004 |
2 SIP | 47,648 | 47,624 |
14 SSP | 368 | 337 |
15 PUA-A | 65,536 | |
16 PUA-B | 65,536 | |
Totals | 256,336 | 113,021 |
- ↑ Code points which have been allocated to a Unicode block.
- ↑ The total number of graphic, format and control characters (i.e., excluding private-use characters, noncharacters and surrogate code points).
Basic Multilingual Plane
![](../I/m/Roadmap_to_Unicode_BMP.svg.png)
(As of Unicode Standard version 6.0)
The first plane, plane 0, the Basic Multilingual Plane (BMP) contains characters for almost all modern languages, and a large number of symbols. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing. Most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean (CJK) characters.
The High Surrogates (U+D800–U+DBFF) and Low Surrogate (U+DC00–U+DFFF) codes are reserved for encoding non-BMP characters in UTF-16 by using a pair of 16-bit codes: one High Surrogate and one Low Surrogate. A single surrogate code point will never be assigned a character.
65,312 of the 65,536 code points in this plane have been allocated to a Unicode block, leaving just 224 non-allocated code points (fourteen 16-character segments).
As of Unicode 7.0, the BMP comprises the following 159 blocks:
- C0 Controls and Basic Latin (Basic Latin) (0000–007F)
- C1 Controls and Latin-1 Supplement (0080–00FF)
- Latin Extended-A (0100–017F)
- Latin Extended-B (0180–024F)
- Linguistic (phonetic) scripts
- IPA Extensions (0250–02AF)
- Spacing Modifier Letters (02B0–02FF)
- Combining Diacritical Marks (0300–036F)
- Greek and Coptic (0370–03FF)
- Cyrillic (0400–04FF)
- Cyrillic Supplement (0500–052F)
- Armenian (0530–058F)
- Hebrew (0590–05FF)
- Arabic (0600–06FF)
- Syriac (0700–074F)
- Arabic Supplement (0750–077F)
- Thaana (0780–07BF)
- N'Ko (07C0–07FF)
- Samaritan (0800–083F)
- Mandaic (0840–085F)
- Arabic Extended-A (08A0–08FF)
- Indic scripts:
- Thai (0E00–0E7F)
- Lao (0E80–0EFF)
- Tibetan (0F00–0FFF)
- Myanmar (1000–109F)
- Georgian (10A0–10FF)
- Hangul Jamo (1100–11FF)
- Ethiopic (1200–137F)
- Ethiopic Supplement (1380–139F)
- Cherokee (13A0–13FF)
- Unified Canadian Aboriginal Syllabics (1400–167F)
- Ogham (1680–169F)
- Runic (16A0–16FF)
- Philippine scripts:
- Tagalog (1700–171F)
- Hanunoo (1720–173F)
- Buhid (1740–175F)
- Tagbanwa (1760–177F)
- Khmer (1780–17FF)
- Mongolian (1800–18AF)
- Unified Canadian Aboriginal Syllabics Extended (18B0–18FF)
- Limbu (1900–194F)
- Tai Le (1950–197F)
- Tai Lue (1980–19DF)
- Khmer Symbols (19E0–19FF)
- Buginese (1A00–1A1F)
- Tai Tham (1A20–1AAF)
- Combining Diacritical Marks Extended (1AB0-1AFF)
- Balinese (1B00–1B7F)
- Sundanese (1B80–1BBF)
- Batak (1BC0–1BFF)
- Lepcha (1C00–1C4F)
- Ol Chiki (1C50–1C7F)
- Sundanese Supplement (1CC0–1CCF)
- Vedic Extensions (1CD0–1CFF)
- Phonetic Extensions (1D00–1D7F)
- Phonetic Extensions Supplement (1D80–1DBF)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Latin extended additional (1E00–1EFF)
- Greek Extended (1F00–1FFF)
- Symbols:
- General Punctuation (2000–206F)
- Superscripts and Subscripts (2070–209F)
- Currency Symbols (20A0–20CF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Letterlike Symbols (2100–214F)
- Number Forms (2150–218F)
- Arrows (2190–21FF)
- Mathematical Operators (2200–22FF)
- Miscellaneous Technical (2300–23FF)
- Control Pictures (2400–243F)
- Optical Character Recognition (2440–245F)
- Enclosed Alphanumerics (2460–24FF)
- Box Drawing (2500–257F)
- Block Elements (2580–259F)
- Geometric Shapes (25A0–25FF)
- Miscellaneous Symbols (2600–26FF)
- Dingbats (2700–27BF)
- Miscellaneous Mathematical Symbols-A (27C0–27EF)
- Supplemental Arrows-A (27F0–27FF)
- Braille Patterns (2800–28FF)
- Supplemental Arrows-B (2900–297F)
- Miscellaneous Mathematical Symbols-B (2980–29FF)
- Supplemental Mathematical Operators (2A00–2AFF)
- Miscellaneous Symbols and Arrows (2B00–2BFF)
- Glagolitic (2C00–2C5F)
- Latin Extended-C (2C60–2C7F)
- Coptic (2C80–2CFF)
- Georgian Supplement (2D00–2D2F)
- Tifinagh (2D30–2D7F)
- Ethiopic Extended (2D80–2DDF)
- Cyrillic Extended-A (2DE0–2DFF)
- Supplemental Punctuation (2E00–2E7F)
- East Asian scripts and symbols:
- CJK Radicals Supplement (2E80–2EFF)
- Kangxi Radicals (2F00–2FDF)
- Ideographic Description Characters (2FF0–2FFF)
- CJK Symbols and Punctuation (3000–303F)
- Hiragana (3040–309F)
- Katakana (30A0–30FF)
- Bopomofo (3100–312F)
- Hangul Compatibility Jamo (3130–318F)
- Kanbun (3190–319F)
- Bopomofo Extended (31A0–31BF)
- CJK Strokes (31C0–31EF)
- Katakana Phonetic Extensions (31F0–31FF)
- Enclosed CJK Letters and Months (3200–32FF)
- CJK Compatibility (3300–33FF)
- CJK Unified Ideographs Extension A (3400–4DBF)
- Yijing Hexagram Symbols (4DC0–4DFF)
- CJK Unified Ideographs (4E00–9FFF)
- Yi Syllables (A000–A48F)
- Yi Radicals (A490–A4CF)
- Lisu (A4D0–A4FF)
- Vai (A500–A63F)
- Cyrillic Extended-B (A640–A69F)
- Bamum (A6A0–A6FF)
- Modifier Tone Letters (A700–A71F)
- Latin Extended-D (A720–A7FF)
- Syloti Nagri (A800–A82F)
- Common Indic Number Forms (A830–A83F)
- Phags-pa (A840–A87F)
- Saurashtra (A880–A8DF)
- Devanagari Extended (A8E0–A8FF)
- Kayah Li (A900–A92F)
- Rejang (A930–A95F)
- Hangul Jamo Extended-A (A960–A97F)
- Javanese (A980–A9DF)
- Myanmar Extended-B (A9E0-A9FF)
- Cham (AA00–AA5F)
- Myanmar Extended-A (AA60–AA7F)
- Tai Viet (AA80–AADF)
- Meetei Mayek Extensions (AAE0–AAFF)
- Ethiopic Extended-A (AB00–AB2F)
- Latin Extended-E (AB30-AB6F)
- Meetei Mayek (ABC0–ABFF)
- Hangul Syllables (AC00–D7AF)
- Hangul Jamo Extended-B (D7B0–D7FF)
- Surrogates:
- High Surrogates (D800–DB7F)
- High Private Use Surrogates (DB80–DBFF)
- Low Surrogates (DC00–DFFF)
- Private Use Area (E000–F8FF)
- CJK Compatibility Ideographs (F900–FAFF)
- Alphabetic Presentation Forms (FB00–FB4F)
- Arabic Presentation Forms-A (FB50–FDFF)
- Variation Selectors (FE00–FE0F)
- Vertical Forms (FE10–FE1F)
- Combining Half Marks (FE20–FE2F)
- CJK Compatibility Forms (FE30–FE4F)
- Small Form Variants (FE50–FE6F)
- Arabic Presentation Forms-B (FE70–FEFF)
- Halfwidth and Fullwidth Forms (FF00–FFEF)
- Specials (FFF0–FFFF)
Supplementary Multilingual Plane
![](../I/m/Roadmap_to_the_Unicode_SMP.svg.png)
(As of Unicode Standard version 6.2)
Plane 1, the Supplementary Multilingual Plane (SMP), contains historic scripts such as Linear B, Egyptian hieroglyphs, and cuneiform scripts; historic and modern musical notation; mathematical alphanumerics; Emoji and other pictographic sets; reform orthographies like Shavian and Deseret; and game symbols for playing cards, Mah Jongg, and dominoes.
As of Unicode 7.0, the SMP comprises the following 85 blocks:
- Linear B Syllabary (10000–1007F)
- Linear B Ideograms (10080–100FF)
- Aegean Numbers (10100–1013F)
- Ancient Greek Numbers (10140–1018F)
- Ancient Symbols (10190–101CF)
- Phaistos Disc (101D0–101FF)
- Lycian (10280–1029F)
- Carian (102A0–102DF)
- Coptic Epact Numbers (102E0-102FF)
- Old Italic (10300–1032F)
- Gothic (10330–1034F)
- Old Permic (10350-1037F)
- Ugaritic (10380–1039F)
- Old Persian (103A0–103DF)
- Deseret (10400–1044F)
- Shavian (10450–1047F)
- Osmanya (10480–104AF)
- Elbasan (10500-1052F)
- Caucasian Albanian (10530-1056F)
- Linear A (10600-1077F)
- Cypriot Syllabary (10800–1083F)
- Imperial Aramaic (10840–1085F)
- Palmyrene (10860-1087F)
- Nabataean (10880-108AF)
- Phoenician (10900–1091F)
- Lydian (10920–1093F)
- Meroitic Hieroglyphs (10980–1099F)
- Meroitic Cursive (109A0–109FF)
- Kharoshthi (10A00–10A5F)
- Old South Arabian (10A60–10A7F)
- Old North Arabian (10A80-10A9F)
- Manichaean (10AC0-10AFF)
- Avestan (10B00–10B3F)
- Inscriptional Parthian (10B40–10B5F)
- Inscriptional Pahlavi (10B60–10B7F)
- Psalter Pahlavi (10B80-10BAF)
- Old Turkic (10C00–10C4F)
- Rumi Numeral Symbols (10E60–10E7F)
- Brahmi (11000–1107F)
- Kaithi (11080–110CF)
- Sora Sompeng (110D0–110FF)
- Chakma (11100–1114F)
- Mahajani (11150-1117F)
- Sharada (11180–111DF)
- Sinhala Archaic Numbers (111E0-111FF)
- Khojki (11200-1124F)
- Khudawadi (112B0-112FF)
- Grantha (11300-1137F)
- Tirhuta (11480-114DF)
- Siddham (11580-115FF)
- Modi (11600-1165F)
- Takri (11680–116CF)
- Warang Citi (118A0-118FF)
- Pau Cin Hau (11AC0-11AFF)
- Cuneiform (12000–123FF)
- Cuneiform Numbers and Punctuation (12400–1247F)
- Egyptian Hieroglyphs (13000–1342F)
- Bamum Supplement (16800–16A3F)
- Mro (16A40-16A6F)
- Bassa Vah (16AD0-16AFF)
- Pahawh Hmong (16B00-16B8F)
- Miao (16F00–16F9F)
- Kana Supplement (1B000–1B0FF)
- Duployan (1BC00-1BC9F)
- Shorthand Format Controls (1BCA0-1BCAF)
- Byzantine Musical Symbols (1D000–1D0FF)
- Musical Symbols (1D100–1D1FF)
- Ancient Greek Musical Notation (1D200–1D24F)
- Tai Xuan Jing Symbols (1D300–1D35F)
- Counting Rod Numerals (1D360–1D37F)
- Mathematical Alphanumeric Symbols (1D400–1D7FF)
- Mende Kikakui (1E800-1E8DF)
- Arabic Mathematical Alphabetic Symbols (1EE00–1EEFF)
- Mahjong Tiles (1F000–1F02F)
- Domino Tiles (1F030–1F09F)
- Playing Cards (1F0A0–1F0FF)
- Enclosed Alphanumeric Supplement (1F100–1F1FF)
- Enclosed Ideographic Supplement (1F200–1F2FF)
- Miscellaneous Symbols And Pictographs (1F300–1F5FF)
- Emoticons (1F600–1F64F)
- Ornamental Dingbats (1F650-1F67F)
- Transport And Map Symbols (1F680–1F6FF)
- Alchemical Symbols (1F700–1F77F)
- Geometric Shapes Extended (1F780-1F7FF)
- Supplemental Arrows-C (1F800-1F8FF)
Supplementary Ideographic Plane
Plane 2, the Supplementary Ideographic Plane (SIP), is used for CJK Ideographs, mostly CJK Unified Ideographs, that were not included in earlier character encoding standards.
As of Unicode 6.1, the SIP comprises the following four blocks:
- CJK Unified Ideographs Extension B (20000–2A6DF)
- CJK Unified Ideographs Extension C (2A700–2B73F)
- CJK Unified Ideographs Extension D (2B740–2B81F)
- CJK Compatibility Ideographs Supplement (2F800–2FA1F); not Unified
Unassigned planes
Planes 3 to 13: No characters have yet been assigned to Planes 3 through 13. Plane 3 is tentatively named the Tertiary Ideographic Plane, but as of version 7.0 there are no characters assigned to it.[6] It is reserved for Oracle Bone script, Bronze Script, Small Seal Script, additional CJK unified ideographs, and other historic ideographic scripts.[7]
It is not anticipated that all these planes will be used in the foreseeable future, given the total sizes of the known writing systems left to be encoded. The number of possible symbol characters that could arise outside of the context of writing systems is potentially huge. At the moment, these 11 planes out of 17 are unused.
Supplementary Special-purpose Plane
Plane 14 (E in hexadecimal), the Supplementary Special-purpose Plane (SSP), currently contains non-graphical characters. The first block is for deprecated language tag characters for use when language cannot be indicated through other protocols (such as the xml:lang attribute in XML). The other block contains glyph variation selectors to indicate an alternate glyph for a character that cannot be determined by context.
As of Unicode 6.1, the SSP comprises the following two blocks:
- Tags (E0000–E007F)
- Variation Selectors Supplement (E0100–E01EF)
Private Use Area planes
The two planes 15 and 16, called Supplementary Private Use Area-A and -B are available for character assignment by parties outside the ISO and the Unicode Consortium. They are used by fonts internally to refer to auxiliary glyphs, for example, ligatures and building blocks for other glyphs. Such characters will have limited interoperability. Software and fonts that support Unicode will not necessarily support character assignments by other parties.
References
- ↑ Unicode Consortium Glossary—Supplementary Planes
- ↑ The four bits wwww in the high surrogate represents the (Unicode plane − 1). Unicode plane = wwww + 1. The highest value wwww can represent is 1111binary = Fhex = 15decimal. Hence plane (15 + 1)=16 is the highest plane a surrogate pair can represent. Hence 10 FFFFhex is the highest code point a surrogate pair can represent. See Table 3.5 "UTF-16 Bit Distribution" in the Unicode Standard http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf
- ↑ See Table 3.6 "UTF-8 Bit Distribution" in the Unicode Standard http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf
- ↑ "Unicode Character Encoding Stability Policy". Retrieved 27 September 2014.
- ↑ Unicode roadmaps
- ↑ "Unicode Data". Retrieved 11 July 2014.
- ↑ TIP Roadmap