Mapping of Unicode characters

From Wikipedia, the free encyclopedia

Unicode
Character encodings Comparison UTF-7, UTF-1 UTF-8, CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU, BOCU-1 Punycode (IDN) GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 2²⁰ + 2¹⁶ or 17 × 2¹⁶, hexadecimal 110000) code points.

As of Unicode 5.0.0, 102,012 (9.2%) of these code points are assigned, with another 137,468 (12.3%) reserved for private use, 2,048 for surrogates, and 66 designated noncharacters, leaving 872,582 (78.3%) unassigned. The number of assigned code points is made up as follows:

2,684 in reserve for designation within a particular block
98,893 graphical characters
435 special purpose characters for control, formatting, and glyph/character variation selection.

(See the summary table for a more detailed breakdown).

Unicode characters can be categorized in many ways. Every character is assigned a script (though many are assigned the common or inherited scripts where they inherit the script from the adjacent character). In Unicode a script is a coherent writing system that includes letters but also may include script specific punctuation, diacritic and other marks and numerals and symbols. A single script supports one or more languages.

Characters are assigned in blocks of characters. These blocks are usually groups of code points in some multiple of eight: many, for example, are grouped in blocks of 128 or 256 code points. Every character is also assigned a general category and subcategory. The general categories are: letter, mark, number, punctuation, symbol, or control (in other words a formatting or non-graphical character).

The blocks of characters are assigned according to various planes. Most characters are currently assigned to the first plane: the Basic Multilingual Plane. This is to help ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octet bytes. The characters outside the first plane usually have very specialized or rare use.

The first 256 code points correspond with those of ISO 8859-1, the most popular 8-bit character encoding in the Western world. As a result, the first 128 characters are also identical to ASCII. Though Unicode refers to these as a Latin script block, these two blocks contain many characters that are commonly useful outside of the Latin script.

[edit] Planes

Main article: Mapping of Unicode character planes

[edit] Graphical characters

Main article: Mapping of Unicode graphic characters

[edit] Compatibility characters

Main article: Unicode compatibility characters

[edit] Non-graphical characters

Main article: Unicode control characters

[edit] Other Special-purpose characters

Several characters fall between the non-graphical control and formatting characters and full-fledged graphical characters.

[edit] Joiners and Non-joiners

Word Joiner (U+2060), Zero-width joiner (U+200D), Zero-width non-joiner (U+200C), Zero-width space (U+200B), Combining Grapheme Joiner (U+034F).

[edit] Invisible Separator

Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like i⁣j.

[edit] Invisible Times and Function Application

Invisible Times (U+2062) and Function Application (U+2061) are useful in mathematics text where the multiplication of terms or the application of a function is implied without any glyph indicating the operation.

[edit] Spaces

The space character (U+0020) typically input by the space bar on a keyboard serves semantically as a word separator in many languages. For legacy reasons, the UCS also includes spaces of varying sizes that are compatibility equivalents for the space character. These spaces include:

Space (U+0020)
En Quad (U+2000)
Em Quad (U+2001)
En Space (U+2002)
Em Space (U+2003)
Three-Per-Em Space (U+2004)
Four-Per-Em Space (U+2005)
Six-Per-Em Space (U+2006)
Figure Space (U+2007)
Punctuation Space (U+2008)
Thin Space (U+2009)
Hair Space (U+200A)
Mathematical Space (U+205F)

Aside from the original ASCII space, the other spaces are all compatibility characters. In this context this means that they effectively add no semantic content to the text, but instead provide styling control. Within Unicode, this non-semantic styling control is often referred to as rich text and is outside the thrust of Unicode’s goals. Rather than using different spaces in different contexts, this styling could instead be handled through intelligent text layout software.

[edit] Line-break control characters

Several characters are designed to help control line-breaks either by discouraging them (no-break characters) or suggesting line breaks such as the soft hyphen (U+00AD) (sometimes called the "shy hyphen"). Such characters, though designed for styling, are probably indispensable for the intricate types of line-breaking they make possible.

Soft Hyphen (U+00AD)
Non-breaking Hyphen (U+2011)
No-break Space (U+00A0)
Narrow No-break Space (U+202F)
Zero-width space (U+200B)

[edit] Whitespace characters

Whitespace characters are not a separate group of characters, but instead Unicode provides a list of characters it deems whitespace characters for interoperability support. Software Implementations and other standards may use the term to denote a slightly different set of characters. Whitespace characters are characters typically designated for programming environments. Often they have no syntactic meaning in such programming environments and are ignored by the machine interpreters. Unicode designates the legacy control characters U+0009 through U+000D and U+0085 as white space characters as well as the Unicode introduced line separator and paragraph separator. Also the core space character (U+0020) is designated as a whitespace character, but none of the other styling spaces.

[edit] Private use characters

The UCS includes over 100,000 code points for private use. This means these code points can be assigned characters with specific properties by individuals, organizations and software vendors outside the ISO and Unicode Consortium. A Private Use Area (PUA) is one of several ranges which are reserved for private use. For this range, the Unicode standard does not specify any characters.

The Basic Multilingual Plane includes a PUA in the range from U+E000 to U+F8FF (57344–63743). Plane Fifteen (U+F0000 to U+FFFFD), and Plane Sixteen (U+100000 to U+10FFFD) are completely reserved for private use as well.

The use of the PUA was a concept inherited from certain Asian encoding systems. These systems had private use areas to encode Japanese Gaiji (rare personal name characters) in application-specific ways. Similarly the ConScript Unicode Registry (unofficial and not related to the Unicode Consortium) aims to coordinate the mapping of scripts not yet encoded in or rejected by Unicode in the PUAs. The Medieval Unicode Font Initiative uses the PUA to encode various ligatures, precomposed characters, and symbols found in medieval texts.

One example of usage of the Private Use Area is Apple's usage of U+F8FF for the Apple logo.

[edit] Special code points

At the simplest level, each character in the UCS represents a code point and a particular semantic function: For graphical characters, the semantic function is often implied by its name, and the script or block it is included within. A graphical character may also have a recommended glyph that helps define the meaning of the character. Han characters, used in China, Japan, Korea, Vietnam and their respective diaspora, include many other rich properties that participate in defining the semantic role for a character.

However, the UCS and Unicode designate other code points for other purposes. Those code points may have no or few character properties associated with them.

[edit] Surrogates

The 2,048 surrogates are not characters, but are reserved for use in UTF-16 to specify code points outside the Basic Multilingual Plane. They are divided into "high surrogates" (D800–DBFF) and "low surrogates" (DC00–DFFF). In UTF-16, they must always appear in pairs, as a high surrogate followed by a low surrogate, thus using 32 bits to denote one code point.

A surrogate pair denotes the code point

10000₁₆ + (H - D800₁₆ ) × 400₁₆ + (L - DC00₁₆)

where H and L are the numeric values of the high and low surrogates respectively.

Since high surrogate values in the range DB80 to DBFF always produce values in the Private Use planes, the high surrogate range can be further divided into (normal) high surrogates (D800–DB7F) and "high private use surrogates" (DB80–DBFF).

[edit] Noncharacters

Unicode reserves sixty-six code points as noncharacters. These code points are guaranteed to never have a character assigned to them. Software implementations are therefore free to use these code points for internal use. However, these noncharacters should never be included in text interchange between implementations. One inherently useful example of a noncharacter is the code point U+FFFE. This code point has the reverse binary sequence of the byte order mark (U+FEFF). If a stream of text contains this noncharacter, this is a good indication the text has been interpreted with the incorrect endianness.

[edit] Summary table of UCS characters assignments

Main article: Summary of Unicode character assignments

[edit] See also

[edit] Tables

Unicode mapping tables
BMP		SMP	SIP		SSP
0000–0FFF	8000–8FFF	10000–10FFF	20000–20FFF	28000–28FFF	E0000–E0FFF
1000–1FFF	9000–9FFF		21000–21FFF	29000–29FFF
2000–2FFF	A000–AFFF	12000–12FFF	22000–22FFF	2A000–2AFFF
3000–3FFF	B000–BFFF		23000–23FFF
4000–4FFF	C000–CFFF	1D000–1DFFF	24000–24FFF	2F000–2FFFF
5000–5FFF	D000–DFFF		25000–25FFF
6000–6FFF	E000–EFFF		26000–26FFF
7000–7FFF	F000–FFFF		27000–27FFF

[edit] External links

Unicode Consortium
decodeunicode Unicode Wiki with all 98,884 graphic characters as gifs, full text search
ConScript Unicode Registry

[edit] References

The Unicode Standard 5.0

Categories: Unicode