Unicode character property
Unicode assigns character properties to each code point.[1] These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls. Slightly inconsequently, some "character properties" are also defined for code points that have no character assigned, and code points that are labeled like "<not a character>". The character properties are described in Standard Annex #44.[2]
Properties have levels of forcefulness: normative, informative, contributory, or provisional. For practical reasons, a character property can be assigned by specifying a continuous range of code points that have the same property.
Character property
Name
Unicode characters are assigned a unique Name (na).[1] The name, in English, is composed of uppercase letters A-Z, digits 0-9, - (hyphen-minus) and <space>. Some sequences are excluded: names beginning with a space or hyphen, names ending with a space or hyphen, repeated spaces or hyphens, and space after hyphen are not allowed. The name is guaranteed to be unique within Unicode, and can be used to identify a code point and its character. Ideographic characters, of which there are tens of thousands, are named in the pattern "cjk unified ideograph-hhhh". For example, U+4E00 一 cjk unified ideograph-4e00. Formatting characters are named too: U+00A0 no-break space.
Starting from Unicode version 2.0, the published name for a code point will never change. In the event of a misspelling in a publication, a correct name will later be assigned to the code point as an Character Name Alias. Within the whole range of names, an alias is unique too.
Apart from these normative names, informal names can be assigned. These are usually other commonly used names for a character, used for illustration, but these informal names are not guaranteed to be unique.
These code points do not have a Name (na=""): Controls (General Category: Cc), Private use (Co), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": <control>, <control-0088>, <reserved>, <noncharacter-hhhh>, <private-use-hhhh>, <surrogate>. Since these labels contain <>-brackets, they can never appear as a Name, which prevents confusion.
Version 1.0 names
In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused version 1.0-names were moved to the property Alias, to provide some backward compatibility.
General Category
Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points, and code points that are defined "not a character".
General Category (Unicode Character Property)[lower-alpha 1] | |||||
---|---|---|---|---|---|
Value | Category Major, minor | Basic type[lower-alpha 2] | Character assigned[lower-alpha 2] | Fixed[lower-alpha 3] | Remarks |
000Letter | |||||
001Lu | Letter, uppercase | Graphic | Character | ||
002Ll | Letter, lowercase | Graphic | Character | ||
003Lt | Letter, titlecase | Graphic | Character | ||
004Lm | Letter, modifier | Graphic | Character | ||
005Lo | Letter, other | Graphic | Character | ||
010Mark | |||||
011Mn | Mark, nonspacing | Graphic | Character | ||
012Mc | Mark, spacing combining | Graphic | Character | ||
013Me | Mark, enclosing | Graphic | Character | ||
020Number | |||||
021Nd | Number, decimal digit | Graphic | Character | All these, and only these, have Numeric Type = De[lower-alpha 3] | |
022Nl | Number, letter | Graphic | Character | ||
023No | Number, other | Graphic | Character | ||
030Punctuation | |||||
031Pc | Punctuation, connector | Graphic | Character | ||
032Pd | Punctuation, dash | Graphic | Character | ||
033Ps | Punctuation, open | Graphic | Character | ||
034Pe | Punctuation, close | Graphic | Character | ||
035Pi | Punctuation, initial quote | Graphic | Character | May behave like Ps or Pe depending on usage | |
036Pf | Punctuation, final quote | Graphic | Character | May behave like Ps or Pe depending on usage | |
037Po | Punctuation, other | Graphic | Character | ||
040Symbol | |||||
041Sm | Symbol, math | Graphic | Character | ||
042Sc | Symbol, currency | Graphic | Character | ||
043Sk | Symbol, modifier | Graphic | Character | ||
044So | Symbol, other | Graphic | Character | ||
050Separator | |||||
051Zs | Separator, space | Graphic | Character | ||
052Zl | Separator, line | Format | Character | Only U+2028 line separator (LSEP) | |
053Zp | Separator, paragraph | Format | Character | Only U+2029 paragraph separator (PSEP) | |
060Other | |||||
061Cc | Other, control | Control | Character | Fixed 65 | No name[lower-alpha 4], <control> |
062Cf | Other, format | Format | Character | ||
063Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2048 | No name[lower-alpha 4], <surrogate> |
064Co | Other, private use | Private-use | Not (but abstract) | Fixed 6400 in BMP, 131,068 in Planes 15–16 | No name[lower-alpha 4], <private-use> |
065Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name[lower-alpha 4], <noncharacter> |
Reserved | Not | Not fixed | No name[lower-alpha 4], <reserved> | ||
Punctuation
Characters have separate properties to denote they are a punctuation character. The properties all have a Yes/No values: Dash, Diacritic, Quotation_Mark, Space, Terminal_Punctuation, Whitespace.
Whitespace
Whitespace is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes spaces, tabs, and new line formatting controls. In Unicode, such a character has the property set "WSpace=yes". In version 6.3, there are 25 whitespace characters.
Whitespace[a] (Unicode character property WSpace=Y) | ||||
---|---|---|---|---|
Code point | Name | Script | General category | Remark |
000009U+0009 | Common | Other, control | HT, Horizontal Tab | |
000010U+000A | Common | Other, control | LF, Line feed | |
000011U+000B | Common | Other, control | VT, Vertical Tab | |
000012U+000C | Common | Other, control | FF, Form feed | |
000013U+000D | Common | Other, control | CR, Carriage return | |
000032U+0020 | space | Common | Separator, space | |
000133U+0085 | Common | Other, control | NEL, Next line | |
000160U+00A0 | no-break space | Common | Separator, space | |
005760U+1680 | ogham space mark | Ogham | Separator, space | |
008192U+2000 | en quad | Common | Separator, space | |
008193U+2001 | em quad | Common | Separator, space | |
008194U+2002 | en space | Common | Separator, space | |
008195U+2003 | em space | Common | Separator, space | |
008196U+2004 | three-per-em space | Common | Separator, space | |
008197U+2005 | four-per-em space | Common | Separator, space | |
008198U+2006 | six-per-em space | Common | Separator, space | |
008199U+2007 | figure space | Common | Separator, space | |
008200U+2008 | punctuation space | Common | Separator, space | |
008201U+2009 | thin space | Common | Separator, space | |
008202U+200A | hair space | Common | Separator, space | |
008232U+2028 | line separator | Common | Separator, line | |
008233U+2029 | paragraph separator | Common | Separator, paragraph | |
008239U+202F | narrow no-break space | Common | Separator, space | |
008287U+205F | medium mathematical space | Common | Separator, space | |
012288U+3000 | ideographic space | Common | Separator, space | |
a. ^ Unicode 6.3 property list |
Other general characteristics
Ideographic, alphabetic, noncharacter.
Display-related properties
Shaping, width.
Bidirectional writing
Four character properties pertain to bi-directional writing: Bidirectional Character Type, (formally Bidi_Class); Bidi_Control, Bidi_Mirrored and Bidi_Mirroring_Glyph.
One of Unicode's major features is support of bi-directional (Bidi) text display R-to-L and L-to-R. The Unicode Bidirectional Algorithm UAX9[7] describes the process of presenting text with altering script directions. For example, it enables a Hebrew quote in an English text. The Bidi_Character_Type marks a characters behaviour in directional writing. To override a direction, Unicode has defined seven special Bidi_controls, formatting control characters (LRM, LRE, LRO, RLM, RLE, RLO, PDF). These characters can enforce a direction, and by definition only affect bi-directional writing.
Each code point has a property called Bidirectional Character Type, formally Bidi_Class. It defines its behaviour in a bidirectional text as interpreted by the algorithm. There are 19 possible types.
Bidirectional character type (Unicode character property Bidi_Class)[1] | |||||
---|---|---|---|---|---|
Type[2] | Description | Strong/Weak/Neutral effect, or Explicit | Directionality | General scope | Bidi_Control character[3] |
L | Left-to-Right | Strong | L-to-R | Most alphabetic and syllabic characters, Han ideographs, non-European or non-Arabic digits, LRM character, ... | U+200E left-to-right mark (LRM) |
R | Right-to-Left | Strong | R-to-L | Hebrew alphabet and related punctuation, RLM character | U+200F right-to-left mark (RLM) |
AL | Right-to-Left Arabic | Strong | R-to-L | Arabic, Thaana and Syriac alphabets, and most punctuation specific to those scripts | U+061C arabic letter mark (ALM) |
EN | European Number | Weak | European digits, Eastern Arabic-Indic digits, ... | ||
ES | European Separator | Weak | plus sign, minus sign, ... | ||
ET | European Number Terminator | Weak | degree sign, currency symbols, ... | ||
AN | Arabic Number | Weak | Arabic-Indic digits, Arabic decimal and thousands separators, ... | ||
CS | Common Number Separator | Weak | colon, comma, full stop, no-break space, ... | ||
NSM | Nonspacing Mark | Weak | Characters in General Categories Mark, nonspacing and Mark, enclosing (Mn, Me) | ||
BN | Boundary Neutral | Weak | Default ignorables, non-characters, control characters other than those explicitly given other types | ||
B | Paragraph Separator | Neutral | paragraph separator, appropriate Newline Functions, higher-level protocol paragraph determination | ||
S | Segment Separator | Neutral | Tab | ||
WS | Whitespace | Neutral | space, figure space, line separator, form feed, General Punctuation block spaces | This set is smaller than Unicode whitespace list | |
ON | Other Neutrals | Neutral | All other characters, including object replacement character | ||
LRE | Left-to-Right Embedding | Explicit | L-to-R | LRE character only | U+202A left-to-right embedding (LRE) |
LRO | Left-to-Right Override | Explicit | L-to-R | LRO character only | U+202D left-to-right override (LRO) |
RLE | Right-to-Left Embedding | Explicit | R-to-L | RLE character only | U+202B right-to-left embedding (RLE) |
RLO | Right-to-Left Override | Explicit | R-to-L | RLO character only | U+202E right-to-left override (RLO) |
Pop Directional Format | Explicit | PDF character only | U+202C pop directional formatting (PDF) | ||
LRI | Left-to-Right Isolate | Explicit | L-to-R | LRI character only | U+2066 left-to-right isolate (LRI) |
RLI | Right-to-Left Isolate | Explicit | R-to-L | RLI character only | U+2067 right-to-left isolate (RLI) |
FSI | First Strong Isolate | Explicit | FSI character only | U+2068 first strong isolate (FSI) | |
PDI | Pop Directional Isolate | Explicit | PDI character only | U+2069 pop directional isolate (PDI) | |
Notes
|
In normal situations, the algorithm can determine the direction of a text by this character property. To control more complex Bidi situations, e.g. when an English text has a Hebrew quote, extra options are added to Unicode. Seven characters have the property Bidi_Control=Yes: LRM, RLM, LRE, RLE, PDF, LRO, RLO as named in the table. These are invisible formatting control characters, only used by the algorithm and with no effect outside of bidirectional formatting.[7] Despite the name, they are formatting characters, not control characters, and have General category "Other, format (Cf)" in the Unicode definition.
Basically, the algorithm determines a sequence of characters with the same strong direction type (R-to-L or L-to-R), taking in account an overruling by the special Bidi-controls. Number strings (Weak types) are assigned a direction according to their strong environment, as are Neutral characters. Finally, the characters are displayed per string's direction.
Two other character properties are relevant to the bidirectional text: Bidi_Mirrored=Yes indicates that the glyph should be mirrored when written R-to-L. The property Bidi_Mirroring_Glyph=U+hhhh can then point to the mirrored character. For example, brackets "()" are mirrored this way. Shaping cursive scripts such as Arabic, and mirroring glyphs that have a direction, is not part of the algorithm.
Casing
The Case value is Normative in Unicode. It pertains to those scripts with uppercase (aka capital, majuscule) and the lowercase (aka small, minuscule) letter. Case-difference occurs in the scripts Latin, Greek, Coptic, Cyrillic, Glagolitic, Armenian, Deseret, and archaic Georgian.
(upper, lower, title, folding—both simple and full)
Numeric values and types
Decimal
Characters are classified with a Numeric type.[1] Numeric are all characters such as fractions, subscripts, superscripts, Roman numerals, currency numerators, encircled numbers, and script-specific digits. All these have a numeric value that can be decimal, including zero and negatives, but also a vulgar fraction. If there is not such a value, as with most of the scripts, the numeric type is "None".
The characters that do have a numeric value are separated in three groups: Decimal (De), Digit (Di) and Numeric (Nu, i.e. all other). "Decimal" means the character is a straight decimal digit. Only characters that are part of a contiguous encoded range 0..9 have numeric type Decimal. Other digits, like superscripts, have numeric type Digit. All numeric characters like fractions and roman numerals end up with the type "Numeric". The intended effect is that an even more simple parser can use these decimal numeric values, without being distracted by say a numeric superscript or a fraction. Some 41 CJK Ideographs that represent a number, including those used for accounting, are typed Numeric.
On the other hand, characters that could have a numeric value as a second meaning are still marked Numeric type "None", and have no numeric value (""). E.g. Latin letters can be used in paragraph numbering like (II.A.1.b), but the letters "I", "A" and "b" are not numeric (type "None") and have no numeric value.
(Unicode character property) | Numeric Type[a]||||
---|---|---|---|---|
Numeric type | Code | Has Numeric Value | Example | Remarks |
Not numeric | None | No |
|
Numeric Value="NaN" |
Decimal | De | Yes |
|
Straight digit (decimal-radix). Corresponds both ways with General Category=Nd[b] |
Digit | Di | Yes |
|
Decimal, but in typographic context |
Numeric | Nu | Yes |
|
Numeric value, but not decimal-radix |
a. ^ Unicode 6.0, Chapter 4.6 | ||||
b. ^ Property Value Stability, in Stability policy. |
Hexadecimal digits
Hexadecimal characters are those in the series with hexadecimal values 0...9ABCDEF (sixteen characters, decimal value 0-15). The character property Hex_Digit is set to Yes when a character is in such a series. The series are:
Characters in Unicode marked Hex_Digit=Yes | |||
---|---|---|---|
0123456789ABCDEF | Basic Latin, capitals | Also ASCII_Hex_Digit=Yes | |
0123456789abcdef | Basic Latin, small letters | Also ASCII_Hex_Digit=Yes | |
0123456789ABCDEF | Fullwidth forms, capitals | ||
0123456789abcdef | Fullwidth forms, small letters |
Leaving out repetition of the decimals 0-9 (twice), 44 characters marked as such. The property ASCII_Hex_Digit marks only those hexadecimal characters that are in ASCII, i.e. the top two row from the table.
So Unicode has no separate characters for hexadecimal values. A consequence is, that when using regular characters it is impossible to determine whether hexadecimal value is intended, or even whether a value is intended at all. That should be determined at a higher level, e.g. by prepending "0x" to a hexadecimal number or by context. The only feature is that Unicode can note that a sequence can or can not be a hexadecimal value.
Block
A block is a named, continuous range of code points. It is identified by its first and last code point. It may contain code points that are reserved, not-assigned etc. Each character that is assigned, has a single "block name" value from the currently 209 names. Unassigned code points outside of an existing block, have the default value "No_block".
Unicode blocks and contained scripts | ||||
---|---|---|---|---|
Block range | Block name | Code points[lower-alpha 1] | Plane | Scripts[lower-alpha 2][lower-alpha 3][lower-alpha 4][lower-alpha 5][lower-alpha 6] |
A&000000U+0000..U+007F | Basic Latin[lower-alpha 7] | 128 | 000 BMP | Latin, Common |
A&000080U+0080..U+00FF | Latin-1 Supplement[lower-alpha 8] | 128 | 000 BMP | Latin, Common |
A&000100U+0100..U+017F | Latin Extended-A | 128 | 00000 BMP | Latin |
A&000180U+0180..U+024F | Latin Extended-B | 208 | 000 BMP | Latin |
A&000250U+0250..U+02AF | IPA Extensions | 96 | 000 BMP | Latin |
A&0002B0U+02B0..U+02FF | Spacing Modifier Letters | 80 | 000 BMP | Latin, Common |
A&000300U+0300..U+036F | Combining Diacritical Marks | 112 | 000 BMP | Inherited |
A&000370U+0370..U+03FF | Greek and Coptic | 144 | 000 BMP | Greek, Coptic, Common |
A&000400U+0400..U+04FF | Cyrillic | 256 | 000 BMP | Cyrillic, Inherited |
A&000500U+0500..U+052F | Cyrillic Supplement | 48 | 000 BMP | Cyrillic |
A&000530U+0530..U+058F | Armenian | 96 | 000 BMP | Armenian, Common |
A&000590U+0590..U+05FF | Hebrew | 112 | 000 BMP | Hebrew |
A&000600U+0600..U+06FF | Arabic | 256 | 000 BMP | Arabic, Common, Inherited |
A&000700U+0700..U+074F | Syriac | 80 | 000 BMP | Syriac |
A&000750U+0750..U+077F | Arabic Supplement | 48 | 000 BMP | Arabic |
A&000780U+0780..U+07BF | Thaana | 64 | 000 BMP | Thaana |
A&0007C0U+07C0..U+07FF | NKo | 64 | 000 BMP | Nko |
A&000800U+0800..U+083F | Samaritan | 64 | 000 BMP | Samaritan |
A&000840U+0840..U+085F | Mandaic | 32 | 000 BMP | Mandaic |
A&0008A0U+08A0..U+08FF | Arabic Extended-A | 96 | 000 BMP | Arabic |
A&000900U+0900..U+097F | Devanagari | 128 | 000 BMP | Devanagari, Common, Inherited |
A&000980U+0980..U+09FF | Bengali | 128 | 000 BMP | Bengali |
A&000A00U+0A00..U+0A7F | Gurmukhi | 128 | 000 BMP | Gurmukhi |
A&000A80U+0A80..U+0AFF | Gujarati | 128 | 000 BMP | Gujarati |
A&000B00U+0B00..U+0B7F | Oriya | 128 | 000 BMP | Oriya |
A&000B80U+0B80..U+0BFF | Tamil | 128 | 000 BMP | Tamil |
A&000C00U+0C00..U+0C7F | Telugu | 128 | 000 BMP | Telugu |
A&000C80U+0C80..U+0CFF | Kannada | 128 | 000 BMP | Kannada |
A&000D00U+0D00..U+0D7F | Malayalam | 128 | 000 BMP | Malayalam |
A&000D80U+0D80..U+0DFF | Sinhala | 128 | 000 BMP | Sinhala |
A&000E00U+0E00..U+0E7F | Thai | 128 | 000 BMP | Thai, Common |
A&000E80U+0E80..U+0EFF | Lao | 128 | 000 BMP | Lao |
A&000F00U+0F00..U+0FFF | Tibetan | 256 | 000 BMP | Tibetan, Common |
A&001000U+1000..U+109F | Myanmar | 160 | 000 BMP | Myanmar |
A&0010A0U+10A0..U+10FF | Georgian | 96 | 000 BMP | Georgian, Common |
A&001100U+1100..U+11FF | Hangul Jamo | 256 | 000 BMP | Hangul |
A&001200U+1200..U+137F | Ethiopic | 384 | 000 BMP | Ethiopic |
A&001380U+1380..U+139F | Ethiopic Supplement | 32 | 000 BMP | Ethiopic |
A&0013A0U+13A0..U+13FF | Cherokee | 96 | 000 BMP | Cherokee |
A&001400U+1400..U+167F | Unified Canadian Aboriginal Syllabics | 640 | 000 BMP | Canadian Aboriginal |
A&001680U+1680..U+169F | Ogham | 32 | 000 BMP | Ogham |
A&0016A0U+16A0..U+16FF | Runic | 96 | 000 BMP | Runic, Common |
A&001700U+1700..U+171F | Tagalog | 32 | 000 BMP | Tagalog |
A&001720U+1720..U+173F | Hanunoo | 32 | 000 BMP | Hanunoo, Common |
A&001740U+1740..U+175F | Buhid | 32 | 000 BMP | Buhid |
A&001760U+1760..U+177F | Tagbanwa | 32 | 000 BMP | Tagbanwa |
A&001780U+1780..U+17FF | Khmer | 128 | 000 BMP | Khmer |
A&001800U+1800..U+18AF | Mongolian | 176 | 000 BMP | Mongolian, Common |
A&0018B0U+18B0..U+18FF | Unified Canadian Aboriginal Syllabics Extended | 80 | 000 BMP | Canadian Aboriginal |
A&001900U+1900..U+194F | Limbu | 80 | 000 BMP | Limbu |
A&001950U+1950..U+197F | Tai Le | 48 | 000 BMP | Tai Le |
A&001980U+1980..U+19DF | New Tai Lue | 96 | 000 BMP | New Tai Lue |
A&0019E0U+19E0..U+19FF | Khmer Symbols | 32 | 000 BMP | Khmer |
A&001A00U+1A00..U+1A1F | Buginese | 32 | 000 BMP | Buginese |
A&001A20U+1A20..U+1AAF | Tai Tham | 144 | 000 BMP | Tai Tham |
A&001B00U+1B00..U+1B7F | Balinese | 128 | 000 BMP | Balinese |
A&001B80U+1B80..U+1BBF | Sundanese | 64 | 000 BMP | Sundanese |
A&001BC0U+1BC0..U+1BFF | Batak | 64 | 000 BMP | Batak |
A&001C00U+1C00..U+1C4F | Lepcha | 80 | 000 BMP | Lepcha |
A&001C50U+1C50..U+1C7F | Ol Chiki | 48 | 000 BMP | Ol Chiki |
A&001CC0U+1CC0..U+1CCF | Sundanese Supplement | 16 | 000 BMP | Sundanese |
A&001CD0U+1CD0..U+1CFF | Vedic Extensions | 48 | 000 BMP | Common, Inherited |
A&001D00U+1D00..U+1D7F | Phonetic Extensions | 128 | 000 BMP | Cyrillic, Greek, Latin |
A&001D80U+1D80..U+1DBF | Phonetic Extensions Supplement | 64 | 000 BMP | Latin, Greek |
A&001DC0U+1DC0..U+1DFF | Combining Diacritical Marks Supplement | 64 | 000 BMP | Inherited |
A&001E00U+1E00..U+1EFF | Latin Extended Additional | 256 | 000 BMP | Latin |
A&001F00U+1F00..U+1FFF | Greek Extended | 256 | 000 BMP | Greek |
A&002000U+2000..U+206F | General Punctuation | 112 | 000 BMP | Common, Inherited |
A&002070U+2070..U+209F | Superscripts and Subscripts | 48 | 000 BMP | Latin, Common |
A&0020A0U+20A0..U+20CF | Currency Symbols | 48 | 000 BMP | Common |
A&0020D0U+20D0..U+20FF | Combining Diacritical Marks for Symbols | 48 | 000 BMP | Inherited |
A&002100U+2100..U+214F | Letterlike Symbols | 80 | 000 BMP | Latin, Greek, Common |
A&002150U+2150..U+218F | Number Forms | 64 | 000 BMP | Latin, Common |
A&002190U+2190..U+21FF | Arrows | 112 | 000 BMP | Common |
A&002200U+2200..U+22FF | Mathematical Operators | 256 | 000 BMP | Common |
A&002300U+2300..U+23FF | Miscellaneous Technical | 256 | 000 BMP | Common |
A&002400U+2400..U+243F | Control Pictures | 64 | 000 BMP | Common |
A&002440U+2440..U+245F | Optical Character Recognition | 32 | 000 BMP | Common |
A&002460U+2460..U+24FF | Enclosed Alphanumerics | 160 | 000 BMP | Common |
A&002500U+2500..U+257F | Box Drawing | 128 | 000 BMP | Common |
A&002580U+2580..U+259F | Block Elements | 32 | 000 BMP | Common |
A&0025A0U+25A0..U+25FF | Geometric Shapes | 96 | 000 BMP | Common |
A&002600U+2600..U+26FF | Miscellaneous Symbols | 256 | 000 BMP | Common |
A&002700U+2700..U+27BF | Dingbats | 192 | 000 BMP | Common |
A&0027C0U+27C0..U+27EF | Miscellaneous Mathematical Symbols-A | 48 | 000 BMP | Common |
A&0027F0U+27F0..U+27FF | Supplemental Arrows-A | 16 | 000 BMP | Common |
A&002800U+2800..U+28FF | Braille Patterns | 256 | 000 BMP | Braille |
A&002900U+2900..U+297F | Supplemental Arrows-B | 128 | 000 BMP | Common |
A&002980U+2980..U+29FF | Miscellaneous Mathematical Symbols-B | 128 | 000 BMP | Common |
A&002A00U+2A00..U+2AFF | Supplemental Mathematical Operators | 256 | 000 BMP | Common |
A&002B00U+2B00..U+2BFF | Miscellaneous Symbols and Arrows | 256 | 000 BMP | Common |
A&002C00U+2C00..U+2C5F | Glagolitic | 96 | 000 BMP | Glagolitic |
A&002C60U+2C60..U+2C7F | Latin Extended-C | 32 | 000 BMP | Latin |
A&002C80U+2C80..U+2CFF | Coptic | 128 | 000 BMP | Coptic |
A&002D00U+2D00..U+2D2F | Georgian Supplement | 48 | 000 BMP | Georgian |
A&002D30U+2D30..U+2D7F | Tifinagh | 80 | 000 BMP | Tifinagh |
A&002D80U+2D80..U+2DDF | Ethiopic Extended | 96 | 000 BMP | Ethiopic |
A&002DE0U+2DE0..U+2DFF | Cyrillic Extended-A | 32 | 000 BMP | Cyrillic |
A&002E00U+2E00..U+2E7F | Supplemental Punctuation | 128 | 000 BMP | Common |
A&002E80U+2E80..U+2EFF | CJK Radicals Supplement | 128 | 000 BMP | Han |
A&002F00U+2F00..U+2FDF | Kangxi Radicals | 224 | 000 BMP | Han |
A&002FF0U+2FF0..U+2FFF | Ideographic Description Characters | 16 | 000 BMP | Common |
A&003000U+3000..U+303F | CJK Symbols and Punctuation | 64 | 000 BMP | Han, Hangul, Common, Inherited |
A&003040U+3040..U+309F | Hiragana | 96 | 000 BMP | Hiragana, Common, Inherited |
A&0030A0U+30A0..U+30FF | Katakana | 96 | 000 BMP | Katakana, Common |
A&003100U+3100..U+312F | Bopomofo | 48 | 000 BMP | Bopomofo |
A&003130U+3130..U+318F | Hangul Compatibility Jamo | 96 | 000 BMP | Hangul |
A&003190U+3190..U+319F | Kanbun | 16 | 000 BMP | Common |
A&0031A0U+31A0..U+31BF | Bopomofo Extended | 32 | 000 BMP | Bopomofo |
A&0031C0U+31C0..U+31EF | CJK Strokes | 48 | 000 BMP | Common |
A&0031F0U+31F0..U+31FF | Katakana Phonetic Extensions | 16 | 000 BMP | Katakana |
A&003200U+3200..U+32FF | Enclosed CJK Letters and Months | 256 | 000 BMP | Katakana, Hangul, Common |
A&003300U+3300..U+33FF | CJK Compatibility | 256 | 000 BMP | Katakana, Common |
A&003400U+3400..U+4DBF | CJK Unified Ideographs Extension A | 6592 | 000 BMP | Han |
A&004DC0U+4DC0..U+4DFF | Yijing Hexagram Symbols | 64 | 000 BMP | Common |
A&004E00U+4E00..U+9FFF | CJK Unified Ideographs | 20992 | 000 BMP | Han |
A&00A000U+A000..U+A48F | Yi Syllables | 1168 | 000 BMP | Yi |
A&00A490U+A490..U+A4CF | Yi Radicals | 64 | 000 BMP | Yi |
A&00A4D0U+A4D0..U+A4FF | Lisu | 48 | 000 BMP | Lisu |
A&00A500U+A500..U+A63F | Vai | 320 | 000 BMP | Vai |
A&00A640U+A640..U+A69F | Cyrillic Extended-B | 96 | 000 BMP | Cyrillic |
A&00A6A0U+A6A0..U+A6FF | Bamum | 96 | 000 BMP | Bamum |
A&00A700U+A700..U+A71F | Modifier Tone Letters | 32 | 000 BMP | Common |
A&00A720U+A720..U+A7FF | Latin Extended-D | 224 | 000 BMP | Latin, Common |
A&00A800U+A800..U+A82F | Syloti Nagri | 48 | 000 BMP | Syloti Nagri |
A&00A830U+A830..U+A83F | Common Indic Number Forms | 16 | 000 BMP | Common |
A&00A840U+A840..U+A87F | Phags-pa | 64 | 000 BMP | Phags Pa |
A&00A880U+A880..U+A8DF | Saurashtra | 96 | 000 BMP | Saurashtra |
A&00A8E0U+A8E0..U+A8FF | Devanagari Extended | 32 | 000 BMP | Devanagari |
A&00A900U+A900..U+A92F | Kayah Li | 48 | 000 BMP | Kayah Li |
A&00A930U+A930..U+A95F | Rejang | 48 | 000 BMP | Rejang |
A&00A960U+A960..U+A97F | Hangul Jamo Extended-A | 32 | 000 BMP | Hangul |
A&00A980U+A980..U+A9DF | Javanese | 96 | 000 BMP | Javanese |
A&00AA00U+AA00..U+AA5F | Cham | 96 | 000 BMP | Cham |
A&00AA60U+AA60..U+AA7F | Myanmar Extended-A | 32 | 000 BMP | Myanmar |
A&00AA80U+AA80..U+AADF | Tai Viet | 96 | 000 BMP | Tai Viet |
A&00AAE0U+AAE0..U+AAFF | Meetei Mayek Extensions | 32 | 000 BMP | Meetei Mayek |
A&00AB00U+AB00..U+AB2F | Ethiopic Extended-A | 48 | 000 BMP | Ethiopic |
A&00ABC0U+ABC0..U+ABFF | Meetei Mayek | 64 | 000 BMP | Meetei Mayek |
A&00AC00U+AC00..U+D7AF | Hangul Syllables | 11184 | 000 BMP | Hangul |
A&00D7B0U+D7B0..U+D7FF | Hangul Jamo Extended-B | 80 | 000 BMP | Hangul |
A&00D800U+D800..U+DB7F | High Surrogates | 896 | 000 BMP | |
A&00DB80U+DB80..U+DBFF | High Private Use Surrogates | 128 | 000 BMP | |
A&00DC00U+DC00..U+DFFF | Low Surrogates | 1024 | 000 BMP | |
A&00E000U+E000..U+F8FF | Private Use Area | 6400 | 000 BMP | |
A&00F900U+F900..U+FAFF | CJK Compatibility Ideographs | 512 | 000 BMP | Han |
A&00FB00U+FB00..U+FB4F | Alphabetic Presentation Forms | 80 | 000 BMP | Latin, Hebrew, Armenian |
A&00FB50U+FB50..U+FDFF | Arabic Presentation Forms-A | 688 | 000 BMP | Arabic, Common |
A&00FE00U+FE00..U+FE0F | Variation Selectors | 16 | 000 BMP | Inherited |
A&00FE10U+FE10..U+FE1F | Vertical Forms | 16 | 000 BMP | Common |
A&00FE20U+FE20..U+FE2F | Combining Half Marks | 16 | 000 BMP | Inherited |
A&00FE30U+FE30..U+FE4F | CJK Compatibility Forms | 32 | 000 BMP | Common |
A&00FE50U+FE50..U+FE6F | Small Form Variants | 32 | 000 BMP | Common |
A&00FE70U+FE70..U+FEFF | Arabic Presentation Forms-B | 144 | 000 BMP | Arabic, Common |
A&00FF00U+FF00..U+FFEF | Halfwidth and fullwidth forms | 240 | 000 BMP | Latin, Katakana, Hangul, Common |
A&00FFF0U+FFF0..U+FFFF | Specials | 16 | 000 BMP | Common |
A&010000U+10000..U+1007F | Linear B Syllabary | 128 | 011 SMP | Linear B |
A&010080U+10080..U+100FF | Linear B Ideograms | 128 | 011 SMP | Linear B |
A&010100U+10100..U+1013F | Aegean Numbers | 64 | 011 SMP | Common |
A&010140U+10140..U+1018F | Ancient Greek Numbers | 80 | 011 SMP | Greek |
A&010190U+10190..U+101CF | Ancient Symbols | 64 | 011 SMP | Common |
A&0101D0U+101D0..U+101FF | Phaistos Disc | 48 | 011 SMP | Common, Inherited |
A&010280U+10280..U+1029F | Lycian | 32 | 011 SMP | Lycian |
A&0102A0U+102A0..U+102DF | Carian | 64 | 011 SMP | Carian |
A&010300U+10300..U+1032F | Old Italic | 48 | 011 SMP | Old Italic |
A&010330U+10330..U+1034F | Gothic | 32 | 011 SMP | Gothic |
A&010380U+10380..U+1039F | Ugaritic | 32 | 011 SMP | Ugaritic |
A&0103A0U+103A0..U+103DF | Old Persian | 64 | 011 SMP | Old Persian |
A&010400U+10400..U+1044F | Deseret | 80 | 011 SMP | Deseret |
A&010450U+10450..U+1047F | Shavian | 48 | 011 SMP | Shavian |
A&010480U+10480..U+104AF | Osmanya | 48 | 011 SMP | Osmanya |
A&010800U+10800..U+1083F | Cypriot Syllabary | 64 | 011 SMP | Cypriot |
A&010840U+10840..U+1085F | Imperial Aramaic | 32 | 011 SMP | Imperial Aramaic |
A&010900U+10900..U+1091F | Phoenician | 32 | 011 SMP | Phoenician |
A&010920U+10920..U+1093F | Lydian | 32 | 011 SMP | Lydian |
A&010980U+10980..U+1099F | Meroitic Hieroglyphs | 32 | 011 SMP | Meroitic |
A&0109A0U+109A0..U+109FF | Meoritic Cursive | 96 | 011 SMP | Meroitic |
A&010A00U+10A00..U+10A5F | Kharoshthi | 96 | 011 SMP | Kharoshthi |
A&010A60U+10A60..U+10A7F | Old South Arabian | 32 | 011 SMP | Old South Arabian |
A&010B00U+10B00..U+10B3F | Avestan | 64 | 011 SMP | Avestan |
A&010B40U+10B40..U+10B5F | Inscriptional Parthian | 32 | 011 SMP | Inscriptional Parthian |
A&010B60U+10B60..U+10B7F | Inscriptional Pahlavi | 32 | 011 SMP | Inscriptional Pahlavi |
A&010C00U+10C00..U+10C4F | Old Turkic | 80 | 011 SMP | Old Turkic |
A&010E60U+10E60..U+10E7F | Rumi Numeral Symbols | 32 | 011 SMP | Arabic |
A&011000U+11000..U+1107F | Brahmi | 128 | 011 SMP | Brahmi |
A&011080U+11080..U+110CF | Kaithi | 80 | 011 SMP | Kaithi |
A&0110D0U+110D0..U+110FF | Sora Sompeng | 48 | 011 SMP | Sora Sompeng |
A&011100U+11100..U+1114F | Chakma | 80 | 011 SMP | Chakma |
A&011180U+11180..U+111DF | Sharada | 96 | 011 SMP | Sharada |
A&011680U+11680..U+116CF | Takri | 80 | 011 SMP | Takri |
A&012000U+12000..U+123FF | Cuneiform | 1024 | 011 SMP | Cuneiform |
A&012400U+12400..U+1247F | Cuneiform Numbers and Punctuation | 128 | 011 SMP | Cuneiform |
A&013000U+13000..U+1342F | Egyptian Hieroglyphs | 1072 | 011 SMP | Egyptian Hieroglyphs |
A&016800U+16800..U+16A3F | Bamum Supplement | 576 | 011 SMP | Bamum |
A&016F00U+16F00..U+16F9F | Miao | 160 | 011 SMP | Miao |
A&01B000U+1B000..U+1B0FF | Kana Supplement | 256 | 011 SMP | Katakana, Hiragana |
A&01D000U+1D000..U+1D0FF | Byzantine Musical Symbols | 256 | 011 SMP | Common |
A&01D100U+1D100..U+1D1FF | Musical Symbols | 256 | 011 SMP | Common, Inherited |
A&01D200U+1D200..U+1D24F | Ancient Greek Musical Notation | 80 | 011 SMP | Greek |
A&01D300U+1D300..U+1D35F | Tai Xuan Jing Symbols | 96 | 011 SMP | Common |
A&01D360U+1D360..U+1D37F | Counting Rod Numerals | 32 | 011 SMP | Common |
A&01D400U+1D400..U+1D7FF | Mathematical Alphanumeric Symbols | 1024 | 011 SMP | Common |
A&01EE00U+1EE00..U+1EEFF | Arabic Mathematical Alphabetic Symbols | 256 | 011 SMP | Arabic |
A&01F000U+1F000..U+1F02F | Mahjong Tiles | 48 | 011 SMP | Common |
A&01F030U+1F030..U+1F09F | Domino Tiles | 112 | 011 SMP | Common |
A&01F0A0U+1F0A0..U+1F0FF | Playing Cards | 96 | 011 SMP | Common |
A&01F100U+1F100..U+1F1FF | Enclosed Alphanumeric Supplement | 256 | 011 SMP | Common |
A&01F200U+1F200..U+1F2FF | Enclosed Ideographic Supplement | 256 | 011 SMP | Hiragana, Common |
A&01F300U+1F300..U+1F5FF | Miscellaneous Symbols and Pictographs | 768 | 011 SMP | Common |
A&01F600U+1F600..U+1F64F | Emoticons | 80 | 011 SMP | Common |
A&01F680U+1F680..U+1F6FF | Transport and Map Symbols | 128 | 011 SMP | Common |
A&01F700U+1F700..U+1F77F | Alchemical Symbols | 128 | 011 SMP | Common |
A&020000U+20000..U+2A6DF | CJK Unified Ideographs Extension B | 42720 | 022 SIP | Han |
A&02A700U+2A700..U+2B73F | CJK Unified Ideographs Extension C | 4160 | 022 SIP | Han |
A&02B740U+2B740..U+2B81F | CJK Unified Ideographs Extension D | 224 | 022 SIP | Han |
A&02F800U+2F800..U+2FA1F | CJK Compatibility Ideographs Supplement | 544 | 022 SIP | Han |
A&0E0000U+E0000..U+E007F | Tags | 128 | 1414 SSP | Common |
A&0E0100U+E0100..U+E01EF | Variation Selectors Supplement | 240 | 1414 SSP | Inherited |
A&0F0000U+F0000..U+FFFFF | Supplementary Private Use Area-A | 65536 | 151515 PUA | |
A&100000U+100000..U+10FFFF | Supplementary Private Use Area-B | 65536 | 1616 PUA | |
Script
Each assigned character can have a single value for its "Script" property, signifing to which script it belongs.[16] The value is a four-letter code in the range Aaaa-Zzzz, as available in ISO 15924, which is mapped to a writing system. Apart from when describing the background and usage of a script, Unicode does not use a connection between a script and languages that use that script. So "Hebrew" refers to the Hebrew script, not to the Hebrew language.
The special code Zyyy for "Common" allows a single value for a character that is used in multiple scripts. The code Zinh "Inherited script", used for combining characters and certain other special-purpose code points, indicates that a character "inherits" its script identity from the character with which it is combined. (Unicode formerly used the private code Qaai for this purpose.) The code Zzzz "Unknown" is used for all characters that do not belong to a script (i.e. the default value), such as symbols and formatting characters. Overall, characters of a single script can be scattered over multiple blocks, like Latin characters. And the other way around too: multiple scripts can be present is a single block, even when the block name suggests different: e.g. block Letterlike Symbols contains characters from the Latin, Greek and Common scripts.
When the Script is "" (blank), according to Unicode the character does not belong to a script. This pertains to symbols, because the existing ISO script codes "Zmth" (Mathematical notation) and "Zsym" (Symbol) are not used in Unicode. The "Script" property is also blank for code points that are not a typographic character like controls, substitutes, and private use code points.
If there is a specific script alias name in ISO 15924, is used in the character name: U+0041 A latin capital letter a, and U+05D0 א hebrew letter alef.
Normalization properties
Decompositions, decomposition type, canonical combining class, composition exclusions, and more.
Age
Age is the version of the Standard in which the code point was first designated. The version number is shortened to the numbering major.minor, although there more detailed version numbers are used: versions 4.0.0 and 4.0.1 both are named 4.0 as Age. Given the releases, Age can be from the range: 1.0, 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0, 4.1, 5.0, 5.1, 5.2, 6.0 and 6.1.[17][18] Code points that are not assigned, have Age=Unassigned.
Deprecated
Once a character has been defined, it will not be withdrawn or changed in defining properties (code point, name). But it can be declared deprecated: A coded character whose use is strongly discouraged.[19] As of version 6.1, 111 characters are deprecated. A deprecation is noted in the code chart, and usually an alternative is available.
Boundaries
(grapheme cluster, word, line, and sentence)
References
- ↑ 1.0 1.1 1.2 Unicode 6.0 chapter 4
- ↑ "Unicode Standard Annex #44: Unicode Character Database". The Unicode Standard. 2012-01-23version 6.1.0
- ↑ Unicode 6.0, Chapter 4, table 4-9
- ↑ Unicode 6.0, Chapter 2, table 2-3: Types of code points
- ↑ Stability policy: Property Value Stability and table. Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal).
- ↑ Unicode 6.0, Chapter 4, table 4-12 Name=""; a Code Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.
- ↑ 7.0 7.1 UAX 9, Standard Annex "Unicode Bidirectional Algorithm"
- ↑ Unicode Blocks data file. As of Unicode version 6.3
- ↑ UAX 24: Unicode Script Property (4alpha code)
- ↑ UAX 24: Script data file
- ↑ Including unassigned code points: non-character, reserved
- ↑ The script has one or multiple characters in the block, as defined by the Script Property. This is independent of the block name
- ↑ "Common" (Zyyy) and "Inherited" (Zinh or Qaai) refer to Scripts in ISO 15924
- ↑ Called "C0 Controls and Basic Latin" in ISO/IEC 10646
- ↑ Called "C1 Controls and Latin-1 Supplement" in ISO/IEC 10646
- ↑ Unicode Standard Annex #24: Unicode Script Property
- ↑ Pre version 4
- ↑ Versions 4.0 and later
- ↑ "3.4 Characters and Encoding: rule D13".