CJK Unified Ideographs
![](../I/m/CJKV_variant_glyphs.png)
The Chinese, Japanese and Korean (CJK) scripts share a common background. In the process called Han unification the common (shared) characters were identified, and named "CJK Unified Ideographs". Unicode defines a total of 74,617 CJK Unified Ideographs.[1]
The terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a picture writing system.
Historically, Vietnam used Chinese ideographs too, so sometimes the abbreviation "CJKV" is used. This system was replaced by the Latin-based Vietnamese alphabet in the 1920s.
CJK Unified Ideographs blocks
CJK Unified Ideographs
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were also used in Vietnam's Nôm script (now obsolete). The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical sequence.
The block is the result of Han unification,[2] which was somewhat controversial in the Far East.[3] Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding.[4]
Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set proposal, which actually calls for 14,658 ideographic variation sequences, is an extreme example of the use of variation selectors.[5]
Charts
4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.
Sources
The code points in this block are assigned under Source Separation Rule.
- China
Code | Standard | Character count | note |
---|---|---|---|
G0 | GB 2312-80 | 6763 | |
G1 | GB 12345-90 | 2352 | |
G3 | GB 7589-87 unsimplified form | 7237 | |
G5 | GB 7590-87 unsimplified form | 7039 | |
G7 | Modern Chinese general character chart | 42 | |
G8 | GB 8565-89 | 290 | |
- Taiwan
Code | Standard | Character count | note |
---|---|---|---|
T1 | CNS 11643-1986 plane 1 | 5401+9 | |
T2 | CNS 11643-1986 plane 2 | 7650 | |
TE | CNS 11643-1986 plane 14 | 6319+239+10 | 239 from CCIII, 10 from XCCS |
- Japan
Code | Standard | Character count | note |
---|---|---|---|
J0 | JIS X 0208-90 | 6335+1 | |
J1 | JIS X 0212-90 | 5801 | |
- South Korea
Code | Standard | Character count | note |
---|---|---|---|
K0 | KS C 5601-87 (now KS X 1001:2004) | 4888 | includes 268 duplicates |
K1 | KS C 5657-91 (now KS X 1002:2004) | 2856 | |
K2 | PKS C 5700-1:1994 (now KS X 1027-1:2011) | 7911 | |
K4 | PKS 5700-3:1998 (now KS X 1027-3:2011) | 4 | |
- Others
- ANSI Z39.64-1989
- Big5
- CCCII plane 1
- GB 12052-89
- JEF
- Chinese telegraph code
- Taiwan telegraph code
- Xerox Chinese
In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points.
CJK Unified Ideographs Extension A
The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,582 additional characters in the range U+3400 through U+4DB5 that were added in Unicode 3.0 (1999).
Charts
Sources
- China
Code | Standard |
---|---|
GE | GB 16500-95 |
GS | Singapore CJK ideographs |
- Taiwan
Code | Standard | note |
---|---|---|
T3 | CNS 11643-1992 plane 3 | |
T4 | CNS 11643-1992 plane 4 | |
T5 | CNS 11643-1992 plane 5 | |
T6 | CNS 11643-1992 plane 6 | |
T7 | CNS 11643-1992 plane 7 | |
TF | CNS 11643-1992 plane 15 | |
- Japan
Code | Standard | note |
---|---|---|
JA | Unified Japanese IT Vendors Contemporary Ideographs, 1993 | |
- South Korea
Code | Standard | note |
---|---|---|
K3 | PKS C 5700-2:1994 (now KS X 1027-2:2011) | |
K4 | PKS 5700-3:1998 (now KS X 1027-3:2011) | |
- Vietnam
Code | Standard | note |
---|---|---|
V0 | TCVN 5773:1993 | |
V1 | TCVN 6056:1995 | |
CJK Unified Ideographs Extension B
The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,711 characters in the range U+20000 through U+2A6D6 that were added in Unicode 3.1 (2001). These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Nôm characters that were formerly used to write Vietnamese.
Charts
20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.
Sources
- Kangxi dictionary
- Hanyu Da Zidian
- Ciyuan
- Cihai
- Hanyu Da Cidian
- Encyclopedia of China
- Beijing University Founder DTP
- Siku Quanshu
- HKSCS
- JIS X 0213 planes 1 and 2, also known as levels 3 and 4
- PKS 5700-3:1998 (now KS X 1027-3:2011), Korean IRG Hanja Character Set 5th Edition: 2001 (now KS X 1027-4:2011)
- KPS 9566-97, KPS 10721-2000
- CNS 11643 planes 4-7, 15
- TCVN, VHN 01:1998, VHN 02:1998
CJK Unified Ideographs Extension C
The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,149 characters in the range U+2A700 through U+2B734 that were added in Unicode 5.2 (2009).
Charts
Sources
- China
- Encyclopedia of China
- Beijing University Founder DTP
- Hanyu Da Zidian
- Hanyu Da Cidian
- Old hanyu word dictionary
- Commercial Press Ideographs
- Xiandai Hanyu Cidian
- Cihai
- Kangxi dictionary
- Chinese Academy of Surveying & Mapping
- Modern Chinese Dialect Encyclopedia
- Yinzhou jinwen jicheng yinde (殷周金文集成引得)
- Japan
- Japanese KOKUJI Collection
- South Korea
- Korean IRG Hanja Character Set 5th Edition: 2001
- North Korea
- KPS 10721:2003
- Vietnam
- Nguyễn Quang Hồng, Từ điển chữ Nôm [Dictionary of Nom], 2006.
- Hoàng Triều Ân, Từ điển chữ Nôm Tày [Dictionary of Nom used by the Tay People], 2003.
- Vũ Văn Kính, Bảng tra chữ Nôm miền Nam [Table of Nom Characters in the South], 1994.
- Other
- Unicode UTC
- DeFrancis, John, et al., ABC Chinese-English Dictionary, 2nd edition. (1998) Honolulu: University of Hawaii Press
- The Church of Jesus Christ of Latter-day Saints Hong Kong division
- Mathews, Robert H., Mathews' Chinese-English Dictionary, (1975) Cambridge; Harvard University Press
- Guangyun
- Zheng Zhuoxin (郑作新), et al., 中国鸟类系统检索 [Chinese bird system index], (2000), Beijing, (www.sciencep.com)
- Shuowen Jiezi, Duan Yucai, Annotated
CJK Unified Ideographs Extension D
The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).
Charts
CJK Unified Ideographs Extension E (projected)
The CJK Unified Ideographs Extension E block was earlier provisionally named Extension D.
CJK-E was originally intended to include another 16,000+ characters not present in CJK-C. However, in May 2007 the Republic of China (Taiwan) withdrew 6,545 personal name usage characters deemed no longer in use,[6] in May 2013 China withdrew 6 characters,[7] and many others were later withdrawn or moved to CJK-F (projected),[8] so the current version includes 5,762 new characters. CJK-E including 5,762 Han characters will be new in Unicode 8.0.[9]
CJK Unified Ideographs Extension F (projected)
The IRG agreed on the proposal for a CJK Unified Ideographs Extension F at the 38th IRG meeting in June 2012,[10] and work on CJK-F is currently in process.
CJK Compatibility Ideographs
There are four Unicode blocks whose names include the phrase "CJK Compatibility":
- CJK Compatibility (3300–33FF)
- CJK Compatibility Forms (FE30–FE4F)
- CJK Compatibility Ideographs (F900–FAFF)
- CJK Compatibility Ideographs Supplement (2F800–2FA1F)
The CJK Compatibility Ideographs block contains twelve characters for CJK Unified Ideographs compatibility. None of the other characters in these blocks relate to CJK Unification. See Unified ideographs outside of the blocks below.
Known issues
Disunification of U+4039
The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.
The proposal of disunification of U+4039[11] was accepted and the new character is encoded at U+9FC3 in Unicode 5.1.
Unified ideographs outside of the blocks
The CJK Compatibility Ideographs block (F900-FAFF) is not part of the "unified ideographs" list, but includes twelve characters that are in fact classified and named as unified ideographs: FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29.
Unifiable variants and exact duplicates in Extension B
In CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded.[12] In addition to the deliberate encoding of close glyph variants, six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake:[13]
- U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8
- U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes
- U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the China-, Taiwan- and Japan-source glyphs for U+8641
- U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals
- U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes
- U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals
- U+26842 𦡂 = U+26866 𦡦 : same glyph shapes
- U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")
Other CJK Ideographs in Unicode, not Unified
Apart from the five blocks of "Unified Ideographs", Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.
Four blocks (one of which is labelled "Unified Ideographs") of compatibility characters are included for compatibility with legacy text handling system and other legacy character sets. They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore their use is discouraged.
Usually, compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. However, the amount of CJK ideographs within any non-Unicode standard is too big to fit into Unicode's CJK Compatibility Ideographs blocks. Instead, code points are assigned when the affected characters are approved by the Unicode Consortium, but have yet to assign any code points within the CJK Unified Ideographs blocks.
Unicode version history
CJK unified Ideographs additions per Unicode version | ||||
---|---|---|---|---|
Unicode version | Addition | Plane | Characters added | Total Characters |
1.0 (1991) | CJK Unified Ideographs | Basic Multilingual Plane (BMP) | 20,902 | 20,914 |
CJK Compatibility Ideographs | BMP | 12 | ||
3.0 (1999) | CJK Unified Ideographs Extension A | BMP | 6,582 | 27,496 |
3.1 (2001) | CJK Unified Ideographs Extension B | Supplementary Ideographic Plane (SIP) | 42,711 | 70,207 |
4.1 (2005) | CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646 | BMP | 22 | 70,229 |
5.1 (2008) | CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039 | BMP | 8 | 70,237 |
5.2 (2009) | CJK Unified Ideographs Extension C | SIP | 4,149 | 74,394 |
8 other characters from ARIB #47, #95, #93 and HKSCS | BMP | 8 | ||
6.0 (2010) | CJK Unified Ideographs Extension D | SIP | 222 | 74,616 |
6.1 (2012) | 1 character corresponding to Adobe-Japan1-6 CID+20156 | BMP | 1 | 74,617 |
Notes
- ↑ Unicode 6.1 Character Database – property list file
- ↑ The Unicode standard 4.0, Appendix A - Han Unification History
- ↑ Suzanne Topping, "The secret life of Unicode"
- ↑ "Chapter 11 - East Asian scripts", The Unicode standard, 4.0.
- ↑ PRI 108: Combined registration of the Adobe Japan1 collection and of sequences in that collection
- ↑ IRG N 1306: Request to Withdraw 6545 T-Source from CJK D candidate
- ↑ ISO/IEC JTC1/SC2/WG2 N4439
- ↑ http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg38/IRGN1872_US_ExtF.pdf
- ↑ http://www.unicode.org/charts/PDF/Unicode-8.0/
- ↑ http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg38/IRGN1870Resolutions.doc
- ↑ Andrew West and John Jenkins, proposal of disunification of U+4039
- ↑ unifiable glyph variants
- ↑ Cook, Richard (6 October 2003). "Defect Report on Duplicate Encoded CJK Forms" (PDF). ISO/IEC JTC1/SC2/WG2. Retrieved 2012-03-28.
See also
- Han Unification
- List of Unicode characters
- List of CJK fonts
- Ideographic Rapporteur Group
External links
- Unicode Consortium U+4E00... (PDF)
- Information on a number of the 98,884 characters in Unicode 5.0 from the decodeUnicode Wiki project at the University of Applied Sciences in Mainz, Germany
|