CJK Unified Ideographs

CJKV ideograph 次 in traditional and simplified Chinese, Korean, Vietnamese and Japanese.

The Chinese, Japanese and Korean (CJK) scripts share a common background. In the process called Han unification the common (shared) characters were identified, and named "CJK Unified Ideographs". Unicode defines a total of 74,617 CJK Unified Ideographs.^[1]

The terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a picture writing system.

Historically, Vietnam used Chinese ideographs too, so sometimes the abbreviation "CJKV" is used. This system was replaced by the Latin-based Vietnamese alphabet in the 1920s.

CJK Unified Ideographs blocks

CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,941 basic Chinese characters in the range U+4E00 through U+9FCC. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were also used in Vietnam's Nôm script (now obsolete). The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical sequence.

The block is the result of Han unification,^[2] which was somewhat controversial in the Far East.^[3] Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding.^[4]

Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set proposal, which actually calls for 14,658 ideographic variation sequences, is an extreme example of the use of variation selectors.^[5]

Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

Sources

The code points in this block are assigned under Source Separation Rule.

China

Code	Standard	Character count
G0	GB 2312-80	6763
G1	GB 12345-90	2352
G3	GB 7589-87 unsimplified form	7237
G5	GB 7590-87 unsimplified form	7039
G7	Modern Chinese general character chart	42
G8	GB 8565-89	290

Taiwan

Code	Standard	Character count	note
T1	CNS 11643-1986 plane 1	5401+9
T2	CNS 11643-1986 plane 2	7650
TE	CNS 11643-1986 plane 14	6319+239+10	239 from CCIII, 10 from XCCS

Japan

Code	Standard	Character count	note
J0	JIS X 0208-90	6335+1
J1	JIS X 0212-90	5801

South Korea

Code	Standard	Character count	note
K0	KS C 5601-87 (now KS X 1001:2004)	4888	includes 268 duplicates
K1	KS C 5657-91 (now KS X 1002:2004)	2856
K2	PKS C 5700-1:1994 (now KS X 1027-1:2011)	7911
K4	PKS 5700-3:1998 (now KS X 1027-3:2011)	4

Others

ANSI Z39.64-1989
Big5
CCCII plane 1
GB 12052-89
JEF
Chinese telegraph code
Taiwan telegraph code
Xerox Chinese

In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points.

CJK Unified Ideographs Extension A

The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,582 additional characters in the range U+3400 through U+4DB5 that were added in Unicode 3.0 (1999).

Charts

3400-4DBF.

Sources

China

Code	Standard
GE	GB 16500-95
GS	Singapore CJK ideographs

Taiwan

Code	Standard	note
T3	CNS 11643-1992 plane 3
T4	CNS 11643-1992 plane 4
T5	CNS 11643-1992 plane 5
T6	CNS 11643-1992 plane 6
T7	CNS 11643-1992 plane 7
TF	CNS 11643-1992 plane 15

Japan

Code	Standard	note
JA	Unified Japanese IT Vendors Contemporary Ideographs, 1993

South Korea

Code	Standard	note
K3	PKS C 5700-2:1994 (now KS X 1027-2:2011)
K4	PKS 5700-3:1998 (now KS X 1027-3:2011)

Vietnam

Code	Standard	note
V0	TCVN 5773:1993
V1	TCVN 6056:1995

CJK Unified Ideographs Extension B

The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,711 characters in the range U+20000 through U+2A6D6 that were added in Unicode 3.1 (2001). These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Nôm characters that were formerly used to write Vietnamese.

Charts

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.

Sources

Kangxi dictionary
Hanyu Da Zidian
Ciyuan
Cihai
Hanyu Da Cidian
Encyclopedia of China
Beijing University Founder DTP
Siku Quanshu
HKSCS
JIS X 0213 planes 1 and 2, also known as levels 3 and 4
PKS 5700-3:1998 (now KS X 1027-3:2011), Korean IRG Hanja Character Set 5th Edition: 2001 (now KS X 1027-4:2011)
KPS 9566-97, KPS 10721-2000
CNS 11643 planes 4-7, 15
TCVN, VHN 01:1998, VHN 02:1998

CJK Unified Ideographs Extension C

The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,149 characters in the range U+2A700 through U+2B734 that were added in Unicode 5.2 (2009).

Charts

2A700-2B73F.

Sources

China

Encyclopedia of China
Beijing University Founder DTP
Hanyu Da Zidian
Hanyu Da Cidian
Old hanyu word dictionary
Commercial Press Ideographs
Xiandai Hanyu Cidian
Cihai
Kangxi dictionary
Chinese Academy of Surveying & Mapping
Modern Chinese Dialect Encyclopedia
Yinzhou jinwen jicheng yinde (殷周金文集成引得)

Japan

Japanese KOKUJI Collection

South Korea

Korean IRG Hanja Character Set 5th Edition: 2001

North Korea

KPS 10721:2003

Vietnam

Nguyễn Quang Hồng, Từ điển chữ Nôm [Dictionary of Nom], 2006.
Hoàng Triều Ân, Từ điển chữ Nôm Tày [Dictionary of Nom used by the Tay People], 2003.
Vũ Văn Kính, Bảng tra chữ Nôm miền Nam [Table of Nom Characters in the South], 1994.

Other

Unicode UTC
DeFrancis, John, et al., ABC Chinese-English Dictionary, 2nd edition. (1998) Honolulu: University of Hawaii Press
The Church of Jesus Christ of Latter-day Saints Hong Kong division
Mathews, Robert H., Mathews' Chinese-English Dictionary, (1975) Cambridge; Harvard University Press
Guangyun
Zheng Zhuoxin (郑作新), et al., 中国鸟类系统检索 [Chinese bird system index], (2000), Beijing, (www.sciencep.com)
Shuowen Jiezi, Duan Yucai, Annotated

CJK Unified Ideographs Extension D

The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).

Charts

2B740-2B81F.

CJK Unified Ideographs Extension E (projected)

The CJK Unified Ideographs Extension E block was earlier provisionally named Extension D.

CJK-E was originally intended to include another 16,000+ characters not present in CJK-C. However, in May 2007 the Republic of China (Taiwan) withdrew 6,545 personal name usage characters deemed no longer in use,^[6] in May 2013 China withdrew 6 characters,^[7] and many others were later withdrawn or moved to CJK-F (projected),^[8] so the current version includes 5,762 new characters. CJK-E including 5,762 Han characters will be new in Unicode 8.0.^[9]

CJK Unified Ideographs Extension F (projected)

The IRG agreed on the proposal for a CJK Unified Ideographs Extension F at the 38th IRG meeting in June 2012,^[10] and work on CJK-F is currently in process.

CJK Compatibility Ideographs

There are four Unicode blocks whose names include the phrase "CJK Compatibility":

CJK Compatibility (3300–33FF)
CJK Compatibility Forms (FE30–FE4F)
CJK Compatibility Ideographs (F900–FAFF)
CJK Compatibility Ideographs Supplement (2F800–2FA1F)

The CJK Compatibility Ideographs block contains twelve characters for CJK Unified Ideographs compatibility. None of the other characters in these blocks relate to CJK Unification. See Unified ideographs outside of the blocks below.

Known issues

Disunification of U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.

The proposal of disunification of U+4039^[11] was accepted and the new character is encoded at U+9FC3 in Unicode 5.1.

Unified ideographs outside of the blocks

The CJK Compatibility Ideographs block (F900-FAFF) is not part of the "unified ideographs" list, but includes twelve characters that are in fact classified and named as unified ideographs: FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29.

Unifiable variants and exact duplicates in Extension B

In CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded.^[12] In addition to the deliberate encoding of close glyph variants, six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake:^[13]

U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8
U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes
U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the China-, Taiwan- and Japan-source glyphs for U+8641
U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals
U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes
U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals
U+26842 𦡂 = U+26866 𦡦 : same glyph shapes
U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")

Other CJK Ideographs in Unicode, not Unified

Apart from the five blocks of "Unified Ideographs", Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Four blocks (one of which is labelled "Unified Ideographs") of compatibility characters are included for compatibility with legacy text handling system and other legacy character sets. They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore their use is discouraged.

Usually, compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. However, the amount of CJK ideographs within any non-Unicode standard is too big to fit into Unicode's CJK Compatibility Ideographs blocks. Instead, code points are assigned when the affected characters are approved by the Unicode Consortium, but have yet to assign any code points within the CJK Unified Ideographs blocks.

Unicode version history

CJK unified Ideographs additions per Unicode version
Unicode version	Addition	Plane	Characters added	Total Characters
1.0 (1991)	CJK Unified Ideographs	Basic Multilingual Plane (BMP)	20,902	20,914
1.0 (1991)	CJK Compatibility Ideographs	BMP	12	20,914
3.0 (1999)	CJK Unified Ideographs Extension A	BMP	6,582	27,496
3.1 (2001)	CJK Unified Ideographs Extension B	Supplementary Ideographic Plane (SIP)	42,711	70,207
4.1 (2005)	CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646	BMP	22	70,229
5.1 (2008)	CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039	BMP	8	70,237
5.2 (2009)	CJK Unified Ideographs Extension C	SIP	4,149	74,394
5.2 (2009)	8 other characters from ARIB #47, #95, #93 and HKSCS	BMP	8	74,394
6.0 (2010)	CJK Unified Ideographs Extension D	SIP	222	74,616
6.1 (2012)	1 character corresponding to Adobe-Japan1-6 CID+20156	BMP	1	74,617

Notes

↑ Unicode 6.1 Character Database – property list file
↑ The Unicode standard 4.0, Appendix A - Han Unification History
↑ Suzanne Topping, "The secret life of Unicode"
↑ "Chapter 11 - East Asian scripts", The Unicode standard, 4.0.
↑ PRI 108: Combined registration of the Adobe Japan1 collection and of sequences in that collection
↑ IRG N 1306: Request to Withdraw 6545 T-Source from CJK D candidate
↑ ISO/IEC JTC1/SC2/WG2 N4439
↑ http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg38/IRGN1872_US_ExtF.pdf
↑ http://www.unicode.org/charts/PDF/Unicode-8.0/
↑ http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg38/IRGN1870Resolutions.doc
↑ Andrew West and John Jenkins, proposal of disunification of U+4039
↑ unifiable glyph variants
↑ Cook, Richard (6 October 2003). "Defect Report on Duplicate Encoded CJK Forms" (PDF). ISO/IEC JTC1/SC2/WG2. Retrieved 2012-03-28.

External links

Unicode Consortium U+4E00... (PDF)
Information on a number of the 98,884 characters in Unicode 5.0 from the decodeUnicode Wiki project at the University of Applied Sciences in Mainz, Germany

CJK ideographs in Unicode^[a]

Block name	Code points	Used	Chart range		Plane	Han unification	Scripts contained in block

CJK Unified Ideographs ″ ″ ″ CJK Unified Ideographs Extension A CJK Unified Ideographs Extension B ″ ″ ″ ″ ″ ″ CJK Unified Ideographs Extension C CJK Unified Ideographs Extension D CJK Radicals Supplement Kangxi Radicals Ideographic Description Characters CJK Symbols and Punctuation CJK Strokes Enclosed CJK Letters and Months CJK Compatibility CJK Compatibility Ideographs CJK Compatibility Forms CJK Compatibility Ideographs Supplement	20992 6592 42720 4160 224 128 224 16 64 48 256 256 512 32 544	20941 6582 42711 4149 222 115 214 12 64 36 254 256 472 32 542	4E00–62FF 6300–77FF 7800–8CFF 8D00–9FFF 3400–4DBF 20000–215FF 21600–230FF 23100–245FF 24600–260FF 26100–275FF 27600–290FF 29100–2A6DF 2A700–2B73F 2B740–2B81F 2E80–2EFF 2F00–2FDF 2FF0–2FFF 3000–303F 31C0–31EF 3200–32FF 3300–33FF F900–FAFF FE30–FE4F 2F800–2FA1F	1/4 2/4 3/4 4/4 1/7 2/7 3/7 4/7 5/7 6/7 7/7	0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 2 SIP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 0 BMP 2 SIP	Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Unified Not unified Not unified Not unified Not unified Not unified Not unified Not unified 12 are unified Not unified Not unified	Han Han Han Han Han Han Han Han Han Han Han Han Han Han Han, Common Han Common Han, Common, Inherited Common Katakana, Hangul, Common Katakana, Common Han Common Han

Totals	76768	76602				74617

^ As of version 6.2

Unicode

Unicode Consortium
ISO/IEC 10646 (Universal Character Set)
Versions

Code points

Block
Characters
Character charts
Character property
Plane
Private Use Area

Characters

Special purpose	BOM Combining grapheme joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space

Lists	CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth

Processing

Algorithms	Bi-directional text Collation ISO 14651 Equivalence

Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-9/UTF-18 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

CJK Unified Ideographs

CJK Unified Ideographs blocks

CJK Unified Ideographs

Charts

Sources

CJK Unified Ideographs Extension A

Charts

Sources

CJK Unified Ideographs Extension B

Charts

Sources

CJK Unified Ideographs Extension C

Charts

Sources

CJK Unified Ideographs Extension D

Charts

CJK Unified Ideographs Extension E (projected)

CJK Unified Ideographs Extension F (projected)

CJK Compatibility Ideographs

Known issues

Disunification of U+4039

Unified ideographs outside of the blocks

Unifiable variants and exact duplicates in Extension B

Other CJK Ideographs in Unicode, not Unified

Unicode version history

Notes

See also

External links