GB 2312

From Wikipedia, the free encyclopedia

GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun (国家标准), which means national standard in Chinese.

GB2312 (1980) has been superseded by GBK and GB18030, which include additional characters, but GB2312 is nonetheless still in widespread use.

While GB2312 covers 99.75% of the characters used for Chinese input, historical texts and many names remain out of scope. GB2312 includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks.

There is an analogous character set known as GB/T 12345, closely related to GB2312, but with traditional character forms replacing simplified forms. GB-encoded fonts often come in pairs, one with the GB 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.

Characters

Characters in GB2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte codepoint of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (ten or wei).

The rows (numbered from 1 to 94) contain characters as follows:

01-09, comprising punctuation and other special characters; also Hiragana, Katakana, Greek, Cyrillic, Pinyin, Bopomofo
16-55, the first plane for Chinese characters, arranged according to Pinyin. (3755 characters).
56-87, the second plane for Chinese characters, arranged according to radical and strokes. (3008 characters).
88-89, further Chinese characters. (103 characters). Defined only for GB/T 12345, not GB 2312.

The rows 10-15 and 90-94 are unassigned.

Encodings of GB2312

EUC-CN

EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254).

Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is more storage efficient, this because no bits are reserved to indicate three or four byte sequences, and no bit is reserved for detecting tailing bytes.

To map the code points to bytes, add 160 (0xA0) to the 1000's and 100's value of the code point to form the high byte, and add 160 (0xA0) to the 10's and 1's value of the code point to form the low byte.

For example, if you have the GB2312 code point 4566 ("外", which means foreign), the high byte will come from 45 (4500), and the low byte will come from 66 (0066). For the high byte, add 45 to 160, giving 205 or 0xCD. For the low byte do the same, add 66 to 160, giving 226 or 0xE2. So, the full encoding is 0xCDE2.

HZ

HZ is another encoding of GB2312 that is used mostly for Usenet postings.

External links

v t e Character encodings

Character sets

Early telecommunications	ASCII ISO/IEC 646 ISO/IEC 6937 T.61 BCD (6-bit) Baudot code Morse code Chinese telegraph code

ISO/IEC 8859	-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16

Bibliographic use	ANSEL ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 MARC-8

National standards	ArmSCII CNS 11643 GOST 10859 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KPS 9566 KS X 1001 PASCII TIS-620 TSCII VISCII YUSCII

EUC	CN JP KR TW

ISO/IEC 2022	CN JP KR CCCII

MacOS codepages ("scripts")	Arabic CentralEurRoman ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari Dingbats Farsi Greek Gujarati Gurmukhi Hebrew Icelandic Japanese / ShiftJIS Korean / EUC-KR Roman Romanian Symbol Thai / TIS-620 Turkish Ukrainian

DOS codepages	437 667 668 720 737 770 773 775 790 808 819 850 851 852 853 854 855 857 858 860 861 862 863 864 865 866 867 868 869 872 895 912 915 932 991 Kamenický Mazovia MIK Iran System

Windows codepages	874 / TIS-620 932 / Shift JIS 936 / GBK 949 / EUC-KR 950 / Big5 1250 1251 1252 1253 1254 1255 1256 1257 1258 28604 54936 / GB18030

EBCDIC codepages	37/1140 273/1141 277/1142 278/1143 280/1144 284/1145 285/1146 297/1147 420/16804 424/12712 500/1148 838/1160 871/1149 875/9067 930/1390 933/1364 937/1371 935/1388 939/1399 1025/1154 1026/1155 1047/924 1112/1156 1122/1157 1123/1158 1130/1164 JEF KEIS

Platform specific	ATASCII CDC display code DEC-MCS DEC Radix-50 ELWRO-Junior Fieldata GSM 03.38 HP roman8 PETSCII TI calculator character sets WISCII ZX Spectrum character set

Unicode / ISO/IEC 10646	UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-7 UTF-1 UTF-EBCDIC GB 18030 SCSU BOCU-1

Miscellaneous codepages	APL Cork HZ IBM code page 1133 KOI8 TRON

Related topics	control character (C0 C1) CCSID Character encodings in HTML charset detection Han unification ISO 6429/IEC 6429/ANSI X3.64 mojibake

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

GB 2312

Characters

Encodings of GB2312

EUC-CN

HZ

See also

External links