Han unification
From Wikipedia, the free encyclopedia
-
"Unihan" redirects here. For other uses, see Unihan (disambiguation).
Unicode |
---|
Encodings |
UCS |
Mapping |
Bi-directional text |
BOM |
Han unification |
Unicode and HTML |
Unicode and e-mail |
Unicode typefaces |
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called hanzi), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan.
Unihan can also refer to the Unihan Database web site maintained by the Unicode Consortium, which provides information about all of the unified Han characters encoded in the Unicode standard, including representative glyphs, mappings to various national standards, dictionary numbers, and definitions for compound words drawn from the free Japanese EDICT and Chinese CEDICT dictionary projects (which are provided for convenience and are not a formal part of the Unicode standard).
Contents |
[edit] Standard
Rules for Han Unification are given in the East Asian Scripts chapter of the various versions of the Unicode Standard (Chapter 11 in Unicode 4.0) [1]. The Ideographic Rapporteur Group (IRG) [2], made up of experts from the Chinese-speaking countries, North and South Korea, Japan, Vietnam, and other countries, is responsible for the process.
[edit] Details
The secret life of Unicode article located on IBM DeveloperWorks has an explanation of this issue that illustrates some of the confusion:
- The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be, and new characters were invented in each country.
- For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical, whereas the simplified Chinese, Japanese, and Korean glyphs use three. But there is only one Unicode point for the grass character (草, U+8349) regardless of writing system. Another example is the ideograph for "one" (壹, 壱, or 一), which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently.
In fact, the three ideographs for "one" are encoded separately in Unicode. They are not national variants. The first and second are used on financial instruments to prevent forgery, while the third is the common form in all three countries.
A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph — the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners. The display error occurs only when rendering plain text in a single font, and not when rendering language-specific text and names in language-appropriate fonts.
The problem of one character representing semantically different concepts is also present in the Latin part of Unicode. The Unicode character for an apostrophe is the same as the character for a right single quote: ’.
The process of Han Unification was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories. Characters which some Japanese today consider completely distinct were historically the same, and were taught as the same in Japanese schools until the 1950s. As for historical research, Unicode now encodes far more characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.
Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.
[edit] Controversy
Some of the controversy comes from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California) [3], but included no East Asia government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character replications [4]. This 16-bit requirement later had to be abandoned. The controversy later extended to the internationally representative ISO: the initial CJK-JRG group favored a proposal (DIS 10646) for a non-unified character set, "which was thrown out in favor of unification with the Unicode Consortium's unified character set by the votes of American and European ISO members" (even though the Japanese position was unclear) [5]. Endorsing the Unicode Han unification was a necessary step for the heated ISO 10646/Unicode merger.
Much of the controversy surrounding Han unification is based on use of the distinction between the ideas of characters and glyphs, as defined in Unicode, and the related but distinct idea of graphemes. Unicode defines abstract characters, as opposed to glyphs, which are particular visual representations of a character in a font, or graphemes, basic units of writing in a particular language. One character may be represented by many distinct glyphs, for example a "g" or an "a", both of which may have one loop or two. In Dutch, "ij" is a single letter (ij), and thus arguably a grapheme (a digraph). For example, the first letter in "IJsselmeer" is capitalized. Similarly for "ch" in some Spanish-speaking countries, and "lj" in Croatian. Graphemes present in national character code standards have been added to Unicode, as required by Unicode's Source Separation rule, even where they can be composed of characters already available.
Unicode publishes charts with pictures for each character, but these are illustrations only and do not mandate the character's shape. References like [6] below seem to assume that what the Unicode standard pictures is how each character must be displayed, and protest when it doesn't match the local appearance of the character. The way things are supposed to work is that a Japanese user will have a font with Japanese-style characters, a Chinese user will have a font with Chinese-style characters, etc., and everyone will see the "right" characters for them. Problems are introduced when several languages must be represented in the same text document, and users expect different fonts for the different languages. This falls outside the scope of the Unicode standard, and is intended to be handled with higher-level markup defining the language used for each string of characters; the fact that software support for this has tended to be cumbersome and often inadequate has contributed toward the misunderstanding of the effects of unification.
Note that most of the opposition to Han unification appears to be Japanese, because of increased sensitivity to the distinctions between Chinese and Japanese styles of characters. There has been very little opposition from Chinese speakers, since, on the other hand, Unicode did not unify simplified characters with their traditional forms. Although the Taiwan Big5 character set does not include Simplified characters, the PRC has character set standards with and without them. Unicode is seen as neutral with regards to the politically charged issue of Simplified versus Traditional characters, encoding Simplified and Traditional Chinese glyphs separately (e.g. the ideograph for "discard" is 丟 U+4E1F for Traditional Chinese Big5 #A5E1 and 丢 U+4E22 for Simplified Chinese GB #2210). Traditional and Simplified characters must be encoded separately according to Unicode Han Unification rules, because they are distinguished in pre-existing PRC character sets, not just because they have different shapes. Mapping between Traditional and Simplified characters is not one-to-one, which also prevents unification.
Specialist character sets developed to address, or regarded by some as not suffering from, these perceived deficiencies include:
- ISO/IEC 2022 (based on sequence codes to switch between Chinese, Japanese, Korean character sets - hence without unification)
- CNS character set
- CCCII character set
- TRON
- UTF-2000
However, none of these alternative standards has been as widely adopted as Unicode, which is now the base character set for many new standards and protocols, and is built into the architecture of operating systems (Microsoft Windows, Apple Mac OS X, and many versions of Unix), programming languages (Perl, Python, C#, Java, Common LISP, APL), and libraries (IBM International Components for Unicode (ICU) along with the Pango, Graphite, Scribe, Uniscribe, and ATSUI rendering engines), font formats (TrueType and OpenType) and so on.
[edit] Check your browser
The following table contains the same set of graphemes in all five rows, but each row is marked (via the HTML lang attribute) as being in a different language: Chinese (3 varieties: unmarked "Chinese", simplified characters, and traditional characters), Japanese, or Korean. So, ideally, your browser should select, for each grapheme, a glyph (from a font) that suits each language better. See how well it works for you.
Chinese (generic) | 与 | 今 | 令 | 免 | 入 | 全 | 具 | 刃 | 化 | 區 | 外 | 情 | 才 | 次 | 海 | 漢 | 画 | 直 | 真 | 空 | 紀 | 草 | 角 | 請 | 道 | 餓 | 骨 |
Chinese (Simplified) | 与 | 今 | 令 | 免 | 入 | 全 | 具 | 刃 | 化 | 區 | 外 | 情 | 才 | 次 | 海 | 漢 | 画 | 直 | 真 | 空 | 紀 | 草 | 角 | 請 | 道 | 餓 | 骨 |
Chinese (Traditional) | 与 | 今 | 令 | 免 | 入 | 全 | 具 | 刃 | 化 | 區 | 外 | 情 | 才 | 次 | 海 | 漢 | 画 | 直 | 真 | 空 | 紀 | 草 | 角 | 請 | 道 | 餓 | 骨 |
Japanese | 与 | 今 | 令 | 免 | 入 | 全 | 具 | 刃 | 化 | 區 | 外 | 情 | 才 | 次 | 海 | 漢 | 画 | 直 | 真 | 空 | 紀 | 草 | 角 | 請 | 道 | 餓 | 骨 |
Korean | 与 | 今 | 令 | 免 | 入 | 全 | 具 | 刃 | 化 | 區 | 外 | 情 | 才 | 次 | 海 | 漢 | 画 | 直 | 真 | 空 | 紀 | 草 | 角 | 請 | 道 | 餓 | 骨 |
Unicode code point | U+4E0E | U+4ECA | U+4EE4 | U+514D | U+5165 | U+5168 | U+5177 | U+5203 | U+5316 | U+5340 | U+5916 | U+60C5 | U+624D | U+6B21 | U+6D77 | U+6F22 | U+753B | U+76F4 | U+771F | U+7A7A | U+7D00 | U+8349 | U+89D2 | U+8ACB | U+9053 | U+9913 | U+9AA8 |
The following table contains identical graphemes with multiple glyphs encoded in Unicode:
Chinese (generic) | 高 | 髙 | 紅 | 红 | 丟 | 丢 | 乗 | 乘 | 侣 | 侶 | 兌 | 兑 | 內 | 内 | 產 | 産 | 稅 | 税 | 亀 | 龜 | 龟 | 別 | 别 | 両 | 两 | 兩 | |||||||||||
Chinese (Simplified) | 高 | 髙 | 紅 | 红 | 丟 | 丢 | 乗 | 乘 | 侣 | 侶 | 兌 | 兑 | 內 | 内 | 產 | 産 | 稅 | 税 | 亀 | 龜 | 龟 | 別 | 别 | 両 | 两 | 兩 | |||||||||||
Chinese (Traditional) | 高 | 髙 | 紅 | 红 | 丟 | 丢 | 乗 | 乘 | 侣 | 侶 | 兌 | 兑 | 內 | 内 | 產 | 産 | 稅 | 税 | 亀 | 龜 | 龟 | 別 | 别 | 両 | 两 | 兩 | |||||||||||
Japanese | 高 | 髙 | 紅 | 红 | 丟 | 丢 | 乗 | 乘 | 侣 | 侶 | 兌 | 兑 | 內 | 内 | 產 | 産 | 稅 | 税 | 亀 | 龜 | 龟 | 別 | 别 | 両 | 两 | 兩 | |||||||||||
Korean | 高 | 髙 | 紅 | 红 | 丟 | 丢 | 乗 | 乘 | 侣 | 侶 | 兌 | 兑 | 內 | 内 | 產 | 産 | 稅 | 税 | 亀 | 龜 | 龟 | 別 | 别 | 両 | 两 | 兩 | |||||||||||
Unicode code point | U+9AD8 | U+9AD9 | U+7D05 | U+7EA2 | U+4E1F | U+4E22 | U+4E57 | U+4E58 | U+4FA3 | U+4FB6 | U+514C | U+5151 | U+5167 | U+5185 | U+7522 | U+7523 | U+7A05 | U+7A0E | U+4E80 | U+9F9C | U+9F9F | U+5225 | U+522B | U+4E21 | U+4E24 | U+5169 |
[edit] Unicode ranges
The "CJK Unified Ideographs" range has 20,992 codepoints. Together with the extension A and B ranges, the total number codepoints reserved for CJK ideographs (as of Unicode 4.1) is 70,320.
- CJK Unified Ideographs (4E00–9FFF) (chart)
- CJK Unified Ideographs Extension A (3400–4DBF) (chart)
- CJK Unified Ideographs Extension B (20000–2A6DF)
- Kangxi Radicals (2F00–2FDF)
- CJK Radicals Supplement (2E80–2EFF)
- CJK Symbols and Punctuation (3000–303F) (chart)
- CJK Strokes (31C0–31EF)
- Enclosed CJK Letters and Months (3200–32FF) (chart)
- CJK Compatibility (3300–33FF) (chart)
- CJK Compatibility Ideographs (F900–FAFF) (chart)
- CJK Compatibility Ideographs (2F800–2FA1F)
- CJK Compatibility Forms (FE30–FE4F) (chart)
[edit] See also
[edit] External links
- Unihan Database
- Unicode standard
- Han Unification in Unicode by Otfried Cheong
- Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations
- Why Unicode Will Work On The Internet
- Per-character summary of differences in characters
- The secret life of Unicode
- GB18030 Support Package for Windows 2000/XP, including Chinese, Tibetan, Yi, Mongolian and Thai font by Microsoft
- Proposal to encode additional grass radicals in the UCS – A humorous proposal to encode all possible variants of the grass radical, made as an April Fool's Day joke
- Unicode Technical Note 26: On the Encoding of Latin, Greek, Cyrillic, and Han
- "Unicode Revisited" – the strong point of view of some people working on the competing TRON proposal
- "Unicode in Japan, guide to a technical and psychological struggle" – A more balanced take on the arguments for and against Unicode for Japanese.