Talk:Han unification

From Wikipedia, the free encyclopedia

This article is part of WikiProject Hong Kong, a project to coordinate efforts in improving all Hong Kong-related articles. If you would like to help improve this and other Hong Kong-related articles, you are invited to join this project!
??? This article has not yet received a rating on the Project's quality scale.

This article is part of WikiProject China, a project to improve all China-related articles. If you would like to help improve this and other China-related articles, please join the project. All interested editors are welcome.
??? This article has not yet received a rating on the quality scale.

Contents

[edit] Quick Fix, unsure of edit documentation

There is a comment in the page referring the disambiguation of the apostrophe and the single-quote character in Latin charset. That was resolved in Unicode 2.1: http://www.cs.tut.fi/~jkorpela/latin1/3.html#27 Suggetion: either remove the paragraph or identify another instance. (and then delete this section of the discussion.)


[edit] Check your browser

The "check your browser" section is a good idea, and it works fine on my browser, Mozilla 1.6, except that the two lines remain size 16, and the two lines of larger text overlap. Is there a way to increase the space between the two lines? I don't understand the markup method that is used to make that table. Gus 17:47, 2004 Apr 5 (UTC)

It's a bug in your browser. It is not necessary to specify the row height, because good browsers set it automatically so that all text is visible and there is no overlap. The table markup may be found here: [1] and here: [2]. — Monedula 21:59, 5 Apr 2004 (UTC)

It would be very handy to also include an image of the correctly rendered different versions so we have something to compare our browsers/fonts/OSes/rendering systems against.

What I really don't understand is what the output is supposed to look like. The first column has ma (horse) as a simplified character. The author wasn't expecting it to magically transmute into the "traditional" glyph in a Taiwanese font, was he? Why bother with what the simplified glyph looks like in a traditional font, or vice versa? It's like testing how your browser handles Cyrillic with a language tag of "English" — it makes no sense at all. — ajo, 21 Dec 2004
It's not ma (马), it's yu (与). No horse at all. — Monedula 11:40, 21 Dec 2004 (UTC)
Whoops. Isn't that still a simplified character? — ajo, 22 Dec 2004
Yes. The "Taiwanese font" thing is a red herring, though; the comparison is intended to be between Simplified Chinese (horizontal line at the bottom left, much like the simplified "ma") and Japanese (horizontal line longer, crossing the hooked line at the bottom right). -- pne 15:10, 26 Jan 2005 (UTC)

Also does anybody know if there exists a list of all such disputed characters? I think it would be a good idea to include them here too. Unless of course there are thousands, in which case a link would do. — Hippietrail 02:59, 2 Jun 2004 (UTC)

I don't know of such a list; I just added ones where I knew or thought that Chinese and Japanese differed in preferred style. I imagine some users will have created partial lists, but I'm not sure whether an "official" list exists. (There are undoubtedly lists of where Chinese, Japanese, and/or Korean source characters were merged, but I imagine they give no information about whether the forms were identical, similar, or "different but not different enough to consider them separate characters"—which is what we're after here, I believe.) -- pne 10:00, 2 Jun 2004 (UTC)
The link [3] given in the article has a partial list. There do seem to be quite a few so they probably shouldn't be exhaustively listed here. DopefishJustin (・∀・) 02:35, 5 Jun 2004 (UTC)

I'm not really sure why but 10th character of Japanese kanji appears as a rarely used one for me. The one from Chinese generic character is more commonly used. Revth 03:23, 28 Jun 2004 (UTC)

Another observation: My browser (Firefox 1.5.0.1) renders the lines with identical fonts. However, when I move the table to a HTML document (not XHTML), differences appear. Funny... 84.189.186.5 01:22, 15 March 2006 (UTC)

[edit] POV phrase

"This might make some sense because the Unicode is a vast improvement over the chaotic system of Chinese encoding, which contains two main systems (GB 2312 and Big5) and numerous variants and extensions."

I don't like this sentence much. Not only is "vast improvement" POV-tastic, but the Japanese encoding situation is hardly less chaotic. You've got Shift-JIS, EUC-JP, and ISO-2022-JP, not to mention various other JIS variants. I'll probably remove this altogether unless someone has a better idea. DopefishJustin (・∀・) 03:38, 5 Jun 2004 (UTC)
Please note that while Shift-JIS, EUC-JP and ISO-2022-JP are different encodings, they encode identical character *sets* (JIS X 0201+0208), with some minor exceptions. So it can hardly be said to be "chaotic". The Chinese situation is different in that GB2312 and Big5 developed entirely independently, and are not only different encodings, but also different *sets*. 221.254.245.147 08:48, 14 April 2006 (UTC)

[edit] Graphemes vs. glyphs

Thanks DopefishJustin for that paragraph - that is my understanding, too (though I'm not sure support for this is widespread enough.)

Similar issues crop up with things such as LATIN LETTER S WITH CEDILLA, which (by my understanding) should display as s-with-cedilla in Turkish texts and s-with-comma-below in Romanian texts since those glyphs were unified as one grapheme. However, a subsequent revision of Unicode included an explicit s-with-comma-below, so now there are two different characters vying for the role of the Romanian letter (similarly with t-with-diacritic), and there's controversy regarding which one of them is "correct". A sticky situation. -- pne 13:13, 6 Jun 2004 (UTC)

It's just as you say with s-cedilla/s-comma and t-cedilla/t-comma. Due to the fact that encoding converters already mapped these characters used in Romanian to the cedilla variants, the specifically comma variants are rarely used. I'd bet that search engines and search & replace and spell-checking functions in most applications are also confused by the issue or ignore the new characters. Google didn't do well last time I checked. — Hippietrail 08:20, 8 Jun 2004 (UTC)
It's not really true that s with cedilla should be rendered as s with comma below in Romanian texts. I've got a Romanian book to teach children how to write, and I would think there of all places, they would go out of their way to get it right if there was a difference. But in fact, the book switches back and forth between s with comma and s with cedilla. The distinction, whether real or not, was pushed by the Romanian Standardization body, which is the only reason the comma-below variants were included. --Prosfilaes 05:34, 17 Nov 2004 (UTC)

[edit] Request for Explanation of Variants

I've just learned of the existence of "Z-variants" in Unicode. It's very difficult to find good explanations of what they are exactly, and whether it is another name for a "glyph variant", a type of glyph variant, a seperate issue to glyph variants entirely. Is it always a different issue to simplified vs. traditional or is the line blurred?

From what I can find, Unicode wanted to be able to convert to and from the major existing CJK character sets unambiguously. Some of these sets have two codepoints for what Unicode considers to be the same character. In these cases, Unicode also provides two codepoints. One is considered the usual codepoint to use, the other is considered a Z-variant.

I bet it's more complicated than that. I have requested an article Z-variant to make it clear. — Hippietrail 01:09, 27 Jun 2004 (UTC)

I haven't heard that term before. Unicode has what are called CJK Compatibility characters, which are in separate blocks and are only supposed to be used for round-trip encoding conversion purposes. Unicode 4.0 also introduces variant selector characters, which go after the normal character to provide a standard way of indicating that a particular numbered variant is needed, although few variants have been defined so far and I don't think anything supports it yet. DopefishJustin (・∀・) 17:46, Jun 27, 2004 (UTC)
Here's one extract from a document about Internationalized Domain Names: [4]

[edit] Unicode encodes simplified and traditional glyphs separately

This is in part because Unicode encodes simplified and traditional glyphs separately (e.g. the ideograph for "dragon" is 龍 U+9F8D for Traditional Chinese and 龙 U+9F99 for Simplifed Chinese), because, in the general case, the glyph forms are distinct.

I would very much like to see a clearer statement on why Chinese was encoded into two entirely separate glyph sets, whereas Japanese got Unified. The argument used for justifying Unification is the Japanese "misunderstanding" of Unified glyphs, which supposedly do not require separate codepoints - even though that's exactly what was done in splitting Traditional and Simplified Chinese. It's pretty obvious that there must have been a not inconsiderable amount of politicking going on in the Unification process to end up with what we have now, in which case that should be noted rather than just dismissing the Japanese opposition to Unification as a "misunderstanding" on their part. 221.254.245.147 09:01, 14 April 2006 (UTC)

Has anyone a number for the cases where there is a separate simplified short sign encoded? --Pjacobi 10:01, 17 Nov 2004 (UTC)

Don't have a number, but this wikipedia page have a unicode traditional to simplified chinese (in chinese): http://zh.wikipedia.org/wiki/Wikipedia:%E4%B8%AD%E6%96%87%E7%B9%81%E7%AE%80%E4%BD%93%E5%AF%B9%E7%85%A7%E8%A1%A8

While Unicode Unify Japanese and Chinese characetrs, Simplified and Traditional Chinese are mostly separated with thousands(?) of duplication.

If 紅 U+7d05 and 红 U+7ea2 count as two characters (graphemes?) than the Japanese version of 草 U+8349 should be encoded differently.

纟 is just the "cursive" version of 糸, just a metter of calligraphy style. but in unicode all characters with the 糸 side also have a 纟 version. With that as precedent, characeters with 䒑, 艹, 艹 or 艹 should be encoded differently too.

A lot of what was and wasn't encoded in the BMP is due to what was encoded by the source standards and was really out of Unicode's hands. It was probably encoded that way for compatibility only. Don't try and use it for precedent. --Prosfilaes 06:58, 6 Jan 2005 (UTC)
So you argee there is a problem right? Lets do the finger pointing later. --Thu Jan 6 17:48:49 UTC 2005
There's a problem, because Unicode has to work in real life. If Unicode could have thrown out the old standards, it wouldn't have this problem. The point was, you can't use those as examples for what should be done, because they weren't supposed to be there in the first place. --Prosfilaes 03:54, 7 Jan 2005 (UTC)
The "lang" HTML attribute doesn't work in general like in plain text file right? The page claim "Note that most of the opposition to Han unification appears to be Japanese, because of increased sensitivity to the distinctions between Chinese and Japanese styles of letters. There has been very little opposition from Chinese speakers." The point is, Japanese got unified into Chinese, but unicode did NOT apply the grapheme philsophy to chinese at all. I can see why most Japanese object. Historical reason is not important here. For Chinese, Unicode is consider politically neutral because it's neither big5 nor gb not because of unihan. For that matter, OSs and programming languages support Unicode have notthing to do with unihan either. In any case, Chinese web sites use big5/gb not unicode. --Sat Jan 8 10:52:05 EST 2005
The question about plain text files is not beauty, it's the pure information with no font information. Virtually every type of marked up text includes some sort of language tagging. Unicode did apply the grapheme philosophy to Chinese; look at <http://www.unicode.org/versions/Unicode4.0.0/ch11.pdf>. Just because you don't agree with all the points, doesn't mean they didn't apply the policy. Historical reasons of course matter; you have to take everything in context. You can't demand that Unicode is broken because a character set that would have been useless in real life might have done something in a more ideologically pure method. --Prosfilaes 20:51, 8 Jan 2005 (UTC)

Here are just a few more samples (For the "Check you browser" section?) :

Chinese (generic)                      
Chinese (Simplified)                      
Chinese (Traditional)                      
Japanese                      
Korean                      
code U+9ad8 U+9ad9   U+7d05 U+7ea2   U+4e1f U+4e22 &nbsp U+4e57 U+4e58   U+4fa3 U+4fb6   U+514c U+5151   U+5167 U+5185   U+7522 U+7523   U+7a05 U+7a0e   U+2fd4 U+4e80 U+9f9c U+9f9f U+f907 U+f908   U+5225 U+522b   U+4e21 U+4e24 U+5169 U+f978
To be honest, I see zero difference between the rows other than what can be accounted for by typographic, stylistic variations. --Wing 14:20, 12 Feb 2005 (UTC)

[edit] Grapheme philosophy

I edited the page to remove the emphasis on the "controversial" philosophy of graphemes. As far as I know, every character standard defines the characters and not the precise shapes; that is, there can be more than one font. Simplified characters aren't a violation of the grapheme philsophy; page 11 of the Unicode standard says "These are both part of the IRG G-source, with traditional forms and simplified forms separated where they differ.". Unicode recognizes the concept of the limits of how much a character may vary and tries to use that to seperate characters in graphemes. --Prosfilaes 03:54, 11 Dec 2004 (UTC)


The problem is Unicode did NOT apply the grapheme philsophy to chinese at all. For example, 內 U+5167 is big5 #A4BA 内 U+5185 is GB #3658, 丟 U+4E1F big5 #A5E1 丢 U+4E22 gb #2210.

In fact, if multiple font set is part of the unicode design, there should be well defined method within unicode to specify country/language code without using html/xml "lang" attribute. -- Somebody

I'm not sure what you're trying to say here. Is your claim that those characters are not distinct in any character sets Unicode is round-trip compatible with, and that there are many more of such characters? As such the big5 and gb code points you list seem irrelevant. BTW, there is a way for specifying language without using external tags, though its use is discouraged in preference of proper tagging (Unicode is supposed to be plain text character set, not a rich text format). Considering that this is a FAQ and you seem ignorant of it, your assertion that the makers of the Unicode standard systematically ignored their own rules in classifying Chinese characters is rather suspect. 130.233.22.111 18:13, 17 May 2006 (UTC)

  • The grapheme policy is not controversial, but saying things like the "Japan Kanji for Horse is just the Chinese Hanzi for horse, but written in a different font", is controversial. There is a whole possible continuum of changes of Hanzi so if you ever start to draw a limit, it will be arbitrary, and hence can never be fully satisfactory. So the opposite question, is why do we *need* Han unification? The initial motivation is clear, the people involved in Unicode (see first version 1988), wanted a 2-byte coding, hence Han unification was critical (and maybe necessary); and probably become the pet project of the initial Unicode designers. But that was lost, and worst, with the famous UTF-8 encoding, most Japanese Kanji need 3 bytes as far as I understood, so that reason is dead and buried.
  • What is controversial in my opinion, is that Han unification using the grapheme policy using was completly decided by the initial Unicode team, which included mostly American companies and organizations, way before Asian gov. representatives participated. But at the end of the day, only CJKV governements could legitimaly make decisions, so the good way would have been to grant different code ranges to each CJKV government, and later, as a optional bonus perform Han unification identification, for applications that need it.
  • What is very controversial is the *consequences* of the grapheme policy. If the same character is visually different in Chinese and Japanese, so as to be rejected by some people using one of those languages, then you cannot mix easily them. Hence you can't use Chinese and Japanese file names with UTF-8, and then properly display the list of the files. That's *BROKEN*. That's the exact reason why Unicode allocates different codes for different characters: you are no more supposed to mix Chinese and Japanese, than to mix Thai and Greek, so why not just reuse the Greek code for Thai, and leave external mark-up to decide if you are using Thai or Greek... i.e. why not codepages then ? Unicode tries to evade the issue by saying text isn't supposed to include language information, but the fact is, it should include enough information to represent it properly without ambiguity, and if that information is language, then it must be there, otherwise there are very annoying consequences. I don't understand how people can be happy with forcing those nasty consequences in the throat of the world, just because Han Unification happened to be the pet project of some American software engineers based on bogus 2-byte design constraints, while there is a simple alternative: no unification per default.

--213.41.133.220 03:52, 4 August 2006 (UTC)

Thai and Greek is a bit of a strawman, since there's no connection. Unicode engineers have pointed out that Chinese quotes in Japanese texts are often in the same fonts as the surrounding texts, and Han unification follows the same principles as not including Fraktur as different from Latin.--Prosfilaes 05:46, 4 August 2006 (UTC)
  • Yes, but Chιnese, Japanese or even sometιmes Greek quotes ιn Eυropean texts are usιng the same font as surroundιng text for theιr roman transcrιptιons (be ιt Pinyin, Romaji, ...); that does not mean that the resultιng transcrιptιon ιs a 100%-acceptable way to wrιte normal texts ιn the original languages. I've never seen a Japanese booκ wrιtten in Romaji for ιnstance. So this argυment does not address the issυe that the υnified character does not have a representation that sυits every langυage that υses it.
  • The point about Fraktur is an excellent example, Antiqua is nowadays an accepted way to render text written in Fraktur and besides many books were printed indifferently with Antiqua/Fraktur fonts, because the underlying language, grammar, syntax, words and letters are identical since they are those of the German language, so it is really just a font. But Chinese is not written indifferently with Chinese/Japanese fonts (because there are many missing Hanzi characters). Japanese is not written indifferently with Japanese/Chinese fonts (because Chinese has no katakana, hiragana for starters).
  • So my point is not a strawman: the fact that there is or not connection between some Kanjis and Hanzis is irrelevant (except as an interesting academical exercice), what is relevant is whether there is a representation which can be 100% fully accepted by all users of the character. If there is no such representation, then the *consequences* of artificial unification (have to put stupid language markers in some way, somewhere) would be exactly the same as the consequences as articifically unifying Thai and Greek characters (by simple reuse of the code range). Annoying consequences.
  • Also you correctly underline that the Unicode group was consistent in following its principles. That's true, but then Unicode principles are broken. Indeed, the design blunder was to rigidly stick to academic principles when the only major principle that should be keep is to make Unicode easy to use (i.e. no annoying consequences). There was a quite pratical alternative to Han unification: no Han unification (and if necessary, have a sub-group working on identifying connected characters).
--213.41.133.220 14:41, 4 August 2006 (UTC)
  • Ok, I have reviewed a little bit of the point of views of the issue - and in fact it turns out that this issue is more complex, but much of it is not covered here ; a starting point is [5] --213.41.133.220 03:42, 10 August 2006 (UTC)

[edit] about the character 'grass'

The character given in the text 草 is not the one with three or four strokes that the IBM article was referring to. As a single character meaning grass, it should look like 艸, a variant of 草 that has been abandoned in simplified Chinese, and is rarely used in print in traditional Chinese (I think). Though it is rarely seen as a character, it is a very common radical indicating 'plant,' and most of the time appears on top as 艹 (three strokes). In traditional Chinese, all those characters that have this radical should have the horizontal stroke broken right in the middle, making it a four-stroke radical. But I have never seen it on computer for some reason... --Liuyao 08:19, 11 Dec 2004 (UTC)

艹 and/or 䒑 are 3 strokes radical "grass", 艹 and/or 艹 are the 4 strokes.

艸 = 艹 both are radicals, not a character. 草 is a character listed under 艸 Kowloonese 02:23, Jan 27, 2005 (UTC)
Not exactly; 艸 is both a character and a radical, and is the original (ideographic) form of 草 (which is a hybrid ideographic/phonetic character) but is rarely, if at all, used nowadays. I have an old dictionary which consistently uses 艸 instead of 草 (plus other archaisms, for example: 葉 for 頁 which is now used afaik only by librarians; or 括弧 for 括號 which is now used afaik only in colloquial Cantonese).
But then 艸 of course has 6 strokes, not 3 or 4... --Wing 18:14, 12 Feb 2005 (UTC)
I stand corrected. 艸 is a character and a radical and it has 6 strokes. 艹 is counted as 4 stroke. For example, 艸 and 艾 are both listed as 6 strokes. 草 is 10 strokes, though you first have to look up the 艸 radicle listed as 6 strokes and then add the additional 6 strokes. But 6+6=10 in this case because 艸 loses 2 strokes when it becomes the radical 艹. Kowloonese 01:41, Feb 18, 2005 (UTC)

[edit] Vietnamese

What's with Vietnamese? Shouldn't there also be some information given on Chữ nôm script? If there's anyknow knowledgable about this he or she might step forward and add something.

The vietnamese variant of the Chinese script seems to be completely obsolete now, and so I guess it wasn't considered worth mentioning. 惑乱 分からん 01:24, 30 April 2006 (UTC)

[edit] why

this article doesn't give any information on the rationale for this descision when other visually identical and semantically very similar characters (e.g. some greek and latin stuff) where not unified. Plugwash 11:04, 21 April 2006 (UTC)

See [6].--Prosfilaes 21:04, 25 April 2006 (UTC)
Allocation of memory size seems to be an important part of it. 惑乱 分からん 01:25, 30 April 2006 (UTC)
By the way, greek and latin characters are given separate code points in Japanese (JIS) character sets, including characters such as 'A', and distinguishing between them is thus required for round-trip compatibility to JIS. Amusing, no? :-) 130.233.22.111 18:53, 17 May 2006 (UTC)

[edit] biased page

why does this whole article read like an apology for han unification? fact of the matter is, a double standard is being applied. chinese don't care because their style of characters are the ones being used, not japanese. why didn't they take the same approach with japanese as they did with traditional chinese and simplified chinese? in most cases (like 99.9%) in traditional and simplified chinese, the characters and combination of characters even mean the same thing, but in japanese, the meanings diverged a lot.

also what is the point of saying east asians are the ones doing han unification? of course there aren't western experts capable of doing this kind of research. all you're saying is you paid people enough money that they'd do this for you. how does this mean that east asians accept what you're doing? --203.70.98.28 17:28, 3 June 2006 (UTC)

The meanings of Latin characters have also diverged a lot. The meaning is irrelevant. Traditional and Simplified Chinese aren't disunified; only those characters that are fundamentally different are disunified.
Unicode isn't paying anyone to do this; most experts are paid by their national governments if at all. It's hardly true that East Asians believe such and such; many Chinese have no problem with it, and I've met a Korean online that was annoyed at how disunified Unicode Han was. It's far from black and white.--Prosfilaes 17:41, 3 June 2006 (UTC)


[edit] Reassessing the content of the page

After digging miscellaneous sources of information, it seems to me that there are many things that could be improved. In particular, a few times, only the less sophisticated criticisms of Unicode are addressed.

Now a few topics that could be improved:

  • The Han-unification rules.
  • The West-vs-East controversy, such as: "This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners". It seems to me, that this controversy (including some Japanese criticisms) must go further than just simple pleasure of criticizing Westerners:
    • First, a clarification, it's unclear to me to what extent, Unification has been carried out by Easterners - I already corrected the text, to the effect that the decision to use Unification was taken by the Unicode consortium (California-based American companies), and then got endorsed by ISO, as apparently as a collateral damage of the push for the merge Unicode and ISO; as you can guess from [7], Han unification did not seem to be a major topic of the merger - the only time, where it could have been removed. Second, even though everywhere it is written that unification is carried by Easterners in several places, the history of Unicode shows clearly that the first Unicode unification database was created by those companies [8]. Was it thrown away, and re-created from scratch by IRG? If not, it seems to me only half-true to claim "it is carried out by Easterners". See also: [9]
    • Second, if you look at history, CCCII character set actually predates Unicode. The message [10] includes a passing mention that it was used later as basis for the American EACC by the RLG [11] but with the axiom "one glyph = one code". Later RLG was involved in the Unicode, Unihan unification. Why is this relevant, you may ask... well the point is that CCCII includes a feature (namely the character/variant/glyph property, see below), which is sometimes advocated as superior [12]. This puts a new light on the West-vs-East controvery, since you can argue that: some Westerns took an (Asian) Taiwanese standard, CCCII with good features, modified it, removed some good features, created a new standard without feedback from the initial Asian experts (Unicode-pre-1990), and strong-armed international organizations (precisely ISO), into accepting it (with the merger), and later delegated the task of figuring out the details to a subgroup (of ISO: the IRG). Hence the legend that Westerns didn't understand Asian languages well, and messed up.
  • Non-unification: the article did not mention non-Han unification as a possibility. In fact it is perfectly possible to perform no unification: this is the approach of ISO-2022 where character sets are not unified [13]. Escape sequences are used to switch between characters sets (for instance switch to this Japanese standard, or this Chinese standard). There are drawbacks but also advantages. Hence ISO-2022, even though it is not popular, deserves some mention (I added it). After all, like all design choices, the decision of Han-unification can be best appreciated, not only by its own advantages/drawbacks but also by those of the alternatives.
  • There are several issues about the principle of differenciating the concept of character from the concept of glyph (this is for the Controversy section):
    • As mentionned in the article, there is some misunderstanding because Unicode publication was only showing some glyph versions of the Chinese characters, which caused people to assume that they were the mandated representations. Unicode aims at giving a code to a character only, and the idea is that a Japanese font would be used for Japanese text, a Chinese font would be used for Chinese text, ect...
    • This is reinforced by the fact that, Japanese text, and Chinese text, for instance, use different styles (different fonts) even for identical characters. Hence, to properly represent a text to a Japanese, it is not only necessary to properly display the characters which look different, but even all the other characters should be displayed in with a Japanese fonts. This would call for not only dis-unifying Japanese versions of the variants, but in fact, total dis-unification of the Japanese/Korean/Chinese (with software problems: what happens if in a WYSIWIG editor, you select Chinese text and change the font to some Japanese font? On the other hand, according [14] there are some fonts universally acceptable in CJK countries. Then you would want to have the Japanese-version of some character, and the Chinese-version of the same character in the Universal font. For which Unicode isn't enough.
    • This "universal CJK font" problems, hilights the problem of the "character/glyph" principle: in an universal font, you would want two versions of the characters. And indeed, in view of the older standard CCCII (using it even before Unicode was on the drawing board), a better principle could have been "'One Character has one or more Variants, which have many Glyphs'" as indicated in [15] with the comment "Unfortunately, many of those arguing in favor of Unicode have, historically, been non-users of Han characters who have tended to just assume that differences between character variants are generally insignificant display-level details, like differences between fonts". Then then the Japanese-version, the Chinese simplified-version, the Chinese-traditional version, the Korean version, could be variants of the same characters (when different). Each variant would have exactly one different glyph in an universal CJK font. A Japanese font could include the same glyph for all the variants. It is not clear, however, if one can draw definite boundaries.
    • The consequences of Han-unification are also an issue: it is not possible to have "plain text". "Plain text" appears in places such as filesystems (directory and file names), or Web forms (even though there is some locale information exchanged at HTTP protocol level). It creates additional problem for information exchange (for instance, cut and paste between applications), as the language (font?) should be exchanged along with the Unicode string data. It is not possible to create an Unicode font (as mentionned previously).
Sure it is. You just have to accept either a Chinese font or a Japanese font, and as you agreed above Japanese users tend to write Chinese in Japanese texts using Japanese fonts. This isn't true for Fraktur; English or Esperanto or French in German or Norwegian Fraktur texts were always typeset in Roman, never in Fraktur.--Prosfilaes 06:28, 10 August 2006 (UTC)
The point is "plain text" doesn't have information to select the variant, how to write "黛" for instance. Of course, you're right, everything can be displayed in a way of another. --213.41.133.220 23:41, 13 August 2006 (UTC)
    • The answer to the Unicode font argument, is that Unicode should not be used, but instead systems like Open Type. On the other hand, that does not prevent people from creating Unicode fonts: witness w[16]; if they are out in the wild, people would have to cope with this.
OpenType instead of Unicode????? What is that supposed to mean? Mlewan 11:07, 10 August 2006 (UTC)
Actually not "instead of", but "in complement of" it, additional mechanisms - such as those present in OpenType - allow the choice of a proper variant based on some context (language, ...): [17]. --213.41.133.220 23:41, 13 August 2006 (UTC)
    • Inconsistency: as said, the Chinese simplified and traditional characters are not unified when different, even though they are clear-cut candidate for unification. They escape unification because of the round-trip conversion principles. But still this inconsistency make the whole Han-unification incoherent - if a Japanese character had a variant identical to a Chinese simplified character, then the Japanese variant would be have a code allocated for it, otherwise not. This makes differences in the characters inside the same country more important than differences of characters between different countries. In addition, judging from [18] there are not so many Japanese variants, the question of why not have separate code for Japanese variants is unclear.
That's not at all true. Simplified and traditional characters aren't unified, because they're different.--Prosfilaes 06:28, 10 August 2006 (UTC)
Ok I stand corrected, only "identical" character are unified (and normally not those who one or a few different strokes). Also, I checked [19] samples on several dozen Chinese[Simplified/Traditional]/Japanese fonts... one conclusion, is that some of the critics of the page are in fact due to sometimes to poor quality of the font used for the Unicode site for some characters, since the other Chinese and Japanese glyphs for the character look pretty similar otherwise. Second conclusion, is that indeed there is some amount of variation in Chinese fonts (Ming, Song, Hei, Kai, ...), but little in the Japanese fonts (Mincho, Gothic): hence it makes complete sense to unify the character for Chinese... (since it is exactly the same character from the same language in a different font). Then quite often the Japanese variant looks very similar to one of the Chinese fonts variant (exception I found: "黛"), so I guess the logical conclusion is to unify also as Japanese and Chinese characters are in the same code-space. This means that unification in one language can have implications for the unification of another language OR result inconsistent mapping to duplicate characters. -- 213.41.133.220 23:41, 13 August 2006 (UTC)
    • In fact, judging from [20] some dis-unification is actually being carried out, to handle properly names at least.
  • Useful references: [21] "Michael Kaplan's" Blog, Unicode topics

--213.41.133.220 06:13, 10 August 2006 (UTC)

[edit] on the check your browser section again

I'm wondering if this is all that useful. Does any browser change the glyphs depending on the specified language in the text? Perhaps a smaller table with fewer collumns would work better. Then we could use images to illustrate the way a particular Unicode code point would have different glyphs for each of the languages/locales. I don't have the language background to produce this, but I might be able to find some references that do it. Again, I think just enough columns to cover the normal width of the article (maybe 5 or 6). Then use image files instead of text characters to show how the different characters are rendered in each language. Then there could be a follow up table that just had the text. But as I said, I'm not aware of any browser that makes these sorts of glyph changes based on the language tags. —The preceding unsigned comment was added by Indexheavy (talkcontribs) 07:19, 5 May 2007 (UTC).

Firefox changes the display depending on language. I think there are more browsers where it works. You are right that the table is growing a lot. One solution might be to
a) create a separate page. "List of Language dependent Unihan characters" or something like that.
b) rotate the table so each language gets a column instead of a row.
Using images may be tricky as the look depends on which fonts you have installed. To keep a consistent look, we would have to decide which font to use, and there is not one single Asiatic font which is installed by default on all operating systems.
However, we could have one or two examples with images for users who do not have browsers where the language tag works. Mlewan 08:45, 5 May 2007 (UTC)
I think transposing the table is a good idea so that the languages become columns and the separate characters become the rows. I'lll probably go ahead and do that if there aren't any objections. However, what I think the article needs more than a 'check your browser" section, is a demonstration of how the glyphs differ in the different languages. With an image we have control over that sort of glyph display. With text, its anybody's guess what a reader sees when they load this article. Just looking at the first code point: U+4E0E (which according to the Unihan database means "and; with; to; for; give, grant"), I see very little variation in the glyph in Firefox on Mac (perhaps Fireforx on Mac doesn't select the correct font the way it does on Windows). So neither this glyph nor and none of the oters, have any significant difference from one row to the next. The glyph variation as I change fonts on my system is much greater than the differences displayed in firefox. I would imagine that there are some good examples of ideographs that would appear different in Japanese, Traditional Chinese and Vietnamese (and simplified Chinese if the character is also unified). Characters like that would serve as much better examples for illustrating the need for language sensitive glyph substitution. And that would require using images of the glyphs to be sure the illustration was getting across to every reader (not just the few readers who visit with Firefox on Windows, if even them). Part of the problem is that this may require someone who knows Japanese, Chineses and Vietnamese to be sure the glyph is displayed properly for each. Indexheavy 18:01, 5 May 2007 (UTC)
I am using Firefox on Mac, and I see "big" differences in the display of 与. It is possible that one has to know what to look for, but the horizontal line crosses the curved one in Japanese (and Korean), but it does not in the Chinese versions. On a Mac you can verify that the difference is consistent looking up the character in the Character Palette. Expand the "Font Variation" and see which fonts display which behaviour.
The character actually does not exist in Korean, and that may be confusing for a reader, but it is nevertheless good to know how it is displayed, if one types a Chinese word in the middle of a Korean text.
Most of the characters in the table show significant differences for someone who knows what to look for. However, if we rotate the table, we should probably also make room for an "explanation" column, where one describes in words or with images what to look for in each character, so even people without knowledge of the CJK languages know what the difference is.
(Vietnamese has not used Chinese characters for about 100 years. You probably mean Korean.) Mlewan 19:14, 5 May 2007 (UTC)

Yes, you're right that I'm not familiar enough with the writing to notice the differences at first glance. I was looking more towards the middle of the table and though I see some subtle differences there too now, they still aren't as significant as the differences I see as I change through the fonts on my system. Also, I accidentally left off Korean, but I was referring to Vietnamese Chu nho, which I don't believe is completely obsolete. In Unicode documentation these days I often see the abbreviation CJKV instead of just CJK to extend to Vietnamese. Anyway, I think fewer examples with more exposition would bevery helpful to help readers understand the importance of font support for the unified characters. However, I still doubt the importance of the 'check your browser' section. Now that I tested FireFox on Windows, I see there are no differences at al (unless one also needs to have additional font support on the Windows install; I believe I'm testing it on a fairly basic installation). Also, I'm not so sure that FireFox is doing the appropriate thing here anyway. According to Unicode, they recommend following current practice where the glyphs used for ideographs are drawn from either the document's language or the reader's language. In that case a browser shouldn't switch fonts or glyphs due to changes in the language tag, but instad use the same glyphs thorughout the document unless the author changes fonts explicitly. For document's that have no declared language, they browser should use the environment language. So in summary, ther's only one browser who handles these characters in a way that the 'check your browser' section is checking for and it's not at all clear that FireFox is doing the right thing here.

I think this U+4E0E would serve as a decent example. I uploaded a glyph for this character that looks more significantly different than the ones I'm seeing in the table in FireFox. My thinking is that if we could match this glyph and the others to the various writing systems for this character (and maybe a few other characters with greater or lesser significant differences), it would provide a reference for readers to understand the issue. I don't have the link handy right now, but I know I"ve run across an example at the Unicode site, but it only relates to variations within Japanese alone (not across the writing systems). It might be a good idea to produce examples of both (intra-writing system unification and inter-writing system unification glyph differences). Indexheavy 20:31, 5 May 2007 (UTC)

The table is extremely useful. I have used it dozens of times in professional circumstances, and I have distributed the link to several of my colleagues. When you work with CJK web sites, you need to know which strings are sensitive to language issues. It is not enough to have one example. You need at least one example for each kind of difference, and the more examples you have, the better. We may not need a complete table within this article, but we need all examples somewhere. I would create a new page for it, if I had the time.
Provided you have the right fonts installed, at least both Firefox and IE handle this fine on Windows. I think Opera also does it. Safari 2.0 does not.
The whole point of Unihan is that you change font according to language. If you look in any bilingual Chinese/Japanese book, like language courses, you will notice that they always have different fonts for the two languages. Firefox and IE do this according to the rules. Mlewan 21:19, 5 May 2007 (UTC)
I think you're misunderstanding much of what I'm trying to say. First, I wasn't suggesting that we reduce the table to only one example. However, having several examples with some explantaory notes would be much more useful than all the examples we have with no explanation of the differences. Secondly, I understand the point of Unihan is to change fonts (changing glyphs to be more precise, even if they come from the same font). However, the issue is how one changes fonts. Unicode does not suggest changing fonts according to language tags; rather it suggests changing fonts according to the reader's user language environment (for plain text). For rich text, the fonts should be those specified by authors (or by a reader in the case of user stylesheets for example). Changing fonts when encountering different language tags is not recommended by Unicode and I think its bad practice in general (FireFox is the only browser I can see changing fonts, but it doesn't seem to necessariliy be in relation to producing the appropriate Unihan glyph).
Perhaps the solution is to leave the "chekc your browser" table as is, but add another smaller table before it that properly illustrates some differences between CJKV glyphs for each of few Unihan characters. Would you say that these characters included in the "check you browser" table are good potential candidates to illustrate those differences? One other question I had for you. Above you refer to " at least one example for each kind of difference". Could you say a little more about these different kinds. That too would be helpful exposition for readers ot he article (as well as me). Thanks for engaging in this discussion. Indexheavy 03:42, 6 May 2007 (UTC)
OK, you definitely seem to have understood the idea of font switching. I misunderstood you before. The only question is whether one should count the html-language tags as "rich text"-tags or not. All major browsers do (provided CJK is installed), and I think that settles it. It is also the best way to be sure to display for example 具 in a Japanese way in HTML. The alternative would be to specify fonts, but you do not know which fonts the reader has installed.
I was deliberately vague with "one example for each kind", and you spotted it. The problem is that there are so many kinds. When it comes to 具, it is a matter of connecting all parts of the character. When it comes to 今, it is a matter of an angle of a stroke. And so on. Mlewan 07:42, 6 May 2007 (UTC)

[edit] Made transpose changes to the table

I transposed the two tables. Please check to see that I didn't mess anything up. Alos, I didn't understand what the difference was in the two tables. Should they be combined into one? In the second table the code points were grouped into pairs (and sometimes threes). I didn't reproduce that in any visible way, though there is an extra row in wiki table syntax that doesn't get displayed (though its in the source for easy editing purposes). Finally, I think thre's plenty of room for an additional colum or two. For example if we need to add a glyph for Vietnamese (Chu_nho) or an annotation. Annotations that draw attention to the important differences in the glyphs or merely to include the kDefinition property from the Unihan database. Again, I think a much smaller table with some particularly good examples rendered with images would help illustrate this topic without a reader needing Asian fonts installed (my test of Windows basic install shows there's no differences in any browser: IE, FireFox and Opera). Indexheavy 05:32, 6 May 2007 (UTC)

Good work! Thanks a lot on behalf of all the readers!
However, yes, there were some things you missed. The pairs in the second table are there to illustrate the alternative solution to solve the problem. The only way to display the two variants of 入 is to change font (or language tag). However, for 內, there is the alternate character 内. This is somewhat silly, as the difference is (almost) equivalent. And then there are really confusing cases, like 兌, which can be displayed as 兑 both using another character or a language tag. So each row in a pair displays the "same" character but with different unicode codepoints.
When it comes to Vietnamese, I think it should be left out from the page entirely. We do not know of any single character where the Vietnamese language tag changes its display. I am not even sure that there are any dedicated fonts for Vietnamese forms of Chinese characters. Once such a font is identified and the problem is pointed out there, it could be added to the table.
I agree with a smaller table in addition, to further explain the principles. I made some more modifications, changed title and added some explanation. Mlewan 07:42, 6 May 2007 (UTC)

[edit] Generic Chinese?

What is "generic Chinese"? I understand that other columns are different national standards, but what is "generic Chinese" supposed to be? --Voidvector 01:35, 28 June 2007 (UTC)

[edit] Wrong statement in the article about "内"

Article says that "入" (enter) has different variants which are not encoded into Unicode. Then it goes onto say something like "内" (simplified Chinese) and "內" (traditional Chinese) are derivatives of the before mentioned character. As far as I know, this is wrong because:

入 (enter) + 冂 (a radical) = 內 (inside, traditional Chinese) 
人 (person) + 冂 (a radical) = 内 (inside, simplified Chinese)

--Voidvector 02:22, 28 June 2007 (UTC)

[edit] Make those characters pictures

Hey, this page is interesting, but the character comparison tables are tto dependant on the reader's font. For example on Linux ubuntu 7 with Firefox, many characters are the same. The other problem is that the browser will use two fonts with different design (say a thin "pencil-like" one and a square "print-like" one) to render the subtle differences between C,J or K versions, so it is not clear at all. Having someone with good full unicode font making a screen shot and uploading this to the article would certainly help a lot. Just a suggestion. —Preceding unsigned comment added by 220.231.37.106 (talk) 04:36, 4 June 2008 (UTC)