Talk:GB 18030
From Wikipedia, the free encyclopedia
now we have an understandable explanation in english from sun anyone fancy expanding this article? Plugwash 17:43, 22 January 2006 (UTC)
One thing to beware of in using the Sun article - it has an error where it mentions Unicode 2.1 instead of 1.1 as basis for GBK. I checked elsewhere at length, and the Wikipedia GBK page is actually correct. BTW I did the recent updates to GBK, GB2312 and GB18030. -- Richard Donkin 06:39, 27 January 2006 (UTC)
[edit] GB 18030-2005?
according to http://www.sac.gov.cn/, the 2005 version of this standard was released at Nov. 8,2005, effective May 1, 2006.--Skyfiler 22:30, 23 February 2006 (UTC)
- got anymore detail? is it just official incorporation of stuff introduced by unicode since the original standard? Plugwash 17:18, 2 February 2006 (UTC)
- No more tech details on the website, and search yield no results :(--Skyfiler 22:30, 23 February 2006 (UTC)
[edit] Unicode code points
>which is easily sufficient to cover Unicode's 1,114,112 (17*65536) code points.
this counted the 2048 surrogate points which don't need to be encoded, e.g. un UTF-16 you can't encode U+D800 as this isn't a valid real code point:
where does this figure come from. I undestand that UTF-16 can encode all code points but it only covers 1,112,064:
BMP is 0x10000 - (0xE000 - 0xD800), i.e. don't count surrogate code points = 0xF800 (63488) The rest have 20 bits in UTF-16 four bytes sequences (0x1000 is subtracted) = 2 ^ 20 = 1048576 (0x100000) 0xF800 + 0x100000 = 0x10F800 (1,112,064)
-
- Sorry i counted the overall range of code points and didn't account for the ones that are permanently reserved for uses other than encoding characters (e.g. the surrogates you mentioned). Plugwash 17:26, 14 July 2006 (UTC)
[edit] The need for a new mapping table
It appears that the mapping table, while probably based on more or less official mappings, refers to an older version of Unicode, and, taking the section further up on the page into account, perhaps also of GB18030, and lacks several mappings that have now become available. Notably, it maps several characters present in GB18030 to characters in Unicode's Private Use Area, although according to Kenneth Whistler of Unicode, Inc, these characters were already mapped in Unicode 4.1. The same appears true of the mapping table included with my copy of Ubuntu Linux, which may or may not be the same table.
It would appear that no up-to-date table is available in the public domain, however, so this may be the most up-to-date table that's available. At any rate, I think we should keep our eyes open in case a more recent table surfaces. Rōnin 20:07, 21 February 2007 (UTC)
- All the sources i can find written in english online seem to say that GB18030 is supposed to be a 1:1 mapping of all parts of unicode including the private use area and all code points that are currently unassigned. Do you have an authoritive source that says (either directly or through tables that can be read in conjunction with that big xml file) that GB18030 values that map to private use unicode code points are used in the GB18030 standard for something other than private use? (note, it seems that most but not all of the bmp private use area is mapped either to codes that our gbk article calls out as private use or to 4 byte codes) Plugwash 04:00, 16 March 2007 (UTC)
-
- I have a few, actually. http://www.unicode.org/faq/han_cjk.html#23 says this:
-
- A. That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of Unicode 4.1, so of course, they are in Unicode 5.0 and later versions.
-
- You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters. These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
-
- I am aware that a new version of GB 18030 has been released which shows reference glyphs for a wider range of characters (including supplementary ones, I believe) and updates the mappings from Unicode 3.0 to Unicode 4.0 or higher - which changes some of them from PUA code points to assigned code points.
-
- Some assigned characters are mapped from 2-byte parts of GBK and GB 18030 to the Private-Use Area in the BMP (U+E000..U+F8FF). A small portion of these mappings have changed between GBK and GB 18030, and GB 18030 maps them instead to Unicode characters that were introduced in Unicode 3.0.
-
- It seems to me like the table in the article maps a lot of codes in the range E000-F8FF, which could mean that it actually maps a lot of now assigned code points to the Private Use Area. Though as seen in the post from the mailing list quoted above, the people at the ICU project are aware of it.
-
- Rōnin 04:44, 17 March 2007 (UTC)