Talk:UTF-32/UCS-4

From Wikipedia, the free encyclopedia

< Talk:UTF-32

[edit] Why 4 byte? Why not 3?

Why is there no 3-byte encoding? 2^24 is 16,177216, much more than is needed to represent the 1,114,112 character codes of Unicode. Is this because of word boundaries? I understand there are tradeoffs, but wouldn't someone somewhere have use of a simple to process encoding that didn't waste a whole byte for each character? --Apantomimehorse 06:47, 9 September 2006 (UTC)

Truth is if they had planned things properly from the beggining i doubt this encoding would exist. If you are going to the trouble of supporting suplementry characters you will probablly wan't other advanced text features too which will nullify most of the advantages of a fixed width encoding.
In any cace you'd be pretty mad to use UTF-32 or UTF-24 for storage or transfer purposes and if you wan't to use a 3 byte encoding internally in your app or app framework theres nothing to stop you (though i strongly suspect it will perform far worse than either a well written UTF-16 or UTF-32 system). Plugwash 00:39, 11 September 2006 (UTC)

[edit] NPOV?

Is it just the way I'm reading this article, or does it stink of a total lack of NPOV? Almost reads like a case for everybody forgetting about UTF-32.. UTF-32 space inefficient? Not if you're Japanese. The whole reason the character handling is in the state it's in is because people didn't care about the needs of other people. It was pretty clear a long time ago that a solution was needed to i18n and that something not unremarkably like UTF-32 was needed.

"Also whilst a fixed number of bytes per code point may seem convenient at first it isn't really that much use. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing."

Well, no. If you're talking about drawing glyphs sure, but it has absolutely no pros/cons as compared to other charsets in that context. It makes i18n string handling easier by an order of magnitude though. All you do is divide everything by four put simply. Try counting the length of a string in UTF-8 or UTF-16.. It's just about impossibly to do in a stable way.. Look at the whole "Bush hid the facts" bug in notepad.. the *perfect* example of an issue that would never have occurred with UTF-32. http://www.evilshroud.com/bushhidthefacts/ --Streaky 03:35, 30 November 2006 (UTC)

Inefficiant is definately true, in the best case its no better than either UTF-8 or UTF-16 and in the common cases (yes that includes chineese and japaneese) it is far worse.
What *IS* the code point count usefull for? Most of the time what matters is either size in memory, grapheme cluster count or console position count.
As for the <name> hid the facts "bug" you mentioned, it doesn't look like a charset issue to me (and is almost certainly not related to either UTF-8 or UTF-16). To me it looks like a deliberate easter egg but unless someone can translate Plugwash 12:52, 30 November 2006 (UTC)