Talk:UTF-8

From Wikipedia, the free encyclopedia

Contents

[edit] stroke vs caps

Hi, I am trying to translate this page in Italian for it.wikipedia, and I was wondering if someone could elaborate a bit about the part on stroke versus capital letters. --Agnul 21:42, 2 Dec 2004 (UTC)

[edit] slightly ambiuous text

UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 does this mean that the offset is subtracted in utf-8 or does it mean that it is subtracted in utf-16 and not utf-8? —The preceding unsigned comment was added by Plugwash (talkcontribs) 2004-12-04 22:38:47 (UTC).

[edit] Reason for RFC3629 restricting sequence length

I've tried to find a good reason why RFC3629 restricts the allowed encoding area from 0x0-0x7FFFFFFF down to 0x0-0x10FFFF, but haven't found anything. It would be nice to have an explanation about why this was done. Was it only to be UTF-16 compatible? -- sunny256 2004-12-06 04:07Z

i haven't seen any specific reference to why in specifications but i think what you mention is the most likely reason. I do wonder if this restriction will last though or if they will have to find a way to add more code points at a later date (they originally thought that 65535 code points would be plenty ;) ). Plugwash 10:31, 6 Dec 2004 (UTC)
The reason is indeed UTF-16 compatibility. Originally Unicode was meant to be fixed width. After the Unicode/ISO10646 harmonization, Unicode was meant to encode the basic multilingual plane only, and was equivalent to ISO's UCS-2. The ISO10646 repertoire was larger from the beginning, but no way of encoding it in 16 bit units was provided. After UTF-16 was defined and standardized, future character assignments were restricted to the first 16+ planes of the 31-bit ISO repertoire, which are reachable with UTF-16. Consequently security reasons prompted the abandonment of non-minimum length UTF-8 encodings, and around the same time valid codes were restricted to at most four bytes, instead of the original six capable of encoding the whole theoretical ISO repertoire. Decoy 14:05, 11 Apr 2005 (UTC)
The precise references for the above are [1] and [2]. The latter indicates that WG2 on the ISO side has formally recognized the restriction. Furthermore, [3] and [4] indicate that the US has formally requested WG2 to commit to never allocating anything beyond the UTF-16 barrier in ISO10646, and that the restriction will indeed be included in a future addendum to the ISO standard. Decoy 16:51, 19 Apr 2005 (UTC)

[edit] decimal VS hex

the article currently uses a mixture of decimal and hex with no clear labeling as to which is which. Unicode specs seem to prefer hex. HTML entities and the windows entry method default to decimal. ideas? Plugwash 02:19, 26 Jan 2005 (UTC)

Unicode characters are usually given in hexadecimal (e.g., U+03BC). But numeric character references in HTML are better in decimal (e.g., μ rather than μ), because this works on a wider range of browsers. So I think we should stick to hexadecimal except for HTML numeric character references. --Zundark 09:15, 26 Jan 2005 (UTC)
I agree, and would add that when referring to UTF-8 sequence values / bit patterns / fixed-bit-width numeric values, it is more common to use hexadecimal values, either preceded by "0x" (C notation) or followed by " hex" or " (hex)". I prefer the C notation. I doubt there are very many people in the audience for this article who are unfamiliar with it. For example, bit pattern 11111111 is written as 0xFF. -- mjb 18:25, 26 Jan 2005 (UTC)
Could someone add a paragraph to the main text about decimal or hex numeric character references to explain as why they are necessary?? —The preceding unsigned comment was added by 203.218.79.170 (talk • contribs) 2005-03-25 01:19:52 (UTC).

[edit] Appropriate image?

Is Image:Lack of support for Korean script.png an appropriate image for the "invalid input" section? I am wondering if those squares with numbers are specifically UTF-8? --Oldak Quill 06:26, 23 July 2005 (UTC)

No, the hexadecimal numbers in the boxes are Unicode code points for which the computer has no glyph to display. They don't indicate the encoding of the page. The encoding could be UTF-16, UTF-8, or a legacy encoding. Indefatigable 14:47, 23 July 2005 (UTC)

[edit] Comment about UTF-8

The article reads: "It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems."

"UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout."


RANTMODE: ON

I would like to add that UTF-8 is causing havoc and confusion to computer users that are used to the international characters in at least the ISO-8859-1 charset. Getting national chars through is a hit or miss thing these days. Some people can read UTF-8, but not write it. Some people write it but can't read it, some people can do neither, and some people do both.

UTF-8, in my opinion, sucks. I've used months to finally figure out how to get rid of the abomination. I'm finally back to ISO-8859-1 on all my own systems. I *hate* modern OS-distributions defaulting to UTF-8, and I've got a dream about meeting the bastard that invented the abomination to tell him what I really think of it.

In my opinion UTF-8 is wonderful encoding and I hope that it will be widely accepted as soon as possible. For example, the Czech language may use the CP852 encoding, the ISO8859-2 encoding, the Kamenicky encoding, Windows-1250 encoding, theoretically even KOI-8CS. Every stupid producer of a character LCD display creates his own "encoding". It doesn't work in DVD players, it creates problems on web pages, in computers, you cannot even imagine how it sucks. UTF-8 could become THE STANDARD we are looking for, because it is definitely the best standardized Unicode encoding and it is widely accepted (even Notepad in WinXP recognizes it, so we can say that it has become a de-facto standard). What really makes one angry is the "Java UTF-?", "MAC UTF-?", byte-order-dependent UTF-16 "encoding" and so on. Everybody tries to make his "proprietary extensions" and the main reason is to keep incompatibility and to kill the interoperability of software - because some people think it to be "good for bussiness". So PLEASE, ACCEPT THE UTF-8, it really is the best character encoding nowadays. (2006-08-17 20:38)

RANTMODE: OFF (unsigned comment by 80.111.231.66)

working with systems that use different encodings and provide no good way to specify which one is in use is going to be painfull PERIOD. The old system of different charsets for different languages can be just as horrible. Plugwash 16:37, 13 August 2005 (UTC)
The point is that most people don't need to work with systems with different language encodings. While almost everyone is affected by the conflicts between UTF-8 and their local charset. When you're not used to that kind of bullshit, it gets hugely annoying. Personally I've just spent the day de-utf8ying my systems. (80.111.231.66)
Point taken, but you should know that we feel the same way when dealing with your "code pages" and other limited charsets. Some time in the future, we might say the same thing about UTF-8, though at least they've left us two extra bytes to add characters to. Rōnin 22:48, 15 October 2005 (UTC)


Wikipedia is utf-8. Anyone who thinks that Wikipedia would have been as easy to manage without one unified character set is smoking something. 'nuff said. --Alvestrand 00:15, 18 August 2006 (UTC)

[edit] The comparison table

Is the table wrong for the row showing "000800–00FFFF"? I don't know much about UTF-8 but to me it should read "1110xxxx 10xxxxxx 10xxxxxx" not "1110xxxx xxxxxx xxxxxx" --220.240.5.113 04:30, 14 September 2005 (UTC)

  • Fixed. It got lost in some recent table reformatting. Bovlb 05:44, 14 September 2005 (UTC)

[edit] bom

the following was recently added to the article i reverted and was re-reverted ;).

  • UTF-16 needs a Byte Order Mark (U+FEFF) in the beginning of the stream to identify the byte order. This is not necessary in UTF-8, as the sequences always start with the most significant byte on every platform.
UTF-16 only needs a bom if you don't know the endianess ahead of time because the protocol was badly designed not to specify it, but such apps that don't have an out of band way of specifying charset already probbablly aren't going to be very suitable for use with utf-16 anyway. Apart from plain text files i don't see many situations where a bom would be needed and it should be noted that with plain utf-8 text at least ms uses a bom anyway. In sumarry while i can see where you are coming from i'm not sure how relavent this is given whats already been said. Plugwash 22:39, 6 October 2005 (UTC)
It was recently pointed out to me that using a UTF-8 BOM is the only way to force web browsers to interpret a document as UTF-8 when they've been set to ignore declared encodings. I haven't confirmed this, myself, but it sounds plausible. — mjb 00:15, 19 October 2005 (UTC)

[edit] xyz

I see that there are now not only x'es, but also y's and z's in the table. While it was difficult to understand before, it's now starting to be more confusing than informative. Could we possibly revert to only using x? Rōnin 19:38, 28 November 2005 (UTC)


I can see xyz in the table, but description in the table looks like "three x, eight x" for "00000zzz zxxxxxxx" UTF-16 entry, and in other places of the table - just x, no y nor z. —The preceding unsigned comment was added by 81.210.107.159 (talk • contribs) 2006-04-18 21:46:20 (UTC).

I for one am more confused after reading the table than before reading it. I used to think I had at least a vague idea of how it worked, but I'm back to square zero now. If things could be rephrased it might be easier to understand. -- magetoo 19:40, 21 May 2006 (UTC)

I replaced the z's in the first three rows of the table with x's, since the "z>0000" note was redundant with the hexadecimal ranges column. I couldn't decipher the fourth row of the table, so I left it alone. IMHO the table would make a lot more sense if the UTF-16 column was removed. -- surturz 06:15, 14 June 2006 (UTC)

I've removed the UTF-16 column and y's and z's. It now uses only x (and 0) for a bit marker. I think it is much clearer now, I hope everyone agrees. Surturz 06:43, 14 June 2006 (UTC)

Someone has put x's y's and z's back in the table, but in a much better manner. It looks good now. --Surturz 21:35, 7 October 2006 (UTC)

[edit] "This is an example of the NFD variant of text normalization"

did you actually compare the tables to check this was the case before adding the comment? —The preceding unsigned comment was added by Plugwash (talkcontribs).

no - the week before, I talked to the president of the Unicode Consortium, Mark Davis, and asked him whether there were any products that used NFD in practice. He said that MacOS X was the chief example of use of NFD in real products. --Alvestrand 21:08, 19 April 2006 (UTC)


[edit] Ideograms vs. logograms

I replaced the three occurences of 'ideogram' to their correct name, 'logogram', since clearly (and in one case definitely) CJK characters were ment, which are logographic, not ideographic. JAL - 13:48, 2006-06-06 (CET)

[edit] External links

The second UTF-8 test page is more comprehensive than the first, which is now unmaintained. Should we just remove the first?

[edit] conversion

is there any conversion tool from and to UTF-8 (e.g. U+91D1 <-> UTF-8 E98791)? Doing the binary acrobatics in my head takes too long. dab () 12:36, 24 July 2006 (UTC)

i find the bin to hex and hex to bin much easier, but that might be because i worked with chips. matter of practice. may be a small paper showing at least first 16 hex and their bin equiv, or, a txt file with link in desktop or quicklink, is another alternative. but a tool for that should perform better. ~Tarikash 05:41, 25 July 2006 (UTC).

um, I am familiar with hex<->bin, but honestly, how long does it take you to figure out 91D1->E98791? I suppose with some practice, it can be done in 20 seconds? 10 seconds if you're good? Now imagine you need to do dozens of these conversions? Will you do them all in your head, with a multiple-GHz machine sitting right in front of you? I'm all for mental agility, but that sounds silly. dab () 18:09, 29 July 2006 (UTC)

considering that utf-8 was designed to be trivial for computers to encode and decode it really shouldn't take any programmer that understands bit-ops more than a few minuites to code something like this from scratch. Even less to code it around an existing conversion library they were familiar with.
why exactly are you doing dozens of theese conversions in the first place? at that kind of numbers i'd be thinking of automating the whole task not using an app i had to manually paste numbers into. Plugwash 22:44, 29 July 2006 (UTC)

[edit] Clarify intro?

I had to read all the way down into the table in order to be sure we were talking about a binary format rather than a text format here. Is there a way of making the first two sentences clearer? By "binary format" I mean that characters apparently use the full range of ASCII characters, rather than just the 36 alphanumeric ones. Stevage 06:34, 16 August 2006 (UTC)

odd definition of text format you have there (btw ascii has 96 printable characters). Yes it does use byte values that aren't valid ascii to represent non ascii characters just like virtually every text encoding used today (cp850,windows-1252,ISO-8859-1 etc) does. Also note that UTF-8 is not a file format in its own right, you can have a dos style UTF-8 text file or a unix style UTF-8 text file or a file that uses UTF-8 for strings but a non text based structure. Plugwash 17:02, 16 August 2006 (UTC)
Ok, having read character encoding I understand better. Not sure how the intro to this article could be improved. Stevage 19:59, 16 August 2006 (UTC)
Yeah, the bit "the initial encoding of byte codes and character assignments for UTF-8 is coincident with ASCII (requiring little or no change for software that handles ASCII but preserves other values)." is either confusing, not English, or not correct, I'm not sure which. Is this better? : "a byte in the ASCII range (i.e. 0-127, AKA 0x00 to 0x7F) represents itself in UTF-8. (This partial backwards compatibility means that little or no change is required for software that handles ASCII but preserves other values to also support UTF-8). UTF-8 strings." It needs to be explained how it differs from punycode. --Elvey 16:36, 29 November 2006 (UTC)