Talk:Unicode

From Wikipedia, the free encyclopedia

This is the talk page for discussing improvements to the Unicode article.

Article policies
Archives: 1, 2, 3, 4
Other languages WikiProject Echo has identified Unicode as a foreign language featured article. You may be able to improve this article with information from the Russian language Wikipedia.


Contents

[edit] IPA and Unicode

Hi, Unicode 5.0 is now out. Does anyone know if the IPA character for the labiodental flap () has been incorporated into the latest version of the Unicode standard? Thank you. --Kjoonlee 18:10, 6 December 2006 (UTC)

It hans't been incorporated yet. SIL Corporate PUA Assignments say it's to be included in a later version, and Proposed New Characters: Pipeline Table mentions it's still in the pipeline. --Kjoonlee 16:25, 7 December 2006 (UTC)
It's in Unicode 5.1.0 now. --Kjoonlee 09:23, 9 April 2008 (UTC)

[edit] email and japanese

i dont know if this is a unicode thing but when somebody sends me japanese characters i get stuff i cant read but if i send that email again to a system that uses that set it still translates it right(eg a mobile phone(jp version) would be good to add a link on the unicode page that leads to programs that translate these characters back to japanese and a webbased sollution too. 124.102.32.2 04:58, 27 January 2007 (UTC)

It's probably written in a standard your system isn't set up to read properly. 惑乱 分からん 16:10, 19 February 2007 (UTC)

[edit] Normalization?

The article mentions normalization, but it doesn't explain what normalization is in this context.—Preceding unsigned comment added by 217.85.157.177 (talk)

Ah, er, ... no, it doesn't. does it. And it should, shouldn't it.
I'll create a section on normalization in a few days, unless someone beats me to it. Cheers, CWC(talk) 17:09, 16 March 2007 (UTC)


[edit] Weasel words Issues section?

Do some of the descriptions in the Issues section sound like weasel words to anyone else? Specifically, I mean the phrases like "Some Japanese computer programmers object to Unicode" and (especially) "Some have decried Unicode as a plot against Asian cultures perpetrated by Westerners..." I had added the weasel tag but it was quickly removed by someone and I was cited for vandalism - I swear, I'm not trying to mess around with anything. But that section definitely has quite a bit of "some X say" and "it is claimed that," etc.

I don't actually know anything about the debates surrounding those issues themselves (I was just browsing to learn about Unicode) so I don't know how those phrases should be corrected, but my impression is that one can add a tag there to signal for other people who might know better about how to clarify?

Yishan 03:07, 20 March 2007 (UTC)

I've just edited that section to add some references, which might be helpful. There was some opposition to Unicode 5-10 years ago, mostly from Japan, but not much was written about it in English. Those statements ("Some Japanese computer programmers object to Unicode", "Some have decried Unicode as a plot against Asian cultures perpetrated by Westerners...") are a bit 'weaselly', but they're also perfectly accurate as far as I know, and they're probably the best we can do with English-language sources. I hope this helps, CWC 09:37, 20 March 2007 (UTC)
Were not some of the Tron links very explicitly against Unicode from a Japanese perspective? If someone wants to unweasel the text, I think some of the links at Han_unification may help. Mlewan 11:50, 20 March 2007 (UTC)

[edit] Suggest merge

I suggested to merge Unicode roadmap into this article. Anyone oppose? -- Hello World! 09:47, 22 April 2007 (UTC)

  • Looks like a good move to me. CWC 18:03, 22 April 2007 (UTC)

[edit] Missing history

Unicode began with the opposition to ISO 10646 and later two party finally reached a consensus and merge into one. Do anyone know about that history? — HenryLi (Talk) 18:55, 14 June 2007 (UTC)

There's a little at ISO 10646#History of ISO 10646. EdC (talk) 01:45, 27 December 2007 (UTC)
That section is short but very informative. Thanks, EdC. CWC 09:38, 28 December 2007 (UTC)

[edit] Clarification Needed: Code Value vs. Code Point -

From the page on Aug 27th, 2007:

In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point [...]. In the other cases, each code point may be represented by a variable number of code values.

Could someone please clarify the distinction between a "code value" and a "code point"? Searching in the text does not clarify the difference. —Preceding unsigned comment added by 68.118.248.80 (talk) 05:54, August 27, 2007 (UTC)

I've added the following:
An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values.
EdC 12:46, 27 August 2007 (UTC)

[edit] One-to-one

According to one-to-one, one-to-one means injective, which can be done between sets of different sizes. It is a bit ambiguous; I learned it as bijective. It shouldn't just be deleted, but I don't know that injective is clear enough to enough of our readers to be the right word to use here.--Prosfilaes 21:14, 30 August 2007 (UTC)

[edit] Question on carriage returns / line feeds

Would some knowledgeable person be willing to comment on whether Unicode resolves the carriage return problem that exists between computer platforms (Apple, PC, Unix)? It was my understanding that Unicode allows a text file to be properly read on all platforms, but I don't know how this works. If Unicode does not resolve this problem, it would be helpful to state this explicitly and refer to the reader to another article that addresses this issue. Regards, WWriter (talk) 17:55, 11 February 2008 (UTC)

A text file can be read on any platform, if the user has a program that handles any kind of carriage return. But no, the different standards still prevail. Besides there is a big difference between different text files. A program that reads UTF8 may not be able to read UTF16, and so on. However, that confusion is not as much cross platform as within each platform.
Is it really necessary to mention that in this article? There is some info at Newline#Unicode. One could have a reference to it, I guess, but from which section? Mlewan (talk) 19:37, 11 February 2008 (UTC)

Unicode does not resolve this issue any more than any other ASCII-based character set does. --JWB (talk) 21:50, 11 February 2008 (UTC)

Unicode would have made the CR/NL problem even more complicated by adding U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR; however, nobody is using these two characters. — Monedula (talk) 16:09, 12 February 2008 (UTC)

It has been a while since this question was posed, but I think the responses do not really directly respond to the question. First, it is important to understand that before Unicode, different OSs and platforms often relied on their own character encodings (apparently finding the ISO standard encodings/character sets inadequate). So independent of the new line issue, Unicode does (to the extent that it is being adopted) supersede all of these other character set / encodings and therefore provides a solution of allowing text files to be read on any platform.
In terms of the new line, Unicode did introduce U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR as Monedula mentioned. This however was an attempt to provide a Unicode solution to semantically encoding paragraphs and lines: potentially replacing all of the various platform solutions. So in doing so, Unicode does provide a way around the historical platform dependent solutions. However, as Monedula also revealed, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through new line normalization. This is done with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach every possible new line character is internally converted to a common new line (which one doesn't really matter since its an internal operation just for rendering). So in other words, regardless of how the line ending is encoded in the text, the text system can treat it as a new line. I hope that clarifies things a bit. Indexheavy (talk) 21:38, 20 April 2008 (UTC)

[edit] Clarification section

Somewhere, very early in the article, the following terms must be explained:

  • Rows,
  • Blocks,
  • Planes;

and their interrelationships. I presume glyph and character comes that early, but I've not looked. Said: Rursus 09:07, 14 May 2008 (UTC)

I agree. The article is not very clear on many key concepts, and needs some attention. I have added a new section on Architecture and Terminology near the beginning which addresses this issue--I hope that it is not too much detail.BabelStone (talk) 22:27, 21 May 2008 (UTC)
Thanks. I think that's a big improvement. I've long felt this artilce presented a largely skewed view of Uniocde: focussing too much on the Unicode transformation formats and other peripheral issues and not enough on the central topic of the assignement of characters to code point, collation and other algorithms, etc. Your new section goes a long way toward imporving that. Indexheavy (talk) 23:25, 21 May 2008 (UTC)

[edit] Image of the book restored

I undid an edit with the edit message remove fair use image from unicode. That is an image of a *book* which while its about unicode, does not partain to the article and adds nothing to the article.. This book isn't about unicode, this book is Unicode; the formal definition of "the Unicode Standard version 5.0" is "what's published in this book". --Alvestrand (talk) 05:46, 19 May 2008 (UTC)

Ah drat, did not see this comment, see mine below ;). —— nixeagle 20:06, 21 May 2008 (UTC)

[edit] fair use image

Noted a few things in my comment on removal of the image. That image fails to provide a fair use rational as required by our non free image policy. As part of that it need to be spoken about in the article, not used as a pretty picture. If the book is actually spoken about that at least gives a case to leaving the image in. This is unicode: http://www.unicode.org/versions/Unicode5.1.0/. The book is actually out of date by a version. Do we still wish to include non-free content? Better images to include would be examples of what unicode looks like. Those can be free. —— nixeagle 20:06, 21 May 2008 (UTC)

The book is spoken about in the article as it IS the latest published version of the Unicode Standard. It is not out of data by a version as the book is only published when there is a major update to the standard (i.e. version 2.0, 3.0, 4.0, 5.0). Unicode 5.1 is a minor update, and so there is no new book associated with it. If you go to http://www.unicode.org/versions/Unicode5.1.0/ you will see that the pdf files linked to on the left are those for version 5.0, i.e. the 5.0 book. Quoting from that page: "Version 5.1.0 of the Unicode Standard consists of the book publication (The Unicode Standard, Version 5.0), as amended by this specification, together with the 5.1.0 Unicode Standard Annexes and the 5.1.0 Unicode Character Database (UCD)." The image should be reinstated.BabelStone (talk) 21:51, 21 May 2008 (UTC)
The lack of a non-free-use rationale was easily rectified; I'm not surprised that Everson didn't have the patience to figure out how to navigate the labyritnth of Wikipedia's required justifications. FWIW, the Unicode standard is NOT based on RFCs. RFCs are IETF publications, the Unicode standard is created by the Unicode consortium. --Alvestrand (talk) 13:32, 22 May 2008 (UTC)