Talk:Mapping of Unicode characters

From Wikipedia, the free encyclopedia

Chinese character "Book" This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.
??? This article has not yet been assigned a rating on the Project’s quality scale.
??? This article has not yet been assigned a rating on the Project’s importance scale.

I added the summary/categorized table of the UCS as I said I would on the UCS dicsussion page. I think if anyone feels the table should be narrowed, the decimal start and end could be omitted without much loss of readability. As I said in edit summariesa and over at the UCS discussion page, I'd like to also add links from within the table to sections of the mapping of unicode characters article and other articles too. I think each of the broad categories (lettered A through N) should be discussed in this article. Then links from each script-block could go to the article on the script or to an ariticle on the script in unicode/ucs.

So I plan to add the folowing sections to this article:

  • Scripts (Modern and Anicent)
  • Phonetics
  • Unified Diacritics
  • Unified Punctuation
  • Symbols
  • Numerals
  • Musical Notation
  • CJK and Unihan
  • Compatbillity characters (legacy and others) and normalization
  • Control characters, format characters and variation selectors
  • Surrogates
  • Private Use Code Points

Anyone else is welcomed to jump in on these tasks. --Indexheavy 01:16, 25 April 2007 (UTC)

Contents

[edit] On the charge of editorializing

I can understand how you might read it that way in isolation. I'm not trying to editorialize so much as help make the distinction between semantic characters and glyphs clearer. Many people cite it as a mantra, but don't necessarily understand it (now I'm editorializing). The point that this section (and what I plan to add to the linked main article) is to show that UCS lists the characters according to their glyph names. Meanwhile Unicode adds alias names that try to get a t the phoneme semantics. Right now its a hybird that helps serve as an excellent example of this distinction so often cited (numerals too, though less so). I hope that makes it clearer what I'm trying to do there. In the past many of these articles have simply been long lists of Unicode characters (at one point there was a single article deovted to every character). I didn't find that very encylopedic. I think here at wikipedia we serve readers better by expositing and providing examples and fleshing out these categories and expositing on some of the idiosynncratic characters (like the phoneme characters). --Indexheavy 02:55, 30 April 2007 (UTC)

BTW, perhaps I'm not understanding correctly what you thought was editorializing. Please respond here to clarify. Indexheavy 02:56, 30 April 2007 (UTC)

the problem with the Unicode consortium is that they seem to think their character names are self-explanatory. Except in some cases where they for some reaosn or other feel disposed to add a gloss. This is a problem (the recent addition of the cuneiform range really drove this home to the point of ridicule), and should be duly discussed, citing notable sources. But so far it is your choice to give such weight to the character name. A name is just that: a unique tag for a codepoint. The actual reason for encoding a character is buried in proposals somewhere. Thus, for a sourced discussion of why a character was encoded and not another, you have to dig up these proposals, study them, and quote from them. Just drawing your own conclusions from the names in the character charts is not helpful and violates WP:OR and specifically WP:SYN. dab (𒁳) 11:43, 2 May 2007 (UTC)

For some reason I missed your comment here until now. I wasn't ignoring you, I just didn't see it. I'm sure there are all sorts of interesting storeis, behind the scenes disputes and whatnot surrounding the Unicode and UCS. I'm not trying to write about that (nor do I have any expertise or sources on it). I'm trying to write from the Unicode Standard and the other publications of the Unicode consortium on their rendition of the "mapping of unicode characters". You're accusing me of violating WP:OR, yet I say again, I'm the only one who has added a reference to this article. I understand I could use some more specific references, but its quite disingenuine to accuse me of OR when not a single reference existed for this article until I began my edits. Secondly, on the charge of violating WP:SYN, I'm drawing only from the Unicode Standard (which is what I'm most familiar with) and not synthesizing from multiple sources as the policy outlines. I'm also not trying to advance a position. Perhaps if you told me what position you fear I'm advancing we could clear the air and I could try to avoid that misperception as I draft and redraft my material. Indexheavy 09:59, 9 May 2007 (UTC)

[edit] Indexheavy

Indexheavy, before you continue "overhauling" this article, may I ask you to cite your sources. Your "semantic phonemes" and "semantic characters" etc., while well-meant, simply add to the confusion (as I argued here). You want to "help make the distinction between semantic characters and glyphs clearer". I appreciate the thought, but at present you are not exactly helping. First of all, show that your usage of "semantic character" (as opposed to simple "character" is in any way endorsed by Unicode. Unless you do that, I'm afraid we'll have to deep revert to April 25. thanks. dab (𒁳) 11:37, 2 May 2007 (UTC)

Just to help you understand where my terminology comes from, here's a useful quote from The Unicode Standard 5.0 (p15): "The Unicode Standard draws a distinction between characters, which are the smallest components of written language that have semantic value, and glyphs, which represent the shapes that characters can have when they are rendered or displayed". For Uniocde (especially in contrast to ISO and the UCS without Unicode), many of the compatibility characters (like the Arabic initial, isolated, medial and final fomrs) are redundant. They are character encoding forms and not simply the character as "the smallest components of language that have semantic value" but rather characters that encode a specific abstract glyph for another character. Anything could be encoded as a character. For example, one could designate that code point U+E0FFA will be the letter 'g' from Linotypes Times Roman font version 2.3 released in 1992 (the dingbat characters are a similar example acknoledged by the Unicode Standard). However, these are not examples of semantic characters, but rather characters that encode glyphs. Unicode’s approach in contrast involves moving the handling of these forms/varaints to smart font technology and smart text rendering. These are distinctions made in the Unicode Standard: distinctions I'm trying to explain to a general reader in an encyclopedic manner. I feel a bit like I'm taking shots in the dark here. I'm having a hard time understnading how you read the Unicode Standard. But I'm trying to find ways to begin the conversation. Please let me know how you might reprhase some of my prose. In doing that we might start to understand the different readings. Indexheavy 11:17, 9 May 2007 (UTC)

[edit] Longest page on the English Wikipedia

According to this: special:longpages, this page is the longest page on the English Wikipedia, at 688,000 bytes. either this number is bogus, or this page will take a very long time to load on a low-speed link.

It appears that the HTML table was generated by a word processor. Please consider using a Wiki table or at least a better HTML editor. Thanks. -Arch dude 23:26, 3 May 2007 (UTC)

Please lend a hand in imporving the table. The conversion to a wikitable might help, but its largely a false efficiency. The wikitable still needs to be converted to an HTML table when its delivered, so everything gained in the compact wikitable syntax is lost upon delivery (keep in mind the size of the article in that list is the storage size, not necessarily the delivery size; when the table's delivered the "_" and "|" characters are replaced with complete "<tr></tr>" and "<td></td>" syntax). The wikitable gains other efficiencies by simply disallowing much of the HTML table semantics. The table was largely generated by hand (not by a word-processor). Unfortunately, many of the stylees had to be added in-line because Wikipedia doesn't support embedded or linked stylesheets for table styling (which would make the total size considerably smaller). If anyone wants to reduce the size of the table, the styles could probably be handled in some other way (I'm not familiar enough with wiki styling conventsions). Also it could be reduced by removing the tooltips, but I think they're quite helpful. Finally, it might make sense to move many of the table details off to separate articles once they're created. Then simply a summary table of the individual tables could appear on this page. So in summary:
  • converting to a Wikitable (not much gained)
  • Changing the styling (borders, cell horizontal and vertical alignments) to another syntax
  • removing or shortening tooltips (these are repeated for every cell with a lengthy phrase)
  • breaking table out into the separate related articles (I'll probably do this once I stableize the table and finish the detailed articles).
My goal here was to cr4eate a drill-down type group of articles, where one could start at this article and see how the various Unicode Planes and Blocks were grouped together and then follow through to see more detail on each block/script/character general category. So moving the tables to other articles would be consistent with that drill-down approach. Indexheavy 02:26, 4 May 2007 (UTC)

I shortened the titles (tooltips) considerably. I also removed most of the inline styles on the table cells. It still doesn't quite look the way i want it to, but its readable and it looks decent (oh if only the wikimedia software developers would enter the 21st century)..The classs and title attributes could probably be elminiated completely if we need to make it smaller. However, the steps I already took get us out of the top 15 articles so maybe we're off the radar now. I do think that breaking the detailed tables off into separate articles makes a lot of sense, so this article could be reduced substantially that way (in time anyway). 04:10, 4 May 2007 (UTC)

Thanks for considering all of the options. You are clearly on top of the situation. If you intend to subdivide the article eventually, may I recommend that you avoid all of the intermediate steps? There are no rules, and I'm just another editor with an opinion, but as you point out, many of the gains are either trivial or bogus. The big win occurs when you split the article. I therefore propose that we live with it as it is until you are prepared to split it. I am not competent to help much. Best of luck on this, and keep up the good work! -Arch dude 13:23, 4 May 2007 (UTC)

[edit] Phonetic characters

The section on phonetic characters makes very idiosyncratic use of the word `phoneme´. I can kind of guess what is meant, but I think that phone would be more appropriate. A phoneme is a language-particular unit, which is defined by its opposition to other phonemes. As such, a phoneme is a logical unit, which has no direct relation to the physical world. One can basically represent a phoneme by any string one likes best (Although for mnemonic reasons, certain strings are of course better than others). A phone on the other hand is observable in the physical world and does have acoustic properties. IPA characters are used to refer to phones. Their representation is not arbitrary. This is probably what is intended by `common phoneme semantics', and I suggest that this be renamed to `underlying acoustic properties' or something like that. Jasy jatere 08:44, 10 May 2007 (UTC)

I see nothing wrong with the change you propose making: though I'm having trouble seeing how it fits with the phoneme article you link to. For example, would you say that a "bilabial plosive" was a phoneme or a phone? It is that type of semantics the passage refers to.
There's a second distinction you seem to be making that I"m not clear on too: that between the "strings" used to represent a phoneme, and the "characters" used to refer to phones. Could you provide some examples of what you mean there? Just to provide some clarification from the computing end, in relation to Unicode a string is an ordered collection (an array or list) of characters (or graphemes). On the other hand characters are the “smallest components of written language" So perhaps you were using strings and characters somewhat interchangeably, but I thought maybe there was another distinction there that I wasn't comprehending.. Indexheavy 19:42, 10 May 2007 (UTC)
One other distinction that I should add to help facilitate communication across these disciplinary boundaries. In relation to the definitions of string and character that I describe above and the distinctions you seem to be making, a glyph (or glyphs) is (are) the visual representation of a character (or character combination). Basically it is the picture or image that text software uses to visually display the character(s). So, for example, the same glyph as that used for Latin small letter 'p' (i.e., a picture of a small Latin letter 'p') might be used as a glyph for the character 'bilabial voiced plosive' (hypothetically speaking, since there is not a character named 'bilabial voiced plosive' though there are similarly named characters). In this case a bilabial plosive would be a single character. In contrast, a character set could encode a character 'bilabial' a character 'voiced' and a character 'plosive'. In this case, the character combination 'bilabial' + 'plosive' + 'voiced' might be represented by a glyph that was identical (or similar) to the Latin small letter 'p'. In this case three characters map to a single glyph. Another approach a character set might take would be to not encode characters for a phonetic writing system at all. Instead, the phonetic writing system would pick and choose characters from other writing systems within the character set based on the glyphs typically used for those characters and make use of those for phonetic writing system. To me this would be the equivalent of not encoding a Greek writing system and instead write the Greek language by borrowing characters with similar looking glyphs from Latin, Cyrillic and mathematical symbols. In many ways we have this very same situation with phonetic writing systems. Indexheavy 20:02, 10 May 2007 (UTC)

[edit] 2^20 + 2^16 ?

The article's first sentence states the total number of code points:

1,114,112 = 220 + 216 or 17 × 216

The second explanation is easy to visualize: 216 thingies per plane times 17 planes. But I find the first one confusing. I know it's correct (220 = 24 × 216 = 16 × 216), but why obscure the number this way? Where does that variant come from? --193.99.145.162 17:04, 27 June 2007 (UTC)

[edit] Lepcha?

The article claims that the Lepcha script (1C00-1C4F) is part of Unicode 5.0. It isn't.


[edit] Splitting

I'm proposing that this article be split into 5 sub articles, along the first 5 main entries on the TOC. Mbisanz (talk) 23:33, 22 November 2007 (UTC)

Ok its been about 12 days and no comments, so I'm gonna begin to split it later tonight. Remember, it can always be rolled back if it turns out this is counter to consensus. Mbisanz (talk) 03:55, 5 December 2007 (UTC)
I just noticed that the article split resulted in several "Big - see Large / Large - see Big" type situations. The main article refers to the five sub-articles as "main article" and three of them return the favor (the other two seem to have been orphaned). I do not have the expertise to tell exactly what happened, but I read through WP:SS and I am pretty sure this is SNAFU. Cawifre (talk) 21:00, 26 May 2008 (UTC)