User:Shlomital
From Wikipedia, the free encyclopedia
Shlomi Tal is a student of Arabic and Islam and linguistics at Bar-Ilan University. Also a computer cognoscens with an inordinate interest in file formats and multimedia applications, and the interface between the real (analogue) world and the digital one in general; but with zero knowledge of or interest in programming.
Contents |
[edit] What’s the deal?
For anyone who should ask “didn’t you put a notice on leaving Wikipedia for good, and subsequently requested your user page deleted?”, the thing is this: I’m currently suffering from academic burnout—the weariness from verbal activity that follows hours of such activity in the university daily. As a programmer friend of mine once said to me quite a few years ago when I talked to him about computer things outside his work hours: your line of work or study may be something you’re extremely enthusiastic about during its dedicated hours, but outside that time frame you’re filled with distaste at the mere thought of it. I didn’t understand it back then, but I do now. So I don’t feel I can spend time writing new articles, editing existing ones or, least of all, engaging in academic disputes (sourcing, logical fallacies and all that stuff) with people on the talk pages. I’m out of that for I don’t know how long.
Still, there’s more to contributing to Wikipedia than writing things. You’ll find new, non-verbal, contributions of mine on the Wikimedia Commons (or on Wikipedia in the case of fair-use materials), with edits of existing pages to link to them where useful.
I apologise for the all too hasty packing up. There were two useful tutorials I’d written which got deleted upon my request, so I’m restoring them here, below my Wikipedia Rejoinder. Please use my talk page only for requests for and comments on media.
[edit] Wikipedia Rejoinder
- Everything you read
on Wikipediashould be taken with a grain of salt.
There, fixed that for you.
[edit] Hebrew Computing Primer
I consider myself an expert when it comes to Hebrew computing. Not having programmed Hebrew support systems, nor being old enough to have worked with Old Code, yet I’ve encountered all variants of Hebrew encodings and storage methods in the two decades or so I’ve been using computers, and I know the theory behind them and how to deal with them practically. This is a primer for anyone who’s interested in the details.
[edit] Specialties
Hebrew computing has two specialties: additional characters (which any character set outside ASCII has) and bidirectionality. For any undertaking of Hebrew support, the additional characters must be provided by an encoding, and the writing mode must be accommodated by an I/O system. I will now elaborate on the peculiarity of the writing mode and storage, as it is the more challenging property.
[edit] Writing Hebrew
For English the writing mode is left to right, top to bottom: the pen starts at top left and moves right constantly until the line ends, and then it moves one line bottom and back to the far left. For Hebrew the writing mode is in general right to left, top to bottom: the pen starts at top right and moves left constantly until the line ends, and then it moves one line bottom and back to the far right. The “in general” is crucial: any use of digits (0–9) or Western-script text (usually English nowadays) will induce a left to right direction on the same line: the pen starts at the right and moves left, but in the middle of the line it moves from left to right, and after that it goes back to moving right to left. This feature, bidirectionality, complicates the writing of Hebrew (and Arabic) and needs special handling for both input and output.
For input, Hebrew typewriters would force the typist to key in the digits and English text in reverse order. When computers came, there was room for a little more sophistication. Early on in Hebrew computing, the typist could type Hebrew text normally in right to left typing mode, and when he or she had to type digits or English, the characters appeared in what is called “push mode”: the cursor stays on the same position, and the characters flow out of it until reversion to Hebrew. This hasn’t changed over the ages—the most modern systems, like Windows XP and GNOME and Mac OS X, still display the same choice of a general writing mode, and push mode for characters with the other directionality.
[edit] Storing Hebrew
Storage, however, is a different matter. After typing and saving the file, there had to be a way of storing the Hebrew text. Here there is a sharp demarcation line between old systems and new ones: old systems stored Hebrew text in visual order, while new ones store it in logical order.
Visual order is the order that the eyes see, relative to a general directionality. If the general directionality of the text is left to right, then visually ordered Hebrew is stored as the eyes read it left to right (the opposite of normal typing and reading order). For example, in the word שלום the eyes sees ם (final mem!) first, so in visual Hebrew this ם will be stored first, and the ש last. This makes text operations (sorting, editing and linebreaking, to name a few) quite difficult; there is only one reason why this unnatural storage mode was adopted at all, and I will explain later on.
Logical order, in contrast, is the order that the fingers type and the eyes read naturally. No matter the general directionality of the text, in the word שלום the letter ש is always typed and read first. In logical Hebrew, therefore, the letter ש will also be stored first. This is the natural order, making computer handling of Hebrew text as easy and feature-rich as that of English text. However, it does have one snag: that for proper display of the Hebrew text, the display system must be programmed to output the Hebrew letters right to left.
Now that is the reason why visual Hebrew can be seen in all early Hebrew computing systems. For visual Hebrew to work, an encoding is needed and an input mechanism, but the display mechanism can be left unchanged from that which suffices for English text. Visual Hebrew, in essence, can be called a kludge, a practice of retrofitting—the easy addition of Hebrew support onto systems that have never been designed for Hebrew. Logical order, on the other hand, is a ground-up affair, carried out for new systems, and where resources are more plentiful. So on old Hebrew computers (IBM mainframes, old Unix workstations) and platforms (the Hebrew Internet until the advent of Internet Explorer 5 in 2000) we find visual Hebrew, while in the newer settings (Microsoft Windows 3.0 and onwards, KDE and GNOME, the Hebrew Internet nowadays) logical Hebrew is the norm.
Hebrew storage is a detail as relevant as the encoding. For Cyrillic text, for instance, it would be enough to state the encoding (ISO 8859-5, KOI-8, Windows-1251 etc), but for Hebrew text one has to know its storage method as well as its encoding. Two of the Hebrew encodings come in either storage method.
[edit] Encoding
The Hebrew block in Unicode, hexadecimal 0590–05FF, is a small one. The repertoire of Hebrew is not demanding; for day to day use, the 27 letter-forms (א to ת including final forms) are sufficient. Vowel-points (nikkud) are nice to have but not necessary. Cantillation marks for the Hebrew Bible can be left to encyclopaedic character sets like Unicode. The first Hebrew encodings, therefore, were concerned with the mapping of 27 Hebrew characters.
The first attempt for a Hebrew encoding is called Old Code. Old Code is not a single encoding, nor is New Code. Old Code can be a 6-bit BCD encoding used on an IBM mainframe, or it can be 7-bit Hebrew (an ISO 646 variant), and New Code can be DOS Hebrew (CP862), ISO 8859-8, Windows-1255 or Unicode Hebrew. What is meant by Old Code is this: the Hebrew alphabet is mapped instead of Latin letters (usually the lowercase ones) and one additional symbol. New Code means that the Hebrew alphabet has separate positions of its own, not trampling on Latin-script repertoire. In the most common form of Old Code, 7-bit Hebrew (Israeli standard SI 960), the Hebrew alphabet occupied the positions 60–7A, on top of the lowercase Latin letters, with א on the grave accent. On IBM mainframes the arrangement would be the same: ב to ת on top of Latin (lowercase if both cases existed), א on some other character, usually & or @.
Old Code Hebrew required two display modes for the terminal: English mode and Hebrew mode. The switch determined what the lowercase Latin positions represented in display. Hebrew text in English mode would be a jumble of lowercase Latin letters, and lowercase English in Hebrew mode would be a jumble of Hebrew letters, with only the human operator any the wiser to it. That inconvenience was the chief reason for transitioning to New Code forms of Hebrew.
The ISO 8859 standards had a simple policy of transitioning from 7-bit to 8-bit character sets: just take the non-ASCII characters and set the high bit on them. So ISO 8859-8, the Hebrew part of the standard, had the Hebrew alphabet in positions E0–FA. Microsoft, luckily, followed the lead in its Windows-1255 codepage and encoded the Hebrew letters in the same positions, adding vowel-points in lower positions (I say “luckily” because the Windows codepages aren’t always compatible with the ISO standards—compare ISO 8859-6 and Windows-1256 for Arabic). Hebrew on the Macintosh too follows the ISO ordering of the letters, and adds vowel-points in lower positions, but different from those of Windows-1255. Unix adopted the ISO standard for Hebrew. The only odd man out for New Code Hebrew was DOS, using CP862, where the Hebrew letters (like the ISO standard, without vowel-points) were mapped to positions 80–9A.
Finally, the last word on Hebrew encoding is, of course, the Unicode standard. Positions 05D0–05EA have the letters, and lower positions have vowel-points and cantillation marks, enough for digitising any kind of Hebrew text.
[edit] History
I said both encoding and storage method had to be taken into account when speaking about Hebrew computing. Here now is the history of Hebrew computing:
- Old Code Hebrew in visual order for mainframes and early workstations. Still in use in legacy environments, much as EBCDIC is, and can also be seen on Israeli teletext (there’s a button on the remote control that switches between the two display modes, and if you switch it to English mode you can see a mess of lowercase Latin letters).
- CP862 Hebrew in visual order for DOS. Visual order because DOS had no bidirectional display mechanism. However, a few applications were programmed with their own bidirecational display mechanism, and stored Hebrew in logical order. So CP862 Hebrew can be either visual or logical. Those who have ever tried to pass between the QText and EinsteinWriter word processors (the former visual, the latter logical) know what a headache it could be.
- ISO 8859-8 Hebrew in visual order for Unix workstations and for the early Hebrew Internet. Various attempts to fit bidirectional support for Hebrew on Unix were carried out, most notably El-Mar’s system for Motif, but, like so many issues in the Unix world, integrated support had to wait until the free software world (GNU/Linux) supplied the goods. The Hebrew Internet was in visual order until Microsoft launched Internet Explorer 5, which featured bidirectional support without regard to the operating system (earlier versions of IE supported Hebrew only on Hebrew Microsoft operating systems). Mozilla came soon afterwards, and nowadays visual Hebrew on the Internet is dead.
- ISO 8859-8 Hebrew in logical order (ISO-8859-8-I at IANA) for Unix and the Internet. It never gained as much acceptance as Windows-1255, as it had no provision for vowel-points, and Unicode has finally made it obsolete.
- Windows CP1255 Hebrew in logical order for Microsoft Windows operating systems beginning with 3.0. This brought logical Hebrew to the mainstream, and despite a few setbacks resulting from changes in the bidirectional algorithm (such as the display of the hyphen-minus), Israeli computer users were made not to settle for less. In the early days of the Internet, the fact that the OS used logical Hebrew while the Web used visual Hebrew was a source of much frustration, as anyone who pasted Hebrew text from a website into Notepad or Word back then knows.
- Unicode Hebrew in logical order as the encoding and storage method to end all encodings and storage methods. All operating systems and web browsers support it today. Windows-1255 still lingers on, but it will be the last to die out; all the others are already obsolete.
[edit] For Wikipedia
The article is a summary of my knowledge on Hebrew computing. It is too specialised, and perhaps too long, for a Wikipedia article of its own, or so I feel, but as is the norm in Wikipedia, anyone is free to take it, or bits and pieces of it, and incorporate it in an existing article. Or just save it for reference. Like all my writings on Wikipedia, the GFDL applies.
[edit] Adding iTXt chunks with pngcrush
Here is how to add iTXt (international text, in Unicode UTF-8) chunks to PNG files using pngcrush. Why is this tutorial necessary? Because the current documentation is misleading (I’ve contacted the author about it and he says he’ll fix it in the next version), and it took me a long time, with a lot of trial and error, till I finally found out how to do it. The tutorial is for sparing anyone else the trouble.
Writing a tEXt chunk (limited to ISO Latin-1) is just as the documentation says:
-text b[efore_IDAT]|a[fter_IDAT] "keyword" "text"
The documentation would have you believe that writing an iTXt chunk is the same:
-itxt b[efore_IDAT]|a[fter_IDAT] "keyword" "text"
But that is not the case. To write an iTXt chunk, pngcrush requires four parameters, not two. The correct help would look like this:
-itxt b[efore_IDAT]|a[fter_IDAT] "keyword" "language code" "translated keyword" "text"
Only the keyword is required, as in tEXt, but four values must be passed to pngcrush, otherwise the program terminates without doing anything (which is what led me to believe the -itxt
option was nonfunctional). In other words, the -itxt
flag must be followed by either b
or a
and then by four pairs of quotation marks, the first of which must contain characters, the others allowed to be empty.
After discovering the liveliness of the -itxt
function, the next task for me was to find out how to write the UTF-8 strings properly from the Win32 console window. This probably isn’t an issue for Linux users, so the section ends here if you are one.
I tried different modes for the console window: cmd
starts it in codepage mode, cmd /u
starts it in Unicode (more accurately UTF-16LE) mode; when opened to display in a raster font, only one MS-DOS codepage is supported (CP437 on a American WinXP, CP850 on a West European, etc), while the TrueType font (Lucida Console) can display a greater character set, such as Greek and Cyrillic (but not Hebrew or Arabic); once launched, chcp 65001
switches the console codepage to UTF-8, and others are also available (437, 850 and the rest, as well as 1252 for Windows West-European, and the other Microsoft Windows codepages). I had plenty of variations to tinker with.
The first input method I tried was to type the text in the command line window. The result, upon examination of the output file with the hexadecimal editor, was question marks where the Unicode text should be. The second method I tried was writing the text in a text editor and pasting from there into the command line window. Again, the result was question marks no matter what I tried. Finally, I decided to try batch files. I wrote a batch file with the entire pngcrush command, saved it as UTF-8, removed the Byte Order Mark and then ran it.
The first result wasn’t what I wanted, but it was different from all the previous errors: not question marks but characters in a wrong encoding. That was under CP437. I switched to 65001 (UTF-8), but the batch file then wouldn’t execute at all. Lastly, I did chcp 1252
(Windows West-European), and it worked. All characters were as they should be, and reading the file with a viewer in Unicode mode confirmed this.
So the steps to writing an international text metadata chunk with pngcrush are these: 1) Write the entire pngcrush command, with all its options and parameters, into a batch file (extension *.bat
) and save as UTF-8. Remove the Byte Order Mark if any. 2) Launch the console window in any mode (ie with or without /u
), in any font (raster or TTF). Only the codepage matters. 3) Switch the codepage to Windows West-European by typing chcp 1252
. 4) Run the batch file.
The result is a totally correct PNG file with perfect iTXt chunks. They may not be very useful now, but they will be in the not so far future. A graphical interface for manipulating PNG iTXt chunks would be much more preferrable to many, of course, but even this somewhat tedious use of calling pngcrush from a batch file in a command line window is far better than the hackish and error-prone way of writing the chunks, after their calculated lengths, into the PNG file with a hex editor and then fixing the CRCs. Writing a PNG file involves a lot of procedure, and only computers can be relied upon to follow procedure without error.
(Note, added 24 Jun 2006: you can also use Jason Summers’s TweakPNG to write out, and then import, any PNG chunk you want, including iTXt. However, although this option takes care of the chunk length field and the CRC, you still have to do the internals of the chunk, such as the null separators, by hand with a hex editor, so it isn’t ideal. TweakPNG as of version 1.2.1 can’t manipulate iTXt.)
[edit] Note to self
IBM16 is IRGB, Win16 is IBGR, ZX Spectrum is IGRB.
[edit] PNG Tutorial
For a concise PNG tutorial go to this subpage.