Unicode Specials

From Wikipedia, the free encyclopedia

Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0–FFFF. Of these 16 codepoints, 5 are assigned as of Unicode 5.0:

U+FFF9 "INTERLINEAR ANNOTATION ANCHOR", marks start of annotated text

U+FFFA "INTERLINEAR ANNOTATION SEPARATOR", marks start of annotating text

U+FFFB "INTERLINEAR ANNOTATION TERMINATOR", marks end of annotating text

U+FFFC "OBJECT REPLACEMENT CHARACTER", placeholder in the text for another unspecified object

U+FFFD "REPLACEMENT CHARACTER" used to replace an unknown or unprintable character

U+FFFE and U+FFFF are not unassigned in the usual sense, but 'guaranteed not to be a Unicode character at all'. They can be used to guess a text's encoding scheme, since any text containing these is by definition not a correctly encoded Unicode text. The U+FEFF is Unicode's byte-order mark, named "zero-width no-break space" (as inclusion of it in text shall not be noticed). If this character is read in the wrong byte order, it will read 0xFFFE, which is illegal UNICODE.

[edit] Replacement character

The replacement character � (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system such as a text parser was not able to decode a stream of data to a correct symbol.

Consider a text file created with Notepad in Microsoft Windows and saved with Windows-1252 encoding (Microsoft calls this code page usually ANSI). This file has the content für, a German word. These three letters correspond to the byte values 0x66 0xFC 0x72.

This file is now opened within a Linux environment. Many Linux text editors nowadays have UTF-8 as the preset encoding. As the first character (0x66) is within the code range 0x000000–0x00007F, UTF-8 correctly interprets it as an f. The second character (0xFC) translates to binary 1111 1100, which is not a reasonable value for any UTF-8 encoded data. A text editor could therefore now insert the replacement character symbol to warn the user that something went wrong. The last, character (0x72) now is within the code range 0x000000–0x00007F and can be decoded correctly. The whole string now looks like this: f�r.

If this file now is saved in UTF-8 form, the text file data will look like this: 0x66 0xEF 0xBF 0xBD 0x72. The “new” data, 0xEF 0xBF 0xBD, is the correct UTF-8 code for Unicode code point U+FFFD. Therefore, the original 0xFC, ü, has been replaced with 0xEF 0xBF 0xBD, �.

Back to the Windows environment, this modified text file is opened with Microsoft's editor using Windows-1252 encoding. As 0xEF == ï, 0xBF == ¿ and 0xBD == ½, The whole text file will be displayed within Editor like this: fï¿½r.

Once data was transformed as in the example above (different symbols replaced with a single replacement character), there is no trivial way other than manually finding and replacing the correct character from context to get back the original data.

Some websites specify their used encoding incorrectly to UTF-8 rather than, for example, the actually used Windows-1252. In some web browsers (such as Firefox), this results in all umlauts, ß's and some other characters in the higher range of Windows-1252 (with the most significant bit set to 1) being displayed as � instead. Other web browsers such as new versions of Internet Explorer try their best in figuring out which code page may was meant to be used.