Mojibake

From Wikipedia, the free encyclopedia

The Japanese Wikipedia article for mojibake with improper encoding.

Mojibake is the phenomenon of incorrect, unreadable characters (garbage characters) shown when computer software fails to render a text correctly according to its associated character encoding. It is a loanword from Japanese.

1 Etymology
2 Causes
3 Problems in other languages
4 See also

[edit] Etymology

The Japanese word 文字化け (mojibake) is composed of 文字 (moji), which means letter, character, and 化け (bake), from the verb 化ける (bakeru), which means to appear in disguise, to take the form of, to change for the worse. Literally, it means "character changing".

[edit] Causes

Mojibake is often caused by forced display of writing systems or character encodings that are "foreign" to the user's computer system: if a computer does not have the software required to process a foreign language's characters, it will attempt to process them in its default language encoding, usually resulting in gibberish. Messages transferred between different encodings of the same language can also have mojibake problems. Japanese language users, with several different encodings historically employed, would encounter this problem relatively often. An improperly configured or badly written web browser may not distinguish a page coded in EUC-JP and another in Shift-JIS if the coding scheme is not assigned explicitly using the HTTP headers sent along with the documents, or the HTML document's meta tags that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers. A well-defined dictionary can usually avoid this problem.

As an example, the intended word "文字化け", encoded in UTF-8, might be incorrectly displayed as "•¶Žš‰»‚¯" in software that is not correctly configured to handle Japanese or Unicode.

In the mid 1990s, as this problem became common, several websites featured mojibake not as a problem to be tackled but simply for amusement. Words and even sentences were "deciphered" with meanings made up to deliver funny messages.

[edit] Problems in other languages

In Chinese, this phenomenon is called luanma Simplified Chinese: 乱码; Traditional Chinese: 亂碼; pinyin: luànmǎ; literally "chaotic codes".

In Hebrew it is usually called sinit (סינית), meaning "Chinese".

Users of Central and Eastern European languages can also be affected. Because most computers were not connected to any network, during the mid- to late eighties there were different character encodings for every language with diacritical characters.

Handwritten krakozyabry corrected by a postal employee.

In Russian, mojibake is called krakozyabry (кракозя́бры). During the 1990s, several different encodings for the Cyrillic alphabet (Unix KOI8-R, Windows CP-1251, DOS 866, standard ISO 8859-5, and several others) competed. Badly configured servers and lack of compatibility made garbled text a common and frustrating experience. Many e-mail servers stripped the 8th bit from the characters as permitted by earlier standards (which renders UTF-8 unreadable, as well as all of the above). For this reason many Cyrillic users resorted to Volapuk encoding. An even more frustrating problem emerged in the early 2000s, when the popular e-mail client Microsoft Outlook started to replace correctly entered Cyrillic characters with question marks when replying to or forwarding messages created in competing encodings.

In Bulgarian, mojibake is often called maymunitsa (маймуница), meaning monkey's alphabet.

In Poland every company selling early DOS computers created its own encoding, and simply reprogrammed the EPROMs of the video cards (typically CGA, EGA or Hercules) with the according character shapes. Additionally, users of then-popular home computers (such as the Amiga and Atari ST) invented their own encodings, incompatible with international standards (ISO 8859-2), vendor standards (IBM CP852, Windows CP1250) and locally agreed-upon PC/MS DOS standards (Mazovia). The situation began to improve when, after pressure from academic and user groups, ISO 8859-2 succeeded as the "Internet standard" with limited support of the dominant vendor's software. With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as krzaki ("bushes").