UTF-EBCDIC

From Wikipedia, the free encyclopedia

Unicode
Character encodings Comparison UTF-7, UTF-1 UTF-8, CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU, BOCU-1 Punycode (IDN) GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this 101XXXXX was used instead of 10XXXXXX as the format for later bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, UTF-EBCDIC will generally produce larger output for the same input data than UTF-8.

This transformation leaves the data in an ASCII based format, so a reversible byte-byte transform is made on this data using a lookup table to make it as close to normal EBCDIC code pages as feasible. These steps can be easily reversed to recover the unicode code points.

Generally, this encoding form is rarely used, even on EBCDIC based mainframes for which it was designed. IBM EBCDIC based mainframe operating systems, like z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.