ISO/IEC 2022

From Wikipedia, the free encyclopedia

ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying

  • a technique for including multiple character sets in a single character encoding, and
  • a technique for representing character sets which cannot be represented in 7 bits.

Unlike ISO 8859 character encodings which use 8 bits for every character, the ISO 2022 encodings are variable size encodings typically using either 8 or 16 bits per character. Several character encodings use ISO 2022 mechanisms. For example, ISO-2022-JP is a widely used character encoding for the Japanese language.

Contents

[edit] Introduction

Many languages or language families not based on the Latin alphabet such as Greek, Russian, Arabic, or Hebrew have historically been represented on computers with 8-bit extended ASCII encodings including the ISO 8859 family of character sets. Written East Asian languages, specifically Chinese, Japanese, and Korean, use far more characters than can be represented in an 8-bit computer byte and were first represented on computers with language-specific double byte encodings.

ISO 2022 was developed as a technique to attack both of these problems: to represent characters in multiple character sets within a single character encoding, and to represent large character sets.

Being based on ISO 646, ISO 2022 exhibits many of ISO 646's properties. For example, the most significant bit of each byte does not carry any meaning; this allows ISO 2022 (like ISO 646) to be easily transmitted through 7-bit communication channels. (This 7-bit property also forms the basis of the EUC code.)

To represent multiple character sets, the ISO 2022 character encodings include escape sequences which indicate the character set for characters which follow. The escape sequences are registered with ISO and are often three characters long starting with the ASCII ESCAPE character (hexadecimal 1B, octal 33). These character encodings require data to be processed sequentially in a forward direction since the correct interpretation of the data depends on the most recently encountered escape sequence.

To represent large character sets, ISO 2022 builds on ISO 646's property that 1 byte can define 94 graphic (printable) characters (in addition to space and 33 control characters). Using two bytes, it is thus possible to represent up to 8836 (94×94) characters; and, using three bytes, up to 830584 (94×94×94) characters. For the two-byte character sets, the code point of each character is normally specified in so-called kuten form (sometimes called quwei, especially when dealing with GB2312 and related standards), which specifies a zone (ku or qu), and the point (ten) or position (wei) of that character within the zone.

The escape sequences therefore do not only declare which character set is being used, but also, by knowing the properties of these character sets, know whether a 94-, 8836-, or 830584-character (or some other sized) encoding is being dealt with.

In practice, the escape sequences declaring the national character sets may be absent if context or convention dictates that a certain national character set is to be used. For example, RFC 1922, which defines ISO-2022-CN, allows ASCII SHIFT characters to be used instead of escape sequences.

Although the ISO 2022 character sets are still in common use, particularly ISO-2022-JP, most modern e-mail applications are converting to use the simpler Unicode character encodings such as UTF-8.

[edit] ISO 2022 Character Sets

Character encodings using ISO 2022 mechanism include:

  • ISO-2022-JP - widely used encoding for Japanese. Starts in ASCII and includes the following escape sequences
    • ESC ( B to switch to ASCII (1 byte per character)
    • ESC ( J to switch to JIS X 0201-1976 (ISO 646:JP) Roman set (1 byte per character)
    • ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
    • ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)
  • ISO-2022-JP-1 - Same as ISO-2022-JP with one additional escape sequence
    • ESC $ ( D to switch to JIS X 0212-1990 (2 bytes per character)
  • ISO-2022-JP-2 - Multilingual extension of ISO-2022-JP. Same as ISO-2022-JP-1 with the following additional escape sequences
    • ESC $ A to switch to GB 2312-1980 (2 bytes per character)
    • ESC $ ( C to switch to KS X 1001-1992 (2 bytes per character)
    • ESC . A to switch to ISO 8859-1 high part, Extended Latin 1 set (1 byte per character)
    • ESC . F to switch to ISO 8859-7 high part, Basic Greek set (1 byte per character)
  • ISO-2022-JP-3 - Same as ISO-2022-JP with three additional escape sequences
  • ISO-2022-JP-2004 - Same as ISO-2022-JP-3 with one additional escape sequence
  • ISO-2022-KR - Korean
    • ESC $ ) C to switch to KS X 1001-1992 (2 bytes per character)
  • ISO-2022-CN - Chinese
    • ESC $ ) A to switch to GB 2312-1980 (2 bytes per character)
    • ESC $ ) G to switch to CNS 11643-1992 Plane 1 (2 bytes per character)
    • ESC $ * H to switch to CNS 11643-1992 Plane 2 (2 bytes per character)
  • ISO-2022-CN-EXT - Same as ISO-2022-CN with six additional escape sequences
    • ESC $ ) E to switch to ISO-IR-165 (2 bytes per character)
    • ESC $ + I to switch to CNS 11643-1992 Plane 3 (2 bytes per character)
    • ESC $ + J to switch to CNS 11643-1992 Plane 4 (2 bytes per character)
    • ESC $ + K to switch to CNS 11643-1992 Plane 5 (2 bytes per character)
    • ESC $ + L to switch to CNS 11643-1992 Plane 6 (2 bytes per character)
    • ESC $ + M to switch to CNS 11643-1992 Plane 7 (2 bytes per character)

[edit] See also

[edit] References

  • Lunde, Ken. CJKV Information Processing. Cambridge, Massachusetts: O'Reilly & Associates, 1998. ISBN 1-56592-224-7.

[edit] External links

RFCs
  • RFC 1468: description of ISO-2022-JP
  • RFC 2237: description of ISO-2022-JP-1
  • RFC 1554: description of ISO-2022-JP-2
  • RFC 1922: description of ISO-2022-CN and ISO-2022-CN-EXT
  • RFC 1557: description of ISO-2022-KR
In other languages