UTF-7

From Wikipedia, the free encyclopedia

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.

The basic Internet e-mail standard SMTP specifies that the transmission format is US-ASCII and does not allow byte values above the ASCII range. MIME provides a way to specify the character set allowing for use of other character sets including UTF-8 and UTF-16. However the underlying transmission infrastructure is still not guaranteed to be 8-bit clean and therefore content transfer encodings have to be used with them. Unfortunately base64 has the problem of making even US-ASCII characters unreadable and UTF-8 combined with quoted-printable produces a very inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP.

Provided certain rules are followed during encoding UTF-7 can be sent in e-mail without using a separate MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or base64, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable.

UTF-7 is generally not used as a native representation within applications as it is very awkward to process. 8BITMIME has also been introduced, which reduces the need to encode messages in a 7-bit format. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the Internet Mail Consortium recommends against its use.[clarify]

A modified form of UTF-7 is currently used in the IMAP e-mail retrieval protocol for mailbox names. See section 5.1.3 of RFC 3501 for details.

Contents

[edit] Description

UTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32. There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7.

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains all 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are considered very safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail.

Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates) and then in modified base64. The start of these blocks of modified base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified base64 set. As a special case if the character after the modified base64 is a - (ASCII hyphen-minus) then it is consumed. As a special case, literal + characters may be encoded as +- (they may be encoded in modified base64 too).

[edit] Examples

  • "Hello, World!" is encoded as "Hello, World!"
  • "1 + 1 = 2" is encoded as "1 +- 1 = 2"
  • "£1" is encoded as "+AKM-1". The Unicode code point for the pound sign is U+00A3 (which is 00A316 in UTF-16), which converts into Modified Base64 as in the table below. There are two bits left over, which are padded to 0.
Hex digit 0 0 A 3  
Bit pattern 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0
Index 0 10 12
Base64-Encoded A K M

[edit] Algorithm for manually encoding and decoding UTF-7

[edit] Encoding

First an encoder must decide which characters to represent directly in ascii form and which to place in blocks of unicode characters, A simple encoder may encode all characters it considers safe for direct encoding directly. However the cost of coming out of a unicode block to represent a single character and then going directly back in is 3 to 3⅔ bytes, this is more than the 2⅔ bytes needed to represent such a character as a part of the unicode sequence.

Once the unicode sequences are decided on they must be encoded using the following procedure then surrounded by the appropriate delimiters

We will use the £† (0x00A3) (0x2020) character sequence as an example

  1. Express the character’s Unicode numbers (UTF-16) in Binary:
    0x00A3 → 0000 0000 1010 0011
    0x2020 → 0010 0000 0010 0000
  2. Concatenate the binary sequences
    0000 0000 1010 0011 and 0010 0000 0010 0000 → 0000 0000 1010 0011 0010 0000 0010 0000
  3. Regroup the binary into groups of six bits, starting from the left:
    0000 0000 1010 0011 → 000000 001010 001100 100000 001000 00
  4. If the last group has less than six bits, add trailing zeros:
    0000 0000 1010 0011 → 000000 001010 001100 100000 001000 000000
  5. Replace each group of six bits with a respective Base64 code:
    000000 001010 001100 100000 001000 000000 → AKMgIA

[edit] Decoding

First the message must be separated into plain ASCII text and unicode blocks as mentioned in the description section, once this is done the unicode blocks must be decoded with the following procedure (using the result of the encoding example above as our example)

  1. Express each Base64 code as the bit sequence it represents:
    AKMgIA → 000000 001010 001100 100000 001000 000000
  2. Regroup the binary into groups of sixteen bits, starting from the left:
    000000 001010 001100 → 0000000010100011 0010000000100000 0000
  3. If there is an incomplete group at the end, discard it (If the incomplete group contains more than four bits or contains any ones, the code is invalid):
    0000000010100011 0010000000100000
  4. Each group of 16 bits is a characters Unicode (UTF-16) number and can be expressed in other forms:
    0000 0000 1010 0011 ≡ 0x00A3 ≡ 16310

[edit] Security

UTF-7 allows multiple representations of the same source string by shifting in and out of the base 64 mode multiple times. Modern mail and other transports can handle UTF-8 so the use of UTF-7 is not required as it has been historically. Modern applications should consider supporting more secure encodings instead.

[edit] Not yet developed: UTF-6 and UTF-5

Some proposals have been made for a UTF-6 and UTF-5 for radio telegraphy environments[1] [2], however no formal UTF standard has been formalized as of 2006.

  • These proposals are not related to Punycode.

[edit] References

  1. ^ Seng, James, UTF-5, a transformation format of Unicode and ISO 10646, 28 Jan 2000, retrieved 23 Aug 2007
  2. ^ Welter, Mark; Brian W. Spolarich, WALID, Inc. (2000-11-16). UTF-6 - Yet Another ASCII-Compatible Encoding for IDN. Internet Engineering Task Force (IETF) INTERNET-DRAFT. The Internet Society. Retrieved on 2007-08-28.

[edit] See also