Talk:UTF-7

From Wikipedia, the free encyclopedia

Contents

[edit] Technical question

What exactly is encoded as Base64 between the + and the -? Is it UTF-16? -- Timwi 02:37, 21 Dec 2003 (UTC)

Yeah: "Unicode is encoded using Modified Base64 by first converting Unicode 16-bit quantities to an octet stream (with the most significant octet first). Text with an odd number of octets is ill-formed." (from the RFC) CGS 01:04, 23 Dec 2003 (UTC).
Thanks. Fixed the article. -- Timwi 04:16, 23 Dec 2003 (UTC)

[edit] Thanks

Thanks for spotting that "i" :-) -- Timwi 02:00, 15 Feb 2004 (UTC)

[edit] Deprecation

does anyone have any references for when/why this was deprecated? from my understanding it will generally match or beat all 3 of the other formats practical for unicode e-mail (UTF-8 with quoted printable UTF-8 with base64 and UTF-16 with base64). Plugwash 23:55, 17 July 2005 (UTC)

The IMC's guidelines for i18n of internet e-mail (here), published Aug 1998, say use of utf-7 in internet e-mail is strongly discouraged. The Unicode 4.0 spec mentions only utf-8, utf-16 and utf-32 (conspicuously omitting utf-7). This page mentions some drawbacks, but I don't know which (if any) of these were behind abandoning utf-7. -- Rick Block (talk) 02:36, July 18, 2005 (UTC)

neither of those sites actually use the word deprecated and the internet mail consortiums site really seems to miss the point. Sure utf-8 CAN be handled by mime its just that utf-8+quoted printable is terrible (6 bytes minimum for anything non-ascii!) and utf-8+base64 isn't exactly hugely efficiant either. yw

[edit] Transfer encoding syntax

A couple months ago, an anonymous contributor (83.248.26.202) added this to the intro:

Despite the name, UTF-7 is not a UTF. It is rather a transfer encoding syntax (TES), as is Punycode for internationalized domain names.

Plugwash recently removed this and asked, in an HTML comment, what the difference is between a TES and a UTF.

I don't know exactly what the difference is, but I can say that when I was cleaning up the encoding related categories, I ran across some examples of what I was tempted to call character meta-encodings that had been misfiled as character sets:

  • Encodings for 7-bit transport of 8-bit data; these were originally intended to transport encoded text, but they're actually for any binary data. Examples include Quoted-printable, Base64, Radix-64, ASCII armor, Ascii85, Uuencode, and YEnc. If you use one of these encodings in a MIME message body, you use MIME's Content Transfer Encoding mechanism to signal that you used it.
  • Encoded-word, which is an encoding scheme for representing non-ASCII text in a MIME message header value. This is basically just mapping arbitrary UCS characters to sequences of ASCII-range Unicode characters.

I haven't studied UTF-7 at all, really (yet), but if it doesn't map arbitrary UCS characters to code values or byte sequences, then it's more like the examples above and less like the other UTFs. — mjb 00:56, 13 August 2005 (UTC)

UTF-7 does map a sequence of code points to a sequence of bytes like the other UTFs but unlike them there is more than one valid way to represent a peice of text in UTF-7 and its output is designed to be used directly in internet mail. Essentially it was a case of recognising that UTF-8+quoted printable=insanely inefficiant encoding and doing it better by designing a single process for the entire task.
From a registration and mail header point of view UTF-7 is considered to be a character set (e.g. its listed at http://www.iana.org/assignments/character-sets).Plugwash 01:19, 13 August 2005 (UTC)