Punycode

From Wikipedia, the free encyclopedia

Unicode
Encodings UTF-7 UTF-8 CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU Punycode GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces

Punycode, defined in RFC 3492, is the self-proclaimed "bootstring encoding" of Unicode strings into the limited character set permitted in host names. The encoding is used as part of IDNA, which is a system enabling the use of internationalized domain names in all languages that are supported by Unicode, where the burden of translation lies entirely with the user application (a web browser for example).

The encoding is applied separately to each component of a domain name which is not represented solely within the ASCII character set, and a reserved prefix 'xn--' is added to the translated Punycode string. For example, bücher becomes bcher-kva in Punycode, and therefore the domain name bücher.ch would be represented as xn--bcher-kva.ch in IDNA.

1 Encoding of non-ASCII character insertions as code numbers
2 Re-encoding of code numbers as ASCII sequences
3 Spoofing concerns
4 External links

[edit] Encoding of non-ASCII character insertions as code numbers

Special characters are removed from the string, while at the end a sequence of codes is added, one code for each insertion of a special character; these insertions are done primarily in the order of their Unicode-values, and secondarily in the order in which they occur in the string. The code for each insertion represents the number of possibilities of inserting a special character at the given stage (that is, without regard to characters that will be inserted afterwards), before the actual insertion, where these possible insertions are again ordered primarily according to their Unicode-values, and secondarily according to position. The first possibility, denoted by the code "a", means that character 128 is inserted at the beginning of the string, or, if there has already been an insertion of a special character, that the same character is added again immediately after the previous one.

The described coding is a form of delta encoding. Special characters in a word are usually from the same language, hence often with nearby Unicode values. Thus the numbers to be used are often smaller with this method. In the case of multiple occurrences of a character it also helps that positions are counted from the previous position.

In the case of "bücher", the code "kva" is used for inserting "ü" (character 252) in "bcher". Of all possibilities of inserting a special character somewhere in "bcher", there are potentially the characters 128–251, each in six possible positions, as well as "ü" in front of the "b", which come before the actual insertion of "ü" after the "b", hence 124 × 6 + 1 = 745 possibilities.

[edit] Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits (like the third digit from the right in ordinary numbers having a weight 100) varies.

In this case a "number system" with 36 "digits" is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0". The second digit has a weight of 35 instead of 36 because for three-digit numbers the first (least significant) digit is in the range b-9, "a" would mark the end of the number. Therefore "kva" represents the number 10 + 35 × 21 = 745.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Compare an ASCII 'punycoded' URL http://xn--tdali-d8a8w.lv/ (working) and its full Unicode counterpart that does include Latvian characters with appropriate diacritics: http://tūdaliņ.lv.

Punycode is designed to work across all script systems, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being Punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.