UTF-32/UCS-4

From Wikipedia, the free encyclopedia

Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail
Unicode typefaces

UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. It can be regarded as the simplest encoding form, as all other Unicode Transformation Formats have variable-length encodings for various code points.

However, since UTF-32 uses 4 bytes for every character it is very space inefficient. Specifically, non-BMP characters are so rare in most text they may as well be considered non-existent for sizing discussions. This means that UTF-32 is generally at least twice the size of other encodings and sometimes as much as 4 times the size.

Also whilst a fixed number of bytes per code point may seem convenient at first it isn't really that much use. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing.

For these reasons UTF-32 is little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode text.

[edit] History

The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.

UCS-4 is sufficient to represent all of the Unicode code space, which has 1114112 (= 220+216) code points and therefore requires only up to hexadecimal 10FFFF. Some people consider it wasteful to reserve such a large code space for mapping a relatively small set of code points, so a new encoding form, UTF-32, was proposed. UTF-32 is a subset of UCS-4 that uses 32-bit code values only in the 0 to 10FFFF code space.

UTF-32 was originally a subset of the UCS-4 standard, but the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes and has removed former provisions for private-use code positions in groups 60 to 7F and in planes E0 to FF.

Accordingly UCS-4 and UTF-32 can be now taken to be identical save that the UTF-32 standard has additional Unicode semantics that must be observed.

[edit] See also

[edit] External links