Unicode in Microsoft Windows

Microsoft started to consistently implement Unicode in their products quite early. Windows NT was the first operating system that used "wide characters" in system calls. Using the UCS-2 encoding scheme at first, it was upgraded to UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs.

In various Windows families

Windows NT based systems

Modern Windows versions like Windows XP and Windows Server 2003, and prior to them Windows NT (3.x, 4.0) and Windows 2000 are shipped with system libraries which support string encoding of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bit encoding called the "code page" (or incorrectly referred to as ANSI code page). 16-bit functions have names suffixed with -W (from "wide"), for example, lstrlenW(). Code page oriented functions use the suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, do not provide a clean way to pass both 8-bit and 16-bit strings to the same API or put them in the same structure. Windows also provides the 'M' API which in some locales provided multi-byte encodings, but in most locales is the same as 'A'. Most such 'A' and 'M' functions are implemented as a wrapper that translates the code page to UTF-16 and calls the 'W' function.

The IsTextUnicode function uses a heuristic algorithm on a byte string passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like Notepad, often gives incorrect results. This gave rise to legends about the existence of "Easter eggs" like Bush hid the facts.

Windows CE

In Windows CE UTF-16 was used almost exclusively, with the 'A' API mostly missing.

Windows 9x

In 2001, Microsoft released a special supplement to Microsoft’s old Windows 9x systems. It includes a dynamic link library unicows.dll (only 240 KB) containing the 16-bit flavor (the ones with the letter W on the end) of all the basic functions of Windows API.

UTF-8

Although the locale can be set so the 'M' encodings handle some multi-byte encodings, it is not possible to set them to support UTF-8 (attempts to use the locale id, code page 65001, passed to MultiByteToWideChar for UTF-8 are ignored). As many libraries, including the standard C and C++ library, only allow access to files using the 'M' API, it is not possible to open all Unicode-named files with them. Thus Unicode is not supported by Windows in software using a portable API.

There are proposals to add an API to portable libraries such as Boost to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.[1]

Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various network protocols, including the Internet Protocol Suite. An application which has to pass UTF-8 to or from a 'W' Windows API should call the functions MultiByteToWideChar and WideCharToMultiByte.[2] To get predictable handling of errors and surrogate halves it is more common for software to implement their own versions of these functions.

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.