Introduction to Internationalization Issues in the Win32 API

Owner: ???

Effort: ???

Dependencies: ???

Abstract: This page provides an overview of the aspects of the Win32 internationalization API that are relevant to XEmacs, including the basic distinction between multibyte and Unicode encodings. Also included are pointers to how XEmacs should make use of this API.

The Win32 API is quite well-designed in its handling of strings encoded for various character sets. The API is geared around the idea that two different methods of encoding strings should be supported. These methods are called multibyte and Unicode, respectively. The multibyte encoding is compatible with ASCII strings and is a more efficient representation when dealing with strings containing primarily ASCII characters, but it has a great number of serious deficiencies and limitations, including that it is very difficult and error-prone to work with strings in this encoding, and any particular string in a multibyte encoding can only contain characters from a very limited number of character sets. The Unicode encoding rectifies all of these deficiencies, but it is not compatible with ASCII strings (in other words, an existing program will not be able to handle the encoded strings unless it is explicitly modified to do so), and it takes up twice as much memory space as multibyte encodings when encoding a purely ASCII string.

Multibyte encodings use a variable number of bytes (either one or two) to represent characters. ASCII characters are also represented by a single byte with its high bit not set, and non-ASCII characters are represented by one or two bytes, the first of which always has its high bit set. (The second byte, when it exists, may or may not have its high bit set.) There is no single multibyte encoding. Instead, there is generally one encoding per non-ASCII character set. Such an encoding is capable of representing (besides ASCII characters, of course) only characters from one (or possibly two) particular character sets.

Multibyte encoding makes processing of strings very difficult. For example, given a pointer to the beginning of a character within a string, finding the pointer to the beginning of the previous character may require backing up all the way to the beginning of the string, and then moving forward. Also, an operation such as separating out the components of a path by searching for backslashes will fail if it's implemented in the simplest (but not multibyte-aware) fashion, because it may find what appears to be a backslash, but which is actually the second byte of a two-byte character. Also, the limited number of character sets that any particular multibyte encoding can represent means that loss of data is likely if a string is converted from the XEmacs internal format into a multibyte format.

For these reasons, the C code in XEmacs should never do any sort of work with multibyte encoded strings (or with strings in any external encoding for that matter). Strings should always be maintained in the internal encoding, which is predictable, and converted to an external encoding only at the point where the string moves from the XEmacs C code and enters a system library function. Similarly, when a string is returned from a system library function, it should be immediately converted into the internal coding before any operations are done on it.

Unicode, unlike multibyte encodings, is a fixed-width encoding where every character is represented using 16 bits. It is also capable of encoding all the characters from all the character sets in common use in the world. The predictability and completeness of the Unicode encoding makes it a very good encoding for strings that may contain characters from many character sets mixed up with each other. At the same time, of course, it is incompatible with routines that expect ASCII characters and also incompatible with general string manipulation routines, which will encounter a great number of what would appear to be embedded nulls in the string. It also takes twice as much room to encode strings containing primarily ASCII characters. This is why XEmacs does not use Unicode or similar encoding internally for buffers.

The Win32 API cleverly deals with the issue of 8 bit vs. 16 bit characters by declaring a type called TCHAR which specifies a generic character, either 8 bits or 16 bits. Generally TCHAR is defined to be the same as the simple C type char, unless the preprocessor constant UNICODE is defined, in which case TCHAR is defined to be WCHAR, which is a 16 bit type. Nearly all functions in the Win32 API that take strings are defined to take strings that are actually arrays of TCHARs. There is a type LPTSTR which is defined to be a string of TCHARs and another type LPCTSTR which is a const string of TCHARs. The theory is that any program that uses TCHARs exclusively to represent characters and does not make assumptions about the size of a TCHAR or the way that the characters are encoded should work transparently regardless of whether the UNICODE preprocessor constant is defined, which is to say, regardless of whether 8 bit multibyte or 16 bit Unicode characters are being used. The way that this is actually implemented is that every Win32 API function that takes a string as an argument actually maps to one of two functions which are suffixed with an A (which stands for ANSI, and means multibyte strings) or W (which stands for wide, and means Unicode strings). The mapping is, of course, controlled by the same UNICODE preprocessor constant. Generally all structures containing strings in them actually map to one of two different kinds of structures, with either an A or a W suffix after the structure name.

Unfortunately, not all of the implementations of the Win32 API implement all of the functionality described above. In particular, Windows 95 does not implement very much Unicode functionality. It does implement functions to convert multibyte-encoded strings to and from Unicode strings, and provides Unicode versions of certain low-level functions like ExtTextOut(). In fact, all of the rest of the Unicode versions of API functions are just stubs that return an error. Conversely, all versions of Windows NT completely implement all the Unicode functionality, but some versions (especially versions before Windows NT 4.0) don't implement much of the multibyte functionality. For this reason, as well as for general code cleanliness, XEmacs needs to be written in such a way that it works with or without the UNICODE preprocessor constant being defined.

Getting XEmacs to run when all strings are Unicode primarily involves removing any assumptions made about the size of characters. Remember what I said earlier about how the point of conversion between internally and externally encoded strings should occur at the point of entry or exit into or out of a library function. With this in mind, an externally encoded string in XEmacs can be treated simply as an arbitrary sequence of bytes of some length which has no particular relationship to the length of the string in the internal encoding.

To facilitate this, the enum external_data_format, which is declared in lisp.h, is expanded to contain three new formats, which are FORMAT_LOCALE, FORMAT_UNICODE and FORMAT_TSTR. FORMAT_LOCALE always causes encoding into a multibyte string consistent with the encoding of the current locale. The functions to handle locales are different under Unix and Windows and locales are a process property under Unix and a thread property under Windows, but the concepts are basically the same. FORMAT_UNICODE of course causes encoding into Unicode and FORMAT_TSTR logically maps to either FORMAT_LOCALE or FORMAT_UNICODE depending on the UNICODE preprocessor constant.

Under Unix the behavior of FORMAT_TSTR is undefined and this particular format should not be used. Under Windows however FORMAT_TSTR should be used for pretty much all of the Win32 API calls. The other two formats should only be used in particular APIs that specifically call for a multibyte or Unicode encoded string regardless of the UNICODE preprocessor constant. String constants that are to be passed directly to Win32 API functions, such as the names of window classes, need to be bracketed in their definition with a call to the macro TEXT. This awfully named macro, which comes out of the Win32 API, appropriately makes a string of either regular or wide chars, which is to say this string may be prepended with an L (causing it to be a wide string) depending on the UNICODE preprocessor constant.

By the way, if you're wondering what happened to FORMAT_OS, I think that this format should go away entirely because it is too vague and should be replaced by more specific formats as they are defined.

Ben Wing

Conform with <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Automatically validated by PSGML