Programming with wide characters

5392

Author: Leslie P. Polzer

The ISO C90 standard introduced a wide character type named wchar_t, thereby appointing an official standard for wide characters in the C language. Its usage, however, is not well understood among C programmers, and debugging wide characters with the GNU Debugger is a challenge few can get to work. As a result, many programmers fall back to using ASCII character arrays, which is not good; today, localized code matters more and more.

To use wchar_t, include the header <wchar.h>. Declare compile-time initialized characters and strings with the prefix L, for example:

char widechar = L'';
char* widestring = L"Hello, world!";

Only then will the C compiler create proper wide characters. On my system these are four bytes each, as opposed to a char, which is only one byte.

Discerning units

Programming with obsolete character pointers is so easy: every character is one byte wide on every platform, there are only a handful of well-known control characters, and if a character is printable it takes up exactly one screen column.

Using wide characters requires you to be more careful about these units. For example, a single character can take up more than one column of screen space, but length modifiers in the printf format string take their size in bytes.

One of the most important applications of this knowledge is the proper usage of the functions wcslen and wcswidth:

size_t wcslen (const wchar_t *s);
int wcswidth (const wchar_t *s, size_t n);

Use wcslen when allocating memory for wide characters, and use wcswidth to align text.

Printing

The printf format string conversion specifier for wide character arrays is %ls. Do not use %S; it is only defined in the Single UNIX Specification Version 2, not in any ISO C standard.

You can use printf to output wide character strings, but wprintf is more appropriate because it handles wide characters natively. For example, the unit for length modifiers is “wide characters” with wprintf — as opposed to bytes with printf. The other functions from the printf family also have wide character equivalents.

When using wprintf or fwprintf, the output stream must be in wide character mode. To switch an output stream, use fwide. For example, to switch stdout to wide character mode:

if (fwide(stdout, 0) == 0) /* 0 queries the current mode */
{ /* stdout has no specific char mode yet, attempt to set to wide */
if (fwide(stdout, 1) <= 0) /* a value greater than zero switches to wide character mode */
printf("could not switch to wide char mode!n");
else
wprintf(L"switched to wide char mode.n");
}

Once a mode is set, it cannot be changed except by calling freopen on the stream, so be sure to set the orientation — especially for stdout — early.

Applying string functions

Every C programmer has worked with the functions defined in string.h. Never use these functions on wide character strings!

Each one of them has a wide character equivalent — just replace the prefix str with wcs. You can find all of these functions and more (e.g. pendants to the mem* functions) in the man page wchar(0).

Wide characters and multi-byte strings

Multi-byte strings are the classic way to encode alphabets that do not fit into the classic 256-character map, such as Chinese. There are two functions for converting multi-byte strings to wide character strings and vice versa:

#include <stdlib.h>

size_t mbstowcs(wchar_t *dest, const char *src, size_t n);
size_t wcstombs(char *dest, const wchar_t *src, size_t n);

The return value of these functions is the number of bytes converted successfully. Note that the conversion depends on the value of the LC_CTYPE environment variabler; this is because values greater than 127 represent different characters depending on the locale settings.

GNU Gettext uses multi-byte strings only, and isn’t likely to change in the near future. This is because wide characters incur a lot of space overhead. Consequently, you will have to apply mbstowcs to the strings gettext() supplies.

Displaying ASCII wide characters in GDB

The GNU Debugger does not know how to interpret a wchar_t pointer. This is because the GDB developers are unsure how to correctly handle different character sets being used by the system, GDB and the program being debugged. Another issue is the printability of characters: whether a character is printable depends on the font used and its semantics (for example, some characters are control characters).

While these two issues are certainly genuine, the GDB developers failed to add basic support to GDB for printing simple ASCII characters, a situation which calls for remediation.

I wrote a script to make this possible by defining a new command wchar_print (it should suffice to type wc). You are welcome to download it, and can employ it in a number of ways:

  • Put its contents in your .gdbinit file
  • Include it in your .gdbinit file with the source command
  • Make GDB read it at startup with gdb -x wchar.gdb
  • Make GDB read it in a session: (gdb) source wchar.gdb

You can call the function with a pointer to wchar_t:

(gdb) wchar_print widestring
"Hello, world!"

Note, though, that wchar_print will produce incorrect results for wide characters with a value greater than 127.

I would like to thank Steve Graegert for helpful clarifications regarding wide character usage.

Category:

  • C/C++