Using Unicode in Linux

7470

Author: Michał Kosmulski

Last time we talked about Unicode and its benefits. If you’ve decided you want to reap those benefits yourself, here’s how to convert a Linux system from another encoding system to Unicode.

First of all, check whether you’re already using a Unicode locale. The command locale prints out the values of environmental variables that influence the locale settings. A complete description of their meanings is available in locale man pages. Usually, locale names consist of a lowercase language code followed by an underscore and an uppercase country code (e.g. en_US for U.S. English). Unicode locale names that use UTF-8 encoding additionally end with “.UTF-8.” If such names are present in the output of locale, you are already using a Unicode locale.

If you do need to make the conversion, back up all your important data first, as you’ll be converting your disk’s filesystems. Note that backups made prior to converting the filesystem and thereafter are somewhat incompatible. As we noted last time, the operating system and many utilities do not realize what characters the bytes in file names represent. Among the utilities with this problem is the tar program, which is a popular backup tool. If you are using an en_US locale now and some filename contains the character “ä” (German A umlaut), it is represented as a single byte: hex 0xE4. After you move to UTF-8, it will be represented as two bytes: 0xC3 0xA4. However, neither the filesystem nor tar know that these two different byte sequences can represent the same character. If you restore your old file from backup after moving to a UTF-8 locale, the old one-byte sequence will be used in the restored file’s name, making it different from the new version’s name. Under a UTF-8 locale, this single byte won’t be considered “ä” but rather an invalid UTF-8 sequence, and will be displayed as a placeholder or as an octal representation of the erroneous byte only. So if you restore data from older backups or archives after you move to UTF-8, you may need to run a filename conversion similar to the one we’ll describe below on the extracted files in order to get the filenames correct.

To be able to use UTF-8 locales without having to invest too much work in it, you will need glibc (GNU C library) version 2.2 or newer (any reasonably modern distro should have it). You can check your version by running /lib/libc.so.6.

The following paragraphs give a step-by-step description of how to perform the conversion. Most operations described below must be performed as root.

Setting the locale

Certain environment variables tell applications which locale is to be used. Commonly used variables are:

  • LC_ALL — When set, the value of this variable overrides the values of all other LC_* variables.
  • LC_* — These variables control different aspects of the locale. For example, LC_CTYPE controls the way upper- to lowercase conversion takes place, while LC_TIME controls the date and time format. LC_MESSAGES defines the language for application messages. Details can be found in the man page for locale(7).
  • LANG — If LC_ALL is not set, then locale parameters whose corresponding LC_* variables are not set default to the value of LANG.

Before modifying your locale, remember or save to a file the output of locale, which shows your current locale. Also, note down the output of locale -k LC_CTYPE | fgrep charmap (your current character encoding), as you will need this information later on.

In order to tell applications to use UTF-8 encoding, and assuming U.S. English is your preferred language, you could use the following command:

export LC_ALL=en_US.UTF-8

Applications started afterwards from the same terminal window should be aware of UTF-8. To check if that’s the case, you could for example use the command wc. wc -c will tell you the number of bytes and wc -m the number of characters in a file or in data read from standard input (end typing with Enter and Ctrl-D). In a UTF-8 locale, if text contains non-ASCII characters, the number of bytes will be greater than the number of characters. For example:

user@host:~$ wc -c
Bär
5
user@host:~$ wc -m
Bär
4

This three-character word is encoded using four bytes in UTF-8 (the extra character or byte is the end-of-line marker).

If the test failed (i.e. wc returns the same number in both cases), your system probably came without UTF-8 locale definitions, and you will have to use localedef to generate them. For example, if en_US.UTF-8 is missing, you can generate it from en_US using:

localedef -i en_US -f UTF-8 en_US.UTF-8

Since values of environment variables last only as long as your session, you have to put your export commands in /etc/profile so that they are run for each user the next time he or she logs in. If you perform your work from inside KDE, you will have to log out and back in so that environmental variables can be re-read in order for changes to take effect. GNOME seems to always use UTF-8 internally, even if the locale is not UTF-8-based. No matter which desktop environment you are using, it may be necessary to log out and, if you are using a login manager (e.g. KDM or GDM), restart the X Window System by pressing Ctrl-Alt-Backspace so that /etc/profile is re-read and all applications come to know about the new locale.

Converting filesystems

The next step is to convert your filesystems. This is the only risky part of the transition, so do make a backup of all important data from your disks if you haven’t done so yet.

As noted above, the Linux kernel doesn’t care about character encodings. For common Linux filesystems (ext2, ext3, ReiserFS, and other filesystems typical for Unices), information that a particular filesystem uses one encoding or another is not stored as a part of that filesystem. Only locale-controlling environment variables tell software that particular bytes should be displayed as one or another character. Filesystems found on Microsoft Windows machines (NTFS and FAT) are different in that they store filenames on disk in some particular encoding. The kernel must translate this encoding to the system encoding, which will be UTF-8 in our case.

If you have Windows partitions on your system, you will have to take care that they are mounted with correct options. For FAT and ISO9660 (used by CD-ROMs) partitions, option utf8 makes the system translate the filesystem’s character encoding to UTF-8. For NTFS, nls=utf8 is the recommended option (utf8 should also work). Add these mount options to filesystems of these types in your /etc/fstab to make them mount with the correct options. A fragment of /etc/fstab might then look like this (other options may vary depending upon your setup):

/dev/hda2        /mnt/c           ntfs        defaults,ro,nls=utf8                        1 0
/dev/hda3        /mnt/d           vfat        defaults,quiet,utf8                         1 0
/dev/cdrom       /mnt/cdrom       iso9660     defaults,noauto,users,ro,utf8               0 0
# If using supermount, add "utf8" to the options _after_ two dashes, e.g.
#none             /mnt/cdrom       supermount  fs=iso9660,dev=/dev/cdrom,--,auto,ro,utf8   0 0
/dev/fd0         /mnt/floppy      auto        defaults,noauto,users,rw,quiet,utf8         0 0

After you modify /etc/fstab, you should remount the filesystems in question by issuing a mount -o remount /mnt/mount-point command for each of them. Non-ASCII characters in filenames on those filesystems should now be displayed correctly again. Note that this requires the kernel to be capable of converting between character sets, so support for UTF-8 must be compiled in or available as a module. This option is available in “File systems”->”Native Language Support”->”NLS UTF-8″ in the kernel configuration program. Depending upon which encoding your Windows partitions use, you may also need to compile in support for that encoding, too. Check this page for a list of codepages used by various language versions of FAT. NTFS always uses Unicode internally and doesn’t need any kernel NLS options except for UTF-8 support.

Native Linux filesystems do not store information about the character encoding used, so you must physically change the names of all files to the new encoding, as opposed to the simple remounting of FAT and NTFS volumes. In theory, all you need to do is execute a command like:

mv original-filename filename-in-UTF-8-encoding

for each file. In practice, things tend to be a little more complicated. First of all, you may already have UTF-8-encoded filenames on your disk without knowing it. For example, some GNOME applications tend to create UTF-8 filenames regardless of the locale used, and kbd (a set of utilities for handling console fonts) comes with a sample file called ♪♬ (two music notes) in the documentation. During conversion, these files need to be identified and their names left unchanged.

Another issue to look out for is directories. Since both the directory name and names of files contained within may need changing to their UTF-8 equivalents, you can’t simply create a list of all files and directories and then perform mv old-name new-name for each of them. If you did and first renamed a directory, then the path referring to the files in that directory would no longer be valid. So, order is important. In the directory tree, the leaves (i.e. files) should be renamed first, then the lowest-level directories, then their parent directories, and so on.

Below is a script that tries to perform the necessary convertion automatically. Note that using it may be dangerous — back up important data first! While it should work in most cases, this script isn’t bulletproof. In order to keep things simple, it doesn’t handle some special cases such as spaces in mount paths (which are really rare) and read-only filesystems (it is not obvious what should be done with those; if you do intend to convert a read-only mounted hard disk partition, remount it manually as read-write with mount -o remount,rw /some/mount/path before running the script). Depending on filesystem size and the number of files that need conversion, this script may take a long time to complete, especially since for simplicity’s sake it is far from optimal (all this could probably be done in Perl in a much more compact way).

Remember to modify orgcharset in the script below to the name of your old character set found out in one of previous steps using locale -k LC_CTYPE.

#!/bin/sh

fstab=/etc/fstab
orgcharset=INVALID_CHARSET_NAME

export LC_ALL=POSIX

# Find filesystems suitable for conversion
filesystems=`awk '!/vfat|ntfs|iso9960|udf|auto|autofs|swap|subfs|sysfs|proc|devpts|nfs|smbfs|^#/{print $2}' "$fstab"`
# Locate files whose names need to be converted and sort the list
find $filesystems -xdev | {
	while read; do
		# Check if the filename needs conversion (i.e. is not a correct UTF-8 string)
		if ! echo `basename "$REPLY"` | iconv -f UTF-8 -t UTF-8 &>/dev/null; then
			echo "$REPLY"
		fi
	done
} | sort -r | {
	# Rename files
	while read; do
		dirname=`dirname "$REPLY"`
		orgfname=`basename "$REPLY"`
		newfname=`echo "$orgfname" | iconv -f "$orgcharset" -t UTF-8`
		if [ $? -ne 0 ]; then
			echo "Error: iconv failed for $REPLY. Skipping." >&2
			continue
		fi
		mv "$REPLY" "$dirname"/"$newfname"
	done
}

Converting text files

It is convenient for user text files to use the system’s default encoding, so after moving to UTF-8 you may want to convert your text files too. Converting configuration files isn’t necessary, as programs that can handle non-ASCII data in their config almost always use UTF-8 for storing it already. You can convert a single text file with iconv:

iconv -f old-encoding -t UTF-8 filename > temp.tmp && mv temp.tmp filename

Again, make sure this actually works before playing with important data.

Getting fonts with Unicode support

Unicode fonts for the text console are usually shipped with major Linux distributions. To enable UTF-8 on the console, run unicode_start (unicode_stop to return to previous one-byte encoding mode).

In order to be able to actually see Unicode characters displayed by X applications, you need to download and install Unicode fonts. Bitstream Vera is a TrueType font available under an open source license that now comes with many Linux distributions. Unfortunately, it contains few characters. An extended version, with support for most Latin accented characters, is called Hunky Font. A family of Unicode fonts called FreeFont is available under the GPL. There are also a number of free-as-in-beer fonts on the Web, including Microsoft Core Fonts (a package containing among others the popular typefaces Arial and Times New Roman), Bitstream Cyberbit (only Roman style is available, but it has a very good Unicode coverage), Gentium, and many others. Of course, there are also lots of commercial fonts that can be used with X.

Summary

Using UTF-8 has many advantages over using a single-byte locale. A minor one is the ability to use any character in file names and on the command line. The main advantage of Unicode, however, is that it allows easier data exchange and better interoperability than any other character set. UTF-8 is meant to replace ASCII in the future, so at some point “text file” is going to mean “UTF-8 file” just as it means “ASCII file” now.

Links

Unicode Consortium
Unicode page in Wikipedia
UTF-8 page in Wikipedia
man page for UTF-8 (or run man 7 utf8)
man page for Unicode (or run man 7 unicode)
UTF-8 and Unicode FAQ
Unicode HOWTO (somewhat outdated)

A continually updated version of this article can be found at the author’s Web site.

Michał Kosmulski is a student at Warsaw University
and Warsaw University of Technology.