PROGRAMMING Bilingual Programming
Bilingual Programming Second language Linux is international. It was started by a programmer from Finland who speaks Swedish. Aided by a Welsh- speaking lieutenant. Supplemented with a kernel maintainer from Brazil. So why is all our software written in English? This month multi-lingual development and the gettext package. BY STEVEN GOODWIN
he English language holds the same Turning Japanese equivalent by using the code in box: In power in today’s society that Latin GNU/Linux uses a technique known as PHP. Tdid many hundreds of years ago. locales to determine many things: the It’s not the most expressive language, nor appropriate translations for text, the #include
66 June 2004 www.linux-magazine.com Bilingual Programming PROGRAMMING
cate which lines of text will need trans- translated. We shall shortly see a tool that lating. We can do this by calling a special makes use of these markers itself to help Locale Categories function (called, not surprisingly, build the dictionary of translations. If we A category defines a set of data,and every supported language has its own set of data. gettext) that will consult the dictionary were to build the dictionary manually The category might define the way to and convert our string to something suit- (but why would we?!), the gettext_noop impart particular information:numbers ably foreign. marker would be unnecessary. over 1000 might be separated by commas or Some programmers prefer to replace dots,for example,or the date might be writ- printf(gettext("Hello U this nine character marker with a single ten day-month-year or month-day-year.This World!\n")); character macro, such as the underscore. information is not related to the language This is because the word gettext (and as such,which is why the term ‘locale’is used,constituting both language and cul- This function can be found in the libintl both brackets) can cause many lines to tural specifics. A directory is created for each header file, so we must, break the 80 character limit. This is sim- category. ply, There are standard functions to format #include
www.linux-magazine.com June 2004 67 PROGRAMMING Bilingual Programming
$ xgettext -C -d lm <(gcc -E U which highlights the deliberate mistake localedirectory, be careful not to change helloworld.c) above. Did you spot it? See Listing 1. directory, as this path would then The first warning simply reminds us become unreachable. In this example we specify the -C flag , to that we haven’t changed the header While in our local root directory, we indicate that the piped result is a C information yet. We can fix that by must create a locale directory, and copy source file. Users of automake will have amending the line to use the appropriate our lm.mo to the appropriate place in the an easier life, since the Makefile will gen- characterization. tree. That place being, erate these files automatically. You will also note that the file contains "Content-Type: text/plain; U $ mkdir -p locale/fr/LC_MESSAGES comments using the familiar hash sym- charset=ISO-8859-1\n" $cplm.mo locale/fr/LC_MESSAGES bol. These comments come in four flavors, and are determined by the char- To determine an appropriate code you Since the package is called ‘lm.mo’ in acter immediately following the hash, as can refer to the box: ISO 8859, or [1] for every language, we use the directory seen in Table 1. a more detailed analysis. This informa- name to distinguish between a French The xgettext program can also add tion is of more use to translators than lm.mo and a German lm.mo. This name comments into the PO file when, for programmers. As is the extensive func- is determined by the conventional lan- example, it believes the strings may be tionality provided by [2]. guage codes, as detailed at [3]. The used for special formatting. The PO file The error itself is easily fixed, and in directory named LC_MESSAGES is also contains a header to indicate the larger programs, more difficult to spot by needed because of the wide variety of revision date of the file, and the transla- humans. It can also check the strings for different locale information that might tor that last edited it. the correct number (and type) of argu- be present. There can also be directories Having now gotten this template file, ments using the -c option. We’re now to indicate the format of the date, and we need to create a catalog for a foreign ready to test it! how to represent numbers. See box: language. Like French. Locale Categories for a full list. Norwegian Wood Now you can run your program (with- Tour De France In order to convince our program to use out having to recompile), using a French We start by making a simple copy of the an appropriate language dictionary, we locale, and witness the result. template file, and adding the appropriate need to add a couple of further lines of French words to each msgstr. code to indicate that we’re happy about $ LANG=fr_FR ./hello using a locale. These are straight for- Bonjour, le monde msgid "Hello World!\n" ward, and common to all such programs. msgstr "Bonjour, le monde" For a more permanent change of locale, #include
68 June 2004 www.linux-magazine.com Bilingual Programming PROGRAMMING
their appropriate ISO-8859 sets). Gener- Above is a common example to create a come across problems that occur when ating a French locale can be done easily plural. The case of ‘one file’ requires a we use two or more arguments in a with, singular noun, whereas everything else printf, because the word order is impera- uses the plural, files. That’s in English! tive. Even in a simple (English) program, $su Not all languages follow this pattern. a mismatched %d and %s can cause # you must be root to do this The case of ‘zero files’ might not be printf to core dump. After translating a Password: plural (as in French), or there could be simple phrase, such as “There are %d # echo "fr_FR ISO-8859-1" >> U separate words for zero, one and two files named %s”, it is not unreasonable /etc/locale.gen (such as those in the Baltic family). To for the resultant text to appear as “With # locale-gen compensate for this, a separate function, the name %s, there are %d files”. What’s Generating locales... ngettext, is available which takes two more, since we (as programmers) do not fr_FR.ISO-8859-1... done string ID’s (one for singular, and one for know about every other possible transla- Generation complete. plural) and a number. The number is tion, it is not something we can prevent. then used to determine which version of More subtle problems can occur with Debian users can also use dpkg-reconfig- that string should be used in translation. phrases like “Copying file from %s to ure locales. %s”. You can test this using your own pro- printf( ngettext("Deleting %d U There are two methods of resolving gram, or (if you think the bug belongs to file", "Deleting %d files", U the word order problem. The first Hello World!) one of the multi-lingual iNum), iNum); requires that the translator modify the GNU tools, such as rm. wording so that the arguments always Upon seeing the ngettext marker, the appear in the right order. The msgfmt $ LANG=fr_FR rm this_wont_exist xgettext program will generate two string command can then be called using the -c rm: Ne peut enlever `this_wontU IDs in the .PO file, ready for the transla- option, so that it will perform checks on _exist': Aucun fichier ou U tor, along with a special c-format the .PO file. This option actually per- répertoire de ce type comment, which we’ll come to shortly. forms three separate checks. They are, format (the one we need in this To make your dictionary available to oth- #: helloworld.c:32 instance), header (the presence and con- ers, you should install it into the global #, c-format tents of the header) and domain repository of .mo files at /usr/share/ msgid "Deleting %d file" (checking for problems with the domain locale/ (or the location specified by the msgid_plural "Deleting %d files" directives). environment variable, TEXTDOMAIN msgstr[0] "" The second solution places the onus DIR). This directory uses the same hier- msgstr[1] "" on the programmer, and is preferred. In archy given above. Installing your text this case, the format string must be here (which also requires superuser priv- Not all the problems are solved by nget- amended to describe the order of the ileges) means your code no longer needs text though. At some point you will parameters. So, using our copy file to specify a directory to the bindtextdo- example above, this would give us, main function, and you can replace the ISO 8859 U directory name with NULL. ISO Characterization printf( gettext ("Copying file U Having now understood the technical ISO 8859-1 Western, or west European from %1$s to %2$s"), process behind multi-lingual software, ISO 8859-2 Central European,or east European pSrc, pDest); let us review some of the finer details we ISO 8859-3 South European,or Maltese (and need to consider when programming. Esperanto) The special format specifiers, %1$s and ISO 8859-4 North European %2$s, are handled by the printf code in Spanish Eyes ISO 8859-5 Eastern European,Cyrillic alphabets glibc. Non-GNU variants may not be so Most developers have a method for deal- like Russian feature-full. ing with strings, like their favorite string ISO 8859-6 Arabic Having highlighted the word order library, for example. They also have their ISO 8859-7 Greek problem, you should now be aware that own methods for building strings ISO 8859-8 Hebrew constructing strings at run-time is a bad dynamically, either to add plurals, or idea. The solutions we have available to ISO 8859-9 Turkish build large sentences from component us can only work when the entire string ISO 8859-10 Nordic (Sámi,Inuit,Icelandic) parts (like the verbal Lego of automated is given to the translator. Splitting text up ISO 8859-11 Thai train announcements). We shall now into sections and using strcat (or similar) ISO 8859-12 (was Celtic, but withdrawn) cover a number of these methods, high- should be avoided at all costs, since the ISO 8859-13 The Baltic Rim lighting the problems (and solutions) translator has no understanding of the ISO 8859-14 Celtic involved. ordering (or the ability to change it), or ISO 8859-15 Euro the meaning of the sentence. Each string printf("Deleting %d file%s", U ISO 8859-16 South eastern European (incorporates contained in the catalog must make euro symbol) iNum, iNum==1?"":"s"); sense when presented on its own.
www.linux-magazine.com June 2004 69 PROGRAMMING Bilingual Programming
the file listing, we would be doubling the ing the longest piece of text in the left, or Unicode work for the translator! For instance, in you might need to word-wrap every- All the examples in this article use ASCII Listing 3. thing. It might involve scrolling the text characters.This covers most western lan- guages,but neglects those character sets That’s true. We are doubling the work! within the visible window (like XMMS). requiring two bytes,such as Chinese. In However, this extra work is minimal. It might simply chop all characters that order to support them fully,we need to Especially compared to the programmer overrun, and ask the translator for work in Unicode.This involves a much larger hassle that might otherwise be involved, shorter versions. The solution you quantity of work,as the basic char type can or the cringe-inducing gender misuse employ will vary according to the not be used,and is instead replaced by when the wrong version of ‘the’ is amount of work you, and your transla- wchar_t.Also,many of the well-known func- prepended to the words. tors, are willing to do. Only applications tions (like sprintf) need to be adapted to use that sell on their presentation abilities their equivalent wide versions,like swprintf. China Girl (like games) should consider this a The last implementation problem we necessity. /* Don't code like this!! */ shall mention involves aesthetics. This strcpy("Copying file from "); refers to the screen layout, the menus of Vienna strcat(pSrc); a GUI, and the use of tab stops. Although As software develops, more and more strcat(" to "); your program may look nicely formatted strings will be added to the program. Re- strcat(pDest); in English, as soon as any of the words translating the whole program every change, your pre-determined layout will time is obviously wasted effort. So In some applications, the most difficult break. German words, for example, are instead, we should use the msgmerge word to translate is ‘the’! English has on average 50% longer than their Eng- tool. This takes the original language only one word for the definite article, lish equivalents. You have two choices. template (the .PO file, that’s often ‘the’. French, German, Spanish, and Either ignore word length, or code renamed to .POT) without any transla- many others don’t. Depending on the around it. tions, and the newest language-specific language, they may have special versions Most (if not all) command line utilities catalog to build a new .PO. This new file for masculine, feminine, neuter and are unconcerned with special formatting. contains all the original translations, plural. The same is true of the indefinite The information is functional and uni- combined with the new, as yet untrans- article, ‘a’. Normally, these words will be form, making it suitable for parsing by lated, strings. included as part of the standard transla- scripts. GUI software may explicitly tion. By now you should have learnt that place text in two columns, at X1 and X2, $ msgmerge old_po_file.pot U building strings dynamically is not a in order to appeal to the end user. current_language_po.po >U good idea. In some cases it can be very There’s nothing wrong with wanting to new_language_po.po tempting to cut down on the quantity of appeal to the end user! Unfortunately, translations required as in Listing 2. when running under a different locale, Metropolis We should modify this so that the the text in the left column may overrun With the gettext package, we can create strings read ‘a directory’ and ‘a file’, so the text in the right. truly multi-lingual software, even if we the translated versions will work regard- To avoid this problem you will need to can’t speak any of the languages in ques- less of gender. However, you might write some more code. This might tion. Using separate language catalogs argue, if we also had a portion of the pro- involve adjusting the position of the allows the translation work to be distrib- gram that produced a short version of right hand column, perhaps by calculat- uted amongst those who can speak different tongues, without having to Listing 2: Smaller translations recompile the code. This makes it a fully data-driven, distributed, piece of devel- 01 if ( mygetfiletype(szFilename) == DIRECTORY) opment work. 02 pFiletype = gettext ("directory"); So with that thought I bid you all a 03 if ( mygetfiletype(szFilename) == FILE) fond farewell. Au revoir. Auf Wiederse- 04 pFiletype = gettext ("file"); hen. Adiós and Arrivederci! ■ 05 printf (gettext ("%1$s is a %2$s"), szFilename, pFiletype); INFO Listing 3: Doubling the work [1] ISO8859 Alphabet Soup:http://wwwwbs. cs.tu-belin.de/user/czyborra/charsets/ 01 if ( mygetfiletype(szFilename) == DIRECTORY) 02 pFiletype = gettext ("directory"); /* same strings as [2] Data on languages: http://www.eki.ee/letter/ before - does this mean less work? */ [3] Language codes:http://www.loc.gov/ 03 if ( mygetfiletype(szFilename) == FILE) standards/iso639-2/langcodes.html 04 pFiletype = gettext ("file"); [4] FAQ for GNU gettext: 05 printf("%s : %s", szFilename, pFiletype); /* no translation http://www.haible.de/bruno/gettext-FAQ. required here */ html#integrating_noop
70 June 2004 www.linux-magazine.com