PROGRAMMING Bilingual Programming

Bilingual Programming Second language is international. It was started by a programmer from Finland who speaks Swedish. Aided by a Welsh- speaking lieutenant. Supplemented with a kernel maintainer from Brazil. So why is all our software written in English? This month multi-lingual development and the package. BY STEVEN GOODWIN

he English language holds the same Turning Japanese equivalent by using the code in box: In power in today’s society that Latin GNU/Linux uses a technique known as PHP. Tdid many hundreds of years ago. locales to determine many things: the It’s not the most expressive language, nor appropriate translations for text, the #include is it the most popular. It certainly isn’t character set required to represent the int main(int argc, char *argv[]) the easiest to learn. It is, however, the alphabet, and cultural specifics like the { most widespread. With the remnants of expression of numbers, or the date. Each printf("Hello World!\n"); the old British Empire still present, and area is considered in the box: Locale Cat- return 0; the continued growth of America, people egories, although the focus of this article } are required to use English in order to will be on text translation. compete on the world stage. So let us start with the simplest pro- It’s fairly obvious to us where the transla- Computers and the Internet have gram we know, Hello World. We shall be tion string will need to go. At compile increased this linguistic strangle-hold. coding in , although the same tech- time, however, we do not know what the More web pages exist in English than any niques can be applied regardless of replacement string will be, or what lan- other language. More programming lan- language. You’ll be able to test the PHP guages it will need to be in. This prevents guages use English words like if and us from including any translation data while, regardless of the designer’s nation- In PHP directly into program. Instead, we must ality. Most software uses prompts and Writing multi-lingual software in PHP is no build up catalogs of each word and error messages that are written in English. different from using C.The functions even phrase used by our program, and employ However, with Linux taking control of have the same name! However,when run as the gettext package to act like a dictio- many different systems across the globe, part of a web page,it might be more suit- nary. This will replace our (English) it would appear to be xenophobic of us able to specify the locale explicitly.Perhaps words with the correct foreign version at to continue developing ‘English-only’ coming from an session variable,or cookie run time. What is ‘correct’ will be deter- software. Adding the ability to change on the users machine. mined by the user’s specific locale. the language (or locale) of your software general. Even if you can not translate the each language we need to support The effect of setlocale can also be achieve by text yourself, you can make it easier for using the putenv function. Marking the source is a simple someone else to do so by following the process. We, as programmers, must work putenv ("LANG=fr"); guidelines in this article. through each line of the code and indi-

66 June 2004 www.linux-magazine.com Bilingual Programming PROGRAMMING

cate which lines of text will need trans- translated. We shall shortly see a tool that lating. We can do this by calling a special makes use of these markers itself to help Locale Categories function (called, not surprisingly, build the dictionary of translations. If we A category defines a set of data,and every supported language has its own set of data. gettext) that will consult the dictionary were to build the dictionary manually The category might define the way to and convert our string to something suit- (but why would we?!), the gettext_noop impart particular information:numbers ably foreign. marker would be unnecessary. over 1000 might be separated by commas or Some programmers prefer to replace dots,for example,or the date might be writ- printf(gettext("Hello U this nine character marker with a single ten day-month-year or month-day-year.This World!\n")); character macro, such as the underscore. information is not related to the language This is because the word gettext (and as such,which is why the term ‘locale’is used,constituting both language and cul- This function can be found in the libintl both brackets) can cause many lines to tural specifics. A directory is created for each header file, so we must, break the 80 character limit. This is sim- category. ply, There are standard functions to format #include these locale strings. For example,strfmon #define _(str) gettext (str) and strftime format the text for money and Compiling under GNU/Linux requires no #define N_(str) gettext_U time data,respectively. extra link libraries for the code to work. noop (str) Category Meaning The word GNU is essential here. That is LC_COLLATE Order of string-collation because the internationalization features The GNU standard prefers a space LC_CTYPE How to define characters. Echoes of are included directly in glibc. Users of between function name and bracket, but ctype.h as this also performs upper/ other -like systems may not be so this is often omitted. lower case conversion lucky. However, without a language cata- We can now move on and build our LC_MESSAGES The translated text.The focus of this article log, no translations will be made. That foreign language dictionary. LC_MONETARY Format and symbols for money doesn’t matter at the moment, since the LC_NUMERIC Format and symbols for numbers English text will be output in all cases Vienna Calling LC_TIME Format and symbols for time and date where a translation can not be found. C Building a file that contains all the programmers will also note that this strings in a program is not as time-con- method is not all-encompassing, because suming as you might think. Naturally, it msgid "Hello World!\n" there is more than one way to declare a is a very common task, and can be msgstr "" string. However, we’ve only learnt one achieved by using a tool named xgettext. way to mark strings for translation. So This is one of the few instances where As you can see, each piece of text has a we will need to use another method, to the ‘x’ does not stand for an X Window marker ID and an equivalent string, cope with those cases where a function program. Instead, it is short for ‘extract’. ready for translating. This string can call to gettext would result in a syntax This program will search the source file only hold a translation for one specific error. For example, for any string used in conjunction with language, so this file becomes a tem- the function call gettext (or gettext_noop) plate. Each translator takes a copy of it, char *pHello = "Hello, U and place the text into a catalog file and translates the text within it to his or World!\n"; (ending the suffix .PO) ready to be trans- her native tongue. Sometimes, this PO lated. The program understands enough file is renamed to POT to differentiate To circumvent this problem, we need to about C, and about other languages (see between the template, and the language- create a macro that includes a marker, box: xgettext: Supported Languages), to specific catalog files. but has no adverse effect on the syntax. understand the syntax of a function call, Note that xgettext will search for the and differentiate it from variables and function name gettext. It does not under- #define gettext_noop(String) U comments. stand enough of the C syntax (or that String any language) to understand techniques ... $ xgettext -d lm helloworld.c like #define _(str), given above. This char *pHello = gettext_noop("U $ tail -n 3 lm.po doesn’t preclude the use of such tricks Hello, World!\n"); #: helloworld.c:5 however. There are two popular solu- tions. One is to specify the underscore as We then need to invoke the translation xgettext: Supported an additional keyword that will act in the module in the usual way, before we out- Languages same manner as if it were gettext. put the string. Like so, C,C++,ObjectiveC awk $ xgettext -d lm -k_ U PO YCP printf (gettext (pHello) ); helloworld.c Python Tcl Lisp,EmacsLisp RST These markers not only perform the Alternatively, you could pre-process translation when the program is running, librep Glade your C file (causing the macro to be but indicate to us what text needs to be Java expanded) before running xgettext.

www.linux-magazine.com June 2004 67 PROGRAMMING Bilingual Programming

$ xgettext -C -d lm <(gcc -E U which highlights the deliberate mistake localedirectory, be careful not to change helloworld.c) above. Did you spot it? See Listing 1. directory, as this path would then The first warning simply reminds us become unreachable. In this example we specify the -C flag , to that we haven’t changed the header While in our local root directory, we indicate that the piped result is a C information yet. We can fix that by must create a locale directory, and copy source file. Users of will have amending the line to use the appropriate our lm.mo to the appropriate place in the an easier life, since the Makefile will gen- characterization. tree. That place being, erate these files automatically. You will also note that the file contains "Content-Type: text/plain; U $ mkdir -p locale/fr/LC_MESSAGES comments using the familiar hash sym- charset=ISO-8859-1\n" $cplm.mo locale/fr/LC_MESSAGES bol. These comments come in four flavors, and are determined by the char- To determine an appropriate code you Since the package is called ‘lm.mo’ in acter immediately following the hash, as can refer to the box: ISO 8859, or [1] for every language, we use the directory seen in Table 1. a more detailed analysis. This informa- name to distinguish between a French The xgettext program can also add tion is of more use to translators than lm.mo and a German lm.mo. This name comments into the PO file when, for programmers. As is the extensive func- is determined by the conventional lan- example, it believes the strings may be tionality provided by [2]. guage codes, as detailed at [3]. The used for special formatting. The PO file The error itself is easily fixed, and in directory named LC_MESSAGES is also contains a header to indicate the larger programs, more difficult to spot by needed because of the wide variety of revision date of the file, and the transla- humans. It can also check the strings for different locale information that might tor that last edited it. the correct number (and type) of argu- be present. There can also be directories Having now gotten this template file, ments using the -c option. We’re now to indicate the format of the date, and we need to create a catalog for a foreign ready to test it! how to represent numbers. See box: language. Like French. Locale Categories for a full list. Norwegian Wood Now you can run your program (with- Tour De France In order to convince our program to use out having to recompile), using a French We start by making a simple copy of the an appropriate language dictionary, we locale, and witness the result. template file, and adding the appropriate need to add a couple of further lines of French words to each msgstr. code to indicate that we’re happy about $ LANG=fr_FR ./hello using a locale. These are straight for- Bonjour, le monde msgid "Hello World!\n" ward, and common to all such programs. msgstr "Bonjour, le monde" For a more permanent change of locale, #include you must export the LANG environmen- We can add the string(s) by either modi- ... tal variable in the usual way. For fying the file directly, or using one of the char *pPackage = "lm"; example, many tools available. Translators using char *pDirectory = "locale"; the editor have an advantage ... $ export LANG=fr_FR here, since they may use PO mode. For setlocale (LC_ALL, ""); $ ./hello those who favor a GUI, the program bindtextdomain (pPackage, U Bonjour, le monde poeditor can also be used. pDirectory); To be used by our Hello World pro- textdomain (pPackage); If you’re on an exclusively English sys- gram, this text file needs to be converted tem this may not work, due to the fact into a machine-friendly, binary, format. The bindtextdomain function indicates there is no French locale on your system The program that does this is called the local root directory of our translated (other potential problems are covered in msgfmt and creates a file (ending in .mo catalog files, while textdomain requires [4]). The /etc/locale.gen file will indicate instead of .po) that is more optimal for us to specify the name of our package, or which locales have been generated for accessing arbitrary strings. It is not only program. Ours is called ‘lm’, since we’ve your machine, whereas the file /usr/ trivial to use, but includes error checking created an lm.mo catalog. Note that if share/i18n/SUPPORTED will indicate you specify a relative path for the which ones can be installed (along with Table 1:Hash symbols Character Comment type Notes Listing 1: Finding an error .(period) Automatic Should not be touched 01 $ msgfmt lm.po :(colon) Reference The file & line number 02 msgfmt: lm.po: warning: Charset "CHARSET" is not a portable encoding of the string name. ,(comma) Flag To indicate the trans- 03 Message conversion to user's charset might not work. lation is ‘fuzzy’,for example 04 lm.po:19: `msgid' and `msgstr' entries do not both end with '\n' (whitespace) Translator As entered by a human 05 msgfmt: found 1 fatal error

68 June 2004 www.linux-magazine.com Bilingual Programming PROGRAMMING

their appropriate ISO-8859 sets). Gener- Above is a common example to create a come across problems that occur when ating a French locale can be done easily plural. The case of ‘one file’ requires a we use two or more arguments in a with, singular noun, whereas everything else printf, because the word order is impera- uses the plural, files. That’s in English! tive. Even in a simple (English) program, $su Not all languages follow this pattern. a mismatched %d and %s can cause # you must be root to do this The case of ‘zero files’ might not be printf to core dump. After translating a Password: plural (as in French), or there could be simple phrase, such as “There are %d # echo "fr_FR ISO-8859-1" >> U separate words for zero, one and two files named %s”, it is not unreasonable /etc/locale.gen (such as those in the Baltic family). To for the resultant text to appear as “With # locale-gen compensate for this, a separate function, the name %s, there are %d files”. What’s Generating locales... ngettext, is available which takes two more, since we (as programmers) do not fr_FR.ISO-8859-1... done string ID’s (one for singular, and one for know about every other possible transla- Generation complete. plural) and a number. The number is tion, it is not something we can prevent. then used to determine which version of More subtle problems can occur with users can also use -reconfig- that string should be used in translation. phrases like “Copying file from %s to ure locales. %s”. You can test this using your own pro- printf( ngettext("Deleting %d U There are two methods of resolving gram, or (if you think the bug belongs to file", "Deleting %d files", U the word order problem. The first Hello World!) one of the multi-lingual iNum), iNum); requires that the translator modify the GNU tools, such as rm. wording so that the arguments always Upon seeing the ngettext marker, the appear in the right order. The msgfmt $ LANG=fr_FR rm this_wont_exist xgettext program will generate two string command can then be called using the -c rm: Ne peut enlever `this_wontU IDs in the .PO file, ready for the transla- option, so that it will perform checks on _exist': Aucun fichier ou U tor, along with a special c-format the .PO file. This option actually per- répertoire de ce type comment, which we’ll come to shortly. forms three separate checks. They are, format (the one we need in this To make your dictionary available to oth- #: helloworld.c:32 instance), header (the presence and con- ers, you should install it into the global #, c-format tents of the header) and domain repository of .mo files at /usr/share/ msgid "Deleting %d file" (checking for problems with the domain locale/ (or the location specified by the msgid_plural "Deleting %d files" directives). , TEXTDOMAIN msgstr[0] "" The second solution places the onus DIR). This directory uses the same hier- msgstr[1] "" on the programmer, and is preferred. In archy given above. Installing your text this case, the format string must be here (which also requires superuser priv- Not all the problems are solved by nget- amended to describe the order of the ileges) means your code no longer needs text though. At some point you will parameters. So, using our copy file to specify a directory to the bindtextdo- example above, this would give us, main function, and you can replace the ISO 8859 U directory name with NULL. ISO Characterization printf( gettext ("Copying file U Having now understood the technical ISO 8859-1 Western, or west European from %1$s to %2$s"), process behind multi-lingual software, ISO 8859-2 Central European,or east European pSrc, pDest); let us review some of the finer details we ISO 8859-3 South European,or Maltese (and need to consider when programming. Esperanto) The special format specifiers, %1$s and ISO 8859-4 North European %2$s, are handled by the printf code in Spanish Eyes ISO 8859-5 Eastern European,Cyrillic alphabets glibc. Non-GNU variants may not be so Most developers have a method for deal- like Russian feature-full. ing with strings, like their favorite string ISO 8859-6 Arabic Having highlighted the word order , for example. They also have their ISO 8859-7 Greek problem, you should now be aware that own methods for building strings ISO 8859-8 Hebrew constructing strings at run-time is a bad dynamically, either to add plurals, or idea. The solutions we have available to ISO 8859-9 Turkish build large sentences from component us can only work when the entire string ISO 8859-10 Nordic (Sámi,Inuit,Icelandic) parts (like the verbal Lego of automated is given to the translator. Splitting text up ISO 8859-11 Thai train announcements). We shall now into sections and using strcat (or similar) ISO 8859-12 (was Celtic, but withdrawn) cover a number of these methods, high- should be avoided at all costs, since the ISO 8859-13 The Baltic Rim lighting the problems (and solutions) translator has no understanding of the ISO 8859-14 Celtic involved. ordering (or the ability to change it), or ISO 8859-15 Euro the meaning of the sentence. Each string printf("Deleting %d file%s", U ISO 8859-16 South eastern European (incorporates contained in the catalog must make euro symbol) iNum, iNum==1?"":"s"); sense when presented on its own.

www.linux-magazine.com June 2004 69 PROGRAMMING Bilingual Programming

the file listing, we would be doubling the ing the longest piece of text in the left, or Unicode work for the translator! For instance, in you might need to word-wrap every- All the examples in this article use ASCII Listing 3. thing. It might involve scrolling the text characters.This covers most western lan- guages,but neglects those character sets That’s true. We are doubling the work! within the visible window (like XMMS). requiring two bytes,such as Chinese. In However, this extra work is minimal. It might simply chop all characters that order to support them fully,we need to Especially compared to the programmer overrun, and ask the translator for work in Unicode.This involves a much larger hassle that might otherwise be involved, shorter versions. The solution you quantity of work,as the basic char type can or the cringe-inducing gender misuse employ will vary according to the not be used,and is instead replaced by when the wrong version of ‘the’ is amount of work you, and your transla- wchar_t.Also,many of the well-known func- prepended to the words. tors, are willing to do. Only applications tions (like sprintf) need to be adapted to use that sell on their presentation abilities their equivalent wide versions,like swprintf. China Girl (like games) should consider this a The last implementation problem we necessity. /* Don't code like this!! */ shall mention involves aesthetics. This strcpy("Copying file from "); refers to the screen layout, the menus of Vienna strcat(pSrc); a GUI, and the use of tab stops. Although As software develops, more and more strcat(" to "); your program may look nicely formatted strings will be added to the program. Re- strcat(pDest); in English, as soon as any of the words translating the whole program every change, your pre-determined layout will time is obviously wasted effort. So In some applications, the most difficult break. German words, for example, are instead, we should use the msgmerge word to translate is ‘the’! English has on average 50% longer than their Eng- tool. This takes the original language only one word for the definite article, lish equivalents. You have two choices. template (the .PO file, that’s often ‘the’. French, German, Spanish, and Either ignore word length, or code renamed to .POT) without any transla- many others don’t. Depending on the around it. tions, and the newest language-specific language, they may have special versions Most (if not all) command line utilities catalog to build a new .PO. This new file for masculine, feminine, neuter and are unconcerned with special formatting. contains all the original translations, plural. The same is true of the indefinite The information is functional and uni- combined with the new, as yet untrans- article, ‘a’. Normally, these words will be form, making it suitable for parsing by lated, strings. included as part of the standard transla- scripts. GUI software may explicitly tion. By now you should have learnt that place text in two columns, at X1 and X2, $ msgmerge old_po_file.pot U building strings dynamically is not a in order to appeal to the end user. current_language_po.po >U good idea. In some cases it can be very There’s nothing wrong with wanting to new_language_po.po tempting to cut down on the quantity of appeal to the end user! Unfortunately, translations required as in Listing 2. when running under a different locale, Metropolis We should modify this so that the the text in the left column may overrun With the gettext package, we can create strings read ‘a directory’ and ‘a file’, so the text in the right. truly multi-lingual software, even if we the translated versions will work regard- To avoid this problem you will need to can’t speak any of the languages in ques- less of gender. However, you might write some more code. This might tion. Using separate language catalogs argue, if we also had a portion of the pro- involve adjusting the position of the allows the translation work to be distrib- gram that produced a short version of right hand column, perhaps by calculat- uted amongst those who can speak different tongues, without having to Listing 2: Smaller translations recompile the code. This makes it a fully data-driven, distributed, piece of devel- 01 if ( mygetfiletype(szFilename) == DIRECTORY) opment work. 02 pFiletype = gettext ("directory"); So with that thought I bid you all a 03 if ( mygetfiletype(szFilename) == FILE) fond farewell. Au revoir. Auf Wiederse- 04 pFiletype = gettext ("file"); hen. Adiós and Arrivederci! ■ 05 printf (gettext ("%1$s is a %2$s"), szFilename, pFiletype); INFO Listing 3: Doubling the work [1] ISO8859 Alphabet Soup:http://wwwwbs. cs.tu-belin.de/user/czyborra/charsets/ 01 if ( mygetfiletype(szFilename) == DIRECTORY) 02 pFiletype = gettext ("directory"); /* same strings as [2] Data on languages: http://www.eki.ee/letter/ before - does this mean less work? */ [3] Language codes:http://www.loc.gov/ 03 if ( mygetfiletype(szFilename) == FILE) standards/iso639-2/langcodes.html 04 pFiletype = gettext ("file"); [4] FAQ for GNU gettext: 05 printf("%s : %s", szFilename, pFiletype); /* no translation http://www.haible.de/bruno/gettext-FAQ. required here */ html#integrating_noop

70 June 2004 www.linux-magazine.com