USENIX Association

Proceedings of the FREENIX Track: 2003 USENIX Annual Technical Conference

San Antonio, Texas, USA June 9-14, 2003

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2003 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. GNU Mailman, Internationalized

Barry A. Warsaw Zope Corporation [email protected] [email protected] http://www.zope.com http://barry.warsaw.us

Abstract When Mailman was first released, python.org quickly adopted it and has been using it ever since. Mailman 2.0 GNU Mailman is a mailing list manage- marked a milestone in its development, as ment system that has been in production use version 2.0.13 is quite stable, and deployed since 1998. In December 2002, a version 2.1 at thousands of sites. It runs everything was released containing many new features. from small special interest group lists to This paper will describe one of the most im- huge announcement lists, at sites ranging portant – Mailman 2.1’s internationalization from the commercial (RedHat, SourceForge, support. Presented here are the tools that Apple, Dell, SAP, and Zope Corporation), were built and the approaches Mailman took to the hacker community (XEmacs, Samba, to marking and translating text, as well a re- Gnome, KDE, , and of course Python), view of some of the benefits and pitfalls of to numerous educational organizations Mailman’s solution. Also presented will be and non-profits. There are lots of host- some future directions for internationalized ing facilities providing Mailman services, Mailman, as well as other complex Python and increasingly, quite a few international applications such as Zope. organizations.

One of the reasons for the increased interest from the non-English speaking world is that 1 Introduction Mailman 2.1, which had been in development for about two years, is fully internationalized. Internationalization is the process of prepar- ing an application for use in multiple locales. GNU Mailman was invented by John Viega Localization is the process of specializing the sometime before 1997, or so is indicated by application for a specific locale. For example, the earliest known archived message on the during internationalization, all end-user dis- subject. The earliest hit for “viega mail- playable text in the Mailman 2.1 source code man” in Google groups is about the Dave was specially marked as requiring translation. Matthews band mailing list that John was Mailman 2.1.1 (the latest available patch re- running [Viega97]. lease at the time of writing), has been local- ized to almost 20 natural languages. At the time, python.org [Python.Org] was running a hacked version of for While Mailman 2.1 may appear to be only a all its special interest group (SIG) mailing minor revision over 2.0.13, it really represents lists, but this had two problems: first, the site quite an extensive rewrite. It could easily was becoming unmaintainable as the admin- have been argued that this version should be istrators tried to customize new features into called Mailman 3.0. Before describing the de- Majordomo, and second, it just wouldn’t do tails of the internationalization work, a brief to run Python’s mailing lists on a Perl-based overview of Mailman is provided, including list server.

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 39 a quick tour of some of the other important the message. This means that various as- features in the 2.1 release. pects of the message, i.e. the header or footer, or the To field, can contain infor- mation specific to the member receiving the message.

2 What is GNU Mailman? • Extensive privacy options which allow a list administrator to select policies for subscribing and unsubscribing (open, “GNU Mailman” (informally referred to as confirmation required, or approval re- just “Mailman”), is a system for managing quired), policies for posting to the list electronic mailing lists. It is implemented pri- (open, moderated, members only, ap- marily in Python, an object-oriented, very proved posters only), and some limited high-level, open source programming lan- spam defenses. guage. Mailing lists are administered by a list owner, and users can interact with the list • Automatic bounce processing. Bouncing – including subscribing and unsubscribing – addresses are the bane of any mailing through the web and through email. Site ad- list, and Mailman provides two mech- ministrators can also interact with Mailman anisms for automatic bounce detection, via a suite of command line scripts, or even regular expression based bounce match- via the interactive Python prompt. Mailman ing and Variable Envelope Return Paths is the official mailing list manager of the GNU [VERP]. project and is available under terms of the RFC 3464 [RFC3464] and the older RFC GNU General Public License [GPL]. it replaces [RFC1894] describe a stan- dard format for bounce notifications. Mailman strives for standards compliance, However, many mail systems ignore or and as such is interoperable with a wide range incorrectly implement this standard. For of web servers and browsers, and mail servers recognizing bounce messages, Mailman and clients. Of the web servers, it requires the has an extensive set of regular expression ability to execute CGI scripts, and of mail based matches used to dig the bouncing servers it requires the ability to filter mes- address out of the notice. For fool-proof sages through programs. Apache is probably bounce detection, Mailman also supports the most widely used web server for Mail- VERP, a technique where the intended man, and any of the Big 4 mail servers (Send- recipient’s address as it appears on the mail, Postfix, , and Exim) will work mailing list is encoded into the envelope just fine. The HTML that Mailman out- sender of the message. Because remote puts is extremely pedestrian so just about mail servers are required to send bounces any web browser should work with it, as long to the envelope sender, Mailman can un- as it supports cookies. Mailman should work ambiguously decode the intended recip- with any MIME-compliant mail reader. Mail- ient’s address and register an accurate man works on any -like operating sys- bounce. Note that technically, VERP tem, such as GNU/. must be implemented in the mail server, but Mailman’s use of the technique is Mailman supports a wide range of features, close enough to warrant the label. such as: • Archiving. Mailman comes bundled with an archiver called Pipermail. Pipermail’s • User selectable delivery modes. Mem- chief advantages are that it comes bun- bers can elect to receive messages im- dled, that it is implemented in Python, mediately, or in batches called digests. and that in Mailman 2.1 it is interna- Two forms of digests are supported, RFC tionalized, allowing the display of mes- 1153 style plain text digests [RFC1153], sages in alternative languages and char- and MIME multipart/digest style di- acter encodings. Its primary disadvan- gests. Non-digest deliveries can be per- tages are that it doesn’t support search- sonalized specifically for the recipient of ing and isn’t very customizable. Mail-

40 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association man is easily integrated with external 3 Internationalization Issues archivers.

The new features in Mailman 2.1 are ex- • A mail to news gateway. Mailman can tensive, but the most visible addition is the be configured to gateway lists to and support for multiple natural languages. This from Usenet newsgroups. For example, means that all the administrative and pub- the comp.lang.python newsgroup is gate- licly visible web pages, all the email notifica- wayed to the [email protected] tions, and even the built-in archiver can be mailing list. Even moderated lists, such configured to produce text in any of nearly as comp.lang.python.announce can be 20 natural languages out of the box. A large gatewayed, with Mailman serving as the part of the re-architecting of Mailman for 2.1 moderation tool. has been to provide a framework for easily adding new natural languages as they become available from volunteer translation teams. • Auto-responder, content filtering, and topics. The auto-responder can be set up to send a canned message when- 3.1 Message IDs ever someone posts to the list, or emails the list owner or -request robot. Con- tent filtering allows the list owner to explicitly filter or pass specific MIME Not every string in an application needs to (Multipurpose Internet Mail Extensions be translated. For example, some strings are [RFC2045]) content types. Topics allow used as keys in dictionaries, or represent mail the list owner to assign incoming mes- headers, or contain HTML tags. To make sages to any of a configurable number of the proper distinction we refer to strings that groups, and members can “subscribe” to are intended for human readability as “text” a specific topic, receiving only the sub- or “messages”. One of the most labor in- set of list traffic that matches the desired tensive parts of internationalizing an exist- topics. ing code base such as Mailman’s is to go through every string in the software and dis- tinguish messages from ordinary strings. In • Virtual domains. Mailman can be used addition to the non-translatable strings de- on a mail server that supports multi- scribed above, the decision was made to not ple virtual domains. For example, the translate log messages since these are not in- python.org and zope.org mail domains tended for the end-user, and would make de- are run on the same machine, from the bugging in global community more difficult. same Mailman installation. The one lim- itation in Mailman 2.1 is that a mailing Each message that is to be translated needs list with the same name may not appear to have four pieces of information at runtime in more than one domain. This restric- in order to calculate the translated text: the tion will be lifted in future versions. application domain, the message id, the de- fault text, and the target locale. Because Mailman is a fairly self-contained application, there is only one static domain, the “mail- Mailman also provides each list with its man” domain, which never changes during own home page (called a “listinfo” page) the life of the program’s execution. which can be customized through the web. Mailing lists can be automatically created The message id and default text are two re- and deleted through the web (with proper lated, but distinct concepts. The message id support from the mail server). Mailman also uniquely identifies the textual message to be provides web-based approval of moderated displayed to the user. The message id names messages and subscriptions. There are a host the message but it may not necessarily be the of other smaller new features in Mailman 2.1 message. It is the message id which is the which won’t be described in this paper. primary key into a translation catalog dictio-

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 41 nary. such as “Delete” which has one spelling in En- glish, may be translated to one of several dif- The default text is the text to use as the ferent words in another language, depending translation of the message id, when the id is on the context. This poses a problem for the not found in the translation catalog. Because translator because the message id “Delete” coordinating 20 different language teams is a may appear a dozen times in the application, project management challenge, it is common but may require several different words in the for some language catalogs to lag behind the target language. Also, minor changes in for- source code development. Mailman releases mating or punctuation change the message id, are rarely delayed so that language teams can which requires a re-translation (this may be catch up (although advance notice of impend- considered an advantage because changes in ing releases is usually given). It is often the punctuation can cause semantic differences, case, therefore, that a particular message id requiring a re-translation anyway). won’t be found in a specific language catalog. The default text is the fall back to use in this There is no perfect solution, but Mailman case. has decided to use implicit message ids be- cause of the source code readability advan- As an example, suppose a web form had a tages. This occasionally requires negotiation Delete button. The message id for the but- between the application developers and the ton might be something like “form27-delete- translation teams to choose appropriate and button”, while the default text might be distinguishable message ids, and imposes a “Delete”. sort of inertia against changing existing text in the source code. One way to alleviate these Message ids may be explicit or implicit. In problems in future releases would be to use the above example “form27-delete-button” is a mix of implicit and explicit message ids, an explicit message id. While it uniquely where implicit ids are used predominantly, identifies the message to be used, it does not but in rare cases explicit ids (along with a contain any text that will be displayed to the partial English catalog) are used to resolve user. The advantage of explicit message ids ambiguities. is that they are immune to minor typos or formatting changes (e.g. whitespace or punc- tuation additions or deletions). The disad- 3.2 The Locale vantages of explicit message ids are two-fold: they require an extra catalog mapping mes- sage ids to the default language (e.g. English Internationalizing a web-based application in Mailman’s case), and they make the source is much more complicated than internation- code less readable. The latter is the more alizing a command line program such as ‘ls’ serious consequence; since nearly all human because the natural language context (i.e. the readable text in Mailman exists in Python “locale”) is determined by the web request, or source code, using explicit message ids would the email message being processed, instead make the code nearly unreadable. A devel- of by the user’s shell. In Mailman, the lo- oper would have to consult the English cata- cale is dynamic and fluid; there may in fact log several times for some lines of code. be several locales needed to process any par- ticular email message. Most of the existing The alternative approach is to use implicit techniques for internationalizing programs as- message ids, where the message id serves sume a static locale and a single domain. a dual purpose as the default text. Thus Mailman inherits the single domain tradition the human readable text that appears in the of these tools, but it uses dynamic techniques Python source code is first used as the mes- to calculate the translation locale. sage id, and if that fails to find a transla- tion, it is used as the default text. While We use the term “locale” and “language” this has the advantage of making the source interchangeably below, although this is not code more readable and easier to develop, it completely accurate. A locale describes much has several disadvantage. First, a message more than the language used; it also defines

42 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association the character encoding, as well as the format- proach where the locale is represented by an ing of dates, numbers, currency, etc. How- instance of a class. While this may seem nat- ever, since Mailman does not currently sup- ural, it is actually an elaboration of the global port the localization of data such as dates, the translation contexts in classic international- language selection is the most important as- ized programs. This will be described in more pect of the active locale. Typically (although detail later. not exclusively) a single character encoding is used for a single language. 3.3 Character Encodings At any given point in the processing, the following locales may exist. Above and beyond the natural language is- • The source code language. Mailman is sues, character encoding issues are probably developed in English, so by default, the the most vexing for the Mailman developers. English text is always available. Trans- “Character encoding” is usually referred to as lations can be, and often are, incomplete the character set or charset, after the email and the English message is the global fall header parameter described in RFC 2045. back. A naive view would create a one-to-one cor- • The site default language. Of the nearly respondence between language and charset. 20 languages that Mailman supports out For example, you might say that all Span- of the box, the site administrator can ish text should be rendered in the iso-8859-1 choose one of those languages as the “site (Latin-1) character set [ISOSoup]. However, default language”. When no other locale even this simple example isn’t accurate be- is known, the site default will be used. cause the Euro sign is available only in iso- • The list default language. Every mailing 8859-15. list has a default language as well as as set of alternatively supported languages. The problem is exacerbated by some Asian The list default language is used for all languages. Japanese for example may ap- the administrative pages. It is also the pear in any of euc-jp, iso-2022-jp, shift-jis, language used when the list context is and may be different depending on whether known and there is no overriding con- the text appears in a web browser or in an text. email message. In fact, Mailman 2.1’s naive approach causes some problems for Japanese • The page default language. The list ad- users, especially when an email message is ministrator can also choose to let the list displayed as a web page in the archiver. This support any other language allowed by will be fixed in a future release. the site administrator. A user browsing the list overview page can choose to view Usually, English text uses the us-ascii char- that page in any of those languages, by acter set, but for maximum interoperability, selecting that language in a pop up menu a list conducted in English may still want to on the list’s public web pages. be aware of Latin-1 characters. Mailman has to be careful when combining characters in • The user’s preferred language. The user different charsets, especially those for which can also choose one of the list supported us-ascii is not a subset. languages to be their preferred language, by making this choice in their preferences For example, say a Spanish list received a page. All email notices that Mailman message in Turkish, which uses Latin-5 (a.k.a. sends out to the user, or any web page iso-8859-9). When that message is archived, that the user views when logged in, is dis- different parts of the HTML page for the mes- played in the user’s preferred language. sage will be in iso-8859-1 and other parts will be in iso-8859-9. But since HTML is inade- To support these multiple language con- quate at allowing multiple charsets in a single texts, Mailman uses an object-oriented ap- web page, the characters in one or the other

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 43 of those charsets must be converted to HTML and added to Python 2.2. The email pack- entities, using their Unicode equivalent. age is compliant with all the relevant MIME RFCs, as well as other mail related standards. Multiple character set issues can also arise in the processing of email messages. Say for example that a message to a German list ar- 3.4 Message Catalogs rives in Japanese. Mailman has a feature called “headers and footers” which allow the list administrator to add some canned text GNU gettext [Gettext] is a widespread for- to the start and end of a message (e.g. “To mal model for supporting multilingual appli- unsubscribe, click here”). Previous versions cations in traditional applications. Get- of Mailman would simply paste the header text encourages the use of implicit message and/or footer around the original message ids. This leads to a rhythm whereby the C body. This was broken for several reasons. programmer marks translatable text in the The most obvious one is that if the message is source code by wrapping them in a function really a Base64 encoded image, adding some call. The function is usually () – called “the spurious ASCII text around the original body underscore function” – and it has both a run- would break the decoding. But if the mes- time behavior and an off-line purpose. At sage contained text in a different character run-time, the underscore function performs set than the header or footer text, concatena- the lookup of the message id in a global lan- tion may render the original body unreadable. guage catalog. There is also an off-line tool The solution requires careful examination of which searches all the source code for marked the original message, and in the extreme, rip- strings, extracting them and placing them in ping apart and reconstituting the structure of a message catalog template, called a .pot file. the original message, so that the headers and footers will always be added in a MIME-safe GNU gettext contains both a C library and way. a suite of tools provided by The Translation Project [TranslationProject] to manage inter- Internationalization standards for email nationalized programs. The message extrac- and HTML are defined in a series of RFCs, tion tool is called xgettext. While newer ver- and these must be adhered to. For exam- sions of xgettext understand Python source ple, the most fundamental email RFC is 2822 code to some degree, a pure-Python version of [RFC2822] (which recently superseded RFC the program called pygettext was developed 822). This RFC describes the structure of an and is distributed with Python. pygettext email message, but it is naive in its ASCII has some additional benefit, including the bias. RFCs 2045 through 2047 were added to ability to extract Python docstrings which address the use of multilingual character sets may not be marked with the underscore func- in email messages. RFC 2047 [RFC2047] was tion. added to describe how non-ASCII characters are to be encoded in Subject fields and in Mailman has adopted the gettext model other email headers. Mailman must be able of marking and translating source strings, to both interpret email messages with RFC and to that end, a GNU gettext-like stan- 2047 encoded headers, and produce properly dard module was implemented for Python formatted ones when necessary. The chal- [GettextModule]. While the gettext mod- lenge is to parse well intentioned, but erro- ule implements the same global translation neously encoded headers (to give the bene- model of the C library, two elaborations were fit of the doubt). These types of errors are necessary for a more Pythonic interface. all too common in email messages found in the wild and Mailman must be made robust First, for long running daemon processes against these types of poorly formed mes- such as Mailman 2.1’s mail processor, mul- sages. tiple language contexts are required, so the global state implied by gettext isn’t always Prodded by these various issues, a compre- appropriate. Here’s an example to illustrate hensive email package [Email] was developed understand why.

44 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association When a new member subscribes to a mail- This is a critically important feature for in- ing list, two notification messages can be sent. ternationalized programs because some lan- One is a welcome message sent to the mem- guages may require a different order of the ber, and the other a new member notification substitutions to be grammatically correct. sent to the list administrator. If the list’s While stock Python supports this require- preferred language is Spanish, but the user ment, its implementation leads to overly ver- prefers German, these two notifications will bose code. In the above example, we’ve writ- be sent out in two different languages. Since a ten the words “listname” and “member” four single process crafts and sends both notifica- times each. Now imagine that level of ver- tions, simply using () wrapping doesn’t give bosity duplicated a hundred times per source enough information. Which language should file. “Tedious” comes to mind! the underscore function translate its message id to? Mailman solves this by providing its own underscore function, which wraps the gettext Python solves this problem by providing standard function, but provides a little bit of an object-oriented API in additional to get- useful magic by looking up substitution vari- text’s traditional functional API. Using the ables in the local and global namespace of object interface, a program can create in- the caller. Using Mailman’s special under- stances which represent the translation con- score function, the above code can then be text; in other words, a single target language rewritten as: catalog is fully encapsulated in an object. For convenience, this object can be stored in some global context, and in the Mailman source, listname = get_listname() this global object can be saved and restored member = get_username() as necessary. Here is a simplified Python ex- print _(’%(member)s has been ’ ’subscribed to %(listname)s’) ample:

# The list’s preferred language is in While the average Perl programmer might # effect right now ask what all the fuss is about, the Python saved = i18n.get_translation() programmer will notice something interest- try: ing: there’s no interpolation dictionary and i18n.set_language( no modulus operator. The dictionary is users_preferred_language) created from the namespaces of the caller send_user_notification() of the underscore function, which contains finally: the “listname” and “member” local variables. i18n.set_translation(saved) The trick is that the underscore function send_admin_notification() uses a little known Python function called sys. getframe() to capture the global and The second problem might be termed syn- local namespaces of the caller of underscore. tactic sugar or simple convenience, but it It then puts these in an interpolation dictio- turns out to be extremely important in a nary, with local variables overriding global Python program filled with translatable text. variables, and then applies the modulo op- Python strings support variable substitution erator to the translated string, using this dic- (also called “interpolation”), whereby a dic- tionary. tionary can be used to supply the substitu- tions. For example: Marked translatable texts are used all over Mailman, and we run pygettext over all the source code to produce a gettext compatible listname = get_listname() mailman.pot catalog file. To translate this member = get_username() d = {’listname’: listname, to a new language, the translation team ’member’: member, would start by copying mailman.pot to } messages/xx/LC MESSAGES/mailman.po print _(’%(member)s has been ’ where “xx” is the language code for the new ’subscribed to %(listname)s’) % d language. From here, standard tools such

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 45 as po-mode for Emacs or KDE’s kbabel can The first location to yield the desired tem- be used to provide translations for all the plate wins. Thus, as with the gettext cata- source message ids. Then, standard gettext logs, English is always an available fall back. tools can be used to generate a mailman.mo binary file, which Python’s gettext module Templates, like marked translatable source can read. In this way, internationalized code text, support variable substitutions, us- Python programs can leverage most of the ing the same syntax. With templates, an ex- tools translation teams normally use for plicit substitution dictionary is always pro- C programs. Translators don’t have to vided, and the interpolation is performed af- learn new tools just to translation Python ter the template is located. programs. While the template system works well enough, its coarseness is a serious drawback. 3.5 Templates For example, say a new feature required the addition of an HTML button on one of the templates. While this is trivial to do for the While gettext style message text in Python English template, changing the English tem- source code are essential for an international- plate means all the other templates are out- ized Mailman, they aren’t always appropri- of-date. The translation teams must follow ate. For example, Mailman has always used up with new versions of the modified tem- templates as a way of conveniently represent- plates, or other languages will lag behind the ing full web pages or parts of email messages. English version. These templates provide an easy way for site administrators to customize the look of the One solution for templates might be some- Mailman web pages, or the text sent out un- thing like Zope Page Templates (ZPT) der various circumstances. [Pelletier], and specifically, internationalized ZPT [I18NZPT]. Internationalized ZPT com- In an internationalized Mailman, the tem- bines the best of gettext and templates by plates serve another purpose: they serve as allowing the template author to design the a mechanism for providing language specific template, marking sections of the template versions of the templates. Mailman uses al- as translatable text. Another extraction tool most 50 templates for various purposes, and can then run over the ZPT file and add the of course provides the English versions of the translatable messages to the overall catalog. templates as a default. Each supported lan- This has the huge advantage that structural guage provides its own version of the tem- changes to a template don’t require the trans- plates, and Mailman has a defined search lation teams to do any work. Changes to con- order for template lookup. For example, tent messages in the template simply mean if Mailman were to display the public list that one phrase may be out of date, but the overview page for a mailing list, it would whole template won’t be invalidated. search for the listinfo.html page, in the fol- lowing locations (relative to the installation directory): 4 Unicode • The list-specific language directory lists/listname/language/template Python has two types of string objects, • The virtual domain-specific traditional 8-bit byte data strings and Uni- language directory tem- code character strings. Python also has lit- plates/list.host name/language/template eral forms for each string type; quoted text • The site-wide language directory tem- are defined to be 8-bit strings unless the lead- plates/site/language/template ing quote is prefixed with a “u”, in which case it is a Unicode string. Because strings • The global default language directory can come into Mailman in a variety of ways templates/language/template (e.g. through the web, an email message, or a

46 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association message catalog), the code must be prepared forms and genders pose particularly thorny to handle encoded 8-bit strings and Unicode problems. Python 2.3’s gettext module sup- strings. Encoded 8-bit strings must be con- ports plural forms, but only alpha releases of verted to Unicode via the unicode() built-in Python 2.3 have been made available as of function in order to properly combine strings this writing. English doesn’t have gendered using concatenation or interpolation. In ad- nouns, and sometimes, English text source dition, Unicode strings must be re-encoded strings need to be rewritten to accommodate when printing them to certain streams, such translators. as the log files, or standard output, but these encoding operations must watch out for un- Python supports a number of specific char- supported characters. For example, if a Uni- acter encoding “codecs” in the standard dis- code string containing Latin-1 characters is tribution. While Python has built-in sup- printed to an ASCII-only terminal, a excep- port for most Western codecs, Asian codecs tion can be raised due to the non-ASCII char- in particular are not supported. Fortunately acters in the string. Japanese, Korean, and Chinese codecs are available as third party distributions. There is no doubt that character conver- sion issues have been the thorniest and most Internationalization is a lot more than sim- common bugs reported on Mailman 2.1 to ply translating strings; many other values date. While many issues have been fixed, the from currencies to dates must also be local- most important lesson learned is that Mail- ized if they are to be displayed correctly for a man should convert all text (not necessarily particular language or country. Long term all strings!) to Unicode at the earliest pos- goals include wrapping IBM’s ICU library sible time, ideally when the text enters the [ICU] in Python. system. Mailman should use Unicode strings everywhere internally, converting to encoded While internationalization imposes some 8-bit strings only where needed, and only at performance overhead, the effect is negligi- the last possible moment. Analysis will still ble. In an application such as Mailman, the be needed to decide how to handle conver- performance of the mail server that Mailman sion errors, such as those described above. In feeds messages to, the network bandwidth, Python, the conversion function can be given and the performance of the an additional argument which specifies how and file system have a far greater influence on strict the conversion should be, e.g. raise an the performance of the system than does the exception if there are illegal characters found, Mailman software. Internationalization has throw the illegal characters away, or substi- imposed no perceived performance penalty. tute a question mark for any illegal charac- ters. The exact choice of the strictness flag Internationalization has increased the size will be dependent on the context in which of the software distribution, since by default the conversion is occurring. the download contains the message catalogs for all supported languages. The current cat- alog contains over 1200 message ids and is approximately 228KB in size. The trans- 5 Other Issues lated and compiled catalog files are from 80 to 300 KB in size depending on the completeness of the translation. In all, the message cata- There are some operational issues that logs themselves add approximately 16 MB to need to be addressed for an international- the uncompressed program source code. The ized application such as Mailman. Care must templates add about another approximately 3 be taken when marking the source code for MB. For this reason, future releases of Mail- translation so that the text is split in a gram- man may provide an English-only distribu- matically clear way. For example, when- tion, with separately downloadable language ever possible full sentences should be used, packs. since translating sentence fragments may not be possible in all languages. Also, plural

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 47 6 Examples def _(s): return s

categories = { Here is some sample Python code (refor- ’cat1’:_(’Privacy’), matted for this paper) taken from Mailman ’cat2’:_(’Autoreplies’), ’cat3’:_(’Topics’), 2.1 which shows marked messages: }

_ = i18n._ label = _(categories[category]) realname = mlist.real_name doc.SetTitle( Here, we’re marking three strings for ex- _(’%(realname)s Administration ’ traction, but we aren’t translating them at ’(%(label)s)’)) the point of definition, because the target lo- doc.AddItem(Center(Header(2, _( ’%(realname)s mailing list ’ cale isn’t known at this time. ’administration
%(label)s ’ ’Section’)))) Due to space limitations, sample web pages can’t be included, however a pub- lic internationalized list can be viewed at Notice first that the local variables “label” http://mail.python.org/mailman/listinfo/playground. and “realname” are referenced in the default You can view this listinfo page in any of the text, and that their values come from the languages supported by Mailman. magic interpolation described above. In a translated message the order of the substi- tutions may change, so this ensures that the substitutions will occur in the grammatically 7 Future and Related Work correct location in the translated message. Also notice that there are actually three uses of the underscore function. In the second and Internationalized Mailman servers are in third, the only function arguments are strings deployed use around the world, and many of (in Python, adjacent strings are concatenated the earliest related bugs have been satisfac- by the lexer). The pygettext rule for mes- torily fixed. However the basic architecture sage extraction is a single string inside the used by Mailman may undergo additional re- underscore function, so both these texts will finement in future releases. In particular, be extracted into the message catalog. Mailman will be rewritten to use Unicode in- ternally for all human readable text. The The first use of the underscore function is templates used in Mailman will likely be re- interesting in that it is getting a value out of a designed to use something like Zope’s ZPT, dictionary lookup. This is an example of a de- which allow finer grain control over the evo- ferred translation, where the underscore func- lution of the templates. tion is only used for its run-time behavior. The text returned by the dictionary lookup is The experiences learned during the Mail- translated in a normal fashion, but the pro- man internationalization effort have been car- gram source code categories[category] isn’t ried forward to the Zope 3 internationaliza- extracted into the catalog because it isn’t a tion effort [Zope3]. Zope is a web applica- string. tion server and framework, also written in Python. Zope’s internationalization efforts At the place where the categories dictio- are made more complicated by the fact that nary is defined, the strings are also wrapped it is a framework supporting multiple appli- in an underscore function for pygettext ex- cations rather than a single application. This traction, but they aren’t translated at that means that while Mailman needs only a single place in the program. We do this in a num- application domain, Zope may have multiple ber of situations, such as when dictionaries simultaneous domains, even in a single trans- are defined in module global scope. In that lation context. Zope therefore needs a way to case, you would see something like: record the domain that a particular message

48 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association has come from so that it can be looked up in Mailman was originally invented by John the proper catalog at output time. The solu- Viega, and at various times has been tion has been to create a MessageID object, shepherded, maintained, and developed by which is a subclass of string that contains the Thomas Wouters, Ken Manheimer, Harald domain as an instance variable. Meland, and Scott Cotton. The author is the current project leader. Also, in Zope few translatable messages are found in Python source code. The predom- Juan Carlos Rey Anaya and Victoriano Gi- inant carrier of human readable text is the ralt produced the first working prototypes ZPT. Thus the mechanisms described above of an internationalized Mailman, and worked for simple interpolation and global transla- with the author to designed the architecture tion contexts aren’t appropriate for Zope. for supporting internationalization. Others While Zope’s internationalization efforts are who provided invaluable contributions for built on the Python tools developed during the internationalization effort include Ben Mailman’s internationalization, they will ul- Gertzfield, Martin von Loewis, Simone Pi- timately improve or expand on these tools. unno, Daniel Buchmann, Tokio Kikuchi, and Ousmane Wilane. The ACKNOWLEDGE- As part of the Zope 3 internationalization MENTS file that comes with the Mailman effort, a Translation Web Service (TWS) has source distribution contains a detailed list of been proposed [ZopeTWS]. While largely contributors. only science fiction at the time of this writ- ing, the TWS is a vision of how to coordi- nate the project management issues related to internationalization. One of the most dif- 9 Availability ficult ongoing problems for an international- ized project is coordinating the output of the software developers with the translation ef- forts of the language teams. While the Trans- GNU Mailman is , cov- lation Project attempts to automate and co- ered by the GNU General Public Li- ordinate much of this process, the TWS plans cense. It is available for download from to take this aspect a step further by provid- http://sf.net/projects/mailman ing a global web service for truly collaborative translations. With the TWS, a project man- More information on Mailman can be found ager could upload message templates for par- at its home page http://www.list.org ticular domains, and language teams would translate messages at their own pace. When a Mirrors of the Mailman site are at new version of the software is prepared for re- http://www.gnu.org/software/mailman and lease, the project manager could then down- http://mailman.sf.net load a snapshot of the current state of the various translations for the project. The key advance with the TWS is that once the in- frastructure is in place, the software develop- References ers are no longer bottlenecks in the transla- tion effort, and coordination among transla- tion team members is automatic. [Viega97] http://groups.google.com/groups? q=viega+mailman&hl=en&lr=&ie=UTF-8 &scoring=d&start=60&sa=N&filter=0

8 Acknowledgments [Python.Org] Python Home Page, http://www.python.org

The author would like to thank Zope Cor- [GPL] , poration (http://www.zope.com) for their sup- GNU General Public License, port of this work. http://www.gnu.org/licenses/gpl.txt

USENIX Association FREENIX Track: 2003 USENIX Annual Technical Conference 49 [RFC1153] F. Wancho, Di- [I18NZPT] http://dev.zope.org/Wikis/DevSite/Projects/ gest Message Format, ComponentArchitecture/ZPTInternationalizationSupport http://www.faqs.org/rfcs/rfc1153.html [ICU] International Components for Uni- [VERP] D. J. Bernstein, Vari- code, http://www-124.ibm.com/icu/ able Envelope Return Paths, http://cr.yp.to/proto/verp.txt [Zope3] Welcome to the Zope 3 project, http://dev.zope.org/Wikis/DevSite/Projects/ [RFC3464] K. Moore and G. Vaudreuil, ComponentArchitecture/ An Extensible Message Format for Delivery Status Notifications, [ZopeTWS] Translation Web Service, http://www.faqs.org/rfcs/rfc3464.html http://dev.zope.org/Wikis/DevSite/Projects/ ComponentArchitecture/TranslationWebService [RFC1894] K. Moore and G. Vaudreuil, An Extensible Message Format for Delivery Status Notifications, http://www.faqs.org/rfcs/rfc1894.html [RFC2045] N. Freed and N. Borenstein, Multipurpose Internet Mail Ex- tensions (MIME) Part One: For- mat of Internet Message Bodies, http://www.faqs.org/rfcs/rfc2045.html [ISOSoup] The ISO 8859 Alphabet Soup, http://czyborra.com/charsets/iso8859.html [Gettext] GNU gettext, http://www.gnu.org/software/gettext/ gettext.html [TranslationProject] The Trans- lation Project Site, http://www.iro.umontreal.ca/contrib/po/ HTML/index.html [GettextModule] Multilingual in- ternationalization services, http://www.python.org/doc/current/lib/ module-gettext.html [RFC2822] P. Resnick, In- ternet Message Format, http://www.faqs.org/rfcs/rfc2822.html [RFC2047] K. Moore, MIME (Multi- purpose Internet Mail Extensions) Part Three: Message Header Ex- tensions for Non-ASCII Text, http://www.faqs.org/rfcs/rfc2047.html [Email] email – an email and MIME handling package, http://www.python.org/doc/current/lib/ module-email.html [Pelletier] M. Pelletier and A. Latteier, The Zope book, Chapter 5, Using Zope Page Templates, ISBN 0735711372.

50 FREENIX Track: 2003 USENIX Annual Technical Conference USENIX Association