Managing order relations in MlBibTEX∗

Jean-Michel Hufflen LIFC (EA CNRS 4157) University of Franche-Comté 16, route de Gray 25030 BESANÇON CEDEX France hufflen (at) lifc dot univ-fcomte dot fr http://lifc.univ-fcomte.fr/~hufflen Abstract Lexicographical order relations used within dictionaries are language-dependent. First, we describe the problems inherent in automatic generation of multilin- gual bibliographies. Second, we explain how these problems are handled within MlBibTEX. To add or update an order relation for a particular natural language, we have to program in Scheme, but we show that MlBibTEX’ environment eases this task as far as possible. Keywords Lexicographical order relations, dictionaries, bibliographies, colla- tion algorithm, , MlBibTEX, Scheme.

Streszczenie Porządek leksykograficzny słownikach jest zależny od języka. Najpierw omó- wimy problemy powstające przy automatycznym generowaniu bibliografii wielo- języcznych. Następnie wyjaśnimy, jak są one traktowane w MlBibTEX-u. Do- danie lub zaktualizowanie zasad sortowania dla konkretnego języka naturalnego umożliwia program napisany w języku Scheme. Pokażemy, jak bardzo otoczenie MlBibTEX-owe ułatwia to zadanie. Słowa kluczowe Zasady sortowania leksykograficznego, słowniki, bibliografie, algorytmy sortowania leksykograficznego, Unikod, MlBibTEX, Scheme.

0 Introduction these items w... the alphabetical order of authors’ Looking for a word in a dictionary or for a name names. But the bst language of bibliography styles in a phone book is a common task. We get used [14] only provides a SORT function [13, Table 13.7] to the lexicographic order over a long time. More suitable for the English language, the commands for precisely, we get used to our own lexicographic or- accents and other diacritical signs being ignored by der, because it belongs to our cultural background. this sort operation. It depends on languages. This problem is particu- The purpose of this article is to show how this larly acute when we deal with managing multilin- problem is solved in MlBibTEX’s first public release. In practice, this version deals only with European gual bibliographies, as in our program MlBibTEX— languages using the . The MlBibT for ‘MultiLingual BibTEX’. Let us recall that this program is written using the Scheme programming program aims to be a ‘better’ BibTEX [15], the bibli- language [10]. A standardised library providing sup- ography processor usually associated with the LATEX word processor [12]. When it builds a ‘References’ port for Unicode [22] has been designed [18, §§ 1.1 & 1.2], but we cannot say that the present version section for a LATEX document, BibTEX uses a bib- liography style to determine the layout. Some bib- of Scheme is Unicode-compliant, even if some imple- 1 liography styles are unsorted, that is, the order of mentations include partial support. So some parts bibliographical items within the bibliography is the of our present implementation of order relations are order of first citations of these items throughout the temporary, but we think that this implementation document. However, most of BibTEX’s styles sort 1 At the time of finishing the revised version of this article, the proposal for Scheme’s next standard has just been ratified ∗ Title in Polish: Zarządzanie zasadami sortowania lek- and is now the ‘official’ sixth version of this language [19, 18]. sykograficznego w MlBIBTEX-u. See http://www.r6rs.org for more details.

TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 101 Jean-Michel Hufflen

• The Czech alphabet is: a < b < c < č < d < ... < h < ch < i < ... < r < ř < s < š < t < . . . z < ž. • In Danish, accented letters are grouped at the end of the alphabet: a < ... < z < æ < ø < ‌a. • The Estonian language does not use the same order for unaccented letters as the usual Latin order; in addition, accented letters are either inserted into the alphabet or alphabeticised like the corresponding unaccented letter: a < ... < s ∼ š < z ∼ ž < t < ... < w < õ < ä < ö < ü < x < y. • Here are the accented letters in the French language: à ∼ â, ç, è ∼ é ∼ ê ∼ ë, î ∼ ï, ô, ù ∼ û ∼ ü, ÿ. When two words differ by an acccent, the unaccented letter takes precedence, but w.r.t. a right-to-left order:a cote < côte < coté < côté. Two ligatures are used: ‘æ’ (resp. ‘œ’), alphabeticised like ‘ae’ (resp. ‘oe’). • There are three accented letters in German — ‘ä’, ‘ö’, ‘ü’ — and three lexicographic orders: – DINb-1: a ∼ ä, ∼ ö, u ∼ ü; – DIN-2: ae ∼ ä, oe ∼ ö, ue ∼ ü; – Austrian: a < ä < ... < o < ö < ... < u < ü < v < ... < z. • The Hungarian alphabet is: a ∼ á < b < c < cs < d < dz < dzs < e ∼ é < f < g < gy < h < i ∼ í < j < k < l < ly < m < n < ny < o ∼ ó < ö ∼ ő < p < ... < s < sz < t < ty < u ∼ ú < ü ∼ ű < v < ... < z < zs Some double digraphs must be restored before sorting: ccs → +cs, ddz → dz+dz, ggy → gy+gy, lly → ly+ly, nny → +ny, ssz → +sz, tty → ty+ty The same for the double trigraph: ddzs → dzs+dzs. • The Polish alphabet is: a < ą < b < c < ć < d < e < ę < ... < l < ł < m < n < ń < o < ó < p < ... < s < ś < t < ... < z < ż • The Romanian alphabet is: a < ă < â < b < ... < i < î < j < . . . s < ş < t < ţ < u < ... < z. • The Slovak alphabet is: a < á < ä < b < c < č < d < ď < dz < dž < e < é < f < g < h < ch < i < í < j < k < l < ĺ < ľ < m < n < ň < o < ó < ô < p < q < r < ŕ < s < š < t < ť < u < ú < ... < y < ý < z < ž • The Spanish alphabet was a < b < c < ch < d < ... < l < ll < m < n < ñ < o < ... < z until 1994. Now the digraphs ‘ch’ and ‘ll’ are no longer viewed as single letters in modern dictionaries, and the words using ‘ñ’ are interleaved with words using ‘n’. • In Swedish, accented letters are grouped at the end of the alphabet: a < ... < z < ‌a < ä < ö.

a Using a left-to-right order for this step is common mistake even for French people. But the accurate order is right-to-left, as specified in [7]. b Deutsche Institut für Normung (German Institute of normalisation).

Figure 1: Some order relations used in European languages. could be easily updated for future versions dealing programmer should be able to specify a new order with Unicode. relation. We give more technical details in an an- In the first section, we show how diverse lex- nex, for users that would like to experiment more icographic orders used throughout European coun- themselves. In particular, we explain how to deal tries are. This allows readers to estimate this diver- with languages using the Latin 2 encoding, despite sity and to realise the complexity of this task. We our implementation being based on Latin 1. also explain why this problem is made more diffi- cult when we consider it within the framework of 1 European languages and bibliographies. Then we show how order relations lexicographic orders operate in MlBibTEX and how they are built. We Figure 1 gives an idea of the diversity of order re- also give some details about the common and differ- lations used throughout some European countries. ◦ ent points between xındy [13, § 11.3] and MlBibTEX, In this figure, ‘a < b’ denotes that the words be- both being programs using multilingual order rela- ginning with a are ‘less than’ the words beginning tions. Reading this article does not require advanced with b, whereas ‘a ∼ b’ expresses that the letters knowledge of Scheme;2 in fact, we think that a non- a and b are interleaved, except that a takes prece- dence over b if two words differ only by these two let- 2 Readers can refer to [20] for an introductory book about ters. Roughly speaking, there are two families of lan- Scheme.

102 TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 Managing order relations in MlBibTEX guages in the realm of associated lexicographic or- can be decomposed into these ‘simpler’ characters: ders. Accented letters are sometimes fully viewed as latin small letter o, U+006F ‘real’ letters, distinct from unaccented ones: exam- combining accent, U+0302 ples are given by some Slavonic languages. In other The sort algorithm requires several passes. To de- languages, accented letters are sorted as if there scribe it roughly, an information about weight, given were no accent. The precedence of a unaccented by sort keys, is associated with each string. Then letter over an accented one is determined in various this information is re-arranged according to sort lev- ways: it follows a left-to-right order in Irish, Ital- els, w.r.t. letters, w.r.t. accents, etc. Finally, a binary ian, and Portuguese, a right-to-left order in French. comparison between bytes is done, level by level, un- The Estonian language ‘mixes’ the two approaches: til the two strings can be distinguished. This algo- some accented letters — ‘õ’, ‘ä’ — are alphabeticised, rithm can be refined for a particular language, by some — ‘š’, ‘ž’ — are interleaved. Last, some letter using a specialised sort key table, possibly including groups may be viewed as a single letter and alpha- sort keys for accented letters and digraphs viewed as beticised as another letter. For example, the Hun- single letters. This modus operandi would be diffi- garian words beginning with ‘cs’ are alphabeticised cult to put into action within MlBibT X. First, we separately from the words beginning with ‘c’. In E do not have complete support for Unicode:4 for ex- fact, the ‘c-’ entry in a Hungarian dictionary con- ample, we cannot directly deal with characters such tains words beginning with ‘c’ and not with ‘cs’. as the ‘combining circumflex accent’, not included The ‘c-’ entry is followed by the ‘cs-’ entry, before in the Latin-1 encoding. But we keep the idea about the ‘d-’ entry. decomposition, replacing the combining characters Anyway, it is apparent that there cannot be a by ASCII5 characters. For example, the ‘combining universal order encompassing all lexicographic or- circumflex accent’ will be replaced by the ‘^’ char- ders. Besides, these orders aim to classify words of acter. To sum up, our order relations are based on a dictionary, that is, common words belonging to a 3-step algorithm: a language, even if some dictionaries may include some proper names. When bibliographies are gen- • replace composite characters (‘foreign’ letters erated, order relations are used to sort bibliograph- or composite characters not viewed as single let- ical items, most often w.r.t. authors’ names. These ters) when extracting successive letter groups names may be ‘foreign’ proper names if we consider and compare the two results, the language used for the bibliography. So names • refine the sort with accent information when can include characters outside of this language’s al- accented letters are interleaved with others, phabet. As a consequence, an order relation for sort- • test the case: when two words differ only in ing a bibliography should be able to deal with any case, an uppercase letter takes precedence over letter, since any letter may appear in foreign names. a lowercase one, according to a left-to-right or- A good choice is to associate such a foreign letter der. with a letter belonging to the ‘basic’ Latin alpha- bet, so this foreign letter is interleaved with the ba- 2 Generating order relations sic letter, which takes precedence over the foreign Let us recall that MlBibTEX can apply BibTEX’s letter if two words differ only by these two letters. bibliography styles using a compatibility mode [6], If we consider the English language, this means that but in order to take advantage of MlBibTEX’s multi- accented letters are interleaved with unaccented let- lingual features as far as possible, it is better to use ters, but unaccented letters take precedence. Most the nbst6 language [4], close to XSLT7 [24], the lan- implementations of order relations proceed in this guage of transformations used for XML8 documents. way. Let us recall that parsing a bibliography data base Unicode provides a default algorithm to sort all (.bib) results in the representation of an XML tree in its characters. This algorithm is based on a sort key Scheme [11]; this nbst language includes an element 3 table, DUCET [23]. It is also based on a decom- for sorting selected subtrees of an XML document position property for composite characters. For ex- [4, App. A], this element being analogous to XSLT’s ample, the ‘ô’ letter, whose name and code point — [24, § 10]. For example, the following two elements given using hexadecimal numbers — are: 4 latin small letter o with circumflex, See the annex. 5 American Standard Code for Information Interchange. U+00F4 6 New Bibliography STyles. 7 eXtensible Stylesheet Language Transformations. 3 Default Unicode Collation Element Table. 8 eXtensible Markup Language.

TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 103 Jean-Michel Hufflen can be used to sort bibliographical items by the first It should be noted that only lowercase letters have author’s last name, and then the items left unsorted to be specified, the equivalent relations among up- by this first step are sorted by the first author’s first percase letters will be deduced. 9 name: Let us come back to associations for additional position property in Unicode. For example: character. MlBibTEX knows such decomposition in- Due to the language attribute’s value, this sort op- formation for each accented letter of Latin 1. These eration will use the lexicographic order for the Ger- default associations can be overridden by alphabet- man language. Such an order relation is to be speci- specific associations given to the function building fied in Scheme, as a 2-argument function taking two orders. Weights are managed as follows. strings s0 and s1 and returning a ‘true’ value (#t) • By default, the weight of each component of an if s0 is strictly less than s1, a ‘false’ value (#f) oth- alphabet — appearing within the second argu- erwise. The best way to define such a function is ment of

9 less it is an accented letter included in default asso- Let us notice that this illustrative example would be 11 too restrictive for an ‘actual’ bibliography style: there may ciations. be several authors, and some authors may be denoted by an Given an alphabet’s specification — the second organisation name, in which case the element’s name is not argument of the

104 TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 Managing order relations in MlBibTEX

(define

Figure 2: Building order relations for some European languages.

TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 105 Jean-Michel Hufflen

(define mk-hungarian-word-sectioner ; Building a generator of sectioning functions for Hungarian words. (

Figure 3: How to section Hungarian words.

MlBibTEX notices the possible presence of multi- of an order relation is different because it is done character sequences (e.g., digraphs or trigraphs). If step by step. There are forms: need be, it builds a lexical analyser able to return the define-alphabet define-letter-group longest sequence of characters belonging to this al- merge-rule sort-rule phabet,12 an example of use being given in Figure 3. to specify an alphabet, a letter group, and the re- Let us mention that these analysers extract as few placement of a pattern. If a sort procedure is quite sequences of characters as possible. For example, if close to the standard way used in English, it is prob- we have to compare a word beginning with ‘a’ and ◦ a word beginning with ‘b’ in English, only the first ably easier to use xındy’s forms, because only small letters — "a" and "b" — are extracted because that changes have to be expressed. In MlBibTEX, we is sufficient to determine the result. chose to develop fewer functions, which encapsulate Regarding the implementation, the encoding of the complete making of an order relation. This al- the sequences of an alphabet w.r.t. an increasing or- lows a global view of a new order relation and makes der is implemented by means of hash tables,13 which easier some coherence tests among the information ensures efficiency. Let us not forget that these order about this relation. relations are used to sort bibliographical items, and 4 Conclusion sorting requires many calls to the function modelling an order relation. The availability of these language-dependent order relations within a unique program has been planned ◦ 3 MlBibTEX vs. xındy through the use of the language attribute, as speci- ◦ fied in the W3C15 recommendation about XSLT [24, xındy [9] and MlBibTEX do not aim to perfom the ◦ § 10]. However, these relations have been imple- same task, since xındy is an index processor. How- mented only partially in most of XSLT processors. ever, both have common points: they reimplement Of course, our implementation also only partially ‘old’ programs belonging to TEX’s galaxy — make- provides this service, because we are limited to Eu- index [13, § 11.2] and BibTEX — with a particular ropean languages. But we think that the orders we focus on multilingual features, they are both writ- define are correct w.r.t. these languages and they ten using a Lisp14 dialect: Common Lisp [21] for ◦ are actually running. Our implementation is clearly xındy, Scheme for MlBibTEX. Of course, the suc- influenced by the Unicode collation algorithm. It cessive steps used for putting an order relation into is a first step towards general algorithms for lexico- action — needed to arrange the successive entries of graphic orders, and a first version subject to changes ◦ an index — also exist in xındy. But the specification when we explore other languages or get criticisms from end-users. In many domains, improvement has 12 Such lexical analysers are implemented by means of come about because first versions existed. We think tries. In MlBibTEX, this structure is also used to manage that will be also the case for our functions. the information related to language identifiers, as explained in [5]. 5 Acknowledgements 13 A hash table has a set of entries, and can efficiently map an object to another object. This structure is described in [1] Many thanks to Jerzy B. Ludwichowski, who has from a general point of view, our implementation of hash written the Polish translation of the abstract. I tables in MlBibTEX is inspired by [8]. 14 LISt Processor. 15 World Wide Web Consortium.

106 TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 Managing order relations in MlBibTEX also thank Gyöngyi Bujdosó, Hans Hagen, Karel Now you can type some expressions and evalu- Horák, Dag Langmyhr, who helped me fix some er- ate them: rors. Thanks to Karl Berry and Barbara Beeton, (order-relation "Frank Böhmert" "german" In Scheme, the ‘ " ’ character being the delimiter of

TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007 107 Jean-Michel Hufflen

References Chris A. Rowley, Christine Detig and A [1] Alfred V. Aho, Ravi Sethi and Jeffrey D. Joachim Schrod: The LTEX Companion. Ullman: Compilers, Principles, Techniques 2nd edition. Addison-Wesley Publishing and Tools. Addison-Wesley Publishing Company, Reading, Massachusetts. August Company. 1986. 2004. [2] Matthew Flatt: PLT MzScheme: Language [14] Oren Patashnik: Designing BIBTEX Manual. Version 360. August 2004. Styles. February 1988. Part of the BibTEX http://download.plt-scheme.org/doc/ distribution. 360/pdf/mzscheme.. [15] Oren Patashnik: BIBTEXing. February 1988. [3] Chris Hanson, the MIT Scheme team Part of the BibTEX distribution. et al.: MIT Scheme Reference Manual, 1st [16] Manuel Serrano: Bigloo. A Practical Scheme edition. March 2002. Massachusetts Institute Compiler. User Manual for Version 2.9a. of Technology. December 2006. [4] Jean-Michel Hufflen: “MlBibTEX’s Version [17] Olin Shivers: List Library. October 1999. 1.3”. TUGboat, Vol. 24, no. 2, pp. 249–262. http://srfi.schemers.org/srfi-1/. July 2003. [18] Michael Sperber, William Clinger, R. Kent [5] Jean-Michel Hufflen: Managing Languages Dybvig, Matthew Flatt, Anton van within MlBIBTEX. In revision. June 2005. Straaten, Richard Kelsey and Jonathan [6] Jean-Michel Hufflen:“BibTEX, MlBibTEX Rees: Revised6 Report on the Algorithmic and Bibliography Styles”. Biuletyn GUST, Language Scheme — Standard Libraries. Vol. 23, pp. 76–80. In BachoTEX 2006 September 2007. hhtp://www.r6rs.org. conference. April 2006. [19] Michael Sperber, William Clinger, R. Kent [7] ISO-IEC CD 14651: International String Dybvig, Matthew Flatt, Anton van Ordering — Method for Comparing Character Straaten, Richard Kelsey, Jonathan Strings and Description of a Default Tailorable Rees, Robert Bruce Findler and Jacob Ordering. May 1996. Matthews: Revised6 Report on the [8] Panu Kalliokoski: Basic Hash Tables. Algorithmic Language Scheme. September September 2005. http://srfi.schemers. 2007. hhtp://www.r6rs.org. org/srfi-69/. ◦ [20] George Springer and Daniel P. Friedman: [9] Roger Kehr: xındy Manual. February 1998. Scheme and the Art of Programming. The http://www.xindy.org/doc/manual.html. MIT Press, McGraw-Hill Book Company. [10] Richard Kelsey, William D. Clinger, 1989. Jonathan A. Rees, Harold Abelson, [21] Guy Lewis Steele, Jr., Scott E. Fahlman, Norman I. Adams iv, David H. Bartley, Richard P. Gabriel, David A. Moon, Gary Brooks, R. Kent Dybvig, Daniel P. Daniel L. Weinreb, Daniel Gureasko Friedman, Robert Halstead, Chris Bobrow, Linda G. DeMichiel, Sonya E. Hanson, Christopher T. Haynes, Keene, Gregor Kiczales, Crispin Perdue, Eugene Edmund Kohlbecker, Jr, Donald Kent M. Pitman, Richard Waters and Oxley, Kent M. Pitman, Guillermo J. Jon L White: Common Lisp. The Language. Rozas, Guy Lewis Steele, Jr, Gerald Jay Second Edition. Digital Press. 1990. Sussman and Mitchell Wand: “Revised5 [22] The Unicode Consortium: The Unicode Report on the Algorithmic Language Standard Version 5.0. Addison-Wesley. Scheme”. HOSC, Vol. 11, no. 1, pp. 7–105. November 2006. August 1998. [23] The , http: [11] Oleg E. Kiselyov: XML and Scheme. Unicode Consortium September 2005. http://okmij.org/ftp/ //unicode.org/reports/tr10/: Unicode Scheme/xml.html. Collation Algorithm. Unicode Technical Standard #10. July 2006. [12] Leslie Lamport: LATEX: A Document Preparation System. User’s Guide and [24] W3C: XSL Transformations (XSLT). Reference Manual. Addison-Wesley Publishing Version 1.0. W3C Recommendation. Edited Company, Reading, Massachusetts. 1994. by James Clark. November 1999. http: [13] Frank Mittelbach and Michel Goossens, //www.w3.org/TR/1999/REC-xslt-19991116. with Johannes Braams, David Carlisle,

108 TUGboat, Volume 29, No. 1 — XVII European TEX Conference, 2007