Managing Order Relations in mlBibTeX Jean-Michel Hufflen

To cite this version:

Jean-Michel Hufflen. Managing Order Relations in mlBibTeX. TUGB, 2007, pp.101–108. ￿hal- 00644465￿

HAL Id: hal-00644465 https://hal.archives-ouvertes.fr/hal-00644465 Submitted on 24 Nov 2011

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Managing Order Relations in MlBibTEX∗

Jean-Michel HUFFLEN LIFC (EA CNRS 4157) University of Franche-Comté 16, route de Gray 25030 BESANÇON CEDEX FRANCE [email protected] http://lifc.univ-fcomte.fr/~hufflen

Abstract Lexicographical order relations used within are language-dependent. First, we describe the problems induced within automatic generation of multilin- gual bibliographies. Second, we explain how these problems are handled within MlBibTEX. To add or update an order relation for a particular natural language, we have to program in Scheme, but we show that MlBibTEX’s environment eases this task as far as possible. Keywords Lexicographical order relations, dictionaries, bibliographies, colla- tion algorithm, , MlBibTEX, Scheme.

Streszczenie Porządek leksykograficzny słownikach jest zależny od języka. Najpierw omó- wimy problemy powstające przy automatycznym generowaniu bibliografii wielo- języcznych. Następnie wyjaśnimy, jak są one traktowane w MlBibTEX-. Do- danie lub zaktualizowanie zasad sortowania dla konkretnego języka naturalnego umożliwia program napisany w języku Scheme. Pokażemy, jak bardzo otoczenie MlBibTEX-owe ułatwia to zadanie. Słowa kluczowe Zasady sortowania leksykograficznego, słowniki, bibliografie, algorytmy sortowania leksykograficznego, Unikod, MlBibTEX, Scheme.

0 Introduction of these items throughout the document. However, Looking for a word within a or for a name most of BibTEX’s styles sort these items w.r.t. the in a phone book is common task. We get used to the alphabetical order of authors’ names. But the bst for a long time. More precisely, language of bibliography styles [14] only provides a we get used to our own lexicographic order, because SORT function [13, Table 13.7] suitable for the En- it belongs to our cultural background. It depends glish language, the commands for accents and other on languages. diacritical signs being ignored by this sort operation. This problem is particularly acute when we deal The purpose of this article is to show how this with managing multilingual bibliographies, as in our problem is solved in MlBibT ’s first public ver- program MlBibT X — for ‘MultiLingual BibT X’. E E sion. In practice, this version only deals with Euro- Let us recall that this program aims to be a ‘better’ pean languages using the Latin . Besides, BibT X [15], the bibliography processor usually as- E the MlBibT X program is written using the Scheme sociated with the LAT X word processor [12]. When E E programming language [10]. Some implementations it builds a ‘References’ section for a LAT X docu- E provide partial support for Unicode [22], a proposal ment, BibT X uses a bibliography style ruling the E for a future standardised library has been estab- layout of this section. Some bibliography styles are lished [18, §§ 1.1 & 1.2], but we cannot say that the unsorted, that is, the order of bibliographical items 1 present version of Scheme is Unicode-compliant. within the bibliography is the order of first citations

∗ Title in Polish: Zarządzanie zasadami sortowania lek- 1 At the time of writing the revised version of this article, sykograficznego w MlBibTEX-u. the proposal for Scheme’s next standard [19, 18] has been

TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting 1001 Jean-Michel HUFFLEN

So some parts of our present implementation of or- separately from the words beginning with ‘’. In der relations are temporary, but we think that this fact, the ‘c-’ entry, in a Hungarian dictionary, con- implementation could be easily updated for future tains words beginning with ‘c’ and not with ‘’. Unicode-compliant versions. The ‘c-’ entry is followed by the ‘cs-’ entry, before In a first section, we show how diverse lexi- the ‘d-’ entry. cographic orders used throughout European coun- Anyway, it clearly appears that there cannot tries are. This allows readers to estimate this diver- be a universal order, encompassing all lexicographic sity and to realise the complexity of this task. We orders. Besides, these orders aim to classify words also explain why this problem is made more diffi- of a dictionary, that is, common words belonging to cult when we consider it within the framework of a language, even if some dictionaries may include bibliographies. Then we show how order relations some proper names. When bibliographies are gen- operate in MlBibTEX and how they are built. We erated, order relations are used to sort bibliograph- also give some details about the common and differ- ical items, most often w.r.t. authors’ names. These ◦ ent points betwen xındy [13, § 11.3] and MlBibTEX, names may be ‘foreign’ proper names if we consider these two programs using multilingual order rela- the language used for the bibliography. So names tions. can include characters outside of this language’s al- Reading this article does not require advanced phabet. As a consequence, an order relation for sort- knowledge of Scheme;2 in fact, we think that a non- ing a bibliography should be able to deal with any programmer should be able to specify a new order , since any letter may appear in foreign names. relation. We give more technical details in an an- A good choice is to associate such a foreign letter nex, for users that would like to do more experi- with a letter belonging to the ‘basic’ Latin alphabet, ment themselves. In particular, we explain how to so this foreign letter is interleaved with this basic let- deal with languages using the Latin 2 encoding, even ter, which takes precedence over the foreign letter if if our implementation is based on Latin 1. two words differ only by these two letters. If we con- sider the English language, this means that accented 1 European languages and lexicographic letters are interleaved with unaccented letters, but orders unaccented letters take precedence. So proceed most Figure 1 gives an idea about the diversity of order of implementation of order relations. relations used throughout some European countries. Unicode provides a default algorithm to sort all a < b its characters. This algorithm is based on a sort key In this figure, ‘ ’ denotes that the words begin- 3 ning with a are less than the words beginning with table, DUCET [23]. It is also based on a decom- , whereas ‘a ∼ b’ expresses that the letters a and b position property for composite characters. For ex- are interleaved, except that a takes precedence over ample, the ‘ô’ letter, whose name and code point — b if two words differ only by these two letters. given using hexadecimal numbers — are: Roughly speaking, there are two family of lan- latin small letter o with , guages, if we consider the associated lexicographic U+00F4 orders. In some languages, accented letters are fully can be decomposed into these ‘simpler’ characters: viewed as ‘real’ letters, distinct from unaccented latin small letter o, U+006F ones: examples are given by Slavonic languages. In combining circumflex accent, U+0302 other languages, accented letters are processed as if The sort algorithm requires several passes. To de- there is no accent. The precedence of a unaccented scribe it roughly, an information about weight, given letter over an accented one is not managed the same by sort keys, is associated with each string. Then way: it follows a left-to-right order in Irish, Ital- this information is re-arranged according to sort lev- ian, and Portuguese, a right-to-left order in French. els, w.r.t. letters, w.r.t. accents, etc. Finally, a binary The ‘mixes’ the two approaches: comparison between bytes is done, level by level, un- some accented letters — ‘õ’, ‘ä’ — are alphabeticised, til the two strings can be distinguished. This algo- some — ‘š’, ‘ž’ — are interleaved. Last, some letter rithm can be refined for a particular language, by groups may be viewed as a single letter and alpha- using a specialised sort key table, possibly including beticised as another letter. For example, the Hun- sort keys for accented letters and digraphs viewed garian words beginning with ‘cs’ are alphabeticised as single letters. submitted for ratification. See http://www.r6rs.org for more This modus operandi would be difficult to put details. into action within MlBibTEX. First, we do not have 2 Readers can refer to [20] for an introductory book about Scheme. 3 Default Unicode Element Table.

1002 TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting Managing Order Relations in MlBibTEX

• The Czech alphabet is: a < b < c < č < d < ... < h < ch < i < ... < r < ř < s < š < t < . . . z < ž. • In Danish, accented letters are grouped at the end of the alphabet: a < ... < z < æ < ø < å. • The Estonian language does not use the same order for unaccented letters than usual Latin order; in addition, accented letters are either inserted into the alphabet or alphabeticised like the corresponding unaccented letter: a < ... < s ∼ š < z ∼ ž < t < ... < w < õ < ä < ö < ü < x < y. • Here are the accented letters in the : à ∼ â, ç, è ∼ é ∼ ê ∼ ë, î ∼ ï, ô, ù ∼ û ∼ ü, ÿ. a When two words differ by an acccent, the unaccented letter takes precedence, but w.r.t. a right-to-left order: cote < côte < coté < côté. The French language also use two ligatures: ‘æ’ (resp. ‘œ’), alphabeticised like ‘ae’ (resp. ‘oe’). • There are three accented letters in German — ‘ä’, ‘ö’, ‘ü’ — and three lexicographic orders: b – DIN -1: a ∼ ä, o ∼ ö, u ∼ ü; – DIN-2: ae ∼ ä, oe ∼ ö, ue ∼ ue; – Austrian: a < ä < ... < o < ö < ... < u < ü < v < ... < z. • The Hungarian alphabet is: a ∼ á < b < c < cs < d < dz < dzs < e ∼ é < f < g < gy < h < i ∼ í < j < k < l < ly < m < n < ny < o ∼ ó < ö ∼ ő < p < ... < s < sz < t < ty < u ∼ ú < ü ∼ ű < v < ... < z < zs Some double digraphs may be restored before : ccs → cs+cs, ddz → dz+dz, ggy → gy+gy, lly → ly+ly, nny → ny+ny, ssz → sz+sz, tty → ty+ty The same for the double trigraph: ddzs → dzs+dzs. • The Polish alphabet is: a < ą < b < c < ć < d < e < ę < ... < l < ł < m < n < ń < o < ó < p < ... < s < ś < t < ... < z < ż • The Romanian alphabet is: a < ă < â < b < ... < i < î < j < . . . s < ş < t < ţ < u < ... < z. • The Slovak alphabet is: a < á < ä < b < c < č < d < ď < dz < dž < e < é < f < g < h < ch < i < í < j < k < l < ĺ < ′ l < m < n < ň < o < ó < ô < p < q < r < ŕ < s < š < t < ť < u < ú < ... < y < ý < z < ž • The Spanish alphabet was a < b < c < ch < d < ... < l < ll < m < n < ñ < o < ... < z until 1994. Now the digraphs ‘ch’ and ‘ll’ are no longer viewed as single letters in modern dictionaries, and the words using ‘’ are interleaved with words using ‘n’. • In Swedish, accented letters are grouped at the end of the alphabet: a < ... < z < å < ä < ö.

a Using a left-to-right order for this step is common mistake even for French people. But the accurate order is right-to-left, as specified in [7]. b Deutsche Institut für Normung (German Institute of normalisation).

Figure 1: Some order relations used in European languages. complete support for Unicode:4 for example, we • refine the sort about accent information when cannot directly deal with caracters such as the ‘com- accented letters are interleaved with others, bining circumflex accent’, not included in the Latin-1 • test the case. encoding. But we keep the idea about decomposi- 5 2 Generating order relations tion, replacing the combining characters by ASCII characters. For example, the ‘combining circumflex Let us recall that MlBibTEX can apply BibTEX’s accent’ will be replaced by the ‘^’ character. To bibliography styles using a compatibility mode [6], sum up, our order relations are based on a 3-step but in order to take advantage of MlBibTEX’s multi- algorithm: lingual features as far as possible, it is better to use 6 7 the nbst language [4], close to XSLT [24], the lan- • replace composite characters (‘foreign’ letters 8 or composite characters not viewed as single let- guage of transformations used for XML documents. ters) when extracting successive letter groups Let us recall that parsing a bibliography data base .bib and compare the two results, ( ) results in the representation of an XML tree in 6 New Bibliography STyles. 4 See the annex. 7 eXtensible Language Stylesheet Transformations. 5 American Standard Code for Information Interchange. 8 eXtensible Markup Language.

TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting 1003 Jean-Michel HUFFLEN

Scheme [11], this nbst language includes an element action according to a left-to-right (resp. right- for sorting selected subtrees of an XML document to-left) order. Cf. the use of these two values [4, App. A], this element being analogous to XSLT’s for position property in Unicode. For example: where “ |’| ” denotes the default weight of the “ ’ ” Due to the language attribute’s value, this sort op- character. MlBibTEX knows such decomposition in- eration will use the lexicographic order for the Ger- formation for each accented letter of Latin 1. These man language. Such an order relation is to be speci- default associations can be overrided by alphabet- fied in Scheme, as a 2-argument function taking two specific associations given to the function building s s strings 0 and 1 and returning a ‘true’ value (#t) orders. Weights are managed as follows. if s0 is strictly less than s1, a ‘false’ value (#f) oth- • By default, the weight of each component of an erwise. alphabet — appearing within the second argu- The best way to define such a function is to de- ment of

1004 TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting Managing Order Relations in MlBibTEX

(define

Figure 2: Building order relations for some European languages.

TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting 1005 Jean-Michel HUFFLEN

MlBibTEX notices the possible presence of multi- 4 Conclusion character sequences (e.g., digraphs or trigraphs). If The availability of these language-dependent order need be, it builds a lexical analyser able to return the relations within a unique program has been planned longest sequence of characters belonging to this al- 11 through the use of the language attribute, as speci- phabet, an example of use being given in Figure 3. 14 fied in the W3C recommendation about XSLT [24, Let us mention that these analysers extract as few § 10]. However, these relations have been imple- sequences of characters as possible. For example, if mented only partially in most of XSLT processors. a we have to compare a word beginning with ‘ ’ and Of course, our implementation also provides this ser- b a word beginning with ‘ ’ in English, only the first vice partially, because we are limited to European "a" "b" letters — and — are extracted because that languages. But we think that the orders we define is sufficient to conclude. are correct w.r.t. these languages and they are ac- From a point of view related to implementa- tually running. Our implementation is clearly influ- tion, the encoding of the sequences of an alphabet enced by the Unicode collation algorithm, it is a first w.r.t. an increasing order is implemented by means 12 step towards general algorithms for lexicographic or- of hash tables, what ensures efficiency. Let us not ders, it is a first version subject to changes when we forget that these order relations are used to sort bib- explore other languages or get criticisms from end- liographical items, and sorting requires many calls users. In many domains, improvement has existed to the function modelling an order relation. because first versions existed. We think that will be ◦ also the case for our functions. 3 MlBibTEX vs xındy ◦ xındy [9] and MlBibT X do not aim to perfom the 5 Acknowledgements ◦ E same task, since xındy is an index processor. How- Many thanks to Jerzy B. Ludwichowski, who has ever, both have common points: they reimplement written the Polish translation of the abstract. I also ‘old’ programs belonging to TEX’s galaxy — makein- thank Gyöngyi Bujdosó, Hans Hagen, Karel Horák, dex [13, § 11.2] and BibTEX — with particular fo- Dag Langmyhr, who helped me fix some errors. cus on multilingual features, they are both writ- ten using a Lisp13 dialect: Common Lisp [21] for A How to use MlBibTEX’s functions ◦ xındy, Scheme for MlBibTEX. Of course, the suc- A.1 Getting started cessive steps used for putting an order relation into To use the functions dealing with multilingual or- action — needed to arrange the successive entries of ◦ dering, change your current directory into the src an index — also exist in xındy. But the specification subdirectory of MlBibTEX’s main directory, launch of an order relation is different because it is done a Scheme interpreter, and proceed as follows: step by step. There are forms: (load "common.scm") ; Loading general define-alphabet define-letter-group ; definitions. merge-rule sort-rule (load "orders.scm") ; Loading all the to specify an alphabet, a letter group, and the re- ; definitions related to orders. This causes placement of a pattern. If a sort procedure is quite ; the other files needed to be loaded, too. close to the standard way used in English, it is prob- ◦ Then you can use the functions described in Fig- xındy ably easier to use ’s forms, because only small ure 2. Use a Scheme interpreter R5RS-compliant [10] changes have to be expressed. In MlBibTEX, we and able to deal with the Latin 1 encoding: bigloo 15 chose to develop less functions, but more powerful. [16], MIT Scheme[3], PLT Scheme [2] do that. This allows a global view of a new order relation Now we describe the conventions used within and makes easier some coherence tests among the strings resulting from parsing a .bib file. These con- information about this relation. ventions are supposed to be followed by the argu- ments of the functions modelling order relations, so 11 Such lexical analysers are implemented by means of you have to know them. You can directly type ac- tries. In MlBibTEX, this structure is also used to manage the information related to language identifiers, as explained cented letters belonging to the Latin 1 encoding: in [5]. 12 A hash table has a set of entries, and can efficiently "Frank Böhmert" map an object to another object. This structure is described in [1] from a general point of view, our implementation of 14 World Wide Web Consortium. 15 hash tables in MlBibTEX is inspired by [8]. In fact, these three Scheme interpreters include partial 13 LISt Processor. support of Unicode, as mentioned in the introduction.

1006 TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting Managing Order Relations in MlBibTEX

(define mk-hungarian-word-sectioner ; Building a generator of sectioning functions for Hungarian words. (

Figure 3: How to section Hungarian words.

In Scheme, the ‘ " ’ character being the delimiter of to build a generator of functions sectioning words for constant strings, it must be escaped by a ‘\’ charac- a particular language. This order-relation as follows: tween braces, as suggested by the previous exam- (c-language->order-relation ple. If it has an argument, write this argument be- "german" tween braces, as in "Rezs\\H{o} Kókai" for ‘Rezső order-relation according to the modus operandi we explain in § 2 16 function returns #t if the association succeeds, #f and try to model some ‘exotic’ order relations. otherwise (for example, if it is called with a string A.2 Testing decomposition whose value is an unknown language). To see how words are sectioned into successive let- References ters, digraphs, etc. according to a particular alpha- [1] Alfred V. Aho, Ravi Sethi and Jeffrey D. Ull- bet, then use the

TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting 1007 Jean-Michel HUFFLEN

[2] Matthew Flatt: PLT MzScheme: Lan- [14] Oren Patashnik: Designing BibTEX Styles. guage Manual. Version 360. August 2004. February 1988. Part of the BibTEX distribu- http://download.plt-scheme.org/doc/ tion. 360/pdf/mzscheme.pdf. [15] Oren Patashnik: BibTEXing. February 1988. [3] Chris Hanson, the MIT Scheme team et al.: Part of the BibTEX distribution. MIT Scheme Reference Manual, 1st edition. [16] Manuel Serrano: Bigloo. A Practical Scheme March 2002. Massachusetts Institute of Tech- Compiler. User Manual for Version 2.9a. De- nology. cember 2006. [4] Jean-Michel Hufflen: “MlBibTEX’s Version [17] Olin Shivers: List Library. October 1999. 1.3”. TUGboat, Vol. 24, no. 2, pp. 249–262. July http://srfi.schemers.org/srfi-1/. 2003. [18] Michael Sperber, William Clinger, R. Kent [5] Jean-Michel Hufflen: Managing Languages Dybvig, Matthew Flatt, Anton van within MlBibTEX. To appear. June 2005. Straaten, Richard Kelsey and Jonathan Rees: Revised5.97 Report on the Algorithmic [6] Jean-Michel Hufflen:“BibTEX, MlBibTEX Language Scheme — Standard Libraries. June and Bibliography Styles”. Biuletyn GUST, 2007. hhtp://www.r6rs.org. Vol. 23, pp. 76–80. In BachoTEX 2006 con- ference. April 2006. [19] Michael Sperber, William Clinger, R. Kent Dybvig, Matthew Flatt, Anton van [7] ISO-IEC CD 14651: International String Order- Straaten, Richard Kelsey, Jonathan ing — Method for Comparing Character Strings Rees, Robert Bruce Findler and Jacob and Description of a Default Tailorable Order- 5.97 Matthews: Revised Report on the Al- ing. May 1996. gorithmic Language Scheme. June 2007. [8] Panu Kalliokoski: Basic Hash Tables. hhtp://www.r6rs.org. September 2005. http://srfi.schemers. [20] George Springer and Daniel P. Friedman: org/srfi-69/. Scheme and the Art of Programming. The MIT ◦ [9] Roger Kehr: xındy Manual. February 1998. Press, McGraw-Hill Book Company. 1989. http://www.xindy.org/doc/manual.html. [21] Guy Lewis Steele, Jr., Scott E. Fahlman, [10] Richard Kelsey, William D. Clinger, Richard P. Gabriel, David A. Moon, Jonathan A. Rees, Harold Abelson, Nor- Daniel L. Weinreb, Daniel Gureasko Bo- man I. Adams iv, David H. Bartley, brow, Linda G. DeMichiel, Sonya E. Keene, Gary Brooks, R. Kent Dybvig, Daniel P. Gregor Kiczales, Crispin Perdue, Kent M. Friedman, Robert Halstead, Chris Han- Pitman, Richard Waters and Jon L White: son, Christopher T. Haynes, Eugene Edmund Common Lisp. The Language. Second Edition. Kohlbecker, Jr, Donald Oxley, Kent M. Digital Press. 1990. Pitman, Guillermo J. Rozas, Guy Lewis [22] The Unicode Consortium: The Unicode Steele, Jr, Gerald Jay Sussman and Mitchell Standard Version 4.0. Addison-Wesley. August Wand: “Revised5 Report on the Algorithmic 2003. Language Scheme”. HOSC, Vol. 11, no. 1, pp. 7– [23] The Unicode Consortium, http: 105. August 1998. //unicode.org/reports/tr10/: Unicode [11] Oleg E. Kiselyov: XML and Scheme. Septem- Collation Algorithm. Unicode Technical ber 2005. http://okmij.org/ftp/Scheme/ Standard #10. July 2006. xml.html. [24] W3C: XSL Transformations (XSLT). Ver- [12] Leslie Lamport: LATEX: A Document Prepa- sion 1.0. W3C Recommendation. Edited by ration System. User’s Guide and Reference James Clark. November 1999. http://www.w3. Manual. Addison-Wesley Publishing Company, org/TR/1999/REC-xslt-19991116. Reading, Massachusetts. 1994. [13] Frank Mittelbach, Michel Goossens, Jo- hannes Braams, David Carlisle, Chris A. Rowley, Christine Detig and Joachim Schrod: The LATEX Companion. 2nd edition. Addison-Wesley Publishing Company, Read- ing, Massachusetts. August 2004.

1008 TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting