Implementing Language-Dependent Lexicographic Orders in Scheme

Jean-Michel HUFFLEN LIFC (EA CNRS 4157) — University of Franche-Comté

16, route de Gray — 25030 BESANÇON CEDEX — FRANCE 

Abstract related to Unicode, an industry standard designed to al- The lexicographical order relations used within are low text and symbols from all of the writing systems of the world to language-dependent, and we explain how we implemented such be consistently represented [2006]. Our method can be generalised orders in Scheme. We show how our sorting orders are derived to other alphabets. In addition, let us mention that if a document from the Unicode algorithm. Since the result of a Scheme cites some works written using the Latin alphabet and some writ- function can be itself a function, we use generators of sorting ten using another alphabet (Arabic, Cyrillic, . . . ), they are usually orders. Specifying a sorting order for a new natural language has itemised in two separate bibliography sections. So this limitation is been made as easy as possible and can be done by a programmer not too restrictive within MlBIBTEX’s purpose. who just has basic knowledge of Scheme. We also show how The next section of this article is devoted to some examples Scheme data structures allow our functions to be programmed in order to give some idea of this task’s complexity. We briefly efficiently. recall the principles of the Unicode collation algorithm [2006a] in Section 3 and explain how we adapted it in Scheme in Section 4. Keywords Lexicographical order relations, collation algorithm, Section 5 discusses some points, gives some limitations of our Unicode, MlBIBTEX, Scheme. implementation, and sketches possible future work.

1. Introduction 2. Lexicographic orders and natural languages Sorting words belonging to a natural language, like in a dictio- The basic lexicographic order, well-known in , can be nary, depends on this language. As part of the MlBIBT X project— E defined by: for ‘MultiLingual BIBTEX’—we developed functions implemented collation, that is, determining sorting orders of strings of charac- [ ] ≤ s1 ters. Let us recall that MlBIBT X aims to be a ‘better’ BIBT X E E [x0|s0] ≤ [x1|s1] ⇐⇒ x0

function used within these styles [Mittelbach et al., 2004, Ta- are to be ignored in a first pass. Then if two words differ only ble 13.7] ignores accents and other diacritical signs, so in practice, be the case of a letter, an uppercase letter takes precedence over it is suitable only for the English language. MlBIBTEX allows bib- the corresponding lowercase one, according to a left-to-right order. liographical items of a document written in English (resp. French, In addition, let us notice that this relation can be implemented German, . . . ) to be sorted according to the English (resp. French, efficiently for unaccented letters since the ASCII1 codes for letters German, . . . ) order. follow the English . MlBIBTEX has been developed in Scheme, we explained the rea- deal with: an uppercase letter takes precedence over the corre- sons of this choice and outlined its architecture in [Hufflen, 2005b]. sponding downcase one if two words differ only by the case of a Here we focus on the definitions of sorting orders as part of this letter, and the order is left-to-right. framework. MlBIBTEX’s first version only deals with the European As shown by some non-limitative examples in Figure 1, this languages using the Latin alphabet. Our method is derived from an problem may be more complicated for other languages. We can ob- serve some changes in the alphabetical order of unaccented letters:

in the Estonian language, ‘’ is not the last letter of the alphabet, it is ranked between ‘’ and ‘ ’. Accented letters may be treated as individual letters, like in Swedish, or interleaved with unaccented letters, like in the most common orders used in Germany. The same

for ligatures: ‘’ is viewed as a separate letter in Swedish, alpha-

beticised like ‘’ in French. If accented letters are interleaved with unaccented ones, the latter take precedence when two words differ Copyright c 2007 Jean-Michel Hufflen. Proceedings of the 2007 Workshop on Scheme and Functional Programming only by accents. In most cases, the order is left-to-right—that is true Université Laval Technical Report DIUL-RT-0701 1 American Standard Code for Information Interchange.

Scheme and Functional Programming 2007 139 • The Czech alphabet is: a < b < c < cˇ < d < ... < h < ch < i < ... < r < rˇ < s < š < t < ... z < ž. • In Danish, accented letters are grouped at the end of the alphabet: a < ... < z < æ < ø < å ∼ aa. • The Estonian language does not use the same order for unaccented letters than usual Latin order; in addition, accented letters are either inserted into the alphabet or alphabeticised like the corresponding unaccented letter: a < ... < s ∼ š < z ∼ ž < t < ... < w < õ < ä < ö < ü < x < y • Here are the accented letters in the French language: à ∼ â, ç, è ∼ é ∼ ê ∼ ë, î ∼ ï, ô, ù ∼ û ∼ ü, ÿ.

When two words differ by an acccent, the unaccented letter takes precedence, but w.r.t. a right-to-left order: cote < côte < coté < côté. The French language also use two ligatures: ‘’ (resp. ‘ ’), alphabeticised like ‘ae’ (resp. ‘oe’).

There are three accented letters in German—‘’, ‘ ’, ‘ ’—and three lexicographic orders: DINa-1: a ∼ ä, o ∼ ö, u ∼ ü; DIN-2: ae ∼ ä, oe ∼ ö, ue ∼ ue; Austrian: a < ä < ... < o < ö < ... < u < ü < v < ... < z. • The Hungarian alphabet is: a ∼ á < b < c < cs < d < dz < dzs < e ∼ é < f < g < gy < h < i ∼ í < j < k < l < ly < m < n < ny < o ∼ ó < ö ∼ o˝ < p < ... < s < sz < t < ty < u ∼ ú < ü ∼ ˝u < v < ... < z < zs • In Swedish, accented letters are grouped at the end of the alphabet: a < ... < z < å < ä < ö. ‘a

uses a right-to-left order (cf. Fig. 1). In some languages, digraphs two words differ only by these two letters. So proceed most of im-

may sort as separate letters: for example, ‘ ’ is ranked between ‘ ’ plementations of lexicographic order relations. Let us notice that a and ‘’ in Czech. The Hungarian language uses a trigraph, ‘ ’, foreign name may include additional letters whose association with

as a separate letter. In addition, there are special rules for double a basic letter may be difficult: for example, the Icelandic ‘’ letter.

digraphs in this language: for example, ‘+ ’ is written ‘ ’ in

this language, but the two successive digraphs should be restored 3. The Unicode collation algorithm

before sorting: ‘’ should be sorted as ‘ ’.

The same rule holds for the double trigraph ‘’, for ‘ + ’. Unicode provides a default algorithm [2006a] to sort all the strings

2 Other equivalences exist: in Danish, ‘’ is equivalent to ‘ ’ . build over its characters. It consists of a multilevel algorithm: each So it clearly appears that there cannot be a universal order, en- step sorts the elements left unsorted by the preceding step. Here are compassing all lexicographic orders. In addition, let us recall that these levels:

we are interested in such order relations in order to sort bibliograph-

L1 Base characters < <

ical items w.r.t. authors’ names. These names may be ‘foreign’

 L2 Accents < <

proper names if we consider the language used for the bibliogra-

 L3 Case < <

phy, that is, the language of the document. A very simple example

L4 Punctuation < <

is the use of English names within the bibliography of a document

 Ln Tie-breaker < < written in French. Such foreign names may include characters out- side the alphabet of the document’s language. As a consequence, The differences indicated by the underlined characters are swamped

an order relation for sorting the items of a bibliography should be at stronger-level steps, for example, the difference between ‘’ and  able to deal with any letter belonging to a language written using ‘’ at Level 1. In the last example, ‘ ’ represents a format charac- the Latin alphabet, since such letters may appear in foreign names. ter, which is otherwise ignorable. A good choice is to associate accented foreign letters with the cor- The first two steps are based on a decomposition property

responding unaccented letter. If we consider the English language, [2006b] for composite characters. For example, the ‘’ letter, this means that accented letters are interleaved with unaccented let- whose name and code point—given using hexadecimal numbers— are: 2 The fact that a of several letters may be equivalent to one has been pointed out in an example given in the proposal for the new standard LATIN SMALL LETTER O WITH CIRCUMFLEX, U+00F4 of Scheme [cf. Sperber et al., 2007, § 1.2]:

can be decomposed into these two ‘simpler’ characters put together

=

⇒ when a text is to be written: because in German, the uppercase form of the ‘’ letter is ‘ ’. On the LATIN SMALL LETTER O, U+006F

contrary: COMBININGCIRCUMFLEXACCENT, U+0302

= ⇒ The sort keys used for each level are extracted from a Unicode 3 As mentioned in [Flatt and Feeley, 2005], the implementation of the func- collation element table, which defaults to the DUCET , given by tions comparing strings can no longer be defined in terms of character-by- character comparisons, as they are in the present standard R5RS [1998]. 3 Default Unicode Collation Element Table.

140 Scheme and Functional Programming 2007

4 the file , available at the Web site of Unicode . These In addition, the general collation algorithm can be simplified in

keys are weight values. For example, these values for the letters ‘ ’, our case. Let us recall that we deal with natural languages written

‘’, and the combining circumflex accent, given using hexadecimal using the Latin alphabet. values, are: • Only letters and whitespace characters are of interest for us:

LATIN SMALL LETTER C ℄ the punctuation signs can be dropped out, so the last step of

LATIN SMALL LETTER O ℄ the general algorithm is not needed. According to languages,

COMBININGCIRCUMFLEXACCENT ℄ some additional characters can be recognised—for example,

The first pass uses the first column’s values as primary sort the hyphen ‘’ character—they are either ignored or ranked key, the second pass uses the second column’s values as sec- between the space character and all the letters. • ondary sort key, and so on. ‘’ values are to being ignored. As far as we know, the third step is the same for all the lan- In our example, this means that a combining circumflex accent guages we deal with: an uppercase letter takes precedence over is to be ignored by the first pass of the algorithm. In fact, this the corresponding downcase one if two words differ only by the algorithm now consists of a binary comparison between double case of a letter, and the order is left-to-right. bytes, until the two strings can be distinguished. If a right-to-left So, to derive a sorting order for strings from a generator, we order is to be used for the second step, like in the French lan- have to provide four arguments. guage, the lists of double bytes belonging to the second column

should be reversed before applying comparisons. For example, • A list whose elements are separator characters, viewed less 

these values are for the ‘ ’ than any letter. It should begin by the space character and word, ‘ for the ‘ ’ word. Do- often this list contains only this character, in which case the

ing a double-byte-to-double-byte comparison allows us to con- variable can be used. This is not universal: for clude that  < , which is the case if the default colla- example, space characters are ignored when words are sorted tion algorithm is applied. If we consider the right-to-left order in Hungarian (cf. the definition of the variable in (e.g., for French), these lists of double-bytes are to be reversed, Figure 2).

and the comparison between and • An alphabet, given w.r.t. the increasing order, as a list of strings.

 implies that > , which If the ‘classical’ alphabet is used—unaccented letters of the is correct for this language. Latin alphabet, sorted according to the usual order— just put

Weight values used as sort keys may be different from the code the ‘false’ value (cf. the definition of the variable). points ranking all the characters of Unicode and may be language- • An association list for additional of characters, each dependent. If a letter should be viewed as synonym of consecutive sequence being followed by a replacement and a weight value. letters, for example, the ‘’ ligature in English, the table gives That means that a decomposition is to be applied to these several 4-uples: sequences. • LATIN SMALL LETTER AE ℄ A function related to the sense of the second step: when the first

℄ is finished, weight values prepared for the second step appear

5 ℄ in reverse order, so put if this second step’s order is

left-to-right, put —the identity function) for a right-

‘’ and ‘ ’ being the primary sort keys for the letters ‘ ’ and

to-left order. Cf. the use of these two values for and

‘’. If a digraph should be viewed as a single letter, two consecutive characters are given a unique 4-uple of weight values. For example, .

‘ ’ in traditional Spanish: It should be noticed that only lowercase letters have to be specified, LATIN SMALL LETTER C, LATIN SMALL LETTER H; "ch" the equivalent relations among uppercase letters will be deduced.

℄ Figure 2 shows how the order relations for the European lan- guages described in Figure 1 are put into action. The result of Given a particular language, some characters are given vari-

our generator of order relations, , is a 2-

able weight values for the Unicode collation algorithm [see 2006a, × argument function. Such a function takes two strings, ×0 and 1,

§ 3.2.2]: they may be ignorable, non-ignorable, shifted, or shift-

× × and returns if 0 is strictly less than 1 according to the order trimmed. ‘Shifted’ (resp. ‘shift-trimmed’) means that the variable-

relation for the corresponding language, otherwise. Such an or- weighted characters are ranked before (resp. after) the others at the der relation is able to deal with strings containing ‘foreign’ letters, fourth step. since there are default associations for the accented letters of all the

European languages. For example, the Polish letter ‘’ is associated 4. Our implementation with the ‘’ letter by default, the weight value allowing ‘ ’ to take

precedence over ‘’ at the second step of the sorting order if need As part of MlBIBTEX, our implementation of the collation algo-

rithm aims to serve two purposes: be. Let us give two examples:



• end-users should be able to add a new order for a new language =⇒



easily, provided that they can express how this order is built in =⇒  an abstract way; In the first example, ‘’ and ‘ ’ are foreign letters for the English • resulting sorting orders for strings should be efficient, because language, and the order for the second step is left-to-right. In the they are used to sort list of bibliographical items, these lists be- second example, this order is right-to-left, as in French. ing possibly big: such an operation requires many comparisons Our generator proceeds as follows. among strings. Of course, efficient sort are known for a long time, but the more efficient the comparisons among 5 Some Schemers could observe that this function does not belong to pure strings, the more efficient the list sort. functional style, because it is potentially destructive [Shivers, 1999, see].

But it is more efficient than the function and the weight informa- 4

tion list is not shared with other lists.

Scheme and Functional Programming 2007 141

Put these three letters at the end of the standard

alphabet.

In Hungarian, a whitespace character is irrelevant when words are sorted.

Figure 2. Building order relations for some European languages.

• All the letters of the alphabet—the second argument of the to build a trie6. If a single letter—or a digraph or trigraph—

function—and all the members—its is recognised, this trie gets access to either the corresponding

third argument—supersede the default definitions. value of an association, or the value, in which case the recog- • All the separator characters and letters of the alphabet are num- nised sequence belongs to the alphabet. The other characters are bered and used as entries of a hash table, getting access to cor- ignored. responding numbers. Such hash tables have been put into action by means of the functions of SRFI 69 [see Kalliokoski, 2005]. 6 • All the separator characters, all the letters of the alphabet, all This word originates from the central letters of the word ‘retrieval’ Fred- kin [1960]. A digital is a tree for storing strings in which nodes are the associations’ keys, and all the default definitions are used organised by substrings common to two or more strings, a trie is a particu- lar case of a digital tree: there is only one node for every common prefix.

142 Scheme and Functional Programming 2007

Definition of a zero-argument function that will section the word ‘szol˝ o’˝ (‘grape’).

=⇒ The successive equivalent letters, digraphs, etc. of this word are returned in turn, with the corresponding

=⇒ weight value.

=⇒

=⇒

=⇒ The word is finished, so all the calls of this function will return the ‘false’ value, from now on.

Another example.

=⇒

=⇒

=⇒

=⇒

=⇒

=⇒ Although the double digraph is written as ‘ ’, it is replaced by two occurrences of the ‘ ’ digraph.

=⇒

=⇒

=⇒ =⇒

Figure 3. Sectioning Hungarian words.

• Our tries are implemented by balanced ternary search trees. For example, the Hungarian word ‘’ (cf. Figure 3) should be

‘Balanced’ means that for each non-empty subtree, the numbers typed by ‘ ’. As part of MlBIBTEX, this is not of elements of the left, middle, and right branches do not differ a real drawback since end-users get used to type accented letters from more than 1. To get this trie, we sort the alphabet accord- by means of LATEX commands within their bibliography database

7 ing to the encoding, so our hash table is used to retain files . In addition, it will easy to adapt our functions when Scheme information about precedence within this alphabet. On another becomes Unicode-compliant.

point, the resulting trie allows us to efficiently implement word Another limitation is given by exceptions. For example, let us

sectioning into letters, digraphs, etc. consider the following Hungarian person names: < . They follow the Hungarian rules for sorting names (cf. Figure 1).

The weight value associated with each string belonging to the

But we have < because of etymological reasons, su- alphabet is 1. So you can use weight values greater than or equal perseding the usual decomposition of words. Probably a to 2 for accented letters belonging to the language. In comparison of exceptions would be the best way to solve this problem, but we with the Unicode collation algorithm, we skip combining charac- have not implemented it yet. ters resulting from the decomposition procedure and only give their

weight value. For example, the accents allowed in French over the In MlBIBTEX, we chose to allow the introduction of a new sort-

‘’ letter are the grave and circumflex accents (‘ ’ and ‘ ’), but not ing order by means of only one definition. This allows a global view

the acute one (‘’). The allowed accents are given 2 and 3 as weight of this new order relation and makes easier some coherence tests among the information about this relation. A different approach has

values, they come before the default value for the acute accent over ◦

this letter. On the contrary, we do not specify that the ‘ ’ ligature is been followed by [Kehr, 1998], a multilingual index proces- A alphabeticised like ‘’ because it is the default definition for this sor associated with LTEX, and written using COMMON LISP [Steele character. et al., 1990]. The specification of an order relation is different be- We show how strings are sectioned in Figure 3. When an order cause it is done step by step. There are forms:

relation is applied to two strings, we build sectioner functions for

these two strings. We section a string as few times as possible and stop as soon as we can conclude. The example given is a sectioner for Hungarian words, possibly using digraphs and double digraphs to specify an alphabet, a letter group (digraph, trigraph, etc.), and (cf. § 2). This example also includes words containing accented the replacement of a pattern. If a sort procedure is quite close to the

letters interleaved with unaccented ones (‘ ’ and ‘ ’, interleaved standard way used in English, it is probably easier to use ’s with ‘’ and ‘ ’). forms, because only small changes have to be expressed. On the contrary, MlBIBTEX allows users to define a new order relation by applying only one function, encompassing all the aspects of this 5. Discussion new order relation. As we explain in [Hufflen, 2005b], we decided that MlBIBTEX Even if we have adapted the Unicode collation algorithm to should be used with several Scheme interpreters, in order to enforce our requirements for MlBIBTEX, we think that we could easily this program’s portability. There is a proposal to make Scheme implement an efficient version of the whole algorithm—not limited Unicode-compliant [Flatt and Feeley, 2005]; that is planned for to languages written using the Latin alphabet—by means of the the future standard [Sperber et al., 2007, §§ 1.1 & 1.2]; but only same structures: tries and hash tables. A possible improvement a little support for Unicode is provided now, rudimentary about possible encodings [Serrano, 2006, p. 35], MIT Scheme [Hanson 7

When a bibliography data base file is parsed by MlBIBTEX, the LATEX com-

et al., 2002, § 5.7], [Flatt, 2007, § 1.2.1]. In fact, ½

mands that result in characters belonging to the ÄaØiÒ encoding are ex- IB MlB TEX’s basic encoding is , and European characters panded, the others are left unchanged. So parsing ‘ ’ within A

outside it are obtained by means of a workaround: the LTEX com- the value associated with a BIBTEX field results in ‘ ’, whereas parsing mands to produce them [see Mittelbach et al., 2004, Table 7.33]. ‘’ results in the Scheme string ‘ ’.

Scheme and Functional Programming 2007 143 could be the extraction of sort keys from the files available at the of the 6th Workshop on Scheme and Functional Programming, Web site of Unicode: it would just require an ad hoc parser. volume 619 of Indiana University Science Depart- Finally, let us remark that we used continuation-based functions ment, pages 77–87, Tallinn, September 2005b. to put into action the sectionning of a string into letters, including Panu Kalliokoski. Basic Hash Tables. the case of digraphs. A more concise specification could have been , September 2005.

given by using a lazy functional programming language based on ◦

 ÒdÝ Roger Kehr. Ü Manual. the call by need—e.g., Haskell [Peyton Jones, 2003]—getting the , February 1998. next letter is done only if need be. Richard Kelsey, William D. Clinger, Jonathan A. Rees, Harold Abelson, Norman I. Adams iv, David H. Bartley, Gary Brooks, 6. Conclusion R. Kent Dybvig, Daniel P. Friedman, Robert Halstead, Chris The availability of sorting orders depending on natural languages Hanson, Christopher T. Haynes, Eugene Edmund Kohlbecker, Jr, 8 Donald Oxley, Kent M. Pitman, Guillermo J. Rozas, Guy Lewis is planned in XSLT [1999], the language of transformations used 5 9 Steele, Jr, Gerald Jay Sussman, and Mitchell Wand. Revised for XML documents. XSLT provides an element that can sort strings according to the rules of a natural language [see W3C, report on the algorithmic language Scheme. HOSC, 11(1):7–105, August 1998. 1999, § 10]. But in practice, most of XSLT processors implement this feature only partially, and the way to design new order rela- Leslie Lamport. LATEX: A Document Preparation System. User’s tions, if need be, is unspecified by the W3C10 recommendation as Guide and Reference Manual. Addison-Wesley Publishing well as the documentation of these processors. Designers of bibli- Company, Reading, Massachusetts, 1994. ography styles for MlBIBTEX can use order relations by means of Frank Mittelbach, Michel Goossens, Johannes Braams, David an element analogous to XSLT’s [see Hufflen, 2005a]. But as shown Carlisle, Chris A. Rowley, Christine Detig, and Joachim Schrod. in § 4, only basic knowledge of Scheme is needed for people inter- The LATEX Companion. Addison-Wesley Publishing Company, ested in enlarging MlBIBTEX by new relations. Reading, Massachusetts, 2 edition, August 2004. When MlBIBTEX’s first experimental versions were launched, Oren Patashnik. BIBTEXing. Part of the BIBTEX distribution, there was only an order relation implementing the default collation February 1988. 11 algorithm roughly . In parallel, we developed our order relation Simon Peyton Jones, editor. Haskell 98 Language and Libraries. generator. That aimed to ask some people for tests about their own The Revised Report. Cambridge University Press, April 2003. language. We do not forget that natural languages are an open

Manuel Serrano. BigÐÓ Ó. A Practical Scheme Compiler. User domain, that is, it is difficult to establish general rules that may fail Manual for Version 2.9a, December 2006. on particular cases since the features of natural languages are very diverse. So we consider our present work as a first version subject Olin Shivers. List Library. to changes when we explore other languages or get criticisms , October 1999. from end-users. But until now, feedback has been good, and as a Michael Sperber, William Clinger, R. Kent Dybvig, Matthew consequence, our order relation generator has been integrated into Flatt, Anton van Straaten, Richard Kelsey, and Jonathan Rees. 5.97 MlBIBTEX’s first public version. Revised Report on the Algorithmic Language Scheme —

Standard Libraries. , June 2007. Acknowledgements Guy Lewis Steele, Jr., Scott E. Fahlman, Richard P. Gabriel, David A. Moon, Daniel L. Weinreb, Daniel Gureasko Bobrow, Many testers of MlBIBTEX helped me fix some errors concerning Linda G. DeMichiel, Sonya E. Keene, Gregor Kiczales, Crispin various languages: thank to them. I am also grateful to the referres, Perdue, Kent M. Pitman, Richard Waters, and Jon L White. who suggested me some improvements. COMMON LISP. The Language. Second Edition. Digital Press, 1990. References Unicode Collation Algorithm. The UNICODE CONSORTIUM,

Matthew Flatt. PLT MzScheme: Language Manual. Version , July 2006a. Unicode Technical Standard #10. 370.

, May 2007. Unicode Normalization Forms. The UNICODE CONSORTIUM,

, October 2006b. Uni- Matthew Flatt and Marc Feeley. R6RS Unicode Data. code Standard Annex #15. , July 2005. Edward Fredkin. Trie memory. Communications of the ACM, 3(9): The UNICODE CONSORTIUM. The Unicode Standard Version 5.0. 490–499, September 1960. Addison-Wesley, November 2006.

Chris Hanson, the MIT Scheme team, et al. MIT Scheme Reference W3C. XSL Transformations (XSLT). Version 1.0. Manual. Massachusetts Institute of Technology, 1.96 edition, , November March 2002. 1999. W3C Recommendation. Edited by James Clark. Jean-Michel Hufflen. Bibliography styles easier with MlBIBTEX. In Proc. EuroTEX 2005, pages 179–192, Pont-à Mousson, France, March 2005a. Jean-Michel Hufflen. Implementing a bibliography processor in Scheme. In J. Michael Ashley and Michel Sperber, editors, Proc.

8 eXtensible Language Stylesheet Transformations. 9 eXtensible Markup Language. 10 World Wide Web Consortium. 11 That was the case for the version described in [Hufflen, 2005b].

144 Scheme and Functional Programming 2007