Managing Order Relations in Mlbibtex Jean-Michel Hufflen
Total Page:16
File Type:pdf, Size:1020Kb
Managing Order Relations in mlBibTeX Jean-Michel Hufflen To cite this version: Jean-Michel Hufflen. Managing Order Relations in mlBibTeX. TUGB, 2007, pp.101–108. hal- 00644465 HAL Id: hal-00644465 https://hal.archives-ouvertes.fr/hal-00644465 Submitted on 24 Nov 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Managing Order Relations in MlBibTEX∗ Jean-Michel HUFFLEN LIFC (EA CNRS 4157) University of Franche-Comté 16, route de Gray 25030 BESANÇON CEDEX FRANCE [email protected] http://lifc.univ-fcomte.fr/~hufflen Abstract Lexicographical order relations used within dictionaries are language-dependent. First, we describe the problems induced within automatic generation of multilin- gual bibliographies. Second, we explain how these problems are handled within MlBibTEX. To add or update an order relation for a particular natural language, we have to program in Scheme, but we show that MlBibTEX’s environment eases this task as far as possible. Keywords Lexicographical order relations, dictionaries, bibliographies, colla- tion algorithm, Unicode, MlBibTEX, Scheme. Streszczenie Porządek leksykograficzny w słownikach jest zależny od języka. Najpierw omó- wimy problemy powstające przy automatycznym generowaniu bibliografii wielo- języcznych. Następnie wyjaśnimy, jak są one traktowane w MlBibTEX-u. Do- danie lub zaktualizowanie zasad sortowania dla konkretnego języka naturalnego umożliwia program napisany w języku Scheme. Pokażemy, jak bardzo otoczenie MlBibTEX-owe ułatwia to zadanie. Słowa kluczowe Zasady sortowania leksykograficznego, słowniki, bibliografie, algorytmy sortowania leksykograficznego, Unikod, MlBibTEX, Scheme. 0 Introduction of these items throughout the document. However, Looking for a word within a dictionary or for a name most of BibTEX’s styles sort these items w.r.t. the in a phone book is common task. We get used to the alphabetical order of authors’ names. But the bst lexicographic order for a long time. More precisely, language of bibliography styles [14] only provides a we get used to our own lexicographic order, because SORT function [13, Table 13.7] suitable for the En- it belongs to our cultural background. It depends glish language, the commands for accents and other on languages. diacritical signs being ignored by this sort operation. This problem is particularly acute when we deal The purpose of this article is to show how this with managing multilingual bibliographies, as in our problem is solved in MlBibT X’s first public ver- program MlBibT X — for ‘MultiLingual BibT X’. E E E sion. In practice, this version only deals with Euro- Let us recall that this program aims to be a ‘better’ pean languages using the Latin alphabet. Besides, BibT X [15], the bibliography processor usually as- E the MlBibT X program is written using the Scheme sociated with the LAT X word processor [12]. When E E programming language [10]. Some implementations it builds a ‘References’ section for a LAT X docu- E provide partial support for Unicode [22], a proposal ment, BibT X uses a bibliography style ruling the E for a future standardised library has been estab- layout of this section. Some bibliography styles are lished [18, §§ 1.1 & 1.2], but we cannot say that the unsorted, that is, the order of bibliographical items 1 present version of Scheme is Unicode-compliant. within the bibliography is the order of first citations ∗ Title in Polish: Zarządzanie zasadami sortowania lek- 1 At the time of writing the revised version of this article, sykograficznego w MlBibTEX-u. the proposal for Scheme’s next standard [19, 18] has been TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting 1001 Jean-Michel HUFFLEN So some parts of our present implementation of or- separately from the words beginning with ‘c’. In der relations are temporary, but we think that this fact, the ‘c-’ entry, in a Hungarian dictionary, con- implementation could be easily updated for future tains words beginning with ‘c’ and not with ‘cs’. Unicode-compliant versions. The ‘c-’ entry is followed by the ‘cs-’ entry, before In a first section, we show how diverse lexi- the ‘d-’ entry. cographic orders used throughout European coun- Anyway, it clearly appears that there cannot tries are. This allows readers to estimate this diver- be a universal order, encompassing all lexicographic sity and to realise the complexity of this task. We orders. Besides, these orders aim to classify words also explain why this problem is made more diffi- of a dictionary, that is, common words belonging to cult when we consider it within the framework of a language, even if some dictionaries may include bibliographies. Then we show how order relations some proper names. When bibliographies are gen- operate in MlBibTEX and how they are built. We erated, order relations are used to sort bibliograph- also give some details about the common and differ- ical items, most often w.r.t. authors’ names. These ◦ ent points betwen xındy [13, § 11.3] and MlBibTEX, names may be ‘foreign’ proper names if we consider these two programs using multilingual order rela- the language used for the bibliography. So names tions. can include characters outside of this language’s al- Reading this article does not require advanced phabet. As a consequence, an order relation for sort- knowledge of Scheme;2 in fact, we think that a non- ing a bibliography should be able to deal with any programmer should be able to specify a new order letter, since any letter may appear in foreign names. relation. We give more technical details in an an- A good choice is to associate such a foreign letter nex, for users that would like to do more experi- with a letter belonging to the ‘basic’ Latin alphabet, ment themselves. In particular, we explain how to so this foreign letter is interleaved with this basic let- deal with languages using the Latin 2 encoding, even ter, which takes precedence over the foreign letter if if our implementation is based on Latin 1. two words differ only by these two letters. If we con- sider the English language, this means that accented 1 European languages and lexicographic letters are interleaved with unaccented letters, but orders unaccented letters take precedence. So proceed most Figure 1 gives an idea about the diversity of order of implementation of order relations. relations used throughout some European countries. Unicode provides a default algorithm to sort all a < b its characters. This algorithm is based on a sort key In this figure, ‘ ’ denotes that the words begin- 3 ning with a are less than the words beginning with table, DUCET [23]. It is also based on a decom- b, whereas ‘a ∼ b’ expresses that the letters a and b position property for composite characters. For ex- are interleaved, except that a takes precedence over ample, the ‘ô’ letter, whose name and code point — b if two words differ only by these two letters. given using hexadecimal numbers — are: Roughly speaking, there are two family of lan- latin small letter o with circumflex, guages, if we consider the associated lexicographic U+00F4 orders. In some languages, accented letters are fully can be decomposed into these ‘simpler’ characters: viewed as ‘real’ letters, distinct from unaccented latin small letter o, U+006F ones: examples are given by Slavonic languages. In combining circumflex accent, U+0302 other languages, accented letters are processed as if The sort algorithm requires several passes. To de- there is no accent. The precedence of a unaccented scribe it roughly, an information about weight, given letter over an accented one is not managed the same by sort keys, is associated with each string. Then way: it follows a left-to-right order in Irish, Ital- this information is re-arranged according to sort lev- ian, and Portuguese, a right-to-left order in French. els, w.r.t. letters, w.r.t. accents, etc. Finally, a binary The Estonian language ‘mixes’ the two approaches: comparison between bytes is done, level by level, un- some accented letters — ‘õ’, ‘ä’ — are alphabeticised, til the two strings can be distinguished. This algo- some — ‘š’, ‘ž’ — are interleaved. Last, some letter rithm can be refined for a particular language, by groups may be viewed as a single letter and alpha- using a specialised sort key table, possibly including beticised as another letter. For example, the Hun- sort keys for accented letters and digraphs viewed garian words beginning with ‘cs’ are alphabeticised as single letters. submitted for ratification. See http://www.r6rs.org for more This modus operandi would be difficult to put details. into action within MlBibTEX. First, we do not have 2 Readers can refer to [20] for an introductory book about Scheme. 3 Default Unicode Collation Element Table. 1002 TUGboat, Volume 0 (2060), No. 0 — Proceedings of the 2060 Annual Meeting Managing Order Relations in MlBibTEX • The Czech alphabet is: a < b < c < č < d < ... < h < ch < i < ... < r < ř < s < š < t < . z < ž. • In Danish, accented letters are grouped at the end of the alphabet: a < ... < z < æ < ø < å. • The Estonian language does not use the same order for unaccented letters than usual Latin order; in addition, accented letters are either inserted into the alphabet or alphabeticised like the corresponding unaccented letter: a < ... < s ∼ š < z ∼ ž < t < ... < w < õ < ä < ö < ü < x < y.