The Lexicon Graph Model a Generic Model for Multimodal Lexicon Development
Total Page:16
File Type:pdf, Size:1020Kb
Thorsten Trippel The Lexicon Graph Model A generic model for multimodal lexicon development Universität Bielefeld Fakultät für Linguistik und Literaturwissenschaft Printed on ageing resistant paper according to ISO 9706 Gedruckt auf alterungsbeständigem Papier nach ISO 9706 Published as a PhD thesis at Bielefeld University, Germany Veröffentlicht als Dissertation an der Universität Bielefeld, Deutschland This work has been published with the same pagination up to the appendix (p. 255) in a book by the same title published by: AQ-Verlag, Saarbrücken, ISBN 978-3-922441-90-8 Additional resources can be found at http://www.lexicongraph.de Thorsten Trippel 2006 To Christel Trippel † Preface At the time of starting my studies, I never thought of writing about lexicons some day. I never imagined that someday I would look at lexicons, or analyze their structures and think of a suitable model for them. Now, a bit later, I can look back on my way towards this field of interest and see, that I really enjoyed the trip into the study of lexicons. A word of thanks Obviously a work such as this would hardly be possible without strong sup- port from many different individuals and groups. I would like to say thank you to those who have helped by giving me some form of encouragement, feedback or advice along the way. I would also like to mention some explic- itly. Firstly there are my parents Erwin and Christel, the latter not seeing the completion of this work while dwelling on this earth, but I am sure she is aware of my gratitude in the place she resides now. I am grateful to my sis- ter Birgit, who sometimes experienced my impatience when facing another deadline. A special thanks to the Wächter family, my aunt and uncle, with my cousins, who were always close and offering words of encouragement. Secondly there is my ‘university family’, which extends to the Computa- tional Linguistics and Spoken language working group at Bielefeld Univer- sity and the colleagues of the research group Text Technological Modeling of Information of the Deutsche Forschungsgemeindschaft (DFG) at different locations. It has been great fun working with you— at least most of the time viii — and I hope there is more to come. I owe special thanks to: Dafydd Gib- bon for the opportunity, and for supervising me and — besides his words of encouragement and criticism — offering a variety of new ideas, insights and perspectives; Ben Hell who has been around for almost the whole time of my research, for giving numerous pieces of advice in the computational field and even discussing trivialities such as running programs remotely; Kathrin Retzlaff for her support on the administrative and human side of the work- ing group. The people working also in the fields of multiple modalities, I have to thank, esp. Alexandra Thies, and the former members of the work- ing group Karin Looks and Ulrike Gut. Others have been working on vari- ous aspects of the corpora, and I owe them my gratitude for providing me with material: Maren Kleimann, Morten Hunke, Sophie Salffner, and San- drine Adouakou. In the computational linguistics working group, I owe spe- cial thanks to Alexander Mehler for his encouragement and interesting dis- cussions, providing me with additional ideas and and insights; Jan-Torsten Milde for the TASX-annotator and cooperation in creating the TASX format; Felix Sasaki, who not only enlightened me with language acquisition stud- ies based on a survey of his children; Andreas Witt who constantly made me think of the application of data vs. document centered data modeling. A good ‘kickoff’ was provided by my former colleagues in the final phase of the EA- GLES and Verbmobil projects, who contributed significantly to my decision to start with this project, namely Silke Kölsch, Inge Mertins-Obbelode and Harald Lüngen. Thirdly there are my friends all over the world, representing different dis- ciplines and backgrounds, who have helped me to reflect upon new ideas. I won’t name all of you here, most of you would not even like to be thanked in such a place. However, thank you. Fourthly — and I should probably stop before listing a phone book — my teachers, educators, and youth leaders who helped me develop my personality and interests. I hope you are aware of the profound impact you have had on me. Editorial feedback and comments on different issues related to the manuscript were provided by Nigel Battye, Dafydd Gibbon, Mark Math- ias, Alexander Mehler, Felix Sasaki, Alexandra Thies, and Daniela Wächter. Thank you for your time and efforts. I owe additional gratitude to my pub- lisher Erwin Stegentritt and the series editor Guido Drexel for their valuable feedback, recommendations and patience. At the end, preparing a manuscript always takes longer than expected. ix To all of you, I wish to personally express my gratitude and the way I feel. Although a lot of people contributed in one way or another, any remaining errors, flaws or gaps are solely my responsibility. This work includes research funded by the Deutsche Forschungsgemein- schaft (DFG) within the Forschergruppe Texttechnologische Informations- modellierung, as well as various other projects such as the BMB+F funded project Verbmobil in the Bielefeld lexicon working group, the Volkswagen Foundation project DOBES in the EGA sub-project, and in Probral project by the Deutscher Akademischer Austausch Dienst (DAAD). All of them pro- vided funding and research opportunities and collaboration opportunities that influenced this project. Software AG (Darmstadt), provided a licence of the Tamino XML database, that was used for some parts of this project. Technical remarks In many scientific articles and books, there is a clear distinction between lit- erature found in the bibliography, software that is mentioned — possibly with reference to the vendor — and standards. However, these resources are treated equally in this work, i.e. the bibliography is a list of resources cited, including material from the web, software and corpora, but also literature. This deci- sion was made based on the use of resources and standards which appear in the same way in the discourse as the literature, although the status may be different. For volatile resources, the date of the last check was mentioned. If the document has since changed, the newest available version was reflected in December 2006. A reference version of this book was handed in at Universität Bielefeld, Fakultät für Linguistik und Literaturwissenschaft as my dissertation. Contents Preface Part I The formal structure of the lexicon 1 Introduction ............................................. 25 1.1 The problem of the lexicon . 26 1.2 Methodology . 27 2 Describing different lexicons .............................. 29 2.1 Objective . 30 2.1.1 Lexicography for the human user . 30 2.1.2 Lexicography for system use . 31 2.1.3 Combining approaches . 32 2.1.4 Requirements for a lexicon . 33 2.1.5 Excursus: Portability of language resources . 36 2.1.6 Seven requirements for modern lexicography . 39 2.2 Lexicon Structure Description . 40 2.2.1 Microstructure . 42 2.2.2 Mesostructure . 43 2.2.3 Macrostructure . 43 2.2.4 Summary of lexicon structure descriptions . 46 2.3 Analysis of structures of existing lexicons . 46 2.3.1 Overview of lexicons . 47 2.3.2 Monolingual Dictionary: Oxford Advanced Learners Dictionary . 51 xii Contents 2.3.3 Bilingual Dictionary . 52 2.3.4 The case of a lexicon including different writing systems . 54 2.3.5 Orthography Dictionary: Duden . 56 2.3.6 Pronunciation Dictionary . 58 2.3.7 Thesaurus . 59 2.3.8 Monolingual electronic lexicon on the web . 60 2.3.9 Bilingual electronic lexicon on the web . 62 2.3.10 Terminology lexicon: Eurodicautom . 63 2.3.11 Computer readable glossary: The Babylon GLS format 66 2.3.12 Wordlists as lexicons . 68 2.3.13 Computer readable semantic lexicons: the Multilingual ISLE Lexical Entry . 70 2.3.14 Machine Readable Terminology Interchange Format (MARTIF) . 73 2.3.15 Integrated corpus and lexicon toolkit: the Field Linguist’s Toolbox . 74 2.3.16 Verbmobil lexicon . 75 2.3.17 Speech synthesis lexicons . 78 2.3.18 Lexicon for part-of-speech tagging and automatic phonemic transcription . 79 2.3.19 Feature structure lexicons . 81 2.3.20 The Generative Lexicon . 85 2.3.21 Summary of the lexicon analysis . 88 2.4 Summary . 93 3 Lexicon Graph ........................................... 95 3.1 Pending issues in lexicography . 95 3.1.1 Combining lexical resources . 96 3.1.2 Headword selection . 98 3.1.3 Single source lexicon publishing . 99 3.1.4 Inference in a lexicon: using knowledge . 100 3.1.5 Duplication of lexical information as a problem of maintenance . 100 3.1.6 Consistency of a lexicon and preserving corpus information . 101 3.1.7 Including non-textual and other special data . 102 3.2 A strategy for the definition of a generic lexicon . 102 3.2.1 Microstructure in the Lexicon Graph view . 105 3.2.1.1 Representation of lexical items . 105 Contents xiii 3.2.1.2 Representation of lexical relations . 107 3.2.1.3 Typing lexical relations . 108 3.2.1.4 Resulting structure of the microstructure representation . 109 3.2.2 The mesostructure in the Lexicon Graph view . 110 3.2.3 The macrostructure in the Lexicon Graph view . 110 3.2.4 Summary: Lexicon Graphs as a generic model for lexicons . 111 3.3 Lexicon format . 111 3.3.1 Lexicon Graph element definition . 113 3.3.2 Specifying the elements . 114 3.3.3 Typing items and knowledge . 116 3.3.4 Integration of metadata in a Lexicon Graph . 117 3.4 Representing existing lexicons . 119 3.4.1 A system driving lexicon: The Verbmobil lexicon .