The Inflectional Morphologies of the Swedish Noun, the Swedish Verb and the English Verb
Total Page:16
File Type:pdf, Size:1020Kb
Alfred Holl Sara Maroldo Reinhard Urban The Inflectional Morphologies of the Swedish Noun, the Swedish Verb and the English Verb Reverse Dictionaries Based upon Data Mining Methods Växjö University Press Series: Mathematical Modeling in Physics, Engineering and Cognitive Sciences, v. 12 This research project was sponsored by Staedtler-Stiftung, Nuremberg, Germany Introduction I.3 Preface of the Senior Author This book joins linguistic issues with IT methods, in particular the structural examination of inflectional-morphological systems with methods of automatic data analysis (data mining). With our research results, we address computer scientists and (computer) linguists, as well as language learners and teachers looking for advanced, didactic material and deepened insights into the structures of language. In order to keep the text understandable for researchers in the areas of linguistics and computer science, the specific terminologies are reduced to the inevitable degree. My research on this subject started with my PhD thesis (Holl 1988) on the verbal systems of Latin and Neo-Latin languages. This was conducted with manual analyses because, in the middle of the 1980s, the storage capacity of PCs was still pretty small. Later on, I published some papers regarding Swedish and Neo-Latin languages (Holl 2001, 2002, 2003). In the years from 2003 to 2006, two major research projects, both using modern information technology, were sponsored by the Bavarian government and by the Staedtler- Stiftung, Nuremberg, Germany: a morphological dictionary of the Russian and German verbs (Holl / Behrschmidt / Kühn 2004) and a morphological dictionary of the Ancient and New Greek verbs (Holl / Pavlidis / Urban 2005). On this basis, two further research projects were launched at the Department of Information Systems and Computer Science at the University of Applied Sciences of Nuremberg, Germany, in 2006. The first project analyzed the English verbal system. The second one examined, for the first time with the means of modern IT, a nominal system: the inflectional morphology of the Swedish noun. I, myself, was the project leader and had the responsibility for the adequate use of scientific methods in both of the two projects whose results are published in this book. The first study is the outcome of a master’s thesis at the University of Applied Sciences of Nuremberg: Sara Maroldo examined the English verbal 3 I.4 Introduction morphology on the basis of 5,700 verbs (Part III). Staedtler-Stiftung; Nuremberg, provided the financial basis for the second study: Reinhard Urban examined the Swedish nominal morphology on the basis of 25,000 nouns (Part IV). Part V completes the research results on Swedish inflectional morphology. It is a reprint of a paper on Swedish verbal morphology which I wrote already in 1999 using slightly different methods for analysis and presentation than today. An appendix with a presentation using the methods of this book is added. The original paper was published under the title “The inflectional morphology of the Swedish verb with respect to reverse order: analogy, pattern verbs and their key forms“ in the journal Arkiv för Nordisk Filolgi 116 (2001) 193-220. The linguistic material examined in this book can be found on my homepage http://www.informatik.fh-nuernberg.de/professors/holl/Personal/HollHome.htm in the form of Excel files. The introduction (Part I) and the presentation of data mining principles independent of individual languages and parts of speech (Part II) are revised English translations by Sara Maroldo from Holl / Behrschmidt / Kühn 2004 and Holl / Pavlidis / Urban 2006. They are necessary for a comprehensive understanding of the English and Swedish parts. Part II contains examples from English verb morphology, in order to demonstrate our data mining methods for readers without a good command of Swedish as well. I would like to express my acknowledgments to • Patricia Brockmann, English native speaker, University of Applied Sciences at Nuremberg for checking the English part • Göran Hallberg, editor of Arkiv för Nordisk Filologi, University of Lund, Sweden, for permission to include a reprint of my paper on Swedish verb morphology in this book 4 Introduction I.5 • Andrei Khrennikov, University of Växjö, for accepting this book as Volume 12 in his series Mathematical Modeling in Physics, Engineering and Cognitive Sciences • Staedtler-Stiftung, Nuremberg, Germany, for its generous sponsoring; otherwise this research project would not have been possible • the School of Mathematics and Computer Science (Matematiska och Systemtekniska Institutionen), University of Växjö, Sweden, for sponsoring the costs of publication. Växjö / Sweden, Nuremberg / Germany, March 2007 Alfred Holl 5 I.6 Introduction 6 Introduction I.7 Contents PREFACE OF THE SENIOR AUTHOR.................................................I.3 CONTENTS............................................................................................I.7 PART I. INTRODUCTION ....................................................................I.13 1. MOTIVATION...................................................................................I.15 2. METHOD..........................................................................................I.16 2.1 BASICS OF THE DATA ANALYSIS...................................................................I.16 2.2 PRE-PROCESSING, PROCESSING AND POST-PROCESSING OF THE DATA ANALYSIS .....................................................................................................I.20 3. STRUCTURE ...................................................................................I.21 PART II. PARTS OF DATA MINING INDEPENDENT OF INDIVIDUAL LANGUAGES ............................................................II.1 1. PRE-PROCESSING OF THE DATA ANALYSIS..............................II.3 1.1 REPRENSENTATION OF LINGUISTIC OBJECTS.................................................II.4 1.2 MODIFIED ORTHOGRAPHIC CONVENTIONS....................................................II.5 1.2.1 Treatment of the beginning of lexical bases ............................................II.5 1.2.2 Prefix treatment .......................................................................................II.5 1.3 GATHERING OF LINGUISTIC DATA .............................................................. II.12 1.3.1 Gathering of lexemes ............................................................................ II.12 1.3.2 Recording of key forms ......................................................................... II.13 1.4 DELIMITATIONS.......................................................................................... II.14 1.4.1 Delimitations in comparison to the syntax ........................................... II.14 1.4.2 Delimitations in comparison to the lexicon .......................................... II.15 7 I.8 Introduction 1.5 KEY FEATURES AND INFLECTION TYPES .................................................... II.16 1.5.1 Key features........................................................................................... II.16 1.5.2 Inflection types ...................................................................................... II.17 1.6 DERIVATION RULES AND EXCEPTIONS ....................................................... II.18 1.6.1 Derivation rules .................................................................................... II.19 1.6.2 Exceptions ............................................................................................. II.19 1.7 LEXEMES WITH TWO INFLECTION TYPES.................................................... II.20 1.7.1 Interpretation as two lexemes ............................................................... II.20 1.7.2 Interpretation as one lexeme................................................................. II.22 2. PROCESSING OF THE DATA ANALYSIS ....................................II.23 2.1 TECHNICAL REQUIREMENTS....................................................................... II.23 2.2 THE DATA MINING CONCEPT ...................................................................... II.23 2.2.1 Preparation - sorting algorithm............................................................ II.24 2.2.2 Data structure and algorithm of the data mining concept.................... II.27 3. POST-PROCESSING OF THE DATA ANALYSIS .........................II.32 3.1 TYPOGRAPHIC MARKING............................................................................ II.33 3.2 REDUCTIONS .............................................................................................. II.33 3.2.1 Formalizable reductions ....................................................................... II.33 3.2.2 Non-formalizable reductions ................................................................ II.39 3.3 USE OF THE RESULTING LEXEME REGISTER................................................ II.39 3.3.1 Assigning an arbitrary lexeme to its paradigm cluster......................... II.39 3.3.2 Gaining linguistic information from the lexeme register...................... II.41 PART III. DATA MINING OF THE INFLECTIONAL- MORPHOLOGICAL SYSTEM OF THE ENGLISH VERB..............III.1 1. PRE-PROCESSING OF THE DATA ANALYSIS.............................III.3 1.1 REPRENSENTATION OF LINGUISTIC OBJECTS................................................III.3 1.2 MODIFIED ORTHOGRAPHIC CONVENTIONS...................................................III.3 1.2.1 Treatment of the beginning of lexical bases ...........................................III.4