Colorado Technical University Enabling A

COLORADO TECHNICAL UNIVERSITY ENABLING A LEGACY MORPHOLOGICAL PARSER TO USE DATR-BASED LEXICONS VOLUME 1 of 2 A DISSERTATION SUBMITTED TO THE GRADUATE COUNCIL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF COMPUTER SCIENCE DEPARTMENT OF COMPUTER SCIENCE BY MICHAEL A. COLBURN M.C.I.S., University of Denver, Denver, CO, 1991 M.A., University of Texas, Arlington, TX, 1981 B.A., Rockmont College, Denver, CO, 1975 DENVER, COLORADO DECEMBER 1999 ENABLING A LEGACY MORPHOLOGICAL PARSER TO USE DATR-BASED LEXICONS BY MICHAEL A. COLBURN THE DISSERTATION IS APPROVED __________________________________________ Dr. Carol Keene, Committee Chair __________________________________________ Dr. H. Andrew Black __________________________________________ Dr. Jay Rothman __________________________________________ Date Approved ii ABSTRACT This research is an investigation of the feasibility and desirability of using DATR with AMPLE, a legacy morphology exploration tool developed by the Summer Institute of Linguistics. The research demonstrates the feasibility of using DATR with AMPLE by showing how DATR can be used to encode the lexical information required by AMPLE and by presenting an interface between AMPLE and DATR-based lexicons that does not require modification of AMPLE itself. The desirability of using DATR with AMPLE is shown by demonstrating that there are generalizations that cannot be captured using AMPLE’s lexical knowledge representation language (LKRL) that can be captured in DATR, thereby reducing redundancy of lexical information. The research demonstrated the truth of the four hypotheses. In order to prove these hypotheses, the author analyzed the morphophonology and morphotactics of a Papuan language, Ogea, and built a comprehensive AMPLE unified database lexicon that supports the parsing of every example found in the author's write-up of Ogea. Next, the author developed a DATR-based version of the lexicon, as well as an interface to generate the AMPLE lexicon from DATR. It was found that linguistic generalizations not possible in AMPLE's LKRL could be made in DATR. Through these generalizations, redundancy in the Ogea AMPLE lexicon was reduced by 64% in the DATR version of the same lexicon. The validity of the interface from the DATR lexicon to the AMPLE lexicon was verified by programmatically. A portion of an AMPLE root database file for a second language, Yalálag Zapotec, was also analyzed. It was demonstrated that for that language also, generalizations not possible to make with the AMPLE LKRL could be made in DATR. Redundancy in the Yalálag Zapotec subset AMPLE lexicon was reduced by nearly 58% in the DATR version. iii ACKNOWLEDGEMENTS We all stand on the shoulders of those who went before us. In that sense, the list of acknowledgements would be quite long. However, this research was made possible by the following people: my employer, who allowed my work schedule to be flexible; Kristin and Gabriel, who put up with a pre-occupied father for several years; my wife, Lisa, who helped with proof-reading and offered moral support; Kelebai Iriwai, who over the years has tirelessly helped me to understand his language, Ogea; members of the Papua New Guinea branch of the Summer Insititute of Linguistics who shared their insights into Papuan linguistics and phonology to help me in my analysis of Ogea, Cindy Farr, Dotty James, Sjakk and Jacqueline van Kleef, Denise Potts, Ger Reesink, and others; Dr. Donald Burquest, who first encouraged me to pursue doctoral work; the members of my committee for their insightful comments and encouragement, especially Dr. H. Andrew Black, due to his expertise in linguistics and AMPLE. Of course, this author takes full responsibility for the content presented here, including decisions made regarding the analysis of Ogea morphophonemics and morphotactics. iv TABLE OF CONTENTS ABSTRACT............................................................................................................III ACKNOWLEDGEMENTS.....................................................................................IV TABLE OF CONTENTS..........................................................................................V LIST OF FIGURES..............................................................................................VIII LIST OF TABLES ..................................................................................................IX LIST OF ABBREVIATIONS...................................................................................X LIST OF ACRONYMS ...........................................................................................XI CHAPTER 1 INTRODUCTION................................................................................1 1.1 Problem Background ........................................................................................3 1.1.1 What is Morphology? .................................................................................4 1.1.2 What are Morphological Parsers? ...............................................................5 1.1.3 What is the Purpose of Lexicons? ...............................................................6 1.1.4 What is AMPLE? How Does it Work?........................................................8 1.1.5 What is DATR? How Might it Benefit AMPLE? ........................................9 1.2 Problem Statement..........................................................................................10 1.3 Overview of the Research ...............................................................................11 1.3.1 Purpose of the Research............................................................................11 1.3.2 Significance..............................................................................................11 1.3.3 Research Hypotheses ................................................................................13 1.3.4 Proof Criteria............................................................................................15 1.3.5 Methodology ............................................................................................16 1.4 Summary ........................................................................................................17 v vi CHAPTER 2 ANALYSIS OF THE LITERATURE.................................................19 2.1 Introduction ....................................................................................................19 2.2 The Morphological Parser...............................................................................19 2.2.1 Morphology..............................................................................................19 2.2.2 The Role of the Morphological Parser in NLP ..........................................26 2.2.3 The AMPLE Morphological Parser...........................................................30 2.3 The Lexicon....................................................................................................47 2.3.1 The Role of the Lexicon in Linguistic Theory...........................................47 2.3.2 Inheritance and the Lexicon......................................................................57 2.3.3 The Role of the Lexicon in NLP ...............................................................62 2.3.4 Lexical Knowledge Representation with DATR........................................64 2.3.5 The DATR Literature................................................................................79 2.3.6 Implementations of DATR........................................................................96 2.4 Related Research ............................................................................................97 2.5 Summary ......................................................................................................100 CHAPTER 3 USING DATR WITH AMPLE ........................................................102 3.1 Introduction ..................................................................................................102 3.2 Foundational Work .......................................................................................102 3.3 Managing Name Space .................................................................................104 3.4 Encoding AMPLE Lexical Data in DATR ....................................................105 3.5 Capturing Generalizations and Reducing Redundancy with DATR ...............114 3.5.1 The Use of Abstract Lexical Entries........................................................114 3.5.2 The Use of Variable-like Constructs .......................................................132 3.5.3 The Use of Morphophonemic Rules........................................................137 3.5.4 Reduction of Redundancy.......................................................................147 vii 3.5.5 Summary ................................................................................................151 3.6 An Interface Between DATR and AMPLE ...................................................151 3.6.1 Description of the Interface.....................................................................152 3.6.2 Executing the Interface ...........................................................................156 3.6.3 Verification of the Interface....................................................................157 3.7 Generating Other Lexicons from DATR .......................................................158 3.8 Handling Many-to-Many and One-to-Many Relationships ............................159

Colorado Technical University Enabling A

A Reconstruction of Proto-Sogeram

How to Format Interlinearized Linguistic Examples Three-Line Format

Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations

For Linguists User's Guide

Using Interlinear Glosses As Pivot in Low-Resource Multilingual Machine Translation

The Faith Between the Lines: a Survey of Doctrinal Language in the Old Northumbrian Glosses to the Gospel of Matthew in the Lindisfarne Gospels

E-Research for Linguists

A Catalogue of Manuscripts Known to Contain Old English Dry-Point Glosses

Modelling and Annotating Interlinear Glossed Text from 280 Di Erent

Towards a General Model of Interlinear Text

Pdf Field Stands on a Separate Tier

Talking in Halq'eméylem