Lexical Database Enrichment Through Semi-Automated Morphological Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Some pages of this thesis may have been removed for copyright restrictions. If you have discovered material in Aston Research Explorer which is unlawful e.g. breaches copyright, (either yours or that of a third party) or any other law, including but not limited to those relating to patent, trademark, confidentiality, data protection, obscenity, defamation, libel, then please read our Takedown policy and contact the service immediately ([email protected]) LEXICAL DATABASE ENRICHMENT THROUGH SEMI-AUTOMATED MORPHOLOGICAL ANALYSIS Volume 1 THOMAS MARTIN RICHENS Doctor of Philosophy ASTON UNIVERSITY January 2011 This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis and no information derived from it may be published without proper acknowledgement. 1 Summary Aston University Lexical Database Enrichment through Semi-Automated Morphological Analysis Thomas Martin Richens Doctor of Philosophy 2011 Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 2 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions. Keywords: morphological rules; automatic affix discovery; derivational morphology; segmentation fallacy; derivational tree. Acknowledgments The research presented here was conducted under a full time EPSRC-funded research studentship. The project began under the supervision of Sylvia Wong, Lecturer in Computing Science and was concluded under the joint supervision of Ian Nabney, Professor of Computing Science and Ramesh Krishnamurthy, Lecturer in English. The author would also like to thank the following, all of whom have played a role in facilitating this research: • The late Sharen Lloyd • Steve Dalton • Chris Buckingham • Ken Litkowski • Christian Boitet • Mathieu Mangeot • Nazaire Mbame Tom Richens www.rockhouse.me.uk [email protected] 3 Contents VOLUME 1 Glossary 16 1 Introduction 23 1.1 Definitions 23 1.1.1 Wordnets 23 1.1.2 Derivational Morphology 24 1.1.3 Verb Frames 26 1.1.4 Parts of Speech, Participles and Gerunds 27 1.1.5 Qualia 27 1.2 Motivation 28 1.2.1 Fighting Arbitrariness 28 1.2.2 Derivational Morphology for Lexical Databases 29 1.2.3 Project Aims 31 1.2.4 Fulfilment of Project Aims 31 1.3 Experimental Platform 34 1.3.1 Object-Oriented Approaches to Modelling Wordnet Data 34 1.3.1.1 RDF 34 1.3.1.2 Python 35 1.3.2 The WordNet Model 36 1.3.2.1 Choice of Java 36 1.3.2.2 WordNet Relations 36 1.3.2.3 Sentence Frames 37 1.3.2.4 The Lexicon 37 1.3.2.5 The Lemmatiser 38 1.3.2.6 Applications of the Model and Related Publications 38 1.3.2.7 Subsequent Modifications 39 2 Investigation into WordNet 40 2.1 Word Senses 41 2.1.1 "I don't believe in word senses" 41 2.1.1.1 Metaphor 42 2.1.1.2 Translation Equivalents 44 2.1.1.3 Conclusions on Word Senses 47 2.1.2 Granularity 48 2.1.2.1 Implications of WordNet Granularity for Multilingual Wordnet Development 48 2.1.2.2 Investigation into WordNet Granularity 49 2.1.2.3 Clustering of Word Senses and Synsets 51 2.2 Taxonomy 52 4 2.2.1 Ontology 52 2.2.1.1 Shortcomings of WordNet-like Ontologies 52 2.2.1.2 Is a Correct Ontology Possible? 55 2.2.1.3 Compatibility of Existing Ontologies 56 2.2.1.4 Conclusions on Ontology 57 2.2.2 Investigation into the Verb Taxonomy 58 2.2.2.1 Introduction 58 2.2.2.2 Hypernyms and Troponyms 60 2.2.2.2.1 Algorithm for Identifying Topological Anomalies in Hierarchical Relations 60 2.2.2.2.2 Cycle 62 2.2.2.2.3 Rings 63 2.2.2.2.4 Dual Inheritance Without Rings 65 2.2.2.2.5 Isolators 65 2.2.2.2.6 Roots of the Verbal Taxonomy 67 2.2.2.3 Antonyms 69 2.2.2.3.1 Multiple Antonyms 70 2.2.2.3.2 Antonyms Without a Common Hypernym 71 2.2.2.4 Conclusion 72 2.3 Syntax 74 2.3.1 WordNet Sentence Frames 75 2.3.1.1 Synsets with More than 2 Framesets 75 2.3.1.2 Synsets with 2 Framesets 76 2.3.1.3 Synsets with 1 Frameset 77 2.3.1.4 Additional Frames 78 2.3.2 Frame Inheritance 79 2.3.2.1 Valency 79 2.3.2.2 Theory of Frame Inheritance 79 2.3.2.3 Investigation into Frame Inheritance 80 2.3.2.3.1 Algorithm for Validating Frame Inheritance 81 2.3.2.3.2 Extended Definition of Valid Frame Inheritance 83 2.3.2.3.3 Adapted Algorithm to Incorporate Broader Definition of Valid Frame Inheritance 84 2.3.2.3.4 Final Evaluation of Frame Inheritance 86 2.4 Conclusions on WordNet 87 3 Investigation into Morphology 90 3.1 Background 91 3.1.1 Some Simple Stemmers 91 3.1.2 A State of the Art Morphological Database? 93 3.1.2.1 Analysis of CatVar Sample Dataset 94 3.1.3 Previous Work on the Morphological Enrichment of WordNet 98 3.1.4 Derivational Trees 102 3.1.5 Morphological Enrichment across Languages 103 3.1.6 Inference of Morphological Relations from a 5 Dictionary 104 3.2 A Rule-based Approach 105 3.2.1 Requirements for the Morphological Enrichment of WordNet 105 3.2.2 Pilot Study on the Formulation and Application of Morphological Rules 108 3.2.2.1 Formulation of Morphological Rules from the CatVar Dataset 108 3.2.2.2 Application of Morphological Rules 113 3.2.2.2.1 Autogeneration of Suffixed Forms 113 3.2.2.2.2 Suffix Stripping 123 3.2.2.2.3 Overgeneration of Suffix Generation and Suffix Stripping Compared 128 3.2.2.3 Prefixations in the Random Word List 129 3.2.2.4 Application to the Enrichment of WordNet 131 3.2.2.5 Conclusions from the Pilot Study 135 3.2.3 Conclusions on Morphological Rules 136 3.3 Review of Existing Morphological Analysis Algorithms 139 3.3.1 From Phoneme to Morpheme 139 3.3.2 Word Segmentation 141 3.3.3 Minimum Description Length 145 3.3.4 Conclusions on Word Segmentation 153 3.4 Automatic Affix Discovery 153 3.4.1 Automatic Prefix Discovery 155 3.4.1.1 Prefix Tree Construction 155 3.4.1.2 Heuristics to Elucidate the Semantic Criterion 161 3.4.1.3 Results from Automatic Prefix Discovery 162 3.4.2 Automatic Suffix Discovery 163 3.4.2.1 Extension of the Algorithm to Suffix Discovery 163 3.4.2.2 Results from Automatic Suffix Discovery 164 3.4.3 Comparison of Results from Automatic Affix Discovery with Results from the Pilot Study on Morphological Rules 165 3.4.3.1 Undergeneration by Automatic Suffix Discovery 165 3.4.3.2 Heuristics Tested against Morphological Rules 166 3.4.4 Additional Heuristics 167 3.4.5 Conclusions on Automatic Affix Discovery 170 3.5 Final Considerations Prior to Morphological Analysis and Enrichment 171 3.5.1 Affix Stripping Precedence 171 3.5.2 Compound Expressions and Concatenations 173 6 3.5.3 Implications of WordNet Granularity