Extraction of Neologisms from Japanese Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Extraction of Neologisms from Japanese Corpora A thesis presented by James Breen to The School of Computing and Information Systems in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Melbourne Melbourne, Australia December 2017 ©2017 - James Breen All rights reserved. Thesis advisers Author Timothy Baldwin James Breen Francis Bond Extraction of Neologisms from Japanese Corpora Abstract In this thesis an exploration of the application of natural-language processing techniques to the extraction of neologisms from Japanese corpora is described. The research aim was to establish techniques which can be developed and exploited to assist significantly in neologism extraction for compiling Japanese monolingual and bilingual dictionaries. The particular challenge of the task is presented by the lack of word boundaries in Japanese text which creates a problem in the identification of unrecorded words. Three broad approaches have been explored, using a variety of language processing and artificial intelligence techniques, and drawing on large-scale Japanese corpora and reference lexicons: synthesis of possible Japanese words by mimicking Japanese morphological processes, followed by testing for the presence of candidate words in Japanese corpora; analysis of morpheme sequences in Japanese texts to determine the presence of potential new or unrecorded terms; and analysis of language patterns which are often used in Japanese in association with new and emerging terms. The research described in this thesis has identified a number of processes which Abstract iv can be used to assist lexicographers in the identification of unrecorded lexical items in Japanese texts. Contents Title Page . i Abstract . iii Table of Contents . v Abbreviations used in this Thesis . x Citations to Previously Published Work . xii Acknowledgments . xiii 1 Introduction 1 1.1 Thesis Overview . 1 1.2 Background, Relevance and Importance of the Project . 3 1.2.1 Japanese Orthography . 3 1.2.2 Neologism Formation in Japanese . 4 1.2.3 Project Goals . 8 1.3 Conceptual Framework of the Study, Summary of Experimental Methods 9 1.4 Lexicographic Background . 12 1.5 Thesis Structure and Timeline . 14 2 Neologisms: Lexicographic Issues and Terminology 16 2.1 Introduction . 16 2.2 Nomenclature . 19 2.3 What Makes Up A Lexical Item? . 23 2.4 The Japanese Perspective . 26 2.5 Single Items . 26 2.6 Multiword Expressions . 29 2.7 Potential Lexical Items . 33 2.8 Summary . 40 3 Prior Work and Literature Review 41 3.1 Introduction . 41 3.2 General Lexicography . 41 3.3 Japanese NLP - Morphological Analyzers and Parsers . 43 v Contents vi 3.4 Alternative Segmentation Approach . 43 3.5 Identification of Unknown Words . 44 3.6 Pre-candidature Work . 46 4 Resources 49 4.1 Introduction . 49 4.2 Dictionaries and General Lexicons . 50 4.3 Text Corpora . 53 4.4 n-gram Corpora . 55 4.5 Software . 58 4.5.1 Morphological Analyzers . 58 4.5.2 Morpheme Lexicons . 60 4.5.3 Machine-Learning Systems . 62 5 Lexical Item Identification From Morpheme Analysis 64 5.1 Introduction . 64 5.2 Morphological Analysis of Japanese Text . 65 5.3 Prior Work . 69 5.4 Using Unknown Word Functions in Morphological Analyzers . 71 5.5 Rule-Based Lexical Item Identification . 75 5.6 Application of Machine Learning to Lexical Item Extraction . 79 5.6.1 Basic Model . 79 5.6.2 Labelling . 80 5.6.3 Feature Development . 82 5.7 Evaluation . 86 5.7.1 Overview . 86 5.7.2 Testing with Automatically Marked-up Texts . 87 5.7.3 Initial Testing . 87 5.7.4 Testing with Hand-Annotated Texts . 91 5.7.5 Precision Issues . 99 5.7.6 Impact of Training Texts . 100 5.7.7 Impact of Training Size . 101 5.8 Testing for Potential Lexical Items in Unseen Texts . 103 5.9 Summary, Discussion and Conclusions . 107 5.10 Postscript: Alternative Chunking Approach . 109 6 Japanese Loanword Multi-Word Expressions: Extraction, Segmen- tation and Translation 111 6.1 Introduction . 111 6.2 Orthographical Aspects of Loanwords in Japanese . 112 6.3 Assimilation of Loanwords in Japanese . 114 6.4 Loanword Multi-Word Expressions . 115 Contents vii 6.5 Prior Work . 115 6.5.1 Segmentation . 116 6.5.2 Non-English Words . 117 6.5.3 Pseudo-English Constructions . 118 6.5.4 Orthographical Variants . 118 6.5.5 Polysemy in Loanwords . 119 6.6 Extraction of Loanwords from Japanese Text . 119 6.7 Segmentation and MWE Translation . 120 6.8 Evaluation . 125 6.8.1 Segmentation . 125 6.8.2 Translation . 127 6.9 Summary, Discussion and Possible Improvements . 132 6.9.1 Online CLST Interface . 134 7 Neologism Synthesis 135 7.1 Introduction . 135 7.2 Prior Work . 136 7.3 Approaches to Neologism Synthesis . 136 7.4 Resources . 138 7.5 Evaluation of Synthesized Compounds . 140 7.5.1 Initial Investigations . 140 7.5.2 Classification of Synthesized Compounds . .143 7.5.3 Initial Testing . 145 7.5.4 Initial Testing - Results . 147 7.5.5 Initial Testing - Discussion . 149 7.6 Morpheme-based Abbreviation Construction . 150 7.6.1 Background . 150 7.6.2 Initial Testing . 151 7.6.3 Extended testing . 153 7.6.4 Discussion . 154 7.7 Affixation . 155 7.8 Generalized Compound Creation - 2-kanji Compounds . 156 7.8.1 Introduction . 156 7.8.2 Initial Tests . 156 7.8.3 Extended Investigations . 157 7.8.4 Discussion of 2-kanji Synthesis and Evaluation . 171 7.9 Generalized Compound Creation - 4-kanji Compounds . 172 7.9.1 Introduction . 172 7.9.2 Compound Synthesis . 172 7.9.3 Initial Investigation . 174 7.9.4 Development of Training Data . 178 7.9.5 Testing . 180 Contents viii 7.9.6 Discussion of 4-kanji Synthesis and Evaluation . 182 7.10 Summary, Discussion and Conclusions . 185 7.10.1 Future Work . 186 8 Generation and Extraction of Compound Verbs 188 8.1 Introduction . 188 8.2 Overview of Japanese Compound Verbs . 189 8.3 Prior Work . 191 8.4 General Approach and Resources . 191 8.4.1 Approaches . 191 8.4.2 Resources Used . 193 8.4.3 Synthesis of Compound Verbs . 193 8.4.4 Direct n-gram Search . 195 8.5 Analysis of the Potential Compound Verbs . 198 8.6 Productivity Measures of V1 and V2 Components . 200 8.7 Summary, Conclusions and Future Work . 206 9 Neologism Identification through Language Contexts 208 9.1 Introduction . 208 9.2 Prior Work . 209 9.3 Text Corpora . 211 9.4 Initial Exploration . 212.