Russian Natural Language Processing for Computer
Total Page:16
File Type:pdf, Size:1020Kb
Faculty of Humanities, Social Sciences, and Education Russian natural language processing for computer- assisted language learning Capturing the benefits of deep morphological analysis in real-life applications — Robert J Reynolds A dissertation for the degree of Philosophiae Doctor – February 2016 ii Russian natural language processing for computer-assisted language learning Capturing the benefits of deep morphological analysis in real-life applications Robert J. Reynolds A dissertation presented for the degree of Philosophiae Doctor (PhD) Faculty of Humanities, Social Sciences, and Education UiT: The Arctic University of Norway Norway February 4, 2016 iv © 2016 by Robert Joshua Reynolds. All rights reserved. Printed by Tromsprodukt AS, Tromsø, Norway ISSN 0000-0000 ISBN 000-00-0000000-0 To Rachael vi Contents List of Figures ix List of Tables xi Abstract xvii Acknowledgements xix Preface xxiii 1 Introduction 3 1.1 Introduction . .3 1.2 Structure of the dissertation . .7 I Linguistic analysis and computational linguistic methods 9 2 A new finite-state morphological analyzer of Russian 11 2.1 Introduction . 11 2.2 Background of Russian part-of-speech tagging . 12 2.3 UDAR . 16 2.3.1 Lexc and Twolc . 16 2.3.2 Structure of nominals: lexc and twolc . 19 2.3.3 Structure of verbs: lexc and twolc . 35 2.3.4 Morphosyntactic tags . 49 2.3.5 Flavors of the FST . 53 2.4 Evaluation . 53 2.4.1 Coverage . 54 2.4.2 Speed . 54 2.5 Potential applications . 55 2.6 Conclusions and future work . 57 vii viii CONTENTS 3 Morphosyntactic disambiguation and dependency annotation 59 3.1 Introduction . 59 3.2 Related work . 61 3.3 Ambiguity in Russian . 62 3.4 Analysis pipeline . 64 3.4.1 Morphological analyzer . 65 3.4.2 Disambiguation rules . 65 3.5 Development process . 66 3.6 Evaluation . 67 3.6.1 Corpus . 70 3.6.2 Qualitative evaluation . 70 3.6.3 Task-based evaluation . 73 3.6.4 Combining with a statistical tagger . 74 3.7 Conclusions and Outlook . 75 II Applications of the analyzer in language learning 77 4 Automatic stress placement in unrestricted text 79 4.1 Introduction . 79 4.1.1 Background and task definition . 81 4.1.2 Stress corpus . 82 4.2 Automatic stress placement . 83 4.3 Results . 85 4.4 Discussion . 88 4.5 Conclusions . 89 5 Visual Input Enhancement of the Web 91 5.1 Introduction . 92 5.2 Key topics for Russian learners . 95 5.2.1 Noun declension . 96 5.2.2 Stress . 98 5.2.3 Aspect . 100 5.2.4 Participles . 104 5.3 Feedback . 106 5.4 Conclusions and Outlook . 108 6 Automatic classification of document readability on the basis of mor- phological analysis 111 6.1 Introduction . 111 6.2 Background . 113 CONTENTS ix 6.2.1 History of evaluating text complexity . 113 6.2.2 Automatic readability assessment of Russian texts . 114 6.3 Corpora . 118 6.3.1 CIE corpus . 119 6.3.2 news corpus . 119 6.3.3 LingQ corpus . 120 6.3.4 Red Kalinka corpus (RK) . 121 6.3.5 TORFL corpus . 122 6.3.6 Zlatoust corpus (Zlat.) . 122 6.3.7 Summary and the Combined corpus (Comb.) . 122 6.4 Features . 123 6.4.1 Lexical features (LEX)................... 125 6.4.2 Morphological features (MORPH)............. 128 6.4.3 Syntactic features (SYNT)................. 130 6.4.4 Discourse/content features (DISC)............. 132 6.4.5 Summary of features . 133 6.5 Results . 134 6.5.1 Corpus evaluation . 136 6.5.2 Binary classifiers . 141 6.6 Feature evaluation . 142 6.6.1 Feature evaluation with binary classifiers . 147 6.7 Conclusions and Outlook . 148 7 Conclusions and outlook 151 7.1 Summary . 151 7.2 Resources . 153 7.2.1 NLP tools . 153 7.2.2 Corpora . 153 7.2.3 Language-learning tools . 154 7.3 Outlook . 154 7.4 Conclusion . 155 References . 157 x CONTENTS List of Figures 2.1 Finite-state transducer network . 17 3.1 Example output from the morphological analyzer and constraint grammar . 68 3.2 Constraint grammar rules relevant to Figure 3.1 . 69 3.3 Learning curve for three tagging setups: hunpos with no lexi- con; hunpos with a lexicon; and hunpos with a lexicon and the Russian constraint grammar in a voting set up. 75 6.1 Distribution of document length in words . 124 6.2 Learning curves of binary classifiers trained on LQsupp subcorpus 142 xi xii LIST OF FIGURES List of Tables 1 Comparison of Scholarly and ISO9 transliteration systems . xxiii 2.1 Comparison of existing Russian morphological analyzers. FOSS = free and open-source software; gen. = can generate wordforms; disamb. = can disambiguate wordforms with more than one read- ing based on sentential context . 16 2.2 Two nouns of the same declension class with different stem palatal- ization. The underlying lexc forms are in parentheses. 20 2.3 Upper- and lower-side correspondences for nominal palatalization 21 2.4 Upper- and lower-side correspondences for ‘spelling rules’ . 22 2.5 Upper- and lower-side correspondences for fleeting vowels . 22 2.6 Upper- and lower-side correspondences for fleeting vowels in the lexeme kopejka ‘kopeck’ . 23 2.7 Upper- and lower-side correspondences for fleeting vowels in the lexeme lëd ‘ice’ . 24 2.8 Upper- and lower-side correspondences for fleeting vowels in yod stems, such as muravej ‘ant’ and kop'ë ‘spear’ . 24 2.9 Upper- and lower-side correspondences for e-inflection in i-stems: kafeterij ‘cafeteria’, povtorenie ‘repetition’, and Rossiâ ‘Russia’ . 25 2.10 Shifting stress pattern of ruka ‘hand’ . 26 2.11 Upper- and lower-side correspondences for word stress, example word sestra ‘sister’ . 27 2.12 Upper- and lower-side correspondences for genitive plural inflec- tion -ov/-ëv/-ev ........................... 28 2.13 Upper- and lower-side correspondences for genitive plural zero ending . 28 2.14 Upper- and lower-side correspondences for genitive plural zero ending for bašnâ ‘tower’, sem'â ‘family’, and sekvojâ ‘sequoia’ . 29 xiii xiv LIST OF TABLES 2.15 Upper- and lower-side correspondences for genitive plural -ej end- ing for kon' ‘horse’, matˇc ‘match’, levša ‘left-hander’, and more ‘sea’ . 31 2.16 Upper- and lower-side correspondences for comparative adjectives 32 2.17 Upper- and lower-side correspondences for comparatives with pre- fix po- ................................ 33 2.18 Upper- and lower-side correspondences for masculine short-form adjectives . 34 2.19 Upper- and lower-side correspondences for verbal stem mutations 36 2.20 Upper- and lower-side correspondences for past passive participle stem alternations to -žd- ....................... 37 2.21 Upper- and lower-side correspondences for verbal stem mutations of moˇc' and peˇc' ........................... 38 2.22 Upper- and lower-side correspondences for û and â in verbal endings 39 2.23 Upper- and lower-side correspondences for u in verbal endings . 40 2.24 Realization of imperative endings . 42 2.25 Upper- and lower-side correspondences for imperatives with the stressed ending S .......................... 42 2.26 Upper- and lower-side correspondences for imperatives with the unstressed ending U: duj, vypej, and lâg .............. 43 2.27 Upper- and lower-side correspondences for imperatives with the unstressed ending U: pómni, mórsiˆ , výbegi, otvét' ......... 44 2.28 Upper- and lower-side correspondences for imperatives with the reflexive suffixes . 46 2.29 Upper- and lower-side correspondences for fleeting vowels in ver- bal prefixes . 46 2.30 Upper- and lower-side correspondences for devoicing of z in verbal prefixes . 47 2.31 Part-of-speech tags used in UDAR . 49 2.32 Sub-part-of-speech tags used in UDAR . 50 2.33 Nominal tags used in UDAR . 51 2.34 Verbal morphosyntactic tags used in UDAR . 52 2.35 Coverage of wikipedia lexicon by UDAR and mystem3. 54 2.36 Speed comparison processing the OpenCorpora lexicon list (5 484 696 tokens) . 55 3.1 Frequency of different types of morphosyntactic ambiguity in un- restricted text . 64 3.2 The distribution of rules in reliability categories and syntactic role labeling. 65 LIST OF TABLES xv 3.3 Results for the test corpora . 70 4.1 Example output of each stress placement approach, given a partic- ular set of readings for the token kosti ............... 84 4.2 Results of stress placement task evaluation . 86 5.1 Results of the corpus study of lexical cues for aspect . 103 6.1 Contributions of LingQ ‘expert’ Russian contributors . 120 6.2 LingQ subcorpora distribution of documents by level . 121 6.3 LingQ subcorpora distribution of words per document by level . 121 6.4 Distribution of documents per level for each corpus . 123 6.5 Average words per document for each level of each corpus . 123 6.6 Lexical variability features (LEXV) ................ 125 6.7 Lexical complexity features (LEXC)................ 127 6.8 Lexical familiarity features (LEXF) ................ 129 6.9 Morphological features (MORPH)................. 130 6.10 Features calculated on the basis of sentence length (SENT).... 131 6.11 Syntactic features (SYNT)..................... 132 6.12 Discourse.