Crowdsourcing Lexical Semantic Judgements from Bilingual Dictionary Users

Crowdsourcing lexical semantic judgements from bilingual dictionary users A thesis presented by Richard James Fothergill http://orcid.org/0000-0002-8140-2743 for the degree of Doctor of Philosophy in November 2017 to the School of Computing and Information Systems in total fulfillment of the requirements of the degree The University of Melbourne Melbourne, Australia Thesis advisor Author Timothy Baldwin Richard James Fothergill Crowdsourcing lexical semantic judgements from bilingual dictionary users Abstract Words can take on many meanings, and collecting and identifying example usages rep- resentative of the full variety of meanings words can take is a bottleneck to the study of lexical semantics using statistical approaches. To perform supervised word sense disambiguation (WSD), or to evaluate knowledge-based methods, a corpus of texts annotated with senses from a dictionary may be constructed by paid experts. However, the cost usually prohibits more than a small sample of words and senses being represented in the corpus. Crowdsourcing methods promise to acquire data more cheaply, albeit with a greater challenge for quality control. Most crowdsourcing to date has incentivised participation in the form of a payment or by gamification of the resource construction task. However, with paid crowdsourcing the cost of human labour scales linearly with the output size, and while game playing volunteers may be free, gamification studies must compete with a multi-billion dollar games industry for players. In this thesis we develop and evaluate resources for computational semantics, working towards a crowdsourcing method that extracts information from naturally occurring human activities. A number of software products exist for glossing Japanese text with entries from a dictionary for English speaking students. However, the most popular ones have a tendency to either present an overwhelming amount of information containing every sense of every word or else hide too much information and risk removing senses with particular relevance to a specific text. By offering a glossing application with interactive features for exploring word senses, we create an op- portunity to crowdsource human judgements about word senses and record human interaction with semantic NLP. Declaration This is to certify that: (i) the thesis comprises only my original work towards the PhD except where indicated in the Preface; (ii) due acknowledgement has been made in the text to all other material used; (iii) the thesis is fewer than 100,000 words in length, exclusive of tables, maps, bibliographies and appendices. Signed: Date: 13/11/2017 c 2017 - Richard James Fothergill All rights reserved. Preface This thesis includes material adapted from several publications and additional new material. Chapter 5 of this thesis has been adapted from the following two publications: RICHARD FOTHERGILL, and TIMOTHY BALDWIN. 2011. Fleshing it out: A supervised ap- proach to MWE-token and MWE-type classification. In Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand. RICHARD FOTHERGILL, and TIMOTHY BALDWIN. 2012. Combining resources for MWE-token classification. In *SEM 2012: The First Joint Conference on Lexical and Computational Se- mantics, Montreal, Canada. I carried out the research described in these papers, wrote the first draft of each paper and was the primary author, making changes based on consultation with the other listed author. Parts of Chapter 6 were adapted from Sections 3.2 (Statistics) and 3.3 (Distribution) of the following publication: FRANCIS BOND,TIMOTHY BALDWIN,RICHARD FOTHERGILL, and KIYOTAKA UCHIMOTO. 2012. Japanese SemCor: A sense-tagged corpus of Japanese. In Proceedings of the 6th Inter- national Conference of the Global WordNet Association (GWC), Matsue, Japan. I was the author of the adapted sections and carried out the work described therein independently. In particular, I was responsible for the word alignment and sense transfer between English SemCor and Japanese Semcor. The remaining sections of the paper — particularly the project to translate English SemCor into Japanese — is wholly the work of the other authors of the paper. This work was made possible thanks to the funding provided by the Henry James Williams Scholarship administered by the University of Melbourne Scholarships Office. Acknowledgements I owe a great deal of thanks to many people for their roles in my life while I worked on this PhD. The list can never be complete but here I express my heartfelt gratitude to: • My family — Mum, Dad, Joanne, Linda, Grandma and many more too numerous to list — for being there whenever I needed you but most of all for being understanding whenever I disappeared into my work. This goes for my all-but-family Lorand and Anna as well. • Michael, Bernie, Marco and the rest of the team at Rome2rio for your support and friendship on the final stretches of this project in particular. • Ele and Amelia likewise for being great housemates in that time; and before that Ned , Hannah and Bianca. • Ned, Nikki, Florian and Mel for your solidarity in our long cafe fuelled writing sessions. • Suganya for keeping me connected to a social life when I needed it the most. • Many friends, including but not limited to Heather, Rob, Tim, Nyssa, Iain for the energy to keep going that I took from your warm companionship. Special thanks goes to Etsuko Toyoda for facilitating demonstration of the glossing software essential to this thesis in Japanese classes at Melbourne University, and for your enthusiasm towards integrating technology with the curriculum. I am deeply indebted to Jim Breen: your years of work on the Japanese Multilingual Dictionary project made this thesis possible and your WWWJDIC: Online Japanese Dictionary Service was essential to my study of the Japanese language and a major inspiration for the thesis. Most of all I would like to thank my supervisor, Tim Baldwin, for his years of dedication beyond the call of duty and for sharing his truly formidable knowledge of the field and the research process. Contents Title Page . .i Abstract . ii Table of Contents . vi List of Tables . .x List of Figures . xii I Background 2 1 Introduction 3 1.1 Motivation . .6 1.2 Contributions . 10 1.3 Outline . 14 2 Literature Review 16 2.1 Computational lexical semantics . 17 2.1.1 Introducing word sense disambiguation . 17 2.1.2 Other approaches to computational lexical semantics . 24 2.1.3 Multiword expressions . 26 2.2 Word sense disambiguation . 30 2.2.1 WSD research community . 31 2.2.2 Classification of WSD research . 33 2.2.3 Evaluation of WSD . 36 2.2.4 Contemporary challenges for WSD research . 39 2.2.5 Recent word sense research . 42 2.2.6 A review of PageRank for word sense disambiguation . 44 2.3 Multiword expressions . 56 2.3.1 Type extraction . 56 2.3.2 Compositionality of multiword expression types . 58 2.3.3 Token identification . 60 2.4 Human computation . 67 2.4.1 Crowdsourcing lexical semantic annotations . 69 2.4.2 Quality control in crowdsourcing . 72 2.5 Summary . 79 vi Contents vii 3 Resources Review 80 3.1 Linguistic concepts useful for text processing . 80 3.1.1 Part of Speech . 81 3.1.2 Morphology . 83 3.1.3 Word order . 84 3.1.4 Dependency grammars and parsing . 89 3.2 Japanese natural language processing . 92 3.2.1 Available tokenisers and parsers . 94 3.2.2 Dependency parsing and the Japanese language . 96 3.3 Princeton WordNet .................................. 97 3.3.1 SemCor .................................... 102 3.3.2 Other WordNets . 104 3.3.3 Translations of SemCor ............................ 106 3.4 Japanese MWE resources . 108 3.4.1 The OpenMWE corpus . 108 3.4.2 JDMWE . 109 II Glossing 110 4 The Wakaran glossing tool 111 4.1 Introduction to the Wakaran Glossing tool . 112 4.2 Related software . 116 4.2.1 WWWJDIC . 117 4.2.2 Rikai.com and Rikai-chan . 117 4.2.3 Furigana Inserter . 119 4.2.4 The Sharp Intelligent Dictionary ....................... 120 4.3 User base . 120 4.4 Design considerations . 121 4.4.1 Word glossing in computer assisted language learning . 122 4.5 System design . 126 4.5.1 Data model . 127 4.5.2 Architecture . 130 4.5.3 Control flow . 130 4.6 Evaluation . 137 4.6.1 User feedback . 137 4.6.2 Instrumentation . 141 4.6.3 Usage by day . 142 4.6.4 Usage across a session . 145 4.6.5 Clustering usage patterns . 149 4.7 Conclusion . 162 Contents viii III Disambiguation 165 5 Multiword Expressions 166 5.1 Flavours of token candidate disambiguation . 168 5.1.1 Qualities . 170 5.1.2 Type specialised disambiguation . 171 5.1.3 Type generalised disambiguation . 171 5.1.4 Cross-type disambiguation . 173 5.1.5 Additional resources . 175 5.1.6 Related work . 175 5.2 Feature extraction . 177 5.2.1 Preprocessing . 177 5.2.2 Idiom characteristic features . 180 5.2.3 Context features . 184 5.2.4 Type features . 186 5.3 OpenMWE corpus experiments . 188 5.3.1 Results . 189 5.3.2 Discussion . 199 5.4 JDMWE lexicon experiments . 200 5.4.1 Results . 200 5.4.2 Another look at WSD features . ..

Load more