Orthographic Support for Passing the Reading Hurdle in Japanese

Orthographic support for passing the reading hurdle in Japanese A thesis presented by Lars Yencken to e Department of Computer Science and Software Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Melbourne Melbourne, Australia April 2010 ©2010 - Lars Yencken All rights reserved. esis advisor(s) Author Timothy Baldwin Lars Yencken Orthographic support for passing the reading hurdle in Japanese Abstract Learning a second language is, for the most part, a day-in day-out struggle against the mountain of new vocabulary a learner must acquire. Furthermore, since the number of new words to learn is so great, learners must acquire them autonomously. Evidence suggests that for languages with writing systems, native-like vocabulary sizes are only developed through reading widely, and that reading is only fruitful once learners have acquired the core vocabulary required for it to become smooth. Learners of Japanese have an especially high barrier in the form of the Japanese writing system, in particular its use of kanji characters. Recent work on dictionary accessibility has focused on compensating for learner errors in pronouncing unknown words, however much difficulty remains. is thesis uses the rich visual nature of the Japanese orthography to support the study of vocabulary in several ways. Firstly, it proposes a range of kanji similarity measures and evaluates them over several new data sets, finding that the stroke edit distance and tree edit distance metrics best approximate human judgements. Secondly, it uses stroke edit distance construct a model of kanji misrecognition, which we use as the basis for a new form of kanji search by similarity. Analysing query logs, we find that this new form of search was rapidly adopted by users, indicating its utility. We finally combine kanji confusion and pronunciation models into a new adaptive testing platform, Kanji Tester, modelled after aspects of the Japanese Language Proficiency Test. As the user tests themselves, the system adapts to their error patterns and uses this information to make future tests more difficult. Investigating logs of use, we find a weak positive correlation between ability estimates and time the system has been used. Furthermore, our adaptive models generated questions which were significantly more difficult than their control counterparts. iv Abstract Overall, these contributions make a concerted effort to improve tools for learner self- study, so that learners can successfully overcome the reading hurdle and propel themselves towards greater proficiency. e data collected from these tools also forms a useful basis for further study of learner error and vocabulary development. Abstract v is to certify that i. the thesis comprises only my original work towards the PhD ii. due acknowledgement has been made in the text to all other material used iii. the thesis is less than 100,000 words in length, exclusive of tables, maps, bibliographies and appendicies Lars Yencken Citations to Previously Published Work Large portions of Chapter 4 have appeared in the following papers: Yencken, Lars and Timothy Baldwin. 2008. Measuring and pre- dicting orthographic associations: modelling the similarity of Japanese kanji. In Proceedings of the 22nd International Conference on Computational Linguistics, 1041–1048, Manchester, UK. Yencken, Lars and Timothy Baldwin. 2006. Modelling the orthographic neighbourhood for Japanese Kanji. In Proceedings of the 21st Interna- tional Conference on Computer Processing of Oriental Languages, 321-332, Sen- tosa, Singapore. Similarly, large portions of Chapter 5 have appeared in: Yencken, Lars and Timothy Baldwin. 2008. Orthographic similarity search for dictionary lookup of Japanese words. In Proceedings of the 18th Eu- ropean Conference on Artificial Intelligence, 343–347, Patras, Greece. Yencken, Lars and Timothy Baldwin. 2005. Efficient grapheme- phoneme alignment for Japanese. In Proceedings of the Australasian Language Technology Workshop 2005, 143–151, Sydney, Australia. Acknowledgments ere are many people who have supported me through this long adventure. I’d like to begin by thanking my parents for their persistent and dogged encouragement, my wife Lauren for her love and understanding, and my cat Nero for his quiet, enduring company along the way. When I first began this work, Slaven Bilac visited our lab in Melbourne and handed over both source code and control of the FOKS system so that I could use it as a test bed for new dictionary research. I would like to thank him for his kindness and generosity, without which much of this work would not have been possible. In late 2006 I spent 6 months with the Tanaka Lab at the University of Tokyo. I would like to thank Kumiko Tanaka-Ishii for kindly supporting my visit, and for expanding my perspective on many aspects of Japanese life, language and culture. My warmest thanks go also to my friend and colleague Zhihui Jin for our many long discussions during this period, and for his fresh insights as a Chinese learner of Japanese. Finally, my most sincere and wholehearted thanks go to my long-suffering supervisor Tim Baldwin, for his sharp insight, stoic patience, constant warmth and all round good humour. ank you. Dedicated to Lauren, my wife and partner in all things. Contents List of Figures ................................... xii List of Tables ................................... xvi 1 Introduction 1 1.1 Aim and contributions ........................... 5 1.2 Audience .................................. 6 1.3 esis structure ............................... 6 2 e Japanese writing system 10 2.1 Overview .................................. 10 2.2 Input and lookup of Japanese ........................ 14 2.3 Word formation in Japanese ........................ 18 3 Learning Japanese vocabulary 22 3.1 Autonomous vocabulary acquisition .................... 23 e beginner’s paradox ........................... 23 Knowing a word ............................... 26 Ordering strategies ............................. 29 Acquisition strategies ............................ 30 Self-study tools ............................... 36 3.2 Structure of the mental lexicon ....................... 41 Visual word recognition ........................... 42 General lexical relationships ........................ 46 Graphemic neighbourhoods: errors and associations ............ 49 3.3 Dictionary lookup of Japanese ....................... 52 Lookup for producers ............................ 52 Lookup for receivers ............................ 53 FOKS dictionary .............................. 56 3.4 Second language testing ........................... 57 e role of testing .............................. 57 Testing theory ................................ 59 Vocabulary testing and drilling ....................... 61 ix x Contents 3.5 Conclusion ................................. 63 4 Orthographic similarity 64 4.1 Introduction ................................. 64 Overview .................................. 64 Distance within alphabetic orthographies ................. 64 4.2 Distance models: a first approach ...................... 66 Form of models ............................... 66 Bag of radicals model ............................ 67 L1 norm across rendered images ...................... 68 4.3 Similarity experiment ............................ 70 Experiment outline ............................. 70 Control pairs ................................ 72 Results .................................... 72 Evaluating similarity models ........................ 74 Model evaluation .............................. 75 4.4 Distance models: a second approach .................... 77 Models ................................... 78 Evaluation .................................. 82 4.5 Discussion .................................. 89 4.6 Conclusion ................................. 90 5 Extending the dictionary 92 5.1 FOKS: the baseline ............................. 93 Intelligent lookup .............................. 93 Architecture ................................. 95 Error models ................................ 97 Limitations ................................. 97 FOKS rebuilt ................................ 99 5.2 Grapheme-phoneme alignment ....................... 99 Overview .................................. 99 Existing algorithm ............................. 101 Modified algorithm ............................. 105 Evaluation .................................. 108 Improved alignment ............................. 112 5.3 Usability enhancements ........................... 113 Place names and homographs ........................ 113 Sense separation ............................... 115 Query and reading explanation ....................... 115 5.4 Intelligent grapheme search ......................... 118 Overview .................................. 118 Overall model ................................ 119 Contents xi Confusion model .............................. 120 5.5 Log analysis ................................. 121 5.6 Conclusion ................................. 126 6 Testing and drilling 128 6.1 From search to testing ............................ 129 Testing as an application space ....................... 129 e ideal test ................................ 130 6.2 Japanese Language Proficiency Test .................... 132 Overview of the JLPT ........................... 132 Question types ............................... 134 Potential criticisms of the JLPT ...................... 137 6.3 Kanji Tester ................................

Load more