Word Segmentation and Ambiguity in English and Chinese NLP & IR

Word Segmentation and Ambiguity in English and Chinese NLP & IR by ± Jin Hu Huang, B.Eng.(Comp.Sc)Grad.Dip.(Soft.Eng.) School of Computer, Engineering and Mathematics, Faculty of Science and Engineering August 10, 2011 A thesis presented to the Flinders University of South Australia in total fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Adelaide, South Australia, 2012 c (Jin Hu Huang, 2012) CONTENTS Abstract ::::::::::::::::::::::::::::::::::::: ix Certification ::::::::::::::::::::::::::::::::::: xii Acknowledgements ::::::::::::::::::::::::::::::: xiii Preface ::::::::::::::::::::::::::::::::::::: xv 1. Introduction ::::::::::::::::::::::::::::::::: 1 1.1 Context Sensitive Spelling Correction . 1 1.2 Chinese Pinyin Input . 2 1.3 Chinese Segmentation . 6 1.4 Chinese Information Retrieval (IR) . 9 1.5 Thesis Contribution . 10 1.6 Thesis Organization . 13 Part I Word Disambiguation for English Spelling Checking and Chinese Pinyin Input 15 2. Machine Learning for Context Sensitive Spelling Checking ::::::: 16 2.1 Introduction . 16 2.2 Confused Words . 17 2.3 Context-sensitive Spelling Correction . 18 2.4 Experiment and Result . 21 2.5 Interface . 29 2.6 Conclusion and Future Work . 30 2.7 Reflections . 31 Contents iii 3. Statistical N-gram Language Modeling :::::::::::::::::: 33 3.1 Statistical Language Modeling . 33 3.2 N-Gram Markov Language Models . 36 3.3 Smoothing Methods . 37 3.3.1 Add One Smoothing . 37 3.3.2 Interpolation Smoothing . 38 3.3.3 Absolute Discounting . 38 3.3.4 Good-Turing Discounting . 38 3.3.5 Katz Back-off Smoothing . 39 3.3.6 Witten-Bell Smoothing . 39 3.3.7 Kneser-Ney Smoothing . 40 3.3.8 Modified Kneser-Ney Smoothing . 40 3.4 Discussion . 41 3.5 Conclusion . 41 4. Compression-based Adaptive Approach for Chinese Pinyin Input :::: 42 4.1 Introduction . 42 4.2 Statistical Language Modelling . 44 4.2.1 Pinyin-to-Character Conversion . 45 4.2.2 SLM Evaluation . 45 4.3 Compression Theory . 46 4.4 Adaptive Modelling . 47 4.5 Prediction by Partial Matching . 48 4.6 Experiment and Result . 52 4.7 Conclusion . 54 5. Error-driven Adaptive Language Modeling for Pinyin-to-Character Con- version :::::::::::::::::::::::::::::::::::: 55 5.1 Introduction . 55 5.2 LM Adaption Methods . 56 5.2.1 MAP Methods . 56 5.2.2 Discriminative Training Methods . 57 5.3 Error-driven Adaption . 58 5.4 Experiment and Result . 61 5.5 Conclusion . 65 Contents iv Part II Chinese Word Segmentation and Classification 66 6. Chinese Words and Chinese Word Segmentation ::::::::::::: 67 6.1 Introduction . 67 6.2 The Definition of Chinese Word . 67 6.3 Chinese Word Segmentation . 70 6.3.1 Segmentation Ambiguity . 70 6.3.2 Unknown Words . 71 6.4 Segmentation Standards . 72 6.5 Current Research Work . 73 6.6 Conclusion . 76 7. Chinese Word Segmentation Based on Contextual Entropy ::::::: 78 7.1 Introduction . 78 7.2 Contextual Entropy . 79 7.3 Algorithm . 80 7.3.1 Contextual Entropy . 80 7.3.2 Mutual Information . 82 7.4 Experiment Results . 83 7.5 Conclusion . 90 7.6 Reflections . 90 8. Unsupervised Chinese Word Segmentation and Classification :::::: 92 8.1 Introduction . 92 8.2 Word Classification . 93 8.3 Experiments and Future Work . 96 8.4 Conclusion . 99 8.5 Reflections . 100 Part III Chinese Information Retrieval 101 9. Using Suffix Arrays to Compute Statistical Information ::::::::: 102 9.1 Suffix Trees . 102 9.2 Suffix Arrays . 104 9.3 Computing Term Frequency and Document Frequency . 109 9.4 Conclusion . 112 Contents v 10. N-gram based Approach for Chinese Information Retrieval ::::::: 113 10.1 Introduction . 113 10.2 Chinese Information Retrieval . 114 10.2.1 Single-character-based (Uni-gram) Indexing . 114 10.2.2 Multi-character-based (N-grams) Indexing . 115 10.2.3 Word-based Indexing . 115 10.2.4 Previous Works . 116 10.3 Retrieval Models . 121 10.3.1 Vector Space Model . 121 10.3.2 Term Weighing . 122 10.3.3 Query and Document Similarity . 123 10.3.4 Evaluation . 123 10.4 Experimental Setup . 126 10.4.1 TREC Data . 126 10.4.2 Measuring Retrieval Performance . 127 10.5 Experiments and Discussion . 127 10.5.1 Using Dictionary-based Approach . 127 10.5.2 Statistical Segmentation Approach . 128 10.5.3 Using Different N-grams . 131 10.5.4 Word Extraction . 136 10.5.5 Removing Stop Words . 142 10.6 Discussion . 145 10.7 Conclusion . 147 11. Conclusions ::::::::::::::::::::::::::::::::: 149 11.1 Thesis Review . 149 11.2 Future Work . 151 Appendix 153 A. The Appendix: Tables for TREC 5 & 6 Chinese Information Retrieval Results ::::::::::::::::::::::::::::::::::: 154 B. The Appendix: Examples of TREC 5 & 6 Chinese Queries ::::::: 162 Bibliography :::::::::::::::::::::::::::::::::: 188 LIST OF FIGURES 1.1 Number of Homonyms for Each Pinyin . 4 5.1 Perplexity compare between static and adaptive model on Modern Novel.................................. 62 5.2 Perplexity compare between static and adaptive model on Martial Arts Novel . 63 5.3 Perplexity compare between static and adaptive model on People's Daily 96 . 64 7.1 Contextual Entropy and Mutual Information for \The two world wars happened this century had brought great disasters to human being including China." . 80 10.1 Average Precision At Different Recall For Dictionary-based and Statistical Segmentation Approaches . 129 10.2 Average Precision At X Documents Retrieved For Dictionary-based and Statistical Segmentation Approaches . 130 10.3 Average Precision for 54 Queries Using 1-gram, 2-grams, 3-grams and 4-grams . 134 10.4 The Impact of Extracted Words on 54 Queries . 139 10.5 The Impact of Stop Words on 54 Queries . 143 LIST OF TABLES 2.1 Diameter with occurrences, significance and probability - number of contexts (WSJ 87-89,91-92) . 22 2.2 Relationship between probability and significance - number of contexts (WSJ 87-89,91-92) . 23 2.3 False errors and coverage testing on test and validation corpora with no errors seeded but two real errors found . 24 2.4 True errors detected (recall) and corrected when errors seeded ran- domly . 25 2.5 Seeded errors of the confusion set of \from" and \form" (S,P≥95%) 25 2.6 False positive rate (FPR), true positive rate (TPR) and informed- ness when errors seeded . 27 2.7 Spelling Errors Found in WSJ0801 and WSJ1231 . 28 4.1 Compression models for the string \dealornodeal" . 47 4.2 PPM model after processing the string dealornodeal . 50 4.3 Compression results for different compression methods . 52 4.4 Character Error Rates for Kneser-Ney, Static and Adaptive PPM 53 5.1 Witten-Bell smoothing model after processing the string dealornodeal 60 5.2 Comparing perplexity and CER using different smoothing methods on testing corpus . 63 5.3 CER and percentage of data used for adaption . 64 5.4 Testing on Xinhua 96 with different mixed models with adaption . 65 6.1 Some differences between the segmentation standards . 73 7.1 Validation results based on Recall, Precision and F-Measure for Eq. 7.1 7.2 7.3 7.4 . 84 7.2 Validation results based on Recall, Precision and F-Measure for Eq. 7.5 7.6 7.7 7.8 . 84 7.3 Validation results on Recall, Precision and F-measure according to Eq. 7.9 7.10 7.11 . 85 List of Tables viii 7.4.

Word Segmentation and Ambiguity in English and Chinese NLP & IR

Research on the Time When Ping Split Into Yin and Yang in Chinese Northern Dialect

The Fundamentals of Chinese Historical Phonology

Norman 1988 Chapter 3.Pdf

INFORMATION to USERS the Most Advanced Technology Has Been Used to Photo Graph and Reproduce This Manuscript from the Microfilm Master

A Study of the Standardization of Chinese Writing/ Ying Wang University of Massachusetts Amherst

November 2008 Editorial Paris

The Plural Forms of Personal Pronouns in Modern Chinese Baoying Qiu University of Massachusetts Amherst

On Chinese Traditional Linguistics Philology Or Linguistics

Paper Title (Use Style: Paper Title)

“Regularities” and “Irregularities” in Chinese Historical Phonology

Journal of Chinese Linguistics

The Use of Embodied Animation for Beginning Learners of Chinese Characters