Hanzi, Concept and Computation: a Preliminary Survey of Chinese Characters As a Knowledge Resource in NLP
Total Page:16
File Type:pdf, Size:1020Kb
Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP von Shu-Kai Hsieh Philosophische Dissertation angenommen von der Neuphilologischen Fakultät der Universität Tübingen am 06.02.2006 Tübingen 10.04.2006 Gedruckt mit Genehmigung der Neuphilologischen Fakultät der Universität Tübingen Hauptberichterstatter: Prof. Dr. Erhard W. Hinrichs Mitberichterstatter: Prof. Dr. Eschbach-Szabo Dekan: Prof. Dr. Joachim Knape Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP Shu-Kai Hsieh Acknowledgements There are many people to whom I owe a debt of thanks for their support, for the completion of my thesis and supported me in science as well in privacy during this time. First, I would like to sincerely thank my advisor, Prof. Dr Erhard Hin- richs, under whose influence the work here was initiated during my fruit- ful stay in Germany. Without his continuous and invaluable support, this work could not have been completed. I would also like to thank Prof. Dr. Eschbach-Szabo for reading this thesis and offering constructive comments. Besides my advisors, I am deeply grateful to the rest of my thesis commit- tee: Frank Richter and Fritz Hamm, for their kindly support and interesting questions. A special thanks goes to Lothar Lemnitzer, who proofread the thesis carefully and gave insightful comments. I would like to thank my parents for their life-long love and support. Last but not least, I also owe a lot of thanks to my lovely wife Hsiao-Wen, my kids MoMo and NoNo for their understanding while I was away from home. Without them, it would not have been possible to complete the study. 1 Abstract This thesis deals with Chinese characters (Hanzi): their key characteristics and how they could be used as a kind of knowledge resource in the (Chinese) NLP. Part 1 deals with basic issues. In Chapter 1, the motivation and the reasons for reconsidering the writing system will be presented, and a short introduction to Chinese and its writing system will be given in Chapter 2. Part 2 provides a critical review of the current, ongoing debate about Chi- nese characters. Chapter 3 outlines some important linguistic insights from the vantage point of indigenous scriptological and Western linguistic tradi- tions, as well as a new theoretical framework in contemporary studies of Chinese characters. The focus of Chapter 4 concerns the search for appro- priate mathematical descriptions with regard to the systematic knowledge information hidden in characters. The subject matter of mathematical for- malization of the shape structure of Chinese characters is depicted as well. Part 3 illustrates the representation issues. Chapter 5 addresses the design and construction of the HanziNet, an enriched conceptual network of Chinese characters. Topics that are covered in this chapter include the ideas, archi- tecture, methods and ontology design. In Part 4, a case study based on the above mentioned ideas will be launched. Chapter 6 presents an experiment exploring the character-triggered semantic class of Chinese unknown words. Finally, Chapter 7 summarizes the major findings of this thesis. Next, it depicts some potential avenues in the future, and assesses the theoretical implications of these findings for computational linguistic theory. 2 Contents I Introduction 11 1 Motivation 12 1.1 DoesWritingMatter? . 13 1.2 Knowledge-Leanness: A Bottleneck in Natural Language Pro- cessing? .............................. 15 1.3 Writing Systems: The Missing Corner? . 18 2 A Brief Introduction of Chinese and its Characters 22 2.1 WhatisHanzi? .......................... 22 2.1.1 AnOverview ....................... 22 2.1.2 The Relation Between Chinese and Hanzi . 26 2.2 Character Structure Units and Linguistic Issues . .. 30 2.2.1 Constituent Units of Chinese Characters . 30 2.2.2 Word/Morpheme Controversies in Chinese . 33 II Background 37 3 A Review of Hanzi Studies 38 3.1 Hanziology:ADefinition . 38 3.2 IndigenousFrameworks. 40 3.2.1 Six Writing: Principles of Character Construction . 40 3.2.2 Y`ou W´en Theory ..................... 43 3 3.3 ContemporaryLinguisticStudies . 44 3.3.1 The Classification of Writing Systems . 44 3.3.2 Ideographicorlogographic? . 48 3.3.3 Word-centered or Character-centered? . 50 3.3.4 CriticalRemarks . 55 3.4 ContemporaryHanziStudies. 57 3.4.1 Overview ......................... 57 3.4.2 Hanzi Gene Theory: a Biological Metaphor . 58 3.4.3 Hanzi, Concept and Conceptual Type Hierarchy . 67 3.4.4 CriticalRemarks . 77 4 Mathematical Description 86 4.1 Introduction............................ 87 4.2 The Finite-State Automata and Transducer Model . 88 4.2.1 Finite-State Techniques: An Overview . 89 4.2.2 Topological Analysis via Planar Finite-State Machines 93 4.3 NetworkModels.......................... 97 4.3.1 BasicNotions ....................... 97 4.3.2 PartialOrderRelations. 101 4.3.3 Tree ............................103 4.3.4 (Concept)Lattice. .105 4.4 StatisticalModels. .112 4.4.1 CharacterStatistics. .112 4.4.2 Statistical Measures of Productivity and Association ofCharacters .......................113 4.4.3 Characters in a Small World ...............120 4.5 Conclusion.............................128 4 III Representation 132 5 HanziNet: An Enriched Conceptual Network of Chinese Characters 133 5.1 Introduction............................134 5.2 Chinese Character Network: Some Proposed Models . 138 5.2.1 Morpheme-based . .138 5.2.2 Feature-based . .139 5.2.3 RadicalOntology-based . 139 5.2.4 HanziOntology-based . .141 5.2.5 Remarks..........................141 5.3 TheorecticalAssumptions . .142 5.3.1 Concepts, Characters and Word Meanings . 142 5.3.2 Original Meaning, Polysemy and Homograph . 147 5.3.3 Hanzi Meaning Components as Partial Common-Sense KnowledgeIndicators. .149 5.4 Architecture............................152 5.4.1 Basic Design Issues: Comparing Different Large-Scale Lexical Semantic Resources . 152 5.4.2 Components. .. .. .160 5.5 Issues in Hanzi Ontology Development . 168 5.5.1 What is an Ontology : A General Introduction from DifferentPerspectives. .168 5.5.2 Designing a Hanzi-grounded Ontology . 172 IV Case Study 179 6 Semantic Prediction of Chinese Two-Character Words 180 6.1 Introduction............................181 6.2 Word Meaning Inducing via Character Meaning . 184 6.2.1 Morpho-Semantic Description . 185 5 6.2.2 Conceptual Aggregate in Compounding: A Shift To- wardCharacterOntology . .187 6.3 Semantic Prediction of Unknown two-character Words . 189 6.3.1 Background ........................189 6.3.2 Resources .........................191 6.3.3 PreviousResearch . .193 6.3.4 A Proposed HanziNet-basedApproach . 197 6.3.5 ExperimentalSettings . .205 6.3.6 ResultsandErrorAnalysis. 208 6.3.7 Evaluation. .. .. .211 6.4 Conclusion.............................212 V Gaining Perspectives 214 7 Conclusion 215 7.1 Contributions ...........................215 7.2 FutureResearches. .216 7.2.1 Multilevel Extensions . 216 7.2.2 Multilingual extensions . 217 7.3 ConcludingRemarks . .217 A Test Data 238 B Character Semantic Head: A List 251 C Character Ontology 254 D A Section of Semantic Classification Tree of CILIN 270 6 List of Figures 2.1 Some topological structures of Hanzi (adopted from Yiu and Wong(2003)) ........................... 24 2.2 The word length distribution of Chinese characters . .. 29 2.3 A three-layer hierarchy of the Hanzi lexicon structure . .... 33 3.1 Hanzitriangle........................... 39 3.2 Sampson’s classification scheme . 45 3.3 Sproat’s classification scheme . 47 3.4 OrthographicDepthHypothesis . 55 3.5 The 24 main Cang-Jie signs. The 4 rough categories here are designed for the purpose of memorizing. 62 3.6 First period-doubling bifurcation . 72 3.7 Second period-doubling bifurcation and third bifuration.... 72 3.8 A complete code definition of a character . 76 4.1 One of the topological structures of Chinese characters de- scribed by γ(α) [γ(β) [γ(ζ) γ(δ)]]. ........... 94 → ↓ → 4.2 A planar FSA that maps the expression γ(α) [γ(β) → ↓ [γ(ζ) γ(δ)]] (the planar figure of “蹦”) given in figure 4.1. → The labels “R” and “D” on the arcs indicate the recognizing direction (Right and Down); the label “left” on the starting state 0 indicates the position where scanning starts. 95 4.3 Elements of a Semantic Network . 99 4.4 Two structures of the semantic network . 100 7 4.5 Three kinds of partial order relations (Taken from Sowa (1984))102 4.6 A concept lattice represented by a line diagram . 110 4.7 Amorecomplexconceptlattice . 111 4.8 Character-based language laws testing . 114 4.9 (a). Bipartite graphs of characters (the numerically indexed row) and components (the alphabetically indexed row), (b). Reduced graph from (a) containing only characters. 125 5.1 Conceptual relatedness of characters: An example of q˘u . .137 5.2 “Bound” and “free” morphemes: An example of comb . 139 5.3 Venndiagramofcharacters: Chaonmodel . 140 5.4 Thepyramidstructuremodel . .146 5.5 Character-based concept tree and word-based semantic clouds 147 5.6 A common-sense knowledge lattice . 152 5.7 The explicit structure of HanziNet . 158 5.8 The complete architecture of HanziNet . 169 5.9 The HanziNet ontology: A snapshot . 174 5.10 A proposed “characterized” Ontology design . 176 5.11 A snapshot of the HanziNet Ontology environment . 178 8 List of Tables 2.1 Chinese signary: A historical comparison . 25 2.2 How many Hanzi does a computer recognize?: A code scheme comparison ............................ 25 2.3 Numberofradicals: Acomparison. 32 3.1 DeFrancis’s classification scheme . 46 3.2 Chu’s tree-structured conceptual hierarchy (truncated for brevity) 71 3.3 A self-synchronizing code of Chinese characters . .. 75 4.1 Aformalcontextofvehicles . .109 4.2 Statistical characteristics of the character network: is the to- N tal number of nodes(characters), k is the average number of links per node, is the clustering coefficient, ℓ¯ is the average shortest- C path length, and ℓmax is the maximum length of the shortest path between a pair of characters in the network. ............128 5.1 Cognatecharacters .