A Graph Model for Words and Their Meanings

A Graph Model for Words and their Meanings Von der Philosophisch-Historischen Fakultät der Universität Stuttgart zur Erlangung der Würde eines Doktors der Philosophie (Dr. phil.) genehmigte Abhandlung vorgelegt von Beate Dorow aus Albstadt Erstgutachter: PD Dr. phil. habil. Ulrich Heid Zweitgutachter: Prof. Hinrich Schütze, Ph.D. Tag der mündlichen Prüfung: 03.03.2006 Institut für Maschinelle Sprachverarbeitung der Universität Stuttgart 2006 2 Contents 1 Introduction 13 1.1 Motivation.................................... 13 1.2 Semantic similarity and ambiguity . ..... 13 1.2.1 Ambiguity................................ 13 1.2.2 Semanticsimilarity ........................... 14 1.3 Idioms...................................... 15 1.4 Synonyms .................................... 16 1.5 Contributionsofthethesis . ... 17 1.6 Outlineofthethesis .............................. 18 2 Graphs 21 2.1 Terminologyandnotation . 21 2.2 Real-worldgraphs................................ 25 2.2.1 Socialnetworks ............................. 25 2.2.2 Technologicalnetworks . 26 2.2.3 Biologicalnetworks ........................... 27 2.2.4 Graphs arising in linguistics . .. 27 2.3 Curvature as a measure of local graph topology . ...... 29 2.3.1 Clustering in social networks . .. 29 2.3.2 Graph-theoreticdefinition . 31 2.3.3 Geometricinterpretation . 33 3 Data acquisition 39 3.1 Acquisitionassumption. .. 39 3.2 Textmaterial .................................. 40 3.3 Buildingthegraph ............................... 40 3.4 Preprocessingofthewordgraph. ... 43 3.4.1 Conflatingvariantsofthesameword . 43 3.4.2 Eliminatingweaklinks . .. .. 44 3.5 Weightinglinks ................................. 44 3.6 Spuriousedges ................................. 46 3.6.1 Errorsindataacquisition . 46 3.6.2 Violation of the symmetry assumption . .. 48 4 CONTENTS 3.6.3 Semantically heterogeneous lists . .... 48 3.6.4 Ambiguity due to conflation of variants . ... 48 3.6.5 Statistical properties of the word graph . ..... 48 4 Discovering word senses 51 4.1 Intuition..................................... 51 4.2 MarkovClustering ............................... 52 4.3 Word Sense Clustering Algorithm . ... 53 4.4 ExperimentalResults. .. .. 54 5 Using curvature for ambiguity detection and semantic class acquisition 59 5.1 A linguistic interpretation of curvature . ........ 59 5.2 Curvatureasameasureofambiguity . ... 61 5.3 Curvatureasaclusteringtool . ... 64 5.3.1 Determining the curvature threshold . ... 65 5.3.2 Allowing for overlapping clusters . ... 69 5.4 EvaluationagainstWordNet . .. 72 5.4.1 PrecisionandRecall .......................... 72 5.4.2 Mapping clusters to WordNet senses . 75 5.4.3 ResultsandConclusions . 76 5.5 Uncovering the meanings of novel words . .... 79 6 From clustering words to clustering word associations 87 6.1 Intuition..................................... 87 6.2 Linkgraphtransformation . .. 87 6.3 Somepropertiesofthelinkgraph . ... 89 6.4 Clusteringthelinkgraph. .. 92 6.5 Evaluation.................................... 92 6.6 Pairsofcontrastingclusters . .... 96 6.7 Conclusion.................................... 102 7 Idiomaticity in the word graph 105 7.1 Introduction................................... 105 7.2 Asymmetriclinks ................................ 106 7.2.1 Testing the reversibility (symmetry) of links . ....... 106 7.2.2 Semanticconstraints . 108 7.3 Idiomaticconstraints . 113 7.4 Linkanalysis .................................. 114 7.5 Statisticalanalysis ............................. 118 7.5.1 Noncompositionality of meaning . 118 7.5.2 Semanticanomaly ........................... 119 7.5.3 Limited syntactic variability . 122 7.5.4 Variabilityofcontext . 123 Table of Contents 5 7.6 Conclusions ................................... 126 8 Synonyms in the word graph 127 8.1 Hypothesis.................................... 127 8.2 Methodology .................................. 129 8.3 Evaluationexperiment . 131 8.3.1 Testset ................................. 131 8.3.2 Experimentalset-up . .. .. 133 8.4 RevisedHypothesis ............................... 137 8.5 Conclusion.................................... 138 9 Related Work 143 9.1 Extractionofsemanticrelations . ..... 143 9.2 Semantic similarity and ambiguity . ..... 146 9.2.1 Bootstrapping semantic categories . .... 146 9.2.2 Hard clustering of words into classes of similar words ........ 150 9.2.3 Soft clustering of words into classes of similar words ......... 152 9.2.4 Word sense induction by clustering word contexts . ...... 158 9.2.5 Identifying and assessing ambiguity . .... 165 9.3 Idiomaticity ................................... 166 9.4 Synonymy .................................... 167 10 Summary, Conclusion and Future Research 171 10.1Summary .................................... 171 10.2 ConclusionandFutureResearch . .... 173 11 Zusammenfassung 175 11.1 Semantische Ambiguität und Ahnlichkeit¨ . .. 176 11.2Idiome...................................... 180 11.3Synonyme .................................... 180 11.4 Einordnung und zukünftige Forschungsarbeit . .......... 181 Acknowledgements 187 6 Table of Contents List of Figures 1.1 Local graph around rock ............................ 15 1.2 Local graph around carrot and stick ...................... 16 1.3 Local graph around writer and author ..................... 17 2.1 Asimplegraph ................................. 22 2.2 Twosubgraphs ................................. 22 2.3 Adirectedgraph ................................ 23 2.4 Aweightedgraph................................ 23 2.5 Completegraphs ................................ 24 2.6 Ahierarchy ................................... 25 2.7 Aparsetree ................................... 28 2.8 WordNetsnippet ................................ 30 2.9 Asocialnetwork ................................ 32 2.10Curvature .................................... 33 2.11Surfaces ..................................... 34 2.12Boundarystrips................................. 34 2.13Aplanartriangle ................................ 35 2.14 Atriangleonasphere ............................. 36 2.15Atriangleonasaddle ............................. 37 2.16 ThePoincarédisk.. .. .. 37 3.1 Buildingthewordgraph ............................ 42 3.2 Local graph around body ............................ 42 3.3 Conflatingthevariantsofaword . .. 43 3.4 Thetrianglefilter................................ 44 4.1 Local graph around mouse ........................... 51 4.2 Local graph around wing ............................ 52 5.1 Local graph around head (3D)......................... 60 5.2 Curvaturevs.Frequency . 62 5.3 Local graph around monaco .......................... 63 5.4 Evolutionofthegiantcomponent . ... 67 5.5 Evolutionofsmallcomponents. ... 68 8 LIST OF FIGURES 5.6 Curvatureclustering ............................. 70 5.7 The clusters of mouse .............................. 73 5.8 The clusters of oil ................................ 74 5.9 The clusters of recorder ............................. 74 5.10 Precision, recall and F-score of the curvature clustering........... 77 5.11PathsinWordNet................................ 82 5.12 Length of the paths connecting a testword with its label .......... 83 5.13Christmastree ................................. 85 6.1 Linkgraphtransformation . .. 88 6.2 Two triangles involving squash ......................... 89 6.3 Local graph around the word organ ...................... 89 6.4 Link graph associated with organ ....................... 90 6.5 A clique under the link graph transformation . ....... 91 6.6 Curvatureofnodesinthelinkgraph . ... 92 6.7 Precision and recall of link graph clustering . ........ 94 6.8 F-scoreoflinkgraphclustering . .... 95 6.9 Two clusters of humour ............................. 97 6.10 Subgraphs corresponding to three clusters of plant .............. 102 6.11 Contrastsbetweenclusters . .... 103 7.1 Directed graph of family relationships . ....... 111 7.2 Hierarchical relationships between aristocrats . ........... 112 7.3 Directedgraphoflife-events . .... 112 7.4 Chronologyoftheseasons . 113 7.5 Lexicalsubstitution. .. .. 115 7.6 Local graph around rhyme and reason ..................... 115 7.7 Local graph around rhyme and song ...................... 116 7.8 Local graph around man, woman and child .................. 117 7.9 Local graph around lock, stock and barrel ................... 117 7.10 Circulardispersion . 125 8.1 Local graph around pedestrian and walker .................. 128 8.2 First-andsecond-orderMarkovgraph. ..... 130 8.3 Local graph around gull and mug ....................... 132 8.4 TargetranksunderHypothesis1 . 135 8.5 TargetranksunderHypothesis2 . 139 9.1 Bootstrappingsemanticcategories . ...... 149 9.2 Agglomerativeclustering . 151 9.3 Acooccurrencematrix ............................. 159 9.4 WordSpace ................................... 159 9.5 VectorsinWordSpace ............................. 160 9.6 Shared Nearest Neighbors Clustering . ..... 164 List of Figures 9 9.7 The neighborhood of chocolate inadictionary . 169 11.1 Die Nachbarschaft von Gericht ......................... 177 11.2Krümmung ................................... 178 11.3 Graphtransformation . 179 10 List of Figures List of Tables 3.1 Statisticalpropertiesofthewordgraph . ....... 49 4.1 Sample of test words and their discovered senses . ........ 55 4.2 German nominalizations and their discovered senses . .......... 57 5.1 Correlation between curvature and WordNet polysemy count........ 64 5.2 Properties of curvature clustering for different curvature thresholds

A Graph Model for Words and Their Meanings

The Three Phases of Arendt's Theory of Totalitarianism*

A Study of Idiom Translation Strategies Between English and Chinese

Experiments in Idiom Recognition

Lexical Semantic Analysis in Natural Language Text

Idioms-And-Expressions.Pdf

Draft 5 for Printing

The Denotative Meaning

From Idiom Variants to Open-Slot Idioms: Close-Ended and Open-Ended Variational Paradigms Ramon Marti Solano

SLIDE – a Sentiment Lexicon of Common Idioms

Research Report on Phonetics and Phonology of Sinhala

Semantics 126.134: Test #1 22 February 2006

A CMB Framework to Semantic Features of Opposite-Localizer-Superposition-Idioms