With an Application to Ontologies and Semantics

On Link Predictions in Complex Networks with an Application to Ontologies and Semantics Inaugural-Dissertation zur Erlangung des Doktorgrades der Philosophie des Fachbereiches Sprache, Literatur, Kultur (05) der Justus-Liebig-UniversitätGießen vorgelegt von Bastian Entrup aus Hardegsen 2016 Dekan: Prof. Dr. Magnus Huber 1.Berichterstatter: Prof. Dr. Henning Lobin 2.Berichterstatter: Prof. Dr. Thomas Gloning Tag der Disputation: 21.06.2017 2 For my wife. 3 Contents List of Figures 7 List of Tables 9 1 Introduction 11 2 Graph Theory and Complex, Natural Networks 19 2.1 Properties of Graphs and Networks . 19 2.1.1 Vertices, Edges, In- and Out-Degree . 19 2.1.2 Components, Cliques, Clusters, and Communities . 22 2.1.3 Multigraphs, Trees, and Bipartite Graphs . 23 2.1.4 Measure of Centrality and Importance of Vertices . 24 2.1.5 Small-World Networks and Scale-Free Distribution . 28 2.1.6 Terminology . 31 2.2 Language Networks and Computational Models of Language . 31 2.3 Graphs of Language . 34 2.3.1 Hierarchical Structures of the Mind . 34 2.3.2 Complex Network Growth and Models of Language Network Evo- lution . 36 2.4 Conclusion . 42 3 Lexical Semantics and Ontologies 45 3.1 Frame Semantics and Encyclopedic Knowledge . 47 3.2 Field Theories . 49 3.3 Distributional Semantics . 51 3.4 Semantic Relations . 56 3.4.1 Synonymy . 56 3.4.2 Hyponymy . 56 3.4.3 Meronymy . 57 3.4.4 Antonymy and Opposition . 57 3.5 Ontologies . 59 3.5.1 Conception . 59 3.5.2 Ontology Formalizations . 60 3.6 Conclusion . 65 4 4 Relation Extraction and Prediction 68 4.1 Extending Ontologies: Relation Extraction from Text . 68 4.2 Predicting Missing Links . 73 4.3 Possible Machine Learning Algorithms . 80 4.3.1 Machine Learning . 80 4.3.2 Machine Learning Algorithms . 83 4.4 Conclusion . 92 5 WordNet: Analyses and Predictions 95 5.1 Ambiguity: Polysemy and Homonymy . 96 5.1.1 Homonymy . 96 5.1.2 Polysemy . 97 5.2 Architecture of WordNet . 106 5.2.1 Nouns . 108 5.2.2 Adjectives . 111 5.2.3 Adverbs . 114 5.2.4 Verbs . 114 5.2.5 Overview and Shortcomings . 117 5.3 Network Analysis . 120 5.3.1 WordNet as a Graph . 120 5.3.2 Graphs of Single POS . 127 5.3.3 Network Analysis: Overview . 131 5.4 Link Prediction: Polysemy . 134 5.4.1 State of the Art: Finding Polysemy in WordNet . 134 5.4.2 Feature Selection: the Network Approach . 138 5.4.3 Data Preparation . 143 5.5 Evaluation and Results . 144 5.5.1 Homograph Nouns . 144 5.5.2 Homograph Adjectives . 152 5.5.3 Homograph Adverbs . 156 5.5.4 Homograph Verbs . 160 5.6 Results . 162 6 DBpedia: Analyses and Predictions 166 6.1 DBpedia: an Ontology Extracted from Wikipedia . 166 6.1.1 Knowledge Bases . 166 6.1.2 What Is DBpedia? . 166 5 6.1.3 Extraction of Data from Wikipedia . 167 6.1.4 DBpedia Ontology . 168 6.1.5 Similar Approaches . 169 6.1.6 Shortcomings of DBpedia . 171 6.2 Network Analysis . 173 6.3 Approach to Link Identification in DBpedia . 176 6.3.1 Related Work . 176 6.3.2 Identifying Missing Links and the Network Structure of DBpedia . 178 6.4 The Data Sets . 181 6.5 Identification of Missing Links: Evaluation and Results . 185 6.5.1 Setup: Apriori . 185 6.5.2 Quantitative Evaluation: Apriori . 186 6.5.3 Setup: Clustering . 188 6.5.4 Quantitative Evaluation: Clustering . 189 6.6 Unsupervised Classification of Missing Links . 190 6.6.1 Setup: Link Classification Using Apriori Algorithm . 190 6.6.2 Evaluation . 191 6.7 Conclusion . 198 7 Discussion, Conclusion, and Future Work 202 8 References 212 A Appendix 232 6 List of Figures 1 Simple example graphs. 21 2 Closed triangle assumption. 22 3 A small, simple tree. 24 4 A simple bipartite graph with two different kinds of vertices: fa,c,eg and fb,dg....................................... 25 5 Weighted graphs: (a) weighted by degree, (b) weighted by betweenness centrality, and (c) weighted by closeness centrality. 27 6 Plot of frequency and rank of words based on the Brown Corpus: (a) the data following a power law and (b) the same data plotted on logarithmic scale. 29 7 The same data as in Fig. 6(b) but scaled by a factor of 2. 30 8 An Erdösand Rényi random graph with 100 vertices and a connectivity probability of 0:2................................. 37 9 Watts and Strogatz graphs with n = 20 and k = 4. 38 10 A Barabásiand Réka graph with 100 vertices. 39 11 Degree distributions of Erdösand Rényi, Watts and Strogatz, and Barabási and Réka models. 40 12 An exemplary RDF graph. 63 13 Sample of hierarchically structured text. 71 14 Sample of hierarchically structured text with marked terms. 71 15 HRG: hierarchical random graph model. 79 16 Word senses, synsets, and word forms in WordNet. 107 17 Exemplary organization of nouns in WordNet. 111 18 Exemplary organization of adjectives in WordNet. 114 19 Exemplary organization of adverbs in WordNet. 115 20 Exemplary organization of verbs in WordNet. 116 21 WordNet degree distribution (directed network). 122 22 WordNet degree distribution (ignoring direction) with fitted power law. 123 23 Schematic plot of synset flaw, jurisprudenceg and its first- and second- degree neighbors. 124 24 NCC plot of WordNet. 126 25 Degree distribution, in- and out-degree combined, of the four WordNet subsets with the appropriate power law fitting. Note the different scaling of the plots. 129 26 NCP plots of the WordNet POS subsets. 132 7 27 Classes of instances yes/blue and no/red plotted according to the geodesic path. 146 28 Instances containing word senses from an adverb and the other given POS and their membership of class yes (blue) or no (red). 155 29 Correlation of geodesic path and class (yes/blue and no/red) in the adjec- tive test and training set. 155 30 Classes in adverb set relative to geodesic path. 158 31 Correlation of degree (word form 1) and class in the adverb test and training set......................................... 158 32 Closeness of the two word sense vertices involved in the instances. 159 33 Changes in the WordNet graph: connected and unconnected word forms of bank...................................... 165 34 Exemplary graph: DBpedia classes and instances and their interrelations. 169 35 Connectivity between the four Beatles in DBpedia. 171 36 The Beatles and their neighbors in DBpedia. 172 37 DBpedia degree distribution. 174 38 Unconnected triad with assumed connection (dotted edge) between two vertices A and C................................. 180 39 Precision and recall in relation to the similarity. 187 40 Classification results with all selected classes: predicted relations in %. Green: correct classifications, red: incorrect classifications. 192 41 Screenshot of Wikipedia page: Barenaked Ladies............... 194 42 Classification results with only well-performing classes: predicted relations in %. Green: correct classifications, red: incorrect classifications. 196 43 Existing relations (grey, dotted) and newly added relations (black) between the four Beatles in DBpedia. 201 8 List of Tables 1 Lexical field w^ısheit. Left: ca. 1200; right: ca. 1300. ..

Load more