The Topology of Semantic Knowledge

The Topology of Semantic Knowledge Jimmy Dubuisson Jean-Pierre Eckmann Departement´ de Physique Theorique´ and Section de Mathematiques´ Universite´ de Geneve` [email protected] Christian Scheible Hinrich Schutze¨ Institut fur¨ Maschinelle Sprachverarbeitung Center for Information University of Stuttgart and Language Processing [email protected] University of Munich Abstract and has been used to study semantic relations between concepts and for analyzing semantic data. Studies of the graph of dictionary definitions Traditionally, a popular lexical database of English (DD) (Picard et al., 2009; Levary et al., 2012) is Wordnet (Miller, 1995; Miller and Fellbaum, have revealed strong semantic coherence of 1998), which organizes the semantic network in local topological structures. The techniques terms of graph theory. In contrast to manual ap- used in these papers are simple and the main results are found by understanding the struc- proaches, the automatic analysis of semantically in- ture of cycles in the directed graph (where teresting graph structures of language has received words point to definitions). Based on our ear- increasing attention. For example, it has become lier work (Levary et al., 2012), we study a dif- clear more recently that cycles and triangles play ferent class of word definitions, namely those an important role in semantic networks, see e.g., of the Free Association (FA) dataset (Nelson (Dorow et al., 2005). These results suggest that the et al., 2004). These are responses by subjects underlying semantic structure of language may be to a cue word, which are then summarized by a directed, free association graph. discovered through graph-theoretical methods. This is in line with similar findings in much wider realms We find that the structure of this network is than NLP (Eckmann and Moses, 2002). quite different from both the Wordnet and the dictionary networks. This difference can be In this paper, we compare two different types explained by the very nature of free associa- of association networks. The first network is con- tion as compared to the more “logical” con- structed from an English dictionary (DD), the sec- struction of dictionaries. It thus sheds some ond from a free association (FA) database (Nelson (quantitative) light on the psychology of free et al., 2004). We represent both datasets through association. directed graphs. For DD, the nodes are words and In NLP, semantic groups or clusters are inter- the directed edges point from a word to its defini- esting for various applications such as word tion(s). For FA, the nodes are again words, and each sense disambiguation. The FA graph is tighter cue word has a directed edge to each association it than the DD graph, because of the large number of triangles. This also makes drift of elicits. meaning quite measurable so that FA graphs Although the links in these graphs were not con- provide a quantitative measure of the seman- structed by following a rational centralized process, tic coherence of small groups of words. their graph exhibits very specific features and we concentrate on the study of its topological proper- 1 Introduction ties. We will show that these graphs are quite different in global and local structure, and we inter- The computer study of semantic networks has been pret this as a reflection of the different nature of around since the advent of computers (Brunet, 1974) DD vs. FA. The first is an objective set of rela- 669 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 669–680, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics tions between words and their meaning, as explained 2 The USF FA dataset by other words, while the second reveals the nature This dataset is one of the largest existing databases of subjective reactions to cue words by individuals. of free associations (FA) and has been collected at This matter of fact is reflected by several quantita- the University of South Florida since 1973 by re- tive differences in the structure of the corresponding searchers in psychology (Nelson et al., 2004). Over graphs. the years, more than 6’000 participants produced about 750’000 responses to 5’019 stimulus words. The main contribution of this paper is an empiri- The procedure for collecting the data is called dis- cal analysis of the way semantic knowledge is struc- crete association task and consists in asking partici- tured, comparing two different types of association pants to give the first word that comes to mind (tar- networks (DD and FA). We conduct a mathemati- get) when presented a stimulus word (cue). cal analysis of the structure of the graphs to show For creating the initial set of stimulus words, that the way humans express their thoughts exhibits the Jenkins and Palermo word association norms structural properties in which one can find seman- (Palermo and Jenkins, 1964) proved useful but too tic patterns. We show that a simple graph-based limited as they consist of only 200 words. For this approach can leverage the information encoded in reason, additional words have been regularly added free association to narrow down the ambiguity of to the pool of normed words, unfortunately without meaning, resulting in precise semantic groups. In well established rules being followed. For instance, particular, we find that the main strongly connected some were selected as potentially interesting cues, component of the FA graph (the so-called core) is some were added as responses to the first sets of cues very cyclic in nature and contains a large predom- and, some others were collected for supporting new inance of short cycles (i.e., co-links and triangles). studies on verbs. We still work with this database, In contrast to the DD graph, bunches of triangles because of its breadth. form well-delimited lexical fields of collective semantic knowledge. This property may be promising The final pool of stimuli comprises 5’019 words for downstream tasks. Further, the methods devel- of which 76% are nouns, 13% adjectives, and 7% oped in this paper may be applicable to graph rep- verbs. A word association is said to be normed i.e. resentations that occur in other problems such as when the target is also part of the set of norms, , word sense disambiguation (e.g., (Heylighen, 2001; a cue. The USF dataset of free associations con- Agirre and Soroa, 2009)) or sentiment polarity in- tains 72’176 cue-target pairs, 63’619 of which are puberty-sex duction (Hassan and Radev, 2010; Scheible, 2010). normed. As an example, the association is normed whereas the association puberty-thirteen thirteen To show the semantic coherence of these lexi- is not, because is not a cue. cal fields of the FA graph, we perform an exper- 3 Mathematical definitions iment with human raters and find that cycles are strongly semantically connected even when com- We collect here those notions we need for the analy- pared to close neighbors in the graph. sis of the data. A directed graph is a pair G = (V, E) of a set The reader might wonder why sets of pairwise V of vertices and, a set E of ordered pairs of ver- associations can lead to any interesting structure. tices also called directed edges. For a directed edge One of the deep results in graph theory, (Bollobas,´ (u, v) ∈ E, u is called the tail and v the head of 2001), is that in sparse graphs, i.e., in graphs with the edge. The number of edges incident to a vertex few links per node, the number of triangles is ex- v ∈ V is called the degree of v. The in-degree tremely rare. Therefore, if one does find many tri- (resp. out-degree) of a vertex v is the number of angles in a graph, they must be not only a signal edge heads (resp. edge tails) adjacent to it. A vertex of non-randomness, but carry relevant information with null in-degree is called a source and a vertex about the domain of research as shown earlier (Eck- with null out-degree is called a sink. mann and Moses, 2002). A directed path is a sequence of vertices such 670 that a directed edge exists between each consecutive where µX and σX are respectively the mean and pair of vertices of the graph. A directed graph is standard deviation of the random variable X. said to be strongly connected, (resp. weakly con- The linear degree correlation coefficient of a nected) if for every pair of vertices in the graph, graph is called assortativity and is expressed as: there exists a directed path (resp. undirected path) between them. A strongly connected component, X ρD = xy(exy − axby)/(σaσb) SCC, (resp. weakly connected component, WCC) xy of a directed graph G is a maximal strongly connected (resp. weakly connected) subgraph of G. where exy is the fraction of all links that connect A directed cycle is a directed path such that its nodes of degree x and y and where ax and by are re- start vertex is the same as its end vertex.A co- spectively the fraction of links whose tail is adjacent link is a directed cycle of length 2 and a triangle a to nodes with degree x and whose head is adjacent to directed cycle of length 3. nodes with degree y, satisfying the following three The distance between two vertices in a graph is conditions: the number of edges in the shortest path connecting X X X them. The diameter of a graph G is the greatest exy = 1, ax = exy, by = exy distance between any pair of vertices.

Load more