Random Walks for Text Semantic Similarity

Random Walks for Text Semantic Similarity Daniel Ramage, Anna N. Rafferty, and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 94305 fdramage,[email protected] [email protected] Abstract general-purpose similarity measures, particularly for short text passages. Many tasks in NLP stand to benefit from Most approaches fall under one of two cate- robust measures of semantic similarity for gories. One set of approaches attempts to explic- units above the level of individual words. itly account for fine-grained structure of the two Rich semantic resources such as WordNet passages, e.g. by aligning trees or constructing provide local semantic information at the logical forms for theorem proving. While these lexical level. However, effectively com- approaches have the potential for high precision bining this information to compute scores on many examples, errors in alignment judgments for phrases or sentences is an open prob- or formula construction are often insurmountable. lem. Our algorithm aggregates local re- More broadly, it’s not always clear that there is a latedness information via a random walk correct alignment or logical form that is most ap- over a graph constructed from an underly- propriate for a particular sentence pair. The other ing lexical resource. The stationary dis- approach tends to ignore structure, as canonically tribution of the graph walk forms a “se- represented by the vector space model, where any mantic signature” that can be compared lexical item in common between the two passages to another such distribution to get a relat- contributes to their similarity score. While these edness score for texts. On a paraphrase approaches often fail to capture distinctions im- recognition task, the algorithm achieves an posed by, e.g. negation, they do correctly capture 18.5% relative reduction in error rate over a broad notion of similarity or aboutness. a vector-space baseline. We also show that This paper presents a novel variant of the vec- the graph walk similarity between texts tor space model of text similarity based on a ran- has complementary value as a feature for dom walk algorithm. Instead of comparing two recognizing textual entailment, improving bags-of-words directly, we compare the distribu- on a competitive baseline system. tion each text induces when used as the seed of a random walk over a graph derived from Word- 1 Introduction Net and corpus statistics. The walk posits the ex- Many natural language processing applications istence of a distributional particle that roams the must directly or indirectly assess the semantic sim- graph, biased toward the neighborhood surround- ilarity of text passages. Modern approaches to ing an input bag of words. Eventually, the walk information retrieval, summarization, and textual reaches a stationary distribution over all nodes in entailment, among others, require robust numeric the graph, smoothing the peaked input distribution relevance judgments when a pair of texts is pro- over a much larger semantic space. Two such sta- vided as input. Although each task demands its tionary distributions can be compared using con- own scoring criteria, a simple lexical overlap mea- ventional measures of vector similarity, producing sure such as cosine similarity of document vectors a final relatedness score. can often serve as a surprisingly powerful base- This paper makes the following contributions. line. We argue that there is room to improve these We present a novel random graph walk algorithm Word Step 1 Step 2 Step 3 Conv. corresponding to eating away at something. How- eat 3 8 9 9 ever, because this concept is not closely linked corrode 10 33 53 >100 with other words in the sentence, its relative rank pasta – 2 3 5 drops as the distribution converges and other word dish – 4 5 6 senses more related to food are pushed up. The food – – 21 12 random walk allows the meanings of words to re- solid – – – 26 inforce one another. If the sentence above had ended with drank wine rather than spaghetti, the Table 1: Ranks of sample words in the distribu- final weight on the food node would be smaller tion for I ate a salad and spaghetti after a given since fewer input words would be as closely linked number of steps and at convergence. Words in the to food. This matches the intuition that the first vector are ordered by probability at time step t; the sentence has more to do with food than does the word with the highest probability in the vector has second, although both walks should and do give rank 1. “–” indicates that node had not yet been some weight to this node. reached. 3 Related work for semantic similarity of texts, demonstrating its Semantic relatedness for individual words has efficiency as compared to a much slower but math- been thoroughly investigated in previous work. ematically equivalent model based on summed Budanitsky and Hirst (2006) provide an overview similarity judgments of individual words. We of many of the knowledge-based measures derived show that walks effectively aggregate information from WordNet, although other data sources have over multiple types of links and multiple input been used as well. Hughes and Ramage (2007) is words on an unsupervised paraphrase recognition one such measure based on random graph walks. task. Furthermore, when used as a feature, the Prior work has considered random walks on var- walk’s semantic similarity score can improve the ious text graphs, with applications to query expan- performance of an existing, competitive textual sion (Collins-Thompson and Callan, 2005), email entailment system. Finally, we provide empiri- address resolution (Minkov and Cohen, 2007), and cal results demonstrating that indeed, each step of word-sense disambiguation (Agirre and Soroa, the random walk contributes to its ability to assess 2009), among others. paraphrase judgments. Measures of similarity have also been proposed for sentence or paragraph length text passages. 2 A random walk example Mihalcea et al. (2006) present an algorithm for To provide some intuition about the behavior of the general problem of deciding the similarity of the random walk on text passages, consider the meaning in two text passages, coining the name following example sentence: I ate a salad and “text semantic similarity” for the task. Corley spaghetti. and Mihalcea (2005) apply this algorithm to para- No measure based solely on lexical identity phrase recognition. would detect overlap between this sentence and Previous work has shown that similarity mea- another input consisting of only the word food. sures can have some success as a measure of tex- But if each text is provided as input to the random tual entailment. Glickman et al. (2005) showed walk, local relatedness links from one word to an- that many entailment problems can be answered other allow the distributional particle to explore using only a bag-of-words representation and web nearby parts of the semantic space. The number of co-occurrence statistics. Many systems integrate non-zero elements in both vectors increases, even- lexical relatedness and overlap measures with tually converging to a stationary distribution for deeper semantic and syntactic features to create which both vectors have many shared non-zero en- improved results upon relatedness alone, as in tries. Montejo-Raez´ et al. (2007). Table 1 ranks elements of the sentence vector 4 Random walks on lexical graphs based on their relative weights. Observe that at the beginning of the walk, corrode has a high rank due In this section, we describe the mechanics of to its association with the WordNet sense of eat computing semantic relatedness for text passages based on the random graph walk framework. The all synsets it takes part in, and from each word to algorithm underlying these computations is related all its part-of-speech. These edge weights are de- to topic-sensitive PageRank (Haveliwala, 2002); rived from corpus counts as in Hughes and Ram- see Berkhin (2005) for a survey of related algo- age (2007). We also included a low-weight self- rithms. loop for each node. To compute semantic relatedness for a pair of Our graph has 420,253 nodes connected by passages, we compare the stationary distributions 1,064,464 edges. Because synset nodes do not link of two Markov chains, each with a state space de- outward to part-of-speech tagged nodes or word fined over all lexical items in an underlying corpus nodes in this graph, only the 117,659 synset nodes or database. Formally, we define the probability of have non-zero probability in every random walk— finding the particle at a node ni at time t as: i.e. the stationary distribution will always be non- zero for these 117,659 nodes, but will be non-zero (t) X (t−1) ni = nj P (ni j nj) for only a subset of the remainder. nj 2V 4.2 Initial distribution construction where P (ni j nj) is the probability of transition- The next step is to seed the random walk with an ing from nj to ni at any time step. If those transi- tions bias the particle to the neighborhood around initial distribution over lexical nodes specific to the words in a text, the particle’s distribution can the given sentence. To do so, we first tag the in- be used as a lexical signature. put sentence with parts-of-speech and lemmatize To compute relatedness for a pair of texts, we each word based on the finite state transducer of first define the graph nodes and transition proba- Minnen et al. (2001). We search over consecu- bilities for the random walk Markov chain from tive words to match multi-word collocation nodes an underlying lexical resource. Next, we deter- found in the graph.

Load more