Estimating Sentence Semantic Similarity Using N-Gram Regression Models and Web Snippets

DeepPurple: Estimating Sentence Semantic Similarity using N-gram Regression Models and Web Snippets Nikos Malandrakis, Elias Iosif, Alexandros Potamianos Department of ECE, Technical University of Crete, 73100 Chania, Greece [nmalandrakis,iosife,potam]@telecom.tuc.gr Abstract on word-level semantic similarity estimation to de- sign and implement a system for sentence-level STS We estimate the semantic similarity between for Task6 of the SemEval’12 campaign. two sentences using regression models with Semantic similarity between words can be re- features: 1) n-gram hit rates (lexical matches) garded as the graded semantic equivalence at the between sentences, 2) lexical semantic similarity between non-matching words, and 3) lexeme level and is tightly related with the tasks of sentence length. Lexical semantic similarity is word sense discovery and disambiguation (Agirre computed via co-occurrence counts on a cor- and Edmonds, 2007). Metrics of word semantic sim- pus harvested from the web using a modified ilarity can be divided into: (i) knowledge-based met- mutual information metric. State-of-the-art re- rics (Miller, 1990; Budanitsky and Hirst, 2006) and sults are obtained for semantic similarity com- (ii) corpus-based metrics (Baroni and Lenci, 2010; putation at the word level, however, the fusion Iosif and Potamianos, 2010). of this information at the sentence level pro- vides only moderate improvement on Task 6 When more complex structures, such as phrases of SemEval’12. Despite the simple features and sentences, are considered, it is much harder used, regression models provide good perfor- to estimate semantic equivalence due to the non- mance, especially for shorter sentences, reach- compositional nature of sentence-level semantics ing correlationof 0.62 on the SemEval test set. and the exponential explosion of possible interpre- tations. STS is closely related to the problems of paraphrasing, which is bidirectional and based on 1 Introduction semantic equivalence (Madnani and Dorr, 2010) and Recently, there has been significant research activ- textual entailment, which is directional and based ity on the area of semantic similarity estimation on relations between semantics (Dagan et al., 2006). motivated both by abundance of relevant web data Related methods incorporate measurements of sim- and linguistic resources for this task. Algorithms ilarity at various levels: lexical (Malakasiotis and for computing semantic textual similarity (STS) are Androutsopoulos, 2007), syntactic (Malakasiotis, relevant for a variety of applications, including in- 2009; Zanzotto et al., 2009), and semantic (Rinaldi formation extraction (Szpektor and Dagan, 2008), et al., 2003; Bos and Markert, 2005). Measures question answering (Harabagiu and Hickl, 2006) from machine translation evaluation are often used and machine translation (Mirkin et al., 2009). Word- to evaluate lexical level approaches (Finch et al., or term-level STS (a special case of sentence level 2005; Perez and Alfonseca, 2005), including BLEU STS) has also been successfully applied to the prob- (Papineni et al., 2002), a metric based on word n- lem of grammar induction (Meng and Siu, 2002) gram hit rates. and affective text categorization (Malandrakis et al., Motivated by BLEU, we use n-gram hit rates and 2011). In this work, we built on previous research word-level semantic similarity scores as features in 565 First Joint Conference on Lexical and Computational Semantics (*SEM), pages 565–570, Montreal,´ Canada, June 7-8, 2012. c 2012 Association for Computational Linguistics a linear regression model to estimate sentence level ated above) can significantly reduce the semantic semantic similarity. We also propose sigmoid scal- similarity estimation error of the mutual information ing of similarity scores and sentence-length depen- metric I(i, j). This is also experimentally verified in dent modeling. The models are evaluated on the Se- (Iosif and Potamianos, 2012c). mEval’12 sentence similarity task. In addition, one can modify the mutual information metric to further reduce estimation error (for 2 Semantic similarity between words the theoretical foundation behind this see (Iosif and In this section, two different metrics of word simi- Potamianos, 2012a)). Specifically, one may intro- α larity are presented. The first is a language-agnostic, duce exponential weights in order to reduce the p i p j corpus-based metric requiring no knowledge re- contribution of ( ) and ( ) in the similarity metric. The modified metric I i, j , is defined as: sources, while the second metric relies on WordNet. a( ) Corpus-based metric: Given a corpus, the se- 1 pˆ(i, j) pˆ(i, j) Ia(i, j)= log + log . (1) mantic similarity between two words, wi and wj, 2 pˆα(i)ˆp(j) pˆ(i)ˆpα(j) is estimated as their pointwise mutual information The weight α was estimated on the corpus of (Iosif I i, j pˆ(i,j) (Church and Hanks, 1990): ( ) = log pˆ(i)ˆp(j) , and Potamianos, 2012b) in order to maximize word where pˆ(i) and pˆ(j) are the occurrence probabili- sense coverage in the semantic neighborhood of ties of wi and wj, respectively, while the probability each word. The Ia(i, j) metric using the estimated of their co-occurrence is denoted by pˆ(i, j). These value of α = 0.8 was shown to significantly out- probabilities are computed according to maximum perform I(i, j) and to achieve state-of-the-art results likelihood estimation. The assumption of this met- on standard semantic similarity datasets (Rubenstein ric is that co-occurrence implies semantic similarity. and Goodenough, 1965; Miller and Charles, 1998; During the past decade the web has been used for Finkelstein et al., 2002). For more details see (Iosif estimating the required probabilities (Turney, 2001; and Potamianos, 2012a). Bollegala et al., 2007), by querying web search en- WordNet-based metrics: For comparison pur- gines and retrieving the number of hits required poses, we evaluated various similarity metrics on to estimate the frequency of individual words and the task of word similarity computation on three their co-occurrence. However, these approaches standard datasets (same as above). The best re- have failed to obtain state-of-the-art results (Bolle- sults were obtained by the Vector metric (Patward- gala et al., 2007), unless “expensive” conjunctive han and Pedersen, 2006), which exploits the lexical AND queries are used for harvesting a corpus and information that is included in the WordNet glosses. then using this corpus to estimate similarity scores This metric was incorporated to our proposed ap- (Iosif and Potamianos, 2010). proach. All metrics were computed using the Word- 1 Recently, a scalable approach for harvesting a Net::Similarity module (Pedersen, 2005). corpus has been proposed where web snippets are downloaded using individual queries for each word 3 N-gram Regression Models (Iosif and Potamianos, 2012b). Semantic similar- Inspired by BLEU (Papineni et al., 2002), we pro- ity can then be estimated using the I(i, j) metric pose a simple regression model that combines evi- and within-snippet word co-occurrence frequencies. dence from two sources: number of n-gram matches Under the maximum sense similarity assumption and degree of similarity between non-matching (Resnik, 1995), it is relatively easy to show that a words between two sentences. In order to incorpo- (more) lexically-balanced corpus2 (as the one cre- rate a word semantic similarity metric into BLEU, 1The scalability of this approach has been demonstrated in we apply the following two-pass process: first lexi- (Iosif and Potamianos, 2012b) for a 10K vocabulary, here we cal hits are identified and counted, and then the se- extend it to the full 60K WordNet vocabulary. mantic similarity between n-grams not matched dur- 2According to this assumption the semantic similarity of two words can be estimated as the minimum pairwise similarity of respond to all senses, i.e., the denominator of I(i, j) is overes- their senses. The gist of the argument is that although words timated causing large underestimation error for similarities be- often co-occur with their closest senses, word occurrences cor- tween polysemous words. 566 ing the first pass is estimated. All word similar- length for each sentence pair, in words) acts as a ity metrics used are peak-to-peak normalized in the scaling factor for the linearly estimated similarity. [0,1] range, so they serve as a “degree-of-match”. The hierarchical fusion scheme is actually a col- The semantic similarity scores from word pairs are lection of (overlapping) linear regression models, summed together (just like n-gram hits) to obtain each matching a range of sentence lengths. For ex- a BLEU-like semantic similarity score. The main ample, the first model DL1 is trained with sentences problem here is one of alignment, since we need with length up to l1, i.e., l ≤ l1, the second model to compare each non-matched n-gram from the hy- DL2 up to length l2 etc. During testing, sentences pothesis with an n-gram from the reference. We with length l ∈ [1, l1] are decoded with DL1, sen- use a simple approach: we iterate on the hypoth- tences with length l ∈ (l1, l2] with model DL2 etc. esis n-grams, left-to-right, and compare each with Each of these partial models is a linear fusion model the most similar non-matched n-gram in the refer- as shown in (2). In this work, we use four models ence. This modification to BLEU is only applied with l1 = 10, l2 = 20, l3 = 30, l4 = ∞. to 1-grams, since semantic similarity scores for bi- grams (or higher) were not available. 4 Experimental Procedure and Results Thus, our list of features are the hit rates obtained by BLEU (for 1-, 2-, 3-, 4-grams) and the total se- Initially all sentences are pre-processed by the mantic similarity (SS) score for 1-grams3.

Estimating Sentence Semantic Similarity Using N-Gram Regression Models and Web Snippets

Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity?

Evaluating Lexical Similarity to Build Sentiment Similarity

Using Prior Knowledge to Guide BERT's Attention in Semantic

Latent Semantic Analysis for Text-Based Research

Ying Liu, Xi Yang

Untangling Semantic Similarity: Modeling Lexical Processing Experiments with Distributional Semantic Models

Measuring Semantic Similarity of Words Using Concept Networks

Semantic Similarity Between Sentences

Semantic Sentence Similarity: Size Does Not Always Matter

Word Embeddings for Semantic Similarity: Comparing LDA with Word2vec

Event-Based Knowledge Reconciliation Using Frame Embeddings And

Mathlingbudapest: Concept Networks for Semantic Similarity