Semantic Text Similarity Using N-Grams, Wordnet, Syntactic Analysis, ESA and Information Retrieval Based Features

LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features Davide Buscaldi, Joseph Le Roux, Adrian Popescu Jorge J. Garc´ıa Flores CEA, LIST, Laboratoire d’Informatique de Paris Nord, Vision & Content CNRS, (UMR 7030) Engineering Laboratory Universite´ Paris 13, Sorbonne Paris Cite,´ F-91190 Gif-sur-Yvette, France F-93430, Villetaneuse, France [email protected] {buscaldi,joseph.le-roux,jgflores} @lipn.univ-paris13.fr Abstract tures for the global system. These measures include simple distances like Levenshtein edit distance, co- This paper describes the system used by the sine, Named Entities overlap and more complex dis- LIPN team in the Semantic Textual Similarity tances like Explicit Semantic Analysis, WordNet- task at *SEM 2013. It uses a support vector regression model, combining different text sim- based similarity, IR-based similarity, and a similar- ilarity measures that constitute the features. ity measure based on syntactic dependencies. These measures include simple distances like The paper is organized as follows. Measures are Levenshtein edit distance, cosine, Named En- presented in Section 2. Then the regression model, tities overlap and more complex distances like based on Support Vector Machines, is described in Explicit Semantic Analysis, WordNet-based Section 3. Finally we discuss the results of the sys- similarity, IR-based similarity, and a similar- tem in Section 4. ity measure based on syntactic dependencies. 2 Text Similarity Measures 1 Introduction 2.1 WordNet-based Conceptual Similarity The Semantic Textual Similarity task (STS) at (Proxigenea) *SEM 2013 requires systems to grade the degree of similarity between pairs of sentences. It is closely First of all, sentences p and q are analysed in or- related to other well known tasks in NLP such as tex- der to extract all the included WordNet synsets. For tual entailment, question answering or paraphrase each WordNet synset, we keep noun synsets and put detection. However, as noticed in (Bar¨ et al., 2012), into the set of synsets associated to the sentence, Cp the major difference is that STS systems must give a and Cq, respectively. If the synsets are in one of the graded, as opposed to binary, answer. other POS categories (verb, adjective, adverb) we One of the most successful systems in *SEM look for their derivationally related forms in order 2012 STS, (Bar¨ et al., 2012), managed to grade pairs to find a related noun synset: if there is one, we put of sentences accurately by combining focused mea- this synsets in Cp (or Cq). For instance, the word sures, either simple ones based on surface features “playing” can be associated in WordNet to synset (ie n-grams), more elaborate ones based on lexical (v)play#2, which has two derivationally related semantics, or measures requiring external corpora forms corresponding to synsets (n)play#5 and such as Explicit Semantic Analysis, into a robust (n)play#6: these are the synsets that are added measure by using a log-linear regression model. to the synset set of the sentence. No disambiguation The LIPN-CORE system is built upon this idea of process is carried out, so we take all possible mean- combining simple measures with a regression model ings into account. to obtain a robust and accurate measure of tex- Given Cp and Cq as the sets of concepts contained tual similarity, using the individual measures as fea- in sentences p and q, respectively, with |Cp| ≥ |Cq|, 162 Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task, pages 162–168, Atlanta, Georgia, June 13-14, 2013. c 2013 Association for Computational Linguistics the conceptual similarity between p and q is calcu- Web 1T (Brants and Franz, 2006) frequency counts. lated as: The semantic similarity between words is calculated P as: max s(c1, c2) c ∈C c1∈Cp 2 q ss(p, q) = (1) ws(w , w ) = max s (c , c ). |Cp| i j jc i j (5) ci∈Wi,cj inWj where s(c1, c2) is a conceptual similarity measure. where W and W are the sets containing all synsets Concept similarity can be calculated by different i j in WordNet corresponding to word w and w , re- ways. For the participation in the 2013 Seman- i j spectively. The IC values used are those calcu- tic Textual Similarity task, we used a variation of lated by Ted Pedersen (Pedersen et al., 2004) on the the Wu-Palmer formula (Wu and Palmer, 1994) British National Corpus1. named “ProxiGenea” (from the french Proximite´ Gen´ ealogique,´ genealogical proximity), introduced 2.3 Syntactic Dependencies by (Dudognon et al., 2010), which is inspired by the analogy between a family tree and the concept hi- We also wanted for our systems to take syntac- erarchy in WordNet. Among the different formula- tic similarity into account. As our measures are tions proposed by (Dudognon et al., 2010), we chose lexically grounded, we chose to use dependen- the ProxiGenea3 variant, already used in the STS cies rather than constituents. Previous experiments 2012 task by the IRIT team (Buscaldi et al., 2012). showed that converting constituents to dependen- The ProxiGenea3 measure is defined as: cies still achieved best results on out-of-domain texts (Le Roux et al., 2012), so we decided to use 1 s(c , c ) = (2) a 2-step architecture to obtain syntactic dependen- 1 2 1 + d(c ) + d(c ) − 2 · d(c ) 1 2 0 cies. First we parsed pairs of sentences with the 2 where c0 is the most specific concept that is present LORG parser . Second we converted the resulting 3 both in the synset path of c1 and c2 (that is, the Least parse trees to Stanford dependencies . Common Subsumer or LCS). The function returning Given the sets of parsed dependencies Dp and Dq, the depth of a concept is noted with d. for sentence p and q, a dependency d ∈ Dx is a triple (l, h, t) where l is the dependency label (for in- 2.2 IC-based Similarity stance, dobj or prep), h the governor and t the depen- dant. We define the following similarity measure be- This measure has been proposed by (Mihalcea et tween two syntactic dependencies d1 = (l1, h1, t1) al., 2006) as a corpus-based measure which uses and d2 = (l2, h2, t2): Resnik’s Information Content (IC) and the Jiang- dsim(d1, d2) = Lev(l1, l2) Conrath (Jiang and Conrath, 1997) similarity metric: idf ∗ s (h , h ) + idf ∗ s (t , t ) ∗ h WN 1 2 t WN 1 2 1 2 sjc(c1, c2) = (3) (6) IC(c1) + IC(c2) − 2 · IC(c0) where idf = max(idf(h ), idf(h )) and idf = where IC is the information content introduced by h 1 2 t max(idf(t ), idf(t )) are the inverse document fre- (Resnik, 1995) as IC(c) = − log P (c). 1 2 The similarity between two text segments T1 and quencies calculated on Google Web 1T for the gov- T2 is therefore determined as: ernors and the dependants (we retain the maximum 0 P max ws(w, w2) ∗ idf(w) for each pair), and sWN is calculated using formula 1 w∈{T } w2∈{T2} sim(T ,T ) = B 1 2, with two differences: 1 2 2 @ P idf(w) w∈{T1} P 1 • if the two words to be compared are antonyms, max ws(w, w1) ∗ idf(w) w ∈{T } w∈{T2} 1 1 then the returned score is 0; + C(4) P idf(w) A 1 w∈{T2} http://www.d.umn.edu/˜tpederse/similarity.html 2 https://github.com/CNGLdlab/LORG-Release where idf(w) is calculated as the inverse document 3We used the default built-in converter provided with the frequency of word w, taking into account Google Stanford Parser (2012-11-12 revision). 163 • if one of the words to be compared is not in weighted vector of Wikipedia concepts. Weights WordNet, their similarity is calculated using are supposed to quantify the strength of the relation the Levenshtein distance. between a word and each Wikipedia concept using the tf-idf measure. A text is then represented as a The similarity score between p and q, is then cal- high-dimensional real valued vector space spanning culated as: all along the Wikipedia database. For this particular P max dsim(di, dj) task we adapt the research-esa implementation d inD di∈Dp j q (Sorg and Cimiano, 2008)7 to our own home-made sSD(p, q) = max , |Dp| weighted vectors corresponding to a Wikipedia snapshot of February 4th, 2013. P max dsim(di, dj) dj inDp di∈Dq 2.6 N-gram based Similarity |D | q This feature is based on the Clustered Keywords Po- (7) sitional Distance (CKPD) model proposed in (Bus- caldi et al., 2009) for the passage retrieval task. 2.4 Information Retrieval-based Similarity The similarity between a text fragment p and an- Let us consider two texts p and q, an Information Re- other text fragment q is calculated as: trieval (IR) system S and a document collection D indexed by S. This measure is based on the assump- X 1 tion that p and q are similar if the documents re- h(x, P ) d(x, xmax) trieved by S for the two texts, used as input queries, ∀x∈Q simngrams(p, q) = Pn (9) are ranked similarly. i=1 wi Let be Lp = {dp , . , dp } and Lq = 1 K Where P is the set of n-grams with the highest {dq1 , . , dqK }, dxi ∈ D the sets of the top K documents retrieved by S for texts p and q, respectively.

Load more