LIPN-CORE: Semantic Text Similarity Using N-Grams, Wordnet, Syntactic Analysis, ESA and Information Retrieval Based Features

LIPN-CORE: Semantic Text Similarity using n-grams, WordNet, Syntactic Analysis, ESA and Information Retrieval based Features Davide Buscaldi, Joseph Le Roux, Adrian Popescu Jorge J. Garc´ıa Flores CEA, LIST, Laboratoire d’Informatique de Paris Nord, Vision & Content CNRS, (UMR 7030) Engineering Laboratory Universite´ Paris 13, Sorbonne Paris Cite,´ F-91190 Gif-sur-Yvette, France F-93430, Villetaneuse, France [email protected] fbuscaldi,joseph.le-roux,jgfloresg @lipn.univ-paris13.fr Abstract to obtain a robust and accurate measure of textual similarity, using the individual measures as fea- This paper describes the system used by the tures for the global system. These measures include LIPN team in the Semantic Textual Similar- simple distances like Levenshtein edit distance, co- ity task at SemEval 2013. It uses a sup- sine, Named Entities overlap and more complex dis- port vector regression model, combining dif- tances like Explicit Semantic Analysis, WordNet- ferent text similarity measures that constitute based similarity, IR-based similarity, and a similar- the features. These measures include simple distances like Levenshtein edit distance, co- ity measure based on syntactic dependencies. sine, Named Entities overlap and more com- The paper is organized as follows. Measures are plex distances like Explicit Semantic Analy- presented in Section 2. Then the regression model, sis, WordNet-based similarity, IR-based simi- based on Support Vector Machines, is described in larity, and a similarity measure based on syn- Section 3. Finally we discuss the results of the sys- tactic dependencies. tem in Section 4. 2 Text Similarity Measures 1 Introduction The Semantic Textual Similarity task (STS) at Se- 2.1 WordNet-based Conceptual Similarity mEval 2013 requires systems to grade the degree of (Proxigenea) similarity between pairs of sentences. It is closely First of all, sentences p and q are analysed in or- related to other well known tasks in NLP such as tex- der to extract all the included WordNet synsets. For tual entailment, question answering or paraphrase each WordNet synset, we keep noun synsets and put detection. However, as noticed in (Bar¨ et al., 2012), into the set of synsets associated to the sentence, Cp the major difference is that STS systems must give a and Cq, respectively. If the synsets are in one of the graded, as opposed to binary, answer. other POS categories (verb, adjective, adverb) we One of the most successful systems in SemEval look for their derivationally related forms in order 2012 STS, (Bar¨ et al., 2012), managed to grade pairs to find a related noun synset: if there is one, we put of sentences accurately by combining focused mea- this synsets in Cp (or Cq). For instance, the word sures, either simple ones based on surface features “playing” can be associated in WordNet to synset (ie n-grams), more elaborate ones based on lexical (v)play#2, which has two derivationally related semantics, or measures requiring external corpora forms corresponding to synsets (n)play#5 and such as Explicit Semantic Analysis, into a robust (n)play#6: these are the synsets that are added measure by using a log-linear regression model. to the synset set of the sentence. No disambiguation The LIPN-CORE system is built upon this idea of process is carried out, so we take all possible mean- combining simple measures with a regression model ings into account. Given Cp and Cq as the sets of concepts contained where idf(w) is calculated as the inverse document in sentences p and q, respectively, with jCpj ≥ jCqj, frequency of word w, taking into account Google the conceptual similarity between p and q is calcu- Web 1T (Brants and Franz, 2006) frequency counts. lated as: The semantic similarity between words is calculated P as: max s(c1; c2) c 2C c12Cp 2 q ss(p; q) = (1) ws(wi; wj) = max sjc(ci; cj): (5) jCpj ci2Wi;cj inWj where s(c1; c2) is a conceptual similarity measure. where Wi and Wj are the sets containing all synsets Concept similarity can be calculated by different in WordNet corresponding to word wi and wj, re- ways. For the participation in the 2013 Seman- spectively. The IC values used are those calcu- tic Textual Similarity task, we used a variation of lated by Ted Pedersen (Pedersen et al., 2004) on the the Wu-Palmer formula (Wu and Palmer, 1994) British National Corpus1. named “ProxiGenea” (from the french Proximite´ 2.3 Syntactic Dependencies Gen´ ealogique,´ genealogical proximity), introduced by (Dudognon et al., 2010), which is inspired by the We also wanted for our systems to take syntac- analogy between a family tree and the concept hi- tic similarity into account. As our measures are erarchy in WordNet. Among the different formula- lexically grounded, we chose to use dependen- tions proposed by (Dudognon et al., 2010), we chose cies rather than constituents. Previous experiments the ProxiGenea3 variant, already used in the STS showed that converting constituents to dependen- 2012 task by the IRIT team (Buscaldi et al., 2012). cies still achieved best results on out-of-domain The ProxiGenea3 measure is defined as: texts (Le Roux et al., 2012), so we decided to use a 2-step architecture to obtain syntactic dependen- 1 s(c1; c2) = (2) cies. First we parsed pairs of sentences with the 1 + d(c1) + d(c2) − 2 · d(c0) LORG parser2. Second we converted the resulting parse trees to Stanford dependencies3. where c0 is the most specific concept that is present Given the sets of parsed dependencies Dp and Dq, both in the synset path of c1 and c2 (that is, the Least for sentence p and q, a dependency d 2 Dx is a Common Subsumer or LCS). The function returning triple (l; h; t) where l is the dependency label (for in- the depth of a concept is noted with d. stance, dobj or prep), h the governor and t the depen- dant. We define the following similarity measure be- 2.2 IC-based Similarity tween two syntactic dependencies d1 = (l1; h1; t1) and d = (l ; h ; t ): This measure has been proposed by (Mihalcea et 2 2 2 2 al., 2006) as a corpus-based measure which uses dsim(d1; d2) = Lev(l1; l2) idf ∗ s (h ; h ) + idf ∗ s (t ; t ) Resnik’s Information Content (IC) and the Jiang- ∗ h WN 1 2 t WN 1 2 Conrath (Jiang and Conrath, 1997) similarity metric: 2 (6) 1 s (c ; c ) = (3) where idfh = max(idf(h1); idf(h2)) and idft = jc 1 2 IC(c ) + IC(c ) − 2 · IC(c ) 1 2 0 max(idf(t1); idf(t2)) are the inverse document fre- where IC is the information content introduced by quencies calculated on Google Web 1T for the gov- (Resnik, 1995) as IC(c) = − log P (c). ernors and the dependants (we retain the maximum The similarity between two text segments T1 and for each pair), and sWN is calculated using formula T2 is therefore determined as: 2, with two differences: 0 P max ws(w; w2) ∗ idf(w) 1 w2fT g w22fT2g • if the two words to be compared are antonyms, sim(T ;T ) = B 1 1 2 2 @ P idf(w) then the returned score is 0; w2fT g 1 1 P 1 http://www.d.umn.edu/˜tpederse/similarity.html max ws(w; w1) ∗ idf(w) 2 https://github.com/CNGLdlab/LORG-Release w2fT g w12fT1g + 2 C(4) 3We used the default built-in converter provided with the P idf(w) A Stanford Parser (2012-11-12 revision). w2fT2g • if one of the words to be compared is not in weighted vector of Wikipedia concepts. Weights WordNet, their similarity is calculated using are supposed to quantify the strength of the relation the Levenshtein distance. between a word and each Wikipedia concept using the tf-idf measure. A text is then represented as a The similarity score between p and q, is then cal- high-dimensional real valued vector space spanning culated as: all along the Wikipedia database. For this particular 0 P max dsim(di; dj) task we adapt the research-esa implementation d inD di2Dp j q (Sorg and Cimiano, 2008)7 to our own home-made sSD(p; q) = max B ; @ jDpj weighted vectors corresponding to a Wikipedia snapshot of February 4th, 2013. P 1 max dsim(di; dj) dj inDp di2Dq C 2.6 N-gram based Similarity jD j A q This feature is based on the Clustered Keywords Po- (7) sitional Distance (CKPD) model proposed in (Bus- caldi et al., 2009) for the passage retrieval task. 2.4 Information Retrieval-based Similarity The similarity between a text fragment p and an- Let us consider two texts p and q, an Information Re- other text fragment q is calculated as: trieval (IR) system S and a document collection D indexed by S. This measure is based on the assump- X 1 tion that p and q are similar if the documents re- h(x; P ) d(x; xmax) trieved by S for the two texts, used as input queries, 8x2Q simngrams(p; q) = Pn (9) are ranked similarly. i=1 wi Let be Lp = fdp ; : : : ; dp g and Lq = 1 K Where P is the set of n-grams with the highest fdq1 ; : : : ; dqK g, dxi 2 D the sets of the top K documents retrieved by S for texts p and q, respectively. weight in p, where all terms are also contained in q; Q is the set of all the possible n-grams in q and n Let us define sp(d) and sq(d) the scores assigned by S to a document d for the query p and q, respectively.

LIPN-CORE: Semantic Text Similarity Using N-Grams, Wordnet, Syntactic Analysis, ESA and Information Retrieval Based Features

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support