<<

UNIBA: Combining Distributional Semantic Models and Sense Disambiguation for Textual Similarity

Pierpaolo Basile and Annalina Caputo and Giovanni Semeraro Department of Computer Science University of Bari Aldo Moro Via, E. Orabona, 4 - 70125 Bari (Italy) firstname.surname @uniba.it { }

Abstract based), lexical (word sense disambiguation), and distributional level. The idea is to combine in a This paper describes the UNIBA team unique system the semantic aspects that pertain participation in the Cross-Level Semantic text fragments. Similarity task at SemEval 2014. We pro- The following section gives more details about pose to combine the output of different se- the similarity measures and their combination in a mantic similarity measures which exploit unique score through supervised methods (Section Word Sense Disambiguation and Distribu- 2). Section 3 describes the system set up for the tional Semantic Models, among other lex- evaluation and comments on the reported results, ical features. The integration of similar- while Section 4 concludes the paper. ity measures is performed by means of two supervised methods based on Gaus- 2 System Description sian Process and Support Vector Machine. Our systems obtained very encouraging The idea behind our system is to combine the results, with the best one ranked 6th out output of several similarity measures/features by of 38 submitted systems. means of a supervised algorithm. Those features were grouped in three main categories. The fol- 1 Introduction lowing three sub-sections describe in detail each feature exploited by the system. Cross-Level Semantic Similarity (CLSS) is the task of computing the similarity between two text 2.1 Distributional Level fragments of different sizes. The task focuses on Distributional Semantic Models (DSM) are an the comparison between texts at different lexical easy way for building geometrical spaces of con- levels, i.e. between a larger and a smaller text. cepts, also known as Semantic (or Word) Spaces, The task comprises four different levels: 1) para- by skimming through huge corpora of text in or- graph to sentence; 2) sentence to phrase; 3) phrase der to learn the context of word usage. In the re- to word; 4) word to sense. The task objective is sulting space, semantic relatedness/similarity be- to provide a framework for evaluating general vs. tween two is expressed by the of level-specialized methods. the distance between points that represent those Our general approach consists in combining words. Thus, the semantic similarity can be com- scores coming from different semantic similarity puted as the cosine of the angle between the two algorithms. The combination is performed by a vectors that represent the words. This concept supervised method using the training data pro- of similarity can be extended to whole sentences vided by the task organizers. The data set com- by combining words through vector addition (+), prises pairs of text fragments that can be rated with which corresponds to the point-wise sum of the a score between 0 and 4, where 4 indicates the vector components. Our DSM measure (DSM) maximum level of similarity. is based on a SemanticSpace, represented by a We select algorithms which provide similarities co-occurrences matrix M, built by analysing the at different levels of semantics: surface (or string- distribution of words in the British National Cor- This work is licensed under a Creative Commons At- pus (BNC). Then, M is reduced using the Latent tribution 4.0 International Licence. Page numbers and pro- ceedings footer are added by the organisers. Licence details: Semantic Analysis (LSA) (Landauer and Dumais, http://creativecommons.org/licenses/by/4.0/ 1997). Vector addition and cosine similarity are

748 Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 748–753, Dublin, Ireland, August 23-24, 2014. then used for building the vector representation of this can be considered as a variation of the in- each text fragment and computing their pairwise verse document frequency (idf ) for retrieval similarity, respectively. that we named inverse gloss frequency (igf ). The igf for a word wk occurring gfk∗ times in 2.2 Lexical Semantics Level the set of extended glosses for all the senses Word Sense Disambiguation. Most of our in Si, the sense inventory of wi, is computed Si measures rely on the output of a Word Sense Dis- as IGFk = 1 + log2 | | . The final weight gfk∗ ambiguation (WSD) algorithm. Our newest ap- for the word wk appearing h times in the ex- proach to WSD, recently presented in Basile et tended gloss gij∗ is given by: al. (2014), is based on the simplified Lesk algo- rithm (Vasilescu et al., 2004). Each word wi in 1 weight(wk, gij∗ ) = h IGFk (1) a sequence w1w2...wn is disambiguated individ- × × 1 + d ually by choosing the sense that maximizes the similarity between the gloss and the context of wi 2. Building the context. The context C for the (i.e. the whole text where w occurs). To boost i word wi is represented by all the words that the overlap between the context and the gloss, occur in the text. this last is expanded with glosses of related mean- ings, following the approach described in Baner- 3. Building the vector representations. The con- jee and Pedersen (2002). As sense inventory we text C and each extended gloss gij∗ are repre- choose BabelNet 1.1, a huge multilingual seman- sented as vectors in the SemanticSpace built tic network which comprises both WordNet and through the DSM described in Subsection (Navigli and Ponzetto, 2012). The al- 2.1. gorithm consists of the following steps: 4. Sense ranking. The algorithm computes the 1. Building the glosses. We retrieve all possible cosine similarity between the vector repre- word meanings for the target word wi that are sentation of each extended gloss gij∗ and that listed in BabelNet. BabelNet mixes senses of the context C. Then, the cosine similar- in WordNet and Wikipedia. First, senses ity is linearly combined with the probability in WordNet are searched for; if no sense is p(sij wi), which takes into account the sense found (as often happens with named enti- | distribution of sij given the word wi. The ties), senses for the target word are sought in sense distribution is computed as the num- Wikipedia. We preferred that strategy rather ber of times the word wi was tagged with than retrieving senses from both sources at the sense sij in SemCor, a collection of 352 once because this last approach produced documents manually annotated with Word- worse results when tuning the system. Once Net synsets. T he additive (Laplace) smooth- the set of senses S = s , s , ..., s as- i { i1 i2 ik} ing prevents zero probabilities, which can oc- sociated to the target word wi has been re- cur when some synsets do not appear in Sem- trieved, gloss expansion occurs. For each Cor. The probability is computed as follows: sense sij of wi, the algorithm builds the sense extended gloss gij∗ by appending the glosses t(wi, sij) + 1 of meanings related to s to its original gloss p(sij wi) = (2) ij | #wi + Si gij. The related meanings, with the exception | | of “antonym” senses, are the output of the The output of this step is a ranked list of BabelNet function “getRelatedMap”. More- synsets. over, each word in gij∗ is weighted by a func- tion inversely proportional to the distance be- The WSD measure (WSD) is computed on the top tween sij and its related meaning. The dis- of the output of the last step. For each text frag- tance d is computed as the number of edges ment, we build a Bag-of-Synset (BoS) as the sum, linking two senses in the graph. The func- over the whole text, of the weighted synsets as- tion takes also into account the frequencies sociated with each word. Then, we compute the of the words in all the glosses giving more WSD similarity as the cosine similarity between emphasis to the most discriminative words; the two BoS.

749 Graph. A sub-graph of BabelNet is built for the two texts; each text fragment starting from the synsets pro- vided by the WSD algorithm. For each word the MCS The most common subsequence between synset with the highest score is selected, then this the two texts; initial set is expanded with the related synsets in 2-gram, 3-gram For each text fragment, we BabelNet. We apply the Personalized Page Rank build the Bag-of-n-gram (with n varying in (Haveliwala, 2002) to each sub-graph where the 2, 3 ); then we compute the cosine similar- synset scores computed by the WSD algorithm are { } ity between the two Bag-of-n-gram repre- exploited as prior probabilities. The weighted rank sentations. of synsets provided by Page Rank is used to build the BoS of the two text fragments, then the Person- BOW For each tokenized text fragment, we build alized Page Rank (PPR) is computed as the cosine its Bag-of-Word, and then compute the co- similarity between them. sine similarity between the two BoW. Synset Distributional Space. Generally, sim- L1 The length in characters of the first text frag- ilarity measures between synsets rely on the ment; synsets hierarchy in a (e.g. WordNet). We define a new approach that is com- L2 The length in caracters of the second text frag- pletely different, and represents synsets as points ment; in a geometric space that we call SDS (Synset Dis- tributional Space). SDS is generated taking into DIFF The difference between L1 and L2. account the synset relationships, and similarity is defined as the synsets closeness in the space. We 2.4 Word to Sense build a symmetric matrix S which contains synsets The word to sense level is different from the other on both rows and columns. Each cell in the matrix ones: in this case the similarity is computed be- is set to one if a semantic relation exists between tween a word and a particular word meaning. the corresponding synsets. The relationships are Since a word meaning is not a text fragment, this extracted from BabelNet limiting synsets to those level poses a new challenge with respect to the occurring also in WordNet, while synsets coming classical text similarity task. In this case we de- from Wikipedia are removed to reduce the size cide to consider the word on its own as the first of S. The method for building the matrix S re- text fragment, while for the second text fragment lies on Reflective Random Indexing (RRI) (Co- we build a dummy text using the BabelNet gloss hen et al., 2010), a variation of the Random In- assigned to the word sense. In that way, the distri- dexing technique for matrix reduction (Kanerva, butional and the lexical measures (WSD, Graph, 1988). RRI retains the advantages of RI which and DSM) can be applied to both fragments. Ta- incrementally builds a reduced space where dis- ble 1 recaps the features used for each task. tance between points is nearly preserved. More- over, cyclical training, i.e. the retraining of a new 3 Evaluation space exploiting the RI output as basis vectors, Dataset Description. The SemEval-2014 Task 3 makes indirect inference to emerge. Two differ- Cross-Level Semantic Similarity is designed for ent similarity measures can be defined by exploit- evaluating systems on their ability to capture the ing this space for representing synsets: WSD-SDS semantic similarity between lexical items of dif- and PPR-SDS, based on WSD and PPR respec- ferent length (Jurgens et al., 2014). To this ex- tively. Each BoS is represented as the sum of the tent, the organizers provide four different levels synset vectors in the SDS space. Then, the simi- of comparison which correspond to four different larity is computed as the cosine similarity between datasets: 1) Paragraph to Sentence (Par2Sent); 2) the two vector representations. Sentence to Phrase (Sent2Ph); 3) Phrase to Word 2.3 Surface Level (Ph2W); and 4) Word to Sense (W2Sense). For each dataset, the organizer released trial, At the surface level, we compute the following training and test data. While the trial includes a features: few examples (approximately 40), both training EDIT The edit, or Levenshtein, distance between and test data comprise 500 pairs of text fragments.

750 Official Spearman Run P ar2Sent Sent2P h P h2WW 2Sense Rank Correlation bestTrain .861 .793 .555 .420 - - LCS .527 .562 .165 .109 -- run1 .769 .729 .229 .165 7 10 run2 .784 .734 .255 .180 6 8 run3 .769 .729 .229 .165 8 11

Table 2: Task results.

P ar2Sent A proprietary implementation of Reflective • Feature Sent2P h W 2Sense Random Indexing to build the distributional P h2W space based on synsets (SDS) extracted from DSM √ √ BabelNet (we used two cycles of retraining); WSD √ √ Weka for the supervised approach. PPR √ √ • WSD-SDS √ √ After a tuning step using both training and trial PPR-SDS √ √ data provided by organizers, we selected three dif- EDIT √ - ferent supervised systems: Gaussian Process with MCS √ - Puk kernel (run1), Gaussian Process with RBF 2-gram √ - kernel (run2), and Support Vector Machine Re- 3-gram √ - gression with Puk kernel (run3). All the sys- BOW √ - tems are implemented with the default parame- L1 √ - ters set by Weka. We trained a different model on L2 √ - each dataset. The DSM is built using the 100, 000 DIFF √ - most frequent terms in the BNC, while the co- Table 1: Features per task. occurrences are computed on a window size of 5 words. The vector dimension is set to 400, the same value is adopted for building the SDS, where Each pair is associated with a human-assigned the number of seeds (no zero components) gener- similarity score, which varies from 4 (very similar) ated in the random vectors is set to 10 with one to 0 (unrelated). Organizers provide the normal- step of retraining. The total number of synset vec- ized Longest Common Substring (LCS) as base- tors in the SDS is 576, 736. In the WSD algorithm, line. The evaluation is performed through the we exploited the whole sentence as context. The Pearson (official rank) and the Spearman’s rank linear combination between the cosine similarity correlation. and the probability p(s w ) is performed with a ij| i System setup. We develop our system in JAVA factor of 0.5. The distance for expanding a synset relying on the following resources: with its related meaning is set to one. The same depth is used for building the graph in the PPR Stanford CoreNLP to pre-process the text: • method, where we fixed the maximum number of tokenization, lemmatization and PoS-tagging iterations up to 50 and the dumpling factor to 0.85. are applied to the two text fragments; Results. Results of our three systems for each similarity level are reported in Table 2 with BabelNet 1.1 as knowledge-base in the WSD • the baseline provided by the organizer (LCS). algorithm; Our three systems always outperform the LCS JAVA JUNG library for Personalized Page baseline. Table 2 also shows the best results • Rank; (bestT rain) obtained on the training data by a Gaussian Process with Puk kernel and a 10-fold British National Corpus (tokenized text with cross-validation. Support Vector Machine and • removal) and SVDLIB to build the Gaussian Process with Puk kernel, run1 and run3 SemanticSpace described in Subsection 2.1; respectively, produce the same results. Comparing

751 h umte usrpre nTbe4. Table on reported runs submitted the ( are able improvements word were the given we obtain a to adjustment, for this synsets Making In the retrieved. all words. case, three than a less such with fragments PoS-tagging text exclude for to decided we Thus, Ba- belNet. from synsets of retrieval correct the negatively influenced behaviour incorrect results This texts. small wrong on produced PoS-tagging sometimes systems. supervised drop the a of causes performance feature in no- that of we ablation however, the that results; ticed worst the Page obtains Personalized Rank the the on At based measure poor. the level, is infor- context texts about short mation in ob- while to performance; contexts good the wider tain on fragments rely text can large size algorithm In the WSD to fragments. ascribed text two be the last of can the behaviour in This method lev- best two ones. the first is the DSM in while results els, individually. best the feature obtains each WSD on parameters default data training on ( obtained those with figures these al :RslsatrPStgigrmvlfor removal PoS-tagging ( after text short Results 4: Table ( system by supervised obtained best are 3 the Table in training results own; its on sure sensitive less training be problem. on this to over-fit to seems kernel to RBF tends while kernel data, Puk the that run W h P Sent ar P u3.6 1.5.4 +46.66 +31.67 .242 +46.67 .237 +14.85 .242 +00.78 .263 +14.85 .257 run3 .263 run2 run1 h P Run fe h umsindaln,w oie that noticed we deadline, submission the After eaaye lotepromneo ahmea- each of performance the also analysed We Task 2 1 2 Sense 2 W 2 and Sent h P < run 2 3 W 61.1 10.2 44.7 26.8 09.5 .218 .116 .357 .231 .069 .002 .584 .585 .236 .478 .376 .062 .630 .474 .084 .526 .44 -.062 .110 .461 .085 .110 .129 .147 -.641 .129 .649 -.580 .540 .697 .612 04.8 07.3 10--.9 09.1 .071 .013 .079 .095 - - .120 .136 .087 .087 -.094 .095 .228 words). 3

vs. DSM ∆% et rain bestT WSD W ∆% al :Idvda esrsfrec task. each for measures Individual 3: Table 2 Sense ,w a observe can we ), PPR ihrsetto respect with ) run

W WSD-SDS ∆% 2 2 Sense with ) PPR-SDS 752 EDIT ae .Hvlwl.20.Tpcsniiepagerank. Topic-sensitive 2002. Haveliwala. H. Taher Wid- Dominic and Schvaneveldt, Roger Cohen, Trevor Se- Giovanni and Caputo, Annalina Basile, Pierpaolo An 2002. Pedersen. Ted and Banerjee Satanjeev References (MIUR). Research and Uni- versity of Ministry Italian the by funded ecosystems” Entrepreneurship Technology sustainable develop to ENvironment the INtelligenCe collective Virtual of A objectives research the PON fulfils work This Acknowledgments evaluation the to measures. respect with robust are our that methods suggest results These ranks used. was it Spearman while correlation, Pearson ranks the system best the of Our se- models. distributional mantic and disambiguation sense string-matching, word on differ- based combine task measures similarity systems similarity ent Our participa- semantic our SemEval-2014. cross-level of of the results in the tion reported have We Conclusions 4 In of matics discovery for method connections. scalable indi- implicit and A indexing inference: random rect Reflective 2010. dows. press). in 2014.org/accepted-papers.php, In Sense Word 2014 Lesk Model. Distributional Enhanced Semantic a An through algorithm Disambiguation 2014. meraro. In processing Springer. text intelligent and . tics disambigua- using sense word tion for algorithm lesk adapted rceig fte1t nentoa Conference International 11th the of Proceedings

02 MCS 38 uln rln,Ags.(http://www.coling- August. Ireland, Dublin, , 32:4 256. – 43(2):240 , 00563 atcpnsi h akwt epc to respect with task the in participants 2-gram 3470993 3-gram ora fBoeia Infor- Biomedical of Journal rceig fCOLING of Proceedings rjc VNET - “VINCENTE project BOW opttoa linguis- Computational

L1 ae 136–145. pages ,

8 L2 th 6 th when

out DIFF on World Wide Web, WWW ’02, pages 517–526, New York, NY, USA. ACM. David Jurgens, Mohammad Taher Pilehvar, and Roberto Navigli. 2014. Semeval-2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th International Workshop on Semantic Evalu- ation (SemEval-2014), Dublin, Ireland, August 23– 24. Pentti Kanerva. 1988. Sparse Distributed Memory. MIT Press. Thomas K. Landauer and Susan T. Dumais. 1997. A Solution to Plato’s Problem: The Theory of Acquisition, Induction, and Rep- resentation of Knowledge. Psychological Review, 104(2):211–240. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual se- mantic network. Artificial Intelligence, 193:217– 250. Florentina Vasilescu, Philippe Langlais, and Guy La- palme. 2004. Evaluating variants of the lesk ap- proach for disambiguating words. In Proceedings of the Conference on Language Resources and Evalu- ation (LREC), pages 633–636.

753