UNIBA: Combining Distributional Semantic Models and Word Sense Disambiguation for Textual Similarity

UNIBA: Combining Distributional Semantic Models and Word Sense Disambiguation for Textual Similarity Pierpaolo Basile and Annalina Caputo and Giovanni Semeraro Department of Computer Science University of Bari Aldo Moro Via, E. Orabona, 4 - 70125 Bari (Italy) firstname.surname @uniba.it { } Abstract based), lexical (word sense disambiguation), and distributional level. The idea is to combine in a This paper describes the UNIBA team unique system the semantic aspects that pertain participation in the Cross-Level Semantic text fragments. Similarity task at SemEval 2014. We pro- The following section gives more details about pose to combine the output of different se- the similarity measures and their combination in a mantic similarity measures which exploit unique score through supervised methods (Section Word Sense Disambiguation and Distribu- 2). Section 3 describes the system set up for the tional Semantic Models, among other lex- evaluation and comments on the reported results, ical features. The integration of similar- while Section 4 concludes the paper. ity measures is performed by means of two supervised methods based on Gaus- 2 System Description sian Process and Support Vector Machine. Our systems obtained very encouraging The idea behind our system is to combine the results, with the best one ranked 6th out output of several similarity measures/features by of 38 submitted systems. means of a supervised algorithm. Those features were grouped in three main categories. The fol- 1 Introduction lowing three sub-sections describe in detail each feature exploited by the system. Cross-Level Semantic Similarity (CLSS) is the task of computing the similarity between two text 2.1 Distributional Semantics Level fragments of different sizes. The task focuses on Distributional Semantic Models (DSM) are an the comparison between texts at different lexical easy way for building geometrical spaces of con- levels, i.e. between a larger and a smaller text. cepts, also known as Semantic (or Word) Spaces, The task comprises four different levels: 1) para- by skimming through huge corpora of text in or- graph to sentence; 2) sentence to phrase; 3) phrase der to learn the context of word usage. In the re- to word; 4) word to sense. The task objective is sulting space, semantic relatedness/similarity be- to provide a framework for evaluating general vs. tween two words is expressed by the opposite of level-specialized methods. the distance between points that represent those Our general approach consists in combining words. Thus, the semantic similarity can be com- scores coming from different semantic similarity puted as the cosine of the angle between the two algorithms. The combination is performed by a vectors that represent the words. This concept supervised method using the training data pro- of similarity can be extended to whole sentences vided by the task organizers. The data set com- by combining words through vector addition (+), prises pairs of text fragments that can be rated with which corresponds to the point-wise sum of the a score between 0 and 4, where 4 indicates the vector components. Our DSM measure (DSM) maximum level of similarity. is based on a SemanticSpace, represented by a We select algorithms which provide similarities co-occurrences matrix M, built by analysing the at different levels of semantics: surface (or string- distribution of words in the British National Cor- This work is licensed under a Creative Commons At- pus (BNC). Then, M is reduced using the Latent tribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: Semantic Analysis (LSA) (Landauer and Dumais, http://creativecommons.org/licenses/by/4.0/ 1997). Vector addition and cosine similarity are 748 Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 748–753, Dublin, Ireland, August 23-24, 2014. then used for building the vector representation of this can be considered as a variation of the in- each text fragment and computing their pairwise verse document frequency (idf ) for retrieval similarity, respectively. that we named inverse gloss frequency (igf ). The igf for a word wk occurring gfk∗ times in 2.2 Lexical Semantics Level the set of extended glosses for all the senses Word Sense Disambiguation. Most of our in Si, the sense inventory of wi, is computed Si measures rely on the output of a Word Sense Dis- as IGFk = 1 + log2 | | . The final weight gfk∗ ambiguation (WSD) algorithm. Our newest ap- for the word wk appearing h times in the ex- proach to WSD, recently presented in Basile et tended gloss gij∗ is given by: al. (2014), is based on the simplified Lesk algorithm (Vasilescu et al., 2004). Each word wi in 1 weight(wk, gij∗ ) = h IGFk (1) a sequence w1w2...wn is disambiguated individ- × × 1 + d ually by choosing the sense that maximizes the similarity between the gloss and the context of wi 2. Building the context. The context C for the (i.e. the whole text where w occurs). To boost i word wi is represented by all the words that the overlap between the context and the gloss, occur in the text. this last is expanded with glosses of related meanings, following the approach described in Baner- 3. Building the vector representations. The con- jee and Pedersen (2002). As sense inventory we text C and each extended gloss gij∗ are repre- choose BabelNet 1.1, a huge multilingual seman- sented as vectors in the SemanticSpace built tic network which comprises both WordNet and through the DSM described in Subsection Wikipedia (Navigli and Ponzetto, 2012). The al- 2.1. gorithm consists of the following steps: 4. Sense ranking. The algorithm computes the 1. Building the glosses. We retrieve all possible cosine similarity between the vector repre- word meanings for the target word wi that are sentation of each extended gloss gij∗ and that listed in BabelNet. BabelNet mixes senses of the context C. Then, the cosine similar- in WordNet and Wikipedia. First, senses ity is linearly combined with the probability in WordNet are searched for; if no sense is p(sij wi), which takes into account the sense found (as often happens with named enti- | distribution of sij given the word wi. The ties), senses for the target word are sought in sense distribution is computed as the num- Wikipedia. We preferred that strategy rather ber of times the word wi was tagged with than retrieving senses from both sources at the sense sij in SemCor, a collection of 352 once because this last approach produced documents manually annotated with Word- worse results when tuning the system. Once Net synsets. T he additive (Laplace) smooth- the set of senses S = s , s , ..., s as- i { i1 i2 ik} ing prevents zero probabilities, which can oc- sociated to the target word wi has been re- cur when some synsets do not appear in Sem- trieved, gloss expansion occurs. For each Cor. The probability is computed as follows: sense sij of wi, the algorithm builds the sense extended gloss gij∗ by appending the glosses t(wi, sij) + 1 of meanings related to s to its original gloss p(sij wi) = (2) ij | #wi + Si gij. The related meanings, with the exception | | of “antonym” senses, are the output of the The output of this step is a ranked list of BabelNet function “getRelatedMap”. More- synsets. over, each word in gij∗ is weighted by a function inversely proportional to the distance be- The WSD measure (WSD) is computed on the top tween sij and its related meaning. The dis- of the output of the last step. For each text frag- tance d is computed as the number of edges ment, we build a Bag-of-Synset (BoS) as the sum, linking two senses in the graph. The func- over the whole text, of the weighted synsets as- tion takes also into account the frequencies sociated with each word. Then, we compute the of the words in all the glosses giving more WSD similarity as the cosine similarity between emphasis to the most discriminative words; the two BoS. 749 Graph. A sub-graph of BabelNet is built for the two texts; each text fragment starting from the synsets provided by the WSD algorithm. For each word the MCS The most common subsequence between synset with the highest score is selected, then this the two texts; initial set is expanded with the related synsets in 2-gram, 3-gram For each text fragment, we BabelNet. We apply the Personalized Page Rank build the Bag-of-n-gram (with n varying in (Haveliwala, 2002) to each sub-graph where the 2, 3 ); then we compute the cosine similar- synset scores computed by the WSD algorithm are { } ity between the two Bag-of-n-gram repre- exploited as prior probabilities. The weighted rank sentations. of synsets provided by Page Rank is used to build the BoS of the two text fragments, then the Person- BOW For each tokenized text fragment, we build alized Page Rank (PPR) is computed as the cosine its Bag-of-Word, and then compute the co- similarity between them. sine similarity between the two BoW. Synset Distributional Space. Generally, sim- L1 The length in characters of the first text frag- ilarity measures between synsets rely on the ment; synsets hierarchy in a semantic network (e.g. WordNet). We define a new approach that is com- L2 The length in caracters of the second text frag- pletely different, and represents synsets as points ment; in a geometric space that we call SDS (Synset Dis- tributional Space). SDS is generated taking into DIFF The difference between L1 and L2. account the synset relationships, and similarity is defined as the synsets closeness in the space.

Load more