Short Texts Semantic Similarity Based on Word Embeddings

Proceedings_____________________________________________________________________________________________________ of the Central European Conference on Information and Intelligent Systems 27 Short texts semantic similarity based on word embeddings Karlo Babic,´ Sanda Martinciˇ c-Ipši´ c,´ Ana Meštrovic´ Francesco Guerra University of Rijeka University of Modena and Reggio Emilia Department of informatics Department of Engineering Enzo Ferrari Radmile Matejciˇ c´ 2, 51000 Rijeka via Vivarelli 10, 41125 Modena {kbabic, smarti, amestrovic}@domain.com [email protected] Abstract. Evaluating semantic similarity of texts is a Word embeddings represent a corpus-based distribu- task that assumes paramount importance in real-world tional semantic model which describes the context in applications. In this paper, we describe some exper- which a word is expected to appear. There is a variety iments we carried out to evaluate the performance of of representation models based on word embeddings different forms of word embeddings and their aggre- (Mikolov et al., 2013b; Pennington et al., 2014; Bo- gations in the task of measuring the similarity of short janowski et al., 2017; Peters et al., 2018). texts. In particular, we explore the results obtained with Within these models, the context of the words is two publicly available pre-trained word embeddings taken into account in the process for defining the em- (one based on word2vec trained on a specific dataset beddings and the accuracy of the applications using and the second extending it with embeddings of word them is typically improved. However, word embed- senses). We test five approaches for aggregating words dings have certain limitations: for example, they can- into text. Two approaches are based on centroids and not capture more than one meaning per word (poly- summarize a text as a word embedding. The other ap- semy). Furthermore, a large number of lexical knowl- proaches are some variations of the Okapi BM25 func- edge bases have been developed in the last years. The tion and provide directly a measure of the similarity of knowledge they convey could be exploited for creat- two texts. ing embeddings that better represent the meaning of Keywords. semantic similarity, short texts similarity, the words they are representing. For this reason, other word embeddings, word2vec, NLP techniques have been proposed to extend the afore- mentioned approaches with embeddings of words and meanings associated with words. This is the case of 1 Introduction the NASARI dataset (Camacho-Collados et al., 2016) that integrates pre-trained word embeddings based on Measuring semantic similarity of texts has an impor- word2vec model with word senses embeddings reached tant role in the various tasks from the field of nat- from the BabelNet. BabelNet is a multilingual dic- ural language processing (NLP) such as information tionary which contains synsets (Navigli and Ponzetto, retrieval, document classification, word sense disam- 2012). It merges WordNet with other lexical and ency- biguation, plagiarism detection, machine translation, clopedic resources such as Wikipedia and Wiktionary. text summarization, etc. A more specific task, mea- Representing words in short texts with (semantic) suring semantic similarity of short texts is of great im- embeddings is only the first step for capturing the portance in applications such as opinion mining and meaning of them and being able to measure their sim- news recommendation in the domain of social media ilarities. Identifying the semantics of short texts is (De Boom et al., 2016). another challenging task due to the complexities of A large number of approaches have been developed semantic interactions among words. More precisely, for addressing this problem. Some of these approaches word embeddings can model the semantics of one typically model short text as an aggregate of words word, but how to scale from words to texts is not and apply specific metrics to compute the similarity a straightforward process. A large number of tech- of these aggregations. Most of the existing techniques niques have been proposed and there is no consen- represent text as a weighted set of words (e.g., bag of sus in the community on how to proceed. The sim- words), where the order of the word in the text (i.e. plest approaches suggest taking the sum or the average the context) and the possible meanings associated to (centroid) of the individual word embeddings for all the words is not taken into account. Recently, neural words in the text. These approaches have been widely networks have been adopted for building word embed- adopted in many experiments, for example, (Brokos dings, thus providing a real breakthrough to this field. et al., 2016; Rossiello et al., 2017; Sinoara et al., 2019) This work has been supported in part by the University of Rijeka and in general, they perform well. However, by cal- under the project: uniri-drustv-18-38 culating only sum or centroid of a set of word embed- _____________________________________________________________________________________________________ 30th CECIIS, October 2-4, 2019, Varaždin, Croatia 28_____________________________________________________________________________________________________Proceedings of the Central European Conference on Information and Intelligent Systems dings, we are losing a certain part of semantic informa- Corpus-based measures enable comparison of lan- tion and thus maybe this is not an optimal approach. guage units such as words, or texts based on statistics. There are other possible approaches to generate text They determine semantic similarity between words or embeddings based on word embeddings like for exam- texts using information derived from large corpora. ple in (Kenter and De Rijke, 2015; Kusner et al., 2015). These include traditional approaches like simple n- SemTexT is project that involves University of Ri- gram measures (Salton, 1989; Damashek, 1995), bag jeka and University of Modena and Reggio Emilia of words (Bow) (Salton et al., 1975; Manning et al., with the aim of studying and developing semantic tech- 2010) or more complex approaches such as Latent Se- niques for measuring the similarity of short texts. As mantic Analysis (LSA) proposed by Landauer (Lan- one of the first actions in the project, the idea is to dauer et al., 1998). evaluate the performance of some of the existing tech- Recent trends in NLP prefer corpus-based ap- niques for representing short texts and measuring their proaches and representation models such as word2vec similarities. In this paper, we describe our prelimi- (Mikolov et al., 2013b), GloVe (Pennington et al., nary experiments where we evaluate how five similar- 2014), FastText (Bojanowski et al., 2017) and more re- ity measures perform, with respect to human judgment. cently ELMo (Peters et al., 2018). Two word representation models have been evaluated: The results of these models are words represented as one is a typical word2vec model; the second represen- embeddings with the property that semantically or syn- tation model is built on NASARI set, which includes tactically similar words tend to be close in the seman- word sense descriptions. tic space (Collobert and Weston, 2008; Mikolov et al., In short, in this paper, we address three main issues 2013a). related to the task of measuring semantic similarity: (a) Identifying the degree of semantic similarity of short how to represent the words, (b) how to aggregate word texts based on the word embeddings is a challenging representations for modeling short texts and (c) how to task that has been studied extensively in the past years. measure the similarity between aggregations. To re- Certain approaches offer a sentence or document em- solve (a) we apply two existing representation models beddings as a solution (Le and Mikolov, 2014; Cer based on word embeddings; for (b) and (c) we test five et al., 2018). However, in this study, we are focused on methods that aggregate word embeddings and provide the methods that enable determining semantic similar- the semantic similarity score. ity of short texts based only on the word embeddings. The results of our preliminary experiments in two Mihalcea et al. proposed an approach for measur- datasets were quite surprising for us: the semantics ing semantic similarity of texts by exploiting the in- provided by NASARI do not improve the performance formation that can be drawn from the similarity of the in the results and centroid-based measures generally component words. The proposed approach is based on perform better than other more complex measures. two corpus-based and six knowledge-based measures The rest of the paper is organized as follows. In of word semantic similarity. According to the pre- Section 2, we present related work. In Section 3, we sented results, it outperforms the vector-based similar- describe the approach with word sense embeddings ity approach in the task of paraphrase detection. How- and we give an overview of various word embeddings ever, their approach is rather traditional and it is not based methods for calculating semantic similarity of based on word embeddings. short texts. In Section 4, we provide evaluation results. Finally, in the last section, we give a conclusion and Kusner et al. introduced a new measure, called possible directions for future work. the Word Mover’s Distance (WMD), which measures the dissimilarity between two text documents (Kusner et al., 2015). Documents are represented using word 2 Related Work embeddings and

Load more