Simbow at Semeval-2017 Task 3: Soft-Cosine Semantic Similarity Between Questions for Community Question Answering

SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering Delphine Charlet and Geraldine´ Damnati Orange Labs Lannion, France delphine.charlet,[email protected] Abstract 2016), which focuses on semantic similarity between short well-formed sentences. This paper describes the SimBow sys- In this paper, we only focus on subtaskB, with tem submitted at SemEval2017-Task3, for the purpose of developing semantic textual sim- the question-question similarity subtask B. ilarity measures for such noisy texts. Question- The proposed approach is a supervised question similarity appeared in SemEval2016 combination of different unsupervised tex- (Nakov et al., 2016), and is pursued in Se- tual similarities. These textual similarities mEval2017 (Nakov et al., 2017). The approaches rely on the introduction of a relation ma- explored last year were mostly supervised fusion trix in the classical cosine similarity be- of different similarity measures, some being un- tween bag-of-words, so as to get a soft- supervised, others supervised. Among the un- cosine that takes into account relations be- supervised measures, many were based on over- tween words. According to the type of re- lap count between components (from n-grams of lation matrix embedded in the soft-cosine, words or characters to knowledge-based compo- semantic or lexical relations can be con- nents such as named entities, frame representa- sidered. Our system ranked first among tions, knowledge graphs, e.g. (Franco-Salvador the official submissions of subtask B. et al., 2016)...). Much attention was also paid for the use of word embeddings (e.g. (Mihaylov and 1 Introduction Nakov, 2016)), with question-level averaged vec- Social networks enable people to post questions, tors used directly with a cosine similarity or as in- and to interact with other people to obtain relevant put of a neural classifier. Finally, fusion was often answers. The popularity of forums show that they performed with SVMs (Filice et al., 2016) are able to propose reliable answers. Due to this Our motivation in this work was slightly dif- tremendous popularity, forums are growing fast, ferent: we considered that forum data were too and the first reflex for an internet user is to check noisy to get reliable outputs from linguistic anal- with his favorite search engine if a similar ques- ysis and we wanted to focus on core textual se- tion has already been posted.Community Question mantic similarity. Hence, we avoided using any Answering at SemEval focuses on this task, with metadata analysis (such as user profile...) to get re- 3 different subtasks. SubtaskA (resp. subtaskC) sults that could easily generalize to other similar- aims at re-ranking the comments of one original ity tasks.Thus, we explore unsupervised similarity question (resp. the comments of a set of 10 re- measures, with no external resources, hardly any lated questions), regarding the relevancy to the linguistic processing (except a list of stopwords), original questions. SubtaskB aims at re-ranking relying only on the availability of sufficient unan- 10 related questions proposed by a search en- notated corpora representative of the data. And we gine, regarding the relevancy to the original ques- fuse them in a robust and simple supervised frame- tion. Subtasks A and C are question-answering work (logistic regression). tasks. SubtaskB can be viewed as a pure seman- The rest of the paper is organized as follows: tic textual similarity task applied on community in section2, the core unsupervised similarity mea- questions, with noisy user-generated texts, mak- sure is presented, the submitted systems are de- ing it different from SemEval-Task1 (Agirre et al., scribed in section3, and section4 presents results. 315 Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 315–319, Vancouver, Canada, August 3 - 4, 2017. c 2017 Association for Computational Linguistics 2 Soft-Cosine Similarity Measure 2013) have known a tremendous success recently. They enable to obtain relevant semantic relations In a classical bag-of-words approach, texts are rep- between words, based on a simple similarity mea- resented by a vector of TF-IDF coefficients of size sure (e.g. cosine) between the vector representa- N, N being the number of different words occur- tions of these words. ring in the texts. Computing a cosine similarity In this work, 2 distributed representations between 2 vectors is directly related to the amount of words are computed, using the word2vec of words which are in common in both texts. t n toolkit, in the cbow configuration: one is esti- X .Y t cos(X, Y ) = whith X .Y = xiyi mated on English Wikipedia, and the other is es- √Xt.X√Y t.Y i=1 timated using the unannotated corpus of questions X (1) and comments on Qatar-Living forum, distributed When there are no words in common between in the campaign, which contains 100 millions of texts X and Y (i.e. no index i for which both x i words. The vectors dimension is 300 (experiments and y are not equal to zero), cosine similarity is i with various vector dimensions didn’t provide any null. However, even with no words in common, significant difference), and only the words with a texts can be semantically related when the words minimal frequency of 50 are taken into account. are themselves semantically related. Hence we word2vec propose to take into account word-level relations Once the representations of words M by introducing in the cosine similarity formula a are available, can be computed in different relation matrix M, as suggested in equation2. ways. We have explored different variants, and the best results were obtained with the following Xt.M.Y cosM (X, Y ) = (2) framework, where vi stands for the word2vec √ t √ t X .M.X Y .M.Y reprsentation of word wi: n n t 2 X .M.Y = ximi,jyj (3) mi,j = max(0, cosine(vi, vj)) (4) Xi=1 jX=1 where M is a matrix whose element mi,j ex- Grounding to 0 is motivated by the observation presses some relation between word i and word that negative cosine between words are hard to in- j. With such a metric, the similarity between two terpret, and often irrelevant. Squaring is applied to texts is non null as soon as the texts share related emphasize the dynamics of the semantic relations: words, even if they have no words in common. insisting more on strong semantic relations, and Introducing the relation matrix in the denomina- flattening weak semantic relations. Actually we tor normalization factors ensures that the reflex- have observed in several applicative domains that ive similarity is 1. If the words are only related high semantic similarities derived from word em- with themselves (m = 1 and m = 0 i, j with bedding are more significant than low similarities. i,i i,j ∀ i = j), M is the identity matrix and the soft-cosine 6 turns out to be the cosine. 2.2 Edit-distance based relations We first investigated this modified cosine simi- Using a Levenshtein distance between words, an larity in the context of topic segmentation of TV edit relation between words can be computed: it Broadcast News (Bouchekif et al., 2016), using enables to cope, for instance, with little typo- semantic relations between words to improve the graphic errors which are frequent in social user- computation of semantic cohesion between con- generated corpora such as Qatar Living forum. It secutive snippets. Other researchers have also pro- is defined as mi,i = 1 and for i = j: posed this measure (e.g (Sidorov et al., 2014)) 6 along with the soft-cosine denomination, where β Levenshtein(w , w ) the matrix was based for instance on Levenshtein m = α 1 i j (5) i,j ∗ − max( w , w ) distance between n-grams. In this work, we inves- || i|| || j|| ! tigate different kinds of word relations that can be used for computing M. w is the number of characters of the word, || || α is a weighting factor relatively to diagonal ele- 2.1 Semantic relations ments, and β is a factor that enables to emphasize Distributed representations of words, such as the the score dynamics. Experiments on train and dev word2vec approach proposed by (Mikolov et al., led to set α = 1.8 and β = 5. 316 3 System Description paper (Nakov et al., 2017). Results are presented with the MAP evaluation measure, on 3 corpora: 3.1 Data pre-processing dev (50 original questions 10 related questions), Some basic preprocessing steps are applied on the × test2016 (70 original questions 10 related ques- text: lowercase, suppression of punctuation marks × tions) and test2017 (88 original questions 10 re- and stopwords, replacing urls and images with the × lated questions). generic terms ” url ” and ” img ”. As for the bag It is worth noticing that the MAP scorer used in of word representation, TF-IDF coefficients are this campaign is sensitive to the amount of orig- computed in a specific way: TF coefficients are inal questions which don’t have any relevant re- computed in the text under consideration as usual lated questions in the gold labels. In fact, these but IDF coefficients are computed from the large questions always account for a precision of 0 in the unannotated Qatar Living forum corpus. MAP scoring. Hence, an Oracle evaluation, giv- 3.2 Supervised combination of unsupervised ing a score of 1 to all related questions labeled as similarities ”true”, and a score of 0 to all related questions la- For a given pair of texts to compare, 3 beled as ”false” in the gold labels, doesn’t provide textual similarity measures are considered: a 100% MAP but an Oracle MAP which corre- cosMrel (soft-cosine with semantic relations), sponds to the proportion of original questions that cosMlev (soft-cosine with Levenshtein distance), have at least 1 relevant related question.

Simbow at Semeval-2017 Task 3: Soft-Cosine Semantic Similarity Between Questions for Community Question Answering

Intelligent Chat Bot

An Ensemble Regression Approach for Ocr Error Correction

Practice with Python

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

Use of Word Embedding to Generate Similar Words and Misspellings for Training Purpose in Chatbot Development

Feature Combination for Measuring Sentence Similarity

The Research of Weighted Community Partition Based on Simhash

Hybrid Algorithm for Approximate String Matching to Be Used for Information Retrieval Surbhi Arora, Ira Pandey

Levenshtein Distance Based Information Retrieval Veena G, Jalaja G BNM Institute of Technology, Visvesvaraya Technological University

Thesis Submitted in Partial Fulﬁlment for the Degree of Master of Computing (Advanced) at Research School of Computer Science the Australian National University

Effective Search Space Reduction for Spell Correction Using Character Neural Embeddings