Ebiquity: Paraphrase and Semantic Similarity in Twitter Using Skipgrams

Ebiquity: Paraphrase and Semantic Similarity in Twitter using Skipgram Taneeya Satyapanich, Hang Gao and Tim Finin University of Maryland, Baltimore County Baltimore, MD, 21250, USA [email protected], [email protected], [email protected] Abstract 2003) or predicate argument tuple matching (Qiu, et al., 2006). Some other approaches that go be- We describe the system we developed to partic- yond simple lexical matching have also been de- ipate in SemEval 2015 Task 1, Paraphrase and veloped. For example, (Mihalcea, et al., 2006) es- Semantic Similarity in Twitter. We create simi- timated semantic similarity of sentence pairs with larity vectors from two-skip trigrams of prepro- word-to-word similarity measures and a word cessed tweets and measure their semantic simi- specificity measure. (Zhang and Patrick, 2005) larity using our UMBC-STS system. We sub- uses text canonicalization to transfer texts of simi- mit two runs. The best result is ranked eleventh out of eighteen teams with F1 score of 0.599. lar meaning into the same surface text with a high- er probability than those with different meaning. 1. Introduction Many of these approaches adopt distributional semantic models, but limited to a word level. To In this task (Wei, et al., 2015), participants were extend distributional semantic models beyond given pairs of text sequences from Twitter trends words, several researchers have learned phrase or and produced a binary judgment for each stating sentence representation by composing the repre- whether or not they are paraphrases (e.g., semanti- sentation of individual words (Mitchell and Lapata, cally the same) and optionally a graded score (0.0 2010; Baroni and Zamparelli, 2010). An alternative to 1.0) measuring their degree of semantic equiva- approach by (Socher et al., 2011) represents lence. For example, for the trending topic “A Walk phrases and sentences with fixed matrices consist- to Remember” (a film released in 2002), the pair A ing of pooled word and phrase pairwise similari- Walk to Remember is the definition of true love” ties. (Le and Mikolov, 2014) learns representation and “A Walk to Remember is on and Im in town of sentences directly by predicting context without and Im upset” might be judged as not paraphrases composition of words. with score 0.2 whereas the pair “A Walk to Re- In our work, we judge that two sentences are member is the definition of true love” and “A Walk paraphrases if they have high degree of semantic to Remember is the cutest thing” could be judged similarity. We use the UMBC-Semantic Textual as paraphrases with a score of 0.6. Similarity system (Lushan Han et al., 2013), which Many methods have been proposed to solve the provides high accurate semantic similarity meas- paraphrase detection problem. Early approaches urement. The remainder of this paper is organized were often based on lexical matching techniques, as follows. Section 2 describes the task and the e.g., word n-gram overlap (Barzilay and Lee, details of our method. Section 3 presents our re- 51 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 51–55, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics sults and a brief discussion. The last section offers 1.2. LSA Word Similarity Model conclusions. Our LSA word similarity model is a revised ver- 2. Our Method sion of the one we used in the 2013 and 2014 SemEval semantic text similarity tasks (Han, et al., To decide whether two tweets are paraphrases or 2013, Kashyap et al., 2014). LSA relies on the fact not, we use a measurement based on semantic sim- that semantically similar words (e.g., cat and feline ilarity values. If two tweets are semantically simi- or nurse and doctor) are more likely to occur near lar, they are judged as paraphrases, otherwise they one another in text. Thus evidence for word simi- are not. We described steps of our method as fol- larity can be computed from a statistical analysis of lows. a large text corpus. We extract raw word co- occurrence statistics from a portion of a 2007 Stan- 1.1. Preprocessing ford WebBase dataset (Stanford, 2001). We performed part of speech tagging and lem- Generally, tweets are informal text sequences that matization on the corpus using the Stanford POS include abbreviations, neologisms, emoticons and tagger (Toutanova et al., 2000). Word/term co- slang terms as well genre-specific elements such as occurrences were counted with a sliding window hashtags, URLs and @mentions of other Twitter of fixed size over the entire corpus. We generate accounts. This is due to both the informal nature two co-occurrence models using window sizes ±1 of the medium and the requirement to limit content and ±4. The smaller window provides more precise to at most 140 characters. Thus, before measuring context which is better for comparing words of the the semantic similarity, we replace abbreviation same part of speech while the larger one is more and slang to the readable version. We collected suitable for computing the semantic similarity be- about 685 popular abbreviations and slang terms 1 tween words of different syntactic categories. from several Web resources and combined these Our word co-occurrence models are based on a with the provided twitter normalization lexicon predefined vocabulary of 22,000 common English developed by Han Bo and Timothy Baldwin open-class words and noun phrases, extended with (2011). about 2,000 verb phrases from WordNet. The final After replacing abbreviations and slang terms, dimensions of our word/phrase co-occurrence ma- we remove all stop words to get our final desired trices are 29,000×29,000 when words/phrases are processed tweets. Then we produce a set of two- POS tagged. We apply singular value decomposi- skip trigrams for each tweet and name these sets as tion on the word/phrase co-occurrence matrices trigram sets. We adapted the skip-gram technique (Burgess 1998) after transforming the raw from (Guthrie, et al., 2006). word/phrase co-occurrence counts into their log Take the tweet “Google Now for iOS simply frequencies, and select the 300 largest singular beautiful” as an example, after removing stop word values. The LSA similarity between two s, we get ‘Google Now iOS simply beautiful’. Then words/phrases is then defined as the cosine similar- a two-skip trigram set is produced: {‘Google Now ity of their corresponding LSA vectors generated iOS’, ‘Now iOS simply’, ‘iOS simply beautiful’, by the SVD transformation. ‘Google iOS simply’, ‘Google simply beautiful’, To compute the semantic similarity of two text ‘Now simply beautiful’, ‘Google Now beautiful’, sequences, we use the simple align-and-penalize ‘Google Now simply’, ‘Now iOS beautiful’}, which algorithm described in (Han et al., 2013) with a is referred as trigram set. We transform every raw few improvements. These improvements include tweet into its processed version and then corre- some sets of common disjoint concepts and an en- sponding trigram set. hanced stop word list. 1.3. Features For two trigram sets, we compute the semantic 1 These included http://webopedia.com, http://blog.- similarity of every possible pair of trigrams in the- mltcreative.com and http://internetslang.com and others. se two sets using the UMBC Semantic Textual 52 Similarity system. For each pair of tweet (T1 and T2), six features are produced as: Model F1 Precision Recall • Feature1 = semantic similarity value between Logistic each pair of tweets (whole sentence with ab- 0.697 0.706 0.726 breviation and slangs replaced, and stop words Regression removed) Support Vector 0.691 0.707 0.726 • Feature2 = Regression • Feature3 = Table 2. Performance of system on development data. • Feature4 = Since the performance of both systems is almost • Feature5 = the same, we decide to submit one run of each sys- • Feature6 = the weighted average on length of tem. tweets of two averages above. 3. Results and Discussions 1.5. Training We submit two runs: Run1 (Logistic Regression) We used the LIBSVM system (Chang and Lin, obtained an F1 score of 0.599, precision score of 2011) for training a logistic regression model and a 0.651 and recall score of 0.554, and Run2 (Support support vector regression model. We run a grid Vector Regression), which received an F1 of search to find the best parameters for both models. 0.590, precision of 0.646, and recall of 0.543. All training data (13,063 pairs of tweets) were used When ranked, we are in the eighteenth (Run ) and to train the models without discarding any debata- 1 the nineteenth (Run2) out of the 38 runs. The first ble data. We tested the contribution for of each of rank has F1 score of 0.674. The full distribution of the features through ablation experiments on the F1 score is shown in Figure 1. The relatively low development data in which each feature was delet- ranking of our system might be the result of sever- ed in each experimental run. Table 1 shows the al factors. statistical results for each feature ablation run. First factor is the prevalence of neologisms, misspellings, informal slang and abbreviations in Feature deleted F1 Precision Recall tweets. Better preprocessing to make the tweets closer to normal text might improve our results. Feature 1 0.7 0.709 0.728 Another factor is the UMBC STS system. Ex- Feature 2 0.697 0.706 0.726 amples of input on which UMBC STS system per- form poorly are shown in Table 3. We can group Feature 3 0.697 0.706 0.726 these into two sets, each associated with problem Feature 4 0.691 0.700 0.722 in performing the paraphrase task.

Load more