Unsupervised Sparse Vector Densification for Short Text Similarity

Yangqiu Song and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801, USA yqsong,danr @illinois.edu { }

Abstract 2007). Instead of using only the words in a doc- ument, ESA uses a bag-of-concepts retrieved from Sparse representations of text such as bag-of- Wikipedia to represent the text. Then the similarity words models or extended explicit semantic between two texts can be computed in this enriched analysis (ESA) representations are commonly concept space. used in many NLP applications. However, for short texts, the similarity between two such s- Both bag-of-words and bag-of-concepts model- parse vectors is not accurate due to the small s suffer from the sparsity problem. Because both term overlap. While there have been multiple models use sparse vectors to represent text, when proposals for dense representations of words, comparing two pieces of texts, the similarity can be measuring similarity between short texts (sen- zero even when the text snippets are highly related, tences, snippets, paragraphs) requires combin- but make use of different vocabulary. We can expect ing these token level similarities. In this paper, that these two texts are related but the similarity val- we propose to combine ESA representations and representations as a way to gen- ue does not reflect that. ESA, despite augmenting erate denser representations and, consequent- the lexical space with relevant Wikipedia concepts, ly, a better similarity measure between short still suffers from the sparsity problem. We illustrate texts. We study three densification mecha- this problem with the following simple experiment, nisms that involve aligning sparse representa- done by choosing a documents from the “rec.autos” tion via many-to-many, many-to-one, and one- group in the 20-newsgroups data set1. For both doc- to-one mappings. We then show the effective- uments and the label description “cars” (here we fol- ness of these mechanisms on measuring simi- low the description shown in (Chang et al., 2008; larity between short texts. Song and Roth, 2014)), we computed 500 concepts using ESA. Then we identified the concepts that ap- 1 Introduction pear both in the document ESA representation and in the label ESA representation. The average sizes Bag-of-words model has been used for many ap- of this intersection (number of overlapping concepts plications as the state-of-the-art method for tasks in the document and label representation) are shown such as document classifications and information re- in Table 1. In addition to the original documents, we trieval. It represents each text as a bag-of-words, also split each document into 2, 4, 8, 16 equal length and computes the similarity, e.g., cosine value, be- parts, computed the ESA representation of each, and tween two sparse vectors in the high-dimensional then the intersection with the ESA representation of space. When the contextual information is insuffi- the label. Table 1 shows that the number of concepts cient, e.g., due to the short length of the document, shared by the label and the document representation explicit semantic analysis (ESA) has been used as decreases significantly, even if not as significantly a way to enrich the text representation (Gabrilovich and Markovitch, 2006; Gabrilovich and Markovitch, 1http://qwone.com/˜jason/20Newsgroups/

1275

Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1275–1280, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics

Table 1: Average sizes of the intersection between the x and y together to increase the similarity value. ESA concept representations of documents and label- We can rewrite the vectors x and y as x = s. Both documents and label are represented with 500 xa , . . . , xa and y = yb , . . . , yb , where ai Wikipedia concepts. Documents are split into different { 1 nx } { 1 ny } and b are indices of the non-zero terms in x and y lengths. j (1 a , b V ). x and y are the weights asso- ≤ i j ≤ ai bi # of split Avg. # of words per doc. Avg. # of concepts ciated to the terms in the vocabulary. Suppose there 1 209.6 23.1 2 104.8 18.1 are non-zero terms nx and ny in x and y respective- 4 52.4 13.8 ly. Then cosine similarity can be rewritten as: 8 26.2 10.6 nx ny 16 13.1 8.4 δ(ai bj)xa yb cos(x, y) = i=1 j=1 − i j , (1) x y P P || || · || || as the drop in the document size. For example, there where δ( ) is the Dirac function δ(0) = 1 and · are on average 8 concepts in the intersection of two δ(other) = 0. Suppose we can compute the simi- vectors with 500 non-zero concepts when we split larity between terms ai and bj, which is denoted as each document into 16 parts. φ(ai, bj), then the problem is how to aggregate the When there are fewer overlapping terms between similarities between all ai’s and bj’s to augment the two pieces of texts, it can cause mismatch or biased original cosine similarity. match and result in less accurate comparison. In this paper, we propose to use unsupervised approaches 2.1 Similarity Augmentation to improve the representation, along with a corre- The most intuitive way to integrate the similarities sponding similarity approach between these repre- between terms is averaging them: sentations. Our contribution is twofold. First, we n ny incorporate the popular word2vec (Mikolov et al., 1 x 2013a; Mikolov et al., 2013b) representations into SA(x, y) = xai ybj φ(ai, bj). nx x ny y ESA representation, and show that incorporating se- || || · || || i=1 j=1 X X (2) mantic relatedness between Wikipedia titles can in- This similarity averages all the pairwise similarities deed help the similarity measure between short texts. between terms a ’s and b ’s. However, we can ex- Second, we propose and evaluate three mechanism- i j pect a lot of the similarities φ(a , b ) to be close to s for comparing the resulting representations. We i j zero. In this case, instead of introducing the relat- verify the superiority of the proposed methods using edness between nonidentical terms, it will also in- three different NLP tasks. troduce noise. Therefore, we also consider an align- 2 Sparse Vector Densification ment mechanism that we implement greedily via a maximum matching mechanism: In this section, we introduce a way to compute n the similarity between two sparse vectors by aug- 1 x SM (x, y) = xa yb max φ(ai, bj). menting the original similarity measure, i.e., co- x y i j j || || · || || i=1 sine similarity. Suppose we have two vectors x = X (3) T T (x1, . . . , xV ) and y = (y1, . . . , yV ) where V is j argmax φ(a , b ) We choose as j0 i j0 and substitute the vocabulary size. Traditional cosine similarity the similarity φ(ai, bj) between terms ai and bj in- computes the between these two vec- to the final similarity between x and y. Note that tors and normalizes it by their norms: cos(x, y) = this similarity is not symmetric. Thus, if one needs xT y x y . This requires each dimension of x to be a symmetric similarity, the similarity can be com- || ||·|| || aligned with the same dimension of y. Note that puted by averaging two similarities SM (x, y) and for sparse vectors x and y, most of the the elements SM (y, x). can be zero. Aligning the indices can result in zero The above two similarity measurements are sim- similarity even though the two pieces of texts are re- ple and intuitive. We can think about SA(x, y) lated. Thus, we propose to align different indices of as leveraging term many-to-many mapping, while

1276 (a) rec.autos vs. sci.electronics (full doc.) (b) rec.autos vs. sci.electronics (1/16 doc.) (c) rec.autos vs. rec.motorcycles (full doc.) (d) rec.autos vs. rec.motorcycles (1/16 doc.) Figure 1: Accuracy of dataless classification using ESA and Dense-ESA with different numbers of concepts.

SM (x, y) uses only one-to-many term mapping. as five). For each word, we finally obtained a 200 SA(x, y) can introduce small and noisy similarity dimensional vector. If the term is a phrase, we sim- values between terms. While SM (x, y) essentially ply average words’ vectors of each phrase to obtain aligns each term in x with it’s best match in y, we the representation following the original word2vec run the risk that multiple components of x will se- approach (Mikolov et al., 2013a; Mikolov et al., lect the same element in y. To ensure that all the 2013b). We use two vectors a and b to represent the non-zero terms in x and y are matched, we propose vectors for the two terms. To evaluate the similarity to constrain this by disallowing many-to-one between two terms, for the average approach as E- mapping. We do that by using a similarity metric q. (2), we use the RBF kernel over the two vectors based on the Hungarian method (Papadimitriou and exp a b 2/(0.03 a b ) as the similari- {−|| − || ·|| ||·|| || } Steiglitz, 1982). The Hungarian method is a combi- ty for all the experiments, since this will have a good natorial optimization algorithm that solves the bipar- property to cut the terms with small similarities. For tite graph matching problem by finding an optimal the max and Hungarian approach as Eqs. (3) and (4), assignment matching the two sides of the graph on a we simply use the cosine similarity between the two one-to-one basis. Assume that we run the Hungari- word2vec vectors. In addition, we cut off all simi- an method on the the pair x, y , and let h(a ) = b larities below threshold γ and map them to zero. { } i j denote the outcome of the algorithm, that is ai is aligned with bj. (We assume here, for simplicity, 3 Experiments that n = n ; we can always achieve that by adding x y We experiment on three data sets. We use dataless some zero weighted terms that are not aligned). The classification (Chang et al., 2008; Song and Roth, we define the similarity as: 2014) over 20-newsgroups data set to verify the cor- n 1 x rectness of our argument of short text problems, and S (x, y) = x y φ(a , h(a )). H x y ai h(ai) i i use two short text data sets to evaluate document || || · || || i=1 X (4) similarity measurement and event classification for sentences. 2.2 Term Similarity Measure To evaluate the term similarity φ( , ), we use lo- 3.1 Dataless Classification · · cal contextual similarity based on distributed rep- Dataless classification uses the similarity between resentations. We adopt the word2vec (Mikolov et documents and labels in an enriched “semantic” s- al., 2013a; Mikolov et al., 2013b) approach to ob- pace to determine in which category the given doc- tain a dense representation of words. The represen- ument is. In this experiment, we used the label de- tation of each word is predicted based on the context scriptions provided by (Chang et al., 2008). It has word distribution in a window around it. We trained been shown that ESA outperforms other representa- word2vec on the Wikipedia dump data using the de- tions for dataless classification (Chang et al., 2008; fault parameters (CBOW model with window size Song and Roth, 2014). Thus, we chose ESA as our

1277 Table 2: Accuracy of dataless classification using ESA and Dense-ESA with 500 dimensions. rec.autos vs. sci.electronics (easy) rec.autos vs. rec.motorcycles (difficult) Method Full document Short (1/16 doc.) Full document Short (1/16 doc.) ESA (Cosine) 87.75% 56.55% 80.95% 46.64% Dense-ESA (Average) 87.80% 64.67% 81.11% 59.38% Dense-ESA (Max) 87.10% 64.34% 84.30% 59.11% Dense-ESA (Hungarian) 88.85% 65.95% 82.15% 59.65%

Figure 2: Boxplot of similarity scores for “rec.autos vs. sci.electronics” (easy, left) and “rec.autos vs. rec.motorcycles” (difficult, right). For each method of ESA and Dense-ESA with max matching in Eq. (3), we com- pute S(d, l ) and S(d, l ) between a document d and the labels l and l . Then we compute S(d) = S(d, l ) S(d, l ). 1 2 1 2 1 − 2 For each ground truth label, we draw the distribution of S( ) with outliers in the figures. For example, “ESA:autos” · shows the S( )’s distribution of the data with label “rec.autos.” The t-test results show that the distributions of different · labels are significantly different (99%). We can see that Dense-ESA pulls apart the distributions of different labels and that the separation is more significant for the more difficult problem (right). baseline method. To demonstrate how the length of Table 3: Spearman’s correlation of document similarity documents affects the classification result, we used using ESA and Dense-ESA with 500 concepts. both full documents and the 16 split parts (the part- Method Spearman’s correlation s are associated with the same label as the origi- ESA (Cosine) 0.5665 nal document). To demonstrate the impact of den- Dense-ESA (Average) 0.5814 sification, we selected two problems as an illustra- Dense-ESA (Max) 0.5888 tion: “rec.autos vs. sci.electronics” and “rec.autos Dense-ESA (Hungarian) 0.6003 vs. rec.motorcycles.” While the former problem is relatively easy since they belong to different super- classes, the latter problem is more difficult since 3.2 Document Similarity they are under the same super-class. The value of We used the data set provided by Lee et al.2 (Lee et threshold γ for max matching and Hungarian based al., 2005) to evaluate pairwise short document simi- densification is set to 0.85 empirically. larity. There are 50 documents and the average num- Figure 1 shows the results of the dataless clas- ber of words is 80.2. We averaged all the human sification using ESA and ESA with densification annotations for the same document pair as the sim- (Dense-ESA) with different numbers of Wikipedia ilarity score. After computing the scores for pairs concepts as the representation dimensionality. We of documents, we used Spearman’s correlation to e- can see that Dense-ESA significantly improves the valuate the results. Larger correlation score mean- dataless classification results. As shown in Table 2, s that the similarity is more consistent with human while the max matching and Hungarian matching annotation. The best word level based similarity re- based methods are typically the best metrics the sult is close to 0.5 (Lee et al., 2005). We tried the most significant results, the improvements are more cosine similarity between ESA representations and significant for shorter documents, and for more diffi- cult problems. Figure 2 highlights this observation. 2http://faculty.sites.uci.edu/mdlee/similarity-data/

1278 that encode world knowledge. Recently, several Table 4: F1 of sentence event type classification using ESA and Dense-ESA with 500 concepts. representations were proposed to extend word representations for phrases or sentences (Lu and Li, Method F1 (mean std) ± ESA (Cosine) 0.469 0.011 2013; Hermann and Blunsom, 2014; Passos et al., ± Dense-ESA (Average) 0.451 0.010 2014; Kalchbrenner et al., 2014; Le and Mikolov, ± Dense-ESA (Max) 0.481 0.008 2014; Hu et al., 2014; Sutskever et al., 2014; Zhao ± Dense-ESA (Hungarian) 0.475 0.016 ± et al., 2015). In this paper, we evaluate how to combine two off-the-shelf representations to densify also Dense-ESA. The value of γ for max matching the similarity between text data. based densification is set to 0.95, and for Hungari- Yih et al. also used average matching and a dif- an based densification it is set to 0.89. We can see ferent maximum matching for QA problem (Yih et that from Table 3, ESA is better than the word based al., 2013). However, their sparse representation is method, and that all versions of Dense-ESA outper- still at the word level while ours is based on ESA. form the original ESA. Interestingly, related ideas to our average matching mechanism have been proposed also in the comput- 3.3 Event Classification er vision community, which is the set kernel (or set In this experiment, we chose the ACE20053 data set similarity) (Smola et al., 2007; Gretton et al., 2012; to test how well we can classify sentences into even- Xiong et al., 2013). t types without any training. There are eight type- 5 Conclusion s of events: life, movement, conflict, contact, etc. We chose all the sentences that contain event infor- In this paper, we study the mechanisms of com- mation as the data set. Following the dataless clas- bining two popular representations of text, i.e., E- sification protocol, we compare the similarity be- SA and word2vec, to enhance computing short text tween sentences and label descriptions to determine similarity. Furthermore, we proposed three differ- the event types. There are 3,644 unique sentences ent mechanisms to compute the similarity between with events, including 2,712 sentences having on- these representations, and demonstrated, using three ly one event type, 421 having two event types, and different data sets that the proposed method outper- 30 having three event types. The average length of forms the traditional ESA. the sentences is 23.71. Thus, this is a multi-label classification problem. To test the approaches, we Acknowledgments used five-fold cross validation to select the thresh- This work is supported by the Multimodal Informa- olds for each class to classify whether the sentence tion Access & Synthesis Center at UIUC, part of C- belongs to an event type. The value of threshold γ CICADA, a DHS Science and Technology Center for both max matching and Hungarian based densifi- of Excellence, by the Army Research Laboratory cation is also set to 0.85 empirically. Then we report (ARL) under agreement W911NF-09-2-0053, and the mean and standard derivation over five runs. The by DARPA under agreement number FA8750-13-2- results are shown in Table 4. We can see that Dense- 0008. The views and conclusions contained herein ESA also outperforms ESA. are those of the authors and should not be interpret- ed as necessarily representing the official policies or 4 Related Work endorsements, either expressed or implied by these ESA (Gabrilovich and Markovitch, 2006; agencies or the U.S. Government. Gabrilovich and Markovitch, 2007) and dis- tributed word representations (Ratinov and Roth, 2009; Turian et al., 2010; Collobert et al., 2011; Mikolov et al., 2013a; Mikolov et al., 2013b; Pen- nington et al., 2014) are popular text representations

3http://www.itl.nist.gov/iad/mig/tests/ace/2005/

1279 References L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoN- M. Chang, L. Ratinov, D. Roth, and V. Srikumar. 2008. LL, pages 147–155. Importance of semantic representation: Dataless clas- A. J. Smola, A. Gretton, L. Song, and B. Scholkopf.¨ sification. In AAAI, pages 830–835. 2007. A hilbert space embedding for distributions. In R. Collobert, J. Weston, L. Bottou, M. Karlen, ALT, pages 13–31. K. Kavukcuoglu, and P. P. Kuksa. 2011. Natural Y. Song and D. Roth. 2014. On dataless hierarchical text language processing (almost) from scratch. J. Mach. classification. In AAAI, pages 1579–1585. Learn. Res., 12:2493–2537. I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence E. Gabrilovich and S. Markovitch. 2006. Overcoming to sequence learning with neural networks. In NIPS, the brittleness bottleneck using Wikipedia: Enhancing pages 3104–3112. text categorization with encyclopedic knowledge. In J. Turian, L. Ratinov, and Y. Bengio. 2010. Word rep- AAAI, pages 1301–1306. resentations: A simple and general method for semi- E. Gabrilovich and S. Markovitch. 2007. Computing se- supervised learning. In ACL, pages 384–394. mantic relatedness using Wikipedia-based explicit se- L. Xiong, B. Poczos,´ and J. G. Schneider. 2013. Efficient mantic analysis. In IJCAI, pages 1606–1611. learning on point sets. In ICDM, pages 847–856. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf,¨ W. Yih, M. Chang, C. Meek, and A. Pastusiak. 2013. and A. Smola. 2012. A kernel two-sample test. J. Question answering using enhanced lexical semantic Mach. Learn. Res., 13:723–773. models. In ACL, pages 1744–1753. K. M. Hermann and P. Blunsom. 2014. Multilingual Y. Zhao, Z. Liu, and M. Sun. 2015. Phrase type sensitive models for compositional distributed semantics. In A- tensor indexing model for semantic composition. In CL, pages 58–68. AAAI. B. Hu, Z. Lu, H. Li, and Q. Chen. 2014. Convolutional neural network architectures for matching natural lan- guage sentences. In NIPS, pages 2042–2050. N. Kalchbrenner, E. Grefenstette, and P. Blunsom. 2014. A convolutional neural network for modelling sen- tences. In ACL, pages 655–665. Q. V. Le and T. Mikolov. 2014. Distributed represen- tations of sentences and documents. In ICML, pages 1188–1196. M. D. Lee, B. Pincombe, and M. Welsh. 2005. An empir- ical evaluation of models of text document similarity. In CogSci, pages 1254–1259. Z. Lu and H. Li. 2013. A deep architecture for matching short texts. In NIPS, pages 1367–1375. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013a. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119. T. Mikolov, W.-t. Yih, and G. Zweig. 2013b. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751. C. H. Papadimitriou and K. Steiglitz. 1982. Combinato- rial Optimization: Algorithm und Complexity. Engle- wood Cliffs, NJ: Prentice-Hall. A. Passos, V. Kumar, and A. McCallum. 2014. Lexicon infused phrase embeddings for named entity resolu- tion. In CoNLL, pages 78–86. J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543.

1280