Implementation Notes for the Soft Cosine Measure
Total Page:16
File Type:pdf, Size:1020Kb
Implementation Notes for the Soft Cosine Measure Vít Novotný Masaryk University Faculty of Informatics Brno, Czech Republic [email protected] ABSTRACT coordinates in α: T The standard bag-of-words vector space model (vsm) is efficient, ¹v1ºα = »1 1 1 1 1 1 0 0 0 0 0 0 0 0¼ ; and and ubiquitous in information retrieval, but it underestimates the ¹ º » ¼T similarity of documents with the same meaning, but different termi- v2 α = 0 0 0 1 1 0 2 1 1 1 1 1 1 1 : nology. To overcome this limitation, Sidorov et al. [14] proposed the Assuming α is orthonormal, we can take the inner product of the 2 Soft Cosine Measure (scm) that incorporates term similarity rela- ` -normalized vectors v1, and v2 to measure the cosine of the angle tions. Charlet and Damnati [2] showed that the scm is highly effec- (i.e. the cosine similarity) between the documents d1, and d2: tive in question answering (qa) systems. However, the orthonormal- ¹v º T¹v º ization algorithm proposed by Sidorov et al. [14] has an impractical h /k k /k ki 1 α 2 α ≈ v1 v1 ; v2 v2 = q q 0:23: time complexity of O¹n4º, where n is the size of the vocabulary. T T ¹v1ºα ¹v1ºα ¹v2ºα ¹v2ºα In this paper, we prove a tighter lower worst-case time complex- ity bound of O¹n3º. We also present an algorithm for computing Intuitively, this underestimates the true similarity between d1, the similarity between documents and we show that its worst-case and d2. Assuming α is orthogonal but not orthonormal, and that the time complexity is O¹1º given realistic conditions. Lastly, we de- terms Julius, and Caesar are twice as important as the other terms, scribe implementation in general-purpose vector databases such as we can construct a diagonal change-of-basis matrix W = ¹wij º Annoy, and Faiss and in the inverted indices of text search engines from α to an orthonormal basis β, where wii corresponds to the such as Apache Lucene, and ElasticSearch. Our results enable the importance of a term i. This brings us closer to the true similarity: deployment of the scm in real-world information retrieval systems. T ¹v1ºβ = W¹v1ºα = »1 1 1 2 2 1 0 0 0 0 0 0 0 0¼ ; T KEYWORDS ¹v2ºβ = W¹v2ºα = »0 0 0 2 2 0 2 1 1 1 1 1 1 1¼ ; and Vector Space Model, computational complexity, similarity measure hv1/kv1 k; v2/kv2 ki ACM Reference Format: T W¹v1ºα W¹v2ºα Vít Novotný. 2018. Implementation Notes for the Soft Cosine Measure. = ≈ 0:53: In q q The 27th ACM International Conference on Information and Knowledge W¹v º TW¹v º W¹v º TW¹v º Management (CIKM ’18), October 22–26, 2018, Torino, Italy. ACM, New York, 1 α 1 α 2 α 2 α NY, USA, 4 pages. https://doi.org/10.1145/3269206.3269317 Since we assume that the bases α and β are orthogonal, the terms dead and killed contribute nothing to the cosine similarity despite h i 1 INTRODUCTION the clear synonymy, because βdead; βkilled = 0. In general, the vsm will underestimate the true similarity between documents that The standard bag-of-words vector space model (vsm) [13] represents carry the same meaning but use different terminology. documents as real vectors. Documents are expressed in a basis In this paper, we further develop the soft vsm described by where each basis vector corresponds to a single term, and each Sidorov et al. [14], which does not assume α is orthogonal and coordinate corresponds to the frequency of a term in a document. which achieved state-of-the-art results on the question answer- arXiv:1808.09407v1 [cs.IR] 28 Aug 2018 Consider the documents ing (qa) task at SemEval 2017 [2]. In Section 2, we review the previous work incorporating term similarity into the vsm. In Sec- d1 = “When Antony found Julius Caesar dead”, and tion 3, we restate the definition of the soft vsm and present several d = “I did enact Julius Caesar: I was killed i’ the Capitol” 2 computational complexity results. In Section 4, we describe the im- f g14 14 plementation in vector databases and inverted indices. We conclude represented in a basis α i i=1 of R , where the basis vectors cor- respond to the terms in the order of first appearance. Then the cor- in Section 5 by summarizing our results and suggesting future work. responding document vectors v , and v would have the following 1 2 2 RELATED WORK Most works incorporating term similarity into the vsm published CIKM ’18, October 22–26, 2018, Torino, Italy © 2018 Association for Computing Machinery. prior to Sidorov et al. [14] remain in an orthogonal coordinate This is the author’s version of the work. It is posted here for your personal use. system and instead propose novel document similarity measures. To Not for redistribution. The definitive Version of Record was published in The 27th name a few, Mikawa et al. [8] proposes the extended cosine measure, ACM International Conference on Information and Knowledge Management (CIKM ’18), October 22–26, 2018, Torino, Italy, https://doi.org/10.1145/3269206.3269317. which introduces a metric matrix Q as a multiplicative factor in the cosine similarity formula. Q is the solution of an optimization problem to maximize the sum of extended cosine measures between Table 1: The real time to compute a matrix E from a dense matrix S each vector and the centroid of the vector’s category. Conveniently, averaged over 100 iterations. We used two Intel Xeon E5-2650 v2 3 the metric matrix Q can be used directly with the soft vsm, where (20M cache, 2.60 GHz) processors to evaluate the O¹n º Cholesky 4 it defines the inner product between basis vectors. Jimenez etal. factorization from NumPy 1.14.3, and the O¹n º iterated Gaussian [6] equip the multiset vsm with a soft cardinality operator that elimination from lapack. For n > 1000, only sparse S seem practical. corresponds to cardinality, but takes term similarities into account. The notion of generalizing the vsm to non-orthogonal coordinate n terms Algorithm Real computation time systems was perhaps first explored by Sidorov et al. [14] in thecon- 100 Cholesky factorization 0.0006 sec (0.606 ms) text of entrance exam question answering, where the basis vectors 100 Gaussian elimination 0.0529 sec (52.893 ms) did not correspond directly to terms, but to n-grams constructed 500 Cholesky factorization 0.0086 sec (8.640 ms) by following paths in syntactic trees. The authors derive the inner 500 Gaussian elimination 22.7361 sec (22.736 sec) product of two basis vectors from the edit distance between the cor- 1000 Cholesky factorization 0.0304 sec (30.378 ms) responding n-grams. Soft cosine measure (scm) is how they term the 1000 Gaussian elimination 354.2746 sec (5.905 min) formula for computing the cosine similarity between two vectors expressed in a non-orthogonal basis. They also present an algo- rithm that computes a change-of-basis matrix to an orthonormal the Cholesky factor E can also be arbitrarily dense and therefore basis in time O¹n4º. We present an O¹n3º algorithm in this paper. expensive to store. Given a permutation matrix P, we can instead T T Charlet and Damnati [2] achieved state-of-the-art results at the factorize P SP into FF . Finding the permutation matrix P that min- qa task at SemEval 2017 [10] by training a document classifier on imizes the density of the Cholesky factor F is NP-hard [16], but T −1 soft cosine measures between document passages. Unlike Sidorov heuristic stategies are known [3, 4]. Using the fact that P = P , et al. [14], Charlet and Damnati [2] already use basis vectors that and basic facts about transpose, we can derive E = PF as follows: T T T T T T correspond to terms rather than to n-grams. They derive the inner S = PP SPP = PFF P = PF¹PFº = EE : product of two basis vectors both from the edit distance between Lemma 3.3. G = ¹Rn; W ; S º x; y 2 Rn the corresponding terms, and from the inner product of the corre- Let α β be a soft vsm. Let . T sponding word2vec term embeddings [9]. Then hx; yi = ¹W¹xºα º SW¹yºα . Proof. Let E be the change-of-basis matrix from the basis β to 3 COMPUTATIONAL COMPLEXITY an orthonormal basis γ of Rn. Then: In this section, we restate the definition of the soft vsm as it was T T T described by Sidorov et al. [14]. We then prove a tighter lower hx; yi = ¹xºγ ¹yºγ = E¹xºβ E¹yºβ = EW¹xºα EW¹yºα n n worst-case time complexity bound for computing a change-of-basis Õ Õ matrix to an orthonormal basis. We also prove that under certain = ¹α i ºγ · wii · ¹xi ºα · ¹α j ºγ · wjj · ¹yj ºα assumptions, the inner product is a linear-time operation. i=1 j=1 n n n Õ Õ Definition 3.1. Let R be the real n-space over R equipped with = wii · ¹xi ºα · hα i ; α j i · wjj · ¹yj ºα h· ·i f gn n the bilinear inner product ; . Let α i i=1 be the basis of R in i=1 j=1 which we express our vectors. Let W = ¹w º be a diagonal change- n n α ij Õ Õ T f gn n = wii · ¹xi ºα · sij · wjj · ¹yj ºα = W¹xºα SW¹yº : □ of-basis matrix from α to a normalized basis βi i=1 of R , i.e.