Cosine similarity

Cosine similarity is a measure of similarity between two 1 Definition non-zero vectors of an that measures the cosine of the angle between them. The cosine of 0° is The cosine of two non-zero vectors can be derived by us- 1, and it is less than 1 for any other angle. It is thus a judg- ing the Euclidean formula: ment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors dia- a · b = ∥a∥ ∥b∥ cos θ metrically opposed have a similarity of −1, independent of their magnitude. Cosine similarity is particularly used Given two vectors of attributes, A and B, the cosine sim- in positive space, where the outcome is neatly bounded ilarity, cos(θ), is represented using a dot product and in [0,1]. The name derives from the term “direction co- magnitude as sine": in this case, note that unit vectors are maximally A·B “similar” if they're parallel and maximally “dissimilar” if similarity = cos(θ) = ∥A∥∥B∥ = they're orthogonal (perpendicular). This is analogous to ∑n AiBi the cosine, which is unity (maximum value) when the seg- √ i=1 √ , where A and B are ∑n ∑n i i 2 2 ments subtend a zero angle and zero (uncorrelated) when Ai Bi the segments are perpendicular. i=1 i=1 components of vector A and B respectively. Note that these bounds apply for any number of di- mensions, and cosine similarity is most commonly used The resulting similarity ranges from −1 meaning exactly in high-dimensional positive spaces. For example, in opposite, to 1 meaning exactly the same, with 0 indicat- and text mining, each term is no- ing (decorrelation), and in-between values tionally assigned a different dimension and a document indicating intermediate similarity or dissimilarity. is characterised by a vector where the value of each di- For text matching, the attribute vectors A and B are usu- mension corresponds to the number of times that term ally the term frequency vectors of the documents. The appears in the document. Cosine similarity then gives a cosine similarity can be seen as a method of normalizing useful measure of how similar two documents are likely document length during comparison. to be in terms of their subject matter.[1] In the case of information retrieval, the cosine similarity The technique is also used to measure cohesion within of two documents will range from 0 to 1, since the term clusters in the field of data mining.[2] frequencies (tf-idf weights) cannot be negative. The an- Cosine distance is a term often used for the complement gle between two term frequency vectors cannot be greater in positive space, that is: DC (A, B) = 1 − SC (A, B), than 90°. where D is the cosine distance and S is the cosine sim- C C If the attribute vectors are normalized by subtracting the ilarity. It is important to note, however, that this is not vector means (e.g., A − A¯ ), the measure is called cen- a proper distance as it does not have the triangle tered cosine similarity and is equivalent to the Pearson inequality property—or, more formally, the Schwarz in- correlation coefficient. equality—and it violates the coincidence axiom; to re- pair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular dis- 1.1 Angular distance and similarity tance (see below.) One of the reasons for the popularity of cosine similarity The term “cosine similarity” is sometimes used to refer to is that it is very efficient to evaluate, especially for sparse a different definition of similarity provided below. How- vectors, as only the non-zero dimensions need to be con- ever the most common use of “cosine similarity” is as sidered. defined above and the similarity and distance metrics de- fined below are referred to as “angular similarity” and “angular distance” respectively. The normalized angle between the vectors is a formal distance metric and can be calculated from the similarity score defined above. This angular distance metric can then be used to compute a similarity function bounded between 0 and 1, inclusive.

1 2 3 SOFT COSINE MEASURE

When the vector elements may be positive or negative: 2 Properties

cos−1(similarity) Cosine similarity is related to Euclidean distance as fol- distance = lows. Denote Euclidean distance by the usual ∥A − B∥ , π and observe that similarity = 1 − distance Or, if the vector elements are always positive: ∥A−B∥2 = (A−B)⊤(A−B) = ∥A∥2+∥B∥2−2A⊤B

2 · cos−1(similarity) by expansion. When A and B are normalized to unit distance = π length, ∥A∥2 = ∥B∥2 = 1 so the previous is equal to similarity = 1 − distance Although the term “cosine similarity” has been used for 2(1 − cos(A, B)) this angular distance, the term is used as the cosine of the angle only as a convenient mechanism for calculating Null distribution: For data which can be negative as the angle itself and is no part of the meaning. The ad- well as positive, the null distribution for cosine similarity vantage of the angular similarity coefficient is that, when is the distribution of the dot product of two independent used as a difference coefficient (by subtracting it from 1) random unit vectors. This distribution has a mean of zero the resulting function is a proper distance metric, which and a variance of 1/n (where n is the number of dimen- is not the case for the first meaning. However, for most sions), and although the distribution is bounded between uses this is not an important property. For any use where −1 and +1, as n grows large the distribution is increas- only the relative ordering of similarity or distance within ingly well-approximated by the normal distribution.[5][6] a set of vectors is important, then which function is used For other types of data, such as bitstreams (taking values is immaterial as the resulting order will be unaffected by of 0 or 1 only), the null distribution will take a different the choice. form, and may have a nonzero mean.[7]

1.2 Confusion with “Tanimoto” coefficient 3 Soft cosine measure The cosine similarity may be easily confused with the Tanimoto metric - a specialised form of a similarity co- The soft cosine measure is a measure of “soft” similar- efficient with a similar algebraic form: ity between two vectors, i.e., the measure that considers similarity of pairs of features.[8] The traditional cosine similarity considers the (VSM) fea- A · B T (A, B) = tures as independent or completely different, while the ∥A∥2 + ∥B∥2 − A · B soft cosine measure proposes considering the similarity In fact, this algebraic form was first defined by Tanimoto of features in VSM, which allows generalization of the as a mechanism for calculating the Jaccard coefficient in concepts of cosine measure and also the idea of similar- the case where the sets being compared are represented ity (soft similarity). as bit vectors. While the formula extends to vectors in For example, in the field of natural language process- general, it has quite different properties from cosine sim- ing (NLP) the similarity among features is quite intu- ilarity and bears little relation other than its superficial itive. Features such as words, n-grams or syntactic n- appearance. grams[9] can be quite similar, though formally they are considered as different features in the VSM. For exam- 1.3 Ochiai coefficient ple, words “play” and “game” are different words and thus are mapped to different dimensions in VSM; yet it is obvi- ous that they are related semantically. In case of n-grams This coefficient is also known in biology as Ochiai coef- or syntactic n-grams, can be applied ficient, or Ochiai-Barkman coefficient, or Otsuka-Ochiai (in fact, Levenshtein distance can be applied to words as coefficient:[3][4] well). For calculation of the soft cosine measure, the matrix s of n(A ∩ B) K = √ similarity between features is introduced. It can be calcu- n(A) × n(B) lated using Levenshtein distance or other similarity mea- Here, A and B are sets, and n(A) is the number of el- sures, e.g., various WordNet similarity measures. Then ements in A . If sets are represented as bit vectors, the we just multiply by this matrix. Ochiai coefficient can be seen to be the same as the cosine Given two N-dimension vectors a and b, the soft cosine similarity. similarity is calculated as follows: 3

[8] Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, ∑ Helena; Pinto, David. “Soft Similarity and Soft Co- N s a b √ i,j √ij i j sine Measure: Similarity of Features in Vector Space soft_cosine1(a, b) = ∑ ∑ , N N Model”. Computación y Sistemas. 18 (3): 491– i,j sijaiaj i,j sijbibj 504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014. where sij = similarity(featurei, featurej). [9] Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Ef- If there is no similarity between features (sii = 1, sij = 0 for stathios; Gelbukh, Alexander; Chanona-Hernández, Lil- i ≠ j), the given equation is equivalent to the conventional iana. Syntactic Dependency-based N-grams as Classifica- cosine similarity formula. tion Features. LNAI 7630. pp. 1–11. ISBN 978-3-642- The complexity of this measure is quadratic, which makes 37798-3. Retrieved 7 October 2014. it perfectly applicable to real world tasks. The complexity can be transformed to subquadratic. 6 External links

4 See also • Weighted cosine measure • A tutorial on cosine similarity using Python • Sørensen’s quotient of similarity • Web API to Compute Cosine, Jaccard and Dice for • Text in Any Language • Correlation

• Dice’s coefficient

• SimRank

• Information retrieval

5 References

[1] Singhal, Amit (2001). “Modern Information Retrieval: A Brief Overview”. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.

[2] P.-N. Tan, M. Steinbach & V. Kumar, “Introduction to Data Mining”, , Addison-Wesley (2005), ISBN 0-321- 32136-7, chapter 8; page 500.

[3] Ochiai A. Zoogeographical studies on the soleoid fishes found Japan and its neighboring regions. II // Bull. Jap. Soc. sci. Fish. 1957. V. 22. № 9. P. 526-530.

[4] Barkman J.J. Phytosociology and ecology of cryptogamic epiphytes, including a taxonomic survey and description of their vegetation units in Europe. – Assen. Van Gorcum. 1958. 628 p.

[5] Spruill, Marcus C (2007). “Asymptotic distribution of coordinates on high dimensional spheres”. Elec- tronic communications in probability. 12: 234–247. doi:10.1214/ECP.v12-1294.

[6] CrossValidated: Distribution of dot products between two random unit vectors in RD

[7] Graham L. Giller (2012). “The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity”. Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044. 4 7 TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

7 Text and image sources, contributors, and licenses

7.1 Text

• Cosine similarity Source: https://en.wikipedia.org/wiki/Cosine_similarity?oldid=773090778 Contributors: Michael Hardy, Pengo, Dave6, Dfrankow, Beland, Thorwald, Hydrox, Xezbeth, Jason Davies, Rickyp, Simetrical, Qwertyus, Crasshopper, SmackBot, InverseHypercube, Mcld, Jmnbatista, Freelance Intellectual, Fyedernoggersnodden, Headbomb, Nizar grira, Chilliam, Fylwind, Jericbryledy, SieBot, Backfiah, Paolo.dL, Shidzu, UKoch, MystBot, Addbot, Yobot, Bublegun, AnomieBOT, Lynxoid84, Brenthecht, Tarantulae, FrescoBot, Hobsonlane, Hosszuka, ZéroBot, RichardThePict, Alvin.pastore, Горшков Михаил, Christian.adriano, Lxcythian, July1024, Glennschuetze, Asdllc, Loraof, Grigori sidorov, The Quixotic Potato, Abhishekborar93 and Anonymous: 35

7.2 Images

7.3 Content license

• Creative Commons Attribution-Share Alike 3.0