Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing
Total Page:16
File Type:pdf, Size:1020Kb
Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing Tim vor der Bruck¨ Marc Pouly School of Information Technology School of Information Technology Lucerne University of Lucerne University of Applied Sciences and Arts Applied Sciences and Arts Switzerland Switzerland [email protected] [email protected] Abstract determined beforehand on a very large cor- pus typically using either the skip gram or The prevalent way to estimate the similarity of two documents based on word embeddings the continuous bag of words variant of the is to apply the cosine similarity measure to Word2Vec model (Mikolov et al., 2013). the two centroids obtained from the embed- 2. The centroid over all word embeddings be- ding vectors associated with the words in each longing to the same document is calculated document. Motivated by an industrial appli- to obtain its vector representation. cation from the domain of youth marketing, If vector representations of the two documents to where this approach produced only mediocre compare were successfully established, a similar- results, we propose an alternative way of com- ity estimate can be obtained by applying the cosine bining the word vectors using matrix norms. The evaluation shows superior results for most measure to the two vectors. of the investigated matrix norms in compari- Let x1; : : : ; xm and y1; : : : ; yn be the word vec- son to both the classical cosine measure and tors of two documents. The cosine similarity value several other document similarity estimates. between the two document centroids C1 und C2 is given by: 1 Introduction Estimating semantic document similarity is of ut- m n 1 X 1 X most importance in a lot of different areas, like cos( ( x ; y )) \ m i n i plagiarism detection, information retrieval, or text i=1 i=1 (1) Pm Pn summarization. We will focus here on an NLP hxi; yji = i=1 j=1 application that has been less researched, i.e., the mnkC1kkC2k assignment of people to the best matching target group to allow for running precise and customer- Hence, potentially small values of hxi; yji can oriented marketing campaigns. have in aggregate a considerable influence on the Until recently, similarity estimates were pre- total similarity estimate, which makes this esti- dominantly based either on deep semantic ap- mate vulnerable to noise in the data. We propose proaches or on typical information retrieval tech- an alternative approach that is based on matrix niques like Latent Semantic Analysis. In the last norms and which proved to be more noise-robust couple of years, however, so-called word and sen- by focusing primarily on high word similarities. tence embeddings became state-of-the-art. Finally, we conducted an evaluation where we The prevalent approach to document similarity achieved with our method superior accuracy in estimation based on word embeddings consists in target group assignments than several traditional measuring the similarity between the vector rep- word embedding based methods. resentations of the two documents derived as fol- 2 Related Work lows: 1. The word embeddings (often weighted by The most popular method to come up with word the tf-idf coefficients of the associated words vectors is Word2Vec, which is based on a 3 layer (Brokos et al., 2016)) are looked up in a neural network architecture in which the word hashtable for all the words in the two doc- vectors are obtained as the weights of the hid- uments to compare. These embeddings are den layer. Alternatives to Word2Vec are GloVe 1827 Proceedings of NAACL-HLT 2019, pages 1827–1836 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics (Pennington et al., 2014), which is based on ag- Name Definition gregated global word co-occurrence statistics and qPm Pn 2 the Explicit Semantic Analysis (or shortly ESA) Frob. norm kAkF := i=1 j=1 jAijj p > (Gabrilovic and Markovitch, 2009), in which each 2-norm kAk2 := ρ(A A) word is represented by the column vector in the Pm Pn L1;1-norm kAkL1;1 := i=1 j=1 jAijj tf-idf matrix over Wikipedia. Pm 1-norm kAk1 := max1≤j≤n i=1 jAijj Pn The idea of Word2Vec can be transferred to the 1-norm kAk1 := max1≤i≤m j=1 jAijj level of sentences as well. In particular, the so- called Skip Thought Vector model (STV) (Kiros Table 1: Examples of matrix norms; A is an m × n et al., 2015) derives a vector representation of matrix; ρ(X) denotes the largest absolute eigenvalue of a squared matrix X. the current sentence by predicting the surrounding sentences. (Song and Roth, 2015) propose an alternative approach to applying the cosine measure to the ponential runtime complexity. two word vector centroids for ESA word embed- A drawback of conventional similarity esti- dings. In particular, they establish a bipartite graph mates as described above is that slightly related consisting of the best matching vector compo- word pairs can have in aggregate a considerable nents by solving a linear optimization problem. influence on their values, i.e., these estimates are The similarity estimate for the documents is then sensitive to noise in the data. In contrast, several given by the global optimum of the objective func- of our matrix norm based similarity estimates fo- tion. However, this method is only useful for cus primarily on strongly related word pairs and sparse vector representations. In case of dense are therefore less vulnerable to noise. vectors, (Mijangos et al., 2017) suggested to ap- ply the Frobenius kernel to the embedding matri- ces, which contain the embedding vectors for all 3 Similarity Measure / Matrix Norm document components (usually either sentences or words) (cf. also (Hong et al., 2015)). However, crucial limitations are that the Frobenius kernel Before going more into detail, we want to review is only applicable if the number of words (sen- some concepts that are crucial for the remainder tences respectively) in the compared documents of this paper. According to (Belanche and Orozco, coincide and that a word from the first document is 2011), a similarity measure on some set X is only compared with its counterpart from the sec- an upper bounded, exhaustive and total function ond document. Thus, an optimal matching has to s : X × X ! I ⊂ R with jIj > 1 (therefore be established already beforehand. In contrast, the I is upper bounded and sup I exists). Addition- matrix norm approach as presented here applies to ally, it should fulfill the properties of reflexivity arbitrary embedding matrices. Since it conducts (the supremum is reached if an item is compared a pairwise comparison of all words contained in to itself) and symmetry. We call such a measure the two documents, there is also no need for any normalized if the supremum equals 1 (Attig and matching method. Perner, 2011). Note that an asymmetric similarity measure can easily be converted into a symmet- Another simlarity estimate that employs the en- ric by taking the geometric or arithmetic mean of tire embedding matrix is the word mover‘s dis- the asymmetric measure applied twice to the same tance (Kusner et al., 2015), which is a special case arguments in switched order. of the earth mover‘s distance, a well studied trans- portation problem. Basically, this approach deter- A norm is a function f : V ! R over some vec- mines the minimum effort (with respect to em- tor space V that is absolutely homogeneous, pos- bedding vector changes) to transform the words itive definite and fulfills the triangle inequality. It of one text into the words of another text. The is called matrix norm if its domain is a set of ma- word mover‘s distance requires a linear optimiza- trices and if it is sub-multiplicative, i.e., kABk ≤ tion problem to be solved. Linear optimization is kAk·kBk. Several popular matrix norms are given usually tackled by the simplex method, which has in Table1. Note that the Frobenius norm can also p > in the worst case, which rarely occurs however, ex- be represented by kAkF = tr(AA ). 1828 4 Document Similarity Measure based for arbitrary matrices Z, since with this property on Matrix Norms we have kA>Bk For an arbitrary document t we define the em- sn(t; u) = pkA>Ak · kB>Bk beddings matrix E(t) as follows: E(t)ij is the i- th component of the normalized embeddings vec- k(B>A)>k = p (2) tor belonging to the j-th word of the document t. kB>Bk · kA>Ak Let t; u be two arbitrary documents, then the en- kB>Ak = p = sn(u; t) try (i; j) of a product E(t)>E(u) specifies the re- kB>Bk · kA>Ak sult of the cosine measure estimating the seman- Let M and N be arbitrary matrices such that tic similarity between word i of document t and MN and NM are both defined and quadratic, word j of document u. The value of a matrix norm then (see (Chatelin, 1993)) kE(t)>E(u)k is then a measure for the similar- ity of the two documents. Since the vector com- ρ(MN) = ρ(NM) (3) ponents obtained by Word2Vec can be negative, where ρ(X) denotes the largest absolute eigen- the cosine measure between two word vectors can value of a squared matrix X. also assume negative values (rather rarely in prac- Using identity3 one can easily infer that: tice though). Negative cosine values indicate neg- q q > > > atively correlated words and should be handled kZk2 = ρ(Z Z) = ρ(ZZ ) = kZ k2 akin to the uncorrelated case. Because a matrix (4) norm usually treats negative and positive matrix entries alike, we replace all negative values in the matrix by zeros.