On Efficient Retrieval of Top Similarity Vectors

On Efficient Retrieval of Top Similarity Vectors Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, Ping Li Cognitive Computing Lab Baidu Research USA 1195 Bordeaux Dr, Sunnyvale, CA 94089, USA 10900 NE 8th St, Bellevue, WA 98004, USA fshulongtan,zhixinzhou,v xuzhaozhuo,[email protected] Abstract et al., 2015; Faruqui et al., 2016; Sultan et al., 2016; Koper¨ and im Walde, 2018; Gong et al., 2018), Re- Retrieval of relevant vectors produced by rep- lation Extraction (RE) (Plank and Moschitti, 2013) resentation learning critically influences the efficiency in natural language processing (NLP) and text coherence evaluation (Putra and Toku- tasks. In this paper we demonstrate an efficient naga, 2017). For un-normalized vectors, although method for searching vectors via a typical non- cosine similarity is still widely applied, the final metric matching function: inner product. Our matching scores of word embeddings are usually method, which constructs an approximate In- weighted (Acree et al., 2016; Srinivas et al., 2010) ner Product Delaunay Graph (IPDG) for top-1 by ranking-based coefficients (e.g., the side infor- Maximum Inner Product Search (MIPS), trans- mation), which transforms the problem back to forms retrieving the most suitable latent vectors into a graph search problem with great search via inner product (see Eq. (2)). benefits of efficiency. Experiments on data Formally, retrieving the most similar word with representations learned for different machine the inner product ranking function is a Maximum learning tasks verify the outperforming effec- Inner Product Search (MIPS) problem. MIPS is tiveness and efficiency of the proposed IPDG. a continuously addressed topic (Bachrach et al., 2014; Shrivastava and Li, 2014; Kalantidis and 1 Introduction Avrithis, 2014; Shrivastava and Li, 2015; Guo et al., With the popularity of representation learning meth- 2016; Wu et al., 2017), and it has non-trivial differ- ods, such as Word2vec (Mikolov et al., 2013a), ences with traditional Approximate Nearest Neigh- words are represented as real-valued embedding bor Search (ANNS) (Friedman et al., 1975, 1977; vectors in the semantic space. Therefore, retrieval Indyk and Motwani, 1998) problems. ANNS is an of similar word embeddings is one of the most ba- optimization problem of finding the close points to sic operations in natural language processing with the query point in a given set. Usually, the “close” wide applicability in synonym extraction (Yoon means smaller in metric distances such as cosine or et al., 2017), sentence alignment (Levy et al., 2017), Euclidean distance, which have obvious geometri- polysemous word learning (Sun et al., 2017) and cal implications. However, inner product is a typi- semantic search for documents related to a query. cal non-metric measure, which distinguishes MIPS In this work, we address on efficient retrieval from traditional ANNS problems. Thus, methods of similar word embeddings via inner product (dot designed for ANNS may have performance limita- product) similarity. Inner product is a general se- tions in MIPS. For NLP tasks, such as retrieving mantic matching function with applications in neu- relevant word embeddings by cosine and Euclidean ral probabilistic language models (Bengio et al., distances, different ANNS methods have been stud- 2003), machine translation (Gao et al., 2014), ques- ied (Sugawara et al., 2016). To our best knowledge, tion answering (Lee et al., 2015), and attention there is little literature on MIPS for retrieving word mechanisms (Vaswani et al., 2017). For normal- or language representations. ized vectors, inner product is equivalent to cosine Currently, search on graph methods, such similarity, which is a common semantic textual as Hierarchical Navigable Small World graphs similarity utilized in semantic classification and (HNSW), is regarded as the state-of-the-art ANNS search (Sahami and Heilman, 2006; Ramage et al., method (Malkov and Yashunin, 2018). Perfor- 2009; Agirre et al., 2012; Huang et al., 2013; Liu mance evaluation has demonstrated that HNSW is 5236 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5236–5246, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics d able to strongly outperform other methods ANNS Formally, for latent space X ⊂ R , given an ar- benchmarks for metric distances. Meanwhile, the bitrary query vector q 2 X and a set of vectors graph structure also has the flexibility of defin- S = fx1; : : : ; xng ⊂ X, vector similarity is de- ing measures on edges, making HNSW feasible fined as a continuous symmetric matching function, for MIPS. Morozov et al (Morozov and Babenko, f : X × X ! R. The goal of similar vector 2018) conduct HNSW for MIPS and achieve pos- retrieval is to find: itive results and also they introduce concepts of arg max f(x; q): (1) Delaunay Graph to explain similarity graph based x2S methods for MIPS. Nevertheless, the link between HNSW and Delaunay Graph is still tenuous. Al- In our paper, we specially discuss the non-metric though global optima of MIPS will be retrieved by similarity measure, inner product: Delaunay Graph, there are little evidence showing > d that HNSW approximates proper Delauny Graph f(x; q) = x q; x; q 2 X = R nf0g: for inner product. How to provide a solid graph- Without loss of the generality, we can always as- based MIPS method is still an open question. sume kqk = 1. We are not interested in the zero In this paper, we propose a new search on graph vector since its inner product with any vector is al- method, namely Inner Product Delaunay Graph ways zero. The problem in Eq. (1) with respect to (IPDG), for MIPS. Our key contributions can be the inner product is often referred to as Maximum summarized as follows: Inner Product Search (MIPS) in literature. • Design an edge selection algorithm specifically The weighted cosine ANNS problem can also be for inner product that reduces useless edges on viewed as the MIPS problem. We consider a data graph and thus improves the searching efficiency. set S = f(zi; wi): i 2 [n]g where wi is an real • Propose a two rounds graph construction algo- scalar and zi is a vector. rithm for effectively approximating Delaunay z>q wz > q Graph under inner product. w cos(z; q) = w = ; (2) kzkkqk kzk kqk • Empirically evaluate the effectiveness and effi- kqk = 1 ciency. Provide a state-of-the-art MIPS method where . As can be seen, weighted ANNS for similarity search in word embedding datasets. w.r.t. cosine similarity is equivalent to MIPS by letting xi = wizi=kzik. The organization of this paper is as below: in the next section, we will introduce the research back- 2.2 Related Works ground. In Section 3, the approximate Inner Prod- Previous approaches for Maximum Inner Product uct Delaunay Graph (IPDG) will be introduced. Search (MIPS) can be mainly categorized into: (1) For Section 4, we explore the effectiveness and ef- reducing MIPS to ANNS; (2) non-reduction meth- ficiency of IPDG in maximum inner product word ods. Reduction methods add wrappers on indexed retrieval and compare it with state-of-the-art MIPS data and queries asymmetrically and reduce the methods. Section 5 concludes the whole paper. MIPS problem to ANNS in metric spaces (Shri- vastava and Li, 2015; Bachrach et al., 2014). For 2 Background example, given the query q, the indexed data S = In this section, we will first introduce the definition fx1; :::; xng and Φ = maxi jjxijj, the wrapper can of Maximum Inner Product Search (MIPS) prob- be defined as: lem and review state-of-the-art methods for MIPS. p Later a theoretical solution for MIPS by searching P (x) = [x=Φ; 1 − jjxjj2=Φ2]; (3) on the Delaunay Graphs will be summarized. Q(q) = [q; 0]: (4) 2.1 Problem Statement It is not difficult to prove that searching on the new In machine learning tasks, embedding methods data by cosine or `2-distance is equal to search such as Word2vec (Mikolov et al., 2013a,b), on the original data by inner product. Recently, Glove (Pennington et al., 2014) or deep collab- researchers found that methods above can be im- orative filtering (Xu et al., 2018) learn representa- proved further, based on the observation of the long tions of data as dense distributed real-value vectors. tail distribution in data norms (Huang et al., 2018; 5237 Yan et al., 2018). New approaches are proposed by adding wrappers for each norm range, such as Range-LSH (Yan et al., 2018). With reductions like the above one, any ANNS methods can be applied for MIPS. However, it was shown that there are performance limitations for the reduction MIPS methods (Morozov and Babenko, 2018). Recently, more and more non-reduction methods are proposed, specifically for MIPS. Guo et al. proposed an MIPS method based on Product Quanti- zation (PQ) (Guo et al., 2016). Yu et al. used an upper bound of inner product as the approximation of MIPS and designed a greedy search algorithm to Figure 1: This shows the relation between Delaunay Graph find this approximation, called Greedy-MIPS (Yu and Voronoi cells in inner product space. The red dots are et al., 2017). Graph-based non-reduction MIPS extreme points of each Voronoi cell. Delaunay Graph con- method, ip-NSW, was firstly introduced in Moro- nects extreme points with black edges. If we search on this dataset, every query has a maximum inner product with one zov and Babenko(2018) and the theoretical basis of these extreme points (i.e., red ones). for conducting MIPS by similarity graph was also provided. Continuing of the advantages of similarity graph based methods for ANNS, ip-NSW proportion of extreme points is relatively small in showed superior performance for MIPS.

Load more