On Efficient Retrieval of Top Similarity Vectors

Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, Ping Li Cognitive Computing Lab Baidu Research USA 1195 Bordeaux Dr, Sunnyvale, CA 94089, USA 10900 NE 8th St, Bellevue, WA 98004, USA {shulongtan,zhixinzhou,v xuzhaozhuo,liping11}@baidu.com

Abstract et al., 2015; Faruqui et al., 2016; Sultan et al., 2016; Koper¨ and im Walde, 2018; Gong et al., 2018), Re- Retrieval of relevant vectors produced by rep- lation Extraction (RE) (Plank and Moschitti, 2013) resentation learning critically influences the ef- ficiency in natural language processing (NLP) and text coherence evaluation (Putra and Toku- tasks. In this paper we demonstrate an efficient naga, 2017). For un-normalized vectors, although method for searching vectors via a typical non- cosine similarity is still widely applied, the final metric matching : inner product. Our matching scores of word embeddings are usually method, which constructs an approximate In- weighted (Acree et al., 2016; Srinivas et al., 2010) ner Product Delaunay Graph (IPDG) for top-1 by ranking-based coefficients (e.g., the side infor- Maximum Inner Product Search (MIPS), trans- mation), which transforms the problem back to forms retrieving the most suitable latent vec- tors into a graph search problem with great search via inner product (see Eq. (2)). benefits of efficiency. Experiments on data Formally, retrieving the most similar word with representations learned for different machine the inner product ranking function is a Maximum learning tasks verify the outperforming effec- Inner Product Search (MIPS) problem. MIPS is tiveness and efficiency of the proposed IPDG. a continuously addressed topic (Bachrach et al., 2014; Shrivastava and Li, 2014; Kalantidis and 1 Introduction Avrithis, 2014; Shrivastava and Li, 2015; Guo et al., With the popularity of representation learning meth- 2016; Wu et al., 2017), and it has non-trivial differ- ods, such as Word2vec (Mikolov et al., 2013a), ences with traditional Approximate Nearest Neigh- words are represented as real-valued embedding bor Search (ANNS) (Friedman et al., 1975, 1977; vectors in the semantic space. Therefore, retrieval Indyk and Motwani, 1998) problems. ANNS is an of similar word embeddings is one of the most ba- optimization problem of finding the close points to sic operations in natural language processing with the query point in a given . Usually, the “close” wide applicability in synonym extraction (Yoon means smaller in metric distances such as cosine or et al., 2017), sentence alignment (Levy et al., 2017), Euclidean distance, which have obvious geometri- polysemous word learning (Sun et al., 2017) and cal implications. However, inner product is a typi- semantic search for documents related to a query. cal non-metric measure, which distinguishes MIPS In this work, we address on efficient retrieval from traditional ANNS problems. Thus, methods of similar word embeddings via inner product (dot designed for ANNS may have performance limita- product) similarity. Inner product is a general se- tions in MIPS. For NLP tasks, such as retrieving mantic matching function with applications in neu- relevant word embeddings by cosine and Euclidean ral probabilistic language models (Bengio et al., distances, different ANNS methods have been stud- 2003), machine translation (Gao et al., 2014), ques- ied (Sugawara et al., 2016). To our best knowledge, tion answering (Lee et al., 2015), and attention there is little literature on MIPS for retrieving word mechanisms (Vaswani et al., 2017). For normal- or language representations. ized vectors, inner product is equivalent to cosine Currently, search on graph methods, such similarity, which is a common semantic textual as Hierarchical Navigable Small World graphs similarity utilized in semantic classification and (HNSW), is regarded as the state-of-the-art ANNS search (Sahami and Heilman, 2006; Ramage et al., method (Malkov and Yashunin, 2018). Perfor- 2009; Agirre et al., 2012; Huang et al., 2013; Liu mance evaluation has demonstrated that HNSW is

5236 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5236–5246, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics d able to strongly outperform other methods ANNS Formally, for latent space X ⊂ R , given an ar- benchmarks for metric distances. Meanwhile, the bitrary query vector q ∈ X and a set of vectors graph structure also has the flexibility of defin- S = {x1, . . . , xn} ⊂ X, vector similarity is de- ing measures on edges, making HNSW feasible fined as a continuous symmetric matching function, for MIPS. Morozov et al (Morozov and Babenko, f : X × X → R. The goal of similar vector 2018) conduct HNSW for MIPS and achieve pos- retrieval is to find: itive results and also they introduce concepts of arg max f(x, q). (1) Delaunay Graph to explain similarity graph based x∈S methods for MIPS. Nevertheless, the link between HNSW and Delaunay Graph is still tenuous. Al- In our paper, we specially discuss the non-metric though global optima of MIPS will be retrieved by similarity measure, inner product: Delaunay Graph, there are little evidence showing > d that HNSW approximates proper Delauny Graph f(x, q) = x q, x, q ∈ X = R \{0}. for inner product. How to provide a solid graph- Without loss of the generality, we can always as- based MIPS method is still an open question. sume kqk = 1. We are not interested in the zero In this paper, we propose a new search on graph vector since its inner product with any vector is al- method, namely Inner Product Delaunay Graph ways zero. The problem in Eq. (1) with respect to (IPDG), for MIPS. Our key contributions can be the inner product is often referred to as Maximum summarized as follows: Inner Product Search (MIPS) in literature. • Design an edge selection algorithm specifically The weighted cosine ANNS problem can also be for inner product that reduces useless edges on viewed as the MIPS problem. We consider a data graph and thus improves the searching efficiency. set S = {(zi, wi): i ∈ [n]} where wi is an real • Propose a two rounds graph construction algo- scalar and zi is a vector. rithm for effectively approximating Delaunay z>q wz > q Graph under inner product. w cos(z, q) = w = , (2) kzkkqk kzk kqk • Empirically evaluate the effectiveness and effi- kqk = 1 ciency. Provide a state-of-the-art MIPS method where . As can be seen, weighted ANNS for similarity search in word embedding datasets. w.r.t. cosine similarity is equivalent to MIPS by letting xi = wizi/kzik. The organization of this paper is as below: in the next section, we will introduce the research back- 2.2 Related Works ground. In Section 3, the approximate Inner Prod- Previous approaches for Maximum Inner Product uct Delaunay Graph (IPDG) will be introduced. Search (MIPS) can be mainly categorized into: (1) For Section 4, we explore the effectiveness and ef- reducing MIPS to ANNS; (2) non-reduction meth- ficiency of IPDG in maximum inner product word ods. Reduction methods add wrappers on indexed retrieval and compare it with state-of-the-art MIPS data and queries asymmetrically and reduce the methods. Section 5 concludes the whole paper. MIPS problem to ANNS in metric spaces (Shri- vastava and Li, 2015; Bachrach et al., 2014). For 2 Background example, given the query q, the indexed data S = In this section, we will first introduce the definition {x1, ..., xn} and Φ = maxi ||xi||, the wrapper can of Maximum Inner Product Search (MIPS) prob- be defined as: lem and review state-of-the-art methods for MIPS. p Later a theoretical solution for MIPS by searching P (x) = [x/Φ; 1 − ||x||2/Φ2], (3) on the Delaunay Graphs will be summarized. Q(q) = [q; 0]. (4)

2.1 Problem Statement It is not difficult to prove that searching on the new In machine learning tasks, embedding methods data by cosine or `2-distance is equal to search such as Word2vec (Mikolov et al., 2013a,b), on the original data by inner product. Recently, Glove (Pennington et al., 2014) or deep collab- researchers found that methods above can be im- orative filtering (Xu et al., 2018) learn representa- proved further, based on the observation of the long tions of data as dense distributed real-value vectors. tail distribution in data norms (Huang et al., 2018;

5237 Yan et al., 2018). New approaches are proposed by adding wrappers for each norm range, such as Range-LSH (Yan et al., 2018). With reductions like the above one, any ANNS methods can be ap- plied for MIPS. However, it was shown that there are performance limitations for the reduction MIPS methods (Morozov and Babenko, 2018). Recently, more and more non-reduction methods are proposed, specifically for MIPS. Guo et al. pro- posed an MIPS method based on Product Quanti- zation (PQ) (Guo et al., 2016). Yu et al. used an upper bound of inner product as the approximation of MIPS and designed a greedy search algorithm to Figure 1: This shows the relation between Delaunay Graph find this approximation, called Greedy-MIPS (Yu and Voronoi cells in . The red dots are et al., 2017). Graph-based non-reduction MIPS extreme points of each Voronoi cell. Delaunay Graph con- method, ip-NSW, was firstly introduced in Moro- nects extreme points with black edges. If we search on this dataset, every query has a maximum inner product with one zov and Babenko(2018) and the theoretical basis of these extreme points (i.e., red ones). for conducting MIPS by similarity graph was also provided. Continuing of the advantages of sim- ilarity graph based methods for ANNS, ip-NSW proportion of extreme points is relatively small in showed superior performance for MIPS. general. And Theorem 2.1 will show that only ex- treme points can achieve a maximum inner product 2.3 Delaunay Graph score for any nonzero query. The definition of an extreme point is equivalent Delaunay Graph plays an important role in the sim- to the one in (Barber et al., 1996), i.e., x ∈ S is ilarity search. The properties and construction of extreme if and only if x is on the boundary of the `2-Delaunay Graph have been considered in liter- convex hull of S. In the two dimensional cases, the ature (Aurenhammer, 1991; Cignoni et al., 1998). edges form the boundary of the convex hull, which Indeed, one can generalize the definition to any real is also shown in Figure1. binary function, including inner product.

Definition 2.1. The Voronoi cell Ri with respect 2.4 Search on Delaunay Graph to f and xi is the set Searching on the Delaunay Graph is demon- strated effective for similarity search (Morozov and Ri := {q ∈ X : f(xi, q) ≥ f(x, q) for all x ∈ S}. Babenko, 2018). In the inner product case, given any query vector q ∈ X, we start from an extreme Moreover, x ∈ S is an extreme point if it is associ- point, then move to its neighbor that has a larger in- ated with a nonempty Voronoi cell. ner product with q. We repeat this step until getting Definition 2.2. The Delaunay Graph G with re- an extreme point has a larger inner product with spect to f and S is an undirected graph with ver- q than all its neighbors and then we return it. It tices S satisfying {xi, xj} ∈ G if and only if can be demonstrated this returned local optimum Ri ∩ Rj 6= ∅. is actually the global optimum. An example of Voronoi cells and corresponding Generally, for any searching measure f, if the Delaunay Graph in inner product space is shown corresponding Voronoi cells are connected, then in Figure1. Regions in different colors correspond the local optimum returned by the greedy search is to Voronoi cells for extreme points (red ones). De- also the global optimum. Formally the statement launay Graph connects extreme points. Different can be summarized as below. The proof can be from metric similarities (e.g. `2-norm), the Voronoi found in Morozov and Babenko(2018). cells of some data points with respect to inner prod- Theorem 2.1. Suppose f satisfies that the Voronoi uct are possibly empty. By Definition 2.2, a data cells Ri with respect to any of S (including point is isolated (i.e., have no incident edges) if its S itself) are connected on X, and G is the Delaunay Voronoi cell is empty. As we can see in Figure1, Graph with respect to f and some S, then for q ∈ there are many isolated points (blue ones). The X, a local maximum in the greedy search starting

5238 from an extreme point, that is, xi ∈ S satisfies the edge selection method is vital for the trade-off of effectiveness and efficiency in searching. How- f(xi, q) ≥ max f(x, q) (5) ever, the existing edge selection techniques used x∈N(xi) in HNSW and ip-NSW are actually designed for where N(x ) = {x ∈ S : {x , x} ∈ G} i i metric distances, which are inapplicable for the is a global maximum. non-metric measure, e.g., inner product. Suppose the assumptions (i.e., connected a a q Voronoi cells) in Theorem 2.1 hold, we say search- b ing on Delaunay Graph can find the global maxi- mum. It is easy to check that the assumptions hold b q for the inner product case since the Voronoi cells c Origin w.r.t. the inner product are either empty or a convex cone, so they are connected. Then we can claim (a) (b) that searching on Delaunay Graph in inner product, Figure 2: The example of edge selection used in constructing the vector in S that has the maximum inner product approximate Delaunay Graph. (a) the selection method for with the query vector will be retrieved. metric spaces used in HNSW and ip-NSW. c is selected while b is abandoned since it is not diverse from a. (b) the edge selection in IPDG. b will be ignored because a is a “super” 3 Inner Product Delaunay Graph point of it, which has been selected. Although the Delaunay Graph has demonstrated As shown in Figure2 (a), the edge selection its potentials in similarity search, the direct con- for metric spaces works as below: for each new struction of the Delaunay Graph in large scale and inserting node (or edge updating node) q and its high dimensional datasets is unfeasible due to the nearest neighbor set (candidates) from Algorithm2, exponentially growing number of edges in high a directed edge from q to the nearest neighbor a is dimension. To remedy this issue, practical algo- constructed first. For other candidates, say b, the rithms usually approximate Delaunay Graphs. In edge selection algorithm will check whether: this section, we will present the new proposed al- gorithm for constructing approximate Delaunay dis(q, b) < dis(a, b), (6) Graph in inner product space, namely Inner Prod- uct Delaunay Graph (IPDG). Two key features of where dis( · , · ) is the distance of two vectors, our algorithm will be introduced first: i) edge se- such as `2-distance or angular distance. If it is true, lection specifically for inner product; and ii) the there will be an edge from q to b, otherwise, b will two rounds graph construction. And then we will be abandoned in the selection. By this way, in a conduct a case study on the toy dataset to show restricted degree, the new inserting node will have the effectiveness of IPDG in constructing better diverse outgoing neighbors. As shown in Figure2 approximate Delaunay Graphs for inner product. (a), b is not selected while c is selected. It is obvious that the edge selection method for 3.1 Edge Selection for Inner Product metric spaces is not suitable for inner product. As > > To balance the effectiveness (retrieval of the nearest presented in Figure2 (b), although q b > a b neighbor) and the efficiency (complete the process (corresponding to dis(q, b) < dis(a, b)), b should > > within limit time) of the retrieval, some empirical not be selected, since a b > b b and for any 0 tricks are usually applied in previous search on query vector q with all positive elements, we have 0> 0> graph methods: a) use directed edges instead of q a > q b. This means that b is dispensable in undirected edges; b) restrict the degree of outgoing the top-1 MIPS task and the edge from q to b should edges for each node; and c) select more diverse not be constructed. To solve this issue, we propose outgoing edges (Malkov and Yashunin, 2018; Mo- a new edge selection method by checking whether: rozov and Babenko, 2018). b>b > a>b. (7) Specifically, for the inner product case, ip-NSW proposed in (Morozov and Babenko, 2018) applies If it is true, we will select b. Otherwise, we will skip all tricks listed above (although the authors did not b since a is a “super” point of b and b is dispensable. mention it in the paper, the implementation did In this way, each inserting node will trend to con- inherit all features from HNSW). We found that nect with extreme points but not other short norm

5239 vectors. The detailed algorithm is summarized in final searching performance. A straightforward Algorithm1 Lines 17 − 28. method may probably help: inserting data points with larger norms first. We tried this trick but it Algorithm 1 IPDG Construction did not work well. The reason is that high norm 1: Input: dataset S, the size of candidate size N, points are not necessarily extreme points. Norms maximum outgoing degree of graph M. of extreme points for some Voronoi cells may be 2: Initialize graph G = ∅. round = 0 relatively small. The top large norm points may be 3: while round < 2 do just from one or a few Voronoi cells. In high dimen- 4: round = round + 1 sional data, it is difficult to find true extreme points. 5: for each x in S do Alternatively, we design a two rounds construc- 6: A ← GREEDY SEARCH(x, G, N). tion algorithm to solve this issue and exploit the 7: B ← EDGE SELECTION(A, M). additional round construction to update edges, es- −→ 8: Add edges xy to G for every y ∈ B. pecially for nodes inserted in the beginning. In this 9: for each y in B do . Edge Updating way, the graph construction algorithm can detect −→ 10: C ← {z ∈ S : yz ∈ G} ∪ {x}. extreme points automatically. We tried to conduct 11: D ← EDGE SELECTION(C,M). this two rounds construction method for ip-NSW 12: Remove original outgoing edges of too, but there are no significant improvements. −→ y, and add edges yz to G for z ∈ D. We share the graph construction algorithm for 13: end for IPDG, including the edge selection function in Al- 14: end for gorithm1. After the graph being constructed, we 15: end while perform MIPS via a greedy search algorithm pre- 16: Output: graph G. sented in Algorithm2. The greedy search algo- 17: function EDGE SELECTION(A, M) rithm is also used in the graph construction for 18: B = ∅. candidates collecting. 19: for y ∈ A do > > 20: if y y ≥ maxz∈B y z then Algorithm 2 GREEDY SEARCH(q, G, N)) 21: B = B ∪ {y} . 1: Input: The query q, the index graph G, the 22: end if size of candidate set N. 23: |B| ≥ M if then 2: Randomly choose a node with outgoing edges, 24: Break. say y. A ← {y}. Mark y as checked and the 25: end if rest as unchecked. . 26: end for In the practical implementation, A is a priority 27: Output: B. queue for efficiency. We note A as a set here 28: end function to simplify the expression. 3: while not all nodes in G are checked do −→ 3.2 Two-Round Construction 4: A ← A ∪ {z ∈ S : yz ∈ G, y ∈ A, z unchecked} Based on the new edge selection method introduced 5: Mark nodes in A as checked. above (and the reverse edge updating, see Algo- 6: A ← top N candidates in A ∪ Z in de- rithm1 Lines 9 − 13), nodes with larger norms scending order of inner product with q. will have higher probabilities to be selected as out- 7: if A does not update then going neighbors. So extreme points of the dataset 8: Break. will have more incoming edges and non-extremes 9: end if points will more likely have no incoming edges in 10: end while general. This is consistent with the true Delaunay 11: Output: A. Graphs in inner product space as previously shown in Figure1. However, at beginning of the graph construc- 3.3 A Toy Example tion, relatively “super” points are not true extreme points. Vectors coming in later may be better can- To further explain the differences between the pro- didates (i.e., true extreme points). This issue will posed method and previous state-of-the-art, ip- damage the overall graph quality and affect the NSW, we conduct a case study on a toy example

5240 4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

1 1 1

2 2 2

3 3 3

4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3 4 3 2 1 0 1 2 3

(a) (b) (c) Figure 3: This is a toy example for approximate inner product Delaunay Graph construction (green lines are edges here) and red dots are extreme points too. (a) is the true Delaunay Graph. (b) is an approximation by IPDG. (c) is built by ip-NSW. Note that IPDG and ip-NSW construct directed edges instead of undirected ones for the efficiency consideration. Only edges for nodes with incoming edges are shown in (b) and (c). data which is shown in Figure3. We randomly As most state-of-the-art MIPS algorithms evalu- generate 400 two dimensional vectors following ate their performance on recommendation datasets, distribution Normal(0,I2). Figure3 (a) shows the we also benchmark IPDG on three recommendation true Delaunay Graph for inner product. Red nodes datasets: Amazon Movie (Amovie), Yelp and Net- correspond to extreme points of this dataset. Fig- flix. We use the Matrix Factorization (MF) method ure3 (b) and (c) are graphs built by the proposed in (Hu et al., 2008) to obtain latent vectors of user IPDG and ip-NSW, respectively. The parameter N and item. Then, in the retrieval process, user vec- is set to 10 and M is set to 2 for both algorithms in tors are regarded as queries and the item vector that this study. Note that graphs built by IPDG and ip- has the highest inner product score with each query NSW are directed graphs. To give better showing should be returned by the MIPS algorithm. out, we only keep edges corresponding to nodes with incoming edges and other edges are ignored. Datasets Dimension # Base Data Nodes without incoming edges will not be visited fastTextEn 300 989873 and do not affect the searching process, thus can fastTextFr 300 1142501 be removed after the graph construction. As can be GloVe 50 1183514 seen, the graph built by IPDG is more like the true Amovie 64 104708 Delaunay Graph and is more efficient for MIPS, Yelp 64 25815 while the graph built by ip-NSW have too many Netflix 50 17770 useless edges as shown in Figure3 (c). Table 1: Statistics of the datasets. 4 Experiments Statistics of the six datasets are listed in Table1. In this section, we evaluate the proposed IPDG by They vary in dimension (300, 64 and 50), sources comparing it with state-of-the-art MIPS methods. (recommendation ratings, word documents) and ex- 4.1 Datasets traction methods (fastText, GloVe and MF), which We used the following three pre-trained embed- is sufficient for fair comparison. The ground truth dings to investigate the performance of IPDG in is the top-1 nearest neighbor by inner product. MIPS for similar word searching. For each word 4.2 Baselines embedding datasets, we random select 10000 vec- In this paper, we compare IPDG with state-of-the- tors as queries and others as the base data. art MIPS methods. Firstly, reduction methods can fastTextEn and fastTextFr are 300 dimensional be baselines. Some popular ANNS open source English and French word embeddings trained on platforms utilize the reduction trick to solve MIPS, Wikipedia using fastText (Joulin et al., 2016). such as Annoy1. As introduced in Section 2.2, with GloVe50 are 50 dimensional word embeddings reductions, any ANNS methods can be applied for trained on Wikipedia2014 and Gigaword5 using GloVe (Pennington et al., 2014). 1https://github.com/spotify/annoy

5241 104 fastTextEn 104 fastTextFr 104 GloVe 8 8 IPDG IPDG IPDG HNSW-Wrapper HNSW-Wrapper 3 HNSW-Wrapper 6 Greedy-MIPS 6 Greedy-MIPS Faiss-PQ ip-NSW ip-NSW Greedy-MIPS ip-NSW Range-LSH Range-LSH 2 4 4 Range-LSH

2 2 1 Queries Per Second Queries Per Second Queries Per Second

0 0 0 0.6 0.7 0.8 0.9 1 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. Recall Avg. Recall Avg. Recall

104 Amovie 104 Yelp 105 Netflix 15 15 2 IPDG IPDG IPDG HNSW-Wrapper HNSW-Wrapper HNSW-Wrapper Faiss-PQ Faiss-PQ 1.5 Faiss-PQ Greedy-MIPS Greedy-MIPS 10 10 Greedy-MIPS ip-NSW ip-NSW ip-NSW Range-LSH Range-LSH 1 Range-LSH

5 5 0.5 Queries Per Second Queries Per Second Queries Per Second

0 0 0 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. Recall Avg. Recall Avg. Recall Figure 4: Recall vs. Time curves of for all methods in top-1 MIPS. Results for Faiss-PQ on fastTextEn and fastTextFr are not shown since they cannot produce recalls greater than 0.6. Best results are in upper right corners.

MIPS. In this line, we choose HNSW (Malkov and ANNS/MIPS algorithm costs at each recall level. Yashunin, 2018) (referred to as HNSW-Wrapper) Both evaluation indicators have their own pros as the baseline and neglect other alternatives since and cons. Recall vs. Time is straightforward but HNSW is usually regarded as the most promising it may introduce bias in implementation. Recall method for ANNS in metric spaces. We exploit the vs. Computations is beyond implementation original implementation of HNSW2 and add the but it does not consider the cost of different wrapper introduced in Section 2.2. index structures. We will show both of these Range-LSH (Yan et al., 2018) is also an reduc- perspectives in the following experiments for the tion MIPS method and considers norm distribution comprehensive evaluation. of the data. The original implementation3 is used. All comparing methods have tunable parameters. Faiss-PQ4 is a popular open source ANNS plat- In order to present a fair comparison, we vary all form from Facebook, which is mainly implemented parameters over a fine grid for all methods. For by Product Quantization (PQ) techniques. It con- each algorithm in each experiment, we will have tains MIPS as one component. multiple points scattered on the plane. To plot Greedy-MIPS is an MIPS algorithm from Yu curves, we first find out the best result, maxx, along et al.(2017). We use the original implementation 5. with the x-axis (i.e., Recall). Then 100 buckets are ip-NSW is a state-of-the-art MIPS algorithm produced by splitting the range from 0 to maxx proposed in (Morozov and Babenko, 2018).6. evenly. For each bucket, the best result along the y- axis (e.g., the biggest amount of queries per second 4.3 Experimental Settings or the lowest percentage of computations) is chosen. There are two popular ways to evaluate If there are no data points in the bucket, the bucket ANNS/MIPS algorithms: i) Recall vs. Time; will be ignored. In this way, we shall have multiple ii) Recall vs. Computations. Recall vs. Time pairs of data for drawing curves. All time-related reports the number of queries an algorithm can experiments were performed on a 2X 3.00 GHz process per second at each recall level. Recall vs. 8-core i7-5960X CPU server with 32GB memory. Computations reports the amount/percentage of 4.4 Experimental Results pairwise distance/similarity computations that the We first show experimental results for all compari- 2https://github.com/nmslib son algorithms from the view of Recall vs. Time, 3 https://github.com/xinyandai/similarity-search which are shown in Figure4. Overall, the proposed 4https://github.com/facebookresearch/faiss 5https://github.com/rofuyu/exp-gmips-nips17 method IPDG performs consistently and signifi- 6https://github.com/stanis-morozov/ip-nsw cantly better than baselines on all six datasets. As

5242 10-3 fastTextFr GloVe Amovie 4 0.01 0.02 IPDG IPDG IPDG 0.008 ip-NSW HNSW-Wrapper 3 ip-NSW Range-LSH 0.015 ip-NSW Range-LSH Range-LSH 0.006 2 0.01 0.004

1 % Computations 0.005 % Computations 0.002 % Computations

0 0 0 0.6 0.7 0.8 0.9 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.7 0.8 0.9 1 Avg. Recall Avg. Recall Avg. Recall

Figure 5: Recall vs. Computations curves in top-1 MIPS. Note that it is unable to show results for HNSW-Wrapper on fastTextFr and Glove in the showing scope. Best results are in lower right corners. can be seen, some baselines show promising per- tion dataset, Amovie, while performs much worse formance on partial datasets but they may work on the word embedding dataset, fastTextFr and much worse on other datasets. For example, on GloVe. It is even unable to show the result for lower dimensional datasets (i.e., the last four fig- HNSW-Wrapper on fastTextFr and Glove in the ures of Figure4), ip-NSW work well but it fails showing scope. For IPDG and ip-NSW, they share on high dimensional datasets (i.e., fastTextEn and similar index structures, it is fair to compare their fastTextFr). Greedy-MIPS shows advantages on computation amount for each query. To get a simi- high dimensional datasets while becomes worse lar recall, IPDG requires much less inner product on some lower dimensional datasets, such as Net- computation. For example, on fastTextFr, to reach flix and GloVe. Among all methods, only IPDG the recall at 95%, ip-NSW requires about 0.3% works consistently well on all datasets which shows computations while IPDG only needs 0.07% com- its effectiveness and robustness. Range-LSH per- putations. This also demonstrates the efficiency of forms badly in these experiments. The main rea- vector inner product retrieval by IPDG. son is that Range-LSH does not have a good “bud- get” setting, similar to the budget in Greedy-MIPS 4.5 More Comparison with ip-NSW and the Nsearch parameter in graph-based meth- ods. HNSW-Wrapper does not work comparably Datasets ip-NSW IPDG with IPDG either, especially on word embedding fastTextEn 144339 (14.6%) 100138 (10.1%) datasets. On some recall levels, say higher than fastTextFr 378875 (33.2%) 250750 (21.9%) 0.5, searching by HNSW-Wrapper is extremely GloVe 622080 (52.6%) 437378 (37.0%) slow (see the first three figures). It is clear that Amovie 32434 (31.0%) 12985 (12.4%) HNSW-Wrapper is far from state-of-the-art in chal- Yelp 5224 (20.2%) 1871 (7.2%) lenging MIPS tasks, such as larger or higher di- Netflix 17154 (96.5%) 14867 (83.7%) mensional vector datasets. The PQ based method, Faiss-PQ, works badly on all datasets since quanti- Table 2: Number and percentage of nodes with incoming zation codes can speedup the retrieval while may edges for graphs built by ip-NSW and IPDG. largely reduce the search performance, especially for the challenging top-1 MIPS problem. Note that In this section, we will conduct a study by com- results for Faiss-PQ on fastTextEn and fastTextFr paring the proposed IPDG and its closely related are not shown in Figure4 since they cannot produce method ip-NSW on the index graph quality. The recalls greater than 0.6. evaluation measure is the number of nodes with We also show experimental results by Recall incoming edges. Intuitively, only extreme points vs. Computations in Figure5. Greedy-MIPS and of each dataset are useful for top-1 MIPS retrieval. Faiss-PQ cannot be evaluated from this view and Non-extreme points could be ignored in graph con- the other four methods are explored here. Due to struction (i.e., without incoming edges so will not the limited space, only results on partial datasets be visited in searching). Results for N = 100 and are represented. As can be seen, only IPDG and ip- M = 16 are shown in Table2. As can be seen, NSW work consistently well on all shown datasets. the graphs built by IPDG have much fewer nodes HNSW-Wrapper and Range-LSH work compara- with incoming edges, which is consistent with the bly with the other two methods on the recommenda- toy example introduced above. The reason can be

5243 explained as below. The finely designed edge selec- References tion method in IPDG trends to select extreme points Brice Acree, Eric Hansen, Joshua Jansa, and Kelsey as outgoing neighbors for each newly inserted node Shoub. 2016. Comparing and evaluating cosine sim- or each edge updating node (see Algorithm1 Lines ilarity scores, weighted cosine similarity scores and 9−13). Meanwhile, extreme points will have more substring matching. Technical report. opportunities to keep incoming edges in the edge Eneko Agirre, Mona Diab, Daniel Cer, and Aitor updating and the second round graph construction. Gonzalez-Agirre. 2012. Semeval-2012 task 6: A pi- While non-extreme points will lose their incoming lot on semantic textual similarity. In Proceedings of edges in these processes. the First Joint Conference on Lexical and Computa- tional Semantics-Volume 1: Proceedings of the main 5 Conclusion and Future Work conference and the shared task, and Volume 2: Pro- ceedings of the Sixth International Workshop on Se- Fast similarity search for data representations via mantic Evaluation, pages 385–393. Association for inner product is a crucial and challenging task Computational Linguistics. since it is one of the basic operations in machine Franz Aurenhammer. 1991. Voronoi diagrams—a sur- learning algorithms and recommendation meth- vey of a fundamental geometric data structure. ACM ods. To remedy this issue, we propose a search Computing Surveys (CSUR), 23(3):345–405. on graph method, namely Inner Product Delaunay Yoram Bachrach, Yehuda Finkelstein, Ran Gilad- Graph (IPDG), for Maximum Inner Product Search Bachrach, Liran Katzir, Noam Koenigstein, Nir (MIPS) in embedded latent vectors. IPDG pro- Nice, and Ulrich Paquet. 2014. Speeding up the vides a better approximation to Delaunay Graphs xbox recommender system using a euclidean trans- for inner product than previous methods and is formation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender sys- more efficient for the MIPS task. Experiments on tems (RecSys), pages 257–264. extensive benchmarks demonstrate that IPDG out- performs previous state-of-the-art MIPS methods C Bradford Barber, David P Dobkin, and Hannu Huh- in retrieving latent vectors under inner product. danpaa. 1996. The quickhull algorithm for convex In this paper, we improve the top-1 MIPS per- hulls. ACM Transactions on Mathematical Software (TOMS), 22(4):469–483. formance by graph-based index. In the future, we will try to move the state-of-the-art frontier further, Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and not only for top-1 MIPS but also for top-n, n > 1, Christian Jauvin. 2003. A neural probabilistic lan- MIPS results. Besides of metric measures (e.g., ` - guage model. Journal of Machine Learning Re- 2 search, 3(Feb):1137–1155. distance and cosine similarity) and inner product, more complicated measures has been studied, for Paolo Cignoni, Claudio Montani, and Roberto example (Tan et al., 2019). It would be interesting Scopigno. 1998. Dewall: A fast divide and conquer to adopt these measures in NLP tasks. Another delaunay triangulation algorithm in ed. Computer- Aided Design, 30(5):333–341. promising direction is to adopt a GPU-based sys- tem for fast ANNS or MIPS, which has been shown Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, highly effective for generic ANNS tasks (Li et al., and Chris Dyer. 2016. Problems with evaluation of 2012; Johnson et al., 2017; Zhao et al., 2019). De- word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276 veloping GPU-based algorithms for MIPS is still a . topic which has not been fully explored. Jerome H. Friedman, F. Baskett, and L. Shustek. 1975. An algorithm for finding nearest neighbors. IEEE Acknowledgement Transactions on Computers, 24:1000–1006.

The authors would like to sincerely thank the Jerome H. Friedman, J. Bentley, and R. Finkel. 1977. anonymous reviewers of NAACL 2019 and An algorithm for finding best matches in logarithmic EMNLP 2019, for their helpful comments, which expected time. ACM Transactions on Mathematical has improved the quality of this paper. Software, 3:209–226. Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. 2014. Learning continuous phrase representations for translation modeling. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics (ACL), volume 1, pages 699– 709.

5244 Hongyu Gong, Tarek Sakakini, Suma Bhat, and Jinjun Omer Levy, Anders Søgaard, and Yoav Goldberg. 2017. Xiong. 2018. Document similarity for texts of vary- A strong baseline for learning cross-lingual word ing lengths via hidden topics. In Proceedings of the embeddings from sentence alignments. In Proceed- 56th Annual Meeting of the Association for Compu- ings of the 15th Conference of the European Chap- tational Linguistics (ACL), volume 1, pages 2341– ter of the Association for Computational Linguistics, 2351. volume 1, pages 765–774. Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and Ping Li, Anshumali Shrivastava, and Christian A. David Simcha. 2016. Quantization based fast inner Konig. 2012. Gpu-based minwise hashing: Gpu- product search. In Artificial Intelligence and Statis- based minwise hashing. In Proceedings of the 21st tics (AISTATS), pages 482–490. World Wide Web Conference (WWW), pages 565– 566, Lyon, France. Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, In Proceedings of the Eighth IEEE International Kevin Duh, and Ye-Yi Wang. 2015. Representation Conference on Data Mining (ICDM), pages 263– learning using multi-task deep neural networks for 272. semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, American Chapter of the Association for Computa- Alex Acero, and Larry Heck. 2013. Learning deep tional Linguistics: Human Language Technologies structured semantic models for web search using (NAACL-HLT), pages 912–921. clickthrough data. In Proceedings of the 22nd ACM International Conference on Information Knowl- Yury A Malkov and Dmitry A Yashunin. 2018. Ef- edge Management (CIKM), pages 2333–2338. ficient and robust approximate nearest neighbor search using hierarchical navigable small world Qiang Huang, Guihong Ma, Jianlin Feng, Qiong Fang, graphs. IEEE transactions on pattern analysis and and Anthony KH Tung. 2018. Accurate and fast machine intelligence. asymmetric locality-sensitive hashing scheme for maximum inner product search. In Proceedings of Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- the 24th ACM SIGKDD International Conference on frey Dean. 2013a. Efficient estimation of word Knowledge Discovery & Data Mining (KDD), pages representations in . arXiv preprint 1561–1570. arXiv:1301.3781. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Piotr Indyk and Rajeev Motwani. 1998. Approximate rado, and Jeff Dean. 2013b. Distributed representa- nearest neighbors: Towards removing the curse of tions of words and phrases and their compositional- dimensionality. In Proceedings of the Thirtieth An- ity. In Advances in Neural Information Processing nual ACM Symposium on the Theory of Computing Systems (NIPS), pages 3111–3119. (STOC), pages 604–613, Dallas, TX. Stanislav Morozov and Artem Babenko. 2018. Non- Jeff Johnson, Matthijs Douze, and Herve´ Jegou.´ 2017. metric similarity graphs for maximum inner product Billion-scale similarity search with gpus. arXiv search. In Advances in Neural Information Process- preprint arXiv:1702.08734. ing Systems (NeurIPS), pages 4722–4731. Armand Joulin, Edouard Grave, Piotr Bojanowski, Jeffrey Pennington, Richard Socher, and Christopher D. Matthijs Douze, Herve´ Jegou,´ and Tomas Mikolov. Manning. 2014. Glove: Global vectors for word rep- 2016. Fasttext.zip: Compressing text classification resentation. In Empirical Methods in Natural Lan- models. arXiv preprint arXiv:1612.03651. guage Processing (EMNLP), pages 1532–1543. Yannis Kalantidis and Yannis Avrithis. 2014. Lo- Barbara Plank and Alessandro Moschitti. 2013. Em- cally optimized product quantization for approxi- bedding semantic similarity in tree kernels for do- mate nearest neighbor search. In Proceedings of the main adaptation of relation extraction. In Proceed- IEEE Conference on Computer Vision and Pattern ings of the 51st Annual Meeting of the Associa- Recognition (CVPR), pages 2321–2328. tion for Computational Linguistics (ACL), volume 1, pages 1498–1507. Maximilian Koper¨ and Sabine Schulte im Walde. 2018. Analogies in complex verb meaning shifts: the effect Jan Wira Gotama Putra and Takenobu Tokunaga. 2017. of affect in semantic similarity models. In Proceed- Evaluating text coherence based on semantic simi- ings of the 2018 Conference of the North American larity graph. In Proceedings of TextGraphs-11: the Chapter of the Association for Computational Lin- Workshop on Graph-based Methods for Natural Lan- guistics: Human Language Technologies (NAACL- guage Processing, pages 76–85. HLT), volume 2, pages 150–156. Daniel Ramage, Anna N Rafferty, and Christopher D Moontae Lee, Xiaodong He, Wen-tau Yih, Jianfeng Manning. 2009. Random walks for text semantic Gao, Li Deng, and Paul Smolensky. 2015. Reason- similarity. In Proceedings of the 2009 workshop on ing in vector space: An exploratory study of ques- graph-based methods for natural language process- tion answering. arXiv preprint arXiv:1511.06426. ing, pages 23–31.

5245 Mehran Sahami and Timothy D Heilman. 2006. A web- Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, and based kernel function for measuring the similarity of James Cheng. 2018. Norm-ranging lsh for maxi- short text snippets. In Proceedings of the 15th In- mum inner product search. In Advances in Neural ternational Conference on World Wide Web (WWW), Information Processing Systems (NeurIPS), pages pages 377–386, Edinburgh, Scotland, UK. 2952–2961, Montreal, Canada.

Anshumali Shrivastava and Ping Li. 2014. Asymmet- Seunghyun Yoon, Pablo Estrada, and Kyomin Jung. ric LSH (ALSH) for sublinear time maximum inner 2017. Synonym discovery with etymology-based product search (MIPS). In Advances in Neural In- word embeddings. In 2017 IEEE Symposium Series formation Processing Systems (NIPS), pages 2321– on Computational Intelligence (SSCI), pages 1–6. 2329, Montreal, Canada. Hsiang-Fu Yu, Cho-Jui Hsieh, Qi Lei, and Inderjit S Anshumali Shrivastava and Ping Li. 2015. Improved Dhillon. 2017. A greedy approach for budgeted asymmetric locality sensitive hashing (ALSH) for maximum inner product search. In Advances in Neu- maximum inner product search (MIPS). In Proceed- ral Information Processing Systems (NIPS), pages ings of the Thirty-First Conference on Uncertainty 5453–5462, Long Beach, CA. in Artificial Intelligence (UAI), pages 812–821, Am- Weijie Zhao, Shulong Tan, and Ping Li. 2019. SONG: sterdam, The Netherlands. Approximate Nearest Neighbor Search on GPU. Technical report, Baidu Research. Gokavarapu Srinivas, Niket Tandon, and Vasudeva Varma. 2010. A weighted tag similarity measure based on a collaborative weight model. In Proceed- ings of the 2nd International Workshop on Search and Mining User-Generated Contents, pages 79–86.

Kohei Sugawara, Hayato Kobayashi, and Masajiro Iwasaki. 2016. On approximately searching for simi- lar word embeddings. In Proceedings of the 54th An- nual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 2265–2275.

Md Arafat Sultan, Jordan Boyd-Graber, and Tamara Sumner. 2016. Bayesian supervised domain adap- tation for short text similarity. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (NAACL-HLT), pages 927–936.

Yifan Sun, Nikhil Rao, and Weicong Ding. 2017. A simple approach to learn polysemous word embed- dings. arXiv preprint arXiv:1707.01793.

Shulong Tan, Zhixin Zhou, Zhaozhuo Xu, and Ping Li. 2019. Fast Item Ranking under Neural Network based Measures. Technical report, Baidu Research.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems (NIPS), pages 5998–6008.

Xiang Wu, Ruiqi Guo, Ananda Theertha Suresh, San- jiv Kumar, Daniel N Holtmann-Rice, David Simcha, and Felix Yu. 2017. Multiscale quantization for fast similarity search. In Advances in Neural Informa- tion Processing Systems (NIPS), pages 5745–5755.

Jun Xu, Xiangnan He, and Hang Li. 2018. Deep learn- ing for matching in search and recommendation. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 1365–1368.

5246