Derived Codebooks for High-Accuracy Nearest Neighbor Search
Total Page:16
File Type:pdf, Size:1020Kb
Derived Codebooks for High-Accuracy Nearest Neighbor Search Fabien André Anne-Marie Kermarrec Nicolas Le Scouarnec Technicolor Inria Technicolor ABSTRACT relies almost exclusively on 8-bit quantizers (28 centroids) which High-dimensional Nearest Neighbor (NN) search is central in mul- result in small lookup tables, that are both fast to pre-compute and timedia search systems. Product Quantization (PQ) is a widespread fit the fastest CPU caches. NN search technique which has a high performance and a good Novel quantization models inspired by Product Quantization scalability. PQ compresses high-dimensional vectors into compact (PQ) have been proposed to improve ANN search accuracy. These codes thanks to a combination of quantizers. Large databases can models, such as Additive Quantization (AQ) [4] or Tree Quantiza- therefore be stored entirely in RAM, enabling fast responses to NN tion (TQ) [6] offer a higher accuracy, but some of them lead toa queries. In almost all cases, PQ uses 8-bit quantizers as they offer much increased response time. An orthogonal approach to achieve low response times. a higher accuracy is to use 16-bit quantizers instead of the common In this paper, we advocate the use of 16-bit quantizers. Com- 8-bit quantizers. However, 16-bit quantizers are generally believed pared to 8-bit quantizers, 16-bit quantizers boost accuracy but they to be intractable because they result in a 3 to 10 times increase in increase response time by a factor of 3 to 10. We propose a novel response time compared to 8-bit quantizers approach that allows 16-bit quantizers to offer the same response In this paper, we introduce a novel approach that makes 16-bit time as 8-bit quantizers, while still providing a boost of accuracy. quantizers almost as fast as 8-bit quantizers while retaining the Our approach builds on two key ideas: (i) the construction of derived increase in accuracy. Our approach builds on two key ideas: (i) we 8 codebooks that allow a fast and approximate distance evaluation, build small derived codebooks (2 centroids each) that approximate 16 and (ii) a two-pass NN search procedure which builds a candidate the codebooks of the 16-bit quantizers (2 centroids each) and set using the derived codebooks, and then refines it using 16-bit (ii) we introduce a two-pass NN search procedure which builds a quantizers. On 1 billion SIFT vectors, with an inverted index, our candidate set using the derived codebooks, and then refines it using approach offers a Recall@100 of 0.85 in 5.2 ms. By contrast, 16-bit the 16-bit quantizers. Moreover, our approach can be combined quantizers alone offer a Recall@100 of 0.85 in 39 ms, and 8-bit with inverted indexes, commonly used to manage large databases. quantizers a Recall@100 of 0.82 in 3.8 ms. More specifically, this paper makes the following contributions: • We detail how we build derived codebooks that approxi- mate the 16-bit quantizers. We describe how our two-pass 1 INTRODUCTION NN search procedure exploits derived codebooks to achieve both a low response time and a high accuracy. Over the last decade, the amount of multimedia data handled by • We evaluate our approach in the context of Product Quan- online services, such as video sharing web sites or social networks, tization (PQ) and Optimized Product Quantization (OPQ), has soared. This profusion of multimedia content calls for efficient with and without inverted indexes. We show that it offers multimedia search techniques, so as to allow users to find relevant close response time to 8-bit quantizers, while having the content. Efficient Nearest Neighbor (NN) search in high dimension- same accuracy as 16-bit quantizers. ality is key in multimedia search. High-dimensional feature vectors, • We discuss the strengths and limitations of our approach. or descriptors, can be extracted from multimedia files, capturing We show that because it increases accuracy without signif- their semantic content. Finding similar multimedia objects then icantly impacting response time, our approach compares consists in finding objects with a similar set of descriptors. favorably with the state of the art. In low dimensionality, efficient solutions to the NN search prob- lem, such as KD-trees, have been proposed. However, exact NN 2 BACKGROUND search remains challenging in high-dimensional spaces. Therefore, arXiv:1905.06900v1 [cs.CV] 16 May 2019 current research work focuses on finding Approximate Nearest In this section, we describe how Product Quantization (PQ) and Op- Neighbors (ANN) efficiently. Due to its good scalability andlow timized Product Quantization (OPQ) encode vectors into compact response time, Product Quantization (PQ) [9] is a popular approach codes. We then show how to perform ANN search in databases of [12, 14, 15] for ANN search in high-dimensional spaces. PQ com- encoded vectors. presses high-dimensional vectors into compact codes occupying only a few bytes. Therefore, large databases can be stored entirely 2.1 Vector Encoding in main memory. This enables responding to NN queries without Product Quantization (PQ) [9] is based on the principle of vector performing slow I/O operations. quantization. A vector quantizer is a function q which maps a vector PQ compresses vectors into compact codes using a combination x 2 Rd to a vector C»i¼ 2 C, where C is a predetermined set of of quantizers. To answer ANN queries, PQ first pre-computes a set d-dimensional vectors. The set of vectors C is named codebook, and of lookup tables comprising the distance between the subvectors its elements are named centroids. A quantizer which minimizes the of the query vector and centroids of the quantizers. PQ then uses quantization error maps a vector x to its closest centroid: these lookup tables to compute the distance between the query q¹xº = arg min jjx − C»i¼jj. vector and compact codes stored in the database. Previous work C»i¼2C The index i of the closest centroid C»i¼ of a vector x 2 Rd can be Algorithm 1 ANN Search used as a compact code representing the vector x: code¹xº = i such j m 1: function nns(fC g , database,y,r) that q¹xº = C»i¼. The compact code i only uses b = dlog ¹kºe bits of j=0 2 2: list index_list(database,y) . Index memory, while a d-dimensional vector stored as an array of floats j m j m 3: fD g compute_tables(y, fC g ) . Tables uses d · 32 bits of memory. The accuracy of ANN search primarily j=0 j=0 f j gm depends on the quantization error induced by the vector quantizer. 4: return scan(list, D j=0) . Scan j m Therefore, to minimize quantization error, it is necessary to use a 5: function scan(list, fD g ,r) 64 j=0 quantizer with a large number of centroids, e.g., k = 2 centroids. 6: neighbors binheap¹rº . binary heap of size r Yet, training such quantizers is not tractable. 7: for i 0 to jlistj − 1 do Product Quantization addresses this issue by allowing to gen- 8: c list»i¼ . ith compact code erate a quantizer with a large number of centroids from several f j gm 9: d adc(c, D j=0) low-complexity quantizers. To quantize a vector x 2 Rd , a prod- 10: neighbors. add¹¹i,dºº uct quantizer first splits x into m sub-vectors x = ¹x0, ::: , xm−1º. 11: return neighbors Then, each sub-vector x j is quantized using a distinct quantizer qj . j j f j gm−1 Each quantizer q has a distinct codebook C , of size k. A product 12: function adc(c, D j=0 ) quantizer maps a vector x 2 Rd as follows: 13: d 0 14: for j 0 to m do 0 0 m−1 m−1 j pq¹xº = q ¹x º, ::: , q ¹x º 15: d d + D »c»j¼¼ return d 0 m−1 = ¹C »i0¼, ::: , C »im−1¼º Thus, the codebook C of the product quantizer is the Cartesian product of the codebooks of the quantizers, C = C0 × · · · × Cm−1. been assigned to. Because they offer both a decrease in response The product quantizer has a codebook of size km, while only re- time and an increase in accuracy, inverted indexes are widely used. quiring training m codebooks of size k. Product quantizers encode They are especially useful for large databases, as exhaustive search high-dimensional vectors into compact codes by concatenating the is hardly tractable in this case. codes generated by the m quantizers: pqcode¹xº = ¹i0, ::: ,im−1º, 0 m−1 2.3 ANN Search such that q¹xº = ¹C »i0¼, ::: , C »m − 1¼º. Optimized Product Quantization (OPQ) [8] and Cartesian K- ANN Search in a database of compact codes takes three steps: Index, means (CKM) [10] are two similar quantizations schemes, both where the inverted index is used to retrieve the most appropriate relying on the same principles as Product Quantization (PQ). An inverted lists, Tables, where a set of lookup tables are pre-computed optimized product quantizer multiplies a vector x 2 Rd by an and Scan, which involves computing the distance between the query orthonormal matrix R 2 Rd×d before quantizing it in the same way vector and the compact codes stored in the inverted lists. The Index as a product quantizer: step is only necessary when non-exhaustive search is used and T skipped in the case of exhaustive search. opq¹xº = pq¹Rxº, such that R R = I Index Step. In this step, the closest cell to the query vector y The matrix R allows an arbitrary rotation of the vector x, thus is determined using the inverted index, and the corresponding enabling a better distribution of the information between the quan- inverted list are retrieved.