Optimized Product Quantization for Approximate Nearest Neighbor Search

Optimized Product Quantization for Approximate Nearest Neighbor Search Tiezheng Ge1∗ Kaiming He2 Qifa Ke3 Jian Sun2 1University of Science and Technology of China 2Microsoft Research Asia 3Microsoft Research Silicon Valley Abstract and the distance between two data points is approximated by the distance between their codewords. PQ achieves a Product quantization is an effective vector quantization large effective codebook size with the Cartesian product of approach to compactly encode high-dimensional vectors a set of small sub-codebooks. It has been shown to be more for fast approximate nearest neighbor (ANN) search. The accurate than various hashing-based methods (c.f . [10, 3]), essence of product quantization is to decompose the orig- largely due to its lower quantization distortions and more inal high-dimensional space into the Cartesian product of precise distance computation using a set of small lookup ta- a finite number of low-dimensional subspaces that are then bles. Moreover, PQ is computationally efficient and thus at- quantized separately. Optimal space decomposition is im- tractive for large-scale applications—the Cartesian product portant for the performance of ANN search, but still re- enables pre-computed distances between codewords to be s- mains unaddressed. In this paper, we optimize produc- tored in tables with feasible sizes, and query is merely done t quantization by minimizing quantization distortions w.r.t. by table lookups using codeword indices. It takes about 20 the space decomposition and the quantization codebooks. milliseconds to query against one million data points for the We present two novel methods for optimization: a non- nearest neighbor by exhaustive search. parametric method that alternatively solves two smaller To keep the size of the distance lookup table feasible, sub-problems, and a parametric method that is guaranteed PQ decomposes the original vector space into the Cartesian to achieve the optimal solution if the input data follows product of a finite number of low-dimensional subspaces. some Gaussian distribution. We show by experiments that It has been noticed [10] that the prior knowledge about the our optimized approach substantially improves the accura- structures of the input data is of particular importance, and cy of product quantization for ANN search. the accuracy of ANN search would become substantially worse if ignoring such knowledge. The method in [11] op- timizes a Householder transform under an intuition that the 1. Introduction data components should have balanced variances. It is al- Approximate nearest neighbor (ANN) search is of great so observed that a random rotation achieves similar perfor- importance for many computer vision problems, such as re- mance [11]. But the optimality in terms of quantization er- trieval [17], classification [2], and recognition [18]. Re- ror is unclear. Thus, optimal space decomposition for PQ cent years have witnessed the increasing interest (e.g., remains largely an unaddressed problem. [18, 20, 3, 10, 6]) in encoding high dimensional data in- In this paper, we formulate product quantization as an to distance-preserving compact codes. With merely tens optimization problem that minimizes the quantization dis- of bits per data item, compact encoding not only saves the tortions by searching for optimal codebooks and space de- cost of data storage and transmission, but more important- composition. Such an optimization problem is challenging ly, it enables efficient nearest neighbor search on large-scale due to large number of free parameters. We proposed t- datasets, taking only a fraction of a second for each nearest wo solutions. In the first solution, we split the problem neighbor query [18, 10] into two sub-problems, each having a simple solver. The Hashing [1, 18, 20, 19, 6, 8] has been a popular approach space decomposition and the codebooks are then alterna- to compact encoding, where the similarity between two da- tively optimized, by solving for the space decomposition ta points is approximated by the Hamming distance of their while fixing the codewords, and vice versa. Such a solution hashed codes. Recently, product quantization (PQ) [10] was is non-parametric in that it does not assume any priori in- applied to compact encoding, where a data point is vector- formation about the data distribution. Our second solution quantized to its nearest codeword in a predefined codebook, is a parametric one in that it assumes the data follows Gaus- ∗This work is done when Tiezheng Ge is an intern at Microsoft Re- sian distribution. Under such assumption, we show that the search Asia. lower bound of the quantization distortion has an analytical 1 formulation, which can be effectively optimized by a sim- Formally, denote any x ∈ RD as the concatenation of ple Eigenvalue Allocation method. Experiments show that M subvectors: x =[x1, ...xm, ...xM ]. For simplicity it is our two solutions outperform the original PQ [10] and other assumed [10] that the subvectors have common number of alternatives like transform coding [3] and iterative quantiza- dimensions D/M. The Cartesian product C = C1 ×...×CM tion [6], even when the prior knowledge about the structure is the set in which a codeword c ∈Cis formed by concate- of the input data is used by PQ [10]. nating the M sub-codewords: c =[c1, ...cm, ...cM ], with Concurrent with our work, a very similar idea is inde- each cm ∈Cm. We point out that the objective function for pendently developed by Norouzi and Fleet [14]. PQ, though not explicitly defined in [10], is essentially: min x − c(i(x))2, (2) 2. Quantization Distortion C1,...,CM x In this section, we show that a variety of distance approx- s.t. c ∈C= C1 × ... ×CM . imation methods, including k-means [13], product quan- x c C tization [10], and orthogonal hashing [19, 6], can be for- It is easy to show that ’s nearest codeword in is M c mulated within the framework of vector quantization [7] the concatenation of the nearest sub-codewords = c1, ...cm, ...cM cm where quantization distortion is used as the objective func- [ ] where is the nearest sub-codeword of xm M tion. Quantization distortion is tightly related to the empiri- the subvector . So Eqn. (2) can be split into separate cal ANN performance, and thus can be used to measure the subproblems, each of which can be solved by k-means in its “optimality” of a quantization algorithm for ANN search. corresponding subspace. This is the PQ algorithm. The benefit of PQ is that it can easily generate a code- 2.1. Vector Quantization book C with a large number of codewords. If each sub- D codebook has k sub-codewords, then their Cartesian prod- Vector quantization (VQ) [7] maps a vector x ∈ R to uct C has kM codewords. This is not possible for classical k- a codeword c in a codebook C = {c(i)} with i in a finite means when kM is large. PQ also enables fast distance com- index set. The mapping, termed as a quantizer, is denoted putation: the distances between any two sub-codewords in a by: x → c(i(x)). In information theory, the function i(·) is subspace are precomputed and stored in a k-by-k lookup ta- called an encoder, and function c(·) is called a decoder [7]. ble, and the distance between two codewords in C is simply The quantization distortion E is defined as: the sum of the distances compute from the M subspaces. 1 E x − c i x 2, Iterative Quantization [6] = n ( ( )) (1) x If any codeword c must be taken from “the vertexes of a rotating hyper-cube,” minimizing the distortion leads to a where ·denotes the l2-norm, n is the total number of da- hashing method called Iterative Quantization (ITQ) [6]. ta samples, and the summation is over all the points in the D C The D-dimensional vectors in {−a, a} are the ver- given sample set. Given a codebook , a quantizer that min- D imizes the distortion E must satisfy the first Lloyd’s condi- tices of an axis-aligned -dimensional hyper-cube. Sup- tion [7]: the encoder i(x) should map any x to its nearest pose the data has been zero-centered. The objective function in ITQ [6] is essentially: codeword in the codebook C. The distance between two vectors can be approximated by the distances between their min x − c(i(x))2, (3) R,a codewords, which can be precomputed offline. x D T 2.2. Codebook Generation s.t. c ∈C= {c | Rc ∈{−a, a} },RR = I, R I We show that a variety of methods minimize the distor- where is an orthogonal matrix and is an identity matrix. tion w.r.t. to the codebook using different constraints. The benefit of using a rotating hyper-cube as the codebook is that the squared Euclidean distance between any K-means two codewords is equivalent to the Hamming distance be- If there is no constraint on the codebook, minimizing the tween their indices. So ITQ is in the category of binary distortion in Eqn.(1) leads to the classical k-means cluster- hashing methods [1, 20, 19]. Eqn.(3) also indicates that any i · ing algorithm [13]. With the encoder ( ) fixed, the code- orthogonal hashing method is equivalent to a vector quan- c x x word of a given is the center of the cluster that belongs tizer. The length a in (3) does not impact the resulting hash- to—this is the second Lloyd’s condition [7]. ing functions as noticed in [6], but it matters when we com- Product Quantization [10] pare the distortion with other quantization methods. If any codeword c must be taken from the Cartesian 2.3. Distortion as the Objective Function product of a finite number of sub-codebooks, minimizing the distortion in Eqn.(1) leads to the product quantization The above methods all optimize the same form of quan- method [10].

Optimized Product Quantization for Approximate Nearest Neighbor Search

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support