Optimized Product Quantization for Approximate Nearest Neighbor Search

Tiezheng Ge1∗ Kaiming He2 Qifa Ke3 Jian Sun2 1University of Science and Technology of China 2Microsoft Research Asia 3Microsoft Research Silicon Valley

Abstract and the distance between two data points is approximated by the distance between their codewords. PQ achieves a Product quantization is an effective vector quantization large effective codebook size with the Cartesian product of approach to compactly encode high-dimensional vectors a set of small sub-codebooks. It has been shown to be more for fast approximate nearest neighbor (ANN) search. The accurate than various hashing-based methods (c.f . [10, 3]), essence of product quantization is to decompose the orig- largely due to its lower quantization distortions and more inal high-dimensional space into the Cartesian product of precise distance computation using a set of small lookup ta- a finite number of low-dimensional subspaces that are then bles. Moreover, PQ is computationally efficient and thus at- quantized separately. Optimal space decomposition is im- tractive for large-scale applications—the Cartesian product portant for the performance of ANN search, but still re- enables pre-computed distances between codewords to be s- mains unaddressed. In this paper, we optimize produc- tored in tables with feasible sizes, and query is merely done t quantization by minimizing quantization distortions w.r.t. by table lookups using codeword indices. It takes about 20 the space decomposition and the quantization codebooks. milliseconds to query against one million data points for the We present two novel methods for optimization: a non- nearest neighbor by exhaustive search. parametric method that alternatively solves two smaller To keep the size of the distance lookup table feasible, sub-problems, and a parametric method that is guaranteed PQ decomposes the original into the Cartesian to achieve the optimal solution if the input data follows product of a finite number of low-dimensional subspaces. some Gaussian distribution. We show by experiments that It has been noticed [10] that the prior knowledge about the our optimized approach substantially improves the accura- structures of the input data is of particular importance, and cy of product quantization for ANN search. the accuracy of ANN search would become substantially worse if ignoring such knowledge. The method in [11] op- timizes a Householder transform under an intuition that the 1. Introduction data components should have balanced variances. It is al- Approximate nearest neighbor (ANN) search is of great so observed that a random rotation achieves similar perfor- importance for many problems, such as re- mance [11]. But the optimality in terms of quantization er- trieval [17], classification [2], and recognition [18]. Re- ror is unclear. Thus, optimal space decomposition for PQ cent years have witnessed the increasing interest (e.g., remains largely an unaddressed problem. [18, 20, 3, 10, 6]) in encoding high dimensional data in- In this paper, we formulate product quantization as an to distance-preserving compact codes. With merely tens optimization problem that minimizes the quantization dis- of bits per data item, compact encoding not only saves the tortions by searching for optimal codebooks and space de- cost of data storage and transmission, but more important- composition. Such an optimization problem is challenging ly, it enables efficient nearest neighbor search on large-scale due to large number of free parameters. We proposed t- datasets, taking only a fraction of a second for each nearest wo solutions. In the first solution, we split the problem neighbor query [18, 10] into two sub-problems, each having a simple solver. The Hashing [1, 18, 20, 19, 6, 8] has been a popular approach space decomposition and the codebooks are then alterna- to compact encoding, where the similarity between two da- tively optimized, by solving for the space decomposition ta points is approximated by the Hamming distance of their while fixing the codewords, and vice versa. Such a solution hashed codes. Recently, product quantization (PQ) [10] was is non-parametric in that it does not assume any priori in- applied to compact encoding, where a data point is vector- formation about the data distribution. Our second solution quantized to its nearest codeword in a predefined codebook, is a parametric one in that it assumes the data follows Gaus- ∗This work is done when Tiezheng Ge is an intern at Microsoft Re- sian distribution. Under such assumption, we show that the search Asia. lower bound of the quantization distortion has an analytical

1 formulation, which can be effectively optimized by a sim- Formally, denote any x ∈ RD as the concatenation of ple Eigenvalue Allocation method. Experiments show that M subvectors: x =[x1, ...xm, ...xM ]. For simplicity it is our two solutions outperform the original PQ [10] and other assumed [10] that the subvectors have common number of alternatives like transform coding [3] and iterative quantiza- dimensions D/M. The Cartesian product C = C1 ×...×CM tion [6], even when the prior knowledge about the structure is the set in which a codeword c ∈Cis formed by concate- of the input data is used by PQ [10]. nating the M sub-codewords: c =[c1, ...cm, ...cM ], with Concurrent with our work, a very similar idea is inde- each cm ∈Cm. We point out that the objective function for pendently developed by Norouzi and Fleet [14]. PQ, though not explicitly defined in [10], is essentially: min x − c(i(x))2, (2) 2. Quantization Distortion C1,...,CM x In this section, we show that a variety of distance approx- s.t. c ∈C= C1 × ... ×CM . imation methods, including k-means [13], product quan- x c C tization [10], and orthogonal hashing [19, 6], can be for- It is easy to show that ’s nearest codeword in is M c mulated within the framework of vector quantization [7] the concatenation of the nearest sub-codewords = c1, ...cm, ...cM cm where quantization distortion is used as the objective func- [ ] where is the nearest sub-codeword of xm M tion. Quantization distortion is tightly related to the empiri- the subvector . So Eqn. (2) can be split into separate cal ANN performance, and thus can be used to measure the subproblems, each of which can be solved by k-means in its “optimality” of a quantization algorithm for ANN search. corresponding subspace. This is the PQ algorithm. The benefit of PQ is that it can easily generate a code- 2.1. Vector Quantization book C with a large number of codewords. If each sub- D codebook has k sub-codewords, then their Cartesian prod- Vector quantization (VQ) [7] maps a vector x ∈ R to uct C has kM codewords. This is not possible for classical k- a codeword c in a codebook C = {c(i)} with i in a finite means when kM is large. PQ also enables fast distance com- index set. The mapping, termed as a quantizer, is denoted putation: the distances between any two sub-codewords in a by: x → c(i(x)). In information theory, the function i(·) is subspace are precomputed and stored in a k-by-k lookup ta- called an encoder, and function c(·) is called a decoder [7]. ble, and the distance between two codewords in C is simply The quantization distortion E is defined as: the sum of the distances compute from the M subspaces. 1 E x − c i x 2, Iterative Quantization [6] = n ( ( )) (1) x If any codeword c must be taken from “the vertexes of a rotating hyper-cube,” minimizing the distortion leads to a where ·denotes the l2-norm, n is the total number of da- hashing method called Iterative Quantization (ITQ) [6]. ta samples, and the summation is over all the points in the D C The D-dimensional vectors in {−a, a} are the ver- given sample set. Given a codebook , a quantizer that min- D imizes the distortion E must satisfy the first Lloyd’s condi- tices of an axis-aligned -dimensional hyper-cube. Sup- tion [7]: the encoder i(x) should map any x to its nearest pose the data has been zero-centered. The objective func- tion in ITQ [6] is essentially: codeword in the codebook C. The distance between two vectors can be approximated by the distances between their min x − c(i(x))2, (3) R,a codewords, which can be precomputed offline. x D T 2.2. Codebook Generation s.t. c ∈C= {c | Rc ∈{−a, a} },RR = I, R I We show that a variety of methods minimize the distor- where is an orthogonal matrix and is an identity matrix. tion w.r.t. to the codebook using different constraints. The benefit of using a rotating hyper-cube as the code- book is that the squared between any K-means two codewords is equivalent to the Hamming distance be- If there is no constraint on the codebook, minimizing the tween their indices. So ITQ is in the category of binary distortion in Eqn.(1) leads to the classical k-means cluster- hashing methods [1, 20, 19]. Eqn.(3) also indicates that any i · ing algorithm [13]. With the encoder ( ) fixed, the code- orthogonal hashing method is equivalent to a vector quan- c x x word of a given is the center of the cluster that belongs tizer. The length a in (3) does not impact the resulting hash- to—this is the second Lloyd’s condition [7]. ing functions as noticed in [6], but it matters when we com- Product Quantization [10] pare the distortion with other quantization methods. If any codeword c must be taken from the Cartesian 2.3. Distortion as the Objective Function product of a finite number of sub-codebooks, minimizing the distortion in Eqn.(1) leads to the product quantization The above methods all optimize the same form of quan- method [10]. tization distortion, but subject to different constraints. This 9/,: -/9: D 0.2 0.1 to represent the space decomposition. The -dimensional 1SKGTY vector space is first transformed by R. The dimension- 1SKGTY s of the transformed space are then divided into D/M chunks. The i-th chunk, consisting of the dimensions of (i − 1) ∗ D/M + {1, 2, ..., D/M}, is then assigned to the S'6 S'6 67 67 i-th subspace. Note that any reordering of the dimensions  67 67 R 67 67 /:7 can be represented by an orthonormal matrix. Thus de- /:7 cides the dimensions of the transformed space assigned to 0 0 0 15000 30000 0.1 0.3 0.5 each subspace. The free parameters of product quantization *OYZUXZOUT *OYZUXZOUT thus consist of the sub-codebooks for the subspaces, and Figure 1: mAP vs. quantization distortion. We show result- the matrix R. The additional free parameters of R allows s from five methods: k-means, ITQ, and three variants of the vector space to rotate, thus relax the constraints on the PQ. The datasets are SIFT1M and GIST1M from [10]. All codewords. From Fig. 1 we see that relaxing constraints methods are given 16 bits for codeword length. The data leads to lower distortions. We optimize product quantiza- consist of the largest 16 principal components (this is to en- tion over these free parameters: able measuring the ITQ distortion). min x − c(i(x))2, (4) R,C1,...,CM implies that distortion is an objective function that can be x 1 M evaluated across different quantization methods. We empir- s.t. c ∈C= {c | Rc ∈C × ... ×C ,RTR = I} ically observe that the distortion is tightly correlated to the We call the above problem Optimized Product Quantization ANN search accuracy of different methods. (OPQ). The effective codebook C is jointly determined by To show this, we investigate nearest neighbor search for m M R and sub-codebooks {C }m=1. 100 nearest neighbors on two large datasets. We adop- Notice that assigning x to its nearest codeword c is e- t the common strategy of linear search using compact quivalent to assigning Rx to the nearest Rc. To apply the codes [10, 6]: the return results are ordered by their approxi- codebook in Eqn.(4) for encoding, we only need to pre- mate distances to the query. Here both the query and the da- process the data by Rx, and the remaining step is the same ta are quantized, and their distance is approximated by their as that in PQ. codeword distances (for ITQ this is equivalent to ranking by Optimizing the problem in Eqn.(4) was considered “not Hamming distance). We compute the mean Average Preci- tractable” [11], because of the large number of degrees sion (mAP) over ten thousand and one thousand queries on of freedom. Previous methods pre-processed the data us- the first and second data set, respectively. The mAP using ing simple heuristics like randomly ordering the dimensions different methods and their distortions are shown in Fig. 1. [10] or randomly rotating the space [11]. The matrix R has We compare five methods: k-means, ITQ, and three variants not been considered in any optimization. In the following M of PQ (decomposed into =2, 4, or 8 subspaces, denoted we propose a non-parametric iterative solution and a simple as PQM ). For all methods the codeword length is fixed to be parametric solution to the problem of Eqn.(4). B =16bits, which essentially gives K =216 codewords (though PQ and ITQ need not explicitly store them). We can 3.1. A Non-Parametric Solution see that mAP (from different methods) has a strong corre- Our non-parametric solution does not assume any data lation with the quantization distortion. We observe similar distribution 1. We split the problem in Eqn.(4) into two sim- behaviors under various ANN metrics, like precision/recall pler sub-problems that are optimized in an alternative way. at the first N samples, with various number of ground-truth m M nearest neighbors. Step (i): Fix R, optimize the sub-codebooks {C }m=1. Based on the above observations, we use distortion as an Recall that assigning x to the nearest c is equivalent to objective function to evaluate the optimality of a quantiza- assigning Rx to the nearest Rc. Denote xˆ = Rx and ˆc = tion method. Rc.SinceR is orthonormal, we have x−c2 = xˆ−ˆc2. With R fixed, Eqn.(4) then becomes: 3. Optimized Product Quantization min xˆ − ˆc(i(xˆ))2, (5) C1,...,Cm Product quantization involves decomposing the D- xˆ dimensional vector space into M subspaces, and comput- s.t. ˆc ∈C1 × ... ×CM . ing a sub-codebook for each subspace. M is determined by 1 the budget constraint of memory space (to ensure a feasi- We follow the terminology in statistics that a “non-parametric” model is the one that does not rely on any assumption about the data distribu- ble lookup table size) and computational costs, and is pre- tion, while “parametric” model explicitly assumes certain parameterized determined in practice. We use an orthonormal matrix R distribution such as Gaussian distribution. Eqn.(5) is the same problem as PQ in Eqn.(2). We can sep- Algorithm 1 Non-Parametric OPQ arately run k-means in each subspace to compute the sub- Input: training samples {x}, number of subspaces M, codebooks. number of sub-codewords k in each sub-codebook. m M {Cm}M R Output: the matrix R, sub-codebooks {C }m=1, M sub- Step (ii): Fix the sub-codebooks m=1, optimize . m M x−c2 Rx−ˆc2 indices {i }m=1 for each x. Since = , the sub-problem becomes: m M m M 1: Initialize R, {C }m=1,and{i }m=1. 2 min Rx − ˆc(i(xˆ)) , (6) 2: repeat R x 3: Step(i): project the data: xˆ = Rx. s.t. RTR = I. 4: for m =1to M do m 5: for j =1to k: update ˆc (j) by the sample mean For each xˆ, the codeword ˆc(i(xˆ)) is fixed in the subproblem of {xˆm | im(xˆm)=j}. m m m and can be derived from the sub-codebooks computed in 6: for ∀xˆ : update i (xˆ ) by the sub-index of the Step (i). To find ˆc(i(xˆ)), we simply concatenate the M sub-codeword ˆcm that is nearest to xˆm. xˆ ˆc i xˆ sub-codewords of the sub-vectors in . We denote ( ( )) 7: end for y n X Y as .Given training samples, we denote and as two 8: Step(ii): solve R by Eqn.(7). D n x y -by- matrices whose columns are the samples and 9: until max iteration number reached respectively. Note Y is fixed in this subproblem. Then we

4 can rewrite (6) as: x 10 4.3 min RX − Y 2, (7) R F 4.1 s.t. RTR = I,

3.9 where ·F is the Frobenius norm. This is the Orthogonal Procrustes problem [16, 6] and there is a closed-form solu- *OYZUXZOUT 3.7 tion: first apply Singular Value Decomposition to XY T = USV T R VUT , and then let = . In [6] this solution was 3.5 0 100 200 300 400 500 used to optimized the ITQ problem in Eqn.(3). /ZKXGZO\KT[SHKX Figure 2: Convergence of Algorithm 1 in the SIFT1M Our algorithm alteratively optimizes Step (i) and (ii). dataset[10]. Here we use M =4and k = 256 (32 bits). Note that in Step (i) we need to run k-means, which by itself is an iterative algorithm. However, we notice that after Step (ii) the updated matrix R would not change the cluster mem- ric solution has both practical and theoretical merits. First, bership of the previous k-means clustering results, thus we it is a simpler method and is globally optimal if the data only need to refine the previous k-means results instead of follows Gaussian distribution. Second, it provides a way to restarting k-means. With this strategy, we empirically find initialize the non-parametric method. Third, it provides new that even if we only run one k-means iteration in each Step theoretical explanations for two commonly used criteria in (i), our entire algorithm still converges to a good solution. some other quantization methods. A pseudo-code is in Algorithm 1. 3.2.1 Distortion Bound of Quantization Note that if we ignore line 3 and line 8 in Algorithm 1, it D is equivalent to PQ (for PQ one might usually put line 2 in We assume each dimension of x ∈ R is subject to an in- the inner loop). Thus its complexity is comparable to PQ, dependent Gaussian distribution with zero mean. From rate b except that in each iteration our algorithm updates R and distortion theory [5] we know that the codeword length transforms the data by R. Fig. 2 shows the convergence of from any quantizer achieving a distortion of E must satisfy: our algorithm. In practice we find 100 iterations are good D Dσ2 enough for the purpose of ANN search. b E ≥ 1 d , ( ) log2 E (8) Like many other alternative-optimization algorithms, our d=1 2 algorithm is locally optimal and the final solution depends σ2 on the initialization. In the next subsection we propose a where d is the variance at each dimension. In this equation σ2 parametric solution that can be used to initialize our alter- we have assumed d is sufficiently large (a more general E native optimization procedure. form is in [5]). Equivalently, the distortion satisfies: − 2 1 3.2. A Parametric Solution E ≥ k D D|Σ| D , (9) We further propose another solution assuming the data where k =2b and we assume all codewords have identical follows a parametric Gaussian distribution. This paramet- code length. The matrix Σ is the covariance of x,and|Σ| is the determinant. Here we have relaxed the independence 3.2.4 Eigenvalue Allocation x assumption and allowed to follow a Gaussian distribu- We first show that the objective function in Eqn.(12) has a N , tion (0 Σ). Eqn.(9) is the distortion lower bound for any constant lower bound. Using the inequality of arithmetic k quantizer with codewords. The following table shows the and geometric means (AM-GM inequality) [4], the objec- values of this bound and the empirical distortion of k-means tive in Eqn.(12) satisfies: (105 samples, k = 256, σ2 randomly generated in [0.5, 1]): d M M M 1 |Σˆ mm| D ≥ M |Σˆ mm| D . D 32 64 128 (13) m=1 m=1 distortion bound 16.2 38.8 86.7 empirical distortion 17.1 39.9 88.5 The equality holds if and only if the term |Σˆ mm| has the m It is reasonable to consider this bound as an approxima- same value for all .Further,Fischer’s inequality [9] gives: tion to the k-means distortion. The small gap (∼ 5%)may M be due to two reasons: k-means can only achieve a local- |Σˆ mm|≥|Σˆ|. (14) ly optimal solution, and the fixed code-length for all code- m=1 words may not achieve optimal bit rate2. The equality holds if the off-diagonal sub-matrices in Σˆ e- |ˆ|≡| | 3.2.2 Distortion Bound of Product Quantization qual to zero. Note that Σ Σ is a constant given the data distribution. Combining (13) and (14), we obtain the Now we study the distortion bound of product quantization constant lower bound for the objective in (12): x ∼N , x when (0 Σ). Suppose has been decomposed into M a concatenation of M equal-dimensional sub-vectors. Ac- M 1 |Σˆ mm| D ≥ M|Σ| D . (15) M × M cordingly, we can decompose Σ into sub-matrices: m=1 ⎛ ⎞ The minimum is achieved if the following two criteria are Σ11 ··· Σ1M ⎜ ⎟ both satisfied: . . . . Σ=⎝ . .. . ⎠ (10) (i) Independence. If we align the data by PCA, the e- ΣM1 ··· ΣMM quality in Fischer’s inequality (14) is achieved. This implies we make the dimensions independent to each other. Here the diagonal sub-matrices Σmm are the covariance of (ii) Balanced Subspaces’ Variance. The equality in m the -th subspace. Notice that the marginal distribution AM-GM (13) is achieved if |Σˆ mm| has the same value for xm D N , of subjects to M -dimensional Gaussian (0 Σmm). all subspaces. Suppose the data has been aligned by PCA. From (9), the distortion bound of PQ is: Then |Σˆ mm| equals to the product of the eigenvalues of M Σmm. By re-ordering the principal components, we can − 2M D M E k D | | D , PQ = M Σmm (11) balance the product of eigenvalues for each subspace, thus m=1 the values |Σˆ mm| for all subspaces. As a result, both equal- 3.2.3 Minimizing Distortion Bound of PQ ities in AM-GM (13) and Fischer’s (14) are satisfied, so the objective function is minimized. Remember that space decomposition can be parameterized Based on the above analysis, we propose a simple R R by an orthonormal matrix . When applying to data, the Eigenvalue Allocation method to optimize Eqn.(12). This xˆ Rx N , ˆ ˆ R RT variable = is subject to (0 Σ) with Σ= Σ . method is a greedy solution for the balanced partition prob- R We propose to minimize the distortion bound w.r.t. to op- lem. We first align the data using PCA and sort the eigen- timize the space decomposition in product quantization: 2 2 2 values σ in the descending order σ1 ≥ ... ≥ σD.Note M M M that we do not reduce dimensions. We prepare empty min |Σˆ mm| D , (12) M R buckets, one for each of the subspaces. We sequentially m=1 pick out the largest eigenvalue and allocate it to the bucket s.t. RTR = I, having the minimum product of the eigenvalues in it (un- less the bucket is full, i.e., with D/M eigenvalues in it). where Σˆ mm is the diagonal sub-matrix of Σˆ. The constant The principal directions corresponding to the eigenvalues scale in Eqn.(11) has been ignored in this objective func- in each bucket form the subspace. In fact, the principal di- tion. Due to the orthonormal constraint, this problem is in- rections are re-ordered to form the columns of R. herently non-convex. Fortunately, the special form of our In real data sets, we find this greedy algorithm is suf- objective function can be minimized by a simple algorithm, ficiently good for minimizing the objective function. To as we show next. show this fact, we compute the covariance matrix Σ from 2In information theory, it is possible to reduce the average bit rate by the SIFT/GIST datasets [10]. The following table shows the varying the bit-length of codewords, known as entropy encoding [5]. theoretical minimum of the objective function (right hand 9_TZNKZOIHOZY9*) side in (15)) and the objective function value (left hand side 1 OPQNP in (15)) obtained by our Eigenvalue Allocation algorithm. OPQP Here we use M =8and k = 256. We can see the above 0.8 PQRO greedy algorithm achieves the theoretical minimum: PQRR TC 0.6 ITQ theoretical minimum Eigenvalue Allocation

3 3 8KIGRR SIFT 2.9286 × 10 2.9287 × 10 0.4 GIST 1.9870 × 10−3 1.9870 × 10−3 0.2 Interestingly, existing encoding methods have adopted, either heuristically or in objective functions different from 0 0 2500 5000 ours, the criteria of “independence” or “balance” mentioned 4 above. “Independence” was used in [20, 19, 3] in the for- Figure 3: Comparison on Gaussian synthetic data using m of PCA projection. “Balance” was used in [11, 20, 3]: Symmetric Distance Computation and 32-bit codes. the method in [11] rotates the data to “balance” variance for each component but lost “independence”; the methods ranking for orthogonal hashing methods like ITQ [6]. If in [20, 3] adaptively allocate the codeword bits to the princi- only the data are quantized, the method is termed as Asym- pal components. Our derivation provides theoretical expla- Distance Computation (ADC) [10]. ADC is more nations for the two criteria: they can be considered as a way accurate than SDC but has the same complexity. We have of minimizing the quantization distortion under a Gaussian tested both cases. The exhaustive search is fast using lookup distribution assumption. tables: e.g., for 64 bits indices it takes 20 ms per 1 million Summary of the parametric solution. Our parametric distance computation. Experiments are on a PC with an In- solution first computes the covariance matrix Σ of the data tel Core2 2.13GHz CPU and 8G RAM. We do not combine and uses Eigenvalue Allocation to generate the orthonormal with any non-exhaustive method like inverted files [17] as matrix R, which determines the space decomposition. The this is not the focus of our paper. We study the following data are then transformed by R. The original PQ algorithm methods: is then performed on these transformed data. • OPQP: our parametric solution.

• OPQNP: our non-parametric solution initialized by para- 4. Experiments metric solution. • We evaluate our method for ANN search using three PQRO: randomly order dimensions as suggested in [10]. public datasets. The first two datasets are from SIFT1M • PQRR: data is aligned using PCA and then randomly ro- and GIST1M [10]. The SIFT1M consists of 1 million 128-d tated, as suggested in [11]. SIFT features [12] and 10k queries. The GIST1M set con- • TC (Transform Coding [3]): a scalar quantization sists of 1 million 960-d GIST features [15] and 1k queries. method that quantizes each principal component with an The third dataset MNIST3 consists of 70k images of hand- adaptive number of bits. written digits, each as a 784-d vector concatenating all pix- • ITQ [6]: one of the state-of-the-art hashing methods, a els. We randomly sample 1k as the queries and use the re- special vector quantization method. maining as the data base. We further generate a synthet- ic dataset subject to a 128-d independent Gaussian distri- Notice that in these settings we have assumed there is no bution, where the eigenvalues of the covariance matrix are prior knowledge available. Later we will study the case with given by σ2 = e−0.1d (d=1,...,128), a long-tail curve fit to prior knowledge. d B the eigenvalues of the above real datasets. This synthetic set Given the code-length , all the PQ-based methods has 1 million data points and 10k queries. (OPQNP,OPQP,PQRO,PQRR) assign 8 bits to each sub- k M B/ We consider K=100 Euclidean nearest neighbors as the space ( = 256). The subspace number is 8. 4 true neighbors. We find that for lookup-based methods, K We have published the Matlab code of our solutions . is not influential for the comparisons among the methods. Performance on the synthetic dataset We follow a common exhaustive search strategy (e.g., Fig. 3 shows the performance on the synthetic data sub- [10, 6]). The data are ranked in the order of their approxi- ject to a 128-d independent Gaussian distribution, evaluated mate distances to the query. If both the query and data are to through recall vs. N, i.e., the proportion of the true nearest be quantized, the method is termed as Symmetric Distance neighbors ranked in the first N positions. We can see that Computation (SDC) [10]. SDC is equivalent to Hamming OPQNP and OPQP perform almost the same. OPQP achieves

3http://yann.lecun.com/exdb/mnist/ 4research.microsoft.com/en-us/um/people/kahe/ G9/,:HOZY9*) H9/,:9*) I9/,:'*) 1 0.8 0.8 OPQNP OPQNP OPQP OPQP 0.8 0.6 PQRO 0.6 PQRO PQRR PQRR 0.6 TC TC 0.4 ITQ 0.4 OPQNP S'6 S'6 8KIGRR 0.4 OPQP PQRO 0.2 0.2 0.2 PQRR TC ITQ 0 0 0 0 1000 2000 16 32 64 128 16 32 64 128 4 HOZY HOZY Figure 4: Comparisons on SIFT1M. (a): recall at the N top ranked samples, using SDC and 64-bit codes. (b): mean Average Precision vs. code-length, using SDC. (c): mean Average Precision vs. code-length, using ADC.

G-/,:HOZY9*) H-/9:9*) I-/9:'*) 1 0.4 0.4 OPQNP OPQNP OPQNP OPQP OPQP OPQP 0.8 PQRO 0.3 PQRO 0.3 PQRO PQRR PQRR PQRR 0.6 TC TC TC ITQ 0.2 ITQ 0.2 S'6 S'6 8KIGRR 0.4

0.1 0.1 0.2

0 0 0 0 1000 2000 16 32 64 128 16 32 64 128 4 HOZY HOZY Figure 5: Comparisons on GIST1M. theoretic minimum of 6.314 × 10−3 in Eqn.(15). This im- This is because these two sets exhibit more than one cluster: plies that, under a Gaussian distribution, our parametric so- the SIFT1M set has two distinct clusters (this can be visual- lution is optimal. On the contrary, PQRO and PQRR perform ized by the first two principal components), and MNIST can substantially worse. This means that the subspace decom- be expected to have 10 clusters due to the 10 digits. In these position is important for PQ even under a simple Gaussian cases the non-parametric solution is able to further reduce distribution. Besides, we find PQRO performs better than the distortion. Such an improvement on MNIST is signifi- PQRR. This is because in this distribution PQRO automati- cant. In GIST1M our two methods are comparable, possibly cally satisfies the independent criterion (see 3.2.4). as this set is mostly subject to a Gaussian distribution. Performance without prior knowledge We notice that TC performs clearly better than PQRO and PQ in the GIST1M set. This scalar quantization method Next we evaluate the performance if the prior knowledge RR attempts to balance the bits assigned to the eigenvalues. So is not available. In Fig. 4, 5, and 6 we compare the results of it better satisfies the two criteria in Sec. 3.2.4. But TC is SIFT1M, GIST1M, and MNIST. We show the recall in the inferior to our methods in all datasets. This is because our first N positions with 64 bits using SDC (Fig. 4, 5, 6 (a)), method quantizes multi-dimensional subspaces instead of and the mAP (mean Average Precision) vs. code-length us- each scalar dimension. And unlike TC that assigns a vari- ing SDC and ADC, respectively (Fig. 4, 5, 6 (b)(c)). able number of bits to each eigenvalue, our method assigns We find that both of our solutions substantially outper- the eigenvalues to each subspace. Since bit numbers are form the existing methods. The superiority of our methods discrete but eigenvalues are continuous, it is easier for our does not depends on the choice of SDC or ADC. Typical- method to achieve balance. ly, in all cases even our simple parametric method OPQP has shown prominent improvement over PQRO and PQRR. Performance with prior knowledge This indicates that PQ-based methods strongly depend on It has been noticed [10] that PQ works much better if uti- the space decomposition. We also notice the performance lizing the prior knowledge that SIFT and GIST are concate- of PQRR is disappointing. Although this method tries to nated histograms. Typically, the so-called “natural” order is balance the variance using a random rotation, the indepen- that each subspace consists of neighboring histograms. The dence between subspaces is lost by such a random rotation. “structural” order (when M =8) is that each subspace con- Our non-parametric solution further improves the result- sists of the same bin of all histograms (each histogram has s from the parametric one in the SIFT1M and MNIST sets. 8 bins). It is observed [10] that the natural order perform- G34/9:HOZY9*) H34/9:9*) I34/9:'*) 1 1 1 OPQNP OPQNP OPQP OPQP 0.8 0.8 0.8 PQRO PQRO PQRR PQRR TC 0.6 0.6 0.6 TC ITQ

NP S'6 OPQ S'6 8KIGRR 0.4 OPQP 0.4 0.4 PQRO 0.2 PQRR 0.2 0.2 TC ITQ 0 0 0 0 100 200 300 400 500 16 32 64 128 16 32 64 128 4 HOZY HOZY Figure 6: Comparisons on MNIST.

G9/,:HOZY9*) H-/9:HOZY9*) 1 1 pages 459–468. IEEE, 2006. [2] O. Boiman, E. Shechtman, and M. Irani. In defense of 0.8 0.8 nearest-neighbor based image classification. In CVPR, 2008. [3] J. Brandt. Transform coding for fast approximate nearest 0.6 0.6 neighbor search in high dimensions. In CVPR, 2010. ´ 8KIGRR 8KIGRR [4] A. Cauchy. Cours d’analyse de l’Ecole Royale Polytech- 0.4 0.4 nique. Imprimerie royale, 1821. [5] T. Cover and J. Thomas. Elements of information theory, 0.2 OPQNP 0.2 OPQNP OPQNP+pri OPQNP+pri chapter 13, page 348. John Wiley & Sons, Inc., 1991. PQpri PQpri [6] Y. Gong and S. Lazebnik. Iterative quantization: A pro- 0 0 0 1000 2000 0 1000 2000 4 4 crustean approach to learning binary codes. In CVPR, 2011. Figure 7: Comparisons with prior knowledge using 64 bits [7] R. Gray. Vector quantization. ASSP Magazine, IEEE, 1984. and SDC. (a): SIFT1M. (b): GIST1M. [8] K. He, F. Wen, and J. Sun. K-means Hashing: an Affinity- Preserving Quantization Method for Learning Binary Com- pact Codes. In CVPR, 2013. s better for SIFT features, and the structural order is better [9] R. A. Horn and C. R. Johnson. Matrix Analysis, chapter 7, for GIST features. We denote PQ with priori knowledge as page 478. Cambridge University Press, 1990. PQpri (use the recommended orders). Note such priors may [10] H. Jegou, M. Douze, and C. Schmid. Product quantization limit the choices of M and are not always applicable for all for nearest neighbor search. TPAMI, 33, 2011. possible code-length. [11] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregating In Fig. 7 we compare PQpri with our prior-free non- local descriptors into a compact image representation. In parametric method OPQNP. We also evaluate our non- CVPR, pages 3304–3311, 2010. parametric method using the prior orders as initialization, [12] D. G. Lowe. Distinctive image features from scale-invariant denoted as OPQNP+pri. Our prior-free method outperforms keypoints. IJCV, 60:91–110, 2004. the prior-based PQ. In SIFT1M our prior-dependent method [13] J. B. MacQueen. Some methods for classification and analy- improves further due to a better initialization. In GIST1M sis of multivariate observations. In Proceedings of 5th Berke- our prior-dependent method is slightly inferior than our ley Symposium on Mathematical Statistics and Probability, prior-free method: this also implies that GIST1M largely pages 281–297. University of California Press, 1967. follows a Gaussian distribution. [14] M. Norouzi and D. Fleet. Cartesian k-means. In CVPR, 2013. 5. Conclusion [15] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001. We have proposed to optimize space decomposition in [16] P. Sch¨onemann. A generalized solution of the orthogonal product quantization. Since space decomposition has been procrustes problem. Psychometrika, 31(1):1–10, 1966. shown to have great impact on the search accuracy in our [17] J. Sivic and A. Zisserman. Video google: a text retrieval experiments and in existing study [10], we believe our re- approach to object matching in videos. In ICCV, 2003. sult has made product quantization a more practical and ef- [18] A. B. Torralba, R. Fergus, and Y. Weiss. Small codes and fective solution to general ANN problems. large image for recognition. In CVPR, 2008. [19] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hash- References ing for scalable image retrieval. In CVPR, 2010. [20] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for NIPS, pages 1753–1760, 2008. approximate nearest neighbor in high dimensions. In FOCS,