Fast Orthogonal Projection Based on Kronecker Product

Fast Orthogonal Projection Based on Kronecker Product Xu Zhang1;2, Felix X. Yu2;3, Ruiqi Guo3, Sanjiv Kumar3, Shengjin Wang1, Shih-Fu Chang2 1Tsinghua University, 2Columbia University, 3 Google Research Abstract projection to the input data before converting it into a binary or a non-binary code. For instance, a k-bit binary code We propose a family of structured matrices to speed is simply given as, up orthogonal projections for high-dimensional data com- h(x) = sign(Rx): (1) monly seen in computer vision applications. In this, a structured matrix is formed by the Kronecker product of However, the projection operation becomes expensive as the a series of smaller orthogonal matrices. This achieves input dimensionality d increases. In practice, to achieve O(d log d) computational complexity and O(log d) space high recall in retrieval tasks, it is often desirable to use long complexity for d-dimensional data, a drastic improvement codes with a large k such that k = O(d) [22, 8, 29]. In this over the standard unstructured projections whose compu- case, the space and computational complexity of projection tational and space complexities are both O(d2). We also is O(d2), and such a high cost often becomes the bottleneck introduce an efficient learning procedure for optimizing at both learning and prediction time. For instance, when such matrices in a data dependent fashion. We demon- k = d = 50; 000, projection matrix alone takes 10 GB (sin- strate the significant advantages of the proposed approach gle precision) and projecting one vector can take 800 ms on in solving the approximate nearest neighbor (ANN) image a single core. search problem with both binary embedding and quantiza- In addition, in many applications, it is desired that the tion. Comprehensive experiments show that the proposed projection matrix is orthogonal. An orthogonal transfor- approach can achieve similar or better accuracy as the ex- mation preserves the Euclidean distance between points. isting state-of-the-art but with significantly less time and And it can also distribute the variance more evenly across memory. the dimensions. These properties are important to make several well-known techniques perform well on real-world data [10, 26]. Another motivation for orthogonal projec- 1. Introduction tion comes from the goal of learning maximally uncor- related bits while learning data-dependent binary codes. Linear projection is one of the most widely used opera- One way to achieve that is by imposing orthogonal (or tions, fundamental to many algorithms in computer vision. near orthogonal) constraints on the projections [10, 37]. Given a vector x 2 Rd, and a projection matrix R 2 Rk×d, k In binary embedding, many independent empirical exper- the linear projection computes h(x) 2 R : iments [18, 15, 31, 10] have shown that imposing orthogo- h(x) = Rx: nality constraint on the projection achieves better results for approximate nearest neighbor search. Ji et al. [19] provided In the area of large-scale search and retrieval in computer theoretical analysis to support the use of orthogonal pro- vision, linear projection is usually followed by quantiza- jection. While in quantization, the orthogonality constraint tion to convert high dimensional image features into bi- of the projection matrix makes some critical optimization nary embeddings [15, 23, 34, 25] or product codes [16, 26]. problems possible to solve [26, 35]. However, the main These compact codes have been used to speed up search challenge is that the computational complexity of building and reduce storage in image retrieval [32], feature match- an orthogonal matrix is O(d3) while space and projection ing [31], attribute recognition [28], and object categoriza- complexity is O(d2), both of which are expensive for large tion [6] among others. For example, the popularly used Lo- d. cality Sensitive Hashing (LSH) technique applies a linear In order to speed up projection for high-dimensional This work was supported by the National Science and Technology data, researchers have studied various types of structured Support Program under Grant No. 2013BAK02B04 and Initiative Scientic matrices, including Hadamard matrix, Toeplitz matrix and Research Program of Tsinghua University under Grant No. 20141081253. Felix X. Yu was supported in part by the IBM PhD Fellowship Award when Circulant matrix. The idea behind projecting with struc- working on this project. tured matrices instead of the traditional unstructured matrices is that one can exploit the structure to achieve better Method Time Space Time (Learning) space and computational complexity than quadratic. Pro- Unstructured [10] O(d2) O(d2) O(Nd2) jecting vectors with structured matrices have been studied Bilinear [8] O(d1:5) O(d) O(Nd1:5) in a variety of contexts. In dimensionality reduction, Alon Circulant [37] O(d log d) O(d) O(Nd log d) 2 and Chazelle [1] studied projection using a sparse Gaussian Kronecker (ours) O(d log d) O(log d) O(Nd log d) matrix paired with a Hadamard matrix. This was followed Table 1. Computational and space costs. d: data dimensionality, et al by Dasgupta . [5] who used a combination of permu- N: number of training samples. The space and time complexities tation and diagonal matrices along with Hadamard matrix de×de of Kronecker projection are based on Aj 2 R , 8j, where de in LSH. These variants of Hadamard matrices were fur- is a small constant. ther used by Jacques et al. in compressed sensing [14] and Le et al. [21] in kernel approximation. These works uti- lize the well-known fast Johnson-Lindenstrauss transform to compute Kronecker projection with a time complexity of to achieve O(d log d) time complexity. Researchers have O(d log d); 3) By changing the sizes of the small orthog- also used Toeplitz-structured matrices [2, 11] and circulant onal matrices, the resulting matrix has varying number of matrices [12, 37, 38, 4] for projection, which also obtains parameters (degrees of freedom), making it easier to con- O(d log d) time complexity. trol performance-speed trade-off; 4) The space complexity However, the main problem with the commonly used is O(log d) in comparison to O(d) for most other structured structured matrices is that they are not orthogonal. Al- matrices. though the Hadamard matrix is orthogonal by itself, it is We study Kronecker projection in two application set- typically used in combination with other matrices (e.g., tings: binary embedding, and quantization (Section 2). sparse Gaussian or diagonal/permutation matrices), which We propose a randomized version (Section 4) and a convert it into a non-orthogonal matrix. Besides this, there data-dependent learned version (Section 5) of such ma- is another problem with using the Hadamard matrix di- trices. We conduct extensive image retrieval experiments rectly: there are no free-parameters in Hadamard matri- on ImageNet-32768, ImageNet-16384, and Flickr-16384 ces. Thus, one cannot learn an Hadamard matrix in a data- datasets (Section 6). The results show that with fixed num- dependent fashion. ber of bits, the method needs much less space and time than In this work we introduce a very flexible family of or- the state-of-the-art methods to achieve similar or better per- thogonal structured matrices formed by Kronecker prod- formance. uct of small element matrices, leading to substantially reduced space and computational complexity. One can vary 2. Background the number of free parameters in these matrices to adapt to We begin by reviewing the two settings where the fast or- the needs of a given application. The most related work to thogonal projection based on Kronecker Product is applied: our proposed method is the bilinear projection [8], which is binary embedding, and quantization. also orthogonal and faster than quadratic. We show that the bilinear method can be viewed as a special case of the pro- 2.1. Binary Embedding posed method. Moreover, our structure is more flexible and 1:5 Binary embedding methods map original vectors into k- has lower computational complexity than O(d ) of the bi- k linear method. Table 1 summarizes the space and time com- bit binary vectors such that h(x) 2 f+1; −1g . Since data- plexity of the proposed method in comparison with other points are stored as binary codes, the storage cost is reduced structured matrices. significantly even when k = O(d). The approximate nearest neighbors are retrieved using Hamming distance in the 1.1. Our Contributions binary code space, which can be computed very efficiently using table lookup, or the POPCNT instruction on modern In this work, we propose a novel method to construct a computer architectures. family of orthogonal matrices by using the Kronecker prod- Locality Sensitive Hashing (LSH) is a popular method uct of a series of small orthogonal matrices. Formally, the for generating binary codes that preserves cosine dis- Kronecker projection matrix is defined as tance [3, 27] and typically uses randomized projections in R = A1 ⊗ A2 ⊗ · · · ⊗ AM ; (1) to generate binary codes. However, many works have where Aj, j = 1; ··· ;M are small orthogonal matrices. shown the advantages of learning data-dependent binary We term them as the element matrices. Such Kronecker codes by optimizing the projection matrix R in (1) instead product matrices have the following unique advantages: of using the randomized ones [20, 25, 36, 24, 8]. Specifi- 1) They satisfy orthogonality constraint and therefore pre- cally, Iterative Quantization (ITQ) [10] showed that by us- serve Euclidean distance in the original space; 2) Similar to ing a PCA projection followed by a learned orthogonal pro- Hadamard and circulant matrices, there exists fast algorithm jection, the resulting binary embedding outperforms non- orthogonal or randomized orthogonal projection in retrieval 3.

Load more