Fast Random Projections Using Lean Walsh Transforms
Total Page:16
File Type:pdf, Size:1020Kb
Fast Random Projections using Lean Walsh Transforms Abstract We present a random projection matrix that is applicable to vectors x 2 Rd in O(d) operations if 0 d ¸ k2+± for arbitrary small ±0 and k = O(log(n)="2). The projection succeeds with probability 1 ¡ 1=n ¡(1+±) and it preserves lengths up to distortion " for all vectors such that kxk1 · kxk2k for arbitrary small ±. Previous results are either not applicable in linear time or require a bound on kxk1 that depends on d. 1 Introduction We are interested in producing a random projection procedure such that applying it to a vector in Rd requires O(d) operation. The idea is to ¯rst project an incoming vector x 2 Rd into an intermediate dimension d~ = d® (0 < ® < 1) and then use a known construction to project it further down to the minimal Johnson 2 Lindenstrauss (JL) dimension k = O(log(n)=" )). The mapping is composed of three matrices x 7! RADsx. ~ The ¯rst, Ds, is a random §1 diagonal matrix. The second, A, is a dense matrix of size d £ d termed the Lean Walsh Matrix. The third, R, is a k £ d~ Fast Johnson Lindenstrauss matrix. As R, we use the fast JL construction by Ailon and Liberty [2]. Their matrix, R, can be applied ~ ~ 2+±00 00 (2+±00)=® to ADsx in O(d log(k)) operations as long as d ¸ k for arbitrary small ± . If d > k this 1 condition is met . Therefore, for such d, applying R to ADsx is length preserving w.p and accomplishable in O(d~log(k)) = O(d® log(k)) = O(d) operations. ~ The main purpose of this manuscript is to produce a dense d £ d matrix A such that x 7! ADsx is ¡(1+±) computable in O(d) operation, and preserves length w.p for vectors such that kxk1 · kxk2k for arbitrarily small ±. 1Later in the manuscript we see that ® can be chosen arbitrarily close to 1. 1 2 Norm concentration For a matrix to exhibit the property of a random projection, it is enough for it to preserve the length of any one vector with very high probability: h i 2 Pr j kADsxk ¡ 1 j ¸ ") < 1=n (1) where A is a column normalized matrix, Ds is a random diagonal §1 w.p 1=2 matrix, x is a general vector (w.l.o.g kxk2 = 1), and n is chosen according to a desired success probability, usually polynomial in the number of projected vectors. Notice that we can replace the term ADsx with ADxs where Dx is a diagonal matrix holding on the diagonal the values of x, i.e Dx(i; i) = x(i) and s is a vector of random §1. By denoting M = ADx we view the term kMsk as a convex function over the product space f1; ¡1gd from which the variable s is chosen. In his book, Talagrand [4] describes a strong concentration result for convex Lipschitz bounded functions over probability product spaces. Lemma 2.1 (Talagrand [4]). Consider a matrix M and a random entree-wise i.i.d §1 vector s. De¯ne the random variable Y = kMsk2. Denote by ¹ the median of Y and σ = kMk2!2 the spectral norm of M. The following holds: 2 2 Pr[jY ¡ ¹j > t] < 4e¡t =8σ (2) Lemma 2.1 asserts that kADxsk distributes like a (sub) Gaussian around its median, with standard deviation 2σ. First, in order to have E(Y 2) = 1 it is necessary and su±cient for the columns of A to be normalized (or normalized in expectancy). To estimate the median, ¹, we substitute t2 ! t0 and compute: Z 1 E[(Y ¡ ¹)2] = Pr[(Y ¡ ¹)2] > t0]dt0 0 Z 1 0 2 · 4e¡t =(8σ )dt0 = 32σ2 0 Furthermore, by Jensen, (E[Y ])2 · E[Y 2] = 1. Hence E[(Y ¡ ¹)2] = E[Y 2] ¡ 2¹E[Y ] + ¹2 ¸ 1 ¡ 2¹ + ¹2 = p (1 ¡ ¹)2. Combining, j1 ¡ ¹j · 32σ. By setting " = t + j1 ¡ ¹j, we get: 2 2 Pr[jY ¡ 1j > "] < 4e4e¡" =8σ (3) If we set k = 16 log(n)="2 (assuming log(n) > 6) the requirement of equation 1 is met for σ = k¡1=2. We see that a condition on σ = kADxk is su±cient for the projection to succeed w.h.p. This naturally de¯nes Â. d De¯nition 2.1. Let  ½ R be such that for all x 2  we have kxk2 = 1 and the spectral norm of ADx is at most k¡1=2. n o d ¡1=2 Â(A; "; n) = x 2 R j kxk = 1; kADxk2 · k (4) 2 2 where Dx is a diagonal matrix such that Dx(i; i) = x(i) and k = 16 log(n)=" . Lemma 2.2. For any column normalized matrix, A, and a random §1 diagonal matrix, Ds, the following holds: ¯ ¯ 2 8x 2 Â(A; "; n) Pr[¯kADsxk ¡ 1¯ ¸ "] · 1=n (5) Proof. By the de¯nition of  (2.1) and substituting σ into equation 3 It is convenient to think about  as the "good" set of vectors for which ADs is length preserving with high probability. Our goal is to characterize Â(A; "; n) for a speci¯c ¯xed matrix A, which we term Lean Walsh. It is described in the next section. 3 Lean Walsh transforms The Lean Walsh Transform, similar to the Walsh Transform, is a recursive tensor product matrix de¯ned by a constant seed matrix, A1, and constructed recursively by using Kronecker tensor products A`0 = A1 A`0¡1. De¯nition 3.1. A1 is a Lean Walsh seed (or simply 'seed') if i) A1 is a rectangular, possibly complex, matrix r£c ¡1=2 ¡1=2 A1 2 C , such that r < c; ii) A1 is absolute valued r entree-wise, i.e, jA1(i; j)j = r ; iii) the rows of A1 are orthogonal; iv) all inner products between its di®erent columns are bounded, in absolute value, by a constant ½ < 1. ½ is called the Coherence of A1. 0 0 De¯nition 3.2. A` is a Lean Walsh transform if for all ` · ` we have A` = A1 A`0¡1. Here stands for the Kronecker product and A1 is a seed as in de¯nition 3.1. The following are examples of seed matrices: 0 1 1 1 ¡1 ¡1 0 1 B C 1 1 1 0 1 B C 00 1 @ A A1 = p B 1 ¡1 1 ¡1 C ;A1 = p (6) 3 @ A 2 1 e2¼i=3 e4¼i=3 1 ¡1 ¡1 1 These examples are a part of a large family of possible seed matrices. This family includes, amongst other 0 00 constructions, sub-Hadamard matrices (like A1) or sub-Fourier matrices (like A1 ). A simple construction is given for possible larger seeds: c c Fact 3.1. Let F be the c £ c Discrete Fourier matrix, not normalized such that jF (i; j)j = 1. De¯ne A1 p to be the matrix consisting of the ¯rst r = c ¡ 1 rows of F normalized by 1= r. A1 is a viable Lean Walsh seed with coherence 1=r. 3 p Proof. The fact that jA1(i; j)j = 1= r and the fact that the rows of A1 are orthogonal are trivial. Moreover, the inner product of two di®erent columns of A1 must be ½ = 1=r in absolute value. Xc c ¤ c (F (i; j1)) F (i; j2) = 0 (7) i 1 Xr 1 (F c(i; j ))¤F c(i; j ) = ¡ (F c(c; j ))¤F c(c; j ) (8) r 1 2 r 1 2 i 1 jhA(j1);A(j2)ij = (9) 1 1 r From elementary properties of Kronecker products we characterize A`. Using the number of rows, r, number of columns, c, and the coherence, ½, of A1. ® 2 ` Fact 3.2. The following facts hold true for A`: i) A` is of size d £ d ,® = log(r)= log(c) < 1 . c = d and ` ~ ~ ® ~¡1=2 r = d we have that d = d , ® = log(r)= log(c) < 1. ii) for all i and j, A`(i; j) 2 §d which means that A` is column normalized. iii) the rows of A` are orthogonal. d Fact 3.3. The time complexity of applying A` to any vector z 2 R is O(d). Proof. Let z = [z1; ::: ; zc] where zi are sections of length d=c of the vector z. Using the recursive decom- position for A` we compute A`z by ¯rst summing over the di®erent zi according to the values of A1 and d applying to each sum the matrix A`¡1. Denoting by T (d) the time to apply A` to z 2 R we get that T (d) = rT (d=c) + O(d). Due to the Master Theorem, and the fact that r < c, T (d) = O(d), or more speci¯cally T (d) · dcr=(c ¡ r). 0 For clarity, we demonstrate Fact 3.3 for the ¯rst example, A1: 0 1 0 1 z1 B C A0 (z + z ¡ z ¡ z ) B C B `¡1 1 2 3 4 C B z2 C 1 B C A0 z = A0 B C = p B A0 (z ¡ z + z ¡ z ) C (10) ` ` B C @ `¡1 1 2 3 4 A B z3 C 3 @ A 0 A`¡1(z1 ¡ z2 ¡ z3 + z4) z4 In what follows we characterize Â(A; "; n) for a general Lean Walsh transform by the parameters of its seed, r; c and ½. The omitted notation, A, stands for A` of the right size to be applied to x i.e ` = log(d)= log(c).