The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
Improved Subsampled Randomized Hadamard Transform for Linear SVM
Zijian Lei Liang Lan Department of Computer Science, Department of Computer Science, Hong Kong Baptist University, Hong Kong Baptist University, Hong Kong SAR, China Hong Kong SAR, China [email protected] [email protected]
Abstract et al. 1999) etc. However, these traditional dimensional- ity reduction methods are computationally expensive which Subsampled Randomized Hadamard Transform (SRHT), a makes them not suitable for big data. Recently, random pro- popular random projection method that can efficiently project jection (Bingham and Mannila 2001) that projects a high- a d-dimensional data into r-dimensional space (r d) dimensional feature vector x ∈ Rd into a low-dimensional in O(dlog(d)) time, has been widely used to address the feature x ∈ Rr (r d) using a random orthonormal ma- challenge of high-dimensionality in machine learning. SRHT d×r T n×d R ∈ R x R x works by rotating the input data matrix X ∈ R by Ran- trix . Namely, = . Since random projec- domized Walsh-Hadamard Transform followed with a subse- tion methods are (1) computationally efficient while main- quent uniform column sampling on the rotated matrix. De- taining accurate results; (2) simple to implement in prac- spite the advantages of SRHT, one limitation of SRHT is tice; and (3) easy to analyze (Mahoney and others 2011; that it generates the new low-dimensional embedding with- Xu et al. 2017), they have attracted significant research at- out considering any specific properties of a given dataset. tentions in recent years. Therefore, this data-independent random projection method may result in inferior and unstable performance when used The theoretical foundation of random projection is the for a particular machine learning task, e.g., classification. Johnson-Lindenstrauss Lemma (JL lemma) (Johnson, Lin- To overcome this limitation, we analyze the effect of using denstrauss, and Schechtman 1986). JL lemma proves that SRHT for random projection in the context of linear SVM in Euclidean space, high-dimensional data can be randomly classification. Based on our analysis, we propose importance projected into lower dimensional space while the pairwise sampling and deterministic top-r sampling to produce ef- distances between all data points are preserved with a high fective low-dimensional embedding instead of uniform sam- probability. Based on the JL lemma, several different ways pling SRHT. In addition, we also proposed a new supervised are proposed for constructing the random matrix R: (1) non-uniform sampling method. Our experimental results have demonstrated that our proposed methods can achieve higher Gaussian Matrix (Dasgupta and Gupta 1999): each entry in R is generated from a Gaussian distribution with mean classification accuracies than SRHT and other random pro- 1 jection methods on six real-life datasets. equals to 0 and variance equals to d ; (2) Achlioptas ma- trix (Achlioptas 2003): each entry in R is generated from {−1,0,1} from a discrete distribution. This method gen- Introduction erates a sparse random matrix R. Therefore it requires less memory and computation cost; (3) sparse embedding ma- Classification is one of the fundamental machine learn- trix (or called count sketch matrix) (Clarkson and Woodruff ing tasks. In the era of big data, huge amounts of high- 2017) and its similar variant feature hashing (Weinberger et dimensional data become common in a wide variety of al. 2009; Freksen, Kamma, and Larsen 2018): for each row applications. As the dimensionality of the data for real- R, only one column is randomly selected and assign either 1 life classification tasks has increased dramatically in recent or −1 with probability 0.5 to this entry; and (4) Subsampled years, it is essential to develop accurate and efficient classi- Randomized Hadamard Transform (SRHT) (Tropp 2011): fication algorithms for high-dimensional data. It uses a highly structured matrix R for random projection. One of the most popular methods to address the high- The procedure of SRHT can be summarized as follows: It dimensionality challenge is dimensionality reduction. It al- first rotates the input matrix X ∈ Rn×d (n is the number of lows the classification problems to be efficiently solved in samples, d is the dimensionality of the data) by Randomized a lower dimensional space. Popular dimensionality reduc- Walsh-Hadamard Transform and then uniformly sampling r tion methods includes Principal Component Analysis (PCA) columns from the rotated matrix. More details of SRHT will (Jolliffe 2011), Linear Discriminant Analysis (LDA) (Mika be discussed in next section. Copyright c 2020, Association for the Advancement of Artificial Although random projection methods have attracted a Intelligence (www.aaai.org). All rights reserved. great deal of research attention in recent years (Sarlos
4519 2006; Choromanski, Rowland, and Weller 2017) and have JL lemma proves that in Euclidean space, high- been studied for regression (Lu et al. 2013), classification dimensional data can be randomly projected into lower di- (Paul et al. 2013; Zhang et al. 2013; Paul, Magdon-Ismail, mensional space while the pairwise distances between all and Drineas 2015) and clustering (Boutsidis, Zouzias, and data points are well preserved with high probability. There Drineas 2010; Liu, Shen, and Tsang 2017), most of the exist- are several different ways to construct the random projec- ing studies focused on data-independent projection. In other tion matrix R. In this paper, we focus on SRHT which uses words, the random matrix R was constructed without con- a structured orthonormal matrix for random projection. The sidering any specific properties of a given dataset. details about SRHT are as follows. In this paper, we focus on SRHT for random projection and study how it affects the classification performance of lin- Subsampled Randomized Hadamard Transform ear Support Vector Machines (SVM). SRHT achieve dimen- (SRHT) sionality reduction by uniformly sampling r columns from For d =2qwhere q is any positive integer1 , SRHT defines a X a rotated version of the input data matrix . Even though d × r matrix as: Randomized Walsh-Hadamard transform in SRHT tends to d equalize column norms (Boutsidis and Gittens 2013), we ar- R = DHS (2) r gue that the importance of different columns in the rotated data matrix are not equal. Therefore, the uniformly random where sampling procedure in SRHT would result in low accuracy • D ∈ Rd×d is a diagonal matrix whose elements are inde- when used as a dimensionality reduction method in high- pendent random signs {1,−1}; dimensional data classification. • H ∈ Rd×d is a normalized Walsh-Hadamard matrix. The To overcome the limitation of SRHT, we propose to pro- Walsh-Hadamard matrix is defined recursively as: duce effective low-dimensional embedding by using non- Hd/2 Hd/2 11 Hd = H2 = , uniform sampling instead of uniform sampling. To achieve H −H with 1 −1 and this goal, we first analyze the effect of using SRHT for ran- d/2 d/2 H √1 H ∈ Rd×d dom projection in linear SVM classification. Based on our = d d ; analysis, we have proposed importance sampling and deter- • S ∈ Rd×r is a subset of randomly sampling r columns r ministic top- sampling methods to improve SRHT. Further- from the d×d identity matrix. The purpose of multiplying more, we also propose a new sampling method by incorpo- S is to uniformly sample r columns from the rotated data rating label information. It samples the columns which can matrix Xr = XDH. achieve optimal inter-class separability along with the intra- class compactness of the data samples. There are several advantages of using SRHT for random Finally, we performed experiments to evaluate our pro- projection. Firstly, we do not need to explicitly represent H posed methods on six real-life datasets. Our experimental in memory and only need to store a diagonal random sign results clearly demonstrate that our proposed method can matrix D and a sampling matrix S with r non-zero values. obtain higher classification accuracies than SRHT and other Therefore, the memory cost of storing R is only O(d + r). three popular random projection methods while only slightly Secondly, due to the recursive structure of Hadamard ma- increasing the running time. trix H, matrix multiplication XR only takes O(nd log(d)) time by using Fast Fourier Transform (FFT) for matrix mul- Preliminaries tiplication (Tropp 2011). Ailon and Liberty (Ailon and Lib- erty 2009) further improve the time complexity of SRHT to Random Projection O(nd log(r)) if only r columns in Xr are needed. Given a data matrix X ∈ Rn×d, random projection methods reduce the dimensionality of X by multiplying it by a ran- Methodology dom orthonormal matrix R ∈ Rd×r with parameter r d. The projected data in the low-dimensional space is Limitation of Uniform Sampling in SRHT Despite the advantages of SRHT as mentioned in the pre- X XR ∈ Rn×r = . (1) vious section, one limitation of SRHT is that the random Note that the matrix R is randomly generated and is inde- matrix R is constructed without considering any specific pendent of the input data X. The theoretical foundation of properties of input data X. Based on the definition in (2), random projection is the Johnson-Lindenstrauss Lemma (JL SRHT works by first rotating the input data matrix X by lemma) as shown in following, multiplying it with DH and then uniformly random sam- X XDH Lemma 1 (Johnson-Lindenstrauss Lemma (JL lemma) pling r columns from the rotated matrix r = . Since (Johnson, Lindenstrauss, and Schechtman 1986)). For any this uniform column sampling method does not consider the 2 X 0 <