A Survey of Dimensionality Reduction Techniques Based on Random Projection Haozhe Xie, Jie Li, Hanqing Xue
Total Page:16
File Type:pdf, Size:1020Kb
1 A Survey of Dimensionality Reduction Techniques Based on Random Projection Haozhe Xie, Jie Li, Hanqing Xue Abstract—Dimensionality reduction techniques play important The essential idea of RP is based on the Johnson-Lindenstrauss roles in the analysis of big data. Traditional dimensionality reduc- lemma [6], which states that it is possible to project n points tion approaches, such as principle component analysis (PCA) and in a space of arbitrarily high dimensions onto an O(log n)- linear discriminant analysis (LDA), have been studied extensively in the past few decades. However, as the dimensionality of data dimensional space such that the pairwise distances between increases, the computational cost of traditional dimensionality points are approximately preserved. Thus, RP has attracted reduction methods grows exponentially, and the computation be- increasing attention in recent years and has been employed in comes prohibitively intractable. These drawbacks have triggered many machine learning scenarios, including classification [7], the development of random projection (RP) techniques, which [8], [9], clustering [10], [11], [12], and regression [13], [14], map high-dimensional data onto a low-dimensional subspace with extremely reduced time cost. However, the RP transformation [15]. Although RP is much less expensive in terms of computa- matrix is generated without considering the intrinsic structure tional cost, it often fails to capture the task-related information of the original data and usually leads to relatively high distortion. because the latent space is generated without considering the Therefore, in recent years, methods based on RP have been intrinsic structure of the original data. Various methods have proposed to address this problem. In this paper, we summarize been proposed to overcome this issue and to improve the the methods used in different situations to help practitioners to employ the proper techniques for their specific applications. performance of RP. These methods can be classified into Meanwhile, we enumerate the benefits and limitations of the three categories: feature extraction approaches, dimensional- various methods and provide further references for researchers ity increasing approaches, and ensemble approaches. Table I to develop novel RP-based approaches. provides a taxonomy of the approaches developed to improve Index Terms—random projection, compressive sensing, dimen- the performance of RP, their most prominent advantages and sionality reduction, high-dimensional data disadvantages, and the corresponding literatures. Feature extraction approaches, which are the most com- monly used approach to improve the performance of RP, I. INTRODUCTION attempt to construct informative and non-redundant features HE data in machine learning and data mining scenarios from a large set of data. These methods can be divided T usually have very high dimensionality [1]. For example, into two major categories: general-propose methods and market basket data in the hypermarket are high-dimensional, application-specific methods. Generally, application-specific consisting of several thousand types of merchandise. In text feature extraction methods find better discriminative features mining, each document is usually represented by a vector than general-propose methods do, but they are limited to whose dimensionality is equals to the vocabulary size. In a small number of datasets. The main drawback of feature bioinformatics, gene expression profiles can also be consid- extraction approaches is that they are usually computationally ered as matrices with more than ten thousand continuous intensive. values. High dimensionality leads to burdensome computation Dimensionality increasing approaches map a low- and curse-of-dimensionality issues. Therefore, dimensionality dimensional feature space onto a higher-dimensional feature arXiv:1706.04371v4 [cs.LG] 30 May 2018 reduction techniques are often applied in machine learning space while improving the linear separability. The original tasks to alleviate these problems [2]. Traditional dimension- features can be better represented in high-dimensional ality reduction techniques, such as PCA [3] and LDA [4], space [32]. The generated high-dimensional space requires have been widely studied in past decades. However, as data impracticably large computational resources, so RP is used for dimensionality increases, the computational cost of traditional dimensionality reduction. According to the existing literatures dimensionality reduction approaches grows exponentially, and on RP, extreme learning machine (ELM) [33] and rectangle the computation becomes prohibitively intractable. RP [5], filters [34] are often used to increase the dimensionality of which projects the original high-dimensional matrix Xn×d the original feature space. Both methods are computationally onto a k-dimensional subspace using a random matrix Wd×k, fast; however, their architecture is so simple that they often is a simple and rapid approach to reduce dimensionality. RP have trouble fitting complex features [35]. can be formulated as follows Ensemble approaches have been studied extensively, includ- ing the well-known random forest [36] and AdaBoost [37]. RP Xn×k = Xn×dWd×k (1) These methods are robust and perform well on imbalanced datasets [38]. Also, ensembles of multiple RP instances lead H. Xie, J. Lie, and H. Xue are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China (email: to lower risks of overfitting and better generalization per- [email protected]; [email protected]; [email protected]). formance. Nevertheless, as the data dimensionality increases, 2 TABLE I: A taxonomy of approaches to improve the performance of RP. The advantages and disadvantages are listed along with the corresponding references. Approach Advantages Disadvantages Ref. General-purpose methods Feature extraction Applicable to most datasets Computationally intensive [8] Good at finding discriminative features [9] [15] [16] [17] Application-specific methods Better at finding discriminative features than Computationally intensive [18] general-propose methods Applicable to a few datasets [19] [20] [21] [22] Dimensionality increasing Fast Bad at fitting complex features [23] Improves linear separability Weak in finding discriminative features [24] [25] [26] [27] Ensemble Robust Computationally intensive [28] Lower risk of overfitting Slow in making predictions [29] Applicable to most datasets Sensitive to noise (Boosting) [30] Performs well on imbalanced datasets [31] prediction becomes incredibly slow. A. General-purpose methods Although various approaches have been developed to im- Traditional dimensionality reduction techniques search for prove the performance of RP, some issues still need to be dimensions with the maximum discriminative power, whereas addressed. Practitioners also require guidelines to select the RP performs well in rapidly finding low-dimensional space. proper approach to use for their specific application. Here, It is natural to combine the two types of methods to solve we review these approaches and summarize their benefits and dimensionality reduction problems. Thus, in the past few limitations to provide a reference for further studies of RP- years, substantial research based on the two techniques has based methods. been conducted. II. FEATURE EXTRACTION APPROACHES Xie et al. [8] incorporated RP into PCA, LDA, and feature Feature extraction transforms data from high-dimensional selection (FS) [43] to classify gene expression profiles of space lower-dimensional space, which is the most commonly breast cancer. The experimental results demonstrated that the used approach to improve the performance of RP. It is classification accuracy of RP can be significantly improved by often used before or after RP as a preprocessing or post- FS, especially in small-n-large-p datasets [44]. processing step, respectively (Figure 1). A general overview of Zhao et al. [9] proposed semi-random projection (SRP) to preprocessing and post-processing methods for RP in different find a discriminative subspace while maintaining a feasible application fields is presented in Table II. Researchers and computational load. In contrast to RP, where the values of the practitioners prefer to use feature extraction in the post- transformation matrix are assigned randomly, the weights for processing stage because RP reduces the dimensionality of the the transformation vectors of SRP are obtained by LDA. The feature space, which greatly accelerates the feature extraction SRP method (see Figure 2) consists of three steps. First, the n×d methods. In many machine learning and pattern recognition original data matrix X 2 R is mapped onto a subspace n×k systems, feature extractors that transform raw data into fea- Xci 2 R using the randomly selected k features. Next, the ture vectors must be carefully designed, especially those in data with k features are projected onto a single dimension hi computer vision, including histogram of oriented gradient using a transform vector W 2 Rk×1 learned by LDA. The (HOG) [39] and scale invariant feature transform (SIFT) above procedure is repeated r times to generate the following [40]. So important are Feature extractors that they directly latent subspace H 2 Rn×r affect the performance of the developed methods. According to extant literatures on RP, the use of feature extraction H = h1 h2 ::: hr (2) methods can be roughly divided into two categories: general- purpose methods,