1

A Survey of Techniques Based on Random Projection Haozhe Xie, Jie Li, Hanqing Xue

Abstract—Dimensionality reduction techniques play important The essential idea of RP is based on the Johnson-Lindenstrauss roles in the analysis of big data. Traditional dimensionality reduc- lemma [6], which states that it is possible to project n points tion approaches, such as principle component analysis (PCA) and in a space of arbitrarily high dimensions onto an O(log n)- linear discriminant analysis (LDA), have been studied extensively in the past few decades. However, as the dimensionality of data dimensional space such that the pairwise distances between increases, the computational cost of traditional dimensionality points are approximately preserved. Thus, RP has attracted reduction methods grows exponentially, and the computation be- increasing attention in recent years and has been employed in comes prohibitively intractable. These drawbacks have triggered many scenarios, including classification [7], the development of random projection (RP) techniques, which [8], [9], clustering [10], [11], [12], and regression [13], [14], map high-dimensional data onto a low-dimensional subspace with extremely reduced time cost. However, the RP transformation [15]. Although RP is much less expensive in terms of computa- matrix is generated without considering the intrinsic structure tional cost, it often fails to capture the task-related information of the original data and usually leads to relatively high distortion. because the latent space is generated without considering the Therefore, in recent years, methods based on RP have been intrinsic structure of the original data. Various methods have proposed to address this problem. In this paper, we summarize been proposed to overcome this issue and to improve the the methods used in different situations to help practitioners to employ the proper techniques for their specific applications. performance of RP. These methods can be classified into Meanwhile, we enumerate the benefits and limitations of the three categories: feature extraction approaches, dimensional- various methods and provide further references for researchers ity increasing approaches, and ensemble approaches. Table I to develop novel RP-based approaches. provides a taxonomy of the approaches developed to improve Index Terms—random projection, compressive sensing, dimen- the performance of RP, their most prominent advantages and sionality reduction, high-dimensional data disadvantages, and the corresponding literatures. Feature extraction approaches, which are the most com- monly used approach to improve the performance of RP, I.INTRODUCTION attempt to construct informative and non-redundant features HE data in machine learning and data mining scenarios from a large set of data. These methods can be divided T usually have very high dimensionality [1]. For example, into two major categories: general-propose methods and market basket data in the hypermarket are high-dimensional, application-specific methods. Generally, application-specific consisting of several thousand types of merchandise. In text feature extraction methods find better discriminative features mining, each document is usually represented by a vector than general-propose methods do, but they are limited to whose dimensionality is equals to the vocabulary size. In a small number of datasets. The main drawback of feature bioinformatics, gene expression profiles can also be consid- extraction approaches is that they are usually computationally ered as matrices with more than ten thousand continuous intensive. values. High dimensionality leads to burdensome computation Dimensionality increasing approaches map a low- and curse-of-dimensionality issues. Therefore, dimensionality dimensional feature space onto a higher-dimensional feature arXiv:1706.04371v4 [cs.LG] 30 May 2018 reduction techniques are often applied in machine learning space while improving the linear separability. The original tasks to alleviate these problems [2]. Traditional dimension- features can be better represented in high-dimensional ality reduction techniques, such as PCA [3] and LDA [4], space [32]. The generated high-dimensional space requires have been widely studied in past decades. However, as data impracticably large computational resources, so RP is used for dimensionality increases, the computational cost of traditional dimensionality reduction. According to the existing literatures dimensionality reduction approaches grows exponentially, and on RP, (ELM) [33] and rectangle the computation becomes prohibitively intractable. RP [5], filters [34] are often used to increase the dimensionality of which projects the original high-dimensional matrix Xn×d the original feature space. Both methods are computationally onto a k-dimensional subspace using a random matrix Wd×k, fast; however, their architecture is so simple that they often is a simple and rapid approach to reduce dimensionality. RP have trouble fitting complex features [35]. can be formulated as follows Ensemble approaches have been studied extensively, includ- ing the well-known random forest [36] and AdaBoost [37]. RP Xn×k = Xn×dWd×k (1) These methods are robust and perform well on imbalanced datasets [38]. Also, ensembles of multiple RP instances lead H. Xie, J. Lie, and H. Xue are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China (email: to lower risks of overfitting and better generalization per- [email protected]; [email protected]; [email protected]). formance. Nevertheless, as the data dimensionality increases, 2

TABLE I: A taxonomy of approaches to improve the performance of RP. The advantages and disadvantages are listed along with the corresponding references.

Approach Advantages Disadvantages Ref. General-purpose methods Feature extraction Applicable to most datasets Computationally intensive [8] Good at finding discriminative features [9] [15] [16] [17] Application-specific methods Better at finding discriminative features than Computationally intensive [18] general-propose methods Applicable to a few datasets [19] [20] [21] [22] Dimensionality increasing Fast Bad at fitting complex features [23] Improves linear separability Weak in finding discriminative features [24] [25] [26] [27] Ensemble Robust Computationally intensive [28] Lower risk of overfitting Slow in making predictions [29] Applicable to most datasets Sensitive to noise (Boosting) [30] Performs well on imbalanced datasets [31] prediction becomes incredibly slow. A. General-purpose methods Although various approaches have been developed to im- Traditional dimensionality reduction techniques search for prove the performance of RP, some issues still need to be dimensions with the maximum discriminative power, whereas addressed. Practitioners also require guidelines to select the RP performs well in rapidly finding low-dimensional space. proper approach to use for their specific application. Here, It is natural to combine the two types of methods to solve we review these approaches and summarize their benefits and dimensionality reduction problems. Thus, in the past few limitations to provide a reference for further studies of RP- years, substantial research based on the two techniques has based methods. been conducted. II.FEATUREEXTRACTIONAPPROACHES Xie et al. [8] incorporated RP into PCA, LDA, and feature Feature extraction transforms data from high-dimensional selection (FS) [43] to classify gene expression profiles of space lower-dimensional space, which is the most commonly breast cancer. The experimental results demonstrated that the used approach to improve the performance of RP. It is classification accuracy of RP can be significantly improved by often used before or after RP as a preprocessing or post- FS, especially in small-n-large-p datasets [44]. processing step, respectively (Figure 1). A general overview of Zhao et al. [9] proposed semi-random projection (SRP) to preprocessing and post-processing methods for RP in different find a discriminative subspace while maintaining a feasible application fields is presented in Table II. Researchers and computational load. In contrast to RP, where the values of the practitioners prefer to use feature extraction in the post- transformation matrix are assigned randomly, the weights for processing stage because RP reduces the dimensionality of the the transformation vectors of SRP are obtained by LDA. The feature space, which greatly accelerates the feature extraction SRP method (see Figure 2) consists of three steps. First, the n×d methods. In many machine learning and pattern recognition original data matrix X ∈ R is mapped onto a subspace n×k systems, feature extractors that transform raw data into fea- Xci ∈ R using the randomly selected k features. Next, the ture vectors must be carefully designed, especially those in data with k features are projected onto a single dimension hi computer vision, including histogram of oriented gradient using a transform vector W ∈ Rk×1 learned by LDA. The (HOG) [39] and scale invariant feature transform (SIFT) above procedure is repeated r times to generate the following [40]. So important are Feature extractors that they directly latent subspace H ∈ Rn×r affect the performance of the developed methods. According   to extant literatures on RP, the use of feature extraction H = h1 h2 ... hr (2) methods can be roughly divided into two categories: general- purpose methods, such as some statistical learning methods Experiments were performed on six datasets generated from and neural networks, which can be applied to many fields the standard text 20 newsgroups corpus [45] for determining (e.g., natural language processing and computer vision); and the categories of texts. The experimental results indicated that application-specific methods, which are designed to address a the classification accuracy of SRP followed by PCA (SRP + specific problem, like nonsubsampled pyramid (NSP) [41] and PCA) increases by approximately 25% compared to that of constrained energy minimization (CEM) [42]. RP, but it is still lower than that of PCA. 3

… … … … … Feature Extraction … Random Projection … Feature Extraction … … … … … (Preprocessing) (Post-processing) … … … …

Original Space Latent Space #1 Latent Space #2 Latent Space #3

Fig. 1: Preprocessing and post-processing methods for RP. Both of them are used for feature extraction and help RP to better capturing the intrinsic structure of the original data.

TABLE II: Alternative preprocessing and post-processing feature extraction methods used to improve the performance of RP.

Preprocessing Post-processing Application Ref. PCA; FS; (None) PCA; LDA; FS; (None) Tumor tissue classification [8] (None) LDA Text classification [9] BoW (None) Sales rank prediction [15] (None) BoW Texture image classification [16] CNN RNN Text recognition in images [17] FAST Corder Detector (None) Image classification [18] NSP Mahalanobis distance Target enhancing [19] (None) CEM Hyperspectral target detection [20] (None) TCIMF Hyperspectral target detection [21] (None) MUSIC Wideband spectrum sensing [22]

Linear Discriminative Analysis … …

Random Projection … … …

… … … … … … …

… … …

Original Space … … Latent Space … …

Fig. 2: Unified graph explanation for random projection and semi-random projection. The latent space consists of features generated by different RP transformation matrices.

Stacked semi-random projection (SSRP) was developed to stacked by feeding the output of the previous SRP layer as further improve the performance of SRP. In SSRP, SRP is the input of the subsequent SRP layer. Suppose the output of 4

Original Space

Latent Space … …

The 1st The 2nd … The last

SRP Layer … SRP Layer … SRP Layer …

% H%&# H H$ H# H"

Fig. 3: Architecture of stacked semi-random projection with L layers, in which SRP is stacked layer by layer. The dimension in the left several layers becomes smaller and smaller. The last several layers keep the dimension unchanged. the k-th SRP layer is denoted by Hk with dimensionality of User Generated Content about Tablet Computers” [54]. Since dk and that the first layer is initialized with the original data the BoW representation ignores the semantic relation between H0 = X with dimensionality of d0. The relationship between words, it may fail to capture order information and dependency the dimensionality of the k-th and (k − 1)-th SRP layer can between the words in reviews [55]. be calculated as In computer vision, the BoW model has been applied to tex- √ ture research (including synthesis, classification, segmentation, dk = αb dk−1c (3) compression, and shape from texture) [56], [57], [58], [59] by treating texture features as words. In BoW, texture images where α is a hyperparameter. The output of the k-th SRP layer are statistically described as histograms over a dictionary can be determined as of features. It has been proved that local texture feature descriptors learned from BoW are insensitive to local image k 1 H = (4) perturbations such as rotation, affine changes and scale [16]. 1 + exp(−[Wk]THk−1) Liu et al. [60] proposed a robust and powerful texture classifier where Wk is the transformation vector of the k-th SRP layer based on RP and BoW. First, RP is used to extract a small set learned from regularized LDA [46]. The overall process of of random features from local image patches. Then, random SSRP is shown in Figure 3. SSRP outperforms other methods, features are embedded into a BoW model to conduct texture including PCA, sparse PCA [47], RP, RP + PCA, mSDA [48], classification. Finally, learning and classification are performed and SRP + PCA, on 5 of 6 datasets in terms of classification in the compressed domain. The proposed method improves accuracy. In addition, the computational cost of SSRP is much the classification accuracy by 10.38% and 3.32% compared lower than that of PCA, although it is 2 times higher than that to local binary pattern (LBP) [61] and the combination of of RP. Because LDA is adopted, SRP and SSRP may easily LBP and normalized Gabor filter (NGF) [62]. Furthermore, the overfit the data in the presence of labeling noise and are not proposed method consumes considerably less time and storage applicable for non-linear problems [49]. space. However, BoW may still reduce the discriminative The BoW model is a simplified representation used in nat- power of images because it ignores the geometric relationships ural language processing and computer vision [16], [50], [51]. among visual words [63]. The number of words in a BoW model usually reaches tens Traditional machine learning techniques are limited in their of millions, which explodes the dimensionality of predictor ability to handle natural data in their raw form. Deep learning matrices. In the past few years, RP has been used to address [64] has performed remarkably well, leaving traditional ma- this issue. Schneider et al. [15] proposed an attributes-based chine learning in the dust. Wu et al. [17] proposed a novel regression model in which historical data were used to forecast method based on RP and deep neural network (DNN) [65] for one-week-ahead rolling sales for tablet computers. In contrast text recognition in natural scene images. First, a convolutional to the extant approaches [52], [53], the authors considered neural network (CNN) [65] is adopted as a feature extractor to customer reviews and used a BoW model to analyze product convert word images into a multi-layer CNN feature sequence feedback. However, a key challenge of BoW is that the with slicing windows. Then, RP is used to map the high- millions of words contained in the bag lead to infeasible dimensional CNN features onto a subspace with low dimen- computation. RP was adopted to reduce the dimensionality of sionality. Finally, a recurrent neural network (RNN) [66], used the BOW predictor matrices to address this problem. Assume for decoding the embedded RP-CNN features, is trained to that the BoW matrix is denoted by BOWSn×d, where n recognize the text in the image. Multiple RNNs are ensembled and d are the number of reviews and words, respectively. RP by the Recognizer Output Voting Error Reduction (ROVER) maps the original space onto a subspace BOWS\ n×k spanned algorithm [67] to further improve the recognition rate. The by the k features. Compared to the baseline model without authors found that the recognition rate of RP-CNN features, BoW, the proposed model with BoW [50] has better predictive with 85% dimensionality reduction, is similar to that of the performance on the dataset named “Market Dynamics and original high-dimensional features. Although CNNs and RNNs 5 dramatically improves the recognition rate, CNNs and RNNs from a random combination of colors in a window that slides are difficult to train and require substantial amounts of time over the original image (see Figure 4a). Motivated by the to select the hyperparameters [68]. intuition that human visual systems are feature detectors for Low-rank approximation plays a central role in data anal- colors, lines, corners, and so on, the features from accelerated ysis. In mathematics, it is often desirable to find a good segment test (FAST) [77] corner detector is adopted in CRP approximation of a given matrix with a lower rank. Eigenvalue to detect features in images. The projected image can be decomposition is a typical strategy, for example, the best represented as a combination of multiple corner images (see known method is singular value decomposition (SVD) [69], Figure 4b). The authors indicated that SWRP performs better but it usually leads to heavy computation. RP is a simple tech- than CRP in terms of classification accuracy; neural networks nique and has been widely used to accelerate the computations without SWRP have similar classification accuracy but much for such approximations. Assume that the original matrix can lower computational cost than neural networks with SWRP. be formulated as X ∈ Rn×k. The quality of the approximation However, the features extracted from SWRP and CRP are depends on how well the method captures the important part not rotation-invariant because neither introduces multi-scale of X. Martinsson [70] proposed an approximation method features. based on QR decomposition [71] and RP. First, a matrix By taking advantage of the fact that RP is insensitive Y ∈ Rn×d is formed according to Y = (XXT)pXS, where to noise in images, Qin et al. [19] proposed a method to p stands for the number of iterations, and S ∈ Rk×d follows suppress clutters while enhancing targets in infrared images Gaussian distribution N (0, 1). Then, QR decomposition is (see Figure 5). First, a signal decomposition algorithm named applied: Y = QR. The low-rank approximation of matrix X NSP [41] is adopted to decompose an image into a single can be represented as Xe = Q(QTX). The proposed method low-frequency subband and multiple high-frequency subbands. constructs a nearly optimal rank-k approximation with much After K-scale NSP decomposition, K +1 subbands, including lower time complexity. Alternatively, the sampling matrix one low-frequency subband and K high-frequency subbands, S can be sampled from subsampled randomized Hadamard are produced. The high-frequency subbands mainly contain transform (SRHT) [72], and the corresponding approximation targets and minimally cluttered backgrounds, so it is easier to method is known as structured RP [73]. The experimental extract target information from the high-frequency subbands. results revealed that the structured RP is faster but less accurate To preserve as much target information as possible, a 3D than the method proposed by Martinsson [70]. image cube is constructed by concatenating all the high- Zhang et al. [74] proposed a method to accelerate linear frequency subbands. The cube model can be expressed as regression with the help of low-rank matrix approximation T implemented with RP and QR decomposition. Suppose the FK×M = (f1, f2, . . . , fk, . . . , fK ) (6) observed data matrix and the corresponding response vector where f is the row vector representation of a subband are represented as X and Y, respectively. Let X be the low- k e and M is the number of pixels of the k-th high-frequency rank approximation of X; thus, the coefficient vector β can subband. Then, RP is used to project the 3D image cube be calculated according to the following recursive formula FK×M onto a low-dimensional subspace QS×M to reduce the spatial redundancy of the target and background information.

1 T T Finally, the Mahalanobis distance [78] is applied to remove βt+1 = argmin − (β − βt) Xe (Y − Xe βt) d n background clutter in dimensionality-reduced high-frequency β∈R ! (5) subbands obtained above. Compared to other state-of-the-art λ +Φ kβk + kβ − β k2 methods for the background suppression of infrared small- t 1 2 t 2 target images, the proposed model outperforms max-median [79], morphological (top-hat) [80], phase spectrum of quater- where λ represents the regularization term. β0 is set to 0, nion Fourier transform (PQFT) [81] and wavelet transform and Φt = min(Φmin, Φ0ηt), in which ηt ∈ (0, 1) controls (WRX) [82] in terms of signal-to-clutter ratio gain (SCRG) the shrinkage speed of Φt. Compared to existing methods and background suppression factor (BSF) [83]. excluding RP, such as PGH [75] and ADG [76], the proposed Feng et al. [20] developed a new approach to detect hyper- method significantly reduces the computation time by 98.90% spectral targets using the CEM method [42]. The objective of and 98.55%. Additionally, the convergence rate of the pro- CEM is to design a finite impulse response (FIR) linear filter posed method is comparable to that of PGH and much higher with L filter coefficients, and the FIR filter can be represented than that of ADG. using an L-dimensional vector w = {w1, w2, . . . , wL}. Let ri(i = 1, 2, . . . , n) be an L-dimensional sample pixel vector. B. Application-specific methods The optimized target detector w can be given as In some specific application fields, especially image pro- −1 ∗ RL×Ld cessing, specific feature extraction methods are often designed w = (7) dTR−1 d to obtain better experimental results. Arriaga et al. [18] pro- L×L Pn T posed two RP-based methods: sliding window RP (SWRP) where RL×L = (1/n) i=1 riri is the sample autocorrela- and corner RP (CRP). In SWRP, a corresponding location tion matrix of the target and d denotes the spectral signature in the projected images is filled with a color that generated of the target. RP is incorporated to resolve the issue of 6

(a) Sliding window random projection (b) Corner random projection Fig. 4: Description of (a) sliding window random projection and (b) corner random projection. In sliding window random projection, features are represented by a random combination of colors within a sliding window. In corner random projection, the features are represented by corners of objects.

NSP Decomposition Information Compression Low-frequency subband x

!" !"#$

!% !$

High-frequency subbands y Mahalanobis Distance z Random Original Image … Projection Target Image

Fig. 5: Flowchart of the background suppression method, where NSP is used for feature extraction and RP is used for information compression. the curse of dimensionality in hyperspectral imagery. The components that is used to constrain the desired targets in noise suppression effect of RP is superior to that of the D. Similarly, 0q×1 is a q × 1 column vector with zeros maximum-noise-fraction (MNF) [84] and PCA, and therefore in all components that is used to suppress the undesired the CEM method with dimensionality reduction by RP (RP- targets in U. Analogous to Feng et al. [20], Du et al. [21] CEM) outperforms MNF-CEM and PCA-CEM in terms of conducted target detection by TCIMF, where RP is used both detection accuracy and computation time. The drawback for dimensionality reduction. Not only does RP reduce the of CEM is that it can detect only a single target because computational complexity, but it improves the target-detection CEM treats undesired targets as interference and does not accuracy by decision fusion across multiple RP instances. The make full use of the known information. The target-constrained experimental results demonstrated that the detection accuracy interference-minimized filter (TCIMF) [85] was developed to of RP-TCIMF with a single run of RP is slightly lower than address this issue. It assumes the pixels of an image are that of TCIMF, but it can be further improved by performing composed of three separate signal sources, D (desired targets), multiple runs. U (undesired targets), and I (interference), whereas CEM   simply treats U as a part of I. Let D = d1, d2, . . . , dp and Subspace-based spectrum estimation is often used for wide-   U = u1, u2, . . . , uq denote the desired-target signature and band spectrum sensing. However, subspace-based techniques, undesired-target signature, respectively. The spectral signature which require eigendecomposition of the original data, are of the target d in Eq. 7 can be replaced with [D, U]. The computationally expensive. To alleviate this issue, Majee et optimal weight vector w can be formulated as al. [22] applied low-rank approximation before multiple signal classification (MUSIC) [86], which is employed for wideband −1   ∗ RL×L [DU] 1p×1 spectrum sensing. The low-rank approximation technique used w = T (8) −1 0q×1 in this work is based on Cholesky factorization (CF) [87] and [DU] RL×L [DU] RP. Suppose X ∈ Rn×n is an arbitrary square matrix. First, n×k where 1p×1 is a p × 1 column vector with ones in its RP is performed as Xb = XW, where W ∈ R is a random 7 projection matrix. Then, ΦT is filled with k left singular values samples. To overcome this problem, a more stable model was of Xb ; thus, Xb can be estimated as Xc0 = ΦXΦT. Next, proposed by Gao et al. [91], where the rectangle filters are CF is applied as Xc0 = LLT, and D can be calculated as replaced by MSERs [92]. To robustly adapt to variations in D = Xc0ΦT(LT)−1. SVD is then applied as D = UΣVT, target appearance, a least squares support vector machine (LS- and the k-rank approximation of X can be obtained as SVM) [93] is employed. The authors stated that the proposed XLR = UΣ2VT. The authors concluded that the proposed tracker outperforms the CT-based tracker [27] in terms of mean method achieves a marginal reduction in time complexity. In distance precision and mean frame rate (FPS). Nevertheless, addition, the spectrum sensing performance of the proposed MSER features have limited performances on blurred images method is comparable or superior to that of MUSIC in terms [94]. Therefore, the performance of the proposed tracking of the probability of alarm. method may be seriously affected on blurred videos. Recent studies have revealed that ELM performs well in III.DIMENSIONALITY INCREASING APPROACHES regression and classification problems [33]. Additionally, com- In some studies, researchers act in diametrically opposed pared to traditional algorithms, such as back-propagation (BP) ways. They first map the original dataset onto a higher- and support vector machine (SVM), ELM provides not only dimensional feature space to better represent the original good generalization performance but also fast learning speed. features. Then, RP is applied to reduce the dimensionality and Alshamiri et al. [24] performed RP in conjunction with ELM computational cost. for low-dimensional data classification. The method consists of X ∈ n×d Ma et al. [23] proposed a robust method for face recog- two phases. First, the original data R are projected L L  d nition, as illustrated in Figure 6. In the proposed method, onto the subspace spanned by ( ) using ELM. The multi-radius local binary pattern (MLBP) is first adopted linear separability of the data is often increased by mapping the to incorporate more structural facial features. Then, a high- data onto a high-dimensional ELM feature space. Then, RP is dimension multiscale and multi-radius LBP (MMLBP) space applied to reduce the dimensionality of the ELM feature space. There is a slight improvement in the classification accuracy of P ∈ RL×m is obtained by convolving rectangle filters. The rectangle filters of a given face image with width w and height ELM-RP compared to that of ELM. Since ELM consists of a h can be defined as follows single hidden layer, it is difficult to encode complex things and achieve satisfactory accuracy. Also, on small-n-large-p ( datasets, ELM is prone to overfitting because of the lack of u ,v 1, u0 ≤ u ≤ wi, v0 ≤ v ≤ hi p 0 0 = (9) training samples [95]. i 0, otherwise where u0 and v0 represent the offset coordinates of filter IV. ENSEMBLE APPROACHES pi with width wi and height hi. There are approximately Inspired by the fact that ensemble methods can signifi- L = m2 = (wh)2 (i.e., the number of possible locations times cantly improve the performance of weak classifiers, several the number of possible scales) exhaustive rectangle filters algorithms were proposed based on ensembles of decision for each face image. Generally, a large multiscale rectangle trees. Well-known methods include random forest [36] and filter matrix P can be obtained by stacking pi together, AdaBoost [37]. Several studies have proved that ensembles of 1×m where pi ∈ R can be formulated as a row vector whose multiple RP instances help RP to produce more stable results. dimensionality is equal to w×h. Since L is usually 106 −1010, Schclar et al. [28] proposed an ensemble method based on sparse RP [88] is used to perform dimensionality reduction. RP and nearest-neighbor (NN) inducers [96]. First, K random The subspace can be formulated as Mn×m = Wn×LPL×m, matrices are generated by RP. Then, K training sets are where W is a transformation matrix generated by RP. The constructed for the ensemble classifiers by applying random proposed method not only achieves a higher recognition rate matrices to the original dataset. Next, K NN classifiers are but also shows better robustness to corruption, occlusion, and trained, and the final classification result is produced via a disguise compared to Randomface [89] and Eigenfaces [90], voting scheme. The proposed method is more accurate than even in low-dimensional spaces. the non-ensemble NN classifier. Multiple studies have focused on visual tracking [25], [26], Zhang et al. [29] proposed an RP ensemble method that [27], where RP is favored by researchers because of its charac- is analogous to the work of Schclar et al. [28] for drug- teristics of computational effectiveness and data independence. target interaction prediction, where the “PaDEL-Descriptor” Zhang et al. [26], [27] proposed a tracking framework, named [97] is adopted as a feature detector. The authors stated CT tracker, where rectangle filters, following Eq. 9, are that the proposed drug-target interaction prediction method adopted to form a very high-dimensional multiscale image improves the accuracy by 4.5%-8.2% compared to the work feature vector and RP is applied to compress samples of of [98]. Similarly, Yoshioka et al. [30] proposed an RP foreground targets and the background. The tracking window ensemble method for dysarthric speech recognition, in which in the first frame is determined manually. To predict the an automatic speech recognition (ASR) system is adopted target in the subsequent frame, positive samples are taken as a feature detector. Compared to the PCA-based feature near the current target while negative samples are taken far projection method, the method proposed in [30] improves the from the current target. Both positive and negative samples recognition rate by 5.23%. are used to update the Bayes classifier. However, rectangle Gondara [31] also proposed an ensemble classifier using filters are sensitive to the presence of a few outliers in the RP. In contrast to the method proposed by Schclar et al. [28], 8

MLBP Representation Dimensionality Increasing Dimensionality Reduction

!",$ MMLBP space ( ∈ ℝ'×, Feature Space — &$ — . = 0-×'('×1 Fused map — &% — — &$ —

— &% — RP … …

Original Image — &- —

— &' —

!",%

Fig. 6: Overview of a robust face recognition method. The facial features are mapped onto a very-high dimensional space by rectangle filters. Then, RP is used to compress features into a low-dimensional subspace. where the RP matrices are applied to the same feature space, The features cannot be well-represented in the high- in this method, the RP matrices are applied to random subsets dimensional space produced by extant methods, such as rect- of the original feature set. The authors demonstrated that the angle filters and ELM. Therefore, the proposal of accurate proposed method performs equally well or better than random dimensionality increasing approaches constitutes a promising forest and AdaBoost in terms of classification accuracy. direction for future RP research. To alleviate the high distortion in a single run of a clustering Other opportunities for future RP research will be the algorithm in the feature spaces produced by RP, Fern et al. extension towards upcoming tasks that require real-time com- [10] investigated how RP could best be used for clustering and putation, such as speech, voice, image and video recognition. proposed a cluster ensemble approach based on expectation- In these tasks, we believe RP will play an important role in maximization (EM) [99] and RP. In the proposed ensemble handling the high-dimensional characteristics of these appli- approach, EM generates a probabilistic model θ of a mixture cations. of k Gaussian distributions after each RP run. The final clusters are aggregated by measuring the similarity of clusters in ACKNOWLEDGMENT different clustering results, where the similarity of two clusters This work is partially supported by the National Natural Sci- ci and cj can be defined as ence Foundation of China (Grant Nos. 61471147, 61371179), θ the Natural Science Foundation of Heilongjiang Province sim(ci, cj) = min Pij (10) pi∈ci,pj ∈cj (Grant No. F2016016), the Fundamental Research Funds for the Central Universities (Grant No. HIT.NSRIF.2017037), and P θ i j where ij denotes the probability of data points and the National Key Research and Development Program of belonging to the same cluster, which can be calculated as China (Grant No. 2016YFC0901905).

k θ X Pij = P (l|i, θ)P (l|j, θ) (11) REFERENCES l=1 [1] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. The authors noted that the proposed ensemble method (RP + Elsevier, 2011. [2] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by EM) is more robust and produced better clusters than those locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, of PCA + EM. However, EM algorithm often gets stuck in a 2000. local minimum, even on perfect datasets [100]. [3] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987. V. CONCLUSIONSANDFUTUREPERSPECTIVES [4] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers, “Fisher discriminant analysis with kernels,” in Neural Networks for RP is an efficient and powerful dimensionality reduction Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop. IEEE, 1999, pp. 41–48. technique that has been developed and matured during the past [5] S. Dasgupta, “Experiments with random projection,” in Proceedings 15 years. With the rapid increase of big data, RP has provided of the Sixteenth conference on Uncertainty in artificial intelligence. tangible benefits to counteract the burdensome computational Morgan Kaufmann Publishers Inc., 2000, pp. 143–151. [6] W. B. Johnson and J. Lindenstrauss, “Extensions of lipschitz mappings requirements and has met the needs of real-time processing in into a hilbert space,” Contemporary mathematics, vol. 26, no. 189-206, some situations. Despite the fact that RP is computationally p. 1, 1984. efficient, it often introduces relatively high distortion. To solve [7] E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in Proceedings of this problem, major successful efforts have been made to the seventh ACM SIGKDD international conference on Knowledge improve the performance of RP, as summarized in this survey. discovery and data mining. ACM, 2001, pp. 245–250. 9

[8] H. Xie, J. Li, Q. Zhang, and Y. Wang, “Comparison among dimen- [30] T. Yoshioka, T. Takiguchi, and Y. Ariki, “Evaluation of random- sionality reduction techniques based on random projection for cancer projection-based feature combination on dysarthric speech recogni- classification,” Computational Biology and Chemistry, vol. 65, pp. 165– tion,” American Journal of Signal Processing, vol. 3, no. 3, pp. 41–48, 172, 2016. 2013. [9] R. Zhao and K. Mao, “Semi-random projection for dimensionality [31] L. Gondara, “Rpc: An efficient classifier ensemble using random reduction and extreme learning machine in high-dimensional space,” projections,” in Machine Learning and Applications (ICMLA), 2015 IEEE Computational Intelligence Magazine, vol. 10, no. 3, pp. 30–41, IEEE 14th International Conference on. IEEE, 2015, pp. 559–564. 2015. [32] H. Lu, B. Du, J. Liu, H. Xia, and W. K. Yeap, “A kernel extreme learn- [10] X. Z. Fern and C. E. Brodley, “Random projection for high dimensional ing machine algorithm based on improved particle swam optimization,” data clustering: A cluster ensemble approach,” in ICML, vol. 3, 2003, Memetic Computing, vol. 9, no. 2, pp. 121–128, 2017. pp. 186–193. [33] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: [11] T. Sakai and A. Imiya, “Fast spectral clustering with random projection theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, and sampling,” in International Workshop on Machine Learning and 2006. Data Mining in Pattern Recognition. Springer, 2009, pp. 372–384. [34] P. Viola and M. Jones, “Rapid object detection using a boosted cascade [12] S. Tasoulis, L. Cheng, N. Valim¨ aki,¨ N. J. Croucher, S. R. Harris, of simple features,” in Computer Vision and Pattern Recognition, W. P. Hanage, T. Roos, and J. Corander, “Random projection based 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society clustering for population genomics,” in Big Data (Big Data), 2014 Conference on, vol. 1. IEEE, 2001, pp. I–I. IEEE International Conference on. IEEE, 2014, pp. 675–682. [35] J. Cao and Z. Lin, “Extreme learning machines on high dimensional [13] S. Wan, M.-W. Mak, and S.-Y. Kung, “R3p-loc: A compact multi- and large data applications: a survey,” Mathematical Problems in label predictor using ridge regression and random projection for protein Engineering, vol. 2015, 2015. subcellular localization,” Journal of theoretical biology, vol. 360, pp. [36] A. Liaw and M. Wiener, “Classification and regression by randomfor- 34–45, 2014. est,” R news, vol. 2, no. 3, pp. 18–22, 2002. [14] Q. Sun, H. Zhu, Y. Liu, and J. G. Ibrahim, “Sprem: sparse projection [37] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” regression model for high-dimensional linear regression,” Journal of Journal-Japanese Society For Artificial Intelligence, vol. 14, no. 771- the American Statistical Association, vol. 110, no. 509, pp. 289–302, 780, p. 1612, 1999. 2015. [38] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, [15] M. J. Schneider and S. Gupta, “Forecasting sales of new and existing “A review on ensembles for the class imbalance problem: bagging- products using consumer reviews: A random projections approach,” , boosting-, and hybrid-based approaches,” IEEE Transactions on International Journal of Forecasting, vol. 32, no. 2, pp. 243–256, 2016. Systems, Man, and Cybernetics, Part C (Applications and Reviews), [16] L. Liu, P. Fieguth, D. Clausi, and G. Kuang, “Sorted random projections vol. 42, no. 4, pp. 463–484, 2012. for robust rotation-invariant texture classification,” Pattern Recognition, [39] N. Dalal and B. Triggs, “Histograms of oriented gradients for human vol. 45, no. 6, pp. 2405–2418, 2012. detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. [17] R. Wu, S. Yang, D. Leng, Z. Luo, and Y. Wang, “Random projected 886–893. convolutional feature for scene text recognition,” in Frontiers in Hand- writing Recognition (ICFHR), 2016 15th International Conference on. [40] D. G. Lowe, “Distinctive image features from scale-invariant key- IEEE, 2016, pp. 132–137. points,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004. [18] R. I. Arriaga, D. Rutter, M. Cakmak, and S. S. Vempala, “Visual [41] A. L. Da Cunha, J. Zhou, and M. N. Do, “The nonsubsampled con- categorization with random projection,” Neural computation, 2015. tourlet transform: theory, design, and applications,” IEEE transactions [19] H. Qin, J. Han, X. Yan, J. Li, H. Zhou, J. Zong, B. Wang, and on image processing, vol. 15, no. 10, pp. 3089–3101, 2006. Q. Zeng, “Multiscale random projection based background suppression [42] J. C. Harsanyi, “Detection and classification of subpixel spectral signa- of infrared small target image,” Infrared Physics & Technology, vol. 73, tures in hyperspectral image sequences,” Ph.D. dissertation, University pp. 255–262, 2015. of Maryland Baltimore County Baltimore, MD, 1993. [20] W. Feng, Q. Chen, W. He, G. R. Arce, G. Gu, and J. Zhuang, “Random [43] Student, “The probable error of a mean,” Biometrika, pp. 1–25, 1908. projection-based dimensionality reduction method for hyperspectral [44] A. Antoniadis, S. Lambert-Lacroix, and F. Leblanc, “Effective dimen- target detection,” in SPIE Optical Engineering+ Applications. Inter- sion reduction methods for tumor classification using gene expression national Society for Optics and Photonics, 2015, pp. 961 117–961 117. data,” Bioinformatics, vol. 19, no. 5, pp. 563–570, 2003. [21] Q. Du, J. E. Fowler, and B. Ma, “Random-projection-based dimension- [45] K. Lang, “Newsweeder: Learning to filter netnews,” in Proceedings ality reduction and decision fusion for hyperspectral target detection,” of the 12th international conference on machine learning, 1995, pp. in Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE 331–339. International. IEEE, 2011, pp. 1790–1793. [46] J. H. Friedman, “Regularized discriminant analysis,” Journal of the [22] S. Majee, P. Ray, and Q. Cheng, “Efficient wideband spectrum sensing American statistical association, vol. 84, no. 405, pp. 165–175, 1989. using random projection,” in Signals, Systems and Computers, 2015 [47] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, 49th Asilomar Conference on. IEEE, 2015, pp. 141–145. “A direct formulation for sparse pca using semidefinite programming.” [23] C. Ma, J.-Y. Jung, S.-W. Kim, and S.-J. Ko, “Random projection-based SIAM review, vol. 49, no. 3, pp. 434–448, 2007. partial feature extraction for robust face recognition,” Neurocomputing, [48] M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha, “Marginalized stacked vol. 149, pp. 1232–1244, 2015. denoising autoencoders,” in Proceedings of the Learning Workshop, [24] A. K. Alshamiri, A. Singh, and B. R. Surampudi, “Combining elm with Utah, UT, USA, vol. 36, 2012. random projections for low and high dimensional data classification [49] H. Yan and Y. Dai, “The comparison of five discriminant methods,” in and clustering,” in Proceedings of the Fifth International Conference Management and Service Science (MASS), 2011 International Confer- on Fuzzy and Neuro Computing (FANCCO-2015). Springer, 2015, pp. ence on. IEEE, 2011, pp. 1–4. 89–107. [50] A. Alavi, A. Wiliem, K. Zhao, B. C. Lovell, and C. Sanderson, [25] D. Shan and Z. Chao, “Improved `1-tracker using robust pca and “Random projections on manifolds of symmetric positive definite random projection,” Machine Vision and Applications, vol. 27, no. 4, matrices for image classification,” in Applications of Computer Vision pp. 577–583, 2016. (WACV), 2014 IEEE Winter Conference on. IEEE, 2014, pp. 301–308. [26] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive track- [51] A. I. Maqueda, A. Ruano, C. R. del Blanco, P. Carballeira, F. Jau- ing,” in European Conference on Computer Vision. Springer, 2012, reguizar, and N. Garc´ıa, “Novel multi-feature bag-of-words descriptor pp. 864–877. via subspace random projection for efficient human-action recognition,” [27] ——, “Fast compressive tracking,” IEEE transactions on pattern anal- in Advanced Video and Signal Based Surveillance (AVSS), 2015 12th ysis and machine intelligence, vol. 36, no. 10, pp. 2002–2015, 2014. IEEE International Conference on. IEEE, 2015, pp. 1–6. [28] A. Schclar and L. Rokach, “Random projection ensemble classifiers,” in [52] J. Chevalier and A. Goolsbee, “Measuring prices and price compe- International Conference on Enterprise Information Systems. Springer, tition online: Amazon. com and barnesandnoble. com,” Quantitative 2009, pp. 309–316. marketing and Economics, vol. 1, no. 2, pp. 203–222, 2003. [29] J. Zhang, M. Zhu, P. Chen, and B. Wang, “Drugrpe: Random projection [53] N. Archak, A. Ghose, and P. G. Ipeirotis, “Deriving the pricing ensemble approach to drug-target interaction prediction,” Neurocom- power of product features by mining consumer reviews,” Management puting, vol. 228, pp. 256–262, 2017. Science, vol. 57, no. 8, pp. 1485–1509, 2011. 10

[54] X. Wang, F. Mai, and R. H. Chiang, “Database submission—market Instrumentation. International Society for Optics and Photonics, 1999, dynamics and user-generated content about tablet computers,” Market- pp. 74–83. ing Science, vol. 33, no. 3, pp. 449–458, 2013. [80] V. T. Tom, T. Peli, M. Leung, and J. E. Bondaryk, “Morphology-based [55] R. Kosala and H. Blockeel, “Web mining research: A survey,” ACM algorithm for point target detection in infrared backgrounds,” in Proc. Sigkdd Explorations Newsletter, vol. 2, no. 1, pp. 1–15, 2000. SPIE, vol. 1954, 1993, pp. 2–11. [56] T. Leung and J. Malik, “Representing and recognizing the visual [81] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detection appearance of materials using three-dimensional textons,” International using phase spectrum of quaternion fourier transform,” in Computer journal of computer vision, vol. 43, no. 1, pp. 29–44, 2001. vision and pattern recognition, 2008. cvpr 2008. ieee conference on. [57] M. Varma and A. Zisserman, “A statistical approach to texture classifi- IEEE, 2008, pp. 1–8. cation from single images,” International Journal of Computer Vision, [82] I. Daubechies, “The wavelet transform, time-frequency localization and vol. 62, no. 1-2, pp. 61–81, 2005. signal analysis,” IEEE transactions on information theory, vol. 36, [58] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid, “Local features no. 5, pp. 961–1005, 1990. and kernels for classification of texture and object categories: A com- [83] C. Yang, J. Ma, S. Qi, J. Tian, S. Zheng, and X. Tian, “Directional prehensive study,” International journal of computer vision, vol. 73, support value of gaussian transformation for infrared small target no. 2, pp. 213–238, 2007. detection,” Applied optics, vol. 54, no. 9, pp. 2255–2265, 2015. [59] M. Varma and A. Zisserman, “A statistical approach to material [84] A. A. Green, M. Berman, P. Switzer, and M. D. Craig, “A transfor- classification using image patch exemplars,” IEEE transactions on mation for ordering multispectral data in terms of image quality with pattern analysis and machine intelligence, vol. 31, no. 11, pp. 2032– implications for noise removal,” IEEE Transactions on geoscience and 2047, 2009. remote sensing, vol. 26, no. 1, pp. 65–74, 1988. [60] L. Liu and P. Fieguth, “Texture classification from random features,” [85] H. Ren and C.-I. Chang, “Target-constrained interference-minimized IEEE Transactions on Pattern Analysis and Machine Intelligence, approach to subpixel target detection for hyperspectral images,” Optical vol. 34, no. 3, pp. 574–586, 2012. Engineering, vol. 39, no. 12, pp. 3138–3145, 2000. [61] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale [86] R. Schmidt, “Multiple emitter location and signal parameter estima- and rotation invariant texture classification with local binary patterns,” tion,” IEEE transactions on antennas and propagation, vol. 34, no. 3, IEEE Transactions on pattern analysis and machine intelligence, pp. 276–280, 1986. vol. 24, no. 7, pp. 971–987, 2002. [87] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure [62] D. A. Clausi and H. Deng, “Design-based texture feature fusion using with randomness: Probabilistic algorithms for constructing approximate gabor filters and co-occurrence probabilities,” IEEE Transactions on matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, Image Processing, vol. 14, no. 7, pp. 925–936, 2005. 2011. [63] C.-F. Tsai, “Bag-of-words representation in image annotation: A re- [88] P. Li, T. J. Hastie, and K. W. Church, “Very sparse random projections,” view,” ISRN Artificial Intelligence, vol. 2012, 2012. in Proceedings of the 12th ACM SIGKDD international conference on [64] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm Knowledge discovery and data mining. ACM, 2006, pp. 287–296. for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527– [89] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face 1554, 2006. recognition via sparse representation,” IEEE transactions on pattern [65] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification analysis and machine intelligence, vol. 31, no. 2, pp. 210–227, 2009. with deep convolutional neural networks,” in Advances in neural [90] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of information processing systems, 2012, pp. 1097–1105. cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991. [66] T. Mikolov, M. Karafiat,´ L. Burget, J. Cernocky,` and S. Khudanpur, [91] Y. Gao, X. Shan, Z. Hu, D. Wang, Y. Li, and X. Tian, “Extended “Recurrent neural network based language model.” in Interspeech, compressed tracking via random projection based on msers and online vol. 2, 2010, p. 3. ls-svm learning,” Pattern Recognition, vol. 59, pp. 245–254, 2016. [67] J. G. Fiscus, “A post-processing system to yield reduced word error [92] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline rates: Recognizer output voting error reduction (rover),” in Automatic stereo from maximally stable extremal regions,” Image and vision Speech Recognition and Understanding, 1997. Proceedings., 1997 computing, vol. 22, no. 10, pp. 761–767, 2004. IEEE Workshop on. IEEE, 1997, pp. 347–354. [93] J. A. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle, “Weighted [68] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. least squares support vector machines: robustness and sparse approxi- 521, no. 7553, pp. 436–444, 2015. mation,” Neurocomputing, vol. 48, no. 1, pp. 85–105, 2002. [69] G. H. Golub and C. Reinsch, “Singular value decomposition and least [94]A. Sluzek,´ “Improving performances of mser features in matching and squares solutions,” Numerische mathematik, vol. 14, no. 5, pp. 403– retrieval tasks,” in Computer Vision–ECCV 2016 Workshops. Springer, 420, 1970. 2016, pp. 759–770. [70] P.-G. Martinsson, “Randomized methods for matrix computations and [95] H. Zhong, C. Miao, Z. Shen, and Y. Feng, “Comparing the learning analysis of high dimensional data,” arXiv preprint arXiv:1607.01649, effectiveness of bp, elm, i-elm, and svm for corporate credit ratings,” 2016. Neurocomputing, vol. 128, pp. 285–295, 2014. [71] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university [96] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE press, 2012. transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967. [72] J. A. Tropp, “Improved analysis of the subsampled randomized [97] C. W. Yap, “Padel-descriptor: An open source software to calcu- hadamard transform,” Advances in Adaptive Data Analysis, vol. 3, no. late molecular descriptors and fingerprints,” Journal of computational 01n02, pp. 115–126, 2011. chemistry, vol. 32, no. 7, pp. 1466–1474, 2011. [73] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, “A fast randomized [98] Z. He, J. Zhang, X.-H. Shi, L.-L. Hu, X. Kong, Y.-D. Cai, and K.-C. algorithm for the approximation of matrices,” Applied and Computa- Chou, “Predicting drug-target interaction networks based on functional tional Harmonic Analysis, vol. 25, no. 3, pp. 335–366, 2008. groups and biological features,” PloS one, vol. 5, no. 3, p. e9603, 2010. [74] W. Zhang, L. Zhang, R. Jin, D. Cai, and X. He, “Accelerated sparse [99] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood linear regression via random projection.” in AAAI, 2016, pp. 2337– from incomplete data via the em algorithm,” Journal of the royal 2343. statistical society. Series B (methodological), pp. 1–38, 1977. [75] L. Xiao and T. Zhang, “A proximal-gradient homotopy method for the [100] Y. Wang and N. L. Zhang, “Severity of local maxima for the em sparse least-squares problem,” SIAM Journal on Optimization, vol. 23, algorithm: Experiences with hierarchical latent class models.” in Prob- no. 2, pp. 1062–1091, 2013. abilistic Graphical Models, 2006, pp. 301–308. [76] Y. Nesterov, “Gradient methods for minimizing composite functions,” Mathematical Programming, vol. 140, no. 1, pp. 125–161, 2013. [77] E. Rosten and T. Drummond, “Machine learning for high-speed corner detection,” in European conference on computer vision. Springer, 2006, pp. 430–443. [78] R. De Maesschalck, D. Jouan-Rimbaud, and D. L. Massart, “The ma- halanobis distance,” Chemometrics and intelligent laboratory systems, vol. 50, no. 1, pp. 1–18, 2000. [79] S. D. Deshpande, H. E. Meng, R. Venkateswarlu, and P. Chan, “Max-mean and max-median filters for detection of small targets,” in SPIE’s International Symposium on Optical Science, Engineering, and