Introduction to Machine Learning (67577) Lecture 11

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Machine Learning (67577) Lecture 11 Introduction to Machine Learning (67577) Lecture 11 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Dimensionality Reduction Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 1 / 25 Reduces training (and testing) time Reduces estimation error Interpretability of the data, finding meaningful structure in data, illustration Why? n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 Reduces training (and testing) time Reduces estimation error Interpretability of the data, finding meaningful structure in data, illustration n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Why? Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 Reduces estimation error Interpretability of the data, finding meaningful structure in data, illustration n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Why? Reduces training (and testing) time Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 Interpretability of the data, finding meaningful structure in data, illustration n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Why? Reduces training (and testing) time Reduces estimation error Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Why? Reduces training (and testing) time Reduces estimation error Interpretability of the data, finding meaningful structure in data, illustration Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 Dimensionality Reduction Dimensionality Reduction = taking data in high dimensional space and mapping it into a low dimensional space Why? Reduces training (and testing) time Reduces estimation error Interpretability of the data, finding meaningful structure in data, illustration n;d Linear dimensionality reduction: x 7! W x where W 2 R and n < d Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 2 / 25 Outline 1 Principal Component Analysis (PCA) 2 Random Projections 3 Compressed Sensing Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 3 / 25 Linear recovery: x~ = Uy = UW x Measures \approximate recovery" by averaged squared norm: given examples x1;:::; xm, solve m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Natural criterion: we want to be able to approximately recover x from y = W x PCA: Principal Component Analysis (PCA) x 7! W x What makes W a good matrix for dimensionality reduction ? Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25 Linear recovery: x~ = Uy = UW x Measures \approximate recovery" by averaged squared norm: given examples x1;:::; xm, solve m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 PCA: Principal Component Analysis (PCA) x 7! W x What makes W a good matrix for dimensionality reduction ? Natural criterion: we want to be able to approximately recover x from y = W x Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25 Linear recovery: x~ = Uy = UW x Measures \approximate recovery" by averaged squared norm: given examples x1;:::; xm, solve m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Principal Component Analysis (PCA) x 7! W x What makes W a good matrix for dimensionality reduction ? Natural criterion: we want to be able to approximately recover x from y = W x PCA: Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25 Measures \approximate recovery" by averaged squared norm: given examples x1;:::; xm, solve m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Principal Component Analysis (PCA) x 7! W x What makes W a good matrix for dimensionality reduction ? Natural criterion: we want to be able to approximately recover x from y = W x PCA: Linear recovery: x~ = Uy = UW x Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25 Principal Component Analysis (PCA) x 7! W x What makes W a good matrix for dimensionality reduction ? Natural criterion: we want to be able to approximately recover x from y = W x PCA: Linear recovery: x~ = Uy = UW x Measures \approximate recovery" by averaged squared norm: given examples x1;:::; xm, solve m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 4 / 25 Theorem Pm > Let A = i=1 xixi and let u1; : : : ; un be the n leading eigenvectors of A. Then, the solution to the PCA problem is to set the columns of U to > be u1;:::; un and to set W = U Solving the PCA Problem m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25 Solving the PCA Problem m X 2 argmin kxi − UW xik n;d d;n W 2R ;U2R i=1 Theorem Pm > Let A = i=1 xixi and let u1; : : : ; un be the n leading eigenvectors of A. Then, the solution to the PCA problem is to set the columns of U to > be u1;:::; un and to set W = U Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 5 / 25 The transformation x 7! UW x moves x to this subspace The point in S which is closest to x is VV >x, where columns of V are orthonormal basis of S Therefore, we can assume w.l.o.g. that W = U > and that columns of U are orthonormal Proof main ideas UW is of rank n, therefore its range is n dimensional subspace, denoted S Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25 The point in S which is closest to x is VV >x, where columns of V are orthonormal basis of S Therefore, we can assume w.l.o.g. that W = U > and that columns of U are orthonormal Proof main ideas UW is of rank n, therefore its range is n dimensional subspace, denoted S The transformation x 7! UW x moves x to this subspace Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25 Therefore, we can assume w.l.o.g. that W = U > and that columns of U are orthonormal Proof main ideas UW is of rank n, therefore its range is n dimensional subspace, denoted S The transformation x 7! UW x moves x to this subspace The point in S which is closest to x is VV >x, where columns of V are orthonormal basis of S Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25 Proof main ideas UW is of rank n, therefore its range is n dimensional subspace, denoted S The transformation x 7! UW x moves x to this subspace The point in S which is closest to x is VV >x, where columns of V are orthonormal basis of S Therefore, we can assume w.l.o.g. that W = U > and that columns of U are orthonormal Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 6 / 25 Therefore, an equivalent PCA problem is m ! ! > X > argmax trace U xixi U : d;n > U2R :U U=I i=1 Pm > The solution is to set U to be the leading eigenvectors of A = i=1 xixi . Proof main ideas Observe: kx − UU >xk2 = kxk2 − 2x>UU >x + x>UU >UU >x = kxk2 − x>UU >x = kxk2 − trace(U >xx>U) ; Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25 Pm > The solution is to set U to be the leading eigenvectors of A = i=1 xixi . Proof main ideas Observe: kx − UU >xk2 = kxk2 − 2x>UU >x + x>UU >UU >x = kxk2 − x>UU >x = kxk2 − trace(U >xx>U) ; Therefore, an equivalent PCA problem is m ! ! > X > argmax trace U xixi U : d;n > U2R :U U=I i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25 Proof main ideas Observe: kx − UU >xk2 = kxk2 − 2x>UU >x + x>UU >UU >x = kxk2 − x>UU >x = kxk2 − trace(U >xx>U) ; Therefore, an equivalent PCA problem is m ! ! > X > argmax trace U xixi U : d;n > U2R :U U=I i=1 Pm > The solution is to set U to be the leading eigenvectors of A = i=1 xixi . Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 7 / 25 Value of the objective It is easy to see that m d X 2 X min kxi − UW xik = λi(A) W 2 n;d;U2 d;n R R i=1 i=n+1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 8 / 25 Centering It is a common practice to \center" the examples before applying PCA, namely: 1 Pm First calculate µ = m i=1 xi Then apply PCA on the vectors (x1 − µ);:::; (xm − µ) This is also related to the interpretation of PCA as variance maximization (will be given in exercise) Shai Shalev-Shwartz (Hebrew U) IML Lecture 11 Dimensionality Reduction 9 / 25 Efficient implementation for d m and kernel PCA Pm > > m;d Recall: A = i=1 xixi = X X where X 2 R is a matrix whose > i'th row is xi .
Recommended publications
  • Experiments with Random Projection
    Experiments with Random Projection Sanjoy Dasgupta∗ AT&T Labs – Research Abstract a PAC-like sense) algorithm for learning mixtures of Gaus- sians (Dasgupta, 1999). Random projection can also easily Recent theoretical work has identified random be used in conjunction with EM. To test this combination, projection as a promising dimensionality reduc- we have performed experiments on synthetic data from a tion technique for learning mixtures of Gaus- variety of Gaussian mixtures. In these, EM with random sians. Here we summarize these results and il- projection is seen to consistently yield models of quality lustrate them by a wide variety of experiments (log-likelihood on a test set) comparable to or better than on synthetic and real data. that of models found by regular EM. And the reduction in dimension saves a lot of time. Finally, we have used randomprojection to construct a clas- 1 Introduction sifier for handwritten digits, from a canonical USPS data set in which each digit is represented as a vector in R256. It has recently been suggested that the learning of high- We projected the training data randomly into R40, and were dimensional mixtures of Gaussians might be facilitated by able to fit a mixture of fifty Gaussians (five per digit) to this first projecting the data into a randomly chosen subspace of data quickly and easily, without any tweaking or covariance low dimension (Dasgupta, 1999). In this paper we present a restrictions. The details of the experiment directly corrob- comprehensive series of experiments intended to precisely orated our theoretical results. illustrate the benefits of this technique.
    [Show full text]
  • Random Projections for Machine Learning and Data Mining
    Outline Outline 1 Background and Preliminaries Random Projections for Machine Learning and 2 Johnson-Lindenstrauss Lemma (JLL) and extensions 1 Background and Preliminaries Data Mining: 3 Applications of JLL (1) Approximate Nearest Neighbour Search Theory and Applications 2 RP Perceptron Johnson-Lindenstrauss Lemma (JLL) and extensions Mixtures of Gaussians Random Features 3 Applications of JLL (1) Robert J. Durrant & Ata Kaban´ 4 Compressed Sensing 4 Compressed Sensing University of Birmingham SVM from RP sparse data 5 Applications of JLL (2) r.j.durrant, a.kaban @cs.bham.ac.uk 5 Applications of JLL (2) { } RP LLS Regression www.cs.bham.ac.uk/˜durranrj sites.google.com/site/rpforml Randomized low-rank matrix approximation Randomized approximate SVM solver 6 Beyond JLL and Compressed Sensing ECML-PKDD 2012, Friday 28th September 2012 6 Beyond JLL and Compressed Sensing Compressed FLD Ensembles of RP R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 1 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 2 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 3 / 123 Motivation - Dimensionality Curse Curse of Dimensionality Mitigating the Curse of Dimensionality The ‘curse of dimensionality’: A collection of pervasive, and often Comment: counterintuitive, issues associated with working with high-dimensional An obvious solution: Dimensionality d is too large, so reduce d to What constitutes high-dimensional depends on the problem setting, k d. data. but data vectors with arity in the thousands very common in practice Two typical problems: (e.g.
    [Show full text]
  • Experiments with Random Projection Abstract 1 Introduction 2 High
    UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 143 Experiments with Random Projection Sanjoy Dasgupta* AT&T Labs - Research Abstract a PAC-like sense) algorithm for learningmixtures of Gaus­ sians (Dasgupta, 1999). Random projection can also easily Recent theoretical work has identified random be used in conjunction with EM. To test this combination, projection as a promising dimensionality reduc­ we have performed experiments on synthetic data from a tion technique for learning mixtures of Gaus­ variety of Gaussian mixtures. In these, EM with random sians. Here we summarize these results and il­ projection is seen to consistently yield models of quality lustrate them by a wide variety of experiments (log-likelihood on a test set) comparable to or better than on synthetic and real data. that of models found by regular EM. And the reduction in dimension saves a lot of time. Finally, we have used random projection to construct a clas­ 1 Introduction sifier for handwritten digits, from a canonical USPS data set in which each digit is represented as a vector in JR256 . It has recently been suggested that the learning of high­ We projected the training data randomly into JR40, and were dimensional mixtures of Gaussians might be facilitated by able to fit a mixture of fifty Gaussians (five per digit) to this first projecting the data into a randomly chosen subspace of data quickly and easily, without any tweaking or covariance low dimension (Dasgupta, 1999). In this paper we present a restrictions. The details of the experiment directly corrob­ comprehensive series of experiments intended to precisely orated our theoretical results.
    [Show full text]
  • Targeted Random Projection for Prediction from High-Dimensional Features
    Targeted Random Projection for Prediction from High-Dimensional Features Minerva Mukhopadhyay David B. Dunson Abstract We consider the problem of computationally-efficient prediction from high dimen- sional and highly correlated predictors in challenging settings where accurate vari- able selection is effectively impossible. Direct application of penalization or Bayesian methods implemented with Markov chain Monte Carlo can be computationally daunt- ing and unstable. Hence, some type of dimensionality reduction prior to statistical analysis is in order. Common solutions include application of screening algorithms to reduce the regressors, or dimension reduction using projections of the design matrix. The former approach can be highly sensitive to threshold choice in finite samples, while the later can have poor performance in very high-dimensional settings. We propose a TArgeted Random Projection (TARP) approach that combines positive aspects of both strategies to boost performance. In particular, we propose to use in- formation from independent screening to order the inclusion probabilities of the fea- arXiv:1712.02445v1 [math.ST] 6 Dec 2017 tures in the projection matrix used for dimension reduction, leading to data-informed 1 sparsity. We provide theoretical support for a Bayesian predictive algorithm based on TARP, including both statistical and computational complexity guarantees. Ex- amples for simulated and real data applications illustrate gains relative to a variety of competitors. Some key words: Bayesian; Dimension reduction; High-dimensional; Large p, small n; Random projection; Screening. Short title: Targeted Random Projection 1 Introduction In many applications, the focus is on prediction of a response variable y given a massive- dimensional vector of predictors x = (x1, x2, .
    [Show full text]
  • Bilinear Random Projections for Locality-Sensitive Binary Codes
    Bilinear Random Projections for Locality-Sensitive Binary Codes Saehoon Kim and Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology, Korea {kshkawa,seungjin}@postech.ac.kr Abstract Feature Extraction Pooling Concatenation Locality-sensitive hashing (LSH) is a popular data- independent indexing method for approximate similarity search, where random projections followed by quantization hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Most of high-dimensional visual descriptors for images exhibit a natural matrix structure. When visual descriptors are repre- Feature Extraction Feature Averaging Concatenation sented by high-dimensional feature vectors and long binary codes are assigned, a random projection matrix requires ex- Figure 1. Two exemplary visual descriptors, which have a natural matrix structure, are often converted to long vectors. The above il- pensive complexities in both space and time. In this pa- lustration describes LLC [15], where spatial information of initial per we analyze a bilinear random projection method where descriptors is summarized into a final concatenated feature with feature matrices are transformed to binary codes by two spatial pyramid structure. The bottom shows VLAD [7], where smaller random projection matrices. We base our theoret- the residual between initial descriptors and their nearest visual vo- ical analysis on extending Raginsky and Lazebnik’s result cabulary (marked as a triangle) is encoded in a matrix form. where random Fourier features are composed with random binary quantizers to form locality sensitive binary codes.
    [Show full text]
  • Large Scale Canonical Correlation Analysis with Iterative Least Squares
    Large Scale Canonical Correlation Analysis with Iterative Least Squares Yichao Lu Dean P. Foster University of Pennsylvania Yahoo Labs, NYC [email protected] [email protected] Abstract Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposi- tion of huge matrices. In this paper we introduce L-CCA , a iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymp- totic convergence and finite time accuracy of L-CCA are established. The experi- ments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets. 1 Introduction Canonical Correlation Analysis (CCA) is a widely used spectrum method for finding correlation structures in multi-view datasets introduced by [15]. Recently, [3, 9, 17] proved that CCA is able to find the right latent structure under certain hidden state model. For modern machine learning problems, CCA has already been successfully used as a dimensionality reduction technique for the multi-view setting. For example, A CCA between the text description and image of the same object will find common structures between the two different views, which generates a natural vector representation of the object. In [9], CCA is performed on a large unlabeled dataset in order to generate low dimensional features to a regression problem where the size of labeled dataset is small. In [6, 7] a CCA between words and its context is implemented on several large corpora to generate low dimensional vector representations of words which captures useful semantic features.
    [Show full text]
  • Efficient Dimensionality Reduction Using Random Projection
    Computer Vision Winter Workshop 2010, Libor Spaˇ cek,ˇ Vojtechˇ Franc (eds.) Nove´ Hrady, Czech Republic, February 3–5 Czech Pattern Recognition Society Efficient Dimensionality Reduction Using Random Projection Vildana Sulic,´ Janez Pers,ˇ Matej Kristan, and Stanislav Kovaciˇ cˇ Faculty of Electrical Engineering, University of Ljubljana, Slovenia [email protected] Abstract Dimensionality reduction techniques are transformations have been very popular in determining especially important in the context of embedded vision the intrinsic dimensionality of the manifold as well as systems. A promising dimensionality reduction method extracting its principal directions (i.e., basis vectors). The for a use in such systems is the random projection. most famous method in this category is the Principal In this paper we explore the performance of the Component Analysis (PCA) [11]. PCA (also known random projection method, which can be easily used in as the Karhunen-Loeve´ transform) is a vector-space embedded cameras. Random projection is compared to transform that reduces multidimensional data sets to lower Principal Component Analysis in the terms of recognition dimensions while minimizing the loss of information. A efficiency on the COIL-20 image data set. Results low-dimensional representation of the data is constructed show surprisingly good performance of the random in such a way that it describes as much of the variance projection in comparison to the principal component in the data as possible. This is achieved by finding a analysis even without explicit orthogonalization or linear basis of reduced dimensionality for the data (a normalization of transformation subspace. These results set of eigenvectors) in which the variance in the data is support the use of random projection in our hierarchical maximal [27].
    [Show full text]
  • Random Projections: Data Perturbation for Classification Problems Arxiv:1911.10800V1
    Random projections: data perturbation for classification problems Timothy I. Cannings School of Mathematics, University of Edinburgh Abstract Random projections offer an appealing and flexible approach to a wide range of large- scale statistical problems. They are particularly useful in high-dimensional settings, where we have many covariates recorded for each observation. In classification problems there are two general techniques using random projections. The first involves many projections in an ensemble { the idea here is to aggregate the results after applying dif- ferent random projections, with the aim of achieving superior statistical accuracy. The second class of methods include hashing and sketching techniques, which are straight- forward ways to reduce the complexity of a problem, perhaps therefore with a huge computational saving, while approximately preserving the statistical efficiency. arXiv:1911.10800v1 [stat.ME] 25 Nov 2019 Figure 1: Projections determine distributions! Left: two 2-dimensional distributions, one uniform on the unit circle (black), the other uniform on the unit disk (blue). Right: the corresponding densities after the projecting into a 1-dimensional space. In fact, any p- dimensional distribution is determined by its one-dimensional projections (cf. Theorem1). 1 1 INTRODUCTION Modern approaches to data analysis go far beyond what early statisticians, such as Ronald A. Fisher, may have dreamt up. Rapid advances in the way we can collect, store and process data, as well as the value in what we can learn from data, has led to a vast number of innovative and creative new methods. Broadly speaking, random projections offer a universal and flexible approach to complex statistical problems.
    [Show full text]
  • Experiments with Random Projection Ensembles: Linear Versus Quadratic Discriminants
    Experiments with Random Projection Ensembles: Linear versus Quadratic Discriminants Xi Zhang Ata Kaban´ Microsoft (China) Co., Ltd School of Computer Science Beijing University of Birmingham China Birmingham, UK [email protected] [email protected] Abstract—Gaussian classifiers make the modelling assumption more, an averaging ensemble of RP-ed LDA classifiers was that each class follows a multivariate Gaussian distribution, and shown to enjoy state-of the art performance [4]. It is then include the Linear Discriminant Analysis (LDA) and Quadratic natural to wonder if this idea could be extended to devise Discriminant Analysis (QDA). In high dimensional low sample size settings, a general drawback of these methods is that the QDA ensembles? sample covariance matrix may be singular, so their inverse is This work is motivated by the above question. We inves- not available. Among many regularisation methods that attempt tigate random projection ensembles for QDA, and conduct to get around this problem, the random projection ensemble experiments to assess its potential relative to the analogous approach for LDA has been both practically successful and LDA ensemble. The main idea is to use the random projections theoretically well justified. The aim of this work is to investigate random projection ensembles for QDA, and to experimentally to reduce the dimensionality of the covariance matrix, so the determine the relative strengths of LDA vs QDA in this context. smaller dimensional covariance matrix will not be singular. We identify two different ways to construct averaging ensembles We identify two different ways to construct averaging en- for QDA. The first is to aggregate low dimensional matrices to sembles for QDA, which will be referred to as ΣRP -QDA and construct a smoothed estimate of the inverse covariance, which is RP Ensemble-QDA respectively.
    [Show full text]
  • Data Quality Measures and Efficient Evaluation Algorithms for Large
    applied sciences Article Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data Hyeongmin Cho and Sangkyun Lee * School of Cybersecurity, Korea University, Seoul 02841, Korea; [email protected] * Correspondence: [email protected] Abstract: Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets. Keywords: data quality; large-scale; high-dimensionality; linear discriminant analysis; random projec- tion; bootstrapping Citation: Cho, H.; Lee, S.
    [Show full text]
  • Dimension Reduction with Extreme Learning Machine Liyanaarachchi Lekamalage Chamara Kasun, Yan Yang, Guang-Bin Huang and Zhengyou Zhang
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2570569, IEEE Transactions on Image Processing 1 Dimension Reduction with Extreme Learning Machine Liyanaarachchi Lekamalage Chamara Kasun, Yan Yang, Guang-Bin Huang and Zhengyou Zhang Abstract—Data may often contain noise or irrelevant infor- PCA is a holistic-based representation algorithm which mation which negatively affect the generalization capability of represents the original data sample to different degrees, and machine learning algorithms. The objective of dimension reduc- learn a set of linearly uncorrelated features named principal tion algorithms such as Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), random projection components that describe the variance of data. Dimension (RP) and auto-encoder (AE) is to reduce the noise or irrelevant reduction is achieved by projecting the input data via a subset information of the data. The features of PCA (eigenvectors) and of principal components that describes the most variance of the linear AE is not able to represent data as parts (e.g. nose in data. The features that have less contribution to the variances a face image); On the other hand, NMF and non-linear AE are considered to be less descriptive and therefore removed. is maimed by slow learning speed and RP only represents a subspace of original data. This paper introduces a dimension PCA algorithm converges to a global minimum. reduction framework which to some extend represents data as NMF is a parts-based representation algorithm which de- parts, has fast learning speed and learns the between-class scatter composes the data to a positive basis matrix and positive subspace.
    [Show full text]
  • Constrained Extreme Learning Machines: a Study on Classification Cases
    > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Constrained Extreme Learning Machines: A Study on Classification Cases Wentao Zhu, Jun Miao, and Laiyun Qing (LM) method [5], dynamically network construction [6], Abstract—Extreme learning machine (ELM) is an extremely evolutionary algorithms [7] and generic optimization [8], fast learning method and has a powerful performance for pattern the enhanced methods require heavy computation or recognition tasks proven by enormous researches and engineers. cannot obtain a global optimal solution. However, its good generalization ability is built on large numbers 2. Optimization based learning methods. One of the most of hidden neurons, which is not beneficial to real time response in the test process. In this paper, we proposed new ways, named popular optimization based SLFNs is Support Vector “constrained extreme learning machines” (CELMs), to randomly Machine (SVM) [9]. The objective function of SVM is select hidden neurons based on sample distribution. Compared to to optimize the weights for maximum margin completely random selection of hidden nodes in ELM, the CELMs corresponding to structural risk minimization. The randomly select hidden nodes from the constrained vector space solution of SVM can be obtained by convex containing some basic combinations of original sample vectors. optimization methods in the dual problem space and is The experimental results show that the CELMs have better the global optimal solution. SVM is a very popular generalization ability than traditional ELM, SVM and some other method attracting many researchers [10]. related methods. Additionally, the CELMs have a similar fast learning speed as ELM. 3.
    [Show full text]