Addressing the Curse of Dimensionality with Convolutional Neural Networks

Total Page:16

File Type:pdf, Size:1020Kb

Addressing the Curse of Dimensionality with Convolutional Neural Networks Addressing the Curse of Dimensionality with Convolutional Neural Networks Joan Bruna UC Berkeley —> Courant Institute, NYU collaborators: Ivan Dokmanic (UIUC), Stephane Mallat (ENS, France) Pablo Sprechmann (NYU), Yann LeCun (NYU) From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. Theory?? Curse of Dimensionality • Images, videos, audio and text are instances of high-dimensional data. d 6 x R ,d 10 . 2 ⇠ high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! • Goal: Estimate fˆ from training data. high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. • e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k Q: How many points do we need to observe to guarantee ˆ error f(x) f(x) < ✏? high-dimensional space | − | Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . • | − | k − k Q: How many points do we need to observe to guarantee ˆ • error f(x) f(x) < ✏? high-dimensional space | − | d N should be ✏− . We pay an exponential⇠ price on the input dimension. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. • Invariance and local stability perspective: x Supervised learning: (x ,y ) , y 1,K labels. i i i i 2 { } x⌧ f(x)=p(y x) satisfies f(x⌧ ) f(x)if x⌧ ⌧ is a high-dimensional| family of deformations⇡ { of x} Unsupervised learning: (xi)i. Density f(x)=p(x) also satisfies f(x ) f(x). ⌧ ⇡ ill-posed Inverse Problems • Consider the following linear problem: 2 d y = Γx + w,x,y,w L (R ) , where 2 Γ is singular x is drawn from a certain distribution (e.g. natural images) w independent of x • Examples: – Super-Resolution, Inpainting, Deconvolution. • Standard Regularization route: xˆ = arg min Γx y 2 + (x) . x k − k R • Q: How to leverage training data? (x) (typically) convex R • Q: Is this formulation always appropriate to estimate images? Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model y Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model x1 x2 y Regularization and Learning • The inverse problem requires a high-dimensional model for p(x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model: x1 x2 y ↵x +(1 ↵)x 1 − 2 • The conditional distribution of images is not well modeled only with Gaussian noise. Non-convexity in general. Plan • CNNs and the curse of dimensionality on image/audio regression. • Multi scale wavelet scattering convolutional networks. • Generative Models using CNN sufficient statistics. • Applications to inverse problems: Image and Texture Super- Resolution joint work with P. Sprechmann (NYU), Yann LeCun (NYU/FAIR) Ivan Dokmanic (UIUC), S. Mallat (ENS) and M. de Hoop (Rice) Learning, features and Kernels Change of variable Φ(x)= φk(x) k d { } 0 to nearly linearize f(x), which is approximated by: f˜(x)= Φ(x) ,w = w φ (x) . h i k k 1D projection k d X Data: x R Φ(x) Rd0 2 2 Linear Classifier w x Φ Metric: x x0 Φ(x) Φ(x0) k − k k − k How and when is possible to find such a Φ ? • What ”regularity” of f is needed ? • Feature Representation Wish-List A change of variable Φ(x) must linearize the orbits g.x g G 2 • p { } g1x g1 .x x x0 g1x0 p g1 .x0 Linearise symmetries with a change of variable Φ(x) • p Φ(g1 .x) Φ(x) p Φ(x0) Φ(g1 .x0) Lipschitz: x, g : Φ(x) Φ(g.x) C g • 8 k − k1 k k Discriminative:Lipschitz: x, g :Φ(Φx)(x) Φ(Φx0()g.x) C− Cf(gx) f(x0) • 8 k k − − kk | k k − | Geometric Variability Prior x(u) ,u: pixels, time samples, etc. ⌧(u) , : deformation field L (x)(u)=x(u ⌧(u)) : warping ⌧ − Video of Philipp Scott Johnson • Deformation prior: ⌧ = λ sup ⌧(u) +sup ⌧(u) . k k u | | u |r | – Models change in point of view in images – Models frequency transpositions in sounds – Consistent with local translation invariance Rotation and Scaling Variability Rotation and deformations • Group: SO(2) Diff(SO(2)) ⇥ Scaling and deformations • Group: R Diff(R) ⇥ Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] Chapter 2. Invariant Scattering Representations xˆ xˆτ | || | σx (1 + s)σx ξ (1 + s)ξ ω Figure 2.1: Dilation of a complex bandpass window. If ξ σ s−1,thenthesupports ≫ x are nearly disjoint. Besides deformation instabilities, the Fourier modulus andtheautocorrelationlose 2 too much information. For example, a Dirac δ(u)andalinearchirpeiu are two signals having Fourier transforms whose moduli are equal and constant. Very different signals may not be discriminated from their Fourier modulus. Acanonicalinvariant[KDGH07; Soa09] Φ(x)=x(u a(x)) registers x L2(Rd) − ∈ with an anchor point a(x), which is translated when x is translated: a(xc)=a(x)+c. It thus defines a translation invariant representation: Φxc = Φx.Forexample,theanchor point may be a filtered maximum a(x)=argmax x ⋆ h(u) ,forsomefilterh(u). A u | | canonical invariant Φx(u)=x(u a(x)) carries more information than a Fourier modulus, − and characterizes x up to a global absolute position information [Soa09]. However, it has the same high-frequency instability as a Fourier modulustransform.Indeed,forany choiceGeometric of anchor point a(x), applyingVariability the Plancherel Prior formula proves that x(u a(x)) x′(u a(x′)) (2π)−1 xˆ(ω) xˆ′(ω) . ∥ − − − ∥≥ ∥| | − | |∥ ′ Ax = x φ,φ: local average • Blur operator:If x = xτ ,theFouriertransforminstabilityathighfrequenciesimp lies that Φx = x(u a(x)) is also unstable with⇤ respect to deformations. – The only− linear operator A stable to deformations 2.2.5 SIFT and HoG AL⌧ x Ax ⌧ x . φ(u) SIFT (Scale Invariant Featurek Transform)− is a localkk image descriptorkk introducedk [Bruna’12] by Lowe in [Low04], which achieved huge popularity thanks to its invariance and discriminability properties. The SIFT method originally consists in a keypoint detection phase, using a Dif- ferences of Gaussians pyramid, followed by a local description around each detected keypoint. The keypoint detection computes local maxima on a scale space generated by isotropic gaussian differences, which induces invariance to translations, rotations and partially to scaling. The descriptor then computes histograms of image gradient ampli- tudes, using 8 orientation bins on a 4 4gridaroundeachkeypoint,asshowninFigure × 2.2. 19 Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] j j • Wavelet filter bank: Wx = x , (u)=2− (2− R u) { ⇤ k} k ✓ k : spatially localized band-pass filter.
Recommended publications
  • Curse of Dimensionality for TSK Fuzzy Neural Networks
    Curse of Dimensionality for TSK Fuzzy Neural Networks: Explanation and Solutions Yuqi Cui, Dongrui Wu and Yifan Xu School of Artificial Intelligence and Automation Huazhong University of Science and Technology Wuhan, China yqcui, drwu, yfxu @hust.edu.cn { } Abstract—Takagi-Sugeno-Kang (TSK) fuzzy system with usually use distance based approaches to compute membership Gaussian membership functions (MFs) is one of the most widely grades, so when it comes to high-dimensional datasets, the used fuzzy systems in machine learning. However, it usually fuzzy partitions may collapse. For instance, FCM is known has difficulty handling high-dimensional datasets. This paper explores why TSK fuzzy systems with Gaussian MFs may fail to have trouble handling high-dimensional datasets, because on high-dimensional inputs. After transforming defuzzification the membership grades of different clusters become similar, to an equivalent form of softmax function, we find that the poor leading all centers to move to the center of the gravity [18]. performance is due to the saturation of softmax. We show that Most previous works used feature selection or dimensional- two defuzzification operations, LogTSK and HTSK, the latter of ity reduction to cope with high-dimensionality. Model-agnostic which is first proposed in this paper, can avoid the saturation. Experimental results on datasets with various dimensionalities feature selection or dimensionality reduction algorithms, such validated our analysis and demonstrated the effectiveness of as Relief [19] and principal component analysis (PCA) [20], LogTSK and HTSK. [21], can filter the features before feeding them into TSK Index Terms—Mini-batch gradient descent, fuzzy neural net- models. Neural networks pre-trained on large datasets can also work, high-dimensional TSK fuzzy system, HTSK, LogTSK be used as feature extractor to generate high-level features with low dimensionality [22], [23].
    [Show full text]
  • Optimal Grid-Clustering : Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering
    Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering Alexander Hinneburg Daniel A. Keim [email protected] [email protected] Institute of Computer Science, UniversityofHalle Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany Abstract 1 Intro duction Because of the fast technological progress, the amount Many applications require the clustering of large amounts of data which is stored in databases increases very fast. of high-dimensional data. Most clustering algorithms, This is true for traditional relational databases but however, do not work e ectively and eciently in high- also for databases of complex 2D and 3D multimedia dimensional space, which is due to the so-called "curse of data such as image, CAD, geographic, and molecular dimensionality". In addition, the high-dimensional data biology data. It is obvious that relational databases often contains a signi cant amount of noise which causes can be seen as high-dimensional databases (the at- additional e ectiveness problems. In this pap er, we review tributes corresp ond to the dimensions of the data set), and compare the existing algorithms for clustering high- butitisalsotrueformultimedia data which - for an dimensional data and show the impact of the curse of di- ecient retrieval - is usually transformed into high- mensionality on their e ectiveness and eciency.Thecom- dimensional feature vectors such as color histograms parison reveals that condensation-based approaches (such [SH94], shap e descriptors [Jag91, MG95], Fourier vec- as BIRCH or STING) are the most promising candidates tors [WW80], and text descriptors [Kuk92]. In many for achieving the necessary eciency, but it also shows of the mentioned applications, the databases are very that basically all condensation-based approaches havese- large and consist of millions of data ob jects with sev- vere weaknesses with resp ect to their e ectiveness in high- eral tens to a few hundreds of dimensions.
    [Show full text]
  • High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Lecturer: Pravesh Kothari Scribe:Sanjeev Arora
    princeton univ. F'16 cos 521: Advanced Algorithm Design Lecture 9: High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Lecturer: Pravesh Kothari Scribe:Sanjeev Arora High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer, etc.) and this lecture seeks to introduce high dimen- sional geometry. We encounter the so-called curse of dimensionality which refers to the fact that algorithms are simply harder to design in high dimensions and often have a running time exponential in the dimension. We also encounter the blessings of dimensionality, which allows us to reason about higher dimensional geometry using tools like Chernoff bounds. We also show the possibility of dimension reduction | it is sometimes possible to reduce the dimension of a dataset, for some purposes. d P 2 1=2 Notation: For a vector x 2 < its `2-norm is jxj2 = ( i xi ) and the `1-norm is P jxj1 = i jxij. For any two vectors x; y their Euclidean distance refers to jx − yj2 and Manhattan distance refers to jx − yj1. High dimensional geometry is inherently different from low-dimensional geometry. Example 1 Consider how many almost orthogonal unit vectors we can have in space, such that all pairwise angles lie between 88 degrees and 92 degrees. In <2 the answer is 2. In <3 it is 3. (Prove these to yourself.) In <d the answer is exp(cd) where c > 0 is some constant. Rd# R2# R3# Figure 1: Number of almost-orthogonal vectors in <2; <3; <d Example 2 Another example is the ratio of the the volume of the unit sphere to its cir- cumscribing cube (i.e.
    [Show full text]
  • Clustering for High Dimensional Data: Density Based Subspace Clustering Algorithms
    International Journal of Computer Applications (0975 – 8887) Volume 63– No.20, February 2013 Clustering for High Dimensional Data: Density based Subspace Clustering Algorithms Sunita Jahirabadkar Parag Kulkarni Department of Computer Engineering & I.T., Department of Computer Engineering & I.T., College of Engineering, Pune, India College of Engineering, Pune, India ABSTRACT hiding clusters in noisy data. Then, common approach is to Finding clusters in high dimensional data is a challenging task as reduce the dimensionality of the data, of course, without losing the high dimensional data comprises hundreds of attributes. important information. Subspace clustering is an evolving methodology which, instead Fundamental techniques to eliminate such irrelevant dimensions of finding clusters in the entire feature space, it aims at finding can be considered as Feature selection or Feature clusters in various overlapping or non-overlapping subspaces of Transformation techniques [5]. Feature transformation methods the high dimensional dataset. Density based subspace clustering such as aggregation, dimensionality reduction etc. project the algorithms treat clusters as the dense regions compared to noise higher dimensional data onto a smaller space while preserving or border regions. Many momentous density based subspace the distance between the original data objects. The commonly clustering algorithms exist in the literature. Each of them is used methods are Principal Component Analysis [6, 7], Singular characterized by different characteristics caused by different Value Decomposition [8] etc. The major limitation of feature assumptions, input parameters or by the use of different transformation approaches is, they do not actually remove any of techniques etc. Hence it is quite unfeasible for the future the attributes and hence the information from the not-so-useful developers to compare all these algorithms using one common dimensions is preserved, making the clusters less meaningful.
    [Show full text]
  • A Brief Survey of Deep Reinforcement Learning Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath
    IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 1 A Brief Survey of Deep Reinforcement Learning Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath Abstract—Deep reinforcement learning is poised to revolu- is to cover both seminal and recent developments in DRL, tionise the field of AI and represents a step towards building conveying the innovative ways in which neural networks can autonomous systems with a higher level understanding of the be used to bring us closer towards developing autonomous visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, agents. For a more comprehensive survey of recent efforts in such as learning to play video games directly from pixels. Deep DRL, including applications of DRL to areas such as natural reinforcement learning algorithms are also applied to robotics, language processing [106, 5], we refer readers to the overview allowing control policies for robots to be learned directly from by Li [78]. camera inputs in the real world. In this survey, we begin with Deep learning enables RL to scale to decision-making an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policy- problems that were previously intractable, i.e., settings with based methods. Our survey will cover central algorithms in high-dimensional state and action spaces. Amongst recent deep reinforcement learning, including the deep Q-network, work in the field of DRL, there have been two outstanding trust region policy optimisation, and asynchronous advantage success stories.
    [Show full text]
  • Subspace Clustering for High Dimensional Data: a Review ∗
    Subspace Clustering for High Dimensional Data: A Review ¤ Lance Parsons Ehtesham Haque Huan Liu Department of Computer Department of Computer Department of Computer Science Engineering Science Engineering Science Engineering Arizona State University Arizona State University Arizona State University Tempe, AZ 85281 Tempe, AZ 85281 Tempe, AZ 85281 [email protected] [email protected] [email protected] ABSTRACT in an attempt to learn as much as possible about each ob- ject described. In high dimensional data, however, many of Subspace clustering is an extension of traditional cluster- the dimensions are often irrelevant. These irrelevant dimen- ing that seeks to ¯nd clusters in di®erent subspaces within sions can confuse clustering algorithms by hiding clusters in a dataset. Often in high dimensional data, many dimen- noisy data. In very high dimensions it is common for all sions are irrelevant and can mask existing clusters in noisy of the objects in a dataset to be nearly equidistant from data. Feature selection removes irrelevant and redundant each other, completely masking the clusters. Feature selec- dimensions by analyzing the entire dataset. Subspace clus- tion methods have been employed somewhat successfully to tering algorithms localize the search for relevant dimensions improve cluster quality. These algorithms ¯nd a subset of allowing them to ¯nd clusters that exist in multiple, possi- dimensions on which to perform clustering by removing ir- bly overlapping subspaces. There are two major branches relevant and redundant dimensions. Unlike feature selection of subspace clustering based on their search strategy. Top- methods which examine the dataset as a whole, subspace down algorithms ¯nd an initial clustering in the full set of clustering algorithms localize their search and are able to dimensions and evaluate the subspaces of each cluster, it- uncover clusters that exist in multiple, possibly overlapping eratively improving the results.
    [Show full text]
  • Dimensionality Reduction
    Dimensionality Reduction Aarti Singh Machine Learning 10-701/15-781 Nov 17, 2010 Slides Courtesy: Tom Mitchell, Eric Xing, Lawrence Saul 1 High-Dimensional data • High-Dimensions = Lot of Features Document classification Features per document = thousands of words/unigrams millions of bigrams, contextual information Surveys - Netflix 480189 users x 17770 movies 2 High-Dimensional data • High-Dimensions = Lot of Features Discovering gene networks 10,000 genes x 1000 drugs x several species MEG Brain Imaging 120 locations x 500 time points x 20 objects 3 Curse of Dimensionality • Why are more features bad? – Redundant features (not all words are useful to classify a document) more noise added than signal – Hard to interpret and visualize – Hard to store and process data (computationally challenging) – Complexity of decision rule tends to grow with # features. Hard to learn complex rules as VC dimension increases (statistically challenging) 4 Dimensionality Reduction “Unrolling the swiss roll” 5 Dimensionality Reduction • Feature Selection – Only a few features are relevant to the learning task X3 - Irrelevant X3 X1 X2 • Latent features – Some linear/nonlinear combination of features provides a more efficient representation than observed features 6 Feature Selection • Approach 1: Score each feature and extract a subset 7 Feature Selection • Approach 1: Score each feature and extract a subset Common subset selection methods: • One step: Choose d highest scoring features • Iterative: 8 9 10 Feature Selection • Approach 2: Regularization (MAP) Integrate feature selection into learning objective by penalizing number of features with non-zero weights -ve log likelihood penalty Minimizes # features Convex Small weights of chosen compromise features chosen 11 Latent Feature Extraction Combinations of observed features provide more efficient representation, and capture underlying relations that govern the data E.g.
    [Show full text]
  • On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness
    Journal of Machine Learning Research 18 (2018) 1-60 Submitted 3/17; Revised 2/18; Published 4/18 On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness Fabrizio Angiulli [email protected] DIMES { Dept. of Computer, Modeling, Electronics, and Systems Engineering University of Calabria 87036 Rende (CS), Italy Editor: Sanjiv Kumar Abstract Over the years, different characterizations of the curse of dimensionality have been pro- vided, usually stating the conditions under which, in the limit of the infinite dimensionality, distances become indistinguishable. However, these characterizations almost never address the form of associated distributions in the finite, although high-dimensional, case. This work aims to contribute in this respect by investigating the distribution of distances, and of direct and reverse nearest neighbors, in intrinsically high-dimensional spaces. Indeed, we derive a closed form for the distribution of distances from a given point, for the expected distance from a given point to its kth nearest neighbor, and for the expected size of the ap- proximate set of neighbors of a given point in finite high-dimensional spaces. Additionally, the hubness problem is considered, which is related to the form of the function Nk repre- senting the number of points that have a given point as one of their k nearest neighbors, which is also called the number of k-occurrences. Despite the extensive use of this function, the precise characterization of its form is a longstanding problem. We derive a closed form for the number of k-occurrences associated with a given point in finite high-dimensional spaces, together with the associated limiting probability distribution.
    [Show full text]
  • Curse of Dimensionality in Adversarial Examples
    IJCNN 2019 - International Joint Conference on Neural Networks, Budapest Hungary, 14-19 July 2019 Curse of Dimensionality in Adversarial Examples Nandish Chattopadhyay∗y, Anupam Chattopadhyayy, Sourav Sen Guptay and Michael Kasper∗ ∗Fraunhofer Singapore, Nanyang Technological University, Singapore Email: [email protected], [email protected] ySchool of Computer Science and Engineering, Nanyang Technological University, Singapore Email: [email protected], [email protected], [email protected] Abstract—While machine learning and deep neural networks capacity to learn features and patterns in the data, to extents in particular, have undergone massive progress in the past years, which were not reachable before. this ubiquitous paradigm faces a relatively newly discovered chal- lenge, adversarial attacks. An adversary can leverage a plethora of attacking algorithms to severely reduce the performance of A. Motivation : Adversarial Examples existing models, therefore threatening the use of AI in many Soon after the boom in flourishing AI, also popularly re- safety-critical applications. Several attempts have been made to try and understand the root cause behind the generation of ferred to as the “AI revival”, a new discovery proved to a newly adversarial examples. In this paper, we try to relate the geometry found problem. In 2015, it was shown that the high performing of the high-dimensional space in which the model operates and neural networks were vulnerable to adversarial attacks [4]. optimizes, and the properties and problems therein, to such The models, which would perform the task of classification adversarial attacks. We present the mathematical background, very accurately on a test image dataset, would perform very the intuition behind the existence of adversarial examples and substantiate them with empirical results from our experiments.
    [Show full text]
  • Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality
    entropy Article Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality Evgeny M. Mirkes 1,2,∗ , Jeza Allohibi 1,3 and Alexander Gorban 1,2 1 School of Mathematics and Actuarial Science, University of Leicester, Leicester LE1 7HR, UK; [email protected] (J.A.); [email protected] (A.G.) 2 Laboratory of Advanced Methods for High-Dimensional Data Analysis, Lobachevsky State University, 603105 Nizhny Novgorod, Russia 3 Department of Mathematics, Taibah University, Janadah Bin Umayyah Road, Tayba, Medina 42353, Saudi Arabia * Correspondence: [email protected] Received: 4 August 2020; Accepted: 27 September 2020; Published: 30 September 2020 Abstract: The curse of dimensionality causes the well-known and widely discussed problems for machine learning methods. There is a hypothesis that using the Manhattan distance and even fractional lp quasinorms (for p less than 1) can help to overcome the curse of dimensionality in classification problems. In this study, we systematically test this hypothesis. It is illustrated that fractional quasinorms have a greater relative contrast and coefficient of variation than the Euclidean norm l2, but it is shown that this difference decays with increasing space dimension. It has been demonstrated that the concentration of distances shows qualitatively the same behaviour for all tested norms and quasinorms. It is shown that a greater relative contrast does not mean a better classification quality. It was revealed that for different databases the best (worst) performance was achieved under different norms (quasinorms). A systematic comparison shows that the difference in the performance of kNN classifiers for lp at p = 0.5, 1, and 2 is statistically insignificant.
    [Show full text]
  • Coping with the Curse of Dimensionality by Combining Linear Programming and Reinforcement Learning
    Utah State University DigitalCommons@USU All Graduate Theses and Dissertations Graduate Studies 5-2010 Coping with the Curse of Dimensionality by Combining Linear Programming and Reinforcement Learning Scott H. Burton Utah State University Follow this and additional works at: https://digitalcommons.usu.edu/etd Part of the Computer Engineering Commons, and the Operational Research Commons Recommended Citation Burton, Scott H., "Coping with the Curse of Dimensionality by Combining Linear Programming and Reinforcement Learning" (2010). All Graduate Theses and Dissertations. 559. https://digitalcommons.usu.edu/etd/559 This Thesis is brought to you for free and open access by the Graduate Studies at DigitalCommons@USU. It has been accepted for inclusion in All Graduate Theses and Dissertations by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. COPING WITH THE CURSE OF DIMENSIONALITY BY COMBINING LINEAR PROGRAMMING AND REINFORCEMENT LEARNING by Scott H. Burton A thesis submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science Approved: _______________________ _______________________ Nicholas S. Flann Donald H. Cooley Major Professor Committee Member _______________________ _______________________ Dan W. Watson Byron R. Burnham Committee Member Dean of Graduate Studies UTAH STATE UNIVERSITY Logan, Utah 2010 ii Copyright © Scott H. Burton 2010 All Rights Reserved iii ABSTRACT Coping with the Curse of Dimensionality by Combining Linear Programming and Reinforcement Learning by Scott H. Burton, Master of Science Utah State University, 2010 Major Professor: Dr. Nicholas S. Flann Department: Computer Science Reinforcement learning techniques offer a very powerful method of finding solutions in unpredictable problem environments where human supervision is not possible.
    [Show full text]
  • On the Effects of Dimensionality on Data Analysis with Neural Networks
    On the effects of dimensionality on data analysis with neural networks M. Verleysen1, D. François2, G. Simon1, V. Wertz2* Université catholique de Louvain 1 DICE, 3 place du Levant, B-1348 Louvain-la-Neuve, Belgium 2 CESAME, 4 av. G. Lemaitre, B-1348 Louvain-la-Neuve, Belgium {verleysen, simon}@dice.ucl.ac.be, {francois, wertz}@auto.ucl.ac.be Abstract. Modern data analysis often faces high-dimensional data. Nevertheless, most neural network data analysis tools are not adapted to high- dimensional spaces, because of the use of conventional concepts (as the Euclidean distance) that scale poorly with dimension. This paper shows some limitations of such concepts and suggests some research directions as the use of alternative distance definitions and of non-linear dimension reduction. 1. Introduction In the last few years, data analysis has become a specific discipline, sometimes far from its mathematical and statistical origin, where understanding of the problems and limitations coming from the data themselves is often more valuable than developing complex algorithms and methods. The specificity of modern data mining is that huge amounts of data are considered. There are new fields where data mining becomes crucial (medical research, financial analysis, etc.); furthermore, collecting huge amount of data often becomes easier and cheaper. A main concern in that direction is the dimensionality of data. Think of each measurement of data as one observation, each observation being composed of a set of variables. It is very different to analyze 10000 observations of 3 variables each, than analyzing 100 observations of 50 variables each! One way to get some feeling of this difficulty is to imagine each observation as a point in a space whose dimension is the number of variables.
    [Show full text]