The Teaching Dimension of Kernel Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

The Teaching Dimension of Kernel Perceptron The Teaching Dimension of Kernel Perceptron Akash Kumar Hanqi Zhang Adish Singla Yuxin Chen MPI-SWS University of Chicago MPI-SWS University of Chicago Abstract et al.(2020)). One of the most studied scenarios is the case of teaching a version-space learner (Goldman & Kearns, 1995; Anthony et al., 1995; Zilles et al., Algorithmic machine teaching has been stud- 2008; Doliwa et al., 2014; Chen et al., 2018; Mansouri ied under the linear setting where exact teach- et al., 2019; Kirkpatrick et al., 2019). Upon receiving ing is possible. However, little is known for a sequence of training examples from the teacher, a teaching nonlinear learners. Here, we estab- version-space learner maintains a set of hypotheses that lish the sample complexity of teaching, aka are consistent with the training examples, and outputs teaching dimension, for kernelized perceptrons a random hypothesis from this set. for different families of feature maps. As a warm-up, we show that the teaching complex- As a canonical example, consider teaching a 1- ity is Θ(d) for the exact teaching of linear dimensional binary threshold function fθ∗ (x) = d k 1 perceptrons in R , and Θ(d ) for kernel per- x θ∗ for x [0; 1]. For a learner with a finite (or f − g 2 i ceptron with a polynomial kernel of order k. countable infinite) version space, e.g., θ i=0;:::;n 2 f n g Furthermore, under certain smooth assump- where n Z+ (see Fig. 1a), a smallest training set i 2 i+1 i i+1 tions on the data distribution, we establish a is ; 0 ; ; 1 where θ∗ < ; thus the f n n g n ≤ n rigorous bound on the complexity for approxi- teaching dimension is 2. However, when the version mately teaching a Gaussian kernel perceptron. space is continuous, the teaching dimension becomes We provide numerical examples of the opti- , because it is no longer possible for the learner to 1 mal (approximate) teaching set under several pick out a unique threshold θ∗ with a finite training canonical settings for linear, polynomial and set. This is due to two key (limiting) modeling assump- Gaussian kernel perceptrons. tions of the version-space learner: (1) all (consistent) hypotheses in the version space are treated equally, and (2) there exists a hypothesis in the version space 1 Introduction that is consistent with all training examples. As one can see, these assumptions fail to capture the behavior of many modern learning algorithms, where the best Machine teaching studies the problem of finding an opti- hypotheses are often selected via optimizing certain mal training sequence to steer a learner towards a target loss functions, and the data is not perfectly separable concept (Zhu et al., 2018). An important learning- (i.e. not realizable w.r.t. the hypothesis/model class). theoretic complexity measure of machine teaching is the teaching dimension (Goldman & Kearns, 1995), To lift these modeling assumptions, a more realistic which specifies the minimal number of training ex- teaching scenario is to consider the learner as an empir- amples required in the worst case to teach a target ical risk minimizer (ERM). In fact, under the realizable concept. Over the past few decades, the notion of setting, the version-space learner could be viewed as an teaching dimension has been investigated under a va- ERM that optimizes the 0-1 loss—one that finds all hy- riety of learner’s models and teaching protocols (e.g,. potheses with zero training error. Recently, Liu & Zhu Cakmak & Lopes(2012); Singla et al.(2013; 2014); Liu (2016) studied the teaching dimension of linear ERM, et al.(2017); Haug et al.(2018); Tschiatschek et al. and established values of teaching dimension for several (2019); Liu et al.(2018); Kamalaruban et al.(2019); classes of linear (regularized) ERM learners, including Hunziker et al.(2019); Devidze et al.(2020); Rakhsha support vector machine (SVM), logistic regression and ridge regression. As illustrated in Fig. 1b, for the previ- th Proceedings of the 24 International Conference on Artifi- ous example it suffices to use (θ∗ , 0) ; (θ∗ + , 1) cial Intelligence and Statistics (AISTATS) 2021, San Diego, f − g with any min(1 θ∗; θ∗) as training set to teach California, USA. PMLR: Volume 130. Copyright 2021 by θ ≤ − l the author(s). ∗ as an optimizer of the SVM objective (i.e., 2 regu- <latexit sha1_base64="2bie5j0FklXVgHszbRALhkB9Hm0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg==</latexit> <latexit sha1_base64="GqPwQjZ2r3YPS+cj+fJ5jHVTlCo=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg==</latexit> <latexit sha1_base64="2bie5j0FklXVgHszbRALhkB9Hm0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg==</latexit> <latexit sha1_base64="QrxaAz2yQLwjgOwEUUvZ/CWYkT0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPe3uMtw==</latexit> <latexit sha1_base64="GqPwQjZ2r3YPS+cj+fJ5jHVTlCo=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg==</latexit> <latexit sha1_base64="2bie5j0FklXVgHszbRALhkB9Hm0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjsoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPefeMtg==</latexit> <latexit sha1_base64="uTQlYwH9iubcAeQoIo2VSlEv+2M=">AAAB+nicbVBNS8NAEN34WetXq0cvwSKIYEmkYL0VvHisYD+giWWznbRLNx/sTpQS+1O8eFDEq7/Em//GTZuDtj4YeLw3w8w8LxZcoWV9Gyura+sbm4Wt4vbO7t5+qXzQVlEiGbRYJCLZ9agCwUNoIUcB3VgCDTwBHW98nfmdB5CKR+EdTmJwAzoMuc8ZRS31S2UHR4D0/uzcgVhxkWkVq2rNYC4TOycVkqPZL305g4glAYTIBFWqZ1sxuimVyJmAadFJFMSUjekQepqGNADlprPTp+aJVgamH0ldIZoz9fdESgOlJoGnOwOKI7XoZeJ/Xi9Bv+6mPIwThJDNF/mJMDEysxzMAZfAUEw0oUxyfavJRlRShjqtog7BXnx5mbQvqnatenVbqzTqeRwFckSOySmxySVpkBvSJC3CyCN5Jq/kzXgyXox342PeumLkM4fkD4zPHwF3k9Q=</latexit> <latexit sha1_base64="QrxaAz2yQLwjgOwEUUvZ/CWYkT0=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEsN4KXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz32ylsbG5t7xR3S3v7B4dH5eOTto5TxbDFYhGrbkA1Ci6xZbgR2E0U0igQ2Akmd3O/84RK81g+mGmCfkRHkoecUWOlpjcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp619Xb5nWlXsvjKMIZnMMleHADdbiHBrSAAcIzvMKb8+i8OO/Ox7K14OQzp/AHzucPe3uMtw==</latexit> <latexit sha1_base64="GqPwQjZ2r3YPS+cj+fJ5jHVTlCo=">AAAB73icbVDLSgNBEOyNrxhfUY9eBoMgHsKuBIy3gBePEcwDkjXMTmaTIbOz60yvEEJ+wosHRbz6O978GyfJHjSxoKGo6qa7K0ikMOi6305ubX1jcyu/XdjZ3ds/KB4eNU2casYbLJaxbgfUcCkUb6BAyduJ5jQKJG8Fo5uZ33ri2ohY3eM44X5EB0qEglG0UruLQ4704aJXLLlldw6ySryMlCBDvVf86vZjlkZcIZPUmI7nJuhPqEbBJJ8WuqnhCWUjOuAdSxWNuPEn83un5MwqfRLG2pZCMld/T0xoZMw4CmxnRHFolr2Z+J/XSTGs+hOhkhS5YotFYSoJxmT2POkLzRnKsSWUaWFvJWxINWVoIyrYELzll1dJ87LsVcrXd5VSrZrFkYcTOIVz8OAKanALdWgAAwnP8ApvzqPz4rw7H4vWnJPNHMMfOJ8/wiSPxg==</latexit> <latexit sha1_base64="+KldzglugnuR7NcCxtMwkEsn9xU=">AAAB+nicbVBNS8NAEN34WetXq0cvwSKIQkmkYL0VvHisYD+giWWznbRLNx/sTpQS+1O8eFDEq7/Em//GTZuDtj4YeLw3w8w8LxZcoWV9Gyura+sbm4Wt4vbO7t5+qXzQVlEiGbRYJCLZ9agCwUNoIUcB3VgCDTwBHW98nfmdB5CKR+EdTmJwAzoMuc8ZRS31S2UHR4D0/uzcgVhxkWkVq2rNYC4TOycVkqPZL305g4glAYTIBFWqZ1sxuimVyJmAadFJFMSUjekQepqGNADlprPTp+a <latexit sha1_base64="wi34xSZ/KS5JCNIeyECh4VoWQys=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU02kYL0VvHisaG2hDWWznbRLN5uwuxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrW9sbhW3Szu7e/sH5cOjRx2nimGLxSJWnYBqFFxiy3AjsJMopFEgsB2Mb2Z++wmV5rF8MJME/YgOJQ85o8ZK9/xC9ssVt+rOQVaJl5MK5Gj2y1+9QczSCKVhgmrd9dzE+BlVhjOB01Iv1ZhQNqZD7FoqaYTaz+anTsmZVQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc1Yd3PuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lTxeVr1a9fquVmnU8ziKcAKncA4eXEEDbqEJLWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QMJOI2g</latexit>
Recommended publications
  • Designing Algorithms for Machine Learning and Data Mining
    Designing Algorithms for Machine Learning and Data Mining Antoine Cornuéjols and Christel Vrain Abstract Designing Machine Learning algorithms implies to answer three main questions: First, what is the space H of hypotheses or models of the data that the algorithm considers? Second, what is the inductive criterion used to assess the merit of a hypothesis given the data? Third, given the space H and the inductive crite- rion, how is the exploration of H carried on in order to find a as good as possible hypothesis?Anylearningalgorithmcanbeanalyzedalongthesethreequestions.This chapter focusses primarily on unsupervised learning, on one hand, and supervised learning, on the other hand. For each, the foremost problems are described as well as the main existing approaches. In particular, the interplay between the structure that can be endowed over the hypothesis space and the optimisation techniques that can in consequence be used is underlined. We cover especially the major existing methods for clustering: prototype-based, generative-based, density-based, spectral based, hierarchical, and conceptual and visit the validation techniques available. For supervised learning, the generative and discriminative approaches are contrasted and awidevarietyoflinearmethodsinwhichweincludetheSupportVectorMachines and Boosting are presented. Multi-Layer neural networks and deep learning methods are discussed. Some additional methods are illustrated, and we describe other learn- ing problems including semi-supervised learning, active learning, online learning, transfer learning, learning to rank, learning recommendations, and identifying causal relationships. We conclude this survey by suggesting new directions for research. 1 Introduction Machine Learning is the science of, on one hand, discovering the fundamental laws that govern the act of learning and, on the other hand, designing machines that learn from experiences, in the same way as physics is both the science of uncovering the A.
    [Show full text]
  • Mistake Bounds for Binary Matrix Completion
    Mistake Bounds for Binary Matrix Completion Mark Herbster Stephen Pasteris University College London University College London Department of Computer Science Department of Computer Science London WC1E 6BT, UK London WC1E 6BT, UK [email protected] [email protected] Massimiliano Pontil Istituto Italiano di Tecnologia 16163 Genoa, Italy and University College London Department of Computer Science London WC1E 6BT, UK [email protected] Abstract We study the problem of completing a binary matrix in an online learning setting. On each trial we predict a matrix entry and then receive the true entry. We propose a Matrix Exponentiated Gradient algorithm [1] to solve this problem. We provide a mistake bound for the algorithm, which scales with the margin complexity [2, 3] of the underlying matrix. The bound suggests an interpretation where each row of the matrix is a prediction task over a finite set of objects, the columns. Using this we show that the algorithm makes a number of mistakes which is comparable up to a logarithmic factor to the number of mistakes made by the Kernel Perceptron with an optimal kernel in hindsight. We discuss applications of the algorithm to predicting as well as the best biclustering and to the problem of predicting the labeling of a graph without knowing the graph in advance. 1 Introduction We consider the problem of predicting online the entries in an m × n binary matrix U. We formulate this as the following game: nature queries an entry (i1; j1); the learner predicts y^1 2 {−1; 1g as the matrix entry; nature presents a label y1 = Ui1;j1 ; nature queries the entry (i2; j2); the learner predicts y^2; and so forth.
    [Show full text]
  • Kernel Methods and Factorization for Image and Video Analysis
    KERNEL METHODS AND FACTORIZATION FOR IMAGE AND VIDEO ANALYSIS Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science by Ranjeeth Dasineni 200507017 ranjith [email protected] International Institute of Information Technology Hyderabad, India November 2007 INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled \Kernel Methods and Factorization for Image and Video Analysis" by Ranjeeth Dasineni, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Dr. C. V. Jawahar Copyright c Ranjeeth Dasineni, 2007 All Rights Reserved To IIIT Hyderabad, where I learnt all that I know of Computers and much of what I know of Life Acknowledgments I would like to thank Dr. C. V. Jawahar for introducing me to the fields of computer vision and machine learning. I gratefully acknowledge the motivation, technical and philosophical guidance that he has given me throughout my undergraduate and graduate education. His knowledge and valuable suggestions provided the foundation for the work presented in this thesis. I thank Dr P. J. Narayanan for providing an excellent research environment at CVIT, IIIT Hyderabad. His advice on research methodology helped me in facing the challenges of research. I am grateful to the GE Foundation and CVIT for funding my graduate education at IIIT Hyderabad. I am also grateful to fellow lab mates at CVIT and IIIT Hyderabad for their stimulating company during the past years. While working on this thesis, few researchers across the world lent their valuable time validate my work, check the correctness of my implementations, and provide critical suggestions.
    [Show full text]
  • COMS 4771 Perceptron and Kernelization
    COMS 4771 Perceptron and Kernelization Nakul Verma Last time… • Generative vs. Discriminative Classifiers • Nearest Neighbor (NN) classification • Optimality of k-NN • Coping with drawbacks of k-NN • Decision Trees • The notion of overfitting in machine learning A Closer Look Classification x O Knowing the boundary is enough for classification Linear Decision Boundary male data Height female data Weight Assume binary classification y= {-1,+1} (What happens in multi-class case?) Learning Linear Decision Boundaries g = decision boundary d=1 case: general: +1 if g(x) ≥ 0 f = linear classifier -1 if g(x) < 0 # of parameters to learn in Rd? Dealing with w0 = . bias homogeneous “lifting” The Linear Classifier popular nonlinearities non-linear threshold Σi wi xi+w0 linear sigmoid 1 x1 x2 … xd bias A basic computational unit in a neuron Can Be Combined to Make a Network … Amazing fact: Can approximate any smooth function! … x x … 1 x2 3 xd An artificial neural network How to Learn the Weights? Given labeled training data (bias included): Want: , which minimizes the training error, i.e. How do we minimize? • Cannot use the standard technique (take derivate and examine the stationary points). Why? Unfortunately: NP-hard to solve or even approximate! Finding Weights (Relaxed Assumptions) Can we approximate the weights if we make reasonable assumptions? What if the training data is linearly separable? Linear Separability Say there is a linear decision boundary which can perfectly separate the training data distance of the closest point to the boundary (margin γ ) Finding Weights Given: labeled training data S = Want to determine: is there a which satisfies (for all i) i.e., is the training data linearly separable? Since there are d+1 variables and |S| constraints, it is possible to solve efficiently it via a (constraint) optimization program.
    [Show full text]
  • Kernel Methods for Pattern Analysis
    This page intentionally left blank Kernel Methods for Pattern Analysis Pattern Analysis is the process of finding general relations in a set of data, and forms the core of many disciplines, from neural networks to so-called syn- tactical pattern recognition, from statistical pattern recognition to machine learning and data mining. Applications of pattern analysis range from bioin- formatics to document retrieval. The kernel methodology described here provides a powerful and unified framework for all of these disciplines, motivating algorithms that can act on general types of data (e.g. strings, vectors, text, etc.) and look for general types of relations (e.g. rankings, classifications, regressions, clusters, etc.). This book fulfils two major roles. Firstly it provides practitioners with a large toolkit of algorithms, kernels and solutions ready to be implemented, many given as Matlab code suitable for many pattern analysis tasks in fields such as bioinformatics, text analysis, and image analysis. Secondly it furnishes students and researchers with an easy introduction to the rapidly expanding field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, while covering the required conceptual and mathematical tools necessary to do so. The book is in three parts. The first provides the conceptual foundations of the field, both by giving an extended introductory example and by cov- ering the main theoretical underpinnings of the approach. The second part contains a number of kernel-based algorithms, from the simplest to sophis- ticated systems such as kernel partial least squares, canonical correlation analysis, support vector machines, principal components analysis, etc.
    [Show full text]
  • Fast Kernel Classifiers with Online and Active Learning
    Journal of Machine Learning Research (2005) to appear Submitted 3/05; Published 10?/05 Fast Kernel Classifiers with Online and Active Learning Antoine Bordes [email protected] NEC Laboratories America, 4 Independence Way, Princeton, NJ08540, USA, and Ecole Sup´erieure de Physique et Chimie Industrielles, 10 rue Vauquelin, 75231 Paris CEDEX 05, France. Seyda Ertekin [email protected] The Pennsylvania State University, University Park, PA 16802, USA Jason Weston [email protected] NEC Laboratories America, 4 Independence Way, Princeton, NJ08540, USA. L´eon Bottou [email protected] NEC Laboratories America, 4 Independence Way, Princeton, NJ08540, USA. Editor: Nello Cristianini Abstract Very high dimensional learning systems become theoretically possible when training ex- amples are abundant. The computing cost then becomes the limiting factor. Any efficient learning algorithm should at least pay a brief look at each example. But should all examples be given equal attention? This contribution proposes an empirical answer. We first presents an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels. 1. Introduction Electronic computers have vastly enhanced our ability to compute complicated statistical models. Both theory and practice have adapted to take into account the essential com- promise between the number of examples and the model capacity (Vapnik, 1998). Cheap, pervasive and networked computers are now enhancing our ability to collect observations to an even greater extent.
    [Show full text]
  • Online Passive-Aggressive Algorithms on a Budget
    Online Passive-Aggressive Algorithms on a Budget Zhuang Wang Slobodan Vucetic Dept. of Computer and Information Sciences Dept. of Computer and Information Sciences Temple University, USA Temple University, USA [email protected] [email protected] Abstract 1 INTRODUCTION The introduction of Support Vector Machines (Cortes In this paper a kernel-based online learning and Vapnik, 1995) generated signi¯cant interest in ap- algorithm, which has both constant space plying the kernel methods for online learning. A large and update time, is proposed. The approach number of online algorithms (e.g. perceptron (Rosen- is based on the popular online Passive- blatt, 1958) and PA (Crammer et al., 2006)) can be Aggressive (PA) algorithm. When used in easily kernelized. The online kernel algorithms have conjunction with kernel function, the num- been popular because they are simple, can learn non- ber of support vectors in PA grows with- linear mappings, and could achieve state-of-the-art ac- out bounds when learning from noisy data curacies. Perhaps surprisingly, online kernel classi¯ers streams. This implies unlimited memory can place a heavy burden on computational resources. and ever increasing model update and pre- The main reason is that the number of support vectors diction time. To address this issue, the pro- that need to be stored as part of the prediction model posed budgeted PA algorithm maintains only grows without limit as the algorithm progresses. In ad- a ¯xed number of support vectors. By intro- dition to the potential of exceeding the physical mem- ducing an additional constraint to the origi- ory, this property also implies an unlimited growth in nal PA optimization problem, a closed-form model update and perdition time.
    [Show full text]
  • Mistake Bounds for Binary Matrix Completion
    Mistake Bounds for Binary Matrix Completion Mark Herbster Stephen Pasteris University College London University College London Department of Computer Science Department of Computer Science London WC1E 6BT, UK London WC1E 6BT, UK [email protected] [email protected] Massimiliano Pontil Istituto Italiano di Tecnologia 16163 Genoa, Italy and University College London Department of Computer Science London WC1E 6BT, UK [email protected] Abstract We study the problem of completing a binary matrix in an online learning setting. On each trial we predict a matrix entry and then receive the true entry. We propose a Matrix Exponentiated Gradient algorithm [1] to solve this problem. We provide a mistake bound for the algorithm, which scales with the margin complexity [2, 3] of the underlying matrix. The bound suggests an interpretation where each row of the matrix is a prediction task over a finite set of objects, the columns. Using this we show that the algorithm makes a number of mistakes which is comparable up to a logarithmic factor to the number of mistakes made by the Kernel Perceptron with an optimal kernel in hindsight. We discuss applications of the algorithm to predicting as well as the best biclustering and to the problem of predicting the labeling of a graph without knowing the graph in advance. 1 Introduction We consider the problem of predicting online the entries in an m × n binary matrix U. We formulate this as the following game: nature queries an entry (i1; j1); the learner predicts y^1 2 {−1; 1g as the matrix entry; nature presents a label y1 = Ui1;j1 ; nature queries the entry (i2; j2); the learner predicts y^2; and so forth.
    [Show full text]
  • Output Neuron
    COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16 Lecture 4: • Neural Networks Perceptron Feedforward Neural Network Backpropagation algorithM HuMan-level intelligence (strong AI) • SF exaMple: Marvin the Paranoid Android • www.youtube.coM/watch?v=z9Yp6D1ASnI HuMan-level intelligence (strong AI) • Robots that exhibit coMplex behaviour, as skilful and flexible as huMans, capable of: • CoMMunicating in natural language • Reasoning • Representing knowledge • Planning • Learning • Consciousness • Self-awareness • EMotions.. • How to achieve that? Any idea? NEURAL NETWORKS MOTIVATION The brain • Enables us to do soMe reMarkable things such as: • learning froM experience • instant MeMory (experiences) search • face recognition (~100Ms) • adaptation to a new situation • being creative.. • The brain is a biological neural network (BNN): • extreMely coMplex and highly parallel systeM • consisting of 100 billions of neurons coMMunicating via 100 trillions of synapses (each neuron connected to 10000 other neurons on average) • MiMicking the brain could be the way to build an (huMan level) intelligent systeM! Artificial Neural Networks (ANN) • ANN are coMputational Models that Model the way brain works • Motivated and inspired by BNN, but still only siMple iMitation of BNN! • ANN MiMic huMan learning by changing the strength of siMulated neural connections on the basis of experience • What are the different types of ANNs? • What level of details is required? • How do we iMpleMent and siMulate theM? ARTIFICIAL NEURAL NETWORKS
    [Show full text]
  • A Review of Learning Planning Action Models Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Etivier, Sylvie Pesty
    A Review of Learning Planning Action Models Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Etivier, Sylvie Pesty To cite this version: Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Etivier, Sylvie Pesty. A Review of Learning Planning Action Models. Knowledge Engineering Review, Cambridge University Press (CUP), 2018, 33, 10.1017/S0269888918000188. hal-02010536 HAL Id: hal-02010536 https://hal.archives-ouvertes.fr/hal-02010536 Submitted on 7 Feb 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Knowledge Engineering Review, Vol. 00:0, 1{24. c 2004, Cambridge University Press DOI: 10.1017/S000000000000000 Printed in the United Kingdom A Review of Learning Planning Action Models ANKUJ ARORA* HUMBERT FIORINO* DAMIEN PELLIER* MARC METIVIER**´ SYLVIE PESTY* *Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France E-mail: fi[email protected] **Paris Descartes University, LIPADE, 45 rue des Saints-P´res75006 Paris, France E-mail: [email protected] Abstract Automated planning has been a continuous field of study since the 1960s, since the notion of accomplishing a task using an ordered set of actions resonates with almost every known activity domain.
    [Show full text]
  • Explaining RNN Predictions for Sentiment Classification
    2018 Fall CSCE 638 Course Project Explaining RNN Predictions for Sentiment Classification Ninghao Liu and Qingquan Song 2018-11-29 1 Why Recurrent Neural Network? Architectures Applications RNNs have made lots of progress in NLP domain. 2 Why Interpretation? Self-driving cars NNs are regarded as black box Medical diagnosis “Interpreting Machine Learning Models”. Medium.com, 2017. 3 Why Interpretation? Self-driving cars NNs are regarded as black box Decisions made by RNNs sometimes could be critical, interpretation can increase trust. Medical diagnosis “Interpreting Machine Learning Models”. Medium.com, 2017. 4 Target Models and Settings Target Models: Long-Short Time Memory & Gated Recurrent Unit Dataset: Stanford The corpus contains 11,855 sentences extracted from Sentiment Treebank movie reviews. [Train: 6920; Dev.: 872; Test: 1821] • Hidden Unit: 150 Prediction Accuracy • Learning Rate: 1e-3 • Training Epoch: 20 • Loss: Neg. Log-Loss 5 Developed Interpretation Methods Naive Explanation (NaiveExp): The response score to the t-th input word: Motivation: Utilize the prediction change on time stamp t. Drawback: Over-simplify the internal forgetting mechanism of RNN. Vanilla Gradient (VaGrad): Motivation: Post-hoc explanation using gradient, i.e., local prediction change w.r.t. the input. Drawback: (1) Easily affected by noise; (2) Does not distinguish between positive and negative contribution. 6 Developed Interpretation Methods Integrated Gradient (InteGrad): Denoising Gradient Times Embedding (EmbGrad): Directional (Polarized) Integrated
    [Show full text]
  • Outline of Machine Learning
    Outline of machine learning The following outline is provided as an overview of and topical guide to machine learning: Machine learning – subfield of computer science[1] (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[3] Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. Contents What type of thing is machine learning? Branches of machine learning Subfields of machine learning Cross-disciplinary fields involving machine learning Applications of machine learning Machine learning hardware Machine learning tools Machine learning frameworks Machine learning libraries Machine learning algorithms Machine learning methods Dimensionality reduction Ensemble learning Meta learning Reinforcement learning Supervised learning Unsupervised learning Semi-supervised learning Deep learning Other machine learning methods and problems Machine learning research History of machine learning Machine learning projects Machine learning organizations Machine learning conferences and workshops Machine learning publications
    [Show full text]