Learning with Kernels at Scale and Applications in Generative Modeling

Total Page:16

File Type:pdf, Size:1020Kb

Learning with Kernels at Scale and Applications in Generative Modeling Learning with Kernels at Scale and Applications in Generative Modeling Wei-Cheng Chang CMU-LTI-20-008 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15123 www.lti.cs.cmu.edu Thesis Committee: Yiming Yang (Chair) Carnegie Mellon University Barnabás Póczos Carnegie Mellon University Jeff Schneider Carnegie Mellon University Sanjiv Kumar Google Research NYC Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright c 2020 Wei-Cheng Chang Keywords: deep kernel learning, deep generative modeling, kernelized discrepancies, Stein varia- tional gradient descent, large-scale kernels To all of the people that light up my life. iv Abstract Kernel methods are versatile in machine learning and statistics. For instance, Kernel two- sample test induces Maximum Mean Discrepancy (MMD) to compare two distributions and serves as a distance metric for learning implicit generative models (IGMs). Kernel goodness- of-fit test, as another example, induces Kernel Stein Discrepancy (KSD) to measure model- data discrepancy and connects to a variational inference procedure for explicit generative models (EGMs). Other extensions include time series modeling, graph-based learning, and more. Despite their ubiquity, kernel methods often suffer from two fundamental limitations: the difficulty in kernel selection for complex downstream tasks and the tractability of large- scale problems. This thesis addresses both challenges in several complementary aspects. In partI, we tackle the issue of kernel selection in learning implicit generative models (IGMs) with kernel MMD. Conventional methods using a fixed kernel MMD have lim- ited success on high-dimensional complex distributions. We propose to optimize MMD with neural-parametrized kernels, which is more expressive and improves the state-of-the- art results on high-dimensional distributions (Chapter2). We also formulate kernel selection problems as learning kernel spectral distributions, and enrich the spectral distributions by learning IGMs to draw samples from it (Chapter3). In partII, we aim at learning suitable kernels for Stein variational inference descent (SVGD) in explicit generative modeling (EGMs). Although SVGD with fixed kernels shows encouraging performance in low-dimensional (within hundreds) Bayesian inference tasks, its success with high-dimensional problems such as image generation is limited. We propose the noise-conditional kernel SVGD (NCK-SVGD) for adaptive kernel learning, which is the first SVGD variant successfully scaled up for distributions with the dimension of several thou- sands, and performs competitively as state-of-the-art IGMs. With a novel entropy regularizer, NCK-SVGD enjoys flexible control between sample diversity and quality (Chapter4). In part III, we address the kernel tractability challenge with variants of random Fourier features (RFF). We propose to learn non-uniformly weighted RFF, which performs as good as the uniformly-weighted RFF while demanding less memory (Chapter5). For the kernel contextual bandit problem, we reduce the computational cost of the kernel UCB algorithm by using RFF to approximate the predictive mean and confidence estimate (Chapter6). In partIV, We extents kernel learning for time series modeling and graph-based learning. For change-point detection over time series, we optimize the kernel two-sample test via auxiliary generative models, acting as surrogate samplers of unknown anomaly distributions (Chapter7). For graph-based transfer learning, we construct the graph Laplacian by kernel diffusion and leverage label propagation to transfer knowledge from source to target domains (Chapter8). vi Contents 1 Introduction 1 1.1 Motivations and Background . .1 1.2 Challenges . .2 1.3 Thesis Structure and Contributions . .3 I Kernel Learning for Implicit Generative Models9 2 Optimizing Kernels for Generative Moment Matching Network 11 2.1 Background . 11 2.2 Two-Sample Test and Generative Moment Matching Network . 12 2.2.1 MMD with Kernel Learning . 12 2.2.2 Properties of MMD with Kernel Learning . 13 2.3 MMD GAN . 14 2.3.1 Feasible Set Reduction . 15 2.3.2 Encoding Perspectives and Relations to WGAN . 16 2.4 Experiments and Results . 17 2.4.1 Qualitative Analysis . 17 2.4.2 Quantitative Evaluation . 19 2.4.3 Stability of MMD GAN . 20 2.4.4 Computation Issue . 21 2.4.5 Better Lipschitz Approximation and Necessity of Auto-Encoder . 21 2.5 Summary . 22 2.6 Appendix . 23 2.6.1 Proof of Theorem3.................................. 23 2.6.2 Proof of Theorem4.................................. 23 2.6.3 Proof of Theorem5.................................. 24 3 Learning Kernel Spectral Distributions 25 3.1 Background . 25 3.2 IKL Framework . 26 3.3 IKL Application to MMD-GAN . 27 3.3.1 Property of MMD GAN with IKL . 28 vii 3.3.2 Experiments and Results . 28 3.4 IKL Application to Random Kitchen Sinks . 33 3.4.1 Consistency and Generalization . 35 3.4.2 Experiment and Results . 36 3.5 Summary . 38 3.6 Appendix . 39 3.6.1 Proof of Theorem8.................................. 39 3.6.2 Proof of Lemma9.................................. 40 3.6.3 Proof of Theorem 11................................. 40 3.6.4 Hyperparameters of GAN and RKS experiments . 42 II Kernel Learning for Explicit Generative Models 45 4 Kernel Stein Generative Modeling 47 4.1 Background . 47 4.2 Score-based Generative Modeling: Inference and Estimation . 48 4.2.1 Stein Variational Gradient Descent (SVGD) . 48 4.2.2 Score Estimation . 49 4.3 Challenges of SVGD in High-dimension . 50 4.3.1 Mixture of Gaussian with Disjoint Support . 50 4.3.2 Anneal SVGD in High-dimension . 50 4.4 SVGD with Noise-conditional Kernels and Entropy Regularization . 52 4.4.1 Learning Deep Noise-conditional Kernels . 53 4.4.2 Entropy Regularization and Diversity Trade-off . 53 4.5 Experiments and Results . 55 4.5.1 Quantitative Evaluation . 56 4.5.2 Qualitative Analysis . 58 4.5.3 The Effect of Entropy Regularization on Precision/Recall . 59 4.6 Summary . 59 4.7 Appendix . 60 4.7.1 Proof of Proposition 17................................ 60 4.7.2 Toy Experiment Setup in Section 4.3......................... 61 III Large-scale Kernel Approximations 63 5 Data-driven Random Fourier Features 65 5.1 Background . 65 5.1.1 Random Fourier Features . 66 5.1.2 Bayesian Quadrature for Random Features . 67 5.2 Stein Effect Shrinkage (SES) Estimator . 69 5.3 Experiments and Results . 70 viii 5.4 Discussion and Conclusion . 74 6 Kernel Contextual Bandit at Scale 75 6.1 Background . 75 6.2 Kernel Upper-Confident-Bound Framework . 76 6.3 Random Fourier Features for Kernel UCB . 79 6.3.1 RFF-UCB-NI: A Non-incremental Algorithm . 79 6.3.2 Theoretical analysis of RFF-UCB-NI . 80 6.3.3 RFF-UCB: A Fast Incremental Algorithm . 81 6.4 Experiment and Results . 82 6.5 Summary . 86 6.6 Appendix . 87 6.6.1 Derivation Details of RFF-UCB Online Update . 87 6.6.2 Kernel Approximation Guarantees . 87 6.6.3 Regret Bound . 89 6.6.4 Additional Experiment settings and Results . 90 IV Extensions to Time-Series and Graph-based Learning 93 7 Kernel Change-point Detection 95 7.1 Background . 95 7.2 Optimizing Test Power of Kernel CPD . 97 7.2.1 Difficulties of Optimizing Kernels for CPD . 97 7.2.2 A Practical Lower Bound on Optimizing Test Power . 97 7.2.3 Surrogate Distributions using Generative Models . 98 7.3 KLCPD: Realization for Time Series Applications . 99 7.3.1 Relations to MMD GAN . 101 7.4 Experiments and Results . 101 7.4.1 Evaluation on Real-world Data . 101 7.4.2 In-depth Analysis on Simulated Data . 104 7.5 Summary . 106 8 Graph-based Transfer Learning 107 8.1 Background . 107 8.2 KerTL: Kernel Transfer Learning on Graphs . 108 8.2.1 TL with Graph Laplacian . 109 8.2.2 Graph Constructions . 109 8.2.3 Diffusion Kernel Completion . 111 8.3 Experiments and Results . 112 8.3.1 Benchmark Datasets and Comparing Methods . 112 8.3.2 Quantitative Evaluation . 114 8.4 Summary.
Recommended publications
  • Relational Machine Learning Algorithms
    Relational Machine Learning Algorithms by Alireza Samadianzakaria Bachelor of Science, Sharif University of Technology, 2016 Submitted to the Graduate Faculty of the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2021 UNIVERSITY OF PITTSBURGH DEPARTMENT OF COMPUTER SCIENCE This dissertation was presented by Alireza Samadianzakaria It was defended on July 7, 2021 and approved by Dr. Kirk Pruhs, Department of Computer Science, University of Pittsburgh Dr. Panos Chrysanthis, Department of Computer Science, University of Pittsburgh Dr. Adriana Kovashka, Department of Computer Science, University of Pittsburgh Dr. Benjamin Moseley, Tepper School of Business, Carnegie Mellon University ii Copyright c by Alireza Samadianzakaria 2021 iii Relational Machine Learning Algorithms Alireza Samadianzakaria, PhD University of Pittsburgh, 2021 The majority of learning tasks faced by data scientists involve relational data, yet most standard algorithms for standard learning problems are not designed to accept relational data as input. The standard practice to address this issue is to join the relational data to create the type of geometric input that standard learning algorithms expect. Unfortunately, this standard practice has exponential worst-case time and space complexity. This leads us to consider what we call the Relational Learning Question: \Which standard learning algorithms can be efficiently implemented on relational data, and for those that cannot, is there an alternative algorithm that can be efficiently implemented on relational data and that has similar performance guarantees to the standard algorithm?" In this dissertation, we address the relational learning question for the well-known prob- lems of support vector machine (SVM), logistic regression, and k-means clustering.
    [Show full text]
  • Computer Science and Decision Theory Fred S. Roberts1 Abstract 1
    Computer Science and Decision Theory Fred S. Roberts1 DIMACS Center, Rutgers University, Piscataway, NJ 08854 USA [email protected] Abstract This paper reviews applications in computer science that decision theorists have addressed for years, discusses the requirements posed by these applications that place great strain on decision theory/social science methods, and explores applications in the social and decision sciences of newer decision-theoretic methods developed with computer science applications in mind. The paper deals with the relation between computer science and decision-theoretic methods of consensus, with the relation between computer science and game theory and decisions, and with \algorithmic decision theory." 1 Introduction Many applications in computer science involve issues and problems that decision theorists have addressed for years, issues of preference, utility, conflict and cooperation, allocation, incentives, consensus, social choice, and measurement. A similar phenomenon is apparent more generally at the interface between computer sci- ence and the social sciences. We have begun to see the use of methods developed by decision theorists/social scientists in a variety of computer science applications. The requirements posed by these computer science applications place great strain on the decision theory/social science methods because of the sheer size of the problems addressed, new contexts in which computational power of agents becomes an issue, limitations on information possessed by players, and the sequential nature of repeated applications. Hence, there is a great need to develop a new generation of methods to satisfy these computer science requirements. In turn, these new methods will provide powerful new tools for social scientists in general and decision theorists in particular.
    [Show full text]
  • Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs
    Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a graphical model defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence means that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative statistical inference graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as statistics, information theory, machine sponding to a causal dynamical system.
    [Show full text]
  • Research Notices
    AMERICAN MATHEMATICAL SOCIETY Research in Collegiate Mathematics Education. V Annie Selden, Tennessee Technological University, Cookeville, Ed Dubinsky, Kent State University, OH, Guershon Hare I, University of California San Diego, La jolla, and Fernando Hitt, C/NVESTAV, Mexico, Editors This volume presents state-of-the-art research on understanding, teaching, and learning mathematics at the post-secondary level. The articles are peer-reviewed for two major features: (I) advancing our understanding of collegiate mathematics education, and (2) readability by a wide audience of practicing mathematicians interested in issues affecting their students. This is not a collection of scholarly arcana, but a compilation of useful and informative research regarding how students think about and learn mathematics. This series is published in cooperation with the Mathematical Association of America. CBMS Issues in Mathematics Education, Volume 12; 2003; 206 pages; Softcover; ISBN 0-8218-3302-2; List $49;AII individuals $39; Order code CBMATH/12N044 MATHEMATICS EDUCATION Also of interest .. RESEARCH: AGul<lelbrthe Mathematics Education Research: Hothomatldan- A Guide for the Research Mathematician --lllll'tj.M...,.a.,-- Curtis McKnight, Andy Magid, and -- Teri J. Murphy, University of Oklahoma, Norman, and Michelynn McKnight, Norman, OK 2000; I 06 pages; Softcover; ISBN 0-8218-20 16-8; List $20;AII AMS members $16; Order code MERN044 Teaching Mathematics in Colleges and Universities: Case Studies for Today's Classroom Graduate Student Edition Faculty
    [Show full text]
  • A Hierarchical Generative Model of Electrocardiogram Records
    A Hierarchical Generative Model of Electrocardiogram Records Andrew C. Miller ∗ Sendhil Mullainathan School of Engineering and Applied Sciences Department of Economics Harvard University Harvard University Cambridge, MA 02138 Cambridge, MA 02138 [email protected] [email protected] Ziad Obermeyer Brigham and Women’s Hospital Harvard Medical School Boston, MA 02115 [email protected] Abstract We develop a probabilistic generative model of electrocardiogram (EKG) tracings. Our model describes multiple sources of variation in EKGs, including patient- specific cardiac cycle morphology and between-cycle variation that leads to quasi- periodicity. We use a deep generative network as a flexible model component to describe variation in beat-specific morphology. We apply our model to a set of 549 EKG records, including over 4,600 unique beats, and show that it is able to discover interpretable information, such as patient similarity and meaningful physiological features (e.g., T wave inversion). 1 Introduction An electrocardiogram (EKG) is a common non-invasive medical test that measures the electrical activity of a patient’s heart by recording the time-varying potential difference between electrodes placed on the surface of the skin. The resulting data is a multivariate time-series that reflects the depolarization and repolarization of the heart muscle that occurs during each heartbeat. These raw waveforms are then inspected by a physician to detect irregular patterns that are evidence of an underlying physiological problem. In this paper, we build a hierarchical generative model of electrocardiogram signals that disentangles sources of variation. We are motivated to build a generative model of EKGs for multiple reasons.
    [Show full text]
  • Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting
    Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting Aditya Grover1, Jiaming Song1, Alekh Agarwal2, Kenneth Tran2, Ashish Kapoor2, Eric Horvitz2, Stefano Ermon1 1Stanford University, 2Microsoft Research, Redmond Abstract A learned generative model often produces biased statistics relative to the under- lying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We show that this likelihood-free importance weighting method induces a new energy-based model and employ it to correct for the bias in existing models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative mod- els, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data. 1 Introduction Learning generative models of complex environments from high-dimensional observations is a long- standing challenge in machine learning. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [1]. In model-based off-policy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world deployment [2], which is especially valuable for risk-sensitive applications [3].
    [Show full text]
  • Probabilistic PCA and Factor Analysis
    Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis Piyush Rai Machine Learning (CS771A) Oct 5, 2016 Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 1 K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t.
    [Show full text]
  • Learning a Generative Model of Images by Factoring Appearance and Shape
    1 Learning a generative model of images by factoring appearance and shape Nicolas Le Roux1, Nicolas Heess2, Jamie Shotton1, John Winn1 1Microsoft Research Cambridge 2University of Edinburgh Keywords: generative model of images, occlusion, segmentation, deep networks Abstract Computer vision has grown tremendously in the last two decades. Despite all efforts, existing attempts at matching parts of the human visual system’s extraordinary ability to understand visual scenes lack either scope or power. By combining the advantages of general low-level generative models and powerful layer-based and hierarchical models, this work aims at being a first step towards richer, more flexible models of images. After comparing various types of RBMs able to model continuous-valued data, we introduce our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape. We then propose a generative model of larger images using a field of such RBMs. Finally, we discuss how masked RBMs could be stacked to form a deep model able to generate more complicated structures and suitable for various tasks such as segmentation or object recognition. 1 Introduction Despite much progress in the field of computer vision in recent years, interpreting and modeling the bewildering structure of natural images remains a challenging problem. The limitations of even the most advanced systems become strikingly obvious when contrasted with the ease, flexibility, and robustness with which the human visual system analyzes and interprets an image. Computer vision is a problem domain where the structure that needs to be represented is complex and strongly task dependent, and the input data is often highly ambiguous.
    [Show full text]
  • An Efficient Boosting Algorithm for Combining Preferences Raj
    An Efficient Boosting Algorithm for Combining Preferences by Raj Dharmarajan Iyer Jr. Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1999 © Massachusetts Institute of Technology 1999. All rights reserved. A uth or .............. ..... ..... .................................. Department of E'ectrical ngineering and Computer Science August 24, 1999 C ertified by .. ................. ...... .. .............. David R. Karger Associate Professor Thesis Supervisor Accepted by............... ........ Arthur C. Smith Chairman, Departmental Committe Graduate Students MACHU OF TEC Lo NOV LIBRARIES An Efficient Boosting Algorithm for Combining Preferences by Raj Dharmarajan Iyer Jr. Submitted to the Department of Electrical Engineering and Computer Science on August 24, 1999, in partial fulfillment of the requirements for the degree of Master of Science Abstract The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting algorithm for combining preferences called RankBoost. We also describe an efficient implementation of the algorithm for certain natural cases. We discuss two experiments we carried out to assess the performance of RankBoost. In the first experi- ment, we used the algorithm to combine different WWW search strategies, each of which is a query expansion for a given domain. For this task, we compare the performance of Rank- Boost to the individual search strategies. The second experiment is a collaborative-filtering task for making movie recommendations.
    [Show full text]
  • Neural Decomposition: Functional ANOVA with Variational Autoencoders
    Neural Decomposition: Functional ANOVA with Variational Autoencoders Kaspar Märtens Christopher Yau University of Oxford Alan Turing Institute University of Birmingham University of Manchester Abstract give insight into patterns of interest embedded within the data. Recently there has been particular interest in applications of Variational Autoencoders (VAEs) Variational Autoencoders (VAEs) have be- (Kingma and Welling, 2014) for generative modelling come a popular approach for dimensionality and dimensionality reduction. VAEs are a class of reduction. However, despite their ability to probabilistic neural-network-based latent variable mod- identify latent low-dimensional structures em- els. Their neural network (NN) underpinnings offer bedded within high-dimensional data, these considerable modelling flexibility while modern pro- latent representations are typically hard to gramming frameworks (e.g. TensorFlow and PyTorch) interpret on their own. Due to the black-box and computational techniques (e.g. stochastic varia- nature of VAEs, their utility for healthcare tional inference) provide immense scalability to large and genomics applications has been limited. data sets, including those in genomics (Lopez et al., In this paper, we focus on characterising the 2018; Eraslan et al., 2019; Märtens and Yau, 2020). The sources of variation in Conditional VAEs. Our Conditional VAE (CVAE) (Sohn et al., 2015) extends goal is to provide a feature-level variance de- this model to allow incorporating side information as composition, i.e. to decompose variation in additional fixed inputs. the data by separating out the marginal ad- ditive effects of latent variables z and fixed The ability of (C)VAEs to extract latent, low- inputs c from their non-linear interactions.
    [Show full text]
  • An Introduction to Johnson–Lindenstrauss Transforms
    An Introduction to Johnson–Lindenstrauss Transforms Casper Benjamin Freksen∗ 2nd March 2021 Abstract Johnson–Lindenstrauss Transforms are powerful tools for reducing the dimensionality of data while preserving key characteristics of that data, and they have found use in many fields from machine learning to differential privacy and more. This note explains whatthey are; it gives an overview of their use and their development since they were introduced in the 1980s; and it provides many references should the reader wish to explore these topics more deeply. The text was previously a main part of the introduction of my PhD thesis [Fre20], but it has been adapted to be self contained and serve as a (hopefully good) starting point for readers interested in the topic. 1 The Why, What, and How 1.1 The Problem Consider the following scenario: We have some data that we wish to process but the data is too large, e.g. processing the data takes too much time, or storing the data takes too much space. A solution would be to compress the data such that the valuable parts of the data are kept and the other parts discarded. Of course, what is considered valuable is defined by the data processing we wish to apply to our data. To make our scenario more concrete let us say that our data consists of vectors in a high dimensional Euclidean space, R3, and we wish to find a transform to embed these vectors into a lower dimensional space, R<, where < 3, so that we arXiv:2103.00564v1 [cs.DS] 28 Feb 2021 ≪ can apply our data processing in this lower dimensional space and still get meaningful results.
    [Show full text]
  • Note 18 : Gaussian Discriminant Analysis
    CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes the joint probability p(x;Y ) over the label Y : y^ = arg max p(x;Y = k) k Generative models typically form the joint distribution by explicitly forming the following: • A prior probability distribution over all classes: P (k) = P (class = k) • A conditional probability distribution for each class k 2 f1; 2; :::; Kg: pk(X) = p(Xjclass k) In total there are K + 1 probability distributions: 1 for the prior, and K for all of the individual classes. Note that the prior probability distribution is a categorical distribution over the K discrete classes, whereas each class conditional probability distribution is a continuous distribution over Rd (often represented as a Gaussian). Using the prior and the conditional distributions in conjunction, we have (from Bayes’ rule) that maximizing the joint probability over the class labels is equivalent to maximizing the posterior probability of the class label: y^ = arg max p(x;Y = k) = arg max P (k) pk(x) = arg max P (Y = kjx) k k k Consider the example of digit classification. Suppose we are given dataset of images of hand- written digits each with known values in the range f0; 1; 2;:::; 9g. The task is, given an image of a handwritten digit, to classify it to the correct digit. A generative classifier for this this task would effectively form a the prior distribution and conditional probability distributions over the 10 possible digits and choose the digit that maximizes posterior probability: y^ = arg max p(digit = kjimage) = arg max P (digit = k) p(imagejdigit = k) k2f0;1;2:::;9g k2f0;1;2:::;9g Maximizing the posterior will induce regions in the feature space in which one class has the highest posterior probability, and decision boundaries in between classes where the posterior probability of two classes are equal.
    [Show full text]