Learning with Kernels at Scale and Applications in Generative Modeling

Learning with Kernels at Scale and Applications in Generative Modeling Wei-Cheng Chang CMU-LTI-20-008 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15123 www.lti.cs.cmu.edu Thesis Committee: Yiming Yang (Chair) Carnegie Mellon University Barnabás Póczos Carnegie Mellon University Jeff Schneider Carnegie Mellon University Sanjiv Kumar Google Research NYC Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright c 2020 Wei-Cheng Chang Keywords: deep kernel learning, deep generative modeling, kernelized discrepancies, Stein variational gradient descent, large-scale kernels To all of the people that light up my life. iv Abstract Kernel methods are versatile in machine learning and statistics. For instance, Kernel two- sample test induces Maximum Mean Discrepancy (MMD) to compare two distributions and serves as a distance metric for learning implicit generative models (IGMs). Kernel goodness- of-fit test, as another example, induces Kernel Stein Discrepancy (KSD) to measure model- data discrepancy and connects to a variational inference procedure for explicit generative models (EGMs). Other extensions include time series modeling, graph-based learning, and more. Despite their ubiquity, kernel methods often suffer from two fundamental limitations: the difficulty in kernel selection for complex downstream tasks and the tractability of large- scale problems. This thesis addresses both challenges in several complementary aspects. In partI, we tackle the issue of kernel selection in learning implicit generative models (IGMs) with kernel MMD. Conventional methods using a fixed kernel MMD have limited success on high-dimensional complex distributions. We propose to optimize MMD with neural-parametrized kernels, which is more expressive and improves the state-of-the- art results on high-dimensional distributions (Chapter2). We also formulate kernel selection problems as learning kernel spectral distributions, and enrich the spectral distributions by learning IGMs to draw samples from it (Chapter3). In partII, we aim at learning suitable kernels for Stein variational inference descent (SVGD) in explicit generative modeling (EGMs). Although SVGD with fixed kernels shows encouraging performance in low-dimensional (within hundreds) Bayesian inference tasks, its success with high-dimensional problems such as image generation is limited. We propose the noise-conditional kernel SVGD (NCK-SVGD) for adaptive kernel learning, which is the first SVGD variant successfully scaled up for distributions with the dimension of several thou- sands, and performs competitively as state-of-the-art IGMs. With a novel entropy regularizer, NCK-SVGD enjoys flexible control between sample diversity and quality (Chapter4). In part III, we address the kernel tractability challenge with variants of random Fourier features (RFF). We propose to learn non-uniformly weighted RFF, which performs as good as the uniformly-weighted RFF while demanding less memory (Chapter5). For the kernel contextual bandit problem, we reduce the computational cost of the kernel UCB algorithm by using RFF to approximate the predictive mean and confidence estimate (Chapter6). In partIV, We extents kernel learning for time series modeling and graph-based learning. For change-point detection over time series, we optimize the kernel two-sample test via auxiliary generative models, acting as surrogate samplers of unknown anomaly distributions (Chapter7). For graph-based transfer learning, we construct the graph Laplacian by kernel diffusion and leverage label propagation to transfer knowledge from source to target domains (Chapter8). vi Contents 1 Introduction 1 1.1 Motivations and Background . .1 1.2 Challenges . .2 1.3 Thesis Structure and Contributions . .3 I Kernel Learning for Implicit Generative Models9 2 Optimizing Kernels for Generative Moment Matching Network 11 2.1 Background . 11 2.2 Two-Sample Test and Generative Moment Matching Network . 12 2.2.1 MMD with Kernel Learning . 12 2.2.2 Properties of MMD with Kernel Learning . 13 2.3 MMD GAN . 14 2.3.1 Feasible Set Reduction . 15 2.3.2 Encoding Perspectives and Relations to WGAN . 16 2.4 Experiments and Results . 17 2.4.1 Qualitative Analysis . 17 2.4.2 Quantitative Evaluation . 19 2.4.3 Stability of MMD GAN . 20 2.4.4 Computation Issue . 21 2.4.5 Better Lipschitz Approximation and Necessity of Auto-Encoder . 21 2.5 Summary . 22 2.6 Appendix . 23 2.6.1 Proof of Theorem3.................................. 23 2.6.2 Proof of Theorem4.................................. 23 2.6.3 Proof of Theorem5.................................. 24 3 Learning Kernel Spectral Distributions 25 3.1 Background . 25 3.2 IKL Framework . 26 3.3 IKL Application to MMD-GAN . 27 3.3.1 Property of MMD GAN with IKL . 28 vii 3.3.2 Experiments and Results . 28 3.4 IKL Application to Random Kitchen Sinks . 33 3.4.1 Consistency and Generalization . 35 3.4.2 Experiment and Results . 36 3.5 Summary . 38 3.6 Appendix . 39 3.6.1 Proof of Theorem8.................................. 39 3.6.2 Proof of Lemma9.................................. 40 3.6.3 Proof of Theorem 11................................. 40 3.6.4 Hyperparameters of GAN and RKS experiments . 42 II Kernel Learning for Explicit Generative Models 45 4 Kernel Stein Generative Modeling 47 4.1 Background . 47 4.2 Score-based Generative Modeling: Inference and Estimation . 48 4.2.1 Stein Variational Gradient Descent (SVGD) . 48 4.2.2 Score Estimation . 49 4.3 Challenges of SVGD in High-dimension . 50 4.3.1 Mixture of Gaussian with Disjoint Support . 50 4.3.2 Anneal SVGD in High-dimension . 50 4.4 SVGD with Noise-conditional Kernels and Entropy Regularization . 52 4.4.1 Learning Deep Noise-conditional Kernels . 53 4.4.2 Entropy Regularization and Diversity Trade-off . 53 4.5 Experiments and Results . 55 4.5.1 Quantitative Evaluation . 56 4.5.2 Qualitative Analysis . 58 4.5.3 The Effect of Entropy Regularization on Precision/Recall . 59 4.6 Summary . 59 4.7 Appendix . 60 4.7.1 Proof of Proposition 17................................ 60 4.7.2 Toy Experiment Setup in Section 4.3......................... 61 III Large-scale Kernel Approximations 63 5 Data-driven Random Fourier Features 65 5.1 Background . 65 5.1.1 Random Fourier Features . 66 5.1.2 Bayesian Quadrature for Random Features . 67 5.2 Stein Effect Shrinkage (SES) Estimator . 69 5.3 Experiments and Results . 70 viii 5.4 Discussion and Conclusion . 74 6 Kernel Contextual Bandit at Scale 75 6.1 Background . 75 6.2 Kernel Upper-Confident-Bound Framework . 76 6.3 Random Fourier Features for Kernel UCB . 79 6.3.1 RFF-UCB-NI: A Non-incremental Algorithm . 79 6.3.2 Theoretical analysis of RFF-UCB-NI . 80 6.3.3 RFF-UCB: A Fast Incremental Algorithm . 81 6.4 Experiment and Results . 82 6.5 Summary . 86 6.6 Appendix . 87 6.6.1 Derivation Details of RFF-UCB Online Update . 87 6.6.2 Kernel Approximation Guarantees . 87 6.6.3 Regret Bound . 89 6.6.4 Additional Experiment settings and Results . 90 IV Extensions to Time-Series and Graph-based Learning 93 7 Kernel Change-point Detection 95 7.1 Background . 95 7.2 Optimizing Test Power of Kernel CPD . 97 7.2.1 Difficulties of Optimizing Kernels for CPD . 97 7.2.2 A Practical Lower Bound on Optimizing Test Power . 97 7.2.3 Surrogate Distributions using Generative Models . 98 7.3 KLCPD: Realization for Time Series Applications . 99 7.3.1 Relations to MMD GAN . 101 7.4 Experiments and Results . 101 7.4.1 Evaluation on Real-world Data . 101 7.4.2 In-depth Analysis on Simulated Data . 104 7.5 Summary . 106 8 Graph-based Transfer Learning 107 8.1 Background . 107 8.2 KerTL: Kernel Transfer Learning on Graphs . 108 8.2.1 TL with Graph Laplacian . 109 8.2.2 Graph Constructions . 109 8.2.3 Diffusion Kernel Completion . 111 8.3 Experiments and Results . 112 8.3.1 Benchmark Datasets and Comparing Methods . 112 8.3.2 Quantitative Evaluation . 114 8.4 Summary.

Learning with Kernels at Scale and Applications in Generative Modeling

Relational Machine Learning Algorithms

Computer Science and Decision Theory Fred S. Roberts1 Abstract 1

Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

Research Notices

A Hierarchical Generative Model of Electrocardiogram Records

Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

Probabilistic PCA and Factor Analysis

Learning a Generative Model of Images by Factoring Appearance and Shape

An Efficient Boosting Algorithm for Combining Preferences Raj

Neural Decomposition: Functional ANOVA with Variational Autoencoders

An Introduction to Johnson–Lindenstrauss Transforms

Note 18 : Gaussian Discriminant Analysis