Probabilistic Data Analysis with Probabilistic Programming

Probabilistic Data Analysis with Probabilistic Programming by Feras Ahmad Khaled Saad S.B., Electrical Engineering and Computer Science, M.I.T. (2016) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology September 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. Signature of Author: . Department of Electrical Engineering and Computer Science September 21, 2016 Certified by: . Dr. Vikash Mansinghka Research Scientist, Thesis Supervisor Accepted by: . Dr. Christopher Terman Chairman, Master of Engineering Thesis Committee Probabilistic Data Analysis with Probabilistic Programming by Feras Ahmad Khaled Saad S.B., Electrical Engineering and Computer Science, M.I.T. (2016) Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Probabilistic techniques are central to data analysis, but different approaches can be challenging to apply, combine, and compare. This thesis introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bayesian models, multivariate kernel methods, discriminative machine learning, clus- tering algorithms, dimensionality reduction, and arbitrary probabilistic programs. We also demon- strate the integration of CGPMs into BayesDB, a probabilistic programming platform that can ex- press data analysis tasks using a modeling language and a structured query language. The practical value is illustrated in two ways. First, CGPMs are used in an analysis that identifies satellite data records which probably violate Kepler’s Third Law, by composing causal probabilistic programs with non-parametric Bayes in under 50 lines of probabilistic code. Second, for several representative data analysis tasks, we report on lines of code and accuracy measurements of various CGPMs, plus comparisons with standard baseline solutions from Python and MATLAB libraries. Thesis Supervisor: Vikash Mansinghka Title: Research Scientist 3 To my mother and father, who gave and give me everything. Acknowledgments This project would not be possible without the help of my friends and colleagues at the Probabilistic Computing Project. I extend my sincerest gratitude to our principal investigator Vikash Mansinghka, the ideal professional and personal mentor in all respects. I am grateful to DARPA for funding this research and my appointment as a research assistant through the Probabilistic Programming for Advanced Machine Learning program. I thank my brother, Khaled, for his uplifting support, affection, and tolerance and engagement with my most atypical of thoughts. 5 Contents 1 Introduction 7 2 Related Work 8 3 Composable Generative Population Models 10 3.1 Populations . 10 3.2 Computational description of composable generative population models . 11 3.3 Statistical description of composable generative population models . 12 3.4 Composable generative population models are an abstraction for probabilistic pro- cesses . 14 4 Algorithmic Implementations of Composable Generative Population Models 15 4.1 Primitive univariate distributions and statistical data types . 16 4.2 Cross-Categorization . 17 4.3 Ensemble classifiers and regressors . 20 4.3.1 Classification . 20 4.3.2 Regression . 20 4.4 Factor analysis & probabilistic PCA . 20 4.5 Parametric mixture of experts . 22 4.6 Generative nearest neighbors . 23 4.7 Multivariate kernel density estimation . 25 4.8 Probabilistic programs in VentureScript . 26 5 Integrating Conditional Generative Population Models into BayesDB 29 5.1 Querying composable generative population models using the Bayesian Query Lan- guage . 30 5.2 Compositional networks of composable generative population models . 31 5.3 Building populations and networks of composable generative population models with the Metamodeling Language . 33 5.3.1 Population Schemas . 33 5.3.2 Metamodel Definitions . 34 5.3.3 Homogeneous Ensembles of CGPM Networks . 35 5.3.4 Heterogeneous Ensembles of CGPM Networks . 35 5.4 Composable generative population models generalize and extend generative population models in BayesDB . 36 6 Applications of Composable Generative Population Models 37 6.1 Analyzing satellites using a composite CGPM built from causal probabilistic programs, discriminative machine learning, and Bayesian non-parametrics . 37 6.2 Comparing code length and accuracy on representative data analysis tasks . 39 7 Discussion and Future Work 46 References 48 1. Introduction Probabilistic techniques are central to data analysis, but can be difficult to apply, combine, and compare. Families of approaches such as parametric statistical modeling, machine learning and probabilistic programming are each associated with different formalisms and assumptions. This thesis shows how to address these challenges by defining a new family of probabilistic models and integrating them into BayesDB, a probabilistic programming platform for data analysis. It also gives empirical illustrations of the efficacy of the framework on multiple synthetic and real-world tasks in probabilistic data analysis. This thesis introduces composable generative population models (CGPMs), a computational formalism that extends graphical models for use with probabilistic programming. CGPMs specify a table of observable random variables with a finite number of columns and a countably infinite number of rows. They support complex intra-row dependencies among the observables, as well as inter-row dependencies among a field of latent variables. CGPMs are described by a computational interface for generating samples and evaluating densities for random variables derived from the base table by conditioning and marginalization. We show how to implement CGPMs for a broad class of model families including discriminative statistical learning, kernel density estimators, non- parametric Bayesian methods, and arbitrary probabilistic programs. We also describe algorithms and new syntaxes in the probabilistic Metamodeling Language for building compositions of CGPMs that can interoperate with BayesDB. The practical value is illustrated in two ways. First, the thesis outlines a collection of data analysis tasks with CGPMs on a high-dimensional, real-world dataset with heterogeneous types and sparse observations. The BayesDB script builds models which combine non-parametric Bayes, principal component analysis, random forest classification, ordinary least squares, and a causal probabilistic program that implements a stochastic variant of Kepler’s Third Law. Second, we illustrate coverage and conciseness of the CGPM abstraction by quantifying the lines of code and accuracy achieved on several representative data analysis tasks. Estimates are given for models expressed as CGPMs in BayesDB, as well as for baseline methods implemented in Python and MATLAB. Savings in lines of code of ~10x at no cost or improvement in accuracy are typical. The remainder of the thesis is structured as follows. Section 2 reviews related work to CGPMs in both graphical statistics and probabilisitc programming. Section 3 describes the conceptual, computational, and statistical formalism for CGPMs. Section 4 formulates a wide range of probabilistic models as CGPMs, and provides both algorithmic implementations of the interface as well as examples of their invocations through the Metamodeling Language and Bayesian Query Language. Section 5 outlines an architecture of BayesDB for use with CGPMs. We show how CGPMs can be composed to form a generalized directed acyclic graph, constructing hybrid models from simpler primitives. We also present new syntaxes in MML and BQL for building and querying CGPMs in BayesDB. Section 6 applies CGPMs to several probabilistic data analysis tasks in a complex real- world dataset, and reports on lines of code and accuracy measurements. Section 7 concludes with a discussion and directions for future work. 7 2. Related Work Directed graphical models from statistics provide a compact, general-purpose modeling language to describe both the factorization structure and conditional distributions of a high-dimensional joint distribution (Koller et al., 2007). Each node is a random variable which is conditionally indepen- dent of its non-descendants given its parents, and its conditional distribution given all its parents is specified by a conditional probability table or density. CGPMs extend this mathematical description to a computational one, where nodes are not only random variables with conditional densities but also computational units (CGPMs) with an interface that allows them to be composed directly as software. A CGPM node typically encapsulates a more complex statistical object than a single variable in a graphical model. Each node has a set of required input variables and output variables, and all variables are associated with statistical data types. Nodes are required to both simulate and evaluate the density of a subset of their outputs by conditioning on all their inputs, as well as either conditioning or marginalizing over another subset of their outputs. Internally, the joint distribution of output variables for a single CGPM node

Load more