Turing: a Language for Flexible Probabilistic Inference

Turing: a language for flexible probabilistic inference Hong Ge Kai Xu Zoubin Ghahramani University of Cambridge University of Edinburgh University of Cambridge Uber Abstract however, it is currently necessary first to derive the inference method, e.g. in the form of variational or Markov chain Monte Carlo (MCMC) algorithm, and Probabilistic programming is becoming an then implement it in application-specific code. Worse attractive approach to probabilistic machine yet, building models from data is often an iterative learning. Through relieving researchers from process, where a model is proposed, fit to data and the tedious burden of hand-deriving inference modified depending on its performance. Each of these algorithms, not only does it enable devel- steps is time-consuming, error-prone and usually re- opment of more accurate and interpretable quires expert knowledge in mathematics and computer models but it also encourages reproducible science, an impedance for researchers who are not ex- research. However, successful probabilistic perts. In contrast, deep learning methods have bene- programming systems require flexible, generic fited enormously from easy-to-use frameworks based on and efficient inference engines. In this work, automatic differentiation that implement end-to-end we present a system called Turing for flex- optimisation. There is a real potential for automated ible composable probabilistic programming probabilistic inference methods (in conjunction with ex- inference. Turing has a intuitive modeling isting automated optimisation systems) to revolutionise syntax and supports a wide range of sampling machine learning practice. based inference algorithms. Most importantly, Turing inference is composable: it combines Probabilistic programming languages [Goodman et al., Markov chain sampling operations on subsets 2008, Wood et al., 2014, Lunn et al., 2000, Minka et al., of model variables, e.g. using a combination 2014, Stan Development Team, 2014, Murray, 2013, of a Hamiltonian Monte Carlo (HMC) engine Pfeffer, 2001, 2009, Paige and Wood, 2014, Murray and a particle Gibbs (PG) engine. This com- et al., 2017] aim to fill this gap by providing a very posable inference engine allows the user to eas- flexible framework for defining probabilistic models and ily switch between black-box style inference automating the model learning process using generic methods such as HMC, and customized infer- inference engines. This frees researchers from writing ence methods. Our aim is to present Turing complex models by hand and enables them to focus and its composable inference engines to the on designing a suitable model using their insight and community and encourage other researchers expert knowledge, and accelerates the iterative pro- to build on this system to help advance the cess of model modification. Moreover, probabilistic field of probabilistic machine learning. programming languages make it possible to implement and publish novel learning and inference algorithms in the form of generic inference engines. This enables fair 1 Introduction direct comparison between new and existing learning and inference algorithms on the same set of problems, something that is sorely needed by the scientific com- Probabilistic model-based machine learning [Ghahra- munity. Furthermore, open problems that cannot be mani, 2015, Bishop, 2013] has been used successfully solved by state-of-the-art algorithms can be published for a wide range of problems and new applications are in the form of challenging problem sets, allowing infer- constantly being explored. For each new application, ence experts to easily identify open research questions in the field. st Proceedings of the 21 International Conference on Artifi- In this work, we introduce a system for probabilistic ma- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, chine learning called Turing. Turing is an expressive Spain. PMLR: Volume 84. Copyright 2018 by the author(s). Turing: a language for flexible probabilistic inference probabilistic programming language developed with the family of models we can perform efficient inference. a focus on intuitive modelling syntax and inference Although motivated from a pragmatic point of view, efficiency. The organisation of the remainder of this this approach has leads to a fruitful collection of soft- paper is as follows. Section 2 sets up the problem and ware systems including BUGS, Stan [Stan Development notation. Section 3 describes the proposed inference Team, 2014], and Infer.NET [Minka et al., 2014]. engines and Section 4 describe some techniques used The second approach to probabilistic programming for the implementation of proposed engines. Section 5 relaxes constraints imposed by existing inference al- discusses some related work. Section 6 concludes. gorithms, and attempts to introduce languages that are flexible enough to encode arbitrary probabilistic 2 Background models. In probabilistic modelling, we are often interested in the 2.2 Inference for probabilistic programs problem of simulating from a probability distribution p(θ j y; γ). Here, θ could represent the parameters of Probabilistic programs can only realize their flexibility interest, y some observed data and γ some fixed model potential when accompanied with efficient inference hyper-parameters. The target distribution p(θ j y; γ) engines. To explain how inference in probabilistic pro- arises from conditioning a probabilistic model p(y; θ j γ) gramming works, we consider the the following HMM on some observed data y. example with K states: 2.1 Model as computer programs πk ∼ Dir(θ) φk ∼ p(γ)(k = 1; 2;:::;K) One way to represent a probabilistic model is by using a zt j zt−1 ∼ Cat(· j πz ) (4) computer program. Perhaps the earliest and most influ- t−1 y j z ∼ h(· j φ )(t = 1; 2;:::;N) ential probabilistic programming system so far is BUGS t t zt [Lunn et al., 2000]. The BUGS language dates back Here Dir and Cat denote the Dirichlet and Categorical to 1990's. In BUGS, a probabilistic model is encoded distribution respectively. The complete collection of using a simple programming language conveniently re- parameters in this model is fπ ; φ ; z g. An effi- sembling statistical notations. After specifying a BUGS 1:K 1:K 1:T cient Gibbs sampler with the following series of three model and conditioning on some observed data, Monte steps is often used for Bayesian inference: Carlo samples can be automatically drawn from the model's posterior distribution. Algorithm 2.1 shows Step 1: Sample z1:T ∼ z1:T j φ1:K ; π1:K ; y1:T ; the generic structure of a probabilistic program. Step 2: Sample φk ∼ φk j z1:T ; y1:T ; γ; Algorithm 2.1. A generic probabilistic program. Step 3: Sample πk ∼ πk j z1:T ; θ (k = 1;:::;K). Input: data y and hyper-parameter γ 2.3 Computation graph based inference Step 1: Define global parameters One challenge of performing inference for probabilistic global θ ∼ p(· j γ); (1) programs is about how to obtain the computation graph between model variables. This is in contrast with Step 2: For each observation yn, define (local) latent graphical models, where the dependence structure (or variables and compute likelihoods computation graph) is normally static and known ahead of time. For example, consider variables in the HMM θlocal ∼ p(· j θlocal ; θglobal; γ) (2) n 1:n−1 model, the chain z1; z2; : : : ; zT is dependent, while other local global variables are independent given this chain. However, yn ∼ p(· j θ ; θ ; γ) (3) 1:n when the HMM model is represented in the form of where n = 1; 2;:::;N. a probabilistic program (i.e.θ = fπ1:K ; φ1:K ; z1:T g, cf Algorithm 2.1), we no longer have the computation Above model variables (or parameters) are divided graph between model parameters. into two groups: θlocal denotes model parameters (or n For certain probabilistic programs, it is possible to latent variables) specific to observation yn, such as a mixture indicator for a data item in mixture models, construct the computation graph between variables through static analysis. This is the approach taken by and θglobal denote global parameters. the BUGS language [Lunn et al., 2000] and infer.NET Currently there are two main approaches to probabilis- [Minka et al., 2014]. Once the computation graph tic programming. The first approach is based on the is constructed, a Gibbs sampler or message passing idea that probabilistic programs should only support algorithm [Minka, 2001, Winn and Bishop, 2005] can Hong Ge, Kai Xu, Zoubin Ghahramani Support discrete Support universal Composable MCMC Sampler Require gradients? Require adaption? variables? programs? operator? HMC No Yes Yes No Yes NUTS No Yes Yes No Yes IS Yes No No Yes No SMC Yes No No Yes No PG Yes No No Yes Yes PMMH Yes No No Yes Yes IPMCMC Yes No No Yes Yes Table 1: Supported Monte Carlo algorithms in Turing (v0.4). be applied to each random node of the computation with stochastic branches [Goodman et al., 2008, Mans- graph. However, one caveat of this approach is that the inghka et al., 2014, Wood et al., 2014], such as con- computation graph underlying a probabilistic program ditionals, loops and recursions, poses a substantially needs to be fixed during inference time. For programs more challenging Bayesian inference problem because involving stochastic branches, this requirement may inference engines have to manage varying number of not be satisfied. In such cases, we have to resort to model dimensions, dynamic computation graph and so other inference methods. on. 2.4 Hamiltonian Monte Carlo based 2.5 Simulation based inference inference Currently, most inference engines for universal proba- For the family of models whose log probability is point- bilistic programs (those involving stochastic branches) wise computable and differentiable, there exists an use forward-simulation based sampling methods such efficient sampling method using Hamiltonian dynamics. as rejection sampling (RS), sequential Monte Carlo Developed as an extension of the Metropolis algorithm, (SMC), and particle MCMC [Andrieu et al., 2010].

Load more