Approximate Bayesian Computation with Kullback-Leibler Divergence As Data Discrepancy

Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy Bai Jiang Tung-Yu Wu Wing Hung Wong Princeton University Stanford University Stanford University Abstract The simulator-based models are said to be implicit [11] because their likelihood functions involve integrals Complex simulator-based models usually have over latent variables or solutions to differential equa- intractable likelihood functions, rendering the tions and have no explicit form. These models are also likelihood-based inference methods inappli- said to be generative [12] because they specify how to cable. Approximate Bayesian Computation generate synthetic data sets. (ABC) emerges as an alternative framework of As the unavailability of the likelihood function renders likelihood-free inference methods. It identifies the conventional likelihood-based inference methods in- a quasi-posterior distribution by finding val- applicable, Approximate Bayesian Computation (ABC) ues of parameter that simulate the synthetic emerges as an alternative framework of likelihood-free data resembling the observed data. A major inference. [13–15] provide general overviews of ABC. ingredient of ABC is the discrepancy measure Rejection ABC [1, 16–18], the first and simplest ABC between the observed and the simulated data, algorithm, repeatedly draws values of parameter θ in- which conventionally involves a fundamental dependently from some prior π, simulates synthetic difficulty of constructing effective summary data Y for each value of θ, and rejects the parameter statistics. To bypass this difficulty, we adopt θ if the discrepancy D(X; Y ) between the observed a Kullback-Leibler divergence estimator to as- data X and the simulated data Y exceeds a tolerance sess the data discrepancy. Our method enjoys threshold . This algorithm obtains an independent the asymptotic consistency and linearithmic and identically distributed (i.i.d.) random sample of time complexity as the data size increases. In parameter from a quasi-posterior distribution. Later experiments on five benchmark models, this [3, 19–23] enhance the efficiency over rejection ABC by method achieves a comparable or higher quasi- incorporating Markov Chain Monte Carlo and sequen- posterior quality, compared to the existing tial techniques. On the other aspect, [24] interprets the methods using other discrepancy measures. acceptance-rejection rule with the tolerance threshold in ABC as convoluting the target posterior distribution with a small -noise term. This interpretation encom- 1 Introduction passes the spirit of assigning continuous weights to The likelihood function is of central importance in proposed parameter draws rather than binary weights statistical inference by characterizing the connection (either accepting or rejecting) in many ABC implemen- between the observed data and the value of parameter tations [25]. in models. Many simulator-based models, which are A major ingredient of ABC is the data discrepancy mea- stochastic data generating mechanisms taking parame- sure D(X; Y ), which crucially influences the quality ter values as input and returning data as output, have of the quasi-posterior distribution. An ABC algorithm arisen in evolutionary biology [1, 2], dynamic systems typically reduces data X; Y to their summary statis- [3, 4], economics [5, 6], epidemiology [7–9], aeronau- tics S(X);S(Y ) and measures the distance between tics [10] and other disciplines. In these models, the S(X) and S(Y ) instead as the data discrepancy. Nat- observed data are seen as outcomes of the data generat- urally one would like to use a low-dimensional and ing mechanisms given some underlying true parameter. quasi-sufficient summary statistic S(·), which offers a satisfactory tradeoff between the acceptance rate of proposed parameters and the quality of the quasi- Proceedings of the 21st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, posterior [26]. Constructing effective summary statis- Spain. PMLR: Volume 84. Copyright 2018 by the author(s). tics presents a fundamental difficulty and is actively Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy pursued in the literature (see reviews [26, 27] and refer- The KL divergence method achieves a comparable or ences therein). Most existing methods can be grouped higher quality of the quasi-posterior distribution in the into three main categories: best subset selection, regres- experiments on five benchmark models, compared to sion and Bayesian indirect inference. The first cate- its three cousins: the classification accuracy method gory of methods select the optimal subset from a set of [43], the maximum mean discrepancy method [44] and pre-chosen candidate summary statistics according to the Wasserstein distance method [45]. It also enjoys various information criteria (e.g. measure of sufficiency a linearithmic time complexity: the cost for a single [28], entropy [29], AIC/BIC [26], impurity in random call of DKL(X; Y ), given observed and simulated data forests [30]) and then use the selected subset as the X; Y with n samples each, is O(n ln n). This cost is summary statistics. The candidate summary statistics smaller than O(n2)-cost of computing the maximum are usually provided by experts in the specific scientific mean discrepancy in [44] and the Wasserstein distance domain. The second category assembles methods that in [45]. Computing the classification accuracy in [43] fit regression on candidate summary statistics [31, 25]. costs O(n) in general, but it generates much worse The last category, inspired by indirect inference [32], quais-posteriors than other methods in the experiments. dispenses with candidate summary statistics and di- The remaining parts of the paper are structured as rectly constructs summary statistics from an auxiliary follows: Section 2 describes the ABC algorithm and model [33, 34, 27]. five data discrepancy measures including our KL diver- This paper mainly focuses on the setting in which no in- gence estimator. Section 3 establishes the asymptotic formative candidate summary statistic is available but consistency of our method and compares its limiting the observed data contains a moderately large number quasi-posterior distribution to those of other methods. n n of i.i.d. samples X = fXigi=1, allowing direct data In Section 4, we apply the methodology to five bench- discrepancy measures. Most methods for constructing mark simulator-based models. Section 5 discusses our summary statistic fails in this setting. An exception method and concludes the paper. is Fearnhead and Prangle’s semi-automatic method [25], which works well for univariate distributions. It 2 ABC and Data Discrepancy performs a regression on quantiles (and their 2nd, 3rd Denote by X ⊂ d the data space, and by Θ the param- and 4th powers) of the empirical distributions. Still R eter space of interest. The model M = fpθ : θ 2 Θg unclear is how to extend this method to multivariate is a collection of distributions on X . It has no ex- distributions. Another exception is the category of plicit form of pθ(x) but can simulate i.i.d. random Bayesian indirect inference methods [33, 34, 27]. But samples given a value of parameter. To identify the their performance relies on the choice of auxiliary mod- true parameter θ∗ that generates i.i.d. observed data els. n samples X = fXigi=1, Approximate Bayesian Compu- We propose using a Kullback-Leibler (KL) divergence tation (ABC) algorithm finds the values of parameter m estimator [35], which is based on the observed data that generate the synthetic data Y = fYigi=1 ∼ pθ n m X = fXigi=1 and the simulated data Y = fYigi=1, as i.i.d. resembling the observed data X. The extent to the data discrepancy for ABC. We denote this estima- which the observed and simulated data are resembling tor or data discrepancy by DKL(X; Y ). As such, we is quantified by a data discrepancy measure D(X; Y ). bypass the construction of summary statistics. The Since our goal is to compare difference data discrepancy KL divergence KL(g0jjg1), also known as information measures rather than present a complete methodology divergence or relative entropy, measures the distance be- for ABC, we only use rejection ABC, the simplest ABC tween two distributions g0(x) and g1(x) [36]. Denote by ∗ algorithm (Algorithm 1), throughout this paper. For fpθ : θ 2 Θg the model under study, and by θ the true Qm convenience of notations, we write pθ(y) = i=1 pθ(yi). parameter that generates X. Interpreting KL(pθ∗ jjpθ) (t) T Algorithm 1 outputs a random sample fθ gt=1 of the as the expectation of the log-likelihood ratio nicely quasi-posterior distribution connects it to the maximum likelihood estimation. Our method leverages the KL divergence to Bayesian in- Z π(θjX; D; ) / π(θ)I (D(X; y) < ) pθ(y)dy: (1) ference. Using a consistent KL divergence estimator developed by [35], our method is asymptotically consistent in the sense that the quasi-posterior converges Next, we introduce our KL divergence method and to π(θjKL(pθ∗ jjpθ) < ) / π(θ)I(KL(pθ∗ jjpθ) < ) as describe other direct data discrepancies. The semi- the sample sizes n; m increase. Here I(·) denotes the automatic method and the Bayesian indirect inference indicator function. The KL divergence estimator used method are also included as they do not require candi- in our method might be replaced with other estimator date summary statistics. Let us collect more notations. [37–42]. g0 and g1 denote densities of two d-dimensional distributions on the sample space X ⊆ Rd. For a vector Bai Jiang, Tung-Yu Wu, Wing Hung Wong Algorithm 1 Rejection ABC Algorithm and use it as the data discrepancy for ABC. Formally, n denoting by D the k-fold subset of D, and by jD j its Input: - the observation data X = fXigi=1; k k - a prior π(θ) over the parameter space Θ; size, - a tolerance threshold > 0; 2 - a data discrepancy measure D : X n ×X m ! R+.

Approximate Bayesian Computation with Kullback-Leibler Divergence As Data Discrepancy

Linear Models for Classification

Confidence Intervals for Probabilistic Network Classifiers

Bayesian Learning

Bayesian Machine Learning

Agnostic Bayes

On Discriminative Bayesian Network Classifiers and Logistic Regression

Areview of Bayesian Statistical Approaches for Big Data

Bayesian Statistical Inference for Intractable Likelihood Models Louis Raynal

Bayesian Network Classifiers 133

Course Notes for Bayesian Models for Machine Learning

Building Classifiers Using Bayesian Networks

Pattern Recognition Chapter 4: Bayes Classifier