Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy

Bai Jiang Tung-Yu Wu Wing Hung Wong Princeton University Stanford University Stanford University

Abstract The simulator-based models are said to be implicit [11] because their likelihood functions involve integrals Complex simulator-based models usually have over latent variables or solutions to differential equa- intractable likelihood functions, rendering the tions and have no explicit form. These models are also likelihood-based inference methods inappli- said to be generative [12] because they specify how to cable. Approximate Bayesian Computation generate synthetic data sets. (ABC) emerges as an alternative framework of As the unavailability of the renders likelihood-free inference methods. It identifies the conventional likelihood-based inference methods in- a quasi-posterior distribution by finding val- applicable, Approximate Bayesian Computation (ABC) ues of parameter that simulate the synthetic emerges as an alternative framework of likelihood-free data resembling the observed data. A major inference. [13–15] provide general overviews of ABC. ingredient of ABC is the discrepancy measure Rejection ABC [1, 16–18], the first and simplest ABC between the observed and the simulated data, algorithm, repeatedly draws values of parameter θ in- which conventionally involves a fundamental dependently from some prior π, simulates synthetic difficulty of constructing effective summary data Y for each value of θ, and rejects the parameter statistics. To bypass this difficulty, we adopt θ if the discrepancy D(X, Y ) between the observed a Kullback-Leibler divergence estimator to as- data X and the simulated data Y exceeds a tolerance sess the data discrepancy. Our method enjoys threshold . This algorithm obtains an independent the asymptotic consistency and linearithmic and identically distributed (i.i.d.) random sample of time complexity as the data size increases. In parameter from a quasi-posterior distribution. Later experiments on five benchmark models, this [3, 19–23] enhance the efficiency over rejection ABC by method achieves a comparable or higher quasi- incorporating and sequen- posterior quality, compared to the existing tial techniques. On the other aspect, [24] interprets the methods using other discrepancy measures. acceptance-rejection rule with the tolerance threshold  in ABC as convoluting the target posterior distribution with a small -noise term. This interpretation encom- 1 Introduction passes the spirit of assigning continuous weights to The likelihood function is of central importance in proposed parameter draws rather than binary weights statistical inference by characterizing the connection (either accepting or rejecting) in many ABC implemen- between the observed data and the value of parameter tations [25]. in models. Many simulator-based models, which are A major ingredient of ABC is the data discrepancy mea- stochastic data generating mechanisms taking parame- sure D(X, Y ), which crucially influences the quality ter values as input and returning data as output, have of the quasi-posterior distribution. An ABC algorithm arisen in evolutionary biology [1, 2], dynamic systems typically reduces data X, Y to their summary statis- [3, 4], economics [5, 6], epidemiology [7–9], aeronau- tics S(X),S(Y ) and measures the distance between tics [10] and other disciplines. In these models, the S(X) and S(Y ) instead as the data discrepancy. Nat- observed data are seen as outcomes of the data generat- urally one would like to use a low-dimensional and ing mechanisms given some underlying true parameter. quasi-sufficient summary statistic S(·), which offers a satisfactory tradeoff between the acceptance rate of proposed parameters and the quality of the quasi- Proceedings of the 21st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, posterior [26]. Constructing effective summary statis- Spain. PMLR: Volume 84. Copyright 2018 by the author(s). tics presents a fundamental difficulty and is actively Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy pursued in the literature (see reviews [26, 27] and refer- The KL divergence method achieves a comparable or ences therein). Most existing methods can be grouped higher quality of the quasi-posterior distribution in the into three main categories: best subset selection, regres- experiments on five benchmark models, compared to sion and Bayesian indirect inference. The first cate- its three cousins: the classification accuracy method gory of methods select the optimal subset from a set of [43], the maximum mean discrepancy method [44] and pre-chosen candidate summary statistics according to the Wasserstein distance method [45]. It also enjoys various information criteria (e.g. measure of sufficiency a linearithmic time complexity: the cost for a single [28], entropy [29], AIC/BIC [26], impurity in random call of DKL(X, Y ), given observed and simulated data forests [30]) and then use the selected subset as the X, Y with n samples each, is O(n ln n). This cost is summary statistics. The candidate summary statistics smaller than O(n2)-cost of computing the maximum are usually provided by experts in the specific scientific mean discrepancy in [44] and the Wasserstein distance domain. The second category assembles methods that in [45]. Computing the classification accuracy in [43] fit regression on candidate summary statistics [31, 25]. costs O(n) in general, but it generates much worse The last category, inspired by indirect inference [32], quais-posteriors than other methods in the experiments. dispenses with candidate summary statistics and di- The remaining parts of the paper are structured as rectly constructs summary statistics from an auxiliary follows: Section 2 describes the ABC algorithm and model [33, 34, 27]. five data discrepancy measures including our KL diver- This paper mainly focuses on the setting in which no in- gence estimator. Section 3 establishes the asymptotic formative candidate summary statistic is available but consistency of our method and compares its limiting the observed data contains a moderately large number quasi-posterior distribution to those of other methods. n n of i.i.d. samples X = {Xi}i=1, allowing direct data In Section 4, we apply the methodology to five bench- discrepancy measures. Most methods for constructing mark simulator-based models. Section 5 discusses our summary statistic fails in this setting. An exception method and concludes the paper. is Fearnhead and Prangle’s semi-automatic method [25], which works well for univariate distributions. It 2 ABC and Data Discrepancy performs a regression on quantiles (and their 2nd, 3rd Denote by X ⊂ d the data space, and by Θ the param- and 4th powers) of the empirical distributions. Still R eter space of interest. The model M = {pθ : θ ∈ Θ} unclear is how to extend this method to multivariate is a collection of distributions on X . It has no ex- distributions. Another exception is the category of plicit form of pθ(x) but can simulate i.i.d. random Bayesian indirect inference methods [33, 34, 27]. But samples given a value of parameter. To identify the their performance relies on the choice of auxiliary mod- true parameter θ∗ that generates i.i.d. observed data els. n samples X = {Xi}i=1, Approximate Bayesian Compu- We propose using a Kullback-Leibler (KL) divergence tation (ABC) algorithm finds the values of parameter m estimator [35], which is based on the observed data that generate the synthetic data Y = {Yi}i=1 ∼ pθ n m X = {Xi}i=1 and the simulated data Y = {Yi}i=1, as i.i.d. resembling the observed data X. The extent to the data discrepancy for ABC. We denote this estima- which the observed and simulated data are resembling tor or data discrepancy by DKL(X, Y ). As such, we is quantified by a data discrepancy measure D(X, Y ). bypass the construction of summary statistics. The Since our goal is to compare difference data discrepancy KL divergence KL(g0||g1), also known as information measures rather than present a complete methodology divergence or relative entropy, measures the distance be- for ABC, we only use rejection ABC, the simplest ABC tween two distributions g0(x) and g1(x) [36]. Denote by ∗ algorithm (Algorithm 1), throughout this paper. For {pθ : θ ∈ Θ} the model under study, and by θ the true Qm convenience of notations, we write pθ(y) = i=1 pθ(yi). parameter that generates X. Interpreting KL(pθ∗ ||pθ) (t) T Algorithm 1 outputs a random sample {θ }t=1 of the as the expectation of the log-likelihood ratio nicely quasi-posterior distribution connects it to the maximum likelihood estimation. Our method leverages the KL divergence to Bayesian in- Z π(θ|X; D, ) ∝ π(θ)I (D(X, y) < ) pθ(y)dy. (1) ference. Using a consistent KL divergence estimator developed by [35], our method is asymptotically con- sistent in the sense that the quasi-posterior converges Next, we introduce our KL divergence method and to π(θ|KL(pθ∗ ||pθ) < ) ∝ π(θ)I(KL(pθ∗ ||pθ) < ) as describe other direct data discrepancies. The semi- the sample sizes n, m increase. Here I(·) denotes the automatic method and the Bayesian indirect inference indicator function. The KL divergence estimator used method are also included as they do not require candi- in our method might be replaced with other estimator date summary statistics. Let us collect more notations. [37–42]. g0 and g1 denote densities of two d-dimensional dis- tributions on the sample space X ⊆ Rd. For a vector Bai Jiang, Tung-Yu Wu, Wing Hung Wong

Algorithm 1 Rejection ABC Algorithm and use it as the data discrepancy for ABC. Formally, n denoting by D the k-fold subset of D, and by |D | its Input: - the observation data X = {Xi}i=1; k k - a prior π(θ) over the parameter space Θ; size, - a tolerance threshold  > 0;  - a data discrepancy measure D : X n ×X m → R+. K 1 X 1 X ˆ for t = 1, ..., T do DCA =  (1 − hk(Xi)) K |Dk| repeat: propose θ ∼ π(θ) and k=1 i:(Xi,0)∈Dk m  draw Y = {Yi}i=1 ∼ pθ i.i.d. until : D(X, Y ) <  X ˆ + hk(Yi) (3) Let θ(t) = θ i:(Yi,1)∈Dk end for (1) (2) (T ) Output: θ , θ , . . . , θ ˆ where hk is the trained prediction rule on the data set of D\Dk. Our experiments set K = 5 and h to be the Linear Discriminant Analysis (LDA) classifier. x, kxk denotes its `2 norm, and x(i) denotes its i-th We choose LDA because [43] reports that the quasi- ordered element. For two scalars a and b, we write posterior quality seems insensitive to the choice of max{a, b} as a ∨ b. We write D in short for a data classifiers, and LDA is computationally cheaper than discrepancy D(X, Y ), if the observed and simulated other classifiers. The time cost per call of DCA is data X and Y are clear in the context. O(n + m). 2.1 Kullback-Leibler (KL) Divergence as Data Discrepancy 2.3 Maximum Mean (MM) Discrepancy The kernel embedding of a distribution The KL divergence between g0 and g1 is defined as R g(x), defined as µg = k(·, x)g(x)dx, is an element Z g0(x) in the RKHS H associated with a positive definite KL(g0||g1) = g0(x) ln dx ≥ 0, g1(x) kernel kernel k : X × X → R [48, 49]. The maximum mean (MM) discrepancy [50] between g0 and g1 is the which is zero if and only if g (x) = g (x) for almost 0 1 distance of µg0 and µg1 in the Hilbert space H, i.e. everywhere. Given the observed and simulated data X 2 2 MM (g0, g1) = kµg0 − µg1 kH.[44] adopts an unbiased 2 and Y , we use the following estimator for KL(pθ∗ ||pθ): estimator of MM (pθ∗ , pθ) as the data discrepancy in ABC. The square of the estimator is as follows. n d X minj kXi − Yjk m DKL = ln n + ln . (2) P P n min kXi − Xjk n − 1 k(Xi,Xj) k(Yi,Yj) i=1 j6=i D2 = 1≤i6=j≤n + 1≤i6=j≤m MM n(n − 1) m(m − 1) This estimator is the special case of Equation (14) in Pn Pm 2 k(Xi,Yj) [35] using 1-nearest neighbor density estimate. Theo- − i=1 j=1 (4) rem 2 in [35] establishes the almost-sure convergence of nm (2). We use (2) as the data discrepancy in Algorithm In the same fashion of [44], we choose a Gaussian 1. As this method involves 2n operations of nearest kernel with the bandwidth being the median of {kXi − neighbor search, we use k-d trees [46, 47] to implement Xjk : 1 ≤ i 6= j ≤ n}. The time cost per call of it. The time cost per call of DKL is O((n∨m) ln(n∨m)) 2 DMM is O((n + m) ), as it requires computing the on average. (n + m) × (n + m) pairwise distance matrix. 2.2 Classification Accuracy (CA) as Data Discrepancy 2.4 Wasserstein Distance d The classification accuracy method in [43] origi- Let ρ be a distance on X ⊆ R . The q-Wasserstein nates from the phenomenon that distinguishing Y distance between g0 and g1 is defined as from X, when θ is very different to θ∗, is usu-  Z 1/q ally easier than doing so, when θ is similar to q ∗ Wq(g0, g1) = inf ρ(x, y) dγ(x, y) , θ . This method first labels Xi as class 0 and γ∈Γ(g0,g1) X ×X Yi as class 1, yielding an augmented data set D = {(X1, 0),..., (Xn, 0), (Y1, 1),..., (Ym, 1)}, and where Γ(g0, g1) is the set of all joint distribution γ(x, y) then trains a prediction rule (classifier) h : x 7→ {0, 1} on X × X such that γ has marginals g0 and g1. An to distinguish two classes. The authors define discrim- estimator of Wq(pθ∗ , pθ) based on the samples X and inability or classifiability between two samples X and Y can serve as the data discrepancy for ABC. In par- Y as the K-fold cross-validation classification accuracy, ticular, with q = 2 and ρ being the Euclidean distance, Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy

an instance is given by {pA(x|φ): φ ∈ Φ}. See a general review of the liter-  1/2 ature in [27]. An instance of these methods is given n m by [33]. The authors suggest the maximum likelihood X X 2 DW2 = min γijkXi − Yjk (5) γ   estimate (MLE) of the auxiliary model as the summary i=1 j=1 statistic. Formally, T s.t. γ1m = 1n, γ 1n = 1m, 0 ≤ γij ≤ 1 m ˆ Y S(y) = φ(y) = arg max pA(yi|φ). (7) where γ = {γij : 1 ≤ i ≤ n, 1 ≤ j ≤ m} is a φ∈Φ i=1 n × m matrix, and 1n, 1m are vectors of n and m ones, respectively. In particular, if pA(x|φ) is d-dimensional Gaussian with parameter φ (which aggregates mean and covariance For multivariate distributions (d > 1), exactly solving parameters together), the summary statistics S(y) are 3 this optimization problem costs O((n + m) ln(n + m)) m merely the sample mean and covariance of y = {yi}i=1. [51] and approximate optimization algorithms [52, 53] [34] also describes an approach that uses the auxiliary 2 reduce the cost to O((n + m) ). For univariate likelihood (AL) to set up a data discrepancy distributions (d = 1), if n = m and ρ(x, y) = |x − y|, q-Wasserstein distance has an explicit form 1 ˆ 1 ˆ DAL = ln pA(Y |φ(Y )) − ln pA(Y |φ(X)) (8) 1 Pn q1/q m m n i=1 |X(i) − Y(i)| . In this special case, the computation cost is O(n ln n). 3 Asymptotic Analysis 2.5 Distance between Summary Statistics This section analyzes the asymptotic quasi-posterior An ABC algorithm typically uses the distance between distributions as the size n of observation data X goes the data summaries S(X),S(Y ) as the discrepancy to infinity (and the size m of the simulated data Y measure D. In particular, if using the Euclidean dis- increases at the same rate like m/n → α > 0) with tance, different data discrepancy measures. The tolerance DS = kS(X) − S(Y )k. (6) threshold  assumed to be fixed. This analysis explains The subscript S specifies the choice of summary statis- how different data discrepancy measures weight the val- tic S. Most methods for constructing S requires ex- ues of parameter in their quasi-posterior distributions. pertise in the specific scientific domain to provide can- The theoretical results show that the quasi-posterior didate summary statistics. Two exceptions are the distribution obtained by our KL divergence method semi-automatic method [25] and the Bayesian indirect (2) identifies θ with KL(pθ∗ ||pθ) <  asymptotically. In inference method [33, 34, 27]. addition, our method is related to the classification accuracy method (3) by the concept of f-divergence If no candidate summary statistic is given, the semi- [54]. The KL divergence method is also asymptotically automatic method can work for univariate distribu- equivalent to (8), a Bayesian indirect inference method, tions. The method first generates a large artificial data in case that the auxiliary model is bijective to the true set of pairs (θ, Y ) from the prior π and the simulated model. Proof of theorems and corollaries are deferred model. When the distribution is one-dimensional, each to the Appendix. Y in the artificial data set is a vector of length m. The semi-automatic method performs a linear regression We start this section from Theorem 1, which is an on the artificial data set with θ as target and candi- application of Lévy’s upward theorem. date summary statistics as regressors. In absence of Theorem 1 (Asymptotic Quasi-Posterior) If the candidate summary statistics, Fearnhead and Prangle data discrepancy measure D(X, Y ) in Algorithm 1 con- [25] suggested using evenly-spaced quantiles (and their verges to some real number D(p ∗ , p ) almost surely as 2nd, 3rd and 4th powers) of Y as regressors. Formally, θ θ the data size n → ∞, m/n → α > 0 then the quasi- [θ|Y ] ≈ β + PK−1 P4 β Y l . The summary E 0 k=1 l=1 kl (km/K) posterior distribution π(θ|X; D, ) defined by (1) con- statistic S(y) is merely the prediction of θ given y, verges to π(θ|D(pθ∗ , pθ) < ) for any θ. That is, which is understood as an approximation of the poste- rior mean [θ|y]. As the original paper [25] does not lim π(θ|X; D, ) = π(θ|D(pθ∗ , pθ) < ) E n→∞ clearly show how to extend this method to multivariate ∝ π(θ) (D(p ∗ , p ) < ) distributions in which case each simulated data set Y I θ θ is an n × d matrix, we simply put quantiles of each This theorem asserts that the asymptotic quasi- marginal together as a total of 4d(K − 1) regressors. posterior is a restriction of the prior π on the region Our experiments set K = 8. {θ ∈ Θ: D(pθ∗ , pθ) < }. Putting it together with the The Bayesian indirect inference methods con- established almost sure convergence of DKL in [35], we struct the summary statistic from an auxiliary model have a corollary for DKL as follows. Bai Jiang, Tung-Yu Wu, Wing Hung Wong

Corollary 1 (Asymptotic Quasi-Posterior of DKL) literature except that [50] shows that DMM converges Let n → ∞ and m/n → α > 0. If Algorithm 1 uses to MM in some special cases. For summary-based data DKL defined by (2) as the data discrepancy measure discrepancy (6), in general then the quasi-posterior distribution ∗ π(θ|X; DS, ) → π(θ|ks(θ ) − s(θ)k < ) lim π(θ|X; DKL, ) = π(θ|KL(pθ∗ ||pθ) < ) n→∞ if s(θ∗) and s(θ) are the limits of S(X) and S(Y ) as ∝ π(θ)I(KL(pθ∗ ||pθ) < ). n, m → ∞. The semi-automatic method advocates an approximation of the posterior mean as the summary It is known that the maximum likelihood estimator statistics. As S(X) = E[θ|X] → θ∗ and S(Y ) = minimizes the KL divergence between the empirical ∗ E[θ|Y ] → θ, we have π(θ|X; DS, ) → π(θ|kθ − θk < distribution of p ∗ and p . ABC with D shares the θ θ KL ). However, this appealing asymptotic quasi-posterior same idea to find θ with small KL divergence. is a mirage because the semi-automatic construction Next, we find DKL coincides with DAL in the frame- provides only a projection of E[θ|y] onto the function work of f-divergence [54]. For a convex function class spanned by the regressors. In most cases, finding f(t) with f(1) = 0, f-divergence between two dis- the posterior mean E[θ|y], the optimal estimator in tributions g0 and g1 is defined as Df (g0||g1) = the Bayesian sense, is even harder than the task of R f (g0(x)/g1(x)) g0(x)dx. The KL divergence belongs constructing summary statistics. to this class of divergences with f(t) = t ln t. DCA (induced by the optimal classifier) is also related to 4 Experiments f-divergence. We run experiments on five benchmark models: a bi-

Corollary 2 (Asymptotic Quasi-Posterior of DCA) variate Gaussian mixture model (d = 2), a M/G/1- Let n → ∞ and m/n → α > 0. If the optimal (Bayes) queuing model (d = 5), a bivariate beta model (d = 2), classifier h(x) = I(αpθ(x) ≥ pθ∗ (x)) induces DCA a moving-average model of order 2 and length d = 10, in (3) then Algorithm 1 with DCA yields the and a multivariate g-and-k distribution (d = 5). In quasi-posterior distribution each experiment, we set n = m and the tolerance threshold  adaptively such that 50 of 105 proposed θ lim π(θ|X; DCA, ) = π(θ|Df (pθ∗ ||pθ) + c(α) < ) n→∞ are accepted.

∝ π(θ) (D (p ∗ ||p ) + c(α) < ) I f θ θ 4.1 Toy Example: Gaussian Mixture Model where the constant c(α) = (α ∨ 1)/(1 + α) is the The univariate Gaussian mixture model services as classification accuracy of the naive prediction rule a benchmark model in ABC literatures [20, 24]. 0 h (x) = I(α ≤ 1) and the f-divergence Df corresponds Here we use a more challenging bivariate Gaus- to f(t) = (α/t) ∨ 1 − α ∨ 1. sian mixture model Z ∼ Bernoulli(p), X|Z = 0 ∼ N (µ0, [0.5, −0.3; −0.3, 0.5]), X|Z = 1 ∼ This result suggests that D with a general classifier CA N (µ1, [0.25, 0; 0, 0.25]). Unknown parameter θ = is a suboptimal estimator for f-divergence D (p ∗ ||p ) f θ θ (p, µ0, µ1) consists of the mixture ratio p and sub- since the classifier in use is an approximation of the population means µ0, µ1. We perform ABC on n = optimal (Bayes) classifier. 500 observed samples which are generated from the ∗ ∗ ∗ Thirdly, our KL divergence method also resembles one true parameters p = 0.3, µ0 = (+0.7, +0.7), µ1 = of the Bayesian indirect inference method (8). (−0.7, −0.7). The prior we used is p ∼ Uniform[0, 1], 2 µ0, µ1 ∼ Uniform[−1, +1] . Corollary 3 (Asymptotic Quasi-Posterior of DAL) Figure 1 shows that the KL divergence DKL visibly Let n → ∞ and m/n → α > 0. If an auxiliary model outperforms other methods. DW2, DMM and DS with {pA(·|φ): φ ∈ Φ} is bijective to the model {pθ : θ ∈ Θ} the auxiliary MLE as summary roughly figure out the and induces DAL in (8) then Algorithm 1 with DAL true parameters but mix up the two subpopulation yields the quasi-posterior distribution means. The other two methods does not find the true parameters. lim π(θ|X; DAL, ) = π(θ|KL(pθ||pθ∗ ) < ) n→∞ For estimating the mixture ratio p, the KL divergence ∝ π(θ) (KL(p ||p ∗ ) < ). I θ θ method achieves a mean square error of 0.001, more accurate than other methods (DW2: 0.002, DW2: 0.015; It is worth noting that similar results may not hold DCA: 0.053; DS with the auxiliary MLE as summary: for D defined by (4) and D defined by (5), as MM W2 0.020; DS with the semi-automatic construction as their convergences have not been established in the summary: 0.025). Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy

∗ ∗ Figure 1: Quasi-posteriors for the Gaussian mixture model. Black lines cross at µ0 and µ1.

4.2 M/G/1-queuing Model are generated from θ∗ = (1, 1, 1, 1, 1). The parame- Queuing models are usually easy to simulate from but ter setting was used in [56]. The prior we used is 5 have no intractable likelihoods. We adopts a specific (θ1, θ2, θ6, θ7, θ8) ∼ Uniform[0, 5] . Figure 3 shows an M/G/1-queue that has been studied by ABC meth- overall satisfactory performance of DKL among others. ods [55, 25]. In this model, the service times follows 4.4 Moving-average Model of Order 2 Uniform[θ1, θ2] and the inter-arrival times are expo- Marin et al. [14] uses the moving-average model of or- nentially distributed with rate θ3. Each datum is a 5-dimensional vector consisting of the first five inter- der 2 as a benchmark model in their review. This model generates Yj = Yj + θ1Yj−1 + θ2Yj−2, j = departure times x = (x1, x2, x3, x4, x5) after the queue 1, ..., d with Zj being unobserved noise error terms. starts from empty. Unknown parameter θ = (θ1, θ2, θ3). We perform ABC on n = 500 observed samples which Each datum Y is a time series of length d. We are generated from the true parameters θ∗ = (1, 5, 0.2). take Zj to follow Student’s t distribution with 5 de- grees of freedom and set d = 10. With the prior The prior we used is θ1 ∼ Uniform[0, 10], θ2 − θ1 ∼ (θ1, θ2) ∼ Uniform([−2, +2] × [−1, +1]), ABC was per- Uniform[0, 10], θ3 ∼ Uniform[0, 0.5]. formed on n = 200 observed samples generated from ∗ Figure 2 shows that DKL produces better inference θ = (0.6, 0.2). DKL, DW2, DMM, DS with auxiliary ∗ ∗ for θ1 and θ2 then DW2 and DMM. It also captures MLE as summary obtain comparably high quality quasi- ∗ θ3, although the marginal quasi-posterior does not posteriors. See Figure 4. concentrate as much as those of DW2 and DMM. The summary statistics constructed by the semi-automatic 4.5 Multivariate g-and-k Distribution ∗ method using marginal octiles identifies θ1 and give The univariate g-and-k distribution is defined by its ∗ ∗ meaningless inference about θ2 and θ3. inverse distribution function (9). It has no analyti- cal form of the density function, and the numerical 4.3 Bivariate Beta Model evaluation of the likelihood function is costly [59]. Arnold and Ng [56] defines an 8-parameter model as  −gzx  follows. U ∼ Gamma(θ , 1), i = 1,..., 8. Let −1 1 − e 2 k i i F (x) = A + B 1 + c (1 + zx) zx (9) 1 + e−gzx V1 = (U1 + U5 + U7)/(U3 + U6 + U8), where zx is the x-th quantile of the standard normal V2 = (U2 + U5 + U8)/(U4 + U6 + U7) distribution, and parameters A, B, g, k are related to location, scale, skewness and kurtosis, respectively, and then Z = V /(1+V ),Z = V /(1+V )) are marginally 1 1 1 2 2 2 c = 0.8 is the conventional choice [25]. As the inver- beta distributed, and Z = (Z ,Z ) jointly follows a 1 2 sion transform method can conveniently sample from bivariate beta distribution. Crackel and Flegal [57] this distribution by drawing Z ∼ N (0, 1) i.i.d. and considers 5-parameter sub-model by restricting θ = 3 then transforming them to be g-and-k distributed ran- θ = θ = 0. Another variant of this model was studied 4 5 dom variables. [60, 5, 61, 25] have performed ABC by [58]. We apply the methodology to Crackel and on it. Multivariate g-and-k distribution has also been Flegal [57]’s sub-model. considered [62, 63]. Here we study a 5-dimensional g- T We perform ABC on n = 500 observed samples which and-k distribution: first draw (Z1,...,Z5) ∼ N (0, Σ) Bai Jiang, Tung-Yu Wu, Wing Hung Wong

Figure 2: Quasi-posteriors for M/G/1-queue model. Black lines intercepts x-axis at true parameters.

Figure 3: Quasi-posteriors for bivariate beta model. Black lines intercepts x-axis at true parameters.

Figure 4: Quasi-posteriors for the moving-average model. Black lines cross at true parameters.

with Σ having sparse structures Σii = 1 and Σij = ρ marginally as the univariate g-and-k distribution does. if |i − j| = 1 or 0 otherwise, and transform them Again DKL obtain comparably high quality quasi- Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy

∗ ∗ Figure 5: Quasi-posteriors for the multivariate g-and-k model. Solid lines cross at µ0 and µ1. posteriors. See Figure 5. of the data discrepancy function. A whole run of ABC analysis usually calls the data discrepancy function 5 Discussion millions of times. In case of n = m, the O(n ln n) cost per call of the KL divergence method is thus attractive, 2 We add the KL divergence estimators to the arsenal whereas DMM and DW2 have O(n ) cost per call. of data discrepancy measures for ABC. This estimator Our theoretical results assume m = αn asymptotically, converges to the exact KL divergence. Thus D , as KL and our experiments use the typical setting m = n. the data discrepancy for ABC, yields a quasi-posterior In the experiments, our method performs well when which eventually concentrates on {θ : KL(p ∗ ||p ) < } θ θ α ∈ [0.5, 1.0]. If α is close to 0 then the commonly-seen as the sample size increases. Our method and the maxi- imbalance issue in two sample problems arises. mum likelihood estimation share the same spirit of find- ing minimizers of the KL divergence. We also connect Our theoretical results also assume n goes to infinity. the KL divergence method to the classification accuracy For finite data samples, we need the convergence rate method DCA and a Bayesian indirect inference method or non-asymptotic error bounds of the KL estimator DAL. Both two methods are somehow suboptimal to in use to quantify the quality of the quasi-posterior DKL, as they are equivalent to DKL asymptotically distribution. Unfortunately, apart from the almost sure only if their key ingredients (the prediction rule for convergence result for the KL estimator, there are few DCA or the auxiliary model for DAL) are “oracle” ones. results to more precisely justify the KL estimators in We run experiments on five benchmark models and the literature. compare different data discrepancy measures. In terms of the quasi-posterior quality, DKL performs compara- Acknowledgements bly good or even better than D , D , and much MM W2 This work is supported by NSF Grants DMS-1407557 better than other methods. and DMS-1721550. It was completed when Bai Jiang Apart from the quasi-posterior quality, another core was a Ph.D. student at Stanford University. The au- consideration of ABC is the computational tractability thors thank anonymous referees for their suggestions. Bai Jiang, Tung-Yu Wu, Wing Hung Wong

References [13] Katalin Csilléry, Michael GB Blum, Oscar E Gag- giotti, and Olivier François. Approximate Bayesian [1] Simon Tavaré, David J Balding, Robert C Griffiths, Computation in practice. Trends in ecology & evo- and Peter Donnelly. Inferring coalescence times lution, 25(7):410–418, 2010. from DNA sequence data. Genetics, 145(2):505– 518, 1997. [14] Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ryder. Approximate [2] Mark A Beaumont, Wenyang Zhang, and David J Bayesian Computational methods. Statistics and Balding. Approximate Bayesian Computation in Computing, pages 1–14, 2012. population genetics. Genetics, 162(4):2025–2035, 2002. [15] Mikael Sunnåker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander, Matthieu Foll, and [3] Tina Toni, David Welch, Natalja Strelkowa, An- Christophe Dessimoz. Approximate Bayesian Com- dreas Ipsen, and Michael PH Stumpf. Approxi- putation. PLoS Computational Biology, 9(1): mate Bayesian Computation scheme for parameter e1002803, 2013. inference and model selection in dynamical sys- tems. Journal of the Royal Society Interface, 6 [16] Yun-Xin Fu and Wen-Hsiung Li. Estimating the (31):187–202, 2009. age of the common ancestor of a sample of DNA sequences. Molecular biology and evolution, 14(2): [4] Simon N Wood. Statistical inference for noisy 195–199, 1997. nonlinear ecological dynamic systems. Nature, 466 (7310):1102–1104, 2010. [17] Gunter Weiss and Arndt von Haeseler. Inference of population history using a likelihood approach. [5] Glynn W Peters and Scott A Sisson. Bayesian Genetics, 149(3):1539–1546, 1998. inference, Monte Carlo sampling and operational [18] Jonathan K Pritchard, Mark T Seielstad, Anna risk. Journal of Operational Risk, 1(3):27–50, 2006. Perez-Lezaun, and Marcus W Feldman. Popula- [6] Gareth W Peters, Scott A Sisson, and Yanan tion growth of human Y chromosomes: a study of Fan. Likelihood-free for α- Y chromosome microsatellites. Molecular biology stable models. Computational Statistics & Data and evolution, 16(12):1791–1798, 1999. Analysis, 56(11):3743–3756, 2012. [19] Paul Marjoram, John Molitor, Vincent Plagnol, [7] Mark M Tanaka, Andrew R Francis, Fabio Luciani, and Simon Tavaré. Markov chain Monte Carlo and SA Sisson. Using Approximate Bayesian Com- without likelihoods. Proceedings of the National putation to estimate tuberculosis transmission pa- Academy of Sciences, 100(26):15324–15328, 2003. rameters from genotype data. Genetics, 173(3): [20] Scott A Sisson, Yanan Fan, and Mark M Tanaka. 1511–1520, 2006. Sequential Monte Carlo without likelihoods. Pro- [8] Michael GB Blum and Viet Chi Tran. HIV with ceedings of the National Academy of Sciences, 104 contact tracing: a case study in Approximate (6):1760–1765, 2007. Bayesian Computation. Biostatistics, 11(4):644– [21] Scott A Sisson, Yanan Fan, and Mark M Tanaka. 660, 2010. Correction for “Sequential Monte Carlo without [9] Trevelyan J McKinley, Joshua V Ross, Rob Dear- likelihoods”. Proceedings of the National Academy don, and Alex R Cook. Simulation-based Bayesian of Sciences, 104(6):1760–1765, 2009. inference for epidemic models. Computational [22] Christopher C Drovandi and Anthony N Pettitt. Statistics & Data Analysis, 71:434–447, 2014. Estimation of parameters for macroparasite popu- [10] Jason Christopher, Caelan Lapointe, Nicholas lation evolution using Approximate Bayesian Com- Wimer, Torrey Hayden, Ian Grooms, Gregory B putation. Biometrics, 67(1):225–233, 2011. Rieker, and Peter E Hamlington. Parameter es- [23] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. timation for a turbulent buoyant jet using Ap- An adaptive Sequential Monte Carlo method for proximate Bayesian Computation. In 55th AIAA Approximate Bayesian Computation. Statistics Aerospace Sciences Meeting, page 0531, 2017. and Computing, 22(5):1009–1020, 2012. [11] Peter J Diggle and Richard J Gratton. Monte [24] Richard D Wilkinson. Approximate Bayesian Com- Carlo methods of inference for implicit statistical putation (ABC) gives exact results under the as- models. Journal of the Royal Statistical Society. sumption of model error. Statistical applications Series B (Methodological), pages 193–227, 1984. in genetics and molecular biology, 12(2):129–141, [12] Christopher M Bishop and Julia Lasserre. Gener- 2013. ative or discriminative? getting the best of both [25] Paul Fearnhead and Dennis Prangle. Con- worlds. , 8:3–24, 2007. structing summary statistics for Approximate Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy

Bayesian Computation: semi-automatic Approxi- [37] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. mate Bayesian Computation. Journal of the Royal Divergence estimation of continuous distributions Statistical Society: Series B (Statistical Methodol- based on data-dependent partitions. IEEE Trans- ogy), 74(3):419–474, 2012. actions on Information Theory, 51(9):3064–3074, 2005. [26] Michael GB Blum, Maria A Nunes, Dennis Pran- gle, and Scott A Sisson. A comparative review [38] XuanLong Nguyen, Martin J Wainwright, and of dimension reduction methods in Approximate Michael I Jordan. Estimating divergence function- Bayesian Computation. Statistical Science, 28(2): als and the likelihood ratio by penalized convex 189–208, 2013. risk minimization. In NIPS, pages 1089–1096, 2007. [27] Christopher C Drovandi, Anthony N Pettitt, and Anthony Lee. Bayesian indirect inference using a [39] Qing Wang, Sanjeev R Kulkarni, and Sergio Verdú. parametric auxiliary model. Statistical Science, 30 Divergence estimation for multidimensional densi- (1):72–95, 2015. ties via k-nearest-neighbor distances. IEEE Trans- actions on Information Theory, 55(5):2392–2405, [28] Paul Joyce and Paul Marjoram. Approximately 2009. sufficient statistics and Bayesian computation. Sta- tistical applications in genetics and molecular biol- [40] XuanLong Nguyen, Martin J Wainwright, and ogy, 7(1):26, 2008. Michael I Jordan. Estimating divergence func- tionals and the likelihood ratio by convex risk [29] Matthew A Nunes and David J Balding. On opti- minimization. IEEE Transactions on Information mal selection of summary statistics for Approxi- Theory, 56(11):5847–5861, 2010. mate Bayesian Computation. Statistical applica- tions in genetics and molecular biology, 9(1):34, [41] Jorge Silva and Shrikanth S Narayanan. In- 2010. formation divergence estimation based on data- dependent partitions. Journal of Statistical Plan- [30] Pierre Pudlo, Jean-Michel Marin, Arnaud Estoup, ning and Inference, 140(11):3180–3198, 2010. Jean-Marie Cornuet, Mathieu Gautier, and Chris- tian P Robert. Reliable abc model choice via [42] Kirthevasan Kandasamy, Akshay Krishnamurthy, random forests. Bioinformatics, 32(6):859–866, Barnabas Poczos, and Larry Wasserman. Non- 2015. parametric von mises estimators for entropies, di- vergences and mutual informations. In Advances [31] Daniel Wegmann, Christoph Leuenberger, and in Neural Information Processing Systems, pages Laurent Excoffier. Efficient Approximate Bayesian 397–405, 2015. Computation coupled with Markov chain Monte Carlo without likelihood. Genetics, 182(4):1207– [43] Ritabrata Gutmann, Michael U Dutta, Samuel 1218, 2009. Kaski, and Jukka Corander. Likelihood-free infer- ence via classification. Statistics and Computing, [32] Christian Gourieroux, Alain Monfort, and Eric 28(2):411–425, 2018. Renault. Indirect inference. Journal of applied econometrics, 8(S1):S85–S118, 1993. [44] Mijung Park, Wittawat Jitkrittum, and Dino Se- jdinovic. K2-ABC: Approximate Bayesian Com- [33] Christopher C Drovandi, Anthony N Pettitt, and putation with kernel embeddings. In Artificial Malcolm J Faddy. Approximate bayesian computa- Intelligence and Statistics, pages 398–407, 2016. tion using indirect inference. Journal of the Royal Statistical Society: Series C (Applied Statistics), [45] Espen Bernton, Pierre E Jacob, Mathieu Gerber, 60(3):317–337, 2011. and Christian P Robert. Inference in generative models using the Wasserstein distance. arXiv [34] Alexander Gleim and Christian Pigorsch. Approx- preprint arXiv:1701.05146, 2017. imate Bayesian Computation with indirect sum- mary statistics. Draft paper: http://ect-pigorsch. [46] Jon Louis Bentley. Multidimensional binary search mee. uni-bonn. de/data/research/papers, 2013. trees used for associative searching. Communica- tions of the ACM, 18(9):509–517, 1975. [35] Fernando Pérez-Cruz. Kullback-Leibler divergence estimation of continuous distributions. In Informa- [47] Songrit Maneewongvatana and David Mount. On tion Theory, 2008. ISIT 2008. IEEE International the efficiency of nearest neighbor searching with Symposium on, pages 1666–1670. IEEE, 2008. data clustered in lower dimensions. Computational Science—ICCS 2001, pages 842–851, 2001. [36] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathe- [48] Alex Smola, Arthur Gretton, Le Song, and Bern- matical statistics, 22(1):79–86, 1951. hard Schölkopf. A Hilbert space embedding for Bai Jiang, Tung-Yu Wu, Wing Hung Wong

distributions. In International Conference on Al- [62] Christopher C Drovandi and Anthony N Pettitt. gorithmic Learning Theory, pages 13–31. Springer, Likelihood-free bayesian estimation of multivariate 2007. quantile distributions. Computational Statistics & [49] Alain Berlinet and Christine Thomas-Agnan. Re- Data Analysis, 55(9):2541–2556, 2011. producing kernel Hilbert spaces in probability and [63] Jingjing Li, David J Nott, Yanan Fan, and Scott A statistics. Springer Science & Business Media, Sisson. Extending Approximate Bayesian Compu- 2011. tation methods to high dimensions via a Gaussian [50] Arthur Gretton, Karsten M Borgwardt, Malte J copula model. Computational Statistics & Data Rasch, Bernhard Schölkopf, and Alexander Smola. Analysis, 106:77–89, 2017. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012. [51] Rainer Burkard, Mauro Dell’Amico, and Silvano Martello. Assignment problems: Society for indus- trial and applied mathematics, 2009. [52] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. [53] Marco Cuturi and Arnaud Doucet. Fast computa- tion of Wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014. [54] Imre Csiszár and Paul C Shields. Information theory and statistics: a tutorial. Foundations and Trends R in Communications and Information Theory, 1(4):417–528, 2004. [55] Michael GB Blum and Olivier François. Non- linear regression models for Approximate Bayesian Computation. Statistics and computing, 20(1):63– 73, 2010. [56] Barry C Arnold and Hon Keung Tony Ng. Flexible bivariate beta distributions. Journal of Multivari- ate Analysis, 102(8):1194–1202, 2011. [57] Roberto Crackel and James Flegal. Bayesian in- ference for a flexible class of bivariate beta distri- butions. Journal of Statistical Computation and Simulation, 87(2):295–312, 2017. [58] Ingram Olkin and Ruixue Liu. A bivariate beta distribution. Statistics & Probability Letters, 62 (4):407–412, 2003. [59] Glen D Rayner and Helen L MacGillivray. Numeri- cal maximum likelihood estimation for the g-and-k and generalized g-and-h distributions. Statistics and Computing, 12(1):57–75, 2002. [60] Michele Haynes and Kerrie Mengersen. Bayesian estimation of g-and-k distributions using MCMC. Computational Statistics, 20(1):7–30, 2005. [61] David Allingham, R AR King, and Kerrie L Mengersen. Bayesian estimation of quantile distri- butions. Statistics and Computing, 19(2):189–201, 2009.