Monte Carlo Integration using and Gibbs Sampling

Wolfgang Hormann¨ and Josef Leydold

Department of Statistics, University of Economics and Business Administration Vienna, Austria [email protected]

Abstract- To evaluate the expectation of a simple function are calculated as they represent the Bayesian point with respect to a complicated multivariate density Monte Carlo estimates and confidence intervals for the parameters of the integration has become the main technique. Gibbs sampling model. and importance sampling are the most popular methods for this It is important to note that the methods discussed here task. In this contribution we propose a new simple general and the conclusions of this paper are valid for any inte¥¨§©" gral purpose importance sampling procedure. In a simulation study of the form given above as long as §©" the evaluation of is we compare the performance of this method with the performance

much faster than the evaluation of . ©)

of Gibbs sampling and of importance sampling using a vector of ¥¨§©)* Using naiv e Monte Carlo integration random vectors independent variates. It turns out that the new procedure is much with density are generated. The average of the better than independent importance sampling; up to dimension is the estimator (i.e. the approximate result) of the above five it is also better than Gibbs sampling. The simulation results . Howe ver, often the direct generation of iid. vectors indicate that for higher dimensions Gibbs sampling is superior. with density is impossible or very slow. There are three

general approaches to evaluate the above integral if direct

1. Introduction generation of vectors with density is not possible:

A standard problem in scientific computing is the evaluation + Simulation of vectors with density using the

of an integral of the form rejection technique. ¢¡¤£¦¥¨§ ©  §© ¢©

+ Importance sampling.

!

+ § ©©  ©"!#

© algorithms, in particular §©" where denotes a vector in $ , Gibbs sampling as the most popular one in Bayesian

is the multiple of a density¥¨§©" function with bounded statistics. but often unknown integral and is an arbitrary real valued function. In typical applications we want to find There are quite a few papers on this topic in the expectation, and some quantiles of all marginal literature, certainly an indication that this is a difficult distributions together with all correlations between the practical problem. Papers mainly concerned with the variables. If we consider, for example,¥¨§© %&©'the expectation it computing methodology are eg. [6] (among many others)

is obvious that we have to use to obtain the for the Gibbs sampler and [8], [11], [7], [13], [12] and [10] ¥¨§ © expectation of the ( -th marginal. Also for the , for importance sampling. quantiles and correlations the evaluation of §© is very In the last years many authors spot that the routine cheap whereas the evaluation of the density can be use shifts more and more to Gibbs sampling or other very expensive. Markov chain Monte Carlo methods. On the other hand An application for the above setting are for example the it is obvious that it is very easy to parallelize importance parameter estimation problems of Bayesian statistics. Then sampling that uses independent sampling from the same is the posterior density that is typically very expensive to distribution. In Markov Chain Monte Carlo normally one evaluate. It is standard to calculate expectation, variance long Markov chain is generated which makes parallelization and several quantiles (eg. 2.5%, 50%, and 97.5%) of all more difficult. Nevertheless, it is possible to run a small or several important one-dimensional marginals. These number of chains in parallel.

1

G O We were not able to find papers that compare the Note that if we consider the normed densities O and

performance of these two so different approaches. Thus, associated with and G then we have

§ © F6 §© >

the aim of this paper is twofold. We develop a new QP R¦S

M G O

simple general purpose importance sampling scheme and O T we compare its performance to the performance of Gibbs sampling. 3.1. Comparing rejection and importance sampling 2. Importance Sampling Perhaps the most frequently used error calculated in statistics is the mean squared error (MSE). To compare the

For importance sampling we have to find an importance§© advantages and disadvantages of the different integration

sampling density (also called proposal density) , . In methods it is therefore instructive to consider the MSE most applications it is necessary to find more than one when estimating the mean of a marginal distribution. characteristic for several marginal distributions. It is thus in (Without loss of generality we will restrict the discussion

practice not possible¥¨§ © to find the optimal importance density to the first marginal distribution.) If we can generate iid.

  6

for a single . In this situation§©" it is argued in the random variates directly it is well known that the MSE for

8

U U § ©

literature (see eg. [8] and [3]) that , should be roughly estimating the mean is equal to , where denotes the 8 proportional to but with higher tails. The idea of variance of the marginal distribution and the sample size. importance sampling§ © is then to sample vectors from the As we are focusing on the case that the evaluation of is

proposal density , and to evaluate the integral very expensive it seems to be of interest to express the MSE

in terms of VXW the number of evaluations of the density that

§ ©

-/§ © 0 §©" © 

¥¨§©" .-/§ © are necessary. §©

, with If we consider rejection sampling and assume that no

,

-/§©"

-/§ 1 squeeze (simple lower bound of the density) is used, which

The© is called weight function and the weight for is rarely possible for multidimensional problems, we can

. As we do not know the integral of § ¥¨§ © .-5§©  we cannot use express the mean squared error of rejection sampling as

VXW

§ ¥¨§©" .-/§ ©  6 § §©"

243 function of , the number of evaluations of :

the classic importance sampling estimate but 

243 273 ,

 @ have to use the ratio estimate , see §

[8] for its properties. M[U VXW

MSE Y¤Z W It is quite clear that good weights should be V approximately constant, in other words the weights should For importance sampling we can use the fact that have a small variance. In [8] it is suggested to use the 

the MSE for the mean 6 of a marginal distribution is

  8:9

“effective sample size” 8:9

U

 ;§=< ->)* 6?§=< - @

) approximately equal to as is the effective sample

8

§ ->)* ¤BDC 8:9

A size. Thus we get for importance sampling:



©

§->)* :BDC

E Ta

Maximizing the effective sample size is equivalent to W_^ `

A

§ @]  

U T_a

minimizing the variance of the weights. It is not difficult 3_`

Z W V

MSE \ V5W

to show that V5W

 

8



§ © F6 §© @¢©

8:9

E §©

If we compare importance sampling § © 4with rejection

,

, G sampling with the same normed densities O we

can see (as pointed out in [4]) that we always have

cb §

3. Rejection Sampling §

Y¤Z Z W W

V V MSE \ MSE

For rejection sampling (see [9] or [2] for a § © detailed G

explanation) it is necessary to find a hat function with as 

§ © IHJ §©" ©LK 

§©" § ©

©LbdPR?S §©" N¢©h 

G $ §© ¨g

for §©

M

Tfe , The hat function must have bounded integral and should , be chosen such that it is not too difficult to§ © generate random Of course the above result does not mean that rejection vectors with density proportional to G . The integral sampling is useless. Only for the case that we cannot

of the hat function should be as small as possible as the use a squeeze (lower bound for the density) and that the M expected number of trials necessary to generate one evaluation of the¥¨§©" density is much more expensive than the random vector is

§©" N© evaluation of we can see that importance sampling

E

  G

§ © N© is resulting in a smaller MSE for approximately the same E M computing effort.

2 Note: Our setting here is a typical one for Monte in the first quadrant is the given by: !

Carlo integration. For most discrete event simulations¥¨§©" the

§ ©:C¢©   ©"!# JopqS[§sr < )¢©)*

situation is different. There the evaluation of involves )kt[ n

the whole simulation system and is in most cases much m  more time consuming than the generation of the random ) variates. To choose the parameters n for all coordinates in each quadrant we use the following heuristic formula that is based on the idea to find a hyperplane not too different from

4. A new class of importance sampling the (log-)transformed density:

 ) u !

densities !

B

v v

n

9.

w xy w xy w xFy w xy

aa aa aa.~ aa*~

`kW¢`} `kW¢` `kW¢`€| In practice the success of importance sampling heavily `zW{`}|

depends on the used importance sampling density. If  )

we consider the variety of shapes a multi-modal density where ‚ is the vector of dimension with all entries equal „ can have in higher dimensions it is not astonishing that to 1 and ƒ denotes the -th vector of the canonical basis. If it is impossible to find a general IS density for arbitrary the probabilities ?† of the proposal distribution are selected densities in high dimension. There are few general proportional) to the reciprocal value of the product of all purpose suggestions how to select the proposal density. n this implies that the importance sampling densities of An exception are adaptive importance sampling schemes all quadrants have the same value in the origin. In our (cf. eg. [7] or [13]), that have the disadvantage that they numerical experiments we observed that for distributions still need an importance sampling density to start with. In with high correlations the probabilities for some quadrants are too close to 0. So we decided to correct small

addition they use kernel density estimation and are thus very ! slow. probabilities such that for every quadrant theC#6¨§ C‡X1probability

is guaranteed to be larger than or equal to m . It In this paper we will concentrate on the more tractable should be obvious from our explanations that this proposal but still difficult case of unimodal densities . One density requires) a considerable number of parameters:

important importance sampling density for a unimodal ! § IBˆC# different n and a probability in every quadrant, which with low tails is the split-normal distribution suggested

results in a total of m parameters. Nevertheless, it is in [3]. This approach assumes that the density is straightforward and very fast to calculate these parameters. approximately multi-normal and guesses the variance- covariance matrix of the distribution using a rough numeric estimate of the Hessian matrix in the mode of the log- 5. Gibbs sampling density. Good results are reported for this method for some Bayesian applications. Gibbs sampling (see eg. [5] or [9]) is the most In contrast we develop a general approach to construct popular Markov Chain Monte Carlo algorithm. Given importance sampling densities for multivariate log-concave an arbitrary starting point the Gibbs sampler generates distributions. (Log-concave posterior densities are fairly sequentially variates from the one dimensional full common in Bayesian statistics.) The new method requires conditional distributions using the current values of the only the knowledge of the mode but no estimate of the other coordinates. Thus the Gibbs sampler is not generating variance covariance matrix. It was motivated by automatic an iid. sequence of random vectors. Instead it generates hat-construction algorithms based on transformed density a Markov chain of random vectors whose distribution rejection (TDR) , see [9]. These TDR algorithms construct converges to the distribution with density . It is possible

upper bounds for multivariate densities by using tangent to show© ) that under mild regularity conditions using the ¥¨§©" §©" X¢© v¥¨§© ). ectors of the generated Markov chain the average of the hyperplanes of the log-densities. E For importance sampling densities it is not necessary to converges to the integral . Of course obtain upper bounds; only approximations of the densities the choice of the starting value is of some importance. are required. We thus try to approximate the given log- If the approximate location of the mode is known it is a

density ikj¢l by a hyperplane in a region. To do this it good starting point. Otherwise the sampler is run for some ! is necessary to know the location of the mode and it is time till a statistical test indicates that the Markov chain

convenient to decompose the -dimensional space into m has converged. Then this first part of the chain (the so

quadrants using the location of the mode as origin. It is easy called burn-in) is discarded and the sampling is continued

¥¨§ © to see that a vector of independent exponential variates to generate) a vector sample and to calculate the average of has the property that its joint density transformed by the the . It is possible although not simple to calculate logarithm is a hyperplane. So in each quadrant we can a confidence interval for the integration result using Gibbs use a vector of independent) exponential distributions with sampling.

different parameters n when we try to approximate the log- One practical problem of the Gibbs sampler is the fact density by a hyperplane. The importance sampling density that it is necessary to sample from the full conditional

3 

distributions. If it is known that they are log-concave Š MSE Gibbs it is possible to use automatic random variate generation 2 0.5 1.7 methods for one-dimensional log-concave distributions. For 3 0.5 2.2 details see [6] and [9]. 4 0.5 2.7 Despite the popularity of the Gibbs sampler there are 6 0.5 3.9 hardly any investigation on the size of the mean squared 10 0.5 5.9 error of integrals evaluated using the Gibbs sampler ([1] is 2 0.9 9.5 such a rare example). Therefore we tried to quantify the 3 0.9 17.7 error for a simple but important special case. 4 0.9 25.9 6 0.9 42.1 5.1. The MSE of the Gibbs sampler for the multi- 10 0.9 74.2 normal distribution 2 0.99 98.8 3 0.99 194.7 It is instructive (and comparatively simple) to investigate 4 0.99 289.2 the properties of the Gibbs sampling scheme for the multi- 6 0.99 473.6 normal distribution. In the sequel we assume without loss 10 0.99 824.9 of generality that the variances of all marginal distributions

are equal to one. Table 1: MSE of the expectationC‡ of the first marginal

‘C‡¢‡‡¢‡

 © 

It is easy to see that for the two dimensional normal distribution multiplied by Ž for the Gibbs sampler with



8

© Cr 

distribution the full conditional distribution of ‰ given ; the density of the multi-normal distribution

Š Š is normal with mean Š and variance . Thus we have with unit variances and all correlations equal to .

the recursion

¤

§ C_ @ § ‡‹ :BDŒ C>r

  

Š_‰ Š

‰ confidence intervals for the results. However, it must not be





§sC# @ §sC_ ¤B C>r

 Π

 overlooked that these confidence intervals can be wrong if



‰ Š#‰ Š

) the variance of the weights is high which can lead to poor

§

where denotes a sequence of iid. standard normal estimates of the standard error. Another clear advantage of 

C>r importance sampling is that if high precision is needed we „ variates. It is easy to show that the ‰ form an AR(1)

can easily parallelize the simulation as we are generating Š#Ž

process with parameter Š and variance [1].



)

To calculate the MSE of the §mean of the first marginal iid. samples. „ we can write the sum of the ‰ as sum of the with For Gibbs sampling it is not necessary to know the mode of the distribution but it is necessary to have

different coefficients and get:  C0B

generation methods available to generate v ariates from the

§ @ 



§sC>r Š

8 full conditional distributions of the density . It is necessary

MSE-Gibbs-2dim 8

Š to decide when the chain has converged and it is not

 easy to calculate confidence intervals for the results as the

For dimensions m the Gibbs sampler of the multi- generated vectors are not independent. normal distribution is no longer a simple AR process Looking at the above arguments we think that they are but if the correlation between all variables are equal it slightly in favor of importance sampling. However, most is still not difficult to write the MSE of the expectation important is the question which of the two methods will of a marginal distribution as a sum of normal variates lead to more precise results. So we will compare the with different variances and to calculate the variance of mean squared error for a standard integration problems to this sum numerically. Our results for some dimensions get more objective arguments to decide between these two can be found in Table 1. As expected the MSE grows methods. for increasing correlation and of course also grows for increasing dimension. 6.1. Comparing the mean squared error

6. Comparison of Gibbs sampling and First we have to decide about the rules for our comparison. importance sampling As we assume that the density is expensi ve to evaluate we should use the number of evaluations of as a main factor.

It is not easy to make a fair comparison between these two Unfortunately for the Gibbs sampler this number strongly very different methods. For the new importance sampling depends on the structure of . If the full conditional density proposed above we need the location of the mode, distrib utions of are standard distributions no evaluations which can be easily found as we assume that the log-density of are necessary at all. If we use rejection algorithms is at least approximately concave. The remaining part of for each of the full conditionals it can be necessary to the algorithm is simple; it is also no problem to calculate evaluate the full conditionals four times or even more often

4

Gibbs IS indep. IS new Gibbs IS new

“ ’  ’ E V E V E V E V E V 2 0. 2.0 4.0 1.1 1.7 1.1 1.7 2 0.3 2.3 4.4 1.2 1.7 3 0. 3.0 6.0 1.2 1.7 1.2 1.7 3 0.3 4.2 6.9 1.4 2.0 4 0. 4.0 8.0 1.3 1.8 1.3 1.8 4 0.3 5.6 10 1.5 2.6 6 0. 6.0 12.0 - - 1.6 2.0 6 0.3 9.6 12 2.0 2.6 8 0. 8.0 16.0 - - 1.8 2.5 8 0.3 12 19 2.2 3.3 2 0.5 3.5 4.6 1.9 3.1 1.2 1.6 2 0.5 3.4 6.3 1.5 2.3 3 0.5 6.6 7.5 4.8 9.9 1.7 2.3 3 0.5 6.6 7.8 1.9 3.3 4 0.5 11 11 16 42 2.9 4.6 4 0.5 10 11 2.3 3.8 6 0.5 24 22 - - 32 47 6 0.5 16 17 4.1 6.4 8 0.5 40 32 - - 210 220 8 0.5 20 24 8.0 13 2 0.9 18 17 7.1 13 2.7 5 2 0.6 4.4 6.6 2.0 3.5 3 0.9 55 54 80 255 8.4 18 3 0.6 8.4 10 3.1 5.0 4 0.9 104 100 550 1070 30 73 4 0.6 12 15 8.2 14 6 0.9 253 236 - - 370 760 6 0.6 18 19 36 58 8 0.9 472 432 - - 1200 1600 8 0.6 26 29 126 230

Table 2: The MSE multiplied by ”–•#— for expectation (E) Table 3: Same as Table 2 but for a multi-normal mixture  Ÿ* Ÿ–¡–¡F¡ ¢

and variance (V) of the first marginal. For the multi-normal distribution with the mixtures centered at ž and

 Ÿ  ŸF¡F¡–¡¢

“ £

distribution with unit variances and all correlations equal to ; ž*£ . ™ always using ”˜•_— evaluations of ; “IS indep.” uses independent double exponential random variates whereas “IS new” uses our new procedure of Section 4.

are confirmed by this simulation study.

¤‡ C#6

to generate only one coordinate of the vector. And it Looking at the case š we can compare the results 8

is not clear if the evaluation of the full conditionals is with the naive simulation6 where the MSE is for the 8 as time consuming as that of the full density or much expectation and m for the variance. The Gibbs sampler is cheaper. So there is clearly no general way to make a fair exactly the same as naive simulation in this case. However, comparison. As a simple assumption, that should not be as we are reporting the mean squared error for a fixed

too unrealistic for man y applications, we therefore assume number of evaluations of the sample size decreases with  that for dimension one cycle of the Gibbs sampler (ie. dimension which in turn leads to an increased MSE. updating all coordinates) requires evaluations of the

density. Comparing our new importance sampling procedure  and Gibbs sampling it is obvious that the importance For our experiments with dimension between two sampling procedure is clearly better for smaller dimensions and eight we use the multi-normal distribution with unit and for smaller correlations. For increasing dimension variances and equal correlations between all variables. Thus the performance of the importance sampling procedure the variance covariance matrix has ones in the diagonal

deteriorates much more quickly than that of the Gibbs š and all other elements are equal to š . For we used the sampler so that, depending on the correlation, the break- values 0, 0.5 and 0.9. We compare the mean squared error even point is somewhere around dimension five. For of the integral that calculates the mean and of the integral

stronger correlations the advantage of the Gibbs sampler in that calculates theC_‡ variance of the first marginal always higher dimensions is even more obvious. making exactly Ž evaluations of . (The mode search for importance sampling and the burn-in for Gibbs sampling Table 3 gives the results for our second simulation

are not included). Due to the nature of the methods and our experiment. There we used a normal mixture distributions

C_‡ § r>¥¦r>¥¦r>¥¦

“rule”, we have explained above, this means that we have with two equal probable§ ¥¦¥¦¥¦– parts: both standard normal vectors

Q Qœ› always Ž repetitions for importance sampling but only with mean vectors and between 5000 ( m ) and 1250 ( ) repetitions for respectively. It can be clearly seen that the results are very the Gibbs sampler. similar to the results of Table 2. Interpreting the results of Table 2 we can first clearly see that, as expected, our new procedure to select the Of course our simulation study is only comparing the importance sampling density leads to a clearly smaller performance of the two methods for two simple problems. mean squared error than using an independent importance Nevertheless, it is an indication that the Gibbs sampler that sampling density. Another point is that our computational is not using approximations of the density but samples from results for the mean squared error of the expectation of the the exact full conditional distributions performs well for first marginal when using the Gibbs sampler (see Table 1) increasing dimension.

5 7. Conclusions [8] T. Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, We have proposed a simple novel method for constructing 37(2):185–194, 1995. importance sampling densities. Our simulation results indicate that the new method works fine up to dimension [9] W. Hormann,¨ J. Leydold, and G. Derflinger. Automatic five or six. Nonuniform Random Variate Generation. Springer- We have compared the performance of this importance Verlag, Berlin Heidelberg, 2004. sampling procedure with the Gibbs sampler. Our [10] S. J. Koopman and N. Shephard. Estimating the results show that for low dimension the performance for likelihood of stochastic volatility model: testing the importance sampling is clearly better. However, due assumptions behind importance sampling. Technical to the fact that the performance of importance sampling report, Department of Econometrics, Free University deteriorates much faster with the dimension the Gibbs Amsterdam, 2004. sampler performs clearly better for higher dimensions

(larger than six). [11] M.-S. Oh and J. O. Berger. Integration of multimodal For many densities the Gibbs sample is more functions by Monte Carlo importance sampling. J. complicated to implement and it is also much more difficult Amer. Statist. Assoc., 88(422):450–456, 1993. to parallelize. Therefore the simplicity of our general purpose importance sampling approach and the possibility [12] A. Owen and Y. Zhou. Safe and effective importance to easily distribute the computations should certainly be sampling. J. Amer. Statist. Assoc., 95(449):135–143, utilized for dimensions below six. For higher dimensions 2000. importance sampling should only be used if an importance [13] P. Zhang. Nonparametric importance sampling. J. sampling density very close to the real density can be found. Amer. Statist. Assoc., 91(435):1245–1253, 1996. Acknowledgements This work was supported by the Austrian Science Foundation (FWF), project no. P16767-N12.

References

[1] M.-H. Chen and B. Schmeiser. Performance of the Gibbs, Hit-and-run, and Metropolis samplers. Journal of Computational and Graphical Statistics, 2:251– 272, 1993. [2] L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, New-York, 1986. [3] J. Geweke. Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57(6):1317–1339, 1989. [4] J. Geweke. Monte Carlo simulation and . In A. Amman, D. Kendrick, and J. Rust, editors, Handbook of Computational Economics, pages 731–800. North-Holland, Amsterdam, 1996. [5] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov chain Monte Carlo in practice, London, 1996. Chapman & Hall. [6] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41(2):337– 348, 1992. [7] G. H. Givens and A. E. Raftery. Local adaptive importance sampling for multivariate densities with strong nonlinear relationships. J. Amer. Statist. Assoc., 91(433):132–141, 1996.

6