Adaptive Mixtures of Student-T Distributions by David Ardia, Lennart F
Total Page:16
File Type:pdf, Size:1020Kb
CONTRIBUTED RESEARCH ARTICLES 25 AdMit: Adaptive Mixtures of Student-t Distributions by David Ardia, Lennart F. Hoogerheide and Herman K. fatter tails than the Normal distribution; especially van Dijk if one specifies Student-t distributions with few de- grees of freedom, the risk is small that the tails of the candidate are thinner than those of the target Introduction distribution. Finally, Zeevi and Meir(1997) showed that under certain conditions any density function This note presents the package AdMit (Ardia et al., may be approximated to arbitrary accuracy by a con- 2008, 2009), an R implementation of the adaptive vex combination of basis densities; the mixture of mixture of Student-t distributions (AdMit) proce- Student-t distributions falls within their framework. dure developed by Hoogerheide(2006); see also The package AdMit consists of three main func- Hoogerheide et al.(2007); Hoogerheide and van Dijk tions: AdMit, AdMitIS and AdMitMH. The first one al- (2008). The AdMit strategy consists of the construc- lows the user to fit a mixture of Student-t distribu- tion of a mixture of Student-t distributions which tions to a given density through its kernel function. approximates a target distribution of interest. The The next two functions perform importance sam- fitting procedure relies only on a kernel of the tar- pling and independence chain M-H sampling using get density, so that the normalizing constant is not the fitted mixture estimated by AdMit as the impor- required. In a second step, this approximation is tance or candidate density, respectively. used as an importance function in importance sam- To illustrate the use of the package, we apply the pling or as a candidate density in the independence AdMit methodology to a bivariate bimodal distribu- chain Metropolis-Hastings (M-H) algorithm to esti- tion. We describe the use of the functions provided mate characteristics of the target density. The esti- by the package and document the ability and rele- mation procedure is fully automatic and thus avoids vance of the methodology to reproduce the shape of the difficult task, especially for non-experts, of tun- non-elliptical distributions. ing a sampling algorithm. Typically, the target is a posterior distribution in a Bayesian analysis, where we indeed often only know a kernel of the posterior Illustration density. In a standard case of importance sampling or the This section presents the functions provided by the independence chain M-H algorithm, the candidate package AdMit with an illustration of a bivariate bi- density is unimodal. If the target distribution is mul- modal distribution. This distribution belongs to the timodal then some draws may have huge weights class of conditionally Normal distributions proposed in the importance sampling approach and a second by Gelman and Meng(1991) with the property that mode may be completely missed in the M-H strat- the joint density is not Normal. It is not a posterior egy. As a consequence, the convergence behavior of distribution, but it is chosen because it is a simple these Monte Carlo integration methods is rather un- distribution with non-elliptical shapes so that it al- certain. Thus, an important problem is the choice of lows for an easy illustration of the AdMit approach. the importance or candidate density, especially when Let X1 and X2 be two random variables, for little is known a priori about the shape of the target which X1 is Normally distributed given X2 and vice density. For both importance sampling and the in- versa. Then, the joint distribution, after location and dependence chain M-H, it holds that the candidate scale transformations in each variable, can be written density should be close to the target density, and it as (see Gelman and Meng, 1991): is especially important that the tails of the candidate . h 1 2 2 2 2 p(x) ∝ k(x) = exp − 2 Ax1x2 + x1 + x2 should not be thinner than those of the target. (1) i Hoogerheide(2006) and Hoogerheide et al.(2007) − 2Bx1x2 − 2C1x1 − 2C2x2 mention several reasons why mixtures of Student-t . distributions are natural candidate densities. First, where p(x) denotes a density, k(x) a kernel and x = 0 they can provide an accurate approximation to a (x1, x2) for notational purposes. A, B, C1 and C2 are wide variety of target densities, with substantial constants; we consider an asymmetric case in which skewness and high kurtosis. Furthermore, they A = 5, B = 5, C1 = 3, C2 = 3.5 in what follows. can deal with multi-modality and with non-elliptical The adaptive mixture approach determines the shapes due to asymptotes. Second, this approxima- number of mixture components H, the mixing prob- tion can be constructed in a quick, iterative proce- abilities, the modes and scale matrices of the compo- dure and a mixture of Student-t distributions is easy nents in such a way that the mixture density q(x) ap- to sample from. Third, the Student-t distribution has proximates the target density p(x) of which we only The R Journal Vol. 1/1, May 2009 ISSN 2073-4859 26 CONTRIBUTED RESEARCH ARTICLES know a kernel function k(x) with x 2 Rd. Typically, default, the algorithm uses the value one (i.e. the ap- k(x) will be a posterior density kernel for a vector of proximation density is a mixture of Cauchy) since i) model parameters x. it enables the method to deal with fat-tailed target The AdMit strategy consists of the following distributions and ii) it makes it easier for the iterative steps: procedure to detect modes that are far apart. There may be a trade-off between efficiency and robustness (0) Initialization: computation of the mode and at this point: setting the degrees of freedom to one re- scale matrix of the first component, and draw- flects that we find a higher robustness more valuable ing a sample from this Student-t distribution; than a slight possible gain in efficiency. (1) Iterate on the number of components: add a The core function provided by the package is the AdMit new component that covers a part of the space function . The main arguments of the function KERNEL of x where the previous mixture density was are: , a kernel function of the target density on relatively small, as compared to k(x); which the approximation is constructed. This func- tion must contain the logical argument log. When (2) Optimization of the mixing probabilities; log = TRUE, the function KERNEL returns the loga- rithm value of the kernel function; this is used for nu- (3) Drawing a sample fx[i]g from the new mixture merical stability. mu0 is the starting value of the first q(x); stage optimization; it is a vector whose length corre- sponds to the length of the first argument in KERNEL. (4) Evaluation of importance sampling weights Sigma0 is the (symmetric positive definite) scale ma- [i] [i] fk(x )/q(x )g. If the coefficient of variation, trix of the first component. If a matrix is provided by i.e. the standard deviation divided by the the user, it is used as the scale matrix of the first com- mean, of the weights has converged, then stop. ponent and then mu0 is used as the mode of the first Otherwise, go to step (1). By default, the adap- component (instead of a starting value for optimiza- tive procedure stops if the relative change in tion). control is a list of tuning parameters, contain- the coefficient of variation of the importance ing in particular: df (default: 1), the degrees of free- sampling weights caused by adding one new dom of the mixture components and CVtol (default: Student-t component to the candidate mixture 0.1), the tolerance of the relative change of the co- is less than 10%. efficient of variation. For further details, the reader is referred to the documentation manual (by typing There are two main reasons for using the coefficient ?AdMit) or to Ardia et al.(2009). Finally, the last ar- of variation of the importance sampling weights as gument of AdMit is ... which allows the user to pass a measure of the fitting quality. First, it is a nat- additional arguments to the function KERNEL. ural, intuitive measure of quality of the candidate Let us come back to our bivariate conditionally as an approximation to the target. If the candidate Normal distribution. First, we need to code the ker- and the target distributions coincide, all importance nel: sampling weights are equal, so that the coefficient of variation is zero. For a poor candidate that not > GelmanMeng <- function(x, log = TRUE) even roughly approximates the target, some impor- + { tance sampling weights are huge while most are (al- + if (is.vector(x)) most) zero, so that the coefficient of variation is high. + x <- matrix(x, nrow = 1) + r <- -0.5 * ( 5 * x[,1]^2 * x[,2]^2 The better the candidate approximates the target, the + + x[,1]^2 + x[,2]^2 more evenly the weight is divided among the can- + - 10 * x[,1] * x[,2] didate draws, and the smaller the coefficient of vari- + - 6 * x[,1] - 7 * x[,2] ) ation of the importance sampling weights. Second, + if (!log) Geweke(1989) argues that a reasonable objective in + r <- exp(r) the choice of an importance density is the minimiza- + as.vector(r) tion of the variance, or equivalently the coefficient of + } variation, of the importance weights. We prefer to Note that the argument log is set to TRUE by default quote the coefficient of variation because it does not so that the function outputs the logarithm of the ker- depend on the number of draws or the integration nel.