Adversarial Variational Inference for Tweedie Compound Poisson Models
Total Page:16
File Type:pdf, Size:1020Kb
Adversarial Variational Inference for Tweedie Compound Poisson Models Yaodong Yang * 1 2 Sergey Demyanov * 1 Yuanyuan Liu 1 Jun Wang 2 Abstract cial Gamma mixture models, with the number of mix- tures determined by a Poisson-distributed random variable. Tweedie Compound Poisson models are heavily Tweedie is heavily used for modelling non-negative heavy- used for modelling non-negative continuous data tailed continuous data with a discrete probability mass at with a discrete probability spike at zero. An im- zero (see Fig.(1a)). In general, distributions that have sup- portant practice is the modelling of the aggre- port covering the non-negative real axis are usually not ca- gate claim loss for insurance policies in actuarial pable of capturing the density spike at zero. Conversely, science. However, the intractable density func- Tweedie can parameterize a class of Gamma distributions tion and the unknown variance function have pre- that can directly account for the mass at zero, while still sented considerable challenges for Tweedie re- keeping the ability to model the right-skewness for the pos- gression models. Previous studies are focused itive part. As a result, Tweedie finds its importance across on numerical approximations to the density func- different real-world applications (Jørgensen, 1991; Smyth tion. In this study, we tackle the Bayesian & Jørgensen, 2002), including actuarial science (premium Tweedie regression problem via a Variational ap- / loss severity modelling), ecology (species biomass pre- proach. In particular, we empower the posterior diction), meteorology (rainfall modelling), etc. approximation by an implicit model trained in the adversarial setting, introduce the hyper prior Despite the importance of Tweedie, the intractable density by making the parameters of the prior distribu- function and the unknown variance function hinder the in- tion trainable, and integrate out one local latent ferences on Tweedie models. On one hand, Tweedie den- variable in Tweedie model to reduce the vari- sity function does not have an analytically tractable form, ance. Our method is evaluated on the applica- thus making traditional maximum likelihood methods dif- tion of predicting the losses for auto insurance ficult to apply. On the other hand, the index parameter policies. Results show that the proposed method P in the variance function is a latent variable that is usu- enjoys a state-of-the-art performance among tra- ally unknown beforehand, even though the estimation of ditional inference methods, while having a richer P is of great interest in applications such as the hypoth- estimation of the variance function. esis test and predictive uncertainty measures (Davidian & Carroll, 1987). Unsurprisingly, there has been little work devoted to full-likelihood based inference on Tweedie, let alone Bayesian treatments. 1. Introduction To date, most practices of Tweedie modelling are con- Tweedie models (Jorgensen, 1997; 1987) are special mem- ducted within the quasi-likelihood (QL) framework (Wed- bers in the exponential dispersion family; they specify a derburn, 1974) where the density function is approximated power-law relationship between the variance and the mean: by the first and the second order moments. Despite its wide P Var(Y ) = E(Y ) . For arbitrary positive P (so-called applicability, QL has the deformity of requiring the index arXiv:1706.05446v3 [stat.ML] 16 Jul 2017 index parameter) outside the interval of (0; 1), Tweedie parameter P to be specified beforehand. Extended quasi- corresponds to a particular stable distribution, e.g., Gaus- likelihood (EQL) (Nelder & Pregibon, 1987) together with sian (P = 0), Poisson (P = 1), Gamma (P = 2), the profile likelihood method (Cox & Reid, 1993) are pro- Inverse Gaussian (P = 3). When P lies in the range posed to fill the gap of index estimation for QL. Nonethe- of (1; 2), the Tweedie model is equivalent to Compound less, EQL is not capable of handling exact zero values; con- Poisson–Gamma Distribution (Simsekli et al., 2013), here- sequently, EQL requires observed data to be adjusted by an after Tweedie for simplification. Tweedie serves as a spe- ad-hoc small positive number which again could imperil the inferences. Most important, a consistent concern on *Equal contribution 1American International Group (AIG). Science Team, The AIG Building, 58 Fenchurch Street, London QL methodology is that the moment estimator may not be EC3M 4AB, U.K. 2University College London. Correspondence asymptotically equivalent to the maximum likelihood esti- to: Yaodong Yang <[email protected]>. mator (Pinheiro & Chao, 2012). ICML 2017 Workshop on Implicit Models. Copyright 2017 by It was not until recently that numerical approximations the author(s). have been used for the Tweedie density function. Well- Adversarial Variational Inference Methods for Tweedie Compound Poisson Models known approaches include series expansion method (Dunn λ =2.6, α =4, β =0.24 & Smyth, 2005) and Fourier inversion method (Dunn & 0.6 Smyth, 2008). In these cases, profile likelihood methods 0.4 can access the ”true” likelihood function so that accurate Density estimation of P are feasible. With the estimated P, Yang 0.2 et al.(2015) and Chen & Guestrin(2016) applied the gra- dient boosted trees onto the Tweedie regression model that 0.0 0 5 10 tries to incorporate non-linear interactions in the features. Value Qian et al.(2016) investigated the grouped elastic net and (a) Tweedie Density Function (b) Tweedie Regression Model lasso methods for the Tweedie model. Nonetheless, the de- Figure 1. Graphical model of Tweedie regression model. pendence on density approximations makes the above al- . gorithms vulnerable to numerical issues. Series expansion methods are noted for suffering from the unbounded incre- Gamma–distributed “cost” for individual event. In mod- ment in the number of required terms. Fourier inversion elling real-world data, precipitation for example, λ models methods might fail to work in the small range of observed the frequencies (how many days have rain), Gi models the variables, and be computationally heavy to implement. severity each time (the amount of rainfall for each rainy day), and Y is the aggregate precipitation during that time The expectation of a latent variable for an intractable like- period. Judging from whether the aggregate number of ar- lihood function can typically be estimated using the EM rivals N is zero, the joint density can be written as: algorithm in an empirical Bayes approach or MCMC meth- ods in a full Bayesian approach. Simsekli et al.(2013) and P (Y;Njλ, α; β) = P (Y jN; α; β)P (Njλ) (1) Zhang(2013) proposed solving the Tweedie mixed-effect nα−1 −y/β n −λ −λ y e λ e models by explicitly exploiting the latent variable in the = d0(y) · e · 1n=0 + · · 1n>0; Bayesian formulation. Both MCMC and MCEM meth- βnαΓ(nα) n! ods have been implemented. Zhang(2013) compared the latent variable approach with the aforementioned density where d0(·) is the Dirac Delta function at zero. function approximations, and found that although MCMC Note that although in this study we focus on the case of methods allow for the empirical estimation of variance Tweedie model with P 2 (1; 2), our method can be ex- function, and without the need to directly evaluate the den- tended to other cases with different P values. We re- sity function, they are computationally demanding on the fer to the Appendix for the connection of the parameters high dimensional dataset, and tend to be subject to Monte fλ, α; βg of the Compound Poisson–Gamma model with Carlo errors, where dedicated supervisions are needed. the general Tweedie model parameterized by the mean, the In this work, we employ Variational Inference (VI) to solve dispersion, and the index parameter fµ, φ, Pg. Bayesian Tweedie regression, meanwhile, also try to es- In the context of Tweedie regression (see Fig.(1b)), we de- timate the distribution of the index parameter. VI of- fine 1) the observed data D = (xi; yi)i=1;:::;N , represent- fers the advantages of circumventing the calculation of the ing the independent variables xi and the dependent vari- marginal probability density function, enabling the esti- able yi. 2) global hidden variables w, which parameter- mation of the variance function from the data, and most ize the function inferring the parameters of the Tweedie importantly, scalability to high dimensional large datasets. density function, i.e., (λi; αi; βi) = fλ,α,β(xi; w), and Specifically, we apply implicit models in adversarial set- 3) local latent variables ni, indicating the number of tings to empower the posterior approximations. We set up Gamma mixture components. The latent variables con- the hyper prior by making the parameters of the prior distri- tain both local and global ones z = (w; ni)i=1;:::;M , and bution trainable. To reduce the variance during training, we are assigned prior distribution P (z). The log-likelihood integrate out Poisson–distributed latent variable that causes is computed by summing over the number of observa- the high variance. As far as we are concerned, this is the PM tions; i.e., log P (Djz) = i=1 log P (yijni; w; xi). The first work that manages to solve Tweedie regression prob- goal here is to find the posterior distribution of P (zjD) lem via the variational approach, and results show its state- so as to make future predictions via P (y jD; x ) = of-the-art performance compared to conventional methods. R pred pred P (ypredjz; xpred)P (zjD) d z. 2. Preliminaries: Tweedie Regression Model Variational Inference (Jordan et al., 1999) approximates the posterior distribution, usually complicated and intractable, The Tweedie model with P 2 (1; 2) equivalently de- by proposing a class of probability distributions Qθ(z) (so- scribes the compound Poisson–Gamma distribution. We called inference models), usually with more friendly forms, denote random variable Y following such distribution as: and then finds the best set of parameter θ by minimizing PN i:i:d the KL divergence between the proposal distribution and Y = i=1 Gi;N ∼ Poisson(λ);Gi ∼ Gamma(α; β) , where Y assumes Poisson arrivals of the events, and the true distribution; i.e., KL(Qθ(z)jjP (zjD)).