Adversarial Variational Inference for Tweedie Compound Poisson Models

Yaodong Yang * 1 2 Sergey Demyanov * 1 Yuanyuan Liu 1 Jun Wang 2

Abstract cial Gamma mixture models, with the number of mix- tures determined by a Poisson-distributed . Tweedie Compound Poisson models are heavily Tweedie is heavily used for modelling non-negative heavy- used for modelling non-negative continuous data tailed continuous data with a discrete mass at with a discrete probability spike at zero. An im- zero (see Fig.(1a)). In general, distributions that have sup- portant practice is the modelling of the aggre- port covering the non-negative real axis are usually not ca- gate claim loss for insurance policies in actuarial pable of capturing the density spike at zero. Conversely, science. However, the intractable density func- Tweedie can parameterize a class of Gamma distributions tion and the unknown function have pre- that can directly account for the mass at zero, while still sented considerable challenges for Tweedie re- keeping the ability to model the right-skewness for the pos- gression models. Previous studies are focused itive part. As a result, Tweedie finds its importance across on numerical approximations to the density func- different real-world applications (Jørgensen, 1991; Smyth tion. In this study, we tackle the Bayesian & Jørgensen, 2002), including actuarial science (premium Tweedie regression problem via a Variational ap- / loss severity modelling), (species biomass pre- proach. In particular, we empower the posterior diction), meteorology (rainfall modelling), etc. approximation by an implicit model trained in the adversarial setting, introduce the hyper prior Despite the importance of Tweedie, the intractable density by making the parameters of the prior distribu- function and the unknown hinder the in- tion trainable, and integrate out one local latent ferences on Tweedie models. On one hand, Tweedie den- variable in Tweedie model to reduce the vari- sity function does not have an analytically tractable form, ance. Our method is evaluated on the applica- thus making traditional maximum likelihood methods dif- tion of predicting the losses for auto insurance ficult to apply. On the other hand, the index parameter policies. Results show that the proposed method P in the variance function is a latent variable that is usu- enjoys a state-of-the-art performance among tra- ally unknown beforehand, even though the estimation of ditional inference methods, while having a richer P is of great interest in applications such as the hypoth- estimation of the variance function. esis test and predictive uncertainty measures (Davidian & Carroll, 1987). Unsurprisingly, there has been little work devoted to full-likelihood based inference on Tweedie, let alone Bayesian treatments. 1. Introduction To date, most practices of Tweedie modelling are con- Tweedie models (Jorgensen, 1997; 1987) are special mem- ducted within the quasi-likelihood (QL) framework (Wed- bers in the exponential dispersion family; they specify a derburn, 1974) where the density function is approximated power-law relationship between the variance and the : by the first and the second order moments. Despite its wide P Var(Y ) = E(Y ) . For arbitrary positive P (so-called applicability, QL has the deformity of requiring the index arXiv:1706.05446v3 [stat.ML] 16 Jul 2017 index parameter) outside the interval of (0, 1), Tweedie parameter P to be specified beforehand. Extended quasi- corresponds to a particular , e.g., Gaus- likelihood (EQL) (Nelder & Pregibon, 1987) together with sian (P = 0), Poisson (P = 1), Gamma (P = 2), the profile likelihood method (Cox & Reid, 1993) are pro- Inverse Gaussian (P = 3). When P lies in the range posed to fill the gap of index estimation for QL. Nonethe- of (1, 2), the Tweedie model is equivalent to Compound less, EQL is not capable of handling exact zero values; con- Poisson– (Simsekli et al., 2013), here- sequently, EQL requires observed data to be adjusted by an after Tweedie for simplification. Tweedie serves as a spe- ad-hoc small positive number which again could imperil the inferences. Most important, a consistent concern on *Equal contribution 1American International Group (AIG). Science Team, The AIG Building, 58 Fenchurch Street, London QL methodology is that the moment estimator may not be EC3M 4AB, U.K. 2University College London. Correspondence asymptotically equivalent to the maximum likelihood esti- to: Yaodong Yang . mator (Pinheiro & Chao, 2012).

ICML 2017 Workshop on Implicit Models. Copyright 2017 by It was not until recently that numerical approximations the author(s). have been used for the Tweedie density function. Well- Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

known approaches include series expansion method (Dunn λ =2.6, α =4, β =0.24

& Smyth, 2005) and Fourier inversion method (Dunn & 0.6 Smyth, 2008). In these cases, profile likelihood methods

0.4

can access the ”true” likelihood function so that accurate Density estimation of P are feasible. With the estimated P, Yang 0.2 et al.(2015) and Chen & Guestrin(2016) applied the gra- dient boosted trees onto the Tweedie regression model that 0.0 0 5 10 tries to incorporate non-linear interactions in the features. Value Qian et al.(2016) investigated the grouped elastic net and (a) Tweedie Density Function (b) Tweedie Regression Model lasso methods for the Tweedie model. Nonetheless, the de- Figure 1. Graphical model of Tweedie regression model. pendence on density approximations makes the above al- . gorithms vulnerable to numerical issues. Series expansion methods are noted for suffering from the unbounded incre- Gamma–distributed “cost” for individual event. In mod- ment in the number of required terms. Fourier inversion elling real-world data, precipitation for example, λ models methods might fail to work in the small range of observed the frequencies (how many days have rain), Gi models the variables, and be computationally heavy to implement. severity each time (the amount of rainfall for each rainy day), and Y is the aggregate precipitation during that time The expectation of a latent variable for an intractable like- period. Judging from whether the aggregate number of ar- lihood function can typically be estimated using the EM rivals N is zero, the joint density can be written as: algorithm in an empirical Bayes approach or MCMC meth- ods in a full Bayesian approach. Simsekli et al.(2013) and P (Y,N|λ, α, β) = P (Y |N, α, β)P (N|λ) (1) Zhang(2013) proposed solving the Tweedie mixed-effect nα−1 −y/β n −λ −λ y e λ e models by explicitly exploiting the latent variable in the = d0(y) · e · 1n=0 + · · 1n>0, Bayesian formulation. Both MCMC and MCEM meth- βnαΓ(nα) n! ods have been implemented. Zhang(2013) compared the latent variable approach with the aforementioned density where d0(·) is the at zero. function approximations, and found that although MCMC Note that although in this study we focus on the case of methods allow for the empirical estimation of variance Tweedie model with P ∈ (1, 2), our method can be ex- function, and without the need to directly evaluate the den- tended to other cases with different P values. We re- sity function, they are computationally demanding on the fer to the Appendix for the connection of the parameters high dimensional dataset, and tend to be subject to Monte {λ, α, β} of the Compound Poisson–Gamma model with Carlo errors, where dedicated supervisions are needed. the general Tweedie model parameterized by the mean, the In this work, we employ Variational Inference (VI) to solve dispersion, and the index parameter {µ, φ, P}. Bayesian Tweedie regression, meanwhile, also try to es- In the context of Tweedie regression (see Fig.(1b)), we de- timate the distribution of the index parameter. VI of- fine 1) the observed data D = (xi, yi)i=1,...,N , represent- fers the advantages of circumventing the calculation of the ing the independent variables xi and the dependent vari- marginal probability density function, enabling the esti- able yi. 2) global hidden variables w, which parameter- mation of the variance function from the data, and most ize the function inferring the parameters of the Tweedie importantly, scalability to high dimensional large datasets. density function, i.e., (λi, αi, βi) = fλ,α,β(xi; w), and Specifically, we apply implicit models in adversarial set- 3) local latent variables ni, indicating the number of tings to empower the posterior approximations. We set up Gamma mixture components. The latent variables con- the hyper prior by making the parameters of the prior distri- tain both local and global ones z = (w, ni)i=1,...,M , and bution trainable. To reduce the variance during training, we are assigned prior distribution P (z). The log-likelihood integrate out Poisson–distributed latent variable that causes is computed by summing over the number of observa- the high variance. As far as we are concerned, this is the PM tions; i.e., log P (D|z) = i=1 log P (yi|ni, w; xi). The first work that manages to solve Tweedie regression prob- goal here is to find the posterior distribution of P (z|D) lem via the variational approach, and results show its state- so as to make future predictions via P (y |D, x ) = of-the-art performance compared to conventional methods. R pred pred P (ypred|z; xpred)P (z|D) d z. 2. Preliminaries: Tweedie Regression Model Variational Inference (Jordan et al., 1999) approximates the posterior distribution, usually complicated and intractable, The Tweedie model with P ∈ (1, 2) equivalently de- by proposing a class of probability distributions Qθ(z) (so- scribes the compound Poisson–Gamma distribution. We called inference models), usually with more friendly forms, denote random variable Y following such distribution as: and then finds the best set of parameter θ by minimizing PN i.i.d the KL divergence between the proposal distribution and Y = i=1 Gi,N ∼ Poisson(λ),Gi ∼ Gamma(α, β) , where Y assumes Poisson arrivals of the events, and the true distribution; i.e., KL(Qθ(z)||P (z|D)). Minimiz- ing the KL divergence equals to maximizing the EVIDENCE Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

h P (D,z) i OFLOWERBOUND, ELBO = E log . If we Algorithm 1 AVB for Bayesian Tweedie Regression Qθ (z) Qθ (z) set P (D, z) = P (D|z)P (z) in the ELBO, the optimization 1: Input: data D = (xi, yi), i = 1, ..., M 2: while θ not converged do of finding the best Qθ(z) can be rewritten as: 3: for N : do   T ∗ Qθ(z) 4: Sample noises i,j=1,...,M ∼ N(0, I), θ = arg max EQθ (z) − log + log P (D|z) (2) P θ P (z) 5: Map noise to prior zj = Pψ(j ) = µ + σ j , Q 6: Map noise to posterior zi = Qθ(i), 3. Methods 7: Minimize Eqn.(3) over φ via the gradients: 1 PM h Q P i Optimizing the ELBO requires the gradients w.r.t. ELBO. ∇φ M i=1 − log σ(Tφ(zi )) − log(1 − σ(Tφ(zi )) REINFORCE algorithm (Williams, 1992) is a gradient- 8: end for based method where the parameter gradients are sampled 9: Sample noises i=1,...,M ∼ N(0, I), Q directly from the generative model, that is, ∇θELBO = 10: Map noise to posterior zi = Qθ(i), 1 P P (D,zi) ∇θ log i . However, we find that direct appli- 11: Sample a minibatch of (xi, yi) from D, n i Qθ (z ) cation of REINFORCE does not offer reasonable results 12: Minimize Eqn.(2) over θ, ψ via the gradients: 1 PM Q on Tweedie models. One main reason is the unacceptably- ∇θ,ψ M i=1[−Tφ∗ (zi )+ high variance issue (Hinton et al., 1995). Such issue is PT j=1 P (yi|nj, w; xi)·P (nj|w; xi)] aggravated due to Tweedie’s gamma function and Dirac 13: end while Delta function; the issue exists even when REINFORCE is equipped with the “baseline” trick (Mnih & Gregor, 2014) in Eqn.(3). AVB considers the optimization problem in or local expectation (AUEB & Lazaro-Gredilla´ , 2015). In Eqn.(2) and Eqn.(3) as a two-layer minimax game. In prac- addition, we discover that the over-simplified proposal dis- tice, we apply stochastic gradient descent iteratively to find tributions can usually harm the inference process. Pro- a Nash-equilibrium. In practice, we could update Eqn.(3) N posed representation of Qθ(z), such as a Gaussian distribu- for T times then update the other for training stability. tion with mean and variance to be determined, is often lim- Mescheder et al.(2017) proves that such Nash-equilibrium, ited in expressiveness to approximate the complicated pos- if reached, is a global optimum of the ELBO. terior distribution P (z|D). It is especially true in Tweedie’s case as the density function itself is multi-modal, let alone 3.2. Hyper Prior the posterior. In this study, we solve these two problems by We parameterize the prior distribution P (z) by ψ and also employing the implicit inference models with hyper prior, ψ make ψ trainable when optimizing Eqn.(2). We consider and Tweedie-specific variance reduction technique. the function of trainable prior as the hyper prior. When the ELBO is maximized, the posterior will be pushed to- 3.1. Adversarial Variational Bayes wards the prior where the ideal case is log Qθ (z) = 0. Pψ (z) AVB (Huszar´ , 2017; Karaletsos, 2016; Mescheder et al., Prior distribution is usually simple, N(µ, σI) in this case; 2017; Ranganath et al., 2016b;a; Wang & Liu, 2016) em- however, the intuition here is not to constrain the posterior powers the VI methods by using neural networks as the distribution by one single prior that might seem to be over- proposed class of inference models; this allows more ex- simplified to the problem. To further ensure the expressive- pressive approximations to the posterior distribution in the ness of the inference network, we would like the prior to be probabilistic model. As a neural network is a black-box, the adjusted to make the posterior Qθ(z) close to a set of prior inference model is implicit; thus, it does not have a closed distributions, or a self-adjustable prior, rather than a single form expression, but it doesn’t bother to draw samples from distribution that never changes. Tuneable parameters in the it. To circumvent the issue of computing the gradient from prior helps cover such multiple prior distribution settings. the implicit model, the mechanism of adversarial learning We can further apply the reparameterization trick (Kingma is introduced; in particular, a discriminative network Tφ(z) & Welling, 2013) via z = zψ() = µ + σ , and sub- is used to model the first term in Eqn.(2), namely, to distin- sequently replace EPψ (z)[f(z)] with E∼N(0,I)[f(zψ())] guish the latent variables that were sampled from the prior for reduced variance in estimating the gradients in Eqn.(3). distribution p(z) from those that were sampled from the in- ference network Qθ(z). If we find the optimal solution of 3.3. Variance Reduction the following optimization problem: In our study, we find that integrating out the local latent φ∗ = arg min − E [log σ(T (z))] Qθ (z) φ variable n in P (D|z) gives a significantly lower variance φ (3) i  in computing the gradients w.r.t. Eqn.(2) (see Eqn.(4)), and − EP (z)[log(1 − σ(Tφ(z))] can effectively helps the training convergence. As ni is a T where σ(x) is the sigmoid function, then we can esti- X P (y |w; x ) = P (y |n , w; x ) · P (n |w; x ) (4) h Qθ (z) i i i i j i j i mate the ratio by E log = E [T ∗ (z)] Qθ (z) P (z) Qθ (z) φ j=1 Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

Table 1. The pairwise Gini index comparison with standard error based on 20 random splits

MODEL GLMPQLLAPLACE AGQMCMCTDBOOST AVB BASELINE

GLM / −2.976.28 1.755.68 1.755.68 −15.027.06 1.616.32 9.845.80 PQL 7.375.67 / 7.506.26 7.505.72 6.735.95 0.816.22 9.026.07 LAPLACE 2.104.52 −1.005.94 / 8.845.36 4.004.61 21.454.84 20.614.54 AGQ 2.104.52 −1.005.94 8.845.36 / 4.004.61 21.454.84 20.614.54 MCMC 14.756.80 −1.066.41 3.125.99 3.125.99 / 7.825.83 11.885.50 TDBOOST 17.524.80 17.085.36 19.305.19 19.305.19 11.614.58 / 20.304.97 AVB −0.174.70 0.0495.62 3.414.94 3.414.94 0.864.62 11.494.93 /

sider the ordered Lorenz curve and the corresponding Gini index (Frees et al., 2011; 2014) as the metric. Assum- MCMC ing for the ith/M observations, yi is the ground truth, bi 20 AVB is the prediction from baseline model, and yˆi is the pre- diction from model to compare, we first sort all the sam- ples by the relativity ri =y ˆi/bi in an increasing order, Density 10 and then compute the empirical cumulative distribution as: Pn b ·1(r r) Pn y ·1(r r) ˆ i=1 i i6 ˆ i=1 i i6 (Fb(r) = Pn , Fy(r) = Pn ). The i=1 bi i=1 yi plot of (Fˆb(r), Fˆy(r)) is the ordered Lorenz curve, and Gini 0 index is the twice the area between the curve and the 45◦ 1.00 1.25 1.50 1.75 2.00 Value line. In short, a larger Gini index indicates that the model has greater capability of separating the observations. In in- Figure 2. Posterior distribution of the index parameter P. surance context, Gini index profiles the insurer’s capability . of distinguishing policy holders with different risk levels. Poisson–generated sample, the variance will explode in the cases where yi is zero but the sampled ni is positive, and Pair-wise Gini comparison is presented in Table.(1). AVB yi is positive but the sampled ni is zero. This also accounts method shows the state-of-the-art performance, even com- for why the direct application of REINFORCE algorithm paratively better than the boosting tree method that is con- fails to work. In practice, we sum over all possible values sidered to have a strong performance. We compare the pos- of ni and find that empirically limiting the summation T terior distribution of P estimated by MCMC and AVB in to between 2 to 10, depending on the dataset, has the same Fig.(2). AVB shows a richer representation of the index performance as summing over to a larger number. parameter, with two modals at 1.20 and 1.31. As differ- ent P characterizes different Tweedie distribution, com- pared with one single value that traditional profile likeli- 4. Experiments hood method offers, flexible choices of P avoid the case We evaluate our method on modelling the aggregate loss of one size fitting all policies; instead, they endow the re- for auto insurance polices. Predicting the number of claims gression model with customized descriptions for insurance and the total costs for the claims serve as the essence of policies that have different underlying risk profiles. Note producing fair and reasonable prices for insurance prod- that AVB uses a neural sampler that does not involve any ucts. We considered the public dataset (Yip & Yau, 2005; rejection procedures; unlike MCMC, it holds the potentials Zhang et al., 2005) that serve as the benchmark in Tweedie for large–scale predictions on high dimensional dataset. modelling. The dataset contains 10296 auto insurance pol- Full description of the dataset, flow chart of the AVB algo- icy records. Each record includes a driver’s aggregate claim rithm, detailed settings of each method, and plots of pair- amount (yi) for his policy, as well as 56 features (xi). We wise ordered Lorenz curve can be found in the Appendix. separate the dataset in 50%/25%/25% for train/valid/test, and compare our method with six traditional inference methods for Tweedie regression: 1) Generalized Linear 5. Conclusions Model (McCullagh, 1984), 2) Penalized Quasi-Likelihood We employed Variational Inference in solving Bayesian (Breslow & Clayton, 1993), 3) Laplace approximation Tweedie regression. Our method showed a strong perfor- (Tierney & Kadane, 1986), 4) Adaptive Gauss–Hermite mance with future potential for large–scale applications in Quadrature (Liu & Pierce, 1994), 5) MCMC (Zhang, the insurance domain. In the future, we plan to incorporate 2013), and 6) TDBoost (Yang et al., 2017). the random effects, and release an R package about Varia- As the loss amount is sparse and right-skewed, we con- tional approaches to Tweedie mixed-effect regression. Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

References Jorgensen, Bent. Exponential dispersion models. Journal of the Royal Statistical Society. Series B (Methodologi- AUEB, Michalis Titsias RC and Lazaro-Gredilla,´ Miguel. cal), pp. 127–162, 1987. Local expectation gradients for black box variational in- ference. In Advances in Neural Information Processing Jørgensen, Bent. Exponential dispersion models. Num- Systems, pp. 2638–2646, 2015. ber 48. Instituto de matematica pura e aplicada, 1991. Breslow, Norman E and Clayton, David G. Approximate Jorgensen, Bent. The theory of dispersion models. CRC inference in generalized linear mixed models. Journal Press, 1997. of the American statistical Association, 88(421):9–25, 1993. Karaletsos, Theofanis. Adversarial message passing for Chen, Tianqi and Guestrin, Carlos. Xgboost: A scalable graphical models. arXiv preprint arXiv:1612.05048, tree boosting system. arXiv preprint arXiv:1603.02754, 2016. 2016. Kingma, Diederik P and Welling, Max. Auto-encoding Cox, DR and Reid, N. A note on the calculation of ad- variational Bayes. arXiv preprint arXiv:1312.6114, justed profile likelihood. Journal of the Royal Statistical 2013. Society. Series B (Methodological), pp. 467–471, 1993. Langley, P. Crafting papers on machine learning. In Lang- Davidian, Marie and Carroll, Raymond J. Variance func- ley, Pat (ed.), Proceedings of the 17th International Con- tion estimation. Journal of the American Statistical As- ference on Machine Learning (ICML 2000), pp. 1207– sociation, 82(400):1079–1091, 1987. 1216, Stanford, CA, 2000. Morgan Kaufmann.

Dunn, Peter K and Smyth, Gordon K. Series evaluation of Liu, Qing and Pierce, Donald A. A note on gausshermite Tweedie exponential dispersion model densities. Statis- quadrature. Biometrika, 81(3):624–629, 1994. tics and Computing, 15(4):267–280, 2005. European Dunn, Peter K and Smyth, Gordon K. Evaluation McCullagh, Peter. Generalized linear models. Journal of Operational Research of Tweedie exponential dispersion model densities by , 16(3):285–292, 1984. Fourier inversion. and Computing, 18(1):73– Mescheder, Lars, Nowozin, Sebastian, and Geiger, An- 86, 2008. dreas. Adversarial variational Bayes: Unifying varia- Firth, David and Turner, Heather L. Bradley-terry models tional autoencoders and generative adversarial networks. in r: the bradleyterry2 package. Journal of Statistical arXiv preprint arXiv:1701.04722, 2017. Software, 48(9), 2012. Mnih, Andriy and Gregor, Karol. Neural variational in- Frees, Edward W, Meyers, Glenn, and Cummings, ference and learning in belief networks. arXiv preprint A David. Summarizing insurance scores using a Gini arXiv:1402.0030, 2014. index. Journal of the American Statistical Association, 106(495):1085–1098, 2011. Nelder, John A and Pregibon, Daryl. An extended quasi- likelihood function. Biometrika, 74(2):221–232, 1987. Frees, Edward W Jed, Meyers, Glenn, and Cummings, A David. Insurance ratemaking and a Gini index. Jour- Pinheiro, Jose´ C and Chao, Edward C. Efficient laplacian nal of Risk and Insurance, 81(2):335–366, 2014. and adaptive gaussian quadrature algorithms for multi- level generalized linear mixed models. Journal of Com- Giner, Goknur¨ and Smyth, Gordon K. statmod: Probability putational and Graphical Statistics, 2012. calculations for the inverse gaussian distribution. arXiv preprint arXiv:1603.06687, 2016. Qian, Wei, Yang, Yi, and Zou, Hui. Tweedies compound Hinton, Geoffrey E, Dayan, Peter, Frey, Brendan J, and poisson model with grouped elastic net. Journal of Neal, Radford M. The” wake-sleep” algorithm for un- Computational and Graphical Statistics, 25(2):606–625, supervised neural networks. Science, 268(5214):1158, 2016. 1995. Ranganath, Rajesh, Tran, Dustin, Altosaar, Jaan, and Blei, Huszar,´ Ferenc. Variational inference using implicit distri- David. Operator variational inference. In Advances in butions. arXiv preprint arXiv:1702.08235, 2017. Neural Information Processing Systems, pp. 496–504, 2016a. Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction Ranganath, Rajesh, Tran, Dustin, and Blei, David. Hierar- to variational methods for graphical models. Machine chical variational models. In International Conference learning, 37(2):183–233, 1999. on Machine Learning, pp. 324–333, 2016b. Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

Simsekli, Umut, Cemgil, Ali Taylan, and Yilmaz, Yusuf Kenan. Learning the beta-divergence in Tweedie compound poisson matrix factorization models. In ICML (3), pp. 1409–1417, 2013. Smyth, Gordon K and Jørgensen, Bent. Fitting Tweedie’s compound poisson model to insurance claims data: dis- persion modelling. Astin Bulletin, 32(01):143–157, 2002. Tierney, Luke and Kadane, Joseph B. Accurate approxi- mations for posterior moments and marginal densities. Journal of the american statistical association, 81(393): 82–86, 1986. Wang, Dilin and Liu, Qiang. Learning to draw samples: With application to amortized mle for generative adver- sarial learning. arXiv preprint arXiv:1611.01722, 2016. Wedderburn, Robert WM. Quasi-likelihood functions, gen- eralized linear models, and the gaussnewton method. Biometrika, 61(3):439–447, 1974. Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. Yang, Yi, Qian, Wei, and Zou, Hui. A boosted Tweedie compound poisson model for insurance premium. arXiv preprint arXiv:1508.06378, 2015. Yang, Yi, Qian, Wei, Zou, Hui, and Yang, Maintainer Yi. Package tdboost. 2016. Yang, Yi, Qian, Wei, and Zou, Hui. Insurance premium prediction via gradient tree-boosted Tweedie compound poisson models. Journal of Business & Economic Statis- tics, pp. 1–15, 2017. Yip, Karen CH and Yau, Kelvin KW. On modeling claim frequency data in general insurance with extra zeros. In- surance: Mathematics and Economics, 36(2):153–163, 2005. Zhang, Tong, Yu, Bin, et al. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538–1579, 2005. Zhang, Yanwei. Likelihood-based and Bayesian methods for Tweedie compound poisson linear mixed models. Statistics and Computing, 23(6):743–757, 2013. Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

Appendix replacing Eqn.(8) into Eqn.(1), we get the joint density function represented by {µ, φ, P} as: Tweedie Exponential Dispersion Model µ2−P 1 The Tweedie model is a special member in exponential dis- P (X,N|µ, φ, P) = exp (− ) n=0 persion model (EDM). The density function of EDM is de- φ(2 − P) fined by a two-parameter function as: h  log (φ) 2 − P x ∗ exp n− + log ( ) P − 1 P − 1 P − 1  1  f (x; θ, φ) = h(x, φ) exp (θx − η(θ)) (5) − log (2 − P) − log Γ(n + 1) (9) X φ 1 µ1−P x µ2−P − ( + ) where θ is the natural (canonical) parameter, φ is the dis- φ P − 1 2 − P 1 persion parameter, η(θ) is the function ensuring 2 − P i n>0 − log Γ( n) − log (x) . normalization, and h(x, φ) is the base measure independent P − 1 of the parameter θ. As a member of the exponential fam- ily, EDM has the property that the mean µ and the variance Auto Insurance Dataset Var(x) can be computed from the first and second order derivatives of η(θ) w.r.t θ respectively. Since there exists Each record includes a driver’s aggregate claim amount (yi) 00 a one-to-one mapping between θ and µ, η (θ) can be de- for the policy he bought as well as 56 features (xi). Fea- noted as a function of the mean µ, η00(θ) = v(µ), which tures describe the general information about the driver and is known as the variance function. One important property his insured vehicle, for example, distance to work, mari- is that for each variance function, it characterizes a unique tal status, whether driving license has been revoked in the EDM. past 7 years, etc. The definition of each column can be The Tweedie model specifies a power-law relationship be- found on page 24 in Yang et al.(2017). We are interested tween the variance and the mean: Var(x) = E(x)P . There in predicting the distribution of the aggregate claim loss for exist stable distributions for all positive P ∈/ (0, 1). Corre- each insurance policy holder, i.e., the driver, based on the sponding to the general form of EDM in Eqn.(5), Tweedie’s provided features. Unsurprisingly, the distribution of total natural parameter θ and cumulant function η(θ) are: claim amount is sparse and highly right-skewed, with 61% of the drivers having no claims at all, and 9% of the policy holders making over 65% of the total claims. The data are (log µ if P = 1 (log µ if P = 2 θ = , η(θ) = pre-processed following the same rules as Zhang(2013). µ1−P µ2−P 1−P if P= 6 1 2−P if P 6= 2 (6) Algorithm Flow Chart And the normalization quantity h(x, φ) can be obtained as:

ε N(0, I) ε N(0, I) ( µ1−P ~ ~ if x = 0 Reparameterization h(x, φ) = 1−P , (7) 1 P∞ hn(x, φ, P) if x > 0 x n=1 Prior Posterior Q (z) Generator Weights Weights θ xnα Qθ ( ) h (x, φ, P) = , n x(P − 1)nαφn(1+α)(2 − P)nn!Γ(nα) Pψ (z) ni Discriminator P∞ Tweedie Likelihood (xi,yi) where n=1 hn(x, φ, P) is also known as Wright’s gener- Tφ ( ) Eqn.(4) alized Bessel function. Note that h(x, φ) in Tweedie model Tφ*(z) is also a function of P. P(D|z) Neural Network Loss Function

Tweedie EDM {µ, φ, P} with P ∈ (1, 2) equivalently Discriminative Loss Variational Loss Trainable Parameters describes the Compound Poisson–Gamma distribution Eqn.(3) Eqn.(2) (CPGD) that is parameterized by {λ, α, β}. The connec- tion between the parameters {λ, α, β} of CPGD and the Figure 3. Flow chart of AVB for Bayesian Tweedie Regression. parameters {µ, φ, P} of Tweedie model follows as: Pseudo-code can be found in Algorithm.(1).

 2−P  λ = µ µ = λαβ  φ(2−P)    α+2 2−P , P = α = P−1 α+1   1−P 2−P  P−1  λ (αβ) β = φ(P − 1)µ φ = 2−P (8) Equivalent to Eqn.(1), the joint distribution P (X,N|µ, φ, P) has a closed-form expression. By Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

Implementation Details layer is size 171, initialized with Norm(0, I), outputs size 171 vector computed as element-wise computa- • GLM. We fit the Tweedie GLM model with the tion noise weights + biases. Discriminator input size tweedie function in the statmod package (Giner & 171, with 3 hidden layers of size 20, and output size is Smyth, 2016). The index parameter P is set to 1.34, 1. The learning rate is 0.003 exponentially decaying which is the mean value of the posterior distribution with the coefficient 0.99. estimated in MCMC result in Fig.(2).

• PQL. Penalized quasi-likelihood method proposed by (Breslow & Clayton, 1993) is implemented using the glmmPQL function in the R package BradleyTerry2 (Firth & Turner, 2012).

• Laplace. Laplacian approximation (Tierney & Kadane, 1986) likelihood-based methods are carried out by the R package cplm (Zhang, 2013). The index parameter P is set to 1.32 by the self-contained profile likelihood method.

• AGQ. Adaptive Gauss–Hermite Quadrature (Liu & Pierce, 1994). The number of knots is set to 15 in the AGQ implementation using function cpglmm in R package cplm (Zhang, 2013). The index parameter P is set to 1.32 by the self-contained profile likelihood method.

• MCMC. bcplm function in the cplm (Zhang, 2013) package is used for the implementation of the Markov Chain Monte Carlo. To be consistent with stud- ies in (Zhang, 2013), we use the same settings of priors. They are specified as: βj ∼ N(0, 1002), j = 0, ..., 56 , φ ∼ U(0, 100), p ∼ U(1, 2), and θ2 ∼ Ga(0.001, 1000). All the priors distributions are non-informative. The random-walk Metropolis- Hastings algorithm is used for parameter simulations except for the variance component due to conjugacy of its prior and posterior distribution. The proposal of the (truncated) Normal proposal distribu- tions are tuned according to sample variances by run- ning 5000 iterations before the MCMC so that the ac- ceptance rate will be about 50% for the Metropolis- Hastings updates. The MCMC procedure contains 10000 iterations of three parallel chains with burn-in period set to be the first 1000 iterations.

• TDBoost. The TDBoost proposed by (Yang et al., 2015) is implemented by the R package TDBoost (Yang et al., 2016). The number of trees is chosed by cross validation; other hyper-parameters are set as default. The tree is trained with early stopping by se- lecting the best iteration via a validation set. The index parameter P is set to 1.39 by the self-contained profile likelihood method.

• AVB. Feature dimension is 56. For each observation, it estimates {α, β, λ}. Qφ(z) takes in 10 dimensional source of noise, initialized by Norm(0, I), 2 hidden layers with 10 neurons each, outputs vector of size 171 (56∗3+3). For prior distribution Pψ(z) network, input Adversarial Variational Inference Methods for Tweedie Compound Poisson Models

GLM PQL Laplace AGQ MCMC TDBoost AVB 100

75 AGQ 50 25 0 100

75 AVB 50 25 0 100

75 GLM 50 Score 25 GLM 0 100 PQL Laplace 75 Laplace 50

Loss AGQ 25 0 MCMC 100

MCMC TDBoost 75 50 AVB 25 0 100

75 PQL 50 25 0 100 TDBoost 75 50 25 0 0 25 50 751000 25 50 751000 25 50 751000 25 50 751000 25 50 751000 25 50 751000 25 50 75100 Premium

Figure 4. Pairwise order Lorenz curve. The corresponding Gini index is Table.(1) .