Parameter Estimation for Tweedie Distribution with MCMC Approach Topic Presentation
Total Page:16
File Type:pdf, Size:1020Kb
Parameter Estimation for Tweedie Distribution with MCMC Approach Topic Presentation Tairan Ye 4/12/2018 University of Connecticut Summary 1. Markov Chain Monte Carlo 2. Tweedie Distribution 3. Parameter Estimation for Tweedie Model with MCMC Approach 1 Markov Chain Monte Carlo Introduction • Fact: Your desired parameters are not in the analytical form or extremely hard to solve out. • Solution: The sampling algorithms, like Markov Chain Monte Carlo • Reason: By constructing a Markov chain, we can sample the desired distribution by observing the chain after a number of iterations. The sample distribution will match more closely to the actual desired distribution as the iterations increase. 2 Example: Monte Carlo Integration General problem: Evaluating E[h(X )] = R h(x)p(x)dx can be difficult. However, if we can draw samples: X (1), X (2), ..., X (N) ∼ p(x) (1) Then, we can estimate: 1 N E[h(X )] ≈ ∑ h(X (t)) (2) N t=1 This is Monte Carlo (MC) Integration. 3 Markov Chain A Markov chain is generated by sampling: X (t+1) ∼ p(xjx(t)), t = 1, 2, ... (3) Here p is the transition kernel. So X (t+1) depends only on X (t), rather than X (0), X (1), X (2), ..., X (t−1) We can summarize as following: p(X (t+1)jX (t), X (t−1), ..., X (0)) = p(X (t+1)jX (t)) (4) 4 An example of Markov Chain A first order auto-regressive process with lag-1 auto-correlation 0.5. X (t+1)jX (t) ∼ N(0.5x(t), 1) (5) A simulation is conducted as following, which contains two different starting points: Figure 1: Simulation with two starting points The chains seemed to have forgotten their starting positions. 5 Sampling Algorithm Overview The most fundamental MCMC sampling algorithm are: • Gibbs Sampling 1. Require for the analytic form of the full conditional distribution. 2. High-efficiency { the sequence will converge with little iterations. • Metropolis-Hasting Algorithm 1. More flexible { no need for the deduction of full conditional distribution. 2. Some challenges for setting up the proposal distribution { experience needed to control the accept rate between 20% to 30%. 6 Full Conditional Distribution and Gibbs Sampling • Full conditional distributions: 2 2 The distributions p(qjs , y1, ..., yn) and p(s jq, y1, ..., yn) are called the full conditional distributions of s2 and q respectively, as they are each a conditional distribution of a parameter given everything else. • Gibbs Sampler: Given a current state of the parameters f(s) = fq(s), s2(s)g, we generate a new state as follows: (s+1) 2(s) 1. sample q ∼ p(qjs , y1, ..., yn); 2(s+1) 2 (s+1) 2. sample s ∼ p(s jq , y1, ..., yn); 3. update f(s+1) = fq(s+1), s2(s+1)g 7 Generalized Gibbs Sampler Suppose you have a vector of parameters f = ff1, ..., fpg, and your information about f is measured with p(f) = p(f1, ..., fp). For example, in the normal model f = fq, s2g, and the probability measure of interest 2 (0) (0) (0) is p(q, s jy1, ..., yn). Given a starting point f = ff1 , ..., fp g, the Gibbs sampler generates f(s) from f(s−1) as follows: (s) (s−1) (s−1) (s−1) • sample f1 ∼ p(f1jf2 , f3 , ..., fp ) (s) (s) (s−1) (s−1) • sample f2 ∼ p(f2jf1 , f3 , ..., fp ) ...... (s) (s) (s) (s) • sample fp ∼ p(fpjf1 , f2 , ..., fp−1) Markov Property: In this sequence, f(s) depends on f(0), ..., f(s−1) only through f(s−1), i.e. f(s) is conditionally independent of f(0), ..., f(s−2) given f(s−1). 8 Metropolis Algorithm The Metropolis algorithm proceeds by sampling a proposal value q∗ nearby the current value q(s) using a symmetric proposal distribution J(q∗jq(s)). Usually J(q∗jq(s)) is very simple, with samples from J(q∗jq(s)) being near q(s) with high probability. Examples include: • J(q∗jq(s)) ∼ U(q(s) − d, q(s) + d) • J(q∗jq(s)) ∼ N(q(s), d2) 9 Metropolis Algorithm The algorithm is as following: 1. Sample q∗ ∼ J(q∗jq(s)) 2. Compute the acceptance ratio: p(q∗jy) p(yjq∗)p(q∗) r = = (6) p(q(s)jy) p(yjq(s))p(q(s)) 3. Update ( ∗ q , p = min(r, 1) q(s+1) = (7) q(s), p = 1 − min(r, 1) • Step 3 can be accomplished by sampling u ∼ U(0, 1) and setting q(s+1) = q∗ if u < r and setting q(s+1) = q(s) otherwise. • In many cases, computing the ratio r directly can be numerically unstable, a problem that often can be remedied by computing the logarithm of r: 10 Metropolis Hastings Algorithm Metropolis Hastings Algorithm is the generalized and multi-parameters version of Gibbs sampling and Metropolis algorithm. Let's consider a simple example where our target probability distribution is p0(u, v), a bivariate distribution for two random variables U and V . 1. Update U: ∗ ∗ (s) (s) • Sample u ∼ Ju(u ju , v ) • Compute the acceptance ratio: p (u∗, v (s)) J (u(s)ju∗, v (s)) r = 0 ∗ u (8) (s) (s) ∗ (s) (s) p0(u , v ) Ju(u ju , v • Update ( u∗, p = min(r, 1) u(s+1) = (9) u(s), p = 1 − min(r, 1) • Sampling u ∼ U(0, 1) and setting u(s+1) = u∗ if u < r and setting u(s+1) = u(s) otherwise. 11 Metropolis Hastings Algorithm After updating U, we continue to update V: 1. Update V : ∗ ∗ (s+1) (s) • Sample v ∼ Jv (v ju , v ) • Compute the acceptance ratio: p (u(s+1), v ∗) J (v (s)ju(s+1), v ∗) r = 0 ∗ v (10) (s+1) (s) ∗ (s+1) (s) p0(u , v ) Jv (v ju , v ) • Update ( v ∗, p = min(r, 1) v (s+1) = (11) v (s), p = 1 − min(r, 1) • Sampling u ∼ U(0, 1) and setting v (s+1) = v ∗ if u < r and setting v (s+1) = v (s) otherwise. 12 Proposal Distribution The proposal distribution is the most "challenging" part for conducting the Metropolis algorithm. • Without restriction: ∗ 2 Normal distribution: q ∼ N(qt−1, sp ) • With restriction: ∗ Random walk: f (q ) = f (qt−1) + e and e ∼ N(0, 1) Example: q > 0, then apply log function. ∗ 2 q can be generated from lognormal(log(qt−1), sp ) 13 Combination In complex models, it is often the case that conditional distributions are available for some parameters but not for others. In these situations we can combine Gibbs and Metropolis-type proposal distributions to generate a Markov chain to approximate the joint distribution of all of the parameters. 14 More to talk about 1. Stationary: As t ! ¥, the Markov chain converges to its stationary distribution. Technique: trace plot, effective sample size (ESS) Package: 'coda' in R 2. Burn-in: Typically, we burn-in the initial simulation values since they are most likely nonstationary, and study the remaining part. 3. Gap: In order to avoid the dependency, we need to pick up the simulation values by gaps. 4. Chains: Also, we will build more than 1 chain to make sure the sampler is representative for the distribution. By the way, the calculation of deviance information criterion (DIC) requires at least 2 chains. 15 Example Figure 2: The trace plot of the raw MCMC 16 Example Figure 3: The trace plot of the updated sequence 17 Software • WinBUGS (Bayesian inference Using Gibbs Sampling) • JAGS (Just Another Gibbs Sampler) • Stan { with linkage of most of data science softwares via multiply platforms • R packages: rjags, coda, mcmc, BRugs, etc. 18 Tweedie Distribution Tweedie Distribution Overview Tweedie distribution can be defined as Tw(m, f, p). • m is the expected mean for the Tweedie distribution. • f is called the dispersion parameter. • p, or power, controls what distribution it belongs to. Property: • A special case of exponential dispersion models. • Variance = fmp. • Most interested in p 2 (1, 2), which is defined as compound Poisson-gamma distribution. 19 Compound Poisson-gamma distribution Due to its special property { mass at zero and continue otherwise, the Tweedie model with p 2 (1, 2), or the compound Poisson-gamma distribution, is popular in insurance industry. Denote the total auto insurance claim in dollar is Y , the frequency of claim is N, and the severity of each claim is Z. Then, we have: N Y = ∑ Zn (12) n=1 N ∼ Poisson(l) (13) Z ∼ gamma(a, b) (14) N = Z (15) j 20 Compound Poisson-gamma distribution We say Y ∼ Tw(m, f, p). Also: lab = m (16) la(a + 1)b2 = fmp (17) (18) We can solve: m2−p l = (19) f(2 − p) 2 − p a = (20) p − 1 b = f(p − 1)mp−1 (21) 21 Compound Poisson-gamma distribution The probability function for the Tweedie distribution with a power p 2 (1, 2) is: 1. For y = 0: P[Y = 0] = exp f−lg m2−p (22) = exp − f(2 − p) 2. For y > 0: ¥ n − y 1 l f (y) = e b e−l y na−1 Y ∑ na n=1 G(na)b n! j n 1 y 2−p o ( (2−p) ) ¥ ( ) p−1 y m 1 2−p f(p−1) = exp − − f(p − 1)m(p−1) f(2 − p) y ∑ j (2−p)j j=1 f G( p−1 )j! (23) 22 Parameter Estimation for Tweedie Model with MCMC Approach Model In order to study the key factors that affects the auto insurance claims, we can set up: log(ms ) = Xsg + erandom (24) 2 erandom ∼ N(0, s ) (25) where s indicates for different location; Xs contains the information about policyholders, including personal information as well as vehicle information; g is the factors, which we are interested in; e covers all other information we don't discuss here.