Advanced Topics in Computing Lectures

Advanced Topics in Computing Lectures 1 Generalized Linear Models 1.1 Exponential Dispersion Models We write yθ − b(θ) f(yjθ; φ) = exp + c(y; φ) φ (alternatively, write a(φ) instead of φ in the denominator): this is an exponential dispersion model; for fixed φ, this is an exponential family in θ. Write m for the reference measure for the densities. Clearly, Z f(yjθ; φ) dm(y) = 1: Differentiation with respect to θ, assuming that interchanging integration and differentiation is justified: @ Z 0 = f(yjθ; φ) dm(y) @θ Z @f(yjθ; φ) = dm(y) @θ Z y − b0(θ) = f(yjθ; φ) dm(y) φ E (y) − b0(θ) = θ,φ φ so that 0 Eθ,φ(y) = b (θ) which does not depend on φ. We can thus write 0 E(y) = µ = b (θ): Differentiating once more, we get ! Z b00(θ) y − b0(θ)2 0 = − + f(yjθ; φ) dm(y) φ φ b00(θ) V (y) = − + θ,φ φ φ2 so that 00 Vθ,φ(y) = φb (θ): 1 If µ = b0(θ) defines a one-to-one relation between µ and θ, we can write b00(θ) = V (µ) where V is the variance function of the family. Formally, V (µ) = b00((b0)−1(µ)) With this, var(y) = φb00(θ) = φV (µ) 1.2 Generalized Linear Models For a GLM, we then have for i = 1; : : : ; n responses yi and covariates (explana- 0 tory variables) xi so that for E(yi) = µi = b (θi) we have 0 g(µi) = β xi = ηi; where g is the link function and ηi is the linear predictor. Alternatively, 0 µi = h(β xi) = h(ηi); where h is the response function (and g and h are inverses of each other if invertible). What this precisely means is that for fixed φi, conditional on xi the distribution 0 0 of yi has density f(yijθi; φi) where g(b (θi)) = β xi: 0 0 yijxi ∼ f(yijθi; φi); g(b (θi)) = β xi: 1.3 Maximum Likelihood Estimation The log-likelihood thus equals n X yiθi − b(θi) ` = + c(y ; φ ) : φ i i i=1 i To obtain the score function (partials of the log-likelihood with respect to the 0 0 βj), we first differentiate g(b (θi)) = β xi with respect to βj to get 0 0 @β xi @g(b (θi)) 0 0 00 @θi 0 @θi xij = = = g (b (θi))b (θi) = g (µi)V (µi) @βj @βj @βj @βj from which @θi xij = 0 @βj g (µi)V (µi) and n 0 n @` X yi − b (θi) xij X yi − µi xij = = : @β φ g0(µ )V (µ ) φ V (µ ) g0(µ ) j i=1 i i i i=1 i i i 2 Maximum likelihood estimation is typically performed by solving the likelihood equations (i.e., setting the scores to zero). For Newton's method one needs to 2 0 compute the Hessian H`(β) = [@ `=@βi@βj]. As µi = b (θi), @µi 00 @θi xij xij = b (θi) = V (µi) 0 = 0 @βj @βj g (µi)V (µi) g (µi) and hence 2 n @ ` X @ yi − µi xij = @β β @β φ V (µ ) g0(µ ) j k i=1 k i i i n 0 X xij @µi 1 yi − µi @(V (µi)g (µi)) = − − φ @β V (µ )g0(µ ) (V (µ )g0(µ ))2 @β i=1 i k i i i i k n X xijxik = − φ V (µ )g0(µ )2 i=1 i i i n X (yi − µi)xijxik − (V 0(µ )g0(µ ) + V (µ )g00(µ )) φ V (µ )2g0(µ )3 i i i i i=1 i i i Noting that the second term looks complicated and has expectation zero, only the first term is used for the \Newton-type" iteration, the Fisher scoring algorithm. Equivalently, instead of using the observed information matrix (the negative Hessian of the log-likelihood), one uses its expectation (the Fisher information matrix). Assume that φi = φ/ai with known case weights ai. Then the Fisher information matrix is n 0 1 X ai X W (β)X x x = ; φ V (µ )g0(µ )2 ij ik φ i=1 i i 0 where X is the usual regressor matrix (with xi as row i) and ai 0 W (β) = diag 0 2 ; g(µi) = xiβ: V (µi)g (µi) Similarly, the score function is n 0 1 X ai X W (β)r(β) g0(µ )(y − µ )x = ; φ V (µ )g0(µ )2 i i i ij φ i=1 i i 0 where r(β) is the vector with elements g (µi)(yi − µi), the so-called working 0 residuals. (Note that from g(µi) = ηi, g (µi) is often written as @ηi=@µi.) Thus, the Fisher scoring update is 0 −1 0 βnew β + (X W (β)X) X W (β)r(β) = (X0W (β)X)−1X0W (β)(Xβ + r(β)) = (X0W (β)X)−1X0W (β)z(β) 3 where z(β), the working response, has elements 0 0 0 β xi + g (µi)(yi − µi); g(µi) = xiβ: Thus, the update is computed by weighted least squares regression of z(β) on X (using the square roots of W (β) as weights), and the Fisher scoring algorithm for determining the MLEs is an iterative weighted least squares (IWLS) algorithm. Everything is considerable simplified when using the canonical link g = (b0)−1, for which 0 ηi = g(µi) = g(b (θi)) = θi: As d 1 1 (b0)−1(µ) = = ; dµ b00((b0)−1(µ)) V (µ) for the canonical link we have g0(µ)V (µ) ≡ 1. Hence, n n @` 1 X @2` 1 X = a (y − µ )x ; = − a V (µ )x x @β φ i i i ij @β @β φ i i ij ik j i=1 j k i=1 and observed and expected information matrices coincide, so that the IWLS Fisher scoring algorithm is the same as Newton's algorithm. (Note that all of the above becomes more convenient if we consistently use the response function instead of the link function.) 1.4 Inference Under suitable conditions, MLE β^ asymptotically N(β; I(β)−1) with expected Fisher information matrix 1 I(β) = X0W (β)X: φ Thus, standard errors can be computed as the square roots of the diagonal elements of ^ 0 ^ −1 cov(d β) = φ(X W (β)X) where X0W (β^)X is a by-product of the final IWLS iteration. This needs an estimate of φ (unless known). Estimation by MLE is practically difficult: hence, φ is usually estimated by the method of moments. Remembering that var(yi) = φiV (µi) = φV (µi)=ai, we see that if β was known, unbiased estimate of φ would be n 2 1 X ai(yi − µi) : n V (µ ) i=1 i 4 Taking into account that β is estimated, estimate employed is n 2 1 X ai(yi − µî) φ^ = n − p V (^µ ) i=1 i (where p is the number of β parameters). 1.5 Deviance The deviance is a quality-of-fit statistic for model fitting achieved by Maximum Likelihood, generalizing the idea of using the sum of squares of residuals in ordinary least squares, and (here) defined as D = 2φ(`sat − `mod) (assuming a common φ, perhaps after taking out weights), where the saturated model uses separate parameters for each observation so that the data is fitted ∗ 0 ∗ exactly. Here, this can be achieved via yi = µi = b (θi ) which gives zero scores (provided that this also maximizes, of course). The contribution of observation i to `sat − `mod is then ∗ ∗ 0 ∗ ^ 0 ^ θi yiθi − b (θi ) yiθi − b (θi) yiθ − b(θ) − = ; φ φ φ i i i θî ^ 0 ^ ^0 where θi is obtained from the fitted model (i.e., g(b (θi)) = β xi). Now, ∗ ∗ ∗ Z θi Z θi θi d 0 (yiθ − b(θ))j = (yiθ − b(θ)) dθ = (yi − b (θ)) dθ: θî θî dθ θî Substituting µ = b0(θ) gives dµ = b00(θ) dθ or equivalently, dθ = V (µ)−1 dµ, so that ∗ Z θi Z yi 0 yi − µ (yi − b (θ)) dθ = dµ θî µî V (µ) and the deviance contribution of observation i is ∗ θi Z yi yiθ − b(θ) yi − µ 2φi = 2 dµ. φ V (µ) i θî µî Note that this can be taken as the definition for the deviance as well as a basis for defining quasi-likelihood models based on a known variance function V . 1.6 Residuals Several kinds of residuals can be defined for GLMs: response yi − µî 5 0 working from working response in IWLS, i.e., g (^µi)(yi − µî) Pearson P yi − µî ri = V (^µi) P P 2 so that i(ri ) equals the generalized Pearson statistic. P D 2 deviance so that i(ri ) equals the deviance (see above). (All definitions equivalent for the Gaussian family.) 1.7 Generalized Linear Mixed Models We augment the linear predictor by (unknown) random effects bi: 0 0 ηi = xiβ + zibi where the bi come from a suitable family of distributions and the zi (as well as the xi, of course) are known covariates. Typically, bi ∼ N(0;G(#)): Conditionally on bi, yi is taken to follow an exponential dispersion model with 0 0 g(E(yijbi)) = ηi = xiβ + zibi: The marginal likelihood function of the observed yi is obtained by integrating the joint likelihood of the yi and bi with respect to the marginal distribution of the bi: n Y Z L(β; φ, #) = f(yijβ; φ; #; bi)f(bij#) dbi i=1 (assuming for simplicity that random effects are independent across observation units).

Advanced Topics in Computing Lectures

Machine Learning Or Econometrics for Credit Scoring: Let’S Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi

18.650 (F16) Lecture 10: Generalized Linear Models (Glms)

Maximum Likelihood from Incomplete Data Via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Stati

Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models

Properties of the Zero-And-One Inflated Poisson Distribution and Likelihood

A Brief Survey of Modern Optimization for Statisticians1

On the Maximization of Likelihoods Belonging to the Exponential Family

Gaussian Process Learning Via Fisher Scoring of Vecchia's Approximation

Stochastic Gradient Descent As Approximate Bayesian Inference

Stochastic Gradient MCMC: Algorithms and Applications

Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-Order Convergence

An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials Andrew M