Iteratively Reweighted Least Squares for Maximum Likelihood Estimation

J.R. Statist.Soc. B (1984). 46,No. 2, pp. 149-192 IterativelyReweighted Least Squaresfor Maximum Likelihood Estimation,and some Robust and ResistantAlternatives By P. J. GREEN UniversityofDurham, UK [Readbefore the Royal Statistical Society at a meetingorganised by the ResearchSection on Wednesday,December 7th,1983, ProfessorJ. B. Copas in the Chair3 SUMMARY The scope of applicationof iterativelyreweighted least squaresto statisticalesti- mation problemsis considerablywider than is generallyappreciated. It extends beyondthe exponential-family-typegeneralized linear models to otherdistributions, to non-linearparameterizations, and to dependentobservations. Various criteria for estimationother than maximumlikelihood, including resistant alternatives, may be used. The algorithmsare generallynumerically stable, easily programmed without the aid of packages,and highlysuited to interactivecomputation. Keywords:NEWTON-RAPHSON; FISHER SCORING;GENERALIZED LINEAR MODELS; QUASI- LIKELIHOOD; ROBUSTREGRESSION; RESISTANTREGRESSION; RESIDUALS 1. PRELIMINARIES 1.1. An IntroductoryExample This paperis concernedwith fitting regression relationships in probabilitymodels. We shallgen- erallyuse likelihood-basedmethods, but willventure far from familiar Normal theory and linear models. As a motivationfor our discussion,let us considerthe familiarexample of logisticregression. We observeYi,Y2, ** ,Ym whichare assumedto be drawnindependently from Binomial distributions with known indices nl, n2,. .,n, Covariates{x11, i = 1,2,.... .,m; j= 1,2, . ..,p} are also available and it is postulated that yix' B(ni,{ 1 + exp (- Exil31)} -1), for parameters ,1, 02, . ., ,p whosevalues are to be estimated.The importantingredients of thisexample from thepoint of view of thispaper are: (A) a regressionfunction ql = q (p), whichhere has the formri - { 1 + exp (-1x11t3)}-; and (B) a probabilitymodel, expressed as a log-likelihoodfunction of t , L(q), whichin thiscase is m L = E {y logi7l + (ni-yi) log(1 - )}- 1=1 In commonwith most of the problemswe shallconsider in thispaper notice that, in theusual applicationof this example, (i) q has muchlarger dimension than i (ii) the probabilitymodel (B) is largelyunquestioned, except perhaps for some referenceto goodnessof fit,and (iii) it is theform of theregression function (A) thatis thefocus of ourattention. Typicallywe wouldbe interestedin selectingcovariates of importance,deciding the formof the regressionfunction, and estimatingthe values of the I3s. Presentaddress: Departmentof Mathematical Sciences, The University,Durham DH1 3LE. ? 1984 Royal StatisticalSociety 0035-9246-84-46149 $2.00 This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 150 GREEN [No. 2, 1.2. GeneralFormulation We considera log-likelihoodL, a functionof an n-vectorq of predictors.TypicaLly n is equal to, or comparablewith, the numberof individualobservations of whichthe likelihoodforms the densityor probabilityfunction, but we shall be concernedalso with cases whereindividual observationsare difficultto define,for example with one or moremultinomial samples. The predictorvector t1 is functionallydependent on thep-vector P of parametersof interest: p is typicallymuch smaller than n. We base ourinference on thefunction q = tj (p) by estimat- ing the parametersp , and derivingapproximate confidence intervals and significancetests. Initiallywe shallconsider only maximum likelihood estimates, and supposethat the modelis sufficientlyregular that we mayrestrict attention to thelikelihood equations aL T= = DTu =0 (1) ap whereu is the n-vector{aL/a h }, and D the n x p matrix{ }/ia i The standardNewton- Raphson method for the iterativesolution of (1) calls for evaluatingu, D and the second derivativesof L foran initialvalue of p and solvingthe linear equations (-a2L DTu (2) tapp ) ( p*p> foran updatedestimate f * Thisprocedure is repeateduntil convergence. Equation (2) is derived fromthe firsttwo termsof a Taylorseries expansion for aLla f fora log-likelihoodquadratic in themethod converges in one step. Commonlythe second derivatives in (2) arereplaced by an approximation.Note that -a2L aL a2 a a T a2L (anl) a pp T anai a pP T ap anTIT ap and we replacethe termson the rightby theirexpectations (at the currentparameter values 11). By thestandard arguments: ( aL //2L aL 1aL\ T E l E E- = IA5 say,and withthis approximation (essentially Fisher's scoring technique) (2) becomes - (DTAD) (p * ?) = DTu. (3) We will assumethat D is of fullrank p, and thatA is positivedefinite throughout the parameter space: thus(3) is a non-singularp x p systemof equationsfor P *. Ratherthan handle their numerical solution directly, note thatthey have the formof normal equationsfor a weightedleast squares regression: P * solves minimize(A -1u+ D(P - p *))T A(A-lu + D( P -p*)), (4) that is, it resultsfrom regressing A1 u + D P onto the columnsof D usingweight matrix A. Thus we use an iterativelyreweighted least squares(IRLS) algorithm(4) to implementthe Newton-Raphsonmethod with Fisherscoring (3), for an iterativesolution to the likelihood equations(1). Thistreatment of the scoringmethod via least squaresgeneralizes some very long- standingmethods, and specialcases are reviewed in thenext Section. Two commonsimplifications are thatthe modelmay be linearlyparameterized, q = X P say, so thatD is constant,or thatL has the formYL1(i1) (e.g. observationsare independent)so that This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 151 A is diagonal. IRLS algorithmsalso arisein inferencebased on the conceptof quasi-likelihood,which was proposedby Wedderburn(1974) and extendedto the multivariatecase by McCullagh(1983). Suppose that the n-vectorof observationsy has E(y) = l and var(y) = a2 V(ij), where q= (p) as before,a2 is a scalar,and the matrix-valuedfunction V( ) is specified.The log- quasi-likelihoodQ is definedas anyfunction of tj satisfying aQ -QV(tj) (y- n) (5) all whereV- is a generalizedinverse. We estimate P by solvingthe quasi-likelihood equations O=aQ =a aQ DTU a, ap an say. Since E(aQla-/t ) =0 and E(-a2Q/atp T) = V-, the Newton-Raphsonequations with expectedsecond derivatives have the form (3) withA = V -. Questionsof existenceand uniquenessof themaximum likelihood estimates in variouscases are discussedby Wedderburn(1976), Pratt(1981) and Burridge(1981). The large-sampletheory thatjustifies likelihood-ratio tests and confidenceintervals for parameters will be foundin Cox and Hinkley(1974, p. 294-304) and McCullagh(1983). In particular,the asymptoticcovariance matrixfor the estimate of P is cr2[E(-a2L/a ppT)] -1 = a2(DTAD)-' The importantconnection between such theoreticalresults and the numericalproperties of IRLS is thatboth are justifiedby the approximatequadratic behaviour of the log-likelihood near its maximum.Thus it is reasonablethat IRLS shouldwork when maximum likelihood is relevant. 1.3. Historyand SpecialCases Fromthe first,Fisher noted that maximum likelihood estimates would often require iterative calculation.In Fisher(1925), use of Newton'smethod was mentioned, dL / ( d2LN dO / \dO2 / eitherfor one step only,or iteratedto convergence.Implicitly he replacedthe negativesecond derivative,the observedinformation, at each parametervalue by its expectationassuming that valuewere the trueone. Thistechnique, and itsmulti-parameter generalization, became known as "Fisher's method of scoringfor parameters",and was furtherdiscussed by Bailey (1961 Appendix1), Kale (1961, 1962) and Edwards(1972). Use of the scoringmethod in whatwe termregression problems seems to date fromFisher's contributedappendix to Bliss(1935). Thispaper was concernedwith dosage-mortality curves, a quantalresponse problem as in Section 1.1, except forthe use of the probittransformation in place of the logit.The relativemerits of usingobserved or expectedinformation were discussed by Garwood(1941), and the methodhas becomemore generally known from the various editions of Finney'sbook on ProbitAnalysis (1947, 1952,Appendix II). Moore and Zeigler(1967) discussedthese binomialproblems with an arbitraryregression function,and demonstratedthe quite generalconnection with non-linear least-squares regression. Nelder and Wedderburn(1972) introducedthe class of generalizedlinear models to unifya numberof linearlyparameterized problems in exponentialfamily distributions. These models are discussedin Section3.2. The importantconnection between the IRLS algorithmfor maximum likelihood estimation and the Gauss-Newtonmethod for least-squaresfitting of non-linearregressions was further elucidatedby Wedderburn(1974). Jennrichand Moore (1975) consideredmaximum likelihood estimation in a moregeneral This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 152 GREEN [No. 2, exponentialfamily than did Nelderand Wedderburn;their approach is similarto ours,except thatthe predictors j mustbe theexpected values of the observations. Importantrecent contributions have come fromMcCullagh (1983) and Jorgensen(1983), particularlyregarding the treatment of dependentobservations. 2. PRACTICALITIES 2.1. Multinomialdata Much of our discussionof the detailedproperties of IRLS algorithmsthat followscan be motivatedby exampleswith multinomial data. First consider a single multinomialdistribution with polynomiallyparameterized cell probabilities,such as oftenarises with data on gene frequencies.In

Load more