J.. Statist.Soc. B (1984). 46,No. 2, pp. 149-192 IterativelyReweighted Least Squaresfor Maximum Likelihood Estimation,and some Robust and ResistantAlternatives

By P. J. GREEN UniversityofDurham, UK [Readbefore the Royal Statistical Society at a meetingorganised by the ResearchSection on Wednesday,December 7th,1983, ProfessorJ. B. Copas in the Chair3 SUMMARY The scope of applicationof iterativelyreweighted least squaresto statisticalesti- mation problemsis considerablywider than is generallyappreciated. It extends beyondthe exponential-family-typegeneralized linear models to otherdistributions, to non-linearparameterizations, and to dependentobservations. Various criteria for estimationother than maximumlikelihood, including resistant alternatives, may be used. The algorithmsare generallynumerically stable, easily programmed without the aid of packages,and highlysuited to interactivecomputation. Keywords:NEWTON-RAPHSON; FISHER SCORING;GENERALIZED LINEAR MODELS; QUASI- LIKELIHOOD; ROBUSTREGRESSION; RESISTANTREGRESSION; RESIDUALS 1. PRELIMINARIES 1.1. An IntroductoryExample This paperis concernedwith fitting regression relationships in probabilitymodels. We shallgen- erallyuse likelihood-basedmethods, but willventure far from familiar Normal theory and linear models. As a motivationfor our discussion,let us considerthe familiarexample of logisticregression. We observeYi,Y2, ** ,Ym whichare assumedto be drawnindependently from Binomial distri- butions with known indices nl, n2,. . .,n, Covariates{x11, i = 1,2,.... .,m; j= 1,2, . ..,p} are also available and it is postulated that yix' B(ni,{ 1 + exp (- Exil31)} -1), for parameters ,1, 02, . . ., ,p whosevalues are to be estimated.The importantingredients of thisexample from thepoint of view of thispaper are: (A) a regressionfunction ql = q (p), whichhere has the formri - { 1 + exp (-1x11t3)}-; and (B) a probabilitymodel, expressed as a log-likelihoodfunction of t , L(q), whichin thiscase is m L = E {y logi7l + (ni-yi) log(1 - )}- 1=1 In commonwith most of the problemswe shallconsider in thispaper notice that, in theusual applicationof this example, (i) q has muchlarger dimension than i (ii) the probabilitymodel (B) is largelyunquestioned, except perhaps for some referenceto goodnessof fit,and (iii) it is theform of theregression function (A) thatis thefocus of ourattention. Typicallywe wouldbe interestedin selectingcovariates of importance,deciding the formof the regressionfunction, and estimatingthe values of the I3s.

Presentaddress: Departmentof Mathematical Sciences, The University,Durham DH1 3LE.

? 1984 Royal StatisticalSociety 0035-9246-84-46149 $2.00

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 150 GREEN [No. 2, 1.2. GeneralFormulation We considera log-likelihoodL, a functionof an n-vectorq of predictors.TypicaLly n is equal to, or comparablewith, the numberof individualobservations of whichthe likelihoodforms the densityor probabilityfunction, but we shall be concernedalso with cases whereindividual observationsare difficultto define,for example with one or moremultinomial samples. The predictorvector t1 is functionallydependent on thep-vector P of parametersof interest: p is typicallymuch smaller than n. We base ourinference on thefunction q = tj (p) by estimat- ing the parametersp , and derivingapproximate confidence intervals and significancetests. Initiallywe shallconsider only maximum likelihood estimates, and supposethat the modelis sufficientlyregular that we mayrestrict attention to thelikelihood equations

aL T= = DTu =0 (1) ap whereu is the n-vector{aL/a h }, and D the n x p matrix{ }/ia i The standardNewton- Raphson method for the iterativesolution of (1) calls for evaluatingu, D and the second derivativesof L foran initialvalue of p and solvingthe linear equations

(-a2L DTu (2) tapp ) ( p*p> foran updatedestimate f * Thisprocedure is repeateduntil convergence. Equation (2) is derived fromthe firsttwo termsof a Taylorseries expansion for aLla f fora log-likelihoodquadratic in themethod converges in one step. Commonlythe second derivatives in (2) arereplaced by an approximation.Note that -a2L aL a2 a a T a2L (anl) a pp T anai a pP T ap anTIT ap and we replacethe termson the rightby theirexpectations (at the currentparameter values 11). By thestandard arguments: ( aL

//2L aL 1aL\ T E l E E- = IA5 say,and withthis approximation (essentially Fisher's scoring technique) (2) becomes - (DTAD) (p * ?) = DTu. (3) We will assumethat D is of fullrank p, and thatA is positivedefinite throughout the parameter space: thus(3) is a non-singularp x p systemof equationsfor P *. Ratherthan handle their numerical solution directly, note thatthey have the formof normal equationsfor a weightedleast squares regression: P * solves minimize(A -1u+ D(P - p *))T A(A-lu + D( P -p*)), (4) that is, it resultsfrom regressing A1 u + D P onto the columnsof D usingweight matrix A. Thus we use an iterativelyreweighted least squares(IRLS) algorithm(4) to implementthe Newton-Raphsonmethod with Fisherscoring (3), for an iterativesolution to the likelihood equations(1). Thistreatment of the scoringmethod via least squaresgeneralizes some very long- standingmethods, and specialcases are reviewed in thenext Section. Two commonsimplifications are thatthe modelmay be linearlyparameterized, q = X P say, so thatD is constant,or thatL has the formYL1(i1) (e.g. observationsare independent)so that

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 151 A is diagonal. IRLS algorithmsalso arisein inferencebased on the conceptof quasi-likelihood,which was proposedby Wedderburn(1974) and extendedto the multivariatecase by McCullagh(1983). Suppose that the n-vectorof observationsy has E(y) = l and var(y) = a2 V(ij), where q= (p) as before,a2 is a scalar,and the matrix-valuedfunction V( ) is specified.The log- quasi-likelihoodQ is definedas anyfunction of tj satisfying aQ -QV(tj) (y- n) (5) all whereV- is a generalizedinverse. We estimate P by solvingthe quasi-likelihood equations O=aQ =a aQ DTU a, ap an say. Since E(aQla-/t ) =0 and E(-a2Q/atp T) = V-, the Newton-Raphsonequations with expectedsecond derivatives have the form (3) withA = V -. Questionsof existenceand uniquenessof themaximum likelihood estimates in variouscases are discussedby Wedderburn(1976), Pratt(1981) and Burridge(1981). The large-sampletheory thatjustifies likelihood-ratio tests and confidenceintervals for parameters will be foundin Cox and Hinkley(1974, p. 294-304) and McCullagh(1983). In particular,the asymptoticcovariance matrixfor the estimate of P is cr2[E(-a2L/a ppT)] -1 = a2(DTAD)-' The importantconnection between such theoreticalresults and the numericalproperties of IRLS is thatboth are justifiedby the approximatequadratic behaviour of the log-likelihood near its maximum.Thus it is reasonablethat IRLS shouldwork when maximum likelihood is relevant. 1.3. Historyand SpecialCases Fromthe first,Fisher noted that maximum likelihood estimates would often require iterative calculation.In Fisher(1925), use of Newton'smethod was mentioned, dL / ( d2LN dO / \dO2 / eitherfor one step only,or iteratedto convergence.Implicitly he replacedthe negativesecond derivative,the observedinformation, at each parametervalue by its expectationassuming that valuewere the trueone. Thistechnique, and itsmulti-parameter generalization, became known as "Fisher's method of scoringfor parameters",and was furtherdiscussed by Bailey (1961 Appendix1), Kale (1961, 1962) and Edwards(1972). Use of the scoringmethod in whatwe termregression problems seems to date fromFisher's contributedappendix to Bliss(1935). Thispaper was concernedwith dosage-mortality curves, a quantalresponse problem as in Section 1.1, except forthe use of the probittransformation in place of the logit.The relativemerits of usingobserved or expectedinformation were discussed by Garwood(1941), and the methodhas becomemore generally known from the various editions of Finney'sbook on ProbitAnalysis (1947, 1952,Appendix II). Moore and Zeigler(1967) discussedthese binomialproblems with an arbitraryregression function,and demonstratedthe quite generalconnection with non-linear least-squares regression. Nelder and Wedderburn(1972) introducedthe class of generalizedlinear models to unifya numberof linearlyparameterized problems in exponentialfamily distributions. These models are discussedin Section3.2. The importantconnection between the IRLS algorithmfor maximum likelihood estimation and the Gauss-Newtonmethod for least-squaresfitting of non-linearregressions was further elucidatedby Wedderburn(1974). Jennrichand Moore (1975) consideredmaximum likelihood estimation in a moregeneral

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 152 GREEN [No. 2, exponentialfamily than did Nelderand Wedderburn;their approach is similarto ours,except thatthe predictors j mustbe theexpected values of the observations. Importantrecent contributions have come fromMcCullagh (1983) and Jorgensen(1983), particularlyregarding the treatment of dependentobservations. 2. PRACTICALITIES 2.1. Multinomialdata Much of our discussionof the detailedproperties of IRLS algorithmsthat followscan be motivatedby exampleswith multinomial data. First consider a single multinomialdistribution with polynomiallyparameterized cell probabilities,such as oftenarises with data on gene frequencies.In the usual model forthe humanABO blood system,A and B are allelesco-dominant to 0, so thatunder random mating withgene frequenciesof p, q and r forA, B and 0, the probabilitiesfor the phenotypesA, B, AB and 0 are p2 + 2pr,q2 + 2qr, 2pq and r2. Both setsof frequenciessum to 1, and on remov- ingthis redundancy we obtaina regressionproblem with three predictors on 2 parameters,q being definedas the probabilitiesof the phenotypes,A, B and AB, and P takenas (logp, logq)T. This is obviouslya non-linearregression, and thederivative matrix D is [ pr -pq1 D = 2 -pq qr [ pq pqJ Given observedfrequencies Y1,Y2,Y3,Y4 for A, B, AB and 0, with Tyi=n, we have - - ui = Yi71gi-y4/74 andAil = n(ig/ig - 1U4), wherem74= 1 ?h - 72 73 Incidentally,if iN retainsits identityas a predictor,the regression becomes of order4 x 2 but A is now diagonal:this is essentiallythe algorithmused by Jennrichand Moore(1975) forthis problem,with the exception that they take P - (p, q)T. More generally,data in the formof R multinomialsamples on the same set of S response categoriesoften arise in the regressionanalysis of categoricaldata. We may arrangethe data as a two-waytable of countsyrs' r = 1, 2, .. ., R; s = 1, 2,..., 5, and the log-likelihoodis essentiallyL = z 2Yrslog Prs where T2Prs = 1 forall r and theRS cell probabilitiesPrs are suit- ably parameterized.There has been considerableinterest in the case where the categories 1, 2, . . ., S are ordered;see forexample McCullagh (1980), Andersonand Phillips(1981) and Pratt(1981). Thenwe maybe preparedto write

Pri = ( forsome givendistribution function 4f. Here the matrix(xrj) representscovariate information, and the parametersare -oo=0 <01

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 153 satisfied. Use of models such as these forcategorical data was motivatedby the case whereT is a logisticfunction and S = 2. This is the usual linearlogistic model for binary data (Cox, 1970). An alternativemodel forordinal data witha similarmotivation is Anderson'sstereotype model, a specialcase of themodels in Anderson(1984), whichwe maywrite as

Prs= exp ( 7sY s Z xr,jf) E exp(-yt-(t) Xj3j) ; /~~~~ti ~ where , y and 4) areto be fitted,under the constraints ys 0 and 1 = o > q2 > . . . > Os = 0. Here,the simplestparameterization for IRLS is to use rs = -s - Os Lij Xrji, r = 1, 2, . . .,R; s = 1, 2,. . ., S- 1. However,this model suffersfrom rank deficiency of D at P = 0, and there willbe consequentnumerical difficulties, and problemsin the large-sampletheory. Further there is no guaranteethat at thecomputed maximum, the Os will be correctlyordered. 2.2. Example For a straightforwardapplication of the groupedcontinuous multinomial model, consider the data in Table 1, relatingschool GCE A-levelscore to universitydegree performance for fifteen

TABLE 1 A-levelscore and degreeclassification

Degree class

Score I 11(i) 11(ii) III Pass Total

15 22 13 10 3 0 48 14 20 21 31 9 2 83 13 13 43 31 16 10 113 12 7 21 35 18 5 86 11 3 21 26 32 8 90 10 3 17 25 20 12 77 9 1 10 9 15 11 46 8 1 2 4 12 6 25 7 0 1 2 6 1 10 6 0 0 2 1 0 3

years'intake to a certainuniversity first degree course. Clearly there may be heterogeneitieshere causedby changingstandards with time, and it wouldhave been preferablenot to combinethese data overthe years.We examinethe data fora possiblelinear relationship between the log-odds of attainingdegree class s or betterand A-levelscore. Here, s = 1, 2, 3 or 4 fordegree I, H1, 112 and III. Parallelregression lines in thisdomain define the proportional odds model (see McCullagh, 1980) and are equivalentto a logisticallydistributed latent response variable, which is categorized into unknownintervals to providethe degree classification. Specifically, then, we supposethat the probabilitythat a studentwith an A-levelscore of xr attainsdegree class i = 1, 2, 3, 4 is Prisuch that EL 1 Pri = T(s a?xr), where T(u) = (1 + e u)1. Defining lrs= 4iq=l Pri s = 1, 2, 3, 4, r = 1, 2,.. ., 10, and f = (G1, 02, 03, 04, a)T, and usingan unweightedleast squares regression of the empiricallog-odds ratio to provideinitial estimates for I , IRLS convergedin 4 iterations to the maximum likelihood solution0 = (-6.803, -5.177, -3.763,-2.096)T, a = -0.3915 (standarderror 0.040). This fitgave a deviance(likelihood-ratio statistic against the saturated model)of 48.5 on 35 degreesof freedom:it is thereforeprobably adequate as a summaryof the data. TreatingA-level score as a factorrather than a quantitativecovariate reduced the deviance to 36.5 (27 d.f.) so thatbased on a nominalsignificance test, no non-linearityis suggested. Use of theNormal distribution function for the latent response variable made practically no difference, whilethe Gumbeldistribution (the "complementary-log-log" link, or proportionalhazards model) fittedless well, whether degree classes were arranged in increasingor decreasingorder.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 154 GREEN [No. 2, For this example, of course, the allegedlylatent responsevariable has an immediate interpretationas an examinationmark underlying the degreeclassification. We mightthen know thecutpoints {O, }, or rather,allow an interceptand scaleand write

Z pf = i (_ ts xr) where{t8,} are known.For illustration,if the cutpointst are takenas (75, 60, 45, 30)T thenthe maximumlikelihood estimates of y,8 and a are 8.773, 3.835 and 9.736, witha devianceof 51.2 (37 d.f.). Thus each A-levelpoint is "worth"nearly 4 examinationmarks; note, however, from the magnitudeof the estimatefor a thatthere is considerablevariation about this regression line. The approachis not,of course,limited to the case of a singlecovariate. For example,we have successfullyused IRLS to fit both the groupedcontinuous and stereotypemodels to the back pain studydata of Andersonand Philips(1981) in whichthere are 6 responsecategories and three categoricalexplanatory factors. 2.3. Convergence Generalexperience seems to be that choice of startingvalues for the parameterestimates is not particularlycritical. Jennrich and Moore(1975) makethis point quite strongly,though they are workingonly in an exponentialfamily framework. In a model wherethere is a dangerof multiplemaxima, it is of courseimportant to repeatthe iterativeprocess from several different pointsin the parameterspace, in orderto obtainmore confidence that the true global maximum has beenobtained. In certainspecialized problems, accurate starting values can be obtainedby explicitformulae. For the ABO geneticexample, Rao (1973, p. 371) givesformulae that render an iterativesolution almostunnecessary. The effectof ignoringsuch formulae and using quite arbitrary starting points is illustratedin Fig. 1 forthis example, using the 0, A, B, AB frequencies:202, 179, 35 and 6, as used by Thompsonand Baker(1981). It willbe seenthat convergence is successfullyobtained fromnearly every admissible point, but that the iterations can be ratherwild unless a littlethought is usedto providea sensibleinitial estimate.

r=O

Fig. 1. The ABO bloodgroup example: trajectories of successive iterations of IRLS fromthree differ- ent initialestimates, using the parameterization J= (logp, logq)T, plottedin barycentriccoordinates. The shadedregion covers initial values for which IRLS is unsuccessful

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 155 For generalizedlinear models, it is easy to findusable starting values (See Section3.2). In fact wheneverthe regressionfunction is linear,TI = X , , and initialestimates of ti can be obtained froma simpletransformation of the data,an unweightedregression of theseon thecolumns of X usuallyyields appropriate starting values of , . Jorgensen(1983) suggestsa modificationfor the non-linearcase. We havementioned use of boththe expected and observedinformation matrices in theiterative stepof the algorithm.Again this decision is not usuallyimportant. Expected information used to be preferredbecause in the problemsbeing considered it was algebraicallysimpler, and because its value on convergencewas needed to computethe asymptotic variance of the estimates.For discussionof thesepoints see Garwood(1941), Edwards(1972) and Cox and Hinkley(1974, p. 308), forexample, but see also Efronand Hinkley(1978). Further,the observedinformation matrixmay not be positivedefmite. In exponentialfamily models appropriately parameterized, observedand expectedinformation may be the same (Nelderand Wedderburn,1972, Jennrich and Moore,1975, see Section3.2). On the otherhand, in certainsituations the expectedinformation is unknown(for example with censoredsurvival data, the potentialcensoring times are not recordedfor uncensored observations),or algebraicallycomplicated, or involvesnuisance parameters (see Section3.1). Only in simplecases can the behaviourof the algorithmon iterationbe properlyquantified. At worst,all we can say is thatit is a fixedpoint method: if it convergesto a point j , thenP is a solutionof the likelihoodequations. Exact Newton-Raphsonis a quadraticmethod, so that convergencewill be rapidnear the solution,but maynot be obtainedat all farfrom this point. See Chambers(1977, p. 136) and Jennrichand Ralston(1979). As pointedout by Jorgensen (1983), we can alwaysmodify the Fisherscoring method by reducingthe step size to ensurethat thelikelihood increases on each iteration. The iterationsmust be monitoredin orderto detectconvergence (or its failure).Two obvious methodsfor doingso are to recordrelative changes in parameterestimates or absolutechanges in the log-likelihood.GLIM used a modificationof the latter,but the formeris morereadily adaptedto alternativesto likelihoodmethods (Section 5) and involvessimpler calculations, while it is less suitedto automaticapplication. When handling an unfamiliarproblem, it is important to follow the entireiterative history of the solutionto be confidentthat the convergence criterionemployed is appropriate.

2.4. Reparameterization One-to-onetransformation of eitherti or P in the modelL = L(tq(p)) will not essentially changethe problem,but does changeits specification,and can make the differencebetween successand failurein theapplication of IRLS. Suppose that 11 and , are in appropriatelydifferentiable one-to-one correspondence with , and y respectively,and denote the associatedJacobians by S and R. Thus S is n x n with Sil = aqj/aX, and R is p x p withRil - agi/a. If we re-parameterizethe modelas L = L(4 (y)) thenby elementarycalculus, u, A and D are replacedby STu, STAS andS' DR. The likelihood equationsDTU = 0 becomeRTDTU = 0 and theIRLS stepis

(S -1 DR)T (STAS) (S -1 DR) (y * - y) = RTDTU, whichreduces to

(RTDTADR) (y * - y) = RTDTU. Thus,as expected,reparameterization in the TI-space has made no differenceto the iterative solution,but in the J-space it does makea difference,unless the transformation from P to y is affme. Froma numericalpoint of view,there may be goodreason to reparameterizein orderapproxi- matelyto linearizethe problems.Successful use of IRLS dependsessentially on the adequacyof

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 156 GREEN [No. 2, approximatelinearity of aL/a P i p . If forsome y , RTDTU is morenearly linear in y thanis DTu in p , thentransformation may be worthwhile. For a simpleexample of this,consider again the ABO geneticsystem. If the problemwere alternativelyparameterized in termsof y = (p, q)T, thenthe frequenciesof phenotypes(A, B, AB) are q where q T = (71 (2 -71 - 272), 72 (2 -271 -y 72), 2'yly2) and so

7 [ 1' -72 -71 DR= 2 - 72 171 _72 7-2 'Yi Althoughthis has a simplerform than does D of Section2.1, it is foundthat with this set-up the problemis moresensitive to initialvalues for y . Figure2 illustratesthat starting values must be restrictedto a much smallerpart of the parameterspace thanwith the earlierparameterization.

p=

r=O

Fig. 2. As forFig. 1, but with the alternativeparameterization 9 = (p, q)T.

Althoughas we have seen, reparameterizinga will make no numericaldifference (if we neglectrounding error) it may cause considerablechanges in settingup the problemas the informationmatrix STAS mayhave simpler structure than A. 2.5. Contputingthe WeightedLeast Squares Solutions The numericalanalysis of the linearleast squares problem is well developed.Chambers (1977, chap. 5) providesa usefulsummary from a statisticalviewpoint. For ordinaryleast squares, fmd- ing P * to minimize11 y-D P * 11= (y- D 0 *)T (y-D thereare methodsthat are more stablenumerically than the obvious , * = (DTD) - DTy. The orthogonaldecomposition methods involve the explicitor implicitconstruction of an n x n orthogonalmatrix Q(QTQ = I) suchthat

R QD =[O whereR is a p x p uppertriangular matrix. Since orthogonaltransformations preserve eucidean length,our requiredsolution p * satisfiesR 51* Qy (whereQ denotedthe firstp rowsof Q)

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 19841 IterativelyReweighted Least Squares 157 and maybe obtainedby back-substitution.Note also thatDTD = RTR, facilitatingthe calculation of (DT D)1 etc. Orthogonaldecompositions may be foundby a modifiedGram-Schmidt method, Givens' method,or by meansof Householdertransformations. This last approachis recommendedfor generaluse (see the proceduresdecompose and solve of Busingerand Golub (1965), imple- mentedas NAG routinesF01AXF and F04ANF(NAG,1981)). From a strictlynumerical-analytic point of view,the back-substitutionfor f * shouldbe followedby an iterativeimprovement, in whichthe residualsfrom the least squaressolution become the right-handsides for a new least squaresproblem. Providing the transformationhas been savedthis does not entailfurther decomposition; nevertheless, it is notlikely to lead to any improvementrelevant to data-analysis,particularly when the least squaressolution is, as here, onlya partof an iterativestep. Ordinaryleast squaressolves our IRLS step (4) whenA is a scalarmatrix: the "modified dependentvariate" y is A-' u + D P . WhenA has a morecomplicated form, we need to use weightedleast squares. Two possibilitiesare open to us: (i) to transformthe problem to ordinaryleast squares, or (ii) to generalizethe orthogonal decomposition method. WhenA is diagonal,diag (vi?) say, the ttansformation(i) is trivial.With y definedas above, we minimize(y - D p *)T A(y - D = Ev2 (yi- E2di/,3*)2by component-wisemultiplication of the entriesin y and the rows of D by the squareroots of the diagonalelements of A, and usingordinary least squares. This gives the simple prescription:

Regress- + vi z dikIk on {vidi1} (6) Vi

Chambers(1977 p. 120) suggeststhis procedure, and there seems no pointin lookingfor a method of type(ii). For general,non-diagonal, A, conversionto ordinaryleast squares entails construction of the Choleskysquare root matrix B andusing:

Regress(B -1u + BTD II) on columnsof BTD (7)

An alternativeapproach would be to decomposeD in the geometrydetermined by A, by con- structingan n x n matrixQ forwhich

QTQ=A,QD[= j and solvingR J* = Qy = Q(A1u + D jJ) = QA-1 u + R JPas before,by back substitution. This hybridHouseholder/Cholesky decomposition may offeradvantages in efficiencyor accuracy,but we have experiencedno difficultieswith the routineuse of a separateCholesky decompositionfollowed by ordinaryleast squares.Note thatA may be of specialform (e.g. tridiagonal)which may be exploitedin thedecomposition. Withthis approach, no specialpurpose is neededfor any of thesecomputations. For example,the interactivegeneral purpose language APL has all the array-handlingcapabilities that are required. Of course,weighted least squaresprocedures are available in manystatistical packages and sub- routinelibraries. In the BMDP series(Dixon, 1981), forexample, the P3R and PAR programsfit non-linearregressions and the manualdescribes their usage for maximum likelihood estimation. The facilitiesin GLIM (Bakerand Nelder,1978), and GENSTAT (Alvey,et al., 1977) forfitting generalizedlinear models will be reviewedin Section3.3. Generalizedleast squares in whichthe weightmatrix is not diagonaldoes not seemto be availablein packages.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 158 GREEN [No. 2, 2.6. AlternativeNumerical Methods The Newton-Raphsonmethod, whether or not the secondderivatives are replacedby their expectations,is conceptuallythe simplestnumerical procedure for maximizing the likelihood function.It wouldbe naiveto suggestthat it can be used universallyfor these problems, but we have arguedthat one principalreason why it maybreak down (an inadequateapproximation of L by a quadraticnear its maximum) may also meanthat maximum likelihood estimation is not appropriate. However,two otherreasons in particularmay suggestthat other numerical methods are more suitable-slow convergencefar from the optimum,and thenecessity of supplyinganalytic second derivatives.These difficulties can be overcomeby use of Quasi-Newtonmethods (see forexample, Gill and Murray(1972)). This has been recommendedby Andersonand Philips(1981) and Anderson(1981) forthe multinomialproblems discussed in Section2.1. Quasi-Newtonmethods seem highlysuited to maximumlikelihood estimation, and shouldperhaps be generallyrecom- mendedif IRLS fails. Otheroptimization methods that may be needed in particularapplications include search methods(e.g. Powell (1964), Nelderand Mead (1965)) and the conjugategradient algorithms (Fletcherand Reeves,1964). These latterare economicalin storagewhen the numberof para- metersis large,and are advocatedby McIntosh(1982) forthe fittingof linearlyparameterized models on smallcomputers. Search methods are probablymost suitedto problemswhere the objectivefunction is ratherless well behaved than most log-likelihoods. Usefulreviews of thesemethods are givenby Chambers(1977, chapter6), and Jennrichand Ralston(1979). For certainproblems, special purposealgorithms are available.For example,in log-linear models for multi-waycontingency tables, one rival to the IRLS method is the iterative proportionalfitting algorithm (see, forexample, Bishop, Fienberg and Holland,1976). Thiswill normallybe preferable,but in a sparsecontingency table the differenceis less clear.Brown and Fuchs(1983) providea valuablediscussion of thesepoints. The MLP package(Ross, 1980) uses severaldifferent algorithms, including a modifiedNewton method,for various special maximum likelihood problems, including probit analysis and genetic linkage.User-defned models can be analysed,and the programwill handle non-linearpara- meterizations. We should also touch here on the relationshipto the EM algorithm(Dempster, Laird and Rubin, 1977). This is not a numericalmethod for maximum likelihood estimation in the same senseas IRLS. It is rathera generalprinciple for handling problems in whichthe likelihood takes a particularlycomplicated form because of unobservedlatent or missingvariables. In particular cases therecan be a close connectionwith IRLS. For example,Hinde (1982) showsthat, for a Normal-Poissoncompound distribution,when a numericalintegral in the EM-algorithmis evaluatedby Gaussianquadrature, the resultis an IRLS algorithm.Brillinger and Preisler(1983) use a similarmethod for a varietyof compounddistributions.

3. NUISANCE PARAMETERS 3.1. Introduction Not all of the parametersin a model need hold the same interestor have the same logical status.In certaincases, such "nuisanceparameters" K are additionalto thosenaturally entering the regressionfunction tq (P), and that is the situationconsidered here. (It is a different distinction,for example, than that made between treatmentsand blocks in a designed experiment.)It is generallyprofitable to recastthe model as L = L( t , K) wheretj = i(I). Thereare now two setsof likelihoodequations

aL - =D u=O (8)

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 159 aL - =0. (9) aK We continueto use the firstset, but by analogywith the case of the variancein Normallinear regressionwe mayreplace (9) by some othercriterion if convenient.We may in factnot wish to estimateK , but need to do so because of its implicationsfor J; solutionof (8) may depend on K , and estimatesof thevariance of P mayinvolve Kc. In some situations,the dependenceon K may essentiallyfactor out of the problem. Jorgensen(1983) discussesmodels which may be expressedin theform:

L-=C(V, K) + t(y, n ?K (10) where the vectory representsthe observeddata, and ?(i) is a type of "precision"factor dependenton the nuisanceparameters. Jorgensen calls this the extended class of generalized linear models,but in factno linearstructure q = X P is necessarilyintended. Withthe model (10), we have aL at - l(K)DT-DT and a2L / a2t at a271 a appT a-(K) T aa1p7) (1 1) The Newton-Raphsonmethod for estimating P thusseems to involveK onlythrough the pre- cisionparameter p, and indeedthis appears to cancelfrom the two sidesof (2). However,while the secondterm on the rightin (11) disappearson takingexpectations of thesecond derivatives, (or wouldvanish anyway if q werelinear in P ), theexpectation of a2t/aj,qT generallyinvolves K . As Jorgensenobserves, this will be the case unless(10) is an exponentialfamily with q the minimalcanonical parameter (which covers Nelder and Wedderburn'smodels). Jorgensen advo- cates the compromiseof ignoringthe secondterm of (11) but not takingexpectations in the firstterm. This iterativemethod, which he termsthe "linearizationmethod", can be realised by takingu = {at/a 1 } and A = {-a2 t/a q T }. This will allow estimationof P withoutinter- ferencefrom the nuisanceparameter K . The new difficultythat may arise in thus using "observed"rather than "expected"information is thatA maynot be positive-definite(or even semi-definite),at some points in the q -space.IRLS wouldthen break down completely. If so, thenan alternativeto treatingP and K in the samemanner may be availableif (9) is explicitlysoluble for K givenfixed P . OftenK iS one-dimensional,and enters(9) in a simple way. If this is the case, we can attempta solutionfor ( , K) by meansof a 2-partiteration: (i) holdingic fixed,use IRLS to update P; (ii) holding,B fixed,solve (9) foran updated Kc The convergencetheory of this is even moreinaccessible, but someresults are availablefor the linearregression case (Section4.1).

3.2. GeneralizedLinear Models Nelderand Wedderburn(1972) proposeda class of likelihoodfunctions for n independent observations{yj } whichmay be written n b= a c(ri(y)rL b(0p)) - c(}i,+a s (12)

where b(-) and c(-) are prescribed functions, {7ri} a set of known "prior weights", 0 a

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 160 GREEN [No. 2, nuisanceprecision parameter (known or unknown),and thecanonical parameter Oi is functionally relatedto thelinear predictor t1i = I x113. This fallsinto our generalframework, and in factforms a subclassof Jorgensen'sextended generalizedlinear models. In the notationof Section 1, D is constant(the model or design matrixX), A is diagonalsince the observationsare independent,and u has a special form: becauseof theexponential family assumption (12) it involvesonly yi - b'(01) =yi - E(Yg). Standardexamples include: (i) Yi \-N(71i, 2 ) (ii) yi v Poisson,with mean exp (71a) (iii) yi X Binomial,with probability (1 + exp (- tai))- (iv) yi X Gamma,with mean??-' (v) yi ^v Binomial,with probability 1D(71i) (vi) y1 Gamma,with mean rn (vii) yi X Negativebinomial, with probability (1 + exp (- 1i))- wherein each case rn= E x11(3.Thus manystandard analyses, including linear regression, log- linearmodels and logitand probitanalysis fit into this structure. Usingthe IRLS approachdescribed in Section1 forgeneralized linear models has additional justificationwhen the canonicalparameter Oi in (12) is identicalto t1i.This includes the first four of the standardmodels listed above. If thisholds then it is easyto see thatthe second derivatives of L withrespect to q or A do not involvey: the observedand expectedinformation matrices thereforeagree, and we are usingNewton-Raphson exactly. Statistically this property implies the existenceof sufficientstatistics: in the case wherethe priorweights are all unityand 4 is known, we see thatthe likelihoodL in (12) can be rearrangedto exhibitXry as sufficientfor P . For examples(v) to (vii),the observed and expected second derivatives differ: exact Newton-Raphson willwork for (v) and (vii) butnot (vi).

3.3. GLIMand CompositeLinkFunctions The unityof generalizedlinear models has been exploitedin the developmentof the GLIM program(Baker and Nelder,1978) whichis essentiallyan IRLS algorithmcoupled with data- handlingfacilities, automatic specification of severalstandard models, and a high-levelsyntax formanipulating design matrices. Similar features are now availablein GENSTAT (Alvey,et al., 1977). GLIM permitsthe fittingof non-standard("user-defined") models by way of the OWN directiveand itsassociated macros. It may be shownthat fora generalizedlinear model, after cancelling out thenuisance para- meter 4 as in Section 3.1, ui = iri(yi- p1)/'ri28i and Aii = ir1 rj22, where t4 = E(yi), ri2= 0ri var(yi) and 8i = d?1l/dgi.Given a problemspecified in termsof our u, A and D, it seemsimpossible for GLIM to handlenon-diagonal A. If A is diagonal,the user'smacros should definethe GLIM vectors%YV, %FV, %VA, %DR and %PW(y, ja, X 2, 6 and X ) to satisfythe above relationsfor ui andAii, and thenuse thecolumns of D in theFIT directive.Because of the redundancyin notation,there are severalways of settingup sucha problem. Considerableingenuity has been expendedin codingsome ratherunamenable problems into the GLIM commandlanguage following this prescription.In particular,Thompson and Baker (1981) demonstrateda usefulextension of standardgeneralized linear models by way of com- posite link functions.A numberof importantmodels including the geneticsexample and the groupedcontinuous models that we have discussedfall into the distributionfamily (12) butwith a non-linearregression function. Burn (1982) and Roger(1983) have extendedthis application of GLIM to a widevariety of multinomial problems arising in genetics. The forthcomingreplacement for GLIM will provideconsiderably better facilities for user- definedmodels, including composite link functions.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 19841 IterativelyReweighted Least Squares 161 4. LINEAR REGRESSION 4.1. ThreeIRLS Methods Wenow turn to thelinear model

Yi= Xi/f+ afe (13) in which the {ei } are independentand identicallydistributed with densityf. We consider estimatingP and a accordingto one of threeprinciples: (i) maximumlikelihood, for known f; (ii) robustregression, providing estimates protected against departures of f fromNormality; and (iii) resistantregression (which is deferredto Section5.2). - Define i and w by ;(t) = tw(t) = - (d/dt) log f(t), and write qi = 2xilf3and ri = (yi qi) The log-likelihoodis L = - n log a + z logf(ri), so thelikelihood equations are aL - aL=a l4(ri)xi,=O (14) at/ and aL -= a' (I (r1) ri - n) = O. (15)

Lettingasterisks denote updated items, if we substitutew(ri) (y1 - 7*)Ia* for4P(r1) then (15) and (14) yield ao n w(ri)(Yi - 7(16) and

2w(ri) yix i = , E w(r1) XiixXi k* (17) Of these,(16) updatesa* from ii * explicitly,while (17) are normalequations for a weighted leastsquares regression, obtained without any appeal to theNewton-Raphson procedure. Proceedingmore formally, and differentiating(14), we derivethe Newton-Raphsonequations k (rd) xil E P(rd)XiXik k* ) (18) ia i k ( These least squaresequations may be used directly,or &'(ri)can be replacedby one of two approximations.Firstly, noting that 4'(r) = w(r) + rw'(r),we canjust ignore the second term (note that w( ) = constantfor a Normaldistribution). In thiscase (18) reducesto (17). Alternatively, Fisherscoring uses

= i{f 2dr 4i'(r) 2E(4i'(r)) f(r)} say,known as the intrinsicaccuracy of thedistribution (Fisher, 1925), so that(18) is replacedby

E P(ri) xi1 = a X -Xik(k - ) (19) * a a i k

We have derivedthree altemative IRLS proceduresfor updating p, basedon (17), (18) and (19) respectively: (I) RegressV/(w(ri))y on V/(w(ri))xii, (17')

a 4d(ri+ ) (II) Regress on \/(L'r(d))x11; (18')

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 162 GREEN [No. 2,

(III) Regress??i + a (yr - ?1j)on Xi. (19 )

Of theseprocedures, (1) was used by Beatonand Tukey(1974), (11) is exactNewton-Raphson, and (III) is similarto thatused by Huber(1973) and Bickel(1975), withthe exceptionthat theyused an empiricalestimate of a, in theabsence of assumptionson f. The threeprocedures coincide in the Normalcase. In general,Newton's method (II) has the strongestjustification, but it is usuallymore complicated to program,and is onlydefmed when 0'(r) >0 for all r: thatis, when 4 is strictlyincreasing (f, log-concave).Methods (I) and (III) are about equally simpleto program,and (TII) has the advantageof usingonly the unweighted designmatrix X. Maximumlikelihood for the errordistribution fAt) = kcllk exp (-c I t Ik)/(2r(k l )) coincides withLk regression,which minimizes S I y3- z xgft31I These models,and asymmetricversions in whichthe exponentmultiplier depends on the signof t are the onlyregressions in the class of modelsof Section3.1: numericalsolution is in principleeasier since a factorsout of (14), and (15) has an explicitsolution for a. However,these models are notoriouslydifflcult to fitwhen k < 1. IRLS cannotbe recommendedwhen k < 2: separatemethods based on linearprogramming areavailable for L1 regression(Barrodale and Roberts,1973). Dempster,Laird and Rubin(1980) discussmaximum likelihood regression, using method (I), forerror distributions from the "Normal/independent"family, which includes the t distributions. They demonstratethat the IRLS algorithmso implementedis an exampleof an EM algorithm, so thattheoretical results about convergence are available here. Standarderrors for the regressioncoefflcients are readilyobtained. We notedabove that the expectednegative second derivatives with respect to 1 are

E _ 2L ] r a ppT J so that the estimatedasymptotic covariance matrix for the estimatesof P is a1 (XTX) -1 Turningnow to robustregression, the principleof M-estimation (Huber, 1973) suggestsesti- matingp by minimizing, p((yi - , xj1Pj)/a)for a suitablychosen loss functionp. If p is -logf forsome densityfunction f thenthis is numericallyequivalent to maximumlkelihood again,assuming that f is thecorrect density. We proceedas aboveusing 4 = p', exceptthat in method(III) theintrinsic accuracy a mustbe replacedby an empiricalestimate. Essentially the only other differences are in thebasis for choice of the functionv (fromrobustness considerations rather than a probabilitymodel), and in the treatmentof a. In robustregression it is usual to use criteriaother than the maximum likelihood equation(15), and often,not to iterateon scale. Hollandand Welsch(1977) providea usefuldiscussion of thesepoints, and of the choiceof 4, function.No convergencetheory is knownfor iteratingon scale unlessthe Huberfunction 41(r)= sign(r) min{ I r l,H} (Huber,1973) is used. Theyrecommend using this first, and then,if a different4 functionis preferred,continuing without further change in a. Theycompare eight different4 functionsin theirpaper, some based on likelihoodmodels, and evaluateefficiencies and robustnessproperties by numericalintegration and simulation. 4.2. A RegressionExample from Materials Science Delayed fracturein brittlematerials may be demonstratedby observingthe changein bend strengthover a rangeof constantstress rates to failure.Braiden, Green and Wright(1982) use the classicalWeibull model for distribution of brittlestrength to derivea failuremodel in which y, the logarithmof the fracturestress, turns out to be relatedto x, the logarithmof the stress rate,by the linearregression y = I3 + 02x + ae. The errore has a Gumbeldistribution, and 02 and a are simplyrelated to thestress corrosion and brittlefracture parameters that are of primary interest.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 163 Raw data fromone particularexperiment is presentedin Table 2. Note thatthe experiment was conductedat fivedifferent stress rates, with twelve independent observations at each,that thereis considerablerandom variation in the data,and thata Normallinear regression, even after

TABLE 2 Stressfatigue data. Double torsiontest on specimensof tungsten carbide6%o cobalt alloy with ground surface finish. Fracture stress (MN m-2 at fivedifferent stress rates

Stressrate (Mnm2 S -1)

0.1 1 10 100 1000

1676 1895 2271 1997 2540 2213 1908 2357 2068 2544 2283 2178 2458 2076 2606 2297 2299 2536 2325 2690 2320 2381 2705 2384 2863 2412 2422 2783 2752 3007 2491 2441 2790 2799 3024 2527 2458 2827 2845 3068 2599 2476 2837 2899 3126 2693 2528 2875 2922 3156 2804 2560 2887 3098 3176 2861 2970 2899 3162 3685

See Braiden,Green and Wright(1982).

takinglogarithms, would have suggestedthat thereare severaloutliers. Each of the threeIRLS methodsof the previoussection converges quickly to themaximum likelihood estimates of PI, j2 and c: theresulting estimates appear in Table3. For each of themethods, 7 iterationswere needed to obtainthe threeestimates to relativeaccuracies of 10-5, 10 4 and 10-3 respectively,and 11 iterationsto obtainall threeto 10-5. Thereis nothingto choosebetween the methods on grounds of performance.

TABLE 3 Linearregressions of thenatural logarithms of thedata in Table2

Estimates(s.e. 's in parentheses)

PI 2 a

Least squares 7.8089 0.021115 0.12887 Maximumlikelihood: Ungroupeddata 7.8667 0.021867 0.10594 (0.01 675) (0.00420) Grouped data: Cut points 7.5(0.05)8.1 7.8663 0.020454 0.09947 (0.01654) (0.00408) (0.01052) 7.6(0.1)8.0 7.8657 0.021717 0.09662 (0.01678) (0.00459) (0.01198)

As an experiment,the same model was fittedafter grouping the data,using the methodfor multinomialdata outlinedin Section2.1. Remarkably,even witha verycoarse grouping,the parameterestimates and theirestimated standard errors are close to thosefor the ungrouped data. Thisis illustratedin Table 3 fortwo different groupings. Braiden,Green and Wright(1982) gave furtherdetails of the analysis,which included a Monte-Carloassessment of theadequacy of theWeibull model.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 164 GREEN [No. 2, 5. RESIDUALS AND RESISTANT METHODS 5.1. DefiningResiduals in Non-linearModels Residuals are traditionallythought of as (possibly standardized)discrepancies between observationsand fittedvalues y - E(y). Morerecently, definitions have been soughtwhich yield residualsuncorrelated under the assumedmodel and thusform a betterbasis for diagnostic deter- minationof modelinadequacy. We havealready met one suchdefinition implicitly. Following the Householder decomposition for the simplelinear model, we use the firstp componentsof Qy to determineP * by back substitution.The remaining(n -p) componentsare uncorrelatedwith variance 2, ifthe original data followsthe givenmodel and is uncorrelatedwith variance a2. Modelinadequacy can thereby be diagnosed,but we cannotidentify data inadequacy-thecorrespondence of "residuals"with "observations"has beenlost. Movingaway fromsimple linear models, how arewe to defneuseful "residuals" that can form a basis for assessmentof model adequacy, detectionof discrepantobservations, and (to anticipatethe next Section) accommodatepossibly discrepant observations by theiruse in resistantanalyses? One basis fora generaldefinition is to assignresiduals not to the observations,but the pre- dictorstq. The philosophyis thatour modelis prescribedby a likelihoodfunction L(1i) where t variesa prior in an n-dimensionalspace. We then introduce restrictions on thefreedom of 11to varyby requiringq = q (p), a givenfunction of a p-vectorP whichis now the targetof our inference.The (n -p) degreesof freedomlost by thus parameterizingti forcediscrepancies betweenthe data and the modelL( tj ( P)). The residualassigned to eachcomponent of q should thenmeasure the enforced change in q. A naturaldefinition from this point of view would be to definea vectorof residualsas q- t (P) where ij maximizesL( q) and PI maximizesL( q (fp)). This agreeswith the usual definitionfor Normallinear regression y N( i , e I); however,it is not invariantto trivial reparameterizationsof q . It is betterto examinechanges in q on the L-scale,so we define then-vector of rawdeviances A by

4 = 2 ( sup {L(( P + te)} -L(11(P))) - (20) t (whereei is the unitvector in theith direction),that is, Ai is twicethe increase in log-likelihood attainedby freeingi? fromits dependence on P . In thecase of Normallinear regression, A, = (y,- n1)2/a2, wherei7 and a2 arethe fitted mean and varianceof yi. Moregenerally, if we can write

L(ql)= EL(R) (21) forexample if we haven independentobservations each parameterized by one t7,then

Ai=2 (L-(nL))-(i) \ niJ and we havethe useful property that

E Ai = constant-2 L(i). (22) Thusin thiscase maximumlikelhood is equivalentto minimumsum-of-deviances. If (22) is deemedimportant, yet (21) does not hold,the onlyoption seems to be to define deviancessequentially: for example, let

A* = A+- A,+-, whereA+ = 2 ( sup {L( 1')} -L(n)) (23)

forall/ > i

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 19841 IterativelyReweighted Least Squares 165 Thenthe analogue of (22) holdsfor {A3 and thisdefinition is equivalentto (20) if (21) holds. A similardefinition could be givenfor any ordering of thecomponents of q in (23). For elucidationof definition(23), suppose that L can be replacedby its approximating quadraticL( n) = c + bT I TA wherec, b, and the non-negativedefinite matrix A are allconstant. Then aL -=b-Ai =uand-{a2L/aui,T}A.

It maybe shownthat

Ai+=UT(i) A`(ii) U(i)(s = T Z(i) wherethe subscripts in parenthesestruncate after the ith rowand column,z = B-1u, and B is the Choleskysquare root of A. Thus we have A* = z2. This analysisis exact for the (correlated) Normalcase, and otherwise only approximate. - For generalizedlinear models we have Al - z4 whereZi / e(vi- pi)/71is just the ith observationstandardized, after cancelling out the precisionparameter 0; theseare referredto as standardizedresiduals in theGLIM programand manual. In general,note from(7) thatthe IRLS algorithmis seekingj suchthat z = B-1u is uncor- relatedwith the columnsof BTD. It is easy to calculatez (by forwardsubstitution since B is lower-triangular),and the IRLS algorithmcan be programmedto use thesevalues directly for updatingP . Jqrgensen(1983) has also suggestedthe use of B-'u as a vector of "score residuals".The above discussionsuggests that they could be used interchangeablywith the deviances. Howeverit is easy to findsituations, even where the observationsare independent,in which theuse of scoreresiduals does not make sense.In the case of linearregression (see Section4.1) wehave aL a a ff

/-a2L C Aiij=E (2) M so that,whether we use zi as above, or withoutusing expectations in the denominator,these residualswill not be definedand usefulunless 'P is strictlyincreasing. This is preciselythe conditionfor method (II) of Section4.1 to work-thatis, f mustbe log-concave. Thisfailure does not extendto the deviances-itis actuallythe quadraticapproximation used abovethat breaks down. In fact,by (20) and (23), 4 = A = 2 log (fma/1f(r,))where fmax is the supremumvalue attainedby f. If f is boundedand continuous,this gives a sensibledefinition. Residualsfor diagnostic purposes in logisticregression have been discussed by Pregibon(1981). Ofthe various possible available scales, he findsAl (or its signedsquare root) mostuseful, but alsouses zi. His paper,while nominally addressed to thebinomial/logistic model, is relevantto all generalizedlinear models. Our discussionsuggests that deviances should be ofwider applicability, butthat care is neededif independence (21) does notapply. 5.2. ResistantAlternatives to LikelihoodMethods The principleof resistancein statisticaldata analysis(as distinctfrom robustness) dictates thatfitted models should be almostinvariant to largechanges in individualobservations. Asapplied to linearregression, this usually means that the least squares criterion: Choose ,J to minimizeTz (24)

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 166 GREEN [No. 2, wherezi -=zi(p) = yi - I xig3,is replacedby Choose P to minimizeI w(zi) 2 (25) wherew(z) is a weightfunction chosen to be 1 forsmall I z I and decliningas I z I increases,in orderto giveless weightto "discrepant"observations (discrepant in the senseonly of fittingthe model less well). As implementedby Tukey(1977) and others(e.g. McNeil(1977)), resistant regressionis achievednot through (25) butby

Solving Z w(zi) zixi1= 0 forallj. (26)

Thus the normalequations, not the sum-of-squarescriterion, are weighted.This distinctionhas led to someconfusion. The obviousIRLS approachof updatingP to J* by regressing

{w(zi) }yi on {w(zi) }2xi to get (* (27) converges,if at all, to a solutionof (26), not (25). If it is reallyintended to solve(25), then differentiationleads to normalequations of form(26) but withw(-) replacedby a different weightw*(z) = w(z) + ' zw'(z). Unfortunately,for many sensible weight functionsw(*), includingTukey's bi-square, w(z) = (max{ 0, 1 - cz2 I)2, w*( . ) is not a valid weightfunction as it does not remain non-negative.Thus this IRIS approachwill not apply for such w(*). In this sectionwe discussthe resistantmethods obtained when (24) is regardedas maximum likelihoodfor Normal linear regression, and thismodel is replacedby an arbitraryone. In viewof the pointsmade above,it seemsmost natural to regard(26) ratherthan (25) as to be generalized. At least in the case whereL(lq)= S1L1(q1) we are led to considerweighted likelihood equations: aL* Z WI -2=0 forallj (28) wherethe weightswi dependon the discrepancybetween the data and the fittedmodel. For measuresof discrepancy,we usuallyuse the deviancesof the previoussection. Pregibon (1982) discussessuch resistantfits for the binomial/logisticmodel, again in termsof deviances,but specifiedby analogywith (25) as minimizingIX(A1) where X(-) is a differentiablenon- decreasingfunction. Again, this can be castin theform (28). Howeverthe weightsare obtained, it seemsnatural to attemptto solve(28) by IRLS. In matrix notationwe have DTWu= 0 whereW = diag(wi). In general,all of u, W and D dependon P , but in the spiritof our earlierdiscussion it is temptingto approximatethe "secondderivatives" by treatingW and D as fixed. Updatedestimates are thus obtainedfrom the equationsDTWAD( J * - J) = DTWU,so that * is chosento minimize (Wu+ WAD( p*))T (WA)-1 (Wu+ WAD(p-p*))

2 = Xwgt-+ri X dif(o-f 3)) (29) i Vi~ I Thusthe onlycomplication is an additionalset of weightsin theregression. Here of coursewi, as wellas ui, viand di, arecalculated at the currentvalue, ,I. A programto fitthe modelconventionally may easily be modifiedto producea resistantfit. It willbe seen from(29) thatit is necessaryonly to multiplyui and v2 by wi immediatelybefore the least squaresstep. In the case of generalizedlinear models this can be achievedusing GLIM by eithermultiplying the priorweights ir1(= %PW) by wi, or dividingthe variancefunction ri2(=%VA) by wi. Usingthe second alternativean OWNfit may be specified.It willusually be necessaryalso to redefinethe deviance terms %DI to givea sensibleconvergence criterion.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 167 It is foundthat the presenceof { wi} adverselyaffects the convergence properties of thisIRLS method.If the processdoes converge,it willbe to a solutionof (28), butit maynot converge at all. Some experimentationwith differentstarting points, for example usingL1 ratherthan unweightedleast squares initially,is oftenneeded. It is occasionallynecessary to startwith constantweights (wi 1), and graduallychange them towards the desiredvalues as theiterations progress.These suggestionsmay seem subjectiveand ad hoc, but it is not the aim of resistant data-analysisto providea unique "objective"solution, but ratherto examinewhether any doubt shouldbe caston a modelfitted conventionally. Whenthe likelihoodis not of the formz Li(nij),the onlyway of proceedingseems to be to replace

MaximizeL = MinimizeE 4 by

Minimize X(Ai*) = Solve 2wi - =0 for allf.

Formallythis can be treatedjust as above, althoughthe problemis now not diagonal.We can proceedas if a^* / _a-2A* ur = -I Ars Wil anr but thismethod is untested.Note thatthe definitionsof A* assumean orderingon thecompon- entsof iq , and so the behaviourof the algorithm,and itssolution, may depend on thisordering. 5.3. An Examplefrom Probit Analysis For an illustrationof a resistantanalysis we considerthe experimentdescribed by Finney (1952, p. 69) whichassessed the relativepotency of threepoisons. The dataare given in Table4.

TABLE 4 Relativepotency of threepoisons

Observation number Kill Out of Poison Log dose

1 44 50 R 1.01 2 42 49 R 0.89 3 24 46 R 0.71 4 16 48 R 0.58 5 6 50 R 0.41 6 48 48 D 1.70 7 47 50 D 1.61 8 47 49 D 1.48 9 34 48 D 1.31 10 18 48 D 1.00 11 16 49 D 0.71 12 48 50 M 1.40 13 43 46 M 1.31 14 38 48 M 1.18 15 27 46 M 1.00 16 22 46 M 0.71 17 7 47 M 0.40

Poisons: R - rotenone D - deguelin M- mixture From Finney (1952), p. 69.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 168 GREEN [No. 2, FollowingFinney, we performa seriesof probitanalyses in whichit is assumedthat the kill probabilityfor a log-dosex of poisonr (r = 1, 2, 3) is 4({ar + Pr x) where4> is thestandard Normal distributionfunction. Of interesthere is the possibilitythat the a's or the1B's are equal. Deviances frommaximum likelihood fits are givenin Table 5: clearlythere is ample evidencethat the regressionlines are neitherthe samenor parallel.Interpretation is easier if the lines are parallel, as this situationcorresponds to constantrelative potency at all levelsof response.We therefore

TABLE 5 Maximumlikelihood analyses for Finney's data

Numberof Observations Degrees parameters* omitted Deviance of freedom

2 - 70.8 1 5 4 - 30.3 13 6 - 20.1 11 4 2 11 15 14.4 10 4 11 14 15 13.7 10 4 11 16 17 7.7 10 2 11 16 17 67.7 12 6 11 16 17 7.1 8

*2:oa,f; 4:aj ,%2,13)0; 6t a as1$' perseverewith the four-parametermodel (oil, a2, a3, ) and attempta resistantanalysis. For comparison,three alternative weight functions were used, selected from those listed by Holland and Welsch(1977) forrobust linear regression: Bi-square: w(z) = (max(O,1 - (z/B)2 ))2 Cauchy: w(z) = (1 + (zIC) ) 1 Huber: w(z) = min(1, H/Iz 1)

In each case thereis a tuningconstant, taken as + oo for an ordinarylikelihood analysis, and reducedfor greater resistance. For residualsz we use the score residuals,which are simplythe observationsstandardized by theirfitted means and standarddeviations for this exponential familymodel. Both least-squaresand least-absolute-deviationsfits were used to provideinitial estimates.For each weightfunction the tuningconstant was graduallyreduced until stability was achieved;see Table 6. This proceduregenerated various possible groupsof candidate observationsto be labelleddiscrepant. Likelihood analyses were performedwith these omitted

TABLE 6 Resistantanalyses offour-parameter model for Finney 's data

Initial Observations Weight Tuning values Weighted severely function constant (2 = L2, 1 = L1) deviance down weighted

B 9 1 or2 28.2 B 6 1 or2 25.5 11 2 B 4.5 lor2 17.5 11 16 2 B 3 2 5.3 11 17 16 B 3 1 5.9 11 14 15 C 6 2 28.1 - C 4 2 25.8 11 2 C 2 2 17.7 11 2 C 1.4 lor2 9.3 11 17 16 H 2 2 29.3 - H 1.5 2 26.0 11 2 H 1 1 or2 19.2 11 2 H 0.5 2 9.8 11 2 15 H 0.4 1 or2 7.8 11 2 15

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] IterativelyReweighted Least Squares 169 (Table 5) revealingin particularthat a dramaticallybetter fit for the 4-parameter,parallel regressionline, model is obtainedif observations 11, 16 and 17 areneglected. There is no evidence to suggestthat the 2 or 6 parametermodels are preferable,having omitted these observations. We thereforesettle with the conclusionthat all observationsexcept numbers 11, 16 and 17 are consistentwith the four-parametermodel in whichthe kill probabilities are 4(- 2.673 + 3.906 x), FD(-4.366+ 3.906 x) and 'I(-3.712 + 3.906 x) - for Rotenone, Deguelin,and the mixture. Followinginspection of the data, Finneyomitted the same threeobservations from his analysis altogether,a course of action defendedon the groundsthat the chiefinterest here is in the behaviourof the poisons at high concentrations.He then fitsparallel probit curves and his estimatesare similarto ours. Note that we would not advocatethe use of resistantmethods in ordersimply to reject inconvenientdata: such discrepanciesshould normally be followedup withthe experimenter.It is obviouslyunsatisfactory to haveto discardthree observations from 17. Note also thatthe result hereis notjust a betterfit, but a betterfit to a simplerand moreinterpretable model.

ACKNOWLEDGEMENTS I am verygrateful to JulianBesag and AllanSeheult for stimulation and encouragementduring the course of this work,to PeterMcCullagh, Bent Jorgensenand David Brillingerfor helpful commentson an earlierversion, to ProfessorH. Marshfor the degreeclassification data, and to the refereesfor valuablesuggestions on presentationand a numberof additionalreferences.

REFERENCES Alvey,N. G. et al. (1977) TheGenstat Manual. Rothamsted Experimental Station. Anderson,J. A., and Philips,P. R. (1981) Regression,discrimination, and measurementmodels for ordered categoricalvariables. Appl. Statist., 30, 22-31. Anderson,J. A. (1984) Regressionand orderedcategorical variables (with Discussion) J. R. Statist.Soc. B, 46, (in press). Bailey,N. T. J. (1961) Introductionto themathematical theory of geneticlinkage. Oxford: University Press. Baker,R. J. and Nelder,J. A. (1978) The GLIM System,Release 3. Oxford:Numerical Algorithms Group. Barrodale,I., and Roberts,F. D. K. (1973) An improvedalgorithm for discrete LX approximation.SIAM J. Numer.Anal. 10, 839-848. Beaton,A. E., and Tukey,J. W. (1974) Thefitting of powerseries. Technometrics, 16, 147-1 85. Bickel,P. J.(1975) One-stepHuber estimates in thelinear model. J. Amer. Statist. Assoc., 70, 428-434. Bishop,Y. M. M., Fienberg,S. E. and Holland,P. W.(1976) DiscreteMultivariate analysis: theory and practice. Cambridge,MA: MIT press. Bliss,C. (1935) Thecalculation of the dosage-mortality curve.Ann.Appl.Biol., 22, 134-167. Braiden,P., Green,P. J. and Wright,B. D. (1982) Quantitativeanalysis of delayedfracture observed in stress testson brittlematerials. J. Material Sci., 17, 3227-3 234. Brillinger,D. R. and Preisler,H. K. (1983) Maximumlikelihood estimation in a latentvariable problem. In Studiesin Econometrics,Time Senes, and MultivariateStatistics, 31-65 (S. Karlin,T. Amemiyaand L. A. Goodman,eds.) NewYork: Academic Press. Brown,M. B. and Fuchs,C. (1983) On maximumlikelihood estimation in sparsecontingency tables. Com- putationalStatistics and Data Analysis,1, 3-15. Burn,R. (1982) Loglinearmodels with composite link functions in genetics. In Proceedingsof theInternational Conferenceon GeneralisedLinear Models p. 144-154. (R. Gilchrist,ed.) LectureNotes in ,vol 14. NewYork: Springer. Burridge,J. (1981) A note on maximumlikelihood estimation for regression models using grouped data. J. Roy.Statist. Soc. B, 43, 41-45. Businger,P. and Golub,G. H. (1965) Linearleast squares solutions by Householdertransformations. Numer. Math.,7, 269-276 (reprintedin Handbookfo'r Automatic Computation, Vol. II, LinearAlgebra, J. H. Wilkinsonand C. Reinsch,eds. Berlin:Springer-Verlag). Chambers,J. M. (1977) Computationalmethods for data analysis. New York: Wiley. Cox,D. R. (1970) Analysisof binarydata. London:Methuen. Cox,D. R. and Hinkley,D. V. (1974) Theoreticalstatistics. London: Chapman and Hall. Dempster,A. P., Laird,N. M. and Rubin,D. B. (1977) Maximumlikelihood from uncomplete data via the EM algorithm(with Discussion). J. R. Statist.Soc. B, 39, 1-3 8. - (1980) Iterativelyreweighted least squaresfor linear regression when errors are Normal/Independent distributed.In MultivariateAnalysis-V. (P. R. Krishnaiah,ed.) Amsterdam:North Holland, p. 35-57.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 170 GREEN [No. 2, Dixon,W. J. (ed.) (1981) TheBMDP Statistical Programs. University of CaliforniaPress. Edwards,A. W.F. (1972) Likelihood.Cambridge: University Press. Efron,B. and Hinkley,D. V. (1978) Assessingthe accuracy of themaximum likelihood estimator: Observed versusexpected Fisher information. Biometrika, 65, 457-487. Finney,D. J.(1947, 1952) ProbitAnalysis. Cambridge: University Press. Fisher,R. A. (1925) Theoryof statistical estimation. Proc. Camb. Phil. Soc., 22,700-725. (1935) The caseof zero survivors, In Bliss,(1935). Fletcher,R. and Reeves,C. M. (1964) Functionminimisation by conjugategradients. Computer J., 7, 149-154. Garwood,F. (1941) The applicationof maximumlikelihood to dosage-mortality curves. Biometrika, 32, 46-58. Gentleman,W. M. (1973) Leastsquares computations by Givenstransformations without square roots. J. Inst. Maths.Applic.,12,329-336. (1974) Basicprocedures for large, sparse or weighted least-squares. Appl. Statist., 23, 448-454. Gili,P. E. and Murray,W. (1972) Quasi-Newtonmethods for unconstrained optimisation. J.Inst. Math. Applic., 9, 91-108. Golub,G.. H. (1969) Matrixdecompositions and statisticalcalculations. In StatisticalComputation, 365-397. (R. C. Miltonand J.A. Nelder,eds). NewYork: Academic Press. Golub, G. H. and Styan,G. P. H. (1973) Numericalcomputations for univariate linear models. J. Statist. Comput.Simul., 2, 253-274. Hinde,J. (1982) CompoundPoisson regression models. In Proceedingsof theInternational Conference on GeneralisedLinear Models 109-121. (R. Gilchrist,ed.) LectureNotes in Statistics.Vol 14. New York: Springer. Holland,P. W. and Welsch,R. E. (1977) Robustregression using iteratively reweighted least-squares. Commun. Statist.Theor. Meth., A6, 813-827. Huber,P. J. (1973) Robustregression: asymptotics, conjectures and MonteCarlo. Ann. Statist., 1, 799-821. Jennrich,R. I. and Moore,R. H. (1975) Maximumlikelihood estimation by meansof non-linearleast squares. Proc.Amer. Statist. Assoc. (Statistical Computing Section), 57-65. Jennrich,R. I. and Ralston,M. L. (1979) Fittingnon-linear models to data.Ann. Rev. Biophys.Bioeng., 8, 195-238. Jorgensen,B. (1983) Maximumlikelihood estimation and large-sampleinference for generalised linear and non- linearregression models. Biometrika, 70, 19-28. Kale, B. K. (1961) On the solutionof thelikelihood equation by iterationprocesses. Biometrika, 48,452-456. (1962) On the solutionof likelihoodequations by iterationprocesses. The multiparametriccase. Biometrika,49,479-486. McCullagh,P. (1980) Regressionmodels for ordinal data (with Discussion). J. Roy. Statist. Soc. B, 42, 109-142. (1983) Quasi-likelihoodfunctions. Ann. Statist., 11, 59-67. McIntosh,A. (1982) FittingLinear Models: An Applicationof ConjugateGradient Algorithms. Lecture notes in Statistics,vol. 10. NewYork: Springer. McNeil,D. (1977) Interactivedata analysis. New York: Wiley. Moore,R. H. and Zeigler,R. K. (1967) The use of non-linearregression methods for analyzing sensitivity and quantalresponse data. Biometrics, 23, 563-566. NAG(1981) LibraryManual, Mark 8. Oxford:Numerical Algorithms Group. Nelder,J. A. and Mead,R. (1965) A simplexmethod for function minimisation. Computer J., 7, 308-313. Nelder,J. A. and Wedderburn,R. W. M. (1982) Generalizedlinear models. J. Roy. Statist.Soc. A, 135, 370-384. Powell,M. J. D. (1964) An efficientmethod for finding the minimum of a functionof several variables without calculatingderivatives. Computer J., 7, 155-162. Pratt,J. W. (1981) Concavityof the log likelihood.J.Amer. Statist.Assoc., 76, 103-106. Pregibon,D. (1981) Logisticregression diagnostics. Ann. Statist., 9, 705-724. (1982) Resistantfits for some commonlyused logisticmodels with medical applications. Biometrics, 38, 485-498. Rao,C. R. (1973) LinearStatistical Inference and itsApplications, 2nd ed. NewYork: Wiley. Roger,J. H. (1982) Compositelink functions with linear log link and Poissonerror. GLIM newsletter, December 1982. Ross,G. J.S. (1980) MaximumLikelihood Program. Rothamsted Experimental Station. Thompson,R. and Baker,R. J. (1981) Compositelink functions in generalisedlinear models. Appl. Statist., 30,125-131. Tukey,J. W. (1977) ExploratoryData Analysis.Addison-Wesley. Reading, Mass. Wedderburn,R. W. M. (1974) Quasi-likelihoodfunctions, generalised linear models, and the Gauss-Newton method.Biometrika, 61, 439-447. (1976) On the existenceand uniquenessof the maximumlikelihood estimates for certain generalised linearmodels. Biometrika, 63, 27-32.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green's Paper 171 DISCUSSION OF DR GREEN'S PAPER

Dr Bent J4rgensen(Odense University): I am glad to see tonight's paper, and I find the paper difficultto criticize,because it promotes ideas that I have also been suggestingrecently. My own startingpoint was an attempt to understandhow GLIM fitsgeneralized linear models. It is not so easy to understand Nelder and Wedderburn's(1972) paper, but it helps when one realizes that Fisher's scoring method may be interpretedas an iterativeweighted least squares procedure for any type of distribution,not just for the exponential family.I wonder what led Dr Green on the track? The iterativemethods consideredin the paper are all essentiallyof the form p p t+b,

b = (DT AD)Yi DT aL/at1. (a) where D = 3t/a#T and A is a positive-definitesymmetric matrix. Since D has full rank, DTAD is positive-definite.Hence (a) is a gradient method, and the steplength8 > 0 may be chosen to give an increase in the value of the likelihood compared with the previousiteration (Kennedy and Gentle, 1980, p. 430). Although the computations in (a) may be performedvia least-squaresmethods, I would like to suggestthat the term iterativelyreweighted least squares should only be used in connection with exponential families or quasi-likelihood estimation.Otherwise, I find the term too diffuse and ambiguous. I call (a) the delta-algorithmwith weightmatrix A, a reminderof the similarity between the form DTAD and the way asymptotic variance matrices are transformedby the 6-method. I would like to stressthe importanceof calculatingthe steplengthproperly, because this often makes the differencebetween success and failure of gradient methods, even for the Newton- Raphson algorithmused with a concave log-likelhood. In Section 2.6 it is asserted that the con- vergence of the algorithmis slow far fromthe optimum,but if the steplengthis properlycom- puted, the convergenceis in fact quick far fromthe optimum,and the choice of startingvalue is not important.But the convergencemay be slow near the optimum,because the matrix- DTAD may be a poor approximation to the second derivativeof L, particularlyif D is non-constant. For the resistantestimating equation (28) there is no objective function,so the steplength can not be calculated in the usual way. It may be useful to define 8 as the smallest positive solution to the equation U*TW*D*(DTWAD)- DTWU, where 8 entersthe equation throughthe updated quantities u*, W* and D*. For maximumlikeli- hood estimation, correspondingto W =I, this value of 8 corresponds to the smallest local maximumof L(j (.)) on the half line { p + 5b: 8 > 0} . As suggested in the paper we may take A as either the observed or expected information matrix for t , but other choices for A are possible. If observed informationis used when L(tj) is not concave, a simple modification of the Cholesky decomposition (Kennedy and Gentle, 1980, p. 445) may be used to obtain a positive-definiteA. If the log-likelihoodis of the additive form(21), two possible choices forA are as a diagonal matrixwith elements either aL1 - Aii=- (,qi - 7li) (i= L2 .. I,n) ami or

Ai - )J{2(Li(ni) -L(rz))1 (i- 1,...,n), where r maximizesLi(77i). These choices are applicable if Li is unimodal,and the firstgeneralizes (17), whereas the second correspondsto an algorithmproposed by Ross (1982). I have used (17) to do L1 regressionin GLIM, and with a minor modification of Aii for qmnear mithis works well, and convergesin less than twentyiterations. This and other exerciseswith the delta-algorithm suggest that by judicious choice of A, the delta-algorithmmay often cope with "notoriously

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 172 Discussionof Dr Green'sPaper [No. 2, difficult"problems. The choiceof weightmatrix for the delta-algorithm is discussed by Jorgensen (1983b). The properdefinition of residualsdepends on the purposefor which they are intended,but as a rulethere should be a one-to-onerelation between residuals and observations,such that a large residualmay be interpretedas an outlyingobservation. In general,the conceptof an observationis somewhatelusive, but oftenwe mayparametrize the model in sucha waythat the components of q correspondto observations,in some sense.In the case y -Nn (tj, ), with2 known,any definitionshould ideally produce y - as thevector of residuals,so let us tryto applythe various definitionsof residualsin Section 5.1 to this example.The score residualsgive the vectorof residualsBT(y - i,), whereB is the squareroot of TL-1, and hence fail the telt,because the ith residualis a functionof severalyi -77i. Similarly,(23) fails,whereas a-t(() = y - wj( ) are the correctresiduals. For (20), it may-beshown that a one-stepappr:oximation to determine the supremum,taking one stepof Fisher'sscoring method starting at a (%), givesAi as thesquare of theith component of thevector K -'(s- _-1T- D(DT 2: -l D)-1 DT s 1)(y_ whereK is diagonal.The mainterm of thisexpression is K1 -1(y- ), so thisdefinition also fails. The definitionI preferis to take d1/2A1 u as thevector of residuals, where d = diag{A1l, . . Ann}, althoughthis may be problematicif the log-likelihoodis notconcave. For themultivariate normaldistribution this definition yields (Yi - fg)/,1 /2as theith residual. Obviously, more work needsto be doneon thissubject. My finalcomment is on the resistantfitting methods discussed in Section5.2, and it probably revealsmy ignorance on the subject.Apparently, resistant methods discard outlying observations fromthe fit,but accordingto the ultimateparagraph of the paper,this is notthe purpose of the method.If one intendsto findoutlying or influentialobservations, one mightinstead use the regressiondiagnostic methods of Pregibon(1981). But then,what is the purposeof resistant methods? Iterativetechniques for maximum likelihood estimation resemble scientific investigation, in the sense that manysmall steps are takenin the directionof an unknownoptimum. The difficulty withscientific investigation is thatthe objectivefunction is not alwaysclearly defined, and the weightsattached to variousissues vary from individual to individual.But I findthat tonight's papertakes a good stepin the rightdirection, and it is a pleasurefor me to proposethe vote of thanks. MrR. Thompson(Animal BreedingResearch Organisation, Edinburgh): My firstcomments are relatedto the verygenerality of the approach,giving me difficultyin embeddingspecific problemsinto the framework.The colourinheritance of NorthRonaldsay sheep (Ryder et al., 1974) illustratesthe point.Over 10 yearsago I wentthrough algebra similar to thatin the ABO example,with a successionof modelswith over 100 termsto be differentiatedand littlestructure in the equations.Later I realizedthat the modelscould be conciselyrepresented by a loglinear model forthe underlyinggenotypes and a compositematrix, C, linkingthe phenotypesto geno- types.The generalizedlinear model approach needed only slight modification using functions of the distributionas weights,and workingdependent and regressorvariables. In otherexamples the C matrixallows transformation from one scale allowingeasy expression of theexpectation of observationsto anotherallowing independence of the observations,for examplein Dr Green's Example2.2. I hope thatthe nextversion of GLIM willallow the concise specification of C with modelformulae, although the calculationsare easilyprogrammed in GENSTAT. Againon the subjectof generalityI would like to haverules for parameterization for quick convergence.Why should log p be betterthan p in the ABO example?Although I seemto be the onlyperson to haveused them,the calculations for the speed of convergenceof theEM algorithm (Dempsteret al., 1979) seemuseful. On the definitionof residualsit seemed,at leastin the writtenversion, that the non-diagonal infoymationmatrix, A, was motivatingthe sequentialdefinition of residuals.Estimation of para- metersimposes correlationbetween residualsand this is taken account of in the cross- validatorydefinition (20). The definitionof residuals should depend on theuse intendedfor them and if A is non-diagonalthen some statisticdependent on cross-productsof suitableresiduals

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 173 mightbe informative.Can one expect the definitionof residualsto be useful with binarydata, when the residualdeviance can be uninformative? Given the confusionbetween (25) and (26) can the resistantmethodology be interpretedas a formalmodel being a mixtureof the data and the fittedmodel? Over the years I can remembercoming to hear importantpapers by Lindley and Smith,Curnow and Smith,Wilkinson et al. and Nelder and also managingto attend SmithfieldShow. This yearis no exception and I think Dr Green's paper will be as well-thumbedas these papers and so it gives me greatpleasure to second the vote of thanks. The vote of thankswas carriedby acclamation.

ProfessorMurray Aitkin (Universityof Lancaster); This paper is a welcome addition to the small number of papers demonstratingthe power and flexibilityof IRLS approaches to maximum likelihood. As a confirmedGLIM user I will restrictmy commentsto threeGLIM-related points: (i) Censoringof observationspresents an awkward problemfor GLIM because of the different link functionand (implicit) errormodel forthe censoredand uncensoredobservations. Composite link functions are complicated and not constructed easily, if at all, in this case. Alternative maximum likelihood proceduresusing special featuresof the likelihood functionwork for right- censored observations from the Weibull and extreme value distributions(Aitkin and Clayton, 1980) and for the left- and right-censoredobservations from the logistic and log-logistic distributions(Bennett, 1983). For the normal and lognormaldistributions the EM algorithmcan be used (Aitkin, 1981), but the generalIRLS approach described by the author is much simpler. (ii) The discussion of robust and resistantmethods is not reassuring.Without a probability model the "tuning constant" approach seems completelyad hoc. Model-basedprocedures are not difficultto construct,as the author notes in referringto the t-distributionmethod of Dempster et al. A likelihood-based "sensitivityanalysis" equivalent to the tuning constant approach is possible by fittingthe t-distributionfor a fixed degreesof freedomk, and constructinga profile likelihood over k. This would clearly demonstratethe need, if it existed, for robust estimation. (iii) The example in Section 5.3 seems to me an alarmingexample of the misuse of resistant methods. Dr Green has made it clear that this example is an illustrationof how such methodsfit into the IRLS approach, ratherthan an example of theirvalue. On the probitscale, the regressions are not parallel for the complete set of data, so a simple comparisonof relativepotency of the three treatmentsis not possible. However the lines for Deguelin and Mixtureare nearly parallel, and can be set equal with little change in deviance. Thus the relativepotency of R to D or M depends on dose level, but that of D to M does not. The "discrepantobservations" 11, 16 and 17, when removed,allow a common slope to be fitted.The simplermodel is achieved at the expense of, not three observations,but 142. Surely this is bad modellingpractice, even if one is interested only in the relativepotencies at highdose levels. Examination of the full probitmodel forthe fulldata shows a bad fitof the model at points 10 and 11. Curvatureof the Deguelin pointson the probitscale is quite noticeable. A complementarylog-log link gives a considerable improvement,with a deviance of 16.09, instead of 20.13 for the 6-parametermodel-and this is unchangedif the slopes forM and D are set equal. On the CLL scale thereare no largeresiduals. Surely it is better to present a model for the whole data which includes a necessaryinter- action, rather than removingthe part of the data which establishesthe need for the interaction and presentinga simplermodel. To repeat, thisis not a criticismof the paper, which I have found stimulatingand valuable.

Mr R. Burn (Brighton Polytechnic): I would firstlike to comment on the use of IRLS for fittingmultinomial models in genetics. Loglinear models with linear composite link functions (Thompson and Baker, 1981) which, as Dr Green points out, constitutea special case of his general class of models, are provingto be a convenientway of formulatingmultinomial models in which the cell probabilitiesare polynomial functionsof the parametersto be estimated.With it denotingthe vectorof expected cell counts,the structureof these models is p=Cy, y =exp(tj), tzX$ , where P is usually a vector of log gene frequenciesor other genetic parameters.Although Dr Green's direct IRLS approach to such problems,exemplified by the ABO problem,is interesting,

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 174 Discussionof Dr Green'sPaper [No. 2, the compositelink formulation appears to haveat leasttwo advantages.Firstly, there is no need to findanalytic derivatives, which could require considerable effort for some problems. Secondly, the C matrixusually has somegenetic meaning. In manycases C is an expressionof thedominance relationshipsamong the variousalleles. For bothof thesereasons, together with the relative ease withwhich these models can be fittedin GLIM or GENSTAT,composite link models offer the prospectof a routinepractical method for estimatinggenetic parameters. On the otherhand, multinomialmodels do occurin whichthe cell probabilities are morecomplicated functions of the unknownparameters. There are problems,for instance, in whichthey are rational functions. Such modelsdo not fitso naturally(if at all) intothe composite link framework, and Dr Green'sdirect attackon thelikelihood would be a usefulalternative. The otherpoint that I wouldlike to makeconcerns Dr Green'sremark that packages such as GLIM or GENSTAT have no facilitiesfor fittingmodels witha non-diagonalweight matrix. Althoughthis appears to be true,it is stillpossible to implementthe general algorithm, at leastin GENSTAT, by firstdiagonalizing the matrixA. Since A is assumed positivedefinite and symmetric,there is an orthogonalmatrix P such thatPTAP = A, say, whereA is the diagonal matrixof eigenvaluesof A, whichare positive. The IRLS stepthen becomes: regress A-1 ' u + D ontothe columns of D withweights A1, whereu = PTu and D= PTD. I havetried this in GENSTAT on two of the examplesin thepaper, the ABO problemand the groupedcontinuous multinomial model for the A-level score data, with essentially the same results as the author.I would not suggestthat this is necessarilyan efficientway of computingIRLS solutions,but it maybe convenientfor those familiar with GENSTAT. I would like to join the previousdiscussants in thankingDr Greenfor a stimulatingpaper. Dr SusanR. Wilson(Australian National University): Tonight's paper offersan overviewof relevantliterature on an importanttopic, namely the numerical evaluation of maximumlikelihood estimates.The reasonsfor choosingone type of numericalmethod rather than anotherare complex,and someof theseare outlinedin Section2.6. One furtheradvantage of iterativeleast squaresprocedures is thatthey are morelikely, than the alternative methods, to haveconvergence difficulties.Often closer inspection of the data will then revealthat the model is not at all appropriatefor the data. One examplewhere use of a quasi-Newtonprocedure led to an unaccept- able modelbeing adopted is givenin Guerreroand Johnson(1982). The data werebeing used to predictthe probabilityof femaleparticipation in the Mexicanlabour forceas a functionof dichotomousvariables, locality, age, incomeand schooling.Guerrero and Johnsonconsidered a model withthe Box-Cox powertransformation (with parameter X) of the odds ratioas a linear functionof the explanatoryvariables, thereby extending the logisticregression model (X = 0). They concludedthat X = -6,6, and thatthe interactionbetween age and incomewas significant. The (correct)implementation of thisextended model in GLIM (involvingupdating the model matrixon each iteration)fails to convergefor these data. This enforces closer examination of the logisticregression model fit which reveals the reason. There is an aberrantobservation, correspond- ingto "factorcombination ac" (locality,income). Removal of thisobservation, and re-fitting the logisticregression model, a satisfactoryfit is obtained,with now an interactionbetween locality and incomein the linearpart of the model.(Further details I have givenelsewhere.) That their finalmodel was inadequateshould have been obviousfrom the parameterestimates and their correspondingstandard errors (Guerrero and Johnson,1982, Table 2). Also, otherprocedures, suchas "resistantanalysis" described in Section5.2, andthe use of graphicaltools for determining the influenceof individualobservations on the powertransformation, would alertone to the aberrantobservation. In general,it shouldnot be necessaryto go so farin the overallanalysis to findsuch a glaringdiscrepancy. To summarize,if IRLS fails,go back and examinethe data care- fully,before considering use of quasi-Newtonmethods. Use of iterativeleast squares procedures for sparse contingency tables, where there are zerosin any of the marginalconfigurations defined by a log-linearmodel, is notwell understood. In com- mentingon Brownand Fuchs(1983), Astonand Wilson(1984) describehow the implementation givenby equation(4) cannotidentify directly that some of theestimated cell frequenciesfor the modelmay be identicallyzero. Hence some of the parameterestimates and theirstandard errors will not be correct,although the fittedvalues will be correct.Stabilization of the parameter estimatesand theirstandard errors can be readilyachieved by identificationof the occurrence of zerosin anyof themarginal configurations, and then constraining the corresponding(zero) cellsto haveestimated cell frequenciesof zeroexactly.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 175 Finally, I would like to add that the descriptionof GLIM in Section 3.3 is a verylimited one. In using GLIM's own directiveand associated macros,I findthe schematicdiagram Fig. Dl (ignor- ing weights)helpful. Using a notation resemblingthat in tonight'spaper, the diagramshows how the GLIM vectors (given in brackets) can be used, via the macros Ml, M2, M3 and M4 to imple- ment a generalizedIRLS. Then it can be seen, that the GLIM notation is sufficientlyflexible (not redundant), that by overwritingthe GLIM vectors appropriatelya wide variety [far wider than _ - ~

y,(%LP) NN

/, (%FV)

y, (%YV) -> U,LI (YI-,)/vib M2 6, (%DR)

v, (%VA) ,=y+A1u

A` =diag (v.6b2)

= (DTAD)-' DAz

(34) Deviance

Fig.Dl just the composite link functionregression mentioned in Section 3.3 and the discussion] of pro- blems can be solved in this framework.Although usually A, D, u, y and so on will correspond respectivelyto A, D, U, y they will not necessarily. For example, if A is not diagonal then a singularvalue decomposition can be used to produce a diagonal A with D being correspondingly altered to D and so forth.At which stage one abandons such manipulationsand opts forwriting a one-offprogram will depend on the individual.Most statisticiansfind it usually more efficient in terms of their own time to use a readily-availableand familiarpackage. GLIM also has an OFFSET facilitywhich enables iterativeproportional fittingtypes of steps to be implemented, and so an even wider range of (conditional) maximum likelihood problems can be readily solved (furtherdetails are givenin Adena and Wilson,1982).

Dr J. A. Nelder-(Rothamsted Experimental Station): First an etymological quibble: why "reweighted"? If you are iteratingyou can and do change the weight,but you may also be chang- ing the dependentvariate, the covariates,etc. I propose IWLS, not IRLS. I agree with the proposer in findingsequentially defined residuals unsatisfactory;however, cross-validatoryresiduals can be defined for non-diagonal covariance structures,giving a (1, 1) correspondencebetween data values and residuals. They will of course be correlated,but so also are such residualswhen the data are independent. ProfessorAitkin has commented on the fact that so-called resistantmethods are not resistant to changes in the assumption about the link function.The same is true for variance functions; differentvariance functionswill often lead to quite differentsubsets of points being given low weight.There are particulardangers in applyingresistant methods to binarydata, forif fora given unit the fittedvalue p is small then there are two possible residuals one small and negativeif the datum value y = 0 and one large and positive if it is 1. In a substantialdata set therewill be a few ones, and these are highly informativeabout the size of p, yet a resistantanalysis will "down- weight" them drastically.I believe that "resistant"is a word that promisesmuch more than it can deliver. Finally, since the speaker has built his paper around an algorithm,could he not give us a classi- ficationof the models that can be fitted,characterized algorithmically, i.e. if some or all covariates change per iteration; similarly the dependent variate, weight, etc? This might be illuminating.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 176 Discussionof Dr Green'sPaper [No. 2, Mr G. J. S. Ross (Rothamsted Experimental Station): Dr Jorgensenhas already referredto my use of what McCullagh and Nelder (1983) call "deviance residuals". These are quite simple to calculate, and enable the log likelihood to be expressed directlyas a sum of squares when the errordistribution is non-Normal(except in the unlikelycase wherethe mode of the log likelihood is other than at q =y). They may be differentiated,and used in any optimizationalgorithm for minimizingsums of squares, of which the Gauss-Newton method is the simplest. Details of the method are givenin Ross (1982). Direct comparison with IRLS may be made by solving the same problem by both methods fromdifferent initial parameterestimates. A graphicalcomparison in two dimensionsis obtained by plottingcontours of the log likelihood forthe next iteration:large open contoursindicate fast convergence. In all comparisons made so far the Gauss-Newton on deviance residuals is an improvementon IRLS. The reason is that the solution locus, when plotted in the space of deviance residuals,is more linearthan in data space. Furtherimprovements may be achieved by stable parametertransformations, particularly with general non-linear models, but the improvements would also apply to IRLS. Stable transformationsimprove linearity and orthogonalitywithin the solution locus. As a contributorto statisticalpackages (MLP, GENSTAT) I would claim that it is preferable to use their optimization facilitiesrather than rely on the IRLS method, especially when the problem is unfamiliar.IRLS needs the correctalgebraic specification,good initial estimates,but may still be slow or divergent.The optimizationapplications require only the model formulaand the distributionof errors,there is protectionfrom divergence, and parametertransformations are easilyincorporated. I cannot agree with Dr Wilsonthat the inefficiencyof a methodis a virtueif it shows data to be inadequate. There is no excuse for failingto interpretresults fully. I thankthe author fora paper that has stimulateda usefuldiscussion.

ProfessorD. M. Titterington(University of Glasgow): One approach to the approximate solution of equations (3) which has not been touched on in the paper is the recursiveone. Let Vn-l denote the matrix DTAD based on n observations and suppose the observations are independent. Then VJl - V!1j is a matrix of rank 1 and Vn is easily computed from Vn_,..A sequence of recursiveestimators for 0 can then be generatedby ln = sn-1 + Vn-1 S(Yn, Pn-1), n =1, 2,. .. where S(y, P ) denotes the score vector for a single observation,y. Admittedly{ 0 n } usually depends on the order in which the observationsare dealt with but the recursionis simple and sometimesreassuring asymptotic properties can be obtained via stochasticapproximation theory. The case of linear logisticregression is dealt withby Walkerand Duncan (1967); see also Goodwin and Payne (1977, Chapter 7). Titterington(1984) looks at recursiveestimation based on incom- plete data and also notes an interestingparallel between the generalmethod of scoringand the EM algorithm.The formeris definedby, say, Pr=Pr-1 +{nI(Pr.l)}1 aL(P r-0 P r, = 1,2, ... whereI( p) is the informationmatrix from one observation,typically hard to compute and invert in incomplete-dataproblems. An alternativeiteration is obtained by using, instead of I(p), the much simplerIc(p), correspondingto a complete observation.This iterationis sometimesexactly the EM algorithmand oftenvery close to it.

ProfessorA. C. Atkinson (Imperial College, London): This interestingpaper complements both the recent published and unpublished work of Jorgensenand also the book of McCullagh and Nelder (1983). My comments are however addressed mainly to Section 5.2 of the paper. I remain unclear about the operational differencebetween robust and resistantprocedures. In principlethe distinctionis clear, but, as Huber comments (1981, p. 7), the proceduresare, for all practical purposes, the same. Even so, because of the effect of leverage, (25) or, not equivalently,(26), may not provide what is required. For observationsat remote points in the space of the carriers,that is with hi = xT(XTX)Y' xi approachingone, the least-squaresresiduals will be small,regardless of the observedvalue of the response. Such observationswill not be down- weighted and the procedure will not be resistant.Procedures which do allow for the effectof leverage include the limited influence regressionof Krasker and Welsch (1982), furtherdis-

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 177 cussed by Huber (1983). But there remain some problemsof interpretation(Cook and Weisberg, 1983). If there are outlying observations at points of high leverage, use of the author's robust/ resistantprocedure may lead to the down-weightingof observationswith low leverage,some of which may have large residuals due to the distortionof the fitted model by the outliers. An example of this is provided by Ruppert and Carroll (1980) whose robust method was trimmed least squares. In their analysis of some data on salinity, the conclusion of one analysis was that observations 1, 13, 15 and 17 should be trimmed. I have elsewhere (Atkinson, 1982) contrastedthis rather muddy conclusion with the results of a diagnosticanalysis which clearly demonstratesthat observation 16 has an extremevalue of one of the carriers.It has howeverto be admittedthat this value, although not its importance,can be detected by visual inspectionof the marginalvalues of the explanatoryvariables.

Dr R. Gilchrist (Polytechnic of North London): It is a pleasure to add my congratulationsto Peter Green on a thought-provokingpaper. In the time allowed, I shall confinemyself to discussing three aspects of the paper. Firstly,Dr Green is indeed correctthat for "nuisance parameters",a two-stagealgorithm can be more successfulin achievingconvergence than a single-stageupdating algorithm.For example, in Scallan et al. (1984), it is shown how such an algorithmcan be usefully employed in the exponential family frameworkto estimate what we call a parametriclink function, i.e. where E(Yi) = pg= h(l xqj,B,k), for some known functionh. This works well, for example, for the Box-Cox link withunknown exponent or forestimating the shape parameterand asymptote of the generalized logistic curve definedby ln (pg) = ln(kl) - k2 ln (1 + exp (-71i/k2). Moreover,this two-stagetechnique is easily incorporatedinto GLIM and is closely related to direct likelihood inference,as discussed by Aitkin (1982). For example, for one-dimensionalk, our technique maximizesthe likelihoodby movingup the "profilelikelihood". Dr Green is correct in assertingthat GLIM 3 is not designed to handle non-diagonalA. However,as severalother dis- cussantshave noted, it is possible to handle non-diagonalA in GLIM 3. For example, you can use a Cholesky decomposition as in Section 2.5 and incorporatethis in the design matrix.Examples of this procedure for some standard models should shortlyappear in the GLIM News- letter (Scallan, 1984); however, this technique is clearly somewhat trickyfor the non-specialist. It would not be difficultto change the GLIM code to allow non-diagonalA, althoughat the cost of some efficiency.However, the problemremains as to what formof A should be made available. My final comments concern residuals. Firstly,orthogonal residuals have some appeal to me, despite their lack of independence for non-Normalmodels. For example, the Givens algorithm (available in GLIM 4) gives the so-called recursiveresiduals used by Brown et al. (1975). These can be useful for detectingchanges in a model over time, even if they do not pinpointdiscrepant observations. Secondly, the discussion in Section 5.1 may not make it crystal clear that the standardisedresiduals of GLIM 3 do not take into account the correlationbetween an observation yi and its fittedvalue ji . Thus Gilchrist(1981) and Pregibon(1982) have suggestedwhat Gilchrist refersto as adjusted residuals,which to some extenttake into account this correlation.Moreover, Pregibon(1982) shows that it is these residualswhich give the score test statisticcorresponding to the hypothesis that a given observation is an outlier. (These adjusted residuals are trivially computed in GLIM 3 by dividing GLIM's standardized residuals by /(I - hi), where h = %WT*%VLI%SC.) Moreover,this technique can be applied in a similarway to the case where the observationsare correlated.Finally, Dr Green's considerationof changes in n on the L-scale certainly seems valuable; however, I have the feelingthat selected transformationsof 77- 77() may yield bettertests against certain alternative hypotheses.

Mr A. C. Davison (Imperial College, London): I have a few remarkspertaining to the deviance residualsAi definedat equation (20) of the paper. Suppose that independent observations Yi have common continuo'us distributionfunction F(y; 7i, oz), predictors77i = 77i(j), and that a is a shape or scale parametercommon to all the observations: a in the Normal linear model, the shape parameter of the gamma or Weibull distributions,and so on. In such a situationone would oftenwork withthe signedsquare roots rD(Yi; 7i, a) = sgn(tmax)

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 178 Discussionof Dr Green'sPaper [No. 2, of the deviance residuals Ai, where tmax is the value of t for which the supremumin (20) is attained, and wish to treat them as approximately independent standard Normal variates. In particular,when the rD are orderedand plotted againstNormal orderstatistics, the resultinggraph should be close to a straightline of unit slope throughthe originif the rD are to be usefulin assess- ing closeness of fit of F(.) to the data. How useful is this graph for a given distributionF(.)? This problemmay be studied by consideringthe function G(w) = rD (F 1 (?(W); , 0); , t), which may depend on oa,as a functionof w. For

F(x; 71,ot) = 1D((x -q)/0t), that is, the standardNormal-theory linear model, rD (Y; 17, ac) = Iy - 17)/01,

F1(1(w);'q,a) =q.+aw, and so G(w) = w, as we would requireof a sensiblemeasure of closeness to Normality. For the Weibull distribution,a Taylor expansion shows that the functionG(w) is to second order in w the expression -0.345 + 1.023w + 0.0035w2 independentlyof the value of a: very close to the required straight line for values of w in the usual range but with a negative intercept. For the generalizedPareto distribution F(y; , a) = 1-(1 -ay/)/llt, which is exponential when a = 0, a distributionuseful in some extreme-valuecontexts, the functionG(. ) depends on the value of a but is quite close to the line G(w) = w forthe usual values of w and a. Adjustmentsto bringthe observed deviance residualsrD close to the line G(w) = w may some- times be made; but it may be sufficientto know approximatelythe size and direction of the allowance the eye should make when inspectingthe rD. The implication seems to be that appropriatelydefined deviance residuals for continuous distributionsF(.) often have properties very close to those of standardized residuals in the Normal-theorylinear model, and thereforeare likely to be easily interpretableand of potentially wide ap plicability,even for measuringhow close to F(.) the distributionalform of the data is. The effect of the data actually havingdistribution H(.) ratherthan F(.) may be assessed in an obvious way. These remarksdo not take into account the likely effecton the rD of theirmutual correlation: that may be quite a differentstory, and may perhaps be tackled by methods analogous to those of Cox and Snell (1968). ProfessorT. Lewis (The Open University):Paraphrasing Vanzetti (Frankfurter and Jackson, 1929, p. xi), I wishto congratulateDr Greenon mostof hispaper. But ProfessorAitkin's remarks on theprobit example (Section 5.3) needfollowing up. Studentswondering about what workto take up and whatstatisticians do are oftenadvised to look at thejournals and the RSS discussionpapers. It wouldgrieve me to thinkof them taking at facevalue Dr Green'swords: ". . . clearlythere is ample evidencethat the regressionlines are neitherthe same nor parallel.Interpretation is easier if the linesare parallel.... We thereforepersevere with the four-parametermodel. . ." -in otherwords, bugger the data! As Dr Greenmentioned, the data in Table 4 comeoriginally from Martin (1942). He feelsa bit queasy about omittingthe threepoints 11, 16, 17. Martin,the experimenter,who also omitted thesepoints, talked about breaksin the regressionlines, but in factthe deviationsthat bothered him are non-significant.Dr Greendefends his omissionof the threepoints by sayingFinney did the same. So what?He citesFinney's argument (Finney, 1947, 1952) thatonly the behaviour at highconcentration x is of interest.On thisargument, the low-xpoints 3, 4, 5 shouldhave been omittedalong with 11, 16, 17. Butthey were conveniently retained. In thethird edition of hisbook (Finney,1971, p. 233), Finneyhas a newtext:

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 179 "So far as fair interpretationof Martin's experiment is concerned, the rejection of anomalous observationsmay be questioned, but argumentabout this afterso manyyears is unprofitable.Here the three low doses are omitted, solely for the purpose of illustrating numericalprocedures." Had Dr Green stated that his resistantanalysis was solely forthe purpose of illustratingnumerical procedures,all would have been fine.But on the contrary,he claims " . that the result . . . is not just a better fit,but a betterfit to a simplerand more inter- pretablemodel." ProfessorFinney magisteriallydistances himself from the earlieranalysis, as if it had nothingto do with him. Actually, if we go back to Martin's paper (Martin, 1942, pp. 69-81), we find it immediately followed by a companion paper by Finney (Finney, 1942, pp. 82-94). Mutual acknowledgements in the two papers suggest that Martin was the experimenter,Finney the statisticalguru. Would Martin have thrown out his three points if his guru had advised against? It is easy to criticize; the comeback is "Can you do better?" I do not know, but here is an attempt. First note points 8 and 7 as remindersof the magnitudeof the standarderrors attaching to the observations.Now look at the same-x points 1(R), 15(M), 10(D), at the closergroup 3(R), 16(M), 11(D), and then at 17(M), 5(R), rememberingthat M is a mixtureof R and D and not just any old poison. Are the lines obliged to be parallel? Mightthey not be concurrent?This would make sense, because at the concentrationat which Rotenone and Deguelin have the same toxicity the mixturewould plausibly also have this toxicity. Using all 17 points, I fittedthe 5-parameter model Yr = "1(Pr) = 7 +fr(x ) (r = 1, 2, 3). The fittedQoint of concurrence(Q, X) was (0.4,-1.2), the fittedslopes wereOR = 4.17,,M = 2.83, OD = 2.26 (OR >1M >13D, O.K.). The deviance on 12 df was about 24, p -2 per cent-rather high,but the model is credible.It impliesof course that forhigh concentrationsRotenone is more toxic than Deguelin,and forlow concentrationsthe reverse. I know nothingabout these things,but is it possible? Well,I thought,if Dr Green can wheel in a leading authorityto back up his case, why shouldn't I try? So I went and asked Steven Rose, the professorof at the Open University.He said, yes, indeed, it is perfectlypossible for a poison A to be more toxic than poison B at highconcentrations and less toxic thanB at low con- centrations;he could think of a variety of biochemical mechanismsthat would generatesuch a phenomenon. As I was disappearingthrough the door he also, I think,murmured something about biologistsoften being too hungup on linearmodels-but I may have misheardhim . . .

Dr A. J. Lawrance (Universityof Birmingham): Tonight's paper seems to me to be important, lucid and clever. The section which most interestedme was the one dealing with residuals. It is indeed difficultto define widely acceptable residuals or to agree on the propertiesthey should have. Dr Green mentionsthe importanceof uncorrelatedresiduals for diagnostic determination of model adequacy. Now I must declare my own interestin residuals fromnon-linear time series.It is a pity that the main ideas in the present paper do not have easier applications to simply dependent data, such as arise fromtime series. For non-linearautoregressive models the natural definitionof residualsis not entirelyclear. Peter Lewis and I have taken to defininglinear residuals from non-linear models; such residuals are uncorrelated when the model is appropriate, but because of non-linearity,are not independent. The analysis of the higher order dependence properties of uncorrelated residuals is a problem needing furtherattention. In the time series domain we have worked graphically,and with various correlationfunctions of the residuals and theirsquares. Anothertrouble with most sortsof residualsis that they are omnibus,and not aimed at pickingup particulartypes of departurefrom the proposed model. Withnon-linear time series, one may, for instance, consider furthersorts of residuals which are sensitiveto the directionality of the process, when this is of interest.Methods based on the standardsecond order properties will overlook this aspect. I also feel that what is needed now in time series is a class of models which can take advantage of the impressiveprogress made in regressionover the last decade. In conclusion, I join with other contributorsin congratulatingthe author; there is certainly no need to transgressthe bounds of RSS englishin the discussionof this paper.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 180 Discussionof Dr Green'sPaper [No. 2, The followingcontributions were received in writingafter the meeting. MrJ. E. Besag (visitingCarnegie-Mellon University): It is a pleasureto congratulatemy colleaguePeter Green on hiswide-ranging paper. I wishbriefly to commenton Section5.2, concerningresistant variants of leastsquares and likelihoodmethods, for it seemsthere is a dangerthat the authorhas added to the confusionhe seeksto avoid. Certainly,it mayon occasionhave been remarkedthat IRLS updating,as in (27), seeksto minimizez w(zi)z2, a claimwhich is falseunless w(z) ocI z I-P andp < 2. However,such statements,from which I shallnot admit Dr Greenonce savedme, are entirelyincidental, since the minimizationof S w(zi)z?, in itself,makes little sense as a generalcriterion for resistant estimation.The rationalefor IRLS hereis equivalently,either through the normalequations, as mentionedby Dr Green,or, as I prefer,in termsof successiveupdating of theweights in weighted least squares(Mosteller and Tukey,1977, p. 357; Besag, 1981, Section2). Thus, in the latter framework,one choosesweights, applies these to obtaina weightedleast squares fit, calculates correspondingresiduals and, in the lightof these, formsnew weightsto apply at the next iteration.Simultaneous minimization is irrelevant:for example, one mighteven contemplate use of theweight function w(z) = Z-2, thoughprobably not for long. Lastly,I remindDr Greenthat the suggestionof usingIRLS withgeneralized linear models to obtainresistant variants of maximumlikelihood fits is nota newone: see Besag(1981, Section2).

ProfessorD. R. Brillinger(Univbrsity of California,Berkeley): This is a mosttimely paper. The procedureof iterativereweighting has alreadyfound many uses and solvedimportant practical problems,yet its greatestsuccesses would appearto lie ahead. The procedureproves flexible, easyto program,and to haveimportant byproducts. To mentionone exampleof theprocedure's power,with a "flick-of-the-wrist"it changes a givenestimation method into a robust/resistantone. The firstgroup of scientiststo make substantialuse of IRLS were quite possiblythe seismologists.They continuallyfind themselves dealing with non-linearmodels and outlying observations.Jeffreys (1936) madeuse of theprocedure in theconstruction of tables for the travel timesof earthquakewaves. Bolt (1960) usedit in thedetermination of thelocation of earthquakes or explosionsgiven data on the arrivaltimes of the wavesat variousobservatories. Incidentally this last referenceintroduces the techniqueof robust/resistantregression some yearsbefore statisticiansbegan to studyit. Preislerand I (1983, 1984) have employediterative reweighting together with numerical integrationto fitmodels involving random effects and latentvariables. (Two relatedreferences are Bock and Lieberman,1970 and Hinde1982.) In thefirst of ourpapers we fita compoundPoisson modelto multiply-subscriptedcount data, Yijk, giventhe covariatexijk, withyjjk assumedPoisson of meanrr + PiXijk u1 giventhe latent variate ui. (Thereis substantialphysical background justify- ingsuch a detailedmodel.) GLIM and iterativereweighting lead quicklyto thevalues of themaxi- mumlikelihood estimates. In Section5.1, Greendiscusses the definition of residuals in non-linear models.There are also difficultiesin definingresiduals for random effect and latentvariable models.In Brillingerand Preisler(1983) we found"uniform residuals" to be usefulin detecting an unsuspectedphenomenon. (Uniform residuals are definedto be the valuesone obtainson applyingthe fittedcumulative distribution functions to theindividual observations. They provide a definitionof residualfor any stochastic model.) A problem,of quite a differentsort, that may be handledvia the procedureof iterative reweightingis studiedin Brillingerand Preisler(1984). The model0(yil) = 01(xl ij) + ?2 (x2 ij) + ei +e is fit,with the functions 0, 0 ?2 unknown,with e1 a randomeffect and withei1 error. The solutionis obtainedby numericalintegration, iterative reweighting and repeateduse of the ACE procedureof Breimanand Friedman(1984). (ACE determinesfunctions 0, 41, 02 to maximize the samplecorrelation between the values 0(yi) and thevalues fi (x1i) + ?2 (x2?). It seemsdestined to findmany exciting applications in practicalstatistics.) The optionof usingweights was added to ACE as an afterthought.This occurrence,and the importancethat Green's paper shows IRLS to have, suggestthat developersof statisticalalgorithms should provideweighted versions wheneverpossible. A wordof cautionneeds to be added,and Greendoes addressthis issue. The verysimplest of non-lineariterations can lead to oscillatingand othernon-convergent sequences. (May, 1976 is an easilyapproached reference to thisphenomenon.) Now, because of thefinite accuracy of com- puterseven linearrelationships become effectivelynon-linear. Proofs of theoreticalconvergence

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussion of Dr Green'sPaper 181 may not be pertinent.Our experience is that one wants to proceed with the greatestprecision allowable. It seems that IRLS may come to dominate much of statisticalcomputation. Green's paper is a veryimportant one.

Dr R. W. Farebrother (Universityof Manchester): The author assumes that A = BBT is positive definite but does not consider the possibilityof small elements on the diagonal of B. In this context Kourouklis and Paige (1981) would recommendthe use of minv'v S.T. u+ADP =ADP* +Bv ratherthan the author's equation (7) which may be rewrittenas minvv S.T. B -u+BTD,l=BTDP*+v. A similar problem arises in Section 4.1 when ftr) cc exp(- c I r Ik) as w(r) = ck I r Ik-2 and W(r) = (k - l)w(r) will take very large values if r is small and 1 < k < 2. But the author need not despair of IRLS as Ekblom (1974) and Sposito and others (1977) have publishedvariants of algorithms(17) and (18) which performsatisfactorily in this context and Gentleman's (1974) algorithmis designedfor use withnon-constant weights. Further we recommend that the early stages of the iterativeprocess should be subjected to small random shocks as it is possible that the simple iterativeprocedure will divergefrom an apparent solution if subjected to random shocks. See Bohlin's discussion of this point in Wold (1981, pp. 48-50).

Dr M. Green (Polytechnic of North London): My only objection to Dr Green's excellent paper is his suggestionthat IRLS is ". . . easily programmedwithout the aid of packages . . His choice of example in 5.3 is interestingsince it provides a perfect counter-exampleto this suggestion.If we follow Finney a little furtherwe find several alterativemodels that take into account the fact that the "third" poison is a mixtureof the other two. I will considertwo such alterativemodels that could be easily understood by a toxicologistlacking the sophisticationin Mathematicsand Computing necessary to fit such models by programmingthe IRLS algorithm. Following Finney these models can be called the Independent Action model and the Similar Action model. IndependentAction Model If we suppose that the poisons act in differentphysiological ways we could assume that the probabilityof survival from a mixtureis the product of the probabilitiesof survivalfrom each poison if administeredseparately. This leads to a model forsurvival probability of

@(D(1 + 01 X1I) (D(a2 + 02 X2 ) wherexl and x2 are log doses forpoisons 1 and 2. SimilarA ctionModel If we suppose that the poisons act in exactly the same physiologicalway we could assume that the mixtureof poisons at doses d, and d2 is equivalentto a certaindose of poison 1. Takingp as the relativepotency of poison 1 this leads to a model forkill probabilityof L(ox+ , log (d1 + pd2))- Both models are plausible but face the modellerwith the unenviabletask of translationinto an IRLS algorithm.The overwhelmingsuccess of packages such as GLIM is that for many models the user has only to understand the Statistics and how to use the package and facilitiesare provided for the extension of the class of fittablemodels relativelyeasily without recourse to programmingan interactivesystem. The model for Independent Action can be formulatedas a simple case of the Composite Link Function models (Thompson and Baker 1981). The specificationand analysisof such models can be automated by use of macros, as described in an unpublished paper by Roger and Colman presentedat the GLIM82 conferenceheld at PNL. The model for SimilarAction can be analysed simplyby use of a macro that calculates the derivativesof the function a +/3log (d1 +pd2)

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 182 Discussionof Dr Green'sPaper [No. 2, withrespect to the parametersa, ,Band p. Use of thismacro and a standardGLIM givesestimates for ae,,B and p of -1.834, 1.216 and 0.4940. In both cases GLIM automaticallyprovides the frameworkfor estimationand extra informationsuch as the deviance(36.26 forthe Similar Actionmodel) without the user having to programsuch calculations. ProfessorR. I. Jennrich(University of Californiaat L.A.): The authoris to be congratulated foridentifying a basically simple idea thathas manyimportant applications. We haveall too few of these. By notingthe broad spectrumof applicationsof IRLS he has illustratedthe funda- mentalrole of regressionin statisticalanalysis. Carrying this perhaps to an extreme,I havelong been a proponentof the followingunified field theory for statistics: "Almost all of statisticsis linearregression, and mostof whatis leftover is non-linearregression." This propositionarose fromextensive work on the developmentof statisticalsoftware, primarilyBMDP. The heartof almostevery program, from ANOVA to multidimensionalscaling, looks a good deal like a regressionprogram. Programs that iterate look likenon-linear regression, thosethat do notlike linear regression. Viewinganalyses like maximumlikelihood and robustestimation from the perspective of IRLS producesnot just a convenientcomputing device, but somebasic insights as well.Fitting expected responsesto observedresponses using IRLS withweights inversely proportional to variancesis a very naturalthing to do. For the exponentialfamily this is preciselymaximum likelihood estimationas has been noted by manyincluding this discussant.As is demonstrated,IRLS providesan insightfulway to viewrobust estimation. The weightsplay a naturalrole in reducing the effectof outlyingobservations and the regressionformulation has suggesteda numberof approximatestandard error formulae. Withregard to theiruse as a computingdevice, the authornotes in his abstractthat IRLS algorithmsare easilyprogrammed without the aid of packages.I agree,but forme one of themost rewardinguses of IRLS is thatof formulatingproblems so theycan be analysedby standardnon- linearregression programs without the need to programanything more than the functionto be fittedand the weightsused. The key to successhere as the authorpoints out is findinga formulationthat uses ordinary weights rather than generalized weights. The latterare not accepted by mostpackage programs. I will concludewith a technicalcomment. The startingpoint for the discussionof maximum likelihoodestimation is the Fisherscoring algorithm

wheresp = dL/d, is the scorevector for the parametervector ,BIpp = Ep(spsT) is thecorrespond- ing informationmatrix and AO is the Fisherscoring step. In the processof factoring(*) to give his basic equation(3), the authorassumes that L(f)- is a compositefunction L(0I(G)) and that L(i?) is a likelihoodfunction. The secondassumption is not requiredand droppingit leads to a usefulgeneralization. Assume only that L(13) is composite.Let u = dL/dr andA = E,(uuT). Then I,s = DTAD and so = DTu and (*) becomesformally identical to the author'sequation (3). All that is lost are two of the threeequations above his equation(3). For example,u maynot have expectationzero and hencemay not be a scorevector. To see what can be gained by this generalizationconsider the multinomiallikelihood L = Z2Yi logpi- Let L(n) = Yi logi7 withoutassuming the ig sum to 1. Then ui= Yi/Pi, Ai; = 6g/pi and Dij = api/a13. A scoringstep is obtainedby regressingy on D withweights Wi pjl. This is a more symmetricand I thinknatural formulation of IRLS forthe multi- nomialthan thatin Section2.1, and because the weightsare now simpleany of a varietyof standardnon-linear regression programs may be used to carryout the requiredcalculations. Moreover,it is notnecessary to eliminatea cellto fitthe general theory. Dr A. Pazman (MathematicalInstitute, Slovak Academy of Sciences, Bratislava):An alternativeto the measuresof discrepancygiven in Section5 couldbe theRao distancebetween "7 and7r(f3) in thespace of the predictors 71. By definition,this distance is givenby

Fi d7y aL aL dyl ( D: =min ET(E) - dt (1*

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 19841 Discussionof Dr GreenX Paper 183 wherethe minimumis takenover all twicedifferentiable mappings (curves) from t E < t , ty> to y(t)6 Rn whichare suchthat 7 [t'y = 7y [t'y] -(,). The distanceD is invariantunder any differentiablereparameterization ofi7, and in thenormal linear case D2 is theresidual. Thereis no difficultyto computeD ifYip.. .1 Yn are independentand if L(77)= iLi(Ii) (as in equation(2.1)). In thiscase ly ~~2 F/2 T DD m ti[ (dy1j dLi(n1i\) ] J| k[}L t) Et(t) dX ]i dt.

HenceD is simplythe Euclidean distance after the reparameterization

77i,+ Vi0n):= Eni [dL ) 2] d??i (i= 11.. *n).

For example,if we takethe in Section1.1 we have mi= y/n1 and

[(dL1i 2 ] r/_I__

Thus

VI(7fl) = 2\/n1arcsin /7I? and A D2 =D2(: [VI(17)A - (R()i

ii

=41ini[arcsin -1-arcsin /ni( jj (2*) ni wherei7r(j) is givenin (A), Section1.1. It followsfrom (2*) that

-D2(1) =- 2 j ni [arcsin Y/ - arcsinv'/1(3)] 1 ni( x o/{nI(o")[1 -nimoI} aoi

= yj/g-7g13 + niE -V01 n1i(g o(If7-r(i)fI1)

_ t7 + o(11 r7 ( ) 11).

Hence for ? whichis near to the set {,q(f): ,BE:R }, the ML estimateis almostequal to the estimateobtained by minimizingthe Rao distance. Dr A. N. Pettitt(University of Technology,Loughborough): I was interestedto read Dr Green'sreference to use of the IRLS methodin thelinear regression problem in Section4. I have been workingon methodsof makingestimation techniques robust with censored and grouped observations.One approachis withranks; see, forexample, Pettitt (1983). Anotheris to use the EM algorithm(Dempster et al., 1977) whenapplied to theregression problem with normal errors and censoredand groupedobservations. This givesIRLS as the methodof estimation.Similarly,

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 184 Discussionof Dr Green's Paper [No. 2, as the author notes in Section 4.1, when the EM algorithmis applied to the regressionproblem with "normal/independent"errors, such as the t distribution,the IRLS method again results. If one combinesthe "normal/independent"errors with censored and groupedobservations, then, again,the EM algorithmresults in IRLS estimates. In thisparticular case two expectationsin theE stagehave to be carriedout and it is important to getthe ordercorrect or otherwiseIRLS estimatesdo not result.Explicit results can be found for finitescale mixturesof normaldistributions and forthe t distributionwith even degrees of freedom,therefore giving IRLS estimateswhich are robustto outliers.Estimates can be found usingGLIM, for example. Mypoint, as theauthor correctly notes, is thatwhen the IRLS estimatesresult as an application of theEM algorithm,then the theoryof theEM algorithmguarantees that the full iterated values are local maximaor turningpoints of the likelihoodfunction. Obviously, there is no guaranteeof a globalmaximum unless special circumstances prevail. No suchtheory exists for the IRLS methodin general.It wouldappear necessary to compute the observedinformation, not the expectedinformation, at the fullyiterated root, in orderto checkthat the root gives a local maximumand to evaluatethe likelihood or quasi-likelihoodat the stepsof thealgorithm in orderto checkthe convergence of the technique. The IRLS techniquemay resultin convergence,but it shouldbe asked to what;otherwise, statisticiansare showingan extremelycavalier attitude to the generalproblem of optimization. Dr J.H. Roger (ReadingUniversity): The generalformulation outlined in thispaper includes those modelsdescribed by Thomsonand Baker(1981) as CompositeLink functions.Composite Link functionsform a techniquewhich allows one to fitany non-linearlink function to a model with exponentialfamily error function, using a packagesuch as GLIM, by alteringthe design matrixat each iterationand is numericallyequivalent to that describedhere as the Newton- Raphsonmethod with Fisher scoring. I haveused thisapproach for fitting both Variety Environ- mentinteraction models (Finley and Wilkinson,1979) and Probitswith adjustments C andD for naturalresponsiveness (Finney, 1971, p. 126). The majorproblem encountered is makingthe iterativeprocess converge when the startingvalues for the parameters , are farfrom the optimum. It is necessaryto monitorthe changesin the log-likelihoodat each stepand alterthe step length appropriately.Such a procedurewhile certainly not guaranteeing convergence to a globaloptimum has allowedthe solutionof severalpractical problems. The choicefor the formfor the matrixA seemsto be of less practicalimportance. Indeed the use of a suitableparameterization of the modelseems to be morecrucial. The packageGLIM has been used by severalworkers to solvecertain IRLS problems.However it is not ideal. Thereis a need for a computingenvironment which has the arraymanipulative featuresof APL, device independentgraphics, a powerfulmacro facility(with editing),text handlingand finallya simpleIRLS procedure. Dr A. H. Seheult (Universityof Durham): It is a pleasureto congratulatemy colleagueon a lucidand valuablepaper. My commentsconcern residuals, particularly when the representation(21) does not hold. Althoughit may appear naturalto defineresiduals in the observationspace, surelythis is directlyvalid only when observations and parametersare structurallyrelated, as in linearmodels. In the absenceof such a structure,a definitionof discrepai*yin termsof likelihood,such as deviancein equation(20), seemsmore natural. Often, however, such measures are interpretable as signedsquared residuals in theobservation space. In thenotation of thepaper we wouldlike to write A L( )L(r ( p) i whereAl = 2 [Li (1) - Li (r1(p))] is the deviance associated withyg; this partition is exactwhen equation(21) holds.One way of partiallyresolving the difficultiesin the generalcase referred to by Dr Greenis to usejackknife techniques. Denote by E(in) = n -1L(t) theaverage likelihood per observationand definethe pseudo-likelihood generated by yi to be L7(tj) = nL(rT; y) -(n - 1)L(;Y()) wherey(g) is y withoutyg. Now definethe pseudo-deviance associated with yi to be

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 185 AIo= 2(L? Li ( a( ) Notethat A? = Ai when(21) holds. Finally,it is interestingto notethat Li (q) log Pr yi I Y(i)) so thatL0(q) = z Li( ) is the "log-likelihood"used by Besag(1975, 1977) to determinemaximum pseudo-likelihood estimates for conditionallyspecified spatial models involvingawkward normalizing factors in the full likelihood.The above definitionof A? assumesthat i} and ,B are propermaximum likelihood estimatesbut this is notnecessary and they could be replacedby pseudoestimates. MrW. D. Stirling(Massey University,New Zealand): Dr Green'spaper extendsthe IRLS algorithmfor generalized linear models to a widerange of otherimportant models. Stirling (1984) independentlyderived the algorithmfor modelswhere responses are independentand involve linearfunctions of parameters;further applications are describedin thatpaper. In manyproblems expected second derivativescannot be easilyspecified or evaluated.When q = X 0 , Newton-Raphsoniterations are an alternativeto Fisher'sscoring technique and they can also be foundby IRLS. Since a2Iipo a stepis givenby equation(3) with a2L A =-- annT WhereasFisher's scoring technique ensures that A is positivesemi-definite, making (4) a valid least squaresproblem, for Newton-Raphson iterations it maynot be positivesemi-definite. This howeveris not a problemsince (4) stilldescribes the Newton-Raphson step and canalso be easily solved.In particularmany weighted least-squares algorithms such as Gauss-Jordan,and Givens methods also work with negativeweights allowing (4) to be solved when the responses are independentwith A diagonal. Since DT AD should be positivesemi-definite near the solution,and oftenin a largeregion of the parameterspace, the algorithmsare usuallystill numericallystable. Gill and Murray(1974) suggesta modificationto ensureconvergence. A consistentestimate of parametervariances is stillgiven by (DTAD) -1 . The 2-partiterations for ( ,c) suggestedat the end of Section3.1 convergevery slowly if j8and iKare highlycorrelated. However, if (9) is explicitlysoluble for ic, c can be eliminatedby replacingit by its maximumlikelihood estimator for fixed 1j, K (wi)thereby reducing the problem to simpleIRLS on L(, i (q )). FinallyI wishtQ suggest an alternativeto theuse ofad hoc weightfunctions in Sections5.2 and 5.3. Robust/resistantM-estimators for normal linear models are equivalentto maximumlikelihood assuminga longer-taileddistribution than the normal.Similar methods can be appliedto Poisson and binomialdata by fittingnegative binomial or beta-binomialdistributions with the same means. The extranuisance parameter can be adjustedto give the desireddegree of robustness.An advantageof thisapproach is thatthe nuisanceparameters can also be estimatedfrom the data allowingtests of whetherrobust/resistant fitsare needed. Dr G. TunnicliffeWilson (University of Lancaster):This paper wi be extremelyvaluable to those,like myself,who havebeen awarefor a longtime of theimportance of IRLS but have failedto appreciatethe overall structure of the subject. My contributionis confinedto a smallpoint in Section4 on LinearRegression. Most of the densitiesf whichone sees in use forthe errorterm are smoothand symmetric,in whichcase the weightfunction w(t) is well defined.It is, however,still quite possible to definew(t) fora smooth asymmetricdensity provided it is centredon itsmode. I have found this useful in a time series context wherethe residualsin a univariate autoregressivemodel for a riverflow seriesappeared to have approximatelyan exponential density.It is pointedout by Dr Greenthat even for a two-sidedexponential density, IRLS cannot be recommended.However, the errormodel I adopted was the sum of two independent components:e =eI +e2, whereel is exponentialwith mean X and e2 is Normal(O, ). With A/c- 5 say,this gives a reasonablemodel of the observederror density, and hasa plausibleinter-

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 186 Discussionof Dr Green'sPaper [No. 2, pretation.The resultingdensity is smoothedby the presenceof e2. It is reasonablytractable, its mode is easilyfound numerically, and the correspondingweight vector w evaluated.Using only one or two steps of IRLS substantiallyimproves the precisionof the autoregressiveparameter, andthe model fit. Manystatistical packages have linear and non-linearleast-squares estimation facilities (including timeseries modelling) with the optionof a fixedweight vector supplied by the user.Internally, theysolve the equations (17) withfixed weights. When coupled with calculation and loop facilities thisallows the fullpower of IRLS to be exploited.In GENSTAT,for example, calculation of the above-mentionedweight vector required only three extra lines of code. Dr Green's paper stimulatedme to tryout thisexample, and I hopeit willencourage many others to use themethod of IRLS.

ProfessorC. F. J.Wu (Universityof Wisconsin,Madison): If the IRLS methodis forminimiz- ingan objectivefunction M, be it a log likelihoodor not,it is betterto use thechanges in M for monitoringconvergence. This is becausethe IRLS or a suitablemodification yield a monotone sequence of M values, while the behaviourof the sequence of parameterestimates is less predictable.In the contextof the EM algorithm,Boyles (1983) gave an examplethat exhibits convergencein the likelihoodvalues but oscillatesin the parameterestimates. The convergence behavioursof the EM algorithmaccording to thesetwo criteriaare quite different(Wu, 1983). If no suchM functionexists as in resistantregression, the changesin the left-handexpression of (26) maybe used formonitoring convergence. When the iterativeestimation of a is incorporated in Method(I) (17'), more generalconvergence results are stillavailable from Dempster et al. (1980) since theirtreatment depends on the assumptionthat the 4 functioncomes fromthe normal/independentfamily. Incidentally this restrictionto the N/I familycan be relaxedby applyingthe Zangwill'sTheorem as in Wu (1983). My onlyquestion is on theappropriateness of the definitionof A7 in the dependentsituation since it maybe quitesensitive to theordering of thecomponents of 7. I congratulatethe authorfor a stimulatingpaper which has greatlyextended the utility of the IRLS method. The authorreplied later, in writing,as follows. I am verygrateful to all thecontributors for their stimulating comments. Together they provide a substantialdiscussion that says muchabout the powerand flexibilityof the IRLS method.I have arrangedmy replies by topicrather than in orderof discussant,in the hope of pickingout and concentratingon commonthemes. My intentionis to replyto pointswith a computing emphasis,and thenmove on to themore statistical aspects, whilst recognizing that the distinction is necessarilyblurred. First, however, let me commenton someof theless technical matters raised. My motivationfor looking at the subjectof this paper was similarto thatof Dr Jorgensen. Whenre-reading Nelder and Wedderburn'spaper, I too feltthat a moreelementary presentation mustbe possible.The generalityarose naturally from this re-formulation and thepractical comput- ingdetails followed from using the method outside generalized linear models, and fromattempting to implementresistant alternatives. Dr Jorgensenand Dr Nelderboth criticizemy "IRLS" terminology.However, the term has the advantageof soundinglike a computationalmethod (which it is) and remainingneutral as to the statisticalprinciples leading to it (whichis appropriateas thereare several). I agreewith Dr Nelder thatthe firstsyllable of "reweighted"is redundant,but feelthat the termhas by nowcaught on ratherwidely and would be difficultto check. It also makesfor a moredesirable acronym, especiallywhen generalized, I am not sureof the utility of theclassification of problemsrequested by Dr Nelder.As I see it, the keyfeature leading to IRLS is thespecification of a regressionmodel as a compositelikelihood functionL( iq( ji)). Whenthe IRLS stepis reducedto ordinaryunweighted least squares, (6) or (7), whichis how it will usuallybe performed,none of the workingcovariates or the dependent variable remain fixed between iterations,except in extremelyspecial cases. The useful simplificationsseem to be essentiallythose mentionedin Section 1.2: that of a diagonal informationmatrix A, and that in which there is an underlyinglinear predictorX , (not necessarilythe same as ij ) so that the workingmodel matrixchanges only in a simpleway. I foundthe historicalcomments from PrQfessor Brillinger very interesting; yet again we find

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 187 that a popular data-analytictechnique was fosteredoutside statistics.With his delightful"unified field theory" and his otherremarks, Professor Jennrich has givennew insightinto the fundamental role of regressionin statisticalestimation. As I had expected, a number of contributorsto the discussioncommented on the use of GLIM for fittingnon-standard models by IRLS. Mr Thompson and Mr Burn find the generalityof my approach confusingin comparison. It is I think importantto retain this generalityof description in order to convey the unity of such a wide varietyof estimationproblems. That is not to imply that in particularproblem areas there is no middle ground between this generalityand a specific package solution. The user will not necessarilybe requiredto differentiatehundreds of terms.Let me use as an example the multinomialproblems with cell probabilitiespolynomial in the para- metersarising in genetics, and mentioned by Mr Thompson, Mr Burn and also Dr Roger. These could be handled directly,but it is more convenientto use the "Poisson trick" and treatthe cell frequencies as independent Poisson observations. Thus, y Poisson (ii) where ii = C 'y and y = exp (XP). In the notation of the paper, u is (y/,t) - 1, A is diag (1/tq), D is C diag ('y)X and the deviance is 2 1 {Yi log (yi/71i)-yi + ni }. (Here the division and exponentiationoperate component-wise.) The differentiationis already done, yet a package is not needed. This specificationis just that given via composite link functionsby Thompson and Baker (1981), but expressedin termsof the score functionrather than GLIM terminology. In a similarvein, I am delightedto see that the other Dr Green and I are in such close agree- ment: I in turn find that he providesa perfectcounter-example to his point about packages. He takes the probit example in Section 5.3 a littlefurther, exploiting the informationthat one poison is a known (1: 4) mixtureof the others by consideringtwo models suggestedby Finney. But the methods he suggestsfor fittingthese two models in GLIM are different,and both are different from that by which probit models are fittedwithout using this mixtureinformation. In fact all of these models, the more general model for synergisticaction given by Finney in which the kill probabilityis 'F(oz+ , log (d1 + pd2 + KV/did2)), the model with complementary-log-loglink used by Professor Aitkin, and Professor Lewis's concurrent line probit model, all have the same structure:ui is ((yij/ni)-r )Aii, whereAii is ni/(7?i(1- 71i)), and the deviance is 21 {Yi log (yi/ngfli)+ (ni -yi) log ((ni -yi)/ni(l -7i))} The differenceslie only in the functionalform of 11(p) and the matrixD of its derivatives.The unity of structurecan be reflectedin use of essentiallythe same program,but not, accordingto Dr Green,if you use GLIM. Mr Thompson, ProfessorAitkin, Mr Burn, Dr Gilchristand Dr Roger all raise points about estimation problems that are clearly within the scope of IRLS yet are awkward or impossible to handle in GLIM. I agree with all these points, and assume they are directedtowards people with more influence in these mattersthan I. Considerable ingenuityhas been shown by some discussants, including Mr Burn, Dr Gilchrist and Dr TunnicLiffe-Wilson,in coding some other difficultproblems, including some with non-diagonalA, into GLIM or GENSTAT. Dr Wilson findsmy descriptionof GLIM limited but I think she will findthat, apart fromher comment about a singularvalue decomposition for non-diagonal A, her points are contained in the second half of the second paragraphin Section 3.3, admittedlyrather briefly. A number of contributorsadd some flesh or technical details to the bare bones of the IRLS method as I described it, while others propose alternativeprocedures. Dr Jorgensenis to be con- gratulatedon a very complete coverage of matterssuch as choice of step lengthand alternatives to the informationmatrix A. The basic IRLS method can indeed be adapted to an even larger class of problems,but the programmingeffort may be greater.As Dr Wilson comments,there is a point at which one gives up and writesa one-offprogram (or consignsthe problemto a general- purpose optimization package). Mr Stirlingalso discusseschoice of A, includingpossibilities where it does not remainpositive semi-definite. These probablytake IRLS out of the reach of those who, like ProfessorJennrich, prefer to use standardweighted regression packages. The same is true of the very interestingand highly desirable modifications to the IRLS method via constrained optimization,suggested by Dr Farebrother. Several points are raised about convergence and divergence,and monitoringthe iterations. ProfessorBrillinger reminds us of the unpleasant oscillatory behaviour that may arise in fitting non-linearmodels and recommendsmaximum precision, as some of the non-linearityis round- ing error. Dr Roger and ProfessorWu both advocate monitoringchanges in the likelihood. Dr

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 188 Discussionof Dr Green'sPaper [No. 2, Roger thereby adjusts the step lengths.The situation described by ProfessorWu is disturbing: should there be convergencein the likelihood, but not the parameterestimates, I would like to know! Dr Pettittnotes the absence of any theoryin generalabout the solution to the iteration. The one point on which one can rely is that when iterationceases at a finitepoint in parameter space the likelihood equations (1) are satisfied.Quite clearly,many factorsmay preventthis being an acceptable solution to the estimationproblem. My own approach is to note that most of such difficultieswill be identifiedby properlymonitoring all aspects of the iteration.It may make for an untidy screen or print-out,but it seems to me essential to write out the currentparameter estimates and log-likelihood on every iteration unless the model is an extremelyfamiliar one. At the end of Section 3.1, I made some briefremarks about use of a two-partiteration in the presence of nuisance parameters.I have found this approach successfulin a varietyof situations. Dr Gilchristdescribes use of this technique when fittingmodels with parametriclink functions, and claims that it is easy to carrythis out in GLIM (only forlow-dimensional K , I would suspect, but this is the usual case). Mr Stirlingpoints out that convergenceof this two-partprocess may be slow if P and K are highlycorrelated, and suggestseliminating K by solvingequation (9). To me, this rather begs the question, since the result will generallybe to destroythe otherwisesimple formfor tj that led me to termthe additional parametersa "nuisance" in the firstplace. Dr Farebrotherrecommends incorporating small random disturbancesinto the earlyiterations, and this seems to me to be a desirableand convenientcomplement to the good practiceof running the iterativesolution from several differentstarting values: both help to see whetherthe fitted maximum may be a global optimum. Mr Thompson would like rules for correctparameterization to obtain quick convergence-the experience of Dr Roger shows that these would be valuable- and Mr Ross's mentionof stable parametertransformation seems to providean answer. Mr Ross also draws our attentionto an importantrival to IRLS of which I should have been aware. In a model with an additive unimodal log-likelihood,use of the deviance residuals allows the negativelog-likelihood to be expressedas a sum-of-squaresand hence minimizedby the Gauss- Newton method, or some variationof it. Dr Jorgensenpoints out that this can stillbe regardedas IRLS, with a suitable choice of A. In fact, in the notation of the paper, but writingzi = _VAi (whether signed or not), the Gauss-Newton step is to regressz on the columns of the matrix (Di ui/z1) to obtain P -p . Thus, remarkablythe programmingeffort may be even less than with Fisher scoring.The informationmatrix A is still needed forthe asymptoticcovariance matrix (DTAD)-1, or it may be approximatedby diag (u2/z12). Can the idea be extended to the case of dependentobservations? Professor Titteringtonsuggests the recursivesolution of equation (3). This approach seems particularlyattractive for very large data sets, where the inherentslight inefficiencydoes not matter,and especially where observationsare acquired sequentially and processed in real time. Ten contributorsin all referto the discussion of residualsin Section 5.1. I certainlyagree with Dr Jorgensen,Mr Thompson and Dr Lawrance that the appropriatenessof a definitiondepends on the use to be made of the residual. In the presentcontext of parametricregression function and probabilitymodel, my aim was to suggestmethods for assessing the relativeimportance of the various contributionsto the likelihood equations; these might then also be used to determine weightson those contributionsin resistantalternatives to maximumlikelihood. Viewingthe likeli- hood equations (1) as a sum over components of the predictorvector i, it is natural to attach residuals to these predictorsrather than the observations.Observations are in some contextssome- what illusory,so that I am not concernedthat thereis not necessarilya one-to-onecorrespondence between tj and y. The diagnostic use of residuals will involve examining the form of the probabilitymodel L(j) to find the data points which are associated with a given "discrepant" predictor17i. From this standpoint,I agree withthe prevailingopinion that the deviances(20) or theirsigned square roots, the deviance residuals,are appropriatewhen the likelihood is additive(21). My only purpose in extending the discussion was to treat the case where A, the informationmatrix for ,j, is not diagonal. Dr Nelder and ProfessorWu criticizethe sequentially defined A* (23). These have the meritof restoringsome additivityto the likelihood equations, but the demeritof losing symmetrybetween the components of tq by depending on the orderingof these components. Where observationsare obtained sequentially,this is perhaps acceptable, but in generalit may be preferableto retainthe symmetry. If "greater than" is replaced by "not equal to" in (23) then the {Ai} reduceto the ordinary deviances {Ai} of (20). Mr Thompson and Dr Nelder mention"cross-validatory" residuals, where

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussionof Dr Green'sPaper 189 Mr Thompson is referringto (20); I would preferto reservethis term for deviances calculated when the correspondingpredictor is omitted fromthe fittingaltogether. Dr Seheult providesan originalapproach via conditional distributions,again with a cross-validatoryflavour, and in which residuals are associated with observationsnot predictors.But providingthat estimates are still obtained by maximizinglikelihood not pseudo-likelihood,and that all observationsare used in this fitting,Dr Seheult's {iAi} are equal to {Al} in some cases even when (21) fails. For example, this is true for y - N(q, E) with l known, not diagonal, and also for the multinomialdistribution with an appropriate interpretationof "missing" observation. How widely does this connection hold? Dr Seheult's suggestiondoes not succeed in restoringthe truthof equation (22), but does providea useful new viewpointand deservesfurther study. On otherresidual matters, Mr Davison has providedadditional theoryfor the deviance residuals, which will aid their interpretation,especially for non-Normallinear regression,and Dr Pazman providesan alternativeby usingthe Rao distance;although the interpretationas euclidean distance in a transformedpredictor space is appealing, I wonder if these are not too complicatedto use in - practice? I cannot agree with Dr Jorgensenthat in the simplestnon-trivial situation, y N( il, 1) with L known but not diagonal, the ith residual should be yi - 7Zi,scaled. Surely,if forexample, a single observationyi is suspected of being in error,it is its departurefrom its conditionalmean given the fittedmodel and the other observationsthat should be examined? As above, this leads exactly to Dr Seheult's proposal and the ordinarydeviances (20). Dr Nelder, Dr Gilchristand Dr Lawrance referto correlationbetween residuals.We seem to have learntto compensate forthis in linearmodels, and I am unhappy at losing the correspondence between residuals and predictors(or observations). Routine use of Gilchrist'sadjusted residuals seems entirelypracticable and desirable,and the idea should apply to a wider varietyof models. Dr Gilchrist's point about assessing'- (P) directlyrather than throughL(q) - L(1(4)) is I thinkanswered by consideringthe case where the formeris largeand the lattersmall -does this not implythat apparentdiscrepancies do not much affectthe likelihood forthese observationsand hence are not of interest? ProfessorBrillinger's use of uniformresiduals is most interesting.It is a pity that statisticians' vision is trained to assess a nominal Normal distributionin residual plots, but of course uniform residualscan always be re-transformedonto any other preferredbaseline distribution. Several contributorsmention models and situationswhere residualsremain difficultto define. I entirelyagree with Dr Nelder and Mr Thompson about binary data, and would not advocate these methods for this case. I have nothingto offerProfessor Brillinger for latent variable models or Dr Lawrance for non-lineartime series,except perhaps that the latterat least mightbe happy to definedeviances sequentially. Several contributorsexpress puzzlement or concern over the descriptionin Section 5.2 of resistantalternatives to maximum likelihood estimationin non-linearmodels. Such ideas, at least for generalized linear models, are not new; see Besag (1981) and Pregibon (1982). As to the purpose of such methods, I can only repeat my assertionthat it is not the aim of resistantdata- analysis (in the context of model-fitting)to provide a unique objective solution, but ratherto examine whetherany doubt should be cast on a model fittedconventionally. It thereforeseems to me eminentlyreasonable to report a bad fit of, say, a probit model to data from a dosage- mortalityexperiment, coupled with a considerablybetter fit to all but a few of the data points. The "operational difference"between robust and resistantprocedures, sought by Professor Atkinson,is surelynot eliminatedby the fact that at the numericallevel the same computations may be carriedout in each. In robustregression typically one proceeds witha fairamount of faith in a certain error distribution,but takes out an insurance policy to cover departure from this assumption;the resultingrobust procedureis used once, in place of a "classical" method,and the fit is quoted as objective. Resistantmethods, on the other hand, may involvethe same numerical procedures,but used experimentallyat varyinglevels of resistance,and as a supplementto classical methods. Mr Thompson's intriguingsuggestion that resistantmethods fit a "model" thatis a mixtureof the data and intended model is I thinktrue, even if the idea is a littlecircular. It ought to aid our understandingof such methods,but I do not know of any principlefor fitting the mixingweights in practicethat leads to the resistantestimating equations (26) or (28). ProfessorAitkin and Mr Stirlingmake related points about the desirabilityof replacing"ad hoc" weightfunctions and tuningconstants by prescribedalternative probability distributions and (possibly estimated) additional nuisance parameters.To the extent to whichthese approaches are

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 190 Discussionof Dr Green'sPaper [No. 2, actually different,I wonder if we would not then be over-modelling?It should be recalled, incidentally,that since in the resistant approach, the tuning constant used is divided into a standardized residual,it does itselfhave a standardizedscale; thus forexample withthe bi-square function,B 9, 6 and 3 correspondto mild, moderate and high degreesof resistance,for any set of data. Mr Besag presentsan alternativebut equivalent rationale for IRLS in resistantmethods, and makes the valuable point that there is no need to iterateto convergence,nor indeed any necessity for ultimateconvergence at all. This would clearlyrequire carefulmonitoring of the changingpara- meter estimates,which as I have stressedis always advisable. There are cases, forexample in Table 6, where the fullyiterated solution depends on the initialvalues used, and some wherein addition to the fit will not iterate away fromthe maximum likelihood estimates.There may in generalbe many solutions to the weightedlikelihood equations (28) and some of these will not at all corres- pond to a "fit" in the ordinarysense. Therefore,we ought perhaps to restrictourselves to initial estimates that do represent credible models: apart from the exploratory and experimental emphasis,the approach does come close to examiningthe "robustness" of such models. Dr Nelder observesthat resistantmethods are not "resistant"to changesin the link or variance functions.This seems entirelyappropriate if we are using a model-based notion of discrepancy. However I certainlyagree with Dr Nelder about the risks of using resistantmethods with binary data. For example, slightchanges in the way in which the methodis formulatedleads the resistant fit to the vaso-constrictiondata presentedby Pregibon (1982) to go badly wrong: the slopes become infiniteand a "perfect" fitis obtained by down-weightingonly threeobservations. ProfessorAtkinson notes the omissionof any referencein the paper to highleverage and model fit. This is an importantmatter, but space was restricted.For the binomial/logisticmodel, I again referto Pregibon(1982). Before leavingthe subject of resistance,I mustreturn to the probitanalysis example that alarms and grievesProfessors Aitkin and Lewis. It is of course rathera small data-set(and ratherold) to be subjected to such intense scrutiny.I wonder how many differentmodels must be triedbefore the choice becomes as subjectiveas the use of resistantmethods? Twelve models or minor variantsare suggestedin the paper and discussion, includingthose from Dr Green, and all of these fit considerablybetter without observations11, 16 and 17. (If ProfessorLewis had estimatedhis parametersby maximumlikelihood, incidentally, he would have founda slightlycloser fit;also these estimatesare heavilyinfluenced by points 1 1, 16, 17.) For a data-set of this size, I would hesitate to abandon an apparentlyaccepted probit trans- formation.Professor Aitkin's use of the complementary-log-loglink yields a reduced deviance overall,but closer inspectionof this fitreveals that the almost imperceptiblecurvature in Deguelin is littlealtered, while clear curvatureand uniformlylarger residuals appear for Rotenone. Further, Professor Lewis finds Rotenone to be more toxic than Deguelin for high concentrations,and less toxic for low: presumably,therefore, the poisons work throughdifferent biological mechanisms,so why should the line for the mixture be concurrentwith the others? Among the remainingpoints witha statisticalflavour, I am gratefulto a numberof contributors for bringingnew problemsto my attention.These include Dr Pettittwith furtherexamples of the EM algorithmreducing to IRLS, namelywith censored or grouped data fromNormal/independent distributions;Professor Brillinger describes work with random effectsand latentvariables models, Mr Stirling'spaper includes examples with censoreddata, and negative-binomialand beta-binomial distributions,and Dr Tunniciffe-Wilsondescribes experience with asymmetricerror distributions in linearregression. Dr Wilson and Dr Ross are in disagreementover the interpretationof IRLS failingto fit a model. I tend to side with Dr Wilson in asking of what use is an optimal fit of a model that is inappropriate.On the other hand, quite complicated and numericallyill-conditioned models are sometimesappropriate, and we may give up too early. Surelythe optimumshould stillbe sought, but the optimizationnot consigned to a black-box prohibitingvisual monitoringof the iteration? Finally, I am pleased to see that Professor Jennrichshows that any initial formulationis insufficientlygeneral, thus widening even furtherthe range of estimation problems where iterativelyreweighted least squares can profitablybe applied.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 1984] Discussion of Dr Green'sPaper 191 REFERENCES IN THE DISCUSSION Adena,M. A. and Wilson,S. R. (1982) GeneralizedLinear Models in EpidemiologicalResearch: Case Control Studies. Sydney: IntstatFoundation. Aitkin,M. (1981) A note on the regressionanalysis of censoreddata. Technometrics,23, 161-163. (1982) Directlikelihood inference. In GLIM82:Proceedings of theInternational Conference on General- ized Linear Models (R. Gilchrist,ed.), pp. 76-86. New York: Springer.(Lecture Notes in Statistics,Vol. 14.) Aitkin,M. and Clayton, D. (1980) The fittingof exponential,Weibull and extremevalue distributionsto com- plex censoredsurvival data usingGLIM. Appl. Statist.,29, 15 6 -163. Aston, C. E. and Wilson, S. R. (1984) A commenton maximumlikelihood estimationin sparse contingency tables. (In the press.) Atkinson,A. C. (1982) Robust and diagnosticregression analyses. Commun. Statist., All, 2559-2571. Besag, J. E. (1975) Statisticalanalysis of non-latticedata. The Statist'n,24, 179-195. (1977) Efficiencyof pseudolikelihoodestimation for simple Gaussian fields.Biometrika, 64, 616-618. (1981) On resistanttechniques and statisticalanalysis. Biometrika, 68, 463-469. Bennett,S. (1983) Log-logisticregression models forsurvival data.Appl. Statist.,32, 165-171. Bock, R. D. and Lieberman, M. (1970) Fitting a response model for n dichtomously scored items. Psychometrika,35, 179-197. Bolt, B. A. (1960) The revisionof earthquake epicentres,focal depths and origintimes using a high-speedcom- puter.Geophys. J., 3, 433-440. Boyles, R. A. (1983) On the convergenceof the EM algorithm.J. R. Statist.Soc. B, 45, 47-50. Breiman, L. and Friedman, J. H. (1984) Estimatingoptimal transformationsfor multipleregression and cor- relation.J. Amer. Statist.Ass., to appear. Brillinger,D. R. and Preisler,H. K.(1984) An exploratoryanalysis of the Joyner-Booreattenuation data. Bull. Seismol.Soc. America,74 (in press). Brown, R. L., Durbin, J. and Evans, J. M. (1975) Techniques for testingthe constancyof regressionrelation- shipsover time.J. R. Statist.Soc. B, 37, 149 -192. Cook, R. D. and Weisberg,S. (1983) Commenton Huber (1983).J.Amer.Statist.Ass., 78, 74-75. Cox, D. R. and Snell, E. J. (1968) A generaldefinition of residuals.J. R. Statist.Soc. B, 30, 248-275. Ekblom, H. (1973) Calculation of linearbest Lp approximations.BIT, 13, 292-300. Finlay, K. W. and Wilkinson,G. N. (1963) The analysis of adaption in a plant breedingprogram. Aust. J. of Agric.Res., 14, 742-754. Finney,D. J. (1942) The analysisof toxicitytests on mixturesof poisons.Ann.Appl.Biol., 29, 82-94. Frankfurter,M. D. and Jackson,G. (eds) (1929) The Lettersof Sacco and Vanzetti.London: Constable. Gilchrist,R. (1981) Calculation of residualsfor all GLIM models. GLIMNewsletter,4, 26-28. Gill, P. E. and Murray,W. (1974) Newton-typemethods for unconstrainedand linearlyconstrained optimi- zation.Math.Programming,7,311-350. Goodwin,G. C. and Payne,R. L. (1977) DynamicSystem Identification: Experiment Design and Data Analysis. New York: Academic Press. Guerrero,V. M. and Johnson,R. A. (1982) Use of the Box-Cox transformationwith binaryresponse models Biometrika,69, 309-314. Hinde, J. (1982) Compound Poisson regressionmodels. In GLIM82: Proceedingsof the InternationalConference on Generalized Linear Models. (R. Gilchrist,ed.), pp. 109-121. New York: Springer.(Lecture Notes on Statistics,Vol. 14.) Huber, R. J. (1981) Robust Statistics.New York: Wiley. - ----(1983) Minimaxaspects of bounded-influenceregression. J.Amer. Statist.Ass., 78, 66-72. Jeffreys,H. (1936) On traveltimes in seismology.Bur. Centr.Seismol. TravauxSci., 14, 3-86. Jorgensen,B. (1983b) The delta algorithmand GLIM. Unpublishedmanuscript. Kennedy,W. J.,Jr and Gentle,J. E. (1980) StatisticalComputing, New York: MarcelDekker. Kourouklis,S. and Paige, C. C. (1981) A constrainedleast squaresapproach to the generalGuass-Markov linear model.J. Amer. Statist. Ass., 76, 620-625. Krasker, W. S. and Welsch, R. E. (1982) Efficientbounded-influence regression. J. Amer. Statist. Ass., 77, 595-605. McCullagh,P. and Nelder,J. A. (1983) GeneralizedLinear Models. London: Chapmanand Hall. Martin,J. T. (1942) The problem of the evaluation of rotenone-containingplants, VI. Ann. Appl. Biol., 29, 69-81. May, R. M. (1976) Simple mathematicalmodels with very complicated dynamics.Nature, 261, 459-466. Mosteller,F. and Tukey, J. W. (1977) Data Analysisand Regression.Reading, Mass.: Addison-Wesley. Pettitt,A. N. (1983) Approximate methods using ranks for regressionwith censored data. Biometrika,70, 121 -132. Pregibon,D. (1982b) Score tests in GLIM with applications. In GLIM82: Proceedingsof theInternational Con- ferenceon GeneralizedLinear Models (R. Gilchrist,ed.), pp. 87-97. New York: Springer.(Lecture Notes in Statistics,Vol. 14.) Ross, G. J. S. (1982) Least squares optimizationof generallog-likelihood functions and estimationof separable linearparameters. In Compstat1982 (H. Caussinuset al., eds), pp. 406-411.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions 192 Discussionof Dr Green'sPaper [No.2, Ruppert, D. and Carroll, R. J. (1980) Trimmedleast squares estimationin the linear model. J. Amer. Statist. Ass.,75, 828-838. Ryder, M. L., Land, R. B. and Ditchburn,R. (1974) Colour inheritancein Soay, Orkneyand Shetland sheep. J.Zool., 173,477 -485. Scallan, A. (1984) Fittingautoregressive models in GLIM. GLIMNewsletter (to appear). Scallan, A., Gilchrist,R. and Green, M. (1984) Parametriclink functions.Computat. Statist. and Data Anal., to appear. Sposito, V. A., Kennedy, W. L. and Gentle, J. E. (1977) Lp norm fit of a straightline. Appl. Statist., 26, 114-118. Stirling,W. D. (1984) Iterativelyreweighted least squares for models with a linear part. Appl. Statist., 33, 7-17. Titterington,D. M. (1984) Recursiveparameter estimation using incomplete data. J. R. Statist.Soc. B, 46, in the press. Walker,S. H. and Duncan, D. B. (1967) Estimationof the probabilityof an eventas a functionof severalinde- pendentvariables. Biometrika, 54, 167-179. Wold,H. (1981) TheFix-point Approach to Interdependent Systems. Amsterdam: North-Holland. Wu,C. F. J.(1983) On theconvergence properties of the EM algorithm.Ann. Statist., 11, 95-103.

This content downloaded from 138.25.2.146 on Mon, 24 Mar 2014 22:24:16 PM All use subject to JSTOR Terms and Conditions