<<

CHAPTER 14 ST 762, M. DAVIDIAN 14 Estimating equation methods for population-average-models

14.1 Introduction

In this chapter, we focus on methods for estimation of the parameters in a population-averaged marginal model of the general form discussed in Chapter 13. In particular, we assume that the pairs (Y i, xi), i = 1,...,m, are independent, where each Y i is (ni × 1), and we consider the general - matrix model

E(Y i|xi)= f i(xi, β), var(Y i|xi)= V i(β, ξ, xi) (ni × ni), (14.1) where f i(xi, β) is the (ni × 1) vector with jth element f(xij, β). The covariance model V i(β, ξ, xi) is taken to have the form

1/2 1/2 var(Y i|xi)= V i(β, ξ, xi)= T i (β, θ, xi)Γi(α, xi)T i (β, θ, xi), (14.2) where T i(β, θ, xi) is the diagonal matrix whose diagonal elements are the models for var(Yij|xi), e.g., involving a function 2 2 var(Yij|xi)= σ g (β, θ, xij) depending on possibly unknown variance parameters θ (as in the previous chapter, we will generally 2 absorb σ into θ for brevity and just refer to θ as all the variance parameters). The (ni × ni) matrix

Γi(α, xi) is a correlation matrix that generally depends on xi only through the within-individual times or other conditions tij at which observations in Y i are taken. Here, α is a vector of unknown correlation parameters.

The vector of variance and correlation parameters ξ = (θT , αT )T may be entirely unknown, or it may be that only α is unknown in models where the form of the is entirely specified.

As discussed in Chapter 13, the correlation model Γi(α, xi) is likely not to be correct. Rather, it is specified as a “working” model that hopefully captures some of the main features of the overall pattern of correlation. This is acknowledged when inference is carried out under model (14.1) with assumed covariance structure (14.2); we will discuss this explicitly in Section 14.5.

Model (14.1) may be viewed as a multivariate analog to the univariate mean-variance models discussed in Chapters 2–12. Thus, it should come as no surprise that inferential strategies for (14.1) exploit some of the same ideas as in the univariate case. In particular, estimation of β and ξ is typically carried out by solution of linear or quadratic estimating equations that are similar in spirit to those used for univariate response. Of necessity, these equations are more complicated in the multivariate setting, as we will see in Sections 14.2, 14.3, and 14.4, although the basic principles are the same.

PAGE 371 CHAPTER 14 ST 762, M. DAVIDIAN

The terminology generalized estimating equations (GEEs), first coined by Liang and Zeger (1986), has come to refer broadly to the body of techniques for inference for β and ξ based on solution of appropriate estimating equations.

• As suggested by the title of Liang and Zeger (1986), “Longitudinal analysis using generalized linear models,” this paper cast the idea of posing estimating equations for multivariate response in the context where the the response is collected longitudinally for each experimental unit.

• The development was also restricted to responses such as binary data, counts, and so on, thus on

mean models f and models for var(Yij|xij) that are of the generalized type. However, this restriction is unnecessary; solving estimating equations for any model of the form (14.1) is feasible more generally.

This restriction does explain why, in much of the literature, there are no unknown variance parameters θ, and interest focuses exclusively on estimating correlation parameters α. For ex-

ample, for binary response, the model for var(Yij|xij) might be taken to be the usual model

f(xij, β){1 − f(xij, β)} and is assumed to be correctly specified. A “working” correlation model might be postulated depending on unknown α, so that only α remains to be estimated.

• In our development here, we will allow the possibility that the model V i(β, ξ, xi) may involve both unknown variance parameters θ and unknown correlation parameters α. We will note simplifications that would occur in the case where the variance function does not depend on any unknown parameters θ.

There are numerous references that cover the types of estimating equations we are about to discuss. Some of the key references are Prentice (1988), Zhao and Prentice (1990), Prentice and Zhao (1991), and Liang, Zeger, and Qaqish (1992). See also Section 7.5 and Chapter 8 of Diggle, Liang, and Zeger (1995), Chapter 9 of Vonesh and Chinchilli (1997), and Chapter 3 of Fitzmaurice et al. (2009).

14.2 Linear estimating equations for β

Just as with univariate response, it is natural to start by considering the normal likelihood as a basis to derive an estimation method for β in (14.1). Thus, analogous to the case of “known weights” in the univariate case, assume that the matrices var(Y i|xi) = V i, say, are known. Under these conditions, assuming that the Y i|xi are normally distributed, the normal loglikelihood has the form m T −1 log L = −(1/2) log |V i| + {Y i − f i(xi, β)} V i {Y i − f i(xi, β)} . (14.3) Xi=1 h i PAGE 372 CHAPTER 14 ST 762, M. DAVIDIAN

Writing T fβ (xi1, β)  .  Xi(β)= . (ni × p),    T   f (xini , β)   β    taking derivatives of (14.3) with respect to the (p × 1) vector β, and setting equal to zero yields

m T −1 Xi (β)V i {Y i − f i(xi, β)} = 0, (14.4) Xi=1 which follows by using the following standard matrix differentiation results.

• For quadratic form q = xT Ax and A symmetric, ∂q/∂x = 2Ax. Note that this is a vector.

• The chain rule gives ∂q/∂β = (∂x/∂β)(∂q/∂x). Note that if both x and β are vectors, then ∂x/∂β is a matrix. (See Section 2.4 for a refresher.)

Section 1.4 of Vonesh and Chinchilli (1997) is an excellent source for many useful and sometimes difficult-to-find results on matrices and matrix differentiation.

The equation (14.4) has the form of a multivariate analog to the usual, univariate WLS equation, −1 where, here, the response and mean are now vectors, and the “weights” V i and “gradient” Xi(β) are matrices.

In fact, we may write (14.4) in a way that makes it clear that there is really no fundamental difference T T T in the general forms in the univariate and multivariate cases. Let Y = (Y 1 ,..., Y m) be the vector of m length N = i=1 ni (total number of observations). Define P T T T f(β)= {f 1 (x1, β),..., f m(x1, β)} (N × 1),

V = block diag(V 1,..., V m) (N × N), and

X1(β)  .  X(β)= . (N × p).      Xm(β)      Note that E(Y |x˜)= f(β) and var(Y |x˜)= V , where we use conditioning on “x˜” as shorthand to denote that conditioning for each component Y i is with respect to xi. Note further that we may rewrite (14.4) as XT (β)V −1{Y − f(β)} = 0. (14.5)

PAGE 373 CHAPTER 14 ST 762, M. DAVIDIAN

This has the same form as the matrix representation of the usual WLS estimating equations in the univariate case with the exception that the “weight matrix” V −1 in (14.5) is not necessarily diagonal. This is not a big deal, so that the form of the equations is exactly that of WLS.

In fact, note that the summands for i = 1,...,m in the equation written in the form of (14.4) are independent. Moreover, regardless of whether the (constant) matrices V i are actually equal to the true var(Y i|xi), the expectation of a summand, conditional on xi, is clearly equal to zero.

• Thus, the estimating equation is unbiased. We would thus expect that estimation of β by solving (14.4) to lead to a .

Of course, if the matrices V i are not known but depend on the parameters β and ξ, following the analogy to the linear case, it is natural to consider replacing them by the postulated covariance model in (14.1). This leads to the linear estimating equation

m T −1 Xi (β)V i (β, ξ, xi){Y i − f i(xi, β)} = 0, (14.6) Xi=1 which may be rewritten, defining V (β, ξ) = block diag{V 1(β, ξ, x1),..., V m(β, ξ, xm)}, as

XT (β)V −1(β, ξ){Y − f(β)} = 0. (14.7)

As in the case where the is known, the summands in (14.6) are independent across i. Hence, we expect that this estimating equation is also unbiased, and would lead to a consistent estimator for β. We will discuss this in more detail in Section 14.5.

IN FACT: If the model for the covariance matrix var(Y |x˜) is correct (we of course assume that the model for the mean is correct), note that (14.7) has exactly the form of the optimal linear estimating equation for β as in Section 9.6, except for the detail that the weight matrix V −1(β, ξ) is not diagonal.

• As we will see in Section 14.5, a similar “folklore” result as for the univariate case obtains for the multivariate generalization of GLS that we discuss momentarily. Thus, the same argument used in Section 9.6 to show asymptotic “optimality” among all linear equations in the univariate case would be applicable to the solution to (14.7).

The main point we emphasize now is that we are led to estimating equation (14.6), or equivalently (14.7), by applying the same ideas to the multivariate case that we did in the univariate setting. The equations have the same form as those for generalized in the univariate case.

PAGE 374 CHAPTER 14 ST 762, M. DAVIDIAN

Carrying the analogy further, an obvious strategy for solving (14.4) is suggested, via a three-step algorithm:

(0) (0) (i) Estimate β by βˆ and set k = 0. A natural choice for βˆ would be the OLS estimator treating all the elements of Y as if they were mutually independent with the same conditional variance; i.e., replacing V in (14.5) by a (N × N) identity matrix. That the OLS estimator is consistent under these conditions follows from the arguments in Section 14.5.

(k) (ii) Estimate ξ (somehow) by ξˆ and form estimated “weight matrices”

ˆ −1 ˆ (k) ˆ(k) V i (β , ξ , xi).

(iii) Re-estimate β by solving in β m T −1 ˆ (k) ˆ(k) Xi (β)V i (β , ξ , xi){Y i − f i(xi, β)} = 0 Xi=1 (k+1) to obtain βˆ . Set k = k + 1 and go to (ii).

Iterating C = ∞ times would correspond to solving (14.6) jointly with another equation for ξ.

IMPLEMENTING STEP (ii): By analogy to the univariate case, a natural approach to estimating ξ would be to use a quadratic estimating equation. We will discuss this in Section 14.3 shortly.

In the early papers on GEEs in the biostatistical literature by Liang and Zeger, estimation of ξ was advocated based on simple, -based approaches. Actually, as this early work was in the context of -type problems, this only involved estimation of correlation parameters α, as in this setting the variance function does not depend on unknown parameters (except perhaps a σ2).

For example, consider the exponential correlation model given in (13.28), reparameterized here as

|tij −t ′ | corr(Yij,Yij′ )= α ij , where Yij and Yij′ are observed at “times” tij and tij′ . Under this model, Γi(α, xi) depends on the 2 2 (k) scalar parameter α. Assume that var(Yij|xij)= σ g (β, θ, xij) with θ known. Given an estimate βˆ at the kth iteration of the above algorithm, the weighted residuals

(k) (k) wrij = {Yij − f(xij, βˆ )}/g(βˆ , θ, xij) have approximate mean 0 and satisfy

2 |tij −t ′ | E(wrij wrij′ |xi) ≈ σ α ij .

PAGE 375 CHAPTER 14 ST 762, M. DAVIDIAN

Taking logarithms of both sides of this expression yields the approximate relationship

2 log(wrij wrij′ ) ≈ log σ + |tij − tij′ | log α; thus, the suggestion was to form all pairs of lagged residuals for each i, pool them together, and estimate log α by simple of the log(wrijwrij′ ) on the |tij − tij′ |. The resulting estimator may be exponentiated to yield an estimator for α.

Given our experience in the univariate case, it seems likely that a potentially more efficient way of estimating α could be identified. Moreover, this approach does not seem to take account of the fact that if observations from the same individual are correlated, then residuals are likely to be so, too.

In any event, following estimation of α in the general model with a scale parameter σ2 (with no other unknown variance parameters), the obvious multivariate analog for estimation of σ2 is

m 2 −1 ˆ (k) T 1/2 ˆ (k) 1/2 ˆ (k) −1 ˆ (k) σˆ = N {Y i − f i(xi, β )} {T i (β , θ, xi)Γi(αˆ , xi)T i (β , θ, xi)} {Y i − f i(xi, β )}. i=1 X (14.8) Of course, N might be replaced by N − p. In the next section, we will see that an estimator of this form when a scale parameter is in the model arises naturally from a quadratic estimating equation approach, as is not unexpected.

14.3 Quadratic estimating equations for variance and correlation parameters

To develop a class of quadratic estimating equations for ξ, it is natural, as we did in the univariate case, to start with the normal loglikelihood. We thus first consider the form of the loglikelihood under the assumption that the conditional distributions of Y i given xi are normal to deduce the form of an estimating equation and then consider generalizations. As before, for simplicity, we suppress explicit mention of a possible scale parameter, absorbing it into θ for convenience.

The normal loglikelihood for model (14.1) is

m T −1 log L = −(1/2) log |V i(β, ξ, xi)| + {Y i − f i(xi, β)} V i (β, ξ, xi){Y i − f i(xi, β)} . (14.9) Xi=1 h i Assume that θ is (q × 1) and α is (s × 1).

Consider differentiation of (14.9) with respect to the kth (scalar) element of ξ, k = 1,...,q + s, ξk, say. This is straightforward using the following, well known matrix differentiation results. Suppose here that V (ξ) is a (n × n) nonsingular matrix depending on a vector ξ.

PAGE 376 CHAPTER 14 ST 762, M. DAVIDIAN

• If ξk is the kth element of ξ, then ∂/∂ξk V (ξ) is the (n × n) matrix whose (ℓ,p) element is the

partial derivative of the (ℓ,p) element of V (ξ) with respect to ξk.

−1 • ∂/∂ξk{log |V (ξ)|} = tr V (ξ){∂/∂ξkV (ξ)} . h i −1 −1 −1 • ∂/∂ξkV (ξ)= −V (ξ) {∂/∂ξk V (ξ)} V (ξ).

T T • For quadratic form q = x V (ξ)x, ∂q/∂ξk = x {∂/∂ξk V (ξ)}x. Thus, from the previous result,

T −1 T −1 −1 ∂/∂ξk {x V (ξ)x} = −x V (ξ) {∂/∂ξk V (ξ)} V (ξ)x.

Applying these results to (14.9), treating β as fixed, and setting each partial derivative of (14.9) equal to zero, we obtain the following (q + s) set of estimating equations.

m T −1 −1 (1/2) {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)} Xi=1  −1 − tr V i (β, ξ, xi)∂/∂ξk{V i(β, ξ, xi)} = 0, k = 1,...,q + s. (14.10) h i

As with (14.6), the summands for i = 1,...,m are independent. Moreover, analogous to the univariate case, if V i(β, ξ, xi) is correctly specified, then, at the true values, the conditional expectation of a summand is zero. This may be seen by using the following result:

• If x is a random vector with mean zero and covariance matrix V , and A is a square matrix, then E(xT Ax)=tr{E(xxT )A} = tr(V A) = tr(AV ).

Using this, we have (assuming expectation is under the parameter values β and η),

T −1 −1 E {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)} | xi h −1 −1 i = tr V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi)V i(β, ξ, xi) h −1 i = tr V i (β, ξ, xi)∂/∂ξk{V i(β, ξ, xi)} , h i from whence unbiasedness of (14.10) follows. Note, of course, that if V i(β, ξ, xi) were incorrectly specified, the equation would not be unbiased.

JOINT ESTIMATING EQUATIONS: Analogous to the univariate case, iterating the three-step algo- rithm with C = ∞ would correspond to jointly solving in β and ξ the p + q + s-dimensional system of estimating equations given as follows.

PAGE 377 CHAPTER 14 ST 762, M. DAVIDIAN

m T −1 Xi (β)V i (β, ξ, xi){Y i − f i(xi, β)} = 0 (14.11) Xi=1 m T −1 −1 (1/2) {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)} Xi=1  −1 − tr V i (β, ξ, xi)∂/∂ξk{V i(β, ξ, xi)} = 0, k = 1,...,q + s. (14.12) h i

SCALE PARAMETER: When the covariance model includes a scale parameter σ2, say, the equation (14.12) corresponding to differentiation with respect to σ2 takes a simple form. Specifically, if we 2 T T 2 T write var(Y i|xi), in a slight abuse of notation, as σ V i(β, θ, α, xi), with ξ = (θ , α ,σ ) , then with 2 ξq+s = σ , 2 ∂/∂ξq+s σ V i(β, θ, α, xi)= V i(β, θ, α, xi), so that (14.12) in the case k = q + s reduces to m −4 T −1 −2 σ {Y i − f i(xi, β)} V i (β, θ, α, xi){Y i − f i(xi, β)}− niσ = 0, Xi=1   which is easily seen to lead to an expression for σ2 of the form in (14.8).

The multiplicative factor of (1/2) in the equations for ξ in (14.12) could be disregarded, but we maintain it for now, as it proves important for the developments of Section 14.6, the motivation for which we now discuss.

ALTERNATIVE FORM OF THE QUADRATIC ESTIMATING EQUATION FOR ξ: Recall that, in the univariate case, we expressed the summand in joint estimating equations in the form

{ gradient of mean function } × { covariance matrix }−1 × { response − mean }. (14.13)

From this representation, we were immediately able to identify generalizations of the equations and determine the “optimal” equation under a set of assumptions on the moments.

• For example, in the case where quadratic equations were motivated by normality, this represen- tation allowed us to replace normal moments by those corresponding to the assumptions.

It is in fact possible to represent estimating equations such as (14.11) and (14.12) in the form (14.13), although it is a little more involved than in the univariate case.

• Writing estimating equations in the form (14.13) has become the standard way to represent such equations and was first made popular in papers by Prentice (1988), Zhao and Prentice (1990), and Prentice and Zhao (1991).

PAGE 378 CHAPTER 14 ST 762, M. DAVIDIAN

We now consider how this might be accomplished. For definiteness, we first focus only on the quadratic equation for estimating ξ given in (14.12). Rather than try to show that (14.12) may be written in this alternative form directly, we instead start with the idea that one would want to write an equation in the form (14.13) and demonstrate how this would be done. The equivalence between this approach and (14.12) derived from normality is presented in Section 14.6.

We will consider β fixed for now, and consider estimation of ξ. If we are interested in estimating the elements of ξ, which describe an entire covariance structure ( and correlations, so variances and ), we must consider the variances of each element of a data vector and all pairwise associations among elements of a data vector. Of course, if there are no unknown variance parameters, we need only consider associations.

T In particular, for Y i = (Yi1,...,Yini ) , if we let vijk(β, ξ) be the (j, k) element of V i(β, ξ, xi) (which is of course equal to the (k, j) element by symmetry), then we know that, if V i(β, ξ, xi) is specified correctly, 2 E {Yij − f(xij, β)} = vijj(β, ξ), h i E [{Yij − f(xij, β)}{Yik − f(xik, β)}]= vijk(β, ξ).

In the univariate case, the quadratic estimating equation was based on treating the squared deviations as the “response” and equating them to their “mean,” the variance function. By analogy, it is natural to think that the quadratic equation in the multivariate case should involve a “response” that includes both squared deviations equated to the variance function and crossproducts of deviations equated to their (the covariances).

• By symmetry, for Y i (ni × 1), there are ni squared deviations (one for each entry of Y i) and

ni(ni −1)/2 distinct crossproducts (i.e., number of covariances), for a total of ni(ni +1)/2 distinct terms.

2 Explicitly, the ni(ni+1)/2 distinct terms are the ni squared deviations {Yi1−f(xi1, β)} ,..., {Yini − 2 f(xini , β)} and the ni(ni − 1)/2 crossproduct terms

{Yi1 − f(xi1, β)}{Yi2 − f(xi2, β)}, {Yi1 − f(xi1, β)}{Yi3 − f(xi3, β)},...,

{Yi,ni−1 − f(xi,ni−1, β)}{Yini − f(xini , β)}.

• Thus, the “response” and “mean” vectors here should be of length ni(ni + 1)/2, assuming that

the model contains unknown variance parameters. If it does not, then only the ni(ni − 1)/2 crossproduct terms would be required.

PAGE 379 CHAPTER 14 ST 762, M. DAVIDIAN

• Of course, as the quadratic estimating equation (14.12) depends on the quadratic form in {Y i −

f i(xi, β)}, it also depends on squared deviations and crossproducts.

To formalize this, define

uijk = {Yij − f(xij, β)}{Yik − f(xik, β)}, (14.14) where the notation suppresses the dependence on β for brevity. Then we may collect the distinct uijk defined in (14.14) in a vector of length ni(ni + 1)/2 (with unknown variance parameters) or ni(ni − 1)/2 (with no unknown variance parameters) in some order.

In the former case, we will define

T ui = (ui11, ui12, ui13,...,ui22, ui23,...,uini−1ni−1, uini−1ni−1, uinini ) .

• Recall that for a (n × r) matrix A, vec(A) is defined as the (nr × 1) vector consisting of the r columns of A stacked in the order 1,...,r.

• If furthermore A is (n × n) and symmetric, then vec(A) contains redundant entries. The vech(·) operator yields the column vector containing all the distinct entries of A by stacking the lower diagonal elements; e.g., for n = 3,

a11   a12 a11 a12 a13        a13  A = and vech(A)=   . a12 a22 a23      a     22   a13 a23 a33           a23       a33      • Thus, our convention here is to use the order imposed by the definition

T ui = vech {Y i − f i(xi, β)}{Y i − f i(xi, β)} . h i

If there are no unknown variance parameters in ξ, ui would be defined by deleting the squared components.

We may define a corresponding vector

T vi(β, ξ)= {vi11(β, ξ), vi12(β, ξ), vi13(β, ξ),...,vi22(β, ξ), vi23(β, ξ),...,vinini (β, ξ)} ; i.e., vi(β, ξ) = vech{V i(β, ξ, xi)}. If ui contains no squared components (i.e., no unknown variance parameters), then the corresponding elements of vi would be deleted.

PAGE 380 CHAPTER 14 ST 762, M. DAVIDIAN

Then we have that the “response” vector ui satisfies

E(ui|xi)= vi(β, ξ), so that vi(β, ξ) is the “mean” vector.

It is important to recognize in reading the literature that there are variations on the construction we T describe here. For example, some authors instead base the equations on the “response” vech(Y iY i ), and/or may “stack” things in a different order.

Treating vi(β, ξ) as the mean vector and as a function of ξ, we may define the “gradient matrix of the mean” to be

Ei(β, ξ)= ∂/∂ξvi(β, ξ).

Ei(β, ξ) will have ni(ni + 1)/2 or ni(ni − 1)/2 rows, depending on the form of V i(β, ξ, xi), and (q + s) columns.

Let the “covariance matrix” of ui be

var(ui|xi)= Zi(β, ξ), say. A little thought about of the form of this matrix makes it clear that it is fairly complex. In particular, in order to specify this matrix, it is clear that we would have to be willing to make assumptions about quantities of the general form

cov(uijk, uiℓp|xi)= E(uijkuiℓp|xi) − E(uijk|xi)E(uiℓp|xi). (14.15)

To investigate what this entails and to make the analogy to the univariate case transparent, consider the 2 particular model for var(Y i|xi) that involves a scale parameter σ and additional variance parameters θ such that 2 2 var(Yij|xij)= σ g (β, θ, xij) for a variance function g and correlation parameters α, where ξ = (θT , αT ,σ2)T . Define

ǫij = {Yij − f(xij, β)}/{σg(β, θ, xij)}.

Then, of course E(ǫij|xi) = 0, and var(ǫij|xi) = 1, where expectation here and subsequently is under the T parameter values β and ξ. Clearly, the elements of ǫi = (ǫi1,...,ǫini ) are correlated, with correlation matrix equal to Γi(α, xi) (assuming as we are that this matrix is correctly specified).

Under this model, for j, k = 1,...,ni,

2 2 2 2 uijj = σ g (β, θ, xij)ǫij, uijk = σ g(β, θ, xij)g(β, θ, xik)ǫijǫik.

PAGE 381 CHAPTER 14 ST 762, M. DAVIDIAN

Thus, of course 2 2 2 2 2 vijj = σ g (β, θ, xij)E(ǫij|xi)= σ g (β, θ, xij),

2 2 vijk = σ g(β, θ, xij)g(β, θ, xik)E(ǫijǫik|xi)= σ g(β, θ, xij)g(β, θ, xik)corr(ǫij,ǫik|xi), where corr(ǫij,ǫik) is the (j, k) element of the correlation matrix Γi(α, xi).

In general, using the shorthand notation

gij = g(β, θ, xij), we may rewrite (14.15) in terms of the ǫij as

4 cov(uijk, uiℓp|xi)= σ gijgikgiℓgip{E(ǫijǫikǫiℓǫip|xi) − E(ǫijǫik|xi)E(ǫiℓǫip|xi)}. (14.16)

The representation (14.16) highlights how much more complicated the multivariate case is relative to the univariate! Recall that in the univariate case, the “response” corresponding to a single independent observation is the scalar squared deviation, so that in constructing the “covariance matrix” we needed only to be concerned about the variance of a squared deviation, which boils down to concern about the variance of the square of a standardized error ǫ and hence the need to make an assumption about excess .

Here, in contrast, the “response” corresponding to a single independent observation (response vector) is itself a complex vector, so we must be concerned with numerous variances and covariances of the general form (14.16).

To illustrate the complexity, note some special cases of (14.16):

4 4 2 cov(uijj, uijj|xi) = var(uijj|xi)= σ gijvar(ǫij|xi),

4 2 2 2 2 2 cov(uijk, uijk|xi) = var(uijk)= σ gijgik[ E(ǫij ǫik|xi) − {E(ǫij ǫik|xi)} ],

2 2 2 2 2 cov(uijjuiℓℓ|xi)= σ gijgiℓ{E(ǫij ǫiℓ|xi) − 1},

4 2 2 cov(uijk, uijp|xi)= σ gijgikgip{E(ǫij ǫikǫip|xi) − E(ǫijǫik|xi)E(ǫij ǫip|xi)},

2 3 3 cov(uijjuijp|xi)= σ gijgip{E(ǫijǫip|xi) − E(ǫijǫip|xi)}.

Of course, (14.16) represents the general case for j 6= k 6= ℓ 6= p.

RESULT: In order to specify the “covariance matrix” Zi(β, ξ), we must be prepared to make assump- tions about numerous higher moments involving the elements of Y i, or equivalently, ǫi, up to four-way associations, i.e., E(ǫijǫikǫiℓǫip|xi).

PAGE 382 CHAPTER 14 ST 762, M. DAVIDIAN

Note that if we have ni = 1, so that Y i is a scalar, all of this reduces to the situation first discussed in Chapter 5.

ESTIMATING EQUATION: Putting aside for the moment the troublesome issue of specifying Zi(β, ξ), the preceding developments suggest the following approach. Given some assumption on Zi(β, ξ) (thus, some assumption on higher moments of the ǫij), to estimate ξ (treating β fixed for now), one would solve an estimating equation of the form

m T −1 Ei (β, ξ)Zi (β, ξ){ui − vi(β, ξ)} = 0. (14.17) Xi=1 This, of course is a (q + s)-dimensional system of equations in (q + s) unknowns.

• This is a quadratic estimating equation in a much more general sense than in the univariate case. A convenient way to think of this is that the squared deviations and crossproducts that make up the “response” here are components of a quadratic form.

• By analogy to the univariate case, it seems likely that the equation will be “optimal” for estimating

ξ if Zi(β, ξ) is chosen correctly, so that var(ui|xi)= Zi(β, ξ).

(k) • In step (ii) of the three-step algorithm, one could implement this by replacing β by βˆ everywhere,

including in ui, where the dependence on β has been suppressed here.

• It is straightforward to verify that, in the univariate case ni = 1, so that there is no parameter α, the estimating equation (14.17) reduces to the general quadratic equation for variance parameters in the second “row” of Equation (10.2).

There are two issues to be resolved:

• Clearly, getting Zi(β, ξ) correct involves major moment assumptions. The chance that we would be able to specify all of the relevant moments correctly seems slim, if not impossible.

What is a practical strategy for specifying Zi(β, ξ) in practice?

• How does (14.17) compare to the equation in the form (14.12)? The latter equation may be thought of as a multivariate generalization of “pseudolikelihood” (PL), as it is based on the normal theory

likelihood. Intuition and the analogy to the univariate case suggest that choosing Zi(β, ξ) to

correspond to the matrix that would obtain if the Y i|xi were multivariate normal should yield (14.12). Is this in fact true?

PAGE 383 CHAPTER 14 ST 762, M. DAVIDIAN

SPECIFYING Zi(β, ξ): The difficulty associated with specifying Zi(β, ξ) with confidence is widely acknowledged. Thus, the standard approach is to face up to the difficulty and, instead of attempting to “get it right,” adopt a “working assumption” that is likely incorrect but might at least capture some of the predominant features of associations among the elements of ui. For instance, it might be better than simply taking Zi(β, ξ) to be an identity matrix for each i (i.e., no “weighting”).

• A “working assumption” involves adopting for the purposes of deducing a form for Zi(β, ξ) a

distributional specification for Y i|xi, or at least for aspects of the distribution.

• As discussed in Section 14.5, standard errors for the estimator of β may be adjusted to take into account that the working assumption may be incorrect.

Some popular working assumptions are as follows:

(i) Independence working assumption. Take Zi(β, ξ) to be the covariance matrix for ui|xi that would

be obtained if all the Yij (equivalently, the ǫij) were assumed to be independent across j.

It may be deduced from the general expressions on page 382 that this implies that Zi(β, ξ) will be a 4 2 2 4 4 2 diagonal matrix with diagonal elements var(uijk|xi)= σ gijgik and var(uijj|xi)= σ gijvar(ǫij|xi). 2 The analyst must specify var(ǫij|xi) but no other higher moments of the ǫij.

(ii) Gaussian working assumption. Take Zi(β, ξ) to be the covariance matrix for ui|xi that would

be obtained by assuming that the distribution of Y i|xi is normal with the first two moments correctly specified according to the assumed mean-covariance model (14.1). Equivalently, assume

that ǫi|xi is normal with mean 0 and covariance matrix V i(β, ξ, xi).

It may shown under this condition that

4 cov(uijk, uiℓp|xi)= vijℓvikp + vijpvikℓ = σ gijgikgiℓgip{E(ǫijǫiℓ)E(ǫikǫip)+ E(ǫijǫip)E(ǫikǫiℓ)}. (14.18)

The entries of Zi(β, ξ) may then be determined from the simplified relationship (14.18).

Note that, as E(ǫijǫik|xi) is equal to the conditional correlation between ǫij and ǫik, from (14.18)

all of the needed entries of Zi(β, ξ) depend only on the assumed correlation model Γi(α, xi). Moreover, (14.18) also yields 4 4 var(uijj|xi) = 2σ gij,

where the “2” is as expected.

PAGE 384 CHAPTER 14 ST 762, M. DAVIDIAN

Although these choices may indeed be misspecifications, the hope is that they will produce estimators

“closer to” being “optimal” than simply ignoring the pattern of association among the elements of ui altogether.

RELATIONSHIP TO PSEUDOLIKELIHOOD: As we have mentioned, the “PL” estimating equation (14.12), deduced from the normality assumption, i.e., m T −1 −1 (1/2) {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)} Xi=1  −1 − tr V i (β, ξ, xi)∂/∂ξk{V i(β, ξ, xi)} = 0, k = 1,...,q + s, (14.19) h i is quadratic, as it depends on a quadratic form in {Y i − f i(xi, β)}. A quadratic form may of course be written as a linear combination of squared deviations and crossproducts; e.g.

ni ni T −1 ijk {Y i − f i(xi, β)} V i (β, ξ, xi){Y i − f i(xi, β)} = {Yij − f(xij, β)}{Yik − f(xik, β)}v (β, ξ), jX=1 kX=1 ijk −1 where v (β, ξ) is the (j, k) element of V i (β, ξ, xi). That is, the relevant quadratic form in the summand of the PL equation (14.19) is a linear combination of the elements of ui.

• Of course, the summands of estimating equation in (14.17), m T −1 Ei (β, ξ)Zi (β, ξ){ui − vi(β, ξ)} = 0, (14.20) Xi=1 are also linear combinations of the elements of ui.

• If the distribution of Y i|xi really are normal, and we choose Zi(β, ξ) according to the Gaussian working assumption, then, by analogy to the developments in previous chapters (e.g., Section 9.6), (14.20) must be the (asymptotically) “optimal” quadratic estimating equation, as the “weight

matrix” Zi(β, ξ) is correctly specified.

• But the quadratic PL equation (14.19) is also the normal theory ML estimating equation for ξ. Thus, it is also (asymptotically) “optimal” if the data really are normally distributed. Moreover, it is also quadratic.

• Both equations cannot be the “optimal” quadratic equation and be different; hence, intuition suggests that estimating equations (14.19) and (14.20) must be the same.

It turns out that it is possible to show analytically the equivalence between the quadratic PL equation deduced from normality, (14.19), and the equation (14.20) with Zi(β, ξ) chosen according to the Gaus- sian working assumption. We defer demonstration of this to Section 14.6, where the argument is carried out in detail.

PAGE 385 CHAPTER 14 ST 762, M. DAVIDIAN

RESULT: Using a (quadratic) equation of the form (14.20) to estimated ξ, along with the linear equation for β, it is clear that iteration of the three-step algorithm on page 375 with C = ∞ solves the p + q + s- dimensional system of equations

−1 m T X (β) 0 V i(β, ξ, xi) 0 Y i − f (xi, β) i i = 0. (14.21)  T      i=1 0 Ei (β, ξ) 0 Zi(β, ξ) ui − vi(β, ξ) X             Compare these equations to, for example, the univariate equations (6.12) (which incorporate the uni- 4 4 variate version of the “Gaussian working assumption” to yield the “2σ gj ” term). It is straightforward to observe that, with ni = 1 for all i and the Gaussian working assumption, (14.21) reduces to (6.12).

Continuing the analogy, writing α˜ = (βT , ξT )T to represent all the unknown parameters being esti- mated, it is clear that we can write (14.21) in the form

m T ˜ −1 Di (α˜ )V i (α˜ ){si(α˜ ) − mi(α˜ )} = 0, (14.22) Xi=1 where

Xi(β) 0 Di(α˜ )=   {ni + ni(ni + 1)/2 × p + q + s}, 0 Ei(β, ξ)     V (β, ξ, x ) 0 ˜ i i V i(α˜ )=   {ni + ni(ni + 1)/2 × ni + ni(ni + 1)/2}, 0 Zi(β, ξ)     Y i f i(xi, β) si(α˜ )=   , mi(α˜ )=   {ni + ni(ni + 1)/2 × 1}. ui(β, ξ) vi(β, ξ)         Once the equations are written in the form (14.22), it is clear that, by analogy to the developments in Section 6.4, (14.22) may be solved by a Gauss-Newton updating scheme.

• Because the dimension of α˜ may be large, the full Gauss-Newton updating scheme based on (14.22) may be unwieldy in practice.

• But, as the equations “separate,” just as in the univariate case, one may carry out the update in two steps: for given ξ, update β by solving the β equation (first p rows) and then, for given β, solve the ξ equation (last q + s rows). This is, of course, is exactly what is done in steps (ii) and (iii) of the three-step algorithm.

PAGE 386 CHAPTER 14 ST 762, M. DAVIDIAN

14.4 Quadratic estimating equations for β

Just as in the univariate case, it is natural in the multivariate context to think about trying to increase efficiency for estimation of β by extracting information about β from the covariance matrix V i(β, ξ).

As we saw in the univariate case (Chapter 5), one way to obtain a quadratic estimating equation for β and to motivate a general class of such equations is to take the normal loglikelihood as a starting point.

Taking the normal loglikelihood (14.9) as a starting point and differentiating with respect to β, it is straightforward to deduce by the same matrix differentiation operations leading to the linear equation (14.4) and the quadratic equation for ξ (14.19) that the resulting equation for β is m T −1 Xi (β)V i (β, ξ, xi){Y i − f i(xi, β)} Xi=1 n T −1 −1 + {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ βℓV i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)}  −1 − tr V i (β, ξ, xi)∂/∂βℓ{V i(β, ξ, xi)} = 0 (p × 1), (14.23) h io where the double parentheses indicate p terms of the form inside them stacked for ℓ = 1,...,p. Of course, underlying estimating equation (14.23) is the assumption of normality. This equation would be solved jointly with the quadratic equation (14.19) for ξ under the normal theory ML approach.

Alternatively, following the same reasoning as in the previous section for estimation of ξ, it should be clear that a more general quadratic equation is possible. In particular, define vi(β, ξ) and ui as before, and let

Bi(β, ξ)= ∂/∂βvi(β, ξ) {ni(ni + 1)/2 × p), so that Bi(β, ξ) is the “gradient matrix of the mean” E(ui|xi) under the model. Then consider the joint estimating equations for β and ξ given by −1 m T T X (β) B (β, ξ) V i(β, ξ, xi) 0 Y i − f (xi, β) i i i = 0. (14.24)  T      i=1 0 Ei (β, ξ) 0 Zi(β, ξ) ui − vi(β, ξ) X            

• Clearly, because of the presence of the “gradient matrix” Bi(β, ξ) in the first p rows of (14.24), this leads to an estimating equation for β that is quadratic.

• If Zi(β, ξ) were chosen according to the Gaussian working assumption, then, by the same reasoning at the end of Section 14.3, the resulting equation corresponding to the first p rows of (14.24) should be the “optimal” such equation if normality really holds, and, thus, intuitively, should be identical to the normal theory ML equation (14.23). That this is indeed the case follows from arguments like those in Section 14.6.

PAGE 387 CHAPTER 14 ST 762, M. DAVIDIAN

Note from (14.23) that the off-block-diagonal elements of the “covariance matrix”

V (β, ξ, x ) 0 ˜ i i V i(β, ξ)=   0 Zi(β, ξ)     are all equal to zero. Now, it is easy to show that, under normality of Y i|xi, we have

cov(Y i, ui|xi)= 0.

So, in fact, this observations reinforces the notion that (14.24) with the Gaussian working assumption is the optimal joint equation if normality really holds, as in this case V˜ i(β, ξ) is exactly the covariance T T T matrix of the “response” si = (Y i , ui ) .

By analogy to the univariate case, if we believe that the distribution of Y i|xi is not normal, we could in principle arrive at a more general equation by specifying the covariance matrix of the “response” T T T si = (Y i , ui ) to embody more realistic assumptions about the necessary moments of si. If we specify

var(ui|xi)= Zi(β, ξ) and cov(Y i, ui|xi)= Ci(β, ξ), say, we would arrive at the estimating equation −1 m T T X (β) B (β, ξ) V i(β, ξ, xi) Ci(β, ξ) Y i − f (xi, β) i i i = 0. (14.25)  T   T    i=1 0 Ei (β, ξ) Ci (β, ξ) Zi(β, ξ) ui − vi(β, ξ) X            

• Provided that the assumptions for Zi(β, ξ) and Ci(β, ξ) were correct, we would expect to have constructed the “optimal” such quadratic equation.

• In order to do this, we must not only feel confident the moment assumptions on var(ui|xi) em-

bodied in Zi(β, xi). We must also make moment assumptions corresponding to Ci(β, ξ), which involve specifying terms of the form

3 cov(Yij, uikℓ|xi)= σ gijgikgiℓE(ǫijǫikǫiℓ|xi).

3 That is, we need to be willing to specify not only the E(ǫij|xi) as in the univariate case, but also the “three-way associations.”

• Of course, the chance that we would be able to specify the matrices Zi(β, ξ) and Ci(β, ξ) com-

pletely correctly in practice is slim. Indeed, specifying V i(β, ξ, xi) correctly in itself is difficult enough, as we have already discussed!

Putting this issue aside for the moment, it should be clear that implementation of estimation via solution of equations of the general form in (14.25) may also be carried out via a Gauss-Newton updating scheme, redefining the matrices Di(β, ξ) and V˜ i(β, ξ) in the obvious way.

PAGE 388 CHAPTER 14 ST 762, M. DAVIDIAN

• Note, however, that it is no longer possible to separate estimation of β from estimation of ξ as in the linear estimating equation case; solution of the equation for β here, even for fixed ξ, is considerably more complicated.

WORKING ASSUMPTIONS: Obviously, as noted above, it is well recognized that correct specification of all of the necessary moments involved in an equation like (14.25) is pretty hopeless in practice. Thus, just as with the quadratic equation for estimation of ξ alone, the practical strategy is to make a “working assumption” about the entire matrix V˜ i(β, ξ). This of course involves an assumption on the matrix

Zi(β, ξ) as before, along with one on the matrix Ci(β, ξ).

Popular working assumptions for V˜ i(β, ξ) in this context are as follows, analogous to the previous discussion.

• Independence working assumption. Pretending that the elements of Y i are mutually independent

leads to the choice for Zi(β, ξ) discussed previously on page 384 and Ci(β, ξ)= 0.

• Gaussian working assumption. Pretending that the distribution of Y i|xi is normal leads to the

choice for Zi(β, ξ) given in (14.18) and Ci(β, ξ)= 0.

TERMINOLOGY: Equations like those in (14.21) and (14.25) have come to be referred to as generalized estimating equations of specific types:

• Equations of the form (14.21), which involve solving a linear estimating equation for β jointly with a quadratic one for ξ, have been called GEE-1.

• Equations of the form (14.25), which involve solving a quadratic estimating equation for β jointly with a quadratic one for ξ, have been called GEE-2.

• This terminology was evidently coined in a paper by Liang, Zeger, and Qaqish (1992).

• The unqualified term “GEE” is used both to refer to the general approach of specifying estimating equations for mean-covariance models of the form (14.1) and to the particular case where the linear equation for β in (14.6) is solved along with simple moment-type equations for the elements of ξ.

Some authors insist that the term “GEE” also embodies the notion that “working assumptions”

are involved, including those on the correlation matrix of Y i|xi, that are likely to be incorrect. These authors thus also imply when they use the term GEE that standard errors for the estimator for β must be “corrected” to account for the possibility that the assumptions may be incorrect.

PAGE 389 CHAPTER 14 ST 762, M. DAVIDIAN

We will discuss the basis for this in the context of linear estimating equations in the next Section. Other authors use “GEE” more loosely and generally.

REMARKS: It should be clear from the preceding development that the same issues involving trade-offs between linear and quadratic estimating equations for β that we discussed for the univariate case extend to the multivariate setting. In fact, because the problem here is more complex, the implications may be even more profound.

• There is a choice between linear and quadratic equations.

• The linear equation is clearly unbiased even if the covariance model V i(β, ξ, xi) is misspecified. Such misspecification is more likely in the multivariate case, as the analyst must model not only variance but also correlation structure. The latter is probably more difficult, so most of the focus

on misspecification in this context is on incorrect modeling of the correlation matrix Γi(α, xi).

• The quadratic equation for β is obviously also unbiased as long as the covariance model V i(β, ξ, xi) is correctly specified. Thus, if the analyst is confident in the form of this matrix, this equation offers an alternative to the linear equation that may increase efficiency. Of course, the optimal

such equation would require the analyst to also specify Zi(β, ξ) and Ci(β, ξ) correctly. If this is the case (unlikely in practice), then the resulting estimator for β will be more efficient than even the optimal linear estimator. If these matrices are not correct, although the estimating equation would remain unbiased, whether or not it offers an improvement in efficiency over the linear equation is no longer clear (by analogy to the results in Chapter 10).

• Of course, the scary thing about quadratic equations for β is that they will not be unbiased if

V i(β, ξ, xi) is not correctly specified, leading to potentially inconsistent estimation of β.

In the univariate case, all that is required is correct modeling of variance, something in which the analyst may often have great confidence. In the multivariate case, the analyst is generally much less confident, as described above. Thus, the potential for increased efficiency may be greatly offset by fear of inconsistent estimation in the multivariate case.

Thus, it is generally agreed that “GEE-1” estimation based on solving equations of the form (14.21) is the “safer” choice for routine use. It is unusual to see quadratic estimating equations for β used in practice for multivariate response, although they are discussed extensively in the literature because of the theoretical potential for increased efficiency.

PAGE 390 CHAPTER 14 ST 762, M. DAVIDIAN

14.5 The “folklore” theorem and “robust” covariance matrix

Not surprisingly, the theoretical results for estimation of β via solution of equations of the form (14.21) and (14.25) parallel those for the univariate case. Because there are two “sample sizes” here, m = number of individuals and ni = number of observations on individual i, it is important to qualify this statement by stating explicitly the asymptotic framework in which it is valid:

• In the literature on multivariate response, the usual perspective is that the m individuals have been sampled from a population of interest. From the population-averaged modeling perspective, we model the first two moments of the relevant conditional distribution in order to learn about features of this population.

The natural asymptotic framework under these conditions is one in which m → ∞, so we learn more about the population as m increases (as in the univariate cases discussed earlier), while the

ni are regarded as fixed.

• Regarding ni as fixed often makes practical sense. In many real situations, the number of ob- servations on each of the experimental units may be rather small; recall the wheezing example discussed in Section 13.4, where at most four measurements were available on each of 300 children.

• Sometimes, however, there may be many observations on each unit. In this case, it may be possible

to treat not only m but also the ni as → ∞. This requires slightly more delicacy in specifying how this might happen.

• In fact, a scenario that has sparked a good deal of recent interest is that in which m is not that

large but, rather, the ni are. Clearly, if inference on the population of units is of interest, small m is a liability, as it limits the information available on the population. In this situation, recent work has focused on “small sample corrections” to improve the reliability of inference, e.g Mancl and DeRouen (2001). Still other work is devoted to a different asymptotic framework in which the

ni →∞ while m stays fixed; the scope of relevance for this is clearly different from that above.

• In many situations, the ni may be dictated by design. But note that in some contexts the ni may in fact be themselves a response; e.g., in the developmental toxicology setting in Example 1.7, the size of the litter may well change in response to increasing toxicity. In everything we have

done so far, we have not highlighted this issue; rather, we have implicitly conditioned on the ni.

We will not delve into this issue further here, but be aware that, if circumstances dictate, the ni themselves may in fact contain important information on scientific questions of interest, and a modeling strategy that does not condition on them may make more sense.

PAGE 391 CHAPTER 14 ST 762, M. DAVIDIAN

Here, we will focus on the framework in which m →∞ while the ni remain fixed.

It should be clear that it would be possible to carry out large sample theory arguments for the estimators for β and ξ in model (14.1) that would be obtained from both (14.21) and (14.25). These arguments would be analogous to those for linear and quadratic equations for β coupled with solution of an additional estimating equation for ξ, as discussed in Chapters 9 and 10.

In this section, we will confine our attention the properties of estimators for β found by solving the linear GEE estimating equation in (14.21) along with estimation of covariance parameters by solution of an additional equation (e.g., a quadratic equation). The famous results we derive are analogous to the “folklore theorem” for GLS estimators in the univariate case, as we now demonstrate.

As the argument allowing for misspecification of the covariance matrix V i(β, ξ, xi) includes the case where this matrix is in fact correctly specified, we present only the more general argument under misspecification. This argument is analogous to that for GLS in the univariate case with misspecified variance function, given in Section 9.3; as we saw in that case, the results when in fact the variance function is given correctly, presented in Section 9.2, were really just a special case of these.

To formalize the argument, suppose that, although the conditional mean model E(Y i|xi) = f i(xi, β) is correctly specified, the covariance model is misspecified as

var(Y i|xi)= U i(β, γ, xi), (14.26) where it is most likely that the model depends on β through the components of the mean vector, and γ is a vector of variance and correlation parameters specifying the chosen model. As we have discussed, most often the misspecification would be in the correlation matrix. If there are no additional variance parameters (i.e., the chosen variance function is known as a function of β), then γ would represent correlation parameters in the misspecified correlation model. More generally, if the model additionally contains unknown variance parameters, some elements of γ may be the variance parameters in a correctly or incorrectly specified variance function.

Suppose that the true form of the covariance matrix is, for some V i,

var(Y i|xi)= V i(β, ξ, xi), (14.27) where β0 and ξ0 are the true values of these parameters. Define

−1/2 δi = V (β0, ξ0, xi){Y i − f i(xi, β0)},

1/2 −1/2 where V i (β0, ξ0, xi) is a symmetric square root of V i(β0, ξ0, xi) and V i (β0, ξ0, xi) its inverse −1/2 1/2 1/2 −1/2 such that V i (β0, ξ0, xi)V i (β0, ξ0, xi)= V i (β0, ξ0, xi)V i (β0, ξ0, xi)= Ini .

PAGE 392 CHAPTER 14 ST 762, M. DAVIDIAN

Then it is clear that

E(δi|xi)= 0, var(δi|xi)= Ini .

Assuming the incorrect model (14.26), by analogy to the discussion in Section 9.3, suppose there is a value γ∗ such that, if we estimate γ in by solving an appropriate estimating equation, the resulting estimator γˆ satisfies 1/2 ∗ m (γˆ − γ )= Op(1).

As in the univariate case, γˆ is likely to have been obtained by solving this estimating equation jointly with the linear GEE equation in (14.21). That is, the resulting estimator βˆ solves

m T ˆ −1 ˆ ∗ ˆ Xi (β)U i (β , γˆ, xi){Y i − f i(xi, β)= 0. (14.28) Xi=1 Analogous to the arguments in Sections 9.2 and 9.3, we denote the estimator for β in the “weights” as ∗ βˆ to allow the possibility of a finite number of iterations of a three-step algorithm and to keep track of its influence.

• Clearly, even with the covariance matrix misspecified, the estimating equation in (14.28) is un- biased, so that βˆ should be consistent under regularity conditions. We also assume as in the 1/2 ˆ ∗ univariate case that m (β − β0)= Op(1).

ˆ T ˆ ∗T T T T T ∗T T Expanding (14.28) in a Taylor series of (β , β , γˆ ) about (β0 , β0 , γ ) , we obtain

∗ ∗ ∗ 1/2 ˆ ∗ 1/2 ˆ ∗ ∗ 1/2 ∗ 0 ≈ Cm + (Am1 + Am2)m (β − β0)+ Dmm (β − β0)+ Emm (γˆ − γ ), (14.29) where m ∗ −1/2 T −1 ∗ 1/2 Cm = m Xi (β0)U i (β0, γ , xi)V i (β0, ξ0, xi)δi, Xi=1 m ∗ −1 T −1 ∗ 1/2 Am1 = m {∂/∂β Xi (β)}U i (β0, γ , xi)V i (β0, ξ0, xi)δi, Xi=1 m ∗ −1 T −1 ∗ Am2 = −m Xi (β0)U i (β0, γ , xi)Xi(β0), Xi=1 m ∗ −1 T −1 ∗ 1/2 Dm = m Xi (β0){∂/∂βU i (β0, γ , xi)}V i (β0, ξ0, xi)δi, Xi=1 m ∗ −1 T −1 ∗ 1/2 Em = m Xi (β0){∂/∂γU i (β0, γ , xi)}V i (β0, ξ0, xi)δi. Xi=1

∗ p ∗ p ∗ p Clearly, as m →∞, Am1 −→ 0, Dm −→ 0, and Em −→ 0; thus, as expected from the univariate case, ∗ the effects of βˆ and γˆ used to form the “weight” matrices are negligible.

PAGE 393 CHAPTER 14 ST 762, M. DAVIDIAN

We thus have, applying these results to (14.29) and rearranging,

1/2 ˆ ∗ ∗ m (β − β0) ≈−Am2Cm.

∗ Writing Xi = Xi(β0), U i = U i(β0, γ , xi), and V i = V i(β0, ξ, xi), we have that m ∗ −1 T −1 A → A, A = lim m X U Xi, m2 m→∞ i i Xi=1 and m ∗ L −1 T −1 −1 C −→ N (0, B), B = lim m X U V iU Xi. m m→∞ i i i Xi=1 Combining, we obtain 1/2 ˆ L −1 −1 m (β − β0) −→ N (0, A BA ), so that

m −1 m m −1 ˆ · T −1 T −1 −1 T −1 β ∼N β0, Xi U i Xi Xi U i V iU i Xi Xi U i Xi . (14.30)  ! ! !   Xi=1 Xi=1 Xi=1    REMARKS:

• Obviously, if in fact the true model (14.27) and the assumed model (14.26) coincide, so that the

covariance structure has been correctly specified, then U i = V i, and the covariance matrix in (14.30) reduces to m −1 T −1 Xi V i Xi , ! Xi=1 which, adopting the definitions on page 373, may be written as

(XT V −1X)−1.

This, of course, is of the identical form to the corresponding covariance matrix in the univariate case.

• Similarly, the covariance matrix in (14.30) may be written as, defining U analogous to V ,

(XT U −1X)−1(XT U −1VU −1X)(XT U −1X)−1.

Thus, note that the comparison between the large sample covariance matrices that arise when

var(Y i|xi) is correctly and incorrectly specified is identical to that in the univariate case, and the same discussion applies.

• In particular, misspecification of the covariance matrix of the response, var(Y i|xi), will result in a loss of efficiency relative to modeling this moment correctly.

PAGE 394 CHAPTER 14 ST 762, M. DAVIDIAN

1/2 ∗ ∗ CAUTION: This argument requires that m (γˆ − γ ) = Op(1) for some γ . Thus, for example, if we solve the linear estimating equation for β along with the quadratic equation for the covariance parameters, where the model is misspecified, we assume that such a γ∗ exists.

Recall that it is most likely that misspecification will be in the correlation matrix, so that the source of this difficulty is this issue. In this case, the components of γ in the misspecified model that are correlation parameters have no real meaning.

Crowder (1995) points out that there may be situations where, if there are no unknown variance param- eters, the estimator γˆ of the correlation parameters for the chosen “working” correlation matrix may not have a well defined limit in probability required for the above argument to go through! There are configurations where the true correlation matrix is such that, for certain “working” assumptions that are incorrect, γ∗ may not exist. Details are beyond the scope of our treatment here; however, the overall message is that there is no general large sample argument to guarantee the above “folklore” result; we p must assume that there exists a γ∗ such that γˆ −→ γ∗.

QUADRATIC ESTIMATING EQUATIONS: Arguments similar to those in Chapter 10 may be used to deduce the general form of the approximate joint covariance matrix of estimators for β and ξ using quadratic equations (14.25) under different working assumptions. We do not present this here.

ROBUST SANDWICH COVARIANCE MATRIX: Although modeling the covariance matrix correctly is of course desirable because of the potential for increased precision in estimation of β, as we have discussed previously, it may be difficult to have confidence that this is the case. In particular, correct modeling of the marginal conditional correlation structure is likely to be difficult, so that the specified model is ordinarily viewed in most applications as only a “working” assumption.

Given this, it is usually unreasonable to construct estimated measures of uncertainty such as standard errors or confidence intervals on the basis of the “correct” folklore result. Instead, it is assumed up- front that the specified model is incorrect, and the estimated standard errors are constructed using the general result (14.30).

More specifically, analogous to the discussion in Section 9.4, the large sample approximate covariance matrix we would like to estimate to use as a basis for standard errors is

(XT U −1X)−1(XT U −1VU −1X)(XT U −1X)−1.

PAGE 395 CHAPTER 14 ST 762, M. DAVIDIAN

T −1 −1 This involves the matrix (X U X) that one would construct under the assumption that U i(β, γ, xi) is the correct model. Of course, m times this matrix may be estimated by

m −1 ˆ −1 T ˆ −1 ˆ ˆ A = m Xi (β)U i (β, γˆ, xi)Xi(β) (14.31) ( ) Xi=1

Defining ˆ ˆ T Ri = {Yi − f i(xi, β)}{Yi − f i(xi, β)} , m−1 times the “middle” matrix may be estimated by m ˆ −1 T ˆ −1 ˆ −1 ˆ ˆ B = m Xi (β)U i (β, γˆ, xi)RiU i (β, γˆ, xi)Xi(β). (14.32) Xi=1

The estimated matrices (14.31) and (14.32) may be combined to obtain an estimator for the covariance matrix of βˆ under the assumption that var(Y i|xi) may be incorrectly specified as

−1 −1 m−1Aˆ Bˆ Aˆ . (14.33)

(Note that the “m” and “m−1” terms cancel.) The routine use of (14.33) in this context was suggested by Liang and Zeger (1986). Software packages that implement this type of “GEE-1” estimation generally calculate standard errors based on (14.33) by default or allow the user to request this calculation. It is generally considered prudent in practice to use these “robust sandwich” standard errors to protect against the likely misspecification of correlation structure.

14.6 Equivalence of “pseudolikelihood” and “GEE” estimating equations for co- variance parameters

On page 385, we noted that it is possible to show formally the equivalence between the quadratic PL equation for ξ deduced from normality, (14.19), given by m T −1 −1 (1/2) {Y i − f i(xi, β)} V i (β, ξ, xi){∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)} Xi=1  −1 − tr V i (β, ξ, xi)∂/∂ξk{V i(β, ξ, xi)} = 0, k = 1,...,q + s, (14.34) h i and the alternative form (14.20) in the case where the matrix Z(β, ξ) is chosen according to the Gaussian working assumption, namely, m T −1 Ei (β, ξ)Zi (β, ξ){ui − vi(β, ξ)} = 0. (14.35) Xi=1 Here, T ui = vech {Y i − f i(xi, β)}{Y i − f i(xi, β)} . h i PAGE 396 CHAPTER 14 ST 762, M. DAVIDIAN

In practice, squared terms are deleted if the model contains no unknown variance parameters. We will not note this explicitly in the following argument.

It suffices to show that (14.34) and (14.35) are in fact the same estimating equation under the conditions above; specifically, it suffices to show that their kth rows coincide. The kth row of (14.34) may be written, using the identity for quadratic forms xT Ax = tr(AxxT ), as

m −1 T −1 (1/2) tr {∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi){Y i − f i(xi, β)}{Y i − f i(xi, β)} V i (β, ξ, xi) Xi=1 h −1 −{∂/∂ξk V i(β, ξ, xi)}V i (β, ξ, xi) = 0. (14.36) i T Noting that Ei(β, ξ) has kth row {∂/∂ vi(β, ξ)} , we may write the kth row of (14.35) as

m T {∂/∂ vi(β, ξ)} Zi(β, ξ){ui − vi(β, ξ)} = 0. (14.37) Xi=1 We may show the result by showing that the ith summand in (14.36) is equal to that in (14.37).

The following results are available in many advanced texts on matrix algebra, such as Chapter 16 of Harville (1997), or in Appendix 4.A of Fuller (1987). For matrices A (a × a), B, C, D,

(i) tr(AB)= {vec(A)}T {vec(BT )} = {vec(AT )}T {vec(B)}.

(ii) tr(ABDT CT )= {vec(A)}T (B ⊗ C)vec(D), where “⊗” represents Kronecker product.

(iii) For A symmetric, there is a relationship between vec(A) and vech(A). In particular, there exists a unique matrix Φ of dimension {a2 × a(a + 1)/2} such that

vec(A)= Φvech(A).

Clearly, Φ is unique and of full column rank, as there is only one way to write the distinct elements of A in a full, redundant vector.

There also exist many (not unique) linear transformations of vec(A) into vech(A). It should be clear that there cannot be a unique such transformation. Consider a transformation matrix Ψ of dimension {a(a + 1)/2 × a2} such that

vech(A)= Ψvec(A).

One particular choice of Ψ is the Moore-Penrose generalized inverse of Φ

Ψ = (ΦT Φ)−1ΦT .

Fuller (1987, page 383) gives the actual form of Φ.

PAGE 397 CHAPTER 14 ST 762, M. DAVIDIAN

Because we are addressing equivalency under normality assumptions, take Y i|xi to be normally dis- tributed for the purposes of the argument. Under these conditions, it is possible to show [see, for example, Fuller (1987, Lemma 4.A.1)] that

T var(ui|xi) = 2Ψ{V i(β, ξ, xi) ⊗ {V i(β, ξ, xi)}Ψ . (14.38)

In fact, (14.38) is a compact way of expressing (14.18). Result 4.A.3.1 of Fuller (1987, page 385) then yields that −1 T −1 −1 {var(ui|xi)} = (1/2)Φ {V i (β, ξ, xi) ⊗ V i (β, ξ, xi)}Φ. (14.39)

Armed with these results, we are now in a position to show the desired correspondence. For brevity, we will suppress the arguments of all matrices and vectors. The estimating equation in (14.36) has two parts: −1 T −1 (1/2)tr{(∂/∂ξkV i)V i (Y i − f i)(Y i − f i) V i } (14.40) and −1 −(1/2)tr{(∂/∂ξkV i)V i }. (14.41)

−1 T Consider (14.40). By result (ii) on page 397, identifying A = ∂/∂ξkV i, B = V i , D = (Y i −f i)(Y i − T T −1 f i) , and C = V i , we have

−1 T −1 (1/2)tr{(∂/∂ξkV i)V i (Y i − f i)(Y i − f i) V i } T −1 −1 T = (1/2){vec(∂/∂ ξkV i)} (V i ⊗ V i )vec{(Y i − f i)(Y i − f i) } T −1 −1 = (1/2){Φvech(∂/∂ ξkV i)} (V i ⊗ V i )Φui T T −1 −1 = {vech(∂/∂ ξkV i)} {(1/2)Φ (V i ⊗ V i )Φ}ui, (14.42) using the definition of ui.

Now from the definition of vi, we have

vech(∂/∂ξk V i)= ∂/∂ξk vech(V i)= ∂/∂ξkvi.

−1 Moreover, the “middle” term in (14.42) in braces, by (14.39), equals Zi , as we are doing these calcu- lations under normality. Substituting these developments into (14.42) yields

T −1 {∂/∂ξkvi} Zi ui. (14.43)

Now consider (14.41).

PAGE 398 CHAPTER 14 ST 762, M. DAVIDIAN

Applying result (ii) on page 397 gives

T −1 −1 −(1/2){vec(∂/∂ ξkV i)} (V i ⊗ V i )vec(V i) T −1 −1 = −(1/2){Φvech(∂/∂ ξkV i)} (V i ⊗ V i )Φvech(V i) T T −1 −1 = −{vech(∂/∂ ξkV i)} {(1/2)Φ (V i ⊗ V i )Φ}vi T −1 = −{∂/∂ξkvi} Zi vi. (14.44)

Combining (14.43) and (14.44), we obtain that the kth row of the PL summand in (14.36) is in fact equal to the kth row of the GEE summand in (14.37), namely

T −1 {∂/∂ξkvi} Zi (ui − vi), as desired.

Of course, it is fact possible to carry out the argument in the reverse direction, starting from (14.37).

Note that the same type of argument may be applied to the second term in the quadratic estimating equation for β, so that in fact the joint normal ML equations may be written in the “GEE-2” form with the Gaussian working assumption employed.

14.7 Implementation in SAS and R

Software for fitting marginal, population-average models using the GEE-1 and GEE-2 approaches is available. However, most of the popular programs have some limitations. In particular, they implement only the linear estimating equation for β, allowing only unknown correlation parameters in the covari- ance model (i.e., no unknown variance parameters except for a scale parameter σ2). They also implement the simple moment methods advocated by Liang and Zeger (1986) rather than the quadratic estimating equation methods used in the GEE-1 approach to estimate these parameters. We demonstrate use of SAS proc genmod and the R/Splus function gee() below to carry out this type of analysis.

PAGE 399 CHAPTER 14 ST 762, M. DAVIDIAN

Implementation of the GEE-1 approach may be carried out in SAS using either of two macros that were originally designed for another purpose. In particular, the glimmix and nlinmix macros were originally targeted for fitting of generalized linear and nonlinear subject-specific models with random effects (generalized linear mixed effects models and nonlinear mixed effects models), respectively, by an approximate technique; this is discussed in Chapter 15. It turns out, as discussed in Chapters 14 and 15 of Littell et al. (2006), that these macros may also be used to fit marginal models in the manner we have discussed. In particular, in this approach, the linear estimating equation for β is solved jointly with the quadratic estimating equation for the covariance parameters under the Gaussian working assumption. As with proc genmod and gee(), only correlation parameters may be estimated; it is not possible to estimate unknown parameters θ in a variance function (except for a scale parameter σ2). SAS proc mixed, which invokes normal theory maximum likelihood for very general linear models, including those for multivariate response, is used to estimate the covariance parameters, where a linear approximation to the nonlinear mean model is used. Iteration between estimation of β and estimation of the covariance parameters continues until convergence. Here, we will demonstrate using the nlinmix macro, which allows for more general mean models than does the glimmix macro.

This sort of analysis may also be carried out using the SAS procedure proc glimmix, not to be confused with the macro, and available in SAS version 9.2 and later. This procedure is meant mainly for fitting generalized linear mixed effects models, discussed in Chapter 15, but can be used to fit marginal models, too.

To illustrate the use of all of these packages, we consider a famous data set first reported by Thall and Vail (1990). A was conducted in which 59 people with epilepsy suffering from simple or partial seizures were assigned at random to receive either the anti-epileptic drug progabide or an inert substance (a placebo) in addition to a standard chemotherapy regimen all were taking. Because each individual might be prone to different rates of experiencing seizures, the investigators first tried to get a sense of this by recording the number of seizures suffered by each subject over the 8-week period prior to the start of administration of the assigned treatment. It is common in such studies to record such baseline measurements, so that the effect of treatment for each subject may be measured relative to how that subject behaved before treatment. Following the commencement of treatment, the number of seizures for each subject was counted for each of four, two-week consecutive periods. The age of each subject at the start of the study was also recorded, as it was suspected that the age of the subject might be associated with the effect of the treatment somehow. The goal of an analysis was to determine whether or not administration of progabide results in fewer seizure episodes on average.

PAGE 400 CHAPTER 14 ST 762, M. DAVIDIAN

Table 14.1: Seizure counts for 5 subjects assigned to placebo (0) and 5 subjects assigned to progabide (1). Period Subject 1 2 3 4 Trt Baseline Age 153330 1131 235330 1130 324050 625 444140 836 5 7 18 9 21. 0 66 22 . 29 11 14 9 8 1 76 18 30 8 7 9 4 1 38 32 31 0 4 3 0 1 19 20 32 3 6 1 3 1 10 30 33 2 6 7 4 1 19 18

The data for the first 5 subjects in each treatment group are summarized in Table 14.1. The response, number of seizures, is in the form of a count, for which the Poisson distribution might be considered an appropriate model.

Many authors who have considered this data set have chosen to treat the baseline response as a covariate rather than as one of the repeated measurements of the response (at time 0). Thall and Vail did this in their original report of analysis of these data. This is partly because the baseline response was collected over 8 weeks, while the responses post-treatment were collected over two weeks, but in reality, the baseline response is a response, so probably should be treated as one of the responses, suitably scaled to be put on a two-week basis.

Although this would be a better way to view these data, we follow the original source and fit the model that they did. Thus, we regard the data as having repeated measurements (ni = 4 for all i) on each of m = 59 subjects. Because we have data from many subjects, we expect the marginal distribution to exhibit overdispersion. Let Yij be the seizure count for subject i at his/her visit to the clinic for the jth period following initiation of treatment, where the corresponding times of measurement are biweekly visits coded as tij = 1, 2, 3, 4 for j = 1,..., 4= ni. Let δi be the treatment indicator for the ith patient,

δi = 0 for placebo subjects

= 1 for progabide subjects.

Following Thall and Vail (1990), for subject i, let ai be the logarithm of age and bi be the logarithm of the number of baseline seizures experienced in the eight-week pre-treatment period divided by 4 (to place the baseline count on the same basis as the post-treatment counts taken at two-week intervals).

PAGE 401 CHAPTER 14 ST 762, M. DAVIDIAN

Summary of the means at each time point post-treatment shows that they basically do not change until the last visit period. Accordingly, we consider the model that Thall and Vail did that accommodates this. Define t4ij = 0 if observation j for subject i is prior to the 4th and last visit and t4ij = 1 if T observation j is the last one (4th visit). We define xi = (bi, ai, δi,t4i1,...,t4ini ) .

A popular model in these circumstances for the mean number of seizures would be a loglinear model.

Consider the particular such model where the conditional mean for Yij is taken to depend only on T xij = (bi, ai, δi,t4ij) , given by

T E(Yij|xi) = exp(β1 + β2bi + β3ai + β4δi + β5t4ij + β6biδi)= f(xij, β), β = (β1,...,β5) . (14.45)

A natural model for variance is 2 var(Yij|xi)= σ f(xij, β), (14.46) where σ2 allows for the possibility of overdispersion.

In the following programs, for illustration, we consider three “working” correlation assumptions:

• Completely unstructured, so that the correlation pattern is assumed to have no particular form

• Compound symmetry or exchangeability

• Autoregressive of order 1 (as the visits are equally-spaced in time).

Full information on SAS proc genmod and proc glimmix is available in the SAS/STAT documentation, and a discussion of the nlinmix macro may be found in Chapter 15 of Littell et al. (2006); some minimal documentation and examples for nlinmix are also given on the Technical Support pages at http://www.sas.com. There is less detailed information on the gee() function available in R using the help() function (i.e., type help(gee) for a basic accounting of syntax and operations).

The following programs do not have accompanying detailed descriptions. They are fairly well docu- mented. See the above references for more details and further options. It is important to note that proc genmod, gee(), and proc glimmix (as well as the glimmix macro, which we do not demostrate here) impose limitations on the form of the mean model, as it is specified through the standard “link functions” used in popular generalized linear models. This requires that the mean be a function of β through a linear predictor. Similarly, the form of the variance function is restricted to correspond to one of the scaled types. There are ways around these restrictions, but they require specialized programming on the part of the user. The nlinmix macro allows more general mean and variance functions to be specified by the user.

PAGE 402 CHAPTER 14 ST 762, M. DAVIDIAN

We discuss the glimmix and nlinmix macros and proc glimmix more in Chapter 15

PROGRAM 14.1 Implementing the linear estimating equation with simple moment methods using SAS proc genmod.

The following program calls proc genmod several times, the first three calls fitting the mean-variance model in (14.45) and (14.46) with the three different “working” correlation models noted above.

PROGRAM STATEMENTS:

/****************************************************************** Fit a loglinear regression model to the epileptic seizure data first reported in a paper by Thall and Vail (1990). These are , so we use the Poisson mean/variance assumptions. This model is fitted with different working correlation matrices. ******************************************************************/ options ls=80 ps=59 nodate; run; /****************************************************************** The data look like (first 8 records on first 2 subjects) 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 . . . column 1 subject column 2 number of seizures column 3 visit (1--4 biweekly visits) column 4 =0 if placebo, = 1 if progabide column 5 baseline number of seizures in 8 weeks prior to study column 6 age ******************************************************************/ data seizure; infile ’seize.dat’; input subject seize visit trt base age; run; /***************************************************************** Fit the loglinear regression model using PROC GENMOD and three different working correlation matrix assumptions: - unstructured - compound symmetry (exchangeable) - AR(1) Subject 207 has what appear to be very unusual data -- for this subject, both baseline and study-period numbers of seizures are huge, much larger than any other subject. We leave this subject in for our analysis; in some published analyses, this subject is deleted. See Diggle, Heagerty, Liang, and Zeger (2002) and Thall and Vail (1990) for more on this subject. We fit a modified model exactly like the one in Thall and Vail (1990). Define - logbase = log (base/4) - logage = log(age) - trt Here, visit4 is an indicator of the last visit and basetrt is the "" term for logbase and trt, allowing the possibility that how baseline seizures affect the response may be different for treated and placebo subjects. Visit4 allows that the mean

PAGE 403 CHAPTER 14 ST 762, M. DAVIDIAN

response changed over the study period, but not gradually; rather, the change only showed up toward the end of the study; The DIST=POISSON option in the model statement specifies that the Poisson requirement that mean = variance, be used. The LINK=LOG option asks for the loglinear model. Other LINK= choices are available. The REPEATED statement specifies the "working" correlation structure to be assumed. The CORRW option in the REPEATED statement prints out the estimated working correlation matrix under the assumption given in the TYPE= option. The MODELSE option specifies that the estimates printed for the elements of betahat are based on assuming the correlation matrix is correctly specified. By default, the ones based on the "robust" version of the covariance matrix are printed, too. The dispersion parameter is estimated rather then being held fixed at 1 -- this allows for the possibility of "overdispersion" The V6CORR option asks that the estimator be computed using the method discussed in the notes. ******************************************************************/ data seizure; set seizure; * if subject=207 then delete; logbase=log(base/4); logage=log(age); basetrt=logbase*trt; if visit<4 then visit4=0; if visit=4 then visit4=1; run; title "UNSTRUCTURED CORRELATION"; proc genmod data=seizure; class subject; model seize = logbase logage trt visit4 basetrt / dist = poisson link = log; repeated subject=subject / type=un corrw modelse v6corr; run; title "EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION"; proc genmod data=seizure; class subject; model seize = logbase logage trt visit4 basetrt / dist = poisson link = log; repeated subject=subject / type=cs corrw modelse v6corr; run; title "AR(1) CORRELATION"; proc genmod data=seizure; class subject; model seize = logbase logage trt visit4 basetrt / dist = poisson link = log; repeated subject=subject / type=ar(1) corrw modelse v6corr; run;

OUTPUT:

UNSTRUCTURED CORRELATION 1 The GENMOD Procedure Model Information Data Set WORK.SEIZURE Distribution Poisson Link Function Log Dependent Variable seize

Number of Observations Read 236 Number of Observations Used 236

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

PAGE 404 CHAPTER 14 ST 762, M. DAVIDIAN

Parameter Information Parameter Effect Prm1 Intercept Prm2 logbase Prm3 logage Prm4 trt Prm5 visit4 Prm6 basetrt

Criteria For Assessing Criterion DF Value Value/DF 230 869.3236 3.7797 Scaled Deviance 230 869.3236 3.7797 Pearson Chi-Square 230 1014.7112 4.4118 Scaled Pearson X2 230 1014.7112 4.4118 Log Likelihood 2994.1327

Algorithm converged.

UNSTRUCTURED CORRELATION 2 The GENMOD Procedure Analysis Of Initial Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 -2.7576 0.4075 -3.5562 -1.9590 45.80 <.0001 logbase 1 0.9495 0.0436 0.8641 1.0349 475.11 <.0001 logage 1 0.8971 0.1164 0.6688 1.1253 59.35 <.0001 trt 1 -1.3411 0.1567 -1.6483 -1.0339 73.21 <.0001 visit4 1 -0.1611 0.0546 -0.2681 -0.0541 8.71 0.0032 basetrt 1 0.5622 0.0635 0.4378 0.6867 78.40 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.

GEE Model Information Correlation Structure Unstructured Subject Effect subject (59 levels) Number of Clusters 59 Correlation Matrix Dimension 4 Maximum Cluster Size 4 Minimum Cluster Size 4

Algorithm converged.

Working Correlation Matrix Col1 Col2 Col3 Col4 Row1 1.0000 0.2846 0.2608 0.1564 Row2 0.2846 1.0000 0.6520 0.3480 Row3 0.2608 0.6520 1.0000 0.4618 Row4 0.1564 0.3480 0.4618 1.0000

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept -3.0705 0.9449 -4.9225 -1.2184 -3.25 0.0012 logbase 0.9371 0.0929 0.7550 1.1192 10.09 <.0001 logage 1.0009 0.2737 0.4644 1.5374 3.66 0.0003 trt -1.4974 0.4223 -2.3251 -0.6697 -3.55 0.0004 visit4 -0.1560 0.0783 -0.3095 -0.0025 -1.99 0.0464 basetrt 0.6281 0.1699 0.2952 0.9610 3.70 0.0002 UNSTRUCTURED CORRELATION 3 The GENMOD Procedure Analysis Of GEE Parameter Estimates Model-Based Standard Error Estimates Standard 95% Confidence

PAGE 405 CHAPTER 14 ST 762, M. DAVIDIAN

Parameter Estimate Error Limits Z Pr > |Z| Intercept -3.0705 1.2019 -5.4261 -0.7148 -2.55 0.0106 logbase 0.9371 0.1275 0.6873 1.1869 7.35 <.0001 logage 1.0009 0.3433 0.3281 1.6737 2.92 0.0035 trt -1.4974 0.4617 -2.4023 -0.5925 -3.24 0.0012 visit4 -0.1560 0.0938 -0.3398 0.0278 -1.66 0.0963 basetrt 0.6281 0.1866 0.2624 0.9937 3.37 0.0008 Scale 2.0836 . . . . . NOTE: The scale parameter for GEE estimation was computed as the square root of the normalized Pearson’s chi-square. EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 4 The GENMOD Procedure Model Information Data Set WORK.SEIZURE Distribution Poisson Link Function Log Dependent Variable seize

Number of Observations Read 236 Number of Observations Used 236

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

Parameter Information Parameter Effect Prm1 Intercept Prm2 logbase Prm3 logage Prm4 trt Prm5 visit4 Prm6 basetrt

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 230 869.3236 3.7797 Scaled Deviance 230 869.3236 3.7797 Pearson Chi-Square 230 1014.7112 4.4118 Scaled Pearson X2 230 1014.7112 4.4118 Log Likelihood 2994.1327

Algorithm converged.

EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 5 The GENMOD Procedure Analysis Of Initial Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 -2.7576 0.4075 -3.5562 -1.9590 45.80 <.0001 logbase 1 0.9495 0.0436 0.8641 1.0349 475.11 <.0001 logage 1 0.8971 0.1164 0.6688 1.1253 59.35 <.0001 trt 1 -1.3411 0.1567 -1.6483 -1.0339 73.21 <.0001 visit4 1 -0.1611 0.0546 -0.2681 -0.0541 8.71 0.0032 basetrt 1 0.5622 0.0635 0.4378 0.6867 78.40 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.

GEE Model Information Correlation Structure Exchangeable Subject Effect subject (59 levels) Number of Clusters 59

PAGE 406 CHAPTER 14 ST 762, M. DAVIDIAN

Correlation Matrix Dimension 4 Maximum Cluster Size 4 Minimum Cluster Size 4

Algorithm converged.

Working Correlation Matrix Col1 Col2 Col3 Col4 Row1 1.0000 0.3582 0.3582 0.3582 Row2 0.3582 1.0000 0.3582 0.3582 Row3 0.3582 0.3582 1.0000 0.3582 Row4 0.3582 0.3582 0.3582 1.0000

Exchangeable Working Correlation Correlation 0.3582063731

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept -2.7939 0.9561 -4.6678 -0.9199 -2.92 0.0035 logbase 0.9504 0.0987 0.7569 1.1439 9.63 <.0001 logage 0.9066 0.2772 0.3633 1.4499 3.27 0.0011 trt -1.3386 0.4296 -2.1805 -0.4967 -3.12 0.0018 EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 6 The GENMOD Procedure Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| visit4 -0.1611 0.0656 -0.2896 -0.0325 -2.46 0.0140 basetrt 0.5633 0.1749 0.2205 0.9060 3.22 0.0013

Analysis Of GEE Parameter Estimates Model-Based Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept -2.7939 1.2158 -5.1767 -0.4110 -2.30 0.0216 logbase 0.9504 0.1301 0.6954 1.2055 7.30 <.0001 logage 0.9066 0.3475 0.2256 1.5876 2.61 0.0091 trt -1.3386 0.4678 -2.2554 -0.4218 -2.86 0.0042 visit4 -0.1611 0.0908 -0.3391 0.0169 -1.77 0.0761 basetrt 0.5633 0.1895 0.1919 0.9347 2.97 0.0030 Scale 2.0742 . . . . . NOTE: The scale parameter for GEE estimation was computed as the square root of the normalized Pearson’s chi-square. AR(1) CORRELATION 7 The GENMOD Procedure Model Information Data Set WORK.SEIZURE Distribution Poisson Link Function Log Dependent Variable seize

Number of Observations Read 236 Number of Observations Used 236

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

PAGE 407 CHAPTER 14 ST 762, M. DAVIDIAN

Parameter Information Parameter Effect Prm1 Intercept Prm2 logbase Prm3 logage Prm4 trt Prm5 visit4 Prm6 basetrt

Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 230 869.3236 3.7797 Scaled Deviance 230 869.3236 3.7797 Pearson Chi-Square 230 1014.7112 4.4118 Scaled Pearson X2 230 1014.7112 4.4118 Log Likelihood 2994.1327

Algorithm converged. AR(1) CORRELATION 8 The GENMOD Procedure Analysis Of Initial Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 -2.7576 0.4075 -3.5562 -1.9590 45.80 <.0001 logbase 1 0.9495 0.0436 0.8641 1.0349 475.11 <.0001 logage 1 0.8971 0.1164 0.6688 1.1253 59.35 <.0001 trt 1 -1.3411 0.1567 -1.6483 -1.0339 73.21 <.0001 visit4 1 -0.1611 0.0546 -0.2681 -0.0541 8.71 0.0032 basetrt 1 0.5622 0.0635 0.4378 0.6867 78.40 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed.

GEE Model Information Correlation Structure AR(1) Subject Effect subject (59 levels) Number of Clusters 59 Correlation Matrix Dimension 4 Maximum Cluster Size 4 Minimum Cluster Size 4

Algorithm converged.

Working Correlation Matrix Col1 Col2 Col3 Col4 Row1 1.0000 0.4661 0.2173 0.1013 Row2 0.4661 1.0000 0.4661 0.2173 Row3 0.2173 0.4661 1.0000 0.4661 Row4 0.1013 0.2173 0.4661 1.0000

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept -3.0526 0.9385 -4.8920 -1.2132 -3.25 0.0011 logbase 0.9445 0.0928 0.7627 1.1263 10.18 <.0001 logage 0.9908 0.2737 0.4543 1.5272 3.62 0.0003 trt -1.4828 0.4168 -2.2998 -0.6659 -3.56 0.0004 visit4 -0.1552 0.0892 -0.3301 0.0197 -1.74 0.0820 basetrt 0.6193 0.1692 0.2876 0.9509 3.66 0.0003 AR(1) CORRELATION 9 The GENMOD Procedure Analysis Of GEE Parameter Estimates Model-Based Standard Error Estimates

PAGE 408 CHAPTER 14 ST 762, M. DAVIDIAN

Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z| Intercept -3.0526 1.1818 -5.3690 -0.7362 -2.58 0.0098 logbase 0.9445 0.1254 0.6988 1.1902 7.53 <.0001 logage 0.9908 0.3375 0.3292 1.6523 2.94 0.0033 trt -1.4828 0.4545 -2.3737 -0.5920 -3.26 0.0011 visit4 -0.1552 0.0950 -0.3414 0.0311 -1.63 0.1025 basetrt 0.6193 0.1835 0.2595 0.9790 3.37 0.0007 Scale 2.0854 . . . . . NOTE: The scale parameter for GEE estimation was computed as the square root of the normalized Pearson’s chi-square.

PROGRAM 14.2 Implementing the linear estimating equation with simple moment methods using R gee().

The following program calls the function gee() three times to fit the mean-variance model in (14.45) and (14.46) with the three different “working” correlation models noted above. Note that there are some slight differences in the results from those obtained from SAS proc genmod, even though the two programs are supposed to be carrying out the same calculations. These are likely due to differences in the implementation. PROGRAM STATEMENTS:

# Fit gee with linear estimating equation for beta and # moment methods as in Liang and Zeger (1986) for correlation # parameter using the R function gee() # load the gee library library(gee) # output data set outfile <- "gee_seizure.Rout" # read in the data set thedata <- matrix(scan("seize.dat"),ncol=6,byrow=T) subj <- thedata[,1] seize <- thedata[,2] visit <- thedata[,3] trt <- thedata[,4] base <- thedata[,5] age <- thedata[,6] # log tranform baseline and age logbase <- log(base/4) logage <- log(age) # other variables that could be used in more complicated models basetrt <- logbase*trt visit4 <- as.numeric(visit==4) # calls to gee() function unfit <- gee(seize ~ logbase+logage+trt+visit4+basetrt,id=subj,family=poisson, corstr="unstructured") cat("UNSTRUCTURED WORKING CORRELATION","\n","\n",file=outfile,append=F) sink(file=outfile,append=T) print(summary(unfit)) sink() cat("\n","\n",file=outfile,append=T) csfit <- gee(seize ~ logbase+logage+trt+visit4+basetrt,id=subj,family=poisson, corstr="exchangeable") cat("EXCHANGEABLE WORKING CORRELATION","\n","\n",file=outfile,append=T)

PAGE 409 CHAPTER 14 ST 762, M. DAVIDIAN sink(file=outfile,append=T) print(summary(csfit)) sink() cat("\n","\n",file=outfile,append=T) # note demonstration of use of user-supplied starting values here ar1fit <- gee(seize ~ logbase+logage+trt+visit4+basetrt,id=subj,family=poisson, corstr="AR-M",M=1,b=c(-2.0,1,1,-1,-0.1,0.5) ) cat("AR(1) WORKING CORRELATION","\n","\n",file=outfile,append=T) sink(file=outfile,append=T) print(summary(ar1fit)) sink()

OUTPUT:

UNSTRUCTURED WORKING CORRELATION

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Logarithm Variance to Mean Relation: Poisson Correlation Structure: Unstructured Call: gee(formula = seize ~ logbase + logage + trt + visit4 + basetrt, id = subj, family = poisson, corstr = "unstructured") Summary of Residuals: Min 1Q 3Q Max -14.1473158 -2.9156034 -0.5648066 1.6966299 59.7187817

Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) -3.0703916 1.21744843 -2.521989 0.94492946 -3.249334 logbase 0.9370925 0.12911640 7.257734 0.09289341 10.087825 logage 1.0008969 0.34772442 2.878420 0.27372135 3.656627 trt -1.4973787 0.46765660 -3.201877 0.42230224 -3.545751 visit4 -0.1559873 0.09499776 -1.642010 0.07831289 -1.991847 basetrt 0.6280492 0.18899638 3.323076 0.16985140 3.697639 Estimated Scale Parameter: 4.45469 Number of Iterations: 4 Working Correlation [,1] [,2] [,3] [,4] [1,] 1.0000000 0.2846098 0.2608401 0.1563989 [2,] 0.2846098 1.0000000 0.6519718 0.3479656 [3,] 0.2608401 0.6519718 1.0000000 0.4617796 [4,] 0.1563989 0.3479656 0.4617796 1.0000000

EXCHANGEABLE WORKING CORRELATION

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Logarithm Variance to Mean Relation: Poisson Correlation Structure: Exchangeable Call: gee(formula = seize ~ logbase + logage + trt + visit4 + basetrt, id = subj, family = poisson, corstr = "exchangeable") Summary of Residuals: Min 1Q Median 3Q Max -14.2804674 -2.9024753 -0.4940558 1.7757319 59.8699344

Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) -2.7933674 1.22877062 -2.273303 0.95596337 -2.922044 logbase 0.9504074 0.13152324 7.226156 0.09869915 9.629338 logage 0.9064447 0.35118728 2.581086 0.27715935 3.270482 trt -1.3386276 0.47277353 -2.831435 0.42951137 -3.116629 visit4 -0.1610871 0.09219968 -1.747155 0.06558188 -2.456275 basetrt 0.5632558 0.19152488 2.940902 0.17485489 3.221276

PAGE 410 CHAPTER 14 ST 762, M. DAVIDIAN

Estimated Scale Parameter: 4.414352 Number of Iterations: 2 Working Correlation [,1] [,2] [,3] [,4] [1,] 1.0000000 0.3551212 0.3551212 0.3551212 [2,] 0.3551212 1.0000000 0.3551212 0.3551212 [3,] 0.3551212 0.3551212 1.0000000 0.3551212 [4,] 0.3551212 0.3551212 0.3551212 1.0000000

AR(1) WORKING CORRELATION

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-function, version 4.13 modified 98/01/27 (1998) Model: Link: Logarithm Variance to Mean Relation: Poisson Correlation Structure: AR-M , M = 1 Call: gee(formula = seize ~ logbase + logage + trt + visit4 + basetrt, id = subj, b = c(-2, 1, 1, -1, -0.1, 0.5), family = poisson, corstr = "AR-M", Mv = 1) Summary of Residuals: Min 1Q Median 3Q Max -14.1147621 -2.9201518 -0.5479669 1.6786673 59.6736888

Coefficients: Estimate Naive S.E. Naive z Robust S.E. Robust z (Intercept) -3.0525967 1.19717709 -2.549829 0.93848186 -3.252697 logbase 0.9444762 0.12699946 7.436852 0.09276876 10.180973 logage 0.9907881 0.34191424 2.897768 0.27370217 3.619950 trt -1.4828350 0.46042681 -3.220566 0.41682111 -3.557485 visit4 -0.1551820 0.09626332 -1.612057 0.08922172 -1.739285 basetrt 0.6192620 0.18592401 3.330727 0.16921026 3.659719 Estimated Scale Parameter: 4.462371 Number of Iterations: 5 Working Correlation [,1] [,2] [,3] [,4] [1,] 1.0000000 0.4661711 0.2173155 0.1013062 [2,] 0.4661711 1.0000000 0.4661711 0.2173155 [3,] 0.2173155 0.4661711 1.0000000 0.4661711 [4,] 0.1013062 0.2173155 0.4661711 1.0000000

PROGRAM 14.3 Implementing the linear estimating equation with the quadratic estimating equation for the correlation parameters using SAS macro nlinmix.

The following program invokes the macro nlinmix to implement the three “working” correlation as- sumptions. We use the most recent version of the macro available on the SAS technical support web site. Comparison of the results to those from Programs 14.1 and 14.2 shows that use of the more sophisti- cated quadratic estimating equation with the Gaussian working assumption to estimate the correlation parameters in each case yields similar, but different, estimates for β.

PAGE 411 CHAPTER 14 ST 762, M. DAVIDIAN

The nlinmix macro requires starting values for β, unlike proc genmod and gee(), which use standard methods for deriving starting values in generalized linear models to obtain these internally, as they restrict attention to such models. Starting values here may be obtained in the usual ways we have discussed for nonlinear models, perhaps by “pooling” the data from all m units together and treating them as mutually independent. For mean-variance models of the generalized linear model type, starting values can be obtained from a preliminary fit (e.g., using proc genmod) of the model assuming all observations are independent, which requires no starting values from the user.

The user should be aware that the .log file generated when SAS is run contains a log of the intermediate calculations over each iteration. This may be examined in the event the convergence is difficult to achieve. For brevity, this file is not reproduced for the example here. PROGRAM STATEMENTS:

/* Fit gee with linear estimating equation for beta and quadratic equation for covariance parameters with gaussian working assumption using macro NLINMIX */

/* This program uses the newest version of the macro availble on the sas web site */ options ps=59 ls=80 nodate; run; /* Include the SAS macro code -- the user should of course substitute the appropriate path */ %inc ‘‘nlinmix_macro.sas’’; /* Read in the data -- Seizure data from Thall and Vail (1990) The data look like (first 8 records on first 2 subjects) 104 5 1 0 11 31 104 3 2 0 11 31 104 3 3 0 11 31 104 3 4 0 11 31 106 3 1 0 11 30 106 5 2 0 11 30 106 3 3 0 11 30 106 3 4 0 11 30 . . . column 1 subject column 2 number of seizures column 3 visit (1--4 biweekly visits) column 4 =0 if placebo, = 1 if progabide column 5 baseline number of seizures in 8 weeks prior to study column 6 age */ data seizure; infile ’seize.dat’; input subject seize visit trt base age; run; data seizure; set seizure; * if subject=207 then delete; logbase=log(base/4); logage=log(age); basetrt=logbase*trt; if visit<4 then visit4=0; if visit=4 then visit4=1; run; title "Seizure data of Thall and Vail (1990)"; /*proc print; run;*/

PAGE 412 CHAPTER 14 ST 762, M. DAVIDIAN

/* Invoke the NLINMIX macro - use the variable name "predv" to define the mean function - if you do not wish to include analytical derivatives (d_b1 through d_b5 here) SAS will calculate them automatically - the parms statement gives starting values - the stmts portion basically contains elements of a call to proc mixed with the linearized version of the mean model. We use the repeated statement to specify the working correlation structure - the procopt statement allows us to specify "empirical" which will produce the robust sandwich standard errors */ title2 "Unstructured working correlation"; %nlinmix(data=seizure, model=%str( predv = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); ), derivs=%str( d_b1 = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b2 = logbase*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b3 = logage*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b4 = trt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b5 = visit4*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b6 = basetrt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); wt=1/predv; ), parms=%str(b1=-2.0 b2=1 b3=1 b4=-1 b5=-0.1 b6=0.5), stmts=%str( class subject; model pseudo_seize = d_b1 d_b2 d_b3 d_b4 d_b5 d_b6/ noint notest solution; repeated / subject=subject type=un rcorr; weight wt; ), expand=zero, procopt=%str(empirical method=ml) ) title2 "Exchangeable working correlation"; %nlinmix(data=seizure, model=%str( predv = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); ), derivs=%str( d_b1 = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b2 = logbase*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b3 = logage*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b4 = trt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b5 = visit4*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b6 = basetrt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); wt=1/predv; ), parms=%str(b1=-2.0 b2=1 b3=1 b4=-1 b5=-0.1 b6=0.5), stmts=%str( class subject; model pseudo_seize = d_b1 d_b2 d_b3 d_b4 d_b5 d_b6/ noint notest solution; repeated / subject=subject type=cs rcorr; weight wt; ), expand=zero, procopt=%str(empirical method=ml) ) title2 "AR(1) working correlation matrix";

%nlinmix(data=seizure, model=%str( predv = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); ), derivs=%str( d_b1 = exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b2 = logbase*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b3 = logage*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt);

PAGE 413 CHAPTER 14 ST 762, M. DAVIDIAN

d_b4 = trt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b5 = visit4*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); d_b6 = basetrt*exp(b1+b2*logbase+b3*logage+b4*trt+b5*visit4+b6*basetrt); wt=1/predv; ), parms=%str(b1=-2.0 b2=1 b3=1 b4=-1 b5=-0.1 b6=0.5), stmts=%str( class subject; model pseudo_seize = d_b1 d_b2 d_b3 d_b4 d_b5 d_b6/ noint notest solution; repeated / subject=subject type=ar(1) rcorr; weight wt; ), expand=zero, procopt=%str(empirical method=ml) )

OUTPUT:

Seizure data of Thall and Vail (1990) 1 Unstructured working correlation The Mixed Procedure Model Information Data Set WORK._NLINMIX Dependent Variable pseudo_seize Weight Variable wt Covariance Structure Unstructured Subject Effect subject Estimation Method ML Residual Variance Method None Fixed Effects SE Method Empirical Degrees of Freedom Method Between-Within

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

Dimensions Covariance Parameters 10 Columns in X 6 Columns in Z 0 Subjects 59 Max Obs Per Subject 4

Number of Observations Number of Observations Read 236 Number of Observations Used 236 Number of Observations Not Used 0

Iteration History Iteration Evaluations -2 Log Like Criterion 0 1 1412.07968151 1 2 1342.07974235 0.00000424 2 1 1342.07780093 0.00000000 Seizure data of Thall and Vail (1990) 2 Unstructured working correlation The Mixed Procedure Convergence criteria met.

Estimated R Correlation Matrix for subject 101/Weighted by wt

PAGE 414 CHAPTER 14 ST 762, M. DAVIDIAN

Row Col1 Col2 Col3 Col4 1 1.0000 0.3250 0.2295 0.2584 2 0.3250 1.0000 0.5006 0.4962 3 0.2295 0.5006 1.0000 0.4947 4 0.2584 0.4962 0.4947 1.0000

Covariance Parameter Estimates Cov Parm Subject Estimate UN(1,1) subject 3.1138 UN(2,1) subject 1.2302 UN(2,2) subject 4.6019 UN(3,1) subject 1.1119 UN(3,2) subject 2.9477 UN(3,3) subject 7.5349 UN(4,1) subject 0.6923 UN(4,2) subject 1.6164 UN(4,3) subject 2.0618 UN(4,4) subject 2.3058

Fit Statistics -2 Log Likelihood 1342.1 AIC (smaller is better) 1374.1 AICC (smaller is better) 1376.6 BIC (smaller is better) 1407.3

Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 9 70.00 <.0001

Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| d_b1 -3.1773 0.9421 59 -3.37 0.0013 d_b2 0.9219 0.08483 59 10.87 <.0001 d_b3 1.0506 0.2775 59 3.79 0.0004 Seizure data of Thall and Vail (1990) 3 Unstructured working correlation

The Mixed Procedure Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| d_b4 -1.6754 0.4193 59 -4.00 0.0002 d_b5 -0.1633 0.06303 59 -2.59 0.0120 d_b6 0.6890 0.1693 59 4.07 0.0001 Seizure data of Thall and Vail (1990) 4 Exchangeable working correlation The Mixed Procedure Model Information Data Set WORK._NLINMIX Dependent Variable pseudo_seize Weight Variable wt Covariance Structure Compound Symmetry Subject Effect subject Estimation Method ML Residual Variance Method Profile Fixed Effects SE Method Empirical Degrees of Freedom Method Between-Within

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117

PAGE 415 CHAPTER 14 ST 762, M. DAVIDIAN

118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

Dimensions Covariance Parameters 2 Columns in X 6 Columns in Z 0 Subjects 59 Max Obs Per Subject 4

Number of Observations Number of Observations Read 236 Number of Observations Used 236 Number of Observations Not Used 0

Iteration History Iteration Evaluations -2 Log Like Criterion 0 1 1412.13918577 1 2 1376.70189404 0.00000000 Seizure data of Thall and Vail (1990) 5 Exchangeable working correlation

The Mixed Procedure Convergence criteria met.

Estimated R Correlation Matrix for subject 101/Weighted by wt Row Col1 Col2 Col3 Col4 1 1.0000 0.3582 0.3582 0.3582 2 0.3582 1.0000 0.3582 0.3582 3 0.3582 0.3582 1.0000 0.3582 4 0.3582 0.3582 0.3582 1.0000

Covariance Parameter Estimates Cov Parm Subject Estimate CS subject 1.5411 Residual 2.7611

Fit Statistics -2 Log Likelihood 1376.7 AIC (smaller is better) 1392.7 AICC (smaller is better) 1393.3 BIC (smaller is better) 1409.3

Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 1 35.44 <.0001

Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| d_b1 -2.7939 0.9561 171 -2.92 0.0039 d_b2 0.9504 0.09873 171 9.63 <.0001 d_b3 0.9066 0.2772 171 3.27 0.0013 d_b4 -1.3386 0.4296 171 -3.12 0.0021 d_b5 -0.1611 0.06558 171 -2.46 0.0150 d_b6 0.5633 0.1749 171 3.22 0.0015 Seizure data of Thall and Vail (1990) 6 AR(1) working correlation matrix

PAGE 416 CHAPTER 14 ST 762, M. DAVIDIAN

The Mixed Procedure Model Information Data Set WORK._NLINMIX Dependent Variable pseudo_seize Weight Variable wt Covariance Structure Autoregressive Subject Effect subject Estimation Method ML Residual Variance Method Profile Fixed Effects SE Method Empirical Degrees of Freedom Method Between-Within

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238

Dimensions Covariance Parameters 2 Columns in X 6 Columns in Z 0 Subjects 59 Max Obs Per Subject 4

Number of Observations Number of Observations Read 236 Number of Observations Used 236 Number of Observations Not Used 0

Iteration History Iteration Evaluations -2 Log Like Criterion 0 1 1411.84577442 1 2 1378.41267102 0.00001238 2 1 1378.40681354 0.00000000 Seizure data of Thall and Vail (1990) 7 AR(1) working correlation matrix

The Mixed Procedure Convergence criteria met.

Estimated R Correlation Matrix for subject 101/Weighted by wt Row Col1 Col2 Col3 Col4 1 1.0000 0.3757 0.1411 0.05302 2 0.3757 1.0000 0.3757 0.1411 3 0.1411 0.3757 1.0000 0.3757 4 0.05302 0.1411 0.3757 1.0000

Covariance Parameter Estimates Cov Parm Subject Estimate AR(1) subject 0.3757 Residual 4.2126

Fit Statistics -2 Log Likelihood 1378.4 AIC (smaller is better) 1394.4 AICC (smaller is better) 1395.0 BIC (smaller is better) 1411.0

PAGE 417 CHAPTER 14 ST 762, M. DAVIDIAN

Null Model Likelihood Ratio Test DF Chi-Square Pr > ChiSq 1 33.44 <.0001

Solution for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| d_b1 -2.9791 0.9375 171 -3.18 0.0018 d_b2 0.9454 0.09338 171 10.12 <.0001 d_b3 0.9677 0.2733 171 3.54 0.0005 d_b4 -1.4493 0.4183 171 -3.46 0.0007 d_b5 -0.1566 0.08350 171 -1.88 0.0624 d_b6 0.6057 0.1701 171 3.56 0.0005

PROGRAM 14.4 Implementing the linear estimating equation with the quadratic estimating equation for the correlation parameters using SAS proc glimmix.

The following program calls proc glimmix to implement the three “working” correlation assumptions. PROGRAM STATEMENTS:

/***************************************************************** Fit the loglinear regression model using PROC GLIMMIX and three different working correlation matrix assumptions: - unstructured - compound symmetry (exchangeable) - AR(1) Subject 207 has what appear to be very unusual data -- for this subject, both baseline and study-period numbers of seizures are huge, much larger than any other subject. We leave this subject in for our analysis; in some published analyses, this subject is deleted. See Diggle, Heagerty, Liang, and Zeger (2002) and Thall and Vail (1990) for more on this subject. We fit a modified model exactly like the one in Thall and Vail (1990). Define - logbase = log (base/4) - logage = log(age) - trt Here, visit4 is an indicator of the last visit and basetrt is the "interaction" term for logbase and trt, allowing the possibility that how baseline seizures affect the response may be different for treated and placebo subjects. Visit4 allows that the mean response changed over the study period, but not gradually; rather, the change only showed up toward the end of the study; The DIST=POISSON option in the model statement specifies that the Poisson requirement that mean = variance, be used. The LINK=LOG option asks for the loglinear model. Other LINK= choices are available. PROC GLIMMIX is really meant for fitting generalized linear mixed effects models, discussed in Chapter 15. But it can also fit marginal models with no random effects, which we are demonstrating here. To carry this out, the syntax for specifying the "working" correlation matrix is different that in from PROC GENMOD. Instead of using a REPEATED statement, one uses a RANDOM statement with either the RESIDUAL option or a CLASS variable that identifies ordered time points, as demonstrated below. If no ordered variable is needed, as in the compound symmetric case, one can use the keyword _RESIDUAL_ instead. The EMPIRICAL option asks for the "robust sandwich" standard errors to be computed. ******************************************************************/ data seizure; set seizure; * if subject=207 then delete; logbase=log(base/4);

PAGE 418 CHAPTER 14 ST 762, M. DAVIDIAN

logage=log(age); basetrt=logbase*trt; if visit<4 then visit4=0; if visit=4 then visit4=1; run; title "UNSTRUCTURED CORRELATION"; proc glimmix data=seizure empirical; class subject visit; model seize = logbase logage trt visit4 basetrt / solution dist = poisson link = log; random visit / subject=subject type=un residual; run; title "EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION"; proc glimmix data=seizure empirical; class subject visit; model seize = logbase logage trt visit4 basetrt / solution dist = poisson link = log; random _residual_ / subject=subject type=cs; run; title "AR(1) CORRELATION"; proc glimmix data=seizure empirical; class subject visit; model seize = logbase logage trt visit4 basetrt / solution dist = poisson link = log; random visit / subject=subject type=ar(1) residual; run;

OUTPUT:

UNSTRUCTURED CORRELATION 1 The GLIMMIX Procedure Model Information Data Set WORK.SEIZURE Response Variable seize Response Distribution Poisson Link Function Log Variance Function Default Variance Matrix Blocked By subject Estimation Technique Residual PL Degrees of Freedom Method Between-Within Fixed Effects SE Adjustment Sandwich - Classical

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 visit 4 1 2 3 4

Number of Observations Read 236 Number of Observations Used 236

Dimensions R-side Cov. Parameters 10 Columns in X 6 Columns in Z per Subject 0 Subjects (Blocks in V) 59 Max Obs per Subject 4

Optimization Information Optimization Technique Dual Quasi-Newton Parameters in Optimization 10 Lower Boundaries 4 Upper Boundaries 0 Fixed Effects Profiled Starting From Data

PAGE 419 CHAPTER 14 ST 762, M. DAVIDIAN

UNSTRUCTURED CORRELATION 2 The GLIMMIX Procedure Iteration History Objective Max Iteration Restarts Subiterations Function Change Gradient 0 0 200 -400.5062683 1.97318315 0.01183 1 0 21 -2.831790507 0.81798083 1.322E-6 2 0 14 314.85673342 0.67956463 0.003809 3 0 13 502.5702045 0.43641336 0.000402 4 0 13 560.9177951 0.15423072 0.000076 5 0 12 567.30403756 0.02115574 2.582E-6 6 0 11 567.75133255 0.00189222 0.00058 7 0 10 567.79921326 0.00019732 0.000421 8 0 2 567.80490214 0.00001064 0.000062 9 0 1 567.80485352 0.00000584 0.000084 10 0 1 567.80484791 0.00000407 0.000038 11 0 1 567.80490418 0.00000288 0.000066 12 0 1 567.80487398 0.00000203 0.000027 13 0 1 567.80492617 0.00000194 0.00007 14 0 1 567.80491018 0.00000137 0.000023 15 0 1 567.80494572 0.00000096 0.000055 16 0 1 567.80493754 0.00000113 0.00002 17 0 1 567.80496256 0.00000084 0.000052 18 0 1 567.80495371 0.00000103 0.000017 19 0 1 567.80497536 0.00000071 0.000044 Did not converge.

Covariance Parameter Estimates Cov Standard Parm Subject Estimate Error UN(1,1) subject 3.2901 . UN(2,1) subject 1.4062 . UN(2,2) subject 4.7779 . UN(3,1) subject 1.2880 . UN(3,2) subject 3.1237 . UN(3,3) subject 7.7110 . UN(4,1) subject 0.8362 . UN(4,2) subject 1.7602 . UN(4,3) subject 2.2056 . UN(4,4) subject 2.4561 . EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 3 The GLIMMIX Procedure Model Information Data Set WORK.SEIZURE Response Variable seize Response Distribution Poisson Link Function Log Variance Function Default Variance Matrix Blocked By subject Estimation Technique Residual PL Degrees of Freedom Method Between-Within Fixed Effects SE Adjustment Sandwich - Classical

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 visit 4 1 2 3 4

Number of Observations Read 236 Number of Observations Used 236

Dimensions R-side Cov. Parameters 2 Columns in X 6

PAGE 420 CHAPTER 14 ST 762, M. DAVIDIAN

Columns in Z per Subject 0 Subjects (Blocks in V) 59 Max Obs per Subject 4

Optimization Information Optimization Technique Dual Quasi-Newton Parameters in Optimization 1 Lower Boundaries 0 Upper Boundaries 0 Fixed Effects Profiled Residual Variance Profiled Starting From Data

EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 4 The GLIMMIX Procedure Iteration History Objective Max Iteration Restarts Subiterations Function Change Gradient 0 0 8 -380.084293 1.98333815 8.044E-7 1 0 3 23.359461727 0.29058491 2.362E-7 2 0 3 348.55736658 0.29793806 5.187E-7 3 0 2 537.65162916 0.13664934 0.000121 4 0 2 590.48253951 0.01386331 3.234E-7 5 0 2 593.93524845 0.00036166 6.545E-9 6 0 1 593.95496556 0.00001317 1.445E-8 7 0 0 593.954995 0.00000000 2.777E-6 Convergence criterion (PCONV=1.11022E-8) satisfied.

Fit Statistics -2 Res Log Pseudo-Likelihood 593.95 Generalized Chi-Square 638.79 Gener. Chi-Square / DF 2.78

Covariance Parameter Estimates Standard Cov Parm Subject Estimate Error CS subject 1.7430 0.4747 Residual 2.7774 0.2961

Solutions for Fixed Effects Standard Effect Estimate Error DF t Value Pr > |t| Intercept -2.7984 0.9578 54 -2.92 0.0051 logbase 0.9505 0.09902 54 9.60 <.0001 logage 0.9078 0.2776 54 3.27 0.0019 trt -1.3383 0.4301 54 -3.11 0.0030 visit4 -0.1611 0.06558 176 -2.46 0.0150 basetrt 0.5634 0.1750 54 3.22 0.0022

Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F logbase 1 54 92.15 <.0001 logage 1 54 10.70 0.0019 trt 1 54 9.68 0.0030 visit4 1 176 6.03 0.0150 EXCHANGEABLE (COMPOUND SYMMETRY) CORRELATION 5 The GLIMMIX Procedure Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F basetrt 1 54 10.36 0.0022 AR(1) CORRELATION 6 The GLIMMIX Procedure

PAGE 421 CHAPTER 14 ST 762, M. DAVIDIAN

Model Information Data Set WORK.SEIZURE Response Variable seize Response Distribution Poisson Link Function Log Variance Function Default Variance Matrix Blocked By subject Estimation Technique Residual PL Degrees of Freedom Method Between-Within Fixed Effects SE Adjustment Sandwich - Classical

Class Level Information Class Levels Values subject 59 101 102 103 104 106 107 108 110 111 112 113 114 116 117 118 121 122 123 124 126 128 129 130 135 137 139 141 143 145 147 201 202 203 204 205 206 207 208 209 210 211 213 214 215 217 218 219 220 221 222 225 226 227 228 230 232 234 236 238 visit 4 1 2 3 4

Number of Observations Read 236 Number of Observations Used 236

Dimensions R-side Cov. Parameters 2 Columns in X 6 Columns in Z per Subject 0 Subjects (Blocks in V) 59 Max Obs per Subject 4

Optimization Information Optimization Technique Dual Quasi-Newton Parameters in Optimization 1 Lower Boundaries 1 Upper Boundaries 1 Fixed Effects Profiled Residual Variance Profiled Starting From Data

AR(1) CORRELATION 7 The GLIMMIX Procedure Iteration History Objective Max Iteration Restarts Subiterations Function Change Gradient 0 0 6 -380.0697849 0.84458225 1E-6 1 0 3 23.534226501 0.44096715 1.82E-6 2 0 3 349.72039778 0.38294694 1.426E-7 3 0 2 541.49567685 0.15364034 1.08E-6 4 0 1 596.67703013 0.01401373 6.843E-6 5 0 1 600.49102646 0.00028385 8.146E-6 6 0 1 600.51244016 0.00000326 1.485E-8 7 0 0 600.51247746 0.00000000 8.225E-7 Convergence criterion (PCONV=1.11022E-8) satisfied.

Fit Statistics -2 Res Log Pseudo-Likelihood 600.51 Generalized Chi-Square 1007.17 Gener. Chi-Square / DF 4.38

Covariance Parameter Estimates Standard Cov Parm Subject Estimate Error AR(1) subject 0.3936 0.06067 Residual 4.3790 0.4533

Solutions for Fixed Effects

PAGE 422 CHAPTER 14 ST 762, M. DAVIDIAN

Standard Effect Estimate Error DF t Value Pr > |t| Intercept -2.9928 0.9376 54 -3.19 0.0024 logbase 0.9452 0.09325 54 10.14 <.0001 logage 0.9721 0.2734 54 3.56 0.0008 trt -1.4557 0.4180 54 -3.48 0.0010 visit4 -0.1564 0.08461 176 -1.85 0.0663 basetrt 0.6083 0.1699 54 3.58 0.0007

Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F logbase 1 54 102.74 <.0001 logage 1 54 12.65 0.0008 trt 1 54 12.13 0.0010 visit4 1 176 3.41 0.0663 AR(1) CORRELATION 8 The GLIMMIX Procedure Type III Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F basetrt 1 54 12.81 0.0007

Note that the fit assuming an unstructured working correlation did not converge; apparently, the number of covariance parameters coupled with the nonlinearity of the mean function cause difficulty for the optimization in proc glimmix.

For the compound symmetric fit, the fitted correlation matrix may be obtained from the Covariance Parameter Estimates section; the common correlation parameter estimate is 1.7430/(1.7430+2.7774) = 0.3856, and the estimate of the variance σ2 is 1.7430+2.7774 = 4.52, with square root 2.126. These may be compared with the values from nlinmix of 0.3582 and (1.5411+2.7611) = 4.302 (square root 2.074). For the AR(1) fit, the estimates of α and σ2 may be read directly from the default output as 0.3936 and 4.379; the square root of the estimate of σ2 is 2.093. The corresponding estimates from nlinmix are 0.3757 and 4.2126 (square root 2.052). Overall, the estimates from nlinmix and proc glimmix compare reasonably well to each other and to those using simpler method of moments estimation of covariance parameters from proc genmod and gee().

PAGE 423