Introduction to the Generalized Estimating Equations and its Applications in Small Cluster Randomized Trials

Fan Li

BIOSTAT 900 Seminar

November 11, 2016

1 / 24 Overview

Outline Background The Generalized Estimating Equations (GEEs) Improved Small-sample Inference Take-home Message How GEEs work? How to improve small-sample inference, especially in cluster randomized trial (CRT) applications?

2 / 24 Cluster Randomized Trials (CRTs)

Randomizing clusters of subjects rather than independent subjects (convenience, ethics, contamination etc.) Intervention administered at the cluster level Outcomes are measured at the individual level Outcomes from subjects within the same cluster exhibit a greater correlation than those from different clusters Intraclass correlation coefficient (ICC) ranges from 0.01 to 0.1 Interest lies in evaluating the intervention effect Small number of clusters with large cluster sizes

3 / 24 The Stop Colorectal Cancer (STOP CRC) Study

Study the effect of intervention to improve cancer screening 26 health clinics (n = 26) are allocated to either usual care or intervention (1-to-1 ratio) Usual care - provided opportunistic/occasional screening Intervention - an automated program for mailing testing kits with instructions to patients

The clinics contain variable number of patients (mi): Min 461/Med 1426/Max 3302

Primary outcome - The patient-level binary outcome (yij), completion status of screening test within a year of study initiation Baseline clinic- and individual-level covariates Inference – estimand?

4 / 24 The Estimand from Conditional Models

T yi = (yi1, . . . , yimi ) - the collection of outcomes from clinic i

Xi - ’design’ matrix (including intercept, treatment variable and baseline covariates) of cluster i

The generalized linear g(E(yi|Xi, bi)) = Xiβ + 1mi bi g(·) - a smooth, invertible link function β is the regression coefficient (intervention effect) 2 bi ∼ N(0, σb ) - Gaussian random effects The estimand, defined by the component of β, typically has a cluster-specific (conditional) interpretation

yij|xij, bi is assumed from an model (likelihood)

5 / 24 The Estimand from Marginal Models

Recall basics of GLM

Now let yij|xij from an exponential family model with µij = E(yij|xij) and νij = h(µij)/φ -variance relationship φ - dispersion parameter T Let µi = E(yi|Xi) = (µi1, . . . , µimi )

Use g(µi) = Xiβ allowing for non-zero of yi is parameterized by a marginal intervention effect (marginal w.r.t. what?) Population-average intervention effect - more straightforward interpretation To make inferences on β, how to describe the correlation between components of yi?

6 / 24 Generalized Estimating Equations (GEEs)

Define Ri(α) is the mi × mi “working” correlation matrix of yi, with α an unknown nuisance parameter, then the “working” 1/2 1/2 covariance Vi = var(yi|Xi) = Ai R(α)Ai /φ

Ai = diag(h(µi1), . . . , h(µimi )) The GEEs are defined as

n n X X T −1 Ui(β, φ, α) = Di Vi (yi − µi) = 0, i=1 i=1

T where Di = ∂µi/∂β

Only the first two moments of yij are assumed, quasi-likelihood score equation (Wedderburn, 1974) The efficient score equation from a semi-parametric restricted model (Tsiatis, 2006) Iterative algorithms (Newton-Raphson) for estimation

7 / 24 Dealing with Nuisances

The working variance Vi contains the nuisance parameters α and φ It is possible to “profile out” these nuisances within the iterative procedure Moment-based estimators for nuisances: Given a current estimate βˆ, the Pearson residuals is 1/2 rˆij = (yij − µˆij)/νˆij , which is typically used to estimate

n mi ˆ X X 2 φ = rˆij/(N − p), i=1 j=1

Pn where N = i=1 mi and p is the dimension of β What about α?

8 / 24 Dealing with Nuisances - Cont’d

Choices of the working correlation structure

If Ri = I, the independent structure – no nuisances involved 0 If corr(yij , yij0 ) = α for j 6= j , then we end up with the exchangeable correlation. The nuisance can be estimated by

n ( n ) ˆ X X X αˆ = φ rˆij rˆij0 / mi(mi − 1)/2 − p i=1 j>j0 i=1

If Ri is assumed unstructured, then (loosely speaking) ˆ n φ X −1/2 T −1/2 R (ˆα) = Aˆ (y − µˆ )(y − µˆ ) Aˆ i n i i i i i i i=1 Other types of correlation structures are available In CRT applications, the exchangeable structure is often assumed

9 / 24 Modified Newton’s Algorithm

Initiate βˆ → Compute φˆ(βˆ) and αˆ(β,ˆ φˆ(βˆ)) → Update βˆ by the Newton’s method → Repeat the last two steps until convergence Essentially we are solving β iteratively from

n X  ˆ  0 = Ui β, αˆ(β, φ(β)) i=1 n X T −1  ˆ  = Di (β)Vi β, αˆ(β, φ(β)) (yi − µi(β)) , i=1

with a working assumption on Ri(α). Why are these efforts worthwhile? It turns out that, under mild assumptions, the final solution βˆ is consistent even if Ri(α) is misspecified

10 / 24 Asymptotics

(A.1) Sufficient moments of components of Xi and yi exist (A.2) φˆ is root-n consistent given β (A.3) αˆ is root-n consistent given β and φ

(A.4) |∂αˆ(β, φ)/∂φ| = Op(1)

Under (A.1) - (A.4), βˆ is consistent to the truth β0 and asymptotically normal with the sandwich

n !−1 n ! n !−1 X T −1 X T −1 −1 X T −1 Vsand = Di Vi Di Di Vi cov(yi)Vi Di Di Vi Di i=1 i=1 i=1

(A.2) - (A.4) are usually fulfilled by the moment-based estimators for nuisances

If Ri(α) is correctly specified, Vsand equals the model-based variance Pn T −1 −1 ( i=1 Di Vi Di) Doesn’t depend on the nuisances as long as (A.2) and (A.3) hold robust or empirical variance estimator

11 / 24 Asymptotics - Cont’d

The proof is centered on the classical theory of unbiased estimating equation (van de Vaart, 1998), which simply uses Taylor expansion The key is to realize that βˆ is asymptotically linear with

n !−1 n √ −1 X T −1 1 X n(βˆ − β0) = lim n D V Di √ Ui(β0) + op(1) n→∞ i i n i=1 i=1

Vsand then comes from a simple application of CLT

Plug-in estimator Vˆsand T replace cov(yi) by eˆieˆi with eˆi = yi − µˆi replace β by βˆ Rule of thumb - need at least n = 50 to work

12 / 24 Small Sample Performance

Recall that STOP CRC only has 26 clinics (n = 26)

The plug-in estimator used the residuals eˆi to estimate cov(yi) These residuals tend to be too small with small n ˆ Vˆsand would be expected to underestimate the covariance of β How to correct for this bias?

13 / 24

An immediate solution is resampling One could possibly use cluster bootstrapping ( clinics with replacement) or delete-s jackknife covariance estimator computation sufficient number of bootstrap replicates? optimal s in jackknifing? Non-closed form correction is difficult to translate into practice

14 / 24 Deriving Vsand,MD

A bias-corrected covariance estimator is proposed by Mancl and DeRouen (2001) T Reduce the bias of the residual estimator, eˆieˆi Let ei = ei(β) = yi − µi, we could write for each i ∂e eˆ ≈ e + i (βˆ − β) = e − D (βˆ − β), i i ∂βT i i

T T where we recall that Di = ∂µi/∂β = −∂ei/∂β The second moment of eˆi T h ˆ T T i h ˆ T i E(ˆeieˆi ) ≈ cov(yi) − E ei(β − β) Di − E Di(β − β)ei h ˆ ˆ T T i +E Di(β − β)(β − β) Di

15 / 24 Deriving Vsand,MD

Recall the asymptotic linearity of βˆ, so we have

n !−1 n ˆ X T −1 X T −1 (β − β0) ≈ Di Vi Di Di Vi ei i=1 i=1 −1 Pn T −1  T −1 Define Hil = Di i=1 Di Vi Di Dl Vl

Hii is the block diagonal element of a projection matrix Hii is the leverage of the ith cluster/clinic Hil is between zero and one, and usually close to zero T Further since E(eiel ) = 0, i 6= l, we have h ˆ T i E Di(β − β)ei = Hiicov(yi) h ˆ T T i T E ei(β − β) Di = cov(yi)Hii h ˆ ˆ T T i Pn T E Di(β − β)(β − β) Di = l=1 Hilcov(yl)Hil

16 / 24 Deriving Vsand,MD

Summing up these terms, we get

T T X T E(ˆeieˆi ) ≈ (Ii − Hii)cov(yi)(Ii − Hii) + Hilcov(yl)Hil l6=i T ≈ (Ii − Hii)cov(yi)(Ii − Hii)

Ii is the identity matrix of dimension mi, and the latter term is assumed small because Hil’s are close to zero −1 T T −1 cov(yi) ≈ (Ii − Hii) eˆieˆi (Ii − Hii ) Consequently, the MD bias-corrected robust sandwich variance estimator takes the form

n !−1 n ˆ X T −1  X T −1 −1 T Vsand,MD = Di Vi Di Di Vi (Ii − Hii) eˆieˆi i=1 i=1 n !−1 T −1 −1  X T −1 ×(Ii − Hii ) Vi Di Di Vi Di i=1

Inflates Vˆsand

17 / 24 Possible Improvement

In theory, the MD bias-corrected robust variance can be improved by incorporating Hil T T T Specifically, if we write the residuals as eˆ = (ˆe1 ,..., eˆn ) , and the Pn T −1 −1 T −1 projection matrix H = D( i=1 Di Vi Di) D V T T T D = (D1 ,...,Dn ) V = block diag(V1,...,Vn) T T T y = (y1 , . . . , yn ) We can show E(ˆeeˆT ) = (I − H)cov(y)(I − H)T , which may promise a more accurate correction Any practical issues? Numeric problems due to the near singularity

18 / 24 Heuristics for Vsand,KC

Kauermann and Carroll (2001) proposed an alternative correction by extending the bias-corrected sandwich estimator for T Under the regression model Yi = xi β + ei with 2 T ei ∼ N(0, σi ), suppose the parameter of interest is a scalar θ = c β Write X as the design matrix, projection matrix T −1 T T T −1 H = X(X X) X , and define ai = c (X X) xi The sandwich estimator ! ˆ T T −1 X T 2 T −1 X 2 2 Vsand = c (X X) xixi eˆi (X X) c := ai eˆi i i consistently estimate the variance of least square projection ˆ T ˆ 2 P 2 2 T T −1 θ = c β, which is σ i ai = σ c (X X) c

19 / 24 Heuristics for Vsand,KC

2 2 However, under where σi = σ , the expectation

2 T 2 E(ˆei ) = E[y (I − H)y] = σ (1 − hii),

where hii is the leverage of observation i (between 0 and 1) Indeed ˆ 2 X 2 2 X 2 E(Vsand) = σ ai −σ ai hii i i | {z } Bias term

A simple fix is to replace eˆi in Vˆsand by leverage-adjusted residuals − 1 e˜i = (1 − hii) 2 eˆi

20 / 24 Heuristics for Vsand,KC

Realizing that bias of the GEE sandwich estimator stems from the T bias of the plug-in estimator eˆieˆi for cov(yi) A similar fix is to use the cluster-leverage-adjusted residuals − 1 e˜i = (Ii − Hii) 2 eˆi The resulting KC bias-corrected sandwich estimator

n !−1 n X T −1  X T −1 − 1 T ˆ 2 Vsand,KC = Di Vi Di Di Vi (Ii − Hii) eˆieˆi i=1 i=1 n !−1 T − 1 −1  X T −1 2 ×(Ii − Hii ) Vi Di Di Vi Di i=1

It turns out that Vˆsand,KC removes the first-order bias of Vˆsand if Ri(α) is correct, and even works well under misspecification

The corrected variance between Vˆsand and Vˆsand,MD

21 / 24 Revisit STOP CRC

How do we test the intervention effect for STOP CRC with n = 26? What decisions should we make? Two additional complications Large variations in clinic sizes, the coefficient of variation cv = 0.485 Wald t-test or Wald z-test? These decisions are evaluated by simulation studies

22 / 24 Revisit STOP CRC

A well-described simulation study (Li and Redden, 2014) Simulate correlated binary outcomes from a marginal beta-binomial The model is parameterized by ICC, which are assumed commonly reported values 0.01 and 0.05 Assume 10, 20 and 30 clusters/clinics, with an average size of clusters ranging from 10 to 150 Main findings

The Wald t-test with Vˆsand,KC remains valid as long as cv < 0.6 The Wald z-test with Vˆsand,MD is only valid when n ≥ 20, while the Wald t-test is conservative when n ≤ 20 When cv > 0.6, a different bias-corrected variance by Fay and Graubard (2001) with the Wald t-test is recommended

For STOP CRC, both Vˆsand,KC (t-test) and Vˆsand,MD (z-test) are worth pursuing

23 / 24 Take-home Message

What are GEEs? What is the general rule of thumb for asymptotics? Intuitions on bias-corrected sandwich estimator and the rule of thumb for CRT applications The above simulations dispense with covariate adjustment, would that impact the recommendations?

24 / 24