Introduction to the Generalized Estimating Equations and its Applications in Small Cluster Randomized Trials
Fan Li
BIOSTAT 900 Seminar
November 11, 2016
1 / 24 Overview
Outline Background The Generalized Estimating Equations (GEEs) Improved Small-sample Inference Take-home Message How GEEs work? How to improve small-sample inference, especially in cluster randomized trial (CRT) applications?
2 / 24 Cluster Randomized Trials (CRTs)
Randomizing clusters of subjects rather than independent subjects (convenience, ethics, contamination etc.) Intervention administered at the cluster level Outcomes are measured at the individual level Outcomes from subjects within the same cluster exhibit a greater correlation than those from different clusters Intraclass correlation coefficient (ICC) ranges from 0.01 to 0.1 Interest lies in evaluating the intervention effect Small number of clusters with large cluster sizes
3 / 24 The Stop Colorectal Cancer (STOP CRC) Study
Study the effect of intervention to improve cancer screening 26 health clinics (n = 26) are allocated to either usual care or intervention (1-to-1 ratio) Usual care - provided opportunistic/occasional screening Intervention - an automated program for mailing testing kits with instructions to patients
The clinics contain variable number of patients (mi): Min 461/Med 1426/Max 3302
Primary outcome - The patient-level binary outcome (yij), completion status of screening test within a year of study initiation Baseline clinic- and individual-level covariates Inference – estimand?
4 / 24 The Estimand from Conditional Models
T yi = (yi1, . . . , yimi ) - the collection of outcomes from clinic i
Xi - ’design’ matrix (including intercept, treatment variable and baseline covariates) of cluster i
The generalized linear mixed model g(E(yi|Xi, bi)) = Xiβ + 1mi bi g(·) - a smooth, invertible link function β is the regression coefficient (intervention effect) 2 bi ∼ N(0, σb ) - Gaussian random effects The estimand, defined by the component of β, typically has a cluster-specific (conditional) interpretation
yij|xij, bi is assumed from an exponential family model (likelihood)
5 / 24 The Estimand from Marginal Models
Recall basics of GLM
Now let yij|xij from an exponential family model with µij = E(yij|xij) and variance νij = h(µij)/φ mean-variance relationship φ - dispersion parameter T Let µi = E(yi|Xi) = (µi1, . . . , µimi )
Use g(µi) = Xiβ allowing for non-zero covariance of yi is parameterized by a marginal intervention effect (marginal w.r.t. what?) Population-average intervention effect - more straightforward interpretation To make inferences on β, how to describe the correlation between components of yi?
6 / 24 Generalized Estimating Equations (GEEs)
Define Ri(α) is the mi × mi “working” correlation matrix of yi, with α an unknown nuisance parameter, then the “working” 1/2 1/2 covariance Vi = var(yi|Xi) = Ai R(α)Ai /φ
Ai = diag(h(µi1), . . . , h(µimi )) The GEEs are defined as
n n X X T −1 Ui(β, φ, α) = Di Vi (yi − µi) = 0, i=1 i=1
T where Di = ∂µi/∂β
Only the first two moments of yij are assumed, quasi-likelihood score equation (Wedderburn, 1974) The efficient score equation from a semi-parametric restricted moment model (Tsiatis, 2006) Iterative algorithms (Newton-Raphson) for estimation
7 / 24 Dealing with Nuisances
The working variance Vi contains the nuisance parameters α and φ It is possible to “profile out” these nuisances within the iterative procedure Moment-based estimators for nuisances: Given a current estimate βˆ, the Pearson residuals is 1/2 rˆij = (yij − µˆij)/νˆij , which is typically used to estimate
n mi ˆ X X 2 φ = rˆij/(N − p), i=1 j=1
Pn where N = i=1 mi and p is the dimension of β What about α?
8 / 24 Dealing with Nuisances - Cont’d
Choices of the working correlation structure
If Ri = I, the independent structure – no nuisances involved 0 If corr(yij , yij0 ) = α for j 6= j , then we end up with the exchangeable correlation. The nuisance can be estimated by
n ( n ) ˆ X X X αˆ = φ rˆij rˆij0 / mi(mi − 1)/2 − p i=1 j>j0 i=1
If Ri is assumed unstructured, then (loosely speaking) ˆ n φ X −1/2 T −1/2 R (ˆα) = Aˆ (y − µˆ )(y − µˆ ) Aˆ i n i i i i i i i=1 Other types of correlation structures are available In CRT applications, the exchangeable structure is often assumed
9 / 24 Modified Newton’s Algorithm
Initiate βˆ → Compute φˆ(βˆ) and αˆ(β,ˆ φˆ(βˆ)) → Update βˆ by the Newton’s method → Repeat the last two steps until convergence Essentially we are solving β iteratively from
n X ˆ 0 = Ui β, αˆ(β, φ(β)) i=1 n X T −1 ˆ = Di (β)Vi β, αˆ(β, φ(β)) (yi − µi(β)) , i=1
with a working assumption on Ri(α). Why are these efforts worthwhile? It turns out that, under mild assumptions, the final solution βˆ is consistent even if Ri(α) is misspecified
10 / 24 Asymptotics
(A.1) Sufficient moments of components of Xi and yi exist (A.2) φˆ is root-n consistent given β (A.3) αˆ is root-n consistent given β and φ
(A.4) |∂αˆ(β, φ)/∂φ| = Op(1)
Under (A.1) - (A.4), βˆ is consistent to the truth β0 and asymptotically normal with the sandwich covariance matrix
n !−1 n ! n !−1 X T −1 X T −1 −1 X T −1 Vsand = Di Vi Di Di Vi cov(yi)Vi Di Di Vi Di i=1 i=1 i=1
(A.2) - (A.4) are usually fulfilled by the moment-based estimators for nuisances
If Ri(α) is correctly specified, Vsand equals the model-based variance Pn T −1 −1 ( i=1 Di Vi Di) Doesn’t depend on the nuisances as long as (A.2) and (A.3) hold robust or empirical variance estimator
11 / 24 Asymptotics - Cont’d
The proof is centered on the classical theory of unbiased estimating equation (van de Vaart, 1998), which simply uses Taylor expansion The key is to realize that βˆ is asymptotically linear with
n !−1 n √ −1 X T −1 1 X n(βˆ − β0) = lim n D V Di √ Ui(β0) + op(1) n→∞ i i n i=1 i=1
Vsand then comes from a simple application of CLT
Plug-in estimator Vˆsand T replace cov(yi) by eˆieˆi with eˆi = yi − µˆi replace β by βˆ Rule of thumb - need at least n = 50 to work
12 / 24 Small Sample Performance
Recall that STOP CRC only has 26 clinics (n = 26)
The plug-in estimator used the residuals eˆi to estimate cov(yi) These residuals tend to be too small with small n ˆ Vˆsand would be expected to underestimate the covariance of β How to correct for this bias?
13 / 24 Resampling
An immediate solution is resampling One could possibly use cluster bootstrapping (sampling clinics with replacement) or delete-s jackknife covariance estimator computation sufficient number of bootstrap replicates? optimal s in jackknifing? Non-closed form correction is difficult to translate into practice
14 / 24 Deriving Vsand,MD
A bias-corrected covariance estimator is proposed by Mancl and DeRouen (2001) T Reduce the bias of the residual estimator, eˆieˆi Let ei = ei(β) = yi − µi, we could write for each i ∂e eˆ ≈ e + i (βˆ − β) = e − D (βˆ − β), i i ∂βT i i
T T where we recall that Di = ∂µi/∂β = −∂ei/∂β The second moment of eˆi T h ˆ T T i h ˆ T i E(ˆeieˆi ) ≈ cov(yi) − E ei(β − β) Di − E Di(β − β)ei h ˆ ˆ T T i +E Di(β − β)(β − β) Di
15 / 24 Deriving Vsand,MD
Recall the asymptotic linearity of βˆ, so we have
n !−1 n ˆ X T −1 X T −1 (β − β0) ≈ Di Vi Di Di Vi ei i=1 i=1 −1 Pn T −1 T −1 Define Hil = Di i=1 Di Vi Di Dl Vl
Hii is the block diagonal element of a projection matrix Hii is the leverage of the ith cluster/clinic Hil is between zero and one, and usually close to zero T Further since E(eiel ) = 0, i 6= l, we have h ˆ T i E Di(β − β)ei = Hiicov(yi) h ˆ T T i T E ei(β − β) Di = cov(yi)Hii h ˆ ˆ T T i Pn T E Di(β − β)(β − β) Di = l=1 Hilcov(yl)Hil
16 / 24 Deriving Vsand,MD
Summing up these terms, we get
T T X T E(ˆeieˆi ) ≈ (Ii − Hii)cov(yi)(Ii − Hii) + Hilcov(yl)Hil l6=i T ≈ (Ii − Hii)cov(yi)(Ii − Hii)
Ii is the identity matrix of dimension mi, and the latter term is assumed small because Hil’s are close to zero −1 T T −1 cov(yi) ≈ (Ii − Hii) eˆieˆi (Ii − Hii ) Consequently, the MD bias-corrected robust sandwich variance estimator takes the form
n !−1 n ˆ X T −1 X T −1 −1 T Vsand,MD = Di Vi Di Di Vi (Ii − Hii) eˆieˆi i=1 i=1 n !−1 T −1 −1 X T −1 ×(Ii − Hii ) Vi Di Di Vi Di i=1
Inflates Vˆsand
17 / 24 Possible Improvement
In theory, the MD bias-corrected robust variance can be improved by incorporating Hil T T T Specifically, if we write the residuals as eˆ = (ˆe1 ,..., eˆn ) , and the Pn T −1 −1 T −1 projection matrix H = D( i=1 Di Vi Di) D V T T T D = (D1 ,...,Dn ) V = block diag(V1,...,Vn) T T T y = (y1 , . . . , yn ) We can show E(ˆeeˆT ) = (I − H)cov(y)(I − H)T , which may promise a more accurate correction Any practical issues? Numeric problems due to the near singularity
18 / 24 Heuristics for Vsand,KC
Kauermann and Carroll (2001) proposed an alternative correction by extending the bias-corrected sandwich estimator for linear regression T Under the heteroscedasticity regression model Yi = xi β + ei with 2 T ei ∼ N(0, σi ), suppose the parameter of interest is a scalar θ = c β Write X as the design matrix, projection matrix T −1 T T T −1 H = X(X X) X , and define ai = c (X X) xi The sandwich estimator ! ˆ T T −1 X T 2 T −1 X 2 2 Vsand = c (X X) xixi eˆi (X X) c := ai eˆi i i consistently estimate the variance of least square projection ˆ T ˆ 2 P 2 2 T T −1 θ = c β, which is σ i ai = σ c (X X) c
19 / 24 Heuristics for Vsand,KC
2 2 However, under homoscedasticity where σi = σ , the expectation
2 T 2 E(ˆei ) = E[y (I − H)y] = σ (1 − hii),
where hii is the leverage of observation i (between 0 and 1) Indeed ˆ 2 X 2 2 X 2 E(Vsand) = σ ai −σ ai hii i i | {z } Bias term
A simple fix is to replace eˆi in Vˆsand by leverage-adjusted residuals − 1 e˜i = (1 − hii) 2 eˆi
20 / 24 Heuristics for Vsand,KC
Realizing that bias of the GEE sandwich estimator stems from the T bias of the plug-in estimator eˆieˆi for cov(yi) A similar fix is to use the cluster-leverage-adjusted residuals − 1 e˜i = (Ii − Hii) 2 eˆi The resulting KC bias-corrected sandwich estimator
n !−1 n X T −1 X T −1 − 1 T ˆ 2 Vsand,KC = Di Vi Di Di Vi (Ii − Hii) eˆieˆi i=1 i=1 n !−1 T − 1 −1 X T −1 2 ×(Ii − Hii ) Vi Di Di Vi Di i=1
It turns out that Vˆsand,KC removes the first-order bias of Vˆsand if Ri(α) is correct, and even works well under misspecification
The corrected variance between Vˆsand and Vˆsand,MD
21 / 24 Revisit STOP CRC
How do we test the intervention effect for STOP CRC with n = 26? What decisions should we make? Two additional complications Large variations in clinic sizes, the coefficient of variation cv = 0.485 Wald t-test or Wald z-test? These decisions are evaluated by simulation studies
22 / 24 Revisit STOP CRC
A well-described simulation study (Li and Redden, 2014) Simulate correlated binary outcomes from a marginal beta-binomial The model is parameterized by ICC, which are assumed commonly reported values 0.01 and 0.05 Assume 10, 20 and 30 clusters/clinics, with an average size of clusters ranging from 10 to 150 Main findings
The Wald t-test with Vˆsand,KC remains valid as long as cv < 0.6 The Wald z-test with Vˆsand,MD is only valid when n ≥ 20, while the Wald t-test is conservative when n ≤ 20 When cv > 0.6, a different bias-corrected variance by Fay and Graubard (2001) with the Wald t-test is recommended
For STOP CRC, both Vˆsand,KC (t-test) and Vˆsand,MD (z-test) are worth pursuing
23 / 24 Take-home Message
What are GEEs? What is the general rule of thumb for asymptotics? Intuitions on bias-corrected sandwich estimator and the rule of thumb for CRT applications The above simulations dispense with covariate adjustment, would that impact the recommendations?
24 / 24