The Use of Post-Intervention Data from Waitlist Controls to Improve Estimation of Treatment Effect in Longitudinal Randomized Controlled Trials

Home , Design of experiments, Longitudinal study

THE USE OF POST-INTERVENTION DATA FROM WAITLIST CONTROLS TO IMPROVE ESTIMATION OF TREATMENT EFFECT IN LONGITUDINAL RANDOMIZED CONTROLLED TRIALS

DISSERTATION

Presented in Partial Fulﬁllment of the Requirements for the Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

Kimberly A. Walters, B.S.

*****

The Ohio State University

2008

Dissertation Committee: Approved by

Joseph S. Verducci, Adviser Haikady N. Nagaraja Adviser William I. Notz Graduate Program in Biostatistics c Copyright by

Kimberly A. Walters

2008 ABSTRACT

In medicine and public health research, the randomized delayed-intervention controlled trial (RDICT), also known as a wait-listed or stepped wedge design, is commonly used to study overt, slow-acting treatments in comparison to a control condition over time. Ten RDICT designs are speciﬁed as generalizations of the motivating example, a longitudinal psychology study of a psychoeducational intervention for children with bipolar disorder. These designs vary according to number of observation occasions, time between observations, and length of delay before the control group receives treatment.

Two estimators of fixed effects in separate linear mixed effects (LME) models, ˆ ˆ θ2 and θ1, are proposed to measure treatment effect based on data from an RDICT design. The LME models have a piecewise linear mean structure, allowing phases for treatment, placebo, and leveling-off effects. The treatment effect is traditionally conceptualized as the difference in slopes between the immediate treatment (IT) and pre-intervention control groups, which we call θ1.

Alternately, in an RDICT design, the treatment effect can be the change in slope post-intervention in the delayed-treatment (DT) control group, called θ0. The full model, which allows these treatment effects to differ, produces the standard estimator, ˆ θ1. A reduced model, nested within the full one, forces the inter and intra treatment ˆ effects to be identical and produces the novel estimator, θ2.

ii ˆ ˆ A simulation study was conducted to observe the relative efficiency of θ2 to θ1 as it varies over the 10 RDICT designs and 8 scenarios, which differ in size of treatment effect, intraclass correlation, and sample allocation to DT group.

The best-performing and recommended RDICT design, called H2.5 with a DT:IT allocation ratio of 2:1, achieved a relative efficiency of 1.3 when the group-specific treatment effects are identical. The H2.5 design has the longest overall calendar duration of the 10 designs considered and is an extension of the design used in the motivating example study of childhood mood disorders.

iii This is dedicated to everyone suﬀering from invisible disabilities.

∗ ∗ ∗

“This spiritualist, this statistician, what are you anyway?” - Thomas Pynchon

iv ACKNOWLEDGMENTS

The writing of this dissertation is the culmination of a long journey, beginning with a few tentative steps, made possible with the belief and nurturing of my mother

Gerda, sister Michelle, and father Robyn. Along the way, I found support and un- derstanding from a very special group of women very nearly in my shoes, each on her own adventure of transformation and achievement. Equally essential to my success were the many dear friends who made my path gentler by holding my hand and bolstering my spirits, especially John and Kylene. I am also indebted to my health care providers Drs. Mary Kiacz, Lee Cohen and Kitty Soldano, who cared for my whole wellness; my employer-mentors Drs. Mary Fristad, Ram Tiwari, Tom Bishop and Chris Holloman, who gave me a chance as patient role models; and my eternally supportive adviser Dr. Joe Verducci, who showed me the business of ethical statistical practice and helped me negotiate my invisible wall.

v VITA

November 14, 1974 ...... Born Maine, USA

1996 ...... B.S. Chemistry Rensselaer Polytechnic Institute

2002-2004 ...... Graduate Teaching Assistant OSU Statistics

2004-2006 ...... Graduate Research Assistant OSU Psychiatry

2006-2008 ...... Statistical Consultant OSU Statistical Consulting Service

2007 ...... Cancer Research Training Fellow National Cancer Institute

PUBLICATIONS

Research Publications

Y. Li; R. Tiwari; K. Walters; J. Zou “A Weighted-Least-Squares Estimation Approach to Comparing Trends in Age-Adjusted Cancer Rates Across Overlapping Regions”. Statistics in Medicine, in submission.

M. Fristad; J. Verducci; K. Walters; M. Young “The Impact of Multi-Family Psy- choeducation Groups (MFPG) in Treating Children Aged 8-12 with Mood Disorders”. Archives of General Psychiatry, in revision.

vi FIELDS OF STUDY

Major Field: Biostatistics

Studies in: Longitudinal Data Prof. Joseph Verducci Behavioral Interventions Prof. Mary Fristad

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... xii

List of Figures ...... xiii

Chapters:

1. Introduction ...... 1

1.1 Domain ...... 1 1.2 Contribution to the Field ...... 1 1.3 Motivation ...... 1 1.3.1 Motivating Example ...... 2 1.3.2 RCT: The Standard Design ...... 2 1.3.3 RDICT: The Modified Design ...... 3 1.3.4 Self or Other as Control ...... 3 1.3.5 Overt Interventions ...... 4 1.4 Treatment Effect ...... 4 1.4.1 Study Objectives ...... 4 1.4.2 Definition ...... 5 1.4.3 Theory and Concept ...... 7 1.4.4 Operationalization ...... 8 1.5 Research Questions ...... 10

viii 1.5.1 Model Assumptions ...... 10 1.5.2 Design Issues ...... 11

2. Literature Review ...... 12

2.1 Introduction ...... 12 2.2 Repeated Measures ...... 12 2.3 Cross-sectional versus Longitudinal Studies ...... 13 2.4 Types of Treatment Effects ...... 14 2.4.1 The Effect in Cause and Effect ...... 14 2.4.2 Characteristics of Effect ...... 15 2.4.3 Characteristics of Condition ...... 16 2.5 Quasi-Experimental Designs ...... 17 2.5.1 No Controls ...... 17 2.5.2 Switching Replications ...... 17 2.6 Experimental Designs ...... 18 2.6.1 Cross-Over Studies ...... 18 2.6.2 Randomized Controlled Trials ...... 18 2.6.3 Longitudinal Designs ...... 19 2.7 “Classic” Wait-listed Design ...... 19 2.8 Models ...... 20 2.8.1 Independent Observations ...... 20 2.8.2 Correlated Data ...... 21 2.8.3 Marginal and Mixed Models ...... 24 2.8.4 Linear Mixed Models ...... 26 2.9 Estimation ...... 27 2.9.1 Likelihood Functions ...... 27 2.9.2 REML ...... 28 2.10 Inference ...... 29 2.10.1 Likelihood Ratio Test ...... 29 2.10.2 Wald Test ...... 30 2.11 Evaluation of Estimators ...... 30 2.11.1 A Sample Size Calculation ...... 31 2.12 Missing Data ...... 32 2.13 Summary ...... 32

3. Models and Data ...... 33

3.1 Introduction ...... 33 3.2 MFPG: Study Description ...... 33 3.2.1 Types of Condition and Intervention ...... 33 3.2.2 Study Design ...... 36

ix 3.3 Data Structure ...... 37 3.3.1 Outcome and Design ...... 37 3.3.2 Exploratory Data Analysis ...... 37 3.3.3 Missingness ...... 39 3.4 Rationale for Model ...... 41 3.4.1 Conceptual Motivation: Dynamic Modeling ...... 41 3.4.2 Linearization Phases ...... 43 3.4.3 Elements of the Model ...... 45 3.4.4 Choice of Random Eﬀects ...... 46 3.5 Model for MFPG Data ...... 47 3.5.1 Time Convention ...... 47 3.5.2 Delayed Treatment Group ...... 48 3.5.3 RCT: Both Groups Without DI ...... 48 3.5.4 An Entrance-Centered Model for RDICT ...... 49 3.5.5 A Treatment-Centered Model for RDICT ...... 50 3.5.6 Inference ...... 51 3.6 Narrowing the Universe of Models ...... 51 3.6.1 Models: Mean Structure ...... 51 3.6.2 Models: Relative Treatment and Placebo Eﬀects ...... 53 3.6.3 Models: Variance Components ...... 53

4. Simulation Study ...... 55

4.1 Introduction ...... 55 4.2 Objective ...... 55 4.3 Notation ...... 56 4.4 Narrowing the Universe of Designs ...... 56 4.4.1 Observation Times: Resolution h of Unit Time ...... 58 4.4.2 Length of Delay d in Waitlist Control Group ...... 59 4.4.3 Summary of Constraints and Simulation Designs ...... 59 4.5 Protocol ...... 62 4.5.1 Software ...... 62 4.5.2 Scenarios ...... 62 4.5.3 Starting Seeds for Random Number Generation ...... 64 4.5.4 Level of Dependence Between Simulated Datasets ...... 64 4.5.5 Scenarios: Sample Size n ...... 64 4.5.6 Scenarios: Size of Treatment Eﬀect λ ...... 65 4.5.7 Scenarios: Covariance Structure ρ ...... 66 4.5.8 Number of Simulations M ...... 66 4.5.9 Results Stored From Each Run ...... 67 4.5.10 Summary Measures of Performance ...... 67 4.5.11 Criteria for Comparison ...... 68

x 4.6 Results ...... 69 4.6.1 Results for MFPG Design and Extension ...... 69 4.6.2 Theoretical Standard Error ...... 70 4.6.3 Model-Based versus Empirical Standard Errors ...... 71

5. Conclusions ...... 76

5.1 Distinguishing Features of the 10 Designs ...... 76 5.2 Standard Errors over the Designs ...... 76 5.2.1 Intra Estimator ...... 78 5.2.2 Inter Estimator ...... 79 5.2.3 Combined Estimator ...... 79 5.3 Relative Eﬃciencies over the Designs ...... 80 5.3.1 Balanced Sample Allocation ...... 81 5.3.2 Unbalanced Sample Allocation ...... 81 5.3.3 Other Allocation Plans ...... 83 5.4 Unanswered Questions ...... 87 5.5 Recommendations ...... 88

Bibliography ...... 90

xi LIST OF TABLES

Table Page

3.1 Within-Subject Correlation in MFPG ...... 39

3.2 Attrition Rates in MFPG ...... 39

4.1 Notation for Time Variables ...... 56

4.2 Notation for Simulation Parameteres ...... 57

4.3 Summary of Simulation Designs ...... 57

4.4 Possible Design Observation Times ...... 58

4.5 Results for MFPG Design ...... 69

4.6 Results for Enhanced MFPG Design ...... 70

xii LIST OF FIGURES

Figure Page

1.1 Interpretations of Treatment Eﬀect ...... 6

1.2 Conceptual Plot of Treatment Eﬀect ...... 9

3.1 Average Response Behavior by Treatment Group for Entire Sample . 34

3.2 Example Response Behavior in Immediate Treatment Group . . . . . 35

3.3 Example Response Behavior in Delayed Treatment Group ...... 36

3.4 Between-Time Scatter Plots for Detrended Data ...... 38

3.5 Response Behavior by Missingness Patterns ...... 40

3.6 Conceptual Plot of Response with Delayed Intervention ...... 42

3.7 Conceptual Plot of Linearized Response with Delayed Intervention . . 43

3.8 Average Response Behavior by Treatment Group for Treated Subset . 44

3.9 Simulation Mean Proﬁles ...... 52

4.1 Treatment Period Divided Into Thirds ...... 60

4.2 Treatment Period Divided Into Halves ...... 61

4.3 Illustration of RDICT Designs Considered ...... 63

ˆ 4.4 Theoretical Standard Error for θ? ...... 72

xiii ˆ 4.5 Theoretical Standard Error for θ2 ...... 73

4.6 Theoretical, Model-Based, and Empirical Standard Errors ...... 74

4.7 Theoretical, Model-Based, and Empirical Relative Eﬃciencies . . . . 75

5.1 Design Properties: Calendar Duration ...... 77

5.2 Simulation Results: Relative Eﬃciencies ...... 82

5.3 Thoeretical Standard Errors Over Full Range of Group Allocation . . 84

5.4 Thoeretical Relative Eﬃciencies Over Full Range of Group Allocation 85

5.5 Comparison of Thoeretical Relative Eﬃciencies Over Full Range of Group Allocation ...... 86

xiv CHAPTER 1

INTRODUCTION

1.1 Domain

This dissertation addresses longitudinal studies of overt, slow-acting behavioral- type interventions in human populations with chronic pathology where the number of subjects n well exceeds the number of time points m (n m respectively) with probable attrition and possible spontaneous remission. The statistical advantages of a wait-listed design, called the randomized delayed-intervention controlled trial

(RDICT), are considered.

1.2 Contribution to the Field

While post-intervention observations for the control group are generally disregarded in a wait-listed design [2], the present research investigates possible statistical advantages to utilization of these data in estimation of the treatment eﬀect (TE).

1.3 Motivation

In medical studies involving human subjects, investigators may wish to oﬀer a proposed treatment to all participants for ethical and practical reasons. In standard

1 clinical trials, treatment is withheld from a subset of the study participants, who

form a control group for comparison and who receive a placebo or the best-available

treatment.

If treatment is given to the control group after a delay, as in the randomized

delayed-intervention controlled trial (RDICT), both groups experience the eﬀect of

treatment, sooner or later. The motivation for the current research is to take advan-

tage of this fact in estimation of treatment eﬀect for such a wait-listed design.

1.3.1 Motivating Example

The motivating example is the Multi-Family Psychoeducation Group (MFPG)

study [8], conducted by Dr. Mary Fristad between 2002 and 2006 at The Ohio State

University (OSU) Medical Center, of n = 165 children diagnosed with bipolar mood disorder. The participants were observed m = 4 times over 18 months after random assignment to receive treatment at baseline or after a delay of one year. This way, all the severely impaired children enrolled in the study were oﬀered a potentially beneﬁcial intervention, and participants in the control group had an incentive to return for follow-up.

1.3.2 RCT: The Standard Design

The gold standard [7, 10] for evaluating evidence-based treatments in medical and public health research is the randomized controlled trial (RCT). The main ingredient of an RCT is the random assignment of subjects to either a treatment or control group. The presence of such a control group enables investigators to attribute any diﬀerence between the two groups, the so-called treatment eﬀect, to the intervention.

A desirable feature of longitudinal, as opposed to cross-sectional, studies is repeated

2 observations on the same subject over time, which facilitates measurement of change.

Longitudinal RCTs, also known as parallel designs, are superior to observational studies or quasi-experiments in establishing a causal relationship between an intervention and a health beneﬁt [18].

1.3.3 RDICT: The Modiﬁed Design

The randomized delayed-intervention controlled trial (RDICT), also called the wait-listed or stepped wedge design, as illustrated by the example study on bipolar children [8] presented in more detail hereafter, is an extension of the standard longitudinal RCT. The only difference is that subjects assigned to the control group receive treatment after a fixed, pre-specified delay period. There are clear ethical and practical advantages to the RDICT study design. It is the aim of this research to explore the statistical advantages of RDICT and provide guidelines for its use in practice.

1.3.4 Self or Other as Control

In crossover designs or no-control quasi-experiments [21], a person is compared to him- or herself under diﬀerent conditions, often using a pre-post diﬀerence score.

The control arm of the RDICT stands on its own as a so-called interrupted time series design and allows this type of self as control analysis. In this case, the outcome trajectories for a subject are compared prior to and following the delayed intervention, and each participant serves as his or her own control. On the other hand, in the standard RCT design, the treatment group is compared to a wholly distinct other and initially equivalent (via randomization) control group.

3 1.3.5 Overt Interventions

The delayed-intervention design is ideal for measuring the treatment eﬀect of an overt intervention for an undesirable medical pathology or social condition. Overt treatments are those from which participants cannot be blinded, such as psychological or physical therapy, surgical operation, or community education, which are fairly obvious to the particpant.

With the standard RCT design, since the treatment delivery is overt to the participant, control group members may become demoralized and drop out, knowing they will never receive intervention. On the other hand, the RDICT design is appealing for this sort of intervention since all the subjects receive treatment at some point during the course of the study.

Even when participants in a study cannot be masked to their treatment group membership, it is crucial that researchers performing outcome evaluation are never- theless blinded as is usual in the double-blinded study design.

1.4 Treatment Eﬀect

Since the main objective of most medical intervention studies is to measure the effect of a treatment, it is of primary concern to define and operationalize a meaningful concept of treatment effect (TE).

1.4.1 Study Objectives

The research goal in both study designs is generally to estimate the eﬀect of treatment, requiring a comparison between the treatment and control groups. The usual aim of a randomized controlled trial is two-fold: ﬁrst, to determine whether

4 there is a treatment effect via hypothesis testing, and, second, to estimate the size of that effect via point estimation. When using linear models, a common test of a nonzero treatment effect is the likelihood ratio test, which compares a full model including the treatment effect to a reduced model with no such effect. The power of this test equals the probability of rejecting the null hypothesis of no treatment effect given a certain nonzero effect.

1.4.2 Deﬁnition

One way to formalize treatment effect in a longitudinal study is to compare the rates of changes, i.e., slopes or angles, in an outcome measurement between the treatment and control groups. Alternately, treatment effect can be conceptualized as a comparing the extents of ultimate improvement rather than rates of getting there, as shown in Figure 1.2. These are equivalent if the time required to achieve full effect of treatment is set to 1 unit, as is done in models to follow. The relationships between slopes, angles, and absolute differences are illustrated in Figure 1.1.

The concept of a treatment eﬀect naturally decomposes into two main ingredients.

The ﬁrst ingredient is an observed change in the main outcome over time, anticipated to be in the direction of improvement. The second ingredient is the attribution of this change to the treatment, as isolated from other environmental or random inﬂuences.

Whether the effect is seen as additive or multiplicative depends on whether the change over time is modeled as a linear or nonlinear process. In either case, the second ingredient above defines treatment effect as the change in the treated group above and beyond the change in the controls. This definition requires a comparison between the two groups.

5 φ ψ λE = tan(φ) γ

CONTROL

E = tan(φ + ψ)

(1 − λ)E γ + θ Symptom Severity

TREATMENT

Time

Figure 1.1: Interplay of Various Conceptualizations of Treatment Effect (TE). The green and red lines represent the mean profiles for the treatment and control groups, respectively. E is the overall improvement in the treatment group over the TE period, which is a single unit of time by definition, implying a group-specific slope of γ + θ = −E. The placebo effect (PE) is represented by the angle φ or by the slope in the control group, γ = −λE. TE can be represented by the angle ψ or the difference in group slopes, θ = −(1 − λ)E.

6 In the presence of waitlist controls, it may be possible and even desirable to incorporate the post-intervention outcomes of the waitlist controls into the estimate of treatment effect. The feasibility of this incorporation depends on how similarly the two groups respond following intervention. Since the control group may experience a placebo effect, there is not as much room for improvement on the outcome scale as in the primary treatment group. In the ideal case considered here, the delayed intervention group will improve at the same rate as the primary group and level off at the same destination outcome value, as shown later in Figure 3.6.

1.4.3 Theory and Concept

In physical experiments, researchers impose conditions on a system and then measure variables of interest. As with medical research, effects of the experimental conditions may be instantaneous or gradual, temporary or permanent. One factor that distinguishes human experiments from physical is the well-known placebo effect. The conscious expectation of change by a study participant may have an effect, in absence of any other treatment. Mere recruitment into and participation in a study can incite the placebo effect. Such is the power of the human mind!

Ideally, to determine treatment effect, one would employ a time machine. A particular subject would be given a proposed treatment and observed sufficiently over time from delivery. Then, going back in time to the point of delivery, the subject would be observed withholding treatment. That way, all the internal and external influences would be identical, with the exception of treatment. Any difference would be clearly the effect of treatment.

7 Even in this ideal case, there would remain the question of whether treatment eﬀect varies according to individual characteristics. Often, the aim is to determine the average eﬀect over an entire population. Random sampling from that population is desirable to approach the true overall mean in estimation.

Often in clinical trials, the population of interest is all persons with a particular diagnostic criterion. A simple random sample of this population is near impossible, since a complete sampling frame includes those with undiagnosed illness. In addition, some diseases make participation in a study very diﬃcult. Convenience samples are common in practice and introduce questions of generalizability.

Lacking a time machine, the best feasible alternative is to sample two groups which equally represent the population of interest and let only one group receive intervention. The other group represents what would have happened to the ﬁrst group without intervention, taking the time machine trip to simulate the impossible counterfactual. Random allocation of recruited participants to the two groups achieves the goal that they equally represent the population, with large enough sample sizes.

This research is concerned with overt treatments, which cannot be concealed or bluffed to the subjects. In this case, the second, or control group, will not experience the same placebo effect as if taking a sugar pill, when ignorant of treatment group status. An effect is still possible, however, and cannot be ruled out in any case.

Controls remain a necessity.

1.4.4 Operationalization

For a linear model with first-order time as the principal covariate, the treatment effect can be operationalized as the difference in slopes, either between two groups or

8 CONTROL

TREATMENT EFFECT Symptom Severity

TREATMENT

Time

Figure 1.2: Possible Conceptualization of Treatment Effect. The green and red lines represent hypothetical continuous mean profiles for the treatment and control groups, respectively, over time from entry into a randomized controlled trial. The treatment effect (TE) may be defined as the ultimate difference in outcomes between the groups following stabilization of placbo and treatment effects. Alternately, TE can be a function of the rates to reach these endpoints in outcome.

at a change-point within a group. In a nonlinear setting, with non-Gaussian outcomes or a nonlinear relationship with time, a difference or ratio between group parameters is established, ideally with a convenient interpretation. As seen in Figure 1.2, the treatment effect may simply be the difference in endpoints between a control and treatment group.

9 1.5 Research Questions

The reseach questions for this dissertation concern a comparison of two possible ˆ ˆ treatment eﬀect estimators, θ1 and θ2, both of which are functions of data observed in a randomized delayed-intervention controlled trial (RDICT). The ﬁrst estimator, ˆ ˆ θ1, represents the standard between-group comparison, while the second, θ2, incor- porates the post-intervention data from the control group, thus combining between- and within-group comparisons into a single estimator.

This comparison of estimators translates into a practical question for medical researchers. Is there a statistical advantage, beyond the ethical and practical advantages, to utilizing the RDICT design instead of the standard RCT design when studying behavioral interventions? Since the answer may depend on model and design parameters, simulation studies were conducted to examine various scenarios.

The relative eﬃciency of the two estimators was the primary evaluation criterion for comparison. In addition, power and mean squared error (MSE) were considered.

1.5.1 Model Assumptions

Within the linear model developed here, two model features were varied over the scenarios for simulation. The first modification, controlled by λ, was the size of the treament effect relative to the size of the placebo effect. Secondly, the correlation coefficient, ρ, determined the relative sizes of between- and within-subject variation for the exchangeable covariance structure.

10 1.5.2 Design Issues

Important features of the RDICT design were controlled. They were the number m and spacing h of time points, the delay d for the control group, and sample sizes nk for each group k = 0, 1. It of interest whether a particular design accentuates any ˆ beneﬁt of the RDICT-speciﬁc estimator, θ2.

11 CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

The purpose of this chapter is to provide an overview of the history of longitudinal methodology, including modeling and design considerations, with randomized controlled trials as a starting point. A review of literature from the ﬁelds of statistics and medicine, as well as social research including psychology, provides the foundation for the current research question. The choices of design and model depend on the context of the medical disorder and type of intervention.

2.2 Repeated Measures

As Paul Albert stated in his 1999 article [1], there are three reasons to measure the same experimental unit over time. The first reason is to estimate measurement error or instrument consistency. It is unavoidable that time elapses between subsequent measurements since they cannot be made simultaneously. The second reason is for data safety monitoring purposes. As has been seen in recent years, as in the notewor- thy example of a large-scale hormone replacement therapy trial, studies are subject to early termination when clear benefit or detriment is identified. The third reason,

12 relevant to this research, is to evaluate change over time, such as the gradual eﬀect

of a slow-acting treatment, including maintenance or loss of eﬀect during followup.

In considering the broad problem of measuring and explaining change, Ian Plewis

[17] classiﬁed change models into linear, nonlinear, and stepwise, as a function of

time, age, group membership, or other covariates.

2.3 Cross-sectional versus Longitudinal Studies

Cross-sectional observations are taken at a single point in history and capture an

association between variables in a particular slice of time. Longitudinal observations

are taken over time, generally on the same experimental units, although there are

epidemiological studies that have repeated cross-sectional observations. Longitudinal

studies allow investigators to track subjects over time and establish any eﬀect after

an intervention.

Diggle et al. [6] demonstrated that cross-sectional associations may disguise lon-

gitudinal eﬀects in the presence of cohort diﬀerences. Suppose βC is the slope for the cross-sectional regression relationship between the outcome and predictor variables over the sample at time 1 and that βL is the longitudinal slope for the change in outcome within each individual as the predictor changes over time.

When cross-sectional studies are used to draw conclusions about development over time, the assumption that βC = βL is made. This assumption is only applicable when there are little or no cohort eﬀects, for example, when diﬀerent birth cohorts pass through the same outcome levels at comparable ages.

When two groups are evaluated at different points in time, their differences can reflect a cohort effect, especially with young patients who are still developing. This is

13 relevant to the question of whether delivery time aﬀects treatment eﬀect in wait-listed designs.

Relevant to present research, Diggle et al. [6] noted that “each person can be thought of as serving as his or her own control” in a longitudinal study. This is similar to the idea of using the baseline, or pre, observation as a covariate in analysis of covariance (ANCOVA) with pre-post data.

A persistent question with repeated measures is the relative amounts of variation between subjects and within subjects. Longitudinal studies become more ad- vantageous as the within-subject variation increases relative to the between-subject variation [6].

2.4 Types of Treatment Eﬀects

There are different scenarios requiring treatment, and additionally there are different ways in which the effect of a treatment may manifest. In order to estimate treatment effects, it is a useful exercise to systematically categorize their types to distinguish appropriate methods.

2.4.1 The Eﬀect in Cause and Eﬀect

The eﬀect of treatment is deﬁned relative to the natural progression of the disease condition in the absence of intervention, but perhaps in the presence of a placebo.

Shadish, Cook, and Campbell [21] state that “a central task for all cause-probing research is to create reasonable approximations to this physically impossible counterfactual.” The counterfactual in a longitudinal study is what would have happened to subjects had they not been exposed to treatment. For example, it is desirable to

14 know the “treatment-free estimate of rate of change per time interval” when studying

“spontaneous linear changes.”

2.4.2 Characteristics of Eﬀect

Shadish et al. [21] categorize eﬀects by their form, permanence, and immediacy.

Although they present this taxonomy in the context of a so-called interrupted time series, it has more general application.

The forms of treatment effect may include level, trend, variance, and cyclical properties. The first two forms listed correspond to intercept and slope, respectively, in a simple linear model of time. In the presence of random assignment to treatment groups, it is expected that there are no intercept differences at baseline. The specific subforms corresponding to a treatment effect in the form trend or drift is highly dependent on the model assumed for the mean structure, whether linear in time or not.

An effect is labeled according to its permanence as continuous if the effect is persistent or discontinuous if it dissipates over time. Continuous effects may be present only when an active treatment is present, such as blood pressure medicine or birth control pills. It may be that a continuous effect explores a simple state space, such as a vasectomy, which effects an essentially irreversible switch from fertile to sterile.

A continuous eﬀect may require sustained treatment, such as daily medication for a chronic condition, or may be the result of a one-time exposure, such as a childhood vaccine or educational program.

On the other hand, the desired eﬀect may be discontinuous, requiring only a ﬁnite course of treatment, such as antibiotics to cure a bacterial infection, reestablishing a

15 natural and self-sustaining equilibrium of biological ﬂora. Lastly, the discontinuous

eﬃcacy of a treatment may peak and then diminish after some time, such as for some

vaccinations.

An eﬀect is called, according to its immediacy, immediate or delayed depending

on the time period between introduction and eﬀect of a treatment. Immediate eﬀects

are perhaps simplest to observe since no followup is required. Delayed eﬀects may

complicate study in that experimental units must be monitored over time, the amount

of delay may be unknown or inconsistent, and humans are harder to keep track of

than laboratory mice.

There are some practical limitations to combinations of these characteristics. For

example, with form and permanence, a continuous trend eﬀect most likely has a limit,

namely at zero symptoms or in a normal range for healthy humans. A discontinuous

effect may result from removal of treatment, ceiling or floor effects, or a need for a

booster treatment.

2.4.3 Characteristics of Condition

An undesirable health or social condition may be classiﬁed according to its permanence, just as with the eﬀect taxonomy. The condition may be acute and temporary, like a rash or cold, or chronic and potentially progressive, such as pain or dementia.

There is an interplay between effect permanence and condition permanence. A temporary condition only demands a discontinous effect. A chronic condition requires a continous effect, either through sustained active exposure to a treatment or via a permanent intervention. Treatment, continuous or discontinued, depends on whether

16 the condition being treated is chronic or acute and whether a permanent cure is possible.

2.5 Quasi-Experimental Designs

Studies without randomization, without comparison groups, or with nonequivalent control groups are called quasi-experimental.

2.5.1 No Controls

In Shadish et al.’s repeated-treatment design in Chapter 4 [21], treatment is delivered, removed, and then reintroduced to a single group of participants at a later occasion. This design is practical only with discontinous eﬀects. It is of note that the authors consider the “most interpretable outcome of this design” to be the case where the treatment eﬀect is similar on both exposures to the treatment and in the opposite direction from the change upon removal.

2.5.2 Switching Replications

In Shadish et al.’s switching replications design in Chapter 5 [21], the investigator initially gives treatment to one of two nonequivalent groups and “administers treatment at a later date to the group that initially served as a no-treatment control.”

The effect in the second group, a modified replication, may not be identical to that in the first due to the different context. The permanence of the treatment effect in the first group determines whether this group can be used as a control for the second.

The authors claim “the design is still useful even if the initial treatment continues to have an impact, especially if the control group catches up to the treatment group once the control group receives treatment.”

17 2.6 Experimental Designs

Studies utilizing randomization to treatment groups are called experimental.

2.6.1 Cross-Over Studies

Cross-over designs utilize randomization and expose subjects to multiple treatments in succession. They are useful in studying chronic conditions and temporary interventions. One main challenge with cross-over studies is the carryover eﬀects between one period of treatment and the next. They are not appropriate for interventions with a long-lasting or permanent eﬀect.

2.6.2 Randomized Controlled Trials

Randomization or random allocation of study participants to treatment status allows practical and statistical assumptions regarding the equivalence, comparability, and independence of the groups in order to reduce bias [18]. Controls approximate the desired counterfactual and establish the natural disease history or placebo eﬀects in the context of the study. The outcome data for the control and treatment groups are compared to establish an eﬀect for the novel proposed intervention.

The randomized controlled trial (RCT) came into popular use after World War II as the power of the placebo eﬀect was widely recognized around 1955 [10]. According to Kaptchuk, “the placebo became the emblem for all the healing occuring in the dis- guised ‘no-treatment’ arm of an RCT,” including “nature taking its course; regression to the mean; routine medical and nursing care; regimens such as rest, diet, exercise,

18 and relaxation; easing of anxiety by diagnosis and treatment; the patient-doctor relationship; classic conditioning and learnt behaviours; the expectation of relief and the imagination; and the will and belief of both the patient and the practitioner.”

2.6.3 Longitudinal Designs

In discussing the practical problems with longitudinal designs, Shadish et al. [21] note “it is not always ethical to withhold treatments from participants for long periods of time, and the use of longitudinal observations on no-treatment or wait-list control- group participants is rare because such participants often simply obtain treatment elsewhere.”

2.7 “Classic” Wait-listed Design

C. Hendricks Brown et al. [2] propose an extension to the wait-listed design in their 2006 article on youth suicide prevention trials. They state that “data from the second phase cannot be used to assess intervention impact because there is no control group to compare over that time period.” It is taken for granted that delayed treatment is given solely to satisfy the community partners.

This declaration is provocative in the context of this research, which aims to utilize the second phase data, i.e., post-intervention observations in the wait-listed

(or delayed intervention) group. The challenge is to make a case that these data need not be discarded and ignored. Rather, the so-called second phase data may provide a statistical advantage to the RDICT (also called wait-listed) design.

The wait-listed design is a special case of the stepped wedge trial design as discussed by Celia Brown and Richard Lilford [3]. The stepped wedge design allows more

19 than two randomized groups, each corresponding to a later time for introduction of treatment.

The RDICT design essentially contains two experiments, a within-subjects design and a between-subjects design. The latter, standard RCT, compares the immediate treatment (IT) and delayed treatment (DT) groups. The former quasi-experiment compares the DT group to itself before and after intervention. In an overview of behavioral experimental design, the German psychologist Joachim Krauth [11] warns against the former, conceding “one cannot rule out that within-subjects designs may produce the same results as between-subjects designs if certain assumptions are valid, which are diﬃcult to check”.

Krauth continues, “If we keep in mind that, in general, only appropriately per- formed between-subjects designs admit a causal interpretation, it is obvious which results can be trusted when using both kinds of designs for the same problem.” The

RDICT design may present an opportunity for both results to be observed in harmony within a single study to strengthen the evidence for or against a treatment eﬀect.

2.8 Models

The linear mixed model established by Laird and Ware in 1982 [12] is built upon a history of more simple models presented here.

2.8.1 Independent Observations

In both longitudinal and cross-sectional studies, a basic modeling approach is the linear model (2.1), which regresses a response variable for each of N units on one or ˆ many (p) covariates, resulting in the ordinary least squares (OLS) estimate, βOLS = (XT X)−1XT y, which minimizes the residual sum of squares, RSS = (y − Xβˆ)(y −

20 Xβˆ)T without any distributional assumptions. In this model, y is the outcome vector,

T X is the design matrix, β = (β1, . . . , βp) are the regression parameters, and are

the random measurement errors.

y = X β + , (2.1) N×1 N×p p×1 N×1

T T T T T T y = [y1 ,..., yn ] X = [X1 ,...,Xn ] . (2.2) N×1 N×p

During this notation development, in order to anticipate extension of this simple model, suppose the N units are partitioned into n clusters of size mi so that N = Pn i=1 mi . In longitudinal medical trials, the clustering mechanism reﬂects i = 1, . . . , n

N human subjects observed at j = 1, . . . , mi times each, where often mi ≡ m = n . An eﬀort will be made to use this notation througout this document.

T T The individual-level outcomes yi = (yi1, . . . , yim) and covariates Xi = [xi1,..., xim] m×p

are the building blocks of the entire-study variables (2.2), where xij is the p × 1 vector of covariates for subject i at time j. The error term is similary built from the

T individual errors i = (i1, . . . , im) .

Under the assumption of independent error terms with distribution ∼ MVN(0, σ2I), ˆ the OLS estimator βOLS is also the maximum likelihood estimator (MLE). This as-

iid 2 sumption implies Yi ∼ MVN(Xiβ, σ I).

2.8.2 Correlated Data

Longitudinal studies vary considerably according to the relative values n and m as well as the variable types. Cases where there are many observations (large m) for

few subjects (n) may inspire a time series or functional data approach. Link functions

21 are used in generalized linear models for discrete outcomes, and survival analysis can be employed for outcomes that indicate an event occurrence.

The main challenge of longitudinal data is the nonzero correlation of outcome variables within a subject, and the OLS solution is hence insuﬃcient since it assumes independence of the measurement error terms in . The impact of this within-subject correlation, generally in the positive direction, depends on whether the comparison of interest compares multiple groups at the same time or diﬀerent times within an individual.

A general rule of thumb [6] is that positive within-subject correlation increases the variance in estimation of diﬀerences between groups and decreases it for comparisons within a subject. Clearly, increases in variance for an estimator diminish the power of any corresponding tests, requiring larger sample sizes. The intuition behind this general rule is that highly correlated observations within a subject leave the remaining bulk of variation between the subjects, making it harder to summarize groups of individuals.

The standard linear model 2.1 is extended to the longitudinal context in

yi = Xiβ + i for independent units i = 1, . . . , n, (2.3)

ind where Yi ∼ MVN(Xiβ, Vi), the variance matrices Vi are a function of the covariance

2 parameter α, and a common simplifying assumption dictates Vi ≡ σ V0. ˆ ˆ The weighted least squares solution, βW LS, minimizes the RSS = (y −Xβ)W (y − Xβˆ)T , where W is some symmetric weight matrix. The choice of W is an important part of estimating robust (or empirical) standard errors for βˆ, and W −1 is called the working variance matrix.

22 For short evenly-spaced time sequences with minimal missingness, estimation of

β and of Var(βˆ) via WLS, weighted least squares, is robust to misspeciﬁcation of the covariance structure. Unless inference regarding α is desired, simple correlation assumptions suﬃce [6].

Combining the outcomes of distinct subjects in (2.3) by vertically stacking entries of Yi and Xi to compose Y and X, respectively,

Y ∼ MVN(Xβ,H ≡ σ2V ), (2.4)

where the unscaled covariance matrix H ≡ I ⊗Vi, and the scaled covariance matrix n×n 2 V ≡ I ⊗V0 under the common individual covariance (Vi ≡ σ V0) assumption with n×n σ2 as the common within-subject variance.

The general and two speciﬁc weighted least squares estimates for β are

ˆ T −1 T βW LS = (X WX) X W y

ˆ T −1 T βOLS = (X X) X y

ˆ T −1 −1 T −1 βGLS = (X V X) X V y,

ˆ where the weight matrices are the identity in βOLS and the inverted, scaled variance ˆ ˆ matrix in βGLS, the generalized least squares (GLS) estimator. Note that βOLS is the same OLS estimate referred to as βˆ above.

The variance of the WLS estimator is

ˆ T −1 T T −1 RW ≡ Var(βW LS) = {(X WX) X W }H{WX(X WX) } and reduces in the special GLS case to formulation in equation (2.5).

ˆ T −1 −1 Var(βGLS) = (X H X) (2.5)

23 ˆ ˆ Diggle et al. [6] compare the eﬃciency of βGLS relative to βOLS in the cases of compound symmetry, exponential correlation, and a crossover design. The more Xi

and Vi vary between subjects, the less suitable is the OLS estimate.

The various approaches to longitudinal data analysis diﬀer primarily in their han-

dling of V , the variance of the outcome random vector. Imbedded in the general model above are several assumptions that may not be true in some study designs, e.g., an equal number (m) of observations per subject, identical within-subject co-

2 variance (Vi = σ V0), and between-subject independence (the block diagonal form of

V ).

2.8.3 Marginal and Mixed Models

The two main linear models employed in longitudinal data analysis are marginal models and linear mixed eﬀects (LME) models. They diﬀer in distribution and correlation structure assumptions, estimation methods, parameter magnitude and interpretation, and separate versus joint modeling of the linear predictor (Xiβ) and variance components.

The primary goals of most longitudinal analyses are consistent and eﬃcient estimators for the regression parameter (β) and variance (V ) as well as robust estimates of their corresponding standard errors for the purpose of valid inference. In addition, the analysis should address potential bias introduced by missing observations, a particular concern in longitudinal trials. The choice of model necessarily depends on the aims of the study, speciﬁcally its hypotheses and their scope.

24 Marginal models are quite ﬂexible in modeling the structure of the variation (V ) in the response variable while LME models assume particular structures to this variation by invoking and conditioning on a latent random variable by subject. In some cases, these models for V are equivalent in some sense. For example, a random intercept LME model imposes the same structure to V as does the marginal model with an exchangeable correlation. The diﬀerence is that, with LME, the variation is partitioned into two parts, and inference is made conditional on one of these.

In marginal models, the population average of the response variable is regressed over a selection of covariates of interest. The models essentially integrate over all other attributes of the entire population for each level of the explanatory variable.

The parameters in a marginal model reﬂect the eﬀect of the chosen covariates on the population average, which is often the interest in epidemiological or other public health applications.

Diggle, Heagerty, Liang, and Zeger [6] categorize longitudinal methods that do not collapse repeated measurements into a single summary statistics into three types.

The marginal analysis models estimate the parameters determining the population means, β, and variances, α, separately. Random eﬀect models assume the regression coeﬃcients vary randomly in the population. Transition models allow the distribution of an observation at a particular measurement occasion to depend on the previous outcome values and current covariates.

Diggle et al. [6] suggest that the choice between marginal and linear mixed eﬀects models may depend on the relative sizes of variation between and within subjects.

When there is a large variability among subjects, within-subject comparisons are more precise than between-subject (or group) contrasts, which may beneﬁt from LME

25 models. If, on the other hand, there is little variation between subjects, marginal

models may be appropriate.

Marginal models answer questions regarding the trend of a population mean over

time rather than the evolution of individual proﬁles. In the case of an outcome with

a Gaussian distribution and a linear contrast of interest, these are equivalent since

expectation is a linear operator.

2.8.4 Linear Mixed Models

In linear mixed eﬀects models, a latent variable based on some unobservable co-

variates is assumed to distinguish the heterogeneity of the population. The model

parameters then explain the regression of the outcome on the observed covariates of

interest, within an individual stratum of the population sharing the same latent prop-

erty. The marginal models, on the other hand, overlook this sometimes inexplicable

heterogeneity by averaging it out over the entire population.

Marginal models are used when investigators are interested only in the eﬀect of

population-speciﬁc covariates on an entire population. LME models assess covariate

eﬀects within unidentiﬁed subsets of individuals who share an unmeasured or unmea-

surable property. The hierarchical nature of mixed models makes them analogous to

a Bayesian approach [5].

In the LME model (2.6), the random variation in (2.3) is divided into individual

random eﬀects and measurement error. The Zi are generally a submatrix of the full design matrix Xi, and q ≤ p, where q and p are the numbers of random (bi) and

2 ﬁxed (β) eﬀects, respectively. It is commonly assumed that Σi = σ Im, or that the observations within a subject are independent given bi.

26 yi = Xiβ + Zibi + i, (2.6)

where

iid bi ∼ N(0,D) independent of

iid i ∼ N(0, Σi).

ind T The distributional assumptions in (2.6) imply Yi ∼ MVN(Xiβ,Vi ≡ ZiDZi +Σi).

2.9 Estimation

Of primary concern in statistical analysis is to produce “good” estimates of the mean structure parameters. Often, one or several parameters are of key interest and have an interpretable meaning with regard to the research objective, e.g., determining the eﬀect of a proposed treatment. Maximum likelihood methods are presented here for mean and covariance parameters in linear mixed models.

2.9.1 Likelihood Functions

Under assumption (2.4), the likelihood function depends on the parameters (β,

2 ˆ 2 RSS(V0) V0, σ ). Substituting β, as a function of V0, for β allows for solutionσ ˆ = nm and reduction via proﬁling of log likelihood function to

m l(V ) = − {n log RSS(V ) + log(|V |)}, 0 2 0 0

P T −1 where RSS(V0) = i(yi−Xiβ) V0 (yi−Xiβ). Demidenko [5] derived a considerably simpliﬁed variance-proﬁled log likelihood function for a balanced observation random

intercept linear mixed model.

27 Often, assumptions about the mean structure aﬀect the saturation of X. A fully

saturated model estimates a separate mean for each possible covariate condition.

2 Diggle et al. [6] note that misspeciﬁcation of X will bias the estimates of σ and V0.

REML, restricted maximum likelihood, circumvents this bias.

In marginal models, it is often necessary to model β and α separately. When

Vi(α) is known, the likelihood of β is maximized by the standard MLE (2.7), which maximizes (2.8), where θ = (βT , αT )T . An alternate approach, restricted maximum

likelihood (REML) is often employed in order to estimate the variance components

parameter α.

n −1 n ˆ X T −1 X T −1 β = Xi Vi Xi Xi Vi yi (2.7) i=1 i=1

n Y − m − 1 1 T −1 L (θ) = (2π) 2 | V | 2 exp − (y − X β) V (α) (y − X β) (2.8) ML i 2 i i i i i i=1

2.9.2 REML

Restricted maximum likelihood estimation (REML) transforms the response vec-

T tor into U(nm−p) × 1 = A Y, called error contrasts, which have a distribution de-

pending only on the variance components and not on the ﬁxed eﬀects. The matrix A

has (nm − p) linearly independent columns also mutually orthogonal to the columns

of design matrix X. The resulting restricted likelihood (2.9) depends only on α. Al-

ternatively, the REML (joint) likelihood function (2.10) can be solved simultaneously

for θ to get REML estimators for β and α. In practice, MLE and REML estimators

diﬀer drastically only in models including a large number (p) of explanatory variables.

28 n − 1 X T −1 2 ˆ L(α) ∝ Xi Vi Xi LML(β(α), α) (2.9) i=1

n − 1 X T −1 2 LREML(θ) = Xi Vi Xi LML(θ) (2.10) i=1

The likelihoods are optimized using iterative methods, such as procedures based

on Newton-Raphson. The parameter space is Θ = Θβ × Θα, where α determines Vi

and θ is an element of the space. Note that both D and Σi must be positive semi-

deﬁnite in the mixed eﬀects model, resulting in a more restrictive parameter space

than for the marginal model, where this requirement applies only to Vi.

REML estimation decomposes the outcome vector Y into two parts, one depend-

ing only on β and the other only on α. For example, Y can be broken into its orthogonal projection onto the column-space of X and the OLS residual vector.

2 RSS(V0) The REML procedure results inσ ˜ = nm−p and a new reduced log likelihood of 1 l∗(V ) = l(V ) − log(|XT V −1X|). 0 0 2

The diﬀerence between ML and REML estimates is most pronounced when p is large

or V is near-singular.

2.10 Inference

Hypothesis testing for linear mixed models is achieved using various tests, includ-

ing likelihood ratio tests (LRT) and Wald tests.

2.10.1 Likelihood Ratio Test

The likelihood ratio test (LRT) statistic for comparing a reduced model nested within a full model equals −2 log Lreduced . Under the null hypothesis of the reduced Lfull

29 model, the LRT statistic follows a Chi-squared distribution with degrees of freedom

equal to that lost in the reduction between models.

2.10.2 Wald Test

Inferences involving the ﬁxed eﬀects involve the hypotheses (2.11) and asymptotic

2 d 2 distribution of a test statistic (2.12), where Ks×p is full-rank and Qn −→ χs(0) under n→∞ ˆ H0. This Wald test underestimates the standard error of β by ignoring variability

introduced by estimating α. Verbeke and Molenberghs [24] discuss approximate tests

(t and F) to compensate for this bias.

H0 : Kβ = 0 Ha : Kβ 6= 0 (2.11)

−1  n !−1  2 ˆ T X T −1 T ˆ Qn = [K(β − β)] K Xi Vi (αˆ)Xi K  [K(β − β)] (2.12) i=1 The so-called sandwich (or empirical variance) estimator (2.13) for Var(βˆ) has

been shown to be robust to misspeciﬁcation of the correlation structure but not to ˆ incorrect linear predictors of the mean or missing data, where ri = yi − Xiβ and

ˆ T Vi = riri .

n −1 n n −1 ˆ X T −1 X T −1 T −1 X T −1 Var(β) = Xi Vi Xi Xi Vi [riri ]Vi Xi Xi Vi Xi i=1 i=1 i=1 (2.13)

2.11 Evaluation of Estimators

In order to compare the two estimators for the RDICT design, it is necessary to establish a metric for the performance of the estimator. Besides minimizing bias,

30 increasing eﬃciency and maximizing power to detect practically signiﬁcant treatment

eﬀects are desirable. In addition, conﬁdence interval coverage and mean squared error

(MSE) are meaningful assessment measures for a particular method [4].

There are a multitude of criteria used to evaluate the strength of a design. Tu et

al. [22] derive the power function for linear mixed models based on the asymptotic

variance of the parameter estimator. The power function is used by other authors as

well [14, 9].

Winkens et al. [25] use the c-optimality criterion of minimizing Var(cT βˆ) where the contrast vector c isolates the parameter of interest. Several authors [20, 9] have used a variance inflation factor (VIF) or “design effect”, which depends on the intraclass correlation coefficient (ICC), ρ. Ouwens et al. [15] utilize the D-optimality criterion of minimizing det{Var(βˆ)}.

Winkens et al. [25] further utilize the c-optimality criterion to introduce the relative eﬃciency of two designs, which is the ratio of c-optimal variances.

2.11.1 A Sample Size Calculation

Although I devise and employ a crude sample size calculation for the simulation studies in section 4.5.5, a more elaborate one from the literature is presented below.

The per-group sample size required [6] to detect a between-group diﬀerence of d, in the slopes for the ﬁxed and common covariate xm×1, between two groups with power P = 1 − Q and type I error α is

2 2 2(zα + zQ) σ (1 − ρ) n = 2 · 2 , d msx

2 where sx = Var(xm×1) and ρ is the correlation between any two distinct observations within a subject. Note that n decreases as ρ increases with this statistical question.

31 The second factor above is

T −1 Xi Ri Xi 2,2 for exchangeable correlation and can be generalized, where Ri = Corr(yi).

In a problem where group diﬀerences are of primary interest, change over time may be disregarded by collapsing observations within a subject to an average. The sample size then increases with ρ, since the variance of an average increases with the covariances between the terms being summed. Diggle et al. [6] demonstrate this reversal with continous and binary outcomes.

2.12 Missing Data

Missing data mechanisms are categorized as missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). The ignorability of these mechanisms depends on the method of estimation used. Two likelihood based approaches to modeling missing data are selection and pattern mixture models, which diﬀer in the factorization of a joint likelihood for the response vector and missingness pattern.

2.13 Summary

The linear mixed effects (LME) model is a flexible and useful tool for fitting longitudinal data from a randomized controlled trial. In the next chapter, a LME model is fit to the data from the motivating example and generalized for the simulation study in the following chapter.

32 CHAPTER 3

MODELS AND DATA

3.1 Introduction

In this chapter, the analysis of the MFPG dataset is presented to motivate a general model framework for the simulation study in the next chapter.

3.2 MFPG: Study Description

The NIMH-funded Multi-Family Psychoeducation Group (MFPG) study [8], conducted by Dr. Mary Fristad of The Ohio State University (OSU), was a randomized controlled trial to assess the eﬀect of a psychoeducational intervention by following the mood severity of n = 165 children at m = 4 evenly-spaced times over 18 months, as seen in Figure 3.1. To qualify for inclusion in MFPG, the children aged 8-11 years were diagnosed with either bipolar or major depressive disorder at baseline. In each case, a primary informant also participated in interviews and treatment. Families were enrolled quarterly in 11 cycles of 15 families.

3.2.1 Types of Condition and Intervention

The intervention consisted of 8 weekly group sessions for the children and care- givers run by trained psychologists. The session sizes varied from 2 to 8 participants,

33 ann eeshdldt eev ramn fe era atitcnrl ( overall controls waitlist that as so year 1 after treatment receive to scheduled were ( 8 each maining immediately in treatment receive enrolled to families assigned 15 randomly the were 7 Of cycle, cycle. enrollment each for attrition on depending DT and IT the inter- for respectively, that 2, Note and con- occasion. 0 95% observation times point-wise each observation groups. at at the delivered mean represent treat- was group bars delayed vention mood the error and in for The (IT) intervals means treatment ﬁdence respectively. the (Intent-to- immediate groups, represent the Entire for squares (DT) for time red ment over Group and (MSI) Treatment diamonds index by green severity Evolution The Outcome Sample. Treat) Mean 3.1: Figure Mean MSI

15 20 25 30 35 0 n 0 7and 87 = n 1 78. = 1 Time 34 IMMEDIATE 2 DELAYED k )adtere- the and 1) = 3 k 0), = According to the taxonomy developed in section 2.4, the condition of a mood disorder is generally chronic and potentially progressive, getting worse over time without treatment. A chronic condition requires a continuous treatment effect to maintain healthy symptom ranges. The form of the treatment effect is basically a trend from pathology toward normal followed by leveling-off, possibly in the fashion of a sigmoid curve.

The psychoeducational intervention is permanent and irreversible, since a person cannot undo exposure to a course of learning. The effect of this type of intervention is gradual, in that it takes some time for the full TE to develop, and it is continuous, although like any knowledge, it may wear off in the long run. Refresher courses may be beneficial.

Primary Treatment

0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Participant Participant Participant Participant ● 60 ● ● 50 ● ● 40 ● ● 30 20 ● ● ● ● 10 0 Participant Participant Participant Participant 60 50 ● 40 ● ● ● ● 30 ● ● ● ● ● 20 ● ● 10 ● 0 ● Participant Participant Participant Participant MSI ● 60 ● 50 ● ● 40 ● 30 ● 20 ● ● ● ● ● ● ● 10 0 Participant Participant Participant Participant 60 50 40 ● ● ● 30 ● ● ● ● 20 ● ● ● ● 10 ● ● ● 0 ● ● 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Time

Figure 3.2: Sample Outcome Proﬁles in Immediate Treatment Group. Observed individual proﬁles in mood severity index (MSI) are shown in separate panels for a random subset of participants from the immediate treatment group. Lines that do not extend for the full time indicate a loss to followup. Note that intervention for this group occured at observation time 0.

35 Delayed Intervention

0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Participant Participant Participant Participant 60 ● 40 ● ● ● ● ● 20 ● ● ● ● 0 Participant Participant Participant Participant 60 ● ● 40 ● ● 20 ● ● 0 Participant Participant Participant Participant

MSI ● 60 ● ● ● ● ● ● 40 ● ● ● ● ● 20 ● ● 0 Participant Participant Participant Participant 60 40 ● ● ● ● ● ● ● ● ● 20 ● ● ● 0 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 Time

Figure 3.3: Sample Outcome Proﬁles in Delayed Treatment Group. Observed individual proﬁles in mood severity index (MSI) are shown in separate panels for a random subset of participants from the delayed treatment group. Lines that do not extend for the full time indicate a loss to followup. Note that intervention for this group occured at observation time 2.

3.2.2 Study Design

The particular randomized delayed-intervention controlled trial (RDICT) design

chosen speciﬁed observation of a main outcome, or indicator of disease severity, for

each individual at m = 4 equally-spaced timepoints, indexed by j = 0, 1, 2, 3. The

delay, d, for this RDICT design was 2 observation timepoints.

Approximately half of the participants, n1 = 78, with group indicator k = 1, were selected via randomization to receive the overt treatment immediately following their baseline measurements, j = 0. The remaining half of study participants, n0 = 87, with k = 0, acted as controls until they received the delayed intervention following the third timepoint, j = d ≡ 2.

36 Since the control group was observed following treatment, speciﬁcally at the fourth time corresponding to j = 3, observations taken on this group may or may not contribute useful information in the estimation of treatment eﬀect.

3.3 Data Structure

The continuous outcome variable for MFPG was mood severity index (MSI), which has a range of 0 to 133. The MSI range for a healthy child is generally 0 to 10, while a score above 35 signiﬁes severe pathology. The approximate ranges 10 to 20 and 20 to 35 in MSI represent the symptom-only and diagnosis categories, respectively. The outcome of interest, MSI, combined scores from instruments evaluating symptoms of depression and mania in the child, as reported by the caregiver.

3.3.1 Outcome and Design

For each of the n = 165 study participants, the MSI values were assessed via psychological interviews at each of m = 4 observation occasions, evenly spaced by six-month intervals. This study is technically a group or cluster randomized trial

(GRT), although that feature is disregarded in the present research.

3.3.2 Exploratory Data Analysis

Observed individual proﬁles over time for a random subset of participants by treatment group are shown in Figures 3.2 and 3.3. The sample mean evolution by treatment group with pointwise conﬁdence intervals is demonstrated in Figure 3.1. To assess the extent of intra-subject dependence, the sample correlations calculated from pairwise complete detrended data are presented in Table 3.1 and Figure 3.4. There

37 Detrended Outcomes

−20 0 20 40 −20 0 20 40

Time 0 40

0.31 0.39 0.36 20 0 −20

● ● ● ● ●

40 ● ● ●● ● ● Time 1 ● ● ● ●● ●●● ● 20 ● ●● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ●●● ●●●●●● ● ●●●●●●● ●● ● 0.49 0.51 ● ● ●●●●●●●● ● ● 0 ● ● ●●● ●●● ●●● ● ●●●● ● ● ●●●●● ● ● ●●● ●● ● ●● ●●● ●● ● ●●●● ● ● ●● ●● ● ● ● ● ●●● ●● ● ●● ● −20 ●

● ●

● ● ●● 40 ● ● ● ● ● ● ● ● ● ● Time 2 ● ● ● ● ●● ● ● ● ●●● ● ● ●●●●● ● ● ● ●●● ● ● ● ●● ● 20 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●●● ● ●●●●● ● ●● ● ● ● ● ●● ● ● ● ● ●●● ● ● 0.48 ● ● ● ● ●● 0 ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ●●●●●● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ●● ●● ●●● ●● ● ●● ●● ●●● ● ● ● ●●●● ● ● ●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●●● ●● ● ● ● ● ● ●● ● ● ● −20

● ● ● ● ● ● ● ● ●

40 ● ● ● ● ● ● ● ●● Time 3 ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ●● ●● ●● ● ● ●● ● ● ●

20 ● ● ● ● ●● ● ●● ●● ● ● ● ●● ●●● ●● ●●● ●● ● ●●● ●● ● ● ● ●●● ● ● ●● ●●● ● ● ●●●●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ●●● ● ● ● ● ● ● 0 ● ●● ● ●●● ●● ●● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●●●●●● ●● ●● ●●● ● ● ● ● ●●●● ● ● ●● ● ● ●●● ●● ● ● ●●● ●●●● ●● ● ●●●●●●●● ● ● ● ● ●●● ●● ●● ●●●●●●● ● ●● ●●● ● ●●● ●● ● ● ●● ●●●●● ● ● ●●● ● ● ●●●●● ● ● −20 −20 0 20 40 −20 0 20 40

Figure 3.4: Within-Subject Correlation for Detrended MFPG Data. The group- and time-speciﬁc means were subtracted from the observed mood severity index (MSI) outcomes. Histograms and pair-wise scatter plots are shown along and below the main diagonal, respectively. The sample correlations are displayed above the main diagonal in font sizes proportional to their magnitude.

is fairly strong and stable correlation that does not diminish over time, suggesting a exchangeable correlation structure.

38 Time 0 Time 1 Time 2 Time 3 Time 0 1 0.31 0.39 0.36 Time 1 1 0.49 0.51 Time 2 1 0.48 Time 3 1

Table 3.1: Sample Correlations for Pairwise Complete Detrended MFPG Data. These values were used to assess the suitability of the random intercept model as well as the strength of the intraclass correlation.

3.3.3 Missingness

A major consideration in any longitudinal study is the eﬀect of missing observations on the validity of any models used or inferences made. Since the dropout patterns in this study were essentially only monotone, missingness is summarized by the attrition shown in Table 3.2. Overall, 28% (22 of 78) of the immediate treatment and 39% (34 of 87) of the delayed treatment participants were lost to followup.

Group Time 0 Time 1 Time 2 Time 3 Total Immediate (k=1) 78 (8) 70 (9) 61 (5) 56 78 Delayed (k=0) 87 (13) 74 (13) 61 (8) 53 87

Table 3.2: Number Participants Remaining by Time and Treatment Group. For each time point, the number of subjects remaining in the MFPG study are listed by treatment group. In parentheses are the number of dropouts following that time period.

The response behavior was compared for these four dropout patterns in Figure

3.5 to see if data were missing completely at random (MCAR), i.e., if the outcome response was similar regardless of last interview time. It is especially clear that, in

39 a uhlwrma S crsa httm hnddtoewormie nthe in remained who those did than time longer. that study at ( scores observation MSI second mean the lower after much immediately had out dropped participants who 13 group the DT particular, In the in interview. last their of time differed the study to the according leaving greatly participants for MSI the group, (DT) treatment delayed the 1 time after out solid, dropped The who completers. 0. those study time represent and single observation 2 respectively The after or lines followup monotone treatment. dotted 4 to delayed and the lost and dashed, participants for immediate shown the group, are represents each (MSI) dot The in index Group. patterns severity Treatment data and mood missing Time in Observation profiles Last mean by observed Profiles Outcome 3.5: Figure Average MSI

20 25 30 35 40 0 ● Immediate Treatment 1 Time 2 3 40

Average MSI

20 25 30 35 40 0 ● Delayed Treatment 1 Time 2 j 1) = 3 3.4 Rationale for Model

With m = 4 timepoints and 2 treatment groups, the 8 diﬀerent means could be modeled free of constraint, as in a repeated measures analysis of variance (RM-

ANOVA). On the other hand, certain segments may be ﬁt to a straight line. The mean structure as piecewise linear is motivated by a smooth dynamic model curve since the outcome levels are presumable changing continously over time. The variance components are motivated by conjecture and the observed sample correlations. The sample correlations in Table 3.1, an exchangeable correlation structure seems to be appropriate since the oﬀ-diagonal values, varying from .31 to .51, do not clearly decrease with time lag separation.

3.4.1 Conceptual Motivation: Dynamic Modeling

An alternative to the piecewise linear model of outcome response, the perhaps more realistic dynamic model represents the outcome as a continuous function over time with parameters according to times of treatment seeking and commencement of intervention, as shown in Figure 3.6. It is proposed that individuals suﬀering from a psychiatric or other medical problem may seek treatment during a crisis or peak in the evolution of their symptoms.

By this reasoning, if participants are recruited as they seek treatment, the disease symptom trajectory would reach a maximum with slope of zero at entry into the study. After this point, there is a naturnal decline in symptoms away from the crisis maximum, which may be called spontaneous remission, even in the absence of treatment. The hypothesized treatment eﬀect is reﬂected in a more rapid symptom decline in the treatment group.

41 PRIMARY TREATMENT DELAYED INTERVENTION Symptom Severity

0 1 2 3

Time

Figure 3.6: Possible Conceptualization of Delayed Intervention Response. The green line represents a hypothetical continuous response in the immediate treatment (IT) group. The IT response gradually improves after receiving intervention (indicated by vertical line at time 0) following a sigmoid curve and then levels off. The red line represents a hypothetical dynamic response in the delayed treatment (DT) group. The DT response initially improves due to placebo effect and levels off prior to treatment at time 2 (indicated by vertical line). Following intervention, the DT response gradually decreases, experiencing the treatment effect, and levels off at the same final outcome value as the IT group.

The rapid improvement in the disorder symptomology cannot be sustained in- deﬁnitely since at some point the level will presumably enter the healthy range of a person with no diagnosis. One possible model feature to capture this attenuation of treatment eﬀect would specify a changepoint, or change in concavity or decceleration, in the symptom trajectory toward and approaching a stable level on the outcome

42 ● ●

PRIMARY TREATMENT DELAYED INTERVENTION ●

●

● Symptom Severity

● ●

0 1 2 3

Time

Figure 3.7: Possible Conceptualization of Linearized Delayed Intervention Response. The dotted lines represent the hypothetical dynamic model from Figure 3.6. No matter the underlying model, observation points may capture an approximately linear proﬁle, as illustrated.

scale. It is reasonable to dictate this change point at a speciﬁed elapsed time from

intervention.

If the model in Figure 3.6 were explored using the m = 4 RDICT design in MFPG,

a mean proﬁle such as that shown in Figure 3.7 may result, which is not at all unlike

the actual mean proﬁle for the treated subset (n = 116) of MFPG data in Figure 3.8.

3.4.2 Linearization Phases

It is sometimes useful to “break up the curvilinear growth trajectories into separate linear components ... to compare growth rates during two diﬀerent periods.” [19] The

43 sn he hneons h hssae1 lcb ffc,2 lcb eeig 3) leveling, placebo 2) effect, leveling. treatment placebo 4) 1) effect, are treatment phases The phases changepoints. linear distinct three four into using scale time the divides model the of form linear piecewise groups. at DT delivered and was IT intervention the that for respectively, Note groups, 2, and (DT) the occasion. 0 for treatment observation intervals times delayed each confidence observation 95% at and (MSI) point-wise mean index the (IT) severity represent group treatment bars mood error immediate in The the means The respectively. the for Subset. represent Treated time squares for over Group red Treatment and by diamonds Evolution green Outcome Mean 3.8: Figure Mean MSI

15 20 25 30 35 0 1 Time 44 IMMEDIATE 2 DELAYED 3 A key feature of a model is hTE, the duration of phase 3. An educated guess at hTE provides guidance to researchers in designing a study. It is the length of time

required to observe full or near-full treatment impact, prior to the leveling oﬀ in phase

The ﬁrst two phases potentially occur after entry into the study but in the absence

of and prior to treatment. Phase 1 is observed only in the delayed-intervention control

group. Phase 2 is observed depending on whether the delay for intervention exceeds

the time to placebo leveling or not. None of the designs, real or proposed, presented

in this research investigate phase 2 of the model.

The last two phases occur following intervention, with or without a delay. Phase

3, observed for all subjects under the RDICT design, represents the initial treatment

impact. The ﬁnal phase (phase 4) is a leveling oﬀ, attenuation, or loss of the initial

treatment impact, which may or may not be captured by the study design depending

on the length of followup.

The changepoint between phases 1 and 2 occurs when the spontaneous remission

or placebo eﬀect wears oﬀ. The second changepoint, between phases 2 and 3, occurs

at the administration of treatment, marking the beginning of treatment impact. The

ﬁnal changepoint, between phases 3 and 4, occurs once the treatment has done its

work and participants either stabilize, continue slight improvement, or regress into

pathology.

3.4.3 Elements of the Model

For the MFPG data, based on the mean proﬁles in Figure 3.1, the immediate

treatment group experiences phases 3 and 4, treatment eﬀect and leveling, while the

45 control group experiences phases 1 and 3, placebo and treatment eﬀects. It is assumed

that the delay, d = 2 time periods, for intervention in the latter group was not long

enough for the placebo eﬀect to wear oﬀ, so the second phase of the model was not

invoked by the MFPG design.

Both treatment groups are thus piecewise linear with a hinge at j = 2 or 12

months. For the control group, the trajectory was assumed to be linear in two pieces,

one before and one after treatment delivery at 12 months, after a delay of d = 2 obser-

vation times. Any improvement before the treatment may be considered spontaneous

or placebo-like.

The immediate treatment group could potentially follow a straight line over the

entire duration of the study, 18 months, but this is unlikely. At some point, it would

be expected that the outcome would level oﬀ after maximum treatment eﬀect is

exhausted. In MFPG, the attenuation or leveling-oﬀ changepoint was after hTE = 12

months or h = 2 timepoints. This may have been anticipated but was more likely determined based on preliminary or pilot study results.

An additional possible constraint to this piecewise linear model with 4 distinct slopes requires the post-treatment slopes (prior to stabilization) to be equal, resulting in a model with only 3 slopes. This constraint is key to the present research, which asks whether treatment eﬀect varies with the timing of delivery. If not, the delayed intervention in RDICT may provide useful information in estimation of the TE.

3.4.4 Choice of Random Eﬀects

The natural grouping variable for random eﬀects in MFPG is the individual and perhaps the small group sessions formed for treatment delivery. The repeated time

46 measurements, level 1, are nested within participant, level 2, and participants are

nested within session, level 3. Considered for inclusion in the linear mixed eﬀects

model were a random intercept and slope for participant as well as a random intercept

by session. Decisions regarding inclusion of a random eﬀect are made via comparsion

of estimated standard deviations for those eﬀects. Venables and Ripley in Chapter

10 [23] also suggest a comparison of models with and without random eﬀects.

3.5 Model for MFPG Data

Consider symptom severity data of the form yijk, indexed by treatment group

k = 0, 1, participant number i = 1, . . . , nk, and occasion j = 0, . . . , m−1. These data represent m repeated observations (over time) on nk subjects in each of the groups k.

The k = 1 group is the immediate treatment group while k = 0 references the control

group.

3.5.1 Time Convention

The time variable, tj or tjk, is indexed by the measurement occasion j ∈ 0 : (m−1)

or occasion and the group k ∈ {0, 1} if necessary for distinction. So far, there are

two time scales for the study, the calendar time of days and months, as well as the

observation time integers j ∈ 0 : (m−1), which map onto the former time scale. There

are a few special integers, the hinge h ∈ 1 : (m − 1) and the delay d ∈ 1 : (m − 2),

which lie on the latter time scale.

To standardize the statistical modeling, a third time scale was constructed. In

subsequent modeling, the scale of time variable, tj or tjk, is such that a single unit

of time equals the treatment eﬀect time period, which was hTE = 12 months in

MFPG. In this design, the unit of time is divided into halves, or h = 2 time periods,

47 hTE by observation points that are ( h ) = 6 months apart, so that hTE calendar time elapses between observations j = 0 and j = h.

Setting time at study entrance to t0 ≡ 0 for both groups, as in so-called “entrance- centered” models, then dictates th ≡ 1 by convention above. For MFPG, this forces

1 3 the m = 4 time points t = (tj : j ∈ 0 : 3) to be (0, 2 , 1, 2 ).

3.5.2 Delayed Treatment Group

The waitlist controls form their own quasi-experiment, what Shadish et al. [21]

called an interrupted time series design. A simple model for the n0 = 78 participants

iid in this group is shown below (3.1) for i = 1, . . . , n0, j = 0,..., 3, assuming bi0 ∼

2 ind iid 2 N(0, τ )& ij0 ∼ N(0, σ ).

+ yij0 = bi0 + α0 + γ1tj + θ0(tj − td) + ij0 (3.1)

This random intercept model is essentially segmented linear with a changepoint

after d = 2 observations. The pre-intervention slope γ1 reﬂects any placebo eﬀect. The

random intercepts bi0 reﬂect individual variation, while the ij0 reﬂect measurement

error, for example, the psychometric properties of the instrument. Without debating

the external validity of θ0 as a measure of treatment eﬀect (TE), we ﬁt this model to ˆ obtain the following estimates: θ0 = −6.97(5.72),σ ˆ = 12.14, andτ ˆ = 11.71.

3.5.3 RCT: Both Groups Without DI

A model incorporating data from both groups but ignoring the delayed treatment

(DT) is presented below (3.2) for i = 1, . . . , nk, j = 0,..., 2, k = 0, 1, assuming

iid 2 ind iid 2 bik ∼ N(0, τ )& ijk ∼ N(0, σ ). Note that the last j = 3 observation is dropped

since this is the post-treatment for the control group. It is also dropped for the

48 immediate treatment group since the last time point is post-attenuation and hence does not inform directly regarding the TE (but would increase the degrees of freedom).

yijk = bik + αk + (γ1 + I[k=1]θ1)tj + ijk (3.2)

shown separately by group as

yij0 = bi0 + α0 + (γ1)tj + ij0

yij1 = bi1 + α1 + (γ1 + θ1)tj + ij1

We ﬁt this RCT model, igoring the post-DI observation time, to obtain the fol- ˆ lowing estimates: θ1 = −6.55(3.10),σ ˆ = 12.50, andτ ˆ = 9.95.

3.5.4 An Entrance-Centered Model for RDICT

An entrance-centered model, where t0 ≡ 0 and th ≡ 1 with h = 2 (for the halving of hTE) and d = 2, is an extension of 3.2 and shown separately by group in the models below for i = 1, . . . , nk, j = 0,..., 3, k = 0, 1,. This model encompasses the full dataset, combining the quasi-experiment and RCT approaches to treatment eﬀect estimation.

+ yij0 = bi0 + α0 + (γ1)tj + θ0(tj − td) + ij0

+ yij1 = bi1 + α1 + (γ1 + θ1)tj + γ2(tj − th) + ij1

49 ˆ We ﬁt this EC-Full model to obtain the following estimates: θ1 = −6.41(3.05), ˆ θ0 = −6.99(5.79),σ ˆ = 12.30, andτ ˆ = 10.38. This model is “full” in the sense that it still estimates the two measures of treatment eﬀect separately. In order to force these to be a single parameter, the time variables were shifted in the treatment-centered model that follows. The treatment-centered model is also helpful when d 6= h.

3.5.5 A Treatment-Centered Model for RDICT

In order to constrain the parameters of the full entrance-centered (EC-Full) model so that there is a single parameter θ for the treatment eﬀect in both groups, the times for the delayed treatment (DT) group were shifted down by td so that td0 ≡ 0.

For MFPG, d = 2 and td = 1, so the new time values for the DT group were

1 1 t0 = (tjk : j ∈ {0,..., 3}, k = 0) = (−1, − 2 , 0, 2 ). The reduced treatment-centered (TC-Reduced) family of linear mixed eﬀects (LME)

models with a random intercept has

+ + yijk = bik + αk + γ1tjk + θ2(tjk) + γ2(tjk − th) + ijk

assuming

iid 2 ind iid 2 bik ∼ N(0, τ )& ijk ∼ N(0, σ ); i = 1, . . . , nk; j = 0, . . . , m − 1; k = 0, 1.

We ﬁt this TC-Reduced model to obtain the following estimates for the MFPG ˆ dataset: θ2 = −6.41(3.05),σ ˆ = 12.29, andτ ˆ = 10.39.

In matrix notation, the TC-Reduced model family has

yik = Xikβ + zikbik + ik,

where the parameter is

T β = α0 α1 γ1 θ2 γ2 .

50 A full treatment-centered (TC-Full) equivalent to the full entrance-centered model

in section 3.5.4, adds a parameter to this TC-Reduced model by replacing θ2 with the group-speciﬁc treatment eﬀect parameters θ1 and θ0.

3.5.6 Inference

Recalling equations (2.7) and (2.13), the standard estimators [6] for the mean and its variance are presented here again.

n −1 n ˆ X T −1 X T −1 β = Xi Vi Xi Xi Vi yi i=1 i=1 n −1 n n −1 ˆ X T −1 X T −1 T −1 X T −1 Var(β) = Xi Vi Xi Xi Vi [riri ]Vi Xi Xi Vi Xi i=1 i=1 i=1

3.6 Narrowing the Universe of Models

Although the true model may be a nonlinear curve with one trajectory for the untreated population and another trajectory for the treated population, it may perhaps be adequately approximated by a piecewise linear mean structure, with various slopes depending on treatment status and time since treatment.

3.6.1 Models: Mean Structure

The mean structure is very simple, once certain assumptions are established. We assumed the placebo effect hPE period equals the treatment effect period hTE for convenience, so that both effects occur over h observation periods from entrance to the study. Furfthermore, we never considered a design that has a delay d beyond the placebo effect h. This would unnecessarily extend the total length of the study and hence increase the cost and attrition.

51 RCT (h=3) RDICT (h=3,d=1)

● ●

Mean Outcome ● ● ● ● Mean Outcome ● ● ● ●

0.00 0.67 1.33 2.00 0.00 0.67 1.33 2.00

Time Time

RDICT (h=2,d=1) RDICT (h=3,d=3)

● ●

●

Mean Outcome ● ● ● Mean Outcome ● ● ● ●

0.0 0.5 1.0 1.5 2.0 0.00 0.67 1.33 2.00

Time Time

Figure 3.9: Example Mean Profiles for Simulation. Each plot shows an example scenario for a randomized controlled (RCT) or delay-intervention controlled (RDICT) trial, as indicated. The green and red lines represent the expected mean outcomes over time for the immediate treatment (IT) and control groups, respectively. The RCT design shows placebo effect (PE) in the control group and the treatment effect (TE) in the IT group, with h = 3 indicating the frequency of observations in a unit of time. For the 3 RDICT designs presented, h = 2, 3 for resolution and d = 1, 3 to specify the delay in treatment (represented by a vertical line) for the delayed treatment (DT) controls. The dotted red lines represent the RCT control group behavior (in absence of DT).

For the immediate treatment group, there is an improvement over h observation

periods followed by an indeﬁnite plateau. For the control group, there is a lesser

placebo improvement until either there is a ﬂat period (in RCT) after h time periods

52 or there is increased improvement (in RDICT) after d time periods, when delayed intervention is administered. In all cases considered here, d ≤ h. In the latter

RDICT case, at observation time j = d, the mean proﬁle trends downward in a parallel fashion to that of the immediate treatment group until it reaches the same destination plateau.

Example mean proﬁles are shown in Figure 3.9.

3.6.2 Models: Relative Treatment and Placebo Eﬀects

The improvement in the immediate treatment group was arbitrarily set to 10 units decreasing over the single unit of time corresponding to the treatment eﬀect period, hTE. The parameter λ ∈ [0, 1] indicates how many units of improvement

(proportionally to immediate group improvement) constitute the placebo effect over the placebo effect time period, hPE, which was assumed to be equivalent to hTE and hence also represented by a single unit of time in our model. As a consequence, the post-intervention slope equals −10, while the pre-intervention slope equals −10λ, resulting in a treatment effect of θ = −(1 − λ)10. According to the TC-Reduced model for the MFPG data, λˆ = 0.38, recalling λ is the proportion of the total effect due to spontaneous remission.

3.6.3 Models: Variance Components

The between-subject random intercept standard deviation τ was set to 10 to mimic the real data example, where theτ ˆ = 10.38 was on the order of the change in the immediate treatment group over the full treatment eﬀect period hTE, which ˆ wasγ ˆ1 + θinter = −10.28 according to the EC-Full model. The within-subject residual

53 standard deviation σ was then controlled via ρ, the intraclass correlation coeﬃcient,

τ 2 where ρ ≡ τ 2+σ2 . For example,ρ ˆ = 0.42 under the EC-Full model for the real dataset. Having established a framework for the generating model, we consider the design space for the simulation study in the next chapter.

54 CHAPTER 4

SIMULATION STUDY

4.1 Introduction

In this chapter, the design space and evaluation criteria are established, and the simulation study results are presented for various conditions determined by parameters of the model.

4.2 Objective

As waitlist (delayed intervention) designs are commonly used in practice of medical investigation, it is the aim of this dissertation to establish recommendations to intervention researchers regarding methodology for statistical analysis of data from such a design. Of particular interest is whether it is preferable to ignore or incorporate post-intervention data from the control group in estimation of treatment eﬀect

(TE).

The TC-Full model estimates treatment eﬀect for the immediate treatment (IT) and disregards post-intervention changes in the delayed treatment (DT) group. The

TC-Reduced model estimates a combined treatment eﬀect common to both groups.

In order to identify conditions for increased eﬃciency of the combined TE estimator,

55 Symbol Scale Description Values hTE calendar TE duration expert-speciﬁed hPE calendar PE duration hPE = hTE h count no. obs. segments in TE dur. (2,3) d occasion obs. no. of DT (beyond entry) (1,2,3) m count no. repeated meas. (per subj.) (4,5,6) mpd count no. post-delay obs. in DT group m − (d + 1)

Table 4.1: Notation for Variables on Comparable Time Scales. The time-related variables are summarized on 3 time scales: calendar time, counts, and occasion of observation. Abbreviations are made for treatment eﬀect (TE), placebo eﬀect (PE), and delayed treatment (DT).

this simulation study determines and compares the relative eﬃciencies of the TC- ˆ ˆ Reduced (θ2) and TC-Full estimators (θ1) of TE for 10 speciﬁc randomized delayed intervention controlled trial (RDICT) designs in a variety of scenarios.

4.3 Notation

The notation for simulation study parameters are shown in Tables 4.1 and 4.2, for time-related and other variables, respectively. The signiﬁcance and speciﬁcation of these variables will be discussed forthwith.

4.4 Narrowing the Universe of Designs

As with selection of model families, the choice of designs under consideration started with the RDICT design utilized in the real study (MFPG). Deviations from this starting point were in the number of observation times, time between observations, number of subjects per group, and, to address the crucial feature of the RDICT design, the length of delay for treatment in the control group.

56 Symbol Scale Description Values τ 2 outcome2 betw.-subject intercept variance 102 σ2 outcome2 within-subject error variance set via (τ,ρ) ρ [0,1] ICC: ρ = τ 2/(τ 2 + σ2) (.4, .7) λ [0,1] PE/(PE+TE) (.2, .4) M Z+ number of runs (per scenario) M = 400 n Z+ sample size (per run) n = 134 pd (0,1) proportion of sample in DT group (1/2, 2/3)

Table 4.2: Notation for Simulation Study Parameters. For each parameter in the simulation study, the scale, description, and possible values are given. Abbreviations are made for treatment effect (TE), placebo effect (PE), delayed treatment (DT), and intraclass correlation coefficient (ICC).

Name h d m mpd mpd/d H1.4 2 1 4 2 2 H2.4 2 2 4 1 0.5 H2.5 2 2 5 2 1 T1.4 3 1 4 2 2 T1.5 3 1 5 3 3 T2.4 3 2 4 1 0.5 T2.5 3 2 5 2 1 T2.6 3 2 6 3 1.5 T3.5 3 3 5 1 0.33 T3.6 3 3 6 2 0.67

Table 4.3: The 10 RDICT Designs for Simulation Study. The distinguishing features of each design are the total observations (m), the delay (d), and the frequency of observation (h).

57 j h 0 1 2 3 4 5

1 3 2 0 2 1 2 2 -

1 2 4 5 3 0 3 3 1 3 3

Table 4.4: Time Scale for Observation Occasions in Model Based on Graduation of Treatment Period. The times of the jth observation are shown in scaled time units for the two values of h.

4.4.1 Observation Times: Resolution h of Unit Time

The length of time, hTE, corresponding to the time for full or near-full treatment

eﬀect to occur is speciﬁed by the researcher or subject expert. The length of this

treatment period is set to equal 1 unit on the time scale used in the models and

designs, whether it be a month or a year. In designs studied here, the treatment

period is divided into h ∈ {2, 3} time periods, and both treatment groups are observed

on the identical m occasions.

The graduation of the treatment period hTE was limited to either two or three

observation periods. Less (h = 1), or the pre-post observation scheme, does not allow

any veriﬁcation or investigation of the assumed treatment eﬀect process over this

time. Intermediate observations may improve estimation of the treatment period for

future applications or help establish client expectations for translation into practice.

A more ﬁne graduation (h > 3) starts to cost too much toward total number of observations m and perhaps captures redundant information.

58 4.4.2 Length of Delay d in Waitlist Control Group

We assume the delayed intervention is to be introduced to the control group at an

observation time j = d ∈ 1 : (m − 2), rather than in between measurement occasions.

In all the models considered here, the placebo eﬀect period hPE equals the treatment

eﬀect period hTE for simpliﬁcation.

In order to not unnecessarily extend the total length of the study, the delay does

not exceed hPE, which equals 1 on the model time scale at measurement occasion j = h. As a result, the delayed intervention at j = d occurs before or at j = h, as

illustrated in Figures 4.1 and 4.2 for h = 3 or 2, respectively.

4.4.3 Summary of Constraints and Simulation Designs

The constraints on the number of observations, graduation of treatment period,

and length of delay are summarized in the list below:

• m ∈ {4, 5, 6}

• h ∈ {2, 3}

• d ∈ {1 : h}

• mpd ≡ m − (d + 1) > 0

• max(h + 1, d + 2) ≤ m ≤ (d + 1) + h.

The total number of repeated measurements m was chosen to be 4, 5, or 6 in our

designs. These are the commonly employed m values [13, 25]. The graduation (h)

of the treatment period into halves or thirds has been justiﬁed above, as well as the

constraint that the delay does not exceed this period. The delay is at least one so

59 Design T1 Design T2 Design T3 Mean Outcome

0.00 0.33 0.67 1.00 1.33

Time

Figure 4.1: RDICT Designs With Treatment Period Divided Into Thirds. Ex- pected mean proﬁles are shown for randomized delayed-intervention controlled trials (RDICT) with unit time graduation of h = 3 and delays in treatment of d = 1, 2, 3. Dashed vertical lines represent the possible intervention times, and the corresponding design names are T1, T2, and T3. Thin green and red lines demonstrate the mean proﬁles for the immediate treatment and untreated controls, respectively.

that the control group has at least one time period without treatment in order to fulﬁll their comparison role.

That the number of post-delay observations mpd is at least 1 constrains the number of observations m from below such that m ≥ d + 2, as does the requirement to follow the immediate treatment group for the full treatment period m ≥ h + 1. The largest

60 Design H1 Design H2 Mean Outcome

0.0 0.5 1.0 1.5

Time

Figure 4.2: RDICT Designs With Treatment Period Divided Into Halves. Ex- pected mean proﬁles are shown for randomized delayed-intervention controlled trials (RDICT) with unit time graduation of h = 2 and delays in treatment of d = 1, 2. Dashed vertical lines represent the possible intervention times, and the corresponding design names are H1 and H2. Thin green and red lines demonstrate the mean proﬁles for the immediate treatment and untreated controls, respectively.

m considered would follow the delayed intervention group for the delay (d) plus the full treatment period (h), bounding it from above, although most of our designs do not reach this maximum.

61 For each of the 3 m values, the 5 valid (h, d) combinations, i.e., (2,1), (2,2),

(3,1), (3,2), and (3,3), were considered according to the above constraints. Of the 15

possible designs, 10 of them were valid according the prescribed design criteria.

The nomenclature and salient features of the ten RDICT designs considered are

illustrated in Figure 4.3 and Table 4.3. The number of observations m varied from 4

to 6, and the length of the study participation varied from 1 to 2. Additionally, the

1 time of delayed intervention ranged from 3 to 1.

4.5 Protocol

This section delineates the steps in the simulation study and provides justiﬁcation for each decision.

4.5.1 Software

Computing (ISBN 3-900051-07-0). There are many useful references for programming in this language [23, 16].

4.5.2 Scenarios

A single scenario is deﬁned by the parameters listed below, which are explained in Tables 4.1 and 4.2. The variables that vary between scenarios are indicated by an asterisk (∗).

∗ ∗ ∗ ∗ • DESIGN: (h , d , m ), n, pd

• MODEL: λ∗, τ, ρ∗

62 D 6 T3.6

D 5 T3.5

D 6 T2.6

D 5 T2.5

D 4 T2.4

D 5

Design T1.5

D 4 T1.4

D 5 H2.5

D 4 H2.4

D 4 H1.4

0.0 0.3 0.5 0.7 1.0 1.3 1.5 1.7 2.0

Time

Figure 4.3: RDICT Designs Considered in Simulation Study. The abscissa is time scaled such that the eﬀect of treatment elapses over a single unit. The termination of treatment eﬀect for the immediate treatment group is indicated by a vertical dotted line. The 10 design names are listed along the vertical axis and are composed of a letter H or T, corresponding to h = 2, 3, a number for the delay d, and a number for m following a period. For each design, a horizontal line represents the length of the study with tick marks at the observation occasions. The letter D indicates delivery of the delayed intervention, and the number m is shown to the right of each design line.

63 4.5.3 Starting Seeds for Random Number Generation

The initial seed was set to 43201 and increased by the count of random numbers generated following each iteration under a new scenario. There are M × n × (m + 1) random numbers generated to create M independent simulated datasets for the runs of a single scenario.

4.5.4 Level of Dependence Between Simulated Datasets

According to Burton et al. [4], “moderately independent simulations use the same set of simulated independent data sets to compare a variety of statistical methods for the same scenario, but a diﬀerent set of data sets is generated for each scenario investigated.” By this reasoning, the same M datasets were analyzed by the two methods, TC-Full and TC-Reduced, in a single scenario. New and independent data were then generated for subsequent scenarios.

4.5.5 Scenarios: Sample Size n

The sample size n was determined based on an estimate of variance calculated from the MFPG design matrices below and estimates from the TC-Reduced model from section 3.5.5.

 1 0 −1 0 0   0 1 0 0 0   1 0 −0.5 0 0   0 1 0.5 0.5 0  Xi0 =   and Xi1 =    1 0 0 0 0   0 1 1 1 0  1 0 0.5 0.5 0 0 1 1.5 1.5 0.5

The equation (4.1) was derived for the balanced design scenario (n0 = n1 = nhalf ≡ n ˆ 2 ) by applying the simplifying assumption for GLS (2.5) to Var(β) in (2.13).

64 n −1 ˆ X T −1 −1 T −1 T −1 −1 Varβ = Xi Vi Xi = [nhalf ] Xi0Vi Xi0 + Xi1Vi Xi1 (4.1) i=1

2 2 Substituting (σ Im + τ Jm) for Vi using σ = 12 and τ = 10, which approximates the real data estimates and a subset of simulation scenarios, it was possible to solve for ˆ nhalf based on the desired SE(θ). In the simulation studies, the true TE θ equals either ˆ 8 or 6 depending on λ, the desired SE(θ) of 3 is reasonable and dictates nhalf = 67.

For this reason, the sample size n = 134 is used throughout this simulation study.

4.5.6 Scenarios: Size of Treatment Eﬀect λ

The overall improvement in the immediate treatment group, E in Figure 1.1, over the full treatment effect period hTE is considered to be the additive effects of the placebo (PE) and treatment (TE). Holding E constant, the size of TE depends on that of PE. To vary the size of the treatment effect, the control parameter λ was introduced as the proportion of E due to PE, i.e., λ = PE = γ1 in terms of (PE+TE) (γ1+θ) model parameters. The values for λ in the simulations were .2 and .4 to investigate whether size of TE affects the relative performance of the estimators. Recall that this proportion was approximately .4 in MFPG.

The value for E=10 was chosen to mirror that seen in MFPG. Although arbitrary in some senses, the choice of E must be considered in comparison to the variance parameters. In addition, if TE were measured in terms of the angle ψ, it is worth noting that ψ is a function of λ and E.

65 4.5.7 Scenarios: Covariance Structure ρ

There are only two variance parameters in a random intercept linear mixed model,

σ2 for the error term and τ 2 for the random eﬀect. They can be related through the

intraclass correlation coeﬃcient (ICC), which is equal to the correlation between two

repeated measurements on the same individual, as follows.

τ 2 ρ = τ 2 + σ2

As ρ approaches 1, the between-subject variation dominates the within-subject vari-

ation, and by the rule of thumb in section 2.8.2, estimation of within-subject compar-

isons, such as slope, should be more eﬃcient than between-group comparisons. It is

not clear how ρ will affect the ability to detect a treatment effect which is a difference between group slopes.

The value τ was set to 10 for simulations, which is equal to E, the overall change in outcome for the immediate treatment group, as was the case in MFPG dataset. The values τ and σ determine how deeply the signal of the treatment eﬀect is “buried” in the noise of variation and consequently how diﬃcult it is to detect.

4.5.8 Number of Simulations M

Based on SE(θˆ)=3 and a speciﬁed minimum deviation from the true parameter value of less than 0.3, the number of simulation runs of M = 400 per scenario was chosen [4].

66 4.5.9 Results Stored From Each Run

From each run l ∈ {1 : M} for a particular scenario and design combination, the following values were computed and retained in a saved text ﬁle for potential future use, where ? ∈ {2, 1}.

ˆ • estimate: θ?,l

ˆ • standard error: SE(θ?,l)

ˆ • parameter signiﬁcance: p-value(θ?,l)

4.5.10 Summary Measures of Performance

Based on the results of the M runs for a particular scenario and design, the following summary measures were computed, where ? ∈ {2, 1}.

• average estimate:

ˆ¯ −1 P ˆ θ? = M l θ?,l

• average standard error (model-based):

−1 P ˆ SE? = M l SE(θ?,l)

• sample standard deviation (empirical standard error):

2 ˆ −1 P ˆ ˆ¯ 2 SD (θ?) = (M − 1) l(θ?,l − θ?)

67 • proportion of signiﬁcant p-values:

−1 P ˆ power(θ?) = M l I{p−value(θ?,l) < .05}

4.5.11 Criteria for Comparison

The following values were computed to compare the two methods of TE estimation, where ? ∈ {2, 1}.

• bias: ˆ¯ Bias? = θ? − θ?

• relative eﬃciency (based on SE):

2 SE1 RESE = 2 SE2

• relative eﬃciency (based on SD):

2 SD (θˆ1) RESD = 2 SD (θˆ2)

• mean squared error (based on SE):

2 2 MSE?,SE = Bias? + SE?

• mean squared error (based on SD):

2 2 ˆ MSE?,SD = Bias? + SD (θ?)

68 pd ρ λ RE B2 B1 MSE2 MSE1 p(θ2) p(θ1) 0.5 0.4 0.2 1.000 -0.12 -0.12 8.96 8.96 0.770 0.770 0.5 0.4 0.4 1.000 -0.05 -0.05 8.92 8.91 0.535 0.533 0.5 0.7 0.2 1.000 -0.06 -0.06 2.57 2.57 0.998 0.998 0.5 0.7 0.4 1.000 0.07 0.07 2.57 2.57 0.960 0.960 0.67 0.4 0.2 1.067 0.09 0.08 9.47 10.10 0.733 0.715 0.67 0.4 0.4 1.067 0.08 0.11 9.55 10.19 0.470 0.455 0.67 0.7 0.2 1.067 0.01 -0.03 2.73 2.91 0.995 0.998 0.67 0.7 0.4 1.067 -0.01 0.00 2.72 2.91 0.948 0.933

Table 4.5: Results for MFPG Design H2.4. The ﬁrst 3 columns identify the scenario of {pd, ρ, λ}. Both the relative eﬃciency (RE) and MSE values are based on the model-based standard errors. The bias (B) and power (p) values are also included for both estimators, combined (2) and inter (1).

4.6 Results

Representative raw results are presented in tables and plots in this section. First is a discussion of the distinction between the empirical and model-based standard errors in relation to the theoretical standard error.

4.6.1 Results for MFPG Design and Extension

A subset of the summary and comparison measures are presented in Table 4.5 for the 8 scenarios of {pd, ρ, λ} under the H2.4 design, which was employed in the motivating MFPG example. The measures presented are RESE, Bias?, MSE?,SE, and power(θ?) for ? ∈ {2, 1}.

An extension to the H2.4 design of MFPG simply adds another observation to the study for both groups, resulting in design H2.5. The same results as shown for the

MFPG design are shown for this enhanced H2.5 design in Table 4.6.

69 pd ρ λ RE B2 B1 MSE2 MSE1 p(θ2) p(θ1) 0.5 0.4 0.2 1.050 -0.29 -0.30 7.91 8.31 0.850 0.830 0.5 0.4 0.4 1.050 0.08 0.09 7.81 8.20 0.575 0.550 0.5 0.7 0.2 1.050 -0.07 -0.07 2.23 2.34 1.000 1.000 0.5 0.7 0.4 1.050 0.01 0.03 2.22 2.33 0.983 0.973 0.67 0.4 0.2 1.311 -0.17 -0.19 7.11 9.32 0.870 0.775 0.67 0.4 0.4 1.311 -0.04 -0.02 7.08 9.28 0.633 0.530 0.67 0.7 0.2 1.311 0.04 0.04 2.02 2.65 1.000 0.995 0.67 0.7 0.4 1.311 0.02 0.04 2.02 2.65 0.988 0.960

Table 4.6: Results for Enhanced MFPG Design H2.5. The ﬁrst 3 columns identify the scenario of {pd, ρ, λ}. Both the relative eﬃciency (RE) and MSE values are based on the model-based standard errors. The bias (B) and power (p) values are also included for both estimators, combined (2) and inter (1).

4.6.2 Theoretical Standard Error

When the individual variances Vi are known, as in a simulation study, the esti-

mator variance (and standard error) is a function only of the design and variance

matrices and is does not depend on the parameter value itself. As a consequence, the

theoretical standard error (SE) for the estimators can be calculated for each design

and {ρ, pd} combination, independent of λ. ˆ ˆ ˆ The theoretical SE values for θ2, θ1, and θ0 are shown for each {ρ, pd} pair are

shown in Figure 4.4. Interestingly, while the theoretical SE proﬁles are very similar ˆ ˆ ˆ for θ2 and θ1, they are quite diﬀerent for θ0.

While the theoretical SE depends on {ρ, pd}, as seen more clearly in Figure 4.5 for ˆ θcombined, the theoretical relative eﬃciency (RE) does not depend on ρ, but only on pd. With only a few exceptions, the SE is lower for pd = .67 than for pd = .67. The shape of the theoretical RE over the 10 designs for each pd is shown in Figure 4.7.

70 4.6.3 Model-Based versus Empirical Standard Errors

Although the model-based standard error (SE?) relies on simulated realizations of

the outcome, it is nearly identical to the theoretical SE and is more stable than the ˆ ˆ empirical standard error (SD(θ?)). The three are compared for θcombined in Figure 4.6.

The noisiness of the empirical standard error relative to the model-based and theoretical standard errors can be seen more clearly in the squared ratio scale of the ˆ relative eﬃciencies in Figure 4.7. Note that the SD(θ?) values plotted for each {ρ, pd} combination are an average for the two corresponding scenarios, each with a diﬀerent

λ.

Since the empirical standard error does note deviate from the model-based SE in a systematic fashion, the model-based SE is considered for drawing conclusions about the designs and scenarios. The model-based relative eﬃciency, just as the theoretical

RE, does not depend on ρ or λ, but varies with design and pd as illustrated in Figure

5.2.

71 combined ● inter ● intra ● H1.4 H2.4 H2.5 T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6

p_d : { 0.67 } p_d : { 0.67 } rho : { 0.4 } rho : { 0.7 } 8

● ●

● ● ● 6 ●

● ● ● ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● p_d : { 0.5 } p_d : { 0.5 } rho : { 0.4 } rho : { 0.7 } ● ● 8

● ● ● ● ● theoretical standard error 6 ● ● ●

● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 4.4: Theoretical Standard Errors by Estimator. The theoretical standard errors (SE) of the estimators are plotted over the 10 designs for each of the 4 combinations of pd = .5,.67 and ρ = .4,.7. The separate lines in each panel represent the combined, inter, and intra estimators, with colors blue, pink, and green, respectively. Note that the SE values for the combined (2) and inter (1) estimators are quite similar, while the intra (0) values are apart.

72 0.5 ● 0.67 ● H1.4 H2.4 H2.5 T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6

rho : { 0.4 } rho : { 0.7 }

● ● 6 ● ●

●

● 4

● ● ● ● ●

● ● ● ● ● 3 ● ● ● ● ● theoretical SE (combined) ● ●

● ●

2 ● ● ● ● ● ● ● ● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 4.5: Theoretical Standard Error for Combined Estimator. The theoretical standard error (SE) of the combined estimator is plotted over the 10 designs for ρ = .4,.7. The separate lines in each panel represent the SE values for pd = .5,.67, with colors blue and pink, respectively.

73 sd ● se ● true ● H1.4 H2.4 H2.5 T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6

p_d : { 0.67 } p_d : { 0.67 } rho : { 0.4 } rho : { 0.7 }

● 6 ● ● 5

● 4 ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● 2 ● ● ● ● p_d : { 0.5 } p_d : { 0.5 } rho : { 0.4 } rho : { 0.7 } ● ● ● 6 ●

5 ● 4 ● standard error (combined estimator) ● ● ● ● ● ● ● 3 ● ● ● ● ● ● 2 ● ● ● ● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 4.6: Theoretical, Model-Based, and Empirical Standard Errors. The standard error (SE) of the combined estimator is plotted over the 10 designs for each of the 4 combinations of pd = .5,.67 and ρ = .4,.7. The separate lines in each panel represent the theoretical (true), model-based (se), and empirical (sd) derivations for SE, with colors green, pink, and blue, respectively. Note that the theoretical and model-based SE values are virtually indistinguishable.

74 sd ● se ● true ● H1.4 H2.4 H2.5 T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6

p_d : { 0.67 } p_d : { 0.67 } rho : { 0.4 } rho : { 0.7 } ●

● ● 1.3 ●

1.2 ● ● ● ● ● ● 1.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 p_d : { 0.5 } p_d : { 0.5 } rho : { 0.4 } rho : { 0.7 }

relative efficiency 1.3

1.2

● ● 1.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● ● ● ● ● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 4.7: Theoretical, Model-Based, and Empirical Relative Eﬃciencies. The relative eﬃciency (RE) of the combined to inter estimator is plotted over the 10 designs for each of the 4 combinations of pd = .5,.67 and ρ = .4,.7. The separate lines in each panel represent the theoretical (true), model-based (se), and empirical (sd) derivations for RE, with colors green, pink, and blue, respectively. Note that the theoretical and model-based RE values are virtually indistinguishable.

75 CHAPTER 5

CONCLUSIONS

This chapter discusses the results from the simulation study and presents rec- comendations for researchers as well as further questions to be answered with future research.

5.1 Distinguishing Features of the 10 Designs

In order to decipher the profiles of relative efficiciencies over the designs, plots of design features are presented in terms of calendar durations, which seems more influential than observation counts, in Figure 5.1. The distinction between counts and duration is salient only because the calendar time between observations is longer for the h = 2 designs. In fact, the H2.5 design has the longest total study duration, a full 2 units of scaled time, even though designs T2.6 and T3.6 have more observation occasions.

5.2 Standard Errors over the Designs

Although the primary interest of this study is the relative size of the standard ˆ ˆ errors (SE) for θ2 and θ1, we examine here the eﬀect of design on the absolute sizes

76 delay ● post−delay ● study ●

2.0 ●

● ●

1.5 ● ●

● ● ●

1.0 ● ● ● ● ● ● ● ● ● calendar duration

● ● ● ● ●

0.5 ● ●

● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 5.1: A Comparison of Calendar Durations by Design. The lengths of calendar duration are plotted over the 10 designs. The separate lines represent the total (study), pre-delay (delay), and post-delay durations, with colors green, blue, and pink, respectively.

77 ˆ of the standard errors for these estimators as well as θ0 from the TC-Full model to gain insight into the design diﬀerences.

In this discussion, larger SE values are called worse and lower ones are called better. Furthermore, the inequalities refer to the size of SE. The ranking of the designs for the theoretical SE values are seen in Figure 4.4 and listed for each estimator of treatment eﬀect at the end of this section. The rank orders are stable over ρ and pd

for the inter and intra estimators, while the combined estimator rankings are slightly

aﬀected by pd.

5.2.1 Intra Estimator

ˆ The SE for θ0, which estimates the change in slope following intervention in the

control group, are seen in Figure 4.4. The rank order of SE values by design in

preserved over the {pd, ρ} combinations.

The worst four designs are T 2.4 > T 1.4 > T 3.5 > T 1.5. Looking at Figure 5.1,

designs T1.4 and T1.5 have the shortest delay, and designs T2.4 and T3.5 have the

shortest post-delay observation period. Since both of these pieces contribute to the

estimation of θ0, short duration in either phase is a detriment to the precision of

estimation.

The three best designs are H2.5 < T 2.6 < T 3.6. Design H2.5 has the longest

study duration followed by designs T2.6 and T3.6. In H2.5 and T2.6, the post-delay

observation period is equal to the maximum allowed, which is 1 scaled unit. In H2.5

and T3.6, the delay period is the maximum allowed, which is also equal to 1 scaled

unit. ˆ The MFPG design H2.4 is ranked sixth best for precision of θ0.

78 5.2.2 Inter Estimator

ˆ The SE for θ1, which estimates the diﬀerence in slopes between the immediate treatment and pre-intervention control groups, are seen in Figure 4.4. The rank order of SE values by design in preserved over the {pd, ρ} combinations.

The worst three designs are T 1.4 > T 1.5 > H1.4. These designs have the short-

est delay periods, which limits the information gathered about the pre-intervention

controls.

The best four designs are T 3.6 < T 3.5 < H2.5 < H2.4. These designs have the

largest allowable delay of 1 scaled unit of time. The best two designs, T3.6 and T3.5,

have an additional observation during this time period, compared the next two, H2.5

and H2.4.

5.2.3 Combined Estimator

ˆ The SE for θ2, which combines the intra and inter estimates, are seen in Figure

4.5. The rank order of SE values by design in preserved over the ρ, but the best two

designs are reversed for the pd values. The rank order for the combined SE by designs

is identical to that for inter with exception of the ordering of the best three designs,

H2.5, T3.5, and T3.6.

1 When pd = 2 , the best three designs are T 3.6 < H2.5 < T 3.5. The top two designs, T3.6 and H2.5, are the best designs for the intra and inter SE values, re-

spectively. While the T3.5 design is good for inter SE, it is third worst for intra

SE.

2 When pd = 3 , the best three designs are H2.5 < T 3.6 < T 3.5. Compared with

pd = .5, a greater proportion of the sample is allocated to the delayed treatment

79 group. It is no surprise the that best design for intra SE, H2.5, nudges ahead into the top spot. ˆ ˆ Note that the MFPG design H2.4 is ranked fourth best for precision of θ1 and θ2.

Estimator SE Rankings

ˆ θ0 H2.5 < T 2.6 ≈ T 3.6 < T 2.5 < H1.4 ≈ H2.4 < T 1.5 ≈ T 3.5 < T 1.4 ≈ T 2.4

ˆ θ1 T 3.6 < T 3.5 < H2.5 < H2.4 < T 2.6 < T 2.5 < T 2.4 < H1.4 < T 1.5 < T 1.4

ˆ 1 θ2, pd = 2 T 3.6 < H2.5 < T 3.5 < H2.4 < T 2.6 < T 2.5 < T 2.4 < H1.4 < T 1.5 < T 1.4

ˆ 2 θ2, pd = 3 H2.5 < T 3.6 < T 3.5 < H2.4 < T 2.6 < T 2.5 < T 2.4 < H1.4 < T 1.5 < T 1.4

5.3 Relative Eﬃciencies over the Designs

Since both the estimators of fixed effects considered here are unbiased according to linear model theory (Gauss-Markov Theorem), the primary distinction between them ˆ ˆ is captured by the relative efficiency (RE) of θ2 to θ1. In this discussion, larger RE values are called better and lower ones are called worse. Furthermore, the inequalities refer to the size of RE. The ranking of the designs for the model-based RE values are seen in Figure 5.2 and listed below. The rank orders are stable within pd values, with the exception of a few switches of nearly-equal RE designs for the different ρ.

Allocation RE Rankings

1 pd = 2 T 1.4 ≈ T 2.4 > H2.5 > T 1.5 ≈ T 3.5 > T 2.5 > T 2.6 > T 3.6 > H2.4 > H1.4

2 pd = 3 H2.5 > T 2.6 ≈ T 3.6 > H2.4 ≈ H1.4 > T 2.5 > T 3.5 > T 1.5 > T 1.4 ≈ T 2.4

80 5.3.1 Balanced Sample Allocation

1 When pd = 2 , the best three designs are T 1.4 ≈ T 2.4 > H2.5 with a maximum improvement in eﬃciency of 6.8%. The ﬁrst two have the shortest study duration time of all 10 designs. The T1.4 design has the worst or next to worst standard error

(SE) for each of the 3 estimators. Both of the best two designs beneﬁt from combining the inter and intra estimators of TE.

The H2.5 design has the longest study duration of all 10 designs and some of the best SE values across the 3 estimators. These top 5, excluding H2.5, have the shortest lengths of either the delay or post-delay period, equal to 0.33 scaled time units, as seen in Figure 5.1. The remaining designs have RE values close to 1, with

1.4% improvement or less.

5.3.2 Unbalanced Sample Allocation

2 When pd = 3 , the best three designs are H2.5 > T 2.6 ≈ T 3.6 with a maximum improvement in eﬃciency of 31.1%, followed by H2.4 ≈ H1.4 at 6.7%. It is not surprising that the top 3 ranked designs for RE are the same as those for the intra SE in the case that more subjects are assigned to the delayed treatment group, from which the intra estimator draws information. The H2.5 design is best for intra precision and third for inter precision. The T2.6 design is second best for intra precision but only

ﬁfth best for inter precision.

These top 3 designs also have the longest study durations from 1.67 to 2 scaled time units. The next two best designs, H1.4 and H2.4, have the next longest study length of 1.5 scaled time units. Of the three designs with the next longest duration, T2.5 has balance pre- and post-intervention time periods and comes in rank 6 with 3.2%

81 0.5 ● 0.67 ●

● 1.3

1.2

● ●

1.1 model−based relative efficiency

● ● ● ●

●

● ● ● ● ● ● ● 1.0 ● ● ● ● T1.4 T1.5 T2.4 T2.5 T2.6 T3.5 T3.6 H1.4 H2.4 H2.5 design

Figure 5.2: Model-Based Relative Eﬃciencies. The relative eﬃciency (RE) of the combined to inter estimator is plotted over the 10 designs. The separate lines in each panel represent the RE values for pd = .5,.67, with colors blue and pink, respectively.

82 improvement in eﬃciency. The four worst designs have the shortest study duration,

from 1 to 1.33 scaled time units, with improvements of less than 1% in estimator

eﬃciency.

5.3.3 Other Allocation Plans

1 2 In this simulation study, the only values of pd considered were 2 and 3 . In order to explore other apportionments of subjects to the treatment groups, theoretical ˆ ˆ standard errors and relative eﬃciencies of θ2 to θ1 are presented in Figures 5.3 and

5.4 for pd from 0.05 to 0.95. ˆ The standard errors for θ1 are concave up over pd with a minimum at or right of ˆ center. Those for θ2 achieve a minimum at or near pd = 1 and are more level on the

right end.

1 2 In general, the relative eﬃciency for a design improves as pd increases from 2 to 3 , allocating more subjects to the delayed treatment group, with a few exceptions. As seen in Figure 5.2, there are two designs, T1.4 and T2.4, where the relative eﬃciency

1 (RE) is greater for the pd = 2 allocation, and two more designs where the RE is essentially equal, T1.5 and T3.5. In order to see the relative shapes of the curves in

Figure 5.4, the trend curves are presented on the same plot with a restricted range

in Figure 5.5 for a representative subset of the 10 designs.

The designs vary according to how heavy the allocation to delayed treatment must

be before the RE starting increasing steeply. This relative beneﬁt is always balanced

against the desire for a smaller absolute standard error (SE). It is an intuitive certainty

that these extremely high RE at the right end of pd would be tempered by unequal

treatment eﬀects.

83 0.2 0.4 0.6 0.8 Design : T3.5 Design : T3.6

Design : T1.5 Design : T2.4 Design : T2.5 Design : T2.6

theoretical standard error Design : H1.4 Design : H2.4 Design : H2.5 Design : T1.4

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 proportion DT

Figure 5.3: Theoretical Standard Errors for Allocation Proportions. The standard ˆ ˆ errors (SE) for the combined θ2 and inter θ1 estimators are plotted over the pd values ∈ [.05,.95] for the 10 designs in blue and pink, respectively. These curves are for the ρ = .4 scenario and exhibit the same relative behavior as for the ρ = .7 scenario.

84 0.2 0.4 0.6 0.8 Design : T3.5 Design : T3.6

Design : T1.5 Design : T2.4 Design : T2.5 Design : T2.6

2 relative efficiency

Design : H1.4 Design : H2.4 Design : H2.5 Design : T1.4

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 proportion DT

Figure 5.4: Theoretical Relative Eﬃciencies for Allocation Proportions. The relative eﬃciency (RE) of the combined to inter estimator is plotted over the pd values ∈ [.05,.95] for the 10 designs.

85 H2.4 ● H2.5 ● T1.4 ● T2.5 ● T2.6 ● T3.5 ●

1.8

1.6

1.4 relative efficiency

1.2

1.0

0.2 0.4 0.6 0.8 proportion DT

Figure 5.5: Comparison of Theoretical Relative Eﬃciencies for Allocation Propor- tions. The relative eﬃciency (RE) of the combined to inter estimator is plotted over the pd values ∈ [.05,.95] for a representative subset of the designs.

86 5.4 Unanswered Questions

There are further research questions not addressed in the current dissertation.

They are presented and brieﬂy considered in this section.

• Model: Unequal Treatment Eﬀects. The expectation is that the beneﬁts to ef-

ﬁciency seen in this study would persist when the treatment eﬀects are similar

but not identical in the immediate and delayed treatment groups. However, it

is not known whether response to treatment is comparable between delivery at

study-entrance and following a delay. Real studies with a stepped-wedge design

could be conducted to examine the eﬀect of delay length on TE. More extensive

generating models could be considered to determine the value of the combined

estimator, which would be biased, under conditions of unequal treatment eﬀect

(TE). The post-delay TE following the placebo eﬀect (PE) can be considered

a sort of cohort eﬀect, especially with young patients, who grow considerably

over the 12 month delay of MFPG, for example.

• Model: Role of Differential Attrition Rates. The benefit to efficiency intro-

duced by the combined treatment eﬀect estimator may be accentuated or di-

minished by diﬀerential attrition rates. In RCT and even in wait-listed designs,

researchers expect higher dropout rates in the control group.

• Model: Misspeciﬁed Treatment Eﬀect Period. The choice of hTE by the re-

searcher is a more or less educated guess. The success in capturing the treat-

ment effect in both groups is affected by misspecification of hTE. In addition,

87 the spacing of observations may be unnecessarily close if hTE is underestimated.

• Model: Alternate Estimator for Treatment Eﬀect. A strategic post-hoc weighted ˆ ˆ average of the intra θ0 and inter θ1 estimators from the TC-Full model, based

on their detectable separation or likelihood of equality, is conceivable.

• Design: Optimal Spacing of Observations. There has been much research into

the optimal spacing of observations in longitudinal studies, and they often de-

viate slightly from equal spacing.

• Design: Fewer Observations in IT Group. In order to save such resources as

time, eﬀort, and money, a design which follows the delayed treatment (DT)

group longer than the immediate treatment (IT) may oﬀer the same beneﬁts to

eﬃciency of estimating treatment eﬀect.

5.5 Recommendations

When an investigator in intervention research wishes to study a slow-acting, overt treatment, first the investigator should specify the treatment effect period, hTE and then consider whether any placebo effect would be achieved during this same time.

I recommend graduating the TE period into halves, since no beneﬁt to standard error (for either estimator) or advantage to the combined estimator is gained by more frequent observation, under the linear improvement model.

88 If the researcher is able to extend the length of the study to 2 times the TE

period, or 5 total observation occasions, I recommend disproportionate allocation to

the delayed treatment (DT) control group and a delay equal to the TE period. This

2 RDICT design is called H2.5 with pd = 3 according to present terminology. In this case, assuming equal treatment eﬀects in both groups, the combined estimator of TE, ˆ ˆ θ2, delivers 33% improved eﬃciency over the traditional estimator, θ1.

89 BIBLIOGRAPHY

[1] Albert, Paul. Longitudinal Data Analysis (Repeated Measures) in Clinical Trials. Statistics in Medicine, 18(13):1707–1732, 1999.

[2] Brown, C. Hendricks, Peter A. Wyman, Jing Guo, and Juan Pe˜na. Dynamic Wait-listed Designs for Randomized Trials: New Designs for Prevention of Youth Suicide. Clincal Trials, 3(3):259–271, 2006.

[3] Brown, Celia A., and Richard J. Lilford. The Stepped Wedge Trial Design: A Systematic Review. BMC Medical Research Methodology, 6:54, 2006.

[4] Burton, Andrea, Douglas G. Altman, Patrick Royston, and Roger L. Holder. The Design of Simulation Studies in Medical Statistics. Statistics in Medicine, 25(24):4279–4292, 2006.

[5] Demidenko, Eugene. Mixed Models: Theory and Applications. Hoboken, New Jersey: John Wiley & Sons, Inc., 2004.

[6] Diggle, Peter J., Patrick Heagerty, Kung-Yee Liang, and Scott L. Zeger. Analysis of Longitudinal Data. Oxford University Press, Second edition, 2002.

[7] Fisher, Ronald A. The Design of Experiments. London: Hafner, 1935.

[8] Fristad, Mary A., Joseph S. Verducci, Kimberly A. Walters, and Matthew E. Young. The Impact of Multi-family Psychoeducation Groups (MFPG) in Treat- ing Children Aged 8-12 with Mood Disorders. Archives of General Psychiatry, in revision 2008.

[9] Hussey, Michael A., and James P. Hughes. Design and Analysis of Stepped Wedge Cluster Randomized Trials. Contemporary Clinical Trials, 28:182–191, 2007.

[10] Kaptchuk, Ted J. Powerful Placebo: The Dark Side of the Randomized Con- trolled Trial. Lancet, 351:1722–1725, 1998.

90 [11] Krauth, Joachim. Experimental Design: A Handbook and Dictionary for Medical and Behavioral Research. Number 14 in Techniques in the Behavioral and Neural Sciences. New York: Elsevier, 2000.

[12] Laird, Nan M., and James H. Ware. Random-eﬀects Models for Longitudinal Data. Biometrics, 38:963–974, 1982.

[13] Maxwell, Scott E. Longitudinal Designs in Randomized Group Comparisons: When Will Intermediate Observations Increase Statistical Power? Psychological Methods, 3(3):275–290, 1998.

[14] Muller, Keith E., Lloyd J. Edwards, Sean L. Simpson, and Douglas J. Taylor. Statistical Tests with Accurate Size and Power for Balanced Linear Mixed Mod- els. Statistics in Medicine, 26(19):3639–3660, 2007.

[15] Ouwens, Mario J. N. M., Frans E. S. Tan, and Martijn P. F. Berger. Maximin D- optimal Designs for Longitudinal Mixed Eﬀects Models. Biometrics, 58(4):735– 741, 2002.

[16] Pinheiro, Jos´eC., and Douglas M. Bates. Mixed-eﬀects Models in S and S-PLUS. New York: Springer Verlag, 2000.

[17] Plewis, Ian. Analysing Change: Measurement and Explanation using Longitudi- nal Data. New York: J. Wiley, 1985.

[18] Pocock, S. J. Allocation of Patients to Treatment in Clinical Trials. Biometrics, 35:183–197, 1979.

[19] Raudenbush, Stephen W., and Anthony S. Bryk. Hierarchical Linear Models: Applications and Data Analysis Methods. Thousand Oaks, California: Sage Pub- lications, Inc., 2002.

[20] Reise, Steven P., and Naihua Duan. Multilevel Modeling: Methodological Ad- vances, Issues, and Applications. Mahwah, N.J.: Lawrence Erlbaum Associates, 2003.

[21] Shadish, William R., Thomas D. Cook, and Donald T. Campbell. Experimen- tal and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton Miﬄin, 2002.

[22] Tu, X. M., J. Kowalski, J. Zhang, K. G. Lynch, and P. Crits-Christoph. Power Analyses for Longitudinal Trials and Other Clustered Designs. Statistics in Medicine, 23(18):2799–2815, 2004.

[23] Venables, William N., and Brian D. Ripley. Modern Applied Statistics with S. New York: Springer, Fourth edition, 2002.

91 [24] Verbeke, Geert, and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. Springer, 2000.

[25] Winkens, Bjorn, Hubert J. A. Schouten, Gerard J. P. van Breukelen, and Martijn P. F. Berger. Optimal Time-points in Clinical Trials with Linearly Divergent Treatment Eﬀects. Statistics in Medicine, 24:3743–3756, 2005.