Subgroup Identification and Variable Selection from Randomized Clinical Trial Data
by
Jared C. Foster
A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Biostatistics) in The University of Michigan 2013
Doctoral Committee: Professor Bin Nan, Co-chair Professor Jeremy M.G. Taylor, Co-chair Assistant Professor Lu Wang Professor Ji Zhu c Jared C. Foster 2013 All Rights Reserved For my wonderful family and beautiful fianc´e.
I couldn’t have done this without you.
ii ACKNOWLEDGMENTS
Thank you to Professors Jeremy Taylor, Bin Nan, Ji Zhu and Lu Wang, my dissertation committee, for all the very helpful questions and comments you have given me. Beyond this, I’d like to thank Jeremy and Bin, my co-advisors, for all the hours they spent meeting with me and all the useful advice they have given me. Bin, you introduced me to the incredibly fascinating field of variable selection, for which I am truly grateful. Jeremy, you introduced me to subgroup analysis and personalized medicine, which, combined with variable selection, has become my true passion. Additionally, despite my best efforts to become your most annoying student, you have been incredibly patient, thoughtfully responding to my many emails and discussing many ideas and issues with me whenever I randomly drop by your office.
You guys are awesome, and I am truly blessed to have had the opportunity to learn from both of you.
Thank you also to my office mates, past and present, for putting up with me and my love of conversation. Yong-Seok Park, Rebecca Andridge and Connie Lee were incredibly helpful when I was preparing for the qualifying exam. Anna Conlon,
John Rice, Jincheng Shen, Daniel Muenz and Oliver Lee, you have been very patient with me, and I have very much enjoyed our many fascinating conversations. Edward
Kennedy, Phil Boonstra and Laura Fernandes, thank you also for putting up with me, and for helping me to grow through our many philosophical and spiritual discussions.
iii To my parents, Dr. James and Mrs. Janet Foster, and Jenna and Jeff Spielmann, my sister and brother-in-law, thank you so much for all of your support throughout my life, especially these past seven years. You guys are amazing, and I love you all very much.
Finally, to the love of my life, Grace, thank you so much for all you have put up with these past seven years. You have been incredibly loving, patient and supportive, and have inspired me to become a better man. I definitely could not have done any of this without you. I can’t wait to spend the rest of my life with you!
iv TABLE OF CONTENTS
DEDICATION ...... ii
ACKNOWLEDGMENTS ...... iii
LIST OF FIGURES ...... vii
LIST OF TABLES ...... ix
CHAPTER
1. Introduction ...... 1
2. Variable selection in monotone single-index models via the adaptive LASSO ...... 7
2.1 Introduction ...... 7 2.2 Estimation for monotone single index models ...... 10 2.2.1 A smooth monotone function estimate for η with fixed β ...... 11 2.2.2 Estimation for β with fixed η ...... 13 2.2.3 Algorithm ...... 14 2.2.4 Asymptotics ...... 16 2.2.5 Bootstrap standard errors ...... 17 2.3 Simulations ...... 18 2.3.1 Examples ...... 19 2.4 Example data ...... 23 2.5 Discussion ...... 27
3. Selection of simple rules for treatment assignment using pa- tient information ...... 29
3.1 Introduction ...... 29 3.2 Identifying subgroups using the average value function . . . . 31
v 3.2.1 Nonparametric estimation of g and h ...... 32 3.2.2 Selecting a subgroup for fixed g and h ...... 33 3.2.3 Evaluation of the region Aˆ ...... 36 3.2.4 Selection of δ ...... 38 3.3 Simulations ...... 39 3.4 Application to randomized clinical trial data ...... 45 3.5 Discussion ...... 50
4. Permutation testing for treatment-covariate interactions .. 53
4.1 Introduction ...... 53 4.2 Permutation tests ...... 56 4.2.1 Review of permutation tests ...... 56 4.2.2 Modified permutation methods ...... 57 4.3 Simulations ...... 62 4.3.1 Simple linear model example ...... 63 4.3.2 Complex model example ...... 67 4.4 Application to data from a randomized clinical trial . . . . . 70 4.5 Discussion ...... 72
5. Discussion and future work ...... 83
APPENDIX ...... 87
A. Asymptotics for Chapter 1 ...... 88
BIBLIOGRAPHY ...... 93
vi LIST OF FIGURES
Figure
2.1 Averageη ˆm values from 100 simulations ...... 20
2.2 Estimates of functionη ˆ(·) from Eli Lilly data. Index values in the T plotted data are βˆ x, where βˆ comes from the constrained approach, and treatment effect estimates are the Z values from Virtual Twins procedure. Those points to the right of the vertical dotted line would be considered “enhanced” based on this analysis...... 25
3.1 Plot of True g(x) for Cases 1-4 ...... 45
3.2 Histogram of True g(x) for Cases 1-4 ...... 46
3.3 Histogram ofg ˆ(x) for TROPHY Data ...... 50
4.1 Rejection rates (α = 0.05) for simple linear model simulations . . . 66
4.2 Histograms of p-values for simple linear scenario 1 (β = 0.5, γ = 0.25, θ = 0.35)...... 74
4.3 Histograms of p-values for simple linear scenario 2 (β = 0.5, γ = 0.25, θ =0) ...... 75
4.4 Histograms of p-values for simple linear scenario 3 (β = 0.25, γ = 0.5, θ = 0.35)...... 76
4.5 Histograms of p-values for simple linear scenario 4 (β = 0.25, γ = 0.5, θ =0) ...... 77
4.6 Rejection rates (α = 0.05) for complex model simulations ...... 78
4.7 Histograms of p-values for complex scenario 1 (g(x) = 15I(x1 > 0, x2 > 0))...... 79
vii 4.8 Histograms of p-values for complex scenario 2 (g(xi) = 5) . . . . . 80
4.9 Histograms of p-values for complex scenario 3 (g(xi) = 5 + ei)... 81
4.10 Histograms of p-values for complex scenario 3 (g(x) = 0) ...... 82
viii LIST OF TABLES
Table
2.1 Simulation results: variable selection performance ...... 21
2.2 Performance of standard error estimates ...... 23
2.3 Estimates for Eli Lilly data ...... 26
3.1 Simulation study results: subgroup identification performance . . . 44
3.2 Simulation study results: Q(Aˆ) estimation performance ...... 47
4.1 P-values for TROPHY data ...... 71
ix CHAPTER 1
Introduction
In recent years, there has been a great deal of interest in using a patient’s specific information to make more informed treatment decisions. This practice, often referred to as personalized medicine, is motived by the fact that few treatments are equally effective for all individuals in a population. A treatment may be quite effective for some subset of a population, but mildly effective, ineffective, or even harmful for others. Because of this, there is great interest in identifying which individuals in a population, if any, will respond well to a treatment. More specifically, it is necessary to identify which characteristics, if any, lead to this enhanced response if one wishes to pursue a personalized treatment.
A generally accepted approach to identifying such characteristics is to pre-define a small number of subregions of the covariate space before looking at the data, and then evaluate them. However, one may not always know which subgroups to consider, and increasing the number of subgroups considered also increases the risk of false positive
findings, which is already a well-known danger in subgroup analysis. One way to reduce this risk of false positives is to employ a multiple testing approach, such as a
Bonferroni correction. This can effectively reduce false positives, but also decreases
1 2 the power to detect true subgroups.
An alternative approach, which will be the main focus of this dissertation, is to pre-define a statistical procedure for identifying subgroups [Ruberg et al., 2010].
These procedures can eliminate the need for a priori specification of subgroups, and may help to reduce the risk of false findings, though some risk still remains. Such procedures have been proposed by many authors, including Friedman and Fisher
[1999], Negassa et al. [2005], Su et al. [2008, 2009], Brinkley et al. [2010], Cai et al.
[2011], Foster et al. [2011], Lipkovich et al. [2011], Qian and Murphy [2011], Zhao et al. [2011], Imai and Ratkovic [2012] and Zhang et al. [2012]. When developing such a procedure, it is common to first consider what each subject’s potential outcome value would be, given each of the treatment options [Zhang et al., 2012]. That is, when two treatment options (say treatment 1 and treatment 0) are present, it is common to consider what each subject’s potential outcome would be given treatment
1 and what it would be given treatment 0. One general way to identify subgroups is to first estimate these two potential outcomes, and then take the difference, which is an estimate of the treatment effect. One can then investigate the relationship between these estimated treatment effects and the covariates to obtain subgroups.
Alternatively, one could attempt to identify subgroups by maximizing the expected response under a class of treatment regimes [Gunter et al., 2007, Zhang et al., 2012].
The treatment regimes in such a case will generally be defined based on the covariates, and will be of the form xi ∈ A ⇒ Ti = 1, xi ∈/ A ⇒ Ti = 0, where xi and Ti are the covariate vector and treatment indicator for subject i and A is some region of the covariate space. Thus, with this approach, the “optimal” subgroup is the region
A which maximizes the expected response when only individuals in A will receive treatment 1 and only individuals in region Ac will receive treatment 0. 3
Note that, in the setting of personalized medicine, the ultimate goal is to use the identified subregion to define “rules”, which can be used in the future to make more informed treatment decisions. Moreover, these treatment decisions will nearly always be made by someone such as a physician or nurse practitioner, who may only have a modest understanding of statistics. Thus, an issue which should be considered before employing any subgroup identification procedure is the potential form and complexity of the resulting subgroup(s), which can both vary considerably depending on which approach is taken. A very complex subgroup, which depends on some, or perhaps all of the available covariates will often be quite good at identifying truly enhanced responders, but such subgroups may lack “nice” interpretability. In addition, the dependence on a large number of covariates means a large amount of information will need to be collected before a treatment decision can be made, which could lead to slower, more expensive, or more invasive (due to performing of unnecessary procedures to collect information) treatment decisions than are necessary.
This could potentially limit the chances of such a subgroup being used in practice. In contrast, a very simple subgroup, perhaps depending on only one or two covariates, may less accurately identify enhanced responders, but will be easier to interpret, and will likely see more real-world use. Given this tradeoff between classification accuracy and interpretability, the most “ideal” subgroups may be those which are of only moderate complexity. To this end, we will focus on the identification of simple subgroups, which depend on only a few covariates, and have a very simple form. In many cases, such subgroups may be able to classify enhanced responders well, while still being nicely interpretable, and will thus be more likely to see real-world use.
Because a large number of covariates may often exist, identifying “useful” sub- groups which depend on only a few covariates will generally require some form of 4 variable selection. One simple way to do this is to use a regression tree, which splits the data into a number of regions of the covariate space (i.e. subgroups) [Negassa et al., 2005, Su et al., 2008, 2009, Foster et al., 2011, Lipkovich et al., 2011]. These regions contain individuals who are similar with regard to the response, and they are generally defined using only a subset of the available covariates. Alternatively, one could consider a more model-based approach to selecting covariates. In particular, one could consider some form of penalized regression. Potential penalty functions include the LASSO [Tibshirani, 1996], smoothly clipped absolute deviation (SCAD)
[Fan and Li, 2001] and adaptive LASSO [Zou, 2006] penalties. These penalty func- tions are designed to force the regression parameter estimates which correspond to
“useless” variables to zero, thereby removing these variables from the model. Thus, predicted values of the response of interest will be a function of a linear combination of a potentially small number of covariates.
As previously mentioned, most subgroup analysis carries with it some potential for false positive findings. Pre-defining a subgroup identification approach can help to reduce false positives, but in our experience, these methods also have a tendency to identify subgroups, even when no true enhanced subgroup exists. To further re- duce the potential for false findings, one may wish to evaluate the “usefulness” of a subgroup once it is identified, perhaps by performing a hypothesis test or computing some type of “enhancement” metric [Foster et al., 2011]. Before such a hypothesis test can be implemented, one must consider what “null” means in the setting of subgroup analysis. Note that “meaningful” subgroups arise when the treatment effect differs as covariate values differ. That is, subgroups arise when treatment-by-covariate in- teractions exist. Thus, in this setting, one could define “null” data as that in which treatment is constant with respect to the covariates, so that no treatment-by-covariate 5 interactions exist. Alternatively, if one has a specific subgroup to evaluate, “null” data could be defined as data for which the effect of treatment in the chosen subgroup is no different than that for the entire population. In this dissertation, we will consider the more general definitition that no treatment-by-covariate interactions exist.
In this dissertation, we will consider a number of methods which use randomized clinical trial data to identify simple subgroups of enhanced treatment effect, which should depend on only a small number of covariates. Using randomized clinical trial data allows us to avoid potential problems with confounded relationships, and thus have more confidence that the identified subgroup contains “truly” enhanced individ- uals. Moreover, randomized clinical trial data generally contains a large number of subjects, which is advantageous, as large samples are generally required if one wishes to accurately identify subgroups.
In Chapter 2, we consider the use of adaptive LASSO-penalized monotone single- index models to identify subgroups of enhanced treatment effect. A single-index model assumes that the outcome of interest is an unknown function of a linear combination of covariates. By forcing this unknown function to be monotone, we are able to nicely describe the estimated effect of each covariate on the response of interest. In addition, by penalizing the regression parameter, the resulting model will generally depend on only a relatively small number of covariates, making it easier to interpret.
In Chapter 3, we propose a two-stage subgroup identification procedure, which can be viewed as a model-based alternative to Virtual Twins [Foster et al., 2011]. In the first stage of this procedure, we use nonparametric regression to obtain treatment effect estimates for each subject. From these estimates, we define a criterion [Sutton and Barto, 1998, Gunter et al., 2007], which is subsequently used to systematically evaluate many subgropus of a simple, pre-specified form. The identified subgroup is 6 that which has the best value of the evaluation criterion. In this chapter, we also consider the use of an enhancement metric [Foster et al., 2011] to evaluate the utility of identified subgroups.
In Chapter 4, we propose a number of permutation-based methods for obtaining p-values for treatment-by-covariate interactions, which can be used to test whether or not an identified subgroup is truly enhanced. These methods are used to obtain p-values for some of the enhancement metric estimates discussed in Chapter 3. All methods in this dissertation are evaluated in simulation studies, and illustrated using randomized clinical trial data.
In Chapter 5, we present an overall discussion of the proposed methods, and consider a number of potential extensions and modifications of these methods. CHAPTER 2
Variable selection in monotone single-index models via the adaptive LASSO
2.1 Introduction
Linear regression is a simple and commonly-used technique for assessing rela-
tionships of the form y = βT x + between an outcome of interest, y, and a set of
covariates, x1, . . . , xp; however, in many cases, a more general model may be desirable.
As noted by Hardle et al. [1993], one particularly useful and more general variation of the linear regression formulation is the single-index model
T yi = η(β xi) + i, (2.1)
T where xi’s are subject-specific covariate vectors, β = (β1, . . . , βp) , yi ∈ R, η is
2 an unknown function, 1, . . . , n are iid errors with mean zero and variance σ , and
i’s and xi’s are independent. To ensure identifiability, no intercept is included, and β1 is assumed to be equal to 1. These models are able to capture important features in high-dimensional data, while avoiding the difficulties associated with high- dimensionality, as dimensionality is reduced from many covariates to a univariate
7 8 index Yu and Ruppert [2002]. Single-index models have applications to a number of
fields, including discrete choice analysis in econometrics and dose-response models in biometrics Hardle et al. [1993].
There is a rich literature on estimation of β and η, including Hardle et al. [1993],
Yu and Ruppert [2002], Ichimura [1993], Carroll et al. [1997], Xia et al. [2002], Xia and Hrdle [2006], among many others. Additionally, variable selection for single-index models was considered by Kong and Xia [2007], who proposed the separated cross- validation method, and Liang et al. [2010], who applied the smoothly clipped absolute deviation (SCAD) approach to partially linear single-index models. However, little consideration has been given to such problems for monotone single-index models, where η is required to be non-decreasing (or non-increasing). In the case of linear models, a great many authors, including Tibshirani [1996], Fan and Li [2001], Zou
[2006] and Zou and Zhang [2009] have considered variable selection via penalized least squares, which allows for simultaneous selection of variables and estimation of regression parameters. Several penalty functions, including the SCAD Fan and Li
[2001], the adaptive LASSO Zou [2006] and the adaptive elastic-net Zou and Zhang
[2009], have been shown to possess favorable theoretical properties, including the oracle properties; that is, consistency of selection and asymptotic normality, with the asymptotic covariance matrix being the same as that which would be obtained if the true underlying model were known. Hence, for large samples, oracle procedures perform as well as if the true underlying model were known in advance. Furthermore,
Liang et al. [2010] established the oracle properties for the SCAD for partially linear single-index models. Given the desirable properties of the SCAD, adaptive LASSO and adaptive elastic-net approaches, it is natural to consider the extension of these methods to monotone single-index models. Unlike the adaptive LASSO and adaptive 9 elastic-net, which present a convex optimization problem, the SCAD optimization problem is non-convex, and thus more computationally demanding Hastie et al. [2009].
In addition, the adaptive elastic-net and SCAD methods require the selection of two tuning parameters, whereas the adaptive LASSO requires the selection of a single tuning parameter. Therefore, for convenience, computational efficiency, and because covariates in our example data are not highly correlated (a condition under which the adaptive elastic-net is especially good), we consider adaptive LASSO penalized least squares estimation of β in monotone single-index models.
The assumption of monotonicity and the desire to select a subset of the covari- ates are motivated in part by the randomized clinical trial data considered in Foster et al. [2011]. A monotonicity assumption is often reasonable, and such an assumption may improve prediction and reduction in model complexity, while also allowing for more straightforward inference. Foster et al. Foster et al. [2011] consider methods for subgroup identification in randomized clinical trial data. In such cases, should a subgroup be identified, it is desirable that this subgroup be easily described, and depend on only a small number of covariates. Application of the methods proposed ˆ ˆ T in this paper result in estimatesη ˆ and β, such thaty ˆi =η ˆ(β xi), whereη ˆ is mono-
tone and βˆ generally includes a number of zero values. Using this model, one can
classify individuals withy ˆ’s beyond some predefined threshold, c, as being in the
subgroup. Then, because of the monotonicity ofη ˆ, the predefined threshold can be
T converted into an equivalent threshold, c0, on βˆ x, and the impact of the chosen co- variates on subgroup membership can be easily described. Without the assumption of monotonicity, the subgroup may be a collection of several disjoint subregions of the covariate space, making each covariate’s impact on subgroup membership more difficult to ascertain. 10
The remaining sections of this article are as follows. In Section 2, we consider
penalized least-squares estimation for monotone single-index models, briefly discuss
asymptotics, and discuss a method to obtain standard error estimates for β. In
Section 3, we present the results of a simulation study implemented to assess the
performance of the adaptive LASSO penalized single-index models. In Section 4, we
briefly discuss the application of this method to the randomized clinical trial data,
and in Section 5, we give concluding remarks.
2.2 Estimation for monotone single index models
Our estimation procedure iterates between estimation of β and η until conver-
gence. Given some η, the penalized least-squares estimator of β can be found by
minimizing
n p X T 2 X Q(β) = yi − η(β xi) + λn wj|βj|, (2.2) i=1 j=2
where wj, j = 2, . . . , p are known weights and covariates xi are standardized to have
mean zero and variance 1. Due to the identifiability constraint specified in model
ˆ −γ (2.1), β1 is not penalized. Following Zou [2006], we choose wj = |βinit,j| for γ > 0,
ˆ α 1 where βinit is a n -consistent estimator of β, where 0 < α ≤ 2 . We use linear ordinary ˆ least squares (OLS) estimates for βinit, as under the assumptions of Theorem 2.1 in Li √ and Duan [1989], these are shown to be n-consistent up to a multiplicative scalar. ˆ ˆ Once obtained, βinit is rescaled by β1,init. Alternatively, weights could be defined
using the unpenalized single-index model estimates of β.
For a given β, without considering the monotonicity constraint, η can be estimated 11
at some point t using the Nadaraya-Watson kernel-weighted average:
T P t−β xj j yjK h ηˆ(t; β, y, X, h) = , (2.3) T P t−β xj j K h
where X is the covariate matrix, K is a fixed kernel function and h is a bandwidth.
Note that, when β is known,η ˆ is determined by h, so a value of h must be chosen. We consider kernel functions which are symmetric probability densities. For numerical stability, we hold (2.3) fixed for all t outside the range of the βT x’s. That is,η ˆ(t) =
T T T T ηˆ(mini(β xi)) if t < mini(β xi) andη ˆ(maxi(β xi)) if t > maxi(β xi).
Combining (2.2) and (2.3), the adaptive LASSO estimator for β is obtained by minimizing
p ˆ X0 T 2 X Q(β, h) = yi − ηˆ(β xi; β, y, X, h) + λn wj|βj| (2.4) i j=2
P0 with respect to β, where i denotes summation over i such that the denominator in the kernel estimator is not too close to zero. Details can be found in Hardle et al. ˆ [1993]. With the inclusion of the penalty term in (2.4), β becomes a function of λn, ˆ so in addition to h, a value of λn must be chosen if β is to be obtained. Throughout this paper, we refer to the method of estimating β and η without a monotonicity constraint, using objective function (2.4), as the unconstrained approach.
2.2.1 A smooth monotone function estimate for η with fixed β
There are a variety of ways to obtain smooth monotone regression function es- timates, including quadratic B-splines He and Shi [1998], I-splines Ramsay [1988], empirical distribution tilting Hall and Huang [2001], the scatterplot smoothing ap- 12
proach of Friedman and Tibshirani [1984], and the kernel-based approach of Mukerjee
[1988] and Mammen [1991]. We consider the kernel-based method of the latter two
papers, which we briefly describe below.
Assume β is known. The proposed monotone estimatorη ˆm requires two steps:
Isotonization. This step involves the application of the pooled adjacent violator
T algorithm (PAVA) Barlow et al. [1972]. Using (β xi, yi) ordered by increasing
T β xi as data, PAVA produces monotone estimatesm ˆ 1,..., mˆ n, which are av-
erages of yj’s near i (unless y’s are already monotone, in which casem ˆ i = yi),
and which are not necessarily smooth Friedman and Tibshirani [1984].
Smoothing. Apply the kernel estimator (2.3) with yi replaced bym ˆ i for all i to
estimate η. That is,η ˆm(t) =η ˆ(t; β, mc, X, h).
Sincem ˆ 1,..., mˆ n are monotone, the resulting function estimate is monotone in t. It is worth noting that this may not necessarily be the case for other smoothing methods, such as local linear regression.
As previously mentioned, a bandwidth is needed to estimate η, and can be found using cross-validation; however, our algorithm requires estimation of both η and its derivative η0, so care must be taken. In particular, to ensure good algorithmic con- vergence, it is crucial thatη ˆ0 be smooth, but to obtain a smooth estimate of η0, it is often necessary to oversmooth η. Thus, we restrict the range of potential bandwidths in our cross-validation. Specifically, h is restricted to be between 0.1∗sd(Xβ) and sd(Xβ), as values in this range were found to perform well in our simulations. 13
2.2.2 Estimation for β with fixed η
The shooting algorithm proposed by Fu [1998] has been shown to perform well in solving LASSO penalized least-squares problems for linear models Friedman et al.
[2007]. Therefore, we consider the application of this algorithm to LASSO problems for the single-index model. One way to achieve this is to employ a linear approxi-
T T mation via Taylor series expansion of η(β xi) about β0 xi, where β0 is known. We define the linear approximation as follows: