Subgroup Identification and Variable Selection from Randomized Clinical
Total Page:16
File Type:pdf, Size:1020Kb
Subgroup Identification and Variable Selection from Randomized Clinical Trial Data by Jared C. Foster A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Biostatistics) in The University of Michigan 2013 Doctoral Committee: Professor Bin Nan, Co-chair Professor Jeremy M.G. Taylor, Co-chair Assistant Professor Lu Wang Professor Ji Zhu c Jared C. Foster 2013 All Rights Reserved For my wonderful family and beautiful fianc´e. I couldn't have done this without you. ii ACKNOWLEDGMENTS Thank you to Professors Jeremy Taylor, Bin Nan, Ji Zhu and Lu Wang, my dissertation committee, for all the very helpful questions and comments you have given me. Beyond this, I'd like to thank Jeremy and Bin, my co-advisors, for all the hours they spent meeting with me and all the useful advice they have given me. Bin, you introduced me to the incredibly fascinating field of variable selection, for which I am truly grateful. Jeremy, you introduced me to subgroup analysis and personalized medicine, which, combined with variable selection, has become my true passion. Additionally, despite my best efforts to become your most annoying student, you have been incredibly patient, thoughtfully responding to my many emails and discussing many ideas and issues with me whenever I randomly drop by your office. You guys are awesome, and I am truly blessed to have had the opportunity to learn from both of you. Thank you also to my office mates, past and present, for putting up with me and my love of conversation. Yong-Seok Park, Rebecca Andridge and Connie Lee were incredibly helpful when I was preparing for the qualifying exam. Anna Conlon, John Rice, Jincheng Shen, Daniel Muenz and Oliver Lee, you have been very patient with me, and I have very much enjoyed our many fascinating conversations. Edward Kennedy, Phil Boonstra and Laura Fernandes, thank you also for putting up with me, and for helping me to grow through our many philosophical and spiritual discussions. iii To my parents, Dr. James and Mrs. Janet Foster, and Jenna and Jeff Spielmann, my sister and brother-in-law, thank you so much for all of your support throughout my life, especially these past seven years. You guys are amazing, and I love you all very much. Finally, to the love of my life, Grace, thank you so much for all you have put up with these past seven years. You have been incredibly loving, patient and supportive, and have inspired me to become a better man. I definitely could not have done any of this without you. I can't wait to spend the rest of my life with you! iv TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGMENTS ::::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: vii LIST OF TABLES :::::::::::::::::::::::::::::::: ix CHAPTER 1. Introduction ..............................1 2. Variable selection in monotone single-index models via the adaptive LASSO ............................7 2.1 Introduction . .7 2.2 Estimation for monotone single index models . 10 2.2.1 A smooth monotone function estimate for η with fixed β ......................... 11 2.2.2 Estimation for β with fixed η ............. 13 2.2.3 Algorithm . 14 2.2.4 Asymptotics . 16 2.2.5 Bootstrap standard errors . 17 2.3 Simulations . 18 2.3.1 Examples . 19 2.4 Example data . 23 2.5 Discussion . 27 3. Selection of simple rules for treatment assignment using pa- tient information ............................ 29 3.1 Introduction . 29 3.2 Identifying subgroups using the average value function . 31 v 3.2.1 Nonparametric estimation of g and h ........ 32 3.2.2 Selecting a subgroup for fixed g and h ........ 33 3.2.3 Evaluation of the region A^ .............. 36 3.2.4 Selection of δ ...................... 38 3.3 Simulations . 39 3.4 Application to randomized clinical trial data . 45 3.5 Discussion . 50 4. Permutation testing for treatment-covariate interactions .. 53 4.1 Introduction . 53 4.2 Permutation tests . 56 4.2.1 Review of permutation tests . 56 4.2.2 Modified permutation methods . 57 4.3 Simulations . 62 4.3.1 Simple linear model example . 63 4.3.2 Complex model example . 67 4.4 Application to data from a randomized clinical trial . 70 4.5 Discussion . 72 5. Discussion and future work ..................... 83 APPENDIX :::::::::::::::::::::::::::::::::::: 87 A. Asymptotics for Chapter 1 ..................... 88 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 93 vi LIST OF FIGURES Figure 2.1 Averageη ^m values from 100 simulations . 20 2.2 Estimates of functionη ^(·) from Eli Lilly data. Index values in the T plotted data are β^ x, where β^ comes from the constrained approach, and treatment effect estimates are the Z values from Virtual Twins procedure. Those points to the right of the vertical dotted line would be considered \enhanced" based on this analysis. 25 3.1 Plot of True g(x) for Cases 1-4 . 45 3.2 Histogram of True g(x) for Cases 1-4 . 46 3.3 Histogram ofg ^(x) for TROPHY Data . 50 4.1 Rejection rates (α = 0:05) for simple linear model simulations . 66 4.2 Histograms of p-values for simple linear scenario 1 (β = 0:5, γ = 0:25, θ = 0:35)................................. 74 4.3 Histograms of p-values for simple linear scenario 2 (β = 0:5, γ = 0:25, θ =0) .................................. 75 4.4 Histograms of p-values for simple linear scenario 3 (β = 0:25, γ = 0:5, θ = 0:35)................................. 76 4.5 Histograms of p-values for simple linear scenario 4 (β = 0:25, γ = 0:5, θ =0) .................................. 77 4.6 Rejection rates (α = 0:05) for complex model simulations . 78 4.7 Histograms of p-values for complex scenario 1 (g(x) = 15I(x1 > 0; x2 > 0))................................ 79 vii 4.8 Histograms of p-values for complex scenario 2 (g(xi) = 5) . 80 4.9 Histograms of p-values for complex scenario 3 (g(xi) = 5 + ei)... 81 4.10 Histograms of p-values for complex scenario 3 (g(x) = 0) . 82 viii LIST OF TABLES Table 2.1 Simulation results: variable selection performance . 21 2.2 Performance of standard error estimates . 23 2.3 Estimates for Eli Lilly data . 26 3.1 Simulation study results: subgroup identification performance . 44 3.2 Simulation study results: Q(A^) estimation performance . 47 4.1 P-values for TROPHY data . 71 ix CHAPTER 1 Introduction In recent years, there has been a great deal of interest in using a patient's specific information to make more informed treatment decisions. This practice, often referred to as personalized medicine, is motived by the fact that few treatments are equally effective for all individuals in a population. A treatment may be quite effective for some subset of a population, but mildly effective, ineffective, or even harmful for others. Because of this, there is great interest in identifying which individuals in a population, if any, will respond well to a treatment. More specifically, it is necessary to identify which characteristics, if any, lead to this enhanced response if one wishes to pursue a personalized treatment. A generally accepted approach to identifying such characteristics is to pre-define a small number of subregions of the covariate space before looking at the data, and then evaluate them. However, one may not always know which subgroups to consider, and increasing the number of subgroups considered also increases the risk of false positive findings, which is already a well-known danger in subgroup analysis. One way to reduce this risk of false positives is to employ a multiple testing approach, such as a Bonferroni correction. This can effectively reduce false positives, but also decreases 1 2 the power to detect true subgroups. An alternative approach, which will be the main focus of this dissertation, is to pre-define a statistical procedure for identifying subgroups [Ruberg et al., 2010]. These procedures can eliminate the need for a priori specification of subgroups, and may help to reduce the risk of false findings, though some risk still remains. Such procedures have been proposed by many authors, including Friedman and Fisher [1999], Negassa et al. [2005], Su et al. [2008, 2009], Brinkley et al. [2010], Cai et al. [2011], Foster et al. [2011], Lipkovich et al. [2011], Qian and Murphy [2011], Zhao et al. [2011], Imai and Ratkovic [2012] and Zhang et al. [2012]. When developing such a procedure, it is common to first consider what each subject's potential outcome value would be, given each of the treatment options [Zhang et al., 2012]. That is, when two treatment options (say treatment 1 and treatment 0) are present, it is common to consider what each subject's potential outcome would be given treatment 1 and what it would be given treatment 0. One general way to identify subgroups is to first estimate these two potential outcomes, and then take the difference, which is an estimate of the treatment effect. One can then investigate the relationship between these estimated treatment effects and the covariates to obtain subgroups. Alternatively, one could attempt to identify subgroups by maximizing the expected response under a class of treatment regimes [Gunter et al., 2007, Zhang et al., 2012]. The treatment regimes in such a case will generally be defined based on the covariates, and will be of the form xi 2 A ) Ti = 1, xi 2= A ) Ti = 0, where xi and Ti are the covariate vector and treatment indicator for subject i and A is some region of the covariate space. Thus, with this approach, the \optimal" subgroup is the region A which maximizes the expected response when only individuals in A will receive treatment 1 and only individuals in region Ac will receive treatment 0.