Additive Cox Proportional Hazards Models for Next-Generation Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Additive Cox proportional hazards models for next-generation sequencing data Huda Mohammed Alshanbari The University of Leeds Department of Statistics Submitted in accordance with the requirements for the degree of Doctor of Philosophy November 2017 The candidate confirms that the work submitted is her own, except where work which has formed part of jointly authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated below. The candidate confirms that appropriate credit has been given within the thesis where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. @2017 The University of Leeds and Huda Mohammed Alshanbari ii Acknowledgements iii Acknowledgements I would like to express my great thanks to my supervisors Dr Stuart Barber and Dr Arief Gusnanto for all the support, enthusiasm, guidance, helpful conversations and advice throughout my PhD study, and for their insights and comments on my thesis. Indeed, words cannot express my deep gratitude, it has been a great pleasure to work with them. Lots of thanks to all administrative staff of the School of Mathematics, and Tim Hainsworth, who has always been ready to help me with a computing problem. This thesis is dedicated to the memory of my father, Mohammed Alshanbari, I miss him every day, may Allah forgive him and grant him hishighest paradise (Ameen). A great thank you to my beloved mother, for her emotional support, encouragement and spiritual support throughout my life. I am profoundly grateful to my husband Khalid Alrafaay and my daughter Saja who bring joy to my life. Throughout my PhD, I was financially supported by Princess Nourah Bint Abdulrahman University. Abstract iv Abstract Eighty-Nine Non-Small Cell Lung Cancer (NSCLC) patients experience chromo- somal rearrangements called Copy Number Alteration (CNA), where the cells have abnormal number of copies in one or more regions in their genome, this genetic al- teration are known to drive cancer development. An important aim of this thesis is to propose a way to combine the clinical covariate as fixed predictors with CNAs ge- nomics windows as smoothing terms using the penalized additive Cox Proportional Hazard (PH) model. Most of the proposed prediction methods assume linearity of the CNAs genomic windows along with the clinical covariates. However, the continu- ous covariates can affect the hazard via more complicated nonlinear functional forms. Therefore, Cox PH model with continuous covariate are likely misspecified, because it is not fitting the correct functional form for the continuous covariates. Some reports of the work on combining the clinical covariates with high-dimensional genomic data in a clinical genomic prediction are based on standard Cox PH model. Most of them focus on applying variable selection to high-dimensional CNA genomic data. Our main interest is to propose a variable selection procedure to select important nonlinear effects from CNAs genomic-windows. Two different approaches of feature selection are presented which are discrete and shrinkage. Discrete feature selection is based on penalized univariate variable selection, which identify the subset of the CNAs genomic-windows have the strongest effects on the survival time, while feature selection by shrinkage works by adding a second penalty to the penalized partial log- likelihood, that leads to penalizing the smoothing coefficients in the model, as a result some of the smoothing coefficient are being set to the zero. For the NSCLC dataset, we find that the size of the tumor cells and spread cancer into the lymph nodes are significant factors that increase the hazard of the patients survival, and the estimate of the smooth log hazard ratio curves identify that some of the significant CNA genomic-windows contribute a higher or lower hazard of death to the survival of some significant CNA genomic-windows across the genome. Contents Acknowledgements iii Abstract iv Contents v List of Figures x List of Tables xvi 1 Introduction 1 1.1 Objective of Thesis . .2 1.2 Outline of Thesis . .3 2 Estimation of Copy Number alteration in Lung Cancer Data 6 2.1 Introduction . .6 2.2 Description of Non-Small Cell Lung Cancer Dataset . .6 2.2.1 Clinical Characteristics . .6 2.2.2 Next-Generation Sequencing for Copy Number Alteration . .8 2.3 Analysis of Copy Number Alteration. .9 2.3.1 GC Correction . 10 2.3.2 A Smooth Segmentation Approach. 10 2.3.3 Genome-wide Normalization . 12 2.3.4 Contamination Correction . 13 v Contents vi 2.4 Real Data Normalization . 15 2.5 Conclusion . 21 3 Survival Analysis 22 3.1 Introduction . 22 3.2 Cox Proportional Hazards (PH) Models . 23 3.3 Estimation of The Model Parameters . 24 3.3.1 The Full Likelihood Function . 24 3.3.2 The Partial Likelihood Function . 24 3.4 Breslow Estimator . 27 3.5 Checking the Proportional Hazard Assumption . 29 3.6 Residual for Cox PH model . 30 3.7 The Cox PH model for Clinical Characteristic Data . 31 3.8 Conclusion . 39 4 Generalized Additive Model 41 4.1 Introduction . 41 4.2 Generalized Additive Model Overview . 42 4.3 Spline Basis and Penalties . 45 4.3.1 Radial Basis Functions . 45 4.3.2 Cubic Splines . 51 4.4 Penalized Iteratively Re-weighted Least Square Estimation (PIRLS) . 58 4.5 Selecting the Smoothing Parameter . 62 4.6 Choosing the Number of Knots . 63 4.6.1 Myopic algorithm . 63 4.6.2 Full-search Algorithm . 64 4.7 P-value . 64 4.8 Logistic Regression with GAM for Clinical Characteristic of Lung Cancer Dataset . 65 4.8.1 Modelling . 65 Contents vii 4.9 Conclusion . 68 5 Additive Cox PH Model 70 5.1 Introduction . 70 5.2 Extending The Cox PH Model . 72 5.3 The Smoothing Log Hazard Ratio . 77 5.4 Simulating Survival Time from Standard Parametric Distributions . 79 5.4.1 Generating Survival Times to Simulate Cox PH models . 81 5.5 Simulation Study . 82 5.5.1 Generating Survival Times to Simulate Additive Cox PH Models 82 5.5.2 Simulating Additive Cox Model for One Smooth Term . 83 5.5.3 Simulating Additive Cox Model for Two Smoothing Terms . 85 5.5.4 Simulating Additive Cox Model with One Smoothing Term and One Categorical Variable . 86 5.6 Test for Covariates Effects . 88 5.6.1 Test for the Spline Effect Parameters . 88 5.6.2 Simulation Under the Null Hypothesis of the Tests in the Pe- nalized Additive Cox PH Model . 92 5.7 Estimating the Smoothing Parameters and the Number of Knots . 113 5.8 Model Diagnostics . 116 5.8.1 Estimation of h0(t) and S(tjx) ................. 116 5.8.2 Cox-Snell Residuals . 116 5.9 Real Data Analysis . 117 5.10 Conclusion . 123 6 Variable Selection for Penalized Additive Cox PH Model 125 6.1 Introduction . 125 6.2 Selecting Variables by Testing Individual Covariates . 128 6.2.1 Univariate Selection Based on Standard Cox PH Model . 128 6.2.2 Univariate Selection Based on Penalized Additive Cox PH Model129 Contents viii 6.3 Dependencies of Significant CNA genomic-windows . 137 6.4 Forward Stepwise Selection . 141 6.4.1 Cluster Analysis . 154 6.5 Conclusion . 160 7 Shrinkage Penalty Variable Selection 161 7.1 Introduction . 161 7.2 Shrinkage Method . 163 7.2.1 Double Penalty Approach . 164 7.2.2 Shrinkage Approach . 165 7.3 Simulation Study . 167 7.3.1 Simulation Setting . 167 7.3.2 Simulation Results . 167 7.4 Choosing the Optimal Value of .................... 171 7.4.1 Simulation Study . 171 7.4.2 Simulation Study Using Real Data . 174 7.5 Real Data Analysis . 180 7.5.1 Fit the Shrinkage Penalized Additive Cox PH Model with One Smoothing Parameter . 180 7.6 Forward Stepwise Selection Using the Shrinkage Approach . 187 7.7 A Comparison between the Discrete Feature Selection and Shrinkage Feature Selection . 197 7.8 Conclusion . 203 8 Conclusion and Future Work 205 8.1 Conclusion . 205 8.2 Future Work . 207 8.2.1 Selecting the Optimal Values of the Smoothing Parameters . 207 Bibliography 211 Contents ix Appendices 228 A Significant CNA genomics windows 229 A.1 Significant CNA Genomics Windows in the Penalized Univariate Se- lection . 229 A.2 The Common significant CNA Genomic-Windows for both Univariate Variable Selection and Penalized Univariate Variable Selection . 233 List of Figures 2.1 Histogram of ratio and smoothed ratio . 15 2.2 The fit of the mixture distribution to the smoothed ratio . 16 2.3 Copy number ratio with smoothed-segmentation line along genome . 17 2.4 Copy number ratio with normalized and segmented DNAcopy . 17 2.5 Copy number ratios for chromosome three . 18 3.1 Plots of scaled Schoenfeld residuals . 35 3.2 Plots of scaled Schoenfeld residuals . 37 3.3 Plots of the martingale residuals versus Age with smooth curve the black solid line. 38 3.4 Plots of the estimated of Cox-Snell residual . 38 3.5 Plots of the estimated cumulative hazard function . 39 4.1 The plot of the smooth function of Age in model 20 . 68 5.1 The plot of the estimated log hazard ratio f^(x) versus x. The gray lines are the estimated log hazard ratio f^(x) for the 500 simulations, the black line indicates the true curve f(x) = sin(x), and the red dashed line is the average of f^(x) for the 500 simulations. 84 5.2 The plot of the mean square error versus x, the dotted vertical lines indicates the locations of the knots.