Penalised Maximum Likelihood Estimation for Multi-State Models
Total Page:16
File Type:pdf, Size:1020Kb
Penalised maximum likelihood estimation for multi-state models Robson Jose´ Mariano Machado A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Statistical Science University College London October 17, 2018 2 I, Robson Jose´ Mariano Machado, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Abstract Multi-state models can be used to analyse processes where change of status over time is of interest. In medical research, processes are commonly defined by a set of living states and a dead state. Transition times between living states are often in- terval censored. In this case, models are usually formulated in a Markov processes framework. The likelihood function is then constructed using transition probabili- ties. Models are specified using proportional hazards for the effect of covariates on transition intensities. Time-dependency is usually defined by parametric models, which can represent a strong model assumption. Semiparametric hazards specifica- tion with splines is a more flexible method for modelling time-dependency in multi- state models. Penalised maximum likelihood is used to estimate these models. Se- lecting the optimal amount of smoothing is challenging as the problem involves multiple penalties. This thesis aims to develop methods to estimate multi-state models with splines for interval-censored data. We propose a penalised likelihood method to estimate multi-state models that allow for parametric and semiparametric hazards specifications. The estimation is based on a scoring algorithm, and a grid search method to estimate the smoothing parameters. This method is shown using an application to ageing research. Furthermore, we extend the proposed method by developing a computationally more efficient method to estimate multi-state models with splines. For this extension, the estimation is based on a scoring algorithm, and an automatic smoothing parameters selection. The extended method is illustrated with two data analyses and a simulation study. Acknowledgements I would like to thank the Conselho Nacional de Desenvolvimento Cient´ıfico e Tec- nologico´ - Brazil for funding this research. I would like to express my sincere grati- tude to Dr Ardo van den Hout for his thorough supervision of my PhD. In particular, I am thankful for all the opportunities he provided, for motivating me and for his patience. I would like to thank Dr Giampiero Marra for providing good directions to this research. I am thankful to Dr Aidan O’Keeffe and Dr Andrew Titman for their valuable comments that significantly improved this thesis. I thank Dr Boo Jo- hansson for permission to use the OCTO data, a connection that was made possible by Dr Graciela Muniz-Terrera to whom I am also thankful. In addition, I wish to thank the team of researchers behind the ELSA study. I would like to thank my friends and colleagues from University College London for the time spent together and exchange of ideas. Finally, I am thankful to my family and friends for their support in general and throughout my academic path, especially to my parents for making me independent. Special thanks to Verena for making everything easier during the development of this research, for providing support in difficult moments and for celebrating the good ones. Contents 1 Introduction 14 1.1 Research aim and overview . 14 1.2 Special features of multi-state processes . 16 1.2.1 Model structure . 17 1.2.2 Observational patterns . 18 1.3 Literature review . 19 1.4 Survival analysis . 22 1.4.1 The exponential distribution . 25 1.4.2 The Gompertz distribution . 25 1.4.3 The Weibull distribution . 26 1.4.4 The lognormal distribution . 27 1.5 Simulation of multi-state processes . 29 1.6 Examples . 29 1.6.1 Cardiac allograft vasculopathy data . 29 1.6.2 English longitudinal study of ageing data . 31 1.6.3 Origins of variance in the oldest-old data . 32 1.7 Discussion . 33 2 Parametric multi-state models 35 2.1 Continuous-time Markov processes . 35 2.2 Model representation . 38 2.3 Piecewise-constant hazards approximation . 40 2.4 Maximum likelihood estimation . 41 Contents 6 2.4.1 Likelihood function of the model . 41 2.4.2 Scoring algorithm . 43 2.5 Confidence intervals . 45 2.6 Model selection . 46 2.7 Model validation . 47 2.8 Applications . 48 2.8.1 Origins of variance in the oldest-old data . 48 2.8.2 Cardiac allograft vasculopathy data . 51 2.9 Discussion . 56 3 Multi-state models with splines 57 3.1 Introduction . 57 3.1.1 Smoothing methods . 58 3.1.2 Cubic regression splines . 59 3.1.3 P-splines . 61 3.2 Model representation . 63 3.3 Penalised maximum likelihood estimation . 65 3.3.1 Penalised log-likelihood function . 65 3.3.2 Parameter estimation . 66 3.3.3 Smoothing parameter estimation . 68 3.4 Prediction . 69 3.5 Application to the English longitudinal study of ageing data . 69 3.5.1 Predicting cognitive function . 76 3.6 Discussion . 78 4 Automatic smoothing for multi-state models 81 4.1 Background for automatic smoothing . 81 4.2 Penalised maximum likelihood estimation . 85 4.2.1 Model representation . 85 4.2.2 Piecewise-constant hazards . 85 4.2.3 Penalised log-likelihood function . 86 Contents 7 4.2.4 Parameter estimation . 86 4.2.5 Smoothing parameters estimation . 88 4.2.6 Summary of the algorithm . 89 4.2.7 Confidence intervals . 89 4.3 Simulation study . 90 4.4 Applications . 92 4.4.1 Origins of variance in the oldest-old data . 93 4.4.2 Cardiac allograft vasculopathy data . 98 4.5 Discussion . 101 5 Conclusions and future work 103 Appendices 107 A Code for the R software 107 Bibliography 116 List of Figures 1.1 A two-state survival model . 17 1.2 An unidirectional model . 17 1.3 An illness-death model . 18 1.4 A competing risks model . 18 1.5 The trajectory of an illness-death process. Dashed vertical lines represent the follow-up times. Solid horizontal lines represent the state occupancy over time . 19 1.6 Gompertz hazard functions for scale and shape parameters (l;q) = (1;−0:8);(0:1;0:15) and (0:01;0:45) ................ 26 1.7 Weibull hazard function for scale parameter l = 1 and the shape parameters g = 0:5; 1; 2 and 3 . 27 1.8 Lognormal hazard functions for m = 1 and s 2 = 0:75; 1; and 1:75 . 28 1.9 Four-state model for disease progression after transplant for the CAV data . 30 1.10 Five-state model for longitudinal data in ELSA on number of words remembered in a recall . 32 1.11 Four-state model for longitudinal data in OCTO . 33 2.1 Histogram of age at baseline transformed by minus 80 in the OCTO data. The variable t represents age minus 80 . 49 2.2 Estimated Gompertz hazards (solid lines) for women, with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications . 51 List of Figures 9 2.3 Comparison of model-based survival from states 1, 2, and 3 with Kaplan-Meier curves. Model-based survival: grey lines for indi- viduals, blue lines for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals. Fre- quencies for baseline state along vertical axes . 52 2.4 Illness-death without recovery model for disease progression after transplant for the CAV data . 53 2.5 Histogram of time since transplant in the CAV data . 54 2.6 Estimated Gompertz hazards for subjects with IHD and with donor age of 26 (solid lines), with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications . 55 2.7 Comparison of model-based survival from states 1 with Kaplan- Meier curves. Model-based survival: grey lines for individuals, blue line for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals . 56 3.1 The left hand panel illustrates one basis function, B4(x), for a cu- bic regression spline. On the right hand panel, the various curves of medium thickness show the basis functions, Bi(x), of a cubic re- gression spline, each multiplied by its coefficients a j. These scaled basis functions are summed to get the smooth curve illustrated by the thick continuous curve . 61 3.2 Illustrations of B-splines basis functions of degrees (a) one, (b) two, and (c) three . 62 3.3 Illustration of smooth curves made up of B-spline basis functions. On the left hand panel, the dashed curves show B-splines basis func- tions with m = 1 multiplied by their associated coefficients. The thick solid smooth curve is the sum of the scaled basis functions. The right hand panel shows the same, but for B-splines basis func- tion with m = 2............................ 63 List of Figures 10 3.4 Five-state model for longitudinal data in ELSA on number of words remembered in a recall . 70 3.5 AIC results and fitted hazard transitions for men with ten or more year of education. In (a) and (c), the AIC results for fixed l2 = 10 −3 and fixed l1 = 10 , respectively. In (b) and (d), the estimated hazards for 2 ! 3 and 3 ! 4, respectively. Solid line for P-splines I and dotted line for Gompertz. In (e) and (f), the AIC results for model P-splines II and fitted hazard for 3 ! 4, respectively. Time denotes age transformed by subtracting 49 years .