Penalised maximum likelihood estimation for multi-state models

Robson Jose´ Mariano Machado

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London.

Department of Statistical Science University College London

October 17, 2018 2

I, Robson Jose´ Mariano Machado, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Abstract

Multi-state models can be used to analyse processes where change of status over time is of interest. In medical research, processes are commonly defined by a set of living states and a dead state. Transition times between living states are often in- terval censored. In this case, models are usually formulated in a Markov processes framework. The likelihood function is then constructed using transition probabili- ties. Models are specified using proportional hazards for the effect of covariates on transition intensities. Time-dependency is usually defined by parametric models, which can represent a strong model assumption. Semiparametric hazards specifica- tion with splines is a more flexible method for modelling time-dependency in multi- state models. Penalised maximum likelihood is used to estimate these models. Se- lecting the optimal amount of smoothing is challenging as the problem involves multiple penalties. This thesis aims to develop methods to estimate multi-state models with splines for interval-censored data. We propose a penalised likelihood method to estimate multi-state models that allow for parametric and semiparametric hazards specifications. The estimation is based on a scoring algorithm, and a grid search method to estimate the smoothing parameters. This method is shown using an application to ageing research. Furthermore, we extend the proposed method by developing a computationally more efficient method to estimate multi-state models with splines. For this extension, the estimation is based on a scoring algorithm, and an automatic smoothing parameters selection. The extended method is illustrated with two data analyses and a simulation study. Acknowledgements

I would like to thank the Conselho Nacional de Desenvolvimento Cient´ıfico e Tec- nologico´ - Brazil for funding this research. I would like to express my sincere grati- tude to Dr Ardo van den Hout for his thorough supervision of my PhD. In particular, I am thankful for all the opportunities he provided, for motivating me and for his patience. I would like to thank Dr Giampiero Marra for providing good directions to this research. I am thankful to Dr Aidan O’Keeffe and Dr Andrew Titman for their valuable comments that significantly improved this thesis. I thank Dr Boo Jo- hansson for permission to use the OCTO data, a connection that was made possible by Dr Graciela Muniz-Terrera to whom I am also thankful. In addition, I wish to thank the team of researchers behind the ELSA study. I would like to thank my friends and colleagues from University College London for the time spent together and exchange of ideas. Finally, I am thankful to my family and friends for their support in general and throughout my academic path, especially to my parents for making me independent. Special thanks to Verena for making everything easier during the development of this research, for providing support in difficult moments and for celebrating the good ones. Contents

1 Introduction 14 1.1 Research aim and overview ...... 14 1.2 Special features of multi-state processes ...... 16 1.2.1 Model structure ...... 17 1.2.2 Observational patterns ...... 18 1.3 Literature review ...... 19 1.4 Survival analysis ...... 22 1.4.1 The ...... 25 1.4.2 The Gompertz distribution ...... 25 1.4.3 The ...... 26 1.4.4 The lognormal distribution ...... 27 1.5 Simulation of multi-state processes ...... 29 1.6 Examples ...... 29 1.6.1 Cardiac allograft vasculopathy data ...... 29 1.6.2 English longitudinal study of ageing data ...... 31 1.6.3 Origins of in the oldest-old data ...... 32 1.7 Discussion ...... 33

2 Parametric multi-state models 35 2.1 Continuous-time Markov processes ...... 35 2.2 Model representation ...... 38 2.3 Piecewise-constant hazards approximation ...... 40 2.4 Maximum likelihood estimation ...... 41 Contents 6

2.4.1 Likelihood function of the model ...... 41 2.4.2 Scoring algorithm ...... 43 2.5 Confidence intervals ...... 45 2.6 Model selection ...... 46 2.7 Model validation ...... 47 2.8 Applications ...... 48 2.8.1 Origins of variance in the oldest-old data ...... 48 2.8.2 Cardiac allograft vasculopathy data ...... 51 2.9 Discussion ...... 56

3 Multi-state models with splines 57 3.1 Introduction ...... 57 3.1.1 Smoothing methods ...... 58 3.1.2 Cubic regression splines ...... 59 3.1.3 P-splines ...... 61 3.2 Model representation ...... 63 3.3 Penalised maximum likelihood estimation ...... 65 3.3.1 Penalised log-likelihood function ...... 65 3.3.2 Parameter estimation ...... 66 3.3.3 Smoothing parameter estimation ...... 68 3.4 Prediction ...... 69 3.5 Application to the English longitudinal study of ageing data . . . . . 69 3.5.1 Predicting cognitive function ...... 76 3.6 Discussion ...... 78

4 Automatic smoothing for multi-state models 81 4.1 Background for automatic smoothing ...... 81 4.2 Penalised maximum likelihood estimation ...... 85 4.2.1 Model representation ...... 85 4.2.2 Piecewise-constant hazards ...... 85 4.2.3 Penalised log-likelihood function ...... 86 Contents 7

4.2.4 Parameter estimation ...... 86 4.2.5 Smoothing parameters estimation ...... 88 4.2.6 Summary of the algorithm ...... 89 4.2.7 Confidence intervals ...... 89 4.3 Simulation study ...... 90 4.4 Applications ...... 92 4.4.1 Origins of variance in the oldest-old data ...... 93 4.4.2 Cardiac allograft vasculopathy data ...... 98 4.5 Discussion ...... 101

5 Conclusions and future work 103

Appendices 107

A Code for the R software 107

Bibliography 116 List of Figures

1.1 A two-state survival model ...... 17 1.2 An unidirectional model ...... 17 1.3 An illness-death model ...... 18 1.4 A competing risks model ...... 18 1.5 The trajectory of an illness-death process. Dashed vertical lines represent the follow-up times. Solid horizontal lines represent the state occupancy over time ...... 19 1.6 Gompertz hazard functions for scale and shape parameters (λ,θ) = (1,−0.8),(0.1,0.15) and (0.01,0.45) ...... 26 1.7 Weibull hazard function for λ = 1 and the shape parameters γ = 0.5, 1, 2 and 3 ...... 27 1.8 Lognormal hazard functions for µ = 1 and σ 2 = 0.75, 1, and 1.75 . 28 1.9 Four-state model for disease progression after transplant for the CAV data ...... 30 1.10 Five-state model for longitudinal data in ELSA on number of words remembered in a recall ...... 32 1.11 Four-state model for longitudinal data in OCTO ...... 33

2.1 Histogram of age at baseline transformed by minus 80 in the OCTO data. The variable t represents age minus 80 ...... 49 2.2 Estimated Gompertz hazards (solid lines) for women, with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications ...... 51 List of Figures 9

2.3 Comparison of model-based survival from states 1, 2, and 3 with Kaplan-Meier curves. Model-based survival: grey lines for indi- viduals, blue lines for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals. Fre- quencies for baseline state along vertical axes ...... 52 2.4 Illness-death without recovery model for disease progression after transplant for the CAV data ...... 53 2.5 Histogram of time since transplant in the CAV data ...... 54 2.6 Estimated Gompertz hazards for subjects with IHD and with donor age of 26 (solid lines), with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications ...... 55 2.7 Comparison of model-based survival from states 1 with Kaplan- Meier curves. Model-based survival: grey lines for individuals, blue line for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals ...... 56

3.1 The left hand panel illustrates one basis function, B4(x), for a cu- bic regression spline. On the right hand panel, the various curves

of medium thickness show the basis functions, Bi(x), of a cubic re-

gression spline, each multiplied by its coefficients α j. These scaled basis functions are summed to get the smooth curve illustrated by the thick continuous curve ...... 61 3.2 Illustrations of B-splines basis functions of degrees (a) one, (b) two, and (c) three ...... 62 3.3 Illustration of smooth curves made up of B-spline basis functions. On the left hand panel, the dashed curves show B-splines basis func- tions with m = 1 multiplied by their associated coefficients. The thick solid smooth curve is the sum of the scaled basis functions. The right hand panel shows the same, but for B-splines basis func- tion with m = 2...... 63 List of Figures 10

3.4 Five-state model for longitudinal data in ELSA on number of words remembered in a recall ...... 70

3.5 AIC results and fitted hazard transitions for men with ten or more

year of education. In (a) and (c), the AIC results for fixed λ2 = 10 −3 and fixed λ1 = 10 , respectively. In (b) and (d), the estimated hazards for 2 → 3 and 3 → 4, respectively. Solid line for P-splines I and dotted line for Gompertz. In (e) and (f), the AIC results for model P-splines II and fitted hazard for 3 → 4, respectively. Time denotes age transformed by subtracting 49 years ...... 74

3.6 Comparison of model-based survival from states 1, 2, 3, and 4 with Kaplan-Meier curves. Model-based survival: grey lines for indi- viduals, smooth black line for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence bands. Frequencies for baseline state along vertical axes ...... 77

3.7 For the P-splines II model, estimated ten-year transition probabil- ities for men aged 60 with ten or more years of education, and in state 3 at baseline. Solid line for transition (with B = 1000) and dashed lines for 95% confidence bands ...... 78

4.1 An unidirectional three-state model ...... 82

4.2 Simulation study: true (black lines), estimated (grey lines) and mean estimated (red lines) hazards for the illness-death model for 100 replications ...... 91

4.3 Four-state model for longitudinal data in OCTO ...... 94

4.4 Histogram of age transformed by minus 80 in the OCTO data . . . . 95

4.5 Estimated smooth hazards (solid lines) for women, with 95% con- fidence intervals (dashed lines) ...... 96 List of Figures 11

4.6 Comparison of model-based survival from states 1, 2, and 3 with Kaplan-Meier curves. Model-based survival: grey lines for indi- viduals, smooth blue lines for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals. Frequencies for baseline state along vertical axes ...... 97 4.7 Illness-death without recovery model for disease progression after transplant ...... 98 4.8 Histogram of time since transplant in the CAV data ...... 99 4.9 Estimated smooth hazards for subjects with IHD and with donor age of 26 (solid lines), with 95% confidence intervals (dashed lines) 100 4.10 Comparison of model-based survival with Kaplan-Meier curves for the Gompertz model (left-hand side) and spline model (right-hand side). Model-based survival: grey lines for individuals, blue lines for the mean of the individual curves. Kaplan-Meier in black lines with 95% confidence intervals ...... 101 List of Tables

1.1 State table for the CAV data: number of times each pair of states was observed at successive observation times. The three living states are defined by CAV severity ...... 31

1.2 State table for the OCTO data: number of times each pair of states was observed at successive observation times. The three living states are defined by grades of cognitive function ...... 34

2.1 State table for the OCTO data for the history of State 3: number of times each pair of states was observed at successive observation times. The three living states are defined by grades of cognitive function ...... 48

2.2 Parameter estimates for the four-state for the OCTO data. Estimated standard errors in parentheses. Time scale t is age in years minus 80 49

2.3 State table for the CAV data: number of times each pair of states was observed at successive observation times. The two living states are defined by CAV severity ...... 53

2.4 Parameter estimates for the illness-death without recovery for the CAV data. Estimated standard errors in parentheses. The variable t represents time since baseline ...... 54

3.1 State table for the ELSA data: number of times each pair of states was observed at successive observation times. The four living states are defined by number of words remembered ...... 70 List of Tables 13

3.2 Comparison between models for the ELSA data with N = 1000, where -2LL stands for -2 times the (penalised) loglikelihood func- tion evaluated at its maximum. The variable t denotes age trans- formed by subtracting 49 years ...... 73 3.3 Results for sex, education and time for the five-state P-splines II model for the ELSA data. Estimated standard errors in parentheses. The variable t denotes age transformed by subtracting 49 years . . . 75

4.1 Simulation study to investigate the performance of the multi- state models with splines for modelling time-dependent processes. Mean, bias and estimated standard errors (eSE) for R = 100 repli- cations. Absolute bias less than x is denoted by dxe ...... 93 Chapter 1

Introduction

1.1 Research aim and overview

Multi-state models are a broadly applicable approach to analysing longitudinal data, where change of status over time is of interest. In medical research, these statuses are usually defined by the severity of a disease or condition of subjects in a follow- up study. Some standard examples include, dementia research in which interest lies in the decline of cognitive function, post-transplantation chronic disease studies, which aim to investigate disease progression after transplant, and human immun- odeficiency virus (HIV), where modelling the transition of subjects across disease stages is of interest. Data deriving from these medical conditions share similar char- acteristics inherent to how data are obtained. Specifically, time of change in a med- ical condition cannot always be observed exactly, instead changes are recorded at pre-specified follow-up times. In this case, transition times are said to be interval- censored. On the other hand, if a dead state is defined, time of death is usually known exactly. Multi-state models for data with these characteristics are the focus of this thesis.

Multi-state models for interval-censored data are commonly formulated in a Markov processes framework (Kalbfleisch and Lawless, 1985). The Markov prop- erty states that the future of the process only depends on the current state. Models can be specified through the transition intensities (also called hazards), which rep- resent the instantaneous risks of moving across states. To facilitate estimation, a 1.1. Research aim and overview 15 time homogeneous Markov process is usually assumed (Kalbfleisch and Lawless, 1985; Jackson, 2011). In this case, the hazards are assumed to be constant over time. For a wide range of applications, the risks of moving across states depend on the current state and on time. In this case, a non-homogeneous Markov assumption is assumed to model the multi-state process. Several time-dependent models can be fitted with parametric specifications (Van den Hout, 2017). However, the functional forms underlying the hazards are often unknown and parametric models can be too restrictive.

Flexible multi-state models can be obtained with smooth nonparametric haz- ard functions specification. In this case, the hazards are specified in terms of splines function basis. Splines are polynomial functions, which allow for flexible modelling De Boor (1978). In particular, they enable modelling time-dependent hazards with- out making strong model assumptions. For each hazard function, an extra parameter is employed to control model smoothness. These parameters are commonly called smoothing parameters. In the frequentist context, penalised maximum likelihood is used to estimate the models. Estimation of multi-state models with splines can be carried out in two challenging steps. First, for fixed values of smoothing parame- ters, penalised maximum likelihood estimation is used to obtain estimates for the model parameters. Second, given the estimates of model parameters in the first step, we aim to define a stable and efficient method to estimate the optimal values for the smoothing parameters. These two steps are iterated until a convergence criterion is met. At the present time, methods available cannot fully address the problem of estimating flexible multi-state models in the presence of interval censoring.

This thesis aims to develop a new efficient method for estimating multi-state models with splines in the presence of censoring. The focus is on observation schemes in which the transition times between living states are interval-censored and times into the dead (or absorbing) state are known exactly (or right-censored).

In the remainder of this chapter, some special features of multi-state modelling are presented. This includes a description of basic concepts of survival analysis that are used throughout this thesis. In addition, the literature on techniques for 1.2. Special features of multi-state processes 16 parametric and semiparametric multi-state models is reviewed. This also includes a review of smoothing methods useful for the development of methods in this thesis. Finally, we introduce data used to illustrate methods studied and developed in this thesis. Chapter 2 follows this introduction by presenting the theory of multi-state mod- elling. Specifically, we discuss parametric model specification and estimation. A detailed description of a scoring algorithm to obtain the maximum likelihood esti- mates is discussed. The method is then illustrated through an application to data for post-heart transplantation patients, and data for decline of cognitive function. Chapter 3 then extends parametric multi-state models by allowing for transition-specific hazards specification with splines. Firstly, we provide an in- troduction to smoothing methods that will be used for the remainder of the thesis. In particular, we present cubic regression splines and P-splines. We then present our method for specification and estimation of multi-state models with splines, that uses a scoring algorithm for estimating the model parameters, and a grid search for selecting the optimal amount of smoothing. This method is then illustrated with an application to ageing research. Chapter 4 proposes an unifying and efficient framework for estimating multi- state models with splines in the presence of censoring. A simulation study is carried out to analyse the performance of the method presented. Subsequently, the method is illustrated with an application to data used in Chapter 2. The final chapter provides a discussion of the methods and applications devel- oped in this thesis, and indicates areas for future research.

1.2 Special features of multi-state processes

In this section, a number of different multi-state models are presented in order to illustrate some essential features of these models. In particular, we discuss some model structures and observational patterns commonly found in medical research. The discussion presented in this section is mainly based on the papers by Andersen and Keiding (2002) and Commenges (2002). 1.2. Special features of multi-state processes 17 1.2.1 Model structure Graphically, multi-state models can be illustrated using diagrams with boxes rep- resenting the states and with arrows between the states representing the possible transitions. A state is said an absorbing state if further transitions cannot occur from that state. A transient state is a state that is not absorbing, meaning that at least one transition is possible from that state.

Survival model A simple multi-state model can be defined for survival data in which individuals are followed-up until the event death (or censoring) occurs. The two-state model is then defined by the transient State 1 (alive) and the absorbing State 2 (death). This model is illustrated in Figure 1.1.

Figure 1.1: A two-state survival model

Unidirectional model Unidirectional models extend the survival model by defining a set of transient states before an absorbing State D (Figure 1.2) (Titman, 2008). For this model, individuals start at a transient state and can move forward to the next state until the absorbing state. Cook and Lawless (2002) use these models for the analysis of repeated events data.

Figure 1.2: An unidirectional model

The illness-death model The illness-death model is defined by a disease-free state (State 1), from which subjects can move into the dead state (State 3), or move into a diseased state (State 2). Once in State 2, subjects can move back to State 1 or move into State 3, see Figure 1.3. 1.2. Special features of multi-state processes 18

Figure 1.3: An illness-death model

Competing risks model In the context of medical research, competing risks models are defined by one transient state (State 1: alive) and a number, k, of absorbing states, State h, for h = 1,...,k corresponding to “death from cause h”. Figure 1.4 illustrates the model for k = 2.

Figure 1.4: A competing risks model

1.2.2 Observational patterns As discussed in Commenges (2002), the observations are often incomplete in the sense that it is not possible to observe the whole population of interest, and it is not possible to observe the process continuously over time. The first problem can be addressed by drawing a representative sample of the population. The second problem leads to what is called censored observations. Commenges (2002) provides a thorough discussion on how to approach different types of censoring in multi-state processes. We next describe right and interval-censoring as these are the focus of this thesis. Consider first that a multi-state process is observed in continuous time from time t0 until time t0 + c. If the state of the process at time t0 + c is an absorbing 1.3. Literature review 19

Dead

Illness

Healthy

0 1 2 3 4 6

Figure 1.5: The trajectory of an illness-death process. Dashed vertical lines represent the follow-up times. Solid horizontal lines represent the state occupancy over time state, then the process is observed completely. If not, then some transition times are said to be right-censored. For many applications, a multi-state process is observed at only a finite number of times. In this case, the exact time of transitions are only known to have occurred within a time interval. In this case, the transition times are said to be interval cen- sored. This type of censoring is common in cohort studies, where states are recorded at fixed visit times. If a dead state is defined, the time of death is usually known exactly. In this case, we have a mix of discrete and continuous time observations. This thesis focus on methods for these pattern of observations. Figure 1.5 illustrates the trajectory of an illness-death process over time. The vertical dashed lines repre- sent the follow-up times at which state occupancy is recorded. The horizontal lines represent the state occupancy over time. The process moves back and forth between the living states, subsequently moves from the illness state into the dead state.

1.3 Literature review For the analysis of time-homogeneous multi-state processes, Kalbfleisch and Law- less (1985) developed a general procedure for obtaining maximum likelihood esti- mates of the model parameters. The method is based on a scoring algorithm that uses an approximation to the second order derivatives of the log-likelihood func- tion. Kay (1986) proposes a similar method that calculates the first and second or- 1.3. Literature review 20 der derivatives of some particular multi-state processes. Kay also provides methods for hypothesis testing and model diagnostics. Satten and Longini (1996) developed a method for fitting these models when states are subject to measurement errors. Jackson (2011) presented the msm package in R for analysing time homogeneous multi-state processes for interval-censored data.

For time-dependent multi-state processes, Kalbfleisch and Lawless (1985) sug- gested using two methods. The first method uses piecewise-constant hazards to approach time-dependent processes. In this case, the hazards are constant within specified intervals, but can change for different intervals, see also Kay (1986). For many applications, it is not reasonable to assume that the underlying hazards are piecewise-constant functions. Also, the total number of parameters in the model can become large with the number of transition specific-hazards. The second method fo- cuses on a special case in which the non-homogeneity is due to a time-varying mul- tiplicative change in the matrix of transition intensities. In this case, it is possible to find a transformation function so that the resulting process is time homogeneous. Hubbard et al. (2008) investigated this method and proposed a flexible approach for estimating the transformation functions. Titman (2011) uses a numerical approx- imation to calculate the transition probabilities at the level of the corresponding differential equations. A disadvantage of this method is that estimation can become computationally expensive if many time-varying covariates are included. Jackson (2011) suggests using a piecewise-constant approximation for parametric models, which is employed to obtain the transition probabilities for the likelihood function. Van den Hout (2017) provides a general method for estimating multi-state models for interval-censored data. Focus is given to time-dependent parametric models, such as, Gompertz and Weibull distributions. Given a piecewise-constant approx- imation to the parametric hazards, a scoring algorithm is used for estimating the models. Further literature on time-dependent multi-state models includes Omar et al. (1995), Van den Hout and Matthews (2008a) and Van den Hout and Matthews (2008b).

Even though multi-state models with splines for known transition times are out 1.3. Literature review 21 of the scope of this research, a literature review of these models can be important for the development of this work. Fahrmeir and Klinger (1998) approached these models in a counting process framework. Transition intensities are specified in terms of spline functions. Estimation is based on a backfitting scheme with internal smoothing parameter selection by AIC optimisation. The application is to human sleep data with a discrete set of sleep states. Measurements are made every 30 seconds for a group of 30 patients. A Bayesian approach to this model and data is given in Kneib and Hennerfeind (2008). In this case, the inferential procedures developed allow for simultaneous estimation of model and smoothing parameters. Sennhenn-Reulen and Kneib (2016) developed an estimation procedure based on a structured lasso penalisation for multi-state models. The aim of their research is to identify covariate effect coefficient equal to zero. Baseline transition intensities are specified with piecewise-constant models, or unspecified and equal across all transitions. The R package Flexsurv is mainly designed for modelling time-to- event data (Jackson, 2016). It can fit any parametric time-to-event distribution and the spline models of (Royston and Parmar, 2002). This package can also be used for fitting some multi-state model by writing data in a survival format. Models can be specified with a rage of parametric and nonparametric shapes.

In the context of multi-state models with splines for interval-censored data, a penalised approach for an unidirectional three-state model is used in Joly and Com- menges (1999). Estimation is performed with an algorithm which uses derivatives of the penalised log-likelihood. The smoothing parameters are selected using a grid search with cross-validation. Joly et al. (2002) use the same approach for an illness- death without recovery model. Joly et al. (2009) further extend their method for a five-state model without recovery. These methods require explicit expressions for the transition probabilities. Calculating those formulae can be intractable for more complex models, such as, models with more than four states and backward transi- tions. In addition, their methods can be computationally intractable for models with multiple smoothing parameters, as the grid search method requires models to be fitted for all possible combination of smoothing parameters values given in the grid. 1.4. Survival analysis 22

The method proposed in Titman (2011) allows for nonparametric hazard specifica- tions with B-splines; however, the log-likelihood is maximised without penalisation. If a large set of B-splines basis is used to model a transition-specific hazard, models can become unidentifiable.

Efficient smoothing parameter selection methods are important for practical multi-state modelling with splines. We next discuss some relevant literature for automatic smoothing parameter estimation that will be useful for defining an au- tomatic and efficient algorithm to estimate multi-state models with splines. Gu and Wahba (1991) developed an efficient Newton method for multiple smoothing parameter selection. Wood (2000) extends their method for multiple smoothing parameter selection in generalized ridge regression problems. Wood (2004) pro- vides a stable and efficient method that improves the method developed in Wood (2000). These methods are widely used in generalised additive models. However, they can be further extended for more general settings. Radice et al. (2016) pro- poses a method for multiple smoothing parameter estimation for copula regression. Their method is general, but lead to numerical instability for some applications. Marra et al. (2017) developed a more general and stable method that can be esti- mate multiple smoothing parameters using only the gradient and Hessian (or Fisher information matrix). This method is general and can be used in a variety of settings.

1.4 Survival analysis

Some concepts of time-to-event analysis can be used and extended to multi-state modelling. In this section, we describe important features of these concepts that will be useful for the development of this thesis. Time-to-event analysis aims to describe the analysis of data in the form of a well-defined time origin until the occurrence of some particular event. In medical research, the time origin will often correspond to beginning of a treatment and the event of interest can be relief of pain, the recurrence of symptoms, or death. If the event is death of a patient, then time-to- event data are commonly called survival data. In this thesis, we focus on examples that include a dead state, and we shall refer to time-to-event data as survival data. 1.4. Survival analysis 23

This section is partly based on Collett (2015). Let T be the random variable associated with the survival time of patients, which can take any non-negative values. The distribution function of T is given by

Z t F(t) = P(T < t) = f (u)du, (1.1) 0 where f (t) represents the density function of T. Equation (1.1) repre- sents the probability that survival is less than some value t. The survival function, S(t), is defined to be the probability that the survival time is greater than or equal to t,

S(t) = P(T ≥ t) = 1 − F(t), (1.2) where F(t) is as defined in (1.1). The hazard function is widely used to express the risk or hazard of an event occurring at some time t. This function is obtained from the probability that an individual dies at time t, conditional on them having survived to that time. Formally, the hazard function, h(t), is given by

P(t ≤ T ≤ t + δt|T ≥ t) h(t) = lim . (1.3) δt→0 δt Equations (1.1), (1.2) and (1.3) provide that the hazard function can be ex- pressed as f (t) h(t) = . (1.4) S(t) The cumulative hazard function is used to express the cumulative risk of an event occuring by time t. Formally, the cumulative hazard function, H(t), is given by Z t H(t) = h(u)du. (1.5) 0 d From Equation (1.4), it follows that h(t) = − dt {logS(t)}, which implies that S(t) = exp{−H(t)}. Therefore, the survival function can also be expressed in terms of the cumulative hazard function. 1.4. Survival analysis 24

Empirical non-parametric methods are useful to summarise the survival times for individuals in a particular group. Suppose that for a single sample of survival times none of the observations is censored. The empirical survival function is given by Number of individuals with survival times ≥ t Sb(t) = . (1.6) Number of individuals in the data The empirical survival function is equal to unity for values of t before the first death time, and zero after the final death time. The estimated survivor function, Sb(t), is assumed to be constant between two adjacent death times, and so a plot of Sb(t) against t is a step-function. The function decreases immediately after each observed survival time.

Suppose that there are n individuals with survival times at t1,...,tn. These ob- servations can be right-censored and more than one death can occur at a given time.

Suppose that there are r deaths among the individuals, where r ≤ n. Let t(1),...,t(r) be the r ordered death times, n j the number of individuals who are alive just before t( j), including those who are about to die, and d j the number of individuals who die at t( j). The Kaplan-Meier estimate of the survival function is given by

k   n j − d j Sb(t) = ∏ (1.7) j=1 n j

for t(k) ≤ t < t(k+1), k = 1,...,r, with Sb(t) = 1 for t < t(1), and where t(r+1) is taken to be ∞. If the largest observation is a censored survival time, t∗, then Sb(t) is undefined ∗ for t > t . If the largest observed survival time, t(r), is an uncensored observation, nr = dr, and so Sb(t) = 0 for t ≥ t(r). A plot of the Kaplan-Meier estimate of the survival function is a step-function, in which the estimated survival probabilities are constant between adjacent death times and decrease at each death time. This thesis uses the Kaplan-Meier estimate of the survival function for model validation. 1.4. Survival analysis 25 1.4.1 The exponential distribution

The probability density function of a random variable T that has an exponential distribution with parameter λ > 0 is given by

f (t) = λ exp−λt, (1.8) for 0 ≤ t < ∞. The survival function is then given by

Z t S(t) = 1 − f (t)dt 0 S(t) = exp(−λt). (1.9)

d Using the relation h(t) = − dt log(S(t)), the hazard function is given by

h(t) = λ, (1.10) for t ≥ 0. Therefore, if the hazard is constant over time. For many applications, it may be more reasonable to consider time-dependent hazard functions.

1.4.2 The Gompertz distribution

The probability density function of a random variable T that has a Gompertz distri- bution with parameters λ > 0 and θ ∈ R is given by

λ  f (t) = λeθt exp (1 − eθt) , (1.11) θ for 0 ≤ t < ∞. The survival function of the Gompertz distribution is given by

λ  S(t) = exp (1 − eθt) . (1.12) θ

The hazard function of the Gompertz is given by

h(t) = λeθt, (1.13) 1.4. Survival analysis 26

λ =

1.0 0.01 θ = 0.45 0.8 0.6

λ = 0.1 θ = 0.15 0.4 Hazard function 0.2

λ = 1

θ = − 0.8 0.0 0 2 4 6 8 10 Time

Figure 1.6: Gompertz hazard functions for scale and shape parameters (λ,θ) = (1,−0.8),(0.1,0.15) and (0.01,0.45)

for t ≥ 0. The parameters θ and λ are the shape and scale parameters, respec- tively. For the case in which θ = 0, the hazard function is constant and equal to λ, and the survival times then have an exponential distribution. Positive values of θ lead to increasing hazards, while negative values of θ lead to decreasing hazards. As an illustration, the hazard function for Gompertz distributions with parameters (λ,θ) = (1,−0.8),(0.1,0.15) and (0.01,0.45) are shown in Figure 1.6.

1.4.3 The Weibull distribution

The probability density function of a random variable T that has a Weibull distribu- tion, with parameters λ > 0 and γ > 0 is given by

f (t) = λγγ−1 exp(−λtγ ), (1.14) for 0 ≤ t < ∞. The survival function of the Weibull distribution is given by

S(t) = exp(−λtγ ). (1.15) 1.4. Survival analysis 27

γ = 5 3

γ = 2 4 3 2 Hazard function

γ = 1 1

γ = 0.5 0

0.0 0.5 1.0 1.5 2.0 Time

Figure 1.7: Weibull hazard function for scale parameter λ = 1 and the shape parameters γ = 0.5, 1, 2 and 3

The corresponding hazard function is given by

h(t) = λγtγ−1 (1.16) for t ≥ 0. For the case in which γ = 1, the hazard function is constant over time, and the survival times have an exponential distribution. The shape of the hazard function is controlled by the parameter γ, which is then defined as the , while λ is a scale parameter. A plot of the Weibull hazard function for distributions with λ = 1 and γ = 0.5,1,2 and 3 is shown in Figure 1.7.

1.4.4 The lognormal distribution

A random variable T has a lognormal distribution, with parameters µ and σ, if the random variable logT has a with mean µ and variance σ 2. The probability density function of T is given by

1 f (t) = √ t−1 exp−(logt − µ)2/2σ 2 , (1.17) σ 2π 1.4. Survival analysis 28 0.5 0.4

σ = 0.3 0.75

0.2 σ = 1 Hazard function

0.1 σ = 1.75 0.0 0 2 4 6 8 10 Time

Figure 1.8: Lognormal hazard functions for µ = 1 and σ 2 = 0.75, 1, and 1.75

for 0 ≤ t < ∞ and σ. The survival function of the lognormal distribution is

logt − µ  S(t) = 1 − Φ , (1.18) σ where Φ(·) is the standard normal distribution function, given by

1 Z z Φ(z) = √ exp(−u2/2)du. (1.19) 2π −∞

The hazard function can be found from the relation h(t) = f (t)/S(t). This function is zero at t = 0, increases to a maximum and then decreases to zero as t tends to infinity. The hazard and survival functions are given in terms of integrals limits, which makes the use of this model difficult for applications. However, the flexibility of this model is attractive and they will be useful in Chapter 4 to simulate transition times from a distribution function that leads to a non-linear shape of haz- ard functions. For illustration, the hazard function for lognormal distributions with parameters µ = 1 and σ 2 = 0.75,1, and 1.75 are shown in Figure 1.6. 1.5. Simulation of multi-state processes 29 1.5 Simulation of multi-state processes

In this section, we describe how to simulate multi-state processes using the inversion method (Bender et al., 2005). For a fixed transition from state r to state s, r 6= s, the random variable T represents the time to event . If the cumulative distribution function for leaving state r to state s is given by F(t) = 1 − S(t), then U = F(T) has a uniform distribution on the interval [0,1], denoted by U ∼ U[0,1]. Moreover, if a random variable has distribution U ∼ U[0,1] then 1 −U ∼ U[0,1] as well. It means that S(T) ∼ U[0,1]. Therefore, if the function H(t) can be inverted, the time to event from state r to state s can expressed as

−1 > T = H [−log(U)exp(β rsz)] (1.20) where U ∼ U[0,1]. Suppose there are m − 1 competing intensities for leaving state r. The time to next event can be obtained by taking T = min(T1,...,Tm−1), where Ti is obtained applying (1.20). For the calculation of subsequent event times, we use that the survival function conditional on entering the state r at time t0 > 0 is given by

S(t|t0) = P(T > t|T > t0) S(t) = . (1.21) S(t0)

1.6 Examples

This section further discusses special features of multi-state processes through ex- amples. Data from these examples will then be used to illustrate the statistical methods presented in subsequent chapters.

1.6.1 Cardiac allograft vasculopathy data Cardiac allograft vasculopathy (CAV) is a narrowing of the arterial walls and one of the main cause of death in heart transplantation patients. The data are a series of approximately yearly angiographic examinations of heart transplant recipients. 1.6. Examples 30

Figure 1.9: Four-state model for disease progression after transplant for the CAV data

The data come from Papworth Hospital U.K. The data contain 3217 rows which are grouped by 622 patients and ordered by years after transplant. For the CAV data, baseline means years since the beginning of the study.

Of interest is the onset of CAV after transplantation. The state at each follow- up time is a grade of CAV which can be normal, mild, moderate or severe. Dead is the absorbing state and time of death is known within one day. Three living states are defined by CAV severity: State 1, 2, and 3, for Normal, Mild/Moderate, and Severe, respectively. An additional State 4 is defined as the Dead state, see Figure 1.9.

The interval-censored multi-state process is summarised by the frequencies in Table 1.1. The code 99 represents right-censored observations. As it can be seen, some backward transitions are recorded. However, the narrowing process is assumed to be biologically irreversible. There are essentially two ways to approach this problem. First, it is possible to fit multi-state models with misclassification of states (Jackson et al., 2003). Such an approach is out of the scope of the current research. Second, the data can be redefined for the history of CAV. In this case, the states are classified as Healthy (1) if the patient has not developed the disease, Mild/Moderate (2) if the patient has developed mild/moderate or Severe (3) if the patient has developed severe CAV and Dead (4) if the patient has died. Then, the data are consistent with the model in Figure 1.9.

The CAV data include several covariates which may be used to analyse pa- tient’s progression across states. These include recipient age at examination, donor age, the sexes of recipient and donor, and primary diagnosis of ischaemic heart dis- ease (IHD). Sharples et al. (2003) showed in an analyse of these data that IHD and 1.6. Examples 31

Table 1.1: State table for the CAV data: number of times each pair of states was observed at successive observation times. The three living states are defined by CAVseverity

To From 1 2 3 4 99 1 1367 204 44 148 276 2 46 134 54 48 69 3 4 13 107 55 26

donor age are major risk factors of disease onset.

1.6.2 English longitudinal study of ageing data

The English Longitudinal Study of Ageing (ELSA) baseline (1998-2001) is a rep- resentative sample of the English population aged 50 and older. ELSA contains information on health, economic position, and quality of life. Data from ELSA can be obtained via the Economic and Social Data Service (www.esds.ac.uk). There are 11932 individuals in the ELSA baseline. The following description of these data can also be found in Van den Hout (2017). Of interest is the change of cognitive function in older population. For the analysis in this thesis, a random sample of size N = 1000 is taken from ELSA. Of these 1000 individuals, 205 died during the follow-up with age at death available. Because ELSA data are publicly available, measures have been taken by the data provider to prevent identification of the individuals. One of those measures is the censoring of ages above 90 years. In the sampling of the subset of N = 1000, indi- viduals who were 90 years or older at baseline were ignored, where baseline means entry in the study. The sample has 544 women and 456 men. Highest educational qualification is dichotomised according to years of formal education: fewer than ten versus ten or more. There are 558 individuals with fewer than ten years of education. The focus is on the number of words remembered in a delayed recall from a list of ten. The score on this test is equal to the number of words remembered, i.e., score ∈ {0,1,2,··· ,10}. It is of interest to explore the effect of age and gender on cognitive change over time when controlling for education. Four living states are 1.6. Examples 32

Figure 1.10: Five-state model for longitudinal data in ELSA on number of words remem- bered in a recall defined by the number of words an individual can remember: State 1, 2, 3, and 4, for the number of words {7,8,9,10}, {6,5}{4,3,2}, and {1,0}, respectively. An additional State 5 is defined as the dead state, see Figure 1.10.

1.6.3 Origins of variance in the oldest-old data

The origins of variance in the oldest-old (OCTO) study included dizygotic (DZ) and monozygotic (MZ) twin pairs aged 80 years of age and older. The sample was selected from older adults in the population-based Swedish Twin Registry. Older adults participating in the study were tested in their residence by nurses. Informed consent was obtained from each participant. Five cycles of longitudinal data were collected at two year intervals. The initial sample consisted of 702 individuals (351 same-sex pairs). The final analysis included 694 participants. Of interest is the effects of age on cognitive function. The mini-mental state examination (MMSE) is used to assess global cognitive functioning of participants at each time point (Folstein et al., 1975). The state at each follow-up time is a clas- sification of respondents in terms of MMSE, which can be normal MMSE (27 ≤ MMSE ≤ 30), mild MMSE impairment (23≤ MMSE ≤ 26) and severe MMSE im- pairment (MMSE ≤ 22). The MMSE is not used to determine clinical diagnosis but as suggestive of mild cognitive impairment and dementia. These MMSE severity are used to define three living states: State 1, 2, and 3, for No cognitive impair- ment, Mild cognitive impairment, and Severe cognitive impairment, respectively. An additional State 3 is defined as the Dead state, see Figure 1.11. The interval-censored multi-state process is summarised by the frequencies in Table 1.2. Even though there are some backwards transitions from State 3, severe 1.7. Discussion 33

State 1 State 2 State 3 No cognitive Mild cognitive Severe cognitive impairment impairment impairment

Dead

Figure 1.11: Four-state model for longitudinal data in OCTO cognitive impairment is assumed to be irreversible. Here, the OCTO data are rede- fined for the history of severe cognitive impairment. In this case, if a participant has moved into State 3, they are only allowed to stay in State 3 or move into the dead state, see Figure 1.11. The data include several covariates which may be used to analyse patient’s decline of cognitive function and death. These include education, sex, and socioe- conomic status (SES). Robitaille et al. (2018) investigate the effect of age and these covariates on the transitions between cognitive states and life.

1.7 Discussion To conclude, estimation of time-dependent multi-state models for interval- censoring can be challenging. Most parametric models are specified with restrictive functional forms, such as, Gompertz and Weibull. Flexible parametric distributions can also be used, but obtaining the derivatives of the log-likelihood function for those models can be intractable. For the examples in Section 1.6, there is no clear information on good parametric shapes. Nonparametric hazards specification with splines provides a flexible approach for modelling time-dependent multi-state pro- cess. The methods developed so far focus on particular cases and are not feasible for many applications. 1.7. Discussion 34

Table 1.2: State table for the OCTO data: number of times each pair of states was observed at successive observation times. The three living states are defined by grades of cognitive function

To From 1 2 3 4 1 715 133 82 233 2 67 106 116 110 3 8 21 274 319 Chapter 2

Parametric multi-state models

This chapter introduces parametric multi-state models for interval-censored data. The general formulation builds up within a Markov processes framework, and mod- els are specified through hazard functions. Maximum likelihood is used for estima- tion. Model selection and model validation are briefly discussed as they will be useful for comparing and validating models throughout this thesis. The method is illustrated with the CAV and OCTO data sets. These applications show the versatil- ity of multi-state modelling for longitudinal data, but also highlight some limitations of parametric model specifications. The aim of this chapter is to present the general theory of multi-state models. All methods presented are standard and can be found in textbooks (Cox, 2017; Van den Hout, 2017). The applications are original, as the CAV and OCTO data sets have not been analysed with the formatting proposed in this chapter.

2.1 Continuous-time Markov processes

A stochastic process is a collection of random variables {Y(t)|t ∈ U}, where U is an index set that can take discrete or continuous values. The state space, S , is the set of possible values of Y(t), which can also be discrete or continuous. The focus here is on the case in which U is continuous and S is discrete, where the former is a set of model states and Y(t) denotes the state of S occupied by the system at time t. The content presented in this section follows the presentation in Cox (2017).

Given the time points t1,...tn, it is of interest to examine the joint distribu- 2.1. Continuous-time Markov processes 36 tion of Y1,...,Yn, where Yj = Y(t j) for j = 1,...,n. Commonly, {Y(t)|t ∈ U} is assumed to be a Markov process, which means that the future state of the process only depends on the current state. Formally, a continuous-time Markov process on the discrete states S is defined through a set of probabilities, prs(t), such that,

 prs(t,u) = P Y(u +t) = s|Y(u) = r , (2.1) for r,s ∈ S , u ≥ 0 and t ≥ 0. The Markov process {Y(t)|t ∈ U} is time- homogeneous if the probability (2.1) only depends on the initial state, that is,

 prs(t) = P Y(t) = s|Y(0) = r . (2.2)

The process is for now assumed to be time-homogeneous. The probabilities in (2.2) must satisfy

0 ≤ prs(t) ≤ 1, (2.3)

prk(t) = ∑ prs(u)psk(t − u)(t > u), (2.4) s

∑ prs(t) = 1. (2.5) s

The matrix P(t) which contains these probabilities is called the transition prob- ability matrix. Equation (2.4) is the Chapman-Kolmogorov equation for a time- homogeneous Markov process and can be written in matrix form as P(t) = P(u)P(t − u), with P(0) = I. This is useful for computing transition probabilities over a long-term time interval.

In applications, models are specified through the transition rates over a small time interval. The transition intensities (or hazards) from state r to state s are given by PY(t + ∆t) = s|Y(t) = r qrs = lim , (2.6) ∆t→0 ∆t 0 for r 6= s. Notice that the qrss are constants. The matrix Q with off-diagonal entries qrs and diagonal entries qrr = −∑s6=r qrs is called the generator matrix. If qrr = 0, 2.1. Continuous-time Markov processes 37 the state r is called absorbing. As described in Section 1.6, data analysed in this thesis will include an absorbing state, which represents the dead state.

As an example, the generator matrix for the ELSA data described in Section 1.6.2 is given by

  −(q12 + q15) q12 0 0 q15    q −(q + q + q ) q 0 q   21 21 23 25 23 25    0 q −(q + q + q ) q q .  32 32 34 35 34 35    0 0 q43 −(q43 + q45) q45   0 0 0 0 0

The rate at which individuals move, e.g., from state 1 into state 3 is zero, q13 = 0, because this transition is not allowed by the process. In order to move into state 3 from state 1 subjects must go first through state 2. The dead state 5 is the only absorbing state for this application.

For a given generator matrix, Q, a Markov process is uniquely defined (Cox, 2017). The link between a generator matrix and its probability matrix is established by the forward and backward equations, which are given by

P0(t) = P(t)Q, (2.7)

P0(t) = QP(t), (2.8) respectively. Given the initial condition P(0) = I, the solution to the differential equations in (2.7) and (2.8) is

P(t) = exp(Qt) (2.9) ∞ tk = ∑ Qk , (2.10) k=0 k! where Q0 = I. Because Q is finite, the series (2.10) is convergent and (2.9) is the unique solution of both backward and forward equations. Moler and Van Loan (2003) discuss a range of methods to calculate the exponential of a matrix. Here, the 2.2. Model representation 38 transition probability matrix P(t) are computed using the eigen-decomposition of Q.

Let b1,...,bk represent the eigenvalues of Q and A the matrix with the eigenvectors as columns. For distinct eigenvalues, the matrix A is invertible, and the eigenvalue −1 decomposition is Q = Adiag(b1,...,bk)A . The transition probability matrix P(t) for elapsed time t is given by

  P(t) = A diag eb1t,...,ebkt A−1.

In this thesis, multi-state processes are Markov processes with transition prob- abilities depending on the current state and possibly on covariates. For many applications, the risks of moving across states depend on the current state and on time. In this case, a non-homogeneous Markov assumption is assumed to model the multi-state process. The generator matrix is then a function of time, which means that the matrix Q(t) can vary over time. We next describe how to specify and approach time-dependent multi-state models.

2.2 Model representation

The hazard functions as defined in (2.6) can be generalised to the situation where the hazards depend on time and on the values of p explanatory variables, x1,...,xp. The > set of values of the explanatory variable are denoted by the vector x = (x1,...,xp) .

Let qrs.0(t) represent the hazard function for an individual for whom the values of all the explanatory variables that make up the vector x is zero. This function is called the baseline hazard function. A time-dependent hazard regression model can be written in the form

>  qrs(t) = qrs.0(t)exp β rsx , (2.11)

> where β rs = (βrs.1,...,βrs.p) is the vector of coefficients of the p explanatory vari- ables acting on transition from state r to state s. We focus on the case where the explanatory variables are recorded at the time origin of the study. It is straight- forward to extend the model to the situation where the values of the explanatory 2.2. Model representation 39 variables change over time (Van den Hout, 2017).

Equation (2.11) shows that transition-specific time dependency can be intro- duced via baseline hazards. As presented in Section 1.4, examples of parametric functional forms for qrs.0(t) include

exponential: qrs.0(t) = αrs αrs > 0 (2.12)

τrs−1 Weibull: qrs.0(t) = αrsτrst αrs,τrs > 0 (2.13)

Gompertz: qrs.0(t) = αrs exp(ξrst) αrs > 0 . (2.14)

The exponential model is the simplest parametric hazard specification, which does not allow for time-dependent modelling. The Weibull and Gompertz specifi- cations are useful to model monotonic upward or downward trends over time.

Whilst it is straightforward to specify time-dependent models for the hazards, calculating the transition probabilities of time-dependent multi-state processes are complicated. The forward and backwards equations in (2.7) and (2.8), respectively, are derived for processes for which Q is constant. For time-dependent multi-state processes, those equations can be extended as follows. Define a family of matrices P(t,u), for u > t, with elements P(Y(u) = j|Y(t) = i). The forward and backward equations are given by

∂P(t,u) = P(t,u)Q(u), (2.15) ∂u ∂P(t,u) − = Q(t)P(t,u), (2.16) ∂t respectively. Equations (2.15) and (2.16) are called the Kolmogorov differential equations. Finding a solution or approximated solution for these equation is crucial for practical multi-state modelling. Titman (2011) uses a numerical approximation to solve these differential equations. In this thesis, time-dependency is approached by using a piecewise-constant approximation to the hazards. 2.3. Piecewise-constant hazards approximation 40 2.3 Piecewise-constant hazards approximation

This section describes a piecewise-constant approximation to take into account time-dependency of the hazards. Given a specified time-interval (t1,t2], we aim to find an approximate value for the transition probability P(t1,t2), which is then employed in estimation. We next present two methods that can be used within the framework that will be developed in this thesis.

The first method uses the follow-up times in the data to define the grid for the piecewise-constant approximation for the individual likelihood contributions.

In this case, for a given time-interval (t1,t2] the transition probability P(t1,t2) is estimated by exp((t2 − t1)Q(t1)). This approximation performs well in estimation depending on the volatility of the process and on the study design. For more volatile processes, the follow-up times have to be more frequent to capture the risk changes over time.

Alternatively, it is possible to impose a fixed grid to the piecewise-constant approximation as described in Van den Hout and Matthews (2008a). For a given time interval (t1,t2], the transition probability P(t1,t2) can be calculated by imposing a grid for the piecewise-constant approximation, which is the same for all individual likelihood contributions. In this case, time intervals in the data are embedded in the grid. For example, say the grid is defined by u1,...,uM. For the observed time interval (t1,t2], determine j1 and j2 such that u j1 < t1 ≤ u j1+1 and u j2 < t2 ≤ u j2+1.

The transition matrix for (t1,t2] is then defined by

   P(t1,t2) = P t1,u j1+1 P u j1+1,u j1+2 × · · · × P u j2 ,t2 ,

using generator matrices Q(u j1 ),Q(u j1+1),...,Q(u j2 ), respectively.

Van den Hout (2017) compares the performance of the piecewise-constant ap- proximation and the exact method that solves the forward and backward differential equations as in (2.15) and (2.16). The simulation is carried out for the illness-death model with Gompertz hazards. The grid for the piecewise-constant approximation is defined by the data. Two study designs are investigated: states are observed yearly 2.4. Maximum likelihood estimation 41 and at the points 0, 4, 8, 9, 10, 11, 12. For the first case, the results of both meth- ods are very similar. For the second case, the piecewise-constant approximation led to slightly biased results for the scale parameter, but similar results for the shape parameter. It is expected given the 4-year interval times. Therefore, the piecewise- constant approximation leads to satisfactory results as long as the time between observations is not too long relative to the volatility of the multi-state process. For multi-state processes in which the sampling times are covariate dependent (e.g. individuals with a pre-condition are sampled more often), the approximation bias may lead to bias of the covariate effect. For such cases, imposing a grid to the piecewise-constant approximation can improve the estimation. Also, it is possible to include an interval-censored state in between two distant observation times (Van den Hout, 2017).

2.4 Maximum likelihood estimation

> Let θ = (θ1,...,θq) be the parameters vector of a given multi-state model. These parameters can be estimated using the method of maximum likelihood. We first obtain the likelihood function of the sample data. Then we illustrate a scoring al- gorithm, which is used to obtain the maximum likelihood estimates of the model parameters.

2.4.1 Likelihood function of the model

Given a multi-state model, maximum likelihood inference can be used to estimate model parameters. In the presence of interval censoring, the likelihood function is constructed using transition probabilities.

Let the state space be S = {1,2,..,D}, with D the dead state. Consider a series of states Y1,...,Yn observed at times t1,...,tn, respectively. The inference is conditional on the first observed state. For Y2,...,Yn, the distribution is

P(Yn = yn,...,Y2 = y2|Y1 = y1,θ ,t,X), (2.17)

> > where θ = (θ1,...,θq) is the vector with the model parameters, t = (t1,...,tn) , and 2.4. Maximum likelihood estimation 42 the n× p matrix X contains the values of the p covariates at each of the n time points. A conditional first-order Markov assumption is used to define the distribution (2.17) of Y2,...,Yn as

n  ∏ P Yj = y j|Yj−1 = y j−1,θ ,t j−1,x j−1 , j=2

th where x j−1 is the ( j − 1) row in X.

Next consider an individual i with observed values y1,...,yn−1 ∈ S \D, and a last observation yn which is either a value in S or a code for right-censoring. The n likelihood contribution for this individual is Li = ∏ j=2 Li j, where

   P Yj = y j|Yj−1 = y j−1,θ ,t j−1,x j−1 for j = 2,...,n − 1 Li j = (2.18)  C(yn|yn−1) for j = n .

If a living state at tn is observed, then C(yn|yn−1) = P(Yn = yn|Yn−1 = yn−1), where part of the conditioning is ignored in the notation. If the state is right censored at tn, D−1 then C(yn|yn−1) = ∑s=1 P(Yn = s|Yn−1 = yn−1). If the state at tn is D, then known time of death is taken into account by defining

D−1 C(yn|yn−1) = ∑ P(Yn = s|Yn−1 = yn−1)qsD(tn). (2.19) s=1

Given N individuals, the log-likelihood function is given by

N N ni `(θ ) = ∑ logLi = ∑ ∑ logLi j, (2.20) i=1 i=1 j=2 where ni is the number of observation times for individual i, which can include a right-censored observation.

The above definition of the likelihood function can also be found in Jackson (2011). Including time-dependency, as defined by the models in Section 2.2, does not affect the basic structure of the likelihood function. Similar expressions of the likelihood function can be found in Kalbfleisch and Lawless (1985), Kay (1986), 2.4. Maximum likelihood estimation 43

Gentleman et al. (1994), and Van den Hout (2017).

The log-likelihood function in (2.20) can be maximised by using a general- purpose optimiser or a scoring algorithm. The latter is illustrated in the next section and will be used for the analysis in this chapter.

2.4.2 Scoring algorithm

The maximum likelihood estimates of the vector of parameters θ in the multi-state model (2.11) can be found by maximising the log-likelihood function (2.20) using numerical methods. This maximisation can be achieved using a scoring algorithm (Van den Hout, 2017), and it is described below.

Kalbfleisch and Lawless (1985) proposed a scoring algorithm for maximis- ing the log-likelihood function of time homogeneous multi-state models. Given a piecewise-constant approximation to the time-dependency in the hazard model (2.11), their formulae can be used to maximise the likelihood function (2.20). In particular, those formulae are applied to the constituent intervals with constant haz- ards in the likelihood function. Notice that, in this case, we obtain an approximation to the log-likelihood function in (2.20), which is still the logarithm of a likelihood function.

> As in Section 2.4.1, let θ = (θ1,...,θq) be the vector of model parameters. For a given time interval (t1,t2], the first-order derivatives of the transition probability matrix, ∂P(t1,t2)/∂θk, is given in terms of ∂Q(t)/∂θk. The latter is straightfor- ward to derive for most parametric models, such as, Weibull and Gompertz hazard models. The likelihood contributions for exact death times and right-censoring are straightforward to deal with, as they are made up of transition probabilities.

To specify the scoring algorithm, the derivative of a transition matrix is pre- sented first. Given a piecewise-constant approximation to the hazards, the likeli- hood contribution for an observed time interval (t1,t2] is defined using a constant > generator matrix Q = Q(t1). For the eigenvalues of Q given by b = (b1,...,bD) , define B = diag(b). Given matrix A with the eigenvectors as columns, the eigen- value decomposition is Q = ABA−1. The transition probability matrix P(t) = 2.4. Maximum likelihood estimation 44

P(t1,t2) for elapsed time t = t2 −t1 is given by

  P(t) = A diag eb1t,...,ebDt A−1.

As described in Kalbfleisch and Lawless (1985), the derivative of P(t) can be obtained as ∂ −1 P(t) = AVkA , ∂θk where Vk is the D × D matrix with (l,m) entry

 (k)  glm [exp(blt) − exp(bmt)]/(bl − bm) l 6= m    (k)  gll t exp(blt) l = m,

(k) (k) −1 where glm is the (l,m) entry in G = A∂Q/∂θkA . For the parametric time-dependent hazard models in Section 2.2, matrix

∂Q/∂θk is straightforward to derive. The scoring algorithm can now be defined as follows. Let g(θ ) be the q × 1 vector of first-order derivatives of the log-likelihood function in (2.20). This quan- tity is called the gradient vector. The kth entry of g(θ ) is given by

N ni ∂ ∑ ∑ logLi j . (2.21) i=1 j=2 ∂θk

Let I (θ ) be the q × q matrix of expected negative second-order derivatives of the log-likelihood. This quantity is given by

 ∂ 2`(θ )  I (θ ) = IE − . (2.22) ∂θθθ∂θθ >

The matrix I (θ ) is called the Fisher information matrix. This matrix can be ap- proximated by defining the q × q matrix M(θ ) with (k,l) entry

N ni ∂ ∂ ∑ ∑ logLi j logLi j . (2.23) i=1 j=2 ∂θk ∂θl 2.5. Confidence intervals 45

The scoring algorithm provides that an estimate of the vector θ at the (v+1)th cycle of the iterative procedure, θ (v+1), is

−1 θ (v+1) = θ (v) + Mθ (v) gθ (v).

−1 for v = 1,2,3..., where gθ (v)) is the gradient vector and Mθ (v) is the inverse of the (approximated) Fisher information matrix, both evaluated at θ (v). The pro- cess iterates until the relative differences in the values of the parameter estimates (v+1) (v) satisfies max1≤k≤q |θ − θ | < δ for a suitable small positive value. The asymptotic covariance matrix of the maximum likelihood estimate θb is equal to the inverse of the Fisher information matrix I (θ )−1, which can be ap- proximated by I (θb)−1. Hence, after convergence, the covariance matrix of the maximum likelihood estimate θb is estimated by M(θb)−1 (Van den Hout, 2017).

The standard error of θbi = (θb)i is given by

−1 1/2 SE(θbi) ≈ (M(θ ) )ii , (2.24) where i = 1,...,q. The algorithm is implemented by the author in R in such a way that it is easy to vary transition-specific choices for parametric shapes.

2.5 Confidence intervals

The distribution of the maximum likelihood estimator can be used to construct con- fidence intervals for the estimate θb and functions of them, such as the hazards and probability matrix (Wood, 2006). Let Vθ represent the covariance matrix of θb at convergence. From large sample theory, samples of the estimate θb can be drawn from N(θb,Vθ ). Confidence intervals for functions of the model parameters can be constructed as follows:

Step 1: Draw b vectors from N(θb,Vθ ).

Step 2: Calculate the value of the function of interest at each simulated value.

Step 3: Using the simulated values of the function, calculate the lower (ς/2) and 2.6. Model selection 46

upper (1 − ς/2), quantiles.

The parameter ς is usually set to 0.05. In this thesis, we approximate the covariance matrix Vθ by the inverse of the matrix M as defined in (2.23). Sampling from a K-variate normal distribution N(µ,Σ) is possible by using the Cholesky decomposition Σ = LL>. First K draws are taken independently from the standard normal and collected in the K × 1 vector z. A multivariate draw from N(µ,Σ) is then given by µ + Lz.

2.6 Model selection In multi-state models various hazards specifications can be used. Model selection is commonly used to select the best model among them. Model fitting can always be improved by adding more parameters. However, parsimonious models are easier to estimate and better for prediction. Model selection methods aim to find the balance between goodness-of-fit and parsimony. Ruppert et al. (2003) and Wood (2006) provide descriptions of several methods for model selection. Their method can be used to compare nested multi-state models, i.e., models with the same structure, but with different specifications. This thesis uses the Akaike Information Crite- rion (Akaike, 1998). This quantity is also know as AIC. This section follows the description presented in Ruppert et al. (2003) and Wood (2006). The AIC selects models based on their fit to new data. The AIC is derived from the Kullback-Leibler (K-L) discrepancy. Suppose that the model density is fθ (y), and that f0(y) is the true density, where θ denotes the parameter (or the vector of parameters ) of fθ . The K-L discrepancy is given by

Z K( fθ , f0) = {log[ f0(y)] − log[ fθ (y)]} f0(y)dy. (2.25)

Equation (2.25) provides a measure of how well fθ matches the truth. If θb is the maximum likelihood estimate of θ, then K( fθ , f0) could be used to provide a mea- sure of how well the the model fθ is expected to fit a new set of data, not used to estimate θb. It can be shown that E[K( fθ , f0)] ≈ −`(θb) + q, where `(θb) is the log- likelihood function evaluated at the maximum likelihood estimate, and q represents 2.7. Model validation 47 the number of model parameters. This measure can be used for model comparison. The Akaike Information Criterion is then defined as

AIC = −2`(θb) + 2q. (2.26)

Models with lower AIC values are considered to be closer to the true model, than models with higher AIC values. The term q penalises models with more parameters than necessary, counteracting the tendency of the likelihood to models with a large number of parameters.

2.7 Model validation

For multi-state models, the Markov assumption and parametric hazards specifica- tion are used to facilitate inference. Titman and Sharples (2010) reviewed diag- nostics methods for multi-state models in the presence of interval censoring. We describe below the method that will be used in this thesis for model validation and model comparison.

A simple diagnostic method compares the model predictions of the entry time into an absorbing state with the Kaplan-Meier estimates (see Section 1.4). Suppose that all individuals start in the same state at baseline and progress to an absorbing state, and that the assumptions in the multi-state model are correct. Then, there should be close agreement between the empirical survival curve and the survival curve implied by the fitted multi-state model. In this case, survival time is in relation to the dead state. The pointwise 95% confidence intervals of the Kaplan-Meier are used as a benchmark to decide if any disagreement is within allowed bands. If the time scale is age, then age has to be transformed to time since beginning of the study. In this case, we can compare all individual survival curves. Notice that this method only checks one part of the model, the survival times from a specified state. However, it gives relevant information about model fitting. This method is used to assess model fit throughout this thesis. 2.8. Applications 48

Table 2.1: State table for the OCTO data for the history of State 3: number of times each pair of states was observed at successive observation times. The three living states are defined by grades of cognitive function

To From 1 2 3 4 1 710 130 79 229 2 66 98 100 104 3 0 0 339 329

2.8 Applications

The methods presented in this chapter are illustrated with applications to the OCTO and CAV data sets. In what follows, model estimation is undertaken using the scor- > ing algorithm presented in Section 2.4.2. Let θ = (θ1,...,θq) be the vector with model parameters, where q depends on the application. The convergence criterion (v+1) (v) −6 for the algorithm is to stop at iteration v+1 when max1≤k≤q |θ −θ | < 10 . For both data sets, models are specified with Gompertz hazards because this is a common assumption in applied research (Marioni et al., 2012; Robitaille et al., 2018) and can be used with the msm package.

2.8.1 Origins of variance in the oldest-old data

We fit a multi-model for the OCTO data defined as in Figure 1.11. Even though in- dividuals in state 3 (severe cognitive impairment) can only move into state 4 (dead), some backward observations from state 3 are recorded. This is due to measurement error. It is possible to fit multi-state models with misclassification of states (Jack- son et al., 2003). However, such an approach is out of the scope of this research. We define the OCTO data for the history of state 3, which means that once indi- viduals move into state 3, they are only allowed to move into state 4. The resulting interval-censored multi-state process is summarised by the frequencies in Table 2.1. Let t represent age minus 80. This is necessary to avoid numerical problems with the scoring algorithm. Because time of death is known, rather than being in- terval censored, the likelihood contribution of individuals observed in state r < 4 at ∗ 3 ∗ ∗ time t and dead at time t > t are given by ∑s=1 P(Y(t ) = s|Y(t) = r)qs4(t ). As 2.8. Applications 49

Frequency 0 100 300 500 0 5 10 15 20 25 t

Figure 2.1: Histogram of age at baseline transformed by minus 80 in the OCTO data. The variable t represents age minus 80

Table 2.2: Parameter estimates for the four-state for the OCTO data. Estimated standard errors in parentheses. Time scale t is age in years minus 80

Intercept t sex α12.0 -2.157 (0.163) α12.1 0.115 (0.024) βL -0.325 (0.100) α14.0 -3.185 (0.220) α14.1 0.128 (0.032) βD -0.334 (0.091) α21.0 -1.383 (0.254) α21.1 -0.020 (0.047) α23.0 -0.967 (0.162) α23.1 0.056 (0.023) α24.0 -3.775 (0.719) α24.1 0.177 (0.078) α34.0 -1.463 (0.130) α34.1 0.060 (0.013)

described in Section 2.3, transition probabilities for the likelihood function are cal- culated by using a piecewise-constant approximation to the hazards. For the OCTO data, the mean length of follow-up times is 1.994 years with standard deviation of 1.098 and 1.986. Assuming that change in transition intensities in relation to the frequency of observation can be assessed in intervals of approximately 2 years, we can use the data to define the grid for the piecewise-constant approximation.

The proportional hazards model with Gompertz specification and dependence 2.8. Applications 50 on the covariate sex is given by

qrs(t) = exp(αrs.0 + αrs.1t + βrssex), (2.27) where (r,s) ∈ {(1,2),(1,4),(2,1),(2,3),(2,4),(3,4)} and sex is 0/1 for men/women. For the transition between the living states, the constraints on the coefficients for sex are β12 = β23 = βL, except for transition from 2 to 1 where β21 = 0. For the transitions into the dead state, the constraints are β14 = β24 = β34 = βD. This model has a total of 14 parameters and AIC = 5342.837.

The estimated hazards (solid lines) for women and 95% confidence intervals (dashed lines) are presented in Figure 4.5. The confidence intervals are obtained by simulation with b = 1000 replications. The risks of moving across states are increasing throughout the length of the study, except for the transition from 2 to 1 which is decreasing. The confidence intervals are fairly wide after approximately 15 years due to the dependence on the slope parameter as t increases. Figure 2.1 illustrates the histogram of age at baseline minus 80.

Table 2.2 presents the parameter estimates for model (2.27) for the OCTO data. The effect of sex for the living states is very close to the effect of sex for the transition into the dead state, βbL = −0.325 and βbD = −0.334.

Figure 2.3 depicts the baseline-specific survival as estimated by the model and as described by the Kaplan-Meier curves. Individual survival curves (in grey) are shifted to the years since baseline so that we can compare them and their mean to the Kaplan-Meier curve. This is necessary because individuals have different ages at baseline. For survival given baseline states 2 and 3, there is some discrepancy between the model-based mean survival and the Kaplan-Meier curve, but overall the fit is reasonably good. Although this is not a proper goodness-of-fit test, the comparison shows that the model is able to capture the attrition due to death during the follow-up. 2.8. Applications 51 1.5 1.5 1.0 1.0 0.5 0.5 Hazard 1 to 2 Hazard 1 to 4 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Age − 80 Age − 80 1.5 1.5 1.0 1.0 0.5 0.5 Hazard 2 to 1 Hazard 2 to 3 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Age − 80 Age − 80 1.5 1.5 1.0 1.0 0.5 0.5 Hazard 2 to 4 Hazard 3 to 4 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 Age − 80 Age − 80

Figure 2.2: Estimated Gompertz hazards (solid lines) for women, with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications

2.8.2 Cardiac allograft vasculopathy data

For the CAV data as described in Section 1.6.1, some backward transitions are recorded from states 3 and 2. However, the process is biologically irreversible and of particular interest is the onset of CAV. In order to investigate this, an illness-death without recovery model can be defined. The states are classified as Healthy (1) if the patient has not developed the disease, CAV (2) if the patient has developed moderate or severe CAV and Dead (3) if the patient has died, see Figure 2.4. Also, diagnosis of ischaemic heart disease (IHD) and donor age are known to be major risk factors of disease onset (Titman, 2011). Only data until 15 years are considered, since after 2.8. Applications 52 1.0 0.8 0.6 0.4 0.2 Survival from 1 (N=394) 0.0

0 2 4 6 8 10 1.0 0.8 0.6 0.4 0.2 Survival from 2 (N=146) 0.0

0 2 4 6 8 10 1.0 0.8 0.6 0.4 0.2 Survival from 3 (N=152) 0.0

0 2 4 6 8 10 Years since baseline

Figure 2.3: Comparison of model-based survival from states 1, 2, and 3 with Kaplan-Meier curves. Model-based survival: grey lines for individuals, blue lines for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals. Frequencies for baseline state along vertical axes this time data are scarce and we cannot estimate the model for times beyond that time. Titman (2011) used a similar formatting of the CAV data. Table 2.3 gives the number of times each pair of states was observed at successive observation times. 2.8. Applications 53

Table 2.3: State table for the CAV data: number of times each pair of states was observed at successive observation times. The two living states are defined by CAV severity

To From 1 2 3 99 1 1336 224 139 252 2 0 412 110 105

Health CAV

Dead

Figure 2.4: Illness-death without recovery model for disease progression after transplant for the CAV data

We fit a progressive three-state model for the CAV data defined as in Figure 2.4. Because time of death is known within one day, rather than being interval censored, the likelihood contribution of individuals observed in state r < 3 at time ∗ 2 ∗ ∗ t and are dead at time t > t are given by ∑s=1 P(Y(t ) = s|Y(t) = r)qs3(t ). As described in Section 2.3, transition probabilities for the likelihood function are cal- culated using a piecewise-constant approximation to the hazards. For the CAV data, the mean length of follow-up times is 1.622 years with standard deviation of 0.972 and median 1.258. Assuming that change in transition intensities in relation to the frequency of observation can be assessed in intervals of approximately one year, we can use the data to define the grid for the piecewise-constant approximation. Let t represent time since baseline. The proportional Gompertz hazard model is specified with dependence on donor age (dage) and primary diagnosis of ischaemic heart disease (IHD):

qrs(t) = exp(αrs.0 + αrs.1t + β1dage + β2IHD), (2.28) where (r,s) ∈ {(1,2),(1,3),(2,3)}. This model has 8 parameters and AIC = 3190.752. The estimated Gompertz hazards for subjects with IHD and donor age of 26 2.8. Applications 54 700 600 500 400 Frequency 300 200 100 0

0 5 10 15 Time since transplant (years)

Figure 2.5: Histogram of time since transplant in the CAV data

Table 2.4: Parameter estimates for the illness-death without recovery for the CAV data. Estimated standard errors in parentheses. The variable t represents time since baseline

Intercept t Covariates α12.0 -3.166 (0.175) α12.1 0.110 (0.022) β1 0.014 (0.004) α13.0 -3.682 (0.204) α13.1 -0.195 (0.064) β2 0.296 (0.089) α23.0 -3.585 (0.268) α23.1 0.101 (0.032)

(solid lines) and 95% confidence intervals (dashed lines) are presented in Figure 2.6. The confidence intervals are calculated using simulation with b = 1000, as described in Section 2.5. The risks of moving from state 1 (Healthy) to state 2 (CAV), and from state 2 (CAV) to state 3 (Dead) increase over the years. The risk of going from state 1 to state 3 (Dead) is very low and decreasing over the years. Similarly to what happened in the application to the OCTO data (Section 2.8.1), the confidence intervals become wider towards the end of the study. The histogram of time since transplant in Figure 4.8 shows that data become scarce after approximately 10 years. Table 2.4 illustrates the parameter estimates for model (2.28) for the CAV data.

For the covariates effects, βb1 = 0.018 and βb2 = 0.277 indicating that donor age and IHD increase the risks of disease progression and death. 2.8. Applications 55 1.0 0.8 0.6 0.4 Hazard 1 to 2 0.2

0 5 10 15 Time since transplant (years) 0.15 0.10 0.05 Hazard 1 to 3

0.00 0 5 10 15 Time since transplant (years) 1.0 0.8 0.6 0.4 Hazard 2 to 3 0.2

0.0 0 5 10 15 Time since transplant (years)

Figure 2.6: Estimated Gompertz hazards for subjects with IHD and with donor age of 26 (solid lines), with 95% confidence intervals (dashed lines). The confidence intervals are obtained by simulation with b = 1000 replications

Figure 2.7 shows baseline survival as estimated by the model and as described by the Kaplan-Meier curves. The model predicts the survival reasonably well up to ten years. However, there is some discrepancy between the estimate survival curve and the Kaplan-Meier curve after ten years, which is an indication of lack of fit. 2.9. Discussion 56 1.0 0.8 0.6 Survival 0.4 0.2 0.0 0 5 10 15 Time since transplant (years)

Figure 2.7: Comparison of model-based survival from states 1 with Kaplan-Meier curves. Model-based survival: grey lines for individuals, blue line for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals

2.9 Discussion This chapter presented the general formulation of parametric multi-state models for interval-censored data. We show in an application to the OCTO data that a Gom- pertz hazard specification results in a satisfactory model fit. In the application with the CAV data, we demonstrate that such parametric model assumption can be too restrictive. This thesis aims to investigate how model fit can be improved by using flexible hazards specification with splines, without defining a specific parametric form. Even though there are a wide range of more flexible model, deciding on a specific model can be difficult. The next chapter introduces the multi-state models with splines, which aim to overcome restrictive hazards models such as Gompertz and Weibull. Chapter 3

Multi-state models with splines

This chapter begins with a brief introduction to smoothing methods. This will in- clude a detailed description of cubic regression splines and P-splines basis func- tions. We then show how to specify transition-specific hazard functions with splines. A penalised maximum likelihood method is developed to estimate model parameters. The method uses a scoring algorithm to maximise the penalised log- likelihood function and a grid search to select the optimum amount of smoothing. The chapter then concludes with an application to the ELSA data, where paramet- ric and semi-parametric transition-specific hazards specifications are explored to model the multi-state processes. A substantial part of the material in this chapter has been published in in Medicine as a paper entitled Flexible multistate models for interval-censored data: Specification, estimation, and an application to ageing research (Machado and van den Hout, 2018). We acknowledge that part of the analysis of the ELSA data is also in Van den Hout (2017).

3.1 Introduction

The application to the CAV data in Chapter 2 shows that parametric hazards speci- fication can lead to poor model fitting. In this section, we present an introduction to smoothing methods, which are a general technique to estimate nonparametric func- tions. These methods allow for flexible modelling without making strong assump- tion about the functional forms underlying the data. We describe how to approach the simple case of estimating a univariate nonparametric function. The methods 3.1. Introduction 58 are subsequently extended to specify and estimate multi-state models with splines. This sections follows the description presented in Ruppert et al. (2003) and Wood (2006).

3.1.1 Smoothing methods

Consider the problem of estimating a function g from a set of n data points (xi,yi) for i = 1,...,n, where g is continuous on [x1,xn] ∈ R with absolutely continuous first derivatives. The nonparametric function f can be used as a predictor

yi = f (xi) + εi, (3.1)

where εi are the error terms and IE (εi) = 0. In practice, f is represented in terms of spline basis functions. Splines are curves made up of sections of polynomials joined together so that they are contin- uous in value, as well as in first and second derivatives. The points at which the sections join are known as the knots of the spline. The locations of the knots must be chosen. Typically the knots would either be evenly spaced through the range of the observed x values, or placed at quantiles of the distribution of unique x values.

The problem of estimating the nonparametric function f with splines can be > formulated as follows. Let B(x) = [B1(x),...,Bq(x)] be the vector of spline basis > functions evaluated at x and α = (α1,...,αq) the parameter vector so that the nonparametric function can be written as f (x) = α >B(x).A penalised spline is > defined as αb B(x), where αb is the minimiser of

n > 2 > ∑{yi − α B(xi)} + λαα Sα, (3.2) i=1 for some symmetric positive semidefinite matrix S and scalar λ > 0. The matrix S is called the penalty matrix. The scalar λ is known as the smoothing parameter, which is used to control the function smoothness. Various definitions of spline basis functions can be found in Wood (2006). This thesis uses cubic regression splines and P-splines, which are introduced in Sections 3.1.2 and 3.1.3, respectively. The 3.1. Introduction 59 smoothing method discussed in this section can be generalised to more complex settings such as generalised additive models (Wood, 2006).

3.1.2 Cubic regression splines

Cubic regression splines are parametrised in terms of the values of polynomial func- tions at the knots. In particular, the knots can be placed considering the percentiles of the distributions of unique x values. This means that more knots are placed where there is more data. This is appealing for multi-state modelling because multi-state processes commonly become scarce towards the end of study. If knots are placed where there is no data, models cannot be estimated.

Consider defining a cubic spline function, f (x), with k knots, x1,...,xk. Let 00 f (x j) = α j and f (x j) = δ j for j = 1,...,k. Then the spline can be written as

− + − + f (x) = a j (x)α j + a j (x)α j+1 + c j (x)δ j + c j (x)δ j+1, (3.3)

− + − + for the interval [x j,x j+1]. The basis function a j (x),a j (x),c j (x) and c j (x) are de- fined as

− a j (x) = (x j+1 − x)/h j, + a j (x) = (x − x j)/h j, − 3 c j (x) = [(x j+1 − x) /h j − h j(x j+1 − x)]/6, + 3 c j (x) = [(x − x j) /h j − h j(x − x j)]/6,

where h j = x j+1 − x j. Define the matrices G and D as

Di,i = 1/hi,

Di,i+1 = −1/hi − 1/hi+1,

Di,i+2 = 1/hi+1, 3.1. Introduction 60 for i = 1,...,k − 2, and,

Gi,i+1 = hi+1/6,

Gi+1,i = hi+1/6, for i = 1,...,k − 3. The conditions that the spline must have continuous second derivatives, at x j, and should have zero second derivative at x1 and xk, imply that

Gδ − = Dα, (3.4)

− > − −1 where δ = (δ2,...,δk−1) . Defining F = G D, and

  0    − F = F ,   0 where 0 is a row of zeros, we have that δ = Fα. Hence, the spline can be re-written in terms of α as

− + − + f (x) = a j (x)α j + a j (x)α j+1 + c j (x)F jα + c j (x)F jα, (3.5)

for x j ≤ x ≤ x j+1. Equation (3.5) can be further written as

k f (x) = ∑ αiBi(x), (3.6) i=1 where Bi(x) are given implicitly by (3.5). Therefore, given a set of x values, at which to evaluate the spline, it is possible to obtain a model matrix mapping α to evaluate the spline. It can be shown that

Z xk f 00(x)2dx = α >D>G−1Dα, (3.7) x1 that is, S = D>G−1D is the penalty matrix (Wood, 2000) . Figure 3.1 illustrates 3.1. Introduction 61 1.0 1.0 0.8 0.8 0.6 0.6 ) ) x x ( 4 ( f 0.4 0.4 B 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x

Figure 3.1: The left hand panel illustrates one basis function, B4(x), for a cubic regression spline. On the right hand panel, the various curves of medium thickness show the basis functions, Bi(x), of a cubic regression spline, each multiplied by its coefficients α j. These scaled basis functions are summed to get the smooth curve illustrated by the thick continuous curve

how a smooth function can be represented in terms of cubic regression spline basis functions. The cubic regression splines as defined here are implemented in the package mgcv in R (Wood, 2007).

3.1.3 P-splines

Simpler cubic splines can be represented by a B-splines basis. The B-splines ba- sis functions are appealing because these functions are strictly local. Each basis function is only non-zero over the intervals between m + 3 adjacent knots, where m is the order of the basis (e.g., m = 2 for cubic splines). The k + m + 1 knots, x1 < x2 < ... < xk+m+1, define a k parameter B-spline basis, where the interval over th which the spline is to be evaluated lies within [xm+2,xk]. An (m + 1) order spline can be represented as k m f (x) = ∑ Bi (x)αi, (3.8) i=1 where the B-spline basis functions are most conveniently defined recursively as fol- lows: m x − xi m−1 xi+m+2 − x m−1 Bi (x) = Bi (x) + Bi+1 (x), xi+m+1 − xi xi+m+2 − xi+1 3.1. Introduction 62

(a) 1.0 0.8 0.6

0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0

(b) 1.0 0.8 0.6

0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0

(c) 1.0 0.8 0.6

0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.2: Illustrations of B-splines basis functions of degrees (a) one, (b) two, and (c) three

for i = 1,...,k, and  1 x ≤ x < x −1  i i+1 Bi (x) = . 0 otherwise

Figure 3.2 shows the B-splines basis of degree 1, 2 and 3 for the case of seven evenly spaced knots. Figure 3.3 illustrates how a smooth curve can be represented in terms of B-splines basis functions of degree two and three.

P-splines are a smoother using B-splines basis functions, defined on evenly spaced knots, with a difference penalty applied directly to the parameters, αi, to control function smoothness. To illustrate how this works, suppose we decide to penalise the squared difference between adjacent αi values. Then the penalty would be k−1 2 2 2 2 P = ∑ (αi+1 − αi) = α1 − 2α1α2 + 2α2 − 2α2α3 + ... + αk , i=1 3.2. Model representation 63 1.0 1.0 0.8 0.8 0.6 0.6 ) ) x x ( ( f f 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x

Figure 3.3: Illustration of smooth curves made up of B-spline basis functions. On the left hand panel, the dashed curves show B-splines basis functions with m = 1 mul- tiplied by their associated coefficients. The thick solid smooth curve is the sum of the scaled basis functions. The right hand panel shows the same, but for B-splines basis function with m = 2 which can be written in matrix form as

  1 −1 0 ··     −1 2 −1 ··   >   P = α  0 −1 2 ··α. (3.9)      ·····   ·····

Equation 3.9 can be written as P = α >D>Dα, where the matrix D is the difference operator matrix (Eilers and Marx, 1996). P-splines are straightforward to set up and use, and allow for good flexibility as any order of penalty can be combined with any order of B-spline basis. Their main disadvantage is that the simplicity is somewhat diminished if uneven knot spacing is required (Wood, 2006).

3.2 Model representation

We showed in Section 2.2 that it is straightforward to define parametric time- dependent multi-state models. The specification of multi-state models with splines can be done in a similar way, as spline models can be seen as a type of parametric models. 3.2. Model representation 64

Recall that a time-dependent hazard regression model for transition intensities combines baseline hazards with log-linear regression. Time-dependent models can be defined by using proportional hazards model for transition r to s, r 6= s, as follows

>  qrs(t) = qrs.0(t)exp β rsx , (3.10)

> where qrs.0(t) is the baseline hazard function, x = (x1,...,xp) is a covariate vec- > tor and β rs = (βrs.1,...,βrs.p) is vector of unknown parameters. In Chapter 2, examples of parametric shapes for qrs.0(t) are presented. We next describe the non- parametric specification of qrs.0(t) with splines. Each baseline hazard can be ap- proximated by the exponential of a linear combination of Krs spline base functions

Bk(t) and regression coefficients αrs.k ∈ R as follows

Krs ! qrs.0(t) = exp ∑ αrs.kBk(t) . (3.11) k=1

Let the number of spline basis functions be large. Define the vector of co- > efficients by α rs = (αrs.1,...,αrs.Krs ) for r 6= s. Each qrs.0(t) is associated to a penalty matrix, which is quadratic in the basis coefficients and measures the com- plexity of qrs.0(t). For each transition r → s, the smoothing penalty can be written > as λrsα rsSrsα rs, where Srs is a matrix of known coefficients. The quantities λrs are called smoothing parameters. They control the trade-off between model fit and model smoothness. Large values for the smoothing parameters, λrs → ∞, lead to a log-linear estimate of qrs.0, while λrs = 0 results in an unpenalised regression spline estimate (Wood, 2006). With relation to the number of knots, Krs = 10 is usually enough and larger number of splines basis functions will not change the estimated functions.

In this chapter, we focus on the use of P-splines for the basis functions Bk(t). However, the method is implemented in a way that is easy to employ other spline definitions and corresponding penalties. Flexible multi-state models can be defined by P-splines and a combination of parametric hazards specification presented in Section 2.2. Applications of P-splines to multi-state models can also be found in 3.3. Penalised maximum likelihood estimation 65

Kneib and Hennerfeind (2008).

3.3 Penalised maximum likelihood estimation

Multi-state models with splines can be fitted to a set of longitudinal data using penalised maximum likelihood estimation. In this section, we present the penalised log-likelihood function, and a scoring algorithm for finding the penalised likelihood estimates of the model parameters.

3.3.1 Penalised log-likelihood function

For semi-parametric multi-state models, at least one baseline hazard function is specified with P-splines. If the model for a transition is specified with P-splines, we define a large set of equidistant knots. To control the smoothness of the estimated curve, a penalty based on finite differences of the coefficient of adjacent P-splines is imposed on the log-likelihood function. Let β , α and ξ represent the vector of parameters associated to the parametric, non-parametric and covariates components of a multi-state model, respectively. Let θ = (β >,α >,ξ >)> be the full set of pa- rameters and `(θ ) be the log-likelihood function of a semi-parametric multi-state model. This function is constructed as described in Section 2.4.1. The penalised log-likelihood function is given by

s 1 > > `p(θ ) = `(θ ) − ∑ λ jα j D j D jα j 2 j=1 1 = `(θ ) − θ >S θ , (3.12) 2 λ

> where s is defined as the number of transitions with P-splines, α j = (α j1,...,α jK) is assumed to have the same dimension for each hazard approached with splines,

D j is the matrix representation of the difference operator of adjacent P-splines, λ = > (λ1,...,λs) is the vector of smoothing parameters and Sλ is the penalty matrix. Sλ > is a block diagonal matrix with blocks λ jD j D j for penalising P-splines parameters and zeros elsewhere (Gray, 1992). 3.3. Penalised maximum likelihood estimation 66 3.3.2 Parameter estimation

The penalised log-likelihood function (3.12) can be maximised using a numerical method that consists of two parts. First, given a grid of values for the smoothing parameters vector, we aim to find an estimate of the model parameters for each vec- tor in the grid. Second, we select the smoothing parameters vector that minimises an information criterion such as the AIC. We next describe how to perform the first part of the method, that is, how to maximise the penalised log-likelihood function in (3.12) for fixed smoothing parameters vector.

Given a piecewise-constant approximation to the time-dependency in the haz- ard model (3.10), a scoring algorithm can be used to maximise the penalised log- likelihood function (3.12) for a fixed value of the smoothing parameter vector λ . In particular, the scoring algorithm presented in Section 2.4.2 can be extended and used to find the penalised maximum likelihood estimates.

Let gp(θ ) be the q × 1 vector of first-order derivatives of the penalised log- likelihood function in (3.12). This quantity is called the penalised gradient and is given by

gp(θ ) = ∂`p(θ )/∂θθ (3.13)

= ∂`(θ )/∂θθ − Sλ θ (3.14)

= g(θ ) − Sλ θ (3.15)

where g(θ ) is the gradient vector as defined in (2.21). Let I p(θ ) be the q × q ma- trix of the expected negative second-order derivatives of the log-likelihood function. This matrix is known as penalised Fisher information and is given by

h 2 >i I p(θ ) = IE −∂ `p(θ )/∂θθθ∂θθ (3.16)

h 2 > i = IE −∂ `(θ )/∂θθθ∂θθ + Sλ (3.17)

= I + Sλ , (3.18) where I (θ ) is the Fisher information matrix as defined in (2.22). 3.3. Penalised maximum likelihood estimation 67

The scoring algorithm provides that an estimate of the vector θ at the (v+1)th cycle of the iterative procedure, θ (v+1), is

(v+1) (v) (v)−1 (v) θ = θ + I p θ gp θ . (3.19)

(v) (v)−1 for v = 1,2,3..., where gp θ is the penalised gradient vector and I p θ is the inverse of the penalised Fisher information matrix, both evaluated at θ (v). The process iterates until the relative differences in the values of the parameter estimates q (v) (v+1) satisfies ∑k=1 |θk − θk | < δ for a suitable small positive value δ.

The asymptotic covariance matrix of the penalised maximum likelihood esti- −1 mate θb is given by the inverse of the penalised Fisher information matrix I p(θ ) −1 (Gray, 1992), which can be approximated by I p(θb) . Thus, the standard error of

θbi = (θb)i is given by −1 1/2 SE(θbi) ≈ (I p(θb) )ii , (3.20) for i = 1,...,q.

As discussed in Section 2.4.2, it is not possible to calculate the Fisher informa- tion matrix, and it is approximated by the q × q matrix M(θ ) with (k,l) entry

N ni ∂ ∂ ∑ ∑ logLi j logLi j. (3.21) i=1 j=2 ∂θk ∂θl

Hence, the penalised Fisher information matrix in (3.18) is approximated by

Mp(θ ) = M(θ ) + Sλ . (3.22)

The matrix I p is then replaced by Mp in equations (3.19) and (3.20).

The algorithm is implemented by the author in R in such a way that it is straightforward to vary transition-specific choices for parametric shapes. An ex- ample of such a model is explored in the application, where Gompertz and spline models are used to specify the hazards. 3.3. Penalised maximum likelihood estimation 68 3.3.3 Smoothing parameter estimation

Estimating the optimal value for the smoothing parameters λ is crucial for fitting models with splines (Gu and Wahba, 1991). Given a grid of smoothing parameter vectors, the optimum value can be defined as the one with the smallest AIC. The AIC definition presented in Section 2.6 can be extended for semi-parametric models as follows,

AIC(λ ) = −2`p(θ ) + 2d f , (3.23)

where d f is a measure of model complexity, and `p(θ ) is the penalised log- likelihood function. The quantity d f is called the degrees of freedom. For para- metric models, the degrees of freedom are equal to the number of independent pa- rameters in the model. For semi-parametric models with splines, the degrees of freedom can be defined as

−1 d f (λ ) = tr[I (I + Sλ ) ], (3.24)

where I is the Fisher information matrix and Sλ is the penalty function (Gray, 1992). A similar definition of degrees of freedom is given in Commenges et al. (2007). If the vector of smoothing parameters is zero, λ = 0, there is no penalty and the degrees of freedom is given by

−1 d f (λ ) = tr[I (I + Sλ ) ] = tr[I (I )−1] = tr[I] = q, where q is the number of parameters in the model, and I is the q×q identity matrix. Therefore, for d f = q the AIC definition in (3.23) coincides with the definition provided in Section 2.6. If the smoothing parameters vector is not zero, then the degrees of freedom is less than the total number of parameters, d f < q.

In this thesis, we approximate the matrix I by the matrix M as defined in 3.4. Prediction 69

Section 2.4.2. This implies that we use an approximation to the degrees of freedom.

3.4 Prediction

Once a multi-state model is fitted using a parametric and semi-parametric hazard model, estimated model parameters can be used for prediction. Typically this con- cerns computing transition matrices as a function of the penalised maximum likeli- hood estimate. The covariance of a function of model parameters can be estimated by Monte Carlo simulation or by using the multivariate delta method. Because tran- sition probabilities are restricted to [0,1], we have to calculate the standard error of e.g., log(p/(1 − p)) or log(−log(p)), and back transform (Titman, 2011). This thesis focuses on the simulation method.

Let Vbθ denote the estimated covariance matrix of the penalised maximum like- −1 lihood estimate, θb, defined as Mp(θb) in (3.22). Notice that Vbθ takes into account the choice of the smoothing parameters, λb. Of interest is the estimation of P(t1,t2) for arbitrary t1 and t2 > t1. In the case of a time-dependent model, let the grid for the piecewise-constant approximation be defined by u j+1 = u j + h for j = 1,...,M such that u1 = t1 and uM = t2. Given this grid, matrix P(t1,t2) is estimated by

P(u1,u2) × · · · × P(uM−1,uM). (b) For Monte Carlo simulation, parameter vectors θ are drawn from N(θb,Vbθ ), (b) for b = 1,...,B, and for each sampled θ , P(t1,t2) is calculated. Summary statis- tics such as mean and covariance can be derived easily from the B realisations of

P(t1,t2).

3.5 Application to the English longitudinal study of ageing data

The method is illustrated with an application to the ELSA data. As described in Section 1.6.2, the ELSA data are a random sample of size N = 1000 individuals. Of these 1000 individuals, 205 died during the follow-up with age at death available. The sample has 544 women and 456 men. Highest educational qualification is dichotomised for the current analysis according to years of formal education: fewer 3.5. Application to the English longitudinal study of ageing data 70

Figure 3.4: Five-state model for longitudinal data in ELSA on number of words remem- bered in a recall

Table 3.1: State table for the ELSA data: number of times each pair of states was observed at successive observation times. The four living states are defined by number of words remembered

To From 10-7 words 6-5 words 4-2 words 1-0 words Dead 10-7 words 164 150 49 12 8 6-5 words 156 440 303 48 40 4-2 words 52 336 616 151 85 1-0 words 11 35 114 149 72 than ten versus ten or more.

This application focuses on the number of words remembered in a delayed recall from a list of ten. A five-state model is defined for the number of words remembered and death, see Figure 3.4. The four living states are defined as: State 1, 2, 3, and 4, for the number of words {7,8,9,10}, {6,5}{4,3,2}, and {1,0}, respectively. The state 5 is defined as the dead state. The statistical modelling in this section aims to explore the effect of age and gender on cognitive change over time when controlling for education. The interval-censored multi-state process is summarised by the frequencies in Table 3.1. Note that the sum of the transitions into the dead state is equal to the number of deaths in the sample, i.e., 205. Table 3.1 also shows that the process is mainly progressive in the sense that the main trend over time is towards the higher states.

In what follows, model estimation is undertaken by using the scoring algo- > rithm. Let θ = (θ1,...,θq) be the vector with model parameters, where q depends on the chosen model. The convergence criterion for the algorithm is to stop at itera- q (v) (v+1) −6 tion v+1 when ∑k=1 |θk −θk | < 10 . Because time of death is known, rather 3.5. Application to the English longitudinal study of ageing data 71 than being interval censored, the likelihood contribution of individuals observed in state r < 5 at time t and dead at time t∗ > t are given as in (2.19) with D = 5.

Model selection is bottom-up starting with the time-homogeneous exponential hazard model given by

 qrs(t) = exp βrs.0 , (3.25) for the transitions r → s depicted in Figure 1.10. This intercept-only model with 10 parameters has AIC = 8109.5. Convergence of the scoring algorithm was reached after 14 iterations, using starting values βrs.0 = −3 for all the parameters.

As described in Section 2.3, time-dependent models are estimated by using a piecewise-constant approximation to the hazards. For the ELSA data, the mean length of follow-up times is 2.178 years with standard deviation of 0.855 and the median is 2 years. Assuming that change in transition intensities in relation to the frequency of observation can be assessed in intervals of approximately 2 years, we use the data to define the grid for the piecewise-constant approximation rather than imposing a fixed grid to calculate the likelihood contributions. For the process at hand, age is the most suitable time scale. Age in the ELSA data is transformed by subtracting 49 years. This results in 1 being the minimal age in the sample.

The first extension is a Gompertz models given by

 qrs(t) = exp βrs.0 + ξrst , (3.26) where the effect of time is allowed to be different for all transitions. This model has 20 parameters and AIC = 7784.2.

Even though the sample size is not small, Table 3.1 shows that mortality in- formation is limited because only about 20% of the individuals end up in the dead state during follow-up. A more parsimonious model Gompertz model is given by

 qrs(t) = exp βrs.0 + ξrst , (3.27) 3.5. Application to the English longitudinal study of ageing data 72 where ξ21 = ξ32 = ξ43 = ξB and ξ15 = ξ25 = ξ35 = ξ45 = ξD. That is, the effect of time is the same for all backwards transitions and for transitions into the dead state. This model has 15 parameters, and needs 16 scoring iterations when using starting values βrs.0 = −3 and ξrs = 0 for all the relevant r,s-combinations. The model has AIC = 7780.5. In what follows, model (3.25) is extended by adding parameters with parameter equality constraints. Subsequently, covariate information is added for the transitions of interest, i.e., those transitions that represent a decline in cognitive function. For this, model (3.27) is extended to

 qrs(t) = exp βrs.0 + ξrst + βrs.1sex + βrs.2education , (3.28) where sex is 0/1 for women/men, and education is 0/1 for fewer than ten years/ten years of more of education. For the transitions into the dead state, the constraints on the coefficients for sex are β15.1 = β25.1 = β35.1 = β45.1, and for education are

βr5.2 = 0 for r = 1,2,3,4. This model has 22 parameters, needs 16 iterations, and has AIC = 7680.3. It is worthwhile to investigate alternative time-dependent models. First, in model (3.28), the Gompertz baseline models for the transitions into the dead state are replaced by Weibull models. Starting values for the transitions into the dead state are βr5.0 = −10, τ15 = exp(0.5), and for the remaining parameters the values are as given above. This yields AIC = 7688.7 after 20 iterations. Next, all baseline hazards definitions in model (3.28) are replaced by Weibull models, which results in AIC = 7729.5 after 28 iterations. Alternatively, model (3.28) is defined with Gompertz baseline models for the transitions into the dead state and Weibull models for progression through the living states. This yields AIC = 7719.7 after 25 iterations. Semi-parametric models with P-splines can be used to model non-linear func- tional forms and to check shapes specified by parametric models. Because the focus of our investigation is the decline of cognitive function, which is mostly associated with individuals in states 3 and 4, we replace the Gompertz hazard for transition 3.5. Application to the English longitudinal study of ageing data 73

Table 3.2: Comparison between models for the ELSA data with N = 1000, where -2LL stands for -2 times the (penalised) loglikelihood function evaluated at its maxi- mum. The variable t denotes age transformed by subtracting 49 years

Model Baseline hazards #Parameters -2LL AIC Intercept-only Exponential 10 8089.5 8109.5 t Gompertz no constraints 20 7744.2 7784.2 t Gompertz 15 7750.5 7780.5 t, sex, education Gompertz 22 7636.3 7680.3 t, sex, education Gompertz for living 22 7644.7 7688.7 and Weibull for death t, sex, education Weibull 22 7685.5 7729.5 t, sex, education Weibull for living 22 7675.7 7719.7 and Gompertz for death t, sex, education P-splines I 38 7626.0 7678.2 for 2 → 3 and 3 → 4 t, sex, education P-splines II for 3 → 4 30 7630.9 7678.2

2 → 3 and 3 → 4 in model (3.28) by

K ! q23(t) = exp ∑ α23.kBk(t) + β23.1sex + β23.2education k=1 (3.29) K ! q34(t) = exp ∑ α34.kBk(t) + β34.1sex + β34.2education . k=1

For this model, the number of P-splines bases for both hazard functions is K =

10 and the vector of smoothing parameters is λ = (λ1,λ2). The initial grid is given by all pairs of combinations of log10 λ1 = (−3,−2,−1,0,1,2,3) and log10 λ2 = (−3,−2,−1,0,1,2,3). A possible graphical representation of the AIC results is to plot its values when one smoothing parameter is fixed. Figure 3.5(c) illustrates the −3 resulting AIC for different values of λ2 with fixed λ1 = 10 . The value which minimises the AIC is λ2 = 10. It happens for all values of λ1. The search for the optimal values of λ1 is less straightforward as λ1 → ∞. Figure 3.5(a) shows the AIC for several values of λ1 with fixed λ2 = 10. The AIC decreases quickly for small values of λ2; however, it gets approximately constant for large values. This result indicates that the functional form of the hazard for transitions 2 → 3 is log-linear. Because both AIC and parameter estimates do not change much for 7 sufficiently large values of λ1, it is possible to set λ1 = 10 . In this case, the best 3.5. Application to the English longitudinal study of ageing data 74

(a) (b) ) 10

, 1 λ ( Intensity AIC 0.2 0.4 0.6 7678−3 7682−2 7686 −1 0 1 2 3 4 5 6 7 8 9 10 11 0 10 20 30 40

log10(λ1) Time (years)

(c) (d) ) 2 λ , 3 − 10 ( Intensity AIC 7688 7692 7696 0.0 0.2 0.4 −3 −2 −1 0 1 2 3 0 10 20 30 40

log10(λ2) Time (years)

(e) (f) ) λ ( AIC Intensity 0.1 0.3 0.5 7678−3 7682 7686 −2 −1 0 1 2 3 0 10 20 30 40

log10(λ) Time (years)

Figure 3.5: AIC results and fitted hazard transitions for men with ten or more year of ed- −3 ucation. In (a) and (c), the AIC results for fixed λ2 = 10 and fixed λ1 = 10 , respectively. In (b) and (d), the estimated hazards for 2 → 3 and 3 → 4, respec- tively. Solid line for P-splines I and dotted line for Gompertz. In (e) and (f), the AIC results for model P-splines II and fitted hazard for 3 → 4, respectively. Time denotes age transformed by subtracting 49 years model (P-splines I) according to the AIC is obtained with smoothing parameter > λb = (107,10). This model has 30 parameters and 26.1 degrees of freedom.

The fitted hazards for transition 2 → 3 for the Gompertz (3.28) and P-splines I (3.29), for men with ten or more years of education are illustrated in Figure 3.5(b). The functional forms of both models are very similar for this transition; however, the functional forms for transition 3 → 4 are quite different, as indicated in Figure 3.5(d). Model (3.29) has AIC = 7678.2 indicating that it performs slightly better than the Gompertz model with AIC = 7680.3. Therefore, for prediction purposes the Gompertz models may be preferable for prediction beyond the time range in the data. The fitted hazards in Figures 3.5(b) and (d) show that an increase of age is 3.5. Application to the English longitudinal study of ageing data 75

Table 3.3: Results for sex, education and time for the five-state P-splines II model for the ELSA data. Estimated standard errors in parentheses. The variable t denotes age transformed by subtracting 49 years

sex education t β12.1 0.552 (0.138) β12.2 -0.281 (0.146) ξ12 0.030 (0.010) β23.1 0.178 (0.101) β23.2 -0.836 (0.103) ξB -0.031 (0.006) β34.1 0.141 (0.145) β34.2 -0.445 (0.160) ξD 0.042 (0.009) βD 0.477 (0.151)

associated with higher risk of moving from state 2 to state 3 and from state 3 to state 4. The functional form of hazard for transition 2 → 3 in model (3.29) indicates that a Gompertz specification can be reasonable for this transition. Therefore, in model (3.28), only the hazard for transition 3 → 4 is specified with P-splines:

K ! q34(t) = exp ∑ α34.kBk(t) + β34.1sex + β34.2education . (3.30) k=1

The number of P-splines bases is K = 10 and the grid search is made on the values log10 λ = (−3,−1,0,1,3). The resulting AIC values are illustrated in Figure 3.5(e). The minimum AIC with value 7678.2 is obtained at λ = 10. That is the same AIC value as for model (3.29); however, the degrees of freedom is slightly smaller d f = 23.65. As model (3.30) (P-splines II) is easier to estimate if compared to model (3.29), it is considered the best model among all illustrated in this thesis. Table 3.2 summarises the comparison of the investigated models. Figure 3.5(f) illustrates the fitted hazard for transition 3 → 4 in model (3.30) for men with ten or more years of education. As expected, there is an increase in the risk of progression to a decline of cognitive function over the years. Table 3.3 shows the estimates for the covariate effect parameters in the P- splines II model (3.30). Most of the point estimates are as expected. For example, the effect of getting older is associated with decline of cognitive function, i.e., ξb12 >

0, and with a decreasing hazard of remembering more words, i.e., ξbB < 0. For transitions 1 → 2, 2 → 3, and 3 → 4 more years of education is associated with a 3.5. Application to the English longitudinal study of ageing data 76 lower risk of moving. The effect of being a male patient is associated with higher risks of going into the dead state, βbD = 0.477. This correspond to hazard ratio of exp(βbD) = 1.611, which represents a 61% higher hazard of death.

Model validation is hampered by the interval censoring of the transitions be- tween the living states. But given that death times are available, it make sense to compare survival as estimated by the model with Kaplan-Meier curves (Gentleman et al., 1994). Of course, this will only check part of the fitted model. Figure 3.6 depicts baseline-specific survival as estimated by the model and as described by the Kaplan-Meier curves. Individual survival curves (in grey) are shifted to the years since baseline so that we can compare them and their mean to the Kaplan-Meier curves. This is necessary because individuals have different ages at baseline. The shape of the Kaplan-Meier curves is due to fact that time of death is rounded to the nearest integer. Even though time of death is not known precisely, we assume that time of transition into the dead state is known exactly. For survival given base- line state 3, there is some discrepancy between model-based mean survival and the Kaplan-Meier curve, but overall the fit is reasonably good. Although this is not a proper goodness-of-fit test, the comparison shows that the model is able to capture the attrition due to death during the follow-up.

3.5.1 Predicting cognitive function

Although parameters for the transition intensities help to understand the estimated model, interpretation is more straightforward when transition probabilities are con- sidered. Firstly, consider a short time interval for which we assume that the intensi- ties are constant. For men aged 60 with ten or more years of education, the two-year transition probabilities are estimated at

  0.330 0.488 0.154 0.010 0.018    0.171 0.531 0.253 0.023 0.022      Pb (t1 = 11,t2 = 13) =  0.083 0.391 0.429 0.066 0.031 , (3.31)      0.034 0.219 0.410 0.291 0.046    0 0 0 0 1 3.5. Application to the English longitudinal study of ageing data 77 Survival from 1 (N=99) 0.5 0.7 0.9 0.5 0.7 0.9 Survival from 2 (N=296) 0 2 4 6 8 10 0 2 4 6 8 10 0.5 0.7 0.9 0.5 0.7 0.9 Survival from 3 (N=456) Survival from 4 (N=149) 0 2 4 6 8 10 0 2 4 6 8 10 Years since baseline Years since baseline

Figure 3.6: Comparison of model-based survival from states 1, 2, 3, and 4 with Kaplan- Meier curves. Model-based survival: grey lines for individuals, smooth black line for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence bands. Frequencies for baseline state along vertical axes

where t denotes age transformed by subtracting 49 years. The diagonal entries in this matrix dominate. But there are some large off-diagonal entries as well. For example, if a man aged 60 is in state 3, then he has a 39% chance of being in state 2 two years later. This high chance is an illustration of the noise of the process under investigation: it is quite likely that a 60 year old man moves between states 2 and 3 within the next two years.

Next we illustrate the estimation of standard errors and 95% confidence in- tervals for transition probabilities. Using simulation with B = 1000, the estimated standard errors of matrix (4.22) is

  0.038 0.029 0.016 0.004 0.011    0.013 0.021 0.019 0.009 0.006     .  0.007 0.019 0.024 0.024 0.006    0.004 0.019 0.024 0.037 0.012 3.6. Discussion 78

From 3 to 3 From 3 to 4 From 3 to 5 Transition probability Transition 0.0 0.2 0.4 0.6 0.80.0 1.0 0.2 0.4 0.6 0.80.0 1.0 0.2 0.4 0.6 0.8 1.0

60 62 64 66 68 70 60 62 64 66 68 70 60 62 64 66 68 70 Age Age Age

Figure 3.7: For the P-splines II model, estimated ten-year transition probabilities for men aged 60 with ten or more years of education, and in state 3 at baseline. Solid line for transition probabilities (with B = 1000) and dashed lines for 95% con- fidence bands

The 95% confidence intervals for the first row are given by

(0.263,0.408), (0.425,0.540), (0.120,0.185), (0.004,0.021), and (0.011,0.049).

Next, ten-year transition probabilities are estimated for men aged 60 with ten or more years of education. The grid is defined by h = 1/2 years. The estimation is shown in Figure 3.7. Figure 3.7 concurs with expectations. For example, given the progressive trend of the process, it is to be expected that a probability of being in state 3 decreases over time, as moving to states 4, and 5 becomes more likely due to increased age.

3.6 Discussion

Specification and estimation of continuous-time multi-state models are presented and shown to be a flexible framework for statistical modelling of time-dependent processes. By defining transition-specific with parametric and semi-parametric 3.6. Discussion 79 hazard models, a wide range of multi-state processes can be investigated. Pe- nalised maximum likelihood estimation is undertaken by a scoring algorithm using a piecewise-constant approximation to time-dependent hazards. The Akaike infor- mation criterion is used to select the optimal value for the smoothing parameters.

The Markov processes formulation to semi-parametric multi-state models ex- tends the method described in Joly and Commenges (1999) and Joly et al. (2002). This is an important methodology to medical statistics as backward transitions occur naturally in many applications (Abner et al., 2012; Marioni et al., 2012). Further- more, using the piecewise-constant approximation is an alternative to the method introduced by Titman (2011) which handles the time-dependency by using numer- ical solutions to the non-linear differential equations defined directly by the time- dependency of the Markov process. As stated by Titman, computation using the non-linear differential equations can become prohibitively slow when adding con- tinuous covariates. This is not a problem when using the piecewise-constant ap- proximation and the scoring algorithm. To address the problem with continuous covariates, the method in Titman (2011) uses an approximation to the full likeli- hood. Therefore, It would be interesting to compare both methods to investigate the degree of approximation and relative computation speeds between the methods.

The piecewise-constant approximation can be determined by the observation times in the data. Alternatively, a fixed grid can be used that is imposed for all like- lihood contributions. This method is computationally more extensive as it requires a greater number of eigenvalue decompositions to calculate transition probabilities for intervals defined by the grid. Those two methods to define a piecewise-constant approximation will differ depending on the study design and volatility of the process of interest (Van den Hout, 2017).

A semi-Markov assumption could be more reasonable for modelling the ELSA data; however, fitting semi-Markov models with interval-censored data is compli- cated given the number of living states and backward transitions (Commenges, 2002). Nonetheless, Figure 3.6 shows that the fit is reasonably good.

The scoring algorithm is implemented in R in such a way that it is easy to 3.6. Discussion 80 vary transition-specific choices for parametric and semi-parametric shapes. An ex- ample of such a model is explored in the application, where P-splines are used for transitions 2 → 3 and 3 → 4, and Gompertz hazards are defined for the other transi- tions. The eigenvalue decomposition in the algorithm is computed with the function eigen in R, which uses the LAPACK routine (Anderson et al., 1999). P-spline bases are computed using the code in the appendix in Eilers and Marx (1996). There is some overlap between the method presented in this chapter and the msm package (Jackson, 2011). The last is a platform to analyse time-homogeneous multi-state processes with interval-censored transition times. It is possible to fit some time-dependent models with msm, such as the Gompertz and splines mod- els. However, this package cannot fit models with penalised splines nor with some commonly used parametric specifications such as the Weibull model. If prediction of a time-dependent process beyond the time range in the data is of interest, hazard models with P-splines can be used to validate the parametric choices which underlie the prediction. This was illustrated in the application with the ELSA data in which age range is from 50 to 90 years. If risk factors are the main focus of the research, P-splines can be used to capture non-parametric shapes of time-dependency. The choice of the type of spline is not essential. P-splines were used in this chapter, but any other spline function with a first-order derivative can be handled within the current framework. The same holds for parametric shapes other than the Gompertz and the Weibull. The specification and estimation of a continuous- time survival model is very general and does not pose restrictions on the number of states, scale of covariates, or number of transitions. Chapter 4

Automatic smoothing for multi-state models

The method presented in the last chapter can become intractable for many appli- cations, as computational time increases quickly with the number of hazards ap- proached with splines. This chapter focus on developing an efficient and automatic method for estimating multi-state models with splines. Automatic smoothing meth- ods are pivotal for practical modelling with splines. This chapter begins with a description of a commonly used method for multiple smoothing parameters estima- tion, and its limitation for multi-state models. Subsequently, we develop a penalised maximum likelihood method for automatic smoothing in multi-state models. The method is illustrated with applications to the CAV and OCTO data sets. A small simulation study is used to assess the performance of the method. A substantial part of the material in this chapter has been submitted to arXiv as a paper entitled Pe- nalised maximum likelihood estimation in multistate models for interval-censored data (Machado et al., 2018).

4.1 Background for automatic smoothing

Chapter 3 shows that it is straightforward to specify parametric and semi-parametric hazard in multi-state models. For each transition-specific hazards approached with splines, a smoothing parameter is employed for estimation. Hence, the problem of estimating multi-state models with splines involves estimating multiple smoothing 4.1. Background for automatic smoothing 82

Figure 4.1: An unidirectional three-state model parameters. Automatic smoothing parameters selection methods for generalised additive models (GAM) are well established (Ruppert et al., 2003; Wood, 2006). Those methods can be extended for different settings, e.g., semi-parametric cop- ula regression models (Radice et al., 2016). In this section, we illustrate how the existing GAM framework could be applied to multi-state models, and discuss the limitations of such methods for multi-state modelling. For ease of presentation, the method is illustrated for the unidirectional three-state model in Figure 4.1. The details of most results in this section are presented in Section 4.2. The following presentation is based on Wood (2000) and Radice et al. (2016).

Let n represent the number of follow-up times in the data. The linear predictors for transitions 1 → 2 and 2 → 3 are given by

K > η12.i = xi β 12 + ∑ α12.kB12.k(ti) (4.1) k=1 K > η23.i = xi β 23 + ∑ α23.kB23.k(ti) (4.2) k=1

> > where xi = (xi1,...,xip) is a vector of covariates, β 12 = (β12.1,...,β12.p) and > β 23 = (β23.1,...,β23.p) are parameter vectors. The splines parameters can be writ- > > ten as α 12 = (α12.1,...,α12.K) and α 23 = (α23.1,...,α23.K) , where K represents the number of splines basis functions. The linear predictors can further be written as

> > η12.i = xi β 12 + B12(ti) α 12 (4.3) > > η23.i = xi β 23 + B23(ti) α 23, (4.4)

> > where B12(ti) = (B12.1(ti),...,B12.K(ti)) and B23(ti) = (B23.1(ti),...,B23.K(ti)) . > > > > > > After defining, X1i = (xi ,B12(ti) ) and X2i = (xi ,B23(ti) ) , we have that η1i = 4.1. Background for automatic smoothing 83

> > > > > > > > X1iθ 1 and η2i = X2iθ 2, where θ 1 = (β 12,α 12) and θ 2 = (β 23,α 23) .

> > > > > Let θ = (β 12,α 12,β 23,α 23) be the full set of parameters. The penalised log-likelihood is given by

1 ` (θ ) = `(θ ) − θ >S θ , (4.5) p 2 λ where `(θ ) is the log-likelihood function and Sλ = diag(0,λ1S1,0,λ2S2) is the penalty matrix.

For the estimation of the model parameters, given a value for the vector of smoothing parameters, λb, we aim to find an update for the parameter vector θ [a]. > > > [a] Let us define X = (X1,...,Xn) , where Xi = diag X1i,X2i , the vector d as a vector with ith element given by

[a]  [a] [a]  di = ∂`(θ )i/∂η12.i,∂`(θ )i/∂η21.i , (4.6)

[a] [a] and W a block diagonal matrix made up of 2 × 2 matrices Wi given by

 2 [a] 2 [a]  ∂ `(θ )i ∂ `(θ )i   ∂η12.i∂η12.i ∂η12.i∂η23.i   . (4.7)  2 [a] 2 [a]   ∂ `(θ )i ∂ `(θ )i  ∂η23.i∂η12.i ∂η23.i∂η23.i

We then have that the penalised gradient and penalised Hessian at θ [a] are respec- tively given by

g[a] = X>d[a] − S θ [a] (4.8) p λb [a] = −X>W[a]X − S . (4.9) Hp λb

[a+1] [a] Applying a first-order Taylor expansion to gp around θ , setting the resulting expression to zero, we find that a new fit θ [a+1] is given by

 −1 θ [a+1] = X>W[a]X + S X>W[a]z[a], (4.10) λb 4.1. Background for automatic smoothing 84 where z[a] = (W[a])−1d[a] + Xθ [a] is the pseudo-data (Wood, 2000). The derivation of this result is presented with more details in Section 4.2. The pseudo-data plays the role of the response vector in the GAM framework.

The smoothing parameter vector can be estimated by minimising the Un- Biased Risk Estimator given by

1/2 2 V (λ ) = ||W (z − Aλ z)|| /nˇ − 1 + 2tr(Aλ )/nˇ, (4.11)

√ √ > −1 > where Aλ = WX X WX + Sλ X W is the influence matrix, andn ˇ = 2 × n (Craven and Wahba, 1978). Equation (4.11) can be minimised using a method developed by Wood (2004), which is based on Newton’s method and can evaluate the components of V (λ ) and their first and second order derivatives.

[a] [a] In the derivation above, the Wi can be approximated by the Mi matrix de- fined as  [a] [a] [a] [a]  ∂`(θ )i ∂`(θ )i ∂`(θ )i ∂`(θ )i    ∂η12.i ∂η12.i ∂η12.i ∂η23.i   , (4.12)  [a] [a] [a] [a]  ∂`(θ )i ∂`(θ )i ∂`(θ )i ∂`(θ )i  ∂η23.i ∂η12.i ∂η23.i ∂η23.i for i = 1,...,n.

The above penalised maximum likelihood estimation for multi-state models is based on the pseudo-data z = (W)−1d + Xθ . The main limitation of this method is that the weight matrix W must be positive definite. For highly flexible models, not all the n weight matrices contained in W = diag(W1,...,Wn) need to be positive definite. Also, the matrix W is as large as there are data and transitions in a multi- state process. Then computational time and stability become serious issues for most applications.

Marra et al. (2017) propose a smoothing parameter estimation based on a parametrisation of z that uses H and g as a whole instead of the n components that make them up. For some applications, the Hessian H may not be positive definite but these would occur considerably less frequently than when working with the n weight matrices that make it up. 4.2. Penalised maximum likelihood estimation 85

We next present a penalised maximum likelihood estimation for multi-state models with splines, that is based on the framework developed by Marra et al. (2017).

4.2 Penalised maximum likelihood estimation

This section presents an automatic and efficient method to estimate multi-state mod- els with splines in the presence of interval censoring. We recall model specification and the piecewise-constant approximation to the time-dependency. The method presented in Marra et al. (2017) is then adapted for multi-state models with splines.

4.2.1 Model representation

Time-dependent models can be defined by using a proportional hazards model for transition r to s, r 6= s

>  qrs(t) = qrs.0(t)exp β rsx , (4.13)

> where qrs.0(t) is an unspecified baseline hazard function, x = (x1,...,xp) is a co- > variate vector and β rs = (βrs.1,...,βrs.p) is a vector of unknown parameters. The nonparametric specification of qrs.0(t) with splines is given by

Krs ! qrs.0(t) = exp ∑ αrs.kBk(t) , (4.14) k=1 where Krs is the number of spline base functions Bk(t), and αrs.k ∈ R are regression coefficients.

4.2.2 Piecewise-constant hazards

Similarly to Chapter 3, time-dependency of the hazards are taken into account by using a piecewise-constant approximation. If the follow-up times vary across indi- viduals, the individual-specific follow-up times can be used to define the piecewise- constant approximation for the individual likelihood contributions. This implies that  a transition probability such P Yj = y j|Yj−1 = y j−1 is derived by using Q(t j−1) to estimate P(t j−1,t j) by exp((t j − t j−1)Q(t j−1)). It is also possible to impose a 4.2. Penalised maximum likelihood estimation 86

fixed grid to the piecewise-constant approximation as described in Van den Hout and Matthews (2008a). For most applications, both methods lead to similar results and the method described in this Section is preferable as it is less computationally extensive (Van den Hout, 2017).

4.2.3 Penalised log-likelihood function For each hazard, let the number of splines basis dimension be large enough to al- > low for flexible modelling. Let θ = (θ1,...,θq) be the full set of parameters and `(θ ) be the logarithm of the likelihood function. The smoothness of the model is controlled by adding a smoothness penalty to the log-likelihood function. The penalised log-likelihood function is

1 ` (θ ) = `(θ ) − θ >S θ , (4.15) p 2 λ where Sλ is the penalty matrix. This is a block diagonal matrix with blocks λrsSrs for penalising splines parameters of transition r to s, where Srs is a matrix of known coefficients.

4.2.4 Parameter estimation Given a piecewise-constant approximation to the time-dependency in the hazard model (4.13), a scoring algorithm can be used to maximise the penalised log- likelihood function (4.15), see Section 3.3.2. For a given multi-state model, if more than one hazard is specified with splines, then estimation of λ by direct grid search can be computationally burdensome. As discussed in Section 4.1, there are methods available for automatic smooth- ing parameters estimation within the penalised likelihood framework (Wood, 2006; Radice et al., 2016). For their method, the derivatives of the penalised log-likelihood function have to be split into the derivatives with relation to the linear predictors, and the derivatives of the linear predictor with relation to the model parameters. The direct use of their methods in multi-state models leads to large sparse matrices that are difficult to deal with. In this section, we present the method for automatic smoothing developed by 4.2. Penalised maximum likelihood estimation 87

Marra et al. (2017), which uses the gradient and the Hessian (or Fisher information matrix) as a whole instead of components that make them up. The method consists of two parts. First, given a value for the smoothing parameters, we aim to find an estimate of the model parameters. Second, we use such an estimate to find an update for the smoothing parameters. We next describe how to perform the first part of the method.

[a] [a] [a] [a] [a] Let gp = g − Sλ θ and H p = H − Sλ represent the penalised gradi- ent and negative of the penalised Hessian matrix at iteration a, respectively, where [a] [a] 2 > g = `(θ )/ θ | [a] = `(θ )/ θ θ | [a] ∂ θ ∂θθ θ =θ and H ∂ θ ∂θθ∂θθ θ =θ . A first-order Taylor [a+1] [a] expansion of gp about the current fit θ is given by

[a+1] [a] [a] [a+1] [a] gp ≈ gp + H p (θ − θ ), (4.16) where g[a+1] = g[a] − S θ [a] and [a] = [a] − S . Let us define [a] = − [a]. p λb H p H λb I H A new fit θ [a+1] is obtained by taking the right-hand side of equation (4.16) to be zero

  0 = g[a] + − [a] − S (θ [a+1] − θ [a]) p I λb   g[a] = [a] + S (θ [a+1] − θ [a]) p I λb   g[a] − S θ [a] = [a] + S θ [a+1] − [a]θ [a] − S θ [a] λb I λb I λb   [a] + S θ [a+1] = g[a] + [a]θ [a] I λb I  −1 p p p −1  θ [a+1] = [a] + S [a] [a]θ [a] + [a] g[a] . I λb I I I

Therefore, for fixed value of λb the new fit for the parameter estimator can be ex- pressed as  −1 p θ [a+1] = [a] + S [a]z[a], (4.17) I λb I p p −1 where z[a] = I [a]θ [a] + ε [a] is a q × 1 vector, with ε [a] = I [a] g[a] also a q × 1 vector. The quantities z is called the pseudo-data, which plays the role of the response vector for GAM. 4.2. Penalised maximum likelihood estimation 88

This parametrisation of the model-parameter estimators allows for a well founded formulation of the smoothing parameter selection that is presented in Sec- tion 4.2.5 (Marra et al., 2017). As discussed in Kalbfleisch and Lawless (1985), calculating the second derivatives of the probability matrix can be difficult to calcu- late. We use the approximation to the Fisher information matrix that involves only the first order derivatives of the penalised log-likelihood function as in (2.23).

4.2.5 Smoothing parameters estimation

The penalised maximum likelihood approach described in Section 4.2.4 can only estimate model parameters, θ , for fixed vector of smoothing parameters, λ . In this Section, we described the automatic smoothing parameter selection presented in Marra et al. (2017).

From likelihood theory, ε ∼ N (0,I) and z ∼ N (µ z,I), where I is the identity √ matrix, µ z = I θ and θ is the true parameter vector. The predicted value vector √ √ √ for z is µ = θ = A z, where A = ( + S )−1 . The smoothing bz I b λb λb I I λb I parameter vector is estimated to minimise

  ||µ − µ ||2 = ||(z − ε) − A z||2 E z cz E λb   = ||(z − A z) − ε||2 E λb   = ||z − A z||2 + ||ε||2 − 2ε >(z − A z) E λb λb   = ||z − A z||2 + ε >ε − 2ε >(µ + ε) + 2ε >A (µ + ε) E λb z λb z   = ||z − A z||2 − ε >ε − 2ε >µ + 2ε >A µ + 2ε >A ε . E λb z λb z λb

Notice that (ε >ε) = ( ε2) = q, (ε >µ ) = (ε >)µ = 0 and (ε >A µ ) = E E ∑ i E z E z E λb z (ε >)A µ = 0. Using that ε >A ε is a scalar, and that a scalar is its own trace, E λb z λb we obtain that

(tr(ε >A ε)) = (tr(A εεεε >)) = tr(A (εεεε >)) = tr(A I) = tr(A ). E λb E λb λb E λb λb 4.2. Penalised maximum likelihood estimation 89

These calculations can also be found in Wood (2006). Hence, it follows that

  ||µ − µ ||2 = ||z − A z||2 − q + 2tr(A ) (4.18) E z cz E λb λb where q is the number of parameters. The derivation of (4.18) is presented with a minor mistake in Marra et al. (2017), where the constant q does not represent the total number of parameters. Calculating the expectation on the right hand side of (4.18) is not straight- forward. In practice, λ is estimated by minimising the Un-Biased Risk Estimator (UBRE; Craven and Wahba (1978))

2 V (λ ) = ||z − Aλ z|| − q + 2tr(Aλ ). (4.19)

Equation (4.19) can be minimised using the automatic smoothing parameter selec- tion method developed by (Wood, 2004) or by using a general-purpose optimiser.

4.2.6 Summary of the algorithm The methods described in Sections 4.2.4 and 4.2.5 are iterated until the parameter [a+1] [a] estimator satisfies max1≤k≤q |θ −θ | < δ for a suitable small positive value δ (Radice et al., 2016). The two steps of the algorithm are as follows:

Step 1: For fixed smoothing parameters λ [a], find an estimate of θ :

[a+1] θ = argmax `p(θ). θ

Step 2: Given the estimate θ [a+1], find an estimate of λ using equation (4.19):

[a+1] λ = argmin V (λ ). λ

4.2.7 Confidence intervals The distribution of the penalised maximum likelihood estimator can be used to con- struct confidence intervals for the estimate θb and functions of them, such as the 4.3. Simulation study 90 hazards and probability matrix (Wood, 2006). Let Vθ represent the covariance ma- trix of θb at convergence. From large sample theory, samples of the estimate θb can be drawn from N(θb,Vθ ). Confidence intervals for functions of the model parame- ters can be constructed as follows:

Step 1: Draw b vectors from N(θb,Vθ ).

Step 2: Calculate the value of the function of interest at each simulated value.

Step 3: Using the simulated values of the function, calculate the lower (ς/2) and upper (1 − ς), quantiles.

The parameter ς is usually set to 0.05. In this thesis, we approximate the covariance matrix Vθ by the inverse of the matrix Mp as defined in (3.22). The standard errors reflect the choice of theta, but do not account for the way theta was chosen, i.e., the same standard error would occur if theta were taken to be fixed and known from the outset.

4.3 Simulation study

We perform a small simulation study to analyse the performance of the method pre- sented in Section 4.2 for estimating multi-state models with splines. The simulation is described for an illness-death model without recovery with a log-normal distri- bution with parameters µ = 1.25 and σ = 1 for transition 1 to 2, an exponential distribution with rate exp(−2.5) for transition 1 to 3, and a Gompertz distribution with rate exp(−2.5) and shape 0.1 for transition 2 to 3.

Let Trs = Trs|u represent the time to the event s conditional on being in state r at time u > 0. If state at u is 1, then the time of transition to the next state can be obtained by taking T = min{T12,T13}. If T = T12 then, the next state is 2, otherwise if T = T13 then the next state is 3. If the state is 2, then the time of the next state is

T23. The event times T12 and T13 are simulated using the functions rgengamma() and rgompertz(), respectively, in the package flexsurv (Jackson, 2016). The tran- sition times T23 can be simulated by sampling from uniform distribution and using the inversion method as described in Section 1.5. 4.3. Simulation study 91 Hazards 1 to 2 0.0 0.2 0.4 0.6 0 5 10 15 Time (years) Hazards 1 to 3 0.00 0.10 0.20 0.30 0 5 10 15 Time (years) Hazards 2 to 3 0.0 0.2 0.4 0 5 10 15 Time (years) Figure 4.2: Simulation study: true (black lines), estimated (grey lines) and mean estimated (red lines) hazards for the illness-death model for 100 replications

We perform R = 100 replications. The sample size is N = 200 individuals. The time scale is years since baseline, i.e., time since the beginning of the study. The follow-up times are yearly and the length of the study is 15 years. This leads to interval-censored transition times for transitions 1 to 2 and known time of transitions into the dead state.

The package mgcv in R (Wood, 2007) is used to set the design and penalty matrices. The number of knots for each hazard is K = 10, hence the model has a total of 30 parameters. We use cubic regression splines, in which case the knots are placed using the percentiles of the observation times. Therefore, knots placement is different for every sample. The multi-state model with splines is then estimated using the procedure described in Section 4.2. 4.4. Applications 92

Figure 4.2 illustrates the true (black lines), estimated (grey lines) and mean estimated (red lines) hazards for the illness-death model for 100 replications. Some estimated hazards seem to over- or under-estimated the true hazards; however, the mean estimated hazards are very close to the true hazard curves. The discrepancy between the hazards towards the end of the study is due to scarcity of data after 10 years. Also, the bias for hazard for transition 1 to 2 at early follow-up times is due to the lack of transitions from state 1 to state 2 in the first year of the study. Overall, the method satisfactorily estimates the nonlinear trend underlying the hazard for transition 1 to 2.

Table 4.1 presents the results of the simulation in terms of transition probabili- ties. For each defined time interval, it shows the transition probabilities for the true model, the mean estimated transition probabilities, the bias and the estimated stan- dard errors (eSE) (Burton et al., 2006). The results show that the multi-state model with splines can estimate well transition probabilities across all time intervals.

The findings from the simulation results are twofold. First, they indicate that the proposed method is able to estimate nonlinear hazards in the presence of interval censoring. Second, they show that the piecewise-constant approximation to the transition probabilities provides satisfactory results, as we are able to recover the true curves and transition probabilities. Of course, the small number of replications prevent us from being able to assess the coverage of confidence intervals; however, the simulation results shows the good performance of the method with relation to recovering the true model.

4.4 Applications

The methods presented in this chapter are illustrated with applications to the OCTO and CAV data sets. In what follows, model estimation is undertaken using the scor- > ing algorithm summarised in Section 4.2.6. Let θ = (θ1,...,θq) be the vector with model parameters, where q depends on the application. The convergence criterion [a] [a+1] −6 for the algorithm is to stop at iteration a + 1 when max1≤k≤q |θk − θk | < 10 . We use penalised cubic regression splines for the basis function. The knots are 4.4. Applications 93

Table 4.1: Simulation study to investigate the performance of the multi-state models with splines for modelling time-dependent processes. Mean, bias and estimated stan- dard errors (eSE) for R = 100 replications. Absolute bias less than x is denoted by dxe

Transition probabilities True Mean Bias eSE p11(0,5) 0.241 0.232 -0.090 0.03 p12(0,5) 0.387 0.402 0.015 0.033 p13(0,5) 0.372 0.366 -0.006 0.028 p22(0,5) 0.589 0.611 0.022 0.057 p23(0,5) 0.411 0.389 -0.022 0.057 p11(0,10) 0.065 0.065 d0.001e 0.015 p12(0,10) 0.232 0.239 0.008 0.023 p13(0,10) 0.703 0.695 -0.008 0.028 p22(0,10) 0.246 0.263 0.018 0.037 p23(0,10) 0.754 0.737 -0.018 0.037 p11(5,10) 0.269 0.281 -0.012 0.052 p12(5,10) 0.291 0.284 0.007 0.037 p13(5,10) 0.440 0.435 -0.005 0.052 p22(5,10) 0.417 0.430 0.013 0.037 p23(5,10) 0.583 0.570 -0.013 0.037 placed considering the percentiles of the observation times. This means that more knots are placed where data are plentiful and fewer knots where data are scarce. This is a key factor for fitting multi-state models with splines. Because multi-state data can become scarce close to the end of study, there might not be enough information to estimate some basis coefficients. Fitting multi-state models with P-splines might not be possible for some applications as it requires the knots to be equally spaced. In this case, some knots can be placed where there is no data. The design and penalty matrices are set up using the package mgcv in R. The smoothing parameters are estimated using the general-purpose optimiser optim in R.

4.4.1 Origins of variance in the oldest-old data

We showed in Section 2.8.1 that a Gompertz hazards specification seems to be rea- sonable for the OCTO data. In this section, these data are analysed with a multi- model with splines. The multi-state process is illustrated in Figure 4.3. We aim to check the suitability of the parametric hazards specification in Section 2.8.1, as 4.4. Applications 94

State 1 State 2 State 3 No cognitive Mild cognitive Severe cognitive impairment impairment impairment

Dead

Figure 4.3: Four-state model for longitudinal data in OCTO well as verify whether the modelling can be further improved with nonparametric hazards.

Recall that for the OCTO data, the mean length of follow-up times is 1.994 years with standard deviation of 1.098 and median 1.986. Assuming that change in transition intensities in relation to the frequency of observation can be assessed in intervals of approximately 2 years, we can use the data to define the grid for the piecewise-constant approximation.

Let t represent age minus 80. Because time of death is known, rather than being interval censored, the likelihood contribution of individuals observed in state r < 4 ∗ 3 ∗ ∗ at time t and dead at time t > t are given by ∑s=1 P(Y(t ) = s|Y(t) = r)qs4(t ). Similarly to the application in Section 2.8.1, the proportional hazard model with splines and dependence on the covariate sex is given by

10 ! qrs(t) = exp ∑ αrs.kBk(t) + βrssex , (4.20) k=1 where (r,s) ∈ {(1,2),(1,4),(2,1),(2,3),(2,4),(3,4)}, Bk(t) are cubic regression splines, and sex is 0/1 for men/women. For the transition between the living states, d the constraints on the coefficients for sex are β12.1 = β23.1 = βL, except for transition from 2 to 1 with β21.1 = 0. For the transitions into the dead state, the constraints are d β14.1 = β24.1 = β34.1 = βD. As indicated in (4.20), the hazards are modelled with 10 knots each, hence the total number of parameters is 62. The resulting model has AIC = 5340.946, −2`(θ ) = 5306.396 , and d f = 17.275. The comparison with the parametric model in (2.27), which has AIC = 5342.837, −2`(θ ) = 5314.837, and 14 parameters, indi- 4.4. Applications 95

Frequency 0 100 300 500 0 5 10 15 20 25 t

Figure 4.4: Histogram of age transformed by minus 80 in the OCTO data cates that the nonparametric hazards specification in (4.20) leads to a more flexible model, which slightly improves model fitting.

The estimated smooth hazards for women (solid lines) and 95% confidence intervals (dashed lines) are presented in Figure 4.5. For transition from state 1 to 2, the hazard is increasing and has a nonlinear shape. For transition from state 2 to 1, the hazard slightly increases up to approximately 5 years, but decreases af- terwards. The shapes of these hazards can be difficult to model with parametric specifications. For the other transitions, the risks of moving across states are in- creasing throughout the length of the study, with fairly log-linear shapes that can be approached with parametric models. The confidence intervals are fairly wide after approximately 15 years. That is because data become scarce after 15 years and our smooth non-parametric hazard model is a strictly local approach. Figure 2.1 illustrates the histogram of age minus 80. The vector of smoothing parameters is estimated at λb = (327.57, 640.20, 216.54, 3935634.85, 500128.59, 2118.19)>.

The covariate effects are estimated at βcL = −0.323 (0.100) and βcD = −0.338 (0.092), indicating that being a woman decrease the risks of moving across 4.4. Applications 96 1.2 1.2 0.8 0.8 0.4 0.4 Estimated hazard 1 to 2 Estimated hazard 1 to 4 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25

Age − 80 Age − 80 1.2 1.2 0.8 0.8 0.4 0.4 Estimated hazard 2 to 1 Estimated hazard 2 to 3 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25

Age − 80 Age − 80 1.2 1.2 0.8 0.8 0.4 0.4 Estimated hazard 2 to 4 Estimated hazard 3 to 4 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25

Age − 80 Age − 80

Figure 4.5: Estimated smooth hazards (solid lines) for women, with 95% confidence inter- vals (dashed lines) cognitive states and moving into the dead state, respectively. Notice that for the parametric model in (2.27), the effect of the covariate sex for living states and into the dead state are estimated at −0.325 and −0.334, respectively. Therefore, the baseline hazard specification does not seem to influence the covariate estimates. Figure 3.6 depicts baseline-specific survival as estimated by the model and as described by the Kaplan-Meier curves. Individual survival curves (in grey) are shifted to the years since baseline so that we can compare them and their mean to the Kaplan-Meier curve. This is necessary because individuals have different ages at baseline. For survival given baseline state 3, there is small discrepancy between model-based mean survival and the Kaplan-Meier curve, but overall the fit seems to 4.4. Applications 97 1.0 0.8 0.6 0.4 0.2 Survival from 1 (N=394) 0.0

0 2 4 6 8 10 1.0 0.8 0.6 0.4 0.2 Survival from 2 (N=146) 0.0

0 2 4 6 8 10 1.0 0.8 0.6 0.4 0.2 Survival from 3 (N=152) 0.0

0 2 4 6 8 10 Years since baseline

Figure 4.6: Comparison of model-based survival from states 1, 2, and 3 with Kaplan-Meier curves. Model-based survival: grey lines for individuals, smooth blue lines for the mean of the individual survival curves. Kaplan-Meier in black lines with 95% confidence intervals. Frequencies for baseline state along vertical axes

be good. Although this is not a proper goodness-of-fit test, the comparison shows that the model is able to capture the attrition due to death during the follow-up. Furthermore, the predictions for the nonparametric model (4.20) are better than the ones obtained for model (2.27), see Figure 2.3 . 4.4. Applications 98

Health CAV

Dead

Figure 4.7: Illness-death without recovery model for disease progression after transplant

4.4.2 Cardiac allograft vasculopathy data

As illustrated in Section 2.8.2, Gompertz hazards specification for analysing the CAV data results in poor model fitting. In this section, we analyse these data with nonparametric hazards models. The multi-state process is illustrated in Figure 4.7. We aim to verify whether flexible model specifications with splines can improve model fitting. As discussed in Section 2.8.2, the mean length of follow-up times is 1.622 years with standard deviation of 0.972 and median 1.258. Assuming that change in transition intensities in relation to the frequency of observation can be assessed in intervals of approximately 1 year, we can use the data to define the grid for the piecewise-constant approximation. Let t represent time since baseline. Since time of death is known within one day, rather than being interval censored, the likelihood contribution of indi- viduals observed in state r < 3 at time t and dead at time t∗ > t are given by 2 ∗ ∗ ∑s=1 P(Y(t ) = s|Y(t) = r)qs3(t ). The proportional hazard model with splines is specified with dependence on donor age (dage) and primary diagnosis of ischaemic heart disease (IHD):

7 ! qrs(t) = exp ∑ αrs.kBk(t) + β1dage + β2IHD , (4.21) k=1 where (r,s) ∈ {(1,2),(1,3),(2,3)} and Bk(t) are cubic regression splines. As indicated in (4.21), the hazards are modelled with 7 knots each, hence the total number of parameters is 23. On a computer with 30 cores and using parallel computing (Analytics and Weston, 2014a,b) , model 4.21 takes less than 22 minutes to be estimated. 4.4. Applications 99 700 600 500 400 Frequency 300 200 100 0

0 5 10 15 Time since transplant (years)

Figure 4.8: Histogram of time since transplant in the CAV data

> The vector of smoothing parameters is λ = (λ12,λ13,λ23) . This model has AIC = 2931.715, −2`(θ ) = 2903.483, and d f = 13.846. Compared to model (2.28), which has AIC = 2932.953, −2`(θ ) = 2924.953, and 8 parameters, model (4.21) is more complex, considerably improves the likelihood, and slightly im- proves the AIC.

The estimated smooth hazards for subjects with IHD and donor age of 26 (solid lines) and 95% confidence intervals (dashed lines) are presented in Figure 4.9. The vector of smoothing parameters is estimated at λb = (14.292,55.036,743629)>. The risk of moving from state 1 (healthy) to state 2 (CAV) increases until approximately 8 years after transplant, but decreases afterwards. The risk of going from state 1 to state 3 (dead) is very low and almost constant until approximately 10 years since transplant, but increases pretty steep afterwards. The transition intensity from state 2 to state 3 is quite volatile and upwards until 10 years after transplant and decreasing afterwards. The confidence intervals are fairly wide after approximately 10 years. That is because data become scarce after that time.

For the parametric part of the model, βb1 = 0.014 (0.004) and βb2 = 0.2 (0.096) indicating that donor age and IHD increase the risks of disease progression and 4.4. Applications 100 0.5 0.4 0.3 0.2 Hazard 1 to 2 0.1 0.0 0 5 10 15 Time since transplant (years) 0.8 0.6 0.4 Hazard 1 to 3 0.2 0.0 0 5 10 15 Time since transplant (years) 0.8 0.6 0.4 Hazard 2 to 3 0.2

0 5 10 15 Time since transplant (years)

Figure 4.9: Estimated smooth hazards for subjects with IHD and with donor age of 26 (solid lines), with 95% confidence intervals (dashed lines) death. The model (2.28) has very similar estimates for covariates dage and IHD, 0.018 and 0.277, respectively. Hence, as in the application to the CAV data, the baseline hazard specification is robust with relation to specification of time- dependency.

Although estimated hazards gives insightful information about the risks of moving across states, interpretation is more straightforward when transition proba- bilities are considered. For subject with IHD and with donor age of 26, the five-year transition probabilities are estimated at

  0.502 (0.446,0.556) 0.318 (0.274,0.365) 0.180 (0.149,0.212)   Pb(0,5) =  0 0.699 (0.616,0.773) 0.301 (0.227,0.384) ,   0 0 1 with 95% confidence interval (in brackets) obtained using b = 1000 simulations as 4.5. Discussion 101 1.0 1.0 0.8 0.8 0.6 0.6 Survival Survival 0.4 0.4 0.2 0.2 0.0 0.0 0 5 10 15 0 5 10 15 Time since transplant (years) Time since transplant (years)

Figure 4.10: Comparison of model-based survival with Kaplan-Meier curves for the Gom- pertz model (left-hand side) and spline model (right-hand side). Model-based survival: grey lines for individuals, blue lines for the mean of the individual curves. Kaplan-Meier in black lines with 95% confidence intervals in Section 2.5. A transition probability can be interpreted as follows. A subject with IHD and donor age of 26 has a 29% chance of being in the CAV five years later. Model validation for multi-state models can be carried out by comparing model prediction of the entry time into the dead state with the Kaplan-Meier curve esti- mates (Titman and Sharples, 2010). Figure 4.10 depicts baseline-specific survival as estimated by the models (2.28) and (4.21) (on the left and right hand side, re- spectively) and as described by the Kaplan-Meier curves. For the Gompertz model in (2.28), the fit is reasonably good up to 10 years, but after that the model fails to predict survival. The multi-state model with splines predicts the survival reasonably accurately throughout the years.

4.5 Discussion

This chapter presents a practical and unifying framework for estimating multi-state models with splines for interval-censored data. This novel methodology improves both accuracy of the parameter estimates and computational speed, if compared to grid search methods such as the one presented in Chapter 3. The new estimation procedure is made possible by rewriting the optimisation problem using a penalised general likelihood estimation (Marra et al., 2017). 4.5. Discussion 102

The method is illustrated with two applications. We aim to illustrate the feasi- bility of the method and its usage for flexible time-dependent modelling. In partic- ular, the application to the OCTO data shows that the method is able to efficiently estimate the hazard functions for a complex multi-state processes with a backward transition. The application to the CAV data illustrates the flexibility of the method to model nonlinear time dependency. The small simulation study and application show the importance of this method for flexible modelling of time-dependent processes. Even though an assessment of bias is not possible due to the small number of replications, the simulation study shows that the method can recover nonlinear, log-liner and linear hazards. Further- more, we illustrate with an application that the method can give insightful informa- tion on the functional form underlying the hazards. The automatic smoothing parameters estimation as described in Marra et al. (2017) requires the Hessian or the Fisher information for estimation. We show through a simulation and applications that an approximation to the Fisher informa- tion matrix, which only uses the first order derivatives of the log-likelihood, per- forms well on estimation. This is relevant for interval-censored data as calculating the second derivatives of the transition probabilities can be intractable. The msm package (Jackson, 2011) is designed to model time-homogeneous multi-state models. However, it is possible to fit some time-dependent models, such as Gompertz and splines (without penalties) models. In this case, time-dependency is also approached by using a piecewise-constant approximation to the hazards. Therefore, this research can also be seen as a generalisation of the msm package, which allows for flexible modelling of the time-dependency. The Gompertz hazards specification is common in many applications due to its simplicity and straightforward use with the msm package. We show through a model validation method that such restrictive model specifications can lead to poor model fit. As shown in Figure 4.10, the multi-state model with splines can improve considerably model fit by allowing for flexible hazards specification. Chapter 5

Conclusions and future work

Multi-state models are commonly used in medical research where patients’ health status over time is of interest. In the presence of interval censoring, models can be formulated in a Markov processes framework. Models are specified through propor- tional hazards functions. For time-dependent processes, deciding on a reasonable shape underlying the hazards is not straightforward. Semi-parametric hazards spec- ifications, which allow for flexible modelling of time dependency, are alternatives to parametric models. However, estimating multi-state models with splines is chal- lenging as the problem involves multiple smoothing parameters selection.

This thesis presented novel penalised maximum likelihood methods to estimate multi-state models with splines for interval-censored data. In particular, we have fo- cused on estimating semi-parametric hazards functions with splines. Special atten- tion has been given to cubic regression splines and P-splines, but in principle any other splines basis could have been employed. The principal contribution of this thesis has been to develop an unifying and efficient framework to estimate semi- parametric multi-state models. This new method could be used to model nonlinear time-dependency in multi-state processes or to check parametric model assump- tions.

In Chapter 1, we have discussed various multi-state processes and some of their special features, such as model structure and censoring. We also provided a litera- ture review of the existing methods to analyse these processes. We concluded that current methods cannot fully address the problem of estimating multi-state models 104 with splines.

Chapter 2 then explored a standard method for fitting parametric multi-state models. This method uses a scoring algorithm to maximise the log-likelihood func- tion, for a given piecewise-constant approximation to the hazards. To assess model fit, we used the comparison of the overall model survival function with the Kaplan- Meier empirical estimates. The application of these methods to the OCTO data showed that Gompertz hazards specification seems to lead to a reasonably good model fit. However, such restrictive model assumptions for analysing the CAV data resulted in lack of fit. We then concluded that a method to estimate more flexible multi-state models should be developed to deal with time-dependency.

For this purpose, in Chapter 3, we developed a penalised maximum likeli- hood method to estimate flexible multi-state models with splines. Specifically, the method allows for parametric and semi-parametric hazards specifications. We showed that a piecewise-constant approximation can be used to calculate the tran- sition probabilities for the likelihood function, and then a scoring algorithm can be used to optimise the penalised likelihood function. The main contribution of this method has been to provide a general framework to specify and estimate multi-state models with splines. We then applied the method to the ELSA data to investigate the decline of cognitive function in older population. Even though time of death is is rounded to the nearest integer, we assumed that time of transition into the dead state is known exactly. This application showed that the risks of moving forward across cognitive states are increasing. The method, however, is based on a grid search for estimating the optimal amount of smoothing. Such approach can become impractical for applications where several hazards are modelled with splines.

In Chapter 4, we extended the method presented in Chapter 3 by developing an efficient algorithm based on penalised maximum likelihood and automatic search of the smoothing parameters to estimate multi-state models for interval-censored data. This method improves both estimation time and, by providing more precise smoothing parameter estimates, accuracy of the model parameter estimates. The method is a general tool that, in principle, allows for flexible modelling of multi- 105 state processes with any structure, and it is the main contribution of this thesis.

The method has been illustrated with analysis of two data sets and a simulation study. The application to the OCTO data showed that the method can elegantly es- timate the smooth hazards of a considerably complicated process with a backward transition. This allowed us to verify that a Gompertz hazard specification can be a reasonable model assumption. In relation to the application to the CAV data, the estimated smooth hazards showed that the risks of moving between states follow a nonlinear trend over time. We verified through a model diagnostic that such flexi- ble hazards specification with splines improves the model fitting for the CAV data. Therefore, the method enabled us to improve the knowledge of the process of inter- est. Finally, the simulation study allowed us to verify that the method can recover non-linear and linear hazards functions.

The piecewise-constant approximation provides a general framework to cal- culate transition probabilities of any multi-state process. However, the method re- quires a much greater number of numerical evaluations as computation of eigenval- ues decomposition is required for all pair of observation times. Alternatively, the method presented in Titman (2011) could also be used to calculate transition proba- bilities. Theoretical properties such as conditions for consistency and identifiability should be studied.

The estimation time is improved by using parallel computing to calculate the score vector and Fisher information matrix. The time to estimate model (4.21) with a computer with 30 cores is just below 22 minutes. Computational time could be further improved by using the package in Rcpp (Eddelbuettel et al., 2011).

To conclude, the methods presented in this thesis can be further extended to es- timate multi-state models for different data structures. One possible extension is to allow for misclassification of states (Jackson et al., 2003). This poses extra difficulty for estimation as computation of the gradient or information is more challenging (Lystig and Hughes, 2002). For processes with time-varying covariates, the hazards are specified in terms of functions of the covariates. Similarly to the specification of the baseline hazards, the functional forms underlying time-varying covariates are 106 often unknown. Therefore, these functions could be specified by smooth functions. Semiparametric hazards specification with splines of the covariate effects can be useful to provide insightful information about the process of interest. The work by Joly and Commenges (1999) on penalised semi-Markov models could be extended in two ways. First, an automatic smoothing parameter selection method can be used to improve precision of parameter estimation. Second, the method could be extended for penalized semi-Markov models with backward transitions. Appendix A

Code for the R software

The method developed in this thesis are implemented in the software R (R Core Team, 2016). In this appendix we provide the code for fitting the three-state model for the CAV data, as defined in Section 4.4.2.

#It is necessary to load the following libraries library(msm) library(splines) library(mgcv) library(flexsurv)

#The CAV data is assigned to the variable ‘‘Data’’. #The following is the CAV data for an individual.

PTNUM years state dage pdiag 1 100002 0.000000 1 21 1 2 100002 1.002740 1 21 1 3 100002 2.002740 2 21 1 4 100002 3.093151 2 21 1 5 100002 4.000000 2 21 1 6 100002 4.997260 2 21 1 7 100002 5.854795 3 21 1

# The following construct the smoothing basis. # The argument ‘‘bs’’ represent spline basis. #In this case, ‘‘cr’’ stands for cubic regression splines. smoother = smooth.construct(s(years, bs = "cr"), Data, NULL)

#Get the design matrix

X = smoother$X 108

#Get the dimension of the smoother n.dim = smoother$bs.dim

#Get the penalty matrix pen = output <- matrix(unlist(smoother$S), ncol = n.dim, byrow = TRUE)

#Define the number of parameter for each transition and #the number of covariates. npar12 = n.dim npar13 = n.dim npar23 = n.dim ncov = 2

#This is useful for coding n1 = npar12 n2 = n1 + npar13 n3 = n2 + npar23 n4 = n3 + ncov npar = n4

#Vector of initial values par.0 = c(rep(-3, n3), rep(0,2))

#number of states and dead state s.number = 3 dead.state = 3

#Split the data into individuals n = split(Data, Data$PTNUM) N = length(n)

#Define elements for the scoring iter=1 max.iter=100 diff= rep(Inf,length(par)) digits = 3 count = 0 Max = 1000 abstol = 1e-6 gamma = c(2, 2, 2) sp = exp(gamma)

#The next function is the generator matrix and its eigen decomposition 109

#Data.Ind.i is the data for ith individual #X.i smoother for ith individual #j is the jth observation

Q.matrix = function(par, Data.Ind.i, j, X.i){ dage = Data.Ind.i[j-1,4] pdiag = Data.Ind.i[j-1,5] Q = matrix(0,s.number,s.number) Q[1,2] = exp(X.i[j-1,]%*%par[1:n1] + par[n3+1]*dage + par[n3+2]*pdiag) Q[1,3] = exp(X.i[j-1,]%*%par[(n1+1):n2] + par[n3+1]*dage + par[n3+2]*pdiag) Q[1,1] = -(Q[1,2] + Q[1,3]) Q[2,3] = exp(X.i[j-1,]%*%par[(n2+1):n3] + par[n3+1]*dage + par[n3+2]*pdiag) Q[2,2] = -Q[2,3] r = eigen(Q, symmetric = FALSE) A = r$vectors d = r$values inv.A = solve(A) Q.result = list("A" = A, "d" = d, "inv" = inv.A, "Q" = Q) return(Q.result) } #Function for calculating the probability transition for a follow-up P = function(j, Data.Ind.i, A, d, inv.A){ t = exp(d*(Data.Ind.i[j,2] - Data.Ind.i[j-1,2])) P = A%*%diag(t)%*%inv.A return(P) } #Function that returns a likelihood contribution for a #interval-censored observation time Pt = function(j, Data.Ind.i, Q){ r = eigen(Q, symmetric = FALSE) A = r$vectors x = r$values t = exp(x*(Data.Ind.i[j,2] - Data.Ind.i[j-1,2])) pt.aux = A%*%diag(t)%*%solve(A) pt = pt.aux[Data.Ind.i[j-1,3], Data.Ind.i[j,3]] return(pt) } #Function that returns a likelihood contribution for #death times

Ps = function(s, Data.Ind.i, Q){ m = nrow(Data.Ind.i) r = eigen(Q, symmetric = FALSE) 110

A = r$vectors x = r$values t = exp(x*(Data.Ind.i[m,2] - Data.Ind.i[m-1,2])) ps.aux = A%*%diag(t)%*%solve(A) ps = ps.aux[Data.Ind.i[m-1,3], s]*Q[s,s.number] return(ps) } #Function to calculate the trace of a function trace = function(A){ s = 0 for (i in 1:(nrow(A))){ s = s + A[i,i] } return(s) } #Matrix square root square.root = function(S){ d = eigen(S, symmetric = TRUE) rS = d$vectors%*%diag(d$valuesˆ0.5)%*%t(d$vectors) } #Derivative of the Q matrix deriv.Q = function(par,Data.Ind.i, j, u, Q, X.i){ dage = Data.Ind.i[j-1,4] pdiag = Data.Ind.i[j-1,5] dQ = matrix(0, s.number, s.number) if(u<=n1){ dQ[1,2] = X.i[j-1,u]*Q[1,2] dQ[1,1] = -dQ[1,2] } if(u>n1 & u<=n2){ dQ[1,3] = X.i[j-1,u-n1]*Q[1,3] dQ[1,1] = -dQ[1,3] } if(u>n2 & u<=n3){ dQ[2,3] = X.i[j-1,u-n2]*Q[2,3] dQ[2,2] = -dQ[2,3] } if(u==n3+1){ dQ[1,2] = dage*Q[1,2] dQ[1,3] = dage*Q[1,3] dQ[1,1] = -(dQ[1,2]+dQ[1,3]) dQ[2,3] = dage*Q[2,3] dQ[2,2] = -dQ[2,3]

} if(u==n3+2){ dQ[1,2] = pdiag*Q[1,2] 111

dQ[1,3] = pdiag*Q[1,3] dQ[1,1] = -(dQ[1,2]+dQ[1,3]) dQ[2,3] = pdiag*Q[2,3] dQ[2,2] = -dQ[2,3]

} return(dQ) }

#Derivative of the probability transition matrix deriv.P = function(j, Data.Ind.i, dQ.u, s.number, A, d, inv.A){ t = (Data.Ind.i[j,2] - Data.Ind.i[j-1,2]) G.u = inv.A%*%dQ.u%*%A V.u = matrix(0,s.number, s.number) for(l in 1:s.number){ for(m in 1:s.number){ if(l==m){ V.u[l,l] = G.u[l,l]*t*exp(d[l]*t) }else{ V.u[l,m] = G.u[l,m]*(exp(d[l]*t) - exp(d[m]*t))/(d[l] - d[m]) } } } dP.u = A%*%V.u%*%inv.A return(dP.u) }

#Function to calculate the gradient and the M matrix scoring.msm = function(N, par){ M = matrix(0,length(par), length(par)) S = matrix(0,length(par),1) for(u in 1:length(par)){ for(v in 1:length(par)){ if(u <= v){ aux = scoring.uv(u,v,N,par) M[v,u] = aux[[2]] M[u,v] = aux[[2]] if(u==v){ S[u] = aux[[1]] } }

} } result = list("S" = S, "M" = M) return(result) } 112 dP.u[Data.Ind.i[j-1,3], Data.Ind.i[j,3]] dP.v[Data.Ind.i[j-1,3], Data.Ind.i[j,3]] dP.u[Data.Ind.i[j-1,3], Data.Ind.i[j,3]] M.uv = M.uv* + (1/(P.aux[Data.Ind.i[j-1,3],* Data.Ind.i[j,3]])ˆ2) score.u.aux = score.u.aux + (1/P.aux[Data.Ind.i[j-1,3], Data.Ind.i[j,3]]) * denom = 0 decomp.Q = Q.matrix(par,A Data.Ind.i, = j, decomp.Q$A d X.i) = decomp.Q$d inv.A = decomp.Q$inv Q = decomp.Q$Q P.aux = P(j,dQ.u Data.Ind.i, = A, deriv.Q(par,dQ.v d, Data.Ind.i, = inv.A) j, deriv.Q(par,dP.u u, Data.Ind.i, = Q, j, deriv.P(j,dP.v X.i) v, Data.Ind.i, = Q, dQ.u, deriv.P(j,if(Data.Ind.i[j,3]!=s.number){ X.i) s.number, Data.Ind.i, A, dQ.v, d, s.number, inv.A) A, d, inv.A) }else{ Data.Ind.i = n[[i]] X.i = X[(1:nrow(Data.Ind.i)),]X #get = data X[-(1:nrow(Data.Ind.i)),]for(j associated # in to Delete 2:nrow(Data.Ind.i)){ subject data i associated to subject i M.uv = 0 score.u.aux = 0 for(i in 1:N){ #Calculates the u#and entry the of uvscoring.uv the entry = gradient of function(u, the v, M N, matrix par){ 113 Q[s, s.number] Q[s, s.number] * Q[s, s.number] * * dQ.v[s, s.number] dQ.u[s, s.number] * * num.v * num.u * + P.aux[Data.Ind.i[j-1,3], s] denom = denom + P.aux[Data.Ind.i[j-1,3], s] num.v = num.v + dP.v[Data.Ind.i[j-1,3], s] + P.aux[Data.Ind.i[j-1,3], s] num.u = num.u + dP.u[Data.Ind.i[j-1,3], s] } M.uv = M.uv + (1/denomˆ2) score.u.aux = score.u.aux + num.u/denom num.u = 0 num.v = 0 for(s in 1:(s.number-1)){ } } } result = list("score.u"return(result) = score.u.aux, "M.uv" = M.uv) } 114

#Set up the penalty matrix spl.S = function(n1, n2, n3, n4, npar,pen){ Pen = list() Pen[[1]] = Pen[[2]] = Pen[[3]] = Pen[[4]] = matrix(0, npar, npar) Pen[[1]][1:n1, 1:n1] = pen Pen[[2]][(n1+1):n2,(n1+1):n2] = pen Pen[[3]][(n2+1):n3,(n2+1):n3] = pen return(Pen) }

#Estimate the model parameter for given #smoothing parameters estimate.parameters = function(par.0, sp){ count=0 Pen = spl.S(n1, n2, n3, n4, npar,pen) Pen = sp[1]*Pen[[1]] + sp[2]*Pen[[2]] + sp[3]*Pen[[3]] Max.inner = 50 while(count < Max.inner){ aux = scoring.msm(N,par.0) S = aux[[1]] Sp = S - Pen%*%par.0 I = aux[[2]] Ip = I + Pen sqrt.I = square.root(I) inv.sqrt.I = solve(sqrt.I) e = inv.sqrt.I%*%S z = sqrt.I%*%par.0 + e inv.pen.I = solve(I + Pen) par.h = inv.pen.I%*%sqrt.I%*%z count = count + 1 if(max(abs(par.0 - par.h)) < abstol) break par.0 = par.h } est.parameters.result = list("par" = par.h, "I" = I, "Ip"=Ip, "inv.pen.I" = inv.pen.I, "sqrt.I" = sqrt.I, "z" = z, "count"=count) return(est.parameters.result) } 115

#Set up the UBRE #gamma represents log(sp) UBRE = function(gamma){ Pen = spl.S(n1, n2, n3, n4, npar, pen) sp = exp(gamma) Pen = sp[1]*Pen[[1]] + sp[2]*Pen[[2]] + sp[3]*Pen[[3]] A = sqrt.I%*%solve(I + Pen)%*%sqrt.I ubre = t(z - A%*%z)%*%(z - A%*%z) + 2*trace(A) - npar return(ubre) }

#Minimise the UBRE estimate.lambda = function(gamma){ lambda.update = optim(gamma, UBRE, method = "BFGS", control = list(maxit = 100000), hessian = TRUE) lambda = lambda.update$par return(lambda) }

#The following while runs until the convergence criteria is achieved # It is the summary of the scoring algorithm while(count < Max){ find.est = estimate.parameters(par.0, sp) par = find.est[[1]] I = find.est[[2]] inv.pen.I = find.est[[4]] sqrt.I = find.est[[5]] z = find.est[[6]] count = count + 1 if(max(abs(par - par.0)) < abstol) break est.gamma = estimate.lambda(gamma) gamma = est.gamma sp = exp(gamma) par.0 = par } Bibliography

E. L. Abner, R. J. Kryscio, G. E. Cooper, D. W. Fardo, G. A. Jicha, M. S. Men- diondo, P. T. Nelson, C. D. Smith, L. J. Van Eldik, L. Wan, et al. Mild cognitive impairment: Statistical models of transition using longitudinal clinical data. In- ternational Journal of Alzheimer’s Disease, 2012:291920, 2012.

H. Akaike. Information theory and an extension of the maximum likelihood princi- ple. In Selected Papers of Hirotugu Akaike, pages 199–213. Springer, 1998.

R. Analytics and S. Weston. doParallel: Foreach parallel adaptor for the par- allel package, 2014a. URL http://CRAN.R-project.org/package= doParallel. R package version 1.0.8.

R. Analytics and S. Weston. foreach: Foreach looping construct for R, 2014b. URL http://CRAN.R-project.org/package=foreach. R package version 1.4.2.

P. K. Andersen and N. Keiding. Multi-state models for event history analysis. Sta- tistical Methods in Medical Research, 11(2):91–115, 2002.

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ guide. Third Edition. Philadelphia: SIAM, 1999.

R. Bender, T. Augustin, and M. Blettner. Generating survival times to simulate cox proportional hazards models. Statistics in Medicine, 24(11):1713–1723, 2005.

A. Burton, D. G. Altman, P. Royston, and R. L. Holder. The design of simulation studies in medical statistics. Statistics in Medicine, 25(24):4279–4292, 2006. BIBLIOGRAPHY 117

D. Collett. Modelling survival data in medical research. CRC press, 2015.

D. Commenges. Inference for multi-state models from interval-censored data. Sta- tistical Methods in Medical Research, 11(2):167–182, 2002.

D. Commenges, P. Joly, A. Gegout-Petit,´ and B. Liquet. Choice between semi- parametric estimators of markov and non-markov multi-state models from coars- ened observations. Scandinavian Journal of Statistics, 34(1):33–52, 2007.

R. J. Cook and J. F. Lawless. Analysis of repeated events. Statistical Methods in Medical Research, 11(2):141–166, 2002.

D. R. Cox. The theory of stochastic processes. Routledge, 2017.

P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377–403, 1978.

C. De Boor. A practical guide to splines, volume 27. Springer-Verlag New York, 1978.

D. Eddelbuettel, R. Franc¸ois, J. Allaire, K. Ushey, Q. Kou, N. Russel, J. Chambers, and D. Bates. Rcpp: Seamless r and c++ integration. Journal of Statistical Software, 40(8):1–18, 2011.

P. H. Eilers and B. D. Marx. Flexible smoothing with B-splines and penalties. Statistical Science, 11(2):89–102, 1996.

L. Fahrmeir and A. Klinger. A nonparametric multiplicative hazard model for event history analysis. Biometrika, 85(3):581–592, 1998.

M. F. Folstein, S. E. Folstein, and P. R. McHugh. Mini-mental state: A practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 12(3):189–198, 1975.

R. Gentleman, J. Lawless, J. Lindsey, and P. Yan. Multi-state Markov models for analysing incomplete disease history data with illustrations for HIV disease. Statistics in Medicine, 13(8):805–821, 1994. BIBLIOGRAPHY 118

R. J. Gray. Flexible methods for analyzing survival data using splines, with applica- tions to breast cancer prognosis. Journal of the American Statistical Association, 87(420):942–951, 1992.

C. Gu and G. Wahba. Minimizing GCV/GML scores with multiple smoothing pa- rameters via the Newton method. SIAM Journal on Scientific and Statistical Computing, 12(2):383–398, 1991.

R. A. Hubbard, L. Inoue, and J. Fann. Modeling nonhomogeneous markov pro- cesses via time transformation. Biometrics, 64(3):843–850, 2008.

C. H. Jackson. Multi-state models for panel data: The msm package for R. Journal of Statistical Software, 38(8):1–29, 2011.

C. H. Jackson. Flexsurv: A platform for parametric survival modelling in R. Journal of Statistical Software, 70(8):1–33, 2016.

C. H. Jackson, L. D. Sharples, S. G. Thompson, S. W. Duffy, and E. Couto. Multi- state markov models for disease progression with classification error. Journal of the Royal Statistical Society: Series D (The Statistician), 52(2):193–209, 2003.

P. Joly and D. Commenges. A penalized likelihood approach for a progressive three- state model with censored and truncated Data: Application to AIDS. Biometrics, 55(3):887–890, 1999.

P. Joly, D. Commenges, C. Helmer, and L. Letenneur. A penalized likelihood ap- proach for an illness–death model with interval-censored data: Application to age-specific incidence of dementia. Biostatistics, 3(3):433–443, 2002.

P. Joly, C. Durand, C. Helmer, and D. Commenges. Estimating life expectancy of demented and institutionalized subjects from interval-censored observations of a multi-state model. Statistical Modelling, 9(4):345–360, 2009.

J. Kalbfleisch and J. F. Lawless. The analysis of panel data under a Markov assump- tion. Journal of the American Statistical Association, 80(392):863–871, 1985. BIBLIOGRAPHY 119

R. Kay. A Markov model for analysing cancer markers and disease states in survival studies. Biometrics, 80:855–865, 1986.

T. Kneib and A. Hennerfeind. Bayesian semi-parametric multi-state models. Sta- tistical Modelling, 8(2):169–198, 2008.

T. C. Lystig and J. P. Hughes. Exact computation of the observed information matrix for hidden markov models. Journal of Computational and Graphical Statistics, 11(3):678–689, 2002.

R. J. Machado and A. van den Hout. Flexible multistate models for interval- censored data: Specification, estimation, and an application to ageing research. Statistics in Medicine, 37(10):1636–1649, 2018.

R. J. M. Machado, A. Van den Hout, and G. Marra. Penalised maximum likeli- hood estimation in multistate models for interval-censored data. arXiv preprint arXiv:1801.06375, 2018.

R. E. Marioni, M. J. Valenzuela, A. Van den Hout, C. Brayne, F. E. Matthews, et al. Active cognitive lifestyle is associated with positive cognitive health transitions and compression of morbidity from age sixty-five. PLoS One, 7(12):e50940, 2012.

G. Marra, R. Radice, T. Barnighausen,¨ S. N. Wood, and M. E. McGovern. A Simul- taneous Equation Approach to Estimating HIV Prevalence With Nonignorable Missing Responses. Journal of the American Statistical Association, 112(518): 484–496, 2017.

C. Moler and C. Van Loan. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM review, 45(1):3–49, 2003.

R. Z. Omar, N. Stallard, and J. Whitehead. A parametric multistate model for the analysis of carcinogenicity experiments. Lifetime Data Analysis, 1(4):327–346, 1995. BIBLIOGRAPHY 120

R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2016. URL https: //www.R-project.org/.

R. Radice, G. Marra, and M. Wojtys.´ Copula regression spline models for binary outcomes. Statistics and Computing, 26(5):981–995, 2016.

A. Robitaille, A. van den Hout, R. J. Machado, D. A. Bennett, I. Cukiˇ c,´ I. J. Deary, S. M. Hofer, E. O. Hoogendijk, M. Huisman, B. Johansson, et al. Transitions across cognitive states and death among older adults in relation to education: A multistate survival model using data from six longitudinal studies. Alzheimer’s & Dementia, 2018.

P. Royston and M. K. Parmar. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prog- nostic modelling and estimation of treatment effects. Statistics in Medicine, 21 (15):2175–2197, 2002.

D. Ruppert, M. P. Wand, and R. J. Carroll. Semiparametric regression. cambridge series in statistical and probabilistic mathematics 12. Cambridge: Cambridge Univ. Press. Mathematical Reviews (MathSciNet): MR1998720, 2003.

G. A. Satten and I. M. Longini. Markov chains with measurement error: estimating thetrue’course of a marker of the progression of human immunodeficiency virus disease. Applied Statistics, pages 275–309, 1996.

H. Sennhenn-Reulen and T. Kneib. Structured fusion lasso penalized multi-state models. Statistics in Medicine, 35(25):4637–4659, 2016.

L. D. Sharples, C. H. Jackson, J. Parameshwar, J. Wallwork, and S. R. Large. Diagnostic accuracy of coronary angiography and risk factors for post–heart- transplant cardiac allograft vasculopathy. Transplantation, 76(4):679–682, 2003.

A. C. Titman. Model diagnostics in multi-state models of biological systems. PhD thesis, University of Cambridge, 2008. BIBLIOGRAPHY 121

A. C. Titman. Flexible nonhomogeneous Markov models for panel observed data. Biometrics, 67(3):780–787, 2011.

A. C. Titman and L. D. Sharples. Model diagnostics for multi-state models. Statis- tical Methods in Medical Research, 19(6):621–651, 2010.

A. Van den Hout. Multi-state survival models for interval-censored data. Boca Raton: CRC/Chapman & Hall, 2017.

A. Van den Hout and F. E. Matthews. Multi-state analysis of cognitive ability data: A piecewise-constant model and a Weibull model. Statistics in Medicine, 27(26): 5440–5455, 2008a.

A. Van den Hout and F. E. Matthews. A piecewise-constant markov model and the effects of study design on the estimation of life expectancies in health and ill health. Statistical Methods in Medical Research, 2008b.

S. Wood. Generalized additive models: An introduction with R. CRC press, 2006.

S. N. Wood. Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society. Series B, Statistical Methodol- ogy, pages 413–428, 2000.

S. N. Wood. Stable and efficient multiple smoothing parameter estimation for gen- eralized additive models. Journal of the American Statistical Association, 99 (467), 2004.

S. N. Wood. The mgcv package. www. r-project. org, 2007.