Statistical Methods in Micro-Simulation Modeling: Calibration and Predictive Accuracy

by Stavroula Chrysanthopoulou B.S., Athens University of Economics and Business, 2003 Sc. M., University of Athens, 2007

A Dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in , at Brown university

Providence, Rhode Island May 2014 c Copyright 2014 by Stavroula Chrysanthopoulou This dissertation by Stavroula Chrysanthopoulou is accepted in its present form by the SPH department of Biostatistics as satisfying the dissertation requirement for the degree of Doctor of Philosophy.

Date Constantine Gatsonis, PhD (Advisor)

Recommended to the Graduate Council

Date Carolyn Rutter, Reader, PhD (Reader)

Date Xi Luo, PhD (Reader)

Date Matthew Harrison, PhD (Reader)

Approved by the Graduate Council

Date Peter Weber, Dean of the Graduate School

iii Curriculum Vitæ

Stavroula Chrysanthopoulou was born on May 2, 1980, in Athens, Greece.

She received her BSc degree in from Athens University of Economics and Business (AUEB), in September 2003, and her MSc degree in Biostatistics from University of Athens (UOA), in February 2007.

In September 2008 she was admitted to the PhD program in Biostatistics, at Brown University, from where she received her second MSc degree in Biostatistics in 2010. She successfully defended her PhD Dissertation entitled ”Statistical Methods in Micro-Simulation Modeling: Calibration and Predictive Accuracy”, on September 13, 2013.

During her five years career as a PhD candidate, she was appointed as a teaching assistant in the following courses, offered by the department of Biostatistics at Brown University:

• Introduction to Biostatistics (Fall semester, 2008)

• Applied Regression Models (Spring semester, 2009)

• Analysis of Life Time Data (Spring semester, 2012)

She presented a poster entitled ”Relationship between breast biopsies and family histrory of breast cancer”, at the Brown University Public Health Research Day, in Spring semester 2010.

She also presented part of her dissertation work as an invited speaker in the ”Micro- iv simulation Models for Health Policy: Advances and Applications” session, at the Joint Statistical Meetings (JSM) 2013 conference in Montreal, Canada.

She has several years of working experience as:

⇒ 2003-2005: Consulting Biostatistician, mainly involved in the design and con- duct of statistical analysis for biomedical papers.

⇒ 2005-2008: Statistical Consultant at Agilis SA-Statistics and Informatics, in- volved with research on methods for official statistics in projects conducted by the European Statistical Service (Eurostat)

Her research interests are focused on statistical methods for complex predictive mod- els, such as Micro-simulation Models (MSMs) used in medical decision making, as well as on High Performance Computing (HPC) techniques for complex statistical computations using the open source statistical package R.

v Acknowledgements

The five years of my life as a PhD candidate were full of valuable experiences, ex- ceptional opportunities to improve myself both as a scientist and as a human being, and of course a lot of challenging moments. In this beautiful “journey” I was blessed by God to be surrounded by very important people, without the support of whom I would never be able to achieve my goal.

First and foremost I would like to thank my advisor, Professor Constantine Gatsonis, for his willingness to work with me in this very interesting field, and his continuing support and guidance that helped me to overcome all the obstacles and conduct this important research. His intelligence, ethos, and integrity render him the perfect role model for young scientists. I want to also express my gratitude to Dr Carolyn Rutter for her valuable feedback as an expert in micro-simulation modeling, as well as for the exceptional opportunities she provided me with to present my work and exchange opinions with experts in the field. I would also like to thank Dr Matthew Harrison for his felicitous comments and insight that helped me to improve the Empirical calibration method, as well as to better organize and carry out the daunting task of calibrating a micro-simulation model. Thanks also to Dr Xi Luo for serving as a reader in my thesis committee.

I am also grateful to people from the Brown Center for Computation and Visual- ization support group, especially Mark Howison and Aaron Shen for always being very responsive and effective in helping me with the implementation of exhaustive parallel processes in R. I also thank Dr Samir Soneji for his assistance in estimating vi Cumulative Incidence Functions from the National Health Interview Survey data.

I also thank all the faculty, staff, and students of the Brown School of Public Health. Especially I want to thank all my professors from the Biostatistics department, the staff of the Center of Statistical Sciences (CSS), and my classmates. Special thanks go to Denise Arver and Elizabeth Clark for always being very responsive and con- siderate.

Besides the people in the Academic environment, I was also blessed to have a beau- tiful family and some wonderful friends that were always there for me in all the ups and downs of my career as a PhD candidate. To all these people I owe a great deal of my achievement.

I have no words to express how blessed I am for growing up in a very loving and caring family who always believed in and supported me. I want to thank my father for the first nine, full of love years of my life, as well as for being my good angel since the day he passed away. There is no way to thank my wonderful mother enough, for dedicating her life to my brother and me, and holding very successfully both parental roles the past twenty four years of my life. She has been without exaggeration the best mother ever! I owe her all the good (if any) elements of my personality and a large portion of the success in my life until now. For all these reasons I will always be very grateful and proud of being her daughter.

I would also like to thank my brother Vassilios, for always being a good example for me and undertaking a large portion of the burden as the protector of our family after the loss of our father. I am also grateful to my brother’s family, his wife Ioanna Andreopoulou, who I consider a true sister, and my two little “Princesses” Katerina and Antonia, for the positive effect they have on me.

God has indeed been very generous with me by sending invaluable friends in my life. I would first like to thank Dr Jessica Jalbert, Dr Dhiraj Catoor and Dr Sinan Karaveli vii for considerably helping me with my installation here in Providence. Special thanks also go to the Perdikakis family, the parents Ann and Costas, and the children Rhea and Damon Ray, Giana, and Dean for their support, caring and love. I am very grateful for meeting and being part of this amazing family.

Last but not least I would like to express my gratitude to my heart friend Nektaria for the continuing support, her kindness and thoughtfulness, and most importantly for the great honor she did me to baptize her first born, Anna.

Unfortunately, due to space constraints, I have to finalize my list by thanking from the bottom of my heart all the aforementioned people as well as many other valuable friends, relatives and important persons in my life. Truly and deeply thankful for their positive effect in my life, I dedicate my accomplishment to them all.

viii Abstract of “Statistical Methods in Micro-Simulation Modeling: Calibration and Predictive Accuracy” by Stavroula Chrysanthopoulou, Ph.D., Brown University, May 2014

This thesis presents research on statistical methods for the development and evalu- ation of micro-simulation models (MSM). We developed a streamlined, continuous time MSM that describes the natural history of lung cancer, and used it as a tool for the implementation and comparison of methods for calibration and assessment of pre- dictive accuracy. We performed a comparative analysis of two calibration methods. The first employs Bayesian reasoning to incorporate prior beliefs on model parame- ters, and information from various sources about lung cancer, to derive posterior dis- tributions for the calibrated parameters. The second is an Empirical method, which combines searches of the multi-dimensional parameter space using Latin Hypercube design with measures to specify parameter values that pro- vide good fit to observed data. Furthermore, we studied the ability of the MSMs to predict times to events, and suggested metrics, based on concordance statistics and hypothesis tests for survival data. We conducted a simulation study to compare the performance of MSMs in terms of their predictive accuracy. The entire methodology was implemented in R.3.0.1. Development of an MSM in an open source statistical software enhances the transparency, and facilitates research on the statistical prop- erties of the model. Due to MSMs complexity, use of High Performance Computing techniques in R is essential to their implementation. The analysis of the two cali- bration methods showed that they result in extensively overlapping set of values for the calibrated MSM parameters, and MSM outputs. However, the Bayesian method performs better in the prediction of rare events, while the Empirical method proved more efficient in terms of the computational burden. The assessment of predictive accuracy showed that among the methods suggested here, hypothesis tests outper- form concordance statistics, since they proved more sensitive for detecting differences

ix between predictions, obtained by the MSM, and actual individual level data.

x To my beloved family.

xi Contents

Abstract ix

1 Introduction 1 1.1 Micro-Simulation Models (MSMs) ...... 3 1.1.1 Overview ...... 3 1.1.2 Applications in health care research ...... 3 1.1.3 Development of an MSM ...... 7 1.2 Thesis Outline ...... 9

2 Micro-simulation model describing the natural

history of lung cancer 12 2.1 Background ...... 14 2.2 Model description ...... 16 2.2.1 Model components ...... 17 2.2.2 Simulation Algorithm ...... 26 2.2.3 Software ...... 31 2.3 Application ...... 32 2.3.1 Ad-hoc values for model parameters ...... 32 2.3.2 MSM output - Examples ...... 37 2.4 Discussion ...... 51 xii 3 Calibration methods in MSMs - a comparative

analysis 54 3.1 Background ...... 55 3.1.1 Calibration vs estimation in .... 55 3.1.2 Calibration methods for MSMs ...... 57 3.1.3 Assessing calibration results ...... 58 3.2 Methods ...... 60 3.2.1 Notation ...... 60 3.2.2 Bayesian Calibration Method ...... 61 3.2.3 Empirical Calibration Method ...... 62 3.2.4 Calibration outputs: interpretation and use ...... 69 3.3 High Performance Computing in R ...... 71 3.3.1 Software for MSMs ...... 71 3.3.2 Example: computational burden of two MSM cali- bration methods ...... 72 3.3.3 Parallel Computing ...... 74 3.3.4 Code architecture ...... 76 3.3.5 Algorithm efficiency: Bayesian vs Empirical Cali- bration ...... 78 3.3.6 Concluding remarks ...... 79 3.4 Comparative Analysis ...... 82 3.4.1 Input Data ...... 82 3.4.2 MSM parameters to calibrate ...... 84 3.4.3 Calibration Targets ...... 85 3.4.4 Simulation Study ...... 87 3.4.5 Terms of comparison ...... 96 3.5 Results ...... 100 xiii 3.5.1 Parameters ...... 100 3.5.2 Predictions ...... 108 3.6 Calibration Methods Refinement ...... 118 3.7 Discussion ...... 129

4 Assessing the predictive accuracy of MSMs 133 4.1 Background ...... 134 4.1.1 Assessment of MSMs ...... 134 4.1.2 Predictive accuracy of MSMs ...... 135 4.2 Methods ...... 140 4.2.1 Notation ...... 140 4.2.2 Concordance statistics ...... 141 4.2.3 Hypothesis testing ...... 145 4.2.4 Simulation Study ...... 148 4.3 Results ...... 150 4.3.1 Single run of the MSM ...... 150 4.3.2 Multiple runs of the MSM ...... 154 4.4 Discussion ...... 162

5 Conclusions 167 References ...... 170

xiv List of Tables

2.1 MSM simulation algorithm ...... 30 2.2 MSM ad-hoc parameter estimates: Onset of the first malignant cell . 35 2.3 SEER data on lung cancer at diagnosis ...... 36 2.4 MSM ad-hoc parameter estimates: Lung cancer progression ...... 37 2.5 Predicted times to events: Males - Non smokers ...... 39 2.6 Predicted times to events: Females - Non smokers ...... 39 2.7 Predicted times to events: Males - Current smokers ...... 40 2.8 Predicted times to events: Females - Current smokers ...... 41 2.9 Predicted times to events: Males - Former smokers, quitting smoking at age 40 ...... 44 2.10 Predicted times to events: Males - Former smokers, quitting smoking at age 50 ...... 45 2.11 Predicted times to events: Males - Former smokers, quitting smoking at age 60 ...... 46 2.12 Predicted times to events: Females - Former smokers, quitting smok- ing at age 40 ...... 47 2.13 Predicted times to events: Females - Former smokers, quitting smok- ing at age 50 ...... 48 2.14 Predicted times to events: Females - Former smokers, quitting smok- ing at age 60 ...... 49

xv 3.1 Code efficiency ...... 80 3.2 Reference population age distribution ...... 84 3.3 Observed lung cancer incidence rates ...... 86 3.4 Calibration ...... 89 3.5 Number of microsimulations ...... 90 3.6 Summary Statistics - parameters ...... 105 3.7 Summary statistics - predictions ...... 114 3.8 Assessing MSM predictions ...... 118 3.9 Discrepancy - predictions ...... 119 3.10 Summary statistics - Box plots ...... 120 3.11 Summary Statistics - parameters (sub-analysis) ...... 122 3.12 Summary statistics - predictions (sub-analysis) ...... 127 3.13 Discrepancy - predictions (sub-analysis) ...... 128

4.1 Assessment (toy.1, V=1) ...... 150 4.2 Assessment (toy.2, V=1) ...... 153 4.3 Assessment (toy.1, V=200, 400, 600, 800, 1000) ...... 159 4.4 Assessment (toy.2, V=200, 400, 600, 800, 1000) ...... 163

xvi List of Figures

2.1 Markov State diagram of the lung cancer MSM ...... 16 2.2 Lung cancer mortality: Non-smokers ...... 39 2.3 Lung cancer mortality: Current smokers ...... 42 2.4 Lung cancer mortality: Former smokers ...... 50

3.1 LHS implementation (N=5) ...... 66 3.2 LHS implementation (N=20) ...... 66 3.3 Micro-simulation size ...... 92 3.4 Density Plots - parameters ...... 103 3.5 Mahalanobis distances - parameters ...... 104 3.6 Bayesian method: Contours of calibrated parameters ...... 106 3.7 Empirical method: Contours of calibrated parameters ...... 107 3.8 Density plots - predictions (internal validation) ...... 112 3.9 Density plots - predictions (external validation) ...... 113 3.10 Mahalanobis distances - predictions ...... 115 3.11 Calibration plots ...... 116 3.12 Box plots ...... 117 3.13 Density Plots - parameters (sub-analysis) ...... 121 3.14 Bayesian method (sub-analysis): Contours of calibrated parameters . 123 3.15 Empirical method (sub-analysis): Contours of calibrated parameters . 124 3.16 Density plots - predictions - sub (internal validation) ...... 125

xvii 3.17 Density plots - predictions - sub (external validation) ...... 126 3.18 MH algorithm flow chart ...... 131 3.19 Bayesian Calibration flow chart ...... 132

4.1 KM curves - Observed vs Predicted survival (toy.1, V=1) ...... 151 4.2 KM curves - Observed vs Predicted survival (toy.2, V=1) ...... 153 4.3 KM curves - Observed vs Predicted survival (toy.1, V=200) ...... 156 4.4 KM curves - Observed vs Predicted survival (toy.1, V=400) ...... 156 4.5 KM curves - Observed vs Predicted survival (toy.1, V=600) ...... 157 4.6 KM curves - Observed vs Predicted survival (toy.1, V=800) ...... 157 4.7 KM curves - Observed vs Predicted survival (toy.1, V=1000) . . . . . 158 4.8 KM curves - Observed vs Predicted survival (toy.2, V=200) ...... 160 4.9 KM curves - Observed vs Predicted survival (toy.2, V=400) ...... 160 4.10 KM curves - Observed vs Predicted survival (toy.2, V=600) ...... 161 4.11 KM curves - Observed vs Predicted survival (toy.2, V=800) ...... 161 4.12 KM curves - Observed vs Predicted survival (toy.2, V=1000) . . . . . 162

xviii Chapter 1

Introduction

Comparative Effectiveness Research (CER), a novel research framework aimed at developing broad-based comparative evidence on the outcomes of diagnostic and therapeutic procedures, has recently attracted significant scientific attention. An important component of CER is the development of new methodologies for empir- ical and modeling studies that generate information appropriate for health policy decisions. Within this context, a class of predictive models, the micro-simulation models (MSMs), has attracted considerable attention among researchers. MSM’s use information from various sources of and clinical expertise to simulate individual disease trajectories, i.e., trajectories that describe events asso- ciated with the development of the target disease. The summarized results from these individual trajectories are used to make predictions about long term effects of a health policy intervention on a given population.

Micro-simulation models have been widely used in several fields. However, the sys- tematic investigation of their statistical properties is only recently getting under way. The main objective of this thesis is to address two of the key elements in the devel- opment and evaluation of an MSM, namely, model calibration and prediction, from a statistical point of view. To this end we first develop a streamlined micro-simulation model that describes the natural history of lung cancer, and use it as a tool to explore 1 the statistical aspects of calibration and prediction for MSMs.

The thesis is divided into five chapters. The first chapter provides an introduction and overview of the thesis. The second chapter focuses on the development of a streamlined, continuous time MSM that describes the natural history of lung cancer in the absence of screening and treatment interventions. This MSM serves as a tool for the study of the statistical properties of MSMs in subsequent chapters. In particular, the third chapter provides a comparative analysis of two calibration methods, a Bayesian and an Empirical one, with application to this MSM for lung cancer. The fourth chapter discusses the assessment of the predictive accuracy of an MSM, using the lung cancer model. The dissertation concludes with a fifth chapter which summarizes the main findings and concusions, and outlines the plans for future work on the study of the statistical properties of MSMs.

2 1.1 Micro-Simulation Models (MSMs)

1.1.1 Overview

Micro-simulation models (MSMs) are complex models designed to simulate individual level data using Markov Chain Monte Carlo methods. The first applications of MSMs were in social policy in the late 1950s (Orcutt (1957)). In recent years, MSMs are beginning to be used extensively in health policy and medical decision making. MSMs in health policy problems are used to describe the natural history of a disease in individual members of a cohort, usually in conjunction with the effect of some intervention. To this end MSMs use mathematical equations with stochastic assumptions to describe in detail complex observed and latent characteristics of the underlying process. The inherent intricacy of MSMs posed serious time and cost constraints in their development and implementation, especially during the first years of their use. However, the advances in scientific computing in recent years have contributed considerably to the improvement and expansion of new methodologies and applications of MSMs in general, and to medical decision making in particular.

1.1.2 Applications in health care research

Rutter et al. (2011), provide a comprehensive review of micro-simulation models used to predict health outcomes. The review highlights the usefulness of MSMs and their continuously expanding role in medical decision making. It also indicates the key steps in the development of a new MSM and discusses the essential checks of the validity of the model. Finally the review points to the need for additional research on the statistical properties of MSMs, especially the incorporation and characterization of the model uncertainty.

Another very important application of MSMs is in the context of the Comparative 3 Effectiveness Research (CER), a rapidly growing area of research aimed at improving health outcomes while reducing related costs. CER has recently attracted a great deal of attention in the medical and scientific community. According to the American Health and Human Services (HHS) department (109) CER is defined as:

“ the conduct and synthesis of systematic research comparing different inter-

ventions and strategies to prevent, diagnose, treat and monitor health condi-

tions.The purpose of this research is to inform patients, providers and decision-

makers, responding to their expressed needs, about which interventions are

most effective for which patients under specific circumstances. To provide this

information, CER must assess a comprehensive array of health-related out-

comes for diverse patient populations. Defined interventions compared may

include medications, procedures, medical and assistive devices and technologies,

behavioral change strategies, and delivery system interventions. This research

necessitates the development, expansion, and use of a variety of data sources

and methods to assess comparative effectiveness.”

Tunis et al. (112) provide a comprehensive introduction to CER in the context of the recently enacted USA health care reform, and discuss the statistical challenges in carrying out this research. The authors highlight the need for sufficient, credible, relevant and timely evidence in the conduct of CER, and emphasize that ”the primary purpose of CER is to help health-care decision makers make informed decisions at the level of individual care for patients and clinicians, and at the level of policy determinations for payers and other policymakers”. The conduct of CER comprises a great variety of novel and existing methods in medical research, all of which can be classified in five broad categories, i.e., systematic reviews, decision modeling, retrospective analysis, prospective observation studies and experimental studies.

A key example of the use of CER in medical decision making, mentioned in both the Tunis et al. (112) paper as well as the commentary by Gatsonis (27), is the

4 evaluation of diagnostic modalities for cancer. Both papers indicate the necessity for individual-level information to assist decisions. However this type of information can prove very costly, time-consuming or even totally impracticable due to the complex- ity of the health-care setting. Therefore micro-simulation has risen to prominence as a promising tool that can make projections about the impact of interventions (such as screening) when applied to population cohorts, and inform health policies and medical decision making. A characteristic example of the application of new mod- eling techniques in Medical Decision Making (MDM) (including micro-simulation modeling) is the research conducted by the Cancer Intervention and Surveillance Modeling Network (CISNET) of NCI (http://cisnet.cancer.gov). The CISNET group is a consortium of NCI-sponsored investigators with research interest focused on the development and application of advanced statistical modeling. Its main objective is to use advanced modeling techniques to better understand the effects of cancer control interventions (prevention, screening, treatment, etc.) on individuals as well as on population trends (incidence and mortality rates). The CISNET consortium currently comprises five large groups focusing their research on five different types of cancer: breast, colorectal, esophagus, lung and prostate cancer. Models developed to describe each one of these types of cancer, can be used to guide health research and priorities.

The complexity of an MSM can make its development a daunting task. However, a valid MSM can be useful to many stakeholders. In particular, it can be used to inform patients, providers and decision-makers and assist them in deciding on the most effective and efficient intervention under certain circumstances. Despite their complexity, MSMs hold some very “attractive” features that have distinguished them from other useful tools for the conduct of CER. First, MSMs are designed to describe and evaluate complicated processes when analytical formulas are not available. The models focus on making predictions about individual patient trajectories rather than

5 describing the average patient. This, as already mentioned, is a key element of any statistical tool used for the conduct of CER which is essentially patient-centered.

In addition, MSMs provide an easy way of representing time dependent transition probabilities between major states of the disease course while, at the same time, they facilitate the explicit incorporation of different sources of uncertainty intrinsic to the system (stochastic, parameter, structural, etc). Furthermore they compile and sometimes even reconcile contradictory facts about the disease process derived from different sources (e.g. experimental studies, observational studies, expert opin- ions, etc). MSMs also provide short or long-term predictions about the course of a disease and the effect of interventions (e.g., screening schedule, treatment, etc) on a population. In the case of simulating results from longitudinal studies, MSM based projections can be available well in advance of the actual study conclusion. Finally, MSMs can be used to produce large pseudo-samples, a very important feature es- pecially in cases where the conducting of large, well designed studies (e.g. large scale clinical trials, etc) is prohibitive by time and/or cost constraints or even ethical considerations.

An example of the application of MSMs in health care is their wide use to evaluate and compare cancer screening programs. In this setting an MSM is used to describe the main stages of the natural history of the specific type of cancer and to model the effect of screening on several aspects of a patient’s lifetime (e.g., survival time, quality of life, etc). In many instances, the course of cancer can be divided into five main stages: the disease free state, the onset of the malignancy (local state), the involvement of detectable lymph nodes metastases (regional state), the involvement of distant metastases and the death either from cancer or from other causes. Modelers may be interested in all or only some of these stages. Several papers have studied each of these disease states separately and have tried to fit complex mathematical models on real data (41; 40; 43; 72; 75; 15; 61; 26; 33; 58; 59; 70; 102; 103). These 6 models aim to combine information from the biological process of the disease with observed outcomes and describe the entire phenomenon in as much detail as possible. Micro-simulation modeling can be used to combine all the models that describe the essential parts of a disease process, and use the Monte Carlo method to simulate individual patients’ trajectories.

1.1.3 Development of an MSM

The development of a micro-simulation model is a complex undertaking involving, as any other statistical predictive model, three major building blocks, namely the model specification, calibration and assessment.

Model specification refers to defining the structure of the model that will be used to describe, analyze, and/or simulate the phenomenon of interest, including the nature of the model (e.g., regression, Markov, etc), as well as the set of rules and assumptions imposed. For a new MSM, in particular, describing the natural history of a disease, model specification entails identification of the major distinct states of the disease as well as stipulation of the transition rules among them, including the relevant mathematical and distributions to describe the underlying stochastic process.

Calibration is the process of determining values of the parameters so as the model to provide good fit to available data about the phenomenon of interest. In the context of MSMing, calibration is analogous to parameter estimation followed in ordinary statistical models (e.g., GLM).

Assessment, pertains to the model’s predictive performance, comprising overall model’s performance and discrimination ability (105). Overall performance can be expressed as the percentage of the explained, variation of the system (R2 statistics) as well as proximity between observed and predicted quantities of interest (GoF statistics). Discrimination, on the other hand, is the model’s ability to correctly classify sub-

7 jects (e.g. patients) with different characteristics based on the individual predictions about the outcome of interest. The goal of this thesis is to explore these building blocks through the development of a new, streamlined MSM describing the natural history of lung cancer.

The main purpose of an MSM is to predict individual trajectories for the phenomenon it describes (in MDM disease trajectories). These individual trajectories can be point estimates of several quantities of interest (outputs) including time to events (e.g., time to the development of lung cancer), binary responses (e.g., death from lung cancer), or even estimates of continuous quantities (e.g., tumor diameter at diagnosis).

As in any other type of statistical analysis, it is important to accompany point esti- mates with some measures of variability, so as to give an idea about their precision. In order to do so in the context of MSMing, it is very important to understand all possible sources of uncertainty inherent in the model, and find a way to incorporate them in the model estimates. Rutter et al. (92), identify the following sources of uncertainty in MSMs:

• population heterogeneity: differences between individuals in the population of interest, with a significant effect on the observed outcomes

• parameter uncertainty: variability due to the estimation of unknown model parameters

• selection uncertainty: incorporation of information based solely on a small portion of studies from the pool of available studies on the specific topic

• sampling variability: variability owing to the fact that the calibration data are summary statistics estimated from a finite sample from the population of interest

8 • stochastic uncertainty: variability due to the random numbers generation procedure followed in the Monte Carlo approach for the evaluation and imple- mentation of the MSM

• structural uncertainty: variability caused by the ignorance about the exact procedure of the phenomenon described by the MSM and related to the model assumptions (incertitude about the functional form of the model)

All the methods presented in this thesis, take into account the problem of the iden- tification and characterization of MSM’s uncertainty.

1.2 Thesis Outline

The remainder thesis is divided into four chapters. Chapter 2 presents the develop- ment of a streamlined continuous-time micro-simulation model (MSM) that describes the natural history of lung cancer in the absence of screening and treatment compo- nents. The chapter begins with an extensive literature review on the subject matter of lung cancer history modeling and surveys use of MSMs in this area. The chap- ter continues with the determination of the major distinct stages of the disease and description of the set of rules and assumptions governing the MSM.

We kept the number of covariate classes to a minimum in order to achieve a man- ageable level of model complexity. Therefore, the set of covariates in the model comprises the gender, age, smoking history (age at beginning and quitting smoking) and smoking habits (smoking intensity based on the average number of cigarettes smoked per day) of each individual. Published results on several stages of lung can- cer course are used for an ad-hoc specification of the model parameters. MSM’s functionality is depicted using characteristic examples of model’s output given cer- tain real life scenarios. The chapter also describes in detail the simulation algorithm followed for the implementation of the model, and illustrates MSM’s performance by 9 running the model for several, characteristic, real life scenarios, and comparing MSM predictions to knowledge attained in the field (i.e., lung cancer research). The main objective in building this MSM is to serve as a tool for the comparative evaluation of statistical methodologies for the model calibration, validation and assessment of predictive accuracy described in subsequent parts of the thesis.

The second chapter discusses with the calibration of an MSM. Here, the literature review includes references to methods used for the calibration of MSMs in medical decision making specifically. The main objective of this chapter is to provide a comparative analysis of two calibration methods for MSMs. To this end a simulation study is designed and conducted, the results of which, comprise the basis of the comparative analysis.

The first method is the Bayesian calibration developed by Rutter et al. (90) and implemented on an MSM for colorectal cancer. The second method is a new empirical calibration method. The idea underlying this method, is to combine some of the best modeling practices currently applied for the empirical calibration of several types of MSMs, including search algorithms of possible values from the multidimensional parameter space, GoF statistics to assess model’s overall performance, convergence criteria, stopping rules, etc. A key component of the new method relies on the incorporation of the broadly used Latin Hypercube Sampling (LHS) design in the searching algorithm for more efficient (compared to simple random sample) search of the multidimensional parameter space of a (usually rather involved) MSM.

Both the Bayesian and the empirical calibration methods are implemented on the continuous-time MSM for the natural history of lung cancer, described in the first chapter. The comparison of the models uses both qualitative (e.g., efficiency, prac- ticality, interpretation of calibration results, etc), as well as quantitative measures assessing overall model’s performance (GoF statistics) including both internal and

10 external validation. Internal validation pertains to assessing model’s performance us- ing exactly the same data that were used during the calibration procedure, whereas for external validation purposes different data are used. In addition, graphical ways for assessing model’s performance are also provided. The results from this compari- son are used for recommendations regarding the use of these two, as well as similar approaches in practice.

Although very widely used, at least to our knowledge, no systematic work has been carried out yet on the assessment of an MSM’s predictive accuracy. The fourth chap- ter is concerned with the assessment of the predictive accuracy of a “well” calibrated MSM. Micro-simulation models are considered here as a special type of predictive, survival models, since they predict actual survival times unlike other, broadly used survival models, which predict hazard rates, or ratios (e.g., Cox Proportional Haz- ards, Accelerated Failure time, etc., models). The extensive literature review aims at identifying measures of predictive accuracy used in the context of survival modeling that could also be applied for the assessment of an MSM.

Two broadly used methodologies are applied to the two calibrated MSMs resulted from Chapter 3, namely, concordance statistics and methods aimed at comparing predicted with observed survival curves. These approaches are adapted to the par- ticularities of MSMs. The chapter compares the two methodologies, summarizes findings from a simulation study, and concludes with suggestions about useful statis- tics for the assessment of the predictive accuracy of an MSM.

In the last chapter of this thesis (Chapter 5) we summarize the main findings and include future work related to our research.

11 Chapter 2

Micro-simulation model describing the natural his- tory of lung cancer

In the first chapter we develop a new, streamlined, continuous time micro-simulation model (MSM) that describes the natural history of lung cancer in the absence of any screening or treatment component. This is a predictive model that simulates individual patient trajectories given a certain set of covariates, namely the age, gender and smoking history. The model structure is in line with methods , and combines findings from several sources, related to lung cancer research. This new MSM predicts the course of lung cancer for each individual, from the initiation of the first malignant cell, to the tumor progression to regional and distant stages, until death from lung cancer (or some other cause), or the end of the prediction period. The main goal is for the model to serve as a tool to explore, in subsequent chapters, some properties of the MSMs from a statistical point of view. In particular, the research focus will be on model calibration and assessment of the model’s predictive accuracy. The model is developed using the open source statistical software R.3.0.1, in order to enhance its transparency and explore the potentiality of this software to be used for the development of MSMs in general.

The chapter begins with background information regarding MSMs currently used

12 to describe the natural history of lung cancer. The main part of the chapter is dedicated to the description of the new, streamlined MSM for lung cancer that we develop here. The second section describes the main model components, namely the distinct disease states, the set of transition rules between them, the distributions and mathematical equations describing the particularities of the process as well as an account of the model parameters. Thereafter, we present in detail the simulation algorithm followed to predict individual trajectories. The next section pertains to the explanation of the process followed for the determination of some ad-hoc values for the model parameters in conjunction with a brief description of the data used for this purpose. Model performance is exemplified by running the MSM under hypotheti- cal scenarios, i.e., for different individual baseline characteristics including smoking habits. The chapter concludes with discussion on the overall model’s performance, advances, and shortcomings, as well as future work on this topic.

13 2.1 Background

Micro-simulation models (MSMs) are complex models designed to simulate individual level data using Markov Chain Monte Carlo methods. Several micro-simulation models have been developed in order to describe the natural history of lung cancer. Two of the most comprehensive and widely used ones are the Lung Cancer Policy Model (LCPM) developed by McMahon (70), and the MIcro-simulation ScCreening ANalysis (MISCAN) model by Habbema et al. (38). Other simplified versions of MSMs for lung cancer can be also found in the literature (Goldwasser (33), Hazelton et al. (40), etc). The LCPM is a discrete time epidemiological MSM that combines information related to multiple stages of lung cancer mainly based on epidemiological models. The MISCAN model on the other hand is a continuous time MSM that in addition takes into account the biology of the tumor cells (latent process). Noteworthy is the fact that all the MSMs that have been developed to describe the course of lung cancer take into account the smoking history and smoking habits in the prediction of lung cancer risk and mortality.

McMahon et al. (71) and Shi et al. (98) present two representative applications of the aforementioned models in medical decision making. The first paper presents the application of the Lung Cancer Policy Model (LCPM) to assess the long-term effectiveness of lung cancer screening in the Mayo CT study, an extended, single-arm study aiming to evaluate the effect of helical CT screening for lung cancer on current and former smokers. Here, the LCPM micro-simulation model is used to simulate the end results of interest for pseudo-individuals of a hypothetical control arm, i.e. in the absence of any screening program.

The second paper refers to the application of the MISCAN micro-simulation model

14 for lung cancer to explore a number of hypotheses that could potentially explain the controversial finding of the Mayo Lung Project (MLP), namely the increase in lung cancer survival since the time of diagnosis without a corresponding reduction in lung cancer mortality. In this case, the authors modify the MISCAN model parameters accordingly so as to simulate pseudo-individuals under different tested scenarios that could possibly explain that controversial finding, such as, over-diagnosis, screening sensitivity, and population heterogeneity. They subsequently fit each model on real data from the MLP randomized and compare their goodness of fit (GoF) to that of the simplest model, i.e. the one in which the model parameters related to the hypotheses of interest are set to their neutral values. For instance, a parameter for indolent cancers is introduced in the model to account for possible effect of over- diagnosis. Only a notable improvement of the GoF measure (deviance) could strongly support the validity of the scenario under consideration. For example, if the model with the indolent cancer parameter does not decrease the deviance measure resulted from the simpler model, then the micro-simulation result does not support over- diagnosis as the reason for the controversial finding of the Mayo Lung Project.

In both papers noteworthy is the fact that results from the MSM application are presented only as point estimates of the quantities of interest, lacking any measure of precision. This is very typical in studies involving use of micro-simulation modeling.

15 2.2 Model description

We have developed a new, streamlined, continuous time MSM that describes the natural history of lung cancer in the absence of any screening or treatment compo- nent. This is a Markov model in the sense that it satisfies the Markovian property, i.e., the transition to any subsequent state depends exclusively upon the state the process currently resides.

The Markov state diagram in figure 2.1 depicts the five distinctive states of the model, i.e. the disease free state (S0), the onset of the first malignant cell (local state, S1), the beginning of the regional (lymph node involvement, S2), and distant stage (involvement of distant metastases, S3), and eventually the death (S4) state. In the same figure hij denotes the hazard rate characterizing the transition from state i to state j.

Death can be attributed to either lung cancer or other causes. In order to consider that a lung cancer death occurred, the individual has to move from state S3 to S4. That is, the model assumes that death from lung cancer can occur only after the tumor is already in distant stage.

Figure 2.1: Markov State diagram of the lung cancer MSM

The model essentially consists of the absorbing state of death (S4) and four “tunnel” states. The “tunnel” states are consecutive states stipulating the specific course of

16 the phenomenon described in the Markov state diagram (101). According to the Markov state diagram presented in figure 2.1, from a disease free state at some time point the first malignant cell initiates (local stage), and proliferates up to the point of lymph nodes involvement (transition to regional stage). The tumor progresses from this stage to the involvement of distant metastases, and eventually causes death from lung cancer unless death from some other cause precedes. As already mentioned, a key model assumption is that it is very unlikely to observe death from lung cancer without previous involvement of distant metastases.

The development and course of lung cancer in a person’s lifetime according to this model is stipulated by a set of transition rules described in detail hereafter. Estimates of the model parameters are obtained from a thorough literature review on the topic including various sources (e.g. RCTs, case-control and cohort studies, meta-analyses, expert opinions, etc). These estimates are used in the present chapter as ad-hoc values for working examples of MSM’s performance, while, in subsequent chapters they will serve as guidance for the specification of plausible values for the MSM parameters. Simulations on individual level basis are carried out using the Monte Carlo method. In particular, this approach involves the generation of a fair amount of individual trajectories resulting in a large number of independent and identically distributed natural histories in each covariate class. This trajectories are summarized so as to get an indication of the predicted quantities of interest, e.g. lung cancer incidence and mortality rates overall and by covariate group, etc.

2.2.1 Model components

Onset of the first malignant cell

We model the onset of the first malignant cell using the exact solutions, for the expression of the hazard rates and the survival probabilities, of the biological two-

17 stage clonal expansion (TSCE) model (75). For piecewise constant parameters, the hazard function for the development of the first malignant cell is (44):

νµX(e(γ+2B)t − 1) h(t) = (2.1) γ + B(e(γ+2B)t + 1)

that for piecewise constant parameters, γ and B can be determined using the follow- ing parameterization:

1 p 2 with γ = α − β − µ and B = 2 (−γ + γ + 4αµ)

where X is total number of normal cells, ν is the normal cell initiation rate, α is the division rate of initiated cells, β is the apoptosis rate (death or differentiation) of initiated cells, and µ is the malignant conversion rate of initiated cells.

Following equation 2.1, the cumulative hazard function is:

νµX  1  H(t) = · − t + · log γ + B + B · e(γ+2B)t (2.2) γ + B B

Previous empirical data analyses with the TSCE model, exploring the dose-response relationship of smoking with lung cancer incidence indicated that power laws are good approximations to this relationship (40; 41). In the same studies X=107 has been provided as a plausible figure for the total number of normal stem cells. Furthermore, the TSCE multistage model allows tests for differences in the initiation, promotion and malignant conversion rates of the course of lung cancer between population sub- groups. Previous analyses of lung cancer incidence data in the nurses (NHS) and the health professionals (HPFS) follow-up studies revealed a significant difference in tobacco-induced promotion and malignant conversion rates between males and females (72).

18 We incorporate these findings, about the effect of smoking on the onset of the first malignant cell, in our model. In particular, if q(t) denotes the smoking intensity at age t, expressed as average cigarettes smokes per day, the effect it has on the α and γ rates is described by the following, power law relationships:

a2 a2 α = α0(1 + α1q(t) ) and γ = γ0(1 + α1q(t) ) where γ0 and α0 are the coefficients for the non-smokers. To account for differences between men and women, as well as between current smoking habits we assume dif- ferent hazards (function of age t) corresponding to all possible combinations of gender (male/female), and smoking (never/former/current smoker). For each individual the time period from birth (t=0) to the onset of the first malignant cell can be split into k intervals in which the hazard rate is constant and depends on the person’s smoking status (smoking or not) within this interval. For simplicity reasons we only assume up to two possible changing time points in a lifetime; time at starting (τ1) and time at quitting (τ2) smoking, where relevant.

The S(t) for the development of lung cancer is:

 Z t  S(t) = exp{−H(t)} = exp − h(x)dx (2.3) 0

Depending on the smoking status of each person we discern the following three possible scenarios:

• Never smoker:  Z t  S(t) = exp − h(x)dx (2.4) 0

• Current smoker:

 Z τ1 Z t  S(t) = exp − h(x)dx − h(x)dx (2.5) 0 τ1

19 • Former smoker:

 Z τ1 Z τ2 Z t  S(t) = exp − h(x)dx − h(x)dx − h(x)dx (2.6) 0 τ1 τ2

Tumor growth

Several studies have showed an inverse correlation between the tumor growth and its size, namely the tumor growth rate is usually non-constant, but decreases steadily. According to these studies the Gompertz function provides a good approximation of the tumor growth for most cancer types and describes the specific process more efficiently than, e.g., the exponential distribution (15). The Gompertz model suggests the proliferation of tumor cells by a modified exponential process in which successive doubling times occur at increasingly longer time intervals (61). Hence the Gompertz function stipulates a shorter pre-clinical period than the exponential model, and longer survival after diagnosis.

The model assumes a Gompertzian (61) tumor growth, i.e. the tumor volume at age t is:

V (t) s (1−e−mt) = e m (2.7) V0

where V0 and V(t) represent the initial tumor volume (volume of the first malignant cell) and the tumor volume at age t respectively and m, s are the location and scale parameters of the Gompertz function.

The hazard rate of the Gompertz distribution as a function of time t is (26):

r(t) = s · e−mt (2.8)

The time at which the tumor has reached volume V(t) can be found using the inverse

20 of the Gompertz function:

− 1  m V (t) m t = log 1 − log (2.9) s V0

In order this equation to be defined values for the Gompertz parameters (m, s) should be chosen so as:

m V (t) V  1 − log > 0, ∀ s ⇒ s > m · log max (2.10) s V0 V0

This limitation is very important especially in the specification procedure of the model parameters either as ad-hoc values or in a regular calibration setting.

Moreover, assuming a spherical tumor growth (i.e. symmetric towards all directions), the tumor size at age t is a function of its diameter at that age (d(t)), and is calculated using the sphere volume formula:

π V (t) = [d(t)]3 (2.11) 6

The tumor volume limits are stipulated from the minimum and maximum possi- ble diameter. The minimum diameter (diameter of one cancerous cell) is set to d0=0.01mm (70; 29; 15) while the maximum diameter (tumor diameter that causes death) is set to dmax=13cm (15).

In order to keep the model parameterization to a minimum, so as the model to be more flexible and easily handled for the purposes of subsequent analyses (calibration and assessment), we assume the same Gompertz distribution for all tumors irrespec- tive of lung cancer type.

21 Disease progression

Disease progression of an existing lung cancer can occur via nodal involvement and distant metastases (70). Current MSMs for lung cancer(70; 33) adopt, in their disease progression parts, methodologies developed to describe the progress of breast cancer (26; 86; 59; 110).

Previous studies (59; 58; 102; 103) have shown that, given a Gompertzian tumor growth, the distribution of tumor volumes at specific time points can be adequately

described using the log-Normal distribution. In particular, let (Vreg,Treg), (Vdist,

Tdist) and (Vdiagn,Tdiagn) the pairs of tumor volume and age at the beginning of regional or distant stage, and at time to diagnosis (clinical detection) respectively.

2 2 We use distributions logNormal (µreg, σreg), logNormal (µdist, σdist), and logNormal

2 (µdiagn, σdiagn) to simulate tumor volumes Vreg,Vdist, and Vdiagn respectively.

In addition, the simulated tumor volumes are subject to the following restrictions:

V0 < Vreg < Vdist < Vmax and V0 < Vdiagn < Vmax (2.12)

Given the tumor volume and its growth rate we can find the time (age) at which the tumor has reached the specific volume. The tumor progression according to the MSM for lung cancer proposed here, relies on several key assumptions. First of all there is a positive correlation between the tumor size and the probability of symptomatic detection, i.e. the larger the tumor size, the higher the probability to be clinically detected. The beginning of the local stage is when the first malignant cell develops. The transition from regional to distant stage is defined to occur at the distant metastatic disease first becomes detectable by usual clinical care. In addition the transition to the distant stage presupposes a tumor already at regional stage which in turn develops only after the transition to a local stage. 22 Finally, another very important assumption implied by this model is that there are no large differences in the growth rates and the tumor size and stage distributions across different covariate classes (age-gender-smoking status group).

The disease progression model also implies that no symptom detection was possible due to lymph node involvement or benign lesions whilst patients with symptom detected distant metastases are by assumption M1 (according to the TNM staging sytstem (76) ) with probability equal to 1. Furthermore, the conditional distribution of the tumor stage given its size at clinical diagnosis is considered multinomial. When defining the ad-hoc values for the model parameters we use the observed, in SEER data, frequencies of local, regional and distant cancers by size at diagnosis presented in Table 2.3. According to this table there are no large differences between males and females, hence we assume the same tumor volume distributions for the two genders, and try to fit the overall size information.

Survival

Competing risks In a multi-state model as the MSM for the natural history of lung cancer presented here, calculation of the survival probabilities is a rather complicated task due to the presence of competing risks. The competing risks issue arises when individuals are subject to risk factors the can cause two or more mutually exclusive events (54). Smoking, for instance, is strongly associated with both lung cancer and other cause death. Hence, when modeling lung cancer mortality, by taking into account risk factors such as age, smoking habbits, etc, death from other causes is the competing risk since it precludes death from lung cancer.

A significant amount of work has been done on the problem of competing risks, a concise summary of which can be found in Moeschberger and Klein (1995). The usual practice is to assume independence among the competing risks and use some 23 conventional non-parametric (e.g. Kaplan-Meier estimator) or semi-parametric (e.g. Cox Proportional Hazards model) method to estimate the survival probabilities. In cases where the independence assumption is not valid more complicated methods should applied. The reason is that simple Kaplan-Meier estimators of the net survival probabilities by cause of death are not enough to describe the mortality rates in this setting. Crude probabilities that express the probability of death from a specific cause after adjusting for other causes of death should be used instead. One way of expressing these crude probabilities is by using a cause-specific, sub-distribution function, i.e., the Cumulative Incidence Function (CIF).

In the natural history model for lung cancer each person faces the risk of dying from lung cancer (main event of interest) or dying from some other cause (competing risk). In order to express the lung cancer survival probability accounting for the competing risk, we employ the CIF techniques described in Gray (1988), and Fine and Gray (1999) that have also been incorporated in the R statistical package library

“cmprsk”. In particular, let Yi be the number of individuals at risk, li the number

of those who died from lung cancer, and oi the number of those who have died from

some other cause by time ti. Here t1

which a competing risk occurs. In this setting, li+oi is the total number of individuals

experiencing any of the competing risks (here death from any cause) at time ti.

The CIF in this case is defined as:

 n h io P Qi−1 lj +oj lj  1 − , if t1 ≤ t CIF[(t) = ti≤t j=1 Yj Yj (2.13)  0 , otherwise

P ˆ li ˆ Note here that, for t1 ≤t, CIF[(t) = S(t−i) , where S(t−i) is the Kaplan-Meier ti

estimator, evaluated at time just before ti, considering death from causes other than lung cancer as the event of interest. Hence, the CIF estimates the probability that

24 the event of interest (death from lung cancer) will occur before time t, and before the occurrence of any competing risk (death from other causes).

Note here that, as already mentioned, a very important assumption made here is that death from lung cancer is unlikely to occur without the previous detection of distant metastases (symptomatic or not). We compute the CIF using combined information from the NHIS and the SEER data (section 2.3.1).

25 Other cause mortality Given the main covariates of interest, namely age, gender, smoking status (current, former or never smoker), and smoking intensity expressed as average number of cigarettes smoked per day, we use the non-parametric estimates we get for the CIF using the observed NHIS data. The MSM simulations depend on the strong assump- tion that the death patterns observed in these data do not change dramatically over time. Hence they are also relevant to the prediction period we are interested in.

Lung cancer mortality Using the SEER data we get non-parametric estimates of the CIF given the individ- ual’s characteristics at the time of clinical (symptomatic) detection of lung cancer. In particular, the CIF estimates are grouped by age (5 years age bins), gender, tumor size (tumor diameter: ≤2cm, 2-5cm, and >5cm) at diagnosis. Given these estimates we can simulate the time to death from lung cancer after the symptomatic detection of lung cancer using an inverse CIF search approach.

2.2.2 Simulation Algorithm

In this section we describe in detail the algorithm we follow in order to run a single micro-simulation, i.e., to predict the lung cancer trajectory of an individual with certain baseline characteristics.

Simulate baseline characteristics For each person we either have access or simulate some baseline characteristics that will be used as input in the model to make predictions. In particular, from each sample, for which predictions regarding the course of lung cancer are to be made, we either have the individual records or some information regarding the distribution of the main covariates of interest, i.e. age, gender and smoking history. The smoking

26 history includes the age at starting and quitting smoking (where relevant) as well as the smoking intensity expressed as the average number of cigarettes smoked per day. Given the form of the available information (individual records or overall sam- ple distributions) we simulate the baseline characteristics using bootstrap method (randomly drawing with replacement from the available data). The set of baseline characteristics stipulates the covariate class g each individual belongs to.

Time to death from other causes

→ Draw uo1 ∼ Unif(0, 1) and uo2 ∼ Unif(0, 1)

→ Compare uo1 to the non-parametric estimate CIFd g(t) from the NHIS data and

find the closest estimate to uo1 in order to specify the time interval during which death from other causes can occur for this person. That is, for {t :

min |u01 −CIF[(t)|}, we assume that death from a cause other than lung cancer

for that person may occur between [t, min(ti) ≥ t].

→ Use the uo2 to assign the specific time point (age) at which death occurs within

the pre-specified [t, min(ti) ≥ t] time interval (key assumption: the time at which death from other causes occurs is uniformly distributed within the pre- specified interval).

Time to the onset of the first malignant cell Given the baseline covariates we simulate the time (age) to the first malignant cell

(Tmal) based on the exact formulas of the hazard function according to the TSCE model, as described in section 2.2.1. In particular:

→ Draw um1 ∼ Unif(0, 1)

1 −1 → Use numerical integration to find age t such that S(t) = um1 ⇒ t = S (um1)

1 R t Given the S(t) we use the ”uniroot” function in R to solve the expression exp{− 0 h(x)dx} - S(t)=0 for t, where t is the age at the onset of the first malignant cell in years.

27 where S(t) is the survival function (eq.2.3), and h(t) is the respective hazard rate (eq.2.1). Depending on the smoking status the survival probability is given by the equations (2.4 - 2.6). For each patient we either have the detailed smoking history, i.e., the exact τ1 and τ2 ages or we can estimate the average age at starting and quitting smoking from available data (e.g., McMahon (2005)).

Disease progression Assuming the same parameters for the tumor growth, volume and stage at diagnosis across the covariate classes of interest, we simulate the tumor progression as follows:

2 2 → Draw Vreg ∼ logNormal(µreg, σreg), and Vdist ∼ logNormal(µdist, σdist)

→ Repeat the previous step until drawing the first pair (Vreg,Vdist) with:

V0 < Vreg < Vdist < Vmax

2 → Draw Vdiagn ∼ logNormal(µdiagn, σdiagn) with V0 < Vdiagn < Vmax

→ Calculate the tumor diameters dreg, ddist, and ddiagn using the sphere volume formula (eq.2.11).

→ Find the times (ages) treg, tdist, and tdiagn using the inverse Gompertz function (eq. 2.9).

→ Simulate ages at the beginning of the regional (Treg) and distant stage (Tdist), as well as age at diagnosis, given age at the onset of the first malignant cell

(Tmal), as:

– Treg = Tmal + treg

– Tdist = Tmal + tdist

– Tdiagn = Tmal + tdiagn

→ Find the tumor stage at diagnosis comparing Vdiagn to Vreg and Vdist (or,

28 2 alternatively, Tdiagn to Treg and Tdist)

Time to death from lung cancer

Given the age (Tdiagn), tumor size (ddiagn) and tumor stage at diagnosis we can simulate the time to death from lung cancer using the non-parametric estimates

CIF(td , g) we get for the CIF from the SEER data as follows:

→ Draw ul1 ∼ Unif(0, 1) and ul2 ∼ Unif(0, 1)

→ Compare ul1 to the non-parametric estimate CIF(td , g) from SEER data, and

find the closest estimate to ul1 in order to specify the time interval during which the death from lung cancer can occur for this person.

→ Use the ul2 to assign the specific time point (age) at which death occurs within the pre-specified time interval3 (key assumption: the time at which death from lung cancer occurs is uniformly distributed within the pre-specified interval).

Comparing the simulated times resulting in from the aforementioned simulation pro- cedure, we ”tell the story” for each individual with certain characteristics regarding the development and course of lung cancer in his lifetime. This ”story” is the pre- dicted individual trajectory resulting after completing one micro-simulation. Table 2.1 recapitulates the main steps of the simulation algorithm to be followed in order to predict the trajectory of an individual with certain baseline characteristics.

2The decision about the quantities compared for the specification of the tumor stage at diagnosis may be very important when, for example, improvement of the algorithm’s efficiency is a key issue, as it is the case with MSM’s calibration (chapter II). 3The length of the pre-specified time intervals varies, and is related to the discontinuity in the non-parametric estimate of the CIF 29 Table 2.1: Continuous time MSM for lung cancer: simulation algorithm to predict the lung cancer trajectory of an individual.

1. Simulate baseline characteristics g=(age, gender, smoking history).1

2. Simulate age to death (Td other) from a cause other than lung cancer given age, gender, and smoking status.

3. Simulate age to the onset of the first malignant cell (Tmal), given gender, smoking status, smoking history (age at starting and quitting smoking), and smoking intensity.

4. Simulate ages at the beginning of the regional (Treg) and the distant

stage (Tdist) given the tumor growth rate.

5. Simulate age (Tdiagn) and tumor diameter (ddiagn) at diagnosis. Find

tumor stage comparing Tdiagn with Treg and Tdist.

6. Simulate age to death from lung cancer (Td lung) given the simulated

individual’s characteristics at diagnosis (Tdiagn and tumor stage).

7. Compare the simulated ages Tdother,Tmal,Treg,Tdist,Tdiagn, and Td lung to ”tell” a story for the specific individual with g set of covariates, i.e., to predict that individual’s trajectory.

1 Smoking history includes: smoking status (never, former or current smoker), and smoking intensity (average number of cigarettes smoked per day)

30 2.2.3 Software

To enhance transparency the model is developed in the open source statistical soft- ware R (version 2.15.2). A comprehensive R code describes the model structure (set of transition rules and assumptions). Given the model parameters (either ad-hoc or calibrated values) for an individual with specific characteristics (set of covariate values) the model stipulates the times to the transition to each state. Combining all the simulated times together, gives the predicted trajectory of this specific individual in regards to the development of lung cancer.

Handling random numbers The implementation of a large number of simulations, required for the evaluation of a complex process using micro-simulation modeling, necessitates a special consider- ation and treatment of the massive quantity of random numbers generated. For this purpose we use the methodology described in Leydold and J. (2005) regarding the generation of independent streams of random numbers for stochastic simulations, that was motivated by the work on the object-oriented random number generator (RNG) with streams and substreams presented in the L’Ecuyer et al. (2002) pa- per. The adoption of the specific methodology, among other things, ascertains the generation of “statistically independent“ streams- i.e., independent random numbers despite the enormous size of random numbers produced - thus avoiding unintended correlations between the several parts of the simulation algorithm. For the implemen- tation of this methodology, we use the built-in functions included in the “rlecuyer“ package in the R library.

31 2.3 Application

2.3.1 Ad-hoc values for model parameters

The MSM for lung cancer we propose here comprises a set of parameters repesenting both latent and observable variables as well as describing the distribution of certain characteristics of the underlying process. Typically the stipulation of MSM parame- ters involves extensive calibration procedures (chapter II). The goal of this section is simply to exemplify the model’s performance by running MSM under hypothetical scenarios. For this purpose, in this chapter, we use some ad-hoc point estimates for the model parameters. In this section we describe the determination of those ad-hoc values that can be used as model inputs to run micro-simulations and predict individual trajectories of lung cancer patients.

Onset of the first malignant cell Several studies have tried to elucidate the biological process of lung carcinogenesis by fitting the TSCE model on real data (75; 64; 41; 40; 72). As ad-hoc values for the TSCE model parameters we use the point estimates reported in Hazelton et al. (40) resulting from the analysis of the second Cancer Prevention Study (CPS II). Table 2.2 provides the complete list of parameters related to the specification of the age at the onset of the first malignant cell, depicts the ad-hoc values (point estimates along with 95% CIs) used for some of these and indicates the type and order of calculations used for the determination of the rest of them.

Tumor growth and disease progression The ad-hoc values for the location and scale parameters of the logarithmic distri- bution describing the tumor volume distribution at clinical detection come from the Koscielny et al. (1985) study. This paper studies the initiation of distant metastasis in breast cancer. In particular, it compares two different patterns of tumor growth, 32 that is an exponential and a Gompertzian one, with respect to their fit on avail- able data concerning distributions of tumor volume at diagnosis, as well as tumor doubling times. Results from this paper agree with findings from previous studies (103) indicating that the tumor growth in humans can be better described using the Gompertz function rather than assuming an exponential curve.

The relationship between the Gompertz distribution parameters (m, s) describing the tumor growth, results from the restriction related to the definition of the inverse Gompertz function (eq. 2.9) for the specification of age t when tumor reaches size V(t). According to this:

m V (t) V (t) 1 − log > 0 ⇒ s > m · log , ∀ V (t) s V0 V0 V  s > m · log max (2.14) V0

Given the tumor volume at diagnosis (Vdiagn) we can calculate the age (Tdiagn) at which the tumor reached this volume using again (2.9). The doubling time as a

function of age Tdiagn is:

1 h m i DT = − log 1 − · log(2) · exp(m · T ) (2.15) m s diagn

For m=0.00042, and s=31·m, the doubling time is close to the observed one recorded in previous studies (70), while (2.14) is satisfied. Finally the logNormal location and scale parameters for Vreg and Vdist are specified so as to reproduce distributions of tumor stage at diagnosis by size similar to what has been observed in SEER data (Table 2.3).

Mortality data Estimates of lung cancer and other cause mortality rates are based on data from two major sources: the National Health Interview Survey NHIS and the Surveillance, 33 and End Results SEER data respectively.

Both databases are representative of the US population and constitute the main source of information about baseline characteristics, health risk factors as well as incidence and mortality rates in the entire population. The NHIS is a national cross-sectional survey aimed at monitoring the national health patterns since 1957. NHIS collects data about several demographic characteristics, risk factors and health statuses of the US population. It also provides information about the age and cause of death. From the large pool of available NHIS data we worked with the Integrated Health Interview Series (IHIS) harmonized set of data. The IHIS variables are given consistent codes and have been thoroughly documented to facilitate cross-temporal comparisons. The SEER program provides information on cancer statistics in an effort to reduce the burden of cancer among the US population. In particular, SEER data record information regarding the incidence and mortality cancer rates by certain demographic characteristics of a geographic sample representing the 28 percent of the US population since 1973.

We based our estimates on lung cancer incidence and mortality on SEER data cov- ering the interval from 1973 to 2008 and IHIS data from 1986 to 2004.

The model is structured so as to predict the main events of interest, i.e. lung cancer incidence and mortality, based on the gender, age, and smoking history of a person including the average time at starting and quitting smoking as well as the average smoking intensity. The NHIS data only provide information about the age, gender, smoking, and, when relevant, cause of death. Information about smoking, in partic- ular, includes, for current smokers at the time of the study, the number (“heaviest amount”) of cigarettes smoked per day, grouped in four categories, namely “less than 15”, “15-24”, “’25-34’, and “35 or more” cigarettes. On the other hand, the SEER data also record the age and gender while in addition they provide information re-

34 garding the age, tumor size and stage at clinical diagnosis as well as the age and cause of death. Therefore we need a way to combine the information coming from these two, representative of the US population, datasets in order to simulate the time and cause of death given the age, gender, smoking history and tumor stage at diagnosis. As already mentioned, the cause of death is classified as lung cancer or other cause.

Table 2.2: Ad-hoc values and calculations for the MSM parameters related to the onset of the first malignant cell.

Gender Parameter Type Males Females X 107 107 fixed −8 −7 v0 7.16·10 1.07·10 fixed (4.6·10−8, 1.21·10−7) (6.97·10−8, 1.62 · 10−7) α0 7.7 15.82 fixed (6.45, 12.99) (13.39, 42.12) γ0 0.09 0.071 fixed (0.071, 0.106) (0.055, 0.088) v1 0.00 0.02 fixed (0.00, 1.76) (0.00, 12.5) α1 0.6 0.5 fixed (0.43, 0.91) (0.27, 0.86) α2 0.22 0.32 fixed (0.12, 0.30) (0.14, 0.40) v v0(1-v1) calculated α2 γ γ0(1+α1 · [q(t)] ) calculated α2 α α0(1+α1 · [q(t)] ) calculated µ0 v0 calculated µ µ0 calculated β0 α − µ − γ calculated Point estimates are extracted from the analysis of the CPS II study data (40). Hazard function: h(t) = [νµX(e(γ+2B)t − 1)]/[γ + B(e(γ+2B)t + 1)] where, B=(1/2)(−γ + pγ2 + 4αµ)

35 Table 2.3: Tumor stage by size at diagnosis (SEER data).

Stage Size Overall local regional distant Total ≤ 2cm 6031(48%) 2705(21%) 3868(31%) 12604 2-5cm 7050(24%) 8348(29%) 13894(47%) 29292 ≥ 5cm 1387(9%) 4803(29%) 10112(62%) 16302

Males ≤ 2cm 2518(44%) 1238(22%) 1957(34%) 5713 2-5cm 3445(23%) 4352(29%) 7228(48%) 15025 ≥ 5cm 810(8%) 2921(31%) 5857(61%) 9588

Females ≤ 2cm 3513(51%) 1467(21%) 1911(28%) 6891 2-5cm 3605(25%) 3996(28%) 6701(47%) 14302 ≥ 5cm 577(9%) 1882(28%) 4255(63%) 6714

Table 2.4 provides a complete list with the ad-hoc values for the MSM’s parameters related to the tumor growth and disease progression. As already mentioned in the simulation procedure of the specific parts of the model, non-parametric estimates of the CIF from the NHIS and SEER data are used as fixed model inputs.

36 Table 2.4: Ad-hoc values and calculations for the model parameters related to the lung cancer progression

Quantity Value Tumor growth

Diameter of one malignant cell* d0 = 0.01mm Diameter of one malignant cell* dmax = 130mm π 3 Tumor volume of diameter d** v = 6 d Parameters of the Gompertz m = 0.00042 distribution** s = 31 · m Disease progression**

Parameters of the logNormal distribution for tu- µreg = 1.1 , σreg = 1.1 mor volume at the beginning of the regional stage Parameters of the logNormal distribution for tu- µdist = 2.8 , σdist = 2.8 mor volume at the beginning of the distant stage Parameters of the logNormal distribution for tu- µdiagn = 3.91 , σdiagn = 3.91 mor volume at diagnosis * Values stipulated from the lung cancer literature. ** Values specified by the modeler to match data.

2.3.2 MSM output - Examples

In this section we present predictions after multiple runs of the MSM under different scenarios. The focus is on lung cancer incidence and mortality of people 65 years old at the beginning of the prediction period, that covers the entire lifespan. We compare MSM outputs between males and females, for never, former, and current smokers separately. For current smokers we also compare results given three different average smoking intensities, i.e., for 10, 30, and 50 cigarettes per day. Furthermore, for former smokers we also include comparisons for different quitting smoking ages, i.e., 40, 50, and 60 years old. Current and former smokers are assumed to have started smoking at the age of 20 years.

For each of these cases, we present the distributions (mean, , quartiles, min and maximum value) of the course of ages at the major states in the lung cancer, namely the age (T mal) at the onset of the first malignant cell, 37 which also indicates the beginning of the local stage, the age at the beginning of the regional (T reg) and distant stage (T dist), the age at diagnosis (T diagn), and the age of death from lung cancer (T death). These age distributions pertain to people for whom the model predicted development and death from lung cancer. We indicatively present the distributions for the aforementioned characteristic scenarios, highlighting the effect of gender and smoking on the development and death of lung cancer. Lung cancer mortality is depicted using survival curves. In addition, we ˆ ˆ report estimates of the probabilities of lung cancer death (Pd) and diagnosis (Pdiagn).

All the results presented in this section are based on sets of 100,000 micro-simulations for each scenario.

Never smokers

ˆ Tables 2.5 and 2.6 compare lung cancer mortality (Pd) and distributions of times to each of the main lung cancer states of the MSM, developed in this chapter, between males and females that have never smoked in their lives. According to these tables, men have higher (almost double) probability of dying from lung cancer (0.218%) than women (0.120%). Overall, the distributions of the predicted times are very similar for the two genders, although slightly shifted to earlier ages for woman. That is to say, for those cases, for which the model predicted death from lung cancer, all the events of main interest in the lung cancer course, happened in younger ages for the woman than for the man in the examples. These finding is in agreement with recent findings on lung cancer incidence and mortality in never smokers (116), indicating that women are more likely than men to have non−smoking associated lung cancer. Figure 2.2 confirms the small difference in lung cancer survival between the two genders.

38 Table 2.5: Male, 65 years old, never smoker. ˆ (Pd = 0.218%) Mean ± SD Min Q1 Q3 Max T mal 66.98 ± 8.50 44.42 60.59 66.93 73.88 85.38 T reg 74.49 ± 8.62 50.53 68.14 74.42 81.39 93.44 T dist 75.79 ± 8.53 52.31 69.51 75.52 82.59 94.48 T diagn 76.76 ± 8.08 51.88 70.58 76.88 83.39 92.26 T death 78.48 ± 7.49 65.15 72.47 78.56 84.63 92.77 Table 2.6: Female, 65 years old, never smoker. ˆ (Pd = 0.120%) Mean ± SD Min Q1 Median Q3 Max T mal 64.67 ± 10.32 37.44 58.48 65.26 72.72 83.55 T reg 72.22 ± 10.20 44.78 65.63 72.57 80.63 91.14 T dist 73.51 ± 10.33 47.25 66.65 73.83 81.59 92.31 T diagn 74.65 ± 10.51 45.26 68.85 75.26 82.54 92.12 T death 78.25 ± 7.62 65.15 71.74 78.40 84.52 92.37

Figure 2.2: MSM predicted lung cancer survival for non-smokers, 65 years old.

39 Current smokers

In all the working examples for current smokers we examine different scenarios (de- pending on the average smoking intensity, i.e., 10, 30 and 50 cigarettes per day) for a person 65 years old, who started smoking at the age of 20 years old. Table 2.7 and table 2.8 present the results for a male and a female respectively. As it was expected (99; 47), we overall observe higher proportions of predicted lung cancer deaths for males than for females. These proportions also increase with the smoking intensity, namely the heavier the smoker, the more probable the development and death of lung cancer is. In addition, the entire course of the lung cancer is shifted towards earlier ages as the average smoking intensity increases, i.e., the onset of the local, regional and distant stages as well as the diagnosis and finally the death of lung cancer, occur in younger ages for heavy smokers.

Table 2.7: Male, 65 years old, current smoker, started smoking at age 20.

Mean ± SD Min Q1 Median Q3 Max ˆ Average smoking intensity: 10 cigarettes per day (Pd = 6.91%). T mal 65.87 ± 9.05 33.96 59.20 65.14 72.21 92.10 T reg 73.37 ± 9.08 41.07 66.61 72.65 79.83 98.56 T dist 74.71 ± 9.07 42.24 68.01 74.01 80.98 100.80 T diagn 75.97 ± 9.12 44.46 69.22 75.33 82.56 99.43 T death 78.93 ± 8.40 65.01 71.95 78.28 85.25 99.84 ˆ Average smoking intensity: 30 cigarettes per day (Pd = 8.75%). T mal 63.98 ± 9.19 32.32 57.39 62.84 70.04 92.49 T reg 71.47 ± 9.23 38.02 64.86 70.35 77.60 100.60 T dist 72.82 ± 9.20 41.94 66.20 71.69 78.80 101.40 T diagn 74.08 ± 9.22 44.10 67.63 73.07 80.28 99.39 T death 77.41 ± 8.39 65.00 70.35 76.18 83.45 99.82 ˆ Average smoking intensity: 50 cigarettes per day (Pd = 9.30%). T mal 62.99 ± 9.25 34.19 56.67 61.85 68.98 93.90 T reg 70.49 ± 9.26 40.15 64.10 69.35 76.47 102.60 T dist 71.83 ± 9.27 42.23 65.51 70.64 77.80 102.60 T diagn 73.15 ± 9.39 44.13 66.87 72.10 79.28 99.56 T death 76.85 ± 8.34 65.00 69.79 75.50 82.62 99.66

40 No large differences are noted in the time distributions between males and females of the same smoking intensity group (Tables 2.7 and 2.8).

Plots in figure 2.3 verify the difference in lung cancer survival between the two genders. In addition, according to these plots, the survival probability dicreases with increasing average smoking intensity from 10 to 30 cigarettes/day, while it remains almost unchanged between 30 and 50 cigarettes/day.

Table 2.8: Female, 65 years old, current smoker, start smoking at age 20.

Mean ± SD Min Q1 Median Q3 Max ˆ Average smoking intensity: 10 cigarettes per day (Pd = 3.53%). T mal 66.59 ± 9.34 33.07 60.00 66.25 73.23 92.44 T reg 74.09 ± 9.34 40.89 67.42 73.74 80.90 99.66 T dist 75.42 ± 9.35 42.32 68.81 75.08 82.13 101.40 T diagn 76.60 ± 9.40 44.10 69.95 76.43 83.30 98.99 T death 79.74 ± 8.43 65.01 72.91 79.17 86.05 99.63 ˆ Average smoking intensity: 30 cigarettes per day (Pd = 5.27%). T mal 63.85 ± 9.54 34.13 57.25 63.00 70.37 94.28 T reg 71.35 ± 9.56 41.20 64.75 70.56 77.88 102.40 T dist 72.71 ± 9.57 42.23 66.09 71.84 79.19 103.30 T diagn 73.99 ± 9.76 44.14 67.39 73.34 80.71 99.68 T death 77.66 ± 8.40 65.00 70.59 76.54 83.61 99.83 ˆ Average smoking intensity: 50 cigarettes per day (Pd = 5.95%). T mal 62.72 ± 9.96 31.79 56.21 61.77 69.35 92.33 T reg 70.24 ± 9.95 39.15 63.70 69.20 76.86 99.04 T dist 71.59 ± 9.95 41.15 65.02 70.57 78.14 101.10 T diagn 72.81 ± 10.2 44.14 66.42 71.98 79.48 99.55 T death 77.23 ± 8.43 65.00 70.12 75.85 83.28 99.81

41 Figure 2.3: MSM predicted lung cancer42 survival for current smokers, 65 years old. Former smokers

In the examples for former smokers, we investigate the effect of smoking intensity (10, 30, and 50 cigarettes smoked per day on average) and quitting smoking age (40, 50 and 60 years old) on the lung cancer course of a male (tables 2.9 to 2.11) and a female (tables 2.12 to 2.14) 65 years old who started smoking at the age of 20 years. As in the case of current smokers, the predicted proportions of lung cancer death are higher for men compared to women in the same smoking category. This tendency is verified in several observational studies on lung cancer (99; 47), namely men, with exactly the same characteristics, are in general more susceptible to lung cancer than women. Furthermore, we observe a positive correlation between the predicted probability of death from lung cancer and the duration of smoking. This correlation is more pronounced in heavier smokers (higher average smoking intensity). Similar patterns hold for women. No large differences were found in the predicted times to the main events of interest between males and females with the same characteristics.

Noteworthy is the fact that lung cancer survival for people (men or women) who smoked for only 20 years in their lives (i.e., start and quit smoking at ages 20 and 40 respectively) is very similar to lung cancer survival for non-smokers. Furthermore, the negative effect of smoking on lung cancer survival is more prominent for longer duration of smoking.

Survival plots in figure 2.4 confirms the similarity in the survival curves of former smokers who started and quit smoking at ages 20 and 40 respectively, with those of non-smokers. These plots also confirm the effect smoking has on the deterioration of lung cancer survival, which become more pronounced as the total number of years of smoking, as well as the average number of cigarettes smoked per day increase.

43 Table 2.9: Male, 65 years old, former smoker, starting and quitting smoking at 20 and 40 years old respectively.

Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 0.23%) T mal 63.92 ± 12.47 32.49 57.78 65.11 72.95 84.82 T reg 71.45 ± 12.37 39.94 65.82 72.56 80.55 92.14 T dist 72.76 ± 12.49 40.62 66.47 74.38 81.86 93.17 T diagn 73.56 ± 12.30 44.21 67.43 75.59 82.16 93.54 T death 77.22 ± 8.10 65.08 69.55 76.70 83.85 93.78 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 0.26%) T mal 61.23 ± 14.11 30.96 53.80 64.36 71.72 84.44 T reg 68.76 ± 14.14 39.16 62.01 72.13 79.02 92.74 T dist 70.04 ± 14.03 40.43 63.49 73.24 80.93 93.26 T diagn 71.08 ± 14.25 44.22 65.14 74.06 81.68 92.78 T death 76.83 ± 8.15 65.06 69.49 75.85 83.53 93.17 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 0.27%) T mal 59.01 ± 14.81 33.45 39.80 61.99 70.86 85.32 T reg 66.51 ± 14.75 39.52 47.74 69.47 78.45 92.67 T dist 67.79 ± 14.80 41.58 49.02 70.95 79.82 94.05 T diagn 68.81 ± 14.88 44.10 51.97 72.72 80.40 92.44 T death 75.74 ± 7.91 65.02 68.53 74.51 81.66 92.99

44 Table 2.10: Male, 65 years old, former smoker, starting and quitting smoking at 20 and 50 years old respectively

Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 0.37%) T mal 56.27 ± 12.93 35.60 46.00 49.98 68.16 84.39 T reg 63.75 ± 12.86 42.85 53.53 58.59 75.43 92.62 T dist 65.10 ± 12.81 44.38 54.96 59.60 76.88 93.01 T diagn 65.66 ± 13.30 44.60 54.37 64.74 77.83 92.42 T death 75.65 ± 6.97 65.07 70.01 75.24 80.08 92.48 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 0.54%) T mal 52.38 ± 11.84 32.84 44.76 48.53 59.58 84.69 T reg 59.93 ± 11.93 40.85 52.08 55.98 66.87 92.34 T dist 61.30 ± 11.86 41.40 53.60 57.12 68.30 95.15 T diagn 61.92 ± 12.41 44.43 52.54 57.46 69.80 92.89 T death 74.78 ± 6.52 65.04 69.67 74.41 78.70 93.11 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 0.70%) T mal 50.05 ± 10.65 32.26 43.84 47.21 49.86 84.42 T reg 57.75 ± 10.68 40.57 51.40 54.82 58.59 93.35 T dist 58.99 ± 10.68 40.79 52.52 56.33 59.25 95.33 T diagn 59.29 ± 11.28 44.22 51.38 55.80 64.72 91.19 T death 74.17 ± 5.91 65.01 69.40 74.03 78.01 92.26

45 Table 2.11: Male, 65 years old, former smoker, starting and quitting smoking at 20 and 60 years old respectively.

Quit smoking at 60 years old Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 2.04%) T mal 56.45 ± 5.74 30.93 54.19 56.79 58.72 84.85 T reg 64.00 ± 5.84 38.41 61.56 64.29 66.24 92.80 T dist 65.33 ± 5.78 41.39 62.96 65.57 67.57 94.42 T diagn 66.84 ± 6.21 44.09 64.79 66.89 69.37 92.79 T death 71.48 ± 6.08 65.01 67.23 69.33 73.28 93.08 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 3.33%) T mal 55.57 ± 5.45 31.83 53.43 56.20 58.27 84.73 T reg 63.11 ± 5.48 40.94 60.89 63.62 65.73 93.42 T dist 64.46 ± 5.49 41.27 62.26 65.04 67.02 95.02 T diagn 66.04 ± 6.11 44.26 64.38 66.55 69.04 93.36 T death 71.05 ± 5.75 65.00 67.02 69.21 72.68 93.47 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 3.93%) T mal 55.07 ± 5.77 33.56 52.85 55.89 58.08 85.28 T reg 62.60 ± 5.80 40.81 60.32 63.34 65.58 91.77 T dist 63.95 ± 5.78 43.24 61.61 64.66 66.94 94.25 T diagn 65.53 ± 6.55 44.13 64.04 66.37 68.87 92.21 T death 71.14 ± 5.78 65.00 67.02 69.19 72.99 92.37

46 Table 2.12: Female, 65 years old, former smoker, starting and quitting smoking at 20 and 40 years old respectively

Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 0.17%) T mal 62.70 ± 12.12 34.02 56.86 63.35 72.07 82.97 T reg 70.25 ± 12.17 40.38 63.76 71.31 79.84 91.54 T dist 71.56 ± 12.20 44.28 64.84 72.03 81.12 92.89 T diagn 72.23 ± 12.27 44.40 66.72 73.99 82.43 92.56 T death 76.93 ± 7.85 65.09 69.67 75.81 84.00 93.45 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 0.19%) T mal 60.38 ± 13.62 33.13 53.55 62.48 71.32 85.05 T reg 67.93 ± 13.52 39.81 60.69 69.60 78.71 91.77 T dist 69.21 ± 13.45 42.34 62.65 70.98 80.07 93.86 T diagn 70.42 ± 13.57 44.33 64.65 72.39 81.60 91.65 T death 76.23 ± 7.92 65.16 69.34 74.61 83.36 92.25 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 0.24%) T mal 55.47 ± 15.15 31.80 38.86 58.37 69.26 82.28 T reg 63.10 ± 14.99 40.25 46.75 65.66 76.62 89.58 T dist 64.44 ± 15.01 41.03 48.12 67.09 77.73 90.86 T diagn 65.51 ± 15.37 44.12 48.10 68.82 78.87 91.66 T death 74.28 ± 7.45 65.04 67.78 72.33 80.72 91.91

47 Table 2.13: Female, 65 years old, former smoker, starting and quiting smoking at 20 and 50 years old respectively.

Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 0.24%) T mal 56.67 ± 12.29 32.71 46.66 54.35 66.48 85.00 T reg 64.28 ± 12.24 40.34 54.11 61.94 73.96 92.43 T dist 65.67 ± 12.24 41.13 55.42 63.54 75.32 94.96 T diagn 66.21 ± 12.82 44.24 54.70 65.43 76.50 92.51 T death 75.72 ± 6.82 65.11 70.37 75.19 79.83 92.61 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 0.37%) T mal 52.59 ± 12.07 32.50 44.02 48.34 61.33 85.63 T reg 60.10 ± 12.02 40.52 51.48 55.92 68.72 93.58 T dist 61.45 ± 12.08 40.93 53.02 57.03 70.03 94.06 T diagn 61.97 ± 12.69 44.09 52.13 57.28 72.62 93.68 T death 74.43 ± 6.51 65.00 68.81 74.06 78.27 93.68 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 0.58%) T mal 50.30 ± 11.44 34.03 42.40 47.11 54.07 84.17 T reg 57.88 ± 11.45 40.30 49.91 54.73 61.67 92.65 T dist 59.14 ± 11.44 42.00 51.28 56.00 63.38 93.55 T diagn 59.68 ± 12.12 44.12 50.60 55.71 65.86 93.61 T death 73.65 ± 6.17 65.03 68.67 73.24 77.43 93.72

48 Table 2.14: Female, 65 years old, former smoker, starting and quitting smoking smoking at 20 and 60 years old respectively

Mean ± SD Min Q1 Q2 Q3 Max Average smoking intensity: 10 cigarettes per day ˆ (Pd = 1.01%) T mal 56.64 ± 6.41 35.88 54.16 56.70 58.84 85.77 T reg 64.14 ± 6.39 43.36 61.33 64.21 66.39 91.84 T dist 65.49 ± 6.42 44.03 62.83 65.53 67.63 93.95 T diagn 66.85 ± 6.90 44.46 64.59 66.89 69.59 92.70 T death 71.63 ± 6.11 65.00 67.28 69.60 73.80 93.24 Average smoking intensity: 30 cigarettes per day ˆ (Pd = 1.94%) T mal 55.09 ± 6.13 32.88 53.07 55.93 58.24 83.79 T reg 62.61 ± 6.13 41.27 60.30 63.37 65.69 91.65 T dist 63.92 ± 6.12 42.08 61.81 64.76 66.99 92.46 T diagn 65.44 ± 6.89 44.15 63.58 66.45 68.90 92.70 T death 71.27 ± 5.60 65.01 67.34 69.49 73.30 93.07 Average smoking intensity: 50 cigarettes per day ˆ (Pd = 2.49%) T mal 54.38 ± 6.33 29.45 51.98 55.50 57.86 83.55 T reg 61.98 ± 6.38 36.55 59.36 63.06 65.40 93.28 T dist 63.29 ± 6.34 37.82 60.85 64.36 66.71 93.38 T diagn 64.74 ± 7.13 44.14 62.34 65.98 68.62 91.61 T death 71.34 ± 5.79 65.00 67.01 69.50 73.57 92.41

49 Figure 2.4: MSM predicted lung cancer survival for former smokers, 65 years old, starting smoking at age 20.

50 2.4 Discussion

The main purpose of the natural history MSM for lung cancer, that we developed in this thesis, was for this model to serve as a tool to the exploration of the statistical properties of micro-simulation models in general. To this end we developed a sim- plified yet valid model that follows current practices in micro-simulation modeling while at the same time adequately describes the natural history of the disease.

The MSM aimed at combining some of the best practices currently followed in this domain, while it remains simple enough to serve as an efficient tool for the exploration of the statistical properties of this type of models. It is a continuous time MSM, namely the events can take place at any time point. Depending on the degree of discretization, this assumption is sometimes more reasonable than the very restrictive one of fixed time lengths imposed by a discrete time MSM .

Furthermore it combines some of the most widely used models for the description of several distinctive stages of the natural history of lung cancer, including both biological and epidemiological models. More specifically, it uses the biological Two Stage Clonal Expansion (TSCE) model (75) to describe the risk for the onset of the first malignant cell. In particular, the model employs the exact solutions for the expression of the hazard rates and the survival probabilities. Moolgavkar and Luebeck (74) comment on the inaccuracy of the approximations that can lead to serious data misinterpretation and they emphasize on the need to use the exact solutions instead.

Moreover, the model employs the Gompertz function to simulate the tumor growth. Several studies have shown that this distribution fits available data well, hence it is preferable for simulating this process compared to other distributions found in the

51 literature (e.g., exponential). Finally the model breaks down the time from the local stage to death (lag time) into three time sub-intervals (local to regional, regional to distant, distant to death) instead of just assuming, e.g. fixed or Gamma distributed lag time (40; 72). This approach enables a more detailed representation of the natural course of lung cancer, hence a more accurate prediction of the times to the events of interest.

However, perhaps the most attractive feature of this MSM is that it is developed in R. The development in an open source, statistical package enhances the transparency of the model and facilitates research on the statistical properties of MSMs in general.

Using ad-hoc estimates for the model parameters (as described in section 4), we make predictions for hypothetical scenarios by running multiple micro-simulations for each case. The results seem plausible compared to what was expected based on relevant studies and reports about lung cancer. We note, for example, higher lung cancer mortality in men compared to women, as well as positive correlation of the negative smoking effects, on the course of lung cancer, with the total smoking duration and intensity.

These examples are only provided as an indication that the model performs rea- sonably well. Large deviations from the truth are attributed to inadequacy of the ad-hoc values for the MSM parameters to reproduce real numbers. Thorough cali- bration exercises are necessary to achieve proximity between MSM predictions and real data. This is one of the main objectives of the next chapter, where we will perform a thorough calibration and validation of this MSM using real data.

Some of the most serious limitations of this MSM are that it does not involve any screening or treatment component, as well as it does not take into account the detection of benign lesions. Moreover, the complexity of the model was kept to a minimum because the main objective of this chapter was to develop a streamlined

52 MSM that sufficiently describes the natural history of lung while it can serve as a handy tool for the exploration of the statistical properties of MSMs in general.

Improvement of the MSM with respect to those limitations were beyond the scope of this thesis. However, working examples demonstrate potentiality of the model to be used in real life scenarios. In this perspective future work includes enhancement of the MSM performance by increasing the level of complexity and incorporating additional components in it, e.g., screening and treatment components. Another immediate goal is to refine the R code and publish it in the form of a library into the CRAN package repository of the open source statistical software, R. This will enhance the transparency of the model, and will give the opportunity to many potential users to use it, either a a tool for further research and development of statistical methods related to micro-simulation models, or in order to simulate entire populations or sub- groups of patients and assist, for instance, decision making in lung cancer research.

53 Chapter 3

Calibration methods in MSMs - a comparative anal- ysis

The second chapter of this thesis pertains to the calibration of micro-simulation mod- els. The main goal is to provide a comparative analysis of two different approaches to this problem, a Bayesian and an Empirical one. The Bayesian calibration adapts the methodology described in Rutter et al. (90). The Empirical method aims at com- bining broadly applied practices for empirically calibrating MSM parameters (92). Both methods are implemented for the calibration of the streamlined MSM for the natural history of lung cancer, developed in the previous chapter. The entire proce- dure is conducted using the open source statistical software R.3.0.1. The comparative analysis comprises graphical, qualitative and quantitative discrepancy measures of the results the two methods produce. This is a first attempt of a thorough compar- ison between two calibration methods in the context of MSMing, with focus on the statistical aspects of these procedures. The chapter results in suggestions about the best method, under certain circumstances, based on the overall assessment of the calibration results according to these measures.

The chapter begins with the description of the two calibration methods that will be implemented. It continues with detailed discussion on the serious computational

54 restrictions related to the implementation of the calibration of the MSM in R. Em- phasis is put on the need for HPC techniques to deal with the particularities of the involved code. It follows the description of the simulation study design conducted for the purposes of the comparative analysis, along with detailed results from this analysis. The chapter concludes with some general remarks about the performance of the two calibration methods with respect to both, MSM validation as well as the computational requirements and restrictions imposed be each method. We comment on the advantages and disadvantages of the two methods, and we refer to future work related to this chapter.

3.1 Background

3.1.1 Calibration vs estimation in statistical theory

Calibration pertains to the specification of model parameters to fit observed quan- tities of interest. The term has many instances in the statistical literature and is closely related to the development of stochastic predictive models. Calibration is also used in the context of fitting complex deterministic mathematical models Kennedy and O’Hagan (2001); Campbell (2006). The terms ”calibration”, ”estimation” and ”model fitting”, are often used interchangeably in the modeling literature Vanni et al. (2011). In the context of ordinary statistical modeling (e.g. generalized linear models), calibration is considered an ”inverse prediction” problem. Simply stated, the question for a new value of the response variable is what set of values for the predictor variables in the model could result in the quantity of interest with high probability. Moreover, this parameter model specification usually refers to point es- timation rather than distribution characterization of the model parameters. In this thesis, we consider calibration as a “model tuning” procedure aiming at specifying those sets of model parameter values which, when used as model inputs, can pre- 55 dict with a desired amount of accuracy the pre-specified target summaries from the available data.

In the context of the specification of MSM parameters, calibration seems more rel- evant than estimation. This is because in micro-simulation modeling it is possible more than one set of parameter values to reproduce results close to the observed quantities of interest. In addition some of the model parameters represent latent variables (i.e., unobserved quantities); hence, model identification problems may arise. Therefore, purely analytical estimation procedures aimed at finding the opti- mal set of parameter values that fit best the observed data (e.g. MLE) are not useful in the specific problem. Identifying the optimal set of parameter values instead is preferable since these sets can provide an idea of the underlying correlation structure of the model parameters. In addition, those sets can be used in order to capture and express the model parameter uncertainty in the produced outputs.

According to Vanni et al. (2011) the goal of a calibration process is manifold and in- cludes specification of unobserved/unobservable model parameters, parameters that are observed with some level of uncertainty, correlation among the model parameters (both observable and unobservable) as well as approximation of the joint distribution of the model parameters. This last goal can be achieved only if the result from the calibration process is more than one combination of parameter values. The set of all plausible combinations of values can be used as an approximation of both the marginal as well as the underlying joint distributions of the MSM’s parameters and outputs. This result of the calibration procedure is extremely useful in the context of MSMing. Unlike typical statistical models, e.g., generalized linear models, where the output variable is directly expressed as a function of the model’s parameters and covariates, usually in MSMs there is no closed form expression of the relation among the model input, output, and parameters. On the contrary, it is very difficult to identify and quantify the correlation mechanisms that govern the model’s structure, 56 because of the complicated relationships dominating the process described by the MSM. This complexity also often give rise to identifiability problems.

3.1.2 Calibration methods for MSMs

Vanni et al. (2011) provide a systematic overview of the calibration procedure that should be followed in the development of economic evaluation mathematical models in general. According to this paper, the calibration procedure comprises decisions on seven essential steps, i.e., decision on the model parameters to calibrate, calibration targets, goodness of fit (GOF) measure(s), search strategy among the of possi- ble parameter values, convergence criteria, stopping rule for the calibration process as well as integration, presentation and use of the model calibration results.

Several methods have been proposed in the literature specifically for the calibration of MSMs in medical decision making. Stout et al. (106) classify model parame- ter estimation methods currently used in cancer simulation models in two broad categories: the purely analytical methods and the calibration methods. This clas- sification is also relevant in the context of micro-simulation modeling calibration. Purely analytic methods refer to direct estimation of the model parameters (e.g. MLE(3; 22; 87; 75)) without reference to model fit. On the contrary, calibration meth- ods result in model parameters based on an efficient search of the parameter space and can be further categorized into undirected and directed methods. Undirected methods involve an exhaustive grid search (65; 59) of the parameter space or grid search using some sampling design (e.g., random sampling (23; 4; 115; 50), Latin Hy- percube Sampling (LHS)(49; 5; 69; 95), etc.). Directed methods, on the other hand, aim at finding the optimum set of parameter values using, for example, the Nelder- Mead(77; 11; 18) or some other optimization algorithm (118; 107; 53; 52). In addition to the two aforementioned broad calibration categories Bayesian(90; 117; 14) calibration methods are also often used in micro-simulation modeling. 57 We could further split the various calibration methods of complex models into em- pirical and theoretical. The characteristics that actually differentiate these two cat- egories lie on the nature of the searching strategy, the convergence criteria, the stop- ping rule, as well as on the interpretation of the produced results. In an empirical method, for instance, the searching strategy usually involves some sort of random sampling within the multivariate parameter space, the convergence criteria and the stopping rules are usually arbitrary (sometimes even based on convenience), while the interpretation of the results (set(s) of values for the calibrated model parame- ters) is often abstruse. Theoretical methods, on the other hand, involve structured searching strategies and stopping rules (e.g., optimization algorithms, Gibbs sampler, etc.), while the interpretation of the results is easier and based on sound theoretical background (e.g., joint posterior distribution of calibrated parameters in Bayesian calibration).

3.1.3 Assessing calibration results

Calibration methods aim at resulting in models that fit well observed data. Hence- forth the evaluation of a calibration method is closely related to model’s validation. There are several , qualitative and quantitative, to assess the performance of a predictive model. Within the scope of MSMing, to our knowledge, no systematic work has been carried out yet on the assessment of a calibrated model. Usually the performance of a calibrated MSM is evaluated only with plots that compare the MSM predictions with the respective observed data (4; 94; 23). In such situations, the conclusions about the quality and adequacy of the MSM are arbitrary and en- tail a certain amount of subjectivity. Plots should be rather used in conjunction with measures (GoF statistics) that quantify this deviation of the MSM outputs from the observed quantities of interest. The most popular among the quantitative measures applied for MSM validation is the chi-square GoF . Bayesian cal-

58 ibration methods provide additional means for assessing the overall performance of a calibrated MSM, i.e. comparison of the observed quantities of interest (calibra- tion targets) with the corresponding posterior predictive distributions. Other GOF measures employing, e.g., profile likelihoods etc, are also suggested in the literature (18).

59 3.2 Methods

3.2.1 Notation

In this section, we introduce some notation that will be used throughout the remain- ing of this document. M(θ) : micro-simulation model

T θ = [θ1, θ2, . . . , θK ] : vector of model parameters

T Z = [Za,Zg,Zs,Zd] : vector of covariates (baseline characteristics) with age (Za),

gender (Zg), smoking status (Zs), and smoking intensity (Zd),

as average number of cigarettes smoked per day

Y = [Y1,Y2, ..., YJ ] : vector of data, i.e. summary statistics found in the literature

that describe quantities of interest in the natural history model

π(θ) : joint prior distribution of θ

π(θk) : prior distribution of θk

h(θ|Y, Z) : joint posterior distribution of θ

f(Y|g(θ), Z) : data distribution. This distribution depends on a function g(·)

of the model parameters θ and the model covariates.1

h(θk|θ(−k), Y, Z) : full conditional for parameter θk given θ(−k), Y and Z

θ(−k) : the θ vector excluding the θk component (k = 1, 2, ...K

and K= total number of MSM parameters)

Mfm(θ, SN ) : MSM predictions after running the model m times in total on

the input sample (SN ) of size N, given θ

1Note: the functional form of g(·) is unknown and hard to specify. 60 3.2.2 Bayesian Calibration Method

The first method employs Bayesian reasoning to the calibration of MSMs. The goal is to use a sound way to incorporate both prior beliefs about the MSM parameters, and observed data found in the literature of lung cancer natural history, in the MSM calibration procedure. To this end we apply the Bayesian calibration method described in detail in Rutter et al. (90), aimed at drawing values from the joint posterior distribution h(θ|Y ) of the model parameters. This method essentially involves a sufficiently large number of Gibbs sampler iterations that result in draws from the full conditional distributions h(θk|Y,Z). Due to the model’s complexity the algorithm also involves embedded approximate Metropolis-Hastings (MH) steps within each Gibbs sampler iteration in order to draw from the unknown forms of the full conditional distributions.

In particular, within each Gibbs sampler step, we implement multiple iterations of a random-walk Metropolis-Hastings algorithm. Given a symmetric jumping distribu-

∗ tion the MH-algorithm accepts a new value for θi with transition probability:

  min(r (θ , θ∗), 1) if π (θ ) QJ f (y |g(θ)) > 0 ∗  i k k k i j=1 j j a(θk, θk) = (3.1) QJ  1 if πk(θi) j=1 fj(yj|g(θ)) = 0

Assuming that the micro-simulation model (M(θ)) and the data distributions f(Y , g(θ))

are correctly specified, we use M(θ) to simulate M draws from fj(Yj, gj(θ)), where j indicates the j-th covariate class. We use the maximum-likelihood estimation (MLE) to estimate the data distribution parameter (e.g., for Binomial and Poisson counts

1 PM the estimate of gj(θ) is the average:g ˆj(θ) = M i=1 Yeij). We then use these esti-

61 mates to calculate the transition probability functionα ˆ(θ, θ∗) based on:

∗ QJ ∗ πk(θ ) fj(yj|gˆj(θ , θ(−k))) r (θ , θ∗) = k j=1 k (3.2) k k k QJ πk(θk) j=1 fj(yj|gˆj(θ))

The Bayesian calibration method results in a V×K matrix of calibrated values, de- noted as ΘBayes, the rows of which represent a random sample from the joint posterior distribution h(θ|Y ) of the MSM parameters. This sample is used to provide esti- mates about both the posterior distributions of the calibrated MSM parameters, as well as the posterior predictive distributions of the quantities of interest.

3.2.3 Empirical Calibration Method

Several empirical calibration methods for micro-simulation models have been sug- gested in the literature (section 3.1.2). Most of them comprise some type of sampling for searching the multidimensional parameter space, stipulation of some proximity measure between observed and predicted quantities of interest and selection of a set of parameter vectors satisfying pre-specified convergence criteria. In many cases the result from an empirical calibration procedure is a set rather than a single parameter vector for the calibrated model.

For the development of our generic empirical calibration method we focus on two key elements of the procedure, i.e., the searching algorithm and the convergence criteria involved. In this section we describe how we combine popular practices, found in the literature of MSMing, so as to create a generic Empirical calibration method that will be compared with the Bayesian method previously described.

When the model dimensionality permits, it is possible to use extensive grid search algorithms to search the parameter space (65). For models comprising many param- eters (as is usually the case in micro-simulation modeling) random sampling rather

62 than extensive grid search algorithms is preferred for searching the parameter space (23; 4; 115; 50). Alternatively, another more efficient sampling scheme can be used in order to sample from the multidimensional parameter space, namely the Latin Hypercube Sampling (LHS) (69; 104).

63 The LHS method was introduced by McKay et al. (69) as an extremely efficient sampling scheme that outperforms the simple random and the stratified sample. LHS and its variations (16) increases the realization efficiency of the algorithm while preventing the introduction of bias and reducing the effect of extreme values in the resulting estimates. Another very attractive feature of the LHS is that it allows for characterizing the uncertainty and conducting sensitivity analysis of complex deterministic or stochastic models

Application of the method is met in several instances of model calibration in medical research (49). In Blower and Dowlatabadi (5) we find an application of the LHS on a deterministic complex model as a technique to explore the uncertainty of the parameter values on the predicted outcomes. Another very interesting application of the LHS is found in Cronin et al. (13) where this method is used in conjuction with a response surface analysis as an efficient way to explore the parameter space and investigate the relationship between the parameter values and the respective model outputs.

The second very important feature of the calibration procedure we focus on is the specification of the convergence criteria to identify acceptable parameter sets. The most commonly used discrepancy measures in the context of calibrating complex models are χ2 and likelihood statistics (115). These two are also the most typi- cal measures used for the overall assessment of the calibrated model fit. However, noteworthy is the fact that in many instances of empirically calibrated complex models, the assessment of the overall model fit is completely arbitrary and based solely on graphical comparisons between observed and predicted quantities of inter- est (94; 23; 4).

64 Latin Hypercube Sample

Before continuing to the description of the Empirical Calibration method, we first discuss the particularities of the Latin Hypercube Sampling (LHS) design.

Let θk ∈ Rk, where Rk is the range of plausible values for θk. We divide the Rk into

N equiprobable (according to the pre-specified distribution we assume for each θk) intervals, and we assign the integers 1 through N to each one of them. We create a sequence of K vectors each of which is a random permutation of the 1, 2, ..., N integers. For each θk we randomly draw a value from the indexed interval according to the K vectors of random permutations previously created. Alternatively the middle point of each interval could be used. The result of this procedure is an MN×K matrix with columns the k vectors of random values for each of the model parameters. The

th mik element of this matrix corresponds to the value extracted from the i indexed

th interval of the θk variable. The i row of the matrix is a sample point of values from the parameters space. The MN×K matrix is the Latin Hypercube Sample extracted from a single of this sampling design.

In the Empirical calibration method we implement the LHS design as a more effi- cient searching algorithm of the multi-dimensional parameter space than the simple random sampling. The goal from a single implementation of this design is to collect a sample of NLHS values for K parameters (where NLHS is the size of the LHS de- sign). To this end, the range of each parameter is divided into NLHS equiprobable (according to the pre-specified underlying distribution) intervals. For each parameter we create a different permutation of the NLHS intervals, and we subsequently draw a value from each corresponding interval, following the underlying distribution. In particular, we utilize the ’maximinLHS’ R function (lhs library), aimed to optimize the collected sample by maximizing the distance between the design points. The set of NLHS points (i.e.,vectors of parameter values) is the sample extracted from the 65 multivariate parameter space using the LHS design.

Figure 3.1: Single implementation of LHS of size NLHS=5 for extracting values from a 2-dimensional parameter space (θ1 and θ2).

Figure 3.2: Single implementation of LHS of size NLHS=5 for extracting values from a 2-dimensional parameter space (θ1 and θ2).

Figures 3.1 and 3.2 present examples of the application of the LHS design in two 66 dimensions. In each of these examples the LHS is used to extract a sample from the

bivariate space stipulated by two of the MSM parameters to be calibrated, i.e., θ1=m

∈ [0.00001, 0.0016], and θ2=mdiagn ∈ [0.0001, 8]. The grid indicates the partition of the bivariate space based in equiprobable marginal intervals for each parameter. The dots in each graph represent the set of points of the latin hypercube sample.

The figure depicts four samples for different sizes (NLHS=5, and 20) and different extracted points from the individual intervals (center vs random).

A limitation of the LHS is that the single implementation of this design can only

result in a restricted number (NLHS) of vectors for the parameter values, hence rendering it inefficient for searching of the multi-dimensional parameter space of an MSM. To overcome this obstacle we suggest the recurrent implementation of the aforementioned design to collect a large enough sample for the purposes of the calibration procedure.

Description

The second method combines some basic concepts of empirical calibration procedures found in the literature of MSMs, which are based on random search of the parameter space. It further suggests the adoption of the LHS design as a more efficient tool for searching the multi-dimensional parameter space. In particular, this empirical method implements the LHS design multiple times to extract a large number of sets of parameter values. This sample is then checked for ”acceptable” sets, i.e., for sets of parameter values that produce model outputs close to the observed ones. The goal, is with this method to eventually collect a sample representative of the underlying population of all the ”acceptable”, according to some convergence criteria, sets of parameter values.

Let Y ∼ f(Y|Λ) the data of interest (calibration data). We implement the LHS design L times in total. Since each repetition of the LHS provides NLHS sets of 67 parameter values (where NLHS is the size of the LHS design), this empirical cali-

bration method essentially results in Nemp = NLHS × L sets of parameter values in total. For each set of parameter values we run the MSM model M(θ) a sufficient number of times, M, and we calculate estimates of the data distribution parameters

1 PM gˆj(θ) = M m=1 Yemj (as in the Bayesian calibration method). Given these estimates we calculate the log-likelihood as:

J X l(ˆg(θ)|Y ) = lj(ˆg(θj)|Yj) j=1

We want to check the null hypothesis Ho : Λ = Λ0, where Λ0 is the vector with the

calibration targets versus the alternative H1 : Λ 6= Λ0. For this check we use the deviance statistic: J h i X h ˆ i D = −2 l(ˆg(θ)|Y ) − l(Λˆs|Y ) = −2 l(ˆg(θj)|yj) − l(λsj|yj) (3.3) j=1

where l(Λˆ s|Y ) is the likelihood of the saturated model.

Under H0 the deviance statistic D follows a chi-square distribution with ν degrees of freedom, one for each tested mean in the calibration target vector. Among the

sets of θ values for which H0 is not rejected, we randomly draw V (to match the Bayesian procedure) vectors of parameter values. Hence, the result of the empirical

calibration method is again a V×K of calibrated values, denoted as ΘEmp, the rows of which represent a random sample from the population of all ”acceptable” sets of parameter values according to the pre-specified convergence criterion (e.g., here the population of parameter values resulting in the higher log-likelihood given the calibration data). These calibrated values can be used in a way analogous to the one suggested for the Bayesian calibration results, in order to provide estimates of the empirical distributions of the calibrated MSM parameters, as well as the empirical distributions of the predicted quantities of interest.

68 3.2.4 Calibration outputs: interpretation and use

An important aspect of the calibration of a MSM is what the anticipated outputs of this procedure should be. To answer this question we have to consider both the conceptual aspect of the problem as well as real life practice. In the comparative analysis, presented here, we suggest the results from both methods to be a collection of parameter vectors rather than simple point estimate for each MSM parameter. This type of calibration output is preferable, especially in the case of complex MSMs for several reasons.

First of all, the nature of the problem itself dictates this form of calibration output. As already mentioned (section 3.1.1) MSM’s complexity renders the parameter spec- ification to be a calibration rather than a problem. It is possible for more than one set of parameter values to produce equivalent outputs, i.e., predictions ”close” to what has been observed. Therefore, we rather wish to collect a sample of these equivalent sets rather than finding the set that maximizes some convergence criterion. Second, the matrix with the calibrated values can reveal interesting rela- tionships between the MSM parameters usually representing unobservable (latent) variables. Understanding these relationships may also be useful for the improvement of the model’s structure, in order for the MSM to better describe the underlying process and, therefore, to enhance the predictive ability of the model. Third, by using a matrix of calibrated values rather than point estimates of the MSM parame- ters, we are able to capture a major source of MSM uncertainty, i.e., the parameter uncertainty, and convey the effect it has on the final results.

The Bayesian method results in the ΘBayes matrix of calibrated values representing a sample from the joint posterior distribution of the MSM parameters given the data

(calibration targets). The Empirical method results in the ΘEmp matrix, essentially comprising a sample of vectors from the joint distribution of the ”acceptable” param- 69 eter values, namely those fulfilling the convergence criteria. In both cases the matrix of calibrated parameter values can be used in order to fulfill the aforementioned purposes of presenting the MSM characteristics (joint and marginal distribution of the model parameters) as well as model predictions of the quantities of interest. In particular, for each one of these sets of parameter values (i.e., for each row of the

ΘBayes or ΘEmp matrix) we run the model M times and we summarize the results in order to estimate the quantities of interest, given a specific input sample SN . We denote Ye = MfM (Θ, SN ) the predictions from a calibrated MSM with Θ matrix of values for the calibrated MSM parameters and input sample SN . Averages, , etc, can be used as point estimates, while measures of variability such as , etc, provide an indication of the model uncertainty including, sampling variability, and parameter uncertainty.

70 3.3 High Performance Computing in R

3.3.1 Software for MSMs

There is a wide range of programming languages for the development of MSMs. Kopec et al. (2010), in their comprehensive review about the quality of MSMs used in Medicine, provide a list of programming languages and existing toolkits currently used for the implementation of MSMs. Java, C #, C++ are very popular languages for the development of such models. Other toolkits, such as TreeAge, are also met in the MSM bibliography. There are also some MSMs (MicMac(24), JAMSIM(66)) that embed the R statistical programming language, only, though, to provide the user with the enhanced statistical and graphical capabilities of the R package for post-simulation processing. This means that the R software is only involved in the analysis of the MSM outputs rather than the actually micro-simulations.

For reasons explained here, the streamlined MSM for the natural history of lung cancer is written in R. To our knowledge, this is the very first attempt to develop and implement any MSM exclusively in R. The R open source statistical software is being widely used by many statisticians from the entire statistical spectrum. The implementation of an MSM in R, not only allows the wide use of this new, very attractive technology in medical decision making, even by people not very familiar with this field, but it also enhances the transparency of the model and, facilitates the research and development of statistical methods related to this technology.

The release of the code, e.g., in the form of a special library in the open source statistical software R, is a feature very attractive, especially to model developers, who can actually read the codes, and thus understand the particularities and compare the structure of similar MSMs. Researchers, on the other hand, unfamiliar with the technical details of an MSM, who intend to use the model as a tool in medical 71 decision making, e.g., to simulate and make predictions for large cohorts, are highly interested in being able to simulate and compare different scenarios. This is another, perhaps more powerful aspect of model’s transparency related to the release of the freely available relevant source code.

The streamlined MSM, that describes the natural history of lung cancer, can provide a handy tool for exploring the statistical properties of MSMs in general.

Although very exciting and attractive the idea of writing an MSM in R, the im- plementation can prove to be a daunting task. Even the term ”micro-simulation” modeling predisposes for extensive computations and rather time consuming pro- cesses. A simple implementation of the model, e.g., to make predictions for a single person or even for a relatively small sample of persons (as in the case of the tables presented in the first chapter) although not instantaneous, is definitely a feasible and relatively easy task to carry out. However, the development of such a model from scratch requires, among others, calibration and overall assessment of the model (goodness of fit tests, validation, etc.), namely processes that can prove hard to design and implement and extremely time consuming to run. In the following para- graphs we attempt to give an idea of what the computational burden, in terms of the required running times, for such processes can be. To this end, we provide as an example our experience from the implementation of the two calibration methods for the comparative analysis, described in this chapter.

3.3.2 Example: computational burden of two MSM calibra-

tion methods

The objective of the second chapter is the calibration of the streamlined MSM for the natural history of lung cancer, with two different methods, a Bayesian and an Empirical one. Trying to keep this problem as simple as possible, we focus our

72 interest on only four MSM parameters. As outlined in the description of the two methods section, each calibration procedure aims at the identification of the most suitable among a total of 100,000 vectors of parameter values for each.

The Empirical calibration method entails the simultaneous check of the values in a candidate parameter vector. In our case, the whole procedure involves testing of 100,000 vectors of parameter values in total. In addition, each vector drawn from the multi-dimensional parameter space is totally independent from the others. The Bayesian calibration method is a little bit more complicated in that it requires sequential check of parameter values. That is, each parameter chain update depends on the suggested parameter values in the previous step. In our case, the Bayesian calibration method entails testing of 4*100,000 parameter values in total. Therefore, the architecture of the Bayesian calibration method allows parallelization of the process to a much more restricted extent.

As described in the simulation study section, in our example, each calibration update entails the implementation of the micro-simulation model M=10 times on a sample of N=5000 people. That is, checking one combination of parameter values (Empirical calibration) or one parameter value (Bayesian calibration) requires 50,000 micro- simulations in total. Hence, for 100,000 updates of all (four) MSM parameters, we need 50,000·100,000=5·109 and 50,000·100,000·4=2·1010 micro-simulations for the Empirical and the Bayesian calibration method respectively. Given these numbers we realize how time consuming the implementation of just a single calibration method can be, let alone a comparative analysis between two of them.

These numbers ascertain that the calibration of an MSM falls into the ”embarrass- ingly parallel” category of computational problems (89), meaning that the entire task can be split into numerous, completely independent, repeated computations, each of which can be executed by a separate processor in parallel. Hence, instead of ”end-

73 less” running times, an ”embarrassingly parallel” procedure, as the calibration of an MSM, can be approached using high performance computing (HPC) techniques, and run within plausible times. A closer look at table 3.1, that presents the required times to run M·N micro-simulations under different settings, verifies the fact that in the absence of HPC the calibration of an MSM is simply impossible.

3.3.3 Parallel Computing

In order to overcome the time limitations posed by the extensive computations in- volved in calibrating an MSM, we harness the idea of parallel computing. This can be achieved by distributing the independent computations simultaneously to multi- ple computer clusters (nodes) that we have set up for this purpose. These clusters may comprise only a single machine with one or more processors, or even multiple machines connected by a communications network. Hence we distinguish between two major types of parallelization; the implicit and the explicit one, depending on the composition of the computer clusters used. It is crucial to decide upon the type of parallelization (available in R) to work with, so as to maximize the benefit from using advanced techniques of high performance computing developed for this statistical software.

Tierney (2008) describes the notions of the implicit and explicit parallel computing within the R context. According to this paper, implicit parallelization pertains basi- cally to exploiting multiple processors of one machine as well as internal R functions to speed up calculations (e.g., vectorized arithmetic operations, ’apply-’like functions, etc.). It essentially takes advantage of the parallelism inherent in the program. This method does not require any special intervention (set up) from the user, hence it is much easier to implement and can prove very beneficial especially for large vectors (e.g., n>2000). Nevertheless, as it can be seen from the example provided in table 3.1, the implicit parallelization can only provide the researcher with a limited ability 74 for improving the efficiency of an R program and is definitely not the solution to the extremely time consuming algorithms problem of the MSM calibration process.

Explicit parallelization on the other hand, provides the user with the ability to set up computer clusters (multiple computers with multiple processors each) so as to distribute the independent computations to a wider range of resources in parallel. Hence, explicit parallelization can substantially improve the efficiency of algorithms involving ”embarrassingly parallel computations”, as in the case of calibrating an MSM. This type of parallelization requires more work and certain amount of com- puter science knowledge to set up the cluster and distribute the algorithm accord- ingly.

After having improved the required time for one micro-simulation, by using the most efficient (to the extend of our knowledge) built-in R functions, the next step is to take advantage of high performance computing techniques so as to carry out the calibration computations within realistic time intervals. Schmidberger et al. (2009) provide a comprehensive account of R packages with advanced techiniques for performing parallel computing in R. According to this paper, the two R packages that stand out as better serving the implementation of parallel computing on computer clusters, are ’snow’ and ’Rmpi’.

For the purposes of the comparative analysis we will be mainly using the ’snow’ library to set up computer clusters using the Message-Passing-Interface (MPI) low- level communication mechanism. This R library has intermediate and high level functions for parallel computing. For the calibration purposes we make use of the high level ones, which are basically parallel versions of the ’apply’ like R-functions. By using the possibilities the R ’snow’ package gives for parallel computing, we can overcome the R single-threaded nature and spread the computational burden across multiple machines and CPUs (McCallum and Weston (2012)). Information about

75 the ’snow’ built-in functions can be found in the relevant R documentation for this package, while some examples for the implementation of parallel computing R using the snow package can be found in Tierney (2008), and Matloff (2013).

3.3.4 Code architecture

Another very important decision to be made, regarding the problem of improving the efficiency of the calibration methods, is which chunk of the code should be paral- lelized. Obviously, the sequential nature of the Bayesian calibration method, leaves a much smaller scope for parallelization than the Empirical one with the random, undirected search approach (independent draws of values from the multidimensional parameter space). This is also the case when comparing the efficiency of undirected with any directed search method, due to the sequential nature of the later one, since each step in a directed method depends on the result from the preceding one.

The sequential nature of the Bayesian method drives the decision, for more efficient results, to perform in parallel the M·N=50,000 micro-simulations involved in each parameter update. The independent draws, on the other hand, of vectors from the multi-dimensional parameter space, involved in the Empirical calibration, allow for greater extent of parallelization, that is only restricted by the size of the induced ta- bles in comparison to the respective memory limits 2. In our case, we take advantage of the architecture of the Empirical calibration method to further parallelize the test-

6 ing of 20 parameter vectors, i.e., M·N·NLHS=50·1000·20=10 micro-simulations in to- tal (where NLHS is the size of the Latin Hypercube Sampling). Thereafter, in order to test 100,000 parameter vectors, we need to repeat this procedure 100,000/20=5000 times in total.

However, in order to make the most out of the implementation of parallelization,

2There are several methodologies and respective packages developed in R that harness high performance computing techniques to deal with large memory, or even out-of-memory data problems Eddelbuettel (2013). 76 we have to make sure that the R code for predicting one trajectory (one micro- simulation) is the most efficient one. Hence, there is one more step before we move forward to the implementation of parallel computing, i.e., to improve the efficiency of the R algorithm for a single micro-simulation. A very helpful R library for this task is the ’Rprof’. This library provides a set of built-in R functions that enable a relatively easy profiling of the execution of R expressions.

Tables in Appendix, with R-profiling results, indicate the degree of improvement we achieved in our code by simply replacing time consuming R structures with their more efficient counterparts. More specifically, by simply replacing ”data.frame” by ”list” in all R-code instances we managed to make the program almost twice as fast (e.g., from 5.86 the total running time dropped to 2.88 time units). By also replacing the approximate integration of the hazard function for the onset of the first malignant cell, by the respective definite cumulative hazard function, we achieved a further 22.2% reduction (from 2.88 to 2.24 time units).

We have described, so far, the gain in the required running time we achieved when we optimized the R code internally, by simply performing R profiling and improving its efficiency accordingly. This process involved work on the efficiency of the code’s architecture, i.e., removing unnecessary computations/parts of the code, replacing ’loops’ with vectorized R functions, etc. Furthermore, we replaced complicated, time consuming R functions and structures with more efficient counterparts (e.g., saving results from a function in ’list’ instead of ’data.frame’ format). In this way we were able to substantially reduce the required time to run 50,000 micro-simulations, from 5532.77 secs (≈1.5 hours) to 2114.92 secs (≈35 minutes). However, even with this sig- nificant improvement, running 50,000·100,000=5·109 micro-simulations to calibrate just one parameter of the MSM parameters (Bayesian method) or sets of parameter values requires the absolutely absurd time of 6.7 years (!!!). After optimizing the efficiency of the algorithm working inside the R code, we drew our attention to HPC 77 techniques, in order to perform parallel computing in R. We focused on the partic- ularities of each calibration method to reduce the respective computational burden to the minimum by implementing relevant techniques accordingly.

3.3.5 Algorithm efficiency: Bayesian vs Empirical Calibration

To better understand the gain in the computational burden, as well as to compare the two methods in terms of their efficiency, we calculate the required running time to calibrate all four MSM parameters with each method. As already mentioned, in order to implement the Bayesian calibration method and take one chain of 100,000 values for the joint posterior of the calibrated MSM parameters we need to repeat the set of M·N=50,000 micro-simulations, 4·100,000 times in total. Hence the required time for the Bayesian calibration is 4.26·100,000·4≈19.7 days (table 3.1), if we use a cluster of 64 nodes (8 computers with 8 cpus each). An analogous task with the Empirical calibration method requires testing of 100,000 vectors from the parameter space in total. By further parallelizing the process and simultaneously computing e.g., 20·50· 1000 = 106 micro-simulations, we can calibrate the four MSM parameters much faster compared to the Bayesian method3. According to table 3.1, the required time to run that many micro-simulations in parallel, is 105.4 seconds≈1.7 minutes. To complete the empirical calibration procedure, we have to run this set of microsimulations 100,000/20=5,000 times in total. Hence the required time to calibrate the four MSM parameters with the Empirical method is ≈ 6.1 days. Depending on the available resources we can achieve further reduction in the running times. In our case, for example, if we further split the Empirical calibration process into three independent pieces, we can receive the results from this method in ≈ 2 days (i.e., almost 10 times faster than with the Bayesian method). Consequently, we realize that the

3Actually, depending on the available HPC techniques we use and the computer clusters capacity, we can further parallelize this problem and achieve even large reduction in the required time to run this procedure

78 architecture of this method provides for parallelization to a significant extent, with corresponding reduction of the required time, unlike the Bayesian calibration method, or any directed search method. Hence an Empirical method for calibrating an MSM can be proved much more practical (efficient) than a Bayesian one, with respect to the required time to run.

Table 3.1 describes improvements in efficiency of an embarrassingly parallel algorithm involving M·N micro-simulations. This ’journey’ begins from the completely trivial case, i.e., performing the computations on a single machine without interfering in the single-threaded R nature and without taking care of the time consuming R functions and structures. From this starting point, the process of improving the efficiency of the algorithm ’travels’ through the notion of implicit parallelism to eventually reach the optimum solution by using explicit parallelism and properly set up of computer clusters (network of multiple computers with multiple processors each) and relevant HPC R techniques/packages/toolkits. The ultimate gain from this process, in order, e.g., to perform M·N=50·1000=50,000 micro-simulations, can reach the impressive number of three orders of magnitude (5532.77/4.26≈1299 (!!!) ). The implementation of HPC for parallel computing with R was actually what made feasible to calibrate the MSM for lung cancer using this open source statistical software.

3.3.6 Concluding remarks

The problem of calibrating an MSM in R falls in the category of the ”embarrass- ingly parallel computations” and necessitates use of high performance computing. In the previous paragraphs we explain the computational considerations imposed by the calibration of an MSM R, using as an example the implementation of the two calibration methods described in this chapter. According to this example the Empirical calibration method is much more efficient than the Bayesian one, since it

79 Type Nodes M N Time Reduction Notes (secs) Ratio* 50 1000 5532.77 - 100 1000 11065.61 - no parallel computing - 1 20 2500 5534.07 - or profiling 40 2500 11077.26 - 50 1000 2114.92 2.62 100 1000 4229.86 2.62 no parallel computing SOCK 1 20 2500 2115.41 2.62 after profiling 40 2500 4234.31 2.62 50 1000 16.55 334.31 implicit parallel 100 1000 34.82 317.79 SOCK 10 computing, after 20 2500 17.24 321.00 profiling 40 2500 34.25 323.42 50 1000 15.97 346.45 implicit parallel 100 1000 34.06 324.89 SOCK 32 computing, after 20 2500 16.11 343.52 profiling 40 2500 34.37 322.29 50 1000 16.02 345.37 implicit parallel 100 1000 33.78 327.58 SOCK 50 computing, after 20 2500 16.06 344.59 profiling 40 2500 34.58 320.34 50 1000 15.85 349.29 implicit parallel 100 1000 34.20 323.56 SOCK 100 computing, after 20 2500 16.10 343.73 profiling 40 2500 34.41 321.92 50 1000 7.45 742.65 explicit parallel 100 1000 12.93 855.81 MPI 32 computing, after 20 2500 6.19 894.03 profiling 40 2500 12.91 858.04 50 1000 4.26 1298.77 explicit parallel 100 1000 7.93 1395.41 MPI 64 computing, after 1000 1000 105.4 profiling 20 2500 3.31 1671.92 40 2500 8.17 1355.85 SOCK: sockets MPI: Message Passing Interface * Ratio reduction in the running time achieved compared to no processing (parallel computing or profiling)

Table 3.1: Algorithm efficiency: Required time (in seconds) to run M·N micro- simulations using different computing capacities.

80 can actually run 10 times faster. This relevant efficiency between the two methods is also applicable when comparing undirected with directed searching algorithms for calibration, due to the conceptual similarities they bare with the Empirical and the Bayesian calibration method respectively.

The most impressive finding from this exercise, was the ultimate gain we achieved in performing a set of parallel computations, which essentially reaches three orders of magnitude compared to the initial time, namely the time required before any work in the architecture of the R code or any type of parallelization. The reported running times in this section exemplify the imperative need for using HPC methods in order to render the development of an MSM in R feasible, with all the beneficial effects such an attempt will have on the overall research in the area.

81 3.4 Comparative Analysis

The main objective of this chapter is the comparison between an empirical and a Bayesian approach to the calibration problem of micro-simulation models (MSMs) in medical decision making (MDM). The streamlined MSM, developed in the first chapter, is used as a tool for the implementation of both calibration methods, de- scribed in section 3.2 of the thesis. To our knowledge, this is a first attempt of a comprehensive and systematic comparison of two calibration methods in the context of micro-simulation modeling. In the following paragraphs we describe the study design for the quantitative and qualitative comparison of the two methods.

3.4.1 Input Data

The MSM model for the natural history of lung cancer (described in Chapter 2) takes into account three baseline characteristics, namely the age, gender and smoking habits in order to predict a person’s trajectory. The smoking habits comprise the smoking status of the person at the beginning of the prediction period, i.e., current, former or never smoker, as well as, when relevant, the smoking intensity, expressed as the average number of cigarettes smoked per day. In order to keep the dimensionality of the problem to an easily manageable level, we restrict our interest to males, current smokers.

We combine information found on data (US 1980 census) and other relevant statistics (Statistical Abstract of the US, 1980) in order to simulate the baseline characteristics of a large sample representative of the US population. This large sample will be the ”pool” from which several sub-samples will be drawn and will be used as input to the MSM for the purposes of both model calibration and assessment. Assuming that the entry year is 1980, we predict 26 years ahead and calibrate the

82 MSM to the observed lung cancer incidence rates reported in SEER 2002-2006 data.

We simulate the age distribution based on information found in the US 1980 census about males, who are current smokers. Given the age group, we simulate the smok- ing intensity for each individual following the distribution of the average number of cigarettes smoked per day, as reported in the Statistical Abstract of the US, 1980. Due to the fact that these tables report the smoking intensity in groups (i.e., <15, 15-24, 25-34, and >34 cigarettes/day) we first draw the smoking intensity category given age, and then we randomly draw an integer from the selected group assuming uniform distribution for the smoking intensity within that group. This integer even- tually expresses the average number of cigarettes smoked per day for the particular individual.

Following this procedure we simulate a large sample of NL=100,000 individuals rep- resentative of the 1980 US reference population. This will be our simulated ”true“ population for which predictions about the lung cancer incidence are to be made using the MSM. For the purposes of both model calibration and validation two sub- samples will be drawn from this simulated population. In particular, we randomly draw two sub-samples of size n=5,000 each. The first one will be the input to the MSM for the implementation of the two calibration methods. We refer to that sam- ple as ”calibration input” (smpl.C5000). The second one, referred as ”validation input”, will be used for validating the calibration results (smpl.V5000). Further- more, other sub-samples will be also randomly drawn from the same, N=100,000, simulated population, to serve other purposes of the comparative analysis presented in this chapter (e.g., samples to produce calibration plots, etc).

Table 3.2 presents the age distributions of the samples used as input for the compar- ative analysis of the two calibration methods. We denote with smpl100,000 the sample of 100,000 people used for the calculations of the calibration targets (see section

83 Input sample Age (years) US 1980 Calibration Validation (smpl100,000 ) (smpl.C5000) (smpl.V5000) 17-39 53672 530 554 40-44 7078 79 63 45-49 6827 65 70 50-54 7122 68 80 55-59 6699 72 53 60-64 5876 60 54 65-69 4833 48 55 70-74 3503 43 25 75-79 2233 22 23 80-84 1284 8 16 >85 873 5 7

Table 3.2: Age distributions of the samples (input data) used for the comparative analysis of the two calibration methods.

3.4.3), smpl.C5000 the sample of size N=1,000 that was used as an input for both cal- ibration methods, as well as the internal validation of the results, and smpl.V5000 the sample used for the external validation of the calibrated models. All these samples are representative of the US 1980 population, i.e., the age and smoking intensity dis- tributions of these samples resemble the corresponding observed data about males, current smokers, reported in the 1980 US census and the Statistical Abstract of the US from the same year.

3.4.2 MSM parameters to calibrate

The streamlined MSM for the natural history of lung cancer, that we developed in the first chapter, involves numerous parameters describing different parts of the model. In order to be able to run the procedures in plausible times, instead of performing an exhaustive calibration, we rather run a restricted one, focusing our interest only on four MSM parameters. All the rest are kept fixed, according to known relationships found in the literature or plausible assumptions to simplify the calibration problem

84 (Chapter 2). In particular:

• we keep the MSM parameters pertaining to the onset of the first malignant cell fixed to the quantities found in the literature about males, current smokers (Table 2.2).

• from the Gompertz(m,s) distribution for the tumor growth, we only calibrate m, assuming s=31·m (section 2.3.1)

• from the log-Normal distributions of the disease progression part of the MSM, we only calibrate the location parameters (i.e mdiagn, mreg and mdist) assum- ing that location=scale (i.e. means = standard deviations)

• our prior beliefs (i.e., prior distributions and plausible intervals for the MSM parameters to calibrate) are in accordance with findings in the literature of the natural history of lung cancer (section 2.3.1)

3.4.3 Calibration Targets

In order to keep the calibration problem as simple as possible, we only calibrate our model to lung cancer incidence by age group. As reference point we use the observed rates in SEER 2002-2006 data, so as to reproduce plausible numbers (Table 3.3). The calibration exercise relies on the strong assumption that the lung cancer incidence rates, conditional on gender and smoking status, remain unchanged throughout the 26 years prediction period (from 1980 to 2006) and are close to the reported ones in SEER 2002-2006 rates. Another problem when calibrating the lung cancer natural history model is the occurrence of rare events especially in ages less than 55 years old. To overcome this problem, we combine the eleven 5-years age groups presented in SEER data, into three, i.e. <60, 60-80, and >80 years old. In this way we are able to observe all the lung cancer incidence rates even when we use as input a sample of people of moderate size, (e.g., n=500). We assume that the lung cancer cases yj 85 th follow Poisson(λj) distribution, where λj is the rate of the j age group, expressed as number of cases per 100,000 person-years (PYs).

Age group Observed Predicted <40 5.2 11 ± 4.48 [40, 45) 10 15 ± 3.3 [45, 50) 26.3 41.9 21 ± 2.83 41 ± 4.25 [50, 55) 56.7 49 ± 4.18 [55, 60) 111.3 107 ± 6.46 [60, 65) 208.4 192 ± 9.01 [65, 70) 329.3 392 ± 16.32 387.4 391 ± 15.96 [70, 75) 455.7 481 ± 18.62 [75, 80) 556.2 498 ± 19.89 [80, 85) 554.5 504 ± 19.93 498.1 464 ± 19.36 >85 441.7 425 ± 18.79

Table 3.3: Observed (2002-2006 SEER data) and predicted (M=100, N=100,000, θfix) lung cancer incidence rates (cases/100,000 person·years) by age group.

We ran an ad hoc analysis to identify combinations of parameter values that give plausible predictions, i.e, predictions close to the observed quantities (SEER data 2002-2006). For this purpose we implemented the MSM on the simulated US 1980 population of N=100,00 males, current smokers. Given their simulated baseline characteristics, i.e., age and smoking intensity, we predicted twenty six years ahead, that is, we predicted lung cancer incidence in 2006. We implemented the model M=100 times in order to increase the accuracy in our predictions. At the end of the prediction period, we combined the results (predicted lung cancer cases per 100,000 person years) by age group. Following the results of this ad-hoc analysis, we identified

c c c c T T a set of values θfix = [θ1, θ2, θ3, θ4] =[0.00038, 2, 1.1, 2.8] , for which the MSM predicts lung cancer incidence rates per age group close to the observed quantities (SEER data). Table 3.3 presents the predicted lung cancer incidence rates for these

T fixed parameter values. We set these rates Yclbr=[y1−clbr, y2−clbr, y3−clbr] =[41, 391, 86 464]T , to be the calibration targets for each one of the two calibration methods.

The reason for choosing Yclbr to be the calibration targets, rather than the respective observed rates in SEER data, is that we wanted to control for the effect the input sample as well as the structure of the model would have on the MSM predictions. In this way, any deviations of the predictions from the reference points would be attributed to a greater extend to the real underlying differences between the two cal- ibration methods rather than to other nuisance, for the purposes of this comparative analysis, factors.

We repeated a similar procedure running the model M=2000 times in total, for

θfix, using this time the “calibration input” sample. We again predicted the lung cancer incidence twenty six years ahead and combined the results by age group, thus

T T resulting in the vector of rates Yfix=[y1−clbr, y2−clbr, y3−clbr] =[50, 353, 452] . We use this vector later on to validate the results from the two calibration methods. The reason for that, the output of an MSM depends, to some extend, on the input sample,

therefore, even for the same θfix and the same total number of micro-simulations (i.e.,

7 M·N=10 ), the output can be slightly different (Yclbr vs Yfix).

Using the notation introduced in section 3.2.1, we define Yclbr = Mf100(θfix, smpl100,000),

and Yfix = Mf2000(θfix, smpl.C5000), to be the two reference points, the predictions from the two calibrations MSMs will be compared to. Consequently, we have three refer-

ence points when comparing the results from the two methods: θfix for the calibrated

parameters, as well as Yclbr, and Yfix for the predicted lung cancer incidence rates.

3.4.4 Simulation Study

The ultimate goal of this chapter is the quantitative and qualitative comparison of the two calibration methods for MSMs, the Bayesian and the Empirical one. To this end we design a simulation study that allows for comparisons of multiple aspects of the

87 calibration procedure. The simulation study pertains to the implementation of both methods to calibrate the parameters of the streamlined MSM for the natural history of lung cancer. In particular, we calibrate all four (θ1, θ2, θ3, θ4) MSM parameters, and compare the results from the two methods using both qualitative and quantitative measures, as well as graphical means.

Methods comparability

In order to ensure comparability of the two methods we have to calibrate the same set of MSM parameters θ, to the same calibration targets Yclbr using the same input data (”calibration sample”). In addition, the prior information about θ in the Bayesian calibration method has to be consistent with the plausible intervals assumed in the empirical calibration method while the estimation of the model’s outputs of interest should be based on the same number of embedded micro-simulation runs (simulation study size).

The results from each calibration method, i.e. point estimates for MSM parameters and predicted outputs, are influenced by the several sources of uncertainty (chapter 1) inherent in the model. Failure to recognize this problem and take precautions to control for it, may cause misleading results and, consequently erroneous conclusions from the comparative analysis. Structural uncertainty cannot be examined in our case since both methods are implemented for the calibration of the exactly the same MSM. We account for selection uncertainty and sampling variability (both related to the calibration data) by setting the same calibration targets. Moreover, in order to account for the effect of the simulation (Monte Carlo) variability by implement- ing the MSM multiple times on the same input sample, and take point estimates (means, standard deviations) of the outputs of interest (calibration targets or indi- vidual trajectories). Parameter uncertainty on the other hand, is an integral part of the calibration method itself, and is captured by the determination of distributions,

88 Characteristics Method Bayesian Empirical T Parameters to calibrate θ=[θ1, θ2, θ3, θ4] Calibration Targets Yclbr (relatively easy to Yclbr (when more than one combine more than one sources calibration targets need to of information) specify a rule to combine them) GoF log-likelihood (inherent in the Deviance approximate MH algorithm) Convergence criteria Trace plots x2 test (a=5%) Convergence Diagnostics Stopping rule V sets of values for θ from the converged sets Result Random draws from the joint Random draws from the posterior distribution of the empirical joint distribution of model parameters θ the “acceptable” values for θ

Table 3.4: Implementation of two calibration methods on the MSM for lung cancer according to the seven-steps approach presented in Vanni et al. (2011) rather the point estimates, of the resulted calibrated MSM parameters. This char- acteristic will provide an additional means of comparison between the two methods. Furthermore, we use the same sample of baseline characteristics (calibration input) for the implementation of the two calibration methods, in order to eliminate the effect of the population heterogeneity on the comparative analysis results.

Finally the calibration results from the two methods are integrated following exactly the same procedure (e.g., using to describe the distribution of calibrated parameters and MSM outputs). Table 3.4 juxtaposes the implementation of the two calibration methods from the seven steps approach point of view presented in Vanni et al. (2011) .

Simulation study size

The accuracy of the calibration results depends heavily on the total number of micro- simulations involved in the computations. As already mentioned, we focus our inter- est on the calibration of an MSM that describes the natural history of lung cancer for males, current smokers. As already mentioned, in order to account for the effect

89 of simulation variability, we implement the MSM multiple times M on the input data (calibration sample of baseline characteristics). Each time the model predicts n trajectories 26 years ahead, one for each person in the input sample. We summarize the results at the end of the prediction period, i.e., we calculate the lung cancer rates per age group. This procedure results in M predictions per age group. As a point estimate of the predicted lung cancer incidence rates we use the averages of the M predicted values by age group. The accuracy of the calibration results is highly related to the size of the simulation study, i.e., the total M·n micro-simulations involved in the calculations.

M Age group 10 20 30 50 100 n=500 <60 34 ± 38.37 35 ± 47.18 44 ± 68.96 56 ± 78.4 38 ± 57.71 [60, 80) 370 ± 202.63 364 ± 208.72 382 ± 240.71 365 ± 218.9 361 ± 216.48 >80 369 ± 316.64 434 ± 331.91 460 ± 259.59 466 ± 309.12 486 ± 315.44 n=1000 <60 41 ± 35.32 43 ± 44.23 41 ± 41.53 43 ± 49.18 43 ± 45.3 [60, 80) 415 ± 160.57 392 ± 171.92 401 ± 160.01 393 ± 156.06 405 ± 155.36 >80 480 ± 218.8 430 ± 174.45 496 ± 195.51 464 ± 185.99 435 ± 172.29 n=2500 <60 47 ± 29.72 41 ± 24.63 41 ± 25.41 40 ± 24.75 41 ± 24.78 [60, 80) 386 ± 101.47 372 ± 103.92 406 ± 100.36 395 ± 103.44 390 ± 101.5 >80 466 ± 115.75 472 ± 123.32 492 ± 123.07 465 ± 120.89 476 ± 121.27 n=5000 <60 44 ± 23.72 40 ± 15.8 41 ± 19.01 43 ± 19.6 40 ± 19.16 [60, 80) 398 ± 82.34 397 ± 67.61 393 ± 73.56 402 ± 69.93 409 ± 73.82 >80 454 ± 71.26 456 ± 97.57 444 ± 79.38 478 ± 93.72 464 ± 89.67 M: total number of micro-simulations per individual n: input sample size

Table 3.5: Predicted lung cancer incidence rates (cases/100,000 person·years) per age group, for different study sizes (M·n). T T Calibration targets: Yclbr=[y1−clbr, y2−clbr, y3−clbr] =[41, 391, 464]

A key issue in the study design is the choice of the M·n combination of total micro- simulations involved in the calculations of each calibration method. There is a trade off between the achieved accuracy in the predictions and the required running time. Our goal was to identify a combination that provides accurate predictions within

90 plausible running times. To this end, we investigated different M·n combinations in order to specify the one that better serves the purposes of the simulation study. We randomly extracted sub-samples of size n=500, 1000, 2500, and 5000, from the N=100,000 simulated 1980 US population. These samples were subsequently used as input to predict lung cancer incidence rates 26 years ahead, implementing the model M=10, 20, 30, 50, and 100 times respectively. Table 3.5 presents the predicted lung cancer incidence rates (average ± sd) per age group for each scenario. Figure 3.3 provides a graphical representation of table 3.5. According to this table, the combination of M=10 and n=5000 seems adequate to produce sufficient results in plausible times 4. The focus, when making this decision was both in model model accuracy (bias and variability of MSM predictions), as well as on the total required running time.

4The required running time for M·n=50,000 micro-simulations is close to 5secs using 64 cores (8nodes;8cpus) 91 Figure 3.3: Predicted (mean±sd) lung cancer incidence rates (cases/100,000 person·years) by age group, for different M·n combinations, given fixed MSM parameter values (θfix=[0.00038, 2, 1.1, 2.8]T ).

92 Implementation

We use both the Bayesian (3.2.2) and the Empirical method (3.2.3), to calibrate all

T four MSM parameters θ=[θ1, θ2, θ3, θ4] . We calibrate the MSM to three targets (

T T Yclbr=[y1−clbr, y2−clbr, y3−clbr] =[41, 391, 464] ) i.e., the predicted lung cancer inci- dence rates per age group for fixed values of the MSM parameters θfix=[0.00038, 2,

T 1.1, 2.8] (Table 3.3). For each θk we use Truncated Normal distribution (TN(µθk , sdθk ), with µθk =sdθk ) to specify either the prior for the Bayesian method or the dis- tribution of plausible parameter values for the Empirical method. In particular, we set:   • m=θ1 ∼ TN µ(θ1) = sd(θ1)=0.0008, L(θ1)=0.00001, U(θ1)=0.0016   • mdiagn=θ2 ∼ TN µ(θ2) = sd(θ2)=4, L(θ2)=0.0001, U(θ1)=8   • mreg=θ3 ∼ TN µ(θ3) = sd(θ3)=2.2, L(θ3)=0.0001, U(θ3)=4.4   • mdist=θ4 ∼ TN µ(θ4) = sd(θ4)=5.6, L(θ4)=0.0001, U(θ1)=11.2

Suppose that we apply the Bayesian method in order to calibrate only θ1=m and

θ2=mdiagn, and produce a chain of length B=100,000 for each parameter. To imple- ment the Gibbs sampler with the embedded approximate MH algorithm, we follow the steps:

0 0 c c 1. set θ1 = θ1, θ2 = θ2 (starting values), and keep θ3 = θ3 = 1.1, θ4 = θ4 = 2.8

0 0 0 c c T (fixed). Denote θ = [θ1, θ2, θ3, θ4] the vector with the starting values for the MSM parameters.

(a) given θ0 run the micro-simulation model M(θ) on the calibration sample (n=1000) to predict individual trajectories 26 years ahead and calculate

the predicted lung cancer casesy ˜mj by age group j.

(b) repeat step (a) M=50 times (m=1, 2, ..., M) resulting in M predicted lung

93 cancer incidence counts per age group. Thesey ˜mj counts are considered

random draws from Poisson(λj) distributions

QJ 0 (c) calculate the likelihood: j=1 fj(yj−clbr|λj = gj(θ )). These λj are func- tion of the MSM parameters. Due to the model’s complexity the form g(·) is hard to derive, therefore we approximate these quantities using the respective MLEs (sample means), hence, λˆ =g ˆ (θ) = 1 PM y j j M m=1 emj

∗ 2. propose a new value θ1

∗ ∗ 0 c c T 3. repeat steps (a) through (c) for θ = [θ1, θ2, θ3, θ4]

∗ QJ ∗ ∗ π(θ1 ) j=1 fj (yj |gˆj (θ )) ∗ 4. calculate the ratio r1(θ1, θ1) = 0 QJ 0 and accept θk with proba- π(θ1) j=1 fj (yj |gˆj (θ )) 0 ∗ bility α(θ1, θ1) (section 3.2.2)   θ∗, if we accept θ∗ 0  1 1 5. set θ1 = 0  θ1, otherwise

∗ 6. propose a new value θ2

∗ 0 ∗ c c T 7. repeat steps (a) through (c) for θ = [θ1, θ2, θ3, θ4]

∗ QJ ∗ ∗ π(θ2 ) j=1 fj (yj |gˆj (θ )) ∗ 8. calculate the ratio r2(θ2, θ2) = 0 QJ 0 and accept θk with proba- π(θ2) j=1 fj (yj |gˆj (θ )) 0 ∗ bility α(θ2, θ2) (section 3.2.2)   θ∗, if we accept θ∗ 0  2 1 9. set θ2 = 0  θ2, otherwise

0 0 The resulting [θ1, θ2] values from the aforementioned process is one update for the chains of the calibrated parameter values. We repeat steps (1) through (9), B=100,000 times. From the total of B=100,000 values, we collect for each chain V=1,000 values selecting every 50th iteration from the last 50,000 values. The resulting V=1,000 vec- tors comprise a sample representative of the joint posterior distribution of the MSM parameters and all together correspond to the ΘBayes matrix of calibrated parameter values of the MSM for lung cancer. 94 We follow an analogous procedure to calibrate one, or any combination of two, three or all four parameters of the MSM. Figures 3.18 and 3.19 depict in flow charts the

implementation of the Bayesian method to calibrate θ1.

For the Empirical calibration method, we implement an LHS of moderate (NLHS=10)

size, L=10,000 times, thus resulting in Nemp= NLHS· L = 100,000 vectors of param- eter values in total. For each vector of parameter values we implement the micro- simulation model M=10 times, and we calculate the corresponding predicted lung cancer incidence rates per age group. As in the Bayesian calibration method, we

assume that the calibration targets yj (lung cancer cases per age group j, j={1,

2, 3}) are from Poisson distributions, i.e., yj ∼ Poisson(λj=gj(θ)). Since the form g(·) is hard to derive, we use the M=10 draws predicted by the model, to calculate estimates for the parameter of these Poisson distributions, i.e.,

ˆ 1 PM λj =g ˆj(θ) = M m=1 yˆmj. Here, the deviance statistic follows a chi-square distribu- tion with ν = 3d.f., hence, stipulating a 5% level of statistical significance, we select those sets satisfying Di < 7.81, thus resulting in Nemp ”acceptable” sets of parameter values in total.

Among those we randomly extract V=1,000 vectors (with replacement if necessary, i.e. if the procedure results in less than 1,000 ”acceptable” vectors). The resulting vectors comprise a sample representative of the joint distribution of the ”acceptable” MSM parameters, according to the Empirical calibration criteria, and all together correspond to the ΘEmp matrix of values for the calibrated parameters of the lung cancer MSM (section 3.2.3).

Figures 3.1 and 3.2 provide a graphical representation of the mechanism for the extraction of NLHS vectors of values from a two-dimensional parameter space, for different NLHS sizes. Two sets of graphs are presented in each figure. In the left graph the extracted value is the center of each selected interval, whilst in the right

95 one the value is randomly chosen from the respective interval.

3.4.5 Terms of comparison

The results from both methods are sets of values describing the joint distribution of the calibrated MSM parameters. The resulting sets represent random draws from the joint posterior distribution, or the empirical joint distribution of the values satisfying the convergence criteria, for the Bayesian and the Empirical method respectively. For the purposes of this comparative analysis, each method results in V=1000 vectors from the multi-variate parameter space.

We use these results to make predictions for the quantities of interest, i.e., lung cancer incidence rates by age group. In particular, for each vector of parameter values we implement the MSM multiple (M=50) times, and we produce point estimates (means) of the respective quantities. The resulting (V=1000) mean incidence rates for each age group represent random draws from the posterior or the empirical predictive distribution. Depending on the input sample, the predictions can be used for the purposes of internal (smpl.C5000) or external validation (smpl.V5000).

We compare the two methods using qualitative and quantitative measures as well as graphical representations of the results produced.

In particular we provide:

1. Density plots (parameters and predictions) We compare the density plots of the marginal distributions of the calibrated MSM parameters, as well as the distributions of the predicted calibration data (lung cancer incidence rates by age group). We use the Kullback-Leibler dis- tance (60) to assess the relative entropy between the probability distributions, resulted from the two methods, with respect to either each calibrated parameter or the predictions by age group. Low values of this distance indicate similarity 96 of the two distributions, provided that they do not present large differences in the overall shape (e.g., different , and higher order moments in gen- eral). We also apply the Kolmogorov-Smirnov test to check whether results from the two methods come from the same underlying distribution. When the null hypothesis is not rejected (similar results from the two methods), we include in the graph the respective p-value. In the density plots for the cali- brated MSM parameters we also include the respective prior distributions. For the predictions we take by each calibrated MSM, we present two different sets of results, one for internal and the other for the external validation of the cal-

ibrated MSM, using as input the calibration (smpl.C5000) and the validation

(smpl.V5000) samples respectively.

2. Correlation and contour plots (parameters) We also provide correlation (scatter) plots of all the pairs of calibrated MSM parameters, as well as contour plots to identify high density points in the bivariate resulted distributions. The scatter plots are accompanied by the Pearson correlation coefficient.

3. Calibration and box plots (predictions) In an attempt to provide additional means to compare the two methods, we use the calibration results (sets of values for the calibrated MSM parameters) to predict lung cancer incidence rates based on different samples of baseline characteristics (input data). In particular, we extracted 20 different samples in total, each of size n=5000, representative of the 1980 US population of males, current smokers. Each sample includes individual level data on age and smoking intensity. For each one of these 20 samples we apply the model M=50 times, to predict lung cancer incidence rates by age group. We use the sample mean as a point estimate of the predicted quantity by age group. Repetition of this process for each set of values for the calibrated MSM parameters results 97 in 1000 predicted rates for each age group.

Using these estimates we produce calibration and box plots to compare the two methods. In the calibration plots we plot the point estimates of the predic- tions from the Bayesian method versus the respective ones from the Empirical method for each one of the 20 different samples, that were used as input data. If the two methods produce similar results, the points in this plot should be scat- tered along the x=y line. We juxtapose the box-plots of the predictions from each calibration method, for each one of the 20 samples, by age group. The extent of overlapping between the respective box-plots, indicate equivalence of the results produced by each method.

4. Discrepancy measures We also provide four quantitative (two univariate and two multivariate) mea- sures of discrepancy to compare the predictions from the two methods, namely the mean absolute (MAD) and mean squared (MSD) deviations, as well as the Euclidean, and the Mahalanobis distances.

The univariate measures of discrepancy are defined as:

V 1 X |y¯vj − yj| MAD = (3.4) V y v=1 j

V  2 1 X y¯vj − yj MSD = (3.5) V y v=1 j

th where,y ¯vj are the point estimates for the lung cancer incidence rate of the j

th th age group given the v vector of MSM parameters, and yj is the j component of the vector used as reference point.

The multivariate distances, on the other hand, given M-dimensional vectors x

98 and a constant vector c (center), is defined as:

p T −1 DM = (x − c) · S · (x − c) (3.6)

where, c represents the center of the multidimensional space. In the Euclidean distance S is the identity matrix, while in the Mahalanobis distance S is the respective covariance matrix of the x vectors.

In the case of the calibrated parameters, this statistic measures the distance of each x vector of MSM parameter values from the c=θfix vector of fixed values assumed in the simulation study. When it comes to MSM predictions, these distances measure the deviation of each vector of predictions from the vector used as reference point (Yclbr or Yfix).

Multivariate distances are useful to be used in conjunction with the results from the univariate ones, since they provide an idea about the combined deviation of the MSM predictions from the reference points (here, lung cancer incidence rates per age group). Furthermore, Mahalanobis distance adds objectivity in the comparison of the results from the two MSMs, since it weighs the relevant deviation based on the underlying covariance matrix. For instance, the distance of a vector x with large variance, as well as the distances of two vectors (x1 and x2) from c, are downweighted (and vice versa). Hence, the final results are not distorted by potentially high correlations or different order of magnitude between the involved quantities of interest.

Both mean deviations regarding the MSM predictions are weighted based on the size of the respective lung cancer incidence rates by age group. We use two reference points for these calculations. The first one is the set of calibra- tion targets Yclbr = Mf100(θfix, smpl100,000), i.e., the set of MSM predictions for

99 θfix, smpl100,000 and M=100. The second is the set of MSM predictions Yfix

= Mf2000(θfix, smpl.C5000), i.e., the model’s output again for θfix, but using the

calibration input (smp.Cl5000) and running the model M=2000 times in total. As already mentioned (section 3.4.3), the reasoning behind the second compar-

ison is that, even if the calibration procedure resulted in the vector θfix, there would be a deviation between MSM’s predictions and calibration targets, even

7 for the same number of total micro-simulations (10 ), if instead of smpl100,000

we used the calibration input (smpl.C5000). This deviation has to do with two sources of uncertainty inherent in the MSM (Chapter 1), namely, population heterogeneity (different composition of the two input samples), and stochastic uncertainty.

By comparing the MSM predictions, resulting from each calibration method

using the calibration input sample (smpl.C5000), with the model’s output for

θfix using exactly the same input sample, we control for the effect of the pop- ulation heterogeneity in the final results. Therefore, any deviations between

MSM predictions and Yfix can be attributed, to a greater extend, to the real underlying differences between the two calibration methods, rather than being distorted by the population heterogeneity.

3.5 Results

3.5.1 Parameters

Table 3.6 and figure 3.4 compare the marginal distributions of the calibrated param- eters from each method. There is a considerable overlap between the results from the two calibration methods. This overlap is more prominent in the case of θ3=mreg and θ4=mdist, where the Kolmogorov-Smirnov test cannot reject the null hypothesis

100 that the respective pairs of distributions represent the same underlying populations at α = 0.1%. In these two cases, noteworthy also is the proximity between the marginal distributions of the calibrated parameters and the assumed priors, indicat- ing potential identifiability problem for these two MSM parameters. Furthermore the relative entropy assessed by the Kullback-Leibler (symmetric) distance, is very close to 0.5 for all MSM parameters except θ1=m.

Both methods include the fixed values, assumed in the simulation study (θfix=[0.00038, 2, 1.1, 2.8]T ), within the range of the calibrated MSM parameters. This is an indica- tion that both methods produce reasonable results. However, the marginal distribu- tions are centered away from these fixed values. In most of the times, the respective fixed parameter value lies outside the Interquartile Range (IQR), with the only ex- ception being θ1=m for the empirical and θ2=mdiagn for the Bayesian method.

Contour plots (figures 3.6, 3.7) reveal bivariate associations between the calibrated parameters. The underlying patterns are similar for the two methods. There is a strong correlation between θ1 and θ2 in both methods. Furthermore, changes in the

θ3 and θ4 do not seem to considerably affect the respective θ1 values (points on the respective plots are gathered around a conceivable line perpendicular to the θ1 axis).

The three parameters θ2, θ3, θ4 seem totally unrelated to each other.

Identifying highly correlated parameters can prove very helpful for the further devel- opment and improvement of the MSM, as well as an extremely interesting discovery for experts investigating the described phenomenon (here lung cancer). With re- spect to the development of the MSM, strong correlations may indicate redundant parameters and suggest a more parsimonious version of the model by expressing some parameters as functions of others, they are highly correlated with. Regarding the true process described by the MSM, strong correlations may reveal relationships be- tween the underlying mechanisms, previously unknown or disregarded by the experts,

101 and, hence advance the overall research on the phenomenon through new interesting paths.

The two multidimensional discrepancy measures lead to contradictory conclusions. According to the Euclidean distance the resulted, from the Bayesian method, values for the calibrated MSM parameters are closer to the fixed values specified in the simulation study (θfix) than those from the Empirical method. Noteworthy, however, is the fact that, although the univariate (figure 3.4, table 3.6) and the bivariate analysis (figures 3.6-3.7), as well as the Euclidean distance (figure 3.5), suggest that there are some discrepancies between the calibrated values, when considering the multidimensional parameter space, centered at θfix, the Mahalanobis distances (figure 3.5) indicate pretty similar results between the two methods. As already mentioned (section 3.4.5), conclusions based on the Euclidean distance may be misleading since, this measure can be distorted by several factors, e.g., high correlations or different order of magnitude between the involved quantities of interest. In the case of the calibrated MSM parameters, there is high correlation between two of them (θ1=m and θ2=mdiagn), while θ1 differs by almost four orders of magnitude from each one of the other MSM parameters.

102 Figure 3.4: Density plots, Kullback-Leibler distance, and Kolmogorov Smirnov p- value, comparing the marginal distributions of the calibrated MSM parameters be- tween the two calibration methods.

103 Figure 3.5: Distributions of multidimensional distances of the calibrated MSM pa- rameters from the fixed values assumed in the simulation study (θfix).

104 θ1=m Method Min Q1 Median Mean Q3 Max Fixed value Deviation ∗ (±SD) (Pk) (%) Bayesian 2.14·10−4 3.15·10−4 3.40·10−4 3.38·10−4 3.64·10−4 4.60·10−4 3.8·10−4 4 · 10−5 (3.9 · 10−5) (88) (11) Empirical 2.78·10−4 3.71·10−4 3.97·10−4 3.97·10−4 4.20·10−4 5.00·10−4 3.8·10−4 −2 · 10−5 (3.7 · 10−5) (33) (4.47) θ2=mdiagn Bayesian 1.49·10−3 1.52 2.65 2.87 3.94 7.95 2 -0.87 (1.77) (36) (43.25) Empirical 7.42·10−3 2.77 4.25 4.36 6.10 7.98 2 -2.36 (2.02) (13) (117.8) θ3=mreg 105 Bayesian 0.019 1.37 2.16 2.22 3.06 4.40 1.1 -1.12 (1.09) (18) (101.5) Empirical 0.013 1.33 2.21 2.25 3.18 4.39 1.1 -1.15 (1.11) (18) (104.7)

θ4=mdist Bayesian 0.071 3.59 5.62 5.76 8.02 11.18 2.8 -2.96 (2.72) (16) (105.5) Empirical 0.0011 3.24 5.73 5.63 8.13 11.20 2.8 -2.83 (3.00) (22) (101.2)

of the predictive distribution, the fixed value corresponds to.

Table 3.6: Summary statistics of the calibrated MSM parameters. Figure 3.6: Contour plots depicting the bivariate parameter distributions of the Bayesian calibrated MSM. Contours drawn at α=0.95, 0.5 and 0.05 of the bivariate distribution.

106 Figure 3.7: Contour plots depicting the bivariate parameter distributions of the Empirically calibrated MSM. Contours drawn at α=0.95, 0.5 and 0.05 of the bivariate distribution.

107 3.5.2 Predictions

The marginal distributions of the predicted lung cancer incidence rates from both methods include the respective calibration targets in their range (figures 3.8, 3.9). Moreover, predictions from the Bayesian MSM include the calibration targets in their IQRs for both internal and external validation, the only exception being the “60-80yrs”age group (table 3.7). On the contrary, calibration targets lie outside the respective IQRs of the predictions from the empirically calibrated MSM, the only ex- ception being the “>80yrs” age group in the external validation case. Consequently, although there is a large overlap between the two methods regarding the ranges of the predicted lung cancer incidence rates by age group (table 3.7), the respective dis- tributions are very different (KS-test p.value<0.001 in every age group). Predicted values from the Bayesian calibrated model are more dispersed than those from the Empirical one, while the bias of the methods varies across the age groups and the type of validation. However, both calibrated models overall predict better the lung cancer incidence in the “>80yrs” group, i.e., the group with more cases in it.

As already described in section 3.4.4, the predictions from each calibrated model resulted from running the model M=50 times for each of the V=1000 calibrated parameter vectors (Θ matrices), given a specific input sample SN . In the case of the internal validation SN =smpl.C5000, i.e., the sample used in the calibration pro-

cedure, while in the external validation SN =smpl.V5000, i.e., another sample of the same size N=5000. Both input samples are extracted from the simulated 1980 US population (N=100,000). We calculated the MAD and MSD discrepancies measures for the calibrated MSMs under four different scenarios depending on the input sam-

ple used (internal and external validation), as well as the reference point (Yclbr or

Yfix, section 3.4.3). Table 3.8 depicts the predictions involved in calculations of the MAD and MSD discrepancy measures presented in table 3.9. Note here that, when

108 comparing MSMs results with Yclbr, predictions involved in the calculations resulted given different MSM input samples. However, when Yfix is the reference point, in the internal validation predictions refer to the same input sample (smpl.C5000), while in the external validation predictions refer to samples of the same size (N=5000).

According to the overall MSM and MAD values (table 3.9) when comparing the predictions to the calibration targets (Yclbr), it is unclear which method outperforms the other. However, noteworthy is the fact that, when looking at the results by age group, the Bayesian calibrated MSM predicts lung cancer incidence better than the Empirically calibrated one, for younger people (”<60yrs”), i.e., for the group with fewer observed cases in it. This finding holds for both internal and external validation and indicates that the Bayesian method results in a set of values for the model parameters that, when using as MSM input, lead to better predictions of rare events.

When it comes to deviations from Yfix, the Empirically calibrated MSM overall results in smaller discrepancies than the Bayesian one. This finding, in conjunction with the note that predictions in this case refer to input samples that are either the same (internal validation) or of the same size (external validation), suggests that the Bayesian method is probably more robust to the sample of baseline characteristics used as input in the calibration procedure.

To better understand this, remember that θfix is a vector of ad-hoc values for the model parameters, therefore independent of the the input samples used for the pre-

dictions. The matrices ΘBAYES and ΘEMP, on the other hand, depend on the input

sample (smpl.C5000) used in the calibration procedure. Furthermore, the predic- tions obtained by the MSM depend on the structure of the model, which remain unchanged, the parameter values and the input sample used. According to Table 3.8, the predictions obtained from each model, depend on the matrices of calibrated

109 values Θ. In addition, in the internal validation case, these predictions also depend

on the input sample (smpl.C5000) used in the calibration procedure, while in the ex- ternal validation case they depend on a slightly different input sample of the same

size (N=5000), from the same reference population (smpl.V5000).

Therefore, the proximity between the MSM predictions and Yfix provides an indi-

cation of how strongly the results of each calibration method (ΘBAYES and ΘEMP) depend on the input sample used in the calibration procedure. The stronger this relationship is, namely the closer the MSM predictions are to the reference vector

Yfix, the less “robust” the method is to the input sample used in the calibration procedure.

Looking at the multivariate version of the aforementioned four sets of comparisons and the respective discrepancy measures (figure 3.10), we have a clearer idea of the combined deviation of the MSM predictions from the reference vectors. According to the Euclidean distance predictions from the Empirically calibrated MSM are consid- erably closer to the reference vectors compared to the those from the Bayesian model in all cases (internal and external validation). This finding was expected because, according to the respective univariate distributions, (table 3.7, figures 3.8-3.9), pre- dictions from the Bayesian MSM are much more dispersed than the ones from the Empirical model in the ”60-80yrs” and ”>80yrs” age groups. Furthermore, although predictions in the ”<60yrs” group are less dispersed, and centered around the cali- bration target, this is not reflected in the Euclidean distance, since this measure does not take into account the relative magnitudes of the quantities of interest.

The Mahalanobis distances change the overall conclusions a lot. According to this measure, the Bayesian calibrated MSM seems to perform equally well in all instances, and only marginally better when comparing predictions with Yfix in the external val- idation case, compared to the Empirically calibrated model. This finding essentially

110 reflects the fact that the superiority of the Bayesian MSM in the “< 60yrs age group essentially rules out with the better predictions of the Empirical MSM in the other two age groups, as indicated by the univariate discrepancy measures applied for the internal validation of the model (table 3.9). On the contrary, in accordance with the univariate analysis, the Mahalanobis distance suggest that the predictions from the

Empirical MSM are closer to Yfix than respective ones from the Bayesian model.

The calibration graphs (figure 3.11) plot the average predicted values by age group for each one of twenty different samples (of size N=5000 each) used as input in the MSM model. As it was expected, these numbers lie on a straight line, denoting that the results from the implementation of the two calibrated MSMs on the same input results in analogous outcomes.

The box-plots (figure 3.12) and the respective summary statistics (table 3.10) are in accordance with the conclusions from the density plots, i.e., the indicate that, overall, the Empirical methods leads to less dispersed predictions. Noteworthy is also the fact that, looking at the medians, the predictions from the Empirical MSM are constantly higher than those from the Bayesian model. However, the Bayesian calibrated MSM tends to make more accurate predictions (medians closer to the respective calibration targets) for the ”<60yrs” and ”>80yrs” age groups.

111 Figure 3.8: INTERNAL VALIDATION: Density plots depicting the marginal dis- tributions of the predicted lung cancer incidence rates (cases/100,000 person·years)

by age group, compared to calibration targets Yclbr= Mf100(θfix, smpl100,000), and

Yfix = Mf2000(θfix, smpl.C5000). [KL-dist: Kullback-Leibler distance]

112 Figure 3.9: EXTERNAL VALIDATION: Density plots depicting the marginal dis- tributions of the predicted lung cancer incidence rates (cases/100,000 person·years) by age group, compared to calibration targets Yclbr=Mf100(θfix, smpl100,000), and

Yfix=Mf2000(θfix, smpl.C5000). [KL-dist: Kullback-Leibler distance]

113 INTERNAL Validation EXTERNAL Validation Bayesian Empirical Bayesian Empirical Summary statistics < 60 years old Min 14.05 31.22 15.24 30.48 Q1 32.70 44.97 32.40 44.02 Median 39.87 49.8 39.55 48.93 Mean±Sd 39.36±9.19 49.8±6.61 38.8±9.05 49.0±6.42 Q3 45.94 54.7 45.35 53.7 Max 66.32 69.7 64.44 68.9 Target value 41 Bias 1.64 -8.79 2.20 -8.00 (%) (4) (21.4) (5.4) 19.5 60-80 years old Min 208.9 313.6 212.5 307.1 Q1 308.1 358.1 301.2 350.5 Median 342.2 373.6 335.2 365.7 Mean±Sd 336.6±40.45 372.9±19.77 329.4±40.46 365.1±19.72 Q3 369.2 387.6 361.1 380.0 Max 426.1 423.1 425.1 415.1 Target value 391 Bias 54.4 18.1 61.6 25.9 (%) (13.9) (4.63) (15.8) (6.6) >80 years old Min 370.6 383.8 361.6 389.3 Q1 433.4 458.9 423.2 447.9 Median 458.8 476.4 449.5 467.2 Mean±Sd 465.0±41.72 476.1±26.70 453.8±40.46 465.7±26.37 Q3 494.5 495.2 482.5 483.5 Max 622.9 562.0 568.0 556.3 Target value 464 Bias -1.0 -12.1 10.2 -1.7 (%) (0.2) (2.6) (2.2) (0.4)

Bias(%): deviation of the mean from the target value of the calibration procedure

Table 3.7: Summary statistics of the predicted lung cancer incidence rates by age group, by implementing the MSM on both the calibration and validation input sam- ple.

114 Figure 3.10: Mahalanobis distances distributions of the calibrated MSM predictions from Yclbr and Yfix (internal and external validation).

115 Figure 3.11: Calibration plots.

116 Figure 3.12: Box plots.

117 Internal Validation Bayesian Calibration Empirical Calibration

Yclbr = Mf100(θfix, smpl100,000) Mf50(ΘBayes, smpl.C5000) Mf50(ΘEmp, smpl.C5000) Yfix = Mf2000(θfix, smpl.C5000)

External Validation Bayesian Calibration Empirical Calibration

Yclbr = Mf100(θfix, smpl100,000) Mf50(ΘBayes, smpl.V5000) Mf50(ΘEmp, smpl.V5000) Yfix = Mf2000(θfix, smpl.C5000) Table 3.8: Comparisons: Predictions vs reference points involved in the calculations of the MAD and MSD discrepancy measures for the two calibrated MSMs (table 3.9).

3.6 Calibration Methods Refinement

Another very important finding is that, when applying the Pearson x2 GoF test, only 34.5% and 59.7% of the predictions from the Bayesian calibrated MSM “pass” the test at a=95% and 99% respectively. The corresponding percentages from the Empir- ical calibrated MSM are much higher, i.e., 77.8% and 98.8% respectively. Analogous findings have in the case of the external validation of the models with the percent- ages of predictions satisfying the GoF test being 31.4% and 54.3% for the Bayesian method, and 73.1% and 96.8% for the Empirical one. This interesting note drove the conduct of a complementary sub-analysis, based on N=100 random draws from the sets of calibrated parameter values (along with their predictions) “passing” the 95% GoF test, from each method.

The results from this supplementary analysis are somewhat different from the main analysis. The most prominent differences, as it was expected, are related to the performance of the Bayesian calibrated model. The distributions of the calibrated parameters as well as the predictions resulted from this model are much less dis- persed compared to the full analysis. The posterior distribution of θ2=mdiagn is

118 Internal Validation Bayesian Calibration Empirical Calibration <60 yrs 60-80 yrs >80 yrs Overall <60 yrs 60-80 yrs >80 yrs Overall MAD 0.0400 0.1392 0.0220 0.0600 0.2145 0.0464 0.0261 0.0957 Y clbr MSD 0.0518 0.0306 0.0081 0.0300 0.0720 0.0047 0.0040 0.0269 MAD 0.2128 0.0465 0.0288 0.0960 0.0041 0.0563 0.0534 0.0379 Y fix MSD 0.0790 0.0153 0.0093 0.0345 0.0175 0.0063 0.0063 0.0100

119 External Validation Bayesian Calibration Empirical Calibration <60 yrs 60-80 yrs >80 yrs Overall <60 yrs 60-80 yrs >80 yrs Overall MAD 0.0563 0.1575 0.0221 0.0777 0.1951 0.0663 0.0037 0.0884 Y clbr MSD 0.0516 0.0355 0.0082 0.0318 0.0626 0.0069 0.0032 0.0242 MAD 0.2240 0.0668 0.0039 0.0982 0.0200 0.0342 0.0303 0.0282 Y fix MSD 0.0830 0.0176 0.0082 0.0362 0.0169 0.0043 0.0043 0.0085

Table 3.9: Measures of discrepancy to assess overall MSM’s predictive performance. MADs and MSDs of model’s predictions from

Yclbr=Mf100(θfix, smpl100,000), and Yfix=Mf2000(θfix, smpl.C5000). [Bold numbers indicate the method with the smaller discrepancy]. Method Summary statistics Min Q1 Median Q3 Max < 60 yrs Bayesian 16.3 32.9 40.1, 46.2 65.3 Empirical 32.2 45.1 50.1 55.0 68.8 60-80 yrs Bayesian 215.7 297.7 330.3 355.9 420.5 Empirical 304.5 345.6 360.1 374.7 410.7 > 80 yrs Bayesian 363.0 436.0 463.2 496.7 584.3 Empirical 408.6 461.2 480.2 497.4 548.8

Table 3.10: Mean values of the main summary statistics (minimum, maximum and quartiles) of the predicted lung cancer incidence rates by age group for 20 different MSM input samples (figure 3.12). now centered around the respective fixed value. Predictions are improved for the “60-80’yrs’ age group. The Bayesian calibrated MSM still performs better when it comes to rare events, while now the overall performance of this model is better than the Empirical one (table 3.13) when predictions are compared with the calibration targets.

120 Figure 3.13: Sub-analysis: Density plots comparing the marginal distributions of the calibrated MSM parameters between the two calibration methods.

121 θ1=m Method Min Q1 Median Mean Q3 Max Fixed value Deviation ∗ (±SD) (Pk) (%) Bayesian 3.02·10−4 3.45·10−4 3.65·10−4 3.65·10−4 3.81·10−4 4.50·10−4 3.8·10−4 1.5 · 10−5 (2.64 · 10−5) (72) (3.9) Empirical 3.15·10−4 3.71·10−4 3.94·10−4 3.95·10−4 4.21·10−4 4.72·10−4 3.8·10−4 −1.51 · 10−5 (3.8 · 10−5) (30) (3.97) θ2=mdiagn Bayesian 0.1058 1.12 1.83 1.74 2.26 3.93 2 0.26 (0.82) (59) (13) Empirical 7.42·10−3 2.74 4.60 4.41 6.33 7.98 2 -2.41 (2.24) (16) (120.5) θ3=mreg 122 Bayesian 0.402 1.41 2.06 2.16 2.92 4.27 1.1 -1.06 (0.96) (13) (96.4) Empirical 0.316 1.37 2.14 2.21 3.09 4.35 1.1 -1.11 (1.09) (16) (101)

θ4=mdist Bayesian 0.157 3.76 5.70 5.58 7.68 10.7 2.8 -2.78 (2.74) (17) (99.3) Empirical 0.094 2.82 5.90 5.49 7.83 11.0 2.8 -2.69 (3.00) (25) (96.1)

∗Percentile of the predictive distribution, the fixed value corresponds to.

Table 3.11: Sub-analysis: Summary statistics of the calibrated MSM parameters. Figure 3.14: Sub-analysis: Contour plots depicting the bivariate parameter distribu- tions of the Bayesian calibrated MSM. Contours drawn at α=0.95, 0.5 and 0.05 of the bivariate distribution.

123 Figure 3.15: Sub-analysis: Contour plots depicting the bivariate parameter distribu- tions of the Empirically calibrated MSM. Contours drawn at α=0.95, 0.5 and 0.05 of the bivariate distribution.

124 Figure 3.16: INTERNAL VALIDATION (sub-analysis): Density plots depict- ing the marginal distributions of the predicted lung cancer incidence rates (cases/100,000 person·years) by age group, compared to calibration targets Yclbr=

Mf100(θfix, smpl100,000), and Yfix=Mf2000(θfix, smpl.C5000). 125 Figure 3.17: EXTERNAL VALIDATION (sub-analysis): Density plots depict- ing the marginal distributions of the predicted lung cancer incidence rates (cases/100,000 person·years) by age group, compared to calibration targets Yclbr=

Mf100(θfix, smpl100,000), and Yfix=Mf2000(θfix, smpl.C5000). 126 INTERNAL Validation EXTERNAL Validation Bayesian Empirical Bayesian Empirical Summary statistics < 60 years old Min 37.24 37.24 34.73 37.11 Q1 42.96 45.31 41.99 44.73 Median 45.74 50.04 45.08 48.18 Mean±Sd 46.35 ±5.09 49.9±6.27 45.61±4.85 49.0±5.77 Q3 50.00 54.4 48.32 53.2 Max 59.83 61.9 61.06 60.7 Target value 41 Bias -5.35 -8.9 -4.61 -8.04 (%) (13) (21.7) (11.2) 19.6 60-80 years old Min 344.5 344.9 339.7 342.8 Q1 357.7 362.5 355.6 355.7 Median 367.2 374.9 362.6 366.3 Mean±Sd 368.3±13.71 364.6±13.4 329.4±40.46 368.1±15.53 Q3 379.0 387.8 372.4 379.8 Max 401.6 409.2 398.4 402.7 Target value 391 Bias 22.7 15.6 26.7 22.9 (%) (5.8) (3.99) (6.8) (5.9) >80 years old Min 421.6 433.5 428.3 424.5 Q1 460.2 460.3 460.1 459.2 Median 481.6 480.0 480.0 472.1 Mean±Sd 480.1±24.67 476.8±20.77 478.3±23.68 472.0±17.22 Q3 498.2 495.0 496.0 482.6 Max 522.2 514.8 521.7 518.2 Target value 464 Bias -16.1 -12.8 -14.3 -8.0 (%) (3.5) (2.8) (3.1) (1.7)

Bias(%): deviation of the mean from the target value of the calibration procedure

Table 3.12: Sub-analysis: Summary statistics of the predicted lung cancer incidence rates by age group, by implementing the MSM on both the calibration and validation input.

127 Internal Validation Bayesian Calibration Empirical Calibration <60 yrs 60-80 yrs >80 yrs Overall <60 yrs 60-80 yrs >80 yrs Overall MAD 0.1305 0.0582 0.0347 0.0745 0.2158 0.0400 0.0276 0.0945 y clbr MSD 0.0323 0.0046 0.0040 0.0136 0.0698 0.0034 0.0027 0.0253 MAD 0.0730 0.0432 0.0623 0.0595 0.0029 0.0633 0.0548 0.0403 y fix MSD 0.0156 0.0034 0.0068 0.0086 0.0156 0.0062 0.0051 0.0089

128 External Validation Bayesian Calibration Empirical Calibration <60 yrs 60-80 yrs >80 yrs Overall <60 yrs 60-80 yrs >80 yrs Overall MAD 0.1125 0.0683 0.0307 0.0705 0.1961 0.0584 0.0173 0.0906 y clbr MSD 0.0266 0.0058 0.0035 0.0119 0.0581 0.0049 0.0017 0.0216 MAD 0.0877 0.0320 0.0581 0.0593 0.0192 0.0429 0.0443 0.0355 y fix MSD 0.0170 0.0024 0.0061 0.0085 0.0136 0.0038 0.0034 0.0069

Table 3.13: Sub-analysis: Measures of discrepancy to assess overall MSM’s predictive performance. MADs and MSDs of model’s predictions from yclbr=MSM(smpl100,000, θfix), and yfix=MSM(smpl.C5000, θfix). 3.7 Discussion

In this chapter we presented a comparative analysis of two calibration methods for micro-simulation modeling. We implemented both methods in the free statistical software, R. We discussed the computational considerations and compared the results of the two calibrated MSMs.

The comparative analysis showed that the Empirical calibration method is much more efficient, regarding the computational burden, since it can be orders of magni- tude faster than the Bayesian one. This finding is also applicable when it comes to the comparison of undirected to any directed calibration method due to the structural similarities those methods respectively bare with the Empirical and the Bayesian methods presented in this chapter. Furthermore, this chapter emphasizes on the im- perative need for HPC techniques for calibrating any complicated predictive model including MSMs.

The two methods produced very similar results with respect to the distributions of the calibrated MSM parameters, resulted in analogous correlation structures, and raised the same identifiability issues.

Predictions from the calibrated MSMs differ somewhat between the two methods. The Bayesian MSM results in more dispersed predictions than the Empirical model, although there are indications that it predicts better rare events. In addition, the Bayesian method seems to be more robust to the input sample used in the calibration procedure.

Finally the supplementary analysis reveals a remarkable improvement in the results from the Bayesian MSM. This finding is suggestive of two things. First of all, more work should be done on the collection of the parameter vectors from the Bayesian 129 calibration method (e.g., length of converged chains, sampling rule for each one of them, etc). Second, the performance of the MSM can be considerably improved if the Bayesian calibration method is followed by an additional step that would further refine the collection of the final sets of vectors for the calibrated parameters. As the supplementary analysis has shown, such an improvement could be achieved if, for example, we choose a subset of vectors, for the MSM parameters, that provide good fit of the model to observed data, according to some GoF criterion.

Future work will be towards a more detailed calibration of the streamlined MSM for lung cancer developed in Chapter 2. We will aim at a complete calibration of the MSM, so as to be able to predict individual trajectories for all possible combinations of gender (male/female) and smoking status (never/former/current smokers). Fur- thermore we envisage the expansion of the two calibration methods so as to account for multiple calibration targets, i.e., being able to incorporate diverse information from different stages of the natural history of lung cancer.

130 Figure 3.18: Flow chart of the implementation of the approximate MH algorithm of the Bayesian method to calibrate θ1. 131 Figure 3.19: Flow chart of the implementation of the Bayesian method to calibrate k k QJ ˆ θ1. [A(θ )=π(θ ) · fj(yj|λj)] j=1 132 Chapter 4

Assessing the predictive accuracy of MSMs

The third chapter of this thesis is concerned with the assessment of the predictive accuracy of MSMs, a quality characteristic that has not been studied in the literature yet. The main outcome of interest for this assessment is individual predicted time to event, thus our approach is based on techniques applied on survival modeling. We propose a set of available concordance indices, typically used for the assessment of the predictive accuracy of survival models. In addition, we study the ability of the MSMs to predict times to events, and suggest use of hypothesis testing to compare observed with predicted survival distributions. We implement the suggested methods in order to assess and compare the predictive accuracy of the two calibrated MSMs, resulted from the previous chapter, and we make recommendation on those that can better capture the predictive quality of an MSM.

The chapter begins with background information on methods used for the assessment of the predictive accuracy of complex models in general, as well as survival models in particular. It continues with the description of the methods suggested for the assessment of the predictive accuracy of an MSM. We further describe the simulation study conducted in order to compare the performance of the suggested methods. For the purposes of this study, we applied the methods on each one of the two calibrated MSMs resulted from chapter 3. It follows a detailed analysis of the simulation results 133 accompanied with suggestions on the most appropriate method to be used under certain circumstances. The chapter concludes with future work in the field.

4.1 Background

4.1.1 Assessment of MSMs

An integral part in the development of a new MSM, as in any predictive model, is the assessment of the model’s predictive accuracy (92; 105). After having discussed in detail the two major building blocks in the development of an MSM, i.e. model specification and calibration, this chapter is concerned with this property of a the model. Assessment of complex models in general contains the notions of model validation (internal and external), sensitivity analysis, characterization of uncertainty and predictive accuracy (92; 105).

The development of an MSM is typically accompanied by a validation analysis. For example model validation may use empirical approaches (118; 3; 4; 65; 23), chi- square (94; 18; 70) and likelihood statistics (94), as well as, posterior estimates of model parameters, and posterior predictive distributions of model outcomes (90). Validation has been discussed in detail in the previous chapter.

The assessment of uncertainty in MSMs, as in any other complex model, is also of central concern, with a wide range of relevant references, from a brief introduction to the problem of measuring uncertainty in complex decision analysis models (83), to the development and implementation of complicated relevant methods. Such methods include Bayesian approaches for characterizing uncertainty with emphasis to model structure (12; 88), expression of patient heterogeneity and parameter uncertainty (48; 55), applications of Probabilistic Sensitivity Analysis (PSA) (17; 7; 80; 81), etc.

In contrast to the assessment of uncertainty, the assessment of the predictive accuracy

134 of an MSM has not received a systematic attention in the literature. However, the assessment of this quality characteristic is essential, since, as it is subsequently noted, one of the most important goal of MSMs, is to accurately predict intervention effects on individual level, and, consequently, on homogeneous sub-groups of patients. The study, implementation, and suggestion on statistical measures for assessing the predictive accuracy of MSM is the main objective of this chapter.

4.1.2 Predictive accuracy of MSMs

Micro-simulation models are broadly used to simulate entire populations with specific characteristics and, often, under different hypothetical scenarios (interventions) (91). The ultimate goal is to use these MSMs to make projections about the possible evolution of the disease or even, when relevant, about the effect of an intervention on the population, so as to inform health policy decisions (92).

However, there are also examples in the literature where, individual level data are used to populate MSMs in order to test additional hypothesis or to enhance the validity of the main findings of the study. McMahon et al. (2008), for instance, populate the Lung Cancer Policy Model with individual level data from the Mayo CT screening, single-arm trial, in order to simulate both the observed screening as well as the missing control arm. They aimed in this way to compare original findings from the Mayo CT study with estimates about lung cancer incidence and mortality from a hypothetical control arm with perfectly matched baseline characteristics.

Henderson et al. (2001), on the other hand, emphasize the importance of accurate point estimates, especially of the predicted survival times, mentioning, among oth- ers, the effect this accuracy may have on administrating the most efficient treatment, saving of valuable resources, as well as guiding personal decisions regarding the re- maining lifespan of each individual. They also refer to other practical needs and

135 pressures imposed by the relevant Health system, which can be vitally assisted by informed decisions based on accurate survival times predictions. These arguments coincide with one of the main goals of comparative effectiveness research (CER), namely the development of adequate methodology to study differences in treatment response between sub-groups of patients, as well as the enhancement of informed medical decisions on individual level basis (112; 25). MSMing comprises an essential tool for predicting intervention effects on individuals, and, consequently, on homo- geneous subgroups, hence can be an integral part of the conduct of CER studies.

The aforementioned examples of the use of MSMs to inform health decisions, point out the need for methods to assess the predictive accuracy of MSMs. Perhaps, one of the most important reasons for lack of references with relevant research, is that, although very important, the prediction of accurate individual trajectories is a very complicated task, the intricacy of which increases with the number of individual-level characteristics involved. In this chapter we suggest methods from the literature that could be used for the assessment of the predictive accuracy of this type of models. The simulation study we conducted exemplifies the necessity of this methods in order to compare two similar, “well” calibrated MSMs.

Predictive accuracy pertains to the ability of a model to correctly predict individ- ual outcomes. Steyerberg et al. (105) provide an overview of traditional and novel measures for assessing the performance of prediction models in general. The au- thors categorize methods into three broad categories, namely, measures of explained variation (R2-statistics), other quadratic scores of the proximity between predictions and actual outcomes (GoF statistics such as MSE, Deviance, Brier score, etc.), and measures of the model’s discrimination ability (C-statistics, ROC curves).

Measures of explained variation (R2-statistics), although very interesting, are hard to derive in the context of MSM. Such an attempt would require systematic work on

136 identifying of all sources of uncertainty inherent in an MSM, as well as expression of this uncertainty to the model outcomes. Research on this topic is part of the future work related to this thesis. We also discussed GoF statistics in the previous chapter, in the context of the calibration of an MSM. In that setting, we are mostly interested in the comparison of the overall summary statistics predicted by the model and the actual data (calibration data) found in the literature of lung cancer, to determine a ”well” calibrated MSM.

In this chapter we focus on the accuracy of individual MSM predictions. The reason for that is that it is possible for a “good” MSM, according to some overall GoF mea- sures, to perform poorly when it comes to individual predictions. The streamlined MSM, for example, may predict lung cancer incidence rate very close to the calibra- tion target for a specific age group. However, the individuals, for which the MSM predicted lung cancer, may differ considerably from those who actually did develop lung cancer.

Depending on the outcome of interest (e.g., continuous, ordinal, binary or survival data), as well as the type of model’s predictions (e.g., prediction of the actual out- come, risk score, survival probability, etc) the predictive performance of an MSM can be assessed using a variety of statistical measures. Since MSMs are designed to predict individual patient trajectories, and in order to exploit the most comprehen- sive predicted information, in this chapter we naturally consider MSMs as a special type of survival models.

Assessing the predictive ability of survival models is a more complicated task than in models for binary outcomes, such as logistic regression models. The complexity problem in survival data analysis is due to the presence of censored observations for which the information about the event of interest is missing. The only thing known, for these observations, is that, up to the censoring time the subject had not

137 experienced the event of interest. The assessment of the performance of survival models usually entails comparison of the predicted risk (rather than the predicted survival times) with the observed ones, usually given a set of covariates. The reason for this is that predicted survival times are not readily available for this type of models.

Several measures for the assessment of the predictive accuracy of a survival model have been suggested in the literature (46; 42; 100; 2; 9; 32; 93). An important class of measures is that of concordance statistics (C-statistics), which focus on discrimi- nation, namely the desired property of the model to correctly classify subjects, given a set of covariates, based on the predicted risk (57; 46).

The most widely used index, due to its simplicity, is the C-index proposed by Harrell et al. (1996). Pencina and D’Agostino (2004) study the statistical properties of C and show the relationship between this index and the modified Kendall’s τ. Similar indices were studied by Gonen and Heller (2005), for the evaluation of Cox propor- tional hazards models, and Uno et al. (2011). The later is applicable to any type of survival models that provide an explicit form of the predicted risk as a function of the model parameters and covariates.

A common characteristic of the C-statistics, proposed for the assessment of a survival model, is that they all are based on comparisons between actual survival status and predicted risk score, a closed form expression of which is obtained from the model. The main reason is that actual predicted survival times are not readily available from these commonly used survival models, but they rather require some further processing of the predicted risk, entailing a certain amount of subjectivity in the final prediction. Furthermore, most of these models (proportional hazards and accelerated failure time) imply a one-to-one correspondence between the predicted risk and the expected survival times, therefore, these two quantities can be used interchangeably

138 to express a concordance relationship between observed and predicted outcomes.

Unlike most of the broadly used survival models, MSMs can predict time to events and censoring status given the baseline characteristics of each individual, rather than simple risk scores at specific time points. Therefore, assessing MSM predictive accuracy should not solely involve concordance measures, because, in this way, a significant portion of the predicted information (the actual predicted survival times) is ignored. Investigators should rather use discrimination in conjunction with other measures quantifying the proximity between predictions and actual outcomes on an individual level basis. Following this reasoning, we suggest here comparisons between the predicted and the observed survival function as supplementary means to concordance statistics for assessing the predictive accuracy of an MSM.

We have to note here that assessment of the predictive performance of commonly used survival models (e.g., Cox proportional hazards) is also possible through comparison of the observed with the predicted survival. However, a key issue in this assessment is the methodology used for the estimation of the predicted survival from those models (79; 78; 36; 45; 100; 32), especially when the model incorporates time-dependent covariates. Since no readily available predictions are available from these models, the predicted survival is subject to additional assumptions (modeling mechanism) beyond those stipulated in the model specification procedure. Therefore, assessment1 of the predictive accuracy of such models depends, not only on the model itself, but also on the method used for obtaining predicted survival. On the contrary, prediction of survival times is usually an integral part of the outcome of an MSM (as is the case with our streamlined MSM), therefore assessment of the predictive performance is straightforward, and refers directly to the model itself and not some other external estimation procedure.

1 A of methods used for the assessment of the predictive performance of risk prediction models can be found in Gerds et al. (2008)

139 A variety of statistics for comparing survival functions is available in the literature. They include a set of tests based on the comparison of weighted Kaplan-Meier esti- mates of the survival functions, such as the Log-Rank test (21), and tests based on the weighted differences of the Nelson-Aalen estimates of the hazard rate, such as the tests by Gehan (1965), Breslow (1970), Tarone and Ware (1977). These tests, although very popular, are not very powerful to detect differences in crossing hazards situations. A class of statistics that has been proposed to amend this shortcoming, includes the Renyi-type and the Cramer Von Mises statistics. A detailed account of the statistics used in this chapter for the comparison of the two survival curves can be found in the Klein and Moeschberger (2003) book.

In the following sections we describe in detail the statistics proposed for the assess- ment of the predictive accuracy of an MSM, as well as the conduct of a simulation study for the comparison of those methods in an MSM setting.

4.2 Methods

4.2.1 Notation

In order to describe the statistics suggested in this chapter for the assessment of the predictive accuracy of an MSM, we have to introduce some special notation.

ˆ ˆ ˆ Let X1,X2, ..., XN , and X1, X2, ..., XN the observed and the predicted event times respectively, and Z1, Z2, ..., ZN , p×1 vectors of covariates in a sample of N individuals. In our case, where the objective is to predict individual trajectories using the MSM for lung cancer, the covariates comprise age, gender and smoking history of each individual. Let also Ti be the actual survival times and Di the corresponding censoring variable, i.e, the time at which the subject is censored. We assume that D is independent of T and Z. Let {(Ti,Zi,Di), i=1, ..., N} be N independent copies of

140 {(T, Z, D)}. For each individual i we only observe (X ,Z , ∆ ) where X =min(T ,  i i i i i  1, if Xi= Ti Di) and ∆i = .  0, otherwise

Furthermore, when comparing the survival between two samples, t1, t2, ..., tK denote the distinct event times in the pooled sample, Ykj the number of individuals at risk, and qkj the total number of events, observed in sample j at time tk, where k=1,2,...,K. P2 P2 In addition Yk = j Yjk, and qk = j qjk are the total number of individuals at risk and total number of events respectively, in the pooled sample at time tk. Following this notation, the Kaplan-Meier estimator of the survival function, for example in the pooled sample is:

  1, if tk < t1 Sˆ(t) = (4.1)  Q (1 − qk ), otherwise  tk≤t Yk

while the Nelson-Aalen estimator of the cumulative hazard is:

  0, if tk < t1 He(t) = (4.2)  P qk , otherwise  tk≤t Yk

4.2.2 Concordance statistics

Definition Let (X1,T1), ... (XN ,TN ) be a sample of bivariate, continuous obser-

vations. The concordance (C) index for a pair of them, let say (X1,T1) and (X2,T2) is defined in general as (84):

C = pr(T1 > T2|X1 > X2) (4.3)

The concordance index has been widely used, for the assessment of the predictive accuracy of regression models for survival data. In this setting the C-index can take 141 either of the following two forms:

C = pr(g(Z1) > g(Z2)|T1 < T2) (4.4) or

C = pr(T1 < T2|g(Z1) > g(Z2)) (4.5) where

Ti denotes the actual survival time and g(Zi) is some expression of the risk for the ith individual as a function of the set Z of covariates.

In the first case (eq. 4.4), the concordance probability is defined conditionally on the true value and can be considered an expression of the model’s sensitivity (i.e., the probability the model correctly classifies the observations given the ”truth”). The second form of the concordance probability expression is defined conditionally on the test value and is analogous to the predictive value of a diagnostic test, in that it expresses the probability of having a certain ordering in the observed times given what the model predicts for these specific data. Most of the C-statistics for survival models are developed based to estimate conditional probability presented in equation 4.4 (39; 113), while estimates of the other conditional probability are also discussed in the literature (34)

The concordance index can be used to quantify one of the key aspects of the predictive accuracy, namely the discrimination ability of a (105). It takes values between 0.5 and 1. A C-index equal to 1 indicates perfect discrimination ability while values of the index closer to 0.5 indicate poor discrimination ability of the model.

142 Harrell’s index

Perhaps the most well-known, easy to compute and, therefore, broadly used measure of the discrimination ability of a survival model, is the Harrell’s C-statistic (39). Let consider all different pairs of subjects (i,j), i Xj and Xi > Xj). The overall C index suggested by Harrell et al. (39) is defined as the proportion of all usable concordant pairs in the sample. Every pair of subjects, at least one of whom had experienced the event of interest, is usable. This index provides an estimate of the concordance probability (eq. 4.4) as: P ˆ ˆ i6=j ∆iI(Xi < Xj)I(Xi < Xj) CbH = P (4.6) i6=j ∆iI(Xi < Xj)

Uno’s index

Uno et al. (113) focus on the estimation of a truncated version of the concordance probability (eq. 4.4), i.e.:

C = pr(g(Z1) < g(Z2)|T1 > T2,T1 < τ) (4.7)

where τ is a pre-specified time point, the only restriction of which being that it should be greater than the shortest censoring time observed. The truncation is introduced to address the problem of the unstable estimation of the tail part of the survival function.

Uno et al employ an ”inverse probability weighting” technique, (10), and propose a non-parametric-estimate of the concordance probability. The most important fea- ture of the Uno’s C-statistic is that, unlike Harrell’s index, it does not depend on the study-specific censoring distribution. Using a simulation study, Uno et al. (2011) show that this index is in general robust to the choice of τ and it performs most of 143 the times better or at least equally well to the Harrell’s index.

144 4.2.3 Hypothesis testing

The second set of methods proposed in this chapter for the assessment of the pre- dictive accuracy of an MSM, comprise statistical tests for the comparison of the predicted with the observed survival curve. In particular we compute the log-rank statistic, a Renyi type statistic, and two different versions of a Cramer-von Mises type statistic. Each of these statistics are used to test the null hypothesis H0 that there is no difference in the survival distributions between the two samples (observed versus predicted data).

Log-Rank statistic

We first apply the well known and broadly used log-rank test (85), which, following the notation previously introduced, employs the statistic:

PK qk (qk1 − Yk1 ) Z = k=1 Yk (4.8) PK Yk1 Yk1 Yk−qk (1 − )( )qk k=1 Yk Yk Yk−1

which under the H0 has a standard normal distribution. The main limitation of this test is that it does not perform very well in crossing hazard situations.

Renyi type tests

The Renyi type statistics aim at comparing two (or more) survival distributions in a way analogous to the Kolmogorov-Smirnov test for uncensored data (54). These statistics are more powerful to detect differences in crossing hazards situations. In our case, we implement the “log-rank” version of this test. The statistic used for testing the null hypothesis is: sup{|Z(t)|, t ≤ τ} Q = (4.9) σ(τ) 145 with

   X qk Z(tα) = qk1 − Yk1 , α = 1, ..., K (4.10) Yk tk≤tα and

      2 X Yk1 Yk2 Yk − qk σ (τ) = qk (4.11) Yk Yk Yk − 1 tk≤τ where τ is the largest tk for which Yk1,Yk2 > 0.

The statistic Q under the null hypothesis can be approximated by the distribution of sup{|B(x)|, 0 ≤ x ≤ 1}, where B is a standard Brownian motion process. Critical values of Q can be found in relevant tables. The supremum of the absolute deviations in the calculation makes the test more powerful than the simple log-rank test to detect (existing) differences between two crossing survival curves.

Cramer-von Mises tests

The last two statistics used for the comparison between observed and predicted survival belong to the Cramer-von Mises type of statistics, which are also analogue of the Kolmogorov-Smirnov test for comparing two cumulative distribution functions (54). Both statistics depend on the weighted squared differences between the Nelson- Aalen estimates of the respective survival functions. The first statistic used for this type of test, is defined as:

  2 1 X h i  2 2  Q1 = He1(tk) − He2(tk) σ (tk) − σ (tk−1) (4.12) σ2(τ) tk≤τ

with t0=0, and the summation calculated over the distinct death times up to time

τ, which is the largest tk for which Yk1,Yk2 > 0, i.e., for that death time for which 146 there are still subjects at risk in both samples. Furthermore, He(tk) (j=1,2 for the two samples, observed and predicted), is the Nelson-Aalen estimator of the cumulative hazard function (section 4.2.1), with estimated variance:

2 X qij σj = , j = 1, 2 (4.13) Yij(Yij − 1) tj ≤t

The Q1 statistic is based on the difference between He1(t) and He2(t), the variance of which is estimated as:

2 2 2 σ (τ) = σ1(t) + σ2(t) (4.14)

The statistic of the alternative version of the Cramer-von Mises test applied in this chapter, is defined as:

" #2 X He1(tk) − He2(tk) Q2 = n 2 [A(tk) − A(tk−1)] (4.15) 1 + nσ (tk) tk≤τ

where, nσ2(t) A(t) = [1 + nσ2(t)]

Under the null hypothesis Q1 and Q2 approximately have the same distribution

R 1 2 with R1 = 0 [B(x)] dx, where B(x) is a standard Brownian motion process, and R A(τ) 0 2 0 R2 = 0 [B (x)] dx, where B (x) is a Brownian bridge process respectively. The critical values of these two processes are also provided in relevant tables.

Note here that there is some loss of power when using either of the two Cramer-von

Mises tests compared to the log-rank test (97). However, Q1 performs almost equally

well when the hazard rates of the two samples are proportional, while Q2 perform better compared to the other tests, in the case of large early differences when the hazards rates cross.

147 4.2.4 Simulation Study

The purpose of the simulation study conducted in this chapter is to implement and compare the alternatives approaches, denote their differences, and make suggestions about the most suitable ones to be used for the assessment of the predictive accuracy of an MSM. To this end, the methods were used to assess and compare the predictive accuracy of the two calibrated MSMs, obtained in Chapter Two. These two MSMs have exactly the same structure and were calibrated to the same targets using two different calibration methods, a Bayesian and an Empirical one. The two meth- ods resulted in different MSMs with respect to the set of values for the calibrated parameters.

As input we used a sample of N=5000 men (smpl.15000), current smokers, randomly

drawn from the 1980 US population (smpl100,000, Chapter 3). Note here, that this sample is different from the one used for the implementation of the two calibration

methods (smpl.C5000). As in chapter II, the baseline characteristics taken into ac- count for predicting trajectories are age, and smoking intensity, expressed as average number of cigarettes smoked per day, for each individual.

For the assessment of the predictive accuracy of the MSM we need to know the truth, namely if and when each person developed lung cancer. In the absence of real data on the time of the development of lung cancer in the group used in the simulation study, we simulated the truth. Specifically, we use two simplified “toy” models, which, given only age, predict time to death and time to lung cancer diagnosis for each individual. The first simplified model (truth model 1 toy.1) uses exponential distributions to predict these to time points, while the second one (truth model 2 toy.2) uses Gumbel distributions. The simulated truth about the censoring status is obtained from the comparison of the two predicted times for each individual. For instance, if predicted time to death is larger than predicted age to lung cancer diagnosis, the prediction 148 indicates that this person had the event otherwise it is censored at the age of death. Ad hoc estimates of the exponential and the Gumbel distributions involved in these simulations, were chosen so as overall lung cancer incidence rates by age group (i.e., <60, 60-80, and >80 years old) to approximate those reported in the 2002-2006 SEER data.

We apply these two “toy” models on the input data (smpl.11000), in order to simu- late the truth about the age at the development of lung cancer for each individual. Subsequently the same sample is used as input to each of the two calibrated MSMs, resulted from chapter II, in order to also predict lung cancer incidence. The compar- isons between the predictions and the simulated “truth”, will provide an indication about the adequacy of each proposed method to assess the predictive performance of an MSM.

As indicated in section 3.2.4, the results from each calibration method is a set of V=1000 vectors for the four MSM parameters, calibrated in the previous chapter. A single run of the MSM pertains to the implementation of the model once, in order to make predictions (one trajectory for each individual) about the input sample of interest, given one vector of parameter values. In tables we present summary results of the model’s performance for different number V of parameter vectors (i.e., V=200, 400, 600, 800, and all 1000). In this way we are also able to investigate the effect the total number of microsimulations has on the final conclusions from the applied statistics.

149 4.3 Results

4.3.1 Single run of the MSM

For V=1 we present Kaplan-Meier curves of the predicted against the observed sur- vival functions. We also provide estimates of the suggested measures for assessing the MSM’s predictive accuracy. Test statistics are accompanied by the respective p-values.

The results from the implementation of the assessment methods on the MSMs, using only one vector of calibrated parameter values, indicate that simulated “observed” lung cancer survival using the first toy model (toy.1, exponential distributions) is very close to the predictions from both models (Figures 4.1 and 4.2), although the survival functions resulted from the predictions of the Bayesian calibrated MSM, crosses with the observed survival.

Table 4.1: Assessment of the predictive accuracy of the two calibrated MSMs: Pre- dicted versus simulated (from toy.1 model) survival.

Method Calibrated MSM Bayesian Empirical Harrel’s index 0.779 0.754

τ = 100 0.641 0.568 Uno’s index τ = 80 0.733 0.710

Log-Rank x2 7.313 3.013 (p-value) (0.00685) (0.0826) Renyi test Q 4.03 2.11 (p-value) (< 0.01) (0.06) Q1 0.654 2.26 (p-value) (>0.01) (< 0.025) Cramer-von Mises Q2 1.66 0.326 (p-value) (<0.02) (>0.1)

150 Figure 4.1: Kaplan-Meier curves of the predicted versus the observed (simulated by the first toy model) survival.

151 The proximity between the predicted and the observed survival is also verified by most of the statistics applied for the assessment of the model (Table 4.1). The C- statistics are similar for the two models, with slightly higher values for the Bayesian

model. Also the log-rank, Renyi type and Cramer-von Mises (Q2) tests, all reject the null hypothesis for the predictions from the Bayesian model but do not reject for those from the Empirically calibrated MSM at α = 5%. However, we draw the

opposite conclusions when looking at the Q1 statistic, according to which, observed survival is similar with the predicted one from the Bayesian model but differs from the one predicted by the Empirical MSM. The reason for this is probably because, as

already mentioned, Q2 performs better than the other tests in cases like this, namely when the hazard rates cross and we observe relative large, early differences among them.

When it comes to the comparison of the predictions with the simulated truth from the second toy model (figure 4.2), observed survival is very close to the predicted one from the Bayesian model, but differs considerably from the predicted survival by the Empirically calibrated MSM. This difference, apparently cannot be captured by neither of the C-statistics applied, since the respective estimates are very close for the two models (table 4.2). On the contrary, this difference is reflected on the results from all the statistical tests (log-rank, Renyi type, and Cramer-von Mises). None of these tests rejects the null hypothesis for the Bayesian model, while they all reject it for the Empirically calibrated model, at least at α = 5% significance level.

152 Figure 4.2: Kaplan-Meier curves of the predicted versus the observed (simulated by the second toy model) survival.

Table 4.2: Assessment of the predictive accuracy of the two calibrated MSMs: Pre- dicted versus simulated (from toy.2 model) survival.

Method Calibrated MSM Bayesian Empirical Harrel’s index 0.799 0.796

τ = 100 0.762 0.719 Uno’s index τ = 80 0.807 0.790

Log-Rank x2 0.027 18.52 (p-value) (0.869) (<0.0001) Renyi test Q 1.894 4.317 (p-value) (0.110) (< 0.01) Q1 0.724 2.853 (p-value) (>0.01) (< 0.01) Cramer-von Mises Q2 0.325 1.318 (p-value) (> 0.01) (< 0.02)

153 4.3.2 Multiple runs of the MSM

We also assessed the predictive accuracy running each of the two calibrated MSMs multiple times, i.e., for multiple vectors V of values for the calibrated parameters. In particular, we run each MSM for five different cases, namely for V=200, 400, 600, 800, and 1000 vectors of parameter values, in order to also investigate the effect the total number of MSM runs has on the results from this assessment. We compare predictions with simulated truth from both toy models. For each case we provide Kaplan-Meier estimates of the predicted versus the observed survival probabilities. We further provide summary statistics to describe the results from the application of each statistical method for the assessment of the predictive accuracy of the model. In particular, we report means and standard deviations of the concordance statistics (Harrell’s and Uno’s index) from V implementations of each of these measures on the MSM predictions. Furthermore, for the statistics comparing the observed with the predicted survival we report the percentage of times the test has not rejected

the H0 at α = 5%, i.e., the hypothesis that the predicted survival is the same as the “observed” (simulated) one.

According to the produced graphs (Kaplan-Meier curves in figures 4.3 to 4.12), as well as the respective tables with the summary statistics (tables 4.3 and 4.4) from the implementation of the methods, that are suggested in this chapter for the assessment of the predictive accuracy of an MSM, the total number V of MSM runs, does not appear to affect the final conclusions. Apparently, even a V=400 appears adequate to draw safe conclusions about the predictive accuracy of the two MSMs, calibrated in the previous chapters.

The simulated true survival, simulated by the first toy model, lies within the range of the predictions from both MSMs, for all five cases (i.e., for V=200, 400, 600, 800, and 1000). This means that, overall, the individual predictions from the two models are 154 very close to the observed survival, resulted from the first toy model. This proximity between the two survivals is reflected on the summary statistics of all the methods suggested in this chapter (table 4.3).

The estimates of the Harrell’s and Uno’s index are almost identical for the two models. The results from the applied tests are also very close for the two MSMs with

a small difference between the non-rejection of the H0 rate, in favor of the Bayesian calibrated MSM according to the first three tests. However, when looking at the

Cramer-von Mises Q2 test, the difference between the non-rejection rates is bigger and reversed, namely in favor of the Empirically calibrated MSM. This finding is in line with the characteristics of this specific test. As already mentioned, Q2 performs well when there is a large early difference in the hazard rates. The Kaplan-Meier plots reveal much more dispersed predicted survival curves earlier time points for the Bayesian compared to the Empirically calibrated MSM, consequently the difference between predicted and observed survival is larger at those points for the Bayesian

MSM. This difference is reflected on the results from the implementation of the Q2 test.

155 Figure 4.3: Kaplan-Meier curves of the predicted (for V=200 vectors of the calibrated MSM parameters) versus the observed (simulated by the first toy model) survival.

Figure 4.4: Kaplan-Meier curves of the predicted (for V=400 vectors of the calibrated MSM parameters) versus the observed (simulated by the first toy model) survival.

156 Figure 4.5: Kaplan-Meier curves of the predicted (for V=600 vectors of the calibrated MSM parameters) versus the observed (simulated by the first toy model) survival.

Figure 4.6: Kaplan-Meier curves of the predicted (for V=800 vectors of the calibrated MSM parameters) versus the observed (simulated by the first toy model) survival.

157 Figure 4.7: Kaplan-Meier curves of the predicted (for V=1000 vectors of the cal- ibrated MSM parameters) versus the observed (simulated by the first toy model) survival.

In the second example we compare the predictions with the “true” survival, simulated using the second model. According to figures from 4.8 to 4.12 the observed survival curve, although marginally, lies within the range of the predicted survival curves from the Bayesian calibrated MSM. This is not the case for the Empirically calibrated MSM, for which a considerable part of the observed survival curve lies above the range of the predicted ones. This is an example of a possible scenario, where two “well” calibrated MSMs, i.e., two MSMs almost equivalent according to some overall GoF measures, differ considerably when it comes to the individual predicted trajectories.

The estimates of both C-statistics are almost identical for the two models, thus indi- cating that a concordance index cannot capture adequately the differences between the predicted and the observed survival noted in the Kaplan-Meier curves. On the contrary, the results from all the statistical tests of the two survival functions are very different between the two models, indicating that the Bayesian calibrated MSM is more accurate than the Empirically calibrated one. The difference between the

158 Table 4.3: Assessment of the predictive accuracy of the two calibrated MSMs com- pared to the simulated truth from toy model 1: Summary statistics of the estimates of six different predictive accuracy measures.

Bayesian Calibrated MSM C-statistic (mean±sd)* Test (%)** Cramer - V Harrell Uno Log-Rank Renyi - von Mises (Z) (Q) (Q1) (Q2) 200 0.7808±0.0099 0.6746±0.0605 80.50 80.00 49.50 79.00 400 0.7806±0.0095 0.6740±0.0560 79.75 82.75 52.75 83.00 600 0.7804±0.0096 0.6740±0.0559 80.33 82.17 52.50 82.83 800 0.7801±0.0096 0.6740±0.0555 80.50 83.25 50.88 84.88 1000 0.7802±0.0095 0.6741±0.0557 80.00 82.40 53.60 84.10

Empirically Calibrated MSM C-statistic (mean±sd)* Test (%)** Cramer - V Harrell Uno Log-Rank Renyi - von Mises (Z) (Q) (Q1) (Q2) 200 0.7804±0.0092 0.6683±0.0587 73.50 73.00 49.00 98.50 400 0.7794±0.0090 0.6730±0.0567 71.25 71.00 48.00 98.50 600 0.7791±0.0089 0.6729±0.0555 71.50 73.17 45.50 99.00 800 0.7787±0.0090 0.6722±0.0546 71.13 73.25 45.13 99.13 1000 0.7787±0.0090 0.6718±0.0548 71.30 73.50 45.00 98.60

* Means and standard deviations of the C-indices estimates, from the V implementations. **Percentage of times, in the V implementations, that the test did not reject

the H0 at α = 5%.

159 two models is more prominent when looking at the results from the log-rank test, and smaller based on the results from the implementation of the Cramer-von Mises

Q1 test.

Figure 4.8: Kaplan-Meier curves of the predicted (for V=200 vectors of the calibrated MSM parameters) versus the observed (simulated by the second toy model) survival.

Figure 4.9: Kaplan-Meier curves of the predicted (for V=400 vectors of the calibrated MSM parameters) versus the observed (simulated by the second toy model) survival.

160 Figure 4.10: Kaplan-Meier curves of the predicted (for V=600 vectors of the cali- brated MSM parameters) versus the observed (simulated by the second toy model) survival.

Figure 4.11: Kaplan-Meier curves of the predicted (for V=800 vectors of the cali- brated MSM parameters) versus the observed (simulated by the second toy model) survival.

161 Figure 4.12: Kaplan-Meier curves of the predicted (for V=1000 vectors of the cali- brated MSM parameters) versus the observed (simulated by the second toy model) survival.

4.4 Discussion

Given that MSMs usually can predict, among other outcomes, actual survival time and censoring status for each individual, we consider them as a special type of survival predictive models. In this chapter we implement two concordance indices broadly used for assessing the predictive accuracy of survival models. Furthermore, we sug- gest and implement four different hypothesis tests, the log-rank test, a Renyi type test, and two Cramer-von Mises tests, as alternative methods to assess the predictive accuracy of an MSM. These tests compare the observed with the predicted survival curve.

It is important to note here that the suggested hypothesis testing methods account for the effect of censoring in a competing risks setting, as is the case in the prediction of lung cancer incidence and mortality given smoking. The MSM takes into account the presence of competing risks when modeling mortality and consequently the KM

162 Table 4.4: Assessment of the predictive accuracy of the two calibrated MSMs com- pared to the simulated truth from toy model 2: Summary statistics of the estimates of six different predictive accuracy measures.

Bayesian Calibrated MSM C-statistic (mean±sd)* Test (%)** Cramer - V Harrell Uno Log-Rank Renyi - von Mises (Z) (Q) (Q1) (Q2) 200 0.7943±0.0083 0.7298±0.0307 29.50 27.00 29.00 62.50 400 0.7946±0.0079 0.7295±0.0306 26.00 24.25 25.25 58.00 600 0.7944±0.0078 0.7304±0.0308 26.33 24.00 26.17 58.17 800 0.7942±0.0076 0.7301±0.0307 24.75 22.50 24.25 57.25 1000 0.7943±0.0077 0.7305±0.0308 26.10 24.10 25.50 59.50

Empirically Calibrated MSM C-statistic (mean±sd)* Test (%)** Cramer - V Harrell Uno Log-Rank Renyi - von Mises (Z) (Q) (Q1) (Q2) 200 0.7932±0.0079 0.7308±0.0296 3.00 9.00 12.50 29.00 400 0.7927±0.0081 0.7300±0.0323 3.50 8.00 15.75 27.75 600 0.7928±0.0081 0.7292±0.0322 2.50 7.50 15.83 27.50 800 0.7930±0.0080 0.7293±0.0323 2.38 7.63 16.13 28.63 1000 0.7928±0.0080 0.7292±0.0319 2.00 7.10 15.90 28.90

* Means and standard deviations of the C-indices estimates, from the V implementations. **Percentage of times, in the V implementations, that the test did not reject

the H0 at α = 5%.

163 curve of the predicted survival times is adjusted accordingly. In the simulation study we compared the predictions obtained by the MSM with the simulated truth, namely a hypothetical observed KM curve that has been adjusted for the competing risks problem. In practice, when implementing the hypothesis tests, it is advisable to adjust the observed survival in order to account for the presence of competing risks, so as to avoid bias in the survival estimates of the event of interest (54).

Summarizing the main findings from the simulation study first of all we note that a single implementation of the MSM, for a randomly selected vector of parameter values (V=1) is not sufficient for the comparison of the predictive accuracy of two similar MSMs. Furthermore, as already indicated in section 3.2.4, MSM outputs based on more than one sets of calibrated values for the model parameters, allow for conveying parameter uncertainty in the final results. For these reasons multiple runs of the model are recommended instead. Based on the results presented in this chapter, a number of V=400 runs of the model is deemed adequate to draw safe conclusions about the relative predictive accuracy of the two models.

In addition, concordance indices, although useful to measure the overall discrimi- nation ability of a model, sometimes they may not be able to capture differences between distinct observed and predicted survival times. The reason for this is that concordances indices are based on the relative ranks of the observed and the predicted values rather than the actual magnitudes. The estimates of the two C-statistics ap- plied in the simulation study, are almost identical for the two model in all cases, thus non-informative about the discrepancies observed, especially between MSMs predic- tions and simulated “truth” from the second toy model. In the context of MSMing other statistical measures are preferable to signify this characteristic of an MSM, such as estimates of the mean squared error of the individual predictions.

In this chapter we also investigated the performance of several hypothesis tests for

164 survival data. These tests aim at comparing observed and predicted survival distri- butions, and can provide an indication about the predictive accuracy of the model with respect to the overall survival estimates for the event of interest.

The simulation study showed that the hypothesis tests result in the same conclusions when there are relatively large differences between the observed and the predicted survival, as in the case of the comparisons with the simulated truth from the second toy model, where all tests indicated the same MSM to be more accurate. However, for less prominent differences it is possible the tests to result in contradictory conclusions. The reason for that lies on the specifics of each test, namely which differences (earlier or later) each test weighs more in the calculations, as well as whether or not they perform well in crossing hazard situations. In such a case, it is unclear whether the individual predictions from one MSM is more accurate than the respective ones from the other. Therefore further investigation of the situation is required, and the final conclusions will also depend on the type of differences we are more interested in detecting.

Furthermore, the log-rank and Renyi type tests lead to similar results about the predictive accuracy of the two models. However the log-rank test proved a little bit more sensitive in detecting more prominent differences between the observed and the predicted survival curves, compared to the Renyi type test.

In future work of high priority we will apply the suggested methods to assess the predictive accuracy of the two calibrated MSMs using real data from the National Lung Screening Trial (NLST) (28; 1). This is a large scale, randomized, multicenter study aimed at comparing the effect of two different screening tests, i.e., low-dose helical computed tomography (CT), and chest radiography on the lung cancer mor- tality of current and heavy smokers. Another very interesting application will be the comparison of two structurally different, yet comparable, MSMs using the methods

165 suggested in this chapter. Special attention and additional work is required on the correct incorporation of between-subjects variability in the assessment, as well as the expansion of the methods to base assessment results on multiple outcomes of interest. Finally, another interesting objective for further research is the considera- tion of censoring in MSM individual predictions, as well as in the assessment of the predictive performance of this type of models.

Finally, another very interesting objective for further research is the construction and use of a predictive accuracy measure focused on the predictions obtained for each specific individual. Such a measure would be based on the mean squared differences of the individual predictions (MSEP) from the observed data (35; 36). These squared differences could refer to estimates of predicted versus observed survival probabilities or times to events for each individual.

166 Chapter 5

Conclusions

The main objective of this thesis was to study statistical methods for the develop- ment and evaluation of micro-simulation models. In this chapter we summarize the findings, as well as future work related to this research.

We began the work for this dissertation by developing an MSM that describes the natural history of lung cancer. This model was then used as a tool for the implemen- tation and comparison of a Bayesian and an Empirical calibration method, aimed at specifying sets of MSM parameter values that provide good fit to the observed quantities of interest. Finally, we have adapted tools from survival data analysis to evaluate the predictive accuracy of a calibrated MSM.

The streamlined MSM, developed in Chapter 2, combines some of the best practices followed in the modeling of the natural history of lung cancer and can be used for valid predictions about the course of the disease. The development of this MSM in an open source statistical software (R.3.0.1), enhances the transparency of the model, facilitates research on the statistical properties of MSMs in general, and promotes the improvement and expansion of the model to describe the course of lung cancer in more detail, with the collaboration of scientists from several fields.

The comparative analysis presented in Chapter 3 showed that both calibration meth-

167 ods result in extensively overlapping results with respect both to sets of values for the calibrated parameters, as well as predictions obtained by each model. However only the Bayesian calibration method provides a sound theoretical background for the incorporation of prior beliefs in the model and the interpretation of the results from this procedure. The ultimate goal of this method is to draw values for the joint posterior distribution of the MSM parameters.

Furthermore, the Bayesian method results in an MSM that performs better in the prediction of rare events compared to the Empirical one. The predictions from the Empirically calibrated MSM, on the other hand, are less dispersed. In addition, the Empirical method is more efficient with respect to the computational time required for the entire calibration.

The Bayesian approach, when focused on estimation, may not serve the purpose of model calibration. Actually, the performance of the Bayesian calibration method can be considerably improved by adding in the procedure a “refinement” step, aimed at the selection of those subset of parameter values which provide better fit of the MSM to observed data, according to some pre-specified GoF measure.

Finally Chapter 3 emphasizes on the imperative need to use High Performance Com- puting techniques in order to undertake a rather complicated task, as the calibration of an MSM, in R. That is because, the implementation of a calibration procedure involves multitudinous, independent micro-simulations, which can be carried out in parallel, thus reducing the total required running time. The R package facilitates parallel processing via special designed libraries that can set up and distribute the task to large computer clusters.

According to the simulation study conducted in Chapter 4, concordance statistics, although useful for assessing the overall discrimination ability of an MSM, may not capture differences between observed and predicted survival. The accuracy of an

168 MSM, with respect to overall predicted survival, can be better assessed by apply- ing hypothesis tests, used in survival analysis, to compare observed with predicted survival curves. These tests account for the effect of censoring in a competing risks setting, as in the case of the survival estimates of lung cancer incidence and mortality given smoking. All the tests suggested in this chapter result in the same conclusion, when the predictions, obtained by the MSM are very different from the respective actual observations. Furthermore, the log-rank test proved more sensitive than the other tests in detecting more prominent differences.

We intend to continue and extend our work in a number of important directions. First, we plan to extend the original MSM in order to incorporate more detailed information, as well as screening and treatment components, thus making it compa- rable to existing models about lung cancer. We also plan the the publication of the MSM in the form of a library into the CRAN package repository of the R statistical software.

We used the two methods presented in Chapter 3 in order to calibrate the MSM on data about males, current smokers. We plan to perform a complete calibration of this MSM, that is, to calibrate the parameters so that the model will be able to predict individual trajectories within narrower subgroups defined by covariates beyond gender and smoking status. Furthermore, we will expand the methods so as to account for multiple calibration targets.

We also intend to apply the methods suggested in Chapter 4 for the assessment of the predictive accuracy of the MSM using actual data from the NLST study. It would also be informative to study how measures of predictive performance can be used to compare two completely different models, such as two structurally different MSMs for lung cancer. More research is also required for the expansion of the methods so as to account for multiple outcomes of interest, as well as for the incorporation of

169 the between-subject variability in the calculations.

Another very interesting topic for further consideration would be the construction of a predictive accuracy measure, focusing on discrepancies of the individual predictions from the observed data. This measure would be an estimate of the mean squared error of the MSM predictions (MSEP). The quantities involved in the calculation could be estimates of the survival probabilities for each particular individual, as well as times to event or censoring.

170 Bibliography

[1] Aberle, D. R., Adams, A. M., Berg, C. D., Black, W. C., Clapp, J. D., Fager- strom, R. M., Gareen, I. F., Gatsonis, C., Marcus, P. M., Sicks, J. D., and Team, N. L. S. T. R. (2011), “Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening,” New England Journal of Medicine, 365, 395–409.

[2] Antolini, L., Boracchi, P., and Biganzoli, E. (2005), “A time-dependent dis- crimination index for survival data,” Statistics in Medicine, 24, 3927–3944.

[3] Baker, R. (1998), “Use of a mathematical model to evaluate breast cancer screening policy,” Health Care Management Science, 1, 103–113, 1386-9620.

[4] Berry, D. A., Inoue, L., Shen, Y., Venier, J., Cohen, D., Bondy, M., Theriault, R., and Munsell, M. F. (2006), “Chapter 6: Modeling the Impact of Treatment and Screening on U.S. Breast Cancer Mortality: A Bayesian Approach,” JNCI Monographs, 2006, 30–36.

[5] Blower, S. M. and Dowlatabadi, H. (1994), “Sensitivity and Uncertainty Anal- ysis of Complex Models of Disease Transmission: An HIV Model, as an Ex- ample,” International Statistical Review / Revue Internationale de Statistique, 62, 229–243.

[6] Breslow, N. (1970), “Generalized Kruskal-Wallis Test for Comparing K Samples

171 Subject to Unequal Patterns of Censorship,” Biometrika, 57, 579–594, h9895 Times Cited:1156 Cited References Count:11.

[7] Briggs, A. H., O’Brien, B. J., and Blackhouse, G. (2002), “Thinking outside the box: Recent advances in the analysis and presentation of uncertainty in cost-effectiveness studies,” Annual Review of Public Health, 23, 377–401.

[8] Campbell, K. (2006), “Statistical calibration of computer simulations,” Relia- bility Engineering & System Safety, 91, 1358–1363.

[9] Chen, H. C., Kodell, R. L., Cheng, K. F., and Chen, J. J. (2012), “Assess- ment of performance of survival prediction models for cancer prognosis,” Bmc Medical Research Methodology, 12.

[10] Cheng, S. C., Wei, L. J., and Ying, Z. (1995), “Analysis of transformation models with censored data,” Biometrika, 82, 835–845.

[11] Chia, Y. L., Salzman, P., Plevritis, S. K., and Glynn, P. W. (2004), “Simulation-based parameter estimation for complex models: a breast cancer natural history modelling illustration,” Statistical Methods in Medical Research, 13, 507–524.

[12] Clyde, M. and George, E. I. (2004), “Model uncertainty,” Statistical Science, 19, 81–94.

[13] Cronin, K. A., Legler, J. M., and Etzioni, R. D. (1998), “Assessing uncertainty in micro simulation modelling with application to cancer screening interven- tions,” Statistics in Medicine, 17, 2509–2523.

[14] De Angelis, D., Sweeting, M., Ades, A. E., Hickman, M., Hope, V., and Ram- say, M. (2009), “An evidence synthesis approach to estimating Hepatitis C Prevalence in England and Wales,” .

172 [15] Detterbeck, F. C. and Gibson, C. J. (2008), “Turning gray: The natural history of lung cancer over time,” Journal of Thoracic Oncology, 3, 781–792.

[16] Deutsch, J. L. and Deutsch, C. V. (2012), “Latin hypercube sampling with multidimensional uniformity,” Journal of Statistical Planning and Inference, 142, 763–772.

[17] Doubilet, P., Begg, C. B., Weinstein, M. C., Braun, P., and McNeil, B. J. (1985), “Probabilistic Sensitivity Analysis Using Monte Carlo Simulation,” Medical Decision Making, 5, 157–177.

[18] Draisma, G., Boer, R., Otto, S. J., van der Cruijsen, I. W., Damhuis, R. A. M., ... Schr o der, F. H., and de Koning, H. J. (2003), “Lead Times and Overdetection Due to Prostate-Specific Antigen Screening: Estimates From the European Randomized Study of Screening for Prostate Cancer,” Journal of the National Cancer Institute, 95, 868–878.

[19] Eddelbuettel, D. (2013), “CRAN Task View: High- Performance and Parallel Computing with R,” http://cran.r- project.org/web/views/HighPerformanceComputing.html, [Online; Retrieved: 15-March-2013].

[20] Fine, J. P. and Gray, R. J. (1999), “A proportional hazards model for the subdistribution of a competing risk,” J Am Stat Assoc, 94, 496–509.

[21] Fleming, T. R. and Harrington, D. P. (1981), “A Class of Hypothesis Tests for One and 2 Sample Censored Survival-Data,” Communications in Statistics Part a-Theory and Methods, 10, 763–794, ls917 Times Cited:73 Cited Refer- ences Count:22.

[22] Foy, M., Spitz, M. R., Kimmel, M., and Gorlova, O. Y. (2011), “A smoking-

173 based carcinogenesis model for lung cancer risk prediction,” International Jour- nal of Cancer, n/a–n/a, 1097-0215.

[23] Fryback, D. G., Stout, N. K., Rosenberg, M. A., Trentham-Dietz, A., Kuru- chittham, V., and Remington, P. L. (2006), “Chapter 7: The Wisconsin Breast Cancer Epidemiology Simulation Model,” JNCI Monographs, 2006, 37–47.

[24] Gampe Jutta, Z. S. (2009), “The Microsimulation tool of the MicMac project,” 2nd General Conference of the International Microsimulation Association, (Ot- tawa, Canada).

[25] Garber, A. M. and Tunis, S. R. (2009), “Does Comparative-Effectiveness Re- search Threaten Personalized Medicine?.” New England Journal of Medicine, 360, 1925–1927.

[26] Garg, M. L., Rao, B. R., and Redmond, C. K. (1970), “Maximum-Likelihood Estimation of the Parameters of the Gompertz Survival Function,” Journal of the Royal Statistical Society. Series C (Applied Statistics), 19, 152–159.

[27] Gatsonis, C. (2010), “The promise and realities of comparative effectiveness research,” Statistics in Medicine, 29, 1977–1981.

[28] Gatsonis, C. A. and Team, N. L. S. T. R. (2011), “The National Lung Screening Trial: Overview and Study Design,” Radiology, 258, 243–253.

[29] Geddes, D. M. (1979), “The natural history of lung cancer: a review based on rates of tumour growth,” Br J Dis Chest, 73, 1–17.

[30] Gehan, E. A. (1965), “A Generalized Wilcoxon Test for Comparing Arbitrarily Singly-Censored Samples,” Biometrika, 52, 203–223.

[31] Gerds, T. A., Cai, T. X., and Schumacher, M. (2008), “The performance of risk prediction models,” Biometrical Journal, 50, 457–479.

174 [32] Gerds, T. A., Kattan, M. W., Schumacher, M., and Yu, C. (2013), “Estimat- ing a time-dependentconcordance index for survival prediction models with covariate dependent censoring,” Statistics in Medicine, 32, 2173–2184.

[33] Goldwasser, D. L. (2009), “Parameter estimation in mathematical models of lung cancer [doctoral thesis],” Ph.D. thesis.

[34] Gonen, M. and Heller, G. (2005), “Concordance probability and discriminatory power in proportional hazards regression,” Biometrika, 92, 965–970.

[35] Gorfine, M., Hsu, L., Zucker, D. M., and Parmigiani, G. (2013), “Calibrated predictions for multivariate competing risks models,” Lifetime Data Anal.

[36] Graf, E., Schmoor, C., Sauerbrei, W., and Schumacher, M. (1999), “Assess- ment and comparison of prognostic classification schemes for survival data,” Statistics in Medicine, 18, 2529–2545.

[37] Gray, R. J. (1988), “A Class of K-Sample Tests for Comparing the Cumulative Incidence of a Competing Risk,” Annals of Statistics, 16, 1141–1154.

[38] Habbema, J. D. F., van Oortmarssen, G. J., Lubbe, J. T. N., and van der Maas, P. J. (1985), “The MISCAN simulation program for the evaluation of screening for disease,” Computer Methods and Programs in Biomedicine, 20, 79–93.

[39] Harrell, F. E., Lee, K. L., and Mark, D. B. (1996), “Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors,” Statistics in Medicine, 15, 361–387.

[40] Hazelton, W. D., Clements, M. S., and Moolgavkar, S. H. (2005), “Multistage carcinogenesis and lung cancer mortality in three cohorts,” Cancer Epidemiol- ogy Biomarkers & Prevention, 14, 1171–1181.

[41] Hazelton, W. D., Luebeck, E. G., Heidenreich, W. E., and Moolgavkar, S. H. 175 (2001), “Analysis of a historical cohort of Chinese tin miners with arsenic, radon, cigarette smoke, and pipe smoke exposures using the biologically based two-stage clonal expansion model,” Radiation Research, 156, 78–94.

[42] Heagerty, P. J. and Zheng, Y. Y. (2005), “Survival model predictive accuracy and ROC curves,” Biometrics, 61, 92–105.

[43] Heidenreich, W. F., Jacob, P., and Paretzke, H. G. (1997), “Exact solutions of the clonal expansion model and their application to the incidence of solid tumors of atomic bomb survivors,” Radiation and Environmental Biophysics, 36, 45–58.

[44] Heidenreich, W. F., Luebeck, E. G., and Moolgavkar, S. H. (1997), “Some properties of the hazard function of the two-mutation clonal expansion model,” Risk Analysis, 17, 391–399.

[45] Henderson, R., Jones, M., and Stare, J. (2001), “Accuracy of point predictions in survival analysis,” Statistics in Medicine, 20, 3083–3096.

[46] Hielscher, T., Zucknick, M., Werft, W., and Benner, A. (2010), “On the prog- nostic value of survival models with application to gene expression signatures,” Statistics in Medicine, 29, 818–29.

[47] Howlader, N., Noone, A., Krapcho, M., Neyman, N., Aminou, R., Waldron, W., Altekruse, SF. Kosary, C., Ruhl, J., Tatalovich, Z., Cho, H., Mariotto, A., Eisner, M., Lewis, D., Chen, H., Feuer, E., and Cronin, K. (posted to the SEER web site, April 2012), “SEER Cancer Statistics Review, 1975-2009 (Vintage 2009 Populations),,” National Cancer Institute. Bethesda, MD.

[48] Hunink, M. G. M., Koerkamp, B. G., Weinstein, M. C., Stijnen, T., and Heijenbrok-Kal, M. H. (2010), “Uncertainty and Patient Heterogeneity in Med- ical Decision Models,” Medical Decision Making, 30, 194–205.

176 [49] Jit, M., Choi, Y. H., and Edmunds, W. J. (2008), “Economic evaluation of human papillomavirus vaccination in the United Kingdom,” BMJ (Clinical research ed.), 337, a769.

[50] Karnon J, Goyder E, T. P. M. S. T. I. B. J. e. a. (2007), “A review and cri- tique of modelling in prioritising and designing screening programmes,” Health Technology Assessment, 11.

[51] Kennedy, M. C. and O’Hagan, A. (2001), “Bayesian calibration of computer models,” Journal of the Royal Statistical Society Series B-Statistical Method- ology, 63, 425–450.

[52] Kim, J. J., Kuntz, K. M., Stout, N. K., Mahmud, S., Villa, L. L., Franco, E. L., and Goldie, S. J. (2007), “Multiparameter Calibration of a Natural History Model of Cervical Cancer,” American Journal of Epidemiology, 166, 137–150.

[53] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983), “Optimization by Simulated Annealing,” Science, 220, 671–680.

[54] Klein, J. P. and Moeschberger, M. L. (2003), Survival analysis: techniques for censored and truncated data.

[55] Koerkamp, B. G., Stijnen, T., Weinstein, M. C., and Hunink, M. G. M. (2011), “The Combined Analysis of Uncertainty and Patient Heterogeneity in Medical Decision Models,” Medical Decision Making, 31, 650–661.

[56] Kopec, J. A., Fines, P., Manuel, D. G., Buckeridge, D. L., Flanagan, W. M., Oderkirk, J., Abrahamowicz, M., Harper, S., Sharif, B., Okhmatovskaia, A., Sayre, E. C., Rahman, M. M., and Wolfson, M. C. (2010), “Validation of population-based disease simulation models: a review of concepts and meth- ods,” Bmc Public Health, 10.

177 [57] Korn, E. L. and Simon, R. (1990), “Measures of explained variation for survival data,” Statistics in Medicine, 9, 487–503.

[58] Koscielny, S., Tubiana, M., Le, M. G., Valleron, A. J., Mouriesse, H., Contesso, G., and Sarrazin, D. (1984), “Breast-Cancer - Relationship between the Size of the Primary Tumor and the Probability of Metastatic Dissemination,” British Journal of Cancer, 49, 709–715.

[59] Koscielny, S., Tubiana, M., and Valleron, A. J. (1985), “A simulation model of the natural history of human breast cancer,” Br J Cancer, 52, 515–524.

[60] Kullback, S. and Leibler, R. A. (1951), “On Information and Sufficiency,” Annals of Mathematical Statistics, 22, 79–86.

[61] Laird, A. K. (1964), “Dynamics of Tumor Growth,” British Journal of Cancer, 18, 490–502.

[62] L’Ecuyer, P., Simard, R., Chen, E. J., and D., K. W. (2002), “An object- oriented random-number package with many long streams and substreams.” Operations Research, 50, 1073–1075.

[63] Leydold, P. L. and J. (2005), “rstream: Streams of Random Numbers for Stochastic Simulation,” R News, 5, 16–20.

[64] Luebeck, E. G., Heidenreich, W. F., Hazelton, W. D., Paretzke, H. G., and Moolgavkar, S. H. (1999), “Biologically based analysis of the data for the Col- orado uranium miners cohort: Age, dose and dose-rate effects,” Radiation Research, 152, 339–351.

[65] Mandelblatt, J., Schechter, C. B., Lawrence, W., Yi, B., and Cullen, J. (2006), “Chapter 8: The SPECTRUM Population Model of the Impact of Screening and Treatment on U.S. Breast Cancer Trends From 1975 to 2000: Principles and Practice of the Model Methods,” JNCI Monographs, 2006, 47–55. 178 [66] Mannion, O., Lay-Yee, R., Wrapson, W., Davis, P., and Pearson, J. (2012), “JAMSIM: a Microsimulation Modelling Policy Tool,” Jasss-the Journal of Artificial Societies and Social Simulation, 15.

[67] Matloff, N. (2013), “Programming on Parallel Machines,” http://heather.cs.ucdavis.edu/ matloff/158/PLN/ParProcBook.pdf, [On- line; Retrieved: 13-March-2013].

[68] McCallum, Q. E. and Weston, S. (2012), “Parallel R,” O’Reilly.

[69] McKay, M. D., Beckman, R. J., and Conover, W. J. (2000), “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code,” Technometrics, 42, 55–61.

[70] McMahon, P. M. (2005), “Policy assessment of medical imaging utilization: methods and applications [doctoral thesis],” Ph.D. thesis.

[71] McMahon, P. M., Kong, C. Y., Johnson, B. E., Weinstein, M. C., Weeks, J. C., Kuntz, K. M., Shepard, J. A. O., Swensen, S. J., and Gazelle, G. S. (2008), “Estimating long-term effectiveness of lung cancer screening in the Mayo CT screening study,” Radiology, 248, 278–287.

[72] Meza, R., Hazelton, W. D., Colditz, G. A., and Moolgavkar, S. H. (2008), “Analysis of lung cancer incidence in the nurses’ health and the health pro- fessionals’ follow-up studies using a multistage carcinogenesis model,” Cancer Causes & Control, 19, 317–328.

[73] Moeschberger, M. L. and Klein, J. P. (1995), “Statistical methods for depen- dent competing risks,” Lifetime Data Analysis, 1, 195–204.

[74] Moolgavkar, S. H. and Luebeck, E. G. (2003), “Multistage carcinogenesis and the incidence of human cancer,” Genes Chromosomes Cancer, 38, 302–6.

[75] Moolgavkar, S. H. and Luebeck, G. (1990), “Two-Event Model for Carcinogen- 179 esis: Biological, Mathematical, and Statistical Considerations,” Risk Analysis, 10, 323–341.

[76] Mountain, C. F. (1997), “Revisions in the International System for Staging Lung Cancer,” Chest, 111, 1710–1717.

[77] Nelder, J. A. and Mead, R. (1965), “A Simplex Method for Function Mini- mization,” The Computer Journal, 7, 308–313.

[78] Nielsen, B. (1997), “Expected survival in the Cox model,” Scandinavian Jour- nal of Statistics, 24, 275–287.

[79] Nieto, F. J. and Coresh, J. (1996), “Adjusting survival curves for confounders: A review and a new method,” American Journal of Epidemiology, 143, 1059– 1068.

[80] Oakley, J. E. and O’Hagan, A. (2004), “Probabilistic sensitivity analysis of complex models: a Bayesian approach,” Journal of the Royal Statistical Society Series B-Statistical Methodology, 66, 751–769.

[81] O’Hagan, A., Stevenson, M., and Madan, J. (2007), “Monte Carlo probabilistic sensitivity analysis for patient level simulation models: Efficient estimation of mean and variance using ANOVA,” Health Economics, 16, 1009–1023.

[82] Orcutt, G. H. (1957), “A New Type of Socio-Economic System,” Review of Economics and Statistics, 39, 116–123, cgb69 Times Cited:26 Cited References Count:4.

[83] Parmigiani, G. (2002), “Measuring uncertainty in complex decision analysis models,” Statistical Methods in Medical Research, 11, 513–537.

[84] Pencina, M. J. and D’Agostino, R. B. (2004), “Overall C as a measure of dis- crimination in survival analysis: model specific population value and confidence ,” Statistics in Medicine, 23, 2109–2123. 180 [85] Peto, R. and Peto, J. (1972), “Asymptotically Efficient Rank Invariant Test Procedures,” Journal of the Royal Statistical Society Series a-General, 135, 185–207.

[86] Plevritis, S. K., Salzman, P., Sigal, B. M., and Glynn, P. W. (2007), “A nat- ural history model of stage progression applied to breast cancer,” Statistics in Medicine, 26, 581–595.

[87] Plevritis, S. K., Sigal, B. M., Salzman, P., Rosenberg, J., and Glynn, P. (2006), “Chapter 12: A Stochastic Simulation Model of U.S. Breast Cancer Mortality Trends From 1975 to 2000,” JNCI Monographs, 2006, 86–95.

[88] Poole, D. and Raftery, A. E. (2000), “Inference for deterministic simulation models: The Bayesian melding approach,” J Am Stat Assoc, 95, 1244–1255.

[89] Rossini, A. J., Tierney, L., and Li, N. (2007), “Simple parallel statistical com- puting in R,” Journal of Computational and Graphical Statistics, 16, 399–420.

[90] Rutter, C. M., Miglioretti, D. L., and Savarino, J. E. (2009), “Bayesian Cali- bration of Microsimulation Models,” J Am Stat Assoc, 104, 1338–1350.

[91] Rutter, C. M. and Savarino, J. E. (2010), “An Evidence-Based Microsimulation Model for Colorectal Cancer: Validation and Application,” Cancer Epidemiol- ogy Biomarkers and Prevention, 19, 1992–2002.

[92] Rutter, C. M., Zaslavsky, A. M., and Feuer, E. J. (2011), “Dynamic Microsim- ulation Models for Health Outcomes,” Medical Decision Making, 31, 10–18.

[93] Saha-Chaudhuri, P. and Heagerty, P. J. (2013), “Non-parametric estimation of a time-dependent predictive accuracy curve,” Biostatistics, 14, 42–59.

[94] Salomon, J. A., Weinstein, M. C., Hammitt, J. K., and Goldie, S. J. (2002), “Empirically calibrated model of hepatitis C virus infection in the United States,” American Journal of Epidemiology, 156, 761–773. 181 [95] Santner, T. J., Williams, B. J., and Notz, W. (2003), The Design and analysis of computer , Springer series in statistics, New York: Springer.

[96] Schmidberger, M., Morgan, M., Eddelbuettel, D., Yu, H., Tierney, L., and Mansmann, U. (2009), “State of the Art in Parallel Computing with R,” Jour- nal of Statistical Software, 31, 1–27.

[97] Schumacher, M. (1984), “2-Sample Tests of Cramer-Vonmises-Type and Kolmogorov-Smirnov-Type for Randomly Censored-Data,” International Sta- tistical Review, 52, 263–281.

[98] Shi, L., Tian, H., McCarthy, W., Berman, B., Wu, S., and Boer, R. (2011), “Exploring the uncertainties of early detection results: model-based interpre- tation of mayo lung project,” BMC Cancer, 11, 92.

[99] Siegel, R., Naishadham, D., and Jemal, A. (2012), “Cancer statistics, 2012,” CA Cancer J Clin, 62, 10–29.

[100] Simon, R. M., Subramanian, J., Li, M. C., and Menezes, S. (2011), “Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data,” Briefings in , 12, 203–214.

[101] Sonnenberg, F. A. and Beck, J. R. (1993), “Markov-Models in Medical Decision-Making - a Practical Guide,” Medical Decision Making, 13, 322–338.

[102] Spratt, J. S. and Spratt, T. L. (1964), “Rates of Growth of Pulmonary Metas- tases and Host Survival,” Annals of Surgery, 159, 161–171.

[103] Steel, G. G. (1977), Growth kinetics of tumours : cell population kinetics in relation to the growth and treatment of cancer, Oxford: Clarendon Press.

[104] Stein, M. (1987), “Large Sample Properties of Simulations Using Latin Hyper- cube Sampling,” Technometrics, 29, 143–151.

182 [105] Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obu- chowski, N., Pencina, M. J., and Kattan, M. W. (2010), “Assessing the Perfor- mance of Prediction Models A Framework for Traditional and Novel Measures,” Epidemiology, 21, 128–138.

[106] Stout, N. K., Knudsen, A. B., Kong, C. Y., McMahon, P. M., and Gazelle, G. S. (2009), “Calibration Methods Used in Cancer Simulation Models and Suggested Reporting Guidelines,” Pharmacoeconomics, 27, 533–545.

[107] Tan, S. Y. G. L., van Oortmarssen, G. J., de Koning, H. J., Boer, R., and Habbema, J. D. F. (2006), “Chapter 9: The MISCAN-Fadia Continuous Tumor Growth Model for Breast Cancer,” JNCI Monographs, 2006, 56–65.

[108] Tarone, R. E. and Ware, J. (1977), “Distribution-Free Tests for Equality of Survival Distributions,” Biometrika, 64, 156–160.

[109] Department of Health and Human Services (2009), “Draft definition of Com- parative Effectiveness Research for the Federal Coordinating Council,” http://www.hhs.gov/recovery/programs/cer/draftdefinition.html.

[110] Thames, H. D., Buchholz, T. A., and Smith, C. D. (1999), “Frequency of first metastatic events in breast cancer: Implications for sequencing of systemic and local-regional treatment,” Journal of Clinical Oncology, 17, 2649–2658.

[111] Tierney, L. (2008), Implicit and Explicit Parallel Computing in R, Physica- Verlag HD, chap. 4, pp. 43–51.

[112] Tunis, S. R., Benner, J., and McClellan, M. (2010), “Comparative effectiveness research: Policy context, methods development and research infrastructure,” Statistics in Medicine, 29, 1963–1976.

[113] Uno, H., Cai, T., Pencina, M. J., D’Agostino, R. B., and Wei, L. J. (2011), “On

183 the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data,” Statistics in Medicine, 30, 1105–1117.

[114] Vanni, T., Karnon, J., Madan, J., White, R. G., Edmunds, W. J., Foss, A. M., and Legood, R. (2011), “Calibrating models in economic evaluation: a seven- step approach,” Pharmacoeconomics, 29, 35–49.

[115] Vanni, T., Legood, R., Franco, E. L., Villa, L. L., Luz, P. M., and Schwarts- mann, G. (2011), “Economic evaluation of strategies for managing women with equivocal cytological results in Brazil,” International Journal of Cancer, 129, 671–679.

[116] Wakelee, H. A., Chang, E. T., Gomez, S. L., Keegan, T. H., Feskanich, D., Clarke, C. A., Holmberg, L., Yong, L. C., Kolonel, L. N., Gould, M. K., and West, D. W. (2007), “Lung cancer incidence in never smokers,” Journal of Clinical Oncology, 25, 472–478.

[117] Welton, N. J. and Ades, A. E. (2005), “A model of toxoplasmosis incidence in the UK: evidence synthesis and consistency of evidence,” Journal of the Royal Statistical Society: Series C (Applied Statistics), 54, 385–404.

[118] Yamaguchi, N., Tamura, Y., Sobue, T., Akiba, S., Ohtaki, M., Baba, Y., Mizuno, S., and Watanabe, S. (1991), “Evaluation of Cancer Prevention Strate- gies by Computerized Simulation Model: An Approach to Lung Cancer,” Can- cer Causes & Control, 2, 147–155.

184