<<

Analytics for Improved Cancer Screening and Treatment by John Silberholz B.S. Mathematics and B.S. Computer Science, University of Maryland (2010) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 ○c Massachusetts Institute of Technology 2015. All rights reserved.

Author...... Sloan School of Management August 10, 2015 Certified by...... Dimitris Bertsimas Boeing Leaders for Global Operations Professor Co-Director, Operations Research Center Thesis Supervisor

Accepted by ...... Patrick Jaillet Dugald C. Jackson Professor Department of Electrical Engineering and Computer Science Co-Director, Operations Research Center 2 Analytics for Improved Cancer Screening and Treatment by John Silberholz

Submitted to the Sloan School of Management on August 10, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Operations Research

Abstract Cancer is a leading cause of death both in the United States and worldwide. In this thesis we use machine learning and optimization to identify effective treatments for advanced cancers and to identify effective screening strategies for detecting early-stage disease. In Part I, we propose a methodology for designing combination drug therapies for advanced cancer, evaluating our approach using advanced gastric cancer. First, we build a database of 414 clinical trials testing regimens for this cancer, extracting information about patient demographics, study characteristics, chemother- apy regimens tested, and outcomes. We use this database to build statistical models to predict trial efficacy and toxicity outcomes. We propose models that use machine learning and optimization to suggest regimens to be tested in Phase II and III clinical trials, evaluating our suggestions with both simulated outcomes and the outcomes of clinical trials testing similar regimens. In Part II, we evaluate how well the methodology from Part I generalizes to advanced . We build a database of 1,490 clinical trials testing drug therapies for breast cancer, train statistical models to predict trial efficacy and toxicity outcomes, and suggest combination drug therapies to be tested in Phase II and III studies. In this work we model differences in drug effects based on the receptor status of patients in a clinical trial, and we evaluate whether combining clinical trial databases of different cancers can improve clinical trial toxicity predictions. In Part III, we propose a methodology for decision making when multiple mathe- matical models have been proposed for a phenomenon of interest, using our approach to identify effective population screening strategies for . We implement three published mathematical models of prostate cancer screening strategy outcomes, using optimization to identify strategies that all models find to be effective.

Thesis Supervisor: Dimitris Bertsimas Title: Boeing Leaders for Global Operations Professor Co-Director, Operations Research Center

3 4 Acknowledgments

I would first like to gratefully acknowledge my advisor Dimitris Bertsimas forhis role in the work presented in this dissertation and also for helping me mature as a researcher during my time at MIT. By adding me to his team studying the synthesis of clinical trial outcomes during the first month of my PhD, Dimitris introduced me to the research questions addressed in the first two chapters of my thesis and more broadly to the area of healthcare analytics, the focus of my entire dissertation. Dimitris’s unrelenting focus on research questions with real-world impact has helped cement in my mind the sorts of problems I should pursue as a researcher, something that will benefit me throughout my research career.

I would also like to acknowledge the guidance of Vivek Farias and Retsef Levi. By serving on both my general exam committee and my thesis committee, Vivek and Retsef provided thoughtful feedback about my research that had a significant impact on the material present in this dissertation.

I would like to acknowledge my collaborators on the work described in this thesis. It was a pleasure working with Stephen Relyea and Allison O’Hair on the study of gastric cancer clinical trials, which was awarded the 2013 Pierskalla Award from the INFORMS Health Applications Society. The study of breast cancer clinical trials would not have been possible without assistance in the literature review and data ex- traction from Allison O’Hair and from a number of undergraduate research assistants from both MIT and Wellesley — Emily Chen, Michael Chen, Shahrin Islam, Siva Nagarajan, David Sukhin, Pei Tao, Roza Trilesskaya, Victoria Wang, Mimi Williams, and Joanna Yeh. Finally, it has been a pleasure to collaborate with Brown Professor Tom Trikalinos on the prostate cancer screening work, and I would also like to thank MIT librarian Courtney Crummett for her assistance in the literature review for that thesis chapter.

Throughout my PhD I have enjoyed spending time with and learning from fellow ORC students. In particular I wanted to thank Nataly Youssef and Swati Gupta for the great times as my friends and study buddies; Iain Dunning and Swati Gupta

5 for an enjoyable and productive research collaboration on the MQLib project; and Allison O’Hair, Iain Dunning, Angie King, Velibor Mišić, and Nataly Youssef for a successful collaboration in developing the Analytics Edge MOOC. Lastly, I wanted to thank my family for their support. My parents have always encouraged my interest in math and programming and my father introduced me to operations research in high school by connecting me with Bruce Golden, his advisor for his master’s degree. Thanks also to my in-laws for their kind words and support while I worked through my PhD. Finally, many thanks to my wife Sarah for all she has done to help me through the PhD, from serving as a sounding board for research ideas to providing feedback on manuscripts to celebrating successes to providing support through difficulties.

6 Contents

1 Introduction 19 1.1 Models to Design Drug Regimens for Advanced Cancer ...... 20 1.2 Optimizing Screening Strategies Using Multiple Mathematical Models 22

2 Designing Chemotherapy Regimens for Advanced Gastric Cancer 25 2.1 Introduction ...... 25 2.2 Clinical Trial Database ...... 31 2.2.1 Manual Data Collection ...... 33 2.2.2 An Overall Toxicity Score ...... 34 2.3 Statistical Models Predicting Clinical Trial Outcomes ...... 37 2.3.1 Data and Variables ...... 38 2.3.2 Statistical Models ...... 39 2.3.3 Statistical Model Results ...... 41 2.3.4 Identifying Unpromising Clinical Trials Before They Are Run 45 2.4 Design of Chemotherapy Regimens ...... 48 2.4.1 Phase II Regimen Optimization Model ...... 48 2.4.2 Phase III Regimen Selection Model ...... 52 2.4.3 Evaluation Techniques ...... 54 2.4.4 Optimization Results ...... 58 2.5 Discussion and Future Work ...... 62

3 Designing Drug Therapies for Advanced Breast Cancer 65 3.1 Introduction ...... 65

7 3.2 Breast Cancer Clinical Trial Database ...... 68 3.2.1 Manual Data Collection ...... 69 3.3 Statistical Models for Breast Cancer Outcomes ...... 73 3.3.1 Results ...... 75 3.3.2 Identifying Unpromising Trials ...... 77 3.4 Combined Databases for Toxicity Prediction ...... 79 3.4.1 Statistical Model Results ...... 81 3.5 Design of Chemotherapy Regimens ...... 82 3.6 Discussion ...... 88

4 Identifying Effective Screening Strategies for Prostate Cancer 91 4.1 Introduction ...... 91 4.2 Methods ...... 94 4.2.1 Model Parameters ...... 95 4.2.2 Quality of Life ...... 95 4.2.3 Screening Strategies ...... 96 4.3 Results ...... 97 4.3.1 Sensitivity Analysis ...... 100 4.4 Discussion ...... 102

A Data Preprocessing for Gastric Cancer Database 105 A.1 Performance Status ...... 105 A.2 Grade 4 Blood Toxicities ...... 106 A.3 Proportion of Patients with a DLT ...... 107

B Coefficients of Models Predicting Gastric Cancer Outcomes 111

C Pseudocode of Evaluation Techniques for Proposed Regimens 115

D Data Preprocessing for Breast Cancer Database 119 D.1 Imputation of Demographic Data ...... 119 D.2 Imputation of Toxicity Outcomes ...... 122

8 E Drugs Appearing in Breast Cancer Clinical Trial Database 125

F Literature Review Details 129

G Ranges Used in Sensitivity Analysis 131

H Building an Efficient Frontier of Screening Strategies 135

9 10 List of Figures

2-1 The results of the 372 clinical trial arms in our database with non- missing median OS and DLT proportion. The size of a point is pro- portional to the number of patients in that clinical trial arm...... 37

2-2 Sequential out-of-sample prediction accuracy of survival models calcu- lated over 4-year sliding windows ending in the date shown, reported as the coefficient of determination (푅2)...... 42

2-3 Four-year sliding-window sequential out-of-sample classification accu- racy of toxicity models, reported as the area under the curve (AUC) for predicting whether a trial will have high toxicity (DLT proportion > 0.5). 44

2-4 Performance of the ridge regression model for toxicity at flagging high- toxicity trials among the 397 trials arms in the testing set...... 46

2-5 Performance of the ridge regression model for median OS at flagging trials that do not achieve top-quartile efficacy among the 397 trials arms in the testing set...... 47

2-6 Median OS and DLT proportion for chemotherapy regimens that could potentially be selected for a Phase III trial experimental arm by our models. The left figure plots outcomes of the Phase II study that tested each regimen, and the right figure plots predicted outcomes for the regimen in a typical Phase III clinical trial patient population. The red points show the top five trials according to the data reported inthe Phase II trials, and the green points show the top five trials according to the prediction models...... 54

11 2-7 Comparison between Phase III experimental arm suggestions (n=27) from our models and from current practice according to the simulation metric. Each point represents a simulation of our approach and high- lighted points are averages across simulations. Points to the left of the y-axis improved over current practice in the proportion of trials with unacceptably high toxicity, and points above the x-axis improved over current practice in average median OS...... 58

2-8 Comparison between Phase III experimental arm suggestions (푛 = 27) from our models and from current practice according to the matching metric...... 61

3-1 The results of the 887 clinical trial arms in our database with non- missing median OS and DLT proportion. The size of a point is pro- portional to the number of patients in that clinical trial arm...... 72

3-2 The number of times each drug was used in a breast cancer clinical trial arm and in a gastric cancer clinical trial arm...... 72

3-3 Four-year sliding-window 푅2 value of the statistical models’ predictions of median OS...... 75

3-4 Four-year sliding-window AUC of the statistical models’ predictions of the proportion of patients experiencing a DLT, labeling trials as having unacceptably high toxicity if more than 50% of patients experience a DLT...... 76

3-5 Performance of the statistical model for toxicity at flagging high-toxicity arms among the 1,678 arms in the testing set...... 78

3-6 Performance of the statistical model for efficacy at flagging trial arms that fail to reach the top quartile of efficacy among the 1,678 arms in the testing set...... 78

3-7 Receiver operating characteristic curve for out-of-sample toxicity pre- dictions of the 331 gastric cancer clinical trial arms published since 1998 with extractable toxicity data...... 80

12 3-8 Receiver operating characteristic curve for out-of-sample toxicity pre- dictions of the 883 breast cancer clinical trial arms published since 1998 with extractable toxicity data...... 81

3-9 Receiver operating characteristic curve for out-of-sample toxicity pre- dictions of the 35 gastric cancer clinical trial arms and 22 breast cancer clinical trial arms published since 1998 for which a drug in the study had been tested in no more than two prior clinical trial arms for that cancer and had been tested in the other cancer...... 82

3-10 Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the simulation metric...... 86

3-11 Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the matching metric...... 88

4-1 Average and most pessimistic assessments of identified and expert- generated screening strategies...... 98

4-2 The 64 screening strategies on the efficient frontier trading off average and most pessimistic QALE assessment, with the strategy optimizing the pessimistic assessment highlighted in red...... 99

4-3 Average and most pessimistic assessments of identified and expert- generated screening strategies, using the most pessimistic QALE decre- ments from [88]...... 100

4-4 The 112 screening strategies on the efficient frontier trading off average and most pessimistic QALE assessment, using the most pessimistic QALE decrements from [88]. The strategy optimizing the pessimistic assessment is highlighted in red...... 101

C-1 Pseudocode of the simulation metric procedure...... 116

C-2 Pseudocode of the matching metric procedure...... 117

13 F-1 Literature review to identify decision analyses...... 130

14 List of Tables

2.1 A summary of the models developed and evaluated in this chapter. . 28

2.2 Definitions of some common chemotherapy clinical trial terms. . 31

2.3 Patient demographic, study characteristic, and outcome variables ex- tracted from gastric cancer clinical trials. These variables, together with the drug variables, were inputted into a database...... 35

2.4 Root mean square prediction error (RMSE) and 푅2 for the cross- validation set (“X-Val”), for out-of-sample predictions (“OOS”), and for bootstrapped out-of-sample predictions (“Bootstrap”) for survival pre- dictions in the most recent 4-year window of data (March 2008–March 2012), which includes 132 out-of-sample predictions...... 43

2.5 AUC for the cross-validation set (“X-Val”), for out-of-sample predic- tions (“OOS”), and for bootstrapped out-of-sample predictions (“Boot- strap”) for toxicity predictions in the most recent 4-year window of data (March 2008–March 2012), which includes 119 out-of-sample pre- dictions. Of these, 24 (20.1%) actually had high toxicity...... 44

2.6 Comparison between Phase III experimental arm suggestions (n=27) from our models and from current practice according to the simula- tion metric, reporting for each optimization model with parameters Γ and 푡 the change in average median OS and proportion of trials with unacceptably high DLT compared to current practice...... 59

15 2.7 Comparison between Phase III experimental arm suggestions (푛 = 27) from our models and from current practice according to the matching metric, reporting for each optimization model with parameter 푡 the proportion of model suggestions that match all drugs and all drug classes of the matched future trial as well as the average change in median OS and proportion of trials with an unacceptably high DLT rate compared to current practice...... 60

3.1 Average value, range, and reporting rate of patient demographic, study characteristic, and outcome variables extracted from breast cancer clin- ical trials...... 71

3.2 Comparison between Phase III experimental arm suggestions (n=193) from our models and from current practice according to the simula- tion metric, reporting for each optimization model with parameters Γ and 푡 the change in average median OS and proportion of trials with unacceptably high DLT compared to current practice...... 87

3.3 Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the matching metric, reporting for each optimization model with parameter 푡 the pro- portion of model suggestions that match all drugs and all drug classes of the matched future trial, as well as the average change in median OS and proportion of trials with unacceptably high DLT compared to current practice...... 87

A.1 Correlation of estimates of total Grade 3/4 toxicity to the true value. 109

B.1 Coefficients of drug dosage variables in the ridge regression models predicting efficacy and toxicity outcomes...... 113

B.2 Coefficients of the non-drug variables in the ridge regression models predicting efficacy and toxicity outcomes...... 114

16 E.1 All drugs tested in at least one clinical trial arm in the breast cancer clinical trial database...... 128

17 18 Chapter 1

Introduction

Cancer is a leading cause of death worldwide, causing 8.2 million deaths worldwide in 2012, and 14.1 million new cases of cancer were diagnosed that year [165]. This leads to a large global economic burden: The total cost in the first year after diagnosis of all new cancer cases in 2010 was an estimated $290 billion, and that number will increase to an estimated $458 billion for new cancer cases diagnosed in 2030 [30]. Given the enormous global health impact and cost of cancer, it is unsurprising that screening and treatment for cancer have received significant attention in opera- tions research (OR) and related communities. OR researchers have studied optimal population cancer screening strategies for decades, modeling strategies for a wide variety of cancers including breast cancer [19, 20, 81, 107, 123, 144, 154], cervical cancer [120, 125], and colorectal cancer [47, 85, 117]. Research into cancer treatment has considered many angles, including decision making in pharmaceutical research and development [73, 145, 160], clinical trial design [7, 27, 28], radiation therapy planning [32, 42, 146, 158, 159], optimizing chemotherapy dosing schedules [5], se- quential decision making in cancer treatment [105, 151], and individualized mortality prediction [116, 149, 180]. This dissertation breaks new ground by applying OR techniques to cancer screen- ing and treatment in two new ways. First, we develop a framework for learning from large databases of clinical trials that test drug therapies for cancer, providing pa- tients and oncologists with information about the therapies that have been tested to

19 date, developing statistical models that could be used by clinical trial planners to prioritize which therapies are tested in new trials, and using optimization to design novel combination drug regimens. Additionally, we develop an approach for iden- tifying population screening strategies for cancer that perform well when evaluated by mathematical models of screening effectiveness that have structural differences; this robustness to model structure could improve decision makers’ confidence in the quality of the identified strategies.

1.1 Models to Design Drug Regimens for Advanced Cancer

Advanced cancer occurs when the disease has spread within a patient’s body; for most forms of cancer, patients’ long-term prognosis is poor when they reach this disease stage and treatment focuses on extending life and improving quality of life. Combi- nation drug therapy, in which patients are treated with multiple drugs, is a popular treatment option for advanced cancers, as evidence from randomized controlled trials suggests that these therapies extend patient survival compared to single-agent drug therapies [40, 56, 67, 172]. Given the large number of drugs available to treat each advanced cancer and the many dosing schedules that could be used for each drug, there is an enormous number of feasible combination drug regimens for each advanced cancer. Oncology researchers use clinical trials to evaluate the effectiveness of new thera- pies — a Phase I study evaluates the safety of a proposed therapy, a Phase II study further evaluates safety and also evaluates efficacy, and a Phase III clinical trial performs a large, randomized comparison of a therapy against a standard regimen. Clinical trials are expensive, costing on average more than $10 million for Phase II studies and $20 million for Phase III trials [153]; these costs are typically incurred either by a pharmaceutical company or the government. Given the significant costs of running trials and the large number of potential com-

20 bination drug therapies that could be tested, researchers use a number of techniques to identify promising drug regimens. Clinical trials often cite in vitro or animal ex- perimentation as motivations for testing a new drug regimen, and other tools such as molecular simulation [45] and virtual clinical trials [108] are also available to help de- sign new regimens. Researchers further rely on previously published randomized con- trolled trials in a systematic way through meta-analyses, which test a hypothesis by combining the results of all published randomized trials that evaluate that hypothesis. While meta-analyses can provide broad guidance in clinical trial design, for instance showing that combination chemotherapy regimens for advanced gastric cancer yield longer survival than single-agent regimens [172] or that -containing com- bination chemotherapy regimens for advanced breast cancer yield longer survival than non-anthracycline-containing combination chemotherapy regimens [67], they cannot be used to predict the efficacy of a specific untested drug regimen and do notincor- porate evidence from any non-randomized clinical studies. Researchers also rely on evidence from the literature collected in a non-systematic way, often referencing a successful Phase II study as motivation for testing a particular regimen. In this dissertation, we present a framework for constructing and learning from large databases of clinical trials that test combination drug therapies. Like meta- analysis, our approach relies on previously published clinical trials in a systematic way. Unlike meta-analysis, our approach incorporates information from non-randomized studies, can be used to predict efficacy and toxicity outcomes for proposed therapies, and can be used to design totally new combination drug regimens. The framework includes the following steps:

∙ Clinical trial database: We first perform a literature review to identify all Phase II and III clinical trials that test drug regimens for the cancer of interest, extracting information about the patients in the study, the study itself, the drugs being tested, and efficacy and toxicity outcomes of interest. This database could be used by cancer patients and their doctors to observe the outcomes of the full set of drug therapies that have been tested to date for a particular cancer.

21 ∙ Statistical models for efficacy and toxicity outcomes: We train statistical models using the results of previously published clinical trials and use these models to predict survival and toxicity outcomes of new clinical trials, evaluating regimens whose drugs have individually been tested before, but potentially in different combinations or dosages. These models could be used by clinical trial planners to prioritize a portfolio of proposed clinical trials, focusing on the ones most likely to exhibit high efficacy and acceptable toxicity.

∙ Optimization models to design new combination drug therapies: We propose and evaluate tools for suggesting novel drug therapies to be tested in Phase II studies and for selecting previously tested regimens to be further eval- uated in Phase III clinical trials. Our methodology balances the dual objectives of exploring novel drug therapies and testing treatments predicted to be highly effective, and it could be used by clinical trial planners to identify promising therapies for testing.

In Chapter 2, we present this framework in detail, demonstrating that it is feasible by building a clinical trial database, statistical models, and optimization models for advanced gastric cancer drug therapies. In Chapter 3, we test the degree to which the framework generalizes to advanced breast cancer, again performing all three steps of the framework for that disease. We further investigate whether data from multiple cancers can be combined to improve the quality of toxicity predictions.

1.2 Optimizing Screening Strategies Using Multiple Mathematical Models

Early detection of cancer via screening has the potential to improve long-term progno- sis, and the U.S. Preventative Services Task Force (USPSTF) recommends population screening for breast cancer [169], cervical cancer [128], and colorectal cancer [168], additionally encouraging screening in high-risk populations for lung cancer [130].

22 Many population screening strategies are possible for a given cancer, which vary based on the age to start and stop screening, the interval between consecutive screen- ing tests, and the age-specific modality or intensity of screening. Given the infeasi- bility of comparing a large number of strategies with a randomized controlled trial (RCT) coupled with the large enrollment and long follow-up needed for a screening RCT, mathematical modeling is a popular tool for evaluating competing screening strategies. Many models of cancer screening and treatment have been published in the OR and medical literatures.

There is considerable uncertainty in mathematical models for cancer screening regarding the natural history of cancer in the human body, the ability of screening tests to detect cancer, and the impact of subsequent treatment of screen-detected cancers. Uncertainty about parameters in mathematical models, such as the age- specific cancer incidence rate or the rate at which untreated cancer advances instage, has received considerable attention in the modeling literature. Sensitivity analysis and probabilistic sensitivity analysis [59] can be used to determine the uncertainty in model outcomes based on parameter uncertainty, while methodologies such as stochastic optimization [29], robust optimization [24], and robust control [184] can be used to make decisions that perform well across a range of possible parameter values.

In this dissertation, we instead focus on the structural uncertainty of models, which captures uncertainty due to the simplifications and assumptions made as part of the modeling process [31]. Examples of structural uncertainty include the prob- ability distribution selected to model the dwelling time before a cancer advances in stage or whether a stage-shift or cure model is used to capture screening ben- efit [174]. Traditionally, structural uncertainty is quantified using a comparative modeling approach, in which multiple models with different structures are used to evaluate proposed health interventions. If models with different structures yield the same or similar conclusions about the effectiveness of a health intervention, this could increase the confidence of a decision maker in the accuracy of those conclusions. Com- parative modeling approaches have been used to identify effective screening strategies for many cancers including colorectal cancer [178], breast cancer [124], and lung can-

23 cer [53]. The results of the models in a comparative modeling study can be combined through model averaging, in which the average or weighted average of model out- comes is used to assess a health intervention [31], though in practice results are often presented separately for each model, with differences discussed in the text of the study [53, 124, 178]. In this dissertation, we present a framework for identifying health interventions that perform well across all models in a comparative modeling study, using mathe- matical optimization to identify effective interventions from a large set of competing strategies. We extend previous work by considering not only the average of models’ assessments of strategies but also the most pessimistic assessment, giving decision makers tunable control over the degree of conservatism to take when identifying ef- fective interventions. In Chapter 4, we present this framework in detail and use it to identify prostate cancer screening strategies that perform well across three structurally different mod- els of screening for that disease. We show that strategies found to be effective by one mathematical model in a comparative modeling setting may not be found effective by all models, highlighting the importance of considering all models when evaluating screening strategies. We further show that our approach can mitigate some weak- nesses of unweighted model averaging, which can favor overly aggressive screening strategies. Finally, we discuss how our framework could be applied in a broad array of comparative modeling settings, including evaluating strategies to screen for other cancers, to screen for other diseases, to prevent infectious diseases, and to control chronic diseases.

24 Chapter 2

Designing Chemotherapy Regimens for Advanced Gastric Cancer

2.1 Introduction

Cancer is a leading cause of death worldwide, accounting for 8.2 million deaths in 2012. This number is projected to increase, with an estimated 13.1 million deaths in 2030 [177]. The prognosis for many solid-tumor cancers is grim unless they are caught at an early stage, when the tumor is contained and can still be surgically removed. At the time of diagnosis, the tumor is often sufficiently advanced that it has metastasized to other organs and can no longer be surgically removed, leaving drug therapy or best supportive care as the best treatment options. A key goal of oncology research for advanced cancer is to identify novel chemother- apy regimens that yield better clinical outcomes than currently available treat- ments [139, 148]. Phase II clinical trials are used to evaluate the efficacy of novel regimens, with a focus on exploring treatments that have never been previously tested for a disease — in this chapter we found that 89.4% of Phase II trials for advanced gastric cancer test a new . While some trials evaluate a new drug for a particular cancer, the majority (84.6% for gastric cancer) instead test novel combinations of existing drugs in different dosages and schedules; in this chapter we focus primarily on this type of chemotherapy regimen. The most effective regimens

25 identified in Phase II trials are then evaluated in Phase III studies, which arelarge randomized controlled trials comparing one or more experimental regimens against a control group treated with the best available standard chemotherapy regimen [70]. Treatments that perform well against standard treatments in Phase III trials may then be considered new standard regimens for advanced cancer; identifying such reg- imens and thereby improving the set of treatments available to patients is a key goal of oncology research for advanced cancer [139, 148].

Finding novel, effective chemotherapy treatments for advanced cancer is challeng- ing in part because the most effective chemotherapy regimens often contain more than one drug. Meta-analyses for a number of cancers have demonstrated efficacy gains of combination chemotherapy regimens over single-agent treatments [56, 172], and in this chapter we found that 80% of all chemotherapy clinical trials for advanced gastric cancer have tested multi-drug treatments. As a result of the large number of different chemotherapy drugs, there are a huge number of potential drug combinations that could be investigated in a new Phase II trial, especially when considering different dosages and dosing schedules for each drug. However, testing any chemotherapy reg- imen in a clinical trial is expensive, costing on average more than $10 million for Phase II studies and $20 million for Phase III trials [153]; these costs are often in- curred either by pharmaceutical companies or by the government. Furthermore, even after a Phase II study has been run testing a new regimen, it can be difficult to determine whether this regimen is a good candidate for testing in a larger Phase III study because Phase II trials often enroll patient populations that are not represen- tative of typical advanced cancer patients [70]. For these reasons, it is a challenge for researchers to identify effective new combination chemotherapy regimens.

Our aspiration in this chapter is to propose an approach that could serve as a method for selecting the chemotherapy regimens to be tested in Phase II and III clinical trials. Because Phase III clinical trials are used to test the most promising chemotherapy regimens to date and can directly affect standard clinical practice, our central objective in this chapter is to design tools that can improve the quality of the chemotherapy regimens tested in Phase III trials compared to current practice. The

26 key contributions of the chapter are:

Clinical Trial Database We developed a database containing information about the patient demographics, study characteristics, chemotherapy regimens tested, and outcomes of all Phase II and III clinical trials for advanced gastric cancer from papers published in the period 1979–2012 (Section 2.2). Surprisingly, and to the best of our knowledge, such a database did not exist prior to this study.

Statistical Models Predicting Clinical Trial Outcomes We train statistical models using the results of previous randomized and non-randomized clinical trials (Section 2.3). We use these models to predict survival and toxicity out- comes of new clinical trials evaluating regimens whose drugs have individually been tested before, but potentially in different combinations or dosages. To our knowledge, no previous papers have employed statistical models for the predic- tion of clinical trial outcomes of arbitrary drug combinations and performed an out-of-sample evaluation of the predictions.

Design of Chemotherapy Regimens We propose and evaluate tools for suggest- ing novel chemotherapy regimens to be tested in Phase II studies and for select- ing previously tested regimens to be further evaluated in Phase III clinical trials (Section 2.4). Our methodology balances the dual objectives of exploring novel chemotherapy regimens and testing treatments predicted to be highly effective. To our knowledge, this is the first use of statistical models and optimization to design novel chemotherapy regimens based on the results of previous clinical trials.

We summarize the models developed and evaluated in this chapter in Table 2.1. In Section 2.4.4 we approximate the quality of our suggested chemotherapy regimens using both simulated clinical trial outcomes and the true outcomes of similar clinical trials in our database. In Section 2.5, we discuss the next step in evaluating our models: using clinical trials to evaluate the quality of the chemotherapy regimens we suggest.

27 Model Approach Evaluation Techniques Prediction of clinical Statistical models trained on Sequential out-of-sample 푅2, trial efficacy and a large database of previous root mean square error, and toxicity outcomes clinical trials (Section 2.3.2) area under the curve (Section 2.3.3), as well as evaluation of whether models could aid planners in avoiding unpromising trials (Section 2.3.4) Design of novel Integer optimization using our The simulation and matching chemotherapy statistical models to select metrics, which use both regimens for novel chemotherapy regimens simulation and the outcomes evaluation in Phase II with high predicted efficacy of true clinical trials to studies and acceptable predicted compare our suggested toxicity (Section 2.4.1) chemotherapy regimens Selection of previously Using our statistical models against those selected in tested chemotherapy to identify previously tested current clinical practice regimens for further regimens with high predicted (Section 2.4.4) evaluation in Phase efficacy and acceptable III clinical trials predicted toxicity (Section 2.4.2)

Table 2.1: A summary of the models developed and evaluated in this chapter.

The approach we propose in this chapter is related to both patient-level clinical prediction rules and meta-regressions, though it differs in several important ways. Medical practitioners and researchers in the fields of data mining and machine learn- ing have a rich history of predicting clinical outcomes. For instance, techniques for prediction of patient survival range from simple approaches like logistic regression to more sophisticated ones such as artificial neural networks and decision trees [137]. Most commonly, these prediction models are trained on individual patient records and used to predict the clinical outcome of an unseen patient, often yielding impressive out-of-sample predictions [37, 57, 95, 101, 116]. Areas of particular promise involve incorporating biomarker and genetic information into individualized chemotherapy outcome predictions [63, 141]. Individualized predictions represent a useful tool to pa- tients choosing between multiple treatment options [171, 182, 183], and when trained on clinical trial outcomes for a particular treatment can be used to identify promising

28 patient populations to test that treatment on [181] or to identify if that treatment is promising for a Phase III clinical trial [54]. However, such models do not enable predictions of outcomes for patients treated with previously unseen chemotherapy regimens, limiting their usefulness in designing novel chemotherapy regimens.

The technique of meta-regression involves building models using the effect size of randomized trials as the dependent variable and using independent variables such as patient demographics and information about the chemotherapy regimen in a particu- lar trial. These models are used to complement meta-analyses, explaining statistical heterogeneity between the effect sizes computed from randomized clinical trials [162]. Though in structure meta-regressions are similar to the prediction models we build, representing trial outcomes as a function of trial properties, they are used to ex- plain differences between existing randomized trials, and study authors generally do not evaluate the out-of-sample predictiveness of the models. Like meta-analyses, meta-regressions are performed on a small subset of the clinical trials for a given disease, often containing just a few drug combinations. Even when a wide range of drug combinations are considered, meta-regressions typically do not contain enough drug-related variables to be useful in proposing new trials. For instance, [94] uses only three variables to describe the drug combination in the clinical trial; new combination chemotherapy trials could not be proposed using the results of this meta-regression. Finally, meta-analyses and meta-regressions are typically performed on a subset of published randomized controlled studies, while our approach uses data from both randomized and non-randomized studies.

Because approaches such as patient-level clinical prediction rules, meta-analysis, and meta-regression cannot be readily used to design new chemotherapy regimens to be tested in clinical trials, other methods, collectively termed preclinical models, are instead used in the treatment design process. Following commonly accepted principles for designing combination chemotherapy regimens [140], researchers seek to combine drugs that are effective as single agents and that show synergistic behavior when combined; drugs that cause the same toxic effects and that have the same patterns of resistance are not combined in suggested regimens. Molecular simulation is a

29 well developed methodology for identifying synergism in drug combinations [45], and virtual clinical trials, which rely on pharmacodynamic and pharmacokinetic models to analyze different drug combinations, can also be used to suggest new treatments [108]. Animal studies and in vitro experimentation can be used to further evaluate novel chemotherapy regimens; results of these preclinical studies are often cited as motivations for Phase II studies of combination chemotherapy regimens [43, 100, 115]. A key limitation of current preclinical models is that most do not incorporate treatment outcomes from actual patients, while the new models we propose in this chapter leverage patient outcomes reported in previous clinical trials.

We believe the data-driven approaches we propose in this chapter would com- plement existing preclinical models. For instance, an in vitro experiment could be performed to evaluate the anti-tumor activity of a combination chemotherapy regi- men suggested by our model from Section 2.4.1. Such experimentation could be used to evaluate and refine chemotherapy regimens suggested by our models, which would be especially important when our model suggests a regimen that combines drugs that have never been tested together in a prior clinical trial. Because our models cannot accurately predict outcomes of clinical trials testing new drugs, existing preclinical models would need to be used to design therapies incorporating these new drugs. On the other hand, our statistical models could be used to predict the efficacy and tox- icity of chemotherapy regimens designed using other preclinical models, identifying the most and least promising suggested regimens.

In this chapter, we evaluate our proposed approach on gastric cancer. Not only is this cancer important — gastric cancer is the third leading cause of cancer death in the world [165] — but there is no single chemotherapy regimen widely considered to be the standard or best treatment for this cancer [135, 172, 176], and researchers frequently perform clinical trials testing new chemotherapy regimens for this cancer. We believe, however, that our approach has potential to help in selecting regimens to test in trials for many other diseases; we discuss the application of the approach to advanced breast cancer in Chapter 3 and additional potential applications in Section 2.5.

30 Term Definition Arm A group or subgroup of patients in a trial that receives a specific treatment. Controlled trial A type of trial in which an experimental treatment is compared to a standard treatment. Cycle The length of time between repeats of a dosing schedule in a chemotherapy treatment. Exclusion criteria The factors that make a person ineligible from participating in a clinical trial. Inclusion criteria The factors that allow a person to participate in a clinical trial. Phase I Study A clinical study focused on identifying safe dosages for an experimental treatment. Phase I/II Study A study that combines a Phase I and Phase II investigation. Phase II Study A clinical study that explores the efficacy and toxicity of an experimental treatment. Phase III Trial A randomized controlled trial that compares an experimental treatment with an established therapy. Randomized trial A type of trial in which patients are randomly assigned to one of several arms. Sequential treatment A treatment regimen in which patients transition from one treatment to another after a pre-specified number of treatment cycles.

Table 2.2: Definitions of some common chemotherapy clinical trial terms.

2.2 Clinical Trial Database

In this section, we describe the inclusion/exclusion rules we used and the data we collected to build our database. Definitions of some of the common clinical trial terms we use are given in Table 2.2. In this study, we seek to include a wide range of clinical trials, subject to the following inclusion criteria: (1) Phase I/II, Phase II or Phase III clinical trials for ad- vanced or metastatic gastric cancer,1 (2) trials published no later than March 2012, the study cutoff date, (3) trials published in the English language. Notably, these

1Clinical trials for gastric cancer often contain patients with cancer of the gastroesophageal junction or the esophagus due to the similarities between these three types of cancer. We include studies as long as all patients have one of these three forms of cancer.

31 criteria include non-randomized clinical trials, unlike meta-analyses, which only in- clude randomized studies. While including non-randomized trials provides us with a significantly larger set of clinical trial outcomes and the ability to generate predictions for a broader range of chemotherapy drug combinations, this comes at the price of needing to control for differences in demographics and other factors between different clinical trials.

Exclusion criteria were: (1) trials testing sequential treatments, (2) trials that in- volve the application of radiotherapy,2 (3) trials that apply chemotherapy for earlier stages of cancer, when the disease can still be cured, and (4) trials to treat gastroin- testinal stromal tumors, a related form of cancer.

To locate candidate papers for our database, we performed searches on PubMed, the Cochrane Central Register of Controlled Trials, and the Cochrane Database of Systematic Reviews. In the Cochrane systems, we searched for either MeSH term “Stomach Neoplasms” or MeSH term “Esophageal Neoplasms” with the qualifier “Drug Therapy.” In PubMed, we searched for a combination of the following keywords in the title: “gastr*” or “stomach”; “advanced” or “metastatic”; and “phase” or “randomized trial” or “randomised trial”. A single individual reviewed these search results, and these searches yielded 350 clinical trials that met the inclusion criteria for this study.

After this search through medical databases, we further expanded our set of papers by searching through the references of papers that met our inclusion criteria. This reference search yielded 64 additional papers that met the inclusion criteria for this study. In total, our literature review yielded 414 clinical trials testing 495 treatment arms that we deemed appropriate for our approach. Since there are often multiple papers published regarding the same clinical trial, we verified that each clinical trial included was unique.

2Radiotherapy is not recommended for metastatic gastric cancer patients [135], and through PubMed and Cochrane searches for stomach neoplasms and radiotherapy, we only found three clinical trials using radiotherapy for metastatic gastric cancer.

32 2.2.1 Manual Data Collection

A single individual manually extracted data from clinical trial papers and entered extracted data values into a database. Values not reported in the clinical trial report were marked as such in the database. We extracted clinical trial outcome measures of interest that capture the efficacy and toxicity of each treatment. Several measures of treatment efficacy (e.g. tumor response rate, median time until tumor progression, median survival time) are commonly reported in clinical trials. A review of the primary objectives of the Phase III trials in our database indicated that for the majority of these trials (60%), the primary objective was to demonstrate improvement in the median overall survival (OS) — the length of time from enrollment in the study until death — of patients in the treatment group. As a result, this is the metric we have chosen as our measure of efficacy.3 To capture the toxic effects of treatment, we also extracted the fraction of patients experiencing any toxicity at Grade 3 or 4, designating severe, life-threatening, or disabling toxicities [133]. For each drug in a given trial’s chemotherapy treatment, the drug name, dosage level for each application, number of applications per cycle, and cycle length were collected. We also extracted many covariates that have been previously investigated for their effects on response rate or overall survival in prior chemotherapy clinical trials for advanced gastric cancer. To limit the number of missing values in our database, we limited ourselves to variables that are widely reported in clinical trials. These variables are summarized in Table 2.3. We chose not to collect many less commonly reported covariates that have also been investigated for their effects on response and survival in other studies, including cancer extent, histology, a patient’s history of prior adjuvant therapy and surgery, and further details of patients’ initial conditions, such as their baseline bilirubin levels or body surface areas [11, 21, 103, 111]. However, the variables we do collect enable us

3The full survival distribution of all patients, which enables the computation of metrics such as 6-month and 1-year survival rates, was available for only 348/495 (70.3%) of treatment arms. Meanwhile, the median OS was available for 463/495 (93.5%) of treatment arms. Given the broader reporting of median OS coupled with the established use of median OS as a primary endpoint in Phase III trials, we have chosen this metric as our central efficacy measure.

33 to control for potential sources of endogeneity, in which patient and physician deci- sion rules in selecting treatments might limit the generalizability of model results. For example, we collect performance status, a factor used by physicians in selecting treat- ments [135]. Although other factors, such as comorbidities and patient preferences for toxicities, are important in treatment decisions for the general population [135], clinical trials uniformly exclude patients with severe comorbidities, and toxicity pref- erences do not affect actual survival or toxicity outcomes. The only other treatment decision we do not account for in our models is that patients with HER2-positive cancers should be treated with the drug trastuzumab [135], while this treatment is ineffective in other patients. We address this issue by excluding trastuzumab from the combination chemotherapy regimen suggestions we make in Section 2.4. In Table 2.3, we record the patient demographics we collected as well as trial outcomes. We note that the set of toxicities reported varies across trials, and that the database contains a total of 9,592 toxicity entries, averaging 19.4 reported toxicities per trial arm.

2.2.2 An Overall Toxicity Score

As described in Section 2.2.1, we extracted the proportion of patients in a trial who experience each individual toxicity at Grade 3 or 4. In this section, we present a methodology for combining these individual toxicity proportions into a clinically relevant score that captures the overall toxicity of a treatment. The motivation for an overall toxicity score is that there are 370 different possible averse events from cancer treatments [133]. Instead of building a model for each of these toxicities, some of which are significantly more severe than others, we use an overall toxicity score. To gain insight into the rules that clinical decision makers apply in deciding whether a treatment has an acceptable level of toxicity, we refer to guidelines es- tablished in Phase I clinical trials. The primary goal of these early studies is to assess drugs for safety and tolerability on small populations and to determine an acceptable dosage level to use in later trials [74]. These trials enroll patients at increasing dosage levels until the toxicity becomes unacceptable. The “Patients and Methods” sections

34 Variable Avg. Value Range % Rep.

Fraction male 0.72 0.29 – 1.00 97.8 Fraction of patients with prior 0.14 0.00 – 1.00 98.2 palliative chemotherapy Median age (years) 59.60 46 – 80 99.0 Mean performance status1 0.88 0.11 – 3.00 84.2 Fraction of patients with primary 0.89 0.00 – 1.00 94.9 tumor in the stomach

Patient Demographics Fraction of patients with primary 0.08 0.00 – 1.00 94.3 tumor in the gastroesophageal junction

Fraction of study authors from Country 0.00 – 1.00 95.8 each country (11 different Dependent variables for countries in at least 10 trial arms)2 Fraction of study authors from an 0.42 0.00 – 1.00 95.8 Asian country2 Number of patients 53.9 11 – 521 100.0

Study Characteristics Publication year 2003 1979 – 2012 100.0

Median overall survival (months) 9.2 1.8 – 22.6 93.5 Incidence of every Grade 3/4 or Toxicity Dependent3 Grade 4 toxicity Outcomes

1 The mean Eastern Cooperative Oncology Group (ECOG) performance status of patients in a clinical trial, on a scale from 0 (fully active) to 5 (dead). See Appendix A.1 for details. 2 The studies that did not report this variable instead reported affiliated institutions without linking authors to institutions. The proportion of authors from an Asian country serves as a proxy to identify study populations with patients of Asian descent, who are known to have different treatment outcomes than other populations. 3 See Appendix A.2 for details on data preprocessing for blood toxicities. Table 2.3: Patient demographic, study characteristic, and outcome variables extracted from gastric cancer clinical trials. These variables, together with the drug variables, were inputted into a database.

of Phase I trials specify a set of so-called dose-limiting toxicities (DLTs). If a patient experiences any one of the toxicities in this set at the specified grade, he or she is said to have experienced a DLT. When the proportion of patients with a DLT exceeds a

35 pre-determined threshold, the toxicity is considered “too high,” and a lower dose is indicated for future trials. From these Phase I trials, we can learn the toxicities and grades that clinical trial designers consider the most clinically relevant and design a composite toxicity score to represent the fraction of patients with at least one DLT during treatment. Based on a review of the 20 clinical trials meeting our inclusion criteria that also presented a Phase I study (so-called combined Phase I/II trials), we identified the following set of DLTs to include in the calculation of our composite toxicity score:

∙ Any Grade 3 or Grade 4 non-blood toxicity, excluding alopecia, nau- sea, and vomiting. 18 of 20 trials stated that all Grade 3/4 non- blood toxicities are DLTs, except some specified toxicities. Alopecia was excluded in all 18 trials and nausea/vomiting were excluded in 12 (67%). The next most frequently excluded toxicity was anorexia, which was excluded in 5 trials (28%).

∙ Any Grade 4 blood toxicity. Of the 20 trials reviewed, 17 (85%) defined Grade 4 neutropenia as a DLT, 16 (80%) defined Grade4 thrombocytopenia as a DLT, 7 (35%) defined Grade 4 leukopenia as a DLT, and 4 (20%) defined Grade 4 anemia as a DLT. Only one trial defined Grade 3 blood toxicities as DLTs, so we chose to exclude this level of blood toxicity from our definition of DLT.

The threshold for the proportion of patients with a DLT that constitutes an unaccept- able level of toxicity ranges from 33% to 67% over the set of Phase I trials considered, indicating the degree of variability among decision makers regarding where the thresh- old should be set for deciding when a trial is “too toxic.” In this chapter we use the median value of 0.5 to identify trials with an unacceptably high proportion of pa- tients experiencing a DLT. Details on the computation of the proportion of patients experiencing a DLT are presented in Appendix A.3. The 372 clinical trial arms in the dataset with non-missing median OS and DLT proportion are plotted in Figure 2-1. This figure shows that a typical clinical trial in our database has a median OSbe-

36 20 Number of Patients 15 20

50

10 100

200

5 500 Median Overall Survival Median Overall

0

0.00 0.25 0.50 0.75 1.00 Proportion of Patients with a DLT

Figure 2-1: The results of the 372 clinical trial arms in our database with non-missing median OS and DLT proportion. The size of a point is proportional to the number of patients in that clinical trial arm.

tween 5 months and 15 months, and a proportion of patients with a DLT between 0 and 0.75.

2.3 Statistical Models Predicting Clinical Trial Out- comes

This section describes the development and testing of statistical models that predict the outcomes of clinical trials. These models are capable of taking a proposed clin- ical trial involving chemotherapy drugs that have been seen previously in different combinations and generating predictions of patient outcomes. In contrast with meta- analysis and meta-regression, whose primary aim is the synthesis of existing trials, our objective is accurate prediction on unseen future trials (out-of-sample prediction).

37 2.3.1 Data and Variables

We used the data we extracted from published clinical trials described in Table 2.3 together with data about the drug therapy tested in each trial arm to develop the sta- tistical models. This data can be classified into four categories: patient demographics, study characteristics, chemotherapy treatment, and trial outcomes. One challenge of developing statistical models using data from different clinical trials comes from the patient demographic data. The patient populations can vary sig- nificantly from one trial to the next. For instance, some clinical trials enroll healthier patients than others, making it difficult to determine whether differences in outcomes across trials are actually due to different treatments or only differences in the pa- tients. To account for this, we include as independent variables in our model all of the patient demographic and study characteristic variables listed in Table 2.3. The reporting frequencies for each of these variables is given in Table 2.3, and missing values are replaced by their variable means or estimates described in Appendix A.1 before model building. In total, we included 20 patient demographic and study char- acteristic variables in our models.4 For each treatment protocol we also define a set of variables to capture the chemotherapy drugs used and their dosage schedules. There exists considerable vari- ation in dosage schedules across chemotherapy trials. For instance, consider two different trials that both use the common drug 5-fluorouracil5: in the first, it is ad- ministered at 3, 000 푚푔/푚2 once a week, and in the second, at 200 푚푔/푚2 once a day. To allow for the possibility that these different schedules might lead to different sur- vival and toxicity outcomes, we define variables that describe not only whether ornot the drug is used (a binary variable), but we also define variables for both the instan- taneous and average dosages for each drug in a given treatment. The instantaneous

4Variables include the fraction of patients who are male, the fraction of patients with prior palliative chemotherapy, the median patient age, the mean ECOG performance status of patients, the fraction of patients with a primary tumor in the stomach, the fraction of patients with a primary tumor in the gastroesophageal junction, the fraction of study authors from each country (11 total variables), the fraction of study authors from an Asian country, the number of patients in the study, and the study’s publication year. 5[122] and [163]

38 dose is defined as the dose of drug 푑 administered each day 푑 is given to patients, and the average dose of a drug 푑 is defined as the average dose of 푑 delivered each week. We do not encode information about loading dosages, which are only given during the first cycle of a chemotherapy regimen, and we use the average instantaneous dosage if a drug is given at different dosages on different days during a cycle. In total, we included 72 drugs in our models, which are listed in Appendix B. As a result, we included 216 drug-related independent variables in our models. Lastly, for every clinical trial arm we define outcome variables to be the median overall survival and the combined toxicity score defined in Section 2.2.2. Trial arms without an outcome variable (and for which we cannot replace the value by estimates described in Appendices A.2 and A.3) are removed prior to building or testing the corresponding models.

2.3.2 Statistical Models

We implement and test several statistical learning techniques to develop models that predict clinical trial outcomes. Information extracted from results of previously pub- lished clinical trials serve as the training database from which the model parameters are learned. Then, given a vector of inputs corresponding to patient characteristics and chemotherapy treatment variables for a newly proposed trial, the models produce predictions of the outcomes for the new trial. The first class of models we consider are regularized linear regression models. If we let x represent a mean-centered, unit-variance vector of inputs for a proposed trial (i.e. patient, study, and treatment variables) and 푦 represent a particular outcome measure we would like to predict (median OS or DLT proportion), then this class ′ of models assumes a relationship of the form 푦 = 훽 x + 훽0 + 휖, for some unknown vector of coefficients 훽, intercept 훽0, and error term 휖. We assume that the noise 2 −1 terms 휖푖 are independent with variance of the form 푉 (휖푖) = 휎 푛푖 , where 휎 is an unknown constant and 푛푖 is the number of patients in trial arm 푖. We adjust for this expected heteroskedasticity by assigning weight 푤푖 = 푛푖/푛¯ to each trial arm 푖 for all linear models, where 푛¯ is the average number of patients in a clinical trial

39 arm. It is well known that in settings with a relatively small ratio of data samples to predictor variables, regularized models help to reduce the variability in the model ^ ˆ parameters. We obtain estimates of the regression coefficients 훽 and 훽0 by minimizing the following objective:

푁 ∑︁ ^′ ˆ 2 ^ 푝 min 푤푖(훽 x푖 + 훽0 − 푦푖) + 휆‖훽‖푝, (2.1) 훽^,훽^ 0 푖=1 where 푁 is the number of observations in the training set and 휆 is a regularization pa- rameter that limits the complexity of the model and prevents overfitting to the train- ing data, thereby improving prediction accuracy on future unseen trials. We choose the value of 휆 from among a set of 50 candidates through 10-fold cross-validation on the training set.6 The choice of norm 푝 leads to two different algorithms. Setting 푝 = 2 yields the more traditional ridge regression algorithm [90], popular historically for its computa- tional simplicity. More recently, the choice of 푝 = 1, known as the lasso, has gained popularity due to its tendency to induce sparsity in the solution [164]. We present results for both variants below, as well as results for unregularized linear regression models. The use of regularized linear models provides significant advantages over more sophisticated models in terms of simplicity, ease of interpretation, and resistance to overfitting. Nevertheless, there is a risk that they will miss significant nonlinear effects and interactions in the data. Therefore, we also implement and test two additional techniques which are better suited to handle nonlinear relationships: random forests (RF) and support vector machines (SVM). For random forests [34], we use the nominal values recommended by [86] for the number of trees to grow (500) and minimum node size (5). The number of variable candidates to sample at each split is chosen through 10-fold cross-validation on the training set.7 For SVM, following the approach of [93],

6 4 Candidate values of 휆 are exponentially spaced between 휆푚푎푥/10 and 휆푚푎푥. We take 휆푚푎푥 to be the smallest value for which all fitted coefficients 훽^ are (numerically) zero. 7Candidate values are chosen from among exponentially spaced values −4 푣 −3 푣 2 푣 ([1.5 3 ], [1.5 3 ],..., [1.5 3 ]), where 푣 is the total number of input variables and [·] denotes rounding to the nearest integer.

40 we adopt the radial basis function kernel and select the regularization parameter 퐶 and kernel parameter 훾 through 10-fold cross validation on the training set.8 All models were built and evaluated with the statistical language R ver- sion 3.0.1 [143] using packages glmnet [69], randomForest [119], and e1071 [127].

2.3.3 Statistical Model Results

Following the methodology of Section 2.2, we collected and extracted data from a set of 414 published journal articles from 1979–2012 describing the treatment methods and patient outcomes for a total of 495 treatment arms of gastric cancer clinical trials. To compare our statistical models and evaluate their ability to predict well on unseen trials, we implement a sequential testing methodology. We begin by sorting all of the clinical trials in order of their publication date. We then only use the data from prior published trials to predict the patient outcomes for each clinical trial arm. Note that we never use data from another arm of the same clinical trial to predict any clinical trial arm. This chronological approach to testing evaluates our model’s capability to do exactly what will be required of it in practice: predict a future trial outcome using only the data available from the past. Following this procedure, we develop models to predict the median overall survival as well as the overall toxicity score. We begin our sequential testing 20% of the way through the set of 495 total treatment arms, setting aside the first 98 arms to be used solely for model building so that our training set is large enough for the first prediction. Of the remaining 397 arms, we first remove those for which the outcome is not available, leaving 383arms for survival and 338 for toxicity. We then predict outcomes only for those arms using drugs that have been seen at least once in previous trials (albeit possibly in different combinations and dosages). This provides us with 347 data points to evaluate the survival models and 307 to evaluate the toxicity models. The survival models are evaluated by calculating the root mean square error (RMSE) between the predicted and actual trial outcomes. They are compared against

8Candidate values are chosen from an exponentially spaced 2-D grid of candidates (퐶 = 2−5, 2−3,..., 215, 훾 = 2−15, 2−13,..., 23).

41 Lasso 2

R Linear 0.6 RF Ridge SVM

0.4

0.2 Survival: 4−Year Sequential Survival: 4−Year

0.0 2000 2002 2004 2006 2008 2010 2012 Year

Figure 2-2: Sequential out-of-sample prediction accuracy of survival models calculated over 4-year sliding windows ending in the date shown, reported as the coefficient of determination (푅2). a naive predictor (labeled “Baseline”), which ignores all trial details and reports the average of previously observed outcomes as its prediction. This is a standard base- line method used in evaluating sequential prediction models. Model performance is presented in terms of the coefficient of determination (푅2) of our prediction models relative to this baseline. For each 4-year period of time we compute the RMSE and 푅2 of each model. To assess statistical fluctuation of these quantities, for each prediction we additionally train 40 models with bootstrap resampled versions of the training set, and for each 4-year period we report the mean, 2.5% quantile, and 97.5% quan- tile of the RMSE and 푅2 obtained when randomly sampling one of the 40 bootstrap model predictions for each of the predictions made during that 4-year period. Fig- ure 2-2 displays the 4-year sliding-window statistical fluctuation of the out-of-sample 푅2 value, and Table 2.4 displays the values of the RMSE and 푅2 over the most re- cent 4-year window of sequential testing, both for the cross-validation results and the out-of-sample predictions.

To evaluate the toxicity models, recall from the discussion of Section 2.2.2 that the toxicity of a treatment is considered manageable as long as the proportion of

42 Root Mean Square Error (RMSE) Coefficient of Determination (푅2) Model X-Val OOS Bootstrap X-Val OOS Bootstrap Baseline — 3.75 3.76 [3.73,3.78] — 0.00 0.00 [-0.01,0.01] Linear — 2.72 3.16 [2.67,4.58] — 0.48 0.27 [-0.49,0.50] Ridge 2.02 2.50 2.57 [2.47,2.67] 0.49 0.56 0.53 [0.50,0.57] Lasso 2.00 2.44 2.66 [2.50,2.83] 0.50 0.58 0.50 [0.43,0.56] RF 2.05 2.61 2.68 [2.59,2.77] 0.48 0.52 0.49 [0.45,0.53] SVM 2.00 2.60 2.65 [2.54,2.77] 0.50 0.52 0.50 [0.46,0.54]

Table 2.4: Root mean square prediction error (RMSE) and 푅2 for the cross-validation set (“X-Val”), for out-of-sample predictions (“OOS”), and for bootstrapped out-of- sample predictions (“Bootstrap”) for survival predictions in the most recent 4-year window of data (March 2008–March 2012), which includes 132 out-of-sample predic- tions. patients experiencing a dose-limiting toxicity (DLT) is less than a fixed threshold — a typical value used in Phase I studies for this threshold is 0.5. Thus, we evaluate our toxicity models on their ability to distinguish between trials with “high toxicity” (DLT proportion > 0.5) and those with “acceptable toxicity” (DLT proportion ≤ 0.5). The metric we will adopt for this assessment is the area under the receiver-operating- characteristic curve (AUC). The AUC can be naturally interpreted as the probability that our models will correctly distinguish between a randomly chosen unseen trial arm with high toxicity and a randomly chosen unseen trial arm with acceptable toxicity. As was the case for survival, we calculate the AUC for each model over a 4-year sliding window and assess statistical fluctuation using bootstrapping, with the results shown in Figure 2-3 and Table 2.5. We see in Figures 2-2 and 2-3 that models for survival and toxicity all show a trend of improving prediction quality over time, which indicates our models are becoming more powerful as additional data is added to the training set. The decrease in the AUC of the toxicity model toward the end of the testing period might be attributable to the large number of new drugs tested in gastric cancer in recent years — 8% of trial arms from 2007–2009 evaluated a drug that had been tested in fewer than three previous arms, while 21% of arms from 2010–2012 tested such a drug. We see that ridge regression, lasso, SVM, and RF all attain similar performance when predicting both survival and toxicity, with sequential 푅2 of more than 0.5 for recent survival

43 1.0

0.9

0.8

0.7 Lasso Linear 0.6 RF Ridge SVM Toxicity: 4−Year Sequential AUC 4−Year Toxicity:

0.5 2000 2002 2004 2006 2008 2010 2012 Year

Figure 2-3: Four-year sliding-window sequential out-of-sample classification accuracy of toxicity models, reported as the area under the curve (AUC) for predicting whether a trial will have high toxicity (DLT proportion > 0.5).

Area Under the Curve (AUC) Model X-Val OOS Bootstrap Baseline — 0.43 0.49 [0.36,0.61] Linear — 0.82 0.77 [0.70,0.84] Ridge 0.76 0.83 0.81 [0.77,0.85] Lasso 0.77 0.82 0.81 [0.76,0.85] RF 0.78 0.81 0.80 [0.75,0.84] SVM 0.78 0.87 0.84 [0.80,0.89]

Table 2.5: AUC for the cross-validation set (“X-Val”), for out-of-sample predictions (“OOS”), and for bootstrapped out-of-sample predictions (“Bootstrap”) for toxicity predictions in the most recent 4-year window of data (March 2008–March 2012), which includes 119 out-of-sample predictions. Of these, 24 (20.1%) actually had high toxicity.

44 predictions and AUC of more than 0.8 for recent toxicity predictions. For both prediction tasks, statistical fluctuations overlap for these four models’ performances in the final 48-month window. Especially for earlier predictions, the unregularized linear model is not competitive, likely due to overfitting to the training set. As a result of this performance assessment, we identified the regularized linear models as the best candidates for inclusion in our optimization models, as they have good prediction quality, are the least computationally intensive, and are the simplest of the models we evaluated. We conducted additional testing to determine whether the explicit inclusion of pairwise interaction terms between variables improved the ridge regression models for survival and toxicity in a significant way. We found that out-of-sample results were not significantly improved by the addition of drug/drug, drug/demographic, or drug/trial information interaction terms, and therefore chose to proceed with the simpler models without interaction terms. The lack of improved out-of-sample performance due to interaction terms may be due to insufficient sample size to identify interaction effects or due to nonlinear interactions that could notbe captured by the regularized linear models. We ultimately selected the ridge regression models to carry forward into the optimization. While we rely on a naive baseline throughout this section to evaluate our prediction models, it would be challenging to improve this baseline. Clinical trial authors do not publish predictions of trial survival and efficacy outcomes, so we cannot compare our predictions to oncologists’ predictions. In Section 2.3.4 we evaluate whether these models could be used by clinical trial planners to identify unpromising clinical trials before they are run, and in Section 2.4 we evaluate if our prediction models could help us design effective combination chemotherapy regimens.

2.3.4 Identifying Unpromising Clinical Trials Before They Are Run

One application of statistical models for predicting a trial’s efficacy and toxicity is to identify and eliminate or modify unpromising proposed trials before they are run.

45 1.00

0.75

0.50

0.25

Prop. High Toxicity Trials Flagged Trials High Toxicity Prop. 0.00

0.00 0.25 0.50 0.75 1.00 Prop. Low Toxicity Trials Flagged

Figure 2-4: Performance of the ridge regression model for toxicity at flagging high- toxicity trials among the 397 trials arms in the testing set.

Such a tool could assist clinical trial planners in deciding whether to run a proposed trial. First, clinical trial planners might use the models predicting toxicity to avoid clinical trials predicted to have a high DLT rate or to adjust the dosages of the drugs being tested. The ridge regression model for the proportion of patients with a DLT could be used to rank trials based on their predicted DLT proportion, and

trials with predicted values exceeding some cutoff 푐퐷퐿푇 could be flagged. Figure 2-4 plots the proportion of trials with a high DLT proportion (more than 50%) and the proportion of trials with a low DLT proportion that are flagged with various cutoffs

푐퐷퐿푇 across all 397 studies in the statistic model testing set (studies published since 1997). Overall, the ridge regression model achieves an AUC of 0.75 in predicting if a trial will have a high DLT rate. Further, 10% of all trials with a high DLT rate can be flagged while only flagging 0.4% of trials with a low DLT rate, and20%ofall trials with high DLT rate can be flagged while only flagging 1.9% of trials withalow DLT rate. Planners might also use the models predicting median OS to identify clinical trials predicted not to attain a high efficacy compared to recent trials in a similar patient

46 1.00

0.75

0.50

0.25

Prop. Low Efficacy Trials Flagged Efficacy Trials Low Prop. 0.00

0.00 0.25 0.50 0.75 1.00 Prop. High Efficacy Trials Flagged

Figure 2-5: Performance of the ridge regression model for median OS at flagging trials that do not achieve top-quartile efficacy among the 397 trials arms in the testing set.

population. We stratify trials based on their patient demographics9 and define a study to have high efficacy if it exceeds the 75th percentile of median OS values reported in trial arms within its strata in the past four years. The ridge regression model for the median OS could be used to rank trials based on the ratio between the predicted median OS and the strata-specific cutoff for high efficacy, and the trials witharatio below some cutoff 푐푂푆 could be flagged as being unlikely to achieve high efficacy. Figure 2-5 plots the proportion of flagged trials that did and did not achieve the strata-specific cutoff for high efficacy for various cutoffs 푐푂푆 across the 397 studies in the testing set. Overall, ridge regression model achieves an AUC of 0.72 in predicting if a trial will not achieve high efficacy. Further 10% of all trials that do not achieve high efficacy can be flagged while only flagging 0.8% of trials with highefficacy,and

9The first strata is trials for which at least half of patients had received prior palliative chemother- apy. These 55 arms in the test set had an average median OS of 7.6 months. The remaining four strata all consist of trial arms with fewer than half of the patients with prior palliative chemotherapy (or that value not reported) and varying patient health levels as measured by the average ECOG performance status in the trial. We include a strata for arms with good performance status (average value 0 to 0.5; these 57 arms in the test set had an average median OS of 11.8 months), with medium performance status (average value 0.5 to 1.0; these 173 arms in the test set had an average median OS of 10.0 months), with poor performance status (average value of 1.0 or greater; these 77 arms in the test set had an average median OS of 9.0 months), and unreported performance status (these 35 arms in the test set had an average median OS of 9.3 months).

47 20% of trials that do not achieve high efficacy can be flagged while only flagging 3.1% of trials with high efficacy.

2.4 Design of Chemotherapy Regimens

This section describes an approach for designing novel chemotherapy regimens to be tested in Phase II studies using mixed integer optimization, using the extracted data and the statistical models we have developed in Sections 2.2 and 2.3. Further, we present a methodology that leverages the statistical models from Section 2.3 to select the best-performing regimen already tested in a Phase II study for further evaluation in a Phase III trial. Finally, we evaluate the quality of the regimens we suggest for Phase II and Phase III trials against those actually selected by oncology researchers using two evaluation approaches: the simulation metric and the matching metric.

2.4.1 Phase II Regimen Optimization Model

Given the current data from clinical trials and the current predictive models that we have constructed, we would like to select the next best regimen to test in a Phase II clinical trial. Following the objectives for designing clinical trials laid out in Sec- tion 2.1, we seek to identify trials that have high efficacy, that have acceptable toxicity, and that test novel treatments. To identify regimens with high efficacy, we include the predicted median OSof patients in the trial in the objective of our optimization model. Our reasoning for this is that for the majority of Phase III trials in our database that clearly stated a primary objective, the objective was to demonstrate improvement in the overall survival (OS) of patients in the treatment group.10 To limit our suggestions to regimens with acceptable toxicity, we add a constraint to our optimization model to bound the predicted proportion of patients experiencing a DLT to not exceed some constant 푡. No Phase III studies in our database listed

10Out of the 20 Phase III trials in our database with a clearly stated primary objective, 12 of them listed OS as a primary objective.

48 a toxicity outcome as a primary objective, validating our choice to address toxicity using a constraint instead of as part of the objective. Even in cases where our model predicts that a regimen will be acceptably non-toxic, a Phase I study would still be necessary to ensure patient safety, potentially resulting in changes to the dosage levels suggested by our models. We seek novel treatments using three approaches. First, we require that all reg- imens (the combination of drug and dose selections) suggested by our models have never been previously tested in a clinical trial; hence, our model always suggests novel regimens. Secondly, we require that new drugs are tested in a clinical trial as soon as they are available, ensuring we evaluate new drugs as quickly as possible. Finally, our models assign higher weight to regimens containing drugs that have not been exten- sively tested. Motivated by the standard deviation of the sample mean,11 we assign −1/2 weight 푢푑 = 푡푑 to drug 푑, where 푡푑 is the number of times drug 푑 has previously been tested in a clinical trial, defining u to be the vector of all such weights. In the

model (1), we increment the objective by Γ푢푑 if drug 푑 is selected for testing in the regimen, where Γ is a parameter that controls the aggressiveness of the exploration. Our mathematical model includes decision variables for the chemotherapy vari- ables described in Section 2.3.1. We define three variables for each drug, correspond- ing to the chemotherapy treatment variables used in the statistical models: a binary

indicator variable 푏푑 to indicate whether drug 푑 is or is not part of the trial (푏푑 = 1 if and only if drug 푑 is part of the optimal chemotherapy regimen), a continuous variable 푖푑 to indicate the instantaneous dose of drug 푑 that should be administered in a single session, and a continuous variable 푎푑 to indicate the average dose of drug 푑 that should be delivered each week. We define x in our optimization model to be the demographic and trial information variables for which the chemotherapy regimen is being selected; these values are treated as a constant in the optimization process. We use the ridge regression models from Section 2.3.2 to parameterize the op-

11 ∑︀푡 −1/2 Recall that sd(( 푖=1 푋푖)/푡) = 휎푡 when 푋푖 are IID random variables with standard deviation 휎. If we view each 푋푖 as the observed impact of some drug 푑 on the efficacy or toxicity of a regimen that tests it, then the standard deviation of the sample mean estimate of that drug’s effect scales −1/2 with 푡푑 , where 푡푑 is the number of times the drug has been tested.

49 ^푏 ′ timization model. Let the model for overall survival (OS) be denoted by 훽푂푆 b + ^푖 ′ ^푎 ′ ^푥 ′ 훽푂푆 i + 훽푂푆 a + 훽푂푆 x, with drug variables b, instantaneous dose variables i, aver- age dose variables a, demographic and study characteristic constants x, and coeffi- cients corresponding to each set of variables indicated with superscripts. Similarly, we have a model for the proportion of patients with a DLT, which we will denote by ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ 훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + 훽퐷퐿푇 x. Note that these models are all linear in the variables.

We can then select the drug therapy to test in the next clinical trial using the following mixed integer optimization model:

′ ′ ′ max (훽^푏 + Γu)′b + 훽^푖 i + 훽^푎 a + 훽^푥 x (1) b,i,a 푂푆 푂푆 푂푆 푂푆 ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ subject to 훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + 훽퐷퐿푇 x ≤ 푡, (1a) 푛 ∑︁ 푏푑 ≤ 푁, (1b) 푑=1 Ab ≤ c, (1c)

(b, i, a) ∈/ 푃, (1d)

(푏푑, 푖푑, 푎푑) ∈ Ω푑, 푑 = 1, . . . , 푛, (1e)

푏푑 ∈ {0, 1}, 푑 = 1, . . . , 푛. (1f)

The objective of (1) maximizes the predicted overall survival of the selected chemother- apy regimen plus some constant Γ times 푢푑, the weight capturing how often drug 푑 has previously been tested, for each drug 푑 in the regimen. This “exploration con- stant” Γ controls how much weight is assigned to exploring drugs that have not been extensively tested in the training set; a large Γ would value exploration of new drugs over identifying a combination with high predicted efficacy, while Γ = 0 optimizes the efficacy of the regimen with no consideration for exploration. We experiment witha number of Γ values in Section 2.4.4.

Constraint (1a) bounds the predicted toxicity by a constant 푡. This constant value can be defined based on common values used in Phase I/II trials or canbe

50 varied to suggest trials with a range of predicted toxicities. In Section 2.4.4, we present results from varying the toxicity limit 푡. Constraint (1b) limits the total number of drugs in the selected trial to 푁, which can be varied to select trials with different numbers of drugs. We limit suggested drug combinations to containno more than 푁 = 3 drugs, which encompasses 89.1% of our database. We chose not to select a limit of 푁 = 4 or higher both because the average number of drugs tested in combinations in our database is 2.3 and because all preferred regimens in the National Comprehensive Cancer Network (NCCN) guidelines for gastric cancer contain three or fewer drugs [135].

We also include constraints (1c) to constrain the drug combinations that can be selected. In our models, we require a new drug to be included if it has never been evaluated in a previous clinical trial and we incorporate generally accepted guidelines for selecting combination chemotherapy regimens [74, 140, 142].12 As discussed in Section 2.2.1, we also eliminate the drug trastuzumab because it is only indicated for the subpopulation of HER2-positive patients. We leave research into effective treatments for this subpopulation as future work. Additional requirements could be added to constraints (1c), though we do not do so in this chapter. Such additional constraints may be necessary due to known toxicities and properties of the drugs, or these constraints can be used to add preferences of the business or research group running the clinical trial. For example, a pharmaceutical company may want to require a new drug they have developed and only tested a few times to be used in the trial. In this case, the optimal solution will be the best drug combination containing the necessary drug.

Constraints (1d) force our selected regimen to differ from the set 푃 of all regimens

12We limit the combinations to contain no more than one drug from any drug class. There are 23 classes of drugs used in total in our database, using the classes defined by [74]. The most common classes are: platinum-based, , , , , alky- lating agents, and chemoprotectants. We disallow pairs of drug classes from being used together if this pairing appears no more than once in our database and is discouraged in the guidelines for selecting regimens. The following pairs of classes were disallowed from being used together: an- thracycline/, alkylating agent/, taxane/topoisomerase II inhibitor, antimetabo- lite/protein kinase, and camptothecin/topoisomerase II inhibitor. If a chemoprotectant drug is used, it must be used with a drug from the class that is not .

51 previously tested in the training set. Constraints (1e) limit the instantaneous and average dose of drug 푑 to belong to a feasible set Ω푑. This forces 푖푑 and 푎푑 to equal 0 when 푏푑 = 0 and to match the instantaneous and average dosages of drug 푑 in some clinical trial in the full database when 푏푑 = 1. These constraints force the dosage for a particular drug to be realistic. Lastly, constraints (1f) define b to be a binary vector of decision variables. The optimization model was implemented in C++ and solved using the Gurobi optimizer [82].

2.4.2 Phase III Regimen Selection Model

Phase III trials are large randomized controlled trials that evaluate the most promising regimens tested in previous Phase II studies, and treatments that perform well against historical controls in Phase III trials may then be considered new standard therapies for advanced cancer. A relatively small number of Phase III trials are run (7% of trials in our database are Phase III), both because a Phase III trial is only run when a therapy is shown to be particularly effective in a Phase II study and because Phase III trials enroll more patients than Phase II studies and are therefore more expensive. Both due to the impact of Phase III results on clinical standards and the relatively small numbers of these trials run, it is especially important that we select high-quality regimens to test in the experimental arms of these clinical trials. One option for selecting a regimen to test in a Phase III trial would be to iden- tify the prior Phase II study with DLT proportion not exceeding parameter 푡 that achieved the highest median OS, selecting the regimen tested in that Phase II study. However, Phase II studies are often performed using patient populations that are not representative of the patients tested in Phase III trials and standard clinical prac- tice [70]. As a result, we instead use the statistical models from Section 2.3 to select the regimen to be tested in the experimental arms of Phase III clinical trials. Using the demographic and trial variables x from the new Phase III trial being designed, we select the regimen previously tested in a Phase II study that has the highest predicted median OS in population x, limiting to regimens with predicted DLT proportion not

52 exceeding a toxicity limit parameter 푡. Each of the prior Phase II studies is already in the training set of the statistical model being used to select Phase III regimens, but using the prediction model enables us to control for the patient population of trials and variability in trial outcomes due to chance. The experimental arms of Phase III clinical trials seek novel chemotherapy regimens, so we limit our selected regimens to those whose set of drugs do not exactly match the set of drugs tested in any arm of a prior Phase III trial.

The effects of using prediction models instead of published clinical trial outcomes are displayed in Figure 2-6. The figure on the left shows, for each regimen tested ina Phase II study in our database that does not match the drug combination tested in a Phase III clinical trial, the proportion of patients experiencing a DLT and the median overall survival, as they were reported in the Phase II study. The figure on the right shows the predicted performance of each drug regimen, using the average of trial and demographic variables across all Phase III trials in our database. The red points show the five best trials according to the data, and the green points show the five besttrials according to the prediction models (we define the best trials here as the ones with the highest overall survival, subject to a DLT proportion of no more than 0.5). These figures show that our method will often suggest different regimens than wewould select by just using the data published in the individual trials. These differences occur because our method takes into account patient and trial characteristics, controlling for trials run in particularly healthy populations (where strong efficacy outcomes may be due to demographics) and for trials with fewer enrolled patients (where strong results are more likely to be due to chance). For instance, the trials indicated in red in Figure 2-6 were smaller on average than the trials in green (37 vs. 51 patients), which may indicate why our models had more confidence in the quality of the trials labeled in green. We will investigate how the regimens suggested by our models compare to those selected in current clinical practice in Sections 2.4.3–2.4.4.

53 20 20 ● ● ● ● ● 15 ● ● 15 ●

●● ● ● ● ● ● ●● 10 10 ●● Median OS (months) 5 5 Adjusted Median OS (months)

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 DLT (proportion) Adjusted DLT (proportion)

Figure 2-6: Median OS and DLT proportion for chemotherapy regimens that could potentially be selected for a Phase III trial experimental arm by our models. The left figure plots outcomes of the Phase II study that tested each regimen, andthe right figure plots predicted outcomes for the regimen in a typical Phase III clinical trial patient population. The red points show the top five trials according to the data reported in the Phase II trials, and the green points show the top five trials according to the prediction models.

2.4.3 Evaluation Techniques

Evaluating the quality of chemotherapy regimens suggested by our optimization model against those selected for clinical trials by oncologists is an inherently difficult task. The only definitive way to evaluate a chemotherapy regimen in a given population is through a clinical trial, and the only definitive way to evaluate the performance of some regimen A suggested by our models against some regimen B designed without the models is through running a randomized controlled trial in a population, testing A against B. While ultimately clinical trial evidence is needed to evaluate the effectiveness of the proposed models, it is important to first perform preclinical evaluations of the proposed approach to determine if it shows promise in improving the results of clinical trials compared to current practice. As described in Section 2.1, preclinical evalua- tions estimate the quality of an approach before testing it in humans and are used extensively in the design of chemotherapy regimens for advanced cancer. Preclinical

54 evaluation cannot be used to conclusively confirm the effectiveness of a new therapy, but it can be used to rule out unpromising approaches. For instance, a new drug that has a high cell kill rate in vitro may or may not be effective at treating cancer in the human body, but a new drug that performs poorly in preclinical testing would likely not be tested in humans. We similarly use preclinical evaluation to determine if our proposed models show promise, which could help in deciding whether clinical evaluation is appropriate. To perform preclinical evaluation of our proposed models, we use two different techniques to approximate the quality of suggestions from our model compared to those made in real clinical trials: the simulation metric and the matching metric.

Simulation Metric

Through clinical trials, oncologists learn about the efficacy and toxicity of regimens they test and use this information when designing further chemotherapy regimens. To evaluate the ability of our approach to learn through time, we simulate the efficacy and toxicity outcomes of clinical trials that test the chemotherapy regimens proposed by our models. To simulate clinical trial outcomes, we first select plausible coefficient

* * vectors 훽푂푆 and 훽퐷퐿푇 for our survival and toxicity prediction models, which we use to represent the ground truth impact of the explanatory variables from Section 2.2 on efficacy and toxicity. We then simulate the median OS and the proportion of

* * patients with a DLT in a proposed clinical trial using 훽푂푆 and 훽퐷퐿푇 , respectively, in both cases adding normally distributed noise with variance based on the number of patients in the clinical trial. To compare the regimens selected by our models with toxicity limit 푡 and exploration parameter Γ against the regimens tested in clinical trials in current practice, we start 20% of the way through the clinical trial database (in 1997), selecting the regimen to be tested in each Phase II study using the optimization model from Section 2.4.1 and selecting the regimen to be used in each Phase III trial experimental arm using the procedure described in Section 2.4.2. For each trial, both the regimen selected by our models and the regimen run in current

* * practice are evaluated using 훽푂푆 and 훽퐷퐿푇 . The outcomes for the regimen selected

55 by our models, plus normally distributed noise, are added to the training set and can be used to design subsequent chemotherapy regimens. We evaluate each parameter

* * set (푡, Γ) using 40 different sets of coefficients 훽푂푆 and 훽퐷퐿푇 . Each set of coefficients is obtained by drawing a bootstrap sample of the entire database of clinical trials and training ridge regression models for the efficacy and toxicity outcomes as described in Section 2.3; because these coefficients are obtained using bootstrap resampling of true clinical trial outcomes, they are plausible according to clinical trial data. Details of the simulation metric procedure are provided in Appendix C. As with many procedures developed to simulate complicated real-world systems, the simulation metric may make a biased evaluation of our proposed model. First, the approach simulates outcomes using the same linear model specification used by the ridge regression model from Section 2.3, which mean that the statistical models used to make decisions are assumed to be structurally accurate. Model performance might be worse if there were a mismatch between the structure of the ridge regression model and the true structure of the relationship between the covariates and outcomes. Further, while we include a number of constraints in the optimization models to prevent infeasible regimens from being suggested, there there is no guarantee that all chemotherapy regimens in the feasible set of the optimization model are indeed biologically, legally, or practically feasible. This could lead the optimization model to obtain a strong evaluation for a suggestion that is actually infeasible, which would favorably bias the evaluation of our approach. Due to the potential biases in the simulation metric, results indicating our models improve over current practice might merit further study in a clinical trial setting, while results indicating our models do not improve over current practice would suggest that no further evaluation is warranted.

Matching Metric

While the simulation metric enables us to evaluate our ability to learn through time by providing feedback about the chemotherapy regimens selected by our optimization model, a shortcoming of this technique is that it relies on simulated outcomes instead of actual clinical trial outcomes, introducing a number of potential biases. As a result,

56 we also evaluate our model’s suggested regimens for Phase III trial experimental arms using the results of similar clinical trials that were run in practice. To compare the regimens selected by our models with toxicity limit 푡 against the regimens tested in clinical trials in current practice, we start 20% of the way through the clinical trial database (in 1997), selecting the regimen to be used in each Phase III trial experimental arm using the procedure described in Section 2.4.2. For each Phase III experimental arm, the regimen selected by our models is evaluated using the clinical trial in the database testing the most similar chemotherapy regimen, taking into account how well the drug classes, drugs, and dosages match13 and limiting to chronologically future clinical trials. Meanwhile, the regimens tested in the actual Phase III experimental arms are evaluated using the outcomes of those trials. Details of the matching metric procedure are provided in Appendix C.

A key benefit of the matching metric as defined is that it does not use the statistical models developed in this chapter in any way to evaluate proposed regimens, not even to adjust for the population in the matched clinical trial. As a result, the matching metric may make a biased evaluation of proposed chemotherapy regimens because it evaluates a chemotherapy regimen using the outcomes of a trial run in a different population. If the matched trial used to evaluate a proposed chemotherapy regimen is run in a healthier population than the population for which the regimen was selected, favorable bias would be introduced to the evaluation. Correspondingly if the patients in the matched trial are less healthy, unfavorable bias is introduced. Due to the potential bias in the matching metric’s evaluation of our model, any improvements over current practice indicated by the matching metric would require confirmatory testing in a clinical trial setting.

13For each drug in our suggested regimen, we assess a penalty of 0 if the same drug is tested at the same dosage in the future regimen, a penalty of 1 if the same drug is tested at a different dosage, a penalty of 10 if a different drug from the same drug class is tested, and a penalty of 100 ifnodrugs from the same drug class are tested in the future regimen. The penalties are summed for each drug in our suggested regimen, and the future clinical trial with the smallest penalty score is considered the best match.

57 6 Γ

● 0

● 2

4 ● 4 ●

● t ● 0.3 2 0.4

0.5

Current Practice 0 ● Average Survival Improvement (months) Survival Improvement Average

−0.2 0.0 0.2 Increase in High DLT Rate (proportion)

Figure 2-7: Comparison between Phase III experimental arm suggestions (n=27) from our models and from current practice according to the simulation metric. Each point represents a simulation of our approach and highlighted points are averages across simulations. Points to the left of the y-axis improved over current practice in the proportion of trials with unacceptably high toxicity, and points above the x-axis improved over current practice in average median OS.

2.4.4 Optimization Results

To compare our models’ suggested regimens with those of oncologists, we sequentially

* * computed the simulation metric for 40 sets of simulated coefficients (훽푂푆, 훽퐷퐿푇 ), testing all nine combinations of parameter settings 푡 ∈ {0.3, 0.4, 0.5} and Γ ∈ {0, 2, 4}. As detailed in Appendix C, for each set of parameter values (푡, Γ), we obtain 40 vectors of overall survival values for the 27 Phase III experimental arms for both

CP Mod current practice (yOS ) and for the proposed models (yOS ). Additionally, we obtain 40 vectors of “high toxicity” indicator values for the Phase III experimental arms both

CP Mod for current practice (yDLT) and for the proposed models (yDLT ). Figure 2-7 and Table 2.6 compare the performance of the regimens selected by this chapter’s models for the 27 Phase III experimental arms against the 27 Phase III ex- perimental arms selected by oncologists, reporting the average differences in outcomes

′ Mod CP ′ Mod CP across the 27 arms, 1 (yOS − yOS )/27 and 1 (yDLT − yDLT)/27. Figure 2-7 plots each individual bootstrap replicate and the average comparative performance across

58 Γ 푡 ∆ OS (95% CI) ∆ DLT (95% CI) 0 0.3 2.8 [0.9, 5.0] -0.06 [-0.15, 0.04] 0 0.4 3.1 [1.8, 4.9] -0.06 [-0.15, 0.04] 0 0.5 3.2 [1.6, 6.3] -0.01 [-0.18, 0.18] 2 0.3 3.6 [1.6, 5.9] -0.05 [-0.15, 0.07] 2 0.4 4.3 [1.8, 5.6] -0.05 [-0.15, 0.11] 2 0.5 4.5 [2.1, 6.7] 0.02 [-0.18, 0.26] 4 0.3 3.7 [0.7, 6.0] -0.06 [-0.18, 0.07] 4 0.4 4.4 [2.0, 6.3] -0.04 [-0.15, 0.07] 4 0.5 4.7 [2.6, 6.5] 0.02 [-0.18, 0.33]

Table 2.6: Comparison between Phase III experimental arm suggestions (n=27) from our models and from current practice according to the simulation metric, reporting for each optimization model with parameters Γ and 푡 the change in average median OS and proportion of trials with unacceptably high DLT compared to current practice.

replicates, and Table 2.6 additionally reports 95% bootstrap percentile confidence in- tervals [51] of comparative performance, computed using the R boot package [39]. In all 360 simulation runs, the average simulated median OS of the regimens suggested by the proposed models was higher then the average simulated median OS of the regi- mens tested in current practice, with the average differences ranging from 2.8 months (95% CI 0.9, 5.0) for the model with (푡, Γ) = (0.3, 0) to 4.7 months (95% CI 2.6, 6.5) for the model with (푡, Γ) = (0.5, 4). The simulated proportion of trials with a high DLT rate did not significantly differ from current practice for any model. Although on average models with high Γ values have better simulated survival outcomes and models with restrictive toxicity limits have worse simulated survival outcomes and better simulated toxicity outcomes, the bootstrap 95% confidence intervals for all these differences include 0.

The regimens selected by our proposed models for Phase II studies and Phase III trial experimental arms are qualitatively different from the ones tested in current practice, which may explain some of the differences in simulated performance. Due to the nature of the formulation of optimization problem (1), nearly all regimens selected by the proposed models for Phase II studies (100%) and Phase III experimental arms (98%) contain exactly three drugs. While this is similar to the average number of

59 푡 Match Drugs Match Classes ∆ OS ∆ DLT 0.3 56% 82% 2.3 0.02 0.4 59% 85% 3.5 0.05 0.5 59% 89% 3.9 0.12

Table 2.7: Comparison between Phase III experimental arm suggestions (푛 = 27) from our models and from current practice according to the matching metric, reporting for each optimization model with parameter 푡 the proportion of model suggestions that match all drugs and all drug classes of the matched future trial as well as the average change in median OS and proportion of trials with an unacceptably high DLT rate compared to current practice. drugs tested in Phase II study arms (2.3) and Phase III trial experimental arms (2.4), these studies test many other combination sizes (only 31% of Phase II study arms and 37% of Phase III trial experimental arms test three-drug combinations). Given that there are benefits to regimens with fewer drugs (e.g. ease of administration and lower cost) and regimens with more drugs (e.g. better combatting drug resistance in the cancer), oncology researchers may prefer to test a range of regimen sizes. For Phase III trial experimental arms, the proposed models selected regimens with newer drugs (newest drug tested in a median of 5 previous trials) than the regimens tested in practice (newest drug tested in a median of 22 previous trials). Given the significant cost of Phase III trials, this could represent risk aversion on the part of clinical trial planners that is not captured in the procedure from Section 2.4.2. By design 100% of regimens suggested by our models for Phase III experimental arms had never been tested before (even in different dosages), similar to the 89% rate seen in clinical practice. In contrast, the proportion of suggested regimens testing new combinations (ignoring dosages) for Phase II studies was 14% in the proposed models with Γ = 0, 25% in the proposed models with Γ = 2, 37% in the proposed models with Γ = 4, and 41% in current practice. This suggests the proposed models with Γ = 0 and Γ = 2 may spend more effort optimizing dosages within a combination and less effort exploring new combinations compared to clinical practice.

To further compare our models’ suggested regimens for Phase III experimental arms with those of oncologists, we used the matching metric to evaluate our sug-

60 4

t

● 0.3 ● 2 0.4

0.5

Current Practice 0 ● Average Survival Improvement (months) Survival Improvement Average

−0.2 0.0 0.2 Increase in High DLT Rate (proportion)

Figure 2-8: Comparison between Phase III experimental arm suggestions (푛 = 27) from our models and from current practice according to the matching metric.

gested regimens obtained using each toxicity limit 푡 ∈ {0.3, 0.4, 0.5}. As detailed in Appendix C, for each parameter value 푡 we obtain overall survival values for the 27

CP Phase III experimental arms for both current practice (yOS ) and for the proposed Mod models (yOS ) as well as “high toxicity” indicator values for the Phase III experimen- CP Mod tal arms both for current practice (yDLT) and for the proposed models (yDLT ).

Using the matching metric, Figure 2-8 compares the performance of the sug- gested regimens for Phase III experimental arms from the three models with 푡 ∈ {0.3, 0.4, 0.5} against the regimens selected by oncologists. The plot uses the same axes as Figure 2-7 and captures statistical fluctuations with 2-standard deviation el- lipses fitted by bootstrap resampling the matching metric results for the 27 Phase III experimental arms. According to the matching metric, the regimens suggested by the proposed models with 푡 = 0.3, 0.4, and 0.5 had on average 2.3, 3.5, and 3.9 months higher median OS than current practice and 0.02, 0.05, and 0.12 proportion higher rate of trials with unacceptably high toxicity, respectively. The ellipses for all three models include only positive changes in survival but positive and negative changes in proportion of trials with unacceptably high toxicity.

61 2.5 Discussion and Future Work

In this chapter, we built a database of clinical trials for gastric cancer, built statistical models to predict out-of-sample efficacy and toxicity of clinical trials, and designed models that propose combination chemotherapy regimens to be tested in clinical trials. Out-of-sample evaluation suggests that the statistical models could be used by clinical trial planners to identify 10–20% of the trials with high toxicity or that fail to achieve high efficacy, in both cases with a small number of misclassifications. Further, two preclinical evaluation techniques indicate that the models presented in this chapter might improve the efficacy of the regimens selected for testing inPhase III clinical trial experimental arms without major changes in toxicity outcomes.

A key limitation of the results presented in this chapter is that we do not run clinical trials to evaluate our suggested chemotherapy regimens; instead, we evaluate our suggestions using simulated results or the results of clinical trials testing similar chemotherapy regimens. As detailed in Section 2.4.3, such preclinical evaluation of proposed approaches is important, as it can be used to eliminate unpromising approaches without needing to run a clinical trial. Given that both the simulation and matching metrics indicated that the proposed approach could improve the efficacy of the regimens tested in Phase III trials, we believe it merits further evaluation in a clinical trial setting. We believe the technique should first be evaluated using a single-arm clinical trial testing a regimen designed by our optimization models from Section 2.4.1. Key outcomes of this clinical trial would be the acceptability of our tools to clinical trial decision makers and the efficacy and toxicity outcomes of the trial. If this initial evaluation is deemed a success, we could then perform a randomized controlled trial comparing a regimen designed by our approach against a regimen designed using standard clinical trial design methodology.

Another limitation of our methodology is that clinical studies with negative re- sults such as disappointing efficacy outcomes are less likely to be published inthe medical literature, a phenomenon called publication bias [62]. This means the clin- ical trial arms in the database from Section 2.2 may be biased toward studies with

62 more positive outcomes, in turn causing the statistical models from Section 2.3 to make overly optimistic predictions for clinical trial outcomes. If publication bias is more concentrated in clinical trials testing some sorts of therapies than others, then this bias may cause the optimization models from Section 2.4 to prefer these sorts of therapies over others. We believe the models presented in this chapter have the potential to significantly improve the quality of chemotherapy regimens tested in clinical trials. However, there are a number of promising future directions that have the potential to strengthen the results presented in this chapter. One opportunity would be to integrate preclinical models, such as in vitro experimentation and molecular simulation studies, into our optimization framework. This integration could take the form of adding constraints to eliminate drug combinations found to be biologically infeasible and adding interaction terms between pairs of drugs found to be synergistic or antagonistic in preclinical models. Another avenue of future work is to use more sophisticated techniques such as the Knowledge Gradient algorithm [68] or Q-learning [55] when exploring the space of combination chemotherapy regimens using Phase II studies. While in this chapter we focused on designing combination chemotherapy regimens for gastric cancer, we believe data-driven tools leveraging databases of clinical trial results could prove useful in other settings. Combination therapy is used to treat many other cancers, and in Chapter 3 we evaluate the degree to which the results in this chapter extend to advanced breast cancer. Further, combination therapy is used to treate other diseases such as hypertension and diabetes, so our models could be applied to design therapies for these diseases. Models trained on a subset of clinical trial results for a specific patient subpopulation, such as HER2-positive patients, could be used to design specialized regimens for these subgroups. Other applications might include improving clinical prognostic models for individual cancer patients, identifying the best available therapies for a particular disease from amongst all those tested in clinical trials, and comparing treatments that have never been compared in a randomized setting.

63 64 Chapter 3

Designing Drug Therapies for Advanced Breast Cancer

3.1 Introduction

Breast cancer is the most common non-skin cancer among women in the United States and the second-leading cause of cancer death in that population [156]. Among women worldwide, it is both the most common non-skin cancer and the leading cause of cancer death [165]. Drug therapies are the main treatments for advanced breast cancer [134], and meta-analysis evidence suggests multi-drug therapies yield better survival outcomes than monotherapies [67, 40]. Given the significant health burden posed by breast cancer, techniques to identify promising combination drug regimens for this disease are of interest. A number of techniques are used to design combination drug therapies for Phase II studies of patients with advanced breast cancer. Preclinical models that do not rely on the results of prior clinical trials, such as animal studies and in vitro ex- perimentation, are often cited as the motivation for combination therapies tested in Phase II studies [3, 4, 10]. Further, authors often cite the results of previous clinical trials both for breast cancer [2, 6, 14] and other cancers [9, 13, 18] as motivation for testing a combination therapy. Meta-analyses, which synthesize the results of prior randomized controlled trials, are also used to select one broad class of therapies

65 (e.g. anthracycline-containing regimens) instead of another (e.g. non-anthracycline- containing regimens) for testing [1, 8, 132]. In Chapter 2, we proposed a new method- ology for synthesizing published clinical trials that can be used to design combination drug therapies, evaluating the approach for advanced gastric cancer. Advanced breast cancer provides a good opportunity to evaluate the generaliz- ability of the statistical and optimization models proposed in Chapter 2 due to the differences between breast cancer and gastric cancer. Due to differences incancer biology, the most effective drugs for treating the two cancers differ, and only four of the 10 most frequently used drugs in gastric cancer clinical trials are also top-10 drugs in breast cancer clinical trials (see Section 3.2 for details). Further, efficacy outcomes qualitatively differ between the cancers: Clinical trials for advanced gastric cancer have an average median overall survival of 9.2 months, while clinical trials for advanced breast cancer have an average median overall survival of 17.6 months. In this chapter, we investigate whether the statistical and optimization models developed in Chapter 2 show promise for the design of combination drug regimens for advanced breast cancer. Key contributions parallel those from Chapter 2: Breast Cancer Clinical Trial Database We build a database of clinical trials that test drug therapies for advanced breast cancer (Section 3.2). This database contains information about patient demographics, study characteristics, drug regimens tested, and efficacy and toxicity outcomes from 1,490 clinical trials describing 2,097 treatment arms, published 1968–2013. Predicting Breast Cancer Clinical Trial Outcomes Given information about the patient demographics, study characteristics, and chemotherapy regimens tested in a clinical trial, we use statistical models to predict efficacy and toxicity outcomes of that trial (Section 3.3). We evaluate whether these models can be used by clinical trial planners to avoid unpromising proposed clinical trials. Designing Breast Cancer Chemotherapy Regimens We evaluate models for designing novel combination drug therapies for breast cancer to be tested in Phase II studies as well as models for selecting the most promising previously tested regimen for further testing in a Phase III trial (Section 3.5).

66 We extend the analysis from Chapter 2 in two major ways, taking advantage of cancer receptor information and the availability of data from multiple cancers when making predictions.

Accounting for Cancer Receptor Status Several well-known proteins affect the prognosis and treatment options for breast cancer patients. The receptor is a protein present in roughly 75% of breast cancers, and the progesterone receptor is also present in the majority of these cancers [61]. Estrogen receptors trigger can- cer cell growth when they encounter estrogen. Hormonal drugs can be used to treat cancers with these receptors either by blocking production of estrogen, as is done by aromatase inhibitors (AIs), or by blocking estrogen from binding to the receptors themselves, as is done by selective modulators (SERMs) and estro- gen receptor downregulators (ERDs). The human epidermal growth factor receptor 2 (HER2) protein is overexpressed in roughly 25% of breast cancers [38], which causes more aggressive cancer growth and spread. Several drugs such as trastuzumab and lapatinib can be used to treat cancer with HER2 overexpression. In Section 3.3, we use interaction terms to model receptor-specific efficacy and toxicity effects of drugs, and in Section 3.5 we take receptor status into account when designing combination chemotherapy regimens.

Toxicity Predictions Using Data from Multiple Cancers Phase I clinical stud- ies, which are used primarily to assess the toxicity of a chemotherapy regimen [70], are often performed across patients with a range of cancers [84, 170]. As a result, it seems natural to predict the toxicity of a chemotherapy regimen using published toxicity outcomes from clinical trials for multiple cancers instead of limiting to data from a single cancer as we did in Chapter 2; this would be expected to especially improve toxicity predictions for cancers with relatively few published clinical trials. In Section 3.4, we predict toxicity outcomes using data from clinical trials for both gastric and breast cancers.

67 3.2 Breast Cancer Clinical Trial Database

The first step in building a database of outcomes for advanced breast cancer clinical trials was searching the published medical literature to identify papers meeting our criteria for inclusion in the study. We included English-language clinical trials testing drug therapies for advanced or metastatic breast cancer. To ensure a more heteroge- neous mix of studies that could be analyzed together, we used a number of exclusion criteria:

∙ Phase I or IV clinical trials

∙ Trials including non-drug therapies such as surgery or radiotherapy

∙ Trials reporting that fewer than half of patients had metastatic disease

∙ Trials testing a drug designed to treat side effects of cancer (e.g. therapy for painful bone metastases) or side effects of chemotherapy (e.g. granulocyte colony stimulating factor treatment for low neutrophil counts)

∙ Trials testing sequential therapies, in which patients transition from one treat- ment to another after a pre-specified number of treatment cycles

To find candidate papers for inclusion in the study, we queried the Cochrane Cen- tral Register of Controlled Trials in October 2012, searching for studies with MeSH term “Breast Neoplasms" and qualifier “drug therapy." Additionally, we searched Pubmed in October 2012 with the following query:

"breast" in title AND ("advanced" OR "metastatic") in title AND ("trial" OR "phase") in title

In July 2013, we expanded our query by further searching Pubmed with the query:

(Breast Neoplasms/drug therapy[MAJR] OR Breast Neoplasms/drug therapy[MeSH Terms]) AND (Clinical Trial[ptyp] OR Phase[Title])

68 To determine whether results should be included in the study, team members reviewed the titles and abstracts of search results, when necessary also accessing the main text of the article. In total 1,490 clinical trials describing 2,097 treatment arms were included as a result of the literature review.

3.2.1 Manual Data Collection

Next, we built a database containing the detailed information from each clinical trial arm we identified. For clinical trials testing multiple treatments, we added multiple observations to our database, one for each treatment tested. To capture key prognostic factors for metastatic breast cancer, we included a number of demographic characteristics of patients in each clinical trial arm. Disease stage has a large impact on disease outcomes [156], so we include the proportion of patients in each arm with advanced, non-metastatic disease (the remaining patients have metastatic disease). Prior therapy is an important prognostic factor [97], so we extracted the proportion of patients with prior chemotherapy for breast cancer and the proportion of patients with prior chemotherapy for advanced breast cancer. The location of metastatic disease has a large effect on cancer survival [96, 97, 110], sowe include the proportion of patients in each clinical trial arm with visceral involvement (lung, liver, or brain), as these patients have the worst prognosis. Receptor status affects disease prognosis [52, 97, 110, 157], so we include the proportion ofpatients in each arm with positive receptor status (either estrogen receptor or progesterone receptor positive cancer) as well as the proportion of patients in each arm with HER2-overexpressing cancer. Finally, we include several other commonly reported pieces of demographic information: the proportion of patients in each clinical trial arm who are male, the median age of patients in each arm (mean age when the median is not reported), the average ECOG performance status of patients in each arm, and the proportion of patients in the arm who are premenopausal women. We collect several other variables from each clinical trial arm. To capture the size of each study, we collect the number of patients receiving the study drugs in each clinical trial arm. To capture changes in advanced breast cancer care through time,

69 we collect the year each study was published. To capture region-specific differences between studies, we collect the proportion of study authors who are from an Asian country, as well as the proportion of study authors who are from each country, limiting to countries for which at least 10 clinical trial arms have been published with the plurality of authors on the study report from that country.

For each drug tested in a clinical trial arm, we extract several pieces of infor- mation, following the approach from Chapter 2. The instantaneous dosage is the amount of the drug scheduled to be given each day (if the dosage is different on dif- ferent days of a chemotherapy cycle, then the average of the daily dosages is used). The average dosage is the average amount of the drug that is delivered per week of therapy. For instance, if a drug is given at dosage 20 푚푔/푚2 on days 1, 8, and 15 of a 28-day chemotherapy cycle, then its instantaneous dosage would be extracted as 20 푚푔/푚2 and its average dosage would be extracted as 15 푚푔/푚2. In cases where a higher loading dose is provided at the beginning of treatment, this loading dose is not extracted. In cases where different dosage levels were used for different patients in the clinical trial arm, the dosage level used with the largest number of patients is encoded.

Finally, we extract information about the outcomes of each clinical trial arm. To measure the efficacy of the arm, we extract the median overall survival of patients. To measure the toxicity of an arm, we extract the proportion of patients experiencing a dose-limiting toxicity (DLT), following Chapter 2 by defining a DLT as any Grade 4 blood toxicity or any Grade 3/4 non-blood toxicity, excluding alopecia, nausea, and vomiting. Toxicity grades are defined using the NCI CTCAE v3.0 guidelines [166]. Where necessary, Grade 4 blood toxicities are imputed from reported Grade 3/4 values, as detailed in Appendix D. Toxicity values are not extracted in cases where toxicities are reported as the proportion of chemotherapy cycles for which the toxicity occurred instead of the proportion of patients experiencing the toxicity, and they are also not reported in cases where the toxicity grade is not clear from the clinical trial report.

Table 3.1 summarizes the extracted demographic, study characteristic, and out-

70 Variable Avg. Value Range % Rep.

Fraction of patients with 0.02 0.00 – 0.50 76.5 advanced, non-metastatic disease Fraction of patients with prior 0.65 0.00 – 1.00 83.6 chemotherapy Fraction of patients with prior 0.33 0.00 – 1.00 73.4 palliative chemotherapy Fraction of patients with visceral 0.58 0.00 – 1.00 60.1 disease Fraction of patients with positive 0.69 0.00 – 1.00 57.5 steroid hormone receptor status Fraction of patients with HER2 0.39 0.00 – 1.00 19.3 overexpression

Patient Demographics Fraction of patients who are male 0.00 0.00 – 0.08 80.8 Median age (years) 55.97 36 – 77 92.2 Mean ECOG performance status 0.70 0.04 – 2.27 50.8 Fraction of patients who are 0.18 0.00 – 1.00 50.0 premenopausal females

Number of patients 64.8 4 – 681 100.0 Publication year 1997 1968 – 2013 100.0 Fraction of study authors from an 0.06 0.00 – 1.00 100.0 Asian country Fraction of study authors from Country 0.00 – 1.00 100.0 each country (27 different Dependent

Study Characteristics variables)

Median overall survival (months) 17.6 2.0 – 72.0 61.7 Fraction of patients with a 0.31 0.00 – 1.00 64.9 dose-limiting toxicity Outcomes

Table 3.1: Average value, range, and reporting rate of patient demographic, study characteristic, and outcome variables extracted from breast cancer clinical trials. come variables across the 2,097 treatment arms in the database. Missing demographic values were imputed, as described in Appendix D.

Figure 3-1 displays the median overall survival and proportion of patients with

71 60 Number of Patients 20

40 50

100

200 20 500 Median Overall Survival Median Overall

0

0.00 0.25 0.50 0.75 1.00 Proportion of Patients with a DLT

Figure 3-1: The results of the 887 clinical trial arms in our database with non-missing median OS and DLT proportion. The size of a point is proportional to the number of patients in that clinical trial arm.

200 ● 5− 150

● Leucovorin 100 ● ● ● ● ● 50 ● S−1 ● ● ●

●● ● ● ●● ● 0 ●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● Gastric Cancer Trial Arms Using Drug Gastric Cancer Trial 0 100 200 300 Breast Cancer Trial Arms Using Drug

Figure 3-2: The number of times each drug was used in a breast cancer clinical trial arm and in a gastric cancer clinical trial arm.

72 a DLT across the 887 clinical trial arms that report both values, demonstrating the high variability in clinical trial efficacy and toxicity outcomes. Figure 3-2 displays the number of breast cancer and gastric cancer clinical trial arms that used each drug. Most drugs were used only a small number of times for both cancers — of the 257 drugs tested across the two cancers, 188 (73%) were tested five or fewer times in both cancers. Though several drugs, like 5-fluorouracil, cisplatin, and docetaxel were tested broadly for both types of cancer, many were tested predominantly in one of the cancers. Several drugs, like S-1, irinotecan, and oxaliplatin, were primarily tested in gastric cancer. Many of the remaining drugs (most notably cyclophosphamide) were tested almost exclusively for breast cancer — 17 drugs (7%) were tested more than 20 times in breast cancer but five or fewer times in gastric cancer. Notably, no hormonal therapies were tested in any gastric cancer clinical trials. The set of all 237 drugs tested in at least one arm in the breast cancer clinical trial database is available in Appendix E.

3.3 Statistical Models for Breast Cancer Outcomes

We next use statistical models to predict the efficacy and toxicity outcomes of breast cancer clinical trials, with a goal of high out-of-sample prediction accuracy. We predict the outcomes of a clinical trial arm based on the patient demographics of the arm, the study characteristics, and the drug dosages used in the arm. The models used for the prediction are trained on chronologically prior clinical trials. For each clinical trial arm, we use all variables defined in Section 3.2.1. In partic- ular, we use the 10 demographic variables defined there, the 30 study characteristic variables (including 27 variables indicating the proportion of authors from different countries). Additionally, we define three variables for each of the 237 drugs tested at least once in a breast cancer clinical trial arm: a binary indicator for whether that drug was used in the current arm and the instantaneous and average dosages as defined in Section 3.2.1, encoding value 0 for both if the drug is notused. Some drugs are expected to have different effects based on a patient’s receptor

73 status, so we introduce interaction terms as independent variables to capture these receptor-specific effects. Hormonal drugs are only beneficial for patients with positive steroid hormone receptor status, so for each of the 32 hormonal drugs that were tested in at least one breast cancer clinical trial arm, we include interaction terms between the drug’s binary, average, and instantaneous dose variables and the proportion of patients in the arm with positive steroid hormone receptor status. Drugs that target the HER2 receptor are expected to be more beneficial for patients with HER2 over- expression, so for each of the six drugs tested in at least one breast cancer clinical trial arm that target the HER2 receptor (afatinib, lapatinib, neratinib, pertuzumab, trastuzumab, and trastuzumab emtansine), we included interaction terms between the proportion of patients with HER2 overexpression and the binary, average, and instantaneous dosages for those drugs.

Due to the large number of independent variables (865 — 10 for demographics, 30 for study characteristics, 711 for drugs and dosages, and 114 for dosage-receptor sta- tus interaction) relative to the number of clinical trial arms in our database (2,097), we use ridge regression to avoid model overfitting [90]. Ridge regression models are simple linear models that balance the quality of the model fit (measured by the sum of the squared residuals) with the model complexity (measured by the 2-norm of the fitted coefficients) using a regularization parameter 휆. In Chapter 2, we showed that ridge regression models predicting median overall survival and proportion of patients with a dose-limiting toxicity yielded equivalent out-of-sample performance to more complicated prediction models including random forest models [34] and support vec- tor machines with radial basis functions [33, 48]. To account for higher variance in reported outcomes for clinical trial arms with fewer patients, we assign each observa- tion 푖 weight 푤푖 = 푛푖/푛¯, where 푛푖 is the number of patients in the arm and 푛¯ is the average number of patients across all trial arms.

We set aside the earliest 20% of clinical trial arms (419 arms), which span the years 1968–1988, to be used solely for training models. For each of the remaining 1,678 clinical trial arms, which span the years 1988–2013, we train a ridge regression model using the trials chronologically prior to the one being predicted using the

74 Interation terms 2

R No Interaction Terms 0.6

0.4

0.2 Survival: 4−Year Sequential Survival: 4−Year

0.0 1995 2000 2005 2010 Year

Figure 3-3: Four-year sliding-window 푅2 value of the statistical models’ predictions of median OS.

R glmnet package [69], normalizing all training-set variables to have 0 mean and unit variance before fitting models and using 10-fold cross-validation to select the regularization parameter from an exponentially spaced grid of 50 candidate values

4 spanning 휆푚푎푥/10 to 휆푚푎푥, where 휆푚푎푥 is the smallest regularization parameter for which the fitted coefficients are numerically zero.

3.3.1 Results

We evaluate the statistical models’ out-of-sample predictions on all trial arms for which the relevant outcome is reported and all drugs being tested have previously been tested in at least one trial arm. Figure 3-3 displays the 4-year sliding-window 푅2 value of the predictions of median overall survival. For instance, the value plotted at 2010 represents the 푅2 value of all out-of-sample predictions made from 2006–2010. To calculate the 푅2, the baseline prediction for each observation is the average median OS value across that observation’s training set. The black line in the figure indicates the model trained with all variables including the dosage-receptor status interaction terms, while the gray line indicates the model trained with all variables except the

75 1.0 Interation terms No Interaction Terms 0.9

0.8

0.7

0.6 Toxicity: 4−Year Sequential AUC 4−Year Toxicity:

0.5 1995 2000 2005 2010 Year

Figure 3-4: Four-year sliding-window AUC of the statistical models’ predictions of the proportion of patients experiencing a DLT, labeling trials as having unacceptably high toxicity if more than 50% of patients experience a DLT. dosage-receptor status interaction terms. The model trained with interaction terms obtains slightly better performance than the one trained without interactions, obtain- ing an out-of-sample 푅2 value of 0.59 in the final four years of the dataset compared to the value of 0.58 for the model with no interaction terms. During this time period, 65% of trials testing a HER2-targeting drug had improved predictions in the model with interaction terms compared to the model with no interaction terms, and 58% of trials testing a hormone-targeting drug had improved predictions. The rather modest impact was expected, as overall 82% of the trial arms testing hormonal drugs that reported steroid hormone receptor status rates had at least 80% of patients with pos- itive receptor status, and 81% of the trial arms testing HER2-targeting drugs that reported HER2 overexpression rates had at least 80% of patients with HER2 overex- pression. As a result, dosage-receptor interaction terms often take similar values in trials testing receptor-specific drugs.

Figure 3-4 displays the 4-year sliding-window performance of the toxicity model, as measured by the area under the ROC curve (AUC) for predicting whether more than 50% of patients in a trial will experience a DLT. This cutoff was used for computing

76 the AUC as it is a common value used to define the maximum tolerated dose in Phase I studies, as demonstrated in Chapter 2. The black line indicates models trained with all variables including the dosage-receptor interaction terms, while the gray line indicates models trained with all variables except the interaction terms. The models with and without interactions obtained nearly identical performance, with an AUC of 0.80 in the final four years of out-of-sample predictions. For both toxicity and survival predictions, models show strong performance in the late 1990s and weaker performance for much of the next decade. This might be partially explained by an increased number of new drugs being tested after 2000 — 13% of trial arms from 1996–2000 evaluated a drug that had been tested in fewer than three previous arms, while 17% of trial arms since 2001 tested such a drug. Decreases in the 푅2 of survival predictions may also be partially explained by changes in the distribution of this outcome through time — the median OS of trial arms published from 1996–2000 had a standard deviation of 7.4 months, while the median OS of trial arms published since 2001 had a standard deviation of 8.4 months.

3.3.2 Identifying Unpromising Trials

We next evaluate whether the statistical models for toxicity and efficacy outcomes can be used by clinical trial planners to flag unpromising proposed clinical trials before they are run. For each of the 1,678 clinical trial arms in the statistical model test set (all arms since 1988), we use the model for toxicity to rank trial arms by their predicted DLT proportion, flagging all arms exceeding some limit 푐퐷퐿푇 . Figure 3-5 plots the pro- portion of high-toxicity (DLT rate exceeding 50%) and acceptable-toxicity trial arms that are flagged with different 푐퐷퐿푇 limits. The toxicity model achieves an AUC of 0.77 in predicting whether a trial will have a high toxicity rate. The model can flag 10% of high-toxicity trials while only flagging 1.5% of acceptable-toxicity trials, and it can flag 20% of high-toxicity trials while only flagging 4.3% of acceptable-toxicity trials. For each test-set trial arm, we follow Chapter 2 in defining strata of arms with

77 1.00

0.75

0.50

0.25

Prop. High Toxicity Trials Flagged Trials High Toxicity Prop. 0.00

0.00 0.25 0.50 0.75 1.00 Prop. Low Toxicity Trials Flagged

Figure 3-5: Performance of the statistical model for toxicity at flagging high-toxicity arms among the 1,678 arms in the testing set.

1.00

0.75

0.50

0.25

Prop. Low Efficacy Trials Flagged Efficacy Trials Low Prop. 0.00

0.00 0.25 0.50 0.75 1.00 Prop. High Efficacy Trials Flagged

Figure 3-6: Performance of the statistical model for efficacy at flagging trial arms that fail to reach the top quartile of efficacy among the 1,678 arms in the testing set.

78 similar patient demographics,1 defining a trial arm as achieving high efficacy ifits median OS exceeds the 75th percentile of median OS values of trials within the past four years in its strata. We use the model for efficacy to rank trial arms by the ratio of their predicted median OS value to their strata-specific high efficacy value, flagging

all arms that fall below some limit 푐푂푆. Figure 3-6 plots the proportion of flagged trials that do and do not achieve their strata-specific high efficacy value with different

푐푂푆 limits. The efficacy model achieves an AUC of 0.78 in predicting whether atrial will fail to reach the high efficacy cutoff. The model can flag 10% of trial armsthat don’t reach the high-efficacy cutoff while only flagging 1.2% of arms that reachthe cutoff, and it can flag 20% of arms that don’t reach the high-efficacy cutoff whileonly flagging 1.8% of arms that reach the cutoff.

3.4 Combined Databases for Toxicity Prediction

Though the efficacy of a chemotherapy regimen may differ dramatically when given to patients with different cancers, its toxicity should be similar across patients with different cancers, leading many Phase I clinical trials to test combinations across patients with multiple types of cancer. In this section, we explore whether combining toxicity data from breast and gastric cancers can improve the toxicity predictions compared to predictions made with cancer-specific data only. We build a database that combines all clinical trial arms from both the breast cancer database from Section 3.2 and the gastric cancer database from Chapter 2, defining independent variables for our statistical models only when their underlying data is present in both databases. This includes four demographic variables: the

1The first strata is trial arms for which at least 50% of patients have received prior palliative chemotherapy; the 480 arms in the test set with this property achieved average median OS of 13.6 months. The remaining four strata all have either fewer than 50% of patients who have received prior palliative chemotherapy or do not report this value, and they differentiate clinical trial arms based on the average ECOG performance status of patients, which measures their general healthiness. We include strata for trials with the most healthy patients (these 258 arms with average performance status 0 to 0.5 have average median OS 23.6 months), with moderately health patients (these 358 arms with average performance status 0.5 to 1.0 have average median OS 19.8 months), with the least healthy patients (these 97 arms with average performance status of 1.0 or greater have average median OS of 14.5 months), and with unreported performance status (these 485 arms have average median OS of 19.5 months).

79 1.00

0.75

0.50

True Positive Rate Positive True 0.25

Combined Databases Single Database 0.00

0.00 0.25 0.50 0.75 1.00 False Positive Rate

Figure 3-7: Receiver operating characteristic curve for out-of-sample toxicity predic- tions of the 331 gastric cancer clinical trial arms published since 1998 with extractable toxicity data.

fraction of patients with prior palliative chemotherapy, the fraction of patients who are male, the median age of patients, and the mean ECOG performance status of patients. Further, we include 33 study characteristic variables: the number of patients in the clinical trial arm, the publication year of the study, the fraction of study authors from an Asian country, and 30 variables indicating the fraction of study authors from different countries, limiting to commonly appearing countries as we did in Section 3.2. Finally, we include a binary, instantaneous, and average dosage variable for each of the 257 drugs tested at least once in a breast or gastric cancer clinical trial arm. As dosage-receptor interaction terms had no impact on toxicity prediction in Section 3.3, we do not include them as independent variables in this analysis.

As in Section 3.3, we perform a sequential evaluation, in which we train a new ridge regression model for each prediction made, using the chronologically prior clinical trial arms as the training set. Parameters for the ridge regression model are selected through cross-validation.

80 1.00

0.75

0.50

True Positive Rate Positive True 0.25

Combined Databases Single Database 0.00

0.00 0.25 0.50 0.75 1.00 False Positive Rate

Figure 3-8: Receiver operating characteristic curve for out-of-sample toxicity predic- tions of the 883 breast cancer clinical trial arms published since 1998 with extractable toxicity data.

3.4.1 Statistical Model Results

Figure 3-7 displays the ROC curve for toxicity predictions of the 331 gastric can- cer clinical trial arms since 1998 with extractable toxicity data, where a trial arm is labeled as unacceptably toxic if at least 50% of its patients experience a DLT. The out-of-sample performance of the model using data from both the gastric cancer database and the breast cancer database is plotted in black, while the out-of-sample performance of the model using only the gastric cancer database is plotted in gray. The AUC of the combined model, 0.79, exceeds the 0.75 AUC of the model trained with only gastric cancer clinical trials, though this difference is not statistically sig- nificant (p = 0.12). Figure 3-8 displays the ROC curve for the toxicity predictions of the 883 breast cancer clinical trial arms since 1998 with extractable toxicity data, again plotting the results for the combined prediction model in black and the model trained using only data from breast cancer clinical trial arms in gray. Both models achieved an AUC of 0.76 and did not significantly differ (p = 0.89). As expected, the combined toxicity model yielded the most benefit in clinical trial

81 1.00

0.75

0.50

True Positive Rate Positive True 0.25

Combined Databases Single Database 0.00

0.00 0.25 0.50 0.75 1.00 False Positive Rate

Figure 3-9: Receiver operating characteristic curve for out-of-sample toxicity predic- tions of the 35 gastric cancer clinical trial arms and 22 breast cancer clinical trial arms published since 1998 for which a drug in the study had been tested in no more than two prior clinical trial arms for that cancer and had been tested in the other cancer. arms testing relatively new drugs for one cancer that had been tested in the other cancer. There were 35 gastric cancer clinical trial arms and 22 breast cancer clinical trial arms with a drug that had been tested no more than twice for that cancer but that had been tested in the other cancer. Figure 3-9 plots the AUC of the combined and single-cancer models among these 57 clinical trial arms. The combined model obtained an AUC of 0.81 while the single-cancer models obtained an AUC of 0.62, significantly lower (p = 0.01). In light of Figure 3-2, it is unsurprising that mostof the benefit of the combined models was for gastric cancer predictions. Cancers with even fewer published clinical trials could stand to benefit even more from combined prediction models for toxicity.

3.5 Design of Chemotherapy Regimens

Given the out-of-sample 푅2 of efficacy predictions and AUC of toxicity predictions observed in Sections 3.3–3.4, it is natural to investigate whether these models can be

82 used to identify high-quality combination drug therapies for testing in Phase II and III clinical trials. In this section, we follow the procedures from Chapter 2 to design and evaluate drug therapies proposed for clinical trials, sequentially computing the regimens we would have tested in each clinical trial arm in the breast cancer database given that arm’s patient demographics and study characteristics.

Following Chapter 2, we use optimization to design novel chemotherapy regimens for testing in Phase II clinical trials. When designing a chemotherapy regimen, we first train statistical models for predicting clinical trial efficacy and toxicity outcomes, using ridge regression models with a training set of chronologically prior clinical trial ^푏− arms as described in Section 3.3. The efficacy model yields coefficient vectors 훽푂푆, ^푖− ^푎− 훽푂푆, and 훽푂푆, corresponding to the binary, average, and instantaneous dosages of ^푥 each drug, respectively, as well as coefficient vector 훽푂푆 corresponding to demo- graphic and study characteristic variables. This model additionally yields dosage- ^푏훼 ^푖훼 ^푎훼 steroid hormone receptor status interaction coefficients 훽푂푆, 훽푂푆, and 훽푂푆, as well ^푏훾 ^푖훾 ^푎훾 as dosage-HER2 overexpression interaction coefficients 훽푂푆, 훽푂푆, and 훽푂푆. Each of these vectors contains value 0 for any drug not interacted with the relevant receptor variable and otherwise contains the value of the interaction term’s coefficient. When designing a chemotherapy regimen for a clinical trial arm with 훼 proportion of pa- tients with positive hormone receptor status and 훾 proportion of patients with HER2 ^푏 ^푏− ^푏훼 ^푏훾 overexpression, we define aggregate coefficient vectors 훽푂푆 = 훽푂푆 + 훼훽푂푆 + 훾훽푂푆, ^푖 ^푖− ^푖훼 ^푖훾 ^푎 ^푎− ^푎훼 ^푎훾 훽푂푆 = 훽푂푆 + 훼훽푂푆 + 훾훽푂푆, and 훽푂푆 = 훽푂푆 + 훼훽푂푆 + 훾훽푂푆. Correspondingly, the ^푏− ^푖− ^푎− ^푥 ^푏훼 ^푖훼 toxicity model yields coefficient vectors 훽퐷퐿푇 , 훽퐷퐿푇 , 훽퐷퐿푇 , 훽퐷퐿푇 , 훽퐷퐿푇 , 훽퐷퐿푇 , ^푎훼 ^푏훾 ^푖훾 ^푎훾 ^푏 ^푖 훽퐷퐿푇 , 훽퐷퐿푇 , 훽퐷퐿푇 , and 훽퐷퐿푇 , with aggregate coefficient vectors 훽퐷퐿푇 , 훽퐷퐿푇 , ^푎 and 훽퐷퐿푇 . These coefficients, together with vector x specifying the patient demo- graphics and study characteristics for the trial arm being optimized, parameterize the optimization model for selecting a new combination drug therapy:

′ ′ ′ max (훽^푏 + Γu)′b + 훽^푖 i + 훽^푎 a + 훽^푥 x (2) b,i,a 푂푆 푂푆 푂푆 푂푆 ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ subject to 훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + 훽퐷퐿푇 x ≤ 푡, (2a)

83 푛 ∑︁ 푏푑 ≤ 푁, (2b) 푑=1 Ab ≤ c, (2c)

(b, i, a) ∈/ 푃, (2d)

(푏푑, 푖푑, 푎푑) ∈ Ω푑, 푑 = 1, . . . , 푛, (2e)

푏푑 ∈ {0, 1}, 푑 = 1, . . . , 푛. (2f)

−1/2 As in Chapter 2, u is a vector with elements 푢푑 = 푡푑 for each drug 푑, where 푡푑 is the number of times drug 푑 has previously been tested in a clinical trial; together with specified exploration parameter Γ, the term Γu′b in the objective encourages the exploration of drugs that have not been extensively tested in prior clinical trials. Constraint (2a) limits the predicted toxicity not to exceed a specified limit 푡, and con- straint (2b) limits the total number of drugs tested not to exceed a limit 푁 (푁 = 3 in all experiments in this chapter). Constraints (2c) require that a new drug be tested in a clinical trial as soon as it is available and also limit the drug class combinations that can be tested in a clinical trial, as was done in Chapter 2. Further, if fewer than half of patients in a clinical trial arm have HER2 overexpressing cancer, then the HER2-specific drugs trastuzumab and trastuzumab emtansine are disallowed. Con- straints (2d) require that the selected regimen differs from all previously tested drug therapies, and constraints (2e) require that the binary, average, and instantaneous dosages of any selected drug must match the dosages used in some clinical trial arm in the breast cancer database. The optimization model was implemented and solved in R using the lpSolveAPI package [112].

We select the drug therapies to test in the experimental arms of Phase III clini- cal trials using statistical models trained using the methodology in Section 3.3. We evaluate each regimen previously tested in a clinical trial arm, limiting to drug combi- nations that have not previously been tested (ignoring dosage) in a Phase III clinical trial arm. Using receptor status proportions 훼 and 훾 and demographic and study characteristic variables x from the experimental arm we are designing, we evaluate

84 each candidate regimen using the statistical models, selecting the one with the highest predicted efficacy and limiting predicted toxicity not to exceed the specified limit 푡.

Following Chapter 2, we use the simulation and matching metrics to evaluate the effectiveness of the proposed approaches. Briefly, the simulation metric obtains

* * efficacy and toxicity model coefficients 훽푂푆 and 훽퐷퐿푇 by training ridge regression models on a bootstrapped version of the full clinical trial database; models are trained using patient demographic, study characteristic, drug, and drug-receptor interaction variables. These coefficients are used to simulate the outcomes of clinical trial arms, and observed clinical trial outcomes are simulated as noisy observations of the true

* * underlying values indicated by 훽푂푆 and 훽퐷퐿푇 . To avoid extreme simulated values, we limit the simulated impact of each drug in a combination therapy on the median OS to not take extreme values.2 While the simulation metric enables us to learn from earlier suggested regimens when designing newer ones, limitations include that it does not use actual clinical trial outcomes to evaluate suggestions, assumes our statistical models are structurally accurate, and assumes that all regimens designed by optimization model (2) are feasible. As in Chapter 2, we also evaluate suggested regimens using the matching metric, in which suggested regimens are evaluated using the most similar regimen tested in the future. While this evaluation approach has the benefit of evaluating our suggested regimens using real clinical trial outcomes, it has drawbacks of not enabling us to learn from trials we design and of the matching clinical trial having been run in a different patient population from the one being designed. Despite the limitations of the simulation and matching metrics, we use them to determine if our models have promise at improving the quality of the regimens tested in Phase III clinical trial experimental arms — if these metrics show no improvement then there is

2Let 푆 be the set of all valid dosing schedules of any drug that is available at the time of the simulated trial arm. For each drug 푑 at instantaneous dosage 푖 and average dosage 푎 in set 푆, the * simulated incremental survival benefit in the linear survival model with coefficient vector 훽푂푆 is *,푏− *,푏훼 *,푏훾 *,푖− *,푖훼 *,푖훾 *,푎− *,푎훼 *,푎훾 훽푂푆,푑 + 훼훽푂푆,푑 + 훾훽푂푆,푑 + 푖훽푂푆,푑 + 훼푖훽푂푆,푑 + 훾푖훽푂푆,푑 + 푎훽푂푆,푑 + 훼푎훽푂푆,푑 + 훾푎훽푂푆,푑, where 훼 is the proportion of patients in the arm with positive hormone receptor status and 훾 is the proportion of patients in the arm with HER2 overexpression. Let 퐼 be the set of all incremental survival benefits of the dosing schedules in 푆. We limit the simulated incremental survival benefit of any drug in a proposed regimen not to exceed the 95th percentile of values in 퐼 and not to fall below the 5th percentile of values in 퐼.

85 20 t

● 0.3

15 0.4

0.5

10 Γ

● ● 0

5 ● 0.05

● 0.1

Current Practice 0 ● Average Survival Improvement (months) Survival Improvement Average

−0.2 −0.1 0.0 0.1 Increase in High DLT Rate (proportion)

Figure 3-10: Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the simulation metric. no reason to further test the methodology, while if they do show promise there could be merit to further evaluation in a clinical trial setting. Beginning 20% of the way through our breast cancer clinical trial database, we used our models to design all Phase II studies and Phase III experimental arms, obtaining efficacy and toxicity outcomes using the simulation metric. Figure 3-10 and Table 3.2 display the average difference between the median OS and proportion of trials with a high DLT rate for the 193 Phase III experimental arms selected by our model and the 193 Phase III experimental arms selected in current practice. Each point indicates one of the 33 simulation runs performed at each parameter setting, and the outlined points are the averages across simulation runs for a parameter setting. The shape of a point indicates the toxicity limit 푡 used when selecting regimens for Phase II and III trials and the color of a point indicates the exploration parameter Γ used in the simulation run. All 297 simulation runs averaged a survival improvement compared to current practice, with an average improvement of 7.2 months and no significant differences as Γ was varied in the set {0, 0.05, 0.1}. 290 simulation runs (98%) decreased the proportion of trial arms with an unacceptably high DLT rate, with an average reduction in that rate of 0.18 for models with 푡 = 0.3, 0.17 for models

86 Γ 푡 ∆ OS ∆ DLT 0 0.3 7.2 -0.18 0 0.4 7.9 -0.17 0 0.5 7.8 -0.12 0.05 0.3 7.4 -0.18 0.05 0.4 7.2 -0.17 0.05 0.5 7.3 -0.12 0.10 0.3 7.1 -0.18 0.10 0.4 6.3 -0.17 0.10 0.5 6.2 -0.12

Table 3.2: Comparison between Phase III experimental arm suggestions (n=193) from our models and from current practice according to the simulation metric, reporting for each optimization model with parameters Γ and 푡 the change in average median OS and proportion of trials with unacceptably high DLT compared to current practice.

푡 Match Drugs Match Classes ∆ OS ∆ DLT 0.3 30% 57% 1.4 -0.16 0.4 32% 56% 1.5 -0.12 0.5 33% 55% 1.7 -0.10

Table 3.3: Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the matching metric, report- ing for each optimization model with parameter 푡 the proportion of model suggestions that match all drugs and all drug classes of the matched future trial, as well as the average change in median OS and proportion of trials with unacceptably high DLT compared to current practice. with 푡 = 0.4, and 0.12 for models with 푡 = 0.5.

Beginning 20% of the way through our breast cancer clinical trial database, we used our models to design all Phase III experimental arms, estimating efficacy and toxicity outcomes using the matching metric. Figure 3-11 displays efficacy and toxi- city outcomes relative to current practice as assessed by the matching metric across the 193 Phase III experimental arms considered. Each point displays the results for the model with a specified toxicity limit 푡, and all models were run with Γ = 0. Sta- tistical uncertainty is captured by bootstrap resampling the results for the 193 arms evaluated, and the 95% confidence ellipse of these bootstrap results is plotted for each parameter setting in Figure 3-11. All models averaged a survival improvement

87 3

2 t

● 0.3 ● 0.4 1 0.5

Current Practice 0 ●

−1 Average Survival Improvement (months) Survival Improvement Average

−0.2 −0.1 0.0 0.1 Increase in High DLT Rate (proportion)

Figure 3-11: Comparison between Phase III experimental arm suggestions (푛 = 193) from our models and from current practice according to the matching metric. over the Phase III experimental arms, ranging from 1.4–1.7 months, and all models averaged a decrease in the proportion of trials with unacceptable high DLT rates, with the decrease ranging from 0.10–0.16. Roughly 32% of evaluations were made with a clinical trial arm that matched all drugs in the proposed regimen, and roughly 56% of evaluations were made with a clinical trial arm that matched all drug classes from the proposed regimen.

3.6 Discussion

In this chapter, we evaluated how well the results from Chapter 2 generalize from ad- vanced gastric cancer to advanced breast cancer. We confirm that statistical models for the efficacy and toxicity of advanced breast cancer clinical trial arms achieve sim- ilar out-of-sample predictive accuracy to the performance observed in Chapter 2 for advanced gastric cancer clinical trials and could potentially be useful to clinical trial planners seeking to flag unpromising proposed trials. We further find that adding drug-receptor interaction terms yields a small improvement in predictive accuracy for statistical models predicting clinical trial efficacy and that statistical models trained

88 with both gastric and breast cancer clinical trial arms can obtain better out-of-sample toxicity predictions than a model trained using only gastric cancer trials. The sim- ulation and matching metrics indicate that our proposed models for designing drug therapies may simultaneously improve the efficacy and toxicity outcomes of Phase III experimental arms. As in Chapter 2, the experiments in this chapter have several limitations, and further experimentation is needed to confirm the benefit of the proposed models. First, the database of advanced breast cancer clinical trial arms used in this work may suffer from publication bias, in which researchers selectively publish their promising results but do not publish their unpromising results [62]. Such a bias would cause to statistical models developed in Section 3.3 to be overoptimistic about the impact of drug therapies, and uneven publication bias across different types of therapies might bias the sorts of drug therapies favored by the models in Section 3.5. Further, the simulation and matching metrics used to evaluate the quality of suggested drug therapies in Section 3.5 both have a number of limitations that may lead to positive bias in their evaluation of the proposed therapies. The only way to conclusively evaluate the impact of the proposed models would be evaluation in a clinical trial setting.

89 90 Chapter 4

Identifying Effective Screening Strategies for Prostate Cancer

4.1 Introduction

Prostate cancer is the most common non-skin cancer among United States men and the second-leading cause of cancer death in this population [155]. Worldwide, it is the second most common non-skin cancer among men and the fourth leading cause of cancer death in this population [165]. Routine screening for prostate cancer with the Prostate-Specific Antigen (PSA) blood test is a topic of controversy due to uncertainty about the survival benefits of screening and due to the risk of overdetection, inwhich disease that would never grow to be clinically meaningful is screen detected and treated. The United States Preventative Services Task Force (USPSTF) does not recommend routine PSA-based screening among asymptomatic men [129], while the American Cancer Society (ACS), American Society for Clinical Oncology (ASCO), and American Urological Association (AUA) recommend shared decision making to determine whether to screen for the disease [175, 22, 41]. Evidence from randomized controlled trials is inconclusive, with the European Randomized Study of Screening for Prostate Cancer (ERSPC) showing a mortality advantage from screening [152] and the United States Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial showing no mortality improvement from screening [15].

91 Though randomized controlled trials (RCTs) represent the gold standard in eval- uating the efficacy of medical interventions in many settings, it is challenging to evaluate the effectiveness of competing population screening strategies for cancer us- ing RCTs like the ERSPC or PLCO. First, patients often do not follow their allocated screening strategy in such trials, making it challenging to compare screening strate- gies. For instance, patients in the control arm of the PLCO were directed not to perform routine PSA screening, yet 40% performed a PSA test in the first year of the trial and 52% performed a test in the sixth year [15]. Further, given the serial nature of screening for cancer, many screening strategies are possible with different start and end ages, intra-screening intervals, and screening intensities. Given that it is infeasi- ble to test more than a few of these strategies in an RCT, trials cannot be used to find the best population screening strategy among the many available. Finally, be- cause less than 1% of the male population dies of prostate cancer by age 75 [165] and the disease often develops slowly, RCTs investigating the mortality impact of prostate cancer screening must have large enrollment and long follow-up times, leading to high costs for the trials.

Given the challenges in evaluating screening strategies using RCTs, modeling is an important tool for evaluating strategies [65]. A number of works have proposed models that could be used to evaluate competing screening strategies for prostate cancer, typically modeling the natural history of the disease, the effectiveness of a specified screening strategy at detecting the disease, and post-detection outcomes for patients [78, 88, 109, 113, 147, 167].

Due to challenges in observing the natural history of cancer and the impact of screening on long-term health outcomes, mathematical models for cancer screening have significant uncertainty. One such form of uncertainty is parameter uncertainty, which is addressed in nearly all modeling studies using sensitivity analysis or prob- abilistic sensitivity analysis [59]. A less commonly addressed form of uncertainty is structural uncertainty, which is caused by the simplifications and scientific judg- ments made as part of the modeling process [31]. Structural uncertainty is addressed using comparative modeling, in which multiple mathematical models with different

92 structures are used to evaluate the same phenomenon and differences between model outcomes are explored in a systematic way [99]. Comparative modeling has previously been used to quantify the role of PSA screening in U.S. prostate cancer mortality [64], to estimate lead time and overdiagnosis due to PSA screening [60], and to evaluate the impact of control arm contamination on perceived screening efficacy in the PLCO trial [80]. To our knowledge comparative modeling has not been used to evaluate competing screening strategies for prostate cancer, though it has been used to eval- uate screening strategies for other cancers, including colorectal cancer [178], breast cancer [124], and lung cancer [53].

In comparative modeling studies of screening strategies, every competing screening strategy is evaluated by each mathematical model, and relatively few techniques exist to combine the evaluations of multiple models. In a systematic review of methods for characterizing structural uncertainty, [31] found that most methods for combining results from structurally distinct models, such as model selection [23, 25, 114, 136], Bayesian model averaging [46, 91, 173], and non-Bayesian model averaging [44], come from the statistics literature and require an observable outcome that can be used to assess the quality of each model’s predictions. When using mathematical models to assess screening strategies, there are typically multiple outcomes of interest (e.g. how well model outcomes match observed population incidence rates and how well they match observed population mortality rates), so combining results from multiple mod- els has been limited to taking (potentially weighted) averages of model outcomes. In practice, comparative modeling for screening strategies has been carried out by using each model to evaluate a set of competing strategies and building model-specific effi- cient frontiers of strategies that trade off screening benefit with intensity; typically no attempt is made to combine results across models. In cases where models’ evaluations of strategies differ, such as in [178] where the SimCRC model found all colonoscopy screening strategies starting at age 40 to be efficient while the MISCAN model found only a subset of these strategies to be efficient, these discrepancies are discussed in the text of the paper. Typically the number of screening strategies evaluated (145 in [178], 20 in [124], and 576 in [53]) is small compared to the enormous number of

93 population screening strategies that are possible. In this chapter, we propose a comparative modeling methodology that considers not only the average of models’ assessments of strategies but also the most pessimistic assessment of a given strategy across the models, giving decision makers tunable control over the degree of conservatism to take when identifying effective interventions. Using mathematical optimization, we search a large set of implementable screening strategies instead of a smaller number of hand-selected strategies. We evaluate the proposed methodology using decision analyses for prostate cancer screening identified through a systematic review of the medical literature.

4.2 Methods

Using a systematic literature review detailed in Appendix F, we identified three deci- sion models published since 2010 that can estimate quality-adjusted life expectancy (QALE) in the general population for PSA-based screening strategies with age-specific PSA thresholds for biopsy. The selected decision analyses included two Markov mod- els [167, 179] and a microsimulation model [78]. The model from [179] is a Markov model; the state encodes both a man’s age and health state: no prostate cancer, prostate cancer (a state encapsulating organ-confined prostate cancer, lymph node-positive prostate cancer, and metastatic prostate can- cer), death from prostate cancer, and death from all other causes. Patients’ PSA levels are not explicitly modeled through time, and each PSA test result is independent of previous test results. All patients are modeled to receive radical prostatectomy as a treatment for screen-detected prostate cancer. We re-implemented this model based on its description in [179]. The model from [167] is a Markov model similar to the model from [179], but it explicitly models patients’ PSA levels through time. The state space for the model contains not only a patient’s age and cancer state (no prostate cancer, non-metastatic prostate cancer, metastatic prostate cancer, and death) but also their current PSA level. All patients are modeled to receive radical prostatectomy as a treatment for

94 screen-detected prostate cancer. We re-implemented this model based on its descrip- tion in [167], confirming the accuracy of the re-implementation using source code provided by the authors. The model from [78] is a microsimulation model developed as part of the Can- cer Intervention and Surveillance Modeling Network (CISNET). A patient’s prostate cancer health state is defined by both the stage (loco-regional or distant) and grade (Gleason 2–7 or 8–10) of their cancer, and their PSA level at each age is simulated jointly with their health state. Patients with detected loco-regional cancer select ac- tive surveillance, radical prostatectomy, or radiation based on treatment patterns in the United States. The authors of [78] provided us with the full source code for their model.

4.2.1 Model Parameters

Each model uses a number of parameters, many of whose values are difficult to know with certainty. [167] and [179] model age-specific PSA levels using PSA test results from 11,872 men who underwent PSA testing in Olmsted County, Minnesota, and pa- rameterize disease progression and mortality using various sources including the Mayo Clinical Radical Prostatectomy Registry, published clinical studies [12, 35, 72, 150], and SEER. [78] model PSA levels through time using data from the control arm of the Prostate Cancer Prevention Trial [66, 79, 161], selecting values for disease pro- gression parameters that cause model-projected disease incidence to match observed incidence in the United States. We perform one-way sensitivity analysis over key parameters identified in the respective studies, as detailed in Appendix G.

4.2.2 Quality of Life

Early detection of prostate cancer via screening is a two-edged sword, with potential health benefits from extended life and decreased incidence of metastatic disease but harms from increased biopsy rates and unnecessary treatment due to overdiagnosis.

95 As a result, in this chapter we study the net quality-of-life impacts of prostate cancer screening. The three models included in this chapter take different approaches to reporting this outcome of interest. [167] and [179] explicitly model QALE, assigning quality-of-life decrements for biopsies, treatment, metastatic disease, and early death due to cancer. Meanwhile, [78] reports a number of metrics that capture quality of life, including the number of screening tests, false positive tests, overdiagnoses, lives saved, and months of life saved. As is common in comparative modeling, we updated all models to ensure they share the same definition of the outcome of interest. We adopted the detailed QALE specification from [88], assigning quality-of-life decrements for screening attendance, biopsy, cancer diagnosis, radiation therapy, radical prostatectomy, active surveillance, palliative therapy, and terminal illness. We use sensitivity ranges from [88] for each QALE decrement in the one-way sensitivity analysis in this chapter, as detailed in Appendix G. As a further sensitivity analysis, we optimize screening strategies using the most pessimistic sensitivity range value for each QALE decrement, using the lowest sensitivity range value for the QALE decrements due to palliative therapy and terminal illness and the highest sensitivity range value for each remaining QALE decrement.

4.2.3 Screening Strategies

Many PSA-based screening strategies for prostate cancer are possible, with different start and end ages, inter-screening intervals, and age-specific PSA levels that trigger a confirmatory biopsy. To select a screening strategy that is easy to implement, we consider all annual and biennial screening strategies with age-specific PSA cutoffs (in 푛푔/푚퐿) from the set {0.5, 1.0, 1.5,..., 6.0}, fixed PSA cutoffs for 5-year age ranges, and PSA cutoffs that are non-decreasing in a patient’s age. In total, there aremore than 5.4 million such screening strategies, which motivates the use of mathematical optimization instead of enumeration to identify effective strategies. In a comparative modeling setting, we seek to identify screening strategies that perform well according to all models included in the analysis, as this increases the

96 confidence of the modeler that the identified strategies are of high quality. Inourcase, we seek to identify screening strategies that maximally improve QALE compared to a strategy of not screening. Given a screening strategy 푠 and models’ evaluations of that strategy’s improvement over not screening, 푚1(푠), 푚2(푠), and 푚3(푠), we define the average assessment of the models to be (푚1(푠) + 푚2(푠) + 푚3(푠))/3 and the most pessimistic assessment of the models to be min{푚1(푠), 푚2(푠), 푚3(푠)}. Depending on a modeler’s preferences, they might seek to optimize either of these objectives or some combination of the two. In this chapter, we compute an efficient frontier of strategies trading off the average assessment and most pessimistic assessment of the strategies. Due to the large number of feasible screening strategies, we use a constrained local search procedure to construct the efficient frontier of strategies (see Appendix Hfor details). To determine the utility of the comparative modeling framework, we also optimize screening strategies using the incremental QALE compared to not screening according to each individual mathematical model, as is typically done in comparative modeling studies. This represents the best screening strategy according to a single mathe- matical model. Finally, we compare the screening strategies to five expert-generated strategies from [147] and 17 expert-generated strategies from [78].

4.3 Results

Figure 4-1 displays the average and most pessimistic assessments of each strategy in the efficient frontier trading off these two quantities (black dots), the screening strate- gies optimized using a single decision model (red dots), and the expert-generated strategies (gray dots). The strategy with the highest pessimistic assessment across the three models yielded 0.5 months of incremental QALE compared to not screening according to the model from [78], 1.7 incremental months according to the model from [167], and 5.4 incremental months according to the model from [179]. Mean- while, the strategy with the highest average assessment across the models yielded 0.1 months of incremental QALE according to the model from [78], 4.7 incremental

97 ● ●● ● ●● ●● ● ● ● ● ● ●●● ● ●●●●● ● ●●● ● ●●● ●●●● 3 ●● ● ● ●● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● 2 ● ● Expert ● ● ● Multi−model ● Single−model ●

1

QALE Change (Months): Average Assessment QALE Change (Months): Average 0

−0.2 0.0 0.2 0.4 QALE Change (Months): Most Pessimistic Assessment

Figure 4-1: Average and most pessimistic assessments of identified and expert- generated screening strategies.

months according to the model from [167], and 6.1 incremental months according to the model from [179]. All screening strategies on the efficient frontier, as well as the strategy optimized using only model [78] and the strategy optimized using only model [179], improve over not screening according to all three models. However, the strategy optimized according to model [167] yields a decrease of 0.2 months of QALE according to the model from [78]. All expert-generated strategies improved over not screening according to all three models, though none were on the efficient frontier. For each expert-generated strategy, there was a strategy on the efficient frontier with at least as good of a pessimistic assessment and at least 0.5 more incremental months of QALE according to the average assessment. Further, for each expert-generated strategy, there was a strategy on the efficient frontier with at least as good ofan average assessment and at least 0.05 more incremental months of QALE according to the pessimistic assessment.

Figure 4-2 displays all 64 screening strategies on the efficient frontier, plotting a transparent line connecting all age-specific biopsy thresholds specified in each strat- egy. 57 of the strategies are annual strategies, and the remaining seven strategies are biennial strategies. The strategy optimizing the most pessimistic assessment of the three models is plotted in red with a point for each screening test labeled with

98 6 ● ● ●

4 ● ●

● ● ●

● ●

2

● ● ● ● ● Biopsy threshold (ng/mL)

0

40 60 80 100 Age

Figure 4-2: The 64 screening strategies on the efficient frontier trading off average and most pessimistic QALE assessment, with the strategy optimizing the pessimistic assessment highlighted in red.

99 2.0 ● ●●●●●● ●●●● ●●● ● ● ●●● ●●●●●● ● ●●●● ● ●●● ● ●●● ● ● ●● ● ● ● ● ● 1.5 ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● Expert 1.0 ● Multi−model ● ● Single−model

0.5

QALE Change (Months): Average Assessment QALE Change (Months): Average 0.0

−1.2 −0.8 −0.4 0.0 QALE Change (Months): Most Pessimistic Assessment

Figure 4-3: Average and most pessimistic assessments of identified and expert- generated screening strategies, using the most pessimistic QALE decrements from [88]. the specified biopsy threshold. Under this strategy, patients perform biennial PSA screening at even ages from 45–74, using biopsy threshold 1.5 ng/mL from ages 45– 54, 2.5 ng/mL from ages 55–59, 3.5 ng/mL from ages 60–64, 4.0 ng/mL from ages 65–69, and 6.0 ng/mL from ages 70–74. The other policies on the efficient frontier are more aggressive, starting at earlier ages, ending at later ages, and generally employing lower biopsy thresholds. All screening policies show a pattern of increasing screening threshold as age increases.

4.3.1 Sensitivity Analysis

The first sensitivity analysis uses the most pessimistic QALE decrements in thesen- sitivity ranges from [88]. Figure 4-3 displays the average and most pessimistic assess- ments of the strategies on the efficient frontier trading off these two quantities (black dots), the screening strategies optimized using a single decision model (red dots), and the expert-generated strategies (gray dots). The strategy with the highest pessimistic assessment across the three models yielded 0.2 months of incremental QALE accord- ing to the model from [78], 0.2 months according to the model from [167], and 3.0 months according to the model from [179]. Meanwhile, the strategy with the highest

100 6 ● ● ● ● ●

● ●

4

● ● ●

2 Biopsy threshold (ng/mL)

0

40 60 80 100 Age

Figure 4-4: The 112 screening strategies on the efficient frontier trading off average and most pessimistic QALE assessment, using the most pessimistic QALE decrements from [88]. The strategy optimizing the pessimistic assessment is highlighted in red. average assessment across the models yielded a 0.5 month decrease in incremental QALE according to the model from [78], 2.4 incremental months according to the model from [167], and 4.0 incremental months according to the model from [179]. 73 of the screening strategies on the efficient frontier improved QALE compared to not screening according to all three models, while the remaining 39 strategies were non-improving according to the model from [78]. Figure 4-4 displays all 112 screening strategies on the efficient frontier optimized using the most pessimistic QALE values from [88]. 88 of the strategies are annual strategies, and the remaining 24 strategies are biennial strategies. Under the best strategy according to the most pessimistic assessment of the three models (plotted in red), patients perform biennial screening from ages 50–69, using biopsy threshold

101 3.5 ng/mL from ages 50–54, 5.0 ng/mL from ages 55–59, and 6.0 ng/mL from ages 60–69. In one-way sensitivity analysis testing, we assessed each strategy plotted in Fig- ure 4-1 under the 80 sensitivity scenarios specified in Appendix G. Several strategies were found to improve over not screening in all scenarios: all 22 expert-generated strategies, 53 of the 64 strategies on the efficient frontier including the 41 most con- servative strategies, and the strategy optimized according to the model from [78]. The remaining 11 strategies on the efficient frontier as well as the strategies optimized ac- cording to the models from [167] and [179] all performed worse then not screening at all for at least one model with a single parameter adjusted to an endpoint of its sensitivity range.

4.4 Discussion

In this chapter, we proposed a methodology to identify screening strategies that per- form well across multiple mathematical models, which could be used to systematically identify strategies that are broadly supported in a comparative modeling framework. We demonstrate this methodology using three models for prostate cancer screening effectiveness derived from a systematic literature review, identifying screening strate- gies that all three models find to improve QALE compared to not screening, even with all QALE decrements simultaneously set to the most pessimistic value in their sensitivity range. Screening strategies obtained using only one model may not per- form well when evaluated with another model; strategies optimized using the model from [167] were assessed as decreasing QALE compared to not screening at all when evaluated using the model from [78]. In this chapter we used mathematical optimization to identify screening strategies that form an efficient frontier trading off the average and most pessimistic assessment of multiple mathematical models, and we demonstrate that none of the 22 expert- generated strategies derived from the prostate cancer screening literature fall on this efficient frontier. This suggests that mathematical optimization is a promising ap-

102 proach to find the best possible screening strategy according to a model orsetof models when there are a large number of possible strategies.

After setting all QALE decrements to their most pessimistic values, the screening strategy that best improves the unweighted model average outcome is considered to be 0.6 months of QALE worse than not screening at all by the model from [78], and sev- eral nearby points on the efficient frontier were similarly found to be non-improving for some scenarios in one-way sensitivity analysis. This decreases confidence that these screening strategies improve over not screening and motivates this chapter’s ap- proach of also considering the most pessimistic assessment when optimizing screening strategies.

A limitation of this chapter is that [167] and [179] did not compare model pre- dictions for observable quantities like prostate cancer incidence or mortality against population or clinical trial data to validate their modeling assumptions. This could explain the significant differences between these models’ assessments of strategies and the assessment by the model from [78], which validated the incidence and mortality outputs of the model using population data as well as data from the ERSPC [78, 79]. Though we form efficient frontiers between an unweighted average of model assess- ments and the most pessimistic assessment in this chapter, decision makers could choose to instead compute an efficient frontier between a weighted average of model assessments and the most pessimistic assessment, selecting model-specific weights to approximate confidence in each model’s outputs.

Though in this chapter we provided an example of our methodology for com- parative modeling of prostate cancer screening strategies, the same approach could be applied to any medical decision making problem for which multiple models have been proposed to select from a large number of competing strategies. First, the ap- proach could be applied more broadly for cancer screening, as the sequential nature of cancer screening ensures there are a large number of possible screening strategies for any cancer and multiple models to assess these strategies have been published for many cancers including colorectal cancer [178], breast cancer [124], and lung can- cer [53]. The approach could also be used to assess the impact of screening followed

103 by treatment in diseases other than cancer, as multiple models have been proposed to assess the health impacts of screening for diseases such as HIV [76, 92] and type 2 diabetes [77, 89, 102]. Beyond screening, the methodology could be used to op- timize disease prevention strategies for diseases for which multiple models to assess the efficacy of strategies have been proposed; for instance, the approach couldbe used to identify effective population vaccination strategies for hepatitis B [75, 106] or to optimally allocate resources between different HIV prevention modalities ina resource-limited setting [49, 121]. Finally, the approach could be used to identify effective strategies for controlling chronic diseases for which multiple models ofef- fectiveness have been proposed, such as determining when to switch therapy when treating HIV [50, 185] or determining how to control the blood glucose levels of dia- betic patients [26, 58, 71, 118, 131].

104 Appendix A

Data Preprocessing for Gastric Cancer Database

We performed a series of data preprocessing steps to standardize the data collected from clinical trials. Many of these steps involved imputation of some value 푦 from a closely related value 푥. In all such cases, we consider linear imputation via linear regression equation 푦 = 훽0 + 훽1푥 + 휖 and quadratic imputation via linear regression 2 equation 푦 = 훽0 +훽1푥 +휖, selecting the model that obtains the lowest sum of squared residuals.

A.1 Performance Status

Performance status is a measure of an individual’s overall quality of life and well- being. It is reported in our database of clinical trials predominantly using the Eastern Cooperative Oncology Group (ECOG) scale [138], and less often using the Karnofsky performance status (KPS) scale [104]. The ECOG scale runs from 0–5, where 0 is a patient who is fully active and 5 represents death. Among the 495 treatment arms evaluated in this work, 423 (85.5%) reported performance status with the ECOG scale, 71 (14.3%) reported with the KPS scale, and 1 (0.2%) did not report performance status. We include mean ECOG score as a variable in our prediction models. In 94 treatment arms, the proportion of patients with ECOG score 0 and 1 was

105 reported as a combined value. To compute the weighted performance score for these arms, we first obtain an estimate of the proportion of patients with ECOG score0and score 1, based on the proportion of patients with score 0 or 1. This estimation is done by taking the 푛 = 302 trial arms with full ECOG breakdown and nonzero 푝0 + 푝1 and fitting linear and quadratic imputation models to estimate 푝0/(푝0 + 푝1) from 푝0 + 푝1; the quadratic model was selected because it had the lowest sum of squared residuals and achieved an 푅2 value of 0.26. For the 18 treatment arms with the proportion of patients with each KPS score reported, we perform a conversion from the KPS scale to the ECOG scale based on data in [36]. For treatment arms reporting performance status with other data groupings, the combined score is marked as unavailable. In total, 100 trial arms (20.2%) were assigned a missing score.

A.2 Grade 4 Blood Toxicities

We need to compute the proportion of patients with each Grade 4 blood toxicity to compute the proportion of patients with a DLT, as defined in Section 2.2.2. Many treatment arms report the proportion of patients with a Grade 3/4 blood toxicity but do not provide the proportion of patients specifically with a Grade 4 blood toxicity. For neutropenia, thrombocytopenia, anemia, and lymphopenia, we built linear and quadratic imputation models to predict the Grade 4 toxicity from the Grade 3/4 toxicity, training on arms reporting both values. The models were trained using 217, 311, 254, and 8 treatment arms, respectively. The quadratic imputation model was most effective in predicting grade 4 neutropenia and the linear models werethe most effective for the remaining three imputations, with the models obtaining 푅2 values of 0.88, 0.67, 0.28, and 0.04, respectively. The models were used to impute the proportion of patients with a Grade 4 toxicity in 119, 93, 101, and 3 treatment arms, respectively. Two of the most common blood toxicities, leukopenia and neutropenia, are often not reported together due to their similarity (neutrophils are the most common type of leukocyte). Because neutropenia is more frequently reported than leukopenia in

106 clinical trial reports, we chose this as the common measure for these two toxicities. We trained quadratic and linear imputation models using the proportion of patients experiencing Grade 3/4 leukopenia to predict the proportion of patients experiencing Grade 4 neutropenia, training on the 142 arms that reported both proportions. The linear model was the most effective with 푅2 = 0.75, and we used that model to convert data from 99 treatment arms that reported leukopenia toxicity data but not neutropenia. Overall, we used some form of imputation to compute Grade 4 blood toxicities in 230 treatment arms (46.5%). As a sensitivity analysis to evaluate the effects of imputing these dependent vari- ables, we sequentially evaluated ridge regression models limited to the 265 treatment arms for which no imputation was performed on the dependent variables. The ridge regression model predicting the proportion of patients with a DLT had a test-set se- quential AUC of 0.82 (bootstrap 95% CI [0.76, 0.87]) on the last four years of the limited dataset and did not significantly differ from the AUC of 0.83 (bootstrap 95% CI [0.77, 0.85]) on the last four years of the full dataset.

A.3 Proportion of Patients with a DLT

The fraction of patients with at least one DLT during treatment cannot be calculated directly from the individual toxicity proportions reported. For instance, in a clinical trial in which 20% of patients had Grade 4 neutropenia and 30% of patients had Grade 3/4 diarrhea, the proportion of patients with a DLT might range from 30% to 50%. Here we compare approaches for computing the proportion of patients experiencing at least one DLT. We consider five options for combining the toxicities:

∙ Max Approach: Label a trial’s toxicity as the proportion of patients with the most frequently occurring DLT. This is a lower bound on the true proportion of patients with a DLT.

∙ Independent Approach: Assume all DLTs in a trial occurred independently of one another, and use this to compute the expected proportion of patients

107 with any DLTs.

∙ Sum Approach: Label a trial’s toxicity as the sum of the proportion of patients with each DLT. This is an upper bound on the true proportion of patients with a DLT.

∙ Grouped Independent Approach: Define groups of toxicities, using the 20 broad anatomical/pathophysiological categories defined by the NCI-CTCAE v3 toxicity reporting criteria [133]. Assign each toxicity group a “group score” that is the incidence of the most frequently occurring DLT in that group. Then, compute a toxicity score for the trial by assuming toxicities from each group occur independently, with probability equal to the group score.

∙ Grouped Sum Approach: Using the same groupings as in the Grouped Independent Approach, compute a toxicity score for the trial as the sum of the group scores.

We evaluate how each of these five approaches does at estimating the proportion of patients with Grade 3/4 toxicities in clinical trials that report this value given the individual Grade 3/4 toxicities. Because there is a strong similarity between the set of Grade 3/4 toxicities and the set of DLTs, we believe this metric is a good approximation of how well the approaches will approximate the proportion of patients with a DLT. 40 trial arms (8.1%) report this value, though we can only compute the combined metric for 36 of them due to missing toxicity data. The quality of each combination approach is obtained by taking the correlation between that approach’s results and the combined grade 3/4 toxicities. As reported in Table A.1, all five combination approaches provide reasonable estimates for the combined toxicity value, though in general grouped metrics out- performed non-grouped metrics. The best approach is the “grouped independent approach,” because it allows the best approximation of the combined Grade 3/4 toxi- cities. We use this approach to compute the final proportion of patients experiencing a DLT.

108 Combination Approach Correlation Grouped Independent 0.893 Independent 0.875 Max 0.867 Grouped Sum 0.855 Sum 0.820

Table A.1: Correlation of estimates of total Grade 3/4 toxicity to the true value.

If one or more of the toxicities for a trial arm are mentioned in the text but their values cannot be extracted (e.g. if toxicities are not reported by patient or toxicity grades are not provided), then the proportion of patients experiencing a DLT for that trial arm is marked as unavailable. This is the case for 104 (21.0%) of trial arms in the database.

109 110 Appendix B

Coefficients of Models Predicting Gastric Cancer Outcomes

Table B.1 contains the coefficients of the drug dosage variables in the ridge regression ^푏 models trained using the entire gastric cancer clinical trial database. Columns 훽푂푆, ^푖 ^푎 훽푂푆, and 훽푂푆 represent the coefficients for the binary, instantaneous, and average dosages of each drug in the model predicting median overall survival, while columns ^푏 ^푖 ^푎 훽퐷퐿푇 , 훽퐷퐿푇 , and 훽퐷퐿푇 are the same coefficients in the model predicting the pro- portion of patients with a dose-limiting toxicity. Though the models were trained using normalized versions of these variables, the coefficients in this table have been de-normalized.

^푏 ^푖 ^푎 ^푏 ^푖 ^푎 Drug Unit 훽푂푆 훽푂푆 훽푂푆 훽퐷퐿푇 훽퐷퐿푇 훽퐷퐿푇 9-Aminocamptothecin 휇푔/푚2 -2.83E-02 -4.72E-05 -9.90E-05 -9.29E-04 -1.55E-06 -3.25E-06 Actinomycin 푚푔 0.00E+00 0.00E+00 0.00E+00 2.45E-04 4.88E-04 6.83E-03 BBR 3438 푚푔/푚2 -1.82E-01 -3.63E-03 -1.02E-01 1.03E-02 2.06E-04 5.75E-03 Bevacizumab 푚푔/푘푔 6.24E-01 6.63E-02 1.36E+00 -4.48E-03 5.77E-04 1.53E-02 BMS-182248-01 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 BOF-A2 푚푔 -1.22E-01 -3.05E-04 -6.10E-04 1.09E-03 2.72E-06 5.43E-06 푚푔/푚2 -4.00E-02 -3.16E-02 -1.79E-01 1.33E-02 1.02E-02 5.29E-02 Bryostatin-1 휇푔/푚2 -4.10E-02 -1.02E-03 -9.55E-03 0.00E+00 0.00E+00 0.00E+00 Caelyx 푚푔/푚2 3.45E-01 4.37E-03 1.57E-01 -8.98E-04 -8.53E-05 -1.77E-03 Capecitabine 푚푔/푚2 2.42E-01 2.84E-04 4.97E-04 -4.45E-03 -8.63E-06 2.24E-06 퐴푈퐶 -5.05E-02 2.59E-03 -1.93E-02 -9.06E-03 3.47E-04 -2.12E-02 푚푔/푚2 -2.10E-01 -4.94E-03 -5.26E-02 0.00E+00 0.00E+00 0.00E+00 Cetuximab 푚푔/푚2 1.82E-01 4.33E-05 2.60E-03 2.14E-03 -1.97E-05 6.92E-05

111 Cisplatin 푚푔/푚2 7.01E-01 1.03E-02 -1.64E-02 2.00E-02 1.05E-03 2.49E-03 Cyclophosphamide 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 2.44E-04 4.07E-07 5.71E-06 푚푔/푚2 1.91E-01 5.79E-03 4.06E-02 2.41E-04 7.31E-06 5.13E-05 DHA-paclitaxel 푚푔/푚2 1.14E-01 1.04E-04 2.18E-03 0.00E+00 0.00E+00 0.00E+00 Diaziquone 푚푔/푚2 -2.37E-02 -5.93E-04 -1.25E-02 0.00E+00 0.00E+00 0.00E+00 Docetaxel 푚푔/푚2 1.13E+00 5.60E-03 9.16E-02 5.47E-02 1.59E-03 1.81E-02 푚푔/푚2 8.30E-02 -2.03E-04 -7.25E-05 -8.43E-03 -1.03E-05 -3.41E-05 Doxorubicin 푚푔/푚2 3.22E-02 -4.05E-03 2.68E-01 2.51E-02 2.19E-04 2.03E-02 푚푔/푚2 2.89E-01 8.36E-03 7.13E-03 1.18E-02 8.33E-05 1.07E-03 Erlotinib 푚푔 -9.49E-02 -6.32E-04 -6.33E-04 -9.85E-03 -6.57E-05 -6.56E-05 Esorubicin 푚푔/푚2 -5.26E-02 -1.50E-03 -3.16E-02 0.00E+00 0.00E+00 0.00E+00 푚푔/푚2 2.61E-01 1.96E-04 -1.05E-02 2.48E-02 2.20E-04 9.36E-04 Everolimus 푚푔 2.82E-01 2.82E-02 2.82E-02 9.99E-04 1.00E-04 1.00E-04 Flavopiridol 푚푔/푚2 -7.19E-02 -1.44E-03 -6.71E-03 6.36E-03 1.27E-04 5.94E-04 Fluorouracil 푚푔/푚2 2.17E-01 3.23E-04 1.50E-04 -2.93E-03 1.14E-05 1.04E-04 푚푔/푚2 -7.02E-02 -7.02E-04 -1.47E-02 0.00E+00 0.00E+00 0.00E+00 Gefitinib 푚푔 -1.43E-01 -5.72E-04 -5.73E-04 -1.83E-02 -7.33E-05 -7.33E-05 푚푔/푚2 1.88E-01 1.33E-04 1.24E-03 -2.30E-03 -3.19E-06 -2.98E-05 Heptaplatin 푚푔/푚2 -6.39E-02 -2.24E-04 -6.11E-03 -4.04E-03 -9.10E-06 -1.11E-04 IFN 푀푈 -2.34E-02 -2.60E-03 -5.94E-03 5.42E-03 6.02E-04 1.84E-03 Iproplatin 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 Irinotecan 푚푔/푚2 8.68E-01 6.09E-04 4.99E-02 4.30E-02 3.23E-04 3.45E-03 Irofulven 푚푔/푘푔 -9.93E-02 -2.21E-01 -2.32E+00 2.58E-03 5.74E-03 6.03E-02 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 4.72E-03 3.96E-04 3.54E-03 Lapatinib 푚푔 -1.39E-01 -9.27E-05 -9.28E-05 0.00E+00 0.00E+00 0.00E+00 Leucovorin 푚푔/푚2 1.04E+00 1.04E-03 -4.68E-03 3.16E-03 -7.82E-05 5.16E-04 Levoleucovorin 푚푔/푚2 5.86E-01 1.63E-03 1.90E-02 -9.52E-03 -6.84E-05 -5.71E-04 Lovastatin 푚푔/푘푔 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 Matuzumab 푚푔 -9.54E-02 -1.19E-04 -8.36E-04 4.51E-03 5.63E-06 3.94E-05 Merbarone 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 푚푔/푚2 -2.58E-01 -2.75E-05 -9.65E-04 -6.21E-04 1.01E-05 2.65E-04 methyl-CCNU 푚푔/푚2 2.41E-01 1.61E-03 1.12E-01 0.00E+00 0.00E+00 0.00E+00 Mitomycin 푚푔/푚2 5.26E-01 6.45E-03 -4.65E-01 -1.43E-02 -2.30E-03 3.44E-03 푚푔/푚2 -1.67E-01 -1.34E-02 -2.60E-01 -6.74E-03 -3.15E-04 -1.14E-02 NK105 푚푔/푚2 6.47E-01 4.31E-03 9.06E-02 4.84E-03 3.23E-05 6.78E-04 OSI-7904L 푚푔/푚2 1.53E-01 1.28E-02 2.69E-01 0.00E+00 0.00E+00 0.00E+00 Oxaliplatin 푚푔/푚2 3.55E-01 6.57E-03 3.21E-02 -1.21E-04 1.50E-04 1.69E-03 Paclitaxel 푚푔/푚2 8.99E-01 4.29E-03 8.67E-02 5.80E-04 9.47E-05 1.34E-03 PALA 푚푔/푚2 -6.72E-02 -2.69E-04 -1.88E-03 5.42E-03 2.17E-05 1.52E-04 Pegamotecan 푚푔/푚2 1.77E-01 2.53E-05 5.31E-04 3.56E-03 5.08E-07 1.07E-05 푚푔/푚2 -1.18E-01 -2.36E-04 -4.95E-03 2.92E-03 5.84E-06 1.23E-04 푚푔 -8.15E-02 -4.08E-03 -5.70E-02 6.68E-05 3.33E-06 4.61E-05 Piroxantrone 푚푔/푚2 -6.69E-02 -4.46E-04 -9.37E-03 -7.44E-04 -4.96E-06 -1.04E-04 PN401 푔 -1.23E-01 -2.57E-03 -2.39E-02 -4.35E-03 -9.04E-05 -8.44E-04 Pravastatin 푚푔 -8.56E-02 -2.14E-03 -2.14E-03 5.46E-03 1.37E-04 1.37E-04

112 푚푔/푚2 -1.95E-01 -6.46E-02 -1.36E+00 7.94E-03 2.92E-03 6.13E-02 S-1 푚푔/푚2 7.51E-01 1.44E-02 1.80E-02 -1.71E-02 9.74E-07 2.89E-06 Saracatinib 푚푔 6.03E-02 3.45E-04 3.45E-04 1.55E-03 8.85E-06 8.85E-06 Sorafenib 푚푔 2.55E-01 3.18E-04 3.18E-04 -2.55E-03 -3.18E-06 -3.19E-06 Sunitinib 푚푔 8.03E-02 1.61E-03 2.41E-03 1.01E-02 2.03E-04 3.04E-04 Thioguanine 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 푚푔/푚2 4.73E-04 3.32E-04 1.35E-03 0.00E+00 0.00E+00 0.00E+00 Trastuzumab 푚푔/푘푔 9.17E-01 1.53E-01 3.21E+00 -1.20E-02 -2.00E-03 -4.20E-02 Triazinate 푚푔/푚2 1.50E-01 7.63E-04 1.32E-02 5.71E-03 2.84E-05 5.53E-04 Trimetrexate 푚푔/푚2 7.59E-03 -4.45E-03 -2.33E-02 1.19E-02 3.67E-04 2.58E-03 UFT 푚푔/푚2 2.54E-01 6.14E-04 -2.70E-04 1.20E-02 3.26E-05 4.84E-05 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 2.44E-04 1.74E-04 2.45E-03 푚푔/푚2 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 0.00E+00 푚푔/푚2 1.86E-01 8.89E-03 1.14E-01 2.89E-03 7.71E-05 -1.39E-03

Table B.1: Coefficients of drug dosage variables in the ridge regression models pre- dicting efficacy and toxicity outcomes.

In several cases we converted between different dosing units so all dosages reported for a particular drug matched in unit. For the trials using cisplatin, 5-fluorouracil, leucovorin, or levoleucovorin and reporting dosages in unit 푚푔, we converted to 푚푔/푚2 by dividing by 2, a typical body surface area. For trials using cytarabine or 5-fluorouracil and reporting dosages in unit 푚푔/푘푔, we converted to 푚푔/푚2 by multiplying by 82 and dividing by 2. Table B.2 contains the coefficients for each non-drug variable in the ridge regres- ^ sion models predicting median OS (훽푂푆) and the proportion of patients experiencing ^ a DLT (훽퐷퐿푇 ). Models were trained using the entire gastric cancer clinical trial database. The intercept term for the model predicting median OS is 6.04 months, and the intercept term for the model predicting DLT is 0.22.

113 ^ ^ Variable Unit 훽푂푆 훽퐷퐿푇 Male Proportion 0.2906 0.0185 Prior palliative chemotherapy Proportion -2.1137 -0.0197 Median age Year -0.0003 -0.0002 Mean performance status Performance Status -1.1918 -0.0353 Primary tumor in stomach Proportion -0.6780 -0.0378 Primary tumor in GEJ Proportion -0.2066 0.1427 Authors from USA Proportion -0.3970 0.0769 Authors from Germany Proportion 0.6626 -0.0151 Authors from UK Proportion 0.4848 0.0136 Authors from Japan Proportion 0.8799 0.0079 Authors from South Korea Proportion 0.1193 0.0233 Authors from France Proportion 0.4466 -0.0063 Authors from Italy Proportion 0.7057 -0.0217 Authors from Spain Proportion -0.0671 0.0305 Authors from Taiwan Proportion -0.2651 0.0438 Authors from China Proportion 0.5208 -0.0709 Authors from Netherlands Proportion 0.5004 -0.0395 Authors from Asia Proportion 1.1485 0.0080 Trial size Patients -0.0023 0.0002 Publication year Year 0.0677 0.0012

Table B.2: Coefficients of the non-drug variables in the ridge regression models pre- dicting efficacy and toxicity outcomes.

114 Appendix C

Pseudocode of Evaluation Techniques for Proposed Regimens

Figure C-1 provides the full pseudocode for the simulation metric evaluation tech- nique, and Figure C-2 provides the full pseudocode for the matching metric evaluation technique.

115 Input: Clinical trial database X (rows in chronological order) containing 푞 arms (the first 푟 := ⌈푞/5⌉ − 1 will be used as a training set), corresponding trial weight vector w, and corresponding Phase III study indicator pIII. Vectors of binary, instantaneous, and average dose variables and demographic/study characteristic variables for trial 푘 are k k k k indicated by b , i , a , and x , respectively. Further input: outcome vectors yOS and yDLT and parameters 푡 and Γ. boot boot boot boot X , yOS , yDLT, w ← bootstrap resampled versions of the 푞 arms in X, yOS, yDLT, and w * * 훽푂푆, 훽퐷퐿푇 , rOS, rDLT ← coefficients and residuals of ridge regression models with pa- boot boot boot boot rameters selected via cross-validation, trained with X , yOS , yDLT, and w 2 ′ boot 휎푂푆 ← rOSdiag(w )rOS/(푞 − 푓 − 1), with ridge regression degrees of freedom 푓 [87] 2 ′ boot 휎퐷퐿푇 ← rDLTdiag(w )rDLT/(푞 − 푓 − 1) ⋃︀푟 k k k III k 퐼퐼퐼 P ← 푘=1{(b , i , a )}, P ← {b | 1 ≤ 푘 ≤ 푟, 푝푘 = 1} train train train train X ← X{1,...,r}, yOS ← yOS,{1,...,r}, yDLT ← yDLT,{1,...,r}, w ← w{1,...,r} CP Mod CP Mod yOS ← [], yOS ← [], yDLT ← [], yDLT ← [] for 푘 = 푟 + 1 to 푞 do ^ ^ train train 훽푂푆, 훽퐷퐿푇 ← coefficients of ridge regression models trained using X , yOS , train train yDLT , and w if Trial 푘 is a control arm of a Phase III trial then b* ← bk, i* ← ik, a* ← ak, PIII ← PIII ⋃︀{bk} else if Trial 푘 is an experimental arm of a Phase III trial then ˜ III ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ k P ← {(b, i, a) ∈ P ∖ P | 훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + 훽퐷퐿푇 x ≤ 푡} * * * ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ k b , i , a ← arg maxb,i,a∈P˜ 훽푂푆 b + 훽푂푆 i + 훽푂푆 a + 훽푂푆 x , with ties broken randomly PIII ← PIII ⋃︀{b*} else if Trial 푘 is an arm of a Phase II study then Solve optimization model (1) using coefficients 훽^푂푆 and 훽^퐷퐿푇 , vector u computed using Xtrain, and patient demographic variables xk, obtaining optimal binary, instanta- neous, and average dosage variable values b*, i*, and a*, respectively. Use parameter Γ in the objective, 푡 in constraint (1a), 푁 = 3 in constraint (1b), set P in constraint (1d), and sets Ω푑 derived from X for constraints (1e). end if P ← P ⋃︀{(b*, i*, a*)} train * * * k train Append X with a row derived from b , i , a , and x , and append w with 푤푘 train *푏 ′ * *푖 ′ * *푎 ′ * *푥 ′ k 2 Append yOS with a sample from 풩 (훽푂푆 b + 훽푂푆 i + 훽푂푆 a + 훽푂푆 x , 휎푂푆/푤푘) train *푏 ′ * *푖 ′ * *푎 ′ * Append yDLT with a sample from 풩 (훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + *푥 ′ k 2 훽퐷퐿푇 x , 휎퐷퐿푇 /푤푘) if Trial 푘 is the experimental arm of a Phase III trial then CP *푏 ′ k *푖 ′ k *푎 ′ k *푥 ′ k Append yOS with 훽푂푆 b + 훽푂푆 i + 훽푂푆 a + 훽푂푆 x Mod *푏 ′ * *푖 ′ * *푎 ′ * *푥 ′ k Append yOS with 훽푂푆 b + 훽푂푆 i + 훽푂푆 a + 훽푂푆 x CP 1 Append yDLT with *푏 ′ k *푖 ′ k *푎 ′ k *푥 ′ k 훽퐷퐿푇 b +훽퐷퐿푇 i +훽퐷퐿푇 a +훽퐷퐿푇 x ≥0.5 Mod 1 Append yDLT with *푏 ′ * *푖 ′ * *푎 ′ * *푥 ′ k 훽퐷퐿푇 b +훽퐷퐿푇 i +훽퐷퐿푇 a +훽퐷퐿푇 x ≥0.5 end if end for CP Mod CP Mod Output: yOS , yOS , yDLT, and yDLT Figure C-1: Pseudocode of the simulation metric procedure.

116 Input: Clinical trial database X (rows in chronological order) containing 푞 arms (the first 푟 := ⌈푞/5⌉ − 1 will be used as a training set), corresponding trial weight vector w from Section 2.3.2, and corresponding Phase III study indicator pIII. Vec- tors of binary, instantaneous, and average dose variables and demographic/study characteristic variables for trial 푘 are indicated by bk, ik, ak, and xk, respectively. The vector of drug class indicators [74] for trial 푘 is indicated by ck. Further input includes outcome vectors yOS and yDLT and parameter 푡. III k 퐼퐼퐼 P ← {b | 1 ≤ 푘 ≤ 푟, 푝푘 = 1} CP Mod CP Mod yOS ← [], yOS ← [], yDLT ← [], yDLT ← [] for 푘 = 푟 + 1 to 푞 do if Trial 푘 is a control arm of a Phase III trial then PIII ← PIII ⋃︀{bk} else if Trial 푘 is an experimental arm of a Phase III trial then train train train X ← X{1,...,k−1}, yOS ← yOS,{1,...,k−1}, yDLT ← yDLT,{1,...,k−1}, train w ← w{1,...,k−1} ^ ^ 훽푂푆, 훽퐷퐿푇 ← coefficients of ridge regression models with parameters se- train train train lected via cross-validation (see Section 2.3), trained using X , yOS , yDLT , and wtrain ⋃︀푘−1 j j j P ← 푗=1 {(b , i , a )} ˜ III ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ k P ← {(b, i, a) ∈ P ∖ P | 훽퐷퐿푇 b + 훽퐷퐿푇 i + 훽퐷퐿푇 a + 훽퐷퐿푇 x ≤ 푡} * * * ^푏 ′ ^푖 ′ ^푎 ′ ^푥 ′ k b , i , a ← arg maxb,i,a∈P˜ 훽푂푆 b+훽푂푆 i+훽푂푆 a+훽푂푆 x , with ties broken randomly PIII ← PIII ⋃︀{b*} c* ← vector of drug class indicators [74] from b* *′ j *′ j *′ 1 1 F ← arg max푘≤푗≤푞 90c c +9b b +b [ * j * j ... * j * j ] a1=a1 and i1=i1 an=an and in=in CP CP 1 Append yOS with 푦푂푆,푘 and append yDLT with 푦퐷퐿푇,푘≥0.5 Mod ∑︀ Mod Append yOS with 푗∈F 푦푂푆,푗/|F| and append yDLT with ∑︀ 1 푗∈F 푦퐷퐿푇,푗 ≥0.5/|F| end if end for CP Mod CP Mod Output: yOS , yOS , yDLT, and yDLT Figure C-2: Pseudocode of the matching metric procedure.

117 118 Appendix D

Data Preprocessing for Breast Cancer Database

D.1 Imputation of Demographic Data

As displayed in Table 3.1, each of the 10 demographic variables extracted from breast cancer clinical trials is missing from some study arms. Six of the demographic variables were imputed by setting all missing values equal to the average value across the training set:

∙ The 492 clinical trial arms with missing data for the proportion of patients with advanced, non-metastatic disease were imputed to mean value 0.01.

∙ The 837 clinical trial arms with missing data for the proportion of patients with visceral disease were imputed to mean value 0.58.

∙ The 1,693 clinical trial arms with missing data for the proportion of patients with HER2 overexpression were imputed to median value 0.21.

∙ The 402 clinical trial arms with missing data for the proportion of patients who are male were imputed to mean value 0.00.

∙ The 164 clinical trial arms with missing data for the median age of patients were imputed to mean value 56.

119 ∙ The 1,048 clinical trial arms with missing data for the proportion of patients who are postmenopausal females were imputed to mean value 0.18.

Due to the similarity between the proportion of patients with prior chemotherapy and the proportion of patients with prior palliative chemotherapy, these two outcomes are jointly imputed. To do so, we trained models to predict the proportion of patients

with prior palliative chemotherapy in trial arm 푖, 푃 푃푖, using the proportion of patients with prior chemotherapy in trial arm 푖, 푃푖. We trained both a linear model of the 2 form 푃 푃푖 = 훽푃푖 + 휖 as well as a quadratic model of the form 푃 푃푖 = 훽푃푖 + 휖, training both models with the 1,316 clinical trial arms reporting both proportions. We selected the quadratic model due to its superior 푅2 value (0.83 vs. 0.74). We used the selected model to impute the proportion of patients with prior palliative chemotherapy in the 437 clinical trial arms that do not report that value but that do report the proportion of patients with prior chemotherapy. We used the same model to impute the proportion of patients with prior chemotherapy in the 223 clinical trial arms that do not report that value but that do report the proportion of patients with prior palliative chemotherapy. Finally, we impute the two proportions in the 121 clinical trial arms that report neither using the mean reported value, assigning a value of 0.33 for the proportion of patients with prior palliative chemotherapy and a value of 0.65 for the proportion of patients with prior chemotherapy. To impute the proportion of patients with positive steroid hormone receptor (HR) status — which means estrogen receptor (ER) positivity, progesterone receptor (PR) positivity, or both — we use ER and PR status information if it is reported. In 118 clinical trial arms, the proportion of patients who were ER+, PR+, and HR+ were all reported; we use these clinical trial arms to train a number of models to be used for imputation. First, we impute the HR status in the clinical trial arms for which the ER and PR status are both reported. We trained both a linear model

of the form 퐻푅푖 = 훽0 + 훽1퐸푅푖 + 훽2푃 푅푖 + 휖 and a quadratic model of the form 2 2 2 퐻푅푖 = 훽0 + 훽1퐸푅푖 + 훽2푃 푅푖 + 휖, selecting the linear model due to its superior 푅 value (0.85 vs. 0.77). We used this model to impute the proportion of patients with positive HR status from the 239 trial arms with ER and PR status reported but no

120 HR status reported. To impute when only ER status was reported, we trained both

a linear model of the form 퐻푅푖 = 훽0 + 훽1퐸푅푖 + 휖 and a quadratic model of the form 2 2 퐻푅푖 = 훽0 + 훽1퐸푅푖 + 휖, selecting the linear model due to its superior 푅 value (0.84 vs. 0.77). We used this model to impute the proportion of patients with positive HR status from the 462 trial arms with ER status reported but no HR status or PR status reported. To impute when only PR status was reported, we trained both a

linear model of the form 퐻푅푖 = 훽0 + 훽1푃 푅푖 + 휖 and a quadratic model of the form 2 2 퐻푅푖 = 훽0 + 훽1푃 푅푖 + 휖, selecting the linear model due to its superior 푅 value (0.70 vs. 0.59). We used this model to impute the proportion of patients with positive HR status from the 2 trial arms with PR status reported but no HR status or ER status reported. Finally, we assigned the median HR positivity proportion, 0.68, to the 891 trial arms with neither ER nor PR status reported.

When reporting ECOG performance status across a patient population, studies often report the proportion of patients who had a performance status of either 0 or 1, 푡. To enable computation of the average performance status of patients in these trials, we impute the proportion of the patients with performance status 0 or 1 who had performance status 0, 푝. Using the 948 clinical trial arms that reported both the proportion of patients with ECOG status 0 and the proportion with ECOG status 1,

we trained both a linear model of the form 푝푖 = 훽0 + 훽1푡푖 + 휖 and a quadratic model 2 2 of the form 푝푖 = 훽0 + 훽1푡푖 + 휖, selecting the linear model due to its superior 푅 value (0.21 vs. 0.20). We used this model to impute the proportion of patients with ECOG status 0 and 1 in the 190 trials that only reported these values in a combined manner. Further, in the 34 clinical trial arms that only reported performance status using the Karnofsky Performance Status (KPS) scale, providing a complete breakdown of the proportion of patients with each score, we imputed the proportion of patients with each ECOG performance status using the mapping from [36]. Finally, we assigned the mean value of 0.70 in the 1,031 trial arms that did not fully report the average ECOG performance status even after these imputation steps.

121 D.2 Imputation of Toxicity Outcomes

The definition of DLT used in this chapter only includes grade 4 blood toxicities, but many clinical trials report the proportion of patients experiencing either a grade 3 or grade 4 blood toxicity, without breaking down that proportion into its constituent grades. As a result, we use imputation to obtain the proportion with grade 4, 푓, from the proportion with grade 3/4, 푡, for each of the five blood toxicities reported in breast cancer clinical trials — neutropenia, leukopenia, thrombocytopenia, anemia,

and lymphopenia. For each toxicity, we train both a linear model 푓푖 = 훽0 + 훽1푡푖 + 휖 2 and a quadratic model 푓푖 = 훽0 + 훽1푡1 + 휖. Models were trained using 632, 592, 848, 708, and 22 trial arms, respectively. The linear models achieved 푅2 values of 0.81, 0.64, 0.65, 0.30, and 0.81, respectively, while the quadratic models achieved 푅2 values of 0.90, 0.73, 0.66, 0.30, and 0.85, respectively; the linear model was selected for imputation of the anemia toxicity outcomes and the quadratic models were selected for the remaining outcomes. We used the five models to impute outcomes in 204, 167, 178, 178, and 7 trial arms, respectively.

As in Chapter 2, we impute the proportion of patients with grade 4 neutropenia, 푛, from the proportion of patients with grade 4 leukopenia, 푙, due to the similar- ities between these two toxicity outcomes, discounting the proportion experiencing leukopenia in all further computations. To do so, we use the 312 trial arms that report both values to train a linear model, 푛푖 = 훽0 + 훽1푙푖 + 휖, and a quadratic model, 2 2 푛푖 = 훽0 + 훽1푙푖 + 휖. We selected the quadratic model due to its superior 푅 value (0.72 vs. 0.72). We used this model to impute the proportion of patients with grade 4 neutropenia from the proportion of patients with grade 4 leukopenia in the 364 clinical trial arms that only reported the leukopenia value.

We use the “Grouped Independent” approach from Chapter 2 to estimate the proportion of patients in a clinical trial who experienced a DLT. For each of the 20 categories of toxicities defined in the NCI-CTCAE v3 toxicity reporting criteria [166], we compute the proportion of patients in a clinical trial experiencing a DLT in that category, 푡1, 푡2, . . . , 푡20. We then estimate the proportion of patients in the trial with

122 ∏︀20 a DLT using the formula 퐷퐿푇 = 1 − 푖=1(1 − 푡푖).

123 124 Appendix E

Drugs Appearing in Breast Cancer Clinical Trial Database

Table E.1 lists all 237 drugs used at least once in the breast cancer clinical trial database.

9-nitro-camptothecin Acivicin Adecatumumab Adozelesin Afatinib Aflibercept Amonafide Angiozyme Anguidine AS1402 Axitinib BCG Bestrabucil Bevacizumab Bisantrene BMS-182248-01 Bortezomib Brequinar Bromocriptine Buserelin Cabergoline Capecitabine Carboplatin Carminomycin CB-3717 Cemadotin Centchroman Cetuximab CG-603 Chlorozotocin CI-941 CI-973 Ciclosporin Cisplatin Clomiphene

125 Corynebacterium parvum CT-2103 Cyclocytidine Cyclophosphamide Cytarabine Cytembena Cytomel Dasatinib DENSPM DES DFMO DHA Dianhydrogalactitol Didemnin B Didox Docetaxel Dolastatin-10 Doxifluridine Doxorubicin Echinomycin Edatrexate Elliptinium Eniluracil Enzastaurin Epirubicin Erlotinib Esorubicin Etanercept Ethinyl Etoposide Everolimus Fazarabine Fluorouracil Formestane Ganitumab Gefitinib Gemcitabine Gusperimus Imatinib Iniparib Interferon Iododoxorubicin Iproplatin Irinotecan Isotretinoin Ixabepilone KRN8602 L-alanosine Lanreotide Lapatinib LCM Leucovorin acetate Levamisole Levoleucovorin Liarozole Lymphoblastoid interferon Marimastat Maytansine Medroxyprogesterone ac- etate Melatonin Menogaril Methenolone enanthate Methotrexate

126 methyl-CCNU methyl-GAG Mitolactol Mitoxantrone Motesanib Nab-paclitaxel Nafazatrom decanoate Neratinib Non-pegylated liposomal Novobiocin Octreotide doxorubicin Octreotide pamoate Orantinib Orathecin Oxaliplatin Oxylone acetate Paclitaxel PALA Pazopanib PCNU Pegylated liposomal dox- Pemetrexed Peplomycin orubicin Peptichemio Perifosine Perillyl Pertuzumab Piperazinedione Pirarubicin Piritrexim Piroxantrone Prednisone Premarin Pyrazofurin Pyrazoloacridine Raltitrexed Razoxane Rebeccamycin Recombinant interleukin-2 Recombinant tumor necro- sis factor Rubidazone S-1 Sagopilone Saracatinib Sorafenib Spirogermanium Streptozocin Sulofenur Sunitinib Suramin Tanespimycin TAS-108 Tauromustine Temsirolimus Tesmilifene Testololactone propionate Thioguanine THP-adriamycin Tibilone Topotecan Trastuzumab Trastuzumab emtansine Triazinate Triciribine

127 Trifluoperazine Trimetrexate mesylate Triptorelin UFT Valspodar Vandetanib Vapreotide Vincristine Vindesine Vinfosiltine Vinorelbine Vorozole VX-710 ZD-0473

Table E.1: All drugs tested in at least one clinical trial arm in the breast cancer clinical trial database.

128 Appendix F

Literature Review Details

We performed a literature review to identify studies presenting mathematical mod- els that can evaluate PSA-based prostate cancer screening strategies. We included mathematical models that were published in 2010 or later and that can evaluate the lifetime effectiveness of screening strategies with different PSA screening frequencies and age-specific PSA levels that trigger a confirmatory biopsy. Lifetime effective- ness could be measured in terms of life expectancy gained or quality-adjusted life expectancy (QALE) gained compared to not screening. Included articles were lim- ited to English-language manuscripts and decision analyses for a general population of screening-age men. In October 2013, we searched Pubmed using the following query:

("Early Detection of Cancer"[Mesh] OR "early diagnosis"[Mesh] OR "Prostatic Neoplasms/Diagnosis"[Mesh] OR "Mass Screening"[Mesh] OR screening) AND ("decision analysis" OR "decision analyses" OR "Decision Support Techniques"[Mesh] OR "Decision Trees"[Mesh] OR "decision trees" OR "Cost-Benefit Analysis"[Mesh] OR "cost-benefit analysis" OR "Markov Chains"[Mesh] OR "Computer Simulation"[Mesh] OR "computer simulation" OR simulate OR simulation[all fields] OR simulating OR "Monte Carlo Method"[Mesh] OR "monte carlo method" OR markov)

129 Abstracts reviewed (푛 = 311)

Articles reviewed (푛 = 29)

Articles excluded (푛 = 23) Not decision analysis for PSA-based screening (푛 = 9) Not for general population (푛 = 2) Can’t evaluate age-specific PSA cut- offs (푛 = 7) Can’t evaluate varying screening intervals (푛 = 2) No lifetime outcomes (푛 = 3)

Articles included (푛 = 6)

3 decision analyses

Figure F-1: Literature review to identify decision analyses.

AND ("prostate cancer" OR "Prostatic Neoplasms"[Majr])

A single reviewer evaluated 311 abstracts and 29 full-text articles, identifying six papers describing three decision analyses as appropriate for inclusion (see Figure F-1). In December 2014, we additionally searched the Tufts Cost-Effectiveness Analysis Registry for the term “prostate,” which returned 68 search results. Three results were retrieved, but none met the inclusion criteria and had not already been found.

130 Appendix G

Ranges Used in Sensitivity Analysis

To perform sensitivity analysis, we varied parameters based on sensitivity ranges used in the papers describing each model. For the model from [167], we defined sensitivity ranges as in that paper:

∙ 푑푡: The rate of other-cause (non-prostate cancer) mortality at age 푡 was varied ±20% from the base-case parameter values from [16, 98].

∙ 푤푡: The prostate cancer incidence rate for a man at age 푡 was varied using sensitivity ranges from [35]. For patients aged 40–49, the sensitivity range was defined as [0.00020, 0.00501]; for patients aged 50–59, the sensitivity range was defined as [0.00151, 0.00491]; for patients aged 60–69, the sensitivity range was defined as [0.00243, 0.00852]; for patients aged 70–79, the sensitivity range was defined as [0.00522, 0.01510]; and for patients aged 80 or more, the sensitivity range was defined as [0.00712, 0.01100].

∙ 푏푡: The annual probability of metastasis among patients with detected cancer treated with radical prostatectomy was varied ±20% from the base-case pa- rameter value of 0.006 derived from the Mayo Clinic Radical Prostatectomy Repository.

∙ 푒푡: The annual probability of metastasis among patients with undetected cancer was varied ±20% from the base-case parameter value of 0.069 from [72, 150].

131 ∙ 푧푡: The annual probability of dying from prostate cancer among men aged 푡 with metastatic disease was varied using the sensitivity range [0.07, 0.37] from [126, 17] around the base-case values of 0.074 for patients aged 40–64 and 0.070 for patients aged 75 and older [98].

∙ 푓: The probability of a biopsy detecting cancer in a patient with prostate cancer was varied ±20% from its base-case value of 0.8 from [83].

The sensitivity analyses in [179] varied QALE decrements derived from the litera- ture and cost parameters; as a result, none of the parameters varied in that sensitivity analysis are included in this work. Given the similarities of the model in that paper

to the model from [167], we used identical sensitivity ranges for the 푑푡, 푤푡, and 푓 parameters, additionally varying the following parameters:

∙ 푏푡: The annual probability of a man of age 푡 with detected prostate cancer treated with radical prostatectomy dying of the disease was varied ±20% from its base-case value of 0.0067 for men aged 40–64 and 0.0092 for men aged 65 and older [98].

∙ 푒푡: The annual probability of a man of age 푡 with undetected prostate cancer dying of the disease was varied ±20% from its base-case value of 0.033 from [12].

The model from [78] performs sensitivity analysis by providing 100 sets of the following parameters that were fitted to calibrate the model to observed population health outcomes:

∙ grade.onset.rate: A rate controlling how quickly patients experience prostate cancer onset.

∙ grade.metastasis.rate: A rate controlling how quickly patients with undetected prostate cancer experience metastasis.

∙ grade.clinical.rate.baseline: A rate controlling how quickly a patient’s cancer is clinically detected.

132 ∙ grade.clinical.rate.distant: A rate controlling how quickly a patient’s cancer is clinically detected after metastasis.

∙ low.grade.slope: A parameter controlling the likelihood that a patient who de- veloped cancer has a low-grade cancer.

In the one-way sensitivity analysis performed in this work, we varied each of these five parameter to the maximum and minimum value in the 100 sets of sensitivity parameters used in [78]. Each model in this work uses the QALE definition from [88], and we use the sensitivity ranges for the parameters defined in that work:

∙ Screening attendance: The utility estimate for the week following screening was varied in range [0.99, 1.00] from base estimate 0.99.

∙ Biopsy: The utility estimate for the three weeks following biopsy was varied in range [0.87, 0.94] from base estimate 0.90.

∙ Cancer diagnosis: The utility estimate for the month following cancer diagnosis was varied in range [0.75, 0.85] from base estimate 0.80.

∙ Radiation therapy: The utility estimate for the first two months after radiation therapy was varied in range [0.71, 0.91] from base estimate 0.73, and the utility estimate for the next 10 months after radiation therapy was varied in range [0.61, 0.88] from base estimate 0.78.

∙ Radical prostatectomy: The utility estimate for the first two months after radical prostatectomy was varied in range [0.56, 0.90] from base estimate 0.67, and the utility estimate for the next 10 months after radical prostatectomy was varied in range [0.70, 0.91] from base estimate 0.77.

∙ Active surveillance: The utility estimate for the first seven years of active surveillance was varied in range [0.85, 1.00] from base estimate 0.97.

133 ∙ Postrecovery period: The utility estimate for years 1–10 following radical prosta- tectomy or radiation therapy was varied in range [0.93, 1.00] from base estimate 0.95.

∙ Palliative therapy: The utility estimate during 30 months of palliative therapy was varied in range [0.24, 0.86] from base estimate 0.60.

∙ Terminal illness: The utility estimate during six months of terminal illness was varied in range [0.24, 0.40] from base estimate 0.40.

134 Appendix H

Building an Efficient Frontier of Screening Strategies

Given a screening strategy 푠, let 퐴(푠) be the average assessment of the strategy across all mathematical models and let 푃 (푠) by the pessimistic assessment of the strategy across all mathematical models. To construct an efficient frontier of strategies trading off the average and most pessimistic assessment, we use mathematical optimization to maximize the objective function 휆퐴(푠)+(1−휆)푃 (푠) for 휆 ∈ {0, 0.1, 0.2,..., 1.0} over annual screening strategies and biennial screening strategies, optimizing a total of 22 times. From the set of all screening strategies encountered during the optimization process (not just the final values identified through optimization), we construct an efficient frontier trading off the average and pessimistic assessments.

The key step in constructing the efficient frontier is solving max푠∈푆 휆퐴(푠) + (1 − 휆)푃 (푠), where 푆 is the set of all feasible screening strategies. We consider strategies with age-specific PSA cutoffs limited to 0.5, 1.0, 1.5, ..., 6.0 푛푔/푚퐿, fixed cutoffs for 5-year age ranges, and cutoffs that are non-decreasing in a patient’s age. We consider screening from ages 40 through 99, so there are 5.4×106 possible screening strategies; as a result, it would be time consuming to use enumeration to identify the strategy with the highest average incremental QALE compared to not screening. Instead, we use constrained iterated local search to identify a locally optimal strategy that cannot be improved by changing a single age-specific PSA threshold.

135 The central step in the iterated local search is the local search, which changes the PSA cutoff to the best possible value for a single age range, adjusting others toensure that PSA cutoffs are non-decreasing in age. Consider a screening strategy forwhich annual screening is performed for ages 45–69, with cutoff 2.0 ng/mL from ages 45–49, 3.0 ng/mL from ages 50–54, and 5.0 ng/mL from ages 55–59, 60–64, and 64–69. We can write this screening strategy compactly as (2, 3, 5, 5, 5), with each value in the vector representing the cutoff for a 5-year period. If we apply local search tothe cutoff for ages 55–59, then we will consider changing the cutoff for that agerange to each value in {0.5, 1.0, 1.5,..., 6.0} ng/mL, adjusting other cutoffs as necessary to ensure all cutoffs are non-decreasing in age. For instance, if the cutoff for ages55–59 were set to 2.5 ng/mL, then the cutoff for ages 50–54 would also need to be decreased to 2.5 ng/mL in order to maintain non-decreasing cutoffs, yielding final screening strategy (2, 2.5, 2.5, 5, 5). The set of all possible screening strategies considered by a local search on age range 55–59 is:

(0.5, 0.5, 0.5, 5, 5) (1, 1, 1, 5, 5) (1.5, 1.5, 1.5, 5, 5) (2, 2, 2, 5, 5) (2, 2.5, 2.5, 5, 5) (2, 3, 3, 5, 5) (2, 3, 3.5, 5, 5) (2, 3, 4, 5, 5) (2, 3, 4.5, 5, 5) (2, 3, 5, 5, 5) (2, 3, 5.5, 5.5, 5.5) (2, 3, 6, 6, 6)

Among these strategies, the one resulting in the largest objective value 휆퐴(푠) + (1 − 휆)푃 (푠) is the one selected by the local search. The iterated local search begins with a strategy of never screening for prostate

136 cancer. The procedure repeatedly loops through a random permutation of the age ranges, performing local search on an age range if it’s within 5 years of an age range for which the current strategy screens with PSA. The procedure terminates when no single age-specific threshold can be changed to improve the overall quality ofthe screening strategy.

137 138 Bibliography

[1] Abdulmuthalib, Idral Darwis, Nugroho Prayogo, and Sutjipto. First-line chemotherapy of advanced or metastatic breast cancer (MBC) with docetaxel and doxorubicin in Indonesia: results from a phase II trial. Medical Journal of Indonesia, 14(1):20–25, 2005.

[2] V. Adamo, V. Lorusso, R. Rossello, B. Adamo, G. Ferraro, D. Lorusso, G. Con- demi, D. Priolo, L. Di Lullo, A. Paglia, S. Pisconti, G. Scambia, and G. Fer- randina. Pegylated liposomal doxorubicin and gemcitabine in the front-line treatment of recurrent/metastatic breast cancer: a multicentre phase II study. British Journal of Cancer, 98(12):1916–1921, 2008.

[3] S. Agelaki, E. Karyda, Ch. Kouroussis, A. Ardavanis, K. Kalbakis, K. Malas, N. Malamos, A. Alexopoulos, E. Tselepatiotis, and V. Georgoulias. Gemc- itabine plus irinotecan in breast cancer patients pretreated with taxanes and anthracyclines: A multicenter phase II study. Oncology, 64(4):477–478, 2003.

[4] Biagio Agostara, Vittorio Gebbia, Antonio Testa, Maria Pia Cusimano, Nicolò Gebbia, and Angela Maria Callari. Mitomycin C and vinorelbine as second line chemotherapy for metastatic breast carcinoma. Tumori, 80(1):33–36, 1994.

[5] Zvia Agur, Refael Hassin, and Sigal Levy. Optimizing chemotherapy scheduling using local search heuristics. Operations Research, 54(5):829–846, 2006.

[6] Jin-Hee Ahn, Sung-Bae Kim, Tae-Won Kim, Sei-Hyun Ahn, Sun-Mi Kim, Jeong-Mi Park, Jung-Shin Lee, Yoon-Koo Kang, and Woo Kun Kim. Capecitabine and vinorelbine in patients with metastatic breast cancer pre- viously treated with anthracycline and taxane. Journal of Korean Medical Sci- ence, 19(4):547–553, 2004.

[7] Vishal Ahuja and John Birge. Response-adaptive designs for clinical trials: Simultaneous learning from multiple patients. European Journal of Operational Research, 2015. To appear.

[8] T. Aihara, Y. Takatsuka, K. Itoh, Y. Sasaki, N. Katsumata, T. Watanabe, S. Noguchi, N. Horikoshi, T. Tabei, H. Sonoo, S. Huraki, and H. Inaji. Phase II study of concurrent administration of doxorubicin and docetaxel as first-line chemotherapy for metastatic breast cancer. Oncology, 64(2):124–130, 2003.

139 [9] M. Airoldi, L. Cattel, R. Passera, F. Pedani, L. Delprino, and C. Micari. Gemc- itabine and oxaliplatin in patients with metastatic breast cancer resistant to or pretreated with both anthracyclines and taxanes. American Journal of Clinical Oncology, 29(5):490–494, 2006.

[10] M. Airoldi, F. Pedani, S. Marchionatti, C. Bumma, L. Cattel, V. Tagini, and V. Recalenda. Clinical data and pharmacokinetics of a docetaxel-vinorelbine combination in anthracycline resistant/relapsed metastatic breast cancer. Acta Oncologica, 42(3):186–194, 2003.

[11] Jaffer A. Ajani, Wuilbert Rodriguez, Gyorgy Bodoky, Vladimir Moiseyenko, Mikhail Lichinitser, Vera Gorbunova, Ihor Vynnychenko, August Garin, Istvan Lang, and Silvia Falcon. Multicenter phase III comparison of cisplatin/S-1 with cisplatin/infusional fluorouracil in advanced gastric or gastroesophageal adeno- carcinoma study: The FLAGS trial. Journal of Clinical Oncology, 28(9):1547– 1553, 2010.

[12] Peter Albertsen, James Hanley, and Judith Fine. 20-year outcomes following conservative management of clinically localized prostate cancer. Journal of the American Medical Association, 293(17):2095–2101, 2005.

[13] A. Alexopoulos, D. Tryfonopoulos, M. V. Karamouzis, G. Gerasimidis, I. Kary- das, K. Kandilis, J. Stavrakakis, H. Stavrinides, C. Georganta, A. Ardavanis, and G. Rigatos. Evidence for in vivo synergism between docetaxel and gemc- itabine in patients with metastatic breast cancer. Annals of Oncology, 15(1):95– 99, 2004.

[14] E. Andreopoulou, D. Gaiotti, E. Kim, M. Volm, R. Oratz, R. Freedberg, A. Downey, C. Vogel, S. Chia, and F. Muggia. Feasibility and cardiac safety of pegylated liposomal doxorubicin plus trastuzumab in heavily pretreated pa- tients with recurrent HER2-overexpressing metastatic breast cancer. Clinical Breast Cancer, 7(9):690–696, 2007.

[15] Gerald L. Andriole, E. David Crawford, Robert L. Grubb, Saundra S. Buys, David Chia, Timothy R. Church, Mona N. Fouad, Edward P. Gelmann, Paul A. Kvale, Douglas J. Reding, Joel L. Weissfeld, Lance A. Yokochi, Barbara O’Brien, Jonathan D. Clapp, Joshua M. Rathmell, Thomas L. Riley, Richard B. Hayes, Barnett S. Kramer, Grant Izmirlian, Anthony B. Miller, Paul F. Pinsky, Philip C. Prorok, John K. Gohagan, and Christine D. Berg. Mortality results from a randomized prostate-cancer screening trial. New England Journal of Medicine, 360(13):1310–1319, 2009.

[16] E Arias. United States life tables, 2006. National Vital Statistics Reports, 58(21):1–40, 2010.

[17] Gunnar Aus, David Robinson, Johan Rosell, Gabriel Sandblom, and Eber- hard Varenhorst. Survival in prostate carcinoma—outcomes from a prospective,

140 population-based cohort of 8887 men with up to 15 years of follow-up. Cancer, 103(5):943–951, 2005.

[18] A. Awada, J. Albanell, P.A. Canney, L.Y. Dirix, T. Gil, F. Cardoso, P. Gascon, M.J. Piccart, and J. Baselga. Bortezomib/docetaxel combination therapy in patients with anthracycline-pretreated advanced/metastatic breast cancer: a phase I/II dose-escalation study. British Journal of Cancer, 98(9):1500–1507, 2008.

[19] Turgay Ayer, Oguzhan Alagoz, and Natasha K. Stout. OR Forum — a POMDP approach to personalize mammography screening decisions. Operations Re- search, 60(5):1019–1034, 2012.

[20] Rose D. Baker. Use of a mathematical model to evaluate breast cancer screening policy. Health Care Management Science, 1(2):103–113, 1998.

[21] Yung-Jue Bang, Eric Van Cutsem, Andrea Feyereislova, Hyun C Chung, Lin Shen, Akira Sawaki, Florian Lordick, Atsushi Ohtsu, Yasushi Omuro, Taroh Satoh, Giuseppe Aprile, Evgeny Kulikov, Julie Hill, Michaela Lehle, Josef Rüschoff, and Yoon-Koo Kang. Trastuzumab in combination with chemother- apy versus chemotherapy alone for treatment of HER2-positive advanced gas- tric or gastro-oesophageal junction cancer (ToGA): a phase 3, open-label, ran- domised controlled trial. Lancet, 376:687–697, 2010.

[22] Ethan Basch, Thomas K. Oliver, Andrew Vickers, Ian Thompson, Philip Kantoff, Howard Parnes, D. Andrew Loblaw, Bruce Roth, James Williams, and Robert K. Nam. Screening for prostate cancer with prostate-specific anti- gen testing: American Society of Clinical Oncology provisional clinical opinion. Journal of Clinical Oncology, 30(24):3020–3025, 2012.

[23] JM Bates and CWJ Granger. The combination of forecasts. Operational Re- search Quarterly, 20(4):451–468, 1969.

[24] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Opti- mization. Princeton Series in Applied Mathematics. Princeton University Press, 2009.

[25] James O. Berger and Luis R. Pericchi. The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91(433):109–122, 1996.

[26] R N Bergman, Y Z Ider, C R Bowden, and C Cobelli. Quantitative estima- tion of insulin sensitivity. American Journal of Physiology - Endocrinology and Metabolism, 236(6):E667–E677, 1979.

[27] D.A. Berry. Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association, 73(362):339–345, 1978.

141 [28] Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of opti- mization over randomization in designing experiments involving small samples. Operations Research, 63(4):868–876, 2015.

[29] John R. Birge and François Louveaux. Introduction to Stochastic Programming, volume 2. Springer, 2011.

[30] D.E. Bloom, E.T. Cafiero, E. Jané-Llopis, S. Abrahams-Gessel, L.R. Bloom, S. Fathima, A.B. Feigl, T. Gaziano, M. Mowafi, A. Pandya, K. Prettner, L. Rosenberg, B. Seligman, A.Z. Stein, and C. Weinstein. The global eco- nomic burden of noncommunicable diseases. Technical report, World Economic Forum, Geneva, 2011.

[31] Laura Bojke, Karl Claxton, Mark Sculpher, and Stephen Palmer. Characteriz- ing structural uncertainty in decision analytic models: A review and application of methods. Value in Health, 12(5):739–749, 2009.

[32] Thomas Bortfeld, Timothy C. Y. Chan, Alexei Trofimov, and John N. Tsitsiklis. Robust management of motion uncertainty in intensity-modulated radiation therapy. Operations Research, 56(6):1461–1473, 2008.

[33] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.

[34] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

[35] Lukas Bubendorf, Alain Schöpfer, Urs Wagner, Guido Sauter, Holger Moch, Niels Willi, Thomas Gasser, and Michael Mihatsch. Metastatic patterns of prostate cancer: An autopsy study of 1,589 patients. Human Pathology, 31(5):578–583, 2000.

[36] G. Buccheri, D. Ferrigno, and M. Tamburini. Karnofsky and ECOG perfor- mance status scoring in lung cancer: A prospective, longitudinal study of 536 patients from a single institution. European Journal of Cancer, 32(7):1135 – 1141, 1996.

[37] H.B. Burke. Artificial neural networks improve the accuracy of cancer survival prediction. Cancer, 79(4):857–862, 1997.

[38] Adele Caldarella, Emanuele Crocetti, Simonetta Bianchi, Vania Vezzosi, Carmelo Urso, Mauro Biancalani, and Marco Zappa. Female breast cancer sta- tus according to ER, PR and HER2 expression: A population based analysis. Pathology & Oncology Research, 17(3):753–758, 2011.

[39] Angelo Canty and Brian Ripley. boot: Bootstrap R (S-Plus) Functions, 2014. package version 1.3-13.

142 [40] Sue Carrick, Sharon Parker, Charlene E. Thornton, Davina Ghersi, John Simes, and Nicholas Wilcken. Single agent versus combination chemother- apy for metastatic breast cancer. Cochrane Database of Systematic Reviews, 2:CD003372, 2009.

[41] H. Ballentine Carter, Peter C. Albertsen, Michael J. Barry, Ruth Etzioni, Stephen J. Freedland, Kirsten Lynn Greene, Lars Holmberg, Philip Kantoff, Badrinath R. Konety, Mohammad Hassan Murad, David F. Penson, and An- thony L. Zietman. Early detection of prostate cancer: AUA guideline. The Journal of Urology, 190(2):419–426, 2013.

[42] Timothy C. Y. Chan, Tim Craig, Taewoo Lee, and Michael B. Sharpe. Gener- alized inverse multiobjective optimization with application to cancer therapy. Operations Research, 62(3):680–695, 2014.

[43] Y Chao, C P Li, T Y Chao, W C Su, R K Hsieh, M F Wu, K H Yeh, W Y Kao, L T Chen, and A L Cheng. An open, multi-centre, phase II clinical trial to evaluate the efficacy and safety of paclitaxel, UFT, and leucovorin in patients with advanced gastric cancer. British Journal of Cancer, 95(2):159–163, 2006.

[44] Chris Chatfield. Model uncertainty, data mining and statistical inference. Jour- nal of the Royal Statistical Society, Series A, 158(3):419–466, 1995.

[45] Ting-Chao Chou. Theoretical basis, experimental design, and computerized simulation of synergism and antagonism in drug combination studies. Pharma- cological Reviews, 58(3):621–681, 2006.

[46] Robert T. Clemen. Combining forecasts: a review and annotated bibliography. International Journal of Forecasting, 5(4):559–583, 1989.

[47] Robert T. Clemen and Christopher J. Lacke. Analysis of colorectal cancer screening regimens. Health Care Management Science, 4(4):257–267, 2001.

[48] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learn- ing, 20(3):273–297, 1995.

[49] Ide Cremin, Ramzi Alsallaq, Mark Dybul, Peter Piot, Geoffrey Garnett, and Timothy B. Hallett. The new role of antiretrovirals in combination HIV pre- vention: a mathematical modelling analysis. AIDS, 27(3):447–458, 2013.

[50] Rebecca M. D’Amato, Richard T. D’Aquila, and Lawrence M. Wein. Man- agement of antiretroviral therapy for HIV infection: modelling when to change therapy. Antiviral Therapy, 3(3):147–158, 1998.

[51] A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Applications. Cambridge University Press, Cambridge, 1997.

143 [52] Shaheenah Dawood, Kristine Broglio, Aman U. Buzdar, Gabriel N. Hortobagyi, and Sharon H. Giordano. Prognosis of women with metastatic breast cancer by HER2 status and trastuzumab treatment: An institutional-based review. Journal of Clinical Oncology, 28(1):92–98, 2010.

[53] Harry J. de Koning, Rafael Meza, Sylvia K. Plevritis, Kevin ten Haaf, Vidit N. Munshi, Jihyoun Jeon, Saadet Ayca Erdogan, Chung Yin Kong, Summer S. Han, Joost van Rosmalen, Sung Eun Choi, Paul F. Pinsky, Amy Berrington de Gonzalez, Christine D. Berg, William C. Black, Martin C. Tammemägi, William D. Hazelton, Eric J. Feuer, and Pamela M. McMahon. Benefits and harms of computed tomography lung cancer screening strategies: A comparative modeling study for the U.S. Preventive Services Task Force. Annals of Internal Medicine, 160(5):311–320, 2014.

[54] Filip De Ridder. Predicting the Outcome of Phase III Trials using Phase II Data: A Case Study of Clinical Trial Simulation in Late Stage Drug Development. Basic and Clinical Pharmacology and Toxicology, 96:235 – 241, 2005.

[55] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. AAAI-98 Proceedings, pages 761–768, 1998.

[56] Catherine Delbaldo, Stefan Michiels, Nathalie Syz, Jean-Charles Soria, Thierry Le Chevalier, and Jean-Pierre Pignon. Benefits of adding a drug to a single-agent or a 2-agent chemotherapy regimen in advanced non-small-cell lung cancer: A meta-analysis. Journal of the American Medical Association, 292(4):470–484, 2004.

[57] Dursun Delen, Glenn Walker, and Amit Kadam. Predicting breast cancer sur- vivability: a comparison of three data mining methods. Artificial Intelligence in Medicine, 34(2):113–127, 2005.

[58] M. Derouich and A. Boutayeb. The effect of physical exercise on the dynamics of glucose and insulin. Journal of Biomechanics, 35(7):911–917, 2002.

[59] P Doubilet, CB Begg, MC Weinstein, P Braun, and BJ McNeil. Probabilistic sensitivity analysis using Monte Carlo simulation. A practical approach. Medical Decision Making, 5(2):157–177, 1984.

[60] Gerrit Draisma, Ruth Etzioni, Alex Tsodikov, Angela Mariotto, Elisabeth Wever, Roman Gulati, Eric Feuer, and Harry de Koning. Lead time and over- diagnosis in prostate-specific antigen screening: Importance of methods and context. Journal of the National Cancer Institute, 101(6):374–383, 2009.

[61] Lisa K Dunnwald, Mary Anne Rossing, and Christopher I Li. Hormone receptor status, tumor characteristics, and prognosis: a prospective cohort of breast cancer patients. Breast Cancer Research, 9(1):R6, 2007.

144 [62] P.J Easterbrook, R Gopalan, J.A Berlin, and D.R Matthews. Publication bias in clinical research. The Lancet, 337(8746):867–872, 1991.

[63] Thomas Efferth and Manfred Volm. Pharmacogenetics for individualized cancer chemotherapy. Pharmacology and Therapeutics, 107:155–176, 2005.

[64] R. Etzioni, A. Tsodikov, A. Mariotto, A. Szabo, S. Falcon, J. Wegelin, D. DiT- ommaso, K. Karnofski, R. Gulati, D. F. Penson, and E. Feuer. Quantifying the role of PSA screening in the US prostate cancer mortality decline. Cancer Causes and Control, 19(2):175–181, 2008.

[65] Ruth Etzioni and Roman Gulati. Response: Reading between the lines of cancer screening trials: Using modeling to understand the evidence. Medical Care, 51(4):304–306, 2013.

[66] Ruth D. Etzioni, Nadia Howlander, Pamela A. Shaw, Donna P. Ankerst, David F. Penson, Phyllis J. Goodman, and Ian M. Thompson. Long-term ef- fects of finasteride on prostate specific antigen levels: results from the prostate cancer prevention trial. Journal of Urology, 174(3):877–881, 2005.

[67] R. Fossati, C. Confalonieri, V. Torri, E. Ghislandi, A. Penna, V. Pistotti, A. Tinazzi, and A. Liberati. Cytotoxic and hormonal treatment for metastatic breast cancer: A systematic review of published randomized trials involving 31,510 women. Journal of Clinical Oncology, 16(10):3439–3460, 1998.

[68] Peter I. Frazier, Warren B. Powell, and Savas Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.

[69] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Soft- ware, 33(1):1–22, 2010.

[70] Lawrence M. Friedman, Curt D. Furberg, and David L. DeMets. Fundamentals of Clinical Trials, volume 4. Springer, 2010.

[71] Andrea De Gaetano and Ovide Arino. Mathematical modelling of the intra- venous glucose tolerance test. Journal of Mathematical Biology, 40(2):136–168, 2000.

[72] Khurshid R. Ghani, Ken Grigor, David N. Tulloch, Prasad R. Bollina, and S. Alan McNeill. Trends in reporting gleason score 1991 to 2001: changes in the pathologist’s practice. European Urology, 47(2):196–201, 2005.

[73] John Gittins, Kevin Glazebrook, and Richard Weber. Multi-Armed Bandit Allocation Indices, volume 2. Wiley, 2011.

145 [74] David E. Golan, Armen H. Tashjian Jr., Ehrin J. Armstrong, and April W. Armstrong, editors. Principles of Pharmacology: The Pathophysiologic Basis of Drug Therapy. Lippincott Williams and Wilkins, second edition, 2008.

[75] Susan T Goldstein, Fangjun Zhou, Stephen C Hadler, Beth P Bell, Eric E Mast, and Harold S Margolis. A mathematical model to estimate global hepatitis B disease burden and vaccination impact. International Journal of Epidemiology, 34(6):1329–1339, 2005.

[76] Reuben M Granich, Charles F Gilks, Christopher Dye, Kevin M De Cock, and Brian G Williams. Universal voluntary HIV testing with immediate antiretro- viral therapy as a strategy for elimination of HIV transmission: a mathematical model. The Lancet, 373(9657):48–57, 2009.

[77] The CDC Diabetes Cost-Effectiveness Study Group. The cost-effectiveness of screening for type 2 diabetes. The Journal of the American Medical Association, 280(20):1998, 1998.

[78] Roman Gulati, John L. Gore, and Ruth Etzioni. Comparative effectiveness of alternative PSA-based prostate cancer screening strategies. Annals of Internal Medicine, 158(3):145–153, 2013.

[79] Roman Gulati, Lurdes Inoue, Jeffrey Katcher, William Hazelton, and Ruth Etzioni. Calibrating disease progression models using population data: a critical precursor to policy development in cancer control. Biostatistics, 11(4):707–719, 2010.

[80] Roman Gulati, Alex Tsodikov, Elisabeth M. Wever, Angela B. Mariotto, Eveline A. M. Heijnsdijk, Jeffrey Katcher, Harry J. de Koning, and Ruth Etzioni. The impact of PLCO control arm contamination on perceived PSA screening efficacy. Cancer Causes and Control, 23(6):827–835, 2012.

[81] Evrim D. Güneş, Stephen E. Chick, and O. Zeynep Akşin. Breast cancer screen- ing services: Trade-offs in quality, capacity, outreach, and centralization. Health Care Management Science, 7(4):291–303, 2004.

[82] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2015.

[83] Gabriel Haas, Nicolas Delongchamps, Richard Jones, Vishal Chandan, Angel Serio, Andrew Vickers, Mary Jumbelic, Gregory Threatte, Rus Korets, Hans Lilja, and Gustavo de la Roza. Needle biopsies on autopsy prostates: Sensitivity of cancer detection based on true prevalence. Journal of the National Cancer Institute, 99(19):1484–1489, 2007.

[84] JD Hainsworth and FA Greco. Paclitaxel administered by 1-hour infusion. Preliminary results of a phase I/II trial comparing two schedules. Cancer, 74(4):1377–1382, 1994.

146 [85] P. R. Harper and S. K. Jones. Mathematical models for the early detection and treatment of colorectal cancer. Health Care Management Science, 8(2):101–109, 2005.

[86] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition). Springer-Verlag, 2008.

[87] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning, volume 2. Springer, 2009.

[88] Eveline A.M. Heijnsdijk, Elisabeth M. Wever, Anssi Auvinen, Jonas Hugos- son, Stefano Ciatto, Vera Nelen, Maciej Kwiatkowski, Arnauld Villers, Alvaro Páez, Sue M. Moss, Marco Zappa, Teuvo L.J. Tammela, Tuukka Mäkinen, Sigrid Carlsson, Ida J. Korfage, Marie-Louise Essink-Bot, Suzie J. Otto, Gerrit Draisma, Chris H. Bangma, Monique J. Roobol, Fritz H. Schröder, and Harry J. de Koning. Quality-of-life effects of prostate-specific antigen screening. New England Journal of Medicine, 367(7):595–605, 2012.

[89] Thomas J. Hoerger, Russell Harris, Katherine A. Hicks, Katrina Donahue, Stephen Sorensen, and Michael Engelgau. Screening for type 2 diabetes melli- tus: A cost-effectiveness analysis. Annals of Internal Medicine, 140(9):689–699, 2004.

[90] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

[91] Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging: A tutorial. Statistical Science, 14(4):382–401, 1999.

[92] Jan A. C. Hontelez, Mark N. Lurie, Till Bärnighausen, Roel Bakker, Rob Baltussen, Frank Tanser, Timothy B. Hallett, Marie-Louise Newell, and Sake J. de Vlas. Elimination of HIV in South Africa through expanded ac- cess to antiretroviral therapy: A model comparison study. PLOS Medicine, 10(10):e1001534, 2013.

[93] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification, 2003.

[94] Chiun Hsu, Ying-Chun Shen, Chia-Chi Cheng, Ann-Lii Cheng, Fu-Chang Hu, and Kun-Huei Yeh. Geographic difference in safety and efficacy of systemic chemotherapy for advanced gastric or gastroesophageal carcinoma: a meta- analysis and meta-regression. Gastric Cancer, 15:265–280, 2012.

[95] Arti Hurria, Kayo Togawa, Supriya G. Mohile, Cynthia Owusu, Heidi D. Klepin, Cary P. Gross, Stuart M. Lichtman, Ajeet Gajra, Smita Bhatia, Vani Katheria, Shira Klapper, Kurt Hansen, Rupal Ramani, Mark Lachs, F. Lennie Wong, and William P. Tew. Predicting chemotherapy toxicity in older adults with cancer:

147 A prospective multicenter study. Journal of Clinical Oncology, 29(25):3457 – 3465, 2011.

[96] A. Imkampe, S. Bendall, and T. Bates. The significance of the site of recurrence to subsequent breast cancer survival. European Journal of Surgical Oncology, 33(4):420–423, 2007.

[97] Amelia Insa, Ana Lluch, Felipe Prosper, Isabel Marugan, Angel Martinez- Agullo, and Javier Garcia-Conde. Prognostic factors predicting survival from first recurrence in patients with metastatic breast cancer: analysis of439pa- tients. Breast Cancer Research and Treatment, 56(1):67–78, 1999.

[98] National Cancer Institute. Surveillance epidemiology and end results, 2008.

[99] Cancer Intervention and Surveillance Modeling Network. Comparative model- ing, 2015. http://cisnet.cancer.gov/modeling/comparative.html.

[100] H. Iwase, M. Shimada, T. Tsuzuki, K. Ina, M. Sugihara, J. Haruta, M. Shinoda, T. Kumada, and H. Goto. A phase II multi-center study of triple therapy with paclitaxel, S-1 and cisplatin in patients with advanced gastric cancer. Oncology, 80:76–83, 2011.

[101] Miles F. Jefferson, Neil Pendleton, Sam B. Lucas, and Michael A. Horan. Com- parison of a genetic algorithm neural network with logistic regression for pre- dicting outcome after surgery for patients with nonsmall cell lung carcinoma. Cancer, 79(7):1338–1342, 1997.

[102] Richard Kahn, Peter Alperin, David Eddy, Knut Borch-Johnsen, John Buse, Justin Feigelman, Edward Gregg, Rury R Holman, M Sue Kirkman, Michael Stern, Jaakko Tuomilehto, and Nick J Wareham. Age at initiation and frequency of screening to detect type 2 diabetes: a cost-effectiveness analysis. The Lancet, 375(9723):1365–1374, 2010.

[103] Y.-K. Kang, W.-K. Kang, D.-B. Shin, J. Chen, J. Xiong, J. Wang, M. Li- chinitser, Z. Guan, R. Khasanov, L. Zheng, M. Philco-Salas, T. Suarez, J. San- tamaria, G. Forster, and P. I. McCloud. Capecitabine/cisplatin versus 5- fluorouracil/cisplatin as first-line therapy in patients with advanced gastric can- cer: a randomised phase III noninferiority trial. Annals of Oncology, 20:666–673, 2009.

[104] David A Karnofsky. The clinical evaluation of chemotherapeutic agents in can- cer. Evaluation of chemotherapeutic agents, 1949.

[105] Jonathan Karnon and Jackie Brown. Selecting a decision model for eco- nomic evaluation: a case study and review. Health Care Management Science, 1(2):133–140, 1998.

148 [106] Sun-Young Kim, Joshua A Salomon, and Sue J Goldie. Economic evaluation of hepatitis B vaccination in low-income countries: using cost-effectiveness af- fordability curves. Bulletin of the World Health Organization, 85(11):833–842, 2007.

[107] R. L. A. Kirch and M. Klein. Surveillance schedules for medical examinations. Management Science, 20(10):1403–1409, 1974.

[108] Marina Kleiman, Yael Sagi, Naamah Bloch, and Zvia Agur. Use of virtual patient populations for rescuing discontinued drug candidates and for reducing the number of patients in clinical trials. ATLA, 37(Supplement 1):39–45, 2009.

[109] T. Kobayashi, R. Goto, K. Ito, and K. Mitsumori. Prostate cancer screening strategies with re-screening interval determined by individual baseline prostate- specific antigen values are cost-effective. European Journal of Surgical Oncology, 33(6):783–789, 2007.

[110] P.G. Koenders, L.V.A.M. Beex, P.W.C. Kloopenborg, A.G.H. Smals, and Th.J. Benraad. Human breast cancer: survival from first metastasis. Breast Cancer Research and Treatment, 21(3):173–180, 1992.

[111] Wasaburo Koizumi, Hiroyuki Narahara, Takuo Hara, Akinori Takagane, Toshikazu Akiya, Masakazu Takagi, Kosei Miyashita, Takashi Nishizaki, Os- amu Kobayashi, Wataru Takiyama, Yasushi Toh, Takashi Nagaie, Seiichi Tak- agi, Yoshitaka Yamamura, Kimihiko Yanaoka, Hiroyuki Orita, and Masahiro Takeuchi. S-1 plus cisplatin versus S-1 alone for first-line treatment of ad- vanced gastric cancer (SPIRITS trial): a phase III trial. Lancet, 9:215–221, 2008.

[112] Kjell Konis. lpSolveAPI: R Interface for lp_solve version 5.5.2.0, 2014. R package version 5.5.2.0-14.

[113] Murray Krahn, John Mahoney, Mark Eckman, John Trachtenberg, Stephen Pauker, and Allan Detsky. Screening for prostate cancer: A decision analytic view. Journal of the American Medical Association, 272(10):773–780, 1994.

[114] PW Laud and JG Ibrahim. Predictive model selection. Journal of the Royal Statistical Society, Series B, 57(1):247–262, 1995.

[115] Kyung Hee Lee, Myung Soo Hyun, Hoon-Kyo Kim, Hyung Min Jin, Jinmo Yang, Hong Suk Song, Young Rok Do, Hun Mo Ryoo, Joo Seop Chung, Dae Young Zang, Ho-Yeong Lim, Jong Youl Jin, Chang Yeol Yim, Hee Sook Park, Jun Suk Kim, Chang Hak Sohn, and Soon Nam Lee. Randomized, mul- ticenter, phase III trial of heptaplatin 1-hour infusion and 5-fluorouracil combi- nation chemotherapy comparing with cisplatin and 5-fluorouracil combination chemotherapy in patients with advanced gastric cancer. Cancer Research and Treatment, 41(1):12–18, 2009.

149 [116] Y.-J. Lee, O.L. Mangasarian, and W.H. Wolberg. Survival-time classifica- tion of breast cancer patients. Computational Optimization and Applications, 25(1):151–166, 2003.

[117] Moshe Leshno, Zamir Halpern, and Nadir Arber. Cost-effectiveness of colorec- tal cancer screening in the average risk population. Health Care Management Science, 6(3):165–174, 2003.

[118] Jiaxu Li, Yang Kuang, and Bingtuan Li. Analysis of IVGTT glucose-insulin in- teraction models with time delay. Discrete and Continuous Dynamical Systems– Series B, 1(1):103–124, 2001.

[119] Andy Liaw and Matthew Wiener. Classification and regression by randomFor- est. R News, 2(3):18–22, 2002.

[120] Thomas L. Lincoln and George H. Weiss. A statistical evaluation of recurrent medical examinations. Operations Research, 12(2):187–205, 1964.

[121] Elisa F. Long and Robert R. Stavert. Portfolios of biomedical HIV interventions in South Africa: A cost-effectiveness analysis. Journal of General Internal Medicine, 28(10):1294–1301, 2013.

[122] Manfred P. Lutz, Hansjochen Wilke, D.J. Theo Wagener, Udo Vanhoe- fer, Krzysztof Jeziorski, Susanna Hegewisch-Becker, Leopold Balleisen, Eric Joossens, Rob L. Jansen, Muriel Debois, Ullrich Bethe, Michel Praet, Jacques Wils, and Eric Van Cutsem. Weekly infusional high-dose fluorouracil (HD-FU), HD-FU plus folinic acid (HD-FU/FA), or HD-FU/FA plus biweekly cisplatin in advanced gastric cancer: Randomized phase II trial 40953 of the European Or- ganisation for Research and Treatment of Cancer Gastrointestinal Group and the Arbeitsgemeinschaft Internistische Onkologie. Journal of Clinical Oncology, 25(18):2580–2585, 2007.

[123] Lisa M. Maillart, Julie Simmons Ivy, Scott Ransom, and Kathleen Diehl. Assess- ing dynamic breast cancer screening policies. Operations Research, 56(6):1411– 1427, 2008.

[124] Jeanne S. Mandelblatt, Kathleen A. Cronin, Stephanie Bailey, Donald A. Berry, Harry J. de Koning, Gerrit Draisma, Hui Huang, Sandra J. Lee, Mark Mun- sell, Sylvia K. Plevritis, Peter Ravdin, Clyde B. Schechter, Bronislava Sigal, Michael A. Stoto, Natasha K. Stout, Nicolien T. van Ravesteyn, John Venier, Marvin Zelen, and Eric J. Feuer. Effects of mammography screening under different screening schedules: Model estimates of potential benefits and harms. Annals of Internal Medicine, 151(10):738–747, 2009.

[125] Laura A. McLay, Christodoulos Foufoulides, and Jason R. W. Merrick. Using simulation-optimization to construct screening strategies for cervical cancer. Health Care Management Science, 13(4):294–318, 2010.

150 [126] Edward M Messing, Judith Manola, Jorge Yao, Maureen Kiernan, David Craw- ford, George Wilding, P Anthony di’SantAgnese, and Donald Trump. Imme- diate versus deferred deprivation treatment in patients with node- positive prostate cancer after radical prostatectomy and pelvic lymphadenec- tomy. The Lancet Oncology, 7(6):472–479, 2006.

[127] David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. R package version 1.6-1.

[128] Virginia A. Moyer. Screening for cervical cancer: U.S. Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 156(12):880– 891, 2012.

[129] Virginia A. Moyer. Screening for prostate cancer: U.S. Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 157(2):120–134, 2012.

[130] Virginia A. Moyer. Screening for lung cancer: U.S. Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 160(5):330–338, 2014.

[131] Amitava Mukhopadhyay, Andrea De Gaetano, and Ovide Arino. Modeling the intra-venous glucose tolerance test: A global study for a single-distributed-delay model. Discrete and Continuous Dynamical Systems–Series B, 4(2):407–417, 2004.

[132] J.M. Nabholtz, J.R. Mackey, M. Smylie, A. Paterson, D.R. Noël, T. Al-Tweigeri, K. Tonkin, S. North, N. Azli, and A. Riva. Phase II study of docetaxel, doxoru- bicin, and cyclophosphamide as first-line chemotherapy for metastatic breast cancer. Journal of Clinical Oncology, 19(2):314–321, 2001.

[133] National Cancer Institute. Common terminology criteria for adverse events v3.0 (CTCAE), August 2006.

[134] National Comprehensive Cancer Network. NCCN Clinical Practice Guidelines in Oncology: Breast Cancer, 3.2015 edition, 2015.

[135] National Comprehensive Cancer Network. NCCN Clinical Practice Guidelines in Oncology: Gastric Cancer, 3.2015 edition, 2015.

[136] Anthony O’Hagan. Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society, Series B, 57(1):99–138, 1995.

[137] Lucila Ohno-Machado. Modeling medical prognosis: Survival analysis tech- niques. Journal of Biomedical Informatics, 34:428–439, 2001.

151 [138] Martin M. Oken, Richard H. Creech, Douglass C. Tormey, John Horton, Thomas E. Davis, Eleanor T. McFadden, and Paul P. Carbone. Toxicity and re- sponse criteria of the Eastern Cooperative Oncology Group. American Journal of Clinical Oncology, 5(6):649–656, 1982.

[139] Beth Overmoyer. Combination chemotherapy for metastatic breast cancer: Reaching for the cure. Journal of Clinical Oncology, 21(4):580–582, 2003.

[140] Ray Page and Chris Takimoto. Cancer Management: A Multidisciplinary Approach: Medical, Surgical and Radiation Oncology, chapter Principles of Chemotherapy. PRR Inc., 2002.

[141] John H. Phan, Richard A. Moffitt, Todd H. Stokes, Jian Liu, Andrew N. Young, Shuming Nie, and May D. Wang. Convergence of biomarkers, bioinformatics and nanotechnology for individualized cancer treatment. Trends in Biotechnol- ogy, 27(6):350–358, 2009.

[142] William B. Pratt. The Anticancer Drugs. Oxford University Press, 1994.

[143] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051- 07-0.

[144] Marion S. Rauner, Walter J. Gutjahr, Kurt Heidenberger, Joachim Wagner, and Joseph Pasia. Dynamic policy modeling for chronic diseases: Metaheuristic- based identification of pareto-optimal screening strategies. Operations Research, 58(5):1269–1286, 2010.

[145] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

[146] H. Edwin Romeijn, Ravindra K. Ahuja, James F. Dempsey, and Arvind Kumar. A new linear programming approach to radiation therapy treatment planning problems. Operations Research, 54(2):201–216, 2006.

[147] Kevin S. Ross, H. Ballentine Carter, Jay D. Pearson, and Harry A. Guess. Com- parative efficiency of prostate-specific antigen screening strategies for prostate cancer detection. Journal of the American Medical Association, 284(11):1399– 1405, 2000.

[148] A. D. Roth. Chemotherapy in gastric cancer: a never ending saga. Annals of Oncology, 14:175–177, 2003.

[149] Young U. Ryu, R. Chandrasekaran, and Varghese Jacob. Prognosis using an isotonic prediction technique. Management Science, 50(6):777–785, 2004.

[150] Peter T. Scardino, J. Robert Beck, and Brian J. Miles. Conservative man- agement of prostate cancer. New England Journal of Medicine, 330(25):1831, 1994.

152 [151] Alexander Scherrer, Ilka Schwidde, Andreas Dinges, Patrick Rüdiger, Sherko Kümmel, and Karl-Heinz Küfer. Breast cancer therapy planning — a novel support concept for a sequential decision making problem. Health Care Man- agement Science, 2014. To appear. [152] Fritz H. Schröder, Jonas Hugosson, Monique J. Roobol, Teuvo L.J. Tam- mela, Stefano Ciatto, Vera Nelen, Maciej Kwiatkowski, Marcos Lujan, Hans Lilja, Marco Zappa, Louis J. Denis, Franz Recker, Antonio Berenguer, Liisa Määttänen, Chris H. Bangma, Gunnar Aus, Arnauld Villers, Xavier Rebillard, Theodorus van der Kwast, Bert G. Blijenberg, Sue M. Moss, Harry J. de Koning, and Anssi Auvinen. Screening and prostate-cancer mortality in a randomized European study. New England Journal of Medicine, 360(13):1320–1328, 2009. [153] Aylin Sertkaya, Anna Birkenbach, Ayesha Berlind, and John Eyraud. Examina- tion of clinical trial costs and barriers for drug development. Technical Report HHSP23337007T, U.S. Department of Health and Human Services, 2014. [154] Michael Shwartz. A mathematical model used to analyze breast cancer screening strategies. Operations Research, 26(6):937–955, 1978. [155] Rebecca Siegel, Jiemin Ma, Zhaohui Zou, and Ahmedin Jemal. Cancer statis- tics, 2014. CA: A Cancer Journal for Clinicians, 64(1):9–29, 2014. [156] Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. Cancer statistics, 2015. CA: A Cancer Journal for Clinicians, 65(1):5–29, 2015. [157] Dennis J. Slamon, Gary M. Clark, Steven G. Wong, Wendy J. Levin, Axel Ullrich, and William L. McGuire. Human breast cancer: Correlation of re- lapse and survival with amplification of the HER-2/neu oncogene. Science, 235(4785):177–182, 1987. [158] David Sonderman and Philip G. Abrahamson. Radiotherapy treatment design using mathematical programming models. Operations Research, 33(4):705–725, 1985. [159] Z. Caner Taşkın, J. Cole Smith, H. Edwin Romeijn, and James F. Dempsey. Optimal multileaf collimator leaf sequencing in IMRT treatment planning. Op- erations Research, 58(3):674–690, 2010. [160] Michael A. Talias. Optimal decision indices for R&D project evaluation in the pharmaceutical industry: Pearson index versus Gittins index. European Journal of Operational Research, 177(2):1105–1112, 2007. [161] Ian M. Thompson, Phyllis J. Goodman, Catherine M. Tangen, M. Scott Lucia, Gary J. Miller, Leslie G. Ford, Michael M. Lieber, R. Duane Cespedes, James N. Atkins, Scott M. Lippman, Susie M. Carlin, Anne Ryan, Connie M. Szczepanek, John J. Crowley, and Jr. Charles A. Coltman. The influence of finasteride on the development of prostate cancer. New England Journal of Medicine, 349(3):215– 224, 2003.

153 [162] Simon Thompson and Julian Higgins. How should meta-regression analyses be undertaken and interpreted? Statistics in Medicine, 21:1559–1573, 2002.

[163] Peter C. Thuss-Patience, Albrecht Kretzschmar, Michael Repp, Dorothea Kingreen, Dirk Hennesser, Simone Micheel, Daniel Pink, Christian Scholz, Bernd Dörken, and Peter Reichardt. Docetaxel and continuous-infusion flu- orouracil versus epirubicin, cisplatin, and fluorouracil for advanced gastric ade- nocarcinoma: A randomized phase II study. Journal of Clinical Oncology, 23(3):494–501, 2005.

[164] Robert J. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996.

[165] Lindsey A. Torre, Freddie Bray, Rebecca L. Siegel, Jacques Ferlay, Joannie Lortet-Tieulent, and Ahmedin Jemal. Global cancer statistics, 2012. CA: A Cancer Journal for Clinicians, 65(2):87–108, 2015.

[166] Andy Trotti, A.Dimitrios Colevas, Ann Setser, Valerie Rusch, David Jaques, Volker Budach, Corey Langer, Barbara Murphy, Richard Cumberlin, C.Norman Coleman, and Philip Rubin. CTCAE v3.0: development of a comprehensive grading system for the adverse effects of cancer treatment. Seminars in Radia- tion Oncology, 13(3):176–181, 2003.

[167] Daniel J. Underwood, Jingyu Zhang, Brian T. Denton, Nilay D. Shah, and Brant A. Inman. Simulation optimization of PSA-threshold based prostate cancer screening policies. Health Care Management Science, 15(4):293–309, 2012.

[168] U.S. Preventative Services Task Force. Screening for colorectal cancer: U.S. Preventive Services Task Force recommendation statement. Annals of Internal Medicine, 149(9):627–637, 2008.

[169] U.S. Preventative Services Task Force. Screening for breast cancer: U.S. Pre- ventive Services Task Force recommendation statement. Annals of Internal Medicine, 151(10):716–726, 2009.

[170] V Valero. Combination docetaxel/cyclophosphamide in patients with advanced solid tumors. Oncology (Williston Park, N.Y.), 11(8 Suppl 8):34–36, 1997.

[171] Laura J. van’t Veer and René Bernards. Enabling personalized cancer medicine through analysis of gene-expression patterns. Nature, 452(7187):564–570, 2008.

[172] Anna D. Wagner, Wilfried Grothe, Johannes Haerting, Gerhard Kleber, Axel Grothey, and Wolfgang E. Fleig. Chemotherapy in advanced gastric cancer: A systematic review and meta-analysis based on aggregate data. Journal of Clinical Oncology, 24(18):2903–2909, 2006.

154 [173] Duolao Wang, Wenyang Zhang, and Ameet Bakhai. Comparison of Bayesian model averaging and stepwise methods for model selection in logistic regression. Statistics in Medicine, 23(22):3451–3467, 2004.

[174] Elisabeth M. Wever, Gerrit Draisma, Eveline A. M. Heijnsdijk, and Harry J. de Koning. How does early detection by screening affect disease progression? Modeling estimated benefits in prostate cancer screening. Medical Decision Making, 31(4):550–558, 2011.

[175] Andrew Wolf, Richard Wender, Ruth Etzioni, Ian Thompson, Anthony D’Amico, Robert Volk, Duardo Brooks, Chiranjeev Dash, Idris Guessous, Kim- berly Andrews, Carol DeSantis, and Robert Smith. American Cancer Society guideline for the early detection of prostate cancer: Update 2010. CA: A Cancer Journal for Clinicians, 60(2):70–98, 2010.

[176] R. Wong and D. Cunningham. Optimising treatment regimens for the manage- ment of advanced gastric cancer. Annals of Oncology, 20:605–608, 2009.

[177] Fact Sheets: Cancer, February 2012. World Health Organization.

[178] Ann G. Zauber, Iris Lansdorp-Vogelaar, Amy B. Knudsen, Janneke Wilschut, Marjolein van Ballegooijen, and Karen M. Kuntz. Evaluating test strategies for colorectal cancer screening: A decision analysis for the U.S. Preventive Services Task Force. Annals of Internal Medicine, 149(9):659–669, 2008.

[179] Jingyu Zhang, Brian Denton, Hari Balasubramanian, Nilay Shah, and Brant Inman. Optimization of PSA screening policies: A comparison of the patient and social perspectives. Medical Decision Making, 32(2):337–349, 2012.

[180] Shengfan Zhang, Julie S. Ivy, James R. Wilson, Kathleen M. Diehl, and Bon- nie C. Yankaskas. Competing risks analysis in mortality estimation for breast cancer patients from independent risk groups. Health Care Management Sci- ence, 17(3):259–269, 2014.

[181] Lihui Zhao, Lu Tian, Tianxi Cai, Brian Claggett, and L.J. Wei. Effectively selecting a target population for a future comparative study. Harvard University Biostatistics Working Paper Series, 2011.

[182] Yingqi Zhao, Donglin Zeng, A. John Rush, and Michael R. Kosorok. Esti- mating individualized treatment rules using outcome weighted learning. JASA, 107(499):1106 – 1118, 2012.

[183] Yufan Zhao, Michael R. Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statist. Med., 28(26):3294–315, 2009.

[184] Kemin Zhou. Essentials of Robust Control, volume 1. Prentice Hall, 1998.

155 [185] R. Zurakowski and D. Wodarz. Treatment interruptions to decrease risk of resistance emerging during therapy switching in HIV treatment. In Decision and Control, 2007 46th IEEE Conference on, pages 5174–5179, 2007.

156