Reinforcement Learning Clinical Trial Data

REINFORCEMENT LEARNING FOR CLINICAL TRIAL DATA Dan Lizotte, Lacey Gunter, Eric Laber, Susan Murphy Department of Statistics - University of Michigan it’s not what you think I say: Reinforcement Learning You think: “Ah yes, Reinforcement Learning...” a1 0 a0 a1 a1 1 3 a0 .5 1 .5 a a1 2 a0 4 a0 No. a0 Well yes, but your pre-conceived notions may lead you astray “Grid-world” and “mountain car” intuitions may not apply a different picture F0 F0 F0 F1 F1 F1 F2 F2 F2 ... ... ... FN FN FN A0=1 A0 A0 A0 G0 G0 G0 F0 G1 G1 G1 F1 ... A1=B ... A2=Δ ... F2 GM GM GM ... A1 A1 FN H1 H1 F0 H2 H2 F1 ... ... A0=5 F2 HK HK ... A2 FN J1 A0 J2 G0 ... G1 JN ... GN unusual things about this setting More (different?) generality than is assumed in toy problems State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . , o ) k ∈ S k 0 0 0 1 1 1 k The set of permissible actions depends on the current state = (s) A A familiar things about this setting Our objective is to maximize the expected sum of future rewards We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a! We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. the data: STAR*D Sequenced Treatment Alternatives to Relieve Depression. NIMH- funded study involving ~4000 patients Intended to discover effective, patient-tailored treatment strategies (i.e. a good policy) for treating clinical depression STAR*D has many layers of complexity Actions: Treatments for depression: Drugs, cognitive therapy Observations: Demographic information, depression assessment, ... Rewards: Unclear. We must combine data about therapeutic benefits, side-effects, costs, and unforeseen things STAR*D terminology Level A period of time over which a patient adheres (theoretically) to one of a Level-specific set of possible treatments There are 4(-ish) Levels in STAR*D that a patient may participate in, in sequence. Hence maximum depth is 4(-ish) Observations are made at intervals during each level Different measures of symptoms and side-effects Remission If a patient achieves a sufficiently good score on a measure of clinical depression (QIDS), he or she may go to follow-up Otherwise, the patient is supposed to proceed to the next Level more STAR*D terminology Treatment Preference: When moving from one Level to the next, a patient provides a treatment preference “Augment”: Patient prefers to add a drug/therapy to the current drug/therapy the patient is receiving “Switch”: Patient prefers to discontinue the current treatment, and substitute a new one These state variables define the possible action sets When transitioning to a new level (for whatever reason): Patient is randomized among the set of treatments that are consistent with a patient’s preference Level 1 CIT QIDS ! 5 Max 12 Weeks QIDS > 5 Follow-up Level 2 SER BUP VEN CIT+BUS CIT+BUP QIDS ! 5 Max 12 Weeks Preference to Switch Preference to Augment QIDS > 5 Follow-up Level 3 MIRT NTP L2+Li L2+THY QIDS ! 5 Max 12 Weeks Preference to Switch Preference to Augment QIDS > 5 Follow-up QIDS ! 5 Level 4 TCP MIRT+VEN Mandatory 12 Weeks Follow-up Example STAR*D Patients Starting at Level 2 6 25 23 24 19 15 15 15 16 15 14 14 9 10 10 9 11 5 11 12 9 8 6 6 6 2 4 789965 Patient 3 664 Level 2 2 18 17 18 20 24 16 12 10 5 Level 3 Level 4 1 13 13 9 7 5 5 QIDS 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Weeks weird things about this setting More (different?) generality than is usually assumed in an MDP State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . , o ) k ∈ S k 0 0 0 1 1 1 k The set of permissible actions depends on the current state = (s) A A weird things about this setting More (different?) generality than is usually assumed in an MDP State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . , o ) k ∈ S k 0 0 0 1 1 1 k Which treatment did the patient receive previously? How well did they do on that treatment? The set of permissible actions depends on the current state = (s) A A Did the patient consent to augment or switch treatment? The various o and a may have different domains familiar things about this setting Our objective is to maximize expected sum of future rewards We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a! We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. familiar things about this setting Our objective is to maximize expected sum of future rewards Reward will be some sort of cost-benefit analysis of treatment We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a! Short horizon means we can compute Q* exactly (except for approximation error) Q* at the end states depends only on r... We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. this domain is easy because The domain is episodic Horizons are short (i.e. 2, 3, 4...) Q-learning type analysis can be done exactly i.e. online methods (in the stochastic gradient sense) are unnecessary this domain is hard because We only have 1292 interesting trajectories There are dozens (hundreds?) of interesting features The types of observations are unusual for RL e.g. variable-length vectors, where the length is meaningful Many observations are missing and/or lies We need bias that is both useful and defensible How sensitive are analyses w.r.t. bias? By ‘bias’ I mean everything that goes into choosing a class of Q functions, including variable selection, choice of regressor, etc. the truth about STAR*D STAR*D is a remarkable resource STAR*D is a mess The action sets are complicated There is an enormous amount of missing data Some scheduled appointments just don’t happen Only limited data were collected during follow-up Many patients just disappear Side effects, didn’t like the doctor, felt better, ... this domain is dangerous because Opportunities for ‘attribution bias’ are absolutely rampant Suppose: Reward = negative QIDS score at end of study We will not use any data from patients who disappeared There are missing state and reward variables After analysis, exotic-drug-X looks really great Exotic-drug-X works really well! ...or exotic-drug-X has horrendous side effects in patients with severe depression, so that group disappears Eliminating incomplete cases induces bias types of missing data MCAR - Missing Completely At Random Missingness does not depend on any of the data values __M_ OF TH__E _ATA ARE _IS_IN_ Letters 25% Missing MAR - Missing At Random Missingness can depend on the observed data Letters after Vowels SO_E O_ THESE DA_A A_E MISSI_G 25% Missing If the data are not MCAR, then assuming the missing data are i.i.d. (or equivalently, “throwing out” missing data) induces bias However, if we use a richer model, we can correct for this. A bigraph model will give correct results in this case types of missing data NMAR - Not Missing At Random Missingness depends on the missing value S_ME _F TH_S_ D_T_ _R_ MISS_NG Vowels 75% Missing Okay so now we’re hosed. Another feature could fix this problem - like “This letter is a vowel” S_ME _F TH_S_ D_T_ _R_ MISS_NG ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ This this would make the data MAR Side information could at least help, like “These words are English” A richer model could provide us with almost the same benefit dealing with missing data When data are not MCAR or MAR, you can’t tell. The better our model can predict the missing values, the more bias we will remove. Conventional wisdom is to use as rich a model as is feasible to model the values of the missing portion of our data given the observed portion. When we really don’t know about how variables relate, the “General Location Model” is a popular choice. the General Location Model Basically an LDA model Don’t know Linear Discriminant Analysis? I will explain. Each exemplar is a vector X = [W1, W2, ... Wp, Z1, Z2, ... Zk] W are discrete, Z are continuous Each possible configuration of Ws is given a probability Define W to be a single discrete variable taking a value for each Z are multivariate normal with mean μ(W) and shared Σ Marginal distribution of W is multinomial Marginal distribution of Z is a mixture of MVNs the General Location Model cartoon X = [ W, Z1, Z2 ] W ∈ { 1, 2, 3 } P(W) = [ 0.4 0.4 0.2 ] μ(1) = [ -1.0 0.0 ] μ(2) = [ 1.0 0.0 ] μ(3) = [ 0.0 -1.0 ] Σ = [ 1.0 0.5 ] [ 0.5 1.0 ] why are we doing this again? We want to model the missing data given the observed data. No problem! Given any pattern of missingness, we can compute the distribution of the missing data given the observed data ...if we knew the parameters.

Reinforcement Learning Clinical Trial Data

STEVE AMIREAULT, Ph.D

IMS Presidential Address(Es) Each Year, at the End of Their Term, the IMS CONTENTS President Gives an Address at the Annual Meeting

STATISTICS in the DATA SCIENCE ERA: a Symposium to Celebrate 50 Years of Statistics at the University of Michigan

Smarts: Part I

Optimal Dynamic Treatment Regimes and Partial Welfare Ordering∗

Copyright by Susan Laura Murphy 2019

Investigation of 2 Types of Self-Administered Acupressure for Persistent Cancer-Related Fatigue in Breast Cancer Survivors

Dynamic Treatment Regimens

Snehalata Huzurbazar Joins SAMSI As Deputy Director

Discussion on the Paper by Murphy 355 Directed Acyclic Graphs

Subgroup Identification and Variable Selection from Randomized Clinical

A Sample Size Calculator for SMART Pilot Studies