REINFORCEMENT LEARNING

FOR

CLINICAL TRIAL DATA

Dan Lizotte, Lacey Gunter, Eric Laber, Susan Murphy Department of - it’s not what you think

I say: Reinforcement Learning You think: “Ah yes, Reinforcement Learning...”

a1 0

a0 a1

a1 1

3 a0 .5

1 .5 a a1

2 a0 4

a0 No. a0 Well yes, but your pre-conceived notions may lead you astray “Grid-world” and “mountain car” intuitions may not apply a different picture

F0 F0 F0 F1 F1 F1 F2 F2 F2 ...... FN FN FN A0=1 A0 A0 A0 G0 G0 G0 F0 G1 G1 G1 F1 ... A1=B ... A2=Δ ... F2 GM GM GM ... A1 A1 FN H1 H1 F0 H2 H2 F1 ...... A0=5 F2 HK HK ... A2 FN J1 A0 J2 G0 ... G1 JN ... GN unusual things about this setting

More (different?) generality than is assumed in toy problems State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . . . , o ) k ∈ S k 0 0 0 1 1 1 k

The set of permissible actions depends on the current state = (s) A A familiar things about this setting

Our objective is to maximize the expected sum of future rewards

We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately

Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a!

We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. the data: STAR*D

Sequenced Treatment Alternatives to Relieve Depression. NIMH- funded study involving ~4000 patients Intended to discover effective, patient-tailored treatment strategies (i.e. a good policy) for treating clinical depression STAR*D has many layers of complexity Actions: Treatments for depression: Drugs, cognitive therapy Observations: Demographic information, depression assessment, ... Rewards: Unclear. We must combine data about therapeutic benefits, side-effects, costs, and unforeseen things STAR*D terminology

Level A period of time over which a patient adheres (theoretically) to one of a Level-specific set of possible treatments There are 4(-ish) Levels in STAR*D that a patient may participate in, in sequence. Hence maximum depth is 4(-ish) Observations are made at intervals during each level Different measures of symptoms and side-effects Remission If a patient achieves a sufficiently good score on a measure of clinical depression (QIDS), he or she may go to follow-up Otherwise, the patient is supposed to proceed to the next Level more STAR*D terminology

Treatment Preference: When moving from one Level to the next, a patient provides a treatment preference “Augment”: Patient prefers to add a drug/therapy to the current drug/therapy the patient is receiving “Switch”: Patient prefers to discontinue the current treatment, and substitute a new one These state variables define the possible action sets When transitioning to a new level (for whatever reason): Patient is randomized among the set of treatments that are consistent with a patient’s preference Level 1 CIT QIDS ! 5 Max 12 Weeks

QIDS > 5 Follow-up

Level 2 SER BUP VEN CIT+BUS CIT+BUP QIDS ! 5 Max 12 Weeks Preference to Switch Preference to Augment

QIDS > 5 Follow-up

Level 3 MIRT NTP L2+Li L2+THY QIDS ! 5 Max 12 Weeks Preference to Switch Preference to Augment

QIDS > 5 Follow-up

QIDS ! 5 Level 4 TCP MIRT+VEN Mandatory 12 Weeks Follow-up Example STAR*D Patients Starting at Level 2

6 25 23 24 19 15 15 15 16 15 14 14 9 10 10 9 11

5 1112 9 8 6 6 6 2

4 789965

Patient 3 664

Level 2

2 18 1718 20 24 1612 10 5 Level 3

Level 4 1 13 13 9 7 5 5 QIDS

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 Weeks weird things about this setting

More (different?) generality than is usually assumed in an MDP State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . . . , o ) k ∈ S k 0 0 0 1 1 1 k

The set of permissible actions depends on the current state = (s) A A weird things about this setting

More (different?) generality than is usually assumed in an MDP State includes past history: observations, actions and rewards s = s (o ,a ,r ,o ,a ,r , . . . , o ) k ∈ S k 0 0 0 1 1 1 k Which treatment did the patient receive previously? How well did they do on that treatment? The set of permissible actions depends on the current state = (s) A A Did the patient consent to augment or switch treatment? The various o and a may have different domains familiar things about this setting

Our objective is to maximize expected sum of future rewards

We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately

Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a!

We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. familiar things about this setting

Our objective is to maximize expected sum of future rewards Reward will be some sort of cost-benefit analysis of treatment We will accomplish this by learning and using a Q function, computing expectations and maximizations appropriately

Q∗(s, a)=Es! [r(s, a, s") + max Q∗(s",a")] a! Short horizon we can compute Q* exactly (except for approximation error) Q* at the end states depends only on r... We then recover the optimal policy from Q* This is off-policy: Data were collected using a random policy. this domain is easy because

The domain is episodic Horizons are short (i.e. 2, 3, 4...) Q-learning type analysis can be done exactly i.e. online methods (in the stochastic gradient sense) are unnecessary this domain is hard because

We only have 1292 interesting trajectories There are dozens (hundreds?) of interesting features The types of observations are unusual for RL e.g. variable-length vectors, where the length is meaningful Many observations are missing and/or lies We need bias that is both useful and defensible How sensitive are analyses w.r.t. bias? By ‘bias’ I everything that goes into choosing a class of Q functions, including variable selection, choice of regressor, etc. the truth about STAR*D

STAR*D is a remarkable resource STAR*D is a mess The action sets are complicated There is an enormous amount of Some scheduled appointments just don’t happen Only limited data were collected during follow-up Many patients just disappear Side effects, didn’t like the doctor, felt better, ... this domain is dangerous because

Opportunities for ‘attribution bias’ are absolutely rampant Suppose: Reward = negative QIDS score at end of study We will not use any data from patients who disappeared There are missing state and reward variables After analysis, exotic-drug-X looks really great Exotic-drug-X works really well! ...or exotic-drug-X has horrendous side effects in patients with severe depression, so that group disappears Eliminating incomplete cases induces bias types of missing data

MCAR - Missing Completely At Random Missingness does not depend on any of the data values __M_ OF TH__E _ATA ARE _IS_IN_ Letters 25% Missing MAR - Missing At Random Missingness can depend on the observed data Letters after Vowels SO_E O_ THESE DA_A A_E MISSI_G 25% Missing If the data are not MCAR, then assuming the missing data are i.i.d. (or equivalently, “throwing out” missing data) induces bias However, if we use a richer model, we can correct for this. A bigraph model will give correct results in this case types of missing data

NMAR - Not Missing At Random Missingness depends on the missing value S_ME _F TH_S_ D_T_ _R_ MISS_NG Vowels 75% Missing Okay so now we’re hosed. Another feature could fix this problem - like “This letter is a vowel” S_ME _F TH_S_ D_T_ _R_ MISS_NG ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ This this would make the data MAR Side information could at least help, like “These words are English” A richer model could provide us with almost the same benefit dealing with missing data

When data are not MCAR or MAR, you can’t tell. The better our model can predict the missing values, the more bias we will remove. Conventional wisdom is to use as rich a model as is feasible to model the values of the missing portion of our data given the observed portion. When we really don’t know about how variables relate, the “General Location Model” is a popular choice. the General Location Model

Basically an LDA model Don’t know Linear Discriminant Analysis? I will explain. Each exemplar is a vector X = [W1, W2, ... Wp, Z1, Z2, ... Zk] W are discrete, Z are continuous Each possible configuration of Ws is given a probability Define W to be a single discrete variable taking a value for each Z are multivariate normal with mean μ(W) and shared Σ Marginal distribution of W is multinomial Marginal distribution of Z is a mixture of MVNs the General Location Model cartoon

X = [ W, Z1, Z2 ] W ∈ { 1, 2, 3 }

P(W) = [ 0.4 0.4 0.2 ]

μ(1) = [ -1.0 0.0 ] μ(2) = [ 1.0 0.0 ] μ(3) = [ 0.0 -1.0 ] Σ = [ 1.0 0.5 ] [ 0.5 1.0 ] why are we doing this again?

We want to model the missing data given the observed data. No problem! Given any pattern of missingness, we can compute the distribution of the missing data given the observed data ...if we knew the parameters. why are we doing this again?

But we don’t know the parameters Curses! We’ll learn them. Great! We can’t; we have missing data. Curses! Those of you who are in-the-know realize what I’m about to do... building the model

We will run MCMC. 0) Make up an initial set of parameters along with priors on them 1)Sample missing data given observed data and parameters 2)Sample parameters given the complete data you just made up 3) Repeat What do we get? A bunch of parameter sets, and a bunch of “imputed” data sets These data sets have the missing bits drawn from mis|obs, with the parameters integrated out --- it’s like *magic*! why all this trouble?

We want to use methods, like Q-learning, that rely on complete input data. We can’t just throw out the incomplete exemplars; we know the data are not MCAR. Instead, we do “Bayesian multiple imputation,” get m imputed data sets, and run whatever we want on each one. How do we combine results? Right now we average them to estimate the expected Q function, where the expectation is over the missing data. We might want the MAP Q-function; this is future work back to STAR*D

≈ 150 variables Many are missing; we *know* the data are not MCAR Compare: Complete Case Analysis (a.k.a. “Throw out incomplete exemplars”) Bayesian multiple imputation Assess confidence using the bootstrap Repeat 1000 times: Draw N exemplars from the training set with replacement For each bootstrapped dataset, learn your Q function, look at the variability among these 1000 Q functions Qˆ using CCA for Level 2 Switch Treatments Voting results from bootstrapped datasets 1000 2.25 SER Qˆ VEN f rom BUP 2 CCA

1.75 750

1.5

1.25 500 Vote s Value 1

0.75

250 0.5

0.25

0 0 5 10 15 20 25 Initial QIDS Score Qˆ ob s for Level 2 Switch Treatments Voting results from bootstrapped datasets 1000 2.25 SER VEN Qˆ ob s BUP 2

1.75 750

1.5

1.25 500 Vote s Value 1

0.75

250 0.5

0.25

0 0 5 10 15 20 25 Initial QIDS Score Qˆ using CCA for Level 2 Switch Treatments Qˆ ob s for Level 2 Switch Treatments Voting results from bootstrapped datasets Voting results from bootstrapped datasets 1000 2.25 1000 2.25 SER SER Qˆ VEN VEN f rom BUP 2 Qˆ ob s BUP 2 CCA

1.75 1.75 750 750

1.5 1.5

1.25 1.25 500 500 Vote s Vote s Value Value 1 1

0.75 0.75

250 250 0.5 0.5

0.25 0.25

0 0 0 0 5 10 15 20 25 5 10 15 20 25 Initial QIDS Score Initial QIDS Score

CCA analysis is confident, and wrong.

Bayesian multiple imputation still provides information but better reflects uncertainty

Clinical significance? We’re not sure yet. summary

We’re doing RL But not your garden-variety RL Easy because of the short-horizon, batch scenario Hard because of: Limited data Complicated data Missing data and attribution bias Ambiguous rewards We are also looking at CATIE, a schizophrenia study Thanks to

John Rush and the STAR*D team Lacey Gunter Eric Laber Susan Murphy See also:

J. Pineau, M.G. Bellemare, A. J. Rush, A. Ghizaru, S.A. Murphy (2007). Constructing evidence-based treatment strategies using methods from computer science. Drug and Alcohol Dependence, 88, Supplement 2:S52-S60.

S.A. Murphy, L.M. Collins, A.J. Rush (2007). Customizing Treatment to the Patient: Adaptive Treatment Strategies (Editorial). Drug and Alcohol Dependence, Drug and Alcohol Dependence. 2007;88(2):S1-S72.