<<

Lecture 7: Q-Learning for Optimal Dynamic Treatment Strategies

Donglin Zeng, Department of , University of North Carolina Introduction

Donglin Zeng, Department of Biostatistics, University of North Carolina Recall Definition of Q-function

I In K-stage SMART, data for an individual consist of

X1, A1, R1, X2, A2, R2, ..., RK

where Xk is the feature/intermediate outcomes prior to stage k and Ak is the treatment.

I By definition, Q-function at stage k, Qk(Ak, Hk) is the optimal reward increment given the current state Hk (including all information collected by stage k) and treatment assignment.

I Thus, if we know Qk(Ak, Hk), then the optimal DTR at stage k is D∗(H ) = argmax Q (a , H ). k k ak k k k

Donglin Zeng, Department of Biostatistics, University of North Carolina Bellman Equation for Q-function

I The equation is for k < K,   Qk(ak, hk) = E Rk + max Qk+1(ak+1, Hk+1) Ak = ak, Hk = hk . ak+1

I For k = K, h i

QK(aK, hK) = E RK AK = aK, HK = hK .

I These relationships become the essential regression in Q-learning.

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning at Single Stage Decision

Donglin Zeng, Department of Biostatistics, University of North Carolina I When K = 1, Q-learning is simply a learning procedure to estimate the expectation of given (A, H) (we omit subscript).

I This can be carried out using a variety of regression models, either parametric or nonparametric.

I The models should include interactions between A and H.

I Regularization can also be imposed when the dimensionality of H is high.

Donglin Zeng, Department of Biostatistics, University of North Carolina More Q-Learning at Single Stage Decision

I Statistical learning algorithms can be used to estimate E[R|A, H].

I Commonly used learning methods for regression include random forest, support vector regression (SVR or -SVR) and neural network ().

I The optimal treatment strategy is

argmaxaE[R|A = a, H = h] for each h. In particular, when A ∈ {−1, 1}, it becomes

sign {E[R|A = 1, H = h] − E[R|A = −1, H = h]} .

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning for Multiple Stage Decisions

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning Algorithm

I Learn QK(aK, hK) using regression of RK given (AK, HK).

I Compute psudo-outcome ReK−1 = RK−1 + maxaK QK(aK, HK) then learn QK−1(ak−1, hk−1) using regression of ReK−1 given (AK−1, HK−1).

I Repeat for stage K − 2 ,..., till stage 1.

∗ I The optimal DTRs: D (h ) = argmax Q (a , h ). k k ak k k k

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning: Two-Stage Example

Data: (H1, A1, R1, H2, A2, R2) where H: state; A: treatment; R: reward. Goal: maximize R1 + R2 to estimate the best treatment at each stage. Q-Learning: using backward-induction logic

I Compare the expected outcome from regression model of second stage for two treatments. Pick the treatment with larger expected outcome.

I Imputation: Create a pseudo second stage outcome Rb2 by the maximum across the two treatments from the above regression model.

I Fit regression model where the output is { R1+ Rb2 }. Pick the treatment 1 maximizing the regression expected value for a given set of baseline variables.

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning Algorithm

I At stage 2 (no future), we fit

Q2(H2, A2) = E[R2|H2, A2],

then estimate

∗ D2 = argmaxa∈{−1,1}Q2(H2, a).

I At stage 1, we obtain individual optimal future reward as

∗ R1 = R1 + max Q2(H1, a); a∈{−1,1}

∗ so we estimate Q1(H1, A1) as E[R1|H1, A1]. I We obtain

∗ D1 = argmaxa∈{−1,1}Q1(H1, a).

Donglin Zeng, Department of Biostatistics, University of North Carolina Regression Models in Q-Learning

I It can be parametric modes with regularization (e.g., LASSO).

I It can also be machine-learning based: support vector regression, random forest, neural networks/deep learning.

I Treatment by coviarate interactions should always be included in the regression.

Donglin Zeng, Department of Biostatistics, University of North Carolina Pros and Cons of Q-Learning

Pros:

I Each step is a . I Make use of all the subjects.

Cons:

I Regression models may be misspecified. I The objective function is for model fitting but not directly for value maximization.

Can we directly search for the optimal DTRs without fitting regression models?

Donglin Zeng, Department of Biostatistics, University of North Carolina Theoretical Challenges

Donglin Zeng, Department of Biostatistics, University of North Carolina Theoretical Results

I When the true optimal DTR is implied in the class of regression models in Q-learning, the derived DTRs from Q-learning is Fisher consistent.

I Furthermore, the value loss due to the estimated DTRs as compared to the optimal DTRs,

V(D∗) − V(Db)

can be bounded by the difference of the least squared losses between D∗ and Db.

I Therefore, the convergence rate in terms the value loss can be controlled by the convergence rate of the summed residual squares in usual regression.

Donglin Zeng, Department of Biostatistics, University of North Carolina Challenges

I Consider a 2-stage Q-learning. At the second stage, we fit a regression model

R2 = α2X2 + β2A2X2 + 2.

I In the first stage, we fit another regression

Qb 1 = α1X1 + β1A1X1 + 1

where Qb 1 ≡ R1 + αb2X2 + 2|βb2X2|.

I Challenge: Qb 1 may not be normally distributed even if all original variables are multivariate normally distributed.

Donglin Zeng, Department of Biostatistics, University of North Carolina Further Challenge

I Further challenge: if P(β2X2 = 0) > 0, i.e. there exists some individuals who do not respond to the second treatment, then the inference for β1 from Q-learning is non-standard.

I Local regularity does not hold. I Resampling methods do not yield correct inference. I Solutions: thresholding, regularization, most conservative inference, split- inference, m out of n bootstrap ...

Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric Q-learning

Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric modelling for Q-function

I As we can see from Q-learning, only Q-function really matter for learning optimal treatment rules. I Even more, only the contrast of Q-function matters. I Semiparametric Q-learning proposes models for modelling Q-function contrasts (blip functions). I The advantage of this approach is I requiring less assumption on Q-function, I allowing shared parameters across stages.

Donglin Zeng, Department of Biostatistics, University of North Carolina Q-contrast function

I At stage k, the Q-contrast function is defined as: for given state Hk and treatment ak,

c(Hk, ak) = Qk(Hk, ak) − Qk(Hk, a0k),

where a0k denotes a reference treatment at stage k. I Since Qk(Hk, ak) is the expect reward increment (return) for a subject with state Hk and treated with ak, assumed to receive the optimal treatments in any future stages, the blip function is equivalent to the expected difference of the maximal long-term reward between two subject, one receiving treatment ak but the other receiving treatment at ak0. I Two common choices of ak0: if ak0 is a fixed treatment (e.g., standard of care at stage k), the Q-contrast function is called a blip function; if ak0 is chosen to the optimal ∗ treatment rule at stage k, i.e., ak0 = Dk (Hk), then the contrast is opposite to what is called a regret function. Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric models for Q-contrast function

I If we know Qk(k, ak), then clearly, D∗(H ) = argmax c(H , a ). k k ak k k I Semiparametric methods directly model these contrast functions using some parameter θ:

c(Hk, ak) = c(Hk, ak; θ).

I θ includes both stage-specific parameters and the parameters shared across stages; I a linear function including the interaction between Hk and ak is commonly used (the main effect of Hk is not necessary due to the contrast). I semiparametric model is possible where θ contains infinitely dimensional parameters. PK PK n ∗ o I Let Gk(θ) = j=k Yj + j=k c(Hj, Dj (Hj); θ) − c(Hj, Aj; θ) be the total reward loss due to non-optimal treatments actually received.

Donglin Zeng, Department of Biostatistics, University of North Carolina G-estimation for Q-contrast function

I G-estimation was originally developed in causal inference for longitudinal data where the focus was on average treatment effect (called structural nested mean models). I G-estimation can be extended to learn optimal treatment rules based on the Q-contrast function. I It is based on solving the equations associated with the following estimating functions: K X   Uk = Gj(θ) − E[Gj(θ)|Hj] Sj(Aj) − E[Sj(Aj)|Hj] , j=k

where Sj(Aj) is any integrable function of Aj. I There are a few points about the estimating functions: I E[Gj(θ)|Hj] and E[Sj(Aj)|Hj] can be modelled using another set of parameters. I When either model is correct, the estimating function has mean zero so leads to consistent estimation for θ. This is called doubly robustness. Donglin Zeng, Department of Biostatistics, University of North Carolina Rationale of G-estimation

I The estimating functions are obtained from semiparametric efficiency theory and they turn out to be some influence function for θ when either model is correct. I In semiparametric efficiency theory, θ is parameter of interest and any other parameters or distributions are treated as nuisance parameter and assumed to be fully nonparametric. I The influence function has mean zero. I A side note, semiparametric efficiency theory is important for semiparametric inference, causal inference, and more recently for high-dimensional inference. De-influence from estimation of the nuisance parameters has been a useful technique for inference in complicated models.

Donglin Zeng, Department of Biostatistics, University of North Carolina Implementation of G-estimation

I We first treat all parameters as stage-specific and apply a backward algorithm to solve the equations for Uk, k = 1, ..., K. I Backward recursive computation is needed since we need ∗ to estimate Dj for j > k when using Uk for estimation. I When some parameters are shared across stages, we can pool their estimators from each stage-specific estimation based on their covariance matrix. I Homework: what is the most efficient pooled estimator for θ if θb1, ..., θbm all converge to θ and their asymptotic covariance matrix is Σ?

I If we only keep the kterm in Uk in the G-estimation, then the resulting method is called A-learning. A-learning is often less efficient.

Donglin Zeng, Department of Biostatistics, University of North Carolina Marginal structural models for estimating optimal rules

I These methods directly model the decision function for the treatment rule and construct marginal estimating equations for estimation. I Consider a single stage decision and assume D(H) = sign(g(H; θ)). I A general procedure for estimation is as follows: I For a given rule parameter, θ, we estimate its associated value using inverse probability weighting method or its augmentation: n  1 X I(Aig(Hi; θ) > 0)Ri Vb(ψ) = n π(Ai|Hi) i=1  I(Aig(Hi; θ) > 0) − π(Ai|Hi) − E[Ri(sign(g(Hi; θ)))|Hi] . π(Ai|Hi)

I We maximize Vb(ψ) to estimate ψ.

Donglin Zeng, Department of Biostatistics, University of North Carolina A few notes for MSM

I π(Ai|Hi) and E[Ri(sign(g(Hi; θ)))|Hi] can be estimated using statistical models and Vb(ψ) enjoys the doubly robustness. I Extension to multi-stage is similar but more complicated. I Like SNMM , MSM can be considered as an extension of their original application to average causal effects to personalized treatment effects. The latter is sometimes called heterogeneous treatment effects (HTEs).

Donglin Zeng, Department of Biostatistics, University of North Carolina