Lecture 7: Q-Learning for Optimal Dynamic Treatment Strategies

Lecture 7: Q-Learning for Optimal Dynamic Treatment Strategies Donglin Zeng, Department of Biostatistics, University of North Carolina Introduction Donglin Zeng, Department of Biostatistics, University of North Carolina Recall Definition of Q-function I In K-stage SMART, data for an individual consist of X1; A1; R1; X2; A2; R2; :::; RK where Xk is the feature/intermediate outcomes prior to stage k and Ak is the treatment. I By definition, Q-function at stage k, Qk(Ak; Hk) is the optimal reward increment given the current state Hk (including all information collected by stage k) and treatment assignment. I Thus, if we know Qk(Ak; Hk), then the optimal DTR at stage k is D∗(H ) = argmax Q (a ; H ): k k ak k k k Donglin Zeng, Department of Biostatistics, University of North Carolina Bellman Equation for Q-function I The equation is for k < K, Qk(ak; hk) = E Rk + max Qk+1(ak+1; Hk+1) Ak = ak; Hk = hk : ak+1 I For k = K, h i QK(aK; hK) = E RK AK = aK; HK = hK : I These relationships become the essential regression in Q-learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning at Single Stage Decision Donglin Zeng, Department of Biostatistics, University of North Carolina I When K = 1, Q-learning is simply a learning procedure to estimate the expectation of R given (A; H) (we omit subscript). I This can be carried out using a variety of regression models, either parametric or nonparametric. I The models should include interactions between A and H. I Regularization can also be imposed when the dimensionality of H is high. Donglin Zeng, Department of Biostatistics, University of North Carolina More Q-Learning at Single Stage Decision I Statistical learning algorithms can be used to estimate E[RjA; H]. I Commonly used learning methods for regression include random forest, support vector regression (SVR or -SVR) and neural network (deep learning). I The optimal treatment strategy is argmaxaE[RjA = a; H = h] for each h. In particular, when A 2 {−1; 1g, it becomes sign fE[RjA = 1; H = h] − E[RjA = −1; H = h]g : Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning for Multiple Stage Decisions Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning Algorithm I Learn QK(aK; hK) using regression of RK given (AK; HK). I Compute psudo-outcome ReK−1 = RK−1 + maxaK QK(aK; HK) then learn QK−1(ak−1; hk−1) using regression of ReK−1 given (AK−1; HK−1). I Repeat for stage K − 2 ,..., till stage 1. ∗ I The optimal DTRs: D (h ) = argmax Q (a ; h ). k k ak k k k Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning: Two-Stage Example Data: (H1; A1; R1; H2; A2; R2) where H: state; A: treatment; R: reward. Goal: maximize R1 + R2 to estimate the best treatment at each stage. Q-Learning: using backward-induction logic I Compare the expected outcome from regression model of second stage for two treatments. Pick the treatment with larger expected outcome. I Imputation: Create a pseudo second stage outcome Rb2 by the maximum across the two treatments from the above regression model. I Fit regression model where the output is f R1+ Rb2 g. Pick the treatment 1 maximizing the regression expected value for a given set of baseline variables. Donglin Zeng, Department of Biostatistics, University of North Carolina Q-Learning Algorithm I At stage 2 (no future), we fit Q2(H2; A2) = E[R2jH2; A2]; then estimate ∗ D2 = argmaxa∈{−1;1gQ2(H2; a): I At stage 1, we obtain individual optimal future reward as ∗ R1 = R1 + max Q2(H1; a); a∈{−1;1g ∗ so we estimate Q1(H1; A1) as E[R1jH1; A1]. I We obtain ∗ D1 = argmaxa∈{−1;1gQ1(H1; a): Donglin Zeng, Department of Biostatistics, University of North Carolina Regression Models in Q-Learning I It can be parametric modes with regularization (e.g., LASSO). I It can also be machine-learning based: support vector regression, random forest, neural networks/deep learning. I Treatment by coviarate interactions should always be included in the regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Pros and Cons of Q-Learning Pros: I Each step is a regression analysis. I Make use of all the subjects. Cons: I Regression models may be misspecified. I The objective function is for model fitting but not directly for value maximization. Can we directly search for the optimal DTRs without fitting regression models? Donglin Zeng, Department of Biostatistics, University of North Carolina Theoretical Challenges Donglin Zeng, Department of Biostatistics, University of North Carolina Theoretical Results I When the true optimal DTR is implied in the class of regression models in Q-learning, the derived DTRs from Q-learning is Fisher consistent. I Furthermore, the value loss due to the estimated DTRs as compared to the optimal DTRs, V(D∗) − V(Db) can be bounded by the difference of the least squared losses between D∗ and Db. I Therefore, the convergence rate in terms the value loss can be controlled by the convergence rate of the summed residual squares in usual regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Challenges I Consider a 2-stage Q-learning. At the second stage, we fit a regression model R2 = α2X2 + β2A2X2 + 2: I In the first stage, we fit another regression Qb 1 = α1X1 + β1A1X1 + 1 where Qb 1 ≡ R1 + αb2X2 + 2jβb2X2j: I Challenge: Qb 1 may not be normally distributed even if all original variables are multivariate normally distributed. Donglin Zeng, Department of Biostatistics, University of North Carolina Further Challenge I Further challenge: if P(β2X2 = 0) > 0, i.e. there exists some individuals who do not respond to the second treatment, then the inference for β1 from Q-learning is non-standard. I Local regularity does not hold. I Resampling methods do not yield correct inference. I Solutions: thresholding, regularization, most conservative inference, split-sampling inference, m out of n bootstrap ... Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric Q-learning Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric modelling for Q-function I As we can see from Q-learning, only Q-function really matter for learning optimal treatment rules. I Even more, only the contrast of Q-function matters. I Semiparametric Q-learning proposes models for modelling Q-function contrasts (blip functions). I The advantage of this approach is I requiring less assumption on Q-function, I allowing shared parameters across stages. Donglin Zeng, Department of Biostatistics, University of North Carolina Q-contrast function I At stage k, the Q-contrast function is defined as: for given state Hk and treatment ak, c(Hk; ak) = Qk(Hk; ak) − Qk(Hk; a0k); where a0k denotes a reference treatment at stage k. I Since Qk(Hk; ak) is the expect reward increment (return) for a subject with state Hk and treated with ak, assumed to receive the optimal treatments in any future stages, the blip function is equivalent to the expected difference of the maximal long-term reward between two subject, one receiving treatment ak but the other receiving treatment at ak0. I Two common choices of ak0: if ak0 is a fixed treatment (e.g., standard of care at stage k), the Q-contrast function is called a blip function; if ak0 is chosen to the optimal ∗ treatment rule at stage k, i.e., ak0 = Dk (Hk), then the contrast is opposite to what is called a regret function. Donglin Zeng, Department of Biostatistics, University of North Carolina Semiparametric models for Q-contrast function I If we know Qk(k; ak), then clearly, D∗(H ) = argmax c(H ; a ): k k ak k k I Semiparametric methods directly model these contrast functions using some parameter θ: c(Hk; ak) = c(Hk; ak; θ): I θ includes both stage-specific parameters and the parameters shared across stages; I a linear function including the interaction between Hk and ak is commonly used (the main effect of Hk is not necessary due to the contrast). I semiparametric model is possible where θ contains infinitely dimensional parameters. PK PK n ∗ o I Let Gk(θ) = j=k Yj + j=k c(Hj; Dj (Hj); θ) − c(Hj; Aj; θ) be the total reward loss due to non-optimal treatments actually received. Donglin Zeng, Department of Biostatistics, University of North Carolina G-estimation for Q-contrast function I G-estimation was originally developed in causal inference for longitudinal data where the focus was on average treatment effect (called structural nested mean models). I G-estimation can be extended to learn optimal treatment rules based on the Q-contrast function. I It is based on solving the equations associated with the following estimating functions: K X Uk = Gj(θ) − E[Gj(θ)jHj] Sj(Aj) − E[Sj(Aj)jHj] ; j=k where Sj(Aj) is any integrable function of Aj. I There are a few points about the estimating functions: I E[Gj(θ)jHj] and E[Sj(Aj)jHj] can be modelled using another set of parameters. I When either model is correct, the estimating function has mean zero so leads to consistent estimation for θ. This is called doubly robustness. Donglin Zeng, Department of Biostatistics, University of North Carolina Rationale of G-estimation I The estimating functions are obtained from semiparametric efficiency theory and they turn out to be some influence function for θ when either model is correct. I In semiparametric efficiency theory, θ is parameter of interest and any other parameters or distributions are treated as nuisance parameter and assumed to be fully nonparametric.

Load more