Continuous State-Space Models for Optimal Sepsis Treatment-A Deep Reinforcement Learning Approach
Total Page:16
File Type:pdf, Size:1020Kb
CONTINUOUS STATE-SPACE MODELS FOR OPTIMAL SEPSIS TREATMENT Continuous State-Space Models for Optimal Sepsis Treatment - a Deep Reinforcement Learning Approach Aniruddh Raghu [email protected] Matthieu Komorowski [email protected] Leo Anthony Celi [email protected] Peter Szolovits [email protected] Marzyeh Ghassemi [email protected] Computer Science and Artificial Intelligence Lab, MIT Cambridge, MA Abstract Sepsis is a leading cause of mortality in intensive care units (ICUs) and costs hospitals billions annually. Treating a septic patient is highly challenging, because individual patients respond very differently to medical interventions and there is no universally agreed-upon treatment for sepsis. Understanding more about a patient’s physiological state at a given time could hold the key to ef- fective treatment policies. In this work, we propose a new approach to deduce optimal treatment policies for septic patients by using continuous state-space models and deep reinforcement learn- ing. Learning treatment policies over continuous spaces is important, because we retain more of the patient’s physiological information. Our model is able to learn clinically interpretable treat- ment policies, similar in important aspects to the treatment policies of physicians. Evaluating our algorithm on past ICU patient data, we find that our model could reduce patient mortality in the hospital by up to 3.6% over observed clinical policies, from a baseline mortality of 13.7%. The learned treatment policies could be used to aid intensive care clinicians in medical decision making and improve the likelihood of patient survival. 1. Introduction arXiv:1705.08422v1 [cs.LG] 23 May 2017 Sepsis (severe infections with organ failure) is a dangerous condition that costs hospitals billions of pounds in the UK alone (Vincent et al., 2006), and is a leading cause of patient mortality (Co- hen et al., 2006). The clinicians’ task of deciding treatment type and dosage for individual patients is highly challenging. Besides antibiotics and infection source control, a cornerstone of the man- agement of severe infections is administration of intravenous fluids to correct hypovolemia. This may be followed by the administration of vasopressors to counteract sepsis-induced vasodilation. Various fluids and vasopressor treatment strategies have been shown to lead to extreme variations in patient mortality, which demonstrates how critical these decisions are (Waechter et al., 2014). While international efforts attempt to provide general guidance for treating sepsis, physicians at the bedside still lack efficient tools to provide individualized real-term decision support (Rhodes et al., 2017). As a consequence, individual clinicians vary treatment in many ways, e.g., the amount and 1 RAGHU,KOMOROWSKI,CELI,SZOLOVITS, AND GHASSEMI type of fluids used, the timing of initiation and the dosing of vasopressors, which antibiotics are given, and whether to administer corticosteroids. In this work, we propose a data-driven approach to discover optimal sepsis treatment strategies. We use deep reinforcement learning (RL) algorithms to identify how best to treat septic patients in the intensive care unit (ICU) to improve their chances of survival. While RL has been used successfully in complex decision making tasks (Mnih et al., 2015; Silver et al., 2016), its application to clinical models has thus far been limited by data availability (Nemati et al., 2016) and the inherent difficulty of defining clinical state and action spaces (Prasad et al., 2017; Komorowski et al., 2016). Nevertheless, RL algorithms have many desired properties for the problem of deducing high- quality treatments. Their intrinsic design for sparse reward signals makes them well suited to over- come complexity from the stochasticity in patient responses to medical interventions, and delayed indications of efficacy of treatments. Importantly, RL algorithms also allow us to infer optimal strategies from suboptimal training examples. In this work, we demonstrate how to surmount the modeling challenges present in the medical environment and use RL to successfully deduce optimal treatment policies for septic patients.1 We focus on continuous state-space modeling, represent a patient’s physiological state at a point in time as a continuous vector (using either raw physiological data or sparse latent state representations), and find optimal actions with Deep-Q Learning (Mnih et al., 2015). Motivating this approach is the fact that physiological data collected from ICU patients provide very rich representations of a patient’s physical state, allowing for the discovery of interpretable and high-quality policies. In particular, we: 1. Propose deep reinforcement learning models with continuous-state spaces, improving on ear- lier work with discretized models. 2. Identify treatment policies that could improve patient outcomes, potentially reducing patient mortality in the hospital by 1.8 - 3.6%, from a baseline mortality of 13.7%. 3. Investigate the learned policies for clinical interpretability and potential use as a clinical de- cision support tool. 2. Background and Related Work In this section we outline important reinforcement learning algorithms used in the paper and moti- vate our approach in comparison to prior work. 2.1 Reinforcement Learning Reinforcement learning (RL) models time-varying state spaces with a Markov Decision Process (MDP), in which at every timestep t an agent observes the current state of the environment st, takes an action at from the allowable set of actions A = f1;:::;Mg, receives a reward rt, and then transitions to a new state st+1. The agent selects actions at each timestep that maximize its expected PT t0−t discounted future reward, or return, defined as Rt = t0=t γ rt0 , where γ captures the tradeoff between immediate and future rewards, and T is the terminal timestep. The optimal action value function Q∗(s; a) is the maximum discounted expected reward obtained after executing action a in state s; that is, performing a in state s and proceeding optimally from this point onwards. More ∗ concretely, Q (s; a) = maxπ E[Rtjst = s; at = a; π], where π — also known as the policy — is a 1. Either patients who develop sepsis in their ICU stay, or those who are already septic at the start of their stay. 2 CONTINUOUS STATE-SPACE MODELS FOR OPTIMAL SEPSIS TREATMENT ∗ mapping from states to actions. The optimal value function is defined as V (s) = maxπ E[Rtjst = s; π], where we act according to π throughout. In Q-learning, the optimal action value function is estimated using the Bellman equation, ∗ ∗ 0 0 0 Q (s; a) = Es0∼T (s0js;a)[r + γ maxa0 Q (s ; a )jst = s; at = a], where T (s js; a) refers to the state transition distribution. Learning proceeds either with value iteration (Sutton and Barto, 1998) or by directly approximating Q∗(s; a) using a function approximator (such as a neural network) and learning via stochastic gradient descent. Note that Q-learning is an off-policy algorithm, as the optimal action-value function is learned with samples < s; a; r; s0 > that are generated to explore the state space. An alternative to Q-learning is the SARSA algorithm (Rummery and Niranjan, 1994); an on-policy method to learn Qπ(s; a), which is the action-value function when taking action a in state s at time t, and then proceeding according to policy π afterwards. In this work, the state st is a patient’s physiological state, either in raw form (as discussed in Section 3.2) or as a latent representation. The action space, A, is of size 25 and is discretized over doses of vasopressors and IV fluids, two drugs commonly given to septic patients, detailed further in Section 3.3. The reward rt is ±Rmax at terminal timesteps and zero otherwise, with positive rewards being issued when a patient survives. At every timestep, the agent is trained to take an action at with the highest Q-value, aiming to increase the chance of patient survival. 2.2 Reinforcement Learning in Health Much prior work in clinical machine learning has focused on supervised learning techniques for diagnosis (Esteva et al., 2017) and risk stratification (N.Razavian et al., 2015). The incorporation of time in a supervised setting could be implicit within the feature space construction (Hug and Szolovits, 2009; Joshi and Szolovits, 2012), or captured with multiple models for different time- points (Fialho et al., 2013; Ghassemi et al., 2014). We prefer RL for sepsis treatment over super- vised learning, because the ground truth of “good” treatment strategy is unclear in medical literature (Marik, 2015). Importantly, RL algorithms also allow us to infer optimal strategies from training examples that do not represent optimal behavior. RL is well-suited to identifying ideal septic treat- ment strategies, because clinicians deal with a sparse, time-delayed reward signal in septic patients, and optimal treatment strategies may differ. Nemati et al. (2016) applied deep RL techniques to modeling ICU heparin dosing as a Partially Observed Markov Decision Process (POMDP), using both discriminative Hidden Markov Models and Q-networks to discover the optimal policy. Their investigation was made more challenging by the relatively small amount of available data. Shortreed et al. (2011) learned optimal treatment policies for schizophrenic patients, and quantified the uncertainty around the expected outcome for patients who followed the policies. Prasad et al. (2017) use off-policy reinforcement learning algorithms to determine ICU strategies for mechanical ventilation administration