Transfer Learning in Deep Reinforcement Learning: a Survey

1 Transfer Learning in Deep Reinforcement Learning: A Survey Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou Abstract—Reinforcement Learning (RL) is a key technique to address sequential decision-making problems and is crucial to realize advanced artificial intelligence. Recent years have witnessed remarkable progress in RL by virtue of the fast development of deep neural networks. Along with the promising prospects of RL in numerous domains, such as robotics and game-playing, transfer learning has arisen as an important technique to tackle various challenges faced by RL, by transferring knowledge from external expertise to accelerate the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under which we analyze their goals, methodologies, compatible RL backbones, and practical applications. We also draw connections between transfer learning and other relevant topics from the RL perspective and explore their potential challenges as well as open questions that await future research progress. Index Terms—Transfer Learning, Reinforcement Learning, Deep Learning, Survey. F 1 INTRODUCTION Einforcement Learning (RL) is an effective framework even incur safety concerns for domains such as automatic- R to solve sequential decision-making tasks, where a driving and health informatics, where the consequences of learning agent interacts with the environment to improve wrong decisions can be too high to take. The abovementioned its performance through trial and error [1]. Originated challenges have motivated various efforts to improve the from cybernetics and thriving in Computer Science, RL has current RL procedure. As a result, Transfer Learning (TL), been widely applied to tackle challenging tasks which were which is a technique to utilize external expertise from other previously intractable [2, 3]. tasks to benefit the learning process of the target task, As a pioneering technique for realizing advanced arti- becomes a crucial topic in RL. ficial intelligence, traditional RL was mostly designed for TL techniques have been extensively studied in the super- tabular cases, which provided principled solutions to simple vised learning domain [12], whereas it is an emerging topic tasks but faced difficulties when handling highly complex in RL. In fact, TL under the framework of RL can be more domains, e.g. tasks with 3D environments. Over the recent complicated in that the knowledge needs to transfer in the years, an integrated framework, where an RL agent is built context of a Markov Decision Process (MDP). Moreover, due upon deep neural networks, has been developed to address to the delicate components of an MDP, the expert knowledge more challenging tasks. The combination of deep learning may take different forms, which need to transfer in different with RL is hence referred to as Deep Reinforcement Learning ways. Noticing that previous efforts on summarizing TL for (DRL) [4], which aims to address complex domains that RL have not covered its most recent advancement [13, 14], were otherwise unresolvable by building deep, powerful in this survey, we make a comprehensive investigation function approximators. DRL has achieved notable success of Transfer Learning in Deep Reinforcement Learning. Es- in applications such as robotics control [5, 6] and game pecially, we build a systematic framework to categorize arXiv:2009.07888v4 [cs.LG] 4 Mar 2021 playing [7]. It also has a promising prospects in domains such the state-of-the-art TL techniques into different sub-topics, as health informatics [8], electricity networks [9], intelligent review their theories and applications, and analyze their transportation systems[10, 11], to name just a few. inter-connections. Besides its remarkable advancement, RL still faces in- The rest of this survey is organized as follows: In section 2, triguing difficulties induced by the exploration-exploitation we introduce the preliminaries of RL and its key algorithms, dilemma [1]. Specifically, for practical RL, the environment including those recently designed based on deep neural dynamics are usually unknown, and the agent cannot exploit networks. Next, we clarify the definition of TL in the context its knowledge about the environment to improve its perfor- of RL, and discuss its relevant research topics (Section mance until enough interaction experiences are collected via 2.4). In Section 3, we provide a framework to categorize exploration. Due to partial observability, sparse feedbacks, TL approaches from multiple perspectives, analyze their and the high-dimension in state and action spaces, acquiring fundamental differences, and summarize their evaluation sufficient interaction samples can be prohibitive, which may metrics (Section 3.3). In Section 4, we elaborate on different TL approaches in the context of DRL, organized by the format • Zhuangdi Zhu and Jiayu Zhou are with the Department of Computer of transferred knowledge, such as reward shaping (Section Science and Engineering, Michigan State University, East Lansing, MI, 4.1), learning from demonstrations (Section 4.2), or learning 48823. E-mail: [email protected], [email protected] from teacher policies (Section 4.3). We also investigate TL • Kaixiang Lin is with the Amazon Alexa AI. E-mail: [email protected] approaches by the way that knowledge transfer occurs, such 2 as inter-task mapping (Section 4.4), or learning transferrable expected rewards that an agent can get from s, given that representations (Section 4.5), etc. We discuss the recent the agent follows policy π in the environment M afterward. applications of TL in the context of DRL in Section 5 and Similar to the value-function, each policy also carries a Q- provide some future perspectives and open questions in function, which is defined over the state-action space to Section 6. estimate the quality of taking action a from state s: π E 0 π 0 QM(s; a) = s0∼T (·|s;a) [R(s; a; s ) + γVM(s )] : 2 DEEP REINFORCEMENT LEARNING AND TRANS- FER EARNING The objective for an RL agent is to learn an optimal L ∗ policy πM to maximize the expectation of accumulated In this section, we provide a brief overview of the recent ∗ ∗ rewards, so that: 8s 2 S; πM(s) = arg max QM(s; a); development in RL and the definitions of some key ter- a2A ∗ π minologies. Next, we provide categorizations to organize where QM(s; a) = sup QM(s; a). π different TL approaches, then point out some of the other topics in the context of RL, which are relevant to TL but will not be elaborated in this survey. 2.2 Reinforcement Learning Algorithms In this section, we review the key RL algorithms developed Remark 1. Without losing clarify, for the rest of this survey, we over the recent years, which provide cornerstones for the TL refer to MDPs, domains, and tasks equivalently. approaches discussed in this survey. Prediction and Control: any RL problem can be disas- 2.1 Reinforcement Learning Preliminaries sembled into two subtasks: prediction and control [1]. In the A typical RL problem can be considered as training an prediction phase, the quality of the current policy is being agent to interact with an environment that follows a Markov evaluated. In the control phase, which is also referred to as Decision Process (MPD) [15]. For each interaction with the the policy improvement phase, the learning policy is adjusted MDP, the agent starts with an initial state and performs an based on evaluation results from the prediction step. Policies action accordingly, which yields a reward to guide the agent can be improved by iteratively conducting these two steps, actions. Once the action is taken, the MDP transits to the which is therefore called policy iteration. next state by following the underlying transition dynamics of Policy iterations can be model-free, which means that the the MDP. The agent accumulates the time-discounted rewards target policy is optimized without requiring knowledge along with its interactions with the MDP. A subsequence of the MDP transition dynamics. Traditional model-free of interactions is referred to as an episode. For MDPs with RL includes Monte-Carlo methods, which uses samples of infinite horizons, one can assume that there are absorbing episodes to estimate the value of each state based on complete states, such that any action taken upon an absorbing state episodes starting from that state. Monte-Carlo methods can will only lead to itself and yields zero rewards. All above- be on-policy if the samples are collected by following the mentioned components in the MDP can be represented using target policy, or off-policy if the episodic samples are collected a tuple M = (µ0; S; A; T ; γ; R; S0), in which: by following a behavior policy that is different from the target policy. • µ0 is the set of initial states. Temporal-Difference Learning, or TD-learning for short, • S is the state space. is an alternative to Monte-Carlo for solving the prediction • A is the action space. problem. The key idea behind TD-learning is to learn the • T : S × A × S ! R is the transition probability 0 state quality function by bootstrapping. It can also be extended distribution, where T (s js; a) specifies the probability 0 to solve the control problem so that both value function and of the state transitioning to s upon taking action a policy can get improved simultaneously. TD-learning is one from state s. of the most widely used RL paradigms due to its simplicity • R : S × A × S ! R is the reward distribution, where 0 and general applicability. Examples of on-policy TD-learning R(s; a; s ) is the reward an agent can get by taking 0 algorithms include SARSA [16], Expected SARSA [17], Actor- action a from state s with the next state being s .

Load more