<<

1 Transfer Learning in Deep : A Survey

Zhuangdi Zhu, Kaixiang Lin, and Jiayu Zhou

Abstract—Reinforcement Learning (RL) is a key technique to address sequential decision-making problems and is crucial to realize advanced artificial intelligence. Recent years have witnessed remarkable progress in RL by virtue of the fast development of deep neural networks. Along with the promising prospects of RL in numerous domains, such as robotics and game-playing, transfer learning has arisen as an important technique to tackle various challenges faced by RL, by transferring knowledge from external expertise to accelerate the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under which we analyze their goals, methodologies, compatible RL backbones, and practical applications. We also draw connections between transfer learning and other relevant topics from the RL perspective and explore their potential challenges as well as open questions that await future research progress.

Index Terms—Transfer Learning, Reinforcement Learning, , Survey. !

1 INTRODUCTION

Einforcement Learning (RL) is an effective framework even incur safety concerns for domains such as automatic- R to solve sequential decision-making tasks, where a driving and health informatics, where the consequences of learning agent interacts with the environment to improve wrong decisions can be too high to take. The abovementioned its performance through trial and error [1]. Originated challenges have motivated various efforts to improve the from cybernetics and thriving in Computer Science, RL has current RL procedure. As a result, Transfer Learning (TL), been widely applied to tackle challenging tasks which were which is a technique to utilize external expertise from other previously intractable [2, 3]. tasks to benefit the learning process of the target task, As a pioneering technique for realizing advanced arti- becomes a crucial topic in RL. ficial intelligence, traditional RL was mostly designed for TL techniques have been extensively studied in the super- tabular cases, which provided principled solutions to simple vised learning domain [12], whereas it is an emerging topic tasks but faced difficulties when handling highly complex in RL. In fact, TL under the framework of RL can be more domains, e.g. tasks with 3D environments. Over the recent complicated in that the knowledge needs to transfer in the years, an integrated framework, where an RL agent is built context of a Markov Decision Process (MDP). Moreover, due upon deep neural networks, has been developed to address to the delicate components of an MDP, the expert knowledge more challenging tasks. The combination of deep learning may take different forms, which need to transfer in different with RL is hence referred to as Deep Reinforcement Learning ways. Noticing that previous efforts on summarizing TL for (DRL) [4], which aims to address complex domains that RL have not covered its most recent advancement [13, 14], were otherwise unresolvable by building deep, powerful in this survey, we make a comprehensive investigation function approximators. DRL has achieved notable success of Transfer Learning in Deep Reinforcement Learning. Es- in applications such as robotics control [5, 6] and game pecially, we build a systematic framework to categorize arXiv:2009.07888v4 [cs.LG] 4 Mar 2021 playing [7]. It also has a promising prospects in domains such the state-of-the-art TL techniques into different sub-topics, as health informatics [8], electricity networks [9], intelligent review their theories and applications, and analyze their transportation systems[10, 11], to name just a few. inter-connections. Besides its remarkable advancement, RL still faces in- The rest of this survey is organized as follows: In section 2, triguing difficulties induced by the exploration-exploitation we introduce the preliminaries of RL and its key algorithms, dilemma [1]. Specifically, for practical RL, the environment including those recently designed based on deep neural dynamics are usually unknown, and the agent cannot exploit networks. Next, we clarify the definition of TL in the context its knowledge about the environment to improve its perfor- of RL, and discuss its relevant research topics (Section mance until enough interaction experiences are collected via 2.4). In Section 3, we provide a framework to categorize exploration. Due to partial observability, sparse feedbacks, TL approaches from multiple perspectives, analyze their and the high-dimension in state and action spaces, acquiring fundamental differences, and summarize their evaluation sufficient interaction samples can be prohibitive, which may metrics (Section 3.3). In Section 4, we elaborate on different TL approaches in the context of DRL, organized by the format • Zhuangdi Zhu and Jiayu Zhou are with the Department of Computer of transferred knowledge, such as reward shaping (Section Science and Engineering, Michigan State University, East Lansing, MI, 4.1), learning from demonstrations (Section 4.2), or learning 48823. E-mail: [email protected], [email protected] from teacher policies (Section 4.3). We also investigate TL • Kaixiang Lin is with the Amazon Alexa AI. E-mail: [email protected] approaches by the way that knowledge transfer occurs, such 2

as inter-task mapping (Section 4.4), or learning transferrable expected rewards that an agent can get from s, given that representations (Section 4.5), etc. We discuss the recent the agent follows policy π in the environment M afterward. applications of TL in the context of DRL in Section 5 and Similar to the value-function, each policy also carries a Q- provide some future perspectives and open questions in function, which is defined over the state-action space to Section 6. estimate the quality of taking action a from state s: π E 0 π 0 QM(s, a) = s0∼T (·|s,a) [R(s, a, s ) + γVM(s )] . 2 DEEP REINFORCEMENT LEARNINGAND TRANS- FER EARNING The objective for an RL agent is to learn an optimal L ∗ policy πM to maximize the expectation of accumulated In this section, we provide a brief overview of the recent ∗ ∗ rewards, so that: ∀s ∈ S, πM(s) = arg max QM(s, a), development in RL and the definitions of some key ter- a∈A ∗ π minologies. Next, we provide categorizations to organize where QM(s, a) = sup QM(s, a). π different TL approaches, then point out some of the other topics in the context of RL, which are relevant to TL but will not be elaborated in this survey. 2.2 Reinforcement Learning Algorithms In this section, we review the key RL algorithms developed Remark 1. Without losing clarify, for the rest of this survey, we over the recent years, which provide cornerstones for the TL refer to MDPs, domains, and tasks equivalently. approaches discussed in this survey. Prediction and Control: any RL problem can be disas- 2.1 Reinforcement Learning Preliminaries sembled into two subtasks: prediction and control [1]. In the A typical RL problem can be considered as training an prediction phase, the quality of the current policy is being agent to interact with an environment that follows a Markov evaluated. In the control phase, which is also referred to as Decision Process (MPD) [15]. For each interaction with the the policy improvement phase, the learning policy is adjusted MDP, the agent starts with an initial state and performs an based on evaluation results from the prediction step. Policies action accordingly, which yields a reward to guide the agent can be improved by iteratively conducting these two steps, actions. Once the action is taken, the MDP transits to the which is therefore called policy iteration. next state by following the underlying transition dynamics of Policy iterations can be model-free, which means that the the MDP. The agent accumulates the time-discounted rewards target policy is optimized without requiring knowledge along with its interactions with the MDP. A subsequence of the MDP transition dynamics. Traditional model-free of interactions is referred to as an episode. For MDPs with RL includes Monte-Carlo methods, which uses samples of infinite horizons, one can assume that there are absorbing episodes to estimate the value of each state based on complete states, such that any action taken upon an absorbing state episodes starting from that state. Monte-Carlo methods can will only lead to itself and yields zero rewards. All above- be on-policy if the samples are collected by following the mentioned components in the MDP can be represented using target policy, or off-policy if the episodic samples are collected a tuple M = (µ0, S, A, T , γ, R, S0), in which: by following a behavior policy that is different from the target policy. • µ0 is the set of initial states. Temporal-Difference Learning, or TD-learning for short, • S is the state space. is an alternative to Monte-Carlo for solving the prediction • A is the action space. problem. The key idea behind TD-learning is to learn the • T : S × A × S → R is the transition probability 0 state quality function by bootstrapping. It can also be extended distribution, where T (s |s, a) specifies the probability 0 to solve the control problem so that both value function and of the state transitioning to s upon taking action a policy can get improved simultaneously. TD-learning is one from state s. of the most widely used RL paradigms due to its simplicity • R : S × A × S → R is the reward distribution, where 0 and general applicability. Examples of on-policy TD-learning R(s, a, s ) is the reward an agent can get by taking 0 algorithms include SARSA [16], Expected SARSA [17], Actor- action a from state s with the next state being s . Critic [18], and its deep neural extension named A3C [19]. • γ is a discounted factor, with γ ∈ (0, 1]. The off-policy TD-learning approaches include SAC [20] • S0 is the set of absorbing states. for continuous state-action spaces, and Q-learning [21] for An RL agent behaves in M by following its policy π, discrete state-action spaces, along with its variants built on which is a mapping from states to actions: π : S → A . For deep-neural networks, such as DQN [22], Double-DQN [22], stochastic policies, π(a|s) denotes the probability for agent Rainbow [23], etc. to take action a from state s. Given an MDP M and a policy TD-learning approaches, such as Q-learning, focus more π π, one can derive a value function VM(s), which is defined on estimating the state-action value functions. Policy Gra- over the state space: dient, on the other hand, is a mechanism that emphasizes π  2  on direct optimization of a parametrizable policy. Traditional V (s) = E r0 + γr1 + γ r2 + ... ; π, s , M policy-gradient approaches include REINFORCE [24]. Recent where ri = R(si, ai, si+1) is the reward that an agent years have witnessed the joint presence of TD-learning receives by taking action ai in the i-th state si, and the and policy-gradient approaches, mostly ascribed to the next state transits to si+1. The expectation E is taken over rapid development of deep neural networks. Representative s0 ∼ µ0, ai ∼ π(·|si), si+1 ∼ T (·|si, ai). The value-function algorithms along this line include Trust region policy opti- estimates the quality of being in state s, by evaluating the mization (TRPO) [25], Proximal Policy optimization (PPO) [26], 3

Deterministic policy gradient (DPG) [27] and its extensions, still interacts with the domain to access reward signals, in such as DDPG [28] and Twin Delayed DDPG [29]. the hope of improving the target policy assisted by a few expert demonstrations, rather than recovering the ground- 2.3 Transfer Learning in the Context of Reinforcement truth reward functions or the expert behavior. LfD can be Learning more effective than IL when the expert demonstrations are actually sub-optimal [35, 36]. Let Ms = {Ms|Ms ∈ Ms} be a set of source domains, Lifelong Learning, or Continual Learning, refers to the which provides prior knowledge Ds that is accessible by the ability to learn multiple tasks that are temporally or spatially target domain Mt, such that by leveraging the information related, given a sequence of non-stationary information. The from Ds, the target agent learns better in the target domain key to acquiring Lifelong Learning is a tradeoff between Mt, compared with not utilizing it. We use Ms ∈ Ms to refer to a single source domain. For the simplistic case, obtaining new information over time and retaining the knowledge can transfer between two agents within the same previously learned knowledge across new tasks. Lifelong Learning is a technique that is applicable to both supervised domain, which results in |Ms| = 1, Ms = Mt. We provide a more concrete description of TL from the RL perspective learning [37] and RL [38, 39], and is also closely related to the as the following: topic of Meta Learning [40]. Lifelong Learning can be a more challenging task compared to TL, mainly because that it Remark 2. [Transfer Learning in the Context of Rein- requires an agent to transfer knowledge across a sequence of forcement Learning] Given a set of source domains Ms = dynamically-changing tasks which cannot be foreseen, rather {Ms|Ms ∈ Ms} and a target domain Mt, Transfer Learning than performing knowledge transfer among a fixed group of ∗ aims to learn an optimal policy π for the target domain, by tasks. Moreover, the ability of automatic task detection can leveraging exterior information Ds from Ms as well as interior also be a requirement for Lifelong Learning [41], whereas for information Dt from Mt, s.t.: TL the agent is usually notified of the emergence of a new ∗ E π π = arg max s∼µt ,a∼π[QM(s, a)], task. π 0 t t Hierarchical Reinforcement Learning (HRL) has been where π = φ(Ds ∼ Ms, Dt ∼ Mt): S → A is a function proposed to resolve real-world tasks that are hierarchical. mapping from the states to actions for the target domain Mt, Different from traditional RL, in an HRL setting, the action learned based on information from both Dt and Ds. space is grouped into different granularities to form higher- In the above definition, we use φ(D) to denote the learned level macro actions. Accordingly, the learning task is also policy based on information D. Especially in the context of decomposed into hierarchically dependent subgoals. Most DRL, the policy π is learned using deep neural networks. well-known HRL frameworks include Feudal learning [42], One can consider regular RL without transfer learning as a Options framework[43], Hierarchical Abstract Machines [44], and special case of the above definition by treating Ds = ∅, so MAXQ [45]. Given the higher-level abstraction on tasks, that a policy π is learned purely on the feedback provided actions, and state spaces, HRL can facilitate knowledge by the target domain, i.e. π = φ(Dt). transfer across similar domains. In this survey, however, we focus on discussing approaches of TL for general RL 2.4 Related Topics tasks rather than HRL. Multi-Agent Reinforcement Learning (MARL) has strong In addition to TL, other efforts have been made to benefit RL connections with Game Theory [46]. Different from single- by leveraging different forms of supervision, usually under agent RL, MARL considers an MDP with multiple agents different problem settings. In this section, we briefly discuss acting simultaneously in the environment. It aims to solve other techniques that are relevant to TL, by analyzing the problems that were difficult or infeasible to be addressed differences, as well as the connections between TL and these by a single RL agent [47]. The interactive mode for multiple relevant techniques, which we hope can further clarify the agents can either be independent, cooperative, competitive, scope of this survey. or even a hybrid setting [48]. Approaches of knowledge Imitation Learning, also known as Apprenticeship Learn- transfer for MARL fall into two classes: inter-agent transfer ing, aims to train a policy to mimic the behavior of an and intra-agent transfer. We refer users to [49] for a more expert policy, given that only a few demonstrations from comprehensive survey under this problem setting. Different that expert are accessible. It is considered as an alternative from their perspective, this survey emphasizes the general TL to RL to solve sequential decision-making problems when approaches for a single agent scenario, although approaches the environment feedbacks are unavailable [30, 31, 32]. There mentioned in this survey may also be applicable to multi- are currently two main paradigms for imitation learning. agent MPDs. The first one is Behavior Cloning, in which a policy is trained in a supervise-learning manner, without access to any reinforcement learning signal [33]. The second one is Inverse Reinforcement Learning, in which the goal of imitation learning is to recover a reward function of the domain that can 3 ANALYZING TRANSFER LEARNINGFROM MULTI- explain the behavior of the expert demonstrator [34]. Imitation PLE PERSPECTIVES Learning is closely related to TL and has been adapted as a TL approach called Learning from Demonstrations (LfD), which In this section, we provide multi-perspective criteria to will be elaborated in Section 4.2. What distinguishes LfD analyze TL approaches in the context of RL and introduce and the classic Imitation Learning approaches is that LfD metrics for evaluations. 4

3.1 Categorization of Transfer Learning Approaches bility of a target domain. Compared with training We point out that TL approaches can be categorized by from scratch, TL enables the learning agent with answering the following key questions: better initial performance, which usually needs few interactions with the target domain to converge to a 1) What knowledge is transferred: Knowledge from good policy, guided by the transferred knowledge. the source domain can take different forms of su- Based on the number of interactions needed to pervisions, such as a set of expert experiences [50], enable TL, we can categorize TL techniques into the action probability distribution of an expert pol- the following classes: (i) Zero-shot transfer, which icy [51], or even a potential function that estimates learns an agent that is directly applicable to the the quality of state and action pairs in the source or target domain without requiring any interactions target MDP [52]. The divergence in knowledge rep- with it; (ii) Few-shot transfer, which only requires a resentations and granularities fundamentally decides few samples (interactions) from the target domain; the way that TL is performed. The quality of the and (iii) Sample-efficient transfer, where an agent can transferred knowledge, e.g. whether it comes from benefit from TL to learn faster with fewer interactions an oracle policy [53] or is provided by a sub-optimal and is therefore still more sample efficient compared teacher [36], also affects the way TL methods are to RL without any transfer learning. designed. 6) What are the goals of TL: We can answer this ques- 2) What RL frameworks are compatible with the TL tion by analyzing two aspects of a TL approach: (i) approach: We can rephrase this question into other the evaluation metrics and (i) the objective function. forms, e.g., is the TL approach policy-agnostic, or does it Evaluation metrics can vary from the asymptotic only apply to certain types of RL backbones, such as the performance to the training iterations used to reach a Temporal Difference (TD) methods? Answers to this certain performance threshold, which implies the question are closely related to the format of the different emphasis of the TL approach. On the transferred knowledge. For example, transferring other hand, TL approaches may optimize towards knowledge in the form of expert demonstrations various objective functions augmented with different are usually policy-agnostic (see Section 4.2), while regularizations, which is usually hinged on the policy distillation, as will be discussed in Section 4.3, format of the transferred knowledge. For example, may not be suitable for RL algorithms such as DQN, maximizing the policy entropy can be combined which does not explicitly learn a policy function. with the maximum-return learning objective in order 3) What is the difference between the source and the to encourage explorations when the transferred target domain: As discussed in Section 2.3, the knowledge is imperfect demonstrations [57]. source domain Ms is the place where the prior knowledge comes from, and the target domain Mt 3.2 Case Analysis of Transfer Learning is where the knowledge is transferred to. Some TL In this section, we use HalfCheetah1, one of the standard approaches are suitable for the scenario where Ms RL benchmarks for solving physical locomotion tasks, as a and Mt are equivalent, whereas others are designed to transfer knowledge between different domains. running example to illustrate how transfer learning can be For example, in video gaming tasks where observa- performed between the source and the target domain. As shown in Figure 1, the objective of HalfCheetah is to train a tions are RGB pixels, Ms and Mt may share the same action space (A) but differs in their observation two-leg agent to run as fast as possible without losing control spaces (S). For other problem settings, such as the of itself. goal-conditioned RL [54], the two domains may 3.2.1 Potential Domain Differences: differ only by the reward distribution: Rs 6= Rt. Such domain difference induces difficulties in trans- During TL, the differences between the source and target fer learning and affects how much knowledge can domain may reside in any component that forms an MDP. transfer. The source domain and the target domain can be different in 4) What information is available in the target do- any of the following aspects: main: While the cost of accessing knowledge from • S (State-space): domains can be made different by source domains is usually considered cheaper, it can extending or constraining the available positions for be prohibitive for the learning agent to access the the agent to move. target domain, or the learning agent can only have a • A (Action-space) can be adjusted by changing the very limited number of environment interactions range of available torques for the thigh, shin, or foot due to a high sampling cost. Examples for this of the agent. scenario include learning an auto-driving agent after • R (Reward function): a task can be simplified by training it in simulated platforms [55], or training using only the distance moved forward as rewards or a navigation robot using simulated image inputs be perplexed by using the scale of accelerated velocity before adapting it to real environments [56]. The in each direction as extra penalty costs. accessibility of information in the target domain can • T (Transition dynamics): two domains can differ by affect the way that TL approaches are designed. following different physical rules, leading to different 5) How sample-efficient the TL approach is: This ques- tion is related to question 4 regarding the accessi- 1. https://gym.openai.com/envs/HalfCheetah-v2/ 5

• Jumpstart Performance (jp): the initial performance (returns) of the agent. • Asymptotic Performance (ap): the ultimate performance (returns) of the agent. • Accumulated Rewards (ar): the area under the learning curve of the agent. • Transfer Ratio (tr): the ratio between ap of the agent with TL and ap of the agent without TL. • Time to Threshold (tt): the learning time (iterations) needed for the target agent to reach certain perfor- mance threshold. • Performance with Fixed Training Epochs (pe): the perfor- mance achieved by the target agent after a specific number of training iterations. • Performance Sensitivity (ps): the variance in returns using different hyper-parameter settings. Fig. 1: An illustration of the HalfCheetah domain. The The above criteria mainly focus on the learning process learning agent aims to move forward as fast as possible of the target agent. In addition, we introduce the follow- without losing its balance. ing metrics from the perspective of transferred knowledge, which, although commensurately important for evaluation, have not been explicitly discussed by prior art: transition probabilities given the same state-action pairs. • Necessary Knowledge Amount (nka): i.e. the necessary • µ0 (Initial states): the source and target domains may amount of the knowledge required for TL in order have different initial states, specifying where and with to achieve certain performance thresholds. Examples what posture the agent can start moving. along this line include the number of designed source • τ (Trajectories): the source and target domains may tasks, the number of expert policies, or the number of allow a different number of steps for the agent to demonstrated interactions required to enable knowl- move before a task is done. edge transfer. • Necessary Knowledge Quality (nkq): the necessary quality of the knowledge required to enable effective TL. This 3.2.2 Transferrable Knowledge: metric helps in answering questions such as (i) Does We list the following transferrable knowledge, assuming that the TL approach rely on near-oracle knowledge from the source and target domains are variants of the HalfCheetah the source domain, such as expert-level demonstra- benchmark, although other forms of knowledge transfer may tions/policies, or (ii) is the TL technique feasible even also be feasible: given sub-optimal knowledge? Metrics from the perspective of transferred knowledge are • Demonstrated trajectories: the target agent can learn harder to standardize because TL approaches differ in from the behavior of a pre-trained expert, e.g. a various perspectives, including the forms of transferred sequence of running demonstrations. knowledge, the RL frameworks utilized to enable such • Model dynamics: the learning agent may access an ap- transfer, and the difference between the source and the target proximation model of the physical dynamics, which domains. It may lead to biased evaluations by comparing TL is learned from the source domain but also applicable approaches from just one viewpoint. However, we believe in the target domain. The agent can therefore perform that explicating these knowledge-related metrics will help in dynamic-programing based on the physical rules, designing more generalizable and efficient TL approaches. running as fast as possible while avoid losing its In general, most of the abovementioned metrics can be control due to the accelerated velocity. considered as evaluating two abilities of a TL approach: the • Teacher policies: an expert policy may be consulted Mastery and Generalization. Mastery refers to how well the by the learning agent, which outputs the probability learned agent can ultimately perform in the target domain, of taking different actions upon a given state example. while Generalization refers to the ability of the learning agent • Teacher value functions: besides teacher policy, the to quickly adapt to the target domain assisted by the trans- learning agent may also refer to the value function ferred knowledge. Metrics, such as ap, ar and tr, evaluates derived by a teacher policy, which implies what state- the ability of Mastery, whereas metrics of jp, ps, nka and nkq actions are good or bad from the teacher’s point of emphasizes more on the ability of Generalization. Metrics such view. as tt, for example, can measure either the Mastery ability or the Generalization ability, depending on the choice of different 3.3 Evaluation metrics thresholds. tt with a threshold approaching the optimal emphasizes more on the Mastery, while a lower threshold We enumerate the following representative metrics for may focus on the Generalization ability. Equivalently, pe can evaluating TL approaches, some of which have also been also focus on either side depending on the choice of the summarized in prior work [58],[13]: number of training epochs. 6

Fig. 2: An overview of different TL approaches, organized by the format of transferred knowledge.

4 TRANSFER LEARNING APPROACHES forms a cycle: {s1, s2, s3, . . . , sn, s1} with F (s1, a1, s2) + In this section, we elaborate on various TL approaches and F (s2, a2, s3)+···+F (sn, an, s1) > 0. Potential-based reward organize them into different sub-topics, mostly by answering shaping avoids this issue by making any state cycle mean- Pn−1 the question of “what knowledge is transferred”. For each type ingless, with i=1 F (si, ai, si+1) ≤ −F (sn, an, s1) ≤ 0. of TL approach, we investigate them by following the other It has been proved that, without further restrictions on the criteria mentioned in Section 3. We start with the Reward underlying MDP or the shaping function F , PBRS is sufficient Shaping approach (Section 4.1), which is generally applicable and necessary to preserve the policy invariance. Moreover, to different RL algorithms while requiring minimal changes the optimal Q function in the original and transformed MDP to the underline RL framework, and overlaps with the other are related by the potential function: TL approaches discussed in this chapter. We also provide an ∗ ∗ QM0 (s, a) = QM(s, a) − Φ(s), (1) overview of different TL approaches discussed in this survey in Figure 2. which draws a connection between potential based reward- shaping and advantage-based learning approaches [60]. 4.1 Reward Shaping The idea of PBRS was extended to [61], which formulated the potential as a function over both the state and the Reward Shaping (RS) is a technique that leverages the action space. This approach is called Potential Based state- exterior knowledge to reconstruct the reward distributions of action Advice (PBA). The potential function Φ(s, a) therefore the target domain to guide the agent’s policy learning. More evaluates how beneficial an action a is to take from state s: specifically, in addition to the environment reward signals, 0 0 0 0 RS learns a reward-shaping function F : S × S × A → R F (s, a, s , a ) = γΦ(s , a ) − Φ(s, a). (2) to render auxiliary rewards, provided that the additional One limitation of PBA is that it requires on-policy learning, rewards contain external knowledge to guide the agent for which can be sample-inefficient, as in Equation (2), a0 is the better action selections. Intuitively, an RS strategy will assign action to take upon state s is transitioning to s0 by following higher rewards to more beneficial state-actions, which can the learning policy. Similar to Equation (1), the optimal Q navigate the agent to desired trajectories. As a result, the functions in both MDPs are connected by the difference agent will learn its policy using the newly shaped rewards ∗ ∗ 0 0 of potentials: Q 0 (s, a) = Q (s, a) − Φ(s, a). Once the R : R = R + F, which means that RS has altered the target M M optimal policy in M0 is learned, the optimal policy in M can domain with a different reward function: be recovered: 0 0 M = (S, A, T , γ, R)) → M = (S, A, T , γ, R ). π ∗ πM(s) = arg max (QM(s, a) − Φ(s, a)). Along the line of RS, Potential based Reward Shaping (PBRS) a∈A is one of the most classical approaches. [52] proposed PBRS Traditional RS approaches assumed a static potential to form a shaping function F as the difference between two function, until [62] proposed a Dynamic Potential Based (DPB) potential functions (Φ(·)): approach which makes the potential a function of both F (s, t, s0, t0) = γΦ(s0, t0) − Φ(s, t). F (s, a, s0) = γΦ(s0) − Φ(s), states and time: They proved that this dynamic approach can still maintain policy ∗ ∗ where the potential function Φ(·) comes from the knowledge invariance: QM0 (s, a) = QM(s, a) − Φ(s, t),where t is the of expertise and evaluates the quality of a given state. current tilmestep. [63] later introduced a way to incorporate The structure of the potential difference addresses a cycle any prior knowledge into a dynamic potential function dilemma mentioned in [59], in which an agent can get structure, which is called Dynamic Value-Function Advice positive rewards by following a sequence of states which (DPBA). The underline rationale of DPBA is that, given any 7

extra reward function R+ from prior knowledge, in order augmented reward comes intrinsically, such as the Belief to add this extra reward to the original reward function, the Reward Shaping proposed by [69], which utilized a Bayesian potential function should satisfy: reward shaping framework to generate the potential value that decays with experience, where the potential value comes γΦ(s0, a0) − Φ(s, a) = F (s, a) = R+(s, a). from the critic itself. If Φ is not static but learned as an extra state-action value The above RS approaches are summarized in Table 1. In function overtime, then the Bellman equation for Φ is : general, most RS approaches follow the potential based RS principle that has been developed systematically: from the π Φ 0 0 Φ (s, a) = r (s, a) + γΦ(s , a ). classical PBRS which is built on a static potential shaping function of states, to PBA which generates the potential as a The shaping rewards F (s, a) is therefore the negation of function of both states and actions, and DPB which learns a rΦ(s, a) : F (s, a) = γΦ(s0, a0) − Φ(s, a) = −rΦ(s, a). This dynamic potential function of states and time, to the state-of- leads to the approach of using the negation of R+ as the im- the-art DPBA, which involves a dynamic potential function of mediate reward to train an extra state-action value function states and actions to be learned as an extra state-action value Φ and the policy simultaneously, with rΦ(s, a) = −R+(s, a). function in parallel with the environment value function. As Φ will be updated by a residual term δ(Φ): an effective TL paradigm, RS has been widely applied to Φ(s, a) ← Φ(s, a) + βδ(Φ), fields including robot training [70], spoken dialogue systems

+ 0 0 [71], and question answering [72]. It provides a feasible where δ(Φ) = −R (s, a) + γΦ(s , a ) − Φ(s, a), and β is the framework for transferring knowledge as the augmented learning rate. Accordingly, the dynamic potential function F reward and is generally applicable to various RS algorithms. becomes: How to integrate RS with other TL approaches, such as 0 0 Learning from demonstrations (Section 4.2) and Policy Transfer Ft(s, a) = γΦt+1(s , a ) − Φt(s, a). (Section 4.3) to build the potential function for shaping will The advantage of DPBA is that it provides a framework to be an intriguing question for the ongoing research. allow arbitrary knowledge to be shaped as auxiliary rewards. Efforts along this line mainly focus on designing different 4.2 Learning from Demonstrations shaping functions F (s, a), while little work has addressed the question of what knowledge can be used to derive this potential In this section, we review TL techniques in which the trans- function. One work by [64] proposed to use RS to transfer ferred knowledge takes the form of external demonstrations. an expert policy from the source domain (Ms) to the target The demonstrations may come from different sources with domain (Mt). This approach assumed the existence of two different qualities: it can be provided by a human expert, mapping functions, MS and MA, which can transform the a previously learned expert policy, or even a suboptimal state and action from the source to the target domain. Then policy. For the following discussion, we use DE to denote the augmented reward is just πs((MS(s), MA(a))), which a set of demonstrations, and each element in DE is a tuple 0 is the probability that the mapped state and action will be of transition: i.e. (s, a, s , r) ∈ DE. Efforts along this line taken by the expert policy in the source domain. Another mostly address a specific TL scenario, i.e. the source and the work used demonstrated state-action samples from an expert target MDPs are the same: Ms = Mt, although there has policy to shape rewards [65]. Learning the augmented reward been work that learns from demonstrations generated in a involves a discriminator, which is trained to distinguish sam- different domain [73]. ples generated by an expert policy from samples generated In general, learning from demonstrations (LfD) is a technique by the target policy. The loss of the discriminator is applied to assist RL by utilizing provided demonstrations for more to shape rewards to incentivize the learning agent to mimic efficient exploration. Knowledge conveyed in demonstra- the expert behavior. This work is a combination of two TL tions encourages agents to explore states which can benefit approaches: RS and Learning from Demonstrations, the later of its policy learning. Depending on when the demonstrations which will be elaborated in Section 4.2. are used for knowledge transfer, approaches can be orga- Besides the single-agent and model-free RL scheme, there nized into offline methods and on-line methods. For offline have been efforts to apply RS to multi-agent RL [66], model- approaches, demonstrations are used for pre-training RL based RL [67], and hierarchical RL [68]. Especially, [66] components before the RL learning step. RL components extended the idea of RS to multi-agent systems, showing such as the value function V (s) [74], the policy π [2], or that the Nash Equilibria of the underlying stochastic game is even the model of transition dynamics [75], are initialized unchanged under a potential-based reward shaping structure. by using these demonstrations. For the [67] applied RS to model-based RL, where the potential online approach, demonstrations are directly used in the RL function is learned based on the free space assumption, an stage to guide agent actions for efficient explorations [76]. approach to model transition dynamics in the environment. Most work discussed in this section follows the online trans- [68] integrated RS to MAXQ, which is a hierarchical RL fer paradigm or combines the offline pre-training with the algorithm framework, by augmenting the extra reward onto online RL learning [77]. Depending on what RL frameworks the completion function of the MAXQ [45]. are used to enable knowledge transfer, work in this domain RS approaches discussed so far are built upon a consensus can be categorized into different branches: some adopts the that the source information for shaping the reward comes policy-iteration framework [50, 78, 79], others follow a Q- externally, which coincides the notion of knowledge transfer. learning framework [76, 80], while more recent work follows Some work of RS also considers the scenario where the the policy-gradient framework [36, 65, 77, 81]. 8 Methods MDP Difference Format of shaping reward Knowledge source 0 PBRS Ms = Mt F = γΦ(s ) − Φ(s)  0 0 PBA Ms = Mt F = γΦ(s , a ) − Φ(s, a)  0 0 DPB Ms = Mt F = γΦ(s , t ) − Φ(s, t)  0 0 DPBA Ms = Mt Ft = γΦt+1(s , a ) − Φt(s, a) ,  Φ learned as an extra Q function 0 0 [64] Ss 6= St, As 6= At Ft = γΦt+1(s , a ) − Φt(s, a) πs 0 0 [65] Ms = Mt Ft = γΦt+1(s , a ) − Φt(s, a) DE

TABLE 1: A comparison of reward shaping approaches.  denotes that the information is not revealed in the paper.

Demonstration data have been applied in the Policy function compared with APID, as Lπ is minimizing the Iterations framework by [82]. Later, [78] introduced the Optimal Bellman Residual instead of the empirical norm. Direct Policy Iteration with Demonstrations (DPID) algorithm. In addition to policy iteration, the following two ap- This approach samples complete demonstrated rollouts DE proaches integrate demonstration data into the TD-learning from an expert policy πE, in combination with the self- framework, such as Q-learning. Specifically, [76] proposed generated rollouts Dπ gathered from the learning agent. the Deep Q-learning from Demonstration (DQfD) algorithm, Dπ ∪ DE are used to learn a Monte-Carlo estimation of the which maintains two separate replay buffers to store demon- Q-value: Qˆ, from which a learning policy can be derived strated data and self-generated data, respectively, so that greedily: π(s) = arg maxQˆ(s, a). This policy π is further expert demonstrations can always be sampled with a certain a∈A probability. Their work leverages the refined priority replay regularized by a loss function L(s, πE) to minimize its mechanism [83] where the probability of sampling a transi- discrepancy from the expert policy decision: tion i is based on its priority pi with a temperature parameter pα NE i α: P (i) = P α . 1 X k pk L(π, πE) = 1{πE(si) 6= π(si)}, NE Another work under the Q-learning framework was i=1 proposed by [80]. Their work, dubbed as LfDS, draws a where NE is the number of expert demonstration samples, close connection to the Reward Shaping technique in Section and 1(·) is an indicator function. 4.1. It builds the potential function based on a set of expert Another work along this line includes the Approximate demonstrations, and the potential value of a given state- Policy Iteration with Demonstration (APID) algorithm, which action pair is measured by the highest similarity between was proposed by [50] and extended by [79]. Different from the given pair and the expert experiences. This augmented DPID where both DE and Dπ were used for value estimation, reward assigns more credits to state-actions that are more the APID algorithm applied only Dπ to approximate on the similar to expert demonstrations, which can eventually Q function. The expert demonstrations DE are used to learn encourage the agent for expert-like behavior. the value function, which, given any state si, renders expert Besides Q-learning, recent work has integrated LfD actions πE(si) with higher Q-value margins compared with into the policy gradient framework [30, 36, 65, 77, 81]. A other actions that are not shown in DE: representative work along this line is Generative Adversarial Imitation Learning (GAIL), proposed by [30]. GAIL introduced Q(si, πE(si)) − max Q(si, a) ≥ 1 − ξi. a∈A\πE (si) the notion of occupancy measure dπ, which is the stationary state-action distributions derived from a policy π. Based The term ξi is used to account for the case of imperfect on this notion, a new reward function is designed such demonstrations. This value shaping idea is instantiated as that maximizing the accumulated new rewards encourages an augmented hinge-loss to be minimized during the policy minimizing the distribution divergence between the occu- evaluation step: pancy measure of the current policy π and the expert policy π Q ← arg min f(Q), where f(Q) = E. Specifically, the new reward is learned by adversarial Q training [53]: a discriminator D is trained to distinguish  π α   interactions sampled from the current policy π and the expert L (Q) + 1 − (Q(si, πE(si)) − max Q(si, a)) + , NE a∈A\πE (si) policy πE:   π in which z = max{0, z} is the hinge loss, and L (Q) is E E + JD = max dπ log[1 − D(s, a)] + dE log[D(s, a)] the Q-function loss induced by an empirical norm of the D:S×A→(0,1) optimal bellman residual: Since πE is unknown, its state-action distribution dE is es- π E π L = (s,a)∼Dπ kT Q(s, a) − Q(s, a)k, timated based on the given expert demonstrations DE. It has been proved that, for a optimized discriminator, its output π E 0 0 where T Q(s, a) = R(s, a) + γ s0∼p(.|s,a)[Q(s , π(s ))] is satisfies D(s, a) = dπ . The output of the discriminator dπ +dE the bellman contracting operator. [79] further extended the is used as new rewards to encourage distribution matching, work of APID with a different evaluation loss: with r0(s, a) = − log(1−D(s, a)). The RL process is naturally π ∗ altered to perform distribution matching by optimizing the L = E(s,a)∼D kT Q(s, a) − Q(s, a)k, π following minimax objective: ∗ 0 0 where T Q(s, a) = R(s, a) + γEs0∼p(.|s,a)[maxQ(s , a )]. a0 max min J(π, D) : = Ed log[1 − D(s, a)] + Ed log[D(s, a)]. Their work theoretically convergence to the optimal Q π D π E 9

Although GAIL is more related to Imitation Learning on biased data [76, 83]. A different strategy to confront the than LfD, its philosophy of using expert demonstrations sub-optimality is to leverage those sub-optimal demonstra- for distribution matching has inspired other LfD algorithms. tions only to boost the initial learning stage. Specifically, For example, [81] extended GAIL with an algorithm called in the same spirit of GAIL,[36] proposed Self-Adaptive POfD, which combines the discriminator reward with the Imitation Learning (SAIL), which learns from sub-optimal environment reward, so that the the agent is trained to maxi- demonstrations using generative adversarial training while mize the accumulated environment rewards (RL objective) as gradually selecting self-generated trajectories with high well as performing distribution matching (imitation learning qualities to replace less superior demonstrations. objective): Another challenge faced by LfD is overfitting: demonstra- tions may be provided in limited numbers, which results in max = Ed [r(s, a)] − λDJS[dπ||dE]. (3) θ π the learning agent lacking guidance on states that are unseen in the demonstration dataset. This challenge is aggravated They further proved that optimizing Equation 3 is same in MDPs with sparse reward feedbacks, as the learning as a dynamic reward-shaping mechanism (Section 4.1): agent cannot obtain much supervision information from E 0 max = dπ [r (s, a)], the environment either. This challenge is also closely related θ to the covariate drift issue [84] which is commonly confronted 0 where r (s, a) = r(s, a) − λ log(Dw(s, a)) is the shaped by approaches of behavior cloning. Current efforts to address reward. this challenge include encouraging explorations by using an Both GAIL and POfD are under an on-policy RL frame- entropy-regularized objective [57], decaying the effects of work. To further improve the sample efficiency of TL, demonstration guidance by softening its regularization on some off-policy algorithms have been proposed, such as policy learning over time [35], and introducing disagreement DDPGfD [65] which is built upon the DDPG framework. regularizations by training an ensemble of policies based on DDPGfD shares a similar idea as DQfD in that they both use the given demonstrations, where the variance among policies a second replay buffer for storing demonstrated data, and serves as a cost (negative reward) function [85]. each demonstrated sample holds a sampling priority pi. For We summarize the above-discussed approaches in Table 2. a demonstrated sample, its priority pi is augmented with a In general, demonstration data can help in both offline pre- constant bias D > 0 in order to encourage more frequent training for better initialization and online RL for efficient sampling of expert demonstrations: exploration. During the RL learning phase, demonstration data can be used together with self-generated data to p = δ2 + λk∇ Q(s , a |θQ)k2 +  +  , i i a i i D encourage expert-like behaviors (DDPGfD, DQFD), to shape where δi is the TD-residual for transition i, value functions (APID), or to guide the policy update in the Q 2 k∇aQ(si, ai|θ )k is the loss applied to the actor, and  is a form of an auxiliary objective function (PID,GAIL, POfD). The small positive constant to ensure all transitions are sampled current RL framework used for LfD includes policy iteration, with some probability. Q-learning, and policy gradient. Developing more general Another work also adopted the DDPG framework to LfD approaches that are agnostic to RL frameworks and can learn from demonstrations [77]. Their approach differs from learn from sub-optimal or limited demonstrations would be DDPGfD in that its objective function is augmented with the next focus for this research domain. a Behavior Cloning Loss to encourage imitating on provided demonstrations: 4.3 Policy Transfer |DE | Policy Transfer X 2 In this section, we review TL approaches of , LBC = ||π(si|θπ) − ai|| . where the external knowledge takes the form of pretrained i=1 policies from one or multiple source domains. Work dis- To further address the issue of suboptimal demonstrations, cussed in this section is built upon a many-to-one problem in [77] the form of Behavior Cloning Loss is altered based on setting, which we formularize as below: the critic output, so that only demonstration actions with Problem Setting. (Policy Transfer) A set of teacher policies higher Q values will lead to the loss penalty: πE1 , πE2 , . . . , πEK are trained on a set of source domains |DE | M1, M2,..., MK , respectively. A student policy π is learned X 2 1 K LBC = kπ(si|θπ) − aik [Q(si, ai) > Q(si, π(si))]. for a target domain by leveraging knowledge from {πEi }i=1. i=1 For the one-to-one scenario, which contains only one There are several challenges faced by LfD, one of which teacher policy, one can consider it as a special case of the is the imperfect demonstrations. Previous approaches usually above problem setting with K = 1. Next, we categorize presume near-oracle demonstrations. However, demonstra- recent work of policy transfer into two techniques: policy tions can also be biased estimations of the environment distillation and policy reuse. or even from a sub-optimal policy [36]. Current solutions to imperfect demonstrations include altering the objective 4.3.1 Transfer Learning via Policy Distillation function. For example, [50] leveraged the hinge-loss func- The term knowledge distillation was proposed by [86] as an tion to allow occasional violations of the property that approach of knowledge ensemble from multiple teacher Q(si, πE(si)) − max Q(si, a) ≥ 1. Some other work a∈A\πE (si) models into a single student model. This technique is later uses regularizations on the objective to alleviate overfitting extended from the field of supervised-learning to RL. Since 10 Methods Optimality Guarantee Format of transferred demonstrations RL framework DQfD  Cached transitions in the replay buffer DQN LfDS  Reward shaping function DQN GAIL  Reward shaping function: −λ log(1 − D(s, a)) TRPO POfD  Reward shaping function: TRPO,PPO r(s, a) − λ log(1 − D(s, a)) DDPGfD  Increasing sampling priority DDPG [77]  Increasing sampling priority and behavior DDPG cloning loss DPID  Indicator binary-loss : L(si) = 1{πE (si) 6= API π(si)    APID Hinge loss on the marginal-loss: L(Q, π, πE ) + API APID extend  Marginal-loss: L(Q, π, πE ) API SAIL  Reward shaping function: r(s, a) − λ log(1 − D(s, a)) DDPG

TABLE 2: A comparison of learning from demonstration approaches.

the student model is usually shallower than the teacher From a different perspective, there are two approaches of model and can perform across multiple teacher tasks, policy distilling the knowledge from teacher policies to a student: distillation is also considered as an effective approach of (1) minimizing the cross-entropy loss between the teacher model compression [87] and multi-task RL [88]. and student policy distributions over actions [90, 91]; and The idea of knowledge distillation has been applied to (2) maximizing the probability that the teacher policy will the field of RL to enable policy distillation. Conventional visit trajectories generated by the student , i.e. maxθ P (τ ∼ policy distillation approaches transfer the teacher policy in a πE|τ ∼ πθ) [92, 93]. One example of approach (1) is the Actor- supervised learning paradigm [88, 89]. Specifically, a student mimic algorithm [90]. This algorithm distills the knowledge policy is learned by minimizing the divergence of action of expert agents into the student by minimizing the cross distributions between the teacher policy πE and student entropy between the student policy πθ and each teacher × policy πθ, which is denoted as H (πE(τt)|πθ(τt)): policy πEi over actions: i X  |τ|  L (θ) = π (a|s) log (a|s), Ei πθ X × E a∈AE min τ∼πE  ∇θH (πE(τt)|πθ(τt)) . i θ t=1 where each teacher agent is learned based on DQN, whose The above expectation is taken over trajectories sampled from policy is therefore derived from the Boltzmann distributions Q the teacher policy πE, which therefore makes this approach over the -function output: −1 teacher distillation τ QE (s,a) . A representative example of work along e i this line is [88], in which N teacher policies are learned πEi (a|s) = −1 0 . P τ QE (s,a ) a0∈A e i for N source tasks separately, and each teacher yields a Ei E N dataset D = {si, qi}i=0 consisting of observations (states) An instantiation of approach (2) is the Distral algorithm s and vectors of the corresponding q-values q, such that [92], in which a centroid policy πθ is trained based on qi = [Q(si, a1),Q(si, a2), ...|aj ∈ A]. Teacher policies are K teacher policies, with each teacher policy learned in a further distilled to a single student agent πθ by minimizing source domain Mi = {Si, Ai, Ti, γ, Ri)}, in the hope that the KL-Divergence between each teacher policy πEi (a|s) and knowledge in each teacher πE can be distilled to the centroid E i the student policy πθ, approximated using the dataset D : and get transferred to student policies. It assumes that both

E the transition dynamics Ti and reward distributions Ri are |D |  E   E  E X qi softmax(qi ) different across different source MDPs. A distilled policy min DKL(π |πθ) ≈ softmax ln . θ τ softmax(qθ) (student) is learned to perform in different domains by i=1 i PK maximizing maxθ i=1 J(πθ, πEi ), where An alternative policy distillation approach is called stu- h XE X t dent distillation [51, 90], which is similar to teacher distillation, J(πθ, πEi ) = (st,at)∼πθ γ (ri(at, st)+ except that during the optimization step, the expectation t t≥0 is taken over trajectories sampled from the student policy α 1 i log πθ(at|st) − log(πE (at|st))) , instead of the teacher policy: β β i

 |τ|  in which both log πθ(at|st) and πθ are used as augmented E X × rewards. Therefore, the above approach also draws a close min τ∼πθ  ∇θH (πE(τt)|πθ(τt)) . θ t=1 connection to Reward Shaping (Section 4.1). In effect, the log πθ(at|st) term guides the learning policy πθ to yield [51] provides a nice summarization of the related work actions that are more likely to be generated by the teacher

on both kinds of distillation approaches. While it is feasible policy, whereas the entropy term − log(πEi (at|st) serves to combine both [84], we observe that more recent work as a bonus reward for exploration. A similar approach focuses on student distillation, which empirically shows was proposed by [91] which only uses the cross-entropy better exploration ability compared to teacher distillation, between teacher and student policy λH(πE(at|st)||πθ(at|st)) especially when the teacher policy is deterministic. to reshape rewards. Moreover, they adopted a dynamically 11

fading coefficient to alleviate the effect of the augmented across multiple tasks. Next, a task (reward) mapper wi is reward so that the student policy becomes independent of learned, based on which the Q-function can be derived: the teachers after certain optimization iterations. π T Qi (s, a) = ψ(s, a) wi. 4.3.2 Transfer Learning via Policy Reuse [95] proved that the loss of GPI is bounded by the difference In addition to policy distillation, another policy transfer between the source and the target tasks. In addition to approach is Policy Reuse, which directly reuses policies from policy-reuse, their approach involves learning a shared source tasks to build the target policy. representation ψ(s, a), which is also a form of transferred The notion of policy reuse was proposed by [94], which knowledge and will be elaborated more in Section 4.5.2. directly learns expert policies based on a probability distri- We summarize the abovementioend policy transfer ap- bution P , where the probability of each policy to be used proaches in Table 3. In general, policy transfer can be realized during training is related to the expected performance gain by knowledge distillation, which can be either optimized of that policy in the target domain, denoted as Wi: from the student’s perspecive (student distillation), or from the teacher’s perspective (teacher distillation) Alternatively, exp (tW ) P (π ) = i , teacher policies can also be directly reused to update the Ei PK j=0 exp (tWj) target policy. All approaches discussed so far presumed one or multiple expert policies, which are always at the disposal where t is a dynamic temperature parameter that increases of the learning agent. Questions such as How to leverage over time. Under a Q-learning framework, the Q-function of imperfect policies for knowledge transfer, or How to refer to teacher their target policy is learned in an iterative scheme: during policies within a budget, are still open to be resolved by future every learning episode, W is evaluated for each expert policy i research along this line. πEi , and W0 is obtained for the learning policy, from which a reuse probability P is derived. Next, a behavior policy is sampled from this probability P . If an expert is sampled as 4.4 Inter-Task Mapping the behavior policy, the Q-function of the learning policy In this section, we review TL approaches that utilize mapping is updated by following the behavior policy in an -greedy functions between the source and the target domains to assist fashion. Otherwise, if the learning policy itself is selected as knowledge transfer. Research in this domain can be analyzed the behavior policy, then a fully greedy Q-learning update from two perspectives: (1) which domain does the mapping is performed. After each training episode, both Wi and function apply to, and (2) how is the mapped representation the temperature t for calculating the reuse probability is utilized. Most work discussed in this section shares a common updated accordingly. One limitation of this approach is that assumption as below: the Wi, i.e. the expected return of each expert policy on the target task, needs to be evaluated frequently. This work was Assumption. [Existence of Domain Mapping] A one-to- one mapping exists between the source domain Ms = implemented in a tabular case, leaving the scalability issue s s s s s s s unresolved. (µ0, S , A , T , γ , R , S0 ) and the target domain Mt = (µt , St, At, T t, γt, Rt, St) More recent work by [95] extended the Policy Improvement 0 0 . theorem [96] from one to multiple policies, which is named Earlier work along this line requires a given mapping as Generalized Policy Improvement. We refer its main theorem function [58, 97]. One examples is [58] which assumes that as follows: each target state (action) has a unique correspondence in X ,X Theorem. [Generalized Policy Improvement (GPI)] the source domain, and two mapping functions S A Let π1, π2, . . . , πn be n decision policies and let are provided over the state space and the action space, t s t s Qˆπ1 , Qˆπ2 ,..., Qˆπn be the approximations of their action- respectively, so that XS(S ) → S , XA(A ) → A . Based on X and X , a mapping function over the Q-values value functions, s.t: Qπi (s, a) − Qˆπi (s, a) ≤  ∀s ∈ S A M(Qs) → Qt can be derived accordingly. Another work S, a ∈ A, and i ∈ {1, 2, . . . , n}. Define π(s) = is done by [97] which transfers advice as the knowledge ˆπi arg max maxQ (s, a), then: between two domains. In their settings, the advice comes a i 2 from a human expert who provides the mapping function Qπ(s, a) ≥ maxQπi (s, a) −  i 1 − γ over the Q-values in the source domain and transfers it to the learning policy for the target domain. This advice encourages for any s ∈ S and a ∈ A, where Qπ is the action-value function the learning agent to prefer certain good actions over others, of π. which equivalently provides a relative ranking of actions in Based on this theorem, a policy improvement approach the new task. can be naturally derived by greedily choosing the action More later research tackles the inter-task mapping prob- which renders the highest Q value among all policies lem by automatically learning a mapping function [98, 99, 100]. for a given state. Another work along this line is [95], Most work learns a mapping function over the state space

in which an expert policy πEi is also trained on a dif- or a subset of the state space. In their work, state representa- ferent source domain Mi with reward function Ri, so tions are usually divided into agent-specific and task-specific Qπ (s, a) 6= Qπ (s, a) Q s s that M0 Mi . To efficiently evaluate the - representations, denoted as agent and env, respectively. functions of different source policies in the target MDP, a In [98] and [99], the mapping function is learned on the disentangled representation ψ(s, a) over the states and ac- agent-specific sub state, and the mapped representation is tions is learned based on neural networks and is generalized applied to reshape the immediate reward. For [98], the 12 Citation Transfer Approach MDP Difference RL framework Metrics [88] Distillation S, A DQN ap [89] Distillation S, A DQN ap, ps [90] Distillation S, A Soft Q-learning ap, ar, ps [92] Distillation S, A A3C ap, pe, tt [94] Reuse R Tabular Q-learning ap [95] Reuse R DQN ap, ar

TABLE 3: A comparison of policy transfer approaches.

invariant feature space mapped from sagent can be applied As summarized in Table 4, for TL approaches that utilize across agents who have distinct action space but share some an inter-task mapping, the mapped knowledge can be (a morphological similarity. Specifically, they assume that both subset of) the state space [98, 99], the Q function [58], or agents have been trained on the same proxy task, based (representations of) the state-action-sate transitions [105]. on which the mapping function is learned. The mapping In addition to being directly applicable in the target do- function is learned using an encoder-decoder neural network main [105], the mapped representation can also be used as an structure [101] in order to reserve as much information about augmented shaping reward [98, 99] or a loss objective [100] the source domain as possible. While transferring knowledge in order to guide the agent learning in the target domain. from the source agent to the target agent on a new task, the environment reward is augmented with an shaped reward term to encourage the target agent to imitate the source agent 4.5 Representation Transfer on the embedded feature space: In this section, we review TL approaches in which the transferred knowledge are feature representations, such as 0 s t r (s, ·) = α f(sagent; θf ) − g(sagent; θg) , representations learned for the value-function or Q-function. Approaches discussed in this section are developed based on s where f(sagent) is the agent-specific state in the source the powerful approximation ability of deep neural networks t domain, and g(sagent) is for the target domain. and are built upon the following consensual assumption: [100] applied the Unsupervised Manifold Alignment (UMA) Assumption. [Existence of Task-Invariance Subspace] approach [102] to automatically learn the state mapping The state space (S), action space (A), or even reward space between tasks. In their approach, trajectories are collected (R) can be disentangled into orthogonal sub-spaces, some of which from both the source and the target domain to learn a are task-invariant and are shared by both the source and target mapping between states. While applying policy gradient domains, such that knowledge can be transferred between domains learning, trajectories from M are first mapped back to the t on the universal sub-space. source: ξt → ξs, then an expert policy in the source domain is applied to each initial state of those trajectories to generate We organize recent work along this line into two ∼ near-optimal trajectories ξs, which are further mapped to subtopics: i) approaches that directly reuse representations ∼ ∼ ∼ from the source domain (Section 4.5.1), and ii) approaches the target domain: ξ → ξ . The deviation between ξ and ξ s t t t that learn to disentangle the source domain representations are used as a loss to be minimized in order to improve the into independent sub-feature representations, some of which target policy. Similar ideas of using UMA to assist transfer are on the universal feature space shared by both the source by inter-task mapping can also be found in [103] and [104]. and the target domains (Section 4.5.2). In addition to approaches that utilizes mapping over states or actions, [105] proposed to learn an inter-task 4.5.1 Reusing Representations mapping over the transition dynamics space: S × A × S. Their work assumes that the source and target domains A representative work of reusing representations is [107], are different in terms of the transition space dimensionality. which proposed the progressive neural network structure to Triplet transitions from both the source domain hss, as, s0si enable knowledge transfer across multiple RL tasks in a and the target domain < st, at, s0t > are mapped to a progressive way. A progressive network is composed of latent space Z. Given the feature representation in Z with multiple columns, where each column is a policy network for higher dimensionality, a similarity measure can be applied training one specific task. It starts with one single column to find a correspondence between the source and target for training the first task, and then the number of columns task triplets. Triplet pairs with the highest similarity in increases with the number of new tasks. While training on a this feature space Z are used to learn a mapping func- new task, neuron weights on the previous columns are frozen, tion X : hst, at, s0ti = X (hss, as, s0si). After the transition and representations from those frozen tasks are applied to the mapping, states sampled from the expert policy in the new column via a collateral connection to assist in learning source domain can be leveraged to render beneficial states the new task. This process can be mathematically generalized in the target domain, which assist the target agent learning as follows: with a better initialization performance. A similar idea of (k)  (k) (k) X (k:j) (j)  h = f W h + U h . mapping transition dynamics can be found in [106], which, i i i−1 j

however, requires stronger assumption on the similarity (k) (k) of the transition probability and the state representations where hi is the i-th hidden layer for task (column) k, Wi (k:j) between the source and the target domains. is the associated weight matrix, and Ui are the lateral 13 Citation Algorithm MDP Difference Mapping Function Usage of Mapping

[58] SARSA St 6= St, M(Qs) → Qt Q value reuse As 6= At [97] Q-learning As 6= At, M(Qs) → advice Relative Q ranking Rs 6= Rt 0 [98] Generally Applicable Ss 6= St M(st) → r Reward shaping 0 [99] SARSA(λ) Ss 6= St Rs 6= Rt M(st) → r Reward shaping [100] Fitted Value Iteration Ss 6= St M(ss) → st Penalty loss on state deviation from expert policy 0 0  [106] Fitted Q Iteration Ss × As 6= St × At M (ss, as, ss) → (st, at, st) Reduce random exploration 0 0  [105] No constraint Ss × As 6= St × At M (ss, as, ss) → (st, at, st) Reduce random exploration

TABLE 4: A comparison of inter-task mapping approaches.

connections from layer i − 1 of previous tasks to the current Successor Representations (SR) is an approach to de- layer of task k. couple the state features of a domain from its reward Although progressive network is an effective multi-task distributions. It enables knowledge transfer across multiple approach, it comes with a cost of giant network structure, domains: M = {M1, M2,..., MK }, so long as the only as the network grows proportionally with the number of difference among them is the reward distributions: Ri 6= Rj. incoming tasks. A later framework called PathNet is proposed SR was originally derived from neuroscience, until [113] by [108] which alleviates this issue by using a network proposed to leverage it as a generalization mechanism for with a fixed size. PathNet contains pathways, which are state representations in the RL domain. subsets of neurons whose weights contain the knowledge of Different from the v-value or Q-value that describes previous tasks and are frozen during training on new tasks. states as dependent on the reward distribution of the MDP, The population of pathway is evolved using a tournament SR features a state based on the occupancy measure of its selection genetic algorithm [109]. successor states. More concretely, the occupancy measure is Another approach of reusing representations for TL the unnormalized distribution of states or state-action pairs is modular networks [110, 111, 112]. For example, [110] that an agent will encounter when following policy π in the proposed to decompose the policy network into a task- MDP [30]. Specifically, SR decomposes the value-function of specific module and agent-specific module. Specifically, let π any policy into two independent components, ψ and R: be a policy performed by any agent (robot) r over the task X V π(s) = ψ(s, s0)w(s0), Mk as a function φ over states s, it can be decomposed into s0 two sub-modules gk and fr: where w(s0) is a reward mapping function which maps states π(s) := φ(s , s ) = f (g (s ), s ), env agent r k env agent to scalar rewards, and ψ is the SR which describes any state where fr is the agent-specific module while gk is the task- s as the occupancy measure of the future occurred states specific module. Their central idea is that the task-specific when following π: module can be applied to different agents performing " ∞ # 0 X i−t 0 the same task, which serves as a transferred knowledge. ψ(s, s ) = Eπ γ 1[Si = s ]|St = s , Accordingly, the agent-specific module can be applied to i=t different tasks for the same agent. 0 with 1[S = s ] = 1 as an indicator function. A model-based approach along this line is [112], which The successor nature of SR makes it learnable using learns a model to map the state observation s to a latent- any TD-learning algorithms. Especially, [113] proved the representation z. Accordingly, the transition probability is feasibility of learning such representation in a tabular case, modeled on the latent space instead of the original state in which the state transitions can be described using a space, i.e. zˆ = f (z , a ), where θ is the parameter of t+1 θ t t matrix. SR was later extended by [95] from three perspectives: the transition model, z is the latent-representation of the t (i) the feature domain of SR is extended from states to state observation, and a is the action accompanying that t state-action pairs; (ii) deep neural networks are used as state. Next, a reward module learns the value-function as function approximators to represent the SR ψπ(s, a) and well as the policy from the latent space z using an actor-critic the reward mapper w; (iii) Generalized policy improvement framework. One potential benefit of this latent representation (GPI) algorithm is introduced to accelerate policy transfer for is that knowledge can be transferred across tasks that have multi-tasks facilitated by the SR framework (See Section 4.3.2 different rewards but share the same transition dynamics, in for more details about GPI). These extensions, however, are which case the dynamics module can be directly applied to built upon a stronger assumption about the MDP: the target domain. Assumption. [Linearity of Reward Distributions] The reward 4.5.2 Disentangling Representations functions of all tasks can be computed as a linear combination of a Methods discussed in this section mostly focus on learning a fixed set of features: disentangled representation. Specifically, we elaborate on TL r(s, a, s0) = φ(s, a, s0)>w, (4) approaches that are derived from two techniques: Successor Representation (SR) and Universal Value Function Approximat- where φ(s, a, s0) ∈ Rd denotes the latent representation of the ing (UVFA). state transition, and w ∈ Rd is the task-specific reward mapper. 14

Based on this assumption, SR can be decoupled from the goals. The UVFA framework is built on a specific problem rewards when evaluating the Q-function of any policy π in a setting: task M with a reward function R : i i Problem Setting. (Goal Conditional RL) Task goals are defined π Eπ i i Qi (s, a) = [rt+1 + γrt+2 + ... |St = s, At = a] in terms of states, e.g. given the state space S and the goal space Eπ > > G, it satisfies that G ⊆ S. = [φt+1wi + γφt+2wi + ... |St = s, At = a] π > = ψ (s, a) wi. (5) One instantiation of this problem setting can be an agent exploring different locations in a maze, where the goals are The advantage of SR is that, when the knowledge of described as certain locations inside the maze. Under this π ψ (s, a) in Ms is observed, one can quickly get the perfor- problem setting, a UVFA module can be decoupled into mance evaluation of the same policy in Mt by replacing ws a state embedding φ(s) and a goal embedding ψ(g), by w Qπ = ψπ(s, a)w . with t: Mt t applying the technique of matrix factorization [118] to a Similar ideas of learning SR as a TD algorithm on a reward matrix describing the goal-conditional task. latent representation φ(s, a, s0) can also be found in [114, 115]. One merit of UVFA resides in its transferrable embedding Specifically, the work of [114] was developed based on an φ(s) across tasks which only differ by goals. Another is assumption which is weaker than Equation (4): Instead its ability of continual learning when the set of goals keep of requiring linearly-decoupled rewards, the latent space expanding over time. On the other hand, a key challenge φ(s, a, s0) is learned in an encoder-decoder structure to of UVFA is that applying the matrix factorization is time- ensure that the information loss is minimized when mapping consuming, which makes it a practical concern when per- states to the latent space. This structure, therefore, comes forming matrix factorization on complex environments with with an extra cost of learning a decoder fd to reconstruct the large state space |S|. Even with the learned embedding state: fd(φ(st)) ≈ st. networks, the third stage of fine-tuning these networks via An intriguing question faced by the SR approach is: Is end-to-end training is still necessary. Authors of the paper there a way that evades the linearity assumption about reward refer to the OptSpac tool for matrix factorization [119]. functions and still enables learning the SR without extra modular UVFA has been connected to SR by [116], in which a set cost? An extended work of SR [116] answered this question of independent rewards (tasks) themselves can be used as affirmatively, which proved that the reward functions does features for state representations. Another extended work not necessarily have to follow the linear structure, yet that combines UVFA with SR is called Universal Successor at the cost of a looser performance lower-bound while Feature Approximator (USFA), proposed by [120]. Following applying the GPI approach for policy improvement. Espe- the same linearity assumption about rewards as in Equation cially, rather than learning a reward-agnostic latent feature (4) , USFA is proposed as a function over a triplet of the state, φ(s, a, s0) ∈ Rd for multiple tasks, [116] aims to learn a action, and a policy embedding z: φ(s, a, s0) ∈ RD×d matrix to interpret the basis functions k d of the latent space instead, where D is the number of seen φ(s, a, z): S × A × R → R , k D tasks. Assuming out of tasks are linearly independent, where z is the output of a policy-encoding mapping z = e(π): k this matrix forms basis functions for the latent space. S × A → Rk. Based on USFA, the Q-function of any policy π M Therefore, for any unseen task i, its latent features can for a task specified by w can be formularized as the product be built as a linear combination of these basis functions, of a reward-agnostic Universal Successor Feature (USF) ψ and r (s, a, s0) so as its reward functions i . Based on the idea a reward mapper w: of learning basis-functions for a task’s latent space, they proposed that learning φ(s, a, s0) can be approximated as Q(s, a, w, z) = ψ(s, a, z)>w. learning r(s, a, s0) directly, where r(s, a, s0) ∈ RD is a vector of reward functions for each seen task: The above Q-function representation is distinct from Equa- tion (5), as the ψ(s, a, z) is generalized over multiple policies, 0  0 0 0  r(s, a, s ) = r1(s, a, s ); r2(s, a, s ), . . . , rD(s, a, s ) . with each denoted by z. Facilitated by the disentangled rewards and policy generalization, [120] further introduced ψ(s, a) π M Accordingly, learning for any policy i in i a generalized TD-error as a function over tasks w and policy becomes equivalent to learning a collection of Q-functions: z, which allows them to approximate the Q-function of any ∼πi  πi πi πi  policy on any task using a TD-algorithm. ψ (s, a) = Q1 (s, a),Q2 (s, a),...,QD (s, a) . A similar idea of using reward functions as features to 4.5.3 Discussion represent unseen tasks is also proposed by [117], which, We provide a summary of the discussed work in this section however, assumes the ψ and w as observable quantities in Table 5. In general, representation transfer can facilitate from the environment. transfer learning in many ways, and work along this line Universal Function Approximation (UVFA) is an alter- usually shares certain assumptions about some task-invariant native approach of learning disentangled state representa- property. Most of them assume that tasks are different tions [54]. Same as SR, UVFA allows transfer learning for only in terms of their reward distributions while sharing multiple tasks which differ only by their reward functions the same states (or actions or transitions) probabilities. (goals). Different from SR which focuses on learning a Other stronger assumptions include (i) decoupled dynamics, reward-agnostic state representation, UVFA aims to find a rewards [95], or policies [120] from the Q-function represen- function approximator that is generalized for both states and tations, and (ii) the feasibility of defining tasks in terms of 15

states [120]. Based on those assumptions, approaches such Recent work of reinforcement robotics learning with TL as TD-algorithms [116] or matrix-factorization [54] become approaches emphasizes more on the ability of fast and robust applicable to learn such disentangled representations. To adaptation to unseen tasks. A typical approach to achieve further exploit the effectiveness of disentangled structure, we this property is to design and select multiple source domains think that generalization approaches, which allow changing for robust training so that a generalized policy trained dynamics or state distributions, are important future work on those source tasks can be quickly transferred to target that is worth more attention in this domain. domains. Examples include the EPOPT approach proposed As an intriguing research topic, there are unresolved by [128], which is a combination of policy transfer via source questions along the line of representation transfer. One is domain ensemble and learning from limited demonstrations how to handle drastic changes of reward functions between for fast adaptation to the target task. Another application domains. As discussed in [121], good policies in one MDP can be found in [129], in which robust agent policies are may perform poorly in another due to the fact that beneficial trained by a large number of synthetic demonstrations from states or actions in Ms may become detrimental in Mt with a simulator to handles dynamic environments. Another totally different reward functions. Especially, as discussed idea for fast adaptation is to learn latent representations in the GPI work [95], the performance lower-bound is deter- from observations in the source domain that are generally mined by the reward function discrepancy while transferring applicable to the target domain, e.g. training robots using knowledge across different tasks. Learning a set of basis simulated 2D image inputs and applying the robot in real 3D functions [116] to represent unseen tasks (reward functions), environments. Work along this line includes [130], which or decoupling policies from Q function representation [120] learns the latent representation using 3D CAD models, may serve as a good start to address this issue, as they and [131, 132] which are derived based on the Generative- propose a generalized latent space, from which different Adversarial Network. Another example is DARLA [133], tasks (reward functions) can be interpreted. However, the which is a zero-shot transfer approach to learn disentangled limitation of this work is that it is not clear how many and representations that are against domain shifts. what kind of sub-tasks need to be learned to make the latent Game Playing is one of the most representative testbeds space generalizable enough to interpret unseen tasks. for TL and RL algorithms. Both the complexity and diversity Another question is how to generalize the representation of games for evaluating TL approaches have evolved over framework to allow transfer learning across domains with the recent decades, from classical testbeds such as grid- different dynamics (or state-action spaces). A learned SR world games to more complex game settings such as online- might not be transferrable to an MDP with different tran- strategy games or video games with pixel GRB inputs. A sition dynamics, as the distribution of occupancy measure representative TL application in game playing is AlphaGo, for successor states no longer holds even when following which is an algorithm for learning the online chessboard the same policy. Potential solutions may include model- games using both TL and RL techniques [2]. AlphaGo is based approaches that approximate the dynamics directly or first trained offline using expert demonstrations and then training a latent representation space for states using multiple learns to optimize its policy using Monte-Carlo Tree Search. tasks with different dynamics for better generalization [122]. Its successor, AlphaGo Master [3], even beat the world No.1 Alternatively, TL mechanisms from the supervised learning ranked human player. domain, such as meta-learning, which enables the ability In addition to online chessboard games, TL approaches of fast adaptation to new tasks [40], or importance sam- have also performed well in video game playing. State-of- pling [123], which can compensate for the prior distribution the-art video game platforms include MineCraft, Atari, and changes [12], might also shed lights on this question. Starcraft. Especially, [134] designed new RL tasks under the MineCraft platform for a better comparison of different RL algorithms. We refer readers to [135] for a survey of AI for 5 APPLICATIONS real-time strategy (RTS) games on the Starcraft platform, with In this section, we summarize practical applications of RL a dataset available from [136]. Moreover, [137] provided a based TL techniques: comprehensive survey on DL applications in video game Robotics learning is an important topic in the RL domain. playing, which also covers TL and RL strategies from certain [124] provided a comprehensive summary of applying RL perspectives. A large portion of TL approaches reviewed techniques to robotics learning. Under the RL framework, in this survey have been applied to the Atari [138] and a classical TL approach for facilitating robotics learning is other game above-mentioned platforms. Especially, OpenAI robotics learning from demonstrations, where expert demonstra- trained an Dota2 agent that can surpass human experts [139]. tions from humans or other robots are leveraged to teach We summarize the game applications of TL approaches the learning robot. [125] provided a nice summarization of mentioned in this survey in Table 6. approaches in this topic. Natural Language Processing (NLP) research has Later there emerged a scheme of collaborative robotic train- evolved rapidly along with the advancement of DL and ing [126]. By collaborative training, knowledge from different RL. There is an increasing trend of addressing NLP problems robots is transferred by sharing their policies and episodic by leveraging RL techniques. Applications of RL on NLP demonstrations with each other. A recent instantiation of range widely, from Question Answering (QA) [140], Dialogue this approach can be found in [127]. Their approach can be systems [141], Machine Translation [142], to an integration of considered as a policy transfer across multiple robot agents NLP and tasks, such as Visual Question An- under the DQN framework, which shares the demonstrations swering (VQA) [143], Image Caption [144], etc. Many of these in a pool and performing policy updates asynchronously. NLP applications have implicitly applied TL approaches, 16 Citation Representations Format Assumptions MDP Difference Learner Metrics [107] Lateral connections to N/A S, A A3C ap, ps previously learned net- work modules [108] Selected neural paths N/A S, A A3C ap [110] Task(agent)-specific net- Disentangled state rep- S, A Policy Gradient ap work module resentation [112] Dynamic transitions N/A S, A A3C ap, pe module learned on latent representations of the state space [95] SF Reward function can be R DQN ap, ar linearly decoupled [114] Encoder-decoder N/A R DQN pe, ps learned SF [116] Encoder-decoder Rewards can be repre- R Q(λ) ap, pe learned SF sented by set of basis functions [54] Matrix-factorized UF Goals are defined in R Tabular Q-learning ap, pe,ps terms of states [120] Policy-encoded UF Reward function can be R -greedy Q-learning ap, pe linearly decoupled;

TABLE 5: A comparison of TL approaches that transfer representations.

including learning from demonstrations, policy transfer, or effectiveness and sample efficiency, especially given the reward shaping, in order to better tailor these RL techniques difficulty of accessing large amounts of clinical data. as NLP solutions which were previously dominated by Others: Besides the above-mentioned topics, RL has the supervise-learning counterparts [145]. Examples in this also been utilized in many real-life applications. For ex- filed include applying expert demonstration to build RL ample, applications in the Transportation Systems have solutions for Spoken Dialogue Systems [146], VQA [143]; or adopted RL for addressing traffic congestion issues with building shaped rewards for Sequence Generation [147], Spoken better traffic signal scheduling and transportation resource alloca- Dialogue Systems [71],QA [72, 148], and Image Caption [144], tion [10, 11, 164, 165]. We refer readers to [166] for a review or transferring policies for Structured Prediction [149] and of RL applications on traffic signal controls. RL and Deep RL VQA [150], etc. We summarize information of the above- are also effective solutions to problems in Finance, including mentioned applications in Table 7. portfolio management [167, 168], asset allocation [169], trading optimization [170], etc. Another application is the Electricity Health Informatics is another domain that has bene- Systems, especially the intelligent electricity networks, which fited from the advancement of RL. RL techniques have can benefit from RL techniques for improved power-delivery been applied to solve many healthcare tasks, including decisions [171, 172] and active resource management [173]. dynamic treatment regimes [151, 152], automatic medical di- We refer readers to [9] for a comprehensive survey of agnosis [153, 154], health resource scheduling [155, 156], and RL techniques for electric power system applications. We drug discovery and development,[157, 158], etc. An overview of believe that given a plethora of theoretical and algorithm recent achievements of RL techniques in the domain of health breakthroughs in TL and DRL research, it is promising to informatics is provided by [159]. Despite the emergence of RL embrace a hybrid framework of TL and DRL techniques for applications to address healthcare problems, only a limited further accelerating the development of these domains. number of them have utilized TL approaches, although we do observe some applications that leverage prior knowledge Q to improve the RL procedure. Specifically, [160] utilized - 6 FUTURE PERSPECTIVES learning for drug delivery individualization. They integrated the prior knowledge of the dose-response characteristics In this section, we propose some future research directions into their Q-learning framework and leveraged this prior for TL in the RL domain: knowledge to avoid unnecessary exploration. Some work Modeling Transferability: A key question for TL is, has considered reusing representations for speeding up whether or to what extent can the knowledge for solving one task the decision-making process [161, 162]. For example, [161] help in solving another? Answering this question contributes proposed to highlight both the individual variability and to realizing automatic TL, including the source task selection, the common policy model structure for the individual HIV the mapping function design, disentangling representations, treatment. [162] applied a DQN framework for prescribing avoiding negative transfer, etc. By aggregating the problem effective HIV treatments, in which they learned a latent setting as well as the assumptions of the TL approaches representation to estimate the uncertainty when transferring discussed in this survey, we hereby introduce a framework a pertained policy to the unseen domains. [163] considered that theoretically enables transfer learning across different do- K the possibility of applying human-involved interactive RL mains. More specifically, for a set of domains M = {Mi}i=1 i i i i training for health informatics. We consider TL combined with K ≥ 2, and ∀Mi ∈ M, Mi = (S , A , T , R , ··· ), with RL a promising integration to be applied in the domain it is feasible to transfer knowledge across these domains if of health informatics, which can further improve the learning there exist: 17 Game Citation TL Approach Atari, 3D Maze [58] Transferrable Citation Application TL Approach representation [146] Spoken Dialogue System LFD Atari, Mujoco [81] LFD [143] VQA LFD Atari [77] LFD [147] Sequence Generation LFD, Reward Shaping Go [2] LFD [72] QA Reward Shaping Keepaway [58] Mapping [148] QA LFD, Reward Shaping RoboCup [97] Mapping [149] Structured Prediction Policy Transfer Atari [88] Policy Transfer [150] Grounded Dialog Generation Policy Distillation Atari [89] Policy Transfer [144] Image Caption Reward Shaping Atari [90] Policy Transfer 3D Maze [92] Policy Transfer Dota2 [139] Reward Shaping TABLE 7: Applications of TL approaches in Natural Language Processing. TABLE 6: TL Applications in Game Playing.

K • A set of invertible mapping functions {gi}i=1, each interpretability can also help in avoiding catastrophic decision- of which maps the state representation of certain making for tasks such as auto-driving or healthcare decisions. domain to a consensual latent space ZS , i.e.: ∀ i ∈ Although there has been efforts towards explainable TL i −1 i {K}, ∃ gi : S → ZS , and gi : ZS → S . approaches for RL tasks [177, 178], there is no clear definition K • A set of invertible mapping functions {fi}i=1, each for interpretable TL in the context of RL, nor a framework of which maps the action representation of certain to evaluate the interpretability of TL approaches. We believe domain to a consensual latent space ZA, i.e.: ∀ i ∈ that the standardization of interpretable TL for RL will be a i −1 i {K}, ∃ fi : A → ZA, and fi : ZA → A . topic that is worth more research efforts in the near future. • A policy π : ZS → ZA and a constant , such that π is -optimal on the common feature space for any considered domain Mi ∈ M. i.e.: REFERENCES π ∗ [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An ∃ π : ZS → ZA,  ≥ 0, s.t. ∀Mi ∈ M, VMi − VMi ≤ , introduction. MIT press, 2018. V ∗ M [2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, where Mi is the optimal policy for domain i. G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, Evaluating Transferability: Evaluation metrics have been V. Panneershelvam, M. Lanctot et al., “Mastering the proposed to evaluate TL approaches from different but game of go with deep neural networks and tree search,” complementary perspectives, although no single metric can nature, vol. 529, no. 7587, p. 484, 2016. summarize the efficacy of a TL approach. Designing a set of [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, generalized, novel metrics is beneficial for the development A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, of TL in DRL domain. In addition to the current benchmarks, A. Bolton et al., “Mastering the game of go without such as OpenAI gym that is designed purely for evaluating human knowledge,” Nature, vol. 550, no. 7676, p. 354, RL approaches, a unified benchmark to evaluate the TL 2017. performance is also worth research and engineering efforts. [4] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and Framework-agnostic Transfer: Most contemporary TL A. A. Bharath, “A brief survey of deep reinforcement approaches are designed for certain RL frameworks. For learning,” arXiv preprint arXiv:1708.05866, 2017. example, some TL methods are applicable to RL algorithms [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to- designed for the discrete-action space (such as DQfD), while end training of deep visuomotor policies,” The Journal others may only be feasible given a continuous action space. of Research, vol. 17, no. 1, pp. 1334– One fundamental cause of these framework-dependent TL 1373, 2016. methods is the diversified development of RL algorithms. [6] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and We expect that a more unified RL framework would in turn D. Quillen, “Learning hand-eye coordination for contribute to the standardization of TL approaches in this robotic grasping with deep learning and large-scale field. data collection,” The International Journal of Robotics Interpretability: Deep learning and end-to-end systems Research, vol. 37, no. 4-5, pp. 421–436, 2018. have made network representation a black-box, making it [7] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, difficult to interpret or debug the model’s representations or “The arcade learning environment: An evaluation plat- decisions. As a result, there have been efforts in defining and form for general agents,” Journal of Artificial Intelligence evaluating the interpretability of approaches in the supervise- Research, vol. 47, pp. 253–279, 2013. learning domain [174, 175, 176]. The merits of interpretability [8] M. R. Kosorok and E. E. Moodie, Adaptive Treat- are manifolds, including enabling disentangled represen- mentStrategies in Practice: Planning Trials and Analyzing tations, building explainable models, facilitating human- Data for Personalized Medicine. SIAM, 2015, vol. 21. computer-interactions, etc. At the meantime, interpretable TL [9] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforce- approaches for the RL domain, especially with explainable ment learning for electric power system decision and representations or policy decisions, can also be beneficial to control: Past considerations and perspectives,” IFAC- many applied fields, such as robotics and finance. Moreover, PapersOnLine, vol. 50, no. 1, pp. 6918–6927, 2017. 18

[10] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and “Multiagent reinforcement learning for integrated net- P. Moritz, “Trust region policy optimization,” in In- work of adaptive traffic signal controllers (marlin-atsc): ternational conference on machine learning, 2015, pp. 1889– methodology and large-scale application on downtown 1897. toronto,” IEEE Transactions on Intelligent Transportation [26] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and Systems, vol. 14, no. 3, pp. 1140–1150, 2013. O. Klimov, “Proximal policy optimization algorithms,” [11] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A arXiv preprint arXiv:1707.06347, 2017. reinforcement learning approach for intelligent traffic [27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, light control,” in Proceedings of the 24th ACM SIGKDD and M. Riedmiller, “Deterministic policy gradient International Conference on Knowledge Discovery & Data algorithms,” 2014. Mining. ACM, 2018, pp. 2496–2505. [28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” Y. Tassa, D. Silver, and D. Wierstra, “Continuous con- IEEE Transactions on knowledge and data engineering, trol with deep reinforcement learning,” arXiv preprint vol. 22, no. 10, pp. 1345–1359, 2009. arXiv:1509.02971, 2015. [13] M. E. Taylor and P. Stone, “Transfer learning for [29] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing reinforcement learning domains: A survey,” Journal function approximation error in actor-critic methods,” of Machine Learning Research, vol. 10, no. Jul, pp. 1633– arXiv preprint arXiv:1802.09477, 2018. 1685, 2009. [30] J. Ho and S. Ermon, “Generative adversarial imitation [14] A. Lazaric, “Transfer in reinforcement learning: a learning,” in Advances in neural information processing framework and a survey,” in Reinforcement Learning. systems, 2016, pp. 4565–4573. Springer, 2012, pp. 143–173. [31] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy [15] R. Bellman, “A markovian decision process,” Journal of imitation learning from observations,” in Advances in mathematics and mechanics, pp. 679–684, 1957. Neural Information Processing Systems, vol. 33, 2020, pp. [16] G. A. Rummery and M. Niranjan, On-line Q-learning 12 402–12 413. using connectionist systems. University of Cambridge, [32] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and Department of Engineering Cambridge, England, 1994, J. Tompson, “Discriminator-actor-critic: Addressing vol. 37. sample inefficiency and reward bias in adversarial [17] H. Van Seijen, H. Van Hasselt, S. Whiteson, and imitation learning,” arXiv preprint arXiv:1809.02925, M. Wiering, “A theoretical and empirical analysis of 2018. expected sarsa,” in 2009 IEEE Symposium on Adap- [33] D. A. Pomerleau, “Efficient training of artificial neural tive Dynamic Programming and Reinforcement Learning. networks for autonomous navigation,” Neural Compu- IEEE, 2009, pp. 177–184. tation, vol. 3, no. 1, pp. 88–97, 1991. [18] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algo- [34] A. Y. Ng, S. J. Russell et al., “Algorithms for inverse rithms,” in Advances in neural information processing reinforcement learning.” in Icml, vol. 1, 2000, p. 2. systems, 2000, pp. 1008–1014. [35] M. Jing, X. Ma, W. Huang, F. Sun, C. Yang, B. Fang, [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, and H. Liu, “Reinforcement learning from imperfect T. Harley, D. Silver, and K. Kavukcuoglu, “Asyn- demonstrations under soft expert guidance.” in AAAI, chronous methods for deep reinforcement learning,” 2020, pp. 5109–5116. in International conference on machine learning, 2016, pp. [36] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Learning sparse 1928–1937. rewarded tasks from sub-optimal demonstrations,” [20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, arXiv preprint arXiv:2004.00530, 2020. “Soft actor-critic: Off-policy maximum entropy deep [37] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and reinforcement learning with a stochastic actor,” in S. Wermter, “Continual lifelong learning with neural International Conference on Machine Learning. PMLR, networks: A review,” Neural Networks, 2019. 2018, pp. 1861–1870. [38] R. S. Sutton, A. Koop, and D. Silver, “On the role of [21] C. J. Watkins and P. Dayan, “Q-learning,” Machine tracking in stationary environments,” in Proceedings learning, vol. 8, no. 3-4, pp. 279–292, 1992. of the 24th international conference on Machine learning. [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, ACM, 2007, pp. 871–878. J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, [39] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, A. K. Fidjeland, G. Ostrovski et al., “Human-level I. Mordatch, and P. Abbeel, “Continuous adaptation control through deep reinforcement learning,” Nature, via meta-learning in nonstationary and competitive vol. 518, no. 7540, p. 529, 2015. environments,” ICLR, 2018. [23] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, [40] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, meta-learning for fast adaptation of deep networks,” and D. Silver, “Rainbow: Combining improvements in Proceedings of the 34th International Conference on in deep reinforcement learning,” in Proceedings of the Machine Learning-Volume 70. JMLR. org, 2017, pp. AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 1126–1135. 2018. [41] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learn- [24] R. J. Williams, “Simple statistical gradient-following ing to detect unseen object classes by between-class algorithms for connectionist reinforcement learning,” attribute transfer,” in 2009 IEEE Conference on Computer Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. Vision and Pattern Recognition. IEEE, 2009, pp. 951–958. 19

[42] P. Dayan and G. E. Hinton, “Feudal reinforcement using reinforcement learning and shaping.” in ICML, learning,” in Advances in neural information processing vol. 98, 1998, pp. 463–471. systems, 1993, pp. 271–278. [60] R. J. Williams and L. C. Baird, “Tight performance [43] R. S. Sutton, D. Precup, and S. Singh, “Between mdps bounds on greedy policies based on imperfect value and semi-mdps: A framework for temporal abstraction functions,” Citeseer, Tech. Rep., 1993. in reinforcement learning,” Artificial intelligence, vol. [61] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled 112, no. 1-2, pp. 181–211, 1999. methods for advising reinforcement learning agents,” [44] R. Parr and S. J. Russell, “Reinforcement learning in Proceedings of the 20th International Conference on with hierarchies of machines,” in Advances in neural Machine Learning (ICML-03), 2003, pp. 792–799. information processing systems, 1998, pp. 1043–1049. [62] S. M. Devlin and D. Kudenko, “Dynamic potential- [45] T. G. Dietterich, “Hierarchical reinforcement learning based reward shaping,” in Proceedings of the 11th Inter- with the maxq value function decomposition,” Journal national Conference on Autonomous Agents and Multiagent of artificial intelligence research, vol. 13, pp. 227–303, 2000. Systems. IFAAMAS, 2012, pp. 433–440. [46] R. B. Myerson, Game theory. Harvard university press, [63] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowe,´ 2013. “Expressing arbitrary reward functions as potential- [47] L. Bu, R. Babu, B. De Schutter et al., “A comprehensive based advice,” in Twenty-Ninth AAAI Conference on survey of multiagent reinforcement learning,” IEEE Artificial Intelligence, 2015. Transactions on Systems, Man, and Cybernetics, Part C [64] T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowe,´ (Applications and Reviews), vol. 38, no. 2, pp. 156–172, “Policy transfer using reward shaping,” in Proceedings of 2008. the 2015 International Conference on Autonomous Agents [48] M. Tan, “Multi-agent reinforcement learning: Indepen- and Multiagent Systems. International Foundation for dent vs. cooperative agents,” in Proceedings of the tenth Autonomous Agents and Multiagent Systems, 2015, international conference on machine learning, 1993, pp. pp. 181–188. 330–337. [65] M. Vecerˇ ´ık, T. Hester, J. Scholz, F. Wang, O. Pietquin, [49] F. L. Da Silva and A. H. R. Costa, “A survey on transfer B. Piot, N. Heess, T. Rothorl,¨ T. Lampe, and M. Ried- learning for multiagent reinforcement learning sys- miller, “Leveraging demonstrations for deep rein- tems,” Journal of Artificial Intelligence Research, vol. 64, forcement learning on robotics problems with sparse pp. 645–703, 2019. rewards,” arXiv preprint arXiv:1707.08817, 2017. [50] B. Kim, A.-m. Farahmand, J. Pineau, and D. Precup, [66] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer, “Learning from limited demonstrations,” in Advances “Potential-based difference rewards for multiagent in Neural Information Processing Systems, 2013, pp. 2859– reinforcement learning,” in Proceedings of the 2014 inter- 2867. national conference on Autonomous agents and multi-agent [51] W. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, systems. International Foundation for Autonomous G. Swirszcz, and M. Jaderberg, “Distilling policy Agents and Multiagent Systems, 2014, pp. 165–172. distillation,” in The 22nd International Conference on [67] M. Grzes and D. Kudenko, “Learning shaping rewards Artificial Intelligence and Statistics, 2019. in model-based reinforcement learning,” in Proc. AA- [52] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance MAS 2009 Workshop on Adaptive Learning Agents, vol. under reward transformations: Theory and application 115, 2009. to reward shaping,” in ICML, vol. 99, 1999, pp. 278–287. [68] Y. Gao and F. Toni, “Potential based reward shaping for [53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, hierarchical reinforcement learning,” in Twenty-Fourth D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, International Joint Conference on Artificial Intelligence, “Generative adversarial nets,” in Advances in neural 2015. information processing systems, 2014, pp. 2672–2680. [69] O. Marom and B. Rosman, “Belief reward shaping [54] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Uni- in reinforcement learning,” in Thirty-Second AAAI versal value function approximators,” in International Conference on Artificial Intelligence, 2018. Conference on Machine Learning, 2015, pp. 1312–1320. [70] A. C. Tenorio-Gonzalez, E. F. Morales, and L. Vil- [55] C. Finn and S. Levine, “Meta-learning: from few-shot lasenor-Pineda,˜ “Dynamic reward shaping: Training learning to rapid reinforcement learning,” in ICML, a robot by voice,” in Advances in Artificial Intelligence – 2019. IBERAMIA 2010. Berlin, Heidelberg: Springer Berlin [56] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, Heidelberg, 2010, pp. 483–492. S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning [71] P.-H. Su, D. Vandyke, M. Gasic, N. Mrksic, T.-H. agile locomotion for quadruped robots,” arXiv preprint Wen, and S. Young, “Reward shaping with recur- arXiv:1804.10332, 2018. rent neural networks for speeding up on-line policy [57] Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell et al., “Re- learning in spoken dialogue systems,” arXiv preprint inforcement learning from imperfect demonstrations,” arXiv:1508.03391, 2015. arXiv preprint arXiv:1802.05313, 2018. [72] X. V. Lin, R. Socher, and C. Xiong, “Multi-hop knowl- [58] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via edge graph reasoning with reward shaping,” arXiv inter-task mappings for temporal difference learning,” preprint arXiv:1808.10568, 2018. Journal of Machine Learning Research, vol. 8, no. Sep, pp. [73] F. Liu, Z. Ling, T. Mu, and H. Su, “State 2125–2167, 2007. alignment-based imitation learning,” arXiv preprint [59] J. Randløv and P. Alstrøm, “Learning to drive a bicycle arXiv:1911.10947, 2019. 20

[74] X. Zhang and H. Ma, “Pretraining deep actor-critic [91] S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Do- reinforcement learning algorithms with expert demon- ersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zis- strations,” arXiv preprint arXiv:1801.10459, 2018. serman, K. Simonyan et al., “Kickstarting deep rein- [75] S. Schaal, “Learning from demonstration,” in Advances forcement learning,” arXiv preprint arXiv:1803.03835, in neural information processing systems, 1997, pp. 1040– 2018. 1046. [92] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirk- [76] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, patrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband Robust multitask reinforcement learning,” in Advances et al., “Deep q-learning from demonstrations,” in Thirty- in Neural Information Processing Systems, 2017, pp. 4496– Second AAAI Conference on Artificial Intelligence, 2018. 4506. [77] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, [93] J. Schulman, X. Chen, and P. Abbeel, “Equivalence and P. Abbeel, “Overcoming exploration in reinforce- between policy gradients and soft q-learning,” arXiv ment learning with demonstrations,” in 2018 IEEE preprint arXiv:1704.06440, 2017. International Conference on Robotics and Automation [94] F. Fernandez´ and M. Veloso, “Probabilistic policy reuse (ICRA). IEEE, 2018, pp. 6292–6299. in a reinforcement learning agent,” in Proceedings of the [78] J. Chemali and A. Lazaric, “Direct policy iteration with fifth international joint conference on Autonomous agents demonstrations,” in Twenty-Fourth International Joint and multiagent systems. ACM, 2006, pp. 720–727. Conference on Artificial Intelligence, 2015. [95] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, [79] B. Piot, M. Geist, and O. Pietquin, “Boosted bellman H. P. van Hasselt, and D. Silver, “Successor features residual minimization handling expert demonstra- for transfer in reinforcement learning,” in Advances tions,” in Joint European Conference on Machine Learning in neural information processing systems, 2017, pp. 4055– and Knowledge Discovery in Databases. Springer, 2014, 4065. pp. 549–564. [96] R. Bellman, “Dynamic programming,” Science, vol. 153, [80] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. no. 3731, pp. 34–37, 1966. Taylor, and A. Nowe,´ “Reinforcement learning from [97] L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Using demonstration through shaping,” in Twenty-Fourth advice to transfer knowledge acquired in one reinforce- International Joint Conference on Artificial Intelligence, ment learning task to another,” in European Conference 2015. on Machine Learning. Springer, 2005, pp. 412–424. [81] B. Kang, Z. Jie, and J. Feng, “Policy optimization with [98] A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine, demonstrations,” in International Conference on Machine “Learning invariant feature spaces to transfer skills Learning, 2018, pp. 2474–2483. with reinforcement learning,” International Conference [82] D. P. Bertsekas, “Approximate policy iteration: A on Learning Representations (ICLR), 2017. survey and some new methods,” Journal of Control [99] G. Konidaris and A. Barto, “Autonomous shaping: Theory and Applications, vol. 9, no. 3, pp. 310–335, 2011. Knowledge transfer in reinforcement learning,” in [83] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, Proceedings of the 23rd international conference on Machine “Prioritized experience replay,” in ICLR, 2016. learning. ACM, 2006, pp. 489–496. [84] S. Ross, G. Gordon, and D. Bagnell, “A reduction of [100] H. B. Ammar and M. E. Taylor, “Reinforcement imitation learning and structured prediction to no- learning transfer via common subspaces,” in regret online learning,” in Proceedings of the fourteenth Proceedings of the 11th International Conference international conference on artificial intelligence and statis- on Adaptive and Learning Agents, ser. ALA’11. tics, 2011, pp. 627–635. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 21– [85] K. Brantley, W. Sun, and M. Henaff, “Disagreement- 36. [Online]. Available: http://dx.doi.org/10.1007/ regularized imitation learning,” in International Confer- 978-3-642-28499-1 2 ence on Learning Representations, 2019. [101] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: [86] G. Hinton, O. Vinyals, and J. Dean, “Distilling the A deep convolutional encoder-decoder architecture knowledge in a neural network,” Deep Learning and for image segmentation,” IEEE transactions on pattern Representation Learning Workshop, NIPS, 2014. analysis and machine intelligence, vol. 39, no. 12, pp. [87] A. Polino, R. Pascanu, and D. Alistarh, “Model 2481–2495, 2017. compression via distillation and quantization,” arXiv [102] C. Wang and S. Mahadevan, “Manifold alignment preprint arXiv:1802.05668, 2018. without correspondence,” in Twenty-First International [88] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, Joint Conference on Artificial Intelligence, 2009. G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, [103] B. Bocsi, L. Csato,´ and J. Peters, “Alignment-based K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” transfer learning for robot models,” in The 2013 Inter- arXiv preprint arXiv:1511.06295, 2015. national Joint Conference on Neural Networks (IJCNN). [89] H. Yin and S. J. Pan, “Knowledge transfer for deep IEEE, 2013, pp. 1–7. reinforcement learning with hierarchical experience [104] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor, replay,” in Thirty-First AAAI Conference on Artificial “Unsupervised cross-domain transfer in policy gradient Intelligence, 2017. reinforcement learning via manifold alignment,” in [90] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor- Twenty-Ninth AAAI Conference on Artificial Intelligence, mimic: Deep multitask and transfer reinforcement 2015. learning,” arXiv preprint arXiv:1511.06342, 2015. [105] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and 21

G. Weiss, “Reinforcement learning transfer via sparse [121] L. Lehnert, S. Tellex, and M. L. Littman, “Advan- coding,” in Proceedings of the 11th International Con- tages and limitations of using successor features for ference on Autonomous Agents and Multiagent Systems- transfer in reinforcement learning,” arXiv preprint Volume 1. International Foundation for Autonomous arXiv:1708.00102, 2017. Agents and Multiagent Systems, 2012, pp. 383–390. [122] J. C. Petangoda, S. Pascual-Diaz, V. Adam, P. Vrancx, [106] A. Lazaric, M. Restelli, and A. Bonarini, “Transfer and J. Grau-Moya, “Disentangled skill embed- of samples in batch reinforcement learning,” in Pro- dings for reinforcement learning,” arXiv preprint ceedings of the 25th international conference on Machine arXiv:1906.09223, 2019. learning. ACM, 2008, pp. 544–551. [123] B. Zadrozny, “Learning and evaluating classifiers [107] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, under sample selection bias,” in Proceedings of the J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and twenty-first international conference on Machine learning. R. Hadsell, “Progressive neural networks,” arXiv ACM, 2004, p. 114. preprint arXiv:1606.04671, 2016. [124] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement [108] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, learning in robotics: A survey,” The International Journal A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013. Evolution channels gradient descent in super neural [125] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, networks,” arXiv preprint arXiv:1701.08734, 2017. “A survey of robot learning from demonstration,” [109] I. Harvey, “The microbial genetic algorithm,” in Euro- Robotics and autonomous systems, vol. 57, no. 5, pp. 469– pean Conference on Artificial Life. Springer, 2009, pp. 483, 2009. 126–133. [126] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A [110] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, survey of research on cloud robotics and automation,” “Learning modular neural network policies for multi- IEEE Transactions on automation science and engineering, task and multi-robot transfer,” in 2017 IEEE Interna- vol. 12, no. 2, pp. 398–409, 2015. tional Conference on Robotics and Automation (ICRA). [127] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep IEEE, 2017, pp. 2169–2176. reinforcement learning for robotic manipulation with [111] J. Andreas, D. Klein, and S. Levine, “Modular multitask asynchronous off-policy updates,” in 2017 IEEE inter- reinforcement learning with policy sketches,” in Pro- national conference on robotics and automation (ICRA). ceedings of the 34th International Conference on Machine IEEE, 2017, pp. 3389–3396. Learning-Volume 70. JMLR. org, 2017, pp. 166–175. [128] A. Rajeswaran, S. Ghotra, B. Ravindran, and S. Levine, [112] A. Zhang, H. Satija, and J. Pineau, “Decoupling dy- “Epopt: Learning robust neural network policies using namics and reward for transfer learning,” arXiv preprint model ensembles,” arXiv preprint arXiv:1610.01283, arXiv:1804.10689, 2018. 2016. [113] P. Dayan, “Improving generalization for temporal dif- [129] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for ference learning: The successor representation,” Neural the unknown: Learning a universal policy with online Computation, vol. 5, no. 4, pp. 613–624, 1993. system identification,” arXiv preprint arXiv:1702.02453, [114] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh- 2017. man, “Deep successor reinforcement learning,” arXiv [130] F. Sadeghi and S. Levine, “Cad2rl: Real single-image preprint arXiv:1606.02396, 2016. flight without a single real image,” arXiv preprint [115] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Bur- arXiv:1611.04201, 2016. gard, “Deep reinforcement learning with successor [131] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, features for navigation across similar environments,” M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Kono- in 2017 IEEE/RSJ International Conference on Intelligent lige et al., “Using simulation and domain adaptation to Robots and Systems (IROS). IEEE, 2017, pp. 2371–2378. improve efficiency of deep robotic grasping,” in 2018 [116] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, IEEE International Conference on Robotics and Automation M. Hessel, D. Mankowitz, A. Zˇ ´ıdek, and R. Munos, (ICRA). IEEE, 2018, pp. 4243–4250. “Transfer in deep reinforcement learning using suc- [132] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A cessor features and generalised policy improvement,” data-efficient framework for training and sim-to-real arXiv preprint arXiv:1901.10964, 2019. transfer of navigation policies,” in 2019 International [117] N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern, Conference on Robotics and Automation (ICRA). IEEE, “Transfer in variable-reward hierarchical reinforcement 2019, pp. 782–788. learning,” Machine Learning, vol. 73, no. 3, p. 289, 2008. [133] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, [118] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, techniques for recommender systems,” Computer, no. 8, “Darla: Improving zero-shot transfer in reinforcement pp. 30–37, 2009. learning,” in Proceedings of the 34th International Confer- [119] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix ence on Machine Learning-Volume 70. JMLR. org, 2017, completion from a few entries,” IEEE transactions on pp. 1480–1490. information theory, vol. 56, no. 6, pp. 2980–2998, 2010. [134] J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Control [120] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, of memory, active perception, and action in minecraft,” H. van Hasselt, D. Silver, and T. Schaul, “Univer- arXiv preprint arXiv:1605.09128, 2016. sal successor features approximators,” arXiv preprint [135] S. Ontanon,´ G. Synnaeve, A. Uriarte, F. Richoux, arXiv:1812.07626, 2018. D. Churchill, and M. Preuss, “A survey of real-time 22

strategy game ai research and competition in starcraft,” model,” in Advances in Neural Information Processing IEEE Transactions on Computational Intelligence and AI in Systems, 2017, pp. 314–324. games, vol. 5, no. 4, pp. 293–311, 2013. [151] E. B. Laber, D. J. Lizotte, M. Qian, W. E. Pelham, and [136] Z. Lin, J. Gehring, V. Khalidov, and G. Synnaeve, S. A. Murphy, “Dynamic treatment regimes: Techni- “Stardata: A starcraft ai research dataset,” in Thirteenth cal challenges and applications,” Electronic journal of Artificial Intelligence and Interactive Digital Entertainment statistics, vol. 8, no. 1, p. 1225, 2014. Conference, 2017. [152] M. Tenenbaum, A. Fern, L. Getoor, M. Littman, V. Man- [137] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep asinghka, S. Natarajan, D. Page, J. Shrager, Y. Singer, learning for video game playing,” IEEE Transactions on and P. Tadepalli, “Personalizing cancer therapy via Games, 2019. machine learning,” in Workshops of NIPS, 2010. [138] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, [153] A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing G. Vaillant, K. Kamnitsas, A. Vlontzos, B. Glocker, atari with deep reinforcement learning,” arXiv preprint B. Kainz et al., “Evaluating reinforcement learning arXiv:1312.5602, 2013. agents for anatomical landmark detection,” Medical [139] OpenAI. (2019) Dotal2 blog. [Online]. Available: image analysis, vol. 53, pp. 156–164, 2019. https://openai.com/blog/openai-five/ [154] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, [140] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on A. Wimmer, and T. Chen, “Multimodal image regis- dialogue systems: Recent advances and new frontiers,” tration with deep context reinforcement learning,” in Acm Sigkdd Explorations Newsletter, vol. 19, no. 2, pp. International Conference on Medical Image Computing and 25–35, 2017. Computer-Assisted Intervention. Springer, 2017, pp. 240– [141] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker, 248. “Reinforcement learning for spoken dialogue systems,” [155] Z. Huang, W. M. van der Aalst, X. Lu, and H. Duan, in Advances in Neural Information Processing Systems, “Reinforcement learning based resource allocation in 2000, pp. 956–962. business process management,” Data & Knowledge [142] B. Zoph and Q. V. Le, “Neural architecture Engineering, vol. 70, no. 1, pp. 127–145, 2011. search with reinforcement learning,” arXiv preprint [156] T. S. M. T. Gomes, “Reinforcement learning for primary arXiv:1611.01578, 2016. care e appointment scheduling,” 2017. [143] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and [157] A. Serrano, B. Imbernon,´ H. Perez-S´ anchez,´ J. M. Ce- K. Saenko, “Learning to reason: End-to-end module cilia, A. Bueno-Crespo, and J. L. Abellan,´ “Accelerating networks for visual question answering,” in Proceedings drugs discovery with deep reinforcement learning: An of the IEEE International Conference on Computer Vision, early approach,” in Proceedings of the 47th International 2017, pp. 804–813. Conference on Parallel Processing Companion. ACM, [144] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, 2018, p. 6. “Deep reinforcement learning-based image captioning [158] M. Popova, O. Isayev, and A. Tropsha, “Deep rein- with embedding reward,” in Proceedings of the IEEE forcement learning for de novo drug design,” Science Conference on Computer Vision and Pattern Recognition, advances, vol. 4, no. 7, p. eaap7885, 2018. 2017, pp. 290–298. [159] C. Yu, J. Liu, and S. Nemati, “Reinforcement learning in [145] T. Young, D. Hazarika, S. Poria, and E. Cambria, healthcare: A survey,” arXiv preprint arXiv:1908.08796, “Recent trends in deep learning based natural language 2019. processing,” ieee Computational intelligenCe magazine, [160] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. vol. 13, no. 3, pp. 55–75, 2018. Jacobs, J. M. Zurada, and M. E. Brier, “Incorporating [146] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, prior knowledge into q-learning for drug delivery “Learning to compose neural networks for question individualization,” in Fourth International Conference on answering,” arXiv preprint arXiv:1601.01705, 2016. Machine Learning and Applications (ICMLA’05). IEEE, [147] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, 2005, pp. 6–pp. J. Pineau, A. Courville, and Y. Bengio, “An actor- [161] V. N. Marivate, J. Chemali, E. Brunskill, and critic algorithm for sequence prediction,” arXiv preprint M. Littman, “Quantifying uncertainty in batch person- arXiv:1607.07086, 2016. alized sequential decision making,” in Workshops at the [148] F. Godin, A. Kumar, and A. Mittal, “Learning when not Twenty-Eighth AAAI Conference on Artificial Intelligence, to answer: a ternary reward structure for reinforcement 2014. learning based question answering,” in Proceedings [162] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi- of the 2019 Conference of the North American Chapter Velez, “Robust and efficient transfer learning with of the Association for Computational Linguistics: Human hidden parameter markov decision processes,” in Language Technologies, Volume 2 (Industry Papers), 2019, Advances in Neural Information Processing Systems, 2017, pp. 122–129. pp. 6250–6261. [149] K.-W. Chang, A. Krishnamurthy, A. Agarwal, J. Lang- [163] A. Holzinger, “Interactive machine learning for health ford, and H. Daume´ III, “Learning to search better than informatics: when do we need the human-in-the-loop?” your teacher,” 2015. Brain Informatics, vol. 3, no. 2, pp. 119–131, 2016. [150] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra, [164] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing “Best of both worlds: Transferring knowledge from via deep reinforcement learning,” IEEE/CAA Journal of discriminative learning to a generative visual dialog Automatica Sinica, vol. 3, no. 3, pp. 247–254, 2016. 23

[165] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large- scale fleet management via multi-agent deep reinforce- ment learning,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 1774–1783. [166] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk, “A survey on reinforcement learning models and algorithms for traffic signal control,” ACM Computing Surveys (CSUR), vol. 50, no. 3, p. 34, 2017. [167] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance functions and reinforcement learning for trading sys- tems and portfolios,” Journal of Forecasting, vol. 17, no. 5-6, pp. 441–470, 1998. [168] Z. Jiang and J. Liang, “Cryptocurrency portfolio man- agement with deep reinforcement learning,” in 2017 Intelligent Systems Conference (IntelliSys). IEEE, 2017, pp. 905–913. [169] R. Neuneier, “Enhancing q-learning for optimal asset allocation,” in Advances in neural information processing systems, 1998, pp. 936–942. [170] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learning for financial signal rep- resentation and trading,” IEEE transactions on neural networks and learning systems, vol. 28, no. 3, pp. 653–664, 2016. [171] G. Dalal, E. Gilboa, and S. Mannor, “Hierarchical decision making in electricity grid management,” in International Conference on Machine Learning, 2016, pp. 2197–2206. [172] F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter, R. Babuska,ˇ and R. Belmans, “Residential demand response of thermostatically controlled loads using batch reinforcement learning,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2149–2159, 2016. [173] Z. Wen, D. O’Neill, and H. Maei, “Optimal demand response using device-based reinforcement learning,” IEEE Transactions on Smart Grid, vol. 6, no. 5, pp. 2312– 2324, 2015. [174] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining explanations: An overview of interpretability of machine learning,” in 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 2018, pp. 80–89. [175] Q.-s. Zhang and S.-C. Zhu, “Visual interpretability for deep learning: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 27–39, 2018. [176] Y. Dong, H. Su, J. Zhu, and B. Zhang, “Improving interpretability of deep neural networks with semantic information,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4306– 4314. [177] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” in Advances in Neural Information Processing Systems, 2017, pp. 3812–3822. [178] R. Ramakrishnan and J. Shah, “Towards interpretable explanations for transfer learning in sequential tasks,” in 2016 AAAI Spring Symposium Series, 2016.