Adversarial Policy Training Against Deep Reinforcement Learning

Adversarial Policy Training against Deep Reinforcement Learning 1 1 1 1 Xian Wu ∗, Wenbo Guo ∗, Hua Wei ∗, Xinyu Xing 1The Pennsylvania State University xkw5132, wzg13, hzw77, xxing @ist.psu.edu f g Abstract the machine learning community designs various deep reinforcement learning algorithms [29, 43, 53] and demonstrates Reinforcement learning is a set of goal-oriented learning al- their great success in a variety of applications, ranging from gorithms, through which an agent could learn to behave in defeating world champions of Go [45] to mastering a wide an environment, by performing certain actions and observing variety of Atari games [30]. the reward which it gets from those actions. Integrated with Different from conventional deep learning, deep reinforce- deep neural networks, it becomes deep reinforcement learn- ment learning (DRL) refers to goal-oriented algorithms, ing, a new paradigm of learning methods. Recently, deep through which one could train an agent to learn how to attain reinforcement learning demonstrates great potential in many a complex objective or, in other words, maximize the reward applications such as playing video games, mastering GO com- it can collect over many steps (actions). Like a dog incen- petition, and even performing autonomous pilot. However, tivized by petting and intimidation, reinforcement learning coming together with these great successes is adversarial at- algorithms penalize the agent when it takes the wrong action tacks, in which an adversary could force a well-trained agent and reward when the agent takes the right ones. to behave abnormally by tampering the input to the agent’s policy network or training an adversarial agent to exploit the In light of the promising results in many reinforcement weakness of the victim. learning tasks, researchers recently devoted their energies In this work, we show existing adversarial attacks against to investigating the security risk of reinforcement learning reinforcement learning either work in an impractical setting algorithms. For example, early research has proposed various or perform less effectively when being launched in a two- methods to manipulate the environment that an agent interacts e.g., agent competitive game. Motivated by this, we propose a new with ( [4, 18, 21]). Their rationale behind such a kind of method to train adversarial agents. Technically speaking, our attack is as follows. In a reinforcement learning task, an agent approach extends the Proximal Policy Optimization (PPO) al- usually takes as input the observation of the environment. By gorithm and then utilizes an explainable AI technique to guide manipulating the environment, an attacker could influence the an attacker to train an adversarial agent. In comparison with agent observation as well as its decision (action), and thus e.g., the adversarial agent trained by the state-of-the-art technique, mislead the agent to behave abnormally ( subtly changing we show that our adversarial agent exhibits a much stronger some pixel values of the sky in the Super Mario game, or capability in exploiting the weakness of victim agents. Be- injecting noise into the background canvas of the Pong game). sides, we demonstrate that our adversarial attack introduces In many recent research works, attacks through environ- less variation in the training process and exhibits less sensi- ment manipulation have demonstrated great success in failing tivity to the selection of initial states. a well-trained agent to complete a certain task (e.g., [18,19]). However, such attacks are not practical in the real world. For example, in the application of online video games, the input 1 Introduction of a pre-trained master agent is the snapshot of the current game scenes. From the attackers’ perspective, it is difficult With the recent breakthroughs of deep neural networks (DNN) for them to hack into the game server, obtain the permission in problems like computer vision, machine translation, and of manipulating the environment, influence arbitrary pixels in time series prediction, we have witnessed a great advance that input image, and thus launch an adversarial attack as they in the area of reinforcement learning (RL). By integrating expect. As a result, recent research proposes a new method to deep neural networks into reinforcement learning algorithms, attack a well-trained agent [10]. ∗Equal Contribution. Different from attacks through environment manipulation, the new attack is designed specifically for the two-agent com- agent to take the action that could influence the action of the petitive game – where two participant agents compete with opponent agent the most. each other – and the goal of this attack is to fail one well- In this paper, we do not claim that our proposed technique trained agent in the game by manipulating the behaviors of is the first method for attacking reinforcement learning. How- the other. In comparison with the environment manipulation ever, we argue that this is the first work that can effectively methods, the new attack against RL is more practical because, exploit the weakness of victim agents without the manipula- to trigger the weakness of the victim agent, this attack does tion of the environment. Using MuJoCo [50] and roboschool not assume full control over the environment nor that over Pong [33] games, we show that our method has a stronger the observation of the victim agent. Rather, it assumes only capability of attacking a victim agent than the state-of-the-art the free access of the adversarial agent (i.e., the agent that the method [10] (an average of 60% vs. 50% winning rate for attacker trains to compete with his opponent’s agent). MuJoCo game and 100% vs. 90% for the Pong game). In In [10], researchers have already shown that the method of addition, we demonstrate that, in comparison with the state- attacking through an adversarial agent could be used for an of-the-art method of training an adversarial policy [10], our alternative, practical approach to attack a well-trained agent proposed method could construct an adversarial agent with a in reinforcement learning tasks. However, as we will demon- 50% winning rate in fewer training cycles (11 million vs. 20 strate in Section6, this newly proposed attack usually ex- million iterations for MuJoCo game, and 1.0 million vs. 1.3 hibits a relatively low success rate of failing the opponent (or million iterations for Pong game). Last but not least, we also in other words victim) agent.1 This is because the attack is show that using our proposed method to train an adversarial a simple application of the state-of-the-art Proximal Policy agent, it usually introduces fewer variations in the training Optimization (PPO) algorithm [43] and, by design, the PPO process. We argue this is a very beneficial characteristic algorithm does not train an agent for exploiting the weakness because this could make our algorithm less sensitive to the of the opponent agent. selection of initial states. We released the game environment, Inspired by this discovery, we propose a new technique to victim agents, source code, and our adversarial agents. 2 train an adversarial agent and thus exploit the weakness of In summary, the paper makes the following contributions. the opponent (victim) agent. First, we arm the adversarial agent with the ability to observe the attention of the victim • We design a new practical attack mechanism that trains agent while it plays with our adversarial agent. By using this an adversarial agent to exploit the weakness of the oppo- attention, the adversarial agent can easily figure out at which nent in an effective and efficient fashion. time step the opponent agent pays more attention to the adversary. Second, under the guidance of the victim’s attention, • We demonstrate that an explainable AI technique can the adversary subtly varies its actions. With this practice, as be used to facilitate the search of the adversarial policy we will show and elaborate in Section4 and5, the adversarial network and thus the construction of the corresponding agent could trick a well-trained opponent agent into taking adversarial agents. sub-optimal actions and thus influence the corresponding re- • We evaluate our proposed attack by using representative ward that the opponent is supposed to receive. simulated robotics games – MuJoCo and roboschool Technically speaking, to develop the attack method men- Pong – and compare our evaluation results with that tioned above, we first approximate the policy network as well obtained from the state-of-the-art attack mechanism [10]. as the state-transition model of the opponent agent. Using the approximated network and model, we can determine the The rest of this paper is organized as follows. Section2 attention of the opponent agent by using an explainable AI describes the problem scope and assumption of this research. technique. Besides, we can predict the action of the opponent Section3 describes the background of deep reinforcement agent when our adversarial agent takes a specific action. learning. Section4 and5 specifies how we design our attack With the predicted action in hand, our attack method then mechanism to train adversarial agents. Section6 summa- extends the PPO algorithm by introducing a weighted term rizes the evaluation results of our proposed attack mechanism. into its objective function. As we will specify in Section5, Section7 provides the discussion of related work, followed the newly introduced term measures the action deviation of by the discussion of some related issues and future work in the opponent agent with and without the influence of our ad- Section8. Finally, we conclude the work in Section9. versarial agent. The weight is the output of the explainable AI technique, which indicates by how much the opponent agent pays its attention to the adversarial agent. By maximizing 2 Problem Statement and Assumption the weighted deviation together with the advantage function in the objective function of PPO, we can train an adversarial Problem statement.

Adversarial Policy Training Against Deep Reinforcement Learning

Elements of DSAI: Game Tree Search, Learning Architectures

ELF: an Extensive, Lightweight and Flexible Research Platform for Real-Time Strategy Games

Improved Policy Networks for Computer Go

Unsupervised State Representation Learning in Atari

Understanding & Generalizing Alphago Zero

Combining Tactical Search and Deep Learning in the Game of Go

(CMPUT) 455 Search, Knowledge, and Simulations

Arxiv:1611.08903V1 [Cs.LG] 27 Nov 2016 20 Percent Increases in Cash Collections [20]

Software-Defined Software: a Perspective of Machine Learning-Based Software Production

Deep Learning for Go

Suggesting Moving Positions in Go-Game with Convolutional Neural Networks Trained Data

CSC 311: Introduction to Machine Learning Lecture 12 - Alphago and Game-Playing