Improved Robustness of Reinforcement Learning Policies Upon Conversion

000 001 002 Improved robustness of reinforcement learning policies upon conversion to 003 004 spiking neuronal network platforms applied to ATARI games 005 006 007 1 008 Anonymous Authors 009 010 son & Gerstner, 2006; Stein et al., 2005), but they can still 011 Abstract operate well even under harsh conditions that affect their 012 Various implementations of Deep Reinforcement internal state and input. Spiking Neural Networks (SNNs) 013 Learning (RL) demonstrated excellent perfor- are considered to be closer to biological neurons due to 014 mance on tasks that can be solved by trained pol- their event-based nature; they are often termed the third 015 icy, but they are not without drawbacks. Deep RL generation of neural networks (Maass, 1996). A spike is 016 suffers from high sensitivity to noisy and missing the quantification of the internal and external process of the 017 input and adversarial attacks. To mitigate these neuron and is always equal to other spikes. Therefore, the 018 deficiencies of deep RL solutions, we suggest individual neuron can serve as a small bottleneck that gives 019 involving spiking neural networks (SNNs). Pre- the ability to sustain low intermittent noise and not propa- 020 vious work has shown that standard Neural Net- gate the noise further. Moreover, spiking neurons as a group 021 works trained using supervised learning for image in a network can damp the noise even further due to their 022 classification can be converted to SNNs with neg- collective effect and their architectural connectivity (Hazan 023 ligible deterioration in performance. In this paper, & Manevitz, 2012). However, SNNs are typically harder to 024 we convert Q-Learning ReLU-Networks (ReLU- train using backpropagation due to the non-differentiable 025 N) trained using reinforcement learning into SNN. nature of the spikes (Pfeiffer & Pfeil, 2018). 026 We provide a proof of concept for the conversion 027 of ReLU-N to SNN demonstrating improved ro- Much of the recent work with SNNs has focused on im- 028 bustness to occlusion and better generalization plementing methods similar to backpropagation (Huh & 029 than the original ReLU-N. Moreover, we show Sejnowski, 2018; Wu et al., 2018) or using biologically in- 030 promising initial results with converting full-scale spired learning rules like spike-timing-dependent plasticity 031 Deep Q-networks to SNNs, paving the way for (STDP) to train the network (Bengio et al., 2015; Diehl & 032 future research. Cook, 2015; Gilra & Gerstner, 2018; Ferr et al., 2018). One 033 of the benefits of using SNNs is their potential to be more 034 energy efficient and faster than rectified linear unit networks 035 1. Introduction (ReLU-N), particularly so on dedicated neuromorphic hard- 036 ware (Mart´ı et al., 2016). 037 Recent advancements in deep reinforcement-learning (RL) Using SNNs in a RL environment seems almost natural 038 have achieved astonishing results surpassing human perfor- since many animals learn to perform certain tasks using 039 mance on various ATARI games (Mnih et al., 2015; Hasselt a variation of semi-supervised and reinforcement learning. 040 et al., 2016; Wang et al., 2016). However, deep RL is sus- Moreover, there is evidence that biological neurons also 041 ceptible to adversarial attacks similarly to deep learning learn using evaluative feedback from neurotransmitters such 042 (Huang et al., 2017). The vulnerability to adversarial at- as dopamine (Wang et al., 2018) (e.g., in the postulated 043 tacks is due to the fact that deep RL uses gradient descent dopamine reward prediction-error signal (Schultz, 2016)). 044 to train the agent. Another consequence of the gradient However, since spiking neurons are fundamentally different 045 descent algorithm is that the trained agent learns to focus from artificial neurons, it is not clear if SNNs are as capable 046 on a few sensitive areas and when these areas are occluded as ReLU-Ns in machine learning domains. This raises the 047 or perturbed, the performance of the RL agent deteriorates. questions: Do SNNs have the capability to represent the 048 Moreover, there is evidence that the policies learned by the same functions as ReLU-N? To be more specific, can SNNs 049 networks in deep RL algorithms do not generalize well and represent complex policies that can successfully play Atari 050 the performance of the agent deteriorates when it encounters games? If so, do they have any advantages in handling noisy 051 a state that it has not seen before even if it is similar to other inputs? 052 states (Witty et al., 2018). 053 Biological systems tend to be very noisy by nature (Richard- We answer these questions by demonstrating that ReLU- 054 Improve robustness of RL Policies using Spiking Neural Networks 055 networks trained using existing reinforcement learning algo- 2.2. Deep Q-Networks 056 rithms can be converted to SNN with similar performance on Reinforcement learning algorithms train a policy π to max- 057 the reinforcement learning task when playing Atari Break- imize the expected cumulative reward received over time. 058 out game. Furthermore, we show that such converted SNNs Formally, this process is modelled as a Markov decision 059 are more robust than the original ReLU-Ns. Finally, we process (MDP). Given a state-space S and an action-space 060 demonstrate that full-sized deep Q-networks (DQN) (Mnih A, the agent starts in an initial state s 2 S from a set of 061 et al., 2015) can also be converted to SNNs and maintain its 0 0 possible start states S 2 S. At each time-step t, starting 062 better than human performance, paving the way for future 0 from t = 0, the agent takes an action a to transition from 063 research in robustness and RL with SNNs. t s s s 064 t to t+1. The probability of transitioning from state to s0 a 065 state by taking action is given by the transition func- 2. Background P (s; a; s0) R(s; a) 066 tion . The reward function defines the expected reward received by the agent after taking action a 067 2.1. Arcade learning environment s 068 on state . The Arcade learning environment (ALE) (Bellemare et al., 069 A policy π is defined as the conditional distribution of ac- 2013) is a platform that enables researchers to test their 070 tions given the state π(s; a) = P r(A = ajS = s). The algorithms on over 50 Atari 2600 games. The agent sees the t t 071 Q-value or action-value of a state-action pair for a given environment through image frames of the game, interacts 072 policy, qπ(s; a), is the expected return following policy π with the environment with 18 possible actions, and receives 073 after taking the action a from state s. 074 feedback in the form of the change in the game score. The games were designed for humans and thus are free from 1 075 π X k q (s; a) = E[ γ Rt+kjSt = s; At = a; π] (1) 076 experimenter bias. The games span many different genres k=0 077 that require the agent/algorithm to generalize well over var- 078 ious tasks, difficulty levels, and timescales. ALE thus has where γ is the discount factor. The action-value function 079 become a popular test-bed for reinforcement learning. follows a Bellman equation that can be written as: 080 π π q (st; at) = rt + γ max q (st+1; at+1) (2) 081 at+1 082 083 Many widely used reinforcement learning algorithms first 084 approximate the Q-value and then select the policy that 085 maximizes the Q-value at each step to maximize returns 086 (Sutton & Barto, 2018). Deep Q-networks (DQN) (Mnih 087 et al., 2015) are one such algorithm that uses deep artificial 088 neural networks to approximate the Q-value. The neural 089 network can learn policies from only the pixels of the screen 090 and the game score and has been shown to surpass human 091 performance on many of the Atari 2600 games. 092 093 094 095 Figure 1. Screenshot of Atari 2600 Breakout game 096 097 098 Breakout: We demonstrate our results on the game of 099 Breakout. Breakout is a game similar to the popular game 100 Pong. The player controls a paddle at the bottom of the 101 screen. There are rows of colored bricks on the upper part 102 of the screen. A ball bounces in between the bricks and the Figure 2. Architecture of Deep Q-networks; following Mnih et al. 103 player controlled paddle. If the ball hits a brick, the brick (2015); ReLu nonlinear units are emphasized by red circles. 104 breaks and the score of the game is increased. However, if 105 the ball falls below the paddle, the player loses a life. The 2.3. Spiking neurons 106 game starts with five lives, and the player/agent is supposed 107 to break all the bricks before they run of lives. Figure1 SNNs may use any of the various neuron models (W. Ger- 108 shows a frame of the game. stner, 2002; Tuckwell, 1988). For our experiments, we use 109 Improve robustness of RL Policies using Spiking Neural Networks 110 four different variations of spiking neuron. We use the nota- are unable to spike or integrate input. For this paper, we 111 tion below used to describe the variance neurons: τ is the ignore the refractory period for simplicity in the conversion 112 time constant. V is the membrane potential voltage. vrest from artificial neurons. For a complete list of the parameters 113 is the resting membrane potential. vthresh is the neuron used for LIF and stochastic LIF see supplementary materials 114 threshold.

Load more