<<

Human-level control through deep reinforcement Liia Butler But first... A quote

"The question of whether machines can think... is about as relevant as the question of whether submarines can swim"

Edsger W. Dijkstra Overview

1. Introduction

2. Reinforcement

3. Deep neural networks

4.

5. Breakdown

6. Evaluation and conclusions Introduction

Deep Q-network (DQN)

- The agent - plus - Deep neural networks - Goal: General - How little do we have to know to be intelligent? Can we solve a wide range of challenging tasks? - Pixels and game score as input Reinforcement Learning

- Theory of how software agents may optimize their control of the environment - Inspired by the psychological and neuroscientific perspectives on animal behavior - One of the three types of machine learning

http://en.proft.me/media/science/ml_types.png Space Invaders Deep Neural Networks - An architecture in , type of artificial neural network - Artificial neural network: a network of nodes representing processing elements that are highly connected, working together towards specific problems, like in biological nervous system - Multiple layers of nodes with increasing abstraction of the - Extract high-level representations from raw data - DQN uses "deep convolutional network" - 84 x 4 x 4 image produced by preprocessing map - three convolutional layers - Two fully connected layers -

http://www.nature.com/nature/journal/v5 18/n7540/images/nature14236-f4.jpg http://www.nature.com/nature/journal/v518/n7540/carousel/nature14236-f1.jpg Markov Decision Process

- State - Action - Reward

http://cse-wiki.unl.edu/wiki/images/5/58/ReinforJpeg.jpg What these mean for DQN

- State - What is going on? - The goal was to be universal so it's represented by screen pixels - Action - What can we do? - Ex. moving, direction, buttons - Reward - What's our motivation? - Points, lives, etc.

http://www.retrogamer.net/wp-content/uploads/2014/07/Top-10-Atari-Jaguar-Games-616x410.png How is DQN going to do this?

- Preprocessing - Reduce input dimensionality, max value for pixel color, remove flickering - ε-greedy policy - choosing the action - Bellman equation - optimal control of environment, action-value function - Using a function approximator to estimate the action-value function - and Q-learning gradient - Experience replay - building a data set from agent's experience Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image Φ =preprocessing sequence T = time-step at which game terminates ε = in ε-greedy policy a = action s = state y = target r = reward ν = reward discount factor C = Number of updates to Q Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image ε-greedy policy T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q ε-greedy policy

How to choose the action 'a' at time 't'

- Exploration, random - Exploitation, best one according to the Q value Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing function s = state Experience Replay y = target r = reward ν = reward discount factor C = Number of updates to Q Experience Replay

1. Take action 2. Store transition in memory 3. random minibatch of transitions from D 4. Optimize using on target 'y' and Q-network Optimizing the Q-Network

- Bellman Equation: - The loss function we have: - From this:

- Gives us the Q-learning gradient: Algorithm Breakdown Key D = Memory, or data set N = Number of experience tuples in replay memory Q = "quality" function Θ = The weight for approximator M = Number of episodes s = sequence x = observation/image T = time-step at which game terminates ε = probability in ε-greedy policy a = action Φ =preprocessing sequence s = state y = target r = reward ν = reward discount factor C = Number of updates to Q Breakout! Evaluation and Conclusions

- Agents vs. Pro gamers - Action at 10 Hz (an action every 0.1 seconds), every 6th frame - At 60 Hz (every 0.017 seconds), every frame, only 6 games > 5% better performance - Controlled human conditions - Out of the 49 games - 29 at human or above - 20 below

http://www.nature.com/nature/journal/v518/n7540/images_article/nature14236-f2.jpg 29 out of 49

20 out of 49

http://www.nature.com/nature/journal/v518/n7540/images_ article/nature14236-f3.jpg Questions and Discussion

- What do you think are some non-gaming applications of deep reinforcement learning? - Do you think that comparing with the "professional human game tester" is a sufficient enough of an evaluation? Is there a better way? - Should we even have a general AI, or are we better off with domain specific AIs? - Are there other consequences besides a beating your high score? (Have we doomed society?)