Bridging and Creativity: Implementing Reinforcement Learning in Processing

Jieliang Luo Media Arts & Technology Sam Green University of California, Santa Barbara

SIGGRAPH Asia 2018 Tokyo, Japan December 6th, 2018 Course Agenda

• What is Reinforcement Learning (10 mins) ▪ Introduce the core concepts of reinforcement learning

• A Brief Survey of Artworks in (5 mins)

• Why Processing Community (5 mins)

• A Q-Learning (35 mins) ▪ Explain a fundamental reinforcement learning algorithm

• Implementing Tabular Q-Learning in P5.js (40 mins) ▪ Discuss how to create a reinforcement learning environment ▪ Show how to implement the tabular q-learning algorithm in P5.js

• Questions & Answers (5 mins)

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Psychology Pictures Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What is Reinforcement Learning

• Branch of

• Learns through trial & error, rewards & punishment

• Draws from psychology, neuroscience, computer science, optimization

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Framework

Agent

Environment

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=V1eYniJ0Rnk Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=ZhsEKTo7V04 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=XYoS68yJVmw Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Google DeepDream

The same image before (left) and after (right) applying DeepDream. The network having been trained to perceive dogs.

Mordvintsev, et al. Inceptionism: Going Deeper into Neural Networks. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Generative Adversarial Networks (GANs)

Portraits of Imaginary People, Mike Tyka

Goodfellow, et al. Generative Adversarial Nets. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery (RNN)

Drawing Operations Unit: Generation 2, Sougwen Chung

Lipton, et al. A Critical Review of Recurrent Neural Networks for Sequence Learning. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Visualizing Deep Neural Network

Decomposition of Human Portraits, Jieliang Luo & Sam Green

Zeiler, et al. Visualizing and Understanding Convolutional Networks. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green IEEE VIS 2018 Art Program Reinforcement Learning

Autodesk Research, 2018

Sutton, et al. Reinforcement Learning: An Introduction. 1998 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists

1) requiring an in-depth understanding of RL to use current RL libraries.

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists

2) requiring an in-depth understanding of RL concepts (observations, actions, reward function) to design RL training environments..

Brockman, et al. OpenAI Gym. 2016 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green How does an agent learn what to do?

Secret

10% +10 5% +25 25% +7 20% +10 25% +3 50% +0 25% +2 20% -5 15% -5 5% -15 A B Idea: Pull lever many times and analyze results

Reward Frequency +7 +3 +2 A -5 Use the average result to approximate the value

Time step Reward Average

t = 1 r1 = 7 7/1 = 7

t = 2 r2 = -5 (7-5)/2 = 1

A t = 3 r3 = 7 (7-5+7)/3 = 3

t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Approximate expected values for both machines

A B

Average value 3.25 Average value 1.5 How to efficiently track the average?

Time step Reward Average t = 1 r = 7 7/1 = 7 1 Must recalculate t = 2 r2 = -5 (7-5)/2 = 1 numerator each time.

t = 3 r3 = 7 (7-5+7)/3 = 3 What if 100,000 steps?! t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Use Moving Average technique for efficiency

Current estimate Sampled reward

Updated estimate Error 0 < Step size < 1

Markov Decision Process

• Framework where agent attempts to extract rewards from environment.

• Maximum reward collection achieved by making many good actions.

• Sacrificial actions may pay off. Example MDP: Cat in building

Environment

L 1 5

2 Potential “rewards” Agent in environment 3 4 R MDP viewed as graph 1 5 2 F 1 3 4 L 5 R F Reset L Terminal F F 0 2 6 R F F 4 3 Agent Rewards given at transition 1 5 2 1 3 4 5

0 2 6

4 3 Rewards given at transition 1 5 2 1 3 4 5

0 F 2 6 +1 4 3 Rewards may be stochastic 1 5 2 1 3 4 5

0 2 6

4 3 Rewards may be stochastic 1 5 2 1 3 4 5

0 F 2 6 +0 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5

0 2 6

4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5

0 2 6

4 3 +0 Goal is to collect maximum reward 1 5 2 Total rewards: 10 1 3 4 5

0 2 6

6 3 +10 Goal is to collect maximum reward 1 5 2 Total rewards: 7 1 3 4 5

0 2 T -3 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5

0 2 6

4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 +3

0 2 6

4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 14 1 3 4 5 +11

0 2 6

6 3 Goal is to collect maximum reward 1 5 2 Total rewards: 12 1 3 4 5

0 2 T -2 4 3 How to assign values to state-action pairs? 1 5 2 1 3 4 5

0 2 6

4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each state-action 1 3 4 combination. 5

0 2 6

4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2

0 2 6

4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R

0 2 6 -5

4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5

4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5

Value of action in state 4 State 0, action L: 2-5+1 = -2 3 State 1, action R: -5+1 = -4 State 2, action F: +1 Repeated experience leads to better estimates 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +1 R +2 F 0 2 T -6

Value of action in state 4 State 0, action L: 1-6+2 = -3 3 State 1, action R: -6+2 = -4 State 2, action F: +2 Use averaging to track values over multiple episodes

Episode 1 Value of action in state State 0, action L: 2-5+1 = -2 State 1, action R: -5+1 = -4 Averages State 2, action F: +1 Value of action in state State 0, action L: (-2 + -3)/2 = -2.5 Episode 2 State 1, action R: (-4-4)/2 = -4 Value of action in state State 2, action F: (1+2)/2 = 1.5 State 0, action L: 1-6+2 = -3 State 1, action R: -6+2 = -4 State 2, action F: 2 Sample many episodes to learn all Q-values

State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1

After trying all possible paths many times, the Q-values will converge. Use Q-values to determine actions to take (policy)

State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1

Trained policy tells the agent what action to make. Exploration vs. Exploitation

• Current approach requires exhaustive exploration to discover Q-values ▪ Not feasible for large number of states.

• -greedy approach balances learning more about known good paths and discovering novel approaches.

• -greedy: choose random action (explore) with probability , choose greedy action (exploit) with probability .

• Often is set to .05. -greedy example ( = .05)

State L R F 0 0 0 0 1 5 1 0 0 2 2 0 0 3 0 3 4 4 0 5

Set all Q-values to 0. Set , e.g. = .05 -greedy example ( = .05)

State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5

Start an episode. Generate random number, for example rnd = .02

If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05)

State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5

.02 < .05 so pick a random action, e.g. F That moves the cat to state 2 Generate random number… -greedy example ( = .05)

State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5

Calculate returns after episode ends. Start a new episode and repeat. -greedy example ( = .05)

State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5

Start an episode. Generate random number, for example rnd = .21

If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05)

State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5

.21 > .05 so pick argmaxa(Q(0,a)) = L That moves the cat to state 1 Generate random number… Discounting future rewards can improve performance • Often it is useful to value rewards sooner than later.

• E.g. You have a choice between $100k now or $1M in 50 years?

• The $1M is discounted by you because of the uncertainty of the future.

• Discounting factor is multiplied by future rewards during Q-value calc.

• Discount exponentially increases:

Reward to go Discounted (aka return) return Discounting example • E.g. Tabular Q-Learning algorithm combines these tricks:

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Q-Learning in P5.js

• Create a RL environment • Implement a Tabular Q-Learning algorithm

Grid in Processing, Luo & Green

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Environments

CartPole, OpenAI Gym Tennis, Unity ML

Collect Box in Unreal, Luo & Green KUKA Grasp, pyBullet

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Core Concepts in RL Environments

• Observations (States) & Actions

• Starting State & Terminal State

• Step Function ▪ Observation, Reward, Terminal Status, Other Info = step(Action)

• Reward Function • Reward = reward(Observation, Action)

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Observations in RL Environments

CartPole, OpenAI Gym Grid in Processing, Luo & Green

Min Max Min Max Cart Position -4.8 4.8 Grey Position Position 0 Position N Cart Velocity -Inf Inf Pole Angle -24° 24° Pole Velocity At Tip -Inf Inf

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Actions in RL Environments

Num Action Num Action 0 Push cart to the left 0 Push square to the left 1 Push cart to the right 1 Push square to the right

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Starting State & Terminal State

CartPole Grid in Processing

Starting State Starting State

All observations are assigned a uniform At state 0 random value between ±0.05

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Starting State & Terminal State

CartPole Grid in Processing

Terminal State Terminal State

Pole Angle is more than ±12° At state N

Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)

Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reward Function

CartPole Grid in Processing

Reward is 1 for every step taken, Reward is 0 for every step taken, including the termination step. excluding the termination step.

Reward is 1 for the termination step.

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Step Function

Grid in Processing

observation, reward, done, info = step_function(action)

Observation: any state between 0 and N

Reward: 0 or 1

Done: true or false

Info: none

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Basic Structure of A RL Env in P5.js

class Grid{

Grid( ){ } – initialize parameters

reset( ){ } – reset agent

step( ){ } – agent interacts with the environment

reward( ){ }

render( ){ } – visualize the environment

sample( ){ } – randomly select an action

}

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Live Coding Session I

• Setup starter code in P5.js ▪ P5js.org ▪ rodgerluo.com/rl_p5.html

• Test the grid env

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Implementation of Tabular Q-Learning

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Implementation of Tabular Q-Learning

q_table = build_q_table(N_STATES, ACTIONS) while cur_episode < max_episodes: state = env.reset() while not done: action = chooseAction(state, q_table) reward, new_state = env.step(action)

state = new_state

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Live Coding Session II

• Implement the Q-Learning formula

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What’s Next

• A reinforcement learning library for p5.js ▪ Tabular Q-Learning ▪ Policy Gradient ▪ Actor-Critic

• A collection of training environments

• Collaborations with artists

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Andreas Schlegel, 2018 with contributions by Ong Kian Peng, Chong Ming En, Deon Chan Chee Kian, Soh Kheng Jin Eugene

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Thank you!

Jieliang (Rodger) Luo: [email protected]

Sam Green: [email protected]

Please help us finish the questionnaire! rodgerluo.com/rl_p5.html

Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green