Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing
Jieliang Luo Media Arts & Technology Sam Green Computer Science University of California, Santa Barbara
SIGGRAPH Asia 2018 Tokyo, Japan December 6th, 2018 Course Agenda
• What is Reinforcement Learning (10 mins) ▪ Introduce the core concepts of reinforcement learning
• A Brief Survey of Artworks in Deep Learning (5 mins)
• Why Processing Community (5 mins)
• A Q-Learning Algorithm (35 mins) ▪ Explain a fundamental reinforcement learning algorithm
• Implementing Tabular Q-Learning in P5.js (40 mins) ▪ Discuss how to create a reinforcement learning environment ▪ Show how to implement the tabular q-learning algorithm in P5.js
• Questions & Answers (5 mins)
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Psychology Pictures Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What is Reinforcement Learning
• Branch of machine learning
• Learns through trial & error, rewards & punishment
• Draws from psychology, neuroscience, computer science, optimization
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Framework
Agent
Environment
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=V1eYniJ0Rnk Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=ZhsEKTo7V04 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=XYoS68yJVmw Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Google DeepDream
The same image before (left) and after (right) applying DeepDream. The network having been trained to perceive dogs.
Mordvintsev, et al. Inceptionism: Going Deeper into Neural Networks. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Generative Adversarial Networks (GANs)
Portraits of Imaginary People, Mike Tyka
Goodfellow, et al. Generative Adversarial Nets. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Recurrent Neural Network (RNN)
Drawing Operations Unit: Generation 2, Sougwen Chung
Lipton, et al. A Critical Review of Recurrent Neural Networks for Sequence Learning. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Visualizing Deep Neural Network
Decomposition of Human Portraits, Jieliang Luo & Sam Green
Zeiler, et al. Visualizing and Understanding Convolutional Networks. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green IEEE VIS 2018 Art Program Reinforcement Learning
Autodesk Research, 2018
Sutton, et al. Reinforcement Learning: An Introduction. 1998 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists
1) requiring an in-depth understanding of RL algorithms to use current RL libraries.
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists
2) requiring an in-depth understanding of RL concepts (observations, actions, reward function) to design RL training environments..
Brockman, et al. OpenAI Gym. 2016 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green How does an agent learn what to do?
Secret
10% +10 5% +25 25% +7 20% +10 25% +3 50% +0 25% +2 20% -5 15% -5 5% -15 A B Idea: Pull lever many times and analyze results
Reward Frequency +7 +3 +2 A -5 Use the average result to approximate the value
Time step Reward Average
t = 1 r1 = 7 7/1 = 7
t = 2 r2 = -5 (7-5)/2 = 1
A t = 3 r3 = 7 (7-5+7)/3 = 3
t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Approximate expected values for both machines
A B
Average value 3.25 Average value 1.5 How to efficiently track the average?
Time step Reward Average t = 1 r = 7 7/1 = 7 1 Must recalculate t = 2 r2 = -5 (7-5)/2 = 1 numerator each time.
t = 3 r3 = 7 (7-5+7)/3 = 3 What if 100,000 steps?! t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Use Moving Average technique for efficiency
Current estimate Sampled reward
Updated estimate Error 0 < Step size < 1
Markov Decision Process
• Framework where agent attempts to extract rewards from environment.
• Maximum reward collection achieved by making many good actions.
• Sacrificial actions may pay off. Example MDP: Cat in building
Environment
L 1 5
2 Potential “rewards” Agent in environment 3 4 R MDP viewed as graph 1 5 2 F 1 3 4 L 5 R F Reset L Terminal F F 0 2 6 R F F 4 3 Agent Rewards given at transition 1 5 2 1 3 4 5
0 2 6
4 3 Rewards given at transition 1 5 2 1 3 4 5
0 F 2 6 +1 4 3 Rewards may be stochastic 1 5 2 1 3 4 5
0 2 6
4 3 Rewards may be stochastic 1 5 2 1 3 4 5
0 F 2 6 +0 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5
0 2 6
4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5
0 2 6
4 3 +0 Goal is to collect maximum reward 1 5 2 Total rewards: 10 1 3 4 5
0 2 6
6 3 +10 Goal is to collect maximum reward 1 5 2 Total rewards: 7 1 3 4 5
0 2 T -3 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5
0 2 6
4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 +3
0 2 6
4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 14 1 3 4 5 +11
0 2 6
6 3 Goal is to collect maximum reward 1 5 2 Total rewards: 12 1 3 4 5
0 2 T -2 4 3 How to assign values to state-action pairs? 1 5 2 1 3 4 5
0 2 6
4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each state-action 1 3 4 combination. 5
0 2 6
4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2
0 2 6
4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R
0 2 6 -5
4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5
4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5
Value of action in state 4 State 0, action L: 2-5+1 = -2 3 State 1, action R: -5+1 = -4 State 2, action F: +1 Repeated experience leads to better estimates 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +1 R +2 F 0 2 T -6
Value of action in state 4 State 0, action L: 1-6+2 = -3 3 State 1, action R: -6+2 = -4 State 2, action F: +2 Use averaging to track values over multiple episodes
Episode 1 Value of action in state State 0, action L: 2-5+1 = -2 State 1, action R: -5+1 = -4 Averages State 2, action F: +1 Value of action in state State 0, action L: (-2 + -3)/2 = -2.5 Episode 2 State 1, action R: (-4-4)/2 = -4 Value of action in state State 2, action F: (1+2)/2 = 1.5 State 0, action L: 1-6+2 = -3 State 1, action R: -6+2 = -4 State 2, action F: 2 Sample many episodes to learn all Q-values
State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1
After trying all possible paths many times, the Q-values will converge. Use Q-values to determine actions to take (policy)
State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1
Trained policy tells the agent what action to make. Exploration vs. Exploitation
• Current approach requires exhaustive exploration to discover Q-values ▪ Not feasible for large number of states.
• -greedy approach balances learning more about known good paths and discovering novel approaches.
• -greedy: choose random action (explore) with probability , choose greedy action (exploit) with probability .
• Often is set to .05. -greedy example ( = .05)
State L R F 0 0 0 0 1 5 1 0 0 2 2 0 0 3 0 3 4 4 0 5
Set all Q-values to 0. Set , e.g. = .05 -greedy example ( = .05)
State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5
Start an episode. Generate random number, for example rnd = .02
If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05)
State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5
.02 < .05 so pick a random action, e.g. F That moves the cat to state 2 Generate random number… -greedy example ( = .05)
State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5
Calculate returns after episode ends. Start a new episode and repeat. -greedy example ( = .05)
State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5
Start an episode. Generate random number, for example rnd = .21
If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05)
State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5
.21 > .05 so pick argmaxa(Q(0,a)) = L That moves the cat to state 1 Generate random number… Discounting future rewards can improve performance • Often it is useful to value rewards sooner than later.
• E.g. You have a choice between $100k now or $1M in 50 years?
• The $1M is discounted by you because of the uncertainty of the future.
• Discounting factor is multiplied by future rewards during Q-value calc.
• Discount exponentially increases:
Reward to go Discounted (aka return) return Discounting example • E.g. Tabular Q-Learning algorithm combines these tricks:
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Q-Learning in P5.js
• Create a RL environment • Implement a Tabular Q-Learning algorithm
Grid in Processing, Luo & Green
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Environments
CartPole, OpenAI Gym Tennis, Unity ML
Collect Box in Unreal, Luo & Green KUKA Grasp, pyBullet
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Core Concepts in RL Environments
• Observations (States) & Actions
• Starting State & Terminal State
• Step Function ▪ Observation, Reward, Terminal Status, Other Info = step(Action)
• Reward Function • Reward = reward(Observation, Action)
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Observations in RL Environments
CartPole, OpenAI Gym Grid in Processing, Luo & Green
Min Max Min Max Cart Position -4.8 4.8 Grey Position Position 0 Position N Cart Velocity -Inf Inf Pole Angle -24° 24° Pole Velocity At Tip -Inf Inf
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Actions in RL Environments
Num Action Num Action 0 Push cart to the left 0 Push square to the left 1 Push cart to the right 1 Push square to the right
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Starting State & Terminal State
CartPole Grid in Processing
Starting State Starting State
All observations are assigned a uniform At state 0 random value between ±0.05
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Starting State & Terminal State
CartPole Grid in Processing
Terminal State Terminal State
Pole Angle is more than ±12° At state N
Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reward Function
CartPole Grid in Processing
Reward is 1 for every step taken, Reward is 0 for every step taken, including the termination step. excluding the termination step.
Reward is 1 for the termination step.
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Step Function
Grid in Processing
observation, reward, done, info = step_function(action)
Observation: any state between 0 and N
Reward: 0 or 1
Done: true or false
Info: none
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Basic Structure of A RL Env in P5.js
class Grid{
Grid( ){ } – initialize parameters
reset( ){ } – reset agent
step( ){ } – agent interacts with the environment
reward( ){ }
render( ){ } – visualize the environment
sample( ){ } – randomly select an action
}
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Live Coding Session I
• Setup starter code in P5.js ▪ P5js.org ▪ rodgerluo.com/rl_p5.html
• Test the grid env
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Implementation of Tabular Q-Learning
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Implementation of Tabular Q-Learning
q_table = build_q_table(N_STATES, ACTIONS) while cur_episode < max_episodes: state = env.reset() while not done: action = chooseAction(state, q_table) reward, new_state = env.step(action)
state = new_state
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Live Coding Session II
• Implement the Q-Learning formula
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What’s Next
• A reinforcement learning library for p5.js ▪ Tabular Q-Learning ▪ Policy Gradient ▪ Actor-Critic
• A collection of training environments
• Collaborations with artists
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Andreas Schlegel, 2018 with contributions by Ong Kian Peng, Chong Ming En, Deon Chan Chee Kian, Soh Kheng Jin Eugene
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Thank you!
Jieliang (Rodger) Luo: [email protected]
Sam Green: [email protected]
Please help us finish the questionnaire! rodgerluo.com/rl_p5.html
Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green