Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing
Total Page:16
File Type:pdf, Size:1020Kb
Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing Jieliang Luo Media Arts & Technology Sam Green Computer Science University of California, Santa Barbara SIGGRAPH Asia 2018 Tokyo, Japan December 6th, 2018 Course Agenda • What is Reinforcement Learning (10 mins) ▪ Introduce the core concepts of reinforcement learning • A Brief Survey of Artworks in Deep Learning (5 mins) • Why Processing Community (5 mins) • A Q-Learning Algorithm (35 mins) ▪ Explain a fundamental reinforcement learning algorithm • Implementing Tabular Q-Learning in P5.js (40 mins) ▪ Discuss how to create a reinforcement learning environment ▪ Show how to implement the tabular q-learning algorithm in P5.js • Questions & Answers (5 mins) Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Psychology Pictures Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What is Reinforcement Learning • Branch of machine learning • Learns through trial & error, rewards & punishment • Draws from psychology, neuroscience, computer science, optimization Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Framework Agent Environment Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=V1eYniJ0Rnk Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=ZhsEKTo7V04 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=XYoS68yJVmw Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Google DeepDream The same image before (left) and after (right) applying DeepDream. The network having been trained to perceive dogs. Mordvintsev, et al. Inceptionism: Going Deeper into Neural Networks. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Generative Adversarial Networks (GANs) Portraits of Imaginary People, Mike Tyka Goodfellow, et al. Generative Adversarial Nets. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Recurrent Neural Network (RNN) Drawing Operations Unit: Generation 2, Sougwen Chung Lipton, et al. A Critical Review of Recurrent Neural Networks for Sequence Learning. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Visualizing Deep Neural Network Decomposition of Human Portraits, Jieliang Luo & Sam Green Zeiler, et al. Visualizing and Understanding Convolutional Networks. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green IEEE VIS 2018 Art Program Reinforcement Learning Autodesk Research, 2018 Sutton, et al. Reinforcement Learning: An Introduction. 1998 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists 1) requiring an in-depth understanding of RL algorithms to use current RL libraries. Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists 2) requiring an in-depth understanding of RL concepts (observations, actions, reward function) to design RL training environments.. Brockman, et al. OpenAI Gym. 2016 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green How does an agent learn what to do? Secret 10% +10 5% +25 25% +7 20% +10 25% +3 50% +0 25% +2 20% -5 15% -5 5% -15 A B Idea: Pull lever many times and analyze results Reward Frequency +7 +3 +2 A -5 Use the average result to approximate the value Time step Reward Average t = 1 r1 = 7 7/1 = 7 t = 2 r2 = -5 (7-5)/2 = 1 A t = 3 r3 = 7 (7-5+7)/3 = 3 t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Approximate expected values for both machines A B Average value 3.25 Average value 1.5 How to efficiently track the average? Time step Reward Average t = 1 r = 7 7/1 = 7 1 Must recalculate t = 2 r2 = -5 (7-5)/2 = 1 numerator each time. t = 3 r3 = 7 (7-5+7)/3 = 3 What if 100,000 steps?! t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Use Moving Average technique for efficiency Current estimate Sampled reward Updated estimate Error 0 < Step size < 1 Markov Decision Process • Framework where agent attempts to extract rewards from environment. • Maximum reward collection achieved by making many good actions. • Sacrificial actions may pay off. Example MDP: Cat in building Environment L 1 5 2 Potential “rewards” Agent in environment 3 4 R MDP viewed as graph 1 5 2 F 1 3 4 L 5 R F Reset L Terminal F F 0 2 6 R F F 4 3 Agent Rewards given at transition 1 5 2 1 3 4 5 0 2 6 4 3 Rewards given at transition 1 5 2 1 3 4 5 0 F 2 6 +1 4 3 Rewards may be stochastic 1 5 2 1 3 4 5 0 2 6 4 3 Rewards may be stochastic 1 5 2 1 3 4 5 0 F 2 6 +0 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 +0 Goal is to collect maximum reward 1 5 2 Total rewards: 10 1 3 4 5 0 2 6 6 3 +10 Goal is to collect maximum reward 1 5 2 Total rewards: 7 1 3 4 5 0 2 T -3 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 +3 0 2 6 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 14 1 3 4 5 +11 0 2 6 6 3 Goal is to collect maximum reward 1 5 2 Total rewards: 12 1 3 4 5 0 2 T -2 4 3 How to assign values to state-action pairs? 1 5 2 1 3 4 5 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each state-action 1 3 4 combination. 5 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R 0 2 6 -5 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5 Value of action in state 4 State 0, action L: 2-5+1 = -2 3 State 1, action R: -5+1 = -4 State 2, action F: +1 Repeated experience leads to better estimates 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +1 R +2 F 0 2 T -6 Value of action in state 4 State 0, action L: 1-6+2 = -3 3 State 1, action R: -6+2 = -4 State 2, action F: +2 Use averaging to track values over multiple episodes Episode 1 Value of action in state State 0, action L: 2-5+1 = -2 State 1, action R: -5+1 = -4 Averages State 2, action F: +1 Value of action in state State 0, action L: (-2 + -3)/2 = -2.5 Episode 2 State 1, action R: (-4-4)/2 = -4 Value of action in state State 2, action F: (1+2)/2 = 1.5 State 0, action L: 1-6+2 = -3 State 1, action R: -6+2 = -4 State 2, action F: 2 Sample many episodes to learn all Q-values State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1 After trying all possible paths many times, the Q-values will converge. Use Q-values to determine actions to take (policy) State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1 Trained policy tells the agent what action to make. Exploration vs. Exploitation • Current approach requires exhaustive exploration to discover Q-values ▪ Not feasible for large number of states. • -greedy approach balances learning more about known good paths and discovering novel approaches. • -greedy: choose random action (explore) with probability , choose greedy action (exploit) with probability . • Often is set to .05. -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 2 0 0 3 0 3 4 4 0 5 Set all Q-values to 0. Set , e.g. = .05 -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5 Start an episode. Generate random number, for example rnd = .02 If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5 .02 < .05 so pick a random action, e.g. F That moves the cat to state 2 Generate random number… -greedy example ( = .05) State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5 Calculate returns after episode ends.