Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing

Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing

Bridging Reinforcement Learning and Creativity: Implementing Reinforcement Learning in Processing Jieliang Luo Media Arts & Technology Sam Green Computer Science University of California, Santa Barbara SIGGRAPH Asia 2018 Tokyo, Japan December 6th, 2018 Course Agenda • What is Reinforcement Learning (10 mins) ▪ Introduce the core concepts of reinforcement learning • A Brief Survey of Artworks in Deep Learning (5 mins) • Why Processing Community (5 mins) • A Q-Learning Algorithm (35 mins) ▪ Explain a fundamental reinforcement learning algorithm • Implementing Tabular Q-Learning in P5.js (40 mins) ▪ Discuss how to create a reinforcement learning environment ▪ Show how to implement the tabular q-learning algorithm in P5.js • Questions & Answers (5 mins) Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Psychology Pictures Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green What is Reinforcement Learning • Branch of machine learning • Learns through trial & error, rewards & punishment • Draws from psychology, neuroscience, computer science, optimization Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Reinforcement Learning Framework Agent Environment Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=V1eYniJ0Rnk Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=ZhsEKTo7V04 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green https://www.youtube.com/watch?v=XYoS68yJVmw Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Google DeepDream The same image before (left) and after (right) applying DeepDream. The network having been trained to perceive dogs. Mordvintsev, et al. Inceptionism: Going Deeper into Neural Networks. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Generative Adversarial Networks (GANs) Portraits of Imaginary People, Mike Tyka Goodfellow, et al. Generative Adversarial Nets. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Recurrent Neural Network (RNN) Drawing Operations Unit: Generation 2, Sougwen Chung Lipton, et al. A Critical Review of Recurrent Neural Networks for Sequence Learning. 2015 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green NIPS 2017 Creativity Art Gallery Visualizing Deep Neural Network Decomposition of Human Portraits, Jieliang Luo & Sam Green Zeiler, et al. Visualizing and Understanding Convolutional Networks. 2014 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green IEEE VIS 2018 Art Program Reinforcement Learning Autodesk Research, 2018 Sutton, et al. Reinforcement Learning: An Introduction. 1998 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists 1) requiring an in-depth understanding of RL algorithms to use current RL libraries. Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green Challenges for Media Artists 2) requiring an in-depth understanding of RL concepts (observations, actions, reward function) to design RL training environments.. Brockman, et al. OpenAI Gym. 2016 Bridging Reinforcement Learning and Creativity, Jieliang Luo & Sam Green How does an agent learn what to do? Secret 10% +10 5% +25 25% +7 20% +10 25% +3 50% +0 25% +2 20% -5 15% -5 5% -15 A B Idea: Pull lever many times and analyze results Reward Frequency +7 +3 +2 A -5 Use the average result to approximate the value Time step Reward Average t = 1 r1 = 7 7/1 = 7 t = 2 r2 = -5 (7-5)/2 = 1 A t = 3 r3 = 7 (7-5+7)/3 = 3 t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Approximate expected values for both machines A B Average value 3.25 Average value 1.5 How to efficiently track the average? Time step Reward Average t = 1 r = 7 7/1 = 7 1 Must recalculate t = 2 r2 = -5 (7-5)/2 = 1 numerator each time. t = 3 r3 = 7 (7-5+7)/3 = 3 What if 100,000 steps?! t = 4 r4 = 7 (7-5+7+7)/4 = 4 … … … = 3.25 Use Moving Average technique for efficiency Current estimate Sampled reward Updated estimate Error 0 < Step size < 1 Markov Decision Process • Framework where agent attempts to extract rewards from environment. • Maximum reward collection achieved by making many good actions. • Sacrificial actions may pay off. Example MDP: Cat in building Environment L 1 5 2 Potential “rewards” Agent in environment 3 4 R MDP viewed as graph 1 5 2 F 1 3 4 L 5 R F Reset L Terminal F F 0 2 6 R F F 4 3 Agent Rewards given at transition 1 5 2 1 3 4 5 0 2 6 4 3 Rewards given at transition 1 5 2 1 3 4 5 0 F 2 6 +1 4 3 Rewards may be stochastic 1 5 2 1 3 4 5 0 2 6 4 3 Rewards may be stochastic 1 5 2 1 3 4 5 0 F 2 6 +0 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 Goal is to collect maximum reward 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 +0 Goal is to collect maximum reward 1 5 2 Total rewards: 10 1 3 4 5 0 2 6 6 3 +10 Goal is to collect maximum reward 1 5 2 Total rewards: 7 1 3 4 5 0 2 T -3 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 0 2 6 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 0 1 3 4 5 +3 0 2 6 4 3 Different paths lead to different total rewards 1 5 2 Total rewards: 14 1 3 4 5 +11 0 2 6 6 3 Goal is to collect maximum reward 1 5 2 Total rewards: 12 1 3 4 5 0 2 T -2 4 3 How to assign values to state-action pairs? 1 5 2 1 3 4 5 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each state-action 1 3 4 combination. 5 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 0 2 6 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R 0 2 6 -5 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5 4 3 Agent can learn this by experience 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +2 R +1 F 0 2 T -5 Value of action in state 4 State 0, action L: 2-5+1 = -2 3 State 1, action R: -5+1 = -4 State 2, action F: +1 Repeated experience leads to better estimates 1 5 2 Calculate “rewards to go” from each visited state. 1 3 4 L 5 +1 R +2 F 0 2 T -6 Value of action in state 4 State 0, action L: 1-6+2 = -3 3 State 1, action R: -6+2 = -4 State 2, action F: +2 Use averaging to track values over multiple episodes Episode 1 Value of action in state State 0, action L: 2-5+1 = -2 State 1, action R: -5+1 = -4 Averages State 2, action F: +1 Value of action in state State 0, action L: (-2 + -3)/2 = -2.5 Episode 2 State 1, action R: (-4-4)/2 = -4 Value of action in state State 2, action F: (1+2)/2 = 1.5 State 0, action L: 1-6+2 = -3 State 1, action R: -6+2 = -4 State 2, action F: 2 Sample many episodes to learn all Q-values State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1 After trying all possible paths many times, the Q-values will converge. Use Q-values to determine actions to take (policy) State L R F 0 -1.0 4.25 3.0 1 5 1 3.0 2.5 2 2 -1.0 -5.0 3 4 3 2.0 4 3.5 5 -1 Trained policy tells the agent what action to make. Exploration vs. Exploitation • Current approach requires exhaustive exploration to discover Q-values ▪ Not feasible for large number of states. • -greedy approach balances learning more about known good paths and discovering novel approaches. • -greedy: choose random action (explore) with probability , choose greedy action (exploit) with probability . • Often is set to .05. -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 2 0 0 3 0 3 4 4 0 5 Set all Q-values to 0. Set , e.g. = .05 -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5 Start an episode. Generate random number, for example rnd = .02 If rnd <= then pick random, else pick argmaxa(Q(0,a)) -greedy example ( = .05) State L R F 0 0 0 0 1 5 1 0 0 2 0 0 2 3 0 3 4 4 0 5 .02 < .05 so pick a random action, e.g. F That moves the cat to state 2 Generate random number… -greedy example ( = .05) State L R F 0 0 0 -1 1 5 1 0 0 2 0 2 2 3 0 3 4 4 0 5 Calculate returns after episode ends.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    75 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us