Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes

Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes Manuela Veloso Carnegie Mellon University Computer Science Department Planning - Fall 2001 Remember the examples on the board. 15-889 – Fall 2001 Veloso, Carnegie Mellon Different Aspects of “Machine Learning” Supervised learning • – Classification - concept learning – Learning from labeled data – Function approximation Unsupervised learning • – Data is not labeled – Data needs to be grouped, clustered – We need distance metric Control and action model learning • – Learning to select actions efficiently – Feedback: goal achievement, failure, reward – Search control learning, reinforcement learning 15-889 – Fall 2001 Veloso, Carnegie Mellon Search Control Learning Improve search efficiency, plan quality • Learn heuristics • (if (and (goal (enjoy weather)) (goal (save energy)) (state (weather fair))) (then (select action RIDE-BIKE))) 15-889 – Fall 2001 Veloso, Carnegie Mellon Learning Opportunities in Planning Learning to improve planning efficiency • Learning the domain model • Learning to improve plan quality • Learning a universal plan • Which action model, which planning algorithm, which heuristic control is the most efficient for a given task? 15-889 – Fall 2001 Veloso, Carnegie Mellon Reinforcement Learning A variety of algorithms to address: • – learning the model – converging to the optimal plan. 15-889 – Fall 2001 Veloso, Carnegie Mellon Discounted Rewards “Reward” today versus future (promised) reward • $100K + $100K + $100K + . • INFINITE! Future rewards not worth as much as current. • Assume reality . : discount factor , say γ. • $100K + γ $100K + γ2 $100K + . • CONVERGES! 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Systems with Rewards Finite set of n states - vector n - si • Probabilistic state matrix, P - n n - pij • × “Goal achievement” - Reward for each state, • vector n -ri Discount factor - γ • Process: • – Start at state si – Receive immediate reward ri – Move randomly to a new state according to the probability transition matrix – Future rewards (of next state) are discounted by γ 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving a Markov Systems with Rewards V ∗(si) expected discounted sum of future rewards • − starting in state si • V ∗(si) = ri + γ[pi1V ∗(s1) + pi2V ∗(s2) + : : : pinV ∗(sn)] 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration to Solve a Markov Systems with Rewards 1 V (si) expected discounted sum of future rewards • − starting in state si for one step. 2 V (si) expected discounted sum of future rewards • − starting in state si for two steps. ... • k V (si) expected discounted sum of future rewards • − starting in state si for k steps. k As k V (si) V ∗(si) • ! 1 ! Stop when difference of k + 1 and k values is smaller • than some . 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Decision Processes Finite set of states, s1; : : : ; sn • Finite set of actions, a1; : : : ; am • Probabilistic state,action transitions: • k p = prob (next = sj current = si and take action ak) ij j Reward for each state, r1; : : : ; rn • Process: • – Start in state si – Receive immediate reward ri – Choose action ak A 2 k – Change to state sj with probability pij. – Discount future rewards 15-889 – Fall 2001 Veloso, Carnegie Mellon Deterministic Example 10 G 10 10 (Reward on unlabelled transitions is zero.) 15-889 – Fall 2001 Veloso, Carnegie Mellon Nondeterministic Example 0.9 0.1 R 1.0 1.0 S1 D S2 D Unemployed Industry R 0.9 1.0 0.9 S3 D S4 D Grad School Academia 0.1 0.1 0.1 R R 0.9 1.0 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving an MDP A policy is a mapping from states to actions. • Optimal policy - for every state, there is no other • action that gets a higher sum of discounted future rewards. For every MDP there exists an optimal policy. • Solving an MDP is finding an optimal policy. • A specific policy converts an MDP into a plain • Markov system with rewards. 15-889 – Fall 2001 Veloso, Carnegie Mellon Policy Iteration Start with some policy π0(si). • Such policy transforms the MDP into a plain Markov • system with rewards. Compute the values of the states according to • current policy. Update policy: • a π0 π1(si) = argmax ri + γ p V (sj) af X ij g j Keep computing • Stop when πk+1 = πk. • 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration V ∗(si) = expected discounted future rewards, if we • start from si and we follow the optimal policy. Compute V ∗ with value iteration: • k – V (si) = maximum possible future sum of rewards starting from state si for k steps. Bellman’s Equation: • N n+1 k n V (si) = maxk ri + γ p V (sj) f X ij g j=1 Dynamic programming • 15-889 – Fall 2001 Veloso, Carnegie Mellon.

Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support