Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes
Total Page:16
File Type:pdf, Size:1020Kb
Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes Manuela Veloso Carnegie Mellon University Computer Science Department Planning - Fall 2001 Remember the examples on the board. 15-889 – Fall 2001 Veloso, Carnegie Mellon Different Aspects of “Machine Learning” Supervised learning • – Classification - concept learning – Learning from labeled data – Function approximation Unsupervised learning • – Data is not labeled – Data needs to be grouped, clustered – We need distance metric Control and action model learning • – Learning to select actions efficiently – Feedback: goal achievement, failure, reward – Search control learning, reinforcement learning 15-889 – Fall 2001 Veloso, Carnegie Mellon Search Control Learning Improve search efficiency, plan quality • Learn heuristics • (if (and (goal (enjoy weather)) (goal (save energy)) (state (weather fair))) (then (select action RIDE-BIKE))) 15-889 – Fall 2001 Veloso, Carnegie Mellon Learning Opportunities in Planning Learning to improve planning efficiency • Learning the domain model • Learning to improve plan quality • Learning a universal plan • Which action model, which planning algorithm, which heuristic control is the most efficient for a given task? 15-889 – Fall 2001 Veloso, Carnegie Mellon Reinforcement Learning A variety of algorithms to address: • – learning the model – converging to the optimal plan. 15-889 – Fall 2001 Veloso, Carnegie Mellon Discounted Rewards “Reward” today versus future (promised) reward • $100K + $100K + $100K + . • INFINITE! Future rewards not worth as much as current. • Assume reality . : discount factor , say γ. • $100K + γ $100K + γ2 $100K + . • CONVERGES! 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Systems with Rewards Finite set of n states - vector n - si • Probabilistic state matrix, P - n n - pij • × “Goal achievement” - Reward for each state, • vector n -ri Discount factor - γ • Process: • – Start at state si – Receive immediate reward ri – Move randomly to a new state according to the probability transition matrix – Future rewards (of next state) are discounted by γ 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving a Markov Systems with Rewards V ∗(si) expected discounted sum of future rewards • − starting in state si • V ∗(si) = ri + γ[pi1V ∗(s1) + pi2V ∗(s2) + : : : pinV ∗(sn)] 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration to Solve a Markov Systems with Rewards 1 V (si) expected discounted sum of future rewards • − starting in state si for one step. 2 V (si) expected discounted sum of future rewards • − starting in state si for two steps. ... • k V (si) expected discounted sum of future rewards • − starting in state si for k steps. k As k V (si) V ∗(si) • ! 1 ! Stop when difference of k + 1 and k values is smaller • than some . 15-889 – Fall 2001 Veloso, Carnegie Mellon Markov Decision Processes Finite set of states, s1; : : : ; sn • Finite set of actions, a1; : : : ; am • Probabilistic state,action transitions: • k p = prob (next = sj current = si and take action ak) ij j Reward for each state, r1; : : : ; rn • Process: • – Start in state si – Receive immediate reward ri – Choose action ak A 2 k – Change to state sj with probability pij. – Discount future rewards 15-889 – Fall 2001 Veloso, Carnegie Mellon Deterministic Example 10 G 10 10 (Reward on unlabelled transitions is zero.) 15-889 – Fall 2001 Veloso, Carnegie Mellon Nondeterministic Example 0.9 0.1 R 1.0 1.0 S1 D S2 D Unemployed Industry R 0.9 1.0 0.9 S3 D S4 D Grad School Academia 0.1 0.1 0.1 R R 0.9 1.0 15-889 – Fall 2001 Veloso, Carnegie Mellon Solving an MDP A policy is a mapping from states to actions. • Optimal policy - for every state, there is no other • action that gets a higher sum of discounted future rewards. For every MDP there exists an optimal policy. • Solving an MDP is finding an optimal policy. • A specific policy converts an MDP into a plain • Markov system with rewards. 15-889 – Fall 2001 Veloso, Carnegie Mellon Policy Iteration Start with some policy π0(si). • Such policy transforms the MDP into a plain Markov • system with rewards. Compute the values of the states according to • current policy. Update policy: • a π0 π1(si) = argmax ri + γ p V (sj) af X ij g j Keep computing • Stop when πk+1 = πk. • 15-889 – Fall 2001 Veloso, Carnegie Mellon Value Iteration V ∗(si) = expected discounted future rewards, if we • start from si and we follow the optimal policy. Compute V ∗ with value iteration: • k – V (si) = maximum possible future sum of rewards starting from state si for k steps. Bellman’s Equation: • N n+1 k n V (si) = maxk ri + γ p V (sj) f X ij g j=1 Dynamic programming • 15-889 – Fall 2001 Veloso, Carnegie Mellon.