1 Introduction to Average-Cost Problems

2.193 Decision-Making in Large-Scale Systems February 15 MIT, Spring 2005 Scribe: Ruggiero Cavallo Lecture Note 5 1 Introduction to average-cost problems In this lecture, we would like to find a policy u minimizing the average cost in infinite horizon. In particular, we are interested in solving: T −1 1 min E lim gu(xt) x0 = x u "T →∞ T # t=0 . X Inspired by the finite-horizon and discounted-cost cases, we might want to define and compute T −1 1 Ju(x)=E lim gu(xt) x0 = x "T →∞ T # t=0 X ∗ J (x) = min Ju(x) u However, for most cases of interest, we are going to have: Ju(x)= λu ∀x ∗ ∗ J (x)= λ ∀x Figure 1: The average-cost from each state will be the same in most cases. In other words, the optimal average cost is the same for each state. If the average cost is the same for each state, this alone will not be enough information to find an optimal policy. In the discounted case, we can get an optimal policy from the cost-to-go function: 1 ∗ ∗ u (x) = arg min ga(x)+ α Pa(x, y)J (y) a y X ∗ However, if J (x)= λ ∀x, then: ∗ ∗ u (x) = arg min ga(x)+ α Pa(x, y)λ a y X ∗ = arg min ga(x)+ λ a 2 Background on Markov chains Definition 1 A Markov chain is described by (S, P ), where S = a finite state space, and P = a transition probability matrix. In the deterministic case, we have a graph, e.g., Figure 2: Simple Markov chain example P (x, y) = prob(xt+1 = y | xt = x) So when we fix a policy, we’re essentially looking at a Markov chain. k1 k2 Definition 2 Two states x, y in a Markov chain communicate if ∃k1,k2 such that P (x, y) > 0 and P (y, x) > 0. If two states communicate, then their average cost will be the same. ki Definition 3 A Markov chain state x is recurrent if ∃k1,k2,...,k∞ such that for each ki, P (x, x) >ǫ> 0. Definition 4 A Markov chain state x is transient if it is not recurrent. 2 Definition 5 A Markov chain is unichain if all recurrent states communicate. It is called irreducible if it is unichain and all states are recurrent. In figure 2 above, states 1, 2 and 3 all communicate with each other, but state 4 doesn’t communicate with any state. States 1, 2, and 3 are recurrent, while state 4 is transient. The Markov chain represented in figure 2 is thus unichain, but not irreducable. A sufficient condition for J ∗(x)= λ∗ ∀x is the following: 1. ∃ a unichain optimal policy, or, k 2. ∀x,y, ∃u such that ∃k> 0, P (x, y) > 0. Note: if J ∗(x) > J ∗(y), starting in state x, follow policy u that reaches y eventually. This will yield J ∗(x)= J ∗(y). Thus J ∗(x)= J ∗(y) ∀x, y. 3 Bellman’s equation for average-cost problems We are going to consider approximating the average-cost problem by a series of finite-horizon problems, with increasing horizon. T −1 Ju(x, T )=E gu(xt) x0 = x " =0 # t X Ju(x, T ) ≈ λuT + hu(x)+ o(T ) (*) Implicit in the above is the choice that hu(1) = 0. The idea is that h(z) represents the cost incurred as you move from state z to state 1 for the first time. The o(T ) term at the end covers the difference resulting from potentially not completing execution at the end of a cycle. ∗ In order to derive Bellman’s equation, we conjecture that Ju(x, T ) satisfies (*), and that J (x, T ) also satisfies a similar expression: J ∗(x, T ) ≈ h∗(x)+ λ∗T + o(T ) We know that J ∗(x, T ) satisfies: ∗ ∗ J (x, T + 1) = min ga(x)+ Pa(x, y)J (y,T ) a y X ≈ h∗(x)+ λ∗T + o(T ) ∗ ∗ = min ga(x)+ Pa(x, y)(h (y)+ λ T + o(T )) a y X 3 Based on this intuition, we conjecture that Bellman’s equation for the average-cost case is given by: ∗ ∗ ∗ λ + h (x) = min ga(x)+ Pa(x, y)h (y) a y X We define (where 1 = a vector of ones): ∗ ∗ ∗ Tuh = gu + Puh ⇒ λ 1 + h = Th Th = min Tuh u Propositions: ¯ 1. (monotonicity) ∀h≤h¯ Th ≤ T h 2. (offset) T (h + k1)= Th + k1 Remarks: • Bellman’s equation doesn’t always have a solution; if different states have different optimal average costs, a solution does not exist. • If a solution exists, then there are infinitely many solutions. λ∗ + h∗ = Th∗ ⇒ λ∗ + (h∗ + 1)= Th∗ + 1 = T (h∗ + 1) where h∗ = the differential cost function. But all solutions differ only by a constant factor, and lead to the optimal policy. Theorem 1 Suppose that Bellman’s equation has a solution (λ∗,h∗). Then λ∗ is the optimal average cost. ∗ ∗ ∗ ∗ ∗ ∗ Moreover, if u satisfies Tu∗ h = Th (greedy with respect to h ), then λu∗ = λ (u is the optimal policy). Theorem 2 If the optimal average cost is the same for all states, then Bellman’s equation has a solution. 4 Relationship with discounted-cost problems In the last part of this lecture, we will show that discounted-cost problems with large discount factors can be used to approach optimal average-cost policies. 4 Figure 3: Discounted-cost First recall that ∞ ∗ t J (x)=E α gu(xt) x0 = x " # t=0 X Define a fictitious state 0, and consider a new MDP such that αPa(x, y) ∀x, y 6=0 ¯ Pa(x, y)= 1 − α ∀x 6=0,y =0 1 forx = y =0. Starting from an arbitrary state x0 6= 0, the state evolution in this MDP is identical to that in the original one, until xk = 0 for the first time. Let T be the first time state 0 is reached: T = inf{k ≥ 0, xt =0}. Then T is a geometric random variable, with distribution P (T = k)=(1 − α)αk), and J ∗ can also be seen as the expected total cost in the new MDP, up until the (random) time horizon T . Using our approximation for finite-horizon costs λ∗ J ∗(x, T ) ≈ + h∗(x)+ o(T ), T we have T −1 ∗ J (x)=E gu(xt) x0 = x " t=0 # X ∞ T −1 T = (1 − α)α E gu(xt) x0 = x " # T =0 t=0 X∞ X T = (1 − α)α Ju(x, T ) T =0 X∞ T ∗ = (1 − α)α (λuT + h (x)+ o(T )) T =0 X λ ≈ u + h∗(x)+ O(1 − α) 1 − α To summarize: λ J ≈ u 1 + h∗(x)+ O(1 − α) u 1 − α λ∗ J ∗ = 1 + h∗(x)+ O(1 − α) 1 − α 5 λu As α goes to 1, note that the term 1−α dominates the discounted cost of each policy u. Hence, if α is large enough, by solving the discounted-cost problem we are essentially finding the optimal average-cost policy. Optimizing the discounted cost with large discount factor as a way of solving optimal average-cost problems exactly or approximately is often preferred in practice, because the theory holds more generally (for instance, we do not need to worry about having unichain policies) and the numerical methods are often more well-behaved. 6.

Load more