Global Concavity and Optimization in a Class of Dynamic Discrete Choice Models
Yiding Feng * 1 Ekaterina Khmelnitskaya * 2 Denis Nekipelov * 2
Abstract 1. Introduction Dynamic discrete choice model with unobserved hetero- geneity is, arguably, the most popular model that is currently Discrete choice models with unobserved hetero- used for Econometric analysis of dynamic behavior of in- geneity are commonly used Econometric models dividuals and firms in Economics and Marketing (e.g. see for dynamic Economic behavior which have been surveys in (Eckstein and Wolpin, 1989), (Dube´ et al., 2002) adopted in practice to predict behavior of individ- (Abbring and Heckman, 2007), (Aguirregabiria and Mira, uals and firms from schooling and job choices to 2010)). In this model, pioneered in Rust(1987), the agent strategic decisions in market competition. These chooses between a discrete set of options in a sequence of models feature optimizing agents who choose discrete time periods to maximize the expected cumulative among a finite set of options in a sequence of discounted payoff. The reward in each period is a function periods and receive choice-specific payoffs that of the state variable which follows a Markov process and is depend on both variables that are observed by the observed in the data and also a function of an idiosyncratic agent and recorded in the data and variables that random variable that is only observed by the agent but is not are only observed by the agent but not recorded reported in the data. The unobserved idiosyncratic compo- in the data. Existing work in Econometrics as- nent is designed to reflect heterogeneity of agents that may sumes that optimizing agents are fully rational value the same choice differently. and requires finding a functional fixed point to Despite significant empirical success in prediction of dy- find the optimal policy. We show that in an im- namic economic behavior under uncertainty, dynamic dis- portant class of discrete choice models the value crete choice models frequently lead to seemingly unrealistic function is globally concave in the policy. That optimization problems that economic agents need to solve. means that simple algorithms that do not require For instance, (Hendel and Nevo, 2006a) features an elabo- fixed point computation, such as the policy gra- rate functional fixed point problem with constraints, which dient algorithm, globally converge to the optimal is computationally intensive, especially in continuous state policy. This finding can both be used to relax spaces, for consumers to buy laundry detergent in the super- behavioral assumption regarding the optimizing market. agents and to facilitate Econometric analysis of dynamic behavior. In particular, we demonstrate At the same time, rich literature on Markov Decision Pro- significant computational advantages in using a cesses (cf. Sutton and Barto, 2018) have developed several simple implementation policy gradient algorithm effective optimization algorithms, such as the policy gradi- over existing “nested fixed point” algorithms used ent algorithm and its variants, that do not require solving for in Econometrics. a functional fixed point. The high-level takeaway message “use police gradients” for dynamic discrete choice models have been discussed in the machine learning community. For example, Ermon et al.(2015) and Ho et al.(2016) justify the advantage of policy gradient method by showing that *Equal contribution 1Department of Computer Science, North- western University, Evanston, IL, USA 2Department of Eco- Dynamic Discrete Choice models are equivalent to Maxi- nomics, University of Virginia, Charlottesville, VA, USA. Cor- mum Entropy IRL models under some conditions with an respondence to: Y.F.
(and possibly running out of) the good tomorrow. Assump- gradient, i.e., θ ← θ + α ∇θVδθ . Though the individuals tion 2.5 suggests monotonicity of preferences of the decision considered in the Economic MDP models may not compute maker over states. That is, higher values of the state are the exact gradient with respect to a policy due to having only associated with higher utility, which would be true in all of sample access to the Markov transition, previous work has the situations described in Assumption 2.4: the larger is the provided approaches to produce an unbiased estimator of difference between the maximum mileage and the current the gradient. For example, REINFORCE (Williams, 1992) mileage, the cheaper it is to maintain the bus; the larger the updates the policy by θ ← θ + αR ∇θ log(δθ(a|s)) where salary/the higher the inventory, the better off the person is. R is the long-term payoff on path. Notice that this updating rule is simple comparing with value function iteration. The Consider a myopic policy δ¯(s) = (exp(r(s)) + 1)−1 which caveat of the policy gradient approach is the lack of its uses threshold π¯(s) = r(s). This policy corresponds to global convergence guarantee for a generic MDP. In this agent optimizing the immediate reward without considering paper we show that such guarantee can be provided for the how current actions impact future rewards. Under Assump- specific class of MDPs that we consider. tion 2.4 and Assumption 2.5, the threshold for optimal policy is at least the threshold of myopic policy, i.e., π˜(s) ≥ π¯(s). Hence, Lemma 2.1 holds. 3. Warm-up: Local Concavity of the Value Function at the Optimal Policy Lemma 2.1. The optimal policy δ˜ chooses action 0 with weakly lower probability than the myopic policy δ¯ in all To understand the convergence of the policy gradient, in this states s ∈ S, i.e., δ˜(s) ≤ δ¯(s). section we introduce our main technique and show that the concavity of the value function with respect to policies is 2.3. MDP in Economics and Policy Gradient satisfied in a fixed neighborhood around the optimal policy. We rely on the special structure of the value function induced Our motivation in this paper comes from empirical work in by random shocks which essentially “smooth it” making it Economics and Marketing where optimizing agents are con- differentiable. We then use Bellman equation (2) to compute sumers or small firms who make dynamic decisions while strong Frechet´ functional derivatives of the value functions observing the current state s and the reward r(s, a) for their and argue that the respective second derivative is negative choice a. These agents often have limited computational at the optimal policy. We use this approach in Section 4 to power making it difficult for them to solve the Bellman show the global concavity of the value function with respect equation to find the optimal policy. They also may have to policies. only sample access to the distribution of Markov transition By ∆ we denote the convex compact set that contains all con- which further complicates the computation of the optimal ¯ policy. In this context we contrast the value function it- tinuous functions δ : S → [0, 1] such that 0 ≤ δ(·) ≤ δ(·). eration method which is based on solving the fixed point The Bellman equation (2) defines the functional Vδ(·). Re- problem induced by the Bellman equation and the policy call that Frechet´ derivative of functional Vδ(·), which maps gradient method. bounded linear space ∆ into the space of all continuous bounded functions of s, at a given δ(·) is a bounded lin- ear functional DVδ(·) such that for all continuous h(·) with khk2 ≤ H¯ : Vδ+h(·) − Vδ(·) = DVδ(·) h(·) + o(khk2). Value function iteration In the value function iterations, When functional DVδ(·) is also Frechet´ differentiable, we e.g., discussed in Jaksch et al.(2010); Haskell et al.(2016), refer to its Frechet´ derivative as the second Frechet´ deriva- the exact expectation in the Bellman equation (1) is replaced 2 tive of functional Vδ(·) and denote it D Vδ(·). by an empirical estimate and then functional iteration uses the empirical Bellman equation to find the fixed point, i.e., Theorem 3.1. Value function Vδ is twice Freechet´ differ- the optimal policy. Under certain assumptions on MDPs, entiable with respect to δ at the choice probability δ˜ cor- one can establish convergence guarantees for the value func- responding to optimal policy and its Frechet´ derivative is ˜ 2 tion iterations, e.g., Jaksch et al.(2010); Haskell et al.(2016). negative at δ in all states s, i.e., D Vδ˜(s) ≤ 0. However, to run these iterations may require significant com- putation power which may not be practical when optimizing We sketch the proof idea of Theorem 3.1 and defer its formal agents are consumers or small firms. proof to Supplementary Material. Start with the Bellman Title Suppressed Due to Excessive Size equation (2) of the value function, the Frechet´ derivative Hinderer, 2005; Rachelson and Lagoudakis, 2010; Pirotta of the value function is the fixed point of the following et al., 2015). Bellman equation Definition 4.1 (Kantorovich metric). For any two proba- bility measures p, q, the Kantorovich metric between them DVδ(s) =(log(1 − δ(s)) − log(δ(s)) − r(s)) 0 0 is + β (Es0 [Vδ(s )|s, 0] − Es0 [Vδ(s )|s, 1]) (3) 0 + β E,s0 [DVδ(s )|s] , K(p, q) = Z and sup fd(p − q) : f is 1-Lipschitz continuous . f X 1 D2V (s) = − Definition 4.2 (Lipschitz MDP). A Markov decision pro- δ δ(s)(1 − δ(s)) cess is (Lr,Lp)-Lipschitz if 0 − 2βEs0 [DVδ(s )|s, 1] (4) 0 0 0 0 + 2βEs0 [DVδ(s )|s, 0] ∀s, s ∈ S |r(s) − r(s )| ≤ Lr |s − s | 0 0 2 0 ∀s, s ∈ S, a, a ∈ A + β Es0 D Vδ(s )|s . 0 0 0 0 K(p(·|s, a), p(·|s , a )) ≤ Lp (|s − s | + |a − a |) A necessary condition for its optimum yielding δ˜ is 4.2. Characterization of the Optimal Policy DVδ˜(s) = 0 for all states s. As a result, equation (4) implies that its second Frechet´ derivative is negative for all states, 2 Our result in Section 3, demonstrates that the second Frechet´ i.e.,D V˜(s) ≤ 0. δ derivative of Vδ with respect to δ is negative for a given pol- The Bellman equation (4) of the second Frechet´ derivative icy δ when inequality (5) holds. To bound the second term of 2 0 0 suggests that D Vδ(s) ≤ 0 for all states s if (5) from below, i.e., Es0 [DVδ(s )|s, 0] − Es0 [DVδ(s )|s, 1], it is sufficient to show that Frechet´ derivative DVδ(·) is 0 0 2β(Es0 [DVδ(s )|s, 0] − Es0 [DVδ(s )|s, 1]) Lipschitz-continuous. Even though we already assume that 1 (5) the Markov transition is Lipschitz, it is still possible that ≤ δ(s)(1 − δ(s)) DVδ is not Lipschitz: Bellman equation (3) for DVδ de- pends on policy δ(s) via log(1 − δ(s)) − log(δ(s)), which s δ The first term in the inequality (5) is always positive for all can be non-Lipschitz in state for general policies . There- ´ policies in ∆, but the second term can be arbitrary small. fore, to guarantee Lipschitzness of the Frechet derivative In the next section, we will introduce a nature smoothness of the value function it is necessary to restrict attention to assumption on MDP (i.e., Lipschitz MDP) and show that the the space of Lipschitz policies. In this subsection, we show local concavity can be extended to global concavity, which that this restriction is meaningful since the optimal policy is implies that the policy gradient algorithm for our problem Lipschitz. converges globally under this assumption. Theorem 4.1. Given (Lr,Lp)-Lipschitz MDP, the optimal policy δ˜ satisfies 4. Global Concavity of the Value Function ! ! 1 − δ˜(s) 1 − δ˜(s†) log − log In this section, we introduce the notion of the Lipschitz † δ˜(s) δ˜(s ) Markov decision process, and Lipschitz policy space. We 2βRmaxLp † then restrict our attention to this subclass of MDPs. Our ≤ Lr + s − s main result shows the optimal policy belongs to the Lips- 1 − β chitz policy space and the policy gradient globally converges † in that space. We defer all the proofs of the results in this for all state s, s ∈ S where Rmax = maxs∈S r(s) is the section to Supplementary Material. maximum of the immediate reward r over S.
4.1. Lipschitz Markov Decision Process 4.3. Concavity of the Value Function with respect to Lipschitz Policies Lipschitz Markov decision process has the property that for two state-action pairs that are close with respect to Euclidean In this subsection, we present our main result showing the metric in S, their immediate rewards r and Markovian tran- global concavity of the value function for our specific class sition P should be close with respect to the Kantorovich or of Lipschitz MDPs with unobserved heterogeneity over the space of Lipschitz policies. L1-Wasserstein metric. Kantorovich metric is, arguable, the most common metric used used in the analysis of MDPs (cf. Definition 4.3. Given (Lr,Lp)-Lipschitz MDP, define its Title Suppressed Due to Excessive Size
Lipschitz policy space ∆ as assuming oracle access to the Frechet´ derivative of the value ∆ = {δ :δ(s) ≤ δ¯(s) ∀s ∈ S and function. While this analysis provides only a theoretical guarantee, as discussed in Section 2.3, in practice the in- † 1 − δ(s) 1 − δ(s ) dividuals are able to produce an unbiased estimator of the log − log δ(s) δ(s†) exact gradient. As a result, the practical application of the 2βRmaxLp † † policy gradient algorithm would only need to adjust for the ≤ Lr + s − s ∀s, s ∈ S , 1 − β impact of stochastic noise in the estimator. where δ¯ is the myopic policy. Since we assume that individuals know the immediate re- ward function r, the algorithm can be initialized at the my- ˜ Theorem 4.1 and Lemma 2.1 imply that the optimal policy δ opic policy δ¯ with threshold π¯(s) = r(s), which is in the lies in this Lipschitz policy space ∆ for any Lipschitz MDP. Lipschitz policy space ∆. From Lemma 2.1 it follows that Definition 4.4 (Condition for global convergence). We say the myopic policy is pointwise in S greater than the optimal that (Lr,Lp)-Lipschitz MDP satisfies the sufficient condi- policy, i.e., δ¯(s) ≤ δ˜(s). Consider policy δ with threshold β β tion for global convergence if π(s) = r(s) + 1−β Rmax − 2 Rmin. Note that Bellman 2βL 4βR L equation (2) implies that V (s) is between Rmin and Rmax p 2L + max p 2 1−β 1 − 2βL r 1 − β for all states s. Thus, policy δ pointwise bounds the optimal p ˜ ˜ 2 (6) policy δ from below, i.e., δ(s) ≤ δ(s). Our convergence exp(Rmin) + 1 rate result applies to the policy gradient within the bounded ≤ and 2βLp < 1. exp(Rmin) Lipschitz policy set ∆ˆ . Definition 4.5. Given (L ,L )-Lipschitz MDP, define its Definition 4.4 is a general theoretical condition for the r p ˆ global convergence for dynamic discrete choice model un- bounded Lipschitz policy space ∆ as der unobserved heterogeneity. It mainly characterize the ∆ˆ = {δ :δ(s) ≤ δ(s) ≤ δ¯(s) ∀s ∈ S and relation between the Lipschitz-smoothness, range of reward † and the upper bound of discount factor. For a specific MDP 1 − δ(s) 1 − δ(s ) log − log † (i.e., Rust model studied in experiment section), directly us- δ(s) δ(s ) ing Definition 4.4 will ensure the convergence for discount 2βRmaxLp † † ≤ Lr + s − s ∀s, s ∈ S . factor less than 0.5 without any additional restrictions on the 1 − β utility model. However, the guarantee can improved with some additional knowledge of the structure of the model, For simplicity of notation, we introduce constants m and M or the optimal policy. In experiment section (Section 5), which only depend on β, Rmin, Rmax, Lr and Lp, whose since the optimal policy in Rust model is monotone, by fur- exact expressions are deferred to the supplementary material ther restricting to policy space of all monotone policies, our for this paper. experiment suggests that the convergence happen even for Theorem 4.3. Given a (Lr,Lp)-Lipschitz MDP, which sat- discount factor equal 0.9 or 0.99. isfies the condition for global convergence (6) and constants 1 With the condition for global convergence (Definition 4.4), m and M defined above, for any step size α ≤ M , the pol- ¯ we develop the main result in our paper. icy gradient initialized at the myopic policy δ and updating as δ ← α∇ V in the bounded Lipschitz policy space ∆ˆ Theorem 4.2. Given (L ,L )-Lipschitz MDP which satis- δ δ r p after k iterations, it produces policy δ(k) satisfying fies the condition for global convergence (6), value function k Vδ is concave with respect to policy δ in the Lipschitz policy (1 − αm) 2 V˜(s) − V (k) (s) ≤ at all s ∈ S. space ∆, i.e., D Vδ(s) ≤ 0 for all s ∈ S, δ ∈ ∆. δ δ 2 (exp(Rmin) + 1) The main technique used in the proof of Theorem 4.2 is to develop an Lipschitz smoothness bound on the Frechet´ Our analysis follows the standard steps establishing conver- gence of the conventional gradient descent algorithm which derivative of the respective value function DVδ(·) by intro- ducing a contracting mapping and applying the fixed point bounds the second Frechet´ derivative of the value function theorem on it. We defer its formal proof to Supplementary Vδ with respect to the policy δ from above and from below Material. by m and M respectively. The formal proof is deferred to Supplementary Material. 4.4. The Rate of Global Convergence of the Policy Gradient Algorithm 5. Empirical Application In this subsection, we establish the rate of global conver- To demonstrate the performance of the algorithm, we use gence a simple version of the policy gradient algorithm the data from Rust(1987) which made the standard bench- Title Suppressed Due to Excessive Size mark for the Econometric analysis of MDPs. Even most re- Table 1. parameter values in from Rust(1987). cent Econometric papers on single-agent dynamic decision- making use this setup to showcase their results (e.g. Arcidi- Parameter Value acono and Miller, 2011; Aguirregabiria and Magesan, 2016; RC 11.7257 Muller¨ and Reich, 2018). θ1 0.001× 2.45569 The paper estimates the cost associated with maintaining and (θ20, θ21, θ22, θ23) (0.0937, 0.4475, 0.4459, 0.0127) replacing bus engines using data from maintenance records β 0.99 from Madison Metropolitan Bus City Company over the course of 10 years (December, 1974—May, 1985). The data contains monthly observations on the mileage of each bus We use “the lazy projection” method to guarantee the as well as the dates of major maintenance events (such as search over Lipschitz policy space. The policy space is bus engine replacement). parametrized by the vector of thresholds (π1, . . . , πN ) cor- Rust(1987) assumes that the engine replacement decisions responding to discretized state space (s1, . . . , sN ). It is ini- (0) (0) follow an optimal stopping policy derived from solving a tialized at the myopic policy, i.e. π1 = u(s1), . . . , πN = dynamic discrete choice model of the type that we described u(sN ). At step k the algorithm updates the thresholds to (k∗) (k−1) (k−1) earlier. Using this assumption and the data, he estimates the the value πi = πi − αDδ(k−1) V (si)L(πi )(1 − cost of operating a bus as a function of the running mileage (k−1) L(πi )), where L(·) is the logistic function and pol- as well as the cost of replacing the bus engine. We use his (k) (k−1) icy δj = L(πj ) for i, j = 1,...,N. To make estimates of the parameters of the return function and the (k∗) state transition probabilities (bus mileage) to demonstrate the“lazy projection” updated values πi are adjusted to (k) (k) convergence of the gradient descent algorithm. the closest monotone set of values π1 ≤ π2 ≤ ... ≤ (k) πN . The algorithm terminates at step k where the norm In (Rust, 1987) the state st is the running total mileage of 5 the bus accumulated by the end of period t. The immediate maxi |DVδ(k) (si)| ≤ τ for a given tolerance τ. The formal reward is specified as a function of the running mileage as: definition of lazy projection is Algorithm 1. ( −RC + , if a = 1 t1 Algorithm 1 “Lazy projection”, (π1, . . . , πN ): thresholds r(st, a, θ1) = −c(st, θ1) + t0, if a = 0 corresponding to discretized STATE space (s1, . . . , sN ); L(·): logistic function; policy δj = L(πj); α: step size; τ: where RC is the cost of replacing the engine, c(st, θ1) is the termination tolerance 3 cost of operating a bus that has st miles. (0) (0) (0) 1: π1 ← u(s1), . . . , πN ← u(sN ) // Initialize π at Following Rust(1987), we take c(st, θ1) = θ1st. Further, as the myopic policy in the original paper, we discretize the mileage taking values 2: while maxi |DVδ(k) (si)| ≤ τ do into an even grid of 2,571 intervals. Given the observed (k∗) (k−1) (k−1) 3: πi ← πi − α Dδ(k−1) Vδ(k−1) (si) L(πi ) monthly mileage, (Rust, 1987) assumes that transitions on (k−1) (1 − L(π )) for all i ∈ [N] the grid can only be of increments 0, 1, 2, 3 and 4. Therefore, i (π(k), . . . , π(k)) ← transition process for discretized mileage is fully specified 4: 1 N the closest monotone thresholds (k∗) (k∗) by just four parameters θ2j = Pr[st+1 = st + j|st, a = 0], of (π1 , . . . , πN ) // Lazy projection j = 0, 1, 2, 3. Table 1 describes parameter values that we 5: end while (k) (k) use directly from Rust(1987). 6: return (π1 , . . . , πN ) We use the gradient descent algorithm to update the policy threshold π : 1 + π ≥ 0 ⇒ a = 1, where a = 1 denotes Figure1 demonstrates convergence properties of our con- the decision to replace the engine. We set the learning rate sidered version of the policy gradient algorithm. We used using the RMSprop method4. the “oracle” versions of the gradient and the value function that were obtained by solving the corresponding Bellman 3Although at first glance the setup in (Rust, 1987) does not directly correspond to the model described in the theoretical section equations. We initialized the algorithm using the myopic of the paper (which is a more classic presentation for the models if threshold π¯(s) = −RC + c(s, θ1); with the convergence this type in the Economics literature), it can be recast in a way so 5 that the two models were aligned. The reason we did not do that is To optimize the performance of the method it is also pos- (k) that we wanted to preserve the original notation of Rust’s paper. sible to consider a mixed norm of the form maxi |π (si) − 4 (k−1) We use standard parameter values for RMSProp method: β = π (si)| + λ maxi |DVδ(k) (si)|∞ ≤ τ for some calibrated 0.1, ν = 0.001 and = 10−8. The performance of the the weight λ. This choice would control both the rate of decay of the method was very similar to that when we used ADAM to update gradient and the advancement of the algorithm in adjusting the the threshold values. thresholds. Title Suppressed Due to Excessive Size
6 criterion set to be based on the value maxi |DVδ(si)| . models with unobserved heterogeneity. Econometrica, 79(6):1823–1867. In the original model in (Rust, 1987), the discount factor used when estimating parameters of the cost function was Bajari, P., Hong, H., and Nekipelov, D. (2013). Game theory very close to 1. However, performance of the algorithm and econometrics: A survey of some recent research. In improves drastically when the discount factor is reduced. Advances in economics and econometrics, 10th world This feature is closely related to the Hadamard stability congress, volume 3, pages 3–52. of the solution of the Bellman equation (e.g. observed in Bajari et al., 2013) and is not algorithm-specific. In all of Bhandari, J. and Russo, D. (2019). Global optimality the follow-up analysis by the same author (e.g. Rust, 1996) guarantees for policy gradient methods. arXiv preprint the discount factor is set to more moderate values of .99 arXiv:1906.01786. or .9 indicating that these performance issues were indeed observed with the settings in (Rust, 1987). Figure 1 (resp. Card, D. (2001). Estimating the return to schooling: Figure 3) illustrate the performance of the algorithm for the Progress on some persistent econometric problems. case where the discount factor is set to 0.99 (resp. 0.9). 7. Econometrica, 69(5):1127–1160. For the same convergence criterion, the algorithm converges much faster. Dube,´ J.-P., Chintagunta, P., Petrin, A., Bronnenberg, B., Goettler, R., Seetharaman, P., Sudhir, K., Thomadsen, R., and Zhao, Y. (2002). Structural applications of the Acknowledgements discrete choice model. Marketing Letters, 13(3):207–220.
Denis Nekipelov was supported in part by NSF grants CCF- Eckstein, Z. and Wolpin, K. I. (1989). The specification and 1563708. The authors thank anonymous reviewers for their estimation of dynamic stochastic discrete choice models: insightful discussions and suggestions. A survey. The Journal of Human Resources, 24(4):562– 598. References Ermon, S., Xue, Y., Toth, R., Dilkina, B., Bernstein, R., Abbring, J. H. and Heckman, J. J. (2007). Econometric Damoulas, T., Clark, P., DeGloria, S., Mude, A., Barrett, evaluation of social programs, part iii: Distributional C., et al. (2015). Learning large-scale dynamic discrete treatment effects, dynamic treatment effects, dynamic choice models of spatio-temporal preferences with ap- discrete choice, and general equilibrium policy evaluation. plication to migratory pastoralism in East Africa. In Handbook of econometrics, 6:5145–5303. Twenty-Ninth AAAI Conference on Artificial Intelligence. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). (2019). Optimality and approximation with policy gradi- Global convergence of policy gradient methods for the ent methods in markov decision processes. arXiv preprint linear quadratic regulator. In International Conference arXiv:1908.00261. on Machine Learning, pages 1466–1475. Aguirregabiria, V. and Magesan, A. (2016). Solution and estimation of dynamic discrete choice structural models Haskell, W. B., Jain, R., and Kalathil, D. (2016). Empiri- Mathematics of Operations using euler equations. Available at SSRN 2860973. cal dynamic programming. Research, 41(2):402–429. Aguirregabiria, V. and Mira, P. (2007). Sequential estima- tion of dynamic discrete games. Econometrica, 75(1):1– Hendel, I. and Nevo, A. (2006a). Measuring the implications 53. of sales and consumer inventory behavior. Econometrica, 74(6):1637–1673. Aguirregabiria, V. and Mira, P. (2010). Dynamic discrete choice structural models: A survey. Journal of Economet- Hendel, I. and Nevo, A. (2006b). Measuring the implica- rics, 156(1):38–67. tions of sales and consumer inventory behavior. Econo- metrica, 74(6):1637–1673. Arcidiacono, P. and Miller, R. A. (2011). Conditional choice probability estimation of dynamic discrete choice Hinderer, K. (2005). Lipschitz continuity of value functions
6 in markovian decision processes. Mathematical Methods The particular tolerance value used was 0.03 for illustrative of Operations Research purposes. , 62(1):3–22. 7When we reduce the cost of replacing the engine along with the discount factor, which ensures that there is significant variation Ho, J., Gupta, J., and Ermon, S. (2016). Model-free imita- in threshold values across states, convergence is improved even tion learning with policy optimization. In International further Conference on Machine Learning, pages 2760–2769. Title Suppressed Due to Excessive Size
Figure 1. Convergence of gradient descent, discount factor β = 0.99
2 Figure 2. Performance of the norm maxi |DVδ(si)| and the second derivative maxi |D Vδ(si)|, discount factor β = 0.99
Figure 3. Convergence of gradient descent, discount factor β = 0.9
2 Figure 4. Performance of the norm maxi |DVδ(si)| and the second derivative maxi |D Vδ(si)|, discount factor β = 0.9 Title Suppressed Due to Excessive Size
Hotz, V. J. and Miller, R. A. (1993). Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60(3):497–529. Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
Jansch-Porto, J. P., Hu, B., and Dullerud, G. (2020). Con- vergence guarantees of policy optimization methods for markovian jump linear systems. arXiv preprint arXiv:2002.04090. Muller,¨ P. and Reich, G. (2018). Structural estimation using parametric mathematical programming with equilibrium constraints and homotopy path continuation. Available at SSRN 3303999. Nekipelov, D., Syrgkanis, V., and Tardos, E. (2015). Econo- metrics for learning agents. In Proceedings of the Six- teenth ACM Conference on Economics and Computation, pages 1–18. ACM. Pirotta, M., Restelli, M., and Bascetta, L. (2015). Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283.
Rachelson, E. and Lagoudakis, M. G. (2010). On the locality of action domination in sequential decision making. Rust, J. (1987). Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica: Journal of the Econometric Society, pages 999–1033.
Rust, J. (1996). Numerical dynamic programming in eco- nomics. Handbook of computational economics, 1:619– 729. Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn- ing: An introduction. MIT press.
Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Ma- chine learning, 8(3-4):229–256.