Global Concavity and Optimization in a Class of Dynamic Discrete Choice Models
Total Page:16
File Type:pdf, Size:1020Kb
Global Concavity and Optimization in a Class of Dynamic Discrete Choice Models Yiding Feng * 1 Ekaterina Khmelnitskaya * 2 Denis Nekipelov * 2 Abstract 1. Introduction Dynamic discrete choice model with unobserved hetero- geneity is, arguably, the most popular model that is currently Discrete choice models with unobserved hetero- used for Econometric analysis of dynamic behavior of in- geneity are commonly used Econometric models dividuals and firms in Economics and Marketing (e.g. see for dynamic Economic behavior which have been surveys in (Eckstein and Wolpin, 1989), (Dube´ et al., 2002) adopted in practice to predict behavior of individ- (Abbring and Heckman, 2007), (Aguirregabiria and Mira, uals and firms from schooling and job choices to 2010)). In this model, pioneered in Rust(1987), the agent strategic decisions in market competition. These chooses between a discrete set of options in a sequence of models feature optimizing agents who choose discrete time periods to maximize the expected cumulative among a finite set of options in a sequence of discounted payoff. The reward in each period is a function periods and receive choice-specific payoffs that of the state variable which follows a Markov process and is depend on both variables that are observed by the observed in the data and also a function of an idiosyncratic agent and recorded in the data and variables that random variable that is only observed by the agent but is not are only observed by the agent but not recorded reported in the data. The unobserved idiosyncratic compo- in the data. Existing work in Econometrics as- nent is designed to reflect heterogeneity of agents that may sumes that optimizing agents are fully rational value the same choice differently. and requires finding a functional fixed point to Despite significant empirical success in prediction of dy- find the optimal policy. We show that in an im- namic economic behavior under uncertainty, dynamic dis- portant class of discrete choice models the value crete choice models frequently lead to seemingly unrealistic function is globally concave in the policy. That optimization problems that economic agents need to solve. means that simple algorithms that do not require For instance, (Hendel and Nevo, 2006a) features an elabo- fixed point computation, such as the policy gra- rate functional fixed point problem with constraints, which dient algorithm, globally converge to the optimal is computationally intensive, especially in continuous state policy. This finding can both be used to relax spaces, for consumers to buy laundry detergent in the super- behavioral assumption regarding the optimizing market. agents and to facilitate Econometric analysis of dynamic behavior. In particular, we demonstrate At the same time, rich literature on Markov Decision Pro- significant computational advantages in using a cesses (cf. Sutton and Barto, 2018) have developed several simple implementation policy gradient algorithm effective optimization algorithms, such as the policy gradi- over existing “nested fixed point” algorithms used ent algorithm and its variants, that do not require solving for in Econometrics. a functional fixed point. The high-level takeaway message “use police gradients” for dynamic discrete choice models have been discussed in the machine learning community. For example, Ermon et al.(2015) and Ho et al.(2016) justify the advantage of policy gradient method by showing that *Equal contribution 1Department of Computer Science, North- western University, Evanston, IL, USA 2Department of Eco- Dynamic Discrete Choice models are equivalent to Maxi- nomics, University of Virginia, Charlottesville, VA, USA. Cor- mum Entropy IRL models under some conditions with an respondence to: Y.F. <[email protected]>, algorithm close to policy gradient. However, the drawback E.K. <[email protected]>, D.N. <[email protected]>. of the policy gradient and previous justification are that the value function in a generic Markov Decision problem is Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- not concave in the policy. This means that gradient-based thor(s). algorithms have no guarantees for global convergence for Title Suppressed Due to Excessive Size a generic MDP. There is a recent trend which studies the method significantly speeds up the maximization of the global convergence of the policy gradient algorithm under likelihood function. different specific models and assumptions (e.g. Fazel et al., This makes the policy gradient algorithm our recommended 2018; Agarwal et al., 2019; Bhandari and Russo, 2019; approach for use in Econometric analysis, and establishes Wang et al., 2019; Jansch-Porto et al., 2020). practical relevance of many newer reinforcement learning In this paper our main goal is to resolve the dichotomy in algorithms from behavioral perspective for social sciences. empirical social science literature that the rationality of con- sumers requires for them to be able to solve the functional 2. Preliminaries fixed point problem which is computationally intensive. Our main theoretic contribution is the proof that, in the class of In this section, we introduce the concepts of the Markov dynamic discrete choice models with unobserved hetero- decision process (MDP) with choice-specific payoff hetero- geneity, under certain smoothness assumption, the value geneity, the conditional choice probability (CCP) represen- function of the optimizing agent is globally concave in the tation and the policy gradient algorithm. policy. This implies that a large set of policy gradient algo- rithms that have a modest computational power requirement 2.1. Markov Decision Process for the optimizing agents have a fast convergence guarantee in our considered class of dynamic discrete choice models. A discrete-time Markov decision process (MDP) with The importance of this result is twofold. choice-specific heterogeneity is defined as a 5-tuple hS; A; r; ; P; βi, where S is compact convex state space First, it gives a promise that seemingly complicated dynamic with diam(S) ≤ S¯ < 1, A is the set of actions, r : optimization problems faced by consumers can be solved S × A ! R+ is the reward function, such that r(s; a) is by relatively simple algorithms that do not require fixed the immediate non-negative reward for the state-action pair point computation or functional optimization. 1 This means (s; a), are independent random variables, P is a Markov that the policy gradient-style methods have an important transition model where where p(s0js; a) defines the tran- behavioral interpretation. As a result, consumer behavior sition density between state s and s0 under action a, and following policy gradient can serve as a behavioral assump- β 2 [0; 1) is the discount factor for future payoff. We as- tion for estimating consumer preferences from data which sume that random variables are observed by the optimizing is more natural for consumer choice settings than other as- agent and not recorded in the data. These variables reflect sumptions that have been used in the past for estimation idiosyncratic differences in preferences of different optimiz- of preferences (e.g. -regret learning in Nekipelov et al., ing agents over choices. In the following discussion we 2015). refer to these variables as “random choice-specific shocks.” Second, more importantly, our result showing fast conver- In each period t = 1; 2;:::; 1, the nature realizes the gence of the policy gradient algorithm makes it an attractive current state st based on the Markov transition P given alternative to the search for the functional fixed point in this the state-action pair (st−1; at−1) in the previous period class of problems. While the goal of the Econometric analy- t − 1, and the choice-specific shocks t = ft;aga2A drawn sis of the data from dynamically optimizing consumers is i.i.d. from distribution . The optimizing agent chooses to estimate consumer preferences by maximizing the likeli- an action a 2 A, and her current period payoff is sum of hood function, it requires to sequentially solve the dynamic the immediate reward and the choice-specific shock, i.e., optimization problem for each value of utility parameters r(s; a) + t;a. Given initial state s1, the agent’s long-term along the parameter search path. Existing work in Eco- P1 t−1 payoff is E1;s2,2;::: t=1 β r(st; at) + t;at . This ex- nomics prescribes to use fixed point iterations for the value pression makes it clear that random shocks play a cru- function to solve the dynamic optimization problem (see cial role in this model by allowing us to define the ex ante Rust, 1987; Aguirregabiria and Mira, 2007). The replace- value function of the optimizing agent which reflects the ment of the fixed point iterations with the policy gradient expected reward from agent’s choices before the agent ob- serves realization of : When the distribution of shocks 1A conceptual advantage of policy gradient for problems stud- t ied in Econometrics can be thought of as follows. The literature of is sufficiently smooth (differentiable), the corresponding ex Econometric models for dynamic Economic behavior considers the ante value function is smooth (differentiable) as well. This optimization of individuals or small firms for whom the transfor- allows us to characterize the impact of agent’s policy on the mation probability in MDP is unknown and required learning. Due expected value by considering functional derivatives of the to the uncertainty of transformation probability, the value function value function with respect to the policy. iteration / policy iteration methods requires the construction of empirical MDP (which might be impractical for individuals or In the remainder of the paper, we rely on the following small firms), while the stochastic policy gradient does not require assumptions. estimation of such probability. Assumption 2.1. The state space S is compact in R and Title Suppressed Due to Excessive Size the action space A is binary, i.e., A = f0; 1g.