<<

Entry for Daniel McFadden in the New Palgrave Dictionary of Economics

1. Introduction

Daniel L. McFadden, the E. Morris Cox Professor of Economics at the University of Cali- fornia at Berkeley, was the 2000 co-recipient of the in Economics, awarded “for his development of theory and methods of analyzing discrete choice.” 1 McFadden was born in North Carolina, USA in 1937 and received a B.S. in Physics from the (with highest honors) in 1956, and a Ph.D. degree in Economics from Minnesota in 1962. His academic career began as a postdoctoral fellow at the University of Pittsburgh. In 1963 he was appointed as assistant professor of economics at the University of California at Berkeley, and tenured in 1966. He has also held tenured appointments at Yale (as Research Professor in 1977), and at the Massachusetts Institute of Technology (from 1978 to 1991). In 1990 he was awarded the E. Morris Cox chair at the University of California at Berkeley, where he has also served as Department Chair and as Director of the Laboratory.

2. Research Contributions

McFadden is best known for his fundamental contributions to the theory and econometric methods for analyzing discrete choice. Building on a highly abstract, axiomatic literature on probabilistic choice theory due to Thurstone (1927), Block and Marschak (1960), and Luce (1959) and others in a literature that originated in mathematical psychology, McFadden developed the econometric methodology for estimating the utility functions underlying probabilistic choice theory. McFadden’s primary contribution was to provide the econometric tools that permitted widespread practical empirical application of discrete choice models, in economics and other disciplines.

According to his autobiography2 , “In 1964, I was working with a graduate student, Phoebe Cottingham, who had data on freeway routing decisions of the California Department of Trans- portation, and was looking for a way to analyze these data to study institutional decision-making behavior. I worked out for her an econometric model based on an axiomatic theory of choice behavior developed by the psychologist Duncan Luce. Drawing upon the work of Thurstone and Marshak, I was able to show how this model linked to the economic theory of choice behavior.

1 The prize was split with James J. Heckman, awarded “for his development of theory and methods for analyzing selective samples.” Summary quotes were from Nobel prize summary at nobelprize.org.

2 From the Nobel prize web site, nobelprize.org.

1 2

These developments, now called the multinomial logit model and the random utility model for choice behavior, have turned out to be widely useful in economics and other social sciences. They are used, for example, to study travel modes, choice of occupation, brand of automobile purchase, and decisions on marriage and number of children.”

This understates the huge impact that the discrete choice literature has had on the social sciences, and is characteristic of McFadden’s modesty. Thousands of papers applying his technique have been published since his path breaking papers, “Conditional Logit Analysis of Qualitative Choice Behavior” (1973) and “The Revealed Preferences of a Government Bureaucracy: Empirical Evidence” (1976). In December 2005, a search of the term “discrete choice” using the Google search engine yielded 10,200,000 entries, and a search on the Google Scholar search engine (which limits search to academic articles), returned 759,000 items.

Besides the discrete choice literature itself, McFadden’s work has spawned a number of related literatures in econometrics, theory, and industrial organization that are among the most active and productive parts of the economic literature in the present day. This includes work in game theory and industrial organization (e.g. the work on discrete choice and product differentiation of Anderson, De Palma and Thisse (1992), estimation of discrete games of incomplete information, Bajari, Hong, Krainer and Nekipelov (2005), and discrete choice modeling in the empirical industrial organization literature, Berry, S. Levinsohn, J. and A. Pakes (1995) and Goldberg (1995)), the econometric literature on semiparametric estimation of discrete choice models (Manski, (1985), McFadden and Train (2000)), the literature on discrete/continuous choice models and its connection to durable goods and energy demand modeling (Dagsvik (1994), Dubin and McFadden (1984), Hannemann (1984)), the econometric literature on choice based and stratified sampling (Cosslett (1981), Manski and McFadden, (1981)), the econometric literature on “simulation estimation” (Lerman and Manski (1981), McFadden (1994), Hajivassiliou and Ruud (1994), Pakes and Pollard (1989)), and the work on structural estimation of dynamic discrete choice models and extensions thereof (Dagsvik (1983), Eckstein and Wolpin (1989), Heckman (1981), Rust, (1994)). These are only some of fields that have been been hugely influenced by McFadden’s contributions to discrete choice and econometrics: given space constraints I have not attempted to survey other fields that have benefited from McFadden’s contributions (e.g. production economics, McFadden (1978)).

In order to give the reader an appreciation for the elegance and generality of McFadden’s contributions, I will provide a brief synopsis of the theory and econometrics of discrete choice, following the treatment in McFadden’s (1981) paper “Econometric Models of Probabilistic Choice.” The underlying theory is superficially rather simple: an agent chooses a single alternative d from 3

a finite set D(x) of mutually exclusive alternatives to maximize a well defined utility function. Agents’ choices as well as the choice set D(x) may vary across agents depending on values of a vector x that can reflect state-dependent or agent-dependent factors or choice constraints, similar to the way a consumer budget set depends on quantities such as income and prices in standard (continuous) consumer theory. The vector x can include characteristics (or “attributes”) of the alternatives in the choice set D(x). The vector x can also include agent characteristics such as income, age, sex, education and so forth. There is great flexibility in specifying how alternative- specific and agent-specific characteristics affect decisions, and McFadden was one of the first to appreciate the huge potential probabilistic choice theory offered for empirical work.

In order to appreciate McFadden’s contribution, it is useful to briefly summarize the elements of the theory on which he built, most of which originated in the literature on mathematical psychology. Fundamental to this literature is the concept of a choice probability P (d|x, D(x)) which represents the probability that an agent chooses a particular element d ∈ D(x). Psychologists emphasized the seemingly random nature of subjects’ decisions in experiments. The earliest cited work on probabilistic choice is Thurstone (1927), who described subject choices as a “discriminal process” and used the normal distribution to model the impact of random factors affecting individual decisions. Thurstone’s formula for the choice probability when there only two possible alternatives is now known in economics as the binomial probit model. The mathematical psychology literature also gave considerable attention to how the choice probability depends on the choice set D(x), since a major focus of the theory was to explain behavior of subjects in laboratory experiments where choice sets can be controlled. The initial work was relatively abstract and axiomatic, and attention focused on determining what restrictions, if any, were placed on choice probability functions (over and above satisfying the ordinary laws of probability) implied by the random utility maximization model (RUM). More precisely, what are the necessary and sufficient conditions for a choice probability P (d|x, D(x)) to be consistent with random utility maximization, where P is given by 0 P (d|x, D(x)) = P r u˜d ≥ u˜d0, d ∈ D(x) , (2.1) n o and {u˜d|d ∈ D(x)} are a collection of random variables representing random utility values of the alternatives in D(x)? This is the discrete choice analog of the integrability problem in consumer theory, i.e. what are the necessary and sufficient conditions for a system of demand equations to be derivable from some underlying utility function? 4

The (1959) book Individual Choice Behavior by Duncan Luce introduced the axiom of independence from irrelevant alternatives (IIA). He showed that if this axiom holds, choice prob- abilities must have a multinomial logit representation (MNL). That is, there exists functions {u(x, d)|d ∈ D(x)} such that

exp{u(x, d)} P (d|x, D(x)) = 0 . (2.2) d0∈D(x) exp{u(x, d )} P The IIA axiom states that the odds of choosing alternative d over alternative d0 are not changed if the choice set is enlarged. That is, if d, d0 ∈ D(x) ⊂ E(x), then

P (d|x, D(x)) P (d|x, E(x)) = . (2.3) P (d0|x, D(x)) P (d0|x, E(x)) The IIA axiom is a strong restriction that may not always be empirically plausible for reasons noted in Debreu’s (1960) review of Luce’s book. McFadden introduced an alternative example he called the “red bus/blue bus problem” that illustrates potential problems with using the MNL to forecast how agents will respond to changes in their choice sets. 3 Block and Marschak (1960) did not impose the IIA axiom, and derived a general necessary condition on the choice probability to be consistent with random utility maximization, i.e. conditions on {P (d|x, D(x))|d ∈ D(x)} to

have the representation in equation (2.1) for some collection of random variables {u˜d|d ∈ D(x)}. Falmagne (1978) showed the Block-Marschak condition was also a sufficient condition for the random utility representation to hold. 4

3 Consider a commuter who initially has only two alternatives to commute to work: walking (d = w) and taking the bus (d = b), D(x) = {w, b}. Suppose that the individual is indifferent between walking and taking the bus, then u(w, x) = u(b, x) and we see from the logit formula that the IIA axiom implies that P (w|x) = P (b|x) = 1/2. However suppose we now introduce a third “irrelevant” alternative — a “red bus” — which is in every manner identical to the existing bus alternative, the “blue bus.” Thus imagine that there is always both a red and blue bus waiting at the bus stop, and the commuter can always walk also. If we denote this new third alternative d = r, we have u(x, r) = u(x, b) = u(x, w), i.e. the commuter is indifferent between taking the red bus or blue bus, and also continues to be indifferent whether to walk or take the bus, the IIA axiom predicts that the choice probabilities in this situation will be P (r|x) = P (b|x) = P (w|x) = 1/3. However Debreu’s argument is that this is not a plausible prediction of the impact of the new alternative: the existence of the new red bus alternative should not affect the probability of whether to walk, so we should continue to have P (w|x) = 1/2 when the new, “irrelevant” red bus alternative is introduced. However since the commuter is indifferent between taking a red or blue bus, we should have P (b|x) = P (r|x) = 1/4. Thus, Debreu argued that Luce’s IIA axiom implies an intuitively implausible prediction of the impact of the introduction of new alternative to an agent’s choice set, at least in situations where the new alternative is essentially identical to one of the existing alternatives.

4 More precisely, Falmagne’s Theorem states that a system of choice probabilities can be derived from some random utility model if and only if the Block-Marschak polynomials are nonnegative. This is the analog of the Slutsky condition in standard consumer theory, i.e. that a demand system x(p, y) can be derived from a utility function if and only if it is homogeneous of degree 0 in (p, y), satisfies y = p0x(p, y), and the Slutsky matrix corresponding to x(p, y) is symmetric and negative semidefinite. I refer the reader to Block and Marschak (1960) or Falmagne (1978) for the definition of the Block-Marschak polynomial. 5

McFadden’s contribution to this literature was to recognize how to operationalize the random utility interpretation in an empirically tractable way. In particular, he derived the “converse” of Luce’s representation theorem, that is, he discovered a random utility interpretation of the MNL model. His other fundamental contribution was to solve an analog of the revealed preference N problem: i.e. using data on the actual choices and states of a sample of agents {(di, xi)}i=1, he showed how it was possible to “reconstruct” their underlying random utility function. Further, he introduced a new class of multivariate distributions, the generalized extreme value family (GEV), and derived tractable formulas for the implied choice probabilities including the nested multinomial logit model, and showed that these choice probabilities do not satisfy the IIA axiom, and thus, relax some of the empirically implausible restrictions implied by IIA.

McFadden studied a more general specification of the random utility model where an agent’s utility function is written as U(x, z, d, θ) that depends on variables x that the econometrician can observe, as well as variables z that the econometrician cannot observe. In addition, the utility is assumed to depend on a vector of parameters θ that are known by the agent but not by the econometrician. Under these assumptions, the solution to the revealed preference problem is equivalent to finding an estimator for θ. 5 McFadden suggested the method of maximum likelihood using the likelihood function L(θ) given by

N L(θ) = P (di|xi, D(xi), θ), (2.4) iY=1

under the assumption that the observations (di, xi) are independently distributed across different agents i. McFadden showed that under appropriate regularity conditions, the maximum likelihood estimator θˆ (the value of θ that maximizes L(θ)), is consistent and asymptotically normal, and thus presents a means for making inferences about agents’ underlying preferences. However the maximum likelihood approach only became feasible when it was possible to derive computationally tractable formulas for choice probabilities derived from various random utility models. This was perhaps McFadden’s most important contribution to the discrete choice literature.

Assume that agents’ utility function has the following additive separable representation

U(x, z, d, θ) = u(x, d, θ) + v(z, d). (2.5)

5 The conceptual distinction between z and θ, both of which are unobserved by the econometrician, is that θ is assumed to be common across agents whereas the vector z can differ from agent to agent. Thus, it is feasible to consider the problem of estimating θ by pooling data on choice made by different agents with the same θ but different idiosyncratic values of z. 6

Define (d) ≡ v(z, d). It follows that an assumption on the distribution of the random vector z implies a distribution for the random vector  ≡ {(d)|d ∈ D(x)}. McFadden’s approach was to make assumptions directly about the distribution of , rather that making assumptions about the distribution of z and deriving the implied distribution of . Standard assumptions for the distribution of  that have been considered include the multivariate normal which yields the multivariate probit variant of the discrete choice model. Unfortunately, in problems where there are more than only two alternatives (the case that Thurstone studied), the multinomial probit model becomes intractable in higher dimensional problems. The reason is that in order to derive the conditional choice probabilities, one must do numerical integrations that have a dimension equal to |D(x)|, the number of elements in the choice set. In general this multivariate integration is computationally infeasible when |D(x)| is larger than 5 or 6, using standard quadrature methods on modern computers.

McFadden introduced an alternative assumption for the distribution of , namely the Multi- variate Extreme value distribution given by

F (z|x) = P r {d ≤ zd|d ∈ D(x)} = exp {− exp{−(zd − µd)/σ)}} , (2.6) d∈YD(x)

and showed that (when the location parameters µd are normalized to 0) that the corresponding random utility model produces choice probabilities given by the multinomial logit formula

exp{u(x, d, θ)/σ} P (d|x, θ) = 0 . d0∈D(x) exp{u(x, d , θ)/σ} P The reason why this should be true is not at all evident at first sight. However it turns out that a key to the tractability of the logit formula is due to an important property of the multivariate extreme

value distribution, namely, it is max-stable: i.e. if 1 and 2 are extreme value random variables, 6 then max(1, 2) is also an extreme value random variable. Define the Social Surplus function

S({u(x, d, θ)}d∈D(x)|x) = E max u(x, d, θ) + (d) . (2.7) (d∈D(x) )

This is the expected maximum utility, where the expectation is taken over the random utility com- ponents , and can be viewed as the analog of the indirect utility function in standard (continuous)

6 Another way to say this is that the extreme value family is closed under maximization which is an analogous property of the class of stable distributions which are closed under addition. 7

consumer theory. 7 It turns out that the partial derivative of S with respect to u(x, d, θ) is P (d|x, θ)

∂ ∂ S({u(x, d, θ)}d∈D(x)|x) = E max u(x, d, θ) + (d) ∂u(x, d, θ) ∂u(x, d, θ) (d∈D(x) ) ∂ = E max u(x, d, θ) + (d) (∂u(x, d, θ) "d∈D(x) #) (2.8)

= P r d = argmax u(x, d0, θ) + (d0)  0   d ∈D(x)  ≡ P (d|x, θ).  The result in equation (2.8) is what McFadden (1981) has called the Williams-Daly-Zachary Theorem. It provides an an explicit formula for choice probabilities derived from a random utility model, and can be regarded as the analog of Roy’ Identity in the standard continuous choice consumer theory.

The max-stable property of multivariate extreme value distribution results in a closed-form

express for the Social Surplus function. Normalizing the location parameters µd = 0 for the random terms d, it is not difficult to show that

S({u(x, d, θ)}d∈D(x)|x) = σγ + σ log  exp{(u(x, d, θ))/σ} , d∈D(x)  X    n 1 8 where γ = limn→∞ i=1 k − log(n) = .5772 . . . is Euler’s constant. Applying the Williams- Daly-Zachary Theorem,P we have

∂ ∂ S({u(x, d, θ)} |x) = σ exp{u(x, d, θ)/σ} ∂u(x, d, θ) d∈D(x) ∂u(x, d, θ)   d∈D(x)  X  (2.9) exp{u(x, d, θ)/σ}  = 0 . d0∈D(x) exp{u(x, d , θ)/σ} P 7 The term “Social Surplus” is probably motivated by the interpretation of each  as indexing different “type” of consumer, so that the expected maximized utility can be interpreted as a “social welfare function” when the distribution F (|x) is reinterpreted as the distribution of types in the population.

8 To derive this note that if (1, 2) are two independent random variables with distributions F1(x) and F2(x), respectively, then the distribution of max(1, 2) is F1(x)F2(x). In the case where (1, 2) are two indepen- dent extreme value random variables with common scale parameter σ and location parameters (µ1, µ2), then F1(x)F2(x) = exp{−exp{−(x − µ1)/σ}} exp{− exp{−(x − µ2)/σ)}} = exp{− exp{−(x − µ)/σ}} where µ = σ log [exp{µ1/σ} + exp{µ2/σ}]. Note that the mean of an extreme value distribution with location param- eter µ and scale parameter σ is (µ + σγ) where γ is Euler’s constant. 8

This is McFadden’s key result, i.e. the MNL choice probability is implied by a random utility model when the random utilities have extreme value distributions. It leads to the insight that the IIA prop- erty is a consequence of the statistical independence in the random utilities. In particular, even if the observed attributes of two alternatives d and d0 are identical (which implies u(x, d, θ) = u(x, d0θ)), the statistical independence of unobservable components (d) and (d0) implies alternatives d and d0 are not perfect substitutes even when when their observed characteristics are identical. In many cases this is not problematic: individuals may have different idiosyncratic perceptions and prefer- ences for two different items that have the same observed attributes. However in the case of the “red bus/blue bus” example or the concert ticket example discussed by Debreu (1960), there are cases where it is plausible to believe that the observed attributes provide a sufficiently good description of an agents’ perception of the desirability of two alternatives. In such cases, the hypothesis that choces are also affected by additive, independent unobservables (d) provides a poor representa- tion of an agent’s decisions. What is required in such cases is a random utility model that has the property that the degree of correlation in the unobserved components of utility (d) and (d0) for two alternatives d, d0 ∈ D(x) is a function of the degree of closeness in the observed attributes. This type of dependence can be captured by a random coefficient probit model. 9

McFadden (1981) introduced the generalized extreme value (GEV) family of distributions. This family relaxes the independence assumption of the extreme value specification while still yielding tractable expressions for choice probabilities. The GEV distribution is given by

F (z|x) = P r{d ≤ zd|d ∈ D(x)} = exp −H(exp{−z1}, . . . , exp{−z|D(x)|}, x, D(x)) , for any function H(z, x, D(x)) satisfying certainn consistency properties. 10 McFadden showedo that the Social Surplus function for the GEV family is given by

S({u(x, d, θ)}d∈D(x)|x) = γ + log [H(exp{u(x, 1, θ)}, . . . , exp{u(x, |D(x)|, θ)}, x, D(x))] ,

9 This is a random utility model of the form U(x, z, d, θ) = xd(θ + z) where xd ix a k × 1 vector of observed attributes of alternative d, and θ is a k × 1 vector of utility weights representing the mean weights individuals assign to the various attributes in xd in the population and z ∼ N(0, Ω) is a k × 1 normally distributed random vector representing agent specific deviations in their weighting of the attributes relative the population average values, θ. Under the random coefficients probit specification of the random utility model, when xd = xd0 , alternatives d and d0 are in fact perfect substitutes for each other and this model is able to provide the intuitively plausible prediction of the effect of introducing an irrelevant alternative — the red bus — in the red bus/blue bus problem. See, e.g. Hausman and Wise (1978).

10 Specifically H(z, x, D(x)) must be 1) linear homogeneous in z, 2) satisfies limz→∞ H(z, x, D(x)) = ∞, 3) has nonpositive even and nonnegative odd mixed partial derivatives in z, and 4) if D(x) ⊂ E(x), and we let z|E(x)| denote a vector with as many components as in E(x) and (z D x , 0 E x D x ) be a vector with |E(x)| components | ( )| | ( )− ( )| c with values of zd for d ∈ D(x) and 0 for d ∈ E(x) − D(X) = E(x) ∪ D(x) , then H(z|D(x)|, x, D(x)) = H((z|D(x)|, 0|E(x)−D(x)|), x, E(x)). This last property ensures that the marginal distributions of a GEV distribution are also in the GEV family. 9

so by the Williams-Daly-Zachary Theorem, the implied choice probabilities are given by

exp{u(x, d, θ)}H (exp{u(x, 1, θ)}, . . . , exp{u(x, |D(x)|, θ)}, x, D(x)) P (d|x, θ) = d H(exp{u(x, 1, θ)}, . . . , exp{u(x, |D(x)|, θ)}, x, D(x)),

where Hd(z, x, D(x)) = ∂/∂zdH(z, x, D(x)). A prominent subclass of GEV distributions is given by H functions of the form

σi n 1/σi H(z, y, D(x)) =  zd  , i=1 d∈Di(x) X  X    where {D1(x), . . . , Dn(x)} is a partition of the full choice set D(x). This class of GEV distributions yields (two level) nested multinomial logit (NMNL) choice probabilities

P (d|x, θ) = P (d|x, Di(x), θ)P (Di(x)|x, D(x), θ),

where d ∈ Di(x) and

exp{u(x, d, θ)/σi} P (d|x, Di(x), θ) = 0 , d0∈Di(x) exp{u(x, d , θ)/σi} P exp{Si(x)} P (Di(x)|x, D(x), θ) = n , j=1 exp{Sj(x)} and P

Si(x) = σi log  exp{u(x, d, θ)/σi} , d∈Di(x)  X    is the Social Surplus function for the subset of choice Di(x). The nested logit model has an interpretation as a two stage decision process (or two level “decision tree”). In the first stage, the agent chooses one of the n partitions Di(x) with probabilities determined by an ordinary logit model with the Social surplus values for each partition, Si(x), playing the role of the utility. This is reasonable since Si(x) represents the expected maximum utility of choosing an alternative d ∈ Di(x). Then in the second stage, the agent chooses an alternative d ∈ Di(x) according to a MNL model with utilities u(x, d, θ)/σi (or alternatively, a random utility model with error terms σi(d), d ∈ Di(x)). McFadden called σi a “similarity parameter” since it plays the role of the scale parameter for extreme value errors d, which are independently distributed conditional on being in subset Di(x) of D(x). As σi → 0, the extreme value unobservables within each partition Di(x) play a diminishing role and Si(x) → max{u(x, d, θ)|d ∈ Di(x)}. The choice of a partition Di(x) is governed by an “upper level” MNL model with the utility for each partition equal to 10

the maximum utility over the alternatives in the partition. The nested logit model does not suffer from the IIA property (at least globally, although IIA does hold “locally” for alternatives d within the same partition element Di(x)). In particular, one can specify a nested logit model that avoids the “red bus/blue bus problem” and thus results in intuitively plausible predictions of the effect of introducing an “irrelevant alternative.” 11

The NMNL model has been applied in numerous empirical studies especially to study demand when there extremely large number of alternatives, such as modeling consumer choice of automobiles (e.g. Berkovec (1985), Goldberg, (1995)). In many of these consumer choice problems there is a natural partitioning of the choice set in terms of product classes (e.g. luxury, compact, intermediate, sport-utility, etc. classes in the case of autos). The nesting avoids the problems with the IIA property and results in more reasonable implied estimates of demand elasticities compared to those obtained using the MNL model. In fact, Dagsvik (1994) has shown that the class of random utility models with GEV distributed utilities is “dense” in the class of all random utility models, in the sense that a choice probabilities implied from any random utility model can be approximated arbitrarily closely by a RUM in the GEV class. However a limitation of nested logit models is that they imply a highly structured pattern of correlation in the unobservables induced by the econometrician’s specification of how the overall choice set D(x) is to be partitioned, and the number of levels in the nested logit “tree.” Even though the NMNL model can be nested to arbitrarily many levels to achieve additional flexibility, it is desirable to have a method where patterns of correlation in unobservables can be estimated from the data rather than being imposed by the analyst. Further, even though McFadden and Train (2000) recognize Dagsvik’s (1994) finding as a “powerful theoretical result”, they conclude that “its practical econometric application is limited by the difficulty of specifying, estimating, and testing the consistency of relatively abstract generalized Extreme Value RUM.” McFadden and Train, (2000), p. 452.

As noted above, the random coefficients probit model has many attractive features: it allows a flexibly specified covariance matrix representing correlation between unobservable components of utilities that avoid many of the undesirable features implied by the IIA property of the MNL model, in a somewhat more direct and intuitive fashion than is possible via the GEV family.

11 For the commuter’s problem discussed earlier, let D(x) = {w, r, b} (i.e. walk, take red bus, or take blue bus). Let D1(x) = {w} and D2(x) = {r, b}. As previously, we assume that u(x, w, θ) = u(x, b, θ) = u(x, r, θ). Further, let σ1 = 1 and σ2 = 0. Then it is not hard to see that for the nested logit model P (w|x, θ) = 1/2 and P (D2(x)|x, D(x), θ) = 1/2. Conditional on choosing to go by bus, the individual is indifferent between going by red or blue bus, so P (r|x, D2(x), θ) = P (b|x, D2(x), θ) = 1/2. Thus, P (r|x, D(x), θ) = P (b|x, D(x), θ) = 1/4, and it follows that this nested logit model yields the intuitively plausible solution to the red bus/blue bus problem. 11

However as noted above, the multinomial probit model is intractable for applications with more than 4 or 5 alternatives due to the “curse of dimensionality” of the numerical integrations required, at least using deterministic numerical integration methods such as Gaussian quadrature. One of McFadden’s most important contributions was his (1989) Econometrica paper that introduced the method of simulated moments (MSM). This was a major breakthrough that introduced a new econometric method that made it feasible to estimate the parameters of multinomial probit models with arbitrarily large numbers of alternatives.

The basic idea underlying McFadden’s contribution is to use monte carlo integration to ap- proximate the probit choice probabilities. While this idea had been previously proposed by Lerman and Manski (1981), it was never developed into a practical, widespread estimation method because “it requires an impractical number of Monte Carlo draws to estimate small choice probabilities and their derivatives with acceptable precision.” McFadden, (1989), p. 997. However McFadden’s brilliant insight was that it is not necessary to have extremely accurate (and thus very computation- ally time intensive) Monte Carlo estimates of choice probabilities in order to obtain an estimator for the parameters of a multinomial probit model that is consistent and asymptotically normal and performs well in finite samples. McFadden’s insight is that the noise from Monte Carlo simulations can be treated in the same way as random sampling error and will thus “average out” in large samples. In particular, his MSM estimator has good asymptotic properties even when only a single Monte Carlo draw is used to estimate each agent’s choice probability.

The key idea behind MSM is to formulate it as a method of moments estimator using an orthogonality condition and an appropriate set of instrumental variables. The key orthogonality condition underlying the MSM estimator is the same as for the MM estimator, namely, if the the ∗ ~ expected value of an agent’s decision equals the choice probability when θ = θ . Let di be a |D(xi)| × 1 vector of 0’s and 1’s with the property that if agent i with observed characteristics xi chooses a particular alternative from their choice set D(xi), then the corresponding component ~ of the vector di equals 1, and equals 0 otherwise. Let P (xi, θ) be the corresponding |D(xi)| × 1 th “stacked” vector of choice probabilities, i.e. Pd(xi, θ) = P (d|xi, θ), where Pd(xi, θ) is the d component of the vector P (xi, θ). If the random utility model is correctly specified, then at the true parameter vector θ∗ we have

~ ∗ E{di − P (xi, θ )} = 0,

~ We can regard the vector η = di − P (xi, θ) as a mean zero “error term” and we can construct a |D(x)| × K matrix of instrumental variables satisfying E{η0Z} = 0 (for example elements of 12 the matrix Z can be constructed from various powers and cross products of the components of the vector xi, for example). If it were possible to evaluate the choice probabilities, and hence the vector P (xi, θ), it would be possible to estimate θ using a minimum distance estimator

N ˆ ~ 0 0 ~ θmm = argmin [di − P (xi, θ)] ZiZi[di − P (xi, θ)]. (2.10) θ iX=1

However in cases such as the multinomial probit model, it is not feasible to evaluate P (xi, θ) when there are more than 5 or 6 alternatives in D(x). So consider an alternative computationally feasible version of the minimum distance estimator where P (xi, θ) is replaced by a Monte Carlo ˆ estimator PS(xi, θ) of based on S independent and identically distributed draws {1, . . . , S} from th the distribution of unobservables F (|xi) in the random utility model. Thus, the d component of Pˆs(xi, θ) is given by

1 S Pˆ (x , θ) = I d = argmax u(x , d0, θ) +  (d0) . (2.11) d i  i i  S d0∈D(xi) iX=1     For any fixed θ the Monte Carlo estimator Pˆs(xi, θ) is an unbiased estimator of P (xi, θ)

E{Pˆ (xi, θ) − P (xi, θ)} = 0, and thus, it will also be the case that the fundamental orthogonality condition will continue to hold when P (xi, θ) is replaced by PS(xi, θ), i.e. 0 ~ ∗ E Zi[di − P (xi, θ )] = 0. n o Based on this insight, McFadden introduced the Method of Simulated Moments estimator

N ˆ ~ ˆ 0 0 ~ ˆ θmsm = argmin [di − P (xi, θ)] ZiZi[di − P (xi, θ)]. (2.12) θ iX=1 and showed that it is a consistent and asymptotically normal estimator of the true parameter vector ∗ ˆ θ . The errors in the Monte Carlo estimate PS(xi, θ) of P (xi, θ) are conceptually similar to sampling errors, and thus tend to average out over the number of observations N, and become ˆ negligible as N → ∞. The cost of using the noisier simulation estimator of P (xi, θ) is that θmsm will have a larger asymptotic variance-covariance matrix than the method of moments estimator θˆmm. However McFadden showed this cost is small: the asymptotic variance-covariance matrix of θˆmsm is only (1 + 1/S) times as large as the asymptotic covariance matrix of the ordinary method of moments estimator ˆθmm where S is the number of Monte Carlo simulation draws. In particular, 13

this implies that the variance of ˆθmsm is only twice as large as θˆmm when only a single Monte Carlo draw is used to estimate the choice probabilities. Since the savings in terms of reduced computation

times from being able to use only a few Monte Carlo draws per observation to estimate Pˆs(xi, θ) are huge, McFadden’s result made it possible to estimate a broad new class of econometric models that were previously believed to be be infeasible due to the computational demands of providing

accurate estimates of P (xi, θ).

The idea behind the MSM estimator is quite general and can be applied in many other settings besides the multinomial probit model. McFadden’s work helped to spawn a large literature on “simulation estimation” that developed rapidly during the 1990s and resulted in computationally feasible estimators for a large new class of econometric models that were previously considered to be computationally infeasible. However there are even better simulation estimators for the multinomial probit model, which generally outperform the MSM estimator in terms of having lower asymptotic variance and better finite sample performance, and which are easier to compute. One problem

with the “crude frequency simulator” Pˆ (xi, θ) in equation (2.11) is that it is a discontinuous and “locally flat” function of the parameters θ, and thus the MSM criterion function in (2.12) is difficult to optimize. Hajivassiliou and McFadden (1998) introduced the method of simulated scores (MSS) that is based on Monte Carlo methods for simulating the scores of the likelihood function for a multinomial probit model and a wide class of other limited dependent variable models such as Tobit and other types of censored regression models. 12 Because it simulates the score of the likelihood rather than using a method of moments criterion that is does not generally lead to full asymptotic efficiency, the MSS estimator is more efficient than the MSM estimator. Also, the MSS is based on a smooth simulator (i.e. a method of simulation that results in an estimation criterion that is a continuously differentiable function of the parameters θ), so the MSS estimator is much easier to compute than the MSM estimator (2.11) based on the crude frequency simulator of Pˆ (xi, θ) in equation (2.11) . Based on numerous Monte Carlo studies and empirical applications, MSS (and a closely related simulated maximum likelihood estimator based on the “Geweke Hajivassiliou- Keane” (GHK) smoother simulator) are now regarded as the estimation methods of choice for a wide class of econometric models with limited dependent variable that are commonly encountered in empirical applications.

Despite these computational breakthrough, the MNL model remains one of the most tractable functional forms for estimating discrete choice models. When the utility functions are specified

12 In the case of a discrete choice model, the score for the ith observation is ∂/∂θ log(P (di|xi, θ)). 14

to be linear-in-parameters, u(x, d, θ) = v(x, d)0θ where v(x, d) is a vector of interactions of characteristics of alternative d and characteristics of the agent, the likelihood function L(θ) is a concave function of θ which makes it easy to compute the maximum likelihood estimator θˆ. 13 However as noted above, the MNL model is considered undesirable in many cases due to the IIA property. McFadden (1984) showed that the MNL model can serve as a “universal approximator” of any set of choice probabilities. That is, any conditional choice probability P (d|x) can be represented as a MNL model with utilities given by

u(x, d) = log(P (d|x)).

McFadden called this universal representation the “mother logit model”. However it is not the case that mother logit can legitimately “rationalize” any set of choice probabilities. The mother logit model is based on a “pseudo utility function” u(x, d) = log(P (d|x)) that is an implicit function of a particular underlying choice set D(x). The approach does not allow us to predict how choices will change if the choice sets change unless it is based on choice probabilities that are explicit functions of the choice set P (d|x, D(x)). However unless the choice probabilities also satisfy the Block-Marschak necessary conditions, the mother logit model will not result in a valid random utility model.

McFadden and Train (2000), in a paper that won the Sir Prize for best empirical paper published in the Journal of Applied Econometrics, showed that a computationally tractable class of choice probabilities, mixed MNL models, are a valid class of random utility models whose implied choice probabilities can approximate choice probabilities implied by virtually any random utility model. 14 A mixed MNL model has choice probabilities of the form

exp{u(x, d, α) P (d|x, θ) = G(dα|θ). (2.13)  0  Z d0∈D(x) exp{u(x, d , α)} P  There are several possible random utility interpretations of the mixed logit model. One interpreta- tion is that the α vector represents “unobserved heterogeneity” in the preference parameters in the population, so the relevant choice probability is marginalized using the population distribution for the α parameters in the population, G(α|θ). The other interpretation is that α is similar to vector ,

13 In fact, standard hill-climbing algorithms can compute the global maximum θˆ in polynomial time (as a function of the dimension K of the θ vector). In cases where the likelihood function is not concave, computer scientists have shown that the problem of finding a global optimum is exponential time in the worst case.

14 The main restriction on the set of allowable random utility models in their approximation result is that there be zero probability of a tie, i.e. zero probability that the agent is indifferent between multiple alternatives in the choice set. 15

i.e. it represents information that agents observe and which affects their choices (similar to ) but which is unobserved by the econometrician, except that the components of , (d) enter the utility function additively separably, whereas the variables α are allowed to enter in a non-additively separable fashion and the random vectors α and  are statistically independent. It is easy to see that under either interpretation, the mixed logit model will not satisfy the IIA property, and thus is not subject to its undesirable implications. McFadden and Train proposed several alternative ways to estimate mixed logit models, including maximum simulated likelihood and MSM. In each case, Monte Carlo integration is used to approximate the integral in equation (2.13) with respect to G(α|θ). Both of these estimators are smooth functions of the parameters θ, and both benefit from the computational tractability of the MNL while at the same time having the flexibility to approximate virtually any type of random utility model. The intuition behind McFadden and Train’s approximation theorem is that a mixed logit model can be regarded as a certain type of neural network using the MNL model as the underlying “squashing function”. Neural networks are known to have the ability to approximate arbitrary types of functions and enjoy certain optimality properties, i.e. the number of parameters (i.e. the dimension of the α vector) needed to approximate arbitrary choice probabilities grows only linearly in the number of included covariates x. 15

This brief survey of McFadden’s contributions to the discrete choice literature has revealed the immense practical benefits of his ability to link theory and econometrics, innovations that lead to a vast empirical literature and widespread applications of discrete choice models. Beginning with his initial discovery, i.e. his demonstration that Luce’s MNL choice probabilities result from a random utility model with multivariate exteme value distributed unobservables, McFadden has made a series of fundamental contributions that have enabled researchers to circumvent the problematic implications of the IIA property of the MNL model, providing computationally tractable methods for estimating ever wider and more flexible classes of random utility and limited dependent variable models in econometrics.

15 Other approximation methods, such as series estimators formed as tensor products of bases that are univariate functions of each of the components of x require a much larger number of coefficients to provide an comparable approximation, and the number of such coefficients grows exponentially fast with the dimension of the x vector. 16

3. Extensions

McFadden’s research continues, at an undiminished pace, providing important theoretical and applied contributions. A recent example is his innovative paper with Jenkins et. al. (2004) “The Browser War - Econometric Analysis of Markov Perfect Equilibrium in Markets with Network Effects.” This paper formulates a dynamic model of competition between two main internet browsers, Microsoft’s Internet Explorer and Netscape, and uses the model to quantify the damages that resulted from aggressive competitive tactics on the part of Microsoft, which were judged anticompetitive and illegal in the landmark case U.S. vs. Microsoft in 2002. The model allows for possibility of network externalities such as where a consumer’s utility of using a given browser may be an increasing function of the browser’s market share. The analysis concludes that Microsoft’s illegal exclusionary contracts with internet service providers that excluded Netscape was only a minor part of the explanation of Netscape’s decline: the majority of the lost market share (and thus damages to Netscape) was due to “Microsoft’s tying of Internet Explorer to the Windows operating system, and the arrangements under which it was difficult or inconvenient for OEM’s to preinstall another browser” (Jenkins et. al. (2004), p. 45).

Unfortunately, there is insufficient space to all of McFadden’s other equally interesting and important work. However I do wish to devote the remaining space to two examples of how McFadden’s work has helped to spawn several new literatures that appear likely to be among the most active and vibrant areas of future applied work in econometrics. One area is the estimation of static and dynamic discrete games of incomplete information. This is a very natural extension of the standard discrete choice model, which could be viewed as a “game against nature”. Consider, for example, a two player game where player a has observed characteristics xa and a choice set Da(xa) and player b has observed characteristics xb and a choice set Db(xb). Assume the two players move simultaneously, in order to maximize the expected value of utility functions

ua(xa, da, db, θa) + a(da) (for player a) and ub(xb, da, db, θb) + b(db) (for player b). The utility functions for both players depend on vectors a and b which are private information (i.e. player a knows a but not b, and vice versa for player b). If it is common knowledge that a and b have extreme value distributions, the Bayesian Nash equilibrium to this game can be defined in

terms of a pair of equilibrium choice probabilities (Pa(da|xa, xb, θ), Pb(db|xa, xb, θ)) that satisfy 17 the following equations

exp{Eua(xa, xb, da, θ)} Pa(da|xa, xbθ) = 0 d0∈Da(xa) exp{Eua(xa, xb, d )} (3.1) P exp{Eub(xa, xb, db, θ)} Pb(db|xa, xb, θ) = 0 , d0∈Db(xb) exp{Eub(xa, xb, d )} P where Eua(xa, xb, da, θ) = ua(xa, da, db, θa)Pb(db|xa, xb, θ) db∈XDb(xb) Eub(xa, xb, db, θ) = ub(xb, da, db, θb)Pa(da|xa, xb, θ). da∈DXa(xa) The Brouwer fixed point theorem implies that a least one equilibrium always exists to this game. If we observe N independent games played by these two types of players, a and b, with observed out- i i i i N comes {da, db, xa, xb}i=1, then we can estimate the parameter vector θ = (θa, θb) by maximizing the likelihood function L(θ) given by

N i i i i i i L(θ) = Pa(da|xa, xb, θ)Pb(db|xa, xb, θ). (3.2) iY=1 We see that the equilibrium choice probabilities in (3.1) are a direct generalization of the MNL probabilities that McFadden derived in a single agent “game against nature” and the likelihood function (3.2) is a direct generalization of the likelihood McFadden developed to estimate the parameters θ in the MNL model (see (2.4)). This line of extension of the single agent discrete choice techniques that McFadden developed is one of the current “frontier areas” in applied econometrics (see, e.g. Bajari et. al. 2005).

Another area where McFadden’s work has been very influential is the literature on dynamic discrete choice models. This literature originated in the 1980s and maintains the single agent focus of most McFadden’s work, but extends the choice from a static context that McFadden analyzed to situations where agents make repeated choices over time in order to maximize a dynamic or intertemporal objective function. For example, Dagsvik (1983) formulated a beautiful extension of the static discrete choice model to a discrete choice in continuous time setting where utilities are

viewed as realizations of continuous time stochastic processes. Let P (dt|xt, θ) be the probability that an agent with observed characteristics xt chooses alternative dt ∈ D(xt) at time t. Then the natural continuous time extension of the random utility model is

0 0 P (dt|xt, θ) = P r U˜t(xt, dt, θ) ≥ U˜t(xt, d , θ), d ∈ D(xt) , n o 18 where {U˜t(xt, d, θ)} is interpreted as a random utility process i.e. a stochastic process indexed by the time variable t. Dagsvik showed that a class of stochastic processes known as multivariate extremal processes are the natural continuous time extension of the extreme value error components in McFadden’s original work, resulting in marginal (i.e. time t) choice probabilities that have the MNL form. Further, Dagsvik showed that the (discrete) stochastic process for the optimal choice chosen by such a decision maker forms a continuous time Markov chain.

A related line of extension of McFadden’s work has been to link it with discrete time sequential decision making models and the method of dynamic programming. In this theory, an agent selects an alternative dt ∈ D(xt) at each time t to maximize the expected value of a time-separable discounted objective function. The solution to the problem is a decision rule, i.e. a sequence of functions (δ0, . . . , δT ) that solves

T (δ , . . . , δ ) = argmax E βtu (x , d , θ) +  (d ) , 0 T  t t t t t  tX=0    where β ∈ (0, 1) is an intertemporal discount factor and dt = δ(It) and It is the information available to the decision maker at time t. If we assume that the observed state variables evolve according to a controlled Markov process with transition probability p(xt+1|xt, dt, θ), and the components t are interpreted as unobserved state variables which are IID (i.e independent and identically distributed) and independent of the observed state variables {xt}, then the agent’s choice probability at time t is given by

0 0 0 Pt(dt|xt, θ) = P r{dt = δt(It)} = P r vt(xt, dt) + t(dt) ≥ vt(xt, d ) + t(d ), d ∈ D(xt) , n o where vt(x, d) is an expected value function given by

v (x, d, θ) = u (x, d, θ) + β S({v (x0, d0)|d0 ∈ D(x0)|x0}p(dx0|x, d, θ), t t 0 t+1 Zx where S is the same Social Surplus function that plays a key role in the derivation of the static discrete choice model (see (2.7)). In particular, if {t} is an IID extreme value process, and if T = ∞ and utilities are time invariant, the Markov decision problem can be shown to be time invariant or stationary with a time invariant decision rule dt = δ(xt, t) that results in a dynamic generalization of the MNL model

exp{v(x, d, θ)/σ} P (d|x, θ) = P r{d = δ(x, )|x} = 0 , d0∈D(x) exp{v(x, d , θ)/σ} P 19 where v(x, d) is the unique fixed point to a contraction mapping v = Γ(v) defined by

0 0 0 v(x, d) = Γ(v)(x, d) ≡ u(x, d) + β σ log  exp{u(x , d , θ)/σ} p(dx |d, x). x0 Z d0∈D(x0)  X    There are also dynamic extensions of the probit model. See Eckstein and Wolpin (1989) and Rust (1994) for surveys of the literature on dynamic extensions of discrete choice models.

4. References Anderson, S.P. De Palma, A. and J.F. Thisse (1992) Discrete choice theory of product differentiation Cambridge, MIT Press. Bajari, P., Hong, H., Krainer, J. and D. Nekipelov (2005) “Estimating Static Models of Strategic Interactions” manuscript, University of Michigan. Berkovec, J. (1985) “New Car Sales and Used Car Stocks: A Model of the Automobile Market” RAND Journal of Economics 16 195–214. Berry, S., Levinsohn J. and A. Pakes (1995) “Automobile Prices in Market Equilibrium” Econo- metrica 63 841–890. Block, H. and J. Marschak (1960) “Random Orderings and Stochastic Theories of Response” in I. Olkin (ed.) Contributions to Probability and Statistics Stanford, Stanford University Press. Cosslett, S.R. (1981) “Efficient Estimation of Discrete-Choice Models” in C.F. Manski and D. McFadden (eds.) op. cit. 51–113. Dagsvik, J.K. (1983) “Discrete dynamic choice: An extension of the choice models of Luce and Thurstone.” Journal of Mathematical Psychology, 27 1–43. Dagsvik, J.K. (1995) “How large is the class of generalized extreme value models?” Journal of Mathematical Psychology 39 90–98. Dagsvik, J.K. (1994) “Discrete and continuous choice, max-stable processes and independence from irrelevant attributes” Econometrica 62 1179–1205. Daly, A. and S. Zachary (1979) “Improved Multiple Choice Models” in D. Hensher and Q. Dalvi (eds.) Identifying and Measuring the Determinants of Mode Choice London, Teakfield. Debreu, G. (1960) “Review of R.D. Luce Individual Choice Behavior” 50 186–188. Dubin, J. and D. McFadden (1984) “An Econometric Analysis of Residential Electric Appliance Holdings and Consumption” Econometrica 52 345–362. Eckstein, Z. and K. Wolpin (1989) “The Specification and Estimation of Dynamic Stochastic Discrete Choice Models” Journal of Human Resources 24 562–598. Falmagne, J.C. (1978) “A Representation Theorem for Finite Random Scale Systems” Journal of Mathematical Psychology 18 52-72. 20

Goldberg, P. (1995) “Product Differentiation and Oligopoly in International Markets: i The Case of the U.S. Automobile Industry” Econometrica 63 891–951. Hajivassiliou, V. and D.L. McFadden (1998) “The Method of Simulated Scores for the Estimation of LDV Models” Econometrica 66 863–896. Hajivassiliou, V.A. and P. A. Ruud (1994) “Classical Estimation Methods for LDV Models Using Simulation” in R.F. Engle and D.L. McFadden (eds.) Handbook of Econometrics Volume IV Amsterdam, Elsevier. Hannemann, M. (1984) “Discrete/Continuous Models of Consumer Demand” Econometrica 52 541–562. Hausman, J. and D. Wise (1978) “A Conditional Probit Model of Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences” Econometrica 46 403–426. Heckman, J.J. (1981) “Statistical Models for Discrete Panel Data” in C.F. Manski and D. McFadden (eds.) op. cit. 114–178. ‘ Jenkins, M., Liu, P., Matzkin, R. and D. McFadden (2004) “The Browser War - Econometric Analysis of Markov Perfect Equilibrium in Markets with Network Effects” manuscript, Department of Economics, University of California at Berkeley. Luce, R.D. (1959) Individual Choice Behavior: A Theoretical Analysis New York: Wiley. Manski, C.F. and D. McFadden (eds.) (1981) Structural Analysis of Discrete Data with Econometric Applications MIT Press. Manski C.F. (1985) “Semiparametric Estimation of Discrete Response: Asymptotics of the Maxi- mum Score Estimator” Journal of Econometrics 27 303–333. Marschak, J. (1960) “Binary Choice Constraints and Random Utility Indicators” in K. Arrow, S. Karlin and P. Suppes (eds.) Mathematical Methods in the Social Sciences Stanford, Stanford University Press. McFadden, D. (1973) “Conditional Logit Analysis of Qualitative Choice Behavior” in P. Zarembka (ed.) Frontiers of Econometrics New York, Academic Press. McFadden, D. (1974) “The Measurement of Urban Travel Demand” Journal of Public Economics 3 303–328. McFadden, D. (1976) “The Revealed Preferences of a Government Bureaucracy: Empirical Evi- dence” Bell Journal of Economics and Management Science 7 55–72. McFadden, D. (1978) “Cost, Revenue, and Profit Functions” in M. Fuss and D. McFadden (eds.) Production Economics: A Dual Approach to Theory and Applications volume 1, Amsterdam, North Holland. McFadden, D. (1981) “Econometric Models of Probabilistic Choice” in C.F. Manski and D. Mc- Fadden (eds.) op. cit. 198–272. McFadden, D. (1984) “Econometric Models of Qualitative Response Models” in Z. Griliches and M. Intriligator (eds.) Handbook of Econometrics Volume II Amsterdam, North Holland. 21

McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration” Econometrica 57 995–1026. McFadden, D. and K. Train (2000) “Mixed MNL Models of Discrete Response” Journal of Applied Econometrics 15 447–470. Pakes, A. and D. Pollard (1989) “Simulation and the Asymptotics of Optimization Estimators” Econometrica 57 1027–1057. Rust, J. (1994) “Structural Estimation of Markov Decision Processes” in R.F. Engle and D.L. McFadden (eds.) Handbook of Econometrics Volume IV Amsterdam, Elsevier. Thurstone, L.L. (1927) “Psychophysical Analysis” American Journal of Psychology 38 368–389.