Lecture: Discrete Choice

October 26, 2019

1 Motivation and overview

Topics include

• Discrete choice and differentiated commodities

• Empirical models for auctions

Motivation

• Common theme in these three topics is the interplay of economic theory and econometrics.

• We use economic theory to formulate the models. Empirical models de- rived from economic theory are often called structural models.

• Major concern will be what can be estimated realistically, i.e. with avail- able data.

• Because the models will involve functional form assumptions and assump- tions on distributions, it will be important to discuss nonparametric iden- tification, i.e. the question whether functional forms and distributions can be recovered from the available data.

• If nonparametric identification can be established, we can use either a parametric or non/semi-parametric estimation procedure.

• Other issue will be computation: Structural models can involve high- dimensional integrals or depend on functions that can only be computed recursively or as a fixed point.

1 • Important question: Why use models derived from economic theory in empirical research?

– Guidance in specification (and thereby identification) of model that relates outcomes to determining variables. – More important: Computation of counterfactuals.

2 Discrete Choice and Differentiated Commodi- ties

We compare two types of problems

• Discrete choice of differentiated commodity.

– Consider consumer who wants to buy a car. – There are a finite number of car types that differ by attributes as size, performance, style etc. – Consumer buys only one car, car is indivisible

• Continuous choice of homogeneous commodity.

– In traditional demand analysis income is allocated to purchase of quantities of homogeneous commodities. – These commodities have a unit price. – Quantities vary continuously.

Two important differences

• Discrete choice deals with indivisible goods, continuous choice with divis- ible goods.

• Discrete choice deals with differentiated goods and continuous choice with homogeneous goods.

• Of course, continuous choice could also be for differentiated commodities.

• We could make discrete choice continuous by considering

– Repeated choice by one consumer: Fraction of choices resulting in particular car type is continuous.

2 – Choice of group of consumers: Fraction of choices of particular car type. – Both have their problems.

2.1 Additive random model

• Consider consumer who chooses between alternatives i = 1,...,I. We omit subscript for consumer.

• ui is the utility associated with choice i.

• For ui we specify a simple model, the Additive Random Utility (ARUM)

model ui = −vi + εi

•− vi is the mean utility of alternative i . Later we make it dependent on observed (and unobserved) characteristics of the alternative and of the

agent. For now think of vi as the price of alternative i (the minus sign simplifies the notation in the sequel).

• εi is a random variable with E(εi) = 0.

• Interpretations

– Optimization error. – Unobserved attributes of alternative i, e.g. style of car. – Unobserved attributes of the agent. – Interactions of these two.

• Interpretation is important if we want to use the ARUM as a model of con- sumer demand for differentiated products. If we include the unobserved

alternative and individual attributes in vi only optimization error remains (but may be needed to fit the data; without it choice of alternative is pre- dictable). For now we allow all interpretations. Repeated choice by the same agent would help in settling the interpretation.

• Interpretation also affects the nature of the joint distribution of ε1, . . . , εI . If there are unobserved agent attributes these random variables will be

3 correlated. We assume   ε1  .   .  ∼ F (ε1, . . . , εI )   εI

with F left unspecified for now. We assume that F is independent of

v1, . . . , vI , but we will consider cases in which this does not hold later.

The joint pdf is f(ε1, . . . , εI ).

2.1.1 ARUM and choice

• The agent choose alternative i if and only if

ui > uj for i 6= j = 1,...,I ⇔ −vi + εi > −vj + εj for i 6= j = 1,...,I

• We compute the probability that i is chosen, pi(v), from the joint distri-

bution of ε1, . . . , εI .

pi(v) = Pr(ε1 < v1 − vi + εi, . . . , εI < vI − vi + εi) =

Z ∞ Z v1−vi+εi Z vI −vi+εi = f(ε1, . . . , εI )dε1 ... dεI dεi = −∞ ∞ −∞ Z ∞ ∂F (v1 − vi + εi, . . . , εi, . . . , vI − vi + εi)dεi −∞ ∂εi

• For I = 3

p1(v) = P r(ε2 < v2 − v1 + ε1, ε3 < v3 − v1 + ε1)

p2(v) = P r(ε1 < v1 − v2 + ε2, ε3 < v3 − v2 + ε2)

p3(v) = P r(ε1 < v1 − v3 + ε3, ε2 < v2 − v3 + ε3)

or if we define η1 = ε2 − ε1 and η2 = ε3 − ε1 (see figure 1)

p1(v) = Pr(η1 < v2 − v1, η2 < v3 − v1)

p2(v) = P r(η1 > v2 − v1, η2 < v3 − v2 + η1)

p3(v) = P r(η1 < v2 − v3 + η2, η2 > v2 − v1)

4 • Note that the choice probabilities can be expressed as an event involving

the random variables ε2 − ε1 and ε3 − ε1 and the average utility differ-

ences v2 − v1 and v3 − v1, i.e. they involve only utility comparisons with alternative 1 that is chosen as the reference alternative.

2.1.2 Multinomial

• The choice probabilities have a closed-form only for special distributions

of ε1, . . . , εI .

• Consider the case that the εi are independent and identically distributed with an Extreme Value distribution with pdf

−ε −εi −e i f(εi) = e e , −∞ < εi < ∞

and cdf −e−εi F (εi) = e

We have E(εi) = γ = .5772 which is Euler’s constant. Also Var(εi) = π2/6.

• For the choice probabilities (using the final expression above)

Z ∞ I Z ∞ −ε −(vj −vi+εi) −εi PI vi−vj −εi −e i Y −e −εi −e (1+ e ) pi(v) = e e e dεi = e e j=1,j6=i dεi −∞ j=1,j6=i −∞

Define  I  X vi−vj λi = ln 1 + e  j=1,j6=i

so that ∞ Z −(ε −λ ) −εi −e i i pi(v) = e e dεi −∞

Change of variables to ηi = εi − λi so that εi = ηi + λi

∞ −v Z −η 1 e i −λi −ηi −e i −λi pi(v) = e e e dηi = e = I = I P vi−vj P −vj −∞ 1 + j=1,j6=i e j=1 e

• The discrete choice model with these choice probabilities is called the Multinomial Logit (MNL) model.

5 Properties of MNL model

• MNL model most popular model of discrete choice, because the choice probabilities have a closed-form expression.

• MNL with scale parameters: In ARUM replace εi byε ˜i = σεi. The choice probabilities are

− vi e σ 1 pi(v) = v = PI − j I (vj −vi) e σ P − σ j=1 j=1 e

1 – If σ → ∞ then pi(v) → I , i.e. the choice probabilities are indepen- dent of the average v1, . . . , vI . – If σ → 0, then

pi(v) = 1 if − vi > −vj, j 6= i

= 0 if − vj > −vi, for somej 6= i

i.e. the alternative with the highest average utility is chosen.

– If we introduce a scale parameter only for alternative i, then if σi → ∞, the choice probabilities are independent of the average utility of i

and if σi → 0, alternative i is either never or always chosen depending on whether the average utility is negative or positive, respectively. 1 – The effect of vi is proportional to , e.g. if vi is log price, the σi elasticities with respect to this price are small if σi is large.

• It has a property that limits its usefulness (at least without changes): Independence of Irrelevant Alternatives (IIA)

pi(v) = e−(vi−vj ) pj(v)

i.e. the ratio of any two choice probabilities depends only on the average utilities of these alternatives, and not on the average utilities of the other alternatives.

IIA and substitution of alternatives

• This imposes restrictions on the substitution between alternatives, if an alternative becomes less attractive, e.g. due to a price increase, or if a new alternative is introduced, e.g. a new car model.

6 • New alternative. We assume that initially there are 2 alternative car

models, 1=Sedan, and 2=SUV that have the same average utility −v1 =

−v2, so that e−v1 1 p1(v) = = = p2(v) e−v1 + e−v2 2 All SUV are green. A new type of SUV is introduced, the red SUV

(alternative 3). Obviously, −v2 = −v3, so that

e−v1 1 p1(v) = = e−v1 + e−v2 + e−v3 3

Conclusion: The market share (fraction of consumers that chooses the alternative) of the sedan has gone down, due to the introduction of a close substitute of the other alternative the (green) SUV. One would expect that the red SUV would only reduce the market share of the green SUV. The reason is that although the red and green SUV have the same average utilities, they have independent random utility components.

• Elasticities: We have

∂pi (v) = −pi(v)(1 − pi(v)) ∂vi ∂pi (v) = pi(v)pj(v) ∂vj or

∂ ln pi (v) = −(1 − pi(v)) ∂vi ∂ ln pi (v) = pj(v) ∂vj

If vi is log(price), we conclude that the price elasticity of alternative i only depends on the market share of the alternative whose price is increased.

• Implication for pricing if supplier is monopolist: If number of consumers

is m, the demand curve is m · pi(v) and with vi the price of alternative i

and ci the constant marginal cost of an extra unit, the profit is

πi = (vi − ci)m · pi(v)

7 The first order condition is

pi(v) 1 vi = ci − ∂p = ci + i (v) 1 − pi(v) ∂vi

Hence the mark-up over marginal cost depends only on the market share and it increases with the market share. Implication is that in niche markets without close competitors the mark-ups will be small, and that is contrary to intuition.

2.2 Specification of the average utilities and estimation

• We introduce a second subscript for the agents, i.e. we have subscript i = 1,...,I for the I alternatives and t = 1,...,T for T agents.

• The choice set may be different for the agents, i.e. agent t may only choose

between It alternatives. The expressions can be easily modified to account for this.

• The ARUM model now is

uit = −vit + εit

with εit i.i.d. Extreme Value for MNL model.

• The average utility depends on

– Attributes of the alternatives, e.g. price, quality etc. denoted by zi. – Attributes of the agent, e.g. income, age etc. if the agent is an

individual denoted by xt.

• General specification of average utilities

0 0 −vit = β zi + γixt

with zi a K-vector of alternative attributes and xt an L-vector of agent

attributes. We interpret βk as the marginal utility of attribute zik.

• Note that β is the same for all alternatives, but zi is alternative specific,

while i is alternative specific and xt is the same for all alternatives.

8 • The choice probabilities do not change if for all alternatives we add a

constant to the average utilities. Hence γi +c is observationally equivalent,

i.e. gives the same choice probabilities, as γi, so that γi, i = 1,...,I are

identified up to a vector of constants. We normalize by setting γ1 = 0.

• A special choice for zi is a vector of alternative specific dummies

dij = 1 if j = i = 0 if j 6= i

so that (I use αj, j = 1,...,I as parameters for this special case)

I 0 X β zi = αjdij j=1

PI Because j=1 dij = 1 we can add a constant to the αj without changing the choice probabilities. Again we normalize by setting α1 = 0.

• An important question is what parameters can be identified with different types of data.

– Aggregate fractions or market share data, i.e. fractions that choose each alternative. If we have a model with alternative specific dum- mies there is a one-one correspondence between these parameters and the choice probabilities

ln pi(v) − ln p1(v) = αi i = 2,...,I

if is the reference alternative (α1 = 0). We obtain an estimator if we replace probabilities by observed fractions. – Aggregate fractions or market shares by subgroup. As an example assume that we have aggregate fractions by income category. We assume that there are T categories. We denote the corresponding

choice probabilities by pi(vt) with t = 1,...,T . We have T (I − 1) independent choice probabilities. If we again use alternative specific dummies and in addition a set of income dummies, we have for the average utility I T X X αjdij + γitxt j=2 t=2

9 with γ1t = 0 for t = 2,...,T and t = 1 the reference income category (why not a complete set of dummies?). We have (I −1)+(T −1)(I − 1) = T (I − 1) parameters, so that there is again a 1-1 mapping from choice probabilities to parameters (find this mapping). If we impose restrictions on the functional form, e.g. by replacing dummies by income category averages and by specifying the average utilities as I X αjdij + γixt j=2 with (I − 1) + (I − 1) = 2(I − 1) parameters for T (I − 1) fractions. If T > 3 we can in principle add (T − 2)(I − 1) parameters. The temp-

tation is to add alternative attributes zi in addition to the alternative specific dummies , i.e

I X 0 0 αjdij + β zi + γixt j=2

Note that β is only identified by restricting the effect of income. We

could also have added a polynomial in xt or interactions dijxt, i.e.

make the alternative specific constant dependent on xt. Conclusion: We cannot identify the effect of more than I −1 alternative attributes without making functional from assumptions.

– Individual data. If one of the xt is continuous, it is as if we have infinitely many subgroups, so that it seems that we can have both alternative specific dummies and alternative attributes. However, there is only identification by functional form restriction. Consider I = 2 with choice probabilities

eα2+v2(xt) p2(xt) = 1 + eα2+v2(xt) 1 p1(xt) = 1 + eα2+v2(xt)

Then p2(xt) α2 + v2(xt) = ln p1(xt)

and we can obtain α2 by the normalization v2(x0) = 0. Again alter- native specific dummies is the most unrestrictive specification for the

10 effect of alternative attributes.

• So with individual data the preferred specifications for the average utilities are I X 0 αjdij + γixt j=2 or imposing restrictions on the attribute effects

0 0 β zi + γixt

• An important extension is to allow the βs in the last specification to

depend on xt, i.e.

β = β0 + β1xt

with β1 a K × L matrix of coefficients. The average utility is

0 0 β0zi + xtβ1zi

0 Note that I have omitted γixt because the current specification restricts the γi (same remarks apply as for the case of alternative specific dum- 0 mies versus the restricted β zi). Note that the coefficient vector on xt is 0 0 ziβ1. This involves KL parameters that should be compared to (I − 1)L parameters in the unrestricted γ2, . . . , γI . We already imposed K ≤ I − 1.

• The resulting MNL probabilities are

β0 z +x0 β z e 0 i t 1 i p (x ; β) = i t I 0 0 P β zj +xtβ1zj j=1 e 0

Assume that there are two attributes: quality z1 with β01 > 0 and log

price, z2, so that β02 < 0. Remember that the cross price effects in the MNL model are proportional to the product of the market shares of the

product whose price is raised and the product under consideration. Let xt be education and let the marginal utility of quality be positively related to

education, i.e. β11 > 0 (β12 = 0 as the marginal utility of price does not depend on education). A high priced, high quality alternative is mainly chosen by agents with a high level of education, who will choose less low price, low quality alternatives. If we concentrate on individuals with a high level of education, then if the price of such an alternative is raised,

11 they will substitute to other high price, high quality alternatives. If a new low price, low quality alternative is introduced, this will not affect their choices much.

• Conclusion: If the variation of the marginal utilities of attributes can be

explained by agent characteristics xt, then we can counter the effect of IIA

by interacting the attributes with xt. Note that this amounts to segment-

ing the market according to xt in groups that value attributes similarly so that substitution occurs between products with similar attributes. In general it is hard to explain taste variation by observables (the work of marketing researchers), so that we should allow for unobservable taste dif- ferences. For economists getting the substitution pattern right is enough.

3 Estimation of parameters of MNL model with individual data

• If the choice probabilities are

β0 z +x0 β z e 0 i t 1 i p (x ; β) = i t I 0 0 P β zj +xtβ1zj j=1 e 0

• The data are a random sample y1t, . . . , yIt, xt with yit = 1 if t chooses i and 0 if not.

• We also need z1, . . . , zI usually obtained from product descriptions etc.

• We have I Y yit f(y1t, . . . , yIt|xt, z1, . . . , zI ; β) = pi(xt; β) i=1 so that the log likelihood is

T I X X ln L(β) = yit ln pi(xt; β) t=1 i=1

• The MLE is computed in usual way and has the usual properties.

12 3.1 Dealing with IIA: Correlation among random utility components

3.1.1 Solution 1: Nested Multinomial Logit (NMNL)

• The key problem with IIA is that alternatives that are close, i.e. are close substitutes, have independent random components.

• A solution is to make the random components of similar alternatives de- pendent.

• Consider example with 4 alternatives. Choose the joint cdf of ε1, . . . , ε4 as ε ε σ ε ε σ  − 1 − 2  1  − 3 − 4  2 − e σ1 +e σ1 − e σ2 +e σ2 F (ε1, ε2, ε3, ε4) = e e

Note that ε1, ε2 are dependent, as are ε3, ε4. We have σ1 ≈ 1 − ρ(ε1, ε2).

• The choice probabilities are

v σ  − v1 − 2  1 e σ1 + e σ1

p12(v) = p1(v) + p2(v) = v v σ −v v σ  − 1 − 2  1  3 − 4  2 e σ1 + e σ1 + e σ2 + e σ2

and − v1 e σ1 p (v) = 1|1,2 − v1 − v2 e σ1 + e σ1 which has the MNL form. Hence

p1(v) = p1|1,2(v)p12(v)

This ARUM model is called the Nested Multinomial Logit (NMNL) model.

• Reconsider the auto type choice example with alternatives 1 (sedan), 2 (green SUV) and 3 (red SUV). Then

2σe−v2 p2,3(v) = 2σe−v2 + e−v1

and 1 p (v) = 2|2,3 2

If v1 = v2, then 2σ p (v) = 2,3 2σ + 1

13 1 Hence, if σ ↓ 0, then p2,3(v) → 2 which is the same as before the intro- duction of the red SUV.

• Specification of average utilities (I = 4)

0 0 0 −v1t = α z1 + β w1 + γ1xt 0 0 0 −v2t = α z1 + β w2 + (γ1 + δ1) xt 0 0 0 −v3t = α z2 + β w3 + γ2xt 0 0 0 −v4t = α z2 + β w4 + (γ2 + δ2) xt

with z1, z2 attributes of nests, w1, w2, w3, w4 attributes of the alternatives.

• Choice probabilities

– Conditional choice probability

0 β w1 e σ1 p1|1,2(xt; β, δ, σ) = 0 0 δ0 x β w1 β w2 + 1 t e σ1 + e σ1 σ1

0 0 Note α z1 and γ xt cancel. σ1 σ1 – From this MNL choice probability we can obtain MLE of β and δ2 σ1 σ1 . – Marginal choice probability

p12(xt; α, β, , δ, σ) =

 0  β0w β0w δ x 1 2 + 1 t 0 0 σ1 σ1 σ1 α z1+γ1xt+σ1 lne +e  e =  0   0  β0w β0w δ xt β0w β0w δ xt 1 2 + 1 3 4 + 2 0 0 σ1 σ1 σ1 0 0 σ2 σ2 σ2 α z1+γ1xt+σ1 lne +e  α z2+γ2xt+σ2 lne +e  e + e

We need to normalize by setting e.g. γ1 = 0. – This is again an MNL choice probability but with an additional ex- planatory variable, usually called the inclusive value. Its coefficient measures the dependence (the smaller the more dependent) of the alternatives in the nest. It can be shown that

0 0 0  β w1 β w2 δ 1xt  0 0 σ σ + σ E[max{u1t, u2t}] = α z1 + γ1xt + σ1 ln e 1 + e 1 1

14 • The expressions for the marginal and conditional choice probabilities sug- gest a two-step estimator, that gives consistent, but inefficient estimators of the parameters.

• Problem with NMNL is that one has to impose the nests. This choice may not be obvious.

3.1.2 Solution 2: (MNP) see Section 4.

3.1.3 Solution 3:

• We noted earlier that we can create realistic substitution between alterna- tives if attractiveness changes or new alternatives are introduced by letting the marginal utility of attributes of the alternatives, i.e. the β coefficients be dependent on observed characteristics of the agents.

• In general: If an agent places a high value on a particular attribute, the probability that he/she chooses an alternative that has much of that at- tribute, is relatively large (compared to alternatives that have little of the attribute). Hence, if an alternative with much of the attribute be- comes less attractive, e.g. because its price increases, then the agent will substitute to other alternatives that have much of the attribute.

• Conclusion is that we should subdivide the population on the basis of the marginal utility of attributes.

• For this we can use observable agent characteristics, e.g. income, ed- ucation etc. This may not explain much of the taste variation in the population and for that reason we should also consider dependence on unobservables.

• If we only allow for onobservable factors, then we obtain the random coefficient logit or mixed logit model:

β = β + ν

with β the average marginal utility in the population and ν a mean 0 random variable that captures taste variation. The joint distribution of

15 ν has pdf g(ν; λ) with λ a vector of parameters. A popular choice is the multivariate normal or lognormal distribution.

• The resulting MNL probabilities are

β0z +ν0z +γ0x Z Z e i i i t pi(xt; θ) = ... 0 g(ν; λ)dν I 0 0 P β zj +ν zj +γj xt j=1 e

• The integral can be computed by numerical integration of by simulation

with νr a draw from g(ν; λ)

R β0z +ν0 z +γ0x 1 X e i r i i t pˆi(xt; θ) = 0 I 0 0 R P β zj +νr zj +γj xt r=1 j=1 e

and these can be used in MSM or in simulated ML.

4 Multinomial Probit (MNP)

4.1 Setup

• If nests are not obvious it may be better to let the data decide on the correlation structure of the random components.

• Consider ARUM

ui = −vi + εi , i = 1,...,I

with   ε1  .   .  ∼ N(0, Σ)   εI

• Define the I − 1 vectors     ε1 − εi v1 − vi  .   .  η(i) =  .  v(i) =  .      εI − εi vI − vi

• Note

η(i) = Aiε

16 with   1 0 · · · −1 ··· 0    0 1 · · · −1 ··· 0     ......  Ai =  . . . −1 0 .     .   . ············ 0    0 ··· 0 −1 0 1 so that ˜ η(i) ∼ N(0, Σ)

with ˜ 0 Σ = AiΣAi

• The choice probability is an orthant probability in the multivariate ˜ pi(v) = P r(η(i) < v(i)) = Φ(v(i); Σ)

• Example for I = 3

p1(v) = Pr(ε2 − ε1 < v2 − v1, ε3 − ε1 < v3 − v1) | {z } | {z } | {z } | {z } η1 v˜1 η2 v˜2

p2(v) = Pr(ε2 − ε1 > v2 − v1, ε3 − ε1 < v3 − v2 + ε2 − ε1) | {z } | {z } | {z } | {z } | {z } η1 v˜1 η2 v˜2−v˜1 η1

p3(v) = Pr(ε3 − ε1 > v3 − v1, ε2 − ε1 < v2 − v3 + ε3 − ε1) | {z } | {z } | {z } | {z } | {z } η2 v˜2 η1 v˜1−v˜2 η2

• Hence we can identify

– v˜1 = v2 − v1 ,v ˜2 = v3 − v1

– Variance matrix of η1, η2.

• Because we can divide all inequalities by any positive constant we divide

by the standard deviation of η1 or equivalently we set this equal to 1 or in terms of original parameters

σ11 + σ22 − 2σ12 = 1

17 • The other two components of Σ˜ are

σ˜12 = σ23 − σ12 − σ13 + σ11

σ˜22 = σ33 + σ11 − 2σ13

• We have 3 equations (identified variances/covariances) with 6 unknowns. Hence, we must fix 3 parameters.

• Option 1: σ11 = σ22 = σ33 = 1, i.e. the random components have same variance.

• Option 2: σ12 = σ13 = σ23 = 0, i.e. random components are indepen- dent. This is not attractive. Reconsider example in which a red SUV was introduced. Then for alternative 1 (sedan)

p1|1,2(v) = Pr(ε2 < v2 − v1 + ε1)

and

p1|1,2,3(v) = Pr(ε2 < v2 − v1 + ε1, ε3 < v3 − v1 + ε1)

Given ε1 under independence

Pr(ε2 < v2−v1+ε1, ε3 < v3−v1+ε1) = Pr(ε2 < v2−v1+ε1) Pr(ε3 < v3−v1+ε1) <

< Pr(ε2 < v2 − v1 + ε1)

If we integrate w.r.t. the pdf of ε1, we have

p1|1,2,3(v) < p1|1,2(v)

i.e. the market share of the sedan decreases. Conclusion: It is the inde- pendence and not the extreme value distribution that causes the problem.

4.2 Estimation of MNP model

• We specify the average utility as in the MNL model, e.g.

0 0 −vit = β zi + γixt

with the normalization γ1 = 0.

18 • The data are as in the MNL case.

• The conditional density that enters the likelihood is

f(y1t, . . . , yIt|xt, z1, . . . , zI ; β, γ) =

I Y 0 0 0 0 yit = Φ(β (z1 − zi) + (γ1 − γi) xt, . . . , β (zI − zi) + (γI − γi) xt; Σ)˜ i=1

• The identified parameters were discussed earlier.

• Likelihood involves computation of I − 1 dimensional normal cdf, i.e. an I − 1 dimensional numerical integral.

• Consider T = 1000, I = 10 and number of alternative attributes is 5 and that of agent attributes is 10.

1 ˜ • The number of parameters is 2 I(I − 1) − 1 (Σ) plus 5 (β) plus 10(I − 1) 1 (γi-s) is 2 I(I − 1) + 4 + 10(I − 1) = 139

• Most numerical search algorithms require numerical first partial deriva- tives ∂ ln L ln L(θ + δe ) − ln L(θ − δe ) ≈ j j ∂θj 2δ

with ej the j-th unit vector and δ a small number.

• Derivative requires 2*139=278 log likelihood evaluations. Hence iteration in algorithm requires 279000 9-dimensional numerical integrals. With 20 iterations we need 5580000 numerical integrals. If 1 integral requires 1 second the estimation will require 1550 hours or almost 65 days!

4.3 Simulation estimation of MNP

• Alternative to numerical integration is simulation. Choice probability

0 0 0 0 Φ(β (z1 − zi) + (γ1 − γi) xt, . . . , β (zI − z1) + (γI − γi) xt; Σ)˜ =

0 0 0 0 = Pr(η1 ≤ β (z1 −zi)+(γ1 −γi) xt, . . . , ηI−1 ≤ β (zI −z1)+(γI −γi) xt; Σ)˜

• Simulation algorithm

19 1. Draw R vectors   η1r  .   .  r = 1,...,R   ηI−1,r

from N(0, Σ).˜ Do this once. 2. Simulate an estimate of the choice probability for observation t

pˆi(xt; β, γ) =

R 1 X I (η ≤ β0(z − z ) + (γ − γ )0x , . . . , η ≤ β0(z − z ) + (γ − γ )0x ) R 1r 1 i 1 i t I−1,r I 1 I i t r=1 3. Simulate log likelihood

T I X X lnL\(β, γ) = yit lnp ˆi(xt; β, γ)) t=1 i=1

and maximize with respect to the parameters.

4.3.1 Issues with this simple simulation approach

(i) The simulated probabilityp ˆi(xt; β, γ) may be 0 (problem with log).

(ii) Because the simulated probability is a frequency it is a discontinuous func- tion of the parameters. This makes numerical maximization of the log likelihood difficult.

(iii) The simulated MLE is not consistent, unless the number of simulations increases with the number of observations. Reason

pˆi(xt; β, ) = pi(xt; β, ) + uit

1 with uit the simulation error with E(uit) = 0 and Var(uit) = R pi(xt; β, γ)(1− pi(xt; β, γ)). Remember MLE is consistent because average log likelihood

T I 1 1 X X p ln L(β, γ) = y pˆ (x ; β, γ)) → T T it i t t=1 i=1

" I # X Ex pi(x; β0, γ0) ln pi(x; β, γ)) i=1

20 and the limit is uniquely maximized at the population value β0, γ0 of the parameters. With simulated probabilities this becomes

T I 1 1 X X p ln L\(β, γ) = T y ln(p (x ; β, γ) + u ) → T T it i t it t=1 i=1

" I # X Ex,u pi(x; β0, γ0) ln(pi(x; β, γ) + ui) i=1

Only if ui very small, i.e. if R is very large will the maximum of the limit be at the population parameters: the simulated MLE is biased, even in large samples.

4.3.2 Solutions

• Problems (i) and (ii) require smarter simulation. In the Appendix, we describe the Geweke-Hajivassiliou-Keane simulator that gives simulated choice probabilities that are continuous in the parameters and never 0 or 1.

• For problem (iii) we change the way we estimate the model: instead of ML we use the Generalized Method of Moments (GMM).

• Starting point is the conditional moment restriction (E0 means that the expectation is over the population distribution)

E0[yit − pi(xt; β0, γ0)|xt] = 0

so that for any function w(xt) we have the unconditional moment restric- tion

E[(yit − pi(xt; β0, γ0))w(xt)] = 0

The w(xt) are called the instruments.

• The corresponding sample moment condition is

T I T 1 X X 1 X (y − p (x ; β, γ))w(x ) = m(x ; β, γ) = m (β, γ) T it i t t T t T t=1 i=1 t=1

21 and the GMM estimator minimizes

−1 mT (β, γ)VT mT (β, γ)

−1 with VT a weighting matrix (chosen by you).

• How to choose w(xt)?

– Optimal (smallest variance of GMM estimator) instruments are with 0 0 0 θ = (β γ ) and the conditional moment function cm(yit, xt; θ) =

yit − pi(xt; θ) ∂cm  ∂p E (x ; θ)|x = i (x ; θ) ∂θ t t ∂θ t This requires the choice probabilities. – One approach is to simulate the instruments (independently of the simulation of the choice probabilities), e.g. use GHK and take deriva- tives. – Other approach is to approximate the optimal instruments by poly- nomials in x and z.

• In simulated GMM or Method of Simulated Moments (MSM) replace choice probabilities by their simulators

T I 1 X X m\(θ) = (y − pˆ (x ; θ))w(x ) = T T it i t t t=1 i=1

T I T I 1 X X 1 X X = (y − p (x ; θ))w(x ) − w(x )u T it i t t T t it t=1 i=1 t=1 i=1

• If simulation is unbiased then E(uit|xt) = 0 with expectation over simu- lations. Hence by the Law of Large numbers

T I 1 X X p w(x )u → 0 T t it t=1 i=1

so that p m\T (θ) → mT (θ)

Intuition: The simulation errors average out in estimation.

22 • Conclusion: MSM estimator is consistent if T → ∞ even for finite R.

• Effect of simulation is only on variance of MSM. If we use R draws then ˆ the variance of the MSM estimator θR is 1 V (θˆ ) = (1 + )V (θˆ ) R R ∞

• In general the simulation errors average out of the estimating equations, i.e. the equation to which the estimator is the solution, if it is linear in the simulated function.

4.3.3 Appendix: Geweke-Hajivassiliou-Keane (GHK) simulator

• Write Σ˜ as the product of two lower triangular matrices (Choleski decom- position) Σ˜ = ∆∆0

with   δ11 0 ······ 0    δ21 δ22 ··· 0 0     δ δ δ 0 0   31 32 33   . . . .   . . . .. 0    δI−1,1 ········· δI−1,I−1

• Hence we have ( =d means is distributed as)

η =d ∆ζ , ζ ∼ N(0,I)

because Var(η) = E(ηη0) = E[∆ζζ0∆0] = Σ.˜

• In sequel we consider I = 4 and the computation of p1t(β, γ). We define

0 0 ci = β (zi − z1) + (γi − γ1) xt , i = 2, 3, 4

• The integration region for the computation of p1t(β, γ) is

η1 ≤ c1

η2 ≤ c2

η3 = c3

23 or

c1 ζ1 ≤ δ11 c2 − δ32ζ1 ζ2 ≤ δ22 c3 − δ31ζ1 − δ32ζ2 ζ3 ≤ δ33

Hence

c1 c2−δ32ζ1 c3−δ31ζ1−δ32ζ2 Z δ11 Z δ22 Z δ33 p1(xt; β, γ) = φ(ζ1)φ(ζ2)φ(ζ3)dζ3dζ2dζ1 −∞ −∞ −∞

We can also write the integral using an indicator function

Z ∞ Z ∞ Z ∞     c1 c2 − δ32ζ1 c3 − δ31ζ1 − δ32ζ2 c1 p1(xt; β, γ) = I ζ1 ≤ , ζ2 ≤ , ζ3 ≤ Φ · −∞ −∞ −∞ δ11 δ22 δ33 δ11

    c2 − δ32ζ1 c3 − δ31ζ1 − δ32ζ2 φ(ζ1) φ(ζ2) φ(ζ3) ·Φ Φ      dζ3dζ2dζ1 = δ22 δ33 Φ c1 Φ c2−δ32ζ1 Φ c3−δ31ζ1−δ32ζ2 δ11 δ22 δ33

Z ∞ Z ∞  c  c − δ ζ  c − δ ζ − δ ζ  = Φ 1 Φ 2 32 1 Φ 3 31 1 32 2 · −∞ −∞ δ11 δ22 δ33

    c1 c2 − δ32ζ1 ·φ ζ1 ζ1 ≤ φ ζ2 ζ2 ≤ dζ2dζ1 δ11 δ22 where you should note that on the support of the truncated distributions the indicator is always 1.

• Simulation scheme   c1 1. Obtain draw from φ ζ1 ζ1 ≤ (see below how) and compute δ11   c2−δ32ζ1 c2−δ32ζ1 . Use this result to obtain draw from φ ζ2 ζ2 ≤ δ22 δ22 and compute c3−δ31ζ1−δ32ζ2 . Do this step R times. δ33 2. Compute the simulated choice probability

R       1 X c1 c2 − δ32ζ1r c3 − δ31ζ1r − δ32ζ2r pˆ (x ; β, γ) = Φ Φ Φ i t R δ δ δ r=1 11 22 33

24 • Note thatp ˆi(xt; β, γ) is continuous in the parameters and strictly between 0 and 1.   c1 • How to draw from φ ζ1 ζ1 ≤ ? δ11

– Use fact that if X has a uniform distribution on [0, 1], then for any cdf −1 −1 F , Y = F (X) ∼ F , because FY (y) = Pr(Y ≤ y) = Pr(F (X) ≤ y) = Pr(X ≤ F (y)) = F (y).

c1 – The cdf of the distribution of ζ1 given ζ1 ≤ is δ11

Φ(ζ1)   Φ c1 δ11

c1 for ζ1 ≤ . δ11 – The inverse cdf is   c  Φ−1 uΦ 1 δ11 – Hence if U is a draw from the uniform [0, 1] distribution, then we compute ζ as 1    −1 c1 ζ1 = Φ UΦ δ11

25 26 27