IDENTIFICATION OF SEMIPARAMETRIC DISCRETE CHOICE MODELS

by

T. SCOTT THOMPSON

Discussion Paper No. 249, September 1989

Center for Economic Research Department of Economics University of Minnesota Minneapolis, Minn 55455 Keywords: semiparametric models, identification, discrete choice, random utility, isotonic functions, index models.

Headnote: The question of model identification is analyzed for the semiparametric random utility model of discrete choice. Attention is focused on settings where agents face a common choice between a set.of J+l alternatives, but where actual choices are only partially observed.

Necessary conditions are derived for the setting where the only data on actual choices consists of a binary indicator for one of the alternatives.

Sufficient conditions are developed in this setting for a linear in parameters specification of indirect utility. It is found that relative to the parametric case, only a mild continuity restriction on the distribution of regressors is needed in the semiparametric model. Under these circumstances all of the choice probabilities are identified, even though actual choices are only partially observed. It is shown that that rely only on the index structure of the model require substantially stronger prior restrictions on the parameters for identification when the number of alternatives is large. Finally, results on the model with partial observability of choices are used to analyze the special case of full observability. IDENTIFICATION OF SEMIPARAMETRIC DISCRETE CHOICE MODELS

I by T. SCOTT THOMPSON

1. INTRODUCTION

The random utility model has become the canonical model in econometrics for situations where agents face a choice between a finite collection of discrete, unordered alternatives. In this model, for each alternative it is assumed that the indirect utility (conditional on choosing that alternative) is additively separable into a function of observables (the "regressors"), known up to a parametric family, and an unobserved disturbance (the "error" term). The actual choice is assumed to be the alternative for which the conditional indirect utility is at a maximum.

This paper examines the question of model identification under the assumption that the vector of errors is independent from observed determinants of indirect utility. We concentrate on the "semiparametric" version of this model in which no other restrictions are placed on the distribution of error terms. It is fruitful in this setting to consider the distribution function for the error terms to be a parameter of the model.

Hence our identification results are informative about the error distribution in addition to the usual parameters of the observable part of indirect utility.

The primary focus of our investigation is on models with partial observability of choices. Specifically, the majority of the paper is devoted to the problem of identifiability of the model when the only data available

1 about actual- choices is a binary indicator for whether or not one of the alternatives was chosen. The results for this case are extended in the last section to cover other settings, however. Full observability of choices is handled as a special case of the results there.

We also consider "the question of identifiability through the index structure of the model. The random utility model of discrete choice is an

"index" model because the conditional distribution of choices depends upon

the observables through an index function. The particular relationship between the index function and the choice probabilities is determined by the error distribution function. We examine identification when one ignores this relationship and relies exclusively on the index property for identifying restrictions.

Relationship !Q Previous Studies.

Our analysis extends previous work in several directions. Of obvious

importance is the relaxation of parametric assumptions. Often it is assumed

that the collection of error terms is statistically independent from the

observable determinants of indirect utility. In parametric versions of the model, which are the only versions that have been extensively used in applications, the vector of errors is also assumed to have a distribution

that lies in a particular parametric family. Usually the Gaussian distributions or a generalized extreme value family of distributions are

assumed, in which case the model respectively takes the familiar probit or nested logit structure. Extensive analysis of these models is contained in

Daganzo (1983), Maddala (1983), McFadden (1974, 1976, 1981, 1982, 1984) and

in numerous econometrics textbooks and monographs on discrete choice.

2 We also extend previous work on identification under partial observability. Poirier (1980) considered identification of bivariate normal probit models in an observational setting corresponding to our binary indicator for one of the possible choices. We extend this work by relaxing the normality assumption, by allowing for more general forms of partial observability, and by allowing for an arbitrary (finite) number of alternatives in the choice set.

More recently, a large number of estimators have been proposed for cases where the distribution of the errors is a priori unrestricted. We extend this literature primarily by allowing for multiple alternatives and partial observability. The semiparametric model has been considered by

Cosslett (1983), Han (1984), Klein and Spady (1987), Manski (1975, 1985,

1988) and Matzkin (1988, 1990). Virtually all of this work analyzes only the binary choice setting. The exception is Manski (1975), who allowed for multiple alternatives, but who made the strong assumption that the error terms are independent and identically distributed across alternatives.

Our results on identification through index restrictions complement the growing body of literature on estimation of index models. The papers by

Gallant (1987), Ichimura (1987), Ichimura and Lee (1990), Powell, Stock and

Stoker (1987), and Stoker (1986, 1990) explore various approaches to estimation of general index models, of which discrete choice models form an important special case. Klein and Spady consider discrete choice models in particular, but their only exploits the index structure of the model. Our results suggest that these estimators do not exploit all of the identifying restrictions on the parameters implicit in discrete choice models.

3 On the other hand, we maintain the assumption that the errors are independent from observables. Limited forms of interdependence between the error distribution and the regressors is allowed in much of the literature on parametric models, and in the papers by Manski for semiparametric models.

Manski (1988) provides an extensive analysis of identification for the binary choice model under alternative assumptions about the joint distribution of observables and error terms.

We also avoid the complications introduced by individual variation in the set of feasible alternatives. While important, the added generality from this extension substantially complicates our analysis. We therefore leave it as a topic for future research.

Findings.

When the only data that is available about actual choices is a binary indicator for whether or not the first alternative is chosen it is necessary to normalize the vector of indirect utility functions against arbitrary permutations and rescalings of its components. Additional restrictions will be needed if the distribution of the regressors has a limited support.

Generic positive proofs of identification will often require a continuity condition on the distribution of the regressors when nothing is known about the error distribution. However, we show that in the linear in parameters model this "almost necessary" condition is the only new condition that needs to be added to the assumptions typically employed in parametric models in order to obtain a positive identification proof. Hence identification in semiparametric models requires only relatively weak (and verifiable) additional conditions beyond those needed for identification in

4 parametric models.

We find that if the parameters of the conditional choice probability function are identified in our partial information setting, then so are all of the conditional choice probabilities. This holds even though one cannot observe the actual choice except when the first alternative is chosen.

Our exploration of identification through index restrictions specifies the nature of the additional restrictions that are needed when one exploits only the index structure of the model. We find that the loss in identifying power from ignoring the full structure of the model is substantial whenever the number of alternatives is large.

Finally, we extend the results for observabi1ity of only one alternative to the case where one has full observabi1ity within a subset of the alternatives. It is found that if one can observe binary indicators for two or more of the possible choices then the relative scale of all of the indirect utility functions is identified. When the actual choice is fully observable then (as in parametric versions of the model) all of the parameters are identified up to a (common) positive scalar transformation.

Organization of ~ Paper.

Section 2 lays out a formal specification for the semiparametric random utility model with partial observabi1ity, develops notation, and provides some alternative interpretations for partial observabi1ity versions of the model. Section 3 analyzes various necessary conditions for identification.

Section 4 discusses identification of the error distribution. Sufficient conditions for identification in the linear in parameters version of the model are covered in Section 5. Section 6 shows that under these conditions

5 all of the choice probabilities are identified. Section 7 shows that additional restrictions are needed on the parameters when one only exploits the index structure of the model for identifying restrictions. Section 8 extends the results from Section 5 to cases where additional information about actual choices is available. An Appendix provides proofs for the main results in the text.

2. THE SEMIPARAMETRIC RANDOM UTILITY MODEL OF DISCRETE CHOICE

Here we provide a formal specification for the semi parametric random utility model. We establish notation and define the formal to be analyzed in the remainder of the paper. We also consider some alternative interpretations of this statistical model provided by economic

settings other than the traditional random utility model.

The Random Utility Model 2! Discrete Choice.

We assume that each member of a population of rational economic agents

is faced with the problem of making a choice from a common finite set of J+l

alternatives labelled 0, ... , J. Each agent is (partially) characterized by

L a vector x E ~ of observed (functions of) the characteristics of the agent

and of the observed characteristics of the alternatives that are of relevance

to the agent's decision problem. The observed choice c corresponds to the

maximal element of an unobserved (J+l)xl vector of (conditional indirect)

utilities taking the form

(2.1) ii - V(x,P) + ~

6 where V is a known function, P is an unknown parameter from a space B, and v is a vector of unobserved components of utility. That is, c - j if and only 2 if U(j+l) > U(i) for each i S J+1, i ~ j+1. It will also be convenient to define the random variables y .• 1{ c - j }, j - 0, ... , J

The distribution of characteristics, utilities and choices in the population is determined by regarding U, x, ;, c and Yo' ... , y as random J variables on an unspecified probability space with probability measure f. We assume that x and; are independent under f. The joint distribution of x and

; determine the probabilities of all observable events. We assume that ties between the U(J) occur with zero probability under f so that the choice c is uniquely determined by the preceding stochastic structure with probability one.

The Random Utility Model ~ Partial Observability.

Through most of this paper we will concentrate on the setting where the actual choice c is not observed. Instead the data consist of i.i.d. observations on the random variables (x,y).o Therefore, we assume that the only information about the actual choice that is available is an indicator for whether or not the first alternative was chosen.

There are several reasons for focusing on the partial observabi1ity setting. First, there exist settings in which choices are not fully observed. For example, economists are often able to determine whether or not a person has chosen to participate in the labor force but are not able to observe which of several alternative activities were chosen.

Second, the analysis of identification depends only on the

7 characteristics of the formal statistical model under analysis. Other settings of choice exist that do not fall into the utility maximization framework outlined above, but which generate a statistical model equivalent to that of the random utility model with partial observability. Hence the present results have implications for other models as well. Some of the alternative interpretations of our statistical model are mentioned below.

Third, results on identification for the partial observability model imply similar results for identification when the actual choice is fully observed. Since the labelling of the alternatives is arbitrary, identification of the full observability model can be accomplished by considering it to be a collection of J+l partial observability models linked by restrictions on the parameters across the models. This approach is developed further in Section 8.

Choice behavior is completely determined by the ordinal properties of utility, so there is no hope of completely identifying a distribution for U.

Therefore it is convenient to transform the random utility model to a representation in terms of utility differences. Let

- (2) - U V(x, b) (1) u _ [ U(l) (2.2) V(X,b)- [ U(l) _ U(J+1) ' ' I V(x, b) (1) _ V(x, b) (J+1) I

~(2) - ~(1) I. and v -

[ -(J+1) -(1) v - v

Then U - V(x,~) - v gives the Jxl vector of utility differences relative to 4 the first alternative. So c - a if and only if U ~ 0.

8 Identification.

The most that one can learn from observations on y and x is their o joint distribution. We assume that the marginal distribution of x is uninformative about or F. So any information about these parameters p o implied by the data must be contained in the distribution of y conditional o on x. Since y is a binary indicator, its conditional distribution is o completely specified by the conditional choice probabilities

(2.3) ¢ (x) • Pre U ~ 0 I x ) - Pre v V(x,P) I x ) o s F ( V(x,P) ) (a.s.). o

The study of identification will concentrate on whether or not (2.3) continues to hold when one substitutes alternative values for F and/or p. o

Suppose that F is known a priori to lie in some space ~ of distribution o functions. We adopt the following standard definition.

Definition: Identification. (F ,P) is identified relative to (F,b) if o and only if

(2.4) Pre F(V(x,b» - Fo (V(x,P» ) < 1.

P is identified relative to b if and only if (F ,P) is identified relative to o

(F,b) for each F E ~. Similarly, F is identified relative to F if and only o if (Fo'P) is identified relative to (F,b) for each b E B.

Hereafter when we say that F or P are identified, we shall mean that o they are identified relative to every other member of their respective parameter spaces ~ and B. Notice that identification of F does not imply o

9 identification of p or conversely. Identification will generally depend upon the specifications for V and B and ~, upon the nature of F and p, and upon o the distribution of x.

Identification under our definition is necessary, but not sufficient for the existence of consistent estimators. Generally, one needs to establish that F(V(x,b» is a smooth functional of (F,b) in some appropriate sense in order to establish consistency for an estimator of (F ,P). o

Parameterizations 2f ~ Semiparametric ~

We shall assume that ~ is the space of all cumulative distribution functions corresponding to probability measures on R.J For technical reasons

it is often convenient to ensure that ~ is compact. This can be achieved by redefining ~ to be the space of all cumulative distribution functions

-J 5 corresponding to probability measures on the extended Euclidean space R .

All of the results in this paper continue to hold on the larger parameter

space provided that we maintain the assumption that F corresponds to a o proper probability distribution on R.J For economy of notation we shall not

explicitly present the more general results. Several footnotes identify the

specific modifications that are needed to accommodate this generalization.

Whenever possible we shall leave the specific forms of V and of the

parameter space B unspecified. It should be understood that B is a

topological space. In applications B will usually be a compact metric space.

There is no need, in general, for B to admit a finite-dimensional

parameterization, however. Therefore many of our results will apply to

models in which no finite-dimensional parameterizations are imposed.

On the other hand, one can only obtain positive identification results

10 if one is able to place some structure on V and B. We will frequently adopt

the following form, which is linear in the parameters p.

(Cl) V(x,b) - bx. B is a set of JxL real matrices.

This form is by far the most commonly used specification in applied work estimating random utility models. Usually restrictions are placed on

the elements of B in the form of exclusion restrictions, equality constraints

or inequality constraints. Since we must have p E B, restrictions of this

sort presume the availability of comparable prior information about p.

Example 1: McFadden (1974) proposed a model equivalent to the

specification (Cl) with B defined by

(2.5) B { I ® {J', {J E IRK, II {J II - 1 }, J 6 where L - JxK. Here b(j) - [ 0 I {J' I ... I 0 ], where each zero is a

lxK row vector and where (J' is the j'th lxK subvector. This implies

V(j)(x,b) - {J'x., where x. is the j'th Kxl subvector of x. This J J specification is appropriate when no purely personal characteristics of the

decision makers are observed.

Equation (2.5) imposes (J-l)xK exclusion restrictions on each row of p,

(J-l)xK equality constraints across the rows of p and a single scale

normalization on p. We shall see that while certain restrictions on pare

necessary, it is not necessary in general to provide such an explicit set

of restrictions in order to ensure identification of p. Therefore we will

usually avoid imposing explicit restrictions of this form on B.

11 Alternative Interpretations of the Model.

We noted above that alternative interpretations of the preceding

statistical model are available. Rather than starting from the random utility model, one can take U, V, v and y o as primitive elements of a general latent variable model of binary response under partial observability.

Parametric versions of this model are analyzed by Poirier (1980) and McFadden

(1984).

Under the alternative interpretation, it is assumed that there are J

unobserved binary responses corresponding to the J events ( U(j) ~ 0 }, but

that we only observe a positive response if the J unobserved responses agree.

In other words y - l{ U ~ 0 }. o For example, y could be a binary decision made by a committee of J o members for which each member has a veto. In this case, the j'th member votes affirmatively if U(j) ~ 0, but due to the veto an affirmative decision

is only observed under unanimity.

Feinstein (1989) provides another example in a study of nuclear reactor

safety monitoring. Here U(l) ~ 0 whenever a safety violation occurs,

(2) U ~ 0 whenever an inspection occurs, and y - 1 if and only if a violation o is detected by the inspection. If the inspection process is perfect then

y - 1 if and only if U ~ O. o Obviously there are many other problems that have this general form.

12 3. NECESSARY CONDITIONS FOR IDENTIFICATION

Here we develop a number of negative results concerning identification of ~ and F. That is, we establish conditions under which there exist values a of (F,b) E ~ x B that are observationally equivalent to (F ,~), but where a

F ~ Fa or b ~~. Since (Fa'~) cannot be identified unless none of these conditions hold, we implicitly establish necessary conditions for identification.

The results in this section all elaborate on the following fundamental 7 proposition.

J J Proposition 1: Suppose that there exists a function H: R ~ R and a b E B such that

(3.1) V(x,b) - H(V(x,~» (a.s.), and

(3.2) v ~ V(x,~) .. H(v) ~ H(V(x,~» (a.s.).

Then (F ,~) is not identified relative to (F, b), where F is the distribution a function for H(v).

Proposition 1 gives basic conditions under which (F ,~) is not a identified relative to (F,b). So under the stated conditions ~ is not identified unless b Similarly, unless H(v) - v (a.s.) then F ~ F , and ~. a F is not. identified. a We next develop several results that imply necessary conditions for identification by establishing specific conditions under which Proposition 1 applies. These are presented in the form of several corollaries to

Proposition 1. Several of the corollaries refer to the following class of

13 functions, which is the smallest class for which (3.2) must always hold.

Definition: Isotonic Function. A function H: ~J ~ ~J is isotonic if

J for each u,v E ~ ,

(3.3) uS v .. H(u) S H(v).

The term 'isotonic' means 'order-preserving': Isotonic functions

J preserve the canonical partial ordering on ~. The composition G 0 H of any two isotonic functions G and H is isotonic. Isotonic functions must be 8 strictly increasing in each argument. Since the definition implies that u - v .. H(u) - H(v), isotonic functions are one-to-one. Hence if H is isotonic and onto then H- 1 exists and is also isotonic. This holds in particular for all linear isotonic functions. We shall say that a JxJ matrix is isotonic if and only if it defines a linear isotonic function.

Proposition I, Corollary 1: Suppose that there exists an isotonic function H and abE B, b ~ P such that

L (3.4) V(e,b) H(V(e,p» for each e E ~ .

Then p is not identified. If also Pr( v ~ H(v) ) > 0 then F is not o identified.

Proposition 1, Corollary 2: If V(x,b) - V(x,P) (a.s.) for some b E B, b ~ p, then P is not identified.

Proposition 1, Corollary 3: Suppose there exists a JxJ matrix C ~ 0 such that CV(x,P) - v (a.s.). Suppose also that there exists an isotonic

14 function G, and abe B, b " {3; such that (3.4) holds for H - G + C. Then {3 is not identified. If also Pr( v " G(v) + > 0 then F is not identified. v ) o

Proposition 1, Corollary 4: Suppose that

(3.5) the support of x is the finite set {x , ... ,x}, 1 n

(3.6) V(x ,.) is continuous at {3, s - 1, ... ,n, and s

(j) (j) (3.7) V (x ,{3) "V (x ,{3), s, t s n, s " t, j s J. s t

Then {3 is not identified. If also

(3.8) Pr( V(x,{3) - V(x,b) ) < 1 for every b in a neighborhood of {3,

b " {3, or

(3.9) V(J)(x ,{3) ) < 1 for some j S J, s then F is not identified. o

Proposition 1, Corollary 5: Suppose there exists a Jx1 v~ctor c with exactly one strictly positive element C(i) such that c'U ~ 0 (a.s.). Without loss of generality let c (1) - 1. Let H be the JxJ matrix that coincides with the identity except for its i'th row, which is given by H - 0 and ii H - -c(j) for j ,,1. If there exists abe B, b " {3, such that (3.4) holds ij for this H then {3 is not identified. If Pr( c'v - 0 ) < 1 then F is not o identified either.

Corollary 1 shows that restrictions on V or B are needed to normalize against isotonic transformations of V(· .{3). That is, the following condition is necessary for identification of {3.

15 (C2) For each b E B, if there exists an isotonic H such that

V(e,b) - H(V(e,p» for every e E ~L then b - p.

The practical significance of this result depends directly on the nature of the class of isotonic functions. For example, the identity function is isotonic. Then Corollary I establishes the trivial result that p is not identified if the functions V(· ,P) and V(· ,b) coincide for some b ~ p.

This indicates that a more parsimonious parameterization of V exists. We shall not consider it further.

The class of isotonic functions is not trivial. All functions of the form H(v) - v + v (v constant) are included. So are functions satisfying

H(v) (j) - h (v(j» for some collection h • j S J. of strictly increasing j j one-to-one and onto functions. In particular, linear scaling functions of the form H(v) - Dv, where D is a positive-definite diagonal matrix, are included in the class. The permutation functions H(v) - Pv for some JxJ permutation m~trix P are isotonic. More complicated functions can be constructed from these examples by forming the composition of two or more

isotonic functions.

From these examples Corollary I establishes the well-known fact that

scale and location normalizations must be placed on the indirect utility

function to ensure identification of p. A location normalization is required because a shift in the centering of the F distribution cannot be o distinguished from a change in a constant term in V. Hence the model

J spec~. f'~cat~on . d oes not provid e a natura1 c h 0 i ce f or t h e or~g~n.. 0 f ro~.

Similarly, deformation (or rescaling) of anyone of the coordinate axes cannot be distinguished from a monotone transformation of the correspohding component of V. So scale and/or shape restrictions are needed on each

16 component of V.

Corollary 1 also demonstrates the need to place restrictions that guard against permutations of the coordinates of V. This restriction is not needed when actual choices are observable. In the partial observabi1ity setting considered here, however, permitting permutations of the coordinates leads to a failure of identification. Since the distribution of observab1es is presumed to embody information on actual choices only through the binary variable y , one cannot distinguish between the remaining alternatives when o the first alternative is not chosen. Hence any arbitrary r~labe1ing of the other choices, corresponding to a permutation of the coordinates of V, is fully consistent with the partial observabi1ity specification of the random utility model.

The need for scale and permutation normalizations arises from the fact that choices are only partially observed rather than from its semiparametric specification. Scale and permutation normalizations are also necessary in parametric versions of the random utility model with partial observability.

See, for example, Poirier (1980), who considered the case where F is assumed o 9 to be bivariate normal.

The failure of identification in Corollary 1 occurs regardless of the 10 stochastic structure of the observables. The conditions of Proposition 1, however, make clear that identification can also fail if the support of

V(x,P) is too small. Corollaries 2 - 5 identify specific conditions under which this occurs. In particular, Corollaries 2 - 4 identify cases in which

V(x,b) almost surely equals an isotonic transformation of V(x,P). The conditions of these Corollaries may hold even though (3.4) is not satisfied for any isotonic function H.

17 CorolLary 2 extends Corollary 1 to cover cases where the functions

V(',b) and V(· ,P) differ but nevertheless coincide on the support of x. In

this case the variation in x is insufficient to to identify P from other

members of B. This is equivalent to the failure of identification associated

with extreme multicollinearity in the linear regression model. As in the

case o~ multicollinearity, the only solution is to impose further

restrictions on B.

Corollary 3 shows that p may not be identified if the support of V(x,P)

J 11 is contained in a proper affine subspace of R Identification fails

because V(x,b) almost surely equals an isotonic transformation of V(x,P).

Since the function G + C need not be isotonic, the Corollary shows that in

this case restrictions on P (and B) may be needed beyond the normalizations

against isotonic transformations identified in Corollary 1.

On the other hand, if CV("P) is identically zero then H(V(· ,P»

coincides with G(V(·,P». This case is covered by Corollary 1, since G is

isotonic. So Corollary 3, like Corollary 2, contributes something new only

when the degeneracy in the support of V(x,P) is caused by insufficient

var1at. i on i n x. 12 As in the cases considered by Corollary 2, the only

solution is to obtain additional restrictions on B.

The literature on parametric random utility models typically makes

assumptions that ensure that the support of V(x,P) is not contained in any

proper affine subspace of R.J This has obscured the fact that these models

are also susceptible to failures of identification of the sort described by

Corollary 3. Consider the following example, which is equivalent to a

version of the model considered by Poirier (1980).

18 Exampl~ 2: Let J - L -2. Restrict' to be the space of all bivariate normal distribution functions (with arbitrary mean vectors and covariance matrices). Suppose that (C1) holds and that B is the set of all 2x2 matrices of the form

(3.10) b ~ O. b :s O. 13 b - 12 21

14 Let ~ - I. Suppose also that x -x (a. s.) and that x has an everywhere 2 1 1 positive Lebesgue density. Let

(3.11)

Let H - G + C. Then

(3.12) H - 1 -.5] • and b, - H~ - [1 .5]1 E B. [ -.5 1 -.5

-1 - Letting F - Fo 0 G ,-one can verify directly that F(bx) - Fo (~x) (a.s.). so

(F .~) is not identified relative to (F,b). Furthermore FE'. since G- 1 is o linear and since , is invariant under linear transformations of the underlying bivariate normal random vectors. So neither ~ nor Fare o identified.

The failure of identification in the preceding example is partially due to the degenerate distribution of x. Even so, B has been sufficiently

restricted to ensure that there is no b E B. b ~ ~. for which V(x,b) - V(x,~)

(a.s.). So ~ would be identified in this example if U were directly

observed. Therefore the example shows that when x has a degenerate

19 distribution, identification of p requires more restrictions on B than are required in the linear regression model.

Corollary 4 shows that identification will often fail if x has finite support. Under Conditions (3.5) - (3.7) it is possible to find abE B close to P such that V(x,b) almost surely equals a nonlinear rescaling of V(x,P).

Since nonlinear rescalings are isotonic, it is then easy to verify the conditions of Proposition 1. The failure of identification analyzed by

Corollary 4 is local in the sense that p may still be identified relative to values of b that are sufficiently far from p.

Corollary 4 can be regarded as an extension of previous results for parametric versions of the random utility model. Poirier (1980) noted that in parametric versions of the model a necessary condition for identificat~on is for the support of x to have a cardinality at least as great as the number 15 of free parameters in the model. No finite-dimensional parameterization exists for the semiparametric random utility model, however. The implicit number of (scalar) free parameters in this model is infinite. Therefore

Poirier's observation suggests that identification might fail in the semiparametric model if the support of x has finite cardinality. 16 The

Corollary shows that this is indeed the case for certain smooth parameterizations.

Corollary 5 shows that if the support of U is bounded in certain directions then there may exist linear functions H that are not isotonic but which nevertheless satisfy the conditions of Proposition 1. 17 This problem does not arise in most parametric models since these models typically assume

J that the support of v (hence of U) covers ~ .

Notice that c'U ~ 0 if and only if c'V(x,P) ~ c'v. Since one can

20 rarely say anything about the ~upport of v a priori, the only way to guarantee that the conditions of the Corollary do not hold is to establish that Pr( c'V(x,P) ~ ~ ) < 1 for each vector c of the form specified in the

Corollary and for every real number~. This is equivalent to establishing that the support of V(x,P) is not contained in certain halfspaces of ~J that intersect the non-negative orthant in non-trivial ways. Alternatively, one can place further restrictions on B to ensure that (3.4) does not hold for any H of the form specified in the Corollary.

Identification fails in the manner of Corollary 5 because the underlying structure of preferences is degenerate. To see this, consider the case where the conditions of the Corollary are satisfied for a vector c with exactly one element equal to 1, one element equal to -1, and the remaining elements all zero. For example, suppose that c (1) - -c (2) - 1. Then c'U ~ 0 (a.s.) if and only if

(3.13) U(l) _ U(2) _ U-(3) _ U-(2) >_ 0 ( a.s .. )

That is, the third alternative is preferred to the second with probability one. In this case the first alternative is preferred to the second alternative if and only if it is also preferred to the third alternative, so an equivalent model is obtained when U-(2) is replaced by U-(3) . The transformation H has exactly this effect: H(l)U _ H(2)U _ U(2)

-(l) -(3) U - U . Clearly strong prior information about P will be needed to sort out the choice behavior in this case.

21 An "Almost Necessary" Condition.

Unlike Corollary 1, the conditions in Corollaries 2 - 5 mayor not hold depending upon the particular stochastic structure of the random utility model. In each case identification may still be achieved if sufficient prior

information is available about p to allow certain restrictions to be imposed

on B. Unfortunately, the particular restrictions that are needed will depend upon the specific nature of the parameterization of V and B, on the true value p and on the distribution of x. Hence these Corollaries do not

immediately suggest additional generic necessary conditions for

identification.

The common feature of the conditions in Corollaries 3 - 5, however, is

a restricted support for V(x,P). The conditions of these Corollaries all

fail if the following holds.

J (C3) Pr( V(x,P) E E ) > 0 for every open set E c ~ .

This continuity condition ensures that the support of V(x,P) coincides

with ~J.

Condition (C3) certainly is not necessary for identification. It is

"almost necessary," however, in the sense that weaker conditions will require

restrictions on V, P and x that do not hold in general, but which must be

verified on a case-by-case basis. So positive, generic proofs of

identification can generally be obtained only under conditions that imply

(C3). The discussion in Section 4 will provide some additional reasons for

imposing (C3).

Virtually all previous work on estimation of semiparametric .models of

discrete choice assumes Condition (C3) directly, or else assumes conditions

that imply (C3).

22 Linear Transformations and Identification.

When does a function H satisfy the conditions of Proposition 1? In particular, for which H does (3.2) hold? The Corollaries to Proposition 1 provide partial answers to these questions by providing specific examples.

Here we address a closely related question: What properties must H possess

if it satisfies equation (3.2)?

We have no general answer to this question. Instead, the following propositions provide a specific answer for the case where H is restricted to be linear.

J J Proposition 2: Suppose that H: ~~ ~ is linear and that F - F 0 H o 1 for some F E ,.18 Then H is isotonic and F - F 0 H- . o

To see the significance of this Proposition, suppose that

V(x,b) - HV(x,P) (a.s.) for some linear transformation H. If also (Fo'P) is

not identified relative to (F,b) then we must have F(HV(x,P» - F (V(x,P» o (a.s.). Then if the support of V(x,P) is large enough (if Condition (C3)

holds, for example) then we must have F - F 0 H. Proposition 2 establishes o that the last condition holds only if H is isotonic, in which case

-1 F - F 0 H provides an explicit expression for F. Therefore, under these o conditions the only linear transformations that can satisfy (3.2) are

isotonic. It follows that if Conditions (C2) and (C3) hold then V(x,P) is

identified relative to every linear transformation of itself.

On the other hand, a more complete characterization of isotonic

functions will be needed in order to implement the normalizations specified

23 by Condition- (C2). We have already identified a number of specific examples

of isotonic transformations. It is not known if others exist. It turns out, however, that these examples exhaust all possibilities when attention is

restricted to linear isotonic functions. Proposition 3 formalizes this

result.

Proposition 3: A linear transformation H is isotonic if and only if

there exist square matrices P and D such that H - PD, where P is a

permutation matrix and D is a positive-definite diagonal matrix.

Equivalently, H is isotonic if and only if there is exactly one strictly

positive element in each row and in each column of H, with the remaining

elements all equal to zero.

Proposition 3 shows that scale and permutation normalizations are

sufficient to exclude isotonic transformations of V(x,P) (other than the

identity) whenever V is linear in the parameters. No other-normalizations on

p will be necessary in many applications.

24 4. IDENTIFICATION OF THE ERROR DISTRIBUTION

Distribution functions satisfy certain monotonicity, additivity and continuity properties. Given these restrictions, it may be possible to completely determine F from its values on some subset of RJ. For example, o F is completely determined by the values it takes on its support. So F may o 0 be identified if V(x,P) is identified and has a rich enough distribution.

The following provides the first positive identification result along these

lines.

Proposition 4: Let V be the support of V(x ,P), let int(V ) denote its o 0 interior and let S be the support of F. If P is identified and o 0 S c int(V) then F is identified. 000

If P is identified but it is not the case that S C int(V) then F 000 generally will not be identified. For example, if V(x,P) has a degenerate

J distribution so that int(V) is empty, but S - R , then there will be open o 0 subsets of RJ on which F can be bounded in various ways, but not enough to o identify an individual distribution function. The proof of the Proposition

reveals that F always is identified relative to those F E ~ that do not o coincide with F on int(V ), however. We are unaware of more general o 0 criteria on S and V that ensure identification of F relative to every o 0 0

other element of ~.

Practical use of the Proposition requires prior knowledge of the

supports of F and V(x,P). Typically, only weak a priori restrictions are o available on the support of F. Often all one knows is that F corresponds o 0

25 to a proper probability distribution. Then conditions are needed that

guarantee that the support of V(x,~) is all of RJ. Therefore Condition (C3) will usually be needed to establish a generic, positive proof of

identification of F . o

5. SUffiCIENT CONDITIONS fOR IDENTifiCATION:

THE LINEAR IN PARAMETERS MODEL

This section develops a set of specific conditions under which ~ and F o are identified. A complete set of sufficient conditions requires more

specific structure on the parameterization of the semiparametric random

utility model than we have assumed so far. Therefore attention will be

restricted to the linear in parameters specification (CI). Proposition 5

presents the main result.

Proposition 5: Suppose that in addition to Condition (CI) the

following conditions hold:

(C4) The support of x is not contained in any proper affine subspace of RL.19

(CS) ~x - ~ x + ~ x where ~ is a nonsingular JxJ submatrix of ~ and 1 1 2 2 1 Pr( x EEl x ) > 0 almost surely for every open set E cR.J 1 2

If also F(V(x,b» - F (V(x.~» (a.s.) for some (F,b) E ~ x B then b - H~ and o -1 . o H for some isotonic matrix H. F - F o

26 Proposition 5, Corollary~ Suppose Conditions (Cl), (C4) and (CS) hold.

Suppose also that

(C6) For each b E B, if b - POp for some permutation matrix P and

some positive-definite diagonal matrix 0 then P - 0 - I.

Then (F ,P) is identified. o

The remainder of this section provides some interpretation for these

sufficient conditions.

It is well-known that linear independence of the components of x is

necessary for identification in parametric models satisfying Condition (Cl).20

Condition (C4) imposes the stronger requirement that these components are

'affinely independent as well. This means that x cannot include a constant

term, either explicitly or through some nontrivial linear combination. In

particular, x cannot include any characteristics that are purely alternative

specific (the price of an alternative, for example), since there is no

variation in these characteristics across individual decision-makers.

Closer examination reveals that the exclusion of constants from x in

the semiparametric random utility model is equivalent to the location

normalization imposed on F in virtually all parametric implementations of o the model. In parametric models the location normalization is provided by

some centering restriction on F. Typically the univariate marginal o distributions of Fo are required to be symmetric about zero. In the

semiparametric setting F is unrestricted. Given any location measures (such o

as the medians) for the random variables v(j), the semiparametric model can

be regarded as having an implicit constant term with coefficient in V(j)

( , ) provided by the location measure for v J. Condition (C4) specifies that the

27 components ofx cannot be collinear with this implicit constant term.

Constant terms can be reintroduced to the semiparametric model if a location normalization is imposed on F. Alternatively, given o Condition (C4) , one can use location measures on the univariate marginal distributions of F to provide estimates of purely alternative specific effects on the o choice probabilities.

When the marginal distributions of F are symmetric, all reasonable o location measures provide identical estimates of these effects. In the semiparametric setting, however, one does not know that F is symmetric, so o different location measures may provide different measures of alternative- specific effects. We choose to leave F free and impose the location o normalization through Condition (C4) in order to avoid choosing among these different measures.

The other normalizations are provided by Condition (C6), which excludes nontrivial isotonic transformations of P from the parameter space. Given the linear specification, Proposition 3 shows that only scale and permutation normalizations on the rows of P are needed to satisfy this condition.

The need for these restrictions has already been established but their implementation has not. Consider Example 1. The definition of B in (2.5) normalizes against permutations by imposing (J-l)xK exclusion restrictions on each row of b. Scale normalizations are provided by the normalization.

11011 - 1 together with the (J-l)xK equality constraints across the rows of b.

In general, scale normalizations on B can be imposed by requiring

IlbU)11 - 1 for each j. While exclusion restrictions can certainly be used to - normalize against permutations, equality constraints are not necessary.

28 Certain inequality constraints across the rows of f3 will work equally well.

For example, if one knows a priori that f3 < f3 <... < f3 then the 11 21 J1 restrictions b s b s ... s b can be used to normalize against 11 21 J1 permutations.

Often, however, economic theory provides no a priori justification for 21 imposing constraints that normalize against pepmutations. Fortunately, it

is rarely necessary to provide an explicit a priori normalization. Poirier

(1980) notes that while permutation normalizations are needed for global 22 identification, they are unnecessary for local identification. So there is

no need to impose these normalizations explicitly when employing local search

algorithms to compute estimates of f3. Furthermore, inferential exercises

involving f3 sometimes are invariant to permutations of its rows. Hence the

permutation normalization will often be ignored in practice.

We have seen that Conditions (Cl), (C4) and (C6) are each equivalent to

conditions routinely assumed for parametric implementations of the random

utility model. Therefore the new restrictions needed for Proposition 5 in

order to establish identification for the semiparametric model are all

contained in Condition (CS). This condition imposes additional structure

. jointly on f3 and on the distribution of x. Conditions (C4) and (CS) together

imply (C3). Hence (CS) is used in part to avoid some of the identification

difficulties identified in the Corollaries to Proposition 1.

Condition (CS) can be analyzed in several parts. First, f3 must include

a JxJ nonsingular submatrix f3 , which implies that f3 has full row rank. 1 Second, there must be a decomposition of x into subvectors x and x such 1 2 that the support of x covers ~J conditional on x. Third, the elements of x 1 2 that comprise x must correspond to the columns of f3 that comprise f3 . 1 1

29 Taken as a whole, these requirements are somewhat restrictive. Clearly any discrete components of x must be included in x. While the condition 2 does not require that x (conditionally) possesses a Lebesgue density, it is 1 not sufficient for each component of x to be absolutely continuous. In 1 particular, the condition requires that there not exist any function g such that g(x) - 0 almost surely, unless-g can be written as a function of x 2 alone. Hence no elements of x can be deterministically related to other 1 components of x through interaction or polynomial expansion terms. So the condition restricts somewhat the ability of the linear in parameters specification to include specifications that are nonlinear in the data.

The condition does not require a priori knowledge of which submatrix of p comprises p . For example, if x possesses an everywhere positive Lebesgue 1 density on RL then (CS) will be satisfied for one of the Jxl subvectors of x provided that p has full row rank. Often, however, not all subvectors of x will satisfy the restrictions needed on x. Then the condition implies rank 1 restrictions on certain submatrices of p. Since the condition places no restrictions on values of b E B other than the true value p, there is no need to impose these rank restrictions in the definition of the parameter space B.

Condition (CS) is not necessary for identification. In the case J 1, for example, there exist cases where Fo and p are identified despite the fact that no component of x is continuous conditional on the other components. We are not aware of any generally applicable alternative condition, however. It does not appear that the condition can be substantially weakened except in special cases involving stronger prior information about F or o p. On the other hand, Condition (CS) corresponds closely to other conditions commonly imposed in the literature on semiparametric estimation of

30 nonlinear regression models. For example, if J - 1 then Condition (C5) is equivalent to Manski's (1988b) Condition (X3) for identification of binary response models.

6. IDENTIFICATION VIA INDEX RESTRICTIONS

A conditional regression model is said to have the "index" property if the distribution of y conditional on x depends on x only through some parametric "index" function V(x,P). The semiparametric random utility model with partial observabi1ity is embedded in this class of models since the conditional choice probabilities for the first alternative are given by

tP (x) - F (V(x,P». Notice, however t that no structure on F is assumed for o 0 0 general index models, while the random utility model requires that F is a o distribution function.

This section considers the loss of identifying information (if any) that occurs when one ignores the fact that F is a distribution function. o The motivation for this exercise is provided by the growing collection of

"index" estimators that have been proposed for the semiparametric random utility model. These estimators have the property that they are consistent for p (under conditions resembling those of Proposition 5) even when F is o not a distribution function. Hence they implicitly ignore the identifying power available in the restriction F E~. o Examples of "index" estimators for the case J - 1 include Ichimura

(1987), Powell, Stock and Stoker (1987), Stoker (1987, 1990), and Klein and

Spady (1987). The last applies-- only in the special case where y is a

31 Bernoulli random variable conditional on x. Ichimura and Lee (1990) proposed an estimator that applies to index models when J ~ 1. We shall generically call these estimators "index estimators."

Comparison of the identification results in Section 5 to those in the preceding papers shows that relying exclusively on the index property for identification results in a substantial loss of identifying power. This is

J J because, given any invertible function H: R ~ R , one can always write

1 ~ (x) - F(H(V(x,P»), where F - F 0 H- . We have seen that this F is a o 0 distribution function if and only if H is isotonic. If one ignores the restriction F e~, however, then V(x,P) is clearly not identified relative o to any invertible transformation of itself.

Significantly stronger prior information is needed to identify V(x,P) relative to arbitrary invertible transformations than is needed to identify it only relative to isotonic transformations. To see this, consider the linear in parameters model. Suppose that p is not identified relative to b - HP for some nonsingular matrix H. If b e Band P has full row rank then P is not identified unless B is restricted sufficiently to ensure that H - I.

So one can quantify the loss in identifying power that results from ignoring

F ~ by comparing the additional restrictions needed on H to ensure H - I o e when H is known to be isotonic with those needed when H is known only to be nonsingular.

Now a matrix H is nonsingular if and only if

(6.1) H - PLSDU, where P is a JxJ permutation matrix, Land U are respectively lower and upper triangular JxJ matrices with ones along the main diagonal, D is a JxJ positive-definite diagonal matrix, and S is a JxJ diagonal "sign" matrix,

32 whose diagonal elements each equal plus or minus one. Proposition 3 establishes that H is isotonic if and only if

(6.2) L - S - U - I.

But it is easy to establish that H - I if and only if (6.2) holds and also

(6.3) P - D I.

So (6.2) represents the additional restrictions needed on each element of

B in order to ensure that is identified when one ignores that F e~. p o If the restriction F e ~ is ignored then one needs J2 a priori o restrictions on P (corresponding to the free parameters of L, Sand U) in order to completely identify p.23 Of these, J will be sign restrictions. So when J - I the cost from ignoring F e ~ is a single sign restriction on p. o

For J ~ 2, equality restrictions will be needed on p beyond the innocuous restrictions needed to normalize against scale and permutation transformations. In these cases one needs substantially stronger prior information in order to successfully identify p using index restrictions than is needed to identify p using the full structure of the random utility model.

Since the off-diagonal elements of Land U can vary continuously, failure to impose these additional constraints will lead to a failure of local and global identification.

It is true that the index estimators are robust to deviations from the random utility model in the form of heteroskedastic disturbances for v. In order for the index property to hold the distribution of v conditional on x can depend on x only through V(x,P), however. It is questionable whether this limited form of robustness compensates for the possibility of misspecification that arises when unnecessarily strong, and perhaps incorrect, restrictions are placed on p.

33 7. IOENTIFICA TION OF THE CHOICE PROBABILITIES

So far we have concentrated on identification of F and~. In this o section we establish that identification of (F ,~) implies identification of o each of the conditional choice probability functions

(7.1) ~ Pre c - j I x - j - 0, ... ,J. j (e) • e ), relative to the collection of choice probability functions that satisfy all of the restrictions implicit in the random utility model.

The claimed identification follows from the following argument. One can derive several equivalent conditions that imply that the second alternative is chosen conditional on x-e.

(7.2) c - 1 .... ij(j):s ij(2) for j - 1, ... , J+l

.... ~(j) - ~(2):s V(2) (e ,~) - V(j) (e ,~) for j - 1, ... , J+l

.... JI(2)~ V(2) (e .~) and JI(j) - JI(2):s V(j) (e ,~) - V(2) (e ,~)

for j - 1, 3, . '0, J+l.

For each j (1:S j :S J) form the matrix r by replacing each element of the j j'th column of a JxJ identity matrix with -1. (Let r be the identity matrix o itself 0) Then the last condition in (7.2) is equivalent to r JI:S rV(e,~). 2 2

Similarly c - j .... r JI :S r Vee ,~) . Hence j j

(7.3) ~j(e) - JI( r v :S r Vee ,~) } dF (v). j j 0

So knowledge of F and ~ is sufficient to derive all of the conditional o choice probabilities.

34 The situation is illustrated in Figure 1 for the case J - 2. Given x - e, the point V(e,~) partitions R2 into the three (almost) disjoint regions ( v: rjV S rjV(e,~) }, j - 0, 1, 2, which are labelled accordingly in the Figure. Alternative j is chosen if and only if the unobserved vector v takes a value in region j. Clearly the probabilities of these events can be determined from knowledge of ~ and F . o 2/ --____e V(e ,~)

o 1

Figure 1.

The result holds despite the partial observability nature of the model.

At first glance the result is surprising since intuition suggests that little can be learned about the preferences between alternatives 1 through J when one can only observe whether or not alternative 0 was chosen. Alternatively, the result can be interpreted as suggesting that unreasonably strong prior information is needed to completely identify (F ,~) when actual choice o behavior is only partially observed.

The result also implies that when one can observe some or all of the other y , I S j S J, then the semiparametric random utility model is j over-identified, given the sufficient conditions in Section 5.

Alternatively, conditions that are not sufficient for identification given

35 partial observabi1ity may be sufficient when one can observe more about the actual choices. This last possibility is explored further in the next section.

8. IDENTIFICA TlON UNDER GENERAL OBSERVABILITY CONDITIONS

Here we extend the analysis of the previous sections by assuming that additional information is available on actual choices beyond that contained in Yo' In particular, we assume that one can observe each of the random variables x, y, ... , y , 1 s M s J. Hence we assume that the actual choice o M is observed if one of the first M+1 alternatives is chosen, but that no other

information on the actual choice is observed. Full observabi1ity is the

special case where M - J or M - J-1. For M s J-2 this setting corresponds to

data where two or more of the possible outcomes are aggregated into an

"other" category by the data collection process.

The analysis proceeds by noting that each of the subproblems where only

x and y. is observed is equivalent to the problem studied in the previous J sections. Hence the identification results that have already been developed

are easily extended to cover the more general case.

Let the matrices r ,OS j S J, be defined as before. For each j let j F denote the distribution function of r v under P and let V - r V. Then j j j j for each alternative j we have

(8.1) ~. (x) - P'r( r u ~ 0 x ) - Pr ( r v S V (x, fJ) I x ) J j j j

This generalizes equation (2.3) to show that all of the conditional choice

36 probabilities have exactly the same structure provided that the appropriate transformation of U, V and v is taken.

Suppose that it has been established that for each collection of distribution functions F. e ~ and each b e B, for 0 S j S M, J

(8.2) F (V (x,b» - F (V (x,Q» .. V (x,b) - H (V (x,Q» (a.s.) jj jj JJ j jj JJ for some collection H of isotonic functions. That is, each of the random j vectors V.(x,P) is individually identified up to an isotonic transformation. J Then the constraints across the choice probability functions implied by the definitions of the V and F provide additional identifying power for P j j and F . o For example, in the linear in parameters model we have

Vj (x,b) - r.bx J - V(x,bJJ where b. J - r j b. Let Pj - rJ.p. Suppose that Pj is not identified relative to b j only if b j - HP j j for some isotonic matrix H . The latter holds if and only if b - r -1 H r p. (In particular, j j j j for j - 0 we haveb- HoP, since ro - I.) If this holds for j S M and if p has full row rank then r-~ r - H for each j S M. This gives linear j j j 0 restrictions on the H that can be used to substantially reduce the set of j 24 isotonic matrices in which H can 1ie. o Proposition 6 and its Corollary use this approach to generalize the

identification results of Proposition 5.

Proposition 6: Suppose that Conditions (Cl), (C4) and (C5) hold.

Suppose that for some collection of distribution functions F e ~ and some j

b e B we have Fj(Vj(x,b» - Fj(Vj(x,P» (a.s.) for 0 S j S M (M ~ 1). Then

1 there is an isotonic matrix H such that b - HP and F. - F. 0 H- , 0 sj S M. J J H has the form H - aP, where a > 0 is scalar and where P is a permutation

37 matrix for which the first M rows coincide with the identity.

Proposition 6, Corollary: Suppose that the conditions of Proposition 6 are met for M - J-l or M - J. Then there is a scalar a > 0 such that b - ap

- -1 and F - F 0 (aI) ,j - 0, ... , J. j j

Proposition 6 establishes conditions under which P is identified up to a positive scalar transformation and a permutation of the rows corresponding to the unobserved y. As in the case where only y is observed, one must j 0 normalize against permutations of the last J-M rows of p. The need arises because for j, k> M, the events ( c - j ) and ( c - k ) are observationally equivalent.

Intuition suggests that there should be approximately a one-to-one correspondence between the number of y that are not observed and the number J of scale normalizations needed on the model.. The Proposition shows that this intuition is incorrect. Only a single scale normalization is needed to identify the scale of p when any two of the yare observed. So when M - 0 j one needs J scale normalizations, but when M ~ 1 a single normalization suffices.

This apparent anomaly can be explained as follows. Consider the rescaling of utility differences DU, where D - diag(a , ... ,a), where a > 0, 1 J j j ~ J. Since

(1 Cj 1 - (1 ) U-(J+1) U Cj ) > (8.3) U ) _ U + ) ~ 0 a U -a -a _ 0 j j j the probability that the first alternative is preferred to each of the others is not affected.

But consider any other pair of alternatives (i,j). Given U we know

38 that alternative j is preferred to alternative i if and only if

(8.4) (r .U) (1) _ U(j+l) _ U(1+1) 2!: O. J

In order for the the transformation from U to DU to leave the implied ranking of these alternatives unchanged we must have (r .U) (1) 2!: 0 .... (r .DU) (1) 2!: O. J J But the latter inequality can be written

(8.5) (r DU) (i) _ (a - a )U(l) + a U(j+l) - a U(i+l) 2!: O. j 1 j j 1

In order for (8.4) and (8.5) to be equivalent we must have a - a . Since 1 j y - 1 if and only if alternative j is preferred to all of the others, the j conditional distribution of y is unaffected by the rescaling if and only if j all of the a coincide. j In retrospect this fact is less surprising. A rescaling of the utility differences U is consistent with a rescaling of the actual utilities U if and only if each component of U is scaled by the same factor. This shows that the extreme case M - 0 of partial observability is the anomalous case. The relative scale of the rows of p usually will be identified.

Department of Economics University of Minnesota Minneapolis, MN 55455.

39 ApPENDIX

This Appendix provides proofs for each of the propositions and corollaries in the text. Several of the proofs refer to the following lemmas.

Lemma A: Suppose that F is a a proper distribution function on RJ.

J Then F(t+s) - F(t) for every t E R only if s - O.

J Proof: Suppose that s ~ O. If F(t) - F(t+s) for all t E R , then it must also be true that F(t) - F(t-s) for all t E RJ. (Set t - t - s for any

arbitrary point t- E R.) J So there is no loss of generality in assuming that

s(l)< O. Choose some t such that F(t ) > 0, and construct a sequence t by o 0 1 iterating according to t t + s. Then F(t ) - F(t ) for all i, and 1+1 i i 0 Then

(A. 1) o < F(t ) - F(t ) s J l( vCl)s tCl)} dF(v) o i i

for each i. But since F is a proper probability distribution the limit of

the right hand side of (A.l) as i ~ ~ is zero. Since (A.l) implies a

contradiction, the supposition that s ~ 0 must be false. •

Lemma B: Fix j S J+l and let r be the corresponding matrix defined in j Section 7". Suppose that G and Hare isotonic matrices for which G - r-1Hf . j j Then H > 0 and there is a permutation matrix P with P - 1 such that jj jj

G - H - H P. jj

40 Proof: By Proposition 3 there is exactly one strictly positive element in each row and in each column of G and H. For each i let h denote the i strictly positive element in the i'th row of H. One can verify directly that

1 r- _ r so G - r Hr. Then direct matrix multiplication shows that for j j' j j each i, k·

-H jk if i j " k

h j if i j k (A.2) Gik h h if i j k j i " H if i ik - Hjk "j,k"j

Since Handjk Gjk must both be non-negative we have Gjk - H jk - 0 whenever k" j. Therefore every element in the j'th row of G and H except for Gjj and Hjj are zero. Since row j must have one positive element,

Gjj > 0 and Hjj > O. Then by definition of h j we have Gjj - h j - Hjj . Furthermore, since there can be at most one strictly positive element in each column of G and H we have Gij - H ij - 0 for i " j. Since 0 - Gij - h j - h i for i " J', it follows that for each i, h i - h j - Hjj This establishes that all of the strictly positive elements of H take the same value Hjj . By Proposition 3, H - PO for some permutation matrix P, where 0 is a diagonal matrix whose diagonal elements are given by the h. So i o - H I and H - PH I H P. Furthermore, since H h > 0 we must have jj , jj jj jj j P 1. jj To complete the proof we need to establish that G - H. We have already shown that Gjk H jk 0 whenever k " j, that Gij - Hij - 0 whenever i " J' , and that G H So the only remaining detail is to show that G - Hik jj jj ik whenever i " j and k " j . But since Hjk - 0 whenever k" j, for these

41 elements we have G - H - H - H . • ik ik jk ik

We now prove each of the propositions and corollaries in the text.

Proof of Proposition 1: Under the stated conditions

(A.3) F(V(x,b)) - Pr( H(v) S H(V(x,P)) I x )

- Pr( v S V(x,P) I x ) - Fo(V(x,P»), almost surely. •

Proof of Corollary 1: Equation (3.1) holds by assumption. Apply the

definition of an isotonic function to verify (3.2) .•

Proof of Corollary 2: Since the identity function is isotonic, apply

Proposition 1 with H - I. •

Proof of Corollary 3: By assumption we have

V(x,b) - G(V(x,P») + CV(x,P). Since the last term is almost surely zero,

(3.1) and (3.2) hold for H - G + v.- •

Proof of Corollary 4: Conditions (3.5) - (3.7) imply that for each b

in some open neighborhood of P (iii) continues to hold, and also

(A.4)

for each j S J, and each s, t S n. Fix any such b ~ p. Then for each j S J

there are infinitely many strictly increasing functions h : R ~ R such that j

(A.s)

42 J J For each j ~ J fix one of these functions and let H: R ~ R be defined by the identities H(j) (v) - h (v(j», j ~ J. It is easy to verify that H is j isotonic, and that (3.1) and (3.2) hold for this H. If (3.8) or (3.9) hold then it is possible to choose the h such that H does not coincide with the j identity on the support of v. •

Proof of Corollary 5: Under the stated conditions U ~ 0 .. HU ~ 0

(a.s.). Then (3.2) follows from the linearity of H and the definition of U.

Equation (3.1) holds by assumption. F is not identified relative to F, the o distribution function for Hv. But v -' MY if and only if c'v - v .•

Proof of Proposition 2: We first prove that H is nonsingu1ar. Let s be any vector such that Hs - O. Then for each t e RJ we have

(A.6) F (t+s) - F(H(t+s» - F(Ht) - F (t). o 0

Then s o by Lemma A, so H is nonsingu1ar.

We next prove that u ~ v ~ Hu ~ Hv. If this is not true then there exist distinct points u, v e RJ such that u s v and H(j)u > H(j)v for some j.

Since H is nonsingu1ar, without loss of generality assume that u < v. For 25 i - 1,2,3, ... let t - u + i(v-u). Clearly t~~. Furthermore, since i i Ht - Hu + i(Hv - Hu), H(j)t ~ _~. So 26 i i (A. 7) F (t ) - F(Ht ) - J dF o i i [ _~, Hti ]

~ I 1( v(j) s H(j)t } dF(v) ~ o. RJ i where the limit follows because the sequence of sets ( v: v(j) S H(j\ i converges from above to the empty set and because distribution functions are

43 continuous from above. 27

Since t ~~, it follows from the monotonicity of distribution functions i that F (t) - 0 for all t <~. But then F cannot be a proper distribution o 0 function, a contradiction. Hence u s v ~ Hu s Hv.

To prove the converse, note that the conditions of the Proposition are 1 unchanged if one replaces H with H- and interchanges F with F. Therefore o the preceding argument also establishes that u s v ~ H-1u S H-1v. Replace u with Hu and v with Hv to establish the needed converse. •

Proof of Proposition 3: Let ~ be the set of matrices of the form given

in the Proposition. Then H E ~ if and only if its columns define a set of vectors that can be rescaled to coincide with the canonical set of basis

J vectors for R. Hence H E ~ if and only if the convex cone { Hv: z ~ 0 }

generated by its columns coincides with the nonnegative orthant, that is if

z ~ 0 ~ Hz ~ O. But then (set z - v - u) ~ consists of those matrices for

which u s v ~ Hu S Hv. •

Proof of Proposition 4: Suppose F(V(x,b» - Fo(V(x,P» almost surely.

Given identification of p we must have V(x,b) - V(x,P), hence F(V(x,P» -

F (V(x,P» almost surely. This implies that F - F almost everywhere on V . 000 Since distribution functions are continuous from above, we must have F - F o on int (V ), hence on S , hence everywhere. • o 0

Proof of Proposition 5: By hypothesis we have F(bx) - Fo(px) almost

surely. Let bx - b x + b x be the decomposition of bx conformable with the 1 1 2 2 decomposition in Condition (CS). Since P is non-singular, there exists a 1

44 unique JxJ matrix H such that b - Hp. Hence bx - Hpx + Cx , where liZ C - b - Hp. So Z Z

(A.8) F( Hpx + CX ) - Fo(px) (a.s.). z

Condition (C5) implies that the support of the distribution of px conditional

J on x covers ~ almost surely. Then since distribution functions are Z continuous from above,

J (A.9) F( Ht + CX ) - Fo(t) for every t E ~ (a.s.). z

Then F - F 0 H, where Let s be an arbitrary point in the support of CXz' o • F (t) - F(t+s) is a distribution function. An application of Proposition 2 • proves that H is isotonic. Therefore H-1 exists, and

J F(Ht) - F ( t + Ox ) for every t E ~ (a.s.), (A.10) o z 1 1 where D _H- C. (To see this, replace each t in (A.9) with t - H- CX ') Z This establishes that F ( t + Dx ) is almost surely constant for every o z t E ~J. Let K - L-J be the dimension of xz' Condition (C4) implies that the support of x is not contained in any proper affine subspace of ~. z Therefore there exists a collection of M+l affinely independent vectors c in i the support of x. For each of these vectors let s - Dc. Then for each z i i J i, j :S M+ 1 and every t E ~ , we have F ( t + s ) - F ( t + s ). o i 0 j Since t is arbitrary we may replace it with t - s. Hence for each j J i, j :S M and every t E ~ , F [ t + (s - s) ] - F (t). Lemma A establishes o i j 0 that each of the vectors s s O. So D(c - c )- o for i - 1, .... , M. i - j i M+1 Since the set of vectors (c span ~, conclude that D - O. i - cM+1) -1 We have established that D - H C - P - H-~ Therefore 2 2 - O. 1 b - Hp. Since also b - HP , we have b - Hp. That F .. F 0 H- follows 2 Z 1 1 o J immediately from b - HP, given that the support of px covers ~ and that H is

45 nonsingular.

The Corollary follows immediately from Propositions 3 and 5. •

Proof of Proposition 6: Each of the matrices r is nonsingular. j Therefore the Conditions (Cl). (C4) and (C5) continue to hold when V is

replaced with V. So by Proposition 5. F (V (x. b» - F (V (x.f3» implies j j j j j that r b - H r f3. or equivalently that b - r -1H r f3. for some isotonic matrix j jj j jj 1 H. Since (C5) implies that t:I has full row rank. we must have r- H r - H j ~ j j j 0 -1 for each j S M. Lemma B establishes that r H r - H only if H - H - oP. j j j 0 0 j

where P - 1 and 0 is the common j' th diagonal element of Hand H. The jj 0 j conclusion follows immediately when these conditions hold simultaneously for

each j S M.

The Corollary follows from the fact that the only JxJ permutation

matrix that coincides with the identity on J-l of its columns is the identity

itself. •

46 FOOTNOTES

1 This research was supported in part by a summer research fellowship from the Graduate School of the University of Minnesota. Some of the results are based upon material in the author's dissertation at the University of

Wisconsin. The author wishes to thank Arthur Goldberger, Hidehiko Ichimura,

Lung-Fei Lee, Charles Manski, James Powell, Christopher Sims, and Harald

Uhlig for helpful comments. They are absolved from any responsibility for errors committed in this paper.

2 If Z is any collection of vectors or vector-valued functions then Z(j) denotes the collection of j'th coordinates of the elements of Z.

3 In general 1(·) denotes the binary indicator function for the event in braces.

4 J The canonical partial ordering on R is employed throughout this paper.

For vectors v and t in RJ the inequalities v stand v < t are equivalent to the corresponding coordinatewise inequalities. That is, v S t is equivalent to v(j)s t(j) for eve.ry j, and v < t is equivalent to v(j)< t(j) for every j.

Similarly, v ~ t (resp. v> t) if and only if t S v (resp. t < v).

5 This extension of ~ is composed of all distribution functions

-J -J corresponding to measures ~ on R that satisfy ~(R ) - 1. Some of these correspond to improper probability measures on RJ. By extending the domain -J of functions in ~ to the compact set R , we ensure that the parameter space is compact when endowed with a topology corresponding to the weak topology on the underlying probability measures.

47 6 McFadden did not impose the constraint 111111 - 1, but instead placed a scale normalization on the univariate marginal distributions of v.

7 The propositions and corollaries are proved in the Appendix.

8 A vector-valued function H is strictly increasing in each argument if uS v and u ~ v implies H(u) S H(v) and H(u) ~ H(v). That is, if for some j u(j) < v(j) then there is a k such that H(k) (u) < H(k) (v).

9 Poirier imposed the scale normalizations by restricting the marginal distributions of F. This is equivalent to our approach of leaving F o 0 unrestricted and imposing the scale normalizations on V.

10 This fact justifies the use of the term "normalization" to discuss the· necessary restrictions on V and B in the preceding discussion.

11 The Corollary is easily extended to cover cases where C(V(x,P» - v

(a.s.) for some nonlinear function C. Thus generally identification may fail if the support of V(x,P) is contained in any nonlinear manifold of dimension less than J. The more general result is useful when the parameterizations of

V and B admit nonlinear transformations of V(x,P).

12 In fact, for the linear in parameters model Corollary 2 is obtained as the special case of Corollary 3 in which G is the identity and v - O.

13 The constant terms in b normalize the scale of V(x,b), while the sign restrictions normalize against permutations.

14 These values are chosen for convenience. The example can be modified to accommodate any values of P and P for which rank(p) - 2 and P e B. 12 21 15 Poirier attributes verification of this result to Takeshi Amemiya.

48 16 This analogy should not be taken too far, however. In the considered by Poirier there are no constraints on the parameter space involving more than one parameter at a time. In the semiparametric setting there are numerous inequality constraints linking the infinity of implicit scalar parameters. So it is not clear whether or not an infinite support for x is necessary in the semiparametric model.

17 The H defined in the Corollary cannot be isotonic since there are no strictly positive elements in H(i>.

18 The proof of the Proposition requires that Fo be a proper probability distribution on RJ. This condition must be added to the Proposition when ~ is extended to include improper probability distributions ..

19 That is, there does not exist a vector c E RL and a scalar c such that o P( c'x - c } - 1. o 20 The linear independence restriction can be partially relaxed if there

are known equality constraints on the columns of po

21 For example, if x includes only personal characteristics of the decision maker then there may not be any justification for placing any constraints on p.

22 That is, one can always find an open neighborhood of p that does not

include any permutation of the rows of p (except for p itself).

49 23 Note, however, that a restriction of the form ~ - 1 implies a scale ij normalization and a sign restriction. It also normalizes against permutations that exchange row i with each row k for which it is known a priori that ~ ~ 1. This implies that it is somewhat misleading to simply kj count constraint equations when interpreting these results.

24 Lemma B in the Appendix characterizes the nature of these restrictions.

25 That is, each element of t diverges to ~. i

26 -J For s,t e R , the notations [s,t], (s,t], etc. shall denote intervals

J of R . That is [s,t] is the Cartesian product across j - 1, ... ,J of the intervals [s(j), t(j)], etc.

27 A function on ~J is continuous from above if and only if it is

right-continuous separately in each argument.

50 REFERENCES

Cosslett, Stephen R. (1983): "Distribution-free maximum likelihood estimator

of the binary choice model," Econometrica 51, 765-782.

Daganzo, Carlos (1979): Multinomial Probit: The Theory and Application to

Demand Forecasting. New York: Academic Press.

Feinstein, Jonathan S. (1989): "The Safety Regulation of U.S. Nuclear Power

Plants: Violations, Inspections, and Abnormal Occurrences," Journal of

Political Economy 97, 115-154.

Gallant, A. Ronald (1987): "Identification and consistency in

semi-nonparametricregression," in Truman F. Bewley (ed.), Advances in

Econometrics: Fifth World Congress, Volume I. New York: Cambridge

University Press.

Han, Aaron K. (1987): "Non-parametric analysis of a generalized regression

model: The maximum rank correlation estimator," Journal of Econometrics

35, 303-316.

Ichimura, Hidehiko (1987): "Estimation of Single Index Models," Ph.D.

dissertation, Department of Economics, Massachusetts Institute of

Technology, Cambridge.

Ichimura, Hidehiko and Lung-Fei Lee (1990): "Semiparametric Estimation of

Multiple Index Models: Single Equation Estimation," in William A.

Barnett, James Powell and George Tauchen (eds.), Nonparametric and

Semiparametric Methods in Econometrics and . New York:

Cambridge University Press.

51 Klein, Roger W. and Richard H. Spady (1987): "An Efficient Semiparametric

Estimator for Discrete Choice Models," Economics Research Group, Bell

Communications Research, Morristown, New Jersey.

Maddala, G.S. (1983): Limited-dependent and qualitative variables in

econometrics. New York: Cambridge University Press.

Manski, Charles F. (1975): "Maximum score estimation of the stochastic

utility model of choice," Journal of Econometrics 3, 205-228.

(1985): "Semi-parametric analysis of discrete response: asymptotic·

properties of the maximum score estimator," Journal of Econometrics 27,

313-333 .

(1988):· "Identification of Binary Response Models," Journal of the

American Statistical Association 83, 729-738.

Matzkin, Rosa L. (1988): "A nonparametric and distribution-free estimator for

the binary choice and the threshold crossing models," Cowles Foundation

Discussion Paper No. 889, Yale University, New Haven.

(1990): "A nonparametric maximum rank correlation estimator," in

William A. Barnett, James Powell and George Tauchen (eds.),

Nonparametric and Semiparametric Methods in Econometrics and

Statistics, New York: Cambridge University Press.

McFadden, Daniel (1974): "Conditional logit analysis of qualitative choice

behavior," in Paul Zarembka (ed.), Frontiers in Econometrics, New York:

Academic Press.

(1976): "Quantal Choice Analysis: A Survey," The Annals of Economic

and Social Measurement 5, 363-390.

52 (1981): "Econometric Models of Probabilistic Choice," in Charles F.

Manski and Daniel McFadden (eds.), Structural Analysis of Discrete Data

with Econometric Applications, Cambridge: M.I.T. Press.

(1982): "Qualitative Response Models," in Werner Hildenbrand (ed.),

Advances in Econometrics, Cambridge: Cambridge University Press.

(1984): "Econometric Analysis of Qualitative Response Models," in Zvi

Griliches and Michael D. Intriligator (eds.), Handbook of Econometrics,

Volume 2, Amsterdam: North-Holland.

Poirier, Dale J. (1980): "Partial Observability in Bivariate Probit Models,"

Journal of Econometrics 12, 209-217.

Powell, James L., James H. Stock and Thomas M. Stoker, "Semiparametric

Estimation of Index Coefficients," Alfred P. Sloan School of Management

Working Paper 1793-86, Massachusetts Institute of Technology.

Robertson, Tim, F.T. Wright and R.L. Dykstra (1988): Order Restricted

Statistical Inference, Chichester: John Wiley and Sons.

Stoker, Thomas M. (1986): "Consistent Estimation of Scaled Coefficients,"

Econometrica 54, 1461-1481.

(1990): "Equivalence of Direct, Indirect and Slope Estimators of

Average Derivatives," in William A. Barnett, James Powell and George

Tauchen (eds.), Nonparametric and Semiparametric Methods in

Econometrics and Statistics, New York: Cambridge University Press.

53.