Penalized maximum likelihood estimation of finite mixture models∗

Sofya Budanova†

November 12, 2016

Abstract

Economic models often resort to finite mixtures to accommodate unobserved heterogeneity. In practice, the number of components in the mixture is rarely known. If too many components are included in the estimation, then the parameters of the estimated model are not point- identified and lie on the boundary of the parameter space. This invalidates the classic results on maximum likelihood estimation. Nonetheless, the parsimonious model, which corresponds to a particular subset of the identified set, can be point-identified. I propose a method to estimate finite mixture models with an unknown number of components by maximizing a penalized likelihood function, where the penalty is applied to the mixing coefficients. The resulting Order- Selection-Consistent Estimator (OSCE) consistently estimates the true number of components in the mixture, and achieves the oracle efficiency for the parameters of the parsimonious model. This paper extends the literature on penalized estimation to the case of non-identified model parameters. Further, numerical simulations illustrate the performance of the proposed method in practice. Finally, the method is applied to the experimental data from Cornand and Heinemann [2014] to determine the composition of subjects’ types associated with their level of rationality in a coordination game. Keywords: finite mixtures, penalized maximum likelihood, SCAD, MCP, non-identification. JEL Classification: C13, C18, C52.

∗The latest version can be found at http://sites.northwestern.edu/sbg354/ †Department of Economics, Northwestern University (e-mail: [email protected]). I would like to thank Ivan Canay and Elie Tamer for being great mentors and their support. This paper has benefited a lot from the comments of Joel Horowitz, Gaston Illanes, Charles Manski, Alexander Torgovitsky, and many participants of the econometrics workshop at Northwestern University. I am also very grateful to Davide Cianciaruso, Sergey Gitlin, Eric Mbakop, Shruti Sinha, Yi Sun, and the other members of the EMG.

1 1 Introduction

Finite mixtures are used to model situations where data comes from a distribution that can be represented as a convex combination of finitely many other densities. That is, the density of the data generating process in a finite mixture model can be written as

h(x; ϕ) ≡ π1f(x; θ1) + π2f(x; θ2) + ... + πK f(x; θK ), where the densities f(x; θ) belong to some family of distributions which can be indexed by a p- dimensional parameter θ (p is assumed to be finite), K is the order of the mixture and denotes the number of distinct components, and πk are the mixing coefficients that are strictly positive and sum PK up to 1, i.e. k=1 πk = 1. Finite mixtures are an element of many economic models. They help to model heterogeneity of agents, workers, consumers, economic regimes, etc. If the order of the mixture is known, ways to estimate the model parameters are well studied. In particular, the theory of efficient likelihood estimation (see Lehmann and Casella[1998]) can be used to show the existence of a consistent estimator of the model parameters and establish its asymptotic normality. For an extensive survey of the literature on finite mixtures see McLachlan and Peel[2004].

However, in many applications in practice there is no ex-ante information on what the number of components is. This gives rise to one of the two issues: if the estimated model has fewer components than the truth, the model will be misspecified, whilst if the estimated model contains more components than the the true one, then the parameters of the estimated model will no longer be point-identified. This paper considers the latter situation; in particular, it is assumed that although the researcher does not know the value of K, he has an idea on its upper bound, K¯ (see

Assumption (M1) in the Section 4.1.1), and estimates a K¯ -component mixture:

π1f(x; θ1) + π2f(x; θ2) + ... + πK¯ f(x; θK¯ ).

In this large model some of the mixing coefficients πk are zero, and hence, this is a situation with parameters being on the boundary of the parameter space. As this usually leads to the nonstandard asymptotic distribution of the maximum likelihood estimator (MLE), a bigger issue

2 in this situation is the loss of point-identification of the vector of this model parameters, which is due to the multiplicity of ways in which the true K-component mixture can be written as a K¯ - component mixture (see Example1). It is a problem because point-identification of the parameters is the crucial assumption for establishing consistency and asymptotic normality of the maximum likelihood estimator (see Section 6.3 in Lehmann and Casella[1998]). Redner[1981] showed how a generalized version of consistency of the MLE can be established in such a situation by proving its convergence to the identified set (see Definition1). Inference in this context is discussed, for example, in Liu and Shao[2003] and Chen et al.[2014].

Despite the parameter vector of the large model not being identified as a whole, there exists an object, which is of a particular interest, and that is point-identified in this situation. This object is called the parsimonious model (see Definition2), and it is the mixture with the smallest number of distinct components that all have non-zero mixing coefficients. The parsimonious model has a tight connection with the true K-component mixture: the number of non-zero mixing coefficients in the parsimonious model is exactly K, and the parameters of the components that correspond to the non-zero weights coincide with the components parameters of the true K-component mixture.

This paper focuses on the estimation of the parsimonious model.

Note that once the parsimonious model becomes the object of interest the vector of the model parameters can be partitioned in two subvectors: 1) (θ1, π1), the parameters of the parsimonious model (this way π1 is a vector of length K and contains strictly positive elements, and (θ1, π1) = ϕ), and 2) (π2, θ2), where π2 = 0, and θ2 is completely non-identified. It means that the parsimonious model is represented by a subset of the identified set (for an example, see Figure1).

This paper proposes Order-Selection-Consistent Estimator (OSCE) – an estimator of the model parameters that converges to the parsimonious part of the identified set; and, moreover, with probability approaching one, it has exactly K non-zero elements in the subvector of estimators of the mixing coefficients (in other words, πˆ2 = 0). The OSCE is defined as the maximizer of the penalized log-likelihood function (see Equation4), where the penalty is put on the mixing coefficients.1 The reason for considering this particular penalization is as follows. Penalizing mixing

1The idea to use a penalty function applied to the mixing coefficients comes from the vast literature on the simultaneous selection of the relevant regressors and estimation of the model parameters in the linear regression models (e.g. Tibshirani[1996], Fan and Li[2001], Knight and Fu[2000], Zhang[2010a], Zou[2006], Huang et al. [2008]). Discussion of the connection of the current paper to this literature is provided in Section3.

3 coefficients guarantees that all the points in the parsimonious subset of the identified set yield the same value of the objective function, and that this value is larger than at any other point from the identified set (see Figure4). This allows to discriminate between different parts of the identified set and achieve convergence to the parsimonious subset. Theorem1 states the conditions that ensure existence of the OSCE and its convergence to the parsimonious part of the identified set (at n−1/2- rate), and Theorem2 proves that the OSCE consistently estimates K and achieves the so-called oracle efficiency for the identifiable parameters of the parsimonious model (namely, θ1 and π1). In other words, the fact that K is estimated consistently allows to estimate the parameters of the parsimonious model as efficiently as in the case with K being known.

The main step of the proof of Theorem1 consists in showing that the objective function evaluated at any point at the boundary of a n−1/2-neighborhood of the parsimonious subset of the identified set is smaller than the objective function evaluated at any point in this subset. Since the parameters of the model are not point-identified, the standard proof of a similar statement from Lehmann and

Casella[1998] (for MLE) or Fan and Li[2001] (for penalized maximum likelihood estimator with point-identified model parameters) cannot be applied directly. In fact, it is the penalty part of the objective function that plays a crucial role in restoring the validity of this step, by ensuring the aforementioned relation for those points on the boundary, for which the standard argument would fail (see Section4 for a more detailed discussion).

To implement the OSCE in practice I propose a data-driven way to select the tuning parameter for the penalty function, and modify the standard EM algorithm, in order to accommodate the penalty term of the objective function (which depends on the mixing coefficients).

To illustrate the applicability of my method, I use the data from Cornand and Heinemann’s experiment. They test a coordination game studied theoretically in Morris and Shin[2002]. Coor- dination games are important as their analysis have implication for social welfare and thus, might lead to policy recommendations. However, the results in Morris and Shin[2002] rely on the players using the equilibrium strategy. Playing the Nash equilibrium in this game requires an infinite level of reasoning, and the experimental literature that studied similar games (e.g. Nagel[1995]) discov- ered that subjects often exhibit bounded reasoning behaviour. In fact, as Cornand and Heinemann

find, the subjects in their experiment played, on average, a strategy consistent with level-2 ratio- nality. Nevertheless, since the social welfare is a non-linear function of the players’ rationality level,

4 knowing the average rationality is not sufficient, and the analysis of the composition of types is required. To address this question, I model the situation as a finite mixture of types characterized by different levels of rationality and apply my method to estimate the composition of the types.

The two main findings of my analysis are the following: 1) the subjects are of two types, with one type being level-1 (i.e. the type that completely ignores the coordination motive), and the other type corresponding to a higher rationality level; 2) as the coordination motive increases, a smaller proportion of participants exhibit the level-1 behavior. The last finding is consistent with the results in Alaoui and Penta[2015], whose experimental evidence suggests that the level of rationality is chosen by players endogenously, depending on the parameters of the game.

1.1 Related Literature

Finite mixtures are a workhorse in economic models that need to account for heterogeneity. For instance, in industrial organization, marketing, and labor economics, they help to account for un- observed (finite) heterogeneity of agents in dynamic models (see Keane and Wolpin[1997] for an example, and Aguirregabiria and Mira[2010] for a comprehensive survey on the estimation of dis- crete choice models in these applications).

Experimental economics is another area where mixtures naturally arise. For instance, Bosch-

Domènech et al.[2010] and Chen et al.[2014] estimate a mixture model in the context of a beauty contest game. As it turns out, in games where playing a Nash equilibrium requires an infinite level of reasoning, people often deviate from the equilibrium strategy, and a whole range of other actions is observed. If in an experiment participants are heterogeneous in their rationality levels, then the observed actions can be modeled as draws from a finite mixture of components, each of which corresponds to a given level of rationality. In these papers, inference on the mixing coefficients is typically the scope of the analysis.

Other areas of economics which utilize finite mixtures include, but are not limited to, estimation of models of regime switching (Hamilton[2010], Haas and Paolella[2012]), data misclassification, games with multiple Nash equilibria (e.g. Berry and Tamer[2006]), and some versions of GARCH models (e.g. Broda et al.[2013]).

The non-identification problem, which arise when a mixture with more components than in the true model is estimated, has been recognized by the literature. There are multiple ways of inferring

5 the order of the mixture, which include bayesian methods, optimization of different criteria, and consecutive testing of the hypothesis about the true number of components.

One possible approach to establish the order K of a mixture would be to perform consecutive testing of a hypothesis about a particular value of K. First, this raises the multiple testing problem

(with not known a priori number of tests), and second, the distribution of the test statistic in this situation is non-standard (once again, due to non-identification of the model parameters and some parameters being on the boundary). Hartigan[1985] showed that the LR statistic is not

1 2 asymptotically distributed as 2 χ1. The distribution of the LR is derived under different assumptions in Liu and Shao[2003], Sen and Ghosh[1985], Dacunha-Castelle et al.[1999], and more recently, Kasahara and Shimotsu[2015].

Another block of existing methods follow a two-step procedure, in which on the first step mix- ture models with different number of components are estimated through an optimization of some objective function, and on the second step the fitted model that maximizes a particular criterion is selected as the final estimator. Very often the likelihood function is chosen as the objective function of the first step, and the criterion from the second step usually has the following structure: it is equal to the value of the maximum likelihood for a particular fitted model minus a penalty term proportional to the complexity of the model (represented by the number of components or the total number of parameters) and to a slowly growing function of the sample size. Keribin[2000] proves that under some conditions optimization of such an information criteria will result in a consistent estimation of the true order of the mixture. Other methods that belong to this category include

Chen and Kalbfleisch[1996] (a distance between an estimated parametric model and the empirical cdf of the data, or a nonparametric density estimator, is the first step objective, and the second-step criterion includes the term proportional to the logarithm of the mixing coefficients), Woo and Sriram

[2006] (penalized Hellinger distance to the non-parametric kernel estimator of the pdf of the density of the data is used as the criterion), James et al.[2001] (Kullback-Leibler distance is minimized at the first step).

The nature of the second-step criteria in all these methods is such that one-step optimization is impossible, and so they all require a full grid search over the number of components. This creates a significant computational burden. My method, on the contrast, achieves the order selection consis- tency in just one step, which involves estimation of the model with the largest possible number of

6 components K¯ . Thus, my method is much more computationally efficient than both the consecutive hypothesis testing and the two-step penalization procedures.

To the best of my knowledge, there are only two other papers that use a similar penalization idea to estimate parameters of a mixture model with unknown number of components. Chen and

Khalili[2008] use a SCAD-penalized maximum likelihood estimator for the mixture models with components that depend on a scalar parameter θ. This assumption allows to impose a natural order on the components, and then a penalty can be applied to the distance between the parameters of the consecutive components. Chen and Khalili[2008] showed that this procedure produces a consistent estimator of the number of components in the mixture. The method I propose in this paper can be applied to families of distributions that depend on a multidimensional parameter.

Another approach is to apply a penalty to the mixing coefficients, as it does not require imposing restrictions on the dimensionality of the components parameters. Huang et al.[2013] use this approach to estimation of gaussian mixtures with the same variances. My method is not limited to the mixtures over the family of normals and in fact, can be applied to any family of distributions that guarantees identification of the parsimonious model (see the remarks in the end of Section 4.1.1 for more details).

It is worth pointing out that while assuming a parametric form of the density might be viewed as quite a strong restriction, going for a fully non-parametric setting and an unknown number of com- ponents simultaneously, without additional assumptions, would not allow for point-identification.

Depending on the structure of the data or assumptions made, either partial, or full non-parametric identification (for the case of panel data) can be shown. See Henry et al.[2014] and Kasahara and

Shimotsu[2009] respectively. Finally, going for a nonparametric modeling of the components is pos- sible if the order of the mixture is assumed to be known, which is done, for example, in Bonhomme et al.[2015].

Finally, the case of mixture models with an unknown number of components is not covered by Andrews and Cheng[2012]. In that paper a crucial assumption is made that some subset of the model parameters is always identified. In the case of mixtures, when some πk is equal to 0, the corresponding θk is not identified, but also for some values of θk, the corresponding πk is not identified (see Example1.)

The rest of the paper is organized as follows. Section 2 formally introduces the model and

7 notation, while the OSCE is presented in Section 3. Discussion of the assumptions on the model and the penalty function together with the main theoretical results is provided in Section 4. Section

5 covers practical implementation of the method (choice of the tuning parameter and the modified

EM algorithm) and studies the performance of the OSCE in simulations. Section 6 is devoted to the application of the method to the data from the experiment in Cornand and Heinemann[2014].

Section 7 concludes.

2 Model and Notation

This section introduces a formal notation for the model parameters and will describe the parameter space.

The data X comes from the following distribution:

K ! X h(x; ϕ) = 1 − πk f(x; θ1) + π2f(x; θ2) + ... + πK f(x; θK ), (1) k=2 where the densities f(·, θk), for k = 1, ..., K, belong to a parametric family of distributions FΘ, the component parameters are such that θk 6= θj for j 6= k, and all the mixing coefficients, πk for PK k = 2, ..., K and (1 − k=2 πk), are strictly positive. The vector of parameters of the K-component 2 true density is denoted by ϕ, where ϕ = (θ1, ..., θK , π2, ..., πK ). Note that the parameters of the density in (1) are not point-identified. The reason for this is that the same density can be obtained by simply relabeling the coefficients. For example, if the density of X is

h(x; ϕ) = π1f(x; θ1) + π2f(x; θ2), then it can also be written as

0 0 0 0 0 h(x; ϕ ) = π1f(x; θ1) + π2f(x; θ2),

0 0 0 0 where π1 = π2,, θ1 = θ2, π2 = π1, and θ2 = θ1. However, this is not problematic, as all the parameter values corresponding to the same density can be obtained through a permutation of the

2Throughout, all parameter vectors are taken to be row vectors.

8 components. When the component parameter θk is one-dimensional, the relabeling issue can be resolved by imposing a (theoretical) restriction on the parameter space of the coefficients, namely

θ1 ≤ θ2 ≤ ... ≤ θK . In the case with multidimensional θk, this is more challenging. Following McLachlan and Peel[2004] (Section 1.14), I impose the restriction

π1 ≥ π2 ≥ ... ≥ πK , and, henceforth, ignore the relabeling problem.

The researcher does not know K but knows K¯ , an upper bound on K (Assumption (M1) below), and estimates a finite mixture with K¯ components:

 K¯  X g(x; ψ) = 1 − πk f(x; θ1) + π2f(x; θ2) + ... + πK¯ f(x; θK¯ ), (2) k=2 where the mixing coefficients πk are now nonnegative. The parameter vector of this large model is denoted by ψ ≡ (θ, π), where θ = (θ1, ..., θK ) is the vector of the parameters of the individual com-

3 ponents and π = (π2, ..., πK ) parameterizes the mixing coefficients. For convenience, I sometimes PK¯ use the notation π1 ≡ 1 − k=2 πk for the first mixing coefficient. For an arbitrary Ke ≥ 2 define

     Ke Ke  X X  Π = (π , ..., π ): 1 − π , π , ..., π ∈ ∆Ke−1, 1 − π ≥ π ≥ ... ≥ π , (3) Ke 2 Ke  k 2 Ke  k 2 Ke  k=2 k=2  where ∆Ke−1 is the (Ke − 1)-simplex. For Ke = K¯ this expression defines the parameter space for π. q The family of distributions to which the individual components belong is indexed by Θ ⊂ R , and hence θk ∈ Θ, for each k = 1, ..., K¯ . The parameter space for ψ is denoted by Ψ and is defined as

K¯ Ψ = Θ × ΠK¯ .

d Ψ is a subset of R , where d = Kq¯ + K¯ − 1 denotes the dimensionality of ψ. I also refer to Φ as

3 Although there are K¯ mixing coefficients, the value of π1 is entirely determined by the values of the last K¯ − 1 mixing coefficients.

9 the parameter space in which ϕ, the vector of parameters of the model with known K, lies, where

K Φ = Θ × ΠK .

Here, ΠK is defined as in (3) for Ke = K, the true order of the mixture. The parameter space Φ is

dϕ a subset of R , where dϕ = Kq + K − 1. Observe that in the estimated model (2) some of the mixing coefficients are 0, and hence ψ lies on the boundary of the parameter space. This fact itself would lead to a nonstandard limiting distribution of the maximum likelihood estimator (for example, see Moran[1971]). However, a more severe issue in this situation is the following: because the true order of the mixture K is smaller than K¯ , the parameter vector ψ is not point-identified, as the next example illustrates.

Example 1. Let φ(x; θ, σ2) denote the density of a normal distribution with mean θ and variance σ2.

For the purpose of this example, suppose that the true data generating process is h(x; ϕ) = φ(x; θ1, 1), but the researcher estimates a mixture of two normal components with (known) unit variances. For

0 simplicity, assume that the mean of the first component, θ1, is known and equal to θ1, so that the only unknown parameters are the mixing coefficient π2 and the mean of the second component θ2. That is, the estimated model takes the following form:

0 g(x; ψ) = (1 − π2)φ(x; θ1, 1) + π2φ(x; θ2, 1).

However, there are multiple ways of representing the true density as a mixture of two components:

0 0 0 • φ(x; θ1, 1) = (1 − π2)φ(x; θ1, 1) + π2φ(x; θ1, 1) for any π2 ∈ [0, 1/2]; or

0 0 • φ(x; θ1, 1) = 1 · φ(x; θ1, 1) + 0 · φ(x; θ2, 1) for any θ2 ∈ Θ.

0 This consideration implies that all the points ψ = (θ2, π2) such that π2(θ2 − θ1) = 0 are obser- vationally equivalent, i.e., they yield the same density function.

Next, I provide definitions for the central concepts of my analysis.

Definition 1. The identified set Ψe 0 is the collection of all points in the parameter space Ψ that

10 correspond to the true distribution, i.e.

n o Ψe 0 = ψe ∈ Ψ: g(x; ψe) = h(x; ϕ) .

Lack of point-identification of ψ is an issue, because it immediately invalidates the results on consistency and asymptotic efficiency of the MLE. Indeed, since there are multiple points which yield the same density, the conceptual question is for which of these points the consistency of the estimator should be established. By generalizing the concept of consistency, Redner[1981], and subsequently Feng and McCulloch[1996], proved that the MLE converges to the identified set, in the sense that the Euclidian distance between the MLE of the parameter vector and the identified set goes to 0 in probability (see Section 3.1 below for additional details).

While all the points from the identified set produce the same density, some of the subsets of the identified set correspond to having a mixture with several components being identical (e.g., on Figure2 and 1a each point from the vertical part of the identified set corresponds to a two- component mixture with the two components being the same). However, one can identify a subset of the identified set that corresponds to a mixture for which all the components associated with non- zero mixing coefficients are different from each other. I refer to this subset as the parsimonious subset of the identified set, as it corresponds to the parsimonious model. Definition2 below formalizes this notion.

Definition 2. The parsimonious model is a mixture model with parameters ψp = (θp, πp), where θp = (θp, ..., θp ), and πp = (πp, ...πp , 0, ..., 0), i.e., 1 K¯ 2 Kp

Kp K¯ p X p p X p g(x; ψ ) = πkf(x; θk) + 0 · f(x; θk), k=1 k=Kp+1 such that:

• g(x; ψp) = h(x; ϕ);

p p • πk > 0, k = 1, ..., K ; and

p p p • θk 6= θj for k 6= j, j, k = 1, ..., K .

11 It is important to note that the parsimonious model is formally a K¯ -component mixture, with exactly Kp non-zero mixing coefficients. This immediately implies that not all the parameters of the parsimonious model are point-identified: since the last K¯ − Kp mixing coefficients are zeros, this leaves the values of the corresponding component parameters unrestricted.

The main question regarding the parsimonious model is whether there is a unique Kp for which the conditions from Definition2 can be satisfied. It is worth pointing out that, since K¯ is finite, there always exists the smallest Ke for which the true density can be written as

Ke K¯ X X g(x; ψe) = πekf(x; θek) + 0 · f(x; θek), k=1 k=Ke+1 for some ψe ∈ Ψ. However, this fact alone does not guarantee that Kp is unique, as the following example illustrates.

Example 2 (Ahmad and Al-Hussaini[1982]) . Let fB(·; α, β) denote the density of a Beta distribu- tion on [0, 1] with parameters α and β, i.e.,

1 f (·; α, β) = xα−1(1 − x)β−1. B B(α, β)

Let the data X come from the distribution

h(x; ϕ) = fB(x; 2, 1), and let the estimated model be

g(x; ψ) = π1fB(x; α1, β1) + π2fB(x; α2, β2).

p p p p p Clearly, K = 1, π1 = 1, π2 = 0, θ1 = (2, 1), and any θ2 ∈ Θ satisfy the conditions listed p p p in Definition (2) of the parsimonious model. However, K = 2, θ1 = (3, 1), θ2 = (2, 2), and

12 p p (π1, π2) = (2/3, 1/3) also satisfy these conditions, as

2 1 2 1 1 1 f (x; 3, 1) + f (x; 2, 2) = x3−1(1 − x)1−1 + x2−1(1 − x)2−1 3 B 3 B 3 B(3, 1) 3 B(2, 2) 1 = 2x2 + 2x(1 − x) = x2−1(1 − x)1−1 1/2

= 1 · fB(x; 2, 1)

= h(x; ϕ).

To sum up, in this situation Kp is not point-identified.

However, whenever Kp is unique, it must be equal to K. The following Proposition1 states a condition under which the parsimonious model is unique, in the sense that Kp, as well as the vector

p p p p (θ1, ..., θKp , π2, ..., πKp ), are point-identified.

Proposition 1. Under Assumption (M3) from Section 4.1.1 below, the parsimonious model, i.e. Kp

p p p p p p p p p and (θ1, ..., θKp , π2, ..., πKp ), is point-identified. Moreover, K = K and (θ1, ..., θKp , π2, ..., πKp ) = ϕ.

Proof. See Appendix B.1.

This paper concentrates on the estimation of the parsimonious model. Given the identification result above, the parsimonious model is characterized by the following subset of the identified set:

n (K¯ −K)o Ψ0 = ψe = (θe1, θe2, πe1, πe2) ∈ Ψ:(θe1, πe1) = ϕ, πe2 = 0, θe2 ∈ Θ ,

where π1 contains the first K −1 elements of the vector of mixing coefficients π, θ1 contains the first

Kq elements of θ, π2 consists of the last K¯ − K elements of π, and θ2 contains the last (K¯ − K)q elements of θ.

All points in the parsimonious subset of the identified set correspond to a mixture with exactly

K distinct components. Observe that this subset is not a singleton, since the parameters of the last

K¯ − K components are unrestricted. The following example illustrates this notation.

0 ¯ Example 1 (Continued). In this example, K = 1, ϕ = θ1 is known, K = 2, and ψ = (θ2, π2). The 0 identified set is Ψe 0 = {(θ2, π2) ∈ Θ × [0, 1/2] : π2(θ2 − θ1) = 0}, whereas its parsimonious subset is

13 Ψ0 = {(θ2, π2) ∈ Θ×[0, 1/2] : π2 = 0}. The identified set and its parsimonious subset are illustrated in Figures 1a and 1b, respectively.

π2 π2

1 1 2 2

0 0 0 0 θ1 θ2 θ1 θ2

(a) In bold is Ψe 0, the identified set. (b) In bold is Ψ0, the parsimonious subset of Ψe 0.

0 Figure 1: The true data generating process is h(x; ϕ) = φ(x; θ1, 1). The estimated model is 0 0 g(x; ψ) = (1 − π2)φ(x; θ1, 1) + π2φ(x; θ2, 1), θ1 is known.

3 Estimation

The model considered in this paper is fully parametric, and hence, it seems natural to adopt the maximum likelihood approach to estimation of the model parameters. However, this does not yield desirable results, since the parameters of the estimated model are not point-identified. Point- identification of the model parameters is the crucial assumption required to establish consistency and asymptotic distribution of the MLE (see Section 6.3 in Lehmann and Casella[1998])). In the situation of no point-identification, Redner[1981] proved in the quotient space that the MLE converges to the identified set. Discussion of this result can be found in Section 3.1 below.

The goal of this paper, instead, is to propose an estimator of the model parameters that converges to a specific subset of the identified set, namely, its parsimonious subset. All the points in this subset have π2 = 0. Hence, the estimator of ψ needs to be “pushed” towards that subset. An analogous problem arises in the context of selecting relevant regressors in a linear regression setting, which is described in Section 3.2. This paper employs a similar tool to achieve the main goal and the estimator that I proposed is defined in Section 3.3.

3.1 MLE

When the model is fully parametric, i.e., the density of the data is known up to a finite-dimensional parameter, the maximization of the likelihood function is the standard approach to estimation of

14 the model parameters. Specifically, the following function is maximized with respect to the vector

ψe: n  K¯  X X Ln(X; ψe) = log  πekf(xi; θek) . i=1 k=1

Even if the parameters of the model were point-identified, i.e. the likelihood above was estimated for K¯ = K, the estimation would not be completely straightforward. This is due to the fact that the

MLE, defined as the global maximizer of the log-likelihood function, sometimes may be inconsistent or may not even exist. In fact, finite mixtures provide a prominent example of a situation in which the MLE does not exist.

Example 2 (Kiefer and Wolfowitz[1956], Section 6) . Let the data come from the following mixture of normal densities with different means and variances:

2 2 (x−µ1) (x−µ2) 1 2 1 2 h(x; ϕ) = (1 − π ) e 2σ1 + π e 2σ2 . 2 p 2 2 p 2 2πσ1 2πσ2

The log-likelihood function is a sum of n logarithms of this density evaluated at the sample observa-

2 tions, and it can be seen that, for example, if µe1 = x1, πe2 = 1/2, and σe1 → 0, the first term in the log-likelihood goes to infinity, while others are bounded, so that the resulting sum is unbounded and hence, the global maximum of the likelihood function does not exist.

Remark 1. The conditions that ensure the existence of the global maximizer of the likelihood func- tion, as well as its consistency, are in general technically difficult. It is important to remember that the goal per se is not to study the properties of the global maximizer of the likelihood function, but rather to find a consistent and asymptotically efficient estimator of the model parameters. For this purpose, it is possible to establish existence of a local maximizer of the log-likelihood function4 that is consistent and asymptotically efficient, in the sense that its asymptotic variance is equal to the inverse of the information matrix evaluated at the truth. Of course, if there is more than one critical point, the question arises as to which one has to be chosen as the estimator of the model param- eters. It was established that the critical point of the log-likelihood that is the closest to the truth possesses the desired properties. Nevertheless, this result is limited to the theoretical understanding of the problem, as in practice the true value of the parameters is unknown and is, actually, the

4A local maximizer corresponds to a critical point of the log-likelihood function.

15 object of the estimation. A more practically useful result says that the critical point that is closest to any consistent estimator of the model parameters can be used (see the discussion in Section 6.4 of Lehmann and Casella[1998]). This implies that a “good” starting point must be chosen for the numerical maximization algorithms, e.g., obtained through the method of moments estimation or k-means clustering algorithms. For more details, see Section 2.12 of McLachlan and Peel[2004].

Whenever the parameters of the model are not point-identified, and hence, the identified set is not a singleton, the classic consistency result of the MLE or a local maximizer of the likelihood cannot be established. However, Redner[1981] showed that a generalized version of consistency can be showed. Specifically, he proved the almost sure convergence of the MLE to the identified set in the quotient space topology.

In the spirit of the discussion above, Feng and McCulloch[1996], among other results, showed

(under other conditions) that there exists a local maximizer of the likelihood function that converges to the identified set in probability, in the sense that the Euclidean distance between this local maximizer of the likelihood and the identified set goes to 0 in probability.5 Figure2 illustrates this result on the generalized consistency by plotting the MLEs of the parameters of the model from

Example1 for different samples.

While the results by Redner[1981] and Feng and McCulloch[1996] are intuitive, they do not provide means to make inference on any of the model parameters. In this context, the inference on the model parameters can be done directly, without the estimation step, by inverting a profiled likelihood ratio test statistic (see Liu and Shao[2003] and Chen, Ponomareva, and Tamer[2014]).

Finally, another issue that is characteristic for the situation with unknown number of components is the parameter-on-the-boundary issue. As was established by Moran[1971], even under point- identification, this leads to a nonstandard asymptotic distribution of the MLE.

The goal of this paper is to obtain an estimator that, in contrast to the MLE, converges to the parsimonious subset of the identified set. As was mentioned above, the parsimonious subset is characterized by π2 = 0. The way to achieve the stated goal would be to push its estimator

πˆ2 towards 0. The literature on simultaneous selection and estimation of linear regression models addresses a similar question, albeit in a different context, as is discussed in the following subsection.

5 The Euclidean distance between the estimator ψˆ and the identified set Ψ0 is defined as inf kψˆ − ψk. e ψe∈Ψe 0 e

16 Figure 2: Results from the estimation of a 2-component mixture via maximum likelihood. 100 observations were generated from N (0, 1), and a mixture of two components was estimated: (1 − π2)φ(x; 0, 1)+π2φ(x; θ2, 1). The MLEs of (θ2, π2) for ten different samples are plotted on the graph. In bold blue is the identified set.

3.2 Selection of relevant regressors: An overview

In practice, researchers often have data sets that contain many variables. While economic theory predicts that some of those regressors should have an impact on the dependent variable, many other variables may or may not be included as controls. Including too many variables leads to inflation of standard errors, while not including all relevant regressors might lead to omitted variable bias.

The literature (e.g., Tibshirani[1996], Fan and Li[2001]) has proposed estimators such that the estimators of the coefficients on irrelevant regressors (i.e., those with true value 0) are exactly equal to 0 with probability approaching 1, for large enough sample sizes. A common element of these methods is the addition of a specific penalty term to the objective function. Namely, the following objective function is maximized:

K T X Qn(Y , X, β) = (Y − Xβ) (Y − Xβ) − n pλn (βk), k=1

17 where Y is an n ×1 vector, X is an n ×K matrix of regressors, β is a K ×1 vector of the regression coefficients, and pλn (·) is a penalty function. Fan and Li[2001] state that a penalty function has to yield an estimator that possesses three properties: (1) unbiasedness (i.e., the estimator of non-zero coefficients is asymptotically unbiased),

(2) sparsity (i.e., with probability approaching 1, all zero coefficients are estimated as exactly 0), and (3) continuity (i.e., the estimator is a continuous function of the data). Despite its convenience for numerical optimization algorithms, the least absolute shrinkage and selection operator (LASSO) penalty introduced by Tibshirani[1996] – one of the most popular penalty functions in the economic literature – generally does not result in an estimator with all the described properties.6 Other choices of the penalty function include SCAD, smoothly clipped absolute deviation, penalty (Fan and Li[2001]), MCP, minimax concave penalty (Zhang[2010a]), adaptive LASSO penalty (Zou

[2006]), and bridge penalty (Knight and Fu[2000], Huang et al.[2008]). Many of these penalties result in an estimator that satisfies the properties listed by Fan and Li.

The LASSO penalty function is proportional to the absolute value of the coefficient:

LASSO pλn (t) = λn|t|.

Both the SCAD and the MCP penalties have the property that they are constant for large values of their argument. The SCAD penalty is an even function and for nonnegative values of its argument t is defined in the following way:

(aλ − t)2 a + 1 pSCAD(t) = λ t (t ≤ λ ) − n (λ < t ≤ aλ ) + λ2 (λ < t), λn n I n 2(a − 1) I n n 2 nI n

where λn determines the magnitude of the penalty and is required to depend on the sample size n, and a is usually set equal to 3.7 (see Fan and Li[2001] for the details).

The MCP penalty is also an even function and for nonnegative values of its argument t is defined as t2 λ2 γ pMCP (t) = (λ t − ) (t ≤ λ γ) + n (t > λ γ), λn n 2γ I n 2 I n

6To be more precise, there exist conditions under which LASSO can yield an estimator with these three properties, but they are rather technical and hard to satisfy. For more details, see Knight and Fu[2000].

18 where, as before, λn depends on the sample size and γ parameterizes the family of minimax concave penalties. The three penalty functions are depicted in Figure3.

p LASSO

2 (a+1)λn SCAD 2 2 aλn MCP 2 2 λn

0 λn aλn t

Figure 3: LASSO, SCAD, and MCP penalty functions (γ for MCP is set equal to a for SCAD).

All the penalty functions introduced in this literature share one feature: they are non-differentiable at the origin. This non-differentiability is what allows to obtain an estimator of those βk that are 0 to be exactly 0.

Also, all these penalty functions involve a tuning parameter that depends on the sample size.

In general, including a penalty in the objective function allows to distort the first-order conditions.

The penalty is small around 0 and increasing, which allows to “push” the estimator towards 0.

However, this “push” might be too large for the estimators of the non-zero coefficients, potentially making them inconsistent. The role of the tuning parameter is to balance these two effects: on the one hand, it should guarantee that the estimators of zero coefficients are distorted enough to become exactly equal to 0; on the other, it should ensure that the estimators of non-zero coefficients are not so distorted as to lose their asymptotic properties. However, for LASSO no value of the tuning parameter can balance these two effects, and that is why the LASSO-based estimator does not simultaneously satisfy the three properties of Fan and Li[2001]. For the current setting, the consequences of this feature of LASSO are even more drastic (see also the discussion after Theorem

1).

19 3.3 OSCE

In this paper I propose to maximize a penalized log-likelihood function, where the mixing coefficients are the object of the penalization:

K¯ X Qn(X; ψ) = Ln(X; ψ) − n pλ(πk), (4) k=2 where pλn (·) is a penalty function that depends on a tuning parameter λn. PK¯ Observe that the first mixing coefficient, π1 = 1 − k=2 πk, is not penalized, as the mixture always has at least one component: given the restriction on the parameter space that the mixing coefficients are nonincreasing, the first component always has to be present.

The intention behind choosing this particular penalization is twofold. First, to have an objective function that achieves the same value at all the points from the parsimonious subset of the identified set. Second, to ensure that its value on the parsimonious subset is strictly larger than at any other point in the identified set. This is what allows to discriminate the parsimonious subset, which is the object of interest in this paper, from all other subsets of the identified set. Figure4 illustrates this claim.

Then, the Order-Selection-Consistent Estimator (OSCE) is defined as a local maximizer √ of Qn(X; ψe) with respect to ψe ∈ Ψ, which is a n-consistent estimator of the parsimonious model (in the sense defined in the statement of Theorem1).

As I show in Sections 4.3 and 4.4, it is possible to apply the technique from the regressor selection literature to the estimation of finite mixture models. However, there are significant differences that are caused by the lack of point-identification of the mixture model parameters.

4 Theoretical Properties of the OSCE

This section covers asymptotic properties of the OSCE. First, I introduce the assumptions on the model and the penalty function, and compare them to those used in the literature. Next, I demonstrate the existence of a local maximizer in the neighborhood of the parsimonious subset of the identified set, together with other asymptotic properties of the OSCE. The last subsection discusses how Assumption (M1) can be weakened.

20 0.1 0.1

0.08 0.08

0.06 0.06

π2 π2

0.04 0.04

0.02 0.02

0 0 -4 0 4 -4 0 4

θ2 θ2 (a) Level sets of a particular realization of the (b) Level sets of the penalized log-likelihood function log-likelihood function. for the same sample

Figure 4: Levels sets of a particular realization of the log-likelihood (in (a)) and the penalized log-likelihood (in (b)) functions for the model in Example1. The sample size is n = 100, the data generating process is h(x; ϕ) = φ(x; 0, 1), and the estimated model is g(x; ψ) = (1 − π2)φ(x; 0, 1) + π2φ(x; θ2, 1). In blue is the identified set.

4.1 Assumptions

4.1.1 Assumptions on the model

(M1) The number K¯ is such that K¯ > K.

n (M2) {xi}i=1 are i.i.d. with density h(x; ϕ) with respect to some measure µ. g(x; ψ) has a support d that does not depend on ψ. Ψ is a compact subspace of R , and ϕ is an interior point of Φ.

q (M3) The set FΘ ≡ {f(x; θ): θ ∈ Θ, x ∈ R } is linearly independent over the field of real numbers.

(M4) The first and second logarithmic derivatives of g(x; ψ) satisfy the equations:

∂ log g(x; ψ) Eψ = 0, for j = 1, ..., d, ∂ψj

∂ log g(x; ψ) ∂ log g(x; ψ)  ∂2 log g(x; ψ) Ijk(ψ) ≡ Eψ = Eψ − , for k, l = 1, ..., d, ∂ψk ∂ψl ∂ψj∂ψk

and, for some C1 > 0, max γ (I(ψ0)) ≤ C1 < +∞

21 max for all ψ0 ∈ Ψ0, where γ (A) denotes the largest eigenvalue of the matrix A.

(M5) There exist functions Nkl(·), for k, l = 1, ..., d, such that for all ψ0 ∈ Ψ0 and µ-almost all x

2 ∂ log g(x, ψ0) ≤ Njk(x), ∂ψk∂ψl

and Eψ0 [Nkl(x)] ≤ C2 < +∞ for some C2 > 0.

(M6) There exists ωε ≡ ∪ψ0∈Ψ0 Bε(ψ0), a neighborhood of Ψ0 (in Ψ), where

 0 0 Bε (ψ) ≡ ψ ∈ Ψ: ψ − ψ ≤ ε

denotes the ε-ball (in Ψ) around a given point ψ,7 such that for µ-almost all x the density

g(x; ψ) admits all the third partial derivatives with respect to ψ for all ψ ∈ ωε and there exist

functions Mjkl(·) such that for all ψ ∈ ωε

3 ∂ log g(x; ψ) ≤ Mjkl(x), ∂ψj∂ψk∂ψl

where Eψ0 Mjkl(x) ≤ C3 < +∞ for all ψ0 ∈ Ψ0.

0 0 ψ0 T ψ0 0 ψ0 (M7) For every ψ ∈ ωε \ Ψe 0, the information matrix satisfies (ψ − ψ ) I(ψ )(ψ − ψ ) > 0, ψ0 ψ0 ψ0 0 8 where ψ = (θ , π ) is the point in Ψ0 that is the closest to ψ in the Euclidian distance.

Assumption (M2) is standard for the literature on maximum likelihood estimation. Assumptions

(M4)-(M6) are a slightly modified version of the corresponding classic assumptions. The modifica-

tions are meant to accommodate lack of point-identification, in that the stated conditions have to

be satisfied at all points in the parsimonious subset rather than at just one point – the truth.

One more assumption, which is usually made, states point-identification of the model param-

eters ψ, which does not hold in the current setting. Instead, (M3) is assumed, which guarantees

identification of the parsimonious model, a result that was shown in Proposition1. This assumption

L ensures that no linear combination of f(·; θk) for some {θl}l=1, can be represented as a different

7 d That is, Bε (ψ) is the intersection of a standard ball in R with the parameter space Ψ. 0 0 8 ψ ψ 0 0 In other words, ψ ∈ Ψ0 is such that ψ − ψ ≤ ψe − ψ for all ψe ∈ Ψ0.

22 ˜ ˜ M linear combination of f(·, θm), for some {θ}m=1, and hence there is a unique model with distinct components entering the mixture with non-zero coefficients.9 (M3) is a high-level assumption; de- tails on how it can be verified for particular families of distributions can be found, for example, in

Rao[1992]. It has been shown to hold for many families of distributions, in particular, the family of multidimensional normals, the family of generalized Cauchy distributions, the family of negative binomial distributions (all in Yakowitz and Spragins[1968]), the family of gamma distributions (Te- icher[1963]), the family of Laplace distributions (Holzmann et al.[2006]), and many others. (M3) is the assumption that allows this paper to concentrate on the estimation of the parameters of the parsimonious model.

Another classic assumption states the positive-definiteness of the information matrix at the truth, which, again, does not hold for the finite mixtures under consideration. Following Feng and McCulloch[1996], I instead assume that (M7) holds. The latter assumption states only that the information matrix is positive definite along a particular direction. It turns out that it is the direction crucial for establishing existence of a consistent estimator of the parsimonious model.

AppendixA provides an illustration of this assumption on Example1.

4.2 Assumptions on the penalty function

The following assumptions on the penalty function are standard for the penalization literature (e.g. see Fan and Li[2001]), except for Assumption (P3). This condition is required to drive the results in this paper, as it allows to show the convergence of the proposed estimator to the parsimonious subset (see Theorem1). It is not very restrictive, as most of the penalty functions introduced in

Section 3.2 satisfy it.

(P1) The penalty function pλ(·) is a non-decreasing concave function, and is twice continuously

differentiable at all π > 0 with an exception of at most several points. Additionally, pλn (0) = 0 and it is not differentiable at 0.

√ (P2) b ≡ max p0 (π ): π 6= 0 = O (1/ n). n λn j j p

−1 (P3) pλn (π) = λnπ + Op(n ) for all π ≤ λn.

9It is unique up to a permutation of components, and up to the parameters of the components entering the mixture with 0 mixing coefficients.

23 p0 p00

LASSO λn LASSO aλn λn 0 t SCAD MCP − 1 MCP a 1 SCAD 0 λn aλn t − a−1 (a) Derivatives of LASSO, SCAD, and MCP. (b) Second derivatives of LASSO, SCAD, and MCP.

Figure 5: First (a) and second (b) derivatives of LASSO, SCAD, and MCP penalty functions (γ for MCP is set equal to a for SCAD).

 00 (P4) lim max |p (πj)| : πj 6= 0 = 0. n→∞ λn

0 (P5) lim inf cn > 0, where cn ≡ lim inf p (π)/λn. n→∞ π→0 λn

Next, I discuss under which conditions Assumptions (P1)-(P5) are satisfied for the MCP, SCAD, and LASSO penalties.

P1. All three functions are concave functions and are continuously differentiable at all π > 0

except for π = λn (SCAD) and π = aλn (SCAD and MCP). All functions are equal to 0 at the origin and are continuous but non-differentiable at 0.

LASSO LASSO −1/2 −1/2 P2. bn = λn and so bn = Op(n ) if λn = Op(n ). For SCAD and MCP functions, SCAD MCP if λn → 0, then for n large enough, bn = bn = 0 (see Figure 5a).

P3. For LASSO pλn (π) = λnπ for all π, so the assumption (P3) is trivially satisfied. For SCAD,

pλn (π) = λnπ, if π ≤ λn, so the assumption is satisfied. Finally, for MCP, for π ≤ λn, 2 −1/2 pλn (π) = λnπ − π /2a = λnπ + Op(1/n), if π is Op(n ), which is possible, for example, if √ λn n → ∞.

P4. The LASSO penalty is a linear function, so its second derivative is 0, hence, (P4) is satisfied

for LASSO. If λn → 0, then for n large enough the second derivatives of SCAD penalty and MCP are 0 (see Figure 5b), so the assumption holds for them too.

LASSO SCAD MCP P5. Finally, c = c = c = 1, and so lim inf cn > 0 for all three penalty functions. n n n n→∞

24 4.3 Existence of the OSCE √ This subsection presents the result about the existence of OSCE, defined as a n-consistent local maximizer of the penalized likelihood function. As the parsimonious model is represented in the parameter space by a subset rather than a point, the consistency should be understood as the convergence of the Euclidean distance between this local maximizer and the parsimonious subset √ to zero (in probability). The concept of n-consistency strengthens the consistency property by demanding the Euclidean distance between the estimator and the parsimonious subset to be of order

−1/2 Op(n ). √ Theorem 1 (Existence of a n−consistent local maximizer). Under the regularity conditions (M1)- √   (M7) and (P1)-(P5), if additionally λn n → ∞, then there exists a local maximizer ψˆ = θˆ, πˆ of ˆ ˆ ˆ −1/2 Qn(X; ψe), such that there exists a point ψ0(ψ) ∈ Ψ0 for which kψ − ψ0(ψ)k = Op(n ).

It is worth pointing out that Theorem1 states the existence of a local maximizer of the penalized likelihood function Qn(X, ψ) that converges to the parsimonious subset of the identified set Ψ0 (as opposed to the result of Feng and McCulloch[1996], who establish the existence of a local maximizer of the non-penalized likelihood function in the neighborhood of the whole identified set

Ψe 0). Moreover, the distance between this maximizer and the parsimonious subset of the identified set is shown to be of order n−1/2.

While this theorem does not provide tools to make inference on model parameters, it is of cen- tral importance. First, consistency of the estimator is an important property per se. Second, this theorem states that ψˆ lies in a neighborhood of the parsimonious subset that has a diameter pro- √ portional to 1/ n. Establishing this rate, at which the neighborhood shrinks to 0, is an important preparation step for deriving the asymptotic distribution of the model parameters.

In general, this theorem is in line with similar statements in the existing literature on the penalized estimation that was mentioned before. An important difference lies in the additional

−1/2 requirement that λn go to 0 slower than n . Although this condition is used in the literature, it is always required only for deriving results about the asymptotic distribution of the model parameters.

The reason I require this condition to hold even for the consistency result is the lack of point- identification of the mixture model parameters. While Appendix B.2 presents the formal proof of

Theorem1, below I provide its intuition, which shows the key challenges in proving the result, as

25 well as the role that this additional assumption plays.

When the model parameters are point-identified, the proof of existence of a local maximizer in a n−1/2 neighborhood of the truth comprises the following steps (e.g. see Lehmann and Casella

[1998]): (1) a n−1/2-neighborhood of the truth is chosen; (2) the value of the objective function at a point from the boundary of this neighborhood is compared to its value at the truth, by taking a

Taylor expansion of the difference around the truth; (3) it is shown that this difference is strictly negative, because positive-definiteness of the information matrix implies that the second-order term of the expansion is strictly negative and dominates all other terms; (4) it is concluded that the objective function has a local maximum in the neighborhood, since the former inequality can be shown to hold uniformly for all points on the boundary. When parameters are point-identified and penalized maximum likelihood is considered (e.g., Fan and Li[2001]), Step (3) involves showing that the second-order term of the log-likelihood expansion dominates also the terms coming from the expansion of the difference in penalties.

In this paper the parameters of the model are not point-identified, and so several adjustments have to be made. First, on Step (1) I consider a neighborhood of the parsimonious subset. Recall that the objective function was chosen in such a way that all points in the parsimonious subset yield the same value of the objective function. Therefore, the objective function evaluated at the boundary of this neighborhood can be compared to its value at any point in the parsimonious set. On Step (2), I take the expansion of the difference between the penalized log-likelihood at a point from the boundary and at the point in the parsimonious subset that is closest to it. While the information matrix evaluated at any point from the parsimonious subset is only positive semidefinite,

Assumption (M7) guarantees that it is strictly positive along the particular direction chosen for the expansion. These adjustments are in the spirit of Feng and McCulloch[1996], with the difference that they considered a non-penalized estimation and showed convergence to the entire identified set.

Although adjustments mentioned in the previous paragraph are important, their necessity has been recognized before. The novel challenge that Theorem1 faces lies in connecting Step (4) to Step (3). Step 4 requires that the dominance of the second-order term of the expansion over all other terms is uniform across all points from the boundary. Even though Assumption (M7) guarantees it pointwise, uniformity does not follow. The reason is that there are boundary points

26 of the neighborhood of the parsimonious subset that lie in the other parts of the identified set.

Consequently, the information matrix at the points closest to them will be 0 along the chosen direction (see AppendixA for details). Nevertheless, I show that all the points from the boundary belong to one of two sets: a set in which the difference between the objective function evaluated at any point from this set and at the parsimonious subset is negative and uniformly bounded away from 0 (due to the dominance of the second-order term); and a set for which the desired result is follows from the fact that the difference in the penalties is negative and large enough to dominate all other terms. In the latter set, the penalty term plays a crucial role. This is in stark contrast with prior literature on penalized estimation, which, for the purpose of establishing consistency of the estimator, only needs to show that the penalty does not distort excessively the likelihood part. In my case, the additional assumption on the rate of the tuning parameter λn is necessary to guarantee that the penalty is large enough to dominate the rest. Again, the difference in approaches is caused by the presence of points in the neighborhood of the parsimonious model that correspond to the value of the likelihood function that is the same as on the parsimonious subset. This is why the penalty terms play a larger role in establishing this result in my case, than in the case when the parameters are point-identified. An illustration of my approach to proving Theorem1 can be found in Appendix A.1.10

4.4 Order selection consistency and oracle efficiency

Note that Theorem1 only states that the OSCE converges to the parsimonious subset of the identified set. In other words, the identifiable parameters of the parsimonious model are estimated

p p consistently: (θˆ1, πˆ1) → ϕ, and πˆ2 → 0. This, however, does not imply that K, the true order of the mixture, is estimated consistently. For instance, in the setting of Example1, all points in the neighborhood of the parsimonious subset (apart from those having πˆ2 exactly equal to 0), correspond to having a 2-component density. Even though as the sample size increases the estimator

10Aragam and Zhou[2014] use the penalized likelihood approach to estimate a bayesian network from observational data. While there is more than one network consistent with the data (i.e., parameters are not point-identified), they show that the penalized maximum likelihood estimator is consistent for the sparsest network (i.e., the one with the smallest number of edges). There is an interesting analogy with my result, as the OSCE is an estimator that converges to the parsimonious subset. Nonetheless, an important distinction from my paper is that in Aragam and Zhou[2014] all the points from the identified set are isolated from each other (i.e., the parameter vector is locally identified). This allows them to prove the existence of a local maximizer of the penalized likelihood function in a neighborhood of each such point following standard arguments.

27 gets closer to the parsimonious subset, the estimated mixture can still have both mixing coefficient positive, and therefore the order of the mixture will not be estimated consistently.

The following theorem states that situations like the one described do not happen: with probabil- ity arbitrarily close to one, there exists a sample size large enough such that the OSCE corresponds to a mixture that has exactly K non-zero mixing coefficients. This fact allows to derive the asymp- totic distribution for other identifiable parameters of the parsimonious model. As a matter of fact, the OSCE achieves the same asymptotic efficiency for (θˆ1, πˆ1) as would have been achieved by ϕˆ, an efficient consistent local maximizer of the (non-penalized) log-likelihood function in the case with the known order of the mixture, K. The latter property is what is referred to in the literature as the oracle efficiency.

Theorem 2 (Oracle efficiency and model selection consistency). Under the same assumption as √ in Theorem1 (i.e., (M1)-(M7), (P2)-(P5), and λn n → +∞), with probability tending to 1 the

−1/2 n -consistent local maximizer ψˆ ≡ (θˆ1, θˆ2, πˆ1, πˆ2) from Theorem1 must satisfy:

1. (Order selection consistency) πˆ2 = 0.

2. (Oracle efficiency)

√ n h ˆ i o d n (I(ϕ) + Σλn ) (θ1, πˆ1) − ϕ + bλn → N (0, I(ϕ)) , where

• I(ϕ) is the information matrix for the model with known K;

• Σ = diag 0, ..., 0, p00 (π ) , ..., p00 (π ) (with dimensionality d × d ); λn λn 2 λn K ϕ ϕ

• b = 0, ..., 0, p0 (π ) , ..., p0 (π ) (with dimensionality 1 × d ). λn λn 2 λn K ϕ

Note that Assumption (P2) implies that bλn → 0, and Assumption (P4) ensures that Σλn converges to the zero matrix. The latter implies that the estimator of (θ1, π1) is as asymptotically efficient as an efficient estimator of parameters of the model with known K.

As was discussed in the end of Section 4.2, for SCAD and MCP (P2) and (P4) are satisfied whenever λn goes to 0. Moreover, under this condition, for large enough sample size, bλn and

28 Σλn are exactly zeros. So, in addition to oracle efficiency, these two penalty functions produce asymptotically unbiased estimators.

The LASSO penalty, on the other hand, does not satisfy the conditions of either theorem, as √ Assumption (P2) for LASSO requires λn n to be bounded in probability, which contradicts the √ condition that λn n → ∞. To sum up, the part of the OSCE that corresponds to the estimators of zero mixing coefficients

is superefficient (specifically, it becomes exactly equal to 0 for large n), which allows to consistently

estimate the true order of the mixture and achieve oracle efficiency for the other identified parameters

of the parsimonious model. Moreover, having the degenerate distribution for estimators of the zero

mixing coefficients allows to avoid the parameter-on-the-boundary problem, as the other identifiable

parameters of the parsimonious model, (θ1, π1), lie inside the corresponding parameter space.

4.5 Growing K¯

Assumption (M1) might be considered too restrictive, as a researcher might want to include more

components into the estimated model for larger sample sizes. Assumption (M1) can be weakened

in the following sense:

¯ ¯ ¯ 4 (M1’) Let Kn depend on the sample size, such that Kn → ∞, Kn/n → 0.

By strengthening some of the conditions in Section 4.1.1 (see Appendix B.4), the following result

can be shown:

0 p Theorem 1 (Existence of a n/K¯n−consistent local maximizer). Under the Assumptions (M1’), p (M2’),(M3), (M4’)-(M6’), (M7), and (P1’)-(P5’), if additionally λn n/K¯n → ∞, then there exists ˆ  ˆ  ˆ a local maximizer ψn = θn, πˆn of Qn(X; ψen), such that there exists a point ψn,0(ψn) ∈ Ψn,0 for 1/2 which kψˆn − ψn,0(ψˆn)k = Op (K¯n/n) .

Since K is fixed for every n, the dimension of ϕ does not change, and hence, a statement similar

to Theorem2 can be stated:

Theorem 20 (Oracle efficiency and model selection consistency). Under the same assumption

0 p as in Theorem1 , with probability tending to 1 the n/K¯n-consistent local maximizer ψˆn ≡

0 (θˆ1, θˆn,2, πˆ1, πˆn,2) from Theorem1 must satisfy:

29 1. (Order selection consistency) πˆ2,n = 0.

2. (Oracle efficiency)

√ n h ˆ i o d n (I(ϕ) + Σλn ) (θ1, πˆ1) − ϕ + bλn → N (0, I(ϕ)) , where

• I(ϕ) is the information matrix for the model with known K;

• Σ = diag 0, ..., 0, p00 (π ) , ..., p00 (π ) (with dimensionality d × d ); λn λn 2 λn K ϕ ϕ

• b = 0, ..., 0, p0 (π ) , ..., p0 (π ) (with dimensionality 1 × d ). λn λn 2 λn K ϕ

Extending the results to the case when the true order K also changes with the sample size is interesting, but not straightforward. The reason is that the existing literature assumes that the smallest eigenvalue of the information matrix of the true model is bounded away from zero uniformly over all n. This does not hold in the setting of mixtures, as when K grows to infinity, the smallest non-zero mixing coefficient must go to 0, which forces the smallest eigenvalue of the information matrix to go to 0 as well (this is due to the fact that some of the diagonal elements of the information matrix are proportional to the square of the mixing coefficients, see Lemma1).

5 Empirical performance of the method

5.1 Choice of the tuning parameter λn

One of the crucial conditions in Theorem1 and2 puts a restriction on the rate at which the tuning parameter λn goes to 0. This restriction, however, does not specify the constant that should multiply that rate. Hence, an important question is how to select λn in practice. In the literature,

λn is usually chosen as the optimizer of some criterion, such as cross-validation, generalized cross- validation (GCV), etc. As Wang et al.[2007] show, λn that minimizes GCV criterion for regressor selection problem in the context of linear models leads to overselection of the regressors (with

11 positive probability), and suggest to pick λn by minimizing a BIC-type criterion instead. I follow

11It is called BIC-type criterion after BIC – the Bayesian Information Criterion, also known as Schwarz information criterion. The reason lies in the rate at which the second term in the criterion function goes to 0 was the sample size

30 their suggestion and adapt their BIC-type criterion for the current setting.

The BIC criterion function for the penalized likelihood estimation of mixture models is defined as n 1 X 1 log n BIC(λ) = log g(x; ψˆ(λ)) − Kˆ (λ) , (5) n 2 n i=1

where Kˆ (λ) denotes the number of non-zero mixing coefficients in the ψˆ(λ), which is defined as

 K¯  ˆ  ˜ X ψ(λ) = argmaxψ˜∈Ψ Ln X; ψ − n pλ(πk) . k=2

Throughout the simulations and the application, I maximize BIC(·) with respect to λ in order to determine which value of the tuning parameter should be selected, i.e.

BIC λn = argmaxλ BIC(λ).

BIC Appendix B.5 provides a formal justification as to why this procedure results in λn that satisfies the rate condition from Theorem1 and2.

5.2 The modified EM algorithm

Finite mixture models are usually estimated via the EM algorithm (for a description see Hastie,

Tibshirani, and Friedman[2001]). It is constructed to accommodate maximization of the log- likelihood function. I, instead, need to provide a way to implement a maximization of the penalized log-likelihood function. For that, I propose a modified version of this algorithm, which is described in details below.

n In my setting I observe a sample X = {xi}i=1, that potentially comes from K models, so that xi is sampled from a distribution with pdf f(x; θk) with probability πk, k = 1, ..., K and, as before, PK π1 = 1 − k=2 πk. The log-likelihood contains log of a weighted sum of densities and hence the closed form solutions for the optimal values of the parameters is not available, which makes the direct maximization of the objective function quite difficult.

Following the standard steps of the EM algorithm for the finite mixtures models, denote by γik increases. BIC criterion is characterized by log n/n rate, which Wang et al.[2007] suggest to use for the problem of −1 determining λn. The GCV criterion has a similar structure with the rate of the second term being instead n .

31 probability that observation xi comes from a model k (given value of parameters ψ). Then the modified EM algorithm is described by the following steps:

1. Take initial guess for ψ(0).

2. On each lth step do:

(l) • E-step: compute γik for all i = 1, ..., n and k = 1, ..., K using the previous step values of the parameters ψ(l): π(l−1)f(x , θ(l−1)) γ(l) = k i k ik PK (l−1) (l−1) j=1 πj f(xi, θj )

(l) PK (l) Notice, that γi1 = 1 − k=2 γik .

• M-step: update the values of the parameters by solving the following system of the first

order conditions:

– For parameters θk:

n 0 (l) X (l) f (xi, θ ) γ k = 0, k = 1, ..., K ik (l) i=1 f(xi, θk )

– For parameters πk:

n (l) Pn (l) (l) 1 P γ 1 (1 − γ − ... − γ ) i=1 ik − i=1 i2 i,K − p0 (π(l)) = 0, k = 2, ..., K. n (l) n (l) (l) λ k πk 1 − π2 − ... − πK 3. Iterate Steps 2 until convergence.

The system of equations for updating values of the mixing coefficients πk does not have a simple solution, as opposed to the case when a non-penalized log-likelihood function is maximized. In that (l) Pn (l) case the updated values of the mixing coefficients are simply πk = i=1 γik /n. To solve the system for penalized log-likelihood optimization, I follow the suggestion by Fan and

Li[2001]: the penalty function is approximated by a quadratic polynomial when πk 6= 0, and then the Newton-Raphson algorithm can be implemented to find the solution of the system of equations.

It should be pointed out, that if during some iteration some πk becomes small enough (i.e.

−6 πk < 10 ), its value at the end of this iterations is set up to 0. This is important part of using a quadratic approximation to the penalty function. There are several papers that suggest other

32 ways of approximating the penalty function to get a faster algorithms (see, for example, Zou and

Li[2008] and Zhang[2010b]), but I stick to the reported one as it has performed quite well in the simulations.

Finally, it can be noticed, that one way to think of penalty functions is as of priors on the corresponding coefficients. The SCAD penalty corresponds to an improper prior (although it is not a traditional Bayesian prior, as it depends on λn, which changes with the sample size). This way the proposed modified EM-algorithm falls into the general framework of EM algorithms and thus should converge to a maximizer of the objective function.

5.3 Simulations

This section present empirical study of the performance of the proposed method, and compares it to those of the existing methods. In first subsection I estimate 10 mixture models from Ishwaran et al.[2001] and Chen and Khalili[2008], in the second subsection I investigate how well the method performs when there are many components relative to the sample size, and in the last subsection I look at models with multidimensional parameters of the components.

In simulations I used both the MCP and the SCAD penalties, and with the results being almost identical, I report in the tables only those obtained under the MCP. The tuning parameter λn was picked by minimizing BIC2,λ.

5.3.1 Settings from Ishwaran et al.[2001] and Chen and Khalili[2008]

In this section I performed simulations to compare the performance of the proposed method to those analyzed in Chen and Khalili[2008]. I ran the simulations for ten models of univariate mixtures with the same but unknown variance. In each setting a mixture of 15 normal components as estimated with unknown means and an unknown but same for all the components variance.

Table1 summarizes information about dgp and sample sizes used in these simulations, whereas

Table2 shows selection probabilities for the best method among those considered in Chen and

Khalili[2008], the method proposed by Chen and Khalili[2008] and the method proposed in this paper (OSCE). The former was not necessarily the same method for all the settings, and this is mentioned in Table2. The details on these methods can be found in Chen and Khalili[2008].

Recall that all the results before were derived under the standard asymptotics when n goes to

33 infinity. In other words, I assumed that the true values of the component parameters are bounded away from each other, and the true values of the mixing coefficients of the components that are present in the mixture are bounded away from 0. However, in some of the settings from Chen and

Khalili[2008] for given small sample sizes the derived approximations might not work very well.

For example, while for Models 1 and 2 OSCE (as well as the method from Chen and Khalili[2008]) performs very well and determines the correct order in 90% of all the simulations, for Model 3 its performance is much worse. However, as can be seen in Table3, as the sample size increases the proposed method selects the correct order with higher and higher probability.

In other models OSCE sometimes performs considerably better than the one from Chen and

Khalili[2008] (e.g. Models 5 and 9), sometimes worse (e.g. Models 6, 8, 10, which for the small sample size used in simulations seem to be close to the non-identification region). Also, the fact that

Chen and Khalili[2008] picks larger number of components in difficult situations might be partially explained by the fact that they pick the tuning parameter λn as the one minimizing a cross-validation criterion, however, as was shown in Wang et al.[2007], this leads to an overselection of components with positive probability.

Table 1: Models of mixtures of univariate normal components (each with variance 1)

θ0 π0 Model Sample size 0 0 0 0 0 0 0 0 0 0 0 0 0 0 θ1 θ2 θ3 θ4 θ5 θ6 θ7 π1 π2 π3 π4 π5 π6 π7 1 100 0 3 1/3 2/3 2 100 0 3 1/2 1/2 3 100 0 1.8 1/2 1/2 4 100 0 3 6 9 1/4 1/4 1/4 1/4 5 100 0 1.5 3 6 1/4 1/4 1/4 1/4 6 100 0 1.5 3 4.5 1/4 1/4 1/4 1/4 7 400 0 3 6 9 12 15 18 1/7 1/7 1/7 1/7 1/7 1/7 1/7 8 400 0 1.5 3 4.5 6 7.5 9 1/7 1/7 1/7 1/7 1/7 1/7 1/7 9 400 0 1.5 3 4.5 6 9.5 12.5 1/7 1/7 1/7 1/7 1/7 1/7 1/7 10 400 0 1.5 3 4.5 9 10.5 12 1/7 1/7 1/7 1/7 1/7 1/7 1/7

5.3.2 Simulations with large number of components

In this section I estimated a mixture with large number of components (100) when the data was generated from a mixture of 50 normal components with variances 1 and means ranging from 0 to

147. The sample size was picked to be equal to 100 (as many as the number of components and

34 Table 2: Estimation of simulated models of mixtures of homogeneous univariate normal components

Model Kˆ Best from KC KC OSCE Model Kˆ Best from KC KC OSCE 1 0.02∗ 0.06 0.02 3 0.00∗ 0.00 0.00 1 2 0.92∗ 0.90 0.91 4 0.30∗ 0.00 0.00 3 0.06∗ 0.04 0.07 7 5 0.21∗ 0.00 0.00 1 0.08∗ 0.03 0.03 6 0.10∗ 0.07 0.00 2 2 0.92∗ 0.90 0.91 7 0.33∗ 0.85 0.97 3 0.00∗ 0.07 0.06 2 0.03∗∗ 0.00 0.13 1 0.11∗∗ 0.28 0.68 3 0.13∗∗ 0.00 0.15 3 2 0.57∗∗ 0.63 0.30 8 4 0.32∗∗ 0.01 0.43 3 0.19∗∗ 0.09 0.02 5 0.38∗∗ 0.18 0.18 ∗∗ * – GWCR, a bayesian approach from Ishwaran et al.[2001] 6 0.13 0.50 0.08 ∗∗ ** – lassoing cdf method, see Chen and Khalili[2008] for details 7 0.01 0.27 0.03 3 0.14∗ 0.00 0.01 Model Kˆ Best from KC∗ KC OSCE 4 0.46∗ 0.01 0.07 1 0.00 0.00 0.09 9 5 0.31∗ 0.40 0.11 2 0.06 0.00 0.00 6 0.05∗ 0.51 0.30 4 3 0.17 0.10 0.09 7 0.02∗ 0.08 0.50 4 0.70 0.82 0.82 2 0.02∗∗∗ 0.06 0.01 5 0.06 0.07 0.00 3 0.37∗∗∗ 0.06 0.02 1 0.00 0.00 0.20 10 4 0.47∗∗∗ 0.06 0.64 2 0.29 0.07 0.11 5 0.13∗∗∗ 0.51 0.10 5 3 0.25 0.60 0.20 6 0.01∗∗∗ 0.23 0.11 4 0.31 0.31 0.48 7 0.01∗∗∗ 0.00 0.07 5 0.14 0.01 0.00 * – AIC, see Ishwaran et al.[2001] 1 0.00 0.00 0.80 ** – KL, see James et al.[2001] 2 0.15 0.05 0.02 *** – GWCR, see Ishwaran et al.[2001] 6 3 0.50 0.65 0.1 4 0.27 0.28 0.05 5 0.07 0.01 0.02 * – KL, see James et al.[2001]

Order selection probabilities in simulations of models from Table1. Columns 3 and 4 are taken from Chen and Khalili [2008], and Column 4 corresponds to the method proposed by Chen and Khalili[2008], while Column 3 – to the best method among others considered in Chen and Khalili[2008] for the particular setting. Column 5 is the OSCE. In bold is the true order of the mixtures, and the highest selection probability for different methods. The last column is based on 1000 simulations.

35 Table 3: Estimating a mixture of two normal components (Model 3) for different sample sizes

Sample size 100 200 400 1000 2000 Kˆ 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Frequency 0.68 0.30 0.02 0.61 0.37 0.01 0.49 0.50 0.01 0.17 0.82 0.01 0.00 1.00 0.00

Model 3 (Table1) was estimated using OSCE for different sample sizes. In bold is the true number of components and the number of components selected for each sample size most often. smaller than the number of parameters in the estimated model), 500, and 1000. The results are in

Table4.

While for n = 100, the method underselects the number of components (with 43 being the most frequent choice), for n = 500 the correct number of components is selected in already 53% of the simulations (with another 33% resulting in the selection of 49 components), and for n = 1000 the correct number of components is picked in 96% of the estimated samples.

Table 4: Estimating a mixture of many homogeneous normal components for different sample sizes

Sample size 100 500 1000 Kˆ 39 40 41 42 43 44 45 46 47 48 49 50 51 49 50 Frequency 0.08 0.10 0.12 0.13 0.14 0.12 0.10 0.08 0.02 0.09 0.33 0.53 0.02 0.04 0.96

The data was generated from a mixture of 50 normal components with variance 1 and means 0, 3, 6, ..., 147. A mixture of 100 homogeneous normal components with unknown variance was estimated using OSCE. Reported are the frequencies of the numbers of components selected in simulations for different sample sizes.

5.4 Simulations for the case of multidimensional component parameters

As was mentioned in the beginning, the method from Chen and Khalili[2008] is developed for the situations in which θk are one-dimensional. OSCE, on the other hand, does not restrict θk (apart from asking it to have finite dimensionality). In this section I estimated a few models of mixtures of normals with both means and variances being unknown and without any restrictions placed on them (in Subsection 5.3.1 the variances of the components were unknown, but it was known they all were equal to each other, hence, it was easy to fit those settings in the framework from Chen and Khalili[2008]). The setting from Table5 cannot be accommodated into theirs framework, as it includes estimation of mixture models with different means and different variances. OSCE, however, performs very well.

36 Table 5: Estimation of simulated mixtures of normals with different means and variances

Mixing coefficients Means Variances 2 2 π1 π2 µ1 µ2 σ1 σ2 True values, ψ 0.500 0.500 3 0 0.5 2 Average values across simulations 0.503 0.497 2.987 −0.031 0.498 1.932 St.dev.’s across simulations 0.066 0.066 0.110 0.284 0.124 0.509

The data was generated from a mixture of two normal components, with true values of the parameters given in Row 1. A mixture of 15 components with unknown means and variances was estimated. OSCE picked 2 components in 100% of the simulations. The sample size was 200. Total number of simulations was 1000. Here πk 2 are the mixing coefficients and θk = (µk, σk) are component parameters.

6 Empirical Application

In this section I consider the experiment in Cornand and Heinemann[2014], in which subjects played a coordination game in the spirit of Morris and Shin[2002]. The two-person game studied by Cornand and Heinemann features incomplete information: there is an unknown state of nature, and each player receives both a private and a public signal about it (signals are of equal precision).

A player’s payoff depends both on how far his action is from the true state and on how much it differs from the opponent’s action. A player’s equilibrium strategy in this game is a particular linear combination of the public and his private signal. In the absence of coordination motives, the equilibrium strategy would prescribe equal weights on both signals. Instead, in the presence of coordination motives, the weight on the public signal is larger than 1/2. This occurs because the public signal is a better predictor of the other player’s action than the private signal. Morris and

Shin[2002] argued that this might cause welfare losses, and hence, in some situations, it might be undesirable to increase the precision of the public signal.

However, the evidence from the experiment in Cornand and Heinemann[2014] suggests that the subjects do not play the equilibrium strategy, as the weight they put on the public signal, on average, is not as high as theory predicts. Therefore, the coordination motive might not lead to adverse effects on social welfare. However, since social welfare is a non-linear function of the individual weights on the public signal, knowledge of the average weight does not suffice to predict the welfare effects of changes in the precision of public information: the researcher needs to know the exact composition of types in the population. On these grounds, I extend the analysis of Cornand and Heinemann by assuming that the subjects were heterogeneous in their level of rationality and,

37 by means of the OSCE, I estimate the composition of types.12 In the first subsection below I describe the setting in Cornand and Heinemann in greater detail. Then, the second subsection covers the estimation procedure and the findings of my analysis.

6.1 The experiment of Cornand and Heinemann[2014]

The subjects in Cornand and Heinemann played the following two-player game. There is a state of nature θ, unknown to players, which is the realization of a random variable distributed uniformly on some interval [θ, θ] (the interval is the common knowledge among players). Player i observes two noisy signals about θ: (1) his private signal xi; and (2) a public signal y. Conditional on θ, all signals xi, xj, and y are drawn independently from a uniform distribution with support [θ −ε, θ +ε]. Player i’s ex-post payoff is given by

2 2 ui(ai, θ) = U0 − (1 − r)(ai − θ) − r(ai − aj) ,

where ai and aj are the actions of the player and his counterpart, respectively (and U0 is a constant which ensures that payoffs are nonnegative). Thus, the payoff of each player depends both on how close his action is to the true state of nature θ and on the distance from his counterpart’s action aj. A player’s equilibrium action is a convex combination of the two signals he observes:

∗ ∗ ∗ 1−r ai = γ xi + (1 − γ )y, where γ = 2−r . The experiment in Cornand and Heinemann was conducted for three different values of r, the parameter that controls the relative importance of the coordination motive. There were 18 sessions in total, with 6 sessions for each of the following three values of r: 0.25, 0.5, 0.75. There were 16 participants in each session, and in each round they were randomly (and anonymously) split into

8 pairs. Each session consisted of 30 rounds, and in every round a new realization of θ was drawn and new signals were realized. At the end of each round, players were informed about the value of

θ in that round, all the signals, the other player’s action, and their payoffs. The interval [θ, θ] was equal to [50, 450], and the signal precision parameter ε was set equal to 10.

Cornand and Heinemann estimated the following regression: ait − yt = c + γ(xit − yt) + uit. Their main finding is that, while the subjects put a larger weight on the public signal than on the

12I downloaded the data from the experiment in Cornand and Heinemann[2014] from Frank Heinemann’s webpage: http://www.macroeconomics.tu-berlin.de/fileadmin/fg124/heinemann/downloads/online_appendix_ch_3.zip.

38 private one, the weight of the public signal was still smaller than the equilibrium weight.

There are several theories that explain these sorts of deviations from equilibrium strategies. One of them is level-k depth of reasoning, which was introduced in Nagel[1995] for a beauty contest game. Cornand and Heinemann appeal to this theory to explain the difference between the weight on the public signal that the equilibrium would predict and the actual weight that, on average, subjects put on it.

The logic behind the theory of level-k rationality is that in games where playing the Nash equilib- rium requires infinitely many iterations of reasoning, players might actually interrupt their reasoning process at some iteration. By considering actions of players with limited levels of reasoning, the level-k model predicts how players’ strategies depend on their levels of rationality. More precisely, level-0 players pick their action randomly and uniformly from the set of all possible actions; level-1 players believe that they play against level-0 players, and choose a best response according to this belief; more generally, a level-k player’s action is the best response to an action by a level-(k − 1) opponent.13 The Nash equilibrium strategy corresponds to the limit of the level-k strategy as k goes to ∞.

Cornand and Heinemann found that the weight that subjects assigned to the public signal, on average, corresponds roughly to level-2 reasoning. In this paper, I extend their analysis by estimating the proportions of each type of players. Indeed, this level-k model can be analyzed with the help of a finite mixture model, in which different components correspond to players of different levels of rationality. Observe that, since not necessarily all types of players were present in the sample, the order of this mixture is unknown a priori.

6.2 Estimation of a mixture model using the data from Cornand and Heine- mann[2014]

The weights that a player assign to the private and public signals depend on his level of rationality.

Table6 summarizes how the weight on the private signal varies with the level of rationality and the value of the coordination parameter r.

13In an alternative model of bounded rationality, Camerer, Ho, and Chong[2004] allow level- k players to believe that their opponents’ types are distributed over level-0 through level-(k − 1). While this model has certain appealing features (e.g., the beliefs about the distribution of their opponents formed by more sophisticated player are closer to the true distribution of types), it adds additional layers of complexity that are unnecessary for the scope of my analysis.

39 Table 6: γk for different values of r and k.

r γ1 γ2 γ3 γ4 γ∞ 0.25 0.5 0.4375 0.4297 0.4287 0.4286 0.5 0.5 0.375 0.344 0.336 1/3 0.75 0.5 0.312 0.242 0.216 0.2

The weight on the private signal (γk) as a function of r, the relative strength of the coordination motive, and the level of rationality k.

I use the OSCE to estimate the proportions of players of each level of rationality that were present in the sample. I model the data generating process the following way: for a given value of r, conditional on the signals he receives, player i, if he is of level-k, chooses action ai = γk(r)xi +

(1 − γk(r))y + uik. Here, the error term uik accounts for all possible errors that players make while computing their optimal response (i.e. rounding errors, miscalculation errors, etc.). I assume that

2 uik is drawn from a normal distribution with mean 0 and the variance σk. The variances of the different components are not restricted to be the same in order to capture potential differences in the size of an error a player makes depending on his rationality level.

Since the level of rationality of each particular subject is unknown, the distribution of ui – the

2 error that player i makes – can be modeled as a finite mixture. Let φ(·; 0, σk) denote the density of a normal distribution with mean 0 and variance σk. Then, the log-likelihood of the data can be written as

n  K¯  n  K¯  X X X X 2 Ln(Z; ψ) = log  πkf(zi; θk) = log  πkφ ai − yi − γk(xi − yi); 0, σk  . i=1 k=1 i=1 k=1

n 2 2 Here Z = {(ai, xi, yi)}i=1, and ψ = (π, θ), where θ = (σ1, ..., σK¯ ), since γk are known (see Table6).

In my main analysis I use only the last round of each session. This is due to the fact that the players’ choices most likely are not independent across the rounds of play. Therefore, the assumption that uit are independent both across the players and rounds might be unrealistic. I consider the last round, because all the learning should have taken place by the last round. Consequently, it seems natural to assume that the last round is best at reflecting each player’s level of rationality.

I perform the main analysis for r = 0.5 and r = 0.75 only. The reason for this can be seen in

40 the first line of Table6: the difference between γk for k = 3, 4, and infinity is rather small and is unlikely to be reliably detected in the sample of n = 96 observations. For the remaining two values of r, I set the value of K¯ equal to 5 (also due to the fact that the difference between the strategies of different types for large values of k becomes very small). Specifically, I included players with levels of rationality from 1 to 4 and players with an infinite level of reasoning, that is, those who play the

Nash equilibrium. The tuning parameter λn was picked based on the BIC-type criterion introduced in (5).

My method can handle situations both in which it is unknown how many different types are in the population, and in which it is unknown a priori which types are there (for example, when there is only level-1 type and level-4 type). This is in contrast with the existing literature on estimation of mixture models in games where players are potentially heterogeneous in their rationality level

(see e.g., Stahl and Wilson[1995], Bosch-Domènech et al.[2010], Ho, Camerer, and Weigelt[1998],

Camerer et al.[2004]).

Table7 summarizes my findings. First, not all types were present in the sample. For r = 0.5 the subjects were of two types: level-1 (around 64% of participants) and level-∞ (around 36%). For r = 0.75, which represents a larger relative strength of the coordination motive, only around 17% of participants were level-1, and 83% were level-2.

This is an interesting finding that is consistent with the results in Alaoui and Penta[2015]. That paper allows the players’ depth of reasoning to be endogenously determined by the parameters of the game. Their experimental evidence suggests that players indeed increase their level of rationality when the benefits from doing so increase relative to the cost. Even though in the data I use in this paper the subjects of Cornand and Heinemann’s experiment who took part in sessions with r = 0.5 and sessions with r = 0.75 were different people, assuming that both samples were representative of the whole population allows me to conclude that my findings are in line with Alaoui and Penta

[2015].

2 Finally, the estimates of σk, the variance of the error that a level-k type makes, increase with k, which corroborates the idea that it is likely harder to compute the strategy corresponding to a higher level of rationality (for instance, the strategy of a level-1 player is simply the arithmetic average of his two signals).

AppendixC provides a robustness check (e.g., the error is modeled to follow a Laplace distribu-

41 tion instead of a normal).

Table 7: Estimation of the game using the last round for different values of r

2 2 2 2 2 r γ1 πˆ1 σˆ1 γ2 πˆ2 σˆ2 γ3 πˆ3 σˆ3 γ4 πˆ4 σˆ4 γ∞ πˆ∞ σˆ∞ 0.5 0.5 0.64 0.25 0.375 0 - 0.344 0 - 0.336 0 - 1/3 0.36 4.67 0.75 0.5 0.17 0.016 0.312 0.83 4.74 0.242 0 - 0.216 0 - 0.2 0 -

7 Concluding remarks

This paper introduces the Order-Selection-Consistent Estimator, defined as a local maximizer of a penalized likelihood function. I showed that the OSCE consistently estimates the true order of the mixture as well as the identified parameters of the parsimonious model. Moreover, it achieves the oracle efficiency for non-zero mixing coefficients and the parameters of the corresponding compo- nents.

This paper also extends the existing literature on penalized estimation, by suggesting how a penalty function can be used to discriminate different parts of the identified set. It also shows how the results on consistency of a local maximizer of the likelihood function can be established when the model parameters are not point-identified.

While almost all the results in this paper were derived under standard asymptotic approxima- tions (i.e., the sample size going to infinity), the analysis leaves unanswered the question about how well the proposed estimator behaves in finite samples. The existing literature documents issues with uniformity of penalized estimators in other contexts (e.g., Pötscher and Leeb[2009]). Another potential problem is coming from components of the mixture being close to each other. While some simulations suggest that in this situation the OSCE may incorrectly determine the number of components in small samples, the question remains of whether this has adverse effects on the estimation of other components. Another situation worth investigating is the one in which the true order of the mixture K grows to infinity.

42 π2 π2

ε ε

ψ0 ψ00 0 00 π2 π2 ψψ0 ψψ00 0 0 0 0 0 θ2 θ1 θ2 θ1 θ2 (a) (b)

Figure 6: A point from ωε, the neighborhood of Ψ0, that (a) does not belong to the identified set Ψe 0, or (b) belongs to Ψe 0, and the point in Ψ0 that is the closest to it.

Appendix

A Discussion of Assumption (M7)

This subsection illustrates Assumption (M7) on the Example1.

0 0 0 0 0 Example 1 (continued). Consider a point ψ = (θ2, π2) inside the neighborhood ωε. Let θ2 6= θ1. 0 0 Then ψ 6∈ Ψe 0 (see Figure 6a), and the condition from Assumption (M7) has to be satisfied at ψ . 0 ψ0 ψ0 0 Indeed, the point in Ψ0 that is the closest to ψ is ψ = (θ2 , 0) = (θ2, 0), and the information matrix at this point is equal to

    0 0 0 0  ψ0  I ψ =   0  =   .  (f(x;θψ )−f(x;θ0))2    0 E 2 1  0 0 2 g(x;ψψ0 )2 0 exp (θ2 − θ1) − 1

While I(ψψ0 ) is only positive semidefinite, it still is positive definite along the direction ψ0 −ψψ0 :

0 ψ0  ψ0  0 ψ0 T 0 2   0 0 2  (ψ − ψ )I ψ (ψ − ψ ) = (π2) exp (θ2 − θ1) − 1 > 0,

0 0 0 since π2 > 0 and θ2 6= θ1. 0 Observe that the only point at which I(·) is not positive definite along any direction is (θ1, 0), 0 0 0 0 00 as at this point I(θ1, 0) = ( 0 0 ) . The point (θ1, 0) is the closest to points ψ in ωε such that 00 0 00 ψ = (θ1, π2 ) (see Figure 6b), that is, only to points that belong to Ψe 0.

43 A.1 Illustration of the key step in the proof of Theorem1

−1/2 This subsection illustrates how the set of the points from ∂ωn, the boundary of the n -neighborhood of Ψ0, can be split into two subsets. To this purpose, consider a slight modification of the setting in Example1. Specifically, assume that now θ1 is not known.

Example 2. The data is generated from h(x; ϕ) = φ(x; θ1, 1), and the estimated model is g(x; ψ) =

(1 − π2)φ(x; θ1, 1) + π2φ(x; θ2, 1), i.e. ψ = (θ1, θ2, π2), so θ1 is not known and is estimated together with θ2 and π2.

Figure7 depicts the parameter space Ψ, the identified set Ψe 0, its parsimonious subset Ψ0, and √ ∂ωn, its neighborhood of radius C/ n.

The information matrix evaluated at all points from Ψ0 has rank 2, with the exception of

0 0 (θ1, θ1, 0), at which the rank of the information matrix is 1. While it seems natural to split the boundary points in two sets based on which point from Ψ0 is the closest to them, it is not the most convenient way to proceed. Instead, I suggest to split all the points based on the direction from them to the corresponding closest point from Ψ0. This idea is presented on Figure8, and is formally described on Step 2 of the proof of Theorem

1 in Appendix B.2 below.

It turns out that for all points in set depicted on Figure 8a the quadratic form defined by the information matrix evaluated at the corresponding point is positive definite along all included direc- tions, and while for the boundary points depicted on Figure 8c this quadratic form might be 0, the penalty term is large enough to ensure that the objective function at those points is smaller than at the parsimonious subset.

B Proofs

B.1 Proof of Proposition1

Proof. The proof is similar to the one in Yakowitz and Spragins[1968].

Assume there exist K, π , ..., π , θ , ..., θ ¯ , where π > 0 for all j = 1, ..., K, π + ... + π = 1, e e1 eKe e1 eK ej e e1 eKe

44 π2

ψ00

√C n

0 θ1

θ0 − √C 0 θ0 + √C 1 n θ1 1 n

θ0 1 ∂ωn

Ψ θ2 0

Figure 7: Illustration of Example2. In bold blue is Ψ0; in dashed blue is the other part of the √ 0 identified set Ψe 0, in red is ∂ωn (the boundary of the C/ n-neighborhood of Ψ0). θ1 is the true value of θ1.

π2 π2

0 θ 1 √C n 0 θ1 d √C 0 n θ1 0 0 (θ1 , θ1 , 0) 0 Ψ 0 θ θ2 0 1 θ1 θ2=const (a) (b) π2 π2

0 θ 1 √C n 0 θ1 d √C 0 n θ1 0 0 (θ1 , θ1 , 0) 0 Ψ 0 θ θ2 0 1 θ1 θ2=const (c) (d)

Figure 8: Illustration how the points from ∂ωn are split into two subsets: (a) over which the second order term of the Taylor expansion is uniformly bounded away from zero and (c) over which the penalty difference dominates all other terms. Figures (b) and (d) show the section of (a) and (c) respectively by plane θ2 = const.

45 and θej ∈ Θ, θej1 6= θej2 whenever j1 6= j2, j1, j2 = 1, ..., Ke, such that

Kp K¯ Ke K¯ X p p X p X X πkf(x; θk) + 0 · f(x; θk) = πejf(x; θej) + 0 · f(x; θek). p k=1 k=K +1 j=1 k=Ke+1

p Kp Ke e Ke Let {θk}k=1 ∪ {θek}k=1 = {θek}k=1. Then, moving all the terms in the above expression to the left-hand side results in the following expression:

Ke X e πelf(x; θel) = 0, k=1

where πej can be 0, negative, or positive (and are between −1 and 1). e e But this implies that the linear combination of f(x; θek), k = 1, ..., Ke, with coefficients πek is e equal to 0. However, by Assumption (M3), it is possible only if πek = 0 for all k = 1, ..., Ke. Since none of πp, .., πp , π , ..., π was 0, it is possible only if 1) Kp = K, and 2) {θ }K = {θ }Ke , and 1 Kp e1 eKe e k k=1 ek k=1 p p p 3) there exists {k , ..., k p }, a permutation of {1, ..., K }, such that π = π . In other words, K 1 K j ekj p p p p and (θ1, ..., θKp , π2, ..., πKp ) are identified. Now, the true K-component model can be written as

K K¯ X X h(x; ϕ) = πkf(x; θk) + 0 · f(x; θk), k=1 k=K+1 where πk > 0, for all k = 1, ..., K, and θk 6= θj for all k 6= j, k, j = 1, ..., K. But since the

p p p p p parsimonious model is point-identified, it means that K = K, and (θ1, ..., θKp , π2, ..., πKp ) = ϕ, which completes the proof.

B.2 Proof of Theorem1

Proof. Below I will suppress X from the notation for Qn(X, ψ) and will simply write it as Qn(ψ). √ √ For C > 0, consider ωn, a C/ n neighborhood of Ψ0, i.e. the union of C/ n neighborhoods of all point from Ψ0. Assume that ωn ⊂ ωε from (M6) (which can be ensured by picking n large enough).

Consider the behavior of the penalized likelihood at ∂ωn, the boundary of this neighborhood. The

46 goal is to show that for C large enough with probability tending to 1

h 0  ψ0 i sup Qn ψ − Qn ψ < 0, 0 ψ ∈∂ωn

where ψψ0 is defined as in Assumption (M7). It will imply that there exists a maximizer ψˆ of

Qn(ψ) inside this neighborhood ωn, or, in other words, that there exists a point ψ0(ψˆ) ∈ Ψ0 such √ that kψˆ − ψ0(ψˆ)k ≤ C/ n (with probability tending to 1.)

Observe that Ψ0 can be characterized the following way:

0 0 Ψ0 = {ψ = (θ1, θ2, π1, π2) ∈ Ψ: θ1 = θ1, π1 = π1, π2 = 0}.

0 0 0 0 0 ψ0 Now consider point ψ = (θ1, θ2, π1, π2) ∈ ∂ωn, and a point ψ – the one from Ψ0 that is 0 ψ0 0 ψ0 0 0 0 closest to ψ . The fact that ψ is the closest to ψ implies that ψ = (θ1, θ2, π1, 0) (because 0 ψ0 0 0 0 0 0 0 0 ψ0 ψ − ψ = (θ1 − θ1, 0, π1 − π1, π2) is orthogonal to Ψ0). Since ψ ∈ ∂ωn, it means that ψ − ψ √ can be written as (C/ n)u, where u ≡ (uθ1 , uθ2 , uπ1 , uπ2 ) is a vector of the unit length such that 0 ψ0 0 ψ0 0 uθ2 ≡ θ2 − θ2 = 0, and uπ2 ≡ π2 − π2 = π2 − 0 ≥ 0 (as no mixing coefficient can be negative).

0  ψ0  Step 1. Consider Qn (ψ ) − Qn ψ :

0  ψ0  h 0  ψ0 i Qn ψ − Qn ψ = Ln ψ − Ln ψ

K¯ X h 0   ψ0 i − n pλn πk − pλn πk . k=2

By taking the Taylor expansion of the likelihood part of the objective function around ψψ0 ,

the expression above can be rearranged as the following:

ψ0 2 2 ψ0  0  C ∂L (ψ ) C ∂ L (ψ ) Q ψ0 − Q ψψ = √ n u + u n uT + R u, ψ0 n n n ∂ψT 2n ∂ψ∂ψT n K¯ X h 0   ψ0 i − n pλn πk − pλn πk , k=2

47 0 where Rn(u, ψ ) is the remainder of the Taylor expansion.

To establish the order of the terms from the likelihood expansion, note that:

  0  2   0  2 ψ d ψ ! ∂Ln ψ P ∂Ln ψ E 0T E   ψ0    ∂ψ   ∂ψk  ∂L ψ k=1 n √ P  ≥ ε n ≤ = ∂ψ0T ε2n ε2n

" # d n ψ0 ψ0 P E P ∂ log g(xi,ψ ) ∂ log g(xj ,ψ ) ∂ψk ∂ψk k=1 i,j=1 = ε2n " 2# d n  ψ0  P P E ∂ log g(xi,ψ ) ∂ψk k=1 i=1 = ε2n d d P ψ0 P I n Ikk(ψ ) n γk = k=1 = k=1 ε2n ε2n ndC C0 ≤ 1 ≤ 1 , ε2n ε2

where the third line holds due to xi, xj being independent for i 6= j (Assumption (M2)), the

0 last line follows from Assumption (M4), and C1 = dC1. Together with the Cauchy-Schwartz- Bunyakovsky inequality, this implies that

 ψ0  0T ψ0 0T √ k∂Ln ψ /∂ψ uk ≤ k∂Ln(ψ )/∂ψ kkuk = Op( n),

uniformly in ψ ∈ Ψ0.

Consider the second term of the expansion. It can be rewritten as

 0    0   2 ∂2L ψψ 2 ∂2L ψψ 2 C n C 1 n  0  C  0  u uT = u + I ψψ uT − uI ψψ uT. 2n ∂ψ∂ψT 2 n ∂ψ∂ψT  2

 ψ0  −1 h 2  ψ0  Ti By Assumptions (M4) and (M2), In ψ = (n) E −∂ Ln ψ /∂ψ∂ψ . Existence of the third derivatives of log g(x; ψ) (Assumption (M6)) ensures that the second derivatives are continuous with respect to ψ, which together with compactness of Θ and Assumption (M5)

48 guarantees that all the conditions of Jennrich’s Uniform Law of Large Numbers are satisfied,

and hence,  0  ∂2L ψψ 1 n  0  + I ψψ = o (1), n ∂ψ∂ψT p

uniformly over Ψ0.

h −1 2  ψ0  T  ψ0 i T 2 −1 2  ψ0  T  ψ0  14 Since u n ∂ Ln ψ /∂ψ∂ψ + I ψ u ≤ kuk n ∂ Ln ψ /∂ψ∂ψ + I ψ , 2 C2  ψ0  T this implies that the second term of the expansion is equal to − 2 uIn ψ u +op(1), where

the last term is uniform in ψ ∈ Ψ0.

Finally, consider the remainder. It follows from Assumption (M6) that

  d 3 C3 X ∂ L ψe R u, ψ0 = u u u , n 3/2 j k l 6n ∂ψj∂ψk∂ψl j,k,l=1

0 where ψe is a point between ψ0 and ψψ .

Assumption (M6) and compactness of ωε guarantee that all the conditions of Jennrich’s Uni- form Law of Large Numbers are satisfied, and hence,

3 " 3 # 1 ∂ Ln(ψe) ∂ log g(x; ψe) →p E n ∂ψj∂ψk∂ψl ∂ψj∂ψk∂ψl

uniformly in ψe, and so

0 √ Rn u, ψ = Op(1/ n) = op(1),

ψ0 uniformly in ψ ∈ Ψ0.

h 0  ψ0 i 0 Step 2. Recall, the goal is to show that supψ ∈∂ωn Qn (ψ ) − Qn ψ < 0. In the situation with point-identification of the parameters it is done by showing that the second term of the

expansion is strictly negative and dominates all other terms. In the current situation it would

be required to show that the second term of the expansion is negative and bounded from below

14 Here k · k2 denotes the operator norm corresponding to l2 norm for vectors. For a n × k matrix A, this norm is kAxk2 k defined as: kAk2 = sup , where x ∈ and the vector norms are the l2 norm. x6=0 kxk2 R

49 (in absolute value) uniformly over all the points from the neighborhood of the boundary of

Ψ0. However, this is simply not true: some of the points from ∂ωn will belong to Ψe 0, and

so the information matrix evaluated at the corresponding points from Ψ0 will be 0 along the desired direction (see the example from AppendixA.

As the second term of the expansion cannot be made dominant uniformly for all points in

0 Ψ0, split all the points ψ ∈ ∂ωn into two subsets according to the following conditions for √ u = (ψ0 − ψψ0 )/(C/ n):

√ 1 • u ∈ U1: the set of all u such that max uπk ≤ d, for some positive d < , j=K+1,...,K¯ (K¯ −K)2+(K¯ −K) 0 ψ0 where, recall, uπ = π − π ;

• u ∈ U2: the set of all u such that there exists at least one k from K + 1 to K¯ such that

uπk > d.

Clearly, the intersection of U1 and U2 is empty, while the union of U1 and U2 is equal to set √ of all possible vectors u = (ψ0 − ψψ0 )/(C/ n).

Step 3. Now, consider U1 and look at inf uI (ψ) uT. ψ∈Ψ0,u∈U1

I (ψ) is a continuous function of ψ, and hence the quadratic form from the expression above

ψ0 is continuous function of ψ and u. Moreover, Ψ0 × U1 is a compact space, and hence, the quadratic form achieves the infimum on this set.

Note that inf uI (ψ) uT = γmin (I (ψ)), and so, as was shown above, inf γmin (I (ψ)) = u∈U1∪U2 ψ∈Ψ0 0. What is achieved by splitting the set of all possible u into two parts is that no linear

combination of the eigenvectors of I (ψ) that correspond to its 0 eigenvalues is in U1 for any

value of ψ ∈ Ψ0, and hence the infimum above should be strictly positive.

More strictly, since the infumum is achieved on Ψ0 × U1, there is ψe ∈ Ψ0 and ue ∈ U1 such that

T   T inf uI (ψ)) u = uIe ψe ue . ψ∈Ψ0,u∈U1

  Assume it is equal to 0. Then 1) I ψe has at least one eigenvalue equal to 0, and 2) ue is a

50   linear combination of eigenvectors corresponding to the zero eigenvalues of I ψe . However,

U1 does not contain any such vector for any possible value of ψ ∈ Ψ0 (see Lemma1), hence the infimum cannot be 0, which is a contradiction. It is possible to conclude that the infimum

is strictly positive.

Step 4. Getting back to the Taylor expansion of the objective function, recall that pλn (·) ≥ 0 and

pλn (0) = 0. Hence

K K¯ 0  ψ0  h 0  ψ0 i X h 0   ψ0 i X  0   Qn ψ − Qn ψ = Ln ψ − Ln ψ − n pλn πk − pλn πk − n pλn πk − 0 k=2 k=K+1 K h 0  ψ0 i X  0   ≤ Ln ψ − Ln ψ − n pλn πk − pλn (πk) , k=2

ψ0 where the last line takes into account that πk = πk for k = 2, ..., K.

Taking the Taylor expansion of the penalty function part in addition to the expansion of the

likelihood part performed above gives the following:

K 0  ψ0  h 0  ψ0 i X h 0   ψ0 i Qn ψ − Qn ψ ≤ Ln ψ − Ln ψ − n pλn πk − pλn πk k=2 C ∂L (ψψ0 ) C2 ∂2L (ψψ0 ) = √ n u + u n uT + R u, ψ0 n ∂ψT 2n ∂ψ∂ψT n K  2  X C 0  ψ0  C 00  ψ0  2 − n √ p π uπ + p π u (1 + op(1)) , n λn k k 2n λn k πk k=2

Determine the order of the terms from the expansion of the penalty function.

Since the penalty function is concave, p00 (·) is negative and λn

K d X  0  X − p00 πψ u2 (1+o (1)) ≤ 2 max |p00 (π )| : π 6= 0 u2 = 2 max |p00 (π )| : π 6= 0 = o (1), λn k πk p λn j j j λn j j p k=2 j=1

by Assumption (P4).

51 The first term from the Taylor expansion of the penalty function,

K K √ √ X 0 ψ0  0 X − pλn (πk )uπk ≤ max pλn (πj): πj 6= 0 (−uπk ) ≤ bnK/ d ≤ bn K, k=2 k=2

Pd 2 where the penultimate inequality follows from the fact that kuk = j=1 uj = 1.

Hence, the penalty function terms can be bounded from above by

√ √ 2 C K nbn + C op(1),

√ where by Assumption (P2), bn = Op(1/ n), hence the whole expression can be bounded by DC, for some positive constant D (note, this D does not depend on ψψ0 ).

The order of the first and third terms of the expansion of the likelihood function was established

above and the sum of those two terms can be bounded from above by AC, for some positive

ψ0 constant A that does not depend on ψ . Finally, for all points u ∈ U1, there exists a constant B0 > 0 such that inf uI (ψ) uT ≥ B0, as was established on the previous step. Hence, ψ∈Ψ0,u∈U1 the second term of the expansion can be bounded from below by BC2, where B > 0 does not

depend on ψψ0 .

Putting everything together,

h 0  ψ0 i 2 sup Qn ψ − Qn ψ ≤ AC − BC + DC < 0, u∈U1

where the last inequality holds since C can be chosen large enough for the expression in the

middle to be negative (i.e. C > (A + D)/B).

T Step 5. Consider now the remaining points, i.e. u ∈ U2. For these points inf uI (ψ) u = 0, ψ∈Ψ0,u∈U2 hence, the same analysis as for points from U1 will not yield the desired result, and so a different approach is required.

For that, go back to the Taylor expansion of the objective function, but unlike in the previous

case, keep the penalty terms corresponding to the mixing coefficients for the last K¯ − K

components:

52 K K¯ 0  ψ0  h 0  ψ0 i X h 0   ψ0 i X  0   Qn ψ − Qn ψ = Ln ψ − Ln ψ − n pλn πk − pλn πk − n pλn πk − 0 k=2 k=K+1 ψ0 2 C ∂L (ψ ) C  0  = √ n u − uI ψψ uT + o (1) + R u, ψ0 n ∂ψT 2 p n K K¯ X C  0  X − n √ p0 πψ u + o (1) − n p π0  n λn k πk p λn k k=2 k=K+1 ¯ ψ0 K K C ∂Ln(ψ ) X C  0  X ≤ √ u + o (1) − n √ p0 πψ u − n p π0  , n ∂ψT p n λn k πk λn k k=2 k=K+1

 0  where the inequality holds due to uI ψψ uT ≥ 0.

0 0 ψ0 Now, for each ψ such that u = ψ − ψ ∈ U2, there exists a least one j ∈ {K + 1, ..., K¯ } √ such that uπj > d. The penalty function is monotone, hence, pλn (πj) > pλn (dC/ n), and for all other πk, pλn (πk) ≥ 0. Using this fact and the order of terms from the Taylor expansion derived above, the supremum of the difference in the objective function evaluated at some point from the boundary of ωn and the corresponding point from Ψ0 can be bounded by

K¯ h 0  ψ0 i X 0  sup Qn ψ − Qn ψ ≤ AC + DC + op(1) − n inf pλn πk u∈U2 u∈U2 k=K+1  C  < AC + DC + o (1) − np d√ p λn n C = (A + D)C + o (1) − nλ d√ + O (1) p n n p √ = (A + D)C + op(1) + Op(1) − λn ndC

< 0,

for n large enough. The third line holds due to Assumption (P3) and because for n large enough √ (and depending on C and d), Cd/ n < λn, as is guaranteed by the condition from Theorem √ √ 1 that λn n → +∞. The last line holds also due to the assumption that λn n → +∞, while the other terms are constant.

53 0 Step 6. Finally, putting everything together: the analysis of points ψ ∈ ∂ωn such that the corre-

sponding u ∈ U1 allows to conclude that for any δ > 0 there exists N1 such that for all n ≥ N 1 ( ) h 0  ψ0 i δ P sup Qn ψ − Qn ψ < 0 > 1 − , 0 ψ :u∈U1 2

0 and the analysis of points ψ ∈ ∂ωn such that the corresponding u ∈ U2 allows to conclude

that for any δ > 0 there exists N2 such that for all n ≥ N2

( ) h 0  ψ0 i δ P sup Qn ψ − Qn ψ < 0 ) > 1 − . 0 ψ :u∈U2 2

Hence, for all n ≥ max{N1,N2},

( ) h 0  ψ0 i δ 15 P sup Qn ψ − Qn ψ < 0 > 2(1 − ) − 1 = 1 − δ. 0 ψ ∈∂ωn 2

Thus, with probability tending to 1, for n large enough, there exists ψˆ, a local maximizer of

−1/2 Qn(ψ) in a n -neighborhood of Ψ0, and that completes the proof.

ψ0 T Lemma 1. For any ψ ∈ Ψ0 and for any u ∈ U1, uI(ψ )u > 0

ψ0 ψ0 T 0 Proof. Consider I(ψ ). Assumption (M7) states that uI(ψ )u > 0 for all ψ ∈ ωε apart from

those that also belong to Ψe 0. 0 0 Let ψ be from the intersection of ωn and Ψe 0. This can happen only if some elements of θ2 are

equal to some elements of θ1 and the sum of the mixing coefficients on all identical components

is equal to the true value of the mixing coefficient that corresponds to this element of θ1 (and, of

0 course, θ1 = θ1). Before proceeding, compute the information matrix of the estimated model. Below, for sim-

th plicity, g will denote the density of the mixture, fk will be used to denote the density of the k

0 00 2 2 component f(x; θk), and fk and fk – to denote its first (∂f(x; θk)/∂θk) and second (∂ f(x; θk)/∂θk) 15Since for any two measurable sets A and BP (A ∩ B) ≥ P (A) + P (B) − 1)

54 0 00 derivatives with respect to θk correspondingly (in the situation of multidimensional θk, fk and fk stand, correspondingly, for the gradient and the hessian of fk with respect to θk). Finally, to save PK space, π1 will be used in place of 1 − k=2 πk. Then the elements of the hessian of the mixture density are:

2 ∂ log g (fk − f1)(fj − f1) ¯ • = − 2 , for j, k = 2, ..., K; ∂πk∂πj g

2 00 2 0 2 ∂ log g πkfk g − πk(fk) ¯ • 2 = 2 , for k = 2, ..., K; ∂θk g 2 π π f 0 f 0 ∂ log g j k k j ¯ • = − 2 , for j, k = 1, ..., K and j 6= k; ∂θk∂θj g

2 0 0 ∂ log g fkg − πkfk(fk − f1) ¯ • = 2 , for k = 2, ..., K; ∂πk∂θk g 2 (f − f )π f 0 ∂ log g k 1 j j ¯ • = − 2 , for j, k = 2, ..., K and j 6= k; ∂πk∂θj g

2 0 0 ∂ log g f1g + π1f1(fk − f1) ¯ • = − 2 , for k = 2, ..., K. ∂πk∂θ1 g

R 0 R 00 R Note that fk = fk = 0, since fk is a density and hence fk = 1, since the order of differentiation and integration can be changed for the densities. This implies that

(f − f )(f − f ) • I = E j 1 k 1 , for j, k = 2, ..., K¯ ; πj πk g2 f 0f 0  • I = π π E j k , for j, k = 1, ..., K¯ ; θj θk k j g2 (f − f )f 0  • I = π E j 1 k , for j = 2, ..., K¯ , k = 1, ..., K¯ . πj θk k g2

To see what is happening to the information matrix evaluated at ψψ0 and along which directions it will be 0, distinguish the following two situations:

• Some components from the last K¯ − K ones are equal to the first component:

θ0 = θ0 = ... = θ0 = θ , for some m ≤ K¯ − K and some K + 1 ≤ k < .... < k ≤ K¯ k1 k2 km 1 1 m and π0 + π0 + π0 + ... + π0 = π (π0 ≥ π0 ≥ ... ≥ π0 ≥ 0) 1 k1 k2 km 1 k1 k2 km

0 ψ0 0 For a point ψ like that, the point from Ψ0 that is closest to it will be ψ = (θ1, θ2, π1, 0), and the entries corresponding to θ0 , θ0 , ..., θ0 will all be equal to θ . Hence, the information ma- k1 k2 km 1 ψ0 trix evaluated at ψ will have the entries in the rows corresponding to πk1 , πk2 , ..., πkm equal

55 to 0 (since those terms involve f(x; θ0 ) − f(x; θ ) ≡ 0, as can be seen from the computation ki 1 of the information matrix in the beginning of this proof).

ψ0 T Hence, uI(ψ )u would be equal to 0 along vectors of the form (0, 0, 0, uπ2 ), where uπ2 has non-zeros on the places corresponding to π0 , π0 , ..., π0 and zeros everywhere else. Since k1 k2 km

kuk = 1, kuπ2 k = 1.

Assume there exists such u inside U1. For any u ∈ U1, max uπj ≤ d. But e j=K+1,...,K¯

K¯ X kuk = ku k = u2 ≤ d2(K¯ − K) < (K¯ − K)/ (K¯ − K)2 + (K¯ − K) < 1, e eπ2 eπj j=K+1

which is a contradiction. Thus, ue cannot be inside U1.

• Some components from the last K¯ −K ones are equal to one of the components from ¯ ¯ second to K: θk1 = θk2 = ... = θkm = θl for some m ≤ K−K, K+1 ≤ k1 < .... < km ≤ K, and some l from 2 to K, π0 + π0 + π0 + ... + π0 = π (π0 ≥ π0 ≥ ... ≥ π0 ≥ 0) l k1 k2 km l k1 k2 km

0 ψ0 0 For a point ψ like that, the point from Ψ0 that is closest to it will be ψ = (θ1, θ2, π1, 0), and the entries corresponding to θ0 , θ0 , ..., θ0 will all be equal to θ . Hence, the information k1 k2 km l ψ0 matrix evaluated at ψ will have the rows corresponding to πk1 , πk2 , ..., πkm equal to the row corresponding to π (since those terms involve f(x; θ0 ) − f(x; θ ) ≡ f(x; θ ) − f(x; θ )). l ki 1 l 1

ψ0 T Hence, uI(ψ )u would be equal to 0 along vectors of the form (0, 0, uπ1 , uπ2 ), where uπ1

has a non-zero entry corresponding to πl, and zeros at all other places, and uπ2 has non-zeros

at the places corresponding to πk1 , πk2 , ..., πkm , and zeros everywhere else. Moreover, the non-

zero element of uπ1 has to be equal to minus the sum of the elements of uπ2 . And since

kuk = 1, kuπk = 1.

Assume there exists such u inside U1. Again, for any u ∈ U1, max uπj ≤ d. But e j=K+1,...,K¯ K¯ |u | = P u ≤ d(K¯ − K), and so eπl eπk k=K+1

K¯ X (K¯ − K)2 + (K¯ − K) kuk = u2 + u2 ≤ d2(K¯ − K)2 + d2(K¯ − K) < = 1, e eπl eπj (K¯ − K)2 + (K¯ − K) j=K+1

56 which is a contradiction. Thus, any such ue cannot be inside U1.

0 In general, if ψ ∈ ∂ωn ∩ Ψe, any feasible combination of these two cases can take place (i.e. some of the last K¯ − K components are equal to the first one, some others are equal to some

2 ≤ j1 ≤ K, some others are equal to some j1 < j2 ≤ K, etc.) However, there is a finite number of combinations of these two cases, and the smallest value of d is actually required for the situation when all components from K + 1 to K¯ are equal to some component l from 2 to K (in this case u is the largest and hence kuk is the largest as well). But any positive value of d smaller than eπl e  ¯ 2 ¯ −1/2 (K − K) + (K − K)) will ensure that no such ue can be in U1.

B.3 Proof of Theorem 2

Recall the following notation: ϕ = (θ1, π1) and ψ can be reparametrized as (ϕ, θ2, π2). To prove Theorem2, the following lemma is required:

√ Lemma 2. Let Assumptions (M2)-(M7) and (P5) hold, and additionally let λn n → ∞. Then,

0 0 0 0 0 with probability tending to 1, for any constant C > 0 and any ψ = (θ1, θ2, π1, π2) ∈ ωn, where ψ˜ √ ωn = {ψ˜ ∈ Ψ: kψ˜ − ψ k ≤ C/ n},

0 0 0 0 0 0 Qn(θ1, θ2, π1, 0) = max √ Qn(θ1, θ2, π1, π˜2). π˜2:0≤kπ˜2k≤C/ n

0 0 0 0 Remark 2. Qn(θ1, θ2, π1, 0) does not depend on θ2.

0 0 0 0 0 −1/2 Proof. Consider ψ = (θ1, θ2, π1, π2) from ωn, a n -neighborhood of Ψ0. Consider another point 00 0 0 0 0 K¯ −1 √ ψ = (θ1, θ2, π1, π2), where π˜2 is such that (π1, π˜2) ∈ ∆ and kπ˜2k ≤ C/ n.

To prove the lemma, show that the derivative of Qn(·) with respect to each element of π2

00 0 0 0 00 evaluated at ψ = (θ1, θ2, π1, π2) is negative for non-zero values of the elements of π2, for any ψ and any C > 0.

For that, take the Taylor expansion of this derivative of Qn(·) around the point from Ψ0 that is

00 ψ00 ψ0 ψ0 ψ0 ψ0 closest to ψ . Note that this point is defined as ψ = (θ1 , θ2 , π1 , 0), and so it is equal to ψ .

57 For any k = K + 1, ..., K¯ the following holds:

00 00 ∂Qn(ψ ) ∂Ln(ψ ) 0 ≡ − npλ(πk) ∂πk ∂πk ψ0 d 2 ψ0 ∂Ln(ψ ) X ∂ Ln(ψ ) 0 = + (ψ00 − ψψ ) ∂π ∂π ∂ψ i i k i=1 k i d d 3 ∗ 1 X X ∂ Ln(ψ ) 0 0 + (ψ00 − ψψ )(ψ00 − ψψ ) − p0 (π ), 2 ∂π ∂ψ ∂ψ i i j j λn k i=1 j=1 k i j

where ψ∗ is a point between ψ00 and ψψ0 .

As was established in the proof of Theorem1, the order of the terms of the expansion is the following:

ψ0 ∂Ln(ψ ) √ • = Op( n) ∂πk 2 ψ0 ∂ Ln(ψ ) • = Op(n) ∂πk∂ψi 3 ∗ ∂ Ln(ψ ) • = Op(n) ∂πk∂ψi∂ψj

00 ψ0 −1/2 00 ψ0 −1/2 Given that |ψi − ψi | ≤ min{C, Ce}n , i.e. ψi − ψi = Op n , this means that

0 ∂Qn(ψ ) √  √  √  0 √  0 = Op n + Op n + Op 1/ n − npλn (πk) = Op n − npλn (πk) ∂πk    p0 (π ) 1 λn k = λnn Op √ − . λn n λn

0 √ But by Assumption (P5), lim inf cn = lim inf lim inf p (π)/λn > 0, while nλn → ∞, so the n→∞ n→∞ π→0 λn

first term in the parenthesis is of order op(1), while the second one is Op(1). Hence, for n large

0 enough, the sign of the derivative ∂Qn(ψ )/∂πj is determined by the second term, which is negative, so for all k = K + 1, ..., K¯ ∂Qn(ψ) Ce < 0, for any 0 < πj ≤ √ . ∂πk n

This completes the proof.

This will allow to prove Theorem2.

58 Proof. Below are the steps of the proof:

−1/2 Step 1. By Theorem1 ψˆ is a local maximizer of the objective function in n -neighborhood of Ψ0.

Step 2. Next, consider Qn(ϕ, θ2, 0) as a function of ϕ. Since π2 = 0, Qn(ϕ, θ2, π2) does not depend

on θ2, and so θ2 will be dropped from the arguments of Qn(ϕ, θ2, 0).

Let ϕˆ be the local maximizer of Qn(ϕ, 0). Show that ψˆ = (ϕˆ, θ2, 0) is the local maximizer of

−1/2 Qn(ψ) in the n -neighborhood of Ψ0 for any θ2 such that ψˆ lies in this neighborhood of

Ψ0.

−1/2 For all ψ = (ϕ, θ2, π2) ∈ ωε, where ε = Cn for some constant C > 0 the following holds:

Qn(ϕ, θ2, π2) − Qn(ϕˆ, θ2, 0) ≡ [Qn(ϕ, θ2, π2) − Qn(ϕ, θ2, 0)] + [Qn(ϕ, θ2, 0) − Qn(ϕˆ, θ2, 0)]

= [Qn(ϕ, θ2, π2) − Qn(ϕ, 0)] + [Qn(ϕ, 0) − Qn(ϕˆ, 0)]

≤ [Qn(ϕ, θ2, π2) − Qn(ϕ, 0)] ,

since Qn(ϕ, 0) − Qn(ϕˆ, 0) ≤ 0. By Lemma2, Qn(ϕ, θ2, π2) − Qn(ϕ, 0) < 0 with probability

going to 1, which implies that ψˆ = (ϕˆ, θ2, 0), hence P (Kˆ = K) → 1, so the OSCE is indeed

an order-selection-consistent estimator. Also note that ϕˆ does not depend on θ2.

Step 3. Since ϕˆ is a local maximum of Qn(ϕ, 0), and ϕ = (θ1, π1) is an interior point in Φ by Assumption (M2), the first order condition should be satisfied at ϕˆ:

∂Qn(ϕ, θ2, π2) = 0, for j = 1, ..., dϕ, ∂ϕ j ψ=(ϕˆ,θ2,0)

where recall, dϕ = dim(ϕ) = Kp + K − 1.

Step 4. Using the fact that Qn(ϕ, θ2, 0) does not depend on θ2 and doing the Taylor expansion of the

59 first order condition around ϕ, I get:

∂Qn(ϕ, θ2, π2) ∂Ln(ϕ, θ2, π2) 0 = − npλ (πˆj)I{pK + 1 ≤ j ≤ dϕ} ∂ϕ ∂ϕ n j ψ=(ϕˆ,θ2,0) j ψ=(ϕˆ,θ2,0) dϕ 2 dϕ ∂Ln(ϕ, 0) X ∂ Ln(ϕ, 0) X = + (ϕ ˆ − ϕ ) + o (1) (ϕ ˆ − ϕ ) ∂ϕ ∂ϕ ∂ϕ i i p i i j i=1 j i i=1 0 00  − n pλn (πj) + pλn (πj)(ˆπj − πj) + op(1)(ˆπj − πj) I{pK + 1 ≤ j ≤ dϕ}

= 0, j = 1, ..., dϕ.

Here I{pK + 1 ≤ j ≤ dϕ} appears because the term coming from the penalty function is

present only for elements of ϕ from pK-th to dϕ-th, i.e. only for the components of π1.

Rewrite this equation in matrix form and regroup the terms. Recalling that kϕˆ − ϕk = p Op(1/ n), allows to obtain the following:

1 ∂L (ϕ, 0)  1 ∂2L (ϕ, 0)  1 n + b + n + Σ (ϕˆ − ϕ) + o (√ ) = 0, n ∂ϕ λn n ∂ϕ∂ϕT λn p n

and √  1 ∂2L (ϕ, 0)   1 ∂L (ϕ, 0) n n + Σ (ϕˆ − ϕ0) + b = √ n + o (1), n ∂ϕ∂ϕT λn λn n ∂ϕ p

Step 5. By the CLT: 1 ∂L (ϕ, 0) √ n →d N (0, I(ϕ)), n ∂ϕ

and by the LLN: 1 ∂2L (0, ϕ) n = I(ϕ) + o (1). n ∂ϕ∂ϕT p

which together with the Slutsky’s theorem leads to the final result.

60 B.4 Conditions for Theorem1 0

As K¯n changes with the sample size, so is the density g(x; ψn), which is underlined by the subscript n on the function g:

gn(x; ψn) = π1f(x; θ1) + ... + πK¯n f(x; θK¯n ).

The subscript n will be used on all the objects that change with the sample size (i.e., the

identified set, the parsimonious subset, etc.).

n (M2’) {xi}i=1 are i.i.d. with density h(x; ϕ) with respect to some measure µ. gn(x; ψn) has a support

dn that does not depend on ψn and n. Ψn is a compact subspace of R , and ϕ is an interior point of Φ.

(M4’) The first and second logarithmic derivatives of gn(x; ψn) satisfy the equations:

  ∂ log gn(x; ψn) Eψn = 0, for j = 1, ..., dn, ∂ψn,j

   2  ∂ log gn(x; ψ) ∂ log gn(x; ψn) ∂ log gn(x; ψn) Ijk(ψn) ≡ Eψn = Eψ − , for k, l = 1, ..., dn, ∂ψn,k ∂ψn,l ∂ψn,j∂ψn,k

and, for some C1 > 0, max γ (I(ψ0)) ≤ C1 < +∞,

max for all ψ0 ∈ Ψ0 and all n, where γ (A) denotes the largest eigenvalue of the matrix A.

(M5’) There exist constants C4 > 0 and C5 > 0, such that for all n, for all k, l = 1, ..., dn, and for

all ψn,0 ∈ Ψn,0 and µ-almost all x

( 2 2) ∂ log gn(x; ψn,0) Eψn,0 ≤ C4, ∂ψk,n∂ψl,n

and

( 2) ∂ log gn(x; ψn,0) ∂ log gn(x; ψn,0) Eψn,0 ≤ C5. ∂ψk,n ∂ψl,n

61 (M6’) There exists ωn,ε ≡ ∪ψn,0∈Ψn,0 Bn,ε(ψn,0), a neighborhood of Ψn,0 (in Ψn), where

 0 0 Bn,ε (ψn) ≡ ψn ∈ Ψn : ψn − ψn ≤ ε

denotes the ε-ball (in Ψn) around a given point ψn, such that for µ-almost all x the density

gn(x; ψn) admits all the third partial derivatives with respect to ψn for all ψn ∈ ωn,ε and

there exist functions Mjkl(·) such that for all n, and all ψn ∈ ωn,ε

3 ∂ log gn(x; ψn) ≤ Mjkl(x), ∂ψn,j∂ψn,k∂ψn,l

2 where Eψn,0 Mjkl(x) ≤ C6 < +∞ for all ψn,0 ∈ Ψn,0.

¯ (P3’) pλn (π) = λnπ + Op(Kn/n) for all π ≤ λn.

B.5 Choice of λn

Theorem 3. Assume that the Assumptions (M1)-(M7) and (P1)-(P5) hold. And let the penalty

function used in the objective function in (4) be SCAD or MCP. Additionally assume that there

exists G(·), such that g(x; ψ) < G(x) for all ψ ∈ ωε and E|G(x)| < ∞. Finally, assume that for any λ, ψˆ(λ), defined as a local maximizer with respect to ψ˜ ∈ Ψ of

K¯ ˜ X Ln(ψ) − n pλ(˜πk), k=2

that is consistent in the sense of Redner[1981] and Feng and McCulloch[1996].

Then

BIC λn = argmaxλ BIC(λ),

BIC BIC √ satisfies the rate conditions from Theorems1 and2, i.e. λn → 0, λn n → ∞.

The reason to assume that ψˆ(λ) is a consistent local maximizer is to avoid potential issues when

the global maximizer does not exist or is not consistent. In estimation, it can be ensured by picking

a good starting values for the EM algorithm (see Remark1).

BIC Proof. BIC criteria are complicated functions of λ and a closed-form expression for λn , a maxi-

62 BIC mizer of the proposed criteria, cannot be obtained. Hence, direct analysis of the rate at which λn goes to 0 is impossible. The approach that I will undertake here is indirect and is an adjusted version

of the proof from Wang et al.[2007] about the selection of the tuning parameter in SCAD-penalized

regressions.

The proof consists of two steps. First, I show that the BIC criterion evaluated at some λn that satisfies the rate conditions, with probability approaching one, will be equal to the BIC function

computed for the model with known K. Then, I show that for all values of λ that lead to the

selection of either too few or too many components (as compared to K), BIC(λ) cannot achieve

this level with probability tending to one.

Let ψ be the notation for the vector of parameters of a K-component mixture. And let Ke e

n 1 X 1 log n BIC = g(z; ψˆ ) − Ke , Ke n Ke 2 n i=1

i.e. the value of BIC for the model estimated with K components, where ψˆ is the consistent local e Ke maximizer of the (non-penalized) log-likelihood function for a mixture with Ke components. Let dim π denote the number of non-zero elements in vector π. Observe, that for any λ such

that dim πˆ(λ) = K, BIC(λ) ≤ BIC . Also define BIC = BIC . e Ke true K √ √ Step 1. Pick λn to be log n/ n. This λn satisfies the rate conditions: λn = log n/ n → 0, and √ λn n = log n → +∞.

Then, Theorem2 guarantees that for this λn, for n large enough, πˆ2(λn) = 0, and ϕˆ(λn)

is a consistent local maximizer of Qn(ϕ, θ2, 0) with respect to ϕ and does not depend on

a particular value of θ2. In other words, g(x; ψˆ(λ)) = h(x; ϕˆ(λ)). Consider the first order condition with respect to ϕ from the proof of Theorem2:

∂L (ϕˆ(λ ), 0) n n − ∇p (ϕˆ) = 0. ∂ϕ λn

For the MCP and the SCAD penalty, for n large enough,

0 0 0 max{pλn (πj ): πj 6= 0} = 0,

63 and so this first order condition reduces to the one in non-penalized maximum likelihood

estimation of ϕ in the model with K being known:

∂L (ϕˆ) n = 0. ∂ϕ

In other words, for n large enough, ϕˆ(λn) = ϕˆ with probability approaching 1, and so

g(x; ψˆ(λn)) = h(x; ϕˆ) with probability approaching 1.

Finally, for n large enough, Kˆ (λn) = K. Combining the two results, it is easy to see that

P (BICλn = BICtrue) → 1

.

Step 2. The second step case can be further divided into two subcases. The first considers λ for which

πˆ(λ) contains less than K non-zero elements, an the second tackles those values of the tuning

parameter that lead to overselection of components, i.e. for which πˆ(λ) has more than K

non-zero elements:

• Underselection of components: I will follow the steps of the proof by Keribin[2000]

(Theorem 2.1) in this part.

Let λ be such that dim πˆ(λ) = K, where K < K. Observe, that BIC ≤ BIC . e e λ Ke Now, consider all K-component mixtures and let FM denote the class of all such e Ke mixtures. In other words,

n o FM = g(x; ψ ): ψ ≡ (θ , π ): θ ∈ ΘKe ; π ∈ Π . Ke eKe eKe eKe eKe e e Ke

Define the Kullback-Leibler divergence between the true parsimonious density h(x; ϕ)

and FM as the infimum of the KL divergence between h and individual elements of Ke FM : Ke h KL(h, F M ) = inf KL(h, g) = inf Eh . Ke g∈FM g∈FM Ke Ke g

Since ΘKe × Π is a compact set, and the Kullback-Leibler divergence is a continuous Ke

64 function of ξ, KL(h, F M ) achieves its minimum at some g ≡ g(x; ψ ) ∈ FM . Ke eKe eKe Ke Moreover,

KL(h, F M ) = KL(h, g ) > 0 Ke eKe due to Assumption (M3).

The goal is to show that

1 1 max Ln(ψ ) − Ln(ϕˆ) < 0 g∈FM Ke n Ke n

To do that, start by considering max Ln(ψ ) − Ln(ϕ). g∈FM Ke Ke First, observe that max Ln(ψ ) ≥ Ln(ψe ), and so g(x;ψ )∈FM Ke Ke Ke Ke

max Ln(ψ ) − Ln(ϕ) ≥ Ln(ψe ) − Ln(ϕ). g(x;ψ )∈FM Ke Ke Ke Ke

By the Law of Large Numbers,

" # 1 1 h i lim inf max Ln(ψ ) − Ln(ϕ) ≥ lim Ln(ψe ) − Ln(ϕ) n→+∞ n g(x;ψ )∈FM Ke n→+∞ n Ke Ke Ke n 1 X g(xi; ψe ) = lim Ke n→+∞ n h(x ; ϕ) i=1 i  h  = −E = −KL(h, g ), h g eKe eKe with probability approaching 1.

Next, find an upper bound on

" # 1 lim sup max Ln(ψ ) − Ln (ϕ) . n g(x;ψ )∈FM Ke Ke Ke

Due to compactness of ΘKe × Π , there exist N (< +∞) balls with diameter δ/2 and Ke δ centers at ξi, i = 1, ..., Nδ, which cover this parameter space.

By following the proof of Keribin[2000], it can be shown, that due to the additional

65 assumption that g(x; ψ) < G(x) for all ψ ∈ ωε and E|G(x)| < ∞,

" # 1 lim sup max Ln(ψ ) − Ln (ϕ) ≤ −KL(h, g ). n g(x;ψ )∈FM Ke eKe Ke Ke

Hence, " # 1 lim max Ln(ψ ) − Ln (ϕ) = −KL(h, g ). n→∞ n g(x;ψ )∈FM Ke eKe Ke Ke Finally, putting everything together:

1 h ˆ i   log n BICλ − BICtrue ≤ BIC − BICtrue = Ln(ψ ) − Ln(ϕˆ) − Ke − K Ke n Ke n 1 h ˆ i 1   log n = Ln(ψ ) − Ln(ϕ) − [Ln(ϕˆ) − Ln (ϕ)] − Ke − K n Ke n n 1 h ˆ i   log n ≤ Ln(ψ ) − Ln(ϕ) − Ke − K n Ke n   log n ≤ −KL(h, F M ) + op(1) − Ke − K Ke n ≤ −KL(h, F M ) < 0, Ke

where the third line follows from Ln(ϕˆ) ≥ Ln(ϕ), and the last line is due to the fact that K − K is finite, log n/n → 0, and KL(h, F M ) > 0. e Ke The above is true for any λ such that dim πˆ(λ) < K, and since there is a finite number

of Ke < K, the above inequality holds uniformly, and so:

! P sup BICλ < BICtrue → 1. λ:dim πˆ(λ)

• Overselection of components:

Next, let λ be such that dim πˆ(λ) = K, where K > K. Again, BIC ≤ BIC , so e e λ Ke concentrate on a finite mixture model with Ke components estimated by maximization of the likelihood function. √ It follows from the Feng and McCulloch[1996] result that ψˆ = argmax L (ψ ) is n- Ke n Ke √ consistent maximizer, in the sense that it lies in a n-neighborhood of the identified

set Ψ . Let ψ be the point from Ψ closest to it. Take a Taylor expansion of the e 0,Ke eKe e 0,Ke likelihood function around this point:

66   2   ∂Ln ψe ∂ Ln ψe  ˆ    −1/2 Ke −1 Ke T  ˆ  Ln ψ − Ln ψe = Op(n ) u + Op(n )u u + Rn u, ψ Ke Ke ∂ψT ∂ψ ∂ψT Ke Ke Ke Ke

= Op(1).

Now, take a Taylor expansion of Ln(ϕˆ) around ϕ:

∂L (ϕ) ∂2L (ϕ) L (ϕˆ) − L (ϕ) = O (n−1/2) n u + O (n−1)u n uT + R (u, ϕˆ) n n p ∂ϕT p ∂ϕ∂ϕT n

= Op(1).

Combining these two Taylor expansions allows to conclude that

  L ψˆ − L (ϕˆ) = O (1), n Ke n p and hence

1 h ˆ i   log n n(BICλ − BICtrue) ≤ n(BIC − BICtrue) = n Ln(ψ ) − Ln(ϕˆ) − n Ke − K Ke n Ke n   = Op(1) − Ke − K log n

< 0, because Ke > K and log n → +∞.

The above is true for any λ such that dim πˆ(λ) > K, and since there is a finite number of Ke > K but smaller than K¯ , the above inequality holds uniformly, and so:

! P sup BICλ < BICtrue → 1, λ:dim πˆ(λ)>K which completes the proof.

67 BIC These two steps show that whichever λn maximizes a BIC criterion, it cannot be the one that leads to under-, or overselection of the components in the mixture, which completes the

proof.

C Application: different specifications

Table7 show the results of the estimation of model from Section 6.2 with

f(z; θ) = fL (ai − yi − γk(xi − yi); 0, σk) ,

where fL(·; µ, σ) denotes the pdf of a Laplace distribution with the location parameter µ and the scale parameter σ, i.e., 1 |x| fL(x; 0, σk) = exp{− }. 2σk σk

The variance of the Laplace distribution with the scale parameter σ is 2σ2.

Table 8: Estimation of the coordination game for different values of r

r γ1 πˆ1 σˆ1 γ2 πˆ2 σˆ2 γ3 πˆ3 σˆ3 γ4 πˆ4 σˆ4 γ∞ πˆ∞ σˆ∞ 0.5 0.5 0.62 0.38 0.375 0 - 0.344 0 - 0.336 0 - 1/3 0.38 1.57 0.75 0.5 0.175 0.03 0.312 0.485 1.82 0.242 0 - 0.216 0 - 0.2 0.34 1.52

References

Victor Aguirregabiria and Pedro Mira. Dynamic discrete choice structural models: A survey. Journal

of Econometrics, 156(1):38–67, 2010.

Khalaf E Ahmad and Essam K Al-Hussaini. Remarks on the non-identifiability of mixtures of

distributions. Annals of the Institute of Statistical Mathematics, 34(1):543–544, 1982.

Larbi Alaoui and Antonio Penta. Endogenous depth of reasoning. The Review of Economic Studies,

page rdv052, 2015.

Donald WK Andrews and Xu Cheng. Estimation and inference with weak, semi-strong, and strong

identification. Econometrica, pages 2153–2211, 2012.

68 Bryon Aragam and Qing Zhou. Concave penalized estimation of sparse gaussian bayesian networks.

arXiv preprint arXiv:1401.0852, 2014.

Steven Berry and Elie Tamer. Identification in models of oligopoly entry. 2006.

Stéphane Bonhomme, Koen Jochmans, and Jean-Marc Robin. Non-parametric estimation of fi-

nite mixtures from repeated measurements. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), pages n/a–n/a, 2015. ISSN 1467-9868. doi: 10.1111/rssb.12110. URL

http://dx.doi.org/10.1111/rssb.12110.

Antoni Bosch-Domènech, José G Montalvo, Rosemarie Nagel, and Albert Satorra. A finite mixture

analysis of beauty-contest data using generalized beta distributions. Experimental economics, 13

(4):461–475, 2010.

Simon A Broda, Markus Haas, Jochen Krause, Marc S Paolella, and Sven C Steude. Stable mixture

garch models. Journal of Econometrics, 172(2):292–306, 2013.

Colin F Camerer, Teck-Hua Ho, and Juin-Kuan Chong. A cognitive hierarchy model of games. The

Quarterly Journal of Economics, pages 861–898, 2004.

Jiahua Chen and J. D. Kalbfleisch. Penalized minimum-distance estimates in finite mixture models.

The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 24(2):167–175, 1996.

Jiahua Chen and Abbas Khalili. Order selection in finite mixture models with a nonsmooth penalty.

Journal of the American Statistical Association, 103(484), 2008.

Xiaohong Chen, Maria Ponomareva, and Elie Tamer. Likelihood inference in some finite mixture

models. Journal of Econometrics, 182(1):87–99, 2014.

Camille Cornand and Frank Heinemann. Measuring agents’ reaction to private and public informa-

tion in games with strategic complementarities. Experimental Economics, 17(1):61–77, 2014.

Didier Dacunha-Castelle, Elisabeth Gassiat, et al. Testing the order of a model using locally conic

parametrization: population mixtures and stationary arma processes. The Annals of Statistics,

27(4):1178–1209, 1999.

69 Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle

properties. Journal of the American Statistical Association, 96(456):1348–1360, 2001.

Ziding D Feng and Charles E McCulloch. Using bootstrap likelihood ratios in finite mixture models.

Journal of the Royal Statistical Society. Series B (Methodological), pages 609–617, 1996.

Markus Haas and Marc S Paolella. Mixture and regime-switching garch models. Handbook of

volatility models and their applications, pages 71–102, 2012.

James D Hamilton. Regime switching models. In Macroeconometrics and Time Series Analysis,

pages 202–209. Springer, 2010.

JA Hartigan. A failure of likelihood asymptotics for normal mixtures. In Proceedings of the Berkeley

conference in honor of and Jack Kiefer, volume 2, pages 807–810. Wadsworth,

Belmont, CA, 1985.

Trevor. Hastie, Robert. Tibshirani, and J Jerome H Friedman. The elements of statistical learning,

volume 1. Springer New York, 2001.

Marc Henry, Yuichi Kitamura, and Bernard Salanié. Partial identification of finite mixtures in

econometric models. Quantitative Economics, 5(1):123–144, 2014.

Teck-Hua Ho, Colin Camerer, and Keith Weigelt. Iterated dominance and iterated best response in

experimental" p-beauty contests". The American Economic Review, 88(4):947–969, 1998.

Hajo Holzmann, Axel Munk, and Tilmann Gneiting. Identifiability of finite mixtures of elliptical

distributions. Scandinavian journal of statistics, 33(4):753–763, 2006.

Jian Huang, Joel L Horowitz, and Shuangge Ma. Asymptotic properties of bridge estimators in

sparse high-dimensional regression models. The Annals of Statistics, pages 587–613, 2008.

Tao Huang, Heng Peng, and Kun Zhang. Model selection for gaussian mixture models. arXiv

preprint arXiv:1301.3558, 2013.

Hemant Ishwaran, Lancelot F James, and Jiayang Sun. Bayesian model selection in finite mixtures

by marginal density decompositions. Journal of the American Statistical Association, 96(456),

2001.

70 Lancelot F James, Carey E Priebe, and David J Marchette. Consistent estimation of mixture

complexity. Annals of Statistics, pages 1281–1296, 2001.

Hiroyuki Kasahara and Katsumi Shimotsu. Nonparametric identification of finite mixture models

of dynamic discrete choices. Econometrica, pages 135–175, 2009.

Hiroyuki Kasahara and Katsumi Shimotsu. Testing the number of components in normal mixture

regression models. Journal of the American Statistical Association, 110(512):1632–1645, 2015. doi:

10.1080/01621459.2014.986272. URL http://dx.doi.org/10.1080/01621459.2014.986272.

Michael P Keane and Kenneth I Wolpin. The career decisions of young men. Journal of political

Economy, 105(3):473–522, 1997.

Christine Keribin. Consistent estimation of the order of mixture models. Sankhy¯a: The Indian

Journal of Statistics, Series A, pages 49–66, 2000.

J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence

of infinitely many incidental parameters. Ann. Math. Statist., 27(4):887–906, 12 1956. doi:

10.1214/aoms/1177728066. URL http://dx.doi.org/10.1214/aoms/1177728066.

Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics, pages

1356–1378, 2000.

Erich Leo Lehmann and . Theory of point estimation, volume 31. Springer Science

& Business Media, 1998.

Xin Liu and Yongzhao Shao. Asymptotics for likelihood ratio tests under loss of identifiability.

Annals of Statistics, pages 807–832, 2003.

Geoffrey McLachlan and David Peel. Finite mixture models. Wiley. com, 2004.

Patrick AP Moran. Maximum-likelihood estimation in non-standard conditions. In Mathemati-

cal Proceedings of the Cambridge Philosophical Society, volume 70, pages 441–450. Cambridge

University Press, 1971.

Stephen Morris and Hyun Song Shin. Social value of public information. The American Economic

Review, 92(5):1521–1534, 2002.

71 Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American Economic

Review, 85(5):1313–1326, 1995.

Benedikt M Pötscher and Hannes Leeb. On the distribution of penalized maximum likelihood

estimators: The lasso, scad, and thresholding. Journal of Multivariate Analysis, 100(9):2065–

2082, 2009.

BLS Prakasa Rao. Identifiability in stochastic models: characterization of probability distributions.

Academic Press, 1992.

Richard Redner. Note on the consistency of the maximum likelihood estimate for nonidentifiable

distributions. The Annals of Statistics, 9(1):225–228, 1981.

Pranab Kumar Sen and Jayanta Kumar Ghosh. On the asymptotic performance of the log likeli-

hood ratio statistic for the mixture model and related results. In In Proceedings of the Berkeley

Conference in Honor of. Citeseer, 1985.

Dale O Stahl and Paul W Wilson. On players’ models of other players: Theory and experimental

evidence. Games and Economic Behavior, 10(1):218–254, 1995.

Henry Teicher. Identifiability of finite mixtures. The Annals of , pages

1265–1269, 1963.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical

Society. Series B (Methodological), pages 267–288, 1996.

Hansheng Wang, Runze Li, and Chih-Ling Tsai. Tuning parameter selectors for the smoothly clipped

absolute deviation method. Biometrika, 94(3):553–568, 2007.

Mi-Ja Woo and T. N Sriram. Robust estimation of mixture complexity. Journal of the American

Statistical Association, 101(476):1475–1486, 2006.

Sidney J Yakowitz and John D Spragins. On the identifiability of finite mixtures. The Annals of

Mathematical Statistics, pages 209–214, 1968.

Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of

Statistics, pages 894–942, 2010a.

72 Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of

Statistics, 38(2):894–942, 2010b.

Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association,

101(476):1418–1429, 2006.

Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals

of statistics, 36(4):1509, 2008.

73