Quick viewing(Text Mode)

Local M-Estimation with Discontinuous Criterion for Dependent and Limited Observations

Local M-Estimation with Discontinuous Criterion for Dependent and Limited Observations

LOCAL M-ESTIMATION WITH DISCONTINUOUS CRITERION FOR DEPENDENT AND LIMITED OBSERVATIONS

MYUNG HWAN SEO AND TAISUKE OTSU

Abstract. This paper examines local M- and their asymptotic properties under sets of high-level assumptions, which define three classes of local M-estimators. The conditions are suciently general to cover the minimum volume predictive region, the conditional maximum score for a panel discrete choice model, and many other widely used estimators in and econometrics. Specifically, they allow for a discontinuous criterion function of weakly dependent observations, which is localized by kernel smoothing and contains a nuisance whose dimension may grow to infinity. Furthermore, the localization can occur around parameter values not a fixed point and the observation may take limited values, which leads to set estimators. Our theory produces three di↵erent nonparametric cube root rates and enables valid inference for the local M-estimators, building on novel maximal inequalities for weakly dependent data. Our results include the standard cube root asymptotics as a special case. To illustrate the usefulness of our results, we verify our conditions for various examples such as Hough transform estimator with diminishing bandwidth, the dynamic maximum score estimator, and many others.

1. Introduction

There is a class of estimation problems in statistics where a point (or set-valued) estimator is obtained by maximizing a discontinuous and possibly localized criterion function. As a prototype, consider the estimation of a simplified version of the minimum volume predictive region for y given x = c (Polonik and Yao, 2000). Let I be the indicator function, and K( ) be a kernel function, {·} · and h be a bandwidth. For a prespecified significance level ↵, the estimator [✓ˆ ⌫ˆ] is obtained by n ± the following M-estimation: n xt c max I ⌫ˆ yt ✓ ⌫ˆ K , (1) ✓ {   } h t=1 n X ✓ ◆ n xt c n xt c where⌫ ˆ =inf ⌫ :sup I ⌫ yt ✓ ⌫ K / K ↵ . This estimation ✓ t=1 {   } hn t=1 hn exhibits severaln highly nonregularP features such as the⇣ discontinuity⌘ P ⇣ of the⌘ criteriono function, local xt c smoothing by the kernel component K , and serial dependence in data, which have hn ⇣ ⌘ Key words and phrases. Cube root asymptotics, Maximal inequality, Mixing process, Partial identification, Parameter- dependent localization. The authors would like to thank Aureo de Paula, Marine Carrasco, Javier Hidalgo, Dennis Kristensen, Benedikt P¨otscher, Peter Robinson, Kyungchul Song, Yoon-Jae Whang, and seminar and conference participants at Cambridge, CIREQ in Montreal, CORE in Louvain, CREATES in Aarhus, IHS in Vienna, ISNPS in C´adiz, LSE, Surrey, UCL, Vienna, and York for helpful comments. Financial support from the ERC Consolidator Grant (SNP 615882) is gratefully acknowledged (Otsu).

1 prevented a full-blown asymptotic analysis of the M-estimator ✓ˆ. Only the consistency is reported in the literature. This type of M-estimation has numerous applications. Since Cherno↵’s (1964) study on esti- mation of the , many papers raised such estimation problems, e.g., the shorth (Andrews et al., 1972), least of squares (Rousseeuw, 1984), nonparametric monotone (Prakasa Rao, 1969), and maximum score estimation (Manski, 1975). These classical examples are studied in a seminal work by Kim and Pollard (1990), which explained elegantly how this type of estimation problems induces the so-called cube root phenomena (i.e., the estimators converge at the cube root rate to some non-normal distributions instead of familiar squared root rate to normals) in a unified framework by of empirical process theory. See also van der Vaart and Wellner (1996) and Kosorok (2008) for a general theory of M-estimation based on empirical process theory. However, these works do not cover the estimation problem in (1) due to their focus on cross-sectional data among others. It should be emphasized that this is not a pathological example. We provide several other relevant examples in Section 3 and supplementary materials, which include well-known Honor´eand Kyriazidou’s (2000) estimator for the dynamic panel discrete choice model. Further- more, we propose a new binary choice model with random coecients and a localized maximum score estimator in Section 3.2. This paper covers broader class of M-estimators than the above examples suggest. The baseline scenario above (called local M-estimation due to the localization at x = c) is generalized into two directions. First, we accommodate not only variables taking limited values (e.g., interval-valued data) which typically lead to estimation of a set rather than a point, but also nuisance with growing dimension. Set estimation problems due to limited data are also known as partial identification problems in econometrics literature (e.g., Manski and Tamer, 2002). It is also novel to accommodate high-dimensional nuisance parameters in the M-estimation with a discontinuous criterion. Second, we allow for the localization to be dependent on the parameter ✓ instead of a n prespecified value c. For instance, the criterion may take the form of I yt ✓ hn with t=1 {| | } h 0. Relevant examples include mode estimation (Cherno↵, 1964, and Lee, 1989) and the Hough n ! P transform estimator in image analysis (Goldenshluger and Zeevi, 2004). Henceforth, we call this case parameter-dependent local M-estimation. Parameter-dependence brings some new feature in our asymptotic analysis but in a di↵erent way from a classical example of parameter-dependency on support (e.g., maximum likelihood for Uniform[0,✓]). The contribution of this paper is to develop general asymptotic theory for such M-estimation with discontinuous and possibly localized criterions. Our theoretical results cover all the examples above and can be used to establish limit laws for point estimators and convergence rates for set estimators. To this end, we develop certain maximal inequalities, which enable us to obtain nonparametric cube root rates of (nh )1/3, nh / log(nh ) 1/3, and (nh2 )1/3, for local M-estimation, limited variable n { n n } n case, and parameter-dependent case, respectively. These inequalities are extended to establish stochastic equicontinuity of normalized processes of the criterions so that an argmax theorem delivers

2 limit laws of the M-estimators. It is worth noting that all the conditions are characterized through conditions that can be easily verified as illustrated in the examples. Thus, our results can be applied without prior knowledge on empirical process theory. It is often not trivial to verify the entropy conditions such as uniform manageability in Kim and Pollard (1990). Particularly for dependent data, the covering or bracketing numbers often need to be calculated using a norm that hinges on the mixing coecients and distribution of the data (e.g., the L2,-norm in Doukhan, Massart and Rio, 1995). Another contribution is that we allow for weakly dependent data, which is associated with expo- nential mixing decay rates of the absolutely regular processes. In some applications, the cube root asymptotics has been extended to time series data, such as Anevski and H¨ossjer (2006) for monotone density estimation, Zinde-Walsh (2002) for least median of squares, de Jong and Woutersen (2011) for maximum score, and Koo and Seo (2015) for break point estimation under misspecification. However, it is not clear whether they are able to handle a general class of estimation problems in this paper. The paper is organized as follows. Section 2 develops asymptotic theory for local M-estimation, and Section 3 provides some examples. In Section 4, we generalize the asymptotic theory to the cases of limited variables (Section 4.1) and parameter-dependent localization (Section 4.2). Section 5 concludes. All proofs and some additional examples are contained in the supplementary material.

2. Local M-estimation with discontinuous criterion

Let us consider the M-estimator ✓ˆ that maximizes n 1 Pnfn,✓ = fn,✓(zt), n t=1 X d where fn,✓ : ✓ ⇥ is a sequence of criterions indexed by the parameters ✓ ⇥ R and zt is a { 2 } 2 ✓ { } strictly stationary sequence of random variables with marginal P . We introduce a set of conditions for fn,✓, which induces a possibly localized counterpart of Kim and Pollard’s (1990) cube root asymptotics. Their cube root asymptotics can be viewed as a special case of ours, where fn,✓ = f✓ for all n. Let Pf = fdP for a function f, be the Euclidean norm of a vector, and be the |·| k·k2 L2(P )-norm of a randomR variable. The class of criterions of our interest is characterized as follows.

Assumption M. For a sequence h of positive numbers with nh , f : ✓ ⇥ satisfies { n} n !1 { n,✓ 2 } the following conditions.

(i): hnfn,✓ : ✓ ⇥ is a class of uniformly bounded functions. Also, limn Pfn,✓ is uniquely { 2 } !1 maximized at ✓0 and Pfn,✓ is twice continuously di↵erentiable at ✓0 for all n large enough and admits the expansion

1 2 2/3 P (f f )= (✓ ✓ )0V (✓ ✓ )+o( ✓ ✓ )+o((nh ) ), (2) n,✓ n,✓0 2 0 0 | 0| n for a negative definite matrix V.

3 (ii): There exist positive constants C and C0 such that

✓ ✓ Ch1/2 f f , | 1 2| n k n,✓1 n,✓2 k2 for all n large enough and all ✓ ,✓ ✓ ⇥: ✓ ✓ C . 1 2 2{ 2 | 0| 0} (iii): There exists a positive constant C00 such that

2 P sup hn fn,✓ fn,✓0 C00", (3) ✓ ⇥: ✓ ✓ <" | |  2 | 0|

for all n large enough, ">0 small enough, and ✓0 in a neighborhood of ✓0.

Although we are primarily interested in the case of h 0, we do not exclude the case of h = 1. n ! n When h 0, the sequence h is usually a bandwidth sequence for localization. Although we n ! { n} cover Kim and Pollard’s setup as a special case, our conditions appear somewhat di↵erent from theirs. In fact, our conditions consist of directly verifiable moment conditions without resorting to notion of empirical process theory such as uniform manageability.

Assumption M (i) contains boundedness, point identification of ✓0, and a quadratic approximation of Pf at ✓ = ✓ . Boundedness of h f : ✓ ⇥ is a major requirement, but is satisfied for all n,✓ 0 { n n,✓ 2 } examples in this paper and Kim and Pollard (1990). In the next section, we relax the assumption of point identification of ✓0. When the criterion fn,✓ involves a kernel estimate for a nonparametric component, it typically takes the form of f (z)= 1 K x c m(y, x, ✓) for z =(y, x) (see, (1) and n,✓ hn hn examples in Section 3). ⇣ ⌘

Despite of discontinuity of fn,✓, the population criterion Pfn,✓ is smooth and approximated by a quadratic function as in (2). This distinguishes our estimation problem from that of a change- point in the regression, which also involves a discontinuous criterion function but the estimator of the change-point is super-consistent, see e.g. Chan (1993), unless the estimating equation is misspecified as in the split point estimator for decision trees (B¨uhlmann and Yu, 2002, and Banerjee and McKeague, 2007). Assumption M (ii) is used to relate the L (P )-norm f f for the criterions to the 2 k n,✓ n,✓0 k2 Euclidean norm ✓ ✓ for the parameters. This condition, which is implicit in Kim and Pollard | 0| (1990, Condition (v)) under independent observations, is often verified in the course of checking the expansion in (2). Assumption M (iii), an envelope condition for the class f f : ✓ ✓ " , plays a key role { n,✓ n,✓0 | 0| } for the cube root asymptotics. It should be noted that for the familiar squared root asymptotics, the upper bound in (3) is of order "2 instead of ". This assumption merges three conditions in Kim and Pollard (1990): their envelope conditions ((vi) and (vii)) and uniform manageability of the class f f : ✓ ✓ " . It is often the case that verifying the envelope condition for { n,✓ n,✓0 | 0| } arbitrary ✓0 in a neighborhood of ✓0 is not more demanding than that for ✓0.

4 We now study the limiting behavior of the M-estimator, which is precisely defined as a ✓ˆ satisfying 2/3 Pnfn,✓ˆ sup Pnfn,✓ op((nhn) ). (4) ✓ ⇥ 2 p The first step is to establish weak consistency ✓ˆ ✓0, which is rather standard and usually shown by ! p establishing the uniform convergence sup✓ ⇥ Pnfn,✓ Pfn,✓ 0. Thus, in this section we simply 2 | | ! assume the consistency of ✓ˆ (see Section 3 and the supplementary materials for some illustrations). The next step is to derive the convergence rate of ✓ˆ. A key ingredient for this step is to obtain 1/2 the modulus of continuity of the centered empirical process Gnhn (fn,✓ fn,✓0 ):✓ ⇥ by certain { 2 } maximum inequality, where Gnf = pn(Pnf Pf) for a function f.Iffn,✓ does not vary with n and z is independent, then several maximal inequalities are available in the literature (e.g., Kim { t} and Pollard, 1990, p. 199). If f varies with n and/or z is dependent, to best of our knowledge, n,✓ { t} there is no maximal inequality which can be applied to the class of functions in Assumption M. Our first task is to establish such a maximal inequality. To proceed, we now characterize the dependence structure of the data. Among several notions of dependence, this paper focuses on an absolutely regular process (see Doukhan, Massart and Rio, 0 1995, for a detail on empirical process theory of absolutely regular processes.). Let and m1 F1 F be -fields of ...,zt 1,z0 and zm,zm+1,... ,respectively.Definethe-mixing coecient as { } { } 1 m = 2 sup (i,j) I J P Ai Bj P Ai P Bj , where the supremum is taken over all the finite 2 ⇥ | { \ } { } { }| 0 partitions Ai i I and Bj j J respectively and m1 measurable. Throughout the paper, we {P} 2 { } 2 F1 F maintain the following assumption on the data z . { t} Assumption D. z is a strictly stationary and absolutely regular process with -mixing coe- { t} cients such that = O(⇢m) for some 0 <⇢<1. { m} m This assumption obviously covers the case of independent observations, and says the mixing 1 coecient m should decay at an exponential rate. For example, various Markov, GARCH, and stochastic volatility models satisfy this assumption (Carrasco and Chen, 2002). 1/2 The maximal inequality for the empirical process Gnhn (fn,✓ fn,✓0 ) is presented as follows.

Lemma M. Under Assumptions M and D, there exist positive constants C and C0 such that

1/2 1/2 P sup Gnhn (fn,✓ fn,✓0 ) C , ✓ ⇥: ✓ ✓0 < | | 2 | | for all n large enough and [(nh ) 1/2,C ]. 2 n 0 This lemma implies the following result.

1 Indeed, the polynomial decay rates of m are often associated with strong dependence and long memory type behavior in sample statistics. See Chen, Hansen and Carrasco (2010) and references therein. In this case, asymptotic analysis for the M-estimator will become very di↵erent.

5 Lemma 1. Under Assumptions M and D, for each ">0, there exist random variables R of { n} order Op(1) and a positive constant C such that

2 2/3 2 Pn(fn,✓ fn,✓0 ) P (fn,✓ fn,✓0 ) " ✓ ✓0 +(nhn) R , | | | | n for all ✓ ⇥ satisfying (nh ) 1/3 ✓ ✓ C. 2 n | 0| We now derive the convergence rate of ✓ˆ. Suppose ✓ˆ ✓ (nh ) 1/3. Then we can take a | 0| n positive constant c such that

2/3 ˆ 2 2/3 2 op((nhn) ) Pn(f ˆ fn,✓0 ) P (f ˆ fn,✓0 )+" ✓ ✓0 +(nhn) R  n,✓  n,✓ | | n 2 2 2/3 ( c + ") ✓ˆ ✓ + o( ✓ˆ ✓ )+O ((nh ) ),  | 0| | 0| p n for each ">0, where the first inequality follows from (4), the second inequality follows from Lemma 1, and the third inequality follows from Assumption M (i). Taking " small enough to satisfy c ">0 yields the convergence rate ✓ˆ ✓ = O ((nh ) 1/3). 0 p n Given the convergence rate of ✓ˆ, the final step is to derive the limiting distribution. To this end, we apply a continuous mapping theorem of an argmax element (e.g., Kim and Pollard, 1990, Theorem 2.7). A key ingredient for this argument is to establish weak convergence of the centered and normalized empirical process

1/6 2/3 Zn(s)=n h Gn(f 1/3 fn,✓0 ), n n,✓0+s(nhn) for s K with any K>0. Weak convergence of the process Z may be characterized by its finite | | n dimensional convergence and tightness (or stochastic equicontinuity). If fn,✓ does not vary with n and z is independent as in Kim and Pollard (1990), a classical combined { t} with the Cram´er-Wold device implies finite dimensional convergence, and a maximal inequality on a suitably regularized class of functions guarantees tightness of the process of criterion functions. We adapt this approach to our local M-estimation with possibly dependent observations satisfying Assumption D. Let ( ) be a function such that (t)= if t 1 and (t) = 1 otherwise, and 1( )be · [t] · the c`adl`aginverse of ( ). Also let Q (u) be the inverse function of the tail probability function · g x P g(z ) >x . For finite dimensional convergence, we adapt Rio’s (1997, Corollary 1) central 7! {| t | } limit theorem for ↵-mixing arrays to our setup.

Lemma C. Suppose Assumption D holds true, Pgn =0,and 1 1 2 sup (u)Q (u) du < . (5) gn 1 n Z0 d Then ⌃=limn Var(Gngn) exists and Gngn N(0, ⌃). !1 !

The finite dimensional convergence of Zn follows from Lemma C by setting gn as any finite 1/6 2/3 dimensional projection of the process gn,s Pgn,s : s with gn,s = n hn (f 1/3 fn,✓ ). { } n,✓0+s(nhn) 0

6 The requirement in (5) is an adaptation of the Lindeberg-type condition in Rio (1997, Corollary

1). The condition (5) excludes polynomial decay of m. It should be noted that for criterions satisfying Assumption M, the (2 + )-th moments P g 2+ for >0 typically diverge because | n,s| those criterions usually involve indicator functions. Thus we cannot apply central limit theorems for mixing sequences with higher than second moments. To verify (5), the following lemma is often useful.

Lemma 2. Suppose Assumptions M and D hold true, and there is a positive constant c such that P g c c(nh 2) 1/3 for all n large enough and s. Then (5) holds true. {| n,s| } n In our examples, g is zero or close to zero with high probability so that the condition P g n,s {| n,s| c c(nh 2) 1/3 is easily satisfied. See Section 3.1 for an illustration. } n To establish tightness of the normalized process Zn, we show the following maximal inequality.

Lemma M’. Suppose Assumption D holds true. Consider a sequence of classes of functions = Gn g : s K for some K>0 with envelope functions G . Suppose there is a positive constant { n,s | | } n C such that 2 P sup gn,s gn,s0 C", (6) s: s s <" | |  | 0| for all n large enough, s K,and">0 small enough. Also assume that there exist 0 <1/2 | 0|  and C > 0 such that G C n and G C for all n large enough. Then for any >0, there 0 n  0 k nk2  0 exist >0 and a positive integer N such that

P sup Gn(gn,s gn,s0 ) , s s < | | | 0| for all n N . 1/6 2/3 Tightness of the process Zn follows from Lemma M’ by setting gn,s = n hn (f 1/3 n,✓0+s(nhn) 2 fn,✓0 ). Note that the condition (6) is satisfied by Assumption M (iii). Compared to Lemma M used to derive the convergence rate of ✓ˆ, Lemma M’ is applied only to establish tightness of the process Zn. Therefore, we do not need an exact decay rate on the right hand side of the maximal inequality.3

Based on finite dimensional convergence and tightness of Zn shown by Lemmas C and M’, re- spectively, we establish weak convergence of Zn. Then a continuous mapping theorem of an argmax element (Kim and Pollard, 1990, Theorem 2.7) yields the limiting distribution of the M-estimator ✓ˆ. The main theorem of this section is presented as follows.

Theorem 1. Suppose that Assumptions M and D hold, ✓ˆ defined in (4) converges in probability to 1/6 2/3 ✓0 int⇥,and(5)holdswithgn,s Pgn,s for each s, where gn,s = n hn (f 1/3 fn,✓ ). 2 n,✓0+s(nhn) 0 2The upper bound in (6) can be relaxed to "1/p for some 1 p< . However, it is typically satisfied with p = 1 for  1 the examples we consider. 3 In particular, the process Zn itself does not satisfy Assumption M (ii).

7 Then 1/3 d (nhn) (✓ˆ ✓0) arg max Z(s), (7) ! s where Z(s) is a Gaussian process with continuous sample paths, expected value s0Vs/2, and covari- n ance kernel H(s1,s2)=limn t= n Cov(gn,s1 (z0),gn,s2 (zt)) < . !1 1 This theorem can be consideredP as an extension of the main theorem of Kim and Pollard (1990) to the cases where the criterion function fn,✓ can vary with the sample size n and/or the observations z can obey a dependent process. To best of our knowledge, the cube root (nonparametric) { t} convergence rate (nh )1/3 with h 0 is new in the literature. It is interesting to note that similar n n ! to standard nonparametric estimation, nhn still plays a role as the “e↵ective sample size.”

2.1. Nuisance parameter and inference. It is often the case that the criterion function contains 1/3 some nuisance parameters, which can be estimated with rates faster than (nhn) (e.g.,⌫ ˆ in (1)). For the rest of this section, let ✓ˆ and ✓˜ satisfy

2/3 Pnfn,✓ˆ,⌫ˆ sup Pnfn,✓,⌫ˆ + op((nhn) ), ✓ ⇥ 2 2/3 f sup f + o ((nh ) ), Pn n,✓˜,⌫0 Pn n,✓,⌫0 p n ✓ ⇥ 2 respectively, where ⌫ is a vector of nuisance parameters and⌫ ˆ is its estimator satisfying⌫ ˆ ⌫ = 0 0 1/3 op((nhn) ). Theorem 1 is extended as follows.

Theorem 2. Suppose Assumption D holds true. Let f : ✓ ⇥ satisfy Assumption M and { n,✓,⌫0 2 } f : ✓ ⇥,⌫ ⇤ satisfy Assumption M (iii). Suppose there exists some negative definite { n,✓,⌫ 2 2 } matrix V1 such that

1 2 2 2/3 P (f f )= (✓ ✓ )0V (✓ ✓ )+o( ✓ ✓ )+O( ⌫ ⌫ )+o((nh ) ), (8) n,✓,⌫ n,✓0,⌫0 2 0 1 0 | 0| | 0| n 1/3 for all ✓ and ⌫ in neighborhoods of ✓0 and ⌫0, respectively. Then ✓ˆ = ✓˜+op((nhn) ).Additionally, 1/6 2/3 if (5) holds with (gn,s Pgn,s) for each s, where gn,s = n hn (f 1/3 fn,✓ ,⌫ ), then n,✓0+s(nhn) ,⌫0 0 0 1/3 d (nhn) (✓ˆ ✓0) arg max Z(s), ! s where Z(s) is a Gaussian process with continuous sample paths, expected value s0V1s/2 and covari- n ance kernel H(s1,s2)=limn t= n Cov(gn,s1 (z0),gn,s2 (zt)) < . !1 1 Once we show that the M-estimatorP has a proper limiting distribution, Politis, Romano and Wolf (1999, Theorem 3.3.1) justify the use of subsampling to construct confidence intervals and make inference. Since our mixing condition in Assumption D satisfies the requirement of their theorem, subsampling inference based on s consecutive observations with s/n is asymptotically valid. !1 See Politis, Romano and Wolf (1999, Section 3.6) for a discussion on data-dependent choices of s. Another candidate to conduct inference based on the M-estimator is the bootstrap. However, even for independent observations, it is known that the naive nonparametric bootstrap is typically invalid

8 under the cube root asymptotics (Abrevaya and Huang, 2005, and Sen, Banerjee and Woodroofe, 2010). It is beyond the scope of this paper to investigate bootstrap inference in our setup.

3. Examples

We provide various examples to demonstrate the usefulness of the asymptotic theory in Section 2 as well as to illustrate the way how to verify the regularity conditions. Three examples are studied in this section and additional three examples of dynamic maximum score, dynamic least median of squares, and monotone density estimation are delegated to Supplement for the sake of space.

3.1. Dynamic panel discrete choice. Consider a dynamic panel data model with a binary de- pendent variable

P y =1x ,↵ = F (x ,↵ ), { i0 | i i} 0 i i P yit =1xi,↵i,yi0,...,yit 1 = F (xit0 0 + 0yit 1 + ↵i), { | } for i =1,...,nand t =1, 2, 3, where yit is binary, xit is a k-vector, and both F0 and F are unknown functions. We observe y ,x but do not observe ↵ . Honor´eand Kyriazidou (2000) proposed the { it it} i conditional maximum score estimator for (0,0), n xi2 xi3 (ˆ,ˆ) = arg max K (yi2 yi1)sgn (xi2 xi1)0 +(yi3 yi0) , (9) , bn { } Xi=1 ✓ ◆ where K is a kernel function and bn is a bandwidth. In this case, nonparametric smoothing is introduced to deal with the unknown link function F . Honor´eand Kyriazidou (2000) obtained consistency of this estimator but the convergence rate and limiting distribution are unknown. Since the criterion for the estimator varies with the sample size due to the bandwidth bn, the cube root asymptotic theory of Kim and Pollard (1990) is not applicable. Here we show that Theorem 1 can be applied to answer these open questions. Let z =(z ,z ,z ) with z = x x , z = y y , and z =((x x ) ,y y ). Also define 10 2 30 0 1 2 3 2 2 1 3 2 1 0 3 0 x = x x . The criterion function of the estimator ✓ˆ =(ˆ , ˆ) in (9) is written as 21 2 1 0 0 k 1 f (z)=b K(b z )z sgn(z0 ✓) sgn(z0 ✓ ) n,✓ n n 1 2{ 3 3 0 } = en(z)(I z0 ✓ 0 I z0 ✓0 0 ), (10) { 3 } { 3 } k 1 and en(z)=2bn K(bn z1)z2. Based on Honor´eand Kyriazidou (2000, Theorem 4), we impose the following assumptions. (a): z n is an iid sample. z has a bounded density which is continuously di↵erentiable { i}i=1 1 at zero. The conditional density of z z =0,z is positive in a neighborhood of zero, and 1| 2 6 3 P z =0z > 0 for almost every z . Support of x conditional on z in a neighborhood { 2 6 | 3} 3 21 1 of zero is not contained in any proper linear subspace of Rk. There exists at least one j (j) (j) j j (1) (j 1) (j+1) (k)2 1,...,k such that = 0 and x x ,z ,wherex =(x ,...,x ,x ,...,x ), { } 0 6 21 | 21 1 21 21 21 21 21

9 j has everywhere positive conditional density for almost every x21 and almost every z1 in a neighborhood of zero. E[z z ,z = 0] is di↵erentiable in z . E[z sgn(( , ) z ) z ]is 2| 3 1 3 2 0 0 0 3 | 1 continuously di↵erentiable at z1 = 0. F is strictly increasing. (b): K is a bounded symmetric density function with s s K(s)ds < for any j, j j j0 1 0 2 1,...,k . As n , it holds nbk / ln n and nbk+3 0. { } !1 n !1 nR !

We verify that f satisfies Assumption M with h = bk . We first check Assumption M (ii). By { n,✓} n n the definition of z = y y (which can take 1, 0, or 1) and change of variables a = b 1z ,we 2 2 1 n 1 obtain 2 1 2 E[e (z) z ]=4h K(a) p (b a z =0,z )daP z =0z , n | 3 n 1 n | 2 6 3 { 2 6 | 3} Z almost surely for all n,wherep is the conditional density of z given z = 0 and z .Thusunder 1 1 2 6 3 (a), h E[e (z)2 z ] >calmost surely for some c>0. Pick any ✓ and ✓ . Note that n n | 3 1 2 1/2 2 1/2 h fn,✓1 fn,✓2 = P hnE[en(z) z3] I z0 ✓1 0 I z0 ✓2 0 n k k2 | | { 3 } { 3 }| 1/2 c P I z0 ✓1 0 I z0 ✓2 0 | { 3 } { 3 }| c ✓ ✓ , 1| 1 2| for some c1 > 0, where the last inequality follows from the same argument to the maximum score example in Section B.1 of the supplementary material using (a). Similarly, Assumption M (iii) is verified as

2 hnP sup fn,✓ fn,# C1P sup I z30 ✓ 0 I z30 # 0 C2", ✓ # <" | |  ✓ # <" | { } { }|  | | | | for some positive constants C1 and C2 and all # in a neighborhood of ✓0 and n large enough. We now verify Assumption M (i). Since hnfn,✓ is clearly bounded, it is enough to verify (2). A change 1 of variables a = bn z1 and (b) imply

Pf = K(a)E[z sgn(z0 ✓) sgn(z0 ✓ ) z = b a]p (b a)da n,✓ 2{ 3 3 0 }| 1 n 1 n Z = p (0)E[z sgn(z0 ✓) sgn(z0 ✓ ) z = 0] 1 2{ 3 3 0 }| 1 2 2 @ E[z2 sgn(z30 ✓) sgn(z30 ✓0) z1 = t]p1(t) +b K(a)a0 { }| ada, n @t@t Z 0 t=ta where ta is a point on the line joining a and 0, and the second equality follows from the dominated 2 k 2/3 convergence and value theorems. Since bn = o((nbn) ) by (b), the second term is negligible. Thus, for the condition in (2), it is enough to derive a second order expansion of E[z sgn(z ✓) 2{ 30 sgn(z0 ✓0) z1 = 0]. Let ✓ = z3 : I z0 ✓ 0 = I z0 ✓0 0 . Honor´eand Kyriazidou (2000, p. 3 }| Z { { 3 }6 { 3 }} 872) showed that

E[z2 sgn(z30 ✓) sgn(z30 ✓0) z1 = 0] = 2 E[z2 z1 =0,z3] dFz3 z1=0 > 0, { }| | | | | ZZ✓

10 for all ✓ = ✓ on the unit sphere and that sgn(E[z z ,z = 0]) = sgn(z ✓ ). Therefore, by applying 6 0 2| 3 1 30 0 @ the same argument as Kim and Pollard (1990, pp. 214-215), we obtain E[z2sgn(z30 ✓) z1 = 0] = @✓ | ✓=✓0 0 and 2 @ E[z2 sgn(z30 ✓) sgn(z30 ✓0) z1 = 0] { }| = I z0 ✓0 =0 ˙ (z3)0✓0z3z0 p3(z3 z1 = 0)dµ✓0 , @✓@✓ { 3 } 3 | 0 Z @ where ˙ (z3)= E[z2 z3,z1 = 0], p3 is the conditional density of z3 given z1 = 0, and µ✓ is the @z3 | 0 surface measure on the boundary of z : z ✓ 0 . Combining these results, the condition in (2) { 3 30 0 } is satisfied with the negative definite matrix

V = 2p1(0) I z0 ✓0 =0 ˙ (z3)0✓0z3z0 p3(z3 z1 = 0)dµ✓0 . (11) { 3 } 3 | Z We now verify the condition of Lemma 2 to apply the central limit theorem in Lemma C. In this example, the normalized criterion is written as

1/6 2k/3 gn,s(z)=n bn en(z)An,s(z3),

k 1 1/3 k/3 where en(z)=2b K(b z1)z2 and An,s(z3)=I z0 (✓0 + sn bn ) 0 I z0 ✓0 0 .Since n n { 3 } { 3 } z 1 and A (z ) takes only 0 or 1, it holds | 2| | n,s 3 | 1 1 1/6 k/3 P g c P K(b z ) 2 cn b A (z ) =1 P A (z ) =1 {| n,s| } | n 1 | n | n,s 3 | {| n,s 3 | } n 1 k 1/3 o CE K(b z ) A (z ) =1 (nb )  | n 1 | | n,s 3 | n 2k 1/3 C0(nb⇥ ) , ⇤  n for some C, C0 > 0, where the second inequality follows from the Markov inequality and the fact that P A (z ) =1 is proportional to (nbk ) 1/3, and the last inequality follows from a change {| n,s 3 | } n of variables and boundedness of the conditional density of z given A (z ) = 1 (by (a)). Since 1 | n,s 3 | k hn = bn in this example, we can apply Lemma 2 to conclude that the condition (5) holds true. Since the criterion (10) satisfies Assumption M and the Lindeberg-type condition (5), Theorem 1 implies the limiting distribution of Honor´eand Kyriazidou’s (2000) estimator as in (7). The matrix V is given in (11). The kernel H is obtained in the same manner as Kim and

Pollard (1990). That is, decompose z3 into r0✓0 +¯z3 withz ¯3 orthogonal to ✓0. Then it holds H(s ,s )=L(s )+L(s ) L(s s ), where 1 2 1 2 1 2

L(s)=4p (0) z¯0 s p (0, z¯ z = 0)dz¯ . 1 | 3 | 3 3| 1 3 Z 3.2. Random coecient binary choice. As a new which can be covered by our asymptotic theory, let us consider the regression model with a random coecient yt = xt0 ✓(wt)+ut. d k We observe xt R , wt R , and the sign of yt. We wish to estimate ✓0 = ✓(c) at some given 2 2

11 c Rk.4 In this setup, we can consider a localized version of the maximum score estimator 2 n ˆ wt c ✓ = arg max K [I yt 0,xt0 ✓ 0 + I yt < 0,xt0 ✓<0 ], ✓ S b { } { } 2 t=1 n X ✓ ◆ where S is the surface of the unit sphere in Rd. We impose the following assumptions. Let h(x, u)= I x0✓0 + u 0 I x0✓0 + u<0 . { } { } (a): x ,w ,u satisfies Assumption D. The density p(x, w) of (x ,w ) is continuous at all x { t t t} t t and w = c. The conditional distribution x w = c has compact support and continuously | di↵erentiable conditional density. The angular component of x w = c, considered as a | random variable on S, has a bounded and continuous density, and the density for the

orthogonal angle to ✓0 is bounded away from zero. (b): Assume that ✓ = 1, median(u x, w = c) = 0, the function (x, w)=E[h(x ,u ) x = | 0| | t t | t x, w = w] is continuous at all x and w = c, (x, c) is non-negative for x ✓ 0 and non- t 0 0 positive for x ✓ < 0 and is continuously di↵erentiable in x, and P x ✓ =0, @(x,w) 0 ✓ p(x, w) > 0 0 { 0 0 @x 0 0 w = c > 0. ⇣ ⌘ | } (c): K is a bounded symmetric density function with s2K(s)ds < . As n , it holds 1 !1 nbk0 for some k >k. n !1 0 R Note that the criterion function is written as 1 w c f (x, w, u)= K h(x, u)[ x ✓ 0 x ✓ 0 ], n,✓ 1/k I 0 I 0 0 hn hn ! { } { } k ˆ where hn = bn. We can see that ✓ = arg max✓ S Pnfn,✓ and ✓0 = arg max✓ S limn Pfn,✓. 2 2 !1 Existence and uniqueness of ✓0 are guaranteed by the change of variables and (b) (see, Manski, 1985). Also the uniform law of large numbers for an absolutely regular process by Nobel and p ˆ Dembo (1993, Theorem 1) implies sup✓ S Pnfn,✓ Pfn,✓ 0. Therefore, ✓ is consistent for ✓0. 2 | | ! We next compute the expected value and covariance kernel of the limit process (i.e., V and H in Theorem 1). Due to strict stationarity (in Assumption D), we can apply the same argument to Kim and Pollard (1990, pp. 214-215) to derive the second derivative

2 @ Pfn,✓ @(x, c) 0 V =lim = I x0✓0 =0 ✓0p(x, c)xx0d(x), n @✓@✓ { } @x !1 0 ✓=✓0 Z ✓ ◆ where is the surface measure on the boundary of the set x : x ✓ 0 . The matrix V is { 0 0 } negative definite under the last condition of (b). Now pick any s1 and s2, and define qn,t = f 1/3 (xt,wt,ut) f 1/3 (xt,wt,ut). The covariance kernel is written as H(s1,s2)= n,✓0+(nhn) s1 n,✓0+(nhn) s2

4Gautier and Kitamura (2013) studied identification and estimation of the random coecient binary choice model, where ✓t = ✓(wt) is unobservable. Here we study the model where heterogeneity in the slope is caused by the observables wt.

12 1 L(s , 0) + L(0,s ) L(s ,s ) ,where 2 { 1 2 1 2 }

4/3 1/3 1 L(s1,s2)= lim(nhn) Var(Pnqn,t)= lim(nhn) Var(qn,t)+ Cov(qn,t,qn,t+m) . n n { } !1 !1 m=1 X 1/3 The limit of (nhn) Var(qn,t) is obtained in the same manner as Kim and Pollard (1990, p. 215). For the covariance, the ↵-mixing inequality implies

2(1 p) 2 2 m p Cov(q ,q ) C q = O(⇢ )O((nh ) 3p hn ), | n,t n,t+m | m k n,tkp n for some C>0 and p>2, where the equality follows from the change of variables and Assumption D. Also, by the change of variables Cov(q ,q ) = Pq q (Pq )2 = O((nh ) 2/3). | n,t n,t+m | | n,t n,t+m n,t | n ` 1 ` By using these bounds (note: if 0

2(p 1)` 1 1 2(p 1)` 1 1/3 + p `m (nh ) Cov(q ,q ) C0(nh ) 3 3 hn ⇢ , n | n,t n,t+m | n m=1 m=1 X X 1/3 for any ` [0, 1]. Thus, by taking ` suciently small, we obtain limn (nhn) m1=1 Cov(qn,t,qn,t+m)= 2 !1 0duetonbk0 . n !1 P We now verify that f : ✓ S satisfies Assumption M with h = bk . Assumption M (i) is { n,✓ 2 } n n already verified. By the change of variables and Jensen’s inequality (also note that h(x, u)2 = 1), there exists a positive constant C such that

1/2 2 h fn,✓1 fn,✓2 = K(s) I x0✓1 0 I x0✓2 0 p(x, c + sbn)dxds n k k2 | { } { }| sZZ CE I x0✓1 0 I x0✓2 0 w = c | { } { }| = CP ⇥x0✓ 0 >x0✓ or x0✓ 0 >x0✓ ⇤w = c , { 1 2 2 1| } for all ✓ ,✓ S and all n large enough. Since the right hand side is the conditional probability 1 2 2 for a pair of wedge shaped regions with an angle of order ✓ ✓ , the last condition in (a) implies | 1 2| Assumption M (ii). For Assumption M (iii), pick any ">0 and there exists a positive constant C0 such that

2 P sup hn fn,✓ fn,# ✓ ⇥: ✓ # <" | | 2 | | 2 2 = K(s) sup [I x0✓ 0 I x0# 0 ] p(x, c + sbn)dxds ✓ ⇥: ✓ # <" | { } { } | ZZ 2 | |

2 C0E sup [I x0✓ 0 I x0# 0 ] w = c ,  " ✓ ⇥: ✓ # <" | { } { } | # 2 | | for all # in a neighborhood of ✓ and n large enough. Again, the right hand side is the conditional 0 probability for a pair of wedge shaped regions with an angle of order ". Thus the last condition in

13 (a) also guarantees Assumption M (iii). Since f : ✓ S satisfies Assumption M, Theorem 1 { n,✓ 2 } implies the limiting distribution of (nh )1/3(✓ˆ ✓ ) for the random coecient model. n 0

3.3. Minimum volume predictive region . As an illustration of Theorem 2, we now consider the example in (1), the minimum volume predictor for a strictly proposed by Polonik and Yao (2000). Suppose we are interested in predicting y R from x R based on the 2 2 observations yt,xt . The minimum volume predictor of y at x = c in the class of intervals of R { } I at level ↵ [0, 1] is defined as 2 Iˆ = arg min µ(S)s.t.Pˆ(S) ↵, S 2I ˆ n xt c n xt c where µ is the Lebesgue measure and P (S)= I yt S K / K is the t=1 { 2 } hn t=1 hn kernel estimator of the conditional probability P y S x = c .Since⇣ I⌘ˆ is an interval,⇣ it⌘ can be P{ t 2 | t } P written as Iˆ =[✓ˆ ⌫ˆ,✓ˆ +ˆ⌫], where ✓ˆ = arg max Pˆ([✓ ⌫ˆ,✓ +ˆ⌫]), ⌫ˆ =inf ⌫ :supPˆ([✓ ⌫,✓ + ⌫]) ↵ . ✓ { ✓ } To study the asymptotic property of Iˆ, we impose the following assumptions.

(a): y ,x satisfies Assumption D. I =[✓ ⌫ ,✓ + ⌫ ] is the unique shortest interval such { t t} 0 0 0 0 0 that P yt I0 xt = c ↵. The conditional density y x=c of yt given xt = c is bounded { 2 | } | and strictly positive at ✓0 ⌫0, and its derivative satisfies ˙y x=c(✓0 ⌫0) ˙y x=c(✓0 +⌫0) > 0. ± | | (b): K is bounded and symmetric, and satisfies lima a K(a) = 0. As n , nhn !1 | | !1 !1 and nh4 0. n ! For notational convenience, assume ✓0 = 0 and ⌫0 = 1. We first derive the convergence rate for⌫ ˆ. 1 n xt c Note that⌫ ˆ =inf ⌫ :sup gˆ([✓ ⌫,✓ + ⌫]) ↵ˆ(c) ,whereˆg(S)= I yt S K { ✓ } nhn t=1 { 2 } hn 1 n xt c and ˆ(c)= K . By applying Lemma M’ and a central limit theorem, we can⇣ obtain⌘ nhn t=1 hn P uniform convergenceP rate⇣ ⌘

1/2 2 max ˆ(c) (c) , sup gˆ([✓ ⌫,✓ + ⌫]) P yt [✓ ⌫,✓ + ⌫] xt = c (c) = Op((nhn) +hn). (| | ✓,⌫ | { 2 | } |) Thus the same argument to Kim and Pollard (1990, pp. 207-208) yields⌫ ˆ 1=O ((nh ) 1/2 +h2 ). p n n Let ✓ˆ = arg min gˆ([✓ ⌫ˆ,✓ +ˆ⌫]). Consistency follows from uniqueness of (✓ ,⌫ ) in (a) and the ✓ 0 0 uniform convergence

p sup gˆ([✓ ⌫ˆ,✓ +ˆ⌫]) P yt [✓ 1,✓+ 1] xt = c (c) 0, ✓ | { 2 | } | ! which is obtained by applying Nobel and Dembo (1993, Theorem 1).

Now let z =(y, x)0 and 1 x c fn,✓,⌫(z)= K [I y [✓ ⌫,✓ + ⌫] I y [ ⌫,⌫] ]. h h { 2 } { 2 } n ✓ n ◆

14 ˆ ˆ Note that ✓ = arg max✓ Pnfn,✓,⌫ˆ. We apply Theorem 2 to obtain the convergence rate of ✓. For the condition in (8), observe that

P (f f )=P (f f )+P (f f ) n,✓,⌫ n,0,1 n,✓,⌫ n,0,⌫ n,0,⌫ n,0,1 1 2 2 2 2 = ˙y x(1 c)+ ˙y x( 1 c) x(c)✓ + ˙y x(1 c)+ ˙y x( 1 c) x(c)✓⌫ + o(✓ + ⌫ 1 )+O(hn). 2{ | | | | } { | | | | } | |

The condition (8) holds with V1 = ˙y x(1 c) ˙y x( 1 c) x(c). Assumption M (iii) for fn,✓,⌫ : ✓ { | | | | } { 2 R,⌫ R is verified in the same manner as in Section B.2 of the supplementary material. It remains 2 } to verify Assumption M (ii) for the class fn,✓,1 : ✓ R . Pick any ✓1 and ✓2. Some expansions { 2 } yield

h f f 2 n k n,✓1,1 n,✓2,1k2 2 y x(✓2 +1x = c + ahn) y x(✓1 +1x = c + ahn) = K(a) | | | | x(c + ahn)da +y x(✓2 1 x = c + ahn) y x(✓1 1 x = c + ahn) Z | | | | 2 ˙ ¨ K(a) y x(✓ +1x = c + ahn)+y x(✓ 1 x = c + ahn) x(c + ahn)da ✓1 ✓2 , { | | | | } | | Z ˙ ¨ where y x is the conditional distribution function of y given x, and ✓ and ✓ are points between | ✓ and ✓ . By (a), Assumption M (ii) is satisfied. Therefore, we can conclude that⌫ ˆ ⌫ = 1 2 0 O ((nh ) 1/2 + h2 ) and ✓ˆ ✓ = O ((nh ) 1/3 + h ). This result confirms positively the conjecture p n n 0 p n n of Polonik and Yao (2000, Remark 3b) on the exact convergence rate of Iˆ.

4. Generalizations

In this section, we consider two generalizations of the asymptotic theory in Section 2. The first is to allow for data taking limited values such as interval valued regressors and the second is to allow for localization to depend on the parameter values.

4.1. Limited variables. We consider the case where some of the variables take limited values.

Thus, we relax the assumption of point identification of ✓0, the maximizer of the limiting population criterion limn Pfn,✓, and consider the case where the limiting criterion is maximized at any !1 element of a set ⇥ ⇥. The set⇥ is called the identified set. In order to estimate ⇥ ,we I ⇢ I I consider a collection of approximate maximizers of the sample criterion function Pnfn,✓, that is ˆ 1/2 ⇥= ✓ ⇥: max Pnfn,✓ Pnfn,✓ cˆ(nhn) , { 2 ✓ ⇥  } 2 1/2 i.e., the level set based on the criterion fn,✓ from the maximum with a cuto↵valuec ˆ(nhn) .This section studies the convergence rate of ⇥ˆ to⇥I under the Hausdor↵distance defined below. We assume that ⇥I is convex. Then the projection ⇡✓ = arg min✓ ⇥ ✓0 ✓ of ✓ ⇥on⇥ I is uniquely 02 I | | 2 defined. To deal with the partially identified case, we modify Assumption M as follows.

Assumption S. For a sequence h of positive numbers satisfying nh , f : ✓ ⇥ { n} n !1 { n,✓ 2 } satisfies the following conditions.

15 (i): hnfn,✓ : ✓ ⇥ is a class of uniformly bounded functions. Also, limn Pfn,✓ is max- { 2 } !1 imized at any ✓ in a bounded convex set ⇥I . There exist positive constants c and c0 such that 2 2 2/3 P (f f ) c ✓ ⇡ + o( ✓ ⇡ )+o((nh ) ), (12) n,⇡✓ n,✓ | ✓| | ✓| n for all n large enough and all ✓ ✓ ⇥:0< ✓ ⇡ c . 2{ 2 | ✓| 0} (ii): There exist positive constants C and C0 such that

✓ ⇡ Ch1/2 f f , | ✓| n k n,✓ n,⇡✓ k2 for all n large enough and all ✓ ✓ ⇥:0< ✓ ⇡ C . 2{ 2 | ✓| 0} (iii): There exists a positive constant C00 such that

2 P sup hn fn,✓ fn,⇡✓ C00", ✓ ⇥:0< ✓ ⇡ <" | |  2 | ✓| for all n large enough and all ">0 small enough.

We allow hn = 1 for the case without a bandwidth in the criterion function. Similar comments to Assumption M apply. The main di↵erence is that the conditions are imposed on the contrast f n,✓ fn,⇡✓ using the projection ⇡✓. Assumption S (i) contains boundedness and expansion conditions. The inequality in (12) can be verified by a one-sided Taylor expansion based on the directional derivative. Assumption S (ii) and (iii) play similar roles and are verified by similar arguments to the point identified case. We establish the following maximal inequality for the criterions satisfying Assumption S. Let rn = nhn/ log(nhn).

Lemma MS. Under Assumptions D and S, there exist positive constants C and C0 < 1 such that

1/2 1/2 P sup Gnhn (fn,✓ fn,⇡✓ ) C( log(1/)) , ✓ ⇥:0< ✓ ⇡ < | | 2 | ✓| 1/2 for all n large enough and [rn ,C ]. 2 0 Compared to Lemma M, the additional log term on the right hand side is due to the fact that the supremum is taken over the -tube (or manifold) instead of the -ball, which increases the entropy. This maximal inequality is applied to obtain the following analog of Lemma 1.

Lemma 3. Under Assumptions D and S, for each ">0, there exist random variables R of { n} order Op(1) and a positive constant C such that

2 2/3 2 Pn(f✓ f⇡✓ ) P (f✓ f⇡✓ ) " ✓ ⇡✓ + r R , | | | | n n 1/3 for all ✓ ✓ ⇥:rn ✓ ⇡ C . 2{ 2 | ✓| } ˆ We now establish the convergence rate of the set estimator ⇥to⇥I . Let ⇢(A, B)=supa A infb B a 2 2 | b and H(A, B) = max ⇢(A, B),⇢(B,A) be the Hausdor↵distance of sets A, B Rd. Based on | { } ⇢ Lemmas MS and 3, the asymptotic property of the set estimator ⇥ˆ is obtained as follows.

16 Theorem 3. Suppose Assumption D holds true. Let f : ✓ ⇥ satisfy Assumption S, and { n,✓ 2 } 1/2 p 1/2 hn f : ✓ ⇥ be a P -Donsker class. Assume H(⇥ˆ , ⇥ ) 0 and cˆ = o ((nh ) ). Then { n,✓ 2 I } I ! p n ˆ 1/2 1/4 1/3 ⇢(⇥, ⇥I )=Op(ˆc (nhn) + rn ).

Furthermore, if cˆ , then P ⇥ ⇥ˆ 1 and !1 { I ⇢ }! 1/2 1/4 H(⇥ˆ , ⇥I )=Op(ˆc (nhn) ).

Note that ⇢ is asymmetric in its arguments. The first part of this theorem says ⇢(⇥ˆ , ⇥I )= 1/2 1/4 1/3 O (ˆc (nh ) + rn ). On the other hand, in the second part, we show P ⇥ ⇥ˆ 1(i.e., p n { I ⇢ }! ⇢(⇥ , ⇥ˆ ) can converge to zero at an arbitrary rate) as far asc ˆ . For example, we may set atc ˆ = I !1 1/2 1/4 log(nhn). These results are combined to imply the convergence rate H(⇥ˆ , ⇥I )=Op(ˆc (nhn) ) 1/3 under the Hausdor↵distance. The cube root term of order rn in the rate of ⇢(⇥ˆ , ⇥I ) is dominated 1/2 1/4 by the term of orderc ˆ (nhn) . We next consider the case where the criterion function contains nuisance parameters. In par- ticular, we allow that the dimension kn of the nuisance parameters ⌫ can grow as the sample size increases. For instance, the nuisance parameters might be coecients in sieve estimation. It is important to allow the growing dimension of ⌫ to cover Manski and Tamer’s (2002) set estimator, where the criterion contains some nonparametric estimate and its transform by the indicator. The rest of this subsection considers the set estimator

ˆ 1/2 ⇥= ✓ ⇥: max Pnfn,✓,⌫ˆ Pnfn,✓,⌫ˆ cˆ(nhn) , { 2 ✓ ⇥  } 2 with some preliminary estimator⌫ ˆ and cuto↵valuec ˆ. To derive the convergence rate of ⇥ˆ , we establish a maximal inequality over a sequence of sets of 1/2 functions that are indexed by parameters with increasing dimension. Let g = hn (f f ) n,s n,✓,⌫ n,✓,⌫0 with s =(✓ ,⌫ ) , and consider a sequence of classes of functions = g : ✓ ⇡ K , ⌫ ⌫ 0 0 0 Gn { n,s | ✓| 1 | 0| anK2 for some K1,K2 > 0 with the envelope function Gn =sup n gn,s . The maximal inequality } G | | in Lemma MS is modified as follows.

Lemma MS’. Suppose Assumption D holds true. Suppose there exists a positive constant C such that

2 P sup sup gn,s C kn", (13) ✓ ⇥ ⌫ ⌫0 " | |  2 | | p sup sup ⌫ ⌫0 C gn,s 2 0 (14) ✓ ⇥ ⌫ ⌫0 "{| | k k } 2 | | for all n large enough and all " small enough. Also assume that there exist 0 <1/4 and C > 0  0 such that G C n and G C for all n large enough. Then there exists K > 0 such that n  0 k nk2  0 3 1/2 3/4 1 P sup Gngn,s K3an kn log knan , gn,s n | | 2G q for all n large enough.

17 The increasing dimension kn of the nuisance parameter ⌫ a↵ects the upper bound via two routes.

First, it increases the size of envelope by the factor of pkn, which in turn increases the entropy of the space. Second, it also demands us to consider an inflated class of functions to apply the more fundamental maximal inequality by Doukhan, Massart and Rio (1995), which relies on the so-called norm. Note that the envelope condition in (13) allows for step functions containing some k·k2, nonparametric estimates. Based on this lemma, the convergence rate of the set estimator ⇥ˆ is characterized as follows.

Theorem 4. Suppose Assumption D holds true. Let f : ✓ ⇥ satisfy Assumption S and { n,✓,⌫0 2 } 1/2 p 1/2 hn f : ✓ ⇥ be a P -Donsker class. Assume ⇢(⇥ˆ , ⇥ ) 0, cˆ = o ((nh ) ), k , { n,✓,⌫0 2 I } I ! p n n !1 ⌫ˆ ⌫0 = op(an) for some an such that hn/an . Furthermore, there exist some ">0 and | | { } !1 1/2 neighborhoods ✓ ⇥: ✓ ⇡ <" and ⌫ : ⌫ ⌫ " , where hn (f f ) satisfies the { 2 | ✓| } { | 0| } n,✓,⌫ n,✓,⌫0 conditions (13) and (14) in Lemma MS’ and

2 2 2/3 P (f f ) P (f f )=o( ✓ ⇡ )+O( ⌫ ⌫ + r ). (15) n,✓,⌫ n,✓,⌫0 n,⇡✓,⌫ n,⇡✓,⌫0 | ✓| | 0| n Then ˆ 1/2 1/4 1/3 1 1/4 1/2 ⇢(⇥, ⇥I )=Op(ˆc (nhn) + rn +(nhnan ) (log kn) )+o(an). (16) Furthermore, if cˆ , then P ⇥ ⇥ˆ 1 and !1 { I ⇢ }! ˆ 1/2 1/4 1/4 1/4 3/8 1/4 H(⇥, ⇥I )=Op(ˆc (nhn) +(nhn) an kn log n)+o(an). (17)

Compared to Theorem 3, we have two extra terms in the convergence rate of H(⇥ˆ , ⇥I )dueto

(nonparametric) estimation of ⌫0. However, they can be shown to be dominated by the first term under standard conditions. Suppose that k4 log k /n 0 and the preliminary estimator⌫ ˆ satisfies n n ! ⌫ˆ ⌫ = O (n 1/2(k log k )1/2), which is often the case as in sieve estimation (see, e.g., Chen, 0 p n n 5 1/2 1/2 1/4 3/8 2007). Then, a = n (k log k ) and an kn 0. Now setc ˆ = log n, which makes the first n n n ! term dominates the second and the last terms.

4.1.1. Example: Binary choice with interval regressor. As an illustration of partially identified mod- els, we consider a binary choice model with an interval regressor studied by Manski and Tamer

(2002). More precisely, let y = I x0✓0 + w + u 0 ,wherex is a vector of observable regressors, { } w is an unobservable regressor, and u is an unobservable error term satisfying P u 0 x, w = ↵ {  | } (we set ↵ = .5 to simplify the notation). Instead of w, we observe the interval [wl,wu] such that P w w w = 1. Here we normalize that the coecient of w to determine y equals one. In { l   u} this setup, the parameter ✓0 is partially identified and its identified set is written as (Manski and Tamer 2002, Proposition 2)

⇥ = ✓ ⇥:P x0✓ + w 0

18 Letx ˜ =(x ,w,w ) and q (˜x) be an estimator for P y =1x˜ with the estimated parameters⌫ ˆ. 0 l u 0 ⌫ˆ { | } Suppose P y =1x˜ = q (˜x). By exploring the maximum score approach, Manski and Tamer { | } ⌫0 (2002) developed the set estimator for ⇥I , that is

⇥ˆ = ✓ ⇥: max Sn(✓) Sn(✓) ✏n , (18) { 2 ✓ ⇥  } 2 where

Sn(✓)=Pn(y .5)[I q⌫ˆ(˜x) >.5 sgn(x0✓ + wu)+I q⌫ˆ(˜x) .5 sgn(x0✓ + wl)]. { } {  } Manski and Tamer (2002) established the consistency of ⇥ˆ to⇥I under the Hausdor↵distance. To establish the consistency, they assumed the cuto↵value ✏n is bounded from below by the (almost sure) decay rate of sup✓ ⇥ Sn(✓) S(✓) ,whereS(✓) is the limiting object of Sn(✓). As Manski 2 | | and Tamer (2002, Footnote 3) argued, characterization of the decay rate is a complex task because

Sn(✓) is a step function and I q⌫ˆ(˜x) >.5 is a step function transform of the nonparametric estimate { } of P y =1x˜ . Therefore, it has been an open question. Obtaining the lower bound rate of ✏ is { | } n important because we wish to minimize the volume of the estimator ⇥ˆ without losing the asymptotic validity. By applying Theorem 4, we can explicitly characterize the decay rate for the lower bound of ✏n and establish the convergence rate of this estimator. A little algebra shows that the set estimator in (18) is written as

ˆ 1/2 ⇥= ✓ ⇥: max Pnf✓,⌫ˆ Pnf✓,⌫ˆ cnˆ , { 2 ✓ ⇥  } 2 where z =(x0,w,wl,wu,u)0, h(x, w, u)=I x0✓0 + w + u 0 I x0✓0 + w + u<0 , and { } { }

f✓,⌫(z)=h(x, w, u)[I x0✓ + wu 0,q⌫(˜x) >.5 I x0✓ + wl < 0,q⌫(˜x) .5 ]. (19) { } {  } We impose the following assumptions. Let @⇥ be the boundary of ⇥ ,  (˜x)=(2q (˜x) I I u ⌫0 1)I q⌫0 (˜x) >.5 , and l(˜x)=(1 2q⌫0 (˜x))I q⌫0 (˜x) .5 . { } {  } (a): x ,w ,w ,w ,u satisfies Assumption D. x w has a bounded and continuous condi- { t t lt ut t} | u tional density p( w ) for almost every w . There exists an element x of x whose conditional ·| u u j density p(x w ) is bounded away from zero over the support of of w . The same condition j| u u holds for x w . The conditional densities of w w ,x and w w ,x are bounded. q ( )is | l u| l l| u ⌫ · continuously di↵erentiable at ⌫0 a.s. and the derivative is bounded for almost everyx ˜. (b): For each ✓ @⇥ ,  (˜x) is non-negative for x ✓ + w 0,  (˜x) is non-positive for 2 I u 0 u l x ✓ + w 0,  (˜x) and  (˜x) are continuously di↵erentiable, and it holds 0 l  u l

P x0✓ + w =0,q (˜x) >.5, (✓0@ (˜x)/@x)p(x w ,w ) > 0 > 0, or { u ⌫0 u | l u } P x0✓ + w =0,q (˜x) .5, (✓0@ (˜x)/@x)p(x w ,w ) > 0 > 0. { l ⌫0  l | l u } (c): There exist some ", C > 0 such that for any ⌫ ⌫ <", q (˜x) q (˜x) c ⌫ ⌫ pk . | 0| | ⌫ ⌫0 | | 0| n To apply Theorem 4, we verify that f : ✓ ⇥ satisfy Assumption S with h = 1. We first { ✓,⌫0 2 } n check Assumption S (i). This class is clearly bounded. From Manski and Tamer (2002, Lemma 1

19 and Corollary (a)), Pf is maximized at any ✓ ⇥ and ⇥ is a bounded convex set. By applying ✓,⌫0 2 I I the argument in Kim and Pollard (1990, pp. 214-215), the second directional derivative at ✓ @⇥ 2 I with the orthogonal direction outward from ⇥I is

@u(˜x) 2 @u(˜x) 2 2P I x0✓ = wu ✓0 p(x wl,wu)(x0✓) du 2P I x0✓ = wl ✓0 p(x wl,wu)(x0✓) dl, { } @x | { } @x | Z Z where and are the surface measures on the boundaries of the sets x : x ⇡ + w 0 and u l { 0 ✓ u } x : x ⇡ + w 0 , respectively. Since this matrix is negative definite by (b), Assumption S (i) is { 0 ✓ l } verified. We next check Assumption S (ii). By h(x, w, u)2 = 1, observe that

P x0✓ wu x0⇡✓ or x0✓< wu .5 , f f p2min { } { } . k ✓,⌫0 ⇡✓,⌫0 k2 ( P x0✓ wl x0⇡✓ or x0✓< wl 0.5 1, we obtain | { }|  2 P sup f✓,⌫0 f⇡✓,⌫0 ✓ ⇥:0< ✓ ⇡ <" | | 2 | ✓| P sup I x0✓ wu x0⇡✓ or x0✓< wu 0. Again, the right hand side is the sum of the probabilities for pairs of wedge shaped regions with angles of order ". Thus, (a) also guarantees Assumption S (iii).

Next, we verify (13) and (14). Let I⌫(˜x)=I q⌫(˜x) >.5 q⌫0 (˜x) or q⌫(˜x) .5

P sup I q⌫(˜x) >.5 q⌫0 (˜x) = P sup I q⌫(˜x) q⌫0 (˜x) >.5 q⌫0 (˜x) 0 ⌫ ⌫0 <" { } ⌫ ⌫ <" { } | | | 0|

CP sup q⌫(˜x) q⌫0 (˜x)  ⌫ ⌫0 <" | | | | C k ",  n p where the first inequality is due to the boundedness of the conditional density of q⌫0 (˜x) and the second to Condition (c). This verifies (13) and (14) is verified in the same manner as Assumption S (ii) in the preceding paragraph. Finally, for (15), note that

P (f f ) P (f f ) | ✓,⌫ ✓,⌫0 ⇡✓,⌫ ⇡✓,⌫0 | P I x0✓ wu x0⇡✓ or x0✓< wu

20 for each ✓ ✓ ⇥: ✓ ⇡ <" and ⌫ in a neighborhood of ⌫ . For the first term of (21), the 2{ 2 | ✓| } 0 law of iterated expectation and an expansion of q⌫(˜x) around ⌫0 based on (a) imply

P I x0✓ wu x0⇡✓ or x0✓< wu

Manski and Tamer (2002) proved the consistency of ⇥ˆ to⇥I in terms of the Hausdor↵dis- tance. We provide a sharper lower bound on the their tuning parameter ✏n, which is in our no- tationcn ˆ 1/2 withc ˆ . For example, if we setc ˆ = log n, then the convergence rate becomes !1 1/4 1/2 H(⇥ˆ , ⇥I )=Op(n (log n) ). We basically verify the high level assumption of Chernozhukov, Hong and Tamer (2007, Condition C.2) in the cube root context. However, we mention that in the above setup, the criterion contains nuisance parameters with increasing dimension and the result in Chernozhukov, Hong and Tamer (2007) is not directly applicable. Furthermore, our result enables us to construct the confidence set by the subsampling as described by Chernozhukov, Hong and Tamer (2007). Specifically, the maximal inequality in Lemma MS’ and 1/2 the assumption that hn f : ✓ ⇥ is P -Donsker are sucient to verify their Conditions C.4 { n,✓,⌫0 2 I } and C.5.

4.2. Parameter-dependent local M-estimation. We consider a setup where localization of the criterion depends on the parameter values. A leading example is the mode estimation. Cherno↵ 1 n (1964) studied the asymptotic property of the mode estimator that maximizes (nh) I yt t=1 {| h with respect to for some fixed h, which was extended to the regression case by Lee (1989). | } P Lee (1989) established consistency of the mode regression estimator and conjectured the cube root convergence rate. To estimate consistently for a broader family of distributions, however, we need to treat h as a bandwidth parameter and let h 0 as in Yao, Lindsay and Li (2012) for example. ! This parameter dependent localization alters Assumption M (iii). That is, it increases the size 1 (in terms of the L2-norm) of the envelope of the class h (I yt h I yt 0 h ): { {| | } {| | } " . To deal with such situations, we modify Assumption M (iii) as follows. | 0| }

Assumption M (iii’). There exists a positive constant C00 such that

2 2 P sup hn fn,✓ fn,✓0 C00", ✓ ⇥: ✓ ✓ <" | |  2 | 0| for all n large enough, ">0 small enough, and ✓0 in a neighborhood of ✓0.

Under this assumption, Lemma M in Section 2 is modified as follows.

21 Lemma M1. Under Assumption M (i), (ii), and (iii’), there exist positive constants C and C0 such that 1/2 1/2 1/2 P sup Gnhn (fn,✓ fn,✓0 ) Chn , ✓ ✓0 < | | | | for all n large enough and [(nh2 ) 1/2,C ]. 2 n 0

Parameter dependency arises in di↵erent contexts as well and may well lead to di↵erent types of non-standard distributions. For instance, the maximum likelihood estimator for Uniform[0,✓]yields super consistency (see, Hirano and Porter, 2003, for a general discussion). This contrast is similar to the di↵erence between estimation of a change point in and mode regression. Once we have obtained Lemma 4.2, the remaining steps are similar to those in Section 2 by 2 replacing “hn”with“hn”. Here we present the final result without nuisance parameter ⌫ for the sake of expositional simplicity.

Theorem 5. Let f : ✓ ⇥ satisfies Assumption M (i), (ii), and (iii’). Additionally, if (5) { n,✓ 2 } 1/6 4/3 holds with (gn,s Pgn,s) for each s, where gn,s = n hn (f 2 1/3 fn,✓ ), then n,✓0+s(nhn) 0 2 1/3 d (nh ) (✓ˆ ✓0) arg max Z(s), n ! s where Z(s) is a Gaussian process with continuous sample paths, expected value s0Vs/2 and covari- n ance kernel H(s1,s2)=limn t= n Cov(gn,s1 (z0),gn,s2 (zt)) < . !1 1 P 4.2.1. Example: Hough transform estimator. In the statistics literature on computer vision algo- rithm, Goldenshluger and Zeevi (2004) investigated the so-called Hough transform estimator for the regression model n ˆ = arg max I yt xt0 h xt , (22) {| | | |} t=1 X where xt =(1, x˜t)0 for a scalarx ˜t and h is a fixed tuning constant. Goldenshluger and Zeevi (2004) derived the cube root asymptotics for ˆ with fixed h, and discussed carefully about the practical choice of h. However, for this estimator, h plays a role of the bandwidth and the analysis for the case of h 0 is a substantial open question (see, Goldenshluger and Zeevi, 2004, pp. 1915-6). n ! Here we focus on the Hough transform estimator in (22) with h = h 0 and study its asymptotic n ! property. The estimators by Cherno↵(1964) and Lee (1989) with varying h can be analyzed in the same manner. Let us impose the following assumptions.

(a): x ,u satisfies Assumption D. x and u are independent. P x 3 < , Px x is positive { t t} t t | t| 1 t t0 definite, and the distribution of xt puts zero mass on each hyperplane. The density of ut is bounded, continuously di↵erentiable in a neighborhood of zero, symmetric around zero, and strictly unimodal at zero. (b): As n , h 0 and nh5 . !1 n ! n !1

22 ˆ ˆ ˆ Let z =(x, u). Note that ✓ = 0 is written as ✓ = arg max✓ Pnfn,✓,where 1 fn,✓(z)=h I u x0✓ hn x . n {| | | |} ˆ p The consistency of ✓ follows from the uniform convergence sup Pnfn,✓ Pfn,✓ 0 by applying ✓ | | ! Nobel and Dembo (1993, Theorem 1). In order to apply Theorem 5, we verify that f satisfies Assumption M (i), (ii), and (iii’). { n,✓} Obviously hnfn,✓ is bounded. Since limn Pfn,✓ =2P(x0✓) x and is uniquely maximized at !1 | | zero (by (a)), limn Pfn,✓ is uniquely maximized at ✓ = 0. Since is continuously di↵erentiable !1 in a neighborhood of zero, Pfn,✓ is twice continuously di↵erentiable at ✓ = 0 for all n large enough. Let be the distribution function of . An expansion yields

1 1 P (f f )=h P (x0✓ + h x ) (h x ) h P (x0✓ h x ) ( h x ) n,✓ n,0 n { n| | n| | } n { n| | n| | } 2 =¨(0)✓0P ( x xx0)✓ 1+O(h ) + o( ✓ ), | | { n } | | i.e., the condition in (2) holds with V =¨(0)P ( x xx ). Note that ¨(0) < 0 by (a). Therefore, | | 0 Assumption M (i) is satisfied.

For Assumption M (ii), pick any ✓1 and ✓2 and note that

2 h f f =2P (x0✓ )+(x0✓ ) x n k n,✓1 n,✓2 k2 { 1 2 }| | 1 2h P x0✓ h x

2 2 P sup hn fn,✓ fn,# P sup I u x0# hn x , u x0✓ >hn x ✓ ⇥: ✓ # <" | |  ✓ ⇥: ✓ # <" {| | | | | | | |} 2 | | 2 | | +P sup I u x0✓ hn x , u x0# >hn x , ✓ ⇥: ✓ # <" {| | | | | | | |} 2 | | for all # in a neighborhood of 0. Since the same argument applies to the second term, we focus on the first term (say, T ). If " 2h , then an expansion around " =0implies  n T P (h ") x u h x = P(h x ) x " + o(").  { n | |  n| |} n| | | |

Also, if ">2hn, then an expansion around hn =0implies

T P h x u h x P(0) x " + o(h ).  { n| |  n| |}  | | n Therefore, Assumption M (iii’) is satisfied. 2 1/3 Finally, the covariance kernel is obtained by a similar way as Section 3.1. Let rn =(nhn) be the convergence rate in this example. The covariance kernel is written by H(s ,s )= 1 L(s , 0) + 1 2 2 { 1 2 L(0,s2) L(s1,s2) ,whereL(s1,s2)=limn Var(rnPngn,t)withgn,t = fn,s /r fn,s /r . An } !1 1 n 2 n

23 expansion implies n 1Var(r2 g ) 2(0)P x (s s ) . We can also see that the covariance term n n,t ! | 0 1 2 | 1 2 2 n m1=1 Cov(rngn,t,rngn,t+m) is negligible. Therefore, by Theorem 5, the limiting distribution of the HoughP transform estimator with the bandwidth hn is obtained as 2 1/3 d (nh ) (ˆ 0) arg max Z(s), n ! s where Z(s) is a Gaussian process with continuous sample paths, expected value ¨(0)s P ( x xx )s/2, 0 | | 0 and covariance kernel H(s ,s )=2(0)P x (s s ) . 1 2 | 0 1 2 |

5. Conclusion

This paper developed general asymptotic theory, which encompasses a wide class of highly non- regular M-estimation problems. Many of these problems have been left without a proper inference method for a long time. It is worthwhile to emphasize that our theory validates inference based on subsampling for this important class of estimators, including the confidence set construction for set valued parameters in the Manski and Tamer’s (2002) binary choice model with an interval regressor. An interesting future research is to develop valid bootstrap methods for these estimators. Naive applications of the standard bootstrap lead to inconsistent inference as shown by Abrevaya and Huang (2005) and Sen, Banerjee and Woodroofe (2010) among others.

Supplemental material. This paper has an on-line supplemental material, which contains all the proofs of the theorems and lemmas and additional examples.

References

[1] Abrevaya, J. and J. Huang (2005) On the bootstrap of the maximum score estimator, Econometrica,73,1175- 1204. [2] Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H. and J. W. Tukey (1972) Robust Estimates of Location, Princeton University Press, Princeton. [3] Andrews, D. W. K. (1993) An introduction to econometric applications of empirical process theory for dependent random variables, Econometric Reviews,12,183-216. [4] Anevski, D. and O. H¨ossjer (2006) A general asymptotic scheme for inference under order restrictions, Annals of Statistics,34,1874-1930. [5] Arcones, M. A. and B. Yu (1994) Central limit theorems for empirical and U-processes of stationary mixing sequences, Journal of Theoretical Probability,7,47-71. [6] Banerjee, M. and I. W. McKeague (2007) Confidence sets for split points in decision trees, Annals of Statistics, 35, 543-574. [7] Belloni, A., Chen, D., Chernozhukov, V. and C. Hansen (2012) Sparse models and methods for optimal instru- ments with application to eminent domain, Econometrica,80,2369-2429. [8] B¨uhlmann, P. and B. Yu (2002) Analyzing bagging, Annals of Statistics,30,927-961. [9] Carrasco, M. and X. Chen (2002) Mixing and moment properties of various GARCH and stochastic volatility models, Econometric Theory,18,17-39. [10] Chan, K. S. (1993) Consistency and Limiting Distribution of the Estimator of a Threshold Au- toregressive Model. Ann. Statist. 21, 520-533.

24 [11] Chen, X. (2007) Large sample sieve estimation of semi-nonparametric models, Handbook of Econometrics,vol. 6B, ch. 76, Elsevier. [12] Chen, X., Hansen, L. P. and M. Carrasco (2010) Nonlinearity and temporal dependence, Journal of Econometrics, 155, 155-169. [13] Cherno↵, H. (1964) Estimation of the mode, Annals of the Institute of Statistical Mathematics,16,31-41. [14] Chernozhukov, V., Hong, H. and E. Tamer (2007) Estimation and confidence regions for parameter sets in econometric models, Econometrica,75,1243-1284. [15] de Jong, R. M. and T. Woutersen (2011) Dynamic time series binary choice, Econometric Theory,27,673-702. [16] Doukhan, P., Massart, P. and E. Rio (1994) The functional central limit theorems for strongly mixing processes, Annales de l’Institut Henri Poincar´e, Probability and Statistics,30,63-82. [17] Doukhan, P., Massart, P. and E. Rio (1995) Invariance principles for absolutely regular empirical processes, Annales de l’Institut Henri Poincar´e, Probability and Statistics,31,393-427. [18] Gautier, E. and Y. Kitamura (2013) Nonparametric estimation in random coecients binary choice models, Econometrica,81,581-607. [19] Goldenshluger, A. and A. Zeevi (2004) The Hough transform estimator, Annals of Statistics,32,1908-1932. [20] Honor´e, B. E. and E. Kyriazidou (2000) Panel data discrete choice models with lagged dependent variables, Econometrica,68,839-874. [21] Kim, J. and D. Pollard (1990) Cube root asymptotics, Annals of Statistics,18,191-219. [22] Lee, M.-J. (1989) Mode regression, Journal of Econometrics,42,337-349. [23] Manski, C. F. (1975) Maximum score estimation of the stochastic utility model of choice, Journal of Econometrics, 3, 205-228. [24] Manski, C. F. (1985) Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator, Journal of Econometrics,27,313-333. [25] Moon, H. R. (2004) Maximum score estimation of a nonstationary binary choice model, Journal of Econometrics, 122, 385-403. [26] Nobel, A. and A. Dembo (1993) A note on uniform laws of averages for dependent processes, Statistics and Probability Letters,17,169-172. [27] Prakasa Rao, B. L. S. (1969) Estimation of a unimodal density, Sankhy¯a,A,31,23-36. [28] Politis, D. N., Romano, J. P. and M. Wolf (1999) Subsampling, New York: Springer-Verlag. [29] Pollard, D., (1993) The asymptotics of a binary choice model. Technical report. [30] Polonik, W. and Q. Yao (2000) Conditional minimum volume predictive regions for stochastic processes, Journal of the American Statistical Association,95,509-519. [31] Rio, E. (1997) About the Lindeberg method for strongly mixing sequences, ESAIM: Probability and Statistics,1, 35-61. [32] Rousseeuw, P. J. (1984) Least median of squares regression, Journal of the American Statistical Association,79, 871-880. [33] Sen, B., Banerjee, M. and M. Woodroofe (2010) Inconsistency of bootstrap: the Grenander estimator, Annals of Statistics,38,1953-1977. [34] van der Vaart, A. W. and J. A. Wellner (1996) Weak Convergence and Empirical Processes, Springer, New York. [35] Yao, W., Lindsay, B. G. and R. Li (2012) Local modal regression, Journal of Nonparametric Statistics,24, 647-663. [36] Zinde-Walsh, V. (2002) Asymptotic theory for some high breakdown point estimators, Econometric Theory,18, 1172-1196.

25 Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. E-mail address: [email protected]

Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. E-mail address: [email protected]

26 SUPPLEMENT TO “LOCAL M-ESTIMATION WITH DISCONTINUOUS CRITERION FOR DEPENDENT AND INCOMPLETE OBSERVATIONS”

MYUNG HWAN SEO AND TAISUKE OTSU

Abstract. In Section A, we present the proofs of Lemmas and Theorems in the paper. Section B provides additional examples on the dynamic maximum score estimator (Section B.1), dynamic least median of squares (Section B.2), and monotone density estimation for dependent observations (Section B.3).

Appendix A. Proofs of theorems and lemmas

A.1. Notation. We employ the same notation in the paper. Recall that Qg(u)istheinverse function of the tail probability function x P g(z ) >x,1 and that is the -mixing 7! {| t | } { m} coecients used in Assumption D. Let ( ) be a function such that (t)= if t 1 and (t)=1 · [t] otherwise, and 1( ) be the c`adl`aginverse of ( ). The L (P )-norm is defined as · · 2, 1 1 2 g = (u)Qg(u) du. (1) k k2, s Z0 A.2. Proof of Lemma M. Pick any C > 0 and then pick any n satisfying (nh ) 1/2 C and 0 n  0 any [(nh ) 1/2,C ]. Throughout the proof, positive constants C (j =1, 2,...) are independent 2 n 0 j of n and . First, we introduce some notation. Consider the sets defined by di↵erent norms:

1 = h1/2(f f ): ✓ ✓ <for ✓ ⇥ , Gn, n n,✓ n,✓0 | 0| 2 2 n 1/2 1/2 o n, = hn (fn,✓ fn,✓0 ): hn (fn,✓ fn,✓0 ) <for ✓ ⇥ , G 2 2 n o 1/2 1/2 n, = hn (fn,✓ fn,✓0 ):hn (fn,✓ fn,✓0 ) <for ✓ ⇥ . G 2, 2 ⇢ 1 For any g , g is bounded (by Assumption M (i)) and so is Qg. Thus we can always find a 2Gn, functiong ˆ such that g 2 gˆ 2 2 g 2 and k k2 k k2  k k2 m Qgˆ(u)= ajI (j 1)/m u < j/m , (2) {  } Xj=1 satisfying Q Q , for some positive integer m and sequence of positive constants a . g  gˆ { j}

1 The function Qg(u), called the quantile function in Doukhan, Massart and Rio (1995), is di↵erent from a familiar function u inf x : u P g(zt) x to define quantiles. 7! {  {| | }}

1 Next, based on the above notation, we derive the set inclusion relationships

2 1 1 n, n,C , n, 1/2 , (3) Gn, ⇢G ⇢G 1 G ⇢Gn,C2 for some positive constants C and C . The relation 2 follows from (Doukhan, 1 2 Gn, ⇢Gn, k·k2  k·k2, Massart and Rio, 1995, Lemma 1). The relation 2 1 follows from Assumption M (ii). Pick Gn, ⇢Gn,C1 1 1 any g . The relation n, 1/2 follows by 2G G ⇢Gn,C2 m j/m 1/m 1 2 2 1 1 2 g 2, aj (u)du m (u)du Qgˆ(u) du k k  ( (j 1)/m )  ( 0 ) 0 Xj=1 Z Z Z 1/a 1 2 2 sup a (u)du 2 g 2 C2 , (4)  00 and norm . Note that G k·k 2d N (⌫, , ) N (⌫, 1 , ) C , [] Gn, k·k2,  [] Gn,C1 k·k2  3 ⌫ ✓ ◆ 1 for some positive constant C3, where the first inequality follows from (by (3)) and Gn, ⇢Gn,C1 (Doukhan, Massart and Rio, 1995, Lemma 1), and the second inequality follows from k·k2  k·k2, the argument to derive Andrews (1993, eq. (4.7)) based on Assumption M (iii) (called the L2- continuity assumption in Andrews, 1993). Therefore, by the indefinite integral formula log xdx = const. + x(log x 1), there exists a positive constant C such that 4 R 'n()= log N[](⌫, n,, 2,)d⌫ C4. (5) 0 G k·k  Z q Finally, based on the entropy condition (5), we apply the maximal inequality of Doukhan, Massart and Rio (1995, Theorem 3), i.e., there exists a positive constant C5 such that

1 P sup Gng C5[1 + qGn, (min 1,vn() )]'n(), (6) g | | { } 2Gn,

u 1 where qGn, (v)=supu v QGn, (u) 0 (˜u)du˜ with the envelope function Gn, of n,, and vn()  G is the unique solution of qR v ()2 ' ()2 n = n . vn() 1 n2 0 (˜u)du˜ R

2 Since ' () C from (5), it holds v () C n 1 for some positive constant C . Now take some n  4 n  5 5 n such that v () 1, and then pick again any n n and [(nh ) 1/2,C ]. We have 0 n0  0 2 n 0 1/2 q (min 1,v () ) C Q (v ()) v () C (nh ) , (7) Gn, { n }  6 Gn, n n  7 n p for some positive constants C6 and C7. Therefore, combining (5)-(7), we obtain

1/2 P sup Gng C8 , (8) | | g 1/2 2Gn,C2 for some positive constant C8. The conclusion follows from the second relation in (3).

A.3. Proof of Lemma 1. Pick any C>0 and ">0. Define A = ✓ ⇥:(nh ) 1/3 ✓ ✓ n { 2 n | 0| C and } 2 2/3 2 Rn =(nhn) sup Pn(fn,✓ fn,✓0 ) P (fn,✓ fn,✓0 ) " ✓ ✓0 . ✓ An{| | | | } 2 It is enough to show R = O (1). Letting A = ✓ ⇥:(j 1)(nh ) 1/3 ✓ ✓

P R >m { n } 2 2/3 2 P Pn(fn,✓ fn,✓0 ) P (fn,✓ fn,✓0 ) >"✓ ✓0 +(nhn) m for some ✓ An  | | | | 2 n o 1 2/3 2 2 P (nhn) Pn(fn,✓ fn,✓0 ) P (fn,✓ fn,✓0 ) >"(j 1) + m for some ✓ An,j  | | 2 Xj=1 n o 1 C pj 0 ,  "(j 1)2 + m2 Xj=1 for all m>0, where the last inequality is due to the Markov inequality and Lemma M. Since the above sum is finite for all m>0, the conclusion follows.

A.4. Proof of Lemma C. First of all, any -mixing process is ↵-mixing with the mixing coecient ↵ /2. Thus it is sucient to check Conditions (a) and (b) of Rio (1997, Corollary 1). m  m Condition (a) is verified under eq. (3) of the paper by Rio (1997, Proposition 1), which guarantees 1 1 2 Var(Gngn) (u)Qgn (u) du for all n. Since Var(Gngn) is bounded (by eq. (3) of the paper)  0 and z is strictly stationary under Assumption D, Condition (b) of Rio (1997, Corollary 1) can { t} R be written as 1 1 2 1/2 1 (u)Qg (u) inf n (u)Qg (u), 1 du 0, n n { n } ! Z0 as n . Pick any u (0, 1). Since 1(u)Q (u)2 is non-increasing in u (0, 1), the condition !1 2 gn 2 in eq. (3) of the paper implies 1(u)Q (u)2

3 A.5. Proof of Lemma 2. By Assumption M (i), it holds g 2C(nh 2)1/6 for all n and s, | n,s| n 2 2 2 1/3 which implies Qgn,s Pgn,s (u) 16C (nhn ) for all n, s, and u (0, 1). By the condition of this  2 lemma, it holds Q (u) c for all n large enough and u>c(nh 2) 1/3. By the triangle inequality gn,s  n and the definition of Qg,

P gn,s Pgn,s Qgn,s (u)+ Pgn,s P gn,s Qgn,s (u) = P gn,s Pgn,s >Qgn,s Pgn,s (u) , {| | | |}  {| | } {| | } which implies Qgn,s Pgn,s (u) Qgn,s (u)+ Pgn,s . Thus, for all n large enough, s, and u>  | | 2 1/3 c(nhn ) , it holds 2 2 2 Qgn,s Pgn,s (u) c + Pgn,s +2c Pgn,s .  | | | | Combining these bounds, eq. (3) of the paper is verified as 1 1 2 (u)Qgn,s Pgn,s (u) du Z0 2 1/3 c(nhn ) 1 2 2 1/3 1 2 2 1 16C (nhn ) (u)du + c +(Pgn,s) +2c Pgn,s (u)du < ,  { | |} 2 1/3 1 Z0 Zc(nhn ) for all n large enough, where the second inequality follows by Assumptions M (i) and D (which 2 1/3 guarantees sup (nh 2)1/3 c(nhn ) 1(u)du < , and 1 1(u)du < ). n n 0 1 0 1 R R

A.6. Proof of Lemma M’. Pick any K>0 and >0. Let g = g g , n,s,s0 n,s n,s0 K = g : s K, s0 K , Gn { n,s,s0 | | | | } 1 K = g : s s0 < , Gn, { n,s,s0 2Gn | | } = g K : g < . Gn, { n,s,s0 2Gn n,s,s0 2, } Since gn,s satisfies the condition in eq. (4) of the paper, there exists a positive constant C1 such that 1 g K : g 0 small enough. Gn, ⇢{ n,s,s0 2Gn n,s,s0 2 1 } Also, by the same argument to derive (4), there exists a positive constant C2 such that gn,s,s0 2,  C2 gn,s,s0 for all n large enough, s K, and s0 K. The constant C2 depends only on the 2 | | | | mixing sequence m . Combining these results, we obtain the set inclusion relationship { } 1 n, 1/2 , (9) G ⇢Gn,C1C2 for all n large enough and all >0 small enough. Also note that the bracketing numbers satisfy

K d/2 N (⌫, , ) N (⌫, , ) C ⌫ , [] Gn, k·k2,  [] Gn k·k2  3 where the first inequality follows from K (by the definitions) and (Doukhan, Gn, ⇢Gn k·k2  k·k2, Massart and Rio, 1995, Lemma 1), and the second inequality follows from the argument to derive

Andrews (1993, eq. (4.7)) based on eq. (4) of the paper (called the L2-continuity assumption in

4 Andrews, 1993). Thus, there is a function '(⌘) such that '(⌘) 0 as ⌘ 0 and ! ! ⌘ 'n(⌘)= log N[](⌫, n,⌘, 2,)d⌫ '(⌘), 0 G k·k  Z q for all n large enough and all ⌘>0 small enough. Based on the above entropy condition, we can apply the maximal inequality of Doukhan, Massart and Rio (1995, Theorem 3), i.e., there exists a positive constant C3 depending only on the mixing sequence such that { m} 1 P sup Gng C4[1 + ⌘ qGn (min 1,vn(⌘) )]'(⌘), g | | { } 2Gn,⌘

u 1 for all n large enough and all ⌘>0 small enough, where q2Gn (v)=supu v Q2Gn (u) 0 (˜u)du˜  with the envelope function 2Gn of n,⌘ (note: by the definition of n,⌘, the envelopeq 2RGn does not G G depend on ⌘), and vn(⌘) is the unique solution of v (⌘)2 '2 (⌘) n = n . vn(⌘) 1 n⌘2 0 (˜u)du˜ Now pick any ⌘>0 small enough soR that 2C '(⌘) <.Since' (⌘) '(⌘), there is a positive 4 n  constant C such that v (⌘) C '(⌘) for all n large enough and ⌘>0 small enough. Since G 5 n  5 n⌘2 n   C n by the definition of n,⌘, there exists a positive constant C such that q (min 1,v (⌘) ) 0 G 6 2Gn { n }  1  1/2 1/2 C6 '(⌘)⌘ n with 0 <<1/2 for all n large enough. Therefore, by setting ⌘ = C1C2 , wep obtain P sup Gng , | | g 1/2 2Gn,C1C2 for all n large enough. The conclusion follows by (9).

A.7. Proof of Theorem 1. As discussed in the main body, Lemma 1 yields the convergence rate of the estimator. This enables us to consider the centered and normalized process Zn(s), which can be defined on arbitrary compact parameter space. Based on finite dimensional convergence and tightness of Zn shown by Lemmas C and M’, respectively, we establish weak convergence of Zn. Then a continuous mapping theorem of an argmax element (Kim and Pollard, 1990, Theorem 2.7) yields the limiting distribution of the M-estimator ✓ˆ.

A.8. Proof of Theorem 2. To ease notation, let ✓0 = ⌫0 = 0. First, we show that ✓ˆ = O ((nh ) 1/3). Since f satisfies Assumption M (iii), we can apply Lemma M’ with g = p n { n,✓,⌫} n,s 1/6 2/3 n hn (f 1/3 fn,✓,0) for s =(✓0,c0)0,whichimplies n,✓,c(nhn) 1/6 2/3 sup n h (f 1/3 f )=O (1), (10) n Gn n,✓,c(nhn) n,✓,0 p ✓ ✏, c ✏ | | | | 1/3 for all ✏>0. Also from eq. (6) of the paper and⌫ ˆ = op((nhn) ), we have

2 2/3 P (f f ) P (f f ) 2✏ ✓ + O ((nh ) ), (11) n,✓,⌫ˆ n,✓,0 n,0,⌫ˆ n,0,0  | | p n

5 for all ✓ in a neighborhood of ✓0 and all ✏>0. Combining (10), (11), and Lemma 1,

1/2 Pn(fn,✓,⌫ˆ fn,0,⌫ˆ)=n Gn(fn,✓,⌫ˆ fn,✓,0)+Gn(fn,✓,0 fn,0,0) Gn(fn,0,⌫ˆ fn,0,0) { } +P (f f )+P (f f ) P (f f ) n,✓,⌫ˆ n,✓,0 n,✓,0 n,0,0 n,0,⌫ˆ n,0,0 2 2/3 P (f f )+2✏ ✓ + O ((nh ) )  n,✓,0 n,0,0 | | p n 1 2 2/3 ✓0V ✓ +3✏ ✓ + O ((nh ) ),  2 1 | | p n for all ✓ in a neighborhood of ✓0 and all ✏>0, where the last inequality follows from eq. (6) of the 2/3 1/3 paper. From Pn(f ˆ fn,0,⌫ˆ) op((nhn) ), negative definiteness of V1, and⌫ ˆ = op((nhn) ), n,✓,⌫ˆ we can find c>0 such that

2/3 2 1/3 2/3 o ((nh ) ) c ✓ˆ + ✓ˆ o ((nh ) )+O ((nh ) ), p n  | | | | p n p n which implies ✓ˆ = O ((nh ) 1/3). | | p n Next, we show that ✓ˆ ✓˜ = o ((nh ) 1/3). By reparametrization, p n 1/3 ˆ 2/3 (nhn) ✓ = arg max(nhn) [(Pn P )(f 1/3 fn,0,⌫ˆ)+P (f 1/3 fn,0,⌫ˆ)] + op(1). s n,s(nhn) ,⌫ˆ n,s(nhn) ,⌫ˆ 1/3 By Lemma M’ (replace ✓ with (✓,⌫)) and⌫ ˆ = op((nhn) ),

2/3 (Pn P )(f 1/3 fn,0,0) (Pn P )(f 1/3 fn,0,0)=op((nhn) ), n,s(nhn) ,⌫ˆ n,s(nhn) ,0 uniformly in s. Also eq. (6) of the paper implies P (f 1/3 fn,0,⌫ˆ) P (f 1/3 n,s(nhn) ,⌫ˆ n,s(nhn) ,0 f )=o ((nh ) 2/3) uniformly in s.Given✓ˆ ✓˜ = o ((nh ) 1/3), an application of Theorem 1 n,0,0 p n p n to the class f : ✓ ⇥ implies the limiting distribution of ✓ˆ. { n,✓,⌫0 2 }

A.9. Proof of Lemma MS. First, we introduce some notation. Let

1/2 1/2 n, = hn (fn,✓ fn,⇡✓ ): hn (fn,✓ fn,⇡✓ ) <for ✓ ⇥ , G { 2, 2 } 1 = h1/2(f f ):✓ ⇡ <for ✓ ⇥ , Gn, { n n,✓ n,⇡✓ | ✓| 2 } 2 1/2 1/2 n, = hn (fn,✓ fn,⇡✓ ): hn (fn,✓ fn,⇡✓ ) <for ✓ ⇥ . G { 2 2 } For any g 1 , g is bounded (Assumption S (i)) and so is Q . Thus we can always find a function 2Gn, g gˆ such that g 2 gˆ 2 2 g 2 and k k2 k k2  k k2 m Qgˆ(u)= ajI (j 1)/m u < j/m , {  } Xj=1 satisfying Qg Qgˆ, for some positive integer m and sequence of positive constants aj . Let rn =  1/2 { } 1/2 nh / log(nh ). Pick any C > 0 and then pick any n satisfying rn C and any [(rn ,C ]. n n 0  0 2 0 Throughout the proof, positive constants Cj (j =1, 2,...) are independent of n and .

6 Next, based on the above notation, we derive some set inclusion relationships. Let 1 1 x 1 1 M = 2 sup0

1 2 2 1 n, 1/2 1/2 1/2 , n, n,C , (14) G ⇢Gn,C1 ⇢Gn,M C1 Gn, ⇢G ⇢G 2 1 2 2 1 where the relation 1/2 follows from Assumption S (iii) and the relation Gn, ⇢Gn,C1 Gn, ⇢Gn,C2 follows from Assumption S (ii). Third, based on the above set inclusion relationships, we derive some relationships for the brack- eting numbers. Let N (⌫, , ) be the bracketing number for a class of functions with radius [] G k·k G ⌫>0 and norm . By (13) and the second relation in (14), k·k N (⌫, , ) N (⌫, 1 , ) C , [] Gn, k·k2,  [] Gn,C2 k·k2  3 ⌫2d for some positive constant C3. Note that the upper bound here is di↵erent from the point identified case. Therefore, for some positive constant C4, it holds 1 'n()= log N[](⌫, n,, 2,)d⌫ C4 log . (15) 0 G k·k  Z q Finally, based on the above entropy condition, we apply the maximal inequality of Doukhan,

Massart and Rio (1995, Theorem 3), i.e., there exists a positive constant C5 depending only on the mixing sequence such that { m} 1 P sup Gng C5[1 + qGn, (min 1,vn() )]'n(), g | | { } 2Gn,

7 u 1 where qGn, (v)=supu v QGn, (u) 0 (˜u)du˜ with the envelope function Gn, of n, (note: n,  G G is a class of bounded functions) andqRvn() is the unique solution of v ()2 ' ()2 n = n . vn() 1 n2 0 (˜u)du˜ Since ' () C log 1 from (15),R it holds v () C n 1(log 1)2 C n 1 log(nh )1/2 2 for n  4 n  5  5 { n } some positive constant C5. Now take some n0 such that vn0 () 1, and then pick again any n n0 1/2  and [rn ,C ]. We have 2 0 1/2 1/2 q (min 1,v () ) C v ()Q (v ()) C n log(nh ) , Gn, { n }  6 n Gn, n  7 n p for some positive constants C6 and C7. Therefore, combining (15)-(7), the conclusion follows by

1 1/2 P sup Gng P sup Gng C8( log ) , g 1 | | | | n, g 1/2 1/2 2G 2Gn,M C1 where the first inequality follows from the first relation in (14).

1/3 A.10. Proof of Lemma 3. Pick any C>0 and ">0. Then define A = ✓ ⇥ ⇥ : rn n { 2 \ I  ✓ ⇡ C and | ✓| } 2 2/3 2 Rn = rn sup Pn(fn,✓ fn,⇡✓ ) P (fn,✓ fn,⇡✓ ) " ✓ ⇡✓ . ✓ An{| | | | } 2 1/3 1/3 It is enough to show R = O (1). Letting A = ✓ ⇥:(j 1)rn ✓ ⇡

2 2/3 2 P Rn >m P Pn(fn,✓ fn,⇡✓ ) P (fn,✓ fn,⇡✓ ) >"✓ ⇡✓ + r m for some ✓ An { } | | | | n 2 n o 1 2/3 2 2 P r Pn(fn,✓ fn,⇡✓ ) P (fn,✓ fn,⇡✓ ) >"(j 1) + m for some ✓ An,j  n | | 2 Xj=1 n o 1 C pj 0 ,  "(j 1)2 + m2 Xj=1 for all m>0, where the last inequality is due to the Markov inequality and Lemma MS. Since the above sum is finite for all m>0, the conclusion follows.

A.11. Proof of Theorem 3. Pick any # ⇥ˆ . By the definition of ⇥ˆ , 2 1/2 1/2 Pn(fn,# fn,⇡# ) max Pnfn,✓ (nhn) cˆ Pnfn,⇡# (nhn) c.ˆ ✓ ⇥ 2 1/3 Now, suppose H(#, ⇥ )= # ⇡ >rn . By Lemma 3 and Assumption S (i), I | #| 2 2/3 2 Pn(fn,# fn,⇡# ) P (fn,# fn,⇡# )+" # ⇡# + r R  | | n n 2 2 2/3 ( c + ") # ⇡ + o( # ⇡ )+O (r ),  | #| | #| p n

8 for any ">0. Note that c, ", and Rn do not depend on #. By taking " small enough, the convergence rate of ⇢(⇥ˆ , ⇥I ) is obtained as ˆ 1/2 1/4 1/3 ⇢(⇥, ⇥I )=sup # ⇡# Op(ˆc (nhn) + rn ). # ⇥ˆ | | 2 Furthermore, for the maximizer ✓ˆ of f , then it holds (f f ) 0 and this implies Pn n,✓ Pn n,✓ˆ n,⇡✓ˆ 1/3 ✓ˆ ⇡ = O (rn ). ✓ˆ p For the convergence rate of ⇢(⇥ , ⇥ˆ ), we show P ⇥ ⇥ˆ 1 forc ˆ , which implies that I { I ⇢ }! !1 ⇢(⇥I , ⇥ˆ ) can converge at arbitrarily fast rate. To see this, note that

1/2 (nhn) max (max Pnfn,✓ Pnfn,✓0 ) ✓ ⇥I | ✓ ⇥ | 02 2 (f f ) +(nh )1/2 P (f f ) + 2(nh )1/2 max ( f Pf ) Gn n,✓ˆ n,⇡✓ˆ n n,✓ˆ n,⇡✓ˆ n Pn n,✓0 n,✓0 | | | | | ✓ ⇥I | 02 1/2 =2hn max Gnfn,✓0 + op(1), (16) | ✓ ⇥I | 02 where the inequality follows from the triangle inequality and the equality follows from Lemmas 1/3 ˆ MS and 3, Assumption S (i), and the rate ✓ ⇡✓ˆ = Op(rn ) obtained above. Therefore, since 1/2 hn f ,✓ ⇥ is P -Donsker (Assumption S (i)), it follows P ⇥ ⇥ˆ 1ifˆc . { n,✓ 2 I } { I ⇢ }! !1 A.12. Proof of Lemma MS’. To ease notation, let ⌫ = 0. = g = f f : ✓ ⇡ 0 Gn { n,s n,✓,⌫ n,✓,0 | ✓| K , ⌫ a K ,s=(✓ ,⌫ ) 1 | | n 2 0 0 0} First, we introduce some notation. Let

= g : g < , Gn, { n,s k n,sk2, } 1 = g : ✓ ⇡ 0 small enough. Also, by the same Gn k n,sk2 1 } argument to derive (4), there exists a positive constant C such that g C g for all 2 k n,sk2,  2 k n,sk2 n large enough. The constant C depends only on the mixing sequence . Combining these 2 { m} results, we obtain the set inclusion relationship

1 , (17) n, 1/4 1/2 G ⇢Gn,C1C2kn for all n large enough and all >0 small enough. On the other hand, for any 0 small enough, there exists some C3 such that 2 1 n, n,C , Gn,0 ⇢G 0 ⇢G 3 0 due to the fact that (Doukhan, Massart and Rio, 1995, Lemma 1) and eq. (13) of the k·k2  k·k2, paper. And, the bracketing numbers satisfy

1 N[](⌫, , ) N[](⌫, n,C , ). Gn,0 k·k2,  G 3 0 k·k2

9 1 Furthermore, the bracketing number N[](⌫, , ) can be bounded by the covering number of Gn,C30 k·k2 the parameter space, say, N (⌫, [ K ,K ]d [ C ,C ]kn ) following the argument in Andrews ⇥ 1 1 ⇥ 3 0 3 0 (1993, eq. (4.7)) based on eq. (12) of the paper. 1 1/4 1/2 d Now we set = a K so that = . Also set = C C kn and compute N (⌫, [ K ,K ] n 2 Gn, Gn 0 1 2 ⇥ 1 1 ⇥ 1/2 1/4 1/2 1/4 kn 1/2 [ K20 an kn ,K20 an kn ] ), where K20 = C1C2K2 . By direct calculation, this covering number d+kn 2 d pd+kn 1/2 1/4 kn is bounded by (2K1) 2⌫ (2K20 an kn ) . Building on this, we compute the quantities in the maximal inequality⇣ in (6).⌘ First,

0 'n(0)= log N[](⌫, , )d⌫ Gn,0 k·k2, Z0 q1/2 1/4 K20 an kn 3 2 C3 kn (log knan log ⌫)d⌫  0 Z p 1/2 3/4 1 K a k log k an ,  3 n n n q for some C3 and K3, where the last inequality follows from the indefinite integral formula log xdx = const.+x(log x 1). Second, as in the discussion following (6), v ( ) ' ( )2/(n 2) k log k a 1/n, n 0  n 0 0  Rn n n  which can be made smaller than 1 for large n.Then,qG (min 1,vn(0) ) C4n vn(0), and n,0 { }  1 1/2 1/4 then 0 qGn, (min 1,vn(0) ) C5 for 0 = K20 an kn . Putting these together, we canp bound the { } 1/2 3/4 1 right hand side of (6) by C6an kn log knan for some C6. p

A.13. Proof of Theorem 4. To ease notation, let ⌫0 = 0. From eq. (14) of the paper, we have

2 2 2/3 P (f f ) P (f f )=o( ✓ ⇡ )+O( ⌫ˆ )+O (r ), (18) n,✓,⌫ˆ n,✓,0 n,⇡✓,⌫ˆ n,⇡✓,0 | ✓| | | p n for all ✓ in a neighborhood of ⇥I and all ✏>0. Combining Lemma MS’, eq. (14) of the paper, and Assumption S (i), and Lemma 3,

1/2 Pn(fn,✓,⌫ˆ fn,⇡ ,⌫ˆ)=n Gn(fn,✓,⌫ˆ fn,✓,0) Gn(fn,⇡ ,⌫ˆ fn,⇡✓,0)+Gn(fn,✓,0 fn,⇡✓,0) ✓ { ✓ } +P (f f ) P (f f )+P (f f ) n,✓,⌫ˆ n,✓,0 n,⇡✓,⌫ˆ n,⇡✓,0 n,✓,0 n,⇡✓,0 1 1/2 3/4 1/2 2 2/3 O ((nh a ) k log n)+✏ ✓ ⇡ + O (r )  p n n n | ✓| p n 2 2 2 2/3 c ✓ ⇡ + ✏ ✓ ⇡ + O ( ⌫ˆ )+O (r ), (19) | ✓| | ✓| p | | p n for all ✓ in a neighborhood of ⇥I and all ✏>0, where the inequality follows from eq. (6) of the 1 paper. Here, log knan in Lemma MS’ is bounded by plog n up to a constant. 1/3 ˆ ˆ Let ✓ = argp max✓ ⇥ Pnfn,✓,vˆ.If✓ ⇡ˆ >an + rn ,thenPn(fn,✓,⌫ˆ fn,⇡✓,⌫ˆ) 0 and thus by 2 | ✓| (19), 1/3 1 1/4 3/8 1/4 ✓ˆ ⇡ o(a )+O (r )+O ((nh a ) k log n). (20) | ✓ˆ| n p n p n n n

2 s 2 The circumradius of the unit s-dimensional hypercube is ps/2. Or i=1 ai /2 for the hypercube of side lengths (a1,...,as). pP

10 1/3 And for any ✓ ⇥ˆ , if ✓ ⇡ >a + rn ,then 0 2 | 0 ✓0 | n 1/2 1 (nhn) cˆ max Pnfn,✓,vˆ Pnfn,⇡ ,vˆ cn cˆ Pnfn,✓ ,vˆ Pnfn,⇡ ,vˆ,  ✓ ⇥ ✓0  0 ✓0 2 and thus by (19),

1/3 1 1/4 3/8 1/4 1/4 1/2 ✓0 ⇡ o(a )+O (r )+O ((nh a ) k log n)+(nh ) cˆ . | ✓0 | n p n p n n n n It remains to show that P ⇥ ⇥ˆ 1 forc ˆ . Proceeding as in (16), we get { I ⇢ }! !1 1/2 (nhn) max (max Pnfn,✓,vˆ Pnfn,✓0,vˆ) ✓ ⇥I | ✓ ⇥ | 02 2 h1/2 (f f ) +(nh )1/2 P (f f ) + 2(nh )1/2 max ( f Pf ) n Gn n,✓ˆ,vˆ n,⇡✓ˆ,vˆ n n,✓ˆ,vˆ n,⇡✓ˆ,vˆ n Pn n,✓0,vˆ n,✓0,vˆ | | | | | ✓ ⇥I | 02 1/2 =2max hn Gnfn,✓0,vˆ + op(1), | ✓ ⇥I | 02 where the first term after the inequality being op(1) is due to Lemmas 3 and MS’, and the second term is to (18) and Assumption S (i) together with the rate for ✓ˆ in (20). Finally, due to Lemma 1/2 MS’ and the class hn f ,✓ ⇥ being a P -Donsker, we conclude Pr ⇥ ⇥ˆ 1. { n,✓ 2 I } { I ⇢ I }! A.14. Proof of Lemma M1. The proof is similar to that of Lemma M except that for some positive constant C000,wehave

1 2 , 1/2 1/2 1/2 1/2 G ⇢GC00hn ⇢GC000hn 2 which reflects the component “hn” in Assumption M (iii’) instead of “hn” in Assumption M (iii). 1/2 1/2 As a consequence of this change, the upper bound in the maximal inequality becomes Chn instead of C1/2. All the other parts remain the same.

A.15. Proof of Theorem 5. The proof is similar to that of Theorem 1 given Lemma M1.

11 Appendix B. Additional examples

B.1. Dynamic maximum score. As a further application of Theorem 1, consider the maximum score estimator (Manski, 1975) for the regression model yt = xt0 ✓0 + ut, that is n ˆ ✓ = arg max [I yt 0,xt0 ✓ 0 + I yt < 0,xt0 ✓<0 ], ✓ S { } { } 2 t=1 X where S is the surface of the unit sphere in Rd.Since✓ˆ is determined only up to scalar multiples, we standardize it to be unit length. A key insight of this estimator is to explore a median or quantile restriction in disturbances of latent variable models to construct a population criterion that identifies structural parameters of interest.

We impose the following assumptions. Let h(x, u)=I x0✓0 + u 0 I x0✓0 + u<0 . { } { } (a): x ,u satisfies Assumption D. x has compact support and a continuously di↵erentiable { t t} t density p. The angular component of xt, considered as a random variable on S, has a

bounded and continuous density, and the density for the orthogonal angle to ✓0 is bounded away from zero. (b): Assume that ✓ = 1, median(u x) = 0, the function (x)=E[h(x ,u ) x = x] is non- | 0| | t t | t negative for x ✓ 0 and non-positive for x ✓ < 0 and is continuously di↵erentiable, and 0 0 0 0 P x ✓ =0, ˙ (x) ✓ p(x) > 0 > 0. { 0 0 0 0 } Except for Assumption D, which allows dependent observations, all assumptions are similar to the ones in Kim and Pollard (1990, Section 6.4). First, note that the criterion function is written as

f✓(x, u)=h(x, u)[I x0✓ 0 I x0✓0 0 ]. { } { } ˆ We can see that ✓ = arg max✓ S Pnf✓ and ✓0 = arg max✓ S Pf✓. Existence and uniqueness of ✓0 2 2 are guaranteed by (b) (see, Manski, 1985). Also the uniform law of large numbers for an absolutely p regular process by Nobel and Dembo (1993, Theorem 1) implies sup✓ S Pnf✓ Pf✓ 0. Therefore, 2 | | ! ✓ˆ is consistent for ✓0. We now verify that f : ✓ S satisfy Assumption M with h = 1. Assumption M (i) is already { ✓ 2 } n verified. By Jensen’s inequality,

f✓1 f✓2 = P I x0✓1 0 I x0✓2 0 P x0✓1 0 >x0✓2 or x0✓2 0 >x0✓1 , k k2 | { } { }| { } for any ✓ ,✓ S. Sincep the right hand side is the probability for a pair of wedge shaped regions 1 2 2 with an angle of order ✓ ✓ , the last condition in (a) implies Assumption M (ii). For Assumption | 1 2| M (iii), pick any ">0 and observe that

2 P sup f✓ f# = P sup I x0✓ 0 >x0# or x0# 0 >x0✓ , ✓ ⇥: ✓ # <" | | ✓ ⇥: ✓ # <" { } 2 | | 2 | | for all # in a neighborhood of ✓0. Again, the right hand side is the probability for a pair of wedge shaped regions with an angle of order ". Thus the last condition in (a) also guarantees Assumption

12 1/3 M (iii). This yields the convergence rate of n for ✓ˆ and the stochastic equicontinuity of the 1/6 empirical process of the rescaled and centered functions gn,s = n (f 1/3 f✓ ). For its finite ✓0+sn 0 dimensional convergence, we can check the Lindeberg condition in Lemma C (i.e., eq. (5) in the paper) by Lemma 2 as in Section B.2, see (22) below. We next compute the expected value and covariance kernel of the limit process (i.e., V and H in Theorem 1). Due to strict stationarity (in Assumption D), we can apply the same argument to Kim and Pollard (1990, pp. 214-215) to derive the second derivative 2 @ Pf✓ V = = I x0✓0 =0 ˙ (x)0✓0p(x)xx0d, @✓@✓ { } 0 ✓=✓0 Z where is the surface measure on the boundary of the set x : x ✓ 0 . The matrix V is negative { 0 0 } definite under the last condition of (b). Now pick any s1 and s2, and define qn,t = f 1/3 (xt,ut) ✓0+n s1 1 f 1/3 (xt,ut). The covariance kernel is written as H(s1,s2)= L(s1, 0)+L(0,s2) L(s1,s2) , ✓0+n s2 2 { } where

4/3 1/3 1 L(s1,s2)= lim n Var(Pnqn,t)= lim n Var(qn,t)+ Cov(qn,t,qn,t+m) . n n { } !1 !1 m=1 X 1/3 The limit of n Var(qn,t) is given in Kim and Pollard (1990, p. 215). For Cov(qn,t,qn,t+m), we note that q takes only three values, 1, 0, or 1. The definition of and Assumption D imply n,t m 2/3 P q = j, q = k P q = j P q = k n , | { n,t n,t+m } { n,t } { n,t+m }|  m for all n, m 1 and j, k = 1, 0, 1, i.e., q is a -mixing array and its mixing coecients are { n,t} bounded by n 2/3 .Then, q is an ↵-mixing array whose mixing coecients are bounded by m { n,t} 2/3 2n m as well. By applying the ↵-mixing inequality, the covariance is bounded as

2/3 2 Cov(q ,q ) Cn q , n,t n,t+m  m k n,tkp for some C>0 and p>2. Note that

2 1/3 1/3 2/p 2/(3p) qn,t [P I x0(✓0 + s1n ) > 0 I x0(✓0 + s2n ) > 0 ] = O(n ). k kp  | { } { }| 1/3 Combining these results, we get n 1 Cov(q ,q ) 0 as n . Therefore, the covari- m=1 n,t n,t+m ! !1 ance kernel H is same as the independentP case in Kim and Pollard (1990, p. 215). Since f : ✓ S satisfies Assumption M, Theorem 1 implies that even if the data obey a { ✓ 2 } dependence process specified in Assumption D, the maximum score estimator possesses the same limiting distribution as the independent case.

B.2. Dynamic least median of squares. As another application of Theorem 2, consider the least median of squares estimator for the regression model yt = xt0 0 + ut, that is ˆ 2 2 = arg min median (y1 x10 ) ,...,(yn xn0 ) . { } We impose the following assumptions.

13 (a): x ,u satisfies Assumption D. x and u are independent. P x 2 < , Px x is { t t} { t} { t} | t| 1 t t0 positive definite, and the distribution of xt puts zero mass on each hyperplane.

(b): The density of ut is bounded, di↵erentiable, and symmetric around zero, and decreases away from zero. u has the unique median ⌫ and ˙ (⌫ ) < 0, where ˙ is the first derivative | t| 0 0 of .

Except for Assumption D, which allows dependent observations, all assumptions are similar to the ones in Kim and Pollard (1990, Section 6.3). ˆ ˆ ˆ It is known that ✓ = 0 is written as ✓ = arg max✓ Pnf✓,⌫ˆ,where

f✓,⌫(x, u)=I x0✓ ⌫ u x0✓ + ⌫ , {   } 1 d and⌫ ˆ =inf ⌫ :sup Pnf✓,⌫ . Let ⌫0 = 1 to simplify the notation. Since f✓,⌫ : ✓ R ,⌫ { ✓ 2 } { 2 2 R is a VC subgraph class, Arcones and Yu (1994, Theorem 1) implies the uniform convergence } 1/2 sup Pnf✓,⌫ Pf✓,⌫ = Op(n ). Thus, the same argument to Kim and Pollard (1990, pp. ✓,⌫ | | 207-208) yields the convergence rate⌫ ˆ 1=O (n 1/2). p Now, we verify the conditions in Theorem 2. By expansions, the condition in eq. (8) of the paper is verified as

P (f f )=P (x0✓ + ⌫) (⌫) (x0✓ ⌫) ( ⌫) ✓,⌫ 0,1 |{ }{ }| +P (⌫) (1) ( ⌫) ( 1) |{ }{ }| 2 2 =˙(1)✓0Pxx0✓ + o( ✓ + ⌫ 1 ). (21) | | | | d To check Assumption M (iii) for f✓,⌫ : ✓ R ,⌫ R , pick any ">0 and decompose { 2 2 } 2 2 2 P sup f✓,⌫ f(✓0,⌫0) P sup f✓,⌫ f✓,⌫0 + P sup f✓,⌫0 f✓0,⌫0 , (✓,⌫): (✓,⌫) (✓ ,⌫ ) <" | |  (✓,⌫): (✓,⌫) (✓ ,⌫ ) <" | | ✓: ✓ ✓ <" | | | 0 0 | | 0 0 | | 0| for (✓0,⌫0) in a neighborhood of (0,...,0, 1). By similar arguments to (21), these terms are of order ⌫ ⌫ 2 and ✓ ✓ 2, respectively, which are bounded by C" with some C>0. | 0| | 0| d We now verify that f✓,1 : ✓ R satisfies Assumption M with hn = 1. By (b), Pf✓,1 is uniquely { 2 } maximized at ✓0 = 0. So Assumption M (i) is satisfied. Since Assumption M (iii) is already shown, it remains to verify Assumption M (ii). Some expansions (using symmetry of ( )) yield · 2 f f = P (x0✓ + 1) (x0✓ + 1) + (x0✓ 1) (x0✓ 1) k ✓1,1 ✓2,1k2 | 1 2 1 2 | 2 (✓ ✓ )0P ˙ ( 1)xx0(✓ ✓ )+o( ✓ ✓ ), 2 1 2 1 | 2 1| d i.e., Assumption M (ii) is satisfied under (b). Therefore, f✓,1 : ✓ R satisfies Assumption M. { 2 }

14 1/6 We finally derive the finite dimensional convergence through Lemma 2. Let gn,s(zt)=n (I x0 (✓0+ {| t 1/3 sn ) ut 1 I x0 ✓0 ut 1 ). Then for any c>0, | } {| t | } P g (z ) >c (22) {| n,s t | } 1/3 1/3 P x0 (✓ + sn ) u 1 x0 ✓ + P x0 (✓ + sn ) u +1 x0 ✓  { t 0  t  t 0} { t 0  t  t 0} 1/3 1/3 +P x0 ✓ u 1 x0 (✓ + sn ) + P x0 ✓ u +1 x0 (✓ + sn ) . { t 0  t  t 0 } { t 0  t  t 0 } However, each of these terms are bounded by O(n 1/3E x s ) due to the bounded ness of the density | t0 | of u ,theindependencebetweenx and u , and the law of iterated expectations that P h(x ,u ) t t t { t t 2 A = EP h(x ,u ) A x for any h and A. Thus, by Lemma 2, the central limit theorem in } { t t 2 | t} Lemma C applies.

It remains to characterize the covariance kernel H(s1,s2) for any s1 and s2.Defineqn,t = f 1/3 (xt,ut) f 1/3 (xt,ut). Then, the standard algebra yields that ✓0+n s1 ✓0+n s2 1 H(s ,s )= L(s , 0) + L(0,s ) L(s ,s ) , 1 2 2{ 1 2 1 2 } 4/3 1/3 where L(s1,s2)=limn n Var(Pnqn,t)=limn n Var(qn,t)+ m1=1 Cov(qn,t,qn,t+m) . The !1 !1 { } limit of n1/3Var(q ) is given by 2 (1) P x (s s ) by direct algebra as in Kim and Pollard (1990, n,t | 0 2 1 | P p. 213). For the covariance Cov(q ,q ), note that q can take only three values, 1, 0, or 1. n,t n,t+m n,t By the definition of m, Assumption D implies

2/3 P q = j, q = k P q = j P q = k n , | { n,t n,t+m } { n,t } { n,t+m }|  m for all n, m 1 and j, k = 1, 0, 1, i.e., q is a -mixing array whose mixing coecients are { n,t} bounded by n 2/3 . In turn, this implies that q is an ↵-mixing array whose mixing coecients m { n,t} 2/3 are bounded by 2n m. Thus, by applying the ↵-mixing inequality, the covariance is bounded as 2/3 2 Cov(q ,q ) Cn q , n,t n,t+m  m k n,tkp for some C>0 and p>2. Note that proceeding as in the bound for (22) we can show that 2 2/(3p) 1/3 q = O(n ). Combining these results, n 1 Cov(q ,q ) 0 as n . k n,tkp m=1 n,t n,t+m ! !1 Therefore, by Theorem 2, we conclude that n1/3(ˆ ) converges in distribution to the argmax P 0 of Z(s), which is a Gaussian process with expected value ˙ (1)s0Pxx0s and the covariance kernel H, for which L(s ,s )=2(1)P x (s s ) . 1 2 | 0 1 2 | B.3. Monotone density. Preliminary results (Lemmas M, M’, C, and 1) to show Theorem 1 may be applied to establish weak convergence of certain processes. As an example, consider estimation of a decreasing marginal density function of z with support [0, ). We impose Assumption D for z . t 1 { t} The nonparametric maximum likelihood estimator ˆ(c) of the density (c) at a fixed c>0 is given by the left derivative of the concave majorant of the empirical distribution function ˆ. It is known that n1/3(ˆ(c) (c)) can be written as the left derivative of the concave majorant of the process 2/3 ˆ 1/3 ˆ 1/3 Wn(s)=n (c + sn ) (c) (c)sn (Prakasa Rao, 1969). Let f✓(z)=I z c + ✓ { } {  }

15 and be the distribution function of . Decompose

1/6 2/3 1/3 1/3 Wn(s)=n Gn(f 1/3 f0)+n (c + sn ) (c) (c)sn . sn { } 1 2 A Taylor expansion implies convergence of the second term to 2 ˙ (c)s < 0. For the first term 1/6 Zn(s)=n Gn(f 1/3 f0), we can apply Lemmas C and M’ to establish the weak convergence. sn 1/6 Lemma C (setting gn as any finite dimensional projection of the process n (f 1/3 f0):s ) { sn } implies finite dimensional convergence of Zn to projections of a centered Gaussian process with the covariance kernel n 1/3 1/3 1/3 1/3 1/3 H(s1,s2)= lim n 0t(c + s1n ,c+ s2n ) (c + s1n )(c + s2n ) , n { } !1 t= n X where 0t is the joint distribution function of (z0,zt). For tightness of Zn, we apply Lemma M’ by 1/6 setting gn,s = n (f 1/3 f0). The envelope condition is clearly satisfied. The condition in eq. sn (6) of the paper is verified as

2 P sup gn,s gn,s0 s: s s <" | | | 0| 1/3 1/3 1/3 = n P sup I z c + sn I z c + s0n s: s s <" | {  } {  }| | 0| 1/3 1/3 1/3 1/3 1/3 n max (c + sn ) (c +(s ")n ), (c +(s + ")n ) (c + sn )  { } (0)". 

Therefore, by applying Lemmas C and M’, Wn weakly converges to Z, a Gaussian process with 1 2 expected value 2 ˙ (c)s and covariance kernel H. The remaining part follows by the same argument to Kim and Pollard (1990, pp. 216-218) (by replacing their Lemma 4.1 with our Lemma 1). Then we can conclude that n1/3(ˆ(c) (c)) converges in distribution to the derivative of the concave majorant of Z evaluated at 0.

References

[1] Andrews, D. W. K. (1993) An introduction to econometric applications of empirical process theory for dependent random variables, Econometric Reviews,12,183-216. [2] Doukhan, P., Massart, P. and E. Rio (1995) Invariance principles for absolutely regular empirical processes, Annales de l’Institut Henri Poincar´e,Probability and Statistics,31,393-427. [3] Kim, J. and D. Pollard (1990) Cube root asymptotics, Annals of Statistics,18,191-219. [4] Rio, E. (1997) About the Lindeberg method for strongly mixing sequences, ESAIM: Probability and Statistics,1, 35-61.

16 Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. E-mail address: [email protected]

Department of Economics, London School of Economics, Houghton Street, London, WC2A 2AE, UK. E-mail address: [email protected]

17