<<

arXiv:1610.01697v3 [stat.ME] 26 Feb 2020 [email protected] [email protected] [email protected] ∗ † ‡ CA eateto cnmc,88 uceHl,Mi tp 1477 Stop: Mail Hall, Bunche 8283 Economics, of Department UCLA, nvriyo ayad eateto cnmc,TdnsHl 314 Hall Tydings Economics, of Department Maryland, of University CA eateto cnmc,88 uceHl,Mi tp 1477 Stop: Mail Hall, Bunche 8283 Economics, of Department UCLA, eta ii hoyfrCmie rs-eto and Cross-Section Combined for Theory Limit Central oe’ 18)formula. approximati (1985) our Topel’s of similarity formal a to implement due to distributions straightforward is it CLT, joint the complicate formulate the Despite shocks. aggregate of presence the interdepen in the recognizing data, series time and of sectional class general fact a of (CL common estimates parameter theorem on of limit properties central focus our We Using dependence. sets. this behind data two the that theory between limit dependence central a develop We economics. empirical iyn Hahn Jinyong obnn rs-eto n iesre aai ogadwe and long a is data series time and cross-section Combining UCLA ∗ nvriyo Maryland of University ud Kuersteiner Guido eray2,2020 27, February ieSeries Time Abstract 1 † ec ewe h w aasources data two the between dence n ihtesadr upyand Murphy standard the with ons aueo h nlssrqie to required analysis the of nature d sbsdo obnto fcross- of combination a on based ls h eutn aaee limiting parameter resulting the )w sals h asymptotic the establish we T) xlctyacut o possible for accounts explicitly arzoMazzocco Maurizio ,CleePr,M,272 kuer- 20742, MD, Park, College 5, 3 o nee,C 09,mmaz- 90095, CA Angeles, Los 03, letbihdpatc in practice established ll r stemechanism the as ors 3 o nee,C 90095, CA Angeles, Los 03, UCLA ‡ 1 Introduction

There is a long tradition in empirical economics of relying on information from a variety of data sources to estimate model parameters. In this paper we focus on a situation where cross-sectional and time-series data are combined. This may be done for a variety of reasons. Some parameters may not be identified in the cross section or alone. Alternatively, parameters estimated from one data source may be used as first-step inputs in the estimation of a second set of parameters based on a different data source. This may be done to reduce the dimension of the parameter space or more generally for computational reasons. Data combination generates theoretical challenges, even when only cross-sectional data sets or time-series data sets are combined. See Ridder and Moffitt (2007) for example. We focus on dependence between cross-sectional and time-series data produced by common aggregate factors. Andrews (2005) demonstrates that even randomly sampled cross-sections lead to independently distributed samples only conditionally on common factors, since the factors introduce a possible correlation. This correlation extends inevitably to a time series sample that depends on the same underlying factors. Andrews (2005) only considers cross-sections and panels with fixed T with sufficient regularity conditions for consistent estimation. Our work in this and in our companion papers Hahn, Kuersteiner and Mazzocco (2016, 2019a, 2019b) extends the analysis in Andrews to situations where consistent estimation is only possible by utilizing an additional time series data set. Examples in our companion papers draw on literatures for the structural estimation of rational expectations models as well as the program evaluation literature. Examples for the former include cross-sectional studies of the consumption based asset pricing model in Runkle (1991), Shea (1995) and Vissing-Jorgensen (2002) and a recent example of the latter is Rosenzweig and Udry (2019). The first contribution of this paper is to develop a central limit theory that explicitly accounts for the dependence between the cross-sectional and time-series data by using the notion of stable convergence. The second contribution is to use the corresponding to derive the asymptotic distribution of parameter obtained from combining the two data sources. Our analysis is inspired by a number of applied papers and in particular by the discussion in Lee and Wolpin (2006, 2010). Econometric estimation based on the combination of cross-sectional and time-series data is an idea that dates back at least to Tobin (1950). More recently, Heckman and

2 Sedlacek (1985) and Heckman, Lochner, and Taber (1998) proposed to deal with the estimation of equilibrium models by exploiting such data combination. It is, however, Lee and Wolpin (2006, 2010) who develop the most extensive equilibrium model and estimate it using similar intuition and panel data. To derive the new central limit theorem and the asymptotic distribution of parameter estimates, we extend the model developed in Lee and Wolpin (2016, 2010) to a general setting that involves two submodels. The first submodel includes all the cross-sectional features, whereas the second submodel is composed of all the time-series aspects. The two submodels are linked by a vector of aggregate shocks and by the parameters that govern their dynamics. Given the interplay between the two submodels, the aggregate shocks have complicated effects on the estimation of the parameters of interest. With the objective of creating a framework to perform inference in our general model, we first derive a joint functional stable central limit theorem for cross-sectional and time-series data. The central limit theorem explicitly accounts for the factor-induced dependence between the two samples even when the cross-sectional sample is obtained by random , a special case covered by our theory. We derive the central limit theorem under the condition that the dimension of the cross-sectional data n as well as the dimension of the time series data τ go to infinity. Using our central limit theorem we then derive the asymptotic distribution of the parameter estimators that characterize our general model. To our knowledge, this is the first paper that derives an asymptotic theory that combines cross sectional and time series data. In order to deal with parameters estimated using two data sets of completely different nature, we adopt the notion of stable convergence. Stable convergence dates back to Reyni (1963) and was recently used in Kuersteiner and Prucha (2013) in a panel data context to establish joint limiting distributions. Using this concept, we show that the asymptotic distributions of the parameter estimators are a combination of asymptotic distributions from cross-sectional analysis and time-series analysis. Stable convergence has found wide applications in many areas of statistics. In it has been applied most prominently in the area of high frequency financial econometrics1 while its use in panel and cross-sectional econometrics is relatively recent and less common. Important references to the high frequency literature include Barndorff-Nielsen, Hansen, Lunde and Shep-

1We thank one of the referees for bringing this literature to our attention.

3 hard (2008), Jacod, Podolskij and Vetter (2010) and Li and Xiu (2016). Monographs treating the probability theoretic foundation for this line of research are Jacod and Shiryaev (2003, henthforth JS), and Jacod and Protter (2012). In line with the literature on high frequency financial econo- metrics, we base our definition of stable convergence in function spaces on JS. In the introduction to their book, JS outline two different strategies of proving central limit theorems. One is what they term the ‘martingale method’ which dates back at least to Stroock and Varadhan (1979). The other is the approach followed by Billingsley (1968) which consists of establishing tightness, finite dimensional convergence and identification of the finite dimensional limiting distribution. In this paper we follow the second approach while JS and the cited papers in the high frequency literature rely on the martingale method. The reason for the difference in approach lies in the fact that the primitive parameters central to the martingale method, essentially a set of conditional moments called the triplets of characteristics of the approximating and limiting process in the language of JS, have no analog in the models we study. The data generating processes (DGP) producing our samples are not embedded in limiting diffusion processes. Even if our data generating process could be represented by or approximated with such diffusions, the limits we take assume that the DGP remains fixed. This assumption precludes any notion of convergence of characteristics that is required for the martingale method. Our limit theory has a second connection to the high frequency literature. Barndorff-Nielsen et.al (2008, Proposition 5) and Li and Xiu (2016, Lemma A3), develop methods to deduce joint stable convergence of two random sequences from stable convergence of the first random sequence and conditional convergence in law of the second random sequence. The assumptions made in these papers, namely that the noise terms are zero conditional on the entire time series are stronger than the assumptions we make in our paper. It is often unnatural to condition on the entire time series of common shocks, which we avoid in this paper. While the formal derivation of the asymptotic distribution may appear complicated, the asymp- totic formulae that we produce are straightforward to implement and very similar to the standard Murphy and Topel’s (1985) formula. We also derive a novel result related to the unit root literature. We show that, when the time-series data are characterized by unit roots, the asymptotic distribution is a combination of a and the distribution found in the unit root literature. Therefore, the

4 asymptotic distribution exhibits mathematical similarities to the inferential problem in predictive regressions, as discussed by Campbell and Yogo (2006). However, the similarity is superficial in that Campbell and Yogo’s (2006) result is about an based on a single data source. But, similarly to Campbell and Yogo’s analysis, we need to address problems of uniform inference. Phillips (2014) proposes a method of uniform inference for predictive regressions, which we adopt and modify to our own estimation problem in the unit root case. Our results should be of interest to both applied microeconomists and macroeconomists. Data combination is common practice in the macro calibration literature where typically a subset of parameters is determined based on cross-sectional studies. It is also common in structural mi- croeconomics where the focus is more directly on identification issues that cannot be resolved in the cross-section alone. In a companion paper, Hahn, Kuersteiner, and Mazzocco (2016, 2019a), we discuss in detail specific examples from the micro literature. In the companion paper, we also provide a more intuitive analysis of the joint use of cross-sectional and time-series data when aggregate factors are present, whereas in this paper the analysis is more technical and abstract. The remainder of the paper is organized as follows. In Section 2, we introduce the general . In Section 3, we present the intuition underlying our main result, which is presented in Section 4.

2 Model

We start this section with a stylized economic example that motivates our econometric model. Consider a farmer who chooses how much labor to supply to the cultivation of a plot in a way that maximizes profits. The production function of the plot depends on labor, Lit, an aggregate shock, νt, and takes the following form:

α 2 σν qit = νt exp (ǫit) Lit , where 2 2 σu E [exp (ǫit)] = 1, νt = ρνt 1 + ut, and σν = Var(νt)= . − 1 ρ2 − The idea behind this specification is that labor is less productive in places with higher of the aggregate shocks: in places that have cold winters but also hurricane seasons in the summer,

5 the incentives to invest in human capital and productivity-improving practices are lower. The gross domestic product (or aggregate state) of the economy y in which the farmer operates depends on aggregate investment x and the aggregate shock ν in the following way:

log yt = β1 + β2 log xt + νt, where the time-series of y and x are observed. Our objective is to estimate the parameters of the farmer’s production function. Let output produced by the farmer be q and its price p. The farmer maximizes the expected profit, i.e., he solves:

max π(L)= ptE [qt] witL L − α α σ2 σ2 s.t. E [qt]= E νt exp (ǫit) L ν = νtL ν

h α i σ2 Note that the expectation E [qt] = E νt exp (ǫit) L ν is with respect to the distribution of ǫit. α 2 2 Therefore, the farmer’s problem is maxh p ν L σν wiL . Let φ = α/ σ . Then the FOC can be L t t − it ν φ 1 written as p ν φL − w = 0. Solving for L, we have  t t − it 1 w φ−1 L = it . t p ν φ  t t  The quantity produced by the farmer is therefore

φ 1 w φ−1 q = ν− φ−1 exp (ǫ ) it . it t it p φ  t  By taking logs, we have

φ 1 φ w log q = log φ log ν + log it + ǫ . it −φ 1 − φ 1 t φ 1 p it − − − t Now, suppose that the econometrician has panel data consisting of quantity and real wages. Also, suppose that time series data consisting of observations for zs = (log ys, log xs) is available where we assume that E [νs log xs] = 0 so that vs can be understood to be the error term in the OLS regression of ys on xs. By regressing log qt on a constant, time dummies, and log real wages using the panel data, and then using the functional relationship among parameters, one can estimate

φ as well as νt. Then by regressing log ys on log xs and a constant using the time series data, we

6 2 can estimate β1, β2 and σν . We can therefore estimate σν and α. Combining these estimates, we can estimate the parameters of the production function. This stylized example contains the gist of our motivation, i.e., that there may be economic models whose parameters can be consistently estimated only by combining two sources of data. Below, we present a more formal description of our econometric model. 2 We assume that our cross-sectional data consist of y , i =1,...,n,t =1, ..., T ,3 where the { i,t } start time of the cross-section or panel, t = 1, is an arbitrary normalization of time. Pure cross-sections are handled by allowing for T = 1. Note that T is fixed and finite throughout our discussion while our asymptotic approximations are based on n tending to infinity. Our time series data consist of z , s = τ +1,...,τ + τ where the time series sample size τ tends to { s 0 0 } infinity jointly with n. The start point of the time series sample is either fixed at an arbitrary time τ ( , ) or depends on τ such that τ = υτ + τ + T for υ [0, 1] and τ a fixed 0 ∈ −∞ ∞ 0 − 0,f ∈ 0,f constant. We refer to the case where υ = 1 such that τ when the time series sample 0 → −∞ size τ tends to as backwards asymptotics. In this case the end of the time series sample is ∞ fixed at τ + T. A hybrid case arises when υ (0, 1) such that the start point τ of the time 0,f ∈ 0 series sample extends back in time simultaneously with the last observation in the sample τ0 + τ tending to infinity. For simplicity we refer to both scenarios υ (0, 1) and υ = 1 as backwards ∈ asymptotics. Backwards asymptotics may be more realistic in cases where recent panel data is augmented with long historical records of time series data. We show that under a mild additional regularity condition both forward and backward asymptotics lead to the same limiting distribution.

Conventional asymptotics are covered by setting υ = 0. The vector yi,t includes all information related to the cross-sectional submodel, where i is an index for individuals, households or firms, and t denotes the time period when the cross-sectional unit is observed. The second vector zs contains aggregate data. The technical assumptions for our CLT, detailed in Section 4, do not directly restrict the data,

2The example presented here is a stylized version of models discussed in detail in Hahn, Kuersteiner and Mazzocco (2019a). In that paper we provide explicit formulas and calculations that demonstrate how the results of this paper can be applied to the types of models we cover. Similar, but simpler calculations can be carried out for the example presented here. 3We do not consider models with estimated fixed effects in this paper because we assume that the parameter space is finite dimensional.

7 nor do they impose restrictions on how the data were sampled. For example, we do not assume that the cross-sectional sample was obtained by randomized sampling, although this is a special case that is covered by our assumptions. Rather than imposing restrictions directly on the data we postulate that there are two parametrized models that implicitly restrict the data. The function f (y β, ν , ρ) is used to model y as a function of cross-sectional parameters β and common i,t| t i,t shocks ν = [ν1, ..., νT ] which are treated as parameters to be estimated and time series parameters ρ. In the same way the function g (z β, ρ) restricts the behavior of some time series variables z .4 s| s Depending on the exact form of the underlying economic model, the functions f and g may have different interpretations. They could be the log-likelihoods of yi,t, conditional on νt, and zs respectively. In a likelihood setting, f and g impose restrictions on yi,t and zs because of the implied martingale properties of the score process. More generally, the functions f and g may be the basis for method of moments (the exactly identified case) or GMM (the overidentified case) estimation. In these situations parameters are identified from the conditions E [f (y β, ν , ρ)] = 0 given C i,t| t the shock ν and E [g (z β, ρ)] = 0. The first expectation, E , is understood as being over the t τ s| C cross-section population distribution holding ν = (ν1, ..., νT ) fixed, while the second, Eτ , is over the stationary distribution of the time-series data generating process. The conditions follow from martingale assumptions we directly impose on f and g. In our companion paper we discuss examples of economic models that rationalize these assumptions. Whether we are dealing with likelihoods or moment functions, the central limit theorem is directly formulated for the estimating functions that define the parameters. We use the notation Fn (β,ν,ρ) and Gτ (β, ρ) to denote the criterion function based on the cross-section and time series respectively. When the model specifies a log-likelihood these functions are de- fined as F (β,ν,ρ) = 1 T n f (y β, ν , ρ) and G (β, ρ) = 1 τ0+τ g (z β, ρ) . When n n t=1 i=1 i,t| t τ τ s=τ0+1 s| 1 T n the model specifies momentP conditionsP we let hn (β,ν,ρ) = P f (yi,t β, νt, ρ) and n t=1 i=1 | 1 τ0+τ kτ (β, ρ) = g (zs β, ρ) . The GMM or moment based criterionP P functions are then given τ s=τ0+1 | 4 The functionPg may naturally arise if the νt is an unobserved component that can be estimated from the aggregate time series once the parameters β and ρ are known, i.e., if νt νt (β,ρ) is a function of (zt,β,ρ) and ≡ the behavior of νt is expressed in terms of ρ. Later, we allow for the possibility that g in fact is derived from the conditional density of νt given νt−1, i.e., the possibility that g may depend on both the current and lagged values of zt. For notational simplicity, we simply write g (zs β,ρ) here for now. |

8 C τ C by F (β,ν,ρ)= h (β,ν,ρ)′ W h (β,ν,ρ) and G (β, ρ)= k (β, ρ)′ W k (β, ρ) with W and n − n n n τ − τ τ τ n τ Wτ two almost surely positive definite weight matrices. The use of two separate objective functions is helpful in our context because it enables us to discuss which issues arise if only cross-sectional variables or only time-series variables are used in the estimation.5 We formally justify the use of two data sets by imposing restrictions on the identifiability of parameters through the cross-section and time series criterion functions alone. Let Θ be a compact set constructed from the product Θ = Θ Θ Θ where Θ is a compact set that contains the β × ν × ρ β true parameter value β0, Θν is a compact set that contains the true parameter value ν0 and Θρ is a compact set that contains the true parameter value ρ0. We denote the probability limit of the objective functions by F (β, νt, ρ) and G (β, ρ) , in other words,

F (β,ν,ρ) = plim Fn (β,ν,ρ) , n →∞

G (β, ρ) = plim Gτ (β, ρ) . τ →∞ The true or pseudo true parameters are defined as the maximizers of these probability limits

(β (ρ) , ν (ρ)) argmax F (β,ν,ρ) , (1) ≡ β,ν Θ Θν ∈ β× ρ (β) argmax G (β, ρ) , (2) ≡ ρ Θρ ∈ and we denote with β0 and ρ0 the solutions to (1) and (2). The idea that neither F nor G alone are sufficient to identify both parameters is formalized as follows. If the function F is constant in ρ at the parameter values β and ν that maximize it then ρ is not identified by the criterion F alone. Formally we state that

max F (β,ν,ρ) = max F (β,ν,ρ0) for all ρ Θρ (3) β,ν Θ Θν β,ν Θ Θν ∈ β× ∈ β× ∈ It is easy to see that (3) is not a sufficient condition to restrict identification in a desirable way. For example (3) is satisfied in a setting where F does not depend at all on ρ. In that case the maximizers in (1) also do not depend on ρ and by definition coincide with β0 and ν0. To rule out this case we require that ρ0 is needed to identify β0 and ν0. Formally, we impose the condition

5 Note that our framework covers the case where the joint distribution of (yit,zt) is modeled. Considering the two components separately adds flexibility in that data is not required for all variables in the same period.

9 that (β (ρ) , ν (ρ)) =(β , ν ) for all ρ = ρ . (4) 6 0 0 6 0 Similarly, we impose restrictions on the time series criterion functions that insure that the param- eters β and ρ cannot be identified solely as the maximizers of G. Formally, we require that

max G (β, ρ) = max G (β0, ρ) for all β Θβ, (5) ρ Θρ ρ Θρ ∈ ∈ ∈ ρ (β) = ρ for all β = β . 6 0 6 0

To insure that the parameters can be identified from a combined cross-sectional and time-series data set we impose the following condition. Define θ (β′, ν′)′ and assume that (i) there exists a ≡ unique solution to the of equations:

∂F (β,ν,ρ) ∂G (β, ρ) , =0, (6) ∂θ ∂ρ  ′ ′  and (ii) the solution is given by the true value of the parameters. In summary, our model is characterized by the high level assumptions in (3), (4), (5), and by the assumption that (6) only has one solution at the true parameter values.

3 Asymptotic Inference

Our asymptotic framework is such that standard textbook level analysis suffices for the discussion of consistency of the estimators. In standard analysis with a single data source, one typically restricts the moment equation to ensure identification, and imposes further restrictions such that the sample analog of the moment function converges uniformly to the population counterpart. Because these arguments are well known we simply impose as a high-level assumption that our estimators are consistent. The purpose of this section is to provide an overview over our results while a rigorous technical discussion is relegated to Section 4 which may be skipped by a less technically oriented reader.

For expositional purposes, suppose that the time series zt is such that the log of its conditional probability density function given zt 1 is g (zt zt 1, ρ). To simplify the exposition in this section − | − we assume that the time series model does not depend on the micro parameter β. To distinguish

10 this scenario where the time series parameter ρ can be estimated in a first step only using time series data from the general case which requires joint estimation of β, ν and ρ we denote the consistent estimator first stage estimator of ρ with ρ. The notationρ ˆ is reserved for the more general joint estimator. e We assume that the dimension of the time series data is τ, and that ρ is a regular estimator with an influence function ϕs such that e 1 τ0+τ √τ (ρ ρ)= ϕ + o (1) (7) − √τ s p s=τ0+1 X e with E [ϕs] = 0. Here, τ0 + 1 denotes the beginning of the time series data, which is allowed to differ from the beginning of the panel data. Using ρ from the time series data, we can then consider maximizing the criterion Fn (β,ν,ρ) with respect to θ = (β, ν1,...,νT ). Implicit in this e representation is the idea that we are given a short panel for estimation of θ = (β, ν1,...,νT ), where T denotes the time series dimension of the panel data. In order to emphasize that T is small, we use the term ’cross-section’ for the short panel data set, and adopt asymptotics where T is fixed. The moment equation then is

∂Fn θ, ρ =0 ∂θ  b e and the asymptotic distribution of θ is characterized by

1 ∂2F (θ, ρ) − ∂F (θ, ρ) √n θ θ = b √n n + o (1) . − − ∂θ∂θ ∂θ p  ′      e e b 2 √n Because √n (∂F (θ, ρ) /∂θ ∂F (θ, ρ) /∂θ) (∂ F (θ, ρ) /∂θ∂ρ′) √τ (ρ ρ) we obtain n − n ≈ √τ −

τ0+τ e 1 ∂Fn (θ, ρ) 1 √n 1 e √n θ θ = A− √n A− B ϕ + o (1) (8) − − ∂θ − √τ √τ s p s=τ0+1 !   X with b ∂2F (θ, ρ) ∂2F (θ, ρ) A , B ≡ ∂θ∂θ′ ≡ ∂θ∂ρ′ We adopt asymptotics where n, τ at the same rate, but T is fixed. We stress that a tech- →∞ nical difficulty arises because we are conditioning on the factors (ν1,...,νT ) . This is accounted for in the limit theory we develop through a convergence concept by Renyi (1963) called stable con- vergence, essentially a notion of joint convergence. It can be thought of as convergence conditional

11 on a specified σ-field, in our case the σ-field generated by (ν ,...,ν ). In simple special cases, C 1 T 1 τ0+τ and because T is fixed, the asymptotic distribution of √τ s=τ0+1 ϕs conditional on (ν1,...,νT ) may be equal to the unconditional asymptotic distribution.P However, as we show in Section 4, this is not always the case, even when the model is stationary. Renyi (1963) and Aldous and Eagleson (1978) show that the concepts of convergence of the dis- tribution conditional on any positive probability event in and the concept of stable convergence C are equivalent. Eagleson (1975) proves a stable CLT by establishing that the conditional character- istic functions converge almost surely. Hall and Heyde’s (1980) proof of stable convergence on the other hand is based on demonstrating that the characteristic function converges weakly in L1. As pointed out in Kuersteiner and Prucha (2013), the Hall and Heyde (1980) approach lends itself to proving the martingale CLT under slightly weaker conditions than what Eagleson (1975) requires. While both approaches can be used to demonstrate very similar stable and thus conditional limit laws, neither simplifies to conventional marginal weak convergence except in trivial cases. A re- lated idea might be to derive conditional (on ) limiting results separately for the cross-section and C time series dimension of our estimators. As noted before, such a result in fact amounts to demon- strating stable convergence, in this case for each dimension separately. Irrespective, this approach is flawed because it does not deliver joint convergence of the two components. It is evident from (8) that the continuous mapping theorem needs to be applied to derive the asymptotic distribution of θ. Because both A and B are -measurable random variables in the limit the continuous map- C 1/2 τ0+τ ping theorem can only be applied if joint convergence of √n∂F (θ, ρ) /∂θ,τ − ϕ and b n s=τ0+1 s any -measurable is established. Joint stable convergence of bothP components C delivers exactly that. Finally, we point out that it is perfectly possible to consistently estimate parameters, in our case (ν1,...,νT ), that remain random in the limit. For related results, see the recent work of Kuersteiner and Prucha (2015). Here, for the purpose of illustration we consider the simple case where the dependence of the time series component on the factors (ν1,...,νT ) vanishes asymptotically. Let’s say that the unconditional distribution is such that

1 τ0+τ ϕ d N (0, Ω ) √τ s → ν s=τ0+1 X

12 where Ων is a fixed constant that does not depend on (ν1,...,νT ) . Let’s also assume that ∂F (θ, ρ) √n n d N (0, Ω ) ∂θ → y conditional on (ν1,...,νT ). Unlike in the case of the time series sample, Ωy generally does depend on (ν1,...,νT ) through the parameter θ.

We note that ∂F (θ, ρ) /∂θ is a function of (ν1,...,νT ). If there is overlap between (1,...,T )

1 τ0+τ and (τ0 +1,...,τ0 + τ), we need to worry about the asymptotic distribution of √τ s=τ0+1 ϕs conditional on (ν1,...,νT ). However, because in this example the only connection betweenP y and

ϕ is assumed to be through θ and because T is assumed fixed, the two terms √n∂Fn (θ, ρ) /∂θ

1/2 τ0+τ and τ − t=τ0+1 ϕt are expected to be asymptotically independent in the trend stationary case and whenP Ων does not depend on (ν1,...,νT ). Even in this simple setting, independence between the two samples does not hold, and asymptotic conditional or unconditional independence as well as joint convergence with -measurable random variables needs to be established formally. This C is achieved by establishing -stable convergence in Section 4.2. C It follows that 1 1 1 1 √n θ θ N 0, A− Ω A− + κA− BΩ B′A− , (9) − → y ν    where 0 < κ lim n/ τ < b. This means that a practitioner would use the square root of ≡ ∞

1 1 1 n 1 1 1 1 1 1 1 1 A− Ω A− + A− BΩ B′A− = A− Ω A− + A− BΩ B′A− n y τ ν n y τ ν   as the . This result looks similar to Murphy and Topel’s (1985) formula, except that we need to make an adjustment to the second component to address the differences in sample sizes. The assumption that 0 <κ< is used as a technical device to obtain an asymptotic approx- ∞ imation that accounts for estimation errors stemming both from the cross-section and time series samples. Simulation results in Hahn et.al (2016, 2019a) for data and sample sizes calibrated to actual macro data show that our approximation provides good control for estimator bias and test size. The knife edge case κ = 0 corresponds to situations where the estimation of time series pa-

1 1 rameters can be neglected for inference about θ, and where now √n θ θ N (0, A− Ω A− ) . − → y The expansion in (8) also shows that the case κ = leads to a scenario where uncertainty from b ∞ b the time series dominates such that the rate of convergence of θˆ now is √τ rather than √n and

13 1 1 where √τ θ θ N (0, A− BΩ B′A− ) . However, in what follows we focus on the case most − → ν relevant in practice where 0 <κ< . b ∞ The asymptotic variance formula is such that the noise of the time series estimator ρ can make quite a difference if κ is large, i.e., if the time series size τ is small relative to the cross section size e n. Obviously. this calls for long time series for accurate estimation of the micro parameter β. We also note that time series estimation asymptotically has no impact on micro estimation if B = 0. One scenario where B = 0 is the case where the model is additively separable in θ and ρ such that F (θ, ρ)= F1 (θ)+ F2 (ρ) . This confirms the intuition that if ρ does not appear as part of the micro moment f, which is the case in Heckman and Sedlacek (1985), and Heckman, Lochner, and Taber (1998), cross section estimation can be considered separate from time series estimation.

In the more general setting of Section 4, Ων may depend on (ν1,...,νT ) . In this case the limiting distribution of the time series component is mixed Gaussian and dependent upon the limiting distribution of the cross-sectional component. This dependence does not vanish asymptotically even in stationary settings. As we show in Section 4, standard inference based on asymptotically pivotal statistics is available even though the limiting distribution of θ is no longer a sum of two independent components. b

4 Joint Panel-Time Series Limit Theory

In this section we first establish a generic joint limiting result for a combined panel-time series process and then specialize it to the limiting distributions of parameter estimates under stationarity and non-stationarity. Let (Ω′, ′,P ′) be a probability space with random sequences zt t∞= and G { } −∞ τ0+τ n y ,...,y ∞ . The observed sample z and y ,...,y is a subset of these random { i1 iT }i=1 { t}t=τ0+1 { i1 iT }i=1 y y sequences. The process we analyze consists of a triangular array of panel data ψn,it where ψn,it typically is a function of yit and parameters, observed for i = 1, ..., n and t = 1,...,T. We let n while T is fixed and t = 1 is an arbitrary normalization of time at the beginning of the → ∞ ν ν cross-sectional sample. It also consists of a separate triangular array of time series ψτ,t, where ψτ,t is a function of zt and parameters for t = τ0 +1,...,τ0 + τ. We allow for τ0 to be either fixed with < K τ K < for some bounded K and τ , or to vary as a function of the time −∞ − ≤ 0 ≤ ∞ →∞ series sample size τ. In the latter case we write τ0 = τ0 (τ) and use the short hand notation τ0 when

14 no confusion arises. The fixed τ0 scenario corresponds to a situation where a (hypothetical) time series sample is observed into the infinite future. For the the case where τ0 = τ0 (τ) , we restrict τ (τ) to τ (τ)= υτ + τ + T for some υ [0, 1] and where τ is a fixed constant. This covers 0 0 − 0,f ∈ 0,f the case τ = τ + T . The case where τ varies with the time series sample size in the prescribed 0 − 0 way can be used to model situations where the asymptotics are carried out ‘backwards’ in time (when υ = 1) or where the panel data are located at a fixed fraction of the time series sample as the time series sample size tends to infinity (0 <υ< 1) .

y ν We develop asymptotic theory for the sums of some generic random vectors ψn,it and ψτ,t. y ν Typically, ψn,itand ψτ,t are the scores or moment functions of a cross-section and time series crite- rion function based on observed data yit and zt. Below we introduce general regularity conditions y ν for these generic random vectors ψn,it and ψτ,t. Let kθ be the dimension of the parameter θ and kρ be the dimension of the parameter ρ. With some abuse of notation we also denote by kθ the number of moment conditions used to identify θ when GMM estimators are used, with a similar

y kθ ν convention applying to kρ. With this notation, ψn,it takes values in R and ψτ,t takes values in Rkρ . We assume that T τ + τ. Throughout we assume that ψy , ψν is a type of a vector ≤ 0 n,it τ,t mixingale sequence relative to a filtration to be specified below. The concept of mixingales was introduced by Gordin (1969, 1973) and McLeish (1975). We derive the joint limiting distribution

1 T n y 1 τ0+τ ν and a related functional central limit theorem for √n t=1 i=1 ψn,it and √τ t=τ0+1 ψτ,t. We now form the triangular array of filtrations similarlyP P to KuersteinerP and Prucha (2013). The filtrations are a theoretical construct defined on the probability space in such a way that the observed sample is a strict subset of the random variables that generate the filtrations. We proceed by first collecting information about all of the common shocks, (ν1, ..., νT ), then we pick the initial time-series realization zmin 1,τ0 in such a way that it predates or coincides with the initial period { } of the panel data set. We then sequentially add the cross-sectional units for the same time period,

15 starting from i =1 to i = n.6 Subsequently the same procedure is repeated by shifting the time index ahead by one period. The process ends once the time index reaches max(T, τ). If τ > T, which eventually happens in forward asymptotics, but may also arise in backward asymptotics, enlarge the filtration by adding random variables yit even when t > T . These random variables are generated from the same distribution that produces the observed cross-sectional sample but are not actually included in the cross-sectional sample. This enlargement is used mostly for notational convenience. Formally, the filtrations are defined as follows. We use the binary operator to denote the ∨ smallest σ-field that contains the union of two σ-fields. Setting = σ (ν , ..., ν ) we define C 1 T

= (10) Gτn,0 C = σ z , y i Gτn,i min(1,τ0) j,min(1,τ0) j=1 ∨ C .   . 

= σ y n , z , z , y i Gτn,n+i j,min(1,τ0) j=1 min(1,τ0)+1 min(1,τ0) j,min(1,τ0)+1 j=1 ∨ C .   .   

n i τn,(t min(1,τ0))n+i = σ yj,t 1,yj,t 2,...,yj,min(1,τ0) , zt, zt 1,...,zmin(1,τ0) , yj,t . G − − − j=1 − { }j=1 ∨ C    As noted before, the filtration is generated by a set of random variables that contain Gτn,tn+i the observed sample as a strict subset. More specifically, we note that the cross-section yj,t is only observed in the sample for a finite number of time periods while the filtrations over the entire expanding time series sample period. We use the convention that τn,(t min(1,τ0))n = G −

τn,(t min(1,τ0) 1)n+n. This implies that zt and y1t are added simultaneously to the filtration τn,(t min(1,τ0))n+1. G − − G − Also note that predates the time series sample by at least one period. To simplify notation Gτn,i define the function q (t, i)=(t min(1, τ ))n + i that maps the two-dimensional index (t, i) into n − 0 the integers and note that for q = q (t, i) it follows that q 0,..., max(T, τ)n . The filtrations n ∈{ } 6The filtrations are constructed based on the ordering of the cross-sectional sample. However, at the cost of slightly stronger moment conditions, the σ-fields can be constructed in a way that is invariant to reordering of the cross-sectional sample. We refer the interested reader to Kuersteiner and Prucha (2013), in particular Definition 1 and the related discussion for details. Note that the stronger moment conditions needed for the invariance property hold in situations where yit is cross-sectionally independent conditional on zt,zt−1, ... . The latter is a leading { }∨C case in our examples.

16 are increasing in the sense that for all q, τ and n. However, they are not Gτn,q Gτn,q ⊂ Gτn,q+1 nested in the sense of Hall and Heyde’s (1980) Condition (3.21) for two reasons. One is the fact that we are considering what essentially amounts to a panel structure generating the filtration. Kuersteiner and Prucha (2013) provide a detailed discussion of this aspect. A second reason has to do with the possible ’backwards’ asymptotics adopted in this paper. Even in a pure time series setting, i.e. omitting yi,t from the definitions (10), backwards asymptotics lead to a violation of Hall and Heyde’s Condition (3.21) because the definition of changes as q is held fixed but Gτn,q τ increases. The consequence of this is that unlike in Hall and Heyde (1980) stable convergence cannot be established for the entire probability space, but rather is limited to the invariant σ-field . This limitation is the same as in Kuersteiner and Prucha (2013) and also appears in Eagleson C (1975), albeit due to very different technical reasons. The fact that the definition of changes Gτn,q for fixed q does not pose any problems for the proofs that follow, because the underlying proof strategy explicitly accounts for triangular arrays and does not use Hall and Heyde’s Condition (3.21). The central limit theorem we develop needs to establish joint convergence for terms involving

y ν both ψn,it and ψτ,t with both the time series and the cross-sectional dimension becoming large simultaneously. Let [a] be the largest integer less than or equal a. Joint convergence is achieved by stacking both moment vectors into a single sum that extends over both t and i. Let r [0, 1] ∈ and define ψν ψ˜ν (r) τ,t 1 τ +1 t τ + [τr] 1 i =1 , (11) it ≡ √τ { 0 ≤ ≤ 0 } { } which depends on r in a non-trivial way. This dependence will be of particular interest when we specialize our models to the near unit root case. For cross-sectional data define ψy ψ˜y (r) ψ˜y n,it 1 1 t T (12) it ≡ it ≡ √n { ≤ ≤ } where ψ˜y (r) = ψ˜y is constant as a function of r [0, 1] . In turn, this implies that functional it it ∈ convergence of the component (12) is the same as the finite dimensional limit. It also that the limiting process is degenerate (i.e. constant) when viewed as a function of r. However, this does not matter in our applications as we are only interested in the sample averages

max(T,τ0+τ) 1 T n n ψy = ψ˜y Xy . √n n,it it ≡ nτ t=1 i=1 i=1 X X t=min(1X,τ0+1) X 17 y ν ′ k Define the stacked vector ψ˜ (r) = ψ˜ (r)′ , ψ˜ (r)′ R φ where φ = (θ, ρ) and k is the it it it ∈ φ dimension of Θ. Consider the stochastic process 

max(T,τ0+τ) n ˜ y Xnτ (r)= ψit (r) , Xnτ (0) = (Xnτ′ , 0)′ . (13) i=1 t=min(1X,τ0+1) X We derive a functional central limit theorem which establishes joint convergence between the panel and time series portions of the process Xnτ (r). The result is useful in analyzing both trend stationary and unit root settings. In the latter, we specialize the model to a linear time series setting. The functional CLT is then used to establish proper joint convergence between stochastic integrals and the cross-sectional component of our model.

For the stationary case we are mostly interested in Xnτ (1) where in particular

max(T,τ0+τ) 1 τ0+τ n ψν = ψ˜ν (1) √τ τ,t it t=τ0+1 i=1 X t=min(1X,τ0+1) X ψν and we used the fact that n ψ˜ν (1) = ψ˜ν (1) = τ,t 1 τ +1 t τ + τ by (11). The limiting i=1 it 1t √τ { 0 ≤ ≤ 0 } distribution of Xnτ (1) isP a simple corollary of the functional CLT for Xnτ (r) . We note that our treatment differs from Phillips and Moon (1999), who develop functional CLT’s for the time series dimension of the panel data set. In our case, since T is fixed and finite, a similar treatment is not applicable. y We introduce the following general regularity conditions for generic random vectors ψn,it and ν ψτ,t. Similarly, the central limit theorem established in this section is for generic random vectors and empirical processes satisfying the regularity conditions. In later sections, these conditions will be specialized to the particular models considered there. To apply the general theory in this section to specific models we will evaluate the score or moment function at the true parameter value. In

y ν those instances ψn,it and ψτ,t will be interpreted as the score or moment function evaluated at the true parameter value. We use . to denote the Euclidean norm. k k

Condition 1 Assume that y i) ψn,it is measurable with respect to τn,(t min(1,τ0))n+i. G − ν ii) ψτ,t is measurable with respect to τn,(t min(1,τ0))n+i for all i =1, ..., n. G − iii) for some δ > 0 and C < , sup E ψy 2+δ C for all n 1. ∞ it n,it ≤ ≥ h i 18 2+δ iv) for some δ > 0 and C < , sup E ψν C for all τ 1. ∞ t τ,t ≤ ≥ y h i v) E ψn,it τn,(t min(1,τ0))n+i 1 =0. G − − ν vi) E ψτ,t τn,(t min(1,τ0) 1)n+i =0 for t > T and all i =1, ..., n. G − − ν vii) E ψ τ,t τn,(t min(1,τ0) 1)n+i ϑt for t< 0 and all i =1, ..., n where G − − 2 ≤   1/2 ϑ C t 1+δ − t ≤ | |   for the same δ as in (iv) and some bounded constant C and where for a vector of random variables 1/2 x =(x , .., x ) , x = d E x 2 is the L norm. 1 d k k2 j=1 | j| 2 P   Remark 1 Conditions 1(i), (iii) and (v) can be justified in a variety of ways. One is the sub- ordinated process theory employed in Andrews (2005) which arises when yit are random draws from a population of outcomes y. A sufficient condition for Conditions 1(v) to hold is that E [ψ (y θ, ρ, ν ) ]=0 holds in the population. This would be the case, for example, if ψ were the | t | C correctly specified score for the population distribution distribution of yi,t given νt . See Andrews (2005, pp. 1573-1574).

y Conditions 1(i), (iii) and (v) impose a martingale property for ψn,it both in the time as well as in the cross-section dimension. More specifically, E ψy ψy , ψy , ..., ψy , = 0 for any n,it| n,i1t n,i2t2 n,iktk C collection (i1, t) , ..., (ik, tk) with ir < i, t2, ..., tk < t andir 1, ..., n . The central limit theorem is ∈{ } established by letting the sums over the sample increase sequentially in line with the information contained in and thus mapping the sums over t and i into a single sum over an index set on Gτn,q the real line. This approach preserves the information structure in and in particular takes Gτn,q into account that in general E ψy ψy , = 0 for s > t and all j 1, ..., n . The score of a n,it| n,js C 6 ∈ { } correctly specified likelihood is a leading example where the information structure takes the form implied by our conditions. The construction in this paper is in contrast to some of the joint limit theory for panels that is based on mds assumptions in the time direction only and weak convergence of time series averages ˘y 1/2 n y of cross-sectional aggregates ψn,t = n− i=1 ψn,it, driven by T going to infinity. Our theory on the other hand depends on the cross-sectionP sample size n tending to infinity, while T is kept fixed. The pure cross-section case is covered by allowing for T = 1. Other examples include cases

19 where the cross-section is sampled at random conditional on and ψy is a conditional moment C n,it function in a rational agent model.

ν Conditions 1(ii), (iv) and (vi) impose a martingale property for ψτ,t in the time dimension. In addition the condition also implies that E ψν ψy , = 0 for any i 1, ..., n and s < t τ,t| n,is C ∈ { } ν and t > T . We note that this condition is weaker than assuming independence between ψτ,t and ψy , even conditionally on . While the examples in Hahn, Kuersteiner and Mazzocco (2019) n,is C do satisfy such a conditional independence restriction, it is not required for the CLT developed in this paper.

ν We note that E ψτ,t τn,qn(t 1,i) = 0 for all i only holds for t > T because we condition not G − only on zt 1, zt 2... but also on ν1, ..., νT , where the latter may have non-trivial overlap with the − − ν former. When τ0 is fixed the number of time periods t where E ψτ,t τn,qn(t 1,i) = 0 is finite G − 6 and thus can be neglected asymptotically. On the other hand, when τ0 varies with τ there is an asymptotically non-negligible number of time periods where the condition may not hold. To handle this latter case we impose an additional mixingale type condition that is satisfied for typical time series models.

Conditions 1(vii) is an additional restriction needed to handle situations where τ0, the starting point of the time series sample, is allowed to diverge to . We call this situation backward −∞ asymptotics. Since it generally is the case that E ψν = 0 for t < 0 because C contains τ,t|C 6 ν information about future realizations of ψτ,t we need a condition that limits this dependence as t . The following example illustrates that the condition naturally holds in linear time series → −∞ models.

j Example 1 Assume us is iid N (0, 1) and zs = ∞ ρ us j with ρ < 1 is the stationary solution j=0 − | | to zs+1 = ρzs + us+1. Use the convention that νPs = zs. Then the score of the Gaussian likelihood is ψν = z u . Assuming that τ = τ +1, T =1 and that z is independent of y conditional τ,s s s+1 0 − s i,t on = σ (ν1), it is sufficient for this example to define τn,(t min(1,τ0))n+i = σ ( zt, zt 1,...,zτ0 ) C G − { − } ∨ 7 σ (ν1). Then

ν s /2 (1+δ)/2 E ψτ,s τn,(s min(1,τ0) 1)n+i = O ρ | | = o s − . G − − 2 | | | | 7Detailed derivations  are in Section A.1.     

20 The following conditions, Condition 2 for the time series sample and Condition 3 below for the cross-section sample are put in place to ensure that, in combination with Condition 1, the variance of X (r) converges as n and τ tend to jointly. Conditions 2 and 3 correspond to nτ ∞ Condition 3.19 in Hall and Heyde (1980, Theorem 3.2). We note in particular that the martingale structure in conjunction with uniform moment bounds imposed in 1 are sufficient to guarantee that cross-covariance terms over different time periods converge to zero.

Condition 2 Assume that: i) for any r [0, 1] , ∈ τ0+[τr] 1 ν ν p ψ ψ ′ Ω (r) as τ τ τ,t τ,t → ν →∞ t=τ0+1 X where Ω (r) is positive definite a.s. and measurable with respect to σ (ν , ..., ν ) for all r (0, 1]. ν 1 T ∈ ii) The elements of Ω (r) are bounded continuously differentiable functions of r>s [0, 1]. The ν ∈ derivatives Ω˙ ν (r)= ∂Ων (r) /∂r are positive definite almost surely.

kρ ˙ iii) There is a fixed constant M < such that sup λν =1,λν R supt λν′ Ων (t) λν M a.s. ∞ k k ∈ ≤

Remark 2 Note that by construction Ων (0) = 0.

Condition 2 is weaker than the conditions of Billingsley’s (1968, Theorem 23.1) functional CLT for strictly stationary martingale difference sequences (mds) because we neither assume

ν ν strict stationarity nor homoskedasticity. We do not assume that E ψτ,tψτ,t′ is constant. Brown (1971) allows for time varying , but uses stopping times to achieve a standard Brownian limit. Even more general treatments with random stopping times are possible - see Gaenssler and Haeussler (1979). On the other hand, here convergence to a Gaussian process (not a standard Wiener process) with the same methodology (i.e. establishing convergence of finite dimensional distributions and tightness) as in Billingsley, but without assuming homoskedasticity is pursued. Related results with heteroskedastic errors in the high frequency and literature can be found for example in Jacod and Shiryaev (2003, Theorem IX 7.28). Heteroskedastic errors are explicitly used in Section 5 where ψν = exp ((t s) γ/τ) η . Even τ,t − s 2 ν if ηs is iid(0, σ ) it follows that ψτ,t is a heteroskedastic triangular array that depends on τ. It can be shown that the variance kernel Ω (r) is Ω (r)= σ2 (1 exp ( 2rγ))/2γ in this case. See ν ν − − equation (61).

21 Condition 3 Assume that n 1 y y p ψ ψ ′ Ω n n,it n,it → ty i=1 X where Ωty is positive definite a.s. and measurable with respect to σ (ν1, ..., νT ) .

Condition 2 holds under a variety of conditions that imply some form of weak dependence of

ν the process ψτ,t. These include, in addition to Condition 1(ii) and (iv), or near epoch ν dependence assumptions on the temporal dependence properties of the process ψτ,t. Condition 3 holds under appropriate moment bounds and random sampling in the cross-section even if the underlying population distribution is not independent (see Andrews, 2005, for a detailed treatment).

4.1 Stable Functional CLT

This section details the probabilistic setting we use to accommodate the results that Jacod and

Shiryaev (2002) (shorthand notation JS) develop for general Polish spaces. Let (Ω′, ′,P ′) be a G probability space with increasing filtrations ′ and for any q = 1, ..., k Gkn,t ⊂ G Gkn,q ⊂ Gkn,q+1 n 8 and an increasing sequence kn as n . Let DRkθ Rkρ [0, 1] be the space of functions → ∞ → ∞ × [0, 1] Rkθ Rkρ that are right continuous and have left limits (see Billingsley (1968, p.109)) → × with discontinuities synchronized across all elements in the vectors. Let be a sub-sigma field of C n kθ kρ ′. Let (ζ,Z (ω, t)) : Ω′ [0, 1] R R R be random variables or random elements in R G × → × × and DRkθ Rkρ [0, 1], respectively defined on the common probability space (Ω′, ′,P ′) and assume × G that ζ is bounded and measurable with respect to . C

As in JS, p.512, let Z (ω′, x)= x be the canonical element on DRkθ Rkρ [0, 1] and let Q (ω′, dx) × be a version of the distribution of Z conditional on . Similarly, let Q (ω′, dx) be a version of C n the conditional (on ) distribution of Zn. Following JS (Definition VI1.1 and Theorem VI1.14) C we define the σ-field generated by all coordinate projections on DRkθ Rkρ [0, 1] as Rkθ Rkρ . Then × D × define the joint probability space (Ω, ,P ) with Ω = Ω′ DRkθ Rkρ [0, 1] , = ′ Rkθ Rkρ and G × × G G ⊗D × n P (dω′, dx) = P ′ (dω′) Q (ω′, dx) . Following JS (p.512, Definition 5.28) we say that Z converges -stably to Z if for all bounded, -measurable ζ, C C E [ζf (Zn)] E [ζQ [f (Z)]] (14) → 8 In our case kn = max(T, τ)n where both n and τ such that clearly kn . → ∞ → ∞ → ∞ 22 where Q [f (Z)] is the expectation of f (Z) conditional on . More specifically, if W (r) is standard C Brownian motion, we say that Zn W (r) -stably where the notation means that (14) holds ⇒ C when Q is Wiener measure (for a definition and existence proof see Billingsley (1968, Chapter 2, Section 9)). By JS Proposition VIII, 5.33 it follows that Zn converges -stably iff Zn is tight and C for all A , E [1 f (Zn)] converges. ∈ C A The concept of stable convergence was introduced by Renyi (1963) and has found wide appli- cation in probability and statistics. Most relevant to the discussion here are the stable central limit theorem of Hall and Heyde (1980, Theorem 3.2) and Kuersteiner and Prucha (2013) who extend the result in Hall and Heyde (1980, Theorem 3.2) to panel data with fixed T . Dedecker and Merlevede (2002) established a related stable functional CLT for strictly stationary martingale differences.

Theorem 1 Assume that Conditions 1, 2 and 3 hold. Then it follows that for ψ˜it defined in (13), and as τ, n , and T fixed, →∞

By (1) X (r) ( -stably) nτ ⇒   C Bν (r)   where B (r) = Ω 1/2W (r), B (r) = r Ω˙ (s)1/2 dW (s) and Ω(r) = diag (Ω , Ω (r)) is - y y y ν 0 ν ν y ν C measurable, Ω˙ ν (s) = ∂Ων (s) /∂s and (WR y (r) , Wν (r)) is a vector of standard kφ-dimensional, mutually independent, Brownian processes independent of Ω.

Proof. In Appendix B.

Remark 3 Note that W (r)= W (1) for each r [0, 1] by construction. Thus, W (1) is simply y y ∈ y a vector of standard Gaussian random variables, independent both of Wν (r) and any random variable measurable w.r.t . C

The limiting random variables B (r) and B (r) both depend on and are thus mutually y ν C dependent. However, conditional on , the limiting random variables are independent because of C 1/2 the mutual independence of Wy (r) and Wν (r) . The representation By (1) = Ωy Wy(1), where a stable limit is represented as the product of an independent Gaussian random variable and a scale factor that depends on , is common in the literature on stable convergence. Results similar to the C 23 one for Bν (r) were obtained by Phillips (1987, 1988) for cases where Ω˙ ν (s) is non-stochastic and has an explicitly functional form, notably for near unit root processes and when convergence is marginal rather than stable. Rootzen (1983) establishes stable convergence but gives a represen- tation of the limiting process in terms of standard Brownian motion obtained by a stopping time transformation. The representation of Bν (r) in terms of a stochastic integral over the random scale process Ω˙ ν (s) is obtained by utilizing a technique mentioned in Rootzen (1983, p. 10) but not utilized there, namely establishing finite dimensional convergence using a stable martingale CLT. This technique combined with a tightness argument establishes the characteristic function of the limiting process. The representation for Bν (r) is then obtained by utilizing isometry prop- erties of the stochastic integral. Rootzen (1983, p.13) similarly utilizes characteristic functions to identify the limiting distribution in the case of standard Brownian motion. Similar representations have been obtained in the high frequency time series literature, see Jacod et al. (2010), Jacod and Protter (2012). Finally, the results of Dedecker and Merlevede (2002) differ from ours in that they only consider asymptotically homoskedastic and strictly stationary processes. In our case, heteroskedasticity is explicitly allowed because of Ω˙ ν (s) . An important special case of Theorem 1 is the near unit root model discussed in more detail in Section 5. More importantly, our results innovate over the literature by establishing joint convergence between cross-sectional and time series averages that are generally not independent and whose limiting distributions are not independent. This result is obtained by a novel construction that embeds both data sets in a random field. A careful construction of information filtrations Gτn,n+i allows to map the field into a martingale array. Similar techniques were used in Kuersteiner and Prucha (2013) for panels with fixed T. In this paper we extend their approach to handle an additional and distinct time series data-set and by allowing for both n and τ to tend to infinity jointly. In addition to the more complicated data-structure, we extend Kuersteiner and Prucha (2013) by considering functional central limit theorems. The following corollary is useful for possibly non-linear but trend stationary models.

Corollary 1 Assume that Conditions 1, 2 and 3 hold. Then it follows that for ψ˜it defined in (13), and as τ, n and T fixed, →∞

X (1) d B := Ω1/2W ( -stably) nτ → C 24 where Ω = diag (Ω , Ω (1)) is -measurable and W = (W (1) , W (1)) is a vector of standard y ν C y ν d-dimensional Gaussian random variables independent of Ω. The variables Ωy, Ων (.) , Wy (.) and

Wν (.) are as defined in Theorem 1.

Proof. In Appendix B. The result of Corollary 1 is equivalent to the statement that X (1) d N (0, Ω) conditional nτ → on positive probability events in . As noted earlier, no simplification of the technical arguments C are possible by conditioning on except in the trivial case where Ω is a fixed constant. Eagleson C (1975, Corollary 3), see also Hall and Heyde (1980, p. 59), establishes a simpler result where X (1) d B weakly but not ( -stably). Such results could in principle be obtained here as well, nτ → C but they would not be useful for the analysis in Sections 4.2 and 5 because the limiting distributions of our estimators not only depend on B but also on other -measurable scaling matrices. Since the C continuous mapping theorem requires joint convergence, a weak limit for B alone is not sufficient to establish the results we obtain below. Theorem 1 establishes what Phillips and Moon (1999) call diagonal convergence, a special form of joint convergence.9 To see that sequential convergence where first n or τ go to infinity, followed by the other index, is generally not useful in our set up, consider the following example.

Assume that d = kφ is the dimension of the vector ψ˜it. This would hold for just identified moment estimators and likelihood based procedures. Consider the double indexed process

max(T,τ0+τ) n ˜ Xnτ (1) = ψit (1) . (15) i=1 t=min(1X,τ0+1) X For each τ fixed, convergence in distribution of X as n follows from the central limit nτ → ∞ theorem in Kuersteiner and Prucha (2013). Let Xτ denote the “large n, fixed τ” limit. For each n fixed, convergence in distribution of X as τ follows from a standard martingale central limit nτ →∞ theorem for Markov processes. Let Xn be the “large τ, fixed n” limit. It is worth pointing out that the distributions of both Xn and Xτ are unknown because the limits are trivial in one direction.

1/2 τ0+τ ν For example, when τ is fixed and n tends to infinity, the component τ − t=τ0+1 ψτ,t trivially

9The discussion assumes that 0 <κ< . The cases where κ =0 or κ = allow for a simplerP treatment where ∞ ∞ either the time series or cross-section sample can be ignored. In those situations considerations of joint convergence play only a minor role.

25 1/2 τ0+τ ν converges in distribution (it does not change with n) but the distribution of τ − t=τ0+1 ψτ,t is generally unknown. More importantly, application of a conventional CLT for the cross-sectionP alone will fail to account for the dependence between the time series and cross-sectional compo- nents. Sequential convergence arguments thus are not recommended even as heuristic justifications of limiting distributions in our setting.

4.2 Trend Stationary Models

Let θ =(β, ν1, ..., νT ) and define the shorthand notation fit (θ, ρ)= f (yit θ, ρ), gt (β, ρ)= g (νt νt 1,β,ρ) , | | − fθ,it (θ, ρ) = ∂fit (θ, ρ) /∂θ and gρ,t (β, ρ) = ∂gt (β, ρ) /∂ρ. Also let fit = fit (θ0, ρ0) , fθ,it = fθ,it (θ0, ρ0) , gt = gt (β0, ρ0) and gρ,t = gρ,t (β0, ρ0) . Depending on whether the estimator under consideration is maximum likelihood or moment based we assume that either (fθ,it,gρ,t)or(fit,gt) y ν satisfy the same Assumptions as ψit, ψτ,t in Condition 1. We recall that νt (β, ρ) is a function of (zt,β,ρ), where zt are observable macro variables. For the CLT, the process νt = νt (ρ0, β0) is evaluated at the true parameter values and treated as observed. In applications, νt will be replaced by an estimate which potentially affects the limiting distribution of ρ. This dependence is analyzed in a step separate from the CLT. The next step is to use Corollary 1 to derive the joint limiting distribution of estimators for

ν 1/2 τ0+τ y φ = (θ′, ρ′)′. Define sML (β, ρ) = τ − ∂g (νt (β, ρ) νt 1 (β, ρ) ,β,ρ) /∂ρ and sML (θ, ρ) = t=τ0+1 | − 1/2 T n n− ∂f (yit θ, ρ) /∂θ for maximumP likelihood, and t=1 i=1 | P P τ0+τ ν τ 1/2 sM (β, ρ)= (∂kτ (β, ρ) /∂ρ)′ Wτ τ − g (νt (β, ρ) νt 1 (β, ρ) ,β,ρ) − | − t=τ0+1 X y C 1/2 T n and s (θ, ρ) = (∂h (θ, ρ) /∂θ)′ W n− f (y θ, ρ) for moment based estimators. M − n n t=1 i=1 it| We use sν (β, ρ) and sy (θ, ρ) generically forP argumentsP that apply to both maximum likelihood and moment based estimators. The estimator φˆ jointly satisfies the moment restrictions using time series data sν β,ˆ ρˆ =0. (16)   and cross-sectional data sy θ,ˆ ρˆ =0. (17)  

26 y ν Defining s (φ)= s (φ)′ ,s (φ)′ ′ the estimator φˆ satisfies s φˆ = 0. A first order Taylor series ˆ expansion aroundφ0 is used to obtain the limiting distribution  for φ. We impose the following additional assumption.

k k kρ 1/2 1/2 Condition 4 Let φ =(θ′, ρ′)′ R φ , θ R θ , and ρ R . Define D = diag n− I , τ − I ,where ∈ ∈ ∈ nτ y ν Iy is an identity matrix of dimension kθ and Iν is an identity matrix of dimension kρ. Assume that for some ε> 0,

∂s (φ) sup Dnτ A (φ) = op (1) φ: φ φ0 ε ∂φ′ − k − k≤ where A (φ) is -measurable and A = A (φ0) is full rank almost surely. Let κ = lim n/τ, C

Ay,θ √κAy,ρ A =  1  √κ Aν,θ Aν,ρ

1/2 y  1/2 y  1/2 ν with Ay,θ = plim n− ∂s (φ0) /∂θ′, Ay,ρ = plim n− ∂s (φ0) /∂ρ′, Aν,θ = plim τ − ∂s (φ0) /∂θ′ 1/2 ν and Aν,ρ = plim τ − ∂s (φ0) /∂ρ′.

Condition 5 For maximum likelihood criteria the following holds:

1 τ0+[τr] p i) for any r [0, 1] with r, g g′ Ω (r) as τ and where Ω (r) satisfies the ∈ τ t=τ0+1 ρ,t ρ,t → ν → ∞ ν same regularity conditions as inP Condition 2(ii). 1 n p ii) f f ′ Ω for all t [1, ..., T ] and where Ω is positive definite a.s. and measurable n i=1 θ,it θ,it → ty ∈ ty T withP respect to σ (ν1, ..., νT ) . Let Ωy = t=1 Ωty.

C C P τ τ Condition 6 Let W = plimn Wn and W = plimτ Wτ and assume the limits to be positive definite and -measurable. Define h (θ, ρ) = plim h (β, ν , ρ) and k (β, ρ) = plim k (β, ρ) . For C n n t τ τ moment based criteria the following holds:

1 τ0+[τr] p i) g g′ Ω (r) as τ and where Ω (r) satisfies the same regularity conditions as τ t=τ0+1 t t → g →∞ g in ConditionP 2(ii). 1 n p T ii) f f ′ Ω for all t [1, ..., T ] . Let Ω = Ω . Assume that Ω is positive n i=1 it it → t,f ∈ f t=1 t,f f definiteP a.s. and measurable with respect to σ (ν1, ..., νT ). P Assume that for some ε> 0,

τ τ ′ ′ iii) supφ: φ φ0 ε (∂kτ (β, ρ) /∂ρ) Wτ ∂k (β, ρ) /∂ρW = op (1) , k − k≤ − C C ′ ′ iv) supφ: φ φ0 ε (∂hn (θ, ρ) /∂θ) Wn (∂h (θ, ρ) /∂θ) W = op (1) . k − k≤ −

27 The following result establishes the joint limiting distribution of φ.ˆ

y ν Theorem 2 Assume that Conditions 1, 4, and either 5 with ψit, ψτ,t =(fθ,it,gρ,t) in the case of y ν likelihood based estimators or 6 with ψit, ψτ,t = (fit,gt) in the case of moment based estimators ˆ hold. Assume that φ φ0 = op (1) and that (16) and (17) hold. Then, − 1 d 1 1/2 D− φˆ φ A− Ω W ( -stably) nτ − 0 → − C   where A is full rank almost surely, -measurable and is defined in Condition 4. The distribution C 1/2 of Ω W is given in Corollary 1. In particular, Ω = diag (Ωy, Ων (1)). Then the criterion is maximum likelihood Ωy and Ων (1) are given in Condition 5. When the criterion is moment based, ′ ′ ∂h(θ0,ρ0) C C ∂h(θ0,ρ0) ∂k(β0,ρ0) τ τ ∂k(β0,ρ0) Ωy = ∂θ W Ωf W ′ ∂θ and Ων (1) = ∂ρ W Ωg (1) W ′ ∂ρ with Ωf and Ωg (1) defined in Condition 6.

Proof. In Appendix B.

Corollary 2 Under the same conditions as in Theorem 2 it follows that

√n θˆ θ d Ay,θΩ1/2W (1) √κAy,ρΩ1/2 (1) W (1) ( -stably). (18) − 0 → − y y − ν ν C   where

y,θ 1 1 1 1 1 A = A− + A− A A A A− A − A A− y,θ y,θ y,ρ ν,ρ − ν,θ y,θ y,ρ ν,θ y,θ y,ρ 1 1 1 A = A− A A A A− A − .  − y,θ y,ρ ν,ρ − ν,θ y,θ y,ρ  For y,θ y,θ y,ρ y,ρ Ωθ = A ΩyA ′ + κA Ων (1) A ′ it follows that 1/2 d √nΩ− θˆ θ N (0,I) ( -stably). (19) θ − 0 → C   Note that Ω , the asymptotic variance of √n θˆ θ conditional on , in general is a random θ − 0 C variable, and the asymptotic distribution of θˆ is mixed normal. However, as in Andrews (2005), the result in (19) can be used to construct an asymptotically pivotal test . For a consistent 1/2 estimator Ωˆ the statistic √nΩˆ − Rθˆ r is asymptotically distribution free under the null θ θ − hypothesis Rθ r = 0 where R is a conforming matrix of dimension q k and and r a q 1 − × θ × vector.

28 5 Unit Root Time Series Models

5.1 Unit Root Problems

When the simple trend stationary paradigm does not apply, the limiting distribution of our es- timators may be more complicated. A general treatment is beyond the scope of this paper and likely requires a case by case analysis. In this subsection we consider a simple unit root model where initial conditions can be neglected. We use it to exemplify additional inferential difficulties that arise even in this relatively simple setting. In Section 5.2 we consider a slightly more complex version of the unit root model where initial conditions cannot be ignored. We show that more complicated dependencies between the asymptotic distributions of the cross-section and time series samples manifest. The result is a cautionary tale of the difficulties that may present themselves when nonstationary time series data are combined with cross-sections. We leave the development of inferential methods for this case to future work. We again consider the model in the previous section, except with the twist that (i) ρ is the

AR(1) coefficient in the time series regression of zt on zt 1 with independent error; and (ii) ρ is at − unity. In the same way that led to (8), we obtain

1 ∂Fn (θ, ρ) 1 √n √n θ θ A− √n A− B τ (ρ ρ) − ≈ − ∂θ − τ −   For simplicity, again assumeb that the two terms on the right are asymptoticallye independent. The 1 1 first term converges in distribution to a normal distribution N (0, A− ΩyA− ), but with ρ = 1 and i.i.d. AR(1) errors the second term converges to

2 1 W (1) 1 ξA− B 1 −2 , 2 0 W (r) dr where ξ lim √n/ τ and W ( ) is the standardR Wiener process. In contrast to the result in (9) ≡ · when ρ is away from unity, √n/τ rather than n/τ is assumed to converge to a constant. Because ρ˜ is superconsistent under the unit root scenario, from a theoretical point of view, the correction term is relevant only in cases where n is much larger than τ such that ξ > 0 in the limit. The result is formalized in Section 5.2. The fact that the limiting distribution of θˆ is no longer Gaussian complicates inference. This discontinuity is mathematically similar to Campbell and Yogo’s (2006) observation, which leads

29 to a question of how uniform inference could be conducted. In principle, the problem here can be analyzed by modifying the proposal in Phillips (2014, Section 4.3). First, construct the 1 α − 1 confidence interval for ρ using Mikusheva (2007). Call it [ρ , ρ ]. Second, compute θ (ρ) L U ≡ argmax F (θ, ρ) for ρ [ρ , ρ ]. Assuming that ρ is fixed, characterize the asymptotic variance θ n ∈ L U b Σ(ρ), say, of √n θ (ρ) θ (ρ) , which is asymptotically normal in general. Third, construct the − 1 α confidence region, say CI (α ; ρ), using asymptotic normality and Σ (ρ). Our confidence − 2 b 2 interval for θ is then given by CI (α ; ρ). By Bonferroni, its asymptotic coverage rate 1 ρ [ρL,ρU ] 2 ∈ is expected to be at least 1 α1S α2. − −

5.2 Unit Root Limit Theory

In this section we consider the special case where νt follows an autoregressive process of the form

νt+1 = ρνt + ηt. As in Hansen (1992), Phillips (1987, 1988, 2014) we allow for nearly integrated processes where ρ = exp (γ/ τ) is a scalar parameter localized to unity such that

ντ,t+1 = exp (γ/ τ) ντ,t + ηt+1 (20)

and the notation ντ,t emphasizes that ντ,t is a sequence of processes indexed by τ. We assume that

τ0 = 0 is fixed and 1/2 τ − ντ,min(1,τ0) = V (0) a.s. where V (0) is a potentially nondegenerate random variable. In other words, the initial condition

1/2 for (20) is ντ,min(1,τ0) = τ V (0). We explicitly allow for the case where V (0) = 0, to model a situation where the initial condition can be ignored. This assumption is similar, although more parametric than, the specification considered in Kurtz and Protter (1991). We limit our analysis to the case of maximum likelihood criterion functions. Results for moment based estimators can be developed along the same lines as in Section 4.2 but for ease of exposition we omit the details.

For the unit root version of our model we assume that νt is observed in the data and that the only parameter to be estimated from the time series data is ρ. Further assuming a Gaussian quasi- we note that the score function now is

gρ,t (β, ρ)= ντ,t 1 (ντ,t ντ,t 1ρ) . (21) − − −

30 The estimatorρ ˆ solving sample moment conditions based on (21) is the conventional OLS estimator given by τ ντ,t 1ντ,t t=τ0+1 − ρˆ = τ 2 . t=τ0+1 ντ,t 1 P − We continue to use the definition for fθ,it (θ,P ρ) in Section 4.2 but now consider the simplified case where θ0 = (β,V (0)). We note that in this section, V (0) rather than ντ,min(1,τ0) is the common 1/2 shock used in the cross-sectional model. The implicit scaling of ντ,min(1,τ0) by τ − is necessary in the cross-sectional specification to maintain a well defined model even as τ . →∞ y 1/2 Consider the joint process (V (r) ,s ) where V (r) τ − ν , and τn ML τn ≡ τ[τr] T n f sy s (θ , ρ ) θ,it . ML ≡ ML 0 0 ≡ √n t=1 i=1 X X Note that r τ0+[τr] 1 Vτn (u) dWτn (u)= τ − ντ,t 1ηt − 0 t=τ0+1 Z X 1/2 τ0+[τr] with W (r) τ − η . We define the limiting process for V (r) as τn ≡ t=τ0+1 t τn r P γr γ(r s) Vγ,V (0) (r)= e V (0) + σe − dWν (s) (22) Z0 γ[rτ]/τ where W is defined in Theorem 1. When V (0) = 0, Theorem 1 directly implies that e− V (r) ν τn ⇒ r sγ 2 1/2 σe− dW (s) -stably noting that in this case Ω (s)= σ (1 exp ( 2sγ)) /2γ and Ω˙ (s) = 0 ν C ν − − ν sγ r γ(r s) Rσe− . The familiar result (cf. Phillips 1987) that Vτn (r) σe − dWν (s) then is a conse- ⇒ 0 quence of the continuous mapping theorem. The case in (22)R where V (0) is a -measurable C random variable now follows from -stable convergence of V (r). In this section we establish C τn joint -stable convergence of the triple V (r) ,sy , r V (u) dW (u) . C τn ML 0 τn τn kφ kθ Let φ = (θ′, ρ)′ R , θ R , and ρ R. TheR true parameters are denoted by θ0 and ∈ ∈ ∈ ρ = exp (γ /τ) with γ R and both θ and γ bounded. We impose the following modified τ0 0 0 ∈ 0 0 assumptions to account for the the specific features of the unit root model.

Condition 7 Define = σ (V (0)). Define the σ-fields n,(t min(1,τ0))n+i in the same way as in C G − (10) except that here τ = κn such that dependence on τ is suppressed and that νt is replaced with

ηt as in

n i n,(t min(1,τ0))n+i = σ yjt 1,yjt 2,...,yj min(1,τ0) , ηt, ηt 1,...,ηmin(1,τ0) , (yj,t) . G − − − j=1 − j=1 ∨ C    31 Assume that i) fθ,it is measurable with respect to n,(t min(1,τ0))n+i. G − ii) ηt is measurable with respect to n,(t min(1,τ0))n+i for all i =1, ..., n G − iii) for some δ > 0 and C < , sup E f 2+δ C ∞ it k θ,itk ≤ iv) for some δ > 0 and C < , sup E hη 2+δ i C ∞ t k tk ≤ h i v) E fθ,it n,(t min(1,τ0))n+i 1 =0 |G − − vi) E ηt n,(t min(1,τ0) 1)n+i =0 for t > T and all i = 1, ..., n . |G − − { } r,s 1 τ0+[τr] 2 vii) For any 1 >r>s 0 fixed let Ωτ,η = τ − E ηt n,(t min(1,τ0))n . Then, ≥ t=min(1,τ0)+[τs]+1 |G − r,s 2 Ω p (r s) σ . P   τ,η → − 1 n p viii) Assume that f f ′ Ω where Ω is positive definite a.s. and measurable with n i=1 θ,it θ,it → ty ty T respect to . Let Ωy P= Ωty. C t=1 Conditions 7(i)-(vi)P are the same as Conditions 1 (i)-(vi) adapted to the unit root model. Con-

2 dition 7(vii) replaces Condition 2. It is slightly more primitive in the sense that if ηt is homoskedas- 1 τ0+[τr] 2 2 tic, Condition 7(vii) holds automatically and convergence of τ − η (r s) σ t=min(1,τ0)+[τs]+1 t → − follows from an argument given in the proofs rather than beingP assumed. On the other hand, Condition 7(vii) is somewhat more restrictive than Condition 2 in the sense that it limits het- eroskedasticity to be of a form that does not affect the limiting distribution. In other words,

1 τ0+[τr] 2 we essentially assume τ − η to be proportional to r s asymptotically. This t=min(1,τ0)+[τs]+1 t − assumption is stronger thanP needed but helps to compare the results with the existing unit root literature.

For Condition 7(viii) we note that typically Ωty (φ) = E fθ,itfθ,it′ and Ωty = Ωty (φ0) where

φ0 = (β0′ ,V0 (0) , ρτ0). Thus, even if Ωty (.) is non-stochastic, it follows that Ωty is random and measurable with respect to because it depends on V (0) which is a random variable measurable C w.r.t . C The following results are established by modifying arguments in Phillips (1987) and Chan and Wei (1987) to account for -stable convergence and by applying Theorem 1. C Theorem 3 Assume that Conditions 7 hold. With τ = 0 and as τ, n and T fixed with 0 → ∞ τ = κn for some κ (0, ) it follows that ∈ ∞ s s V (r) ,sy , V (u) dW (u) V (r) , Ω1/2W (1) , σV (u) dW (u) ( -stably) τn ML τn τn ⇒ γ,V (0) y y γ,V (0) ν C  Z0   Z0  32 in the Skorohod topology on DRd [0, 1] .

Proof. In Appendix B. We now employ Theorem 3 to analyze the limiting behavior of θˆ when the common factors are generated from a linear unit root process. To derive a limiting distribution for φˆ we impose the following additional assumption.

T n 1/2 Condition 8 Let θˆ = argmax f (y θ, ρˆ). Assume that θˆ θ = O n− . t=1 i=1 it| − 0 p 2P P T    Condition 9 Let κ = lim n/τ . Let Ay,θ (φ)= t=1 E [∂fθ,it/∂θ′] , TP Ay,ρ (φ)= E [∂fθ,it/∂ρ] t=1 X y and define A (φ)= A (φ) √κA (φ) where A (φ) is a kθ kφ dimensional matrix of non- y,θ y,ρ × random functions φ h R. Assume that A i(φ ) is full rank almost surely. Assume that for some → y,θ 0 ε> 0, y ∂s˜ (φ) y sup Dnτ A (φ) = op (1) . φ: φ φ0 ε ∂φ′ − k − k≤

We make the possibly simplifying assumption that A (φ ) only depends on the factors through the parameter θ.

Theorem 4 Assume that Conditions 7, 8 and 9 hold. It follows that 1 1 − 1 ˆ d 1 1/2 1 2 √n θ θ0 Ay,θ− Ωy Wy (1) √κAy,θ− Ay,ρ Vγ,V (0) (r) dr σVγ,V (0) (r) dWν (r) ( -stably). − → − − 0 0 C   Z  Z  Proof. In Appendix B. 1 1 2 − 1 Note that the term 0 Vγ,V (0) (r) dr corresponds to Aν,ρ− in the stationary case when the time series model does notR depend on cross-sectional parameters. The result in Theorem 4 is an example that shows how common factors affecting both time series and cross-section data can lead to non-standard limiting distributions. In this case, the initial condition of the unit root process in the time series dimension causes dependence between the components of the asymptotic ˆ distribution of θ because both Ωy and Vγ,V (0) in general depend on V (0). Thus, the situation encountered here is generally more difficult than the one considered in Campbell and Yogo (2006) and Phillips (2014). In addition, because the limiting distribution of θˆ is not mixed asymptotically normal, simple pivotal test statistics as in Andrews (2005) are not readily available contrary to the stationary case.

33 6 Summary

We develop a new limit theory for combined cross-sectional and time-series data sets. We focus on situations where the two data sets are interdependent because of common factors that affect both. The concept of stable convergence is used to handle this dependence when proving a joint Central Limit Theorem. Our analysis is cast in a generic framework of cross-section and time- series based criterion functions that jointly, but not individually, identify the parameters. Within this framework, we show how our limit theory can be used to derive asymptotic approximations to the of estimators that are based on data from both samples. We explicitly consider the unit root case as an example where particularly difficult to handle limiting expressions arise. Our results are expected to be helpful for the econometric analysis of rational expectation models involving individual decision making as well as general equilibrium settings. We investigate these topics, and related implementation issues, in a companion paper.

34 References

[1] Andrews, D.W.K. (1988): “Laws of Large Numbers for Dependent Non-Identically Dis- tributed Random Variables,” Econometric Theory, Vol. 4, pp.458-467.

[2] Andrews, D.W.K. (2005): “Cross-Section Regression with Common Shocks,” Econometrica 73, pp. 1551-1585.

[3] Aldous, D.J. and G.K. Eagleson (1978): “On mixing and stability of limit theorems,” The Annals of Probability 6, 325–331.

[4] Atchade, Y.F. (2009): “A strong law of large numbers for martingale arrays,” manuscript.

[5] Barndorff-Nielsen, O.E., P.R. Hansen, A. Lunde and N. Shephard (2008), “Designing realized kernels to measure the ex post variation of equity prices in the presence of noise,” Economet- rica 76, pp. 1481-1536.

[6] Billingsley, P. (1968): “Convergence of Probability Measures,” John Wiley and Sons, New York.

[7] Brown, B.M. (1971): “Martingale Central Limit Theorems,” Annals of Mathematical Statis- tics 42, pp. 59-66.

[8] Campbell, J.Y., and M. Yogo (2006): “Efficient Tests of Stock Return Predictability,” Journal of Financial Economics 81, pp. 27–60.

[9] Chan, N.H. and C.Z. Wei (1987): “Asymptotic Inference for Nearly Nonstationary AR(1) Processes,” Annals of Statistics 15, pp.1050-1063.

[10] de Jong, R.M. (1996): “A strong law of large numbers for triangular mixingale arrays,” Statistics & Probability Letters, Vol 27, pp. 1-9.

[11] Dedecker, J and F. Merlevede (2002): “Necessary and Sufficient Conditions for the Condi- tional Central Limit Theorem,” Annals of Probability 30, pp. 1044-1081.

[12] Durrett, R. (1996): “,” CRC Press, Boca Raton, London, New York.

35 [13] Eagleson, G.K. (1975): “Martingale convergence to mixtures of infinitely divisible laws,” The Annals of Probability 3, 557–562.

[14] Feigin, P. D. (1985): “Stable convergence of Semimartingales,” Stochastic Processes and their Applications 19, pp. 125-134.

[15] Gordin, M.I. (1969): “The central limit theorem for stationary processes,” Dokl. Akad. Nauk SSSR 188, 739–741.

[16] Gordin, M.I. (1973): Abstracts of Communication, T.1: A-K, International Conference on Probability Theory, Vilnius.

[17] Hahn, J., and G. Kuersteiner (2002): “Asymptotically Unbiased Inference for a Dynamic Panel Model with Fixed Effects When Both n and T are Large,” Econometrica 70, pp. 1639– 57.

[18] Hahn, J., G. Kuersteiner and M. Mazzocco (2016): “Estimation with Aggregate Shocks,” arXiv:1507.04415.

[19] Hahn, J., G. Kuersteiner and M. Mazzocco (2019a): “Estimation with Aggregate Shocks,” forthcoming, The Review of Economic Studies.

[20] Hahn, J., G. Kuersteiner and M. Mazzocco (2019b): “Joint Time Series and Cross-Section Limit Theory under Mixingale Assumptions,” arXiv: 1903.04655.

[21] Hall, P., and C. Heyde (1980): “Martingale Limit Theory and its Applications,” Academic Press, New York.

[22] Hansen, B.E. (1992): “Convergence to Stochastic Integrals for Dependent Heterogeneous Processes,” Econometric Theory 8, pp. 489-500.

[23] Heckman, J.J., L. Lochner, and C. Taber (1998), “Explaining Rising Wage Inequality: Explo- rations with a Dynamic General Equilibrium Model of Labor Earnings with Heterogeneous Agents,” Review of Economic Dynamics 1, pp. 1-58.

36 [24] Heckman, J.J., and G. Sedlacek (1985): “Heterogeneity, Aggregation, and Market Wage Functions: An Empirical Model of Self-Selection in the Labor Market,” Journal of Political Economy, 93, pp. 1077-1125.

[25] Hill, J.B. (2010): “A New Moment Bound and Weak Laws for Mixingale Arrays without Mem- ory or Heterogeneity Restrictions, with Applications to Tail Trimmed Arrays,” manuscript

[26] Jacod, J., M. Podolskij and M. Vetter (2010): “Limit Theorems for Moving Averages of Discretized Processes Plus Noise,” The Annals of Statistics, 38, pp.1478-1545.

[27] Jacod, J. and A.N. Shiryaev (2002): “Limit Theorems for stochastic processes. Second Edi- tion,” Springer Verlag, Berlin.

[28] Jacod, J. and P. Protter (2012): “Discretization of Processes,” Springer Verlag, Berlin.

[29] Kallenberg, O. (1997): “Foundations of Modern Probability,” Springer Verlag, Berlin.

[30] Kanaya, S. (2017) “Convergence Rates of Sums of α-Mixing Triangular Arrays: With an Application to Nonparametric Drift Function Estimation of Continuous-Time Processes,” Econometric Theory, Vol.33, 1121-1153.

[31] Kuersteiner, G.M., and I.R. Prucha (2013): “Limit Theory for Panel Data Models with Cross Sectional Dependence and Sequential Exogeneity,” Journal of Econometrics 174, pp. 107-126.

[32] Kuersteiner, G.M and I.R. Prucha (2015): “Dynamic Spatial Panel Models: Networks, Com- mon Shocks, and Sequential Exogeneity,” CESifo Working Paper No. 5445.

[33] Kurtz and Protter (1991): “Weak Limit Theorems for Stochastic Integrals and Stochastic Differential Equations,” Annals of Probability 19, pp. 1035-1070.

[34] Lee, D., and K.I. Wolpin (2006): “Intersectoral Labor Mobility and the Growth of Service Sector,” Econometrica 47, pp. 1-46.

[35] Lee, D., and K.I. Wolpin (2010): “Accounting for Wage and Employment Changes in the U.S. from 1968-2000: A Dynamic Model of Labor Market Equilibrium,” Journal of Econometrics 156, pp. 68–85.

37 [36] Li, J. and D. Xiu (2016): “Generalized Method of Integrated Moments for High-Frequency Data,” Econometrica 84, 1613-1633.

[37] McLeish, D.L. (1975): “A maximal inequality and dependent strong laws,” Annals of Prob- ability 3, 829–839.

[38] Mikusheva, A. (2007): “Uniform Inference in Autoregressive Models,” Econometrica 75, pp. 1411–1452.

[39] Murphy, K. M. and R. H. Topel (1985): “Estimation and Inference in Two-Step Econometric Models,” Journal of Business and Economic Statistics 3, pp. 370 – 379.

[40] Phillips, P.C.B. (1987): “Towards a unified asymptotic theory for autoregression,” Biometrika 74, pp. 535-547.

[41] Phillips, P.C.B. (1988): “Regression theory for near integrated time series,” Econometrica 56, pp. 1021-1044.

[42] Phillips, P.C.B. and S.N. Durlauf (1986): “Multiple Time Series Regression with Integrated Processes,” Review of Economic Studies 53, pp. 473-495.

[43] Phillips, P.C.B. and H.R. Moon (1999): “ Limit Theory for Nonstationary Panel Data,” Econometrica 67, pp. 1057-1111.

[44] Phillips, P.C.B. (2014): “On Confidence Intervals for Autoregressive Roots and Predictive Regression,” Econometrica 82, pp. 1177–1195.

[45] Renyi, A (1963): “On stable sequences of events,” Sankya Ser. A, 25, 293-302.

[46] Ridder, G., and R. Moffitt (2007): “The Econometrics of Data Combination,” in Heckman, J.J. and E.E. Leamer, eds. Handbook of Econometrics, Vol 6, Part B. Elsevier B.V.

[47] Rootzen, H. (1983): “Central limit theory for martingales via random change of time,” Essays in honour of Carl Gustav Essen, Eds. Holst & Gut, Uppsala University, 154-190.

[48] Rosenzweig, M.R. and C. Udry (2019): “External Validity in a Stochastic World: Evidence from Low-Income Countries,” forthcoming Review of Economic Studies.

38 [49] Runkle, D.E. (1991): “Liquidity Constraints and the Permanent-Income Hypothesis: Evi- dence from Panel Data,” Journal of Monetary Economics 27, pp. 73–98.

[50] Shea, J. (1995):“Union Contracts and the Life-Cycle/Permanent-Income Hypothesis,” Amer- ican Economic Review 85, pp. 186–200.

[51] Stroock, D.W. and S.R.S. Varadhan (1979): “Multidimensional diffusion processes,” Springer Verlag.

[52] Tobin, J. (1950): “A Statistical Demand Function for Food in the U.S.A.,” Journal of the Royal Statistical Society, Series A. Vol. 113, pp.113-149.

[53] Wooldridge, J.M. and H. White (1988): “Some invariance principles and central limit theo- rems for dependent heterogeneous processes,” Econometric Theory 4, pp.210-230.

39 Appendix

A Details for Examples

A.1 Calculations for Example 1

First note that it follows that

E zsus+1 τn,(s min(1,τ0))n+i = zsE [us+1 ] , |G − |C   where the equality uses the fact that zs is measurable with respect to τn,(s min(1,τ0))n+i and that G − us+1 is independent of σ ( zt, zt 1,...,zτ0 ) .To evaluate E [us+1 ] consider the joint distribution { − } |C of us+1 for s< 0 and ν1 = z1, which is Gaussian,

s 0 1 ρ| | N , .    s 1  0 ρ| | 1 ρ2 −     s 2 This implies that E [u ]= ρ| | (1 ρ ) z . Evaluating the L norm of Condition 1(vii) leads to s+1|C − 1 2

ν 2 E ψτ,s τn,(s min(1,τ0) 1)n+i G − − 2 s 2 2 2 ρ | | 1 ρ E (z z E [z z ]) +(E [z z ]) ≤| | − s 1 − s 1 s 1 1 s 2 s 2  2  s 2 ρ − = ρ | | 1 ρ E (z z E [z z ]) + ρ | | 1 ρ | | − s 1 − s 1 | | − 1 ρ2  −   2 2s 2 2s  s 2 ρ − +1 s ρ − = ρ | | 1 ρ + ρ | | | | − (1 ρ2)2 | | 1 ρ2 − − 2+2 s  2+2 s s ρ | | +1 s ρ | | = ρ | | + ρ | | | | 1 ρ2 | | 1 ρ2 − − s (1+δ) = O ρ | | = o s − | | | |     such that ν s /2 (1+δ)/2 E ψτ,s τn,(s min(1,τ0) 1)n+i = O ρ | | = o s − . G − − 2 | | | |      

40 B Proofs for Section 4

B.1 Proof of Theorem 1

To prove the functional central limit theorem we follow Billingsley (1968) and Dedecker and Merlevede (2002). Before stating the details of the proof we lay out the general proof strategy. JS (p.512, Definition 5.28) define stable convergence for sequences Zn defined on a Polish space. By Billingsley (1968), Theorem 15.5 the uniform metric can be used to establish tightness for certain processes that are continuous in the limit. We adopt their definition to our setting, noting that by JS (p.328, Theorem 1.14), DRkθ Rkρ [0, 1] equipped with the Skorohod topology is a Polish × space. Following Billingsley (1968, p. 120) let r

∆Xnτ (ri)= Xnτ (ri) Xnτ (ri 1) . (23) − −

Since there is a one to one mapping between Xnτ (r1) , ..., Xnτ (rk) and Xnτ (r1) , ∆Xnτ (r2) , ..., ∆Xnτ (rk) we establish joint convergence of the latter. The proof proceeds by checking that the conditions of Theorem 1 in Kuersteiner and Prucha (2013) hold. Let k = max(T, τ)n where both n n → ∞ and τ such that clearly k (this is a diagonal limit in the terminology of Phillips and →∞ n →∞ Moon, 1999). Let d = k = k + k . To handle the fact that X Rd we use Lemmas A.1 - φ θ ρ nτ ∈ dk A.3 in Phillips and Durlauf (1986). Define λ = λ′ ,λ′ ′and let λ = (λ ,...,λ ) R with j j,y j,ν 1 k ∈ 41  λ =1. Define t∗ = t min (1, τ ). k k − 0 2 For each n and τ define the mapping q (t, i): N N as q (i, t) := t∗n+i and note that q (i, t) 0 + → + is invertible, in particular for each q 1, ..., k there is a unique pair t, i such that q (i, t) = q. ∈ { n} We often use shorthand notation q for q (i, t). Let

k ¨ ˜ ˜ ψq(i,t) λj′ ∆ψit (rj) E ∆ψit (rj) τn,t∗n+i 1 (24) ≡ − |G − j=1 X  h i where

∆ψ˜it (rj)= ψ˜it (rj) ψ˜it (rj 1) ; ∆ψ˜it (r1)= ψ˜it (r1) . (25) − − ˜ ˜y ˜ν ′ Note that ∆ψit (rj)= ∆ψit (rj) , ∆ψt (rj) with   ψ˜y for j =1 ˜y it ∆ψit (rj)= (26)  0 otherwise  and  ˜ν ψτ,t (rj) if [τrj 1] < t τ0 [τrj] and i =1 ˜ν − ∆ψit (rj)= − ≤ . (27)  0 otherwise  max(T,τ0+τ) n ¨ kn ¨ Using this notation and noting that t=min(1,τ0+1) i=1 ψq(i,t) = q=1 ψq, we write

k P P P

λ1′ Xnτ (r1)+ λj′ ∆Xnτ (rj) j=2 X kn max(T,τ0+τ) n k ¨ ˜ = ψq + λj′ E ∆ψit (rj) τn,t∗n+i 1 (28) G − q=1 t=min(1,τ0+1) i=1 j=1 h i X X X X kn y First analyze the term ψ¨ . Note that ψ is measurable with respect to ∗ by con- q=1 q n,it Gτn,t n+i ¨ struction. Note that byP (24), (26) and (27) the individual components of ψq are either 0 or equal to ψ˜it (rj) E ψ˜it (rj) τn,t∗n+i 1 respectively. This implies that ψ¨q is measurable with respect to − |G − τn,q, noting inh particular that E iψ˜it (rj) τn,t∗n+i 1 is measurable w.r.t τn,t∗n+i 1 by the prop- G |G − G − erties of conditional expectationsh and τn,t∗n+i 1 i τn,q. By construction, E ψ¨q τn,q 1 = 0. G − ⊂ G |G − q ¨ h i This establishes that for Snq = s=1 ψs, P S , , 1 q k , n 1 { nq Gτn,q ≤ ≤ n ≥ } is a mean zero martingale array with differences ψ¨q.

42 To establish finite dimensional convergence we follow Kuersteiner and Prucha (2013) in the proof of their Theorem 2. Note that, for any fixed n and given q, and thus for a corresponding unique vector (t, i), there exists a unique j 1,...,k such that τ0 + [τrj 1] < t τ0 + [τrj] . ∈ { } − ≤ Then,

k ¨ ˜ ˜ ψq(i,t) = λl′ ∆ψit (rl) E ∆ψit (rl) τn,t∗n+i 1 − G − Xl=1  h i ˜y ˜y = λ1′ ,y ψit E ψit τn,t∗n+i 1 1 j =1 − G − { } ˜ν h ˜ν i + λj,ν′ ψτ,t (rj) E ψτ,t (rj) τn,t∗n+i 1 1 [τrj 1] < t τ0 [τrj] 1 i =1 , − G − { − − ≤ } { }  h i where all remaining terms in the sum are zero because of by (24), (26) and (27). For the subsequent inequalities, fix q 1, ..., k (and the corresponding (t, i) and j) arbitrarily. Introduce the ∈ { n} shorthand notation 1j =1 j =1 and 1ij= 1 [τrj 1] < t τ0 [τrj] 1 i =1 . { } { − − ≤ } { } First, note that for δ 0, and by Jensen’s inequality applied to the empirical measure 1 4 x ≥ 4 i=1 i we have that P

2+δ ψ¨q 2+δ 2+δ 1 ˜y ˜y 1 ˜ν ˜ν =4 λ1′ ,y ψit E ψit τn,t∗n+i 1 1j + λj,ν′ ψτ,t (rj) E ψτ,t (rj) τn,t∗n+i 1 1ij 4 − |G − 4 − |G −

 h 2+δ i  2+hδ i 2+δ 1 2+δ ˜y 1 2+δ ˜y 4 λ1,y ψit + λ1,y E ψit τn,t∗n+i 1 1j ≤ 4 k k 4 k k |G −   2+δ h i 2+δ 2+δ 1 2+δ ˜ν 1 2+δ ˜ν +4 λj,ν ψτ,t (rj) + λj,ν E ψτ,t (rj) τn,t ∗n+i 1 1ij 4 k k 4 k k |G −   2+δ h 2+δ i 2+2δ 2+δ ˜y 2+δ ˜y =2 λ1,y ψit + λ1,y E ψit τn,t∗n+i 1 1j k k k k |G −   2+δ h i 2+δ 2+2δ 2+δ ˜ν 2+δ ˜ν +2 λj,ν ψτ,t (rj) + λj,ν E ψτ,t (rj) τn,t ∗n+i 1 1ij. k k k k |G −  h i 

We further use the definitions in (11) such that by Jensen’s inequalit y and for i = 1 and t ∈ [τ0 +1, τ0 + τ]

2+δ 2+δ ˜ν ˜ν ψτ,t (rj) + E ψτ,t (rj) τn,t∗n+i 1 |G − 1 ν 2+δh ν i 2+δ ψτ,t + E ψτ,t τn,t∗n+ i 1 ≤ τ 1+δ/2 |G − 1  ν 2+δ  ν 2+δ   ψτ,t + E ψτ,t τn,t∗n+i 1 ≤ τ 1+δ/2 |G −  h i

43 while for i> 1 or t / [τ +1, τ + τ] , ∈ 0 0 ˜ν ψit =0.

Similarly, for t [1, ..., T ] ∈ 2+δ 2+δ ˜y ˜y ψit + E ψit τn,t∗n+i 1 |G − 1 yh 2+δ y i2+ δ ψit + E ψit τn,t∗n+i 1 ≤ n 1+δ/2 k k k k |G −  h i while for t / [1, ..., T ] ∈ ˜y ψit =0.

Noting that λ 1 and λ < 1, k j,yk ≤ k j,νk 2+δ 3+2δ ¨ 2 1 i =1, t [τ0 +1, τ0 + τ] ν 2+δ E ψq τn,q 1 { ∈ }E ψτ,t τn,t∗n+i 1 G − ≤ τ 1+δ/2 G −   3+2δ h i 2 1 t [1, ..., T ] y 2+δ + { ∈ }E ψit τn,t∗n+i 1 , (29) n1+δ/2 k k G − h i where the inequality in (29) holds for δ 0. To establish the limiting distribution of kn ψ¨ we ≥ q=1 q check that P k n 2+δ E ψ¨ 0, (30) q → q=1   X kn k ¨2 p ψq λ1′ ,yΩytλ1,y + λj,ν′ Ων (rj rj 1) λj,ν, (31) → − − q=1 t 1,..,T j=1 X ∈{X } X and 1+δ/2 kn ¨2 sup E E ψq τn,q 1 < , (32) n  G − !  ∞ q=1 h i X which are adapted to the current setting from Conditions (A.26), (A.27) and (A.28) in Kuersteiner and Prucha (2013). These conditions in turn are related to conditions of Hall and Heyde (1980) and are shown by Kuersteiner and Prucha (2013) to be sufficient for their Theorem 1. To show that (30) holds note that from (29), the law of iterated expectations and Condition 1

44 it follows that for some constant C < , ∞ k k n 2+δ n 2+δ E ψ¨q = E E ψ¨q τn,q 1 |G − q=1 q=1 X   X    3+2δ τ0+τ 2 2+δ E ψν ≤ τ 1+δ/2 τ,t t=τ0+1 X h i 23+2δ T n + E ψy 2+δ n1+δ/2 k itk t=1 i=1 X X h i 23+2δτC 23+2δnT C 23+2δC 23+2δTC + = + 0 ≤ τ 1+δ/2 n1+δ/2 τ δ/2 nδ/2 → because 23+2δC and T are fixed as τ, n . →∞ kn ¨2 Next, consider the probability limit of q=1 ψq . We have

kn P ¨2 ψq q=1 X k τ0+τ 1 ν ν 2 = λ′ ψ E ψ ∗ 1 (33) τ j,ν τ,t − τ,t|Gτn,t n ij j=1 t=τ0+1 X X k   2 ν ν y y + λ′ ψ E ψ ∗ (ψ E [ψ ∗ ])′ λ 1 1 (34) √τn j,ν τ,t − τ,t|Gτn,s n 1t − 1t|Gτn,t n j,y 1j j t (τ0+1,...,τ0+τ) j=1 ∈ X1,..,T X   ∩{ } 2 n 1 y y + λ1′ ,y ψn,it E ψn,it τn,t∗n+i 1 . (35) n  − |G −  t 1,..,T i=1 ∈{X } X     Using

ν ν 2 ν 2 ν 2 ν λ′ ψ E ψ ∗ λ′ ψ +2 ψ λ E ψ ∗ j,ν τ,t − τ,t|Gτn,t n ≤ j,ν τ,t τ,t k j,νk τ,t|Gτn,t n 2 ν   + λ E ψ ∗   k j,νk τ,t |Gτn,t n   ν and that the fact that E ψ ∗ = 0 when t > T implies that τ,t|Gτn,t n

k τ0+τ   1 ν ν 2 ν 2 λ′ ψ E ψ ∗ λ′ ψ 1 (36) τ j,ν τ,t − τ,t|Gτn,t n − j,ν τ,t ij j=1 t=τ0+1 X X   k    1 ν 2 ν 2 ν 2 2 ψ λ E ψ ∗ + λ E ψ ∗ ≤ τ τ,t k j,νk τ,t|Gτn,t n k j,νk τ,t|Gτn,t n j=1 t τ0+1,...,τ0+τ 1,..,T X ∈{ X }∩{ }      

45 where by Condition 2

k τ0+τ k 1 ν 2 p λj,ν′ ψτ,t 1 τ0 + [τrj 1] < t τ0 + [τrj] λj,ν′ (Ων (rj) Ων (rj 1)) λj,ν. τ { − ≤ } → − − j=1 t=t=τ0+1 j=1 X X  X

To show that the RHS of (36) is op (1) note that by the Markov inequality and the Cauchy-Schwarz inequality it is enough to show that

k 1 2 ν 2 ν 2 ν 2 λ 2 E ψ E E ψ ∗ + E E ψ ∗ τ k j,νk τ,t τ,t|Gτn,t n τ,t|Gτn,t n j=1 t τ0+1,...,τ0+τ r ! X ∈{ X1,..,T } h i h   i h   i ∩{ } (37)

k T 1 1 − λ 2 3E ψν 2 + 2 E ψν 2 ϑ + ϑ2 ≤τ k j,νk τ,t τ,t t t j=1 t=0 t=τ0 r !! X X h i X h i 1 − 1/2 1 1 C ν 2 1+δ − 1+δ − O τ − + 2 E ψ t + C t ≤ τ τ,t | | | | t=τ0 r !  X h i     ν 2 2C sup E ψ 1 t τ,t − 1/2 r t 1+δ − ≤ τ h i | | t=τ0 X   1 − 1/2 1 1+δ − 1 + o τ − t + O τ − 0, | | → t=τ0 ! X    where the first inequality follows from Condition 1(vii). The second inequality follows from the fact that T is fixed and bounded and Condition 1(vii). The final result uses that sup E ψν 2+δ t τ,t ≤ C < by Condition 1(iv) and the fact that τ τ, t/τ 1 for t [1, ..., τ], suchh that i ∞ | 0| ≤ ≤ ∈

1 τ0 − 1/2 | | 1 1+δ − 1 1+δ 1/2 τ − t = τ − t − | | t=τ0 t=1 X   X  τ0 | | 1/2 1 t − (1/2+δ/2) τ − t− ≤ τ t=1 X   ∞ 1/2 (1+δ/2) 1/2 τ − t− = O τ − . ≤ t=1 X 

46 (1+δ/2) Note that ∞ t− < for any δ > 0. Next consider (34) where t=1 ∞

P k 2 ν ν y y E  λ′ ψ E ψ τn,s∗n (ψ E [ψ τn,t∗n])′ λ1,y11j11  √τn j,ν τ,t − τ,t|G 1t − 1t|G t (τ0+1,...,τ0+τ) j=1  ∈ X1,..,T X     ∩{ }   1/2  k 2 2 ν ν  E λjν′ ψτ,t E ψτ,t τn,t∗n ≤ √τn   − |G  t τ0+1,...,τ0+τ  j=1 ∈{ X1,..,T } X   ∩{ }    1/2 y y  2 E (ψ E [ψ ∗ ])′ λ (38) × 1t − 1t|Gτn,t n 1y  h i  2√ 1/2 2 k ν 1/2 y y 2 sup E ψτ,t E (ψ1t E [ψ1t τn,t∗n])′ λ1,y (39) ≤ √τn t − |G t τ0,...,τ0+τ   ∈{ X1,..,T }  h i ∩{ } 3 1/2 2 √kT ν 1/2 y 2 sup E ψτ,t sup E ψn,it 0 (40) ≤ √τn t →  i,t    h i where the first inequality in (38) follows from the Cauchy-Schwarz inequalities. Then we have in (39), by Condition 1(iii) and the H¨older inequality that

y y 2 y 2 E (ψ E [ψ ∗ ])′ λ 2E ψ 1t − 1t|Gτn,t n y ≤ k 1tk h i h i such that (40) follows. We note that (40) goes to zero because of Condition 1(iv) as long as T/√τn 0. Clearly, this condition holds as long as T is held fixed, but holds under weaker → conditions as well. Next the limit of (35) is, by Condition 1(v) and Condition 3, n 1 y y 2 p λ1′ ,y ψn,it E ψn,it τn,t∗n+i 1 λ1′ ,yΩtyλ1,y. n − |G − → t 1,..,T i=1 t 1,..,T ∈{X } X   ∈{X } This verifies (31). Finally, for (32) we check that

k 1+δ/2 n 2 sup E E ψ¨q τn,q 1 < . (41) n  G −  ∞ q=1  ! X   First, use (29) with δ = 0 to obtain

k τ0+τ n 2 3 ¨ 2 ν 2 E ψq τn,q 1 E ψτ,t τn,t∗n G − ≤ τ G q=1   t=τ0 X X h i 3 n 2 y 2 + E ψn,it τn,t∗n+i 1 . (42) n G − t 1,..,T i=1 ∈{ } h i X X

47 Applying (42) to (41) and using the H¨older inequality implies

k 1+δ/2 n 2 E E ψ¨q τn,q 1  G −  q=1  ! X   1+δ/2 3 τ0+ τ δ/2 2 ν 2 2 E E ψτ,t τn,t∗n+i 1 ≤  τ G − !  t=τ0+1 h i X   1+δ/2 3 n δ/2 2 y 2 +2 E E ψn,it τn,t∗n+i 1 .  n G −   t τ0,...,τ0+τ 1,..,T i=1 h i  ∈{ X}∩{ } X     By Jensen’s inequality, we have

τ0+τ 1+δ/2 τ0+τ 1+δ/2 1 ν 2 1 ν 2 E ψτ,t τn,t∗n+i 1 E ψτ,t τn,t∗n+i 1 τ G − ! ≤ τ G − t=τ0+1 h i t=τ0+1 h i X X and 1+δ/2 ν 2 ν 2+δ E ψτ,t τn,t∗n+i 1 E ψτ,t τn,t∗n+i 1 G − ≤ G − h i h i so that

1+δ/2 3 τ0+τ 3+3δ/2 τ0+τ 2 ν 2 2 ν 2+δ E E ψτ,t τn,t∗n+i 1 E E ψτ,t τn,t∗n+i 1  τ G − !  ≤ τ G − t=τ0+1 h i t=τ0+1 h h ii X X   3+3δ/2 ν 2+ δ 2 sup E ψτ,t < . (43) ≤ t ∞ h i and similarly,

1+δ/2 3 n 2 y 2 E E ψn,it τn,t∗n+i 1 (44)  n |G −   t 1,..,T i=1 h i  ∈{X } X   δ/2 T n   23+3δ/2 (T n) E ψy 2+δ ≤ n1+δ/2 n,it t=1 i=1 X X h i 3+3δ/2 1+δ/2 y 2+δ 2 T sup E ψn,it < ≤ i,t ∞ h i By combining (43) and (44) we obtain the following bound for (41),

k 1+δ/2 n 2+δ E E ψ¨q τn,q 1  |G − !  q=1   X  3+3δ/2 ν 2+δ 3+3δ/2 1+δ/2 y 2+δ 2 sup E ψτ,t +2 T sup E ψn,it < . ≤ t i,t ∞ h i h i

48 kn ¨ This establishes that (30), (31) and (32) hold and thus establishes the CLT for q=1 ψq. It remains to be shown that the second term in (28) can be neglected. ConsiderP

max(T,τ0+τ) n k ˜ λj′ E ∆ψit (rj) τn,t∗n+i 1 |G − i=1 j=1 t=min(1X,τ0+1) X X h i τ0+τ k 1/2 ν = τ − λj,ν′ E ψτ,t τn,t∗n+i 1 1 τ0 + [τrj 1] < t τ0 + [τrj] |G − { − ≤ } t=τ0+1 j=1 X X T n   1/2 y + n− λ1′ ,yE ψn,it τn,t∗n+i 1 , |G − t=1 i=1 X X   where we defined r0 = 0. Note that

ν E ψτ,t τn,t∗n+i 1 = 0 for t > T |G −   and y E ψn,it τn,t∗n+i 1 =0. |G − This implies, using the convention that a term is zero if it is a sum over indices from a to b with a > b, as well as the fact that T τ + τ, that ≤ 0 τ0+τ 1/2 ν τ − λν′ E ψτ,t τn,t∗n+i 1 |G − t=τ0+1 X T k  1/2 ν = τ − λj,ν′ E ψτ,t τn,t∗n+i 1 1 τ0 + [τrj 1] < t τ0 + [τrj] . |G − { − ≤ } t=τ0+1 j=1 X X   By a similar argument used to show that (37) vanishes, and noting that T is fixed while τ , →∞ it follows that, as long as τ T, 0 ≤ T 1+δ/2 1/2 ν E τ − λj,ν′ E ψτ,t τn,t∗n  |G  t=τ0+1 X    1/2+δ/4 T  1 δ/2 ν 1+δ/2 (T τ ) E λ′ E ψ ∗ ≤ τ − 0 j,ν τ,t|Gτn,t n   "t=τ0+1 # X   1/2+δ/4 T 1 δ/2 ν 1+δ/2 (T τ ) E E ψ ∗ ≤ τ − 0 τ,t|Gτn,t n t=τ0+1   X h i T   1/2+δ/4 1/2+δ/4 1 δ/2 ν 2 (T τ ) E E ψ ∗ , ≤ τ − 0 τ,t|Gτn,t n   t=τ0+1 X  h   i

49 where the first inequality uses the triangular inequality and the second and third inequalities are based on versions of Jensen’s inequality. We continue to use the convention that a term is zero if it is a sum over indices from a to b with a > b. Now consider two cases. When τ use 0 → −∞ δ/2 δ/2 Condition 1(vii) and the fact that (T τ ) /τ 1 as well as (T τ )− τ − − 0 ≤ − 0 ≤| 0| T 1/2+δ/4 1/2+δ/4 1 δ/2 ν 2 (T τ ) E E ψ ∗ τ − 0 τ,t|Gτn,t n t=τ0   X  h i T   1 1 /2+δ/4 − 1 3δ+δ2 (1/2+δ/4) ν 2 (1/2+δ/4) 2 + 4 τ − E ψ + τ − C t −  ≤ τ,t | | t=0 t=τ0 X  h i X 1 2 − (1/2+δ/4) (1/2+δ/4) 1/2 (δ+δ )/4 (1+δ/2) O τ − + τ − τ − C t − ≤ | 0| | | t=τ0 X  1/2 δ/4 ∞ (1/2+δ/4) δ/2 τ0 − (1+δ/2) O τ − + τ − | | C t− ≤ τ   t=1  X δ/2 = O τ − 0 →  τ0 1 (1+δ) since | | υ as τ and ∞ t− (log (t + 1))− < . The second case arises when τ is τ → →∞ t=1 ∞ 0 fixed. Then, P

T 1/2+δ/4 1/2+δ/4 1 δ/2 ν 2 (T τ ) E E ψ ∗ τ − 0 τ,t|Gτn,t n t=τ0+1   X  h i 1/2+ δ/4   1/2 δ/4 ν 2 1+δ/ 2 τ − − sup E ψτ,t (T + τ0 ) 0 ≤ t | | →  h i as τ . In both cases the Markov inequality then implies that →∞

τ0+τ n k 1/2 ˜ τ − λj′ E ∆ψit (rj) τn,t∗n+i 1 = op (1) . |G − t=τ0+1 i=1 j=1 X X X h i and consequently that

k kn ¨ λ1′ Xnτ (r1)+ λj′ ∆Xnτ (rj)= ψq + op (1) . (45) j=2 q=1 X X We have shown that the conditions of Theorem 1 of Kuersteiner and Prucha (2013) hold by establishing (30), (31), (32) and (45). Applying the Cramer-Wold theorem to the vector

Ynt = Xnτ (r1)′ , ∆Xnτ (r2)′ ..., ∆Xnτ (rk)′ ′  50 it follows from Theorem 1 in Kuersteiner and Prucha (2013) that for all fixed r1, .., rk and using the convention that r0 =0,

1 k E [exp (iλ′Ynt)] E exp λ1′ ,yΩytλ1,y + λj,ν′ (Ων (rj) Ων (rj 1)) λj,ν . →  −2  − −  t 1,..,T j=1 ∈{X } X    (46) When Ω (r) = rΩ for all r [0, 1] and some Ω positive definite and measurable w.r.t this ν ν ∈ ν C result simplifies to

k k

λj,ν′ (Ων (rj) Ων (rj 1)) λj,ν = λj,ν′ Ων λj,ν (rj rj 1) . − − − − j=1 j=1 X X The second step in establishing the functional CLT involves proving tightness of the sequence

λ′Xnτ (r) . By Lemma A.3 of Phillips and Durlauf (1986) and Proposition 4.1 of Wooldridge and White (1988), see also Billingsley (1968, p.41), it is enough to establish tightness componentwise.

d This is implied by establishing tightness for λ′X (r) for all λ R such that λ′λ = 1. In the nτ ∈ following we make use of Theorems 8.3 and 15.5 in Billingsley (1968). Recall the definition

τ0+[τr] 1 T n 1 X (r)= ψy , X (r)= ψν . nτ,y √n n,it nτ,ν √τ τ,t t=1 i=1 t=τ0+1 X X X and note that

λ′ (X (s) X (t)) λ′ (X (s) X (t)) + λ′ (X (s) X (t)) | nτ − nτ | ≤ y nτ,y − nτ,y | ν nτ,ν − nτ,ν | where λ′ (X (s) X (t)) = 0 uniformly in t, s [0, 1] because of the initial condition y nτ,y − nτ,y ∈ X (0) given in (13) and the fact that X (t) is constant as a function of t. Thus, to show nτ nτ,y tightness, we only need to consider λν′ Xnτ,ν (t).

Billingsley (1968, p. 58-59) constructs a continuous approximation to λ′Xnτ (r) . We denote it c as λν′ Xnτ,ν (r) and define it analogously to Billingsley (1968, eq 8.15) as

τ0+[τr] c 1 ν (τr [τr]) ν λ′ X (r)= λ′ ψ + − λ′ ψ . ν nτ,ν √τ ν τ,t √τ ν τ,[τr]+1 t=τ0+1 X First, note that for ε> 0, sup τr [τr] 1 such that τ,r | − | ≤ ν (τr [τr]) ν supt E ψτ,t P max − λν′ ψτ,[τr]+1 >ε 0 r [0,1] √τ ≤ ε√τ →  ∈   

51 and consequently, Xc (r) = X (r)+ o (1) uniformly in r [0, 1]. To establish tightness we nτ,ν nτ,ν p ∈ need to establish that the ‘modulus of continuity’

c c c ω (Xnτ , δ) = sup λ′ (Xnτ (s) Xnτ (t)) (47) t s <δ | − | | − | where t, s [0, 1] satisfies ∈ c lim lim supP (ω (Xnτ , δ) ε)=0. δ 0 → n,τ ≥ 1 τ0+k ν Let Sτ,k = √τ t=τ0+1 λν′ ψτ,t. By the inequalities in Billingsley (1968, p.59) it follows that for k such that k/τ

c c sup λ′ (Xnτ (s) Xnτ (t)) 2 max Sτ,k+i Sτ,k . t s t+δ/2 | − | ≤ 0 i τ | − | ≤ ≤ ≤ ≤ By Billingsley (1968, Theorem 8.4) and in particular Billingsley (1990, p.88) and the comments in Billingsley (1968, p. 59) to establish tightness it is enough to show that

2 lim lim sup c P max Sτ,k+s Sτ,k >c =0 (48) c τ s τ | − | →∞ →∞  ≤  holds for all k N. As for finite dimensional convergence define ∈ ¨ν 1 ν ν ψτ,t = ψτ,t E ψτ,t τn,(t min(1,τ0)n+1) √τ − |G − and   τ0+[τr] ¨ ¨ν Xnτ,ν (r)= ψτ,t t=τ0+1 X such that τ0+[τr] ¨ 1 ν Xnτ,ν (r)= Xnτ,ν (r)+ E ψτ,t τn,(t min(1,τ0)n+1) . √τ |G − t=τ0+1 X   ¨ τ0+s ¨ν Let Sτ,s = t=τ0+1λν′ ψτ,t and

P τ0+s ¨ν ν Sτ,s = λν′ ψτ,t + E ψτ,t τn,(t min(1,τ0)n+1) t=τ0+1 G − P    as before. Since

P max Sτ,k+s Sτ,k >c s τ | − |  ≤  1 ¨ ¨ τ0+s ν P max Sτ,k+s Sτ,k + max λν′ E ψτ,t τn,(t min(1,τ0)n+1) >c ≤ s τ − s τ √τ t=τ0+k+1 |G −  ≤ ≤  P   c P max S¨τ,k+s S¨τ,k > (49) ≤ s τ − 2  ≤  1 τ0+s ν c + P max t=τ0+k+1 λν′ E ψτ,t τn,(t min(1,τ0)n+1) > (50) s τ √τ |G − 2  ≤    P 52 Note that for each k and τ fixed, M = S¨ S¨ and = σ (z , ..., z ) , M , is s τ,s+k − τ,k Fτ,s τ0 τ0+s+k { τ,s Fτ,s} a martingale. Note that the filtration does not depend on τ when τ is held fixed. We proof Fτ,s 0 an extension of Hall and Heyde (1980, Theorems 2.1 and 2.2) to triangular martingale arrays in Lemma 1 and Corollary 4 in Appendix C.10 We first consider the term (50) and show that

2 1 τ0+s ν c lim lim sup c P max t=τ0+k+1λν′ E ψτ,t τn,(t min(1,τ0)n+1) > =0. (51) c τ s τ √τ |G − 2 →∞ →∞  ≤  P   Using the convention that a term is zero if it is a sum over indices from a to b with a > b, note that (50) is bounded by Markov inequality by

τ0+s ν c P max λν′ E ψτ,t τn,(t min(1,τ0)n+1) > √τ s τ t=τ0+k+1 |G − 2  ≤  P   4 τ0+s ν 2 E max t=τ0+k+1λν′ E ψτ,t τn,(t min(1,τ0)n+1) ≤ c2τ s τ |G −  ≤  4 P  2  τ0+τ ν E E ψτ,t τn,(t min(1,τ0)n+1) , ≤ c2τ t=τ0+k+1 |G − P h   i so c 2 τ0+s ν √ c P max t=τ0+k+1λν′ E ψτ,t τn,(t min(1,τ0)n+1) > τ s τ |G − 2  ≤  4 P  2  τ0+τ ν E E ψτ,t τn,(t min(1,τ0)n+1) (52) ≤ τ t=T +1 |G − 4 PT h ν 2 4 1  i + E ψ + − ϑt, τ t=0 τ,t τ t=τ0+k+1 P h i P For the first term in (52), we have

4 τ0+τ ν 2 t=T +1E E ψτ,t τn,(t min(1,τ0)n+1) =0 τ |G − h i ν P   4 T ν 2 because E ψτ,t τn,(t min(1,τ0)n+1) = 0 for t > T . The second term t=0E ψτ,t 0 as |G − ετ → τ because T is finite. For the third term in (52) note that P h i →∞ 1/2 4 1 4K 1 1+δ − − ϑ − t √τ t=τ0+k+1 t ≤ √τ t=τ0+k+1 | | P P1/2 δ/4   4Kτ − (1+δ/4) δ/4 ∞ t− = O τ − 0. ≤ √τ t=1 → 10 P  There is only a limited literature on laws of large numbers for triangular arrays of martingales. Andrews (1988) or Kanaya (2017) prove weak laws, de Jong (1996) proves a strong law but without proving a maximal inequality. Atchade (2009) and Hill (2010) allow for trinagular arrays but only with respect to a fixed filteration that does not depend on samples size.

53 These considerations imply the desired result (51). With (51), in order to establish (48), it suffices to consider (49) and show that

2 c lim lim sup c P max S¨τ,k+s S¨τ,k > =0 c τ s τ − 2 →∞ →∞  ≤ 

Because c 4 2 c P max S¨τ,k+s S¨τ,k > E max S¨τ,k+s S¨τ,k 1 max S¨τ,k+s S¨τ,k s τ − 2 ≤ c2ε s τ − · s τ − ≥ 2  ≤   ≤  ≤  2 4 k+s ¨ν k+s ¨ ν c = E max λν′ ψτ,t 1 max λν′ ψτ,t c2ε s τ t=k+1 · s τ t=k+1 ≥ 2  ≤  ≤  P P it suffices to prove that 2 k+s ¨ν k+s ¨ν lim lim sup E max t=k+1λν′ ψτ,t 1 max t=k+1λν′ ψτ,t c =0. (53) c τ s τ · s τ ≥ →∞ →∞  ≤  ≤  P P 2+ δ ν We show in Appendix B.7 that (53) holds as long as sup E √τλ′ ψ¨ < . t ν τ,t ∞   We now identify the limiting distribution using the technique of Rootzen (1983). Tightness together with finite dimensional convergence in distribution in (46), Condition 2 and the fact that

d the partition r , ..., r is arbitrary implies that for λ R with λ = λ′ ,λ′ ′ 1 k ∈ y ν 1  E [exp (iλ′X (r))] E exp λ′ Ω λ + λ′ Ω (r) λ (54) nτ → −2 y y y ν ν ν     with Ωy = t 1,..,T Ωyt. The final step of the argument consists in representing the limiting ∈{ } process in termsP of stochastic integrals over isonormal Gaussian processes.11 By the law of iterated expectations and the fact that by Assumptions (2) and (3) the matrices Ω and Ω (r) are - y ν C measurable it follows that 1 E exp λ′ Ω λ + λ′ Ω (r) λ (55) −2 y y y ν ν ν    1  1 = E E exp λ′ Ω λ E exp λ′ Ω (r) λ . −2 y y y C −2 ν ν ν C         

Let W (r)=(W (r) , W (r)) be a vector of mutually independent standard Brownian motion y ν processes in Rd, independent of any -measurable random variable. We note that the first term C on the RHS of (55) satisfies

1 1/2 E exp λ′ Ω λ = E exp iλ′ Ω W (1) (56) −2 y y y C y y y C        11We are grateful to an anonymous referee for suggesting a simplified method of proof for this step.

54 by the properties of the standard Gaussian characteristic function. To analyze the second con-

1 ditional expectation E exp λ′ Ω (r) λ note that by Kallenberg (1997, p.210) it follows − 2 ν ν ν |C from the isometry of the stochastic integral that there exists a standard Brownian process Wν (r) such that

r 1/2 2 r ˙ ˙ E λν′ Ων (t) dWν (t) = λν′ Ων (t) λνdt = λν′ Ων (r) λν. " 0 C# 0 Z    Z   1/2 r ˙ By linearity of the stochastic integral, conditional on , λ′ Ων (t) dWν (t) is a centered C 0 ν Gaussian process with conditional (on ) characteristic functionR   C 1 r 1/2 E exp λ′ Ω (r) λ = E exp i λ′ Ω˙ (t) dW (t) . (57) −2 ν ν ν C ν ν ν C       Z0     Combining (55), (56) and (57) gives

1 E exp λ′ Ω λ + λ′ Ω (r) λ (58) −2 y y y ν ν ν     r 1/2 1/2 = E E exp iλ′ Ω W (1) E exp i λ′ Ω˙ (t) dW (t) y y y C ν ν ν C    Z0        Since by construction, (Wy (r) , Wν (r)) are mutually independent conditional on it follows that C the RHS of (58) can be written as

r 1/2 1/2 E E exp iλ′ Ω W (1) E exp i λ′ Ω˙ (t) dW (t) y y y C ν ν ν C    Z0      r 1/2  1/2 = E E exp iλ′ Ω W (1) + i λ′ Ω˙ (t) dW (t) y y y ν ν ν C    Z0   r  1/2 1/2 ˙ = E exp iλy′ Ωy Wy (1) + i λν′ Ων (t) dWν (t) , 0   Z    where the last equality follows from the law of iterated expectations.

B.2 Proof of Corollary 1

We note that finite dimensional convergence established in the proof of Theorem 1 implies that 1 E [exp (iλ′X (1))] E exp λ′ Ω λ + λ′ Ω (1) λ . nτ → −2 y y y ν ν ν    We also note that because of (57) it follows that 

1 1/2 ˙ 1 E exp i λν′ Ων (t) dWν (t) = E exp λν′ Ων (1) λν 0 −2   Z       1/2 1 ˙ 1/2 which shows that 0 Ων (t) dWν (t) has the same distribution as Ων (1) Wν (1). R   55 B.3 Proof of Theorem 2

y ν Let sit (θ, ρ) = fθ,it (θ, ρ) and st (ρ, β) = gρ,t (ρ, β) in the case of maximum likelihood estimation y ν and sit (θ, ρ) = fit (θ, ρ) and st (ρ, β) = gt (ρ, β) in the case of moment based estimation. Using the notation developed before we define

y sit(θ,ρ) y √n if t 1, ..., T s˜it (θ, ρ)= ∈{ }  0 otherwise  analogously to (12) and 

ν st (β,ρ) ν √τ if t τ0 +1, ..., τ0 + τ and i =1 s˜it (β, ρ)= ∈{ }  0 otherwise  analogously to (11). Stack the moment vectors in

y ν ′ s˜it (φ) :=s ˜it (θ, ρ)= s˜it (θ, ρ)′ , s˜it (β, ρ)′ (59)  1/2 1/2 and define the scaling matrix Dnτ = diag n− Iy, τ − Iν where Iy is an identity matrix of dimension kθ and Iν is an identity matrix of dimension kρ. For the maximum likelihood estimator, the moment conditions (16) and (17) can be directly written as

max(T,τ0+τ) n

s˜it θ,ˆ ρˆ =0. i=1 t=min(1X,τ0+1) X   For moment based estimators we have by Conditions 4(i) and (ii) that

max(T,τ0+τ) n y ν ′ sup sM (θ, ρ)′ ,sM (β, ρ)′ s˜it (θ, ρ) = op (1) . φ φ0 ε − k − k≤ t=min(1,τ0+1) i=1  X X

It then follows that for the moment based estimators

max(T,τ0+τ) n ˆ ˆ 0= s φ = s˜it θ, ρˆ + op (1) . i=1   t=min(1X,τ0+1) X   ˆ ˆ ′ A first order mean value expansion around φ0 where φ =(θ′, ρ′)′ and φ = θ′, ρˆ′ leads to   max(T,τ0+τ) max(T,τ0+τ) n n ∂s˜ φ¯ it 1 ˆ op (1) = s˜it (φ0)+ Dnτ Dnτ− φ φ0  ∂φ′  − t=min(1,τ0+1) i=1 t=min(1,τ0+1) i=1  X X X X     56 or 1 max(T,τ0+τ) n ∂s˜ φ¯ − max(T,τ0+τ) n 1 ˆ it Dnτ− φ φ0 = Dnτ s˜it (φ0)+ op (1) − −  ∂φ′  t=min(1,τ0+1) i=1  t=min(1,τ0+1) i=1   X X X X   where φ¯ satisfies φ¯ φ φˆ φ and we note that with some abuse of notation we implicitly − 0 ≤ − 0 ¯ ¯ allow for φ to differ across rows of ∂s˜ it φ /∂φ′. Note that

y y ∂s˜it φ¯ ∂s˜ (θ, ρ) /∂θ′ ∂s˜ (θ, ρ) /∂ρ′ = it it ∂φ′  ν ν   ∂s˜it,ρ (β, ρ) /∂θ′ ∂s˜it,ρ (β, ρ) /∂ρ′   ν wheres ˜it,ρ denotes moment conditions associated with ρ. From Condition 4(iii) and Theorem 1 it follows that (note that we make use of the continuous mapping theorem which is applicable because Theorem 1 establishes stable and thus joint convergence)

max(T,τ0+τ) n 1 1 D− φˆ φ = A (φ )− s˜ (φ )+ o (1) nτ − 0 − 0 it 0 p i=1   t=min(1X,τ0+1) X It now follows from the continuous mapping theorem and joint convergence in Corollary 1 that

1 d 1 1/2 D− φˆ φ A (φ )− Ω W ( -stably) nτ − 0 → − 0 C   B.4 Proof of Corollary 2

Partition

Ay,θ √κAy,ρ A (φ0)=  1  √κ Aν,θ Aν,ρ with inverse  

1 1 1 1 1 − 1 1 1 − 1 Ay,θ− + Ay,θ− Ay,ρ Aν,ρ Aν,θAy,θ− Ay,ρ Aν,θAy,θ− √κAy,θ− Ay,ρ Aν,ρ Aν,θAy,θ− Ay,ρ A (φ0)− = − − −  1 1 1 1 1 1  Aν,ρ Aν,θA− Ay,ρ − Aν,θA− Aν,ρ Aν,θA− Ay,ρ −  − √κ − y,θ y,θ − y,θ   Ay,θ √κAy,ρ   = .  1 ν,θ ν,ρ  √κ A A   It now follows from the continuous mapping theorem and joint convergence in Corollary 1 that

1 d 1 1/2 D− φˆ φ A (φ )− Ω W ( -stably) nτ − 0 → − 0 C  

57 where the right hand side has a mixed normal distribution,

1 1/2 1 1 A (φ )− Ω W MN 0, A (φ )− ΩA (φ )′− 0 ∼ 0 0  and

y,θ y,θ y,ρ y,ρ 1 y,θ ν,θ y,ρ ν,ρ 1 1 A ΩyA ′ + κA Ων (1) A ′ √κ A ΩyA ′ + √κA Ων (1) A ′ A (φ0)− ΩA (φ0)′− =  1 ν,θ y,θ ν,ρ y,ρ 1 ν,θ ν,θ ν,ρ ν,ρ  √κ A ΩyA ′ + √κA Ων (1) A ′ κ A ΩyA ′ + A Ων (1) A ′   The form of the matrices Ωy and Ων follow from Condition 5 in the case of the maximum likelihood estimator. For the moment based estimator, Ωy and Ων follow from Condition 6, the definition of y ν sM (θ, ρ) and sM (β, ρ) and Conditions 4(i) and (ii).

B.5 Proof of Theorem 3

y We first establish the joint stable convergence of (Vτn (r) ,sML) . Recall that t 1/2 1/2 τ − ν = exp ((t min (1, τ )) γ/τ) V (0) + τ − exp ((t s) γ/τ) η τ,t − 0 − s s=min(1X,τ0)+1 1/2 1/2 τ0+[τr] and V (r)= τ − ν . Define V˜ (r)= τ − exp ( sγ/τ) η . It follows that τn τ,τ0+[τr] τn s=min(1,τ0)+1 − s 1/2 P τ − ν = exp ((t min (1, τ )) γ/τ) V (0) + exp ([τr] γ/τ) V˜ (r) . τ,τ0+[τr] − 0 τn ˜ y We establish joint stable convergence of Vτn (r) ,sML and use the continuous mapping theorem 1/2 to deal with the first term in τ − ντ[τr]. By the continuous mapping theorem (see Billingsley (1968, p.30)), the characterization of stable convergence on D [0, 1] (as given in JS, Theorem VIII 5.33(ii)) and an argument used in Kuersteiner and Prucha (2013, p.119), stable convergence of ˜ y Vτn (r) ,sML implies that   ˜ y exp ([τr] γ/τ) Vτn (r) ,sML   also converges jointly and -stably. Subsequently, this argument will simply be referred to as the C ‘continuous mapping theorem’. In addition exp (([τr] min (1, τ )) γ/τ) V (0) p exp (rγ) V (0) − 0 → which is measurable with respect to . Together these results imply joint stable convergence C y ˜ y of (Vτn (r) ,sML). We thus turn to Vτn (r) ,sML . To apply Theorem 1 we need to show that ψ = exp ( sγ/τ) η satisfies Conditions 1 iv) and 2. Since τ,s − s 2+δ γ(2+δ) 2+δ γs(2+δ)/τ 2+δ 2+δ exp ( sγ/τ) η = exp ( s/τ) η = e− η η (60) | − s| | − | | s| | s| ≤| s| 58 such that E exp ( sγ/τ) η 2+δ C | − s| ≤ h i and Condition 1 iv) holds. Note that E η 2+δ C holds since we impose Condition 7. Next, | t| ≤ note that E [exp ( 2sγ/τ) η2] = σ2 exp (h 2sγ/τi). Then, it follows from the proof of Chan and − s − Wei (1987, Equation 2.3)12 that

τ0+[τr] τ0+[τr] r 1 2 1 2 p 2 τ − (ψτ,s) = τ − exp ( 2γt/τ) ηt σ exp ( 2γt) dt. (61) − → s − t=τ0X+[τs]+1 t=τ0X+[τs]+1 Z 1/2 In this case, Ω (r)= σ2 (1 exp ( 2rγ)) /2γ and Ω˙ (r) = σ exp ( γr). By the relationship ν − − ν − in (58) and Theorem 1 we have that  

r ˜ y sγ 1/2 Vτn (r) ,sML σ e− dWν (s) , Ωy Wy (1) -stably ⇒ 0 C    Z  which implies, by the continuous mapping theorem and -stable convergence, that C r y (r s)γ 1/2 (V (r) ,s ) exp (rγ) V (0) + σ e − dW (s) , Ω W (1) -stably. (62) τn ML ⇒ ν y y C  Z0  r (r s)γ Note that σ 0 e − dWν (s) is the same term as in Phillips (1987) while the limit given in (62) is the same asR in Kurtz and Protter (1991,p.1043). We now square (20) and sum both sides as in Chan and Wei (1987, Equation (2.8) or Phillips, (1987) to write

τ+τ0 γ/τ γ/τ τ+τ0 γ/τ τ+τ0 1 e− 1 2 2 τe− 2γ/τ 2 2 e− 1 2 τ − ντs 1ηs = τ − ντ,τ+τ0 ντ,τ0 + 1 e τ − ντs 1 τ − ηs . − 2 − 2 − − − 2 s=τ0+1 s=τ0+1 s=τ0+1 X   X X(63)

γ/τ γ/τ 2γ/τ We note that e− 1, τe− 1 e 2γ. Furthermore, note that for all α,ε > 0 it → − → − follows by the Markov and triangular inequalities and Condition 7iv) that

τ+τ0 1 2 1/2 P τ − E η 1 η > τ α ∗ >ε s | t| |Gn,t n t=τ0+1 ! X    2+δ τ+τ0 sup E η 1 t t E η21 η > τ 1/2α | | 0 as τ . ≤ τε s | t| ≤ αδhτ δ/2 i → →∞ t=τ0+1 X    12See Appendix D for details.

59 2 1 k+τ0 2 such that Condition 1.3 of Chan and Wei (1987) holds. Let U = τ − E [η ∗ ] . Then, τ,k t=τ0+1 s |Gn,t n by Holder’s and Jensen’s inequality P

τ+τ0 2+δ 1 2 1+δ/2 2+δ E Uτ,τ τ − E E ηs n,t∗n sup E ηt < (64) | | ≤ |G ≤ t | | ∞ t=τ0+1 h i X h   i h i 2 such that Uτ,τ is uniformly integrable. The bound in (64) also means that by Theorem 2.23 of 2 1 τ+τ0 2 Hall and Heyde it follows that E U τ − η 0 and thus by Condition 7 vii) and τ,τ − s=τ0+1 t → by Markov’s inequality  P  τ+τ0 1 2 p 2 τ − η σ . t → s=τ0+1 X We also have 1 2 2 τ − ντ,τ+τ0 = Vτn (1) , (65)

1 2 p 2 τ − ν V (0) τ,τ0 → and τ+τ0 τ 1 2 2 1 2 s 2 τ − ντs 1 = τ − Vτn = Vτn (r) dr − τ s=τ0+1 s=1 0 X X   Z such that by the continuous mapping theorem and (62) it follows that

τ+τ0 1 2 1 1 2 2 2 σ τ − ντs 1ηs Vγ,V (0) (1) V (0) γ Vγ,V (0) (r) dr . (66) − ⇒ 2 − − 0 − 2 s=τ0+1 Z X  2 1 An application of Ito’s calculus to Vγ,V (0) (r) /2 shows that the RHS of (66) is equal to σ 0 Vγ,V (0)dWν which also appears in Kurtz and Protter (1991, Equation 3.10). However, note that theR results in Kurtz and Protter (1991) do not establish stable convergence and thus don’t directly apply here. When V (0) = 0 these expressions are the same as in Phillips (1987, Equation 8). It then is a further consequence of the continuous mapping theorem that

τ+τ0 1 y 1 1/2 Vτn (r) ,sML, τ − ντs 1ηs Vγ,V (0) (r) , Ωy Wy (1) , σ Vγ,V (0) (r) dWν (r) ( -stably). − ⇒ C s=τ0+1 ! 0 X  Z  B.6 Proof of Theorem 4

y ν ′ Fors ˜it (φ)= s˜it (θ, ρ)′ , s˜it,ρ (ρ) we note that in the case of the unit root model

 y y ∂s˜ (φ) ∂s˜ (θ, ρ) /∂θ′ ∂s˜ (θ, ρ) /∂ρ′ it = it it . ∂φ′  ν  0 ∂s˜it,ρ (ρ) /∂ρ′   60 Defining max(T,τ0+τ) n ∂s˜y (φ) Ay (φ)= it D τn  ∂φ nτ  i=1 ′ t=min(1X,τ0+1) X   we have as before for some φ˜ φ φˆ φ that for − ≤ −

y Aτn (φ) Aτn (φ)= ,  2 τ0+τ 2  0 τ − ν − t=τ0 τ,t   we have P max(T,τ0+τ) n 1 1 − D− φˆ φ = A φ˜ s˜ (φ ) nτ − 0 − τn it 0 i=1     t=min(1X,τ0+1) X Using the representation τ0+τ 1 2 2 2 τ − ντ,t = Vτn (r) dr, t=τ0 0 X Z it follows from the continuous mapping theorem and Theorem 3 that

1 τ+τ0 y y 2 1 Vτn (r) ,sML, Aτn (φ0) , Vτn (r) dr, τ − ντs 1η (67) 0 − Z s=τ0+1 ! 1 X s V (r) , Ω1/2W (1) , Ay (φ ) , V (r)2 dr, σV dW ( -stably). ⇒ y y 0 γ,V (0) γ,V (0) ν C  Z0 Z0  The partitioned inverse formula implies that

1 1 1 1 2 − − − 1 Ay,θ Ay,θAy,ρ 0 Vγ,V (0) (r) dr A (φ0)− = 1 (68)  0 1VR (r)2 dr −   − 0 γ,V (0)  R   By Condition 9, (67) and the continuous mapping theorem it follows that

Ω1/2W (1) 1 ˆ 1 y y Dnτ− φ φ0 A (φ0)− . (69) − ⇒ −  s σV dW    0 γ,V (0) ν   The result now follows immediately from (68) and (69).R

61 B.7 Proof of (53)

We will follow Billingsley (1968, p.208). Let

ν ν ¨ν ξτ,t = λν′ ψτ,t E ψτ,t τn,(t min(1,τ0)n+1) = √τλν′ ψτ,t, − |G − ξu = ξ 1( ξ u) ,  τ,t τ,t | τ,t| ≤ u u u ητ,t = ξτ,t E ξτ,t τn,(t min(1,τ0)n+1) , − G − u u u  u δτ,t = ξτ,t ητ,t = ξτ,t ξτ,t E ξτ,t ξτ,t τn,(t min(1,τ0)n+1) − − − − G −   and note that the expectation in (53) can be written as

2 k+s ¨ν k+s ¨ν E max λν′ ψτ,t 1 max λν′ ψτ,t c s τ t=k+1 · s τ t=k+1 ≥  ≤  ≤  P 2 P 1 k+s 1 k+s = E max ξτ,t 1 max ξτ,t c . (70) s τ √τ t=k+1 · s τ √τ t=k+1 ≥ " ≤  ≤ # P P

We will use the fact that

2 2 2 1 k+s 1 k+s u 1 k+s u max ξτ,t 2 max ητ,t + 2 max δτ,t , (71) s τ √τ t=k+1 ≤ s τ √τ t=k+1 s τ √τ t=k+1 ≤ ≤ ≤ P P P which also implies that

1 k+s 1 max ξτ,t c s τ √τ t=k+1 ≥  ≤  P 2 1 k+s 2 =1 max ξτ,t c s τ √τ t=k+1 ≥ ≤ ! P 2 2 2 2 1 k+s u c 1 k+s u c 1 2 max ητ,t +1 2 max δτ,t ≤ s τ √τ t=k+1 ≥ 2 s τ √τ t=k+1 ≥ 2 ≤ ! ≤ ! P P 1 k+s u c 1 k+s u c =1 max ητ,t +1 max δτ,t . (72) s τ √τ t=k+1 ≥ 2 s τ √τ t=k+1 ≥ 2  ≤   ≤  P P

62 Combining (70), (71) and (72), we obtain

2 k+s ¨ν k+s ¨ν E max λν′ ψτ,t 1 max λν′ ψτ,t c s τ t=k+1 · s τ t=k+1 ≥  ≤  ≤  P 2 P 1 k+s u 1 k+ s u c 2E max ητ,t 1 max ητ,t (73) ≤ s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 2 1 k+s u 1 k+s u c +2E max δτ,t 1 max δτ,t (74) s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 2 1 k+s u 1 k+s u c +2E max ητ,t 1 max δτ,t (75) s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 2 1 k+s u 1 k+s u c +2E max δτ,t 1 max ητ,t . (76) s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P

We will bound each term (73) - (76). First, we have

2 1 k+s u 1 k+s u c (73) = 2E max ητ,t 1 max ητ,t s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 4 4 1 k+s u 2 E max ητ,t . ≤ · c2 s τ √τ t=k+1 " ≤ # 4 P 4 4 1 4 2 E k+s ηu . (77) ≤ · c2 3 τ 2 t=k+1 τ,t     P We can note that

4 4 4 1 k+s u 4 1 k+τ u E max ητ,t E ητ,t s τ t=k+1 2 t=k+1 " ≤ √τ # ≤ 3 τ     P P u by using Corollary 1. We can then note the fact that by construction, (i) ητ,t is a martingale difference; and (ii) it is bounded by 2u, which allows us to follow Billingsley’s (1968, p.207) 4 1 k+s u 4 argument, leading to 2 E η 6(2u) . Therefore, we have τ t=k+1 τ,t ≤   P 4 4 1 k+s u 4 4 E max ητ,t 6 (2u) . (78) s τ √τ t=k+1 ≤ 3 · " ≤ #   P

Combining (77) and (78), we obtain

4 4 4 Cu4 (73) 2 6 (2u)4 = , (79) ≤ c2 3 c2   where C denotes a generic finite constant.

63 Second, we have

2 1 k+s u 1 k+s u c (74) = 2E max δτ,t 1 max δτ,t s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 2 1 k+s u 2E max δτ,t . (80) ≤ s τ √τ t=k+1 " ≤ # P u Using Lemma 1 in Appendix C, and the fact that by construction, δτ,t is a martingale difference, we can conclude that

2 2 1 k+s u 1 k+s u E max δτ,t = E max δτ,t s τ √τ t=k+1 τ s τ t=k+1 " ≤ #  ≤ 

P 1 P 2 4E k+τ δu ≤ τ · t=k+1 τ,t    4 P 2 = k+τ E δu . τ t=k+1 τ,t P h  i Now, note that

u 2 u u 2 E δτ,t = E ξτ,t ξτ,t E ξτ,t ξτ,t τn,(t min(1,τ0)n+1) − − − G − h  i h u  u 2  i E ξτ,t ξτ,t E ξτ,t ξτ,t ≤ − − − h 2 i E ξ ξu   ≤ τ,t − τ,t h i = E (ξ ξ 1( ξ u))2 τ,t − τ,t | τ,t| ≤ = E (ξ 1( ξ >u))2  τ,t | τ,t| = E ξ2 1( ξ >u)  τ,t | τ,t| 1 2+δ 1  2+δ δ E ξτ,t δ sup E ξτ,t , ≤ u ≤ u t     where the first inequality is by Billingsley’s (1968, p. 184) Lemma 1. It follows that

2 1 k+s u C 2+δ E max t=k+1δτ,t δ sup E ξτ,t . (81) s τ √τ ≤ u t " ≤ # P  

Combining (80) and (81), we obtain

C 2+δ (74) δ . sup E ξτ,t (82) ≤ u t  

64 Third, we have

2 1 k+s u 1 k+s u c (75) = 2E max ητ,t 1 max δτ,t s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P 1 P 1 4 2 2 2 1 k+s u 1 k+s u c 2 E max ητ,t E 1 max δτ,t ≤ s τ √τ t=k+1 s τ √τ t=k+1 ≥ 2 " ≤ #! "  ≤  #! P 1 P 4 2 4 4 1 k+s u c =2 6 (2u) P max δτ,t , (83) · 3 · · s τ √τ t=k+1 ≥ 2   !  ≤  P where the first inequality is by Cauchy-Schwarz and the last equality is by (78). We further have

2 1 k+s u c 4 1 k+s u P max δτ,t E max δτ,t s τ √τ t=k+1 ≥ 2 ≤ c2 s τ √τ t=k+1  ≤  " ≤  # P P 4 C 2+δ 2 δ sup E ξτ,t ≤ c u t

C  2+δ  = 2 δ sup E ξτ,t , (84) c u t   where the first inequality is by Markov, and the second inequality is by (81). Combining (83) and (84), we obtain 2 δ Cu − 2+δ (75) 2 sup E ξτ,t . (85) ≤ c t Fourth, we have  

2 1 k+s u 1 k+s u c (76) = 2E max δτ,t 1 max ητ,t s τ √τ t=k+1 · s τ √τ t=k+1 ≥ 2 " ≤  ≤ # P P 2 1 k+s u 2E max δτ,t ≤ s τ √τ t=k+1 " ≤ # P C 2+δ δ sup E ξτ,t , (86) ≤ u t   where the second inequality is by (81). Combining (79), (82), (85), (76), we obtain that

2 k+s ¨ν k+s ¨ν E max λν′ ψτ,t 1 max λν′ ψτ,t c s τ t=k+1 · s τ t=k+1 ≥  ≤  ≤  4 2 δ Cu PC Cu − P C 2+δ 2+δ 2+δ 2 + δ . sup E ξτ,t + 2 sup E ξτ,t + δ sup E ξτ,t . ≤ c u t c t u t       In view of (53), it suffices to prove that we can choose u as a function of c such that the → ∞ terms above all converge to zero as c . This we can do by choosing u = c1/3, for example. →∞ 65 C A Maximal Inequality for Triangular Arrays

In this section we extend Hall and Heyde (1980) Theorem 2.1 and Corollary 2.1 to the case of triangular arrays of martingales. Let be a increasing filtration such that for each τ, Fτ,s . Let S be adapted to and assume that for all τ and k > 0, E [S ]= S . Fτ,s ⊂Fτ,s+1 τ,s Fτ,s τ,s+k|Fτ,s τ,s Since for p 1, . p is convex, it follows by Jensen’s inequality for conditional expectations that ≥ | | E [ S p ] E [S ] p = S p . Thus, for each τ, S p , is a submartingale. | τ,s+k| |Fτ,s ≥| τ,s+k|Fτ,s | | τ,s| {| τ,s| Fτ,s} We say that S p , is a triangular array of submartingales. If S , is a triangular ar- {| τ,s| Fτ,s} { τ,s Fτ,s} ray submartingales then the same holds for S p , . The following Lemma extends Theorem {| τ,s| Fτ,s} 2.1 of Hall and Heyde (1980) to triangular arrays of submartingales.

Lemma 1 For each τ, let S , be a submartingale S with respect to an increasing filtration { τ,s Fτ,s} τ,s . Then for each real λ, and each τ it follows that Fτ,s

λP max Sτ,s >λ E Sτ,τ 1 max Sτ,s >λ . s τ ≤ s τ  ≤    ≤  Proof. The proof closely follows Hall and Heyde, with the necessary modifications. Define the event τ τ Eτ = max Sτ,s >λ = i=1 Sτ,i >λ; max Sτ,j λ = i=1Eτ,i. s τ ∪ 1 j

Corollary 2.1 in Hall and Heyde (1980) now follows in the same way. If S , is a { τ,s Fτ,s} martingale triangular array then S p , is submartingale for p 1. {| τ,s| Fτ,s} ≥

66 Corollary 3 For each τ, let S , be a triangular array of a martingale S with respect to { τ,s Fτ,s} τ,s an increasing filtration . Then for each real λ and for each p 1 and each τ it follows that Fτ,s ≥

p p λ P max Sτ,s >λ E [ Sτ,τ ] . s τ | | ≤ | |  ≤  p p Proof. Note that P (maxs τ Sτ,s >λ) = P (maxs τ Sτ,s >λ ) . Then, using the fact that ≤ | | ≤ | | S p , is submartingale for p 1, apply Lemma 1. {| τ,s| Fτ,s} ≥ Below is a triangular array counterpart of Hall and Heyde’s Theorem 2.2.

Corollary 4 For each τ, let S , be a triangular array of a martingale S with respect to { τ,s Fτ,s} τ,s an increasing filtration . Then for each real λ and for each p> 1 and each τ it follows that Fτ,s p p p p E max Sτ,s E [ Sτ,τ ] s τ | | ≤ p 1 | |  ≤   −  Proof. This proof is based on Hall and Heyde (1980, proof of Theorem 2.2). Note that by the layer-cake representation of an integral, we have

∞ ∞ 1 p p p E max Sτ,s = P max Sτ,s > t dt = P max Sτ,s > t dt s τ | | s τ | | s τ | |  ≤  Z0  ≤  Z0  ≤  1 With the change of variable x = t p , we get

p ∞ p 1 E max Sτ,s = p x − P max Sτ,s > x dt s τ | | s τ | |  ≤  Z0  ≤  Because S is a submartingale, we can apply Lemma 1 and obtain | τ,s|

p ∞ p 2 E max Sτ,s p x − E Sτ,τ 1 max Sτ,s > x s τ | | ≤ | | s τ | |  ≤  Z0   ≤  ∞ p 2 = pE Sτ,τ x − 1 max Sτ,s > x dx | | s τ | |  Z0  ≤   maxs≤τ Sτ,s | | p 2 = pE S x − dx | τ,τ | " Z0 # p p 1 = E Sτ,τ max Sτ,s − p 1 | | s τ | | −  ≤  q 1 p 1 q p p p 1 (E [ Sτ,τ ]) E max Sτ,s − , ≤ p 1 | | s τ | | −   ≤  

67 1 p where the last inequality is an application of H¨older’s inequality for q = 1 1 = p 1 . Dividing both p − 1 − q q 1 p 1 p q sides by E maxs τ Sτ,s − =(E [maxs τ Sτ,s ]) , we get ≤ | | ≤ | |  h  i 1 1 − q p 1 p p p E max Sτ,s (E [ Sτ,τ ]) s τ | | ≤ p 1 | |   ≤  − or p p p p E max Sτ,s E [ Sτ,τ ] . s τ | | ≤ p 1 | |  ≤   − 

D Proof of (61)

Lemma 2 Assume that Conditions 7, 8 and 9 hold. For r, s [0, 1] fixed and as τ it follows ∈ →∞ that τ0+[τr] 1 2 ( 2γt/τ) 2 p τ − (ψτ,s) e − E ηt τn,(t min(1,τ0) 1)n 0 − |G − − → t=τ0+[τs]+1 X  

Proof. By Hall and Heyde (1980, Theorem 2.23) we need to show that for all ε> 0

τ0+[τr] 1 2γt/τ 2 1/2 γt/τ p τ − e− E ηt 1 τ − e− ηt >ε τn,(t min(1,τ0) 1)n 0. (87) G − − → t=τ0+[τs]+1 X    By Condition 7iv) it follows that for some δ > 0

τ0+[τr] 1 2γt/τ 2 1/2 γt/τ E τ − e− E ηt 1 τ − e− ηt >ε τn,(t min(1,τ0) 1)n  |G − −  t=τ0+[τs]+1 X     τ0+[τr] γt/τ 2+δ  (1+δ/2) e− 2+δ τ − E η ≤ εδ | t| t=τ0+[τs]+1  X h i 2+δ [τr] [τs] (2+δ) γ sup E ηt 1+−δ/2 δ e | | 0. ≤ t | | τ ε → h i 1 τ0+[τr] ( 2γt/τ) 2 This establishes (87) by the Markov inequality. Since τ − e − E ηt τn,(t min(1,τ0) 1)n t=τ0+[τs]+1 |G − − is uniformly integrable by (60) and (64) it follows from HallP and Heyde (1980, Theorem 2.23, Eq  2.28) that

τ0+[τr] 1 2 ( 2γt/τ) 2 E τ − (ψτ,s) e − E ηt τn,(t min(1,τ0) 1)n 0.  − |G − −  → t=τ0+[τs]+1 X     The result now follows from the Markov inequality.

68 Lemma 3 Assume that Conditions 7, 8 and 9 hold. For r, s [0, 1] fixed and as τ it follows ∈ →∞ that τ0+[τr] r 1 ( 2γt/τ) 2 p 2 τ − e − E ηt τn,(t min(1,τ0) 1)n σ exp ( 2γt) dt. |G − − → s − t=τ0+[τs]+1 Z X   Proof. The proof closely follows Chan and Wei (1987, p. 1060-1062) with a few necessary adjustments. Fix δ > 0 and choose s = t t ... t = r such that 0 ≤ 1 ≤ ≤ k

2γti 2γti−1 max e− e− < δ. i k ≤ −

This implies

r k k ti 2γt 2γti 2γt 2γti e− dt e− (ti ti 1) e− e− dt δ. (88) − − − ≤ − ≤ Zs i=1 i=1 Zti−1 X X

Let Ii = l :[τt i 1]

2γ[τti−1]/τ 2γti−1 For III we have that e− e− as τ . In other words, there exists a τ ′ such n → → ∞ 2γ[τti−1]/τ 2γti−1 that for all τ τ ′, e− e− δ and by (88) ≥ − ≤

III 2δ. | n| ≤ We also have by Condition 7vii) that

1 2 2 τ − E ηl τn,(l min(1,τ0) 1)n σ (ti ti 1) |G − − → − − l Ii X∈   69 2γ[τti−1]/τ 2 γ as τ such that by maxi k e e | | →∞ ≤ ≤

2 γ 1 2 2 IIn e | | τ − E ηl τn,(l min(1,τ0) 1)n σ (ti ti 1) = op (1) . | | ≤ |G − − − − − l Ii X∈  

Finally, there exists a τ ′ such that for all τ τ ′ it follows that ≥

2γl/τ 2γ[τti−1]/τ 2γ[τti]/τ 2γ[τti−1]/τ max max e− e− max e− e− i k l Ii i k ≤ ∈ − ≤ ≤ − 2γ[τti]/τ 2γti 2 max e− e− i k ≤ ≤ − 2γti 2γti−1 + max e− e− i k ≤ −

2δ + δ =3δ. ≤

We conclude that

k 1 2 2 In 3δ τ − E ηl τn,(l min(1,τ0) 1)n =3δσ (1 + op (1)) . | | ≤ |G − − i=1 l Ii X X∈  

The remainder of the proof is identical to Chan and Wei (1987, p. 1062).

70