<<

Estimation and Inference on in a Latent High-dimensional Gaussian Vector Autoregressive Model∗

Yanqin Fan,† Fang Han,‡ and Hyeonseok Park§

August 28, 2020

Abstract This paper develops estimation and inference methods for the transition matrices of a latent high-dimensional stationary Gaussian vector autoregressive process when the observed process is an increasing but otherwise unknown transformation of the latent process. Our estimator is based on rank estimators of the large and auto- matrices of the latent process. We derive rates of convergence of our estimator based on which we develop inference for Granger causality. Numerical results demonstrate the efficacy of the proposed methods. Although our focus is on the latent process, by the nature of rank estimators, all the methods developed directly apply to the observable process which is a stationary semiparametric high-dimensional Gaussian copula process. In technical terms, our analysis relies heavily on newly developed exponential inequalities for (degenerate) U- under α-mixing condition.

Keywords: High-dimensional ; Sparse transition matrix; α-mixing; Semiparametric Gaussian copula process; De-biasing inference; Kendall’s tau. JEL Codes: C32, C51

∗We thank Yandi Shen, Kevin Song, participants of the Sixth Seattle-Vancouver Conference, and seminar participants at Penn State University for helpful discussions. Numerical computation in this paper is done via CSDE Simulation Cluster, which has received partial support from an Eunice Kennedy Shriver National Institute of Child Health and Human Development research infrastructure grant, P2C HD042828. †Corresponding author: Department of Economics, University of Washington, Seattle, WA 98195, USA; email: [email protected]. ‡Department of Statistics, University of Washington, Seattle, WA 98195, USA; e-mail: [email protected]. §Department of Economics, University of Washington, Seattle, WA 98195, USA; email: [email protected] 1 Introduction

Estimation and inference in high-dimensional stationary Gaussian vector autoregressive (VAR) models have been considered in recent works such as Negahban and Wainwright(2011), Han et al.(2015), and Basu and Michailidis(2015) under the commonly adopted assumption that the underlying process is observable. In many applications in economics and finance, however, the process following a structural model is unobservable. Instead, a process that is deterministically related to the latent process is observable. For example, in structural asset pricing models, the observed equity price is an increasing transformation (unknown) of the latent firm’s market value; in structural auction models, the observed bid is an increasing transformation (unknown) of the bidder’s private value. This paper considers such a scenario where the latent process follows a high-dimensional stationary Gaussian VAR(p) model for a finite (fixed) p and the observed process is an increasing but otherwise unknown transformation of the latent process. We develop simple methods for the estimation and inference of the transition matrices characterizing the latent process. To simplify notation and exposition, we focus on developing our methods and theory for the latent VAR(1) model, to be described below, and refer to it as the latent Gaussian VAR model. We will present extensions to general VAR(p) model for any finite (fixed) p in Section5. In detail, let d {Zt}t∈Z, with Z standing for the set of integers and Zt in the d-dimensional real space R , denote a high-dimensional latent process following a stationary Gaussian VAR model with transition matrix A:

Zt = AZt−1 + Et, Zt ∼ N(0, Σ0). (1.1)

d Further, let {Xt}t∈Z, with Xt ∈ R , denote an observable process such that

> Xt := f(Zt) = (f1(Zt,1), . . . , fd(Zt,d)) , where f := {f1, . . . , fd} consists of univariate strictly increasing unknown functions. The dimension d ≡ dn, the parameters A and Σ0 are all allowed to change with the sample size n and dn may even be larger than n. The scenario thus falls into the application regime of a high-dimensional triangular array framework, which has been explicitly discussed recently in Han and Wu(2019). For notational compactness, we omit the dependence of all model parameters on the sample size n in the rest of the paper. Following Negahban and Wainwright(2011), Han et al.(2015), and Basu and Michailidis(2015), we impose sparsity constraint on the transition matrix A and aim at estimating, and also recovering n the sparsity pattern of A in model (1.1) based only on observations {Xt}t=1. Existing methods are now inapplicable because we don’t have access to the latent process {Zt}t∈Z. Instead, we propose a n new estimator of A based on {Xt}t=1, bypassing the unknown transformations {f1, . . . , fd} without estimating them. It is formulated in two steps. In the first step, we construct rank-based estimators

1 n of the variance and first autocovariance matrices of {Zt}t∈Z from {Xt}t=1 avoiding the estimation of unknown transformations {f1, . . . , fd}. This is made possible by the nature of rank estimators, which are invariant to increasing transformations, and the relation between the (Pearson) covariance of a bivariate Gaussian distribution and its Kendall’s tau. In the second step, we construct a penalized estimator of A from rank estimators of the variance and first autocovariance matrices of

{Zt}t∈Z using the relation between A and population variance-autocovariance matrices, known as the Yule-Walker equation. To justify the proposed estimator, we first derive its rate of convergence and show its consistency to the truth in a high-dimensional asymptotic regime. Under the adopted assumptions, the latent n process {Zt}t∈Z and the observable process {Xt}t=1 are both α-mixing with geometric decaying rates. This paves the way for theoretical studies. However, compared with regression-based estimators of the variance and first autocovariance matrices in Negahban and Wainwright(2011), Han et al.(2015), and Basu and Michailidis(2015), establishing the asymptotic properties of our rank estimators appears to be much more challenging due to their dependence on large random matrices with elements given by U-statistics of α-mixing processes. The analysis is now possible thanks to a recent progress made in Shen et al.(2019), who laid out the path to establishing exponential inequalities for (degenerate) U-statistics of α-mixing processes. Building on a recently established de-biasing inference framework (Neykov et al., 2018a), we further develop tests for non-Granger causality in model (1.1), which is equivalent to testing non- zeroness of entries of A (L¨utkepohl, 2005). This is via combining the de-biasing strategy with a circular-block-bootstrap (CBB) based variance estimator. The latter has been studied in, e.g, Dehling and Wendler(2010) and Fan et al.(2016), although the analysis here is more demanding due to the complicated form of the target of interest. In addition, when the presence of Granger causality is detected, a sign-consistent estimator of Granger causality is immediate following the strategy in Qiu et al.(2015). Numerical results are presented to illustrate the finite sample behavior of the estimator of A and also the proposed tests. Extensions of the methods and theory to latent VAR(p) process for any finite and fixed p are also provided.

It is worthwhile pointing out that, although our focus is on the latent process {Zt}t∈Z, the estimation and inference methods we develop for the latent process {Zt}t∈Z are also valid for the observable process {Xt}t∈Z. Model (1.1) and the relation Xt = f(Zt) imply that the observable process {Xt}t∈Z follows a stationary semiparametric Gaussian copula VAR model with unknown marginal distributions. Unlike the Gaussian VAR model, the Gaussian copula VAR model does not impose any finite restrictions on the process {Xt}t∈Z and hence is more suitable for modeling potentially fat-tailed time series. Since the observable process {Xt}t∈Z is nonlinear, this paper also contributes to the growing literature on modelling the possibly nonlinear temporal dependence (Hsing and Wu, 2004; Beare, 2010; Patton, 2012, 2013; Fan and Patton, 2014; Wang and

2 Xia, 2015), and in particular, the literature on copula-based nonlinear time series models including both the univariate Gaussian copula Markov chains introduced in Chen and Fan(2006b) and studied further in Chen et al.(2009a) and Chen et al.(2009b), and copula-based multivariate time series models studied in Chen and Fan(2006a), Lee and Long(2009), Min and Czado(2014), R´emillard et al.(2012), and Chen et al.(2018). In contrast to the current paper, the model of Xt (and hence also the dimension d) in all these works is assumed to be fixed, while the high-dimensional triangular array framework in this paper induces more challenges. The rest of this paper is organized as follows. Section2 specifies the latent model and the observable process, introduces our estimator, and establishes its rates of convergence. Section3 develops two de-biasing tests for non-Granger causality and introduces a sign consistent estimator of Granger causality. Section4 presents numerical results. Section5 extends the methods to stationary latent VAR(p) processes for any finite and fixed p. The last section concludes and discusses extensions. Technical proofs are relegated to a series of appendices. We now introduce notations used in this paper. For any positive integer d, let [d] := {1, 2, . . . , d}. > d d×d Let v = (v1, . . . , vd) ∈ R and M = [Mjk] ∈ R be a vector and a matrix of interest. We denote vI and MI,J to be the subvector of v and submatrix of M whose entries are indexed by a set I ⊂ [d] and sets I,J ⊂ [d]. We denote MI∗ and M∗J to be submatrices of M whose rows are indexed by I and whose columns are indexed by J. For any p ∈ (0, ∞], we define kvkp to be Pd the vector Lp norm of v. Let kMk∞ := maxj k=1|Mjk| be the matrix operator ∞-norm, kMk2 be the matrix spectral norm, and kMkmax be the elementwise maximum norm. For a symmetric matrix M, let λmax(M) and λmin(M) denote the largest and smallest eigenvalues of M. For any > univariate function f(·): R → R, define f(v) = (f(v1), . . . , f(vd)) and f(M) := [f(Mjk)]. Let diag(M) be the diagonal matrix with diagonals M11,...,Mdd. Let Id be the d-dimensional identity d matrix. For a random vector Xt ∈ R , let Xt,j denote the j-th element of Xt. For any x ∈ R, denote sign(x) = 1(x > 0) − 1(x < 0), with 1(·) standing for the indicator function. For any two real sequences {an} and {bn}, we write an = O(bn) if there exists an absolute positive constant C such that |an| ≤ C|bn| for any large enough n. We write an  bn if both an = O(bn) and bn = O(an) hold. We write an = o(bn) if for any absolute positive constant C, we have |an| ≤ C|bn| for any large enough n. We write an = OP(bn) and an = oP(bn) if an = O(bn) and an = o(bn) hold stochastically.

2 Estimation Method and Asymptotic Properties

2.1 The Model, Granger Causality, and α-mixing Property

The latent VAR model we study is characterized by (1.1) and Assumption M below. i.i.d Assumption M (Latent Process). (i) diag(Σ0) = Id; (ii) {Et}t∈Z ∼ N(0, ΣE), where ΣE satisfies: 0 < cE < λmin(ΣE) ≤ λmax(ΣE) < CE < ∞ for some absolute constants cE and CE; (iii)

3 The transition matrix A satisfies: kAk2 ≤ CA < 1 for some absolute constant CA and the following sparsity constraint:

( d ) d×d X A ∈ M(s, M) with M(s, M) := M ∈ R : max 1(Mjk 6= 0) ≤ s, kMk∞ ≤ M . (2.1) 1≤j≤d k=1 Here s is a positive integer and M is a positive constant, both of which may depend on d and hence also on n. Assumption M(i) is a normalization condition ensuring identifiability of A and f. It is not essential and is removable if only the sparsity pattern of A is of interest. Elementary matrix inequality gives CA ≤ M. Although it is assumed that CA < 1, there is no such restriction on M, which is allowed to increase to infinity along with d and n. Under Assumption M, it holds that

> 2 > 2 Var(Zt) = Σ0 = ΣE + AΣEA + A ΣE(A ) + ··· , which is also positive definite and the eigenvalues of Σ0 are bounded away from zero and infinity by absolute constants. We will show in Proposition 2.2 below that Assumption M yields that the latent process {Zt}t∈Z is geometrically α-mixing, which implies weak temporal dependence even in high dimensions and is crucial in justifying our proposed methods. The observable process and the sample information are characterized by Assumption S below.

Assumption S (Observable Process and Sample Information). (i) Let {Xt}t∈Z denote the observable process satisfying Xt = f(Zt), where f := {f1, . . . , fd} consists of univariate strictly increasing unknown functions; (ii) Let {Xt : t = 1, 2, . . . , n} denote a length-n segment of {Xt}t∈Z and constitute the accessible.

Viewed as a model for the observable process {Xt}t∈Z, Assumptions M and S(i) define a stationary semiparametric Gaussian copula VAR model. Although the {Zt}t∈Z has finite moments of all orders, the observable process {Xt}t∈Z may not have any finite moments and is thus more suitable for modeling financial time series than a Gaussian VAR model.

For the Gaussian VAR process {Zt}t∈Z, it is a well-known fact that the sequence {Zt,k}t∈Z

Granger causes {Zt,j}t∈Z if and only if the (j, k)-th entry of the transition matrix is nonzero (cf. Corollary 2.2.1 in L¨utkepohl(2005)). The following proposition, which is a direct consequence of

Theorem 6.2 in Qiu et al.(2015), shows that this fact also applies to the observable process {Xt}t∈Z.

Proposition 2.1. Under Model (1.1) with Assumption M(ii) and (iii), and Assumption S(i), we have that the sequence {Xt,k}t∈Z Granger causes {Xt,j}t∈Z if and only if Ajk 6= 0.

We then proceed to characterize an important property of the observable process {Xt}t∈Z, its weak (temporal) dependence. To this end, let’s first introduce the α-mixing measure of dependence.

4 Given a probability space (Ω, F, P), for any two σ-algebras A and B in F, define the α-mixing measure of dependence between A and B as

α(A, B) := sup |P(A ∩ B) − P(A)P(B)|. A∈A,B∈B

For any measurable process denoted as {Wt}t∈Z, we define its α-mixing coefficient as

t ∞ α(n; {Wt}t∈Z) := sup α(G−∞, Gt+n), for each n = 1, 2,..., t∈Z

j ∞ where for arbitrary j ∈ Z, G−∞ := σ(Wt, t ≤ j) and Gj := σ(Wt, t ≥ j) stand for the sigma fields generated by {Wt}t≤j and {Wt}t≥j respectively. With these concepts introduced, we are now ready to present the weak dependence property of

{Xt}t∈Z, i.e., its α-mixing coefficient is uniformly geometrically decaying to 0, which holds even if the dimension d explodes as n → ∞. This proposition hinges on the Gaussian assumption and will be heavily used in building up our theory in the follow-up sections.

Proposition 2.2 (Kolmogorov and Rozanov(1960), explicitly stated as Theorem 3.1 in Han and

Wu(2019)) . Suppose that the latent process {Zt}t∈Z follows the latent Gaussian VAR model (1.1) with Assumption M and the observable process {Xt}t∈Z satisfies Assumption S(i). Then both

{Xt}t∈Z and {Zt}t∈Z are geometrically α-mixing such that

α(n; {Xt}t∈Z) = α(n; {Zt}t∈Z) ≤ κ1 exp(−κ2n), for all n ≥ 1.

q CE Here κ1 = 2 and κ2 = − ln(CA) are two absolute positive constants, where CE, cE, and cE (1−CA) CA are defined in Assumption M (ii) and (iii).

2.2 Estimators

A commonly adopted method to estimate parameters in a latent model is to use the equivalent model for the observable process. In our context, the observable process {Xt}t∈Z is a stationary semiparametric Gaussian copula VAR process with unknown marginal distributions. One seemingly plausible approach is to estimate the transition matrix A and the unknown functions {f1, . . . , fd} jointly by a penalized maximum likelihood estimator (MLE). However, in high dimensions this penalized MLE is theoretically intractable. For example, jointly controlling deviations of the estimator fbj from fj for all j ∈ [d], even assuming the data are i.i.d., is known to be challenging (cf. the discussions in Sections 2.3 and 4.3 in Liu et al.(2012)). Instead, we propose a simple rank-based estimator of the transition matrix A without estimating the unknown functions {f1, . . . , fd}. From a more practical point of view, the Kendall’s-tau-type estimators are also known to be more robust to outliers/noises than likelihood based estimators (cf. the simulations conducted in Section 5.2 in Liu et al.(2012)).

5 > −1 > In detail, the celebrated Yule-Walker formula yields that A = Σ1 Σ0 , where Σ1 := E[ZtZt+1]. This suggests a two-step approach for estimating A, as presented below. > > > Step 1. We construct “consistent estimators” of Σ0 and Σ1. Notice that (Z1 , Z2 ) has the following , ! Σ0 Σ1 Ω = > . Σ1 Σ0

For estimating Σ0 and Σ1, it is then sufficient to estimate Ω. It is well-known – cf. Kruskal(1958, p. 823) – that      ! !>  π X1 − Xf1 X1 − Xf1  Ω = sin  · E sign sign  ,  2 X − X X − X   2 f2 2 f2  | {z } T > > > > > > where (Xf1 , Xf2 ) is an independent copy of (X1 , X2 ) . We propose the following estimator of Ω: π  Ωb = sin Tb , 2 where n−1 n−1  ! !> 2 X X Xi − Xj Xi − Xj Tb = sign sign  (2.2) (n − 1)(n − 2) X − X X − X i=1 j=i+1 i+1 j+1 i+1 j+1 is a high-dimensional matrix of entries formulated as second-order U statistics. We then denote

Σb 0 = Ωb [d],[d] and Σb 1 = Ωb [d],d+[d] to be the estimators of variance and auto-covariance matrices Σ0 and Σ1. > −1 Step 2. Using the formula A = Σ1 Σ0 , we propose to estimate A by the following Dantzig-type linear programming estimator,

d d X X > Ab := argmin |Mjk| s.t. kΣb 0M − Σb 1kmax ≤ λ. (2.3) d×d M∈R j=1 k=1

Here λ is a tuning parameter to encourage sparsity of the output.

2.3 Rates of Convergence

Estimating matrices Σ1, Σ0 via the Kendal’s tau matrix Tb in (2.2) allows us to estimate A without estimating the unknown functions {f1, . . . , fd}. However, in high dimensions, the asymptotic properties of Tb are much harder to establish than the sample covariance matrix used in Negahban and Wainwright(2011), Han et al.(2015), Basu and Michailidis(2015), and more recently in Hecq et al.(2019). Although the asymptotic properties of Tb in a finite-dimensional time series model have been well understood (see, for example, Dehling et al.(2017) and the references therein), to our

6 knowledge, Fan et al.(2016) is the only paper in the literature exploring Tb in a high-dimensional time series triangular array set-up. Their analysis is, however, based on the assumption that the underlying process is φ-mixing. Assuming φ-mixing in our case is unrealistic, as a φ-mixing Gaussian VAR process is m-dependent (cf. Proposition 1 in Section 2.1 of Doukhan(1994)). Instead, our analysis is built upon Proposition 2.2 that a Gaussian VAR process is α-mixing under conditions therein, and uses heavily the newly developed probability tools in Shen et al.(2019) for α-mixing processes. In addition to Assumptions M and S, we impose the following scaling condition on (d, n).

4 Assumption E (i) log(d) log (n)/n = O(1) so that there exists an absolute positive constant K0 4 satisfying log(d) log n ≤ K0n for all large enough n; (ii) log(d)/n = o(1).

Assumption E is very mild, allowing the dimension d to be even exponentially larger than the sample size n. It is added mainly for the purpose of presentational simplicity. With Assumptions M, S, and E, we are now ready to introduce our first theorem, which is nonasymptotic and characterizes the deviation probability of Tb from T.

Theorem 2.1 (Stochastic bound for kTb − Tkmax). Suppose Assumptions M, S, and E hold. Then, for all sufficiently large n, we have   1 −4 −2 −2 kTb − Tkmax ≥ 2z + ≤ 16e d + K2{(n − 1)d} , P n − 2 for any z such that r log(ed) K3 log{(n − 1)d} K4 1 z ≥ Q + √ + √ p n − 1 4 K0 n − 1 4 K0 (n − 1) log(ed) r C log(ed) p 1 + √ + 2C K2 4 K n − 1 (n − 1)2d2 √ 0 √ C K K log{(n − 1)d} C K K 1 √ 2 3 √ 2 4 + 2 2 + p , (2.4) 2 K0 (n − 1) d 2 K0 d2(n − 1) (n − 1) log(ed) in which Q only depends on κ1, κ1 and K0 introduced in Proposition 2.2 and Assumption E, respectively; K2 only depends on cA, cE and CE; K3 only depends on K0; C and K4 are absolute constants. In particular, we have r ! log(ed) kTb − Tkmax = O . P n

Remark 2.2. Because sin(x) is a Lipshitz continuous function, Theorem 2.1 implies that, with probability no smaller than 1 − 0, where

−4 −2 −2 0 := 16e d + K2{(n − 1)d} , (2.5)

7 we have π  1  π  1  kΣb 0 − Σ0kmax ≤ 2z + and kΣb 1 − Σ1kmax ≤ 2z + 2 n − 1 2 n − 1 in which z is chosen to satisfy (2.4). Also, we have r ! r ! log(ed) log(ed) kΣb 0 − Σ0kmax = O and kΣb 1 − Σ1kmax = O . P n P n

Exploiting the proof of Theorem 1 in Han et al.(2015) with Lemmas 1 and 2 therein replaced by the corresponding bounds in Remark 2.2, we immediately have the following theorem that quantifies the deviation of Ab from A.

Theorem 2.3 (Stochastic bounds for kAb − Akmax and kAb − Ak∞). Suppose Assumptions M, S, and E hold. Let the tuning parameter λ in equation (2.3) be chosen such that

π  1  λ ≥ (M + 1) 2z + , 2 n − 2 where z is defined in equation (2.4). Then, for any sufficiently large n, it holds that

−1 −4 −2 −2 P(kAb − Akmax ≥ 2kΣ0 k∞λ) ≤ 16e d + K2{(n − 1)d} and −1 −4 −2 −2 P(kAb − Ak∞ ≥ 4skΣ0 k∞λ) ≤ 16e d + K2{(n − 1)d} .

q log(ed) In particular, when λ is picked to be C n−1 for some large enough absolute constant C, we obtain that r ! r ! −1 log(ed) −1 log(ed) kAb − Akmax = O kΣ k∞M and kAb − Ak∞ = O skΣ k∞M . P 0 n P 0 n

To further illustrate consequences of Theorem 2.3, one can now show that kAb − Akmax = o (1) q P −1 log(ed) p provided that kΣ0 k∞M n = o(1) and λ = C log(ed)/n for some pre-specified large enough q −1 log(ed) constant C. Similarly, one has kAb − Ak∞ = oP(1) when skΣ0 k∞M n = o(1).

Remark 2.4. Consider a particular case, Example 1 in Han et al.(2015). There, the authors −1 proved that kΣ0 k∞ = O(1) as long as Σ0 is strictly diagonal dominant (Horn and Johnson(2012)), satisfying that, for all j ∈ [d],

X δj := |Σ0,jj| − |Σ0,jk| ≥ δ > 0, j6=k

−1 p where δ is some positive absolute constant. Then, the assumption that kΣ0 k∞M log(ed)/n = o(1) p  −1 p is equivalent to M = o n/ log(ed) . Similarly, the assumption that skΣ0 k∞M log(ed)/n =   o(1) implies that sM = o pn/ log(ed) . Both allow the parameters s and M to diverge with n.

8 3 Inference on Granger Causality

In this section, we first construct a test for non-Granger causality of one individual series on another individual series in {Xt}t∈Z using the fact that for any k ∈ [d], j ∈ [d], and j 6= k, {Xt,k}t∈Z

Granger causes {Xt,j}t∈Z if and only if the (j, k)-th entry of the transition matrix, namely Ajk, is nonzero. In the presence of Granger causality, we then construct a sign-consistent estimator of the causality relation. Without loss of generality and also in line with Neykov et al.(2018a), we focus on inference for > Am1 for some m ∈ [d]. Inference for Amj for j = 2, ..., d is obtained by changing e1 := (1, 0,..., 0) > to ej := (0,..., 0, 1, 0,..., 0) below. First, we develop a test for non-Granger causality, i.e., testing | {z } j−1

H0 : Am1 = 0 against H1 : Am1 6= 0. (3.1)

The test is constructed in two steps. In the first step, we follow Neykov et al.(2018a) to construct a de-biased estimator of Am1 and establish its asymptotic normality. In the second step, we construct a consistent estimator of the asymptotic variance of the de-biased estimator under the null hypothesis. This is done via CBB, and requires more delicate analysis given the strategies paved by Dehling and Wendler(2010) and Fan et al.(2016).

3.1 A De-biased Estimator and its Asymptotic Normality

d For simplicity of notation, we write β := Am∗ ∈ R , the m-th row of A. Decompose β as > > β = (θ, γ ) , where θ is the first element of β , i.e., θ = Am1. We construct a de-biased estimator of θ = Am1 via the application of Algorithm 1 in Neykov et al.(2018a). It consists of the following three steps.

> > Step 1 Estimate β by βb = (θ,b γb ) with

βb = argminkvk1 such that kΣb 0v − Σb 1,∗mk∞ ≤ λ, v∈Rd

where λ is a tuning parameter. It is easy to see that βb = Ab m∗, where Ab is defined in (2.3).

−1 Step 2 Estimate w := [Σ0 ]∗1 by

0 wb := argmin kvk1 such that kΣb 0v − e1k∞ ≤ λ , v∈Rd where λ0 is another tuning parameter.

> > Step 3 Estimate θ by solving the estimating equation Sb((x, γb ) ) = 0 with

> Sb(v) = wb (Σb 0v − Σb 1,∗m).

9 It is easy to see that the solution θe to the above equation has the following closed form:

> > w (Σb 0βb(0) − Σb 1,∗m) w (Σb 0βb − Σb 1,∗m) θe = − b = θb− b , (3.2) > > wb [Σb 0]∗1 wb [Σb 0]∗1 > > where βb(0) := (0, γb ) .

Consistency and asymptotic normality of the above de-biased estimator θe are now stated in the following two theorems, with assumptions posed similarly to Neykov et al.(2018a).

Theorem 3.1. Suppose Assumptions M, S, and E hold. Further, suppose that log d M = O(1), kΣ−1k = O(1), and max{s , s} √ = o(1), 0 ∞ w n

Pd 1 p 0 p p where sw := k=1 (wk 6= 0). Let λ  log(ed)/n and λ  log(ed)/n and λ ≥ C log(ed)/n 0 0p 0 and λ ≥ C log(ed)/n for some sufficiently large C,C . Then, we have θe− θ = oP (1).

Theorem 3.2. Suppose all conditions in Theorem 3.1 hold. Let

2 n > o σn := (n − 1)Var w Σb 0β − Σb 1,∗m . (3.3)

2 Assume that σn ≥ C > 0 is bounded away from zero by some absolute constant C > 0 and for all sufficiently large n. Let √ n − 1 Un = (θe− θ). σn Then limn→∞|P(Un ≤ t) − Φ(t)| = 0 for any fixed t ∈ R. Here Φ(·) represents the cumulative distribution function of the standard Gaussian.

3.2 Bootstrap Estimation of the Asymptotic Variance

2 In this section, we use CBB to estimate the asymptotic variance σn in (3.3) under H0. Because > > > our estimator contains Σb 1, our bootstrap method is based on Yt := (Xt , Xt+1) . Note that under

H0, we have θ = 0.

> > > Step 1 Construct {Y1,..., Yn−1}, where Yt = (Xt , Xt+1) . Then, draw CBB samples from

{Y1,..., Yn−1}. In detail, following Fan et al.(2016), setting ` to be some positive integer,

we first extend {Y1,..., Yn−1} by defining Yi+(n−1) = Yi for i ≥ 1, randomly draw a block of ∗ b` ` consecutive observations, and then obtain a bootstrap sample {Yt }t=1 by repeating this b = b(n − 1)/`e times so that for each k = 0, . . . , b − 1, 1 ∗(Y ∗ = Y ,..., Y ∗ = Y ) = P k`+1 j (k+1)` j+`−1 n − 1 ∗ for any j ∈ [n − 1]. Here P is the bootstrap distribution conditional on {Y1,..., Yn−1} and bxe stands for the nearest integer value of x ∈ R.

10 Step 2 Construct the bootstrap version of Ωb ,     π 2  ∗  X ∗ ∗ ∗ ∗ > Ωb = sin  · sign(Yi − Yj )sign(Yi − Yj )  ,  2 b`(b` − 1)   i

∗ ∗ ∗ ∗ and let Σb 0 = Ωb [d],[d] and Σb 1 = Ωb [d],d+[d].

Step 3 Use the bootstrap variance defined as

2∗ ∗n > ∗ ∗ o σbn = (b`)Var wb Σb 0βb(0) − Σb 1,∗m

2 ∗ ∗ for the consistent estimator of σn. Here Var (·) stands for the variance operator under P .

Theorem 3.3. In addition to assumptions in Theorem 3.2, assume further that ` → ∞, `2/n = o(1), 2 2 √ and max{sw, s } = o( n/ log d). Then under H0,

2∗ 2 |σbn − σn| = oP(1).

3.3 Tests for Granger non-Causality

This section develops two tests for Granger non-causality, i.e., for testing H0 in (3.1) based on the following two test statistics, where the latter is a modification to the original Wald-type , √ n − 1 T = θ and T adj := (w>Σ )T . (3.4) n ∗ e n b b 0,∗1 n σbn

Theorems 3.2 and 3.3 combined together yield that Tn is asymptotically standard normal under adj H0. Similar conclusion also applies to the adjusted statistic Tn because wb = 1 + oP(1). However, in > finite samples, wb Σb 0,∗1 is usually not strictly equal to 1. This motivates the adjusted test statistic that is shown to enjoy better finite sample performance in our simulation setups in Section4. adj Based on Tn and Tn , we define the following two tests, n o 1  adj 1 adj Tn,α = |Tn| > z1−α/2 and Tn,α = Tn > z1−α/2 , where z1−α/2 is the (1 − α/2) quantile of the standard . We then have the following corollary, which justifies the validity of the tests, and is also a direct consequence of Theorems 3.2 and 3.3.

Corollary 3.1 (by Theorems 3.2 and 3.3). Suppose all conditions in Theorems 3.2 and 3.3 hold. We then have adj P(Tn,α = 1|H0) = α + o(1) and P(Tn,α = 1|H0) = α + o(1).

11 3.4 Estimation of the Sign of Granger Causality

When a test detects Granger causality, it may be of interest to further infer the sign of Am1.

Following Qiu et al.(2015), define a new estimator of A as Ae = [Ae jk], where

Ae jk := Ab jk1(|Ab jk| ≥ γ) for some threshold level γ. Then, similar to Theorem 6.4 in Qiu et al.(2015), we have the following sign-consistency theorem.

Theorem 3.4 (Theorem 6.4 in Qiu et al.(2015)) . Assume that the conditions in Theorem 2.1 −1 hold. Let γ = 2kΣ0 k∞(M + 1)z. Then, with probability no smaller than 1 − 0, where z is defined in equation (2.4) and 0 is defined in Remark 2.2, we have sign(Ae ) = sign(A), provided that min{(j,k):Ajk>0}|Ajk| ≥ 2γ.

4 Monte-Carlo Simulation

In this section, we investigate the finite sample performance of the proposed estimator Ab and adj tests for non-Granger causality based on Tn and Tn . The data generating process (DGP) used in the simulation is characterized by model (1.1) for the latent process and the transformation f generating the observed process, where the following f is taken from Xue and Zou(2012),

f = {f1, f2, f3, f4, f5, f1, f2, f3, f4, f5,... }, in which 1 f (x) = x, f (x) = exp(x), f (x) = x3, f (x) = , and 1 2 3 4 1 + exp(−x)

f5(x) = f2(x)I(x < −1) + f1(x)1(−1 ≤ x ≤ 1) + (f4(x − 1) + 1)1(x > 1). (4.1)

−1 The matrices A and Σ0 are tri-diagonal matrices such that

• the diagonal elements of A are |ρA| and sub-diagonal and superdiagonal elements are ρA/3, > where ρA is chosen so that ΣE = Σ0 − AΣ0A is a positive-definite matrix;

−1 • Σ0 is the normalized Ξ with diagonal elements taking the value 1, where Ξ is a tri-diagonal matrix with diagonal elements 1 and sub-diagonal and superdiagonal elements 1/3.

Given sample size n, dimension d, and ρA = 0.54, we generate data in two steps:

DGP Step 1 Generate {Z1,..., Zn} recursively. That is, generate Z1 from N(0, Σ0) and given

Zt−1, generate Zt from Zt = AZt−1 + Et, where Et ∼ N(0, ΣE);

12 DGP Step 2 Transform (Z1,..., Zn) to (X1,..., Xn) using Xt = f(Zt), for t = 1, . . . , n.

n n−1 > > > Once the data {Xt}t=1 is generated, we construct {Yt}t=1 , where Yt = (Xt , Xt+1) and p compute Ab using the tuning parameter λ = C log d/(n − 1). Following Neykov et al.(2018a), we choose d = 60, 80. The sample sizes are n = 251, 501, 751. All the results are obtained from 5000 simulation repetitions. To increase computational speed, we used the fast algorithms for the calculation of Kendall’s tau in Pozzi et al.(2012). Especially, we used the implementation by Eleutherios(2012). We report two sets of results below. The first set of results measures the performance of the estimator Ab in terms of its ability to detect the true sparsity pattern in A. The second set of results reports the size and power performance of the two proposed tests for non-Granger causality.

4.1 The Performance of Ab

Following Liu et al.(2012), we use the receiver operating characteristic (ROC) curves to illustrate the ability of the estimator Ab in detecting the sparsity pattern of A. For this, we define the sparsity sets S and S as follows: A Ab

• the pair (i, j) is not an element of the set SA if and only if Aij = 0;

• the pair (i, j) is not an element of the set S if and only if A = 0. Ab bij

Then, following equations (5.2) and (5.3) in Liu et al.(2012), we define the false positive rate (FPR) and the false negative rate (FNR) given the tuning parameter λ as follows. We first calculate the false positive and false negative numbers.

• FP(λ): the number of pairs in S not in S ; Ab A

• FN(λ): the number of pairs in S not in S . A Ab

Similar to equation (5.4) in Liu et al.(2012), we calculate the FNR, the FPR, and true positive rate (TPR) as follows,

FN(λ) FP(λ) FNR(λ) = , FPR(λ) = 2 , TPR(λ) = 1 − FNR(λ), card (SA) d − card (SA) where card(·) outputs the number of elements in the input. Then, we plot the averages of FPR(λ) and TPR(λ) over 5,000 repetitions for given n and d. Here λ is set to be Cplog d/(n − 1), where C ∈ {0.10, 0.11, 0.12, ..., 1.00}. Figure1 shows the ROC plots for n = 251, 501, 751 and d = 80. Several observations are in-line. First, for each pair of (n, d), the true positive rate increases along with the false positive rate. This is as expected since, as more edges are selected, it is also

13 Figure 1: ROC plots when n = 251 and d = 80 (black solid line); n = 501 and d = 80 (blue dashed line); and n = 751 and d = 80 (red dash-dot line).

likely that we will pick more non-zero Aij’s. Secondly, as the sample size increases to 751, the true positive rate has been above 0.9 while the false positive rate is lower than 0.2. This demonstrates the powerfulness of the proposed method in variable selection. Lastly, comparing the three different pairs of (n, d), it could be observed that, as the sample size alone increases, the ROC curves grow higher. This indicates that the proposed method is more capable of correctly selecting the non-zero coefficients as more information is obtained, which is as expected.

4.2 The Size and Power Performance

adj To examine the size and power performance of Tn and Tn , we test the null hypothesis that

A1,j = 0 for j = 2, 3, 10, 20, 30, 40, d. Since A1,2 6= 0 and A1,j = 0 for j = 3, 10, 20, 30, 40, d, rejection probabilities for A1,2 give power and those for A1,3,A1,10,A1,20,A1,30,A1,40,A1,d give size. Following Section 5.2 of Neykov et al.(2018a), we compute the debiased estimator using the tuning parameters 0 q log d 1/3 λ = λ = 0.5 n−1 . For CBB, we set ` = [(n − 1) ] following the discussions in Dehling and Wendler(2010) and compute the bootstrap variance based on 2 , 000 bootstrap samples. Tables1 to6 report the results for nominal sizes 1% , 5%, 10%, 15%, 20%. Several conclusions emerge. First, for all settings, the adjusted test has better size than the unadjusted test; second, the sizes of both tests improve and approach the nominal size as n gets larger; third, the power of both tests increases quickly with the sample size; and lastly the results for d = 60 and d = 80 are qualitatively comparable.

14 5 Estimation and Inference in Latent VAR(p) Model

d Let Zt ∈ R follow the VAR(p) process below for a finite and fixed p:

Zt = A1Zt−1 + A2Zt−2 + ... + ApZt−p + Et, Zt ∼ N(0, Σ0), (5.1) where A1, ..., Ap are unknown transition matrices. In this section, we assume that Assumption M-p below and Assumption S in Section 2.1 are satisfied. We extend the estimation and inference methods developed in the previous sections to the VAR(p) process for a finite and fixed p. To save space, we present our rank-based estimators of

A1, ..., Ap and tests for Granger causality without providing the corresponding rate results for the estimators. i.i.d. Assumption M-p: i) diag(Σ0) = Id; ii) {Et}t∈Z ∼ N(0, ΣE), where 0 < cE < λmin(ΣE) ≤

λmax(ΣE) < CE < ∞ for some absolute constants cE and CE; iii) kAik2≤ ai < 1 by an absolute Pp d×pd constant ai for all i = 1, . . . , p, and i=1 ai < 1; and iv) the matrix A := (A1,..., Ap) ∈ R satisfies:

( pd ) d×pd X A ∈ M(s, M) with M(s, M) := M ∈ R : max 1(Mjk 6= 0) ≤ s, kMk∞ ≤ M , (5.2) 1≤j≤d k=1 where s is a positive integer which may depend on d and p, and M is a positive constant which may also depend on d and p. Assumption M-p reduces to Assumption M in Section 2.1 when p = 1.

For a Gaussian VAR(p) process {Zt}t∈Z, it is well-known that the sequence {Zt,k}t∈Z Granger causes {Zt,j}t∈Z if and only if there exists i ∈ [p] such that the (j, k)-th element of Ai, Ai,jk, is non-zero (cf. Corollary 2.2.1 in L¨utkepohl(2005)). The following proposition holds for the observable process {Xt}t∈Z.

Proposition 5.1. Suppose {Zt}t∈Z follows VAR(p) model (5.1) with finite and fixed p. Under

Assumptions M-p(ii), M-p(iii), and S(i), we obtain that the sequence {Xt,k}t∈Z does not Granger cause {Xt,j}t∈Z if and only if Ai,jk = 0 for all i ∈ [p].

We establish α-mixing property of {Zt}t∈Z stated below in Appendix A.3.

Theorem 5.1. Suppose {Zt}t∈Z follows VAR(p) model (5.1) with finite and fixed p. Under

Assumptions M-p and S, {Zt}t∈Z is geometrically α-mixing with mixing coefficient satisfying

α(n; {Xt}t∈Z) = α(n; {Zt}t∈Z) ≤ γ1 exp(−γ2n), for all n ≥ 1.

Here γ1 and γ2 are positive constants that only depend on cE,CE, a1, . . . , ap, and p.

15 5.1 Estimators of A1, ..., Ap

 > > > > (p+1)d Let Ω = E UtUt , where Ut := (Zt+p,..., Zt ) ∈ R . We set Σ0 := Ω[pd]+d,[pd]+d and

Σ1 := Ω[pd]+d,[d]. Then, we estimate A1,..., Ap via the following two steps.   Step 1. Estimate Ω by Ωb = sin 0.5πTb , where

n−p n−p 2 X X h >i Tb = sin(Yi − Yj) sin(Yi − Yj) , (n − p)(n − p − 1) i=1 j=i+1

> > > (p+1)d in which Yt := (Xt+p,..., Xt ) ∈ R . Let Σc0 := Ωb[pd]+d,[pd]+d and Σb1 := Ωb[pd]+d,[d].

d×pd Step 2. Estimate the stacked matrix A = (A1,..., Ap) ∈ R by

d pd X X > Ab := argmin |Mjk| such that kΣb0M − Σb1kmax ≤ λ, d×pd M∈R j=1 k=1

where λ is a tuning parameter.

5.2 Tests for Granger Causality

We present two tests for Granger non-causality of an individual series {Xt,k}t∈Z on another individual series {Xt,m}t∈Z under the latent VAR(p) model (5.1) for m 6= k, m ∈ [d], and k ∈ [d]. From Proposition 5.1, it is sufficient to consider

H0 : Ai,mk = 0 for all i ∈ [p] against H1 : Ai,mk 6= 0 for some i ∈ [p]. (5.3)

> p Following Section 3, we first construct a de-biased estimator of (A1,mk,...,Ap,mk) ∈ R , then provide a consistent estimator of the asymptotic variance of the de-biased estimator under the null hypothesis via CBB, and finally develops two tests.

5.2.1 De-biased Estimation

For notational compactness, we let

> pd G : = {k, k + d, k + 2d, . . . , k + (p − 1)d}, ej := (0,..., 0, 1, 0,..., 0) ∈ R , | {z } j−1 > pd > p β : = (A1,m∗,..., Ap,m∗) ∈ R , and θ := βG = (A1,mk,...,Ap,mk) ∈ R .

> pd > p pd For any generic vectors v = (v1, . . . , vpd) ∈ R and x = (x1, . . . , xp) ∈ R , we define v(x) ∈ R as follows,  x(j−k)/d+1 when j ∈ G, [v(x)]j = vj otherwise.

16 That is, v(x) is a vector such that the values of v indexed by G are replaced by x, keeping the remaining parts intact. The de-biased estimator θe of θ is constructed via the following three steps.

Step 1. Estimate β by

βb := argminkvk1 such that kΣb0v − Σb1,∗mk∞ ≤ λ, v∈Rpd where λ is a tuning parameter.

−1 pd×p Step 2. Estimate W := [Σ0 ]∗G ∈ R by pd p X X 0 Wc := argmin |Mij| such that kΣb0M − [Ipd]∗Gkmax ≤ λ , pd×p M∈R i=1 j=1

0 pd×p where λ is another tuning parameter. Equivalently, Wc = (wb1,..., wbp) ∈ R , where 0 wbj := argminkvk1 such that kΣb0v − ek+(j−1)pk∞ ≤ λ for each j ∈ [p]. v∈Rpd

p Step 3. Estimate θ by solving the estimating equation Sb(βb(x)) = 0p×1, where x ∈ R and

> Sb(v) = Wc (Σb0v − Σb1,∗m).

The solution θe to the above equation has the following closed form: −1 −1  >  >    >  >   θe = − Wc Σb0,∗G Wc Σb0βb(0p×1) − Σb1,∗m = θb− Wc Σb0,∗G Wc Σb 0βb − Σb1,∗m ,

> p where θb := βbG = (Ab1,mk,..., Abp,mk) ∈ R .

5.2.2 Bootstrap Estimation of the Asymptotic Variance

Similar to Theorem 3.2, we obtain that the de-biased estimator θe is asymptotically normally distributed with the asymptotic variance given by

n >  o Σθ,n := (n − p)Var W Σb0β − Σb1,∗m . (5.4)

We provide the following estimator of the asymptotic variance Σθ,n in equation (5.4) under

H0 : θ = 0p×1.

n−p ∗ b` n−p Step 1. Construct {Yt}t=1 , and draw CBB samples {Yt }t=1 from {Yt}t=1 , where b = [(n − p)/`].   Step 2. Construct Ωb∗ := sin 0.5πTb , where

b` b` 2 X X h i Tb = sin(Y∗ − Y∗) sin(Y∗ − Y∗)> . (b`)(b` − 1) i j i j i=1 j=i+1

∗ ∗ ∗ ∗ Then, we set Σb0 := Ωb[pd]+d,[pd]+d, and Σb1 := Ωb[pd]+d,[d].

17 Step 3. Calculate the bootstrap variance as

∗ ∗ n >  ∗ ∗ o Σbθ,n := (b`)Var Wc Σb0 βb(0p×1) − Σb1,∗m .

5.2.3 Tests for Granger non-causality

Define the following test statistics,

−1 >  ∗  Tn := (n − p)θe Σbθ,n θe and > −1 adj  >   ∗   >  Tn := (n − p) Wc Σb0,∗Gθe Σbθ,n Wc Σb0,∗Gθe .

adj Based on Tn and Tn , we define the following two tests, n o 1  2 adj 1 adj 2 Tn,α = Tn > χp,1−α and Tn,α = Tn > χp,1−α ,

2 where χp,1−α is the (1 − α) quantile of the chi-square distribution with the degree of the freedom p.

Theorem 5.2. For the latent VAR(p) model (5.1) with finite and fixed p, suppose that Assumptions

M-p, S, and E hold. We assume that λmin(Σθ,n) ≥ Cθ > 0 by some absolute constant Cθ. In addition, we assume that ` → ∞, `2/n = o(1),

log(d) M = O(1), kΣ−1k = O(1), and max{s2 , s2} √ = o(1), 0 ∞ W n

Ppd 1 p 0 p where sW := max1≤k≤p j=1 (Wjk 6= 0). Let λ  log(epd)/n, λ  log(epd)/n, λ ≥ Cplog(epd)/n, and λ0 ≥ C0plog(epd)/n for some sufficiently large C and C0. We then have

adj P(Tn,α = 1|H0) = α + o(1) and P(Tn,α = 1|H0) = α + o(1).

6 Concluding Remarks

In this paper, we have constructed a simple rank-based estimator of the transition matrix in a high dimension latent VAR(p) model for any finite order p and established its rates of convergence. We have also developed tests for Granger non-causality of one individual series on another series in the high-dimensional latent model and provided numerical evidence on their finite sample performance. It is worth mentioning that all the methods developed in this paper apply directly to the corresponding high dimension latent VAR(p) model with non-zero , although we are unable to estimate the mean. Several extensions are possible. First, if it is desirable to test Granger non-causality of an increasing number of individual series on another series in the high-dimensional latent model, the idea underlying multiple testing and the method of multiplier bootstrap in Chernozhukov et al.(2013) may be adopted. Second, allowing for both increasing d and increasing p is interesting but technically challenging.

18 References

Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535–1567.

Beare, B. K. (2010). Copulas and temporal dependence. Econometrica, 78(1):395–410.

Bradley, R. C. (1983). Approximation theorems for strongly mixing random variables. The Michigan Mathematical Journal, 30(1):69–81.

Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2(1):107–144.

Chen, X. and Fan, Y. (2006a). Estimation and of semiparametric copula-based multivariate dynamic models under copula misspecification. Journal of Econometrics, 135(1- 2):125–154.

Chen, X. and Fan, Y. (2006b). Estimation of copula-based semiparametric time series models. Journal of Econometrics, 130(2):307–335.

Chen, X., Huang, Z., and Yi, Y. (2018). On estimation of multivariate semiparametric GARCH filtered copula models. China Center for Economic Research Working Paper Series E2018025. http://img.bimba.pku.edu.cn/resources/file/13/2018/12/31/201812311606304.pdf.

Chen, X., Koenker, R., and Xiao, Z. (2009a). Copula-based nonlinear quantile autoregression. Econometrics Journal, 12(SUPPL. 1):S50–S67.

Chen, X., Wu, W. B., and Yi, Y. (2009b). Efficient estimation of copula-based semiparametric Markov models. The Annals of Statistics, 37(6 B):4214–4253.

Chernozhukov, V., Chetverikov, D., and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819.

Dedecker, J., Doukhan, P., Lang, G., Leon, J. R., Louhichi, S., and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Springer-Verlag, New York.

Dehling, H., Vogel, D., Wendler, M., and Wied, D. (2017). Testing for changes in Kendall’s tau. Econometric Theory, 33(6):1352–1386.

Dehling, H. and Wendler, M. (2010). and the bootstrap for U-statistics of strongly mixing data. Journal of Multivariate Analysis, 101(1):126–137.

19 Doukhan, P. (1994). Mixing: Properties and Examples. Number 85 in Lecture Notes in Statistics. Springer-Verlag, New York.

Ekstr¨om,M. (2014). A general central limit theorem for strong mixing sequences. Statistics and Probability Letters, 94:236–238.

Eleutherios, L. (2012). Weighted Kendall matrix. MATLAB Central File Ex- change. Retrieved Jan 12, 2015. https://www.mathworks.com/matlabcentral/fileexchange/ 27361-weighted-kendall-rank-correlation-matrix.

Fan, J., Han, F., Liu, H., and Vickers, B. (2016). Robust inference of risks of large portfolios. Journal of Econometrics, 194(2):298–308.

Fan, Y. and Patton, A. J. (2014). Copulas in econometrics. Annual Review of Economics, 6(1):179– 200.

Fang, K. T., Kotz, S., and Ng, K. W. (2017). Symmetric Multivariate and Related Distributions. CRC Press.

G´omez,V. (2016). Multivariate Time Series with Linear State Space Structure. Springer International Publishing.

Gon¸calves, S. and White, H. (2002). The bootstrap of the mean for dependent heterogeneous arrays. Econometric Theory, 18(6):1367–1384.

Han, F. and Li, Y. (2019). Moment bounds for Large autocovariance matrices under dependence. Journal of Theoretical Probability (in press).

Han, F., Lu, H., and Liu, H. (2015). A direct estimation of high dimensional stationary vector autoregressions. Journal of Machine Learning Research, 16:3115–3150.

Han, F. and Wu, W. B. (2019). Probability inequalities for high dimensional time series under a triangular array framework. arXiv preprint arXiv:1907.06577 [math.ST], pages 1–21.

Hecq, A., Margaritella, L., and Smeekes, S. (2019). Granger causality testing in high-dimensional VARs: a post-double-selection procedure. arXiv preprint arXiv:1902.10991 [econ.EM], pages 1–43.

Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. Cambridge University Press, second edition.

Hsing, T. and Wu, W. B. (2004). On weighted U-statistics for stationary processes. The Annals of Probability, 32(2):1600–1631.

20 Kolmogorov, A. N. and Rozanov, Y. A. (1960). On strong mixing conditions for stationary Gaussian processes. Theory of Probability and Its Applications, 5(2):204–208.

Kruskal, W. H. (1958). Ordinal Measures of Association. Journal of the American Statistical Association, 53(284):814–861.

Lahiri, S. N. (2003). Methods for Dependent Data. Springer-Verlag, New York.

Lee, T. H. and Long, X. (2009). Copula-based multivariate GARCH model with uncorrelated dependent errors. Journal of Econometrics, 150(2):207–218.

Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. (2012). High-dimensional semiparametric Gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326.

L¨utkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer-Verlag, Berlin Heidelberg.

Merlev`ede,F., Peligrad, M., and Rio, E. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In Houdre, C., Koltchinskii, V., Mason, D. M., and Peligrad, M., editors, High Dimensional Probability V: The Luminy Volume, volume 5 of Institute of (IMS) Collections, pages 273–292. Institute of Mathematical Statistics, Beachwood, Ohio, USA.

Min, A. and Czado, C. (2014). SCOMDY models based on pair-copula constructions with application to exchange rates. Computational Statistics and Data Analysis, 76:523–535.

Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097.

Neykov, M., Ning, Y., Liu, J. S., and Liu, H. (2018a). A unified theory of confidence regions and testing for high-dimensional . Statistical Science, 33(3):427–443.

Neykov, M., Ning, Y., Liu, J. S., and Liu, H. (2018b). Supplement to ”A unified theory of confidence regions and testing for high-dimensional estimating equations”.

Patton, A. (2013). Chapter 16 - Copula Methods for Forecasting Multivariate Time Series, volume 2 of Handbook of Economic Forecasting. Elsevier.

Patton, A. J. (2012). A review of copula models for economic time series. Journal of Multivariate Analysis, 110:4–18.

Politis, D. and Romano, J. (1992). A circular block-resampling procedure for stationary data. In LePage, R. and Billard, L., editors, Exploring the Limits of Bootstrap, pages 263–270. John Wiley and Sons.

21 P¨otscher, B. M. and Prucha, I. R. (1997). Dynamic Nonlinear Econometric Models: Asymptotic Theory. Springer-Verlag, Berlin Heidelberg.

Pozzi, F., Di Matteo, T., and Aste, T. (2012). weighted correlations. The European Physical Journal B, 85(6):175.

Qiu, H., Xu, S., Han, F., Liu, H., and Caffo, B. (2015). Robust estimation of transition matrices in high dimensional heavy-tailed vector autoregressive processes. 32nd International Conference on Machine Learning, ICML 2015, 3:1843–1851.

R´emillard, B., Papageorgiou, N., and Soustra, F. (2012). Copula-based semiparametric models for multivariate time series. Journal of Multivariate Analysis, 110:30–42.

Shen, Y., Han, F., and Witten, D. (2019). Tail behavior of dependent V-statistics and its applications. arXiv preprint arXiv:1902.02761 [math.ST].

Wang, T. and Xia, Y. (2015). Whittle likelihood estimation of nonlinear autoregressive models with residuals. Journal of the American Statistical Association, 110(511):1083–1099.

Xue, L. and Zou, H. (2012). Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571.

Yoshihara, K.-i. (1976). Limiting behavior of U-statistics for stationary, absolutely regular processes. Zeitschrift f¨urWahrscheinlichkeitstheorie und Verwandte Gebiete, 35(3):237–252.

22 Table 1: The power and size when n = 251, d = 80 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.4892 0.4300 0.6796 0.6368 0.7632 0.7298 0.8186 0.7934 0.8552 0.8350 (1, 3) 0.0316 0.0202 0.1006 0.0764 0.1634 0.1370 0.2188 0.1920 0.2724 0.2386 (1, 10) 0.0202 0.0122 0.0792 0.0618 0.1386 0.1140 0.1942 0.1628 0.2526 0.2204 (1, 20) 0.0228 0.0164 0.0794 0.0632 0.1400 0.1162 0.1928 0.1658 0.2492 0.2150 (1, 30) 0.0262 0.0180 0.0852 0.0648 0.1476 0.1174 0.2068 0.1754 0.2620 0.2296 (1, 40) 0.0250 0.0174 0.0854 0.0640 0.1534 0.1244 0.2086 0.1802 0.2572 0.2294 (1, 80) 0.0256 0.0180 0.0826 0.0646 0.1436 0.1178 0.1982 0.1712 0.2468 0.2182

Table 2: The power and size when n = 501, d = 80 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.8454 0.8214 0.9412 0.9302 0.9692 0.9632 0.9816 0.9788 0.9882 0.9856 (1, 3) 0.0202 0.0148 0.0748 0.0604 0.1370 0.1164 0.1922 0.1748 0.2480 0.2244 (1, 10) 0.0212 0.0168 0.0742 0.0610 0.1264 0.1122 0.1778 0.1592 0.2284 0.2064 (1, 20) 0.0192 0.0154 0.0732 0.0602 0.1344 0.1150 0.1862 0.1676 0.2424 0.2176 (1, 30) 0.0180 0.0146 0.0714 0.0580 0.1292 0.1132 0.1824 0.1592 0.2376 0.2160 (1, 40) 0.0196 0.0140 0.0750 0.0616 0.1296 0.1144 0.1824 0.1642 0.2336 0.2120 (1, 80) 0.0192 0.0160 0.0646 0.0550 0.1272 0.1064 0.1856 0.1604 0.2368 0.2152

Table 3: The power and size when n = 751, d = 80 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.9686 0.9620 0.9914 0.9900 0.9962 0.9956 0.9982 0.9976 0.9988 0.9988 (1, 3) 0.0218 0.0166 0.0698 0.0606 0.1276 0.1132 0.1802 0.1610 0.2308 0.2136 (1, 10) 0.0172 0.0148 0.0626 0.0532 0.1168 0.1056 0.1652 0.1504 0.2134 0.1956 (1, 20) 0.0144 0.0110 0.0630 0.0528 0.1184 0.1028 0.1714 0.1570 0.2258 0.2048 (1, 30) 0.0178 0.0134 0.0658 0.0568 0.1222 0.113 0.1758 0.1602 0.2342 0.2138 (1, 40) 0.0184 0.0132 0.0746 0.0638 0.1304 0.1186 0.1776 0.1614 0.2274 0.2108 (1, 80) 0.0150 0.0122 0.0638 0.0564 0.1268 0.1076 0.1766 0.1626 0.2294 0.2126

23 Table 4: The power and size when n = 251, d = 60 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.4970 0.4412 0.6958 0.6554 0.7822 0.7516 0.8300 0.8080 0.8646 0.8466 (1, 3) 0.0306 0.0218 0.0970 0.0764 0.1560 0.1330 0.2096 0.1826 0.2626 0.2334 (1, 10) 0.0254 0.0174 0.0800 0.0624 0.1398 0.1152 0.1906 0.1662 0.2390 0.2102 (1, 20) 0.0208 0.0148 0.0782 0.0610 0.1410 0.1150 0.1934 0.1674 0.2496 0.2170 (1, 30) 0.0250 0.0154 0.0792 0.0640 0.1358 0.1144 0.1914 0.1622 0.2428 0.2138 (1, 40) 0.0276 0.0198 0.0826 0.0660 0.1430 0.1214 0.1948 0.1672 0.2474 0.2156 (1, 80) 0.0246 0.0174 0.0804 0.0646 0.1338 0.1106 0.1902 0.1616 0.2402 0.2088

Table 5: The power and size when n = 501, d = 60 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.8544 0.8300 0.9436 0.9346 0.9674 0.9640 0.9800 0.9754 0.9866 0.9844 (1, 3) 0.0238 0.0184 0.0784 0.0644 0.1382 0.1196 0.1924 0.1730 0.2518 0.2260 (1, 10) 0.0186 0.0134 0.0696 0.0560 0.1226 0.1076 0.1776 0.1550 0.2334 0.2082 (1, 20) 0.0234 0.0176 0.0736 0.0602 0.1316 0.1130 0.1816 0.1630 0.2296 0.2110 (1, 30) 0.0182 0.0136 0.0704 0.0594 0.1262 0.1110 0.1842 0.1624 0.2348 0.2154 (1, 40) 0.0192 0.0162 0.0732 0.0624 0.1344 0.1168 0.1844 0.1660 0.2312 0.2108 (1, 80) 0.0198 0.0142 0.0694 0.0612 0.1222 0.1072 0.1760 0.1552 0.2254 0.2062

Table 6: The power and size when n = 751, d = 60 and λ = 0.5plog d/(n − 1) 1% 5% 10% 15% 20% adj adj adj adj adj Index Tn Tn Tn Tn Tn Tn Tn Tn Tn Tn (1, 2) 0.9740 0.9672 0.9940 0.9926 0.9986 0.9970 0.9994 0.9994 0.9996 0.9996 (1, 3) 0.0174 0.0146 0.0666 0.0578 0.1234 0.1116 0.1770 0.1616 0.2300 0.2136 (1, 10) 0.0164 0.0132 0.0670 0.0570 0.1238 0.1110 0.1776 0.1654 0.2290 0.2108 (1, 20) 0.0154 0.0118 0.0634 0.0546 0.1212 0.1098 0.1746 0.1576 0.2280 0.2114 (1, 30) 0.0172 0.0134 0.0640 0.0550 0.1160 0.1030 0.1720 0.1560 0.2276 0.2078 (1, 40) 0.0164 0.0138 0.0636 0.0536 0.1200 0.1078 0.1780 0.1608 0.2248 0.2118 (1, 80) 0.0142 0.0124 0.0690 0.0584 0.1244 0.1122 0.1768 0.1620 0.2284 0.2116

A Proofs

In the sequel, for any X ∈ R and any p ∈ (0, ∞], we will use kXkp to denote its Lp norm. Let N denote the set of all positive integers.

24 A.1 Proofs in Section2

A.1.1 Notation

Some additional notations are introduced below. For a random variable X, we denote Xf as 2 an independent copy of X. Then, for x, y ∈ R , we define the following Hoeffding decomposition components,

• f(x, y) = sign(x1 − y1)sign(x2 − y2), g(x, y) = f(x, y) − E[f(X, Xf)];

• g1(x) = E[g(x, X)];

• g2(x, y) = f(x, y) − E[f(x, X)] − E[f(X, y)] + E[f(X, Xf)].

For a random process {Zt}t∈Z following VAR(1) model (1.1), we define

> > > 2d Ui = (Zi , Zi+1) ∈ R .

n−1 Let Ui,k denote the k-th element of the random vector Ui. Based on the observation {Ut}t=1 , we define the following V-statistics,

n−1 1 X > Vb = sign(Ui − Uj)sign(Ui − Uj) ; (n − 1)2 i,j=1 ZZ > V = sign(Ui − Uj)sign(Ui − Uj) dFUi dFUj ,

where FUi is the cumulative distribution function of Ui. Then, we can represent the (k, m)-th element Tbkm of Tb and the (k, m)-th element Vbkm of Vb as ! !! 2 X 2 X Ui,k Uj,k Tbkm = sign(Ui,k − Uj,k)sign(Ui,m − Uj,m) = f , , (n − 1)(n − 2) (n − 1)(n − 2) U U i

A.1.2 Proof of Theorem 2.1

The proof technique is based on the proofs of Theorem 4 and Proposition 5 in Shen et al.(2019). Because T = V, we have the following relationship when n ≥ 3, n − 1 1 1 kTb − Tkmax ≤ kVb − Vkmax + kVkmax ≤ 2kVb − Vkmax + . n − 2 n − 2 n − 2 This implies that, when n ≥ 3,

 1    kTb − Tkmax ≥ 2z + ≤ kVb − Vkmax ≥ z . P n − 2 P

25 Similar to the proofs of Theorem 4 and Proposition 5 in Shen et al.(2019), we can conduct Hoeffding decomposition for the (k, m)-element Vbkm − Vkm of Vb − V as follows,

n−1 !! n−1 ! !! 2 X Ui,k 1 X Ui,k Uj,k Vbkm − Vkm = g1 + g2 , . n − 1 U (n − 1)2 U U i=1 i,m i,j=1 i,m j,m

Then, we have   n−1 !! ! n−1 ! !! 1 X Ui,k z 1 X Ui,k Uj,k z P(|Vbkm−Vkm| ≥ z) ≤ P g1 ≥ +P  g2 , ≥  . n − 1 U 4 (n − 1)2 U U 2 i=1 i,m i,j=1 i,m j,m It implies that

2d   X   P kVb − Vkmax ≥ z ≤ P |Vbkm − Vkm| ≥ z k,m=1 2d n−1 !! ! 2d n−1 ! !! ! 1 U z 1 U U z X X i,k X X i,k j,k ≤ P g1 ≥ + P 2 g2 , ≥ . (A.1) n − 1 Ui,m 4 (n − 1) Ui,m Uj,m 2 k,m=1 i=1 k,m=1 i,j=1

The proof consists of the following five steps. Step 1 will analyze the first part of the right-hand side of inequality (A.1) and Steps 2 to 5 will analyze the second part of the right-hand side of inequality (A.1). The exact form of z in inequality (A.1) will be provided later after analyzing each term. > > > Step 1. Because {Ui = (Zi , Zi+1) }t∈Z is geometrically α-mixing, and g1(Ui,{k,m}) has mean zero with supikg1(Bi,{k,m})k∞ ≤ 2, we have the following relationship using Lemma A.2 in Section A.1.4, that is, for any n ≥ 5, d ≥ 1 and c1 > 0,

n−1 !! r ! 1 X Ui,k log(ed) g1 ≥ c1 P n − 1 U n − 1 i=1 i,m   2 C3(n − 1)c1 log(ed) ≤ 2 exp −  q log(ed) 4(n − 1) + 2c1(n − 1) n−1 (log(n − 1))(log(log(n − 1)))   2 C3c1 log(ed) = 2 exp −  q log(ed)(log2(n−1))(log2(log(n−1))) 4 + 2c1 n−1  2  c1 ≤ 2 exp −K1 log(ed) , 1 + c1 where the constant C3 only depends on (κ1, κ2) and K1 only depends on (K0, κ1, κ2). Here, κ1 and κ2 are the α-mixing coefficients in Proposition 2.2 and K0 is the absolute constant defined in Assumption E. The last inequality holds because of Assumption E (i). Therefore, for n ≥ 5 and c1 ≥ 1,

2d n−1 !! r !  2  X 1 X Ui,k log(ed) 2 c1 P g1 ≥ c1 ≤ 8d exp −K1 log(ed) . n − 1 U n − 1 1 + c1 k,m=1 i=1 i,m

26 Step 2. Because g2(Ui,{k,m}, Uj,{k,m}) is degenerate and centered, using the proof in Proposition 1 and Corollary 7(a) in Shen et al.(2019), we have   n−1 ! !! 1 X Ui,k Uj,k 0 P  g2 , ≥ x + C4t  (n − 1)2 U U i,j=1 i,m j,m ! 2 ! C5(n − 1)y 2 X h i ≤ 2 exp − + (n − 1) J` DMf2 + (n − 1) P(|U1,k| ≥ Mf1) + P(|U1,m| ≥ Mf1) , 1/2 1/2 1/2 A2 + y M2 `=1 where C4 is an absolute positive constant; and C5 only depends on (κ1, κ2); and all other parameters are defined in Shen et al.(2019). According to the discussion on page 5, Table 1, and the proof of Proposition 5 in Shen et al. (2019), we have

D := the uniform upper bound of density functions of Ui,k − Uj,k,  (log(n − 1))4  A1/2 = F (t) σ2 + ,M 1/2 = F (t)1/2(log(n − 1))2,J = 3, 2 n − 1 2 ` δ/(2+δ) 1 1 ! 64κ F (t) X X σ2 = 1 , y = x − , t0 = C t + v + F (t) v , 1 − exp(−κ δ/(2 + δ)) n − 1 6 i i 2 i=0 i=0

2 2 where C6 is an absolute constant, F (t)  log log(Mf1/Mf2) + log log(1/t) for a sufficiently large value of Mf1 and sufficiently small values of Mf2 and t, and v0 and v1 satisfy

2 h i v0 ≤ 2 P(|U1,j| ≥ Mf1) + P(|U1,k| ≥ Mf1) + 12Mf2D,

2 h i v1 ≤ P(|U1,j| ≥ Mf1) + P(|U1,k| ≥ Mf1) + 12Mf2D.

Note that F (t) ≥ 0 by construction. Then, we have

2d n−1 ! !! ! 1 U U F (t) X X i,k j,k 0 P 2 g2 , ≥ y + + C4t (n − 1) Ui,m Uj,m n − 1 k,m=1 i,j=1 ! 2 C5(n − 1)y 2 2 2 h i ≤ 8d exp − + 24(n − 1) d DMf2 + 4(n − 1)d (|U1,k| ≥ Mf1) + (|U1,m| ≥ Mf1) . (A.2) 1/2 1/2 1/2 P P A2 + y M2

q log(ed) 1 q log(ed) 1 Let y = c2 for some positive absolute constant c2, and t = √ . Note that t ≤ n−1 4 K0 n−1 2 for sufficiently large n, and t = o(1) under Assumption E. Then, the remaining steps will be as follows. Step 3 will bound the second and third parts of the right-hand side of inequality (A.2). Step 4 will bound the first part of the right-hand side of inequality (A.2). Step 5 will upper bound t0, for sufficiently large n and sufficiently small t. Step 3. Step 3 will bound

2 2 2 h i 24(n − 1) d DMf2 + 4(n − 1)d P(|U1,k| ≥ Mf1) + P(|U1,m| ≥ Mf1) .

27 Using the Markov inequality and the fact that Ui,k − Uj,k follows normal distribution, we have

2 2 2 h i 24(n − 1) d DMf2 + 4(n − 1)d P(|U1,k| ≥ Mf1) + P(|U1,m| ≥ Mf1)   2 2 2 E[|U1,k|] E[|U1,m|] ≤24(n − 1) d DMf2 + 4(n − 1)d + Mf1 Mf1 " 2 1/2 2 1/2 # 2 2 2 E[|U1,k| ] E[|U1,m| ] ≤24(n − 1) d DMf2 + 4(n − 1)d + Mf1 Mf1 √ 2 2 1 2 σ ≤24((n − 1)d) M2 + 8((n − 1)d) , p 2 f 2πσ Mf1 2 2 where σ is the uniform lower bound of the variance of Ui,k −Uj,k, and σ is the uniform upper bound of the variance of Ui,k. Because Lemma A.3 implies 0 < cΩ ≤ λmin(Ω) ≤ λmax(Ω) ≤ CΩ < ∞, 2 2 where cΩ and CΩ only depend on (CA, cE,CE), σ and σ are lower and upper bounded by some positive constants which only depend on (CA, cE,CE). Therefore,   2 2 2 h i 1 2 1 24(n − 1) d DMf2 + 4(n − 1)d P(|U1,k| ≥ Mf1) + P(|U1,m| ≥ Mf1) ≤ K2((n − 1)d) Mf2 + , 2 Mf1 1 −4 where K2 only depends on (CA, cE,CE). Then, by choosing Mf2 = = ((n − 1)d) , we have Mf1

2 2 2 h i −2 24(n − 1) d DMf2 + 4(n − 1)d P(|U1,k| ≥ Mf1) + P(|U1,m| ≥ Mf1) ≤ K2((n − 1)d) .

Step 4. Step 4 will bound the first part of the right-hand side of inequality (A.2) when q log(ed) 1 q log(ed) y = c2 for some positive constant c2, t = √ , and n is sufficiently large so that t n−1 4 K0 n−1 is sufficiently small under Assumption E. q F (t) log(ed) 1 −m2 Step 4-1. We will show that n−1 = o(t) when t = m1 n−1 and Mf2 = = ((n − 1)d) Mf1 1 for some positive numbers m1 and m2. Note that we set m1 = √ and m2 = 4 in Step 3. For a 4 K0 sufficiently large n and sufficiently small t, we have

2 2   q n−1  F (t) log (2m2 log((n − 1)d)) + log log m1 log(ed)  p t(n − 1) m1 (n − 1) log(ed) 2   q n−1  2 log log m1 log (2m2 log((n − 1)d)) log(ed) = p + p m1 (n − 1) log(ed) m1 (n − 1) log(ed) 2   q n−1  2 log log m1 log (2m2 log((n − 1)d)) 2m2 log((n − 1)d) log(ed) 1 = p + q 2m2 log((n − 1)d) m1 (n − 1) log(ed) n−1 log(ed) m1 log(ed) = o(1)o(1) + o(1)O(1) = o(1).

F (t) q log(ed) Therefore, we have = o(t), and for sufficiently large n and sufficiently small t = √1 , n−1 4 K0 n−1

2   q n−1  2 log log m1 F (t) log (2m2 log((n − 1)d)) log(ed) log((n − 1)d) 1  p + p ≤ K3 p +K4 , t(n − 1) m1 (n − 1) log(ed) m1 (n − 1) log(ed) (n − 1) log(ed) log(ed)

28 where K3 only depends on K0, and K4 is an absolute positive constant. Step 4-2. In this step, we will bound the first part of the right-hand side of inequality (A.2) q log(ed) 1 q log(ed) with y = c2 for some positive constant c2; and t = √ when n is sufficiently large n−1 4 K0 n−1 log ed and n−1 is sufficiently small, which is true under Assumption E. For the first part, we have     ! √ p C (n − 1)y C5c2 n − 1 log(ed) C c log(ed) 5    5 2  exp − 1/2 1/2 = exp − 1/4 = exp − 1/2 3/4 . 1/2  1/2 1/2  log(ed)  1/2    log(ed)  1/2 1/2  log(ed)  1/2  A2 + y M2 A2 + c2 n−1 M2 n−1 A2 + c2 n−1 M2

1/2 3/4  log(ed)  1/2  log(ed)  1/2 Then, Steps 4-2-1 and 4-2-2 combined will show n−1 A2 = O(1) and n−1 M2 = O(1). 1/2  log(ed)  1/2 Step 4-2-1. This step shows that n−1 A2 = O(1). q log(ed) Under Assumption E, for sufficiently large n and sufficiently small t  n−1 , we have

1/2 1/2 2 2 (log(ed)/n) A2  tF (t)  t log (log(Mf1)/ log(Mf2)) + t log (log(1/t)) 2 2 = t log (2m2 log((n − 1)d)) + t log (log(1/t)) √ 2 2 log ed log (2m2 log((n − 1)d)) log (log(1/t)) = m1 √ + n − 1 1/t

≤ K5A,

1/2 1  log(ed)  1/2 where m1 := √ , m2 := 4, and K5A only depends on K0. Therefore, A = O(1). 4 K0 n−1 2 3/4  log(ed)  1/2 Step 4-2-2. This step shows that n−1 M2 = O(1). Under Assumption E, for sufficiently q log(ed) large n and sufficiently small t  n−1 , we have

 3/4 !2 log(ed) 1/2 3 4 2 4 M  t F (t) log (n) = tF (t)(t log (n)) ≤ K5AKe0, n − 1 2 where Ke0 only depends on K0. Therefore, for sufficiently large n and sufficiently small t, we have 3/4  log(ed)  1/2 q n−1 M2 ≤ K5M , where K5M = K5AKe0 only depends on K0. According to Steps 4-2-1 and 4-2-2, we have

log(ed)1/2 A1/2 ≤ K and ((log d/n)3/4M 1/2) ≤ K n − 1 2 5A 2 5M

q log(ed) for sufficiently large n and sufficiently small t  n−1 . Therefore, Steps 4-2-1 and 4-2-2 combined q log(ed) q log(ed) show that, for sufficiently large n, sufficiently small t  n−1 , and y = c2 n−1 where c2 > 0, we have ! ! C (n − 1)y K c log(ed) 8d2 exp − 5 ≤ 8d2 exp − 5 2 , 1/2 1/2 1/2 1/2 A2 + y M2 1 + c2 where K5 only depends on (K0, κ1, κ2).

29 Step 5. Step 5 will upper bound

1 1 ! 0 X X t = C6 t + vi + F (t) vi . i=0 i=0 Following the proof of Proposition 5 in Shen et al.(2019), we have

2 h i v0 ≤ 2 P(|U1,j| ≥ Mf1) + P(|U1,k| ≥ Mf1) + 12Mf2D,

2 h i v1 ≤ P(|U1,j| ≥ Mf1) + P(|U1,k| ≥ Mf1) + 12Mf2D.

Step 3 has shown that

2 2 2 2 2 {(n − 1)d} (v0 + v1) ≤ 2{(n − 1)d} (v1 + v2) 2 hh i i ≤ 8{(n − 1)d} P(|U1,j| ≥ Mf1) + P(|U1,k| ≥ Mf1) + 6Mf2D −2 ≤ 4K2{(n − 1)d} .

Therefore, 1 1 X −1 X p −2 vi = {(n − 1)d} {(n − 1)d} vi ≤ 2 K2{(n − 1)d} . i=0 i=0

q log(ed) Also, for sufficiently large n and the sufficiently small t = √1 , 4 K0 n−1

1 1 X 1 F (t) X F (t) v = t{(n − 1)d} v i d t(n − 1) i i=0 i=0 " # 1 log((n − 1)d) 1 p −1 ≤ K3 + K4 2 K2{(n − 1)d} t d p(n − 1) log(ed) log(ed) √ √ K K log((n − 1)d) K K 1 √2 3 √2 4 = 2 2 + p . 2 K0 (n − 1) d 2 K0 d2(n − 1) (n − 1) log(ed)

q log(ed) Therefore, for sufficiently large n and sufficiently small t = √1 , 4 K0 n−1

1 1 ! r 0 X X 1 log(ed) p −2 t = C6 t + vi + F (t) vi ≤ C6 √ + 2C6 K2((n − 1)d) 4 K0 n − 1 i=0 i=0 √ √ C K K log((n − 1)d) C K K 1 6 √ 2 3 6 √ 2 4 + 2 2 + p . 2 K0 (n − 1) d 2 K0 d2(n − 1) (n − 1) log(ed)

q log(ed) 1 q log(ed) From Steps 3 to 5, when y = c2 for some positive constant c2 and t = √ , for n−1 4 K0 n−1 sufficiently large n and sufficiently small t,  2d n−1 ! !! r X 1 X Ui,k Uj,k log(ed) K3 log((n − 1)d) P  g2 , ≥ c2 + √ (n − 1)2 U U n − 1 4 K (n − 1) k,m=1 i,j=1 i,m j,m 0

30 r K4 1 C4C6 log(ed) p −2 + √ p + √ + 2C4C6 K2((n − 1)d) 4 K0 (n − 1) log(ed) 4 K0 n − 1 √ √ ! C C K K log((n − 1)d) C C K K 1 4 6√ 2 3 4 6√ 2 4 + 2 2 + p 2 K0 (n − 1) d 2 K0 d2(n − 1) (n − 1) log(ed)   2d n−1 ! !! r X 1 X Ui,k Uj,k log(ed) F (t) 0 ≤ P  g2 , ≥ c2 + + C1t  (n − 1)2 U U n − 1 n − 1 k,m=1 i,j=1 i,m j,m ! c2 ≤ 8d2 exp −K log(ed) + K {(n − 1)d}−2. 5 1/2 2 1 + c2 For some positive constant Q to be specified later, let’s set r r log(ed) K3 log((n − 1)d) K4 1 C4C6 log(ed) z = Q + √ + √ p + √ n − 1 4 K0 n − 1 4 K0 (n − 1) log(ed) 4 K0 n − 1 √ √ C C K K log((n − 1)d) C C K K 1 p −2 4 6√ 2 3 4 6√ 2 4 + 2C4C6 K2((n − 1)d) + 2 2 + p . 2 K0 (n − 1) d 2 K0 d2(n − 1) (n − 1) log(ed)

log(ed) Then, for sufficiently large n and sufficiently small n−1 ,

2d   X   P kVb − Vkmax ≥ z ≤ P kVbkm − Vkmk ≥ z k,m=1 2d n−1 !! ! 2d n−1 ! !! ! 1 U z 1 U U z X X i,k X X i,k j,k ≤ P g1 ≥ + P 2 g2 , ≥ n − 1 Ui,m 4 (n − 1) Ui,m Uj,m 2 k,m=1 i=1 k,m=1 i,j=1 2d n−1 !! r ! 2d n−1 ! !! ! 1 U Q log(ed) 1 U U z X X i,k X X i,k j,k ≤ P g1 ≥ + P 2 g2 , ≥ n − 1 Ui,m 4 n − 1 (n − 1) Ui,m Uj,m 2 k,m=1 i=1 k,m=1 i,j=1  (Q/4)2   (Q/2)  ≤ 8d2 exp −K log(ed) + 8d2 exp −K log(ed) + K {(n − 1)d}−2. 1 1 + (Q/4) 5 1 + (Q/2)1/2 2 p Therefore, by choosing sufficiently large Q, we have kVb − Vkmax = OP( log(ed)/n). By picking the value of Q large enough so that (Q/4)2 (Q/2) K ≥ 4 and K ≥ 4, 1 1 + (Q/4) 5 1 + (Q/2)1/2 we have   −4 −2 −2 P kVb − Vkmax ≥ z ≤ 16e d + K2{(n − 1)d} . Lastly, noticing the following relationship n − 1 1 n − 1 1 kTb − Tkmax ≤ kVb − Vkmax + kVkmax ≤ kVb − Vkmax + , n − 2 n − 2 n − 2 n − 2 the proof is finished.

A.1.3 Proof of Theorem 2.3

The proof of Theorem 2.3 immediately follows from the lemma below, which is in the proof of Theorem 5.3 in Qiu et al.(2015).

31 Lemma A.1 (Master theorem). Suppose A ∈ M(s, M). Further suppose kΣb 0 − Σ0kmax ≤ 1 and kΣb 1 − Σ1kmax ≤ 2 with probability no less than 1 − 0. Let λ in equation (2.3) be λ ≥ 1M + 2.

Then with probability no less than 1 − 0, it holds that

−1 −1 kAb − Akmax ≤ 2kΣ0 k∞[1M + 2] = 2kΣ k∞λ and −1 −1 kAb − Ak∞ ≤ 4skΣ0 k∞[1M + 2] = 4skΣ k∞λ.

A.1.4 Technical lemmas used in Section2

Lemma A.2 (Theorem 1 in Merlev`edeet al.(2009)) . Let {Xt}t∈Z be a sequence of centered real-valued random variables. Suppose that the sequence satisfies that

α(m) ≤ κ1 exp(−κ2m) for all m ≥ 1, and that there exists a positive M such that supi≥1kXik∞ ≤ M. Then there exist positive constants

C1 and C2 depending only on (κ1, κ2) such that for all n ≥ 4 and t satisfying that for any −1 0 < t < {C1M(log n)(log log n)} , we have

n !! 2 2 X C2t nM log exp t X ≤ . E i 1 − C tM(log n)(log log n) i=1 1

In terms of probability, there exists a constant C3 only depending on (κ1, κ2) such that for all n ≥ 4 and x ≥ 0, n !  2  X C3x X ≥ x ≤ exp − . (A.3) P i nM 2 + Mx(log n)(log log n) i=1

Lemma A.3. Suppose that the latent process {Zt}t∈Z follows VAR(1) model (1.1) with Assumption  > > > M(ii) and (iii). Let Ωj := Var (Zt ,..., Zt+j) for any j ∈ {0, 1, 2, ···}. Then, Ωj is positive definite and 0 < cΩ < λmin(Ωj) ≤ λmax(Ωj) < CΩ < ∞ for some constants cΩ and CΩ that only depend on (CA, cE,CE).

Proof. Let qj,min be the eigenvector corresponding to λmin(Ωj). We will prove the lemma in two steps: first for the case that j = 0 and then the case that j > 0. P∞ k > k Step 1. When j = 0, Ω0 = Σ0 = k=0 A ΣE(A ) . Then,

∞ X > k > k > λmin(Ωj) = qj,minA ΣE(A ) qj,min ≥ qj,minΣEqj,min ≥ λmin(ΣE), and k=0 ∞ X λmax(ΣE) λmax(ΣE) λ (Ω ) ≤ λ (Σ )k(A>)kk2 ≤ ≤ . max j max E 2 1−kAk2 1 − C2 k=0 2 A

Step 2. We investigate the case when j > 0. We can represent Ωj as follows (cf. Section 1.4 in G´omez(2016)).

32          Id 0d×d ······ 0d×d Zt  Ω0 0d×d ······ 0d×d    AId 0d×d 0d×d ···  E  0d×d ΣE 0d×d ······     t+1   A2 AI 0 ···    0 0 Σ 0 ···  > Ωj = Var  d d×d  Et+2 = L  d×d d×d E d×d  L ,  . . . . .   .   . . . . .   ......   .   ......     .     .    . .   Aj Aj−1 .. AI Et+j  0 . . 0 Σ  d  d×d d×d E | {z } L with       Id 0d×d ··· 0d×d 0d×d 0d×d ··· 0d×d 0d×d 0d×d ··· 0d×d  .   . .   . . . .  0 I 0 .   A 0 .. .   ......   d×d d d×d   d×d    L =  .  +  .  + ··· +   and  . .. ..   .. .. .   ..   . . . 0d×d 0d×d . . .  0d×d 0d×d . 0d×d       .. .. j .. 0d×d . 0d×d Id 0d×d . A 0d×d A 0d×d . 0d×d   Id 0d×d ··· 0d×d  ..   −AId 0d×d .  −1   L =   .  ......  0d×d . . .    . . 0d×d −AId We have j X k 1 −1 kLk2 = kA k2 ≤ and kL k2 ≤ 1 + kAk2. 1 − kAk2 k=0 Therefore,

> 2 λmax(Ω0) λmax(Ωj) ≤ λmax(Ω0)kL k2 ≤ 2 , and (1 − kAk2) −1 1 −1 −1 −1 2 λmax(ΣE ) = λmax( Ωj ) ≤ λmax(ΣE )kL k2 ≤ 2 . λmin(Ωj) (1 + kAk2) Combining Steps 1 and 2 concludes.

A.2 Proofs in Section3

A.2.1 Notation

In addition to the notations introduced in the main text and Section A.1, more are defined d > > as follows. For a generic vector v ∈ R , let S(v) := w hb(v) = w (Σb 0v − Σb 1,∗m). We denote > > > > β(c) := (c, γ ) when we use c in the position of θ. Similarly, we denote βb(c) := (c, γb ) . For Kendall’s tau T, estimator Tb, and bootstrap version Tb ∗, which are defined in subsections 2.2 and 3.2, we denote

∗ ∗ ∗ ∗ • T0 = T[d],[d], T1 = T[d],d+[d], Tb 0 = Tb [d],[d], Tb 1 = Tb [d],d+[d], Tb 0 = Tb [d],[d], and Tb 1 = Tb [d],d+[d];

33 ∗ ∗ • τ0,jk, τb0,jk, and τb0,jk are the (j, k)-th element of T0, Tb 0, and Tb 0, respectively;

∗ • τ1,jk, τb1,jk, and τb1,jk are the (j, k)-th element of T1, Tb 1, and Tb 1, respectively.

π  π  π  π  ∗ π ∗ Then, Σ0 = sin 2 T0 , Σ1 = sin 2 T1 , Σb 0 = sin 2 Tb 0 , Σb 1 = sin 2 Tb 1 , Σb 0 = sin 2 Tb 0 , and ∗ π ∗ ∗ Σb 1 = sin 2 Tb 1 . We define bootstrap version V-statistic Vb as b` b` 1 X 1 X Vb ∗ := sign(Y ∗ − Y ∗)sign(Y ∗ − Y ∗)> = sign(U ∗ − U ∗)sign(U ∗ − U ∗)>, (b`)2 i j i j (b`)2 i j i j i,j=1 i,j=1

∗ > > > ∗ where Yt is the bootstrap sample from Yt = (Xt , Xt+1) and Ut is the latent variable correspond- ∗ ∗ ∗ ing to Yt via Yt = f(Ut ). For V, its estimator Vb , and bootstrap estimator Vb ∗, we denote

∗ ∗ ∗ ∗ • V0 = V[d],[d], V1 = V[d],d+[d], Vb 0 = Vb [d],[d], Vb 1 = Vb [d],d+[d], Vb 0 = Vb [d],[d], and Vb 1 = Vb [d],d+[d];

∗ ∗ • V0,jk, Vb0,jk, and Vb0,jk are the (j, k)-th element of V0, Vb 0, and Vb 0, respectively;

∗ ∗ • V1,jk, Vb1,jk, and Vb1,jk are the (j, k)-th element of V1, Vb 1, and Vb 1, respectively.

Then, we have the following relationships: n−1 ! !! n−1 ! !! 1 X Xt,j Xt0,j 1 X Zt,j Zt0,j Vb0,jk − V0,jk = 2 g , = 2 g , , (n − 1) X X 0 (n − 1) Z Z 0 t,t0=1 t,k t ,k t,t0=1 t,k t ,k n−1 ! !! n−1 ! !! 1 X Xt,j Xt0,j 1 X Zt,j Zt0,j Vb1,jk − V1,jk = 2 g , = 2 g , , (n − 1) X X 0 (n − 1) Z Z 0 t,t0=1 t+1,k t +1,k t,t0=1 t+1,k t +1,k b` ∗ ! ∗ !! b` ∗ ! ∗ !! 1 X X X 0 1 X Z Z 0 V ∗ − V = g t,j , t ,j = g t,j , t ,j , and 0,jk 0,jk (b`)2 X∗ X∗ (b`)2 Z∗ Z∗ 1≤t,t0≤b` t,k t0,k 1≤t,t0≤b` t,k t0,k b` ∗ ! ∗ !! b` ∗ ! ∗ !! 1 X X X 0 1 X Z Z 0 V ∗ − V = g t,j , t ,j = g t,j , t ,j , 1,jk 1,jk (b`)2 X∗ X∗ (b`)2 Z∗ Z∗ 1≤t,t0≤b` t+1,k t0+1,k 1≤t,t0≤b` t+1,k t0+1,k

∗ ∗ > ∗ ∗ > > > > where (Xt,j,Xt,k) and (Xt,j,Xt+1,k) are bootstrap sample points from Yt = (Xt , Xt+1) and ∗ ∗ > ∗ ∗ > ∗ ∗ > (Zt,j,Zt,k) and (Zt,j,Zt+1,k) are bootstrap sample points corresponding to (Xt,j,Xt,k) and q  ∗ ∗ > log(ed) (Xt,j,Xt+1,k) , respectively, via Xt = f(Zt). Note that maxj,k|Vb0,jk − V0,jk| = OP n and   q log(ed) maxj,k|Vb1,jk − V1,jk| = OP n by the proof of Theorem 2.1.

A.2.2 Verification of Assumptions 1-5 in Neykov et al.(2018a)

In this subsection, we will verify Assumptions 1 to 5 in Neykov et al.(2018a). Assumptions 1, 2, 4, and 5 have been verified in Appendix J in Neykov et al.(2018b). We will verify Assumption 3 by following the proofs of Lemma H.4 in Neykov et al.(2018b), Lemmas 3.3, 3.6 to 3.8 in Dehling and Wendler(2010), and Theorem 3.1 in Fan et al.(2016).

34 Lemma A.4 (Assumption 3 in Neykov et al.(2018a)) . Suppose all assumptions in Theorem 3.1 and Theorem 3.2 hold. Then, Assumption 3 in Neykov et al.(2018a) holds, that is, √ n − 1 S(β) −→d N(0, 1). σn Proof. In this part, we will follow the proofs of Lemma H.4 in Neykov et al.(2018b), Lemmas 3.6 and 3.8 in Dehling and Wendler(2010), and Theorem 3.1 in Fan et al.(2016). Following the proof of Lemma H.4 in Neykov et al.(2018b), we have

√ √ > n − 1S(β) = n − 1w (Σb 0β − Σb 1,∗m) √ > = n − 1w ((Σb 0 − Σ0)β − (Σb 1,∗m − Σ1,∗m)) √ > n  π   π    π   π o = n − 1w sin Tb 0 − sin T0 β − sin Tb 1,∗m − sin T1,∗m 2 2 2 2   √  X   π   π  X   π   π  = n − 1 w β sin τ − sin τ + w sin τ − sin τ j k 2 b0,jk 2 0,jk j 2 b1,jm 2 1,jm j∈Sw,k∈S j∈Sw    √  X  π  π X  π  π  = n − 1 w β cos τ (τ − τ ) − w cos τ (τ − τ ) j k 2 0,jk 2 b0,jk 0,jk j 2 1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw  | {z } K1   √  1 X  π   π 2 1 X  π   π 2 − n − 1 w β sin τ (τ − τ ) − w sin τ (τ − τ ) , 2 j k 2 e0,jk 2 b0,jk 0,jk 2 j 2 e1,jm 2 b1,jm 1,jm  j∈Sw,k∈S j∈Sw  | {z } K2

where τe0,jk is between τb0,jk and τ0,jk, and τe1,jm is between τb1,jm and τ1,jm. The second equality d holds because Σ0β − Σ1,∗m = 0. Steps 1 and 2 show K2 = oP(1). We also have K1/νn −→ N(0, 1) for an appropriate value of νn, which will be presented in Step 2-2. 2 Step 1. We will show K2 = oP(1) by proving E[K2 ] = o(1) using Lemma A.10 in Section A.2.5, which follows the proof of Lemma 3.6 in Dehling and Wendler(2010). Because (a + b)2 ≤ 2(a2 + b2), it is sufficient to show that

 2 X π  (n − 1) w β sin τ (τ − τ )2 = o(1) and E  j k 2 e0,jk b0,jk 0,jk   j∈Sw,k∈S

 2 X π  (n − 1) w sin τ (τ − τ )2 = o(1). E  j 2 e1,jm b1,jm 1,jm   j∈Sw For the first one, we have

 2

X π  2 (n − 1)E  wjβk sin τ0,jk (τ0,jk − τ0,jk)   2 e b j∈Sw,k∈S  

X X π  π  2 2 = (n − 1)E  wjβkwj0 βk0 sin τe0,jk sin τe0,j0k0 (τb0,jk − τ0,jk) (τb0,j0k0 − τ0,j0k0 )  0 0 2 2 j∈Sw,k∈S j ∈Sw,k ∈S

35   X X 2 2 ≤ (n − 1)E  |wj||βk||wj0 ||βk0 |(τb0,jk − τ0,jk) (τb0,j0k0 − τ0,j0k0 )  0 0 j∈Sw,k∈S j ∈Sw,k ∈S X X 2 2 ≤ |wj||βk||wj0 ||βk0 |(n − 1)E[(τb0,jk − τ0,jk) (τb0,j0k0 − τ0,j0k0 ) ] 0 0 j∈Sw,k∈S j ∈Sw,k ∈S 1 X X  4 4 ≤ |wj||βk||wj0 ||βk0 |(n − 1) E[(τb0,jk − τ0,jk) ] + E[(τb0,j0k0 − τ0,j0k0 ) ] 2 0 0 j∈Sw,k∈S j ∈Sw,k ∈S X 4 = kwk1kβk1 |wj||βk|(n − 1)E[(τb0,jk − τ0,jk) ], j∈Sw,k∈S where the last inequality comes from |ab| ≤ (a2 + b2)/2. Because (a + b)4 ≤ 8a4 + 8a4, we have

4 E[(τb0,jk − τ0,jk) ] ( n−1 )4 2 X 2 X = g (Z ) + g (Z , Z 0 ) E  n − 1 1 t,{j,k} (n − 1)(n − 2) 2 t,{j,k} t ,{j,k}  t=1 t

 n−1 !4 " !4# 128 X 128 X ≤ g (Z ) + g (Z , Z 0 ) (n − 1)4 E  1 t,{j,k}  (n − 1)4(n − 2)4 E 2 t,{j,k} t ,{j,k} t=1 t

X π  2 2 2 −1 (n − 1)E  wjβk sin τ0,jk (τ0,jk − τ0,jk)   ≤ Ckwk1kβk1O(n ) = o(1). 2 e b j∈Sw,k∈S Similarly, we have  2 X π  (n − 1) w sin τ (τ − τ )2 = o(1). E  j 2 e1,jm b1,jm 1,jm   j∈Sw

2 In conclusion, we have E[K2 ] = o(1), Var(K2) = o(1), and K2 = oP(1).

Step 2. We will investigate the asymptotic distribution of K1. Using Hoeffding decomposition of τb0,jm − τ0,jm and τb1,jm − τ1,jm, we can decompose K1 as   √  X π  π X π  π  K = n − 1 w β cos τ (τ − τ ) − w cos τ (τ − τ ) 1 j k 2 0,jk 2 b0,jk 0,jk j 2 1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw 

= K11 + K12, where

 n−1 !! n−1 !! √  X  π  π X Zt,j X  π  π X Zt,j  K11 = n − 1 wj βk cos τ0,jk g1 − wj cos τ1,jm g1 , 2 n − 1 Zt,k 2 n − 1 Zt+1,m j∈Sw,k∈S t=1 j∈Sw t=1 

36  ! !! √  X π  π X Zt,j Zt0,j K12 = n − 1 wjβk cos τ0,jk g2 , 0 2 (n − 1)(n − 2) 0 Zt,k Zt ,k j∈Sw,k∈S t

d 2 2 Steps 2-1 and 2-2 combined show that K12 = oP(1) and K11/νn −→ N(0, 1), where νn = E[K11]. 2 Step 2-1. In this part, we will show K12 = oP(1) by proving E[K12] = o(1). Similar to Step 1, Lemma A.10 in Section A.2.5 implies

 ! !!2 X π  π X Zt,j Zt0,j 2 2 −1 (n−1)E  wjβk cos τ0,jk g2 ,   ≤ Ckwk1kβk1O(n ) 0 2 (n − 1)(n − 2) 0 Zt,k Zt ,k j∈Sw,k∈S t

 ! !!2 X π  π X Zt,j Zt0,j 0 2 −1 (n−1)E  wj cos τ1,jm g2 ,   ≤ C kwk1O(n ), 0 2 (n − 1)(n − 2) 0 Zt+1,m Zt +1,m j∈Sw t

0 where C and C are some constants which only depend on (CA, cE,CE, κ1, κ2). Therefore, we have 2 E[K12] = o(1), Var(K12) = o(1), and K12 = oP(1). d 2 2 Step 2-2. We show K11/νn −→ N(0, 1), where νn = E[K11]. First, we can rewrite K11 as    !! !! 1 n−1  π  Z π  Z  X  X t,j X t,j  K11 = √  wjβkπ cos τ0,jk g1 − wjπ cos τ1,jm g1  . n − 1  2 Zt,k 2 Zt+1,m  t=1 j∈Sw,k∈S j∈Sw  | {z } Gn1(Ut)

Then, {Gn1(Ut)}t∈Z is also geometrically α-mixing because Gn1(Ut) is the function of Ut = > > > (Zt , Zt+1) and Ut is geometrically α-mixing. Also, Gn1(Ut) is bounded under our assumption because −1 |Gn1(Ut)| ≤ 2πkwk1(kβk1 + 1) ≤ 2πkΣ0 k1(M + 1) = O(1).

2 It implies that νn < Ce by some absolute positive constant Ce. Also, similar to the proof of Theorem 3.1 in Fan et al.(2016), we have q 2 2 2 p |σn − E[K11]| ≤ Var(K12 + K2) + E[K11] Var(K12 + K2) = o(1) + O(1)o(1), (A.4)

2 2 2 which implies σn − E[K11] = o(1). Assumptions in Theorem 3.2 imply that we σn = Var(K11 +

K12 + K2) > C > 0 by some absolute constant C. This implies that there exist absolute positive 2 2 constants N, ec, and Ce such that we have ec < νn = E[K11] < Ce for all n ≥ N. Therefore, we can

37 use triangular array CLT and get the following result (see Theorem 10.2 in P¨otscher and Prucha (1997), and also Ekstr¨om(2014)),

d 2 2 K11/νn −→ N(0, 1) where νn = E[K11].

Because Steps 1 and 2-1 imply Var(K12) = o(1), Var(K2) = o(1), K12 = oP(1), and K2 = oP(1), we have the desired result.

A.2.3 Proofs of Theorems 3.1 and 3.2

Given Lemma A.4, Theorem 3.1 now holds by checking all conditions in Theorem 1 in Neykov et al.(2018a), and Theorem 3.2 now holds by Theorem 2 in Neykov et al.(2018a).

A.2.4 Proof of Theorem 3.3 p Note that under assumptions in Theorem 3.3, Theorem 2.3 implies kβb−βk1 = OP(s log(ed)/n), p and the proof of Theorem 3.1 implies kwb − wk1 = OP(sw log(ed)/n) by verifying Assumption 2 in Neykov et al.(2018a). Given this information, we will prove Theorem 3.3 in two steps. Step 1 will ∗ > ∗ ∗ 2 2∗ ∗ > ∗ ∗ show (b`)Var (w (cΣ0β−Σb 1,∗m))−σn = oP(1). Step 2 will show σbn −(b`)Var (w (Σb 0β−Σb 1,∗m)) = oP(1). ∗ > ∗ ∗ 2 ∗ Step 1. In this step, we will show (b`)Var (w (Σb 0β − Σb 1,∗m)) − σn = oP(1). Let S (β) = > ∗ ∗ w (Σb 0β − Σb 1,∗m). Then   √ √  X π  π X π  π  b`S∗(β) = b` w β cos τ (τ ∗ − τ ) − w cos τ (τ ∗ − τ ) j k 2 0,jk 2 b0,jk 0,jk j 2 1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw 

| {z∗ } K1 √   b`  X π  hπ i2 X π  hπ i2 − w β sin τ ∗ (τ ∗ − τ ) − w sin τ ∗ (τ ∗ − τ ) , 2 j k 2 ejk 2 b0,jk 0,jk j 2 e1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw 

| {z∗ } K2 ∗ ∗ ∗ ∗ where τe0,jk is between τb0,jk and τ0,jk, and τe1,jm is between τb1,jm and τ1,jm. We have   √  X π  π X π  π  K∗ = b` w β cos τ (τ ∗ − τ ) − w cos τ (τ ∗ − τ ) 1 j k 2 0,jk 2 b0,jk 0,jk j 2 1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw    n ∗ ! ∗ ! 1 X  X π  Zt,j X π  Zt,j  ∗ = √ wjβk cos τ0,jk πg1 − wj cos τ1,jm πg1 +K , b` 2 Z∗ 2 Z∗ 12 t=1 j∈Sw,k∈S t,k j∈Sw t+1,m 

| {z∗ } K11 where √ ∗ ! ∗ !! ∗ X π  b` X Zt,j Zt0,j K12 := wjβk cos τ0,jk π g2 ∗ , ∗ 2 (b`)(b` − 1) 0 Z Z 0 j∈Sw,k∈S t

38 √ ∗ ! ∗ !! X π  b` X Zt,j Zt0,j − wj cos τ0,jm π g2 ∗ , ∗ . 2 (b`)(b` − 1) 0 Z Z 0 j∈Sw t

∗ ∗ 2 ∗ ∗ ∗ ∗ Steps 1-1 to 1-3 below combined will show E [(K2 ) ] = oP(1), Var (K12) = oP(1), and Var (K11) − 2 E[K11] = oP(1) by following the proofs of Theorem 2.1 in Dehling and Wendler(2010), Theorem 2.1 in Gon¸calves and White(2002), and Theorem 3.1 in Lahiri(2003). Note that Steps 1-1 to 1-3 imply ∗ ∗ ∗ > ∗ ∗ Var (K11) − (b`)Var (w (Σb 0β − Σb 1,∗m)) = oP(1) because, similar to the proof of Theorem 3.2 in Fan et al.(2016), we have q q ∗ ∗ ∗ > ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ |Var (K11)−(b`)Var (w (Σb 0β−Σb 1,∗m))| ≤ Var (K12+K2 )+ Var (K11) Var (K12 + K2 ) = oP(1).

∗ ∗ 2 Step 1-1. We show that E [(K2 ) ] = oP(1). Note that √   b`  X π  hπ i2 X π  hπ i2 K∗ = w β sin τ (τ ∗ − τ ) − w sin τ (τ ∗ − τ ) . 2 2 j k 2 ejk 2 b0,jk 0,jk j 2 e1,jm 2 b1,jm 1,jm j∈Sw,k∈S j∈Sw 

∗ ∗ 4 Then, similar to the proof of Lemma A.4, it is sufficient to bound (b`)E[E [(τb0,jk − τ0,jk) ] and ∗ ∗ 4 4 4 4 (b`)E[E [(τb1,jm − τ1,jm) ]] uniformly. Because (a + b) ≤ 8a + 8a , we have

∗ ∗ 4 E[E [(τb0,jk − τ0,jk) ]]   4 ( b` ∗ !! ∗ ! ∗ !!) ∗ 2 X Zt,j 2 X Zt,j Zt0,j =E E  g1 ∗ + g2 ∗ , ∗  b` Z b`(b` − 1) Z Z 0 t=1 t,k t

  4 4 ( b` ∗ !!) " "( ∗ ! ∗ !!) ## 128 ∗ X Zt,j 128 ∗ X Zt,j Zt0,j ≤ 4 E E  g1 ∗  + 4 4 E E g2 ∗ , ∗ (b`) Z (b`) (b` − 1) Z Z 0 t=1 t,k t

∗ ! ∗ !! ∗ ! ∗ !! ∗ 2 X Zt,j Zt0,j ∗ 2 X Zt,j Zt0,j U (g2) = g2 , and U (g2) = g2 , . 0,jk (b`)(b` − 1) Z∗ Z∗ 1,jk (b`)(b` − 1) Z∗ Z∗ t

39 ∗ Then, we can rewrite K12 as π X π  √ π X π  √ K∗ = w β cos τ b`U ∗ (g ) − w cos τ b`U ∗ (g ). 12 2 j k 2 0,jk 0,jk 2 2 j 2 0,jm 1,jk 2 j∈Sw,k∈S j∈Sw

Because (a + b)2 ≤ 2(a2 + b2), it is sufficient to show

  2 √ " " √ !2## ∗ X  π  ∗ ∗ X  π  ∗ w β cos τ b`U (g ) = o(1) and w cos τ b`U (g ) = o(1). E E  j k 2 0,jk 0,jk 2   E E j 2 0,jm 1,jm 2 j∈Sw,k∈S j∈Sw From proofs of Lemmas 3.6 and 3.7 in Dehling and Wendler(2010) and Lemma A.8 in Section A.2.5, we have " ! !! ! !!# C X Zi ,j Zi ,j Zi ,j Zi ,j (b`) [ ∗[(U ∗ (g ))2]] ≤ g 1 , 2 g 3 , 4 E E 0,jk 2 2 E 2 2 (b`)(b` − 1) Zi1,k Zi2,k Zi3,k Zi4,k i1,i2,i3,i4 n C0 X ≤ n2 mα (m)2/5 (b`)(b` − 1)2 U m=1 ≤ C00−1,

0 00 where C is an absolute constant, C only depends on (CA, cE,CE), and C only depends on

(CA, cE,CE, κ1, κ2). Therefore, by doing the same calculation as in Lemma A.4, we have   2 X π  √ ∗ w β cos τ b`U ∗ (g ) ≤ 2C00kwk2kβk2O(n−1) = o(1). E E  j k 2 0,jk 0,jk 2   1 1 j∈Sw,k∈S

Using the same approach, we have

  2 X π √ ∗ w cos τ b`U ∗ (g ) = o(1). E E  j 2 0,jm 1,jm 2   j∈Sw

∗ ∗ 2 ∗ ∗ ∗ ∗ Therefore, we have E[E [(K12) ]] = o(1) and it implies that Var (K12) ≤ E (K12) = oP(1). ∗ ∗ ∗ 2 Step 1-3. We will show Var (K11) − E [K11] = oP(1). Define  ! !  X π  Zt,j X π  Zt,j  Gn1(Ut) = wjβk cos τ0,jk πg1 − wj cos τ1,jm πg1 , 2 Zt,k 2 Zt+1,m j∈Sw,k∈S j∈Sw    ∗ ! ∗ ! ∗ ∗  X π  Zt,j X π  Zt,j  G (U ) = wjβ cos τ πg1 − wj cos τ1,jm πg1 , n1 t k 2 0,jk Z∗ 2 Z∗ j∈Sw,k∈S t,k j∈Sw t+1,m 

∗ −1 ∗ ∗ where Ut = f (Yt ). Then, we can rewrite K11 and K11 as

n−1 b` 1 X ∗ 1 X ∗ ∗ K11 = √ Gn1(Ut) and K = √ G (U ). n − 1 11 n1 t t=1 b` t=1

40 Note that {Gn1(Ut)}t∈Z is a triangular array with geometrically α-mixing process because {Ut}t∈Z is a geometrically α-mixing process and Gn1(·) is continuous non-random function. Then, we can show the consistency of bootstrap version of variance following the proofs of Theorem 2.1 in Gon¸calves and White(2002) and Theorem 3.1 in Lahiri(2003). 2∗ ∗ ∗ ∗ Let σbn,MBB be the bootstrap variance of K11 when Gn1(Ut ) is constructed from the moving block 2∗ ∗ ∗ bootstrap (MBB) sample, not the circular block bootstrap (CBB) sample. Let σbn,CBB := Var (K11) ∗ ∗ ∗ be the bootstrap variance of K11 when Gn1(Ut ) is constructed from the circular block bootstrap (CBB) sample. We refer to Chapter 3 of Lahiri(2003) for definitions of MBB and CBB. Then, we finish Step 1-3 by doing the following procedure. 2∗ 2∗ Step 1-3-1. We show that σbn,MBB − σbn,CBB = oP(1) by following the proof of Theorem 3.1 in Lahiri(2003) even though we are dealing with the triangular array because Assumption M implies

Gn1(·) is bounded above by some absolute constant. Step 1-3-2. Because an α-mixing process is the special case of an NED process, we can show 2∗ 2 σbn,MBB − E[K11] = oP(1) by following the proof of Theorem 2.1 in Gon¸calves and White(2002). We skip the proofs of Steps 1-3-1 and 1-3-2 because it can be easily verified by following the proofs of Theorem 3.1 in Lahiri(2003) and Theorem 2.1 in Gon¸calves and White(2002) step by ∗ > ∗ step. In conclusion, the proof of Lemma A.4 and Steps 1-1 to 1-3 imply that (b`)Var (w (Σb 0β − ∗ 2 Σb 1,∗m)) − σn = oP(1). 2∗ ∗ > ∗ ∗ Step 2. we will show that σbn − (b`)Var (w (Σb 0β − Σb 1,∗m)) = oP(1). Note that √ √ √ √ > ∗ ∗ > ∗ ∗ > ∗ ∗ > ∗ b`wb (Σb 0βb(θ)−Σb 1,∗m)− b`w (Σb 0β−Σb 1,∗m) = b`(wb−w) (Σb 0β−Σb 1,∗m)+ b`wb Σb 0(βb(θ)−β).

Steps 2-1 and 2-2 below combined will show that √ √ ∗ > ∗ ∗ 2 ∗ > ∗ 2 E [| b`(wb − w) (Σb 0β − Σb 1,∗m)| ] = oP(1) and E [| b`wb Σb 0(βb(θ) − β)| ] = oP(1). √ ∗ > > ∗ ∗ 2 Step 2-1. We show that E [| b`(wb − w) (Σb 0β − Σb 1,∗m)| ] = oP(1). Because Σ0β − Σ1,∗m = 0, we have √ √ > ∗ ∗ > ∗ ∗ b`(wb − w) (Σb 0β − Σb 1,∗m) = b`(wb − w) ((Σb 0 − Σ0)β − (Σb 1,∗m − Σ1,∗m)).

Therefore, it is sufficient to show that √ √ ∗ > ∗ 2 ∗ > ∗ 2 E [| b`(wb − w) (Σb 0 − Σ0)β| ] = oP(1) and E [( b`(wb − w) (Σb 1,∗m − Σ1,∗m)) ] = oP(1).

Because |sin(x) − sin(y)| ≤ |x − y| for any x, y ∈ R, and b` 1 b` 1 |τ ∗ − τ | ≤ |V ∗ − V | + |V | ≤ |V ∗ − V | + , b0,jk 0,jk (b` − 1) 0,jk 0,jk (b` − 1) 0,jk (b` − 1) 0,jk 0,jk (b` − 1) we have √ > ∗ | b`(wb − w) (Σb 0 − Σ0)β|

41

√ X  π ∗  π  = b` (wj − wj) sin τ − sin τ0,jk βk b 2 b0,jk 2 j∈Sw,k∈S π √ X ≤ b` |w − w ||β ||τ ∗ − τ | 2 bj j w b0,jk 0,jk j∈Sw,k∈S π √ b` X π √ 1 X ≤ b` |w − w ||β ||V ∗ − V | + b` |w − w ||β | 2 (b` − 1) bj j k 0,jk 0,jk 2 (b` − 1) bj j k j∈Sw,k∈S j∈Sw,k∈S π √ b` X π √ 1 ≤ b` |w − w ||β ||V ∗ − V | + b` kw − wk kβk . 2 (b` − 1) bj j k 0,jk 0,jk 2 (b` − 1) b 1 1 j∈Sw,k∈S This implies √ ∗ > ∗ 2 E [| b`(wb − w) (Σb 0 − Σ0)β| ]  2 2 π2  b`  X π2 1 ≤ (b`) ∗ |w − w ||β ||V ∗ − V | + (b`) kw − wk2kβk2. 2 (b` − 1) E  bj j k 0,jk 0,jk   2 (b` − 1)2 b 1 1 j∈Sw,k∈S

First, we have

1 s2 log(ed) (b`) kw − wk2kβk2 = kβk2O w = o (1) (b` − 1)2 b 1 1 1 P n2 P Second, by doing a similar calculation as in Step 1 in Lemma A.4, one has  2 2  b`  X (b`) ∗ |w − w ||β ||V ∗ − V | (b` − 1) E  bj j k 0,jk 0,jk   j∈Sw,k∈S   2  b`   X  ≤(b`) kw − wk kβk |w − w ||β | ∗[(V ∗ − V )2] . (b` − 1) b 1 1 bj j k E 0,jk 0,jk j∈Sw,k∈S 

Note that Lemma A.11 in Section A.2.5 implies

∗ ∗ 2 b`(b` − `)(b` − 2`)(b` − 3`) 2 50` E [(V0,jk − V0,jk) ] − (Vb0,jk − V0,jk) ≤ . (b`)4 b` Therefore, by combining the above three equations, we have √ ∗ > ∗ 2 E [| b`(wb − w) (Σb 0 − Σ0)β| ]   2  b`   X  ≤(b`) kw − wk kβk |w − w ||β | ∗[(V ∗ − V )2] + o (1) (b` − 1) b 1 1 bj j k E 0,jk 0,jk P j∈Sw,k∈S     2 4 b`  X b(b − 1)(b − 2)(b − 3)` 2 ≤(b`) kw − wk1kβk1 |wj − wj||βk| (Vb0,jk − V0,jk) (b` − 1) b b (b`)4 j∈Sw,k∈S    2  b`   X `  + (b`) kw − wk kβk |w − w ||β |50 + o (1) (b` − 1) b 1 1 bj j k b` P j∈Sw,k∈S 

42 4  2 b(b − 1)(b − 2)(b − 3)` b` 2 2 2 = (b`)kw − wk1kβk1(max{|Vb0,jk − V0,jk|}) (b`)4 (b` − 1) b j,k  b` 2 + 50 `kw − wk2kβk2 + o (1) (b` − 1) b 1 1 P s2 log(ed) log(ed) √ s2 log(ed) =O(n)O w kβk2O + O( n)O w kβk2 + o (1) p n 1 P n P n 1 P s2 (log(ed))2  s2 log(ed) =kβk2O w + kβk2O w √ + o (1) 1 P n 1 P n P

=oP(1), because b`  n and `−1 + `2/n = o(1) imply b(b − 1)(b − 2)(b − 3)`4 `  b` 2 = O(1), √ = o(1), and = O(1). (b`)4 n (b` − 1) √ √ ∗ > ∗ 2 ∗ > Similarly, we can show E [( b`(wb−w) (Σb 1,∗m−Σ1,∗m)) ] = oP(1). Therefore, we have E [| b`(wb − w)>(Σb ∗β − Σb ∗ )|2] = o (1). 0 1,∗m P √ ∗ > ∗ 2 Step 2-2. We show E [| b`wb Σb 0(βb(θ) − β)| ] = oP(1). > Because w Σ0(βb(θ) − β) = 0, we have √ √ > ∗ > ∗ > b`wb Σb 0(βb(θ) − β) = b`(wb Σb 0 − w Σ0)(βb(θ) − β) √ √ > ∗ > = b`wb (Σb 0 − Σ0)(βb(θ) − β) + b`(wb − w) Σ0(βb(θ) − β). √ ∗ > ∗ 2 Then, Steps 2-2-1 and 2-2-2 below combined will show E [( b`w (Σb 0 − Σ0)(βb(θ) − β)) ] = o (1) √ b P ∗ > 2 and E [( b`(w − w) Σ0(βb(θ) − β)) ] = o (1). b √ P ∗ > ∗ 2 Step 2-2-1. We show that E [( b`wb (Σb 0 − Σ0)(βb(θ) − β)) ] = oP(1). We can do the same calculation as in Step 2-1 and obtain √ ∗ > ∗ 2 E [( b`wb (Σb 0 − Σ0)(βb(θ) − β)) ] 4  2 b(b − 1)(b − 2)(b − 3)` b` 2 2 2 ≤ (b`)kwk1kβb(θ) − βk1(max{|Vb0,jk − V0,jk|}) (b`)4 (b` − 1) b j,k  b` 2 + 50 `kwk2kβb(θ) − βk2 + o (1) (b` − 1) b 1 1 P  s2 log(ed) s2 log(ed) log(ed) =O(1)O(n) kwk2 + O w O O 1 P n P n P n √  s2 log(ed) s2 log(ed) + O(1)O( n) kwk2 + O w O + o (1) 1 P n P n P

=oP(1). √ ∗ > 2 Step 2-2-2, We show that E [( b`(wb − w) Σ0(βb(θ) − β)) ] = oP(1). This holds because √  2 2 2  ∗ > 2 2 2 sws (log(ed)) | [( b`(w − w) Σ0(βb(θ) − β)) ]| ≤ (b`)kw − wk k(βb(θ) − β)k = O = o (1). E b b 1 1 P n P By combining Steps 1 and 2, we have the desired result.

43 A.2.5 Technical Lemmas used in Section 3

The following lemmas apply Dehling and Wendler(2010) to our case. In the following we use

1A as a shorthand of 1(x ∈ A).

Lemma A.5 (Theorem 2 in Bradley(1983)) . Suppose R and W are random variables taking their values in Borel spaces S1 and S2 respectively, and suppose U is a uniform-[0, 1] random variable independent of (R, W ). Suppose N is a positive integer and H = {H1,H2,...,HN } is a measurable partition of S2. Then, there exists an S2-valued random variable W˘ = f(R, W ,U), where f is a measurable function from S1 × S2 × [0, 1] into S2, such that

1. W˘ is independent of W ;

2. the probability distributions of W˘ and W on S2 are identical; ˘ 1/2 3. P(W and W are not elements of the same Hi ∈ H) ≤ (8N) α(σ(R), σ(W )).

2 The following is Bradley’s lemma with S2 = R by following the proof of Theorem 3 in Bradley (1983), which presents Bradley’s lemma when S2 = R.

> 2 Lemma A.6. Suppose R and W are random variables and W = (W1,W2) ∈ R . Let’s denote

M = max{kW1kγ, kW2kγ}. Then there exists a random variable W˘ such that

1. W˘ is independent of R;

2 2. the probability distributions of W˘ and W on R are identical; 3. For any 0 < q ≤ M,

γ M  γ+1 γ (kW − W˘ k ≥ q) ≤ 50 α(σ(R), σ(W )) γ+1 . P ∞ q

Proof. Following the proof of Theorem 3 in Bradley(1983), let α = α(σ(R), σ(W )), M = γ 1/(γ+1) max{kW1kγ, kW2kγ}, and x = [(1/α)(M/q) , max{γ, 1}] . Note that x ≥ 1 by construc- tion. Further let m be such that x ≤ m ≤ 2x. We construct the rectangular Hij for −m ≤ i, j ≤ m and Hm+1,m+1 as follows:

Hij = [xi − q/2, xi + q/2) × [xj − q/2, xj + q/2)

S c for −m ≤ i, j ≤ m, and Hm+1,m+1 = −m≤i,j≤m Hij . Then, we can set H as the union S ˘ of Hm+1,m+1 and −m≤i,j≤m Hij. From Lemma A.5 , we have that the random variable W = > (W˘ 1, W˘ 2) is independent of R, has the same distribution as W , and

˘ 2 1/2 P(W and W are not elements of the same Hij ∈ H) ≤ 4((2m + 1) + 1) α.

44 Therefore,

˘ 2 1/2 P(kW − W k∞ ≥ q) ≤ 4((2m + 1) + 1) α + P (|W1| ≥ mq + q/2) + P(|Y2| ≥ mq + q/2) ˘ ˘ + P(|W1| ≥ mq + q/2) + P(|W2| ≥ mq + q/2) 2 1/2 = 4((2m + 1) + 1) α + 2P(|W1| ≥ mq + q/2) + 2P(|W2| ≥ mq + q/2)

≤ 24xα + 2P(|W1| ≥ xq) + 2P(|W2| ≥ xq) kW k γ kW k γ ≤ 24xα + 2 1 γ x−γ + 2 2 γ x−γ q q γ γ M  M  γ+1 γ ≤ 24xα + 4 x−γ ≤ C α γ+1 q γ q γ M  γ+1 γ ≤ 50 α γ+1 , q

1/(γ+1) where Cγ := 2 + 24[max{γ, 1}] . The last inequality comes from the last sentence of the proof of Theorem 3 in Bradley(1983).

Lemma A.7 (P-Lipschitz continuity of f(·, ·)). Suppose all assumptions in Proposition 2.2 hold.

Then, there exists a constant L only depending on (CA, cE,CE) such that

0 1 0 E[|f(W , R) − f(W , R)| kW −W k∞≤] ≤ L

> > > for every  > 0, every pair (W , R ) with the common distribution PU1,{j,k},Ut,{j,k} for a t ∈ N or 0> > > the product measure PU1,jk × PU1,jk , and (W , R ) also with one of these common distributions. 0 2 Here, W , W , and R are in R .

Proof. Because |sign(·)| ≤ 1, we have

0 0 0 |f(W , R) − f(W , R)| ≤ | sign(W1 − R1) − sign(W1 − R1)| + |sign(W2 − R2) − sign(W2 − R2)|.

We then have

0 1 0 E[|f(W , R) − f(W , R)| kW −W k∞≤] 0 0 1 0 1 0 ≤E[|sign(W1 − R1) − sign(W1 − R1)| kW −W k∞≤] + E[|sign(W2 − R2) − sign(X2 − R2)| kW −W k∞≤] 0 0 ≤ [|sign(W − R ) − sign(W − R )|1 0 ] + [|sign(W − R ) − sign(W − R )|1 0 ] E 1 1 1 1 |W1−W1|≤ E 2 2 2 2 |W2−R2|≤

≤2P(|W1 − R1| ≤ ) + 2P(|W2 − R2| ≤ ).

The last inequality holds because |x−x0| ≤  and |x−y| >  imply |sign(x−y)− sign(x0 −y)| = 0 for 0 any real numbers x, x , and y . Under the assumptions in Proposition 2.2, P(|W1 − R1| ≤ ) ≤ D1 for some absolute D1 > 0. This is because of the following reasons:

45 • Note that the density of W1 − R1 is the density of U1,j − Ut,j. When the distribution of > > > (W , R ) is from PU1,{j,k} × PU1,{j,k} , W1 and R1 are independent of each other, so that

the density function of W1 − R1 is bounded above under the assumptions. Therefore, let’s > > > consider the case when (W , R ) is from PU1,{j,k},Ut,{j,k} ;

n • Because {Zt}t=1 are jointly Gaussian, the maximum of the values of the density functions of

Z1,j − Zt,j is determined by the variance of Z1,j − Zt,j;

• Under the assumptions in Proposition 2.2, we can show that the variance of Z1,j − Zt,j is uniformly bounded below using Lemma A.3. This implies that the density functions are

bounded above by a constant D1 which only depends on (CA, cE,CE);

• Therefore, P(|W1 − R1| ≤ ) ≤ D1.

By the same logic, we have P(|W2 − R2| ≤ ) ≤ D1. Therefore,

0 1 0 E[|f(W , R) − f(W , R)| kW −W k∞≤] ≤ 2P(|W1 − R1| ≤ ) + 2P(|W2 − R2| ≤ ) ≤ 4D1 = L, where L := 4D1. That is, f(·, ·) is P -Lipschitz-continuous with constant L = 4D1.

Note A.1. The proof of Lemma 3.3 in Dehling and Wendler(2010) implies that g1(·) is P-Lipschitz continuous with constant L and g2(·, ·) is also P-Lipschitz-continuous with constant 2L.

> In the following, we will use Wt to represent (Ut,j,Ut,k) , and αU (·) as the α-mixing coefficient > > > of the process {Ut = (Zt , Zt+1) }t∈Z. The second lemma is to check if Lemma 3.3 in Dehling and Wendler(2010) holds.

Lemma A.8. Suppose all assumptions in Proposition 2.2 hold. Then, for some constants C7 and

C8 only depending on (CA, cE,CE), we have

2/5 |E[g1(Wi1 )g1(Wi2 )g1(Wi3 )g1(Wi4 )]| ≤ C7αU (m) , 2/5 E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )]| ≤ C8αU (m) , where m := max{a2 −a1, a4 −a3} and a1, . . . , a4 are the increasingly rearranged numbers of i1, . . . , i4.

Proof. The proof follows the proof of Lemma 3.3 in Dehling and Wendler(2010) using Lemma

A.6. In this proof, we will only look at the case where i1 < i2 ≤ i3 < i4 with m1 = i2 − i1 = max{i2 − i1, i4 − i3}. Other cases could be handled similarly. ˘ 2 ˘ By Lemma A.6, there exists a random vector Wi1 ∈ R such that (a) Wi1 has the same ˘ distribution as Wi1 , (b) Wi1 is independent of Wi2 ... Wi4 , and (c) when 0 ≤  ≤ 1, we have ˘ −2/3 2/3 ˘ P(kWi1 − Wi1 k∞ ≥ ) ≤ 50 αU (m) . Then, E[g1(Wi1 )g1(Wi2 )g1(Wi3 )g1(Wi4 )] = 0. Because |g1(·)| ≤ 2 and Lemma A.7 implies g1(·) is P-Lipchitz continuous with L, we have

|E[g1(Wi1 )g1(Wi2 )g1(Wi3 )g1(Wi4 )]|

46  ˘  = E [g1(Wi1 ) − g1(Wi1 )]g1(Wi2 )g1(Wi3 )g1(Wi4 ) ˘ ≤8E[|g1(Wi1 ) − g1(Wi1 )|] ˘ 1 ˘ 1 =8E[|g1(Wi1 ) − g1(Wi1 )| ˘ ] + 8E[|g1(Wi1 ) − g1(Wi1 )| ˘ ] kWi1 −Wi1 k∞≥ kWi1 −Wi1 k∞≤ ˘ ≤32P(kWi1 − Wi1 k∞ ≥ ) + 8L −2/3 2/3 ≤1600 αU (m) + 8L,e

−3/5 2/5 where Le := max{L, 1}. Picking  = Le αU (m) , we have

2/5 2/5 |E[g1(Wi1 )g1(Wi2 )g1(Wi3 )g1(Wi4 )]| ≤ Ce7Le αU (m) , where Ce7 is some absolute constant. Using the same approach, one could obtain

2/5 2/5 |E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )]| ≤ Ce8Le αU (m) , where Ce8 is some absolute constant. These conclude the proof.

Lemma A.9. Suppose all assumptions in Proposition 2.2 hold. Suppose that among i1, i2, . . . , i8, at most two of them are identical. Then, for some constant C which only depends on (CA, cE,CE), we have

2/5 |E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )g2(Wi5 , Wi6 )g2(Wi7 , Wi8 )]| ≤ CαU (m) , where m := max{a2 −a1, a8 −a7} and a1, . . . , a8 are the increasingly rearranged numbers of i1, . . . , i8.

Proof. The proof is the same as Lemma A.8, and Lemma 3.3 in Dehling and Wendler(2010). Note that m ≥ 1 because we assume that at most two of i’s are identical.

Without loss of generality, we can assume that a1 = i1; a8 = i8; i1 ≤ i2; i3 ≤ i4; i5 ≤ i6; and i7 ≤ i8 because g2(W1, W2) = g2(W2, W1) for any generic vectors W1 and W2. ˘ 2 ˘ Suppose m = a2 − a1 ≥ a8 − a7. From Lemma A.6, we have Wi1 ∈ R such that (a) Wi1 has the ˘ same distribution as Wi1 ; (b) Wi1 is independent of Wi2 ... Wi8 ; and (c) when 0 ≤  ≤ 1, P(kWi1 − ˘ −2/3 2/3 ˘ Wi1 k∞ ≥ ) ≤ 50 αU (m) . Then, we have E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )g2(Wi5 , Wi6 )g2(Wi7 , Wi8 )] = 0. Because |g2(·)| ≤ 4 , and Lemma A.7 implies that g2(·) is P -Lipchitz continuous with 2L, we have

|E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )g2(Wi5 , Wi6 )g2(Wi7 , Wi8 )]|  ˘  = E [g2(Wi1 , Wi2 ) − g2(Wi1 , Wi2 )]g2(Wi3 , Wi4 )g2(Wi5 , Wi6 )g2(Wi7 , Wi8 ) ˘ ≤64E[|g2(Wi1 , Wi2 ) − g2(Wi1 , Wi2 )|] ˘ 1 ˘ 1 =64E[|g2(Wi1 , Wi2 ) − g2(Wi1 , Wi2 )| ˘ ] + 64E[|g2(Wi1 , Wi2 ) − g2(Wi1 , Wi2 )| ˘ ] kWi1 −Wi1 k∞≥ kWi1 −Wi1 k∞≤ ˘ ≤512P(kWi1 − Wi1 k∞ ≥ ) + 128L −2/3 2/3 ≤25600 αU (m) + 128L,e

47 −3/5 2/5 where Le := max{L, 1}. When we pick  = Le αU (m) , we have

2/5 2/5 |E[g2(Wi1 , Wi2 )g2(Wi3 , Wi4 )g2(Wi5 , Wi6 )g2(Wi7 , Wi8 )]| ≤ 12928Le αu(m) .

This completes the proof.

Lemma A.10. Suppose all assumptions in Proposition 2.2 hold. Let’s denote n 2 X 2 X U(g ) = g (W ) and U(g ) = g (W , W ). 1 n 1 i 2 n(n − 1) 2 i j i=1 i

16 X X X X | [U 4(g )]| = E[g (W , W )g (W , W )g (W , W )g (W , W )] E 2 n4(n − 1)4 2 i1 i2 2 i3 i4 2 i5 i6 2 i7 i8 i1

4 00−2 2 Proof. We only have to present the proof for E[U (g2)] ≤ C because we can prove |E[U (g1)]| ≤ −2 2 0−2 Cn and |E[U (g2)]| ≤ C by following the proof of Lemma 3.6 in Dehling and Wendler(2010). The proof is similar to Lemma 3 in Yoshihara(1976). Note that 16 X | [U 2(g )]| ≤ | [g (W , W )g (W , W )g (W , W )g (W , W )]|. E 2 n4(n − 1)4 E 2 i1 i2 2 i3 i4 2 i5 i6 2 i7 i8 1≤i1,...,i8≤n 6 Case 1. Suppose more than two indices among i1, . . . , i8 are identical. Then, we have at most n 00 possible cases for i1, . . . , i8 . Therefore, for some constant C1 which only depends on (CA, cE,CE), 16 X | [g (W , W )g (W , W )g (W , W )g (W , W )]| ≤ C00n−2, n4(n − 1)4 E 2 i1 i2 2 i3 i4 2 i5 i6 2 i7 i8 1 0≤i1,...,i8≤n, at least three of i’s are identical because |g2(·, ·)| ≤ 4.

Case 2. Suppose at most two indices are identical. Let a1 ≤ a2 ≤ ... ≤ a8 be ordered indices of i1, . . . , i8 and let’s denote b1 := a2 − a1, b3 := a4 − a3, b5 := a6 − a5, and b7 := a8 − a7. Let’s 6 set m = max{b1, b7}. Note that m ≥ 1 by construction. For a fixed m, we have at most 2 × n m number of combinations for the values of a1, . . . , a8. This implies that for a fixed m, we have at 6 most 2 × 8! × n m number of combinations for the values of i1, . . . i8. Here k! the factorial of a number k ∈ N. Therefore, using Lemma A.9 , we have 16 X | [g (W , W )g (W , W )g (W , W )g (W , W )]| n4(n − 1)4 E 2 i1 i2 2 i3 i4 2 i5 i6 2 i7 i8 0≤i1,...,i8≤n, at most two of i’s are identical n 16 00 X 6 2/5 00 −2 ≤ Ce n mαU (m) ≤ C n , n4(n − 1)4 2 2 m=1 00 00 00 00 for some constants Ce2 and C2 where Ce2 only depends on (CA, cE,CE), and C2 only depends on

(CA, cE,CE, κ1, κ2). Therefore, Cases 1 and 2 combined imply the desired result.

48 ∗ ∗ Lemma A.11. The CBB bootstrap V-statistics V0,jk and V1,jk, and V-statistics V0,jk and V1,jk n−1 using {Yt}t=1 have the following relationships:

∗ ∗ 2 b`(b` − `)(b` − 2`)(b` − 3`) 2 50` E [(V0,jk − V0,jk) ] − (Vb0,jk − V0,jk) ≤ , (b`)4 b`

∗ ∗ 2 b`(b` − `)(b` − 2`)(b` − 3`) 2 50` E [(V1,jk − V1,jk) ] − (Vb1,jk − V1,jk) ≤ , (b`)4 b` ∗ ∗ where E (·) stands for the expectation operator under P .

∗ ∗ ∗ Proof. Because the proof is identical for V0,jk and V1,jk , we will examine the case of V0,jk only. One ∗ ∗ 2 can decompose E [(V0,jk − V0,jk) ] as

∗ ∗ 2 E [(V0,jk − V0,jk) ] " ∗ ! ∗ !! ∗ ! ∗ !!# 1 X ∗ Zi ,j Zi ,j Zi ,j Zi ,j = E g 1 , 2 g 3 , 4 (b`)4 Z∗ Z∗ Z∗ Z∗ 1≤i1,i2,i3,i4≤b` i1,k i2,k i3,k i4,k " ∗ ! ∗ !! ∗ ! ∗ !!# 1 X ∗ Zi ,j Zi ,j Zi ,j Xi ,j = E g 1 , 2 g 3 , 4 (b`)4 Z∗ Z∗ Z∗ X∗ 1≤i1,i2,i3,i4≤b`, i1,k i2,k i3,k i4,k all i’s are in different blocks " ∗ ! ∗ !! ∗ ! ∗ !!# 1 X ∗ Zi ,j Zi ,j Zi ,j Zi ,j + E g 1 , 2 g 3 , 4 (b`)4 Z∗ Z∗ Z∗ Z∗ 1≤i1,i2,i3,i4≤b`, i1,k i2,k i3,k i4,k at least two of i’s is in the same block 4 b(b − 1)(b − 2)(b − 3)` 2 = (Vb0,jk − V0,jk) (b`)4 " ∗ ! ∗ !! ∗ ! ∗ !!# 1 X ∗ Zi ,j Zi ,j Zi ,j Zi ,j + E g 1 , 2 g 3 , 4 . (b`)4 Z∗ Z∗ Z∗ Z∗ 1≤i1,i2,i3,i4≤b`, i1,k i2,k i3,k i4,k at least two of i’s is in the same block

4 The last equality holds because there are b(b − 1)(b − 2)(b − 3)` possibilities that i1, i2, i3, and i4 are in different blocks; and when i1, i2, i3, and i4 are in different blocks, similar to the proof of Lemma 3.7 in Dehling and Wendler(2010), we have

" ∗ ! ∗ !! ∗ ! ∗ !!# ∗ Zi ,j Zi ,j Zi ,j Zi ,j E g 1 , 2 g 3 , 4 Z∗ Z∗ Z∗ Z∗ i1,k i2,k i3,k i4,k " ! !! ! !!# 1 X Zi1,j Zi2,j Zi3,j Zi4,j = 4 g , g , (n − 1) Zi1,k Zi2,k Zi3,k Zi4,k 1≤i1,i2,i3,i4≤n−1 2 =(Vb0,jk − V0,jk) ,

∗ ! ∗ ! ∗ ! ∗ ! Zi ,j Zi ,j Zi ,j Zi ,j ∗ because 1 , 2 , 3 , and 4 are independent under P . Therefore, we have Z∗ Z∗ Z∗ Z∗ i1,k i2,k i3,k i4,k

4 ∗ ∗ 2 b(b − 1)(b − 2)(b − 3)` 2 E [(V0,jk − V0,jk) ] − (Vb0,jk − V0,jk) (b`)4

49

" ∗ ! ∗ !! ∗ ! ∗ !!# 1 X ∗ Zi ,j Zi ,j Zi ,j Zi ,j = E g 1 , 2 g 3 , 4 (b`)4 Z∗ Z∗ Z∗ Z∗ 1≤i1,i2,i3,i4≤b`, i1,k i2,k i3,k i4,k at least one of i’s in the same block (b`)4 − b(b − 1)(b − 2)(b − 3)`4 6b3`4 − 11b2`4 + 6b2`4 50b3`4 50` ≤4 = 4 ≤ = , (b`)4 (b`)4 (b`)4 b` because |g(·)| ≤ 2.

A.3 Proof in Section5

A.3.1 Notation

For any square matrix M, let ρ(M) denote its spectral radius. Let   I −A −A ... −A   d 1 2 p−1 A  .  p 0 I −A .. −A  ! ! A   d×d d 1 p−2 Id 0(p−1)d×d  p−1  .  ep := , ep := , Cp :=  .  , Qp := 0 0 I .. −A  0 I  .   d×d d×d d p−3 (p−1)d×d d  .   . . . . .   ......  A1   0d×d ··· 0d×d 0d×d Id

> > > > pd > > with e1 = e1 = Id and Q1 := Id. We set Zt ≡ (Zt , Zt−1,..., Zt−p+1) ∈ R , Et = (Et , 0d×(p−1)d) ∈ pd R ,     A1 A2 A3 ... Ap a1 a2 a3 . . . ap      Id 0d×d 0d×d ... 0d×d  1 0 0 ... 0      0d×d Id 0d×d ... 0d×d pd×pd  0 1 0 ... 0  p×p Ap =   ∈ R , and Ap :=   ∈ R .  ......   ......   . . . . .   . . . . .      ...... 0d×d . . Id 0d×d 0 . . 1 0

k 2 For a pd by pd matrix Ap, we partition Ap by p blocks where each block is a d by d matrix. We denote by [Ak] the (j, m)-th block of Ak. Let Ak denote a p by p matrix where the (j, m)-th p jm p p k k k k element ak,jm of Ap is defined as ak,jm = k[Ap]jmk2. Then, we can write Ap and Ap as

[Ak] ··· [Ak]    p 11 p 1p ak,11 ··· ak,1p k  . . .  k . . . Ap =  . .. .  and Ap =  . .. .  .     [Ak] ··· [Ak] a ··· a p p1 p pp k,p1 k,pp

k k k > > > > pd We have kApk2 ≤ kApk2 ≤ kAp k2 because for any vectors v = (v1 , v2 ,..., vp ) ∈ R and > p p v = (kv1k2,..., kvpk2) ∈ R where {vj}j=1 is a d × 1 vector, we have

k k k−1 k kApvk2 ≤ kApvk2 ≤ kAp Apvk2 ≤ ... ≤ kAp vk2, and kvk2 = kvk2.

50 A.3.2 Proof of Theorem 5.1

When {Zt}t∈Z follows VAR(p) model (5.1), we know that {Zt}t∈Z follows VAR(1) model:

Zt = ApZt−1 + Et with Σ0 = Var(Zt). Under Assumption M-p, the spectral radius of Ap, ρ(Ap), satisfies: ρ(A ) ≤ C < 1 for some absolute constant C . p Ap Ap By following the proof of Theorem 3.1 in Han and Wu(2019), we know the %-mixing coefficient  = 1 when k < p, %(σ(Z0), σ(Zk)) p k ≤ (λmax(Σ0)/λmin(Σ0))kApk2 when k ≥ p.

Here λmax(Σ0)/λmin(Σ0) is upper bounded by a positive constant that only depends on (cE,CE, a1, . . . , ap, p) by the following lemma.

Lemma A.12. Suppose {Zt}t∈Z follows VAR(p) model (5.1) with finite and fixed p. Under Assump- > > > tions M-p and S, the following results hold: For the covariance matrix Ωj := Var((Zt ,..., Zt+j) ) with j ∈ {0, 1, 2, ···}, we have 0 < cp ≤ λmin(Var(Ωj)) ≤ λmax(Var(Ωj)) ≤ Cp < ∞ for some constants cp and Cp that only depend on (cE,CE, a1, . . . , ap, p).

Because p is finite and fixed, we only have to consider the case that k ≥ p. Then, s s λmax(Σ0) k λmax(Σ0) k %(σ(Z0), σ(Zk)) ≤ kA k2 ≤ kAp k2 λmin(Σ0) λmin(Σ0)

k k k because kApk2 ≤ kApk2 ≤ kAp k2. When p is finite and fixed, Gelfand’s formula implies that there k k exists a K such that kAp k2 < (ρ(Ap) + ) for all k ≥ K, by choosing some  > 0 such that

ρ(Ap) +  < C < 1 in which C is a positive constant that only depends on (a1, . . . , ap). Then, when k > K = max{K, p}, we have s s λmax(Σ0) k λmax(Σ0) k %(σ(Z0), σ(Zk)) ≤ kAp k2 < (ρ(Ap) + ) . λmin(Σ0) λmin(Σ0)

Therefore, {Zt}t∈Z is geometrically mixing with α(k; {Zt}t∈Z) ≤ γ1 exp(−γ2k), where γ1 and γ2 are positive constants that only depend on (cE,CE, a1, . . . , ap, p). This completes the proof.

A.3.3 Proof of Theorem 5.2

Because the proof of this theorem is similar to VAR(1) case, we will just sketch the proof. First, we replace Lemma A.3 and Theorem 2.2 with Lemma A.12 and Theorem 5.1, respectively. Then, we have the conclusion that is similar to Section2 because {Zt}t∈Z follows a VAR(1) model when

{Zt}t∈Z follows VAR(p) model (5.1). > Regarding the inference, our estimator θe and the related equation Sb(v) = Wc (Σb0v − Σb1) are p-dimensional vectors, not scalar values. Therefore, we need appropriate norms and additional inequalities in order to prove the theorem. Regarding this issue, we have the following properties.

51 (i) Because p is finite and fixed, the choice of norm is not important. In particular, for a square p×p matrix M ∈ R , we have kMkmax ≤ kMk2 ≤ pkMkmax.

p p (ii) For any random variables Rt ∈ R and Ft ∈ R , Cauchy-Schwartz theorem implies p kCov(Rt, Ft)kmax ≤ kVar(Rt)kmaxkVar(Ft)kmax and kVar(Rt)kmax = max|Var(Rt,j)|. j∈p

p,p (iii) For positive semi-definite matrices M, N ∈ R , we have the following result (c.f. Wielandt- Hoffman inequality). p X 2 2 |σi(M) − σi(N)| ≤ kM − Nk2. j=1

These properties imply that we can prove the theorem by following the proofs in VAR(1) case under assumption in Theorem 5.2. First, properties (i) and (ii) imply that under the max norm, we can apply lemmas and proofs in VAR(1) case to VAR(p) case. Second, property (iii) helps to present the line of reasoning which is similar to the logic below equation (A.4). Therefore, we will skip the proof.

A.3.4 Proof of Lemma A.12

We will prove this lemma as follows. First, we will analyze the case where j = p − 1 in Case 1. Then, we will consider the case where j ≥ p in Case 2. We only need to consider these two cases to prove this lemma because Cauchy interlacing theorem implies that this lemma also holds when 0 ≤ j < p − 1 if it is true when j ≥ p − 1. Case 1. We analyze the case where j = p − 1. Let P be a pd by pd permutation matrix with   0d×d 0d×d ··· 0d×d Id 0 ··· 0 I 0   d×d d×d d 0×d  .   . 0 I 0 0   0×d d 0×d 0×d P =  . . . . .  .  . . . . .       Id 0d×d ... 0d×d 0d×d

> > > > > > > Then, we have (Zt ,..., Zt+p−1) = P(Zt+p−1,..., Zt ) = PZt+p−1 and Ωp−1 = PΣ0P . Be- > −1 cause P = P = P , the matrices Ωp−1 and Σ0 are similar. (cf. two matrices C and D are called similar if there exists an invertible matrix H such that C = H−1DH.) Therefore, we will focus on the eigenvalues of Σ0 instead. P∞ k > k P∞ k k > Because Σ0 = k=0 ApΣE(Ap ) = k=0(Apep) ΣE(Apep) , we have

∞ ∞ X k 2 X k 2 λmax(Σ0) ≤ λmax(ΣE) kApepk2 ≤ λmax(ΣE) kAp k2 ≤ C < ∞, k=0 k=0

52 where C is a positive constant which only depends on (CE, a1, . . . , ap, p). The second last inequality holds because of Assumption M-p(ii) and (iii). Let vmin be the eigenvector of Σ0 corresponding to

λmin(Σ0). We then have

∞ X > k k > λmin(Σ0) = vmin(Apep)ΣE(Apep) vmin k=0 p−1 ! X 1 ≥ v> (Ake )Σ (Ake )> v ≥ . min p p E p p min  −1  p−1  k=0 P k k > λmax k=0(Apep)ΣE(Apep)

Pp−1 k k > >  p−1  Note that k=0(Apep) ΣE(Apep) = Gp(Ip ⊗ΣE)Gp , where Gp = Ipdep Apep ··· Ap ep , and ⊗ is the Kronecker product. The eigenvalues of (Ip ⊗ ΣE) are lower bounded by some positive constant which only depends on cE. Lemma A.13 implies

   .    1 0 ··· 0 0 1 .. 0 0 ··· 0 1 . .  .  0 1 .. .  ..  0 ··· 0 0  G−1 = Q =   ⊗ I − 0 0 0 ⊗ A − ... −   ⊗ A , p p . . .  d .  1 ......  p−1 . .. .. 0 . ..  . . . .   . . 0 1   0 ··· 0 1 0 ··· 0 0 0 ··· 0 0

 −1 −1 Pp−1 Pp−1 Pp−1 k k > and kGp k2 ≤ 1+ k=1kAkk2 ≤ 1+ k=1 ak < ∞. Therefore, λmax k=0(Apep)ΣE(Apep) is upper bounded by some positive constant that only depends on (cE, a1, . . . , ap), and we have

λmin(Σ0) ≥ c > 0 by some constant c that only depends on (cE, a1, . . . , ap). Case 2. We analyze the case where j ≥ p. From           Zt+p−1+k Et+p−1+k Et+p−2+k Et+p Zt+p−1 Z   0   0  0  Z   t+p−2+k  d×d   d×d  k−1  d×d  k  t+p−2  .  =  .  + Ap  .  + ··· + Ap  .  + Ap  .  for k ∈ N,  .   .   .   .   .   .   .   .   .   .  Zt+k 0d×d 0d×d 0d×d Zt we have   Id 0d×d 0d×d ... ········· 0d×d  Z   Z  t  0 I 0 0 ... ······ 0  t    d×d d d×d d×d d×d    Zt+1   ......   Zt+1   .   0 ......   .   .   d×d   .   .   . . .   .     0 . 0 I 0 .. .. 0    Zt+p−1  d×d d×d d d×d d×d Zt+p−1   =  . .    .  Z   [A ] ··· [A ] [A ] I 0 .. ..   E   t+p   p 1p p 12 p 11 d d×d   t+p     .    Zt+p+1  [A2] ··· [A2] [A2] [A ] I 0 ..  Et+p+1  .   p 1p p 12 p 11 p 11 d d×d   .   .   ......   .     . ······ ......    Z   E t+j [Aj−p] ... [Aj−p] [Aj−p] ... [A2] [A1] I j p 1p p 12 p 11 p 11 p 11 d | {z } L

53 From Zt − A1Zt−1 − · · · − ApZt−p = Et, we have     Id 0d×d 0d×d ············ 0d×d   Zt  ......  Zt 0d×d Id 0d×d . . . . .   Zt+1     Zt+1     ......     .  0d×d ......   .   .     .     ......    Z  0d×d . 0d×d Id 0d×d . . .  Z   t+p−1 =    t+p−1 .    .. ..     Et+p  −Ap −Ap−1 ... −A1 Id 0d×d . .   Zt+p        Et+p+1  ..  Zt+p+1   0d×d −Ap −Ap−1 · · · −A1 Id 0d×d .     .     .   .   ......   .    0d×d ......    E   Z j .. t+j 0d×d . 0d×d −Ap −Ap−1 · · · −A1 Id | {z } L−1 −1 Then, it suffices to show that kLk2 and kL k2 are bounded from above by some absolute positive > > > > > constants because Case 1 implies that the eigenvalues of Var((Zt ,..., Zt+p−1, Et+p,..., Et+j) ) are lower and upper bounded by some positive constants that only depend on (cE,CE, a1, . . . , ap, p). We have p −1 X kL k2 ≤ 1 + kAkk2 ≤ 1 + a1 + ... + ap < ∞, and k=1 j−p j−p ∞ X k X k X k kLk2 ≤ 1 + kApk2 + (p − 1) kApk2 ≤ 1 + p kAp k2 ≤ CL < ∞, k=1 k=1 k=1 where CL is a positive constant that only depends on (a1, . . . , ap, p). Therefore, 0 < c ≤ λmin(Ωj) ≤

λmax(Ωj) ≤ C < ∞ by constants c and C which only depend on (cE,CE, a1, . . . , ap, p).

A.3.5 Technical lemmas used in Section5

 p−1  pd×pd −1 Lemma A.13. Let Gp = ep Apep ··· Ap ep ∈ R . For any p ∈ N, the inverse Gp −1 exists and Gp = Qp, where Qp is defined in Section A.3.1.

Proof. Throughout the proof, it is noted that we can represent Ap and Qp recursively as follows, ! ! Ap−1 ep−1Ap Qp−1 −Cp−1 Ap = > and Qp = with A1 = A1 and Q1 = Id. ep−1 0d×d 0d×(p−1)d Id

With this information, we will prove this lemma with two steps. In Step 1, we will show that Gp p−1 can be written recursively, and we have Qp−1Ap−1ep−1 = Cp−1. In Step 2, using the result in

Step 1 and the information that Qp can be represented recursively, we will prove this lemma using mathematical induction.

Step 1. We will show that Gp can be written as

p−1 ! Gp−1 Ap−1ep−1 Gp = for any p ≥ 2, 0d×(p−1)d Id

54 p−1 and we have Qp−1Ap−1ep−1 = Cp−1. k−1 pd×d For simplicity of notation, we partition Ap ep ∈ R with k ∈ [p] to two blocks so that ! k−1 Dp,k Ap ep = , where Dp,k is a (p − 1)d by d matrix and Fp,k is a d by d matrix. Then, we have Fp,k !  p−1  Dp,1 ··· Dp,p−1 Dp,p Gp = ep Apep ··· Ap ep = . Fp,1 ··· Fp,p−1 Fp,p

k−1 In order to finish Step 1, it is sufficient to show that (1) Dp,k = Ap−1ep−1 and Fp,k = 0d×d for p−1 k ∈ [p − 1]; and (2) Dp,p = Ap−1ep−1, Fp,p = Id, and Qp−1Dp,p = Cp−1. > Note that we have Dp,k = Ap−1Dp,k−1 +ep−1ApFp,k−1 and Fp,k = ep−1Dp,k−1 with Dp,1 = ep−1 and Fp,1 = 0d×d. This is because ! ! ! k−1 k−2 Ap−1 ep−1Ap Dp,k−1 Ap−1Dp,k−1 + ep−1ApFp,k−1 Ap ep = ApAp ep = > = > . ep−1 0d×d Fp,k−1 ep−1Dp,k−1

With this information, it is possible to show that (1) and (2) are true for any p ≥ 2 using mathematical induction as follows. 0 Base Step. When p = 2, we have D2,1 = e1 = A1e1, F2,1 = 0d×d,

> > D2,2 = A1D2,1+e1A2F2,1 = A1e1, F2,2 = e1 D2,1 = e1 e1 = Id, and Q1D2,2 = IdA1e1 = A1 = C1.

Therefore, (1) and (2) are true when p = 2. Induction Step. Suppose (1) and (2) hold when p = m. Then, the following three steps will show that (1) and (2) are true for p = m + 1. 0 (i) We have Dm+1,1 = em = Amem and Fm+1,1 = 0d×d. j−1 (ii) Suppose we have that Dm+1,j = Am em and Fm+1,j = 0d×d for j ∈ [m − 1]. Then, we have

j−1 j Dm+1,j+1 = AmDm+1,j + emAm+1Fm+1,j = AmAm em = Amem and ! > > j−1   Dm,j Fm+1,j+1 = emDm+1,j = emAm em = 0d×(m−1)d Id×d = Fm,j = 0d×d. Fm,j

k−1 The results of (i) and (ii) combined imply that Dm+1,k = Am em and Fm+1,k = 0d×d for k ∈ [m]. (iii) From (i),(ii), and the assumption in the induction step, we have

m−1 m Dm+1,m+1 = AmDm+1,m + emAm+1Fm+1,m = AmAm em = Amem, ! > > m−1   Dm,m Fm+1,m+1 = emDm+1,m = emAm em = 0d×(m−1)d Id×d = Fm,m = Id, and Fm,m m−1 QmDm+1,m+1 = (QmAm)(Am em) ! ! ! ! 0d×(m−1)d Am Dm,m AmFm,m Am = = = = Cm. Qm−1 0(m−1)d×d Fm,m Qm−1Dm,m Cm−1

55 The results of (i), (ii), and (iii) combined imply that (1) and (2) are also true for p = m + 1. In conclusion, by mathematical induction, we have the desired result. −1 Step 2. We prove Gp = Qp using mathematical induction as follows. −1 Base Step. When p = 1, we have Gp = Qp because G1 = Id and Q1 = Id. −1 Induction Step. Suppose Gp = Qp is true for p = m, where m ≥ 1. Using the result in Step 1, we have ! ! Qm −Cm Gm Dm+1,m+1 Qm+1Gm+1 = 0d×md Id 0d×md Id ! ! Q G Q D − C I 0 = m m m m+1,m+1 m = md md×d . 0d×md Id 0d×md Id

−1 The last equality holds because Step 1 implies QmDm+1,m+1 = Cm. This shows that Gp = Qp is true for p = m + 1. Therefore, by mathematical induction, we have the desired result.

56