Reconciling the Gaussian and Whittle Likelihood with an Application to Estimation in the Frequency Domain

Reconciling the Gaussian and Whittle Likelihood with an application to estimation in the frequency domain

Suhasini Subba Rao∗ and Junho Yang†

September 30, 2020

Abstract In time series analysis there is an apparent dichotomy between time and frequency domain methods. The aim of this paper is to draw connections between frequency and time domain methods. Our focus will be on reconciling the Gaussian likelihood and the Whittle likelihood. We derive an exact, interpretable, bound between the Gaussian and Whittle likelihood of a second order stationary time series. The derivation is based on obtaining the transformation which is biorthogonal to the discrete Fourier transform of the time series. Such a transformation yields a new decomposition for the inverse of a Toeplitz matrix and enables the representation of the Gaussian likelihood within the frequency domain. We show that the difference between the Gaussian and Whittle likelihood is due to the omission of the best linear predictions outside the domain of observation in the periodogram associated with the Whittle likelihood. Based on this result, we obtain an approximation for the difference between the Gaussian and Whittle likelihoods in terms of the best fitting, finite order autoregressive parameters. These approximations are used to define two new frequency domain quasi-likelihoods criteria. We show that these new criteria can yield a better approximation of the spectral divergence criterion, as compared to both the Gaussian and Whittle likelihoods. In simulations, we show that the proposed estimators have satisfactory finite sample properties.

Keywords and phrases: Biorthogonal transforms, discrete Fourier transform, periodogram, arXiv:2001.06966v3 [math.ST] 29 Sep 2020 quasi-likelihoods and second order stationary time series.

1 Introduction

In his seminal work, Whittle (1951, 1953) introduced the Whittle likelihood as an approximation of the Gaussian likelihood. A decade later, the asymptotic sampling properties of moving average models ﬁtted using the Whittle likelihood were derived in Walker (1964). Subsequently, the

∗Texas A&M University, College Station, Texas, TX 77845, U.S.A. †Authors ordered alphabetically

1 Whittle likelihood has become a popular method for parameter estimation of various stationary time series (both long and short memory) and spatial models. The Whittle likelihood is computationally a very attractive method for estimation. Despite the considerable improvements in technology, interest in the Whittle likelihood has not abated. The Whittle likelihood has gained further traction as a quasi-likelihood (or as an information criterion, see Parzen (1983)) between the periodogram and the spectral density. Several diverse applications of the Whittle likelihood can be found in Fox and Taqqu (1986), Dahlhaus and Künsch (1987) (for spatial processes), Robinson (1995), Dahlhaus (2000), Hurvich and Chen (2000), Giraitis and Robinson (2001), Choudhuri et al. (2004), Abadir et al. (2007), Shao and Wu (2007), Giraitis et al. (2012) (long memory time series and local Whittle methods), Panaretos and Tavakoli (2013), Kirch et al. (2019) (Bayesian spectral methods), and van Delft and Eichler (2020) (functional time series), to name but a few. Despite its advantages, it is well known that for small samples the Whittle likelihood can give rise to estimators with a substantial bias (see Priestley (1981) and Dahlhaus (1988)). Dahlhaus (1988) shows that the finite sample bias in the periodogram impacts the performance of the Whittle likelihood. Motivated by this discrepancy, Sykulski et al. (2019) proposes the debiased Whittle likelihood, which fits directly to the expectation of the periodogram rather than the limiting spectral density. Alternatively, Dahlhaus (1988) shows that the tapered periodogram is better at capturing the features in the spectral density, such as peaks, than the regular periodogram. He uses this as the basis of the tapered Whittle likelihood. Empirical studies show that the tapered Whittle likelihood yields a smaller bias than the regular Whittle likelihood. As a theoretical justification, Dahlhaus (1988, 1990) uses an alternative asymptotic framework to show that tapering yields a good approximation to the inverse of the Toeplitz matrix. It is worth mentioning that within the time domain, several authors, including Shaman (1975, 1976), Bhansali (1982) and Coursol and Dacunha-Castelle (1982), have studied approximations to the inverse of the Toeplitz matrix. These results can be used to approximate the Gaussian likelihood. However, as far as we are aware, there are no results which explain what is lost when using the Whittle likelihood rather than the Gaussian likelihood. The objective of this paper is to address some of these issues. The benefits of such insights are not only of theoretical interest but also lead to the development of computationally simple frequency domain methods which are comparable with the Gaussian likelihood. We first recall the definition of the Gaussian and Whittle likelihood. Our aim is to fit a parametric second order stationary model with spectral density fθ(ω) and corresponding auto- n covariance function {cfθ (r)}r∈Z to the observed time series {Xt}t=1. The (quasi) log-Gaussian likelihood is proportional to

−1 0 −1 Ln(θ; Xn) = n XnΓn(fθ) Xn + log |Γn(fθ)| (1.1)

0 where Γn(fθ)s,t = cfθ (s−t), |A| denotes the determinant of the matrix A and Xn = (X1,...,Xn).

2 In contrast, the Whittle likelihood is a “spectral divergence” between the periodogram and the candidate spectral density. There are two subtly different methods for defining this contrast, one is with an integral the other is to use the Riemann sum. In this paper, we focus on the Whittle likelihood defined in terms of the Riemann sum over the fundamental frequencies

n 2 −1 X |Jn(ωk,n)| 2πk Kn(θ; Xn) = n + log fθ(ωk,n) ωk,n = , (1.2) fθ(ωk,n) n k=1

−1/2 Pn itωk,n where Jn(ωk,n) = n t=1 Xte is the discrete Fourier transform (DFT) of the observed time series. To compare the Gaussian and Whittle likelihood, we rewrite the Whittle likelihood in matrix form. We deﬁne the n × n circulant matrix Cn(fθ) with entries (Cn(fθ))s,t =

−1 Pn −i(s−t)ωk,n n k=1 fθ(ωk,n)e . The Whittle likelihood Kn(θ; Xn) can be written as

n ! −1 0 −1 X Kn(θ; Xn) = n XnCn(fθ )Xn + log fθ(ωk,n) . (1.3) k=1

−1 −1 0 −1 −1 To obtain an exact expression for Γn(fθ) − Cn(fθ ) and Xn[Γn(fθ) − Cn(fθ )]Xn, we focus on the DFT of the time series. The idea is to obtain the linear transformation of the observed n n time series {Xt}t=1 which is biorthogonal to the regular DFT, {Jn(ωk,n)}k=1. The biorthogonal transform, when coupled with the regular DFT, exactly decorrelates the time series. In Section 2.3, we show that the biorthogonal transform corresponding to the regular DFT contains the regular DFT plus the Fourier transform of the best linear predictors of the time series outside the domain of observation. Since this transformation completes the information not found in the regular DFT, we call it the complete DFT. It is common to use the Cholesky decomposition to decompose the inverse of a Toeplitz matrix. An interesting aspect of the biorthogonal transformation is that it provides an alternative decomposition of the inverse of a Toeplitz matrix. In Section 2.4, we show that the complete DFT, together with the regular DFT, allows us to rewrite the Gaussian likelihood within the frequency domain (which, as far as we are aware, is new). Further, it is well known that the Whittle likelihood has a bias due to the boundary effect. By rewriting the Gaussian likelihood within the frequency domain we show that the Gaussian likelihood avoids the boundary effect problem by predicting the time series outside the domain of observation. Precisely, the approximation error between the Gaussian and Whittle likelihood is due to the omission of these linear predictors in the regular DFT. From this result, we observe that the greater the persistence in the time series model (which corresponds to a more peaked spectral density) the larger the loss in approximating the complete DFT with the regular DFT. In order to obtain a better approximation of the Gaussian likelihood in the frequency domain, it is of interest to approximate the difference of the two likelihoods Ln(θ; Xn) − Kn(θ; Xn). For autoregressive processes of finite order, we obtain an analytic expression for the difference in the two likelihoods in terms of the AR parameters (see equation (2.18)). For general second order

3 stationary models, the expression is more complex. In Section 3, we obtain an approximation for Ln(θ; Xn) − Kn(θ; Xn) in terms of the infinite order (causal/minimum phase) autoregressive 2 P∞ −ijω −2 factorisation of fθ(ω) = σ |1 − j=1 φje | . We show that this approximation is the first −1 order term in a series expansion of the inverse of the Toeplitz matrix, Γn(f) . More precisely, in −1 −1 Section 3.2, we show that Γn(f) can be expressed in terms of Cn(fθ) plus a polynomial-type series expansion of the AR(∞) coefficients.

In Section 4, we obtain an approximation for the difference Ln(θ; Xn) − Kn(θ; Xn) in terms of a finite order autoregressive process. We use this to define two spectral divergence criteria which are “almost” unbiased estimators of the spectral divergence between the true (underlying spectral) density and the parametric spectral density. We use these criteria to define two new frequency domain estimators. In Section 5, we obtain the asymptotic sampling properties of the new likelihood estimators including the asymptotic bias and variance. Finally, in Section 6, we illustrate and compare the proposed frequency domain estimators through some simulations. We study the performance of the estimation scheme when the parametric model is both correctly specified and misspecified. The proofs can be found in the Supplementary material. The main proofs can be found in Appendix A, B, D and E. Baxter type inequalities for derivatives of finite predictors can be found in Appendix C. These results are used to obtain an approximation for the difference between the derivatives of the Gaussian and Whittle likelihood. In Appendix E we derive an expression for the asymptotic bias of the Gaussian, Whittle likelihoods, and the new frequency domain likelihoods, described above. In Appendix F, G and H we present additional simulations.

2 The Gaussian likelihood in the frequency domain

2.1 Preliminaries

In this section, we introduce most of the notation used in the paper, it can be skipped on ﬁrst reading. To reduce notation, we omit the symbol Xn in the Gaussian and Whittle likelihood. Moreover, since the focus in this paper will be on the ﬁrst terms in the Gaussian and Whittle likelihoods we use Ln(θ) and Kn(θ) to denote only these terms:

−1 0 −1 −1 0 −1 Ln(θ) = n XnΓn(fθ) Xn and Kn(θ) = n XnCn(fθ )Xn. (2.1)

Let A∗ denote the conjugate transpose of the matrix A. We recall that the circulant matrix ∗ Cn(g) can be written as Cn(g) = Fn ∆n(g)Fn, where ∆n(g) = diag(g(ω1,n), . . . , g(ωn,n)) (diagonal −1/2 itωk,n matrix) and Fn is the n × n DFT matrix with entries (Fn)k,t = n e . We recall that the n eigenvalues and the corresponding eigenvectors of any circulant matrix Cn(g) are {g(ωk,n)}k=1 0 ikω1,n ikωn,n n and {ek,n = (e , . . . , e )}k=1 respectively.

In general, we assume that E[Xt] = 0 (as it makes the derivations cleaner). We use {cf (r)}r∈Z

4 to denote an autocovariance function and f(ω) = P c (r)eirω its corresponding spectral den- r∈Z f sity. Sometimes, it will be necessary to make explicit the true underlying covariance (equiva- lently the spectral density) of the process. In this case, we use the notation covf (Xt,Xt+r) = Ef [XtXt+r] = cf (r). Next we define the norms we will use. Suppose A is a n × n square Pn p 1/p matrix, let kAkp = ( i,j=1 |ai,j| ) be an entrywise p-norm for p ≥ 1, and kAkspec denote p 1/p the spectral norm. Let kXkE,p = (E|X| ) , where X is a random variable. For the 2π- periodic square integrable function g with g(ω) = P g eirω, we use the sub-multiplicative r∈Z r norm kgk = P (2K + |r|K )|g |. Note that if PK+2 sup |g(j)(ω)| < ∞ then kgk < ∞, K r∈Z r j=0 ω K where g(j)(·) denotes the jth derivative of g. Suppose f, g : [0, 2π] → R are bounded functions, that are strictly larger than zero and are symmetric about π. By using the classical factorisation results in Szegö(1921) and Baxter 2 2 2 −2 P∞ −ijω (1962) we can write f(·) = σf |ψf (·)| = σf |φf (·)| , where φf (ω) = 1 − j=1 φj(f)e and P∞ −ijω ψf (ω) = 1 + j=1 ψj(f)e , the terms σg, φg(·), and ψg(·) are defined similarly. We use these expansions in Sections 3 and 4, where we require the following notation

∞ X K ρn,K (f) = |r φr(f)|, r=n+1 −2 2 AK (f, g) = 2σg kψf k0kφgk0kφf kK , 3 − ε and C = kφ k2 kψ k2 f,K 1 − ε f K f K for some 0 < ε < 1.

For postive sequences {an} and {bn}, we denote an ∼ bn if there exist 0 < C1 ≤ C2 < ∞ such that C1 ≤ an/bn ≤ C2 for all n. Lastly, we denote Re and Im as the real and imaginary part of a complex variable respectively.

2.2 Motivation

In order to motivate our approach, we first study the difference in the bias of the AR(1) parameter estimator using both the Gaussian and Whittle likelihood. In Figure 1, we plot the bias in the estimator of φ in the AR(1) model Xt = φXt−1 + εt for different values of φ (based on sample size n = 20). We observe that the difference between the bias of the two estimators increases as |φ| approaches one. Further, the Gaussian likelihood clearly has a smaller bias than the Whittle n likelihood (which is more pronounced when |φ| is close to one). Let {Xt}t=1 denote the observed −1 −1 time series. Straightforward calculations (based on expressions for Γn(fφ) and Cn(fφ )) show that the difference between the Gaussian and Whittle likelihoods for an AR(1) model is

−1 2 2 2 Ln(φ) − Kn(φ) = n 2φX1Xn − φ (X1 + Xn) (2.2)

5 0.16 Bias of AR(1), Gaussian error 0.12 ● ●

● ● 0.08 ● ● ● 0.04 ●

● ● 0 ●

BIAS ● −0.04 ● ● ● ● ●

−0.08 ●

−0.12 ● Gaussian Whittle −0.16 −0.9 −0.7 −0.5 −0.3 −0.1 φ 0.1 0.3 0.5 0.7 0.9

Figure 1: The model Xt = φXt−1 +εt with independent standard normal errors is simulated. The bias of the estimator of φ based on sample size n = 20 over 1000 replications.

Thus we observe that the closer |φ| is to one, the larger the expected difference between the likelihoods. Using (2.2) and the Bartlett correction (see Bartlett (1953) and Cox and Snell (1968), it is possible to obtain an asymptotic expression for the difference in the biases (see also Appendix E.2) Generalisations of this result to higher order AR(p) models may also be possible using the analytic expression for the inverse of the Toeplitz matrix corresponding to an AR(p) model derived in Siddiqui (1958) and Galbraith and Galbraith (1974). However, for more general models, such as the MA(q) or ARMA(p, q) models, using brute force calculations for deriving the difference Ln(θ) − Kn(θ) and its derivatives is extremely difficult. Furthermore, such results do not offer any insight on how the Gaussian and Whittle likelihood are related, nor what is “lost” when going from the Gaussian likelihood to the Whittle likelihood. In the remainder of this section, we derive an exact expression for the Gaussian likelihood in the frequency domain. Using these derivations, we obtain a simple expression for the difference between the Whittle and Gaussian likelihood for AR(p) models. In subsequent sections, we obtain approximations for this difference for general time series models.

2.3 The biorthogonal transform to the discrete Fourier transform

In order to obtain an exact bound, we start with the Whittle likelihood and recall that the DFT of the time series plays a fundamental role in its formulation. With this in mind, our approach n is based on deriving the transformation {Zk,n}k=1 ⊂ sp(Xn) (where sp(Xn) denotes the linear n n space over a complex ﬁeld spanned by Xn = {Xt}t=1), which is biorthogonal to {Jn(ωk,n)}k=1. n n That is, we derive a transformation {Zk,n}k=1 which when coupled with {Jn(ωk,n)}k=1 satisﬁes the following condition

covf (Zk1,n,Jn(ωk2,n)) = f(ωk1 )δk1,k2

6 0 n where δk1,k2 = 1 if k1 = k2 (and zero otherwise). Since Zn = (Z1,n,...,Zn,n) ∈ sp(Xn) , there 0 exists an n×n complex matrix Un, such that Zn = UnXn. Since (Jn(ωk,1),...,Jn(ωn,n)) = FnXn, the biorthogonality of UnXn and FnXn gives covf (UnXn,FnXn) = ∆n(f). The beneﬁt of biorthogonality is that it leads to the following simple identity on the inverse of the variance matrix.

Lemma 2.1 Suppose that Un and Vn are invertible matrices which are biorthogonal with respect to the variance matrix var(Xn). That is cov(UnXn,VnXn) = ∆n, where ∆n is a diagonal matrix. Then −1 ∗ −1 var(Xn) = Vn ∆n Un. (2.3)

∗ PROOF. It follows immediately from cov(UnXn,VnXn) = Unvar(Xn)Vn = ∆n and var(Xn) = −1 ∗ −1 Un ∆n(Vn ) .

To understand how UnXn is related to FnXn we rewrite Un = Fn + Dn(f). We show in the following theorem that Dn(f) has a speciﬁc form with an intuitive interpretation. In order to develop these ideas, we use methods from linear prediction. In particular, we deﬁne the best n linear predictor of Xτ for τ ≤ 0 and τ > n given {Xt}t=1 as

n X Xbτ,n = φt,n(τ; f)Xt, (2.4) t=1

n Pn 2 where {φt,n(τ; f)}t=1 are the coeﬃcients which minimize the L2-distance Ef [Xτ − t=1 φt,n(τ; f)Xt] . Using this notation we obtain the following theorem.

Theorem 2.1 (The biorthogonal transform) Let {Xt} be a second order stationary, zero mean time series with spectral density f which is bounded away from zero and whose autocovariance satisfies P |rc (r)| < ∞. Let X denote the best linear predictors of X as defined in r∈Z f bτ,n τ n (2.4) and {φt,n(τ; f)}t=1 the corresponding coefficients. Then

covf (Fn + Dn(f))Xn,FnXn = ∆n(f), (2.5) where Dn(f) has entries

−1/2 X iτωk,n −i(τ−1)ωk,n Dn(f)k,t = n φt,n(τ; f)e + φn+1−t,n(τ; f)e , (2.6) τ≤0 for 1 ≤ k, t ≤ n. And, entrywise 1 ≤ k1, k2 ≤ n, we have covf Jen(ωk1,n; f),Jn(ωk2,n) = f(ωk1,n)δk1,k2 (2.7)

7 where Jen(ω; f) = Jn(ω) + Jbn(ω; f) and

−1/2 X iτω −1/2 X iτω Jbn(ω; f) = n Xbτ,ne + n Xbτ,ne . (2.8) τ≤0 τ>n

PROOF. See Appendix A (note that identity (2.7) can be directly veriﬁed using results on best linear predictors).

Corollary 2.1 (Inverse Toeplitz identity) Let Γn(f) denote an n × n Toeplitz matrix generated by the spectral density f. Then equations (2.3) and (2.5) yield the following identity

−1 ∗ −1 Γn(f) = Fn ∆n(f )(Fn + Dn(f)), (2.9) where Dn(f) is deﬁned in (2.6). Observe that two spectral density functions f1(ω) and f2(ω) with n−1 the same autocovariance up to lag (n−1), {c(r)}r=0 , can give rise to two diﬀerent representations

−1 ∗ −1 ∗ −1 −1 Γn(f1) = Fn ∆n(f1 )(Fn + Dn(f1)) = Fn ∆n(f2 )(Fn + Dn(f2)) = Γn(f2) .

What we observe is that the biorthogonal transformation (Fn + Dn(f))Xn extends the domain of observation by predicting outside the boundary. A visualisation of the observations and the predictors that are involved in the construction of Jen(ω; f) is given in Figure 2.

Figure 2: Jen(ω; f) is the Fourier transform over both the observed time series and its predictors outside this domain.

It is quite surprising that only a small modiﬁcation of the regular DFT leads to its biorthogonal transformation. Furthermore, the contribution of the additional DFT term is Jbn(ωk,n; f) = −1/2 Op(n ). This is why the regular DFT satisﬁes the well known “near” orthogonal property

−1 covf (Jn(ωk1,n),Jn(ωk2,n)) = f(ωk1 )δk1,k2 + O(n ), see Lahiri (2003) and Brillinger (2001). For future reference, we will use the following deﬁnitions.

Deﬁnition 2.1 We refer to Jbn(ω; f) as the predictive DFT (as it is the Fourier transform of all

8 the linear predictors), noting that basic algebra yields the expression

n −1/2 X X iτω inω −i(τ−1)ω Jbn(ω; f) = n Xt (φt,n(τ; f)e + e φn+1−t,n(τ; f)e ). (2.10) t=1 τ≤0

inω Note that when ω = ωk,n, the term e in (2.10) vanishes. Further, we refer to Jen(ω; f) as the complete DFT (as it contains the classical DFT of the time series together with the predictive

DFT). Note that both Jen(ω; f) and Jbn(ω; f) are functions of f since they involve the spectral density f(·), unlike the regular DFT which is model-free.

Example 2.1 (The AR(1) process) Suppose that Xt has an AR(1) representation Xt = φXt−1+

εt (|φ| < 1). Then the best linear predictors are simply a function of the observations at the two |τ|+1 τ−n endpoints. That is for τ ≤ 0, Xbτ,n = φ X1 and for τ > n Xbτ,n = φ Xn. An illustration is given in Figure 3.

Figure 3: The past and future best linear predictors based on a AR(1) model.

Then the predictive DFT for the AR(1) model is

i(n+1)ω ! φ 1 e −iω Jbn(ω; fφ) = √ X1 + Xn where φ(ω) = 1 − φe . n φ(ω) φ(ω)

In other words, a small adjustment of the boundary leads to Jen(ω; fφ)Jn(ω) being an unbiased estimator of f(ω) = σ2|φ(ω)|−2.

Remark 2.1 Biorthogonality of random variables is rarely used in statistics. An interesting exception is Kasahara et al. (2009). They apply the notion of biorthogonality to problems in prediction. In particular they consider the biorthogonal transform of Xn, which is the random −1 vector Xen = Γn(f) Xn (since covf (Xen,Xn) = In). They obtain an expression for the entries of −1 Xen in terms of the Cholesky decomposition of Γn(f) . However, there is an interesting duality 0 between Xen and Jen = (Jen(ω1,n; f),..., Jen(ωn,n; f)) . In particular, applying identity (2.9) to the

DFT of Xen gives −1 −1 FnXen = FnΓn(f) Xn = ∆n(f )Jen.

This shows that the DFT of the biorthogonal transform of Xn is the standardized complete DFT. Conversely, the inverse DFT of the standardized complete DFT gives the biorthogonal transform

9 to the original time series, where the entries of Xen are

1 n J (ω ; f) X en k,n −ijωk,n Xej,n = √ e . n f(ωk,n) k=1

Remark 2.2 (Connection to the orthogonal increment process) Suppose that Z(ω) is the orthogonal increment process associated with the stationary time series {Xt} and f the corresponding spectral density. If {Xt} is a Gaussian time series, then we have √ Z 2π Z 2π 1 −iωτ n −iωτ Xbτ,n = E [Xτ |Xn] = e E[Z(dω)|Xn] = e Jen(ω; f)dω. 2π 0 2π 0 2.4 The Gaussian likelihood in the frequency domain

In the following theorem, we exploit the biorthogonality between the regular DFT and the complete DFT to yield an exact “frequency domain” representation for the Gaussian likelihood. We use the notation deﬁned in Theorem 2.1.

Theorem 2.2 (A frequency domain representation of the Gaussian likelihood) Suppose the spectral density fθ is bounded away from zero, and the corresponding autocovariance is such P that r |rcfθ (r)| < ∞. Let Ln(θ) and Kn(θ) be deﬁned as in (2.1). Then we have

n 1 0 −1 1 X Jen(ωk,n; fθ)Jn(ωk,n) Ln(θ) = XnΓn(fθ) Xn = . (2.11) n n fθ(ωk,n) k=1

Further −1 −1 ∗ −1 Γn(fθ) − Cn(fθ ) = Fn ∆n(fθ )Dn(fθ). (2.12) This yields the diﬀerence between the Gaussian and Whittle likelihood

−1 0 −1 −1 Ln(θ) − Kn(θ) = n Xn Γn(fθ) − Cn(fθ ) Xn n 1 X Jbn(ωk,n; fθ)Jn(ωk,n) = . (2.13) n fθ(ωk,n) k=1

PROOF. (2.12) follows immediately from Corollary 2.1. Next, we note that FnXn = J n and

(Fn + Dn(fθ))Xn = Jen, thus we immediately obtain equation (2.11), and since Jen(ωk,n; fθ) = Jn(ωk,n) + Jbn(ωk,n; fθ), it proves (2.13). From the above theorem, we observe that the Gaussian likelihood is the Whittle likelihood plus

10 an additional “correction”

n 2 n 1 X |Jn(ωk,n)| 1 X Jbn(ωk,n; fθ)Jn(ωk,n) Ln(θ) = + . n fθ(ωk,n) n fθ(ωk,n) k=1 k=1 | {z } =Kn(θ)

To summarize, the Gaussian likelihood compensates for the well known boundary eﬀect in the Whittle likelihood, by predicting outside the domain of observation. The Whittle likelihood estimator selects the spectral density fθ which best ﬁts the periodogram. On the other hand, since

Efθ [Jen(ωk,n; fθ)Jn(ωk,n)] = fθ(ωk,n), the Gaussian likelihood estimator selects the spectral density which best fits Jen(ωk,n; fθ)Jn(ωk,n) by simultaneously predicting and fitting. Therefore, the “larger” the level of “persistence” in the time series, the greater the predictive DFT Jbn(ωk,n; fθ), and subsequently the larger the approximation error between the two likelihoods. This fits with the insights of Dahlhaus (1988), who shows that the more peaked the spectral density the greater the leakage effect in the Whittle likelihood, leading to a large finite sample bias. In the remainder of this section and the subsequent section, we study the difference between the two likelihoods and corresponding matrices. This will allow us to develop methods that better capture the Gaussian likelihood within the frequency domain. By using Theorem 2.2, we have n 1 X Jbn(ωk,n; fθ)Jn(ωk,n) −1 0 ∗ −1 Ln(θ) − Kn(θ) = = n XnFn ∆n(fθ )Dn(fθ)Xn, n fθ(ωk,n) k=1 ∗ −1 where the entries of Fn ∆n(fθ )Dn(fθ) are

∗ −1 (Fn ∆n(fθ )Dn(fθ))s,t X = [φt,n(τ; fθ)G1,n(s, τ; fθ) + φn+1−t,n(τ; fθ)G2,n(s, τ; fθ)] (2.14) τ≤0 with

1 n 1 X i(τ−s)ωk,n X G1,n(s, τ; fθ) = e = Kf −1 (τ − s + an) n fθ(ωk,n) θ k=1 a∈Z 1 n 1 X −i(τ+s−1)ωk,n X G2,n(s, τ; fθ) = e = Kf −1 (τ + s − 1 + an) n fθ(ωk,n) θ k=1 a∈Z

R 2π −1 irω and K −1 (r) = fθ(ω) e dω. We observe that for 1 << t << n, φt,n(τ; fθ) and φn+1−t,n(τ; fθ) fθ 0 will be “small” as compared with t close to one or n. The same is true for G1,n(s, τ; fθ) and ∗ −1 G2,n(s, τ; fθ) when 1 << s << n. Thus the entries of Fn ∆n(fθ )Dn(fθ) will be “small” far from ∗ −1 the four corners of the matrix. In contrast, the entries of Fn ∆n(fθ )Dn(fθ) will be largest at the four corners at the matrix. This can be clearly seen in the following theorem, where we consider the special case of AR(p) models. We showed in Example 2.1 that for AR(1) processes,

11 the predictive DFT has a simple form. In the following theorem, we obtain an analogous result for AR(p) models (where p ≤ n).

2 −2 Theorem 2.3 (Finite order autoregressive models) Suppose that fθ(ω) = σ |φp(ω)| where Pp −iuω φp(ω) = 1 − u=1 φue (the roots of the corresponding characteristic polynomial lie outside the unit circle) and p ≤ n. The predictive DFT has the analytic form

Jbn(ω; fθ) = −1/2 p p−` −1/2 p p−` n X X −isω inω n X X i(s+1)ω X` φ`+se + e Xn+1−` φ`+se . (2.15) φp(ω) `=1 s=0 φp(ω) `=1 s=0

If p ≤ n/2, then Dn(fθ) is a rank 2p matrix where

Dn(fθ) iω1,n iω1,n  φ1,p(ω1,n) ... φp,p(ω1,n) 0 ... 0 e φp,p(ω1,n) ... e φ1,p(ω1,n)  iω2,n iω2,n −1/2 φ1,p(ω2,n) ... φp,p(ω2,n) 0 ... 0 e φp,p(ω2,n) ... e φ1,p(ω2,n) = n  ......  (2.16)  ......  iωn,n iωn,n φ1,p(ωn,n) ... φp,p(ωn,n) 0 ... 0 e φp,p(ωn,n) ... e φ1,p(ωn,n)

−1 Pp−j −isω and φj,p(ω) = φp(ω) s=0 φj+se . Note, if n/2 < p ≤ n, then the entries of Dn(fθ) will overlap. Let φe0 = 1 and for 1 ≤ s ≤ p, φes = −φs (zero otherwise), then if 1 ≤ p ≤ n/2 we have

−1 −1 ∗ −1 Γn(fθ) − Cn(fθ ) s,t = (Fn ∆n(fθ )Dn(fθ))s,t  −2 Pp−t σ φ`+tφe(`+s) mod n 1 ≤ t ≤ p  `=0 −2 Pp−(n−t) = σ `=1 φ`+(n−t)φe(`−s) mod n n − p + 1 ≤ t ≤ n . (2.17)  0 otherwise

PROOF. In Appendix A. Theorem 2.3 shows that for AR(p) models, the predictive DFT only involves the p observations on each side of the observational boundary X1,...,Xp and Xn−p+1,...,Xn, where the coeﬃcients in the prediction are a linear combination of the AR parameters (excluding the denominator

φp(ω)). The well known result (see Siddiqui (1958) and Shaman (1975), equation (10)) that ∗ −1 Fn ∆n(fθ )Dn(fθ) is non-zero only at the (p × p) submatrices located in the four corners of ∗ −1 Fn ∆n(fθ )Dn(fθ) follows from equation (2.17). By using (2.15) we obtain an analytic expression for the Gaussian likelihood of the AR(p) model in terms of the autoregressive coeﬃcients. In particular, the Gaussian likelihood (written

12 Pp in the frequency domain) corresponding to the AR(p) model Xt = j=1 φjXt−j + εt is

n σ−2 X L (φ) = |J (ω )|2|φ (ω )|2 n n n k,n p k,n k=1 p p−` p ! σ−2 X X X + X φ X − φ X n ` `+s (−s) mod n j (j−s) mod n `=1 s=0 j=1 p p−` p ! σ−2 X X X + X φ X − φ X , (2.18) n n+1−` `+s (s+1) mod n j (s+1−j) mod n `=1 s=0 j=1

0 Pp −ijω where φ = (φ1, ..., φp) and φp(ω) = 1 − j=1 φje . A proof of the above identity can be found in Appendix A. Equation (2.18) oﬀers a simple representation of the Gaussian likelihood in terms of a Whittle likelihood plus an additional term in terms of the AR(p) coeﬃcients.

3 Frequency domain approximations of the Gaussian likelihood

In Theorem 2.2 we rewrote the Gaussian likelihood within the frequency domain. This allowed us to obtain an expression for the diﬀerence between the Gaussian and Whittle likelihoods for

AR(p) models (see (2.18)). This is possible because the predictive DFT Jbn(·; fθ) has a simple analytic form. It would be of interest to generalize this result to general time series models. However, for infinite order autoregressive models, the predictions across the boundary and the predictive DFT given in (2.10) do not have a simple, analytic form. In Section 3.1 we show that we can obtain an approximation of the predictive DFT in terms of the AR(∞) coefficients corresponding to fθ. −1 −1 In turn, this allows us to obtain an approximation for Γn(fθ) − Cn(fθ ), which is analogous to equation (2.17) for AR(p) models. Such a result proves to be very useful from both a theoretical and practical perspective. Theoretically, we use this result to show that the difference between the Whittle and Gaussian likelihood is of order O(n−1). Furthermore, in Section 3.2 we show that the approximation described in Section 3.1 is the first order term of a polynomial-type −1 series expansion of Γn(fθ) in terms of the AR(∞) parameters. From a practical perspective, the approximations are used in Section 4 to motivate alternative quasi-likelihoods defined within the frequency domain.

First, we require the following set of assumptions on the spectral density fθ.

Assumption 3.1 (i) The spectral density f is bounded away from zero.

(ii) For some K > 1, the autocovariance function is such that P |rK c (r)| < ∞. r∈Z f

13 Under the above assumptions, we can write f(ω) = σ2|ψ(ω; f)|2 = σ2|φ(ω; f)|−2 where

∞ ∞ X −ijω X −ijω ψ(ω; f) = 1 + ψj(f)e and φ(ω; f) = 1 − φj(f)e . (3.1) j=1 j=1

P∞ K P∞ K Further, under Assumption 3.1 we have r=1 |r ψr(f)| and r=1 |r φr(f)| are both ﬁnite (see Kreiss et al. (2011)). Thus if f satisﬁes Assumption 3.1 with some K > 1, then kψf kK < ∞ and kφf kK < ∞.

3.1 The ﬁrst order approximation

In order to obtain a result analogous to Theorem 2.3, we replace φs,n(τ; fθ) in Dn(fθ) with ∞ φs(τ; fθ) which are the coeﬃcients of the best linear predictor of Xτ (for τ ≤ 0) given {Xt}t=1 P∞ i.e. Xbτ = t=1 φt(τ; fθ)Xt. This gives the matrix D∞,n(fθ), where

−1/2 X iτωk,n −i(τ−1)ωk,n (D∞,n(fθ))k,t = n φt(τ; fθ)e + φn+1−t(τ; fθ)e . τ≤0

It can be shown that for 1 ≤ k, t ≤ n,

φ∞(ω ; f ) φ∞ (ω ; f ) −1/2 t k,n θ −1/2 iωk,n n+1−t k,n θ (D∞,n(fθ))k,t = n + n e , (3.2) φ(ωk,n; fθ) φ(ωk,n; fθ)

∞ P∞ −isω where φt (ω; fθ) = s=0 φt+s(fθ)e . The proof of the above identity can be found in Appendix

B.1. Using the above we can show that (D∞,n(fθ)Xn)k = Jb∞,n(ωk,n; fθ) where

Jb∞,n(ω; fθ) n n n−1/2 X n−1/2 X = X φ∞(ω; f ) + ei(n+1)ω X φ∞(ω; f ). (3.3) φ(ω; f ) t t θ n+1−t t θ θ t=1 φ(ω; fθ) t=1

We show below that Jb∞,n(ωk,n; fθ) is an approximation of Jbn(ωk,n; fθ).

Theorem 3.1 (An AR(∞) approximation for general processes) Suppose f satisﬁes As- 2 −2 sumption 3.1, fθ is bounded away from zero and kfθk0 < ∞ (with fθ(ω) = σθ |φθ(ω)| ). Let Dn(f), D∞,n(f) and Jb∞,n(ωk,n; f) be deﬁned as in (2.6) and (3.2) and (3.3) respectively. Then we have

0 ∗ −1 XnFn ∆n(fθ )(Dn(f) − D∞,n(f)) Xn n X Jn(ωk,n) = Jbn(ωk,n; f) − Jb∞,n(ωk,n; f) (3.4) fθ(ωk,n) k=1

14 and ∗ −1 Cf,0ρn,K (f) F ∆n(f )(Dn(f) − D∞,n(f)) ≤ AK (f, fθ). (3.5) n θ 1 nK−1

Further, if {Xt} is a time series where supt kXtkE,2q = kXkE,2q < ∞ (for some q > 1), then

−1 0 ∗ −1 n X F ∆n(f )(Dn(f) − D∞,n(f)) X n n θ n E,q C ρ (f) ≤ f,0 n,K A (f, f )kXk2 . (3.6) nK K θ E,2q

PROOF. See Appendix B.1.

We mention that we state the above theorem in the general case that the spectral density f is used to construct the predictors Dn(f). It does not necessarily have to be the same as fθ. This is to allow generalisations of the Whittle and Gaussian likelihoods, which we discuss in Section 4. Applying the above theorem to the Gaussian likelihood gives an approximation which is analogous to (2.18)

n 1 X Jb∞,n(ωk,n; fθ)Jn(ωk,n) −K Ln(θ) = Kn(θ) + + Op(n ) n fθ(ωk,n) k=1 n n 1 X 1 X = K (θ) + X X e−isωk,n ϕ (ω ; f ) + O (n−K ), (3.7) n n s t n t,n k,n θ p s,t=1 k=1

−2 h ∞ iω ∞ i where ϕt,n(ω; fθ) = σ φ(ω; fθ)φt (ω; fθ) + e φ(ω; fθ)φn+1−t(ω; fθ) . The above approximation shows that if the autocovariance function, corresponding to fθ decays sufficiently fast (in the sense that P |rK c (r)| < ∞ for some K > 1). Then replacing the finite predictions with r∈Z fθ the predictors using the infinite past (or future) gives a close approximation of the Gaussian likelihood.

Remark 3.1 Following from the above, the entrywise diﬀerence between the two matrices is approximately

n 1 X (Γ (f )−1 − C (f −1)) ≈ (F ∗∆ (f −1)D (f )) = e−isωk,n ϕ (ω ; f ), n θ n θ s,t n n θ ∞,n θ s,t n t,n k,n θ k=1 thus giving an analytic approximation to (2.14).

In the following theorem, we obtain a bound between the Gaussian and Whittle likelihood.

Theorem 3.2 (The diﬀerence in the likelihoods) Suppose fθ satisﬁes Assumption 3.1. Let

15 Dn(fθ) and D∞,n(fθ) be deﬁned as in (2.6) and (3.2) respectively. Then we have

∗ −1 Fn ∆n(fθ )D∞,n(fθ) 1 ≤ A1(fθ, fθ) (3.8) and −1 −1 Cfθ,0ρn,K (fθ) Γn(fθ) − Cn(f ) ≤ A1(fθ, fθ) + AK (fθ, fθ) . (3.9) θ 1 nK−1

Further, if {Xt} is a time series where supt kXtkE,2q = kXkE,2q < ∞ (for some q > 1), then

C ρ (f ) kL (θ) − K (θ)k ≤ n−1 A (f , f ) + fθ,0 n,K θ A (f , f ) kXk2 . (3.10) n n E,q 1 θ θ nK−1 K θ θ E,2q

PROOF. See Appendix B.1. The above result shows that under the stated conditions

−1 −1 −1 −1 n Γn(fθ) − Cn(fθ ) 1 = O(n ), and the diﬀerence between the Whittle and Gaussian likelihoods is of order O(n−1). We conclude −1 this section by obtaining a higher order expansion of Γn(fθ) .

3.2 A series expansion

Theorem 3.1 gives an approximation of the predictive DFT Jbn(ω; f) in terms of Jb∞,n(ω; f), which is comprised of the AR(∞) coefficients corresponding to f. In the following lemma we show that −1 −1 it is possible to obtain a series expansion of Jbn(ω; f) and Γn(f) − Cn(f) in terms of the products of AR(∞) coefficients. The proof of the results in this section hinge on applying von Neumann’s alternative projection theorem to stationary time series. This technique was first developed for time series in Inoue and Kasahara (2006). We make use of Theorem 2.5, Inoue and

Kasahara (2006), where an expression for the coefficients of the finite predictors φt,n(τ) is given. (1) ∞ We define the function ζt,n (ω; f) = φt (ω) and for s ≥ 2

Z s−1 ! (s) 1 Y −1 ζt,n (ω; f) = s−1 φ(λa+1; f) Φn(λa, λa+1) × (2π) s [0,2π] a=1 ∞ ∞ φt (λs; f)δs≡1( mod 2) + φn+1−t(λs; f)δs≡0( mod 2) δλ1=ωdλs, (3.11)

where dλs = dλ1 ··· dλs denotes the s-dimensional Lebesgue measure,

∞ ∞ ∞ X ∞ iuλ2 X X −isλ1 iuλ2 Φn(λ1, λ2) = φn+1+u(λ1; f)e = φn+1+u+s(f)e e u=0 u=0 s=0

16 (s) and δ denotes the indicator variable. In the following lemma, we show that ζt,n (ω; f) plays the ∞ same role as φt (ω; f) in the predictive DFT approximation given in equation (3.3). It will be used to approximate Jbn(ω; f) to a greater degree of accuracy.

2 −2 Theorem 3.3 Suppose f satisfies Assumption 3.1, where f(ω) = σ |φ(ω; f)| . Let Dn(f) and (s) ζj,n(ω; f) be defined as in (2.6) and (3.11) respectively. Define the s-order predictive DFT

−1/2 n −1/2 n (s) n X (s) i(n+1)ω n X (s) Jb (ω; f) = Xtζ (ω; f) + e Xn+1−tζ (ω; f). n φ(ω; f) t,n t,n t=1 φ(ω; f) t=1

Then

∞ X (s) Jbn(ω, f) = Jbn (ω; f) (3.12) s=1

P∞ (s) and Dn(f) = s=1 Dn (f), where

(s) (s) ζ (ωk,n; f) ζ (ωk,n; f) (s) −1/2 t,n −1/2 iωk,n n+1−t,n (Dn (f))k,t = n + n e . (3.13) φ(ωk,n; f) φ(ωk,n; f)

Further, for a suﬃciently large n we have

m X (s) 1 Jbn(ω; f) = Jb (ω; f) + Op . (3.14) n nm(K−1)+1/2 s=1

PROOF. See Appendix B.2.

In the case s = 1, it is straightforward to show that

(1) (1) Jbn (ω; f) = Jb∞,n(ω; f) and Dn (f) = D∞,n(f).

Therefore, the ﬁrst term in the expansion of Jbn(ω, f) and Dn(f) is the AR(∞) approximation Jb∞,n(ω; f) and D∞,n(f) respectively. We mention, that it is simple to check that if f corresponds (s) to an AR(p) spectral density for some p ≤ n, then Jbn (ω; f) = 0 for all s ≥ 2. For general spectral −1 densities, the higher order expansion gives a higher order approximation of Jbn(ω; f) and Γn(f) in terms of products of the AR(∞) coeﬃcients. Using the above result we have the expansions

∞ −1 −1 X ∗ −1 (s) Γn(f) = Cn(f ) + Fn ∆n(f )Dn (f) s=1

17 and

∞ n (s) X 1 X Jbn (ωk,n; fθ)Jn(ωk,n) Ln(θ) = Kn(θ) + . n fθ(ωk,n) s=1 k=1

(s) It is interesting to note that ζt,n (ω; f) can be evaluated recursively using

(s+2) ζt,n (ω; f) Z 1 −1 −1 (s) = 2 φ(y1; f) φ(y2; f) Φn(ω, y1)Φn(y1, y2)ζt,n (y2; f)dy1dy2. (3.15) (2π) [0,2π]2

(s) (s) In a similar vein, both the s-order predictive DFT Jbn (ω; f) and Dn (f) can be evaluated recursively using a recursion similar to the above (see Appendix B.2 for the details).

The above results show that it is possible to obtain an analytic expression for Jbn(ω; f) and −1 Γn(f) in terms of the products of the AR(∞) coefficients. This expression for the inverse of a Toeplitz matrix may have applications outside time series. However, from the perspective of (1) estimation, the first order approximation Jbn (ω; f) = Jb∞,n(ω; f) is sufficient. We discuss some applications in the next section.

4 New frequency domain quasi-likelihoods

In this section, we apply the approximations from the previous section to define two new spectral divergence criteria. To motivate the criteria, we recall from Theorem 2.2 that the Gaussian likelihood can be written as a contrast between Jen(ω; fθ)Jn(ω) and fθ(ω). The resulting estimator is based on simultaneously predicting and fitting the spectral density. In the case that the model is correctly specified, in the sense there exists a θ ∈ Θ where f = fθ (and f is the true spectral density). Then

Efθ [Jen(ω; fθ)Jn(ω)] = fθ(ω) and the Gaussian criterion has a clear interpretation. However, if the model is misspeciﬁed (which for real data is likely), Ef [Jen(ω; fθ)Jn(ω)] has no clear interpretation. Instead, to understand −1 what the Gaussian likelihood is estimating, we use that Ef [Jbn(ω; fθ)Jn(ω)] = O(n ), which −1 leads to the approximation Ef [Jen(ω; fθ)Jn(ω)] = f(ω) + O(n ). From this, we observe that the expected negative log Gaussian likelihood is

−1 0 −1 −1 −1 n Ef [XnΓn(fθ) Xn] + n log |Γn(fθ)| = I(f, fθ) + O(n ),

18 where n 1 X f(ωk,n) In(f; fθ) = + log fθ(ωk,n) . (4.1) n fθ(ωk,n) k=1

Since In(f; fθ) is the spectral divergence between the true spectral f density and parametric spectral density fθ, asymptotically the misspecified Gaussian likelihood estimator has a meaningful interpretation. However, there is still a finite sample bias in the Gaussian likelihood of order O(n−1). This can have a knock-on effect, by increasing the finite sample bias in the resulting Gaussian likelihood estimator. To remedy this, in the following section, we obtain a frequency domain criterion which approximates the spectral divergence In(f; fθ) to a greater degree of accuracy. This may lead to estimators which may give a more accurate fit of the underlying spectral density. We should emphasis at this point, that reducing the bias in the likelihood, does not necessarily translate to a provable reduction in the bias of the resulting estimators (this is discussed further in Section 5.2). It is worth noting that, strictly, the spectral divergence is defined as −1 Pn f(ωk,n) f(ωk,n) n − log − 1 . It is zero when fθ = f and positive for other values of fθ. k=1 fθ(ωk,n) fθ(ωk,n) But since − log f − 1 does not depend on θ we ignore this term.

4.1 The boundary corrected Whittle likelihood

In order to address some of the issues raised above, we recall from Theorem 2.1 that

Ef [Jen(ω; f)Jn(ω)] = f(ω). In other words, by predicting over the boundary using the (unobserved) spectral density which generates the data, the “complete periodogram” Jen(ω; f)Jn(ω) is an inconsistent but unbiased of the true spectral density f. This motivates the (infeasible) boundary corrected Whittle likelihood

n n 1 X Jen(ωk,n; f)Jn(ωk,n) 1 X Wn(θ) = + log fθ(ωk,n). (4.2) n fθ(ωk,n) n k=1 k=1

Thus, if {Xt} is a second order stationary time series with spectral density f, then we have Ef [Wn(θ)] = In(f; fθ). Of course f and thus Jen(ωk,n; f) are unknown. However, we recall that Jen(ωk,n; f) is comprised of the best linear predictors based on the unobserved time series. The coeﬃcients of the best linear predictors can be replaced with the h-step ahead predictors evaluated with the best ﬁtting autoregressive parameters of order p (the so called plug-in estimators; see Bhansali (1996) and

Kley et al. (2019)). This is equivalent to replacing f in Jen(ωk,n; f) with the spectral density function corresponding to the best ﬁtting AR(p) process Jen(ωk,n; fp), where an analytic form is given in (2.15). Since we have replaced f with fp, the “periodogram” Jen(ωk,n; fp)Jn(ωk,n) does have a bias, but it is considerably smaller than the bias of the usual periodogram. In particular,

19 it follows from the proof of Lemma 4.1, below, that

1 f [Jen(ωk,n; fp)Jn(ωk,n)] = f(ωk,n) + O . E npK−1

The above result leads to an approximation of the boundary corrected Whittle likelihood

n n 1 X Jen(ωk,n; fp)Jn(ωk,n) 1 X Wp,n(θ) = + log fθ(ωk,n). (4.3) n fθ(ωk,n) n k=1 k=1

In the following lemma, we obtain a bound between the “ideal” boundary corrected Whittle likelihood Wn(θ) and Wp,n(θ).

Lemma 4.1 Suppose f satisﬁes Assumption 3.1, fθ is bounded away from zero and kfθk0 < ∞.

Let {aj(p)} denote the coefficients of the best fitting AR(p) model corresponding to the spectral Pp −ijω −2 density f and define fp(ω) = |1 − j=1 aj(p)e | . Suppose 1 ≤ p < n, then we have

∗ −1 Fn ∆n(fθ )(Dn(f) − Dn(fp)) 1 (C + 1) 2(C + 1)2 C ≤ ρ (f)A (f, f ) f,1 + f,1 kψ k kφ k + f,0 . (4.4) p,K K θ pK−1 pK f 0 f 1 nK−1

Further, if {Xt} is a time series where supt kXtkE,2q = kXkE,2q < ∞ (for some q > 1), then

kWn(θ) − Wp,n(θ)kE,q ≤ ρp,K (f)AK (f, fθ) × (C + 1) 2(C + 1)2 C f,1 + f,1 kψ k kφ k + f,0 kXk2 . (4.5) npK−1 npK f 0 f 1 nK E,2q

PROOF. See Appendix B.1.

Remark 4.1 We brieﬂy discuss what the above bounds mean for diﬀerent types of spectral densities f.

(i) Suppose f is the spectral density of a ﬁnite order AR(p0). If p ≥ p0, then ∗ −1 Fn ∆n(fθ )(Dn(f) − Dn(fp)) 1 = 0 and kWn(θ) − Wp,n(θ)kE,q = 0. On the other hand, if K K−1 Pp0 Pp0 p < p0 we replace the p and p terms in Lemma 4.1 with j=p+1 |φj| and j=p+1 |jφj| p respectively, where {φj}j=1 are the AR(p) coeﬃcients corresponding to f.

(ii) If the autocovariances corresponding to f decay geometrically fast to zero (for example an ARMA processes), then for some 0 ≤ ρ < 1 we have

ρp kW (θ) − W (θ)k = O + ρn . (4.6) n p,n E,q n

20 P K (iii) If the autocovariances corresponding to f decay to zero at a polynomial rate with r |r c(r)| < ∞, then

1 kW (θ) − W (θ)k = O . (4.7) n p,n E,q npK−1

Roughly speaking, the faster the rate of decay of the autocovariance function, the “closer”

Wp,n(θ) will be to Wn(θ) for a given p.

K−1 −1 It follows from the lemma above that if 1 ≤ p < n, Ef [Wp,n(θ)] = In(f; fθ) + O((np ) ) and

1 W (θ) = W (θ) + O . p,n n p npK−1

Thus if p → ∞ as n → ∞, then Wp,n(θ) yields a better approximation to the “ideal” Wn(θ) than both the Whittle and the Gaussian likelihood.

Since f is unknown, fp is also unknown. But fp is easily estimated from the data. We use the Yule-Walker estimator to ﬁt an AR(p) process to the observed time series, where we select the order p using the AIC. We denote this estimator as φ and the corresponding spectral density as bp fbp. Using this we deﬁne Jbn(ωk,n; fbp) where

Jbn(ω; fbp) = −1/2 p p−` −1/2 p p−` n X X −isω inω n X X i(s+1)ω X` φb`+s,pe + e Xn+1−` φb`+s,pe , φbp(ω) `=1 s=0 φbp(ω) `=1 s=0

Pp −iuω and φbp(ω) = 1 − u=1 φbu,pe . This estimator allows us to replace Jbn(ωk,n; fp) in Wp,n(θ) with Jbn(ωk,n; fbp) to give the “observed” boundary corrected Whittle likelihood

n n 1 X Jen(ωk,n; fbp)Jn(ωk,n) 1 X Wcp,n(θ) = + log fθ(ωk,n). (4.8) n fθ(ωk,n) n k=1 k=1

We use as an estimator of θ, θbn = arg min Wcp,n(θ). It is worth bearing in mind that

Jen(ωk,n; fbp)Jn(ωk,n) Jen(ωn−k,n; fbp)Jn(ωn−k,n) Im = − Im fθ(ωk,n) fθ(ωn−k,n) thus Wcp,n(θ) is real for all θ. However, due to rounding errors it is prudent to use Re Wcp,n(θ) in the minimisation algorithm. Sometimes Re Jen(ωk,n; fbp)Jn(ωk,n) can be negative, when this arises we threshold it to be positive (the method we use is given in Section 6).

In this paper, we focus on estimating Jbn(ωk,n; fp) using the Yule-Walker estimator. However, as pointed out by two referees, other estimators could be used. These may, in certain situations,

21 give better results. For example, in the case that f has a more peaked spectral density (corresponding to AR parameters close to the unit circle) it may be better to replace the Yule-Walker estimator with the tapered Yule-Walker estimator (as described in Dahlhaus (1988) and Zhang (1992)) or the Burg estimator. We show in Appendix H, that using the tapered Yule-Walker estimator tends to give better results for peaked spectral density functions. Alternatively one could directly estimate Jb∞,n(ωk,n; f), where we use a non-parametric spectral density estimator of f. This is described in greater detail in Appendix H together with the results of some simulations.

4.2 The hybrid Whittle likelihood

The simulations in Section 6 suggest that the boundary corrected Whittle likelihood estimator (defined in (4.8)) yields an estimator with a smaller bias than the regular Whittle likelihood. However, the bias of the tapered Whittle likelihood (and often the Gaussian likelihood) is in some cases lower. The tapered Whittle likelihood (first proposed in Dahlhaus (1988)) gives a better resolution at the peaks in the spectral density. It also “softens” the observed domain of observation. With this in mind, we propose the hybrid Whittle likelihood which incorporates the notion of tapering. n Suppose hn = {ht,n}t=1 is a data taper, where the weights {ht,n} are non-negative and Pn t=1 ht,n = n. We define the tapered DFT as

n −1/2 X itωk,n Jn,hn (ωk,n) = n ht,nXte . t=1

Pn Suppose f is the best ﬁtting spectral density function. Using that t=1 ht,n = n and covf (Xt, Xbτ,n) = cf (t − τ) we have

Ef [Jen(ω; f)Jn,hn (ω)] = f(ω), (4.9) which is analogous to the non-tapered result Ef [Jen(ω; f)Jn(ω)] = f(ω). Based on the above result we deﬁne the infeasible hybrid Whittle likelihood which combines the regular DFT of the tapered time series and the complete DFT (which is not tapered)

n n 1 X Jen(ωk,n; f)Jn,hn (ωk,n) 1 X Hn(θ) = + log fθ(ωk,n). (4.10) n fθ(ωk,n) n k=1 k=1

Using (4.9), it can be shown that Ef [Hn(θ)] = In(f; fθ). Thus Hn(θ) is an unbiased estimator of In(f; fθ). Clearly, it is not possible to estimate θ using the (unobserved) criterion Hn(θ).

22 Instead we replace Jen(ωk,n; f) with its estimator Jen(ωk,n; fbp) and deﬁne

n n 1 X Jen(ωk,n; fbp)Jn,hn (ωk,n) 1 X Hbp,n(θ) = + log fθ(ωk,n). (4.11) n fθ(ωk,n) n k=1 k=1

We then use as an estimator of θ, θbn = arg min Hbp,n(θ). An illustration which visualises and compares the boundary corrected Whittle likelihood and hybrid Whittle likelihood is given in Figure 4.

Figure 4: Left: The estimated complete DFT and the regular DFT which yields the boundary corrected Whittle likelihood. Right: The estimated complete DFT and the tapered DFT which forms the hybrid Whittle likelihood.

5 The sampling properties of the hybrid Whittle likelihood

In this section, we study the sampling properties of the boundary corrected and hybrid Whittle likelihood. Our focus will be on the hybrid Whittle likelihood as it includes the boundary corrected likelihood as a special case, when ht,n = 1 for 1 ≤ t ≤ n. In Das et al. (2020) we study the sampling properties of the estimated complete periodogram Jen(ω; fbp)Jn,hn (ω). Using these results and the results in Appendix D and E, we obtain the bias and variance of the boundary corrected and hybrid Whittle likelihood.

Suppose we ﬁt the spectral density fθ(ω) (where θ is an unknown d-dimension parameter n vector) to the stationary time series {Xt}t=1 whose true spectral density is f. The best ﬁtting spectral density is fθn , where θn = arg min In(f; fθ). Let θbn = (θb1,n,..., θbd,n) be its estimator, where θbn = arg min Hbp,n(θ).

23 5.1 Assumptions

To derive the sampling properties of θbn we assume the data taper has the following form

ht,n = cnhn(t/n), (5.1) where hn : [0, 1] → R is a sequence of positive functions that satisfy the taper assumptions Pn q in Section 5, Dahlhaus (1988) and cn = n/H1,n with Hq,n = t=1 hn(t/n) . We will assume 2 −1 supt,n ht,n < ∞, using this it is straightforward to show H2,n/H1,n ∼ n . Under this condition, the hybrid Whittle is n1/2–consistency and the equivalence result in Theorem 5.1 holds. This assumption is used in Dahlhaus (1983) and in practice one often assumes that a ﬁxed percentage 2 −1 of the data is tapered. A relaxation of the condition H2,n/H1,n ∼ n will lead to a change of rate in Theorem 5.1.

Assumption 5.1 (Assumptions on the parameter space) (i) The parameter space Θ ⊂ d R is compact, 0 < infθ∈Θ infω fθ(ω) ≤ supθ∈Θ supω fθ(ω) < ∞ and θn lies in the interior of Θ.

2 −1 R 2π (ii) The one-step ahead prediction error σ = exp((2π) 0 log fθ(ω)dω) is not a function of the parameter θ.

(iii) Let {φj(fθ)} and {ψj(fθ)} denote the AR(∞) and MA(∞) coeﬃcients corresponding to the

spectral density fθ respectively. Then for all θ ∈ Θ and 0 ≤ s ≤ κ (for some κ ≥ 4), we have ∞ ∞ X K s X K s (a) sup kj ∇θφj(fθ)k1 < ∞ (b) sup kj ∇θψj(fθ)k1 < ∞, θ∈Θ j=1 θ∈Θ j=1

a where K > 3/2, ∇θ g(fθ) is the ath order partial derivative of g with respect to θ, and a a k∇θ g(fθ)k1 denotes the absolute sum of all the partial derivatives in ∇θ g(fθ).

−1 Pn We use Assumption 5.1(ii, iii) to show that the n k=1 log fθ(ωk,n) term in boundary corrected and hybrid Whittle likelihoods are negligible with respect the other bias terms. This allows us to simplify some of the bias expansions. Without Assumption 5.1(ii, iii-a) the asymptotic bias of the new-frequency domain likelihood estimators would contain some additional terms. Assumption 5.1(iii-b) is used to bound the sth derivative of the spectral density.

Assumption 5.2 (Assumptions on the time series) (i) {Xt} is a stationary time series.

Let κ`(t1, . . . , t`−1) denote the joint cumulant cum(X0,Xt1 ,...,Xt`−1 ). Then for all 1 ≤ j ≤ ` ≤ 12,

X |(1 + tj)κ`(t1, . . . , t`−1)| < ∞.

t1,...,t`−1

24 (ii) The spectral density of {Xt} is such that the spectral density f is bounded away from zero and for some K > 1, the autocovariance function satisﬁes P |rK c (r)| < ∞. r∈Z f

(iii) I(θn) is invertible where

Z 2π 1 2 −1 I(θ) = − [∇θfθ(ω) ]f(ω)dω. (5.2) 2π 0

We require Assumption 5.2(i), when ` = 4 and 6 to obtain a bound for the expectation of the terms in the bias expansions and ` = 12 to show equivalence between the feasible estimator based on Hbp,n(θ) and its infeasible counterparts Hn(θ). Under Assumption 5.2(i,ii), Theorem 3.1 in Das et al. (2020), we can show that

p3 1 Hbp,n(θ) = Hn(θ) + Op + . n3/2 npK−1

Under Assumption 5.1(i,iii) the above error is uniform over the parameter space. If the model is K−1 −1 an AR(p0) and p0 ≤ p, then the term O((np ) ) in the above disappears.

To obtain a bound for the mean and variance of θbn = (θb1,n,..., θbd,n) we require the following quantities. Let

2 Z 2π V (g, h) = g(ω)h(ω)f(ω)2dω 2π 0 1 Z 2π Z 2π + 2 g(ω1)h(ω2)f4(ω1, −ω1, ω2)dω1dω2 (2π) 0 0 1 Z 2π and J(g) = g(ω)f(ω)dω, (5.3) 2π 0 where f4 denotes the fourth order cumulant density of the time series {Xt}. We denote the −1 (s,r) (s, r)th element of I(θn) (where I(θn) is deﬁned in (5.2)) as I , and deﬁne

d ∂f −1 ∂2f −1 X (s1,s2) θ θ Gr(θ) = I V , ∂θs2 ∂θs1 ∂θr s1,s2=1 d 1 X ∂f −1 ∂f −1 ∂3f −1 + I(s1,s3)I(s2,s4)V θ , θ J θ . (5.4) 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

5.2 The asymptotic sampling properties

Using the assumptions above we obtain a bound between the feasible and infeasible estimators.

Theorem 5.1 (Equivalence of feasible and infeasible estimators) Suppose Assumptions 5.1 and 5.2 hold. Deﬁne the feasible and infeasible estimators as θen = arg min Hn(θ) and θbn =

25 arg min Hbp,n(θ) respectively. Then for p ≥ 1 we have

p3 1 |θb − θe |1 = Op + , n n n3/2 npK−1

Pd where |a|1 = j=1 |aj|. For the case p = 0, θbn is the parameter estimator based on the Whittle likelihood using the one-sided tapered periodogram Jn(ωk,n)Jn,hn (ωk,n) rather than the regular −1 tapered periodogram. In this case, |θbn − θen|1 = Op (n ).

Note if the true spectral density of the time series is that of an AR(p0) where p0 ≤ p, then the O((npK−1)−1) term is zero.

PROOF. In Appendix D. The implication of the equivalence result is if p3/n1/2 → 0 as p → ∞ and n → ∞, then n|θbn − θen|1 → 0 and asymptotically the properties of the infeasible estimator (such as bias and variance) transfer to the feasible estimator.

5.2.1 The bias and variance of the hybrid Whittle likelihood

The expressions in this section are derived under Assumptions 5.1 and 5.2. The bias We show in Appendix E.4, that the asymptotic bias (in the sense of Bartlett) for

θbn = (θb1,n,..., θbd,n) is

d 3 H2,n X (j,r) p 1 [θbj,n − θj,n] = I Gr(θn) + O + 1 ≤ j ≤ d, (5.5) E H2 n3/2 npK−1 1,n r=1

(j,r) 2 where I and Gr(θn) is defined in (5.4). We note that if no tapering were used then H2,n/H1,n = n−1. The Gaussian and Whittle likelihood have a bias which includes the above term (where 2 −1 Pd (j,r) H2,n/H1,n = n ) plus an additional term of the form r=1 I E[∇θLn(θn)], where Ln(·) is the Gaussian or Whittle likelihood (see Appendix E.4 for the details). Theoretically, it is unclear which criteria has the smallest bias (since the inclusion of additional terms does not necessarily increase the bias). However, for the hybrid Whittle likelihood estimator, a straightfoward “Bartlett correction” can be made to estimate the bias in (5.5). We briefly outline how this can be done. We observe that the bias is built of I(·), J(·) and V (·, ·). Both I(·) and J(·) can easily be estimated with their sample means. The term V (·, ·) can also be estimated by using an adaption of orthogonal samples (see Subba Rao (2018)), which we now describe. Define the random variable

n 1 X hr(g; f) = g(ωk,n)Jen(ωk+r,n; f)Jn(ωk,n) for r ≥ 1, n k=1 where g is a continuous and bounded function. Suppose g1 and g2 are continuous and bounded

26 functions. If r 6= nZ, then Ef [hr(gj; f)] = 0 (for j = 1 and 2). But interestingly, if r << n, then ncovf [hr(g1; f), hr(g2; f)] = nEf [hr(g1; f)hr(g2; f)] = V (g1, g2) + O(r/n). Using these results, we estimate V (g1, g2) by replacing hr(gj; f) with hr(gj; fbp) and deﬁning the “sample covariance”

M n X VbM (g1, g2) = hr(g1; fbp)hr(g2; fbp) M r=1 where M << n. Thus, VbM (g1, g2) is an estimator of V (g1, g2). Based on this construction,

2 ∂ −1 ∂ −1 ∂ −1 ∂ −1 VbM f , f and VbM f , f θbn θbn θbn θbn ∂θs2 ∂θs1 ∂θr ∂θs3 ∂θs4

∂f −1 ∂2f −1 ∂f −1 ∂f −1 are estimators of V θ , θ and V θ , θ respectively. This estimation scheme yields ∂θs2 ∂θs1 ∂θr ∂θs3 ∂θs4 a consistent estimate of the bias even when the model is misspecified. In contrast, it is unclear how a bias correction would work for the Gaussian and Whittle likelihood under misspecification, as they also involve the term Ef [∇θLn(θn)]. In the case of misspecification, Ef [∇θLn(θn)] 6= 0 and is of order O(n−1). It is worth mentioning that the asymptotic expansion in (5.5) does not fully depict what we observe in the simulations in Section 6. A theoretical comparison of the biases of both new likelihoods show that for the boundary corrected Whittle likelihood, the bias is asymptotically −1 Pd (j,r) 2 Pd (j,r) n r=1 I Gr(θn), whereas when tapering is used the bias is (H2,n/H1,n) r=1 I Gr(θn) ≥ −1 Pd (j,r) n r=1 I Gr(θn). This would suggest that the hybrid Whittle likelihood should have a larger bias than the boundary corrected Whittle likelihood. But the simulations (see Section 6) suggest this is not necessarily true and the hybrid likelihood tends to have a smaller bias. The variance We show in Corollary 3.1, Das et al. (2020) that the inclusion of the prediction DFT in the hybrid Whittle likelihood has a variance which asymptotically is small as compared with 3 2 −1 the main Whittle term if p /n → 0 as p, n → ∞ (under the condition H2,n/H1,n ∼ n ) Using this observation, standard Taylor expansion methods and Corollary 3.1 in Das et al. (2020), the asymptotic variance of θbn is

2 H1,n −1 −1 −1 −1 var(θbn) = I(θn) V ∇θfθ , ∇θfθ cθ=θn I(θn) + o(1), H2,n where V (·) is deﬁned in (5.3).

5.2.2 The role of order estimation on the rates

The order in the AR(p) approximation is selected using the AIC, where pb = arg min AIC(p) with 2p AIC(p) = log σ2 + , bp,n n

27 2 1 Pn Pp 2 2+δ σ = (Xt − φj,pXt−j) , Kn is such that K ∼ n for some δ > 0. Ing and Wei bp,n n−Kn t=Kn j=1 b n (2005) assume that the underlying time series is a linear, stationary time series with an AR(∞) that satisfies Assumption K.1−K.4 in Ing and Wei (2005). They show that under the condition P∞ 2 −2K 1/(1+2K) that the AR(∞) coefficients satisfy ( j=p+1 |φj|) = O(p ), then pb = Op(n ) (see 3 1/2 P 1/(1+2K) Example 2 in Ing and Wei (2005)). Thus, if K > 5/2, then pb /n → 0 (where pb = Op(n )) P and pb → ∞ as n → ∞. These rates ensure that the difference between the feasible and infeasible −1 estimator is |θbn − θen|1 = op(n ). Thus the feasible estimator, constructed using the AIC, and the infeasible estimator are equivalent and the bias and variance derived above are valid for this infeasible estimator.

5.2.3 The computational cost of the estimators

We now discuss some of the implementation issues of the new estimators. The Durbin-Levinson algorithm is often used to maximize the Gaussian likelihood. If this is employed, then the computational cost of the algorithm is O(n2). On the other hand, by using the FFT, the computational cost of the Whittle likelihood is O(n log n). For the boundary corrected Whittle and hybrid Whittle likelihood algorithm, there is an n additional cost over the Whittle likelihood due to the estimation of {Jbn(ωk,n; fbp)}k=1. We recall that f is constructed using the Yule-Walker estimator φ = (φ , ..., φ )0 where p is selected bp bp b1,p bp,p n with the AIC. We now calculate the complexity of calculating {Jbn(ωk,n; fbp)}k=1. n−1 The sample autocovariances, {bcn(r)}r=0 (which are required in the Yule-Walker estimator) can be calculated in O(n log n) operations. Let Kn denote the maximum order used for the evaluation of the AIC. If we implement the Durbin-Levinson algorithm, then evaluating φ for bp 2 1 ≤ p ≤ Kn requires in total O(Kn) arithmetic operations. Given the estimated AR coeﬃcients φ , the predictive DFT {J (ω ; f )}n can be calculated in O(min(n log n, npˆ)) arithmetic bpˆ bn k,n bp k=1 operations (the details of the algorithm for optimal calculation can be found in Appendix A.1). Therefore, the overall computational cost of implementing both the boundary corrected Whittle 2 and hybrid Whittle likelihood algorithms is O(n log n + Kn).

Using Ing and Wei (2005) Example 2, for consistent order selection Kn should be such that 1/(2K+1)+ε Kn ∼ n for some ε > 0 (where K is deﬁned in Assumption 3.1). Therefore, we conclude that the computational cost of the new likelihoods is of the same order as the Whittle likelihood.

6 Empirical results

To substantiate our theoretical results, we conduct some simulations (further simulations can be found in Appendix F, G and H). To compare diﬀerent methods, we evaluate six diﬀerent quasi- likelihoods: the Gaussian likelihood (equation (1.1)), the Whittle likelihood (equation (1.3)), the boundary corrected Whittle likelihood (equation (4.8)), the hybrid Whittle likelihood (equation

28 (4.11)), the tapered Whittle likelihood (p.810 of Dahlhaus (1988)) and the debiased Whittle likelihood (equation (7) in Sykulski et al. (2019)). The tapered and hybrid Whittle likelihoods require the use of data tapers. We use a Tukey taper (also known as the cosine-bell taper) where

 1 [1 − cos(π(t − 1 )/d)] 1 ≤ t ≤ d  2 2 hn (t/n) = 1 d + 1 ≤ t ≤ n − d .  1 1 2 [1 − cos(π(n − t + 2 )/d)] n − d + 1 ≤ t ≤ n

We set the proportion of tapering at each end of the time series is 0.1, i.e. d = n/10 (the default in R). When evaluating the boundary corrected Whittle likelihood and hybrid Whittle likelihood, the order p is selected with the AIC and fbp is estimated using the Yule-Walker estimator. Unlike the Whittle, the tapered Whittle and debiased Whittle likelihood, Re Jen(ωk,n; fbp)Jn(ωk,n) and Re Jen(ωk,n; fbp)Jn,hn (ωk,n) can be negative. To avoid negative values, we apply the thresh- −3 olding function f(t) = max(t, 10 ) to Re Jen(ωk,n; fbp)Jn(ωk,n) and Re Jen(ωk,n; fbp)Jn,hn (ωk,n) over all the frequencies. Thresholding induces an additional (small) bias to the new criteria. The proportion of times that Re Jen(ωk,n; fbp)Jn(ωk,n) drops below the threshold increases for spectral density functions with large peaks and when the spectral density is close to zero. However, at least for the models that we studied in the simulations, the bias due to the thresholding is negligible. All simulations are conducted over 1000 replications with sample sizes n = 20, 50, and 300. In all the tables below and Appendix, the bias of the estimates are in parenthesis and the standard deviation are in parenthesis. The ordering of the performance of the estimators is colour coded and is based on their squared root of the mean squared error (RMSE).

6.1 Estimation with correctly speciﬁed models

We ﬁrst study the AR(1) and MA(1) parameter estimates when the models are correctly speciﬁed.

We generate two types of time series models Xn and Y n, which satisfy the following recursions

−iω AR(1): Xt = θXt−1 + et; φX (ω) = 1 − θe −iω −1 MA(1): Yt = et + θet−1; φY (ω) = (1 + θe ) , where |θ| < 1, {et} are independent, identically distributed Gaussian random variables with mean 0 and variance 1. Note that the Gaussianity of the innovations is not required to obtain the theoretical properties of the estimations. In Appendix F.2, we include simulations when the innovations follow a standardized chi-squared distribution with two degrees of freedom. The results are similar to those with Gaussian innovations. We generate the AR(1) and MA(1) models

29 with parameters θ = 0.1, 0.3, 0.5, 0.7 and 0.9. For the time series generated by an AR(1) process, we fit an AR(1) model, similarly, for the time series generated by a MA(1) process we fit a MA(1) model. For each simulation, we evaluate the six different parameter estimators. The empirical bias and standard deviation are calculated. Figures 5 gives the bias (first row) and the RMSE (second row) of each estimated parameter θ for both AR(1) and MA(1) models. We focus on positive θ, similar results are obtained for negative θ. The results are also summarized in Table 3 in Appendix F.1. For both AR(1) and MA(1) models, we observe a stark difference between the bias of the Whittle likelihood estimator (blue line) and the other five other methods, which in most cases have a lower bias. The Gaussian likelihood performs uniformly well for both models and all sample sizes. Whereas, the tapered Whittle estimator performs very well for the MA(1) model but not quite as well for the AR(1) model. The debiased Whittle likelihood performs quite well for both models, especially when the parameter values are small (e.g. θ = 0.1, 0.3, and 0.5). The simulations suggest that the boundary corrected and hybrid Whittle likelihoods (referred from now on as the new likelihoods) are competitive with the benchmark Gaussian likelihood for both AR(1) and MA(1) models. For the AR(1) model the new likelihoods tend to have the smallest or second smallest RMSE (over all sample sizes and more so when φ is large). A caveat is that for the AR(1) model the bias of the boundary corrected and hybrid Whittle tends to be a little larger than the bias of the Gaussian likelihood (especially for the smaller sample sizes). This is interesting, because in Appendix E.2 we show that if the AR(1) model is correctly specified, the first order bias of the boundary corrected Whittle likelihood and the Gaussian likelihood are the same (both are −2θ/n). The bias of the hybrid Whittle likelihood is slightly large, due to the data taper. However, there are differences in the second order expansions. Specifically, for the Gaussian likelihood, it is O(n−3/2), whereas, for the boundary corrected and hybrid Whittle it is O(p3n−3/2). Indeed, the O(p3n−3/2) term arises because of the parameter estimation in the predictive DFT. This term is likely to dominate the O(n−3/2) in the Gaussian likelihood. Therefore, for small sample sizes, the second order terms can impact the bias. It is this second order term that may be causing the larger bias seen in the boundary corrected Whittle likelihood as compared with the Gaussian likelihood. On the other hand, the bias for the MA(1) model tends to be smaller for the new likelihoods, including the benchmark Gaussian likelihood. Surprisingly, there appears to be examples where the new likelihood does better (in terms of RMSE) than the Gaussian likelihood. This happens when n ∈ {50, 300} for θ = 0.9. This observation is noteworthy, as the computational cost of the Gaussian likelihood is greater than the computational cost of the new likelihoods (see Section 5.2). Thus the simulations suggest that in certain situations the new estimator may outperform the Gaussian likelihood at a lower computational cost. In summary, the new likelihoods perform well compared with the standard methods, including

30 AR(1) model

BIAS: AR(1), Gaussian error, n=20 BIAS: AR(1), Gaussian error, n=50 BIAS: AR(1), Gaussian error, n=300 0.002 0.00 0.01 −0.02 0.00 0.000 −0.04 −0.01 −0.002 −0.06 −0.02

BIAS BIAS −0.004BIAS −0.08 −0.03 −0.006 −0.10 Gaussian Hybrid Whittle Tapered −0.04 −0.12 Boundary Debiased −0.008 −0.05 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ RMSE: AR(1), Gaussian error, n=20 RMSE: AR(1), Gaussian error, n=50 RMSE: AR(1), Gaussian error, n=300 0.22 0.14 0.21 0.06 0.13 0.20 0.12 0.05 0.19 0.11 RMSE RMSE RMSE 0.18 0.10 0.04 0.17 0.09 0.03 0.16 0.08

0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ

MA(1) model

BIAS: MA(1), Gaussian error, n=20 BIAS: MA(1), Gaussian error, n=50 BIAS: MA(1), Gaussian error, n=300 0.02 0.010 0.00 0.00 0.005 0.000 −0.05 −0.02 −0.005 BIAS BIAS BIAS

−0.10 −0.04 −0.010 Gaussian Hybrid Whittle Tapered −0.015 Boundary Debiased −0.06 −0.15 −0.020 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ RMSE: MA(1), Gaussian error, n=20 RMSE: MA(1), Gaussian error, n=50 RMSE: MA(1), Gaussian error, n=300 0.30 0.16 0.06 0.28 0.14 0.26 0.05 0.24 0.12

RMSE 0.22 RMSE RMSE 0.10 0.04 0.20 0.08 0.18 0.03

0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ

Figure 5: Bias (ﬁrst row) and the RMSE (second row) of the parameter estimates for the Gaussian AR(1) models and Gaussian MA(1) models. Length of the time series n = 20(left), 50(middle), and 300(right).

31 the benchmark Gaussian likelihood. As expected, for large sample sizes the performance of all the estimators improves considerably. And for some models, the new likelihood is able to outperform the Gaussian likelihood estimator. Though there is no clear rule when this will happen.

6.2 Estimation under misspeciﬁcation

Next, we turn into our attention to the case that the model is misspecified (which is more realistic for real data). As we mentioned above, the estimation of the AR parameters in the predictive DFT of the new likelihoods leads to an additional error of order O(p3n−3/2). The more complex the model, the larger p will be, leading to a larger O(p3n−3/2). To understand the effect this may have for small sample sizes, in this section we fit a simple model to a relatively complex process. For the “true” data generating process we use an ARMA(3, 2) Gaussian time series with −iω 2 −iω 2 spectral density fZ (ω) = |ψZ (e )| /|φZ (e )| , where AR and MA characteristic polynomials are i −i 2 φZ (z) = (1 − 0.7z)(1 − 0.9e z)(1 − 0.9e z) and ψZ (z) = (1 + 0.5z + 0.5z ).

This spectral density has some interesting characteristics: a pronounced peak, a large amount of power at the low frequencies, and a sudden drop in power at the higher frequencies. We consider sample sizes n = 20, 50 and 300, and fit a model with fewer parameters. Specifically, we fit two different ARMA models with the same number of unknown parameters. The first is the ARMA(1,1) model with spectral density

−iω 2 −iω −2 fθ(ω) = |1 + ψe | |1 − φe | θ = (φ, ψ).

The second is the AR(2) model with spectral density

−iω −2iω −2 fθ(ω) = |1 − φ1e − φ2e | θ = (φ1, φ2).

Figure 6 shows the logarithm of the theoretical ARMA(3,2) spectral density (solid line, fZ ) and the corresponding log spectral densities of the best fitting ARMA(1,1) (dashed line) and AR(2) (dotted line) processes for n = 20. The best fitting models are obtained by minimizing the Best spectral divergence θ = arg minθ∈Θ In(f; fθ), where In(f, fθ) is defined in (4.1) and Θ is the parameter space. The best fitting models for n = 50 and 300 are similar. We observe that neither of the misspecified models capture all of the features of the true spectral density. The best fitting ARMA(1,1) model has a large amount of power at the low frequencies and the power declines for the higher frequencies. The best fitting AR(2) model peaks around frequency 0.8, but the power at the low frequencies is small. Overall, the spectral divergence between the true and the best fitting AR(2) model is smaller than the spectral divergence between the true and the best ARMA(1,1) model. For each simulation, we calculate the six different parameter estimators and the spectral

32 True 4 Best ARMA(1,1) Best AR(2) 2

−2 Log−spectral density Log−spectral

−4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Frequency

Figure 6: Plot of log fZ (ω) and log fθBest (ω); Theoretical ARMA(3,2) spectral density (solid), best fitting ARMA(1,1) spectral density (dashed), and best fitting AR(2) spectral density (dotted) for n = 20. divergence. The result of the estimators using the six different quasi-likelihoods is given in Table 1 (for ARMA(1,1)) and Table 2 (for AR(2)). We first discuss the parameter estimates. Comparing the asymptotic bias of the Gaussian likelihood with the boundary corrected Whittle likelihood (see Appendix E.4), the Gaussian likelihood has an additional bias term of form Pd I(j,r) [ ∂Ln ]c . But there is no guarantee that r=1 E ∂θr θn the inclusion of this term increases or decreases the bias. This is borne out in the simulations, where we observe that overall the Gaussian likelihood or the new likelihoods tend to have a smaller parameter bias (there is no clear winner). The tapered likelihood is a close contender, performing very well for the moderate sample sizes n = 50. Similarly, in terms of the RMSE, again there is no clear winner between the Gaussian and the new likelihoods. Overall (in the simulations) the hybrid Whittle likelihood tends to outperform the Gaussian likelihood. We next turn our attention to the estimated spectral divergence I (f, f ). For the fitted n θb ARMA(1, 1) model, the estimated spectral divergence of the new likelihood estimators tends to be the smallest or second smallest in terms of the RMSE (its nearest competitor is the tapered likelihood). On the other hand, for the AR(2) model the spectral divergence of Gaussian likelihood has the smallest RMSE for all the sample sizes. The new likelihood comes in second for sample sizes n = 20 and 300. In the simulations above we select p using the AIC. As mention at the start of the section, this leads to an additional error of O(p3n−3/2) in the new likelihoods. Thus, if a large p is selected the error O(p3n−3/2) will be large. In order to understand the impact p has on the estimator, in Appendix F.4 we compare the the likelihoods constructed using the predictive DFT based

33 n Parameter Gaussian Whittle Boundary Hybrid Tapered Debiased φ 0.031(0.1) -0.095(0.16) -0.023(0.12) -0.006(0.1) -0.080(0.13) 0.187(0.11) 20 ψ 0.069(0.08) -0.172(0.18) -0.026(0.14)0 .028(0.1) -0.068(0.12) 0.093(0.06) In(f; fθ) 1.653(0.81) 1.199(1.57)0 .945(0.84)1 .024(0.89)0 .644(0.61)2 .727(0.73) φ 0.012(0.07) -0.054(0.09) -0.006(0.07)0 .004(0.07) -0.005(0.07) 0.154(0.11) 50 ψ 0.029(0.06) -0.116(0.12) -0.008(0.08)0 .009(0.07)0 .011(0.06)0 .093(0) In(f; fθ) 0.354(0.34) 0.457(0.46) 0.292(0.3)0 .235(0.28)0 .225(0.26)1 .202(0.34) φ 0.002(0.03) -0.014(0.03) 0(0.03)0 .001(0.03) 0(0.03) 0.093(0.08) 300 ψ 0.005(0.03) -0.033(0.05) 0.001(0.03)0 .003(0.03)0 .003(0.03) 0.092(0.01) In(f; fθ) 0.027(0.05) 0.064(0.09) 0.029(0.05)0 .026(0.04)0 .027(0.05)0 .752(0.22) Best fitting ARMA(1, 1) coefficients θ = (φ, ψ) and spectral divergence: − θ20 = (0.693, 0.845), θ50 = (0.694, 0.857), θ300 = (0.696, 0.857). − I20(f; fθ) = 3.773, I50(f; fθ) = 3.415, I300(f; fθ) = 3.388. Table 1: The bias of estimated coefficients for six different estimation methods for the Gaus- sian ARMA(3, 2) misspecified case fitting ARMA(1, 1) model. Standard deviations are in the parentheses. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

n Parameter Gaussian Whittle Boundary Hybrid Tapered Debiased

φ1 0.028(0.14) -0.162(0.22) -0.032(0.16)0 .003(0.14) -0.123(0.16) 0.069(0.15) 20 φ2 -0.004(0.09)0 .169(0.18) 0.052(0.14)0 .025(0.12)0 .132(0.12) -0.034(0.11) In(f; fθ) 0.679(0.72)1 .203(1.46) 0.751(0.85)0 .684(0.8)0 .862(0.97) 0.686(0.81) φ1 0.019(0.09) -0.077(0.12) -0.009(0.09)0 .003(0.09) -0.017(0.09)0 .156(0.15) 50 φ2 -0.024(0.06) 0.066(0.1) 0.006(0.07) -0.003(0.06)0 .013(0.06) -0.121(0.06) In(f; fθ) 0.275(0.33)0 .382(0.45) 0.283(0.37) 0.283(0.37)0 .283(0.36)0 .65(0.7) φ1 0.004(0.04) -0.013(0.04) 0(0.04)0 .001(0.04) 0.001(0.04) 0.014(0.04) 300 φ2 -0.005(0.02)0 .011(0.03) -0.001(0.02) -0.001(0.03) -0.001(0.03) 0.016(0.04) In(f; fθ) 0.049(0.07)0 .053(0.07)0 .049(0.07)0 .053(0.07) 0.054(0.08) 0.058(0.08) Best fitting AR(1) coefficients θ = (φ1, φ2) and spectral divergence: − θ20 = (1.367, −0.841), θ50 = (1.364, −0.803), θ300 = (1.365, −0.802). − I20(f; fθ) = 2.902, I50(f; fθ) = 2.937, I300(f; fθ) = 2.916. Table 2: Same as in Table 1 but fitting AR(2) model. on the AIC with the likelihoods constructed using the predictive DFT based on the best fitting estimated AR(1) model. We simulate from the ARMA(3, 2) model described above and fit an ARMA(1, 1) and AR(2) model. As is expected, the bias tends to be a little larger when the order is fixed to p = 1. But even when fixing p = 1, we do observe an improvement over the Whittle likelihood (in some cases an improvement over the Gaussian likelihood).

7 Concluding remarks and discussion

−1 −1 In this paper we have derived an exact expression for the diﬀerences Γn(fθ) − Cn(fθ ) and 0 −1 −1 Xn[Γn(fθ) − Cn(fθ )]Xn. These expressions are simple, with an intuitive interpretation, in terms of predicting outside the boundary of observation. They also provide a new perspective to the Whittle likelihood as an approximation based on a biorthogonal transformation. We have

34 used these expansions and approximations to define two new spectral divergence criteria (in the frequency domain). Our simulations show that both new estimators (termed the boundary corrected Whittle and hybrid Whittle) tend to outperform the Whittle likelihood. Intriguingly, the hybrid Whittle likelihood tends to outperform the boundary corrected Whittle likelihood. Currently, we have no theoretical justification for this and one future aim is to investigate these differences. We believe that it is possible to use a similar construction to obtain an expression for the difference between the Gaussian likelihood of a multivariate time series and the corresponding multivariate Whittle likelihood. The construction we use in this paper hinges on past and future predictions. In the univariate set-up there is an elegant symmetry for the predictors in the past and future. In the multivariate set-up there are some important differences. This leads to interesting, but different expressions for the predictive DFT. To prove analogous results to those in this paper, we will require Baxter-type inequalities for the multivariate framework. The bounds derived in Cheng and Pourahmadi (1993) and Inoue et al. (2018) may be useful in this context. The emphasis of this paper is on short memory time series. But we conclude by briefly discussing extensions to long memory time series. The fundamental feature (in the frequency domain) that distinguishes a short memory time series from a long memory time series is that the spectral density of a long memory time series is not bounded at the origin. However, we conjecture that the complete DFT described in Theorem 2.1 can have applications within this setting too. Suppose f is the spectral density corresponding to a long memory time series. And let Jbn(ωk,n; f) be defined as in (2.8). We assume that for 1 ≤ k ≤ n − 1 that Jbn(ωk,n; f) is a well defined random variable. Under these conditions, simple calculations show that equation (2.7) in Theorem 2.1 applies for 1 ≤ k1, k2 ≤ n − 1, but not when k = n (the frequency at the origin). Using this, a result analogous to Corollary 2.1 can be obtained for long memory time series.

Theorem 7.1 (Inverse Toeplitz identity for long memory time series) Suppose that the spectral density of a time series, f, is bounded away from zero and bounded on [ε, π] for any ε > 0 −2d and satisfies f(ω) ∼ cf |ω| as ω → 0 for some d ∈ (0, 1/2) and cf > 0. Define the (n − 1) × n submatrix (Den(f))k,t = (Dn(f))k,t for 1 ≤ k ≤ (n − 1), where Dn(f) is defined in Theorem 3.3, equation (3.13). We assume that the entries of Den(f) are finite. Let Fen denotes (n − 1) × n −1/2 ikωt,n submatrix of Fn, where (Fen)k,t = n e . Let ∆e n(f) = diag(f(ω1,n), . . . , f(ωn−1,n)). Then

−1 0 −1 ∗ −1 (In − n 1n1n)Γn(f) = Fen ∆e n(f) (Fen + Den(f)) (7.1)

0 where 1n = (1, ..., 1) is an n-dimension vector

PROOF. See Appendix A.

We note that an expansion of Dn(f)k,t and Jbn(ωk,n; f) is given in Theorem 3.3 in terms of the

35 AR(∞) coefficients corresponding to the spectral density f. The theorem above is contingent on these quantities being finite. We conjecture that this holds for invertible ARFIMA(p, d, q) time series where d ∈ (0, 1/2). Unfortunately, a precise proof of this condition uses a different set of tools to those developed in this paper. Thus we leave it for future research. The plug-in Gaussian likelihood replaces the population mean in the Gaussian likelihood with the sample mean. We now apply the above result to representing the long memory plug- (c) in Gaussian likelihood in the frequency domain. We define the demeaned time series Xn = ¯ −1 Pn Xn − X1n, where X = n t=1 Xt. Then by using Theorem 7.1 we have

n−1 2 n−1 (c) 1 (c)0 −1 (c) 1 X |Jn(ωk,n)| 1 X Jbn (ωk,n; fθ)Jn(ωk,n) Xn Γn(fθ) Xn = + , n n fθ(ωk,n) n fθ(ωk,n) k=1 k=1

(c) (c) where Jbn (·; fθ) denotes the predictive DFT of the demeaned time series Xn . It would be of interest to show that the new likelihoods deﬁned in this paper (and a local frequency version of them) could be used in the analysis of long memory time series. In Appendix G we have presented some preliminary simulations for long memory time series. The results are not conclusive, but they do suggest that the new likelihoods can, in some settings, reduce the bias for long memory parameter estimators. In summary, the notion of biorthogonality and its application to the inversion of certain variance matrices may be of value in future research.

Acknowledgements

The research was partially conducted while SSR was visiting the Universit¨atHeidelberg, SSR is extremely gratefully to the hospitality of everyone at Mathematikon. SSR is grateful to Thomas Hotz for several fruitful discussions. Finally, this paper is dedicated to SSR’s Father, Tata Subba Rao, who introduced the Whittle likelihood to (the young and rather confused) SSR many years ago. SSR and JY gratefully acknowledge the support of the National Science Foundation (grant DMS-1812054). The authors wish to thank three anonymous referees and editors for their insightful observations and suggestions, which substantially improved all aspects of the paper.

References

K. M. Abadir, W. Distaso, and L. Giraitis. Nonstationarity-extended local Whittle estimation. J. Econometrics, 141(2):1353–1384, 2007.

M. S. Bartlett. Approximate conﬁdence intervals. II. More than one unknown parameter. Biometrika, 40(3/4):306–317, 1953.

36 G. Baxter. An asymptotic result for the ﬁnite predictor. Math. Scand., 10:137–144, 1962.

G. Baxter. A norm inequality for a “ﬁnite-section” Wiener-Hopf equation. Illinois J. Math., 7 (1):97–103, 1963.

R. J. Bhansali. The evaluation of certain quadratic forms occurring in autoregressive model ﬁtting. Ann. Statist., 10(1):121–131, 1982.

R. J. Bhansali. Asymptotically eﬃcient autoregressive model selection for multistep prediction. Ann. Inst. Statist. Math., 48(3):577–602, 1996.

Albrecht B¨ottcher and Bernd Silbermann. Analysis of Toeplitz operators. Springer Science & Business Media, 2013.

David R. Brillinger. Time series: Data Analysis and theory, volume 36 of Classics Appl. Math. SIAM, Philadelphia, PA, 2001.

W. W. Chen and C. M. Hurvich. Semiparametric estimation of multivariate fractional cointe- gration. J. Amer. Statist. Assoc., 98(463):629–642, 2003.

R. Cheng and M. Pourahmadi. Baxter’s inequality and convergence of ﬁnite predictors of multivariate stochastic processes. Probab. Theory Related Fields, 95:115–124, 1993.

N. Choudhuri, S. Ghosal, and A. Roy. Bayesian estimation of the spectral density of a time series. J. Amer. Statist. Assoc., 99(468):1050–1059, 2004.

J. Coursol and D. Dacunha-Castelle. Remarques sur l’approximation de la vraisemblance d’un processus Gaussien stationnaire. Teor. Veroyatnost. i Primenen. (Theory of Probability and its Applications), 27(1):155–160, 1982.

D. R. Cox and E. J. Snell. A general deﬁnition of residuals. J. R. Stat. Soc. Ser. B. Stat. Methodol., 30(2):248–265, 1968.

R. Dahlhaus. Spectral analysis with tapered data. J. Time Series Anal., 4(3):163–175, 1983.

R. Dahlhaus. Small sample eﬀects in time series analysis: a new asymptotic theory and a new estimate. Ann. Statist., 16(2):808–841, 1988.

R. Dahlhaus. Nonparametric high resolution spectral estimation. Probability Theory and Related Fields, 85(2):147–180, 1990.

R. Dahlhaus. A likelihood approximation for locally stationary processes. Ann. Statist., 28(6): 1762–1794, 2000.

37 R. Dahlhaus and H. Künsch. Edge effects and efficient parameter estimation for stationary random fields. Biometrika, 74(4):877–882, 1987.

S. Das, S. Subba Rao, and J. Yang. Spectral methods for small sample time series: A complete periodogram approach. arXiv preprint arXiv:2007.00363, 2020.

R. Fox and M. S. Taqqu. Large-sample properties of parameter estimates for strongly dependent stationary Gaussian time series. Ann. Statist., 14(2):517–532, 1986.

R. F. Galbraith and J. I. Galbraith. On the inverses of some patterned matrices arising in the theory of stationary time series. J. Appl. Probab., 11(1):63–71, 1974.

L. Giraitis and P. M. Robinson. Whittle estimation of ARCH models. Econometric Theory, 17 (3):608–631, 2001.

Liudas Giraitis, Hira L. Koul, and Donatas Surgailis. Large sample inference for long memory processes. Imperial College Press, London, 2012.

C. M. Hurvich and W. W. Chen. An eﬃcient taper for potentially overdiﬀerenced long-memory time series. J. Time Series Anal., 21(2):155–180, 2000.

C.-K. Ing and C.-Z. Wei. Order selection for same-realization predictions in autoregressive processes. Ann. Statist., 33(5):2423–2474, 2005.

A. Inoue and Y. Kasahara. Explicit representation of ﬁnite predictor coeﬃcients and its applications. Ann. Statist., 34(2):973–993, 2006.

A. Inoue, Y. Kasahara, and M. Pourahmadi. Baxter’s inequality for ﬁnite predictor coeﬃcients of multivariate long-memory stationary processes. Bernoulli, 24(2):1202–1232, 2018.

Y. Kasahara, M. Pourahmadi, and A. Inoue. Duals of random vectors and processes with applications to prediction problems with missing values. Statist. Probab. Lett., 79(14):1637–1646, 2009.

C. Kirch, M. C. Edwards, A. Meier, and R. Meyer. Beyond Whittle: Nonparametric correction of a parametric likelihood with a focus on bayesian time series analysis. Bayesian Anal., 14 (4):1037–1073, 2019.

T. Kley, P. Preuß, and P. Fryzlewicz. Predictive, ﬁnite-sample model choice for time series under stationarity and non-stationarity. Electron. J. Stat., 13(2):3710–3774, 2019.

J. Krampe, J.-P. Kreiss, and E. Paparoditis. Estimated Wold representation and spectral- density-driven bootstrap for time series. J. R. Stat. Soc. Ser. B. Stat. Methodol., 80:703–726, 2018.

38 J.-P. Kreiss, E. Paparoditis, and D. N. Politis. On the range of validity of the autoregressive sieve bootstrap. Ann. Statist., 39(4):2103–2130, 2011.

H. R. K¨unsch. Statistical aspects of self-similar processes. In Proceedings of the 1st World Congress of the Bernoulli Society, volume 1, pages 67–74. VNU Science Press, 1987.

S. N. Lahiri. A necessary and suﬃcient condition for asymptotic independence of discrete Fourier transforms under short- and long-range dependence. Ann. Statist., 31(2):613–641, 2003.

O. Lieberman. On plug-in estimation of long memory models. Econometric Theory, 21(2): 431–454, 2005.

M. Meyer, C. Jentsch, and J.-P. Kreiss. Baxter’s inequality and sieve bootstrap for random ﬁelds. Bernoulli, 23(4B):2988–3020, 2017.

V. M. Panaretos and S. Tavakoli. Fourier analysis of stationary time series in function space. Ann. Statist., 41(2):568–603, 2013.

Emanuel Parzen. Autoregressive spectral estimation. In Time series in the frequency domain, volume 3 of Handbook of Statist., pages 221–247. North-Holland, Amsterdam, 1983.

Mohsen Pourahmadi. Foundations of time series analysis and prediction theory. Wiley Series in Probability and Statistics: Applied Probability and Statistics. Wiley-Interscience, New York, 2001.

Maurice B. Priestley. Spectral analysis and time series. Vol. 2. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], London-New York, 1981. Multivariate series, prediction and control, Probability and Mathematical Statistics.

P. M. Robinson. Gaussian semiparametric estimation of long range dependence. Ann. Statist., 23(5):1630–1661, 1995.

P. Shaman. An approximate inverse for the covariance matrix of moving average and autoregressive processes. Ann. Statist., 3(2):532–538, 1975.

P. Shaman. Approximations for stationary covariance matrices and their inverses with application to ARMA models. Ann. Statist., 4(2):292–301, 1976.

P. Shaman and R. A. Stine. The bias of autoregressive coeﬃcient estimators. J. Amer. Statist. Assoc., 83(403):842–848, 1988.

X. Shao and W. B. Wu. Local whittle estimation of fractional integration for nonlinear processes. Econometric Theory, 23(5):899–929, 2007.

39 M. M. Siddiqui. On the inversion of the sample covariance matrix in a stationary autoregressive process. Ann. Math. Statist., 29(2):585–588, 1958.

S. Subba Rao. Orthogonal samples for estimators in time series. J. Time Series Anal., 39: 313–337, 2018.

A. M. Sykulski, S. C. Olhede, A. P. Guillaumin, J. M. Lilly, and J. J. Early. The debiased Whittle likelihood. Biometrika, 106(2):251–266, 2019.

G. Szeg¨o. Uber¨ die randwerte einer analytischen funktion. Math. Ann., 84:232–244, 1921.

K. Tanaka. An asymptotic expansion associated with the maximum likelihood estimators in ARMA models. J. R. Stat. Soc. Ser. B. Stat. Methodol., 46(1):58–67, 1984.

M. Taniguchi. On the second order asymptotic eﬃciency of estimators of gaussian ARMA processes. Ann. Statist., 11:157–169, 1983.

D. Tjøstheim and J. Paulsen. Bias of some commonly-used time series estimates. Biometrika, 70(2):389–399, 1983.

A. van Delft and M. Eichler. A note on Herglotz’s theorem for time series on functional spaces. Stochastic Process. Appl., 130(6):3687–3710, 2020.

A. M. Walker. Asymptotic properties of least-squares estimates of parameters of the spectrum of a stationary non-deterministic time-series. J. Aust. Math. Soc., 4:363–384, 1964.

P. Whittle. The analysis of multiple stationary time series. J. R. Stat. Soc. Ser. B. Stat. Methodol., 15:125–139, 1953.

Peter Whittle. Hypothesis Testing in Time Series Analysis. Thesis, Uppsala University, 1951.

H.-C. Zhang. Reduction of the asymptotic bias of autoregressive and spectral estimators by tapering. J. Time Series Anal., 13(5):451–469, 1992.

40 Summary of results in the Supplementary material

To navigate the supplementary material, we brieﬂy summarize the contents of each section.

1. In Appendix A we prove the results stated in Section 2 (which concern representing the Gaussian likelihood in the frequency domain and obtaining an explicit expression for the predictive DFT for AR models). Some of these proofs will use results from Appendix A.1.

2. In Appendix B.1 we prove both the ﬁrst order approximation and higher order approximation results stated in Sections 3. The proof of Theorem 3.1, uses the extended Baxter’s inequality (for completeness we prove this result in Appendix C.1, though we believe the result is well known). The proof Theorem 3.3 uses an explicit expression for ﬁnite predictors.

3. In Appendix B.3 we prove Lemma 4.1. The proof is similar to the proof of Theorem 3.1 but with some subtle diﬀerences.

4. Appendix C mainly deals with Baxter-type inequalities. In Appendix C we give a proof of the extended Baxter inequality. In Appendix C.2 we obtain Baxter type bounds for the derivatives of the ﬁnite predictors is stated (with respect to the unknown parameter). These results are used in Appendix C.3 to bound the diﬀerence between the derivatives of the Gaussian and Whittle likelihood.

5. In Appendix D we show consistency of the new likelihood estimators (deﬁned in Section 4). We also state and prove the necessary lemmas for proving the asymptotic equivalence between the feasible and infeasible estimators (proof of Theorem 5.1).

6. In Appendix E we obtain the bias of the Gaussian, Whittle, boundary corrected and hybrid

Whittle likelihoods under quite general assumptions on the underlying time series {Xt}. In particular, in Appendix E.1 we state the results in the one-parameter case (the proof is given in Appendix E.3). The results for the special case of the AR(1) is given in Appendix E.2. The bias for multi-parameters is given in Appendix E.4.

7. In Appendix F, we supplement the simulations in Section 6.

8. In Appendix G we apply the new likelihood estimation methods to long memory time series. We consider both parametric methods and also variants on the local Whittle likelihood estimator for the long memory parameter. In Appendix H we construct alternative estimators of the predictive DFT, which are used to build the new likelihoods. Through simulations we compare the estimators based on these diﬀerent new likelihoods.

41 A Proof of Theorems 2.1, 2.3 and 7.1

We start this section by giving the proof of Theorems 2.1. This result is instrumental to the subsequent results in the paper. PROOF of Theorem 2.1 First, to prove Theorem 2.1, we recall that it entails obtaining a transform UnXn where covf (UnXn,FnXn) = ∆n(f). Pre and post multiplying this covariance ∗ with Fn and Fn gives

∗ ∗ ∗ Fn covf (UnXn,FnXn) Fn = covθ (Fn UnXn,Xn) = Fn ∆n(f)Fn = Cn(f).

∗ Thus our objective is to ﬁnd the transform Y n = Fn UnXn such that covf (Y n,Xn) = Cn(f).

Then, the vector FnY n = UnXn will be biorthogonal to FnXn, as required. We observe that the entries of the circulant matrix Cn(f) are

n −1 X X (Cn(f))u,v = n f(ωk,n) exp(−i(u − v)ωk,n) = cf (u − v + `n), k=1 `∈Z

n where the second equality is due to the Poisson summation. The random vector Y n = {Yu,n}u=1 is such that cov (Y ,X ) = P c (u − v + `n) and Y ∈ sp(X ). Since cov (X ,X ) = f u,n v `∈Z f u n f u+`n v c (u − v + `n), at least “formally” cov (P X ,X ) = P c (u − v + `n). However, f f `∈Z u+`n v `∈Z f P X is neither a well defined random variable nor does not it belong to sp(X ). We `∈Z u+`n n replace each element in the sum P X with an element that belongs to sp(X ) and gives `∈Z u+`n n the same covariance. To do this we use the following well known result. Let Z and X denote a random variable and vector respectively. Let PX (Z) denote the projection of Z onto sp(X), i.e., the best linear predictor of Z given X, then covf (Z,X) = covf (PX (Z),X). Let Xbτ,n denote best linear predictor of Xτ given Xn = (X1,...,Xn) (as defined in (2.4)). Xbτ,n retains the pertinent properties of Xτ in the sense that covf (Xbτ,n,Xt) = cf (τ − t) for all τ ∈ Z and 1 ≤ t ≤ n. Define

n ! X X X Yu,n = Xbu+`n,n = φs,n(u + `n; f) Xs ∈ sp(Xn), `∈Z s=1 `∈Z where we note that Yu,n a well deﬁned random variable, since by using Lemma B.1 it can be Pn P∞ shown that supn s=1 `=−∞ |φs,n(u + `n; f)| < ∞. Thus by deﬁnition of Yu,n the following holds X covf (Yu,n,Xv) = cf (u − v + `n) = (Cn(f))u,v , (A.1) `∈Z ∗ and Y n = Fn UnXn, gives the desired transformation of the time series. Thus, based on this construction, FnY n = UnXn and FnXn are biorthogonal transforms, with entries (FnXn)k =

42 Jn(ωk,n) and

n −1/2 X X iuωk,n (UnXn)k = (FnY n)k = n Xbu+`n,ne `∈Z u=1

−1/2 X iτωk,n = n Xbτ,ne τ∈Z n −1/2 X X iτωk,n = n Xt φt,n(τ; f)e . (A.2) t=1 τ∈Z

The entries of the matrix U are (U ) = n−1/2 P φ (τ; f)eiτωk,n . To show that U “embeds” n n k,t τ∈Z t,n n the regular DFT, we observe that for 1 ≤ τ ≤ n, φt,n(τ; f) = δτ,t, furthermore, due to second order stationarity the coefficients φt,n(τ; f) are reflective i.e. the predictors of Xm (for m > n) and Xn+1−m share the same set of prediction coefficients (just reflected) such that

φt,n(m; f) = φn+1−t,n(n + 1 − m; f) for m > n. (A.3)

Using these two observations we can decompose (Un)k,t as ! −1/2 itωk,n X iτωk,n X iτωk,n (Un)k,t = n e + φt,n(τ; f)e + φt,n(τ; f)e τ≤0 τ≥n+1

−1/2 itωk,n −1/2 X iτωk,n −i(τ−1−n)ωk,n = n e + n φt,n(τ; f)e + φn+1−t,n(τ; f)e . τ≤0

It immediately follows from the above decomposition that Un = Fn + Dn(f) where Dn(f) is deﬁned in (2.6). Thus proving (2.5). To prove (2.7), we ﬁrst observe that (2.5) implies

covf (((Fn + Dn(f))Xn)k1 , (FnXn)k2 ) = f(ωk1,n)δk1,k2 .

It is clear that (FnXn)k = Jn(ωk,n) and from the representation of FnY n given in (A.2) we have

n −1/2 X iτωk,n −1/2 X iτωk,n (FnY n)k = n Xτ e + n Xbτ,ne τ=1 τ∈{ / 1,...,n}

= Jn(ωk,n) + Jbn(ωk,n; f).

This immediately proves (2.7). Note that equation (2.7) can be veriﬁed directly by using the properties of linear predictors and covariances discussed in the above proof.

43 To prove Theorem 2.3 we study the predictive DFT for autoregressive processes. We start by 2 Pp −iuω −2 obtaining an explicit expression for Jbn(ω; fθ) where fθ(ω) = σ |1− u=1 φue | (the spectral density corresponding to an AR(p) process). It is straightforward to show that predictive DFT predictor based on the AR(1) model is

0 ∞ −1/2 X −τ+1 iτω −1/2 X τ+1−n iτω Jbn(ω; fθ) = n φ X1e + n φ Xne τ=−∞ τ=n+1 −1/2 −1/2 n φ n φ i(n+1)ω = X1 + Xne , φ1(ω) φ1(ω)

−iω where φ1(ω) = 1 − φe . In order to prove Theorem 2.3, which generalizes the above expression to AR(p) processes, we partition Jbn(ω; fθ) into the predictions involving the past and future terms

Jbn(ω; fθ) = Jbn,L(ω; fθ) + Jbn,R(ω; fθ) where 0 ∞ −1/2 X iτω −1/2 X iτω Jbn,L(ω; fθ) = n Xbτ,ne and Jbn,R(ω; fθ) = n Xbτ,ne . τ=−∞ τ=n+1

We now obtain expressions for Jbn,L(ω; fθ) and Jbn,R(ω; fθ) separately, in the case the predictors are 2 Pp ijω −2 p based on the AR(p) parameters where fθ(ω) = σ |1 − j=1 φje | and the {φj}j=1 correspond 0 to the causal AR(p) representation. To do so, we deﬁne the p-dimension vector φ = (φ1, . . . , φp) and the matrix Ap(φ) as

  φ1 φ2 . . . φp−1 φp    1 0 ... 0 0    Ap(φ) =  0 1 ... 0 0  . (A.4)    . . ..   . . . 0 0  0 0 ... 1 0

Therefore, for τ ≤ 0, since X = A (φ)|τ|+1X , where X = (X ,...,X ), we can write bτ,n p p (1) p 1 p

0 X J (ω; f ) = n−1/2 A (φ)|τ|+1X eiτω. (A.5) bn,L θ p p (1) τ=−∞

Lemma A.1 Let Jbn,L(ω) be deﬁned as in (A.5), where the parameters φ are such that the roots Pp j of φ(z) = 1 − j=1 φjz lie outside the unit circle. Then an analytic expression for Jbn,L(ω; fθ) is

−1/2 p p−` n X X −isω Jbn,L(ω; fθ) = X` φ`+se . (A.6) φp(ω) `=1 s=0

44 Pp −isω where φp(ω) = 1 − s=1 φse . PROOF. By using (A.11) we have

p p−` |τ|+1 X X [Ap(φ) Xp](1) = X` φ`+sψ|τ|−s. `=1 s=0

Therefore, using (A.5) and the change of variables τ ← −τ

p p−` ∞ −1/2 X X X −iτω Jbn,L(ω; fθ) = n X` φ`+s ψτ−se `=1 s=0 τ=0 p p−` ∞ −1/2 X X −isω X −i(τ−s)ω = n X` φ`+se ψτ−se `=1 s=0 τ=0 p p−` ∞ −1/2 X X −isω X −i(τ−s)ω = n X` φ`+se ψτ−se . `=1 s=0 τ=s

P∞ −isω −1 Let s=0 ψse = ψ(ω) = φp(ω) , and substitute this into the above to give

−1/2 p p−` n X X −isω Jbn,L(ω; fθ) = X` φ`+se , (A.7) φp(ω) `=1 s=0

Thus we obtain the desired result.

PROOF of Theorem 2.3 To prove (2.15), we note that the same proof as that above can be used to prove that the right hand side predictive DFT Jbn,R(ω; fθ) has the representation

−1/2 p p−` inω n X X i(s+1)ω Jbn,R(ω; fθ) = e Xn+1−` φ`+se . φp(ω) `=1 s=0

Since Jbn(ω; fθ) = Jbn,L(ω; fθ) + Jbn,R(ω; fθ), Lemma A.1 and the above give an explicit expression for Jbn(ω; fθ), thus proving equation (2.15). To prove (2.16) we use that

0 (Jbn(ω1,n; fθ),..., Jbn(ωn,n; fθ)) = Dn(fθ)Xn.

Now by using (2.15) together with the above we immediately obtain (2.16). −1 Pn Finally, we prove (2.17). We use the result n k=1 φp(ωk,n) exp(isωk,n) = φes mod n where

45 Pn−1 −irω φp(ω) = r=0 φere and φer = 0 for p + 1 ≤ r ≤ n. For 1 ≤ t ≤ p we use have

n ∗ −1 1 X φt,p(ωk,n) (Fn ∆n(fθ )Dn(fθ))s,t = exp(−isωk,n) n fθ(ωk,n) k=1 n p−t σ−2 X X = φ (ω ) φ exp(−i`ω ) exp(−isω ) n p k,n `+t k,n k,n k=1 `=0 p−t n X 1 X = σ−2 φ φ (ω ) exp(−i(` + s)ω ) `+t n p k,n k,n `=0 k=1 p−t −2 X = σ φ`+tφe(`+s) mod n. `=0

Similarly, for 1 ≤ t ≤ p,

n ∗ −1 1 X φt,p(ωk,n) (Fn ∆n(fθ )Dn(fθ))s,n−t+1 = exp(i(1 − s)ωk,n) n fθ(ωk,n) k=1 n p−t σ−2 X X = φ (ω ) φ exp(i`ω ) exp(i(1 − s)ω ) n p k,n `+t k,n k,n k=1 `=0 p−t n X 1 X = σ−2 φ φ (ω ) exp(i(` + 1 − s)ω ) `+t n p k,n k,n `=0 k=1 p−t −2 X = σ φ`+tφe(`+1−s) mod n. `=0

PROOF of Equation (2.18) We use that 1 f (ω)−1 = σ−2φ (ω). This gives φp(ω) θ p

Ln(φ) − Kn(φ) = I + II where

n ( p p−` ) 1 J (ω ) 1 X n k,n X X −isωk,n I = 3/2 X` φ`+se n fθ(ωk,n) φp(ωk,n) k=1 `=1 s=0 n n p−` σ−2 X X X = J (ω )φ (ω ) φ e−isωk,n φ e−isωk,n n3/2 n k,n p k,n s+k `+s k=1 `=1 s=0 p p−` n σ−2 X X 1 X = X φ J (ω )φ (ω )e−isωk,n n ` s+k n1/2 n k,n p k,n `=1 s=0 k=1

46 and p p−` σ−2 n J (ω ) 1 X n k,n X X i(s+1)ωk,n II = 3/2 Xn+1−` φ`+se . (A.8) n fθ(ωk,n) k=1 φp(ωk,n) `=1 s=0

Pp ijωk,n −1/2 Pn isωk,n We ﬁrst consider I. Using that φp(ωk,n) = 1 − j=1 φje and n k=1 Jn(ωk,n)e = Xs mod n, gives

p p−` p n σ−2 X X X 1 X I = − X φ φ J (ω )e−i(s−r)ωk,n (set φ = −1) n ` j s+` n1/2 n k,n 0 `=1 s=0 j=0 k=1 p p−` p σ−2 X X X = − X φ φ X . n ` j s+` −(s−j) mod n `=1 s=0 j=0

The proof of II is similar. Altogether this proves the result. Finally, in this section we prove Theorem 7.1 (from Section 7). This result is analogous to Theorem 2.1.

PROOF of Theorem 7.1 Under the stated condition that Den is a ﬁnite matrix, then (Fen +

Den)Xn is a well deﬁned random variable. Thus

∗ cov((Fen + Den)Xn, FenXn) = (Fen + Den)ΓnFen = ∆e n. (A.9)

∗ ∗ ∗ Therefore, using (A.9) and Fn = [Fen , en] = [Fen , en] gives

∗ ∗ ∗ (Fen + Den)ΓnFn = (Fen + Den)Γn[Fen , en] = [(Fen + Den)ΓnFen , (Fen + Den)Γnen]

= [∆e n, (Fen + Den)Γnen].

Note that for 1 ≤ k ≤ n − 1,

0 (ek + dk) Γnen = cov(Jen(ωk,n; f),Jn(0)) = 0.

Therefore, we have

∗ (Fen + Den)ΓnFn = [∆e n, 0n−1].

0 −1 where 0n−1 = (0, ..., 0) is a (n−1)-dimension zero vector. Right multipling the above with FnΓn gives

−1 (Fen + Den) = [∆e n, 0n−1]FnΓn .

47 −1 T Left multiplying the above with [∆e n , 0n−1] gives

" −1 # " −1 # " # ∆e n ∆e n −1 In−1 0n−1 −1 0 (Fen + Den) = 0 [∆e n, 0n−1]FnΓn = 0 FnΓn 0n−1 0n−1 0n−1 0 " #" # In−1 0n−1 Fen = Γ−1 0 0 n 0n−1 0 en " # Fen −1 = 0 Γn 0n−1

∗ Left multiplying the above with [Fen , 0n−1] gives

" −1 # " # ∗ −1 ∗ ∆e n ∗ Fen −1 ∗ −1 Fen ∆e n (Fen + Den) = [Fen , 0n−1] 0 (Fen + Den) = [Fen , 0n−1] 0 Γn = Fen FenΓn . (A.10) 0n−1 0n−1

−1/2 Finally, using that en = n 1n " # Fen I = F ∗F = [F ∗, n−1/21 ] = F ∗F + n−11 10 . n n n en n −1/2 0 en en n n n 1n

Substituting the above into (A.10) gives

∗ −1 −1 0 −1 Fen ∆e n (Fen + Den) = (In − n 1n1n)Γn .

This proves the identity (7.1).

A.1 Some auxiliary lemmas Pp−j In the case that the spectral density f corresponds to an AR(p) model, φj(τ; f) = s=0 φj+sψ|τ|−s for τ ≤ 0. This result is well known (see Inoue and Kasahara (2006), page 980). However we could not ﬁnd the proof, thus for completeness we give the proof below.

2 Pp −ijω −2 2 P∞ −ijω 2 p Lemma A.2 Suppose fθ(ω) = σ |1 − j=1 φje | = σ | j=0 ψje | , where {φj}j=1 correspond to the causal AR(p) representation. Let φj(τ; f) be deﬁned as in (2.4). Then φj(τ; f) = Pp−j s=0 φj+sψ|τ|−s.

p p−` |τ|+1 X X [Ap(φ) Xp](1) = X` φ`+sψ|τ|−s. (A.11) `=1 s=0 where we set ψj = 0 for j < 0.

48 PROOF. To simplify notation let A = Ap(φ). The proof is based on the observation that the jth m m−1 row of A (m ≥ 1) is the (j − 1)th row of A (due to the structure of A). Let (a1,m, . . . , ap,m) denote the ﬁrst row of Am. Using this notation we have

    φ1 φ2 . . . φp−1 φp   a a . . . a a a . . . a 1,m 2,m p,m   1,m−1 2,m−1 p,m−1    1 0 ... 0 0     a1,m−1 a2,m−1 . . . ap,m−1    a1,m−2 a2,m−2 . . . ap,m−2  . . . .  =  0 1 ... 0 0   . . . .  .  . . .. .   . . . . .   . . .. .     ......    a a . . . a   a a . . . a 1,m−p+1 2,m−p+1 p,m−p+1 0 0 ... 1 0 1,m−p 2,m−p p,m−p

From the above we observe that a`,m satisﬁes the system of equations

a`,m = φ`a1,m−1 + a`+1,m−1 1 ≤ ` ≤ p − 1

ap,m = φpa1,m−1. (A.12)

p ∞ Our aim is to obtain an expression for a`,m in terms of {φj}j=1 and {ψj}j=0 which we now define. Pp j −1 Since the roots of φ(·) lies outside the unit circle the function (1 − j=1 φjz ) is well defined Pp −1 P∞ i for |z| ≤ 1 and has the power series expansion (1 − i=1 φiz) = i=0 ψiz for |z| ≤ 1. We m use the well know result [A ]1,1 = a1,m = ψm (which can be proved by induction). Using this we obtain an expression for the coefficients {a`,m; 2 ≤ ` ≤ p} in terms of {φi} and {ψi}. Solving the system of equations in (A.12), starting with a1,1 = ψ1 and recursively solving for ap,m, . . . , a2,m we have

ap,r = φpψr−1 m − p ≤ r ≤ m

a`,r = φ`a1,r−1 + a`+1,r−1 1 ≤ ` ≤ p − 1, m − p ≤ r ≤ m

This gives ap,m = φpψm−1, for ` = p − 1

ap−1,m = φp−1a1,m−1 + ap,m−1

= φp−1ψm−1 + ψpψm−2

ap−2,m = φp−2a1,m−1 + ap−1,m−1

= φp−2ψm−1 + φp−1ψm−2 + ψpψm−3

49 up to

a1,m = φ1a1,m−1 + a2,m−1 p−1 X = φ1+sψm−1−s = (ψm). s=0

This gives the general expression

r X ap−r,m = φp−r+sψm−1−s 0 ≤ r ≤ p − 1. s=0

In the last line of the above we change variables with ` = p − r to give for m ≥ 1

p−` X a`,m = φ`+sψm−1−s 1 ≤ ` ≤ p, s=0 where we set ψ0 = 1 and for t < 0, ψt = 0. Therefore

p p−` |τ|+1 X X [A Xp](1) = X` φ`+sψ|τ|−s. `=1 s=0

Thus we obtain the desired result. In the proof of Lemma A.1 we obtained an expression in terms for the best linear predictor based on the parameters of an AR(p) process. For completeness, we obtain an expression for the left hand side predictive DFT for a general second order stationary time series which is based on inﬁnite future. This expression can be used to prove the identity in equation (3.3). ∞ For τ ≤ 0, let Xbτ be the best linear predictor of Xτ given the inﬁnite future {Xt}t=1 i.e.

∞ X Xbτ = φs(τ; f)Xs. (A.13) s=1

The left hand side predictive DFT given the inﬁnite future is deﬁned as

0 −1/2 X iτω Jb∞,L(ω; f) = n Xbτ e . (A.14) τ=−∞

∞ Corollary A.1 Suppose that f satisfies Assumption 3.1. Let {φj(f)}j=1 denote the AR(∞) coefficients associated with f. Let Jb∞,L(ω; f) be defined as in (A.14). Then, the best linear predictor ∞ of Xτ given {Xt}t=1 where τ ≤ 0 (defined in (A.13)) can be evaluated using the recursion Xbτ =

50 P∞ P∞ s=1 φs(f)Xbτ+s, where we set Xbt = Xt for t ≥ 1. Further, φ`(τ; f) = s=0 φ`+s(f)ψ|τ|−s(f) and

−1/2 ∞ ∞ n X X −isω Jb∞,L(ω; f) = X` φ`+s(f)e . (A.15) φ(ω; f) `=1 s=0

PROOF. We recall that for general processes Xt with f bounded away from 0 has the AR(∞) P∞ representation Xt = j=1 φjXj+t + εt where {εt} are uncorrelated random variables. This ∞ immediately implies that the best linear prediction of Xτ given {Xt}t=1 can be evaluated using P∞ the recursion Xbτ = s=1 φs(f)Xbτ+s. By using (A.11) where we let p → ∞ we have φ`(τ; f) = P∞ s=0 φ`+s(f)ψ|τ|−s(f). This gives the ﬁrst part of the result. To obtain an expression for JbL(·; f) we use (A.6) where we let p → ∞ to obtain the desired result.

By a similar argument we can show that

∞ −1/2 n X iτω i(n+1)ω n X ∞ Jb∞,R(ω; f) = Xbτ e = e Xn+1−tφt (ω; fθ). (A.16) τ=n+1 φ(ω; fθ) `=1

Since Jb∞,n(ω; f) = Jb∞,L(ω; f) + Jb∞,R(ω; f), by using (A.15) and (A.16) we immediately obtain the identity (3.3).

A.1.1 A fast algorithm for computing the predictive DFT

To end this section, for p ≤ n, we provide an O(min(n log n, np)) algorithm to compute the predictive DFT of an AR(p) spectral density at frequencies ω = ωk,n, 1 ≤ k ≤ n. Recall from (2.15),

Jbn(ωk,n; fp) p p−` p p−` n−1/2 n−1/2 X X −isωk,n X X i(s+1)ωk,n = X` φ`+se + Xn+1−` φ`+se , φp(ωk,n) `=1 s=0 φp(ωk,n) `=1 s=0

2 Pp −ijωk,n where fp(·) = |φp(·)| and φp(ωk,n) = 1− j=1 φje . We focus on the ﬁrst term of Jbn(ωk,n; fp) since the second term is almost identical. Interchange the summation, the ﬁrst term is

p p−` n−1/2 X X −isωk,n X` φ`+se φp(ωk,n) `=1 s=0 p−1 p−s ! p−1 n−1/2 n−1/2 X X −isωk,n X −isωk,n = X`φ`+s e = Yse φp(ωk,n) φp(ωk,n) s=0 `=1 s=0

51 Pp−s where Ys = `=1 X`φ`+s for 0 ≤ s ≤ p − 1. Note that Ys can be viewed as a convolution between (X1, ..., Xp) and (0, 0,..., 0, φ1, ..., φp, 0,..., 0). Based on this observation, the FFT can be utilized to evaluate {Ys : 0 ≤ s ≤ p − 1} in O(p log p) operations.

Pp−1 −isωk,n By direct calculation {φp(ωk,n) : 0 ≤ k ≤ n − 1} and { s=0 Yse : 0 ≤ k ≤ n − 1} has O(np) complexity. An alternative method of calculation is based on the observation that

Pp−1 −isωk,n both φp(ωk,n) and s=0 Yse can be viewed as the kth component of the DFT of length n sequences (1, −φ1, ..., −φp, 0, ..., 0) and (Y0, ..., Yp−1, 0, .., 0) respectively. Thus the FFT can be

Pp−1 −isωk,n used to evaluate both {φp(ωk,n) : 0 ≤ k ≤ n − 1} and { s=0 Yse : 0 ≤ k ≤ n − 1} in O(n log n) operations. Therefore, since either method can be used to evaluate these terms the

Pp−1 −isωk,n total number of operations for evaluation of {φp(ωk,n) : 0 ≤ k ≤ n − 1} and { s=0 Yse : 0 ≤ k ≤ n − 1} is O(min(n log n, np)). Therefore, the overall computational complexity is O(p log p+n log n∧np) = O(min(n log n, np)).

B Proof of results in Sections 3 and 4.1

B.1 Proof of Theorems 3.1 and 3.2

Many of the results below hinge on a small generalisation of Baxter’s inequality which we summarize below.

Lemma B.1 (Extended Baxter’s inequality) Suppose f(·) is a spectral density function which satisfies Assumption 3.1. Let ψ(·) and φ(·) be defined as in (3.1) (for the simplicity, we omit ∞ P∞ −isω the notation f inside the ψ(·) and φ(·)). Let φp+1(ω) = s=p+1 φse . Further, let {φs,n(τ)} n denote the coefficients in the best linear predictor of Xτ given Xn = {Xt}t=1 and {φs(τ)} the cor- ∞ responding the coefficients in the best linear predictor of Xτ given X∞ = {Xt}t=1, where τ ≤ 0. Suppose p is large enough such that φ∞ kψk ≤ ε < 1. Then for all n > p we have p K K

n ∞ X K K X K K (2 + s ) |φs,n(τ) − φs(τ)| ≤ Cf,K (2 + s ) |φs(τ)| , (B.1) s=1 s=n+1

3−ε 2 2 P∞ where Cf,K = 1−ε kφkK kψkK and φs(τ) = j=0 φs+jψ|τ|−j (we set ψ0 = 1 and ψj = 0 for j < 0).

PROOF. For completeness we give the proof in Appendix C.

PROOF of Equation (3.2) Since

−1/2 X iτωk,n −i(τ−1)ωk,n (D∞,n(fθ))k,t = n φt(τ)e + φn+1−t(τ)e (B.2) τ≤0 we replace φt(τ) in the above with the coeﬃcients of the MA and AR inﬁnity expansions; φt(τ) =

52 P∞ s=0 φt+sψ|τ|−s. Substituting this into the ﬁrst term in (B.2) gives

∞ −1/2 X iτωk,n −1/2 X X iτωk,n n φt(τ)e = n φt+sψ−τ−se τ≤0 τ≤0 s=0 ∞ −1/2 X −isωk,n X −i(−τ−s)ωk,n = n φt+se ψ−τ−se s=0 τ≤0 ∞ −1/2 X −isωk,n = n ψ(ωk,n) φt+se s=0 −1/2 −1 ∞ = n φ(ωk,n) φt (ωk,n), which gives the ﬁrst term in (3.2). The second term follows similarly. Thus giving the identity in equation (3.2). ∗ −1 Next we prove Theorem 3.1. To do this we note that the entries of Fn ∆n(fθ )D∞,n(f) are

∗ −1 (Fn ∆n(fθ )D∞,n(f))s,t X = [φt(τ; f)G1,n(s, τ; fθ) + φn+1−t(τ; f)G2,n(s, τ; fθ)] , (B.3) τ≤0 where G1,n and G2,n are deﬁned as in (2.14). Thus

∗ −1 Fn ∆n(fθ )[Dn(f) − D∞,n(f)] s,t X = [{φt,n(τ; f) − φt(τ; f)} G1,n(s, τ; fθ) τ≤0

+ {φn+1−t,n(τ; f) − φn+1−t(τ; f)} G2,n(s, τ; fθ)]. (B.4)

To prove Theorem 3.1 we bound the above terms.

PROOF of Theorem 3.1. To simplify notation we only emphasis the coeﬃcients associated with fθ and not the coeﬃcients associated with f. I.e. we set φs,n(τ; f) = φs,n(τ), φs(τ; f) =

φs(τ), φf = φ and ψf = ψ.

The proof of (3.4) simply follows from the deﬁnitions of Dn(f) and D∞,n(f). Next we prove (3.5). By using (B.4) we have

∗ −1 ∗ −1 Fn ∆n(fθ )Dn(f) − Fn ∆n(fθ )D∞,n(f) 1 ≤ T1,n + T2,n,

53 where

n 0 X X T1,n = |φs,n(τ) − φs(τ)||G1,n(t, τ; fθ)| s,t=1 τ=−∞ n 0 X X T2,n = |φn+1−s,n(τ) − φn+1−s(τ)||G2,n(t, τ; fθ)|. s,t=1 τ=−∞

We focus on T1,n, noting that the method for bounding T2,n is similar. Exchanging the summands we have 0 n n X X X T1,n ≤ |G1,n(t, τ; fθ)| |φs,n(τ) − φs(τ)|. τ=−∞ t=1 s=1 Pn To bound s=1 |φs,n(τ) − φs(τ)| we require the generalized Baxter’s inequality stated in Lemma B.1. Substituting the bound in Lemma B.1 into the above (and for a suﬃciently large n) we have 0 n ∞ X X X T1,n ≤ Cf,0 |G1,n(t, τ; fθ)| |φs(τ)|. τ=−∞ t=1 s=n+1 P Using that G1,n(t, τ) = K −1 (τ − t + an) we have the bound a∈Z fθ

0 n ∞ X X X X T1,n ≤ Cf,0 |K −1 (t − τ + an)| |φs(τ)| fθ τ=−∞ t=1 a∈Z s=n+1 0 ∞ X X X = Cf,0 |K −1 (r)| |φs(τ)|. fθ r∈Z τ=−∞ s=n+1

Therefore,

0 ∞ X X X T1,n ≤ Cf,0 |K −1 (r)| |φs(τ)| fθ r∈Z τ=−∞ s=n+1 0 ∞ ∞ ∞ X X X X X ≤ Cf,0 |K −1 (r)| |φs+j||ψ−τ−j| (use φs(τ) = φs+jψ|τ|−j) fθ r∈Z τ=−∞ s=n+1 j=0 j=0 ∞ ∞ ∞ X X X X X = Cf,0 |K −1 (r)| |ψτ−j| |φs+j| (change limits of ) fθ r∈Z τ=0 s=n+1 j=0 τ ∞ X X X ≤ Cf,0 |K −1 (r)| |ψ`| |uφu| (change of variables u = s + j). fθ r∈Z ` u=n+1

54 Next we use Assumption 3.1(i) to give

∞ X X X sK T1,n ≤ Cf,0 |Kf −1 (r)| |ψ`| |φs| θ sK−1 r∈Z ` s=n+1 ∞ Cf,0 X X X K ≤ |Kf −1 (r)| |ψ`| |s φs| nK−1 θ r∈Z ` s=n+1 Cf,0 X ≤ ρn,K (f)kψk0kφkK |Kf −1 (r)|. nK−1 θ r∈Z

R 2π −1 irω −2 R 2π 2 irω We note that the inverse covariance K −1 (r) = f (ω)e dω = σ |φf (ω)| e dω = fθ 0 θ fθ 0 θ σ−2 P φ (f )φ (f ). Therefore fθ j j θ j+r θ

∞ X |K (r)| ≤ σ−2kφ k2. (B.5) fθ fθ fθ 0 r=−∞

Substituting this into the above yields the bound

C T ≤ f,0 ρ (f)kψk kφ k2kφk . 1,n σ2 nK−1 n,K 0 fθ 0 K fθ

The same bound holds for T2,n. Together the bounds for T1,n and T2,n give

∗ −1 ∗ −1 2Cf,0 2 F ∆n(f )Dn(fθ) − F ∆n(f )D∞,n(fθ) ≤ ρn,K (f)kψk0kφf k kφkK . n θ n θ 1 σ2 nK−1 θ 0 fθ

Replacing kψf k0 = kψk0 and kφf kK = kφkK , this proves (3.5). To prove (3.6) we recall

q 1/q 2q1/2q 2q1/2q 2 kXtXskE,q = (E|XtXs| ) ≤ E|Xt| E|Xs| ≤ kXkE,2q.

Therefore,

−1 0 ∗ −1 n X F ∆n(f )(Dn(f) − D∞,n(f)) X n n θ n E,q n X ≤ n−1 F ∗∆ (f −1)(D (f) − D (f)) kX X k n n θ n ∞,n s,t t s E,q s,t=1 −1 ∗ −1 2 ≤ n Fn ∆n(fθ )(Dn(f) − D∞,n(f)) 1 kXkE,2q 2C ≤ f,0 ρ (f)kψ k kφ k2kφ k kXk2 , σ2 nK n,K f 0 fθ 0 f K E,2q fθ where the last line follows from the inequality in (3.5). This proves (3.6).

55 PROOF of Theorem 3.2 For notational simplicity, we omit the parameter dependence on fθ. We ﬁrst prove (3.8). We observe that

n 0 ∗ −1 X X Fn ∆n(fθ )D∞,n(fθ) 1 ≤ (|φs(τ)||G1,n(t, τ)| + |φn+1−s(τ)||G2,n(t, τ)|) s,t=1 τ=−∞

= S1,n + S2,n.

As in the proof of Theorem 3.1, we bound each term separately. Using a similar set of bounds to those used in the proof of Theorem 3.1 we have

n ∞ X X X X S1,n ≤ |K −1 (r)| |ψ`| |φs+j| fθ r∈Z ` s=1 j=0 ∞ X X X 1 2 ≤ |K −1 (r)| |ψ`| |sφs| ≤ kψf k0kφf k kφf k1, fθ 2 θ θ 0 θ σf r∈Z ` s=1 θ

P −2 2 where the bound |K −1 (r)| ≤ σ kφf k follows from (B.5). Using a similar method we r∈Z fθ fθ θ 0 obtain the bound S ≤ σ−2kψ k kφ k2kφ k . Altogether the bounds for S and S give 2,n fθ fθ 0 fθ 0 fθ 1 1,n 2,n

∗ −1 2 2 F ∆n(f )D∞,n(fθ) ≤ kψf k0kφf k kφf k1, n θ 1 σ2 θ θ 0 θ fθ this proves (3.8). The proof of (3.9) uses the triangle inequality

−1 −1 ∗ −1 Γn(fθ) − Cn(fθ ) 1 = Fn ∆n(fθ )Dn(fθ) 1 ∗ −1 ≤ Fn ∆n(fθ )(Dn(fθ) − D∞,n(fθ)) 1 ∗ −1 + Fn ∆n(fθ )D∞,n(fθ) 1 .

Substituting the bound Theorem 3.1 (equation (3.5)) and (3.8) into the above gives (3.9). The proof of (3.10) uses the bound in (3.9) together with similar arguments to those in the proof of Theorem 3.1, we omit the details.

B.2 Proof of results in Section 3.2

To prove Theorem 3.3 we recall some notation. Let P[1,∞)X and P(−∞,n]X denote the projection of X onto sp(Xt; t ≥ 1) or sp(Xt; t ≤ n). Then using the deﬁnition of {φj(τ)} we have

∞ ∞ X X P[1,∞)Xτ = φj(τ)Xj for τ ≤ 1 and P(−∞,n]Xτ = hj(τ)Xj for τ > n. j=1 j=1

56 By stationarity we have hj(τ) = φn+1−j(n + 1 − τ). We use this notation below.

PROOF of Theorem 3.3 We ﬁrst focus on the prediction coeﬃcient associated with Xj where 1 ≤ j ≤ n and show that

0 ∞ X iτω −1 X (s) φj,n(τ)e = φ(ω) ζj,n(ω; f), τ=−∞ s=1

(s) where ζj,n(ω; f) is defined in (3.11). This will lead to the series expansion of Jbn(ω; f). Let us assume τ ≤ 0. By using Inoue and Kasahara (2006), Theorem 2.5, we obtain an expression for the difference between the coefficients of Xj for the finite predictor (based on the space sp(X1,...,Xn)) and the infinite predictor (based on the space sp(X1,X2,...)):

φj,n(τ) − φj(τ) ∞ ∞ 0 X X X = φu1 (τ)hj(u1) + φu1 (τ)hv1 (u1)φj(v1) + u1=n+1 u1=n+1 v1=−∞ ∞ 0 ∞ X X X φu1 (τ)hv1 (u1)φu2 (v1)hj(u2) + ... u1=n+1 v1=−∞ u2=n+1

Replacing hj(τ) = φn+1−j(n + 1 − τ) we have

φj,n(τ) − φj(τ) ∞ ∞ 0 X X X = φu1 (τ)φn+1−j(n + 1 − u1) + φu1 (τ)φn+1−v1 (n + 1 − u1)φj(v1) + u1=n+1 u1=n+1 v1=−∞ ∞ 0 ∞ X X X φu1 (τ)φn+1−v1 (n + 1 − u1)φu2 (v1)φn+1−j(n + 1 − u2) + ... u1=n+1 v1=−∞ u2=n+1

Changing variables with u1 → u1 − n − 1, v1 → −v1,... gives

φj,n(τ) − φj(τ) ∞ ∞ ∞ X X X = φn+1+u1 (τ)φn+1−j(−u1) + φn+1+u1 (τ)φn+1+v1 (−u1)φj(−v1) + u1=0 u1=0 v1=0 ∞ ∞ ∞ X X X φn+1+u1 (τ)φn+1+v1 (−u1)φn+1+u2 (−v1)φn+1−j(−u2) + ...

u1=0 v1=0 u2=0 ∞ ∞ "s−1 # X X Y = φn+1+u1 (τ) φn+1+ua+1 (−ua) [φn+1−j(−us)δs mod 2=1 + φj(−us)δs mod 2=0] . s=2 u1,...,us=0 a=1

57 P∞ Next our focus will be on the term inside the sum s=2, where we deﬁne

∞ "s−1 # (s) X Y φj,n(τ) = φn+1+u1 (τ) φn+1+ua+1 (−ua) [φn+1−j(−us)δs mod 2=1 + φj(−us)δs mod 2=0] . u1,...,us=0 a=1

Using this notation we have

∞ X (s) φj,n(τ) − φj(τ) = φj,n(τ). s=2

(s) We will rewrite φj,n(τ) as a convolution. To do so, we ﬁrst note that from Lemma A.1

∞ |τ| X X φj(τ) = φj+sψ|τ|−s = ψ`φj+|τ|−` for τ ≤ 0. s=0 `=0

This can be written as an integral

Z 2π 1 −1 ∞ −iτλ φj(τ) = φ(λ) φj (λ)e dλ, τ ≤ 0 2π 0

−1 P∞ −isλ ∞ P∞ −isλ where φ(λ) = s=0 ψse and φj (λ) = s=0 φj+se . Using this representation we observe that for τ ≥ 1

Z 2π 1 −1 ∞ −iτλ φ(λ) φj (λ)e dλ = 0. 2π 0

Based on the above we deﬁne the “extended” coeﬃcient

Z 2π ( 1 −1 ∞ −iτλ φj(τ) τ ≤ 0 φej(τ) = φ(λ) φj (λ)e dλ = (B.6) 2π 0 0 τ ≥ 1.

P∞ (s) φj(τ) can be treated as the first term in the expansion φj,n(τ) = φj(τ) + s=2 φj,n(τ). In the same way we have extended the definition of φj,n(τ) over τ ∈ Z we do the same for the higher order terms and define

∞ "s−1 # (s) X Y φej,n(τ) = φen+1+u1 (τ) φn+1+ua+1 (−ua) [φn+1−j(−us)δs mod 2=1 + φj(−us)δs mod 2=0] . u1,...,us=0 a=1

(s) (s) (s) Observe from the above deﬁnition that φej,n(τ) = φj,n(τ) for τ ≤ 0 and φej,n(τ) = 0 for τ ≥ 1.

58 (s) (s) Substituting (B.6) into φej,n(τ) for τ ∈ Z gives rise to a convolution for φej,n(τ)

Z " s #"s−1 ∞ # (s) 1 Y Y X φ (τ) = e−iτλ1 φ(λ )−1 φ∞ (λ )eiuaλa+1 ej,n s a n+1+ua a (2π) [0,2π]s a=1 a=1 ua=0 ∞ ∞ φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 dλs " s # s−1 1 Z −iτλ1 Y −1 Y = s e φ(λa) Φn(λa, λa+1) (2π) s [0,2π] a=1 a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 dλs. (B.7)

(s) Evaluating the Fourier transform of {φej,n(τ)}τ∈Z gives

X (s) iτω X (s) iτω φj,n(τ)e = φej,n(τ)e τ≤0 τ∈Z " s # s−1 1 Z X −iτ(λ1−ω) Y −1 Y = s e φ(λa) Φn(λa, λa+1) (2π) [0,2π]s τ∈Z a=1 a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 dλs !" s # s−1 1 Z X −iτ(λ1−ω) Y −1 Y = s e φ(λa) Φn(λa, λa+1) (2π) [0,2π]s τ∈Z a=1 a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 dλs Z " s # s−1 1 Y −1 Y = s−1 φ(λa) Φn(λa, λa+1) (2π) s [0,2π] a=1 a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 δλ1=ωdλs Z s−1 −1 1 Y −1 = φ(ω) s−1 φ(λa+1) Φn(λa, λa+1) (2π) s [0,2π] a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 δλ1=ωdλs −1 (s) = φ(ω) ζj,n(ω; f), where

Z s−1 (s) 1 Y −1 ζj,n(ω; f) = s−1 φ(λa+1) Φn(λa, λa+1) (2π) s [0,2π] a=1 ∞ ∞ × φn+1−j(λs)δs mod 2=0 + φj (λs)δs mod 2=1 δλ1=ωdλs.

The above holds for s ≥ 2. But a similar representation also holds for the ﬁrst term, φj(τ), in

59 the expansion of φj,n(τ). Using the same argument as above we have

Z 2π X iτω −1 (1) −1 ∞ φj(τ)e = φ(λ1) ζj,n (ω; f) = φ(λ1) φj (λ1)δλ1=ωdλ1. τ≤0 0

Altogether this gives an expression for the Fourier transform of the predictor coeﬃcients corresponding to Xj (1 ≤ j ≤ n) at all lags τ ≤ 0:

∞ ∞ X iτω X iτω X X (s) iτω −1 X (s) φj,n(τ)e = φj(τ)e + φj,n(τ)e = φ(ω) ζj,n(ω; f). (B.8) τ≤0 τ≤0 s=2 τ≤0 s=1

Using a similar set of arguments we have

∞ X iτω −1 i(n+1)ω X (s) φj,n(n + 1 − τ)e = φ(ω) e ζn+1−j,n(ω; f). (B.9) τ≥n+1 s=1

Therefore by using the above and setting j = t we have the series expansion

n ! 1 X X iτω X iτω Jbn(ω; f) = Xt φt,n(τ)e + φn+1−t,n(n + 1 − τ)e n t=1 τ≤0 τ≥n+1 ∞ X (s) = Jbn (ω; f), s=1

(s) where Jbn (ω; f) is deﬁned in Theorem 3.3. Thus proving (3.12). To prove (3.13) we use that

−1/2 X iτωk,n −1/2 iωk,n X −iτωk,n (Dn(f))k,t = n φt,n(τ)e + n e φn+1−t,n(τ)e . τ≤0 τ≤0

This together with (B.8) and (B.9) proves the (3.13). (s) (s) Finally, we will obtain a bound for ζt,n (ω; f), which results in bound for Jbn (ω; f) (which we use to prove (3.14)). By using (3.11) we have

s−1 (s) −1 |ζt,n (ω; f)| ≤ sup |φ(λ; f)| sup |Φn(λ1, λ2)| × λ λ1,λ2 ∞ ∞ kφt (λs; f)k0δs≡1( mod 2) + kφn+1−t(λs; f)k0δs≡0( mod 2) . (B.10)

To bound the terms above we note that

∞ X −K+1 sup |Φn(λ1, λ2)| ≤ |jφn+j(f)| = O(n ), λ1,λ2 j=1

60 where the above follows from Assumption 3.1. Further

∞ −1 X |φ(λ; f)| ≤ |ψj(f)|, j=0 where {ψj(j)} are the MA(∞) coeﬃcients corresponding to the spectral density f. Substituting these bounds into (B.10) gives

∞ ∞ !s−1 ∞ ∞ (s) X X X X |ζt,n (ω; f)| ≤ |ψj(f)| |jφn+j(f)| |φj(f)| + |φj(f)| . j=0 j=1 j=t j=n+1−t

(s) Substituting the above into Jbn (ω; f) gives the bound

∞ ∞ ∞ !s−1 (s) X X X |Jbn (ω; f)| ≤ |ψj(f)| |ψj(f)| |jφn+j(f)| j=0 j=0 j=1 n ∞ ∞ ! 2 X X X ×√ |X | |φ (f)| + |φ (f)| . n t j j t=1 j=t j=n+1−t

(s) Therefore taking expectation of |Jbn (ω; f)| gives the bound

∞ ∞ ∞ !s−1 ∞ (s) X X X 4 X |Jb (ω; f)| ≤ |X0| |ψj(f)| |ψj(f)| |jφn+j(f)| √ |jφj(f)|. E n E n j=0 j=0 j=1 j=1

P∞ P∞ −K+1 For large enough n, j=0 |ψj(f)| · j=1 |jφn+j(f)| ≤ Cn < 1. Therefore

∞ ∞ s−1 X X C |Jb(s)(ω; f)| ≤ n−1/2 = O(n−m(K−1)−1/2) E n nK−1 s=m+1 s=m+1 thus

∞ X (s) −m(K−1)−1/2 Jbn (ω; f) = Op(n ). s=m+1

This proves the approximation in (3.14). Thus we have proved the result.

Proof of equation (3.15) Our aim is to show for s ≥ 3, Z (s) 1 −1 −1 (s−2) ζt,n (ω; f) = 2 φ(y1; f) φ(y2; f) Φn(ω, y1)Φn(y1, y2)ζt,n (y2; f)dy1dy2. (B.11) (2π) [0,2π]2

61 (s) We recall the deﬁnition of ζt,n (ω; f) in equation (3.11)

Z s−1 ! (s) 1 −1 Y −1 ζt,n (ω; f) = s−1 Φn(ω, λ2)φ(λ2; f) φ(λa+1; f) Φn(λa, λa+1) × (2π) s−1 [0,2π] a=2 ∞ ∞ φt (λs; f)δs≡1( mod 2) + φn+1−t(λs; f)δs≡0( mod 2) dλ2dλ3 . . . λs.

We change notation and set us = ω, us−1 = λ2 . . . , u2 = λs−1 This gives

Z 2 ! (s) 1 −1 Y −1 ζt,n (us; f) = s−1 Φn(us, us−1)φ(us−1; f) φ(us−a−1; f) Φn(us−a, us−a−1) (2π) s−1 [0,2π] a=s−1 ∞ ∞ × φt (u1; f)δs≡1( mod 2) + φn+1−t(u1; f)δs≡0( mod 2) du1du2 . . . dus−1.

Again we write the above as (by rearranging the product term)

Z s−1 ! (s) 1 −1 Y −1 ζt,n (us; f) = s−1 Φn(us, us−1)φ(us−1; f) φ(ua; f) Φn(ua+1, ua) × (2π) s−1 [0,2π] a=2 ∞ ∞ φt (u1; f)δs≡1( mod 2) + φn+1−t(u1; f)δs≡0( mod 2) du1du2 . . . dus−2dus−1 Z Z s−1 ! 1 −1 1 Y −1 = Φn(us, us−1)φ(us−1; f) s−2 φ(ua; f) Φn(ua+1, ua) (2π) (2π) s−2 [0,2π] [0,2π] a=2 ∞ ∞ × φt (u1; f)δs≡1( mod 2) + φn+1−t(u1; f)δs≡0( mod 2) du1du2 . . . dus−2 dus−1.

(s−1) The term inside the integral is analogus to ζt,n (us−1; f) (though it is not this). To obtain the exactly expression we apply the same procedure to that described above to the inner integral of the above. This proves (B.11).

(s) It is worth mentioning that analogous to the recursion for ζt,n (ω; f) a recursion can also be obtained for

−1/2 n −1/2 n (s) n X (s) i(n+1)ω n X (s) Jb (ω; f) = Xtζ (ω; f) + e Xn+1−tζ (ω; f) n φ(ω; f) t,n t,n t=1 φ(ω; f) t=1 (s) i(n+1)ω (s) = JbL,n(ω; f) + e JbR,n(ω; f),

62 where

−1/2 n (s) n X (s) Jb (ω; f) = Xtζ (ω; f) L,n φ(ω; f) t,n t=1 −1/2 n (s) n X (s) and Jb (ω; f) = Xn+1−tζ (ω; f). R,n φ(ω; f) t,n t=1

(s) (s) By using the recursion for ζt,n (ω; f) we observe that for s ≥ 3 we can write JbL,n(ω; f) and (s) JbR,n(ω; f) as Z (s) 1 Φn(ω, y1) Φn(y1, y2) (s−2) JbL,n(ω; f) = 2 JbL,n (y2; f)dy1dy2 (2π) [0,2π]2 φ(ω; f) φ(y1; f) and Z (s) 1 Φn(ω, y1) Φn(y1, y2) (s−2) JbR,n(ω; f) = 2 JbR,n (y2; f)dy1dy2. (2π) [0,2π]2 φ(ω; f) φ(y1; f)

B.3 Proof of Lemma 4.1

We now prove Lemma 4.1. The proof is similar to the proof of Theorem 3.1, but with some subtle differences. Rather than bounding the best finite predictors with the best infinite predictors, we bound the best infinite predictors with the plug-in estimators based on the best fitting AR(p) parameters. For example, the bounds use the regular Baxter’s inequality rather than the generalized Baxter’s inequality.

PROOF of Lemma 4.1 We ﬁrst prove (4.4). By using the triangular inequality we have

∗ −1 Fn ∆n(fθ )(Dn(f) − Dn(fp)) 1 ∗ −1 ∗ −1 ≤ Fn ∆n(fθ )(Dn(f) − D∞,n(f)) 1 + Fn ∆n(fθ )(D∞,n(f) − Dn(fp)) 1 Cf,0ρn,K (f) ∗ −1 ≤ AK (f, fθ) + F ∆n(f )(D∞,n(f) − Dn(fp)) , (B.12) nK−1 n θ 1 where the ﬁrst term of the right hand side of the above follows from (3.5). Now we bound the second term on the right hand side of the above. We observe that since the AR(p) process only uses the ﬁrst and last p observations for the predictions that Dn(fp) = D∞,n(fp), thus we can write the second term as

∗ −1 ∗ −1 Fn ∆n(fθ )(D∞,n(f) − Dn(fp)) = Fn ∆n(fθ )(D∞,n(f) − D∞,n(fp)) .

p Recall that {aj(p)}j=1 are the best ﬁtting AR(p) parameters based on the autocovariance func-

63 Pp −isω ∞ tion associated with the spectral density f. Let ap(ω) = 1 − s=1 as(p)e , aj,p(ω) = 1 − Pp−j −isω −1 P∞ −ijω s=1 as+j(p)e and ap(ω) = ψp(ω) = j=0 ψj,pe . By using the expression for D∞,n(f) given in (3.2) we have

∗ −1 j,t j,t Fn ∆n(fθ )(D∞,n(f) − D∞,n(fp)) t,j = U1,n + U2,n where

n −itωk,n ∞ ∞ j,t 1 X e φj (ωk,n) aj,p(ωk,n) U1,n = − n fθ(ωk,n) φ(ωk,n) ap(ωk,n) k=1 n ! −i(t−1)ωk,n ∞ ∞ j,t 1 X e φn+1−j(ωk,n) an+1−j,p(ωk,n) U2,n = − . n fθ(ωk,n) k=1 φ(ωk,n) ap(ωk,n)

j,t j,t j,t j,t We focus on U1,n, and partition it into two terms U1,n = U1,n,1 + U1,n,2, where

n −itωk,n j,t 1 X e ∞ ∞ U1,n,1 = φj (ωk,n) − aj,p(ωk,n) n φ(ωk,n)fθ(ωk,n) k=1 and

n −itωk,n ∞ j,t 1 X e aj,p(ωk,n) −1 −1 U1,n,2 = φ(ωk,n) − ap(ωk,n) n fθ(ωk,n) k=1

n −itωk,n ∞ 1 X e aj,p(ωk,n) = (ψ(ωk,n) − ψp(ωk,n)) . n fθ(ωk,n) k=1

j,t −1 P∞ −i`ωk,n We ﬁrst consider U1,n,1. We observe φ(ωk,n) = ψ(ωk,n) = `=0 ψ`e . Substituting this j,t into U1,n,1 gives

∞ n −i(t+s)ωk,n j,t X 1 X e U1,n,1 = (φj+s − aj+s(p)) n φ(ωk,n)fθ(ωk,n) s=0 k=1 ∞ ∞ n X X 1 X = (φ − a (p)) ψ f (ω )−1e−i(t+`+s)ωk,n j+s j+s ` n θ k,n s=0 `=0 k=1 ∞ ∞ X X X = (φj+s − aj+s(p)) ψ` K −1 (t + ` + s + rn), fθ s=0 `=0 r∈Z

64 R 2π −1 irω where K −1 (r) = fθ(ω) e dω. Therefore, the absolute sum of the above gives fθ 0

P −2 2 By using (B.5) we have |K −1 (τ)| ≤ σ kφf k . Further, by using the regular Baxter τ∈Z fθ fθ θ 0 inequality we have

∞ ∞ X X −K+1 s|φs − as(p)| ≤ (1 + Cf,1) s|φs| ≤ (1 + Cf,1)p ρp,K (f)kφf kK . s=1 s=p+1

Pn j,t Substituting these two bounds into j,t=1 |U1,n,1| yields

n X (1 + Cf,1) |U j,t | ≤ ρ (f)kφ k kψ k kφ k2. 1,n,1 σ2 pK−1 p,K f K f 0 fθ 0 j,t=1 fθ

j,t P∞ −isω Next we consider the second term U1,n,2. Using that ψ(ωk,n) = s=0 ψse and ψp(ωk,n) = P∞ −isω s=0 ψs,pe we have

n −itωk,n ∞ j,t 1 X e aj,p(ωk,n) U1,n,2 = (ψ(ωk,n) − ψp(ωk,n)) n fθ(ωk,n) k=1

∞ n −i(t+s)ωk,n ∞ X 1 X e aj,p(ωk,n) = (ψs − ψs,p) n fθ(ωk,n) s=0 k=1 ∞ ∞ n X X 1 X e−i(t+s+`)ωk,n = (ψs − ψs,p) aj+`(p) n fθ(ωk,n) s=0 `=0 k=1 ∞ ∞ X X X = (ψs − ψs,p) aj+`(p) K −1 (t + s + ` + rn). fθ s=0 `=0 r∈Z

65 Taking the absolute sum of the above gives

n n ∞ ∞ X j,t X X X X |U | ≤ |ψs − ψs,p| |aj+`(p)| |K −1 (t + s + ` + rn)| 1,n,2 fθ j,t=1 j,t=1 s=0 `=0 r∈Z ∞ n ∞ X X X X = |ψs − ψs,p| |aj+`(p)| |K −1 (r)| (apply the bound (B.5)) fθ s=0 j=1 `=0 r∈Z ∞ ! ∞ X X ≤ σ−2kφ k2 |ψ − ψ | |ua (p)| fθ fθ 0 s s,p u s=0 u=0 ∞ X ≤ σ−2kφ k2ka k |ψ − ψ |. fθ fθ 0 p 1 s s,p s=0

P∞ Pp ijω Next we bound kapk1 and s=0 |ψs − ψs,p|. Let φp(ω) = 1 − j=1 φje (the truncated AR(∞) process). Then by applying Baxter’s inequality, it is straightforward to show that

kapk1 ≤ kφpk1 + kap − φpk1 ≤ (Cf,1 + 1)kφf k1. (B.13)

P∞ To bound s=0 |ψs − ψs,p| we use the inequality in Kreiss et al. (2011), page 2126

∞ 2 P∞ X kψf k0 · j=1 |φj − aj(p)| |ψ − ψ | ≤ . s s,p 1 − kψ k · ka − φk s=0 f p 0

Applying Baxter’s inequality to the numerator of the above gives

∞ 2 X kψf k (Cf,0 + 1)ρp,K (f)kφf kK |ψ − ψ | ≤ 0 (B.14) s s,p pK (1 − kψ k · ka − φk ) s=0 f 0 p 0

Pn j,t Substituting the bound in (B.13) and (B.14) into j,t=1 |U1,n,2| gives

n 2 2 2 X j,t (Cf,1 + 1) kψf k0kφf k1kφf kK kφfθ k0ρp,K (f) |U1,n,2| ≤ 2 K · σ p 1 − kψf k0kap − φk0 j,t=1 fθ

Altogether, for suﬃciently large p, where kψf k0 · kap − φk0 ≤ 1/2 we have

n X (1 + Cf,1) |U j,t | ≤ ρ (f)kφ k kψ k kφ k2 1,n σ2 pK−1 p,K f K f 0 fθ 0 t,j=1 fθ 2(C + 1)2 + f,1 kψ k2kφ k kφ k kφ k2ρ (f) σ2 pK f 0 f 1 f K fθ 0 p,K fθ (C + 1) 2(1 + C ) ≤ f,1 ρ (f)kφ k kφ k2kψ k 1 + f,1 kψ k kφ k σ2 pK−1 p,K f K fθ 0 f 0 p f 0 f 1 fθ

66 Pn j,t The same bound holds for t,j=1 |U2,n|, thus using (B.12) and ρn,K (f) ≤ ρp,K (f) gives

∗ −1 Fn ∆n(fθ )(D∞,n(f) − Dn(fp)) 1 (C + 1) 2(C + 1)2 ≤ ρ (f)A (f, f ) f,1 + f,1 kψ k kφ k . p,K K θ pK−1 pK f 0 f 1

Substituting the above into (B.12) gives (4.4). The proof of (4.5) is similar to the proof of Theorem 3.1, we omit the details.

C An extension of Baxter’s inequalities

Let {X } be a second order stationary time series with absolutely summable autocovariance and t spectral density f. We can represent f as f(ω) = ψ(ω)ψ(ω) = 1/ φ(ω)φ(ω) where

∞ ∞ X −isω X −isω φ(ω) = 1 − φse and ψ(ω) = 1 + ψse . s=1 s=1

Note that {φs} and {ψs} are the corresponding AR(∞) and MA(∞) coeﬃcients respectively and ψ(ω) = φ(ω)−1. To simplify notation we have ignored the variance of the innovation.

C.1 Proof of the extended Baxter inequality

p Let {φs,p(τ)}s=1 denote the the coefficients of the best linear predictor of Xt+τ (for τ ≥ 0) given t−1 {Xs}t−p " p ! # X E Xt+τ − φs,p(τ)Xt−s Xt−k = 0 for k = 1, . . . , p. (C.1) s=1 and {φs(τ)} denote the coefficients of the best linear predictor of Xt+τ given the infinite past t−1 {Xs}s=−∞ " ∞ ! # X E Xt+τ − φs(τ)Xt−s Xt−k = 0 for k = 1, 2,... (C.2) s=1

Before we begin, we deﬁne an appropriate norm on the subspace of L2[0, 2π].

Deﬁnition C.1 (Norm on the subspace of L2[0, 2π]) Suppose the sequence of positive weights

{v(k)}k∈Z satisfies 2 conditions: (1) v(n) is even, i.e., v(−n) = v(n) for all n ≥ 0; (2) v(n + m) ≤ v(n)v(m) for all n, m ∈ Z. Given {v(k)} satisfies 2 conditions above, define a subspace Av of L2[0, 2π] by

X Av = {f ∈ L2[0, 2π]: v(k)|fk| < ∞}. k∈Z

67 where, f(ω) = P f eikω. We deﬁne a norm kfk on A by kfk = P v(k)|f |, then it is k∈Z k v k∈Z k easy to check this is a valid norm.

Remark C.1 (Properties of k · k) Suppose the sequence {v(k)}k∈Z satisfies 2 conditions in Definition C.1, and define the norm k · k with respect to {v(k)}. Then, beside the triangle inequality, this norm also satisfies k1k = v(0) ≤ 1, kfk = kfk, and kfgk ≤ kfkkgk (which does not hold for all norms but is an important component of the (extended) Baxter’s proof), i.e.,

(Av, k·k) is a Banach algebra with involution operator. The proof for the multiplicative inequality P follows from the fact that (fg)k = r frgk−r, where fk and gk are kth Fourier coeﬃcient of f and g. Thus

X X kfgk ≤ v(k) frgk−r k∈Z r∈Z

X X X ≤ v(r)v(k − r) frgk−r ≤ v(r)v(k − r)|fr||gk−r| = kfkkgk. k∈Z r∈Z k,r∈Z

Examples of weights include v(r) = (2q + |r|q) or v(r) = (1 + |r|)q for some q ≥ 0. In these two P∞ −ijω examples, when q = K, under Assumption 3.1, ψ(ω), φ(ω) ∈ Av where ψ(ω) = 1 + j=1 ψje P∞ −ijω and φ(ω) = 1 − j=1 φje (see Kreiss et al. (2011)). We believe that Lemma B.1 is well known. But as we could not ﬁnd a prove we give a proof. The proof below follows closely the proof of Baxter (1962, 1963). PROOF of Lemma B.1 We use the same proof as Baxter, which is based on rewriting the normal equations in (C.1) within the frequency domain to yield

p ! 1 Z 2π X eiτω − φ (τ)e−isω f(ω)e−ikωdω = 0, for k = 1, . . . , p 2π s,p 0 s=1

Similarly, using the inﬁnite past to do prediction yields the normal equations

∞ ! 1 Z 2π X eiτω − φ (τ)e−isω f(ω)e−ikωdω = 0, for k ≥ 1. 2π s 0 s=1

Thus taking diﬀerences of the above two equations for k = 1, . . . , p gives

p ! 1 Z 2π X [φ (τ) − φ (τ)] e−isω f(ω)e−ikωdω 2π s,p s 0 s=1 ∞ ! 1 Z 2π X = φ (τ)e−isω f(ω)e−ikωdω 1 ≤ k ≤ p. (C.3) 2π s 0 s=p+1

These p-equations give rise to Baxter’s Weiner-Hopf equations and allow one to ﬁnd a bound for

68 Pp P∞ s=1 |φs,p(τ) − φs(τ)| in terms of s=p+1 |φs(τ)|. Interpreting the above, we have two different Pp −isω P∞ −isω functions ( s=1 [φs,p(τ) − φs(τ)] e ) f(ω) and s=p+1 φs(τ)e f(ω) whose first p Fourier coefficients are the same. Define the polynomials

p p X −isω X ikω hp(ω) = [φs,p(τ) − φs(τ)] e and gp(ω) = gk,pe (C.4) s=1 k=1 where Z 2π ∞ ! −1 X −isω −ikω gk,p = (2π) φs(τ)e f(ω)e dω. (C.5) 0 s=p+1 For the general norm k · k defined in Definition C.1, will show that for a sufficiently large p, khpk ≤ Cf kgpk, where the constant Cf is a function of the spectral density (that we will derive).

The Fourier expansion of hpf is

∞ X ikω hp(ω)f(ω) = gek,pe , k=−∞ where g = (2π)−1 R 2π h (ω)f(ω)e−ikωdω. Then, by (C.3) for 1 ≤ k ≤ p, g = g (where g ek,p 0 p ek,p k,p k,p is deﬁned in (C.5)). Thus

0 ∞ hp(ω)f(ω) = G−∞(ω) + gp(ω) + Gp+1(ω) (C.6) where 0 ∞ 0 X ikω ∞ X ikω G−∞(ω) = gek,pe and Gp+1(ω) = gek,pe . k=−∞ s=p+1 Dividing by f −1 = φφ and taking the k · k-norm we have

−1 0 −1 −1 ∞ khpk ≤ f G−∞ + f gp + f Gp+1 −1 0 −1 −1 ∞ ≤ f G−∞ + f kgpk + f Gp+1 0 −1 ∞ ≤ φ φG−∞ + f kgpk + kφk φGp+1 . (C.7)

0 ∞ First we obtain bounds for φG−∞ and φGp+1 in terms of kgpk. We will show that for a suﬃciently large p

0 ∞ φG−∞ ≤ kφk kgpk + ε φGp+1 ∞ 0 φGp+1 ≤ φ kgpk + ε φG−∞ .

The bound for these terms hinges on the Fourier coeﬃcients of a function being unique, which

69 allows us to compare coeﬃcients across functions. Some comments are in order that will help in the bounding of the above. We recall that f(ω)−1 = φ(ω)φ(ω), where

∞ ∞ X −isω X isω φ(ω) = 1 − φse φ(ω) = 1 − φse . s=1 s=1

0 ∞ Thus φ(ω)G−∞(ω) and φ(ω)Gp+1(ω) have Fourier expansions with only less than the ﬁrst and greater than the pth frequencies respectively. This observation gives the important insight into P∞ ijω the proof. Suppose b(ω) = j=−∞ bje , we will make the use of the notation {b(ω)}+ = P∞ ijω P0 ijω j=1 bje and {b(ω)}− = j=−∞ bje , thus b(ω) = {b(ω)}− + {b(ω)}+. We now return to (C.6) using that f = ψ(ω)ψ(ω) we multiply (C.6) by ψ(ω)−1 = φ(ω) to give 0 ∞ hp(ω)ψ(ω) = φ(ω)G−∞(ω) + φ(ω)gp(ω) + φ(ω)Gp+1(ω). (C.8) Rearranging the above gives

0 ∞ −φ(ω)G−∞(ω) = −hp(ω)ψ(ω) + φ(ω)gp(ω) + φ(ω)Gp+1(ω).

0 We recall that hp(ω)ψ(ω) only contain positive frequencies, whereas φ(ω)G−∞(ω) only contains non-positive frequencies. Based on these observations we have

0 −φ(ω)G−∞(ω) = −φ(ω)G0 (ω) = {φ(ω)g (ω)} + φ(ω)G∞ (ω) . (C.9) −∞ − p − p+1 −

∞ We further observe that Gp+1 only contains non-zero coeﬃcients for positive frequencies of p+1 and greater, thus only the coeﬃcients of φ(ω) with frequencies less or equal to −(p + 1) will give ∞ non-positive frequencies when multiplied with Gp+1. Therefore

−φ(ω)G−1 (ω) = {φ(ω)g (ω)} + φ∞ (ω)G∞ (ω) , −∞ p − p+1 p+1 −

∞ P∞ −isω where φp+1(ω) = s=p+1 φse . Evaluating the norm of the above (using both the triangle and the multiplicative inequality) we have

0 ∞ ∞ φG−∞ ≤ kφk kgpk + φp+1Gp+1 ∞ ∞ ≤ kφk kgpk + φp+1 ψ φGp+1 since ψ(ω)φ(ψ) = 1.

0 ∞ This gives a bound for φG−∞ in terms of kgpk and φGp+1 . Next we obtain a similar bound ∞ 0 for φGp+1 in terms of kgpk and φG−∞ . −1 Again using (C.6), f(ω) = ψ(ω)ψ(ω), but this time multiplying (C.6) by ψ(ω) = φ(ω), we

70 have 0 ∞ hp(ω)ψ(ω) = φ(ω)G−∞(ω) + φ(ω)gp(ω) + φ(ω)Gp+1(ω). Rearranging the above gives

∞ 0 φ(ω)Gp+1(ω) = hp(ω)ψ(ω) − φ(ω)G−∞(ω) − φ(ω)gp(ω).

∞ We observe that φ(ω)Gp+1(ω) contains frequencies greater than p whereas hp(ω)ψ(ω) only contains frequencies less or equal to the order p (since hp is a polynomial up to order p). Therefore −ipω multiply e on both side and take {}+ gives

−ipω ∞ e φ(ω)Gp+1(ω) n −ipω 0 o n −ipω o − e φ(ω)G−∞(ω) − e φ(ω)gp(ω) , (C.10) + +

By the similar technique from the previous, it is easy to show

n −ipω 0 o n −ipω ∞ 0 o e φ(ω)G−∞(ω) = e φp+1(ω)G−∞(ω) . (C.11) + +

Multiplying eipω and evaluating the k · k-norm of the above yields the inequality

∞ ∞ 0 φGp+1 ≤ φgp + φp+1G−∞ ∞ 0 ≤ φ kgpk + φp+1 kψk φG−∞ .

∞ ∞ ∞ We note that kφp+1k = kφp+1k. For φ ∈ Av (see Deﬁnition C.1 and Remark C.1), kφp+1k = P∞ ∞ s=p+1 v(s)|φs| → 0 as p → ∞, for a large enough p, kψ(ω)k · kφp+1k < 1. Suppose that p is ∞ such that φp+1(ω) kψ(ω)k ≤ ε < 1, then we have the desired bounds

0 ∞ φG−∞ ≤ kφk kgpk + ε φGp+1 ∞ 0 φGp+1 ≤ φ kgpk + ε φG−∞ .

0 ∞ −1 The above implies that φG−∞ + φGp+1 ≤ 2(1 − ε) kφk kgpk. Substituting the above in P∞ −isω (C.7), and using that kφk ≥ 1 (since φ = 1 − s=1 φse , kφk ≥ k1k = v(0) ≥ 1) we have

2 kφk kgpk −1 khpk ≤ + f kgpk 1 − ε 3 − ε ≤ (1 − ε)−1 2 kφk + (1 − ε) kφk2 kg k ≤ kφk2 kg k . p 1 − ε p

Thus based on the above we have 3 − ε kh k ≤ kφk2 kg k . (C.12) p 1 − ε p

71 P∞ Finally, we obtain a bound for kgpk in terms of s=p+1 |φs(τ)|. We define an extended version of the function g (ω). Let g (ω) = P g eikω where g is as in (C.4). By definition, g (ω) = p ep k∈Z k,p k,p ep P∞ −isω s=p+1 φs(τ)e f(ω) and the Fourier coefficients of gp(ω) are contained within gep(ω), which implies ∞ ∞ ∞ 2 kgpk ≤ kgepk = φp+1(τ)f ≤ φp+1(τ) kfk ≤ φp+1 kψk . (C.13) ∞ P∞ −isω where φp+1(τ)(ω) = s=p+1 φs(τ)e . Finally, substituting (C.13) into (C.12), implies that if ∞ p is large enough such that φp+1 kψk ≤ ε < 1, then

3 − ε 2 2 ∞ khpk ≤ kφk kψk φ (τ) . 1 − ε p+1

Thus, if the weights in the norm are v(m) = (2K + mK ) (it is well-deﬁned weights, see Remark C.1) we have

p X K K (2 + s ) |φs,p(τ) − φs(τ)| s=1 ∞ 3 − ε 2 2 X ≤ kφk kψk (2K + sK ) |φ (τ)| . (C.14) 1 − ε K K s s=p+1

P∞ Using Corollary A.1 we have for τ ≤ 0 φs(τ) = j=0 φs+jψ|τ|−j (noting that ψj = 0 for j < 0), and the desired result.

C.2 Baxter’s inequality on the derivatives of the coeﬃcients

Our aim is to obtain a Baxter-type inequality for the derivatives of the linear predictors. These bounds will be used when obtaining expression for the bias of the Gaussian and Whittlelikeli- hoods. However, they may also be of independent interest. It is interesting to note that the following result can be used to show that the Gaussian and Whittle likelihood estimators are √ (G) (K) P asymptotically equivalent in the sense that n|θbn − θbn |1 → 0 as n → ∞. The proof of the result is based on the novel proof strategy developed in Theorem 3.2 of Meyer et al. (2017) (for spatial processes). We require the following deﬁnitions. Deﬁne the two n-dimension vectors

ϕ (τ; f ) = (φ (τ; f ), . . . , φ (τ; f ))0 (best linear ﬁnite future predictor) (C.15) n θ 1,n θ n,n θ φ (τ; f ) = (φ (τ; f ), . . . , φ (τ; f ))0 (truncated best linear inﬁnite future predictor). n θ 1 θ n θ

Lemma C.1 Let θ be a d-dimension vector. Let {cθ(r)}, {φj(fθ)} and {ψj(fθ)} denote the autocovariances, AR(∞), and MA(∞) coeﬃcients corresponding to the spectral density fθ. For

72 all θ ∈ Θ and for 0 ≤ i ≤ κ we assume

∞ ∞ X K i X K i kj ∇θφj(fθ)k1 < ∞ kj ∇θψj(fθ)k1 < ∞, (C.16) j=1 j=1 where K > 1. Let ϕ (τ; f ) and φ (τ; f ), be deﬁned as in (C.15). We assume that τ ≤ 0. Then n θ n θ for all 0 ≤ i ≤ κ, we have

i ∂ h i X i a2 ϕ (τ; fθ) − φ (τ; fθ) ≤ f0 Ca1 ∇θ [ϕ (τ; fθ) − φ (τ; fθ)] n n n n 2 ∂θr1 . . . ∂θri 2 a1 a1+a2=i a26=i ∞ X i X b2 + Cb1 ∇θ φj(τ; fθ) 1 , b1 b1+b2=i j=n+1

−1 P a a where f0 = (infω fθ(ω)) and Ca = r k∇θ cθ(r)k1, ∇θ g(fθ) is the ath order partial derivative a of g with respect to θ = (θ1, . . . , θd) and k∇θ g(fθ)kp denotes the `p−norm of the matrix with a elements containing all the partial derivatives in ∇θ g(fθ). PROOF. To prove the result, we deﬁne the n-dimension vector

0 cn,τ = (c(τ − 1), c(τ − 2), . . . , c(τ − n)) (covariances from lag τ − 1 to lag τ − n).

To simplify notation we drop the fθ notation from the prediction coeﬃcients φj,n(τ; fθ) and

φj(τ; fθ).

Proof for the case i = 0 This is the regular Baxter inequality but with the `2-norm rather than

`1-norm. We recall that for τ ≤ 0 we have the best linear predictors

n ∞ X X Xbτ,n = φj,n(τ)Xj and Xτ = φj(τ)Xj. j=1 j=1

Thus by evaluating the covariance of the above with Xr for all 1 ≤ r ≤ n gives the sequence of r normal equation, which can be written in matrix form

∞ X Γ (f )ϕ (τ) = c and Γ (f )φ (τ) + φ (τ)c = c . n θ n n,τ n θ n j n,−j+τ n,τ j=n+1

73 Taking diﬀerences of the above gives

∞ h i X Γ (f ) ϕ (τ) − φ (τ) = φ (τ)c n θ n n j n,−j+τ j=n+1 ∞ h i X ⇒ ϕ (τ) − φ (τ) = Γ (f )−1 φ (τ)c (C.17) n n n θ j n,−j+τ j=n+1

The `2-norm of the above gives

∞ −1 X ϕ (τ) − φ (τ) ≤ kΓn(fθ) kspec |φj(τ)| · kcn,−j+τ k2. n n 2 j=n+1

−1 To bound the above we use the well known result kΓn(fθ) kspec ≤ 1/ infω f(ω) = f0 and kc k ≤ P |c (r)| = C . This gives the bound n,−j+τ 2 r∈Z θ 0

∞ X ϕ (τ) − φ (τ) ≤ f0C0 |φj(τ)|. n n 2 j=n+1

Proof for the case i = 1. As our aim is to bound the derivative of the diﬀerence ϕ (τ)−φ (τ), we n n evaluate the partial derivative of (C.17) with respective to θ and isolate ∂[ϕ (τ) − φ (τ)]/∂θ . r n n r Diﬀerentiating both sides of (C.17) with respect to θr gives

∂Γ (f ) h i ∂ h i n θ ϕ (τ) − φ (τ) + Γ (f ) ϕ (τ) − φ (τ) n n n θ n n ∂θr ∂θr ∞ X ∂φj(τ) ∂cn,−j+τ = c + φ (τ) . (C.18) ∂θ n,−j+τ j ∂θ j=n+1 r r

Isolating ∂[ϕ (τ) − φ (τ)]/∂θ gives n n r

∂ h i ∂Γ (f ) h i ϕ (τ) − φ (τ) = −Γ (f )−1 n θ ϕ (τ) − φ (τ) n n n θ n n ∂θr ∂θr ∞ X ∂φj(τ) ∂cn,−j+τ +Γ (f )−1 c + φ (τ) . (C.19) n θ ∂θ n,−j+τ j ∂θ j=n+1 r r

74 Evaluating the `2 norm of the above and using kABxk2 ≤ kAkspeckBkspeckxk2 gives the bound

∂ h i ϕ (τ) − φ (τ) n n ∂θr 2 ∂Γ (f ) ≤ kΓ (f )−1k n θ ϕ (τ) − φ (τ) n θ spec n n ∂θr spec 2 ∞ ∞ X ∂φj(τ) X ∂cn,−j+τ + kc k2 + |φj(τ)| ∂θ n,−j+τ ∂θ j=n+1 r j=n+1 r 2 ∂Γ (f ) ≤ f n θ ϕ (τ) − φ (τ) 0 n n ∂θr spec 2 ∞ ! ∞ X ∂φj(τ) X X +C0 + k∇θcθ(r)k2 |φj(τ)| ∂θr j=n+1 r∈Z j=n+1 ∞ ∞ ∂Γn(fθ) X ∂φj(τ) X ≤ f0 ϕ (τ) − φ (τ) + C0 + C1 |φj(τ)| (C.20) ∂θ n n 2 ∂θ r spec j=n+1 r j=n+1 where the last line in the above uses the bound P k∇ac (r)k ≤ P k∇ac (r)k = C (for r∈Z θ θ 2 r∈Z θ θ 1 a a = 0 and 1). We require a bound for k∂Γn(fθ)/∂θrkspec. Since Γn(fθ) is a symmetric Toeplitz matrix, then ∂Γn(fθ)/∂θr is also a symmetric Toeplitz matrix (though not necessarily positive deﬁnite) with entries

∂Γ (f ) 1 Z 2π ∂f (ω) n θ = θ exp(i(s − t)ω)dω. ∂θr s,t 2π 0 ∂θr

We mention that the symmetry is clear, since ∂c(s−t;fθ) = ∂c(t−s;fθ) . Since the matrix is symmetric ∂θr ∂θr the spectral norm is the spectral radius. This gives

n Z 2π n ∂Γn(fθ) X ∂c(s − t; fθ) 1 X isω 2 ∂fθ(ω) = sup xsxt = sup | xse | dω ∂θ ∂θ 2π ∂θ r spec kxk2=1 s,t=1 r kxk2=1 0 s=1 r 2 Z 2π n ∂fθ(ω) 1 X isω ∂fθ(ω) ≤ sup sup xse dω = sup . ω ∂θ 2π ω ∂θ r kxk2=1 0 s=1 r

By using the same argument one can show that the ath derivative is

a a a ∂ Γn(fθ) ∂ fθ(ω) X ∂ c(r; fθ) ≤ sup = ≤ Ca. (C.21) ∂θr1 . . . ∂θra spec ω ∂θr1 . . . ∂θra ∂θr1 . . . ∂θra r∈Z

This general bound will be useful when evaluating the higher order derivatives below. Substi-

75 tuting (C.21) into (C.20) gives

∂ h i ∞ ∞ X X ϕ (τ) − φ (τ) ≤ f0 C1 ϕ (τ) − φ (τ) + C1 |φj(τ)| + C0 k∇θφj(τ)k1 . ∂θ n n n n 2 r 2 j=n+1 j=n+1

This proves the result for i = 1.

Proof for the case i = 2 We diﬀerentiate both sides of (C.18) with respect to θr2 to give the second derivative

∂2Γ (f ) h i ∂Γ (f ) ∂ h i n θ ϕ (τ) − φ (τ) + n θ ϕ (τ) − φ (τ) + n n n n ∂θr1 ∂θr2 ∂θr2 ∂θr1 ∂Γ (f ) ∂ h i ∂2 h i n θ ϕ (τ) − φ (τ) + Γ (f ) ϕ (τ) − φ (τ) n n n θ n n ∂θr1 ∂θr2 ∂θr2 ∂θr1 ∞ 2 2 X ∂ φj(τ) ∂φj(τ) ∂cn,−j+τ ∂φj(τ) ∂cn,−j+τ ∂ cn,−j+τ = c + + + φ (τ) . ∂θ ∂θ n,−j+τ ∂θ ∂θ ∂θ ∂θ j ∂θ ∂θ j=n+1 r2 r1 r1 r2 r2 r1 r2 r1 (C.22)

2 h i Rearranging the above to isolate ∂ ϕ (τ) − φ (τ) gives ∂θr2 ∂θr1 n n

∂2 h i ϕ (τ) − φ (τ) n n ∂θr1 ∂θr2 ∂2Γ (f ) h i ∂Γ (f ) ∂ h i = −Γ (f )−1 n θ ϕ (τ) − φ (τ) − Γ (f )−1 n θ ϕ (τ) − φ (τ) n θ n n n θ n n ∂θr1 ∂θr2 ∂θr2 ∂θr1 ∂Γ (f ) ∂ h i −Γ (f )−1 n θ ϕ (τ) − φ (τ) n θ n n ∂θr1 ∂θr2 ∞ 2 X ∂ φj(τ) ∂φj(τ) ∂cn,−j+τ +Γ (f )−1 c + n θ ∂θ ∂θ n,−j+τ ∂θ ∂θ j=n+1 r2 r1 r1 r2 2 ∂φj(τ) ∂cn,−j+τ ∂ cn,−j+τ + + φj(τ) . ∂θr2 ∂θr1 ∂θr2 ∂θr1

∂2 h i Taking the `2-norm of ϕ (τ) − φ (τ) and using (C.21) gives ∂θr1 ∂θr2 n n

∂2 h i ϕ (τ) − φ (τ) n n ∂θr1 ∂θr2 2 ≤ f0 C2 ϕ (τ) − φ (τ) + 2C1 ∇θ[ϕ (τ) − φ (τ)] n n 2 n n 2 ∞ ∞ ∞ X 2 X X +C0 ∇ φj(τ) 1 + 2C1 k∇φj(τ)k1 + C2 |φj(τ)| . j=n+1 j=n+1 j=n+1

76 This proves the result for i = 2. The proof for i > 2 follows using a similar argument (we omit the details).

The above result gives an `2-bound between the derivatives of the ﬁnite and inﬁnite predictors. However, for our purposes an `1-bound is more useful. Thus we use the Cauchy-Schwarz inequality and norm inequality k · k2 ≤ k · k1 to give the `1-bound

∂i h i ϕ (τ; f ) − φ (τ; f ) n θ n θ ∂θr1 . . . ∂θri 1 X i ≤ n1/2f C ∇a2 [ϕ (τ; f ) − φ (τ; f )] + 0 a1 θ n θ n θ a1 1 a1+a2=i a26=i ∞ X i X b2 Cb1 ∇θ φj(τ; fθ) 1 , (C.23) b1 b1+b2=i j=n+1 this incurs an additional n1/2 term. Next, considering all the partial derivatives with respect to θ of order i and using (C.15) we have

n X i k∇θ[φt,n(τ; fθ) − φt(τ; fθ)]k1 t=1 X i ≤ din1/2f C ∇a2 [ϕ (τ; f ) − φ (τ; f )] + 0 a1 θ n θ n θ a1 1 a1+a2=i a26=i ∞ X i X b2 Cb1 ∇θ φj(τ; fθ) 1 , (C.24) b1 b1+b2=i j=n+1 where d is the dimension of the vector θ. The above gives a bound in terms of the inﬁnite predictors. We now obtain a bound in terms of the corresponding AR(∞) and MA(∞) coeﬃcients. To P∞ do this, we recall that for τ ≤ 0, φj(τ; fθ) = s=0 φs+j(fθ)ψ|τ|−j(fθ). Thus the partial derivatives of φj(τ; fθ) give the bound

∞ ∞ ∞ X X X k∇θφj(τ; fθ)k1 ≤ |ψ|τ|−j(fθ)| · k∇θφs+j(fθ)k1 + |φs+j(fθ)| · k∇θψ|τ|−j(fθ)k1 . j=n+1 s=0 j=n+1

77 Substituting the above bound into (C.24) and using Lemma B.1 for the case i = 1 gives

n X k∇θ[φs,n(τ; fθ) − φs(τ; fθ)]k1 s=1 ∞ 1/2 X ≤ n df0 C1(C0 + 1) |φj(τ; fθ)| + j=n+1 ∞ ∞ X X C0 |ψ|τ|−j(fθ)| · k∇θφs+j(fθ)k1 + |φs+j(fθ)| · k∇θψ|τ|−j(fθ)k1 . (C.25) s=0 j=n+1

The above results are used to obtain bounds between the derivatives of the Whittle and Gaussian likelihood in Appendix C.3. Similar bounds can also be obtained for the higher order deriva- Pn i tives s=1 k∇θ[φs,n(τ; fθ) − φs(τ; fθ)]k1 in terms of the derivatives of the MA(∞) and AR(∞) coeﬃcients.

Remark C.2 It is interesting to note that it is possible to prove analgous bounds to those in Lemma C.1 using the derivatives of the series expansion in the proof of Theorem 3.3, that is

∞ ∂ X ∂ (s) [φ (τ; f ) − φ (τ; f )] = φ (τ; f ). ∂θ j,n θ j θ ∂θ j,n θ r s=2 r

This would give bounds on

∞ X ∂ X ∂ h (s) i [φ (τ; f ) − φ (τ; f )]eiτω = φ(ω)−1 φ(ω; f )−1ζ (ω; f ) . ∂θ j,n θ j θ ∂θ θ j,n θ τ≤0 r s=2 r

This approach would immediately give bounds based on the derivatives of the AR(∞) coeﬃcients.

We now state and prove a lemma which will be useful in a later section (it is not directly related to Baxter’s inequality).

Lemma C.2 Let {φj(fθ)} and {ψj(fθ)} denote the AR(∞), and MA(∞) coeﬃcients corresponding to the spectral density fθ. Suppose the same set of Assumptions in Lemma C.1 holds. Let

{αj(fθ)} denote the Fourier coeﬃcients in the one-sided expansion

∞ ∞ X j X j log( ψj(fθ)z ) = αj(fθ)z for |z| < 1, (C.26) j=0 j=1

Then for all θ ∈ Θ and for 0 ≤ s ≤ κ we have

∞ X K s kj ∇θαj(fθ)k1 < ∞, j=1

78 PROOF. We ﬁrst consider the case s = 0. The derivative of (C.26) with respect to z together −1 P∞ j with ψ(z; fθ) = φ(z; fθ) = 1 − j=1 φj(fθ)z gives

∞ ∞ ! ∞ !−1 X j−1 X j−1 X j jαj(fθ)z = jψj(fθ)z ψj(fθ)z j=1 j=1 j=0 ∞ ! ∞ ! X j−1 X j = jψj(fθ)z φej(fθ)z , (C.27) j=1 j=0

P∞ j P∞ j j−1 where j=0 φej(fθ)z = 1 − j=1 φj(fθ)z . Comparing the coeﬃcients of z from both side of above yields the identity

j−1 X jαj(fθ) = (j − `)ψj−`(fθ)φe`(fθ) `=0 j−1 −1 X ⇒ αj(fθ) = j (j − `)ψj−`(fθ)φe`(fθ) for j ≥ 1. (C.28) `=0

Therefore, using the above and taking the absolute into the summand we have

∞ ∞ j−1 X K X X K−1 j |αj(fθ)| ≤ j (j − `)|ψj−`(fθ)||φe`(fθ)| j=1 j=1 `=0 ∞ ∞ X X K−1 = |φe`(fθ)| j (j − `)|ψj−`(fθ)| (exchange summation) `=1 j=`+1 ∞ ∞ X X K−1 = |φe`(fθ)| (s + `) s|ψs(fθ)| (change of variable s = j + `) `=1 s=1 ∞ ∞ X X K −1 −1 ≤ |φe`(fθ)| (s + `) |ψs(fθ)|. ((s + `) ≤ s ) `=1 s=1

Since K ≥ 1, using inequality (a + b)K ≤ 2K−1(aK + bK ) for a, b > 0, we have

79 To prove lemma for s = 1, we diﬀerentiate (C.28) with θ, then, by Assumption 5.1 (iii),

∞ X K j k∇θαj(fθ)k1 j=1 ∞ j−1 ∞ j−1 X X K−1 X X K−1 ≤ kj (j − `)∇θψj−`(fθ)φe`(fθ)k1 + kj (j − `)ψj−`(fθ)∇θφe`(fθ)k1 j=1 `=0 j=1 `=0

P∞ K Using similar technique to prove s = 0, we show j=1 kj ∇θαj(fθ)k1 < ∞ and the proof for s ≥ 2 is similar (we omit the detail).

C.3 The diﬀerence between the derivatives of the Gaussian and Whit- tle likelihoods

We now obtain an expression for the diﬀerence between the derivatives of the Gaussian likelihood and the Whittle likelihood. These expression will be used later for obtaining the bias of the Gaussian likelihood (as compared with the Whittle likelihood). For the Gaussian likelihood, we have shown in Theorem 2.2 that

0 −1 0 ∗ −1 0 ∗ −1 XnΓn(θ) Xn = XnFn ∆n(fθ )FnXn + XnFn ∆n(fθ )Dn(fθ)Xn, where the ﬁrst term is the Whittle likelihood and the second term the additional term due to 0 the Gaussian likelihood. Clearly the derivative with respect to θ = (θ1, . . . , θd) is

0 i −1 0 ∗ i −1 0 ∗ i −1 Xn∇θΓn(θ) Xn = XnFn ∇θ∆n(fθ )FnXn + XnFn ∇θ[∆n(fθ )Dn(fθ)]Xn.

The ﬁrst term on the right hand side is the derivative of the Whittle likelihood with respect to θ, the second term is the additional term due to the Gaussian likelihood. For the simplicity, assume θ is univariate. Our objective in the next few lemmas is to show that

i 0 ∗ d −1 XnFn i ∆n(fθ )Dn(fθ)Xn = O(1), dθ 1 which is a result analogous to Theorem 3.2, but for the derivatives. We will use this result to prove Theorem E.1, in particular to show the derivatives of the Whittle likelihood and the Gaussian likelihood (after normalization by n−1) diﬀer by O(n−1). Just as in the proof of Theorem 3.2, the derivative of this term with respect to θ does not (usually) have a simple analytic form. Therefore, analogous to Theorem 3.1 it is easier to replace the derivatives of Dn(fθ) with the derivatives of D∞,n(fθ), and show that the replacement error

80 is “small”.

Lemma C.3 Suppose Assumption 5.1(i),(iii) holds and g is a bounded function. Then for 1 ≤ i ≤ 3 we have i ∗ d −K+3/2 Fn ∆n(g) i (Dn(fθ) − D∞,n(fθ)) = O(n ), (C.29) dθ 1 and

i k −1 k−i X i d ∆n f d D∞,n(fθ) F ∗ θ = O(1). (C.30) n k dθk dθk−i k=0 1

∗ −1 PROOF. To bound (C.29), we use the expression for Fn ∆n(g )(Dn(fθ) − D∞,n(fθ)) given in (B.4)

∗ −1 X Fn ∆n(g )[Dn(fθ) − D∞,n(fθ)] s,t = [{φt,n(τ; fθ) − φt(τ; fθ)} G1,n(s, τ; g) τ≤0

+ {φn+1−t,n(τ; fθ) − φn+1−t(τ; fθ)} G2,n(s, τ; g)].

Diﬀerentiating the above with respect to θ gives ∗ d Fn ∆n(g) (Dn(fθ) − D∞,n(fθ)) dθ s,t X d d = G (s, τ) [φ (τ) − φ (τ)] + G (s, τ) [φ (τ) − φ (τ)] 1,n dθ t,n t 2,n dθ n+1−t n+1−t τ≤0

= Ts,t,1 + Ts,t,2.

We recall that equation (C.25) gives the bound

n X d [φs,n(τ; fθ) − φs(τ; fθ)] dθ s=1 ∞ 1/2 X ≤ n f0 C1(Cf,0 + 1) |φj(τ; fθ)| + j=n+1 ∞ ∞ X X d d C0 |ψ|τ|−j(fθ)| · φs+j(fθ) + |φs+j(fθ)| · ∇θψ|τ|−j(fθ) . dθ dθ s=0 j=n+1

Substituting this into Ts,t,1 gives the bound

∞ ∞ ∞ ∞ 1/2 X X X X X dφs+j dψ|τ|−j |Ts,t,1| ≤ Cn G1,n(s, τ) |φs+j||ψ|τ|−j| + ψ|τ|−j + φs+j . dθ dθ τ≤0 j=n+1 s=0 s=0 j=n+1

81 Using the same techniques used to prove Theorem 3.1 yields

n X 1/2 −K+1 −K+3/2 |Ts,t,1| = O n n = O(n ). s,t=1

Pn 1/2 −K+1 −K+3/2 Similarly, we can show that s,t=1 |Ts,t,2| = O n n = O(n ). Altogether this gives

∗ −K+3/2 kFn ∆n(g)(Dn(fθ) − D∞,n(fθ))k1 = O(n ).

This proves (C.29) for the case i = 1. The proof for the cases i = 2, 3 is similar. To prove (C.30) we use the same method used to prove Theorem 3.1, equation (3.5). But k −1 i−k i−k d fθ d d P∞ with dθk replacing fθ in ∆n(·) and dθk φj(τ; fθ) = dθk s=0 φs+jψ|τ|−j replacing φj(τ; fθ) = P∞ s=0 φs+jψ|τ|−j in Dn(fθ). We omit the details. We now apply the above results to quadratic forms of random variables.

Corollary C.1 Suppose Assumptions 5.1 (i),(iii) hold and g is a bounded function. Further, if

{Xt} is a time series where supt kXtkE,2q = kXkE,2q < ∞ (for some q > 1), then

i 1 0 ∗ d −K+1/2 X F ∆n(g) (Dn(fθ) − D∞,n(fθ)) X = O(n ), (C.31) n n n dθi n E,q and

i ` −1 `−i 1 X i d ∆n f d D∞,n(fθ) X0 F ∗ θ X = O(n−1) (C.32) n n ` `−i n n ` dθ dθ `=0 E,q for i = 1, 2 and 3.

PROOF. To prove (C.31), we observe that

i 1 0 ∗ d X F ∆n(g) (Dn(fθ) − D∞,n(fθ)) X n n n dθi n E,q n 1 X di ≤ F ∗∆ (g) (D (f ) − D (f )) kX X k n n n dθi n θ ∞,n θ s t E,q s,t=1 s,t

1 n di 2 X ∗ = sup kXtkE,2q Fn ∆n(g) i (Dn(fθ) − D∞,n(fθ)) n t dθ s,t=1 s,t i 1 2 ∗ d −K+1/2 = kXkE,2q Fn ∆n(g) i (Dn(fθ) − D∞,n(fθ)) = O(n ) n dθ 1 where the above follows from Lemma C.3, equation (C.29). This proves (C.31). To prove (C.32) we use the the bound in (C.30) together with a similar proof to that described

82 above. This immediately proves (C.32).

We now apply the above result to the diﬀerence in the derivatives of the Gaussian and Whittle likelihood. It is straightforward to show that

di X0 F ∗ ∆ (f −1)D (f )X (C.33) n n dθi n θ n θ n " i ` −1 `−i # X i d ∆n f d Dn(fθ) = X0 F ∗ θ X n n ` dθ` dθ`−i n `=0 " i ` −1 `−i # X i d ∆n f d D∞,n(fθ) = X0 F ∗ θ X n n ` dθ` dθ`−i n `=0 i ` −1 `−i `−i ! X i d ∆n f d Dn(fθ) d D∞,n(fθ) +X0 F ∗ θ − X . (C.34) n n ` dθ` dθ`−i dθ`−i n `=0

First we study the second term on the right hand side of the above. By applying Corollary C.1 (and under Assumption 5.1) for 1 ≤ i ≤ 3 we have

i −1 0 ∗ d −1 n X F ∆n(f )[Dn(fθ) − D∞,n(fθ)] X n n dθi θ n E,1 i ` −1 `−i `−i X i d ∆n f d Dn(fθ) d D∞,n(fθ) = n−1X0 F ∗ θ − X n n ` `−i `−i n ` dθ dθ dθ `=0 E,1 = O(n−K+1/2). (C.35)

On the other hand, the ﬁrst term on the right hand side of (C.33) has the bound

i −1 0 ∗ d −1 −1 n X F ∆n(f )Dn(fθ) X = O(n ). (C.36) n n dθi θ n E,1 D Rates of convergence of the new likelihood estimators

In this section we study the sampling properties of the new criteria.

D.1 The criteria

To begin with, we state the assumptions required to obtain rates of convergence of the new criteria and asymptotic equivalence to the infeasible criteria. These results will be used to derive the asymptotic sampling properties of the new likelihood estimators, including their asymptotic bias (in a later section). To do this, we start by deﬁning the criteria we will be considering.

We assume that {Xt} is a stationary time series with spectral density f, where f is bounded away from zero (and bounded above). We ﬁt the model with spectral density fθ to the observed

83 time series. We do not necessarily assume that there exists a θ0 ∈ Θ where f = fθ0 . Since we allow the misspeciﬁed case, for a given n, it seems natural that the “ideal” best ﬁtting parameter is

θn = arg min In(f, fθ). (D.1) θ where In(f, fθ) is defined in (4.1). Note that in the case the spectral density is correctly specified, then θn = θ0 for all n where f = fθ0 . −1 Pn We now show that Assumption 5.1(ii,iii) allows us to ignore the n k=1 log fθ(ωk,n) in the Whittle, boundary corrected Whittle and hybrid Whittle likelihoods. To show why this is true, we obtain the Fourier expansion of log f (ω) = P α (f )eirω, where α (f ) = log σ2, in terms θ r∈Z r θ 0 θ of the corresponding MA(∞) coefficients. We use the well known Szegö’sidentity

2 2 2 log fθ(·) = log σ |ψ(·; fθ)| = log σ + log ψ(·; fθ) + log ψ(·; fθ)

P∞ −ijω where ψ(ω; fθ) = j=0 ψj(fθ)e with ψ0(fθ) = 1 and the roots of the MA transfer function P∞ j j=0 ψj(fθ)z lie outside the unit circle (minimum phased). Comparing log f (ω) = P α (f )eirω with the positive half of the above expansion gives θ r∈Z r θ

∞ ∞ X j X j log( ψj(fθ)z ) = αj(fθ)z for |z| < 1, j=0 j=1 and since log fθ is real and symmetric about π, α−j(fθ) = αj(fθ) ∈ R. This allows us to obtain coeﬃcients {αj(fθ)} in terms of the MA(∞) coeﬃcients (it is interesting to note that Pourahmadi

(2001) gives a recursion for αj(fθ) in terms of the MA(∞) coeﬃcient). The result is given in Lemma C.2, but we summarize it below. Under Assumption 5.1(iii) we have for 0 ≤ s ≤ κ (for some κ ≥ 4) ∞ X K s kj ∇θαj(fθ)k1 < ∞. j=1 −1 Pn Using this result, we bound n k=1 log fθ(ωk,n). Applying the Poisson summation formula to this sum we have

n 1 X X X log f (ω ) = α (f ) = α (f ) + α (f ) n θ k,n rn θ 0 θ rn θ k=1 r∈Z r∈Z\{0} 2 X = log(σ ) + αrn(fθ). (D.2) r∈Z\{0}

The sth-order derivative (s ≥ 1) with respect to θ (and using Assumption 5.1(ii) that σ2 does

84 not depend on θ) we have

n 1 X X ∇s log f (ω ) = ∇sα (f ). n θ θ k,n θ rn θ k=1 r∈Z\{0}

By using Lemma C.2 for 0 ≤ s ≤ κ we have

X s X s −K k ∇θαrn(fθ)k1 ≤ 2 k∇θαj(fθ)k1 = O(n ). (D.3) r∈Z\{0} j≥n

Substituting the bound in (D.3) (for s = 0) into (D.2) gives

n 1 X log f (ω ) − log σ2 = O(n−K ). n θ k,n k=1

Using (D.3) for 1 ≤ s ≤ κ we have

n 1 X ∇s log f (ω ) = O(n−K ). n θ θ k,n k=1 1

Therefore if K > 1, the log deterministic term in the Whittle, boundary corrected, and hybrid Whittle likelihood is negligible as compared with O(n−1) (which we show is the leading order in the bias). However, for the Gaussian likelihood, the log determinant cannot be ignored. Specifically, by applying the strong Szegö’stheorem (see e.g., Theorem 10.29 of Böttcher and Silbermann

(2013)) to Γn(fθ) we have

1 1 log |Γ (f )| = log σ2 + E(θ) + o(n−1) n n θ n

P∞ 2 where E(θ) = k=1 αk(fθ) . Therefore, unlike the other three quasi-likelihoods, the error in −1 log |Γn(fθ)| is of order O(n ), which is of the same order as the bias. In Section E.2, we show −1 that the inclusion and exclusion of n log |Γn(fθ)| leads to Gaussian likelihood estimators with substantial diﬀerences in their bias. Further, there is no clear rule whether the inclusion of −1 the n log |Γn(fθ)| in the Gaussian likelihood improves the bias or makes it worse. In the case −1 that n log |Γn(fθ)| is included in the Gaussian likelihood, then the expression for the bias will include the derivatives of E(θ). Except for a few simple models (such as the AR(1) model) the expression for the derivatives of E(θ) will be extremely unwieldy. Based on the above, to make the derivations cleaner, we deﬁne all the quasi-likelihoods

85 without the log term and let

n −1 0 −1 1 X Jen(ωk,n; fθ)Jn(ωk,n) Ln(θ) = n XnΓn(fθ) Xn = n fθ(ωk,n) k=1 n 2 1 X |Jn(ωk,n)| Kn(θ) = n fθ(ωk,n) k=1 n 1 X Jen(ωk,n; fbp)Jn(ωk,n) Wcp,n(θ) = n fθ(ωk,n) k=1 n 1 X Jen(ωk,n; fbp)Jn,hn (ωk,n) Hbp,n(θ) = . (D.4) n fθ(ωk,n) k=1

In the case of the hybrid Whittle likelihood, we make the assumption the data taper {ht,n} is such that ht,n = cnhn(t/n) where cn = n/H1,n and hn : [0, 1] → R is a sequence of taper functions which satisfy the taper conditions in Section 5, Dahlhaus (1988). We deﬁne the parameter estimators as

(G) (K) θbn = arg min Ln(θ), θbn = arg min Kn(θ), (W ) (H) θbn = arg min Wcp,n(θ), and θbn = arg min Hbp,n(θ) (D.5)

D.2 Asymptotic equivalence to the infeasible criteria

(W ) (H) In this section we analyze the feasible estimators θbn = arg min Wcp,n(θ) and θbn = arg min Hbp,n(θ). We show that they are asymptotically equivalent to the corresponding infeasible estimators (W ) (H) θen = arg min Wn(θ) and θen = arg min Hn(θ), in the sense that

3 3 (W ) (W ) p 1 (H) (H) p 1 |θb − θe |1 = Op + and |θb − θe |1 = Op + n n n3/2 npK−1 n n n3/2 npK−1

Pd where |a|1 = j=1 |aj|, where a = (a1, . . . , ad). We will make the assumption that that the ratio 2 −1 of tapers satisfy the condition H2,n/H1,n ∼ n . This has some beneﬁts. The ﬁrst is that the rates for the hybrid Whittle and the boundary corrected Whittle are the same. In particular, by using Corollary 2.1 and Theorem 3.1 in Das et al. (2020) (under Assumption 5.2) we have

p2 p3 [Jbn(ω; fbp) − Jbn(ω; f)]Jn,h (ω) = Op + (D.6) n n n3/2 and

p3 1 Hbp,n(θ) = Hn(θ) + Op + . (D.7) n3/2 npK−1

86 Using this, we show below that the Hybrid Whittle estimator has the classical n1/2–rate. If 2 −1 1/2 we were to relax the rate on H2,n/H1,n ∼ n , then the n –rate and the rates in (D.6) and (D.7) would change. This will make the proofs more technical. Thus for ease of notation and 2 −1 presentation we will assume that H2,n/H1,n ∼ n . s s We start by obtaining a “crude” bound for ∇θWcp,n(θ) − ∇θWn(θ).

Lemma D.1 Suppose that Assumptions 5.1(i,iii) and 5.2(i,ii) hold. Then for 0 ≤ s ≤ κ (for some κ ≥ 4) we have

2 s s p sup ∇θWcp,n(θ) − ∇θWn(θ) = Op θ∈Θ 1 n and

2 s s p sup ∇θHbp,n(θ) − ∇θHn(θ) = Op . θ∈Θ 1 n

PROOF. We ﬁrst prove the result in the case that s = 0 and for Wcp,n(·). In this case

n 1 X −1 h i Wcp,n(θ) − Wn(θ) = fθ(ωk,n) Jbn(ωk,n; fbp) − Jbn(ωk,n; f) Jn(ωk,n). n k=1

Thus

n −1 1 X h i sup |Wcp,n(θ) − Wn(θ)| ≤ sup fθ(ω) × Jbn(ωk,n; fbp) − Jbn(ωk,n; f) Jn(ωk,n) . θ∈Θ θ,ω n k=1

By using Corollary 2.1 in Das et al. (2020) (under Assumption 5.2) we have

h i p3 Jbn(ω; fbp) − Jbn(ω; f) Jn(ω) = ∆(ω) + Op n3/2

3 −3/2 where ∆(ω) is deﬁned in Corollary 2.1 in Das et al. (2020), Op p n bound is uniform all K−1 −1 3 2 4 2 frequencies, supω E[∆(ω)] = O((np ) + p /n ) and supω var[∆(ω)] = O(p /n ). Thus using this we have

n 3 −1 1 X p sup |Wcp,n(θ) − Wn(θ)| = sup fθ(ω) × |∆(ωk,n)| + Op 3/2 θ∈Θ θ,ω n n k=1 p2 p3 p2 = O + = O . p n n3/2 p n

This proves the result for s = 0. A similar argument applies for the derivatives of Wcp,n(θ) (together with Assumption 5.1(iii)) and Hbp,n(θ), we omit the details.

87 Lemma D.2 Suppose that Assumptions 5.1(i,iii) and 5.2(i,ii) hold. Then

(W ) P (H) P |θbn − θn|1 → 0 and |θbn − θn|1 → 0 with p2/n → 0 as p, n → ∞.

PROOF. We start with the infeasible criterion Wn(θ). Let E[Wn(θ)] = Wn(θ). We ﬁrst show the uniformly convergence of Wn(θ), i.e.,

P sup |Wn(θ) − Wn(θ)| → 0. (D.8) θ∈Θ

−1 Using Das et al. (2020), Theorem A.1 and the classical result var[Kn(θ)] = O(n ) we have

n ! −1 X Jbn(ωk,n; f)Jn(ωk,n) var[Wn(θ)] = var Kn(θ) + n fθ(ωk,n) k=1 n 2 X 2 ≤ 2var[Kn(θ)] + var[Jbn(ωk,n; f)Jn(ωk,n)]/fθ(ωk,n) n k=1 −2 −1 ≤ 2var[Kn(θ)] + O(n ) = O(n ).

P Therefore, by Markov’s inequality, Wn(θ) → Wn(θ) for each θ ∈ Θ. To show a uniform convergence, since Θ is compact, it is enough to show that {Wn(θ); θ ∈ Θ} is equicontinuous in probability. For arbitrary θ1, θ2 ∈ Θ,

n X W (θ ) − W (θ ) = n−1 (f −1(ω ) − f −1(ω ))J (ω ; f)J (ω ) n 1 n 2 θ1 k,n θ2 k,n en k,n n k,n k=1 n −1 X + n (log fθ1 (ωk,n) − log fθ2 (ωk,n)) = I1(θ1, θ2) + I2(θ1, θ2). k1

To (uniformly) bound I1(θ1, θ2), we use the mean value theorem

n X I (θ , θ ) = n−1 f −1(ω ) − f −1(ω ) J (ω ; f)J (ω ) 1 1 2 θ1 k,n θ2 k,n en k,n n k,n k=1 n −1 X −1 0 = n ∇θf (ωk,n)c (θ1 − θ2)Jn(ωk,n; f)Jn(ωk,n) θ θ=θk e k=1 0 = Kn(θ) (θ1 − θ2), where K (θ) = n−1 Pn J (ω ; f)J (ω )∇ f −1(ω )c and θ , ..., θ are convex combina- n k=1 en k,n n k,n θ θ k,n θ=θk 1 n

88 tions of θ1 and θ2. It is clear that

1 n −1 X kKn(θ)k1 ≤ sup k∇θfθ (ω)k1 Jen(ωk,n; f)Jn(ωk,n) = Kn. θ,ω n k=1

Thus

|I1(θ1, θ2)| ≤ Kn|θ1 − θ2|1. (D.9)

We need to show that Kn = Op(1) (it is enough to show that supn E[Kn] < ∞). To show this, we use the classical results on DFT

2 E Jen(ωk,n; f)Jn(ωk,n) ≤ E|Jn(ωk,n)| + E Jbn(ωk,n; f)Jn(ωk,n) 1/2 1/2 ≤ f(ωk,n) + var(Jbn(ωk,n; f)) var(Jn(ωk,n)) −1 ≤ f(ωk,n)(1 + O(n )).

Using above and Assumption 5.1(iii-a) gives

n −1 1 X −1 sup E[Kn] ≤ sup k∇θfθ (ω)k1 · sup f(ωk,n)(1 + O(n )) < ∞. n θ,ω n n k=1

Therefore, Kn = Op(1) and from (D.9), I1(θ1, θ2) is equicontinuous in probability. Using similar argument, we can show that I2(θ1, θ2) is equicontinous in probability and thus, {Wn(θ); θ ∈ Θ} P is equicontinous in probability. This imples supθ∈Θ |Wn(θ) − Wn(θ)| → 0, thus we have shown (D.8).

(W ) Next, let θen = arg minθ∈Θ Wn(θ). Since θn = arg minθ∈Θ Wn(θ) we have

(W ) (W ) (W ) Wn(θen ) − Wn(θen ) ≤ Wn(θen ) − Wn(θn) ≤ Wn(θn) − Wn(θn).

Thus

(W ) P |Wn(θen ) − Wn(θn)| ≤ sup |Wn(θ) − Wn(θ)| → 0. θ

(W ) P If θn uniquely minimises In(f, fθ), then by using the above we have that |θen − θn|1 → 0.

However, Wn(θ) is an infeasible criterion. To show consistency we need to obtain a uniform bound on the feasible criterion Wcp,n(θ). That is

P sup |Wcp,n(θ) − Wn(θ)| → 0. (D.10) θ

89 Now by using the triangular inequality, together with (D.8) and Lemma D.1, (D.10) immediately (W ) P follows. Therefore, by using the same arguments those given above we have |θbn − θn|1 → 0, which is the desired result. (H) P By the same set of arguments we have |θbn − θn|1 → 0.

For the simplicity, we assume θ is univariate and state the following lemma. It can be easily generalized to the multivariate case.

Lemma D.3 Suppose Assumptions 5.1(i,iii) and 5.2 hold. Then for i = 1, 2 we have

i i 3 d Wcp,n(θ) d Wn(θ) p 1 c = c + O + (D.11) dθi θ=θn dθi θ=θn p n3/2 npK−1 and

3 3 2 d Wcp,n(θ) d Wn(θ) p (W ) c − cθ=θ = Op + |θ − θn|Op(1), (D.12) 3 θ=θn 3 n bn dθ dθ n

(W ) where θn is a convex combination of θbn and θn. This gives rise to the ﬁrst order and second expansions

2 −1 3 (W ) d Wn(θn) dWn(θn) 1 p 1 (θbn − θn) = − E 2 + Op + 3/2 + K−1 (D.13) dθn dθn n n np and

2 3 dWn(θ) (W ) d Wn(θ) 1 (W ) 2 d Wn(θ) cθ=θn + (θb − θn) cθ=θn + (θb − θn) cθ=θn dθ n dθ2 2 n dθ3 p3 1 = O + . (D.14) p n3/2 npK−1

PROOF. By using Theorem 3.1, Das et al. (2020) we have for i = 1 and 2

i i 3 d Wcp,n(θ) d Wn(θ) p 1 c = c + O + , dθi θ=θn dθi θ=θn p n3/2 npK−1

(W ) this immediately gives (D.11). Let θn denote a convex combination of θn and θbn (note that 3 (W ) d Wcn(θ) θbn is a consistent estimator of θn). To evaluate dθ3 at the (consistent) estimator θn, a slightly diﬀerent approach is required (due to the additional random parameter θn). By using

90 triangular inequality and Lemma D.1 we have

3 3 d Wcp,n(θ) d Wn(θ) c − cθ=θ 3 θ=θn 3 n dθ dθ 3 3 d Wcp,n(θ) d Wn(θ) ≤ c − c 3 θ=θn 3 θ=θn dθ dθ 1 n d3 X −1 −1 + f (ωk,n) − fθn (ωk,n) Jen(ωk,n; f)Jn(ωk,n) n dθ3 θn k=1 p2 1 n d3 X −1 −1 = Op + f (ωk,n) − fθn (ωk,n) Jen(ωk,n; f)Jn(ωk,n) . n n dθ3 θn k=1

d3 −1 For the second term in the above, we apply the mean value theorem to dθ3 fθ to give

d3 d4 d4 (f −1 − f −1) ≤ sup f −1 · |θ¯ − θ | ≤ sup f −1 · |θ(W ) − θ |, 3 θ¯ θn 4 θ n n 4 θ bn n dθ n θ dθ θ dθ note that to bound the fourth derivation we require Assumption 5.1(iii) for κ = 4. Substituting this into the previous inequality gives

3 3 d Wcp,n(θ) d Wn(θ) c − cθ=θ 3 θ=θn 3 n dθ dθ p2 1 n d3 X −1 −1 ≤ Op + f (ωk,n) − fθn (ωk,n) Jen(ωk,n; f)Jn(ωk,n) n n dθ3 θn k=1 2 p (W ) = Op + |θb − θn|Op(1). n n

The above proves (D.12). Using (D.11) and (D.12) we now obtain the ﬁrst and second order expansions in (D.13) and (D.14). In order to prove (D.13), we will show that

3 (W ) 1 p (θb − θn) = Op + . n n1/2 n3/2

(W ) 2 dWcp,n(θbn ) (W ) if p /n → 0 we make a second order expansion of dθ about θn and assuming that θbn lies inside the parameter space we have

(W ) 2 3 dWcp,n(θbn ) dWcp,n(θn) (W ) d Wcp,n(θn) 1 (W ) 2 d Wcp,n(θn) 0 = (W ) = + (θbn − θn) 2 + (θbn − θn) 3 dθbn dθn dθn 2 dθn

(W ) where θn is a convex combination of θn and θbn . Now by using (D.11) and (D.12) we can replace

91 in the above Wcp,n(θn) and its derivatives with Wn(θn) and its derivatives. Therefore,

2 3 dWn(θn) (W ) d Wn(θn) 1 (W ) 2 d Wn(θn) + (θbn − θn) 2 + (θbn − θn) 3 dθn dθn 2 dθn 3 2 p 1 (W ) 2 p (W ) 3 = Op + + (θb − θn) Op + |θb − θn| Op(1). (D.15) n3/2 npK−1 n n n

Rearranging the above gives

2 −1 2 −1 3 (W ) d Wn(θn) dWn(θn) 1 d Wn(θn) d Wn(θn) (W ) 2 (θbn − θn) = − 2 − 2 3 (θbn − θn) dθn dθn 2 dθn dθn 3 2 p 1 (W ) 2 p (W ) 3 +Op + + (θb − θn) Op + |θb − θn| Op(1).(D.16) n3/2 npK−1 n n n

Next we obtain a bound for dWn(θn) (to substitute into the above). Since [ dWn(θn) ] = O(n−K ) dθn E dθn (from equation (D.3)) and var[ dWn(θn) ] = O (n−1) we have dWn(θn) = O (n−1/2). Substituting dθn p dθn p this into (D.16) gives

2 −1 3 (W ) 1 d Wn(θn) d Wn(θn) (W ) 2 (θbn − θn) = 2 3 (θbn − θn) 2 dθn dθn 3 2 −1/2 p 1 (W ) 2 p (W ) 3 +Op(n ) + Op + + (θb − θn) Op + |θb − θn| Op(1). n3/2 npK−1 n n n

2 −1 3 h d Wn(θn) i d Wn(θn) Using that 2 3 = Op(1) and substituting this into the above gives dθn dθn

3 2 (W ) 1 p 1 (W ) 2 p (W ) 3 (θb − θn) = Op + + + (θb − θn) Op + 1 + |θb − θn| Op(1). n n1/2 n3/2 npK−1 n n n (D.17)

92 (W ) 1 Thus, from the above and the consistency result in Lemma D.2 (|θbn − θn| = op(1)) we have

3 (W ) 1 p (θb − θn) = Op + . (D.18) n n1/2 n3/2

−1/2 We use the above bound to obtain an exact expression for the dominating rate Op(n ). Re- turning to equation (D.16) and substituting this bound into the quadratic term in (D.16) gives

2 −1 3 (W ) d Wn(θn) dWn(θn) 1 p 1 (θbn − θn) = − 2 + Op + 3/2 + K−1 . dθn dθn n n np

2 2 d Wn(θn) d Wn(θn) −1/2 Using that 2 = 2 + Op(n ) and under Assumption 5.2(iii) we have dθn E dθn

2 −1 3 (W ) d Wn(θn) dWn(θn) 1 p 1 (θbn − θn) = − E 2 + Op + 3/2 + K−1 . dθn dθn n n np

This proves (D.13). To prove (D.14) we return to (D.15). By substituting (D.18) into (D.15) we have

2 3 3 dWn(θ) (W ) d Wn(θ) 1 (W ) 2 d Wn(θ) p 1 cθ=θn + (θb − θn) cθ=θn + (θb − θn) cθ=θn = Op + . dθ n dθ2 2 n dθ3 n3/2 npK−1

This proves (D.14).

The second order expansion (D.14) is instrumental in proving the equivalence result Theorem 5.1. By following a similar set of arguments to those in Lemma D.3 for the multivariate parameter

1The precise proof for (D.18): By using (D.17) we have

(W ) (W ) 2 (W ) 3 (θbn − θn) = An + (θbn − θn) Bn + (θbn − θn) Cn

1 p3 1 p2 where the random variables An,Bn and Cn are such that An = Op n1/2 + n3/2 + npK−1 , Bn = Op n + 1 , and Cn = Op(1). Then, moving the second and third term in the RHS to LHS

3 (W ) (W ) (W ) 2 1 p 1 (θb − θn)[1 + (θb − θn)Bn + (θb − θn) Cn] = Op + + . n n n n1/2 n3/2 npK−1

(W ) p2 p2 (W ) 2 Using consistency, (θbn − θn)Bn = op(1)Op n + 1 = op( n + 1) and (θbn − θn) Cn = op(1). Finally, we use −1 that (1 + op(1)) = Op(1), then,

3 (W ) 1 p 1 (θb − θn) = Op(1)Op + + . n n1/2 n3/2 npK−1

Thus giving the required probabilistic rate.

93 θ = (θ1, . . . , θd), the feasible estimator satisﬁes the expansion

d 2 ∂Wn(θ) X (W ) ∂ Wn(θ) + (θb − θs,n) cθ=θn + ∂θ s,n ∂θ ∂θ r s=1 s r d 3 3 1 X ∂ Wn(θ) p 1 (θ(W ) − θ )(θ(W ) − θ ) c = + . (D.19) bs1,n s1,n bs2,n s2,n θ=θn 3/2 K−1 2 ∂θs1 ∂θs2 ∂θr n np s1,s2=1

By using the same set of arguments we can obtain a ﬁrst and second order expansion for the hybrid Whittle estimator

2 −1 3 (H) d Hn(θn) dHn(θn) 1 p 1 (θbn − θn) = − E 2 + Op + 3/2 + K−1 (D.20) dθn dθn n n np and

2 3 dHn(θ) (W ) d Hn(θ) 1 (H) 2 d Hn(θ) cθ=θn + (θb − θn) cθ=θn + (θb − θn) cθ=θn dθ n dθ2 2 n dθ3 p3 1 = + . (D.21) n3/2 npK−1

PROOF of Theorem 5.1. We ﬁrst prove the result for the one parameter case when p ≥ 1. (W ) By using (D.14) for the feasible estimator θbn = arg min Wcp,n(θ) we have

2 3 3 dWn(θ) (W ) d Wn(θ) 1 (W ) 2 d Wn(θ) p 1 cθ=θn + (θb − θn) cθ=θn + (θb − θn) cθ=θn = Op + . dθ n dθ2 2 n dθ3 n3/2 npK−1

(W ) Whereas for the infeasible estimator θen = arg min Wcp,n(θ) we have

2 3 dWn(θ) (W ) d Wn(θ) 1 (W ) 2 d Wn(θ) 1 cθ=θn + (θe − θn) cθ=θn + (θe − θn) cθ=θn = Op . dθ n dθ2 2 n dθ3 n3/2

Taking diﬀerences for the two expansions above we have

2 3 (W ) (W ) d Wn(θ) 1 (W ) (W ) h (W ) (W ) i d Wn(θ) (θe − θb ) cθ=θn + (θe − θb ) (θe − θn) + (θb − θn) cθ=θn n n dθ2 2 n n n n dθ3 p3 1 = O + . p n3/2 npK−1

2 d Wn(θ) (W ) (W ) Now replacing dθ2 cθ=θn with its expectation and using that |θen − θn| = op(1) and |θbn − θn| = op(1) we have

2 3 (W ) (W ) d Wn(θ) (W ) (W ) p 1 (θe − θb ) cθ=θn + op(1)(θe − θb ) = Op + . n n E dθ2 n n n3/2 npK−1

94 2 d Wn(θ) Since E dθ2 cθ=θn is greater than 0, the above implies

3 (W ) (W ) p 1 (θe − θb ) = Op + . n n n3/2 npK−1

Now we prove the result for the case p = 0. If p = 0, then Wcp,n(θ) = Kn(θ) (the Whittle likelihood). Let

(K) (W ) θbn = arg min Kn(θ) and θen = arg min Wn(θ).

(K) (W ) −1 Our aim is to show that |θbn − θen | = Op(n ). Note that Wn(θ) = Kn(θ) + Cn(θ), where

n 1 X Jbn(ω; f)Jn(ωk,n) Cn(θ) = . n fθ(ωk,n) k=1

Using a Taylor expansion, similar to the above, we have

2 3 dKn(θ) (K) d Kn(θ) 1 (K) 2 d Kn(θ) 1 cθ=θn + (θb − θn) cθ=θn + (θb − θn) cθ=θn = Op dθ n dθ2 2 n dθ3 n3/2 and

2 3 dWn(θ) (W ) d Wn(θ) 1 (W ) 2 d Wn(θ) 1 cθ=θn + (θe − θn) cθ=θn + (θe − θn) cθ=θn = Op . dθ n dθ2 2 n dθ3 n3/2

Taking diﬀerences of the two expansions

2 2 dCn(θ) (K) (W ) d Kn(θ) (W ) d Cn(θ) cθ=θn + (θb − θe ) cθ=θn − (θb − θn) cθ=θn dθ n n dθ2 n dθ2 3 1 (K) (W ) h (K) (W ) i d Kn(θ) + (θb − θe ) (θb − θn) + (θe − θn) cθ=θn 2 n n n n dθ3 3 1 (W ) 2 d Cn(θ) 1 − (θe − θn) cθ=θn = Op . (D.22) 2 n dθ3 n3/2

To bound the above we use that

(K) −1/2 (W ) −1/2 |θbn − θn| = Op(n ) and |θen − θn| = Op(n ).

In addition by using a proof analogous to the proves of Theorem 3.2, equation (3.10) we have

s n s d Cn(θ) 1 X d −1 −1 = Jbn(ωk,n; f)Jn(ωk,n) fθ(ωk,n) = Op(n ) for 0 ≤ s ≤ 3. dθs n dθs k=1

95 Substituting the above bounds into (D.22) gives

2 (K) (W ) d Kn(θ) 1 (K) (W ) −1/2 dCn(θ) 1 (θb − θe ) cθ=θn + (θb − θe )Op(n ) = − cθ=θn + Op . n n dθ2 2 n n dθ n3/2

2 d Kn(θ) −1 Since [ dθ2 cθ=θn ] = Op(1) we have

(K) (W ) −1 |θbn − θen | = Op(n ), thus giving the desired rate. For the multiparameter case we use (D.19) and the same argument to give

d 2 X (W ) (W ) ∂ Wn(θ) (θb − θe ) cθ=θn + s,n s,n ∂θ ∂θ s=1 s r d 3 1 X h (W ) (W ) (W ) (W ) (W ) (W ) i ∂ Wn(θ) (θbs1,n − θes1,n)(θbs2,n − θs2,n) + (θbs2,n − θes2,n)(θes1,n − θs1,n) × cθ=θn 2 ∂θs1 ∂θs2 ∂θr s1,s2=1 p3 1 = O + . p n3/2 npK−1

2 Replacing ∂ Wn(θ) c with its expectation gives ∂θs∂θr θ=θn

3 (W ) (W ) 0 2 (W ) (W ) p 1 (θb − θe ) ∇ Wn(θ)cθ=θn + |θb − θe |1op(1) = Op + . n n E θ n n n3/2 npK−1

2 Thus under the assumption that E [∇θWn(θ)cθ=θn ] is invertible we have

3 (W ) (W ) p 1 |θe − θb |1 = Op + . n n n3/2 npK−1

By a similar argument we have

3 (H) (H) p 1 |θe − θb |1 = Op + . n n n3/2 npK−1

The case when p = 0 is analogous to the uniparameter case and we omit the details. This concludes the proof.

E The bias of the diﬀerent criteria

In this section we derive the approximate bias of the Gaussian, Whittle, boundary corrected and hybrid Whittle likelihoods under quite general assumptions on the underlying time series {Xt}. The bias we evaluate will be in the sense of Bartlett (1953) and will be based on the second order

96 expansion of the loss function. We mention that for certain speciﬁc models (such as the speciﬁed AR, or certain MA or ARMA) the bias of the least squares, Whittle likelihood or maximum likeihood estimators are given in Taniguchi (1983); Tanaka (1984); Shaman and Stine (1988).

E.1 Bias for the estimator of one unknown parameter

In order to derive the limiting bias, we require the following deﬁnitions

Z 2π 2 −1 Z 2π 1 d fθ(ω) 1 I(θ) = − 2 f(ω)dω and J(g) = g(ω)f(ω)dω. 2π 0 dθ 2π 0

For real functions g, h ∈ L2[0, 2π] we deﬁne

2 Z 2π V (g, h) = g(ω)h(ω)f(ω)2dω 2π 0 1 Z 2π Z 2π + 2 g(ω1)h(ω2)f4(ω1, −ω1, ω2)dω1dω2, (E.1) (2π) 0 0 where f4 denotes the fourth order cumulant density of the time series {Xt}. Further, we deﬁne

n n 2 X 1 X d h i B (θ) = Re c(s − t) e−isωk,n φ(ω ; f )φ∞(ω ; f ) G,n n n dθ k,n θ t k,n θ s,t=1 k=1 n −1 1 X dfθ(ωk,n) B (θ) = f (ω ) K,n n n k,n dθ k=1

∞ R where φ(ω; fθ) and φt (ω; fθ) are deﬁned in Section 3, c(r) = cov(X0,Xr) and fn(ωk) = Fn(ω −

λ)f(λ)dλ and Fn(·) is the Fej´erkernel of order n.

Theorem E.1 Suppose that the parametric spectral densities {fθ; θ ∈ Θ} satisfy Assumptions

5.1. Suppose the underlying time series {Xt} is a stationary time series with spectral density f (G) (K) (W ) (H) and satisﬁes Assumption 5.2. Let θbn , θbn and θbn , and θbn be deﬁned as in (D.5). Then the asymptotic bias is

(G) −1 −1 −3/2 Eθ[θbn − θn] = I(θ) (BK,n(θn) + BG,n(θn)) + n G(θn) + O(n ) (K) −1 −1 −3/2 Eθ[θbn − θn] = I(θ) BK,n(θn) + n G(θn) + O(n ) (W ) −1 3 −3/2 −1 −K+1 Eθ[θbn − θn] = n G(θn) + O p n + n p (H) H2,n 3 −3/2 −1 −K+1 and Eθ[θbn − θn] = 2 G(θn) + O p n + n p H1,n

97 Pn q where Hq,n = t=1 hn(t/n) ,

df −1 d2f −1 df −1 df −1 d3f −1 G(θ) = I(θ)−2V θ , θ + 2−1I(θ)−3V θ , θ J θ , dθ dθ2 dθ dθ dθ3 and V (g, h) is deﬁned in (E.1).

PROOF. See Supplementary E.3.

4 Remark E.1 In the case that the model is linear, then f4(ω1, −ω1, ω2) = (κ4/σ )f(ω1)f(ω2) 2 where σ and κ4 is the 2nd and 4th order cumulant of the innovation in the model. Furthermore, in the case the model is correct speciﬁcation and linear, we can show that As- −1 2 −1 −1 −1 dfθ d fθ dfθ dfθ sumption 5.1(ii) implies that fourth order cumulant term in V dθ , dθ2 and V dθ , dθ is zero. This results in the fourth order cumulant term in G(·) being zero.

E.2 The bias for the AR(1) model

In general, it is difficult to obtain a simple expression for the bias defined in Theorem E.1, but in the special case a model AR(1) is fitted to the data the bias can be found. In the calculation below let θ denote the AR(1) coefficient for the best fitting AR(1) parameter. We assume Gaussianity, which avoids dealing with the fourth order spectral density. If the true model is a Gaussian AR(1) the bias for the various criteria is

• The Gaussian likelihood 1 [θbG − θ] = − θ + O(n−3/2) E n n

• The Whittle likelihood 3 1 [θbK − θ] = − θ + θn−1 + O(n−3/2) E n n n

• The boundary corrected Whittle likelihood 2 [θbW − θ] = − θ + O(p3n−3/2 + (npK−1)−1) E n n

• The hybrid Whittle likelihood

H H2,n 3 −3/2 K−1 −1 E[θbn − θ] = −2 2 θ + O(p n + (np ) ). H1,n

Moreover, if the Gaussian likelihood included the determinant term in the Gaussian likelihood,

98 G −1 i.e. θen = arg minθ[Ln(θ) + n log |Γn(fθ)|], then 2 [θeG − θ] = − θ + O(n−3/2). E n n We observe for the AR(1) model (when the true time series is Gaussian with an AR(1) representation) that the “true” Gaussian likelihood with the log-determinant term has a larger bias than the Gaussian likelihood without the Gaussian determinant term. The above bounds show that the Gaussian likelihood with the log-determinant term and the boundary corrected Whittle likelihood have the same asymptotic bias. This is substantiated in the simulations. However, in the simulations in Section 6.1, we do observe that the bias of the Gaussian likelihood is a little less than the boundary corrected Whittle. The diﬀerence between two likelihoods is likely due to diﬀerences in the higher order terms which are of order O(n−3/2) (for the Gaussian likelihood) and O(p3n−3/2) (for the boundary corrected Whittle likelihood, due to additional estimation of the predictive DFT).

PROOF. The inverse of the spectral density function and autocovariance function is

σ2θ|r| f (ω)−1 = σ−2 1 + θ2 − 2θ cos(ω) and c(r) = . θ 1 − θ2

Thus

d d2 f (ω)−1 = 2σ−2(θ − cos ω) and f (ω)−1 = 2σ−2. dθ θ dθ2 θ This gives

Z 2π 2 −1 Z 2π 1 d fθ(ω) 1 2 I(θ) = − 2 f(ω)dω = − 2 f(ω)dω = − 2 c(0). (E.2) 2π 0 dθ πσ 0 σ

−iω Next we calculate BG,n, since φ(ω) = 1 − θe it is is easy to show

∞ ∞ φ1 (ω) = θ and φj (ω) = 0 for j ≥ 2.

99 Therefore,

n n 2 X 1 X d h i B (θ) = Re c(t − j) e−itωk,n φ(ω ; f )φ∞(ω ; f ) G,n n n dθ k,n θ j k,n θ t,j=1 k=1 n n 2σ−2 X 1 X d = c(t − 1) e−itωk,n (1 − θeiωk,n )θ n n dθ t=1 k=1 n n 2σ−2 X 1 X = c(t − 1) e−itωk,n − 2θe−i(t−1)ωk,n n n t=1 k=1 n " n n # 2σ−2 X 1 X 2θ X = c(t − 1) e−itωk,n − e−i(t−1)ωk,n n n n t=1 k=1 k=1

The second summation (over k) is 0 unless t ∈ {1, n}. Therefore,

2σ−2 4σ−2θ B (θ) = c(n − 1) − c(0). (E.3) G,n n n

To calculate BK,n we have

n −1 1 X dfθ(ωk,n) B (θ) = f (ω ) K,n n n k,n dθ k=1 n 2σ−2 X = f (ω )(θ − cos(ω )) n n k,n k,n k=1 n − 1 1 = 2σ−2 θc(0) − c(1) − c(1 − n) n n 2σ−2 = [c(1) − c(n − 1)] . (E.4) n Altogether this gives

σ2 I(θ)−1 (B (θ) + B (θ)) = − 2σ−2 [c(1) − c(n − 1)] + 2σ−2c(n − 1) − 4σ−2θc(0) K,n G,n 2nc(0) 1 θ = − (c(1) − 2θc(0)) = (E.5) nc(0) n and 1 1 I(θ)−1B (θ) = − [c(1) − c(n − 1)] = − (θ − θn−1). (E.6) K,n nc(0) n

100 −1 Next, we calculate G(θ). Since the third derivative of fθ with respect to θ is zero we have

d d2 G(θ) = I(θ)−2V f −1, f −1 dθ θ dθ2 θ where

2 Z 2π d −1 d −1 1 2θ − 2 cos(ω) 2 2 V fθ , 2 fθ = 2 2 f(ω) dω dθ dθ π 0 σ σ Z 2π 4 1 2 = 4 [θ − cos(ω)] f(ω) dω σ π 0 8 = (θc (0) − c (1)) σ4 2 2

2 where {c2(r)} is the autocovariance function associated with f(ω) , it is the convolution of c(r) with itself;

X c2(r) = c(`)c(` + r) `∈Z

Using this expansion we have

d d2 8 X V f −1, f −1 = c(`)[θc(`) − c(` + 1)] dθ θ dθ2 θ σ4 `∈Z and

σ4 8 2 G(θ) = (θc (0) − c (1)) = (θc (0) − c (1)) . (E.7) 4c(0)2 σ4 2 2 c(0)2 2 2

Putting (E.7) with (E.5) gives

G −1 −1 E[θbn − θ] ≈ I(θ) (BK,n(θ) + BG,n(θ)) + n G(θ) θ 2 = + (θc (0) − c (1)) , n nc(0)2 2 2 K −1 −1 E[θbn − θ] ≈ I(θ) BK,n(θ) + n G(θ) 1 2 = − (θ − θn−1) + (θc (0) − c (1)) , n nc(0)2 2 2

W 2 [θb − θ] ≈ (θc2(0) − c2(1)) , E n nc(0)2

H 2 H2,n E[θbn − θ] ≈ 2 2 (θc2(0) − c2(1)) . c(0) H1,n

It is not entirely clear how to access the above. So now we consider the case that the model is

101 fully speciﬁed. Under correct speciﬁcation we have

2 2 Z 2π −1 d −1 d −1 2σ 2 dfθ(ω) V fθ , 2 fθ = f(ω) dω dθ dθ π 0 dθ 2 Z 2π 2σ 2 1 dfθ(ω) = − f(ω) 2 dω π 0 fθ(ω) dθ 2 Z 2π 2 Z 2π 2σ dfθ(ω) 2σ d = − dω = − fθ(ω)dω π 0 dθ π dθ 0 d 8σ4θ = −4σ2 c(0) = − . dθ (1 − θ2)2

−2 d −1 d2 −1 Thus I(θ) V dθ fθ , dθ2 fθ = −2θ/n. Substituting this into the above we have

1 2 1 [θb(G) − θ] = θ − θ + O(n−3/2) = − θ + O(n−3/2), E n n n n (K) 1 n−1 2 −1 3 1 n−1 −3/2 [θb − θ] = − θ − θ − θ + o(n ) ≈ − θ + θ + O(n ), E n n n n n 2 [θb(W ) − θ] = − θ + O(n−3/2), E n n (H) H2,n −3/2 E[θbn − θ] = −2 2 θ + O(n ). H1,n

This proves the main part of the assertion. To compare the above bias with the “true” Gaussian likelihood, we consider the Gaussian likelihood with the log determinant term. First, consider |s−t| the correlation of AR(1) matrix (An)s,t = θ . Then, ! A B A = n n ,B = (θn, ..., θ)0. n+1 0 n Bn 1

0 −1 Therefore, using block matrix determinant identity, |An+1| = |An|(1 − BnAn Bn). Moreover, it 0 is easy to show AnRn = Bn, where Rn = (0, ..., 0, θ) . Thus

0 2 |An+1| = |An|(1 − BnRn) = |An|(1 − θ ).

2 n−2 2 n−1 Using iteration, |An| = (1 − θ ) |A2| = (1 − θ ) and thus,

2 2 n 2 n σ σ 2 n−1 (σ ) |Γn(fθ)| = An = (1 − θ ) = . 1 − θ2 1 − θ2 1 − θ2

Then, by simple calculus,

d 2θ 2σ−2 n−1 log |Γ (f )| = = c(1). dθ n θ n(1 − θ2) n

102 and thus, G −1 d −1 −1 [θe − θ] ≈ I(θ) BK,n(θ) + BG,n(θ) + n log |Γn(fθ)| + n G(θ) E dθ 1 2θ 2θ = I(θ)−1 (2σ−2c(1) − 4σ−2θc(0) + 2σ−2c(1)) − = − , n n n which proves the results.

E.3 Proof of Theorem E.1

In Theorem 5.1 we showed that

3 3 (W ) (W ) p 1 (H) (H) p 1 |θb − θe | = Op + and |θb − θe | = Op + , n n n3/2 npK−1 n n n3/2 npK−1

(W ) (W ) (H) where θbn = arg min Wcp,n(θ), θen = arg min Wn(θ), θbn = arg min Hbp,n(θ), and (H) (W ) (H) θen = arg min Hn(θ). We will show that the asymptotic bias of θen and θen (under certain conditions on the taper) are of order O(n−1), thus if p3n−1/2 → 0 as n, p → ∞, then the infeasible estimators and feasible estimators share the same asymptotic bias. Therefore in the proof we obtain the bias of the infeasible estimators.

Now we obtain a general expansion (analogous to the Bartlett correction). Let Ln(·) denote the general minimization criterion (it can be Ln(θ), Kn(θ), Wn(θ), or Hn(θ)) and θb = arg min Ln(θ). For all the criteria, it is easily shown that

−1 dLn(θ) −1 (θb− θ) = U(θ) + Op(n ) dθ

2 d Ln where U(θ) = −E[ dθ2 ] and

2 3 dLn(θ) d Ln(θ) 1 2 d Ln(θ) −3/2 + (θb− θ) + (θb− θ) = Op(n ). dθ dθ2 2 dθ3 Ignoring the probabilistic error, the ﬁrst and second order expansions are

dLn(θ) (θb− θ) ≈ U(θ)−1 , (E.8) dθ and 2 3 dLn(θ) d Ln(θ) 1 d Ln(θ) + (θb− θ) + (θb− θ)2 ≈ 0. dθ dθ2 2 dθ3 The method described below follows the Bartlett correction described in Bartlett (1953) and Cox

103 and Snell (1968). Taking expectation of the above we have

2 3 dLn(θ) d Ln(θ) 1 d Ln(θ) + (θb− θ) + (θb− θ)2 E dθ E dθ2 2E dθ3 2 2 dLn(θ) h i d Ln(θ) d Ln(θ) = + (θb− θ) + cov (θb− θ), E dθ E E dθ2 dθ2 3 3 1 h i d Ln(θ) 1 d Ln(θ) + (θb− θ)2 + cov (θb− θ)2, . 2E E dθ3 2 dθ3

−1 dLn(θ) Substituting (θb− θ) ≈ U(θ) dθ into the last three terms on the right hand side of the above gives

2 dLn(θ) dLn(θ) d Ln(θ) − U(θ) (θb− θ) + U(θ)−1cov , E dθ E dθ dθ2 dL (θ)2 d3L (θ) +2−1U(θ)−2 n n E dθ E dθ3 ! dL (θ)2 d3L (θ) +2−1U(θ)−2cov n , n ≈ 0. dθ dθ3

Using the above to solve for E(θb− θ) gives

2 dLn(θ) dLn(θ) d Ln(θ) (θb− θ) = U(θ)−1 + U(θ)−2cov , E E dθ dθ dθ2 ! dL (θ)2 d3L (θ) dL (θ)2 d3L (θ) +2−1U(θ)−3 n n + 2−1U(θ)−3cov n , n E dθ E dθ3 dθ dθ3 dL (θ) dL (θ) d2L (θ) = U(θ)−1 n + U(θ)−2cov n , n E dθ dθ dθ2 " # dL (θ) dL (θ)2 d3L (θ) +2−1U(θ)−3 var n + n n dθ E dθ E dθ3 ! dL (θ)2 d3L (θ) +2−1U(θ)−3cov n , n . dθ dθ3

Thus

E(θb− θ) = I0 + I1 + I2 + I3 + I4 (E.9)

104 where

dL (θ) I = U(θ)−1 n 0 E dθ dL (θ) d2L (θ) I = U(θ)−2cov n , n 1 dθ dθ2 dL (θ) d3L (θ) I = 2−1U(θ)−3var n n 2 dθ E dθ3 dL (θ)2 d3L (θ) I = 2−1U(θ)−3 n n 3 E dθ E dθ3 ! dL (θ)2 d3L (θ) I = 2−1U(θ)−3cov n , n . 4 dθ dθ3

dLn(θ) Note that the term E dθ will be diﬀerent for the four quasi-likelihoods (and will be of order O(n−1)). However the remaining terms are asymptotically the same for three quasi-likelihoods and will be slightly diﬀerent for the hybrid Whittle likelihood.

dLn(θ) The ﬁrst derivative We ﬁrst obtain expressions for E dθ for the four quasi-likelihoods:

n n dKn(θ) 1 X d 1 X d = [|J (ω )|2] f (ω )−1 = f (ω ) f (ω )−1 = B (θ), E dθ n E n k,n dθ θ k,n n n k,n dθ θ k,n K,n k=1 k=1 R where fn(ω) = Fn(ω − λ)f(λ)dλ and Fn is the Fej´erkernel of order n.

To obtain the expected derivative of Ln(θ) we recall that

d d d L (θ) = K (θ) + n−1X0 F ∗ ∆ (f −1)D (f )X . E dθ n E dθ n E n n dθ n θ n θ n

Now by replacing Dn(fθ) with D∞,n(fθ) and using (C.33) we have

d d d L (θ) = K (θ) + n−1X0 F ∗ ∆ (f −1)D (f )X E dθ n E dθ n E n n dθ n θ ∞,n θ n d + n−1X0 F ∗ ∆ f −1 (D (f ) − D (f )) X E n n dθ n θ n θ ∞,n θ n n n d X 1 X d = K (θ) + n−1 c(s − t) e−isωk,n ϕ (ω ; f ) E dθ n n dθ t,n k,n θ s,t=1 k=1 d +n−1 X0 F ∗ ∆ f −1 (D (f ) − D (f )) X E n n dθ n θ n θ ∞,n θ n

−2 h ∞ iω ∞ i where ϕt,n(ω; fθ) = σ φ(ω; fθ)φt (ω; fθ) + e φ(ω; fθ)φn+1−t(ω; fθ) . The ﬁrst term on the 0 RHS of the above is BK,n(θ). Using the change of variables t = n + 1 − t, the second term in

105 RHS above can be written as

n n X 1 X d n−1 c(s − t) e−isωk,n ϕ (ω ; f ) n dθ t,n k,n θ s,t=1 k=1 n n X 1 X d h i = n−1 c(s − t) e−isωk,n φ(ω ; f )φ∞(ω ; f ) + e−i(s−1)ωk,n φ(ω ; f )φ∞ (ω ; f ) n dθ k,n θ t k,n θ k,n θ n+1−t k,n θ s,t=1 k=1 n n X 1 X d = n−1 c(s − t) e−isωk,n φ(ω ; f )φ∞(ω ; f ) n dθ k,n θ t k,n θ s,t=1 k=1 n 1 n d −1 X 0 X −i(s−1)ωk,n ∞ 0 +n c(s − n − 1 + t ) e φ(ω ; f )φ 0 (ω ; f ) (let t = n + 1 − t) n dθ k,n θ t k,n θ s,t0=1 k=1 n n 2 X 1 X d = Re c(s − t) e−isωk,n φ(ω ; f )φ∞(ω ; f ) = B (θ). n n dθ k,n θ t k,n θ G,n s,t=1 k=1

Finally, by using Corollary C.1 we have

−1 0 ∗ d −1 −K+1/2 n X F ∆n(f )[Dn(fθ) − D∞,n(fθ)] X = O(n ). n n dθ θ n E,1

Thus the derivative of the Gaussian likelihood is

dL (θ) n = B (θ) + B (θ) + O(n−K+1/2). E dθ K,n G,n

Next we consider the boundary corrected Whittle likelihood. By using that

E[Jen(ωk,n; f)Jn(ωk,n)] = f(ωk,n) we have n dWn(θ) 1 X d −1 = [Jen(ωk,n; f)Jn(ωk,n)] fθ(ωk,n) E dθ n E dθ k=1 n 1 X d = f(ω ) f (ω )−1. n k,n dθ θ k,n k=1

Finally, the analysis of Hn(θ) is identical to the analysis of Wn(θ) and we obtain

n dHn(θ) 1 X d −1 = [Jen(ωk,n; f)Jh ,n(ωk,n)] fθ(ωk,n) E dθ n E n dθ k=1 n 1 X d = f(ω ) f (ω )−1. n k,n dθ θ k,n k=1

106 In summary, evaluating the above at the best ﬁtting parameter θn and by Assumption 5.1(ii) gives

dK (θ) n c = B (θ ) E dθ θ=θn K,n n dL (θ) n c = B (θ ) + B (θ ) + O(n−K+1/2) E dθ θ=θn K,n n G,n n dW (θ) dH (θ) and n c = n c = 0. (E.10) E dθ θ=θn E dθ θ=θn

−1 −1 It can be shown that BK,n(θn) = O(n ) and BG,n(θn) = O(n ). These terms could be negative or positive so there is no clear cut answer as to whether BK,n(θn) or BK,n(θn) + BG,n(θn) is larger

(our simulations results suggest that often BK,n(θn) tends to be larger).

The second and third order derivatives The analysis of all the higher order terms will require comparisons between the derivatives of Ln(θ),Kn(θ),Wn(θ) and Hn(θ). We ﬁrst represent the derivatives of the Gaussian likelihood in terms of the Whittle likelihood

diL (θ) diK (θ) di n = n + n−1X0 F ∗ ∆ (f −1)D (f )X . dθi dθi E n n dθi n θ n θ n

By using (C.36), for 1 ≤ i ≤ 3 we have

i −1 0 ∗ d −1 −1 n X F ∆n(f )Dn(fθ)X = O(n ). (E.11) n n dθi θ n E,1

Similarly, we represent the derivatives of Wn(θ) and Hp,n(θ) in terms of the derivatives of Kn(θ)

diW (θ) diK (θ) n = n + C dθi dθi i,n diH (θ) diK (θ) n = n,hn + D dθi dθi i,n

J (ω )J (ω ) −1 Pn n k,n n,hn k,n where Kn,h (θ) = n and n k=1 fθ(ωk,n)

n i i 1 X d Jbn(ωk,n; f)Jn(ωk,n) 1 0 ∗ d −1 Ci,n = i = XnFn ∆n( i fθ )Dn(f)Xn n dθ fθ(ωk,n) n dθ k=1 n i i 1 X d Jbn(ωk,n; f)Jn,hn (ωk,n) 1 0 ∗ d −1 Di,n = i = XnHnFn ∆n( i fθ )Dn(f)Xn, n dθ fθ(ωk,n) n dθ k=1 where Hn = diag(h1,n, . . . , hn,n). In the analysis of the ﬁrst order derivative obtaining an exact bound between each “likelihood” and the Whittle likelihood was important. However, for the

107 higher order derivatives we simply require a moment bound on the diﬀerence. To bound Ci,n, we use that

i ∗ d −1 Fn ∆n( i fθ )Dn(f) dθ 1 i i ∗ d −1 ∗ d −1 ≤ Fn ∆n( i fθ )[Dn(f) − D∞,n(f)] + Fn ∆n( i fθ )D∞,n(f) . dθ 1 dθ 1

We use a similar method to the proof of Theorem 3.1, equation (3.5) and Theorem 3.2, equa- di −1 −1 tion (3.8) with ∆n( dθi fθ ) and D∞,n(f) replacing ∆n(fθ ) and D∞,n(fθ) respectively together with Assumption 5.1 and 5.2. By using the proof of Theorem 3.1, equation (3.5), we have ∗ di −1 −K+1 kFn ∆n( dθi fθ )[Dn(f) − D∞,n(f)]k1 = O(n ). Similarly, by using the proof of Theorem 3.2, ∗ di −1 equation (3.8) we have kFn ∆n( dθi fθ )D∞,n(f)k1 = O(1). Altogether this gives

i 2 1/2 −1 0 ∗ d −1 −1 ( |Ci,n| ) = n X F ∆n( f )Dn(f)X = O(n ). (E.12) E n n dθi θ n E,2

For the hybrid likelihood, we use that supt,n |ht,n| < ∞, this gives

i i ∗ d −1 ∗ d −1 kHnFn ∆n( i fθ )Dn(f)k1 ≤ (sup ht,n) × kFn ∆n( i fθ )Dn(f)k1 = O(1). dθ t dθ

Therefore, under the condition that {ht,n} is a bounded sequence

i 2 1/2 −1 0 ∗ d −1 −1 ( |Di,n| ) = n HnX F ∆n( f )Dn(f)X = O n . (E.13) E n n dθi θ n E,2

Thus the expectations of the derivatives are

diL (θ) diK (θ) n = n + O(n−1) E dθi E dθi diW (θ) diK (θ) n = n + O(n−1) E dθi E dθi diH (θ) diK (θ) n = n + O n−1 . E dθi E dθi

This gives the expectation of the second and third derivatives of all likelihoods in terms of I(θ) 3 −1 d fθ and J( dθ3 ):

d2L (θ) d3L (θ) d3f −1 n = −I(θ) + O(n−1), and n = J( θ ) + O(n−1). E dθ2 E dθ3 dθ3

Bounds for the covariances between the derivatives The terms I1,I2 and I4 all contain the co-

108 variance between various likelihoods and its derivatives. Thus to obtain expression and bounds for these terms we use that

di var K (θ) = O(n−1), (E.14) dθi n where the above can be proved using Brillinger (2001), Theorem 4.3.2. Further, if the data taper

{ht,n} is such that ht,n = cnhn(t/n) where cn = n/H1,n and hn : [0, 1] → R is a sequence of taper functions which satisfy the taper conditions in Section 5, Dahlhaus (1988), then

di H var K (θ) = O 2,n . (E.15) i n,hn 2 dθ H1,n

By using (E.11), (E.12), and (E.14) we have

dL (θ) d2L (θ) dK (θ) d2K (θ) cov n , n = cov n , n + O(n−3/2) dθ dθ2 dθ dθ2 dW (θ) d2W (θ) dK (θ) d2K (θ) cov n , n = cov n , n + O(n−3/2) dθ dθ2 dθ dθ2 dL (θ) dK (θ) var n = var n + O(n−3/2) dθ dθ dW (θ) dK (θ) var n = var n + O(n−3/2). dθ dθ

For the hybrid Whittle likelihood, by using (E.13) and (E.15)

1/2 ! 2 2 H dHn(θ) d Hn(θ) dKn,hn (θ) d Kn,hn (θ) 2,n cov , 2 = cov , 2 + O dθ dθ dθ dθ nH1,n 1/2 ! dH (θ) dK (θ) H var n = var n,hn + O 2,n . dθ dθ nH1,n

2 −1 1/2 Using that H2,n/H1,n ∼ n , we show that the above error terms O(H2,n /(nH1,n)) (for the hybrid Whittle likelihood) is the same as the other likelihoods. Next, having reduced the above covariances to those of the derivatives of Kn(θ) and Kn,hn (θ). We ﬁrst focus on Kn(θ). By using the expressions for cumulants of DFTs given in Brillinger (2001), Theorem 4.3.2 and well-known cumulant arguments we can show that

dK (θ) d2K (θ) df −1 d2f −1 cov n , n = n−1V θ , θ + O(n−2) dθ dθ2 dθ dθ2 dK (θ) df −1 df −1 and var n = n−1V θ , θ + O(n−2). dθ dθ dθ

109 To obtain expressions for the covariance involving Kn,hn (θ), we apply similar techniques as those developed in Dahlhaus (1983), Lemma 6 together with cumulant arguments. This gives

dK (θ) d2K (θ) −1 2 −1 n,hn n,hn H2,n dfθ d fθ H2,n cov , 2 = 2 V , 2 + O 2 dθ dθ H1,n dθ dθ nH1,n dK (θ) −1 −1 n,hn H2,n dfθ dfθ H2,n and var = 2 V , + O 2 . dθ H1,n dθ dθ nH1,n

These results yield expressions for I1 and I2 (we obtain these below).

Expression for I0 and a bound for I3. Using the results above we have

• The Gaussian likelihood

−1 −2 I0 = I(θn) [BK,n(θn) + BG,n(θn)] + O(n ) (E.16)

• The Whittle likelihood

−1 −2 I0 = I(θn) BK,n(θn) + O(n )

• The boundary corrected Whittle and hybrid Whittle likelihood

I0 = 0

dLn(θ) −1 However, since for all the likelihoods E[ dθ cθ=θn ] = O(n ), this implies that for all the likelihoods the term I3 is

dL (θ)2 d3L (θ) I = 2−1U(θ)−3 n n = O(n−2). 3 E dθ E dθ3

Expression for I1 and I2. For the Gaussian, Whittle, and boundary corrected Whittle likelihoods we have

df −1 d2f −1 I = n−1I(θ )−2V θ , θ + O(n−3/2) 1 n dθ dθ2 df −1 df −1 d3f −1 I = n−12−1I(θ )−3V θ , θ J θ + O(n−3/2). 2 n dθ dθ dθ3

110 For the hybrid Whittle likelihood we obtain a similar expression

−1 2 −1 H2,n −2 dfθ d fθ −3/2 I1 = 2 I(θn) V , 2 + O n H1,n dθ dθ −1 −1 3 −1 H2,n −1 −3 dfθ dfθ d fθ −3/2 I2 = 2 2 I(θn) V , J 3 + O n . H1,n dθ dθ dθ

A bound for I4 We now show that I4 has a lower order term than the dominating terms I0,I1 and I2. We recall that ! dL (θ)2 d3L (θ) I = 2−1U(θ)−3cov n , n . 4 dθ dθ3

2 3 dLn(θ) d Ln(θ) To bound the above we focus on cov dθ , dθ3 . By using indecomposable partitions we have ! dL (θ)2 d3L (θ) dL (θ) d3L (θ) dL (θ) cov n , n = 2cov n , n n dθ dθ3 dθ dθ3 E dθ dL (θ) dL (θ) d3L (θ) +cum n , n n dθ dθ dθ3 dL (θ)2 d3L (θ) + n n . E dθ E dθ3

We use (E.14), (E.11) and (E.12) to replace Ln(θ) with Kn(θ) or Kh,n(θ). Finally by using the expressions for cumulants of DFTs given in Brillinger (2001), Theorem 4.3.2 we have that for the non-hybrid likelihoods

−2 I4 = O(n ) and for the hybrid Whittle likelihood H2,n I4 = O 2 . nH1,n

Thus, altogether for all the estimators we have that

−2 (θbn − θn) = I0 + I1 + I2 + O(n ),

111 where for the Gaussian, Whittle and boundary corrected Whittle likelihoods

df −1 d2f −1 df −1 df −1 d3f −1 I + I = n−1 I(θ )−2V θ , θ + 2−1I(θ )−3V θ , θ J θ + O(n−3/2) 1 2 n dθ dθ2 n dθ dθ dθ3 −1 −3/2 = n G(θn) + O(n ) and for the hybrid Whittle likelihood

H2,n −3/2 I1 + I2 = 2 G(θn) + O n . H1,n

The terms for I0 are given in (E.16). This proves the result.

E.4 Bias for estimators of multiple parameters

We now generalize the ideas above to multiple unknown parameters. Suppose we fit the spectral density fθ(ω) to the time series {Xt} where θ = (θ1, . . . , θd) are the unknown parameters in d Θ ⊂ R . Ln(θ), Kn(θ), Wcp,n(θ) and Hbp,n(θ) denote the Gaussian likelihood, Whittle likelihood, (G) (W ) (W ) boundary corrected Whittle and hybrid Whittle likelihood defined in (D.4). Let θbn , θbn , θbn (H) and θbn be the corresponding estimators defined in (D.5) and θn = (θ1,n, ..., θd,n) is the best fitting parameter defined as in (D.1). Then under Assumption 5.1 and 5.2 we have the following asymptotic bias:

−1 • The Gaussian likelihood (excluding the term n log |Γn(θ)|)

d (G) X (j,r) −1 −3/2 E[θbj,n − θj,n] = I Br,K,n(θ) + Br,G,n(θ) + n Gr(θ) + O n r=1

• The Whittle likelihood has bias

d (K) X (j,r) −1 −3/2 E[θbj,n − θj,n] = I Br,K,n(θ) + n Gr(θ) + O n . r=1

• The boundary corrected Whittle likelihood has bias

d (W ) −1 X (j,r) 3 −3/2 K−1 −1 E[θbj,n − θj,n] = n I Gr(θ) + O p n + (np ) . r=1

• The hybrid Whittle likelihood has bias

d (H) H2,n X (j,r) 3 −3/2 K−1 −1 [θb − θj,n] = I Gr(θ) + O p n + (np ) . (E.17) E j,n H2 1,n r=1

112 (j,r) Where I , Br,G,n(·), Br,K,n(·), and Gr(·) are deﬁned as in Section 5.

PROOF. Let Ln(θ) be the criterion and θbn = arg min Ln(θ) and θn the best ﬁtting parameter. We use a similar technique used to prove Theorem E.1. The ﬁrst order expansion is

−1 θbn − θn = U(θn) ∇θLn(θn) where U(θ) is the d × d matrix

2 U(θ) = −E ∇θLn(θ) .

Thus entrywise we have

d X r,s ∂Ln(θ) θbr,n − θr,n = U ∂θ s=1 s

(r,s) −1 where U denotes the (r, s)-entry of the d × d matrix U(θn) . To obtain the “bias” we make a second order expansion. For the simplicity, we omit the subscript n from θbr,n and θr,n. For 1 ≤ r ≤ d we evaluate the partial derivative

d 2 d 3 ∂Ln(θ) X ∂ Ln(θ) 1 X ∂ Ln(θ) + (θbs − θs) + (θbs1 − θs1 )(θbs2 − θs2 ) ≈ 0. ∂θr ∂θs∂θr 2 ∂θs1 ∂θs2 ∂θr s=1 s1,s2=1

Taking expectation of the above gives

d 2 d 3 ∂Ln(θ) X ∂ Ln(θ) 1 X ∂ Ln(θ) E + E (θbs − θs) + E (θbs1 − θs1 )(θbs2 − θs2 ) ≈ 0. ∂θr ∂θs∂θr 2 ∂θs1 ∂θs2 ∂θr s=1 s1,s2=1

We now replace the product of random variables with their covariances

d 2 d 2 ∂Ln(θ) X ∂ Ln(θ) X ∂ Ln(θ) + [θbs − θs] + cov θbs − θs, E ∂θ E E ∂θ ∂θ ∂θ ∂θ r s=1 s r s=1 s r d 3 1 X ∂ Ln(θ) + cov θbs1 − θs1 , θbs2 − θs2 E 2 ∂θs1 ∂θs2 ∂θr s1,s2=1 d 3 1 X ∂ Ln(θ) + E[θbs1 − θs1 ]E[θbs2 − θs2 ]E 2 ∂θs1 ∂θs2 ∂θr s1,s2=1 d 3 1 X ∂ Ln(θ) + cov (θbs1 − θs1 )(θbs2 − θs2 ), ≈ 0. 2 ∂θs1 ∂θs2 ∂θr s1,s2=1

113 With the exception of E[θbs − θs], we replace θbs − θs in the above with their ﬁrst order expansions Pd U (s,j) ∂Ln(θ) . This gives j=1 ∂θj

∂L (θ) d d ∂L (θ) ∂2L (θ) n X X (s1,s2) n n E − E[θbs − θs]Us,r + U cov , ∂θr ∂θs2 ∂θs1 ∂θr s=1 s1,s2=1 d 3 1 X (s ,s ) (s ,s ) ∂Ln(θ) ∂Ln(θ) ∂ Ln(θ) + U 1 3 U 2 4 cov , E 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1 d 3 1 X (s ,s ) (s ,s ) ∂Ln(θ) ∂Ln(θ) ∂ Ln(θ) + U 1 3 U 2 4 E E E 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1 d 3 1 X ∂Ln(θ) ∂Ln(θ) ∂ Ln(θ) + U (s1,s3)U (s2,s4)cov , ≈ 0, 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1 where Us,r denotes the (s, r)-entry of the d × d matrix U(θn) Now we consider concrete examples of likelihoods. Using the same arguments as those used in the proof of Theorem E.1 we have the last two terms of the above are of order O(n−2) or 2 O(H2,n/(nH1,n)) depending on the likelihood used. This implies that

∂L (θ) d d ∂L (θ) ∂2L (θ) n X X (s1,s2) n n E − E[θbs − θs]Us,r + U cov , ∂θr ∂θs2 ∂θs1 ∂θr s=1 s1,s2=1 d 3 1 X (s ,s ) (s ,s ) ∂Ln(θ) ∂Ln(θ) ∂ Ln(θ) + U 1 3 U 2 4 cov , E ≈ 0. 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

Let

1 Z 2π J (g) = g(ω)f(ω)dω 2π 0 Z 2π 1 2 −1 I(θ) = − [∇θfθ(ω) ]f(ω)dω 2π 0

(s,r) −1 and Is,r (and I ) corresponds to the (s, r)-th element of I(θn) (and I (θn)). So far, we have no speciﬁed the likelihood Ln(θ). But to write a second order expansion for all four likelihoods 2 −1 we set H2,n/H1,n = n for the Gaussian, Whittle, and boundary corrected Whittle likelihood and using the notation a similar proof to Theorem E.1 we have

∂L (θ) d H d ∂f −1 ∂2f −1 n X 2,n X (s1,s2) θ θ E − Is,rE[θbs − θs] + 2 I V , ∂θr H1,n ∂θs2 ∂θs1 ∂θr s=1 s1,s2=1 H d ∂f −1 ∂f −1 ∂3f −1 2,n X (s1,s3) (s2,s4) θ θ θ + 2 I I V , J ≈ 0. 2H1,n ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

114 Thus

d ∂L (θ) H d ∂f −1 ∂2f −1 X n 2,n X (s1,s2) θ θ Is,rE[θbs − θs] ≈ E + 2 I V , ∂θr H1,n ∂θs2 ∂θs1 ∂θr s=1 s1,s2=1 H d ∂f −1 ∂f −1 ∂3f −1 2,n X (s1,s3) (s2,s4) θ θ θ + 2 I I V , J . 2H1,n ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

In the ﬁnal stage, to extract E[θbs −θs] from the above we deﬁne the d-dimensional column vector 0 Pd D = (D1,...,Dd), where Dr = s=1 Is,rE[θbs − θs] = [I(θn)(θbn − θn)]r. Substituting this in the above gives

∂L (θ) H d ∂f (ω)−1 ∂2f (ω)−1 n 2,n X (s1,s2) θ θ Dr ≈ E + 2 I V , ∂θr H1,n ∂θs2 ∂θs1 ∂θr s1,s2=1 H d ∂f −1 ∂f −1 ∂3f −1 2,n X (s1,s3) (s2,s4) θ θ θ + 2 I I V , J . 2H1,n ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

−1 Using that E[θbn − θn] ≈ I(θn) D and substituting this into the above gives the bias for θbj

d ∂L (θ) H d ∂f (ω)−1 ∂2f (ω)−1 X (j,r) n 2,n X (s1,s2) θ θ E[θbj − θj] ≈ I E + 2 I V , ∂θr H1,n ∂θs2 ∂θs1 ∂θr r=1 s1,s2=1 H d ∂f −1 ∂f −1 ∂3f −1 2,n X (s1,s3) (s2,s4) θ θ θ + 2 I I V , J .(E.18) 2H1,n ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

The above is a general result. We now obtain the bias for the diﬀerent criteria. Let

2 n 1 n ∂ h i X X −isωk,n ∞ Br,G,n(θ) = Re c(s − t) e φ(ωk,n; fθ)φt (ωk,n; fθ) n n ∂θr s,t=1 k=1 n −1 1 X ∂fθ(ωk,n) Br,K,n(θ) = fn(ωk,n) n ∂θr k=1 d ∂f (ω)−1 ∂2f (ω)−1 X (s1,s2) θ θ and Gr(θ) = I V , ∂θs2 ∂θs1 ∂θr s1,s2=1 d 1 X ∂f −1 ∂f −1 ∂3f −1 + I(s1,s3)I(s2,s4)V θ , θ J θ . 2 ∂θs3 ∂θs4 ∂θs1 ∂θs2 ∂θr s1,s2,s3,s4=1

Then, using similar technique from the univariate case, we can show

• The Gaussian likelihood: E [∂Ln(θ)/∂θr] = Br,G,n(θ) + Br,K,n(θ).

• The Whittle likelihood: E [∂Kn(θ)/∂θr] = Br,K,n(θ)

115 • The boundary corrected Whittle and hybrid Whittle likelihood: E [∂Wn(θ)/∂θr] = E [∂Hn(θ)/∂θr] = 0.

Substituting the above into (E.18) gives the four diﬀerence biases in (E.17). Thus we have proved the result.

F Additional Simulations

F.1 Table of results for the AR(1) and MA(1) for a Gaussian time series

116 θ Likelihoods 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9

AR(1), {et} ∼ N (0, 1), n = 20 MA(1), {et} ∼ N (0, 1), n = 20 Gaussian -0.012(0.22) -0.028(0.21) -0.043(0.19) -0.066(0.18) -0.072(0.14) 0.010(0.28)0 .016(0.28) 0.025(0.24)0 .012(0.21)0 .029(0.17) Whittle -0.015(0.21) -0.041(0.20) -0.063(0.19) -0.095(0.18) -0.124(0.15) 0.005(0.29)0 .002(0.28) -0.004(0.24) -0.052(0.23) -0.152(0.21) Boundary -0.015(0.22) -0.037(0.21) -0.054(0.19) -0.079(0.18) -0.103(0.14) 0.007(0.30) 0.009(0.29) 0.009(0.24) -0.022(0.24) -0.111(0.20) Hybrid -0.012(0.22) -0.030(0.21) -0.049(0.19) -0.072(0.18) -0.095(0.14) 0.011(0.30) 0.021(0.29) 0.026(0.25) -0.007(0.22) -0.074(0.17) Tapered -0.014(0.22) -0.036(0.21) -0.063(0.19) -0.090(0.18) -0.117(0.14) 0.004(0.29)0 .004(0.28) -0.006(0.24) -0.043(0.21) -0.122(0.18) Debiased -0.013(0.22) -0.033(0.21) -0.049(0.19) -0.069(0.19) -0.085(0.16) 0.005(0.29) 0.013(0.28) 0.021(0.25) -0.005(0.24) -0.088(0.21)

AR(1), {et} ∼ N (0, 1), n = 50 MA(1), {et} ∼ N (0, 1), n = 50 Gaussian -0.006(0.14) -0.011(0.14) -0.013(0.12) -0.033(0.11) -0.030(0.07) -0.002(0.16)0 .008(0.15)0 .017(0.14)0 .018(0.12)0 .014(0.08) Whittle -0.008(0.14) -0.016(0.14) -0.023(0.12) -0.045(0.11) -0.049(0.08) -0.004(0.15)0 .001(0.15)0 .001(0.14) -0.020(0.13) -0.067(0.11) Boundary -0.007(0.14) -0.012(0.14) -0.015(0.12) -0.034(0.11) -0.036(0.07) -0.003(0.16) 0.006(0.16) 0.013(0.14) 0.005(0.13) -0.026(0.09) Hybrid -0.005(0.14) -0.011(0.14) -0.015(0.13) -0.033(0.11) -0.035(0.07) -0.001(0.16) 0.010(0.16) 0.015(0.14)0 .014(0.12) -0.010(0.07) Tapered -0.005(0.14) -0.013(0.14) -0.018(0.13) -0.038(0.11) -0.039(0.08) 0(0.16) 0.008(0.16) 0.010(0.14)0 .003(0.12) -0.023(0.08) Debiased -0.006(0.14) -0.011(0.14) -0.015(0.12) -0.035(0.11) -0.032(0.08) -0.002(0.16)0 .009(0.16) 0.019(0.15) 0.017(0.15) -0.011(0.11)

AR(1), {et} ∼ N (0, 1), n = 300 MA(1), {et} ∼ N (0, 1), n = 300 Gaussian 0(0.06) -0.002(0.06) -0.001(0.05) -0.004(0.04) -0.005(0.03) 0.002(0.06) 0(0.06)0 .003(0.05) 0(0.04)0 .004(0.03) Whittle 0(0.06) -0.003(0.06) -0.003(0.05) -0.007(0.04) -0.008(0.03) 0.001(0.06) -0.001(0.06) 0(0.05) -0.007(0.04) -0.020(0.04) Boundary 0(0.06) -0.002(0.06) -0.001(0.05) -0.004(0.04) -0.006(0.03) 0.002(0.06) 0(0.06) 0.003(0.05) 0(0.04) -0.002(0.03) Hybrid 0(0.06) -0.002(0.06) -0.001(0.05) -0.005(0.04) -0.006(0.03) 0.002(0.06) 0(0.06) 0.004(0.05) 0.001(0.05)0 .003(0.03) Tapered 0(0.06) -0.002(0.06) -0.001(0.05) -0.005(0.05) -0.006(0.03) 0.002(0.06) 0(0.06) 0.004(0.05) 0.001(0.05) 0.003(0.03) Debiased 0(0.06) -0.002(0.06) -0.001(0.05) -0.004(0.04) -0.006(0.03) 0.002(0.06) 0(0.06) 0.003(0.05) 0(0.05) 0.009(0.05) 117 Table 3: Bias and the standard deviation (in the parentheses) of six diﬀerent quasi-likelihoods for an AR(1) (left) and MA(1) (right) model for the standard normal innovations. Length of the time series n = 20, 50, and 300. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE. F.2 Figures and Table of results for the AR(1) and MA(1) for a non- Gaussian time series

In this section, we provide ﬁgures and table of the results in Section 6.1 when the innovations 2 follow a standardized chi-squared distribution two degrees of freedom, i.e. εt ∼ (χ (2) − 2)/2 (this time the asymptotic bias will contain the fourth order cumulant term). The results are very similar to the Gaussian innovations.

118 AR(1) model

BIAS: AR(1), (χ2(2) − 2)/2 error, n=20 BIAS: AR(1), (χ2(2) − 2)/2 error, n=50 BIAS: AR(1), (χ2(2) − 2)/2 error, n=300 0.01 0.00 0.000 0.00 −0.02 −0.002 −0.01 −0.04 −0.004 −0.02

BIAS −0.06 BIAS BIAS −0.006 −0.08 −0.03 Gaussian Hybrid −0.10 Whittle Tapered −0.04 −0.008 Boundary Debiased −0.12 −0.05 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ RMSE: AR(1), (χ2(2) − 2)/2 error, n=20 RMSE: AR(1), (χ2(2) − 2)/2 error, n=50 RMSE: AR(1), (χ2(2) − 2)/2 error, n=300 0.22 0.14 0.21 0.06 0.13 0.20 0.12 0.19 0.05 0.11 0.18

RMSE RMSE 0.10 RMSE 0.17 0.04 0.09 0.16 0.08 0.03 0.15

0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ

MA(1) model

BIAS: MA(1), (χ2(2) − 2)/2 error, n=20 BIAS: MA(1), (χ2(2) − 2)/2 error, n=50 BIAS: MA(1), (χ2(2) − 2)/2 error, n=300 0.015 0.02 0.010 0.00 0.00 0.005 0.000 −0.05 −0.02 BIAS BIAS −0.005BIAS −0.10 −0.04 Gaussian Hybrid −0.010 Whittle Tapered −0.015 −0.15 Boundary Debiased −0.06 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ RMSE: MA(1), (χ2(2) − 2)/2 error, n=20 RMSE: MA(1), (χ2(2) − 2)/2 error, n=50 RMSE: MA(1), (χ2(2) − 2)/2 error, n=300

0.28 0.06 0.14 0.26 0.05 0.24 0.12

RMSE 0.22 RMSE RMSE 0.04 0.10 0.20 0.03 0.18 0.08 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 θ θ θ

Figure 7: Bias (ﬁrst row) and the RMSE (second row) of the parameter estimates for the AR(1) and MA(1) models where the innovations follow the standardized chi-squared distribution with 2 degrees of freedom. Length of the time series n = 20(left), 50(middle), and 300(right).

119 θ Likelihoods 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 2 2 AR(1), {et} ∼ (χ (2) − 2)/2, n = 20 MA(1), {et} ∼ (χ (2) − 2)/2, n = 20 Gaussian -0.007(0.21) -0.007(0.20) -0.029(0.19) -0.053(0.17) -0.069(0.13) -0.001(0.28)0 .030(0.25)0 .020(0.23)0 .004(0.20)0 .056(0.17) Whittle -0.009(0.21) -0.016(0.20) -0.043(0.20) -0.086(0.18) -0.119(0.14) -0.005(0.27)0 .018(0.26) 0(0.24) -0.061(0.22) -0.153(0.21) Boundary -0.007(0.22) -0.013(0.20) -0.035(0.20) -0.068(0.18) -0.097(0.13) -0.002(0.28) 0.024(0.26) 0.009(0.25) -0.030(0.23) -0.113(0.20) Hybrid -0.002(0.22) -0.005(0.20) -0.026(0.20) -0.058(0.18) -0.088(0.13) 0.005(0.29) 0.035(0.26) 0.021(0.24)0 .004(0.20) -0.074(0.17) Tapered -0.003(0.21) -0.011(0.20) -0.037(0.20) -0.077(0.18) -0.109(0.13) 0.002(0.28) 0.023(0.25)0 .002(0.23) -0.032(0.21) -0.112(0.18) Debiased -0.011(0.21) -0.018(0.19) -0.040(0.20) -0.070(0.19) -0.090(0.15) -0.007(0.27)0 .021(0.25)0 .010(0.24) -0.039(0.24) -0.140(0.23) 2 2 AR(1), {et} ∼ (χ (2) − 2)/2, n = 50 MA(1), {et} ∼ (χ (2) − 2)/2, n = 50 Gaussian 0.004(0.13) -0.011(0.13) -0.012(0.11) -0.031(0.10) -0.029(0.07) 0.009(0.15)0 .003(0.15)0 .017(0.13)0 .014(0.12)0 .010(0.08) Whittle 0.001(0.13) -0.016(0.13) -0.019(0.12) -0.044(0.10) -0.049(0.07) 0.005(0.14) -0.004(0.14)0 .004(0.14) -0.020(0.13) -0.065(0.12) Boundary 0.001(0.13) -0.013(0.13) -0.012(0.12) -0.033(0.10) -0.036(0.07) 0.006(0.15) 0.001(0.15) 0.015(0.14) 0.001(0.12) -0.030(0.10) Hybrid 0.003(0.13) -0.009(0.14) -0.010(0.12) -0.032(0.11) -0.034(0.07) 0.008(0.15) 0.005(0.15) 0.018(0.13)0 .010(0.12) -0.014(0.09) Tapered 0.003(0.13) -0.011(0.14) -0.013(0.12) -0.036(0.11) -0.038(0.07) 0.007(0.15) 0.004(0.15)0 .014(0.13) 0(0.11) -0.026(0.08) Debiased 0.002(0.13) -0.013(0.13) -0.014(0.11) -0.034(0.11) -0.030(0.08) 0.007(0.15)0 .001(0.15)0 .017(0.14) 0.015(0.14) -0.027(0.13) 2 2 AR(1), {et} ∼ (χ (2) − 2)/2, n = 300 MA(1), {et} ∼ (χ (2) − 2)/2, n = 300 Gaussian 0(0.06) -0.005(0.05) -0.004(0.05) -0.004(0.04) -0.006(0.03) 0(0.06) -0.002(0.05) 0(0.05)0 .003(0.04)0 .003(0.03) Whittle -0.001(0.06) -0.006(0.05) -0.005(0.05) -0.006(0.04) -0.009(0.03) 0(0.06) -0.003(0.05) -0.003(0.05) -0.004(0.04) -0.018(0.04) Boundary 0(0.06) -0.005(0.05) -0.004(0.05) -0.004(0.04) -0.007(0.03) 0(0.06) -0.002(0.05) 0(0.05)0 .002(0.04) -0.002(0.03) Hybrid 0(0.06) -0.006(0.06) -0.004(0.05) -0.004(0.04) -0.007(0.03) 0.001(0.06) -0.002(0.06) 0(0.05)0 .003(0.04)0 .002(0.03) Tapered 0(0.06) -0.006(0.06) -0.005(0.05) -0.004(0.04) -0.007(0.03) 0.001(0.06) -0.002(0.06) 0(0.05) 0.003(0.04) 0.001(0.03) Debiased 0(0.06) -0.005(0.05) -0.004(0.05) -0.004(0.04) -0.006(0.03) 0(0.06) -0.002(0.05) 0(0.05) 0.003(0.05) 0.013(0.05) 120 Table 4: Bias and the standard deviation (in the parentheses) of six diﬀerent quasi-likelihoods for an AR(1) (left) and MA(1) (right) model for the standardized chi-squared innovations. Length of the time series n = 20, 50, and 300. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE. F.3 Misspeciﬁed model for a non-Gaussian time series

In this section, we provide ﬁgures and table of the results in Section 6.2 when the innovations 2 follow a standardized chi-squared distribution two degrees of freedom, i.e. εt ∼ (χ (2) − 2)/2. The results are given in Tables 5 and 6.

n Parameter Gaussian Whittle Boundary Hybrid Tapered Debiased φ 0.029(0.1) -0.102(0.16) -0.032(0.12) -0.001(0.1) -0.088(0.13) 0.170(0.12) 20 ψ 0.066(0.08) -0.184(0.20) -0.039(0.15)0 .030(0.09) -0.064(0.12) 0.086(0.09) In(f; fθ) 1.573(0.82) 1.377(3.11) 0.952(0.91)1 .006(0.84)0 .675(0.63)2 .618(0.84) φ 0.014(0.07) -0.051(0.10) -0.004(0.07) 0.007(0.07) -0.003(0.07)0 .143(0.11) 50 ψ 0.027(0.06) -0.118(0.13) -0.013(0.09)0 .008(0.07)0 .009(0.06)0 .090(0.03) In(f; fθ) 0.342(0.34) 0.478(0.53) 0.298(0.32)0 .230(0.27)0 .222(0.27)1 .158(0.37) φ 0.001(0.03) -0.015(0.03) -0.002(0.03) 0(0.03) -0.001(0.03) 0.090(0.08) 300 ψ 0.006(0.03) -0.033(0.05) 0.002(0.03)0 .003(0.03)0 .003(0.03) 0.091(0.02) In(f; fθ) 0.029(0.05) 0.067(0.10) 0.034(0.06)0 .027(0.04)0 .028(0.04)0 .747(0.23) Best fitting ARMA(1, 1) coefficients θ = (φ, ψ) and spectral divergence: − θ20 = (0.693, 0.845), θ50 = (0.694, 0.857), θ300 = (0.696, 0.857). − I20(f; fθ) = 3.773, I50(f; fθ) = 3.415, I300(f; fθ) = 3.388. Table 5: Best fitting (bottom lines) and the bias of estimated coefficients for six different methods for the ARMA(3, 2) misspecified case fitting ARMA(1, 1) model for the standardized chi-squared innovations. Standard deviations are in the parentheses. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

n Parameter Gaussian Whittle Boundary Hybrid Tapered Debiased

φ1 0.017(0.13) -0.178(0.23) -0.047(0.17) -0.006(0.14) -0.134(0.15) 0.044(0.14) 20 φ2 0.002(0.09)0 .176(0.2) 0.057(0.16) 0.023(0.12) 0.135(0.13) -0.019(0.13) In(f; fθ) 0.652(0.72)1 .3073(1.46) 0.788(0.85)0 .671(0.8)0 .887(0.97) 0.658(0.81) φ1 0.018(0.09) -0.079(0.12) -0.010(0.09)0 .002(0.09) -0.018(0.09)0 .140(0.15) 50 φ2 -0.018(0.06) 0.072(0.11) 0.012(0.07)0 .001(0.06)0 .016(0.06) -0.1(0.09) In(f; fθ) 0.287(0.36)0 .406(0.52) 0.302(0.39) 0.298(0.39)0 .293(0.38)0 .631(0.7) φ1 0.002(0.04) -0.015(0.04) -0.002(0.04) 0(0.04) -0.001(0.04) 0.012(0.04) 300 φ2 -0.005(0.02) 0.011(0.03) -0.001(0.02) -0.001(0.02) -0.001(0.02) -0.016(0.04) In(f; fθ) 0.050(0.07)0 .056(0.07)0 .051(0.07)0 .052(0.07) 0.054(0.08) 0.061(0.08) Best fitting AR(1) coefficients θ = (φ1, φ2) and spectral divergence: − θ20 = (1.367, −0.841), θ50 = (1.364, −0.803), θ300 = (1.365, −0.802). − I20(f; fθ) = 2.902, I50(f; fθ) = 2.937, I300(f; fθ) = 2.916. Table 6: Same as in Table 6 but fitting an AR(2) model.

F.4 Comparing the the new likelihoods constructed with the predictive DFT with AR(1) coeﬃcients and AIC order selected AR(p) coeﬃcients

In this section we compare the performance of new likelihoods where the order of the AR model used in the predictive DFT is determined using the AIC with a ﬁxed choice of order with the

121 AR model (set to p = 1). We use ARMA(3, 2) model considered in Section 6.2 and ﬁt the the ARMA(1, 1) and AR(2) to the data. We compare the new likelihoods with the Gaussian likelihood and the Whittle likelihood. The results are given in Tables 7 and 8.

φ ψ In(f; fθ) Best 0.694 0.857 3.415 Gaussian 0.012(0.07) 0.029(0.06) 0.354(0.34) Whittle -0.054(0.09) -0.116(0.12) 0.457(0.46) Boundary(AIC) -0.006(0.07) -0.008(0.08) 0.292(0.3) Bias Boundary(p=1) -0.020(0.08) -0.045(0.09) 0.299(0.29) Hybrid(AIC) 0.004(0.07) 0.009(0.07) 0.235(0.28) Hybrid(p=1) 0.003(0.07) 0.010(0.07) 0.261(0.3)

Table 7: Best fitting (top row) and the bias of estimated coefficients for six different methods for the Gaussian ARMA(3, 2) misspecified case fitting ARMA(1, 1) model. Length of the time series n=50. Standard deviations are in the parentheses. (AIC): an order p is chosen using AIC; (p=1): an order p is set to 1.

φ1 φ2 In(f; fθ) Best 1.364 -0.803 2.937 Gaussian 0.019(0.09) -0.024(0.06) 0.275(0.33) Whittle -0.077(0.12) 0.066(0.1) 0.382(0.45) Boundary(AIC) -0.009(0.09) 0.006(0.07) 0.283(0.37) Bias Boundary(p=1) -0.030(0.1) 0.032(0.07) 0.295(0.35) Hybrid(AIC) 0.003(0.09) -0.006(0.07) 0.283(0.37) Hybrid(p=1) -0.003(0.09) 0.003(0.06) 0.276(0.35)

Table 8: Same as in Table 7, but ﬁtting an AR(2).

G Simulations: Estimation for long memory time series

G.1 Parametric estimation for long memory Gaussian time series

We conduct some simulations for time series whose spectral density, f, does not satisﬁes As- sumption 3.1. We focus on the ARFIMA(0, d, 0) model where

d (1 − B) Wt = εt,

B is the backshift operator, −1/2 < d < 1/2 is a fractional diﬀerencing parameter, and {εt} is an i.i.d. standard normal random variable. Let Γ(x) denote the gamma function. The spectral density and autocovariance of the ARFIMA(0, d, 0) model (where the variance of the innovations

122 is set to σ2 = 1) is

Γ(k + d)Γ(1 − 2d) f (ω) = (1 − e−iω)−2d = (2 sin(ω/2))−2d and c (k) = (G.1) W W Γ(k − d + 1)Γ(1 − d)Γ(d) respectively (see Giraitis et al. (2012), Chapter 7.2). Observe that for −1/2 < d < 0, the fW (0) = 0, this is called antipersistence. On the other hand, if 0 < d < 1/2, then fW (0) = ∞ and Wt has long memory. We generate ARFIMA(0, d, 0) models with d = −0.4, −0.2, 0.2 and 0.4 and Gaussian innovations. We fit both the ARFIMA(0, d, 0) model, with d unknown (specified case) and the AR(2) model (with unknown parameters θ = (φ1, φ2)) (misspecified case) to the data. To do so, we first demean the time series. We evaluate the (plug-in) Gaussian likelihood using the autocovariance function in (G.1) and the autocovariance function of AR(2) model. For the other 5 frequency domain likelihoods, we evaluate the likelihoods at all the fundamental frequencies with the exception of the zero frequency ωn,n = 0. We fit using the spectral density in (G.1) or the spectral −iω −2iω −2 density fθ(ω) = |1 − φ1e − φ2e | where θ = (φ1, φ2) (depending on whether the model is specified or misspecified). For each simulation, we calculate six different parameter estimators. For the misspecified case, we also calculate the spectral divergence

n−1 1 X f(ωk,n) Ien(f; fθ) = + log fθ(ωk,n) , n − 1 fθ(ωk,n) k=1

Best where we omit the zero frequency. The best fitting AR(2) model is θ = arg minθ∈Θ Ien(f; fθ). In Tables 9 and 10 we give a bias and standard deviation for the parameter estimators for the correctly specified and misspecified model (in the misspecified case we also give the spectral divergence).

Correctly speciﬁed model From Table 9 we observe that the bias of both new likelihood estimators is consistently the smallest over all sample sizes and all d except for d = −0.2. The new likelihoods have the smallest or second smallest RMSE for n=300, but not for the small sample sizes (e.g. n=20 and 50). This is probably due to increased variation in the new likelihood estimators caused by the estimation of the AR parameter for the predictive DFT. Since the time series has a long memory, the AIC is likely to choose a large order autoregressive order p, which will increase the variance in the estimator (recall that the second order error of the boundary corrected Whittle is O(p3n−3/2)). The (plug-in) Gaussian likelihood has a relatively large bias for all d, which matches the observations in Lieberman (2005), Table 1. However, it has the smallest variance and this results in the smallest RMSE for almost all n when d is negative. The debiased Whittle also has a larger bias than most of the other estimators. However, it has a smaller variance and thus, having the smallest RMSE for almost all n and positive d.

123 d Likelihoods -0.4 -0.2 0.2 0.4 n = 20 Gaussian -0.097(0.23) -0.148(0.22) -0.240(0.23) -0.289(0.22) Whittle 0.027(0.3) 0.006(0.28) -0.008(0.29) -0.016(0.29) Boundary 0.014(0.31) 0(0.29) -0.005(0.30) -0.007(0.30) Hybrid 0.009(0.31) -0.007(0.3) 0.005(0.30) -0.001(0.30) Tapered 0.026(0.3) 0(0.3) 0.006(0.30) 0.003(0.29) Debiased 0.015(0.29) -0.003(0.27) -0.029(0.26) -0.044(0.27) n = 50 Gaussian -0.042(0.13) -0.073(0.14) -0.097(0.14) -0.123(0.12) Whittle 0.006(0.15) -0.016(0.15) -0.013(0.15) -0.005(0.15) Boundary -0.005(0.15) -0.020(0.16) -0.011(0.16) 0.001(0.16) Hybrid -0.011(0.15) -0.021(0.15) -0.012(0.16) 0(0.16) Tapered -0.007(0.15) -0.019(0.16) -0.011(0.16) 0.010(0.16) Debiased -0.008(0.16) -0.020(0.16) -0.019(0.15) -0.021(0.14) n = 300 Gaussian -0.006(0.05) -0.013(0.05) -0.020(0.05) -0.083(0.02) Whittle 0.006(0.05) -0.001(0.05) -0.004(0.05) 0.002(0.05) Boundary 0.002(0.05) -0.003(0.05) -0.003(0.05) 0.002(0.05) Hybrid 0(0.05) -0.004(0.05) -0.004(0.05) 0.001(0.05) Tapered 0(0.05) -0.004(0.05) -0.004(0.05) 0.003(0.05) Debiased -0.001(0.05) -0.003(0.05) -0.007(0.05) -0.060(0.02)

Table 9: Bias and the standard deviation (in the parentheses) of six diﬀerent quasi-likelihoods for ARFIMA(0, d, 0) model for the standard normal innovations. Length of the time series n = 20, 50, and 300. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

Misspecified model We now compare the estimator when we fit the misspecified AR(2) model to the data. From Table 10 we observe that the Gaussian likelihood performs uniformly well for all d and n, it usually has the smallest bias and RMSE for the positive d. The tapered Whittle also performs uniformly well for all d and n, especially for the negative d. In comparison, the new likelihood estimators do not perform that well as compared with Gaussian and Whittle likelihood. As mentioned above, this may be due to the increased variation caused by estimating many AR parameters. However, it is interesting to note that when d = 0.4 and n = 300, the estimated spectral divergence outperforms the Gaussian likelihood. We leave the theoretical development of the sampling properties of the new likelihoods and long memory time series for future research.

124 Bias d n Par. Best Gaussian Whittle Boundary Hybrid Tapered Debiased

φ1 -0.300 -0.028(0.22) -0.015(0.22) -0.022(0.23) -0.026(0.23) -0.010(0.21) -0.026(0.22) 20 φ2 -0.134 -0.067(0.19) -0.058(0.19) -0.064(0.2) -0.068(0.2) -0.062(0.18) -0.067(0.19) In(f; fθ) 1.141 0.103(0.1) 0.099(0.1) 0.110(0.12) 0.108(0.11)0 .095(0.1)0 .106(0.1) φ1 -0.319 -0.006(0.14)0 .02(0.14) -0.004(0.15) -0.004(0.15) 0.003(0.15) -0.006(0.15) -0.4 50 φ2 -0.152 -0.028(0.13) -0.022(0.13) -0.027(0.13) -0.029(0.13) -0.022(0.13) -0.029(0.13) In(f; fθ) 1.092 0.043(0.05)0 .043(0.05)0 .045(0.05) 0.046(0.05) 0.043(0.05) 0.045(0.05) φ1 -0.331 -0.003(0.06) -0.002(0.06) -0.003(0.06) -0.002(0.06) -0.001(0.06) -0.003(0.06) 300 φ2 -0.164 -0.005(0.06) -0.004(0.05) -0.005(0.06) -0.005(0.06) -0.004(0.06) -0.005(0.06) In(f; fθ) 1.062 0.008(0.01)0 .008(0.01)0 .008(0.01) 0.008(0.01) 0.008(0.01) 0.008(0.01) φ1 -0.157 -0.027(0.23) -0.024(0.23) -0.028(0.24) -0.029(0.23) -0.020(0.22) -0.028(0.23) 20 φ2 -0.066 -0.084(0.21) -0.085(0.2) -0.088(0.21) -0.089(0.21) -0.084(0.19) -0.089(0.21) In(f; fθ) 1.066 0.107(0.1) 0.104(0.1) 0.111(0.11) 0.110(0.11)0 .096(0.1)0 .108(0.1) φ1 -0.170 -0.020(0.15) -0.015(0.15) -0.018(0.15) -0.020(0.15) -0.016(0.15) -0.019(0.15) -0.2 50 φ2 -0.079 -0.035(0.14) -0.032(0.14) -0.034(0.14) -0.035(0.14) -0.031(0.13) -0.036(0.14) In(f; fθ) 1.038 0.043(0.04)0 .043(0.04)0 .045(0.04) 0.045(0.04)0 .043(0.04)0 .045(0.04) φ1 -0.001 0(0.06) -0.001(0.06) 0(0.06) 0.001(0.06) 0.001(0.06) -0.006(0.06) 300 φ2 -0.088 -0.007(0.05) -0.007(0.05) -0.007(0.05) -0.007(0.06) -0.007(0.06) -0.007(0.05) In(f; fθ) 1.019 0.007(0.01)0 .007(0.01)0 .007(0.01) 0.008(0.01) 0.008(0.01) 0.007(0.01)

φ1 0.167 -0.066(0.24) -0.072(0.24) -0.071(0.25) -0.068(0.25) -0.080(0.23) -0.070(0.24) 20 φ2 0.057 -0.098(0.2) -0.106(0.19) -0.107(0.19) -0.108(0.2) -0.111(0.18) -0.105(0.19) In(f; fθ) 0.938 0.1(0.11)0 .1(0.11) 0.103(0.12) 0.105(0.12) 0.098(0.11)0 .1(0.1) φ1 0.186 -0.025(0.15) -0.027(0.15) -0.026(0.15) -0.027(0.15) -0.034(0.15) -0.025(0.15) 0.2 50 φ2 0.075 -0.040(0.15) -0.043(0.15) -0.043(0.15) -0.042(0.15) -0.047(0.15) -0.042(0.15) In(f; fθ) 0.971 0.046(0.05)0 .044(0.05)0 .046(0.05) 0.048(0.05) 0.047(0.05)0 .046(0.05) φ1 0.208 -0.007(0.06) -0.007(0.06) -0.007(0.06) -0.007(0.06) -0.008(0.06) -0.007(0.06) 300 φ2 0.097 -0.006(0.06) -0.007(0.06) -0.006(0.06) -0.007(0.07) -0.008(0.07) -0.006(0.06) In(f; fθ) 1.002 0.008(0.01)0 .008(0.01)0 .008(0.01)0 .008(0.01) 0.009(0.01)0 .008(0.01)

φ1 0.341 -0.072(0.25) -0.082(0.25) -0.075(0.26) -0.074(0.26) -0.103(0.24) -0.077(0.25) 20 φ2 0.094 -0.116(0.21) -0.134(0.2) -0.135(0.2) -0.133(0.2) -0.133(0.19) -0.130(0.2) In(f; fθ) 0.877 0.111(0.12)0 .113(0.12)0 .116(0.12) 0.116(0.13) 0.114(0.12)0 .113(0.11) φ1 0.378 -0.024(0.15) -0.030(0.15) -0.027(0.15) -0.029(0.15) -0.039(0.15) -0.027(0.15) 0.4 50 φ2 0.129 -0.058(0.15) -0.069(0.14) -0.067(0.15) -0.064(0.15) -0.072(0.15) -0.066(0.15) In(f; fθ) 0.944 0.051(0.06)0 .053(0.06) 0.053(0.06)0 .054(0.06)0 .055(0.06) 0.054(0.06) φ1 0.428 -0.004(0.06) -0.004(0.06) -0.004(0.06) -0.007(0.06) -0.009(0.06) -0.004(0.06) 300 φ2 0.178 -0.003(0.06) -0.005(0.06) -0.004(0.06) -0.004(0.06) -0.006(0.06) -0.004(0.06) In(f; fθ) 1.010 0.009(0.01)0 .009(0.01)0 .009(0.01)0 .010(0.01) 0.010(0.01) 0.009(0.01) Table 10: Best fitting and the bias of estimated coefficients using the six different methods for misspecified Gaussian ARFIMA(0, d, 0) case fitting AR(2) model. Standard deviations are in the parentheses. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

G.2 Parametric estimation for long memory non-Gaussian time series

We ﬁt the parametric models described in Appendix G.1. However, the underlying time series is non-Gaussian and generated from the ARFIMA(0, d, 0)

d (1 − B) Wt = εt, where {εt} are i.i.d. standardized chi-square random variables with two-degrees of freedom i.e. 2 εt ∼ (χ (2) − 2)/2. The results in the speciﬁed setting are given in Table 11 and from the non-speciﬁed setting in Table 12.

125 d Likelihoods -0.4 -0.2 0.2 0.4 n = 20 Gaussian -0.077(0.22) -0.130(0.23) -0.219(0.21) -0.277(0.20) Whittle 0.089(0.34) 0.092(0.36) 0.077(0.32) 0.058(0.31) Boundary 0.077(0.34) 0.086(0.37) 0.079(0.33) 0.070(0.32) Hybrid 0.077(0.35) 0.084(0.37) 0.087(0.33) 0.086(0.32) Tapered 0.092(0.35) 0.096(0.37) 0.089(0.33) 0.078(0.31) Debiased 0.057(0.33) 0.051(0.31) 0.009(0.25) -0.021(0.26) n = 50 Gaussian -0.047(0.13) -0.065(0.14) -0.097(0.13) -0.130(0.12) Whittle 0.008(0.15) 0.004(0.16) -0.001(0.15) 0.001(0.16) Boundary -0.002(0.16) 0(0.16) 0.002(0.15) 0.006(0.16) Hybrid -0.005(0.15) -0.001(0.16) 0.004(0.16) 0.006(0.16) Tapered 0.001(0.15) 0.002(0.16) 0.006(0.16) 0.016(0.17) Debiased -0.004(0.16) -0.001(0.16) -0.013(0.14) -0.025(0.14) n = 300 Gaussian -0.011(0.05) -0.012(0.05) -0.017(0.05) -0.085(0.02) Whittle 0.002(0.05) 0(0.05) -0.001(0.05) -0.001(0.05) Boundary -0.003(0.05) -0.001(0.05) 0(0.05) 0.001(0.05) Hybrid -0.006(0.05) -0.003(0.05) -0.001(0.05) -0.003(0.05) Tapered -0.006(0.05) -0.003(0.05) -0.001(0.05) -0.001(0.05) Debiased -0.006(0.05) -0.002(0.05) -0.004(0.05) -0.061(0.03)

Table 11: Same as in Table 9 but for the chi-squared innovations.

G.3 Semi-parametric estimation for Gaussian time series

−2d Suppose the time series {Xt} has a spectral density f(·) with limω→0+ f(ω) ∼ Cω for some d ∈ (−1/2, 1/2). The local Whittle (LW) estimator is an estimation method for estimating d without using assuming any parametric structure on d. It was first proposed in Künsch (1987); Robinson (1995); Chen and Hurvich (2003), see also Giraitis et al. (2012), Chapter 8). The LW estimator is defined as db= arg min R(d) where

M 2 ! M −1 X |Jn(ωk,n)| 2d X R(d) = log M − log ωk,n, (G.2) ω−2d M k=1 k,n k=1 and M = M(n) is an integer such that M −1 + M/n → 0 as n → ∞. The objective function can −2d be viewed as “locally” ﬁtting a spectral density of form fθ(ω) = Cω where θ = (C, d) using the Whittle likelihood.

Since Jen(ωk,n; f)Jn(ωk,n) is an unbiased estimator of true spectral density f(ωk,n), it is possible that replacing the periodogram with the (feasible) complete periodogram my lead to a better

126 Bias d n Par. Best Gaussian Whittle Boundary Hybrid Tapered Debiased

φ1 -0.300 -0.017(0.21) -0.004(0.21) -0.012(0.22) -0.018(0.22)0 .001(0.21) -0.009(0.22) 20 φ2 -0.134 -0.073(0.18) -0.071(0.18) -0.076(0.18) -0.077(0.18) -0.064(0.16) -0.074(0.18) In(f; fθ) 1.141 0.096(0.11)0 .097(0.11)0 .105(0.11) 0.102(0.11)0 .088(0.11)0 .100(0.11) φ1 -0.319 -0.008(0.14) 0(0.14) -0.005(0.14) -0.007(0.15) 0.001(0.14) -0.006(0.14) -0.4 50 φ2 -0.152 -0.035(0.12) -0.030(0.12) -0.035(0.12) -0.035(0.12) -0.026(0.12) -0.035(0.12) In(f; fθ) 1.092 0.042(0.05)0 .040(0.04)0 .043(0.05) 0.043(0.05)0 .041(0.04)0 .043(0.05) φ1 -0.331 -0.004(0.06) -0.002(0.06) -0.004(0.06) -0.004(0.06) -0.003(0.06) -0.004(0.06) 300 φ2 -0.164 -0.007(0.05) -0.006(0.05) -0.007(0.05) -0.008(0.05) -0.007(0.05) -0.007(0.05) In(f; fθ) 1.062 0.007(0.01)0 .007(0.01)0 .007(0.01)0 .008(0.01)0 .008(0.01) 0.007(0.01) φ1 -0.157 -0.036(0.22) -0.034(0.22) -0.037(0.23) -0.039(0.23) -0.028(0.21) -0.034(0.22) 20 φ2 -0.066 -0.083(0.19) -0.080(0.18) -0.082(0.19) -0.083(0.19) -0.076(0.18) -0.082(0.19) In(f; fθ) 1.066 0.099(0.11)0 .095(0.11)0 .100(0.11) 0.101(0.11)0 .086(0.1)0 .097(0.11) φ1 -0.170 -0.018(0.15) -0.016(0.14) -0.019(0.15) -0.017(0.15) -0.012(0.15) -0.019(0.15) -0.2 50 φ2 -0.079 -0.034(0.13) -0.033(0.13) -0.034(0.13) -0.035(0.13) -0.031(0.13) -0.034(0.13) In(f; fθ) 1.038 0.042(0.05)0 .041(0.05)0 .043(0.05) 0.043(0.05)0 .041(0.04)0 .043(0.05) φ1 -0.179 -0.001(0.06) 0(0.06) -0.001(0.06) 0(0.06) 0(0.06) -0.001(0.06) 300 φ2 -0.088 -0.008(0.05) -0.007(0.05) -0.008(0.06) -0.008(0.06) -0.007(0.06) -0.008(0.06) In(f; fθ) 1.019 0.007(0.01)0 .007(0.01)0 .007(0.01)0 .007(0.01) 0.007(0.01)0 .007(0.01)

φ1 0.167 -0.053(0.23) -0.064(0.22) -0.062(0.23) -0.055(0.23) -0.071(0.21) -0.066(0.21) 20 φ2 0.057 -0.091(0.2) -0.100(0.2) -0.100(0.2) -0.101(0.2) -0.099(0.19) -0.095(0.19) In(f; fθ) 0.938 0.094(0.1) 0.091(0.09) 0.094(0.1) 0.097(0.1)0 .086(0.09)0 .086(0.08) φ1 0.186 -0.023(0.15) -0.026(0.15) -0.025(0.15) -0.024(0.15) -0.030(0.15) -0.024(0.15) 0.2 50 φ2 0.075 -0.044(0.14) -0.051(0.14) -0.051(0.14) -0.047(0.14) -0.051(0.13) -0.050(0.14) In(f; fθ) 0.971 0.043(0.04)0 .042(0.04)0 .043(0.04) 0.044(0.04)0 .042(0.04)0 .043(0.04) φ1 0.208 -0.003(0.06) -0.004(0.06) -0.003(0.06) -0.003(0.06) -0.005(0.06) -0.003(0.06) 300 φ2 0.097 -0.008(0.06) -0.009(0.06) -0.008(0.06) -0.010(0.07) -0.011(0.06) -0.008(0.06) In(f; fθ) 1.002 0.007(0.01)0 .007(0.01)0 .007(0.01)0 .007(0.01) 0.008(0.01)0 .007(0.01)

φ1 0.341 -0.073(0.24) -0.080(0.23) -0.074(0.24) -0.070(0.25) -0.103(0.23) -0.085(0.23) 20 φ2 0.094 -0.106(0.21) -0.129(0.19) -0.128(0.2) -0.126(0.2) -0.122(0.19) -0.123(0.19) In(f; fθ) 0.877 0.103(0.11)0 .103(0.11)0 .107(0.12) 0.108(0.12) 0.103(0.11) 0.101(0.11) φ1 0.378 -0.021(0.14) -0.027(0.14) -0.023(0.15) -0.026(0.15) -0.037(0.15) -0.026(0.14) 0.4 50 φ2 0.129 -0.044(0.14) -0.054(0.14) -0.051(0.14) -0.049(0.14) -0.059(0.14) -0.051(0.14) In(f; fθ) 0.944 0.044(0.05)0 .045(0.05)0 .046(0.05) 0.046(0.05) 0.047(0.05)0 .045(0.05) φ1 0.428 -0.003(0.06) -0.003(0.06) -0.003(0.06) -0.004(0.06) -0.006(0.06) -0.002(0.06) 300 φ2 0.178 -0.004(0.06) -0.006(0.06) -0.004(0.06) -0.006(0.06) -0.008(0.06) -0.005(0.06) In(f; fθ) 1.010 0.010(0.01)0 .009(0.01)0 .009(0.01)0 .010(0.01) 0.010(0.01)0 .009(0.01) Table 12: Best fitting and the bias of estimated coefficients using the six different methods for misspecified ARFIMA(0, d, 0) case fitting AR(2) model for the chi-squared innovations. Standard deviations are in the parentheses. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE. estimator of d. Based on this we define the (feasible) hybrid LW criterion,

M ! M −1 X Jen(ωk,n; fbp)Jn,hn (ωk,n) 2d X Q(d) = log M − log ωk,n. ω−2d M k=1 k,n k=1

In a special case that the data taper ht,n ≡ 1, we call it the boundary corrected LW criterion. To empirically assess the validity of the above estimation scheme, we generate a Gaussian ARFIMA(0, d, 0) model from Section G.1 for d = −0.4, −0.2, 0.2 and 0.4 and evaulate the LW, tapered LW (using tapered DFT in (G.2)), boundary corrected LW, and hybrid LW. We set M ≈ n0.65 where n is a length of the time series and we use Tukey taper with 10% of the taper on each end of the time series. For each simulation, we obtain four diﬀerent LW estimators. Table 13 summarizes the bias and standard deviation (in the parentheses) of LW estimators. We observe that the bondary corrected LW has a smaller bias than the regular Local Whittle

127 likelihood except when d = −0.2 and n = 50, 300. However, the standard error tends to be larger (this is probably because of the additional error caused by estimating the AR(p) parameters in the new likelihoods). Despite the larger standard error, in terms of RMSE, the boundary corrected LW (or hybrid) tends to have overall at least the second smallest RSME for most d and n.

d Local likelihoods -0.4 -0.2 0.2 0.4 n = 20 Whittle 0.283(0.46) 0.111(0.5) -0.075(0.48) -0.226(0.43) Boundary 0.282(0.46) 0.109(0.51) -0.075(0.48) -0.220(0.44) Hybrid 0.275(0.46) 0.115(0.52) -0.067(0.49) -0.209(0.44) Tapered 0.279(0.46) 0.121(0.52) -0.068(0.49) -0.204(0.43) n = 50 Whittle 0.060(0.26) -0.056(0.33) -0.089(0.37) -0.109(0.32) Boundary 0.045(0.26) -0.063(0.34) -0.088(0.38) -0.106(0.32) Hybrid 0.033(0.25) -0.069(0.34) -0.090(0.38) -0.110(0.32) Tapered 0.035(0.25) -0.068(0.34) -0.085(0.38) -0.085(0.31) n = 300 Whittle 0.056(0.12) -0.014(0.11) -0.010(0.12) 0.004(0.11) Boundary 0.052(0.12) -0.017(0.11) -0.009(0.12) 0.003(0.11) Hybrid 0.054(0.12) -0.018(0.12) -0.011(0.12) -0.002(0.11) Tapered 0.057(0.12) -0.018(0.12) -0.011(0.12) 0.003(0.12)

Table 13: Bias and the standard deviation (in the parentheses) of four diﬀerent Local Whittle estimators for ARFIMA(0, d, 0) model for the standard normal innovations. Length of the time series n = 20, 50, and 300. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

G.4 Semi-parametric estimation for long memory non-Gaussian time series

Once again we consider the semi-parametric Local Whittle estimator described in Appendix G.3. However, this time we assess the estimation scheme for non-Gaussian time series. We generate the ARFIMA(0, d, 0)

d (1 − B) Wt = εt, where {εt} are i.i.d. standardarized chi-square random variables with two-degrees of freedom i.e. 2 εt ∼ (χ (2) − 2)/2. The results are summarized in Table 14.

128 d Local likelihoods -0.4 -0.2 0.2 0.4 n = 20 Whittle 0.354(0.52) 0.235(0.54) 0.002(0.51) -0.154(0.41) Boundary 0.349(0.52) 0.231(0.54) 0.003(0.51) -0.146(0.41) Hybrid 0.347(0.53) 0.229(0.55) 0.016(0.52) -0.133(0.4) Tapered 0.351(0.52) 0.229(0.54) 0.023(0.51) -0.144(0.41) n = 50 Whittle 0.019(0.25) -0.080(0.34) -0.125(0.38) -0.149(0.33) Boundary 0.007(0.25) -0.086(0.35) -0.123(0.39) -0.146(0.33) Hybrid -0.003(0.24) -0.098(0.35) -0.128(0.39) -0.153(0.34) Tapered 0.001(0.24) -0.100(0.35) -0.125(0.39) -0.133(0.33) n = 300 Whittle 0.104(0.14) -0.006(0.14) -0.015(0.14) -0.011(0.13) Boundary 0.101(0.15) -0.008(0.14) -0.014(0.14) -0.012(0.13) Hybrid 0.100(0.15) -0.011(0.15) -0.017(0.14) -0.020(0.13) Tapered 0.104(0.15) -0.012(0.15) -0.017(0.15) -0.016(0.13)

Table 14: Same as in Table 13 but for the chi-square innovations.

H Simulations: Alternative methods for estimating the predictive DFT

As pointed out by the referees, using the Yule-Walker estimator to estimate the prediction coefficients in the predictive DFT may in certain situations be problematic. We discuss the issues and potential solutions below. The first issue is that Yule-Walker estimator suffers a finite sample bias, especially when the spectral density has a root close to the unit circle (see, e.g., Tjøstheim and Paulsen (1983)). One remedy to reduce the bias is via data tapering (Dahlhaus (1988) and Zhang (1992)). Therefore, we define the boundary corrected Whittle likelihood using tapered Yule-Walker (BC-tYW) replace fbp with fep in (4.2) where fep is a spectral density of AR(p) process where the AR coefficients are estimated using Yule-Walker with tapered time series. In the simulations we use the Tukey taper with d = n/10 and select the order p using the AIC. The second issue is if the underlying time series is complicated in the sense that the underlying AR representation has multiple roots. Then fitting a large order AR(p) model may result in a loss of efficiency. As an alternative, we consider a fully nonparametric estimator of Jbn(ω; f) based on the estimated spectral density function. To do so, we recall from Section 3.1 the first order

129 approximation of Jbn(ω; f) is Jb∞,n(ω; f) where

−1/2 n −1/2 n n X ∞ i(n+1)ω n X ∞ Jb∞,n(ω; f) = Xtφ (ω; f) + e Xn+1−tφ (ω; f) φ(ω; f) t t t=1 φ(ω; f) t=1 n ∞ n ∞ ψ(ω; f) X X ψ(ω; f) X X = √ X φ (f)e−isω + ei(n+1)ω √ X φ (f)eisω, n t s+t n n+1−t s+t t=1 s=0 t=1 s=0

P∞ −ijω where ψ(ω; f) = j=0 ψj(f)e be an MA transfer function. Our goal is to estimate ψ(ω; f) and {φj(f)} based on the observed time series. We use the method proposed in Section 2.2. of Krampe et al. (2018). We ﬁrst start from the well known Szeg¨o’sidentity

log f(·) = log σ2|ψ(·; f)|2 = log σ2 + log ψ(·; f) + log ψ(·; f).

−1 R π −ikλ Next, let αk(f) be the k-th Fourier coeﬃcient of log f, i.e., αk(f) = (2π) −π log f(λ)e dλ. Then, since log f is real, α−k(f) = αk(f). Plug in the expansion of log f to the above identity gives

∞ X −ijω log ψ(ω; f) = αj(f)e . j=1

Using above identity, we estimator ψ(·; f). let fb be a spectral density estimator and let αbk be the estimated k-th Fourier coeﬃcient of log fb. Then deﬁne

M ! X −ijω ψb(ω; fb) = exp αbje j=1 for some large enough M. To estimate the AR(∞) coeﬃcients we use the recursive formula in equation (2.7) in Krampe et al. (2018),

k X j φbk+1 = − 1 − αk+1−jφbj k = 0, 1, ..., M − 1 k + 1 b j=0 where φb0 = −1. Based on this a nonparametric estimator of Jbn(ω; f) is

n∧M M−t n∧M M−t ψb(ω; fb) X X −isω i(n+1)ω ψb(ω; fb) X X isω Jbn(ω; fb) = √ Xt φbs+te + e √ Xn+1−t φbs+te n n t=1 s=0 t=1 s=0 where n ∧ M = min(n, M). In the simulations we estimate fb using iospecden function in R (smoothing with inﬁnite order Flat-top kernel) and set M=30.

By replacing Jbn(ω; f) with its nonparametric estimator Jbn(ω; fb) in (4.2) leads us to de-

130 ﬁne a new feasible criterion which we call the boundary corrected Whittle likelihood using Nonparametric estimation (BC-NP).

H.1 Alternative methods for estimating the predictive DFT results for a Gaussian time series

To access the performance of all the diﬀerent likelihoods (with diﬀerent estimates of the predictive DFT), we generate the AR(8) model

Ut = φU (B)εt where {εt} are i.i.d. normal random variables,

4 8 Y iλj −iλj X j φU (z) = (1 − rje z)(1 − rje z) = 1 − φjz (H.1) j=1 j=1 r = (r1, r2, r3, r4) = (0.95, 0.95, 0.95, 0.95) and λ = (λ1, λ2, λ3, λ4) = (0.5, 1, 2, 2.5). We observe −iω −2 that corresponding spectral density fU (ω) = |φU (e )| has pronounced peaks at ω = 0.5, 1, 1.5 and 2. For all the simulations below we use n = 100. For each simulation, we ﬁt AR(8) model, evaluate six likelihoods from the previous sections plus two likelihoods (BC-tYW and BC-NP), and calculate the parameter estimators. Table 15 summarizes the bias and standard derivation of the estimators and the last row is an average `2- distance between the true and estimator scaled with n. The Gaussian likelihood has the smallest bias and the smallest RMSE. As mentioned in Section 6.1, our methods still need to estimate AR coeﬃcients which has an additional error of order O(p3n−3/2) and it could potentially increase the bias compared to the Gaussian likelihood. The boundary corrected Whittle and hybrid Whittle have smaller bias than the Whittle, tapered, and debiased Whittle. Especially, the hybrid Whittle usually has the second smallest RMSE.

Bias Par. Gaussian Whittle Boundary Hybrid Tapered Debiased BC-tYW BC-NP

φ1(0.381) -0.008(0.08) -0.025(0.09) -0.009(0.08) -0.006(0.09) -0.012(0.09) -0.008(0.09) -0.008(0.08) -0.005(0.12) φ2(-0.294) 0.002(0.09)0 .024(0.1) 0.005(0.09)0 .002(0.09)0 .010(0.09) 0.003(0.1) 0.003(0.09) 0.002(0.13) φ3(0.315) -0.009(0.08) -0.038(0.09) -0.011(0.09) -0.009(0.09) -0.023(0.09) -0.010(0.09) -0.009(0.09) -0.010(0.12) φ4(-0.963) 0.031(0.09)0 .108(0.1) 0.042(0.09)0 .034(0.09)0 .075(0.09) 0.043(0.1) 0.037(0.09) 0.076(0.12) φ5(0.285) -0.015(0.08) -0.049(0.09) -0.020(0.09) -0.016(0.08) -0.029(0.08) -0.017(0.1) -0.018(0.09) -0.022(0.12) φ6(-0.240) 0.010(0.08)0 .040(0.09) 0.014(0.09) 0.010(0.09)0 .024(0.08)0 .012(0.1) 0.011(0.09) 0.022(0.11) φ7(0.280) -0.017(0.08) -0.053(0.09) -0.021(0.09) -0.020(0.09) -0.039(0.08) -0.022(0.09) -0.020(0.09) -0.027(0.1) φ8(-0.663) 0.049(0.08)0 .116(0.08) 0.059(0.08)0 .055(0.08)0 .096(0.08) 0.061(0.09) 0.056(0.08) 0.101(0.1) nkφ − φbk2 6.466 18.607 8.0297 .085 13.611 8.164 7.470 13.280 Table 15: Bias and the standard deviation (in the parenthesis) of eight different quasi-likelihoods for the Gaussian AR(8) model. Length of time series n=100. True AR coefficients are in the parenthesis of the first column.

Bear in mind that neither of the two new criteria uses a hybrid method (tapering on the

131 actual DFT), the BC-tYW significantly reduces the bias than the boundary corrected Whittle and it is comparable with the hybrid Whittle. This gives some credence to the referee’s claim that the bias due to the Yule-Walker estimation can be alleviated using tapered Yule-Walker estimation. Whereas, BC-NP reduces the bias for the first few coefficients but overall, has a larger bias than the boundary corrected Whittle. Also, the standard deviation of BC-NP is quite large than other methods. We suspect that the nonparametric estimator Jb(ω; fb) is sensitive to the choice of the tuning parameters (e.g. bandwidth, kernel function, etc). Moreover, since the true model follows a finite autoregressive process, other methods (boundary corrected Whittle, BC-tYW, and hybrid Whittle) have an advantage over the nonparametric method. Therefore, by choosing appropriate tuning parameters under certain underlying process (e.g., seasonal ARMA model) can improve the estimators, and this will be investigated in future research.

H.2 Alternative methods for estimating the predictive DFT results for a non-Gaussian time series

This time we assess the diﬀerent estimation schemes for non-Gaussian time series. We generate the same AR(8) model as above with

Vt = φU (B)εt where {εt} are i.i.d. standardarized chi-square random variables with two-degrees of freedom i.e. 2 εt ∼ (χ (2) − 2)/2 and φU (z) is deﬁned as in (H.1). For each simulation, we ﬁt AR(8) model, evaluate six likelihoods from the previous sections plus two likelihoods (BC-tYW and BC-NP), and calculate the parameter estimators. The results are summarized in Table 16.

Bias Par. Gaussian Whittle Boundary Hybrid Tapered Debiased BC-tYW BC-NP

φ1(0.381) 0.001(0.08) -0.013(0.09) -0.002(0.09) 0.001(0.09) -0.003(0.09) 0.004(0.09) 0(0.09)0 .001(0.12) φ2(-0.294) -0.001(0.09)0 .014(0.1) -0.001(0.09) -0.002(0.09)0 .006(0.09) -0.008(0.11) -0.002(0.09) -0.010(0.13) φ3(0.315) -0.004(0.09) -0.027(0.1) -0.005(0.09) -0.003(0.09) -0.015(0.09) 0(0.1) -0.003(0.09) -0.005(0.12) φ4(-0.963) 0.034(0.09)0 .097(0.09) 0.040(0.09)0 .034(0.09)0 .073(0.09) 0.038(0.11) 0.036(0.09) 0.068(0.12) φ5(0.285) -0.007(0.09) -0.032(0.09) -0.009(0.09) -0.005(0.09) -0.018(0.09) -0.004(0.1) -0.007(0.09) -0.005(0.12) φ6(-0.240) 0.007(0.09)0 .029(0.09) 0.009(0.09) 0.006(0.09)0 .018(0.09)0 .003(0.1) 0.007(0.09) 0.006(0.12) φ7(0.280) -0.019(0.08) -0.047(0.09) -0.021(0.09) -0.018(0.09) -0.034(0.09) -0.020(0.1) -0.019(0.09) -0.026(0.11) φ8(-0.663) 0.058(0.08)0 .114(0.08) 0.062(0.09)0 .059(0.09)0 .098(0.08) 0.065(0.1) 0.060(0.08) 0.107(0.1) nkφ − φbk2 7.006 16.607 7.7287 .107 13.054 7.889 7.319 13.001 Table 16: Bias and the standard deviation (in the parenthesis) of eight different quasi-likelihoods for the AR(8) model for the standardized chi-squared innovations. Length of time series n=100. True AR coefficients are in the parenthesis of the first column. We use red to denote the smallest RMSE and blue to denote the second smallest RMSE.

132