Spectral Analysis Using Multitaper Whittle Methods with a Lasso Penalty

Dissertation

Presented in Partial Fulﬁllment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

Shuhan Tang, M.S.

Graduate Program in Department of Statistics

The Ohio State University

2020

Dissertation Committee:

Peter Craigmile, Advisor Yunzhang Zhu, Co-advisor Dena Asta Yoonkyung Lee c Copyright by

Shuhan Tang

2020 Abstract

Spectral analysis is the fundamental technique for studying the frequency domain charac- teristics of time series. Traditional nonparametric estimators of the spectral density function, such as the periodogram, are inconsistent, and the more advanced multitaper spectral estimators are often still too noisy. In this dissertation, an L1 penalized quasi-likelihood Whittle framework based on multitaper estimates is proposed for the spectral analysis of univariate and multivariate stationary time series. The new approach is compatible with various types of basis functions. The use of scale-calibrated universal thresholds and the generalized information criterion (GIC) is suggested to eﬃciently capture the sparse features in the spectra.

The fast convergence rate of the proposed estimator is established for the univariate spectral analysis. Additionally, a sample splitting strategy is introduced to obtain pointwise interval estimation of a spectrum using the Huber sandwich method. The proposed framework is extended to the estimation of spectral density matrices based on the Cholesky decomposition to ensure Hermitian positive deﬁniteness. Computationally, an alternating direction method of multipliers (ADMM) algorithm and a proximal gradient method are presented for solving the optimization problems. Application of the new methodology is illustrated via simulation and to the spectral analysis of electroencephalogram (EEG) data.

Keywords: Alternating direction method of multipliers (ADMM); basis expansion; multitaper spectral estimates; proximal gradient descent; sample splitting; Whittle likelihood.

ii Acknowledgments

First of all, I would like to express my deep gratitude to my advisors, Dr. Peter Craigmile and Dr. Yunzhang Zhu, whose insightful guidance and persistent encouragement helped me reach one of the most signiﬁcant milestones in my life. During my Ph.D. study, I have learned and beneﬁted a lot from their great enthusiasm and dedication to teaching and research. I would also like to extend my sincere appreciation to Dr. Dena Asta and Dr. Yoonkyung Lee for their support on both my candidacy exam committee and the dissertation committee.

I am grateful to all my statistics professors for introducing the glamorous world of statistics to me. And I want to acknowledge Dr. Gary G. Koch and Mrs. Carolyn J. Koch for their generosity and the fellowship. Thank you to the Department of Statistics at Ohio State for making me feel at home in Columbus.

Thanks to my family and friends for their endless love and faithful friendship.

iii Vita

2011 ...... B.S. Mathematics, Central South University. 2014 ...... M.S. Applied Statistics, Bowling Green State University. 2014-2016 ...... College of Public Health, The Ohio State University. 2017 ...... M.S. Statistics, The Ohio State University. 2016-present ...... Department of Statistics, The Ohio State University.

Publications

Research Publications

S. Tang, P. F. Craigmile, and Y. Zhu. Spectral estimation using multitaper Whittle methods with a lasso penalty. IEEE Transactions on Signal Processing, 67:4992–5003, 2019.

Fields of Study

Major Field: Statistics

iv Table of Contents

Page

Abstract...... ii

Acknowledgments...... iii

Vita...... iv

List of Tables...... viii

List of Figures...... x

1. Introduction...... 1

1.1 Nonparametric spectral estimation and basis models...... 1 1.2 Penalized multitaper Whittle framework...... 3 1.3 Sample splitting and spectral inference...... 4 1.4 Extension to multivariate spectral analysis...... 5 1.5 Organization of this dissertation...... 6

2. Foundations of spectral analysis for stationary processes...... 7

2.1 Spectral density function...... 7 2.2 Preliminary spectral estimators...... 11 2.2.1 The periodogram...... 11 2.2.2 Multitaper spectral estimators...... 13 2.2.3 Fourier frequencies...... 17 2.3 Basis models for SDFs...... 19 2.3.1 Literature review...... 20 2.3.2 Basis models and least square approaches...... 21 2.3.3 Wavelet soft-thresholding...... 25 2.3.4 Smoothing splines...... 29 2.4 Conclusion...... 30

v 3. Multitaper Whittle method with a lasso penalty...... 32

3.1 Penalized multitaper Whittle estimation...... 32 3.2 Computing: an ADMM algorithm...... 38 3.3 Tuning parameter selection...... 42 3.3.1 Scale-calibrated universal threshold...... 42 3.3.2 Generalized information criterion...... 44 3.4 Theory...... 45 3.5 Simulations...... 49 3.5.1 Validating the empirical theoretical rate...... 59 3.6 Spectral estimation for EEG signals...... 60 3.7 Non-orthogonal basis functions...... 63 3.8 Conclusion...... 64

4. Spectral inference with sample splitting...... 65

4.1 Sample splitting for SDFs...... 65 4.1.1 Overview of sample splitting...... 66

4.1.2 Sample splitting for L1 penalized SDF models...... 68 4.2 Quantifying the variability of MT-Whittle estimators...... 71 4.2.1 Naive variance estimation...... 71 4.2.2 Sandwich variance estimation...... 73 4.2.3 Fast variance approximation...... 77 4.2.4 Summary...... 79 4.3 Proper scoring rules...... 81 4.3.1 Scoring rules and property...... 81 4.3.2 Interval scores...... 84 4.4 Simulation studies...... 85 4.4.1 The choice of number of tapers...... 94 4.5 Application to EEG signals...... 97 4.6 Conclusion...... 99

5. Extensions to spectral analysis of stationary multivariate time series...... 101

5.1 Foundations of multivariate spectral analysis...... 101 5.1.1 Covariance matrix of multivariate processes...... 101 5.1.2 Spectral density matrix...... 103 5.1.3 Representing complex-valued spectra...... 106 5.1.4 Cross-periodgoram and multitaper spectral estimators...... 107 5.1.5 Covariance between MT estimates of diﬀerent cross-spectra.... 110 5.1.6 Covariance between real and imaginary parts of a MT estimator.. 115

vi 5.1.7 Basis methods for SDM estimation...... 117 5.2 Penalized multitaper Whittle method for SDM...... 119 5.2.1 MT-Whittle likelihood for bivariate time series...... 122 5.2.2 Penalized bivariate MT-Whittle method...... 126 5.2.3 Computation: proximal gradient method...... 128 5.2.4 Tuning parameter selection...... 133 5.2.5 Sample splitting...... 137 5.3 Simulation...... 140 5.3.1 The VAR(2) process...... 144 5.3.2 The VARMA(2,15000) process...... 148 5.3.3 The VARMA(4,15000) process...... 152 5.3.4 Discussion...... 157 5.4 Spectral analysis of multivariate EEG signals...... 161 5.5 Conclusion...... 164

6. Conclusion and discussion...... 166

6.1 Concluding remarks...... 166 6.2 Discussion and future works...... 167

Bibliography...... 172

vii List of Tables

Table Page

3.1 Comparison of IRMSEs (with bootstrap standard errors in parenthesis) when using diﬀerent bases for estimating the SDF of the AR(2) process of length N = 2048 by unpenalized least square (LS) method and unpenalized MT- Whittle method with number of tapers K = 10...... 51

3.2 Simulation results with p = 1024 for the four time series of length N = 2048: comparing decibel-scaled bias, standard deviation (SD) and IRMSE

of L1 penalized LS approach and L1 penalized MT-Whittle approach with LA(8) wavelet bases by diﬀerent tuning criteria. Models are ﬁtted to raw MT estimates with K = 10 sine tapers. Values in parenthesis denoted by ‘(SE)’ are corresponding standard errors based on 1000 Bootstrap resamples. Minimum quantities are bolded in each section excluding the ones from ‘Best’ (best possible IRMSE)...... 55

4.1 Recommended number of sine tapers, K, for achieving a close to minimal interval score for a given sample size N...... 95

5.1 Simulation results for the VAR(2) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 146

5.2 Results of evaluating the SDM estimates for the VAR(2) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 148

5.3 Simulation results for the VARMA(2,15000) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 150

viii 5.4 Results of evaluating the SDM estimates for the VARMA(2,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 152

5.5 Simulation results for the VARMA(4,15000) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 154

5.6 Results of evaluating the SDM estimates for the VARMA(4,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 157

ix List of Figures

Figure Page

3.1 Plots of the SDF for three processes on the decibel scale (dB)...... 50

3.2 One replicate of spectrum estimation for the high-order MA process based on unpenalized LS method with diﬀerent basis types. For each basis type, the orange line is the LS estimates with p = 32 bases including the intercept. The black line shows the true SDF on the decibel scale (dB)...... 53

3.3 A comparison of the IRMSEs for diﬀerent L1 penalizations methods for estimating the SDF. Here, the height of the symbols are larger than the widths 95% bootstrap conﬁdence intervals for each IRMSE; the horizontal dashed lines indicate the best possible IRMSEs for each method...... 54

3.4 A comparison of spectral estimates for the diﬀerent four processes. For each

process, the blue line is the L1 penalized MT-Whittle estimates, the gray line is the corresponding raw MT estimates with K = 10 tapers, and the red line is the true SDF...... 57

3.5 Plots of bias vs K on the decibel (dB) scale using L1 penalized LS method (orange) and L1 penalized MT-Whittle method (blue) with the universal threshold for the four processes. The vertical bars show 95% bootstrap conﬁdence intervals for each K, and the dotted lines show the location of the minimum. 58

3.6 Plots of IRMSE vs K on the decibel (dB) scale using L1 penalized LS method (orange) and L1 penalized MT-Whittle method (blue) with the universal threshold for the four processes. The vertical bars show 95% bootstrap con- ﬁdence intervals for each K, and the dotted lines show the location of the minimum...... 59

x √ PM 3.7 Plots of the ratio j=1(Rj − 1)φ(fj) ( M log(M)) against log2(M), for ∞ M varying from 28 to 215. For each process, the solid line is the median of the ratio, and the dashed lines are the 2.5th and 97.5th percentiles of the ratio, based on 1000 simulations...... 60

3.8 Time series plots of the (a) left and (b) right EEG channels, as well as estimated SDFs for the (c) left and (d) right channels. In (c) and (d) the blue line is the estimate with the universal threshold, and the green line is the estimate with GIC-based threshold...... 61

4.1 Plots for evaluating the log SDF estimates of the AR(2) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns...... 87

4.2 Plots for evaluating the log SDF estimates of the AR(4) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns...... 88

4.3 Plots for evaluating the log SDF estimates of the MA process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales between the columns...... 89

4.4 Plots for evaluating the log SDF estimates across the frequencies for the three processes of length N = 2048 using K = 15 sine tapers: bias (ﬁrst row), RMSE (second row), coverage (third row), expected interval score (fourth row). 93

xi 4.5 Plots of four evaluation metrics against sample size N for the three processes using the suggested number of sine tapers, K = (1/4)N 8/15: average absolute bias (ﬁrst row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row). The vertical bars show 95% bootstrap conﬁdence intervals at each N...... 96

4.6 Plots of interval estimates for the left and right EEG channels. In each plot, the solid black line show the point estimates of the SDF. The dashed red lines display the pointwise 95% conﬁdence intervals (CI) constructed based on the sandwich method, while the dashed blue lines exhibit the interval estimates by the fast approximation approach...... 98

5.1 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and MSC of the VAR(2) process.. 145

5.2 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw

MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of the VAR(2) process...... 147

5.3 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(2,15000) process...... 149

5.4 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw

MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of the VARMA(2,15000) process...... 151

5.5 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(4,15000) process...... 153

5.6 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw

MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process...... 156

xii 5.7 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM elements of the VAR(2) process...... 158

5.8 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM elements of the VARMA(4,15000) process...... 159

5.9 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw

MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process...... 160

5.10 Plots of the estimated Cholesky components and the SDM elements of the two-channel EEG signal...... 163

xiii Chapter 1: Introduction

Spectral analysis is an important area in the study of time series and signal processing, which provides statistical insights into the frequency domain for a series collected over time

(see, e.g. Priestley[1981]). Estimating the spectral density function (SDF) or spectrum of a time series serves as a primary tool for the frequency domain analysis. Its use can be found in many different fields, such as astronomy, cognitive science, earth sciences, electri- cal engineering, and finance. Examining the SDF allows us to explore periodicities in the data (e.g., Percival and Walden[1993, ch. 10]), provides an alternative way to analyze and estimate the covariance structure of stationary time series (e.g., Percival and Walden[1993, ch. 4]), and can also be used to understand the effect of preprocessing a time series (e.g.,

Mosley-Thompson et al.[2005]).

1.1 Nonparametric spectral estimation and basis models

There are many nonparametric estimators of the SDF of a stationary time series. These include the periodogram, direct spectral estimators, lag window and overlapping segment averaging spectral estimators, and multitaper (MT) spectral estimators. (See Percival and Walden

[1993] for a complete review.) Most of these spectral estimates can be formulated with a uniﬁed framework of MT spectral estimators [Walden, 2000]. While many of these estimators are developed to provide an adequate tradeoﬀ between bias and variance, often these

1 nonparametric estimates are still too noisy when a stable estimate of the SDF is required.

An alternative strategy is to use a parametric approach. However, model misspeciﬁcation

caused by considering a limited class of models for the SDF can compromise estimation (see

Percival and Walden[1993, ch.9], and references therein).

A popular alternative approach is to consider a semiparametric model for the SDF, in

which the log SDF is expressed in terms of a truncated basis expansion, where the number

of basis functions is allowed to increase with the sample size. The statistical problem then

becomes how to enforce sparsity by selecting the basis functions and estimating the model

parameters so that we adequately estimate the SDF, but also have computational eﬃciency

as the sample size increases. Gao[1993, 1997], Moulin[1994] and Walden et al.[1998] enforce

sparsity using a penalized least square (LS) approach for estimating the log SDF with wavelet

soft thresholding. In terms of computational complexity, wavelet thresholding methods are

typically O(N), for a time series of N regularly sampled values.

A number of approaches enforce smoothness of the SDF via an L2 penalty: Cogburn and

Davis[1974], Wahba and Wold[1975] and Wahba[1980] use penalized LS, and Pawitan and

O’Sullivan[1994] uses a penalized Whittle likelihood, where the Whittle likelihood [Whittle,

1953] is an approximation to the negative log-likelihood for the time series using the peri-

odogram. To enforce sparsity, some of these L2 methods with smoothing splines also use model selection, often in combination with cross-validation, to select the basis functions that are used to model the SDF. Alternatively, one can implement methods such as Wipf and

Nagarajan[2008] to enforce sparsity on the basis expansion directly. (On a related topic, smoothness of spectral estimation can also be tuned with high-resolution approaches introduced in Byrnes et al.[2000], Georgiou and Lindquist[2003] and using extended frameworks based on so-called beta and tau divergence families, such as Basu et al.[1998], Ferrante

2 et al.[2012], Zorzi[2014, 2015a,b]; see Cichocki and Amari[2010] for a general review of such divergences.)

1.2 Penalized multitaper Whittle framework

Our method is also motivated by the need to enforce sparsity while adequately estimating the SDF. In addition, we seek computationally eﬃcient method that can be scalable to large data sets. We develop a quasi-likelihood method for estimating SDFs using a Whittle likelihood [Whittle, 1953] based on MT spectral estimates. A quasi-likelihood function [Wed- derburn, 1974] has similar statistical properties to that of the log likelihood, and can be used for statistical inference, but does not have to match exactly to the log of the joint probability density function of the data (see McCullagh and Nelder[1999, Ch. 9] for a comprehensive review). MT estimates [Thomson, 1982] provide a good compromise between bias and variance, and can yield more eﬃcient estimates of the SDF [Walden et al., 1998]. We demonstrate that the addition of a Whittle likelihood method [Whittle, 1953] improves estimation over traditional LS approaches.

We use a lasso penalty [Tibshirani, 1996] to enforce sparsity, deriving two strategies to optimally select the tuning parameter that is key to obtaining estimates of the SDF with low integrated root mean squared error (IRMSE): “universal threshold” and generalized information criterion (GIC)-based methods (see, e.g., Fan and Tang[2013]). Neither method compromises on computational or statistical eﬃciency by requiring the use of cross-validation to select the tuning parameter. Theoretically, we derive the rate of convergence for our proposed spectral estimator under some technical conditions on the model sparsity and the

MT spectral estimator.

3 We introduce a computationally eﬃcient method to estimate the parameters in our model

using the alternating direction method of multipliers (ADMM) algorithm (see e.g., Boyd et al.

[2011]). To reach an -optimal solution with a time series of length N, our method is O(N−1)

when using wavelet bases and O(N 3 + −1N 2) for general bases. Although computationally

more challenging, our method can be applied to SDF estimation using any collection of

basis functions and, as mentioned above, outperforms LS-based methods such as wavelet

thresholding in terms of estimation quality.

1.3 Sample splitting and spectral inference

To make statistical inference based on our L1 penalized MT-Whittle spectral estimates

[Tang et al., 2019], we proposed a sample splitting strategy and applied Huber sandwich estimator [Huber, 1967] to construct conﬁdence intervals for the SDF of regularly sampled univariate stationary Gaussian processes. Unlike random splitting methods that assume independence and exchangeability between samples [Dezeure et al., 2015], our splitting scheme takes into account the correlations and spatial inhomogeneity of the MT estimates across the Fourier frequencies [Thomson, 1982]. We illustrate three methods for quantifying the variability of the MT-Whittle estimates: a naive method assuming independence between

MT estimates, an estimated sandwich method incorporating the correlation between MT estimates [Thomson, 1982], and a fast sandwich estimation based on an approximation due to Walden et al.[1998] of the correlation between MT estimates. Our simulation results show a clear advantage of using the sandwich-based interval estimation over the naive method.

The sandwich variance correction greatly ameliorates the undercoverage issue of the conﬁ- dence intervals, and we demonstrate that this method achieves a better interval score [e.g.,

Gneiting and Raftery, 2007] than the naive approach. The fast sandwich estimation is very

4 eﬃcient in approximating the theoretical sandwich intervals when an appropriate number of tapers is used. We provide a discussion and an empirical recommendation on choosing the optimal number of tapers for the statistical inference of the SDF based on the simulation study. The suggested number of tapers increases as the sample size grows.

1.4 Extension to multivariate spectral analysis

Spectral analysis of a multivariate time series involves the study of SDFs, also called the auto-spectra, of each component series, and the analysis of the complex-valued cross- spectrum between two component series [see, e.g., Brillinger, 1981]. At each frequency, the spectral density matrix (SDM) is a Hermitian positive deﬁnite matrix composed of auto- spectra values for diagonal elements and cross-spectra values for the oﬀ-diagonals. The idea of our L1 penalized MT-Whittle method can be extended to the SDM estimation for stationary multivariate time series. To facilitate such an extension, we examine related statistical properties of multivariate MT spectral estimators [Walden, 2000]. Similar to

Dai and Guo[2004], Rosen and Stoﬀer[2007], and Krafty and Collinge[2013], we apply a

Cholesky decomposition to the SDM to ensure Hermitian positive deﬁniteness, where the real parts and imaginary parts of the Cholesky elements are modeled by basis expansions. We propose to use a proximal gradient descent algorithm [Parikh and Boyd, 2014] for estimating the Cholesky elements simultaneously based on the L1 penalized multivariate MT-Whittle method. The tuning parameters are selected by generalizing the universal threshold and the

GIC criterion to the multivariate case. Adapting the proposed sample splitting, we compare the performance of the L1 penalized MT-Whittle method to the L2 penalized Whittle method in Krafty and Collinge[2013] via a simulation study. An MT-Whittle-based scoring rule and

5 the Frobenius norm of the deviation from the true SDM are used for evaluating the SDM estimates.

1.5 Organization of this dissertation

The rest of this dissertation is organized as follows. Chapter2 reviews foundations of spectral analysis of univariate stationary time series, statistical properties of preliminary spectral estimators such as the periodogram and MT estimators, and basis regression models for estimating the SDF. Chapter3 proposes our main methodology of spectral estimation using multitaper Whittle likelihood and a lasso penalty for regularly sampled univariate stationary time series, accompanied with related theory and computational algorithm. A sample splitting approach is established in Chapter4 for the sandwich interval estimation of the SDF based on the proposed L1 penalized method, and discussions on the optimal choice of number of tapers can be found in the same chapter. In Chapter5, we expand the scope to spectral analysis of multivariate stationary time series, demonstrating the statistical properties of multivariate MT spectral estimates and generalizing the L1 penalized MT-

Whittle framework to the multivariate case. The utility of our methodology is presented in

Chapters3,4, and5 on simulated processes with prominently sharp spectral features and to the spectral analysis of a two-channel electroencephalogram (EEG) example. We provide in Chapter6 a discussion on further improvement and future works.

6 Chapter 2: Foundations of spectral analysis for stationary processes

Spectral analysis is a technique that characterizes the frequency domain features of a time series. The technique heavily relies on the estimation and inference of the so-called spectral density function (SDF). In this chapter, we introduce the basic concepts of the

SDF and estimators of the SDF that are commonly used, to facilitate an understanding of the background and properties of the SDFs essential for later chapters. Our discussion in this chapter focuses on the SDF of regularly sampled stationary univariate time series. The extension of concepts to the multivariate time series is demonstrated in Chapter5.

2.1 Spectral density function

To formally deﬁne the spectral density function (SDF), we ﬁrst need to realize an alternative way to represent a time series with frequency domain components. Suppose that a discrete, real-valued univariate stationary process {Xt : t ∈ Z} has, without loss of generality, mean zero and is collected at sampling interval ∆ = 1. The spectral representation theorem for discrete parameter stationary processes [Priestley, 1981, p. 251] applies: the process {Xt} has the spectral representation:

Z 1/2 i2πft Xt = e dZ(f), t = 0, ±1, ±2,..., (2.1) −1/2

7 where {Z(f): f ∈ [−1/2, 1/2]} is a complex-valued orthogonal process with properties

[Percival and Walden, 1993, p. 130] :

1. E{dZ(f)} = 0;

2. E{|dZ(f)|2} ≡ dS(I)(f) ≡ S(f)df for f ∈ [−1/2, 1/2], where S(I)(f) is a right-

continuous, non-decreasing and bounded function, called the integrated spectrum or

the spectral distribution function. The derivative S(f), if it exists, is called the SDF

or the spectrum;

3. Cov{dZ(f), dZ(f 0)} = E{dZ(f)dZ∗(f 0)} = 0 for f 6= f 0 on [−1/2, 1/2], where dZ∗(f)

denote the complex conjugate of dZ(f).

The intuition for (2.1) is the following: viewing Euler’s formula ei2πft = cos(2πft) + i sin(2πft), the time series {Xt}, as a function of time t, can be factorized into a linear combination of sine and cosine functions parametrized by various frequencies and amplitudes.

If the sine and cosine components at frequency f have amplitude dZ(f), then the squared magnitude as a random variable has expectation dS(I)(f).

The existence of the integrated spectrum S(I)(f) follows from the generalization of

Bochner’s theorem and Hergolotz’s theorem [see, e.g., Brockwell and Davis, 1991, p. 115-

118]. For a mean zero stationary time series {Xt}, we write the autocovariance function

(ACVF) as

∗ γX (τ) = Cov(Xt+τ ,Xt) = E(Xt+τ Xt ), τ = 0, ±1, ±2,....

Then the following two theorems apply:

Theorem 1 (Generalization of Bochner’s theorem; see Brockwell and Davis[1991], p.115) .

A function γX (·) deﬁned on integers is the ACVF of a stationary (possibly complex-valued) stochastic process if and only if γX (·) is Hermitian and non-negative deﬁnite.

8 Theorem 2 (Hergolotz’s theorem; see Brockwell and Davis[1991], p.117-118) . A complex-

valued function γX (·) deﬁned on integers is non-negative deﬁnite if and only if

Z 1/2 i2πfτ (I) γX (τ) = e dS (f), for all integers τ, −1/2

where S(I)(f) is called the integrated spectrum or the spectral distribution function, which is

(I) 1 right-continuous, non-decreasing and bounded for f ∈ [−1/2, 1/2] with S (− 2 ) = 0.

Thus, a well-deﬁned spectral distribution function S(I)(f) always comes along with the

(I) stationary process {Xt}. If the spectral distribution function S (f) is diﬀerentiable every- where on the frequency domain, then we have

Z 1/2 i2πfτ γX (τ) = e S(f)df, for all integers τ, (2.2) −1/2 where S(f) = dS(I)(f)/df is the SDF. Brockwell and Davis[1991] also demonstrated that if the process {X } has absolutely summable ACVF (i.e., P |γ (τ)| < ∞), then the SDF t τ∈Z X S(f) exists, and ∞ X −i2πfτ S(f) = γX (τ)e , |f| ≤ 1/2, (2.3) τ=−∞ which is the Fourier transform of the ACVF from the time domain to the frequency domain.

Thus, we can see that {γX (τ)} and {S(f)} are Fourier transform pairs.

Focusing on the stationary process {Xt} with zero mean and absolutely summable ACVF, the corresponding SDF S(f) has the following properties [Percival and Walden, 1993, p. 138]:

1. The SDF is a real-valued function deﬁned on f ∈ [−1/2, 1/2];

2. The SDF is an even function; i.e. S(f) = S(−f);

3. The SDF is non-negative; i.e. S(f) ≥ 0;

9 4. The SDF decomposes the variance of {Xt} so that

Z 1/2 (I) Var(Xt) = γX (0) = S (1/2) = S(f)df < ∞. −1/2

The last property expresses an analysis of variance of {Xt} over the frequency domain in terms of the power magnitude. Thus, the SDF represents the distributional density of the power contribution by the component sinusoids shown in (2.1).

In addition, the deﬁnition of SDF, if it exists, can be extended over the real line. For periodic functions such as sinusoids, aliasing happens with period 1/∆, where ∆ = 1 is the sampling interval – the constant time unit spacing between adjacent time points of the time series {Xt}. The value of SDF S(f) at frequency f ∈ [−1/2, 1/2] actually represents the accumulated power densities at frequencies {f + c/∆ : c ∈ Z}. Then by taking

S(f + c/∆) = S(f), for all c ∈ Z and f ∈ [−1/2, 1/2], the SDF can be viewed as being periodic over the real line. Due to the evenness and periodic property, all information contained in a univariate SDF is conveyed by the frequencies on

[0, fN ], with Nyquist frequency fN = 1/(2∆) = 1/2 when we assumes ∆ = 1.

More generally, for a stationary process {Xt} with E(Xt) 6= 0, if the mean is known,

one can always obtain a zero mean process by considering the centered process Xt − E(Xt);

if the mean E(Xt) is unknown, the formula mentioned above for spectral calculation will ¯ ¯ have Xt replaced by Xt − X as a common practice, where X denotes the sample mean of

the observed series. For further properties of the SDF, see Percival and Walden[1993] and

Brillinger[1981].

10 2.2 Preliminary spectral estimators

When the true model for the stationary time series is unknown, a closed form of expression

for the SDF is unknown. To study the spectral information using data, we now review some

nonparametric estimators of the SDF.

2.2.1 The periodogram

A natural and fundamental spectral estimator is the periodogram [Schuster, 1898]. Let

T x = {x1, x2, . . . , xN } be the vector of consecutive observations of the previously deﬁned real-valued stationary process {Xt}, where the start time is t = 1 and N is the sample size.

Again suppose that {Xt} has mean zero and the sampling interval ∆ = 1.

According to Percival[1994], preliminary spectral estimators can be obtained by substi-

(p) tuting γX (τ) in (2.3) by γbX (τ), where ( 1 PN−|τ| x x , for |τ| ≤ N − 1; γ(p)(τ) = N t=1 t t+|τ| bX 0, for |τ| ≥ N.

(p) Then, the resulting spectral estimator is the Fourier transform of {γbX (τ)} [Schuster, 1898], N−1 (p) X (p) −i2πfτ Sb (f) = γbX (τ)e , |f| ≤ 1/2, τ=−(N−1) which is called the periodogram. Alternatively, if we deﬁne

N (p) X 1 −i2πft JX (f) = √ xte , |f| ≤ 1/2, (2.4) t=1 N then it can be shown that N−1 N−|τ| (p) 1 X X −i2πfτ Sb (f) = xtxt+|τ|e N τ=−(N−1) t=1 N N 1 X X = x x e−i2πf(s−t) N s t s=1 t=1 N N X 1 −i2πfs X 1 i2πft (p) (p)∗ = √ xse · √ xte = JX (f)JX (f), s=1 N t=1 N 11 (p)∗ (p) where JX (f) denotes the complex conjugate of JX (f). Hence, the periodogram can also be expressed as N 2 (p) 1 X −i2πft Sb (f) = xte , |f| ≤ 1/2. (2.5) N t=1

Note: Similar to the previous section, if E(Xt) 6= 0 is unknown, all the above formula will have x replaced by x − x¯, wherex ¯ = 1 PN x . Then the quantity γ(p)(τ) becomes t t N t=1 t bX 1 PN−|τ| (p) N t=1 (xt − x¯)(xt+|τ| − x¯) for |τ| ≤ N − 1, the Fourier transformed process JX (f) is re-

N deﬁned by P √1 (x − x¯)e−i2πft, and so an updated equation (2.5) has the form S(p)(f) = t=1 N t b 2 1 PN (x − x¯)e−i2πft . N t=1 t

If a stationary time series {Xt}, not necessarily white noise or Gaussian distributed, has

continuous SDF S(f) for f ∈ [−1/2, 1/2], Percival and Walden[1993, Ch. 6] summarize

some key properties of the SDF:

n o 1. Convergence in mean: E Sb(p)(f) → S(f) asymptotically as N → ∞;

2. Assuming ﬁniteness of certain high-order moments [see, e.g., Brillinger, 1981, p. 126], ( S(f)χ2/2, for 0 < f < 1/2; S(p)(f) → 2 b d 2 S(f)χ1, for f = 0 or 1/2,

asymptotically as N → ∞;

3. Under the conditions of ﬁniteness of certain high-order moments [see, e.g., Brillinger,

1981, p. 26 and p. 125], for 0 < f 0 < f < 1/2, Sb(p)(f) and Sb(p)(f 0) are asymptotically

independent.

The periodogram is not an ideal spectral estimator. As pointed out by Percival[1994,

p. 317], the ‘tragedy of the periodogram’ includes:

1. For ﬁnite sample sizes, the periodogram Sb(p)(f) can be badly biased;

12 2. As N → ∞, the variance Var{Sb(p)(f)} does not shrink to zero (except for S(f) = 0),

and thus the periodogram is not a consistent estimator.

2.2.2 Multitaper spectral estimators

To correct the bias and reduce variance in the periodogram, more advanced nonparametric estimators of the SDF were developed. This includes multitaper (MT) estimators,

Welch’s overlapping segment averaging (WOSA) estimators [Welch, 1967], and lag-window

estimators (see Percival and Walden[1993] for a complete review). As demonstrated by

Walden[2000], most of these estimators can be reformulated in a unifying structure of the

multitaper (MT) spectral estimators deﬁned below.

T For k ∈ {1,...,K}, one can taper the observed time series x = {x1, x2, . . . , xN } using a

T N ﬁnite sequence {h1,k, . . . , hN,k} ∈ R , called data taper. Each resulting tapered time series

T {h1,kx1, . . . , hN,kxN } can be used to form a direct spectral estimator as

N 2 (d) X S (f) = h x e−i2πft , |f| ≤ 1/2. (2.6) bk t,k t t=1 Pn Taking the K data tapers to be orthonormal (i.e. t=1 ht,jht,k = 1 if j = k and 0 if j 6= k), the multitaper (MT) estimator is deﬁned as K 1 X (d) Sb(mt)(f) = Sb (f), |f| ≤ 1/2; (2.7) K k k=1 that is, the MT estimator is an average of K direct spectral estimators (eigenspectra)

with orthonormal tapers. Generally speaking, the tapering is usually applied for the pur-

pose of correcting the bias, and the averaging reduces the variance [Percival and Walden,

1993]. Note that the periodogram can be viewed as a special case of the MT estima-

tor. Equation (2.5) can be simply obtained by taking K = 1 in equation (2.7) with n √ √ o {h1,1, . . . , hN,1} = 1/ N,..., 1/ N . The periodogram, direct spectral estimators, and

MT estimators preserves the even-ness and non-negativeness of the estimated SDF.

13 Walden[2000] shows under known conditions that MT estimators are consistent if we

let the number of tapers K increase with the sample size N, in the more general setting of

multivariate time series. For univariate stationary processes, we heuristically present below

how MT estimators can mitigate the bias and inconsistency issues in the periodogram.

Proposition 1. The expected value of the periodogram can be expressed as

n o Z 1/2 E Sb(p)(f) = F(f − f 0)S(f 0)df 0, −1/2

with 2 N 2 sin (Nπf) X 1 −i2πft F(f) = = √ e , (2.8) N sin2 (πf) t=1 N where the kernel function F(f) is known as spectral window for the periodogram, called the Fej´erkernel.

Proof.

N−1 N−1 N−|τ| n (p) o X n (p) o −i2πfτ 1 X X −i2πfτ E Sb (f) = E γ (τ) e = E(xtxt+|τ|)e bX N τ=−(N−1) τ=−(N−1) t=1 N−1 1 X = (N − |τ|)γ (τ)e−i2πfτ N X τ=−(N−1) N−1 Z 1/2 1 X 0 = (N − |τ|) ei2πf τ S(f 0)df 0 e−i2πfτ N τ=−(N−1) −1/2 Z 1/2 N−1 1 X 0 = (N − |τ|)e−i2π(f−f )τ S(f 0)df 0 N −1/2 τ=−(N−1) Z 1/2 N N 1 X X 0 = e−i2π(f−f )(s−t) S(f 0)df 0 N −1/2 s=1 t=1 Z 1/2 N N 1 X 0 X 0 = e−i2π(f−f )s ei2π(f−f )t S(f 0)df 0 N −1/2 s=1 t=1 Z 1/2 N 2 X 1 −i2π(f−f 0)t 0 0 = √ e S(f )df . −1/2 t=1 N

14 Additionally,

N 2 −i2πfN i2πfN X 1 −i2πft 1 1 − e 1 − e √ e = · · N 1 − e−i2πf 1 − ei2πf t=1 N 1 eiπfN − e−iπfN e−iπfN − eiπfN = · · N eiπf − e−iπf e−iπf − eiπf 1 sin (Nπf) sin (Nπf) = · · N sin (πf) sin (πf) sin2 (Nπf) = . N sin2 (πf)

Since the Fej´erkernel is not a Dirac delta function at point 0, its shape leads to two sources of bias for the periodogram: a loss of resolution due to its central lobe, and leakage due to its side lobes. The direct spectral estimator (2.6) uses tapering to reduce leakage, whose expectation can be obtained following a similar derivation as above:

Z 1/2 n (d) o 0 0 0 E Sbk (f) = Hk(f − f )S(f )df , (2.9) −1/2 with spectral window N 2 X −i2πft Hk(f) = ht,ke . t=1 Averaging across K tapers, the expected value of the MT estimator (2.7) is then

n o Z 1/2 (mt) ¯ 0 0 0 E Sbk (f) = H(f − f )S(f )df , (2.10) −1/2 where K K N 2 ¯ 1 X 1 X X −i2πft H(f) = Hk(f) = ht,ke . K K k=1 k=1 t=1 With properly selected data tapers, the spectral window H¯(f) can have its side lobes sup- pressed with reasonably a small increment in the width of the central lobe. Thus, a better balance between the leakage and a loss of resolution will lead to a reduction in the overall bias.

15 The variance reduction is a result of the averaging step when calculating the MT estimator. For each its individual eigenspectra (2.6), the asymptotic distribution can be similarly obtained as with the periodogram. Percival and Walden[1993] state that, as N → ∞, ( S(f)χ2/2, for 0 < f < 1/2; S(d)(f) → 2 bk d 2 S(f)χ1, for f = 0 or 1/2. Based on the orthonormality between the tapers of eigenspectra, under certain mild conditions [see Percival, 1994, p. 336], the eigenspectra are approximately independent, which implies that ( S(f)χ2 /(2K), for 0 < f < 1/2; S(mt)(f) → 2K (2.11) b d 2 S(f)χK /K, for f = 0 or 1/2 , asymptotically as N → ∞. Thus, we have ( n o S2(f)/K, for 0 < f < 1/2; Var Sb(mt)(f) ≈ (2.12) 2S2(f)/K, for f = 0 or 1/2.

As K increases, we lose resolution in the estimated SDF but decrease the variance of the

MT spectral estimator. Letting the number of tapers K increase with sample size N, the

MT estimator can then be shown to be a consistent estimator of the SDF asymptotically

[see, McCoy et al., 1998, p. 658].

Though MT estimators are capable of balancing bias and variance for spectral estimators, they also induce more correlation between the estimates over diﬀerent frequencies [Thomson,

1982]. We leave the discussion on the correlation of MT estimates to later chapters.

Besides the common properties of the MT estimators mentioned above, other statistical properties follow from the speciﬁc choice of tapers used in the spectral estimates. Discrete prolate spheroidal sequences (DPSS) and sine tapers are most commonly used [Percival and

Walden, 1993]. DPSS tapers are designed to reduce the sidelobes in the spectral estimate.

They solve the time-frequency concentration problem in which we ﬁnd the time limited sequence which has most of its energy concentrated in a speciﬁed frequency band [Percival

16 and Walden, 1993, Ch. 8]. Instead, we use the easily calculated sine tapers [Riedel and

Sidorenko, 1995] in this dissertation, which are deﬁned by

2 1/2 (k + 1)πt h = sin , k = 1,...,K; t = 1,...,N. (2.13) t,k N + 1 N + 1

Sine tapers are designed to reduce the smoothing bias, at a compromise to sidelobe reduction.

For a given K, the sine tapers are concentrated in the frequency band [−W, W ] for a half

bandwidth of W = (K + 1)/(2(N + 1)).

2.2.3 Fourier frequencies

Spectral estimators are typically studied at the Fourier frequencies to deliver information

between the time domain and the frequency domain.

Deﬁnition 1 (see Brockwell and Davis[1991], p.331) . For an observed time series with

sample size N and sampling interval ∆ = 1, the set of Fourier frequencies is deﬁned by

j F = f = : −N/2 < j ≤ N/2, j ∈ , N j N Z where Z denotes the set of all integers.

Several important spectral features are related to Fourier frequencies:

T 1. Complete information of the observed series x = (x1, . . . , xN ) can be preserved with

(p) the corresponding values of JX (f) at the Fourier frequencies [Brockwell and Davis,

(p) N 1991]. Recall from (2.4) that J (f) = √1 P x e−i2πsf is the Fourier transform of X N s=1 s x. An inverse discrete Fourier transform gives

N 1 X X x = x e−i2πsfj ei2πtfj t N s j:fj ∈FN s=1 1 X (p) i2πtfj = √ J (fj)e , for t = 1, 2,...,N. (2.14) N X j:fj ∈FN

17 Thus, one can write

X (p) x = JX (fj) · ej, (2.15) j:fj ∈FN T where e = √1 ei2πfj , ei2π2fj , . . . , ei2πNfj for −N/2 < j ≤ N/2, j ∈ . The set of j N Z

vectors {ej : −N/2 < j ≤ N/2, j ∈ Z} forms orthonormal bases for spanning the n

dimensional complex vector space CN with inner product N ( 0 1 1, if j = j ; X i2πt(fj −fj0 ) hej, ej0 i = e = i2πN(fj −fj0 ) 1 i2π(fj −fj0 ) 1−e 0 N e i2π(f −f ) = 0, if j 6= j . t=1 N 1−e j j0 2. It follows from (2.15) that, the sample variance based on the observed series x can be

decomposed into periodogram values at the Fourier frequencies:

N (p) 1 1 X γ (0) = kxk2 = x x ∗ bX N 2 N t t t=1 N 1 1 (p) 1 (p)∗ X X i2πtfj X −i2πtfj0 = √ JX (fj)e √ JX (fj0 )e N N N 0 t=1 j:fj ∈FN j :fj0 ∈FN N 1 (p) (p)∗ 1 X X X i2πt(fj −fj0 ) = JX (fj)JX (fj0 ) · e N 0 N j:fj ∈FN j :fj0 ∈FN t=1

1 X (p) (p)∗ = J (f )J (f ) N X j X j j:fj ∈FN

1 X (p) = Sb (fj). N j:fj ∈FN

3. Applying Fourier frequencies greatly simpliﬁes the derivation of the asymptotic distri-

bution for periodogram. For example, the distribution for periodogram of Gaussian

white noise process can be derived exactly at Fourier frequencies [see Percival and

Walden, 1993, p. 220-222] by uncorrelatedness and Gaussianity of the real and imagi-

(p) nary parts of JX (f) at non-negative Fourier frequencies.

4. For large enough N, the correlation between periodogram at diﬀerent Fourier fre-

quencies is approximately zero [Percival and Walden, 1993, p. 223]. The width of

18 the central lobe of a spectral window is called resolution bandwidth. The Fej´erkernel

(2.8) has a half resolution bandwidth of 1/N. Spaced by 1/N, the Fourier frequencies

are considered to have small eﬀect on each other’s periodogram value. However, for

direct spectral estimators with non-trivial tapers, frequency grids other than Fourier

frequencies are required for approximate uncorrelatedness [Percival and Walden, 1993,

p. 224].

See Percival and Walden[1993] for more properties at Fourier frequencies. Although the above properties are presented for the periodogram, using Fourier frequencies for MT estimates and further smoothing approaches allows us to have a unifying grid for inference and making fair comparisons with existing methods in literature.

2.3 Basis models for SDFs

Though MT-based estimation tends to outperform the periodogram-based estimation from many aspects, it is often still too noisy to be useful in practice. As shown in our simulation study in the next chapter, some important features of the true spectrum may still be concealed by the remaining noise in MT estimates. Alternative approaches to further smoothing the preliminary spectral estimates are thus in demand. When the expression for the SDF of the true model is unknown, there is always a concern that we poorly estimate the

SDF due to misspecifying a parametric model. In this case, nonparametric or semiparametric models are used to achieve consistent estimators of the SDF. One such popular approach is to model the SDF with a basis expansion, where the number of bases is allowed to increase with the sample size.

19 2.3.1 Literature review

The earliest basis method for estimating a univariate SDF dates back to early 1970s.

Bloomﬁeld[1973] proposed the exponential (EXP) model for the spectral estimation of

univariate time series, which was basically a Fourier expansion of the log-transformed SDF.

The coeﬃcients were estimated by maximizing the exact likelihood of the time series, while

the Whittle likelihood approach [Whittle, 1953] was also mentioned but not implemented.

The use of the EXP model has been widely expanded. Beran[1993] extended it to a fractional

exponential (FEXP) model for long memory processes. Moulines and Soulier[1999], Hurvich

and Brodsky[2001], and Narukawa and Matsuda[2011] discussed broadband estimation for

the SDF of long memory time series by FEXP models, where here the ‘broadband’ refers to

the estimation over the complete frequency domain rather than being restricted to a narrow

band around the frequency zero. In general, James et al.[2013] argue that unpenalized

basis models have overﬁtting issues when the number of bases is allowed to increase with the

sample size, thus penalized models are often used to regularize the model ﬁt and perform

variable selection.

Penalized basis models for spectral estimation were ﬁrst explored using spline basis func-

tions. Adding an L2 ‘roughness’ penalty to the sum of squared errors, Cogburn and Davis

[1974] presented the theory of periodic smoothing splines and its application to SDF estimation. Approaches for smoothing parameter selection and SDF estimation are illustrated by

Wahba and Wold[1975] and Wahba[1980]. The extension of L2 penalized methods was made by Pawitan and O’Sullivan[1994] using the Whittle likelihood based on the periodogram.

These L2 methods all possess a penalty in the form of an integrated square of a certain order derivative of the estimated function.

20 Another type of penalized model used for estimating the SDF employs wavelet thresholding. Gao[1993, 1997], and Moulin[1994] discussed wavelet shrinkage for spectral estimation based on the log periodogram. Assuming approximate Gaussianity of the log MT estimates,

Walden et al.[1998] expanded the wavelet thresholding framework to get improved spectral estimates. Under various thresholding mechanisms, they showed that the model based on the MT estimates with sine tapers was able to substantially reduce integrated root mean squared errors in the ﬁtted SDF compared to the periodogram-based models.

Most of the literature mentioned above are based on least square approaches. We next present a general framework for spectral basis regression and review least square approaches.

The utility of the Whittle likelihood is relegated to the next chapter.

2.3.2 Basis models and least square approaches

Basis methods for the estimation of a SDF typically assume that the log SDF can be expanded by a set of p basis functions {φl(f): l = 1, . . . , p} (e.g., Wahba[1980]). For each

T T frequency f, letting φ(f) = (φ1(f), . . . , φp(f)) and β = (β1, . . . , βp) , we suppose that

p X T log S(f) = φl(f)βl = φ (f)β. (2.16) l=1

There are a variety of options for families of basis functions φ(f) that can be used for spectral estimation. For example, polynomial and Fourier bases can be used to capture global patterns [Bloomﬁeld, 1973], smoothing splines allow for local and smooth patterns in the SDF [Cogburn and Davis, 1974, Wahba and Wold, 1975], and wavelet bases model local behaviour, while capturing second order eﬀects such as peaks, troughs, and cusps [Gao, 1993,

Moulin, 1994, Gao, 1997, Walden et al., 1998]. When the SDF is spatially inhomogeneous, spatially adaptive bases, such as wavelets, have theoretical optimality properties (see, e.g.,

Donoho and Johnstone[1994, 1995], Sardy et al.[2004]). We can also combine families of

21 basis functions to form dictionaries [See, e.g., Wasserman, 2006, Section 9.8]. For spectral estimation, the basis functions chosen should be capable of delineating the prominent features of the SDF such that a satisfactory ﬁt can be achieved by parsimonious models.

To estimate the regression coeﬃcients β in (2.16), least square (LS) approaches are widely used in literature. The distributional inference and tuning for LS spectral estimates always rely on an approximate Gaussianity of the log periodogram or log MT estimators, which may not be true in practice. Since the periodogram is just a special case of MT estimates with K = 1 taper, without loss of generality we summarize the general idea related to LS approaches with only MT estimates below.

Recall from (2.11) that asymptotically Sb(mt)(f) follows a scaled chisquared distribution with 2K degrees of freedom on f ∈ (0, 1/2). Bartlett and Kendall[1946] show that the cumulants of logarithmic transformation of a chisquared random variable approximate the cumulants of a normal random variable, as the chisquared degrees of freedom increases. The approximate normality “may be safely used” when the chisquared degrees of freedom is 10 or greater [Bartlett and Kendall, 1946], however, the convergence is slow. Walden et al.[1998] n o adapt the normal approximation to the random variable log Sb(mt)(f)/S(f) when K ≥ 5, and state the following.

Proposition 2 (Walden et al.[1998], Section II) . For K ≥ 5, the distribution of the random n o variable log Sb(mt)(f)/S(f) agrees well with a normal distribution with the same mean and variance. Approximately,

log Sb(mt)(f) ∼ N(log S(f) + ψ(K) − log(K), ψ0(K)), for 0 < f < 1/2, where ψ(α) = d (ln Γ(α)) /dα is called the digamma function, and ψ0(α) = d (ψ(α)) /dα is

R ∞ α−1 −u called the trigamma function with the gamma function Γ(α) = 0 u e du.

22 Note: The argument used for proving Proposition2 was based on a heuristic applica-

tion of Bartlett and Kendall[1946] for large K with the ﬁrst and second cumulants of the n o h n oi random variable log Sb(mt)(f)/S(f) that E log Sb(mt)(f)/S(f) = ψ(K) − log(K) and h n oi Var log Sb(mt)(f)/S(f) = ψ0(K) for 0 < f < 1/2, rather than a rigorous proof. In practice, the number of tapers K is usually a ﬁxed number and may not be very large. Thus, LS estimation based on Proposition2 could lead to suboptimal solutions because the log SDF estimators are not close to being normally distributed. This issue motivates using the Whit- tle likelihood as an alternative framework for more reliable spectral estimation, including our L1 penalized MT-Whittle method proposed in the next chapter.

Deﬁne Y (f) = log Sb(mt)(f) − ψ(K) + log K. It then follows from Proposition2 that,

Y (f) ∼ N(log S(f), ψ0(K)), for 0 < f < 1/2.

With the log SDF model (2.16), one can write, for 0 < f < 1/2, that

Y (f) = φT (f)β + ξ(f), with ξ(f) ∼ N(0, ψ0(K)). (2.17)

Using the observed series x, we obtain a sample of Y (f) by evaluating the MT spectral

estimates Sb(mt)(f) on the set of M = dN/2e − 1 non-zero, non-Nyquist (i.e., not equal to

1/2) Fourier frequencies

j f = : j = 1,...,M . (2.18) j N

(mt) Denote Y as the vector with jth element Y (fj) = log Sb (fj) − ψ(K) + log K for j =

1,...,M. Choosing a speciﬁc basis representation, let Φ denote the M × p design matrix of

T basis functions evaluated at frequencies (2.18) with jth row equal to φ (fj) for j = 1,...,M.

Then a sampled version of (2.17) writes

Y = Φβ + ξ,

23 T where the error vector ξ = (ξ(f1), . . . , ξ(fM )) is assumed to follow a M-dimensional Gaus- sian distribution with mean 0 and covariance matrix Σξ. The covariance structure of ξ and its eﬀect on wavelet thresholding are discussed in Walden et al.[1998].

(LS) The unpenalized LS method ﬁnds β˜ ∈ Rp that minimizes

M X T 2 T (Y (fj) − φ (fj)β) = (Y − Φβ) (Y − Φβ). (2.19) j=1 The closed form solution is then

(LS) − β˜ = ΦT Φ ΦT Y ,

− where ΦT Φ denotes the generalized inverse of ΦT Φ. The ﬁtted SDF is calculated at each frequency using

n (LS)o S˜(LS)(f) = exp φT (f)β˜ , f ∈ (0, 1/2).

The unpenalized LS approach is equivalent to maximizing a Gaussian likelihood under the assumption of independence between the MT estimates at Fourier frequencies. However, the MT estimators at diﬀerent frequencies are correlated [see Thomson[1982] for details] :

With a locally slow varying spectrum, for f close to f 0,

K K N 2 n (mt) (mt) 0 o 2 2 X X X i2π(f 0−f)t Cov Sb (f), Sb (f ) ≈ S (f) · (1/K ) ht,kht,le ,

k=1 l=1 t=1 for 0 < f < f 0 < 1/2. Thus, we can argue that LS optimization (2.19) is based on a quasi-likelihood function rather than a real likelihood. See Wedderburn[1974], McCullagh and Nelder[1999, Ch. 9] and the next chapter for more information about quasi-likelihood functions.

A nonparametric basis expansion allows the number of bases p to increase with the sample size M. Then the statistical problem becomes how to enforce smoothness and sparsity by selecting the basis functions and estimating the model parameters so that we adequately

24 estimate the SDF without overﬁtting, but also have computational eﬃciency as the sample

size increases. Popular approaches in the literature for regularizing the least squares include

adding penalties of the L1 type and the L2 type. We summarize below the representative

models of both types and use them for comparison in later chapters.

Typically, L1 penalized approaches use wavelet thresholding (see e.g., Walden et al.

[1998]), while the L2 penalized methods use smoothing splines (see e.g., Cogburn and Davis

[1974], Wahba and Wold[1975]).

2.3.3 Wavelet soft-thresholding

The representative L1 penalized approach for spectral estimation uses wavelet soft-

thresholding (see e.g., Walden et al.[1998]).

The ﬁrst step in wavelet soft-thresholding is the discrete wavelet transform (DWT). For

q M = 2 dimensional vector Y , the DWT matrix is defined as an M × M matrix Wq whose rows contain circularly shifted wavelet and scaling filters at different scales. (See Percival ˜ (W) and Walden[2000] for a comprehensive review.) The DWT calculates β = WqY , with the result in the form of

T ˜ (W) h ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) (W)i β = β1,1 , β1,2 ,..., β1,M/2, β2,1 , β2,2 ,..., β2,M/4,...... , βq−1,1, βq1,2 , βq,1 , α˜q+1 ,

n ˜(W) ˜(W) ˜(W) o where β 0 , β 0 ,..., β 0 is called the set of wavelet coeﬃcients associated with j ,1 j ,2 j0,M/(2j ) √ j0−1 0 (W) T scale 2 , for levels j = 1, . . . , q; andα ˜q+1 = 1 Y / M is called the scaling coeﬃcient.

The DWT matrix Wq is not always of interest and thus doesn’t usually need to be com-

(W) puted explicitly. The wavelet coeﬃcients β˜ can be obtained by q recursive applications

of the Pyramid Algorithm [Mallat, 1989]. Each iteration in the pyramid algorithm calcu-

lates the set of wavelet coeﬃcients for one level, in the order of levels j0 from 1 to q. If the

iterations stop at level q0 < q, then the result is said to be a q0th-order partial DWT, which

25 obtains

(Wq ) (W ) (W ) (W ) (W ) (W ) (W ) ˜ 0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 β = β , β ,..., β ,...... , β , β ,..., β q , 1,1 1,2 1,M/2 q0,1 q0,2 q0,M/(2 0 ) T (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) α˜ , α˜ ,..., α˜ q +1 ,...... , α˜ , α˜ , α˜ , α , q0+1,1 q0+1,2 q0+1,M/(2 0 ) q−1,1 q−1,2 q,1 q+1

n (W ) (W ) (W ) o n o ˜ q0 ˜ q0 ˜ q0 ˜(W) ˜(W) ˜(W) where β 0 , β 0 ,..., β 0 = β 0 , β 0 ,..., β 0 are the calculated wavelet j ,1 j ,2 j0,M/(2j ) j ,1 j ,2 j0,M/(2j ) n o 0 (Wq0 ) (Wq0 ) (Wq0 ) coeﬃcients for levels j = 1, . . . , q , and α˜ 00 , α˜ 00 ,..., α˜ 00 remain as scaling 0 j ,1 j ,2 j00,M/(2j )

00 coeﬃcients for levels j = q0 + 1, . . . , q.

Denote Wq0 to be the q0th-order partial DWT matrix, which is an M × M matrix that (W ) (W ) ˜ q0 ˜ q0 can be used to calculate β by β = Wq0 Y [Percival and Walden, 2000]. The ﬁrst

q0 M −M/2 rows of the q0th-order partial DWT matrix Wq0 are M-dim discrete wavelet bases

q0 at the ﬁnest q0 levels, and the other M/2 rows of Wq0 are the associated M-dim discrete

T scaling bases at the roughest level. These bases are orthonormal such that Wq0 Wq0 = IM . The partial DWT performs exactly an unpenalized LS approach. With design matrix

(W ) T ˜ q0 Φ = Wq0 , the coeﬃcients β can be expressed in the form of the LS coeﬃcient estimates, with

(W ) ˜ q0 T −1 β = Wq0 Y = Wq0 Wq0 Wq0 Y ,

which can be equivalently obtained by minimizing (2.19). Note that the full DWT is a

special case of partial DWT with q0 = q.

Upon calculating the wavelet coeﬃcients, the next step is to shrink the coeﬃcients. With

a threshold a ≥ 0, a soft-thresholding function is deﬁned as

ST(x, a) = Sign(x) max{|x| − a, 0}. (2.20)

√ Walden et al.[1998] apply the universal threshold auniv = pψ0(K) · 2 log M for soft-

thresholding the wavelet coeﬃcients calculated based on log MT estimates. Recall that

26 0 T ψ (K) is the approximate variance of Y (f) for 0 < f < 1/2. With orthonormal bases Wq0 , (W ) ψ0(K) is also the approximate variance of the wavelet coeﬃcients in β˜ q0 .

Universal Threshold and Minimax Threshold The universal threshold was originally introduced for the wavelet thresholding, and can be generalized to the L1 penalized LS method for function estimation with orthogonal bases [Donoho and Johnstone, 1994].

Consider data of the form

yj = g(fj) + ξj,

M with fj = j/N, j = 1,...,M, g = {g(fj)}j=1 is the noiseless data from target function g(f),

M 2 and ξ = (ξj)j=1 is a white noise with elements independently distributed as N(0, σξ ). When an M ×M orthonormal design matrix Φ is applied to model g = Φβ, the empirical

coeﬃcients can be calculated by

−1 β˜ = Φ>Φ Φ>y = Φ>y.

Inversely, y = Φβ. Thus, since Φ is an orthogonal matrix, the empirical coeﬃcients are

β˜ = Φ>g + Φ>ξ.

Letting ϑ = Φ>g and ι = Φ>ξ, then the empirical coeﬃcients can be represented as

β˜ = ϑ + ι.

In the above equation, ϑ = Φ>g is the vector of coeﬃcient estimates based on the noiseless

data, and ι = Φ>ξ is a white noise, since Φ is an orthonormal matrix.

In usual situations, the contribution of basis coeﬃcients to the noiseless signal is sparse.

That says, only a small number of empirical basis coeﬃcients contribute to both the signal

2 and the noise; the other basis coeﬃcients contribute only to the noise with variance σξ . The choice of the universal threshold is based on the following two facts:

27 −1 2 1. For the performance measurement deﬁned as Sc(gb, g) = M E k gb − g k2,M , the minimax threshold has the smallest risk bound. Based only on the data, the minimax

threshold gives the best mimic to the performance of ‘oracle’. The universal threshold

is a simple alternative to the optimal threshold, and it is also the asymptotic optimal

threshold. (See Donoho and Johnstone[1994] for details.)

> 2. For the white noise ι = Φ ξ, whose element ιj is independent and identically dis-

2 tributed as N(0, σξ ), we have

n p o P r max |ιj| > σξ 2 log M → 0, as M → ∞. j

univ √ Thus, by soft-thresholding with the universal threshold a = σξ · 2 log M, the bases

with no contribution to the signal will have coeﬃcient estimates that are close to zero with

high probability. √ The universal threshold referred in the literature typically take the value of σξ 2 log M, since the threshold is applied to the full set of DWT bases with p = M. However, if a more general design matrix of p < M predictors is used, then it would be a better idea to adjust the threshold value by replacing M with p, based on a similar argument as the second fact mentioned above. √ Back to wavelet soft-thresholding, with the universal threshold auniv = pψ0(K)· 2 log p,

the thresholded wavelet coeﬃcients can be obtained as

(W ) (W ) q0 ˜ q0 univ βbj0,k0 = ST βj0,k0 , a ,

0 j0 0 with k = 1,...,M/(2 ) for levels of j = 1, . . . , q0. (See Walden et al.[1998].)

Taking the inverse partial DWT, one can obtain the ﬁtted log SDF as

(W ) (W ) −1 q0 T q0 Yb = Wq0 βb = Wq0 βb .

28 (W ) T ˜ q0 Note: Without thresholding, the result Wq0 β will correspond to the ﬁtted values based on unpenalized LS method for wavelet basis regression.

The lasso and soft-thresholding There is a close relationship between soft-thresholding and the lasso. The term ‘lasso’ stands for ‘least absolute shrinkage and selection operator’

[Tibshirani, 1996]. With the tuning parameter ω > 0, a lasso problem for spectral estimation

minimizes the L1 penalized LS criterion M p 1 X 2 X Y (f ) − φT (f )β + ω |β | (2.21) 2 j j l j=1 l=1 Both the lasso and soft-thresholding execute shrinkage, with some coeﬃcient estimates

shrunk exactly to zero. B¨uhlmannand van de Geer[2011] shows that, under an orthonormal

design with p = M (i.e., the number of predictors equal to the number of data points),

the solution to the lasso problem (2.21) is the soft-thresholded LS estimates with threshold

a = ω. For example, the wavelet soft-thresholding in Walden et al.[1998] with the full DWT

and universal threshold auniv is equivalent to solving the lasso problem that minimizes

T T T univ Y − Wq β Y − Wq β + a kβk1.

2.3.4 Smoothing splines

A variety of spline bases can be chosen for ﬁtting an L2 penalized spectral model (see e.g., Cogburn and Davis[1974], Wahba and Wold[1975]). As an example, we present below the penalized least squares and periodic smoothing splines used in Cogburn and Davis[1974] for univariate spectral analysis.

Adapting our previous notations of the LS approach for estimating the SDF, the L2 penalized problem is formulated to minimize M Z 1/2 X 2 (m) 2 {Y (fj) − g(fj)} + λS g (f) df, (2.22) j=1 0

29 where g(f) is deﬁned on the subset of log spectra space such that

( Z 1/2 ) (m) 2 g : g(f) > 0, g(f) = g(f + 1), g(f) = g(−f), g (f) df < ∞, f ∈ R . 0

In (2.22), we take the integrated squared mth derivative of the estimated SDF as the roughness penalty, and λS > 0 denotes the smoothing parameter.

For univariate spectral analysis, the log SDF is an even periodic function. By Krafty and Collinge[2013], the reproducing kernels of the periodic reproducing kernel Hilbert space

[see, e.g., Gu, 2013, Section 2.1] of Cogburn and Davis[1974], with a restriction to ensure that the function is even, is deﬁned as

∞ X B(f, f 0) = (2πl)−2m cos(2πlf) cos(2πlf 0). l=1 Similarly, for odd functions (which is useful for multivariate spectral analysis), the reproduc-

0 P∞ −2m 0 ing kernel is A(f, f ) = l=1(2πl) sin(2πlf) sin(2πlf ). Following Chapter 2 of Gu[2013], the closed form solution in terms of the reproducing kernels can be represented by

M X log Sb(f) = gb(f) = βb0 + βbjB(f, fj), j=1 which is also in the form of basis regression, where the coeﬃcient estimates βb0, βb1,..., βbM can be obtained following the standard procedures (see Gu[2013] for further details).

More generally, the same set of bases can be extended to penalized Whittle’s case for multivariate spectral estimation as shown by Krafty and Collinge[2013], which will be discussed in Chapter5.

2.4 Conclusion

In this chapter, we introduced the foundations and concepts of univariate spectral analysis, presented the properties of preliminary estimators, and discussed the idea of using basis

30 methods for estimating the SDF. Representative LS approaches with relevant regularization from literature were summarized. In Chapter3, we propose a new approach that circumvents the problematic Gaussianity assumption required by LS approaches and achieves sparsity for a wide variety of basis functions.

31 Chapter 3: Multitaper Whittle method with a lasso penalty

In this chapter, we propose an L1 penalized quasi-likelihood Whittle framework based on multitaper spectral estimates which performs semiparametric spectral estimation for regularly sampled univariate stationary time series. Our new approach circumvents the problematic Gaussianity assumption required by least square approaches and achieves sparsity for a wide variety of basis functions. We present an alternating direction method of multipliers (ADMM) algorithm to eﬃciently solve the optimization problem, and develop universal threshold and generalized information criterion (GIC) strategies for eﬃcient tuning parameter selection that outperform cross-validation methods. Theoretically, a fast convergence rate for the proposed spectral estimator is established. We demonstrate the utility of our methodology on simulated series and to the spectral analysis of electroencephalogram (EEG) data.

3.1 Penalized multitaper Whittle estimation

The Whittle likelihood [Whittle, 1953] is widely used to approximate log-likelihoods for stationary and some nonstationary time series using the naive spectral estimator known as the periodogram. (See Calder and Davis[1997] for an introduction to the Whittle likelihood.) For stationary Gaussian processes, using results for symmetric Toeplitz matrices

(e.g., Grenander and Szeg¨o[1958]), the Whittle likelihood is asymptotically equal to two

32 times the negative log-likelihood [Calder and Davis, 1997]. For certain non-Gaussian cases

the Whittle likelihood is still a valid approximation [Hannan, 1973]. This is because the

Whittle likelihood is equal to the joint asymptotic distribution of the periodogram, assum-

ing independence of the spectral estimates over the frequencies that the spectral estimates

are evaluated at [Pawitan and O’Sullivan, 1994]. Since the periodogram often exhibit poor

statistical properties such as bias due to leakage and inconsistency in estimation [Percival

and Walden, 1993], we develop a Whittle likelihood based on MT spectral estimates. We

explain why this MT-Whittle likelihood is still reasonable for parameter estimation.

T Using the time series observations x = (x1, . . . , xN ) , we evaluate the MT spectral esti-

(mt) mates Sb (fj) on the set of M = dN/2e − 1 non-zero, non-Nyquist (i.e., not equal to 1/2)

Fourier frequencies {fj = j/N : j = 1,...,M} deﬁned in (2.18). Choosing a speciﬁc basis representation, let Φ denote the M × p design matrix of basis functions evaluated at these

T Fourier frequencies with row j = 1,...,M equal to φ (fj). The vector of log SDFs

T ζ = (log S(f1),..., log S(fM )) (3.1)

evaluated at these frequencies is ζ = Φβ, by (2.16). Modeling the SDF on the log scale

brings several beneﬁts to SDF estimation:

1. In the literature, a log transformation is widely used in basis regression methods for

SDF estimation (see, e.g., Wahba[1980], Pawitan and O’Sullivan[1994], Walden et al.

[1998], Moulines and Soulier[1999] as mentioned in Section 2.3). The log transforma-

tion resolves the heteroskedasticity over frequencies. Using (2.12) and the upcoming n o Proposition3, we can show that, for 0 < f < 1/2, asymptotically Var Sb(mt)(f) = n o {S(f)}2/K proportional to {S(f)}2, while Var log Sb(mt)(f) = ψ0(K) is a constant,

where ψ0(K) is trigamma function evaluated at K [Walden et al., 1998]. Thus, using a

33 log transformation allows us to make a direct comparison with existing methods, but

also allow us to develop eﬃcient methods of inference and threshold selection when

later incorporating the L1 penalty.

2. Modeling on the log scale handles the additional positivity constraint that comes along

the algorithm iterations of using original scale SDF, so the computation is more eﬃ-

cient.

One could also model the SDF directly as S(f) = φT (f)β. However, two issues arises with

T this parameterization. First, although one could add additional constraints φ (fj)β > 0 at

Fourier frequencies {fj} deﬁned by (2.18) to the original optimization problem, this would not

necessarily guarantee that φT (f)β > 0 for all f ∈ [0, 1/2], especially when the ﬁtted values

at non-Fourier frequencies are acquired with certain associated continuous basis functions

(e.g., continuous wavelets or cosines). Moreover, modeling the original SDF directly results

in a non-convex MT-Whittle likelihood function with respect to β, which leads to a more

challenging optimization problem. Due to these considerations, we chose to model the log

SDF, which not only takes care of the positivity constraint automatically, but also leads to

a tractable convex optimization problem.

To estimate the coeﬃcients in (2.16) for the log SDF, we now introduce a quasi-likelihood

framework based on the MT estimates.

Deﬁnition 2. Using MT spectral estimators, a quasi-likelihood function with expression

M (mt) X Sb (fj) l (Φβ) = log S(f ) + (3.2) W j S(f ) j=1 j is called the MT-Whittle likelihood function.

The MT-Whittle likelihood has the same functional form as the Whittle likelihood based on the periodogram, except we replace the periodogram by the MT estimator. It can be

34 easily shown that we obtain the usual Whittle likelihood using a single K = 1 rectangular √ taper deﬁned by ht,1 = 1/ N for all t = 1,...,N in the MT spectral estimator; the tapered

Whittle likelihood [Dahlhaus, 1988] is also a special case.

To see why the MT-Whittle likelihood is a reasonable quasi-likelihood, we present the following two propositions.

Proposition 3. [Walden, 2000, Section 3.3] Suppose that {Xt : t ∈ Z} is strictly stationary with all moments existing such that

∞ X |cum(Xt+τ1 ,...,Xt+τl−1 ,Xt)| < ∞, τ1,...,τl−1=−∞

for l = 2, 3,..., where cum(Xt1 ,...,Xtl ) denotes the joint cumulant function of order l (see, e.g., Brillinger[1981], Sec. 2.3). Also for each N, let {ht,k : t = 1, . . . , N, k = 1,...,K} be a set of K orthonormal sine or DPSS data tapers. Then

2 (mt) χ2K Sb (f) →d S(f) , for 0 < f < 1/2, as N → ∞, 2K

2 where χ2K denotes a chisquared random variable (RV) with 2K degrees of freedom.

Walden[2000] provides a proof of this result in the multivariate case – our result has

been simpliﬁed to the univariate case. This proposition tells us that at each frequency f, our

MT-spectral estimators have a valid asymptotic scaled chisquared distribution that depends

on the true underlying SDF.

In general MT-spectral estimators are correlated over frequencies: by tapering we reduce

the sidelobes to decrease the bias, but increase the eﬀective bandwidth of the spectral es-

timator, which then increases the correlation. Thomson[1982] showed that with a locally

slowly varying spectrum, for 0 < f < f 0 < 1/2, with f close to f 0, 2 2 K K N n (mt) (mt) 0 o S (f) X X X i2π(f 0−f)t Cov Sb (f), Sb (f ) ≈ ht,kht,le . (3.3) K2 k=1 l=1 t=1

35 However, the next result shows that our MT-Whittle likelihood (3.2) can be reinterpreted

as a gamma quasi-likelihood, which ignores these correlations between frequencies.

Proposition 4. The MT-Whittle likelihood (3.2) corresponds to a gamma quasi-likelihood

assuming the asymptotic distribution of Proposition3 at the Fourier frequencies (2.18), and assuming independence between the Fourier frequencies.

(mt) Proof. We can rewrite the asymptotic distribution of Zj = Sb (fj)(j = 1,...,M) stated in

Proposition3 as Gamma( K,S(fj)/K), a parametrization of the gamma typically used in gen-

eralized linear models (see, e.g., [McCullagh and Nelder, 1999, Ch. 8]). Here Gamma(ν, µ/ν)

denotes a gamma RV with mean µ, shape parameter ν, and variance µ2/ν. This asymptotic

probability density function evaluated at Zj = zj is

K−1 zj zj p(zj) = K exp − . h S(fj ) i S(fj)/K Γ(K) K

The proposition follows by assuming independence between the MT-spectral estimates over

frequencies: the resulting quasi-likelihood, as required, is

M K M M X K X X zj log p(z ) = M log + (K − 1) log z − K log S(f ) + j Γ(K) j j S(f ) j=1 j=1 j=1 j

= constant − K lW (Φβ).

In statistical science, it is common to estimate certain parameters of a model (e.g.,

parameters related to the mean of the distribution, such as β) using a simpliﬁed model

compared to the true statistical model (see, e.g., Diggle et al.[2002]). The simpliﬁed model

is known as the working model. Proposition4 tells us that when we write down a MT-

Whittle likelihood using MT spectral estimates, we deﬁne a working model for the spectral

36 estimates, assuming independence across the Fourier frequencies. An extensive literature on

estimating β in this setting (e.g., Liang and Zeger[1986], Zeger and Liang[1986], Diggle et al.[2002]) indicates that we can consistently estimate the model parameters β under this

working model, even when independence does not truly hold. Thus Proposition4 allows us

to introduce the following penalized quasi-likelihood framework for valid statistical inference

for β.

We incorporate a lasso-type penalty with the MT-Whittle likelihood (3.2) to enforce

sparsity as the number of basis functions, p, increases with sample size N. In Sections 3.5

and 3.6, we let p = M +1 = dN/2e, where M is the number of non-zero, non-Nyquist, Fourier

frequencies, although depending on the form of the penalty, p could actually be larger than

N. Our optimization problem is then

p X min lW (Φβ) + λl|βl|, (3.4) β l=1

where λl ≥ 0 denotes the tuning parameter for coefficient βl, with l = 1, . . . , p. The penalty Pp l=1 λl|βl| serves the function of reducing the noise effect on the coefficient estimates, which in general shrinks all regression coefficients’ magnitudes towards zero. We let βb denote the solution to (3.4), and Sb(f) = exp φT (f)βb denote the resulting estimator of the SDF.

Our L1 penalized method performs simultaneous parameter estimation and feature selection [see, e.g., Wasserman, 2006]. With some of the coeﬃcient estimates shrunk exactly to zero, L1 penalized method identiﬁes a small subset of predictors, which is favored when the true model is considered to be sparse. (See Section 6.2 for a discussion of L2 penalized methods that also carry out feature selection.)

In the next section an algorithm to solve (3.4) is presented. Tuning parameter selection is discussed in Section 3.3.

37 3.2 Computing: an ADMM algorithm

Sardy et al.[2004] suggested using the interior point algorithm to optimize a penalized

non-Gaussian likelihood incorporating wavelet bases expansions, and provided the explicit

steps of the algorithm for the Poisson case. No explicit algorithm was given for the gamma

distribution that makes up our MT-Whittle likelihood or for including other basis functions.

We use instead the alternating direction method of multipliers (ADMM) algorithm, which

is an approach for solving large-scale nonsmooth convex optimization problems, and was

introduced by Glowinski and Marroco[1975] and Gabay and Mercier[1976] (see Boyd et al.

[2011] for further review). The ADMM algorithm has the advantage that the per-iteration

cost is often much lower than that of the interior point algorithm, which makes it an attrac-

tive choice when solutions of medium accuracy are suﬃcient, such as parameter estimation

problems.

The ADMM algorithm proceeds by ﬁrst introducing new equality constraints to decou-

ple the two terms in the objective function (3.4). Speciﬁcally, by introducing the equality

constraints

ζ = Φβ and η = β , (3.5)

the original problem (3.4) is equivalent to

p X min lW (ζ) + λl|ηl| ζ,η,β l=1 subject to Φβ = ζ and β = η ,

M T p where ζ ∈ R and η = (η1, η2, . . . , ηp) ∈ R . We use a second equality constraint β = η to avoid the need to solve a lasso problem at each iteration of the ADMM in the cases that the basis functions are not orthogonal (see Section 3.7 for details).

38 Following Boyd et al.[2011], the augmented Lagrangian can be written as

p X ρ 2 ρ 2 lA(β, ζ, η, u1, u2) = lW (ζ) + λl|ηl| + Φβ − ζ + u1 + β − η + u2 , 2 2 2 2 l=1 where ρ > 0 is the penalty parameter. Note that the convergence is guaranteed for any positive ρ, although a carefully chosen ρ that balance the convergence of the primal and dual residuals often leads to faster convergence.

The ADMM algorithm solves the minimum of the above augmented Lagrangian by al- ternately updating the primal variables (ζ, η, β) and the associated dual variables (u1, u2).

The (n + 1)-th step of the algorithm updates

(n+1) (n) (n) (n) (n) β = argmin lA(β, ζ , η , u1 , u2 ); β (n+1) (n+1) (n) (n) (n) ζ = argmin lA(β , ζ, η , u1 , u2 ); ζ (n+1) (n+1) (n+1) (n) (n) η = argmin lA(β , ζ , η, u1 , u2 ); η (n+1) (n) (n+1) (n+1) u1 = u1 + (Φβ − ζ );

(n+1) (n) (n+1) (n+1) u2 = u2 + (β − η ), which can be solved as

(n+1) T −1n T (n) (n) (n) (n)o β = (Φ Φ + Ip) Φ (ζ − u1 ) + η − u2 ; 2 (n+1) (mt) ρ n T (n+1) (n)o ζj = arg min ζj + Sb (fj) exp(−ζj) + φ (fj)β − ζj + u1j , j = 1,...,M; ζj 2 λ η(n+1) = ST β(n+1) + u(n), l , l = 1, . . . , p; l l 2l ρ (n+1) (n) (n+1) (n+1) u1 = u1 + Φβ − ζ ;

(n+1) (n) (n+1) (n+1) u2 = u2 + β − η .

In this algorithm, u1j and u2j denote respectively the j-th components of u1 and u2, ρ > 0 is a positive penalty parameter, and recall from (2.20) that ST(x, a) = Sign(x) max(|x|−a, 0)

39 (n+1) is the soft-thresholding function with threshold a ≥ 0. To obtain ζj , we take the partial

(n+1) (n) (n) (n) derivative of lA(β , ζ, η , u1 , u2 ) with respect to ζj and set to zero, which leads to a score equation

(mt) −ζj T (n+1) (n) ρζj − Sb (fj)e + 1 − ρ φ (fj)β + u1,j = 0 ,

which can be solved eﬃciently using the bisection method. The solution ensures a global

(n+1) (n) (n) (n) minimum since the second derivative of lA(β , ζ, η , u1 , u2 ) with respect to ζj is

(mt) −ζj ρ + Sb (fj)e , which is positive for all ζj ∈ R for any ρ > 0. For any ρ > 0, the iterates β(n) have been shown to converge to the global solution of the

original optimization problem (3.4) under some mild conditions on the objective function

[Boyd et al., 2011].

Following the optimality conditions and stopping criteria in Boyd et al.[2011], we deﬁne

the primal residual

Φ ζ(n+1) Φβ(n+1) − ζ(n+1) s(n+1) = β(n+1) − = , pri I η(n+1) β(n+1) − η(n+1)

and the dual residual

T Φ ζ(n+1) − ζ(n) s(n+1) = ρ (−I) dual I η(n+1) − η(n) h i = −ρ ΦT (ζ(n+1) − ζ(n)) + (η(n+1) − η(n)) .

We terminate the algorithm when both residuals are smaller than some prespeciﬁed preci- sions; more speciﬁcally, let r pri N abs rel n (n) (n) (n) (n) o = − 1 + p + max Φβ + β , ζ + η , 2 2 2 2 2 dual √ abs rel T (n) (n) = p + ρ Φ u1 + u2 2,

where abs > 0 is an absolute tolerance and rel > 0 is a relative tolerance. The stopping

(n) pri (n) dual criterion is then set to be kspri k2 ≤ and ksdualk2 ≤ . In our numerical studies,

40 setting both tolerances to 10−4 provided a reasonable balance between fast convergence of

the optimization and satisfactory estimation accuracy.

Note: One can also check the convergence of the ADMM algorithm with the Karush-

Kuhn-Tucker (KKT) optimality conditions. Based on the discussion in Tibshirani[2013], the KKT conditions can be extended to minimization of general convex loss functions with

L1 penalties. Adapting to our optimization problem (3.4), the KKT conditions are

( T Φ ∇lW (Φβb) ≤ λl , for l = 1, . . . , p with βbl = 0; l (3.6) T Φl ∇lW (Φβb) + λl · Sign(βbl) = 0, for l = 1, . . . , p with βbl 6= 0, where Φl is the vector of the lth column of matrix Φ, and the gradient ∇lW (Φβ) is an

ˆ(mt) T M-vector with jth element 1 − S (fj)/exp{φ (fj)β} for j = 1,...,M. The necessary conditions (3.6) for optimality can easily be veriﬁed for the application of ADMM algorithm to the examples shown in the simulations presented in Section 3.5.

A close inspection of the above ADMM updating scheme shows that it is ideally suited for our optimization problem (3.4) because all the subproblems can be solved eﬃciently with any basis expansion of log S(f), especially when wavelet basis functions are used. First, note that the β-update is a linear system for general bases and can be solved in linear time when using wavelets, because ΦT Φ is a diagonal matrix for a wavelet basis, and all matrix-vector multiplications involving Φ or ΦT can be carried out in linear time using the cascade algorithm [see, e.g., Percival and Walden, 2000, Section 4]. Second, the ζ-update decomposes into M independent univariate optimization problems, each of which can be solved eﬃciently using a bisection method (Recall that M is the number of Fourier frequencies used to formulate the MT-Whittle likelihood). For practical purposes, it is often assumed that the number of bisection iterations is a constant. Consequently, the per-iteration cost of the proposed ADMM algorithm is O(M) if wavelet basis functions are used. For a general

41 T basis, we perform one Cholesky factorization of Φ Φ + Ip upfront for the β-update, which is O(M 3). Given the Cholesky factorization, all subsequent updates have complexity no greater than O(M 2). Therefore, the per-iteration cost when using a general basis is O(M 2).

Further acceleration is possible using parallel computing techniques.

With regard to the rate of convergence, the total number of ADMM iterations required to reach an -optimal solution is O(−1)[He and Yuan, 2015]. Consequently, the overall computational complexity for the ADMM algorithm to reach an -optimal solution is O(−1M) for a wavelet basis, and O(−1M 2 + M 3) for a general basis.

3.3 Tuning parameter selection

The goodness of ﬁt of the model is determined by the selection of the tuning parameters

λ1, λ2, . . . , λp in (3.4). We assume a common tuning parameter λ for non-intercept bases and λ1 = 0 for the intercept β1. Traditional methods of selection is based on a Gaussian likelihood assumption. Extending to the non-Gaussian case, Sardy et al.[2004] deduced the rules that a tuning parameter should satisfy with a concave and diﬀerentiable non-Gaussian log-likelihood, but only derived an explicit solution for the Poisson case when the interior point algorithm is used. No explicit solution is given for our case. We develop universal threshold and generalized information criterion (GIC) [Fan and Tang, 2013] approaches for tuning parameter selection with a MT-Whittle likelihood.

3.3.1 Scale-calibrated universal threshold

Although the universal threshold was originally introduced for wavelet thresholding by

Donoho and Johnstone[1994] in the same paper they state that their “results apply equally well in orthogonal regression”. We extend the idea of a universal threshold to our penalized

MT-Whittle likelihood problem.

42 Starting with the unpenalized MT-Whittle likelihood with an orthonormal basis, under

the assumptions of Proposition4 we show in the following that the maximum quasi-likelihood

W W estimator βb has the asymptotic distribution βb ∼ N(β, Ip/K):

The derivation relies on writing our asymptotic distribution for the MT-spectral estima-

tors as a generalized linear model (GLM) as given in McCullagh and Nelder[1999]. Based

(mt) on Proposition3, the asymptotic density of Zj = Sb (fj) evaluated at zj in exponential

family form is z ϑ − b(ϑ) p(z ) = exp j + c(z , ς) , (3.7) j a(ς) j

with ϑ = −1/µj = −1/S(fj), a(ς) = 1/ν = 1/K, b(ϑ) = − log(−ϑ) = log S(fj), and

00 2 c(zj, ς) = K log(Kzj) − log zj − log Γ(K). Then, as Var(Zj) = b (ϑ)a(ς) = V (µ)a(ς) = µ /ν, the variance function is V (µ) = µ2. Assuming asymptotic independence over frequencies, the

Fisher information for the GLM can be computed as

I(β) = ΦT W Φ,

where W = diag(w1, . . . , wM ) with

1 1 wj = = = K. 0 2 2 2 V (µj)a(ς) g (µj) µj (1/K) 1/µj Since our model assumes the link function g(µ) = log µ, we have that g0(µ) = 1/µ. Thus,

with an orthonormal basis,

T I(β) = K Φ IM Φ = K Ip,

and under suitable regularity conditions (see, e.g., McCullagh and Nelder[1999, Appendix

A]), as M → ∞,

W −1 βb →d N p(β, I(β) ),

W −1 with Cov βb = I(β) = Ip/K.

43 Next, we use a similar argument as the universal threshold derived by Donoho and John-

W stone[1994]. Let ξW = βb − β denote the noise component of the unpenalized MT-Whittle

W W estimate. Since each component of ξ , ξl , are independent N(0, 1/K) RVs,

W p p P max |ξl | > 1/K 2 log p → 0, as p → ∞. l

Consulting the ADMM algorithm given in Section 3.2 we note that the tuning parameter

only participates in the soft thresholding step of the algorithm. Choosing a scale-calibrated √ universal threshold λuniv = p1/K 2 log p as the tuning parameter λ, the noise components

of β will be shrunk to zero with high probability, leaving only those components that repre-

sent the true underlying SDF. This mimics the universal threshold of Donoho and Johnstone

[1994], but varies in two distinct ways: (i) we need no estimate of the noise variance: the

variance (scale) is determined by the number of tapers K in the MT spectral estimator –

as K increases this variance decreases; (ii) our choice of λ depends on the initial number of basis functions p, and not the sample size M as in Donoho and Johnstone[1994].

3.3.2 Generalized information criterion

For the L1 penalized MT-Whittle likelihood problem (3.4), a generalized information criterion (GIC) ﬁnds

λb = arg min {2K lW (Φβλ) + cM |pλ|} , (3.8) λ

where βλ is the optimizer of L1 penalized MT-Whittle likelihood with tuning parameter λ.

In the penalty term for model complexity, |pλ| denotes the number of non-zero elements in

βλ, and cM is the penalty parameter. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are

two special cases of GIC with cM = 2 and cM = log M respectively. According to Fan

and Tang[2013], AIC has similar performance as cross-validation and typically overﬁts the

44 statistical model and BIC only consistently selects the true model when the dimension of

predictor space is ﬁxed.

When the number of predictors, p, increases exponentially as the sample size M increases

(i.e., log p = O(M κ) for some κ > 0), Fan and Tang[2013] suggest the choice of penalty parameter cM = (log log M)(log p).

3.4 Theory

We now derive the rate of convergence for the proposed L1 penalized MT-Whittle likeli-

hood estimator. This allows us to study consistency of spectrum estimation.

The main result is based on an extension to the arguments used to derive the so-called

fast rate for L1 penalized methods (see, e.g., van de Geer[2008], van de Geer and B¨uhlmann

[2009]). There are two main challenges for our estimator. First, typical theoretical results for

the L1 penalized problem assume independence among samples, whereas the MT-Whittle

likelihood involves sums of dependent RVs. Moreover, to the best of our knowledge, all

existing theory for L1 penalized generalized linear models assumes that the canonical link

is used (see, e.g., van de Geer[2008]). In contrast, we deal with a situation where the

link function is not the canonical link, which makes the log-partition function dependent on

random quantities. These would make our theoretical analysis more challenging, and the

technical conditions more complicated.

For each j = 1,...,M, let

(mt) Rj = Sb (fj)/S(fj), (3.9)

2 where, by Proposition3, Rj − 1 follows the distribution as χ2K /2K − 1 RVs asymptotically. We further impose the following assumptions.

45 Assumption (A1) Assume that

0 log S(fj) = hφ(fj) , β i, j = 1,...,M, (3.10)

0 p for some sparse vector β ∈ R and basis functions φ(fj) satisfying kφ(f)k∞ ≤ B for q √ PM 2 f ∈ [−1/2, 1/2] and some constant B, and kΦlk2 = j=1 φl (fj) = M for l = 1, . . . , p.

0 Assumption (A2) (Compatibility condition) Let S = {l : βl 6= 0} and s0 = |S|. Assume

p that for any v ∈ R with kvSc k1 ≤ 3kvSk1 that

M 1 n 1 o X T T c0 2 γ− Rj exp(v φ(fj)) − v φ(fj) − 1 ≥ min kvSk , c1M 2 kvSk1 , (3.11) M s0 1 j=1

1 with probability tending to 1 as M → ∞ for some constants c0 > 0, c1 > 0, 0 < γ ≤ 2 and

where Rj is deﬁned by (3.9).

Assumption (A1) is the typical sparsity condition imposed for theoretical analysis of

penalized procedures using sparsity-inducing penalty, while Assumption (A2) is similar to

the so-called compatibility condition required for one type of theoretical analysis for lasso

estimator [van de Geer and B¨uhlmann, 2009]. Note that since we are dealing with non-

canonical link functions, the compatibility condition does involve random quantities, which

makes it more diﬃcult to verify in practice.

Theorem 3. Under Assumptions (A1) and (A2) and on the event that

M X c1 γ+ 1 (R − 1) φ(f ) < M 2 , (3.12) j j 3 j=1 ∞ we have that

3λs0 βb(λ) − β0 ≤ (3.13) 1 2c0M for any λ satisfying

M X 2 γ+ 1 2 (R − 1) φ(f ) ≤ λ < c M 2 . j j 3 1 j=1 ∞ 46

PM In particular, when λ = 2 j=1 (Rj − 1) φ(fj) , we have, on the event (3.12), that ∞

M 0 3s0 X βb(λ) − β ≤ (Rj − 1) φ(fj) , (3.14) 1 c0M j=1 ∞ and

3Bs M 0 X sup | log Sb(f) − log S(f)| ≤ (Rj − 1) φ(fj) , (3.15) f∈[− 1 , 1 ] c0M 2 2 j=1 ∞ where log Sb(f) is the L1 penalized MT-Whittle estimator of the log SDF.

Proof. For simplicity, we omit the dependence of βb(λ) on λ. Then

M X n 0 T 0 T o βb ∈ arg min (β − β ) φ(fj) + Rj exp(−(β − β ) φ(fj)) + λkβk1 , β j=1

T 0 since log S(fj) = φ (fj)β for each j. Hence,

M M X 0 T 0 T X 0 (βb − β ) φ(fj) + Rj exp(−(βb − β ) φ(fj)) + λkβk1 ≤ Rj + λkβ k1 . (3.16) j=1 j=1

Let δb = βb − β0. Rearranging terms in (3.16), we obtain that

M ( M ) X n T T o T X 0 Rj exp(−δb φ(fj)) + δb φ(fj) − 1 ≤ δb (Rj − 1) φ(fj) + λkβ k1 − λkβbk1, j=1 j=1 which, together with the fact that

M ! M T X X λ δb (Rj − 1)φ(fj) ≤ kδbk1 (Rj − 1)φ(fj) ≤ kδbk1, 2 j=1 j=1 ∞ implies that

M X n T T o Rj exp(−δb φ(fj)) + δb φ(fj) − 1 j=1

λ 0 ≤ kδbk1 + 2kβ k1 − 2kβbk1 2 λ 0 0 = kβb − β k1 + kβb c k1 + 2kβ k1 − 2kβb k1 − 2kβb c k1 2 S S S S S S λ 0 λ ≤ 3kβb − β k1 − kβb c k1 = 3kδbSk1 − kδbSc k1 . 2 S S S 2

47 Note that the LHS is nonnegative, because ex ≥ x + 1 for any x ∈ R. It follows that

kδbSc k1 ≤ 3kδbSk. By Assumption (A2), we have that c0 1 λ 3λ 2 γ− 2 M min kδbSk1 , c1M kδbSk1 ≤ 3kδbSk1 − kδbSc k1 ≤ kδbSk1 . s0 2 2

γ+1/2 PM Hence, on the event that (2/3)c1M > λ ≥ 2 j=1(Rj − 1)φ(fj) , we have ∞

3λs0 βb(λ) − β0 ≤ . (3.17) 1 2c0M

PM Letting λ = 2 j=1(Rj − 1)φ(fj) , we have on event ∞

 M   X −1 1 +γ 2 (Rj − 1)φ(fj) < 3 c1M ,  j=1 ∞  that M 3λs0 3s0 X kδbk1 ≤ = (Rj − 1)φ(fj) , 2c0M c0M j=1 ∞ and

0 T 0 sup | log Sb(f) − log S(f)| ≤ sup |(βb − β ) φ(f)| ≤ kβb − β k1kφ(f)k∞ 1 1 1 1 f∈[− 2 , 2 ] f∈[− 2 , 2 ] M 3Bs0 X ≤ (Rj − 1) φ(fj) . c0M j=1 ∞ This completes the proof.

Some remarks are in order. First, unlike existing theoretical analysis for L1 penalized

n (mt) methods, we do not impose any independence assumptions on the “sample” Sb (fj): j = o 1,...,M . In this sense, the above result is a deterministic result that holds under the event

(3.12). To verify that the event (3.12) happens with probability tending to 1, we conduct a simulation study in Section 3.5.1 and show for orthogonal bases that

M 1 X √ (Rj − 1) φ(fj) ≤ Op(log(M)). (3.18) M j=1 ∞ 48 In view of this, condition (3.12) holds asymptotically for any γ > 0. This will lead to the

following rate of convergence for βb and the log SDF: 0 log(M) βb(λ) − β = Op s0 √ , 1 M log(M) sup | log Sb(f) − log S(f)| = Op s0 √ . 1 1 M f∈[− 2 , 2 ]

Note that if we further assume that s0 is quite small; that is, the true log SDF has a sparse basis representation, a parametric rate M −1/2 can be achieved (up to a log factor). This is in contrast to the slower nonparametric rate for typical one-dimensional nonparametric regression or density estimation problems (see, e.g., Tsybakov[2009]). In summary, our theory suggests that by exploring sparsity, if it is indeed present in the signal, a signiﬁcant improvement in estimation eﬃciency can be achieved using the proposed method.

3.5 Simulations

We use simulations to evaluate our L1 penalized MT-Whittle method as compared to the commonly used L1 penalized LS method. We also investigate the eﬀect of selecting the tuning parameter and assess our theoretical rate presented after the statement of Theorem

3. We use the following processes: √ 2 1. AR(2) process: Xt = ϕ1,1Xt−1 + ϕ1,2Xt−2 + εt with ϕ1,1 = 0.97 2, ϕ1,2 = −0.97 ;

2. AR(4) process: Xt = ϕ2,1Xt−1 + ϕ2,2Xt−2 + ϕ2,3Xt−3 + ϕ2,4Xt−4 + εt with ϕ2,1 =

2.7607, ϕ2,2 = −3.8106, ϕ2,3 = 2.6535, ϕ2,4 = −0.9238;

P15000 3. High-order MA process: Xt = l=0 θlεt−l with θ0 = 1, θ1 = π/4, and θl = sin(π(l − 1)/2)/(l − 1) for l = 2, 3,..., 15000.

Plots of the true SDFs are shown in Figure 3.1 for each process. These processes have

SDFs that exhibit a range of local structures that can be hard to estimate using simple

49 estimators of the SDF. We demonstrate how our estimation method performs when the

innovations {εt} are N(0, 1) RVs, but also present the same AR(2) process case 1 with

innovations generated by a shifted exponential distribution with mean 0 and variance 1 – we

want to investigate how robust our method is to departures from Gaussianity.

AR(2) AR(4) MA 40 10 20 30 5 20 10 0 10 SDF (dB) SDF (dB) SDF (dB) 0 0 −5 −10 −20 −10 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Frequency Frequency Figure 3.1: Plots of the SDF for three processes on the decibel scale (dB).

We use the decibel-scale integrated root mean squared error (IRMSE) to measure how well we estimate the true SDF. For M = dN/2e−1 non-zero, non-Nyquist frequencies, letting

Se(fj) denote any estimate of the SDF at Fourier frequency fj, the IRMSE is

( M )1/2 1 X h i2 10 log Se(fj) − 10 log S(fj) . (3.19) M 10 10 j=1

We present the IRMSE averaged over 1000 realizations of each process.

In preliminary simulations we used a range of diﬀerent basis function representations to model each SDF: orthogonal polynomial bases, Fourier bases, B-spline bases, wavelet bases, as well as some mixed dictionary bases. Following Walden et al.[1998] we ﬁt our models to mirrored data of frequencies on [0, 1) to capture the evenness and periodic nature of the spectrum, but only summarize on the M frequencies between zero and the Nyquist frequency.

50 Table 3.1: Comparison of IRMSEs (with bootstrap standard errors in parenthesis) when using diﬀerent bases for estimating the SDF of the AR(2) process of length N = 2048 by unpenalized least square (LS) method and unpenalized MT-Whittle method with number of tapers K = 10.

Number of LA(8) wavelet basis Fourier basis B-spline basis bases p LS MT-Whittle LS MT-Whittle LS MT-Whittle 8 2.592 (0.0014) 2.977 (0.0102) 2.037 (0.0021) 2.268 (0.0092) 3.241(0.0010) 4.152 (0.0170) 16 1.449 (0.0030) 1.548 (0.0062) 1.078 (0.0060) 1.105 (0.0071) 2.266 (0.0018) 2.588 (0.0103) 32 0.892 (0.0068) 0.903 (0.0073) 0.846 (0.0096) 0.842 (0.0097) 1.249 (0.0052) 1.306 (0.0071) 64 0.841 (0.0090) 0.837 (0.0090) 1.068 (0.0089) 1.067 (0.0093) 0.886 (0.0087) 0.884 (0.0089) 128 1.067 (0.0085) 1.066 (0.0089) 1.306 (0.0078) 1.308 (0.0084) 1.066 (0.0089) 1.065 (0.0092) 256 1.287 (0.0078) 1.288 (0.0085) 1.353 (0.0075) 1.360 (0.0082) 1.299 (0.0078) 1.301 (0.0084)

Based on the MT estimates, the IRMSEs of using unpenalized models with the LA(8) wavelet basis, Fourier basis, and B-spline basis are presented in Table 3.1 as the number of bases increases in powers of 2, where LA(8) represents the Daubechies least asymmetric wavelet of width 8 (see Daubechies[1992] for details). Results of polynomial basis are not as competitive and thus are not included here. For the simulated AR(2) process of length

N = 2048, the smallest IRSMEs in the preliminary experiment are bolded in each column of

Table 3.1. For the LA(8) wavelet basis, an IRMSE of 0.837 (with a bootstrap standard error

(SE) of 0.009) is achieved by the MT-Whittle approach with p = 64; for the Fourier basis, an

IRMSE of 0.842 (bootstrap SE is 0.010) is given by the MT-Whittle method with p = 32; for the B-spline basis, an IRMSE of 0.884 (bootstrap SE of 0.009) is attained by the MT-Whittle with p = 64. Thus, we saw that the LA(8) wavelet basis and the Fourier basis perform better than the B-spline basis in terms of the IRMSE for the spatially inhomogeneous SDF of the

AR(2) process. With an optimal choice of p, the IRMSE achieved by Fourier basis is slightly greater than the one of LA(8) wavelet basis, however, it is quite competitive for this process.

51 When the number of basis p diﬀers from an optimal choice, the unpenalized MT-Whittle

does not always results in a smaller IRMSE than using unpenalized LS.

Besides the results in Table 3.1, a dictionary method mixing diﬀerent types of bases

was also examined. According to Wasserman[2006], a dictionary combines several bases of

diﬀerent types for a basis expansion, where the bases may not be orthogonal to each other,

and the total number of bases is allowed to be greater than the number of observations

(called overcomplete). However, no reduction in the IRMSE was made by the experimented

combinations of basis functions. Additionally, results obtained for the AR(4) process and the

high-order MA process were similar to the result of AR(2) in general, but the AR(4) process

required more basis functions for achieving an optimal IRMSE, which is not surprising since

the true SDF of the AR(4) process (see Figure 3.1) has more sharp features to be captured.

For the MA process, the diﬀerence in IRMSE between the LS and the MT-Whittle approaches

were not as apparent as the case of the AR(2) process. The sharp peak before a drastic drop

in the MA spectrum is hard to be captured. Plots of ﬁtted SDF based on one simulation

replicate and four types of bases are given in Figure 3.2: none of the bases captures the very

sharp peak well when p = 32, as the initial number of basis functions is not large enough to capture the log spectrum.

We found that the wavelet basis based on the discrete wavelet transform with the LA(8) wavelet ﬁlter had good performance across all the experimental conditions. We employ

LA(8) wavelet basis from now on as a representative for the presentation of the L1 penalized

MT-Whittle approach.

Figure 3.3 shows simulation results for the three processes (one for each panel) comparing our L1 penalized MT-Whittle method (blue circles) to an L1 penalized LS approach (orange

52 (a) LA(8) wavelet basis (b) Fourier basis 10 10 5 5 0 0 SDF (dB) SDF (dB) −5 −5 −10 −10

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Frequency (c) B−spline basis (d) Polynomial basis 10 10 True SDF LS estimates 5 5 0 0 SDF (dB) SDF (dB) −5 −5 −10 −10

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Frequency Figure 3.2: One replicate of spectrum estimation for the high-order MA process based on unpenalized LS method with diﬀerent basis types. For each basis type, the orange line is the LS estimates with p = 32 bases including the intercept. The black line shows the true SDF on the decibel scale (dB).

triangles). The ﬁrst three panels are for Gaussian process, the last panel is for the non-

Gaussian process. In each case, N = 2048, we use K = 10 sine tapers for the MT spectral estimate, and we start with the maximum number of basis functions, p = M + 1 = 1024, including the intercept. We also show the eﬀect of changing the method to select the tuning parameter. While we advocate choosing between the universal threshold (Universal) and the GIC methods, we also compared to cases in which we used cross-validation (CV) to select the tuning parameter, or we did not penalize with a tuning parameter (None). We

53 AR(2) AR(4)

● ● 1.4

● 1.4 ● 1.3 1.2 1.2 1.0 IRMSE IRMSE 1.1 1.0 0.8

● ● ●

0.9 ● 0.6 Universal GIC CV None Universal GIC CV None

MA Non−Gaussian AR(2)

● ●

1.4 ● 1.4 ● 1.2 Penalized least square 1.2 ● Penalized MT−Whittle 1.0 IRMSE IRMSE 1.0 0.8 ● ● ● ● 0.8 0.6 Universal GIC CV None Universal GIC CV None

Figure 3.3: A comparison of the IRMSEs for diﬀerent L1 penalizations methods for estimating the SDF. Here, the height of the symbols are larger than the widths 95% bootstrap conﬁdence intervals for each IRMSE; the horizontal dashed lines indicate the best possible IRMSEs for each method.

also calculated the tuning parameter that corresponds to the smallest possible IRMSE – in practice this value of the IRMSE is not known, but gives us a way to see how close our method is to the optimal value. In each ﬁgure the best IRMSEs for each method (least squares, orange; MT-Whittle, blue) are denoted by the horizontal dashed lines. In terms of the uncertainty, the height of the symbols are larger than 95% bootstrap conﬁdence intervals for each IRMSE. The exact numbers in Figure 3.3 are shown in Table 3.2.

54 Table 3.2: Simulation results with p = 1024 for the four time series of length N = 2048: comparing decibel-scaled bias, standard deviation (SD) and IRMSE of L1 penalized LS approach and L1 penalized MT-Whittle approach with LA(8) wavelet bases by diﬀerent tuning criteria. Models are ﬁtted to raw MT estimates with K = 10 sine tapers. Values in parenthesis denoted by ‘(SE)’ are corresponding standard errors based on 1000 Bootstrap resamples. Minimum quantities are bolded in each section excluding the ones from ‘Best’ (best possible IRMSE).

AR(2) process:

Tuning L1 penalized least square L1 penalized MT-Whittle parameter Bias (SE) SD (SE) IRMSE (SE) Bias (SE) SD (SE) IRMSE (SE) GIC 0.109 (0.0043) 0.720 (0.0032) 0.741 (0.0031) 0.051 (0.0047) 0.692 (0.0033) 0.710 (0.0032) Universal p 0.118 (0.0043) 0.710 (0.0032) 0.733 (0.0031) 0.082 (0.0043) 0.673 (0.0031) 0.692 (0.0031) Best 0.107 (0.0043) 0.702 (0.0031) 0.723 (0.0030) 0.073 (0.0041) 0.667 (0.0031) 0.683 (0.003) CV 0.008 (0.0043) 1.331 (0.0029) 1.338 (0.0029) -0.204 (0.0043) 1.327 (0.0029) 1.349 (0.003) No penalty 0.005 (0.0043) 1.385 (0.0028) 1.391 (0.0028) -0.209 (0.0043) 1.383 (0.0028) 1.404 (0.003)

AR(4) process:

Tuning L1 penalized least square L1 penalized MT-Whittle parameter Bias (SE) SD (SE) IRMSE (SE) Bias (SE) SD (SE) IRMSE (SE) GIC 0.114 (0.0044) 0.893 (0.0033) 0.910 (0.0033) 0.054 (0.0045) 0.858 (0.0033) 0.871 (0.0033) Universal p 0.136 (0.0043) 0.926 (0.0035) 0.946 (0.0034) 0.113 (0.0043) 0.876 (0.0034) 0.894 (0.0034) Best 0.104 (0.0043) 0.875 (0.0032) 0.892 (0.0032) 0.045 (0.0041) 0.842 (0.0032) 0.853 (0.0032) CV 0.033 (0.0043) 1.343 (0.0028) 1.350 (0.0028) -0.179 (0.0043) 1.338 (0.0028) 1.356 (0.0029) No penalty 0.030 (0.0043) 1.393 (0.0027) 1.400 (0.0027) -0.184 (0.0043) 1.392 (0.0027) 1.409 (0.0029)

MA process:

Tuning L1 penalized least square L1 penalized MT-Whittle parameter Bias (SE) SD (SE) IRMSE (SE) Bias (SE) SD (SE) IRMSE (SE) GIC 0.047 (0.0046) 0.839 (0.0031) 0.853 (0.0031) -0.022 (0.0046) 0.838 (0.0029) 0.850 (0.0030) Universal p 0.053 (0.0046) 0.843 (0.0032) 0.856 (0.0031) 0.032 (0.0045) 0.853 (0.0032) 0.865 (0.0032) Best 0.044 (0.0045) 0.822 (0.0030) 0.834 (0.0030) -0.016 (0.0043) 0.821 (0.0030) 0.832 (0.0030) CV 0.002 (0.0044) 1.348 (0.0027) 1.354 (0.0027) -0.210 (0.0043) 1.343 (0.0027) 1.365 (0.0029) No penalty 0.001 (0.0043) 1.398 (0.0027) 1.404 (0.0027) -0.213 (0.0043) 1.396 (0.0027) 1.418 (0.0028)

Non-Gaussian AR(2) process:

Tuning L1 penalized least square L1 penalized MT-Whittle parameter Bias (SE) SD (SE) IRMSE (SE) Bias (SE) SD (SE) IRMSE (SE) GIC 0.106 (0.0092) 0.715 (0.003) 0.776 (0.0036) 0.047 (0.0092) 0.689 (0.0030) 0.747 (0.0036) Universal p 0.115 (0.0091) 0.705 (0.003) 0.768 (0.0037) 0.078 (0.0091) 0.669 (0.0029) 0.730 (0.0036) Best 0.103 (0.0091) 0.697 (0.0030) 0.758 (0.0036) 0.069 (0.0086) 0.664 (0.0029) 0.718 (0.0035) CV 0.005 (0.0091) 1.325 (0.0028) 1.355 (0.0030) -0.207 (0.0091) 1.321 (0.0028) 1.366 (0.0032) No penalty 0.002 (0.0091) 1.379 (0.0027) 1.408 (0.0029) -0.212 (0.0091) 1.377 (0.0027) 1.421 (0.0032)

55 Some results are coherent across the three processes. Using cross-validation to select the tuning parameters does not perform well compared to the universal and GIC methods, although it performs significantly better than using no penalization. For the AR processes, the L1 penalized MT-Whittle outperforms L1 penalized LS by between 4.4–6.0% in terms of the IRMSE. There was little difference between these two estimators for the MA process. By process, there were slight differences between the IRMSE values using the universal threshold and GIC to select the tuning parameter, however both methods yielded IRMSEs that were competitive with the best value possible. Also, for the non-Gaussian AR(2) example, all methods performed similarly with respect to one another. The bottom-right panel of Figure

3.3 shows that all penalized approaches have slightly increased IRMSE values when non-

Gaussian noise is used, but that our method still outperforms the other approaches. This suggests that our L1 MT-Whittle method is robust to non-Gaussianity.

To further summarize how well our method performs, Figure 3.4 compares for each time series process, our L1 penalized MT-Whittle to a raw MT spectral estimate. In each panel, we display the simulation of length N = 2048 that yields the median IRMSE. The blue line denotes the L1 penalized MT-Whittle estimate we obtain using the universal threshold with

K = 10 data tapers. The gray line denotes our raw MT spectral estimate (again with K =

10) and the red line denotes the true SDF in each case. Although we slightly underestimate the spectral peaks using the L1 method, our method better captures the general spectral features of each process relative to the raw MT estimate. The underestimation of the peaks reduces as N increases – since our estimation method is semiparametric, as N increases, p increases and we are better able to capture ﬁner spectral features with more basis functions

(see also Walden et al.[1998] for an illustration of this fact).

56 AR(2) AR(4) 40 20 30 20 10 10 SDF (dB) SDF (dB) 0 0 −10 −20

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Frequency MA(15000) Non−Gaussian AR(2)

True SDF 10 Penalized MT−Whittle 20 MT estimates 5 10 0 SDF (dB) SDF (dB) 0 −5 −10 −10

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Frequency Frequency Figure 3.4: A comparison of spectral estimates for the diﬀerent four processes. For each process, the blue line is the L1 penalized MT-Whittle estimates, the gray line is the corresponding raw MT estimates with K = 10 tapers, and the red line is the true SDF.

We carried out an additional set of simulations that varied the number of sine tapers

K for the MT-Whittle calculation – Figures 3.5 and 3.6 show for the four processes that there is a quadratic relationship between the bias of our spectral estimator and K, and the

IRMSE and K. Note that the minimal IRMSE is always attained at greater value of K than for the bias. We also learned that for very large values of K that the IRMSE for the L1 penalized least squares method (orange curve) approached that of L1 penalized MT-Whittle

(blue curve); this is not surprising since by Proposition3, as K → ∞, the distribution of log

57 AR(2) AR(4) 0.5 0.5 0.4 0.4 0.3 0.3 bias (dB) bias (dB) 0.2 0.2 0.1 0.1 0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 K K MA Non−Gaussian AR(2)

0.5 0.5 Penalized least square Penalized MT−Whittle 0.4 0.4 0.3 0.3 bias (dB) bias (dB) 0.2 0.2 0.1 0.1 0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 50 K K

Figure 3.5: Plots of bias vs K on the decibel (dB) scale using L1 penalized LS method (orange) and L1 penalized MT-Whittle method (blue) with the universal threshold for the four processes. The vertical bars show 95% bootstrap conﬁdence intervals for each K, and the dotted lines show the location of the minimum.

MT spectral estimate is better approximated by a normal distribution [Bartlett and Kendall,

1946]. Our selected value of K = 10 tapers provided a good compromise between balancing the bias and variance of the estimated SDF for these simulated processes. Also, using the

L1 penalized MT-Whittle outperformed the L1 penalized Whittle based on the untapered periodogram. See Section 4.4 for further discussion on the optimal number of tapers for inference of the SDF.

58 AR(2) AR(4) 2.0 2.0 1.5 1.5 RMSE (dB) RMSE (dB) 1.0 1.0 0.5 0.5 0 10 20 30 40 50 0 10 20 30 40 50 K K MA Non−Gaussian AR(2)

2.0 2.0 Penalized least square Penalized MT−Whittle 1.5 1.5 RMSE (dB) RMSE (dB) 1.0 1.0 0.5 0.5 0 10 20 30 40 50 0 10 20 30 40 50 K K

Figure 3.6: Plots of IRMSE vs K on the decibel (dB) scale using L1 penalized LS method (orange) and L1 penalized MT-Whittle method (blue) with the universal threshold for the four processes. The vertical bars show 95% bootstrap conﬁdence intervals for each K, and the dotted lines show the location of the minimum.

3.5.1 Validating the empirical theoretical rate

In this section, we empirically verify the conjectured rate in (3.18) by simulation using the four diﬀerent time series processes deﬁned above. Based on 1000 simulations we calcu- √ PM lated the ratio comparing j=1(Rj − 1)φ(fj) M with log(M) for a wide range of M ∞ values. This ratio should be bounded as we increase M. Figure 3.7 shows the median (solid line), 2.5th percentile (lower dashed line), and 97.5th percentile (upper dashed line) for this

59 ratio, and demonstrates that our empirical rate indeed matches closely with the conjectured

theoretical rate.

AR(2) AR(4) 1.5 1.5 1.0 1.0 ratio ratio 0.5 0.5 0.0 0.0 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 log2(M) log2(M) MA Non−Gaussian AR(2) 1.5 1.5 1.0 1.0 ratio ratio 0.5 0.5 0.0 0.0 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 log2(M) log2(M) √ PM Figure 3.7: Plots of the ratio j=1(Rj − 1)φ(fj) ( M log(M)) against ∞ 8 15 log2(M), for M varying from 2 to 2 . For each process, the solid line is the median of the ratio, and the dashed lines are the 2.5th and 97.5th percentiles of the ratio, based on 1000 simulations.

3.6 Spectral estimation for EEG signals

Electroencephalogram (EEG) signals are often used to monitor brain activity and di- agnose disease such as epileptic seizures. We analyze two channels of EEG data collected

60 from the left and right front cortex of one male rat. (The data is presented in van Lui-

jtelaar[1997], and was downloaded from http://www.vis.caltech.edu/~rodri.) Quiroga et al.[2002] argue that, genetically, analyzing these series is relevant to the study of human epilepsy. Each channel contains 1000 voltages recorded in units of microvolts (mV) collected at a sampling rate of 200 Hz. Time series plots of the left and right channels are shown in the panels (a) and (b) of Figure 3.8, and hint at strong spectral features in the two series.

(a) (b) 3 3 2 2 1 1 Voltage (mV) Voltage (mV) Voltage 0 0 −1 −1

0 1 2 3 4 5 0 1 2 3 4 5 Time (s.) Time (s.)

By GIC By Universal 0.008 0.008 SDF SDF 0.004 0.004 0.000 0.000 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Frequency (Hz) Frequency (Hz)

Figure 3.8: Time series plots of the (a) left and (b) right EEG channels, as well as estimated SDFs for the (c) left and (d) right channels. In (c) and (d) the blue line is the estimate with the universal threshold, and the green line is the estimate with GIC-based threshold.

61 In the supplement of Krafty and Collinge[2013], an L2 penalized multivariate Whittle

likelihood based on the periodogram is used to estimate the SDF. Using spline bases, and

focusing on estimates of the SDF between 0 and 30 Hz, they discover spectral peaks at

around 9 and 18 Hz which indicate a “local synchronization of neurons in both hemispheres

of the frontal cortex”.

Using our L1 penalized MT-Whittle approach, we estimate the SDFs for the left and right channels separately. The MT-Whittle approach will counteract the bias due to leakage and mitigate the inconsistency of a periodogram-based approach. Compared with spline bases, we model the log SDF using a wavelet basis to better capture sharp peaks and other possible local features in the SDF. Additionally, the L1 penalty enforces sparsity, keeping

only relevant features in the estimated SDF.

Panels (c) and (d) of Figure 3.8 display, respectively, the SDF estimates for the left and

right channels using our approach. (We mean padded the series to a length of N = 1024, prior

to spectral estimation.) In each panel the blue line denotes the estimated SDF when we use

the universal threshold method and the green line shows the estimate with the GIC method.

In our estimates, we use MT-spectral estimates with K = 5 sine tapers, and we construct our

wavelet basis using the LA(8) wavelet ﬁlter. Our SDF estimates for both channels contain

several sharp turning points, and include features not depicted by the spline models of Krafty

and Collinge[2013]. In terms of the tuning parameter selection, the universal-threshold- and

GIC-based estimates are fairly similar to one another (the estimated SDFs are more similar

for the left channel). This indicates that both methods are able to capture the interesting

local features in the SDFs.

For the left channel, we estimate prominent spikes at 9 Hz and 16 Hz as well as a

noticeable elbow around 5 Hz. There is more energy in the SDF estimates for the right

62 channel, compared to the left. In the right channel, we pick up a strong broadband structure:

the right channel presents clustered power between 5 Hz and 10 Hz, as well as a summit at

9 Hz, and a crest between 13 to 14 Hz.

In conclusion our L1 penalized MT-Whittle method is useful in revealing spike-wave

discharges for EEG signals observed in the frontal cortex for the male rat under study. It

would be interesting in the future to see how these methods apply to the spectral analysis

of multiple EEG channels in diﬀerent study populations.

3.7 Non-orthogonal basis functions

The proposed L1 penalized MT-Whittle framework and the corresponding ADMM algo-

rithm can also be used for general non-orthogonal basis functions. When using an orthogonal

basis, the second equality constraint β = η in (3.5) is not necessary. However, when using

non-orthogonal bases, the second equality constraint is introduced to avoid the need to solve

the lasso problem repeatedly at each iteration, as we now demonstrate.

Suppose we include only the ﬁrst equality constraint Φβ = ζ. Then the augmented

Lagrangian for the L1 penalized MT-Whittle likelihood (3.4) becomes

p X ρ 2 lA(β, ζ, η, u1, u2) = lW (ζ) + ωl|βl| + Φβ − ζ + u1 , ρ > 0. 2 2 l=1 The ADMM updates in this case would be

( p ) (n+1) ρ 2 X β = arg min Φβ − ζ + u1 2 + ωl|βl| ; β 2 l=1 2 (n+1) (mt) ρ n T (n+1) (n)o ζj = arg min ζj + Sb (fj) exp(−ζj) + φ (fj)β − ζj + u1j , j = 1,...,M; ζj 2 (n+1) (n) (n+1) (n+1) u1 = u1 + (Φβ − ζ ),

where, within each iteration, the update of of β(n+1) needs to solve a lasso problem. Solutions

for orthogonal bases can be simply obtained by soft-thresholding LS estimates, as discussed

63 in Section 2.3.3. However, when the basis functions are not orthogonal, no closed form for

the update of β(n+1) is available and the nested lasso problem will require nested iterations of

ADMM or other algorithms (e.g., LARS [Efron et al., 2004] or glmnet [Friedman et al., 2010])

to converge. Including these nested iterations would degrade the computational eﬃciency of

the method.

3.8 Conclusion

In this chapter, we presented an L1 penalized quasi-likelihood framework based on MT spectral estimates using a basis representation to model the SDF of a stationary time series.

Our methodology allows the number of basis functions to increase with sample size, and through enforcing sparsity, our L1 penalized MT-Whittle estimator performs better or as good as previous methods for estimating the SDF. Our method extends to the application of broader classes of basis functions and their mixtures, beyond those traditionally used with wavelet thresholding. Simulations demonstrate a clear advantage of using the GIC and calibrated universal threshold over cross-validation for tuning parameter selection, with a signiﬁcant reduction in IRMSE. Computationally, the calibrated universal threshold is data-invariant (it only relies on the number of tapers K) whereas the calculation of GIC is data-dependent. However both methods are more eﬃcient than using cross-validation. The proposed ADMM algorithm accelerates when parallel computing and orthogonal bases, such as the Daubechies class of wavelets, are employed.

64 Chapter 4: Spectral inference with sample splitting

In this chapter, we introduce approaches to construct conﬁdence intervals compatible

with the L1 penalized MT-Whittle estimation for the spectral analysis of regularly sampled

univariate stationary time series. Our method combines the idea of sample splitting and

the Huber sandwich estimator to quantify the uncertainty in the MT-Whittle estimator and its correlation between frequencies. The proposed approaches are shown to have much closer-to-nominal coverage rates and better interval scores than a naive method for the interval estimation of diﬀerent SDFs via simulation. With a suggested number of tapers, the performance of the interval estimation improves as the sample size increases. Once again, an application is demonstrated with electroencephalogram (EEG) examples.

4.1 Sample splitting for SDFs

In literature, the study of basis models for SDFs mainly focuses on the point estimation.

Not much work has investigated the corresponding interval estimation and model uncertainty evaluation problem comprehensively. For spline-based models, a Bayesian framework (see, e.g., Rosen and Stoﬀer[2007]) and resampling methods (see, e.g., Dai and Guo[2004] and

Krafty and Collinge[2013]) can be applied to interval estimation of the SDF. However, for

L1 penalized methods such as wavelet soft-thresholding and the proposed L1 penalized MT-

Whittle approach, characterizing the distribution of the estimators is not as easy. According

65 to Dezeure et al.[2015], the lasso penalty induces a point mass at zero, and thus resampling methods such as the standard bootstrap cannot retrieve the distribution of the penalized estimator due to this discontinuity at zero. The inference approach taken by Zhang[2017] does not apply to our situation since their L1 penalty term is based on the total variation distance as compared to using a lasso penalty. Some recent studies on the inference of lasso penalized estimators are summarized by Dezeure et al.[2015] in a more general setting than just SDF estimation. One important approach is based on sample splitting [see Dezeure et al., 2015, Section 2.1.1], which can be adapted to the generalized linear models used in the MT-Whittle framework.

4.1.1 Overview of sample splitting

Sample splitting, which performs model selection and inference on separate pieces of a dataset, is not a recent concept. Early ideas related to sample splitting and some historical work on cross-validation were reviewed by Stone[1974]. One year after, Cox[1975] published a study on the eﬀect of the splitting proportion on the power of signiﬁcance tests. In the

1990s, the problem of conducting inference on the same dataset that is also used for model selection was further investigated in literature. Hurvich and Tsai[1990] showed via Monte

Carlo simulations that model selection and inference on the same data results in coverage rates smaller than the nominal level in the linear regression setting, and the sample splitting approach was recommended as a method of remedy. A summary of using sample splitting for regression models can be found in Picard and Berk[1990]. Theoretically, P¨otscher[1991] studied the asymptotic properties of M-estimators [Huber, 1992] after model selection is carried out on the same dataset. As pointed out by Faraway[1992], because the uncertainty in model selection is not taken into account, such an inference after model selection can

66 lead to biased and overoptimistic results. These existing issues were also discussed within

a general notion of model uncertainty (see, e.g., Draper[1994] and Chatﬁeld[1995]). In the regression context, Draper[1994] assessed model uncertainty for a Bayesian framework.

With sources of model uncertainty explored for more general statistical methods, Chatﬁeld

[1995] discussed the use of Bayesian model averaging and the diﬃculty in quantifying model uncertainty. Chatﬁeld[1995] suggested making inference on new data as “a safer way”. See

Clyde and George[2004] for a more comprehensive review on the developments of Bayesian methods for model uncertainty.

More recently, Wasserman and Roeder[2009] proposed using sample splitting methods for handling post-model selection bias in high dimensional models. With a repeated sample splitting approach, Meinshausen et al.[2009] constructed an aggregated p-value for the inference of high dimensional regression models selected by approaches such as the lasso.

Under more relaxed assumptions, nonparametric tools for measuring variable importance in high dimensional models using sample splitting were developed by Rinaldo et al.[2019].

According to Faraway[2016], there is a tradeoﬀ in statistical inference between full data analysis and sample splitting: though sample splitting resolves the overoptimistic inference problem of selection and estimation using the same data, it also reduces the probability of selecting the accurate model and induces more variability in parameter estimation because of a smaller sample size in the split subsets. Similar arguments were made by Rinaldo et al.

[2019], where the gain in inference robustness with a loss of prediction accuracy by sample splitting is called the inference-prediction tradeoﬀ. Faraway[2016] proposed a heuristic framework to measure the predictive performance of a full data analysis with components representing the best possible performance, model selection cost, parameter estimation cost, and data reuse cost. Though the explicit expressions for each component were diﬃcult to

67 derive, Faraway[2016] stated that sample splitting has higher model selection and parameter estimation cost and lower data reuse cost than full data analysis. Using the logarithmic scoring rule (see Section 4.3 for more details), Faraway[2016] compared the predictive performance of diﬀerent strategies for conducting model selection and inference in separate steps via a simulation study, and showed that a split data analysis is preferable to a full data analysis in general, especially when the model building strategy is complex and cannot be pre-speciﬁed.

4.1.2 Sample splitting for L1 penalized SDF models

To implement the sample-splitting based inference proposed by Dezeure et al.[2015] and

Rinaldo et al.[2019] for L1 penalized SDF models, we need to ﬁrst carefully choose a splitting strategy for data in the frequency domain. In terms of the splitting proportion, Faraway

[2016] and Rinaldo et al.[2019] implemented a random 50:50 split under the assumption of independence. However, for time series or regression data with correlated errors, Picard and

Berk[1990] and Faraway[1995] argued that a good splitting scheme can diﬀer signiﬁcantly from the typical random splitting approach. According to Faraway[1995], a random sample splitting needs to be based on the independence and exchangeability of samples. Notic- ing that the SDF of interest is always spatially inhomogeneous and the MT estimates are correlated across the frequencies, a random splitting may not be appropriate for the SDF estimation in our case. With the same notations as the previous chapter, we present below a reasonable splitting strategy that avoids severe unevenness between the two pieces of data and strives to preserve the overall pattern of the SDF.

T Let x = (x1, . . . , xN ) be the vector of N consecutive observations of a regularly sampled, real-valued, stationary univariate time series {Xt : t ∈ Z}. In the frequency domain, denote

68 (mt) (mt) T (mt) Z = (Sb (f1),..., Sb (fM )) as the vector of MT spectral estimates Sb (fj) evaluated on the set of M = dN/2e−1 non-zero, non-Nyquist (i.e., not equal to 1/2) Fourier frequencies

T as deﬁned by (2.18). Recall that Φ is a M × p basis matrix with jth row equal to φ (fj), j = 1, 2,...,M. The vector (3.1) of log SDFs evaluated at frequencies (2.18) can be expanded as ζ = Φβ by (2.16). We use DM = {Z, Φ} to denote the sample of size M on the frequency domain that is used for analyzing the SDF.

Due to the correlation of MT estimates and lack of exchangeability between frequencies

(2.18), instead of splitting randomly, we separate the frequencies by odd and even numbers of the index j = 1, 2,...,M to formulate two subsets that carry closely comparable information of the SDF:

F1 = f1, f3, . . . , f2dM/2e−1 , and (4.1) F2 = f2, f4, . . . , f2bM/2c . (4.2)

Then, the above sample splitting corresponds to two sample vectors of MT estimates:

T (mt) (mt) (mt) Z1 = Sb (f1), Sb (f3),..., Sb (f2dM/2e−1) , and (4.3) T (mt) (mt) (mt) Z2 = Sb (f2), Sb (f4),..., Sb (f2bM/2c) . (4.4)

Likewise, the basis matrix Φ can also be split into two sub-matrices: matrix Φ1 consists of

the odd rows of Φ, and matrix Φ2 is generated by the even rows of Φ. Thus, we have split

the data DM into two halves. Denote the ﬁrst half of the sample by D1,M1 = {Φ1, Z1} with

half sample size M1 = dM/2e, and denote the second half of the sample by D2,M2 = {Φ2, Z2}

with half sample size M2 = bM/2c.

Adapting the idea summarized in Dezeure et al.[2015, Section 3.1], on the ﬁrst half of

the data, D1,M1 , we apply the L1 penalized MT-Whittle approach introduced in Chapter3 to

identify prominent bases whose coeﬃcient estimates are non-zero. Denote the L1 penalized

69 I I I I T estimate of β as βb = {βb1 ,..., βbp } . Let I be the index set of non-zero elements of βb ; that

n I o is, I = l : βbl 6= 0; l = 1, . . . , p . With the features selected by the ﬁrst half, we consider

the following model on the second half D2,M2 of the data: for each frequency f,

X T log S(f) = φl(f)βl = φI (f)βI , (4.5) l∈I where φI (f) is the vector of the selected basis functions and βI is corresponding regression coeﬃcients. We preserve only the columns with index l ∈ I for Φ2 and Φ. Denote the remaining M2 × |I| design matrix for the second half of frequencies F2 as Φ2, and the

remaining design matrix of the full data at frequencies (2.18) as Φ, such that Φ is the

T M ×|I| matrix with jth row equal to φ (fj) for j = 1,...,M, where |I| is the cardinality of

I, i.e., the number of elements in I. Next, we use unpenalized MT-Whittle likelihood to make inference on the second half of the data with the selected bases, denoted as D2,M2 = Φ2, Z2 .

The MT-Whittle likelihood based on D2,M2 is

(mt) X Sb (fj) lW (Φ2βI ) = log S(fj) + . (4.6) S(fj) j:fj ∈F2

W Let βbI denote the estimates of βI obtained by minimizing (4.6). Then the ﬁtted SDF is

W n T W o Sb (f) = exp φI (f)βbI , for f ∈ (0, 1/2). (4.7)

The corresponding standard errors and conﬁdence intervals can be constructed based on the

asymptotic distribution of the estimator following the standard procedures for generalized

linear models (see, e.g., McCullagh and Nelder[1999]). By Proposition4, the MT-Whittle

likelihood (4.6) is a quasi-likelihood function. A valid inference of the resulting MT-Whittle

W W estimators, βbI and Sb (f), still relies on a proper incorporation of the correlations of MT estimates between frequencies, as discussed in the following section.

70 4.2 Quantifying the variability of MT-Whittle estimators

In this section, we present three methods for estimating the variance of the MT-Whittle-

based estimator of βI . Recall from Section 3.3.1 that the asymptotic distribution of the MT

estimator Sb(mt)(f) in Proposition3 can be written as Gamma( ν, µ/ν), with mean µ = S(f),

shape parameter ν = K, and variance µ2/ν = {S(f)}2/K. According to McCullagh and

(mt) Nelder[1999], the asymptotic pdf of Zj = Sb (fj) evaluated at zj for frequencies (2.18) can

be expressed in the exponential family form (3.7):

z ϑ − b(ϑ) p(z ) = exp j + c(z , ς) , j a(ς) j

where ϑ = −1/µj = −1/S(fj), a(ς) = 1/ν = 1/K, b(ϑ) = − log(ϑ) = log S(fj), and

00 2 c(zj, ϑ) = K log(Kzj) − log zj − log Γ(K). Thus, Var(Zj) = b (ϑ)a(ς) = V (µj)a(ς) = µj /ν,

2 and the variance function is V (µj) = µj .

4.2.1 Naive variance estimation

When assuming asymptotic independence of MT estimates between frequencies (2.18),

W one can adapt the similar results as presented in Section 3.3.1 to the inference of βbI . Based on the results in McCullagh and Nelder[1999] and assuming independence between

the Fourier frequencies, the Fisher information for the GLM (3.7) based on D2,M2 can be

computed as

(naive) T I (βI ) = Φ2 W 2Φ2,

where W 2 = diag{wj}j:fj ∈F2 is the diagonal matrix generated with the sequence of {wj} for

{j : fj ∈ F2}, where

1 1 wj = = = K, 0 2 2 2 V (µj)a(ς) g (µj) µj (1/K) 1/µj

71 since our model assumes the link function g(µ) = log µ, so that g0(µ) = 1/µ. Thus, a naive

estimate of the Fisher information is

(naive) T T I (βI ) = KΦ2 IM2 Φ2 = KΦ2 Φ2.

Further, assuming independence over the Fourier frequencies and under suitable regularity

conditions [see, e.g., McCullagh and Nelder, 1999, Appendix A], as M → ∞, the maximum

W quasi-likelihood estimator βbI has asymptotic distribution represented by

(naive) n W o N |I| βI , Cov βbI ,

with −1 (naive) n W o (naive) −1 T Cov βbI = I (βI ) = Φ2 Φ2 /K. (4.8)

If the columns of Φ2 are orthonormal, then the result can be simpliﬁed as

(naive) n W o Cov βbI = I|I|/K.

W It follows from (4.8) that, as M → ∞, the vector of ﬁtted log SDFs, ΦβbI , has an asymptotic distribution

1 T −1 T N Φβ , Φ Φ Φ Φ . (4.9) M I K 2 2 −1 (naive) T T Denote Ωj as the jth diagonal element of Φ Φ2 Φ2 Φ /K, for j = 1,...,M. Then, a 100(1−α)% naive conﬁdence interval for the log SDF at fj, j = 1,...,M can be calculated

by q q T W (naive) T W (naive) φI (fj)βbI − zα/2 Ωj , φI (fj)βbI − zα/2 Ωj , (4.10)

where zα/2 is the upper α/2 quantile of the standard normal distribution. Thus, the 100(1 −

α)% naive conﬁdence interval for the SDF at fj, j = 1,...,M based on (4.9) is

q q T W (naive) T W (naive) exp φI (fj)βbI − zα/2 Ωj , exp φI (fj)βbI + zα/2 Ωj . (4.11)

72 W However, the covariance structure of βbI is misspeciﬁed by (4.8), which potentially will lead to undercoverage in interval estimation, especially when multiple tapers are used. Recall the

discussion before Proposition4 and (3.3) that MT estimates are correlated over frequencies in

general [Thomson, 1982]: the tapering of the MT estimator decreases the bias but increases

the correlation of the estimates between frequencies.

Freedman[2006] discussed the application of “Huber Sandwich Estimator” [Huber, 1967]

to better gauge the variability of maximum quasi-likelihood estimators for the generalized

linear models. Next, we adapt the Huber sandwich estimator to our MT-Whittle framework

as a potential solution to provide more reliable inference.

4.2.2 Sandwich variance estimation

A general application of the “Huber Sandwich Estimator” [Freedman, 2006] incorporates

an estimate of the correlation between MT estimates to re-estimate the covariance structure

W of βbI empirically. Based on Freedman[2006], under suitable regularity conditions,

W n W o βbI →d N |I| βI , Cov βbI , (4.12)

W as M → ∞, and the asymptotic covariance of βbI can be estimated in a sandwich form by

n W o −1 −1 Cov βbI = J AJ , (4.13) with the component matrices

T −1 −T A = D V Cov(Z2)V D, and

J = DT V −1D.

Here D = (Dj,l){j:fj ∈F2}×I is the M2 × |I| matrix with elements

Dj,l = ∂µj/∂βl = (∂µj/∂ζj)(∂ζj/∂βl) = µj · Φj,l,

73 for j = 2, 4,..., 2M2 and l ∈ I. In matrix notation, we have

D = diag{µj}j:fj ∈F2 Φ2.

In the above expressions, V is the diagonal matrix generated by the values of the variance

function V (µj) at j = 2, 4,..., 2M2 such that

2 V = diag{V (µj)}j:fj ∈F2 = diag{µj }j:fj ∈F2 ,

and V −T denotes the transpose of the inverse matrix of V . Notice that, for our model, the

inverse matrix of J is −1 −1 T −1 −1 T J = D V D = Φ2 Φ2 .

−1 If the columns of Φ2 are orthonormal, then it follows that J = I|I|, and the covariance of W βbI can be simpliﬁed to

n W o T −1 −T Cov βbI = D V Cov(Z2)V D.

The matrices D and V −1 can be estimated by plugging in the MT-Whittle-based estimates

W Sb (fj) of µj at {j : fj ∈ F2} based on the ﬁtted model (4.7). We denote these estimates by

W n W o Db = diag Sb (fj) Φ2, and j:fj ∈F2 W n W 2o Vb = diag Sb (fj) . j:fj ∈F2

n W o To complete the estimation of Cov βbI , we still need an estimate of the covariance matrix

Cov(Z2), whose element for fj, fj0 ∈ F2 is

K K n (mt) (mt) o 1 X X n (mt) (mt) o Cov Sb (fj), Sb (fj0 ) = Cov Sb (fj), Sb 0 (fj0 ) , (4.14) K2 k k k=1 k0=1

74 and it follows from Thomson[1982] that, if {Xt} is a Gaussian process with zero mean, for k, k0 = 1,...,K,

2 Z 1/2 n (mt) (mt) o ∗ Cov S (f ), S 0 (f 0 ) = S(u)H (f − u)H 0 (f 0 − u)du bk j bk j k j k j −1/2 2 Z 1/2

+ S(u)Hk(fj − u)Hk0 (fj0 + u)du , (4.15) −1/2

where N X −i2πft Hk(f) = ht,ke , (4.16) t=1 ∗ ∗ and Hk (f) denotes the complex conjugate of Hk(f) with Hk (f) = Hk(−f) for k = 1,...,K. The derivation of (4.15) is based on the complex version of the Isserlis theorem [Koopmans,

1974, p. 27] and will be presented in a more general form for multivariate spectral analysis

in Section 5.1.5.

Meanwhile, we rewrite (4.15) as follows. By (2.2), the ACVF is the inverse Fourier

R 1/2 i2πu(s−t) transform of the SDF, such that γX (s − t) = −1/2 e S(u)du, which implies that

Z 1/2 Z 1/2 N N ∗ 0 X −i2π(f−u)s X i2π(f 0−u)t S(u)Hk(f − u)Hk0 (f − u)du = S(u) hs,ke ht,k0 e du −1/2 −1/2 s=1 t=1 Z 1/2 N N X X −i2π(f−u)s i2π(f 0−u)t = S(u) hs,kht,k0 e e du −1/2 s=1 t=1 N N Z 1/2 X X −i2π(fs−f 0t) i2πu(s−t) = hs,kht,k0 e e S(u)du s=1 t=1 −1/2 N N X X −i2π(fs−f 0t) = hs,kht,k0 γX (s − t)e , s=1 t=1 and similarly,

Z 1/2 N N X X −i2π(fs+f 0t) S(u)Hk(fj − u)Hk0 (fj0 + u)du = hs,kht,k0 γX (s − t)e . −1/2 s=1 t=1

75 Thus, the covariance (4.15) of the eigenspectra can be alternatively expressed as

N N 2 n (mt) (mt) o X X −i2π(fs−f 0t) Cov S (f ), S 0 (f 0 ) = h h 0 γ (s − t)e bk j bk j s,k t,k X s=1 t=1 N N 2 0 X X −i2π(fs+f t) + hs,kht,k0 γX (s − t)e , (4.17) s=1 t=1 for k, k0 ∈ {1, 2,...,K}.

In practice, both the true SDF S(f) and the true ACVF γX (τ) are usually unknown. To obtain an estimate of the covariance matrix Cov(Z2) for a Gaussian process with zero mean, we replace the true SDF S(f) in (4.15) and substitute the true ACVF γX (τ) in (4.17) by their MT-Whittle-based estimates, as presented in the following.

If the vector of basis functions φ(f) is deﬁned continuously on [0, 1/2], we can replace

W W n T o the true SDF S(f) in (4.15) by Sb (f) = exp φI (f)βbI to get

2 Z 1/2 S n (mt) (mt) o W ∗ Cov S (f ), S 0 (f 0 ) = S (f)H (f − u)H 0 (f 0 − u)du d bk j bk j b k j k j −1/2 2 Z 1/2 W + Sb (f)Hk(fj − u)Hk0 (fj0 + u)du , (4.18) −1/2 0 for fj, fj0 ∈ F2, where Hk(f) is given in (4.16), for k, k = 1, 2,...,K.

If the basis functions are only known discretely, one can alternatively estimate {γX (τ):

W τ = −(N − 1),...,N} by taking the inverse discrete Fourier transform of Sb (fj0 ) = W n T o 0 0 exp φI (fj0 )βbI at fj0 = j /(2N), j = −(N − 1),...,N, such that

N−1 W 1 X W iπj0τ/N γ (τ) = Sb (fj0 )e , (4.19) bX 2N j0=−(N−1) W for τ = −(N − 1),...,N. Replacing the true ACVF γX (τ) in (4.17) by γbX (τ), one can get N N 2 S n (mt) (mt) o X X W −i2π(fs−f 0t) Cov S (f ), S 0 (f 0 ) = h h 0 γ (s − t)e d bk j bk j s,k t,k bX s=1 t=1 N N 2 0 X X W −i2π(fs+f t) + hs,kht,k0 γbX (s − t)e , (4.20) s=1 t=1 76 0 for fj, fj0 ∈ F2, and k, k = 1, 2,...,K. S Then, we can obtain an estimate of the covariance matrix Cov(Z2), denoted as Covd (Z2),

with the element for fj, fj0 ∈ F2 calculated by

K K S n (mt) (mt) o 1 X X S n (mt) (mt) o Covd Sb (fj), Sb (fj0 ) = Covd Sb (fj), Sb 0 (fj0 ) , (4.21) K2 k k k=1 k0=1 for j, j0 = 1,...,M.

W W S With the estimated Db , Vb and Covd (Z2), we formulate the sandwich variance esti- W mation for βbI based on (4.13):

S n W o T −1 n W oT n W o−1 S n W o−T W T −1 Covd βbI = Φ2 Φ2 Db Vb Covd (Z2) Vb Db Φ2 Φ2 . (4.22)

It follows that that the sandwich covariance estimate for the vector of estimated log SDFs is

S W S W S W n o n o T S n o Covd ΦβbI = ΦCovd βbI Φ . Denote Ωj as the jth diagonal element of Covd ΦβbI , for j = 1,...,M. Then, by (4.12), as M → ∞, an asymptotic 100(1−α)% sandwich variance based conﬁdence interval for the log SDF at fj, j = 1,...,M can be calculated by

W q W q h T S T S i φI (fj)βbI − zα/2 Ωj , φI (fj)βbI − zα/2 Ωj , (4.23)

where again zα/2 is the upper α/2 quantile of the standard normal distribution. Taking expo-

nential of the lower and upper bounds, the 100(1 − α)% sandwich variance based conﬁdence

interval for the SDF at fj, j = 1,...,M can be obtained as

W q W q h n T So n T So i exp φI (fj)βbI − zα/2 Ωj , exp φI (fj)βbI + zα/2 Ωj . (4.24)

4.2.3 Fast variance approximation

The sandwich approach (4.13) for estimating the variance is relatively computationally

expensive, especially as the number of tapers K, and sample size N, grow. To speed up

the calculation without compromising too much on the estimation accuracy, we demonstrate

77 below the use of a variance approximation approach proposed by Walden et al.[1998] as an

alternative to sandwich variance estimation.

Based on the results in Thomson[1982], if {Xt} is a Gaussian process with mean zero,

and the SDF S(f 0) is locally invariant around f with |f 0 − f| small, then the covariance

between the eigenspectra of the MT estimates of the SDF is

N 2 (mt) (mt) 0 0 X i2π(f 0−f)t Cov(S (f), S 0 (f )) ≈ S(f)S(f ) h h 0 e . bk bk t,k t,k t=1 By (4.14), we then have 2 1 K K N (mt) (mt) X X X i2π(f 0 −fj )t Cov(Sb (fj), Sb (fj0 )) ≈ S(fj)S(fj0 ) ht,kht,k0 e j , K2 k=1 k0=1 t=1 (mt) for fj, fj0 ∈ F2. Recall from (3.9) that letting Rj = Sb (fj)/S(fj) for {j : fj ∈ F2}, it follows that 2 1 K K N X X X i2π(f 0 −fj )t Cov(R ,R 0 ) ≈ h h 0 e j , j j K2 t,k t,k k=1 k0=1 t=1 for fj, fj0 ∈ F2.

In Walden et al.[1998], the ﬁgure of log(1 + Cov( Rj,Rj0 )) is plotted as a function of fj0 − fj for K = 10 sine tapers. Empirically, Walden et al.[1998] showed that the covariance

Cov(Rj,Rj0 ) can be approximated using a simple model

( 0 |fj0 −fj |N 0 ψ (K) 1 − K+1 , if |fj − fj| ≤ (K + 1)/N; log(1 + Cov(Rj,Rj0 )) = (4.25) 0, otherwise, where ψ0(K) is the trigamma function evaluated at K (see Proposition2). Taking an inverse

transformation, we denote the fast approximation to Cov(Rj,Rj0 ) as

( n 0 |fj0 −fj |N o 0 (fast) exp ψ (K) 1 − K+1 − 1, if |fj − fj| ≤ (K + 1)/N; Cov (Rj,Rj0 ) = (4.26) 0, otherwise. Multiplying the ﬁtted SDFs (4.7), a fast approximation of the covariance of MT estimates

0 between f, f ∈ F2 can be calculated by

(fast) (mt) (mt) W W (fast) Covd (Sb (fj), Sb (fj0 )) = Sb (fj)Sb (fj0 )Cov (Rj,Rj0 ).

78 (fast) Denote Covd (Z2) as the resulting fast estimation of Cov(Z2). Replacing the covariance S (fast) estimate Covd (Z2) in (4.22) by Covd (Z2), we can obtain a fast approximation to the W sandwich variance estimation for βbI based on (4.13):

(fast) n W o T −1 n W oT n W o−1 (fast) n W o−T W T −1 Covd βbI = Φ2 Φ2 Db Vb Covd (Z2) Vb Db Φ2 Φ2 . (4.27)

It follows that that the fast approximation to the covariance of the vector of estimated log

(fast) n W o (fast) n W o T (fast) SDF is Covd ΦβbI = ΦCovd βbI Φ . Denote Ωj as the jth diagonal element (fast) n W o of Covd ΦβbI , for j = 1,...,M. Then, by (4.12), as M → ∞, the fast approximation approach calculates an asymptotic 100(1 − α)% sandwich variance based conﬁdence interval for the log SDF at fj, j = 1,...,M as

q q T W (fast) T W (fast) φI (fj)βbI − zα/2 Ωj , φI (fj)βbI − zα/2 Ωj . (4.28)

Taking exponential of the lower and upper bounds, the 100(1 − α)% conﬁdence interval for the SDF at fj, j = 1,...,M can be approximately obtained as

q q T W (fast) T W (fast) exp φI (fj)βbI − zα/2 Ωj , exp φI (fj)βbI + zα/2 Ωj . (4.29)

4.2.4 Summary

Under suitable regularity conditions [Freedman, 2006], all variance estimation methods

W discussed in Section 4.2 are accompanied with some asymptotic Gaussianity of βb that can be written in the form of (4.12) that

W n W o βbI →d N |I| βI , Cov βbI , as M → ∞.

79 Hence, a corresponding asymptotic distribution of the ﬁtted log SDF can always be expressed

W n W o T ΦβbI →d N M ΦβI , ΦCov βb Φ , as M → ∞. (4.30)

Based on (4.30), pointwise asymptotic conﬁdence intervals for the SDF can be constructed n W o with one of the estimates for the covariance matrix Cov βb discussed previously.

Notice that the naive variance estimation (4.8) can be obtained from (4.13) if we simply

(naive) 2 replace Cov(Z2) by Covd (Z2) = diag{Var(Zj)}j:fj ∈F2 = diag{µj /K}j:fj ∈F2 , ignoring the correlation between the diﬀerent frequencies:

−1 −1 −1 T −1 2 −T −1 J AJ = J D V diag{µj /K}j:fj ∈F2 V D J

= J −1 DT V −1(V /K)V −T D J −1

= J −1 {J/K} J −1

−1 −1 n T o (naive) n W o = J /K = Φ2 Φ2 /K = Cov βbI .

Thus, the naive estimator, the sandwich estimator and its fast alternative all ﬁt into the

sandwich variance estimation structure (4.13) with diﬀerent choice of Cov(Z2). The only

diﬀerence lies in the process of estimating Cov(Z2):

(naive) 1. The naive method in Section 4.2.1 simply applies Covd (Z2) = diag{Var(Zj)}j:fj ∈F2 ;

2. The sandwich method in Section 4.2.2 follows (4.14) and estimating (4.15) or (4.17);

3. The fast procedure in Section 4.2.3 substitutes the cumbersome calculation of (4.15)

or (4.17) by scaling the easily computed approximation (4.26).

Note that, with the respective assumptions discussed in each previous subsection, the

three methods can easily be extended to scenarios without a sample splitting.

80 4.3 Proper scoring rules

In this section, we review the general concepts of proper scoring rules and the foundations of the interval score proposed by Gneiting and Raftery[2007]. The interval score will be applied to evaluating the constructed pointwise conﬁdence intervals for the SDF in our simulation study.

4.3.1 Scoring rules and property

According to Gneiting and Raftery[2007], scoring rules are tools that quantiﬁes the goodness of a probabilistic forecast using a score number summarized from the predictive distribution and the materialized event. On a general probability space, theory and properties of proper score rules are formalized by Gneiting and Raftery[2007]. Adapting similar notations as Gneiting and Raftery[2007], we summarize some relevant concepts and properties below.

For a general sample space Ω, let A be a σ-ﬁeld of subsets of Ω (see e.g. Resnick[2003]).

Let P denote a convex class of probability measures on (Ω, A). A probability forecast can be any probability measure P ∈ P. The convex set P has the property that, if P,Q ∈ P, then

δP + (1 − δ)Q ∈ P for any δ ∈ [0, 1] (see, e.g., Klee[1971]). Denote the extended real line as R¯ = [−∞, ∞] (i.e., the real line supplemented with positive and negative inﬁnity).

Deﬁnition 3 (Gneiting and Raftery[2007], Section 2.1) . A function g :Ω → R¯ is P-quasi- integrable if it is measurable with respect to A and is quasi-integrable (the integral exists and takes values on R¯, see Bauer[2001], p.64) with respect to all Q ∈ P.A scoring rule ¯ is any function Sc : P × Ω → R such that Sc(P, ·) is P-quasi-integrable for all P ∈ P.

81 Deﬁnition 4 (Gneiting and Raftery[2007], Section 2.1) . For P ∈ P, denote the expected R score of probability forecast P under Q ∈ P as Sc(P,Q) = Sc(P, ω)dQ(ω). If

Sc(Q, Q) ≥ Sc(P,Q), for all P,Q ∈ P, (4.31)

then the scoring rule Sc is said to be proper relative to P. If the equality in (4.31) holds if and only if P = Q, then the scoring rule Sc is said to be strictly proper relative to P.

Note that property (4.31) is called risk unbiasedness in Lehmann and Casella[1998, p. 157].

¯ Deﬁnition 5 (Gneiting and Raftery[2007], Section 2.1) . For the scoring rule Sc : P×Ω → R, if Sc(P,Q) ∈ R for all P,Q ∈ P except possibly that Sc(P,Q) = −∞ for P 6= Q, then the scoring rule is regular relative to P.

Deﬁnition 6 (Gr¨unwald and Dawid[2004] and Gneiting and Raftery[2007, Section 2.2]) .

For proper score rule Sc relative to P, the expected score function G : P → R deﬁned by

G(P ) = sup Sc(Q, P ),P ∈ P (4.32) Q∈P is called the information measure or generalized entropy function. If Sc is regular and proper, then

D(P,Q) = Sc(Q, Q) − Sc(P,Q),P,Q ∈ P, (4.33) is called the associated divergence function.

Example (logarithmic score). An important scoring rule related to our analysis in later chapters is the logarithmic score [Good, 1952] for continuous variables. Let Z :Ω → R be a continuous random variable. Denote PZ to be the class of probability distributions for Z, where for any probability distribution P ∈ PZ , there exists a probability density function

82 R p(z) such that p(z) = dP (z)/dz with domain Z ⊆ R. If the integral log p(z)dQ(z) exists ¯ on R for any P,Q ∈ PZ , we can deﬁne the logarithmic score on PZ as

log Sc (P, z) = log p(z), z ∈ Z, (4.34) where P is a distribution forecast, and z represents an observation of Z. By Deﬁnition3,

log log Sc (P, z) is PZ -quasi-integrable for all Q ∈ PZ , thus Sc (P, z) is a scoring rule.

log It can be shown that Sc (P, z) is strictly proper. For the expected score denoted as

log R log Sc (P,Q) = Sc (P, z)dQ(z) for all P,Q ∈ PZ ,

Z Z log log log log Sc (Q, Q) − Sc (P,Q) = Sc (Q, z) dQ(z) − Sc (P, z) dQ(z) Z Z = log q(z) dQ(z) − log p(z) dQ(z) Z = {− log p(z) + log q(z)} dQ(z) Z p(z) = − log dQ(z) q(z) Z p(z) ≥ − log dQ(z) (by Jensen’s inequality) q(z) Z p(z) = − log q(z)µ(dz) q(z) Z Z = − log p(z)dz = − log dP (z) = log 1 = 0.

log log Thus, Sc (Q, Q) ≥ Sc (P,Q), where the equality holds if and only if P = Q by the strictly

log convexity of a negative logarithmic function. By Deﬁnition4, the scoring rule Sc (P, z) is strictly proper relative to PZ .

The negative information measure (see Deﬁnition6) of the logarithmic score is

Z log −Sc (P,P ) = − p(z) log p(z)dz,

83 which is the well-known Shannon entropy [Shannon, 1948]. Meanwhile, for the logarithmic

score, the divergence function is

Z q(z) D (P,Q) = Slog(Q, Q) − Slog(P,Q) = q(z) log dz, (4.35) KL c c p(z)

which is the Kullback-Leibler (KL) divergence [Kullback and Leibler, 1951] of P from Q.

Strictly proper rules can be used as utility and loss functions for estimation problems.

As discussed in Gneiting and Raftery[2007], the optimum score estimator obtained through

minimizing a proper score is also known as minimum contrast estimator [Pfanzagl, 1969] or is

one type of M-estimator [Huber, 1992]. As a special case, maximum likelihood estimation or

the quasi-likelihood method can be formulated with the logarithmic score. Moreover, the use

of proper scoring rules can be extended to model evaluation. Based on Gneiting and Raftery

[2007], an estimation procedure P can be compared with other estimation procedures using

the average score over a ﬁxed set of points, that is, calculating

n 1 X S (P ) = S (P (ω ), ω ), (4.36) c n c j j j=1

where {ω1, ω2, . . . , ωn} is the set of n reference states. The IRMSE in (3.19) is such an

example.

4.3.2 Interval scores

To evaluate the performance of the conﬁdence intervals for the SDF discussed in Section

4.2, we adapt the interval score introduced by Gneiting and Raftery[2007] to our setting.

The interval score is more appealing than using coverage rate solely for interval evaluation

in the sense that the assessment takes into account not only the coverage but also the width

of an interval [Gneiting and Raftery, 2007]. For a central 100(1 − α)% interval, denote its

84 lower bound ((α/2)th percentile) as BL and its upper bound ((1 − α/2)th quantiles) as BU .

If the true value ζ is provided as the reference, then the interval score can be deﬁned as

2 2 Sint(B ,B , ζ) = (B − B ) + (B − ζ)1 + (ζ − B )1 , (4.37) α L U U L α L {ζBU }

1 1 where {ζBU } are binary indicator functions, which equal to 1 if the condition inside the braces holds and equal to 0 otherwise. Notice that the interval score is negatively

oriented, rewarding narrower intervals and penalizing the failures in capturing the true value.

Thus interval estimation with a smaller interval score is preferred. Following the proof in

Gneiting and Raftery[2007, Section 6], the negative interval score is a proper scoring rule.

4.4 Simulation studies

Our simulation study attempts to assess the quality of point estimation and its impact

on the interval estimates based on sample splitting, to compare the performance of diﬀerent

interval estimation approaches, and to provide some insights on choosing an optimal number

of tapers for making inference of the SDF. We simulate the three Gaussian processes used

in Section 3.5. Sample sizes N = 256, 512, 1024, and 2048 are examined. In each case,

we use sine tapers for the MT spectral estimate, and we generate the initial basis matrix

Φ with the maximum number of LA(8) wavelet basis functions, p = M + 1, including the

intercept. Following the sample splitting strategy proposed in Section 4.1.2, we select the

bases by applying the L1 penalized MT-Whittle method to the ﬁrst half of the data D1,M1

with the universal threshold as the tuning parameter. Then point estimates and interval

estimates of the SDF are obtained using the second half of the data D2,M2 with the selected bases. As an essential element for constructing a conﬁdence interval, the unpenalized point estimator SbW (f) in (4.7) is evaluated on absolute bias and the IRMSE. For interval estimates of the SDF, coverage rate and average interval score are employed as our measurements of

85 h i performance. For the second half of frequencies F2 deﬁned in (4.2), let BeL(fj), BeU (fj)

denote any interval estimate for the log SDF at frequency fj ∈ F2. Then the average

interval score can be calculated by

1 X int S BeL(fj), BeU (fj), log S(fj) . (4.38) bM/2c α j:fj ∈F2

We present average absolute bias, the IRMSE, average coverage, and the mean of the average

interval score over 1000 realizations of each process. Similar to the case of point estimation,

the number of tapers K plays an important role in the goodness of an interval estimation.

A sequence of values of K were examined in each simulated case.

Figures 4.1, 4.2, and 4.3 show the simulation results for estimating the SDFs of the

AR(2), AR(4) and high-order MA processes respectively. In each ﬁgure, we display the

average absolute bias vs. K (ﬁrst row), mean IRMSE vs. K (second row), average coverage

vs. K (third row), and average interval score vs. K (fourth row), with sample size N varying

from 256, 512, 1024 to 2048 in diﬀerent columns. For each evaluation metric, 95% bootstrap

conﬁdence intervals are drawn as solid vertical bars at each K, which are very narrow on

these plots.

Unsurprisingly, the average absolute bias (ﬁrst rows) and the mean IRMSE (second rows)

of the unpenalized SDF estimate SbW (f) still hold quadratic relationships against the number of tapers K. As the sample size N increases, we observe that the entire quadratic curves

descend for both the bias and the IRMSE across all three processes. The dashed vertical

lines in Figures 4.1, 4.2, and 4.3 show the locations of the minimum, which indicate that the

optimal number of tapers for minimizing absolute bias or the IRMSE grows as the sample

size N increases. The IRMSE requires a slightly greater K than the bias to attain the

minimum.

86 Average absolute bias Average absolute bias Average absolute bias Average absolute bias 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 average absolute bias average absolute bias average absolute bias average absolute bias average 0.0 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Mean IRMSE Mean IRMSE Mean IRMSE Mean IRMSE 1.2 1.2 1.2 1.2 0.8 0.8 0.8 0.8 0.4 0.4 0.4 0.4 mean IRMSE mean IRMSE mean IRMSE mean IRMSE 0.0 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Average coverage Average coverage Average coverage Average coverage 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 average coverage average coverage average coverage average coverage average 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Average interval score Average interval score Average interval score Average interval score 8 Naive method

20 Sandwich with true Cov(Z) 40 15 Fast approximation

6 Theoretical sandwich 15 30 10 4 10 20 5 2 5 10 average interval score average interval score average interval score average interval score average 0 0 0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Figure 4.1: Plots for evaluating the log SDF estimates of the AR(2) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns.

87 Average absolute bias Average absolute bias Average absolute bias Average absolute bias 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 average absolute bias average absolute bias average absolute bias average absolute bias average 0.0 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Average interval score Average interval score Average interval score Average interval score 50 70 Naive method 140 Sandwich with true Cov(Z)

15 Fast approximation 40 Theoretical sandwich 50 100 30 10 30 60 20 5 10 average interval score average interval score average interval score average interval score average 10 20 0 0 0 0

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Figure 4.2: Plots for evaluating the log SDF estimates of the AR(4) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns.

88 Average absolute bias Average absolute bias Average absolute bias Average absolute bias 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 average absolute bias average absolute bias average absolute bias average absolute bias average 0.0 0.0 0.0 0.0 10 20 30 40 10 20 30 40 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Mean IRMSE Mean IRMSE Mean IRMSE Mean IRMSE 1.2 1.2 1.2 1.2 0.8 0.8 0.8 0.8 0.4 0.4 0.4 0.4 mean IRMSE mean IRMSE mean IRMSE mean IRMSE 0.0 0.0 0.0 0.0 10 20 30 40 10 20 30 40 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Average coverage Average coverage Average coverage Average coverage 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 average coverage average coverage average coverage average coverage average 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 10 20 30 40 10 20 30 40 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Average interval score Average interval score Average interval score Average interval score 15 6

15 Naive method

15 Sandwich with true Cov(Z) 5 Fast approximation Theoretical sandwich 10 4 10 10 3 5 2 5 5 1 average interval score average interval score average interval score average interval score average 0 0 0 0

10 20 30 40 10 20 30 40 0 20 40 60 80 100 0 20 40 60 80 100 K for N=256 K for N=512 K for N=1024 K for N=2048

Figure 4.3: Plots for evaluating the log SDF estimates of the MA process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales between the columns.

89 The following interval estimation methods are compared in terms of the coverage rate

(third rows) and the interval score (fourth rows) in the plots:

1. The naive method (4.10) in Section 4.2.1 (gray line);

2. The fast approximation (4.28) to the sandwich method in Section 4.2.3 (blue line);

3. Sandwich procedure in Section 4.2.2 with matrices Db and Vb calculated using the

W MT-Whittle estimate Sb (f), but applying the true value of Cov(Z2) (green line) ;

4. The conﬁdence interval with theoretical sandwich covariance in Section 4.2.2 calculated

using the true values of the matrices D, V , and Cov(Z2) (red line).

The above methods all use the same MT-Whittle estimate SbW (f) as the point estimate for the SDF. The first three methods can all fit into the sandwich structure (4.13) for covariance estimation, and their only difference lies in the estimation of Cov(Z2). For example, a comparison of method 2 and method 3 tells how well the fast approximation to Cov(Z2) behaves. Method 4 is used as a baseline for evaluating the variance quantification methods.

Comparing method 3 to method 4, a deviation in the plot shows the overall impact of point estimator SbW (f) on the sandwich covariance estimation besides the eﬀect of estimating

Cov(Z2). The plots show a clear advantage of using the sandwich method and its fast approximation over the naive method in quantifying the variability of SbW (f), where the green lines and blue lines always attain coverage rates much closer to the nominal 95% level and noticeably lower interval scores than the gray lines, at every given K. In each plot, the blue line and the green line are quite close to each other, indicating that the fast approximation method behaves quite well in imitating the sandwich approach.

Recall from previous chapters that, as K increases, we lose resolution in the estimated

SDF but decreases the variance of the MT spectral estimator. This tradeoﬀ is reﬂected

90 directly in the plot of IRMSE vs. K for the point estimate. However, an interval estimate

is a comprehensive outcome of the point estimate SbW (f) and a chosen variance estimation

process. As K increases, the correlation between MT estimates across the Fourier frequen-

cies alters, and we can see more complex relationship of coverage rates vs. K, rather than

just quadratic relationships. As shown in the third rows of Figures 4.1, 4.2, and 4.3, the re-

lationship of coverage vs. K varies between diﬀerent processes and even has distinct patterns

for diﬀerent sample sizes of a same time series (see, e.g., Figure 4.2 for the AR(4) process).

The value of K that maximizes the coverage shifts irregularly with an increasing sample size for the estimated sandwich method and its fast approximation. While the AR(2) process always favors a small number of tapers for achieving the nominal coverage, the best number of tapers for the MA process is often relatively large. It is hard to determine a universal value of K for the optimal coverage rate across diﬀerent time series processes. However, as

discussed below, the value of K for the optimal interval score is more tractable.

In terms of choosing a number of tapers K for minimizing the average interval score,

we observe that the relationship of average interval score vs. K behaves quite similarly to the relationship of the IRMSE vs. K: the curves are quadratically shaped; as the sample size increases, the overall curve descends, and the optimal value of K grows. As shown in

Figures 4.1, 4.2, and 4.3, for a given sample size N, the number of tapers of a minimal

IRMSE is usually very close to the value of K required for minimizing the average interval score of the sandwich method. It is also noticed that, when a minimal IRSME is attained, there is usually little diﬀerence between the estimated and theoretical sandwich procedures with respect to the coverage rate and the interval score. A higher IRMSE comes along with a greater deviation between diﬀerent sandwich procedures. As the sample size increases,

91 the deviations in the coverage rate and interval score shrink between estimated sandwich approaches (green and blue lines) and the theoretical sandwich method (red line).

To further understand the performance of interval estimates in the frequency domain spatially, we plot pointwise values of the bias, root mean squared error (RMSE), coverage rate, and expected interval score based on 1000 simulation replicates across the Fourier frequencies on (0, 1/2). We use sample size N = 2048 and K = 15 sine tapers for the three

Gaussian processes as examples for the presentation. The results are shown in Figure 4.4.

Lack of resolution in the estimated SDF is reﬂected by the large biases around the peaks, which are more severe for the AR(4) and MA processes than the AR(2) process. In terms of the evaluation metrics, both point estimates and interval estimates perform worse around the peak than other areas. Other than the peaks, there seems to be a high frequency region where more mis-capturing issues exist. For example, the frequencies in [0.3, 0.4] for the AR processes, and the frequencies in [0.4, 0.5] for the MA process. This is not surprising due to the remaining leakage of the MT estimates at frequencies close to the Nyquist frequency fN = 1/2 [see, e.g., Percival and Walden, 1993] and the artiﬁcial wiggles generated by the wavelet regression at the boundary [see, e.g., Oh et al., 2001, for more discussion on the boundary problem].

Comparing to the naive method (gray line) for interval estimation, the sandwich method

(green line) and its fast approximation (blue line) gain considerable improvement at almost every Fourier frequency in terms of higher coverage rates and lower interval scores in Fig- ure 4.4. Applying 15 sine tapers to a time series sample of length 2048, the results of the sandwich approach and its fast approximation are very close to the theoretical sandwich method (red line) at almost all Fourier frequencies.

92 AR(2) AR(4) MA 0.5 0.4 0.1 0.0 −0.5 bias bias bias −0.1 −0.4 −1.5 −0.3 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

AR(2) AR(4) MA 0.40 1.5 0.5 0.30 1.0 RMSE RMSE RMSE 0.3 0.20 0.5 0.1 0.10 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

AR(2) AR(4) MA 0.8 0.8 0.8 0.4 0.4 0.4 coverage coverage coverage 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

AR(2) AR(4) MA Naive method 60 8 Sandwich with true Cov(Z) 15 Fast approximation

6 Theoretical sandwich 40 10 4 5 20 2 interval score interval score interval score 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

Figure 4.4: Plots for evaluating the log SDF estimates across the frequencies for the three processes of length N = 2048 using K = 15 sine tapers: bias (ﬁrst row), RMSE (second row), coverage (third row), expected interval score (fourth row).

93 In Figure 4.4, we notice that the interval score appears to have some similar pattern as the

RMSE over the frequencies. This conﬁrms the fact shown by the previous simulation that an

optimal IRMSE and an optimal average interval score are always achieved simultaneously or

almost simultaneously as K varies. There exists a direct impact from the quality of the point

estimation on the interval score of the interval estimates for our three time series examples.

We provide below an empirical rule for choosing the number of tapers K, with the goal of

achieving a reasonably good interval score.

4.4.1 The choice of number of tapers

In literature, the most relevant result on choosing the number of tapers is given by Riedel

and Sidorenko[1995] for minimizing the asymptotic IRMSE of kernel smoothed MT esti-

mators, where the optimal choice of K is shown to satisfy K ∝ N 8/15. The MT-Whittle estimates SbW (f) can be viewed as a type of smoothed MT estimator, and we can heuristically derive the relation between K and N by the following argument: When using K sine tapers, the spectral energy is concentrated in frequency band with half bandwidth

W = (K + 1)/{2(N + 1)} ≈ K/(2N) for a large enough N. The ﬁtted log SDF has the scale of standard error proportional to (1/K)1/2 by (2.12) and (4.13). Adapting the normal reference rule for minimizing IRMSE of kernel density estimators [see Scott, 1992, p. 144], the optimal bandwidth then satisﬁes W ∝ (1/K)1/2(N/2)−1/5. Therefore, we need

{K/(2N)} ∝ (1/K)1/2(N/2)−1/5, which implies

K ∝ N 8/15. (4.39)

To implement the optimal number of tapers K in practice, we empirically determine a constant multiplier in front of N 8/15 for (4.39) based on the simulated AR(2), AR(4) and MA processes. Still, estimation and inference were performed with sample splitting.

94 Our presentation focuses on the optimal K for the estimated sandwich methods. To make a

default recommendation of K for all tested processes, we allow the choice of K to achieve the

minimal interval score or have no more than 10% inﬂation in the interval score comparing to

the minimal: Based on the results in Figures 4.1, 4.2, and 4.3, a reasonable recommendation

that achieves a close-to-optimal interval score is

K ≈ (1/4)N 8/15. (4.40)

By (4.40), the recommended K are listed in Table 4.1 for the experimented sample sizes and applied to later simulations. According to the fourth rows of Figures 4.1, 4.2, 4.3, the suggested numbers of tapers work well for the AR(2) and the AR(4) processes, and also perform decently for the high-order MA process when the sample size is greater than 512.

Note that the number of tapers K for minimizing IRMSE may be slightly greater. And if the goal is to achieve best coverage rate, the optimal choice of K is more process-dependent.

Table 4.1: Recommended number of sine tapers, K, for achieving a close to minimal interval score for a given sample size N.

N 128 256 512 1024 2048 4096 K 3 5 7 10 15 21

With the recommended K in Table 4.1, we implement sample splitting and the interval

estimations discussed previously for sample sizes N = 256, 512, 1024, 2048, and 4096. For

the three simulated time series, the results of average absolute bias (ﬁrst row), mean IRMSE

(second row), average coverage (third row) and average interval score (fourth row), with their

95% bootstrap conﬁdence intervals, are presented in Figures 4.5 against the sample size. The

95 average interval score average coverage mean IRMSE average absolute bias 0 1 2 3 4 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.2 0.4 rcse sn h ugse ubro ietapers, sine of number suggested the using processes 4.5: Figure o) vrg nevlsoe(orhrw.Tevria asso 5 bootstrap (third 95% coverage show bars average vertical each row), The at (second intervals row). conﬁdence (fourth IRMSE score mean interval average row), row), (ﬁrst bias absolute 256 256 256 256 1024 1024 1024 1024 AR(2) AR(2) AR(2) AR(2) 2048 2048 2048 2048 N N N N lt ffu vlainmtisaantsml size sample against metrics evaluation four of Plots 4096 4096 4096

average interval score average coverage mean IRMSE average absolute bias

N 0 2 4 6 8 10 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.2 0.4 256 256 256 256 . 1024 1024 1024 1024 96 AR(4) AR(4) AR(4) AR(4) 2048 2048 2048 2048 N N N N 4096 4096 4096 K average interval score average coverage mean IRMSE average absolute bias

(1 = 0 5 10 15 0.0 0.4 0.8 0.2 0.4 0.6 0.8 0.0 0.2 0.4 256 256 256 256 / 1024 1024 1024 1024 4) N N 8 o h three the for 2048 2048 2048 2048 / Theoretical sandwich Fast approximation Cov(Z)Sandwich withtrue Naive method MA MA MA MA 15 N N N N average : 4096 4096 4096 bootstrap conﬁdence intervals at each N are quite narrow to be viewed. Figure 4.5 shows that, the MT-Whittle estimate SbW (f) has decreasing average absolute bias and IRMSE as the sample size N increases for all three processes. The plots of interval scores vs. N for the

interval estimation methods follow a very similar decreasing pattern as the IRMSE. Still,

all sandwich methods outperform the naive interval estimation. With the recommended K,

there is very small a diﬀerence in the interval scores between the estimated sandwich method

(green line), the fast sandwich approximation (blue line), and the theoretical sandwich re-

sult (red line); and the results of estimated sandwich methods converge to the theoretical

sandwich result as the sample size N increases for the simulated time series. This supports

the eﬀectiveness of using the suggested K ≈ (1/4)N 8/15 for optimizing the IRMSE and the interval score.

In terms of the average coverage rate, the lines of estimated sandwich intervals and the fast approximation method also approach the line of the theoretical sandwich interval, as the sample size increases. The nominal 95% confidence intervals of the sandwich methods raise the coverage to above 80%, and we see an upward trend when extending the experiments to time series samples of size 4096. In contrast, a naive approach always decreases the coverage to lower than 50%. By taking into account the correlations of MT estimates between frequencies, the proposed sandwich methods can benefit the practice of SDF inference with an improved reliability. We recommend the use of the fast sandwich approximation for computational efficiency.

4.5 Application to EEG signals

As an application of the proposed inference approach based on sample splitting, we revisit the Electroencephalogram (EEG) example presented in Chapter3. Recall that the

97 data records the left and right front cortex activities of a male rat [van Luijtelaar, 1997], where relevant analysis can help the study of human epilepsy [Quiroga et al., 2002]. Collected at sampling rate of 200 Hz, the time series in microvolts (mV) are shown in plot panels (a) and (b) of Figure 3.8.

Estimated Spectrum of Left Channel Estimated Spectrum of Right Channel

0.05 0.05 MT−Whittle estimates 95% CI by sandwich method 95% CI by fast approximation 0.04 0.04 0.03 0.03 SDF SDF 0.02 0.02 0.01 0.01 0.00 0.00 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Frequency (Hz) Frequency (Hz)

Figure 4.6: Plots of interval estimates for the left and right EEG channels. In each plot, the solid black line show the point estimates of the SDF. The dashed red lines display the pointwise 95% conﬁdence intervals (CI) constructed based on the sandwich method, while the dashed blue lines exhibit the interval estimates by the fast approximation approach.

We analyze the SDFs of the left and right channels separately using the our sample splitting approach. Same as the previous chapter, the time series of both channels are mean padded to a length of N = 1024 prior to analysis, and LA(8) wavelet basis is used for the

SDF estimation. The splitting reduces the sample size for estimation by half. To counteract the loss of resolution and capture comparable SDF features as the previous chapter, we

98 diminish the number of sine tapers used from K = 5 to K = 4. Recall from the previous section that the optimal number of tapers is process-dependent and may greatly varies if the scoring rule changes. Note that the EEG signals have more complex features than our simulation examples. A lower number of tapers is helpful to preserve a higher resolution in the SDF estimation. Following the sample splitting strategy proposed in Section 4.1.2, we apply the L1 penalized MT-Whittle method with a universal threshold to select bases based on the ﬁrst half of data. With the selected bases, point estimates of the SDFs are obtained by optimizing the unpenalized MT-Whittle likelihood function, and the sandwich method (4.24) and the fast approximation approach (4.29) are used to construct pointwise

95% conﬁdence intervals of the SDFs at the Fourier frequencies.

We display the results of point estimates and interval estimates of the SDFs in Figure

4.6. Though the sample size for inference is reduced by splitting, with a smaller number of tapers, similar features are captured by the point estimates to the previous chapter. We notice that the point estimates of the SDFs (solid black line) now have greater magnitudes since the unpenalized result is used instead of an L1 penalized estimation. More variability is observed at the sharp turning points on the SDFs, where the interval estimates get wider, than the other areas. The pointwise 95% conﬁdence intervals constructed by the sandwich approach (red dashed lines) and the intervals based on the fast sandwich approximation

(blue dashed lines) are very close to each other, which tells a consistent story as what we have seen in the previous simulation examples.

4.6 Conclusion

This chapter proposed a sample splitting strategy for the statistical inference of L1 penalized basis regression models for SDFs of regularly sampled univariate stationary processes.

99 To implement the interval estimation of the univariate SDF based on the MT-Whittle framework, a sandwich approach was considered to incorporate the correlations of MT estimates between frequencies. With a suggested number of tapers, the sandwich approach outperforms a naive method that assumes independence, in terms of both coverage rate and the interval score. A fast approximation of the sandwich estimation was provided for accelerat- ing the computation. In the next chapter, we explore the spectral properties of multivariate stationary time series, and see how the methods developed previously can be extended to the multivariate case.

100 Chapter 5: Extensions to spectral analysis of stationary multivariate time series

The spectral analysis of multivariate time series is more complicated than the univariate case. This is in part due to the multi-dimensionality of the time series processes and related matrix generalization of the spectrum, and also due to the fact that the oﬀ-diagonals of the spectral density matrix (SDM) are complex-valued. In this chapter, we review the deﬁnitions in the spectral analysis of regularly sampled multivariate stationary processes and examine the statistical properties of the corresponding MT spectral estimators. We propose an extension of the L1 penalized MT-Whittle method to the multivariate case for estimating the

SDM. A proximal gradient descent algorithm is also demonstrated for solving the associated optimization problem, where the Cholesky decomposition elements of the Hermitian positive deﬁnite SDM, as a function of frequency, are estimated simultaneously.

5.1 Foundations of multivariate spectral analysis

5.1.1 Covariance matrix of multivariate processes

Suppose that {X(t): t ∈ Z} is a real-valued m-dimensional stationary time series with

T sampling interval ∆ > 0, where X(t) = (X1(t),X2(t),...,Xm(t)) . Without loss of generality, assume ∆ = 1 and E{X(t)} = 0. According to Priestley[1981], the covariance matrix

101 of X(t) as a function of time lag τ is deﬁned as

Γ(τ) = E{X(t + τ)XH (t)}, τ = 0, ±1, ±2,..., (5.1)

where XH (t) denotes the Hermitian transpose of X(t), which takes the transpose of the

elementwise complex conjugate vector of X(t). For a, b = 1, . . . , m, the element at the ath

row and the bth column of Γ(τ) is

∗ γab(τ) = Cov{Xa(t + τ),Xb(t)} = E{Xa(t + τ)Xb (t)}, τ = 0, ±1, ±2,..., (5.2)

which is called the cross-covariance function between component series Xa(t) and Xb(t)

when a 6= b. The cross-covariance function degenerates to the ACVF of Xa(t) when a = b

[Priestley, 1981]. Notice that, for τ = 0, ±1, ±2,..., and a, b = 1, . . . , m,

γba(τ) = Cov{Xb(t + τ),Xa(t)} = Cov{Xb(t),Xa(t − τ)}

∗ ∗ = E{Xb(t)Xa (t − τ)} = E{Xa (t − τ)Xb(t)}

∗ = γab(−τ),

and thus, ΓH (τ) = Γ(−τ). Since the vector process X(t) is assumed to be real-valued, the

covariance matrix Γ(τ) is also real-valued by (5.1). In this case, one can write γba(τ) =

T γab(−τ) for a, b ∈ {1, . . . , m}, and Γ(−τ) = Γ (τ). However, for a 6= b, γab(−τ) needs not

be equal to γab(τ). This means that, for any given τ, the covariance matrix Γ(τ) may not be

a symmetric matrix. Additionally, unlike the diagonal elements, the oﬀ-diagonal elements

of Γ(τ) may not have their maximum value at τ = 0.

When a 6= b, the cross-correlation function [Priestley, 1981, Section 9.1] between compo-

nent series {Xa(t)} and {Xb(t)} is deﬁned as

γab(τ) ρab(τ) = 1/2 , for all integers τ, (5.3) {γa(0)γb(0)} 102 where γa(τ) and γb(τ) denote the ACVFs of {Xa(t)} and {Xb(t)} respectively. The cross- p correlation coeﬃcient satisﬁes |ρab(τ)| ≤ 1; that is |γab(τ)| ≤ γa(0)γb(0) for all τ. When

a = b, it degenerates to the auto-correlation function of {Xa(t)}. The m × m matrix ρ(τ)

with elements ρab(τ) at row a column b for a, b = 1, . . . , m is deﬁned as the correlation matrix of the process {X(t): t ∈ Z} [Priestley, 1981, Section 9.1]. Based on Brockwell and Davis[1991], some important properties of the covariance matrix

Γ(τ) of a real-valued stationary multivariate process {X(t): t ∈ Z} can be summarized as follows:

1. Γ(τ) is a real-valued matrix at any given time lag τ;

T 2. Γ(−τ) = Γ (τ), i.e., γab(−τ) = γba(τ) for a, b = 1, . . . , m, for all τ;

1/2 3. |γab(τ)| ≤ {γa(0)γb(0)} for all τ;

Pl T 0 Pl T 2 4. t0,t=1 ct0 Γ(t − t)ct = E{ t=1 ct X(t)} ≥ 0 for all l ∈ {1, 2,... } and all c1,..., cl ∈

Rm, which means that Γ(τ) is positive semi-deﬁnite, for all τ.

5.1.2 Spectral density matrix

According to Brillinger[1981], when P |γ (τ)| < ∞ for all a, b = 1, . . . , m, the spectral τ∈Z ab density matrix (SDM) S(f) exists and is deﬁned as the elementwise Fourier transform of the

covariance matrix Γ(τ) such that

∞ X S(f) = Γ(τ)e−i2πfτ , |f| ≤ 1/2, (5.4) τ=−∞

with Z 1/2 Γ(τ) = ei2πfτ S(f)df, for all integers τ. (5.5) −1/2

103 For a, b = 1, . . . , m, the element at row a column b of S(f) is called the cross-spectrum between component series {Xa(t)} and {Xb(t)}, which can be written as

∞ X −i2πfτ Sab(f) = γab(τ)e , |f| < 1/2, (5.6) τ=−∞ where Z 1/2 i2πfτ γab(τ) = e Sab(f)df, for all integers τ. (5.7) −1/2

Thus, {Sab(f)} and {γab(τ)} are Fourier transform pairs. Because of the asymmetry of γab(τ), the cross-spectrum Sab(f) is generally complex-valued. For real-valued time series {Xa(t)}

∗ ∗ T and {Xb(t)}, notice that Sab(−f) = Sab(f) and Sba(f) = Sab(f). Hence, S(−f) = S (f)

H R 1/2 and S(f) = S (f). By taking τ = 0 in (5.7), we have that γab(0) = −1/2 Sab(f)df, which indicates that the cross-spectrum decomposes the cross-covariance between two time series at time lag 0. When a = b, the function Sab(f) degenerates to the SDF Sa(f) of {Xa(t)}, which is also called the auto-spectrum in the multivariate case.

We further examine the properties of the cross-spectrum Sab(f) using the spectral representation [Percival and Walden, 1993, p.130] of the two component series {Xa(t)} and

{Xb(t)}. By (2.1), we have

Z 1/2 i2πft Xa(t) = e dZa(f), for all integers τ, −1/2 Z 1/2 i2πft and Xb(t) = e dZb(f), for all integers τ, −1/2 where, according to Walden[2000], dZa(f) and dZb(f) are individually orthogonal and jointly cross-orthogonal in the sense that

( 0 0 ∗ Sab(f)df, if f = f ; E {dZa(f )dZ (f)} = (5.8) b 0, otherwise.

The individual properties of the two complex-valued orthogonal processes, {Za(f): f ∈

[−1/2, 1/2]} and {Zb(f): f ∈ [−1/2, 1/2]}, are provided in Section 2.1. Between the two

104 time series {Xa(t): t ∈ Z} and {Xb(t): t ∈ Z}, we can write the cross-covariance as

∗ γab(τ) = E{Xa(t + τ)Xb (t)} (Z 1/2 Z 1/2 ) i2πf(t+τ) −i2πft ∗ = E e dZa(f) e dZb (f) −1/2 −1/2 Z 1/2 Z 1/2 i2πf(t+τ) −i2πf 0t ∗ 0 = e e E {dZa(f)dZb (f )} , −1/2 −1/2 for all integers τ. For stationary multivariate time series {X(t): t ∈ Z}, the above expression

∗ 0 0 should be a function of τ only. This happens if and only if E {dZa(f)dZb (f )} = 0 for f 6= f on [−1/2, 1/2], which is exactly the cross-orthogonality between dZa(f) and dZb(f) in (5.8), an essential property across the processes {Xa(t)} and {Xb(t)} for a, b = 1, 2, . . . , m. As a result, the cross-covariance function can be written as

Z 1/2 i2πfτ ∗ γab(τ) = e E {dZa(f)dZb (f)} , for τ = 0, ±1, ±2,... −1/2

Then by (5.7), we have

∗ Sab(f)df = E{dZa(f)dZb (f)} = Cov{dZa(f), dZb(f)}, (5.9)

which means that the cross-spectrum Sab(f) carries the cross-covariance between the in- crements of the orthogonal processes at frequency f for time series {Xa(t)} and {Xb(t)}.

Therefore, the SDM S(f) can be expressed as

S(f)df = E{dZ(f)dZH (f)}, (5.10)

T where dZ(f) = (dZ1(t), dZ2(t), . . . , dZm(t)) . By (5.10), one can see that S(f) is not only

Hermitian, i.e. S(f) = SH (f), but also is positive semi-deﬁnite.

We summarize some important properties of the spectral density matrix S(f), deﬁned on f ∈ [−1/2, 1/2], as follows:

105 1. S(f) is a complex-valued matrix at a given frequency f;

T 2. S(−f) = S (f), i.e., Sab(−f) = Sba(f) for a, b = 1, . . . , m, for all f;

H ∗ 3. S(f) is Hermitian, S(f) = S (f), i.e., Sba(f) = Sab(f) for a, b = 1, . . . , m, for all f;

4. S(f) is a positive semi-deﬁnite matrix at a given frequency f;

R 1/2 5. Γ(0) = −1/2 S(f)df. Thus the spectral matrix function S(f) over frequencies f de-

composes the covariance matrix at lag 0 for the multivariate time series {X(t): t ∈ Z}.

5.1.3 Representing complex-valued spectra

Based on Percival[1994], there are two conventional approaches for representing the complex-valued cross-spectrum Sab(f) with its real-valued components.

The ﬁrst representation decomposes Sab(f) into the real part Rab(f), called the co- spectrum, and the negative imaginary part Iab(f), called the quadrature spectrum [see, e.g.,

Percival, 1994], such that

Sab(f) = Rab(f) − iIab(f). (5.11)

∗ Recall that Sab(−f) = Sab(f) for all f. Thus we have Rab(−f) = Rab(f) and Iab(−f) =

−Iab(f), which implies that the co-spectrum Rab(f) is an even function, and the quadrature spectrum Iab(f) is an odd function.

The second representation expresses the cross-spectrum Sab(f) in polar form [see, e.g.,

Percival, 1994]: when |Sab(f)|= 6 0,

iϕab(f) Sab(f) = Aab(f)e ,

2 2 1/2 where the amplitude Aab(f) = |Sab(f)| = {Rab(f) + Iab(f)} is called the cross amplitude spectrum, and the phase ϕab(f) = arctan(−Iab(f)/Rab(f)) is called the phase spectrum.

106 Notice that Aab(f) is an real-valued even function, and ϕab(f) is a real-valued cyclic odd

function.

Another important quantity is the complex coherency deﬁned as [see, e.g., Percival, 1994]

Sab(f) wab(f) ≡ 1/2 , (5.12) (Sa(f)Sb(f)) which depends on the cross spectrum Sab(f), the SDF Sa(f) for {Xa(t)} and the SDF Sb(f) for {Xb(t)}. By (5.9), the complex coherency is equal to

Cov{dZa(f), dZb(f)} wab(f) = 1/2 . (Var{dZa(f)}Var{dZb(f)})

Thus, the complex coherency wab(f) is a complex-valued correlation coeﬃcient between

{dZa(f)} and {dZb(f)} in the frequency domain; that is, the correlation between the random amplitudes of the sinusoids in the spectral representations of {Xa(t)} and {Xb(t)}. According

2 to Percival[1994], the correlation coeﬃcient wab(f) satisﬁes 0 ≤ |wab(f)| ≤ 1, where the

2 quantity |wab(f)| is called magnitude squared coherence (MSC), and can be calculated from

2 2 2 |Sab(f)| Aab(f) |wab(f)| = = . (5.13) Sa(f)Sb(f) Sa(f)Sb(f)

The MSC can be viewed as the normalized square of the cross amplitude spectrum, with the

phase information ignored.

5.1.4 Cross-periodgoram and multitaper spectral estimators

The preliminary spectral estimators discussed in Section 2.2 can be extended naturally

to estimating the cross-spectrum between two diﬀerent time series and thus the SDM for

the multivariate process. Suppose the real-valued stationary multivariate process X(t) =

T (X1(t),X2(t),...,Xm(t)) is observed at time t = 1, 2,...,N.

Similar to the periodogram for estimating the SDF, a naive estimator for the cross-

spectrum Sab(f) between the two univariate process {Xa(t)} and {Xb(t)} for a, b = 1, . . . , m

107 (p) can be obtained by substituting γab(τ) in Equation (5.6) with γbab (τ), where  N−τ 1 P X (t + τ)X (t), for 0 ≤ τ ≤ N − 1;  N t=1 a b γ(p)(τ) = 1 PN bab N t=1−τ Xa(t + τ)Xb(t), for − (N − 1) ≤ τ < 0; 0, otherwise.

The resulting estimator for the cross-spectrum is

N−1 (p) X (p) −i2πfτ Sbab (f) = γbab (τ)e , τ=−(N−1) which is called the cross-periodogram.

(p) N Let J (f) = P √1 X (t) exp(−i2πft) for l = 1, 2, . . . , m. Then the cross-periodogram l t=1 N l can alternatively be expressed as

N ! N !∗ (p) X 1 −i2πft X 1 −i2πft Sbab (f) = √ Xa(t)e √ Xb(t)e (5.14) t=1 N t=1 N (p) (p)∗ = Ja (f)Jb (f).

Recall from Section 2.2 that the periodogram has bias and inconsistency issues. Similarly undesirable features are also present in the cross-periodogram, and things can get even worse when the two time series are misaligned [see, e.g., Percival, 1994, p.343].

Another issue with the cross-periodogram (5.14) arises when calculating the naive estimator for the MSC (5.13). In that case,

2 S(p)(f) 2 bab w(p)(f) = bab (p) (p) Sba (f)Sbb (f) 2 2 PN N −1/2X (t)e−i2πft PN N −1/2X (t)e−i2πft t=1 a t=1 b = (p) (p) Sba (f)Sbb (f) (p) (p) Sba (f)Sbb (f) = (p) (p) = 1. Sba (f)Sbb (f) The result implies that the MSC is always estimated as unity when using the cross-periodogram

(p) (p) (p) Sbab (f) and the periodograms Sba (f) and Sbb (f). More explanation for the above result

108 can be found in Priestley[1981, p. 708]. Smoothing or multitapering can circumvent this

problem [Percival, 1994, p.344].

According to Percival[1994], multitaper (MT) estimators can be extended to the cross-

spectral case to better control the bias and variance in the estimation, and a more reasonable

estimation for the MSC can be achieved at the same time. Let {ht,k : k = 1, . . . , K, t =

P 2 P 0 1,...,N} denote K orthonormal data tapers; i.e., t ht,k = 1 and t ht,kht,k0 = 0 for k 6= k . For k = 1,...,K and l = 1, 2, . . . , m, deﬁne

N X −i2πft Jk,l(f) = ht,kXl(t)e . (5.15) t=1

Then the kth eigenspectrum for the cross-spectrum Sab(f) is

(mt) ∗ Sbk,ab (f) = Jk,a(f)Jk,b(f). (5.16)

(mt) The standard MT estimator Sbab (f) of the cross-spectrum is the average of the K eigenspectra: K (mt) 1 X (mt) Sb (f) = Sb (f). (5.17) ab K k,ab k=1 Again, the periodogram-based spectral estimator (5.14) is a special case of the MT estimator √ (5.17) when taking K = 1 and ht,1 = 1/ N for t = 1,...,N.

The MT-estimators for the components of cross-spectrum Sab(f) in Section 5.1.3 can

be easily obtained following their deﬁnitions. For example, the MT-estimator for the MSC

(5.13) is 2 S(mt)(f) 2 bab w(mt)(f) = . bab (mt) (mt) Sba (f)Sbb (f) (mt) A comprehensive discussion of the bias and variance of the MT estimator Sbab (f) can be found in Walden[2000]. Walden[2000] provided a unifying orthogonal MT framework for

the SDM estimation, which connects the MT estimators, WOSA estimators [Welch, 1967],

109 and lag-window estimators (see Percival and Walden[1993] for a complete review of lag-

T window estimators). For k = 1,...,K, let J k(f) ≡ (Jk,1(f),Jk,2(f),...,Jk,m(f)) be the

vector Fourier transform of the tapered vector series {ht,kX(t)}. Thus, for each k,

N X −i2πft J k(f) = ht,kX(t)e . (5.18) t=1

(mt) H H Deﬁne the kth eigenspectrum to be Sb k (f) = J k(f)J k (f), where J k (f) is the Hermitian

transpose of J k(f). Then the standard MT estimator of the SDM is the average of the K

eigenspectra matrices, denoted by

K (mt) 1 X (mt) Sb (f) = Sb (f). (5.19) K k k=1

(mt) (mt) Notice that the MT estimator Sb (f) is the matrix with Sbab (f) as its element at row a (mt) column b for a, b = 1, . . . , m. According to Dai and Guo[2004], to ensure that Sb (f) is positive deﬁnite, we require that K ≥ m; i.e., the number of tapers used should be no less than the dimension of the multivariate time series.

(mt) General statistical properities of the MT estimator Sb (f) such as bias, variance, con-

sistency and asymptotic complex Wishart distribution at each frequency are demonstrated

by Walden[2000]. The degenerated results for the periodogram-based estimator for the SDM

can be found in Brillinger[1981]. See Goodman[1963] for some properties and the derivation

of the complex Wishart distribution. We will illustrate some distributional properties of the

MT estimator in later sections of this chapter.

5.1.5 Covariance between MT estimates of diﬀerent cross-spectra

Similar to the univariate case discussed in Chapter4, a proper spectral inference for the

SDM relies on a correct speciﬁcation of the covariance structure of multivariate spectral

estimators. We present the following result for the MT estimator of the SDM.

110 0 0 (mt) (mt) 0 Proposition 5. For 1 ≤ a, a , b, b ≤ m, each pair of elements {Sbab (f), Sba0b0 (f )} of the (mt) MT spectral matrix estimator Sb (f) at frequencies f, f 0 ∈ [0, 1/2] has covariance

K K n (mt) (mt) 0 o 1 X X n (mt) (mt) 0 o Cov Sb (f), Sb 0 0 (f ) = Cov Sb (f), Sb 0 0 0 (f ) . ab a b K2 k,ab k ,a b k=1 k0=1

0 For a zero-mean stationary multivariate Gaussian process {X(t)}, if the cross spectra Saa0 (f ),

0 0 0 0 Sbb0 (f ), Sab0 (f , and Sba0 (f ) are locally constant around frequency f, i.e., Saa0 (f ) ≈ Saa0 (f),

0 0 0 0 Sbb0 (f ) ≈ Sbb0 (f), Sab0 (f ) ≈ Sab0 (f), and Sba0 (f ) ≈ Sba0 (f) with |f − f| small, then

2 Z 1/2 n (mt) (mt) 0 o ∗ 0 Cov S (f), S 0 0 0 (f ) ≈ S 0 (f)S 0 (f) H (u)H 0 (f − f − u)du bk,ab bk ,a b aa bb k k −1/2 2 Z 1/2 ∗ 0 + Sab0 (f)Sba0 (f) Hk(u)Hk0 (f + f − u)du , (5.20) −1/2

0 PN −i2πft for k, k ∈ {1, 2,...,K}. Here, Hk(f) = t=1 ht,ke is the discrete Fourier transform of

∗ the taper {ht,k : t = 1,...,N}, and Hk (f) = Hk(−f) is the complex conjugate of Hk(f), for each k = 1,...,K.

Proof. For 1 ≤ a, a0, b, b0 ≤ m and 0 ≤ f, f 0 ≤ 1/2, by (5.16), we have

n (mt) (mt) 0 o ∗ 0 ∗ 0 Cov Sbk,ab (f), Sbk0,a0b0 (f ) = Cov Jk,a(f)Jk,b(f),Jk0,a0 (f )Jk0,b0 (f ) .

When {X(t)} is a zero-mean multivariate Gaussian process, the joint distribution of Jk,a(f),

∗ 0 ∗ 0 Jk,b(f), Jk0,a0 (f ), and Jk0,b0 (f ) is a multivariate complex Gaussian distribution. It follows from the complex version of the Isserlis theorem [see, e.g., Koopmans, 1974, p. 27] that

n (mt) (mt) 0 o 0 ∗ ∗ 0 Cov Sbk,ab (f), Sbk0,a0b0 (f ) = Cov{Jk,a(f),Jk0,a0 (f )}Cov{Jk,b(f),Jk0,b0 (f )}

∗ 0 ∗ 0 + Cov{Jk,a(f),Jk0,b0 (f )}Cov{Jk,b(f),Jk0,a0 (f )}

∗ 0 ∗ 0 = E{Jk,a(f)Jk0,a0 (f )}E{Jk,b(f)Jk0,b0 (f )}

0 ∗ ∗ 0 + E{Jk,a(f)Jk0,b0 (f )}E{Jk,b(f)Jk0,a0 (f )}. (5.21)

111 ∗ 0 By (5.15), the term E{Jk,a(f),Jk,a0 (f )} can be obtained as

( N N ) ∗ 0 X −i2πfs X ∗ i2πf 0t E{Jk,a(f)Jk0,a0 (f )} = E hs,kXa(s)e ht,k0 Xa0 (t)e s=1 t=1 ( N N ) X X ∗ −i2π(fs−f 0t) = E hs,kht,kXa(s)Xa0 (t)e s=1 t=1 N N X X ∗ −i2π(fs−f 0t) = hs,kht,k0 E{Xa(s)Xa0 (t)}e s=1 t=1 N N X X −i2π(fs−f 0t) = hs,kht,k0 γaa0 (s − t)e . s=1 t=1

R 1/2 i2πu(s−t) 0 0 Since γaa (s − t) = −1/2 e Saa (u)du by (5.7), it then follows that

N N Z 1/2 ∗ 0 X X −i2π(fs−f 0t) i2πu(s−t) E{Jk,a(f)Jk0,a0 (f )} = hs,kht,k0 e e Saa0 (u)du s=1 t=1 −1/2 Z 1/2 N N X X −i2π(f−u)s i2π(f 0−u)t = Saa0 (u) hs,kht,k0 e e du −1/2 s=1 t=1 Z 1/2 N N X −i2π(f−u)s X i2π(f 0−u)t = Saa0 (u) hs,ke ht,k0 e du −1/2 s=1 t=1 Z 1/2 ∗ 0 = Saa0 (u)Hk(f − u)Hk0 (f − u)du. −1/2

Following similar arguments, we can obtain that

N N ∗ 0 X X −i2π(fs−f 0t) E{Jk,a(f)Jk0,a0 (f )} = hs,kht,k0 γaa0 (s − t)e , s=1 t=1 N N ∗ 0 X X i2π(fs−f 0t) E{Jk,b(f)Jk0,b0 (f )} = hs,kht,k0 γbb0 (−s + t)e , s=1 t=1 N N 0 X X −i2π(fs+f 0t) E{Jk,a(f)Jk0,b0 (f )} = hs,kht,k0 γab0 (s − t)e , s=1 t=1 N N ∗ ∗ 0 X X i2π(fs+f 0t) E{Jk,b(f)Jk0,a0 (f )} = hs,kht,k0 γba0 (−s + t)e . s=1 t=1

112 Alternatively, we have

Z 1/2 ∗ 0 ∗ 0 E{Jk,a(f)Jk0,a0 (f )} = Saa0 (u)Hk(f − u)Hk0 (f − u)du, −1/2 Z 1/2 ∗ 0 ∗ ∗ 0 E{Jk,b(f)Jk0,b0 (f )} = Sbb0 (u)Hk (f − u)Hk0 (f − u)du, −1/2 Z 1/2 0 0 E{Jk,a(f)Jk0,b0 (f )} = Sab0 (u)Hk(f − u)Hk0 (f + u)du −1/2 Z 1/2 ∗ 0 = Sab0 (u)Hk(f + u)Hk0 (f − u)du, −1/2 Z 1/2 ∗ ∗ 0 ∗ ∗ ∗ 0 E{Jk,b(f)Jk0,a0 (f )} = Sba0 (u)Hk (f − u)Hk0 (f + u)du −1/2 Z 1/2 ∗ ∗ 0 = Sba0 (u)Hk (f + u)Hk0 (f − u)du. −1/2

Plugging the above results into (5.21), we obtain

n (mt) (mt) 0 o Cov Sbk,ab (f), Sbk0,a0b0 (f )

∗ 0 ∗ 0 0 ∗ ∗ 0 = E{Jk,a(f)Jk0,a0 (f )}E{Jk,b(f)Jk0,b0 (f )} + E{Jk,a(f)Jk0,b0 (f )}E{Jk,b(f)Jk0,a0 (f )} (Z 1/2 )(Z 1/2 ) ∗ 0 ∗ ∗ 0 = Saa0 (u)Hk(f − u)Hk0 (f − u)du Sbb0 (u)Hk (f − u)Hk0 (f − u)du −1/2 −1/2 (Z 1/2 )(Z 1/2 ) ∗ 0 ∗ ∗ 0 + Sab0 (u)Hk(f + u)Hk0 (f − u)du Sba0 (u)Hk (f + u)Hk0 (f − u)du . −1/2 −1/2

0 0 0 If we assume that the cross spectra Saa0 (f ) ≈ Saa0 (f), Sbb0 (f ) ≈ Sbb0 (f), Sab0 (f ) ≈ Sab0 (f),

0 0 and Sba0 (f ) ≈ Sba0 (f), with |f − f| small, it follows that 2 Z 1/2 n (mt) (mt) 0 o ∗ ∗ 0 Cov S (f), S 0 0 0 (f ) ≈ S 0 (f)S 0 (f) H (f − u)H 0 (f − u)du bk,ab bk ,a b aa bb k k −1/2 2 Z 1/2 ∗ 0 + Sab0 (f)Sba0 (f) Hk(f + u)Hk0 (f − u)du −1/2 2 Z 1/2 ∗ 0 = Saa0 (f)Sbb0 (f) Hk(u)Hk0 (f − f − u)du −1/2 2 Z 1/2 ∗ 0 + Sab0 (f)Sba0 (f) Hk(u)Hk0 (f + f − u)du . −1/2 113 R 1/2 0 ˜ 0 ˜ Note that Hk ∗ Hk (f) = −1/2 Hk(u)Hk (f − u)du is the convolution of H(f) with itself, where

Z 1/2 Z 1/2 ( N N ) ˜ X −i2πut X −i2π(f˜−u)t Hk(u)Hk0 (f − u)du = hs,ke ht,k0 e du −1/2 −1/2 s=1 t=1 N N Z 1/2 X X −i2πft˜ −i2πu(s−t) = hs,k ht,k0 e e du s=1 t=1 −1/2 N X ˜ = ht,kht,k0 exp(−i2πft), t=1 where we have used the fact that

Z 1/2 e−i2πu(s−t)du = 0 for s 6= t. −1/2

n (mt) (mt) 0 o Therefore, Cov Sbk,ab (f), Sbk0,a0b0 (f ) can also be calculated using

n (mt) (mt) 0 o ∗ 0 2 Cov Sbk,ab (f), Sbk0,a0b0 (f ) ≈ Saa0 (f)Sbb0 (f) |Hk ∗ Hk0 (f − f)|

∗ 0 2 + Sab0 (f)Sba0 (f) |Hk ∗ Hk0 (f + f)| 2 N ∗ X 0 = Saa0 (f)Sbb0 (f) ht,kht,k0 exp(−i2π(f − f)t) t=1 2 N ∗ X 0 + Sab0 (f)Sba0 (f) ht,kht,k0 exp(−i2π(f + f)t) . (5.22) t=1

(mt) As a special case, one can easily obtain the covariance of Sbab (f) between two diﬀerent frequencies by Proposition5. We have

K K n (mt) (mt) 0 o 1 X X n (mt) (mt) 0 o Cov Sb (f), Sb (f ) = Cov Sb (f), Sb 0 (f ) . ab ab K2 k,ab k ,ab k=1 k0=1

114 0 0 0 Given that the SDFs Sa(f ), Sb(f ) and the cross-spectrum Sab(f ) are approximately locally constant around f for small |f 0 − f|, we have

2 Z 1/2 n (mt) (mt) 0 o 0 Cov S (f), S 0 (f ) ≈ S (f)S (f) H (u)H 0 (f − f − u)du bk,ab bk ,ab a b k k −1/2 2 Z 1/2 ∗ 2 0 + {Sab(f)} Hk(u)Hk0 (f + f − u)du (5.23) −1/2 2 N X 0 = Sa(f)Sb(f) ht,kht,k0 exp(−i2π(f − f)t) t=1 2 N ∗ 2 X 0 + {Sab(f)} ht,kht,k0 exp(−i2π(f + f)t) . (5.24) t=1 for k, k0 ∈ {1, 2,...,K}.

5.1.6 Covariance between real and imaginary parts of a MT estimator

In this dissertation, we mainly use the real and imaginary parts parametrization, as deﬁned in (5.11), to model the complex-valued cross spectrum. We derive the following results on the covariance between real and imaginary parts of a MT estimator to facilitate inference of the SDM.

(mt) (mt) Let Rbab (f) and Ibab (f) denote respectively the real part and the negative imaginary

(mt) (mt) (mt) (mt) part of the MT cross spectral estimator Sbab (f) such that Sbab (f) = Rbab (f) − iIbab (f).

(mt) (mt) By (5.11), Rbab (f) and Ibab (f) are naturally the MT estimators for the co-spectrum and

(mt) quadrature spectrum Rab(f) and Iab(f). At any given frequency f, we have Rbab (f) = n (mt) (mt)∗ o (mt) n (mt) (mt)∗ o Sbab (f) + Sbab (f) /2 and Ibab (f) = i · Sbab (f) − Sbab (f) /2. Thus,

115 ( (mt) (mt)∗ (mt) (mt)∗ ) n (mt) (mt) o Sb (f) + Sb (f) Sb (f) − Sb (f) Cov Rb (f), Ib (f) = Cov ab ab , i · ab ab ab ab 2 2

i n (mt) (mt)∗ (mt) (mt)∗ o = − · Cov Sb (f) + Sb (f), Sb (f) − Sb (f) 4 ab ab ab ab i n (mt) (mt) o n (mt)∗ (mt)∗ o = − Cov Sb (f), Sb (f) − Cov Sb (f), Sb (f) 4 ab ab ab ab n (mt)∗ (mt) o n (mt) (mt)∗ o + Cov Sbab (f), Sbab (f) − Cov Sbab (f), Sbab (f) i n (mt) o n (mt)∗ o = − Var Sb (f) − Var Sb (f) 4 ab ab n (mt)∗ (mt) o n (mt) (mt)∗ o + Cov Sbab (f), Sbab (f) − Cov Sbab (f), Sbab (f) .

n o 2 n o 2 n o Since Var S(mt)(f) = E S(mt)(f) − E S(mt)(f) = Var S(mt)∗(f) , it follows bab bab bab bab that

n (mt) (mt) o i n (mt)∗ (mt) o n (mt) (mt)∗ o Cov Rb (f), Ib (f) = − Cov Sb (f), Sb (f) − Cov Sb (f), Sb (f) ab ab 4 ab ab ab ab i n (mt) (mt) o n (mt) (mt) o = − Cov Sb (f), Sb (f) − Cov Sb (f), Sb (f) 4 ba ab ab ba

1 n (mt) (mt) o = − Im Cov{Sb (f), Sb (f)} , 2 ab ba where Im {α} denotes the imaginary part of any complex number α ∈ C. By (5.22), we then have

N 2 n (mt) (mt) o 2 X Cov S (f), S 0 (f) ≈ {S (f)} h h 0 exp(0) bk,ab bk ,ba ab t,k t,k t=1 2 N X + Sa(f)Sb(f) ht,kht,k0 exp(−i2π(2f)t) t=1 2 N 2 2 X = Rab(f) − Iab(f) − 2i · Rab(f)Iab(f) ht,kht,k0 t=1 2 N X + Sa(f)Sb(f) ht,kht,k0 exp(−i2π(2f)t) , t=1

116 0 n (mt) (mt) o for k, k ∈ {1, 2,...,K}. Thus, the imaginary part of Cov Sbk,ab (f), Sbk0,ba(f) is

N 2 n (mt) (mt) o X Im Cov{S (f), S 0 (f)} = −2R (f)I (f) h h 0 , bk,ab bk ,ba ab ab t,k t,k t=1

0 0 which equals to 0 if k 6= k, and −2Rab(f)Iab(f) if k = k. Therefore,

K K n (mt) (mt) o 1 X X n (mt) (mt) o Im Cov{Sb (f), Sb (f)} = Im Cov{Sb (f), Sb 0 (f)} ab ba K2 k,ab k ,ba k=1 k0=1 2 = − R (f)I (f). K ab ab

Hence,

n (mt) (mt) o 1 Cov Rb (f), Ib (f) = Rab(f)Iab(f). (5.25) ab ab K 5.1.7 Basis methods for SDM estimation

Similar to the phenomenon that we have observed in the univariate case, the MT estimates of the SDM are often still too noisy. To further smooth the SDM estimates, basis methods were extended to the multivariate case in literature. Many such studies focus on using a spline basis. Pawitan[1996] estimated the auto-spectra and the cross-spectrum of a bivariate time series in separate steps: The auto-spectra are estimated ﬁrst using the L2 penalized Whittle method in Pawitan and O’Sullivan[1994], then a proﬁle likelihood function of the coherence can be formulated by substituting the true SDFs with the obtained auto-spectral estimates in the multivariate Whittle likelihood function (introduced later in

Section 5.2). With L2 penalties added to both the real and imaginary parts, Pawitan[1996] obtained estimates of the cross-spectrum by minimizing the penalized proﬁle likelihood.

Extra restrictions on the penalty parameters were imposed by Pawitan[1996] to enforce a positive-deﬁnite SDM estimate. However, Dai and Guo[2004] argued that the generalization by Pawitan[1996] cannot guarantee the positive-deﬁniteness of the SDM if the vector time

117 series has dimensionality higher than two. To ﬁx this issue, Dai and Guo[2004] proposed to estimate the Cholesky decomposition components of the SDM, where L2 penalized LS estimates were obtained for the real and imaginary parts with spline basis. The Cholesky- based methods were further developed by Rosen and Stoﬀer[2007] and Krafty and Collinge

[2013] using the multivariate Whittle likelihood and spline basis. While Rosen and Stoﬀer

[2007] estimated the SDM using a Bayesian approach, Krafty and Collinge[2013] applied a

Newton method with a quadratic approximation and controlled the estimates of the SDM to be periodic Hermitian functions.

Alternatively, Holan et al.[2017] proposed an unpenalized multivariate cepstral model as an extension of the exponential model [Bloomﬁeld, 1973] to the multivariate case: A cepstral

filter is defined as the power series with matrix-valued coefficients, whose matrix exponential is the MA filter of the causal representation of a multivariate time series. Applying the cepstral filter with Fourier bases, the transfer function of the corresponding MA filter can be used to formulate the SDM of the vector time series. Holan et al.[2017] estimated the cepstral

ﬁlter by optimizing an unpenalized periodogram-based multivariate Whittle likelihood. Re- cently, Chau and von Sachs[2020] proposed a generalization of wavelet thresholding to curves in the Riemannian manifold of Hermitian positive deﬁnite matrices, such as SDM functions.

Their wavelet transform operates within the same geometric space, and the thresholding procedure has a permutation invariant property such that a dimensional switch between the time series components can still lead to an equivalent SDM estimation. On the other hand,

Cholesky-based methods, such as Dai and Guo[2004], Rosen and Stoﬀer[2007], and Krafty and Collinge[2013], usually vary the SDM estimates when the time series components ro- tate [Chau and von Sachs, 2020]. Similar to wavelet thresholding in the univariate case, the method in Chau and von Sachs[2020] is equivalent to a penalized LS approach.

118 In the next section, we extend our L1 penalized MT-Whittle method to the SDM esti-

mation. The generalized framework is suitable for a broad variety of basis functions, and a

proximal gradient descent algorithm is implemented to solve the optimization problem for

obtaining the SDM estimates.

5.2 Penalized multitaper Whittle method for SDM

Suppose the real-valued stationary m-vector time series {X(t): t ∈ Z} is observed at N

(mt) consecutive time points t = 1, 2,...,N. We evaluate the MT spectral estimates Sb (fj) on

the set of M = dN/2e − 1 non-zero, non-Nyquist (i.e., not equal to 1/2) Fourier frequencies

{fj = j/N : j = 1,...,M} deﬁned in (2.18). Now we introduce a quasi-likelihood framework

based on the multivariate MT estimates.

Deﬁnition 7. Using MT estimates of the SDM, a quasi-likelihood function, deﬁned by

M X −1 (mt) lW (S(f)) = log{det S(fj)} + tr{S (fj)Sb (fj)} , (5.26) j=1 is called the multivariate MT-Whittle likelihood function.

The multivariate MT-Whittle likelihood has the same functional form as the multivariate

Whittle likelihood (see, e.g., Krafty and Collinge[2013]) based on the periodograms and cross-periodograms, except that we replace the periodogram-based estimator by the MT estimator. The usual multivariate Whittle likelihood can be viewed as a special case of n √ o (5.26) when using a single K = 1 rectangular taper ht,1 = 1/ N : t = 1,...,N in the

MT estimator of the SDM.

To see why the multivariate MT-Whittle likelihood is a reasonable quasi-likelihood, we

present the following two propositions.

119 Proposition 6. [Walden, 2000, Section 3.3] Suppose that {X(t): t ∈ Z} is a strictly stationary m-vector time series, and all moments of the components X1(t),X2(t),...,Xm(t) exist and such that

∞ X |cum{Xa1 (t + τ1),...,Xal−1 (t + τl−1),,Xal (t)}| < ∞, τ1,...,τl−1=−∞

for a1, . . . , al = 1, . . . , m and l = 2, 3,..., where cum{Xa1 (t1),...,Xal (tl)} denotes the joint cumulant function of order l. For each N, let {ht,k : k = 1,...,K; t = 1,...,N} be a set of

P 2 P 0 K orthonormal sine or DPSS data tapers with t ht,k = 1 and t ht,kht,k0 = 0 for k 6= k . Then, as N → ∞,

( C Nm(0, S(f)), for 0 < |f| < 1/2; J k(f) →d Nm(0, S(f)), for f = 0, ±1/2.

C where Nm(0, S(f)) and Nm(0, S(f)) denote respectively the complex and real m-variate normal distribution with mean 0 and covariance matrix S(f). Moreover, as N → ∞,

( C ( mt) (1/K)Wm (K, S(f)), for 0 < |f| < 1/2; Sb (f) →d (1/K)Wm(K, S(f)), for f = 0, ±1/2.

C where Wm (K, S(f)) and Wm(K, S(f)) denote respectively the complex and real Wishart distribution with K degrees of freedom and covariance matrix S(f).

The proof of Proposition6 can be found in Walden[2000, Section 3.3]. The proposition

(mt) suggests that at each frequency f ∈ (0, 1/2), our MT spectral estimator Sb (f) has an asymptotic scaled complex Wishart distribution that depends on the true SDM. As we have seen in Proposition5, the MT estimates at diﬀerent frequencies are correlated. However, following a similar argument to Proposition4 for the univariate case, we can interpret the multivariate MT-Whittle likelihood (5.26) as a complex Wishart quasi-likelihood.

120 Proposition 7. The multivariate MT-Whittle likelihood (5.26) corresponds to a complex

Wishart quasi-likelihood assuming the asymptotic distribution of Proposition6 at the Fourier

frequencies (2.18), and assuming independence between the Fourier frequencies.

(mt) C Proof. By Proposition6, we have KSb (fj) →d Wm (K, S(fj)) for Fourier frequencies fj, j = 1,...,M deﬁned in (2.18). Based on Goodman[1963, p. 174], the probability density

C function of the complex Wishart distribution Wm (K, S(fj)) can be written as

n (mt) oK−m  det KSb (fj) (mt) n −1 (mt) o p KSb (fj) =   exp −tr S (fj) KSb (fj) ,  Ic(S(fj)) 

m(m−1)/2 Qm K where Ic(S(f)) = π l=1 Γ(K − l + 1) {det S(f)} . (mt) Let Ξj = Sb (fj) for j = 1,...,M. Then one can write the probability density function

for the asymptotic distribution of Ξj (j = 1,...,M) as

h K−m i −1 p(Ξj) ∝ {det Ξj} /Ic(S(fj)) exp −tr S (fj)KΞj

K−m −K −1 ∝ {det Ξj} det{S(fj)} exp −K · tr S (fj)Ξj .

The proposition follows by assuming independence between the MT spectral estimates over

frequencies, by noting that the resulting quasi-likelihood is

M M M X X X −1 log p(Ξj) = constant − K log{det S(fj)} − K tr S (fj)Ξj j=1 j=1 j=1 M X −1 = constant − K log{det S(fj)} + tr{S (fj)Ξj} j=1

= constant − K lW (S(f)).

121 Alternatively, the multivariate MT-Whittle likelihood (5.26) can be expressed as:

M X −1 (mt) lW (S(f)) = log{det S(fj)} + tr{S (fj)Sb (fj)} j=1 M ( K )! X 1 X = log{det S(f )} + tr S−1(f ) J (f )J H (f ) j j K k j k j j=1 k=1 M K ! X 1 X = log{det S(f )} + tr J H (f )S−1(f )J (f ) . j K k j j k j j=1 k=1

5.2.1 MT-Whittle likelihood for bivariate time series

We elaborate the use of multivariate MT-Whittle likelihood for bivariate time series.

T Given a vector time series X(t) = (X1(t),X2(t)) : t ∈ Z with dimensionality m = 2, we can write the SDM as S1(f) S12(f) S(f) = ∗ , S12(f) S2(f) whose determinant is

2 det S(f) = S1(f)S2(f) − |S12(f)| , (5.27)

and thus the inverse matrix at each frequency f is

−1 1 S2(f) −S12(f) S (f) = 2 ∗ . S1(f)S2(f) − |S12(f)| −S12(f) S1(f)

The trace term in the multivariate MT-Whittle likelihood then can be calculated as

n (mt) o tr S−1(f)Sb (f) (5.28) ( " #) 1 S (f) −S (f) S(mt)(f) S(mt)(f) = tr 2 12 b1 b12 −S∗ (f) S (f) (mt)∗ (mt) det S(f) 12 1 Sb12 (f) Sb2 (f) (mt) (mt)∗ ∗ (mt) (mt) S2(f)Sb1 (f) − S12(f)Sb12 (f) − S12(f)Sb12 (f) + S1(f)Sb2 (f) = 2 . S1(f)S2(f) − |S12(f)|

122 Plugging the results of determinant and the trace terms into the multivariate MT-Whittle likelihood (5.26), we obtain an expression for the bivariate MT-Whittle likelihood:  M  X  2 lW (S(f)) = log S1(fj)S2(fj) − |S12(fj)|  j=1   (mt) (mt) n (mt)∗ o S2(fj)Sb1 (fj) + S1(fj)Sb2 (fj) − 2 Re S12(fj)Sb12 (fj)  + 2 , S1(fj)S2(fj) − |S12(fj)|   where Re {α} denotes the real part of α ∈ C. The bivariate MT-Whittle likelihood can also be expressed in terms of the SDFs and the complex coherence. By (5.12), we have  M  X  2 lW (S(f)) = log S1(fj) + log S2(fj) + log 1 − |w12(fj)|  j=1  S(mt)(f ) S(mt)(f ) S(mt)∗(f )  b1 j b2 j √ b12 √ j S (f ) + S (f ) − 2 Re w12(fj)  1 j 2 j S1(fj ) S2(fj )  + 2 1 − |w12(fj)|    M  X  2 = log S1(fj) + log S2(fj) + log 1 − |w12(fj)|  j=1  r r (mt) (mt) (mt) (mt) n o Sb1 (fj ) Sb2 (fj ) Sb1 (fj ) Sb2 (fj ) (mt)∗ + − 2 Re w12(fj)w (fj)  S1(fj ) S2(fj ) S1(fj ) S2(fj ) b12  + 2 . 1 − |w12(fj)|  

Similar to Pawitan[1996], we can plug in the estimates of S1(f) and S2(f) from the univariate

L1 penalized MT-Whittle method, which then leads to a proﬁle likelihood function for S12(f) or w12(f). However, as argued in Dai and Guo[2004], there are issues in getting a positive deﬁnite estimate for the SDM when the SDFs and the cross-spectra are estimated with separate steps, especially when the dimensionality of the time series is greater than two.

123 For a better generalizability to higher dimensional time series, we focus our discussion on positive deﬁnite SDM estimation based on the Cholesky decomposition of a SDM.

We proceed with the modiﬁed complex Cholesky factorization used by Rosen and Stoﬀer

[2007]. The SDM at frequency f can be written as

S(f) = L(f)D(f)LH (f), where L(f) is a complex-valued lower triangle matrix with all 1’s on the diagonal, and D(f) is a real-valued diagonal matrix with positive diagonal elements. For bivariate time series

T X(t) = (X1(t),X2(t)) , for example, we denote

1 0 D (f) 0 L(f) = and D(f) = 1 , L21(f) 1 0 D2(f) where D1(f) and D2(f) are real-valued, positive, and even functions for |f| ≤ 1/2, and L21(f)

∗ is a complex-valued function on [−1/2, 1/2] with its complex conjugate L21(f) = L21(−f).

−1 Letting T (f) = L (f) with T21(f) = −L21(f), we then have

1 0 T (f) = . T21(f) 1

Thus T21(f) is also a complex-valued function on [−1/2, 1/2] whose complex conjugate sat-

∗ isﬁes T21(f) = T21(−f) for all f. It follows that the modiﬁed Cholesky decomposition of the inverse of SDM can be expressed as

S−1(f) = T H (f)D−1(f)T (f)

∗ −1 1 T21(f) D1 (f) 0 1 0 = −1 0 1 0 D2 (f) T21(f) 1 −1 −1 ∗ D1 (f) D2 (f)T21(f) 1 0 = −1 0 D2 (f) T21(f) 1 −1 −1 ∗ −1 ∗ D1 (f) + D2 (f)T21(f)T21(f) D2 (f)T21(f) = −1 −1 . (5.29) D2 (f)T21(f) D2 (f)

124 Then for the SDM we have

−1 S(f) = T H (f)D−1(f)T (f)

−1 −1 ∗ 1 D2 (f) −D2 (f)T21(f) = −1 −1 −1 −1 −1 ∗ D1 (f)D2 (f) −D2 (f)T21(f) D1 (f) + D2 (f)T21(f)T21(f) ∗ D1(f) −D1(f)T21(f) = ∗ . (5.30) −D1(f)T21(f) D2(f) + D1(f)T21(f)T21(f)

Thus, we have the following relationship for the transformation between the Cholesky components and the original SDM elements:

S1(f) = D1(f),

S21(f) = −D1(f)T21(f), and (5.31)

∗ S2(f) = D2(f) + D1(f)T21(f)T21(f). The determinant of the SDM can be calculated by

2 det S(f) = S1(f)S2(f) − |S12(f)| = D1(fj)D2(fj). (5.32)

Similar to (5.30), we can express the MT estimate of the SDM as

" (mt) (mt) (mt)∗ # (mt) Db1 (f) −Db1 (f)Tb21 (f) Sb (f) = (mt) (mt) (mt) (mt) (mt)∗ (mt) , (5.33) −Db1 (f)Tb21 (f) Db2 (f) + Db1 (f)Tb21 (f)Tb21 (f) where, based on (5.31), the MT estimates of the Cholesky components can be calculated by

(mt) (mt) Db1 (f) = Sb1 (f),

(mt) (mt) . (mt) Tb21 (f) = −Sb21 (f) Sb1 (f), with (5.34) 2 . (mt) (mt) (mt) (mt) Db2 (f) = Sb2 (f) − Sb21 (f) Sb1 (f). Applying (5.29) and (5.33), after some algebra, we can obtain the trace term in the bivariate

MT-Whittle likelihood as

(mt) (mt) (mt) 2 n −1 (mt) o Db1 (fj) Db2 (fj) Db1 (fj) (mt) tr S (fj)Sb (fj) = + + T21(fj) − Tb21 (fj) . (5.35) D1(fj) D2(fj) D2(fj)

125 It follows from (5.32) and (5.35) that the bivariate MT-Whittle likelihood can be expressed as

M X −1 (mt) lW (S(f)) = log{det S(fj)} + tr{S (fj)Sb (fj)} j=1 M X = log{D1(fj)D2(fj)} j=1 ! (mt) (mt) (mt) 2 Db1 (fj) Db2 (fj) Db1 (fj) (mt) + + + T21(fj) − Tb21 (fj) D1(fj) D2(fj) D2(fj)

M (mt) X Db (fj) = log D (f ) + 1 1 j D (f ) j=1 1 j ! (mt) (mt) 2 Db2 (fj) Db1 (fj) (mt) + log D2(fj) + + T21(fj) − Tb21 (fj) . D2(fj) D2(fj)

Denote the real part of T21(f) as T21R(f) = Re{T21(f)}, and the imaginary part of T21(f) as T21I (f) = Im{T21(f)}. Then, based on the previous discussion on T21(f), we know that T21R(f) is a real-valued, even function on [−1/2, 1/2], and T21I (f) is a real-valued,

(mt) n (mt) o (mt) n (mt) o odd function on [−1/2, 1/2]. Let Tb21R (f) = Re Tb21 (f) and Tb21I (f) = Im Tb21 (f) . Thus, the bivariate MT-Whittle likelihood can be written in terms of real-valued component functions D1(f), D2(f), T21R(f), and T21I (f), as

M (mt) (mt) X Db (fj) Db (fj) l (S(f)) = log D (f ) + 1 + log D (f ) + 2 W 1 j D (f ) 2 j D (f ) j=1 1 j 2 j ! (mt) 2 2 Db1 (fj) n (mt) o n (mt) o + T21R(fj) − Tb21R (fj) + T21I (fj) − Tb21I (fj) . D2(fj)

5.2.2 Penalized bivariate MT-Whittle method

We consider a basis method for modeling the Cholesky components. In this section, we employ the same set of basis functions, {φl(f): l = 1, . . . , p}, for the expansions of each

Cholesky component of the SDM. The use of diﬀerent basis functions for each component

126 T is left as a future extension. For each frequency f, denote φ(f) = (φ1(f), . . . , φp(f)) .

Letting β = (β , . . . , β )T , β = (β , . . . , β )T , β = (β , . . . , β )T , D1 D1,1 D1,p D2 D2,1 D2,p T21R T21R,1 T21R,p and β = (β , . . . , β )T , we suppose that T21I T21I ,1 T21I ,p p T X log D1(f) = φ (f)βD1 = φl(f)βD1,l, l=1 p T X log D2(f) = φ (f)βD2 = φl(f)βD2,l, l=1 p (5.36) X T (f) = φT (f)β = φ (f)β , and 21R T21R l T21R,l l=1 p X T (f) = φT (f)β = φ (f)β , 21I T21I l T21I ,l l=1 where a logarithmic transformation is used for modeling D1(f) and D2(f) to ensure that the estimated functions are positive. Recall that D1(f) = S1(f). Similar to the univariate

SDF estimation, a logarithmic transformation relieves the heteroskedasticity over frequencies when estimating D1(f).

T Denote the vector containing all the basis coeﬃcients as Θ = {Θ1, Θ2,..., Θ4p} , where

Θl = βD1,l,Θp+l = βD2,l,Θ2p+l = βT21R,l, and Θ3p+l = βT21I ,l, for l = 1, 2, . . . , p. Thus, Θ = T βT , βT , βT , βT . With the above basis expansions of the Cholesky components, D1 D2 T21R T21I we can write the multivariate MT-Whittle likelihood (5.26) as a function of Θ, denoted by lW (Θ). As an extension of the L1 penalized MT-Whittle method to the multivariate case, we incorporate lasso-type penalties to the multivariate MT-Whittle likelihood, and formulate the following optimization problem:

min lW (Θ) + g1(Θ), (5.37) Θ with the L1 penalty term

βD2,l, βT21R,l, and βT21I ,l respectively, for l = 1, . . . , p. Let the vector of tuning parameters be

T λ = (λ1, λ2, . . . , λ4p) , where λl = λD1,l, λp+l = λD2,l, λ2p+l = λT21R,l, and λ3p+l = λT21I ,l, for l = 1, 2, . . . , p. Then the penalty term can be written in a more general form as

4p X g1(Θ) = λl0 |Θl0 |, (5.39) l0=1

0 where λl0 ≥ 0 for l = 1, 2,..., 4p. When a uniform tuning parameter λ ≥ 0 is applied to all

basis coeﬃcients, such that λD1,l = λD2,l = λT21R,l = λT21I ,l = λ for l = 1, . . . , p, a simpliﬁed version of the optimization problem can be obtained as

min lW (Θ) + λkΘk1. (5.40) Θ

Similar to the univariate L1 penalized MT-Whittle method, the lasso penalties are employed to reduce the number of nonzero coefficient estimates, which shrink the magnitudes of all basis coefficients and turn some coefficients to exactly zero [see, e.g., Tibshirani, 1996]. Thus, while obtaining penalized coefficient estimates, the L1 penalized multivariate MT-Whittle optimization (5.37) also performs feature selection [see, e.g., Wasserman, 2006].

5.2.3 Computation: proximal gradient method

The objective function in (5.37) contains two terms, where the quasi-likelihood lW (Θ) is diﬀerentiable with respect to Θ (shown later in this section), and the penalty term g1(Θ) is

a closed proper convex function as deﬁned below. Thus, the optimization problem (5.37) ﬁts

naturally into the framework of proximal gradient method [see, e.g., Parikh and Boyd, 2014,

Section 4.2]. The proximal algorithm relies on the notion of proximal operator of a function,

which is deﬁned as the following.

128 Deﬁnition 8 (Parikh and Boyd[2014], Section 1.1) . A function g : Rn → R ∪ {+∞} is said to be a closed proper convex function if its epigraph, deﬁned by

n epi g = {(y, b) ∈ R × R | g(y) ≤ b},

n is a nonempty closed convex set. The proximal operator of g, denoted by proxg : R →

Rn, is deﬁned as 1 2 proxg(v) = argmin g(y) + ky − vk2 , y 2

where k · k2 is the Euclidean norm.

A comprehensive review on the properties of the proximal operator can be found in Parikh

and Boyd[2014].

P4p The penalty term (5.39) is based on function g1(y) = l0=1 λl0 |yl0 | for general vector

T 4p y = {y1, y2, . . . , y4p} ∈ R . By Deﬁnition8, the proximal operator of αg1, where α > 0, is 1 2 proxαg1 (v) = argmin αg1(y) + ky − vk2 (5.41) y 2 4p ! X 1 2 = argmin αλl0 |yl0 | + ky − vk2 , (5.42) y 2 l0=1

0 where the l th element of proxαg1 (v) is

0 0 0 0 proxαg1 (v) )l = ST(vl , αλl ), for l = 1, 2,..., 4p.

In the above expression, recall from (2.20) that ST(x, a) = Sign(x) max{|x| − a, 0} is the soft-thresholding function.

The proximal gradient algorithm solves the optimization problem (5.37) by applying the proximal operator to update the coeﬃcients Θ. Following Parikh and Boyd[2014, Section

4.2], the proximal gradient algorithm considered in Beck and Teboulle[2009] has the following

(n + 1)th step, as shown in Algorithm 5.1.

129 Algorithm 5.1: Proximal gradient algorithm for (5.37) at the (n + 1)th step

Given Θ(n), α(n−1) and parameter ρ ∈ (0, 1). Let α := α(n−1). Repeat

(n) (n) 1. Let η˜ := proxαg1 Θ − α∇lW (Θ ) .

2 (n) (n) T (n) −1 (n) 2. break if lW (η˜) ≤ lW (Θ ) + ∇lW (Θ ) η˜ − Θ + (2α) η˜ − Θ . 2 3. Update α := ρα.

Return α(n) := α, Θ(n+1) = η˜.

In the proximal gradient descent algorithm, we need to repeatedly evaluate the gradient of the multivariate MT-Whittle likelihood, denoted as ∇lW (Θ), which can be calculated explicitly as follows. First, the partial derivatives of the multivariate MT-Whittle likelihood function with respect to the Cholesky components can be obtained for j = 1, 2,...,M as

(mt) ∂lW (Θ) 1 Db1 (fj) = − 2 ; ∂D1(fj) D1(fj) D1(fj) 2 (mt) (mt) (mt) D (fj) + D (fj) T21(fj) − T (fj) ∂lW (Θ) 1 b2 b1 b21 = − 2 ; ∂D2(fj) D2(fj) D2(fj) (mt) ∂lW (Θ) Db1 (fj) (mt) = · 2 T21R(fj) − Tb21R (fj) ; ∂T21R(fj) D2(fj) (mt) ∂lW (Θ) Db1 (fj) (mt) = · 2 T21I (fj) − Tb21I (fj) . ∂T21I (fj) D2(fj)

Based on the basis expansions (5.47), we have

( p ) ( p ) X X D1(fj) = exp φl(fj)βD1,l ,D2(fj) = exp φl(fj)βD2,l , l=1 l=1 p p X X T21R(fj) = φl(fj)βT21R,l, and T21I (fj) = φl(fj)βT21I ,l. l=1 l=1

130 Then, at each Fourier frequency fj for j = 1,...,M, the partial derivatives of Cholesky components with respect to basis coeﬃcients are

∂D1(fj) ∂D2(fj) = D1(fj)φl(fj), = D2(fj)φl(fj), ∂βD1,l ∂βD2,l ∂T21R(fj) ∂T21I (fj) = φl(fj), and = φl(fj), ∂βT21R,l ∂βT21I ,l for l = 1, 2, . . . , p. It follows from the chain rule that M ∂lW (Θ) X ∂lW (Θ) ∂D1(fj) = · ∂β ∂D (f ) ∂β D1,l j=1 1 j D1,l M (" (mt) # ) X 1 Db (fj) = − 1 D (f )φ (f ) D (f ) D (f )2 1 j l j j=1 1 j 1 j M (" (mt) # ) X Db (fj) = 1 − 1 φ (f ) , D (f ) l j j=1 1 j and similarly, we have M ∂lW (Θ) X ∂lW (Θ) ∂D2(fj) = · ∂β ∂D (f ) ∂β D2,l j=1 2 j D2,l  2   M  D(mt)(f ) + D(mt)(f ) T (f ) − T (mt)(f )  X  1 b2 j b1 j 21 j b21 j  =  −  D (f )φ (f ) D (f ) D (f )2  2 j l j j=1  2 j 2 j   2   M  D(mt)(f ) + D(mt)(f ) T (f ) − T (mt)(f )  X  b2 j b1 j 21 j b21 j  = 1 −  φ (f ) ;  D (f )  l j j=1  2 j  M ∂lW (Θ) X ∂lW (Θ) ∂T21R(fj) = · ∂β ∂T (f ) ∂β T21R,l j=1 21R j T21R,l M (" (mt) # ) X Db1 (fj) (mt) = · 2 T21R(fj) − Tb (fj) φl(fj) ; D (f ) 21R j=1 2 j M ∂lW (Θ) X ∂lW (Θ) ∂T21I (fj) = · ∂β ∂T (f ) ∂β T21I ,l j=1 21I j T21I ,l M (" (mt) # ) X Db1 (fj) (mt) = · 2 T21I (fj) − Tb (fj) φl(fj) , for l = 1, 2, . . . , p. D (f ) 21I j=1 2 j

131 Thus, the gradient of multivariate MT-Whittle likelihood can be obtained as

T ∂lW (Θ) ∂lW (Θ) ∂lW (Θ) ∇lW (Θ) = , ,..., , ∂Θ1 ∂Θ2 ∂Θ4p

where Θl = βD1,l,Θp+l = βD2,l,Θ2p+l = βT21R,l, and Θ3p+l = βT21I ,l, for l = 1, 2, . . . , p.

Furthermore, based on Parikh and Boyd[2014, Section 4.3], an accelerated version of the proximal gradient algorithm can be adapted, whose (n + 1)th step is given below in

Algorithm 5.2.

Algorithm 5.2: Accelerated proximal gradient algorithm at the (n + 1)th step

Given Θ(n), α(n−1) and parameter ρ ∈ (0, 1). Let α := α(n−1), and ϑ(n) := Θ(n) + (n/(n + 3)) Θ(n) − Θ(n−1) . Repeat

(n) (n) 1. Let η˜ := proxαg1 ϑ − α∇lW (ϑ ) .

2 (n) (n) T (n) −1 (n) 2. break if lW (η˜) ≤ lW (ϑ ) + ∇lW (ϑ ) η˜ − ϑ + (2α) η˜ − ϑ . 2 3. Update α := ρα.

Return α(n) := α, Θ(n+1) = η˜.

The proximal gradient descent algorithm is terminated when the values of the objective

function lW (Θ) + g1(Θ) between two adjacent steps is smaller than a prespeciﬁed threshold.

The algorithm may result in a local minimum instead of a global minimum if the diﬀerentiable

function lW (Θ) is non-convex with respect to Θ. An adjustment in the choice of starting

point Θ(1) or the initial step size α(0) may be helpful in achieving a better local minimum in such scenarios, or further improve the computational eﬃciency in general [see, e.g., Parikh and Boyd, 2014].

132 5.2.4 Tuning parameter selection

Similar to the univariate case, we extend the idea of universal threshold proposed by

Donoho and Johnstone[1994] to the L1 penalized multivariate MT-Whittle problem when

orthogonal bases are used. In the proximal gradient algorithm, we notice that the tuning

parameters λ only appear in the calculation of proxαg1 , a soft-thresholding procedure after

(n) updating the coefficient estimates Θ towards the direction of minimizing lW (Θ). Denote W the minimizer of the unpenalized function lW (Θ) as Θb . Heuristically, if the coefficient W estimates Θb follow asymptotic Gaussian distributions with a constant variance, then a W universal threshold can shrink the noise in Θb to zero with high probability, and preserve the true features of Cholesky components of the SDM. (See Section 3.3 for the univariate case.) However, we noticed via simulation that the Cholesky components of an SDM often have different ranges, and thus lead to different variances in the coefficient estimates. The scale difference between estimating diagonal and off-diagonal Cholesky components varies from process to process, regardless of whether a logarithmic transformation is taken on D1(f)

and D2(f) or not. Since the asymptotic distributions of the estimated Cholesky components

are beyond our knowledge at this moment, here we only focus on providing thresholds with

reasonable scale levels for practical purposes.

As discussed previously for the bivariate case, the logarithmic transformation on D1(f)

and D2(f) may help reduce the heteroskedasticity in their estimation. By the relationships

in (5.31), we would expect similar scales between the coeﬃcient estimates for log D1(f) and

log D2(f) and the coeﬃcient estimates for the marginal log SDFs. Assuming the interaction

eﬀect between estimating diﬀerent Cholesky components is small, we apply the same uni-

versal threshold as the univariate case to the L1 shrinkage of log D1(f) and log D2(f) with

133 orthonormal bases:

univ p p univ p p λD1 = 1/K · 2 log 4p, and λD2 = 1/K · 2 log 4p, (5.43) where the number of parameters is adapted to the dimensionality of Θ, 4p, and recall that K is the number of tapers of the MT estimates. By (5.34), more variability exists in estimating

D2 than the estimation of D1, so an optimal performance might require some adjustment for the universal threshold of D2 in practice.

On the other hand, the real-valued oﬀ-diagonal Cholesky components T21R(f) and T21I (f) are modeled on the original scale with basis expansions. By (5.31), they are the real and imaginary parts of the ratio of the cross-spectrum S21(f) over an auto-spectrum S1(f). Since the ranges of both T21R(f) and T21I (f) include negative numbers, a logarithmic transformation is problematic to be applied. For any one of the two spectra, denote the vector of

T true spectral signals at Fourier frequencies (2.18) as µ˜ = (˜µ1,..., µ˜M ) , and correspond- T T ˜ ˜ ˜ ingly a vector of observed values as z˜ = (˜z1,..., z˜M ) . Let ζ = ζ1,..., ζM represent the zero-mean noise in the observed spectrum such that

z˜ = µ˜ + ζ˜.

With p = M orthonormal basis functions, we denote the M × M basis matrix by Φ with the

T jth row φ (fj) = (φ1(fj), . . . , φM (fj)) for j = 1,...,M. Assume that the true spectrum can

T be expressed as µ˜ = Φβµ, then the true basis coeﬃcients βµ = (βµ,1, . . . , βµ,M ) satisfy βµ = ΦT µ˜. Without considering the eﬀects from all other component spectra, the “saturated”

W model with p = M would retain the observed values z˜ = Φβbµ as the ﬁtted spectrum, and

W T we have βbµ = Φ z˜. Thus, the unpenalized coeﬃcient estimates can be expressed as

W T T ˜ W βbµ = Φ µ˜ + Φ ζ = βµ + βbζ ,

134 W n oT W W T ˜ where βbζ = βbζ,1,..., βbζ,M = Φ ζ are the noise in the coefficient estimates that we would like to remove. Here, we determine the scale level for thresholding based on a simplified model with a special case of orthonormal bases Φ = IM , and assume true coefficients βµ = 0. In this W T ˜ ˜ case, our goal of thresholding is to turn off the non-zero elements in βbζ = Φ ζ = ζ. Denote

˜ ˜ ˜ ˜ log absolute values of the non-zero elements of ζ by ξj = log ζj , for ζj 6= 0 and j = 1,...,M. The modulus operation preserves the magnitude and makes the data compatible with a logarithmic transformation, which allows us to build connections to our previously developed theory on the scale-calibrated universal threshold for log transformed spectra in Section ˜ (mt) n (mt) o (mt) 3.3.1. For example, if T21R(f) = 0, then ζj = Tb21R (fj) = −Re Sb21 (fj) Sb1 (fj), and it follows that

n o ˜ ˜ (mt) (mt) ξj = log ζj = log Re Sb21 (fj) − log Sb1 (fj), j = 1,...,M.

˜ Notice that P (ζj 6= 0) → 1 as M → ∞, for each j, since the asymptotic distribution is ˜ continuous. Assume that ξj are independent and asymptotically follow a normal distribution N 0, σ2 . Analogous to modeling a log transformed spectrum, the scale of a universal ξ˜ ˜ threshold for ξj is based on that !

P max ξ˜ < σ p2 log M → 1, as M → ∞, (5.44) j ξ˜ j:ζ˜j 6=0

˜ where σξ˜ is the asymptotic standard deviation of ξj, j = 1,...,M. If the asymptotic variance p ˜ 1/K is still retained for ξj, then we have

n˜ o np p o P max exp ξj < exp 1/K 2 log M → 1, as M → ∞, j

˜ ˜ W Due to exp{ξj} = ζj = βbζ,j in the current setting, it implies that W np p o P max βbζ,l < exp 1/K 2 log M → 1, as M → ∞. j

135 Thus, one can directly apply the exponential universal threshold for T21R(f) and T21I (f) as

n o n o λuniv = exp p1/Kp2 log 4p , and λuniv = exp p1/Kp2 log 4p , (5.45) T21R T21I where, again, the number of parameters is adapted to the dimensionality of Θ. The performance of using the scale-calibrated universal thresholds (5.43) and (5.45) for our L1 penalized bivariate MT-Whittle method will be evaluated in the upcoming simulations presented in

Section 5.3.

Another tuning parameter selection method under consideration is an extension of the generalized information criterion (GIC). The GIC is data-adaptive and computationally more expensive compared to the scale-calibrated universal thresholds, especially when individualized tuning parameters are used for each element of Θ and when the time series is of high dimensionality. Here, we only present the utilization of the GIC to the simpliﬁed version of the penalized MT-Whittle problem (5.40) with a unique tuning parameter λ. Based on

Proposition7, the GIC method applied to (5.40) ﬁnds

˜ λ = arg min {2K lW (Θλ) +c ˜M |p˜λ|} , (5.46) λ where Θλ is the optimizer of L1 penalized multivariate MT-Whittle likelihood with tuning parameter λ. In (5.46), |p˜λ| represents the number of non-zero elements in Θλ, andc ˜M is the penalty parameter. For the SDM estimation of bivariate time series, when the total number of predictors, 4p, increases exponentially as the sample size M increases (i.e., log p =

O(M κ) for some κ > 0), the penalty parameter suggested by Fan and Tang[2013] can be adapted asc ˜M = {log log(4M)}{log(4p)}. (See Section 3.3 and Fan and Tang[2013] for more information on the selection of the penalty parameter.)

136 5.2.5 Sample splitting

We implement a sample splitting strategy for the inference of the proposed L1 penalized multivariate MT-Whittle approach. An overview of sample splitting is provided in Section

4.1.1. Suppose the regularly sampled, real-valued stationary m-vector time series X(t) is ob- n (mt) (mt) o served at N consecutive time points t = 1, 2,...,N. Denote Z = Sb (f1),..., Sb (fM ) (mt) as the set of MT spectral estimates Sb (fj) evaluated at M = dN/2e − 1 non-zero, non-

Nyquist (i.e., not equal to 1/2) Fourier frequencies as deﬁned by (2.18). Recall that Φ is a

T M × p basis matrix with jth row equal to φ (fj), j = 1, 2,...,M for modeling the spectral elements. For example, for a bivariate time series, the vectors of Cholesky components of the SDM evaluated at frequencies (2.18) can be expanded as

T T {log D1(f1),..., log D1(fM )} = ΦβD1 , {log D2(f1),..., log D2(fM )} = ΦβD2 , (5.47) {T (f ),...,T (f )}T = Φβ , and {T (f ),...,T (f )}T = Φβ . 21R 1 21R M T21R 21I 1 21I M T21I ˜ We use DM = {Z, Φ} to denote the sample of size M used for the analysis of the SDM.

Similar to the univariate case, we separate the frequencies by odd and even numbers of the index j = 1, 2,...,M to formulate two subsets, F1 = f1, f3, . . . , f2dM/2e−1 deﬁned in (4.1) and F2 = f2, f4, . . . , f2bM/2c deﬁned in (4.2), which carry comparable information of the

SDM (see Section 4.1.2 for more information). The sample splitting corresponds to two subsets of MT estimates:

n (mt) (mt) (mt) o Z1 = Sb (f1), Sb (f3),..., Sb (f2dM/2e−1) , and (5.48)

n (mt) (mt) (mt) o Z2 = Sb (f2), Sb (f4),..., Sb (f2bM/2c) . (5.49)

Correspondingly, the separated subsets of the Cholesky components can be obtained using

(5.34). At the same time, we conduct the same splitting for the common basis matrix Φ as the univariate case: matrix Φ1 consists of the odd rows of Φ, and matrix Φ2 is generated

137 ˜ by the even rows of Φ. The data DM is then split into two halves. The ﬁrst half of the ˜ sample is used for model selection, denoted by D1,M1 = {Φ1, Z1}, with half sample size

M1 = dM/2e. The second half of the sample is used for the SDM inference, denoted by ˜ D2,M2 = {Φ2, Z2}, with half sample size M2 = bM/2c. Based on Dezeure et al.[2015,

Section 3.1], we apply our L1 penalized multivariate MT-Whittle approach on the first half ˜ of the data, D1,M1 . For the spectral analysis of a bivariate time series, we select the bases with non-zero coefficients for log D1(f), log D2(f), T21R(f), and T21I (f). Denote the L1 penalized I T n I I o estimate of Θ as Θb = Θb 1,..., Θb 4p . Let IΘ be the index set of non-zero elements of I n 0 I 0 o Θb , that is, IΘ = l : Θb l0 6= 0; l = 1,..., 4p . Also, we obtain the index sets of non-zero coefficient estimates for log D1(f), log D2(f), T21R(f), and T21I (f) separately as

n I o n I o ID1 = l : βbD1,l 6= 0; l = 1, . . . , p , ID2 = l : βbD2,l 6= 0; l = 1, . . . , p , n o n o I = l : βI 6= 0; l = 1, . . . , p , and I = l : βI 6= 0; l = 1, . . . , p . T21R bT21R,l T21I bT21I,l

˜ With these selected features, the following models are considered on the second half D2,M2 of the data: for each frequency f,

T X log D1(f) = φ (f)β = φl(f)βD1,l, ID1 D1,ID1 l∈ID1 T X log D2(f) = φ (f)β = φl(f)βD2,l, ID2 D2,ID2 l∈ID 2 (5.50) T X T21R(f) = φ (f)βT ,I = φl(f)βT21R,l, and IT21R 21R T21R l∈IT21R T X T21I (f) = φ (f)βT ,I = φl(f)βT21I ,l, IT21R 21I T21R l∈IT21R where φ (f), φ (f), φ (f), and φ (f) represent the vectors of the selected basis ID1 ID2 IT21R IT21R functions for log D1(f), log D2(f), T21R(f), and T21I (f) respectively, and β , β , D1,ID1 D2,ID2 β , and β are the corresponding regression coeﬃcients. Analogous to uni- T21R,IT21R T21I ,IT21R variate spectral analysis, only columns with the selected indexes are preserved for Φ2 and

138 Φ. For the second half of frequencies F2, we denote the remaining M2 × |ID1 | design matrix

for log D1(f) as Φ2,D1 , the remaining M2 × |ID2 | design matrix for log D2(f) as Φ2,D2 , the

remaining M2 × |IT21R | design matrix for T21R(f) as Φ2,T21R , and the remaining M2 × |IT21I |

design matrix for T21I (f) as Φ2,T21I . The remaining design matrices of the full data at

frequencies (2.18) are denoted as ΦD1 , ΦD2 , ΦT21R , and ΦT21I respectively for log D1(f),

log D2(f), T21R(f), and T21I (f), Then, we make inference using unpenalized multivariate

MT-Whittle likelihood for the second half of the data with the selected bases, denoted as

0 0 D2,M2 = Φ2,D1 , Φ2,D2 , Φ2,T21R , Φ2,T21I , Z2 . Let ΘIΘ = {Θl : l ∈ IΘ} be the full set of

selected coeﬃcients. The bivariate MT-Whittle likelihood based on D2,M2 then is

(mt) (mt) X Db1 (fj) Db2 (fj) lW (ΘIΘ ) = log D1(fj) + + log D2(fj) + D1(fj) D2(fj) j:fj ∈F2 ! (mt) 2 2 Db1 (fj) n (mt) o n (mt) o + T21R(fj) − Tb21R (fj) + T21I (fj) − Tb21I (fj) , D2(fj) (5.51)

W W W W with basis expansions (5.50). Denote βb , βb , βb , and βb as the D1,ID1 D2,ID2 T21R,IT21R T21I ,IT21R coeﬃcient estimates obtained by minimizing (5.51). It follows that the ﬁtted Cholesky

components are

n W o n W o DW (f) = exp φT (f)β , DW (f) = exp φT (f)β , b1 ID bD1,ID b2 ID bD2,ID 1 1 2 2 (5.52) W W TbW (f) = φT (f)βb , and TbW (f) = φT (f)βb , 21R IT21R T21R,IT21R 21I IT21R T21I ,IT21R for f ∈ [0, 1/2]. By the relationships in (5.31), one can obtain the estimates of the SDM

elements for a bivariate time series as

W W Sb1 (f) = Db1 (f),

W W W W Sb21 (f) = −Db1 (f) Tb21R(f) + iTb21I (f) , and (5.53)

W W W W 2 W 2 Sb2 (f) = Db2 (f) + Db1 (f) Tb21R(f) + Tb21I (f) ,

for f ∈ [0, 1/2].

139 By Proposition7, the bivariate MT-Whittle likelihood (5.51) is a quasi-likelihood func-

W W tion. A valid inference of the resulting MT-Whittle estimators, Θb I and Sb (f), still relies on a proper incorporation of the asymptotic properties, such as correlations of MT estimates between frequencies, as discussed in Section 5.1.

5.3 Simulation

To evaluate the performance of our proposed L1 penalized multivariate MT-Whittle method for the SDM estimation, we present a simulation study with the following three bivariate stationary processes:

1. VAR(2) process:

(1) (1) Xt = Ψ1 Xt−1 + Ψ2 Xt−2 + εt,

where the autoregressive (AR) coeﬃcients are √ 0.97 2 −0.5 −0.972 0 Ψ(1) = , and Ψ(1) = . 1 −0.3 0.7 2 0 −0.5

2. VARMA(2,15000) process:

15000 (2) (2) X Xt = Ψ1 Xt−1 + Ψ2 Xt−2 + Ωlεt−l, l=0

where the AR coeﬃcients are √ 0.97 2 −0.6 −0.972 0 Ψ(2) = , Ψ(2) = , 1 −0.1 0 2 0 0

and the moving average (MA) coeﬃcients are

0 0 0 0 0 0 Ω = , Ω = , and Ω = , (5.54) 0 0 1 1 0 π/4 l 0 sin(π(l − 1)/2)/(l − 1)

for l = 2, 3,..., 15000.

140 3. VARMA(4,15000) process:

15000 (3) (3) (3) (3) X Xt = Ψ1 Xt−1 + Ψ2 Xt−2 + Ψ3 Xt−1 + Ψ4 Xt−2 + Ωlεt−l, l=0 where the AR coeﬃcients are

2.7607 −0.6 −3.8106 0 Ψ(3) = , Ψ(3) = , 1 −0.1 0 2 0 0 2.6535 0 −0.9238 0 Ψ(3) = , and Ψ(3) = , 3 0 0 4 0 0

with the same MA coeﬃcients Ωl as deﬁned in (5.54) for l = 0, 1,..., 15000.

T For the three processes, we generate the innovation process εt = (ε1,t, ε2,t) : t ∈ Z as a bivariate Gaussian white noise with

1 0 ε ∼ N 0, , for t ∈ . t 2 0 1 Z

The true SDMs of the three time seires can be obtained using transfer functions of multivariate ARMA processes [see, e.g., Priestley, 1981, Section 9.4]. Plots of the true SDM elements are shown by the red lines in the second row of Figure 5.1 for the VAR(2) process, the second row of Figure 5.3 for the VARMA process, and the second row of Figure 5.5 for the VARMA process.

We quantify the performance of SDM estimation using the following metrics. First, similar to the evaluation of an SDF estimate, the bias and IRMSE are evaluated for the estimates of each Cholesky component and each SDM element. At the same time, a matrix version of the IRMSE can be obtained based on the Frobenius norm to measure the overall performance on the estimated SDM matrix. We call this measure the FIRMSE. Letting

Se(fj) denote any estimate of the SDM at frequency fj ∈ F2 deﬁned in (4.2), the FIRMSE is  1/2  1 X 2  Se(fj) − S(fj) , (5.55) M2 F  j:fj ∈F2 

141 where, for an m-vector time series, the Frobenius norm of Se(f)−S(f) at frequency f equals s n oH n o Se(f) − S(f) = tr Se(f) − S(f) Se(f) − S(f) F

v m m u 2 uX X = t Seab(f) − Sab(f) . a=1 b=1

When the SDM elements are of vastly diﬀerent scales, we can replace the Frobenius norm

by a standardized version that we call the SFIRMSE: v u m m 2 uX X Seab(f) Sab(f) Se(f) − S(f) = t − . F,s p p p p a=1 b=1 Sa(f) Sb(f) Sa(f) Sb(f)

Alternatively, we propose a score-based approach to assess the SDM estimate Se(f). The logarithmic score associated with the multivariate MT-Whittle likelihood, which we call the

Whittle score, is given by

log n −1 o Sc Se(f), S(f) = log det Se(f) + tr Se (f)S(f) − log det S(f) − m.

log A smaller Whittle score is preferred; and the minimum of the score Sc Se(f), S(f) is zero,

achieved when Se(f) = S(f). With sample splitting, the average Whittle score is evaluated

on the second half of frequencies F2, calculated using

1 X log Sc Se(f), S(f) . M2 j:fj ∈F2

We present the evaluation metrics averaged over 500 realizations of each time series process of length N = 2048. As suggested in Section 4.4, we use K = 15 sine tapers for sample size N = 2048 to obtain the raw MT estimates. In the literature, the state-of-the-art in

Cholesky-based methods for the SDM estimation is the L2 penalized Whittle method by

Krafty and Collinge[2013], which is denoted as “ L2 Krafty’s method” later in this section.

Based on sample splitting, we compared our L1 penalized multivariate MT-Whitttle method

142 with the L2 Krafty’s method and the raw MT estimates, to show the strength and weakness of our approach and point out directions of future research. We implemented L2 Krafty’s method with sample splitting as the following: Separate L2 tuning parameters were assigned to diﬀerent Cholesky components; the ﬁrst half of frequencies F1 were used to select the tuning parameters based on the generalized maximum likelihood criterion [see, e.g., Wood,

2011]; with the selected tuning parameters, estimates of Θ were then obtained by L2 Krafty’s method using only the second half of frequencies F2.

Similar to the univariate case, our L1 penalized multivariate MT-Whittle method is suitable for a variety of basis functions. The initial basis matrix Φ was generated with the maximum number of basis functions, p = M + 1 = 1024, including the intercept. Our experiment shows that the LA(8) wavelet basis is still relatively eﬃcient in capturing the true

SDM features compared to other basis functions. For example, when estimating the SDM of the VAR(2) process using our L1 penalized MT-Whittle method and the scale-calibrated universal thresholds (5.43) and (5.45), the wavelet LA(8) basis achieved an average Whittle score of 0.044 with a bootstrap standard error (SE) smaller than 0.001, while a combination of Fourier bases (cosines for even functions log D1(f), log D2(f), and T21R(f) and sines for T21R(f)) attained an average Whittle score of 0.055 with a bootstrap SE smaller than

0.001. Similarly to the previous chapters, we proceed with the LA(8) wavelet basis as being representative for our presentation of the L1 penalized multivariate MT-Whittle method, called “L1 MT-Whittle” later in this section.

To exemplify the behavior of diﬀerent methods on the SDM estimation, we show in

Figures 5.1, 5.3, and 5.5, the estimated spectra of the Cholesky components (ﬁrst row), the

SDM elements (second row), and other cross-spectra related quantities (third row), based on one replicate of the simulated processes. The true spectra are shown in red lines, with MT

143 estimates plotted in gray. We can see that, for the simulated processes, the log-transformed

MT estimates of D1(f) and D2(f) have homogeneous variability over the Fourier frequencies,

but the variance of MT estimates for T21R(f) and T21I (f) are spatially inhomogeneous over

frequencies, where the high frequencies usually have much larger variance than the low

frequencies. The blue lines show the L1 MT-Whittle estimates using the scale-calibrated

universal thresholds (univer.), and the green lines show the results based on L2 Krafty’s

approach. We leave the plots of GIC-based L1 MT-Whittle method to the discussion section.

Next, we show the VAR(2) example as a scenario where L2 Krafty’s method is more eﬃ-

cient; the VARMA(2,15000) process provides a situation where the L1 MT-Whittle method

is superior to the other methods; and the VARMA(4,15000) series gives a case when both

penalized methods meet troubles, and points out a future research direction.

5.3.1 The VAR(2) process

We ﬁrst present example where the L2 Krafty’s method is more eﬃcient. As shown in

Figure 5.1, the true spectra (red lines) of the Cholesky components of the SDM are fairly smooth for the VAR(2) process, and there are a small number of features to be captured.

The one replicate of the L1 MT-Whittle method and the L2 Krafty’s approach captured the main features in the diagonal Cholesky components and the SDM elements. However, both methods have some deviations in estimating the oﬀ-diagonal Cholesky components at high frequencies, and the L2 Krafty’s approach tend to overestimate log D2(f) around the peak.

Table 5.1 shows the average bias and the IRMSE for estimating each Cholesky components and SDM elements. In terms of the IRMSE, the L1 MT-Whittle method outperforms

(has smaller IRMSE than) the L2 Krafty’s approach in estimating log D2(f), S11(f), S22(f),

144 log D1 log D2 T21 Real T21 Imaginary 1 4 0.5 0.5 0 2 1 2 0.0 0.0 Real 21 log D log D 0 −1 Imaginary T 21 T −0.5 −2 −2 −1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S11 S22 S21 Real S21 Imaginary 25 12 0 80 20 8 60 15 6 11 22 Real S S −20 40 21 4 Imaginary 10 S 21 S 2 20 5 0 −40 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S21 Cross Amplitude S21 Phase MSC 10 40 0.8 5 True spectrum 30 0.6 MT estimates 0 Phase

20 L2 Krafty 0.4 21

S L1 MT−Whittle Cross Amplitude −5 10 0.2 21 S 0 0.0 −10 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Magnitude Squared Coherence 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

Figure 5.1: Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and MSC of the VAR(2) process.

and Re{S21}, but is not as eﬃcient in the estimation of the oﬀ-diagonal Cholesky components

Re{T21(f)} and Im{T21(f)}. Both methods performs better than the raw MT estimates.

To further investigate the component-wise behavior of the competing estimators, we display the plots of the bias vs. frequency (ﬁrst and second rows) and the RMSE vs. frequency

(third and fourth rows) in Figure 5.2 for the VAR(2) process. For estimating the Cholesky components, the L2 Krafty’s method is biased around the peaks of log D1(f) and log D2(f); the bias in an L1 MT-Whittle-based estimator occurs mainly at high frequencies of T21R(f)

145 Table 5.1: Simulation results for the VAR(2) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples.

Bias Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} -0.009 -0.002 -0.013 < 0.001 -0.145 -0.039 0.077 -0.056 L1 MT-Whittle (univer.) (0.001) (0.002) (0.001) (0.002) (0.028) (0.009) (0.015) (0.005) -0.003 -0.007 -0.001 -0.006 -0.184 -0.020 0.098 -0.022 L1 MT-Whittle (GIC) (0.002) (0.002) (0.001) (0.001) (0.028) (0.009) (0.014) (0.005) -0.009 0.035 0.018 -0.013 -0.660 -0.093 0.196 -0.104 L2 Krafty (0.002) (0.002) (0.002) (0.001) (0.035) (0.011) (0.018) (0.007) -0.036 -0.105 0.001 -0.001 -0.054 -0.005 0.021 -0.002 Raw MT estimate (0.001) (0.001) (0.001) (0.001) (0.028) (0.009) (0.015) (0.005)

IRMSE Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} 0.134 0.125 0.097 0.085 3.601 1.156 1.906 0.676 L1 MT-Whittle (univer.) (0.001) (0.002) (0.002) (< 0.001) (0.046) (0.013) (0.024) (0.008) 0.139 0.130 0.074 0.063 3.755 1.247 2.025 0.747 L1 MT-Whittle (GIC) (0.001) (0.001) (0.001) (0.001) (0.042) (0.012) (0.022) (0.007) 0.131 0.157 0.068 0.037 3.786 1.217 2.027 0.565 L2 Krafty (0.002) (0.001) (0.001) (0.001) (0.064) (0.017) (0.033) (0.010) 0.266 0.292 0.189 0.188 4.760 1.530 2.480 0.897 Raw MT estimate (0.001) (0.001) (0.001) (0.001) (0.044) (0.013) (0.023) (0.006) Minimum number of each column is bolded.

and T21I (f). Both methods have smaller RMSEs than the MT estimates at most frequencies.

The observed phenomena of the bias vs. frequency and the RMSE vs. frequency in Figure

5.2 agree with the results summarized in Table 5.1.

With respect to the overall quality of the SDM estimation, the FIRMSE of the L1 MT-

Whittle method using scale-calibrated universal thresholds is 4.759 (bootstrap SE is 0.057), smaller than the FIRMSE of L2 Krafty’s method which is 4.984 (bootstrap SE is 0.080), as

146 Bias log D1 Bias log D2 Bias T21 Real Bias T21 Imaginary 0.4 0.10 0.1 0.10 0.3 0.0 0.2 0.00 Bias Bias Bias Bias 0.00 0.1 −0.1 −0.2 −0.10 −0.1 −0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Bias S11 Bias S22 Bias S21 Real Bias S21 Imaginary 1 8 0 0 0.0 6 −1 −5 4 −2 Bias Bias Bias Bias −1.0 2 −10 −3 0 −4 −2.0 −15 −5 −2 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE log D1 RMSE log D2 RMSE T21 Real RMSE T21 Imaginary 0.4 0.4 0.3 0.3 0.3 0.20 0.2 0.2 0.2 RMSE RMSE RMSE RMSE 0.10 0.1 0.1 0.1 0.0 0.0 0.0 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE S11 RMSE S22 RMSE S21 Real RMSE S21 Imaginary 20

10 True spectrum 5 MT estimates 8

15 L2 Krafty 4

2.0 L1 MT−Whittle 6 3 10 RMSE RMSE RMSE RMSE 4 2 1.0 5 2 1 0 0 0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.2: Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle-based estimates (blue) of the Cholesky components and the SDM elements of the VAR(2) process.

147 shown in in Table 5.2. However, the L2 Krafty’s method has a lower SFIRMSE and a lower

average Whittle score than the L1 MT-Whittle method.

Table 5.2: Results of evaluating the SDM estimates for the VAR(2) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples.

Method FIRMSE (SE) SFIRMSE (SE) average Whittle score (SE)

L1 MT-Whittle (univer.) 4.759 (0.057) 0.293 (0.002) 0.044 (< 0.001) L1 MT-Whittle (GIC) 5.012 (0.051) 0.296 (0.002) 0.046 (< 0.001) L2 Krafty 4.984 (0.080) 0.272 (0.002) 0.033 (< 0.001) Raw MT estimate 6.247 (0.055) 0.517 (0.001) 0.170 (0.001) Minimum number of each column is bolded.

To improve the performance of using the L1 MT-Whittle method for the SDM estimation of processes similar to the VAR(2) example, one main task is to reduce the bias of the estimated spectra at high frequencies. One direction to consider is a mixture of diﬀerent type of basis functions [see, e.g., Oh et al., 2001] to remove the artiﬁcial wiggles generated by wavelets at the boundary. At the same time, since the noise in the MT estimates of

T21R and T21I is heteroskedastic over the frequencies, a more spatially adaptive L1 tuning parameter may be required. Some discussion on tuning parameter selection is provided in

Section 5.3.4.

5.3.2 The VARMA(2,15000) process

The spectra of the VARMA(2,15000) process have more complex shapes than the spectra of the VAR(2) process. As shown in Figure 5.3, besides the auto-spectrum S11(f), there are more sharp features in the true spectrum of each Cholesky component (red lines) and each

148 SDM element. (The true curve of the imaginary part of T21(f) is concealed by the relatively

large scaled MT estimates.) While the L1 MT-Whittle method captures most of the sharp

features in the spectra of each component correctly, the L2 Krafty’s method induces more

wiggles at low frequencies and misses the peak of log D2(f).

log D1 log D2 T21 Real T21 Imaginary 6 1 0.5 0.5 4 0 1 2 0.0 −0.5 2 −1 Real 21 log D log D Imaginary T 0 21 −2 −1.5 T −1.0 −2 −3 −2.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S11 S22 S21 Real S21 Imaginary 10 20 10 0 15 5 300 11 22 0 Real 10 S S 21 −20 Imaginary S −5 21 5 S 100 −40 0 0 −15 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S21 Cross Amplitude S21 Phase MSC 10 0.8 40 5 True spectrum 0.6 30 MT estimates 0

Phase L2 Krafty 20 0.4 21

S L1 MT−Whittle Cross Amplitude −5 10 0.2 21 S 0 0.0 −10 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Magnitude Squared Coherence 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

Figure 5.3: Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(2,15000) process.

In Table 5.3, we see that the L1 MT-Whittle method with scale-calibrated universal thresholds (univer.) has smaller IRMSEs than the L2 Krafty’s method for all the original

149 scale SDM elements. For the VARMA(2,15000) process, the only disadvantage of using

L1 MT-Whittle method exists in the estimation of Re{T21(f)}. To further understand the

problem, let us take a look at the plots of bias and RMSE over frequencies in Figure 5.4.

Table 5.3: Simulation results for the VARMA(2,15000) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples.

Bias Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} 0.004 -0.002 0.090 0.020 -0.437 -0.023 -0.010 0.002 L1 MT-Whittle (univer.) (0.001) (0.002) (0.001) (< 0.001) (0.123) (0.007) (0.017) (0.007) 0.009 -0.035 0.006 0.011 -0.428 -0.002 0.040 -0.009 L1 MT-Whittle (GIC) (0.001) (0.002) (0.002) (0.001) (0.123) (0.007) (0.018) (0.007) 0.010 0.107 0.023 0.018 -3.359 -0.034 0.302 0.501 L2 Krafty (0.005) (0.004) (0.004) (0.001) (0.229) (0.015) (0.039) (0.007) -0.028 -0.096 -0.002 -0.002 -0.136 -0.005 0.017 <0.001 Raw MT estimate (0.001) (0.002) (0.001) (0.001) (0.125) (0.007) (0.017) (0.007)

IRMSE Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} 0.183 0.169 0.230 0.042 19.067 0.922 2.519 1.179 L1 MT-Whittle (univer.) (0.001) (0.001) (0.001) (<0.001) (0.269) (0.008) (0.028) (0.014) 0.187 0.175 0.138 0.062 19.484 0.976 2.591 1.299 L1 MT-Whittle (GIC) (0.001) (0.001) (0.002) (0.001) (0.265) (0.007) (0.028) (0.013) 0.168 0.336 0.141 0.042 21.262 0.943 3.664 1.591 L2 Krafty (0.003) (0.003) (0.003) (<0.001) (0.636) (0.015) (0.069) (0.015) 0.266 0.289 0.238 0.234 19.779 1.164 2.912 1.351 Raw MT estimate (0.001) (0.001) (0.001) (0.001) (0.265) (0.006) (0.025) (0.012) Minimum number of each column is bolded.

It is apparent on Figure 5.6 that L2 Krafty’s method has severe bias and RMSE in capturing the peaks of all component spectra related to the SDM. With less bias and RMSE at most frequencies, our L1 MT-Whittle method is more eﬃcient than the L2 Krafty’s method

150 Bias log D1 Bias log D2 Bias T21 Real Bias T21 Imaginary 0.10 1.0 1.0 0.0 0.05 0.5 0.5 −0.5 Bias Bias Bias Bias 0.00 0.0 −1.0 0.0 −0.05 −0.5 −1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Bias S11 Bias S22 Bias S21 Real Bias S21 Imaginary 20 0 8 0 15 6 −5 10 4 −50 Bias Bias Bias Bias 5 −10 2 0 −100 0 −15 −5

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE log D1 RMSE log D2 RMSE T21 Real RMSE T21 Imaginary 0.5 1.2 1.5 1.0 0.4 0.8 0.8 1.0 0.3 0.6 RMSE RMSE RMSE RMSE 0.2 0.4 0.4 0.5 0.1 0.2 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE S11 RMSE S22 RMSE S21 Real RMSE S21 Imaginary

150 True spectrum 8 20 MT estimates 15 L2 Krafty L1 MT−Whittle 6 15 100 10 4 10 RMSE RMSE RMSE RMSE 50 5 5 2 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.4: Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle-based estimates (blue) of the Cholesky components and the SDM elements of the VARMA(2,15000) process.

151 in estimating the majority of SDM components for the VARMA(2,15000) process. The only

exception is the real part of T21(f), where the L1 MT-Whittle method again has trouble in

seizing the right tail of the spectrum. Both methods have a similar weakness of mis-capturing

the high frequency region of Im{T21(f)}. But the L2 Krafty’s method also have a problem

for describing the low frequency region of Im{T21(f)} as well.

According to Table 5.4, the L1 MT-Whittle method has smaller FIRMSE, SFIRMSE,

and average Whittle score, than the L2 Krafty’s method and the raw MT estimates. The comparison between universal thresholds and GIC can be found in Section 5.3.4.

Table 5.4: Results of evaluating the SDM estimates for the VARMA(2,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples.

Method FIRMSE (SE) SFIRMSE (SE) average Whittle score (SE)

L1 MT-Whittle (univer.) 19.524 (0.268) 0.436 (0.002) 0.074 (<0.001) L1 MT-Whittle (GIC) 19.965 (0.264) 0.405 (0.002) 0.074 (0.001) L2 Krafty 22.091 (0.626) 0.594 (0.008) 0.095 (0.001) Raw MT estimate 20.357 (0.264) 0.530 (0.002) 0.169 (0.001) Minimum number of each column is bolded.

5.3.3 The VARMA(4,15000) process

Now we consider another scenario where the current L1 MT-Whittle method with univer-

sal thresholds still outperforms the L2 Krafty’s method, but a more data-adaptive L1 tuning

parameter selection is preferred. The plots of the true spectra (red lines) of SDM elements

of the VARMA(4,15000) process are exhibited in the second row of Figure 5.5.

152 log D1 log D2 T21 Real T21 Imaginary 10 1 1 0 0 −2 5 −1 1 2 −1 Real −4 21 log D log D Imaginary T −3 0 21 −2 −6 T −3 −5 −8 −5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S11 S22 S21 Real S21 Imaginary 0 250 500 150 15000 11 22 0 Real S S −1000 21 Imaginary S 21 S 50 5000 0 0 −2000

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 −1000 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S21 Cross Amplitude S21 Phase MSC 10 0.8 5 True spectrum

1500 MT estimates 0

Phase L2 Krafty 21 0.4

S L1 MT−Whittle Cross Amplitude −5 500 21 S 0 0.0 −10 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Magnitude Squared Coherence 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency

Figure 5.5: Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(4,15000) process.

We see in Figure 5.5 that the auto-spectrum S11 now has an extremely high power at the top, with the maximum value climbed to above 20,000. The bimodal feature of the

ﬁrst Cholesky diagonal component log D1(f) dominates the shapes of all original scale SDM elements. Based on the plots of ﬁtted spectra in Figure 5.5, the L2 Krafty’s method (green line) has a noticeable leakage (see Chapter2 for more information) for estimating log D1(f)

at high frequencies, and introduces more wiggles to the low frequencies of log D2(f). The

L1 MT Whittle method with universal thresholds (blue line) estimates the sharp peak of

153 log D2(f) more accurately than the L2 Krafty’s method. Both the L1 MT Whittle method and the L2 Krafty’s method underestimate the sharpness of peaks in the original SDM elements, and fail to capture high frequency features in the oﬀ-diagonal Cholesky components

Re{T21(f)} and Im{T21(f)}. Table 5.5 shows that the L1 MT-Whittle method achieves smaller IRSMEs than the L2 Krafty’s approach in estimating all original scale SDM elements for the VARMA(4,15000) process. However, higher IRMSEs are observed for the estimates of Re{T21(f)} and Im{T21(f)} using L1 MT-Whittle method with universal thresholds.

Table 5.5: Simulation results for the VARMA(4,15000) process: comparing the bias and IRMSE of diﬀerent methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples.

Bias Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} 0.014 0.093 1.000 0.334 -14.797 -0.149 0.984 -0.865 L1 MT-Whittle (univer.) (0.002) (0.002) (0.004) (0.003) (3.324) (0.037) (0.301) (0.167) 0.013 0.026 0.534 0.329 -14.204 -0.135 1.000 -0.797 L1 MT-Whittle (GIC) (0.002) (0.002) (0.005) (0.004) (3.320) (0.037) (0.300) (0.167) 0.906 0.168 0.795 0.372 -50.869 -0.443 3.357 -2.603 L2 Krafty (0.024) (0.003) (0.012) (0.007) (3.870) (0.043) (0.338) (0.197) -0.011 -0.073 0.005 -0.005 -3.936 -0.035 0.256 -0.197 Raw MT estimate (0.001) (0.002) (0.004) (0.004) (3.367) (0.037) (0.302) (0.169)

IRMSE Cholesky components SDM elements Method log D1 log D2 Re{T21} Im{T21} S11 S22 Re{S21} Im{S21} 0.215 0.234 1.867 0.616 856.406 8.690 70.989 46.041 L1 MT-Whittle (univer.) (0.001) (0.002) (0.006) (0.003) (8.192) (0.092) (0.826) (0.359) 0.214 0.181 1.171 0.649 852.778 8.658 70.768 45.797 L1 MT-Whittle (GIC) (0.001) (0.001) (0.007) (0.006) (8.174) (0.092) (0.825) (0.366) 1.268 0.311 1.509 0.592 873.509 9.234 74.563 47.124 L2 Krafty (0.031) (0.002) (0.021) (0.009) (13.838) (0.145) (1.369) (0.595) 0.276 0.294 0.705 0.695 823.931 8.769 72.092 42.562 Raw MT estimate (0.001) (0.001) (0.004) (0.004) (7.672) (0.081) (0.740) (0.330) Minimum number of each column is bolded.

154 Figure 5.6 displays the bias (ﬁrst and second rows) and the RMSE (third and fourth rows) against the frequencies for estimating the component spectra. Prominent bias is detected in the L2 Krafty’s method for estimating log D1(f) and log D2(f). The bias in estimating log D1(f) is partially due to the fact that L2 Krafty’s method models the Cholesky component

D1(f) on the original scale, where the smoothing puts a diﬀerent weight on the deviation for the spectrum at frequencies {f : log D1(f) < 0}. The L2 Krafty’s method is constructed based on the periodogram, and thus would have more leakage issue than MT-based methods

[see, e.g. Percival, 1994]. Figure 5.6 conﬁrms that neither the L2 Krafty’s method nor the current L1 MT-Whittle method accurately ﬁt the spectrum at frequencies [0.3, 0.5] for

Re{T21(f)} and the spectrum at frequencies [0.2, 0.5] for Im{T21(f)}.

As shown in Table 5.5, the average Whittle score of the L1 MT-Whittle method with scale-calibrated universal thresholds is 0.180 (bootstrap SE is 0.001), much better (smaller) than the average Whittle score of the L2 Krafty’s method, which is 0.670 (bootstrap SE is 0.020). However, due to the unsolved issue in estimating the hight frequency region of the oﬀ-diagonal spectra, the current scale-calibrated universal thresholds do not make an improvement over the raw MT estimates in terms of the SFIRMSE or average Whittle score.

The raw MT estimates are too noisy to be useful, but a better estimation should be able to achieve an enhancement in the scores. For the SDM estimation of multivariate time series similar to the VARMA(4,15000) process, we found that one way to ameliorate the performance of the L1 MT-Whittle method is to use a more data-adaptive tuning parameter selection, such as the GIC. More discussion is provided in the next subsection.

155 Bias log D1 Bias log D2 Bias T21 Real Bias T21 Imaginary 0.8 2 4 1.5 1 3 0.4 0.5 0 Bias Bias Bias Bias 2 0.0 −1 1 −0.5 −2 0 −0.4 −1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Bias S11 Bias S22 Bias S21 Real Bias S21 Imaginary 4000 400 20 600 0 200 −20 0 Bias Bias Bias Bias 200 −4000 −60 −200 −200 −8000 −100 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE log D1 RMSE log D2 RMSE T21 Real RMSE T21 Imaginary 2.5 2.5 0.8 4 2.0 2.0 3 0.6 1.5 1.5 2 RMSE RMSE RMSE RMSE 0.4 1.0 1.0 1 0.2 0.5 0.5 0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE S11 RMSE S22 RMSE S21 Real RMSE S21 Imaginary True spectrum 10000 100 400 MT estimates 800 L2 Krafty 80 L1 MT−Whittle 300 600 60 6000 200 RMSE RMSE RMSE RMSE 400 40 100 200 20 2000 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.6: Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle-based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process.

156 Table 5.6: Results of evaluating the SDM estimates for the VARMA(4,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of diﬀerent methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples.

Method FIRMSE (SE) SFIRMSE (SE) average Whittle score (SE)

L1 MT-Whittle (univer.) 864.831 (8.279) 0.619 (0.002) 0.180 (0.001) L1 MT-Whittle (GIC) 861.177 (8.262) 0.556 (0.002) 0.127 (0.001) L2 Krafty 882.586 (13.979) 7.724 (0.455) 0.670 (0.020) Raw MT estimate 832.485 (7.749) 0.567 (0.002) 0.176 (0.001) Minimum number of each column is bolded.

5.3.4 Discussion

This section discusses tuning parameter selection criteria for the L1 MT-Whittle method and directions of further improvement. So far, we have examined the performance of using

GIC for selecting a uniform L1 tuning parameter for all Cholesky components. It can be seen from Table 5.6 that, when estimating the SDM of the VARMA(4,15000) process, the

GIC makes a signiﬁcant improvement for the L1 MT-Whittle method, attaining a smaller

SFIRMSE and a smaller average Whittle score than the universal thresholds-based method and the raw MT estimates. However, as shown in Tables 5.2 and 5.4, the universal thresholds and the GIC perform very similarly in terms of the average Whittle score when estimating the SDM of the VAR(2) process and the SDM of the VARMA(4,15000) process.

We ﬁrst investigate the scenario where the GIC performs slightly worse than the universal thresholds in terms of the SFIRMSE and average Whittle score – the VAR(2) process. Figure

5.7 displays one replicate of the L1 MT-Whittle estimates based on GIC (orange) and one replicate of the L1 MT-Whittle estimates based on universal thresholds (blue). We can see

157 that the GIC preserved more unfavorable wiggles at the low frequencies than using scale-

calibrated universal thresholds for the estimation of the cross-spectra components, especially

for Im{T21(f)} and Im{S21(f)}. For the processes with a SDM as smooth and sparse as the

VAR(2) process, the GIC for L1 MT-Whittle method with a single tuning parameter leads to overﬁtting of the SDM.

log D1 log D2 T21 Real T21 Imaginary 1 4 0.5 0.5 0 2 0.0 1 2 0.0 Real 21 log D log D 0 −1 Imaginary T −0.5 21 T −0.5 −2 −2 −1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S11 S22 S21 Real S21 Imaginary

25 True spectrum 0 80 MT estimates

10 L1 MT Whittle (GIC) 20 L1 MT Whittle (univer) 8 60 −10 15 6 11 22 Real S S −20 40 21 Imaginary 4 S 10 21 S −30 2 20 5 0 −40 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.7: Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM elements of the VAR(2) process.

The VARMA(4,15000) process have denser features to be captured than the VAR(2)

process. It is noticed from Figure 5.8 that the GIC-based L1 MT-Whittle method has smaller deviation from the true spectrum of the oﬀ-diagonal Cholesky components than the universal thresholds-based method. According to the plots of bias vs. frequency and RMSE

158 vs. frequency in Figure 5.9, the GIC reduces the bias and RMSE in estimating log D2(f)

and Re{T21(f)} for the L1 MT-Whittle method.

log D1 log D2 T21 Real T21 Imaginary 2 10 1 0 1 0 −2 5 0 −1 1 2 Real −4 −1 21 log D log D Imaginary T −3 0 21 T −6 −2 −5 −8 −3 −5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

S11 S22 S21 Real S21 Imaginary 0 250 True spectrum MT estimates L1 MT Whittle (GIC) 200

800 L1 MT Whittle (univer) 150 600 15000 11 22 Real S S −1000 21 Imaginary S 100 400 21 S 50 200 5000 0 0 0 −2000 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.8: Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM elements of the VARMA(4,15000) process.

However, the poor estimation of Im{T21(f)} is not improved. Figure 5.9 shows a great inﬂation in the RMSE at frequencies [0.3, 0.5] for the estimation of Im{T21(f)}, regardless of

whether the GIC or universal threshold is applied. As one can see from the plot of one repli-

cate of MT estimates in Figure 5.8, the MT estimates of Im{T21(f)} at high frequencies has

much larger variability than the MT estimates at low frequencies. Since the high frequency

region is much noisier for the raw MT estimates, it is not surprising that the underlying true

spectra are diﬃcult to estimate for the oﬀ-diagonal Cholesky components of the SDM.

159 Bias log D1 Bias log D2 Bias T21 Real Bias T21 Imaginary 0.5 0.8 2 4 0.0 1 3 0.4 0 2 Bias Bias Bias Bias −0.5 0.0 −1 1 −1.0 0 −2 −0.4

−1.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE log D1 RMSE log D2 RMSE T21 Real RMSE T21 Imaginary 1.5 2.5 0.8 4 2.0 1.0 3 0.6 1.5 2 RMSE RMSE RMSE RMSE 0.4 1.0 0.5 1 0.2 0.5 0 0.0 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

RMSE S11 RMSE S22 RMSE S21 Real RMSE S21 Imaginary True spectrum 10000 100 MT estimates 400 800 L1 MT Whittle (GIC) 80 L1 MT Whittle (univer) 300 600 60 6000 200 RMSE RMSE RMSE RMSE 400 40 100 200 20 2000 0 0 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 frequency frequency frequency frequency

Figure 5.9: Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle-based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process.

160 Based on the current simulation results, for the application of L1 MT-Whittle method, we suggest using the scale-calibrated universal threshold for estimating the SDM of time series similar to the VAR(2) process, and the use of the GIC for the SDM estimation of processes with denser spectral features like the VARMA(4,15000). To develop a tuning parameter selection strategy that is more ﬂexible for both sparse and dense features, one may consider extending the idea in Donoho and Johnstone[1995], where the universal threshold and a data-adaptive thresholding strategy are combined for wavelet thresholding.

For further improvement on the L1 tuning parameter selection, a rigorous asymptotic theory needs to be established for the MT estimates of the Cholesky components of the

SDM and thus for the related estimators of basis coefficients Θ. Due to the heteroskedasticity of the MT estimates of the off-diagonal Cholesky components over the frequency, the coefficient estimates may be of a different scale. The theoretical tuning parameters need to be calibrated for each individual coefficient estimate, and may varies from process to process.

An alternative direction is to use GIC to select individualized tuning parameters for diﬀerent

Cholesky components or even for every basis coeﬃcient. However, the selection of multiple tuning parameters would be very computationally expensive, especially for the SDM estimation of high-dimensional time series. An eﬃcient algorithm needs to be developed for the

GIC when using multiple tuning parameters.

5.4 Spectral analysis of multivariate EEG signals

In the previous chapters, the auto-spectra of a two-channel Electroencephalogram (EEG) signal were analyzed by studying the left and right channels separately. The spectral features across diﬀerent channels were unexplored. With the proposed L1 penalized multivariate MT-

Whittle framework, now we can extend the application to the SDM estimation of multivariate

161 EEG signals, and thus gain a better understanding of the interaction between the EEG

channels. Recall the time series presented in panels (a) and (b) of Figure 3.8. The EEG

data in units of microvolts (mV) were collected over 5 seconds from the left and right front

cortex of one male rat at a sampling rate of 200 Hz [van Luijtelaar, 1997]. See Chapter3

for more background information. To be consistent with our previous analysis, the series are

mean padded to a length of N = 1024 prior to the spectral estimation. We obtain the SDM estimates based our L1 penalized multivariate MT-Whittle method. The raw MT estimates are obtained with 5 sine tapers as in Chapter3, and a wavelet basis is constructed using the LA(8) wavelet ﬁlter. The multivariate EEG signals usually have more complex spectral features than the processes in our simulation study, thus we use the GIC to obtain a more data-adaptive selection of the L1 tuning parameter.

The estimates of the Cholesky components and the SDM elements are shown in Figure

5.10, where the gray lines represent the raw MT estimates, and the blue lines are the estimates based on our L1 MT-Whittle method. As discussed previously, the raw MT estimates are too noisy to be useful. The denoising by the L1 multivariate MT-Whittle method is noticeably eﬀective. In the panels for the diagonal Cholesky components log D1 and log D2, the L1 multivariate MT-Whittle method capture similar features to the separate SDF estimation for the two channels using the L1 univariate MT-Whittle approach. Still, the estimates for the SDF S11 on the left channel preserves the similar features as identiﬁed for log D1: two spikes at 9 Hz and 16 Hz and an elbow around 5 Hz. Comparing to panel (d) of Figure

3.8, the plot of SDF estimates for S22 in Figure 5.10 has more power and detailed features around the peak at 10 Hz, as an eﬀect of aggregating the estimates of Cholesky components by (5.53). For the estimates of cross-spectrum S21, we notice that a variety of features on the real part: a sharp peak around 8 Hz, two crests between 10 Hz and 20 Hz, and a small hill

162 log D1 log D2 T21 Real T21 Imaginary −4 −4 1.0 1.0 −5 −5 1 2 0.0 0.0 −6 −6 Real 21 log D log D −7 −7 Imaginary T 21 −1.0 −1.0 T −8 −8 −9 −9 −2.0 −2.0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 frequency frequency frequency frequency

Left Channel S11 Right Channel S22 S21 Real S21 Imaginary 0.015 0.015 0.010 0.010 0.004 11 22 Real −0.002 S S 21 Imaginary S 21 0.005 0.005 S 0.000 −0.006 0.000 0 5 10 15 20 25 30 0.000 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 frequency frequency frequency frequency

Cross Amplitude Phase MSC 10 0.8 5 0.008

0.6 MT estimates 0

Phase L1 MT−Whittle 0.4 21 S 0.004 Cross Amplitude −5 0.2 21 S 0.0 −10 0.000

0 5 10 15 20 25 30 0 5 10 15 20 25 30 Coherence Magnitude Squablue 0 5 10 15 20 25 30 frequency frequency frequency

Figure 5.10: Plots of the estimated Cholesky components and the SDM elements of the two-channel EEG signal.

around 23 Hz. On the other hand, the imaginary part of S21 possesses a deep trough at 8 Hz and a small ravine at 16 Hz. The phase of S21 is negatively shifted from zero at around 10

Hz and 20 Hz, indicating that some brain activities which occur every 0.1 and 0.05 seconds in the left side of the brain are not synchronized with the right side of the brain. The plot of the estimated MSC tells a consistent story by showing values lower than 0.1 at 10 Hz and 20

Hz, which means that the amplitudes of signals at these frequencies are almost uncorrelated between the left side and the right side of the brain.

163 In summary, the L1 penalized multivariate MT-Whittle framework provides a natural extension of the L1 penalized MT-Whittle method to the SDM estimation of EEG signals.

For the bivariate EEG signal presented above, the extended L1 MT-Whittle approach effec- tively performs denoising and identifies major spectral features within and across different

EEG channels. Compared with the univariate spectral analysis of EEG signals, the SDM estimation provides us with more insight on the diﬀerences between channels, and enables us to acquire a better knowledge of the interaction between diﬀerent areas of the brain.

5.5 Conclusion

In this chapter, we examined the statistical properties of MT estimates for the SDM and proposed an extension of the L1 penalized MT-Whittle method to the spectral analysis of multivariate time series. The extended method is based on a modiﬁed Cholesky decomposition of the SDM and thus ensures a Hermitian positive deﬁnite SDM estimation. We employed the proximal gradient algorithm for the optimization of the penalized multivariate

MT-Whittle likelihood function, and proposed scale-calibrated universal thresholds and the use of GIC for the L1 tuning parameter selection. The strength and weakness of the L1 penalized MT-Whittle approach were analyzed and discussed through a simulation study of three diﬀerent bivariate Gaussian processes. Compared with the L2 penalized Whittle method of Krafty and Collinge[2013], the L1 penalized MT-Whittle approach with wavelet basis is more eﬃcient in terms of capturing the sharp features of the SDM. We also found that a data-adaptive tuning method using the GIC is favorable when the spectral features of a multivariate time series are dense. We illustrated the application of our method to the spectral analysis of multivariate electroencephalogram (EEG) signals, discovering interesting

164 MSC and phase relationships between the left and right EEG channels. More discussion on further improvement and future research is provided in the next chapter.

165 Chapter 6: Conclusion and discussion

6.1 Concluding remarks

In this dissertation, we proposed an L1 penalized multitaper (MT) Whittle framework for the spectral analysis of stationary univariate and multivariate time series. The new framework expands the application of using the Whittle method [Whittle, 1953] for smoothing a broader class of spectral estimators, MT spectral estimators [Walden, 2000], of the SDF and the SDM. Leading to a reduction in the IRMSE, the L1 penalized MT-Whittle method is shown to be more eﬃcient in the SDF estimation than the L1 penalized LS methods in literature, such as the wavelet thresholding method investigated by Walden et al.[1998].

Additionally, our method is suitable for a large variety of basis functions. The proposed tuning parameter selection criteria for the L1 penalty – calibrated universal threshold and

GIC – perform better (i.e., achieve smaller IRMSE) than cross-validation in our simulation study. Fast theoretical convergence rate of the proposed SDF estimator is established. Based on sample splitting, we proposed to construct pointwise interval estimation of the SDF using

Huber sandwich method [Huber, 1967]. By incorporating the correlation of MT estimates between frequencies, the sandwich variance based approach acquires a closer-to-nominal coverage and a more favorable interval score [Gneiting and Raftery, 2007] than the naive method which assumes independence. A fast approximation to the sandwich method is provided as a

166 computationally eﬃcient alternative. We conducted a sequence of simulation studies on the

tuning of the number of tapers for the MT-Whittle method. Multitapering outperforms the

periodogram-based Whittle method, and according to the simulation studies, an empirical

number of tapers for optimizing the IRMSE and the average interval score is suggested. In

terms of computational algorithms, our method for the SDF estimation used an ADMM

algorithm [Boyd et al., 2011]. For multivariate spectral analysis, we adapted a proximal gra-

dient descent algorithm [Parikh and Boyd, 2014] for the SDM estimation, where a Cholesky

decomposition is employed to preserve the Hermitian positive deﬁniteness of the SDM.

6.2 Discussion and future works

There are a number of extensions that we are considering. In Walden et al.[1998], the

authors also vary the number of initial basis functions, p, that are chosen before they perform

the L1 penalization using least squares. We found in simulation studies that preselecting

p can slightly reduce the IRMSE of the SDF estimation. However, the optimal choice of p

depends on the underlying statistical process, and it is not known how to select p for a given

process. This idea is related to more general methods that can be used to simultaneously

select and estimate the coeﬃcients of the basis expansion. For example, using a truncated

lasso approach proposed by Shen et al.[2012] could lead to further reductions in IRMSE,

but this requires developing a general approach to selecting the truncation parameters in the

algorithm. Another extension is linked to the automatic relevance determination approach

reformulated by Wipf and Nagarajan[2008], where a local minimum can be attained while

achieving sparsity by solving a series of reweighted L1 problems.

Our simulation results in Section 3.5 show that diﬀerent types of basis functions may have various eﬃciency in terms of capturing the true features of the SDF. It would be

167 beneﬁcial if an automatic algorithm can be developed to adaptively choose the most suitable class of basis functions for estimating the SDF based on the time series data collected. An overcomplete dictionary method [see, e.g., Wasserman, 2006] put together diﬀerent types of basis functions and allows the total number of bases to be greater than the sample size. Our

L1 penalized MT-Whittle method [Tang et al., 2019] is capable of selecting the bases from a mixed dictionary. However, a further study needs to be conducted on how to determine when a mixed dictionary should be used, and what type of mixture can actually improve the performance than a single type of bases for the estimation and inference of an SDF and an

SDM. For example, the mixture of polynomial and wavelet bases in Oh et al.[2001] for the penalized LS method can be extended to our penalized MT-Whittle approach, which might be helpful to remove the redundant wiggles generated by the wavelets at the boundary for the SDF. The GIC proposed by Fan and Tang[2013] still applies to the tuning parameter selection in the case of mixed bases.

Prediction in the time domain can also be made based on the proposed MT-Whittle method. With our SDF estimates, an MT-Whittle based estimates for the ACVF can be calculated by (4.19). For a univariate stationary Gaussian process, it follows from Brockwell and Davis[1991, Section 5.4] that the predicting values for the time series can be expressed in terms of {X1,...,XN } and the estimated ACVF. The formula for a corresponding prediction interval is also provided by Brockwell and Davis[1991, Section 5.4].

When constructing sandwich variance based interval estimates of the SDF, we used the results in Thomson[1982] to calculate the correlation of MT estimates between Fourier frequencies for Gaussian processes. It would be interesting to derive more general results for non-Gaussian processes. One can also run a simulation study to check the robustness of current method to non-Gaussianity.

168 In literature, a multi sample-splitting method [Meinshausen et al., 2009] is also considered

for the inference of lasso penalized models. The method randomly splits the sample into two

halves multiple times. For each split, model selection and inference are performed separately

on the two halves of data. The multiple p-values obtained for the parameters are then

aggregated with a quantile based method (see Meinshausen et al.[2009] for details), and

conﬁdence intervals can be derived from the aggregation of p-values. Relevant nonparametric

conﬁdence intervals based on sample splitting are discussed in Rinaldo et al.[2019]. A

inference-prediction tradeoﬀ is pointed out by Rinaldo et al.[2019]: Sample splitting can

help improve the inference accuracy, as discussed in Section 4.1.1; however, since only a

portion of the data is employed for model selection, less accurate models may be selected

and thus lead to a reduction in the prediction accuracy. More thorough understanding of

this tradeoﬀ in our spectral scenarios may be gained through further simulation studies.

Potentially, de-sparsifying the lasso [see, e.g., Dezeure et al., 2015] can be adapted to

construct conﬁdence intervals for our L1 penalized MT-Whittle estimator. This bias correction and inference approach for a lasso penalized estimator was proposed by Zhang and

Zhang[2014] for penalized LS problems with independent and identically distributed Gaus- sian noise. van de Geer et al.[2014] generalized the idea to generalized linear models and derived the de-sparsiﬁed estimator based on the KKT condition [Karush, 1939, Kuhn and

Tucker, 1951]. Based on the asymptotic Gaussian distribution of the de-sparsiﬁed estimator, conﬁdence intervals can be constructed accordingly. However, in our case, the MT estimates at Fourier frequencies are correlated. Based on our simulation in Section 4.4, a good interval estimate of the SDF usually needs to have these correlations carefully addressed. It would be interesting to compare the performance of de-sparsifying the lasso and the sample splitting-based method for interval estimation of the SDF.

169 There are several alternative approaches for modeling the SDM. For example, we can

represent the SDM with cepstral ﬁlters in Holan et al.[2017]; and the SDM can also be

expressed by a principal component decomposition [see, e.g., Brillinger, 1981, Chapter 9].

However, these methods may still suﬀer a similar issue to the Cholesky-based smoothing

approach: if the dimensional components of the time series are permuted, the SDM estimate

could vary. The matrix-valued wavelet denoising in Chau and von Sachs[2020] overcomes

this issue. Chau and von Sachs[2020] proposed a generalization of the wavelet transform

to curves in the Riemannian manifold of Hermitian positive deﬁnite matrices. A geometric

requirement is imposed on the wavelet thresholding to ensure that the resulting SDM es-

timator is invariant to the dimensional rotation of time series components (see Chau and

von Sachs[2020, Section 4.3.1] for more details). A future topic of interest is how to extend

this SDM estimation method of Chau and von Sachs[2020] from an LS-based method to

our multivariate MT-Whittle framework. Theoretically, an asymptotic theory needs to be

derived for these new SDM estimators.

For interval estimation of the SDM, Dai and Guo[2004] obtained Bayesian credible inter-

vals for the Cholesky components, and proposed to construct pointwise bootstrap conﬁdence

intervals for SDM elements by regenerating the time series based on their SDM estimate.

Alternatively, Krafty and Collinge[2013] simulated replicates of the Fourier transform of the

time series according to the estimated SDM, and built conﬁdence intervals for SDM elements

with quantiles of the resampling distribution of the L2 penalized Whittle estimator. Chau et al.[2019] proposed to construct a simultaneous conﬁdence region for the SDM based on the properties of data depth functions [see, e.g., Modarres, 2011] in the Riemannian manifold of Hermitian positive deﬁnite matrices. As a comparison, we may extend our sandwich-based

170 interval estimation methods to the multivariate case based on the second order properties of

MT estimates presented in Section 5.1.

Other directions of future research include the extension to nonstationary time series. In literature, spline-based methods are extended to univariate cases such as Guo et al.[2003] and Rosen et al.[2012], and multivariate cases, such as Zhang[2016] and Li and Krafty[2019], for the analysis of time-varying spectra. Another possible direction is the generalization of our L1 penalized MT-Whittle method to the spectral analysis of spatio-temporal processes.

171 Bibliography

M. S. Bartlett and D. Kendall. The statistical analysis of variance-heterogeneity and the logarithmic transformation. Supplement to the Journal of the Royal Statistical Society, 8:128–138, 1946. A. Basu, I. R. Harris, N. L. Hjort, and M. Jones. Robust and efficient estimation by minimising a density power divergence. Biometrika, 85:549–559, 1998. H. Bauer. Measure and Integration Theory. Walter de Gruyter, Berlin, Germany, 2001. A. Beck and M. Teboulle. Gradient-based algorithms with applications to signal-recovery problems. In D. P. Palomar and Y. C. Eldar, editors, Convex Optimization in Signal Processing and Communications, pages 42–88. Cambridge University Press, Cambridge, England, 2009. J. Beran. Fitting long-memory models by generalized linear regression. Biometrika, 80:817–822, 1993. P. Bloomfield. An exponential model for the spectrum of a scalar time series. Biometrika, 60: 217–226, 1973. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1–122, 2011. D. Brillinger. Time Series: Data Analysis and Theory. Holt, New York, NY, 1981. P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods (Second Edition). Springer Science & Business Media, New York, NY, 1991. P. Bühlmannand S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media, New York, NY, 2011. C. Byrnes, T. T. Georgiou, and A. Lindquist. A new approach to spectral estimation: A tunable high-resolution spectral estimator. IEEE Transactions on Signal Processing, 48:3189–3205, 2000. M. Calder and R. A. Davis. Introduction to Whittle (1953), The Analysis of Multiple Stationary Time Series. In Breakthroughs in Statistics, pages 141–169. Springer, New York, NY, 1997. C. Chatfield. Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society. Series A (Statistics in Society), 158:419–466, 1995. J. Chau and R. von Sachs. Intrinsic wavelet regression for curves of hermitian positive definite matrices. Journal of the American Statistical Association, 2020. DOI: 10.1080/01621459.2019.1700129. J. Chau, H. Ombao, and R. von Sachs. Intrinsic data depth for Hermitian positive definite matrices. Journal of Computational and Graphical Statistics, 28:427–439, 2019.

172 A. Cichocki and S. Amari. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of similarities. Entropy, 12:1532–1568, 2010. M. Clyde and E. I. George. Model uncertainty. Statistical Science, 19:81–94, 2004. R. Cogburn and H. T. Davis. Periodic splines and spectral estimation. The Annals of Statistics, 2: 1108–1126, 1974. D. R. Cox. A note on data-splitting for the evaluation of significance levels. Biometrika, 62:441–444, 1975. R. Dahlhaus. Small sample effects in time series analysis: a new asymptotic theory and a new estimate. The Annals of Statistics, 16:808–841, 1988. M. Dai and W. Guo. Multivariate spectral analysis using Cholesky decomposition. Biometrika, 91: 629–643, 2004. I. Daubechies. Ten Lectures on Wavelets, volume 61. SIAM, Philadelphia, PA, 1992. R. Dezeure, P. Bühlmann,L. Meier, and N. Meinshausen. High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30:533–558, 2015. P. J. Diggle, P. Heagerty, K.-Y. Liang, and S. L. Zeger. Analysis of Longitudinal Data. Oxford University Press, Oxford, England, 2nd edition, 2002. D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90:1200–1224, 1995. D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81: 425–455, 1994. D. Draper. Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society, Series B, 56:45–97, 1994. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of Statistics, 32:407–499, 2004. Y. Fan and C. Y. Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75:531–552, 2013. J. J. Faraway. On the cost of data analysis. Journal of Computational and Graphical Statistics, 1: 213–229, 1992. J. J. Faraway. Data splitting strategies for reducing the effect of model selection on inference. Technical report, Citeseer, 1995. J. J. Faraway. Does data splitting improve prediction? Statistics and Computing, 26:49–60, 2016. A. Ferrante, M. Pavon, and M. Zorzi. A maximum entropy enhancement for a family of high- resolution spectral estimators. IEEE Transactions on Automatic Control, 57:318–329, 2012. D. A. Freedman. On the so-called ‘Huber sandwich estimator’ and ‘robust standard errors’. The American Statistician, 60:299–302, 2006. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22, 2010. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2:17–40, 1976.

173 H.-Y. Gao. Wavelet Estimation of Spectral Densities in Time Series Analysis. PhD thesis, Depart- ment of Statistics, University of California, Berkeley, 1993. H.-Y. Gao. Choice of thresholds for wavelet shrinkage estimate of the spectrum. Journal of Time Series Analysis, 18:231–251, 1997. T. T. Georgiou and A. Lindquist. Kullback-Leibler approximation of spectral density functions. IEEE Transactions on Information Theory, 49:2910–2917, 2003. R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualitéd’une classe de problèmesde Dirichlet non linéaires. Revue Fran¸caise d’Automatique, Informatique, Recherche Opérationnelle. Analyse Numérique, 9:41–76, 1975. T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102:359–378, 2007. I. J. Good. Rational decisions. Journal of the Royal Statistical Society, Series B, 14:107–114, 1952. N. R. Goodman. Statistical analysis based on a certain multivariate complex Gaussian distribution (an introduction). The Annals of mathematical statistics, 34:152–177, 1963. U. Grenander and G. Szegö. Toeplitz Forms and Their Applications. University of California Press, Berkeley, CA, 1958. P. D. Grünwald and A. P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. the Annals of Statistics, 32:1367–1433, 2004. C. Gu. Smoothing Spline ANOVA Models. Springer Science & Business Media, New York, NY, 2013. W. Guo, M. Dai, H. C. Ombao, and R. von Sachs. Smoothing spline anova for time-dependent spectral analysis. Journal of the American Statistical Association, 98:643–652, 2003. E. J. Hannan. The asymptotic theory of linear time-series models. Journal of Applied Probability, 10:130–145, 1973. B. He and X. Yuan. On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik, 130:567–577, 2015. S. H. Holan, T. S. McElroy, and G. Wu. The cepstral model for multivariate time series: the vector exponential model. Statistica Sinica, 27:23–42, 2017. P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pages 221–233, Berkeley, Calif., 1967. University of California Press. P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35:73–101, 1992. C. M. Hurvich and J. Brodsky. Broadband semiparametric estimation of the memory parameter of a long-memory time series using fractional exponential models. Journal of Time Series Analysis, 22:221–249, 2001. C. M. Hurvich and C.-L. Tsai. The impact of model selection on inference in linear regression. The American Statistician, 44:214–217, 1990.

174 G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning. Springer, New York, NY, 2013. W. Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939. V. Klee. What is a convex set? The American Mathematical Monthly, 78:616–631, 1971. L. H. Koopmans. The Spectral Analysis of Time Series. Academic Press, New York, NY, 1974. R. T. Krafty and W. O. Collinge. Penalized multivariate Whittle likelihood for power spectrum estimation. Biometrika, 100:447–458, 2013. H. Kuhn and A. Tucker. Nonlinear programming. In J. Neyman, editor, Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–492. University of California Press, Berkeley, CA, 1951. S. Kullback and R. A. Leibler. On information and suﬃciency. The Annals of Mathematical Statistics, 22:79–86, 1951. E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer Science & Business Media, New York, NY, 1998. Z. Li and R. T. Krafty. Adaptive Bayesian time–frequency analysis of multivariate time series. Journal of the American Statistical Association, 114:453–465, 2019. K.-Y. Liang and S. L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73:13–22, 1986. S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 674–693, 1989. E. J. McCoy, A. T. Walden, and D. B. Percival. Multitaper spectral estimation of power law processes. IEEE Transactions on Signal Processing, 46:655–668, 1998. P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall/CRC Press, New York, NY, 1999. N. Meinshausen, L. Meier, and P. B¨uhlmann.P-values for high-dimensional regression. Journal of the American Statistical Association, 104:1671–1681, 2009. R. Modarres. Data depth. In M. Lovric, editor, International Encyclopedia of Statistical Science, pages 334–336. Springer, Berlin, Heidelberg, 2011. E. Mosley-Thompson, C. R. Readinger, P. Craigmile, L. G. Thompson, and C. A. Calder. Regional sensitivity of Greenland precipitation to NAO variability. Geophysical Research Letters, 32, 2005. P. Moulin. Wavelet thresholding techniques for power spectrum estimation. IEEE Transactions on Signal Processing, 42:3126–3136, 1994. E. Moulines and P. Soulier. Broadband log-periodogram regression of time series with long-range dependence. The Annals of Statistics, 27:1415–1439, 1999. M. Narukawa and Y. Matsuda. Broadband semi-parametric estimation of long-memory time series by fractional exponential models. Journal of Time Series Analysis, 32:175–193, 2011. H.-S. Oh, P. Naveau, and G. Lee. Polynomial boundary treatment for wavelet regression. Biometrika, 88:291–298, 2001.

175 N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends R in Optimization, 1: 127–239, 2014. Y. Pawitan. Automatic estimation of the cross-spectrum of a bivariate time series. Biometrika, 83: 419–432, 1996. Y. Pawitan and F. O’Sullivan. Nonparametric spectral density estimation using penalized Whittle likelihood. Journal of the American Statistical Association, 89:600–610, 1994. D. B. Percival. Spectral analysis of univariate and bivariate time series. In J. L. Stanford and S. B. Vardeman, editors, Statistical Methods for Physical Science, chapter 11, pages 313–348. Academic Press, New York, NY, 1994. D. B. Percival and A. T. Walden. Spectral Analysis for Physical Applications. Cambridge University Press, Cambridge, England, 1993. D. B. Percival and A. T. Walden. Wavelet Methods for Time Series Analysis. Cambridge University Press, Cambridge, England, 2000. J. Pfanzagl. On the measurability and consistency of minimum contrast estimates. Metrika, 14: 249–272, 1969. R. R. Picard and K. N. Berk. Data splitting. The American Statistician, 44:140–147, 1990. B. M. Pötscher. Effects of model selection on inference. Econometric Theory, 7:163–185, 1991. M. B. Priestley. Spectral Analysis and Time Series. Academic Press, New York, NY, 1981. R. Q. Quiroga, T. Kreuz, and P. Grassberger. Event synchronization: a simple and fast method to measure synchronicity and time delay patterns. Physical Review E, 66:1–9, 2002. S. I. Resnick. A Probability Path. Birkhäuser,Boston, MA, 2003. K. S. Riedel and A. Sidorenko. Minimum bias multiple taper spectral estimation. IEEE Transac- tions on Signal Processing, 43:188–195, 1995. A. Rinaldo, L. Wasserman, and M. G’Sell. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. The Annals of Statistics, 47:3438–3469, 2019. O. Rosen and D. S. Stoffer. Automatic estimation of multivariate spectra via smoothing splines. Biometrika, 94:335–345, 2007. O. Rosen, S. Wood, and D. S. Stoffer. Adaptspec: Adaptive spectral estimation for nonstationary time series. Journal of the American Statistical Association, 107:1575–1589, 2012. S. Sardy, A. Antoniadis, and P. Tseng. Automatic smoothing with wavelets for a wide class of distributions. Journal of Computational and Graphical Statistics, 13:399–421, 2004. A. Schuster. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terrestrial Magnetism, 3:13–41, 1898. D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York, NY, 1992. C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27: 379–423, 1948. X. Shen, W. Pan, and Y. Zhu. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association, 107:223–232, 2012.

176 M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36:111–147, 1974. S. Tang, P. F. Craigmile, and Y. Zhu. Spectral estimation using multitaper Whittle methods with a lasso penalty. IEEE Transactions on Signal Processing, 67:4992–5003, 2019. D. J. Thomson. Spectrum estimation and harmonic analysis. Proceedings of the IEEE, 70:1055– 1096, 1982. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58:267–288, 1996. R. J. Tibshirani. The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490, 2013. A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, NY, 2009. S. van de Geer, P. Bühlmann,Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42:1166–1202, 2014. S. A. van de Geer. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36:614–645, 2008. S. A. van de Geer and P. Bühlmann.On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. G. van Luijtelaar. The WAG/Rij Rat Model of Absence Epilepsy: Ten Years of Research: a Compilation of Papers. Nijmegen University Press, Nijmegen, Netherlands, 1997. G. Wahba. Automatic smoothing of the log periodogram. Journal of the American Statistical Association, 75:122–132, 1980. G. Wahba and S. Wold. Periodic splines for spectral density estimation: The use of cross validation for determining the degree of smoothing. Communications in Statistics, 4:125–141, 1975. A. T. Walden. A unified view of multitaper multivariate spectral estimation. Biometrika, 87: 767–788, 2000. A. T. Walden, D. B. Percival, and E. J. McCoy. Spectrum estimation by wavelet thresholding of multitaper estimators. IEEE Transactions on Signal Processing, 46:3153–3165, 1998. L. Wasserman and K. Roeder. High dimensional variable selection. Annals of Statistics, 37:2178– 2201, 2009. L. A. Wasserman. All of Nonparametric Statistics. Springer, New York, NY, 2006. R. W. M. Wedderburn. Quasi-likelihood functions, generalized linear models, and the Gauss- Newton method. Biometrika, 61:439, 1974. P. Welch. The use of Fast Fourier Transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio Electoacoustics, 15:70–73, 1967. P. Whittle. Estimation and information in stationary time series. Arkiv förMatematik, 2:423–434, 1953. D. P. Wipf and S. S. Nagarajan. A new view of automatic relevance determination. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing

177 Systems 20, pages 1625–1632. Curran Associates, Inc., Red Hook, NY, 2008. S. N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73:3–36, 2011. S. L. Zeger and K.-Y. Liang. Longitudinal data analysis for discrete and continuous outcomes. Biometrics, 42:121–130, 1986. C.-H. Zhang and S. S. Zhang. Conﬁdence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76:217–242, 2014. L. Zhang. Penalized Regression Methods in Time Series and Functional Data Analysis. PhD thesis, Department of Mathematical and Statistical Sciences, University of Alberta, Canada, 2017. S. Zhang. Adaptive spectral estimation for nonstationary multivariate time series. Computational Statistics & Data Analysis, 103:330–349, 2016. M. Zorzi. A new family of high-resolution multivariate spectral estimators. IEEE Transactions on Automatic Control, 59:892–904, 2014. M. Zorzi. Multivariate spectral estimation based on the concept of optimal prediction. IEEE Transactions on Automatic Control, 60:1647–1652, 2015a. M. Zorzi. An interpretation of the dual problem of the three-like approaches. Automatica, 62: 87–92, 2015b.

178