Spectral Analysis Using Multitaper Whittle Methods with a Lasso Penalty
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Shuhan Tang, M.S.
Graduate Program in Department of Statistics
The Ohio State University
2020
Dissertation Committee:
Peter Craigmile, Advisor Yunzhang Zhu, Co-advisor Dena Asta Yoonkyung Lee c Copyright by
Shuhan Tang
2020 Abstract
Spectral analysis is the fundamental technique for studying the frequency domain charac- teristics of time series. Traditional nonparametric estimators of the spectral density function, such as the periodogram, are inconsistent, and the more advanced multitaper spectral esti- mators are often still too noisy. In this dissertation, an L1 penalized quasi-likelihood Whittle framework based on multitaper estimates is proposed for the spectral analysis of univariate and multivariate stationary time series. The new approach is compatible with various types of basis functions. The use of scale-calibrated universal thresholds and the generalized infor- mation criterion (GIC) is suggested to efficiently capture the sparse features in the spectra.
The fast convergence rate of the proposed estimator is established for the univariate spectral analysis. Additionally, a sample splitting strategy is introduced to obtain pointwise interval estimation of a spectrum using the Huber sandwich method. The proposed framework is extended to the estimation of spectral density matrices based on the Cholesky decomposition to ensure Hermitian positive definiteness. Computationally, an alternating direction method of multipliers (ADMM) algorithm and a proximal gradient method are presented for solving the optimization problems. Application of the new methodology is illustrated via simulation and to the spectral analysis of electroencephalogram (EEG) data.
Keywords: Alternating direction method of multipliers (ADMM); basis expansion; mul- titaper spectral estimates; proximal gradient descent; sample splitting; Whittle likelihood.
ii Acknowledgments
First of all, I would like to express my deep gratitude to my advisors, Dr. Peter Craigmile and Dr. Yunzhang Zhu, whose insightful guidance and persistent encouragement helped me reach one of the most significant milestones in my life. During my Ph.D. study, I have learned and benefited a lot from their great enthusiasm and dedication to teaching and research. I would also like to extend my sincere appreciation to Dr. Dena Asta and Dr. Yoonkyung Lee for their support on both my candidacy exam committee and the dissertation committee.
I am grateful to all my statistics professors for introducing the glamorous world of statis- tics to me. And I want to acknowledge Dr. Gary G. Koch and Mrs. Carolyn J. Koch for their generosity and the fellowship. Thank you to the Department of Statistics at Ohio State for making me feel at home in Columbus.
Thanks to my family and friends for their endless love and faithful friendship.
iii Vita
2011 ...... B.S. Mathematics, Central South University. 2014 ...... M.S. Applied Statistics, Bowling Green State University. 2014-2016 ...... College of Public Health, The Ohio State University. 2017 ...... M.S. Statistics, The Ohio State University. 2016-present ...... Department of Statistics, The Ohio State University.
Publications
Research Publications
S. Tang, P. F. Craigmile, and Y. Zhu. Spectral estimation using multitaper Whittle methods with a lasso penalty. IEEE Transactions on Signal Processing, 67:4992–5003, 2019.
Fields of Study
Major Field: Statistics
iv Table of Contents
Page
Abstract...... ii
Acknowledgments...... iii
Vita...... iv
List of Tables...... viii
List of Figures...... x
1. Introduction...... 1
1.1 Nonparametric spectral estimation and basis models...... 1 1.2 Penalized multitaper Whittle framework...... 3 1.3 Sample splitting and spectral inference...... 4 1.4 Extension to multivariate spectral analysis...... 5 1.5 Organization of this dissertation...... 6
2. Foundations of spectral analysis for stationary processes...... 7
2.1 Spectral density function...... 7 2.2 Preliminary spectral estimators...... 11 2.2.1 The periodogram...... 11 2.2.2 Multitaper spectral estimators...... 13 2.2.3 Fourier frequencies...... 17 2.3 Basis models for SDFs...... 19 2.3.1 Literature review...... 20 2.3.2 Basis models and least square approaches...... 21 2.3.3 Wavelet soft-thresholding...... 25 2.3.4 Smoothing splines...... 29 2.4 Conclusion...... 30
v 3. Multitaper Whittle method with a lasso penalty...... 32
3.1 Penalized multitaper Whittle estimation...... 32 3.2 Computing: an ADMM algorithm...... 38 3.3 Tuning parameter selection...... 42 3.3.1 Scale-calibrated universal threshold...... 42 3.3.2 Generalized information criterion...... 44 3.4 Theory...... 45 3.5 Simulations...... 49 3.5.1 Validating the empirical theoretical rate...... 59 3.6 Spectral estimation for EEG signals...... 60 3.7 Non-orthogonal basis functions...... 63 3.8 Conclusion...... 64
4. Spectral inference with sample splitting...... 65
4.1 Sample splitting for SDFs...... 65 4.1.1 Overview of sample splitting...... 66
4.1.2 Sample splitting for L1 penalized SDF models...... 68 4.2 Quantifying the variability of MT-Whittle estimators...... 71 4.2.1 Naive variance estimation...... 71 4.2.2 Sandwich variance estimation...... 73 4.2.3 Fast variance approximation...... 77 4.2.4 Summary...... 79 4.3 Proper scoring rules...... 81 4.3.1 Scoring rules and property...... 81 4.3.2 Interval scores...... 84 4.4 Simulation studies...... 85 4.4.1 The choice of number of tapers...... 94 4.5 Application to EEG signals...... 97 4.6 Conclusion...... 99
5. Extensions to spectral analysis of stationary multivariate time series...... 101
5.1 Foundations of multivariate spectral analysis...... 101 5.1.1 Covariance matrix of multivariate processes...... 101 5.1.2 Spectral density matrix...... 103 5.1.3 Representing complex-valued spectra...... 106 5.1.4 Cross-periodgoram and multitaper spectral estimators...... 107 5.1.5 Covariance between MT estimates of different cross-spectra.... 110 5.1.6 Covariance between real and imaginary parts of a MT estimator.. 115
vi 5.1.7 Basis methods for SDM estimation...... 117 5.2 Penalized multitaper Whittle method for SDM...... 119 5.2.1 MT-Whittle likelihood for bivariate time series...... 122 5.2.2 Penalized bivariate MT-Whittle method...... 126 5.2.3 Computation: proximal gradient method...... 128 5.2.4 Tuning parameter selection...... 133 5.2.5 Sample splitting...... 137 5.3 Simulation...... 140 5.3.1 The VAR(2) process...... 144 5.3.2 The VARMA(2,15000) process...... 148 5.3.3 The VARMA(4,15000) process...... 152 5.3.4 Discussion...... 157 5.4 Spectral analysis of multivariate EEG signals...... 161 5.5 Conclusion...... 164
6. Conclusion and discussion...... 166
6.1 Concluding remarks...... 166 6.2 Discussion and future works...... 167
Bibliography...... 172
vii List of Tables
Table Page
3.1 Comparison of IRMSEs (with bootstrap standard errors in parenthesis) when using different bases for estimating the SDF of the AR(2) process of length N = 2048 by unpenalized least square (LS) method and unpenalized MT- Whittle method with number of tapers K = 10...... 51
3.2 Simulation results with p = 1024 for the four time series of length N = 2048: comparing decibel-scaled bias, standard deviation (SD) and IRMSE
of L1 penalized LS approach and L1 penalized MT-Whittle approach with LA(8) wavelet bases by different tuning criteria. Models are fitted to raw MT estimates with K = 10 sine tapers. Values in parenthesis denoted by ‘(SE)’ are corresponding standard errors based on 1000 Bootstrap resamples. Minimum quantities are bolded in each section excluding the ones from ‘Best’ (best possible IRMSE)...... 55
4.1 Recommended number of sine tapers, K, for achieving a close to minimal interval score for a given sample size N...... 95
5.1 Simulation results for the VAR(2) process: comparing the bias and IRMSE of different methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 146
5.2 Results of evaluating the SDM estimates for the VAR(2) process: comparing FIRMSE, SFIRMSE, and average Whittle score of different methods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 148
5.3 Simulation results for the VARMA(2,15000) process: comparing the bias and IRMSE of different methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 150
viii 5.4 Results of evaluating the SDM estimates for the VARMA(2,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of different meth- ods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 152
5.5 Simulation results for the VARMA(4,15000) process: comparing the bias and IRMSE of different methods for estimating the SDM components. Values in parenthesis are standard errors based on 500 bootstrap samples...... 154
5.6 Results of evaluating the SDM estimates for the VARMA(4,15000) process: comparing FIRMSE, SFIRMSE, and average Whittle score of different meth- ods. Values in parenthesis denoted by ‘(SE)’ are standard errors based on 500 bootstrap samples...... 157
ix List of Figures
Figure Page
3.1 Plots of the SDF for three processes on the decibel scale (dB)...... 50
3.2 One replicate of spectrum estimation for the high-order MA process based on unpenalized LS method with different basis types. For each basis type, the orange line is the LS estimates with p = 32 bases including the intercept. The black line shows the true SDF on the decibel scale (dB)...... 53
3.3 A comparison of the IRMSEs for different L1 penalizations methods for esti- mating the SDF. Here, the height of the symbols are larger than the widths 95% bootstrap confidence intervals for each IRMSE; the horizontal dashed lines indicate the best possible IRMSEs for each method...... 54
3.4 A comparison of spectral estimates for the different four processes. For each
process, the blue line is the L1 penalized MT-Whittle estimates, the gray line is the corresponding raw MT estimates with K = 10 tapers, and the red line is the true SDF...... 57
3.5 Plots of bias vs K on the decibel (dB) scale using L1 penalized LS method (or- ange) and L1 penalized MT-Whittle method (blue) with the universal thresh- old for the four processes. The vertical bars show 95% bootstrap confidence intervals for each K, and the dotted lines show the location of the minimum. 58
3.6 Plots of IRMSE vs K on the decibel (dB) scale using L1 penalized LS method (orange) and L1 penalized MT-Whittle method (blue) with the universal threshold for the four processes. The vertical bars show 95% bootstrap con- fidence intervals for each K, and the dotted lines show the location of the minimum...... 59
x √ PM 3.7 Plots of the ratio j=1(Rj − 1)φ(fj) ( M log(M)) against log2(M), for ∞ M varying from 28 to 215. For each process, the solid line is the median of the ratio, and the dashed lines are the 2.5th and 97.5th percentiles of the ratio, based on 1000 simulations...... 60
3.8 Time series plots of the (a) left and (b) right EEG channels, as well as esti- mated SDFs for the (c) left and (d) right channels. In (c) and (d) the blue line is the estimate with the universal threshold, and the green line is the estimate with GIC-based threshold...... 61
4.1 Plots for evaluating the log SDF estimates of the AR(2) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns...... 87
4.2 Plots for evaluating the log SDF estimates of the AR(4) process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales across the columns...... 88
4.3 Plots for evaluating the log SDF estimates of the MA process: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row) against the number of tapers K. Sample size N (256, 512, 1024, 2048) varies in different columns. The vertical bars show 95% bootstrap confidence intervals at each K, and the dashed vertical lines show the location of the optimum. Notice that the plots of average interval scores are in different scales between the columns...... 89
4.4 Plots for evaluating the log SDF estimates across the frequencies for the three processes of length N = 2048 using K = 15 sine tapers: bias (first row), RMSE (second row), coverage (third row), expected interval score (fourth row). 93
xi 4.5 Plots of four evaluation metrics against sample size N for the three processes using the suggested number of sine tapers, K = (1/4)N 8/15: average absolute bias (first row), mean IRMSE (second row), average coverage (third row), average interval score (fourth row). The vertical bars show 95% bootstrap confidence intervals at each N...... 96
4.6 Plots of interval estimates for the left and right EEG channels. In each plot, the solid black line show the point estimates of the SDF. The dashed red lines display the pointwise 95% confidence intervals (CI) constructed based on the sandwich method, while the dashed blue lines exhibit the interval estimates by the fast approximation approach...... 98
5.1 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and MSC of the VAR(2) process.. 145
5.2 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw
MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of the VAR(2) process...... 147
5.3 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(2,15000) process...... 149
5.4 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw
MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of the VARMA(2,15000) process...... 151
5.5 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L2 Krafty’s estimates (green), and the proposed L1 MT-Whittle estimates (blue) for Cholesky components, SDM elements, and the MSC of the VARMA(4,15000) process...... 153
5.6 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw
MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process...... 156
xii 5.7 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM ele- ments of the VAR(2) process...... 158
5.8 Plots of true spectra (red), and one replicate of raw MT estimates (gray), L1 MT-Whittle estimates based on GIC (orange), and L1 MT-Whittle estimates based on universal thresholds (blue) for Cholesky components and SDM ele- ments of the VARMA(4,15000) process...... 159
5.9 Plots of bias (row 1 and 2) and RMSE (row 3 and 4) against frequency for raw
MT estimates (gray), L2 Krafty’s estimates (green), and the L1 MT-Whittle- based estimates (blue) of the Cholesky components and the SDM elements of VARMA(4,15000) process...... 160
5.10 Plots of the estimated Cholesky components and the SDM elements of the two-channel EEG signal...... 163
xiii Chapter 1: Introduction
Spectral analysis is an important area in the study of time series and signal processing, which provides statistical insights into the frequency domain for a series collected over time
(see, e.g. Priestley[1981]). Estimating the spectral density function (SDF) or spectrum of a time series serves as a primary tool for the frequency domain analysis. Its use can be found in many different fields, such as astronomy, cognitive science, earth sciences, electri- cal engineering, and finance. Examining the SDF allows us to explore periodicities in the data (e.g., Percival and Walden[1993, ch. 10]), provides an alternative way to analyze and estimate the covariance structure of stationary time series (e.g., Percival and Walden[1993, ch. 4]), and can also be used to understand the effect of preprocessing a time series (e.g.,
Mosley-Thompson et al.[2005]).
1.1 Nonparametric spectral estimation and basis models
There are many nonparametric estimators of the SDF of a stationary time series. These include the periodogram, direct spectral estimators, lag window and overlapping segment aver- aging spectral estimators, and multitaper (MT) spectral estimators. (See Percival and Walden
[1993] for a complete review.) Most of these spectral estimates can be formulated with a unified framework of MT spectral estimators [Walden, 2000]. While many of these estima- tors are developed to provide an adequate tradeoff between bias and variance, often these
1 nonparametric estimates are still too noisy when a stable estimate of the SDF is required.
An alternative strategy is to use a parametric approach. However, model misspecification
caused by considering a limited class of models for the SDF can compromise estimation (see
Percival and Walden[1993, ch.9], and references therein).
A popular alternative approach is to consider a semiparametric model for the SDF, in
which the log SDF is expressed in terms of a truncated basis expansion, where the number
of basis functions is allowed to increase with the sample size. The statistical problem then
becomes how to enforce sparsity by selecting the basis functions and estimating the model
parameters so that we adequately estimate the SDF, but also have computational efficiency
as the sample size increases. Gao[1993, 1997], Moulin[1994] and Walden et al.[1998] enforce
sparsity using a penalized least square (LS) approach for estimating the log SDF with wavelet
soft thresholding. In terms of computational complexity, wavelet thresholding methods are
typically O(N), for a time series of N regularly sampled values.
A number of approaches enforce smoothness of the SDF via an L2 penalty: Cogburn and
Davis[1974], Wahba and Wold[1975] and Wahba[1980] use penalized LS, and Pawitan and
O’Sullivan[1994] uses a penalized Whittle likelihood, where the Whittle likelihood [Whittle,
1953] is an approximation to the negative log-likelihood for the time series using the peri-
odogram. To enforce sparsity, some of these L2 methods with smoothing splines also use model selection, often in combination with cross-validation, to select the basis functions that are used to model the SDF. Alternatively, one can implement methods such as Wipf and
Nagarajan[2008] to enforce sparsity on the basis expansion directly. (On a related topic, smoothness of spectral estimation can also be tuned with high-resolution approaches intro- duced in Byrnes et al.[2000], Georgiou and Lindquist[2003] and using extended frameworks based on so-called beta and tau divergence families, such as Basu et al.[1998], Ferrante
2 et al.[2012], Zorzi[2014, 2015a,b]; see Cichocki and Amari[2010] for a general review of such divergences.)
1.2 Penalized multitaper Whittle framework
Our method is also motivated by the need to enforce sparsity while adequately estimat- ing the SDF. In addition, we seek computationally efficient method that can be scalable to large data sets. We develop a quasi-likelihood method for estimating SDFs using a Whittle likelihood [Whittle, 1953] based on MT spectral estimates. A quasi-likelihood function [Wed- derburn, 1974] has similar statistical properties to that of the log likelihood, and can be used for statistical inference, but does not have to match exactly to the log of the joint probability density function of the data (see McCullagh and Nelder[1999, Ch. 9] for a comprehensive review). MT estimates [Thomson, 1982] provide a good compromise between bias and vari- ance, and can yield more efficient estimates of the SDF [Walden et al., 1998]. We demonstrate that the addition of a Whittle likelihood method [Whittle, 1953] improves estimation over traditional LS approaches.
We use a lasso penalty [Tibshirani, 1996] to enforce sparsity, deriving two strategies to optimally select the tuning parameter that is key to obtaining estimates of the SDF with low integrated root mean squared error (IRMSE): “universal threshold” and generalized information criterion (GIC)-based methods (see, e.g., Fan and Tang[2013]). Neither method compromises on computational or statistical efficiency by requiring the use of cross-validation to select the tuning parameter. Theoretically, we derive the rate of convergence for our proposed spectral estimator under some technical conditions on the model sparsity and the
MT spectral estimator.
3 We introduce a computationally efficient method to estimate the parameters in our model
using the alternating direction method of multipliers (ADMM) algorithm (see e.g., Boyd et al.
[2011]). To reach an -optimal solution with a time series of length N, our method is O(N−1)
when using wavelet bases and O(N 3 + −1N 2) for general bases. Although computationally
more challenging, our method can be applied to SDF estimation using any collection of
basis functions and, as mentioned above, outperforms LS-based methods such as wavelet
thresholding in terms of estimation quality.
1.3 Sample splitting and spectral inference
To make statistical inference based on our L1 penalized MT-Whittle spectral estimates
[Tang et al., 2019], we proposed a sample splitting strategy and applied Huber sandwich estimator [Huber, 1967] to construct confidence intervals for the SDF of regularly sampled univariate stationary Gaussian processes. Unlike random splitting methods that assume in- dependence and exchangeability between samples [Dezeure et al., 2015], our splitting scheme takes into account the correlations and spatial inhomogeneity of the MT estimates across the Fourier frequencies [Thomson, 1982]. We illustrate three methods for quantifying the variability of the MT-Whittle estimates: a naive method assuming independence between
MT estimates, an estimated sandwich method incorporating the correlation between MT estimates [Thomson, 1982], and a fast sandwich estimation based on an approximation due to Walden et al.[1998] of the correlation between MT estimates. Our simulation results show a clear advantage of using the sandwich-based interval estimation over the naive method.
The sandwich variance correction greatly ameliorates the undercoverage issue of the confi- dence intervals, and we demonstrate that this method achieves a better interval score [e.g.,
Gneiting and Raftery, 2007] than the naive approach. The fast sandwich estimation is very
4 efficient in approximating the theoretical sandwich intervals when an appropriate number of tapers is used. We provide a discussion and an empirical recommendation on choosing the optimal number of tapers for the statistical inference of the SDF based on the simulation study. The suggested number of tapers increases as the sample size grows.
1.4 Extension to multivariate spectral analysis
Spectral analysis of a multivariate time series involves the study of SDFs, also called the auto-spectra, of each component series, and the analysis of the complex-valued cross- spectrum between two component series [see, e.g., Brillinger, 1981]. At each frequency, the spectral density matrix (SDM) is a Hermitian positive definite matrix composed of auto- spectra values for diagonal elements and cross-spectra values for the off-diagonals. The idea of our L1 penalized MT-Whittle method can be extended to the SDM estimation for stationary multivariate time series. To facilitate such an extension, we examine related statistical properties of multivariate MT spectral estimators [Walden, 2000]. Similar to
Dai and Guo[2004], Rosen and Stoffer[2007], and Krafty and Collinge[2013], we apply a
Cholesky decomposition to the SDM to ensure Hermitian positive definiteness, where the real parts and imaginary parts of the Cholesky elements are modeled by basis expansions. We propose to use a proximal gradient descent algorithm [Parikh and Boyd, 2014] for estimating the Cholesky elements simultaneously based on the L1 penalized multivariate MT-Whittle method. The tuning parameters are selected by generalizing the universal threshold and the
GIC criterion to the multivariate case. Adapting the proposed sample splitting, we compare the performance of the L1 penalized MT-Whittle method to the L2 penalized Whittle method in Krafty and Collinge[2013] via a simulation study. An MT-Whittle-based scoring rule and
5 the Frobenius norm of the deviation from the true SDM are used for evaluating the SDM estimates.
1.5 Organization of this dissertation
The rest of this dissertation is organized as follows. Chapter2 reviews foundations of spectral analysis of univariate stationary time series, statistical properties of preliminary spectral estimators such as the periodogram and MT estimators, and basis regression models for estimating the SDF. Chapter3 proposes our main methodology of spectral estimation using multitaper Whittle likelihood and a lasso penalty for regularly sampled univariate stationary time series, accompanied with related theory and computational algorithm. A sample splitting approach is established in Chapter4 for the sandwich interval estimation of the SDF based on the proposed L1 penalized method, and discussions on the optimal choice of number of tapers can be found in the same chapter. In Chapter5, we expand the scope to spectral analysis of multivariate stationary time series, demonstrating the statistical properties of multivariate MT spectral estimates and generalizing the L1 penalized MT-
Whittle framework to the multivariate case. The utility of our methodology is presented in
Chapters3,4, and5 on simulated processes with prominently sharp spectral features and to the spectral analysis of a two-channel electroencephalogram (EEG) example. We provide in Chapter6 a discussion on further improvement and future works.
6 Chapter 2: Foundations of spectral analysis for stationary processes
Spectral analysis is a technique that characterizes the frequency domain features of a time series. The technique heavily relies on the estimation and inference of the so-called spectral density function (SDF). In this chapter, we introduce the basic concepts of the
SDF and estimators of the SDF that are commonly used, to facilitate an understanding of the background and properties of the SDFs essential for later chapters. Our discussion in this chapter focuses on the SDF of regularly sampled stationary univariate time series. The extension of concepts to the multivariate time series is demonstrated in Chapter5.
2.1 Spectral density function
To formally define the spectral density function (SDF), we first need to realize an alter- native way to represent a time series with frequency domain components. Suppose that a discrete, real-valued univariate stationary process {Xt : t ∈ Z} has, without loss of gener- ality, mean zero and is collected at sampling interval ∆ = 1. The spectral representation theorem for discrete parameter stationary processes [Priestley, 1981, p. 251] applies: the process {Xt} has the spectral representation:
Z 1/2 i2πft Xt = e dZ(f), t = 0, ±1, ±2,..., (2.1) −1/2
7 where {Z(f): f ∈ [−1/2, 1/2]} is a complex-valued orthogonal process with properties
[Percival and Walden, 1993, p. 130] :
1. E{dZ(f)} = 0;
2. E{|dZ(f)|2} ≡ dS(I)(f) ≡ S(f)df for f ∈ [−1/2, 1/2], where S(I)(f) is a right-
continuous, non-decreasing and bounded function, called the integrated spectrum or
the spectral distribution function. The derivative S(f), if it exists, is called the SDF
or the spectrum;
3. Cov{dZ(f), dZ(f 0)} = E{dZ(f)dZ∗(f 0)} = 0 for f 6= f 0 on [−1/2, 1/2], where dZ∗(f)
denote the complex conjugate of dZ(f).
The intuition for (2.1) is the following: viewing Euler’s formula ei2πft = cos(2πft) + i sin(2πft), the time series {Xt}, as a function of time t, can be factorized into a linear combination of sine and cosine functions parametrized by various frequencies and amplitudes.
If the sine and cosine components at frequency f have amplitude dZ(f), then the squared magnitude as a random variable has expectation dS(I)(f).
The existence of the integrated spectrum S(I)(f) follows from the generalization of
Bochner’s theorem and Hergolotz’s theorem [see, e.g., Brockwell and Davis, 1991, p. 115-
118]. For a mean zero stationary time series {Xt}, we write the autocovariance function
(ACVF) as
∗ γX (τ) = Cov(Xt+τ ,Xt) = E(Xt+τ Xt ), τ = 0, ±1, ±2,....
Then the following two theorems apply:
Theorem 1 (Generalization of Bochner’s theorem; see Brockwell and Davis[1991], p.115) .
A function γX (·) defined on integers is the ACVF of a stationary (possibly complex-valued) stochastic process if and only if γX (·) is Hermitian and non-negative definite.
8 Theorem 2 (Hergolotz’s theorem; see Brockwell and Davis[1991], p.117-118) . A complex-
valued function γX (·) defined on integers is non-negative definite if and only if
Z 1/2 i2πfτ (I) γX (τ) = e dS (f), for all integers τ, −1/2
where S(I)(f) is called the integrated spectrum or the spectral distribution function, which is
(I) 1 right-continuous, non-decreasing and bounded for f ∈ [−1/2, 1/2] with S (− 2 ) = 0.
Thus, a well-defined spectral distribution function S(I)(f) always comes along with the
(I) stationary process {Xt}. If the spectral distribution function S (f) is differentiable every- where on the frequency domain, then we have
Z 1/2 i2πfτ γX (τ) = e S(f)df, for all integers τ, (2.2) −1/2 where S(f) = dS(I)(f)/df is the SDF. Brockwell and Davis[1991] also demonstrated that if the process {X } has absolutely summable ACVF (i.e., P |γ (τ)| < ∞), then the SDF t τ∈Z X S(f) exists, and ∞ X −i2πfτ S(f) = γX (τ)e , |f| ≤ 1/2, (2.3) τ=−∞ which is the Fourier transform of the ACVF from the time domain to the frequency domain.
Thus, we can see that {γX (τ)} and {S(f)} are Fourier transform pairs.
Focusing on the stationary process {Xt} with zero mean and absolutely summable ACVF, the corresponding SDF S(f) has the following properties [Percival and Walden, 1993, p. 138]:
1. The SDF is a real-valued function defined on f ∈ [−1/2, 1/2];
2. The SDF is an even function; i.e. S(f) = S(−f);
3. The SDF is non-negative; i.e. S(f) ≥ 0;
9 4. The SDF decomposes the variance of {Xt} so that
Z 1/2 (I) Var(Xt) = γX (0) = S (1/2) = S(f)df < ∞. −1/2
The last property expresses an analysis of variance of {Xt} over the frequency domain in terms of the power magnitude. Thus, the SDF represents the distributional density of the power contribution by the component sinusoids shown in (2.1).
In addition, the definition of SDF, if it exists, can be extended over the real line. For periodic functions such as sinusoids, aliasing happens with period 1/∆, where ∆ = 1 is the sampling interval – the constant time unit spacing between adjacent time points of the time series {Xt}. The value of SDF S(f) at frequency f ∈ [−1/2, 1/2] actually represents the accumulated power densities at frequencies {f + c/∆ : c ∈ Z}. Then by taking
S(f + c/∆) = S(f), for all c ∈ Z and f ∈ [−1/2, 1/2], the SDF can be viewed as being periodic over the real line. Due to the evenness and periodic property, all information contained in a univariate SDF is conveyed by the frequencies on
[0, fN ], with Nyquist frequency fN = 1/(2∆) = 1/2 when we assumes ∆ = 1.
More generally, for a stationary process {Xt} with E(Xt) 6= 0, if the mean is known,
one can always obtain a zero mean process by considering the centered process Xt − E(Xt);
if the mean E(Xt) is unknown, the formula mentioned above for spectral calculation will ¯ ¯ have Xt replaced by Xt − X as a common practice, where X denotes the sample mean of
the observed series. For further properties of the SDF, see Percival and Walden[1993] and
Brillinger[1981].
10 2.2 Preliminary spectral estimators
When the true model for the stationary time series is unknown, a closed form of expression
for the SDF is unknown. To study the spectral information using data, we now review some
nonparametric estimators of the SDF.
2.2.1 The periodogram
A natural and fundamental spectral estimator is the periodogram [Schuster, 1898]. Let
T x = {x1, x2, . . . , xN } be the vector of consecutive observations of the previously defined real-valued stationary process {Xt}, where the start time is t = 1 and N is the sample size.
Again suppose that {Xt} has mean zero and the sampling interval ∆ = 1.
According to Percival[1994], preliminary spectral estimators can be obtained by substi-
(p) tuting γX (τ) in (2.3) by γbX (τ), where ( 1 PN−|τ| x x , for |τ| ≤ N − 1; γ(p)(τ) = N t=1 t t+|τ| bX 0, for |τ| ≥ N.
(p) Then, the resulting spectral estimator is the Fourier transform of {γbX (τ)} [Schuster, 1898], N−1 (p) X (p) −i2πfτ Sb (f) = γbX (τ)e , |f| ≤ 1/2, τ=−(N−1) which is called the periodogram. Alternatively, if we define
N (p) X 1 −i2πft JX (f) = √ xte , |f| ≤ 1/2, (2.4) t=1 N then it can be shown that N−1 N−|τ| (p) 1 X X −i2πfτ Sb (f) = xtxt+|τ|e N τ=−(N−1) t=1 N N 1 X X = x x e−i2πf(s−t) N s t s=1 t=1 N N X 1 −i2πfs X 1 i2πft (p) (p)∗ = √ xse · √ xte = JX (f)JX (f), s=1 N t=1 N 11 (p)∗ (p) where JX (f) denotes the complex conjugate of JX (f). Hence, the periodogram can also be expressed as N 2 (p) 1 X −i2πft Sb (f) = xte , |f| ≤ 1/2. (2.5) N t=1
Note: Similar to the previous section, if E(Xt) 6= 0 is unknown, all the above formula will have x replaced by x − x¯, wherex ¯ = 1 PN x . Then the quantity γ(p)(τ) becomes t t N t=1 t bX 1 PN−|τ| (p) N t=1 (xt − x¯)(xt+|τ| − x¯) for |τ| ≤ N − 1, the Fourier transformed process JX (f) is re-
N defined by P √1 (x − x¯)e−i2πft, and so an updated equation (2.5) has the form S(p)(f) = t=1 N t b 2 1 PN (x − x¯)e−i2πft . N t=1 t
If a stationary time series {Xt}, not necessarily white noise or Gaussian distributed, has
continuous SDF S(f) for f ∈ [−1/2, 1/2], Percival and Walden[1993, Ch. 6] summarize
some key properties of the SDF:
n o 1. Convergence in mean: E Sb(p)(f) → S(f) asymptotically as N → ∞;
2. Assuming finiteness of certain high-order moments [see, e.g., Brillinger, 1981, p. 126], ( S(f)χ2/2, for 0 < f < 1/2; S(p)(f) → 2 b d 2 S(f)χ1, for f = 0 or 1/2,
asymptotically as N → ∞;
3. Under the conditions of finiteness of certain high-order moments [see, e.g., Brillinger,
1981, p. 26 and p. 125], for 0 < f 0 < f < 1/2, Sb(p)(f) and Sb(p)(f 0) are asymptotically
independent.
The periodogram is not an ideal spectral estimator. As pointed out by Percival[1994,
p. 317], the ‘tragedy of the periodogram’ includes:
1. For finite sample sizes, the periodogram Sb(p)(f) can be badly biased;
12 2. As N → ∞, the variance Var{Sb(p)(f)} does not shrink to zero (except for S(f) = 0),
and thus the periodogram is not a consistent estimator.
2.2.2 Multitaper spectral estimators
To correct the bias and reduce variance in the periodogram, more advanced nonpara- metric estimators of the SDF were developed. This includes multitaper (MT) estimators,
Welch’s overlapping segment averaging (WOSA) estimators [Welch, 1967], and lag-window
estimators (see Percival and Walden[1993] for a complete review). As demonstrated by
Walden[2000], most of these estimators can be reformulated in a unifying structure of the
multitaper (MT) spectral estimators defined below.
T For k ∈ {1,...,K}, one can taper the observed time series x = {x1, x2, . . . , xN } using a
T N finite sequence {h1,k, . . . , hN,k} ∈ R , called data taper. Each resulting tapered time series
T {h1,kx1, . . . , hN,kxN } can be used to form a direct spectral estimator as
N 2 (d) X S (f) = h x e−i2πft , |f| ≤ 1/2. (2.6) bk t,k t t=1 Pn Taking the K data tapers to be orthonormal (i.e. t=1 ht,jht,k = 1 if j = k and 0 if j 6= k), the multitaper (MT) estimator is defined as K 1 X (d) Sb(mt)(f) = Sb (f), |f| ≤ 1/2; (2.7) K k k=1 that is, the MT estimator is an average of K direct spectral estimators (eigenspectra)
with orthonormal tapers. Generally speaking, the tapering is usually applied for the pur-
pose of correcting the bias, and the averaging reduces the variance [Percival and Walden,
1993]. Note that the periodogram can be viewed as a special case of the MT estima-
tor. Equation (2.5) can be simply obtained by taking K = 1 in equation (2.7) with n √ √ o {h1,1, . . . , hN,1} = 1/ N,..., 1/ N . The periodogram, direct spectral estimators, and
MT estimators preserves the even-ness and non-negativeness of the estimated SDF.
13 Walden[2000] shows under known conditions that MT estimators are consistent if we
let the number of tapers K increase with the sample size N, in the more general setting of
multivariate time series. For univariate stationary processes, we heuristically present below
how MT estimators can mitigate the bias and inconsistency issues in the periodogram.
Proposition 1. The expected value of the periodogram can be expressed as
n o Z 1/2 E Sb(p)(f) = F(f − f 0)S(f 0)df 0, −1/2
with 2 N 2 sin (Nπf) X 1 −i2πft F(f) = = √ e , (2.8) N sin2 (πf) t=1 N where the kernel function F(f) is known as spectral window for the periodogram, called the Fej´erkernel.
Proof.
N−1 N−1 N−|τ| n (p) o X n (p) o −i2πfτ 1 X X −i2πfτ E Sb (f) = E γ (τ) e = E(xtxt+|τ|)e bX N τ=−(N−1) τ=−(N−1) t=1 N−1 1 X = (N − |τ|)γ (τ)e−i2πfτ N X τ=−(N−1) N−1 Z 1/2 1 X 0 = (N − |τ|) ei2πf τ S(f 0)df 0 e−i2πfτ N τ=−(N−1) −1/2 Z 1/2 N−1 1 X 0 = (N − |τ|)e−i2π(f−f )τ S(f 0)df 0 N −1/2 τ=−(N−1) Z 1/2 N N 1 X X 0 = e−i2π(f−f )(s−t) S(f 0)df 0 N −1/2 s=1 t=1 Z 1/2 N N 1 X 0 X 0 = e−i2π(f−f )s ei2π(f−f )t S(f 0)df 0 N −1/2 s=1 t=1 Z 1/2 N 2 X 1 −i2π(f−f 0)t 0 0 = √ e S(f )df . −1/2 t=1 N
14 Additionally,
N 2 −i2πfN i2πfN X 1 −i2πft 1 1 − e 1 − e √ e = · · N 1 − e−i2πf 1 − ei2πf t=1 N 1 eiπfN − e−iπfN e−iπfN − eiπfN = · · N eiπf − e−iπf e−iπf − eiπf 1 sin (Nπf) sin (Nπf) = · · N sin (πf) sin (πf) sin2 (Nπf) = . N sin2 (πf)
Since the Fej´erkernel is not a Dirac delta function at point 0, its shape leads to two sources of bias for the periodogram: a loss of resolution due to its central lobe, and leakage due to its side lobes. The direct spectral estimator (2.6) uses tapering to reduce leakage, whose expectation can be obtained following a similar derivation as above:
Z 1/2 n (d) o 0 0 0 E Sbk (f) = Hk(f − f )S(f )df , (2.9) −1/2 with spectral window N 2 X −i2πft Hk(f) = ht,ke . t=1 Averaging across K tapers, the expected value of the MT estimator (2.7) is then
n o Z 1/2 (mt) ¯ 0 0 0 E Sbk (f) = H(f − f )S(f )df , (2.10) −1/2 where K K N 2 ¯ 1 X 1 X X −i2πft H(f) = Hk(f) = ht,ke . K K k=1 k=1 t=1 With properly selected data tapers, the spectral window H¯(f) can have its side lobes sup- pressed with reasonably a small increment in the width of the central lobe. Thus, a better balance between the leakage and a loss of resolution will lead to a reduction in the overall bias.
15 The variance reduction is a result of the averaging step when calculating the MT estima- tor. For each its individual eigenspectra (2.6), the asymptotic distribution can be similarly obtained as with the periodogram. Percival and Walden[1993] state that, as N → ∞, ( S(f)χ2/2, for 0 < f < 1/2; S(d)(f) → 2 bk d 2 S(f)χ1, for f = 0 or 1/2. Based on the orthonormality between the tapers of eigenspectra, under certain mild conditions [see Percival, 1994, p. 336], the eigenspectra are approximately independent, which implies that ( S(f)χ2 /(2K), for 0 < f < 1/2; S(mt)(f) → 2K (2.11) b d 2 S(f)χK /K, for f = 0 or 1/2 , asymptotically as N → ∞. Thus, we have ( n o S2(f)/K, for 0 < f < 1/2; Var Sb(mt)(f) ≈ (2.12) 2S2(f)/K, for f = 0 or 1/2.
As K increases, we lose resolution in the estimated SDF but decrease the variance of the
MT spectral estimator. Letting the number of tapers K increase with sample size N, the
MT estimator can then be shown to be a consistent estimator of the SDF asymptotically
[see, McCoy et al., 1998, p. 658].
Though MT estimators are capable of balancing bias and variance for spectral estimators, they also induce more correlation between the estimates over different frequencies [Thomson,
1982]. We leave the discussion on the correlation of MT estimates to later chapters.
Besides the common properties of the MT estimators mentioned above, other statistical properties follow from the specific choice of tapers used in the spectral estimates. Discrete prolate spheroidal sequences (DPSS) and sine tapers are most commonly used [Percival and
Walden, 1993]. DPSS tapers are designed to reduce the sidelobes in the spectral estimate.
They solve the time-frequency concentration problem in which we find the time limited sequence which has most of its energy concentrated in a specified frequency band [Percival
16 and Walden, 1993, Ch. 8]. Instead, we use the easily calculated sine tapers [Riedel and
Sidorenko, 1995] in this dissertation, which are defined by
2 1/2 (k + 1)πt h = sin , k = 1,...,K; t = 1,...,N. (2.13) t,k N + 1 N + 1
Sine tapers are designed to reduce the smoothing bias, at a compromise to sidelobe reduction.
For a given K, the sine tapers are concentrated in the frequency band [−W, W ] for a half
bandwidth of W = (K + 1)/(2(N + 1)).
2.2.3 Fourier frequencies
Spectral estimators are typically studied at the Fourier frequencies to deliver information
between the time domain and the frequency domain.
Definition 1 (see Brockwell and Davis[1991], p.331) . For an observed time series with
sample size N and sampling interval ∆ = 1, the set of Fourier frequencies is defined by
j F = f = : −N/2 < j ≤ N/2, j ∈ , N j N Z where Z denotes the set of all integers.
Several important spectral features are related to Fourier frequencies:
T 1. Complete information of the observed series x = (x1, . . . , xN ) can be preserved with
(p) the corresponding values of JX (f) at the Fourier frequencies [Brockwell and Davis,
(p) N 1991]. Recall from (2.4) that J (f) = √1 P x e−i2πsf is the Fourier transform of X N s=1 s x. An inverse discrete Fourier transform gives
N 1 X X x = x e−i2πsfj ei2πtfj t N s j:fj ∈FN s=1 1 X (p) i2πtfj = √ J (fj)e , for t = 1, 2,...,N. (2.14) N X j:fj ∈FN
17 Thus, one can write
X (p) x = JX (fj) · ej, (2.15) j:fj ∈FN T where e = √1 ei2πfj , ei2π2fj , . . . , ei2πNfj for −N/2 < j ≤ N/2, j ∈ . The set of j N Z
vectors {ej : −N/2 < j ≤ N/2, j ∈ Z} forms orthonormal bases for spanning the n
dimensional complex vector space CN with inner product N ( 0 1 1, if j = j ; X i2πt(fj −fj0 ) hej, ej0 i = e = i2πN(fj −fj0 ) 1 i2π(fj −fj0 ) 1−e 0 N e i2π(f −f ) = 0, if j 6= j . t=1 N 1−e j j0 2. It follows from (2.15) that, the sample variance based on the observed series x can be
decomposed into periodogram values at the Fourier frequencies:
N (p) 1 1 X γ (0) = kxk2 = x x ∗ bX N 2 N t t t=1 N 1 1 (p) 1 (p)∗ X X i2πtfj X −i2πtfj0 = √ JX (fj)e √ JX (fj0 )e N N N 0 t=1 j:fj ∈FN j :fj0 ∈FN N 1 (p) (p)∗ 1 X X X i2πt(fj −fj0 ) = JX (fj)JX (fj0 ) · e N 0 N j:fj ∈FN j :fj0 ∈FN t=1
1 X (p) (p)∗ = J (f )J (f ) N X j X j j:fj ∈FN
1 X (p) = Sb (fj). N j:fj ∈FN
3. Applying Fourier frequencies greatly simplifies the derivation of the asymptotic distri-
bution for periodogram. For example, the distribution for periodogram of Gaussian
white noise process can be derived exactly at Fourier frequencies [see Percival and
Walden, 1993, p. 220-222] by uncorrelatedness and Gaussianity of the real and imagi-
(p) nary parts of JX (f) at non-negative Fourier frequencies.
4. For large enough N, the correlation between periodogram at different Fourier fre-
quencies is approximately zero [Percival and Walden, 1993, p. 223]. The width of
18 the central lobe of a spectral window is called resolution bandwidth. The Fej´erkernel
(2.8) has a half resolution bandwidth of 1/N. Spaced by 1/N, the Fourier frequencies
are considered to have small effect on each other’s periodogram value. However, for
direct spectral estimators with non-trivial tapers, frequency grids other than Fourier
frequencies are required for approximate uncorrelatedness [Percival and Walden, 1993,
p. 224].
See Percival and Walden[1993] for more properties at Fourier frequencies. Although the above properties are presented for the periodogram, using Fourier frequencies for MT estimates and further smoothing approaches allows us to have a unifying grid for inference and making fair comparisons with existing methods in literature.
2.3 Basis models for SDFs
Though MT-based estimation tends to outperform the periodogram-based estimation from many aspects, it is often still too noisy to be useful in practice. As shown in our simulation study in the next chapter, some important features of the true spectrum may still be concealed by the remaining noise in MT estimates. Alternative approaches to further smoothing the preliminary spectral estimates are thus in demand. When the expression for the SDF of the true model is unknown, there is always a concern that we poorly estimate the
SDF due to misspecifying a parametric model. In this case, nonparametric or semiparametric models are used to achieve consistent estimators of the SDF. One such popular approach is to model the SDF with a basis expansion, where the number of bases is allowed to increase with the sample size.
19 2.3.1 Literature review
The earliest basis method for estimating a univariate SDF dates back to early 1970s.
Bloomfield[1973] proposed the exponential (EXP) model for the spectral estimation of
univariate time series, which was basically a Fourier expansion of the log-transformed SDF.
The coefficients were estimated by maximizing the exact likelihood of the time series, while
the Whittle likelihood approach [Whittle, 1953] was also mentioned but not implemented.
The use of the EXP model has been widely expanded. Beran[1993] extended it to a fractional
exponential (FEXP) model for long memory processes. Moulines and Soulier[1999], Hurvich
and Brodsky[2001], and Narukawa and Matsuda[2011] discussed broadband estimation for
the SDF of long memory time series by FEXP models, where here the ‘broadband’ refers to
the estimation over the complete frequency domain rather than being restricted to a narrow
band around the frequency zero. In general, James et al.[2013] argue that unpenalized
basis models have overfitting issues when the number of bases is allowed to increase with the
sample size, thus penalized models are often used to regularize the model fit and perform
variable selection.
Penalized basis models for spectral estimation were first explored using spline basis func-
tions. Adding an L2 ‘roughness’ penalty to the sum of squared errors, Cogburn and Davis
[1974] presented the theory of periodic smoothing splines and its application to SDF estima- tion. Approaches for smoothing parameter selection and SDF estimation are illustrated by
Wahba and Wold[1975] and Wahba[1980]. The extension of L2 penalized methods was made by Pawitan and O’Sullivan[1994] using the Whittle likelihood based on the periodogram.
These L2 methods all possess a penalty in the form of an integrated square of a certain order derivative of the estimated function.
20 Another type of penalized model used for estimating the SDF employs wavelet threshold- ing. Gao[1993, 1997], and Moulin[1994] discussed wavelet shrinkage for spectral estimation based on the log periodogram. Assuming approximate Gaussianity of the log MT estimates,
Walden et al.[1998] expanded the wavelet thresholding framework to get improved spectral estimates. Under various thresholding mechanisms, they showed that the model based on the MT estimates with sine tapers was able to substantially reduce integrated root mean squared errors in the fitted SDF compared to the periodogram-based models.
Most of the literature mentioned above are based on least square approaches. We next present a general framework for spectral basis regression and review least square approaches.
The utility of the Whittle likelihood is relegated to the next chapter.
2.3.2 Basis models and least square approaches
Basis methods for the estimation of a SDF typically assume that the log SDF can be expanded by a set of p basis functions {φl(f): l = 1, . . . , p} (e.g., Wahba[1980]). For each
T T frequency f, letting φ(f) = (φ1(f), . . . , φp(f)) and β = (β1, . . . , βp) , we suppose that
p X T log S(f) = φl(f)βl = φ (f)β. (2.16) l=1
There are a variety of options for families of basis functions φ(f) that can be used for spectral estimation. For example, polynomial and Fourier bases can be used to capture global patterns [Bloomfield, 1973], smoothing splines allow for local and smooth patterns in the SDF [Cogburn and Davis, 1974, Wahba and Wold, 1975], and wavelet bases model local behaviour, while capturing second order effects such as peaks, troughs, and cusps [Gao, 1993,
Moulin, 1994, Gao, 1997, Walden et al., 1998]. When the SDF is spatially inhomogeneous, spatially adaptive bases, such as wavelets, have theoretical optimality properties (see, e.g.,
Donoho and Johnstone[1994, 1995], Sardy et al.[2004]). We can also combine families of
21 basis functions to form dictionaries [See, e.g., Wasserman, 2006, Section 9.8]. For spectral estimation, the basis functions chosen should be capable of delineating the prominent features of the SDF such that a satisfactory fit can be achieved by parsimonious models.
To estimate the regression coefficients β in (2.16), least square (LS) approaches are widely used in literature. The distributional inference and tuning for LS spectral estimates always rely on an approximate Gaussianity of the log periodogram or log MT estimators, which may not be true in practice. Since the periodogram is just a special case of MT estimates with K = 1 taper, without loss of generality we summarize the general idea related to LS approaches with only MT estimates below.
Recall from (2.11) that asymptotically Sb(mt)(f) follows a scaled chisquared distribution with 2K degrees of freedom on f ∈ (0, 1/2). Bartlett and Kendall[1946] show that the cumulants of logarithmic transformation of a chisquared random variable approximate the cumulants of a normal random variable, as the chisquared degrees of freedom increases. The approximate normality “may be safely used” when the chisquared degrees of freedom is 10 or greater [Bartlett and Kendall, 1946], however, the convergence is slow. Walden et al.[1998] n o adapt the normal approximation to the random variable log Sb(mt)(f)/S(f) when K ≥ 5, and state the following.
Proposition 2 (Walden et al.[1998], Section II) . For K ≥ 5, the distribution of the random n o variable log Sb(mt)(f)/S(f) agrees well with a normal distribution with the same mean and variance. Approximately,
log Sb(mt)(f) ∼ N(log S(f) + ψ(K) − log(K), ψ0(K)), for 0 < f < 1/2, where ψ(α) = d (ln Γ(α)) /dα is called the digamma function, and ψ0(α) = d (ψ(α)) /dα is
R ∞ α−1 −u called the trigamma function with the gamma function Γ(α) = 0 u e du.
22 Note: The argument used for proving Proposition2 was based on a heuristic applica-
tion of Bartlett and Kendall[1946] for large K with the first and second cumulants of the n o h n oi random variable log Sb(mt)(f)/S(f) that E log Sb(mt)(f)/S(f) = ψ(K) − log(K) and h n oi Var log Sb(mt)(f)/S(f) = ψ0(K) for 0 < f < 1/2, rather than a rigorous proof. In prac- tice, the number of tapers K is usually a fixed number and may not be very large. Thus, LS estimation based on Proposition2 could lead to suboptimal solutions because the log SDF estimators are not close to being normally distributed. This issue motivates using the Whit- tle likelihood as an alternative framework for more reliable spectral estimation, including our L1 penalized MT-Whittle method proposed in the next chapter.
Define Y (f) = log Sb(mt)(f) − ψ(K) + log K. It then follows from Proposition2 that,
Y (f) ∼ N(log S(f), ψ0(K)), for 0 < f < 1/2.
With the log SDF model (2.16), one can write, for 0 < f < 1/2, that
Y (f) = φT (f)β + ξ(f), with ξ(f) ∼ N(0, ψ0(K)). (2.17)
Using the observed series x, we obtain a sample of Y (f) by evaluating the MT spectral
estimates Sb(mt)(f) on the set of M = dN/2e − 1 non-zero, non-Nyquist (i.e., not equal to
1/2) Fourier frequencies
j f = : j = 1,...,M . (2.18) j N
(mt) Denote Y as the vector with jth element Y (fj) = log Sb (fj) − ψ(K) + log K for j =
1,...,M. Choosing a specific basis representation, let Φ denote the M × p design matrix of
T basis functions evaluated at frequencies (2.18) with jth row equal to φ (fj) for j = 1,...,M.
Then a sampled version of (2.17) writes
Y = Φβ + ξ,
23 T where the error vector ξ = (ξ(f1), . . . , ξ(fM )) is assumed to follow a M-dimensional Gaus- sian distribution with mean 0 and covariance matrix Σξ. The covariance structure of ξ and its effect on wavelet thresholding are discussed in Walden et al.[1998].
(LS) The unpenalized LS method finds β˜ ∈ Rp that minimizes
M X T 2 T (Y (fj) − φ (fj)β) = (Y − Φβ) (Y − Φβ). (2.19) j=1 The closed form solution is then
(LS) − β˜ = ΦT Φ ΦT Y ,
− where ΦT Φ denotes the generalized inverse of ΦT Φ. The fitted SDF is calculated at each frequency using
n (LS)o S˜(LS)(f) = exp φT (f)β˜ , f ∈ (0, 1/2).
The unpenalized LS approach is equivalent to maximizing a Gaussian likelihood under the assumption of independence between the MT estimates at Fourier frequencies. However, the MT estimators at different frequencies are correlated [see Thomson[1982] for details] :
With a locally slow varying spectrum, for f close to f 0,
K K N 2 n (mt) (mt) 0 o 2 2 X X X i2π(f 0−f)t Cov Sb (f), Sb (f ) ≈ S (f) · (1/K ) ht,kht,le ,
k=1 l=1 t=1 for 0 < f < f 0 < 1/2. Thus, we can argue that LS optimization (2.19) is based on a quasi-likelihood function rather than a real likelihood. See Wedderburn[1974], McCullagh and Nelder[1999, Ch. 9] and the next chapter for more information about quasi-likelihood functions.
A nonparametric basis expansion allows the number of bases p to increase with the sample size M. Then the statistical problem becomes how to enforce smoothness and sparsity by selecting the basis functions and estimating the model parameters so that we adequately
24 estimate the SDF without overfitting, but also have computational efficiency as the sample
size increases. Popular approaches in the literature for regularizing the least squares include
adding penalties of the L1 type and the L2 type. We summarize below the representative
models of both types and use them for comparison in later chapters.
Typically, L1 penalized approaches use wavelet thresholding (see e.g., Walden et al.
[1998]), while the L2 penalized methods use smoothing splines (see e.g., Cogburn and Davis
[1974], Wahba and Wold[1975]).
2.3.3 Wavelet soft-thresholding
The representative L1 penalized approach for spectral estimation uses wavelet soft-
thresholding (see e.g., Walden et al.[1998]).
The first step in wavelet soft-thresholding is the discrete wavelet transform (DWT). For
q M = 2 dimensional vector Y , the DWT matrix is defined as an M × M matrix Wq whose rows contain circularly shifted wavelet and scaling filters at different scales. (See Percival ˜ (W) and Walden[2000] for a comprehensive review.) The DWT calculates β = WqY , with the result in the form of
T ˜ (W) h ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) ˜(W) (W)i β = β1,1 , β1,2 ,..., β1,M/2, β2,1 , β2,2 ,..., β2,M/4,...... , βq−1,1, βq1,2 , βq,1 , α˜q+1 ,
n ˜(W) ˜(W) ˜(W) o where β 0 , β 0 ,..., β 0 is called the set of wavelet coefficients associated with j ,1 j ,2 j0,M/(2j ) √ j0−1 0 (W) T scale 2 , for levels j = 1, . . . , q; andα ˜q+1 = 1 Y / M is called the scaling coefficient.
The DWT matrix Wq is not always of interest and thus doesn’t usually need to be com-
(W) puted explicitly. The wavelet coefficients β˜ can be obtained by q recursive applications
of the Pyramid Algorithm [Mallat, 1989]. Each iteration in the pyramid algorithm calcu-
lates the set of wavelet coefficients for one level, in the order of levels j0 from 1 to q. If the
iterations stop at level q0 < q, then the result is said to be a q0th-order partial DWT, which
25 obtains
(Wq ) (W ) (W ) (W ) (W ) (W ) (W ) ˜ 0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 ˜ q0 β = β , β ,..., β ,...... , β , β ,..., β q , 1,1 1,2 1,M/2 q0,1 q0,2 q0,M/(2 0 ) T (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) (Wq0 ) α˜ , α˜ ,..., α˜ q +1 ,...... , α˜ , α˜ , α˜ , α , q0+1,1 q0+1,2 q0+1,M/(2 0 ) q−1,1 q−1,2 q,1 q+1
n (W ) (W ) (W ) o n o ˜ q0 ˜ q0 ˜ q0 ˜(W) ˜(W) ˜(W) where β 0 , β 0 ,..., β 0 = β 0 , β 0 ,..., β 0 are the calculated wavelet j ,1 j ,2 j0,M/(2j ) j ,1 j ,2 j0,M/(2j ) n o 0 (Wq0 ) (Wq0 ) (Wq0 ) coefficients for levels j = 1, . . . , q , and α˜ 00 , α˜ 00 ,..., α˜ 00 remain as scaling 0 j ,1 j ,2 j00,M/(2j )
00 coefficients for levels j = q0 + 1, . . . , q.
Denote Wq0 to be the q0th-order partial DWT matrix, which is an M × M matrix that (W ) (W ) ˜ q0 ˜ q0 can be used to calculate β by β = Wq0 Y [Percival and Walden, 2000]. The first
q0 M −M/2 rows of the q0th-order partial DWT matrix Wq0 are M-dim discrete wavelet bases
q0 at the finest q0 levels, and the other M/2 rows of Wq0 are the associated M-dim discrete
T scaling bases at the roughest level. These bases are orthonormal such that Wq0 Wq0 = IM . The partial DWT performs exactly an unpenalized LS approach. With design matrix
(W ) T ˜ q0 Φ = Wq0 , the coefficients β can be expressed in the form of the LS coefficient estimates, with
(W ) ˜ q0 T −1 β = Wq0 Y = Wq0 Wq0 Wq0 Y ,
which can be equivalently obtained by minimizing (2.19). Note that the full DWT is a
special case of partial DWT with q0 = q.
Upon calculating the wavelet coefficients, the next step is to shrink the coefficients. With
a threshold a ≥ 0, a soft-thresholding function is defined as
ST(x, a) = Sign(x) max{|x| − a, 0}. (2.20)
√ Walden et al.[1998] apply the universal threshold auniv = pψ0(K) · 2 log M for soft-
thresholding the wavelet coefficients calculated based on log MT estimates. Recall that
26 0 T ψ (K) is the approximate variance of Y (f) for 0 < f < 1/2. With orthonormal bases Wq0 , (W ) ψ0(K) is also the approximate variance of the wavelet coefficients in β˜ q0 .
Universal Threshold and Minimax Threshold The universal threshold was originally introduced for the wavelet thresholding, and can be generalized to the L1 penalized LS method for function estimation with orthogonal bases [Donoho and Johnstone, 1994].
Consider data of the form
yj = g(fj) + ξj,
M with fj = j/N, j = 1,...,M, g = {g(fj)}j=1 is the noiseless data from target function g(f),
M 2 and ξ = (ξj)j=1 is a white noise with elements independently distributed as N(0, σξ ). When an M ×M orthonormal design matrix Φ is applied to model g = Φβ, the empirical
coefficients can be calculated by
−1 β˜ = Φ>Φ Φ>y = Φ>y.
Inversely, y = Φβ. Thus, since Φ is an orthogonal matrix, the empirical coefficients are
β˜ = Φ>g + Φ>ξ.
Letting ϑ = Φ>g and ι = Φ>ξ, then the empirical coefficients can be represented as
β˜ = ϑ + ι.
In the above equation, ϑ = Φ>g is the vector of coefficient estimates based on the noiseless
data, and ι = Φ>ξ is a white noise, since Φ is an orthonormal matrix.
In usual situations, the contribution of basis coefficients to the noiseless signal is sparse.
That says, only a small number of empirical basis coefficients contribute to both the signal
2 and the noise; the other basis coefficients contribute only to the noise with variance σξ . The choice of the universal threshold is based on the following two facts:
27 −1 2 1. For the performance measurement defined as Sc(gb, g) = M E k gb − g k2,M , the minimax threshold has the smallest risk bound. Based only on the data, the minimax
threshold gives the best mimic to the performance of ‘oracle’. The universal threshold
is a simple alternative to the optimal threshold, and it is also the asymptotic optimal
threshold. (See Donoho and Johnstone[1994] for details.)
> 2. For the white noise ι = Φ ξ, whose element ιj is independent and identically dis-
2 tributed as N(0, σξ ), we have
n p o P r max |ιj| > σξ 2 log M → 0, as M → ∞. j
univ √ Thus, by soft-thresholding with the universal threshold a = σξ · 2 log M, the bases
with no contribution to the signal will have coefficient estimates that are close to zero with
high probability. √ The universal threshold referred in the literature typically take the value of σξ 2 log M, since the threshold is applied to the full set of DWT bases with p = M. However, if a more general design matrix of p < M predictors is used, then it would be a better idea to adjust the threshold value by replacing M with p, based on a similar argument as the second fact mentioned above. √ Back to wavelet soft-thresholding, with the universal threshold auniv = pψ0(K)· 2 log p,
the thresholded wavelet coefficients can be obtained as
(W ) (W ) q0 ˜ q0 univ βbj0,k0 = ST βj0,k0 , a ,
0 j0 0 with k = 1,...,M/(2 ) for levels of j = 1, . . . , q0. (See Walden et al.[1998].)
Taking the inverse partial DWT, one can obtain the fitted log SDF as
(W ) (W ) −1 q0 T q0 Yb = Wq0 βb = Wq0 βb .
28 (W ) T ˜ q0 Note: Without thresholding, the result Wq0 β will correspond to the fitted values based on unpenalized LS method for wavelet basis regression.
The lasso and soft-thresholding There is a close relationship between soft-thresholding and the lasso. The term ‘lasso’ stands for ‘least absolute shrinkage and selection operator’
[Tibshirani, 1996]. With the tuning parameter ω > 0, a lasso problem for spectral estimation
minimizes the L1 penalized LS criterion M p 1 X 2 X Y (f ) − φT (f )β + ω |β | (2.21) 2 j j l j=1 l=1 Both the lasso and soft-thresholding execute shrinkage, with some coefficient estimates
shrunk exactly to zero. B¨uhlmannand van de Geer[2011] shows that, under an orthonormal
design with p = M (i.e., the number of predictors equal to the number of data points),
the solution to the lasso problem (2.21) is the soft-thresholded LS estimates with threshold
a = ω. For example, the wavelet soft-thresholding in Walden et al.[1998] with the full DWT
and universal threshold auniv is equivalent to solving the lasso problem that minimizes