Kernel Smoothing of Periodograms Under Kullback–Leibler Discrepancy

Signal Processing 84 (2004) 1255–1266 www.elsevier.com/locate/sigpro

Kernel smoothing ofperiodograms under Kullback–Leibler discrepancy Jan Hannig, Thomas C.M. Lee∗

Department of Statistics, Colorado State University, Fort Collins, CO 80523-1877, USA

Received 5 June 2003; received in revised form 31 March 2004

Abstract

Kernel smoothing on the periodogram is a popular nonparametric method for spectral density estimation. Most important in the implementation ofthis method is the choice ofthe bandwidth, or span, forsmoothing. One idealized way ofchoosing the bandwidth is to choose it as the one that minimizes the Kullback–Leibler (KL) discrepancy between the smoothed estimate and the true spectrum. However, this method fails in practice, as the KL discrepancy is an unknown quantity. This paper introduces an estimator for this discrepancy, so that the bandwidth that minimizes the unknown discrepancy can be empirically approximated via the minimization ofit. It is shown that this discrepancy estimator is consistent. Numerical results also suggest that this empirical choice of bandwidth often outperforms some other commonly used bandwidth choices. The same idea is also applied to choose the bandwidth for log-periodogram smoothing. ? 2004 Elsevier B.V. All rights reserved.

Keywords: Bandwidth selection; Kullback–Leibler discrepancy; Periodogram and log-periodogram smoothing; Relative entropy; Spectral density estimation

1. Introduction One important component ofthe kernel smoothing approach is the choice ofthe bandwidth for Smoothing the periodogram or the log-periodogram smoothing. This paper proposes a bandwidth selec- is a popular method for performing nonparametric tion method that attempts to locate the bandwidth spectral density estimation. Various approaches have that minimizes the unknown Kullback–Leibler (KL) been proposed. These include spline smoothing (e.g., discrepancy between the smoothed estimate and the [9,13,19]), kernel smoothing (e.g., [4,10,11,18]) and true spectrum. The idea behind the proposed method wavelet techniques (e.g., [5,15,20]). The approach that is as follows. An estimator for the unknown KL dis- this paper considers is kernel smoothing. Some ap- crepancy is ÿrst constructed for the class of spectrum pealing features of this approach are that it is sim- estimates (i.e., kernel-smoothed periodograms) that ple to use, easy to understand and straightforward to this paper considers. It is shown that this discrepancy interpret. estimator is consistent under a commonly used model for periodograms. Then the bandwidth that minimizes this discrepancy estimator is chosen as the ÿnal band- ∗ Corresponding author. Tel.: +1-970-491-5269; fax: +1-970- width. As mentioned in [14], the rationale is that the 491-7895. E-mail addresses: [email protected] (J. Hannig), bandwidth that minimizes the discrepancy estimator [email protected] (T.C.M. Lee). should also approximately minimize the unknown

0165-1684/$ - see front matter ? 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2004.04.007 1256 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 discrepancy. Also, Chow [3, pp. 293–294] argues that following kernel estimator for fj: the KL discrepancy is a better distance measure than 2n−1 2n−1 the more commonly used squared distance measure. fˆ = K (! − ! )I K (! − ! ); This point will be elaborated more later. j h m j m h l j m=−n The rest ofthis paper is organized as follows.Sec- l=−n tion 2 provides some background material on the pe- j =0;:::;n− 1: (2) riodogram and the KL discrepancy. The proposed KL 1 · discrepancy estimator, together with some related pre- In the above Kh(·)= h K( h ), where the kernel func- vious work, are presented in Section 3. Section 4 tion K is (usually taken as) a symmetric probability brieEy discusses the smoothing oflog-periodogram. density function and the bandwidth h is a nonnega- Some empirical and theoretical properties ofthe pro- tive smoothing parameter that controls the amount of ˆ posed method are reported in Sections 5 and 6, re- smoothing. Note that fj is a function of h, but, for spectively. Conclusions are oFered in Section 7, while simplicity, this dependence is suppressed from its no- technical details are deferred to the appendix. tation. It is well known that the choice of h is much more crucial than the choice of K (e.g., see [22]). Also, in most other kernel smoothing problems the limits of the two summations in (2) are 0 and n − 1. However, 2. Background since in the present setting boundary eFects can be handled by periodic smoothing, the limits are changed 2.1. Periodogram smoothing from 0 and n − 1to−n and 2n − 1, respectively. ˆ The estimator fj can also be interpreted as a Suppose that {xt} is a real-valued, zero mean sta- weighted average ofthe Ijs. It is because one could tionary process with unknown spectral density f, and write that a ÿnite-sized realization x0;:::;x2n−1 of {xt} is 2n−1 observed. The goal is to estimate f by using those ˆ fj = Wm−jIm with observed xts. The periodogram is deÿned as m=−n 2 2n−1 K (! − ! ) 1 W = h m j : (3) I(!)= xt exp (−i!t) ; m−j 2n−1 2 × 2n Kh(!l − !j) t=0 l=−n √ i= −1;!∈ [0; 2 ): Notice that the weights Wms sum to unity.

To simplify notation, write ! =2 j=(2n), f =f(! ) j j j 2.2. Why KL discrepancy? and Ij =I(!j). Since the spectral density f is symmetric about ! = , in the sequel the focus will be on f j The KL discrepancy for measuring the distance be- for j =0;:::;n− 1. Also, as f is periodic with period tween two probability density functions (pdfs) g (x) 2 , one has f = f and I = I for j =1;:::;n− 1. 1 −j j −j j and g (x) is deÿned as A frequently adopted model for I is (e.g., see 2 j [4,10,15,17]) g1(x) d(g1;g2)= g1(x) log dx g2(x) Ij = fj j;j=0;:::;n− 1; (1) (e.g., see [2]). Note that d(g1;g2) = d(g2;g1). where the js are independent standard exponential In order to use d(g1;g2) for comparing a true ˆ random variables. Therefore E(Ij)=fj and var(Ij)= spectrum f and an estimate f, one needs to com- 2 pare them frequency by frequency. At frequency ! , fj . Due to its unacceptably large variance, Ij is j seldom used as an estimate of fj. the pdf gf(x) corresponding to f is, under model One possible way for obtaining better estimates (1), an exponential distribution with mean fj. That ˆ for fj is to smooth the Ijs. This paper considers the is, gf(x) = exp(−x=fj)=fj;x¿0. For f, a natural J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 1257 candidate for the corresponding pdf is exponential given by with mean fˆ . Denote this pdfas g (x), and thus j fˆ n−1 k ˆ ˆ ˆ 1 2k ˆ g ˆ(x) = exp(−x=fj)=fj;x¿0. We choose to measure h;k = fj − Wm−jIj+m f n k I ˆ j=0 m=−k j+m m=−k the distance between f and f at frequency !j with ˆ ˆ ˆ k fˆ exp(−x=fj) exp(−x=fj)=fj + W − log j + − 1 ; d(g ˆ;gf)= log dx m−j f ˆ Ij fj exp(−x=fj)=fj m=−k fˆ fˆ where ≈ 0:577216 is Euler’s constant and k is a = j − log j − 1: f f pre-speciÿed positive integer parameter whose value j j is chosen independent of n. It will be shown in The- One could also use d(g ;g ) instead of d(g ;g ), orem 1 that its eFect is asymptotically negligible. We f fˆ fˆ f ˆ but for the reason given in Appendix A, we choose propose choosing h as the minimizer of h;k . Empirical properties ofour estimator are reported in Section 5, d(gfˆ;gf). Therefore our overall KL discrepancy for measuring the distance between the whole spectrum while its consistency is established in Section 6. f and fˆ is ˆ 3.1. Construction of h;k n−1 ˆ ˆ ˆ 1 fj fj KL(f; f)= − log − 1 : ˆ n fj fj This subsection outlines the construction of h;k . j=0 The goal is to seek an unbiased estimator for (f;ˆ f). Our target is to choose the h that minimizes (f;ˆ f). KL KL First realize that estimating (f;ˆ f) is equivalent Notice that (f;ˆ f) is an unknown quantity, so di- KL KL to estimating the two quantities, log(fˆ =f ) and fˆ =f , rect minimization is not possible. j j j j ˆ A more frequently used discrepancy measure, based for all j. We estimate the former by log(fj=Ij) − ,as E{log(fˆ =I ) − } = E{log(fˆ =f )}. The latter, due to on the L2 norm, is j j j j the presence of1 =fj, poses a bigger challenge. It is 1n−1 because under model (1) E(1=I )=∞, and so fˆ =I (f; fˆ)= (f − fˆ )2: j j j L2 n j j cannot be used as a building block for estimating j=0 ˆ fj=fj. To overcome this diMculty, we make use ofthe fact that if f is locally smooth, then f :::f are ˆ j−k j+k However, as discussed in [3], KL(f; f) is often a (approximately) identical for small k. This implies ˆ more relevant measure than L (f; f). It is because 2 that the periodogram ordinates Ij−k ;:::;Ij+k are ap- the former considers the distance for the whole distri- proximately independent and identically distributed bution while the latter only considers the mean. Also, as exponential with mean f . Therefore we have since (f;ˆ f) has the same form as the theoretical j KL E{1=( k I )}≈1=(2kf ) and hence 1=f can deviance, it utilizes the asymptotic likelihood ofthe m=−k j+m j j k periodogram. Therefore, bandwidth selection meth- be estimated by 2k= m=−k Ij+m. Thus, there are two ˆ ˆ major advantages ofintroducing k: (1) it overcomes ods that are targeting KL(f; f) (or E{KL(f; f)}) seem to be more desirable than those that are targeting the problem ofinÿnite expectation E(1=Ij)=∞, and (f; fˆ) (or E{ (f; fˆ)}). (2) it can be treated as a device for controlling the L2 L2 bias and variance ofour estimator for1 =fj. ˆ To proceed we decompose fj=fj into two parts: ˆ k ˆ k 3. Estimating the KL discrepancy fj Ij+m (fj − m=−k WmIj+m) = Wm + : fj fj fj This section presents the main contribution m=−k We estimate the ÿrst part by its expectation, i.e., ofthis paper, namely, a consistent estimator for ˆ ˆ k KL(f; f). This estimator is denoted as h;k , and it is m=−k Wm. Since that the numerator ofthe second 1258 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266

k part and m=−k Ij+m are independent, and that Using the same technique as in Section 3.1,we k ˆ E(2k= I ) ≈ 1=f , we estimate the second derived the following estimator h;k for KL(ˆgh;g): m=−k j+m j ˆ k k n−1 part by 2k(fj − WmIj+m)=( Ij+m). 1 m=−k m=−k ˆ ˆ ˆ h;k = yj − gˆj + C Finally, by replacing log(fj=fj) and fj=fj in the ex- n ˆ j=0 pression of KL(f; f) with these estimates, we obtain ˆ h;k . k × exp gˆj − cmyj+m − 1 ; 3.2. Related work m=−k where A cross-validatory bandwidth selection criterion k called CVLL has been studied by [1,7,8]. In partic- 1 Wm c = + W − and ular, it is shown in [8] that CVLL is asymptotically j 2k +1 j 2k +1 ˆ ˆ m=−k equivalent to E{KL(f; f)} (not E{KL(f; f)}). ˆ ˆ k However, it seems that KL(f; f) (or KL(f; f)) is (1 + W ) ˆ ˆ m to be preferred to E{KL(f; f)} (or E{KL(f; f)}). C = : ˆ (1 + Wm − cm) It is because KL(f; f) measures the distance be- m=−k ˆ tween f and f for the data set at hand, rather than It can be shown that, along the line ofTheorems 1 and for the average over all possible data sets. 2 below, ˆ is consistent for (ˆg; g). More recently, [16] propose a generalized cross- h;k KL validation (GCV)-based bandwidth selection method ˆ that also targets KL(f; f). The idea is to esti- 5. Finite sample properties ˆ mate KL(f; f) by cross-validating a gamma deviance function. However, no theoretical results are A numerical experiment was conducted for compar- presented. ing the practical performance of the proposed bandwidth selection method with four other methods found in the literature. These other methods are (i) the CVLL 4. Log-periodogram smoothing and (ii) the GCV methods discussed in Section 3.2; (iii) the SES criterion which appeared in an unpub- The above idea ofKL discrepancy estimation lished report ofPalmer (see [ 7]); and (iv) the Rˆ(p) can also be applied to choose the bandwidth for criterion proposed by Lee [10]. The SES criterion uses log-periodogram smoothing. The ÿrst step is to trans- cross-validation to estimate E{ (f; fˆ)}, while Rˆ(p) L2 form the multiplicative model (1) into an additive is an unbiased estimator for E{ ˆ L2 (f; f)}. Although model by taking a logarithmic transform: these ÿve methods are not targeting at the same discrepancy measure, a direct comparison ofthem would yj = log Ij + = log fj + j;l=0;:::;n− 1; still be interesting. Throughout the whole experiment, k is set to 5 for the proposed method. j : independent zero mean random variables with probability density function 5.1. Setup p(x) = exp{x − − exp(x − )}: Four test examples and three diFerent sample sizes Let g = log f be the log-spectrum andg ˆ be its ker- 2n−1 were used. The three sample sizes were n = 200, 400 nel smoothing estimator; i.e.,g ˆj = m=−n Wm−jym. It is straightforward to derive the following KL dis- and 800, and the four test examples were from the crepancy forg ˆ and g: ARMA(!; ÿ) model n−1 1 xt + a1xt−1 + ···+ a!xt−! (ˆg; g)= {g − gˆ + exp(g ˆ − g ) − 1}: KL n j j j j j=0 =$t + b1$t−1 + ···+ bÿ$t−ÿ;$t ∼ N(0; 1) J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 1259

Fig. 1. Boxplots for various log RKL values. In each panel, the boxplots are corresponding, respectively, to, from left to right, the proposed method, the CVLL, GCV, SES and Rˆ(p) criteria. with parameters given by: Example 1 was AR(3) with is its value, the better is the performance. The sec- a1 = −1:5, a2 =0:7, and a3 = −0:1; Example 2 was ond ratio RL has a similar interpretation, but is for ˆ 2 ˆ AR(3) with a1 =0:9, a2 =0:8 and a3 =0:6; Example 3 L (f; f). Since CVLL is targeting at E{KL(f; f)} 2 ˆ was MA(3) with b1 =0:9, b2 =0:8 and b3 =0:6; but not E{KL(f; f)}, we have also computed a sim- ˆ and Example 4 was MA(4) with b1 = −0:3, b2 = ilar third ratio that is based on KL(f; f). However, −0:6, b3 = −0:3 and b4 =0:6. These testing examples as this third ratio gives the same empirical conclu- have previously been used by many authors; e.g., see sions as for RKL, the corresponding results will not be [4,7,10,17,19]. The kernel function used was K(x)= reported here. 3 2 4 (1 − x );x∈ [0; 1]. It is the optimal kernel oforder For those cases that are associated with n = 400, (0,2) derived in [6]. boxplots ofthe log ofthe R KL and RL2 values are For each ofthe 12 combinations oftest example given in Figs. 1 and 2. Boxplots for other cases are and sample size, 500 independent series were sim- somewhat similar and are hence omitted. ulated, and the corresponding periodograms were We also performed paired Wilcoxon tests to test the computed. Then, for each of these generated peri- signiÿcance ofthe diFerence between the median RKL odograms, the above ÿve methods were applied to (and R ) values ofany two methods. The signiÿcance L2 compute the corresponding bandwidths. In addition, level used was 1%, and the relative rankings, with 1 two practically unobtainable bandwidths were also being the best, are listed in Tables 1 and 2. While computed. They are hKL, the bandwidth that minimizes ranking the methods in this manner is not perfectly (f;ˆ f), and h , the bandwidth that minimizes legitimate, it does provide an indicator ofthe relative KL L2 ˆ merits ofthe methods, and has also been used in other L2 (f; f). Finally, the following two ratios were computed for studies; e.g., see [12,13,21]. every bandwidth h that was selected by any one ofthe ÿve methods: 5.2. Empirical conclusions (f;ˆ f) (f; fˆ) R KL and R L2 KL = L2 = ˆ ˆ With RKL the averaged Wilcoxon test rankings KL(f hKL ;f) L2 (f; f hL ) 2 for the proposed method, the CVLL, GCV, SES and ˆ The ÿrst ratio RKL is used to evaluate the perfor- R(p) criteria are 1.50, 2.54, 1.96, 4.75 and 4.25, mance of h with respect to ˆ respectively, while with R the averaged rankings KL(f; f): the smaller L2 1260 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266

Fig. 2. Similar to Fig. 1 but for log R . L2

Table 1 Wilcoxon rankings for comparing the RKL values

Example n = 200 n = 400 n = 800

proposed CVLL GCV SES Rˆ(p) proposed CVLL GCV SES Rˆ(p) proposed CVLL GCV SES Rˆ(p)

1 1.5 3 1.5 5 4 2.5 2.5 1 5 4 2.5 2.5 1 5 4 2 1 3 2 5 4 1.5 3 1.5 5 4 2 2 2 5 4 3 1 2.5 2.5 4.5 4.5 1 2.5 2.5 4.5 4.5 2 2 2 4.5 4.5 4 1 2.5 2.5 4.5 4.5 1 2.5 2.5 4.5 4.5 1 2.5 2.5 4.5 4.5

Table 2 Wilcoxon rankings for comparing the R values L2 Example n = 200 n = 400 n = 800

proposed CVLL GCV SES Rˆ(p) proposed CVLL GCV SES Rˆ(p) proposed CVLL GCV SES Rˆ(p)

1 1.5 3 1.5 5 4 2 2 2 5 4 2 2 2 5 4 2 1 2.5 2.5 5 4 1 3.5 3.5 3.5 3.5 1.5 1.5 4 4 4 3 4.5 4.5 2.5 2.5 1 4.5 4.5 3 1.5 1.5 4 5 3 1.5 1.5 4 3.5 5 3.5 2 1 5 4 3 1.5 1.5 5 4 3 1.5 1.5

are 2.96, 3.46, 2.79, 3.17 and 2.63, respectively. Examples 1 and 2. Also, the widths ofthe boxplots in Therefore, there are some evidence that the proposed Figs. 1 and 2 seem to suggest that the three KL-based method outperformed other methods if the target dis- methods gave more stable performances than the two ˆ crepancy is KL(f; f). It is interesting to see that L2-based methods. It is perhaps partially due to the even when the target discrepancy is (f; fˆ), the fact that the KL discrepancy is more stable than the L2 proposed method gave or shared the best results for L2 measure under heteroscedasticity. J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 1261

5.3. Log-periodogram smoothing where max (f )2 max (f )2 We have also empirically investigated the use C 6 30 j j + 2=2;C6 49 j j ; 1 2 2 2 ˆ minj (fj) minj (fj) ofthe minimizer of h;k as the bandwidth for log-periodogram smoothing. We have compared C3, C4 are constants depending only on the kernel K, this choice ofbandwidth with other bandwidth se- and w is the modulus of continuity (10). lection methods that are targeting the L2 measure 2 E (gj − gˆj) =n. These other methods include One should notice that, from (4) and (5), the eFect cross-validation and the unbiased risk estimation ap- ofthe pre-chosen k on E(') and var(') vanishes as proach ofLee [ 10]. Unlike the case for periodogram n goes to ∞. ˆ smoothing, the use of h;k only demonstrates a mi- nor improvement over the other L2-based methods. Theorem 2. In addition to the assumptions stated in It is probably due to the fact that E (ˆg; g) and KL Theorem 1, we assume that f is twice dierentiable, 2 E (gj − gˆj) =n are asymptotically equivalent. the derivative f is bounded, f(x) does not vanish on some interval, and there is a nonincreasing sequence −! of hn such that hn → 0 and hn = Cn for some 1 1 6. Theoretical properties constant C and ! ∈{(0; 8 ) ∪ ( 2 ; ∞)}. Then ˆ − (f;ˆ f) We studied the theoretical properties ofour estima- h;k KL → 0 in probability: (6) ˆ ˆ ˆ tor, under model (1), in terms of '=h;k −KL(f; f). KL(f; f) We summarize our results in two theorems. The ÿrst theorem provides nonasymptotic bounds for E(') and var('), which can be used to show that, under some 7. Conclusions regularity conditions, both E(') and var(')goto zero as n →∞. However, such a result is not very In this paper a bandwidth selection method for ˆ useful unless KL(f; f) goes to zero at a rate slower periodogram smoothing is proposed. The proposed than ' does. Our second theorem establishes the con- method aims to choose the bandwidth that minimizes ˆ ditions under which KL(f; f) decays slower than '. the KL discrepancy between the estimated and the ˆ That is, it enables us to show that h;k is consistent true spectra. Some theoretical results are presented ˆ for KL(f; f). The proofs of the theorems are given for supporting the use of this method. Results from in Appendices B and C. Although these results are numerical experiments seem to suggest that the pro- established under the approximation model (1), we posed method is superior to four other existing band- believe that they do provide valuable insights to the width selection methods when the KL discrepancy is performance of the proposed estimator. under consideration.

Theorem 1. Suppose that f is bounded away from 0 and ∞, the kernel K is unimodal, symmetri- Appendix A. Why d(g ˆ;gf ) but not d(gf ;gˆ)? cal and square integrable, and k ¿ 1. Then, under f f model (1), Using d(gfˆ;gf) to measure the distance between fi ˆ ˆ 1 2 k and fj gives KL(f; f) as the overall KL discrepancy |E(')| 6 max (fj)w ; (4) ˆ j f n between f and f, while using d(gf;gfˆ) gives n−1 and ˆ 1 fj fj KL(f; f)= − log − 1 : 3 n fˆ fˆ C C k 1 2 k C k C j=0 j j var(') 6 1 + 2 w ; + 3 + 4 ; n n f n n3h2 n2h Now using the Taylor series approximation (5) y − logy − 1 ≈ (y − 1)2=2fory close to 1, 1262 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 we have and (B.1) we conclude n−1 ˆ 2 1 2 k ˆ 1 fj − fj |E(' )| 6 E(fˆ )w ; KL(f; f) ≈ and j j f n 2n fj j=0 1 2 k n−1 ˆ 2 6 max (fj)w ; ; (B.3) ˆ 1 fj − fj j f n KL(f; f) ≈ : 2n fˆ j=0 j where w is the modulus ofcontinuity deÿned forany ˆ continuous function g: Our beliefis that KL(f; f) is a better measure to use. It is because in the above approximation it uses a w(g; ) = sup{|g(x) − g(y)|; |x − y| 6 }: (B.4) ÿxed quantity, the denominator term fj, to adjust for ˆ ˆ the variance fj − fj for diFerent !j, while KL(f; f) Eq. (4) now follows from averaging (B.3) over j. ˆ uses a random quantity fj. Also, our numerical work ˆ ˆ shows that the behaviors of KL(f; f) and KL(f; f) Proof of (5). We ÿrst split the term 'j = Xj − Yj are remarkably similar for h that are not far from the k + Zj + Wm, where optimal value. m=−k ˆ fj Xj = log Ij − (log fj − );Yj = ; fj Appendix B. Proof of Theorem 1 ˆ ˆk 2k(fj − fj ) Zj = : To simplify notation, denote fˆk = k W I . k I j m=−k m j+m m=−k j+m 1 n−1 Also, decompose ' as ' = 'j, where n j=0 Firstly, ˆ ˆk 2k(fj − fj )   'j = log Ij − (log fj − )+ n k I 1 m=−k j+m var '  n j k ˆ j=1 fj + Wm − :     fj n n m=−k 1 1 6 3 var X  + 3 var Y  n j n j j=1 j=1 Proof of (4). By taking the expectation of 'j and using the fact that the observations are independent, we   n get 1 +3 var Z  (B.5) n j 2k j=1 ˆ ˆk E('j)=E(fj − f )E j k I m=−k j+m and under model (1)   k E(fˆ ) j n 2 + Wm − : (B.1) 1  fj var Xj = : (B.6) m=−k n 6n j=1

Denote xj = minj(fj−k ;:::;fj+k ) and yj = maxj (fj−k ;:::;fj+k ). Combining Secondly,     −1 1  k  1 1n 1 n n fˆ fˆ 6 E 2k I 6 (B.2) var Y  = cov i ; j j+m j 2 yj   xj n n fi fj m=−k j=1 i=1 j=1 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 1263

Thus, n n 2n−1 2n−1 1 cov p=−n Wp−iIp; q=−n Wq−jIq = 2k Wj−i+mIj+m 2k Wi−j+mIi+m 2 m∈Ji;j;k m∈Ii;j;k n fifj cov ; i=1 j=1 k k m=−k Ii+m m=−k Ij+m

n n 2n−1 6 4k2W 2 : 1 max(k+1;|j−i|−k) 6 W W var(I ) 2 2 p−i p−j p n minj(fj) i=1 j=1 p=−n This inequality remains valid also for i¿j by 3 max (f )2 symmetry. Further notice that 6 j j : (B.7) n min (f )2 j j K( j=nh) K(0) ∞ W ≈ 6 ; W 2 j nh nh m Thirdly, ÿx i 6 j. In order to be able to estimate the m=−∞ last term we need to calculate cov(Zi;Zj). K 2(!)d! ≈ : (B.9) Deÿne the following index sets: nh

Therefore there are constants C3;C4 depending only Ii;j;k = {x ∈ Z : − k 6 x 6 min(k; j − i − k − 1)}; on the kernel K, such that

1 n n Ji;j;k = {x ∈ Z : − min(k; j − i − k − 1) 6 x 6 k}; cov n2 i=1 j=1

Ki;j;k = {x ∈ Z : − n 6 x 6 2n − 1; |x − i| ¿k; 2k m∈J Wj−i+mIj+m 2k m∈I Wi−j+mIi+m × i;j;k ; i;j;k k I k I |x − j| ¿k} m=−k i+m m=−k j+m

1 n n (2k)3 6 4k2W 2 6 W 2 n2 max(k+1;|j−i|−k) n k+1 and calculate i=1 j=1

1 n ∞ C k3 C + W 2 6 3 + 4 : cov(Zi;Zj) n2 m n3h2 n2h i=1 m=−∞ 2k m∈J Wj−i+mIj+m 2k m∈I Wi−j+mIi+m (B.10) =cov i;j;k ; i;j;k k I k I m=−k i+m m=−k j+m To estimate the second term on the right-hand side of( B.8), ÿrst recall the Eq. (B.2), conclude 2kWp−iIp 2kWq−jIq + cov ; : k I k I p∈K q∈K m=−k i+k m=−k j+k 2k i;j;k i;j;k var k m=−k Ij+k (B.8) 1 1 1 6 + − 2 2 2 xj (2k +1) xj yj Notice, 1 1 1 2 k 6 +2w ; ; 2 minj(fj) 2k +1 f n 2k Wj−i+mIj+m m∈Ji; j; k 6 2k max W k j−i+m m∈Ji; j; k and realize that | cov(X; Y )| 6 var(X ) var(Y )for m=−k Ij+m any random variables X , Y . Then ÿx p; q ∈ Ki;j;k and =2kWmax(k+1;j−i−k); calculate 2kW I 2kW I where the last equality follows from the fact that the cov p−i p ; q−j q k k smoothing kernel K is symmetric unimodal. m=−k Ii+k m=−k Ij+k 1264 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 2 2 maxj(fj) 2 8k +2 1 2 k 2kWp−iIp 2kWq−jIq 6 + w ; : =E cov ; |I ;I 2 k k p q minj(fj) n n f n m=−k Ii+k m=−k Ij+k (B.12) 2kWp−iIp + cov E |Ip;Iq ; Finally, estimate the second part of( B.11). If k I m=−k i+k p = q, 2kW I E q−j q |I ;I : (B.11) 2kWp−iIp k p q cov E |Ip;Iq ; Ij+k k m=−k m=−k Ii+k If |i − j| ¿ 2k we have 2kWq−jIq E |Ip;Iq =0: k I m=−k j+k 2kW I 2kW I cov p−i p ; q−j q |I ;I =0: Similarly if p = q, k k p q m=−k Ii+k m=−k Ij+k 2kWp−iIp 2kWq−jIq cov E |Ip ;E |Ip k I k I Similarly, if |i − j| 6 2k m=−k i+k m=−k j+k

1 2 6 Wp−iWp−jfp: 2kWp−iIp 2kWq−jIq xixj E cov ; |Ip;Iq k I k I m=−k i+k m=−k j+k From here we conclude  1 n n  W W f f 1 2kWp−iIp  p−i q−j p q min (f )2 cov E |I ;  j j 2 k p  n Ii+k  i=1 j=1 p∈Ki; j; k q∈Ki; j; k m=−k  1 1 2 k  × +2w ; if p = q; 2k +1 f n 2kW I 6 E p−j p |I  k p  2 1 I  W W 2f m=−k j+k  p−i q−j p min (f )2  j j  1 n n max (f )2  1 1 2 k 6 j j W W  × +2w ; if p = q 2 2 p−i p−j n minj(fj) 2k +1 f n p∈Ki; j; k i=1 j=1 3 max (f )2 and 6 j j : (B.13) 2 n minj(fj) 2kW I 2kW I Combining (B.5)to(B.8), and (B.10)to(B.13)we cov p−i p ; q−j q k k obtain (5). Ii+k Ij+k p∈Ki; j; k q∈Ki; j; k m=−k m=−k 2 max (f )2 1 1 2 k 6 j j +2w ; : 2 Appendix C. Proof of Theorem 2 minj(fj) 2k +1 f n

We will ÿrst show that under our assumptions Since there is no more than (4k +1)n pairs of( i; j) satisfying |i − j| 6 2k, we conclude D E{ (f;ˆ f)} ¿ D h4 + 2 eventually: (C.1) KL 1 n nh n n 1 2kWp−iIp 2kWq−jIq cov ; We need two inequalities to prove (C.1). Deÿne n2 k k m=−k Ii+k m=−k Ij+k ˆ i=1 j=1 p∈Ki;j;k q∈Ki;j;k l(y)=(y − 1) − log(y) and hence KL(f; f)= J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266 1265 n−1 n l(fˆ =f ). By applying the Taylor approxi- Fix ¿0 and calculate j=0 j j mation l(y) ≈ 1 (y − 1)2 to l(fˆ =f ) (as in Appendix ˆ ˆ 2 j j h;k − KL(f; f) A) and using the assumption that f is bounded away P ¿ (f;ˆ f) from 0 and ∞, we obtain our ÿrst inequality: KL 2 ˆ ˆ ˆ ˆ h;k − KL(f; f) fj − fj fj ¡P ¿ C 6 l with rn fj fj ˆ 1 min(f ) + P{KL(f; f) ¡rn}: C = j : (C.2) 2 max(fj) By (C.8), the Markov’s and Chebyshev’s inequalities we get We begin deriving the second inequality by calcu- lating ˆ − (f;ˆ f) P h;k KL ¿ r ˆ 2 ˆ ˆ 2 n E(fj − fj) = var(fj)+{E(fj) − fj} : (C.3) E{ˆ − (f;ˆ f)}2 If f(! ) = 0 then there is D(! ) such that eventually ¡ h;k KL → 0; j j 2 2 rn 2 2n−1 ˆ ˆ 2 P{KL(f; f) ¡rn} {E(fj) − fj} = Wm−jfm − fj m=−n ˆ ˆ ¡P[|KL(f; f) − E{KL(f; f)}| ¿ D(! )h4 (C.4) ˆ j n ¿E{KL(f; f)}−rn] and by (B.9) var{ (f;ˆ f)} ¡ KL → 0: 2 2n−1 [E{ (f;ˆ f)}−r ] min (f )2C KL n var(fˆ )= W 2 var(I ) ¿ j j 4 : (C.5) j m−j m nh This ÿnishes the proofof( 6). m=−n Thus by combining (C.2), (C.3), averaged (C.4) and (C.5) we get (C.1). References Similar calculations as in the proofof( 5) show that [1] K. Beltr˜ao, P. Bloomÿeld, Determining the bandwidth ofa 1 kernel spectrum estimate, J. Time Ser. Anal. 8 (1987) 21–39. var{ (f;ˆ f)} =O : (C.6) KL n [2] K.P. Burnham, D.R. Anderson, Model Selection and Inference: A Practical Information–Theoretic Approach, Also recall that since f is bounded the relations (4) Springer, New York, 1998. [3] G.C. Chow, Econometrics, McGraw-Hill, New York, 1983. and (5) imply [4] J. Fan, E. Kreutzberger, Automatic local smoothing for 1 spectral density estimation, Scand. J. Statist. 25 (1998) ˆ ˆ 2 359–369. E{h;k − KL(f; f)} =O (C.7) n [5] H.-Y. Gao, Choice ofthresholds forwavelet shrinkage estimate ofthe spectrum, J. Time Ser. Anal. 18 (1997) Combining (C.1), (C.6) and (C.7) we see that under 231–251. −1=2 4 [6] T. Gasser, H.-G. Muller,S V. Mammitzsch, Kernels for the conditions stated in Theorem 2 n =o(hn + −1 nonparametric curve estimation, J. Roy. Statist. Soc. Ser. B (nh) ). Therefore, we can ÿnd rn such that 47 (1985) 238–252. E{ˆ − (f;ˆ f)}2 =o(r2); [7] C.M. Hurvich, Data-driven choice ofa spectrum estimate: h;k KL n extending the applicability ofcross-validation methods, J. Amer. Statist. Assoc. 80 (1985) 933–940. var{ (f;ˆ f)} =o(r2) and KL n [8] C.M. Hurvich, K. Beltr˜ao, Cross-validatory choice ofa r =o[E{ (f;ˆ f)}]: (C.8) spectrum estimate and its connections with AIC, J. Time Ser. n KL Anal. 11 (1990) 121–137. 1266 J. Hannig, T.C.M. Lee / Signal Processing 84 (2004) 1255–1266

[9] C. Kooperberg, C.J. Stone, Y.K. Troung, Logspline estimation [16] H.C. Ombao, J.A. Raz, R.L. Strawderman, R. von Sachs, ofa possibly mixed spectral distribution, J. Time Ser. Anal. A simple generalised crossvalidation method ofspan 16 (1995) 359–388. selection for periodogram smoothing, Biometrika 88 (2001) [10] T.C.M. Lee, A simple span selector for periodogram 1186–1192. smoothing, Biometrika 84 (1997) 965–969. [17] Y. Pawitan, F. O’Sullivan, Nonparametric spectral density [11] T.C.M. Lee, A stabilized bandwidth selection method for estimation using penalized Whittle likelihood, J. Amer. kernel smoothing ofthe periodogram, Signal Processing 81 Statist. Assoc. 89 (1994) 600–610. (2001) 419–430. [18] K.S. Riedel, A. Sidorenko, Adaptive smoothing ofthe [12] T.C.M. Lee, Tree-based wavelet regression for correlated data log-spectrum with multiple tapering, IEEE Trans. Signal using the minimum description length principle, Austral. New Process. 44 (1996) 1794–1800. Zealand, J. Statist. 44 (2002) 23–39. [19] G. Wahba, Automatic smoothing ofthe log periodogram, [13] T.C.M. Lee, T.F. Wong, Nonparametric log-spectrum J. Amer. Statist. Assoc. 75 (1980) 122–132. estimation using disconnected regression splines and genetic [20] A.T. Walden, D.B. Percival, E.J. McCoy, Spectrum estimation algorithms, Signal Processing 83 (2003) 79–90. by wavelet thresholding ofmultitaper estimators, IEEE Trans. [14] H. Linhart, W. Zucchini, Model Selection, Wiley, New York, Signal Process. 46 (1998) 3153–3165. 1986. [21] M.P. Wand, A comparison ofregression spline smoothing [15] P. Moulin, Wavelet thresholding techniques for power procedures, Comput. Statist. 15 (2000) 443–462. spectrum estimation, IEEE Trans. Signal Process. 42 (1994) [22] M.P. Wand, M.C. Jones, Kernel Smoothing, Chapman & 3126–3136. Hall, London, 1995.