Computationally Intensive Methods for Spectrum Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Computationally Intensive Methods for Spectrum Estimation by Joshua Pohlkamp-Hartt A thesis submitted to the Department of Mathematics & Statistics in conformity with the requirements for the degree of Doctor of Philosophy Queen's University Kingston, Ontario, Canada April 2016 Copyright c Joshua Pohlkamp-Hartt, April 2016 Abstract Spectrum estimation is an essential technique for analyzing time series data. A leading method in the field of spectrum estimation is the multitaper method. The multitaper method has been applied to many scientific fields and has led to the development of new methods for detection signals and modeling periodic data. Within these methods there are open problems concerning parameter selection, signal detection rates, and signal estimation. The focus of this thesis is to address these problems by using techniques from statistical learning theory. This thesis presents three theoretical contributions for improving methods related to the multitaper spectrum estimation method: (1) two hypothesis testing procedures for evaluating the choice of time- bandwidth, NW , and number of tapers, K, parameters for the multitaper method, (2) a bootstrapping procedure for improving the signal detection rates for the F -test for line components, and (3) cross-validation, boosting, and bootstrapping methods for improving the performance of the inverse Fourier transform periodic data estimation method resulting from the F -test. We additionally present two applied contributions: (1) a new atrial signal extraction method for electrocardiogram data, and (2) four new methods for analyzing, modeling, and reporting on hockey game play at the Major Junior level. ii Acknowledgments I would like to thank: • My supervisors Glen and David for their support and guidance. • My colleagues Dave, Aaron, Carly, Charlotte, Karim, Michael, and Wes for the insightful conversations and frank debates. • My friends Natalie, Justin, Liisa and Rory for their love, affection, and patience. • My family for their never ending confidence boosts. I love you mom, dad, and Noah. All of your support is highly appreciated and I am humbled and grateful to know such smart people. A great thanks to the Kingston Frontenacs for their willingness to be involved in this research. Additionally, thank you to the data collectors that worked on this project, without you my ideas would not have come to realization. Finally, thank you to Kingston and Queen's for making this last decade of learning as fun as it was. iii Statement of Originality The contents of this thesis are original except where references are explicitly given. The methods and results in Chapters 5 and 7 were collaborative work done with David Riegert. iv Table of Contents Abstract ii Acknowledgments iii Statement of Originality iv Table of Contents v List of Tables ix List of Figures xi Glossary xvii Chapter 1: Introduction . 1 Chapter 2: Background and Literature Review . 5 2.1 Hypothesis Testing . 6 2.2 Time Series Analysis . 10 2.3 Spectrum Estimation . 11 v 2.4 Multitaper Method . 14 2.5 Signal Detection and the F -test . 17 2.6 Bootstrap Methods . 20 2.7 Cross-validation . 24 2.8 Gradient Boosting . 24 2.9 Methods For Data Analysis . 26 Chapter 3: Sphericity Tests for Parameter Selection . 35 3.1 Introduction . 35 3.2 Naive Sphericity Test . 36 3.3 Bagged Sphericity Test . 42 3.4 Simulations and Comparison . 45 3.5 Conclusions on Tests . 57 Chapter 4: Bootstrapping the F-test . 59 4.1 Introduction . 59 4.2 Practical Limitations of the harmonic F -test . 60 4.3 Testing Procedure . 61 4.4 Rejection Regions and Variance of the Bootstrapped Statistic . 64 4.5 Comparison to the harmonic F -test . 68 4.6 Conclusions on Simulations . 73 Chapter 5: Periodic Data Reconstruction Methods . 76 vi 5.1 Introduction . 76 5.2 Inverse Fourier Transform Signal Synthesis . 77 5.3 Interpolation and Prediction . 82 5.4 Significance Level Determination (Finding α) . 83 5.5 Boosting Residual Signals . 87 5.6 Bootstrapped Signal Synthesis . 89 5.7 Data Analysis and Comparison . 91 5.8 Conclusions on Techniques . 109 Chapter 6: Extracting Atrial Signals . 112 6.1 Introduction . 112 6.2 The Problem . 113 6.3 Advanced Principal Components Analysis . 114 6.4 Data Study . 117 6.5 Conclusion . 121 Chapter 7: Modeling Major Junior Hockey . 122 7.1 Introduction . 122 7.2 Current State of Statistics in Hockey . 123 7.3 Neutral Zone Play . 125 7.4 Optimizing Line Selection . 129 7.5 In-game Player Monitoring . 133 7.6 Predicting Future Trends in Game Play . 135 vii 7.7 Data Analysis: Kingston Frontenacs . 136 7.8 Conclusions and Discussion . 152 Chapter 8: Concluding Remarks . 155 Bibliography . 161 viii List of Tables ^ 4.1 Empirical cut-off values, φ2;2K−2(p), for the re-sampled harmonic F - test(SN = 1) with M = 20. 66 4.2 F -statistic values, for the harmonic F -test . 66 5.1 t-test evaluating the distributions of the mean squared errors of our interpolation methods. The null hypothesis is that Method A has a lower mean mean squared error than Method B, H0 : µMSEA < µMSEB . When reading this table the diagonal entries are the information on the mean and sample size used for each method, the off-diagonal values are the t-statistic and p-value associated with the hypothesis for comparing Method A and Method B. 102 5.2 Average computational costs of each interpolation method. 102 5.3 t-test evaluating the distributions of the mean squared errors of our prediction methods. The null hypothesis is that Method A has a lower mean mean squared error than Method B, H0 : muMSEA < µMSEB . When reading this table the diagonal entries are the information on the mean and sample size used for each method, the off-diagonal values are the t-statistic and p-value associated with the hypothesis for comparing Method A and Method B. 104 ix 5.4 Average computational costs of each prediction method. 104 6.1 Comparison of Atrial extraction methods used on MIT-BIH Normal Sinus Rhythm data. We tested the null hypothesis that the proposed method had equal or worse performance for QRS peak power reduction and F-wave (4:7−5Hz) power retention. The run time is given for each method. 119 7.1 Variables used in statistical modeling of hockey . 140 7.2 Summary of logistic regression model for goal production, including p-values for the hypothesis H0 : β = 0 . 142 7.3 Summary of player END hypothesis tests . 144 x List of Figures 2.1 Triangular window and fast Fourier transform for 512 points of a sinu- soid centred in frequency, from [83]. 14 2.2 Slepian sequences of order (0; 1; 2; 3) in the time domain with NW = 4, N = 1000. NW = 4 is a common choice in time series literature and is practical for demonstrative purposes. 16 2.3 Example of a Shewhart Control Chart from the QCC package in R. 28 2.4 Example of an EWMA Control Chart with λ = :2 from the QCC package in R. 30 2.5 Coefficient regions for regularized regression for two dimensions, from [51]. 32 2.6 Constrained regression regions with relation to least squares estimate, from [51]. 34 3.1 Comparison of non-spherical and spherical distributed complex-valued residuals. 38 3.2 Part of the spectrum showing three test signals for the sphericity test, NW = 4, K = 7, N = 1000. 47 xi 3.3 Naive sphericity test of simulated evenly spaced 5-pronged sinusoids in noise for NW = [2; 10] and K = [2; 20]. The darker shaded regions represent higher p-values for the sphericity test. This plot uses linear smoothing between the parameter values evaluated to demonstrate the change from one parameter choice to the next. 49 3.4 Bagged sphericity test with O = 50 for simulated evenly spaced 5- pronged sinusoids in noise for NW = [2; 10] and K = [2; 20]. The darker shaded regions represent higher p-values for the sphericity test. This plot uses linear smoothing between the parameter values eval- uated to demonstrate the change from one parameter choice to the next. 50 3.5 Proportion of parameter selections of the naive sphericity test for 1000 repetitions of simulated evenly spaced five-pronged sinusoids in noise for NW = [2; 10] and K = [2; 20]. All parameter choices not listed were not selected. 50 3.6 Proportion of parameter selections of the bagged sphericity test with O = 50 for 1000 repetitions of simulated evenly spaced five-pronged sinusoids in noise for NW = [2; 10] and K = [2; 20]. All parameter choices not listed were not selected. 51 3.7 Effect of number of runs, O, on maximum p-value for the bagged sphericity test . 51 3.8 Effect of number of runs, O, on computational time for the bagged sphericity test . 54 xii 3.9 Variance in the the maximum p-value parameter choice from 1000 test- ings with O = 10. All parameter choices not listed were not selected. 55 3.10 Effect on the choice of NW from wrongly specifying the noise pro- cess variance. The true variance is labeled as the blue line and the theoretically acceptable choices are highlighted by the red band. 55 3.11 Effect on the maximum p-value for the bagged sphericity test due to wrongly specifying the noise process variance. The true variance is labeled as the blue line. 56 3.12 Comparison of the sphericty tests' performance under differing propor- tions of non-Gaussian noise. The simulated series tested with standard Gaussian noise are on the left while the simulated series tested with non-Gaussian noise are on the right. The performance of the leftmost tests is in line with the previous analysis in this chapter.