Topics in Functional Data Analysis and Machine Learning Predictive Inference
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2019 Topics in functional data analysis and machine learning predictive inference Haozhe Zhang Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Statistics and Probability Commons Recommended Citation Zhang, Haozhe, "Topics in functional data analysis and machine learning predictive inference" (2019). Graduate Theses and Dissertations. 17626. https://lib.dr.iastate.edu/etd/17626 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Topics in functional data analysis and machine learning predictive inference by Haozhe Zhang A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Statistics Program of Study Committee: Yehua Li, Co-major Professor Dan Nettleton, Co-major Professor Ulrike Genschel Lily Wang Zhengyuan Zhu The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2019 Copyright c Haozhe Zhang, 2019. All rights reserved. ii DEDICATION I would like to dedicate this dissertation to my parents, Baiming Zhang and Jiany- ing Fan, and my significant other, Shuyi Zhang, for their endless love and unwavering support. iii TABLE OF CONTENTS Page LIST OF TABLES . vi LIST OF FIGURES . vii ACKNOWLEDGEMENTS . xv ABSTRACT . xvi CHAPTER 1. GENERAL INTRODUCTION . .1 1.1 Spatially Correlated Functional Data . .1 1.2 Functional Modeling of Crowdsourced Growth Data . .3 1.3 Prediction Intervals for Random Forests . .4 1.4 Dissertation Organization . .6 1.5 Role of Authors . .6 CHAPTER 2. SPATIALLY DEPENDENT FUNCTIONAL DATA: COVARIANCE ESTIMATION, PRINCIPAL COMPONENT ANALYSIS, AND KRIGING . .7 2.1 Introduction . .8 2.1.1 Literature Review . .8 2.1.2 Motivating Data Examples . .9 2.1.3 Our Contributions . 11 2.2 Model and Assumptions . 13 2.2.1 Random field modeling for spatially dependent functional data . 13 2.2.2 Sampling scheme for spatial locations and observation times . 14 2.3 Estimation method . 15 2.3.1 Estimation of the spatio-temporal covariance function . 16 2.3.2 Estimation of the functional principal components . 18 2.3.3 Estimation of the spatial covariance and correlation functions . 19 2.3.4 Covariance estimation for the functional nugget effect . 19 2.3.5 Variance estimation for the measurement errors . 20 iv 2.4 Theoretical Properties . 20 2.5 Implementation . 25 2.5.1 Positive semidefinite adjustment for the spatial covariance functions 25 2.5.2 Choosing the number of B-spline knots . 26 2.5.3 Estimation of the mean function . 27 2.6 Kriging of spatially dependent functional data . 27 2.7 Simulation studies . 29 2.8 Data analysis . 34 2.8.1 Analysis of the London housing price data . 34 2.8.2 Analysis of the Zillow real estate data . 37 2.9 Discussion . 39 2.10 Supplemental Material . 40 2.10.1 Notations . 41 2.10.2 Technical Lemmas . 42 2.10.3 Proofs of the Main Theorems . 55 2.10.4 Supporting Figures for the Simulation Studies . 69 2.10.5 Supporting Figures for Analysis of London Housing Price Data . 71 2.10.6 Supporting Figures for Analysis of Zillow Price-rent Ratio Data . 73 CHAPTER 3. ESTIMATING PLANT GROWTH CURVES AND DERIVATIVES BY MODELING CROWDSOURCED IMAGE-BASED DATA . 77 3.1 Introduction . 78 3.2 Field Experiment, Crowdsourcing Design, and Data . 82 3.3 Model . 83 3.4 Estimation . 88 3.4.1 Robust Estimation of Shape-Constrained Mean Functions . 88 3.4.2 Robust Estimation of Covariance Functions and Variances . 90 3.4.3 Estimating the Functional Principal Components . 92 3.4.4 Estimating the Functional Principal Component Scores . 93 3.5 Algorithm . 94 3.6 Analysis of Maize Growth Data . 96 3.7 Simulation Study . 104 3.8 Discussion . 106 3.9 Appendix A: Supporting Figures for Analysis of Maize Growth Data . 108 3.10 Appendix B: Supporting Figures for Simulation Study . 110 v CHAPTER 4. RANDOM FOREST PREDICTION INTERVALS . 118 4.1 Introduction . 118 4.2 Constructing Random Forest Prediction Intervals . 122 4.2.1 The Random Forest Algorithm . 122 4.2.2 Random Forest Weights . 125 4.2.3 Out-of-bag Prediction Intervals . 126 4.3 Asymptotic Properties of OOB Prediction Intervals . 128 4.4 Alternative Random Forest Intervals . 131 4.4.1 Split Conformal Prediction Intervals . 132 4.4.2 Quantile Regression Forest . 133 4.4.3 Confidence Intervals . 134 4.5 Simulation Study . 135 4.5.1 Evaluating Type I and II coverage rates . 137 4.5.2 Evaluating Type III and IV coverage rates . 138 4.6 Data Analysis . 147 4.7 Concluding Remarks . 159 4.8 Acknowledgements . 164 4.9 Appendix: Proofs of Main Theorems . 164 4.9.1 Proofs of Theorem 1 and Corollary 1 . 165 4.9.2 Proofs of Theorem 2 and Corollary 2 . 168 CHAPTER 5. GENERAL CONCLUSION . 169 5.1 Summary . 169 5.2 Future Work . 171 BIBLIOGRAPHY . 173 vi LIST OF TABLES Page Table 2.1 Simulation results on the mean and standard deviation of inte- grated square errors for functional principal components estimated by sFPCA and iFPCA........................... 33 Table 2.2 Kriging results in the simulation study: mean and standard devia- tion of integrated squared errors for sFPCA and iFPCA+CoKriging. 33 Table 3.1 Simulation results on the mean and standard deviation of inte- grated squared errors (ISE) for mean functions, FPCs, growth curves, and derivatives estimated by the proposed and naive methods. 107 Table 4.1 Name, n = total number of observations (excluding observations with missing values), and p = number of predictor variables for 60 datasets. 155 vii LIST OF FIGURES Page Figure 2.1 London house price data. (a) Locations of houses in the Greater London area; (b) trajectory of the house prices and the estimated mean function (dashed line). 10 Figure 2.2 (a) The locations of 234 neighborhoods in the San Francisco Bay Area; (b) trajectories of home price-to-rent ratios, observed monthly from October 2010 to August 2018 in the 234 neighborhoods. 11 Figure 2.3 Estimation results of sFPCA under Scenario A. Panels (a) - (h) con- tain summaries of the functional estimators, as described in the labels. In each panel, the solid line is the true function; the dashed line is the mean of the functional estimator; and the shaded area illustrates the bands of pointwise 5% and 95% percentiles. Panel (i) contains the boxplots of vb1, vb2, vb3, wbnug,1, and wbnug,2....... 30 Figure 2.4 Estimation results of iFPCA under Scenario A. In each panel, the solid line is the true function; the dashed line is the mean of the functional estimator; and the shaded area illustrates the bands of pointwise 5% and 95% percentiles. 32 Figure 2.5 Results on the London housing price data: (a) the contour plot of Wb (t1, t2); (b) the contour plot of Lb (t1, t2), covariance function of the functional nugget effect; (c) the first two eigenfunctions of Wb (·, ·); (d) the first three eigenfunctions of Lb (·, ·); (e) the estimated spatial correlation function rb1(·) and its positive semi-definite ad- justment re1(·); (f) the estimated spatial correlation function rb2(·) and its positive semi-definite adjustment re2(·)............. 35 Figure 2.6 Sensitivity Analysis on the London housing price data. The red lines are the estimated first two eigenfunctions of W(·, ·) by using the whole dataset, while the green dashed lines and blue dotted lines are the estimated first two eigenfunctions of W(·, ·) by us- ing the data of homes on the northern and southern sides of River Thames. 36 viii Figure 2.7 Results on the Zillow price-to-rent ratio data: (a) contour plot of Wb (t1, t2); (b) the first two eigenfunctions (c) the estimated spatial correlation function rb1(·) and its positive semi-definite adjustment re1(·); (d) rb2(·) and re2(·).......................... 38 Figure 2.8 Mean estimation results for the simulation studies. In each panel, the solid line is the true mean function, the dashed curve is the mean of mb(t), and the shaded area illustrates the confidence band sformed by the pointwise 5% and 95% percentiles. 69 Figure 2.9 Estimation results of iFPCA under Scenario B. 69 Figure 2.10 Estimation results of sFPCA under Scenario B. The upper panel shows the estimation results of principal component functions, while the lower panel shows the estimation results of spatial covariance functions. In each panel, the solid line is the true function; the dashed line is the mean of the functional estimator; and the shaded area illustrates the bands of pointwise 5% and 95% percentiles. 70 Figure 2.11 Estimation results of Rb(u, 0, 0) for London Property Transaction Price Data: contour plots Rb(u, ·, ·) standardized by kR(u, ·, ·)k1 = RR 2 jRb(u, t1, t2)jdt1dt2/jTj , at u = 0, 1, 2, 3, 4, and 5. 71 Figure 2.12 Spatial-temporal pattern of London housing price data: (a) loca- tions of homes (dot points represent homes on the north side of River Thames, and triangular points represent homes on the south side); (b) histogram of the number of transactions per house; (c) histogram of the distance between two homes; (d) histogram of transaction dates.