Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation
Total Page:16
File Type:pdf, Size:1020Kb
Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation by Fuyuan Li B.S. in Telecommunication Engineering, May 2012, Beijing University of Technology M.S. in Statistics, May 2014, The George Washington University A Dissertation submitted to The Faculty of The Columbian College of Arts and Sciences of The George Washington University in partial satisfaction of the requirements for the degree of Doctor of Philosophy January 10, 2019 Dissertation directed by Huixia J. Wang Professor of Statistics The Columbian College of Arts and Sciences of The George Washington University certifies that Fuyuan Li has passed the Final Examination for the degree of Doctor of Philosophy as of December 7, 2018. This is the final and approved form of the dissertation. Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation Fuyuan Li Dissertation Research Committee: Huixia J. Wang, Professor of Statistics, Dissertation Director Tapan K. Nayak, Professor of Statistics, Committee Member Reza Modarres, Professor of Statistics, Committee Member ii c Copyright 2019 by Fuyuan Li All rights reserved iii Acknowledgments This work would not have been possible without the financial support of the National Science Foundation grant DMS-1525692, and the King Abdullah University of Science and Technology office of Sponsored Research award OSR-2015-CRG4-2582. I am grateful to all of those with whom I have had the pleasure to work during this and other related projects. Each of the members of my Dissertation Committee has provided me extensive personal and professional guidance and taught me a great deal about both scientific research and life in general. I would especially like to thank Dr. Huixia Judy Wang, the chairman of my committee. As my teacher and mentor, she has taught me more than I could ever give her credit for here. She has shown me, by her example, what a good scientist (and person) should be. I am also indebted to Dr. Yanlin Tang, who have been supportive of my career goals and who worked actively to provide me with the protected academic time to pursue those goals. Most important support to me in the pursuit of this project came from my family. I would like to thank my parents, whose love and guidance are with me in whatever I pursue. They are my ultimate role models. iv Abstract Copula-Based Analysis of Dependent Data with Censoring and Zero Inflation The analysis of time series data with detection limits is challenging due to the high- dimensional integral involved in the likelihood. To account for the computational challenge, various methods have been developed but most of them rely on restrictive parametric distri- butional assumptions. In the first porject, we propose a semiparametric method for analyzing censored time series, where the temporal dependence is captured by parametric copula while the marginal distributions are estimated nonparametrically. Utilizing the properties of copula modeling, we develop a new copula-based sequential sampling algorithm, which provides a convenient way to calculate the censored likelihood. Even without full paramet- ric distributional assumptions, the proposed method still allows us to easily compute the conditional quantiles of the censored response at a future time point, and thus construct both point and interval predictions. We establish the asymptotic properties of the proposed pseudo maximum likelihood estimator, and demonstrate through simulation and the analysis of a water quality data, that the proposed method is more flexible and it leads to more accurate predictions than Gaussian-based methods for non-normal data. In the second project, we focus on the analysis of multi-site precipitation data that are zero-inflated. We consider an alternative three-part copula-based model to analyze precipitations at multiple locations, where copula functions are used to capture the dependence among locations, and the marginal distribution is characterized through the first two parts of the model. v Table of Contents Acknowledgments . iv Abstract . .v List of Figures . viii List of Tables . .x 1 Introduction .................................... 1 1.1 Copula . .2 1.2 Verification of the Markov Property . .3 1.3 Organization . .5 2 Copula-Based Semiparametric Estimation for Markov Models with Censoring ................ 7 2.1 Introduction . .7 2.2 Proposed Method . .9 2.2.1 Copula-Based Markov Model . .9 2.2.2 Copula-Based Semiparametric Estimator . 11 2.2.3 Computation . 15 2.2.4 Estimation of Conditional Quantiles . 18 2.2.5 Selection of Copula Functions . 21 2.2.6 Selection of Copula Based on The Goodness of Fit Test . 22 2.3 Simulation Study . 23 2.3.1 Estimation of the Copula Parameter . 23 2.3.2 Investigation of Clayton Copula . 29 2.3.3 Estimation of Conditional Quantiles . 32 2.3.4 Selection of Copulas . 38 2.3.5 Selection of Copula Based on The Goodness of Fit Test . 40 2.4 Large Sample Properties of the Estimator . 42 2.5 Technical Proofs . 46 2.5.1 Lemma 1 . 46 2.5.2 Proof of Theorem 1 (Consistency of qˆ).................... 49 2.5.3 Proof of Theorem 1 (Asymptotic Normality) . 51 2.6 Analysis of a Water Quality Data . 54 2.7 Conclusion . 57 3 Copula-Based Analysis of Multisite Daily Precipitation ............ 58 3.1 Introduction . 58 3.2 Proposed Method . 60 3.2.1 Notation . 60 3.2.2 Three-Part Model . 61 3.2.3 Parameter Estimation . 63 3.2.4 Prediction at New Time for Existing Locations . 67 3.2.5 Interpolation at New Locations . 72 3.3 Simulation . 76 vi 3.3.1 Simulation Design . 76 3.3.2 Estimation of Matérn Parameters . 78 3.3.3 Prediction at New Time for Existing Locations . 81 3.3.4 Interpolation at New Locations . 84 3.4 Analysis of Chicago Precipitation Reanalysis Data . 88 3.4.1 Preliminary Analysis of Chicago Precipitation Data . 89 3.4.2 Prediction at New Time for Existing Locations . 98 3.4.3 Interpolation at New Locations . 103 3.5 Conclusion . 107 4 Conclusion and Discussion ............................ 108 4.1 Concluding Remarks . 108 4.2 Limitations and Future Works . 110 Bibliography ..................................... 112 A Copula-Based Semiparametric Estimation for Markov Models with Censoring ................ 118 vii List of Figures 2.1 Boxplots of the omniscient estimator and the CopC* estimator based on the true marginal distribution G∗(·), across different censoring proportions in Case 3 with Clayton(q = 2) copula and t3 marginals. Omni: the omniscient estimator based on the nonparametric estimator Gˆn.................... 30 2.2 Violin plots of the true and estimated quantile Qq(Yn+1 j In) for n = 2000 at q = 0:5 and 0.9 across 500 simulations in Cases 1-3 with 40% censoring and q0 corresponding to t = 0:3. Omni, CopC and Naive: the copula-based estimations by assuming the correct copula function and using the omniscient, proposed and naive estimators of q0, respectively; CopC2: the counterpart of CopC with selected copula; GIM: the Gaussian-based imputation method from Park et al (2007). 41 2.3 (a) observed time series of dissolved ammonia Yt in Middle Susquehanna from 1988 to 2014, where d = 0:02 is the detection limit; (b) the Q-Q plot of log- transformed ammonia above log(d); (c) the estimated conditional quantiles of Yn+1 (curve with solid circles) and the 95% pointwise confidence band from ∗ the proposed method, and the estimated conditional quantiles of Yn+1 from the Gaussian-based imputation method of Park et al (2007) (curve with open circles). 55 2.4 Histograms of the scaled estimated conditional probability from the proposed copula method (CopC) and the Gaussian-based imputation method (GIM) from Park et al (2007) for the cross validation study. 57 3.1 boxplots of lˆ for JF and JC methods . 79 3.2 Computing time for JF and JC methods when the number of locations increases. 80 3.3 Multivariate rank histograms for prediciton on new time obtained from the five methods. 83 3.4 ROC curves and AUCs using misspecified link functions in Model (3.1) and Model (3.3). 88 3.5 Stations for Chicago precipitation data . 89 3.6 ROC curves of the regression models for rain occurrence at station Aurora, L P with link function ho (left) and ho (right). 91 L 3.7 ROC curves of the regression models for rain occurrence, with link function ho P (left) and ho (right), using data from all locations under the common parameter assumption. 92 3.8 (i) Observed vs Fitted plots for rain amount at station Chicago Midway Airport, i l with link function ha (left) and ha (right). 94 3.9 (ii) Q-Q plots for rain amount at station Chicago Midway Airport, with link i l function ha (left) and ha (right). 95 3.10 (iii) Q-Q plots for observed against simulated rain amount at station Chicago i l Midway Airport, with link function ha (left) and ha (right). 96 viii 3.11 Graphical tools of the goodness of fit for the regression models for rain amount, i l with link function ha (upper panels) and ha (lower panels), using data from all locations under the common parameter assumption. 97 3.12 Multivariate rank histograms of different methods for predicting at new times for the Chicago precipitation dataset from 1998 to 2002. 102 3.13 Multivariate rank histograms of different methods for predicting at new times for the Chicago precipitation dataset from 1998 to 2002. (continue) . 103 3.14 ROC curves and AUCs for rain occurrence predictions of the four candidate methods . 106 ix List of Tables 1.1 Arhimedean copulas and their generators . .3 2.1 Average Bias and Root Mean Squared Error (RMSE) of different estimators of q for Case 1 with Gaussian copula and normal marginals.