University of Cincinnati
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CINCINNATI Date:___________________ I, _________________________________________________________, hereby submit this work as part of the requirements for the degree of: in: It is entitled: This work and its defense approved by: Chair: _______________________________ _______________________________ _______________________________ _______________________________ _______________________________ Gibbs Sampling and Expectation Maximization Methods for Estimation of Censored Values from Correlated Multivariate Distributions A dissertation submitted to the Division of Research and Advanced Studies of the University of Cincinnati in partial ful…llment of the requirements for the degree of DOCTORATE OF PHILOSOPHY (Ph.D.) in the Department of Mathematical Sciences of the McMicken College of Arts and Sciences May 2008 by Tina D. Hunter B.S. Industrial and Systems Engineering The Ohio State University, Columbus, Ohio, 1984 M.S. Aerospace Engineering University of Cincinnati, Cincinnati, Ohio, 1989 M.S. Statistics University of Cincinnati, Cincinnati, Ohio, 2003 Committee Chair: Dr. Siva Sivaganesan Abstract Statisticians are often called upon to analyze censored data. Environmental and toxicological data is often left-censored due to reporting practices for mea- surements that are below a statistically de…ned detection limit. Although there is an abundance of literature on univariate methods for analyzing this type of data, a great need still exists for multivariate methods that take into account possible correlation amongst variables. Two methods are developed here for that purpose. One is a Markov Chain Monte Carlo method that uses a Gibbs sampler to es- timate censored data values as well as distributional and regression parameters. The second is an expectation maximization (EM) algorithm that solves for the distributional parameters that maximize the complete likelihood function in the presence of censored data. Both methods are applied to bivariate normal data and compared to each other and to two commonly used simple substitution methods with respect to bias and mean squared error of the resulting parameter estimates. The EM method is the most consistent for estimating all distributional and regres- sion parameters across all levels of correlation and proportions of censoring. Both methods provide substantially better estimates of the correlation coe¢ cient than the univariate methods. iii iv To Bryce and Callie, with all my love. May you know the satisfaction and rewards of hard work and perseverance. v Acknowledgements I would like to thank Dr. Marc Mills and Dr. Bryan Boulanger, my mentors at the U.S. Environmental Protection Agency, for all of their time and support during my traineeship, without which this dissertation would not exist. They have been a pleasure to work with and have provided invaluable encouragement and feedback. I would also like to thank my entire committee, and especially my advisor, Dr. Siva Sivaganesan, for taking the time to review my ideas and writing, for helpful and constructive feedback, and for allowing me the freedom to follow my own ideas and interests. I am grateful for all of the friends who kept me sane along the entire road to this Ph.D., especially those traveling it with me. I have many fond memories of parties, travels, and long talks, all of which eased the stresses along the way. I am also grateful to the many caring professors in the Department of Mathematical Sciences. In particular, Don French has been a great friend as well as a talented teacher and generous tutor. Jim Deddens has always been especially encouraging and willing to help. Terri, Anita, Nancy, and Patti have all been very kind and helpful with a wide variety of administrative technicalities. Finally, I would like to express a very special thank you to my family. My parents have been wonderful cheerleaders and have helped me out in countless ways on the way to this Ph.D. Bryce and Callie have not only put up with me, but have even turned into amazing young adults along the way. This research was funded by the U.S. Environmental Protection Agency, Of- …ce of Research and Development, National Exposure Research Laboratory and National Risk Management Research Laboratory, through a Graduate Research Training Grant. vi Contents List of Tables x List of Figures xi Chapter 1. Introduction 1 1.1. General Problem Statement 1 1.2. The Speci…c Analysis that Inspired this Research 7 1.3. Simpli…ed Problem 12 Chapter 2. Literature Review and Analysis 14 2.1. Overview of Methods for Censored Data 14 2.2. Dropping Censored Values 15 2.3. Replacing Censored Values with Constants 17 2.4. Maximum Likelihood Methods 22 2.5. Regression Methods. 29 2.6. Other Methods 30 2.7. EPA Guidelines 31 2.8. Extensions to Multivariate Distributions 33 2.9. Summary 36 vii Chapter 3. Gibbs Sampling Theory and Method Development 39 3.1. Bayesian Inference 39 3.2. Markov Chain Monte Carlo 45 3.3. Gibbs Sampler Algorithm 48 Chapter 4. Expectation Maximization Theory and Method Development 57 4.1. General Theory of Expectation Maximization 58 4.2. Expectation Maximization for Multivariate Normal Data with Left Censoring 59 Chapter 5. Implementation and Analysis of Methods 71 5.1. Assumptions and Data Transformation 71 5.2. Evaluation of Methods 73 5.3. R Program 75 Chapter 6. Results 77 6.1. Mean, Variance, and Correlation Coe¢ cient in Normal Scale 78 6.2. Regression Parameters 101 6.3. Comparisons in Original Lognormal Scale 106 6.4. Results for Application of Methods to Lower Variance Distribution 107 6.5. Results for Larger Sample Sizes 112 6.6. Conclusions 115 Chapter 7. Discussion and Recommendations 119 viii 7.1. MCMC Method 119 7.2. EM Method 122 7.3. Future Extensions of MCMC and EM Methods 127 References 129 Appendix . R Code 133 ix List of Tables 1.1 Hormones Analyzed in E• uent Survey 9 1.2 Sample Correlation Coe¢ cients Above .5 10 1.3 Proportions of 46 Values Reported as <MDL 11 2.1 Comparison of Two Simple Substitution Methods for a Lognormal Distribution such that Ln(x) is Distributed N(0,1) 23 2.2 Comparison of Two Simple Substitution Methods for a Lognormal Distribution such that Ln(x) is Distributed N(0,1/2) 23 2.3 Comparison of Two Simple Substitution Methods for a Lognormal Distribution such that Ln(x) is Distributed N(0,2) 23 6.1 Con…dence Intervals for Bias Plots of Mean Estimates 79 6.2 E¤ects of Bias on Lognormal Parameters 107 7.1 Bias Introduced by Estimating Individual Variance Components Incorrectly125 7.2 Bias Reduction with Multiple Imputations 126 x List of Figures 2.1 Comparison of Tails for 10% Nondetects 21 2.2 Comparison of Tails for 30% Nondetects 22 6.1 Biases in Mean Estimates when = 0:1 80 6.2 MSEs of Mean Estimates when = 0:1 80 6.3 Biases in Mean Estimates when = 0:5 81 6.4 MSEs of Mean Estimates when = 0:5 81 6.5 Biases in Mean Estimates when = 0:9 82 6.6 MSEs of Mean Estimates when = 0:9 82 6.7 Biases in Mean Estimates when = 0:1 83 6.8 MSEs of Mean Estimates when = 0:1 83 6.9 Biases in Mean Estimates when = 0:5 84 6.10MSEs of Mean Estimates when = 0:5 84 6.11Biases in Mean Estimates when = 0:9 85 6.12MSEs of Mean Estimates when = 0:9 85 6.13Biases in Standard Deviation Estimates when = 0:1 88 xi 6.14MSEs of Standard Deviation Estimates when = 0:1 88 6.15Biases in Standard Deviation Estimates when = 0:5 89 6.16MSEs of Standard Deviation Estimates when = 0:5 89 6.17Biases in Standard Deviation Estimates when = 0:9 90 6.18MSEs of Standard Deviation Estimates when = 0:9 90 6.19Biases in Standard Deviation Estimates when = 0:1 91 6.20MSEs of Standard Deviation Estimates when = 0:1 91 6.21Biases in Standard Deviation Estimates when = 0:5 92 6.22MSEs of Standard Deviation Estimates when = 0:5 92 6.23Biases in Standard Deviation Estimates when = 0:9 93 6.24MSEs of Standard Deviation Estimates when = 0:9 93 6.25Biases in Estimates of when = 0:1 95 6.26MSEs of Estimates of when = 0:1 95 6.27Biases in Estimates of when = 0:5 96 6.28MSEs of Estimates of when = 0:5 96 6.29Biases in Estimates of when = 0:9 97 6.30MSEs of Estimates of when = 0:9 97 6.31Biases in Estimates of when = 0:1 98 6.32MSEs of Estimates of when = 0:1 98 xii 6.33Biases in Estimates of when = 0:5 99 6.34MSEs of Estimates of when = 0:5 99 6.35Biases in Estimates of when = 0:9 100 6.36MSEs of Estimates of when = 0:9 100 6.37Biases in Estimates of Intercept 0 when = 0:5 103 6.38MSEs of Estimates of Intercept 0 when = 0:5 103 6.39Biases in Estimates of Slope 1 when = 0:5 104 6.40MSEs of Estimates of Slope 1 when = 0:5 104 6.41Biases in Estimates of Slope 2 when = 0:5 105 6.42MSEs of Estimates of Slope 2 when = 0:5 105 6.43Biases in Estimates of Lognormal Mean when = 0:5 108 6.44MSEs of Estimates of Lognormal Mean when = 0:5 108 6.45Biases in Estimates of Lognormal Mean when = 0:5 109 6.46MSEs of Estimates of Lognormal Mean when = 0:5 109 6.47Biases in Estimates of Lognormal Variance when = 0:5 110 6.48MSEs of Estimates of Lognormal Variance when = 0:5 110 6.49Biases in Estimates of Lognormal Variance when = 0:5 111 6.50MSEs of Estimates of Lognormal Variance when = 0:5 111 6.51Biases in Estimates of Mean Mu when = 0:5 and Variances are 2 = 0:25113 xiii 6.52MSEs of Estimates of Mean Mu when = 0:5 and Variances are 2 = 0:25113 6.53Biases in Standard Deviation Estimates when = 0:5 and Variances are 2 = 0:25 114 6.54MSEs of Standard Deviation Estimates when = 0:5 and Variances are 2 = 0:25 114 6.55Biases in Mean Estimates for n = 100 when = 0:5 116 6.56MSEs of Mean Estimates for n = 100 when = 0:5 116 6.57Biases in Estimates of for n = 100 when = 0:5 117 6.58MSEs of Estimates of for n = 100 when = 0:5 117 7.1 Autocorrelation vs.