Bayesian Decision-Theoretic Methods for Parameter Ensembles with Application to Epidemiology

Bayesian Decision-theoretic Methods for Parameter Ensembles with Application to Epidemiology By Cedric E. Ginestet Under the supervision of Nicky G. Best, Sylvia Richardson and David J. Briggs arXiv:1105.5004v6 [math.ST] 18 Mar 2014 In partial fulfilment of the requirements for the degree of Doctor of Philosophy. February 2011 Doctorate Thesis submitted to the Department of Epidemiology and Biostatistics Imperial College London. 1 Abstract Parameter ensembles or sets of random effects constitute one of the cornerstones of modern statis- tical practice. This is especially the case in Bayesian hierarchical models, where several decision theoretic frameworks can be deployed to optimise the estimation of parameter ensembles. The reporting of such ensembles in the form of sets of point estimates is an important concern in epidemiology, and most particularly in spatial epidemiology, where each element in these ensembles represent an epidemiological unit such as a hospital or a geographical area of interest. The estimation of these parameter ensembles may substantially vary depending on which inferential goals are prioritised by the modeller. Since one may wish to satisfy a range of desiderata, it is therefore of interest to investigate whether some sets of point estimates can simultaneously meet several inferential objectives. In this thesis, we will be especially concerned with identifying ensembles of point estimates that produce good approximations of (i) the true empirical quantiles and empirical quartile ratio (QR) and (ii) provide an accurate classification of the ensemble's elements above and below a given threshold. For this purpose, we review various decision-theoretic frameworks, which have been proposed in the literature in relation to the optimisation of different aspects of the empirical distribution of a parameter ensemble. This includes the constrained Bayes (CB), weighted-rank squared error loss (WRSEL), and triple-goal (GR) ensembles of point estimates. In addition, we also consider the set of maximum likelihood estimates (MLEs) and the ensemble of posterior means {the latter being optimal under the summed squared error loss (SSEL). Firstly, we test the performance of these different sets of point estimates as plug-in estimators for the empirical quantiles and empirical QR under a range of synthetic scenarios encompassing both spatial and non-spatial simulated data sets. Performance evaluation is here conducted using the posterior regret, which corresponds to the difference in posterior losses between the chosen plug-in estimator and the optimal choice under the loss function of interest. The triple-goal plug-in estimator is found to outperform its counterparts and produce close-to-optimal empirical quantiles and empirical QR. A real data set documenting schizophrenia prevalence in an urban area is also used to illustrate the implementation of these methods. Secondly, two threshold classification losses (TCLs) {weighted and unweighted{ are formulated. The weighted TCL can be used to emphasise the estimation of false positives over false negatives or the converse. These weighted and unweighted TCLs are optimised by a set of posterior quantiles and a set of posterior medians, respectively. Under an unweighted classification framework, the SSEL point estimates are found to be quasi-optimal for all scenarios studied. In addition, the five candidate plug-in estimators are also evaluated under the rank classification loss (RCL), which has been previously proposed in the literature. The SSEL set of point estimates are again found to constitute quasi-optimal plug-in estimators under this loss function, approximately on a par with the CB and GR sets of point estimates. The threshold and rank classification loss functions are applied to surveillance data reporting methicillin resistant Staphylococcus aureus (MRSA) prevalence in UK hospitals. This application demonstrates that all the studied plug-in classifiers under TCL tend to be more liberal than the optimal estimator. That is, all studied plug-in estimators tended to classify a greater number of hospitals above the risk threshold than the set of posterior medians. In a concluding chapter, we discuss some possible generalisations of the loss functions studied in this thesis, and consider how model specification can be tailored to better serve the inferential goals considered. Contents Abstract . .1 Table of Content . .4 Acknowledgements . .7 1 Introduction 8 2 Loss functions for Parameter Ensembles 11 2.1 Bayesian Decision Theory . 11 2.1.1 Premises of Decision Theory . 12 2.1.2 Frequentist and Bayesian Decision Theories . 13 2.1.3 Classical Loss Functions . 14 2.1.4 Functions of Parameters . 15 2.2 Parameter Ensembles and Hierarchical Shrinkage . 16 2.2.1 Bayesian Hierarchical Models . 17 2.2.2 Spatial Models . 18 2.2.3 Estimation of Parameter Ensembles . 19 2.2.4 Hierarchical Shrinkage . 21 2.3 Ranks, Quantile Functions and Order Statistics . 21 2.3.1 Ranks and Percentile Ranks . 21 2.3.2 Quantile Function . 22 2.3.3 Quantiles, Quartiles and Percentiles . 23 2.4 Loss Functions for Parameter Ensembles . 24 2.4.1 Constrained Bayes . 24 2.4.2 Triple-Goal . 25 2.4.3 Weighted and Weighted Ranks Loss Functions . 26 2.5 Research Questions . 28 3 Empirical Quantiles and Quartile Ratio Losses 30 3.1 Introduction . 30 3.2 Estimation of Empirical Quantiles and Quartile Ratio . 33 3.2.1 Empirical Quantiles . 33 3.2.2 Empirical Quartile Ratio . 34 3.2.3 Performance Evaluation . 36 3.3 Non-spatial Simulations . 36 3.3.1 Design . 36 3.3.2 Generative Models . 37 3.3.3 Simulation Scenarios . 37 3.3.4 Fitted Models . 38 3.3.5 Plug-in Estimators under Q-SEL . 38 3.3.6 Plug-in Estimators under QR-SEL . 41 3.4 Spatial Simulations . 43 3.4.1 Design . 43 3.4.2 Spatial Structure Scenarios . 44 3 3.4.3 Fitted Models . 48 3.4.4 Plug-in Estimators under Q-SEL . 49 3.4.5 Plug-in Estimators under QR-SEL . 51 3.4.6 Consequences of Scaling the Expected Counts . 52 3.5 Urban Distribution of Schizophrenia Cases . 55 3.5.1 Data Description . 55 3.5.2 Models Used and Performance Evaluation . 56 3.5.3 Results . 57 3.6 Conclusion . 59 4 Threshold and Rank Classification Losses 62 4.1 Introduction . 62 4.2 Classification Losses . 64 4.2.1 Threshold Classification Loss . 64 4.2.2 Rank Classification Loss . 68 4.2.3 Posterior Sensitivity and Specificity . 69 4.2.4 Performance Evaluation . 70 4.3 Non-spatial Data Simulations . 71 4.3.1 Parameter Ensembles and Experimental Factors . 71 4.3.2 Plug-in Estimators under TCL . 72 4.3.3 Plug-in Estimators under RCL . 73 4.4 Spatial Simulations . 74 4.4.1 Parameter Ensembles and Experimental Factors . 75 4.4.2 Plug-in Estimators under Unweighted TCL . 75 4.4.3 Plug-in Estimators under Weighted TCL . 77 4.4.4 Plug-in Estimators under RCL . 78 4.4.5 Consequences of Scaling the Expected Counts . 79 4.5 MRSA Prevalence in UK NHS Trusts . 82 4.5.1 Data Pre-processing . 82 4.5.2 Fitted Model . 83 4.5.3 TCL Classification . 84 4.5.4 RCL Classification . 90 4.6 Conclusion . 90 4.7 Proof of TCL Minimisation . 93 5 Discussion 94 5.1 Main Findings . 94 5.2 Generalised Classification Losses . 95 5.3 Modelling Assumptions . 96 A Non-Spatial Simulations 97 B Spatial Simulations 99 C WinBUGS Codes 102 C.1 CAR Normal (BYM) Model . 102 C.2 CAR Laplace (L1) Model . 103 C.3 MRSA Model . 104 References . 104 Index .............................................. 109 4 Abbreviations AVL Absolute value error loss DAG Directed Acyclic Graph iid Independent and identically distributed ind Independent BHM Bayesian hierarchical model BYM Besag, York and Molliémodel CAR Conditional autoregressive CB Constrained Bayes CDI Carstairs' deprivation index CDF Cumulative distribution function DoPQ Difference of posterior quartiles EB Empirical Bayes EDF Empirical distribution function G-IG Gamma-Inverse Gamma model GR Triple-goal (G for empirical distribution and R for rank) ICC Intraclass Correlation IQR Interquartile range IQR-SEL Interquartile range squared error loss ISEL Integrated squared error loss L1 Laplace-based Besag, York and Molliémodel MCMC Markov chain Monte Carlo MLE Maximum likelihood estimate MOR Median odds ratio MRSA Methicillin resistant Staphylococcus aureus N-N Normal-Normal model OR Odds ratio pdf Probability density function PM Posterior mean Q-SEL Quantiles squared error loss QR Quartile ratio QR-SEL Quartile ratio squared error loss RCL Rank classification loss RoPQ Ratio of Posterior Quartiles RR Relative risk RSEL Rank squared error loss SEL Squared error loss SBR Smoothing by roughening algorithm SSEL Summed squared error loss WRSEL Weighted rank squared error loss TCL Threshold classification loss 5 A` mes parents. 6 Acknowledgements Writing a thesis is always an arduous process, and the present one has been a particularly testing one. However, during that long journey, I had the chance to receive support from distinguished scholars. For my entry in the world of statistics, I am greatly indebted to Prof. Nicky Best, my director of studies, who has entrusted me with the capacity to complete a PhD in biostatistics, and has demonstrated mountains of patience in the face of my often whimsical behaviour. I also owe a great debt of gratitude to my second supervisor, Prof. Sylvia Richardson, who has helped me a great deal throughout the PhD process, and has been particularly understanding during the write-up stage. This revised thesis, however, would not be in its current form.

Load more