Models for Discrete Epidemiological and Clinical Data

Models for Discrete Epidemiological and Clinical data A thesis presented for the degree of Doctor of Philosophy University College London Fiona Clare McElduff UCL Institute of Child Health 2012 1 Declaration I, Fiona Clare McElduff confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 2 Abstract Discrete data, often known as frequency or count data, comprises of observations which can only take certain separate values, resulting in a more restricted numerical measurement than those provided by continuous data and are common in the clinical sciences and epidemiology. The Poisson distribution is the simplest and most common probability model for discrete data with observations assumed to have a constant rate of occurrence amongst individual units with the property of equal mean and variance. However, in many applications the variance is greater than the mean and overdispersion is said to be present. The application of the Poisson distribution to data exhibiting overdispersion can lead to incorrect inferences and/or inefficient analyses. The most commonly used extension of the Poisson distribution is the negative binomial distribution which allows for unequal mean and variance, but may still be inadequate to model datasets with long tails and/or value-inflation. Further extensions such as Delaporte, Sichel, Gegenbauer and Hermite distributions, give greater flexibility than the negative binomial distribution. These models have received less interest than the Poisson and negative binomial distributions within the statistical literature and many have not been implemented in current statistical software. Also, diagnostics and goodness-of-fit statistics are seldom considered when analysing such datasets. The aim of this thesis is to develop software for analysing discrete data which do not follow the Poisson or negative binomial distributions including component-mix and parameter-mix distributions, value-inflated models, as well as modifications for truncated distributions. The project’s main goals are to create three libraries within the framework of the R project for statistical computing. They are: 1. altmann: to fit and compare a wide range of univariate discrete models 2. discrete.diag: to provide goodness-of-fit and outlier detection diagnostics for these models 3 3. discrete.reg: to fit regression models to discrete response variables within the gamlss framework These libraries will be freely available to the clinical and scientific community to facilitate discrete data interpretation. 4 Acknowledgements I would like to thank my supervisors Mario Cortina-Borja and Angie Wade for their support, guidance and invaluable advice over the term of my PhD study. I would like to thank my colleagues at the MRC Centre of Epidemiology for Child Health, UCL Institute of Child Health, in particular the past and present occupants of Room 5.09 for their support and advice. I am grateful to the many clinicians and researchers who have provided data for this thesis: Professor Adrian Woolf, Dr Shun-Kai Chan, and Dr David long at the Centre of Nephro-eurpology, ICH; Dr Pablo Mateos and Dr James Cheshire from the UCL Department of Geography; Professor Fenella Kirkham at the Neurosciences Unit, ICH; Professor Tony Charman and Dr Greg Pasco, at the Institute of Education, Professor Pat Howlin and Dr Kate Gordon from King’s College London. This project was made possible by a capacity building studentship funded by the Medical Research Council. Finally, I would also like to thank my family for their support and understanding. Especially my parents, who are my biggest champions, Danny, Nicola and Hannah, and also to Kim, Christine and not forgetting Louie. My biggest thanks go to Michael, who has been at my side throughout this journey and whose encouragement has meant the world to me. 5 Contents Abstract 2 Acknowledgements 5 Contents 6 List of Figures 12 List of Tables 18 List of Listings 20 Acronyms and abbreviations 21 1 Introduction 23 1.1 Discrete data . 23 1.2 Examples . 25 1.2.1 UK Surnames distributions . 25 1.2.2 Cysts in steroid treated fetal mouse kidneys . 29 1.2.3 Electroencephalographic seizures in paediatric coma patients 31 1.2.4 Picture Exchange Communication System (PECS) training in teachers of autistic children . 34 1.3 Overview of Thesis . 37 2 Discrete Probability Distributions 39 2.1 Definitions . 39 6 2.1.1 Overdispersion . 39 2.1.2 Value-inflation . 41 2.1.3 Long tails . 42 2.1.4 Truncation . 42 2.1.5 Notation . 43 2.1.6 Special Functions . 49 2.2 Basic Distributions . 55 2.2.1 Bernoulli (p) .......................... 57 2.2.2 Binomial (p; n) ......................... 58 2.2.3 Geometric (p) .......................... 60 2.2.4 Hypergeometric (m; n; k) ................... 62 2.2.5 Poisson (µ) ........................... 66 2.3 Parameter-Mix Distributions . 68 2.3.1 Negative Binomial . 69 2.3.2 Holla (α; θ) ........................... 74 2.3.3 Sichel (α; θ; γ) ......................... 76 2.3.4 Delaporte (α; β; γ) ....................... 80 2.3.5 Yule (λ) ............................. 83 2.3.6 Waring (b; n) .......................... 86 2.3.7 Beta-Binomial (a; b; n) ..................... 88 2.4 Component-Mix Distributions . 93 2.4.1 Zero-inflated Poisson (!; µ) .................. 94 2.4.2 Zero-inflated Negative Binomial (!; p; r) ........... 96 2.4.3 Zero-inflated Sichel (!; α; θ; γ) ................ 99 2.4.4 2-component Poisson Mixture (!; µ, λ) ............ 101 2.4.5 2-component Poisson-Negative Binomial Mixture (!; µ, r; p) . 105 2.5 Truncated Distributions . 108 2.5.1 Positive Poisson (µ) ...................... 109 2.5.2 Positive Geometric (p) ..................... 111 7 2.5.3 Positive Negative Binomial (r; p) ............... 113 2.5.4 Positive Holla (α; θ) ...................... 115 2.5.5 Positive Sichel (α; θ; γ) ..................... 117 2.5.6 Positive Yule (λ) ........................ 121 2.6 Lerch Family Distributions . 123 2.6.1 Lerch (p; a; c) .......................... 124 2.6.2 Zipf (a; c) ............................ 128 2.6.3 Good (p; c) ........................... 130 2.6.4 Zeta (c) ............................. 132 2.7 Generalized Poisson Distributions . 135 2.7.1 Neyman Type A (µ, φ) ..................... 136 2.7.2 Hermite (a; b) .......................... 138 2.7.3 Generalized Hermite (a; b; m) ................. 141 2.7.4 Gegenbauer (a; b; k)...................... 145 2.7.5 Generalized Gegenbauer (a; m; α; β).............. 148 3 Fitting the models 153 3.1 Estimation methods . 153 3.1.1 Rapid Estimation . 153 3.1.2 Maximum Likelihood . 158 3.1.3 Expectation-Maximization (EM) algorithm . 162 3.2 Frameworks for model fitting . 165 3.2.1 Generalized Linear Models (GLM) . 165 3.2.2 Generalized Additive Models (GAM) . 168 3.2.3 Generalized Additive Models for Location, Scale and Shape (GAMLSS) . 171 3.3 Diagnostics . 174 3.3.1 Goodness-of-fit . 175 3.3.2 Model Comparisons . 184 3.3.3 Outlier Detection . 193 8 4 Software for fitting discrete probability models 198 4.1 Current Software . 198 4.1.1 PASW .............................. 199 4.1.2 Stata ............................. 200 4.1.3 SAS ............................... 200 4.1.4 R ................................ 201 4.1.5 MATHEMATICA ........................ 208 4.1.6 Altmann Fitter . 208 4.2 Gaps in methodology . 209 4.3 Outline of software . 212 5 Altmann Library 215 5.1 Datasets . 216 5.2 Summary of discrete datasets . 218 5.3 pdqr for distributions . 221 5.3.1 Probability density function d ................. 221 5.3.2 Cumulative density function p ................. 223 5.3.3 Quantile function q ....................... 226 5.3.4 Random generating function r ................. 227 5.4 Maximum likelihood estimation functions . 229 5.4.1 Estimation of starting values . 233 5.4.2 Maximum likelihood estimation using mle .......... 236 5.4.3 Goodness-of-fit statistics and Output . 237 5.5 Plotting mle objects . 239 5.6 Model comparisons . 244 5.7 Validation of the functions . 249 5.8 Further Examples . 252 5.8.1 Automobile accidents claims for drivers in Belgium, 1978 . 252 5.8.2 Numbers of births occurring to HIV-infected women . 258 5.9 Application to UK surnames distribution . 261 9 6 discrete.diag Library 269 6.1 Goodness-of-fit Methods . 270 6.1.1 Chi-squared Goodness-of-fit Test . 270 6.1.2 Residuals . 272 6.2 Model Comparison . 275 6.2.1 AIC and BIC . 275 6.2.2 EPGF plots . 276 6.3 Outlier Detection . 281 6.3.1 EPGF Outliers plot . 281 6.3.2 Surprise Index plot . 285 6.4 Validation of the functions . 289 6.5 Application to counts of cysts in steroid treated foetal mouse kidneys . 290 6.5.1 Outlier Detection using the EPGF . 291 6.5.2 Model fitting . 292 6.5.3 Outlier detection using Surprise Index . 294 7 discrete.reg library 299 7.1 Geometric Distribution . 299 7.2 Yule Distribution . 304 7.3 Waring Distribution . 309 7.4 Validation of the functions . 314 7.5 Application to Electroencephalographic Seizures in coma patients . 315 8 Discussion 327 8.1 Contributions to software . 327 8.1.1 Altmann library . 327 8.1.2 discrete.diag library . 328 8.1.3 discrete.reg library . 328 8.2 Implications for data analysis . 328 8.3 Limitations of libraries . 330 10 8.4 Further Work . 331 8.5 Conclusion . 333 References 334 Appendices 348 A Distribution Moments 349 B Publications and posters arising from this research 374 B.1 List of publications . 374 B.2 List of Posters . 375 11 List of Figures 1.1 UK Surnames frequencies. 27 1.2 Histograms of counts of cysts in steroid treated and control foetal mouse kidneys. 30 1.3 Number of ES in paediatric coma patients. 32 1.4 Rate of ES in coma patients. 33 1.5 Outcome measure (frequency of initiations, PECS use and speech) as frequencies by treatment group by time period.

Models for Discrete Epidemiological and Clinical Data

Regression-Based Imputation of Explanatory Discrete Missing Data

Hermite Regression Analysis of Multi-Modal Count Data

Bounds on Tail Probabilities of Discrete Distributions

Modeling of Pdfs of Non-Gaussian Data and Application to Wind Pressure Coefficients on Low-Rise Building

A Comparison Between Approximations of Option Pricing Models and Risk-Neutral Densities Using Hermite Polynomials

Package 'Univrng'

A Probabilistic Approach to Some of Euler's Number Theoretic Identities

Multivariate Limit Theorems in the Context of Long-Range Dependence

Handbook on Probability Distributions

Good Distribution Modelling with the R Package Good

Compound Poisson Point Processes, Concentration and Oracle Inequalities

Characterization of Conjugate Priors for Discrete Exponential Families