Models for Discrete Epidemiological and Clinical Data

Total Page:16

File Type:pdf, Size:1020Kb

Models for Discrete Epidemiological and Clinical Data Models for Discrete Epidemiological and Clinical data A thesis presented for the degree of Doctor of Philosophy University College London Fiona Clare McElduff UCL Institute of Child Health 2012 1 Declaration I, Fiona Clare McElduff confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 2 Abstract Discrete data, often known as frequency or count data, comprises of observations which can only take certain separate values, resulting in a more restricted numerical measurement than those provided by continuous data and are common in the clinical sciences and epidemiology. The Poisson distribution is the simplest and most common probability model for discrete data with observations assumed to have a constant rate of occurrence amongst individual units with the property of equal mean and variance. However, in many applications the variance is greater than the mean and overdispersion is said to be present. The application of the Poisson distribution to data exhibiting overdispersion can lead to incorrect inferences and/or inefficient analyses. The most commonly used extension of the Poisson distribution is the negative binomial distribution which allows for unequal mean and variance, but may still be inadequate to model datasets with long tails and/or value-inflation. Further extensions such as Delaporte, Sichel, Gegenbauer and Hermite distributions, give greater flexibility than the negative binomial distribution. These models have received less interest than the Poisson and negative binomial distributions within the statistical literature and many have not been implemented in current statistical software. Also, diagnostics and goodness-of-fit statistics are seldom considered when analysing such datasets. The aim of this thesis is to develop software for analysing discrete data which do not follow the Poisson or negative binomial distributions including component-mix and parameter-mix distributions, value-inflated models, as well as modifications for truncated distributions. The project’s main goals are to create three libraries within the framework of the R project for statistical computing. They are: 1. altmann: to fit and compare a wide range of univariate discrete models 2. discrete.diag: to provide goodness-of-fit and outlier detection diagnostics for these models 3 3. discrete.reg: to fit regression models to discrete response variables within the gamlss framework These libraries will be freely available to the clinical and scientific community to facilitate discrete data interpretation. 4 Acknowledgements I would like to thank my supervisors Mario Cortina-Borja and Angie Wade for their support, guidance and invaluable advice over the term of my PhD study. I would like to thank my colleagues at the MRC Centre of Epidemiology for Child Health, UCL Institute of Child Health, in particular the past and present occupants of Room 5.09 for their support and advice. I am grateful to the many clinicians and researchers who have provided data for this thesis: Professor Adrian Woolf, Dr Shun-Kai Chan, and Dr David long at the Centre of Nephro-eurpology, ICH; Dr Pablo Mateos and Dr James Cheshire from the UCL Department of Geography; Professor Fenella Kirkham at the Neurosciences Unit, ICH; Professor Tony Charman and Dr Greg Pasco, at the Institute of Education, Professor Pat Howlin and Dr Kate Gordon from King’s College London. This project was made possible by a capacity building studentship funded by the Medical Research Council. Finally, I would also like to thank my family for their support and understanding. Especially my parents, who are my biggest champions, Danny, Nicola and Hannah, and also to Kim, Christine and not forgetting Louie. My biggest thanks go to Michael, who has been at my side throughout this journey and whose encouragement has meant the world to me. 5 Contents Abstract 2 Acknowledgements 5 Contents 6 List of Figures 12 List of Tables 18 List of Listings 20 Acronyms and abbreviations 21 1 Introduction 23 1.1 Discrete data . 23 1.2 Examples . 25 1.2.1 UK Surnames distributions . 25 1.2.2 Cysts in steroid treated fetal mouse kidneys . 29 1.2.3 Electroencephalographic seizures in paediatric coma patients 31 1.2.4 Picture Exchange Communication System (PECS) training in teachers of autistic children . 34 1.3 Overview of Thesis . 37 2 Discrete Probability Distributions 39 2.1 Definitions . 39 6 2.1.1 Overdispersion . 39 2.1.2 Value-inflation . 41 2.1.3 Long tails . 42 2.1.4 Truncation . 42 2.1.5 Notation . 43 2.1.6 Special Functions . 49 2.2 Basic Distributions . 55 2.2.1 Bernoulli (p) .......................... 57 2.2.2 Binomial (p; n) ......................... 58 2.2.3 Geometric (p) .......................... 60 2.2.4 Hypergeometric (m; n; k) ................... 62 2.2.5 Poisson (µ) ........................... 66 2.3 Parameter-Mix Distributions . 68 2.3.1 Negative Binomial . 69 2.3.2 Holla (α; θ) ........................... 74 2.3.3 Sichel (α; θ; γ) ......................... 76 2.3.4 Delaporte (α; β; γ) ....................... 80 2.3.5 Yule (λ) ............................. 83 2.3.6 Waring (b; n) .......................... 86 2.3.7 Beta-Binomial (a; b; n) ..................... 88 2.4 Component-Mix Distributions . 93 2.4.1 Zero-inflated Poisson (!; µ) .................. 94 2.4.2 Zero-inflated Negative Binomial (!; p; r) ........... 96 2.4.3 Zero-inflated Sichel (!; α; θ; γ) ................ 99 2.4.4 2-component Poisson Mixture (!; µ, λ) ............ 101 2.4.5 2-component Poisson-Negative Binomial Mixture (!; µ, r; p) . 105 2.5 Truncated Distributions . 108 2.5.1 Positive Poisson (µ) ...................... 109 2.5.2 Positive Geometric (p) ..................... 111 7 2.5.3 Positive Negative Binomial (r; p) ............... 113 2.5.4 Positive Holla (α; θ) ...................... 115 2.5.5 Positive Sichel (α; θ; γ) ..................... 117 2.5.6 Positive Yule (λ) ........................ 121 2.6 Lerch Family Distributions . 123 2.6.1 Lerch (p; a; c) .......................... 124 2.6.2 Zipf (a; c) ............................ 128 2.6.3 Good (p; c) ........................... 130 2.6.4 Zeta (c) ............................. 132 2.7 Generalized Poisson Distributions . 135 2.7.1 Neyman Type A (µ, φ) ..................... 136 2.7.2 Hermite (a; b) .......................... 138 2.7.3 Generalized Hermite (a; b; m) ................. 141 2.7.4 Gegenbauer (a; b; k)...................... 145 2.7.5 Generalized Gegenbauer (a; m; α; β).............. 148 3 Fitting the models 153 3.1 Estimation methods . 153 3.1.1 Rapid Estimation . 153 3.1.2 Maximum Likelihood . 158 3.1.3 Expectation-Maximization (EM) algorithm . 162 3.2 Frameworks for model fitting . 165 3.2.1 Generalized Linear Models (GLM) . 165 3.2.2 Generalized Additive Models (GAM) . 168 3.2.3 Generalized Additive Models for Location, Scale and Shape (GAMLSS) . 171 3.3 Diagnostics . 174 3.3.1 Goodness-of-fit . 175 3.3.2 Model Comparisons . 184 3.3.3 Outlier Detection . 193 8 4 Software for fitting discrete probability models 198 4.1 Current Software . 198 4.1.1 PASW .............................. 199 4.1.2 Stata ............................. 200 4.1.3 SAS ............................... 200 4.1.4 R ................................ 201 4.1.5 MATHEMATICA ........................ 208 4.1.6 Altmann Fitter . 208 4.2 Gaps in methodology . 209 4.3 Outline of software . 212 5 Altmann Library 215 5.1 Datasets . 216 5.2 Summary of discrete datasets . 218 5.3 pdqr for distributions . 221 5.3.1 Probability density function d ................. 221 5.3.2 Cumulative density function p ................. 223 5.3.3 Quantile function q ....................... 226 5.3.4 Random generating function r ................. 227 5.4 Maximum likelihood estimation functions . 229 5.4.1 Estimation of starting values . 233 5.4.2 Maximum likelihood estimation using mle .......... 236 5.4.3 Goodness-of-fit statistics and Output . 237 5.5 Plotting mle objects . 239 5.6 Model comparisons . 244 5.7 Validation of the functions . 249 5.8 Further Examples . 252 5.8.1 Automobile accidents claims for drivers in Belgium, 1978 . 252 5.8.2 Numbers of births occurring to HIV-infected women . 258 5.9 Application to UK surnames distribution . 261 9 6 discrete.diag Library 269 6.1 Goodness-of-fit Methods . 270 6.1.1 Chi-squared Goodness-of-fit Test . 270 6.1.2 Residuals . 272 6.2 Model Comparison . 275 6.2.1 AIC and BIC . 275 6.2.2 EPGF plots . 276 6.3 Outlier Detection . 281 6.3.1 EPGF Outliers plot . 281 6.3.2 Surprise Index plot . 285 6.4 Validation of the functions . 289 6.5 Application to counts of cysts in steroid treated foetal mouse kidneys . 290 6.5.1 Outlier Detection using the EPGF . 291 6.5.2 Model fitting . 292 6.5.3 Outlier detection using Surprise Index . 294 7 discrete.reg library 299 7.1 Geometric Distribution . 299 7.2 Yule Distribution . 304 7.3 Waring Distribution . 309 7.4 Validation of the functions . 314 7.5 Application to Electroencephalographic Seizures in coma patients . 315 8 Discussion 327 8.1 Contributions to software . 327 8.1.1 Altmann library . 327 8.1.2 discrete.diag library . 328 8.1.3 discrete.reg library . 328 8.2 Implications for data analysis . 328 8.3 Limitations of libraries . 330 10 8.4 Further Work . 331 8.5 Conclusion . 333 References 334 Appendices 348 A Distribution Moments 349 B Publications and posters arising from this research 374 B.1 List of publications . 374 B.2 List of Posters . 375 11 List of Figures 1.1 UK Surnames frequencies. 27 1.2 Histograms of counts of cysts in steroid treated and control foetal mouse kidneys. 30 1.3 Number of ES in paediatric coma patients. 32 1.4 Rate of ES in coma patients. 33 1.5 Outcome measure (frequency of initiations, PECS use and speech) as frequencies by treatment group by time period.
Recommended publications
  • Regression-Based Imputation of Explanatory Discrete Missing Data
    Regression-based imputation of explanatory discrete missing data G. Hern´andez-Herrera1,2, A. Navarro1, and D. Mori˜na∗3,4 1Research Group on Psychosocial Risks, Organization of Work and Health (POWAH), Unitat de Bioestad´ıstica, Facultat de Medicina, Universitat Aut`onoma de Barcelona 2Instituto de Investigaciones M´edicas, Facultad de Medicina, Universidad de Antioquia 3Barcelona Graduate School of Mathematics (BGSMath), Departament de Matem`atiques, Universitat Aut`onoma de Barcelona 4Department of Econometrics, Statistics and Applied Economics, Riskcenter-IREA, Universitat de Barcelona (UB) Abstract Imputation of missing values is a strategy for handling non-responses in surveys or data loss in measurement processes, which may be more effective than ignoring the losses. When the variable represents a arXiv:2007.15031v1 [stat.AP] 29 Jul 2020 count, the literature dealing with this issue is scarce. If the vari- able has an excess of zeros it is necessary to consider models includ- ing parameters for handling zero-inflation. Likewise, if problems of over- or under-dispersion are observed, generalisations of the Pois- son, such as the Hermite or Conway-Maxwell Poisson distributions are recommended for carrying out imputation. In order to assess the performance of various regression models in the imputation of a dis- crete variable based on Poisson generalisations, compared to classical ∗Corresponding Author: David Mori˜na ([email protected]) 1 counting models, this work presents a comprehensive simulation study considering a variety of scenarios and real data from a lung cancer study. To do so we compared the results of estimations using only complete data, and using imputations based on the Poisson, negative binomial, Hermite, and COMPoisson distributions, and the ZIP and ZINB models for excesses of zeros.
    [Show full text]
  • Hermite Regression Analysis of Multi-Modal Count Data
    Econometrics Working Paper EWP1001 ISSN 1485-6441 Department of Economics Hermite Regression Analysis of Multi-Modal Count Data David E. Giles* Department of Economics, University of Victoria Victoria, B.C., Canada V8W 2Y2 March, 2010 Abstract We discuss the modeling of count data whose em pirical distribution is both multi-modal and over- dispersed, and propose the Hermite distribution with covariates introduced through the conditional mean. The model is readily estimated by maximu m likelihood, and nests the Poisson model as a special case. The Hermite regression model is applied to data for the number of banking and currency crises in IMF-member countries, and is found to out-perform the Poisson and negative binomial models. Keywords: Count data, multi-modal data, over-dispersion, financial crises JEL Classifications: C16, C25, G15, G21 Author Contact: David E. Giles, Dept. of Economics, University of Victoria, P.O. Box 1700, STN CSC, Victoria, B.C., Canada V8W 2Y2; e-mail: [email protected]; Phone: (250) 721-8540; FAX: (250) 721-6214 1. Introduction Typically, models for count data (i.e., data that take only non-negative integer values) are based on distributions that do not allow for multi-modality. Obvious examples are the Poisson and negative binomial distributions. This seriously limits the usefulness of such models. We discuss a way of broadening the class of discrete distributions that are used in this field by adopting the Hermite distribution proposed by Kemp and Kemp (1965). Apart from Giles (2007), this distribution has not been used in the econometrics literature to date. Further, it appears that when it has been used in other fields, no consideration has been given to introducing covariates into the model, as is done with the conventional Poisson and negative binomial regression models.
    [Show full text]
  • Bounds on Tail Probabilities of Discrete Distributions
    PEIS02-3 Probability in the Engineering and Informational Sciences, 14, 2000, 161–171+ Printed in the U+S+A+ BOUNDS ON TAIL PROBABILITIES OF DISCRETE DISTRIBUTIONS BEERRRNNNHHHAAARRRDD KLLAAARR Institut für Mathematische Stochastik Universität Karlsruhe D-76128 Karlsruhe, Germany E-mail: [email protected] We examine several approaches to derive lower and upper bounds on tail probabil- ities of discrete distributions+ Numerical comparisons exemplify the quality of the bounds+ 1. INTRODUCTION There has not been much work concerning bounds on tail probabilities of discrete distributions+ In the work of Johnson, Kotz, and Kemp @9#, for example, only two ~quite bad! upper bounds on Poisson tail probabilities are quoted+ This article examines several more generally applicable methods to derive lower and upper bounds on tail probabilities of discrete distributions+ Section 2 presents an elementary approach to obtain such bounds for many families of discrete distribu- tions+ Besides the Poisson, the binomial, the negative binomial, the logseries, and the zeta distribution, we consider the generalized Poisson distribution, which is fre- quently used in applications+ Some of the bounds emerge in different areas in the literature+ Despite the elementary proofs, the bounds reveal much of the character of the distributional tail and are sometimes better than other, more complicated bounds+ As shown in Section 3, some of the elementary bounds can be improved using similar arguments+ In a recent article, Ross @11# derived bounds on Poisson tail probabilities using the importance sampling identity+ Section 4 extends this method to the logseries and the negative binomial distribution+ Finally, some numerical com- parisons illustrate the quality of the bounds+ Whereas approximations to tail probabilities become less important due to an increase in computing power, bounds on the tails are often needed for theoretical © 2000 Cambridge University Press 0269-9648000 $12+50 161 162 B.
    [Show full text]
  • Modeling of Pdfs of Non-Gaussian Data and Application to Wind Pressure Coefficients on Low-Rise Building
    MODELING OF PDFS OF NON-GAUSSIAN DATA AND APPLICATION TO WIND PRESSURE COEFFICIENTS ON LOW-RISE BUILDING By LUPING YANG A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2011 1 © 2011 Luping Yang 2 To my parents Qingcheng Yang and Chaoqun Pang 3 ACKNOWLEDGMENTS First of all, I wish to express my gratitude to my committee chair David O. Prevatt. This work would not have been accomplished without his patient guidance, continual support and great encouragement. I am also grateful to my co-chair and research advisor Dr. Kurtis R. Gurley. Working with Dr. Gurley has been a great pleasure for me. His insightful knowledge and amiable smile always gives me inspiration and encouragement when I encounter difficulties. His guidance to me has not only been limited to study and research but also extended to the way of thinking, the manner of interacting with people and the attitude of life. I wish to express my sincere gratitude to Dr. Sergei V. Shabanov in the Mathematical department for serving on my advisory committee. I would like to thank my family, Mr. and Mrs. Yang and my brother Wei Yang, for all their love, care and support. They have made me feel happy and have given me faith to study from thousands of miles away. Lastly, I wish to express my gratitude to my friends, colleagues and the hurricane research group at the University of Florida. Studying and working at UF has been a valuable and pleasant experience in my life.
    [Show full text]
  • A Comparison Between Approximations of Option Pricing Models and Risk-Neutral Densities Using Hermite Polynomials
    U.U.D.M. Project Report 2020:16 A Comparison between Approximations of Option Pricing Models and Risk-Neutral Densities using Hermite Polynomials Nathaniel Ahy Examensarbete i matematik, 30 hp Handledare: Benny Avelin Examinator: Erik Ekström Juni 2020 Department of Mathematics Uppsala University Contents 1 Introduction 1 1.1 Background of Hermite Polynomials . .1 1.2 Financial Derivatives and Stochastic Volatility . .2 1.3 Hermite Polynomials in Finance . .5 1.4 Problem Formulation . .7 1.5 Methodology . .7 1.6 Outline . .8 2 Theory 10 2.1 Heston Model . 10 2.1.1 The Fundamental Theorem of Finance . 11 2.2 Gram-Charlier Expansions . 13 2.2.1 Cumulants and an Explicit Formula for Computing them 13 2.2.2 Gram–Charlier Expansions for the Heston Model . 14 2.3 Gram-Charlier Expansion Option Pricing Formula . 14 2.4 Necula, Drimus and Farkas Risk-Neutral Measure . 15 2.4.1 A closed form formula for the coefficients for the risk- neutral density . 17 2.4.2 Implementation of Closed Form Formula . 19 3 Option Pricing 21 3.1 Estimating Coefficients through quadratic programming . 21 3.1.1 The Active Set Algorithm . 25 3.2 Fitting the NDF log mean and standard deviation to a Model 27 4 Model Fitting Results 28 4.1 Computing the Cumulants . 28 4.2 Model Errors . 29 2 Uppsala University 5 Option Data 33 5.1 Fitting the GC Parameters to the Data . 33 6 Data Fitting Results 36 6.1 NDF Errors . 36 6.2 GC errors . 40 6.3 Model Complexity . 43 7 Conclusion 45 7.1 Conclusion .
    [Show full text]
  • Package 'Univrng'
    Package ‘UnivRNG’ March 5, 2021 Type Package Title Univariate Pseudo-Random Number Generation Version 1.2.3 Date 2021-03-05 Author Hakan Demirtas, Rawan Allozi, Ran Gao Maintainer Ran Gao <[email protected]> Description Pseudo-random number generation of 17 univariate distributions proposed by Demir- tas. (2005) <DOI:10.22237/jmasm/1114907220>. License GPL-2 | GPL-3 NeedsCompilation no Repository CRAN Date/Publication 2021-03-05 18:10:02 UTC R topics documented: UnivRNG-package . .2 draw.beta.alphabeta.less.than.one . .3 draw.beta.binomial . .4 draw.gamma.alpha.greater.than.one . .5 draw.gamma.alpha.less.than.one . .6 draw.inverse.gaussian . .7 draw.laplace . .8 draw.left.truncated.gamma . .8 draw.logarithmic . .9 draw.noncentral.chisquared . 10 draw.noncentral.F . 11 draw.noncentral.t . 12 draw.pareto . 12 draw.rayleigh . 13 draw.t ............................................ 14 draw.von.mises . 15 draw.weibull . 16 draw.zeta . 16 1 2 UnivRNG-package Index 18 UnivRNG-package Univariate Pseudo-Random Number Generation Description This package implements the algorithms described in Demirtas (2005) for pseudo-random number generation of 17 univariate distributions. The following distributions are available: Left Truncated Gamma, Laplace, Inverse Gaussian, Von Mises, Zeta (Zipf), Logarithmic, Beta-Binomial, Rayleigh, Pareto, Non-central t, Non-central Chi-squared, Doubly non-central F , Standard t, Weibull, Gamma with α<1, Gamma with α>1, and Beta with α<1 and β<1. For some distributions, functions that have similar capabilities exist in the base package; the functions herein should be regarded as com- plementary tools. The methodology for each random-number generation procedure varies and each distribution has its own function.
    [Show full text]
  • A Probabilistic Approach to Some of Euler's Number Theoretic Identities
    TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 350, Number 7, July 1998, Pages 2939{2951 S 0002-9947(98)01969-2 A PROBABILISTIC APPROACH TO SOME OF EULER'S NUMBER THEORETIC IDENTITIES DON RAWLINGS Abstract. Probabilistic proofs and interpretations are given for the q-bino- mial theorem, q-binomial series, two of Euler's fundamental partition identi- ties, and for q-analogs of product expansions for the Riemann zeta and Euler phi functions. The underlying processes involve Bernoulli trials with variable probabilities. Also presented are several variations on the classical derange- ment problem inherent in the distributions considered. 1. Introduction For an integer k 1, the q-shifted factorial of t is defined as ≥ k 1 (t; q) =(1 t)(1 tq) (1 tq − ) : k − − ··· − By convention, (t; q)0 = 1. Using arguments on partitions, Euler showed that k ∞ (2) k ∞ q t i 1 (1) = (1 + tq − ) (q; q) k i=1 kX=0 Y and ∞ k ∞ t i 1 1 (2) = (1 tq − )− (q; q) − k i=1 Xk=0 Y hold formally. Identity (1) follows from the observation that the coefficient of k n k t q − in either the sum or the product is equal to the number of partitions of n k n k into k distinct parts. The coefficient of t q − in either side of (2) is the number of partitions of n with k parts (repetition allowed). For n 0, similar arguments on partitions with bounded parts yield the ≥ k k n k q-Binomial Theorem: (t; q) = ( 1) q(2) t n − k k 0 X≥ and the n + k q-Binomial Series: 1=(t; q) = tk n+1 k k 0 X≥ Received by the editors August 8, 1996 and, in revised form, September 16, 1996.
    [Show full text]
  • Multivariate Limit Theorems in the Context of Long-Range Dependence
    Multivariate limit theorems in the context of long-range dependence Shuyang Bai Murad S. Taqqu September 4, 2018 Abstract We study the limit law of a vector made up of normalized sums of functions of long-range depen- dent stationary Gaussian series. Depending on the memory parameter of the Gaussian series and on the Hermite ranks of the functions, the resulting limit law may be (a) a multivariate Gaussian process involv- ing dependent Brownian motion marginals, or (b) a multivariate process involving dependent Hermite processes as marginals, or (c) a combination. We treat cases (a), (b) in general and case (c) when the Hermite components involve ranks 1 and 2. We include a conjecture about case (c) when the Hermite ranks are arbitrary. 1 Introduction A stationary time series displays long-range dependence if its auto-covariance decays slowly or if its spectral density diverges around the zero frequency. When there is long-range dependence, the asymptotic limits of various estimators are often either Brownian Motion or a Hermite process. The most common Hermite processes are fractional Brownian motion (Hermite process of order 1) and the Rosenblatt process (Hermite process of order 2), but there are Hermite processes of any order. Fractional Brownian motion is the only Gaussian Hermite process. Most existing limit theorems involve univariate convergence, that is, convergence to a single limit process, for example, Brownian motion or a Hermite process ([3, 5, 24]). In time series analysis, however, one often needs joint convergence, that is, convergence to a vector of processes. This is because one often needs to consider different statistics of the process jointly.
    [Show full text]
  • Handbook on Probability Distributions
    R powered R-forge project Handbook on probability distributions R-forge distributions Core Team University Year 2009-2010 LATEXpowered Mac OS' TeXShop edited Contents Introduction 4 I Discrete distributions 6 1 Classic discrete distribution 7 2 Not so-common discrete distribution 27 II Continuous distributions 34 3 Finite support distribution 35 4 The Gaussian family 47 5 Exponential distribution and its extensions 56 6 Chi-squared's ditribution and related extensions 75 7 Student and related distributions 84 8 Pareto family 88 9 Logistic distribution and related extensions 108 10 Extrem Value Theory distributions 111 3 4 CONTENTS III Multivariate and generalized distributions 116 11 Generalization of common distributions 117 12 Multivariate distributions 133 13 Misc 135 Conclusion 137 Bibliography 137 A Mathematical tools 141 Introduction This guide is intended to provide a quite exhaustive (at least as I can) view on probability distri- butions. It is constructed in chapters of distribution family with a section for each distribution. Each section focuses on the tryptic: definition - estimation - application. Ultimate bibles for probability distributions are Wimmer & Altmann (1999) which lists 750 univariate discrete distributions and Johnson et al. (1994) which details continuous distributions. In the appendix, we recall the basics of probability distributions as well as \common" mathe- matical functions, cf. section A.2. And for all distribution, we use the following notations • X a random variable following a given distribution, • x a realization of this random variable, • f the density function (if it exists), • F the (cumulative) distribution function, • P (X = k) the mass probability function in k, • M the moment generating function (if it exists), • G the probability generating function (if it exists), • φ the characteristic function (if it exists), Finally all graphics are done the open source statistical software R and its numerous packages available on the Comprehensive R Archive Network (CRAN∗).
    [Show full text]
  • Good Distribution Modelling with the R Package Good
    GOOD DISTRIBUTION MODELLING WITH THE R PACKAGE good Jordi Tur, David Moriña, Pedro Puig, Alejandra Cabaña, Argimiro Arratia, Amanda Fernández-Fontelo ABSTRACT Although models for count data with over-dispersion have been widely considered in the literature, models for under-dispersion -the opposite phenomenon- have received less attention as it is only rela- tively common in particular research fields such as biodosimetry and ecology. The Good distribution is a flexible alternative for modelling count data showing either over-dispersion or under-dispersion, although no R packages are still available to the best of our knowledge. We aim to present in the following the R package good that computes the standard probabilistic functions (i.e., probability density function, cumulative distribution function, and quantile function) and generates random samples from a population following a Good distribution. The package also considers a function for Good regression, including covariates in a similar way to that of the standard glm function. We finally show the use of such a package with some real-world data examples addressing both over-dispersion and especially under-dispersion. Keywords Good distribution · discrete-valued data · underdispersion · regression 1 Introduction Even though the Poisson distribution is the most broadly used for modelling count data (e.g., properties like closure under addition make this distribution especially useful in practice) and has been used to model applications such as the number of soldiers killed by horse kicks per Prussian army corp ([1]) or the number of chromosome aberrations (e.g., dicentrics or rings) in scored sample cells from potentially irradiated individuals ([2]), it is usually chosen for tradition and convenience, and such a practice can very often be brought into question.
    [Show full text]
  • Compound Poisson Point Processes, Concentration and Oracle Inequalities
    Compound Poisson Point Processes, Concentration and Oracle Inequalities Huiming Zhanga, Xiaoxu Wub aSchool of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, 100871, China bMathematics Department, Rutgers University, Piscataway, NJ 08854, USA Abstract This note aims at presenting several new theoretical results for the compound Poisson point process, which follows the work of Zhang et al. [Insurance Math. Econom. 59(2014), 325-336]. The first part provides a new characterization for a discrete compound Poisson point process (proposed by Acz´el[Acta Math. Hun- gar. 3(3)(1952), 219-224]), it extends the characterization of the Poisson point process given by Copeland and Regan [Ann. Math. (1936): 357-362]. Next, we derive some concentration inequalities for discrete com- pound Poisson point process (negative binomial random variable with unknown dispersion is a significant example). These concentration inequalities are potentially useful in count data regressions. We give an ap- plication in the weighted Lasso penalized negative binomial regression whose KKT conditions of penalized likelihood hold with high probability and then we derive non-asymptotic oracle inequalities for a weighted Lasso estimator. Keywords: characterization of point processes, Poisson random measure, concentration inequalities, sub-Gamma random variables, high-dimensional negative binomial regressions. 2010 Mathematics Subject Classification: 60G55, 60G51, 62J12 1. Introduction Powerful and useful concentration inequalities of empirical processes have been derived one after another for several decades. Various types of concentration inequalities are mainly based on mo- ments condition (such as sub-Gaussian, sub-exponential and sub-Gamma) and bounded difference condition; see Boucheron et al. (2013). Its central task is to evaluate the fluctuation of empirical processes from some value (for example, the mean and median) in probability.
    [Show full text]
  • Characterization of Conjugate Priors for Discrete Exponential Families
    Statistica Sinica 11(2001), 409-418 CHARACTERIZATION OF CONJUGATE PRIORS FOR DISCRETE EXPONENTIAL FAMILIES Jine-Phone Chou Academia Sinica, Taipei Abstract: Let X be a nonnegative discrete random variable distributed according to an exponential family with natural parameter θ ∈ Θ. Subject to some regularity we characterize conjugate prior measures on Θ through the property of linear pos- terior expectation of the mean parameter of X : E{E(X|θ)|X = x} = ax + b.We also delineate some necessary conditions for the hyperparameters a and b, and find a necessary and sufficient condition that 0 <a<1. Besides the power series distri- bution with parameter space bounded above (for example, the negative binomial distribution and the logarithmic series distribution) and the Poisson distribution, we apply the result to the log-zeta distribution and all hyper-Poisson distributions. Key words and phrases: Bounded analytic functions, characterization theorems, conjugate priors, discrete exponential families, posterior expectations. 1. Introduction Let X come from an exponential family with natural parameter θ and density xθ−M(θ) dPθ(x)=e dµ(x), (1.1) where µ is a discrete measure with support {x0,x1,...,0 ≤ x0 <x1 <x2 < ···}. Let χ denote the interior of the convex hull of the support of µ,andletΘbe the natural parameter space, i.e., Θ = {θ, eθxdµ(x) < ∞}.Itisobviousthat 0 0 χ =(x0,x )wherex is the least upper bound of {x0,x1,...},andthatΘis convex and unbounded to the left since the xi,i =1, 2,..., are nonnegative. Throughout the paper we assume Θ is nonempty and open, and write Θ = 0 (−∞,θ ).
    [Show full text]