Shrinkage Estimation of Rate Statistics

CS-BIGS 7(1):14-25 http://www.csbigs.fr Shrinkage estimation of rate statistics Einar Holsbø Department of Computer Science, UiT — The Arctic University of Norway Vittorio Perduca Laboratory of Applied Mathematics MAP5, Université Paris Descartes This paper presents a simple shrinkage estimator of rates based on Bayesian methods. Our focus is on crime rates as a motivating example. The estimator shrinks each town’s observed crime rate toward the country-wide average crime rate according to town size. By realistic simulations we confirm that the proposed estimator outperforms the maximum likelihood estimator in terms of global risk. We also show that it has better coverage properties. Keywords : Official statistics, crime rates, inference, Bayes, shrinkage, James-Stein estimator, Monte-Carlo simulations. 1. Introduction variety of performance measures—were small. As it turns out, there is nothing special about 1.1. Two counterintuitive random phenomena small schools except that they are small: their over-representation among the best schools is It is a classic result in statistics that the smaller a consequence of their more variable perfor- the sample, the more variable the sample mean. mance, which is counterbalanced by their over- The result is due to Abraham de Moivre and it representation among the worst schools. The tells us that the standard deviation of the mean p observed superiority of small schools was sim- is sx¯ = s/ n, where n is the sample size and s ply a statistical fluke. the standard deviation of the random variable of interest. Although the equation is very simple, its practical implications are not intuitive. Galton(1886) first described another stochastic People have erroneous intuitions about the laws of mechanism that is dangerous to ignore. Galton chance, argue Tversky and Kahneman in their observed that children of tall (or short) par- famous paper about the law of small numbers ents usually grow up to be not quite as tall (Tversky and Kahneman, 1971). (or short), i.e. closer to average height. Today we know this phenomenon as regression to Serious consequences can follow from small- the mean, and we will find it wherever we sample inference ignoring deMoivre’s equation. find variation. Imagine a coach who berates a Wainer(2007) provides a notorious example: in runner who had an unusually slow lap time the late 1990s and early 2000s private and pub- and finds that, indeed, the next lap is faster. lic institutions provided massive funding to The coach, who always berates slow runners, small schools. This was due to the observation has not had the opportunity to realize that the that most of the best schools—according to a next lap is very likely to be faster no matter -15- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca what. As long as there is variability in lap time showed this by introducing a lower-risk estima- we will some times see unusually slow laps tor that biases or shrinks, the xis toward zero. that we can do nothing about and make no James and Stein(1961) introduced an improved inference from. In this case too do our intu- shrinkage estimator, which we will see below. itions about the laws of chance fail us. People, Efron and Morris(1973) show a similar result including scientists, make the mistake of ig- and a similar estimator for shrinking toward noring regression all the time. Mathematically the pooled mean. There are many successful regression to the mean is as simple as imperfect applications of shrinkage estimation, see for correlation between instances. instance the examples from Morris(1983). The common theme is a setting where the statisti- 1.2. These phenomena in official statistics cian wants to estimate many similar variable quantities. The small-schools example is egregious because it led to wasteful public spending. The 1.4. An almost-Bayesian estimator statistics themselves were probably fine, but their interpretation was not careful enough. In this case study we consider the official Nor- Such summary statistics are often presented wegian crime report counts. We assume that without regard for uncertainty. For instance, in a given year the number of crimes reported every year Statistics Norway (ssb.no), the cen- in town i, denoted ki, corresponds to the num- tral bureau of statistics in Norway, presents ber of criminal events in this town. We further crime report counts. The media usually reports assume that each inhabitant can at most be these numbers as rates and inform us that some reported for one crime a year. Our goal is small town that few people know about is the to estimate the crime probability qi: probability most criminal in the country. Often the focus is that a person will commit a crime in this town. on violent crimes. Figure1 below shows these The obvious estimator is the maximum likeli- rates for 2016. Not knowing de Moivre’s result hood estimate (MLE) for a binomial proportion ˆ k it might be striking to observe that many of the qi = i/ni, where ni is the population of town i. towns with the highest rates are small towns. Similarly, not knowing regression it might be The MLE binomial model rests on an assump- striking to observe that, on average, towns with tion that inhabitants commit crimes indepen- a high rate in one year will have a lower one in dently according to an identical crime proba- any other year, see Figure2 below. These are bility. There are reasons to believe that this is unavoidable stochastic phenomena. Thus there not the case. The desperately poor might be is reason to believe that we should somehow more prone to stealing than the middle class adjust our expectations about these numbers. professional. There is a weaker assumption We will see below that such an adjustment also called exchangeability that says that individuals makes statistical sense. are similar but not identical. More precisely we assume that their joint criminal behavior (some number of zeros and ones) does not depend on 1.3. Shrinkage estimation knowing who the individuals are (the order of There is an astonishing decision-theoretic re- the zeros and ones). It is an important theorem sult due to Charles Stein: suppose that we wish in Bayesian inference, due to De Finetti, that to estimate k ≥ 3 parameters q1, ... , qk and ob- a sequence of exchangeable variables are inde- serve k independent measurements, x1 ... xk, pendent and identically distributed conditional such that xi ∼ N(qi, 1). There is an estimator on an unknown parameter qi that is distributed of qi that has uniformly lower risk, in terms according to an a priori (or prior) distribution of total quadratic loss, than the obvious candi- f (qi) (Spiegelhalter et al., 2004). In the bino- date xi (Stein, 1956). In other words, the maxi- mial sense, qi has the remarkable property that mum likelihood estimate is inadmissible. Stein it is the long-run frequency with which crimes -16- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca occur regardless of the i.i.d. assumption; the the posterior distribution of qi is then also a prior precisely reflects our opinion about this beta distribution. The problem remains how limit. By virtue of De Finetti’s theorem, the to choose the parameters for the prior. On the exchangeability assumption justifies the intro- idea that a given town is probably not that dif- duction of the unknown parameter qi in a bino- ferent from all the other towns, we will simply mial model for ki, so long as we take the prior pool the observed crime rates for all towns and into account. fit a beta distribution to this ensemble by the method of moments. To make an argument with priors is to make a Bayesian argument. Shrinkage is implicit in Under squared error loss, the posterior mean Bayesian inference: observed data gets pulled as point estimate minimizes Bayes risk. The toward the prior (and indeed the prior is pulled posterior mean serves as our shrinkage esti- ˆs ˆs toward the data likelihood). We propose an mate, qi , for qi. We will see that qi in effect ˆs ˆ almost Bayesian shrinkage estimator, qi , that shrinks the observed crime rate qi toward the ¯ 1 ˆ accounts for the variability due to population country-wide mean q = ∑ m qi by taking into size. Our estimator is almost Bayesian because account the size of town i. we do not treat the prior very formally, as will be clear below. Bayesian inference allows for intuitive uncertainty intervals. In contrast to a classical fre- In a Bayesian argument we treat qi as random. quentist confidence interval, which can be The statistician specifies a prior distribution tricky to interpret, we can say that qi lies within f (qi) for the parameter that reflects her knowl- the Bayesian credible interval with a certain edge (and uncertainty) about qi. As in the fre- probability. This probability is necessarily sub- quentist setting, she then selects a parametric jective, as the prior distribution is subjective. model for the data given the parameters, which We will conduct simulations to compare the allows her to compute the likelihood f (xjqi). coverage properties of our estimator to the clas- Inference about qi consists of computing its sical asymptotic confidence interval. posterior distribution by Bayes’ theorem: 1.5. Resources f (xjqi) f (qi) f (qijx) = R . f (xjqi) f (qi) dqi This case-study is written with a pedagogi- cal purpose in mind, and can be used by ad- vanced undergraduate and beginning gradu- There are various assessments we could make ate students in statistics as a tutorial around shrinkage estimation and Bayesian methods.

Shrinkage Estimation of Rate Statistics

Shrinkage Priors for Bayesian Penalized Regression

Shrinkage Estimation of Rate Statistics Arxiv:1810.07654V1 [Stat.AP] 17 Oct

Linear Methods for Regression and Shrinkage Methods

Linear, Ridge Regression, and Principal Component Analysis

Kernel Mean Shrinkage Estimators

Generalized Shrinkage Methods for Forecasting Using Many Predictors

Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited

Mining Big Data Using Parsimonious Factor, Machine Learning, Variable Selection and Shrinkage Methods Hyun Hak Kim1 and Norman R

Shrinkage Improves Estimation of Microbial Associations Under Di↵Erent Normalization Methods 1 2 1,3,4, 5,6,7, Michelle Badri , Zachary D

Shrinkage Estimation for Functional Principal Component Scores, with Application to the Population Kinetics of Plasma Folate

Image Denoising Via Residual Kurtosis Minimization

An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Deﬁnite Matrices∗†