<<

CS-BIGS 7(1):14-25 http://www.csbigs.fr

Shrinkage estimation of rate

Einar Holsbø Department of Computer Science, UiT — The Arctic University of Norway Vittorio Perduca Laboratory of Applied Mathematics MAP5, Université Paris Descartes

This paper presents a simple shrinkage estimator of rates based on Bayesian methods. Our focus is on crime rates as a motivating example. The estimator shrinks each town’s observed crime rate toward the country-wide average crime rate according to town size. By realistic simulations we confirm that the proposed estimator outperforms the maximum likelihood estimator in terms of global risk. We also show that it has better coverage properties.

Keywords : Official statistics, crime rates, inference, Bayes, shrinkage, James-Stein estimator, Monte-Carlo simulations.

1. Introduction variety of performance measures—were small. As it turns out, there is nothing special about 1.1. Two counterintuitive random phenomena small schools except that they are small: their over-representation among the best schools is It is a classic result in statistics that the smaller a consequence of their more variable perfor- the sample, the more variable the sample mean. mance, which is counterbalanced by their over- The result is due to Abraham de Moivre and it representation among the worst schools. The tells us that the standard deviation of the mean √ observed superiority of small schools was sim- is σx¯ = σ/ n, where n is the sample size and σ ply a statistical fluke. the standard deviation of the random variable of interest. Although the equation is very sim- ple, its practical implications are not intuitive. Galton(1886) first described another stochastic People have erroneous intuitions about the laws of mechanism that is dangerous to ignore. Galton chance, argue Tversky and Kahneman in their observed that children of tall (or short) par- famous paper about the law of small numbers ents usually grow up to be not quite as tall (Tversky and Kahneman, 1971). (or short), i.e. closer to average height. Today we know this phenomenon as regression to Serious consequences can follow from small- the mean, and we will find it wherever we sample inference ignoring deMoivre’s equation. find variation. Imagine a coach who berates a Wainer(2007) provides a notorious example: in runner who had an unusually slow lap time the late 1990s and early 2000s private and pub- and finds that, indeed, the next lap is faster. lic institutions provided massive funding to The coach, who always berates slow runners, small schools. This was due to the observation has not had the opportunity to realize that the that most of the best schools—according to a next lap is very likely to be faster no matter -15- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

what. As long as there is variability in lap time showed this by introducing a lower-risk estima- we will some times see unusually slow laps tor that biases or shrinks, the xis toward zero. that we can do nothing about and make no James and Stein(1961) introduced an improved inference from. In this case too do our intu- shrinkage estimator, which we will see below. itions about the laws of chance fail us. People, Efron and Morris(1973) show a similar result including scientists, make the mistake of ig- and a similar estimator for shrinking toward noring regression all the time. Mathematically the pooled mean. There are many successful regression to the mean is as simple as imperfect applications of shrinkage estimation, see for correlation between instances. instance the examples from Morris(1983). The common theme is a setting where the statisti- 1.2. These phenomena in official statistics cian wants to estimate many similar variable quantities. The small-schools example is egregious be- cause it led to wasteful public spending. The 1.4. An almost-Bayesian estimator statistics themselves were probably fine, but their interpretation was not careful enough. In this case study we consider the official Nor- Such summary statistics are often presented wegian crime report counts. We assume that without regard for uncertainty. For instance, in a given year the number of crimes reported every year Statistics Norway (ssb.no), the cen- in town i, denoted ki, corresponds to the num- tral bureau of statistics in Norway, presents ber of criminal events in this town. We further crime report counts. The media usually reports assume that each inhabitant can at most be these numbers as rates and inform us that some reported for one crime a year. Our goal is small town that few people know about is the to estimate the crime probability θi: probability most criminal in the country. Often the focus is that a person will commit a crime in this town. on violent crimes. Figure1 below shows these The obvious estimator is the maximum likeli- rates for 2016. Not knowing de Moivre’s result hood estimate (MLE) for a binomial proportion ˆ k it might be striking to observe that many of the θi = i/ni, where ni is the population of town i. towns with the highest rates are small towns. Similarly, not knowing regression it might be The MLE binomial model rests on an assump- striking to observe that, on average, towns with tion that inhabitants commit crimes indepen- a high rate in one year will have a lower one in dently according to an identical crime proba- any other year, see Figure2 below. These are bility. There are reasons to believe that this is unavoidable stochastic phenomena. Thus there not the case. The desperately poor might be is reason to believe that we should somehow more prone to stealing than the middle class adjust our expectations about these numbers. professional. There is a weaker assumption We will see below that such an adjustment also called exchangeability that says that individuals makes statistical sense. are similar but not identical. More precisely we assume that their joint criminal behavior (some number of zeros and ones) does not depend on 1.3. Shrinkage estimation knowing who the individuals are (the order of There is an astonishing decision-theoretic re- the zeros and ones). It is an important theorem sult due to Charles Stein: suppose that we wish in , due to De Finetti, that to estimate k ≥ 3 parameters θ1, ... , θk and ob- a sequence of exchangeable variables are inde- serve k independent measurements, x1 ... xk, pendent and identically distributed conditional such that xi ∼ N(θi, 1). There is an estimator on an unknown parameter θi that is distributed of θi that has uniformly lower risk, in terms according to an a priori (or prior) distribution of total quadratic loss, than the obvious candi- f (θi) (Spiegelhalter et al., 2004). In the bino- date xi (Stein, 1956). In other words, the maxi- mial sense, θi has the remarkable property that mum likelihood estimate is inadmissible. Stein it is the long-run frequency with which crimes -16- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

occur regardless of the i.i.d. assumption; the the posterior distribution of θi is then also a prior precisely reflects our opinion about this beta distribution. The problem remains how limit. By virtue of De Finetti’s theorem, the to choose the parameters for the prior. On the exchangeability assumption justifies the intro- idea that a given town is probably not that dif- duction of the unknown parameter θi in a bino- ferent from all the other towns, we will simply mial model for ki, so long as we take the prior pool the observed crime rates for all towns and into account. fit a beta distribution to this ensemble by the method of moments. To make an argument with priors is to make a Bayesian argument. Shrinkage is implicit in Under squared error loss, the posterior mean Bayesian inference: observed data gets pulled as point estimate minimizes Bayes risk. The toward the prior (and indeed the prior is pulled posterior mean serves as our shrinkage esti- ˆs ˆs toward the data likelihood). We propose an mate, θi , for θi. We will see that θi in effect ˆs ˆ almost Bayesian shrinkage estimator, θi , that shrinks the observed crime rate θi toward the ¯ 1 ˆ accounts for the variability due to population country-wide mean θ = ∑ m θi by taking into size. Our estimator is almost Bayesian because account the size of town i. we do not treat the prior very formally, as will be clear below. Bayesian inference allows for intuitive uncer- tainty intervals. In contrast to a classical fre- In a Bayesian argument we treat θi as random. quentist confidence interval, which can be The statistician specifies a prior distribution tricky to interpret, we can say that θi lies within f (θi) for the parameter that reflects her knowl- the Bayesian credible interval with a certain edge (and uncertainty) about θi. As in the fre- probability. This probability is necessarily sub- quentist setting, she then selects a parametric jective, as the prior distribution is subjective. model for the data given the parameters, which We will conduct simulations to compare the allows her to compute the likelihood f (x|θi). coverage properties of our estimator to the clas- Inference about θi consists of computing its sical asymptotic confidence interval. posterior distribution by Bayes’ theorem: 1.5. Resources f (x|θi) f (θi) f (θi|x) = R . f (x|θi) f (θi) dθi This case-study is written with a pedagogi- cal purpose in mind, and can be used by ad- vanced undergraduate and beginning gradu- There are various assessments we could make ate students in statistics as a tutorial around shrinkage estimation and Bayesian methods. about the collection of θi. If we assume they are identical we can pool them and use a We will mention some possible extensions in single prior. If we assume they are inde- the conclusion that could be the basis for stu- pendent we specify one prior for each and dent projects. Data and code for all our anal- keep them separate. If we assume they are yses, figures, and simulations are available at exchangeable—similar but not identical—it fol- https://github.com/3inar/crime_rates lows from De Finetti that there is a common prior distribution conditional on which the 2. Data θ1,..., θm are i.i.d. (Spiegelhalter et al., 2004). We will work with the official crime report We make this latter judgment and take a beta statistics released by Statistics Norway (SSB) distribution common to all crime probabilities every year. These data contain the number of as prior. Our likelihood for an observed num- crime reports in a given Norwegian town in a ber of crime reports follows a binomial dis- given year. The counts are stratified by crime tribution. It is a classic exercise to show that type, e.g. violent crimes, traffic violations, etc. -17- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

We will focus on violent crimes. SSB separately Crime rates regress to the mean provides yearly population statistics for each town. Figure1 shows the 2016 crime rates (i.e. counts per population) for all towns in 0.025 Norway against their respective populations. This is some times called a funnel plot for the ●

● funnel-like tapering along the horizontal axis: ● ● ● ● ● 0.015 ● 2016 ● a shape that signals higher among the ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●●● ● ●● smaller towns. ● ●●●● ● ● ●●●●●●● ● ● ● ● ●●●● ●● ● ●● ●●●●●●● ● ● ●● ●●● ●●●●●● ●● ● ●●●●●●●●●● ● ●●● ●●●●● ● ●●● ● ● ●●●●●● ●●●● ●● ● ●● ●●●● ●●●●●●●●● ●● ●● ● ●● ●●●●●●● ●●●●●●● ●● ● ● ● ●●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ●● ● ●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ● ●● ●●●●●●● ●●● ●●●●●●●●● 0.005 ●●● ●●● ● ● ●●● ● ● ●

Crime rates more variable for smaller towns 0.005 0.010 0.015 0.020 0.025 2015 ● (correlation = 0.89)

● Figure 2: Regression to the mean from year ● ● 0.015 ● ● to year. The plot compares 2016 and 2015; ● ● ● ● ● ●● ● ● the black regression line shows that towns ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● 0.010 ●●● ● ● with high crime rates in 2015 tend to have ●●● ● ●●● ● ●● ●●● ● ● ●●●●● ● ●●●● ● ●●●●● ● ● ●●● ● ● ●●● ●●●●● ● ● lower crime rates in 2016, and vice versa for ●●●● ●●● ●●●●● ● ●●●●● ● ●●●●●● ●●●●●●● ●●●●●●● ● ●●●●●● ●●●● ● ●●●●●●● ●●●●●●● low crime rates. The grey dashed line shows ●●●●●● ●●●●●●●●● ●●●●●● ● ● ●●● ●● ● 0.005 ●●●●● ● ●●●● ● ●●●●● ●●●● 2016 Violent crime rate ●●● ●●● what perfect correlation between 2015 and 2016 ● ●● ● ● would look like. 0 100 200 300 400 500 600 Population (in thousands) Figure3 shows the distribution of the pooled Figure 1: Rates of violent crime vs population violent crime rates for 2016. The solid black in 2016 for all towns in Norway. The grey line line is a beta distribution fit to these data. shows the country-wide mean.

Pooled violent crime rates, 2016

Figure2 compares the crime rates in 2015 with 200 those in 2016 and shows that the more (or less)

violent towns in 2015 were on average less (or 150 more) violent in 2016. The solid black line re- 100 gresses 2016 rates on 2015 rates. The dashed Density grey line is what to expect if there were no regression toward the mean. It has an intercept 50

of zero and a slope of unity. The solid grey line 0 is the overall mean in 2016. The most extreme 0.005 0.010 0.015 town in 2015, past .025 on the x-axis, is much Rate closer to the mean in 2016. The solid black regression line shows that this is true for all towns on average. The fact that 2015 and 2016 Figure 3: The distribution of violent crime are consecutive years is immaterial; regression rates in Norway, 2016. The black line describes to the mean will be present between any two the method-of-moments fit of a beta distribu- years. tion to these data. -18- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

2.1. Simulation study model, also shown in Figure5:

We run a simulation study for validation. If θi|α, β ∼ Beta(α, β), we assume that the crime probability in town ki|θi ∼ Binomial(ni, θi). i is stationary we can pool the observed crime rates of all years and use their average, θ¯i, as As mentioned the assumption of town ex- a reasonable “truth.” This allows us to as- changeability leads to this hierarchical model. sess the performance of our estimator against This assumption might not be appropriate if known, realistic crime probabilities, which of we had reasons to think, for instance, that some course is impossible in the real data. The regions are more prone to crime than others. In simulated crime report count in town i is this case, region-specific priors might be better. ki ∼ Binomial(θ¯i, ni), where ni is the 2016 pop- ulation of town i. Figure4 shows a realiza- α, β tion of this procedure. Although not a perfect replica of Figure1—the real data do not have

any rates below .0017—it looks fairly realistic. θ1 θ2 ... θm

Example of simulated crime rates k1 k2 ... km

● ● ● Figure 5: A graph describing our model. ● ●● 0.015

● ● Crime counts, ki, are (conditionally) i.i.d. bi- ● ● ● ●● ● ● ●●● ●● ● nomials whose respective parameters, θ , are ●● ●● i ●● ●● ● ● ● ●●● ● ● ● ● 0.010 ●●●●● ● ● ● (conditionally) i.i.d. according to a common ●●●● ●● ●● ● ●●● ● ●●●● ● ●●● ● ● ●●●●● ● ● ● ● ● ●●●●● ●●●●●●●● ● prior. ●●●●●● ●● ●●● ●●●●● ● ●●●● ● ●●● ● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ● ●●●● ●●● ● ●●●●●● ●●●●●●●●●●● ● ●●●●●●● ● ●●●●● ●●● 0.005 ●●●●● ●●●●●● ●●● ● ●●●●

Violent crime rate ●● ●●●●● ●●● ●●● The posterior follows from the fact that the beta ●●● ●● ● ● ● distribution is conjugate to itself with respect to ●

0.000 the binomial likelihood. Generally, conjugacy 0 100 200 300 400 500 600 means that the prior and posterior distribu- Population (in thousands) tions belong to the same distributional family and usually entails that there is a simple closed- Figure 4: Funnel plot of a set of simulated form way of computing the parameters of the crime rates posterior. Wasserman(2010, p. 178) shows a derivation of the posterior in the beta–binomial model:

3. Methods θi|ki ∼ Beta(α + ki, β + ni − ki).

We will look into the relation between the pa- 3.1. Shrinkage estimates rameters of the posterior to those of the prior in terms of successes and failures in the results We treat θ as the probability for a person to i section. commit a crime in a given period. We model the total number of crime reports in the i-th The shrinkage estimate for the crime probabil- town, k , as the number of successful Bernoulli i ity in town i is the posterior mean trials among ni, where ni is the population of this town. As explained in the introduction, + ˆs α ki θi = . this suggests the following simple Bayesian α + β + ni -19- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

The maximum likelihood estimate for θi is the and assume ˆ k observed crime rate θi = i/ni. In order to fix k   values of α and β, we pool the MLEs for all θˆ = i ∼ N θ , σ2 , i n i i towns θˆ1, ... , θˆm and fit a beta distribution to i these data by the method of moments. We ( − ) where σ2 = θi 1 θi is unknown. If we assume show the resulting fit in Figure3. Because i ni the expectation and variance of a Beta(α, β) are that towns are similar in terms of variance we α αβ can consider the pooled variance estimate α+β and (α+β)2(α+β+1) , respectively, the param- eter estimates for the prior are m 2 ∑ (ni − 1)σˆ σ2 = i=1 i , α(1 − θ¯) P ∑m (n − 1) β = , and i=1 i θ¯ ˆ ˆ  ¯  2 θi(1−θi) ki(ni−ki) 1 − θ 1 2 where σˆ = = . The James- α = − θ¯ . i ni n3 2 ¯ i S θ Stein estimator of crime probability for town i ˆ ˆ ¯ 2 is then ¯ ∑i θi 2 ∑i(θi−θ) Here θ = m and S = m−1 are the sam- ple mean and variance of the pooled MLEs. 2 ! JS (m − 2)σˆP θˆ = 1 − θˆi. i m θˆ2 Instead of estimating α and β from the data like ∑i=1 i this, which ignores any randomness in these This is a shrinkage toward zero. It assumes that parameters, we could have a prior distribution crime rates are probably not as high as they for the parameters themselves. This would appear. This is different from our assumption yield a typical Bayesian hierarchical model. that crime rates are probably not as far away Note also that in forming the estimate for town from the average as they appear. It is simple to i, we end up using its information twice: once modify the above to shrink toward any origin. in eliciting our prior and once in the likelihood. The Efron-Morris variant (Efron and Morris, This is convenient because we need only to 1973) shrinks toward the average: find one prior rather than one for each town where we exclude the ith town from the ith ! (m − 2)σˆ 2 prior. This bit of trickery does not make much θˆJS = θ¯ + 1 − P (θˆ − θ¯). i m ˆ ¯ 2 i difference: we have several hundreds of towns ∑i=1(θi − θ) and hence removing a single town does not affect the shape of the prior much. We will use this variant so that the two meth- ods shrink toward the same point. + The estimate θˆs = α ki shrinks the observed, i α+β+ni ¯ or MLE, crime rate toward the prior mean θ. 3.3. Uncertainty intervals ˆs ¯ ˆ We can rewrite so that θi = δiθ + (1 − δi)θi, + with δ = α β . Here δ directly reflects the We construct credible intervals from the pos- i α+β+ni i prior’s influence on θˆs, and we see that this terior. A 95% credible interval contains .95 of i the posterior density, and the simplest way to influence grows as the town size, ni, shrinks. construct one is to place it between the .025 and .975 quantiles of the posterior. For the 3.2. James-Stein estimates MLE we use the typical normal approxima- For completeness we demonstrate empirically tion (or Wald) confidence interval. There is that the James–Stein estimator is superior to to our knowledge no straight-forward way to the MLE in terms of risk. If town i has a large construct confidence intervals for the JS estima- enough population, we can consider the nor- tor, so we will leave this as an exercise for the mal approximation to the binomial distribution reader. -20- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

3.4. Global risk estimates Shrinkage vs ordinary estimates

S ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● We use the total squared-error loss function,

m ˆs ˆs 2 L(θ, θ ) = ∑(θi − θi ) , i=1

to measure the global discrepancy between MLE ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ● ● ● ● ● ● ● the true rates θ = (θi)i=1,...,m and estimates 0.005 0.010 0.015 ˆs ˆs θ = (θi )i=1,...,m. We do the same for the max- Crime rate estimate imum likelihood and James-Stein estimates JS JS θˆ = (θˆ ) = and θˆ = (θˆ ) = , respec- i i 1,...,m i i 1,...,m Figure 6: Comparing shrinkage and maximum tively. likelihood estimates. Oslo, in black, is both close enough to the grand mean and large enough in size that the estimate does not change.

We will compare the expected loss, or risk, of the three estimators R(·) = E[L(·)], confirming the well-known property that shrinkage esti- mators dominate the MLE. We obtain Monte Carlo estimates of risk by averaging L(·) across 4. Results repeated simulations.

4.1. Official SSB data

3.5. Coverage properties We focus on violent crimes in the year 2016. Figure6 shows the effect of shrinking the ob- served crime rates toward the prior mean. We For the credible interval Cs = (a, b), we want see that the more extreme estimates shrink to- to assess the coverage probability P(θ ∈ Cs) ward the center. The city with highest crime and compare with P(θ ∈ CW ) for the classical rate according to the maximum likelihood esti- Wald confidence interval. We will not assess mate is Havsik (θˆ = 0.018), a small town with the James–Stein estimator in terms of coverage. slightly more than 1000 inhabitants (n = 1054). s W Let I(Ci), where Ci = Ci or Ci , be the indi- After shrinkage, Havsik still ranks first, but the s cator function that is equal to unity if θi ∈ Ci, shrinkage estimate is much lower (θˆ = 0.012). and zero otherwise. We obtain MC estimates of Similarly the town with the lowest crime rate coverage probability by averaging the mean in- is Selbu (θˆ = 0.0017), another small town (n = 1 m · ternal coverage, m ∑i=1 I(Ci ), across repeated 4132). Selbu’s shrinkage estimate is higher than simulations. An uncertainty interval should the MLE by more than 40% (θˆs = 0.0024). Oslo, be well-calibrated: if the size of the interval is shown in black, is a big city (n = 658390) and 95% it should trap the true parameter .95 of the difference between the two estimates is null the time. (θˆ − θˆs = 7 × 10−6). -21- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

as though we add the information of 922 extra Q−Q plot of 2016 rates against fitted prior trials in the binomial sense. In other words ● we add a priori 922 inhabitants, including five criminals, to each town. ● ● ● ● 0.015 ● ● ● ● ●● ●●● ●●● Relative information from shrinkage ●● ● ●●●●●● ●●● ● 0.010 ●●● ●●● ●●● ●●●●● ● ● ●● ●● ●●●●● ● ●●●● ● ●●●● 2.0 ●●●●● ● ●●●● ●●●● ● ●●●● ● ●●●●●● ● ●●●● ● ●●●●●● ●●● ●● ●●●● ● ●●●●● ● ●●●●●● ●●●●●● ● ●● 1.8 0.005 ● ●●●●● ● ●●●●● ●● ●●●●● ● ● ● ●●●● ● ●●●● ● Empirical 2016 quantiles ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ●● ●● ●●●● ●●●●●●● ● ● ● 1.6 ● ● ● ●● ●● ●●● ●● ●● 0.000 0.005 0.010 0.015 ●● ● ● ●● ● ●●●● ●● ● ●● ● ●● ●● ● ● ● ●●● ●●● ●●● 1.4 ●● Prior quantiles ● ●●● ●● ● ●● ●● ● ● ● ●●●● ● ●● ● ●●●●● ●●●●●● ● ● ●●● ●● ● ● ●●●●●●●●● ●●●●●●● ● ●●●●●●●●● ● ● ●●●●●● ● ● ●● ● ● ●●●●●●●●●●● ●●●●●●●● ● 1.2 ●●●●●●● ● ● ● Relative information Relative ●●●●●●● ●● ● ● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● Figure 7: Quantile–quantile plot of 2016 crime ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●● ● ● ●

rates against the fitted prior. The solid line 1.0 describes a perfect fit. 3.0 3.5 4.0 4.5 5.0 5.5 log10 population Figure7 is a quantile–quantile plot of the 2016 violent crime rates against the fitted prior. Figure 8: Relative information in the posterior There is some very slight deviation around the mean compared to the MLE. The figure shows tails, but overall it looks like a nice fit. (α + ki)/ki in grey and (β + ni − ki)/(ni − ki) in gray. These represent the added informa- By shrinking toward the ensemble we tion in terms of number of successes and num- add some information—we use the term ber of failures added to the MLE to form the informally—to the observed rate. We can quan- shrinkage estimate. For the smallest towns, we tify this by looking at the form of the beta practically double the information. distribution, so far taken for granted in this treatment. Its density function is Figure8 shows α0 and β0 (gray and black) rela- − − xα 1(1 − x)β 1 tive to the number of successes (k ) and failures f (x; α, β) = , i B(α, β) (ni - ki) for each town in the 2016 data. For the smaller towns, there is double the informa- where the beta function in the denominator is tion in the shrinkage estimate, while for larger simply the normalizing constant towns there is no practical increase. Naturally Z 1 the value of this extra information depends on B(α, β) = tα−1(1 − t)β−1 dt. 0 the degree to which the prior is relevant. A natural interpretation is that this is a dis- tribution over the probability of success, i.e. Figure9 shows the ten most violent towns ac- crime, in a sequence of Bernoulli trials with cording to shrinkage estimate along with their α − 1 successes and β − 1 failures (cf. the bino- 95% credible intervals. The official, or MLE, mial distribution). Hence we can interpret the crime rate is shown as a red point. We see some posterior for town i as a distribution over the change in ordering. For Hasvik—a small and probability of success in a series of Bernoulli tri- presumably quiet village in northern Norway— 0 0 als with α = α + ki successes and β = β + ni the MLE is so implausible that it is outside the failures (ignoring the −1 for convenience). In credible interval. For Oslo—the biggest city in our data we have that α ≈ 5 and β ≈ 917; it is Norway—the estimate doesn’t change. -22- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

Ten most violent towns in 2016 we fixed θi for this experiment, so we are only Hasvik ● ● assessing the risk function in a single point. Vadsø ● ● Porsanger ● ●

● MC estimates of risk Lebesby ● Vardø ● ● MLE Halden ●● S JS Oslo kommune ●

● Haugesund ● 6000 Ullensaker ●● Hemsedal ● ● Density

0.008 0.012 0.016 2000 0

Figure 9: The ten towns with the highest crime 4e−04 6e−04 8e−04 1e−03 rate, ordered by shrinkage estimate. The bars Loss are 95% credible intervals. MLEs shown in red. Figure 11: Distributions of L(θ, θˆ) (solid black), Figure 10 shows historical data for the three L(θ, θˆs) (dashed red), and L(θ, θˆJS) (solid grey). most violent and the three least violent towns Vertical lines estimate the risk. in 2016, according to official crime rate. We show shrinkage estimates in red and official statistics in black. The vertical bars are 95% un- certainty intervals. The shrinkage estimate is MC estimates of coverage probability

usually more conservative, at least for the more Wald violent towns, but the trends remain similar Credible 40 for both estimates. The credible intervals are

shorter than the classical confidence intervals. 30 We will see that in spite of this their cover-

age is better under simulation. It is interesting 20 Density that the three most violent towns are all in

Finnmark: Norway’s largest and most sparsely 10 populated county. 0 0.86 0.88 0.90 0.92 0.94 0.96 4.2. Simulated data Coverage To obtain MC estimates of risks we run 100 000 Figure 12: Distributions of the internal simulations for each of our two experiments. coverage 1 m I(CW ) (solid black) and Figure 11 shows kernel density estimates of m ∑i=1 i 1 m ( s) the distributions of global loss. Our shrink- m ∑i=1 I Ci (dashed red). Vertical lines esti- age estimates show lower global risk than mate coverage probability. The grey line shows maximum likelihood: Rˆ (θ, θˆs) = 0.00054 ver- the nominal coverage of .95. sus Rˆ (θ, θˆ) = 0.00066. The James–Stein esti- mates fall almost exactly between the two with Figure 12 presents estimated coverage probabil- Rˆ (θ, θˆJS) = 0.00059. We might have observed ities in the same manner as Figure 11. The grey better results for JS had we used a variant of line shows the nominal coverage of .95. The JS that allows unequal . Note that coverage probability of the credible interval for -23- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

Hasvik Lebesby Berlevåg (population: 1054) (population: 1318) (population: 1000)

● ● ● ● ● 0.02 ● 0.02 ● 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.00 2008 2012 2016 0.00 2008 2012 2016 0.00 2008 2012 2016

Gausdal Fusa Selbu (population: 6227) (population: 3876) (population: 4132) 0.02 0.02 0.02

●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●

0.00 2008 2012 2016 0.00 2008 2012 2016 0.00 2008 2012 2016

Figure 10: Historical data for the three most violent and the three least violent towns in 2016, ordered by official crime rate (MLE). The official statistics are drawn in black, and shrinkage estimates in red. The vertical bars indicate confidence and credible intervals, respectively.

the shrinkage estimator, Pˆ (θ ∈ Cs) = 0.917, is ble intervals from this treatment are narrower closer to the nominal value than that of the the and have better coverage than the standard standard interval, Pˆ (θ ∈ CW ) = 0.898. There Wald confidence interval. Hence we get better is however still room for improvement. information about the location of θi. Brown et al.(2001) show extensively that the Wald 5. Conclusion confidence interval for the binomial proportion behaves erratically for extreme values of p, for This case study shows a simple method for varying values of n, and for (un)lucky combi- simultaneous estimation of all town-specific nations of the two. Our result is interesting but crime rates in a country. The method is quite narrow. Generalizing it requires more Bayesian in spirit, although we take some short- work. cuts with our prior. It is known that under squared-error loss the posterior mean is the Smaller towns are over-represented among the optimal decision w.r.t. a given prior. In other most and least violent towns in the official words it minimizes Bayes risk, and is called Norwegian data. Mathematically this has to the Bayes estimate. The theory gives us that be the case. Applying shrinkage methods to Bayes estimates are admissible (Wald, 1947), these data we get more conservative estimates and thus cannot be dominated. The risk esti- for these variable and often extreme quanti- mates of our simulation agree with this. Our ties. At the same time it seems that variance analysis provides an estimate of the crime prob- is not the only factor that places some of these ability with favorable frequency properties in small towns among the most violent. As Fig- terms of and coverage. ure9 shows, the top and bottom three in 2016 show a certain stability year by year. Hasvik Our simulations show that the Bayesian credi- in Finnmark has never ranked especially low -24- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

since 2008. Small towns in the north are often References ranked high for violence. There could be many reasons for this and we leave further analysis Brown, L. D., Cai, T. T., and DasGupta, A. to the criminologists. (2001). Interval estimation for a binomial proportion. Statist. Sci., 16(2):101–133.

Efron, B. and Morris, C. (1973). Stein’s estima- These simple and useful estimation methods tion rule and its competitors—an empirical are best understood by practical examples. We bayes approach. Journal of the American Sta- encourage readers and students to actively fol- tistical Association, 68(341):117–130. low this tutorial by playing with the available code and data. We used a single prior for all Galton, F. (1886). Regression towards medi- towns. It would be an interesting extension ocrity in hereditary stature. The Journal of the to use a mixture of beta distributions to ac- Anthropological Institute of Great Britain and count for any heterogeneity due to different Ireland, 15:246–263. latent rate levels. In this case, an EM algorithm could be used to assign each town to a class. Gelman, A. and Nolan, D. (2017). Teaching Or, since Finnmark seems to be a special case, statistics: A bag of tricks. Oxford University we might estimate per-county priors. It is also Press. possible to include Bayesian multiple testing James, W. and Stein, C. (1961). Estimation with procedures to infer a list of cities likely to have quadratic loss. In Proceedings of the fourth true crimes rate above some given threshold. Berkeley symposium on mathematical statistics There is a temporal aspect to these data that and probability, volume 1, pages 361–379. we have not looked into. It would be possi- ble to start out with a country-wide prior, but Morris, C. N. (1983). Parametric empirical after this let the prior for one year be the pos- bayes inference: theory and applications. terior from the previous. Interested readers Journal of the American Statistical Association, can find other ideas for further development 78(381):47–55. in Robinson(2017). Gelman and Nolan(2017) also discuss a similar project to this one in their Robinson, D. (2017). Introduction to Empirical manual for statistics teachers. Bayes: Examples from Baseball Statistics. Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004). Bayesian approaches to clinical trials In this treatment we have moved from descrip- and health-care evaluation, volume 13. John tive figures typical of official statistics to model- Wiley & Sons. based inferential statistics, estimating a crime probability rather than reporting a crime count. Stein, C. (1956). Inadmissibility of the usual This allows us to account for variance and per- estimator for the mean of a multivariate nor- haps avoid over-interpreting noise, and hence mal distribution. In Proceedings of the Third avoid small-schools-type mistakes. We believe Berkeley Symposium on Mathematical Statistics that probabilistic thinking can enrich descrip- and Probability, Volume 1: Contributions to the tive statistics and aid in their interpretation. Theory of Statistics, pages 197–206, Berkeley, Calif. University of California Press.

Acknowledgements Tversky, A. and Kahneman, D. (1971). Belief in the law of small numbers. Psychological bulletin, 76(2):105. We would like to thank our anonymous re- viewer for the very thorough and very useful Wainer, H. (2007). The most dangerous equa- comments and suggestions. tion. American Scientist, 95(3):249. -25- Shrinkage estimation of rate statistics / E. Holsbø & V. Perduca

Wald, A. (1947). An essentially complete class Wasserman, L. (2010). All of Statistics: A Con- of admissible decision functions. Ann. Math. cise Course in . Springer Statist., 18(4):549–555. Publishing Company, Incorporated.

Correspondence: [email protected]