<<

Published, CS-BIGS 7(1):14-25 http://www.csbigs.fr

Shrinkage estimation of rate

Einar Holsbø Department of Computer Science, UiT — The Arctic University of Vittorio Perduca Laboratory of Applied Mathematics MAP5, Université Paris Descartes

This paper presents a simple shrinkage estimator of rates based on Bayesian methods. Our focus is on crime rates as a motivating example. The estimator shrinks each town’s observed crime rate toward the country-wide average crime rate according to town size. By realistic simulations we confirm that the proposed estimator outperforms the maximum likelihood estimator in terms of global risk. We also show that it has better coverage properties.

Keywords : Official statistics, crime rates, inference, Bayes, shrinkage, James-Stein estimator, Monte-Carlo simulations.

1. Introduction that most of the best schools—according to a variety of performance measures—were small. 1.1. Two counterintuitive random phenomena As it turns out, there is nothing special about small schools except that they are small: their It is a classic result in statistics that the smaller over-representation among the best schools is the sample, the more variable the sample mean. a consequence of their more variable perfor- The result is due to Abraham de Moivre and it mance, which is counterbalanced by their over- tells us that the standard deviation of the mean √ representation among the worst schools. The is σx¯ = σ/ n, where n is the sample size and σ observed superiority of small schools was sim- the standard deviation of the random variable ply a statistical fluke. of interest. Although the equation is very sim- ple, its practical implications are not intuitive. arXiv:1810.07654v1 [stat.AP] 17 Oct 2018 People have erroneous intuitions about the laws of Galton(1886) first described another stochastic chance, argue Tversky and Kahneman in their mechanism that is dangerous to ignore. Galton famous paper about the law of small numbers observed that children of tall (or short) par- (Tversky and Kahneman, 1971). ents usually grow up to be not quite as tall (or short), i.e. closer to average height. Today Serious consequences can follow from small- we know this phenomenon as regression to sample inference ignoring deMoivre’s equation. the mean, and we will find it wherever we Wainer(2007) provides a notorious example: in find variation. Imagine a coach who berates a the late 1990s and early 2000s private and pub- runner who had an unusually slow lap time lic institutions provided massive funding to and finds that, indeed, the next lap is faster. small schools. This was due to the observation The coach, who always berates slow runners, -2-

has not had the opportunity to realize that the date xi (Stein, 1956). In other words, the maxi- next lap is very likely to be faster no matter mum likelihood estimate is inadmissible. Stein what. As long as there is variability in lap time showed this by introducing a lower-risk estima- we will some times see unusually slow laps tor that biases or shrinks, the xis toward zero. that we can do nothing about and make no James and Stein(1961) introduced an improved inference from. In this case too do our intu- shrinkage estimator, which we will see below. itions about the laws of chance fail us. People, Efron and Morris(1973) show a similar result including scientists, make the mistake of ig- and a similar estimator for shrinking toward noring regression all the time. Mathematically the pooled mean. There are many successful regression to the mean is as simple as imperfect applications of shrinkage estimation, see for correlation between instances. instance the examples from Morris(1983). The common theme is a setting where the statisti- 1.2. These phenomena in official statistics cian wants to estimate many similar variable quantities. The small-schools example is egregious be- cause it led to wasteful public spending. The statistics themselves were probably fine, but 1.4. An almost-Bayesian estimator their interpretation was not careful enough. In this case study we consider the official Nor- Such summary statistics are often presented wegian crime report counts. We assume that without regard for uncertainty. For instance, in a given year the number of crimes reported every year Statistics Norway (ssb.no), the cen- in town i, denoted ki, corresponds to the num- tral bureau of statistics in Norway, presents ber of criminal events in this town. We further crime report counts. The media usually reports assume that each inhabitant can at most be these numbers as rates and inform us that some reported for one crime a year. Our goal is small town that few people know about is the to estimate the crime probability θi: probability most criminal in the country. Often the focus is that a person will commit a crime in this town. on violent crimes. Figure1 below shows these The obvious estimator is the maximum likeli- rates for 2016. Not knowing de Moivre’s result hood estimate (MLE) for a binomial proportion it might be striking to observe that many of the ˆ k θi = i/ni, where ni is the population of town i. towns with the highest rates are small towns. Similarly, not knowing regression it might be striking to observe that, on average, towns with The MLE binomial model rests on an assump- a high rate in one year will have a lower one in tion that inhabitants commit crimes indepen- any other year, see Figure2 below. These are dently according to an identical crime proba- unavoidable stochastic phenomena. Thus there bility. There are reasons to believe that this is is reason to believe that we should somehow not the case. The desperately poor might be adjust our expectations about these numbers. more prone to stealing than the middle class We will see below that such an adjustment also professional. There is a weaker assumption makes statistical sense. called exchangeability that says that individuals are similar but not identical. More precisely we assume that their joint criminal behavior (some 1.3. Shrinkage estimation number of zeros and ones) does not depend on There is an astonishing decision-theoretic re- knowing who the individuals are (the order of sult due to Charles Stein: suppose that we wish the zeros and ones). It is an important theorem to estimate k ≥ 3 parameters θ1, ... , θk and ob- in , due to De Finetti, that serve k independent measurements, x1 ... xk, a sequence of exchangeable variables are inde- such that xi ∼ N(θi, 1). There is an estimator pendent and identically distributed conditional of θi that has uniformly lower risk, in terms on an unknown parameter θi that is distributed of total quadratic loss, than the obvious candi- according to an a priori (or prior) distribution -3-

f (θi) (Spiegelhalter et al., 2004). In the bino- as prior. Our likelihood for an observed num- mial sense, θi has the remarkable property that ber of crime reports follows a binomial dis- it is the long-run frequency with which crimes tribution. It is a classic exercise to show that occur regardless of the i.i.d. assumption; the the posterior distribution of θi is then also a prior precisely reflects our opinion about this beta distribution. The problem remains how limit. By virtue of De Finetti’s theorem, the to choose the parameters for the prior. On the exchangeability assumption justifies the intro- idea that a given town is probably not that dif- duction of the unknown parameter θi in a bino- ferent from all the other towns, we will simply mial model for ki, so long as we take the prior pool the observed crime rates for all towns and into account. fit a beta distribution to this ensemble by the method of moments. To make an argument with priors is to make a Bayesian argument. Shrinkage is implicit in Under squared error loss, the posterior mean Bayesian inference: observed data gets pulled as point estimate minimizes Bayes risk. The toward the prior (and indeed the prior is pulled posterior mean serves as our shrinkage esti- ˆs ˆs toward the data likelihood). We propose an mate, θi , for θi. We will see that θi in effect ˆs ˆ almost Bayesian shrinkage estimator, θi , that shrinks the observed crime rate θi toward the ¯ 1 ˆ accounts for the variability due to population country-wide mean θ = ∑ m θi by taking into size. Our estimator is almost Bayesian because account the size of town i. we do not treat the prior very formally, as will be clear below. Bayesian inference allows for intuitive uncer- tainty intervals. In contrast to a classical fre- In a Bayesian argument we treat θi as random. quentist confidence interval, which can be The statistician specifies a prior distribution tricky to interpret, we can say that θi lies within f (θi) for the parameter that reflects her knowl- the Bayesian credible interval with a certain edge (and uncertainty) about θi. As in the fre- probability. This probability is necessarily sub- quentist setting, she then selects a parametric jective, as the prior distribution is subjective. model for the data given the parameters, which We will conduct simulations to compare the allows her to compute the likelihood f (x|θi). coverage properties of our estimator to the clas- Inference about θi consists of computing its sical asymptotic confidence interval. posterior distribution by Bayes’ theorem: 1.5. Resources f (x|θi) f (θi) f (θi|x) = R . f (x|θi) f (θi) dθi This case-study is written with a pedagogi- cal purpose in mind, and can be used by ad- vanced undergraduate and beginning gradu- There are various assessments we could make ate students in statistics as a tutorial around shrinkage estimation and Bayesian methods. about the collection of θi. If we assume they are identical we can pool them and use a We will mention some possible extensions in single prior. If we assume they are inde- the conclusion that could be the basis for stu- pendent we specify one prior for each and dent projects. Data and code for all our anal- keep them separate. If we assume they are yses, figures, and simulations are available at exchangeable—similar but not identical—it fol- https://github.com/3inar/crime_rates lows from De Finetti that there is a common prior distribution conditional on which the 2. Data θ1,..., θm are i.i.d. (Spiegelhalter et al., 2004). We will work with the official crime report We make this latter judgment and take a beta statistics released by Statistics Norway (SSB) distribution common to all crime probabilities every year. These data contain the number of -4-

crime reports in a given Norwegian town in a Crime rates regress to the mean given year. The counts are stratified by crime type, e.g. violent crimes, traffic violations, etc. We will focus on violent crimes. SSB separately 0.025 provides yearly population statistics for each town. Figure1 shows the 2016 crime rates ●

● (i.e. counts per population) for all towns in ● ● ● ● ● 0.015 ● 2016 ● Norway against their respective populations. ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●● ●● ● ● ● ● ● ●●● ● ●● This is some times called a funnel plot for the ● ●●●● ● ● ● ● ●● ●●●●● ● ● ● ●●●●●●● ● ●●●● ● ● ●●● ● ● ● ●●● ●●●●●● ●● ● ● ●●●●●●●●●● ● ●● ●●●●● ● ●● ● ● ● ● ●●●●●●● ●●●● ● ● ● ● ●● ●●● ●●●●●●●●●● ● ● funnel-like tapering along the horizontal axis: ●● ●●●●●● ● ● ●●●●●● ●● ● ●●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ●● ● ●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●● ● ●● ●●●●●●● ●●● ●●●●●●●●● 0.005 ● ● ●●● ●● ●● a shape that signals higher among the ●●● ● ● smaller towns. 0.005 0.010 0.015 0.020 0.025 2015 (correlation = 0.89) Crime rates more variable for smaller towns

● Figure 2: Regression to the mean from year to year. The plot compares 2016 and 2015; ●

● the black regression line shows that towns ● 0.015 ● ● ● with high crime rates in 2015 tend to have ● ● ● ● ● ● ● ● ●● ● lower crime rates in 2016, and vice versa for ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● 0.010 ●●● ● ● ●●● ● ●●● ● ●● low crime rates. The grey dashed line shows ● ●●●● ● ● ●●● ● ● ●●●● ●●●● ● ● ● ●●● ● ●●●●● ● ● ●●●●● ●●● ●●●●●● ● ● what perfect correlation between 2015 and 2016 ●●● ●●●●●●● ●●●●●●●●● ●●●●● ● ●●●●●● ●●●● ●● ●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●● would look like. ●●●●●● ● ● ●●● ●● ● 0.005 ●●●●● ● ●●●● ● ●●●●● ●●●● 2016 Violent crime rate ● ●●●● ●● ●● ● ●●

0 100 200 300 400 500 600 Figure3 shows the distribution of the pooled Population (in thousands) violent crime rates for 2016. The solid black line is a beta distribution fit to these data. Figure 1: Rates of violent crime vs population in 2016 for all towns in Norway. The grey line shows the country-wide mean. Pooled violent crime rates, 2016

Figure2 compares the crime rates in 2015 with 200 those in 2016 and shows that the more (or less)

violent towns in 2015 were on average less (or 150 more) violent in 2016. The solid black line re- 100 gresses 2016 rates on 2015 rates. The dashed Density grey line is what to expect if there were no regression toward the mean. It has an intercept 50

of zero and a slope of unity. The solid grey line 0 is the overall mean in 2016. The most extreme 0.005 0.010 0.015 town in 2015, past .025 on the x-axis, is much Rate closer to the mean in 2016. The solid black regression line shows that this is true for all towns on average. The fact that 2015 and 2016 Figure 3: The distribution of violent crime are consecutive years is immaterial; regression rates in Norway, 2016. The black line describes to the mean will be present between any two the method-of-moments fit of a beta distribu- years. tion to these data. -5-

2.1. Simulation study model, also shown in Figure5:

We run a simulation study for validation. If θi|α, β ∼ Beta(α, β), we assume that the crime probability in town ki|θi ∼ Binomial(ni, θi). i is stationary we can pool the observed crime rates of all years and use their average, θ¯i, as As mentioned the assumption of town ex- a reasonable “truth.” This allows us to as- changeability leads to this hierarchical model. sess the performance of our estimator against This assumption might not be appropriate if known, realistic crime probabilities, which of we had reasons to think, for instance, that some course is impossible in the real data. The regions are more prone to crime than others. In simulated crime report count in town i is this case, region-specific priors might be better. ki ∼ Binomial(θ¯i, ni), where ni is the 2016 pop- ulation of town i. Figure4 shows a realiza- α, β tion of this procedure. Although not a perfect replica of Figure1—the real data do not have

any rates below .0017—it looks fairly realistic. θ1 θ2 ... θm

Example of simulated crime rates k1 k2 ... km

● ● ● Figure 5: A graph describing our model. ● ●● 0.015

● ● Crime counts, ki, are (conditionally) i.i.d. bi- ● ● ● ●● ● ● ●●● ●● ● nomials whose respective parameters, θ , are ●● ●● i ●● ●● ● ● ● ●●● ● ● ● ● 0.010 ●●●●● ● ● ● (conditionally) i.i.d. according to a common ●●●● ●● ●● ● ●●● ● ●●●● ● ●●● ● ● ●●●●● ● ● ● ● ● ●●●●● ●●●●●●●● ● prior. ●●●●●● ●● ●●● ●●●●● ● ●●●● ● ●●● ● ●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ● ●●●● ●●● ● ●●●●●● ●●●●●●●●●●● ● ●●●●●●● ● ●●●●● ●●● 0.005 ●●●●● ●●●●●● ●●● ● ●●●●

Violent crime rate ●● ●●●●● ●●● ●●● The posterior follows from the fact that the beta ●●● ●● ● ● ● distribution is conjugate to itself with respect to ●

0.000 the binomial likelihood. Generally, conjugacy 0 100 200 300 400 500 600 means that the prior and posterior distribu- Population (in thousands) tions belong to the same distributional family and usually entails that there is a simple closed- Figure 4: Funnel plot of a set of simulated form way of computing the parameters of the crime rates posterior. Wasserman(2010, p. 178) shows a derivation of the posterior in the beta–binomial model:

3. Methods θi|ki ∼ Beta(α + ki, β + ni − ki).

We will look into the relation between the pa- 3.1. Shrinkage estimates rameters of the posterior to those of the prior in terms of successes and failures in the results We treat θ as the probability for a person to i section. commit a crime in a given period. We model the total number of crime reports in the i-th The shrinkage estimate for the crime probabil- town, k , as the number of successful Bernoulli i ity in town i is the posterior mean trials among ni, where ni is the population of this town. As explained in the introduction, + ˆs α ki θi = . this suggests the following simple Bayesian α + β + ni -6-

The maximum likelihood estimate for θi is the and assume ˆ k observed crime rate θi = i/ni. In order to fix k   values of α and β, we pool the MLEs for all θˆ = i ∼ N θ , σ2 , i n i i towns θˆ1, ... , θˆm and fit a beta distribution to i these data by the method of moments. We ( − ) where σ2 = θi 1 θi is unknown. If we assume show the resulting fit in Figure3. Because i ni the expectation and variance of a Beta(α, β) are that towns are similar in terms of variance we α αβ can consider the pooled variance estimate α+β and (α+β)2(α+β+1) , respectively, the param- eter estimates for the prior are m 2 ∑ (ni − 1)σˆ σ2 = i=1 i , α(1 − θ¯) P ∑m (n − 1) β = , and i=1 i θ¯ ˆ ˆ  ¯  2 θi(1−θi) ki(ni−ki) 1 − θ 1 2 where σˆ = = . The James- α = − θ¯ . i ni n3 2 ¯ i S θ Stein estimator of crime probability for town i ˆ ˆ ¯ 2 is then ¯ ∑i θi 2 ∑i(θi−θ) Here θ = m and S = m−1 are the sam- ple mean and variance of the pooled MLEs. 2 ! JS (m − 2)σˆP θˆ = 1 − θˆi. i m θˆ2 Instead of estimating α and β from the data like ∑i=1 i this, which ignores any randomness in these This is a shrinkage toward zero. It assumes that parameters, we could have a prior distribution crime rates are probably not as high as they for the parameters themselves. This would appear. This is different from our assumption yield a typical Bayesian hierarchical model. that crime rates are probably not as far away Note also that in forming the estimate for town from the average as they appear. It is simple to i, we end up using its information twice: once modify the above to shrink toward any origin. in eliciting our prior and once in the likelihood. The Efron-Morris variant (Efron and Morris, This is convenient because we need only to 1973) shrinks toward the average: find one prior rather than one for each town where we exclude the ith town from the ith ! (m − 2)σˆ 2 prior. This bit of trickery does not make much θˆJS = θ¯ + 1 − P (θˆ − θ¯). i m ˆ ¯ 2 i difference: we have several hundreds of towns ∑i=1(θi − θ) and hence removing a single town does not affect the shape of the prior much. We will use this variant so that the two meth- ods shrink toward the same point. + The estimate θˆs = α ki shrinks the observed, i α+β+ni ¯ or MLE, crime rate toward the prior mean θ. 3.3. Uncertainty intervals ˆs ¯ ˆ We can rewrite so that θi = δiθ + (1 − δi)θi, + with δ = α β . Here δ directly reflects the We construct credible intervals from the pos- i α+β+ni i prior’s influence on θˆs, and we see that this terior. A 95% credible interval contains .95 of i the posterior density, and the simplest way to influence grows as the town size, ni, shrinks. construct one is to place it between the .025 and .975 quantiles of the posterior. For the 3.2. James-Stein estimates MLE we use the typical normal approxima- For completeness we demonstrate empirically tion (or Wald) confidence interval. There is that the James–Stein estimator is superior to to our knowledge no straight-forward way to the MLE in terms of risk. If town i has a large construct confidence intervals for the JS estima- enough population, we can consider the nor- tor, so we will leave this as an exercise for the mal approximation to the binomial distribution reader. -7-

3.4. Global risk estimates Shrinkage vs ordinary estimates

S ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● We use the total squared-error loss function,

m ˆs ˆs 2 L(θ, θ ) = ∑(θi − θi ) , i=1

to measure the global discrepancy between MLE ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ●●● ● ● ● ● ● ● ● the true rates θ = (θi)i=1,...,m and estimates 0.005 0.010 0.015 ˆs ˆs θ = (θi )i=1,...,m. We do the same for the max- Crime rate estimate imum likelihood and James-Stein estimates JS JS θˆ = (θˆ ) = and θˆ = (θˆ ) = , respec- i i 1,...,m i i 1,...,m Figure 6: Comparing shrinkage and maximum tively. likelihood estimates. Oslo, in black, is both close enough to the grand mean and large enough in size that the estimate does not change.

We will compare the expected loss, or risk, of the three estimators R(·) = E[L(·)], confirming the well-known property that shrinkage esti- mators dominate the MLE. We obtain Monte Carlo estimates of risk by averaging L(·) across 4. Results repeated simulations.

4.1. Official SSB data

3.5. Coverage properties We focus on violent crimes in the year 2016. Figure6 shows the effect of shrinking the ob- served crime rates toward the prior mean. We For the credible interval Cs = (a, b), we want see that the more extreme estimates shrink to- to assess the coverage probability P(θ ∈ Cs) ward the center. The city with highest crime and compare with P(θ ∈ CW ) for the classical rate according to the maximum likelihood esti- Wald confidence interval. We will not assess mate is Havsik (θˆ = 0.018), a small town with the James–Stein estimator in terms of coverage. slightly more than 1000 inhabitants (n = 1054). s W Let I(Ci), where Ci = Ci or Ci , be the indi- After shrinkage, Havsik still ranks first, but the s cator function that is equal to unity if θi ∈ Ci, shrinkage estimate is much lower (θˆ = 0.012). and zero otherwise. We obtain MC estimates of Similarly the town with the lowest crime rate coverage probability by averaging the mean in- is Selbu (θˆ = 0.0017), another small town (n = 1 m · ternal coverage, m ∑i=1 I(Ci ), across repeated 4132). Selbu’s shrinkage estimate is higher than simulations. An uncertainty interval should the MLE by more than 40% (θˆs = 0.0024). Oslo, be well-calibrated: if the size of the interval is shown in black, is a big city (n = 658390) and 95% it should trap the true parameter .95 of the difference between the two estimates is null the time. (θˆ − θˆs = 7 × 10−6). -8-

as though we add the information of 922 extra Q−Q plot of 2016 rates against fitted prior trials in the binomial sense. In other words ● we add a priori 922 inhabitants, including five criminals, to each town. ● ● ● ● 0.015 ● ● ● ● ●● ●●● ●●● Relative information from shrinkage ●● ● ●●●●●● ●●● ● 0.010 ●●● ●●● ●●● ●●●●● ● ● ●● ●● ●●●●● ● ●●●● ● ●●●● 2.0 ●●●●● ● ●●●● ●●●● ● ●●●● ● ●●●●●● ● ●●●● ● ●●●●●● ●●● ●● ●●●● ● ●●●●● ● ●●●●●● ●●●●●● ● ●● 1.8 0.005 ● ●●●●● ● ●●●●● ●● ●●●●● ● ● ● ●●●● ● ●●●● ● Empirical 2016 quantiles ● ●●●● ● ● ● ●●●●●● ●●●● ● ● ●● ●● ●●●● ●●●●●●● ● ● ● 1.6 ● ● ● ●● ●● ●●● ●● ●● 0.000 0.005 0.010 0.015 ●● ● ● ●● ● ●●●● ●● ● ●● ● ●● ●● ● ● ● ●●● ●●● ●●● 1.4 ●● Prior quantiles ● ●●● ●● ● ●● ●● ● ● ● ●●●● ● ●● ● ●●●●● ●●●●●● ● ● ●●● ●● ● ● ●●●●●●●●● ●●●●●●● ● ●●●●●●●●● ● ● ●●●●●● ● ● ●● ● ● ●●●●●●●●●●● ●●●●●●●● ● 1.2 ●●●●●●● ● ● ● Relative information Relative ●●●●●●● ●● ● ● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● Figure 7: Quantile–quantile plot of 2016 crime ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●● ● ● ●

rates against the fitted prior. The solid line 1.0 describes a perfect fit. 3.0 3.5 4.0 4.5 5.0 5.5 log10 population Figure7 is a quantile–quantile plot of the 2016 violent crime rates against the fitted prior. Figure 8: Relative information in the posterior There is some very slight deviation around the mean compared to the MLE. The figure shows tails, but overall it looks like a nice fit. (α + ki)/ki in grey and (β + ni − ki)/(ni − ki) in gray. These represent the added informa- By shrinking toward the ensemble we tion in terms of number of successes and num- add some information—we use the term ber of failures added to the MLE to form the informally—to the observed rate. We can quan- shrinkage estimate. For the smallest towns, we tify this by looking at the form of the beta practically double the information. distribution, so far taken for granted in this treatment. Its density function is Figure8 shows α0 and β0 (gray and black) rela- − − xα 1(1 − x)β 1 tive to the number of successes (k ) and failures f (x; α, β) = , i B(α, β) (ni - ki) for each town in the 2016 data. For the smaller towns, there is double the informa- where the beta function in the denominator is tion in the shrinkage estimate, while for larger simply the normalizing constant towns there is no practical increase. Naturally Z 1 the value of this extra information depends on B(α, β) = tα−1(1 − t)β−1 dt. 0 the degree to which the prior is relevant. A natural interpretation is that this is a dis- tribution over the probability of success, i.e. Figure9 shows the ten most violent towns ac- crime, in a sequence of Bernoulli trials with cording to shrinkage estimate along with their α − 1 successes and β − 1 failures (cf. the bino- 95% credible intervals. The official, or MLE, mial distribution). Hence we can interpret the crime rate is shown as a red point. We see some posterior for town i as a distribution over the change in ordering. For —a small and probability of success in a series of Bernoulli tri- presumably quiet village in — 0 0 als with α = α + ki successes and β = β + ni the MLE is so implausible that it is outside the failures (ignoring the −1 for convenience). In credible interval. For Oslo—the biggest city in our data we have that α ≈ 5 and β ≈ 917; it is Norway—the estimate doesn’t change. -9-

Hasvik Lebesby Berlevåg (population: 1054) (population: 1318) (population: 1000)

● ● ● ● ● 0.02 ● 0.02 ● 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.00 2008 2012 2016 0.00 2008 2012 2016 0.00 2008 2012 2016

Gausdal Fusa Selbu (population: 6227) (population: 3876) (population: 4132) 0.02 0.02 0.02

●● ●● ●● ● ●● ●● ●● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●

0.00 2008 2012 2016 0.00 2008 2012 2016 0.00 2008 2012 2016

Figure 10: Historical data for the three most violent and the three least violent towns in 2016, ordered by official crime rate (MLE). The official statistics are drawn in black, and shrinkage estimates in red. The vertical bars indicate confidence and credible intervals, respectively.

Ten most violent towns in 2016 certainty intervals. The shrinkage estimate is Hasvik ● ● usually more conservative, at least for the more Vadsø ● ● violent towns, but the trends remain similar Porsanger ● ● for both estimates. The credible intervals are Lebesby ● ● shorter than the classical confidence intervals. Vardø ● ● We will see that in spite of this their cover- Halden ●● age is better under simulation. It is interesting Oslo kommune ● that the three most violent towns are all in Haugesund ●● : Norway’s largest and most sparsely Ullensaker ●● populated county. Hemsedal ● ●

0.008 0.012 0.016 4.2. Simulated data

To obtain MC estimates of risks we run 100 000 Figure 9: The ten towns with the highest crime simulations for each of our two experiments. rate, ordered by shrinkage estimate. The bars Figure 11 shows kernel density estimates of are 95% credible intervals. MLEs shown in red. the distributions of global loss. Our shrink- age estimates show lower global risk than Figure 10 shows historical data for the three maximum likelihood: Rˆ (θ, θˆs) = 0.00054 ver- most violent and the three least violent towns sus Rˆ (θ, θˆ) = 0.00066. The James–Stein esti- in 2016, according to official crime rate. We mates fall almost exactly between the two with show shrinkage estimates in red and official Rˆ (θ, θˆJS) = 0.00059. We might have observed statistics in black. The vertical bars are 95% un- better results for JS had we used a variant of -10-

JS that allows unequal . Note that coverage probability of the credible interval for s we fixed θi for this experiment, so we are only the shrinkage estimator, Pˆ (θ ∈ C ) = 0.917, is assessing the risk function in a single point. closer to the nominal value than that of the the standard interval, Pˆ (θ ∈ CW ) = 0.898. There is however still room for improvement. MC estimates of risk

MLE 5. Conclusion S JS This case study shows a simple method for 6000 simultaneous estimation of all town-specific crime rates in a country. The method is

Density Bayesian in spirit, although we take some short- cuts with our prior. It is known that under 2000 squared-error loss the posterior mean is the

0 optimal decision w.r.t. a given prior. In other 4e−04 6e−04 8e−04 1e−03 words it minimizes Bayes risk, and is called the Bayes estimate. The theory gives us that Loss Bayes estimates are admissible (Wald, 1947), and thus cannot be dominated. The risk esti- ˆ Figure 11: Distributions of L(θ, θ) (solid black), mates of our simulation agree with this. Our ˆs ˆJS L(θ, θ ) (dashed red), and L(θ, θ ) (solid grey). analysis provides an estimate of the crime prob- Vertical lines estimate the risk. ability with favorable frequency properties in terms of and coverage.

MC estimates of coverage probability Our simulations show that the Bayesian credi- ble intervals from this treatment are narrower Wald and have better coverage than the standard Credible 40 Wald confidence interval. Hence we get better information about the location of θi. Brown

30 et al.(2001) show extensively that the Wald confidence interval for the binomial proportion 20

Density behaves erratically for extreme values of p, for varying values of n, and for (un)lucky combi- 10 nations of the two. Our result is interesting but

0 quite narrow. Generalizing it requires more 0.86 0.88 0.90 0.92 0.94 0.96 work. Coverage Smaller towns are over-represented among the most and least violent towns in the official Figure 12: Distributions of the internal Norwegian data. Mathematically this has to coverage 1 m I(CW ) (solid black) and m ∑i=1 i be the case. Applying shrinkage methods to 1 m ( s) m ∑i=1 I Ci (dashed red). Vertical lines esti- these data we get more conservative estimates mate coverage probability. The grey line shows for these variable and often extreme quanti- the nominal coverage of .95. ties. At the same time it seems that variance is not the only factor that places some of these Figure 12 presents estimated coverage probabil- small towns among the most violent. As Fig- ities in the same manner as Figure 11. The grey ure9 shows, the top and bottom three in 2016 line shows the nominal coverage of .95. The show a certain stability year by year. Hasvik -11-

in Finnmark has never ranked especially low References since 2008. Small towns in the north are often ranked high for violence. There could be many Brown, L. D., Cai, T. T., and DasGupta, A. reasons for this and we leave further analysis (2001). Interval estimation for a binomial to the criminologists. proportion. Statist. Sci., 16(2):101–133. Efron, B. and Morris, C. (1973). Stein’s estima- These simple and useful estimation methods tion rule and its competitors—an empirical are best understood by practical examples. We bayes approach. Journal of the American Sta- encourage readers and students to actively fol- tistical Association, 68(341):117–130. low this tutorial by playing with the available Galton, F. (1886). Regression towards medi- code and data. We used a single prior for all ocrity in hereditary stature. The Journal of the towns. It would be an interesting extension Anthropological Institute of Great Britain and to use a mixture of beta distributions to ac- Ireland, 15:246–263. count for any heterogeneity due to different latent rate levels. In this case, an EM algorithm Gelman, A. and Nolan, D. (2017). Teaching could be used to assign each town to a class. statistics: A bag of tricks. Oxford University Or, since Finnmark seems to be a special case, Press. we might estimate per-county priors. It is also possible to include Bayesian multiple testing James, W. and Stein, C. (1961). Estimation with procedures to infer a list of cities likely to have quadratic loss. In Proceedings of the fourth true crimes rate above some given threshold. Berkeley symposium on mathematical statistics There is a temporal aspect to these data that and probability, volume 1, pages 361–379. we have not looked into. It would be possi- Morris, C. N. (1983). Parametric empirical ble to start out with a country-wide prior, but bayes inference: theory and applications. after this let the prior for one year be the pos- Journal of the American Statistical Association, terior from the previous. Interested readers 78(381):47–55. can find other ideas for further development in Robinson(2017). Gelman and Nolan(2017) Robinson, D. (2017). Introduction to Empirical also discuss a similar project to this one in their Bayes: Examples from Baseball Statistics. manual for statistics teachers. Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004). Bayesian approaches to clinical trials In this treatment we have moved from descrip- and health-care evaluation, volume 13. John tive figures typical of official statistics to model- Wiley & Sons. based inferential statistics, estimating a crime probability rather than reporting a crime count. Stein, C. (1956). Inadmissibility of the usual This allows us to account for variance and per- estimator for the mean of a multivariate nor- haps avoid over-interpreting noise, and hence mal distribution. In Proceedings of the Third avoid small-schools-type mistakes. We believe Berkeley Symposium on Mathematical Statistics that probabilistic thinking can enrich descrip- and Probability, Volume 1: Contributions to the tive statistics and aid in their interpretation. Theory of Statistics, pages 197–206, Berkeley, Calif. University of California Press.

Tversky, A. and Kahneman, D. (1971). Belief Acknowledgements in the law of small numbers. Psychological bulletin, 76(2):105. We would like to thank our anonymous re- viewer for the very thorough and very useful Wainer, H. (2007). The most dangerous equa- comments and suggestions. tion. American Scientist, 95(3):249. -12-

Wald, A. (1947). An essentially complete class Wasserman, L. (2010). All of Statistics: A Con- of admissible decision functions. Ann. Math. cise Course in . Springer Statist., 18(4):549–555. Publishing Company, Incorporated.

Correspondence: [email protected]