Statistical Methods for Data Science, Lecture 5 Interval Estimates; Comparing Systems

Statistical methods for Data Science, Lecture 5
Interval estimates; comparing systems

Richard Johansson

November 18, 2018

statistical inference: overview

II

estimate the value of some parameter (last lecture):

I

what is the error rate of my drug test?

determine some interval that is very likely to contain the true value of the parameter (today):

I

interval estimate for the error rate

I

test some hypothesis about the parameter (today):

II

is the error rate significantly different from 0.03? are users significantly more satisfied with web page A than with web page B?

“recipes”

I

in this lecture, we’ll look at a few “recipes” that you’ll use in the assignment

III

interval estimate for a proportion (“heads probability”) comparing a proportion to a speciﬁed value comparing two proportions

II

additionally, we’ll see the standard method to compute an interval estimate for the mean of a normal I will also post some pointers to additional tests

I

remember to check that the preconditions are satisﬁed: what kind of experiment? what assumptions about the data?

overview

interval estimates

significance testing for the accuracy comparing two classifiers p-value fishing

interval estimates

II

if we get some estimate by ML, can we say something about how reliable that estimate is?

informally, an interval estimate for the parameter p is an interval I = [p_low, p_high] so that the true value of the parameter is “likely” to be contained in I

I

for instance: with 95% probability, the error rate of the spam ﬁlter is in the interval [0.05, 0.08]

frequentists and Bayesians again. . .

II

[frequentist] a 95% conﬁdence interval I is computed using a procedure that will return intervals that contain p at least 95% of the time

[Bayesian] a 95% credible interval I for the parameter p is an interval such that p lies in I with a probability of at least 95%

interval estimates: overview

II

we will now see two recipes for computing conﬁdence/credible intervals in speciﬁc situations:

I

for probability estimates, such as the accuracy of a classiﬁer (to be used in the next assignment) for the mean, when the data is assumed to be normal

I

. . . and then, a general method

the distribution of our estimator

II

our ML or MAP estimator applied to randomly selected samples is a random variable with a distribution

this distribution depends on the

sample size

I

large sample → more concentrated distribution

estimator distribution and sample size (p = 0.35)

conﬁdence and credible intervals for the proportion

parameter

III

several recipes, see https:

//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

traditional textbook method for conﬁdence intervals is based on approximating a binomial with a normal instead, we’ll consider a method to compute a Bayesian credible interval that does not use any approximations

I

works ﬁne even if the numbers are small

credible intervals in Bayesian statistics

1. choose a prior distribution 2. compute a posterior distribution from the prior and the data 3. select an interval that covers e.g. 95% of the posterior distribution

recipe 1: credible interval for the estimation of a probability

I

assume we carry out n independent trials, with k successes, n − k failures

I

choose a Beta prior for the probability; that is, select shape parameters a and b (for uniform prior, set a = b = 1)

II

then the posterior is also a Beta, with parameters k + a and

(n − k) + b

select a 95% interval

in Scipy

I

assume n_success successes out of n

II

recall that we use ppf to get the percentiles! or even simpler, use interval

a = 1 b = a

n_fail = n - n_success posterior_distr = stats.beta(n_success + a, n_fail + b)

p_low, p_high = posterior_distr.interval(0.95)

example: political polling

I

we ask 87 randomly selected Gothenburgers about whether they support the proposed aerial tramway line over the river

II

81 of them say yes a 95% credible interval for the popularity of the tramway is 0.857 – 0.967

n_for = 81 n = 87 n_against = n - n_for

p_mle = n_for / n posterior_distr = stats.beta(n_for + 1, n_against + 1) print(’ML / MAP estimate:’, p_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95))

don’t forget your common sense

II

I ask 14 Applied Data Science students about whether they support free transporation between Johanneberg and Lindholmen, 12 of them say yes

will I get a good estimate?

recipe 2: mean of a normal

I

we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent

I

can we make an interval estimate for the parameter µ?

recipe 2: mean of a normal

I

we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent

II

can we make an interval estimate for the parameter µ? frequentist conﬁdence intervals, but also Bayesian credible intervals, are based on the t distribution

I

this is a bell-shaped distribution with longer tails than the normal

I

the t distribution has a parameter called degrees of freedom (df) that controls the tails

recipe 2: mean of a normal (continued)

II

x_mle is the sample mean; the size of the dataset is n; the sample standard deviation is s we consider a t distribution:

posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)

I

to get an interval estimate, select a 95% interval in this distribution

example

II

to demonstrate, we generate some data:

x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))

a 95% conﬁdence/credible interval for the mean:

mu_mle = x.mean() s = x.std() n = len(x)

posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n)) print(’estimate:’, mu_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95))

alternative: estimation using bayes_mvs

II

SciPy has a built-in function for the estimation of mean, variance, and standard deviation:

https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/ scipy.stats.bayes_mvs.html

95% credible intervals for the mean and the std:

res_mean, _, res_std = stats.bayes_mvs(x, 0.95) mu_est, (mu_low, mu_high) = res_mean sigma_est, (sigma_low, sigma_high) = res_std

recipe 3 (if we have time): brute force

I

what if we have no clue about how our measurements are distributed?

I

word error rate for speech recognition BLEU for machine translation

I

the brute-force solution to interval estimates

II

the variation in our estimate depends on the distribution of possible datasets

in theory, we could ﬁnd a conﬁdence interval by considering the distribution of all possible datasets, but this can’t be done in practice

the brute-force solution to interval estimates

II

the variation in our estimate depends on the distribution of possible datasets

in theory, we could ﬁnd a conﬁdence interval by considering the distribution of all possible datasets, but this can’t be done in practice

I

the trick in bootstrapping – invented by Bradley Efron – is to assume that we can simulate the distribution of possible

datasets by picking randomly from the original dataset

bootstrapping a conﬁdence interval, pseudocode

II

we have a dataset D consisting of k items we compute a conﬁdence interval by generating N random datasets and ﬁnding the interval where most estimates end up

repeat N times

D^∗= pick k items randomly from D m = estimate on D^∗store m in a list M return 2.5% and 97.5% percentiles of M

I

see Wikipedia for diﬀerent varieties

overview

interval estimates

signiﬁcance testing for the accuracy

comparing two classiﬁers p-value ﬁshing

statistical signiﬁcance testing for the accuracy

I

in the assignment, you will consider two questions:

I

how sure are we that the true accuracy is different from 0.80? how sure are we that classifier A is better than classifier B?

I

II

we’ll see recipes that can be used in these two scenarios these recipes work when we can assume that the “tests” (e.g. documents) are independent

I

for tests in general, see e.g. Wikipedia

comparing the accuracy to some given value

II

my boss has told me to build a classiﬁer with an accuracy of at least 0.70 my NB classiﬁer made 40 correct predictions out of 50

I

so the MLE of the accuracy is 0.80

II

based on this experiment, how certain can I be that the accuracy is really diﬀerent from 0.70?

if the true accuracy is 0.70, how unusual is our outcome?

null hypothesis signiﬁcance tests (NHST)

I

we assume a null hypothesis and then see how unusual (extreme) our outcome is

I

the null hypothesis is typically “boring”: the true accuracy is equal to 0.7

II

the “unusualness” is measured by the p-value

I

if the null hypothesis is true, how likely are we to see an outcome as unusual as the one we got?

the traditional threshold for p-values to be considered “signiﬁcant” is 0.05

the exact binomial test

I

the exact binomial test is used when comparing an estimated probability/proportion (e.g. the accuracy) to some ﬁxed value

I

40 correct guesses out of 50 is the true accuracy really diﬀerent from 0.70?

I

II

if the null hypothesis is true, then this experiment corresponds to a binomially distributed r.v. with parameters 50 and 0.70

we compute the p-value as the probability of getting an outcome at least as unusual as 40

historical side note: sex ratio at birth

I

the ﬁrst known case where a p-value was computed involved the investigation of sex ratios at birth in London in 1710

II

null hypothesis: P(boy) = P(girl) = 0.5 result: p close to 0 (signiﬁcantly more boys)

“From whence it follows, that it is Art, not Chance, that governs.”

(Arbuthnot, An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes, 1710)

Statistical Methods for Data Science, Lecture 5 Interval Estimates; Comparing Systems

Points for Discussion

The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective

More on Bayesian Methods: Part II

On the Frequentist Coverage of Bayesian Credible Intervals for Lower Bounded Means

Bayesian Inference for Median of the Lognormal Distribution K

Bayestest Interval — Interval Hypothesis Testing

Improvement of Bayesian Credible Interval for a Small Binomial Proportion Using Logit Transformation

The Support Interval

Bayesian Random-Effects Meta-Analysis Using the Bayesmeta

Estimating SARS-Cov-2 Seroprevalence And

Making Models with Bayes

Bayesian Inference: Posterior Intervals