<<

Statistical methods for Science, Lecture 5 Interval estimates; comparing systems

Richard Johansson

November 18, 2018 : overview

I estimate the value of some (last lecture): I what is the error rate of my drug test? I determine some interval that is very likely to contain the true value of the parameter (today): I interval estimate for the error rate I test some hypothesis about the parameter (today): I is the error rate significantly different from 0.03? I are users significantly more satisfied with web page A than with web page B?

-20pt “recipes”

I in this lecture, we’ll look at a few “recipes” that you’ll use in the assignment I interval estimate for a proportion (“heads ”) I comparing a proportion to a specified value I comparing two proportions I additionally, we’ll see the standard method to compute an interval estimate for the of a normal I I will also post some pointers to additional tests I remember to check that the preconditions are satisfied: what kind of ? what assumptions about the data?

-20pt overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt interval estimates

I if we get some estimate by ML, can we say something about how reliable that estimate is? I informally, an interval estimate for the parameter p is an interval I = [plow , phigh] so that the true value of the parameter is “likely” to be contained in I I for instance: with 95% probability, the error rate of the spam filter is in the interval [0.05, 0.08]

-20pt frequentists and Bayesians again. . .

I [frequentist] a 95% confidence interval I is computed using a procedure that will return intervals that contain p at least 95% of the time I [Bayesian] a 95% credible interval I for the parameter p is an interval such that p lies in I with a probability of at least 95%

-20pt interval estimates: overview

I we will now see two recipes for computing confidence/credible intervals in specific situations: I for probability estimates, such as the accuracy of a classifier (to be used in the next assignment) I for the mean, when the data is assumed to be normal I . . . and then, a general method

-20pt the distribution of our estimator

I our ML or MAP estimator applied to randomly selected samples is a with a distribution

I this distribution depends on the 0.15 sample size 0.10 I large sample → more concentrated distribution 0.05

0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 25

-20pt estimator distribution and sample size (p = 0.35)

0.30

0.25 0.15

0.20

0.15 0.10

0.10

0.05

0.05

0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n = 10 n = 25

0.14 0.10

0.12 0.08

0.10

0.06 0.08

0.06 0.04

0.04

0.02 0.02

0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n = 50 n = 100

-20pt confidence and credible intervals for the proportion parameter

I several recipes, see https: //en.wikipedia.org/wiki/Binomial_proportion_confidence_interval I traditional textbook method for confidence intervals is based on approximating a binomial with a normal I instead, we’ll consider a method to compute a Bayesian credible interval that does not use any approximations I works fine even if the numbers are small

-20pt credible intervals in Bayesian

1. choose a prior distribution

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 2. compute a posterior distribution from the prior and the data

5

4

3

2

1

0 0.0 0.2 0.4 0.6 0.8 1.0 3. select an interval that covers e.g. 95% of the posterior

distribution 5

4

3

2

1

0 0.0 0.2 0.4 0.6 0.8 1.0

-20pt recipe 1: credible interval for the estimation of a probability

I assume we carry out n independent trials, with k successes, n − k failures I choose a Beta prior for the probability; that is, select shape a and b (for uniform prior, set a = b = 1)

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 I then the posterior is also a Beta, with parameters k + a and (n − k) + b 5 4

3

2

1

0 0.0 0.2 0.4 0.6 0.8 1.0

I select a 95% interval 5 4

3

2

1

0 0.0 0.2 0.4 0.6 0.8 1.0

-20pt in Scipy

I assume n_success successes out of n I recall that we use ppf to get the ! I or even simpler, use interval

a = 1 b = a

n_fail = n - n_success posterior_distr = stats.beta(n_success + a, n_fail + b)

p_low, p_high = posterior_distr.interval(0.95)

-20pt example: political polling

I we ask 87 randomly selected Gothenburgers about whether they support the proposed aerial tramway line over the river I 81 of them say yes I a 95% credible interval for the popularity of the tramway is 0.857 – 0.967

n_for = 81 n = 87 n_against = n - n_for

p_mle = n_for / n

posterior_distr = stats.beta(n_for + 1, n_against + 1)

print(’ML / MAP estimate:’, p_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95))

-20pt don’t forget your common sense

I I ask 14 Applied Data Science students about whether they support free transporation between Johanneberg and Lindholmen, 12 of them say yes

I will I get a good estimate?

-20pt I frequentist confidence intervals, but also Bayesian credible intervals, are based on the t distribution I this is a bell-shaped distribution with longer tails than the normal 1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

I the t distribution has a parameter called degrees of freedom (df) that controls the tails

recipe 2: mean of a normal

I we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the σ; the data points are independent I can we make an interval estimate for the parameter µ?

-20pt recipe 2: mean of a normal

I we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent I can we make an interval estimate for the parameter µ? I frequentist confidence intervals, but also Bayesian credible intervals, are based on the t distribution I this is a bell-shaped distribution with longer tails than the normal 1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

I the t distribution has a parameter called degrees of freedom (df) that controls the tails

-20pt recipe 2: mean of a normal (continued)

I x_mle is the sample mean; the size of the dataset is n; the sample standard deviation is s I we consider a t distribution: posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1)

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

I to get an interval estimate, select a 95% interval in this

distribution 1.75

1.50

1.25

1.00

0.75

0.50

0.25

0.00 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00

-20pt example

I to demonstrate, we generate some data:

x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500))

I a 95% confidence/credible interval for the mean:

mu_mle = x.mean()

s = x.std() n = len(x)

posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n))

print(’estimate:’, mu_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95))

-20pt alternative: estimation using bayes_mvs

I SciPy has a built-in function for the estimation of mean, , and standard deviation: https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/ scipy.stats.bayes_mvs.html

I 95% credible intervals for the mean and the std:

res_mean, _, res_std = stats.bayes_mvs(x, 0.95)

mu_est, (mu_low, mu_high) = res_mean sigma_est, (sigma_low, sigma_high) = res_std

-20pt recipe 3 (if we have time): brute force

I what if we have no clue about how our measurements are distributed? I word error rate for speech recognition I BLEU for machine translation

-20pt I the trick in bootstrapping – invented by – is to assume that we can simulate the distribution of possible datasets by picking randomly from the original dataset

the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution of possible datasets I in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice

-20pt the brute-force solution to interval estimates

I the variation in our estimate depends on the distribution of possible datasets I in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice I the trick in bootstrapping – invented by Bradley Efron – is to assume that we can simulate the distribution of possible datasets by picking randomly from the original dataset

-20pt bootstrapping a confidence interval, pseudocode

I we have a dataset D consisting of k items I we compute a confidence interval by generating N random datasets and finding the interval where most estimates end up

repeat N times 4000 ∗ 3500 D = pick k items randomly from D 3000 ∗ 2500 m = estimate on D 2000

1500

store m in a list M 1000 return 2.5% and 97.5% percentiles of M 500 0 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98

I see Wikipedia for different varieties

-20pt overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt statistical significance testing for the accuracy

I in the assignment, you will consider two questions: I how sure are we that the true accuracy is different from 0.80? I how sure are we that classifier A is better than classifier B? I we’ll see recipes that can be used in these two scenarios I these recipes work when we can assume that the “tests” (e.g. documents) are independent I for tests in general, see e.g. Wikipedia

-20pt comparing the accuracy to some given value

I my boss has told me to build a classifier with an accuracy of at least 0.70 I my NB classifier made 40 correct predictions out of 50 I so the MLE of the accuracy is 0.80 I based on this experiment, how certain can I be that the accuracy is really different from 0.70? I if the true accuracy is 0.70, how unusual is our outcome?

-20pt null hypothesis significance tests (NHST)

I we assume a null hypothesis and then see how unusual (extreme) our outcome is I the null hypothesis is typically “boring”: the true accuracy is equal to 0.7 I the “unusualness” is measured by the p-value I if the null hypothesis is true, how likely are we to see an outcome as unusual as the one we got? I the traditional threshold for p-values to be considered “significant” is 0.05

-20pt the exact binomial test

I the exact binomial test is used when comparing an estimated probability/proportion (e.g. the accuracy) to some fixed value I 40 correct guesses out of 50 I is the true accuracy really different from 0.70? I if the null hypothesis is true, then this experiment corresponds to a binomially distributed r.v. with parameters 50 and 0.70 I we compute the p-value as the probability of getting an outcome at least as unusual as 40

-20pt historical side note: sex ratio at birth

I the first known case where a p-value was computed involved the investigation of sex ratios at birth in London in 1710 I null hypothesis: P(boy) = P(girl) = 0.5 I result: p close to 0 (significantly more boys) “From whence it follows, that it is Art, not Chance, that governs.” (Arbuthnot, An argument for Divine Providence, taken from the constant regularity observed in the births of both sexes, 1710)

-20pt example

I 40 correct guesses out of 50 I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0.14

0.12

0.10

0.08

0.06

outcome

0.04

0.02

0.00 0 10 20 30 40 50 the p-value is 0.16

-20pt example

I 40 correct guesses out of 50 I if the true accuracy is 0.70, is 40 out of 50 an unusual result?

0.14

0.12

0.10

0.08

0.06

outcome

0.04

0.02

0.00 0 10 20 30 40 50

I the p-value is 0.16, which isn’t “significantly” unusual!

-20pt implementing the exact binomial test in Scipy

I assume we made x correct guesses out of n I is the accuracy significantly different from test_acc? I the p-value is the sum of the of the outcomes that are at least as “unusual” as x: import scipy.stats

def exact_binom_test(x, n, test_acc): rv = scipy.stats.binom(n, test_acc) p_x = rv.pmf(x) p_value = 0 for i in (0, n+1): p_i = rv.pmf(i) if p_i <= p_x: p_value += p_i return p_value I actually, we don’t have to implement it since there is a function scipy.stats.binom_test that does exactly this!

-20pt overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt comparing two classifiers

I I’m comparing a Naive Bayes and a perceptron classifier I we evaluate them on the same test set I the NB classifier had 186 correct out of 312 guesses I . . . and the perceptron had 164 correct guesses I so the ML estimates of the accuracies are 0.60 and 0.53, respectively I but does this strongly support that the NB classifier is really better?

-20pt

I we make a table that compares the errors of the two classifiers:

NB correct NB incorrect perc correct A = 125 B = 39 perc incorrect C = 61 D = 87

I if NB is about as good as the perceptron, the B and C values should be similar I conversely if they are really different, B and C should differ I are these B and C value unusual?

-20pt McNemar’s test

I in McNemar’s test, we model the discrepancies (the B and C values)

NB correct NB incorrect perc correct A = 125 B = 39 perc incorrect C = 61 D = 87

I there are a number of variants of this test I the original formulation: Quinn McNemar (1947). Note on the error of the difference between correlated proportions or percentages, Psychometrika 12:153-157. I our version builds on the exact binomial test that we saw before

-20pt McNemar’s test (continued)

NB correct NB incorrect perc correct A = 125 B = 39 perc incorrect C = 61 D = 87

I the number of discrepancies is B+C I how are the discrepancies distributed? I if the two systems are equivalent, the discrepancies should be more or less evenly spread into the B and C boxes I it can be shown that B would be a binomial random variable with parameters B+C and 0.5 I so we can find the p-value (the “unusualness”) like this: p_value = scipy.stats.binom_test(B, B+C, 0.5) I in this case it is 0.035, supporting the claim that NB is better

-20pt alternative implementation

http://www.statsmodels.org/dev/generated/statsmodels.stats. contingency_tables.mcnemar.html

-20pt overview

interval estimates

significance testing for the accuracy

comparing two classifiers

p-value fishing

-20pt searching for significant effects

I scientific investigations sometimes operate according to the following procedure: 1. propose some hypothesis 2. collect some data 3. do we get a “significant” p-value over some null hypothesis? 4. if no, revise hypothesis and go back to 3. 5. if yes, publish your findings, promote them in the media, . . .

-20pt searching for significant effects (alternative)

I or a “data science” experiment: 1. you are given some dataset and told to “extract some meaning” from it 2. look at the data until you find a “significant” effect 3. publish . . .

-20pt searching for significant effects

I remember: if the null hypothesis is true, we will still see “significant” effects about 5% of the time I consequence: if we search long enough, we will probably find some effect with a p-value that is small I even if this is just due to chance

-20pt spurious correlations

Letters in winning word of Scripps National Spelling Bee correlates with Number of people killed by venomous spiders Number of people killed by venomous spiders Number

1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 15 letters 15 deaths

10 deaths

10 letters

5 deaths Spelling bee winning word

5 letters 0 deaths 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Number of people killed by venomous spidersSpelling Bee winning word

tylervigen.com

-20pt example

http://andrewgelman.com/2017/11/11/ student-bosses-want-p-hack-dont-even-know/

-20pt “data dredging”: further reading

https://en.wikipedia.org/wiki/Data_dredging

https: //en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data

-20pt some solutions

I common sense I held-out data (a separate test set) I correcting for multiple comparisons

-20pt Bonferroni correction for multiple comparisons

I assume we have an experiment where we carry out N comparions I in the Bonferroni correction, we multiply the p-values of the individual tests by N (or alternatively, divide the “significance” threshold by N)

-20pt Bonferroni correction for multiple comparisons: example

-20pt the rest of the week

I Wednesday: Naive Bayes and evaluation assignment I Thursday: probabilistic clustering (Morteza) I Friday: QA hours (14–16)

-20pt