Statistical Methods for Data Science, Lecture 5 Interval Estimates; Comparing Systems
Total Page:16
File Type:pdf, Size:1020Kb
Statistical methods for Data Science, Lecture 5 Interval estimates; comparing systems Richard Johansson November 18, 2018 statistical inference: overview I estimate the value of some parameter (last lecture): I what is the error rate of my drug test? I determine some interval that is very likely to contain the true value of the parameter (today): I interval estimate for the error rate I test some hypothesis about the parameter (today): I is the error rate significantly different from 0.03? I are users significantly more satisfied with web page A than with web page B? -20pt “recipes” I in this lecture, we’ll look at a few “recipes” that you’ll use in the assignment I interval estimate for a proportion (“heads probability”) I comparing a proportion to a specified value I comparing two proportions I additionally, we’ll see the standard method to compute an interval estimate for the mean of a normal I I will also post some pointers to additional tests I remember to check that the preconditions are satisfied: what kind of experiment? what assumptions about the data? -20pt overview interval estimates significance testing for the accuracy comparing two classifiers p-value fishing -20pt interval estimates I if we get some estimate by ML, can we say something about how reliable that estimate is? I informally, an interval estimate for the parameter p is an interval I = [plow ; phigh] so that the true value of the parameter is “likely” to be contained in I I for instance: with 95% probability, the error rate of the spam filter is in the interval [0:05; 0:08] -20pt frequentists and Bayesians again. I [frequentist] a 95% confidence interval I is computed using a procedure that will return intervals that contain p at least 95% of the time I [Bayesian] a 95% credible interval I for the parameter p is an interval such that p lies in I with a probability of at least 95% -20pt interval estimates: overview I we will now see two recipes for computing confidence/credible intervals in specific situations: I for probability estimates, such as the accuracy of a classifier (to be used in the next assignment) I for the mean, when the data is assumed to be normal I . and then, a general method -20pt the distribution of our estimator I our ML or MAP estimator applied to randomly selected samples is a random variable with a distribution I this distribution depends on the 0.15 sample size 0.10 I large sample ! more concentrated distribution 0.05 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 25 -20pt estimator distribution and sample size (p = 0:35) 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 10 0.15 0.14 0.10 0.10 0.12 0.08 0.10 0.05 0.06 0.08 0.00 0.06 0.0 0.2 0.4 0.6 0.8 1.0 0.04 n = 25 0.04 0.02 0.02 0.00 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 n = 50 n = 100 -20pt confidence and credible intervals for the proportion parameter I several recipes, see https: //en.wikipedia.org/wiki/Binomial_proportion_confidence_interval I traditional textbook method for confidence intervals is based on approximating a binomial with a normal I instead, we’ll consider a method to compute a Bayesian credible interval that does not use any approximations I works fine even if the numbers are small -20pt credible intervals in Bayesian statistics 1. choose a prior distribution 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 2. compute a posterior distribution from the prior and the data 5 4 3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 3. select an interval that covers e.g. 95% of the posterior distribution 5 4 3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 -20pt recipe 1: credible interval for the estimation of a probability I assume we carry out n independent trials, with k successes, n − k failures 1.4 1.2 choose a Beta prior for1.0 the probability; that is, select shape I 0.8 0.6 parameters a and b (for0.4 uniform prior, set a = b = 1) 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 5 4 then the posterior is also3 a Beta, with parameters k + a and I 2 (n − k) + b 1 0 0.0 0.2 0.4 0.6 0.8 1.0 5 4 I select a 95% interval 3 2 1 0 0.0 0.2 0.4 0.6 0.8 1.0 -20pt in Scipy I assume n_success successes out of n I recall that we use ppf to get the percentiles! I or even simpler, use interval a = 1 b = a n_fail = n - n_success posterior_distr = stats.beta(n_success + a, n_fail + b) p_low, p_high = posterior_distr.interval(0.95) -20pt example: political polling I we ask 87 randomly selected Gothenburgers about whether they support the proposed aerial tramway line over the river I 81 of them say yes I a 95% credible interval for the popularity of the tramway is 0.857 – 0.967 n_for = 81 n = 87 n_against = n - n_for p_mle = n_for / n posterior_distr = stats.beta(n_for + 1, n_against + 1) print(’ML / MAP estimate:’, p_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95)) -20pt don’t forget your common sense I I ask 14 Applied Data Science students about whether they support free transporation between Johanneberg and Lindholmen, 12 of them say yes I will I get a good estimate? -20pt I frequentist confidence intervals, but also Bayesian credible intervals, are based on the t distribution I this is a bell-shaped distribution with longer tails than the normal 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 I the t distribution has a parameter called degrees of freedom (df) that controls the tails recipe 2: mean of a normal I we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent I can we make an interval estimate for the parameter µ? -20pt 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 recipe 2: mean of a normal I we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent I can we make an interval estimate for the parameter µ? I frequentist confidence intervals, but also Bayesian credible intervals, are based on the t distribution I this is a bell-shaped distribution with longer tails than the normal I the t distribution has a parameter called degrees of freedom (df) that controls the tails -20pt 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 recipe 2: mean of a normal (continued) I x_mle is the sample mean; the size of the dataset is n; the sample standard deviation is s I we consider a t distribution: posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1) I to get an interval estimate, select a 95% interval in this distribution 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 4.00 -20pt example I to demonstrate, we generate some data: x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500)) I a 95% confidence/credible interval for the mean: mu_mle = x.mean() s = x.std() n = len(x) posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n)) print(’estimate:’, mu_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95)) -20pt alternative: estimation using bayes_mvs I SciPy has a built-in function for the estimation of mean, variance, and standard deviation: https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/ scipy.stats.bayes_mvs.html I 95% credible intervals for the mean and the std: res_mean, _, res_std = stats.bayes_mvs(x, 0.95) mu_est, (mu_low, mu_high) = res_mean sigma_est, (sigma_low, sigma_high) = res_std -20pt recipe 3 (if we have time): brute force I what if we have no clue about how our measurements are distributed? I word error rate for speech recognition I BLEU for machine translation -20pt I the trick in bootstrapping – invented by Bradley Efron – is to assume that we can simulate the distribution of possible datasets by picking randomly from the original dataset the brute-force solution to interval estimates I the variation in our estimate depends on the distribution of possible datasets I in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice -20pt the brute-force solution to interval estimates I the variation in our estimate depends on the distribution of possible datasets I in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice I the trick in bootstrapping – invented by Bradley Efron – is to assume that we can simulate the distribution of possible datasets by picking randomly from the original dataset -20pt bootstrapping a confidence interval, pseudocode I we have a dataset D consisting of k items I we compute a confidence interval by generating N random datasets and finding the interval where most estimates end up repeat N times 4000 ∗ 3500 D = pick k items randomly from D 3000 ∗ 2500 m = estimate on D 2000 1500 store m in a list M 1000 return 2.5% and 97.5% percentiles of M 500 0 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 I see Wikipedia for different varieties -20pt overview interval estimates significance testing for the accuracy comparing two classifiers p-value fishing -20pt statistical significance testing for the accuracy I in the assignment, you will consider two questions: I how sure are we that the true accuracy is different from 0.80? I how sure are we that classifier A is better than classifier B? I we’ll see recipes that can be used in these two scenarios I these recipes work when we can assume that the “tests” (e.g.