
<p>Statistical methods for Data Science, Lecture 5 <br>Interval estimates; comparing systems </p><p>Richard Johansson </p><p>November 18, 2018 </p><p>statistical inference: overview </p><p>II</p><p>estimate the value of some parameter (last lecture): </p><p>I</p><p>what is the error rate of my drug test? </p><p>determine some interval that is very likely to contain the true value of the parameter (today): </p><p>I</p><p>interval estimate for the error rate </p><p>I</p><p>test some hypothesis about the parameter (today): </p><p>II</p><p>is the error rate significantly different from 0.03? are users significantly more satisfied with web page A than with web page B? </p><p>“recipes” </p><p>I</p><p>in this lecture, we’ll look at a few “recipes” that you’ll use in the assignment </p><p>III</p><p>interval estimate for a proportion (“heads probability”) comparing a proportion to a specified value comparing two proportions </p><p>II</p><p>additionally, we’ll see the standard method to compute an interval estimate for the mean of a normal I will also post some pointers to additional tests </p><p>I</p><p>remember to check that the preconditions are satisfied: what kind of experiment? what assumptions about the data? </p><p>overview </p><p>interval estimates </p><p>significance testing for the accuracy comparing two classifiers p-value fishing </p><p>interval estimates </p><p>II</p><p>if we get some estimate by ML, can we say something about how reliable that estimate is? </p><p>informally, an interval estimate for the parameter p is an interval I = [p<sub style="top: 0.1481em;">low </sub>, p<sub style="top: 0.1481em;">high</sub>] so that the true value of the parameter is “likely” to be contained in I </p><p>I</p><p>for instance: with 95% probability, the error rate of the spam filter is in the interval [0.05, 0.08] </p><p>frequentists and Bayesians again. . . </p><p>II</p><p>[frequentist] a 95% confidence interval I is computed using a procedure that will return intervals that contain p at least 95% of the time </p><p>[Bayesian] a 95% credible interval I for the parameter p is an interval such that p lies in I with a probability of at least 95% </p><p>interval estimates: overview </p><p>II</p><p>we will now see two recipes for computing confidence/credible intervals in specific situations: </p><p>I</p><p>for probability estimates, such as the accuracy of a classifier (to be used in the next assignment) for the mean, when the data is assumed to be normal </p><p>I</p><p>. . . and then, a general method </p><p>the distribution of our estimator </p><p>II</p><p>our ML or MAP estimator applied to randomly selected samples is a random variable with a distribution </p><p>this distribution depends on the </p><p>sample size </p><p>I</p><p>large sample → more concentrated distribution </p><p></p><ul style="display: flex;"><li style="flex:1">estimator distribution and sample size (p = 0.35) </li><li style="flex:1">confidence and credible intervals for the proportion </li></ul><p>parameter </p><p>III</p><p>several recipes, see <a href="/goto?url=https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval" target="_blank">https: </a></p><p><a href="/goto?url=https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval" target="_blank">//en.wikipedia.org/wiki/Binomial_proportion_confidence_interval </a></p><p>traditional textbook method for confidence intervals is based on approximating a binomial with a normal instead, we’ll consider a method to compute a Bayesian credible interval that does not use any approximations </p><p>I</p><p>works fine even if the numbers are small </p><p>credible intervals in Bayesian statistics </p><p>1. choose a prior distribution 2. compute a posterior distribution from the prior and the data 3. select an interval that covers e.g. 95% of the posterior distribution </p><p>recipe 1: credible interval for the estimation of a probability </p><p>I</p><p>assume we carry out n independent trials, with k successes, n − k failures </p><p>I</p><p>choose a Beta prior for the probability; that is, select shape parameters a and b (for uniform prior, set a = b = 1) </p><p>II</p><p>then the posterior is also a Beta, with parameters k + a and </p><p>(n − k) + b </p><p>select a 95% interval </p><p>in Scipy </p><p>I</p><p>assume n_success successes out of n </p><p>II</p><p>recall that we use ppf to get the percentiles! or even simpler, use interval </p><p>a = 1 b = a </p><p>n_fail = n - n_success posterior_distr = stats.beta(n_success + a, n_fail + b) </p><p>p_low, p_high = posterior_distr.interval(0.95) </p><p>example: political polling </p><p>I</p><p>we ask 87 randomly selected Gothenburgers about whether they support the proposed aerial tramway line over the river </p><p>II</p><p>81 of them say yes a 95% credible interval for the popularity of the tramway is 0.857 – 0.967 </p><p>n_for = 81 n = 87 n_against = n - n_for </p><p>p_mle = n_for / n posterior_distr = stats.beta(n_for + 1, n_against + 1) print(’ML / MAP estimate:’, p_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95)) </p><p>don’t forget your common sense </p><p>II</p><p>I ask 14 Applied Data Science students about whether they support free transporation between Johanneberg and Lindholmen, 12 of them say yes </p><p>will I get a good estimate? </p><p>recipe 2: mean of a normal </p><p>I</p><p>we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent </p><p>I</p><p>can we make an interval estimate for the parameter µ? </p><p>recipe 2: mean of a normal </p><p>I</p><p>we have some sample that we assume follows some normal distribution; we don’t know the mean µ or the standard deviation σ; the data points are independent </p><p>II</p><p>can we make an interval estimate for the parameter µ? frequentist confidence intervals, but also Bayesian credible intervals, are based on the t distribution </p><p>I</p><p>this is a bell-shaped distribution with longer tails than the normal </p><p>I</p><p>the t distribution has a parameter called degrees of freedom (df) that controls the tails </p><p>recipe 2: mean of a normal (continued) </p><p>II</p><p>x_mle is the sample mean; the size of the dataset is n; the sample standard deviation is s we consider a t distribution: </p><p>posterior_distr = stats.t(loc = x_mle, scale = s/np.sqrt(n), df = n-1) </p><p>I</p><p>to get an interval estimate, select a 95% interval in this distribution </p><p>example </p><p>II</p><p>to demonstrate, we generate some data: </p><p>x = pd.Series(np.random.normal(loc=3, scale=0.5, size=500)) </p><p>a 95% confidence/credible interval for the mean: </p><p>mu_mle = x.mean() s = x.std() n = len(x) </p><p>posterior_distr = stats.t(df=n-1, loc=mu_mle, scale=s/np.sqrt(n)) print(’estimate:’, mu_mle) print(’95% credible interval: ’, posterior_distr.interval(0.95)) </p><p>alternative: estimation using bayes_mvs </p><p>II</p><p>SciPy has a built-in function for the estimation of mean, variance, and standard deviation: </p><p><a href="/goto?url=https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.bayes_mvs.html" target="_blank">https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/ </a><a href="/goto?url=https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.bayes_mvs.html" target="_blank">scipy.stats.bayes_mvs.html </a></p><p>95% credible intervals for the mean and the std: </p><p>res_mean, _, res_std = stats.bayes_mvs(x, 0.95) mu_est, (mu_low, mu_high) = res_mean sigma_est, (sigma_low, sigma_high) = res_std </p><p>recipe 3 (if we have time): brute force </p><p>I</p><p>what if we have no clue about how our measurements are distributed? </p><p>I</p><p>word error rate for speech recognition BLEU for machine translation </p><p>I</p><p>the brute-force solution to interval estimates </p><p>II</p><p>the variation in our estimate depends on the distribution of possible datasets </p><p>in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice </p><p>the brute-force solution to interval estimates </p><p>II</p><p>the variation in our estimate depends on the distribution of possible datasets </p><p>in theory, we could find a confidence interval by considering the distribution of all possible datasets, but this can’t be done in practice </p><p>I</p><p>the trick in bootstrapping – invented by Bradley Efron – is to assume that we can simulate the distribution of possible </p><p>datasets by picking randomly from the original dataset </p><p>bootstrapping a confidence interval, pseudocode </p><p>II</p><p>we have a dataset D consisting of k items we compute a confidence interval by generating N random datasets and finding the interval where most estimates end up </p><p>repeat N times </p><p>D<sup style="top: -0.3299em;">∗ </sup>= pick k items randomly from D m = estimate on D<sup style="top: -0.3298em;">∗ </sup>store m in a list M return 2.5% and 97.5% percentiles of M </p><p>I</p><p>see <a href="/goto?url=https://en.wikipedia.org/wiki/Bootstrapping_(statistics)#Deriving_confidence_intervals_from_the_bootstrap_distribution" target="_blank">Wikipedia </a>for different varieties </p><p>overview </p><p>interval estimates </p><p>significance testing for the accuracy </p><p>comparing two classifiers p-value fishing </p><p>statistical significance testing for the accuracy </p><p>I</p><p>in the assignment, you will consider two questions: </p><p>I</p><p>how sure are we that the true accuracy is different from 0.80? how sure are we that classifier A is better than classifier B? </p><p>I</p><p>II</p><p>we’ll see recipes that can be used in these two scenarios these recipes work when we can assume that the “tests” (e.g. documents) are independent </p><p>I</p><p>for tests in general, see e.g. <a href="/goto?url=https://en.wikipedia.org/wiki/Statistical_hypothesis_testing" target="_blank">Wikipedia </a></p><p>comparing the accuracy to some given value </p><p>II</p><p>my boss has told me to build a classifier with an accuracy of at least 0.70 my NB classifier made 40 correct predictions out of 50 </p><p>I</p><p>so the MLE of the accuracy is 0.80 </p><p>II</p><p>based on this experiment, how certain can I be that the accuracy is really different from 0.70? </p><p>if the true accuracy is 0.70, how unusual is our outcome? </p><p>null hypothesis significance tests (NHST) </p><p>I</p><p>we assume a null hypothesis and then see how unusual (extreme) our outcome is </p><p>I</p><p>the null hypothesis is typically “boring”: the true accuracy is equal to 0.7 </p><p>II</p><p>the “unusualness” is measured by the p-value </p><p>I</p><p>if the null hypothesis is true, how likely are we to see an outcome as unusual as the one we got? </p><p>the traditional threshold for p-values to be considered “significant” is 0.05 </p><p>the exact binomial test </p><p>I</p><p>the exact binomial test is used when comparing an estimated probability/proportion (e.g. the accuracy) to some fixed value </p><p>I</p><p>40 correct guesses out of 50 is the true accuracy really different from 0.70? </p><p>I</p><p>II</p><p>if the null hypothesis is true, then this experiment corresponds to a binomially distributed r.v. with parameters 50 and 0.70 </p><p>we compute the p-value as the probability of getting an outcome at least as unusual as 40 </p><p>historical side note: sex ratio at birth </p><p>I</p><p>the first known case where a p-value was computed involved the investigation of sex ratios at birth in London in 1710 </p><p>II</p><p>null hypothesis: P(boy) = P(girl) = 0.5 result: p close to 0 (significantly more boys) </p><p>“From whence it follows, that it is Art, not Chance, that governs.” </p><p>(Arbuthnot, <a href="/goto?url=https://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf" target="_blank">An argument for Divine Providence, taken from the constant </a><a href="/goto?url=https://www.york.ac.uk/depts/maths/histstat/arbuthnot.pdf" target="_blank">regularity observed in the births of both sexes</a>, 1710) </p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages50 Page
-
File Size-