2018 Statistics Final Exam Academic Success Programme Columbia University Instructor: Mark England

2018 Statistics Final Exam Academic Success Programme Columbia University Instructor: Mark England Name: UNI: You have two hours to finish this exam. § Calculators are necessary for this exam. § Always show your working and thought process throughout so that even if you don’t get the answer totally correct I can give you as many marks as possible. § It has been a pleasure teaching you all. Try your best and good luck! Page 1 Question 1 True or False for the following statements? Warning, in this question (and only this question) you will get minus points for wrong answers, so it is not worth guessing randomly. For this question you do not need to explain your answer. a) Correlation values can only be between 0 and 1. False. Correlation values are between -1 and 1. b) The Student’s t distribution gives a lower percentage of extreme events occurring compared to a normal distribution. False. The student’s t distribution gives a higher percentage of extreme events. c) The units of the standard deviation of � are the same as the units of �. True. d) For a normal distribution, roughly 95% of the data is contained within one standard deviation of the mean. False. It is within two standard deviations of the mean e) For a left skewed distribution, the mean is larger than the median. False. This is a right skewed distribution. f) If an event � is independent of �, then � (�) = �(�|�) is always true. True. g) For a given confidence level, increasing the sample size of a survey should increase the margin of error. False. Increasing the sample size should reduce the margin of error. Question 2 I have five cards, and I label them 1, 2, 3, 4, 5. I pick cards out at random without replacing them. a) What is the probability of picking an odd numbered card on my first turn? Page 2 3/5 b) I pick two cards out and deal them on the table. What is the chance that if I add the cards together, they will add up to 3? Hint: think about the order! - - The only two ways of doing this are 1 followed by 2 ( × ) and 2 followed by 1 . 0 - - - - - - 1 - ( × ). Therefore, it would be × + × = = . 0 . 0 . 0 12 -2 c) I pick five cards out and deal them out on the table. What is the chance of picking the five cards out in ascending order (1, 2, 3, 4, 5)? - - - - - - There is only one way to do this: × × × × = . 0 3 1 - -12 Question 3 We can do a study and find that the length of time people spend in Westside Market on Broadway is normally distributed with a mean of 21 minutes and a standard deviation of 6 minutes. a) What percentage of people spend over half an hour in the market? 3251- We can utilise z scores: � = = 1.5. Consulting the z score table, that would 6 be the 93.32 percentile so above this would be 6.68%. b) Is it more likely that someone spends (i) under 12 minutes or (ii) over 28 in the market? For this we could calculate z-score but it is easier to just to see which is further from the mean, as it is symmetric. 12 minutes is 9 minutes below the mean and 28 minutes is only 7 minutes over the mean. So, it is over 28 minutes. c) What percentage of people spend between 22 and 25 minutes in the market? 1151- First, we find how many spent below 22 minutes: � = = 0.17 which gives 6 56.75 percentile. Then, we find out how many spent above 25 minutes: � = 1.51- = 0.67 which is the 74.86 percentile. Therefore we can find the difference 6 between them 74.86 − 56.75 which gives 18.11%. Page 3 Question 4 In January 2017, the city of Emeryville, California, introduced a minimum wage increase from $9 dollars to $13.50 per hour. The hope was that increasing the minimum wage would give people a ‘living wage’ and reduce the number of people who have to work multiple jobs to make ends meet. They ask you to investigate whether the new minimum wage has resulted in a significant decline in the percentage of those working multiple jobs. They know that, before the minimum wage increase, the percentage of people who were working multiple jobs was 7.1%. 18 months after the minimum wage increase, the city surveyed 1400 people and found that 83 were working multiple jobs. Use hypothesis testing to decide whether the increase in the minimum wage has resulted in less people having to work multiple jobs. You may decide upon a suitable confidence level which you wish to use, as long as it is clearly stated. Null hypothesis: � = 0.071 Alternate hypothesis � < 0.071 2.2G-×2.H1H � = 1400, �̂ = 0.059, �� = F = 0.0069 -022 Using 95% confidence level, and normal distribution with mean of 0.071 and a standard error of 0.0069. 2.2.H52.2G- Using z scores: � = = −1.8. 2.226H If we are using a 1 sided test the z score we need to be smaller than is -1.65. -1.8 is further away from zero than -1.65 so we can reject the null hypothesis. Therefore we can accept the alternate hypothesis, at 95% confidence level, to say that the increase of the minimum wage has reduced the percentage of people working multiple jobs. Question 5 Jill Stein, the green party candidate for president, won 1.01% of the popular vote in the 2016 general election. I randomly talk to 1000 Americans from across the country who voted in the election. I want to enlist a soccer team of Jill Stein voters, and for that I need 11 people. (Assume that every Jill Stein voter I ask says yes to being on my team). What is the chance that I have enough players to fill my team? You will need to be careful with decimal places in this question. Page 4 2.2-2-×2.HIHH � = 0.0101, �� = F = 0.00316 and we are asking what is the -222 -- probability of sampling this distribution to get over �̂ ≥ = 0.011. -222 2.2--52.2-2- We use z scores: = 0.28. Consulting the z score table, this is the 61st 2.223-6 percentile so to get greater than this we get 39% chance. Question 6 We want to know whether CC2 did better than CC1 in the midterm. The mean for CC1 (24 students) was 36.5 and the standard deviation was 7.5. The mean for CC2 (24 students) was 38.3 and the standard deviation was 9.4. Using the formulae from the back of the exam and the appropriate table, find out whether the average mark of CC2 was significantly higher than CC1. You will need to propose a null and alternative hypothesis and then test them at a certain confidence level of your choosing to arrive at a conclusion. Null hypothesis: �- = �1 Alternate hypothesis �1 > �- (24 − 1) × 56.25 + (24 − 1)88.36 � = P = 8.50 O 46 |36..53I.3| �∗ = = 0.73, For a df of 46, 95% confidence from t table gives 2.01. 0.73<2.01 so cannot T I..F TU reject null hypothesis. Classes performed the same. Page 5 Question 7 We have two data series, yearly salary �, and yearly rent cost for apartment �. Both values are in thousands of dollars and are from New York City. We know that the means and standard deviations are described by �̅ = 57.9, �X = 20.1, �Y = 21.0, �Z = 9.2) with correlation coefficient � = 0.75. We will then carry out linear regression analysis on these two data sets where we aim to predict the y values from the x values. a) What will the equation of the line of best fit (least squares regression) be in the form �\ = �-� + �2? �Z 9.2 �- = � = 0.75 × = 0.343 �X 20.1 �2 = 21.0 − 57.9 × 0.343 = 1.14 �\ = 0.343� + 1.14 b) What does �- represent? �- represents the slope of the line, this is the amount rent goes up as income goes up. c) What is the �1 of the data? What does this value tell us about the data? �1 is 0.56 and says that 56% of variation in rent can be explained by the linear relationship with income. d) I want to make one final check that a linear relationship is really appropriate for the data. What is the best way to do that? Make a scatter plot of residuals. Check there is no shape or change in spread. The residuals should be random. e) I remove a very large outlier from the data. Will this have a large effect on the line of best fit? This will have a large effect. Outliers can have an outsized effect on calculating the line of best fit because the square of the residual will be large. Page 6 Question 8 In January of this year, Pew Research surveyed 1,200 people on the issue of marijuana legalisation. They found that 61% of respondents favoured legalisation. This is up from 31% in 2000. a) Come up with 95% confidence intervals for the true current opinion of the whole country. Explain the meaning of your results in words. 2.6-×2.3H �̂ = 0.61, �� = F = 0.014 -122 Roughly the confidence interval will be plus or minus two standard deviations Between [58.2%, 63.8%]. In words: We have 95% confidence that the true opinion of the population will lie between 58.2% and 63.8%.

2018 Statistics Final Exam Academic Success Programme Columbia University Instructor: Mark England

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Normal Probability Distribution

L-Moments Based Assessment of a Mixture Model for Frequency Analysis of Rainfall Extremes

Estimation of Quantiles and the Interquartile Range in Complex Surveys Carol Ann Francisco Iowa State University

Hand-Book on STATISTICAL DISTRIBUTIONS for Experimentalists

Descriptive Statistics

Measures of Variability

Measures of Spread

Chapter 4 – Analyzing Skewed Quantitative Data Introduction: in Chapter 3, We Focused on Analyzing Bell Shaped (Normal) Data, but Many Data Sets Are Not Bell Shaped

Outing the Outliers – Tails of the Unexpected

Problem Max. Points Your Points 1-10 10 11 10 12 3 13 4 14 18 15 8 16 7 17 14 Total 75

For Questions 1-4, Consider the Following Data Set: 4 3 6 2 8 8