<<

2018 Final Exam Academic Success Programme Columbia University Instructor: Mark England

Name: UNI:

You have two hours to finish this exam. § Calculators are necessary for this exam. § Always show your working and thought process throughout so that even if you don’t get the answer totally correct I can give you as many marks as possible. § It has been a pleasure teaching you all. Try your best and good luck!

Page 1

Question 1 True or False for the following statements? Warning, in this question (and only this question) you will get minus points for wrong answers, so it is not worth guessing randomly. For this question you do not need to explain your answer. a) Correlation values can only be between 0 and 1.

False. Correlation values are between -1 and 1.

b) The Student’s t distribution gives a lower percentage of extreme events occurring compared to a .

False. The student’s t distribution gives a higher percentage of extreme events.

c) The units of the of � are the same as the units of �.

True.

d) For a normal distribution, roughly 95% of the is contained within one standard deviation of the .

False. It is within two standard deviations of the mean

e) For a left skewed distribution, the mean is larger than the .

False. This is a right skewed distribution.

f) If an event � is independent of �, then � (�) = �(�|�) is always true.

True.

g) For a given confidence level, increasing the sample size of a survey should increase the margin of error.

False. Increasing the sample size should reduce the margin of error.

Question 2 I have five cards, and I label them 1, 2, 3, 4, 5. I pick cards out at random without replacing them. a) What is the probability of picking an odd numbered card on my first turn?

Page 2

3/5

b) I pick two cards out and deal them on the table. What is the chance that if I add the cards together, they will add up to 3? Hint: think about the order!

The only two ways of doing this are 1 followed by 2 ( × ) and 2 followed by 1 ( × ). Therefore, it would be × + × = =

c) I pick five cards out and deal them out on the table. What is the chance of picking the five cards out in ascending order (1, 2, 3, 4, 5)?

There is only one way to do this: × × × × =

Question 3 We can do a study and find that the length of time people spend in Westside Market on Broadway is normally distributed with a mean of 21 minutes and a standard deviation of 6 minutes. a) What percentage of people spend over half an hour in the market?

We can utilise z scores: � = = 1.5. Consulting the z score table, that would be the 93.32 so above this would be 6.68%.

b) Is it more likely that someone spends (i) under 12 minutes or (ii) over 28 in the market?

For this we could calculate z-score but it is easier to just to see which is further from the mean, as it is symmetric. 12 minutes is 9 minutes below the mean and 28 minutes is only 7 minutes over the mean. So, it is over 28 minutes.

c) What percentage of people spend between 22 and 25 minutes in the market?

First, we find how many spent below 22 minutes: � = = 0.17 which gives 56.75 percentile. Then, we find out how many spent above 25 minutes: � = = 0.67 which is the 74.86 percentile. Therefore we can find the difference between them 74.86 − 56.75 which gives 18.11%.

Page 3

Question 4

In January 2017, the city of Emeryville, California, introduced a minimum wage increase from $9 dollars to $13.50 per hour. The hope was that increasing the minimum wage would give people a ‘living wage’ and reduce the number of people who have to work multiple jobs to make ends meet. They ask you to investigate whether the new minimum wage has resulted in a significant decline in the percentage of those working multiple jobs.

They know that, before the minimum wage increase, the percentage of people who were working multiple jobs was 7.1%. 18 months after the minimum wage increase, the city surveyed 1400 people and found that 83 were working multiple jobs. Use hypothesis testing to decide whether the increase in the minimum wage has resulted in less people having to work multiple jobs. You may decide upon a suitable confidence level which you wish to use, as long as it is clearly stated.

Null hypothesis: � = 0.071 Alternate hypothesis � < 0.071 .×. � = 1400, �̂ = 0.059, �� = = 0.0069 Using 95% confidence level, and normal distribution with mean of 0.071 and a of 0.0069. .. Using z scores: � = = −1.8. . If we are using a 1 sided test the z score we need to be smaller than is -1.65. -1.8 is further away from zero than -1.65 so we can reject the null hypothesis. Therefore we can accept the alternate hypothesis, at 95% confidence level, to say that the increase of the minimum wage has reduced the percentage of people working multiple jobs.

Question 5

Jill Stein, the green party candidate for president, won 1.01% of the popular vote in the 2016 general election. I randomly talk to 1000 Americans from across the country who voted in the election. I want to enlist a soccer team of Jill Stein voters, and for that I need 11 people. (Assume that every Jill Stein voter I ask says yes to being on my team). What is the chance that I have enough players to fill my team? You will need to be careful with decimal places in this question.

Page 4

.×. � = 0.0101, �� = = 0.00316 and we are asking what is the probability of this distribution to get over �̂ ≥ = 0.011. .. We use z scores: = 0.28. Consulting the z score table, this is the 61st . percentile so to get greater than this we get 39% chance.

Question 6

We want to know whether CC2 did better than CC1 in the midterm. The mean for CC1 (24 students) was 36.5 and the standard deviation was 7.5. The mean for CC2 (24 students) was 38.3 and the standard deviation was 9.4.

Using the formulae from the back of the exam and the appropriate table, find out whether the mark of CC2 was significantly higher than CC1. You will need to propose a null and and then test them at a certain confidence level of your choosing to arrive at a conclusion.

Null hypothesis: � = � Alternate hypothesis � > � (24 − 1) × 56.25 + (24 − 1)88.36 � = = 8.50 46

|..| �∗ = = 0.73, For a df of 46, 95% confidence from t table gives 2.01. 0.73<2.01 so cannot . reject null hypothesis. Classes performed the same.

Page 5

Question 7 We have two data series, yearly salary �, and yearly rent cost for apartment �. Both values are in thousands of dollars and are from New York City. We know that the and standard deviations are described by �̅ = 57.9, � = 20.1, � = 21.0, � = 9.2) with � = 0.75. We will then carry out analysis on these two data sets where we aim to predict the y values from the x values. a) What will the equation of the line of best fit ( regression) be in the form � = �� + �?

� 9.2 � = � = 0.75 × = 0.343 � 20.1 � = 21.0 − 57.9 × 0.343 = 1.14 � = 0.343� + 1.14

b) What does � represent?

� represents the slope of the line, this is the amount rent goes up as income goes up.

c) What is the � of the data? What does this value tell us about the data?

� is 0.56 and says that 56% of variation in rent can be explained by the linear relationship with income.

d) I want to make one final check that a linear relationship is really appropriate for the data. What is the best way to do that?

Make a of residuals. Check there is no shape or change in spread. The residuals should be random.

e) I remove a very large from the data. Will this have a large effect on the line of best fit?

This will have a large effect. can have an outsized effect on calculating the line of best fit because the square of the residual will be large.

Page 6

Question 8 In January of this year, Pew Research surveyed 1,200 people on the issue of marijuana legalisation. They found that 61% of respondents favoured legalisation. This is up from 31% in 2000. a) Come up with 95% confidence intervals for the true current opinion of the whole country. Explain the meaning of your results in words.

.×. �̂ = 0.61, �� = = 0.014 Roughly the will be plus or minus two standard deviations Between [58.2%, 63.8%]. In words: We have 95% confidence that the true opinion of the population will lie between 58.2% and 63.8%.

b) Name two things which Pew Research must do so that their sample is as random and representative as possible. One example might be: Make sure you are not surveying people only from one state. You should try to contact people from every state if possible.

There are many possible options. § Try calling at different times, not just during the work day, else you will oversample the old and unemployed. § Try and sample different races proportionally to the population. § Try and sample different ages proportionally to the population. § Do not ask leading questions. § Try and sample same amount of men and women. § Try and sample similar amounts of democrats and republicans.

Question 9

Here are is the heights of the last 11 presidents in centimetres. 191, 185, 182, 188, 188, 185, 177, 182, 192, 185, 179.

Calculate the mean, standard deviation, median and interquartile . You may find a table helpful for calculating the standard deviation.

Mean is 184.9. Median is 185. Standard deviation is 4.70 IQR is 6

Page 7

Question 10 You pick a card out of a normal 52-card deck n times and record the card you pick, replacing it after each time (so there will always be 52 cards when you pick). You will have the choice to play 10 times or 100 times. a) I would give you $50 if you pick out a club § over 30% of the time. Would you play either 10 or 100 times, and why? Given that the long-term percentage of clubs will reach 25% (as you increase the number of picks this number will fluctuate but gradually hone in on one quarter). Therefore, we should play 10 times because 25% is under 30% and the more we play the more likely we are to get to that value. This is because a few picks can have a large impact when you play ten times but these will average out for larger numbers. b) I would give you $50 if you pick out a club § over 20% of the time. Would you play either 10 or 100 times, and why? As said previously, given that the long-term percentage of clubs will reach 25, the correct answer not would be to play 100 times. This is because 25% is over 20% and so the longer you play the more likely the random fluctuations are to average out. The standard deviation of the distribution for playing 100 times is smaller than when you play 10 times. The means will both be the same.

Question 11 The Gini coefficient is a measure of how wealth inequality. A measure of 0 indicates complete equality (everyone earns the same) and a measure of 1 indicates complete inequality (one person has all the wealth in society). Below are box and whisker plots for the states (and the district of Columbia) contained in the four different regions of the USA: the South, the Northeast, the West and the Midwest (although they are not in that order in the figure). The mean is represented by the x.

Page 8

0.56

0.54

0.52

0.5

0.48

0.46 Gini Coefficient Gini

0.44

0.42

0.4 A B C D 0.38

Match the individual box and whisker diagram (A to D) to the region.

The Midwest has the most right skewed distribution of the four.

C is most right skewed

The South has the lowest interquartile range of the four regions.

D has lowest IQR

The Northeast contains the District of Columbia which is more unequal than any other state.

A contains highest point

The West contains Alaska which is the most equal state in terms of income distribution.

Page 9

B contains lowest point

Question 12 The following data shows in the percentage of the vote Hilary Clinton got in each of the 50 states (and district of Columbia) from the 2016 election

0 1 2 3,7,8,8,8,9, 3 2,3,4,4,5,5,6,6,8,8,8,8, 4 0,1,2,3,4,5,6,7,7,7,7,7,8,8,8,8,8,8, 5 0,2,3,5,5,5,5,5,9, 6 1,1,1,2,2, 7 8 9 3,

a) What is this type of diagram called?

b) The outlier is the district of Columbia with 93%. Why is it possible that this can be such a massive outlier and be so one-sided compared to the other states?

c) After removing the outlier, describe this distribution: is it unimodal, bimodal, uniform, skewed, symmetric etc.?

d) Give two reasons why it is not possible to tell from this diagram whether Hilary won or lost the election. If I was to read this naively I would say that the majority of her numbers were below 50% so you could easily predict that she lost.

a) Stem and leaf diagram b) The district of Columbia can be so one sided because it has such a small population so extreme results are much more likely than with larger populations, where things tend to even out. It is true that this is where the Washington elite lives (who dislike Trump) and so voted overwhelmingly for Hilary but this would not be possible if the population was of a larger size. c) It seems to be unimodal and symmetric with the median being somehere around 46. d) The first reason is the electoral college. You cannot tell from this graph which states got which percentage points. Not all states are equally sized so

Page 10

not all data points are worth the same. New York and California are much more important than Hawai and Idaho. Secondly, you don’t need to get over 50 percent to win because there are more than two candidates. You can win states with under 47% of the vote. Hilary got 47.3 in Michigan and lost and got 46.9 in Minessota and won so there is not 100% correlation between doing well in percentage points and winning the state (the third party candidates are important).

Question 13 If an aeroplane is loaded with too much weight, its fuel is reduced and may not have enough fuel to get to its destination. The computer system which normally stores the weight of the luggage is broken and we don’t know the total weight of luggage on board our fully booked flight. Regulations state that if the average weight of luggage exceeds 23kg, we must reduce the weight and take some luggage off to put on the next flight.

I get my staff to weigh 60 pieces of luggage randomly. The average weight of these 60 pieces of luggage is 22.1kg and the standard deviation of this sample is 2.5kg. Create confidence intervals for the true average of the weight of luggage on this flight. 95% confidence intervals will be sufficient.

� = 22.1 and � = 2.5 and � = 60. . From , �� = = 0.32 √ 95% confidence intervals are approximately: [21.5, 22.7]

I need to find out whether we are going to exceed the 23kg limit. State your conclusion and explain your reasoning.

The 95% confidence intervals do not cross 23kg, so we can say with 95% confidence that the weight limit will not be exceeded.

Page 11

Question 14 The table below shows people’s favourite takeaway food based on their age

Age<35 3560 Total Pizza 33 13 5 51 Indian 8 10 2 20 Burgers 7 28 59 94 Mexican 15 14 10 39 Chinese 10 32 22 64 Total 73 97 98 268

a) What percentage of all people like Pizza? b) Which age group prefers Chinese food more: young or old people? c) What is the probability of picking a person randomly from this group who is middle aged AND favours Indian food? d) What is the probability of picking a person randomly from this group who is old OR likes Mexican food? e) Based on this survey, what is the probability of being a middle aged person GIVEN that you like burgers?

The total number is 268 a) = 19% b) Young: = 14% Old: = 22% c) = 3.7% d) Using formula: + − = = 47% e) Given that I like burgers: = 30%

Page 12

Standardised normal distribution (for the right hand side of the distribution)

Page 13

Student’s t distribution table

Page 14

Formulae for comparing two means using the Student’s t distribution

Page 15