<<

STAT 200 Guided Exercise 2 Answers

For On-Line Students, be sure to: Key Topics • Submit your answers in a Word file to Sakai at the same place you downloaded the file • Measures of Central Tendency • Remember you can paste any Excel or JMP output into a • Stem & Leaf Plot and describing distributions Word File (use Paste Special for best results). • Measures of Variability • Put your name and the Assignment # on the file name: e.g. Ilvento Guided2.doc Answer as completely as you can and show your work. Then upload the file via Sakai to get credit.

1. Let’s finish up the Academy Award winners for best (and actress) since 1996 that was given in Assignment 1, now that we have command of both central tendency and variability. Each year the Academy of the Screen Guild gives an award for the best actor and actress in a motion picture. We have recorded the name and age of each since 1996. The data for males and females is given below (the sample size, n =20). The sum of their age and the sum of age squared are also given.

YEAR ACTOR AGE ACTRESS AGE 1996 45 Frances McDormand 39 1997 60 Helen Hunt 34 1998 46 Gwyneth Paltrow 26 1999 40 Hilary Swank 25 2000 36 Julia Roberts 33 2001 47 Halle Berry 35 2002 29 Nicole Kidman 35 2003 43 Charlize Theron 28 2004 37 Hilary Swank 30 2005 38 Reese Witherspoon 29 2006 Forest Whitiker 45 Helen Mirren 61 2007 Daniel Day-Lewis 50 Marion Cotillard 32 2008 Sean Penn 48 Kate Winslet 33 2009 60 Sandra Bullock 45 2010 50 Natalie Portman 29 2011 39 Meryl Streep 62 2012 Daniel Day-Lewis 55 Jennifer Lawrence 22 2013 Matthew McConaughey 44 Cate Blanchett 44 2014 Eddie Redmayne 32 Julianne Moore 54 2015 Leonardo DiCaprio 41 Brie Larson 26

Sum X 885 Sum X 722 Sum X-squared 40,465 Sum X-squared 28,598

Page 1 of 7

a. Here is the Stem and Leaf plot for each group to compare the distributions.

Stem and Leaf Plot of Actors Winning Academy Award Since 1996

Males Females

Stem Leaf Stem Leaf

2 9 2 2 5 6 6 8 9 9

3 2 6 7 8 9 3 0 2 3 3 4 5 5 9

4 0 1 3 4 5 5 6 7 8 4 4 5

5 0 0 5 5 4

6 0 0 6 1 2

6|0 represents 60 6|0 represents 60

b. Calculate the measures of central tendency and variability for each group. The sum of X and the sum of X-squared for each group are given above. a. Males Females

Mean 885/20 = 44.25 722/20 = 36.10

Median The 10th observation in ordered data The 10th observation in ordered data = 44 The 11Th observation is 45. = 33. The 11th observation is also 33. The average of the two is 44.5 The average of the two is 33.

Mode Not a unique mode Not a unique mode

Range 60 – 29 = 31 62 – 22 = 40

Variance [40,465 – (885)2/20]/(20-1) [28,598 – (722)2/20]/(20-1) [40,465 – 39,161.25]/19 [28,598 – 26,064.20]/19 1303.75/19 = 68.62 2533.80/19 = 133.36

Standard Deviation SQRT(68.62) = 8.28 SQRT(133.36) = 11.55

Coefficient of Variation CV = 8.28/44.25 *100 = 18.72% CV = 11.55/36.10 *100 = 31.99%

c. Briefly compare the two distributions with an emphasis on the measures of Central Tendency and Variability. For males, the distribution is symmetric and centered around the mean of 44.25. There are no obvious outliers. The median is very close to the mean at 44.50. The values vary from 29 to 60 for a range of 31 years. The standard deviation is 8.47 years, which is relatively small compared with the mean (CV = 18.72%).

For females, the mean is lower at 36.10, which is higher that the median of 33. The distribution for females is influenced by two larger outliers at 61 and 62, which pulled the mean up. Otherwise the spread for females is

Page 2 of 7

centered in the mid 20s to mid 30s. The range is larger for females compared with that for males (62-22 = 40), as is the standard deviation (11.55 for females). The higher standard deviation is also a reflection of the outliers. The CV for females is much higher than that of males at 31.99%.

d. For both men and women there are a few outliers. For men there are two individuals with a value of 60. For women there is one winner aged 61 and another aged 62. Calculate z-scores for these values and interpret their meaning. Zm = (60-44.25)/8.28 = 1.90

Zf1 = (62-36.10)/11.55 = 2.24

Zf2 = (61-36.10)/11.55 = 2.15

e. Suppose we wanted to remove the two female outliers from the data. Calculate the new mean for women winners for the remaining 16 winners. Hint: subtract the values from the old sum and divide by 17. Did the outliers influence the mean age much?

(722-62-61) = 573 599/18 = 33.28 The mean for females decreased from 36.10 to 33.28 by removing the two outliers. This is a 7.8% decrease.

2. The following is some data from The Daily Beast on the 50 Most Stressful Universities in 2010. We are looking at the Acceptance rate for these 50 universities. The Acceptance rate is based on the percentage of applicants who were admitted. The Histogram and the Stem and Leaf Plot for this data is given below (note the Stem and Leaf Plot rounds the numbers to a whole number). Use the stem and leaf values for some calculations, such as the min and max. For other calculations, the Sum of (x) is 1574.70 and the Sum of (x2) is 62204.53. The Median for this data is 26.85.

a. Calculate the:

Mean = 31.49 Median = 26.85 Mode = 22 Maximum = 73 Minimum = 8 Range = 65 Variance = 257.37 Standard Deviation = 16.04 Coefficient of Variation = 50.94

Page 3 of 7

b. What is the position of the median value for this data? Since n=50, the position is between the 25th and 26th positions. We would take the average of these two values. c. Does the mode make sense as a measure of Central Tendency for this data? Based on the Stem and Leaf Plot, the mode is 22%. This is a measure of center for one bunching of the data, but there is much more spread and a other groupings of the data. d. Calculate a z-score for an acceptance rate of 61% z = (61-31.49)/16.04 = 1.84. This value is 1.84 standard deviations above the mean e. Based on what you know about the different criteria used by different universities to judge students for admittance, why do you think this distribution looks the way it does? Think about the spread of the data and the measures of spread for the data, such as the range and standard deviation. Does the spread seem large? Hint: Harvard has the lowest acceptance rate at 7.9%. The Pennsylvania State University has an acceptance rate of 51.2%.

The spread is very large. The CV is 50.94%. It might reflect differences between public and private institutions. Private institutions generally have lower acceptance rates. Public schools may have as part of their mission to have higher rates of acceptance to provide educational opportunities to citizens in the state. Even for the most stress universities, generally thought to be the most rigorous, the acceptance rate for public institutions should be higher. We could think of this data as being two populations.

The Box Plots show a difference between Public and private Universities. There still is a lot of spread for each type of university - some private universities have high acceptance rates and some public universities have low acceptance rates. But we can see two distinct groups.

Page 4 of 7

3. Answer the following questions about variability of data sets:

a. How would you describe the variance and standard deviation in words, rather than a formula? Think of what you are calculating and how it might be useful in describing a variable.

The Variance is the average Squared deviation around the center (in this case the center is the mean). The standard deviation is the average deviation around the center (in this case the center is the mean).

b. What is the primary advantage of using the inter-quartile range compared with the range when describing the variability of a variable?

The range only uses two values - the maximum and the minimum - to calculate the range. It can be very sensitive to outliers. The inter-quartile range shows the range of the middle 50% of the values.

c. Can the standard deviation ever be larger than the variance? Explain.

In most cases the standard deviation is less than the variance since it is a square root of the variance. However, in the special case where the variance is between 0 and 1, the standard deviation will be more than the variance. For example, if S2 = .5, then s = .71

d. Can the variance ever be negative? Why or why not?

Since the variance is based on a squared measure, no, it cannot be negative.

e. Show the formula for the Coefficient of Variation and explain what it is and how it can be useful in comparing the variability of different variables.

The ratio of the standard deviation to the absolute value of the mean, usually multiplied by 100. It expresses the standard deviation in relation to the mean. It makes it easier to compare the spread of different variables, even if they are measured on different metrics

Page 5 of 7

4. Two banks use alternative methods of waiting in line for a teller. Both banks user three tellers. Bank A uses separate lines for each teller so a customer must pick which line she or he thinks is best. This approach does allow a customer to pick his/her favorite teller. In contrast, the Bank B uses a single waiting line which leads customers to the next available teller out of all tellers available.

We take a random sample of 15 customers from each bank and record the waiting time in minutes. We are asked to analyze the data and determine the differences we note between the approaches of the two banks. Use graphs and summary measures of central tendency and variability to explain the differences. In the end, I am asking that you summarize your finding in words and not just numbers.

Here are the data. The data are given below (not sorted) and I provided the Sum(x) and the Sum(x^2): Sum(x) 71.50 70.00 Sum(x^2) 360.19 330.68

Bank A - 1 Line Bank B Multiple 5.3 5.0 2.5 3.8 5.9 4.9 4.1 4.3 5.4 5.0 3.8 4.7 5.1 3.9 4.1 5.4 4.1 5.1 5.0 4.0 5.1 4.1 5.7 4.5 3.0 5.1 5.3 5.4 7.1 4.8 a. Graph the two banks using stem and leaf plots. Describe he results of your graphs.

Stem and Leaf Plot of Waiting Time at Two Banks

Bank A Bank B Stem Leaf Stem Leaf 2 5 2 3 0 8 3 8 9 4 1 1 1 4 0 1 3 5 7 8 9 5 0 1 1 3 3 4 7 9 5 0 0 1 1 4 4 6 6 7 1 7 8 8

Page 6 of 7 b. Calculate the following for each bank:

Statistics Bank A Bank B

Mean 71.5/15 = 4.77 70.00/15 = 4.67

Median N is odd, so (15+1)/2 = 8th N is odd, so (15+1)/2 = 8th observation observation = 5.1 = 4.8

Mode 4.1 which occurs 3 times No unique value

Variance (360.19– (71.502/15))/(15-1) (330.68 – (70.002/15))/(15-1) = 1.38 = .29 Std Deviation SQRT(1.38) = 1.18 SQRT(.29) = .54

Maximum 7.10 5.40

Minimum 2.5 3.80

Range 7.10 – 2.50 = 4.60 5.40 – 3.80 = 1.60

Coefficient of Variation 1.18/4.77*100 = 24.68 .54/4.67*100 = 11.47

b. Summarize your results in a paragraph

The measures of center for the two lines are close to each other, but the measures of spread are not. The mean and median for Bank A are close, 4.77 and 5,1, respectively. Likewise the mean and median for Bank B are close to each other and Bank A at 4.67 and 4.8, respectively. However, the spread for Bank A is much larger. The Variances are 1.38 and .29, respectively. This can also be seen in the Coefficient of Variations, with a value of 24.68 for Bank A and 11.47 for Bank B. Allowing customers to pick their line results in more variability in waiting time compared with a single line.

Page 7 of 7