STAT 200 Guided Exercise 2 Answers

Be sure to: Key Topics • Please submit your answers in a Word file to Sakai at the • Measures of Central Tendency same place you downloaded the file. • • You can paste Excel/JMP output into a Word File. Please Stem & Leaf Plot and describing distributions submit only one file for the assignment. • Using Excel to graph data • It is ok to do problems by hand. However, you will need to scan or take a picture of your work. • Put your name and the Assignment # on the file name that you submit: e.g. Ilvento Guided1.doc • Answer completely and show your work. • Guided Assignments are not graded but we check for completed work. Answers are posted on Sakai.

1. Let’s finish up the Academy Award winners for best actor (and actress) since 1996 that was given in Assignment 1, now that we have command of both central tendency and variability. Each year the Academy of the Screen Actors Guild gives an award for the best actor and actress in a motion picture. We have recorded the name and age of each since 1996. The data for males and females is given below (the sample size, n =21). The sum of their age as well as the sum of age squared are also given.

YEAR ACTOR MALE AGE ACTRESS FEMALE AGE 1996 45 Frances McDormand 39 1997 60 Helen Hunt 34 1998 46 Gwyneth Paltrow 26 1999 40 Hilary Swank 25 2000 36 Julia Roberts 33 2001 47 Halle Berry 35 2002 29 Nicole Kidman 35 2003 43 Charlize Theron 28 2004 37 Hilary Swank 30 2005 38 Reese Witherspoon 29 2006 Forest Whitiker 45 Helen Mirren 61 2007 Daniel Day-Lewis 50 32 2008 Sean Penn 48 Kate Winslet 33 2009 60 Sandra Bullock 45 2010 50 Natalie Portman 29 2011 Jean Dujardin 39 Meryl Streep 62 2012 Daniel Day-Lewis 55 Jennifer Lawrence 22 2013 Matthew McConaughey 44 44 2014 32 Julianne Moore 54 2015 Leonardo DiCaprio 41 Brie Larson 26 2016 41 Emma Stone 28

Sum X 926 Sum X 750 Sum X-squared 42,146 Sum X-squared 29,382

Page 1 of 7

Here is the Stem and Leaf plot for each group to compare the distributions.

Stem and Leaf Plot of Actors Winning Academy Award Since 1996

Males Females

Stem Leaf Stem Leaf

2 9 2 2 5 6 6 8 8 9 9

3 2 6 7 8 9 3 0 2 3 3 4 5 5 9

4 0 1 1 3 4 5 5 6 7 8 4 4 5

5 0 0 5 5 4

6 0 0 6 1 2

6|0 represents 60 6|0 represents 60

a. Calculate the measures of central tendency and variability for each group. Males Females

Mean 926/21 = 44.09 750/21 = 35.71

Median The 11th observation in ordered data The 11th observation in ordered data = 44 = 33.

Mode Not a unique mode Not a unique mode

Range 60 – 29 = 31 62 – 22 = 40

Variance [42,146 – (926)2/21]/(21-1) [29,382 – (750)2/21]/(21-1) [42,146 – 40,832.19]/20 [29,382 – 26,785.71.20]/20 1313.81/20 = 65.69 2596.29/20 = 129.814

Standard Deviation SQRT(65.69) = 8.10 SQRT(129.814) = 11.39

Coefficient of Variation CV = 8.10/44.09 *100 = 18.38% CV = 11.55/36.10 *100 = 31.90% b. Briefly compare the two distributions with an emphasis on the measures of Central Tendency and Variability. For males, the distribution is symmetric and centered around the mean of 44.09. There are no obvious outliers. The median is very close to the mean at 44.00. The values vary from 29 to 60 for a range of 31 years. The standard deviation is 8.10 years, which is relatively small compared with the mean (CV = 18.38%).

Page 2 of 7

For females, the mean is lower at 35.71, which is higher that the median of 33. Two large outliers influence the distribution for females at 61 and 62, which pulled the mean up. Otherwise the spread for females is centered in the mid 20s to mid 30s. The range is larger for females compared with that for males (62-22 = 40), as is the standard deviation (11.39 for females). The higher standard deviation is also a reflection of the outliers. The CV for females is much higher than that of males at 31.90%.

c. For both men and women there are a few outliers. For men there are two individuals with a value of 60. For women there is one winner aged 61 and another aged 62. Calculate z-scores for these values and interpret their meaning. Zm = (60-44.09)/8.10 = 1.99

Zf1 = (62-35.71)/11.39 = 2.31

Zf2 = (61-35.71)/11.39 = 2.22 Z-score represents the distance between Xi and the mean X-bar, expressed in standard deviation. For Zm=1.99, it means that there is distance of 1.99 standard deviation between 60 and the mean 44.09. d. Suppose we wanted to remove the two female outliers from the data. Calculate the new mean for women winners for the remaining 19 winners. Hint: subtract the values from the old sum and divide by 19. Did the outliers influence the mean age much? (750-62-61) = 627 627/19 = 33.00 The mean for females decreased from 35.71 to 33.00 by removing the two outliers. This is a 7.6% decrease.

Page 3 of 7

2. Below is the data for infant mortality for 34 OECD countries. The Organization for Economic Co-operation and Development (OECD) is an international economic organization of 34 countries, founded in 1961 to stimulate economic progress and world trade. It is a forum of countries describing themselves as committed to democracy and the market economy, providing a platform to compare policy experiences, seeking answers to common problems, identify good practices and coordinate domestic and international policies of its members. OECD’s web site provided some data on infant mortality for 34 countries. Infant mortality (the rate of death of children under 1 year of age per 1,000 live births) is a measure of development.

The Histogram and the Stem and Leaf Plot for this data is given below. Use the stem and leaf values for some calculations, such as the min and max. For other calculations, the Sum of (x) is 128.2 and the Sum of (x2) is 664.42. The two outliers are Turkey (10.2) and Mexico (13.0).

Stem and Leaf Stem Leaf Count 13 0 1 12 11 10 2 1 9 8 7 0 1 6 0 2 4 6 8 10 12 14 5 001 3 4 0458 4 3 13555667 8 2 0034455568999 13 1 377 3

1|3 represents 1.3 a. Calculate the:

Mean = 3.77 Median =(3.1+3.3)/2= 3.2 Mode =undefined

Maximum = 13 Minimum = 1.3 Range = 11.7

Variance = 5.486 Standard Deviation =2.34

Coefficient of Variation =62.117

b. What is the position of the median value for this data? Since n=34, the position is between the 17th and 18th positions. We would take the average of these two values. c. Does the mode make sense as a measure of Central Tendency for this data? Based on the Stem and Leaf Plot, there is not a unique mode. There are three values that have 3 observations. They are close to the center and to each, but they are not very useful. d. Calculate a z-score for an infant mortality rate of rate of 13. Z= (13-3.77)/2.34=3.94. This value is 3.94 standard deviations above the mean

Page 4 of 7

3. Answer the following questions about variability of data sets:

a. How would you describe the variance and standard deviation in words, rather than a formula? Think of what you are calculating and how it might be useful in describing a variable.

The Variance is the average Squared deviation around the center (in this case the center is the mean). The standard deviation is the average deviation around the center (in this case the center is the mean).

b. What is the primary advantage of using the inter-quartile range compared with the range when describing the variability of a variable?

The range only uses two values - the maximum and the minimum - to calculate the range. It can be very sensitive to outliers. The inter-quartile range shows the range of the middle 50% of the values.

c. Can the standard deviation ever be larger than the variance? Explain.

In most cases the standard deviation is less than the variance since it is a square root of the variance. However, in the special case where the variance is between 0 and 1, the standard deviation will be more than the variance. For example, if S2 = .5, then s = .71

d. Can the variance ever be negative? Why or why not?

Since the variance is based on a squared measure, no, it cannot be negative.

e. Show the formula for the Coefficient of Variation and explain what it is and how it can be useful in comparing the variability of different variables.

The ratio of the standard deviation to the absolute value of the mean, usually multiplied by 100. It expresses the standard deviation in relation to the mean. It makes it easier to compare the spread of different variables, even if they are measured on different metrics

Page 5 of 7

4. Two banks use alternative methods of waiting in line for a teller. Both banks user three tellers. Bank A uses separate lines for each teller so a customer must pick which line she or he thinks is best. This approach does allow a customer to pick his/her favorite teller. In contrast, the Bank B uses a single waiting line which leads customers to the next available teller out of all tellers available.

We take a random sample of 15 customers from each bank and record the waiting time in minutes. We are asked to analyze the data and determine the differences we note between the approaches of the two banks. Use graphs and summary measures of central tendency and variability to explain the differences. In the end, I am asking that you summarize your finding in words and not just numbers.

Here are the data. The data are given below (not sorted) and I provided the Sum(x) and the Sum(x^2): Sum(x) 71.50 70.00 Sum(x^2) 360.19 330.68

Bank A - 1 Line Bank B Multiple 5.3 5.0 2.5 3.8 5.9 4.9 4.1 4.3 5.4 5.0 3.8 4.7 5.1 3.9 4.1 5.4 4.1 5.1 5.0 4.0 5.1 4.1 5.7 4.5 3.0 5.1 5.3 5.4 7.1 4.8

a. Graph the two banks using stem and leaf plots. Describe he results of your graphs.

Stem and Leaf Plot of Waiting Time at Two Banks

Bank A Bank B Stem Leaf Stem Leaf 2 5 2 3 0 8 3 8 9 4 1 1 1 4 0 1 3 5 7 8 9 5 0 1 1 3 3 4 7 9 5 0 0 1 1 4 4 6 6 7 1 7 8 8

Page 6 of 7 b. Calculate the following for each bank:

Bank A Bank B

71.5/15 = 4.77 70.00/15 = 4.67 Mean

N is odd, so (15+1)/2 = 8th N is odd, so (15+1)/2 = 8th Median observation observation = 5.1 = 4.8

4.1 which occurs 3 times No unique value Mode

(360.19– (71.502/15))/(15-1) = (330.68 – (70.002/15))/(15-1) = Variance 1.38 .29

SQRT(1.38) = 1.18 SQRT(.29) = .54 Std Deviation

7.10 5.40 Minimum

2.5 3.80 Maximum

7.10 – 2.50 = 4.60 5.40 – 3.80 = 1.60 Range

1.18/4.77*100 = 24.68 .54/4.67*100 = 11.47 Coefficient of Variation

c. Summarize your results in a paragraph

The measures of center for the two lines are close to each other, but the measures of spread are not. The mean and median for Bank A are close, 4.77 and 5,1, respectively. Likewise the mean and median for Bank B are close to each other and Bank A at 4.67 and 4.8, respectively. However, the spread for Bank A is much larger. The Variances are 1.38 and .29, respectively. This can also be seen in the Coefficient of Variations, with a value of 24.68 for Bank A and 11.47 for Bank B. Allowing customers to pick their line results in more variability in waiting time compared with a single line.

Page 7 of 7