Describing Distributions with Numbers

To Describe the Center (the C in SOCS) we use mean and median.

The Mean: x (lower case letter used!) • The average of a set of observations.

• If the n observations are x1, x 2 ,....., x n , their mean is

x+ x +..... + x ∑ xi x = 1 2 n x = n or in more compact notation n

• Sensitive to the influence of a few extreme observations .

• A skewed distribution will pull the mean toward its long tail.

• Not a resistant measure of center .

• For example:

Which way is this data skewed? Where is the mean likely to be?

The Median : the midpoint of the data such that half of the observations are smaller and half are larger

Do you think that the median is more or less resistant than the mean?

To find the median: 1. Arrange all observation in order of size, from smallest to largest. 2. if n is odd: median = center observation 3. if n is even: median = the average of the two center observations

Example: Find the mean and the median of the following data sets. 1. 3 5 6 6 8 10 11

2. 18 22 22 23 24 26 27 27

Mean vs. Median • What is an example of a data sample that the median gives you more information than the mean?

• Symmetrical distribution: mean = median.

• Skewed distribution: the mean is farther out in the long tail than is the median. (Extreme data points pull the mean toward them and the median is more resistant).

• If you are looking at the age of the students in this classroom would you use mean or median? What if you also included the teacher?

To Describe the Spread (the S in SOCS) we use the quartiles and the variance.

Why Measure Spread?

• the mean can be very misleading (if there are values significantly higher or lower than most of the other data)

The Mean is the same for both of these data sets, but what is different about them? Measures of spread: range: largest data sample – smallest data sample depends only on two data values so could be misleading if those data values are outliers.

Pth percentile: the value such that p percent of the observations fall at or below it.

• the median is the 50 th percentile • first quartile is the 25 th percentile • third quartile is the 75 th percentile

Interquartile Range (IQR) – distance between Q1 and Q3 . ( Q3 -Q1 ) • IQR is resistant ---- not affected by changes in either tail of the distribution. • Not useful in describing skewed distributions because only a single number. The two sides have different spreads so can’t use one to describe them.

Call an observation a suspected outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. The quartiles: Q1 and Q 3

1. Arrange the observations in increasing order 2. locate the median

3. first quartile Q1 is the median of the observations to the left of the overall median.

4. third quartile Q3 is the median of the observations to the right of the overall median.

To find the pth percentile: p percent multiplied by the number of observations. This gives you the number of the observation that fits the description.

Example: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32

th Find the median, Q1 and Q 3 as well as the 90 percentile.

The Five Number Summary

The five-number summary of a set of observations consists of:

Minimum Q1 M Q3 Maximum

These values are graphed in a BOXPLOT to display the data analysis.

• less detail than a histogram • typically used for a side by side comparison of two distributions.

Input the data from above into your calculator and look at the boxplot.

Barry Bonds Example in Class

Measuring Spread: Standard Deviation

Std Deviation measures spread by looking at how far each data sample is from the mean.

Variance: s2 and Standard Deviation s

Variance:

−2 + − 2 ++ − 2 21 2 2 (xx1 ) ( xx 2 )....( xxn ) = − s = s∑ ( xi x ) n −1 or n −1

Standard Deviation:

1 s=∑( x − x ) 2 n −1 i

Reminder: Will you ever have a negative variance or standard deviation?

• Some deviations from the mean will be negative and some will be positive (that is why we square it). • The sum of all deviations of the observations from the mean will ALWAYS BE ZERO. (Show why…3 7 20 30) • The sum of the squares is always the smallest sum possible. • A large variation tells you what? • A small variation tells you what? • What does it mean if the variance is 0?

Some Facts about Variance and Standard Deviation:

We use s and not s2 when measuring data: allows us to compare similar units with mean.

Why n-1? Because once we know n-1 deviations we must know the nth since the sum is zero. Therefore only n-1 can vary freely. We call this degrees of freedom .

Standard deviation is NOT RESISTANT and is greatly affected by outliers. Distributions with outliers have large std deviations.

Choosing a Summary for the data: • Skewed Data set? Five number summary is better • Symmetrical Data Set (no outliers)? Mean and standard deviation is better

Effect of Linear Transformations A linear transformation changes the original variable x into the new variable xnew given by an equation in the form = + xnew a bx Below are the 2009 Salaries Player Salary (US$)

1. Carlos Zambrano 18,750,000

2. Alfonso Soriano 17,000,000

3. Aramis Ramirez 16,650,000

4. 13,250,000

5. Ted Lilly 13,000,000

6. Kosuke Fukudome 12,500,000

7. Ryan Dempster 9,000,000

8 a. Milton Bradley 7,000,000

8 b. Rich Harden 7,000,000

10. Kevin Gregg 4,200,000

11. Reed Johnson 3,000,000

12. 2,300,000

13. Aaron Miles 2,200,000

14. Aaron Heilman 1,625,000

15. Neal Cotts 1,100,000

16 a. Geovany Soto 575,000

16 b. Carlos Marmol 575,000

18. 500,000

19. Koyie Hill 475,000

20. Sean Marshall 450,000

21. Tom Gorzelanny 433,000

22. 430,000

23. Angel Guzman 421,500

24. Jeff Baker 415,000

25. Micah Hoffpauir 407,500

26. 401,500

27. David Patton 400,000

Total Team Salary: 134,058,500 Calculate a five number summary of the salaries as well as the mean and standard deviation for the given scenarios.

1. The current salary.

2. Given each player receives a $50,000 bonus.

3. Given each player receives a 10% raise.

What are the linear effects of these changes?

1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (IQR and standard deviation) 2. Adding the same number a to each observation adds a to measure of center and to quartiles but does not change measures of spread.