Describing Distributions with Numbers
To Describe the Center (the C in SOCS) we use mean and median.
The Mean: x (lower case letter used!) • The average of a set of observations.
• If the n observations are x1, x 2 ,....., x n , their mean is
x+ x +..... + x ∑ xi x = 1 2 n x = n or in more compact notation n
• Sensitive to the influence of a few extreme observations .
• A skewed distribution will pull the mean toward its long tail.
• Not a resistant measure of center .
• For example:
Which way is this data skewed? Where is the mean likely to be?
The Median : the midpoint of the data such that half of the observations are smaller and half are larger
Do you think that the median is more or less resistant than the mean?
To find the median: 1. Arrange all observation in order of size, from smallest to largest. 2. if n is odd: median = center observation 3. if n is even: median = the average of the two center observations
Example: Find the mean and the median of the following data sets. 1. 3 5 6 6 8 10 11
2. 18 22 22 23 24 26 27 27
Mean vs. Median • What is an example of a data sample that the median gives you more information than the mean?
• Symmetrical distribution: mean = median.
• Skewed distribution: the mean is farther out in the long tail than is the median. (Extreme data points pull the mean toward them and the median is more resistant).
• If you are looking at the age of the students in this classroom would you use mean or median? What if you also included the teacher?
To Describe the Spread (the S in SOCS) we use the quartiles and the variance.
Why Measure Spread?
• the mean can be very misleading (if there are values significantly higher or lower than most of the other data)
The Mean is the same for both of these data sets, but what is different about them? Measures of spread: range: largest data sample – smallest data sample depends only on two data values so could be misleading if those data values are outliers.
Pth percentile: the value such that p percent of the observations fall at or below it.
• the median is the 50 th percentile • first quartile is the 25 th percentile • third quartile is the 75 th percentile
Interquartile Range (IQR) – distance between Q1 and Q3 . ( Q3 -Q1 ) • IQR is resistant ---- not affected by changes in either tail of the distribution. • Not useful in describing skewed distributions because only a single number. The two sides have different spreads so can’t use one to describe them.
Call an observation a suspected outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. The quartiles: Q1 and Q 3
1. Arrange the observations in increasing order 2. locate the median
3. first quartile Q1 is the median of the observations to the left of the overall median.
4. third quartile Q3 is the median of the observations to the right of the overall median.
To find the pth percentile: p percent multiplied by the number of observations. This gives you the number of the observation that fits the description.
Example: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32
th Find the median, Q1 and Q 3 as well as the 90 percentile.
The Five Number Summary
The five-number summary of a set of observations consists of:
Minimum Q1 M Q3 Maximum
These values are graphed in a BOXPLOT to display the data analysis.
• less detail than a histogram • typically used for a side by side comparison of two distributions.
Input the data from above into your calculator and look at the boxplot.
Barry Bonds Example in Class
Measuring Spread: Standard Deviation
Std Deviation measures spread by looking at how far each data sample is from the mean.
Variance: s2 and Standard Deviation s
Variance:
−2 + − 2 ++ − 2 21 2 2 (xx1 ) ( xx 2 )....( xxn ) = − s = s∑ ( xi x ) n −1 or n −1
Standard Deviation:
1 s=∑( x − x ) 2 n −1 i
Reminder: Will you ever have a negative variance or standard deviation?
• Some deviations from the mean will be negative and some will be positive (that is why we square it). • The sum of all deviations of the observations from the mean will ALWAYS BE ZERO. (Show why…3 7 20 30) • The sum of the squares is always the smallest sum possible. • A large variation tells you what? • A small variation tells you what? • What does it mean if the variance is 0?
Some Facts about Variance and Standard Deviation:
We use s and not s2 when measuring data: allows us to compare similar units with mean.
Why n-1? Because once we know n-1 deviations we must know the nth since the sum is zero. Therefore only n-1 can vary freely. We call this degrees of freedom .
Standard deviation is NOT RESISTANT and is greatly affected by outliers. Distributions with outliers have large std deviations.
Choosing a Summary for the data: • Skewed Data set? Five number summary is better • Symmetrical Data Set (no outliers)? Mean and standard deviation is better
Effect of Linear Transformations A linear transformation changes the original variable x into the new variable xnew given by an equation in the form = + xnew a bx Below are the 2009 Chicago Cubs Salaries Player Salary (US$)
1. Carlos Zambrano 18,750,000
2. Alfonso Soriano 17,000,000
3. Aramis Ramirez 16,650,000
4. Derrek Lee 13,250,000
5. Ted Lilly 13,000,000
6. Kosuke Fukudome 12,500,000
7. Ryan Dempster 9,000,000
8 a. Milton Bradley 7,000,000
8 b. Rich Harden 7,000,000
10. Kevin Gregg 4,200,000
11. Reed Johnson 3,000,000
12. John Grabow 2,300,000
13. Aaron Miles 2,200,000
14. Aaron Heilman 1,625,000
15. Neal Cotts 1,100,000
16 a. Geovany Soto 575,000
16 b. Carlos Marmol 575,000
18. Ryan Theriot 500,000
19. Koyie Hill 475,000
20. Sean Marshall 450,000
21. Tom Gorzelanny 433,000
22. Mike Fontenot 430,000
23. Angel Guzman 421,500
24. Jeff Baker 415,000
25. Micah Hoffpauir 407,500
26. Jake Fox 401,500
27. David Patton 400,000
Total Team Salary: 134,058,500 Calculate a five number summary of the salaries as well as the mean and standard deviation for the given scenarios.
1. The current salary.
2. Given each player receives a $50,000 bonus.
3. Given each player receives a 10% raise.
What are the linear effects of these changes?
1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (IQR and standard deviation) 2. Adding the same number a to each observation adds a to measure of center and to quartiles but does not change measures of spread.