Chapter 1.2 Notes.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Describing Distributions with Numbers To Describe the Center (the C in SOCS) we use mean and median. The Mean: x (lower case letter used!) • The average of a set of observations. • If the n observations are x1, x 2 ,....., x n , their mean is x+ x +..... + x ∑ xi x = 1 2 n x = n or in more compact notation n • Sensitive to the influence of a few extreme observations . • A skewed distribution will pull the mean toward its long tail. • Not a resistant measure of center . • For example: Which way is this data skewed? Where is the mean likely to be? The Median : the midpoint of the data such that half of the observations are smaller and half are larger Do you think that the median is more or less resistant than the mean? To find the median: 1. Arrange all observation in order of size, from smallest to largest. 2. if n is odd: median = center observation 3. if n is even: median = the average of the two center observations Example: Find the mean and the median of the following data sets. 1. 3 5 6 6 8 10 11 2. 18 22 22 23 24 26 27 27 Mean vs. Median • What is an example of a data sample that the median gives you more information than the mean? • Symmetrical distribution: mean = median. • Skewed distribution: the mean is farther out in the long tail than is the median. (Extreme data points pull the mean toward them and the median is more resistant). • If you are looking at the age of the students in this classroom would you use mean or median? What if you also included the teacher? To Describe the Spread (the S in SOCS) we use the quartiles and the variance. Why Measure Spread? • the mean can be very misleading (if there are values significantly higher or lower than most of the other data) The Mean is the same for both of these data sets, but what is different about them? Measures of spread: range: largest data sample – smallest data sample depends only on two data values so could be misleading if those data values are outliers. Pth percentile: the value such that p percent of the observations fall at or below it. • the median is the 50 th percentile • first quartile is the 25 th percentile • third quartile is the 75 th percentile Interquartile Range (IQR) – distance between Q1 and Q3 . ( Q3 -Q1 ) • IQR is resistant ---- not affected by changes in either tail of the distribution. • Not useful in describing skewed distributions because only a single number. The two sides have different spreads so can’t use one to describe them. Call an observation a suspected outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. The quartiles: Q1 and Q 3 1. Arrange the observations in increasing order 2. locate the median 3. first quartile Q1 is the median of the observations to the left of the overall median. 4. third quartile Q3 is the median of the observations to the right of the overall median. To find the pth percentile: p percent multiplied by the number of observations. This gives you the number of the observation that fits the description. Example: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 th Find the median, Q1 and Q 3 as well as the 90 percentile. The Five Number Summary The five-number summary of a set of observations consists of: Minimum Q1 M Q3 Maximum These values are graphed in a BOXPLOT to display the data analysis. • less detail than a histogram • typically used for a side by side comparison of two distributions. Input the data from above into your calculator and look at the boxplot. Barry Bonds Example in Class Measuring Spread: Standard Deviation Std Deviation measures spread by looking at how far each data sample is from the mean. Variance: s2 and Standard Deviation s Variance: −2 + − 2 ++ − 2 21 2 2 (xx1 ) ( xx 2 )....( xxn ) = − s = s∑ ( xi x ) n −1 or n −1 Standard Deviation: 1 s=∑( x − x ) 2 n −1 i Reminder: Will you ever have a negative variance or standard deviation? • Some deviations from the mean will be negative and some will be positive (that is why we square it). • The sum of all deviations of the observations from the mean will ALWAYS BE ZERO. (Show why…3 7 20 30) • The sum of the squares is always the smallest sum possible. • A large variation tells you what? • A small variation tells you what? • What does it mean if the variance is 0? Some Facts about Variance and Standard Deviation: We use s and not s2 when measuring data: allows us to compare similar units with mean. Why n-1? Because once we know n-1 deviations we must know the nth since the sum is zero. Therefore only n-1 can vary freely. We call this degrees of freedom . Standard deviation is NOT RESISTANT and is greatly affected by outliers. Distributions with outliers have large std deviations. Choosing a Summary for the data: • Skewed Data set? Five number summary is better • Symmetrical Data Set (no outliers)? Mean and standard deviation is better Effect of Linear Transformations A linear transformation changes the original variable x into the new variable xnew given by an equation in the form = + xnew a bx Below are the 2009 Chicago Cubs Salaries Player Salary (US$) 1. Carlos Zambrano 18,750,000 2. Alfonso Soriano 17,000,000 3. Aramis Ramirez 16,650,000 4. Derrek Lee 13,250,000 5. Ted Lilly 13,000,000 6. Kosuke Fukudome 12,500,000 7. Ryan Dempster 9,000,000 8 a. Milton Bradley 7,000,000 8 b. Rich Harden 7,000,000 10. Kevin Gregg 4,200,000 11. Reed Johnson 3,000,000 12. John Grabow 2,300,000 13. Aaron Miles 2,200,000 14. Aaron Heilman 1,625,000 15. Neal Cotts 1,100,000 16 a. Geovany Soto 575,000 16 b. Carlos Marmol 575,000 18. Ryan Theriot 500,000 19. Koyie Hill 475,000 20. Sean Marshall 450,000 21. Tom Gorzelanny 433,000 22. Mike Fontenot 430,000 23. Angel Guzman 421,500 24. Jeff Baker 415,000 25. Micah Hoffpauir 407,500 26. Jake Fox 401,500 27. David Patton 400,000 Total Team Salary: 134,058,500 Calculate a five number summary of the salaries as well as the mean and standard deviation for the given scenarios. 1. The current salary. 2. Given each player receives a $50,000 bonus. 3. Given each player receives a 10% raise. What are the linear effects of these changes? 1. Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (IQR and standard deviation) 2. Adding the same number a to each observation adds a to measure of center and to quartiles but does not change measures of spread. .