Boxplots, Interquartile Range, and Outliers

Boxplots, Interquartile Range, and Outliers Boxplots provide a visual representation of a data set that can be used to determine whether the data set is symmetric or skewed. Constructing a boxplot requires calculation of the “5 number summary”, the interquartile range (IQR), and the presence of any outliers. 5 Number Summary – The 5 number summary for a data set includes the following, which are listed in order from smallest to largest – 1. Minimum - The smallest value in the data set. 2. First Quartile - Separates the lowest 25% of the data in a set from the highest 75%. It is 25 typically denoted as 푸 푤ℎ푒푟푒, ∙ (# 푝표푛푡푠 푛 푑푎푡푎 푠푒푡) = 푝표푠푡표푛 표푓 푄 푛 푠푒푡. ퟏ 100 1 3. Median – The middle value in a sorted (smallest to largest) data set. If there is an even number of values, it is calculated by averaging the two middle values. The Median is also referred to as the Second Quartile (푸ퟐ) because it separates the lower 50% of data in a set from the upper 50%. 4. Third Quartile - Separates the lowest 75% of the data in a set from the highest 25%. It is 75 typically denoted as 푸 푤ℎ푒푟푒, ∙ (# 푝표푛푡푠 푛 푑푎푡푎 푠푒푡) = 푝표푠푡표푛 표푓 푄 푛 푠푒푡. ퟑ 100 3 5. Maximum – The largest value in the data set. IQR - The Interquartile Range is a measure of spread used to calculate the lower and upper outlier boundaries. These boundaries are then used to determine whether a data set has any actual outliers. 푰풏풕풆풓풒풖풂풓풕풊풍풆 푹풂풏품풆 (퐼푄푅) = 푄3 − 푄1 푳풐풘풆풓 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 푄1 − 1.5 퐼푄푅 푼풑풑풆풓 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 푄3 + 1.5 퐼푄푅 Outliers - Outliers are data points that are considerably smaller or larger than most of the other values in a data set. Data values that are smaller than the lower outlier boundary or larger than the upper outlier boundary are outliers. Some data sets do not have any outliers. Outliers that are determined to be the result of an error should be removed from the data set. Example – For the following data set (2012 data for MLB team payrolls in millions), find a) the 5 number summary, b) the IQR, c) the upper and lower outlier boundaries, and d) any outliers. Note – data should be sorted from lowest to highest if it is not provided that way. This allows the easy identification of the min, max, median, and individual data positions within the set. Team Payroll Team Payroll Team Payroll Team Payroll 1 Padres 55 9 Rockies 78 17 Mets 93 25 Rangers 121 2 Athletics 55 10 Indians 78 18 Twins 94 26 Tigers 132 3 Astros 61 11 Nationals 81 19 Dodgers 95 27 Angels 154 4 Royals 61 12 Orioles 81 20 W Sox 97 28 Red Sox 173 5 Pirates 63 13 Mariners 82 21 Brewers 98 29 Phillies 175 6 Rays 64 14 Reds 82 22 Cardinals 110 30 Yankees 198 7 D Backs 74 15 Braves 83 23 Giants 118 8 Blue Jays 75 16 Cubs 88 24 Marlins 118 a) 5 Number Summary – These values can be calculated by hand (shown below) OR they can be found using the “1-Var Stats” button from the Stat Menu on a TI-83 or TI-84 calculator. Average of 2 Represents 25th # of data Represents 75th # of data middle data percentile points in set percentile points in set points in set 25 75 Minimum 푷풐풔풊풕풊풐풏 푸 = (30) Median 푷풐풔풊풕풊풐풏 푸 = (30) Maximum ퟏ 100 ퟑ 100 th 83+88 rd 55 = 7.5 8 Position = = 22.5 23 Position 198 2 = 75 = 85.5 = 118 If the “position” calculation results in a decimal, round up to the next whole number to determine the position. If the calculation results in a whole number, average that position’s data value with the next data value b) IQR 퐼푄푅 = 푄3 − 푄1 = 118 − 75 = 43 c) Upper and Lower Outlier Boundaries – 퐿표푤푒푟 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 푄1 − 1.5 퐼푄푅 = 75 − 1.5 (43) = 10.5 푈푝푝푒푟 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 푄3 + 1.5 퐼푄푅 = 118 + 1.5 (43) = 182.5 d) Outliers – Lower Outliers None (There are no individual data points smaller than the lower boundary of 10.5.) Upper Outliers 198 (Yankees) (This data value is bigger than the upper boundary of 182.5.) Constructing a Box Plot – Construct a Boxplot for the data set in the previous example. Determine whether the data set is symmetric or skewed. 푄1 Median 푄3 Mark outliers with an “x” x 85.5 105 115 118 125 135 145 155 165 175 185 195 55 65 75 95 MLB Team Payrolls (in millions) Draw the whisker out to the This data set is Draw the whisker out to the smallest data value that is larger Skewed RIGHT largest data value that is smaller than the lower boundary than the upper boundary Try this on your own - Construct a Boxplot for the following data set by finding the 5 number summary, the IQR, the outlier boundaries, and any outliers (if they exist.). Data Set 8.2 8.8 9.2 10.6 12.7 8.4 9.0 9.7 11.6 14.0 8.5 9.2 10.4 11.8 15.9 8.8 9.2 10.5 12.6 16.1 Answers: 5 Number Summary 푀푛 = 8.2 푄1 = 8.9 푀푒푑푎푛 = 10.05 푄3 = 12.2 푀푎푥 = 16.1 퐼푄푅 = 3.3 퐿표푤푒푟 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 3.95 푈푝푝푒푟 푂푢푡푙푒푟 퐵표푢푛푑푎푟푦 = 17.15 푂푢푡푙푒푟푠 = None Box Plot: 10.05 12.2 16.1 8.2 8.9 10 11 12 13 14 15 16 17 8 9 .

Boxplots, Interquartile Range, and Outliers

Assessing Normality I) Normal Probability Plots : Look for The

Boxplots for Grouped and Clustered Data in Toxicology

Length of Stay Outlier Detection Through Cluster Analysis: a Case Study in Pediatrics Daniel Gartnera, Rema Padmanb

Non-Parametric Vs. Parametric Tests in SAS® Venita Depuy and Paul A

Understanding and Comparing Distributions

Notes Unit 8: Interquartile Range, Box Plots, and Outliers I

5. Drawing Graphs

A Simple Glossary of Statistics. Barchart: Box-Plot: (Also Known As

Kruskal- Wallis Test Is the Non-Parametric Equivalent to 0.95 1.76 2.91 One-Way ANOVA

Measures of Central Tendency and Spread & Box-And-Whisker Plots

Trajectory Box Plot: a New Pattern to Summarize Movements Laurent Etienne, Thomas Devogele, Maike Buckin, Gavin Mcardle

Constructing Box Plots