Quick viewing(Text Mode)

Boxplots, Interquartile Range, and Outliers

Boxplots, Interquartile Range, and Outliers

Boxplots, Interquartile , and Boxplots provide a visual representation of a set that can be used to determine whether the data set is symmetric or skewed. Constructing a boxplot requires calculation of the “5 number summary”, the (IQR), and the presence of any outliers.

5 Number Summary – The 5 number summary for a data set includes the following, which are listed in order from smallest to largest –

1. Minimum - The smallest value in the data set. 2. First - Separates the lowest 25% of the data in a set from the highest 75%. It is 25 typically denoted as 푸 푤ℎ푒푟푒, ∙ (# 푝표𝑖푛푡푠 𝑖푛 푑푎푡푎 푠푒푡) = 푝표푠𝑖푡𝑖표푛 표푓 푄 𝑖푛 푠푒푡. ퟏ 100 1 3. – The middle value in a sorted (smallest to largest) data set. If there is an even number of values, it is calculated by averaging the two middle values. The Median is also referred to as the Second Quartile (푸ퟐ) because it separates the lower 50% of data in a set from the upper 50%. 4. Third Quartile - Separates the lowest 75% of the data in a set from the highest 25%. It is 75 typically denoted as 푸 푤ℎ푒푟푒, ∙ (# 푝표𝑖푛푡푠 𝑖푛 푑푎푡푎 푠푒푡) = 푝표푠𝑖푡𝑖표푛 표푓 푄 𝑖푛 푠푒푡. ퟑ 100 3 5. Maximum – The largest value in the data set.

IQR - The Interquartile Range is a measure of spread used to calculate the lower and upper boundaries. These boundaries are then used to determine whether a data set has any actual outliers.

푰풏풕풆풓풒풖풂풓풕풊풍풆 푹풂풏품풆 (퐼푄푅) = 푄3 − 푄1

푳풐풘풆풓 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 푄1 − 1.5 퐼푄푅

푼풑풑풆풓 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 푄3 + 1.5 퐼푄푅

Outliers - Outliers are data points that are considerably smaller or larger than most of the other values in a data set. Data values that are smaller than the lower outlier boundary or larger than the upper outlier boundary are outliers. Some data sets do not have any outliers. Outliers that are determined to be the result of an error should be removed from the data set.

Example – For the following data set (2012 data for MLB team payrolls in millions), find a) the 5 number summary, b) the IQR, c) the upper and lower outlier boundaries, and d) any outliers. Note – data should be sorted from lowest to highest if it is not provided that way. This allows the easy identification of the min, max, median, and individual data positions within the set.

Team Payroll Team Payroll Team Payroll Team Payroll 1 Padres 55 9 Rockies 78 17 Mets 93 25 Rangers 121 2 Athletics 55 10 Indians 78 18 Twins 94 26 Tigers 132 3 Astros 61 11 Nationals 81 19 Dodgers 95 27 Angels 154 4 Royals 61 12 Orioles 81 20 W Sox 97 28 Red Sox 173 5 Pirates 63 13 Mariners 82 21 Brewers 98 29 Phillies 175 6 Rays 64 14 Reds 82 22 Cardinals 110 30 Yankees 198 7 D Backs 74 15 Braves 83 23 Giants 118 8 Blue Jays 75 16 Cubs 88 24 Marlins 118 a) 5 Number Summary – These values can be calculated by hand (shown below) OR they can be found using the “1-Var Stats” button from the Stat Menu on a TI-83 or TI-84 calculator.

Average of 2 Represents 25th # of data Represents 75th # of data middle data points in set percentile points in set points in set

25 75 Minimum 푷풐풔풊풕풊풐풏 푸 = (30) Median 푷풐풔풊풕풊풐풏 푸 = (30) Maximum ퟏ 100 ퟑ 100 83+88 55 = 7.5  8th Position = = 22.5  23rd Position 198 2 = 75 = 85.5 = 118

If the “position” calculation results in a decimal, round up to the next whole number to determine the position.

If the calculation results in a whole number, average that position’s data value with the next data value

b) IQR  퐼푄푅 = 푄3 − 푄1 = 118 − 75 = 43 c) Upper and Lower Outlier Boundaries –

퐿표푤푒푟 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 푄1 − 1.5 퐼푄푅 = 75 − 1.5 (43) = 10.5

푈푝푝푒푟 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 푄3 + 1.5 퐼푄푅 = 118 + 1.5 (43) = 182.5 d) Outliers – Lower Outliers  None (There are no individual data points smaller than the lower boundary of 10.5.)

Upper Outliers  198 (Yankees) (This data value is bigger than the upper boundary of 182.5.)

Constructing a Box – Construct a Boxplot for the data set in the previous example. Determine whether the data set is symmetric or skewed.

푄1 Median 푄3 Mark outliers with an “x”

x

85.5

105 115 118 125 135 145 155 165 175 185 195

55 65 75 95

MLB Team Payrolls (in millions)

Draw the whisker out to the This data set is Draw the whisker out to the smallest data value that is larger Skewed RIGHT largest data value that is smaller than the lower boundary than the upper boundary

Try this on your own - Construct a Boxplot for the following data set by finding the 5 number summary, the IQR, the outlier boundaries, and any outliers (if they exist.).

Data Set

8.2 8.8 9.2 10.6 12.7 8.4 9.0 9.7 11.6 14.0 8.5 9.2 10.4 11.8 15.9 8.8 9.2 10.5 12.6 16.1

Answers: 5 Number Summary  푀𝑖푛 = 8.2

푄1 = 8.9 푀푒푑𝑖푎푛 = 10.05

푄3 = 12.2 푀푎푥 = 16.1 퐼푄푅 = 3.3 퐿표푤푒푟 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 3.95 푈푝푝푒푟 푂푢푡푙𝑖푒푟 퐵표푢푛푑푎푟푦 = 17.15 푂푢푡푙𝑖푒푟푠 = None

Box Plot:

10.05

12.2 16.1

8.2 8.9

10 11 12 13 14 15 16 17 8 9