Chapter 3: Data Description

Objectives:

❑ Using and understanding measures of central tendency.

❑ Using and understanding measures of variations.

❑ Identify the position of a data value using measures of positions.

❑ Understanding exploratory data analysis. Overview of Chapter 3

Sec. # Title Page(s)

3 - 1 Measures of Central Tendency 111 - 127

3 - 2 Measures of Variation 128 - 147

3 - 3 Measures of Position 148 - 167

3 - 4 Exploratory Data Analysis 168 - 184 3 – 1: Measures of Central Tendency

 A parameter is a measure obtained by using all the data values from a specific population.

 A statistic is a measure obtained by using the data values from a sample.

 What is (are) the difference(s) between a parameter and a statistic? The Mean (The Arithmetic Average)

 The sample (population) mean is sum of the values divided by the sample (population) size 풏 (푵).

 Sample mean: ∑풙 풙ഥ = 풏  Population mean: ∑풙 흁 = 푵

 Example 3 – 1 and Example 3 – 2, page 112. The Median

 The median (MD) is the midpoint of the data array. Data

50% 50%

MD The Median (cont.)

 Steps to calculate the median are found in page 115.

 Example 3 – 4, page 115 and Example 3 – 5, pages 115-116. The Mode

 The mode is the value(s) that occurs most often in a data set.

Data Set

Unimodal Bimodal No mode Example 3-6 Example 3-7 Multimodal Example 3-8 p. 118 p. 119 p. 119 Properties of Central Tendency (pp. 120 – 121)

Property The Mean The Median The Mode

Easy to compute 3 2 1

Uniqueness ✓ ✓ 

Use all data ✓  

Affected by Less than the ✓  outliers mean Used in case of   ✓ qualitative data The Weighted Mean

 The weight mean is calculated as follows: 풘ퟏ풙ퟏ + 풘ퟐ풙ퟐ + ⋯ + 풘풏풙풏 ∑풘풙 풙ഥ풘 = = 풘ퟏ + 풘ퟐ + ⋯ + 풘풏 ∑풘 where 푥1, … , 푥푛 are the data values, while 푤1, … , 푤푛 are the corresponding weights.

 What is the relationship between the weighted mean and mean?

 Exercise: Try calculating your GPA using the weighted mean! (Hint: w = units, x = grade points) The Weighted Mean (cont.)

 Steps to calculate the weighted means: 1. Determine the data values and the weights. 2. Create a three-column table as shown below. ∑푤푥 3. 푥ҧ = 푤 ∑푤 풙 풘 풘풙

푥1 푤1 푤1푥1

푥2 푤2 푤2푥2 ⋮ ⋮

푥푛 푤푛 푤푛푥푛 ∑ ∑푤 ∑푤푥 Example 3 – 14 (page 120)

풙 풘 풘풙 4 3 12 2 3 6 3 4 12 1 2 2 ∑ 12 32

∑푤푥 32 푥ҧ = = ≈ ퟐ. ퟔퟕ 푤 ∑푤 12 Distribution Shapes

 If the distribution is symmetric, then Mean = Median = Mode Distribution Shapes (cont.)

 If the distribution is positively (right) skewed, then Mean > Median > Mode Distribution Shapes (cont.)

 If the distribution is negatively (left) skewed, then Mean < Median < Mode Distribution Shapes (cont.)

 Using MegaStat, we can calculate a skewness coefficient. Distribution Shapes (cont.)

 If skewness = 0, then the distribution is symmetric.

 If skewness > 0, then the distribution is positively skewed.

 If skewness < 0, then the distribution is negatively skewed. 3 – 2: Measures of Variation

 Measures of variation are used to examine the variability of variables.

 Measures of variation are always non-negative (i.e., ≥ 0).

 Large values of measures of variation indicate high variability, while values close to 0 indicate low variability (or homogenous). Range

 The range is the difference between the highest value and the lowest value.

 Example 3 – 16 and Example 3 – 17, page 129. Variance and standard deviation

 The variance is the average of the squares of the distances of the values from the mean.

 Important rule! 퐬퐭퐚퐧퐝퐚퐫퐝 퐝퐞퐯퐢퐚퐭퐢퐨퐧 = 퐯퐚퐫퐢퐚퐧퐜퐞 Population variance and standard deviation

 The population variance is given by

∑ 풙 − 흁 ퟐ 흈ퟐ = 푵

 The population standard deviation is given by

∑ 풙 − 흁 ퟐ 흈 = 푵 Sample variance and standard deviation (cont.)

 The sample variance is given by

Rule #1 Rule #2

∑ 풙 − 풙ഥ ퟐ 풏∑풙ퟐ − ∑풙 ퟐ 풔ퟐ = 풔ퟐ = 풏 − ퟏ 풏(풏 − ퟏ) Sample variance and standard deviation

 The sample standard deviation is given by

Rule #1 Rule #2

∑ 풙 − 풙ഥ ퟐ 풏∑풙ퟐ − ∑풙 ퟐ 풔 = 풔 = 풏 − ퟏ 풏 풏 − ퟏ Example 3 – 20: Teacher Strikes (Rule #1) Example 3 – 21: Teacher Strikes (Rule #2)

 Find the sample variance and the standard deviation for the following data: 9, 10, 14, 7, 8, 3

 Using Rule #1: Step 1. Calculate the sample mean: ퟓퟏ 풙ഥ = = ퟖ. ퟓ ퟔ Example 3 – 20 (Rule #1)

Step 2. 푥 푥 − 푥 푥 − 푥 2 9 9 – 8.5 = 0.5 0.25 10 1.5 2.25 14 5.5 30.25 7 -1.5 2.25 8 -0.5 0.25 3 -5.5 30.25 ∑ 0 65.5 Example 3 – 20 (cont.)

Step 3. Calculate the sample variance and standard deviation as follows

∑ 풙 − 풙ഥ ퟐ ퟔퟓ. ퟓ 풔ퟐ = = = ퟏퟑ. ퟏ 풏 − ퟏ ퟓ

풔 = ퟏퟑ. ퟏ ≈ ퟑ. ퟔퟏퟗ Example 3 – 21 (Rule #2)

 Using Rule #2: Step 1.

풙 풙ퟐ 9 81 10 100 14 196 7 49 8 64 3 9 ∑ 51 499 Example 3 – 21 (cont.)

Step 2. Calculate the sample variance and standard deviation as follows

풏∑풙ퟐ − ∑풙 ퟐ ퟔ ퟒퟗퟗ − ퟓퟏퟐ 풔ퟐ = = 풏(풏 − ퟏ) ퟔ(ퟓ) ퟐퟗퟗퟒ − ퟐퟔퟎퟏ ퟑퟗퟑ = = = ퟏퟑ. ퟏ ퟑퟎ ퟑퟎ

풔 = ퟏퟑ. ퟏ ≈ ퟑ. ퟔퟏퟗ Coefficient of Variation

 The coefficient of variation is denoted by CV or CVar.

 The coefficient of variation is used to compare variation of at least two variables when the measuring units are different.

Population Sample 흈 풔 퐂퐕퐚퐫 = × ퟏퟎퟎ% 퐂퐕퐚퐫 = × ퟏퟎퟎ% 흁 풙ഥ Example 3 – 23: Sales of Automobiles

Variable Mean Standard Coefficient of Deviation Variation Sales 87 5 5.7% Commissions $5225 $773 14.8%

 The commissions are more variable than the sales.

 Example 3 – 24, page 139. 3 – 3: Measures of Position

 Measures of position are used to determine the locations of data values and outliers, we will consider: Standard scores (or z-score). Percentiles. Quartiles. Standard Score (z-score)

 The standard score (z-score) is defined as

퐯퐚퐥퐮퐞 − 퐦퐞퐚퐧 풛 = 퐬퐭퐚퐧퐝퐚퐫퐝 퐝퐞퐯퐢퐚퐭퐢퐨퐧

 The above score represents the number of standard deviations that a data value falls above or below the mean. Example 3 – 27: Test Scores (page 148)

Score Standard Course Mean z (value) Deviation Calculus 65 50 10 1.5 History 30 25 5 1.0

 Since the z score of calculus is larger, the relative position of the student in the calculus class is higher than relative position of the student in the history class. Example 3 – 28: Test Scores (page 149)

Score Standard Test Mean z (value) Deviation A 38 40 5 -0.4 B 94 100 10 -0.6

 The score for test A is relatively higher than the score for test B. Percentile

 Percentiles divide the data set into 100 equal groups.

 Note that 푷풊 < 푷풊+ퟏ for all 푖. Percentile (cont.)

 Let 푛 denotes the sample size, 푋 denotes a data value, and 푃 denotes a percentile. Usually we deal with two types of problems.

 Problem #1: Finding percentile rank 풏, 푿 → 푷

 Problem #2: Finding data value 풏, 푷 → 푿 Problem #1: 풏, 푿 → 푷

 We use the following rule after ascendingly order the data:

# 퐨퐟 퐯퐚퐥퐮퐞퐬 퐥퐞퐬퐬 퐭퐡퐚퐧 퐗 + ퟎ. ퟓ 푷 = × ퟏퟎퟎ 풏 Example 3 – 30 (page 153)

 Consider the following data: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10 Find the percentile rank of 12. Example 3 – 30 (cont.)

 Step 1. Ascendingly order data: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20

 Step 2. Count how many numbers that are less than 12.

 Step 3. Since we have 6 numbers < 12, then: ퟔ + ퟎ. ퟓ 푷 = × ퟏퟎퟎ = ퟔퟓ퐭퐡 퐩퐞퐫퐜퐞퐧퐭퐢퐥퐞 ퟏퟎ Example 3 – 30 (cont.)

Question: The percentile rank of 12 is the 65th percentile. What does this means? Answer: This means that a student who scores 12 did better than 65% of the class.

 Example 3 – 31, page 154. Problem #2: 풏, 푷 → 푿 Example 3 – 32 (page 154)

 Find the value corresponding to the 25th percentile of: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10

 Step 1. Arrange data. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

 Step 2. 풏 ⋅ 풑 ퟏퟎ ⋅ ퟐퟓ 풄 = = = ퟐ. ퟓ ≈ ퟑ ퟏퟎퟎ ퟏퟎퟎ Example 3 – 32 (cont.)

 Step 3. Count over to the value that corresponds to 3.

2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

This value corresponds to the 25th percentile! Example 3 – 33 (page 155)

 Find the value corresponding to the 60th percentile of: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10

 Step 1. Arrange data. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

 Step 2. 풏 ⋅ 풑 ퟏퟎ ⋅ ퟔퟎ 풄 = = = ퟔ ퟏퟎퟎ ퟏퟎퟎ Example 3 – 33 (cont.)

 Step 3. Count over to the value that corresponds to 6. Find the mean of the 6th and 7th value. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.

ퟏퟎ + ퟏퟐ = ퟏퟏ ퟐ

This value corresponds to the 60th percentile! Quartile

 Quartiles are three numbers (푄1, 푄2, 푄3) that divide the distribution into four groups.

 Note that 푸ퟏ < 푸ퟐ < 푸ퟑ. Quartile (cont.)

 Is there a relationship between 푸ퟐ and the median?

 What are the relationships between 푸ퟏ, 푸ퟐ, and 푸ퟑ and percentiles?

 Example 3 – 34, page 156. Interquartile range

 The interquartile range (IQR) is defined as the difference between 푄1 and 푄3 and is the range of the middle 50% of the data. Outliers

 An outlier is an extremely high or low data value when compared with the rest of the data values.

 To detect an outlier:  Obtain 푄1, 푄3, and IQR.  An outlier is any data value that is smaller than 푸ퟏ – ퟏ. ퟓ × 퐈퐐퐑 or larger than 푸ퟑ + ퟏ. ퟓ × 퐈퐐퐑. 3 – 4: Exploratory data analysis

 The five-number summary:  Minimum  푄1  Median (푄2)  푄3  Maximum

 The above numbers are used to establish a boxplot. Boxplot Boxplot (cont.)

 Symmetric distribution

 Positively (right) skewed distribution

 Negatively (left) skewed distribution Application Summary

Measure Excel only Excel + MegaStat Mean ✓ Median ✓ Mode ✓ Weighted mean ✓ Range ✓ Variance & Standard Dev. ✓ CVar ✓ Standard (Z) Score ✓ Percentile ✓ Quartiles, IQR, and outliers ✓ Boxplot ✓