NUMERICAL DESCRIPTIVE MEASURES

Wish to describe data using summary statistics

1. Measures of Central Tendency (what is the middle of the data?)

Three Options: Arithmetic Mean, Median, Mode (ignore geometric mean in text)

1.a. Arithmetic Mean Most common measure of central tendency: the “average”

When referring to sample values, denoted as X When referring to population values, denoted as μ Sample mean

n X  i X  X  ⋯  X X  i1  1 2 n n n where n is sample size Population Mean N X i X+ X +⋯ + X m =i=1 = 1 2 N N N where N is population size

Example: raw data: 1 3 5 7 9 mean = (1+3+5+7+9)/5 = 5 Arithmetic Mean is affected by outliers

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Mean = 5 Mean = 6

Chapter 3 - 1 1. b Median Raw data is arrayed in ascending order. The MEDIAN is the middle of the data, i.e., a value such that half of the observations lie below and half lie above the value. If n or N is odd, the median is the middle number If n or N is even, the median is the average of the two middle numbers.

Example: a sample of 10 house prices (in thousands) yields:

144 98 204 177 155 316 100 177 177 170 to find median, arrange in ascending order:

98 100 144 155 170 177 177 177 204 316

Since even number of observations: median = (170+177)/2 = 173.5

Note: arithmetic mean = 171.8

Median is often preferred when data are skewed, since not affected by extreme values:

Ex. Housing prices: 98 100 100 102 110 125 350

Median = 102 Mean = 140.7

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14

Median = 5 Median = 5

1. c. Mode Value that occurs most often There may be no mode or there may be more than one mode.

Chapter 3 - 2 Unaffected by extreme values Ex. In housing price data above, mode = 177

0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9 No Mode Mode is infrequently used, but has applications in quality control.

Which is the best measure to use? It depends. If histogram of data is symmetric about the middle, all three measures give similar results. If histogram is skewed right or left, measures will differ.

Choose in context. The remainder of course will focus on mean (eg., what is the mean income of Canadians?, what is the mean life expectancy of a component?) and how to estimate population mean using sample mean.

2. Measures of Variation/Dispersion

Need a single number describing how “spread out” the data are.

Many options: discuss range, MAD, Variance

2. a. Range: simply distance between highest and lowest values Not very informative: ignores all but two observations:

Ex. 1, 9, 9.5, 9.5, 10 Range = 9 Ex. 1, 3, 7, 9, 10 Range = 9

2. b. Interquartile range (ignore)

2. c. Mean Absolute Deviation (MAD)

Chapter 3 - 3 Note in text, but simple idea. Conceptually straightforward idea is to measure how spread out the data are from the middle by reporting the average distance from the the middle to each observation. Let the mean be the measure of the middle.

Ex. 4 observations 3 4 5 6 Mean = (3+4+5+6)/4 = 4.5 Observation 1 lies 1.5 units away from mean Obs. 2 lies 0.5 units away Obs. 3 lies 0.5 units away Obs. 4 lies 1.5 units away.

Therefore, average distance away from mean is (1.5+0.5+0.5+1.5)/4 = 1

This measure is Mean Absolute Deviation (MAD). Can be computed using the formula (for population)

N  X i  MAD  i1 N Absolute value operators are difficult to work with, but are required, since simple mean of deviations will always equal zero. An alternative to absolute value operators is to square the deviations before adding and dividing by no. of observations – then positives do not cancel negatives. Leads to important concept in course …

2. d Variance Defined: the average squared deviation from the mean.

Computed differently in population than in sample.

N (X )2 Population Variance:  i  2  i1 N

Chapter 3 - 4 n (X X )2 Sample Variance:  i s 2  i1 n 1

Example of calculation: Observation Deviation from Mean Squared Deviation 3 -1.5 2.25 4 -0.5 0.25 5 0.5 0.25 6 1.5 2.25 Total 5

Therefore population variance = 5/4 = 1.25

Interpretation of variance is difficult

Most frequently used measure of variation/dispersion …

STANDARD DEVIATION - square root of variance

In Population    2

In Sample s  s 2

Chapter 3 - 5 Data A Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B Mean = 15.5 11 12 13 14 15 16 17 18 19 20 s = .9258 21 Data C Mean = 15.5 11 12 13 14 15 16 17 18 19 20 21 s = 4.57

The magnitude of the Standard Deviation is meaningful. Standard Deviation uses original units of measurement.

Value of standard deviation interpreted using the Empirical Rule: If the data are fairly symmetric about the mean (i.e., histogram is bell-shaped), then the interval

  1 contains approximately 68% of the observations

  2 contains approximately 95% of the observations

  3 contains approximately 99% of the observations

Ex. Suppose final grades in class of 150 statistics students are distributed in a symmetric manner about the mean, with μ = 72 and σ = 9 . Does this indicate a wide variation in grades?

In order to capture 68% of final grades, we require a range from

72 – 9 = 63 to 72 + 9 = 81 and 95% of students will have grades between 54 and 90.

Chapter 3 - 6 2.e. Coefficient of Variation

We may wish to compare the dispersion of two sets of data. If they have different means or are measured in different units, the standard deviations are not comparable. In these cases, use measures of dispersion relative to the mean …

CV = (Standard Deviation/Mean) x 100%

where standard dev. and mean can either be pop. or sample

Note: measured in percentages

Ex. Stock A: mean price last year = $50 Std. dev. of prices last year = $5

Stock B: mean price last year = $100 Std. dev. of prices last year = $5

Calculate CV’s to compare degree of risk:

Stock A: CV = (5/50) x 100% = 10%

Stock B: CV = (5/100) x 100% = 5%

3. Measures of Shape

Measures of skewness indicate shape of distribution.

Left-Skewed Symmetric Right-Skewed Mean = Median =Mode Mode < Median < Mean

4. Measures of Correlation

Chapter 3 - 7 Require measure of how closely correlated two variables are.

Coefficient of Correlation indicates type and strength of linear relationship in bivariate data.

Denoted r

n (xi  x)(yi  y) r  i1 n n 2 2 (xi  x) (yi  y) i1 i1

Coefficient of correlation is unit free, ranges from -1 to +1 -1 indicates perfect negative linear relationship +1 indicates perfect positive linear relationship

Y Y Y

X X X r = -1 r = -.6 r = 0 Y Y

X X r = .6 r = 1

Chapter 3 - 8