<<

Statistics I Chapter 2: Univariate data analysis

Chapter 2: Univariate data analysis

Contents

I Graphical displays for categorical data (barchart, piechart)

I Graphical displays for numerical data data (, polygon, boxplot)

I Numerical measures to describe:

I central tendency (, , )

I location (quartiles, percentiles)

I variation (, , quasi-variance and quasi-standard-deviation, , IQR, coefficient of variation) Chapter 2: Univariate data analysis

Recommended reading

I Pe˜na,D., Romo, J., ’Introducci´ona la Estad´ısticapara las Ciencias Sociales’

I Chapters 4, 5

I Newbold, P. ’Estad´ısticapara los Negocios y la Econom´ıa’(2009)

I Chapter 2

Graphical presentation of data

Once we have a frequency distribution of the data, the following graphical displays can be obtained:

Categorical Numerical ⇓ ⇓ • piechart • histogram • barchart • polygon • boxplot Graphs for qualitative data: piechart

Example 1: The frequency table below corresponds to the data representing blood types reported for a sample of 40 individuals.

Absolute Relative Class Frequency Frequency A 12 0.300 B 11 0.275 AB 8 0.200 O 9 0.225 Total 40 1

Piechart Example 1 cont.: I Each slice is a fraction of the total size of the pie I Many softwares rank slices alphabetically I Although ’pretty’ harder to read than barcharts I Avoid 3D piecharts, for those the area in the background seems to be smaller than the area in the foreground

O 22.5% A 30%

B 27.5% AB 20% Graphs for qualitative data: barchart

Example 2: The frequency table below corresponds to levels of satisfaction for 901 employees. Cumulative Cumulative Absolute Relative Absolute Relative Class Frequency Frequency Frequency Frequency VU 62 0.07 62 0.07 U 108 0.12 170 0.19 S 319 0.35 489 0.54 VS 412 0.46 901 1 Total 901 1

Barchart Example 2 cont.: I Bars are of the same width and equally-spaced, with the heights corresponding to the frequencies I There are gaps between the bars I Bars are labeled with class names I Many softwares rank bars alphabetically 400 300 200 FREQUENCY 100 0

VU U S VS Barchart

I Barcharts can also be constructed for discrete data if there are not too many values I This is a barchart for Example 3 of Ch.1 where we looked at the number of leaves attacked by a pest for a sample of 50 plants 12 10 8 6 4 FREQUENCY 2 0

0 1 2 3 4 5 6 7 8 9 10

Graphs for quantitative data: histogram and polygon

Example: 4 The frequency distribution of the daily high temperature (in Fahrenheit) reported on 20 winter days is as follows:

Class Interval Midpoint ni fi Ni Fi [10, 20) 15 3 0.15 3 0.15 [20, 30) 25 6 0.30 9 0.45 [30, 40) 35 5 0.25 14 0.70 [40, 50) 45 4 0.20 18 0.90 [50, 60) 15 2 0.10 20 1 Total 20 1 Histogram and polygon

I There are no gaps between the bars/bins I Bin widths = widths of class intervals (identical), class boundaries are marked on the horizontal axis I Bin heights = frequencies (here, absolute) I Bin areas are proportional to the frequencies

6 ● Polygon

5 ●

4 ●

3 ●

2 ● FREQUENCIES 1

0 ● ●

0 10 20 30 40 50 60 70 TEMP (F)

Histogram with area of 1 (on a density scale)

I Bin widths = widths of class intervals (not necessarily identical) fi I Bin heights = li −li−1 I Bin areas = fi TOTAL AREA = 1 0.030 0.020 0.010 0.000

0 10 20 30 40 50 60 70 TEMP (F) Describing data numerically

Center Location Variation ⇓ ⇓ ⇓ • mean • quartiles • range • median • percentiles • interquartile range • mode • variance • standard deviation • coeff. of variation

New notation: n X xi = x1 + x2 + ... + xn i=1 P ( : sum, i = 1: the lower limit, n: the upper limit, xi : example of a formula depending on i) Example: 3 X i 2 = (−1)2 + 02 + 12 + 22 + 32 = 15 i=−1

Central tendency: (arithmetic) mean

I The most common measure of central tendency

I Population mean PN xi x + ... + x µ = i=1 = 1 N N N

I Sample mean Pn xi x + ... + x x¯ = i=1 = 1 n n n

I If a, b (b 6= 0) are real numbers and y = a + bx, then

y¯ = a + bx¯

I Affected by extreme values () Example: X :3 , 1, 5, 4, 2, Y :3 , 1, 5, 4, 200 3 + 1 + 5 + 4 + 2 3 + 1 + 5 + 4 + 200 x¯ = = 3¯ y = = 42.6! 5 5 Central tendency: median

I In the ordered list, the median M is the middle number  x((n+1)/2) if n odd (the middle number) M = x(n/2)+x(n/2+1) 2 if n even (the average of the two middle numbers)

(x(1), x(2),..., x(n) that the observations are ranked in increasing order, eg. x(1) = xmin, x(n) = xmax)

I Not affected by outliers Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data 1,2, 3 ,4,5, then identify the middle number(s)

3rd smallest z}|{ M = x((5+1)/2) = x(3) = 3

Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data 0,1, 2,3 ,4,5, then identify the middle number(s)

the average of 3rd and 4th z }| { x + x x + x 2 + 3 M = (6/2) (6/2+1) = (3) (4) = = 2.5 2 2 2

Central tendency: mode

I The value that occurs most often

I Not affected by outliers

I Used for either numerical or categorical data

I There may be no mode, there may be several modes Example: Given observations 3, 1, 5, 4, 2, there is no mode Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1 Shape: comparing mean and median

Three types of distributions:

I Skewed to the left Mean < Median

I Symmetric Mean= Median

I Skewed to the right Median < Mean

LEFT−SKEWED SYMMETRIC RIGHT−SKEWED x < M x = M M < x

Note: The distribution in the middle is known as bell-shaped or normal

Quartiles and percentiles

I Quartiles split the ranked data into four segments with an equal number of values per segment 1 I The first quartile Q1 has position 4 (n + 1) 1 I The second quartile Q2 (= median) has position 2 (n + 1) 3 I The third quartile Q3 has position 4 (n + 1) Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank the data 11, 12, 13, 16, 16 , 17, 18, 21, 22, then identify the positions

Q1 = x(2.5) = x(3) = 12 Q2 = 16 Q3 = x(7.5) = x(8) = 21

I pth percentile, p = 1, 2,..., 99, Pk = x(k(n+1)/100) .

Example cont.: 60th percentile = x(60(9+1)/100) = x(6) = 17 Variation: range and interquartile range (IQR)

I Range is the simplest measure of variation

R = xmax − xmin

I Ignores the way the data is distributed

I Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 − 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 − 1 = 99

I Interquartile range (IQR) can eliminate some problems. Eliminate high and low observations and calculate the range of the middle 50% of the data

IQR = 3rd quartile − 1st quartile = Q3 − Q1

Variation: Interquartile range and boxplot I Outliers are observations that fall I below the value of Q1 − 1.5 · IQR I above the value of Q3 + 1.5 · IQR I For extreme outliers, replace 1.5 by 3 in the above definition

MEDIAN xmin Q1 (Q2) Q3 xmax

25% 25% 25% 25%

12 24 31 42 58 IQR=18 Measure of variation: variance

I Average of squared deviations of values from the mean

I Population variance PN (x − µ)2 σ2 = i=1 i N

I Sample variance faster to calculate z }| { Pn (x − x¯)2 Pn x 2 − n(¯x)2 σˆ2 = i=1 i = i=1 i ⇐ divided by n n n

I Sample quasi-variance (corrected sample variance)

Pn (x − x¯)2 Pn x 2 − n(¯x)2 s2 = i=1 i = i=1 i ⇐ divided by n − 1 n − 1 n − 1

I They are related via n − 1 σˆ2 = s2 n 2 2 2 I If a, b (b 6= 0) are real numbers and y = a + bx, then sy = b sx

Measure of variation: standard deviation (SD)

I The most-commonly used measure of spread

I Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively √ √ √ σ = σ2 σˆ = σˆ2 s = s2

I Shows variation about the mean 2 I Has the same units as the original data, whilst variance is in units

I Variance and SD are both affected by outliers Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 124 124 124 x¯ = = 15.5y ¯ = = 15.5z ¯ = = 15.5 8 8 8

n X 2 2 2 2 xi = 11 + 12 + ... + 21 = 2000 i=1 n X 2 2 2 2 yi = 14 + 15 + ... + 17 = 1928 i=1 n X 2 2 2 2 zi = 11 + 11 + ... + 20 = 2068 i=1 Pn x 2 − n(¯x)2 2000 − 8(15.5)2 78 s2 = i=1 i = = = 11.1429 ⇒ s = 3.3381 x n − 1 8 − 1 7 x 1928 − 8(15.5)2 6 s2 = = = 0.8571 ⇒ s = 0.9258 y 8 − 1 7 y 2068 − 8(15.5)2 146 s2 = = = 20.8571 ⇒ s = 4.5670 z 8 − 1 7 z

Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20

x = 15.5 sx = 3.3 ● ● ● ● ● ● ● ● 11 12 13 14 15 16 17 18 19 20 21

= = y ●15.5●sy 0.9 ● ● ● ● ● ● 11 12 13 14 15 16 17 18 19 20 21

= = ● z 15.5 sz 4.6 ● ● ● ● ● ● ● 11 12 13 14 15 16 17 18 19 20 21 Numerical summaries and frequency tables. Standarization.

I If the data is discrete then Pk x n Pk x 2n − nx¯2 x¯ = i=1 i i and s2 = i=1 i i n n − 1

I If the data is continuous, we replace xi in the above difinition, by the mid-points of class intervals

I To standardize variable x means to calculate x − x¯ s

I If you apply this formula to all observations x1,..., xn and call the transformed ones z1,..., zn, then the mean of the z’s is zero with the standard deviation of one

I Standarization = finding z-score

Empirical rule

If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds:

I 68% of the data are in (¯x − 1s, x¯ +1 s)

I 95% of the data are in (¯x − 2s, x¯ +2 s)

I 99.7% of the data are in (¯x − 3s, x¯ +3 s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95% of the observations.

95% of xi ’s are in: (¯x ± 2s) = (40 ± 2(5)) = (30, 50) Measure of variation: coefficient of variation (CV)

I Measures relative variation and is defined as s CV = |x¯|

I Is a unitless number (sometimes given in %’s)

I Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 5 5 CV = = 0.10 CV = = 0.05 A 50 B 100 Both stocks have the same SDs, but stock B is less variable relative to its mean price