I Chapter 3: Bivariate analysis

Chapter 3: Bivariate data analysis

Contents 3.1 Two-way tables

I Bivariate data

I Definition of a two-way table

I Joint absolute/relative

I Marginal frequency distribution

I Conditional frequency distribution 3.2 Correlation

I Graphical display: scatterplot

I Types of relationships between numerical variables

I Measures of linear association Chapter 2: Bivariate data analysis

Recommended reading

I Pe˜na,D., Romo, J., ’Introducci´ona la Estad´ısticapara las Ciencias Sociales’

I Chapters 7, 8, 9

I Newbold, P. ’Estad´ısticapara los Negocios y la Econom´ıa’(2009)

I Chapter 12

Bivariate data

I Bivariate data is obtained when we observe two characteristics (numerical or categorical) of one individual/object.

I Notation: A sample of n bivariate observations will be written as n pairs (x1, y1), (x2, y2),..., (xn, yn). Summarizing bivariate data: two-way table

I In order to summarize the data, we create two-dimensional table ...

I with classes/class intervals of one variable in the rows ...

I and classes/class intervals of the other in the columns

I The corresponding frequencies (absolute or relative) appear in the cells on the cross-section of the given row and the given column

I When one variable is qualitative, the two-way table is called a

Two-way table: joint absolute frequency distribution

I Two-way table with k rows and m columns Y y1 ··· yj ··· ym Total x1 n11 ··· n1j ··· n1m n1• ...... X xi ni1 ··· nij ··· nim ni• ...... xk nk1 ··· nkj ··· nkm nk• Total n•1 ··· n•j ··· n•m n••

I Notation: nij absolute frequency of cell (i, j)

Row i total: ni• = ni1 + ni2 + ··· + nim Column j total: n•j = n1j + n2j + ··· + nkj

n•• sample size Joint absolute frequency distribution

Example Sixty four men were selected and the following two variables were considered Y = Color of man’s mother’s eyes {Bright, Dark} X = Color of man’s eyes {Bright, Dark}

(x1, y1) = (B, B), (x2, y2) = (B, D),..., (x64, y64) = (D, D)

Y Bright Dark Total X Bright 23 12 35 Dark 17 12 29 Total 40 24 64

I (x2, y2) = (B, D) means that the 2nd man had Bright eyes and his mother had Dark eyes

I 23 of men had B eyes and their mothers also had B eyes

I 29 of men had Dark eyes

Two-way table: relative joint frequency distribution

I fij = nij /n•• relative frequency of cell (i, j) Y y1 ··· yj ··· ym Total x1 f11 ··· f1j ··· f1m f1• ...... Xx i fi1 ··· fij ··· fim fi• ...... xk fk1 ··· fkj ··· fkm fk• Total f•1 ··· f•j ··· f•m 1

I Notation: fij relative frequency of cell (i, j)

Row i total: fi• = fi1 + fi2 + ··· + fim Column j total: f•j = f1j + f2j + ··· + fkj

n•• sample size Joint frequency distribution: absolute and relative

Example Y = Weekly number of visits to the cinema X = Weekly number of visits to the theater (absolute frequencies in black, relative in blue) Y 0 1 2 3 4 0 12 0.2795 0.1164 0.0932 0.0471 0.023 1 4 0.0933 0.0702 0.0471 0.0230 0.000 X 2 3 0.0703 0.0702 0.0470 0.0000 0.000 3 1 0.0230 0.0000 0.0000 0.0000 0.000

Marginal frequency distribution: absolute and relative

I The marginal frequencies take us back to the univariate case (we ignore the other variable)

I We obtain them by looking at the margins of the joint frequency table (absolute or relative)

I Specifically, the relative marginal frequencies of X are defined as

fi• = fi1 + ··· + fij + ··· + fim

and those of Y are

f•j = f1j + ··· + fij + ··· + fkj

I Absolute marginal frequencies are obtained analogously Marginal frequency distribution Example Y = Number of workers X = Number of sales Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1

Marginal (relative) frequencies of Y Workers 1-24 25-49 50-74 75-99 f•j 0.463 0.268 0.195 0.073 Marginal (relative) frequencies of X Sales 1-100 101-200 201-300 fi• 0.561 0.244 0.195

Conditional frequency distribution

I The conditional distribution of Y , given that X = xi (notation f (Y |X = xi )) is obtained by calculating

fij fi• for j-th class/class interval of Y (we narrow our data set to the observations satisfying the condition)

I The conditional distribution of X , given that Y = yj is obtained by calculating for i-th class/class interval of X

fij f•j

I Exactly the same conditional distribution is obtained if we replace the relative frequencies by the absolute frequencies in the above definitions

I Conditional frequencies always add up to 1! Conditional frequency distribution

Example Y = Number of workers X = Number of sales Find the conditional frequency distribution for sales given that the number of workers is between 50 and 74. Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1

The conditional frequency distribution f (X |50 ≤ Y ≤ 74) Sales 1-100 101-200 201-300 0.098 0.049 0.049 f (x|50 ≤ y ≤ 74) 0.500(= 0.195 ) 0.250(= 0.195 ) 0.250(= 0.195 )

Conditional frequency distribution

Example Y = Number of workers X = Number of sales Find the conditional frequency distribution for workers when the number of sales is between 101 and 200. Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1

The conditional frequency distribution f (Y |101 ≤ X ≤ 200) Workers 1-24 25-49 50-74 75-99 f (y|101 ≤ x ≤ 200) 0.402 0.299 0.201 0.098 Graphics for bivariate numerical data: scatterplot

size price 107 162657 ● ● 114 165554 ● ● ● 91 154506 ● 165000 100 162103 ● ● ● ● 96 158271 ● 107 166925 104 161917 160000 100 161149 ● 80 152263

81 151878 Price of a house (euro) 105 165678 ● 155000

111 166696 ● 108 165387 ● 97 161806 80 85 90 95 100 105 110 115 106 163824 Size of a house (m^2)

Types of relationships There are different ways in which two variables may be related: (i) absence of relationship (ii) negative linear relationship (iii) positive linear relationship and (iv) nonlinear relationship

NO RELATIONSHIP NEGATIVE LINEAR 3 ● ● 3 ● ● ● ● ● ●

2 ●● ● 2 ● ● ● ● ● ● ●●

1 ● ● ● ●● 1 ● ● ● ●● ● ● ● ● ● ●● ● ●●

0 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● 0 ● ● ●● ● ● ● ● ● ●●● ● ● ●

−1 ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ●

−2 ● ● ● −2 ● ● ● ● ● −3 −2 −1 0 1 2 −2 −1 0 1 2

POSITIVE LINEAR NONLINEAR

● ● ●● ● ● ●●● 2 ● ● ● ● 0 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●

1 ●● ● ● ● ● ●● ● ● ● ● ● ● ● −2 ● ●● 0 ●● ● ● ● ● ● ● ● ● ● ●● ● −4 ●

−1 ● ● ● ●● ● ● ● ● ● −2 ● −6 −3 −8 ● ● −4

−3 −2 −1 0 1 2 −3 −2 −1 0 1 2 Measures of linear association: covariance

I We look for a descriptive measure that would indicate whether there is a linear relationship between x and y

I Sample covariance is defined as Pn i=1 xi yi − nx¯y¯ zn }| { ! 1 X s = (x − x¯)(y − y¯) xy n − 1 i i i=1

I Covariance and linear transformation: if we transform Y variable according to Z = a + bY , then the covariance between X and the new variable Z is sxz = bsxy

Properties of the covariance

I If the covariance is ’much larger than 0’, it’s because there exists a positive linear relationship between the variables

I If the covariance is ’much smaller than 0’, it’s because there exists a negative linear relationship

I If the covariance is ’small’, it’s because: (i) the linear relationship does not exists; or (ii) the relationship is nonlinear.

I Drawbacks:

I What does it mean large/small?

I What are the units of covariance? Measures of linear association: correlation

I Correlation overcomes the issues with covariance

I Sample correlation is defined as

sxy r(x,y) = sx sy

I Properties:

I The correlation is bounded: −1 ≤ r(x,y) ≤ 1 , so the terms large and small now make sense

I It is unitless

Measures of linear association: correlation

Example Three variables are measured over 91 countries: female life expectancy, male life expectancy and gross national product.

I The covariances between the three pairs of two variables are 105.15, 50066.04 and 57917.93, respectively.

I On the other hand, the correlations between the three pairs of two variables are 0.98, 0.64 y 0.65, respectively.

I Therefore, even if the covariances between male and female live expectancy and gross national product are larger than the covariance between male and female live expectancies, the correlation is larger for these last two variables.