Statistics I Chapter 3: Bivariate Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Statistics I Chapter 3: Bivariate data analysis Chapter 3: Bivariate data analysis Contents 3.1 Two-way tables I Bivariate data I Definition of a two-way table I Joint absolute/relative frequency distribution I Marginal frequency distribution I Conditional frequency distribution 3.2 Correlation I Graphical display: scatterplot I Types of relationships between numerical variables I Measures of linear association Chapter 2: Bivariate data analysis Recommended reading I Pe~na,D., Romo, J., 'Introducci´ona la Estad´ısticapara las Ciencias Sociales' I Chapters 7, 8, 9 I Newbold, P. 'Estad´ısticapara los Negocios y la Econom´ıa’(2009) I Chapter 12 Bivariate data I Bivariate data is obtained when we observe two characteristics (numerical or categorical) of one individual/object. I Notation: A sample of n bivariate observations will be written as n pairs (x1; y1); (x2; y2);:::; (xn; yn): Summarizing bivariate data: two-way table I In order to summarize the data, we create two-dimensional table ... I with classes/class intervals of one variable in the rows ... I and classes/class intervals of the other in the columns I The corresponding frequencies (absolute or relative) appear in the cells on the cross-section of the given row and the given column I When one variable is qualitative, the two-way table is called a contingency table Two-way table: joint absolute frequency distribution I Two-way table with k rows and m columns Y y1 ··· yj ··· ym Total x1 n11 ··· n1j ··· n1m n1• . X xi ni1 ··· nij ··· nim ni• . xk nk1 ··· nkj ··· nkm nk• Total n•1 ··· n•j ··· n•m n•• I Notation: nij absolute frequency of cell (i; j) Row i total: ni• = ni1 + ni2 + ··· + nim Column j total: n•j = n1j + n2j + ··· + nkj n•• sample size Joint absolute frequency distribution Example Sixty four men were selected and the following two variables were considered Y = Color of man's mother's eyes fBright; Darkg X = Color of man's eyes fBright; Darkg (x1; y1) = (B; B); (x2; y2) = (B; D);:::; (x64; y64) = (D; D) Y Bright Dark Total X Bright 23 12 35 Dark 17 12 29 Total 40 24 64 I (x2; y2) = (B; D) means that the 2nd man had Bright eyes and his mother had Dark eyes I 23 of men had B eyes and their mothers also had B eyes I 29 of men had Dark eyes Two-way table: relative joint frequency distribution I fij = nij =n•• relative frequency of cell (i; j) Y y1 ··· yj ··· ym Total x1 f11 ··· f1j ··· f1m f1• . Xx i fi1 ··· fij ··· fim fi• . xk fk1 ··· fkj ··· fkm fk• Total f•1 ··· f•j ··· f•m 1 I Notation: fij relative frequency of cell (i; j) Row i total: fi• = fi1 + fi2 + ··· + fim Column j total: f•j = f1j + f2j + ··· + fkj n•• sample size Joint frequency distribution: absolute and relative Example Y = Weekly number of visits to the cinema X = Weekly number of visits to the theater (absolute frequencies in black, relative in blue) Y 0 1 2 3 4 0 12 0.2795 0.1164 0.0932 0.0471 0.023 1 4 0.0933 0.0702 0.0471 0.0230 0.000 X 2 3 0.0703 0.0702 0.0470 0.0000 0.000 3 1 0.0230 0.0000 0.0000 0.0000 0.000 Marginal frequency distribution: absolute and relative I The marginal frequencies take us back to the univariate case (we ignore the other variable) I We obtain them by looking at the margins of the joint frequency table (absolute or relative) I Specifically, the relative marginal frequencies of X are defined as fi• = fi1 + ··· + fij + ··· + fim and those of Y are f•j = f1j + ··· + fij + ··· + fkj I Absolute marginal frequencies are obtained analogously Marginal frequency distribution Example Y = Number of workers X = Number of sales Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 Marginal (relative) frequencies of Y Workers 1-24 25-49 50-74 75-99 f•j 0.463 0.268 0.195 0.073 Marginal (relative) frequencies of X Sales 1-100 101-200 201-300 fi• 0.561 0.244 0.195 Conditional frequency distribution I The conditional distribution of Y , given that X = xi (notation f (Y jX = xi )) is obtained by calculating fij fi• for j-th class/class interval of Y (we narrow our data set to the observations satisfying the condition) I The conditional distribution of X , given that Y = yj is obtained by calculating for i-th class/class interval of X fij f•j I Exactly the same conditional distribution is obtained if we replace the relative frequencies by the absolute frequencies in the above definitions I Conditional frequencies always add up to 1! Conditional frequency distribution Example Y = Number of workers X = Number of sales Find the conditional frequency distribution for sales given that the number of workers is between 50 and 74. Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 The conditional frequency distribution f (X j50 ≤ Y ≤ 74) Sales 1-100 101-200 201-300 0:098 0:049 0:049 f (xj50 ≤ y ≤ 74) 0.500(= 0:195 ) 0.250(= 0:195 ) 0.250(= 0:195 ) Conditional frequency distribution Example Y = Number of workers X = Number of sales Find the conditional frequency distribution for workers when the number of sales is between 101 and 200. Y 1-24 25-49 50-74 75-99 Total 1-100 0.293 0.122 0.098 0.049 0.561 X 101-200 0.098 0.073 0.049 0.024 0.244 201-300 0.073 0.073 0.049 0.000 0.195 Total 0.463 0.268 0.195 0.073 1 The conditional frequency distribution f (Y j101 ≤ X ≤ 200) Workers 1-24 25-49 50-74 75-99 f (yj101 ≤ x ≤ 200) 0.402 0.299 0.201 0.098 Graphics for bivariate numerical data: scatterplot size price 107 162657 ● ● 114 165554 ● ● ● 91 154506 ● 165000 100 162103 ● ● ● ● 96 158271 ● 107 166925 104 161917 160000 100 161149 ● 80 152263 81 151878 Price of a house (euro) 105 165678 ● 155000 111 166696 ● 108 165387 ● 97 161806 80 85 90 95 100 105 110 115 106 163824 Size of a house (m^2) Types of relationships There are different ways in which two variables may be related: (i) absence of relationship (ii) negative linear relationship (iii) positive linear relationship and (iv) nonlinear relationship NO RELATIONSHIP NEGATIVE LINEAR 3 ● ● 3 ● ● ● ● ● ● 2 ●● ● 2 ● ● ● ● ● ● ●● 1 ● ● ● ●● 1 ● ● ● ●● ● ● ● ● ● ●● ● ●● 0 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● 0 ● ● ●● ● ● ● ● ● ●●● ● ● ● −1 ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● −2 ● ● ● ● ● −3 −2 −1 0 1 2 −2 −1 0 1 2 POSITIVE LINEAR NONLINEAR ● ● ●● ● ● ●●● 2 ● ● ● ● 0 ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● 1 ●● ● ● ● ● ●● ● ● ● ● ● ● ● −2 ● ●● 0 ●● ● ● ● ● ● ● ● ● ● ●● ● −4 ● −1 ● ● ● ●● ● ● ● ● ● −2 ● −6 −3 −8 ● ● −4 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 Measures of linear association: covariance I We look for a descriptive measure that would indicate whether there is a linear relationship between x and y I Sample covariance is defined as Pn i=1 xi yi − nx¯y¯ zn }| { ! 1 X s = (x − x¯)(y − y¯) xy n − 1 i i i=1 I Covariance and linear transformation: if we transform Y variable according to Z = a + bY , then the covariance between X and the new variable Z is sxz = bsxy Properties of the covariance I If the covariance is 'much larger than 0', it's because there exists a positive linear relationship between the variables I If the covariance is 'much smaller than 0', it's because there exists a negative linear relationship I If the covariance is 'small', it's because: (i) the linear relationship does not exists; or (ii) the relationship is nonlinear. I Drawbacks: I What does it mean large/small? I What are the units of covariance? Measures of linear association: correlation I Correlation overcomes the issues with covariance I Sample correlation is defined as sxy r(x;y) = sx sy I Properties: I The correlation is bounded: −1 ≤ r(x;y) ≤ 1 , so the terms large and small now make sense I It is unitless Measures of linear association: correlation Example Three variables are measured over 91 countries: female life expectancy, male life expectancy and gross national product. I The covariances between the three pairs of two variables are 105.15, 50066.04 and 57917.93, respectively. I On the other hand, the correlations between the three pairs of two variables are 0.98, 0.64 y 0.65, respectively. I Therefore, even if the covariances between male and female live expectancy and gross national product are larger than the covariance between male and female live expectancies, the correlation is larger for these last two variables..