Bivariate Data
Total Page:16
File Type:pdf, Size:1020Kb
BIVARIATE DATA David M. Lane. et al. Introduction to Statistics : pp. 172194 ioc.pdf [email protected] ICY0006: Lecture 3 1 / 24 Descriptive statistics Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: Univariate analysis: describing the distribution of a single variable, including I central tendency (mean, median, and mode) I dispersion (range and quantiles of the data-set, measures of spread such as the variance and standard deviation) I shape of the distribution (skewness and kurtosis Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include: I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions ioc.pdf [email protected] ICY0006: Lecture 3 2 / 24 Descriptive statistics Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: Univariate analysis: describing the distribution of a single variable, including I central tendency (mean, median, and mode) I dispersion (range and quantiles of the data-set, measures of spread such as the variance and standard deviation) I shape of the distribution (skewness and kurtosis Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include: I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions ioc.pdf [email protected] ICY0006: Lecture 3 2 / 24 Descriptive statistics Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: Univariate analysis: describing the distribution of a single variable, including I central tendency (mean, median, and mode) I dispersion (range and quantiles of the data-set, measures of spread such as the variance and standard deviation) I shape of the distribution (skewness and kurtosis Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include: I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions ioc.pdf [email protected] ICY0006: Lecture 3 2 / 24 Contents 1 Introduction to Bivariate Data 2 Pearson product-moment correlation 3 Variance Sum Law II 4 Computing r by R ioc.pdf [email protected] ICY0006: Lecture 3 3 / 24 Next section 1 Introduction to Bivariate Data 2 Pearson product-moment correlation 3 Variance Sum Law II 4 Computing r by R ioc.pdf [email protected] ICY0006: Lecture 3 4 / 24 Bivariate Data more than one variable Often, more than one variable is collected on each individual. In health studies the variables such as age, sex, height, weight, blood pressure, and total cholesterol are often measured on each individual. Economic studies may be interested in, among other things, personal income and years of education. Bivariate data consists of data on two variables Usually we are interested in the relationship between the variables ioc.pdf [email protected] ICY0006: Lecture 3 5 / 24 Bivariate Data more than one variable Often, more than one variable is collected on each individual. In health studies the variables such as age, sex, height, weight, blood pressure, and total cholesterol are often measured on each individual. Economic studies may be interested in, among other things, personal income and years of education. Bivariate data consists of data on two variables Usually we are interested in the relationship between the variables ioc.pdf [email protected] ICY0006: Lecture 3 5 / 24 Example: Do the people tend to marry other people of about the same age? Our experience tells us yes, but how good is the correspondence? One way to address the question is to look at pairs of ages for a sample of married couples (an excerpt from a dataset consisting of 282 pairs of spousal ages): We see that, yes, husbands and wives tend to be of about the same age, with men having a tendency to be slightly older than their wives. ioc.pdf [email protected] ICY0006: Lecture 3 6 / 24 Example: Histograms and means of spousal ages Each distribution is fairly skewed with a long right tail. Not all husbands are older than their wives; this fact is lost when we separate the variables. Also the pairing within couple is lost by separating the variables. For example, based on the means alone, we cannot say what percentage of couples has younger husbands than wives. Another example of information not available from the separate descriptions what is the average age of husbands with 45-year-old wives? Finally, we do not know the relationship between the husband's age and the wife's age.ioc.pdf [email protected] ICY0006: Lecture 3 7 / 24 Example: Histograms and means of spousal ages Each distribution is fairly skewed with a long right tail. Not all husbands are older than their wives; this fact is lost when we separate the variables. Also the pairing within couple is lost by separating the variables. For example, based on the means alone, we cannot say what percentage of couples has younger husbands than wives. Another example of information not available from the separate descriptions what is the average age of husbands with 45-year-old wives? Finally, we do not know the relationship between the husband's age and the wife's age.ioc.pdf [email protected] ICY0006: Lecture 3 7 / 24 Visualization of Bivariate Data A scatter plot displays the bivariate data in a graphical form that maintains the pairing. Scatter plots that show linear relationships between variables can dier in several ways including the slope of the line about which they cluster and how tightly the points cluster about the line. This is a scatter plot of the paired ages (all 282 pairs): ioc.pdf [email protected] ICY0006: Lecture 3 8 / 24 Scatter plot Two observations: 1 there is a strong relationship between the husband's age and the wife's age: the older the husband, the older the wife. I When one variable (Y ) increases with the second variable (X ), we say that X and Y have a positive association. I When Y decreases as X increases, we say that they have a negative association. 2 The points cluster along a straight line. When this occurs, the relationship is called a linear relationship. I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straight line. I The relationship is linear even if the points diverge from the line as long as the divergence is random rather than being systematic. ioc.pdf [email protected] ICY0006: Lecture 3 9 / 24 Linear and non-linear relationships Perfect negative relationship Non-linear relationship ioc.pdf [email protected] ICY0006: Lecture 3 10 / 24 Weak relationship and no relationship Weak positive relationship No relationship ioc.pdf [email protected] ICY0006: Lecture 3 11 / 24 Next section 1 Introduction to Bivariate Data 2 Pearson product-moment correlation 3 Variance Sum Law II 4 Computing r by R ioc.pdf [email protected] ICY0006: Lecture 3 12 / 24 Pearson's correlation coecient Pearson's correlation coecient is a statistical measure of the strength of a linear relationship between paired data. The symbol for Pearson's correlation is r when it is measured in the population andr when it is measured in a sample. (Further on, we are dealing with samples and will use r). Denition Let X = fx1;:::;xN g and Y = fy1;:::;yN g are two datasets (two samples) with means MX and MY and standard deviations sX and sY respectively, then the sample Pearson correlation coecient (or simply correlation coecient) is dened by the formula ∑(X − M )(Y − M ) r = X Y : sX sY Considering the formula of standard deviation, we obtain the formula for computing: ∑(X − M )(Y − M ) ∑XY − NM M r = X Y = X Y p 2 2 q 2 2 ∑(X − MX ) ∑(Y − MY ) 2 2 ∑X − NMX ∑Y − NMY ioc.pdf [email protected] ICY0006: Lecture 3 13 / 24 Computing Pearson's r Example ∑XY − NM M r = X Y q 2 2 2 2 ∑X − NMX ∑Y − NMY 210 − 5 · 4 · 9 210 − 180 30 r = = p = p = 0:9682458 p(96 − 5 · 42)(465 − 5 · 92) 16 · 60 960 ioc.pdf [email protected] ICY0006: Lecture 3 14 / 24 Correlation coecients ioc.pdf [email protected] ICY0006: Lecture 3 15 / 24 Properties of Pearson's r The Pearson correlation coecient is symmetric: r = cor(X ;Y ) = cor(Y ;X ). r is restricted as −1 6 r 6 1. The Pearson correlation coecient is invariant to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY , where a;b;c, and d are constants with b;d 6= 0, without changing the correlation coecient.