BIVARIATE

David M. Lane. et al. Introduction to : pp. 172194

ioc.pdf [email protected] ICY0006: Lecture 3 1 / 24

Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. , or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: analysis: describing the distribution of a single variable, including

I (, , and ) I dispersion ( and quantiles of the data-set, measures of spread such as the and ) I shape of the distribution ( and Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24 Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: : describing the distribution of a single variable, including

I central tendency (mean, median, and mode) I dispersion (range and quantiles of the data-set, measures of spread such as the variance and standard deviation) I shape of the distribution (skewness and kurtosis Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24 Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of information. It provides simple summaries about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs. It involves two kinds of analysis: Univariate analysis: describing the distribution of a single variable, including

I central tendency (mean, median, and mode) I dispersion (range and quantiles of the data-set, measures of spread such as the variance and standard deviation) I shape of the distribution (skewness and kurtosis Bivariate analysis: more than one variable are involved and describing the relationship between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables I Graphical representation via scatterplots I Quantitative measures of dependence I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24 Contents

1 Introduction to Bivariate Data

2 Pearson product- correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 3 / 24 Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 4 / 24 Bivariate Data  more than one variable

Often, more than one variable is collected on each individual. In health studies the variables such as age, sex, height, weight, blood pressure, and total cholesterol are often measured on each individual. Economic studies may be interested in, among other things, personal income and years of education.

Bivariate data consists of data on two variables Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24 Bivariate Data  more than one variable

Often, more than one variable is collected on each individual. In health studies the variables such as age, sex, height, weight, blood pressure, and total cholesterol are often measured on each individual. Economic studies may be interested in, among other things, personal income and years of education.

Bivariate data consists of data on two variables Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24 Example: Do the people tend to marry other people of about the same age?

Our experience tells us yes, but how good is the correspondence? One way to address the question is to look at pairs of ages for a sample of married couples (an excerpt from a dataset consisting of 282 pairs of spousal ages):

We see that, yes, husbands and wives tend to be of about the same age, with men having a tendency to be slightly older than their wives.

ioc.pdf

[email protected] ICY0006: Lecture 3 6 / 24 Example: and of spousal ages

Each distribution is fairly skewed with a long right tail. Not all husbands are older than their wives; this fact is lost when we separate the variables. Also the pairing within couple is lost by separating the variables. For example, based on the means alone, we cannot say what percentage of couples has younger husbands than wives. Another example of information not available from the separate descriptions what is the average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24 Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail. Not all husbands are older than their wives; this fact is lost when we separate the variables. Also the pairing within couple is lost by separating the variables. For example, based on the means alone, we cannot say what percentage of couples has younger husbands than wives. Another example of information not available from the separate descriptions what is the average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24 Visualization of Bivariate Data

A displays the bivariate data in a graphical form that maintains the pairing. Scatter plots that show linear relationships between variables can dier in several ways including the slope of the line about which they cluster and how tightly the points cluster about the line. This is a scatter plot of the paired ages (all 282 pairs):

ioc.pdf

[email protected] ICY0006: Lecture 3 8 / 24 Scatter plot

Two observations: 1 there is a strong relationship between the husband's age and the wife's age: the older the husband, the older the wife.

I When one variable (Y ) increases with the second variable (X ), we say that X and Y have a positive association.

I When Y decreases as X increases, we say that they have a negative association.

2 The points cluster along a straight line. When this occurs, the relationship is called a linear relationship.

I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straight line.

I The relationship is linear even if the points diverge from the line as long as the is random rather than being systematic. ioc.pdf

[email protected] ICY0006: Lecture 3 9 / 24 Linear and non-linear relationships

Perfect negative relationship

Non-linear relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 10 / 24 Weak relationship and no relationship

Weak positive relationship No relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 11 / 24 Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 12 / 24 Pearson's correlation coecient

Pearson's correlation coecient is a statistical measure of the strength of a linear relationship between paired data. The symbol for Pearson's correlation is ρ when it is measured in the population andr when it is measured in a sample. (Further on, we are dealing with samples and will use r).

Denition

Let X = {x1,...,xN } and Y = {y1,...,yN } are two datasets (two samples) with means MX and MY and standard deviations σX and σY respectively, then the sample Pearson correlation coecient (or simply correlation coecient) is dened by the formula

∑(X − M )(Y − M ) r = X Y . σX σY Considering the formula of standard deviation, we obtain the formula for computing:

∑(X − M )(Y − M ) ∑XY − NM M r = X Y = X Y p 2 2 q 2 2 ∑(X − MX ) ∑(Y − MY ) 2  2  ∑X − NMX ∑Y − NMY

ioc.pdf

[email protected] ICY0006: Lecture 3 13 / 24 Computing Pearson's r

Example

∑XY − NM M r = X Y q 2 2  2 2  ∑X − NMX ∑Y − NMY

210 − 5 · 4 · 9 210 − 180 30 r = = √ = √ = 0.9682458 p(96 − 5 · 42)(465 − 5 · 92) 16 · 60 960 ioc.pdf

[email protected] ICY0006: Lecture 3 14 / 24 Correlation coecients

ioc.pdf

[email protected] ICY0006: Lecture 3 15 / 24 Properties of Pearson's r

The Pearson correlation coecient is symmetric: r = cor(X ,Y ) = cor(Y ,X ). r is restricted as −1 6 r 6 1. The Pearson correlation coecient is invariant to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY , where a,b,c, and d are constants with b,d 6= 0, without changing the correlation coecient. Positive values denote positive linear correlation. Negative values denote negative linear correlation. A value of 0 denotes no linear correlation. The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24 Properties of Pearson's r

The Pearson correlation coecient is symmetric: r = cor(X ,Y ) = cor(Y ,X ). r is restricted as −1 6 r 6 1. The Pearson correlation coecient is invariant to separate changes in location and scale in the two variables. That is, we may transform X to a + bX and transform Y to c + dY , where a,b,c, and d are constants with b,d 6= 0, without changing the correlation coecient. Positive values denote positive linear correlation. Negative values denote negative linear correlation. A value of 0 denotes no linear correlation. The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24 Assumptions

There are ve assumptions that are made with respect to Pearson's correlation: 1 The variables must be either interval or ratio measurements. 2 The variables must be approximately normally distributed (we will discuss this later) 3 There is a linear relationship between the two variables. 4 Outliers are either kept to a minimum or are removed entirely. (Use scatter plot to determine outliers) 5 There is of the data (All random variables in the sequence or vector have the same nite variance. Homoscedasticity basically means that the along the line of best t remain similar as you move along the line. Use scatter plot to determine Homo- or ).

ioc.pdf

[email protected] ICY0006: Lecture 3 17 / 24 Removing of outliers

ioc.pdf

[email protected] ICY0006: Lecture 3 18 / 24 Homo- and heteroscedasticity

ioc.pdf

[email protected] ICY0006: Lecture 3 19 / 24 Caution!

1 The existence of a strong correlation does not imply a causal link between the variables.!

2 We need to perform a signicance test to decide whether based upon on a given sample there is any or no evidence to suggest that linear correlation is present in the population. (We will discuss signicance tests later in our course.)

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24 Caution!

1 The existence of a strong correlation does not imply a causal link between the variables.!

2 We need to perform a signicance test to decide whether based upon on a given sample there is any or no evidence to suggest that linear correlation is present in the population. (We will discuss signicance tests later in our course.)

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24 Caution!

1 The existence of a strong correlation does not imply a causal link between the variables.!

2 We need to perform a signicance test to decide whether based upon on a given sample there is any or no evidence to suggest that linear correlation is present in the population. (We will discuss signicance tests later in our course.)

Recall: Pearson correlation is a measure of the strength of a relationship between two variables But any relationship should be assessed for its signicance as well as its strength. If your data does not meet the above assumptions then use the Spearman's (ρ) or Kendall rank correlation (τ). ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24 Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 21 / 24 Variance Sum Law II

Variance Sum Law I If X and Y are independent (uncorrelated) variables, then

2 2 2 σX ±Y = σX + σY

Variance Sum Law II When X and Y are correlated variables, the following is valid:

2 2 2 2 σX ±Y = σX + σY ± ρσX σY

where ρ is the correlation between X and Y in the population.

ioc.pdf

[email protected] ICY0006: Lecture 3 22 / 24 Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 23 / 24 Correlations in R

R can perform correlation with the cor() function. Built-in to the base distribution of the program are three routines; for Pearson, Kendal and Spearman Rank correlations. Simplied formats of the function call are 1 cor(x,y)  the default correlation returns the Pearson correlation coecient; 2 cor(dataset)  if you use a datset instead of separate variables you will return a matrix of all the pairwize correlation coecients; 3 cor(x, y, method = "spearman")  if you specify "spearman" you will get the Spearman correlation coecient; 4 cor(x, y, use="complete.obs")  The parameter use species the handling of . Options are all.obs (assumes no missing data  missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion).

ioc.pdf

[email protected] ICY0006: Lecture 3 24 / 24