Lecture 1: Multivariate

M˚ansThulin Department of Mathematics, Uppsala University [email protected]

Multivariate Methods • 22/3 2011

1/30 Outline

I Multivariate data and matrices

I Notation and basic facts I Descriptive

I Generalizations of univariate , ... I Graphical methods

I How to visualize multivariate data sets I

I Is there a need to go beyond Euclid? I Linear algebra

I What’s important in Chapter 2?

2/30 Multivariate data and matrices

General situation: p variables are measured on n subjects (patients, items, countries, ...). That is, we have n observations of a p-variate . Ex: Height, weight and age are measured for 23 people. Then p = 3 and n = 23. The data is stored in an array (a matrix):   x11 x12 ... x1p  x21 x22 ... x2p  X =    . . .. .   . . . .  xn1 xn2 ... xnp Row j contains the p measurements for subject j. xjk = measurement k for subject j.

3/30 : and

For univariate data, we usually look at , such as the mean x¯ and the sample variance s2. We can calculate these as usual for each of the p variables:

n 1 X x¯ = x k n jk j=1 n 1 X s2 = s = (x − x¯ )2, k = 1, 2,..., p k kk n − 1 jk k j=1

2 The marginal sample variances sk are usually put in a sample matrix together with the sample . Beware: Notational hazard! When we look at the sample , s2 is usually denoted s , without the square! k kk √ The sample standard for variable k is denoted skk .

4/30 Descriptive statistics: covariance and correlation

The sample covariance between the variables k and ` is defined as

n 1 X s = s = (x − x¯ )(x − x¯ ), k, ` = 1, 2,..., p k` `k n − 1 jk k j` ` j=1

It measures the linear association between the variables. It is common to rescale the covariance by dividing by the standard deviations of the variables. The number thus obtained is the sample correlation:

sk` rk` = r`k = √ √ skk s``

Note that rkk = 1.

5/30 Descriptive statistics: arrays

  x¯1  x¯2  Sample mean: ¯x =    .   .  x¯p   s11 s12 ··· s1p  s12 s22 ··· s2p  Sample covariance matrix: S =   n  . . .. .   . . . .  s1p s2p ··· spp   1 r12 ··· r1p  r12 1 ··· r2p  Sample correlation matrix: R =    . . .. .   . . . .  r1p r2p ··· 1

6/30 Graphical methods

”A picture is worth a thousand words...”

Some useful approaches for visualizing multivariate data are:

I 3D plots

I Scatter plots

I Bubble plots

I Stars

I Chernoff faces

I Andrews’ curves

(on the other hand, 1001 words are worth more than a picture...)

7/30 Graphical methods: 3D plots Three dimensional scatter plots:

● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 ● ● ●

xyz[,3] ● ● ● ● ● ● 2 ● −1 1 xyz[,2] ● 0

−2 −1 −2 −3 −3 −3 −2 −1 0 1 2 3 4

xyz[,1]

8/30 Graphical methods: 3D plots

I Often a good choice for three dimensional data.

I Enables us to see patterns that would disappear if the data was projected to two dimensions.

I Best when interactive – i.e. when the plot can be rotated.

9/30 Graphical methods: Scatter plots Scatter plots of each pair of variables, usually combined with the marginal .

6.0 7.0 8.0 9.0 5 6 7 8 9

● ● ●

x_1 ● ● ● 10

● ● ● 9 ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ●●● ●● ●●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ● ● ● ● ● 7 ● ●●● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6

● ● ● ● ●●● ● ●●● ●●● 9.0 ● ● ● ●● ●●● ●●● x_2 ● ●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●●●● ● ●●●● ● ● ●● ●● ● ●●●● ●●● ●●● 8.0 ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● 7.0 ● ● ● ● ● ● ● ● ● 6.0

●●● ● ●●● ● ● 9 ● ● ● ● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● x_3 ● ● ●

● ● ●● ●● 8 ●● ●● ●●● ● ●● ●● ● ● ●● ● ●●●● ●●●● ● ● ● ● ●● ●●

● ● ●● ●● ● ● ● 7 ●●● ●●● ● ● ● ● ● ● ● ● ●

● ● ● 6 ● ● ● ● ● ● ● ● ● 5 ● ● ●

9 ● ● ● ● ●● ● ●● ● ●●● ●●● ● ●● ● ●●●● ● ●●● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● x_4 8 ● ● ● ●● ● ●●● ●● ●●● ● ●●●●● ●●●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●●

7 ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● 5 10/30 6 7 8 9 10 5 6 7 8 9 Graphical methods: Scatter plots

I Can, unlike 3D plots, be used for p greater than 3.

I Usually a good first choice for plotting.

I Good for detecting dependencies between pairs of variables.

I Higher-dimensional perspective is lost – important dependencies between more than two variables might go unnoticed.

I Useful for assessing multivariate normality.

11/30 Graphical methods: Bubble plots

In a regular 2D plot, a third dimension can be illustrated using bubbles of different sizes.

Bubble plot 1150 1100

● 1050 ● ● ●

● ●

1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ● 900 850

0 50 100 150 200 250 300

SO2[1:15]

12/30 Graphical methods: Bubble plots

In a regular 2D plot, a third dimension is illustrated using bubbles of different sizes.

Bubble plot 1150 1100

● 1050 ● ● ●

● ●

1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ●● 900 850

0 50 100 150 200 250 300

SO2[1:15]

13/30 Graphical methods: Bubble plots

I Simple, but easy to interpret.

I Possible to make nice-looking plots. I Possible extensions to higher dimensions?

I 3D bubble plots, colours of bubbles, shapes of bubbles, time dimension...

14/30 Graphical methods: Stars The length of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9:

Motor Trend Cars

Mazda RX4 Wag Mazda RX4 Datsun 710

Hornet 4 Drive Valiant Hornet Sportabout

Merc 240D Duster 360 Merc 230

Different versions of stars can be found in the literature.

15/30 Graphical methods: Stars The lengths of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9:

Motor Trend Cars

Mazda RX4 Wag Mazda RX4 Datsun 710

Hornet 4 Drive Valiant Hornet Sportabout

Merc 240D Duster 360 Merc 230

Different versions of stars can be found in the literature.

16/30 Graphical methods: Stars

I Can be used for dimensions higher than three.

I Relatively easy to interpret.

I Can be useful for finding similar data points.

I If plotted in a time sequence, can illustrate change over time.

I Only useful for relative comparisons.

17/30 Graphical methods: Chernoff faces The human mind is extremely good at facial recognition. Chernoff proposed illustrating data sets with facial features.

Herman Chernoff (1973), The use of faces to represent points in k-dimensional space graphically, Journal of the American Statistical Association, 68, pp. 361-368.

akronOH albanyNY allenPA

bufaloNY cantonOH chatagTN

Index Index Index

18/30

Index Index Index Graphical methods: Chernoff faces

I Each variable is represented by a facial feature (e.g. length of nose, size of eyes, width of head...).

I Easy to see groups or clusters in the data.

I Can be used to find outliers.

I If plotted in a time sequence, can illustrate change over time.

I Can be used for p ≤ 18.

I Care must be taken when choosing which variable is represented by which facial feature. We react stronger to changes in some of the features!

I Useful or just plain stupid?

I Does the implementation in R work properly? See Computer exercise 1.

19/30 Graphical methods: Andrews’ curves Andrews proposed that the observations should be projected onto a p-dimensional space of functions, since we are used to comparing functions. He used the observations as Fourier coefficients and plotted the corresponding functions in the interval 0 < t < 2π: √ fx(t) = x1/ 2 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + ...

D.F. Andrews (1973), Plots of High-Dimensional Data, Biometrics, 28, pp. 125-136

Andrews' Curves

15 setosa versicolor virginica 10 5 0

0 1 2 3 4 5 6 20/30 Graphical methods: Andrews’ curves

I Useful for finding clusters – subgroups of data are characterized by similar curves.

I Can be used to find linear relationships – if a point y lies on a line between x and z then fy(t) lies between fx(t) and fz(t) for all t.

I Can illustrate data with very high dimensions.

I Can be used for testing (see original article).

I Becomes cluttered when there are too many data points (as in the picture on the previous slide?).

I Perhaps not as intuitive as some of the other methods – more useful to mathematicians than to practitioners?

21/30 Geographical data

The social network Facebook store lots of data about their users and their online activities.

Such large databases – or parts thereof – can be difficult to visualize.

In December 2010 the Facebook infrastructure engineering team used R to create a map of Facebook friendships. Lines between cities represent friendships between the cities’ inhabitants.

22/30 Geographical data

23/30 Geographical data

24/30 Geographical data

25/30 Distances Why statistical distances? I Account for differences in variation I Account for presence of correlation 10

● ● 5 ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 −5

−15 −10 −5 0 5 10 15

x

26/30 Distances: definition

Definition. Let P and Q be two points, where these may represent measurements x and y on two objects. A real-valued function d(P, Q) is a function if it satisfies the following properties:

(i) d(P, Q) = d(Q, P) (Symmetry) (ii) d(P, Q) ≥ 0 (Non-negativity) (iii) d(P, P) = 0 (Identification mark) For many distance functions the following properties also hold:

(iv) d(P, Q) = 0 iff P = Q (Definiteness) (v) d(P, Q) ≤ d(P, R) + d(R, Q) (Triangle inequality) If (i)-(v) hold, d is called a .

27/30 Distances: some examples

Distance between points x = (x1,..., xp) and y = (y1,..., yp): I Euclidean distance: p 2 2 2 (x1 − y1) + (x2 − y2) + ... + (xp − yp) = p(x − y)T (x − y) q 2 2 2 (x1−y1) (x2−y2) (xp−yp) I Statistical distance: + + ... + s11 s22 spp takes the variances of the variables into account. Same as Euclidean distance for standardized data. q T −1 I : (x − y) Sn (x − y) also takes the covariances/correlations into account. If the variables are uncorrelated, this reduces to the statistical distance.

28/30 Linear algebra

Sections 2.1-2.5, 2.7 and 2A of Johnson & Wichern will not be discussed during the lectures. These contain ”well-known” results from linear algebra. You might want to go through those sections on your own. The following topics and results are of particular interest to us:

I Result 2A.14 on page 100: how to express a matrix using its eigenvalues and eigenvectors.

I Positive definite matrices.

I Matrix ranks.

29/30 Summary

I Multivariate data and matrices

I n measurements of p variables, stored in a matrix. I Descriptive statistics

I Sample means and variances are calculated for each variable. I Covariances and correlations between pairs of variables. I Stored in arrays. I Graphical methods

I 3D plots I Scatter plots I Bubble plots I Stars I Chernoff faces I Andrews’ curves I Distances

I Modify Euclidean distance to account for correlations and differences in variance.

30/30