Lecture 1: Multivariate Data
M˚ansThulin Department of Mathematics, Uppsala University [email protected]
Multivariate Methods • 22/3 2011
1/30 Outline
I Multivariate data and matrices
I Notation and basic facts I Descriptive statistics
I Generalizations of univariate means, variances... I Graphical methods
I How to visualize multivariate data sets I Distances
I Is there a need to go beyond Euclid? I Linear algebra
I What’s important in Chapter 2?
2/30 Multivariate data and matrices
General situation: p variables are measured on n subjects (patients, items, countries, ...). That is, we have n observations of a p-variate random variable. Ex: Height, weight and age are measured for 23 people. Then p = 3 and n = 23. The data is stored in an array (a matrix): x11 x12 ... x1p x21 x22 ... x2p X = . . .. . . . . . xn1 xn2 ... xnp Row j contains the p measurements for subject j. xjk = measurement k for subject j.
3/30 Descriptive statistics: mean and variance
For univariate data, we usually look at summary statistics, such as the sample mean x¯ and the sample variance s2. We can calculate these as usual for each of the p variables:
n 1 X x¯ = x k n jk j=1 n 1 X s2 = s = (x − x¯ )2, k = 1, 2,..., p k kk n − 1 jk k j=1
2 The marginal sample variances sk are usually put in a sample covariance matrix together with the sample covariances. Beware: Notational hazard! When we look at the sample covariance matrix, s2 is usually denoted s , without the square! k kk √ The sample standard deviation for variable k is denoted skk .
4/30 Descriptive statistics: covariance and correlation
The sample covariance between the variables k and ` is defined as
n 1 X s = s = (x − x¯ )(x − x¯ ), k, ` = 1, 2,..., p k` `k n − 1 jk k j` ` j=1
It measures the linear association between the variables. It is common to rescale the covariance by dividing by the standard deviations of the variables. The number thus obtained is the sample correlation:
sk` rk` = r`k = √ √ skk s``
Note that rkk = 1.
5/30 Descriptive statistics: arrays
x¯1 x¯2 Sample mean: ¯x = . . x¯p s11 s12 ··· s1p s12 s22 ··· s2p Sample covariance matrix: S = n . . .. . . . . . s1p s2p ··· spp 1 r12 ··· r1p r12 1 ··· r2p Sample correlation matrix: R = . . .. . . . . . r1p r2p ··· 1
6/30 Graphical methods
”A picture is worth a thousand words...”
Some useful approaches for visualizing multivariate data are:
I 3D plots
I Scatter plots
I Bubble plots
I Stars
I Chernoff faces
I Andrews’ curves
(on the other hand, 1001 words are worth more than a picture...)
7/30 Graphical methods: 3D plots Three dimensional scatter plots:
●
● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 ● ● ●
xyz[,3] ● ● ● ● ● ● 2 ● −1 1 xyz[,2] ● 0
−2 −1 −2 −3 −3 −3 −2 −1 0 1 2 3 4
xyz[,1]
8/30 Graphical methods: 3D plots
I Often a good choice for three dimensional data.
I Enables us to see patterns that would disappear if the data was projected to two dimensions.
I Best when interactive – i.e. when the plot can be rotated.
9/30 Graphical methods: Scatter plots Scatter plots of each pair of variables, usually combined with the marginal histograms.
6.0 7.0 8.0 9.0 5 6 7 8 9
● ● ●
x_1 ● ● ● 10
● ● ● 9 ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ●●● ●● ●●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ● ● ● ● ● 7 ● ●●● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6
● ● ● ● ●●● ● ●●● ●●● 9.0 ● ● ● ●● ●●● ●●● x_2 ● ●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●●●● ● ●●●● ● ● ●● ●● ● ●●●● ●●● ●●● 8.0 ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● 7.0 ● ● ● ● ● ● ● ● ● 6.0
●●● ● ●●● ● ● 9 ● ● ● ● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● x_3 ● ● ●
● ● ●● ●● 8 ●● ●● ●●● ● ●● ●● ● ● ●● ● ●●●● ●●●● ● ● ● ● ●● ●●
● ● ●● ●● ● ● ● 7 ●●● ●●● ● ● ● ● ● ● ● ● ●
● ● ● 6 ● ● ● ● ● ● ● ● ● 5 ● ● ●
9 ● ● ● ● ●● ● ●● ● ●●● ●●● ● ●● ● ●●●● ● ●●● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● x_4 8 ● ● ● ●● ● ●●● ●● ●●● ● ●●●●● ●●●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●●
7 ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● 5 10/30 6 7 8 9 10 5 6 7 8 9 Graphical methods: Scatter plots
I Can, unlike 3D plots, be used for p greater than 3.
I Usually a good first choice for plotting.
I Good for detecting dependencies between pairs of variables.
I Higher-dimensional perspective is lost – important dependencies between more than two variables might go unnoticed.
I Useful for assessing multivariate normality.
11/30 Graphical methods: Bubble plots
In a regular 2D plot, a third dimension can be illustrated using bubbles of different sizes.
Bubble plot 1150 1100
● 1050 ● ● ●
● ●
1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ● 900 850
0 50 100 150 200 250 300
SO2[1:15]
12/30 Graphical methods: Bubble plots
In a regular 2D plot, a third dimension is illustrated using bubbles of different sizes.
Bubble plot 1150 1100
● 1050 ● ● ●
● ●
1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ●● 900 850
0 50 100 150 200 250 300
SO2[1:15]
13/30 Graphical methods: Bubble plots
I Simple, but easy to interpret.
I Possible to make nice-looking plots. I Possible extensions to higher dimensions?
I 3D bubble plots, colours of bubbles, shapes of bubbles, time dimension...
14/30 Graphical methods: Stars The length of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9:
Motor Trend Cars
Mazda RX4 Wag Mazda RX4 Datsun 710
Hornet 4 Drive Valiant Hornet Sportabout
Merc 240D Duster 360 Merc 230
Different versions of stars can be found in the literature.
15/30 Graphical methods: Stars The lengths of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9:
Motor Trend Cars
Mazda RX4 Wag Mazda RX4 Datsun 710
Hornet 4 Drive Valiant Hornet Sportabout
Merc 240D Duster 360 Merc 230
Different versions of stars can be found in the literature.
16/30 Graphical methods: Stars
I Can be used for dimensions higher than three.
I Relatively easy to interpret.
I Can be useful for finding similar data points.
I If plotted in a time sequence, can illustrate change over time.
I Only useful for relative comparisons.
17/30 Graphical methods: Chernoff faces The human mind is extremely good at facial recognition. Chernoff proposed illustrating data sets with facial features.
Herman Chernoff (1973), The use of faces to represent points in k-dimensional space graphically, Journal of the American Statistical Association, 68, pp. 361-368.
akronOH albanyNY allenPA
bufaloNY cantonOH chatagTN
Index Index Index
18/30
Index Index Index Graphical methods: Chernoff faces
I Each variable is represented by a facial feature (e.g. length of nose, size of eyes, width of head...).
I Easy to see groups or clusters in the data.
I Can be used to find outliers.
I If plotted in a time sequence, can illustrate change over time.
I Can be used for p ≤ 18.
I Care must be taken when choosing which variable is represented by which facial feature. We react stronger to changes in some of the features!
I Useful or just plain stupid?
I Does the implementation in R work properly? See Computer exercise 1.
19/30 Graphical methods: Andrews’ curves Andrews proposed that the observations should be projected onto a p-dimensional space of functions, since we are used to comparing functions. He used the observations as Fourier coefficients and plotted the corresponding functions in the interval 0 < t < 2π: √ fx(t) = x1/ 2 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + ...
D.F. Andrews (1973), Plots of High-Dimensional Data, Biometrics, 28, pp. 125-136
Andrews' Curves
15 setosa versicolor virginica 10 5 0
0 1 2 3 4 5 6 20/30 Graphical methods: Andrews’ curves
I Useful for finding clusters – subgroups of data are characterized by similar curves.
I Can be used to find linear relationships – if a point y lies on a line between x and z then fy(t) lies between fx(t) and fz(t) for all t.
I Can illustrate data with very high dimensions.
I Can be used for testing (see original article).
I Becomes cluttered when there are too many data points (as in the picture on the previous slide?).
I Perhaps not as intuitive as some of the other methods – more useful to mathematicians than to practitioners?
21/30 Geographical data
The social network Facebook store lots of data about their users and their online activities.
Such large databases – or parts thereof – can be difficult to visualize.
In December 2010 the Facebook infrastructure engineering team used R to create a map of Facebook friendships. Lines between cities represent friendships between the cities’ inhabitants.
22/30 Geographical data
23/30 Geographical data
24/30 Geographical data
25/30 Distances Why statistical distances? I Account for differences in variation I Account for presence of correlation 10
● ● 5 ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 −5
−15 −10 −5 0 5 10 15
x
26/30 Distances: definition
Definition. Let P and Q be two points, where these may represent measurements x and y on two objects. A real-valued function d(P, Q) is a distance function if it satisfies the following properties:
(i) d(P, Q) = d(Q, P) (Symmetry) (ii) d(P, Q) ≥ 0 (Non-negativity) (iii) d(P, P) = 0 (Identification mark) For many distance functions the following properties also hold:
(iv) d(P, Q) = 0 iff P = Q (Definiteness) (v) d(P, Q) ≤ d(P, R) + d(R, Q) (Triangle inequality) If (i)-(v) hold, d is called a metric.
27/30 Distances: some examples
Distance between points x = (x1,..., xp) and y = (y1,..., yp): I Euclidean distance: p 2 2 2 (x1 − y1) + (x2 − y2) + ... + (xp − yp) = p(x − y)T (x − y) q 2 2 2 (x1−y1) (x2−y2) (xp−yp) I Statistical distance: + + ... + s11 s22 spp takes the variances of the variables into account. Same as Euclidean distance for standardized data. q T −1 I Mahalanobis distance: (x − y) Sn (x − y) also takes the covariances/correlations into account. If the variables are uncorrelated, this reduces to the statistical distance.
28/30 Linear algebra
Sections 2.1-2.5, 2.7 and 2A of Johnson & Wichern will not be discussed during the lectures. These contain ”well-known” results from linear algebra. You might want to go through those sections on your own. The following topics and results are of particular interest to us:
I Result 2A.14 on page 100: how to express a matrix using its eigenvalues and eigenvectors.
I Positive definite matrices.
I Matrix ranks.
29/30 Summary
I Multivariate data and matrices
I n measurements of p variables, stored in a matrix. I Descriptive statistics
I Sample means and variances are calculated for each variable. I Covariances and correlations between pairs of variables. I Stored in arrays. I Graphical methods
I 3D plots I Scatter plots I Bubble plots I Stars I Chernoff faces I Andrews’ curves I Distances
I Modify Euclidean distance to account for correlations and differences in variance.
30/30