Lecture 1: Multivariate Data

Lecture 1: Multivariate Data M˚ansThulin Department of Mathematics, Uppsala University [email protected] Multivariate Methods • 22/3 2011 1/30 Outline I Multivariate data and matrices I Notation and basic facts I Descriptive statistics I Generalizations of univariate means, variances... I Graphical methods I How to visualize multivariate data sets I Distances I Is there a need to go beyond Euclid? I Linear algebra I What's important in Chapter 2? 2/30 Multivariate data and matrices General situation: p variables are measured on n subjects (patients, items, countries, ...). That is, we have n observations of a p-variate random variable. Ex: Height, weight and age are measured for 23 people. Then p = 3 and n = 23. The data is stored in an array (a matrix): 2 3 x11 x12 ::: x1p 6 x21 x22 ::: x2p 7 X = 6 7 6 . .. 7 4 . 5 xn1 xn2 ::: xnp Row j contains the p measurements for subject j. xjk = measurement k for subject j. 3/30 Descriptive statistics: mean and variance For univariate data, we usually look at summary statistics, such as the sample mean x¯ and the sample variance s2. We can calculate these as usual for each of the p variables: n 1 X x¯ = x k n jk j=1 n 1 X s2 = s = (x − x¯ )2; k = 1; 2;:::; p k kk n − 1 jk k j=1 2 The marginal sample variances sk are usually put in a sample covariance matrix together with the sample covariances. Beware: Notational hazard! When we look at the sample covariance matrix, s2 is usually denoted s , without the square! k kk p The sample standard deviation for variable k is denoted skk . 4/30 Descriptive statistics: covariance and correlation The sample covariance between the variables k and ` is defined as n 1 X s = s = (x − x¯ )(x − x¯ ); k; ` = 1; 2;:::; p k` `k n − 1 jk k j` ` j=1 It measures the linear association between the variables. It is common to rescale the covariance by dividing by the standard deviations of the variables. The number thus obtained is the sample correlation: sk` rk` = r`k = p p skk s`` Note that rkk = 1. 5/30 Descriptive statistics: arrays 2 3 x¯1 6 x¯2 7 Sample mean: ¯x = 6 7 6 . 7 4 . 5 x¯p 2 3 s11 s12 ··· s1p 6 s12 s22 ··· s2p 7 Sample covariance matrix: S = 6 7 n 6 . .. 7 4 . 5 s1p s2p ··· spp 2 3 1 r12 ··· r1p 6 r12 1 ··· r2p 7 Sample correlation matrix: R = 6 7 6 . .. 7 4 . 5 r1p r2p ··· 1 6/30 Graphical methods "A picture is worth a thousand words..." Some useful approaches for visualizing multivariate data are: I 3D plots I Scatter plots I Bubble plots I Stars I Chernoff faces I Andrews' curves (on the other hand, 1001 words are worth more than a picture...) 7/30 Graphical methods: 3D plots Three dimensional scatter plots: ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● xyz[,3] ● ● ● ● ● ● 2 ● −1 1 xyz[,2] ● 0 −2 −1 −2 −3 −3 −3 −2 −1 0 1 2 3 4 xyz[,1] 8/30 Graphical methods: 3D plots I Often a good choice for three dimensional data. I Enables us to see patterns that would disappear if the data was projected to two dimensions. I Best when interactive { i.e. when the plot can be rotated. 9/30 Graphical methods: Scatter plots Scatter plots of each pair of variables, usually combined with the marginal histograms. 6.0 7.0 8.0 9.0 5 6 7 8 9 ● ● ● x_1 ● ● ● 10 ● ● ● 9 ● ● ● ● ● ● ● ●●● ●● ● ●● ● ●●●● ● ● ● ● ● ● 8 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ●●● ●● ●●● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ● ● ● ● ● 7 ● ●●● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ●●● ● ●●● ●●● 9.0 ● ● ● ●● ●●● ●●● x_2 ● ●● ● ● ● ● ●● ● ●●● ● ●● ● ●●● ● ●●●● ● ●●●● ● ● ●● ●● ● ●●●● ●●● ●●● 8.0 ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● 7.0 ● ● ● ● ● ● ● ● ● 6.0 ●●● ● ●●● ● ● 9 ● ● ● ● ●● ●●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● x_3 ● ● ● ● ● ●● ●● 8 ●● ●● ●●● ● ●● ●● ● ● ●● ● ●●●● ●●●● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● 7 ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● 5 ● ● ● 9 ● ● ● ● ●● ● ●● ● ●●● ●●● ● ●● ● ●●●● ● ●●● ● ●● ● ●● ● ●● ● ● ● ●●● ● ●● x_4 8 ● ● ● ●● ● ●●● ●● ●●● ● ●●●●● ●●●●● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ●● 7 ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● 5 10/30 6 7 8 9 10 5 6 7 8 9 Graphical methods: Scatter plots I Can, unlike 3D plots, be used for p greater than 3. I Usually a good first choice for plotting. I Good for detecting dependencies between pairs of variables. I Higher-dimensional perspective is lost { important dependencies between more than two variables might go unnoticed. I Useful for assessing multivariate normality. 11/30 Graphical methods: Bubble plots In a regular 2D plot, a third dimension can be illustrated using bubbles of different sizes. Bubble plot 1150 1100 ● 1050 ● ● ● ● ● 1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ● 900 850 0 50 100 150 200 250 300 SO2[1:15] 12/30 Graphical methods: Bubble plots In a regular 2D plot, a third dimension is illustrated using bubbles of different sizes. Bubble plot 1150 1100 ● 1050 ● ● ● ● ● 1000 ● ● ● Mortality[1:15] ● ● 950 ● ● ● ●● 900 850 0 50 100 150 200 250 300 SO2[1:15] 13/30 Graphical methods: Bubble plots I Simple, but easy to interpret. I Possible to make nice-looking plots. I Possible extensions to higher dimensions? I 3D bubble plots, colours of bubbles, shapes of bubbles, time dimension... 14/30 Graphical methods: Stars The length of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9: Motor Trend Cars Mazda RX4 Wag Mazda RX4 Datsun 710 Hornet 4 Drive Valiant Hornet Sportabout Merc 240D Duster 360 Merc 230 Different versions of stars can be found in the literature. 15/30 Graphical methods: Stars The lengths of the rays from the center of the figure represent the values of the variables. Example with p = 7 and n = 9: Motor Trend Cars Mazda RX4 Wag Mazda RX4 Datsun 710 Hornet 4 Drive Valiant Hornet Sportabout Merc 240D Duster 360 Merc 230 Different versions of stars can be found in the literature. 16/30 Graphical methods: Stars I Can be used for dimensions higher than three. I Relatively easy to interpret. I Can be useful for finding similar data points. I If plotted in a time sequence, can illustrate change over time. I Only useful for relative comparisons. 17/30 Graphical methods: Chernoff faces The human mind is extremely good at facial recognition. Chernoff proposed illustrating data sets with facial features. Herman Chernoff (1973), The use of faces to represent points in k-dimensional space graphically, Journal of the American Statistical Association, 68, pp. 361-368. akronOH albanyNY allenPA bufaloNY cantonOH chatagTN Index Index Index 18/30 Index Index Index Graphical methods: Chernoff faces I Each variable is represented by a facial feature (e.g. length of nose, size of eyes, width of head...). I Easy to see groups or clusters in the data. I Can be used to find outliers. I If plotted in a time sequence, can illustrate change over time. I Can be used for p ≤ 18. I Care must be taken when choosing which variable is represented by which facial feature. We react stronger to changes in some of the features! I Useful or just plain stupid? I Does the implementation in R work properly? See Computer exercise 1. 19/30 Graphical methods: Andrews' curves Andrews proposed that the observations should be projected onto a p-dimensional space of functions, since we are used to comparing functions. He used the observations as Fourier coefficients and plotted the corresponding functions in the interval 0 < t < 2π: p fx(t) = x1= 2 + x2 sin t + x3 cos t + x4 sin 2t + x5 cos 2t + ::: D.F. Andrews (1973), Plots of High-Dimensional Data, Biometrics, 28, pp. 125-136 Andrews' Curves 15 setosa versicolor virginica 10 5 0 0 1 2 3 4 5 6 20/30 Graphical methods: Andrews' curves I Useful for finding clusters { subgroups of data are characterized by similar curves. I Can be used to find linear relationships { if a point y lies on a line between x and z then fy(t) lies between fx(t) and fz(t) for all t. I Can illustrate data with very high dimensions. I Can be used for testing (see original article). I Becomes cluttered when there are too many data points (as in the picture on the previous slide?). I Perhaps not as intuitive as some of the other methods { more useful to mathematicians than to practitioners? 21/30 Geographical data The social network Facebook store lots of data about their users and their online activities. Such large databases { or parts thereof { can be difficult to visualize. In December 2010 the Facebook infrastructure engineering team used R to create a map of Facebook friendships. Lines between cities represent friendships between the cities' inhabitants. 22/30 Geographical data 23/30 Geographical data 24/30 Geographical data 25/30 Distances Why statistical distances? I Account for differences in variation I Account for presence of correlation 10 ● ● 5 ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● y ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 −5 −15 −10 −5 0 5 10 15 x 26/30 Distances: definition Definition. Let P and Q be two points, where these may represent measurements x and y on two objects. A real-valued function d(P; Q) is a distance function if it satisfies the following properties: (i) d(P; Q) = d(Q; P) (Symmetry) (ii) d(P; Q) ≥ 0 (Non-negativity) (iii) d(P; P) = 0 (Identification mark) For many distance functions the following properties also hold: (iv) d(P; Q) = 0 iff P = Q (Definiteness) (v) d(P; Q) ≤ d(P; R) + d(R; Q) (Triangle inequality) If (i)-(v) hold, d is called a metric. 27/30 Distances: some examples Distance between points x = (x1;:::; xp) and y = (y1;:::; yp): I Euclidean distance: p 2 2 2 (x1 − y1) + (x2 − y2) + ::: + (xp − yp) = p(x − y)T (x − y) q 2 2 2 (x1−y1) (x2−y2) (xp−yp) I Statistical distance: + + ::: + s11 s22 spp takes the variances of the variables into account.

Load more