<<

Lecture 5: Ecological metrics; Principal Coordinates Analysis

Univariate testing vs. community analysis • Univariate testing deals with hypotheses concerning individual taxa – Is this taxon differentially present/abundant in different samples? – Is this taxon correlated with a given continuous variable? • What if we would like to draw conclusions about the community as a whole?

2 Useful ideas from modern

between anything (abundances, presence-absence, graphs, trees) • Direct hypotheses based on distances. • Decompositions through iterative structuration. • Projections. • Randomization tests, probabilistic simulations.

3

Data à Distances à Statistics

S-1 S-2 … S-16 0.54

0.54 0.34

OTU-1 0.60 0.17 0.10

0.01 0.75 0.17 0.72 OTU-2 0.24 0.34 0.74 0.89 0.58

0.08 0.43 0.82 0.91 0.69 0.64 OTU-3 0.57 0.06 0.59 0.03 0.69 0.89 0.07 OTU-4 0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26 0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30 OTU-5 0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59 0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65

…. 0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19

0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45

OTU-10,000 0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14

0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83

Visualization (Principal coordinates analysis) Statistical hypothesis testing (PERMANOVA)

4 What is a distance ?

• Scalar function d(.,.) of two arguments • d(x, y) >= 0, always nonnegative; • d(x, x) = 0, distance to self is 0; • d(x, y) = d(y, x), distance is symmetric; • d(x, y) < d(x, z) + d(z, y), triangle inequality.

5

Using distances to capture multidimensional heterogeneous information • A “good” distance will enable us to analyze any type of data usefully • We can build specialized distances that incorporate different types of information (abundance, trees, geographical locations, etc.) • We can visualize complex data as long as we know the distances between objects (observations, variables) • We can compute distances (correlations) between distances to compare them • We can decompose the sources of variability contributing to distances in ANOVA-like fashion

6 Distance and similarity

• Sometimes it is conceptually easier to talk about similarities rather than distances – E.g. sequence similarity • Any can be converted into a distance metric, e.g. – S – If S is (0, 1), D=1-S – If S>0, D = 1/S or D = exp(-S)

7

A few useful distances and similarity indices • Distances: 2 – Euclidean: (remember Pythagoras theorem) Σ(xi-yi) 2 2 – Weighted Euclidean: χ = Σ(ei - oi) /ei – Hamming/L1, Bray Curtis = Σ 1{xi=yi} – Unifrac (later) – Jensen-Shannon: (D(X|M) + D(Y|M))/2, where • M = (X + Y)/2 • Kullback-Leibler divergence: D(X|Y) = Σ ln[xi/yi]xi x\y 1 0 • Similarity: 1 f11 f10 – Correlation coefficient 0 f01 f00 – Matching coefficient: (f11+f00)/(f11 + f10 + f01 + f00) – JaccardSimilarity Index: f11/(f11 + f10 + f01)

8 Unifracdistance (Lozupone and Knight, 2005) • Is a distance between groups of organisms related by a tree • Definition: Ratio of the sum of the length of the branches leading to sample X or Y, but not both, to the sum of all branch lengths of the tree.

data.frame matrix

Sample Variables sampleData Phylogenetic Tree OTU Abundance slots: .Data, Taxonomy Table class: phylo class: otuTable names, taxonomyTable slots: see ape slots: .Data, row.names, slots: .Data speciesAreRows .S3Class package

Component data objects:

phyloseq slots: otuTable Experiment-level data object: sampleData taxTab tre 9

Weighted Unifrac(Lozupone et al., 2007)

× | − | !=1

=

10 A note of warning!

• “Garbage in, garbage out” • Wrong normalization => wrong distance => wrong answer • However, given the many choices there isn’t much beyond prior knowledge, experience and intuition to guide in selection of the distance.

11

Distance matrix

• It is convenient to organize distances as a matrix, A=(aij) • Distance matrices are: – Symmetric: aij = aji – Diagonals are 0: aii = 0. • A distance matrix is Euclidean if it is possible to generate these distances from a set of n points in Euclidean space. • Distance matrix is commonly represented by just lower (or upper) diagonal entries.

12 Some uses of distances

• Suppose D is a distance matrix for n objects. The objects are of several kinds indicated by a factor variable F; • Intra-group distances are the distances between objects of the same kind; • Inter-group distances are the distances between objects of different kinds; • Mean distance between an object and a group of other objects is equal to the distance between the object and the center of the group.

13

PRINCIPAL COORDINATE ANALYSIS

14 Vector A vector, v, of dimension n is an n × 1 rectangular array of elements ! v $ # 1 & # v2 & v = # & # ! & v "# n %& vectors will be column vectors. They may also be row vectors, when transposed T v =[v1, v2, … , vn]. 15

A vector, v, of dimension n

! v $ # 1 & # v2 & v = # & # ! & v "# n %& can be thought a point in n dimensional space

16 Every multivariate sample can be represented as a vector in some vector space v3

⎡⎤v1 v ⎢⎥v = ⎢⎥2 ⎣⎦⎢⎥v3

v2

v 1 17

Vector Basis

• A basis is a set of linearly independent (dot product is zero) vectors that span the vector space. • Spanning the vector space: Any vector in this vector space may be represented as a linear combination of the basis vectors. • The vectors forming a basis are orthogonal to each other. If all the vectors are of length 1, then the basis is called orthonormal.

18 Matrix An n × m matrix, A, is a rectangular array of elements ! a a ! a $ # 11 12 1m & # a21 a22 ! a2m & A = (aij ) = # & # " " # " & a a a "# n1 n2 … nm %& n = # of rows m = # of columns

dimensions = n × m 19

Note: Let A and B be two matrices whose inverse exists. Let C = AB. Then the inverse of the matrix C exists and C-1 = B-1A-1.

Proof C[B-1A-1] = [AB][B-1A-1] = A[B B-1]A-1 = A[I]A-1 = AA-1=I

20 Diagonalization Thereom If the matrix A is symmetric with distinct eigenvalues, , … , , with λ1 λ!n ! corresponding eigenvectors x ,…,x ! ! 1 n Assume x!x =1 i !i ! ! ! then A = λ1x1x1! +…+ λn xn x!n ! $! ! $ λ1 " 0 x1' ! ! # &# & = x ,…, x # # $ # &# &= PDPʹ [ 1 n ] ! # 0 " λ &# x' & "# n %&"# n %&

21

Basic idea for analysis of multidimensional data • Compute distances • Reduce dimensions • Embed in Euclidean space • The general framework behind this process is called Duality diagram: (Xnxp, Qpxp, Dnxn)

– Xnxp (centered) data matrix

– Qpxp column weights (weights on variables)

– Dnxn row weights (weights on observations)

22 S. Holmes/The French way. 2 which were translated into English. My PhD advisor, Yves Escoufier [8][10] pub- licized the method to biologists and ecologists presenting a formulation based on his RV-coecient, that I will develop below. The first software implementation of duality based methods described here were done in LEAS (1984) a Pascal program written for Apple II computers. The most recent implementation is the R pack- age ade-4 (see A for a review of various implementations of the methods described here).

2.1. Notation

The data are p variables measures on n observations. They are recorded in a matrix X with n rows (the observations) and p columns (the variables). Dn is an nxn matrix of weights on the ”observations”, most often diagonal. We will also use a ”neighborhood” relation (thought of as a metric on the observations) defined by taking a symmetric definite positive matrix Q. For example, to standardize the variables Q can be chosen as

1 2 000... 1 1 0 2 00... 2 Q = 0 1 1 . 00 2 0 ... 3 B 1 C B ...... 0 2 C B p C @ A These three matrices form the essential ”triplet” (X, Q, D) defining a multivariate data analysis. As the approach here is geometrical, it is important to see that Q and D define geometries or inner products in Rp and Rn respectively through

t p x Qy =Q x, y R t 2 n x Dy =D x, y R 2

From these definitions we see there is a close relation between this approach and kernel based methods, for more details see [24]. Q can be seen as a linear function p p p p from R to R ⇤ = (R ),Duality diagram defines the geometry the space of scalar linear functions on R . D can be seen L n n n as a linear function from R to R ⇤ = (R ). Escoufier[8] proposed to associate to Lof multivariate analysis a data set an operator from the space of observations Rp into the dual of the space T n • V = X DX of variables R ⇤. This is summarized in the following diagram [1] which is made T commutative by defining V and W as XtDX and XQXt respectively.• W = XQX p n • Duality: R ⇤ R !X – The eigen Q VDW decomposition of VQ x ? ? x leads to eigen- ? p? ? n? ? R ? ?R ⇤? decomposition of WD ? y Xt y ? • Inertiais equal to trace This is known as the duality diagram because knowledge of the eigendecompo- sition of XtDXQ = VQ leads to that of the dual operator XQX(sum tD. The of the diagonal main consequence is an easy transition between principal components andelements) of VQ or WD. principal axes

imsart-lnms ver. 2005/05/19 file: dfc.tex date: July 30, 2006 23

Principal Component Analysis (PCA)

• Let Q = I and D = 1/n I and let X be centered. • VQ = XTDXQ = 1/n XTX. • The inertia Tr(VQ) = sum of the variances. • PCA decomposes the variance of X into independent components. • To decompose the inertia means to find the eigen-system of VQ or equivalently WD matrices. • Eigenvalues give the amount of inertia explained in corresponding dimension. • Eigenvectors give the dimensions of variability.

24 Example PCA

Murder Assault UrbanPop Rape −5 0 5 Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Mississippi Arizona 8.1 294 80 31.0 North Carolina Biplot 0.3 Arkansas 8.8 190 50 19.5 South Carolina California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7

West VirginiaVermont 5 0.2 pc.cr Georgia Alabama Alaska Arkansas Kentucky

2.0 Louisiana Murder Tennessee South Dakota

0.1 North Dakota

1.5 Montana Maryland Assault Wyoming Maine Variances Virginia Idaho 1.0 New Mexico Comp.2 Florida New Hampshire 0 0.0 Iowa Michigan Indiana

0.5 Nebraska Missouri Kansas Rape DelawareOklahoma Texas

0.0 Oregon Pennsylvania Comp.1 Comp.2 Comp.3 Comp.4 MinnesotaWisconsin Illinois 0.1

− Nevada Arizona Ohio New York Screeplot: plot of inertia Colorado Washington Connecticut Loadings: 0.2 New Jersey 5 Comp.1 Comp.2 Comp.3 Comp.4 − MassachusettsUtahRhode Island − Murder -0.536 0.418 -0.341 0.649 California Hawaii Assault -0.583 0.188 -0.268 -0.743 UrbanPop UrbanPop -0.278 -0.873 -0.378 0.134

Rape -0.543 -0.167 0.818 −0.2 −0.1 0.0 0.1 0.2 0.3 25 Comp.1

Centering

• Let Y be not centered data matrix with n observations (rows) and p variables (columns) • Let H = (I – 1/n 1x1’) • Then X = HY is centered

26 From Euclidean distances to PCA to PCoA • Note that if D is a , then • X X’ = 1/n H D(2) H. • PCoA is a generalization of PCA in that knowledge of X is not required, all you need to represent the points is D, the inter-point distance matrix.

27

Representation of (arbitrary) distances in Euclidean space • The idea is to use singular value decomposition (SVD) on the centered interpoint distance matrix to extract Euclidean dimensions • SVD: X = U S V, where S is with diagonal elements s1, s2, …, sn, and U and V are unit matrices (i.e. their determinant is 1 and they span their corresponding spaces)

28 PCoA details

• Algorithm starting from D inter-point distances: – Center the rows and columns of the matrix of square (element-by-element) distances: S = -1/2H D(2)H – Compute SVD by diagonalizing S, S = U Λ UT – Extract Euclidean representations: X = U Λ1/2 • The relative values of diagonal elements of Λ gives the proportion of variability explained by each of the axes. • The values of Λ should always be looked at in deciding how many dimensions to retain.

29

Beta-diversity; ordination analysis

ISME J. 2016 Mar 25. doi: 10.1038/ismej.2016.37

30 Differentiation of microbiota between diabetic and non-diabetic subjects and across body sites

(a) Hands B. Feet C. Hands and Feet

● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● Diabetic ● ●● ● ●● Control ● ●● ●●● ● ●● ●● ● ● ●● ● ● ● ● ●●●● ●●● ●●●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ●●● ● ● Control Diabetic ● ● ● Foot● ● Control● Foot Diabetic ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● Arm● Diabetic●●● ● ● ● Arm ●Control● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●

Weighted Unifrac (PC1 [28.5 %] − PC2[12.4 %]) Weighted Unifrac (PC1 [54.0 %] − PC2[ 7.9 %]) Weighted Unifrac (PC1 [37.8 %] − PC2[ 7.4 %])

!

Redel et al. J Infect Dis. 2013 207(7):1105-14

31

Suggested reading/references

+ any proof-based linear algebra text book. 32 Suggested reading

• Susan Holmes “Multivariate Data Analysis: The French Way”, IMS Lecture Notes–Monograph Series, 2006.

33