
Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis • Univariate testing deals with hypotheses concerning individual taxa – Is this taxon differentially present/abundant in different samples? – Is this taxon correlated with a given continuous variable? • What if we would like to draw conclusions about the community as a whole? 2 Useful ideas from modern statistics • Distances between anything (abundances, presence-absence, graphs, trees) • Direct hypotheses based on distances. • Decompositions through iterative structuration. • Projections. • Randomization tests, probabilistic simulations. 3 Data à Distances à Statistics S-1 S-2 … S-16 0.54 0.54 0.34 OTU-1 0.60 0.17 0.10 0.01 0.75 0.17 0.72 OTU-2 0.24 0.34 0.74 0.89 0.58 0.08 0.43 0.82 0.91 0.69 0.64 OTU-3 0.57 0.06 0.59 0.03 0.69 0.89 0.07 OTU-4 0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26 0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30 OTU-5 0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59 0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65 …. 0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19 0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45 OTU-10,000 0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14 0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83 Visualization (Principal coordinates analysis) Statistical hypothesis testing (PERMANOVA) 4 What is a distance metric? • Scalar function d(.,.) of two arguments • d(x, y) >= 0, always nonnegative; • d(x, x) = 0, distance to self is 0; • d(x, y) = d(y, x), distance is symmetric; • d(x, y) < d(x, z) + d(z, y), triangle inequality. 5 Using distances to capture multidimensional heterogeneous information • A “good” distance will enable us to analyze any type of data usefully • We can build specialized distances that incorporate different types of information (abundance, trees, geographical locations, etc.) • We can visualize complex data as long as we know the distances between objects (observations, variables) • We can compute distances (correlations) between distances to compare them • We can decompose the sources of variability contributing to distances in ANOVA-like fashion 6 Distance and similarity • Sometimes it is conceptually easier to talk about similarities rather than distances – E.g. sequence similarity • Any similarity measure can be converted into a distance metric, e.g. – S – If S is (0, 1), D=1-S – If S>0, D = 1/S or D = exp(-S) 7 A few useful distances and similarity indices • Distances: 2 – Euclidean: (remember Pythagoras theorem) Σ(xi-yi) 2 2 – Weighted Euclidean: χ = Σ(ei - oi) /ei – Hamming/L1, Bray Curtis = Σ 1{xi=yi} – Unifrac (later) – Jensen-Shannon: (D(X|M) + D(Y|M))/2, where • M = (X + Y)/2 • Kullback-Leibler divergence: D(X|Y) = Σ ln[xi/yi]xi x\y 1 0 • Similarity: 1 f11 f10 – Correlation coefficient 0 f01 f00 – Matching coefficient: (f11+f00)/(f11 + f10 + f01 + f00) – JaccardSimilarity Index: f11/(f11 + f10 + f01) 8 Unifracdistance (Lozupone and Knight, 2005) • Is a distance between groups of organisms related by a tree • Definition: Ratio of the sum of the length of the branches leading to sample X or Y, but not both, to the sum of all branch lengths of the tree. matrix data.frame matrix Sample Variables sampleData Phylogenetic Tree OTU Abundance slots: .Data, Taxonomy Table class: phylo class: otuTable names, taxonomyTable slots: see ape slots: .Data, row.names, slots: .Data speciesAreRows .S3Class package Component data objects: phyloseq slots: otuTable Experiment-level data object: sampleData taxTab tre 9 Weighted Unifrac(Lozupone et al., 2007) × | − | !=1 = 10 A note of warning! • “Garbage in, garbage out” • Wrong normalization => wrong distance => wrong answer • However, given the many choices there isn’t much beyond prior knowledge, experience and intuition to guide in selection of the distance. 11 Distance matrix • It is convenient to organize distances as a matrix, A=(aij) • Distance matrices are: – Symmetric: aij = aji – Diagonals are 0: aii = 0. • A distance matrix is Euclidean if it is possible to generate these distances from a set of n points in Euclidean space. • Distance matrix is commonly represented by just lower (or upper) diagonal entries. 12 Some uses of distances • Suppose D is a distance matrix for n objects. The objects are of several kinds indicated by a factor variable F; • Intra-group distances are the distances between objects of the same kind; • Inter-group distances are the distances between objects of different kinds; • Mean distance between an object and a group of other objects is equal to the distance between the object and the center of the group. 13 PRINCIPAL COORDINATE ANALYSIS 14 Vector A vector, v, of dimension n is an n × 1 rectangular array of elements ! v $ # 1 & # v2 & v = # & # ! & v "# n %& vectors will be column vectors. They may also be row vectors, when transposed T v =[v1, v2, … , vn]. 15 A vector, v, of dimension n ! v $ # 1 & # v2 & v = # & # ! & v "# n %& can be thought a point in n dimensional space 16 Every multivariate sample can be represented as a vector in some vector space v3 ⎡⎤v1 v ⎢⎥v = ⎢⎥2 ⎣⎦⎢⎥v3 v2 v 1 17 Vector Basis • A basis is a set of linearly independent (dot product is zero) vectors that span the vector space. • Spanning the vector space: Any vector in this vector space may be represented as a linear combination of the basis vectors. • The vectors forming a basis are orthogonal to each other. If all the vectors are of length 1, then the basis is called orthonormal. 18 Matrix An n × m matrix, A, is a rectangular array of elements ! a a ! a $ # 11 12 1m & # a21 a22 ! a2m & A = (aij ) = # & # " " # " & a a a "# n1 n2 … nm %& n = # of rows m = # of columns dimensions = n × m 19 Note: Let A and B be two matrices whose inverse exists. Let C = AB. Then the inverse of the matrix C exists and C-1 = B-1A-1. Proof C[B-1A-1] = [AB][B-1A-1] = A[B B-1]A-1 = A[I]A-1 = AA-1=I 20 Diagonalization Thereom If the matrix A is symmetric with distinct eigenvalues, , … , , with λ1 λ!n ! corresponding eigenvectors x ,…,x ! ! 1 n Assume x!x =1 i !i ! ! ! then A = λ1x1x1! +…+ λn xn x!n ! $! ! $ λ1 " 0 x1' ! ! # &# & = x ,…, x # # $ # &# &= PDPʹ [ 1 n ] ! # 0 " λ &# x' & "# n %&"# n %& 21 Basic idea for analysis of multidimensional data • Compute distances • Reduce dimensions • Embed in Euclidean space • The general framework behind this process is called Duality diagram: (Xnxp, Qpxp, Dnxn) – Xnxp (centered) data matrix – Qpxp column weights (weights on variables) – Dnxn row weights (weights on observations) 22 S. Holmes/The French way. 2 which were translated into English. My PhD advisor, Yves Escoufier [8][10] pub- licized the method to biologists and ecologists presenting a formulation based on his RV-coefficient, that I will develop below. The first software implementation of duality based methods described here were done in LEAS (1984) a Pascal program written for Apple II computers. The most recent implementation is the R pack- age ade-4 (see A for a review of various implementations of the methods described here). 2.1. Notation The data are p variables measures on n observations. They are recorded in a matrix X with n rows (the observations) and p columns (the variables). Dn is an nxn matrix of weights on the ”observations”, most often diagonal. We will also use a ”neighborhood” relation (thought of as a metric on the observations) defined by taking a symmetric definite positive matrix Q. For example, to standardize the variables Q can be chosen as 1 2 000... σ1 1 0 2 00... σ2 Q = 0 1 1 . 00 2 0 ... σ3 B 1 C B ... ... ... 0 2 C B σp C @ A These three matrices form the essential ”triplet” (X, Q, D) defining a multivariate data analysis. As the approach here is geometrical, it is important to see that Q and D define geometries or inner products in Rp and Rn respectively through t p x Qy =<x,y>Q x, y R t 2 n x Dy =<x,y>D x, y R 2 From these definitions we see there is a close relation between this approach and kernel based methods, for more details see [24]. Q can be seen as a linear function p p p p from R to R ⇤ = (R ),Duality diagram defines the geometry the space of scalar linear functions on R . D can be seen L n n n as a linear function from R to R ⇤ = (R ). Escoufier[8] proposed to associate to Lof multivariate analysis a data set an operator from the space of observations Rp into the dual of the space T n • V = X DX of variables R ⇤. This is summarized in the following diagram [1] which is made T commutative by defining V and W as XtDX and XQXt respectively.• W = XQX p n • Duality: R ⇤ R −−!X – The eigen Q VDW decomposition of VQ x ? ? x leads to eigen- ? p? ? n? ? R ? ?R ⇤? decomposition of WD ? y Xt y ? • Inertiais equal to trace This is known as the duality diagram because knowledge of the eigendecompo- sition of XtDXQ = VQ leads to that of the dual operator XQX(sum tD.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages17 Page
-
File Size-