Algebraic Statistics Tutorial I Alleles Are in Hardy-Weinberg Equilibrium, the Genotype Frequencies Are
Total Page:16
File Type:pdf, Size:1020Kb
Example: Hardy-Weinberg Equilibrium Suppose a gene has two alleles, a and A. If allele a occurs in the population with frequency θ (and A with frequency 1 − θ) and these Algebraic Statistics Tutorial I alleles are in Hardy-Weinberg equilibrium, the genotype frequencies are P(X = aa)=θ2, P(X = aA)=2θ(1 − θ), P(X = AA)=(1 − θ)2 Seth Sullivant The model of Hardy-Weinberg equilibrium is the set North Carolina State University July 22, 2012 2 2 M = θ , 2θ(1 − θ), (1 − θ) | θ ∈ [0, 1] ⊂ ∆3 2 I(M)=paa+paA+pAA−1, paA−4paapAA Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 1 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 2 / 32 Main Point of This Tutorial Phylogenetics Problem Given a collection of species, find the tree that explains their history. Many statistical models are described by (semi)-algebraic constraints on a natural parameter space. Generators of the vanishing ideal can be useful for constructing algorithms or analyzing properties of statistical model. Two Examples Phylogenetic Algebraic Geometry Sampling Contingency Tables Data consists of aligned DNA sequences from homologous genes Human: ...ACCGTGCAACGTGAACGA... Chimp: ...ACCTTGGAAGGTAAACGA... Gorilla: ...ACCGTGCAACGTAAACTA... Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 3 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 4 / 32 Model-Based Phylogenetics Phylogenetic Models Use a probabilistic model of mutations Assuming site independence: Parameters for the model are the combinatorial tree T , and rate Phylogenetic Model is a latent class graphical model parameters for mutations on each edge Vertex v ∈ T gives a random variable Xv ∈ {A, C, G, T} Models give a probability for observing a particular aligned All random variables corresponding to internal nodes are latent collection of DNA sequences ACCGTGCAACGTGAACGA Human: X1 X2 X3 Chimp: ACGTTGCAAGGTAAACGA Gorilla: ACCGTGCAACGTAAACTA Y 2 Assuming site independence, data is summarized by empirical Y 1 distribution of columns in the alignment. 6 2 e.g. pˆ(AAA)= 18 , pˆ(CGC)= 18 , etc. Use empirical distribution and test statistic to find tree best P(x1, x2, x3)= P(y1)P(y2|y1)P(x1|y1)P(x2|y2)P(x3|y2) explaining data y1 y2 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 5 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 6 / 32 Phylogenetic Models Algebraic Perspective on Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Once we fix a tree T and model structure, we get a map n φT : Θ → R4 . Vertex v ∈ T gives a random variable Xv ∈ {A, C, G, T} Θ Rd All random variables corresponding to internal nodes are latent ⊆ is a parameter space of numerical parameters (transition matrices associated to each edge). The map φT is given by polynomial functions of the parameters. X X X 1 2 3 For each i ···i ∈ {A, C, G, T }n, φT (θ) gives the probability of 1 n i1···in Y the column (i1,...,in) in the alignment for the particular 2 parameter choice θ. φT (π, a, b, c, d)= π a b c d Y 1 i1i2i3 j1 j2,j1 i1,j1 i2,j2 i3,j2 j1 j2 T 4n The phylogenetic model is the set MT = φ (Θ) ⊆ R . pi1i2i3 = πj1 aj2,j1 bi1,j1 ci2,j2 di3,j2 j1 j2 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 7 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 8 / 32 X X Phylogenetic Varieties and Phylogenetic Invariants X1 X2 3 4 R R n Let [p]:= [pi1···in : i1 ···in ∈ {A, C, G, T } ] Definition Let 4 4 4 R R IT := f ∈ [p]:f (p)=0 for all p ∈ MT ⊆ [p]. plmno = πi aij bik cjl djmeknfko i=1 j=1 k=1 IT is the ideal of phylogenetic invariants of T . Let 4 4 4 = πi aij cjl djm · bik eknfko n R4 i=1 j=1 k=1 VT := {p ∈ : f (p)=0 for all f ∈ IT }. VT is the phylogenetic variety of T . p1111 p1112 ··· p1144 p p ··· p Note that MT ⊂ VT . 1211 1212 1244 =⇒ rank . ≤ 4 . .. Since MT is image of a polynomial map dim MT = dim VT . . p p ··· p 4411 4412 4444 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 9 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 10 / 32 Splits and Phylogenetic Invariants 2-way Flattenings and Matrix Ranks pijkl = P(X1 = i, X2 = j, X3 = k, X4 = l) Definition A split of a set is a bipartition A|B. A split A|B of the leaves of a tree T pAAAA pAAAC pAAAG ··· pAATT is valid for T if the induced trees T |A and T |B do not intersect. p p p ··· p Flat ACAA ACAC ACAG ACTT 12|34(P)= . . .. p p p ··· p TTAA TTAC TTAG TTTT X X X1 X2 3 4 Valid: 12|34 Proposition Not Valid: 13|24 Let P ∈ MT . rank Flat If A|B is a valid split for T , then ( A|B(P)) ≤ 4. Flat Invariants in IT are subdeterminants of A|B(P). If C|D is not a valid split for T , then generically rank Flat ( C|D(P)) > 4. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 11 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 12 / 32 Phylogenetic Algebraic Geometry Using Phylogenetic Invariants to Reconstruct Trees Definition A phylogenetic invariant f ∈ IT is phylogenetically informative if there is some other tree T such that f ∈/ IT . Phylogenetic Algebraic Geometry is the study of the phylogenetic varieties and ideals VT and IT . Idea of Cavender-Felsenstein (1987), Lake (1987): Evaluate phylogenetically informative phylogenetic invariants at Using Phylogenetic Invariants to Reconstruct Trees empirical distribution pˆ to reconstruct phylogenetic trees Identifiability of Phylogenetic Models Interesting Math– Useful in Other Problems Proposition For each n-leaf trivalent tree T , let FT ⊆ IT be a set of phylogenetic invariants such that, for each T = T , there is an f ∈ FT , such that f ∈/ IT . Let f := |f |. T f ∈FT Then for generic p ∈∪MT ,fT (p)=0 if and only if p ∈ MT . Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 13 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 14 / 32 Performance of Invariants Methods in Simulations Identifiability of Phylogenetic Models Huelsenbeck (1995) did a systematic simulation comparison of 26 different methods of constructing a phylogenetic tree on 4 leaf trees. Invariant-based methods did poorly. Definition HOWEVER... Huelsenbeck only used linear invariants. A parametric statistical model is identifiable if it gives 1-to-1 map from Casanellas, Fernandez-Sanchez (2006) redid these simulations parameters to probability distributions. using a generating set of the phylogenetic ideal IT . Phylogenetic invariants become comparable to other methods. “Is it possible to infer the parameters of the model from data?” For the particular model studied in Casanellas, Identifiability guarantees consistency of statistical methods (ML) Fernandez-Sanchez (2006) for a tree with 4 leaves, the ideal IT Two types of parameters to consider for phylogenetic models: has 8002 generators. Numerical parameters (transition matrices) fT := |f | Tree parameter (combinatorial type of tree) f∈FT is a sum of 8002 terms. Major work to overcome combinatorial explosion for larger trees. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 15 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 16 / 32 Geometric Perspective on Identifiability Generic Identifiability Definition Definition The unrooted tree parameter T in a phylogenetic model is identifiable if The tree parameter in a phylogenetic model is generically identifiable if for all for all n-leaf trees with T = T , p ∈ MT there does not exist another T = T such that dim(MT ∩ MT ) < min(dim(MT ), dim(MT )). p ∈ MT . M1 M2 M1 M 2 M3 M1 M2 M3 Identifiable Not Identifiable Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 17 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 18 / 32 Proving Identifiability with Algebraic Geometry Phylogenetic Models are Identifiable Proposition Let M0 and M1 be two algebraic models. If there exist phylogenetically informative invariants f0 and f1 such that Theorem The unrooted tree parameter of phylogenetic models is generically fi (p)=0 for all p ∈ Mi , and fi (q) = 0 for some q ∈ M1−i , then identifiable. dim min dim dim (M0 ∩ M1) < ( M0, M1). Proof. Edge flattening invariants can detect which splits are implied by a M 0 f = 0 specific distribution in MT . The splits in T uniquely determine T . M1 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 19 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 20 / 32 Phylogenetic Mixture Models Identifiability Questions for Mixture Models Basic phylogenetic model assume same parameters at every site This assumption is not accurate within a single gene Question Some sites more important: unlikely to change For fixed number of trees r, are the tree parameters T1,...,Tr , and Tree structure may vary across genes rate parameters of each tree (generically) identified in phylogenetic mixture models? r = 1 (Ordinary phylogenetic models) Most models are identifiable on ≥ 2, 3, 4 leaves. ( Rogers, Chang, Steel, Hendy, Penny, Székely, Allman, Rhodes, Housworth, ...) r > 1 T1 = T2 = ···= Tr but no restriction on number of trees Not identifiable (Matsen-Steel, Stefankovic-Vigoda) r > 1, Ti arbitrary Leads to mixture models for different classes of sites Not identifiable (Mossel-Vigoda) M(T , r) denotes a same tree mixture model with underlying tree T and r classes of sites Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 21 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 22 / 32 How to Construct Phylogenetic Invariants? Theorem (Rhodes-Sullivant 2011) Theorem (Sturmfels-S, Allman-Rhodes, Casanellas-S, The unrooted tree and numerical parameters in a r-class, same tree Draisma-Kuttler) phylogenetic mixture model on n-leaf trivalent trees are Consider “nice” algebraic phylogenetic model.