<<

Example: Hardy-Weinberg Equilibrium

Suppose a gene has two alleles, a and A. If allele a occurs in the population with frequency θ (and A with frequency 1 − θ) and these Algebraic Tutorial I alleles are in Hardy-Weinberg equilibrium, the genotype frequencies are

P(X = aa)=θ2, P(X = aA)=2θ(1 − θ), P(X = AA)=(1 − θ)2 Seth Sullivant The model of Hardy-Weinberg equilibrium is the set North Carolina State University

July 22, 2012 2 2 M = θ , 2θ(1 − θ), (1 − θ) | θ ∈ [0, 1] ⊂ ∆3 ￿￿ ￿ ￿

2 I(M)=￿paa+paA+pAA−1, paA−4paapAA￿

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 1 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 2 / 32

Main Point of This Tutorial Phylogenetics

Problem Given a collection of species, find the tree that explains their history.

Many statistical models are described by (semi)-algebraic constraints on a natural parameter space. Generators of the vanishing ideal can be useful for constructing algorithms or analyzing properties of statistical model.

Two Examples Phylogenetic Sampling Contingency Tables Data consists of aligned DNA sequences from homologous genes

Human: ...ACCGTGCAACGTGAACGA... Chimp: ...ACCTTGGAAGGTAAACGA... Gorilla: ...ACCGTGCAACGTAAACTA...

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 3 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 4 / 32

Model-Based Phylogenetics Phylogenetic Models

Use a probabilistic model of mutations Assuming site independence: Parameters for the model are the combinatorial tree T , and rate Phylogenetic Model is a latent class graphical model

parameters for mutations on each edge Vertex v ∈ T gives a Xv ∈ {A, C, G, T} Models give a probability for observing a particular aligned All random variables corresponding to internal nodes are latent collection of DNA sequences

ACCGTGCAACGTGAACGA Human: X1 X2 X3 Chimp: ACGTTGCAAGGTAAACGA Gorilla: ACCGTGCAACGTAAACTA Y 2

Assuming site independence, data is summarized by empirical Y 1 distribution of columns in the alignment. 6 2 e.g. pˆ(AAA)= 18 , pˆ(CGC)= 18 , etc. Use empirical distribution and test statistic to find tree best P(x1, x2, x3)= P(y1)P(y2|y1)P(x1|y1)P(x2|y2)P(x3|y2) explaining data ￿y1 ￿y2

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 5 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 6 / 32 Phylogenetic Models Algebraic Perspective on Phylogenetic Models

Assuming site independence: Phylogenetic Model is a latent class graphical model Once we fix a tree T and model structure, we get a map n φT : Θ → R4 . Vertex v ∈ T gives a random variable Xv ∈ {A, C, G, T} Θ Rd All random variables corresponding to internal nodes are latent ⊆ is a parameter space of numerical parameters (transition matrices associated to each edge). The map φT is given by functions of the parameters. X X X 1 2 3 For each i ···i ∈ {A, C, G, T }n, φT (θ) gives the probability of 1 n i1···in ￿ Y the column (i1,...,in) in the alignment for the particular 2 parameter choice θ.

φT (π, a, b, c, d)= π a b c d Y 1 i1i2i3 j1 j2,j1 i1,j1 i2,j2 i3,j2 ￿j1 ￿j2

T 4n The phylogenetic model is the set MT = φ (Θ) ⊆ R . pi1i2i3 = πj1 aj2,j1 bi1,j1 ci2,j2 di3,j2 ￿j1 ￿j2

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 7 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 8 / 32

X X Phylogenetic Varieties and Phylogenetic Invariants X1 X2 3 4

R R n Let [p]:= [pi1···in : i1 ···in ∈ {A, C, G, T } ]

Definition

Let 4 4 4 R R IT := ￿f ∈ [p]:f (p)=0 for all p ∈ MT ￿⊆ [p]. plmno = πi aij bik cjl djmeknfko ￿i=1 ￿j=1 k￿=1 IT is the ideal of phylogenetic invariants of T . Let 4 4 4 = πi  aij cjl djm ·  bik eknfko n R4 ￿i=1 ￿j=1 k￿=1 VT := {p ∈ : f (p)=0 for all f ∈ IT }.    

VT is the phylogenetic variety of T . p1111 p1112 ··· p1144 p p ··· p Note that MT ⊂ VT .  1211 1212 1244 =⇒ rank . . . . ≤ 4 . . .. . Since MT is image of a polynomial map dim MT = dim VT .  . . .  p p ··· p   4411 4412 4444 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 9 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 10 / 32

Splits and Phylogenetic Invariants 2-way Flattenings and Matrix Ranks

pijkl = P(X1 = i, X2 = j, X3 = k, X4 = l) Definition

A split of a set is a bipartition A|B. A split A|B of the leaves of a tree T pAAAA pAAAC pAAAG ··· pAATT is valid for T if the induced trees T |A and T |B do not intersect. p p p ··· p  Flat ACAA ACAC ACAG ACTT 12|34(P)= . . . . .  ......    p p p ··· p   TTAA TTAC TTAG TTTT  X X X1 X2 3 4 Valid: 12|34 Proposition Not Valid: 13|24 Let P ∈ MT . rank Flat If A|B is a valid split for T , then ( A|B(P)) ≤ 4. Flat Invariants in IT are subdeterminants of A|B(P). If C|D is not a valid split for T , then generically rank Flat ( C|D(P)) > 4.

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 11 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 12 / 32 Phylogenetic Algebraic Geometry Using Phylogenetic Invariants to Reconstruct Trees

Definition

A phylogenetic invariant f ∈ IT is phylogenetically informative if there is ￿ some other tree T such that f ∈/ IT ￿ . Phylogenetic Algebraic Geometry is the study of the phylogenetic varieties and ideals VT and IT . Idea of Cavender-Felsenstein (1987), Lake (1987): Evaluate phylogenetically informative phylogenetic invariants at Using Phylogenetic Invariants to Reconstruct Trees empirical distribution pˆ to reconstruct phylogenetic trees Identifiability of Phylogenetic Models Interesting Math– Useful in Other Problems Proposition

For each n-leaf trivalent tree T , let FT ⊆ IT be a set of phylogenetic ￿ invariants such that, for each T ￿= T , there is an f ∈ FT , such that ￿ f ∈/ IT ￿ . Let f := |f |. T f ∈FT Then for generic￿ p ∈∪MT ,fT (p)=0 if and only if p ∈ MT .

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 13 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 14 / 32

Performance of Invariants Methods in Simulations Identifiability of Phylogenetic Models

Huelsenbeck (1995) did a systematic simulation comparison of 26 different methods of constructing a phylogenetic tree on 4 leaf trees. Invariant-based methods did poorly. Definition HOWEVER... Huelsenbeck only used linear invariants. A parametric statistical model is identifiable if it gives 1-to-1 map from Casanellas, Fernandez-Sanchez (2006) redid these simulations parameters to probability distributions. using a generating set of the phylogenetic ideal IT . Phylogenetic invariants become comparable to other methods. “Is it possible to infer the parameters of the model from data?” For the particular model studied in Casanellas, Identifiability guarantees consistency of statistical methods (ML) Fernandez-Sanchez (2006) for a tree with 4 leaves, the ideal IT Two types of parameters to consider for phylogenetic models: has 8002 generators. Numerical parameters (transition matrices) fT := |f | Tree parameter (combinatorial type of tree)

f￿∈FT is a sum of 8002 terms. Major work to overcome combinatorial explosion for larger trees.

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 15 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 16 / 32

Geometric Perspective on Identifiability Generic Identifiability

Definition Definition The unrooted tree parameter T in a phylogenetic model is identifiable if The tree parameter in a phylogenetic model is generically identifiable if for all for all n-leaf trees with T ￿= T ￿ , p ∈ MT there does not exist another T ￿ ￿= T such that dim(MT ∩ MT ￿ ) < min(dim(MT ), dim(MT ￿ )).

p ∈ MT ￿ . M1

M2

M1

M 2 M3 M1 M2

M3

Identifiable Not Identifiable

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 17 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 18 / 32 Proving Identifiability with Algebraic Geometry Phylogenetic Models are Identifiable

Proposition

Let M0 and M1 be two algebraic models. If there exist phylogenetically informative invariants f0 and f1 such that Theorem The unrooted tree parameter of phylogenetic models is generically fi (p)=0 for all p ∈ Mi , and fi (q) ￿= 0 for some q ∈ M1−i , then identifiable. dim min dim dim (M0 ∩ M1) < ( M0, M1). Proof. Edge flattening invariants can detect which splits are implied by a M 0 f = 0 specific distribution in MT . The splits in T uniquely determine T . M1

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 19 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 20 / 32

Phylogenetic Mixture Models Identifiability Questions for Mixture Models

Basic phylogenetic model assume same parameters at every site This assumption is not accurate within a single gene Question Some sites more important: unlikely to change For fixed number of trees r, are the tree parameters T1,...,Tr , and Tree structure may vary across genes rate parameters of each tree (generically) identified in phylogenetic mixture models?

r = 1 (Ordinary phylogenetic models) Most models are identifiable on ≥ 2, 3, 4 leaves. ( Rogers, Chang, Steel, Hendy, Penny, Székely, Allman, Rhodes, Housworth, ...)

r > 1 T1 = T2 = ···= Tr but no restriction on number of trees Not identifiable (Matsen-Steel, Stefankovic-Vigoda)

r > 1, Ti arbitrary Leads to mixture models for different classes of sites Not identifiable (Mossel-Vigoda) M(T , r) denotes a same tree mixture model with underlying tree T and r classes of sites

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 21 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 22 / 32

How to Construct Phylogenetic Invariants?

Theorem (Rhodes-Sullivant 2011) Theorem (Sturmfels-S, Allman-Rhodes, Casanellas-S, The unrooted tree and numerical parameters in a r-class, same tree Draisma-Kuttler) phylogenetic mixture model on n-leaf trivalent trees are Consider “nice” algebraic phylogenetic model. The problem of ￿n/4￿ generically identifiable, if r < 4 . computing phylogenetic invariants for any tree T can be reduced to the same problem for star trees K1,k . Proof Ideas. Phylogenetic invariants from flattenings The ideal IT generated by local Tensor rank (Kruskal’s Theorem) [Allman-Matias-Rhodes 2009] contributions from each K1,k , plus Elementary tree combinatorics flattening invariants from edges.

Solving tree and numerical parameter identifiability at the same The varieties VK1,k are interesting classical time algebraic varieties: toric varieties secant varieties Sec4(P3 × P3 × P3)

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 23 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 24 / 32 Group-based models Equations for the CFN Model

αβββ αβγγ αβγ δ Theorem (Sturmfels-S 2005) αβ βαββ βαγγ βαδγ       For any tree T , the toric ideal IT for the CFN model is generated by ￿βα￿ ββαβ γγαβ γδαβ       degree 2 determinantal equations. βββα γγβα δγβα      

X X X3X 4 Random variables in finite abelian group G. 1 2 Fourier coordinates: rl+sm+tn+uo Transitions probabilities satisfy Prob(X = g|Y = h)=f (g + h). qlmno = r,s,t,u∈{0,1}(−1) prstu This means that the formula for Prob(X1 = g1,...,Xn = gn) is a ￿ IT generated by 2 × 2 minors of: convolution (over Gn). Apply discrete Fourier transform to turn convolution into a product. q0000 q0001 q0010 q0011 q0000 q0011 q0001 q0010 ￿q1100 q1101 q1110 q1111￿ Theorem (Hendy-Penny 1993, Evans-Speed 1993) q0100 q0111 q0101 q0110 q1000 q1011 q1001 q1010 In the Fourier coordinates, a group-based model is parametrized by q0100 q0101 q0110 q0111     q1100 q1111 q1101 q1110 monomial functions in terms of the Fourier parameters. ￿q1000 q1001 q1010 q1011￿     In particular, the CFN model is a toric variety.

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 25 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 26 / 32

Gluing Two Trees at a Leaf Gluing more complex graphs

+ = + =

k Still a group action (Glr (C) ). Let T = T1#T2, tree obtained by joining two trees at a leaf. But factors are not acting independently. Each ring C[p]/I , C[p]/I is invariant under action of group T1 T2 C p I ∼ C p I C C p I G k [ ]/ G =￿ ( [ ]/ G1 ⊗ [ ]/ G2 ) G = Glr (C) acting on the glue leaves. C C C C G [p]/IG generated by degree 1 part of ( [p]/IG1 ⊗ [p]/IG2 ) Theorem (Draisma-Kuttler) (toric fiber product if r = 1)

C p I ∼ C p I C C p I G [ ]/ T = ( [ ]/ T1 ⊗ [ ]/ T2 ) Theorem (Engström-Kahle-S)

VT =(VT1 × VT2 )//G (GIT quotient) Can determine generators of IG from IG1 and IG2 if the TFP has “low codimension”. Actions of individual factors (Glr (C)) do no interact. Use Reynolds operator and first fundamental theorem of CIT. Useful for other problems in algebraic statistics.

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 27 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 28 / 32

Summary: Phylogenetic Algebraic Geometry Problems

Theorem (Allman-Rhodes 2006) Let T be a trivalent tree with n leaves, and consider the general Markov model on binary characters. The phylogenetic ideal I has Phylogenetic models are fundamentally algebraic-geometric T generating set objects. Algebraic perspective is useful for: Flat {3 × 3 minors of A|B(P)} Developing new construction algorithms A|B￿∈Σ(T ) Proving theorems about identifiability (currently best available for mixture models) where Σ(T ) is the set of all valid splits on T . Note that P is a Leads to interesting new mathematics, useful for other problems 2 × 2 × ···× 2, n-way tensor. Long way to go: Your Help Needed! Problem

2 For the 5 leaf tree at the right and write 1 4 Flat down all the matrices A|B(P) that are needed in the previous theorem. 5 3

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 29 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 30 / 32 References

E. Allman, C. Matias, J. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37 no.6A (2009) 3099-3132. F.A. Matsen and M. Steel. Phylogenetic mixtures on a single tree can mimic a tree of another topology. Systematic M. Casanellas, J. Fernandez-Sanchez. Performance of a new invariants method on homogeneous and non-homogeneous Biology, 2007. quartet trees, Molecular Biology and Evolution, 24(1):288-293, 2006. E. Mossel and E. Vigoda Phylogenetic MCMC Are Misleading on Mixtures of Trees. Science 309, 2207–2209 (2005) J. Cavender, J. Felsenstein. Invariants of phylogenies: a simple case with discrete states. Journal of Classification 4 (1987) 57-71. J. Rhodes, S. Sullivant. Identifiability of large phylogenetic mixture models. To appear Bulletin of Mathematical Biology, 2011. 1011.4134 J. Draisma, J Kuttler. On the ideals of equivariant tree models, Mathematische Annalen 344(3):619-644, 2009. D. Stefankovic and E. Vigoda. Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Systematic Biology A. Engström, T. Kahle, S. Sullivant. Multigraded commutative of graph decompositions. 1102.2601 56(1): 113-124, 2007. B. Sturmfels, S. Sullivant. Toric ideals of phylogenetic invariants. Journal of Computational Biology 12 (2005) 204-228. S. Evans and T. Speed. Invariants of some probability models used in phylogenetic inference. Annals of Statistics 21 (1993) 355-377.

M. Hendy, D. Penny. Spectral analysis of phylogenetic data. J. Classification 10 (1993) 5–24.

J. Huelsenbeck. Performance of phylogenetic methods in simulation. Systematic Biology 42 no.1 (1995) 17–48.

J. Lake. A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Molecular Biology and Evolution 4 (1987) 167-191.

Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 31 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 32 / 32 Generating Random Tables

Problem

Algebraic Statistics Tutorial II Generate a random table from the set of all nonnegative k1 × k2 integer tables with given row and column sums.

Seth Sullivant r1 North Carolina State University r2 r3 June 10, 2012 c1 c2 c3 c4

Fisher’s Exact Test, Missing Data Problems

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 1 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 2 / 28

Random Walk Connecting Lattice Points in Polytopes

Definition n d d 2 2 2 6 1 0 −1 0 3 2 1 6 Let A : Z → Z a linear transformation, b ∈ Z . 2 2 2 6 −1 0 1 0 1 2 3 6 + = −1 n 4 4 4 0 0 0 4 4 4 A [b]:={x ∈ N : Ax = b} (fiber)

3 2 1 6 1 −1 0 0 4 1 1 6 B ⊂ kerZ A 1 2 3 6 + −1 1 0 0 = 0 3 3 6 −1 −1 4 4 4 0 0 0 4 4 4 Let A [b]B be the graph with vertex set A [b] and u −−v an edge if and only u − v ∈ ±B.

1 −10 10−1 01−1 , , Problem ￿￿−110￿ ￿−10 1￿ ￿0 −11￿￿ −1 Given A and b, find finite B ⊆ kerZ A such that A [b]B is connected. allow for a connected random walk over these contingency tables. Definition −1 If B ⊆ kerZ A is a set such that A [b]B is connected for all b, then B is a Markov basis for A.

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 3 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 4 / 28

Example: 2-way tables 3-way tables

× × × × × Let A : Zk1 k2 k3 → Zk1 k2+k1 k3+k2 k3 be the linear transformation × Let A : Zk1 k2 → Zk1+k2 such that such that

m m k k A(u)= ( u ) ;( u ) ;( u ) A(u)= u1j ,..., uk1j ; ui1,..., uik2   i1i2i3 i1,i2 i1i2i3 i1i3 i1i2i3 i2,i3  ￿j=1 ￿j=1 ￿i=1 ￿i=1 ￿i3 ￿i2 ￿i1     = vector of row and column sums of u = all 2-way margins of 3-way table u = all “line sums” of u. k ×k kerZ(A)={u ∈ Z 1 2 : row and columns sums of u are 0} k1 k2 Markov basis depends on k1, k2, k3, contains moves like: Markov basis consists of the 2 2 2 moves like:

￿ ￿￿ ￿ 1 −1 −11 0000 ￿−11￿￿ 1 −1￿ 10−10   but also non-obvious moves like: −10 1 0   1 −10 −110 000 0 −11 01−1 −110  000  10−1 00 0 0 −11 0001 −10 −10 1 01−1 00 0          

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 5 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 6 / 28 Fundamental Theorem of Markov Bases Toric Varieties = Log-linear Models

Definition Definition Zn Zd Let A : → . The toric ideal IA is the ideal The variety VA = V (IA) is a toric variety. The statistical model M = V (I ) ∩ ∆ is a log-linear model. u v n A A m ￿p − p : u, v ∈ N , Au = Av ￿⊂K[p1,...,pn],

u u1 u2 un MA = {p ∈ ∆m : log p ∈ rowspan A}. where p = p1 p2 ···pn . Fisher’s exact test: Does the data u fit the model MA? Theorem (Diaconis-Sturmfels 1998) The set of moves B ⊆ kerZ A is a Markov basis for A if and only if the b+ b− set of binomials {p − p : b ∈ B} generates IA.

0000 u/||u||  10−10 −→ p21p33 − p23p31 −10 1 0  

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 7 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 8 / 28

2-way tables: Independence Computing Markov Bases

0000 p21 p23  10−10 −→ p21p33 − p23p31 = ￿ p p ￿ Software −10 1 0 ￿ 31 33 ￿   ￿ ￿ 4ti2 www.4ti2.de ￿ ￿ p p ··· p Macaulay2 (4ti2 interface) 11 12 1k2 http://www.math.uiuc.edu/Macaulay2/ p p ··· p  21 22 2k2  Singular (toric package) http://www.singular.uni-kl.de/ IA = ￿2 × 2 minors of . . . . ￿  . . .. .  Theory   pk 1 pk 2 ··· pk k  Gluing Results  1 1 1 2  Finiteness Theorems k1×k2 VA = V (IA)={P ∈ R : rank P ≤ 1} Special Configurations

M = V ∩ ∆ = M A A k1k2 X1⊥⊥ X2

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 9 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 10 / 28

“No Hope” Theorem Which Fibers are Connected?

Problem −1 Let B ⊆ kerZ A. For which b is A [b]B connected? When do Theorem (De Loera-Onn (2006)) −1 −1 u, v ∈ A [b] belong to the same component of A [b]B? Every integer vector appears as part of a minimal Markov basis element for 3 × k2 × k3 tables (with fixed 2-way margins). Example (2 × 3) In particular, minimal Markov basis elements for 3-way tables can 1 −10 01−1 have arbitrarily large entries and arbitrarily large 1-norm. B = , ￿￿−110￿ ￿0 −11￿￿ Example (3 × 4 × 6-tables) For 3 × 4 × 6 tables, minimal Markov basis has 355950 elements. 6 Largest element has 1-norm 28. 6 4 4 4

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 11 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 12 / 28 Enter Lattice Walks and Primary Decomposition (Diaconis-Eisenbud-Sturmfels 1998) Let K[p]:=K[p1,...,pn]. To each m ∈ B associate a binomial

+ − pm − pm ∈ K[p]

Decompose ideal IB = ∩i Ii . + − m m1 mn where m = m − m , p = p1 ···pn . u v u v p − p ∈ IB ⇔ p − p ∈ Ii for all i.

Proposition Hope that ideal Ii are easier to analyze. −1 Let B ⊆ kerZ A. Then u, v ∈ A [b] are in the same component of −1 A [b]B if and only if Theorem (Eisenbud-Sturmfels 1996) Every binomial ideal has a binomial primary decomposition. u v m+ m− p − p ∈ IB := ￿p − p : m ∈ B￿. Dickenstein-Matusevich-Miller, Kahle-Miller (Mesoprimary Theorem (Diaconis-Sturmfels (1998)) decomposition) A set of moves B ⊆ kerZ A is a Markov basis if and only if Algorithms implemented in binomials.m2 (Kahle 2010)

u v n IB = IA := ￿p − p : u, v ∈ N , Au = Av ￿.

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 13 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 14 / 28

2 × 3 tables Graphical Models

1 −10 01−1 B = , ￿￿−110￿ ￿0 −11￿￿ G a graph, N-vertices. N d ∈ Z , di ≥ 2.

p11 p12 p12 p13 Gives set of margins of d × d × ···× dn array. IB = , 1 2 ￿￿ p p ￿ ￿ p p ￿￿ ￿ 21 22 ￿ ￿ 22 23 ￿ C(G)=set of maximal cliques in G. ￿ ￿ ￿ ￿ ￿ p11 p12 ￿ ￿ p12 p13 ￿ p11 p13 = , , ∩￿p21, p22￿ ￿￿ p21 p22 ￿ ￿ p22 p23 ￿ ￿ p21 p23 ￿￿ Definition ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ ￿ = IA￿∩￿p21, p22￿￿ ￿ ￿ ￿ ￿ Let d1×···×dn k AG,d : Z → Z u u u v v v be the linear map that computes the margins associated to all 11 12 13 11 12 13 connected by B if and only if ￿u21 u22 u23￿￿v21 v22 v23￿ C ∈ C(G), of a d1 × ···× dn array. they have the same row and column sums and

u12 + u22 = v12 + v22 > 0.

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 15 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 16 / 28

Example (Row and Column Sums) d1×d2×d3 d1×d2+d1×d3 AG,d : Z → Z d1×d2 d1+d2 1 2 AG,d : Z → Z u u , u 1 2 3 ( ijk )i,j,k ￿→ (( ijk )i,j ( ijk )i,k ) (uij )i,j ￿→ (( uij )i , ( uij )j ) ￿k ￿j ￿j ￿i d =(2, 2, 3) Example (Path) 111000000000 d1×d2×d3 d1×d2+d1×d3 AG,d : Z → Z  000111000000 1 2 000000111000   (uijk )i j k ￿→ (( uijk )i j , ( uijk )i k )  000000000111 3 , , , ,   ￿k ￿j  100100000000 A , =   G d  010010000000    001001000000 Example (4-cycle)    000000100100   1 2  000000010010 × × × × × × ×   A : Zd1 d2 d3 d4 → Zd1 d2+d1 d3+d2 d4+d3 d4  000000001001 G,d   C(G)={{1, 2}, {1, 3}, {2, 4}, {3, 4}} 3 4 u =(u111, u112, u113, u121, u122, u123, u211, u212, u213, u221, u222, u223)

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 17 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 18 / 28 Separating Moves (Conditional Independence) Which Fibers Do CI(G) Moves Connect?

Let A, B, C partition V (G) such that C separates A and B in G. Get moves Proposition (Hammersley-Clifford, Besag (1974)) eiAiBiC + ejAjBiC − eiAjBiC − ejAiBiC CI(G) spans kerZ AG,d for all G. where iA, jA ∈ t∈A[dt ], iB, jB ∈ t∈B[dt ], iC ∈ t∈C[dt ] in kerZ A . G,d ￿ ￿ ￿ Theorem (Dobra (2002), Geiger, Meek, Sturmfels (2006)) 1 −1 These moves naturally generalize for 2-way tables. Separating moves CI(G) are a Markov basis for A if and only if G is ￿−11￿ G,d a chordal graph. CI(G) is set of all separating moves. Problem Example (4-cycle) −1 1 Which fibers A [b] are connected by CI(G) for other graphs? 1 2 G,d 2 ei1i2i3i4 + ej1i2i3j4 − ei1i2i3j4 − ej1i2i3i4 What is the primary decomposition of ICI(G)?

ei1i2i3i4 + ei1j2j3i4 − ei1i2j3i4 − ei1j2i3i4 3 4

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 19 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 20 / 28

Computational Results 2 × 3 tables

1 −10 01−1 Theorem (Kahle-Rauh-S (2012)) B = , ￿￿−110￿ ￿0 −11￿￿ Let #V (G)=n ≤ 5,di = 2 for all i. Then I is radical. CI(G) p11 p12 p12 p13 − IB = , A 1 [b] is connected if b is in the interior of the marginal cone. ￿￿ p p ￿ ￿ p p ￿￿ G,d CI(G) ￿ 21 22 ￿ ￿ 22 23 ￿ −1 = IA￿∩￿p21, p22￿￿ ￿ ￿ AG,d [b]CI(G) is connected if b is positive (except for G = K2,3). ￿ ￿ ￿ ￿

B Every prime component I of the form PS = ￿pi : i ∈ S￿ + IAS . Analyze monomial ideal PS = ￿p21, p22￿ Form vector u := e . S i∈/S i 101 Check if Au is on boundary￿ of marginal cone for all prime u = S S ￿101￿ components. uS has a zero column sum If so B has interior point property. ⇒ all fibers with positive margins (row and column sums) are connected.

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 21 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 22 / 28

Theoretical Results Proof Ideas

Proposition (Kahle-Rauh-S (2012))

If G = G1#G2 is a clique sum, then Find minimal primes for ICI(G). All binomial ideals. If ICI(G1) and ICI(G2) radical, so is ICI(G). k If G1 and G2 satisfy interior point property, so does G. Let J = ICI(G) = IAG,d ∩∩i=1Pi . u v If G1 and G2 satisfy positive margins property, so does G. Let u, v ￿such that AG,d u = AG,d v, so p − p ∈ IA.

Connect u and v using Markov basis moves of AG,d . u v + = Show that p − p ∈ Pi for all i, implies we can shortcut moves with CI(G) moves.

Deduce that J = ICI(G). Theorem (Kahle-Rauh-S (2012)) Depends on having Markov basis of AG,d , which is obtained in these cases via toric fiber product. (Engström, Kahle, S 2011) 1 For cycle Cn,ICI(Cn) is radical, when di = 2 for all i.

2 For K2,n with d1 = d2 = 2,ICI(K2,n) is radical. 3 Interior point property holds in both situations.

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 23 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 24 / 28 Questions Summary

Question Many statistical problems require the construction of random Is ICI(G) radical for all G, d? walks over the lattice points in a polytope. Does interior point property hold for all G, d? A Markov basis provides connectivity for all b. Theorem If Markov basis too hard to compute, can ask: Which fibers are ￿ ￿ connected by a “natural” set of moves? If there are n − 2 mutually orthogonal d × d latin squares, then for ￿ any 2-connected, triangle free graph on G nodes, and di = d for all i, Binomial primary decomposition gives information about the interior point property does not hold for (G, d). connectivity of fibers with subset of Markov basis. Computational and theoretical advances allow us to make

For C4 and d =(3, 3, 3, 3) gives failure of interior point property. progress on graphical models.

Radicality fails for K3,3 and d =(2, 2, 2, 2, 2, 2).

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 25 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 26 / 28

Problems References

J. Besag. (1974) Spatial Interaction and the Statistical Analysis of Lattice Systems, Journal of the Royal Statistical Society, Series B, 36 (2), 192Ð236.

Problem J. De Loera, S. Onn. Markov bases of three-way tables are arbitrarily complicated. Journal of Symbolic Computation, 1 2 41:173–181, 2006. P. Diaconis, D. Eisenbud, B. Sturmfels. Lattice walks and primary decomposition, Mathematical Essays in Honor of Gian-Carlo Rota, eds. B. Sagan and R. Stanley, Progress in Mathematics, Vol. 161, Birkhauser, Boston, 1998, pp. 173-193. P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions, Annals of Statistics 26 3 4 (1998) 363-397

A. Dickenstein, L. Matusevich, E. Miller. Combinatorics of binomial primary decomposition. 0803.3846

A. Dobra. Markov bases for decomposable graphical models. Bernoulli 9 No. 6,(2003) 1-16. 1 Let d =(2, 2, 2, 2). Construct the 16 × 16 matrix AC4,d . D. Eisenbud, B. Sturmfels. Binomial ideals, Duke Mathematical Journal 84 (1996) 1-45. 2 List the elements of CI(C4) A. Engström, T. Kahle, S. Sullivant. Multigraded commutative algebra of graph decompositions. (2011) 1102.2601 3 Use 4ti2, Macaulay2, or Singular to compute the Markov basis of D. Geiger, C. Meek, B. Sturmfels. On the toric algebra of graphical models, Annals of Statistics 34 (2006) 1463-1492 C4. T. Kahle, E. Miller. Decompositions of commutative monoid congruences and binomial ideals. arxiv:1107.4699

T. Kahle, J. Rauh, S. Sullivant. Positive margins and primary decomposition. (2012) 1201.2591

Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 27 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 28 / 28