Tapas of Algebraic Statistics

COMMUNICATION Tapas of Algebraic Statistics Carlos Améndola, Marta Casanellas, and Luis David García Puente What is Algebraic Statistics? Algebraic statistics is a broad field actively expand- Algebraic statistics is an interdisciplinary field that uses ing from discrete statistical models, contingency table tools from computational algebra, algebraic geometry, analysis, and experimental design to Gaussian models, and combinatorics to address problems in statistics and singular learning theory, and applications to phylogenet- its applications. A guiding principle in this field is that ics, machine learning, and biochemical reaction networks. many statistical models of interest are semialgebraic In this note, we will address two recent contributions to sets—a set of points defined by polynomial equalities and this field: an extension of Pearson’s work on Gaussian inequalities. Algebraic statistics is not only concerned with mixtures and some recent results in phylogenetics. understanding the geometry and algebra of the underlying statistical model, but also with applying this knowledge Algebraic Statistics of Gaussian Mixtures to improve the analysis of statistical procedures, and to In 1894 the famous statistician Karl Pearson [5] wanted devise new methods for analyzing data. to explain the asymmetry observed in data measured A well-known example of this principle is the model from a population of Naples’ crabs, believing it was of independence of two discrete random variables. Two possible that two subpopulations of crabs were present discrete random variables 푋, 푌 are independent if their in the sample. The corresponding statistical model is joint probability factors into the product of the marginal known as a Gaussian mixture; in this case a mixture of probabilities. Equivalently, 푋 and 푌 are independent if two univariate Gaussian distributions, each with its own and only if every 2 × 2-minor of the matrix of their joint mean and variance. In order to recover the parameters probabilities is zero. These quadratic equations, together from the sample, Pearson introduced the method of with the conditions that the probabilities are nonnegative moments, matching the density moments to the sample and sum to one, define a semialgebraic set. moments. He obtained the following system of polynomial 2 2 In 1998, Persi Diaconis and Bernd Sturmfels showed equations in the means 휇1 and 휇2, variances 휎1 and 휎2 , how one can use algorithms from computational algebraic and mixture proportions 훼1 and 훼2: geometry to sample from conditional distributions. This work is generally regarded as one of the seminal works of 훼1 +훼2 =1 what is now referred to as algebraic statistics. However, 훼1휇1 +훼2휇2 =0 algebraic methods can be traced back to R. A. Fisher, who 2 2 2 2 used Abelian groups in the study of factorial designs, 훼1(휇1 +휎1 )+훼2(휇2 +휎2 ) =푚2 3 2 3 2 and Karl Pearson, who used polynomial algebra to study (1) 훼1(휇1 +3휇1휎1 ) + 훼2(휇2 +3휇2휎2 ) =푚3 Gaussian mixture models. 4 2 2 4 4 2 2 4 훼1(휇1 +6휇1 휎1 +3휎1 )+훼2(휇2 +6휇2 휎2 +3휎2 ) =푚4 5 3 2 4 5 3 2 4 Carlos Améndola is postdoctoral researcher in the department of 훼1(휇1 +10휇1 휎1 +15휇1휎1 )+훼2(휇2 +10휇2 휎2 +15휇2휎2 ) =푚5. mathematics at Technische Universität München. His email address is [email protected]. After considerable effort and cleverness, Pearson man- Marta Casanellas is associate professor and vice director of re- aged to eliminate variables to obtain a ninth degree search in the department of mathematics at Universitat Politèc- polynomial relation in the single unknown 푥 = 휇 휇 , nica de Catalunya. Her email address is marta.casanellas@upc 1 2 .edu. (2) Luis David García Puente is associate professor and assistant 9 7 2 6 2 5 2 2 4 24푥 − 28휆4푥 + 36푚3푥 − (24푚3휆5 − 10휆4)푥 − (148푚3휆4 + 2휆5)푥 + department chair in the department of mathematics and sta- (288푚4 −12휆 휆 푚 −휆3)푥3+(24푚3휆 −7푚2휆2)푥2+32푚4휆 푥−24푚6 = 0, tistics at Sam Houston State University. His email address is 3 4 5 3 4 3 5 3 4 3 4 3 [email protected]. 2 where 휆4 = 9푚2 − 3푚4 and 휆5 = 30푚2푚3 − 3푚5. After For permission to reprint this article, please contact: substituting his numerical moment estimates 푚푖, he [email protected]. found the real roots of this nonic and determined if they DOI: http://dx.doi.org/10.1090/noti1713 could correspond to a solution for the mixture model. We 936 Notices of the AMS Volume 65, Number 8 COMMUNICATION see his approach as one of the first instances of algebraic matrix). The entries of these matrices, together with the statistics. Pearson’s work leads to natural questions: distribution of nucleotides at the root of the tree, form the parameters of the model. Then, the probability of Problem 1. Can Pearson’s method be generalized for a observing a certain pattern of nucleotides at the leaves of mixture of 푘 Gaussians? How many moments are needed the tree can be written in terms of these parameters, by to recover the parameters? Is there an analogous polyno- assuming that the evolutionary processes of two edges mial to (2)? What is its degree? What about Gaussians in incident at a node 푣 is independent given the observations higher dimensions? at 푣. Recovering the parameters from data drawn from a Gaussian mixture is an important problem in statistics, computer science, and machine learning. Answers to the above questions shed light on the computational complexity and the effectiveness of several algorithms proposed in these areas. The key point is that all the moments of a mixture of Gaussians are polynomials in Figure 1. Three phylogenetic trees representing three the parameters, so they define moment varieties that can possible evolutionary histories of human (h), gorilla be studied algebraically. (g), macaque (m), and lemur (l). Recent progress with this approach has been made by Améndola et al. [2, 3], with partial answers. For example, Again, the key point is that for each phylogenetic tree it was shown [3] that considering all the moments up 휏, the map 휑휏 that sends each set of parameters to the to order 3푘 − 1 will yield generically a finite number vector of probabilities of patterns 퐴퐴 … 퐴, 퐴퐴 … 퐶, … , of Gaussian mixture densities with the same matching 푇푇 … 푇 at the leaves is a polynomial map, and hence moments. In other words, the polynomial moment system its image is (almost) an algebraic variety 푉휏. Different generalizing (1) will generically have a finite number of trees (as the ones in Figure 1) lead to different algebraic solutions for the 3푘 unknown parameters 휇푖, 휎푖, 훼푖 for varieties, and the goal is to use the equations that define 1 ≤ 푖 ≤ 푘. For 푘 = 2 this is Pearson’s number 9. For these algebraic varieties in order to decide, given a data 푘 = 3 it was found [2] that the corresponding degree point (that is, a sequence of nucleotides for each species is 225. In contrast, perhaps shockingly, the system of at the leaves), to which variety is closest (in some sense). 20 polynomial equations in 20 unknowns corresponding The idea of using polynomial equations in phyloge- to the moments up to order three of mixtures of two netics is not due to mathematicians but to biologists. 3 Gaussians in 3-dimensional space ℝ will have generically Indeed, in the late 1980s, biologists James A. Cavender, infinitely many solutions. This means that one needs to Joseph Felsenstein, and James A. Lake already realized consider higher order moments in order to recover the that the equations satisfied by the pattern probabilities parameters. A complete classification of such defective on a phylogenetic tree could help in inferring the tree cases is still open. without having to estimate the parameters of the model. It is precisely this, the fact of not having to estimate the Algebraic Statistical Phylogenetics parameters, that makes algebraic statistics potentially Algebraic statistics has been also used in phylogenetics. useful in phylogenetics. However, selecting a set of equa- Phylogenetics seeks to explain the ancestral relationships tions that define the algebraic variety cannot be done among a group of living species. These relationships are in a canonical way. Moreover, the codimension of these usually represented in a phylogenetic tree as in Figure 1, varieties grows exponentially in the number of leaves, so where the leaves are in bijection with the living species, using them directly may not be a practical choice. the interior nodes represent ancestral species, the root A recent approach to indirectly using these algebraic is the common ancestor to all the species in the tree, varieties is based on the following result due to Elizabeth and the edges represent an evolutionary process that led Allman and John Rhodes: Assume the vector of probabili- from one ancestral species to the next. Figure 1 shows ties 푝 = (푝퐴퐴…퐴, 푝퐴퐴…퐶, … , 푝푇푇…푇) belongs to the image three possible phylogenetic trees that could explain the of 휑휏. Any edge of 휏 splits the set of leaves into two evolution of human, gorilla, lemur, and macaque. subsets 푎 and 푏, giving rise to a matrix 푀푎,푏 whose rows In order to infer the phylogenetic tree that best explains (resp. columns) are labeled by the states at the leaves the evolution of the species, one uses the genome of the in 푎 (resp. 푏) and whose entries are the corresponding living species and models the substitution of nucleotides probabilities in 푝. Then the matrix 푀푎,푏 has rank at most using a Markov process on trees, assuming that each 4. This result leads to equations satisfied by the points position in the genome evolves in the same way and of the variety (the 5 × 5 minors must vanish), and it also independently of the others.

Load more