Research Statement 1 Algebraic Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Research Statement LUIS DAVID GARCIA–PUENTE My mathematical research interests are in algebraic statistics, computational algebraic geometry, and combinatorial commutative algebra. These areas lie in the interplay among algebra, geometry, combinatorics, and symbolic computation. In recent years new algorithms have been developed and several old and new methods from these fields have led to significant and unexpected advances in several diverse areas of application. My research projects arise from problems in discrete multivari- ate analysis, probabilistic independence models, computational biology, and geometric modelling. While most of my work has been in algebraic statistics, I have also mentored several undergradu- ate students as part of the Texas A&M REU/VIGRE course “Algebraic Methods in Computational Biology.” I have established a network of collaborators from mathematics, statistics, and computer science. My main professional goal is to begin a research program involving both undergraduate and graduate students. My research area is ideal for this, because many problems have simple formu- lations, computation and experimentation play an important role, and solutions involve beautiful ideas from diverse areas of knowledge. 1 Algebraic Statistics Algebraic statistics is a new field using ideas from combinatorics, discrete geometry, computational algebra and algebraic geometry to formulate, interpret, and solve statistical problems. It has been applied to experimental design, discrete statistical analysis, statistical inference, and computational biology. The core principle in this area is that discrete statistical models are the non-negative real points of certain algebraic varieties. Bayesian Networks Bayesian networks are directed graphical models. The geometry of these models is relevant in statistics, see [6]. Algebraic statistics provides the right framework to understand the geometry of these models. Bayesian networks can be described either by a recursive factorization of probability distributions or by conditional independence statements dictated by a graph, known as global Markov statements. This is an instance of the computational algebra principle that varieties can be presented either parametrically or implicitly. For example, let G be a graph consisting of two disjoint nodes representing the random variables X and Y . The only global Markov statement encoded by G is global(G) = {X⊥⊥Y }. The recursive factorization of the joint probability distribution of X and Y is given by p(X = i, Y = j) = p(X = i)p(Y = j|X = i) = p(X = i)p(Y = j). The equivalence of these two representations for Bayesian networks is the Factorization Theorem, a well-known theorem in statistics. This theorem is surprisingly delicate and no longer holds in the usual setting of algebraic geometry. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 2 Factorization Theorem In the previous example, the Factorization theorem implies that X is independent of Y if and only if the joint probability distribution of X and Y factors as p(X, Y ) = p(X)p(Y ). Now assume that X has two states and Y has three states. Let pij be an indeterminate denoting the probability p(X = i, Y = j). Let IX⊥⊥Y be the ideal generated by the 2 × 2-minors of the 2 × 3-matrix p11 p12 p13 p21 p22 p23 Then every joint probability distribution p = p(X, Y ) on two independent random variables is a point in the variety V (IX⊥⊥Y ). On the other hand, the factorization of the joint probability distribution gives a map f : (a), (b, c) 7−→ ab, ac, a(1 − b − c), (1 − a)b, (1 − a)c, (1 − a)(1 − b − c) where a = p(X = 0), b = p(Y = 0), and c = p(Y = 1). The image of this map corresponds to the set of all distributions that factor according to G. The map f gives a parametric definition of an algebraic variety. In this context, each distribution in the image of f is a point in the variety V (If ), where the ideal If is the kernel of the homomorphism f. In [4], Michael Stillman, Bernd Sturmfels and I proved the following generalization of the Factorization theorem. Theorem 1. The prime ideal If is a minimal primary component of Iglobal(G). More precisely, ∞ Iglobal(G) : p = If , where p is an explicitly given polynomial. This theorem shows that, in general, Iglobal(G) is not prime. Therefore, to understand the geometry of V (Iglobal(G)), one needs to find the irreducible components of this variety. This is accomplished by finding the primary decomposition of Iglobal(G). In [4], I found the primary decomposition of all Bayesian networks on four random variables and five binary random variables. Model Selection Theorem 1 also gives an effective method to find the prime ideal If . In general, this problem is known as the implicitization problem. For many statistical models the implicitization problem is related to the model selection problem — the problem of choosing the appropriate model that best fits a given set of observations. The polynomials in If are known as polynomial invariants. These invariants usually vary from model to model, so they can be used to distinguished between them. A central theme in algebraic statistics consists in finding generating sets for the ideal of polynomial invariants for several statistical models. In [4], I achieved this for all Bayesian networks on four random variables and five binary random variables. In [3], I found these invariants for Bayesian networks with three observable variables and one hidden variable. As a future project, I want to compute these ideals for larger models. This involves developing new implicitization techniques designed to take advantage of the intrinsic structure present in the parameterizations. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 3 2 Computational Biology Computational biology is a new discipline whose domain is the quantitative analysis of biological data, and the engineering of biological systems. The algebraic view of the discrete statistical models used in biological sequence analysis has had a direct impact on the development of algebraic statistics. Phylogenetic Trees Statistical models based on phylogenies are used to study and quantify differences between species. A phylogenetic invariant for a model of biological sequence evolution along a phylogenetic tree is a polynomial that vanishes on the expected frequencies of base patterns at the terminal taxa. In [10, Ch. 15], Seth Sullivant, Marta Casanellas and I used phylogenetic invariants to infer tree topologies from data. Due to the recursive nature of many algorithms in phylogenetics, it is important to classify these invariants for small phylogenetic trees. Together with former REU student Jacob Porter, I develop and maintain the small phylogenetic trees website. This project consists in implementing efficient algorithms to compute the polynomial invariants for group-based models based on small trees. We also computed the dimension, degree, minimal generating set, Gr¨obner basis, singular locus, and maximum likelihood degree for these models. My future projects include developing methods to compute invariants for different statistical models of evolution and developing sound methods to use these invariants in tree reconstruction. Biochemical Networks A central problem in computational biology is the modeling of biochemical networks from ex- perimental data. The Discrete Mathematics Group at Virginia Bioinformatics Institute led by Reinhard Laubenbacher has developed a method to reverse-engineer biochemical networks based on the framework of discrete dynamical systems in which each variable takes values on a finite field. During the last few years, I have made important contributions to the development of an evolutionary algorithm for the identification and parameter optimization of biochemical network models. This algorithm optimizes the model produced by the previous method based on model complexity and data fit. This algorithm uses tools from computational algebra to reduce the search space. One of my ongoing projects is to optimize the key computational algebraic steps in this algorithm. 3 Computational Algebraic Geometry Algebraic geometry has a distinguished presence in the history of mathematics. Advances in com- puting and algorithms over the last 30 years have revolutionized the area, making many (formerly inaccessible) problems tractable, and providing a fertile ground for experimentation and conjec- ture. Applications of computational algebraic geometry range from computer science, economy, statistics, chemistry, physics, and engineering. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 4 Linear Precision of Toric Patches Geometric modelling is the science of modeling curves, surfaces, and higher-dimensional objects by n d small patches, which are images of functions ϕ : ∆ → R , where ∆ is some domain in R . A patch is a collection of “blending” functions {βa : ∆ → R≥ | a ∈ A} indexed by a finite set A of points in d R whose convex hull equals ∆. Then ϕ is given by X ϕ(x) = βa(x)ba a∈A n where {ba ∈ R | a ∈ A} are control points. Linear precision is the ability of a parametric patch to replicate linear functions. This project is aimed at better understanding linear precision in geometric modelling. In [5], Frank Sottile and I developed the following formulation of linear precision. The patch β has linear precision if X x = βa(x)a, for x ∈ ∆. a∈A We showed that any patch β has a unique reparametrization having linear precision. Moreover, this unique reparametrization is a rational function if and only if a certain algebraic variety has a maximally degenerate position with respect to a canonical linear subspace given by the set A. We also found a simple numerical algorithm for computing the blending functions which have linear precision. In our formulation of linear precision, the points A are fixed. A future project consists in understanding linear precision in the case where non-extreme points of A are allowed to be moved. Primary Decomposition Primary decomposition is a central concept in algebraic geometry. During the last few years, I have implemented several standard algorithms for primary decomposition that will be available in a future release of CoCoA [1].