Research Statement LUIS DAVID GARCIA–PUENTE My mathematical research interests are in algebraic , computational , and combinatorial commutative . These areas lie in the interplay among algebra, geometry, combinatorics, and symbolic computation. In recent years new algorithms have been developed and several old and new methods from these fields have led to significant and unexpected advances in several diverse areas of application. My research projects arise from problems in discrete multivari- ate analysis, probabilistic independence models, computational biology, and geometric modelling. While most of my work has been in algebraic statistics, I have also mentored several undergradu- ate students as part of the Texas A&M REU/VIGRE course “Algebraic Methods in Computational Biology.” I have established a network of collaborators from mathematics, statistics, and computer science. My main professional goal is to begin a research program involving both undergraduate and graduate students. My research area is ideal for this, because many problems have simple formu- lations, computation and experimentation play an important role, and solutions involve beautiful ideas from diverse areas of knowledge.

1 Algebraic Statistics

Algebraic statistics is a new field using ideas from combinatorics, discrete geometry, computational algebra and algebraic geometry to formulate, interpret, and solve statistical problems. It has been applied to experimental design, discrete statistical analysis, statistical inference, and computational biology. The core principle in this area is that discrete statistical models are the non-negative real points of certain algebraic varieties.

Bayesian Networks

Bayesian networks are directed graphical models. The geometry of these models is relevant in statistics, see [6]. Algebraic statistics provides the right framework to understand the geometry of these models. Bayesian networks can be described either by a recursive factorization of probability distributions or by conditional independence statements dictated by a graph, known as global Markov statements. This is an instance of the computational algebra principle that varieties can be presented either parametrically or implicitly. For example, let G be a graph consisting of two disjoint nodes representing the random variables X and Y . The only global Markov statement encoded by G is global(G) = {X⊥⊥Y }. The recursive factorization of the joint probability distribution of X and Y is given by

p(X = i, Y = j) = p(X = i)p(Y = j|X = i) = p(X = i)p(Y = j).

The equivalence of these two representations for Bayesian networks is the Factorization Theorem, a well-known theorem in statistics. This theorem is surprisingly delicate and no longer holds in the usual setting of algebraic geometry. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 2

Factorization Theorem

In the previous example, the Factorization theorem implies that X is independent of Y if and only if the joint probability distribution of X and Y factors as p(X,Y ) = p(X)p(Y ). Now assume that X has two states and Y has three states. Let pij be an indeterminate denoting the probability p(X = i, Y = j). Let IX⊥⊥Y be the ideal generated by the 2 × 2-minors of the 2 × 3-matrix   p11 p12 p13 p21 p22 p23

Then every joint probability distribution p = p(X,Y ) on two independent random variables is a point in the variety V (IX⊥⊥Y ). On the other hand, the factorization of the joint probability distribution gives a map

f : (a), (b, c) 7−→ ab, ac, a(1 − b − c), (1 − a)b, (1 − a)c, (1 − a)(1 − b − c) where a = p(X = 0), b = p(Y = 0), and c = p(Y = 1). The image of this map corresponds to the set of all distributions that factor according to G. The map f gives a parametric definition of an algebraic variety. In this context, each distribution in the image of f is a point in the variety V (If ), where the ideal If is the kernel of the homomorphism f. In [4], Michael Stillman, Bernd Sturmfels and I proved the following generalization of the Factorization theorem.

Theorem 1. The prime ideal If is a minimal primary component of Iglobal(G). More precisely,

∞ Iglobal(G) : p = If , where p is an explicitly given .

This theorem shows that, in general, Iglobal(G) is not prime. Therefore, to understand the geometry of V (Iglobal(G)), one needs to find the irreducible components of this variety. This is accomplished by finding the primary decomposition of Iglobal(G). In [4], I found the primary decomposition of all Bayesian networks on four random variables and five binary random variables.

Model Selection

Theorem 1 also gives an effective method to find the prime ideal If . In general, this problem is known as the implicitization problem. For many statistical models the implicitization problem is related to the model selection problem — the problem of choosing the appropriate model that best fits a given set of observations. The in If are known as polynomial invariants. These invariants usually vary from model to model, so they can be used to distinguished between them. A central theme in algebraic statistics consists in finding generating sets for the ideal of polynomial invariants for several statistical models. In [4], I achieved this for all Bayesian networks on four random variables and five binary random variables. In [3], I found these invariants for Bayesian networks with three observable variables and one hidden variable. As a future project, I want to compute these ideals for larger models. This involves developing new implicitization techniques designed to take advantage of the intrinsic structure present in the parameterizations. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 3

2 Computational Biology

Computational biology is a new discipline whose domain is the quantitative analysis of biological data, and the engineering of biological systems. The algebraic view of the discrete statistical models used in biological sequence analysis has had a direct impact on the development of algebraic statistics.

Phylogenetic Trees

Statistical models based on phylogenies are used to study and quantify differences between species. A phylogenetic invariant for a model of biological sequence evolution along a phylogenetic tree is a polynomial that vanishes on the expected frequencies of base patterns at the terminal taxa. In [10, Ch. 15], Seth Sullivant, Marta Casanellas and I used phylogenetic invariants to infer tree topologies from data. Due to the recursive nature of many algorithms in phylogenetics, it is important to classify these invariants for small phylogenetic trees. Together with former REU student Jacob Porter, I develop and maintain the small phylogenetic trees website. This project consists in implementing efficient algorithms to compute the polynomial invariants for group-based models based on small trees. We also computed the dimension, degree, minimal generating set, Gr¨obner basis, singular locus, and maximum likelihood degree for these models. My future projects include developing methods to compute invariants for different statistical models of evolution and developing sound methods to use these invariants in tree reconstruction.

Biochemical Networks

A central problem in computational biology is the modeling of biochemical networks from ex- perimental data. The Discrete Mathematics Group at Virginia Bioinformatics Institute led by Reinhard Laubenbacher has developed a method to reverse-engineer biochemical networks based on the framework of discrete dynamical systems in which each variable takes values on a finite field. During the last few years, I have made important contributions to the development of an evolutionary algorithm for the identification and parameter optimization of biochemical network models. This algorithm optimizes the model produced by the previous method based on model complexity and data fit. This algorithm uses tools from computational algebra to reduce the search space. One of my ongoing projects is to optimize the key computational algebraic steps in this algorithm.

3 Computational Algebraic Geometry

Algebraic geometry has a distinguished presence in the history of mathematics. Advances in com- puting and algorithms over the last 30 years have revolutionized the area, making many (formerly inaccessible) problems tractable, and providing a fertile ground for experimentation and conjec- ture. Applications of computational algebraic geometry range from computer science, economy, statistics, chemistry, physics, and engineering. LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 4

Linear Precision of Toric Patches

Geometric modelling is the science of modeling curves, surfaces, and higher-dimensional objects by n d small patches, which are images of functions ϕ : ∆ → R , where ∆ is some domain in R . A patch is a collection of “blending” functions {βa : ∆ → R≥ | a ∈ A} indexed by a finite set A of points in d R whose convex hull equals ∆. Then ϕ is given by

X ϕ(x) = βa(x)ba a∈A

n where {ba ∈ R | a ∈ A} are control points. Linear precision is the ability of a parametric patch to replicate linear functions. This project is aimed at better understanding linear precision in geometric modelling. In [5], Frank Sottile and I developed the following formulation of linear precision. The patch β has linear precision if

X x = βa(x)a, for x ∈ ∆. a∈A We showed that any patch β has a unique reparametrization having linear precision. Moreover, this unique reparametrization is a rational function if and only if a certain algebraic variety has a maximally degenerate position with respect to a canonical linear subspace given by the set A. We also found a simple numerical algorithm for computing the blending functions which have linear precision. In our formulation of linear precision, the points A are fixed. A future project consists in understanding linear precision in the case where non-extreme points of A are allowed to be moved.

Primary Decomposition

Primary decomposition is a central concept in algebraic geometry. During the last few years, I have implemented several standard algorithms for primary decomposition that will be available in a future release of CoCoA [1]. In [4], I also developed a new method to decompose the ideals arising in Bayesian networks. Michael Stillman and I implemented this algorithm in the computer algebra systems Macaulay2 [7] and Singular [8]. Nevertheless, current implementations of primary decomposition algorithms are unsuited to deal with ideals arising in applications, due to the large number of indeterminates in these ideals. One of my future goals is to find more efficient methods to decompose ideals with a large number of indeterminates.

Toric Ideals of Matroids

d×n Let A ∈ Z be a d × n integral matrix. Consider the ring homomorphism

d Y aij φA : k[x1, . . . , xn] −→ k[t1, . . . , td], xj 7−→ ti . i=1

The ideal IA = ker(φA) is called a toric ideal. Let A be the integer matrix whose columns correspond to the incidence vectors of the bases of a given matroid. In 1980, Neil White conjectured that LUIS DAVID GARCIA–PUENTE RESEARCH STATEMENT 5

in this case, the toric ideal IA is generated by quadrics. Recently, Blasiak proved in [2] that the conjecture holds for graphical matroids. In [9], Herzog and Hibi generalized this conjecture to discrete polymatroids. These authors asked the more general question on whether IA has a quadratic Gr¨obnerbasis. In [11], Sturmfels showed this for uniform matroids.

Jointly with Giulio Caviglia and Sergi Elizalde, I showed that IA is generated by quadrics for lattice-path polymatroids. Later on, Anna de Mier and Omer Gim´enez generalized our ideas to multipath matroids. We are currently working on a paper containing all these results. As a future project, I want to find if IA has a quadratic Gr¨obnerbasis for the cases where it is known that it is generated by quadrics.

References

[1] J. Abbott, A. Bigatti, M. Caboara, and L. Robbiano. Cocoa, a system for doing computations in . Available at http://cocoa.dima.unige.it, 2000.

[2] J. Blasiak. The toric ideal of a graphic matroid is generated by quadrics. 2005.

[3] L. D. Garcia. Algebraic statistics in model selection. In M. Chickering and J. Halpern, editors, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 177–184. AUAI Press, Arlington, VA, 2004.

[4] L. D. Garcia, M. Stillman, and B. Sturmfels. Algebraic geometry of Bayesian networks. J. Symbolic Comput., 39/3-4:331–355, 2005. Special issue on the occasion of MEGA 2003.

[5] L. D. Garcia-Puente and F. Sottile. Linear precision for parametric patches. 2006.

[6] D. Geiger, D. Heckerman, H. King, and C. Meek. Stratified exponential families: graphical models and model selection. Ann. Statist., 29(2):505–529, 2001.

[7] D. Grayson and M. Stillman. Macaulay 2, a software system for research in algebraic geometry. Available at http://www.math.uiuc.edu/Macaulay2, 2002.

[8] G. Greuel, G. Pfister, and H. Schoenemann. Singular: A computer algebra system for poly- nomial computations. Available at http://www.singular.uni-kl.de/, 2003.

[9] J. Herzog and T. Hibi. Discrete polymatroids. Journal of Algebraic Combinatorics, 16:239–268, 2002.

[10] L. Pachter and B. Sturmfels. Algebraic Statistics for Computational Biology. Cambridge University Press, New York, 2005.

[11] B. Sturmfels. Gr¨obner bases and convex polytopes. American Mathematical Society, Provi- dence, RI, 1996.