A Computer Algebra System for R: Macaulay2 and the M2r Package

Total Page:16

File Type:pdf, Size:1020Kb

Load more

A computer algebra system for R: Macaulay2 and the m2r package

David Kahle∗1, Christopher O’Neill†2, and Jeff Sommars‡3
1Department of Statistical Science, Baylor University
2Department of Mathematics, University of California, Davis
3Department of Mathematics, Statistics, and Computer Science, University of Illinois at
Chicago

Abstract

Algebraic methods have a long history in statistics. The most prominent manifestation of modern algebra in statistics can be seen in the field of algebraic statistics, which brings tools from commutative algebra and algebraic geometry to bear on statistical problems. Now over two decades old, algebraic statistics has applications in a wide range of theoretical and applied statistical domains. Nevertheless, algebraic statistical methods are still not mainstream, mostly due to a lack of easy off-the-shelf implementations. In this article we debut m2r, an R package that connects R to Macaulay2 through a persistent back-end socket connection running locally or on a cloud server. Topics range from basic use of m2r to applications and design philosophy.

1 Introduction

Algebra, a branch of mathematics concerned with abstraction, structure, and symmetry, has a long history of applications in statistics. For example, Pearson’s early work on method of moments estimation in mixture models ultimately involved systems of polynomial equations that he painstakingly and remarkably solved by hand (Pearson, 1894; Am´endola et al., 2016). Fisher’s work in design was strongly algebraic and combinatorial, focusing on topics such as Latin squares (Fisher, 1934). Invariance and equivariance continue to form a major pillar of mathematical statistics through the lens of location-scale families (Pitman, 1939; Bondesson,

1983; Lehmann and Romano, 2005).

Apart from the obvious applications of linear algebra, the most visible manifestations of modern algebra in statistics are found in the young field of algebraic statistics. Algebraic statistics is defined broadly as the application of commutative algebra and algebraic geometry to statistical problems, generally understood to include applications of other mathematical fields that have substantial overlap with commutative algebra and algebraic geometry, such as combinatorics, polyhedral geometry, graph theory, and others (Drton et al., 2009; Sturmfels, 1996). Now a quarter century old, algebraic statistics has revealed that many statistical areas are profitably amenable to algebraic investigation, including discrete multivariate analysis, discrete and Gaussian graphical models, statistical disclosure limitation, phylogenetics, Bayesian statistics, and more. Nevertheless, while the field is well-established and actively growing, advances in algebraic statistical methods are still not mainstream among applied statisticians, largely due to the lack of off-the-shelf implementations of key algebraic algorithms in mainstream statistical software. In this article we debut m2r, a key piece to the puzzle of applied algebraic statistics in R.

[email protected] [email protected] [email protected]

1

1.1 Macaulay2 and the m2r R package

Macaulay2 is a state-of-the-art, open-source computer algebra system designed to perform computations in commutative algebra and algebraic geometry (Grayson and Stillman, 2006). More than twenty years old, the software has a large code base with many community members actively developing add-on packages. In addition, Macaulay2 links to other major open source software in the mathematics community, such as

Normaliz (Bruns et al., 2015, 2016; Bruns and K¨ampf, 2010), 4ti2 (4ti2 Team, 2015), and PHCpack (Verschelde,

1999; Gross et al., 2013), through a variety of interfaces. Natively, Macaulay2 is well-known for its efficiency with large algebraic computations, among other things.
One of the primary benefits of Macaulay2 is its efficiency with large algebraic computations. For instance,
Gr¨obner basis computations comprise the core of many algorithms central to computational algebra. Some of these computations take many hours and produce output consisting of several thousand polynomials or polynomials with several thousand terms. Often, the Macaulay2 user will not be interested in the entire output, but only certain properties; Macaulay2 allows the user to specify relevant properties to return, such as the dimension of the solution set or the highest degree term that appears.
R is increasingly the programming lingua franca of the statistics community, but it has very limited native support for symbolic computing (R Core Team, 2014). rSymPy attempts to alleviate this problem

by connecting R to Python’s SymPy library (Meurer et al., 2017; Grothendieck and Bellosta, 2012). mpoly

provides a basic collection of R data structures and methods for multivariate polynomials and was designed to lay the foundation for a more robust computer algebra system in R (Kahle, 2013). Unfortunately, neither of these wholly meet the computational needs of those in the algebraic statistics community, because neither of them were designed for that purpose. Consequently, for years those using algebraic statistical methods have been forced to go outside of R to manually run key algebraic computations in software such as Macaulay2 and then pull the results back into R. This error prone and tedious process is simply one barrier to entry to using algebraic statistics in R. The problem is compounded by users needing to install Macaulay2, which is not cross-platform, and be familiar with the Macaulay2 language, which is syntactically and semantically very different from R.
In this article we present the m2r package, which is intended to help fill this void. m2r was created at the American Mathematical Society’s 2016 Mathematics Research Community gathering on algebraic statistics. It connects R to a persistent local or remote Macaulay2 session and leverages mpoly’s existing infrastructure to provide wrappers for commonly used algebraic algorithms in a way that naturally fits into the R ecosystem, alleviating the need to learn Macaulay2. It is our hope that m2r will provide a flexible framework for computations in the algebraic statistics community and beyond.
The outline of the article is as follows. In Section 2 we provide a basic overview of the relevant algebraic and geometric concepts used in the rest of the article; we also provide references to learn more. In Section 3 we present a basic demo of m2r to get up and running. Section 4 follows with two applications of interest to R users: using m2r to exactly solve systems of nonlinear algebraic equations and applying m2r to better understand conditional independence models on multiway contingency tables. Next, Sections 5 and 6 provide an overview of how m2r works internally, first by describing the design philosophy and then by demonstrating how m2r connects R to Macaulay2, which need not be installed locally on the user’s machine. We conclude with a brief discussion of future directions in Section 7.

2 Theory and applications

In this section we provide a basic introduction to the algebraic and geometric objects described in the remainder of this work. We aim for understandability over precision, and so in some cases bend the truth a bit. There are accessible texts for more precise definitions; we direct the reader to Gallian (2016) for the basics of modern algebra, and Cox et al. (1997) for the basics of commutative algebra and algebraic geometry.
Broadly speaking, the mathematical discipline of algebra deals with sets of objects with certain welldefined operations between their elements that result in other elements of the set (e.g. the sum of two numbers is a number). At a basic level, modern algebra has focused on three such objects, in order of

2increasing structure: groups, rings, and fields. A group is a set along with a single binary operation “+” in which every element has an inverse. For example, the integers (Z) form a group; 0 is the identity element (x + 0 = x for any x ∈ Z) and the inverse of any integer is its negative (x + (−x) = 0). A ring is a group with a second operation “·” under which elements need not have inverses. For example, Z is also a ring; the product of two integers is an integer, and the multiplicative identity is the number 1 (1·x = x for any x ∈ Z), but 2 has no multiplicative inverse since 1/2 is not an integer. A field is a ring with multiplicative inverses, i.e. a ring where division is defined. As such, the integers form a ring but not a field. On the other hand, the rational numbers Q do form a field, as do the real numbers R and the complex numbers C. Throughout this paper, all group and ring operations will be commutative, or order invariant, e.g. 5 · 2 = 2 · 5.
Among each class of objects, special subsets are distinguished. For example, a subgroup of a group is a subset of a group that is itself a group, e.g. the even integers. The field of commutative algebra focuses on commutative rings and distinguished subsets called ideals. An ideal is a subgroup of a ring that “absorbs” elements of the ring under multiplication. For example, the even integers I ⊂ Z are an ideal of the ring of integers; I is a group under addition, and if you multiply an even number by any integer, the result is even and thus in I. Note that ideals are not necessarily rings, as they usually do not contain the multiplicative identity 1 (in fact, any ideal containing 1 must contain every element of the ring). Special supersets are also distinguished. For example a field extension F0 of a field F is a superset of F that is a field under the same operations as F, e.g. C and R of Q.
As mathematical objects, the set of polynomials in one or several variables forms a commutative ring.
Since the general multivariate setting is as accessible as the more familiar univariate setting, we go straight to multivariate polynomials. Let x denote an n-tuple x = (x1, x2, . . . , xn) of variables. A monomial is a product of the variables of the form

xα

=

xα1 x2 · · · xn

α2

i ∈ N0 = {0, 1, 2, . . .} .

(1)

αn

1

A polynomial f is a finite linear combination of monomials whose coefficients are drawn from some ring K (often a field such as Q, R, or C). The set of all polynomials with coefficients in K is denoted K[x]. For example, f(x, y) = 3x − 2y ∈ Z[x, y]. Obviously, adding, subtracting, and multiplying polynomials results in another polynomial after simplification.
One way to create an ideal in a polynomial ring is simply to generate one from a collection of polynomials.
If f1, . . . , fm is a collection of m polynomials in K[x], the ideal generated by f1, . . . , fm is the set

hf1, . . . , fmi

=

{r1f1 + · · · + rmfm : rk ∈ K[x] for k = 1, . . . , m}

K[x].
(2)

In particular, this set is the smallest ideal containing f1, . . . , fm. The generating polynomials f1, . . . , fm are called a basis of the ideal. Obviously, ideals are infinitely large collections of polynomials. However, they typically aren’t all polynomials; in the ring Z[x, y], I = hx, yi is an ideal, and Z[x, y] \ I consists of all polynomials with nonzero constant term. A remarkable result known as the Hilbert basis theorem states that every ideal has a finite generating set, i.e. a finite basis. However, bases need not be unique. Gr¨obner bases are generating sets with some additional structure and are central objects in computational commutative algebra. In general, it can be difficult to answer questions such as whether or not two ideals are equal, or if a particular polynomial is contained in an ideal. If one has a Gr¨obner basis however, these questions can be answered relatively easily.
There are a number of algorithms known to convert a given collection of polynomials f1, . . . , fm into

0

a Gr¨obner basis g1, . . . , gm . The first historically and simplest is Buchberger’s algorithm, and all major computer algebra systems implement a variant of it, including Macaulay2 and Singular (Buchberger, 1970; Grayson and Stillman, 2006; Greuel et al., 2006). Optimizing Gro¨bner basis computations continues to be an active area of research in computational algebraic geometry, and the aforementioned software packages are regularly updated with newer and faster implementations.
Algebraic geometry is the field of mathematics interested in understanding the geometric structure of zero sets of polynomials, called varieties or algebraic sets. Concretely, the variety generated by f1, . . . , fm is the set of vectors x ∈ Kn where all the polynomials evaluate to zero.

V (f1, . . . , fm)

=
{x ∈ Kn : f1(x) = · · · = fm(x) = 0} .
(3)

3
Sometimes a field extension of K is used instead of K so that, for example, we could consider the set of solutions in Rn of a polynomial with coefficients in Z (which are of course also in R). Varieties are geometric objects. For example, the variety generated by the polynomial x2 + y2 − 1 ∈ R[x, y] is the unit circle; it consists of all pairs (x, y) ∈ R2 such that x2 + y2 = 1.
A system of polynomial equations can be converted into a collection of polynomials by moving every term to one side, leaving the other side to be just zeros; this is a common technique in algebraic geometry. The variety of the resulting set of polynomials is the set of common solutions to the original list of equations. If no solutions exist, the system is said to be inconsistent; if there are a finite number of solutions, the variety is said to be zero dimensional; and if there are an infinite number of solutions, the variety is said to be

positive dimensional.

Note that this construction is a nonlinear generalization of linear algebra. Linear algebra studies polynomials of degree one, where every term has at most one variable and its exponent is one. The varieties are linear varieties: the empty set, a single point, lines, planes, or hyperplanes. By contrast, in general varieties can be significantly more complicated. They can be curved, come to sharp points, be self intersecting, or even disconnected. Unions of varieties are varieties by multiplying their generating sets pairwise, and intersections of varieties are varieties by simply taking all the generators of both. Consequently, given a variety

S

V it make sense to talk about its minimal decomposition, the representation of V as a union V = Vi of smaller irreducible varieties Vi that can not be further decomposed (i.e. if Vi = W1 ∪ W2 for varieties W1 and W2, either Vi = W1 or Vi = W2). Such unions are always finite. The dimension of a variety is the maximum dimension of its irreducible components, which are in turn defined as the dimension of a tangent hyperplane at a generic point, e.g. the dimension of the circle is 1 since (tangent) lines are one dimensional.
There is a rich interplay between polynomial ideals and varieties that forms the core of algebraic geometry and allows us to align geometric structures and procedures with algebraic ones in a near one-to-one fashion. In this setting, Gr¨obner bases play a major role. If I is an ideal, the variety of I, V (I), is the zero set of all the polynomials in I. If I is generated by the polynomials f1, . . . , fm, then V (I) = V (f1, . . . , fm); in particular, different bases of ideals generate identical varieties. In algebraic geometry, Gro¨bner bases are good choices for bases for myriad reasons. For example, if the variety V (I) is zero dimensional, a (lexicographic) Gr¨obner basis is structured in such a way that the equations can be solved one at a time and back-substituted into the others, much in the same way that in a linear system with a unique solution, after Gaussian elimination solutions can be read off and back-substituted one by one. Many geometric properties of varieties, such as their dimension or an irreducible decomposition, can also be easily computed using Gro¨bner bases.

3 Basic usage

This section showcases the basic capabilities of m2r and some of the ways that Macaulay2 can be used.

3.1 Loading m2r

m2r is loaded like any other R package:

R> library(m2r)

Loading required package: mpoly Loading required package: stringr

Warning: package ’stringr’ was built under R version 3.3.2

M2 found in /usr/local/macaulay2/bin

The first two lines of output indicate that m2r depends on mpoly and stringr. The packages mpoly and stringr manipulate and store multivariate polynomials and strings, respectively (Kahle, 2013; Wickham, 2017). The third line indicates that M2, the Macaulay2 executable, was found on the user’s machine at the given path, and that the version of Macaulay2 in that directory will be used for computations. When loaded

4on a Unix-like machine, m2r looks for M2 on the user’s machine by searching through ~/.bash_profile, or if nonexistent, ~/.bashrc and ~/.profile. m2r stores the first place M2is found in the option m2r$m2 path.1
When m2r is loaded, Macaulay2 is searched for but not initialized. The actual initialization and subsequent connection to Macaulay2 by m2r takes place when R first calls a Macaulay2 function through m2r.

3.2 m2r basics

The basic interface to Macaulay2 is provided by the m2() function. m2() accepts a character string containing Macaulay2 code, sends it to Macaulay2 to be evaluated, and brings the output back into R. For example, like all computer algebra systems, Macaulay2 supports basic arithmetic:

R> m2("1 + 1")

Starting M2... done.

[1] "2"

Unlike most m2r functions, m2() does not parse the Macaulay2 output into an R data structure. This can be seen in the result above being a character and not a numeric, but it is even more evident when evaluating a floating point number:

R> m2("1.2") [1] ".12p53e1"

Parsing the output is a delicate task accomplished by the m2 parse() function:

R> m2_parse(m2("1.2")) [1] 1.2

We expand on how m2 parse() works as a general Macaulay2 parser in Section 5.
One of the great advantages to m2r’s implementation is that it provides a persistent connection to a
Macaulay2 session running in the background. In early versions of algstat, Macaulay2 was accessible from R through intermediate script files; algstat saved user supplied Macaulay2 code to a temporary file, called Macaulay2 in script mode to evaluate it, saved the output to another temporary file, and parsed the output back into R (Kahle et al., 2014). One of the major limitations of this scheme is that every computation and every variable created on the Macaulay2 side is lost once the call is complete. Unlike algstat, m2r allows for this kind of persistent connection to a Macaulay2 session, which is easy to demonstrate:

R> m2("a = 1") [1] "1" R> m2("a") [1] "1"

When not actively running code, the Macaulay2 session sits, listening for commands issued by R. The details of the connection are described in detail in Section 6.

1Note that m2r will not necessarily use whatever is on the user’s typical PATH variable because when R makes system() calls, it does not load the user’s personal configuration files. If a different path is desired, the user can easily change this option with the function set m2 path().

5
While the Macaulay2 session is live, it helps to have R-side functions that access it in a natural way.
Because of this, just as there are functions such as ls() and exists() in R, m2r provides analogues for the background Macaulay2 session:

R> m2_ls() [1] "a" R> m2_exists(c("a", "b")) [1] TRUE FALSE R> m2_getwd() [1] "/Users/david_kahle"

m2 ls() also accepts the argument all.names = TRUE, which gives a larger listing of the variables defined in the Macaulay2 session, much like ls(all.names = TRUE). These additional variables fall into two categories: output variables returned by Macaulay2 and m2r variables used to manage the connection. In Macaulay2, the output of each executed line of code is stored as a variable bound to the symbol o followed by the line number executed. For example, the output of the first executed line is o1. These are accessible through m2r as, for example, m2o1; however, since m2r’s internal connection itself makes calls to Macaulay2, the numbering is somewhat unpredictable. This is why they don’t show up in m2 ls() by default. The internal variables that m2r uses to manage the persistent connection to Macaulay2 are called m2rint* and generally shouldn’t be accessed by the user; we provide more on this in Section 5.1.

3.3 Commutative algebra and algebraic geometry

Macaulay2 is designed for computations in commutative algebra and algebraic geometry. Consequently, algebraic structures such as polynomial rings and ideals are of primary interest. While the m2() function suffices at a basic level for these kinds of operations in R, m2r provides a number of wrapper functions and data structures that facilitate interacting with Macaulay2 in a way that is significantly more familiar to R users. In the remainder of this section we showcase these kinds of functions in action. We begin with rings and ideals, the basic algebraic structures in commutative algebra, and the computation of Gro¨bner bases.
Polynomial rings can be created with the ring() function:

Recommended publications
  • Algebraic Statistics Tutorial I Alleles Are in Hardy-Weinberg Equilibrium, the Genotype Frequencies Are

    Algebraic Statistics Tutorial I Alleles Are in Hardy-Weinberg Equilibrium, the Genotype Frequencies Are

    Example: Hardy-Weinberg Equilibrium Suppose a gene has two alleles, a and A. If allele a occurs in the population with frequency θ (and A with frequency 1 − θ) and these Algebraic Statistics Tutorial I alleles are in Hardy-Weinberg equilibrium, the genotype frequencies are P(X = aa)=θ2, P(X = aA)=2θ(1 − θ), P(X = AA)=(1 − θ)2 Seth Sullivant The model of Hardy-Weinberg equilibrium is the set North Carolina State University July 22, 2012 2 2 M = θ , 2θ(1 − θ), (1 − θ) | θ ∈ [0, 1] ⊂ ∆3 2 I(M)=paa+paA+pAA−1, paA−4paapAA Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 1 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 2 / 32 Main Point of This Tutorial Phylogenetics Problem Given a collection of species, find the tree that explains their history. Many statistical models are described by (semi)-algebraic constraints on a natural parameter space. Generators of the vanishing ideal can be useful for constructing algorithms or analyzing properties of statistical model. Two Examples Phylogenetic Algebraic Geometry Sampling Contingency Tables Data consists of aligned DNA sequences from homologous genes Human: ...ACCGTGCAACGTGAACGA... Chimp: ...ACCTTGGAAGGTAAACGA... Gorilla: ...ACCGTGCAACGTAAACTA... Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 3 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 4 / 32 Model-Based Phylogenetics Phylogenetic Models Use a probabilistic model of mutations Assuming site independence: Parameters for the model are the combinatorial tree T , and rate Phylogenetic Model is a latent class graphical model parameters for mutations on each edge Vertex v ∈ T gives a random variable Xv ∈ {A, C, G, T} Models give a probability for observing a particular aligned All random variables corresponding to internal nodes are latent collection of DNA sequences ACCGTGCAACGTGAACGA Human: X1 X2 X3 Chimp: ACGTTGCAAGGTAAACGA Gorilla: ACCGTGCAACGTAAACTA Y 2 Assuming site independence, data is summarized by empirical Y 1 distribution of columns in the alignment.
  • Elect Your Council

    Elect Your Council

    Volume 41 • Issue 3 IMS Bulletin April/May 2012 Elect your Council Each year IMS holds elections so that its members can choose the next President-Elect CONTENTS of the Institute and people to represent them on IMS Council. The candidate for 1 IMS Elections President-Elect is Bin Yu, who is Chancellor’s Professor in the Department of Statistics and the Department of Electrical Engineering and Computer Science, at the University 2 Members’ News: Huixia of California at Berkeley. Wang, Ming Yuan, Allan Sly, The 12 candidates standing for election to the IMS Council are (in alphabetical Sebastien Roch, CR Rao order) Rosemary Bailey, Erwin Bolthausen, Alison Etheridge, Pablo Ferrari, Nancy 4 Author Identity and Open L. Garcia, Ed George, Haya Kaspi, Yves Le Jan, Xiao-Li Meng, Nancy Reid, Laurent Bibliography Saloff-Coste, and Richard Samworth. 6 Obituary: Franklin Graybill The elected Council members will join Arnoldo Frigessi, Steve Lalley, Ingrid Van Keilegom and Wing Wong, whose terms end in 2013; and Sandrine Dudoit, Steve 7 Statistical Issues in Assessing Hospital Evans, Sonia Petrone, Christian Robert and Qiwei Yao, whose terms end in 2014. Performance Read all about the candidates on pages 12–17, and cast your vote at http://imstat.org/elections/. Voting is open until May 29. 8 Anirban’s Angle: Learning from a Student Left: Bin Yu, candidate for IMS President-Elect. 9 Parzen Prize; Recent Below are the 12 Council candidates. papers: Probability Surveys Top row, l–r: R.A. Bailey, Erwin Bolthausen, Alison Etheridge, Pablo Ferrari Middle, l–r: Nancy L. Garcia, Ed George, Haya Kaspi, Yves Le Jan 11 COPSS Fisher Lecturer Bottom, l–r: Xiao-Li Meng, Nancy Reid, Laurent Saloff-Coste, Richard Samworth 12 Council Candidates 18 Awards nominations 19 Terence’s Stuff: Oscars for Statistics? 20 IMS meetings 24 Other meetings 27 Employment Opportunities 28 International Calendar of Statistical Events 31 Information for Advertisers IMS Bulletin 2 .
  • Algebraic Statistics of Design of Experiments

    Algebraic Statistics of Design of Experiments

    Thammasat University, Bangkok Algebraic Statistics of Design of Experiments Giovanni Pistone Bangkok, March 14 2012 The talk 1. A short history of AS from a personal perspective. See a longer account by E. Riccomagno in • E. Riccomagno, Metrika 69(2-3), 397 (2009), ISSN 0026-1335, http://dx.doi.org/10.1007/s00184-008-0222-3. 2. The algebraic description of a designed experiment. I use the tutorial by M.P. Rogantin: • http://www.dima.unige.it/~rogantin/AS-DOE.pdf 3. Topics from the theory, cfr. the course teached by E. Riccomagno, H. Wynn and Hugo Maruri-Aguilar in 2009 at the Second de Brun Workshop in Galway: • http://hamilton.nuigalway.ie/DeBrunCentre/ SecondWorkshop/online.html. CIMPA Ecole \Statistique", 1980 `aNice (France) Sumate Sompakdee, GP, and Chantaluck Na Pombejra First paper People • Roberto Fontana, DISMA Politecnico di Torino. • Fabio Rapallo, Universit`adel Piemonte Orientale. • Eva Riccomagno DIMA Universit`adi Genova. • Maria Piera Rogantin, DIMA, Universit`adi Genova. • Henry P. Wynn, LSE London. AS and DoE: old biblio and state of the art Beginning of AS in DoE • G. Pistone, H.P. Wynn, Biometrika 83(3), 653 (1996), ISSN 0006-3444 • L. Robbiano, Gr¨obnerBases and Statistics, in Gr¨obnerBases and Applications (Proc. of the Conf. 33 Years of Gr¨obnerBases), edited by B. Buchberger, F. Winkler (Cambridge University Press, 1998), Vol. 251 of London Mathematical Society Lecture Notes, pp. 179{204 • R. Fontana, G. Pistone, M. Rogantin, Journal of Statistical Planning and Inference 87(1), 149 (2000), ISSN 0378-3758 • G. Pistone, E. Riccomagno, H.P. Wynn, Algebraic statistics: Computational commutative algebra in statistics, Vol.
  • A Widely Applicable Bayesian Information Criterion

    A Widely Applicable Bayesian Information Criterion

    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
  • Algebraic Statistics 2015 8-11 June, Genoa, Italy

    Algebraic Statistics 2015 8-11 June, Genoa, Italy

    ALGEBRAIC STATISTICS 2015 8-11 JUNE, GENOA, ITALY INVITED SPEAKERS TUTORIALS Elvira Di Nardo - University of Basilicata Luis García-Puente - Sam Houston State University Thomas Kahle - OvGU Magdeburg Luigi Malagò - Shinshu University and INRIA Saclay Sonja Petrović - Illinois Institute of Technology Giovanni Pistone - Collegio Carlo Alberto Piotr Zwiernik - University of Genoa Jim Q. Smith - University of Warwick Bernd Sturmfels - University of California Berkeley IMPORTANT DATES 30 April - deadline for abstract submissions website: http://www.dima.unige.it/~rogantin/AS2015/ 31 May - registration closed 2 Monday, June 8, 2015 Invited talk B. Sturmfels Exponential varieties. Page 7 Talks C. Long, S. Sullivant Tying up loose strands: defining equations of the strand symmetric model. Page 8 L. Robbiano From factorial designs to Hilbert schemes. Page 10 E. Robeva Decomposing tensors into frames. Page 11 Tutorial L. García-Puente R package for algebraic statistics. Page 12 Tuesday, June 9, 2015 Invited talks T. Kahle Algebraic geometry of Poisson regression. Page 13 S. Petrovic´ What are shell structures of random networks telling us? Page 14 Talks M. Compagnoni, R. Notari, A. Ruggiu, F. Antonacci, A. Sarti The geometry of the statistical model for range-based localization. Page 15 R. H. Eggermont, E. Horobe¸t, K. Kubjas Matrices of nonnegative rank at most three. Page 17 A. Engström, P. Norén Algebraic graph limits. Page 18 E. Gross, K. Kubjas Hypergraph decompositions and toric ideals. Page 20 3 J. Rodriguez, B. Wang The maximum likelihood degree of rank 2 matrices via Euler characteristics. Page 21 M. Studený, T. Kroupa A linear-algebraic criterion for indecomposable generalized permutohedra.
  • Lectures on Algebraic Statistics

    Lectures on Algebraic Statistics

    Lectures on Algebraic Statistics Mathias Drton, Bernd Sturmfels, Seth Sullivant September 27, 2008 2 Contents 1 Markov Bases 7 1.1 Hypothesis Tests for Contingency Tables . 7 1.2 Markov Bases of Hierarchical Models . 17 1.3 The Many Bases of an Integer Lattice . 26 2 Likelihood Inference 35 2.1 Discrete and Gaussian Models . 35 2.2 Likelihood Equations for Implicit Models . 46 2.3 Likelihood Ratio Tests . 54 3 Conditional Independence 67 3.1 Conditional Independence Models . 67 3.2 Graphical Models . 75 3.3 Parametrizations of Graphical Models . 85 4 Hidden Variables 95 4.1 Secant Varieties in Statistics . 95 4.2 Factor Analysis . 105 5 Bayesian Integrals 113 5.1 Information Criteria and Asymptotics . 113 5.2 Exact Integration for Discrete Models . 122 6 Exercises 131 6.1 Markov Bases Fixing Subtable Sums . 131 6.2 Quasi-symmetry and Cycles . 136 6.3 A Colored Gaussian Graphical Model . 139 6.4 Instrumental Variables and Tangent Cones . 143 6.5 Fisher Information for Multivariate Normals . 150 6.6 The Intersection Axiom and Its Failure . 152 6.7 Primary Decomposition for CI Inference . 155 6.8 An Independence Model and Its Mixture . 158 7 Open Problems 165 4 Contents Preface Algebraic statistics is concerned with the development of techniques in algebraic geometry, commutative algebra, and combinatorics, to address problems in statis- tics and its applications. On the one hand, algebra provides a powerful tool set for addressing statistical problems. On the other hand, it is rarely the case that algebraic techniques are ready-made to address statistical challenges, and usually new algebraic results need to be developed.
  • Connecting Algebraic Geometry and Model Selection in Statistics Organized by Russell Steele, Bernd Sturmfels, and Sumio Watanabe

    Connecting Algebraic Geometry and Model Selection in Statistics Organized by Russell Steele, Bernd Sturmfels, and Sumio Watanabe

    Singular learning theory: connecting algebraic geometry and model selection in statistics organized by Russell Steele, Bernd Sturmfels, and Sumio Watanabe Workshop Summary 1 Introduction This article reports the workshop, \Singular learning theory: connecting algebraic geometry and model selection in statistics," held at American Institute of Mathematics, Palo Alto, California, in December 12 to December 16, 2011. In this workshop, 29 researchers of mathematics, statistics, and computer science, studied the following issues: (i) mathematical foundation for statistical model selection; (ii)Problems and results in algebraic statistics; (iii) statistical methodology for model selection; and (iv) statistical models in computer science. 2 Background and Main Results Many statistical models and learning machines have hierarchical structures, hidden vari- ables, and grammatical information processes and are being used in vast number of scientific areas, e.g. artificial intelligence, information systems, and bioinformatics. Normal mixtures, binomial mixtures, artificial neural networks, hidden Markov models, reduced rank regres- sions, and Bayes networks are amongst the most examples of these models utilized in practice. Although such statistical models can have a huge impact within these other disciplines, no sound statistical foundation to evaluate the models has yet been established, because they do not satisfy a common regularity condition necessary for modern statistical inference. In particular, if the likelihood function can be approximated by a multi-dimensional normal dis- tribution, then the accuracy of statistical estimation can be established by using properties of the normal distribution. For example, the maximum likelihood estimator will be asymp- totically normally distributed and the Bayes posterior distribution can also be approximated by the normal distribution.
  • Algebraic Statistics Is an Interdisciplinary Field That Uses Tool

    Algebraic Statistics Is an Interdisciplinary Field That Uses Tool

    TAPAS OF ALGEBRAIC STATISTICS CARLOS AMENDOLA,´ MARTA CASANELLAS, AND LUIS DAVID GARC´IA PUENTE 1. What is Algebraic Statistics? Algebraic statistics is an interdisciplinary field that uses tools from computational al- gebra, algebraic geometry, and combinatorics to address problems in statistics and its applications. A guiding principle in this field is that many statistical models of interest are semialgebraic sets|a set of points defined by polynomial equalities and inequalities. Algebraic statistics is not only concerned with understanding the geometry and algebra of the underlying statistical model, but also with applying this knowledge to improve the analysis of statistical procedures, and to devise new methods for analyzing data. A well-known example of this principle is the model of independence of two discrete random variables. Two discrete random variables X; Y are independent if their joint prob- ability factors into the product of the marginal probabilities. Equivalently, X and Y are independent if and only if every 2 × 2-minor of the matrix of their joint probabilities is zero. These quadratic equations, together with the conditions that the probabilities are nonnegative and sum to one, define a semialgebraic set. In 1998, Persi Diaconis and Bernd Sturmfels showed how one can use algorithms from computational algebraic geometry to sample from conditional distributions. This work is generally regarded as one of the seminal works of what is now referred to as algebraic statistics. However, algebraic methods can be traced back to R. A. Fisher, who used Abelian groups in the study of factorial designs, and Karl Pearson, who used polynomial algebra to study Gaussian mixture models.
  • A Survey on Computational Algebraic Statistics and Its Applications

    A Survey on Computational Algebraic Statistics and Its Applications

    East-West J. of Mathematics: Vol. 19, No 2 (2017) pp. 105-148 A SURVEY ON COMPUTATIONAL ALGEBRAIC STATISTICS AND ITS APPLICATIONS Nguyen V. Minh Man Center of Excellency in Mathematics, Ministry of Education; Dept. of Mathematics, Faculty of Science, Mahidol University, Thailand; and Faculty of Environment and Resources University of Technology, VNUHCM, Vietnam e-mail: [email protected] Abstract Computational Algebraic Statistics is a new mathematics-based scien- tific discipline that is combined of three disciplines of Computing, Algebra and Statistics. This review introduces integrating powerful techniques of - among other things- Polynomial Algebra, Algebraic Geometry, Group Theory and Computational Statistics in this new field of Computational Algebraic Statistics (in brief CAS). CAS currently plays an essential and powerful role in some importantly applicable fields growing very fast in recent years, from science to engineering, such as: statistical quality control, process engineering, computational biology, complex biological networks, life sciences, data analytics and finance studies. Furthermore, by combining the theory of algebraic geometry with graph theory, we point out connections between integer linear program- ming (ILP), mathematical modeling in traffic engineering, logistics man- agement and transportation science. These methods and algorithms are based on elegant ideas from some active fields of mathematics and statistics, and their useful applications Key words: algebraic geometry, algebraic statistics, computational science,
  • Mathematisches Forschungsinstitut Oberwolfach Algebraic Statistics

    Mathematisches Forschungsinstitut Oberwolfach Algebraic Statistics

    Mathematisches Forschungsinstitut Oberwolfach Report No. 20/2017 DOI: 10.4171/OWR/2017/20 Algebraic Statistics Organised by Mathias Drton, Seattle Thomas Kahle, Magdeburg Bernd Sturmfels, Berkeley Caroline Uhler, Cambridge MA 16 April – 22 April 2017 Abstract. Algebraic Statistics is concerned with the interplay of techniques from commutative algebra, combinatorics, (real) algebraic geometry, and re- lated fields with problems arising in statistics and data science. This work- shop was the first at Oberwolfach dedicated to this emerging subject area. The participants highlighted recent achievements in this field, explored ex- citing new applications, and mapped out future directions for research. Mathematics Subject Classification (2010): Primary: 13P25, 62E99; Secondary: 15A72, 52A40, 92B05. Introduction by the Organisers The Oberwolfach workshop Algebraic statistics was organized by Mathias Drton, Thomas Kahle, Bernd Sturmfels, and Caroline Uhler and ran April 17-21, 2017. Algebraic statistics is a rather new field, about two decades old. The field emerged from two lines of work: Diaconis and Sturmfels introduced algebraic tools to cat- egorical data analysis and suggested the construction of Markov bases to perform exact goodness-of-fit tests for such data. This got algebraists, combinatorialists, and algebraic geometers interested in problems in statistics. Significant contribu- tions from Japanese statisticians resulted in a book on Markov bases in algebraic statistics. Through recent work Markov bases have also found applications to disclosure limitation and genetics. The second source, which coined the term ‘al- gebraic statistics’, is a book explaining how Gr¨obner basis methods can be used in experimental design. A recent direction that emerged from this is the use of 1208 Oberwolfach Report 20/2017 commutative algebra for experimental design in system reliability.
  • Algebraic Statistics in Practice: Applications to Networks

    Algebraic Statistics in Practice: Applications to Networks

    Algebraic Statistics in Practice: Applications to Networks Marta Casanellas, Sonja Petrovi´c,and Caroline Uhler Abstract: Algebraic statistics uses tools from algebra (especially from multilinear al- gebra, commutative algebra and computational algebra), geometry and combinatorics to provide insight into knotty problems in mathematical statistics. In this survey we illustrate this on three problems related to networks, namely network models for rela- tional data, causal structure discovery and phylogenetics. For each problem we give an overview of recent results in algebraic statistics with emphasis on the statistical achieve- ments made possible by these tools and their practical relevance for applications to other scientific disciplines. 1. Introduction Algebraic statistics is a branch of mathematical statistics that focuses on the use of algebraic, geometric and combinatorial methods in statistics. The term \Algebraic Statistics" itself was coined as the title of a book on the use of techniques from commutative algebra in experi- mental design (Pistone et al., 2001). An early influential paper (Diaconis and Sturmfels, 1998) connected the problem of sampling from conditional distributions for the analysis of categori- cal data to commutative algebra, thereby showcasing the power of the interplay between these areas. In the two decades that followed, growing interest in applying new algebraic tools to key problems in statistics has generated a growing literature. The use of algebra, geometry and combinatorics in statistics did not start only two decades ago. Combinatorics and probability theory have gone hand-in-hand since their beginnings. The first standard mathematical method in statistics may be the Method of Least Squares, which has been used extensively since shortly after 1800 and relies heavily on systems of linear equations.
  • Multi-Way Contingency Tables in Baseball: an Application of Algebraic Statistics

    Multi-Way Contingency Tables in Baseball: an Application of Algebraic Statistics

    Multi-Way Contingency Tables in Baseball: An Application of Algebraic Statistics Brian Minter December 4, 2015 Abstract Algebraic statistics is a relatively new field of mathematics that combines methods from algebraic geometry and commutative algebra to develop new techniques of statis- tical inference. Although there are already methods to test for independence such as Pearson's Chi-Square Test and Fisher's Exact Test, methods from algebraic statistics have shown to be useful when evaluating multi-way and sparse tables. In this paper, we seek to duplicate these results by using Markov Chain Monte Carlo Methods, in par- ticular the Metropolis Hastings Algorithm, to sample multi-dimensional distributions of baseball data to test for independence. 1 Introduction Algebraic statistics is a relatively new field of mathematics that combines methods from alge- braic geometry and commutative algebra to develop new techniques of statistical inference. Algebraic statistics offers a reliable alternative to classic methods of statistical inference. In the past, statisticians have usually relied on classic asymptotic theory. However, when evaluating multi-way tables, algebraic statistics can prove to be a more effective tool. In 1996, Pistone and Wynn [6] published their work on using Gr¨obnerbases to study the rela- tions in factorial designs of experiments. Then in 1998, Diaconis and Sturmfels [5] published their work on constructing Markov chain algorithms for sampling from discrete exponential families. Together, these two papers are credited for introducing algebraic statistics. Since then, the field has been rapidly growing and expanding in different directions. In this paper, we will first discuss general two-way contingency tables.