Algebraic Statistics in Design of Experiments

Total Page:16

File Type:pdf, Size:1020Kb

Algebraic Statistics in Design of Experiments Algebraic statistics in Design of Experiments Maria Piera Rogantin DIMA – Universit`adi Genova – [email protected] Torino, September 2004 3. Indicator function and orthogo- Table of contents nality (a) Orthogonality 1. General aspects (b) Complex coding for full factorial (a) What is the Design of experiments? designs (b) Design & Ideal of the design (c) Indicator function of a fraction (c) The full factorial design (d) Results about orthogonality 2. Fractions and confounding (e) Generation of fractions with a given orthogonal structure (a) Fraction of a full factorial design (f) Regular fractions (b) Space of the responses on the fraction 4. Models on a fraction and term- orders (c) Identifiability of a model (a) Identifiable sub-models and initial (d) Fractions and identifiable linear orders models (b) Model curvature (e) Confounding of subspaces (c) Block term-order and factor screening 1 PART I: General aspects Pistone G., Wynn H. P.(1996). Generalized confounding with Gr¨obner bases, Biometrika, 83(1): 653–666. Robbiano L. (1998). Gr¨obner Bases and Statistics, Gr¨obner Bases and Applications (Proc. of the Conf. 33 Years of Gr¨obner Bases), Buchberg- er, B. & Winkler, F. ed., Cambridge University Press, 179–204. Pistone G., Riccomagno E. and Wynn H. P. (2001). Algebraic Statistics: Computational Commutative Algebra in Statistics, Chapman&Hall. 2 1.1 What is the Design of experiments? ∗ An example: the factorial design Push-off force of a spark control valve. (Wu & Hamada. 2000. Experiments. Whiley & Son, p.247) Influence of three factors on the response: - weld time (0.3 - 0.5 - 0.7 seconds) - pressure (15 - 20 - 25 psi) - moisture (0.8 - 1.2 - 1.8 percent) Each factor has three levels. We consider them as ordinal levels: low - medium - high; The levels are coded by integer numbers. full factorial design Each treatment corresponds to a point in Z3. ∗Full Design 3 In particular, the experiment is realized on fewer treatments ∗ (because of problems with costs, times, practical constraints, setting the factors and measuring the responses . ) How to choose the treatments? Which fraction for “the best” study of the responses? time pressure moisture -1 -1 -1 -1 0 0 -1 1 1 0 -1 0 0 0 1 0 1 -1 1 -1 1 1 0 -1 1 1 0 fractional factorial design or fraction ∗Fraction 4 Design and functions on the design∗ Response Factors Push-off force (lb) Time Pressure Moisture 111.1 -1 -1 -1 131.0 -1 0 0 65.4 -1 1 1 125.5 0 -1 0 46.9 0 0 1 113.7 0 1 -1 72.5 1 -1 1 141.1 1 0 -1 134.2 1 1 0 Linear influence of the factors and their interactions: Force = θ0 + θ1 T + θ2 P + θ3 M + θ12 T · P + θ13 T · M + ... the θ’s signify the “importance” of the terms w.r.t. the response. ∗General design 5 Full factorial designs and fractions: notations • Ai = {aij : j = 1,...,ni} factors m m aij levels coded by rational numbers Q or complex numbers C m m m • D = A1 × . × Am ⊂ Q (or D ⊂ C ) with N = j=1 nj points full factorial design Q • A fraction is a subset F ⊂D; 6 Responses on a design f : D 7→ R (functions defined on D) ∗ “Design” indicates either “Full factorial design” or “Fractional factorial design” or ... • Xi : D ∋ (d1,...,dm) 7→ di projection, frequently called factor α α1 αm • X = X1 · · · Xm , αi <ni, i = 1,...,m α = (α1,...,αm) monomial responses or terms or interactions The term Xα has order k if in α there are k non-null values: Xα is an interaction of order k (binary case: order = degree) Definitions: 1 • Mean value of f on D, ED(f): ED(f) = #D d∈D f(d) P • A contrast is a response f such that ED(f) = 0. • Two responses f and g are orthogonal on D if ED(f g) = 0. ∗General design 7 The polynomial complete regression model∗ • L = {(α1,...,αm) : αi <ni, i =1,...,m} exponents (or logarithms) of all the interactions • complete regression model: For all di ∈ D and for the observed value yi α yi = θα X (di) αX∈L whit θα ∈ R or θα ∈ C. In vector notation, Y =(y1,...,yn) response measured on the points of D: α Y = θα X αX∈L α • Z = [X (d)]d∈D,α∈L matrix of the complete regression model • θ =(θα)α∈L vector of the coefficients Matrix notation of the complete regression model Y = Zθ ∗General design 8 On the full factorial design: the complete regression model is identifiable, i.e. there is a unique solution w.r.t. θ: θˆ = Z−1 Y (the matrix Z is a square full rank matrix) On a fraction: there is not a unique solution w.r.t. θ α (Z = [X (d)]d∈F,α∈L has less rows than columns) θˆ = (Z′Z)−Z′ Y 9 In general, for linear regression models of the form: Y = W θ + R where W is a matrix of “explicative variables” and R is a vector of residuals Even if θˆ = (W ′W )−W ′ Y is not unique, the approximation of the response Y through the “explicative variables” is always unique: Yˆ = W (W ′W )−W ′ Y Y R Yˆ is the orthogonal projec- tion of Y in the linear space Y=^ W ^θ = W (W'W)- W' Y = P Y generate by the columns of W . Wθ Moreover, if Y is a multivariate random variable with normal distribution Y ∼ N (W θ, σ2I) then θˆ = (W ′W )−W ′ Y is a solution of maximum likelihood. 10 Example: 2 × 3 full factorial design∗ -1 -1 -1 0 A = {−1, 1}, n =2 -1 1 1 1 D = A2 = {−1, 0, 1}, n2 =3 1 -1 1 0 1 1 2 2 monomial responses: 1, X1, X2,X2 , X1X2,X1X2 L = {(0, 0), (1, 0), (0, 1), (0, 2), (1, 1), (1, 2)} complete regression model: 2 2 Y = θ00 + θ10X1 + θ01X2 + θ02X2 + θ11X1X2 + θ12X1X2 2 2 1 X1 X2 X2 X1X2 X1X2 1 −1 −11 1 −1 1 −100 0 0 matrix of the complete regression model Z = 1 −1 1 1 −1 −1 1 1 −1 1 −1 1 1100 0 0 1111 1 1 ∗Full Design 12 1.2 Design and design ideal∗ The application of computational commutative algebra to the study of estimability, confounding on the fractions of factorial designs has been proposed by Pistone & Wynn (Biometrika 1996). 1st idea Each set of points D ⊆ Qm is the set of the solutions of a system of polynomial equations; we assume that each solution is simple. 2nd idea Each real function defined on D is a polynomial function with coefficients into the field of real number R. It is a restriction to D of a real polynomial. ∗General Design 13 A design D is a finite set of distinct points in a m-dimensional field km ∗ The defining ideal of the design (or design ideal) I(D) is the set of all polynomials on k[x1, . , xm] that vanish on D. • The design D is a variety • The design ideal I(D) is radical • I(D) is the intersection of the design ideals of each point of D ∗General design 14 Operations with designs & Operations with ideals∗ • Product of designs. m m D1 ⊂ k 1 D2 ⊂ k 2 I (D1 ×D2) =< I1, I2 > m +m D1 ×D2 ⊂ k 1 2 τ term ordering on k[x1, . , xm1+m2] τ1, τ2 t.o. restricted to k[x1, . , xm1] and k[x1, . , xm2] G1,τ1, G2,τ2, G-bases of I (D1), I (D2) G = g1, g2 | g1 ∈ G1,τ1 g2 ∈ G2,τ2 n o is G-basis of I (D1 ×D2) ∗General design 15 • Restriction of a design∗ m D ⊂ k I + J ideal in k[x1, . , xm] I = I(D) I + J = {f + g | f ∈ I g ∈ J} J ideal in k[x1, . , xm] Variety(I + J) = D∩ Variety(J) ∗General design 16 • Union of designs∗ m D1, D2 ⊂ k τ term ordering on k[x1, . , xm], m D1 ∪D2 ⊂ k G1, G2 G-bases of I (D1), I (D2) A G-basis of I (D1 ∪D2) is G = {g1g2 | g1 ∈ G1 g2 ∈ G22} ∗General design 17 1.3 The full factorial design.∗ Polynomial representation The full factorial design D corresponds to the set of solutions of the system 1 n1 = n1− k (X1 − a11) ··· (X1 − a1n1 ) = 0 X1 k=0 ψ1k X1 . (X2 − a21) ··· (X2 − a2n2 ) = 0 . . or . P . . n −1 (Xm − am1) ··· (Xm − amn ) = 0 nm m k m Xm = k=0 ψmk Xm P In the previous examples: 2 (X1 − 1)(X1 +1) = 0 X1 = 1 rewriting or 3 X2 (X2 − 1)(X2 +1) = 0 X = X2 rules 2 3 X1 (X1 − 1)(X1 +1) = 0 X1 = X1 rewriting 3 X2 (X2 − 1)(X2 +1) = 0 or X2 = X2 rules ( 1)( +1) = 0 3 = X3 X3 − X3 X3 X3 ∗Full Design 18 Space of the real functions on the full factorial design R(D)∗ For a full factorial design each function is represented in a unique way by an identified complete regression model (i.e. as a linear combination of constant, simple terms and interactions): R(D) = θ Xα , θ ∈ R α α αX∈L In general, for full factorial designs or fraction: • R(D) is a Hilbert vector space (classical results derive from this structure) 1 The scalar product is: <f,g >= ED(f g) = N d∈D f(d)g(d) P • R(D) is a ring (algebraic statistical approach) The products are reduced with the rules derived by the polynomial representation of the full factorial design: ni−1 ni k Xi = ψik Xi , ψik ∈ R for i = 1, .
Recommended publications
  • A Computer Algebra System for R: Macaulay2 and the M2r Package
    A computer algebra system for R: Macaulay2 and the m2r package David Kahle∗1, Christopher O'Neilly2, and Jeff Sommarsz3 1Department of Statistical Science, Baylor University 2Department of Mathematics, University of California, Davis 3Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago Abstract Algebraic methods have a long history in statistics. The most prominent manifestation of modern algebra in statistics can be seen in the field of algebraic statistics, which brings tools from commutative algebra and algebraic geometry to bear on statistical problems. Now over two decades old, algebraic statistics has applications in a wide range of theoretical and applied statistical domains. Nevertheless, algebraic statistical methods are still not mainstream, mostly due to a lack of easy off-the-shelf imple- mentations. In this article we debut m2r, an R package that connects R to Macaulay2 through a persistent back-end socket connection running locally or on a cloud server. Topics range from basic use of m2r to applications and design philosophy. 1 Introduction Algebra, a branch of mathematics concerned with abstraction, structure, and symmetry, has a long history of applications in statistics. For example, Pearson's early work on method of moments estimation in mixture models ultimately involved systems of polynomial equations that he painstakingly and remarkably solved by hand (Pearson, 1894; Am´endolaet al., 2016). Fisher's work in design was strongly algebraic and combinato- rial, focusing on topics such as Latin squares (Fisher, 1934). Invariance and equivariance continue to form a major pillar of mathematical statistics through the lens of location-scale families (Pitman, 1939; Bondesson, 1983; Lehmann and Romano, 2005).
    [Show full text]
  • Algebraic Statistics Tutorial I Alleles Are in Hardy-Weinberg Equilibrium, the Genotype Frequencies Are
    Example: Hardy-Weinberg Equilibrium Suppose a gene has two alleles, a and A. If allele a occurs in the population with frequency θ (and A with frequency 1 − θ) and these Algebraic Statistics Tutorial I alleles are in Hardy-Weinberg equilibrium, the genotype frequencies are P(X = aa)=θ2, P(X = aA)=2θ(1 − θ), P(X = AA)=(1 − θ)2 Seth Sullivant The model of Hardy-Weinberg equilibrium is the set North Carolina State University July 22, 2012 2 2 M = θ , 2θ(1 − θ), (1 − θ) | θ ∈ [0, 1] ⊂ ∆3 2 I(M)=paa+paA+pAA−1, paA−4paapAA Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 1 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 2 / 32 Main Point of This Tutorial Phylogenetics Problem Given a collection of species, find the tree that explains their history. Many statistical models are described by (semi)-algebraic constraints on a natural parameter space. Generators of the vanishing ideal can be useful for constructing algorithms or analyzing properties of statistical model. Two Examples Phylogenetic Algebraic Geometry Sampling Contingency Tables Data consists of aligned DNA sequences from homologous genes Human: ...ACCGTGCAACGTGAACGA... Chimp: ...ACCTTGGAAGGTAAACGA... Gorilla: ...ACCGTGCAACGTAAACTA... Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 3 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 4 / 32 Model-Based Phylogenetics Phylogenetic Models Use a probabilistic model of mutations Assuming site independence: Parameters for the model are the combinatorial tree T , and rate Phylogenetic Model is a latent class graphical model parameters for mutations on each edge Vertex v ∈ T gives a random variable Xv ∈ {A, C, G, T} Models give a probability for observing a particular aligned All random variables corresponding to internal nodes are latent collection of DNA sequences ACCGTGCAACGTGAACGA Human: X1 X2 X3 Chimp: ACGTTGCAAGGTAAACGA Gorilla: ACCGTGCAACGTAAACTA Y 2 Assuming site independence, data is summarized by empirical Y 1 distribution of columns in the alignment.
    [Show full text]
  • Elect Your Council
    Volume 41 • Issue 3 IMS Bulletin April/May 2012 Elect your Council Each year IMS holds elections so that its members can choose the next President-Elect CONTENTS of the Institute and people to represent them on IMS Council. The candidate for 1 IMS Elections President-Elect is Bin Yu, who is Chancellor’s Professor in the Department of Statistics and the Department of Electrical Engineering and Computer Science, at the University 2 Members’ News: Huixia of California at Berkeley. Wang, Ming Yuan, Allan Sly, The 12 candidates standing for election to the IMS Council are (in alphabetical Sebastien Roch, CR Rao order) Rosemary Bailey, Erwin Bolthausen, Alison Etheridge, Pablo Ferrari, Nancy 4 Author Identity and Open L. Garcia, Ed George, Haya Kaspi, Yves Le Jan, Xiao-Li Meng, Nancy Reid, Laurent Bibliography Saloff-Coste, and Richard Samworth. 6 Obituary: Franklin Graybill The elected Council members will join Arnoldo Frigessi, Steve Lalley, Ingrid Van Keilegom and Wing Wong, whose terms end in 2013; and Sandrine Dudoit, Steve 7 Statistical Issues in Assessing Hospital Evans, Sonia Petrone, Christian Robert and Qiwei Yao, whose terms end in 2014. Performance Read all about the candidates on pages 12–17, and cast your vote at http://imstat.org/elections/. Voting is open until May 29. 8 Anirban’s Angle: Learning from a Student Left: Bin Yu, candidate for IMS President-Elect. 9 Parzen Prize; Recent Below are the 12 Council candidates. papers: Probability Surveys Top row, l–r: R.A. Bailey, Erwin Bolthausen, Alison Etheridge, Pablo Ferrari Middle, l–r: Nancy L. Garcia, Ed George, Haya Kaspi, Yves Le Jan 11 COPSS Fisher Lecturer Bottom, l–r: Xiao-Li Meng, Nancy Reid, Laurent Saloff-Coste, Richard Samworth 12 Council Candidates 18 Awards nominations 19 Terence’s Stuff: Oscars for Statistics? 20 IMS meetings 24 Other meetings 27 Employment Opportunities 28 International Calendar of Statistical Events 31 Information for Advertisers IMS Bulletin 2 .
    [Show full text]
  • Algebraic Statistics of Design of Experiments
    Thammasat University, Bangkok Algebraic Statistics of Design of Experiments Giovanni Pistone Bangkok, March 14 2012 The talk 1. A short history of AS from a personal perspective. See a longer account by E. Riccomagno in • E. Riccomagno, Metrika 69(2-3), 397 (2009), ISSN 0026-1335, http://dx.doi.org/10.1007/s00184-008-0222-3. 2. The algebraic description of a designed experiment. I use the tutorial by M.P. Rogantin: • http://www.dima.unige.it/~rogantin/AS-DOE.pdf 3. Topics from the theory, cfr. the course teached by E. Riccomagno, H. Wynn and Hugo Maruri-Aguilar in 2009 at the Second de Brun Workshop in Galway: • http://hamilton.nuigalway.ie/DeBrunCentre/ SecondWorkshop/online.html. CIMPA Ecole \Statistique", 1980 `aNice (France) Sumate Sompakdee, GP, and Chantaluck Na Pombejra First paper People • Roberto Fontana, DISMA Politecnico di Torino. • Fabio Rapallo, Universit`adel Piemonte Orientale. • Eva Riccomagno DIMA Universit`adi Genova. • Maria Piera Rogantin, DIMA, Universit`adi Genova. • Henry P. Wynn, LSE London. AS and DoE: old biblio and state of the art Beginning of AS in DoE • G. Pistone, H.P. Wynn, Biometrika 83(3), 653 (1996), ISSN 0006-3444 • L. Robbiano, Gr¨obnerBases and Statistics, in Gr¨obnerBases and Applications (Proc. of the Conf. 33 Years of Gr¨obnerBases), edited by B. Buchberger, F. Winkler (Cambridge University Press, 1998), Vol. 251 of London Mathematical Society Lecture Notes, pp. 179{204 • R. Fontana, G. Pistone, M. Rogantin, Journal of Statistical Planning and Inference 87(1), 149 (2000), ISSN 0378-3758 • G. Pistone, E. Riccomagno, H.P. Wynn, Algebraic statistics: Computational commutative algebra in statistics, Vol.
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]
  • Algebraic Statistics 2015 8-11 June, Genoa, Italy
    ALGEBRAIC STATISTICS 2015 8-11 JUNE, GENOA, ITALY INVITED SPEAKERS TUTORIALS Elvira Di Nardo - University of Basilicata Luis García-Puente - Sam Houston State University Thomas Kahle - OvGU Magdeburg Luigi Malagò - Shinshu University and INRIA Saclay Sonja Petrović - Illinois Institute of Technology Giovanni Pistone - Collegio Carlo Alberto Piotr Zwiernik - University of Genoa Jim Q. Smith - University of Warwick Bernd Sturmfels - University of California Berkeley IMPORTANT DATES 30 April - deadline for abstract submissions website: http://www.dima.unige.it/~rogantin/AS2015/ 31 May - registration closed 2 Monday, June 8, 2015 Invited talk B. Sturmfels Exponential varieties. Page 7 Talks C. Long, S. Sullivant Tying up loose strands: defining equations of the strand symmetric model. Page 8 L. Robbiano From factorial designs to Hilbert schemes. Page 10 E. Robeva Decomposing tensors into frames. Page 11 Tutorial L. García-Puente R package for algebraic statistics. Page 12 Tuesday, June 9, 2015 Invited talks T. Kahle Algebraic geometry of Poisson regression. Page 13 S. Petrovic´ What are shell structures of random networks telling us? Page 14 Talks M. Compagnoni, R. Notari, A. Ruggiu, F. Antonacci, A. Sarti The geometry of the statistical model for range-based localization. Page 15 R. H. Eggermont, E. Horobe¸t, K. Kubjas Matrices of nonnegative rank at most three. Page 17 A. Engström, P. Norén Algebraic graph limits. Page 18 E. Gross, K. Kubjas Hypergraph decompositions and toric ideals. Page 20 3 J. Rodriguez, B. Wang The maximum likelihood degree of rank 2 matrices via Euler characteristics. Page 21 M. Studený, T. Kroupa A linear-algebraic criterion for indecomposable generalized permutohedra.
    [Show full text]
  • Lectures on Algebraic Statistics
    Lectures on Algebraic Statistics Mathias Drton, Bernd Sturmfels, Seth Sullivant September 27, 2008 2 Contents 1 Markov Bases 7 1.1 Hypothesis Tests for Contingency Tables . 7 1.2 Markov Bases of Hierarchical Models . 17 1.3 The Many Bases of an Integer Lattice . 26 2 Likelihood Inference 35 2.1 Discrete and Gaussian Models . 35 2.2 Likelihood Equations for Implicit Models . 46 2.3 Likelihood Ratio Tests . 54 3 Conditional Independence 67 3.1 Conditional Independence Models . 67 3.2 Graphical Models . 75 3.3 Parametrizations of Graphical Models . 85 4 Hidden Variables 95 4.1 Secant Varieties in Statistics . 95 4.2 Factor Analysis . 105 5 Bayesian Integrals 113 5.1 Information Criteria and Asymptotics . 113 5.2 Exact Integration for Discrete Models . 122 6 Exercises 131 6.1 Markov Bases Fixing Subtable Sums . 131 6.2 Quasi-symmetry and Cycles . 136 6.3 A Colored Gaussian Graphical Model . 139 6.4 Instrumental Variables and Tangent Cones . 143 6.5 Fisher Information for Multivariate Normals . 150 6.6 The Intersection Axiom and Its Failure . 152 6.7 Primary Decomposition for CI Inference . 155 6.8 An Independence Model and Its Mixture . 158 7 Open Problems 165 4 Contents Preface Algebraic statistics is concerned with the development of techniques in algebraic geometry, commutative algebra, and combinatorics, to address problems in statis- tics and its applications. On the one hand, algebra provides a powerful tool set for addressing statistical problems. On the other hand, it is rarely the case that algebraic techniques are ready-made to address statistical challenges, and usually new algebraic results need to be developed.
    [Show full text]
  • Connecting Algebraic Geometry and Model Selection in Statistics Organized by Russell Steele, Bernd Sturmfels, and Sumio Watanabe
    Singular learning theory: connecting algebraic geometry and model selection in statistics organized by Russell Steele, Bernd Sturmfels, and Sumio Watanabe Workshop Summary 1 Introduction This article reports the workshop, \Singular learning theory: connecting algebraic geometry and model selection in statistics," held at American Institute of Mathematics, Palo Alto, California, in December 12 to December 16, 2011. In this workshop, 29 researchers of mathematics, statistics, and computer science, studied the following issues: (i) mathematical foundation for statistical model selection; (ii)Problems and results in algebraic statistics; (iii) statistical methodology for model selection; and (iv) statistical models in computer science. 2 Background and Main Results Many statistical models and learning machines have hierarchical structures, hidden vari- ables, and grammatical information processes and are being used in vast number of scientific areas, e.g. artificial intelligence, information systems, and bioinformatics. Normal mixtures, binomial mixtures, artificial neural networks, hidden Markov models, reduced rank regres- sions, and Bayes networks are amongst the most examples of these models utilized in practice. Although such statistical models can have a huge impact within these other disciplines, no sound statistical foundation to evaluate the models has yet been established, because they do not satisfy a common regularity condition necessary for modern statistical inference. In particular, if the likelihood function can be approximated by a multi-dimensional normal dis- tribution, then the accuracy of statistical estimation can be established by using properties of the normal distribution. For example, the maximum likelihood estimator will be asymp- totically normally distributed and the Bayes posterior distribution can also be approximated by the normal distribution.
    [Show full text]
  • Algebraic Statistics Is an Interdisciplinary Field That Uses Tool
    TAPAS OF ALGEBRAIC STATISTICS CARLOS AMENDOLA,´ MARTA CASANELLAS, AND LUIS DAVID GARC´IA PUENTE 1. What is Algebraic Statistics? Algebraic statistics is an interdisciplinary field that uses tools from computational al- gebra, algebraic geometry, and combinatorics to address problems in statistics and its applications. A guiding principle in this field is that many statistical models of interest are semialgebraic sets|a set of points defined by polynomial equalities and inequalities. Algebraic statistics is not only concerned with understanding the geometry and algebra of the underlying statistical model, but also with applying this knowledge to improve the analysis of statistical procedures, and to devise new methods for analyzing data. A well-known example of this principle is the model of independence of two discrete random variables. Two discrete random variables X; Y are independent if their joint prob- ability factors into the product of the marginal probabilities. Equivalently, X and Y are independent if and only if every 2 × 2-minor of the matrix of their joint probabilities is zero. These quadratic equations, together with the conditions that the probabilities are nonnegative and sum to one, define a semialgebraic set. In 1998, Persi Diaconis and Bernd Sturmfels showed how one can use algorithms from computational algebraic geometry to sample from conditional distributions. This work is generally regarded as one of the seminal works of what is now referred to as algebraic statistics. However, algebraic methods can be traced back to R. A. Fisher, who used Abelian groups in the study of factorial designs, and Karl Pearson, who used polynomial algebra to study Gaussian mixture models.
    [Show full text]
  • A Survey on Computational Algebraic Statistics and Its Applications
    East-West J. of Mathematics: Vol. 19, No 2 (2017) pp. 105-148 A SURVEY ON COMPUTATIONAL ALGEBRAIC STATISTICS AND ITS APPLICATIONS Nguyen V. Minh Man Center of Excellency in Mathematics, Ministry of Education; Dept. of Mathematics, Faculty of Science, Mahidol University, Thailand; and Faculty of Environment and Resources University of Technology, VNUHCM, Vietnam e-mail: [email protected] Abstract Computational Algebraic Statistics is a new mathematics-based scien- tific discipline that is combined of three disciplines of Computing, Algebra and Statistics. This review introduces integrating powerful techniques of - among other things- Polynomial Algebra, Algebraic Geometry, Group Theory and Computational Statistics in this new field of Computational Algebraic Statistics (in brief CAS). CAS currently plays an essential and powerful role in some importantly applicable fields growing very fast in recent years, from science to engineering, such as: statistical quality control, process engineering, computational biology, complex biological networks, life sciences, data analytics and finance studies. Furthermore, by combining the theory of algebraic geometry with graph theory, we point out connections between integer linear program- ming (ILP), mathematical modeling in traffic engineering, logistics man- agement and transportation science. These methods and algorithms are based on elegant ideas from some active fields of mathematics and statistics, and their useful applications Key words: algebraic geometry, algebraic statistics, computational science,
    [Show full text]
  • Mathematisches Forschungsinstitut Oberwolfach Algebraic Statistics
    Mathematisches Forschungsinstitut Oberwolfach Report No. 20/2017 DOI: 10.4171/OWR/2017/20 Algebraic Statistics Organised by Mathias Drton, Seattle Thomas Kahle, Magdeburg Bernd Sturmfels, Berkeley Caroline Uhler, Cambridge MA 16 April – 22 April 2017 Abstract. Algebraic Statistics is concerned with the interplay of techniques from commutative algebra, combinatorics, (real) algebraic geometry, and re- lated fields with problems arising in statistics and data science. This work- shop was the first at Oberwolfach dedicated to this emerging subject area. The participants highlighted recent achievements in this field, explored ex- citing new applications, and mapped out future directions for research. Mathematics Subject Classification (2010): Primary: 13P25, 62E99; Secondary: 15A72, 52A40, 92B05. Introduction by the Organisers The Oberwolfach workshop Algebraic statistics was organized by Mathias Drton, Thomas Kahle, Bernd Sturmfels, and Caroline Uhler and ran April 17-21, 2017. Algebraic statistics is a rather new field, about two decades old. The field emerged from two lines of work: Diaconis and Sturmfels introduced algebraic tools to cat- egorical data analysis and suggested the construction of Markov bases to perform exact goodness-of-fit tests for such data. This got algebraists, combinatorialists, and algebraic geometers interested in problems in statistics. Significant contribu- tions from Japanese statisticians resulted in a book on Markov bases in algebraic statistics. Through recent work Markov bases have also found applications to disclosure limitation and genetics. The second source, which coined the term ‘al- gebraic statistics’, is a book explaining how Gr¨obner basis methods can be used in experimental design. A recent direction that emerged from this is the use of 1208 Oberwolfach Report 20/2017 commutative algebra for experimental design in system reliability.
    [Show full text]
  • Algebraic Statistics in Practice: Applications to Networks
    Algebraic Statistics in Practice: Applications to Networks Marta Casanellas, Sonja Petrovi´c,and Caroline Uhler Abstract: Algebraic statistics uses tools from algebra (especially from multilinear al- gebra, commutative algebra and computational algebra), geometry and combinatorics to provide insight into knotty problems in mathematical statistics. In this survey we illustrate this on three problems related to networks, namely network models for rela- tional data, causal structure discovery and phylogenetics. For each problem we give an overview of recent results in algebraic statistics with emphasis on the statistical achieve- ments made possible by these tools and their practical relevance for applications to other scientific disciplines. 1. Introduction Algebraic statistics is a branch of mathematical statistics that focuses on the use of algebraic, geometric and combinatorial methods in statistics. The term \Algebraic Statistics" itself was coined as the title of a book on the use of techniques from commutative algebra in experi- mental design (Pistone et al., 2001). An early influential paper (Diaconis and Sturmfels, 1998) connected the problem of sampling from conditional distributions for the analysis of categori- cal data to commutative algebra, thereby showcasing the power of the interplay between these areas. In the two decades that followed, growing interest in applying new algebraic tools to key problems in statistics has generated a growing literature. The use of algebra, geometry and combinatorics in statistics did not start only two decades ago. Combinatorics and probability theory have gone hand-in-hand since their beginnings. The first standard mathematical method in statistics may be the Method of Least Squares, which has been used extensively since shortly after 1800 and relies heavily on systems of linear equations.
    [Show full text]