Lectures on Algebraic Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Lectures on Algebraic Statistics Mathias Drton, Bernd Sturmfels, Seth Sullivant September 27, 2008 2 Contents 1 Markov Bases 7 1.1 Hypothesis Tests for Contingency Tables . 7 1.2 Markov Bases of Hierarchical Models . 17 1.3 The Many Bases of an Integer Lattice . 26 2 Likelihood Inference 35 2.1 Discrete and Gaussian Models . 35 2.2 Likelihood Equations for Implicit Models . 46 2.3 Likelihood Ratio Tests . 54 3 Conditional Independence 67 3.1 Conditional Independence Models . 67 3.2 Graphical Models . 75 3.3 Parametrizations of Graphical Models . 85 4 Hidden Variables 95 4.1 Secant Varieties in Statistics . 95 4.2 Factor Analysis . 105 5 Bayesian Integrals 113 5.1 Information Criteria and Asymptotics . 113 5.2 Exact Integration for Discrete Models . 122 6 Exercises 131 6.1 Markov Bases Fixing Subtable Sums . 131 6.2 Quasi-symmetry and Cycles . 136 6.3 A Colored Gaussian Graphical Model . 139 6.4 Instrumental Variables and Tangent Cones . 143 6.5 Fisher Information for Multivariate Normals . 150 6.6 The Intersection Axiom and Its Failure . 152 6.7 Primary Decomposition for CI Inference . 155 6.8 An Independence Model and Its Mixture . 158 7 Open Problems 165 4 Contents Preface Algebraic statistics is concerned with the development of techniques in algebraic geometry, commutative algebra, and combinatorics, to address problems in statis- tics and its applications. On the one hand, algebra provides a powerful tool set for addressing statistical problems. On the other hand, it is rarely the case that algebraic techniques are ready-made to address statistical challenges, and usually new algebraic results need to be developed. This way the dialogue between algebra and statistics benefits both disciplines. Algebraic statistics is a relatively new field that has developed and changed rather rapidly over the last fifteen years. One of the first pieces of work in this area was the paper of Diaconis and the second author [33], which introduced the notion of a Markov basis for log-linear statistical models and showed its connection to commutative algebra. From there, the algebra/statistics connection spread to a number of different areas including the design of experiments (highlighted in the monograph [74]), graphical models, phylogenetic invariants, parametric inference, algebraic tools for maximum likelihood estimation, and disclosure limitation, to name just a few. References to this literature are surveyed in the editorial [47] and the two review articles [4, 41] in a special issue of the journal Statistica Sinica. An area where there has been particularly strong activity is in applications to computational biology, which is highlighted in the book Algebraic Statistics for Computational Biology of Lior Pachter and the second author [73]. We will some- times refer to that book as the “ASCB book.” These lecture notes arose out of a five-day Oberwolfach Seminar, given at the Mathematisches Forschunginstitut Oberwolfach (MFO), in Germany’s Black For- est, over the days May 12–16, 2008. The seminar lectures provided an introduction to some of the fundamental notions in algebraic statistics, as well as a snapshot of some of the current research directions. Given such a short timeframe, we were forced to pick and choose topics to present, and many areas of active research in algebraic statistics have been left out. Still, we hope that these notes give an overview of some of the main ideas in the area and directions for future research. The lecture notes are an expanded version of the thirteen lectures we gave throughout the week, with many more examples and background material than we could fit into our hour-long lectures. The first five chapters cover the material in those thirteen lectures and roughly correspond to the five days of the workshop. 6 Contents Chapter 1 reviews statistical tests for contingency table analysis and explains the notion of a Markov basis for a log-linear model. We connect this notion to com- mutative algebra, and give some of the most important structural theorems about Markov bases. Chapter 2 is concerned with likelihood inference in algebraic statis- tical models. We introduce these models for discrete and normal random variables, explain how to solve the likelihood equations parametrically and implicitly, and show how model geometry connects to asymptotics of likelihood ratio statistics. Chapter 3 is an algebraic study of conditional independence structures. We intro- duce these generally, and then focus in on the special class of graphical models. Chapter 4 is an introduction to hidden variable models. From the algebraic point of view, these models often give rise to secant varieties. Finally, Chapter 5 con- cerns Bayesian integrals, both from an asymptotic large-sample perspective and from the standpoint of exact evaluation for small samples. During our week in Oberwolfach, we held several student problem sessions to complement our lectures. We created eight problems highlighting material from the different lectures and assigned the students into groups to work on these problems. The exercises presented a range of computational and theoretical challenges. After daily and sometimes late-night problem solving sessions, the students wrote up solutions, which appear in Chapter 6. On the closing day of the workshop, we held an open problem session, where we and the participants presented open research problems related to algebraic statistics. These appear in Chapter 7. There are many people to thank for their help in the preparation of this book. First, we would like to thank the MFO and its staff for hosting our Oberwolfach Seminar, which provided a wonderful environment for our research lectures. In par- ticular, we thank MFO director Gert-Martin Greuel for suggesting that we prepare these lecture notes. Second, we thank Birkh¨auser editor Thomas Hempfling for his help with our manuscript. Third, we acknowledge support by grants from the U.S. National Science Foundation (Drton DMS-0746265; Sturmfels DMS-0456960; Sullivant DMS-0700078 and 0840795). Bernd Sturmfels was also supported by an Alexander von Humboldt research prize at TU Berlin. Finally, and most im- portantly, we would like to thank the participants of the seminar. Their great enthusiasm and energy created a very stimulating environment for teaching this material. The participants were Florian Block, Dustin Cartwright, Filip Cools, J¨ornDannemann, Alex Engstr¨om, Thomas Friedrich, Hajo Holzmann, Thomas Kahle, Anna Kedzierska, Martina Kubitzke, Krzysztof Latuszynski, Shaowei Lin, Hugo Maruri-Aguilar, Sofia Massa, Helene Neufeld, Mounir Nisse, Johannes Rauh, Christof S¨oger,Carlos Trenado, Oliver Wienand, Zhiqiang Xu, Or Zuk, and Piotr Zwiernik. Chapter 1 Markov Bases This chapter introduces the fundamental notion of a Markov basis, which repre- sents one of the first connections between commutative algebra and statistics. This connection was made in the work of Diaconis and Sturmfels [33] on contingency table analysis. Statistical hypotheses about contingency tables can be tested in an exact approach by performing random walks on a constrained set of tables with non-negative integer entries. Markov bases are of key importance to this statis- tical methodology because they comprise moves between tables that ensure that the random walk connects every pair of tables in the considered set. Section 1.1 reviews the basics of contingency tables and exact tests; for more background see also the books by Agresti [1], Bishop, Holland, Fienberg [18], or Christensen [21]. Section 1.2 discusses Markov bases in the context of hierarchical log-linear models and undirected graphical models. The problem of computing Markov bases is addressed in Section 1.3, where the problem is placed in the general context of integer lattices and tied to the algebraic notion of a lattice ideal. 1.1 Hypothesis Tests for Contingency Tables A contingency table contains counts obtained by cross-classifying observed cases according to two or more discrete criteria. Here the word ‘discrete’ refers to cri- teria with a finite number of possible levels. As an example consider the 2 × 2- contingency table shown in Table 1.1.1. This table, which is taken from [1, §5.2.2], presents a classification of 326 homicide indictments in Florida in the 1970s. The two binary classification criteria are the defendant’s race and whether or not the defendant received the death penalty. A basic question of interest for this table is whether at the time death penalty decisions were made independently of the de- fendant’s race. In this section we will discuss statistical tests of such independence hypotheses as well as generalizations for larger tables. 8 Chapter 1. Markov Bases Death Penalty Defendant’s Race Yes No Total White 19 141 160 Black 17 149 166 Total 36 290 326 Table 1.1.1: Data on death penalty verdicts. Classifying a randomly selected case according to two criteria with r and c levels, respectively, yields two random variables X and Y . We code their possible outcomes as [r] and [c], where [r] := {1, 2, . , r} and [c] := {1, 2, . , c}. All probabilistic information about X and Y is contained in the joint probabilities pij = P (X = i, Y = j), i ∈ [r], j ∈ [c], which determine in particular the marginal probabilities c X pi+ := pij = P (X = i), i ∈ [r], j=1 r X p+j := pij = P (Y = j), j ∈ [c]. i=1 Definition 1.1.1. The two random variables X and Y are independent if the joint probabilities factor as pij = pi+p+j for all i ∈ [r] and j ∈ [c]. We use the symbol X⊥⊥Y to denote independence of X and Y . Proposition 1.1.2. The two random variables X and Y are independent if and only if the r × c-matrix p = (pij) has rank 1. Proof. (=⇒): The factorization in Definition 1.1.1 writes the matrix p as the prod- uct of the column vector filled with the marginal probabilities pi+ and the row vector filled with the probabilities p+j.