<<

Algebraic Day

Christiane Görgen: What is (algebraic) statistics?

August 11, 2017

Statistics is ‘the science of collecting, analysing and interpreting numerical data relating to an aggregate of individuals’1. In very simplied terms, inferential statistics (as opposed to descriptive statistics) aims to infer properties of a general population from a subpopulation in order to describe observed phenomena and to predict future behaviour. This inferential process can very roughly be de- scribed as follows:

1. Design of experiment Decide how to collect data such that it can be used to answer questions of interest. In particular, choose which measurements should be taken, how many and in what way such that bias can be avoided. 2. Data collection Randomly sample data from a subpopulation of the population of interest according to the rules decided in the rst step. 3. Exploratory data analysis On the basis of the collected data, explore which type of models can be used to describe the observations: e.g. , spatial models, networks, ... 4. Model selection Find the statistical model which describes the data best. Use e.g. likelihood ratios, hypothesis testing (p-values!), maximum likelihood estimation, ... 5. Inference Infer from the statistical model properties of the underlying population. Sensitivity analysis validates robustness of the results against modelling assumptions. Propagation algorithms infer how new observations canbe embedded in the model. Causal inference investigates how a model behaves under intervention. 6. Presentation Choose an appropriate way to present the obtained results.

Each of the steps above constitutes a eld of research within statistics. Some of these are very applied (2., 6.), others are more inclined towards designing fast numerical algorithms (4.) or draw on probability theory (1., 4., 5.). is a term for the ‘pure’ branch of statistics which uses results from probability theory, functional analysis and linear . These tools mainly come into play in the , in model selection procedures, and in the development of inferential methods. Algebraic statistics is a relatively new term for the subbranch of mathematical statistics which employs tools from (computational) algebra and , rather than more standard methods from the above named elds. This can be in order to describe statistical objects using alternative vocabulary2 or in order to employ well-known techniques from other areas of mathematics to tackle statistical problems3.

1See The Oxford Dictionary of Statistical Terms (2006). For a distinction between statistics and data science, see e.g. Tukey (The future of Data Analysis, Annals of Mathematical Statistics, 1962) and Friedholm (The Role of Statistics in the Data Revolution?, International Statistical Review, 2001). 2See Eliana Duarte and Alessio D’Alì’s talks today on statistical models which ‘are’ toric varieties up to positivity constraints and sum-to-one conditions. 3See Vlada Limic and Christian Lehn’s talks today on Gröbner bases and maximum likelihood estimation.

Max-Planck-Institute for Mathematics in the Sciences (MPI MiS) 1/2 Algebraic Statistics Day August 11, 2017

The name algebraic statistics was coined by the rst books (Pistone et al., 2001; Pachter and Sturmfels, 2005) on the use of algebraic techniques in the design of experiment and in compu- tational biology, respectively. The research area itself originated with two seminal papers in the mid 1990’s (Pistone and Wynn, 1996; Diaconis and Sturmfels, 1998) and has since been a highly active eld (Drton et al., 2009; Gibilisco et al., 2010; Zwiernik, 2016; Sullivant, 2017). Important advances include a geometrical description of statistical models with unobserved random variables using simplicies and polyhedra (Mond et al., 2003); the classication of certain classes of models represented by graphs as toric varieties (Geiger et al., 2006); an application of results on tensor ranks to identify parameters in a statistical model (Allman et al., 2009); a classication of limiting distributions of likelihood ratio tests using singularity theory (Drton, 2009); an application of Alexander duality in specifying bases for certain types of experimental designs (Maruri-Aguilar et al., 2013); a characterisation of certain Gaussian graphical models as composite exponential families using group actions (Draisma et al., 2013); and the study of varieties arising from likelihood inference in mixtures of statistical models (Kubjas et al., 2015).

Some references

Allman, E. S., C. Matias, and J. A. Rhodes (2009). Identiability of parameters in latent structure models with many observed variables. Ann. Statist. 37(6A), 3099–3132. Diaconis, P. and B. Sturmfels (1998). Algebraic algorithms for sampling from conditional distributions. Ann. Stat- ist. 26(1), 363–397. Draisma, J., S. Kuhnt, and P. Zwiernik (2013). Groups acting on Gaussian graphical models. Ann. Statist. 41(4), 1944–1969. Drton, M. (2009). Likelihood ratio tests and singularities. Ann. Statist. 37(2), 979–1012. Drton, M., B. Sturmfels, and S. Sullivant (2009). Lectures on algebraic statistics, Volume 39 of Oberwolfach Seminars. Birkhäuser Verlag, Basel. Geiger, D., C. Meek, and B. Sturmfels (2006). On the toric algebra of graphical models. Ann. Statist. 34(3), 1463–1492. Gibilisco, P., E. Riccomagno, M. P. Rogantin, and H. P. Wynn (Eds.) (2010). Algebraic and geometric methods in statistics. Cambridge University Press, Cambridge. Kubjas, K., E. Robeva, and B. Sturmfels (2015). Fixed points EM algorithm and nonnegative rank boundaries. Ann. Statist. 43(1), 422–461. Maruri-Aguilar, H., E. Sáenz-de Cabezón, and H. P. Wynn (2013). Alexander duality in experimental designs. Ann. Inst. Statist. Math. 65(4), 667–686. Mond, D., J. Smith, and D. van Straten (2003). Stochastic factorizations, sandwiched simplices and the topology of the space of explanations. R. Soc. Lond. Proc. Ser. A Math. Phys. Eng. Sci. 459(2039), 2821–2845. Pachter, L. and B. Sturmfels (2005). Algebraic statistics for computational biology. Cambridge University Press, New York. Pistone, G., E. Riccomagno, and H. P. Wynn (2001). Algebraic Statistics, Volume 89 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, FL. Computational in statistics. Pistone, G. and H. P. Wynn (1996). Generalised confounding with Gröbner bases. Biometrika 83(3), 653–666. Sullivant, S. (2017). Algebraic Statistics. In preparation. http://www4.ncsu.edu/~smsulli2/Pubs/asbook.html. Zwiernik, P. (2016). Semialgebraic statistics and latent tree models, Volume 146 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, Boca Raton, FL.

2/2 MPI MiS