<<

The Pennsylvania State University The Graduate School

AN ALGEBRAIC PERSPECTIVE ON COMPUTING WITH DATA

A Dissertation in by William Wright

© 2019 William Wright

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2019 The dissertation of William Wright was reviewed and approved∗ by the following:

Jason Morton Professor of Mathematics Dissertation Advisor, Chair of Committee

Vladimir Itskov Professor of Mathematics

Alexei Novikov Professor of Mathematics, Director of Graduate Studies

Aleksandra Slavkovic Professor of Statistics

∗Signatures are on file in the Graduate School.

ii Abstract

Historically, algebraic statistics has focused on the application of techniques from computational commutative , combinatorics, and algebraic geometry to problems in statistics. In this dissertation, we emphasize how sheaves and monads are important tools for thinking about modern statistical computing. First, we explore how probabilistic computing necessitates thinking about random variables as tied to their family of extensions and ultimately reformulate this observation in the language of sheaf theory. We then turn our attention to the relationship between theory and relational algebra of databases showing how Codd’s original operations can be seen as constructions inside Set. Next we discuss contextuality, the phenomenon whereby the value of a random variable depends on the other random variables observed simultaneously, and demonstrate how sheaves allow us to lift statistical concepts to contextual measurement scenarios. We then discuss a technique for hypothesis testing based on algebraic invariants whose asymptotic convergence properties do not rely on asymptotitc normality of any estimator as they are defined as energy functionals on the observed data. Finally, we discuss the Giry monad and how its implementation would aid in analysis of data sets with missing data.

iii Contents

List of Figures x

Acknowledgments xi

Chapter 1 Introduction 1 1.1 Motivation & Background ...... 1 1.2 Contributions ...... 3 1.3 Summary ...... 5

Chapter 2 Background 7 2.1 Categories ...... 7 2.2 & Categories of Functors ...... 14 2.2.1 Functors ...... 14 2.2.2 Natural Transformations ...... 16 2.2.3 Categories ...... 17 2.2.4 The Yoneda Embedding ...... 17 2.3 Lattices & Heyting ...... 18 2.3.1 Lattices ...... 18 2.3.2 Heyting Algebras ...... 20 2.4 Monads ...... 21 2.5 Cartesian Closed Categories ...... 23 2.6 Topoi ...... 24 2.7 Presheaves ...... 27 2.7.1 The of Presheaves ...... 27 2.7.2 Initial and Terminal Objects ...... 28 2.7.3 Products and Coproducts ...... 28 2.7.4 Equalizers and Coequalizers ...... 29 2.7.5 Pullbacks and Pushouts ...... 30

iv 2.7.6 Exponentials ...... 31 2.7.7 The Classifier ...... 32 2.7.7.1 ...... 32 2.7.7.2 The Subobject Classifier ...... 32 2.7.8 Local and Global Sections ...... 34 2.8 Sheaves ...... 34

Chapter 3 A Sheaf Theoretic Perspective on Higher Order Probabilistic Programming 38 3.1 The Categorical Structure of Measurable Spaces ...... 39 3.1.1 Non-Existence of Exponentials ...... 40 3.1.2 Lack of Subobject Classifier ...... 42 3.2 The Giry Monad ...... 46 3.2.1 The Endofunctor G ...... 46 3.2.2 The Natural Transformation η ...... 46 3.2.3 The Natural Transformation µ ...... 47 3.2.4 The Kleisli Category of the Giry Monad ...... 47 3.2.5 Simple Facts About the Giry Monad ...... 48 3.3 The Cartesian Closed Category of Quasi-Borel Spaces ...... 49 3.3.1 Quasi-Borel Spaces ...... 49 3.3.2 Cartesian Closure of QBS...... 51 3.3.3 The Giry Monad on the Category of Quasi-Borel Spaces . . 51 3.3.4 De Finetti Theorem for Quasi-Borel Spaces ...... 52 3.4 Standard Borel Spaces ...... 53 3.5 Quasi-Borel Sheaves ...... 55 3.5.1 Sample Space Category ...... 55 3.5.2 Quasi-Borel Presheaves ...... 57 3.5.3 Quasi-Borel Sheaves ...... 57 3.5.4 Lifting Measures Lemma ...... 60 3.6 Probability Theory for Quasi-Borel Sheaves ...... 62 3.6.1 Events ...... 62 3.6.2 Global Sections, Local Sections, and Subsheaves ...... 63 3.6.3 Expectation as a Sheaf ...... 64 3.7 Future Work ...... 65 3.7.1 Probabilistic Programming and Simulation of Stochastic Processes ...... 65 3.7.2 Categorical Logic and Probabilistic Reasoning ...... 66 3.7.3 Sample Space Category and the Topos Structure ...... 66 3.7.4 Extension of the Giry Monad ...... 67

v Chapter 4 Categorical Logic and Relational Databases 68 4.1 Introduction ...... 68 4.2 Data Tables ...... 69 4.2.1 Attributes ...... 71 4.2.2 Attribute Spaces (Data Types) ...... 71 4.2.3 Missing Data ...... 72 4.2.4 Data Types ...... 73 4.2.5 Column Spaces, Tuples, and Tables ...... 74 4.2.5.1 Column Spaces ...... 74 4.2.5.2 Records ...... 74 4.2.5.3 Tables ...... 74 4.2.6 Primary Keys ...... 76 4.2.7 Versioning ...... 76 4.3 Relational Algebra on Tables ...... 77 4.3.1 Products ...... 77 4.3.2 Projection ...... 78 4.3.3 Union ...... 78 4.3.4 Selection ...... 78 4.3.5 Difference ...... 80 4.4 Some Additional Operations on Tables ...... 81 4.4.1 Addition & Deletion ...... 81 4.4.2 Editing Records ...... 81 4.4.2.1 Rename ...... 82 4.4.2.2 Imputation ...... 82 4.4.3 Merging Overlapping Records ...... 83 4.4.3.1 Table ...... 83 4.4.4 Non-Binary Logics ...... 84 4.5 Random Tables and Random Databases ...... 84 4.5.1 Random Tables ...... 85 4.5.2 Giry Monad Applied to Tables ...... 85 4.5.3 Random Databases ...... 85 4.6 Topological Aspects of Databases ...... 87 4.6.1 Simplicial Complex Associated to a Database ...... 87 4.6.2 Contextuality ...... 87 4.6.3 on a Database ...... 89 4.7 Relationship Between Topological Structure of a Schema and Con- textuality ...... 90

vi Chapter 5 Contextual Statistics 91 5.1 Introduction ...... 91 5.2 The Bell Marginals ...... 93 5.3 Skip-NA and Directed Graphical Models ...... 99 5.4 Motivation from Statistical Privacy ...... 103 5.5 Poset of Joins of a Database ...... 103 5.5.1 Contextual Constraint Satisfaction Problems ...... 104 5.5.2 Poset of Solutions to Contextual Constraint Satisfaction Problems ...... 105 5.6 Topology of a Database Schema ...... 108 5.6.1 Contextual Topology on a Database Schema ...... 109 5.7 Sheaves on Databases ...... 113 5.7.1 Presheaf of Data Types ...... 113 5.7.2 Presheaf of Classical Tables of a Fixed Size ...... 114 5.7.3 Sheaf of Counts on Contextual Tables ...... 114 5.7.4 Presheaf of Classical Probability Measures ...... 117 5.7.5 Sheaf of Outcome Spaces ...... 117 5.7.6 Contextual Random Variables ...... 118 5.7.7 Sheaf of Parameters ...... 119 5.7.8 Sheaf of Contextual Probability Measures ...... 120 5.8 Statistical Models on Contextual Sheaves ...... 121 5.8.1 Contextual Statistical Models ...... 122 5.8.2 Factors ...... 125 5.8.3 Classical Snapshots of a Factor ...... 126 5.9 Subobject Classifier for Contextual Sheaves ...... 127 5.10 Local and Global Sections of a Contextual Sheaf ...... 128 5.11 Fitting Contextual Models ...... 130 5.11.1 Maximum Likelihood Estimation for the Saturated Contex- tual Model ...... 130 5.11.2 Classical Approximation of a Contextual Distribution . . . . 132 5.12 Contextual Hypothesis Testing ...... 134 5.12.1 Testing if Observed Marginals are Drawn from the Same Distribution ...... 134 5.12.2 Testing if a Collection of Tables can be Explained Classically 135 5.12.3 A Hypothesis Test for Contextuality ...... 136 5.13 Future Work ...... 136 5.13.1 Contextuality Penalization ...... 136 5.13.2 Sampling for Contextual Probability Distributions ...... 137

vii Chapter 6 Algebraic Hypothesis Testing 139 6.1 Introduction ...... 139 6.2 From model to invariants ...... 140 6.3 Constructing an inner product from invariants ...... 142 6.4 Asymptotic Properties of hψ| H |ψi ...... 143 6.5 Quadratic Forms of Multivariate Normal Distributions ...... 144 6.6 Estimation of Parameters for the Asymptotic Distribution . . . . . 146 6.6.1 Using an MLE ...... 146 6.6.2 Using Normalized Count Data ...... 147 6.7 The Independence Model for a 2 × 2 Contingency Table ...... 147 6.8 Behavior of Statistic on the Boundary of the Probability Simplex . 150 6.9 A Test for the Rank of a Contingency Table ...... 150 6.10 Simulation Techniques and Results ...... 151 6.11 Future Work ...... 156 6.11.1 Application to Mixture Models ...... 156 6.11.2 Application To Restricted Boltzmann Machines ...... 157

Chapter 7 A Monadic Approach to Missing and Conflicting Data 159 7.1 Pullbacks, Maximum Entropy Distributions, and Independence Joins 160 7.2 Merging Conflicting Tables ...... 162 7.3 Imputing Missing Data with Giry Tables ...... 165 7.3.1 Imputing by Empirical Probability Measure of a Column . . 166 7.3.2 Lifting Statistics to the Giry Monad ...... 168 7.3.3 A Simple Example of Giry Imputation ...... 169 7.4 Future Work ...... 171 7.4.1 Implementation of Giry Tables ...... 171 7.4.2 The Giry Monad and Contextuality ...... 172 7.4.3 Generalizations of Interval Time Models ...... 174

Appendix A Supplemental Code for Chapter 5 176 A.1 Introduction ...... 176 A.2 Code ...... 176 A.2.1 CSP for All Bell Marginals ...... 176 A.2.2 CSPs Involving Three Bell Marginals ...... 180 A.2.3 CSPs Involving Two Bell Marginals ...... 184

viii Appendix B Code for Producing Figures in Chapter 7 188 B.1 Introduction ...... 188 B.2 Invariants vs. Chi-Squared 2 × 2 Case ...... 188 B.3 P-values on a Degenerate Distribution in the Binary 4-Cycle Model 192 B.4 Tables of Percentage Deviation from Significance Level ...... 204 B.4.1 Noise Parameter  = 0.1 ...... 204 B.4.2 Noise Parameter  = 0.01 ...... 206 B.4.3 Noise Parameter  = 0.001 ...... 207

Bibliography 210

ix List of Figures

6.1 A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from a uniform distribution...... 152

6.2 A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from the distribution (q00, q01, q10, q11) = (0.1, 0.3, 0.2, 0.4). . . 153

6.3 A scatterplot showing values p-values computed for the perturbed degenerate distribution on the binary 4-cycle comparing the like- lihood ratio test vs. the survival function of the invariants based quadratic form computed via Imhof’s method...... 154

6.4 A scatterplot showing values p-values computed for the perturbed de- generate distribution on the binary 4-cycle comparing the chi-squared test vs. the survival function of the invariants based quadratic form computed via Davies’ method...... 155

x Acknowledgments

I would like to thank my advisor Jason Morton for all the time he has invested in this project. I would also like to thank the other members of my committee: Vladimir Itskov, Alexei Novikov, and Aleksandra Slavkovic for agreeing to be on my committee and investing their time in this project. I would also like to thank Jared Culbertson and Roman Ilin for their supervision while interning at AFRL. I would also like to thank Kirk Sturtz and Benjamin Robinson for many helpful conversations on the subject of applied category theory. Additionally, I would like to thank Manfred Denker for his early encouragement to explore some of the unconventional ideas in this dissertation. I would also like to thank Becky Halpenny and Allyson Borger for all their help throughout my time time at Penn State. I would also like to thank Bojana Radja and Cheryl Huff for their encouragement and support. Lastly, I would like to thank my family and friends for all their support over the years, especially my father, Clifton. This material is based upon work supported by the Air Force Office of Scientific Research under Award No. FA9550-16-1-0300. Any opinions, findings, and conclu- sions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the Air Force Office of Scientific Research.

xi Chapter 1 | Introduction

1.1 Motivation & Background

Traditionally algebraic statistics has focused on applying techniques from algebraic geometry, commutative algebra, and combinatorics to statistical problems. A major focus of this dissertation is to demonstrate that several new tools, namely sheaf theory and monads, can be useful for thinking about the types of statistical questions arising out of the needs of modern statistical computing. Traditionally, statistical theory has focused on the case where data is tabulated in a single table in a tidy manner. However, many data sets contain multiple tables with overlapping columns with missing entries and stale records. The application of category theory to probability is not a new idea. The fist appearance in the literature appears to be a paper by Michele Giry [59], published in 1980. In this paper, Giry constructs an endofunctor associating a measurable space to its collection of probability distributions and shows that this endofunctor can be given the structure of a monad and studies the Kleisli category associated to this monad. Following Giry’s work, there appears to be very little subsequent work in the area until 2006 when Doberkat worked out the Eilenbreg-Moore algebras for the Giry monad restricted to Polish spaces [34]. Culbertson and Sturtz developed a categorical framework for Bayesian probability in 2013 [27]. Some mathematicians have argued that probability should be rethought foundationally. Mumford has also argued that probability should be rethought of in a way where random variable is taken as a primitive concept [103]. Gromov has argued that category theory should give us insights into how such a re-imagining of the foundation could be

1 achieved [61]. The potential of category to gain insight into statistics is not new. McCullagh noticed that categories and natural transformations provide a natural way to express the intuitive concept of requiring a statistical model to admit certain natural extensions depending on the domain of inference [97]. In this dissertaion, we use category theory as a useful language for organizing and expressing statistical concepts on versioned databases with missing data. In particular, sheaf theory is a natural framework for thinking about local to global phenomenon. We propose that sheaves provide a natural way of thinking about distributed or contextually dependent statistical questions. Moreover, Tao observes that the concepts that are probabilistically meaningful are those which are invariant under surjective measure preserving maps [136]. Tao also observes that notions like independence are invariant under such maps while constructions such as which elements of the sample space map to which elements of the outcome space do not satisfy this invariance. This suggests a random variable should be identified with its entire sieve of extensions. In this dissertation we realize this construction by viewing a random variable as a construction involving quasi-Borel sheaves on the sieve of extensions of a sample space in chapter three. McCullagh argues that its is not possible to perform inference with a statistical model unless the model can be extended to the domain for which inference is required [97]. This concept of requiring statistical models to admit natural extensions suggests a rejection of the notion of a fixed universe of sets in favor of some notion of variable structure. However, this is precisely what the topos-theoretic perspective is intended to provide. To quote Lawvere, “Every notion of constancy is relative, being derived perceptually or conceptually as a limiting case of variation, and the undisputed value of such notions in clarifying variation is always limited by that origin. This applies in particular to the notion of a constant set, and explains so much of why naive set theory carries over in some form to the theory of variable sets.” [78] A major goal of this dissertation is to demonstrate how the ideas behind topos theory and sheaf theory can clarify the statistical structure of modern complex data sets by providing a language which can naturally handle the variability of these structures.

2 1.2 Contributions

In chapter three, the main result is extending a construction due to Heunen, Kammar, Staton, and Yang to a presheaf construction on a category of sample spaces (Definition 88) and showing this extension is in fact a sheaf with respect to the atomic Grothendieck topology (Lemma 91). We also realize expectation as a sheaf morphism (Section 3.6.3) and discuss some structural properties of these new objects as they relate to foundational concepts in probability theory (Section 3.6.1, Section 3.6.2). Along the way, we characterize several sub-classes of monic arrows in the category of measurable spaces (Proposition 77) and show that Meas does not admit a subobject classifier (Lemma 76). This provides an alternative proof that Meas is not a topos which was already well known due to Aumann [10] who showed that the category of standard Borel spaces is not Cartesian closed. By proving Meas does not admit a subobject classifier, we provide an alternative proof that Meas is not a topos. We also prove a simple lemma about lifting probability measures along surjective maps (Section 7.3.2). In chapter four, we discuss a categorical perspective on bag models common in the relational database literature. By emphasizing how these models can be seen as constructions within the underlying category of sets, we create a framework which more easily generalizes to situations, as in SQL, where the underlying logic is not a Heyting algebra. Such situations are impossible in purely topos theoretic models as the internal logic is always a Heyting algebra. We also define a simplicial structure (Section 4.6.3) and graph associated to a database schema and prove a result relating properties of this graph to whether or not agreement on marginal tables is sufficient to ensure that the marginals can arise from a joint distribution on the full outcome space. In particular, we provide sufficient conditions for a joint table on the full column set to exist (Lemma 98, Proposition 101). This result is foundational to the next chapter where we attach an additional topological structure to this simplicial complex using this topological structure to weaken the common assumption in statistics that the family of marginals under consideration arise as projections of a joint distribution on the full column space. In chapter five, our major contributions are the use of an appropriate topological structure to sheaf-theoretically lift standard statistical constructions to families of marginals with some overlapping constraints. More precisely, this includes

3 the introduction of a poset structure on the collection of constraint satisfaction problems (Section 5.5.2) which allows us to select an appropriate topology based on the shared columns of the tables constituting a database (Section 5.6.1). Using this topology, we see how to express various statistical concepts as sheaves or presheaves with respect to this topology (Section 5.7). This allows us to define the notion of contextual random variables (Section 5.7.6) and to define statistical models in terms of sheaf morphisms (Section 5.8.1). We also introduce the distinction between classical and contextual factors (Definition 127) and the notion of classical snapshot to handle classical approximations to globally irreconcilable marginals. We discuss a pseudo-lieklihood approach to extending maximum likelihood estimation based on the realization of contextual random variables as subsets of an equalizer (Section 5.11.1) and provide a test for whether or not marginal distributions can arise from a joint distribution on the full column set (Section 5.12.2). This last result is similar to a result due to Abramsky, Barbosa, and Mansfield based on sheaf cohomology which allows the user to detect contextuality [3]. By combining our results with the construction in chapter six, we can provide a goodness-of-fit measure for contextuality rather than a simple detection of contextuality. In chapter six, our major contributions are constructing an energy statistic based on the invariants of an algebraic statistical model and proving its asymptotic consistency under the null hypothesis. This construction is interesting because its asymptotic properties do not rely on the asymptotic normality of an estimator since it can be computed from empirical frequencies. Thus, this construction provides an alternative technique for computing goodness-of-fit in situations where standard asymptotic theory breaks down such as on boundary points of the probability simplex or near singularities of a statistical model. We demonstrate this improved performance near a singularity of the binary 4-cycle undirected graphical model by benchmarking it against the likelihood ratio and chi-squared test in a simulation. In chapter seven, the main result is a lemma establishing that measurable statistics lift to the Giry monad and the use of the Giry monad to combine conflicting data in a way that does not destroy information about the conflicting records. This construction is potentially useful in statistical decision making situations where we would like to design systems which select more conservative actions in the presence of conflict such as in target recognition in sensor networks. This chapter is more speculative than the remaining chapters and is intended to

4 explore how implementation of the Giry monad could be beneficial for statistical computing.

1.3 Summary

The contents of the remainder of this dissertation are as follows. Chapter two provides background information collecting many basic definitions from category theory. This chapter is not intended as a complete introduction to the subject but rather provides a list of definitions used elsewhere in the dissertation and can be used as a reference whenever these concepts are used in subsequent chapters. The topics treated include the basic definitions and properties of categories along with the basic definitions and properties of functor categories. We discuss the important Yoneda lemma as it is used several times throughout the dissertation. We also discuss the basics of theory and Heyting algebras along with Cartesian closed categories and topoi. The most important concepts introduced in this chapter are monads and sheaves which are used several times throughout this dissertation. In chapter three, we examine how category theory gives us insight into the semantics of probabilistic programming languages by demonstrating how the need for higher order functions requires us to step outside the standard bounds of measure theory. We first deconstruct the ways in which the category of measurable spaces is an inadequate framework for higher order probabilistic programming. We then discuss restricting attention to an appropriately well-behaved subset of measurable spaces, namely the standard Borel spaces. We review the recent theory of Quasi-Borel spaces developed by Heunen, Kammar, Staton, and Yang and discuss the importance of sample space extensions to probabilistic programming. We then lift their definition to a sheaf-theoretic one in order to naturally incorporate such extensions into their model of higher order probabilistic programming. In chapter four, we examine databases using the lens of topos theory. We construct a simple model of tables within the category of sets that is simple enough to express the standard operations of relational algebra along with some other common operations for table manipulations. This largely sets a notation and definition of tables to be used in subsequent chapters. We emphasize constructions which can also be performed inside the category of standard Borel spaces as subsequent work will focus on adapting random variables to databases with global

5 inconsistency. Finally we discuss how a database schema creates an abstract simplicial complex showing the interconnections between tables in the database. This observation is the jumping point for chapter five where we analyze statistical techniques in the presence of contextuality. In chapter five, we examine how sheaves on a topology associated to an abstract simplicial complex associated to a database can be used to lift statistical concepts to the realm of databases containing marginal distributions which are globally irreconcilable. We also study the problem of attempting to reconstruct tables from marginal tables and some of the computational issues that arise in these computations. A major theme of this chapter is that using the language of sheaf theory allows us to extend results globally by constructing a classical approximation of the usual concept along the basis. The nuances of adapting statistics to this regime are explored along with some computational issues. In chapter six, we discuss a technique for algebraic hypothesis testing based on algebraic invariants of a statistical model. We derive an asymptotic distribution for an energy functional of a statistic computed from the invariants. The asymptotic theory of this statistic does not rely on asymptotic normality and so provides a method robust to model singularities and boundary points. We discuss potential applications of this technique to mixture models and Restricted Boltzmann machines. We conclude by simulating the statistic and bench-marking it against known techniques. We consider small perturbations of a degenerate binary four-cycle and see the invariant based statistic outperforms standard techniques for this particular example. In chapter seven, we examine the relationship between the Giry monad and multiple imputation and discuss how implementation of the Giry monad could be useful for statistical computing. We see that implementing the Giry monad allows us to preserve information about conflicting measurements when compared to using a point estimate to resolve conflicts between different tables or computational agents. The main result is a construction which allows us to lift all the common statistics used in practice to Giry monads. We also discuss how its implementation would facilitate the handling of multiple imputation techniques in other parts of the inference pipeline. A simple example involving the k-nearest neighbor technique in machine learning indicates how this implementation leads to different calculations which will agree with original techniques for completely observed data.

6 Chapter 2 | Background

This chapter provides background information and terminology for this dissertation. We cover the basic language of categories, functors, monads, Cartesian closed cate- gories, topoi, presheaves and sheaves (both on regular and Grothendieck topologies). These constructions will be used many times throughout the thesis. References are provided to more detailed treatment of these topics.

2.1 Categories

Categories are the spaces where mathematical objects live. Intuitively, we have a collection of objects and morphisms relating these objects which are composable and have identities. It is tempting to think of these as ’sets’ and ’functions’ respectively, but these are not the only type of category as we will see later. In this section, we review some elementary definitions in category theory. More detailed treatments of these topics can be found in [15,18,93].

Definition 1. A category C consists of a collection of objects, denoted Ob (C) , along with a collection of morphisms for each C,D ∈ Ob (C), denoted Mor (C,D) which satisfy the following conditions:

• For all C,D,E in Ob (C), there is a composition

◦ : Mor (C,D) × Mor (D,E) → Mor (C,E)

which is associative.

7 • For all C ∈ Ob (C), there is an identity, 1C : C → C such that for any

f : C → D and any g : E → C, we have f = f ◦ 1C and g = 1C ◦ g.

By commutative diagram, we mean any paths with identical source and target yield the same answer. In the diagram below, we represent the axioms for the identity morphism as a commutative diagram. For this diagram, it suffices to check that the left triangle and right triangle are commutative. From the commutativity of these two triangles, we can deduce the rest. This is a general feature of such diagrams.

f A / B g 1B f  B g / C.

Categories are ubiquitous in mathematics. Here are a few basic examples.

Example 2. The category of sets has a collection of sets as its objects and the morphisms are given by functions.

Example 3. The category of groups has the collection of groups as its objects and the morphisms are given by homomorphisms.

Example 4. There is a category, Meas, of measurable spaces. The objects are measurable spaces and the morphisms are given by measurable mappings.

All the above examples are so called concrete categories which consist of objects which are sets (possibly equipped with additional structure) and morphisms given by functions (which preserve that structure). These are not the only types of categories.

Example 5. Any poset, (P, ≤), can be viewed as a category in the following way. The objects are given by elements of the poset and we define x → y if and only if x ≤ y.

Example 6. A similar construction could be used to view any (X, τ) as a category. The objects are given by open sets and the morphisms are

given by U → V if and only if U ⊂ V . We will use the notation Xτ to denote such a category.

8 The main type of categories that we will be concerned with in this talk are Cartesian closed categories. Before explaining what a Cartesian closed category is, we need a few definitions.

Definition 7. Let C be a category. An object, T , of C is said to be a terminal object if for every object C in C, there is a unique morphism !: C → T .

In sets a terminal object is given any singleton set T = {∗}. This also works for Meas. A poset category admits a terminal object if and only if it has a top element. In a category associated to a topological space, Xτ , the terminal object is the underlying set, X. Terminal objects, when they exist, are unique up to unique isomorphism. The dual notion to a terminal object is an initial object.

Definition 8. Given a category C, an initial object is an object 0 ∈ C such that for each C ∈ C, there is a unique morphism ! : 0 → C.

In the category Set or Meas, this object is the empty set (equipped with the empty sigma algebra in Meas). For a poset, this is a bottom element and for a topological space this is again the empty set. Products are the categorical generalization of the Cartesian product of sets. Products in an abstract category are defined by a universal property.

Definition 9. Let C be a category. C admits products if given any two objects

X and Y in C, there exists a third object X × Y and a pair of morphism pX :

X × Y → Y , pY : X × Y → Y such that given any other object, Z, and morphisms f : Z → X, g : Z → Y , there exists a unique morphism f × g : Z → X × Y such that the following diagram commutes:

Z f f×g

# pY g X × Y /) X

pX   Y

In the category of sets, the product is given by the usual Cartesian product of sets. In the category of measurable spaces, you can take the Cartesian product of the underlying sets with the product sigma algebra. In the category associated to

9 a topological space, Xτ , the product is given by the intersection of open sets. It is also true that products are unique up to unique isomorphism. Coproducts are the dual notion to products.

Definition 10. We say that C admits coproducts if given any two objects X and ` ` Y there exists a third object X Y and a pair of morphism iX : X → X Y , ` iY : Y → X Y such that given any other object Z and morphisms f : X → Z, g : Y → Z, there exists a unique morphisms f ` g : X ` Y → Z such that the following diagram commutes:

X

iX  f Y / X ` Y iY ` g f g  )# Z.

In Set, the coproduct is given by the disjoint union of two sets. For the category associated to a topological space, Xτ , the coproduct of two open sets is given by their union. For a lattice the coproduct is given by the meet operation. In Meas, X ` Y can be constructed as follows. The underlying set is simply the disjoint union of X × {0} and Y × {1}. We can generate a sigma algebra on ` X Y by taking the smallest sigma algebra containing sets of the form BX × {0} and BY ×{1} where BX is a measurable subset of X and BY is a measurable subset of Y . This observation will be important later in the dissertation when we augment outcome spaces to allow for the possibility of missing data. In this dissertation we only focus on this construction for standard Borel spaces. A more detailed construction is provided in [130].

Definition 11. A morphism i : Ef,g → X in C is an equalizer for a pair of morphisms, f, g : X → Y if f ◦ i = g ◦ i and given any morphism h : Z → X such that f ◦ h = g ◦ h, there exists a unique morphism k : Z → E such that i ◦ k = h,

10 i.e. the following diagram commutes:

Z h k f  i ' Ef,g / X / Y. g /

In Set, the equalizer of f, g is simply the subset of X defined by Ef,g = {x ∈ X : f (x) = g (x)} . The dual notion to equalizers are coequalizers.

Definition 12. A morphism q : Y → Ef,g in C is a coequalizer for a pair of morphisms f, g : X → Y if q ◦ f = q ◦ g and given any morphism h : Y → Z, there exists a unique morphism k : Ef,g → Z such that k ◦ q = h, i.e. the following diagram commutes:

f q X / Y / Ef,g g / h k  ' Z.

In Set, the coequalizer of f, g is the quotient of Y by the minimal such that f (x) = g (x) for each x ∈ A.

Definition 13. A pullback for functions f : X → Z and g : Y → Z is a pair of morphisms h : W → X, k : W → Y satisfying the universal property that for any morphisms k : W → X, h : W → Y with f ◦ k = g ◦ h, there exists a unique ` : V → W such that the following diagram commutes:

W h l j ( k V / Y i g    X / Z. f

In the category of sets we can take

V = X ×Z Y = {(x, y) ∈ X × Y | f (x) = g (y)} .

11 Definition 14. A pushout for functions f : X → Y and g : X → Z is a pair of morphisms h : Y → W , k : Z → W satisfying the universal property that for any morphisms i : Y → V and j : Z → V with i ◦ f = j ◦ g, then there exists a unique morphism ` : W → V such that the following diagram commutes:

f X / Y

g h  k  Z / W i l j  ( V.

In the category of sets the pushout is given by W = (Z ` Y ) / ∼ where ∼ is the finest equivalence relation such that f (z) ∼ g (z) . The constructions we have seen thus far are all examples of more general types of limits or colimits. In order to define these notions, we first need to give a rigorous definition of a diagram.

Definition 15. A diagram of shape J in C is a functor from J to C. The category J is referred to as the index category.

Example 16. Let J be the category 0 ⇒ 1 where only the non-identity morphisms are drawn. A diagram on J just picks out two parallel arrows

f X / Y. g /

When discussing diagrams, it is common to leave out explicit mention of the index category J and to simply depict the image under the functor as we have done in all of the diagrams used thus far in this section.

Now that we have defined diagrams we can define the notion of a cone on a diagram.

Definition 17. Let F : J → C be a diagram. A cone to F is an object N in C along with a family ψX : N → F (X) of morphisms indexed by the objects X of J such that for every morphism f : X → Y in J, we have F (f) ◦ ψX = ψY .

A limit is simply a cone that is universal in the sense that any other cone must factor uniquely through it. This is made precise with the following definition:

12 Definition 18. A limit of the diagram F : J → C is a cone (L, φ) which is universal in the sense that for any other cone (N, ψ) to F there is a unique isomorphism u : N → L such that φX ◦ u = ψX for all X in J.

N

u  ψX L ψY

φ φ Õ | X Y "  F (X) / F (Y ) F (f)

Example 19. Products, equalizers, terminal objects, and pullbacks are all examples of limits.

Dual to the notion of limit is the colimit.

Definition 20. A co-cone of a diagram F : J → C is an object N of C along with a family ψX : N → F (X) of morphisms indexed by the objects X of J such that for every morphism f : X → Y in J, we have ψY ◦ F (f) = ψX .

Similarly to limits being defined as universal cones, we can define colimits as universal co-cones.

Definition 21. A colimit of the diagram F : J → C is a co-cone (L, φ) which is universal in the sense that for any other co-cone (N, ψ) to F there is a unique isomorphism u : L → N such that u ◦ φX = ψX for all X in J.

F (f) F (X) / F (Y )

φX φY

" | ψX L ψY u   Õ N

Example 22. Coproducts, coequalizers, pushouts, and initial objects are all ex- amples of colimits.

13 2.2 Functors & Categories of Functors

In this section, we collect basic results and terminology about functors and categories of functors. More detailed treatments can be found in [18,93,94].

2.2.1 Functors

Functors are the mappings between categories. There are two types of functors: covariant functors and contravariant functors.

Definition 23. Let C and D be categories. A covariant functor is a mapping F : C → D which assigns to each object C in C, an object F (C) in D and to each morphism f : C → C0 inside C, a morphism F (f) : F (C) → F (C0) inside D in such a way that F (idC ) = idF (C) and F (f ◦ g) = F (f) ◦ F (g).

Example 24. There is a functor P : Set → Set which maps each set X to its power set P (X) and each set function f : X → Y is sent to the mapping P (f): P (X) → P (Y ) that maps each subset S ⊂ X to its image P (S) ⊂ Y .

Example 25. Let Group denote the category of groups. There is a functor U : Group → Set which associates to each group its underlying set and takes each group homomorphism to its underlying set map. This functor is called a forgetful functor because it simply ’forgets’ the additional group structure and homomorphism structure.

Example 26. (Hom-Functor) Let C be a category and C be an object in C. We C can define a functor h := HomC (C, −) : C → Set which takes each object X to the set of all C-morphisms from C to X, C (C,X). To each morphism f : X → Y ,

HomC (C, f) : HomC (C,X) → HomC (C,Y ) is defined by pre-composition, i.e. g 7→ f ◦ g.

The other type of functors are called contravariant functors. These are defined similarly except that they reverse the direction of morphisms.

Definition 27. Let C and D be categories. A contravariant functor F : C → D assigns to each object C in C an object F (C) in D; however, it assigns to each morphism f : C → C0 a morphism F (f) : F (C0) → F (C) in such a way that

F (idC ) = idF (C) and F (f ◦ g) = F (g) ◦ F (f).

14 Example 28. There is a contravariant functor P 0 : Set → Set that associates to each set X its power set P 0 (X); however, to each function f : X → Y , is mapped to its inverse image map P 0 (f) : P 0 (Y ) → P 0 (X) which maps each subset S ⊂ Y to its pre-image f −1 (S) ⊂ X.

Example 29. There is also a contravariant version of the Hom functor. Let D be an object in C. Then we can define a contravariant functor, Hom (−,D) : C → Set which assigns to each object X in C the collection of morphisms with codomain D and to each morphism f : X → Y assigns a set function Hom (f, D) : Hom (Y,D) → Hom (X,D) defined by post-composition, i.e. g 7→ g ◦ f.

We now collect a sequence of a basic definitions for certain properties that are important for functors. For a more detailed exposition and examples, refer to MacLane.

Definition 30. A functor F : C → D is faithful if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is injective.

Definition 31. A functor F : C → D is full if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is surjective.

Definition 32. A functor F : C → D is fullly faithful if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is bijective.

Definition 33. A functor F : C → D is essentially surjective if each object D in D is isomorphism to F (C) for some C in C.

Definition 34. A functor F : C → D is called an embedding if it is fully faithful and injective on objects.

Definition 35. A functor F : C → D is called an equivalence of categories if it is fully faithful and essentially surjective.

Definition 36. A functor F : C → D is called an isomorphism if there exists a functor G : D → C such that G ◦ F = idC and F ◦ G = idD.

In category theory, it is rare to talk about isomorphisms between categories and more common to talk about equivalences of categories.

15 Definition 37. A functor is said to preserve a property of a morphism f if F (f) satisfies the property whenever f does.

Definition 38. A functor is said to reflect a property of a morphism if f satisfies the property whenever F (f) does.

Here are a few useful facts about functors:

• Faithful functors reflect monics and epics.

• Fully faithful functors reflect isomorphisms.

• Equivalences of categories preserve monics and epics.

• Every functor preserves isomorphisms.

2.2.2 Natural Transformations

Natural transformations are mappings between functors.

Definition 39. Let F,G : C → D be contravariant functors. A natural transforma- tion, η : F ⇒ G associates to each object A in C a morphism ηA : F (A) → G (A), called a component of the natural transformation, such that for any morphism f : A → B in C, the diagram below commutes:

ηB F (B) / G (B)

F (f) G(f)

 ηA  F (A) / G (A) i.e. ηA ◦ F (f) = G (f) ◦ ηB. Note that an analogous definition holds for covariant functors, mutatis mutandis. We only consider natural transformations between contravariant functors in this dissertation.

Example 40. Let f : A → B be a morphism in a category C. We can construct a natural transformation between the covariant Hom-functors φ : Hom (B, −) ⇒

Hom (A, −) whose components are defined as φC : Hom (B,C) → Hom (A, C)

16 where g 7→ g ◦ f. The commutativity of the natural transformation square

Hom(B,h) Hom (B,C) / Hom (B,D)

φC φD  Hom(A,h)  Hom (A, C) / Hom (A, D) follows from the associativity of function composition in Set.

Another way of stating the definition of equivalence of categories is that two categories C and D are equivalent if there exists functors F : C → D and G : D → C such that F ◦ G is naturally isomorphic to the identity functor on D and G ◦ F is naturally isomorphic to the identity functor on C.

2.2.3 Functor Categories

Definition 41. Given two categories C and D, the functor category DC is defined to be the category whose objects are contravariant functors F : C → D and whose morphisms are given by natural transformations between functors.

Functor categories can also be defined for covariant functors, but we will not discuss any examples of these in this dissertation. In fact, a special type a contravariant functor, called a presheaf, will be used many times in this dissertation. These special type of contravariant functors are the subject of the next section.

2.2.4 The Yoneda Embedding

The Yoneda lemma is a generalization of Cayley’s theorem in group theory which allows you to embed any category into a category of functors defined on that category. There are two forms of the Yoneda the covariant version and the contravariant version. In this dissertation, we only use the contravariant form of the lemma and so only it is covered here. We state the Yoneda lemma here without proof. A formal proof can be found in [93]. The contravariant form of the Yoneda lemma concerns the contravariant form of the Hom functor, Hom (−,A) which is often denoted by hA. The contravariant form of the lemma states that for any contravariant functor G : Cop → Set, there

17 is a natural isomorphism ∼ [hA : G] = G (A) where [hA : G] denotes the set of natural transformations between hA and G. When the functor used in the Yoneda lemma is another Hom functor, the contravariant Yoneda lemma states

∼ [hA : hB] = Hom (A, B) .

This means h− gives rise to a covariant functor from C to the category of con- travariant functor into Set, i.e. h− : C → Set. Thus, the Yoneda lemma tells us that any locally small category can be embedded in the category of contravariant functors into Set via h−. This is called the Yoneda embedding of the category. Another way of expressing this is to say any locally small category can be represented by presheaves in a full and faithful matter, i.e.

∼ [hA : P ] = P (A) for any presheaf P . A contravariant functor into Set is said to be representable if it is naturally isomorphic to hA for some object A. When working out how topos theoretic constructions arise, we will commonly restrict to deducing how these constructions should work on representable functors and using these insights to surmise the general situation. This trick is very common in category theory and will be used when we discuss exponentials and subobject classifiers for presheaf topoi in later sections of this chapter.

2.3 Lattices & Heyting Algebras

2.3.1 Lattices

Here we briefly introduce the basic definitions for lattices. A more detailed reference is [29]. The main type of lattices we will focus on in this dissertation are Heyting algebras. These will be defined in the next section. A lattice consists of a poset in which every two elements have a unique supremum and a unique infimum.

Example 42. The natural numbers can be given a poset structure by divisibility,

18 i.e. a ≤ b if and only if a divides b. In this case, the supremum is the least common multiple and the infimum is the greatest common divisor.

Lattice can be given a purely algebraic definition as well. The poset based definition and the algebraic definition are equivalent. We provided the algebraic axioms below.

Definition 43. A lattice (L, ∨.∧) is a set L along with two binary operations ∨ and ∧ on L which satisfies the following properties for all a, b, and c in L:

• a ∨ b = b ∨ a and a ∧ b = b ∧ a

• a ∨ (b ∨ c) = (a ∨ b) ∨ c and a ∧ (b ∧ c) = (a ∧ b) ∧ c

• a ∨ (a ∧ b) = a and a ∧ (a ∨ b) = a.

The first rules are the commutative laws, the second are the associative laws, and the last are known as the absorption laws. There are two more laws, which are consequences of this definition, which are also important for lattices. The following identities holds for every a in L and are known as the idempotent laws:

• a ∨ a = a and a ∧ a = a.

From an algebraic lattice as defined above, we can endow it with a poset structure by defining x ≤ y if and only if x ∧ y = x.

Definition 44. A lattice is said to be bounded if there exists elements > and ⊥ such that ⊥ is an identity element for the join operation ∨ and > is an identity element for the meet operation ∧, i.e.

• a ∨ ⊥ = a and a ∧ > = a.

Definition 45. A lattice is said to be distributive if the following properties hold for all a, b, and c in L :

• a ∨ (b ∧ c) = (a ∨ b) ∧ (a ∨ c)

• a ∧ (b ∨ c) = (a ∧ b) ∨ (a ∧ c).

19 Example 46. Let X be a set. The collection of all subsets of X, P (X), is a bounded where the join ∧ is given by set intersection and the meet ∨ is given by set union. The bottom element is the empty set while the top element is the set X itself.

Example 47. The integers can be given the structure of a distributive lattice where the join is given by minimums and the meet is given by maximums. Notice that this lattice is not bounded because there is neither a smallest integer nor a largest integer.

2.3.2 Heyting Algebras

A Heyting algebra is a bounded, distributive lattice with a weaker form of comple- mentation called pseudo-complementation which we will define below. More details on the properties of Heyting algebras can be found in [60,94]. Heyting algebras are important in topos theory because the collection of subobjects on any object in a topos has the structure of a Heyting algebra. In a Heyting algebra, we can define a pseudo-complement of any a in H, denoted ¬a, where ¬a is the largest element such that a ∧ ¬a = ⊥. Another way of defining Heyting algebras is via a binary operation, called the implication, →, which satisfies the requirements in the definition below.

Definition 48. Let H be a bounded lattice. We say that H is a Heyting algebra if and only if there exists a binary operation, →, called the implication such that the following identities hold for all a, b, and c in H:

• a → a = >

• a ∧ (a → b) = a ∧ b

• b ∧ (a → b) = b

• a → (b ∧ c) = (a → b) ∧ (a → c)

With this definition, we can provide an alternative definition of the pseudo- complement: ¬a := (a → ⊥). Heyting algebra play an important role in topos theory because the lattice of subobjects of an object in a topos has the structure of a Heyting algebra.

20 Example 49. Every is a Heyting algebra with a → b given by ¬p ∨ q.

 1 Example 50. Let 0, 2 , 1 be given as a totally ordered set with ≤ defined in the usual way. This can be given the structure of a Heyting algebra which is not a Boolean algebra by defining the meet, join, implication, and psedu-complementation by the rules depicted in the tables below:

a ∧ b a ∨ b a → b a ¬a 1 1 1 a\b 0 2 1 a\b 0 2 1 a\b 0 2 1 1 0 1 0 0 0 0 0 0 2 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 2 0 2 2 2 2 2 1 2 0 1 1 1 1 1 0 1 0 2 1 1 0 1 1 1 0 2 1 Note that the above construction is not a Boolean algebra because it does not satisfy double , i.e.

1  1 6= ¬ ¬ = 1. 2 2

2.4 Monads

In this section, we briefly review monads. Our exposition here will be rather terse and will simply recall various standard definitions and results. A more thorough exposition can be found in chapter 6 of [93] or chapter 14 of [15].

Definition 51. Let C be a category. A monad, also called a triple, on C consists of

an endofunctor T : C → C together with two natural transformations η : 1C ⇒ T and µ : T 2 ⇒ T referred to as the unit and multiplication, respectively. The triple (T, η, µ) are required to satisfy the coherence conditions µ ◦ T µ = µ ◦ T µ and

µ ◦ T η = µ ◦ ηT = 1T . The first equation is an equality of natural transformations from T 3 ⇒ T and the latter equation is an equality of natural transformations from T ⇒ T . These can be visualized by the following commutative diagrams:

T µ T 3 / T 2

µT µ  2  T µ / T

21 and T η ηT T / T 2 o T

µ id  ~ id T. The first condition is analogous to the associativity condition for if µ is thought of as a categorification of the ’s binary operation while the latter condition is analogous to the existence of an identity element for the binary operation on the monoid. This is why η is referred to as the unit and µ is referred to as the multiplication for the monad.

Example 52. We can define another monad P on Set by defining P (X) to be the power set of X. For a morphism f : A → B, we can define P (f) to be the function defined by taking direct images under f. The unit natural transformation is defined on components as the map ηX : X → P (X) by x 7→ {x}. The multiplication natural 2 2 transformation µ : T ⇒ T is defined on components by µX : T (X) → T (X) by taking a set of sets to its union.

A particular monad, constructed by Michele Giry in [59], will be used frequently in this dissertation. It associates to a measurable space the collection of probability distributions on that measurable space and equips it with a sigma algebra. We will introduce this monad in greater depth in chapter 3. In the computer science literature, (e.g. [100], [65]) monads are often defined slightly differently. The multiplication natural transformation is replaed with a natural transformation called bind, denoted =. The bind is defined for all pairs of objects X, Y in C as a function

(=) : C (X,T (Y )) → C (T (X) ,T (Y )) .

The notation (t  f) is used for (=) (f)(t). The monad axioms then become:

• (t = η) = t,

• (η (x) = f) = f (x),

• t = (λx. (f (x) = g)) = t (= f) = g.

The intuition behind this reformulation is to think of T (X) as an object of compu- tations returning X. In this case, η is the computation that returns immediately.

22 t  f sequences computations by first running t and calling f with the result of t, similar to a UNIX pipeline. The equivalence of these two notions can be found in [96].

2.5 Cartesian Closed Categories

Definition 53. A category C is said to be Cartesian closed if and only if it has a terminal object, admits products, and admits exponentials.

Terminal objects and Products were covered in an earlier section. We briefly recall the definition of exponentials here. More information on Cartesian closed categories can be found in [12,15,60,93].

Definition 54. Let Z and Y be objects of the category C and suppose C admits Y Y  binary products. An object Z together with a morphism eval : Z × Y → Z is said to be an if for any object X and morphism g : X × Y → Z, there is a unique morphism λg : X → ZY (called the transpose of g) such that the diagram below commutes:

X X × Y g λg λg×1Y   $ ZY ZY × Y / Z. eval

The assignment of a unique λg to each g establishes an isomorphism Hom (X × Y,Z) ∼= Y  Y Hom X,Z . In other words, the functor (−) : C → C defined on objects by Y Y Y Y  C 7→ C and on morphisms by (f : C → D) 7→ f : X → Z is right adjoining to the product functor − × Y.

Example 55. The category Set whose objects are sets and whose morphisms are functions is an example of a Cartesian closed category. The exponential object is defined as Y X = Hom (X,Y ), the set of functions from X to Y .

Example 56. A Boolean algebra can be given the structure of a Cartesian closed category. The objects in the category correspond to the elements of its underlying set. Products are given by conjunctions, exponentials are given by implications,

23 and evaluation corresponds to , i.e.

(A ⇒ B) ∧ A ≤ B.

Cartesian closed categories are important in computer science. In a Cartesian closed category, a morphism f : X × Y → Z can be represented as a morphism λf : X → ZY . Computer scientists refer to this as . As such, simply-typed lambda calculus can be interpreted in any Cartesian closed category. The formal relationship between these is given by the Curry-Howard-Lambek correspondence which establishes an isomorphism between , simply-typed lambda calculus, and Cartesian closed categories [28,69,82–84]. In functional programming languages, eval is often written as apply and λg is often written as curry (g).

2.6 Topoi

Topoi are a special class of Cartesian closed categories which have been proposed as a general setting for mathematics [85]. Topoi were initially conceived by Grothendieck as a a type of category which behaves like the category of sheaves on a topological space [9]. A later definition, due to Lawvere, generalized Grothendieck’s topoi in order to make suitable connections with logic [94]. These topoi are referred to as elementary topoi. When we discuss topoi in this dissertation, we will mean the elementary topoi due to Lawvere. The exposition in this section will be terse and effectively serves to collect definitions and standardize notation for future sections. Elementary introductions to the subject can be found in [60,85]. More advanced expositions are given in [94,98]. Standard references for the field are the books [14, 77, 78]. In this dissertation, we use sheaf topoi to model the semantics of higher order probabilistic programming languages extending a previous construction by Heunen, Kammar, Staton, and Yang [65] and demonstrating that this construction is a sheaf with respect to the atomic Grothendieck topology on an appropriately constructed sample space category. We also use sheaf topoi as a background for extending statistical constructions to contextual measurement scenarios in which a family of marginal distributions can not be assumed to orginate from a joint distribution on their full column space. A topos is simply a Cartesian closed category with a subobject classifier. A

24 subobject classifier of a category C is an object Ω in C such that subobjects of any object X are in one-to-one correspondence with the morphisms from X into Ω. Before we can characterize Ω by a universal property, we need to briefly review the definition of subobjects. Subobjects are the topos theoretic analog of subsets. If A ⊂ B, there is an inclusion map ι : A,→ B which is injective and hence monic. Conversely, any monic morphism in the category of sets determines a subset via its image. Hence, the domain of a monic map is isomorphic to a subset of the codomain of this map. However, since there are many sets with the same cardinality, we have to think of subobjects as equivalence classes of monic morphisms where f : A → B and g : C → B are equivalent if f factors through g and vice versa.

Definition 57. A subobject of a category C is an of monic morphisms under the equivalence relation f ∼ g if f and g have the same codomain and factor through one another. We denote the equivalence class of f by [f].

These equivalence class can be given a poset structure. Let f : A → B and g : C → B be monic morphisms. We say [f] ≤ [g] if there is a morphism h : A → C such that f = g ◦ h. Note this forces h to be monic. This construction categorically represents subsets as each subset determines and is determined by a unique subobject. Now that we have discussed subobjects, we can discuss the subobject classifier. Intuitively, subobjects in the category of sets correspond to subsets of a set.

Any subset S ⊂ A has an indicator function, χS : A → {0, 1} defined by  1 x ∈ S χS (x) = 0 x∈ / S.

This can be given a purely categorical definition by associating subsets with collections of monic morphisms into the set. Thus a subobject of A is an equivalence class of monic morphisms with codomain A where two monics m1 and m2 are considered equivalent if and only if they factor through one another. In Set, each subset S ⊂ X determines an equivalence class from its inclusion morphism [ι : S,→ X] and any monic m : E  X determines a subobject which is equivalent to [ι : m (E) ,→ X]. As such, we will often simply refer to subsets of X as subobjects by an abuse of terminology.

25 In the category of sets, the truth object, Ω, is simply the two element set 2 := {0, 1}. If we let 1 := {∗} denote a terminal object in Set, then there is a truth morphism: true : 1 → {0, 1} defined by true (∗) = 1 which picks out the value 1 as corresponding to true in Boolean logic. Given a subobject [m : S  A], the characteristic function, χm, of m (S) satisfies the universal property that the diagram below is a pullback square:

m S / / A

! χm   1 / 2 true where !: S → 1 is the unique map to the singleton set 1. Note that if n : E  A is another monic arrow, the condition that true (1 (E)) = χm (n (E)) requires that n (E) ⊂ m (E). Thus the universal property above merely distinguishes m (E) as the largest possible element in the poset of subobjects. This observation motivates the definition of the subobject classifier for an arbitrary category.

Definition 58. Let C be a category with a terminal object 1.A subobject classifier for C is an object Ω together with a monic morphism > : 1 → Ω such that given any monic morphism m : C  D, there exists a unique morphism χm : D → Ω which makes the diagram below a pullback square:

m C / / D

! χm   1 / Ω. >

We can now officially define a topos.

Definition 59. A category T is said to be a topos if and only if it is a Cartesian closed category which also has a subobject classifier.

Example 60. The protypical example of a topos is Set. However, the categories of presheaves and sheaves are also commonly occuring examples of topoi which we will discuss in the next section. The only topoi discussed in this dissertation are the types mentioned in this example.

There are many equivalent characterizations of topoi. Note that any topos

26 must admit all finite limits and colimits. Historically, the major successes of topos theory came from algebraic geometry and logic. In algebraic geoemtry, topoi were invented by Grothendieck as an attempt to construct a cohomology theory with variable coefficients [9]. Grothendieck’s original definition was generalized by Lawvere in his search for an axiomatization of the category of sets [?]. The definition presented above is what Lawvere would have called an elementary topos. Topoi were also essential to Paul Cohen’s forcing technique used to construct new models of Zermelo-Fraenkel set theory for which he was awarded the Fields medal in 1966. In this dissertation, we will mainly discuss the topos of sets and topoi that arise as presheaf of sheaf topoi on either a topological space or a Grothendieck topology on a category.

2.7 Presheaves

Presheaves and sheaves are an important family of contravariant functors which we will use at many points in this dissertation. We recall some basic definitions and properties below. More detailed treatment of these topics, including proofs, can be found in [94].

2.7.1 The Category of Presheaves

Definition 61. A presheaf on a category C is a contravariant functor into the category Set. Presheaves form a category whose morphisms are given by natural transformations. If C is a category, we use the notation Cˆ to denote the category of presheaves.

Cˆ has the structure of a category. The objects in this category are contravariant functors. Morphisms are given by natural transformations η : F ⇒ G, i.e. given a morphism f : C → D in C, the following diagram commutes:

F(f) F (C) / F (D)

ηC ηD  G(f)  G (C) / G (D) .

The identity natural transformation ι : F → F is created by taking its components

27 to be ιX = idF(X) in Set which produces the following commutative diagram:

F(f) F (C) / F (D)

ιC ιD  F(f)  F (C) / F (D) .

Composition is given by compositions of natural squares, i.e. commutative diagrams of the following form: F(f) F (C) / F (D)

ηS ηD  G(f)  G (C) / G (D)

µS µD  H(f)  H (C) / H (D) .

2.7.2 Initial and Terminal Objects

The initial object in Cˆ is the constant functor 0 in Cˆ which maps each object to the empty-set and every morphism to the identity morphism. The terminal object, 1, is the constant functor that maps each object to the one element set 1 and each morphism to the identity map.

2.7.3 Products and Coproducts

Given two presheaves F and G, the product presheaf is defined on an object C in C by (F × G)(C) := F (C) × G (C) where the right-hand side is a product in Set. From each morphism f : B → C, we obtain a function

(F × G)(f): F (B) × G (B) → F (C) × G (C)

28 such that the following diagram commutes:

F(f) F (C) / F (B) O O ρ1 ρ1 F(f)×G(f) F (C) × G (C) / F (B) × G (B)

ρ2 ρ2  G(f)  G (C) / G (B) .

Given two presheaves F and G, the coproduct presheaf is defined on an object C in C by (F ` G)(C) := F (C) ` G (C) where the right-hand side is a coproduct (disjoint union) in Set. From each morphism f : B → C, we obtain a function

 a  a a F G (f): F (C) G (C) → F (B) G (B)

such that the following diagram commutes:

F(f) F (C) / F (B)

ι1 ι1  F(f) ` G(f)  F (C) ` G (C) / F (B) ` G (B) O O ι2 ι2 G(f) G (C) / G (B) .

2.7.4 Equalizers and Coequalizers

Given two natural transformations η, µ : F ⇒ G, an equalizer ι : E ⇒ F is a natural transformation such that η ◦ ι = µ ◦ ι, i.e. for each C in C, the components

of the natural transformation compose in Set, i.e. ηC ◦ ιC = µC ◦ ιC . Moreover, given any other natural transformation ω : H ⇒ F satisfying η ◦ ω = µ ◦ ω, there is a unique natural transformation κ : H → E such that for any object C in C, the

29 following diagram commutes:

H (C)

ωC κC  ι ( ηC S / EF,G (C) / F (C) / G (C) . µC

Given two natural transformations η, µ : F ⇒ G, a coequalizer ϑ : E ⇒ F is a natural transformation such that ι ◦ η = ι ◦ µ, i.e. for each C in C, the components of the natural transformation compose in Set, i.e. ιC ◦ ηC = ιC ◦ ηC . Moreover, given any other natural transformation ω : G ⇒ H satisfying ω ◦ η = ω ◦ µ, there is a unique natural transformation κ : E ⇒ H such that for any object C in C, the following diagram commutes:

ηC / ϑC F (C) / G (C) / E (C) µC ωC κC (  H (C) .

2.7.5 Pullbacks and Pushouts

Let X , Y, Z be presheaves in Cˆ . Suppose λ : X ⇒ Z and ι : Y ⇒ Z are natural transformations. We say that the natural transformations η : P ⇒ X and κ : P ⇒ Y form a pullback for λ and ι if and only if for every object C in C, the commuting square κC P (C) / Y (C)

ηC ιC   X (C) / Z (C) λC ∼ is a pullback square in Set, i.e. (X ×Z Y)(S) = X (S) ×Z(S) Y (S).

30 Notice that a morphism f : C → D in C, induces a commuting cube:

κC P (C) / Y (C) P(f) Y(f)

$ $ ηC P (D) ιC / Y (D)

  X (C) / Z (C) X (f) Z(f)

$  $  X (D) / Z (C) .

Similarly, if X , Y, Z are presheaves on S with values in Set and λ : Z ⇒ X and ι : Z ⇒ Y are natural transformations, we say that the natural transformations η : X ⇒ Q and κ : Y ⇒ Q form a pushout if and only if for every S ∈ S,

ιC Z (C) / Y (C)

λC κC   X (C) / Q (C) ηC

` ∼ ` is a pushout in Set, i.e. (X Z Y) = X (C) Z(C) Y (C). Again, pushouts always exist because they exist in Set.

2.7.6 Exponentials

Let F and G be presheaves on C. Assuming an exponential, GF exists, would imply that we have the following natural bijection

∼ F  HomC (E × F, G) = HomC E, G for every presheaf E in Cˆ . In particular, if we take E to be a representable functor, i.e. E = HomC (−,C) = hC . This would mean that

F ∼ F  ∼ G (C) = HomCˆ hC , G = HomCˆ (hC × F, G) .

31 Rather than assuming that the desired bijection exists, we can use this observation to define F G (C) := HomCˆ (hC × F, G)

F i.e. G (C) is the set of all natural transformations from HomC (−,C) × F into G which is contravariant and hence a presheaf. The evaluation mapping, eval : GF × F → G, is a natural transformaton defined on components by

evalC (η, y) = ηC (1C , y) ∈ G (C) where C is an object in C, η : HomC (−,C) × F → G, and y ∈ F (C). Verification that these satisfy the universal properties can be found in [94].

2.7.7 The Subobject Classifier

2.7.7.1 Subobjects

Definition. Let F, G : Cop → Set be functors. We say F is a subfunctor of G if F (C) ⊂ G (C) for all objects C in C and F (f) is a restriction of G (f) for all morphisms in C. There is then a natural transformation ι whose components are given by inclu- sion mappings which is monic in Cˆ . Thus each such F determines a subobject. Conversely all subobjects are given by subfunctors since if η : F ⇒ G is monic in the functor category then each component ηC : F (C) → G (C) is an injective.

As subobjects are equivalence classes of monics, ηC is equivalent to ιC where ιC is given by the inclusion of the image of each component.

2.7.7.2 The Subobject Classifier

Definition 62. (Sieves) If C is an object in a category C, a sieve is a subfunctor of Hom (−,C). Sieves can be thought of as the categorical analog of lower sets on a poset. The subfunctor criteria means that if f : B → C belongs to the sieve and g : A → B is any morphism, then f ◦ g also belongs to the sieve.

If Cˆ admits a subobject classifier, Ω, it must, in particular classify the subobjects

32 of the representable presheaf hC := HomC (−,C). Thus,

∼ SubCˆ (HomC (−,C)) = HomCˆ (HomC (−,C) , Ω) = [HomC (−,C) : Ω] .

∼ By the Yoneda lemma, we would have a natural isomorphism [HomC (−,C) : Ω] = Ω(C). Thus, the subobject classifier, if it exists, must be given by

Ω(C) = SubCˆ (HomC (−,C)) .

In other words, the subobject classifier is the collection of sieves on C.

Definition 63. If C is an object in a category C, a sieve on C is a subfunctor of Hom (−,C). Sieves can be thought of as the categorical analog of lower sets on a poset. The subfunctor criteria means that if f : B → C belongs to the sieve and g : A → B is any morphism, then f ◦ g also belongs to the sieve.

Example 64. If we regard a poset as a category, P, the sieves on an object p in C are just sets of elements, S, where each s ≤ p and s ∈ S and s0 ≤ s implies s0 in S.

Definition 65. (Pullback of Sieves) Any morphism f : C → D, induces a map f ∗ : O (D) → O (D) where S 7→ {g | cod (g) = C and f ◦ g ∈ S} . The map f ∗ is known as the pullback.

The restriction mappings in the presheaf are given by sieve pullbacks, i.e. a morphism f : C → D induces a morphism Ω (f) :Ω (D) → Ω (C) defined by taking a sieve on D to its pullback on C. The largest possible sieve on an object is known as its principle sieve.

Definition 66. For any object C ∈ C, we can define the principal sieve, ↓ C, to be the largest possible sieve on C, i.e. ↓ C := {f | cod (f) = C}. Notice this is equivalent to saying ↓ C is the sieve containing the identity map id : C → C.

Note that using principle sieves, we can rewrite the pullback sieve more simply as f ∗ (S) = S∩ ↓ C.

The collection of sieves on an object in a category has the structure of a Heyting algebra which was defined in an earlier section. In order to finish defining

33 the subobject classifier for presheaf categories, we need to also define the truth morphism. The truth morphism > : 1 ⇒ Ω is just the natural transformation defined on each component by picking out the maximal sieve on each Ω(C).

2.7.8 Local and Global Sections

Definition 67. A global section on a presheaf F is a natural transformation from the terminal presheaf1 into F.

A particular class of global sections are important in any topos. These are the global sections of the subobject classifier which are known as the truth values in the topos.

Definition 68. A local section on a presheaf F is a natural transformation from a subobject of the terminal object into F.

2.8 Sheaves

Sheaves are a mathematical tool for handling local to global phenomenon on a topological space. The standard example of a sheaf is the sheaf of continuous real-valued functions on a topological space, X.

Definition 69. A presheaf on a topological space (X, τ) is a function that assigns to each U a set F (U) called a section on U. This assignment is done in V such a way that if U ⊂ V there is a restriction mapping resU : F (V ) → F (U). The associated restriction mappings are required to satisfy the criteria that if V W W U ⊂ V ⊂ W then resU ◦ resV = resU .

We defined presheaves on an arbitrary category in the previous section. The definition above is a special instance of the previous definition because we construct a category from the topological space where the objects are given by open sets and we define U → V if and only if U ⊂ V . Note this is the same definition as the category associated to a topological space Xτ . With this definition, a presheaf is simply a contravariant functor from this category into Set. In terms of our example of continuous functions on a topological space. The sections of each open set U are given by the set of continuous functions

34 f : U → R. The restriction maps are obtained by just restricting the definition V W W of the function to a smaller domain. The requirement resU ◦ resV = resU tells us that restricting the domain of our function from W to V and then restricting the domain of the function from V to U is the same thing as simply restricting the domain of the function from W to U. Sheaves are presheaves which satisfy additional gluability criteria.

Definition 70. A presheaf F is called a sheaf it satisfies two additional criteria:

• (Locality) If {Ui}i∈I is an open covering of a set U, and f, g ∈ F (U) are such U (f) = U (g) U f = g that resUi resUi for each i in the open cover, then .

• (Gluing) If {Ui}i∈I is an open covering of a set U, and if for each i, we have a section fi ∈ F (Ui) in such a way that for every pair Ui and Uj of the Ui (f ) = Uj (g ) f ∈ F (U) covering resUi∩Uj i resUi∩Uj j , then there is a section such U (f) = f i ∈ I that resU∩Ui i for every .

These two axioms can be characterize category theoretically by stating that the following diagram is an equalizer:

Y Y F (U) → F (Ui) ⇒ F (Ui ∩ Uj) i i,j

U where the first map is the product of the maps resUi and the pair of parallel morphisms are the products of the two different ways of doing the restriction Ui Uj resUi∩Uj and resUi∩Uj . A sheaf can be defined by only specifying its values on the open sets of a basis and verifying the sheaf axioms relative to the basis [94]. We will use this fact frequently in chapter 5 of this dissertation when discussing sheaves on the simplicial complex associated to a database. When defining presheaves, we observed that we could construct a category from the collection of open sets on the underlying topological space and see the presheaf as a contravariant functor into the category Set. as such, we can generalize the definition of presheaf to hold in any category.

Definition 71. Let C be a category. A presheaf F on C is a contravariant functor from C into Set. The collection of presheaves itself forms a category which we will denote by Cˆ .

35 The definition of a sheaf required a notion of gluing. As such to extend the definition of sheaf to an arbitrary category, we need some type of category theoretic notion of an open cover. This can be accomplished through the use of sieves which we defined earlier when discussing presheaf categories.

Definition 72. A Grothendieck topology, J, on a category C is a function which associates to each object C in C a collection of sieves J (C) known as the collection of covering sieves of C. This assignment is required to satisfy the following criteria:

• The maximal sieve tC = {f | cod (f) = C} is in J (C).

• (stability axiom) If S ∈ J (C), then the pullback h∗ (S) ∈ J (D) for any morphism h : D → C.

• (transitivity axiom) If S ∈ J (C) and R is any sieve on C such that h∗ (R) ∈ J (D) for all h : D → C in S, then R ∈ J (C).

Definition 73. A category C equipped with a Grothendieck topology J is referred to as a site.

Definition 74. A sieve S on an object C is said to be a covering sieve if S ∈ J (C) .

The covering sieves on a Grothendieck topology are the analogs of open covers of a set. As such, we can develop sheaves on Grothendieck topologies by using topological notions such as maximal families. The interested reader can follow the presentation in [94]. We give a precise definition here for completeness.

Definition 75. A presheaf F on a site (C,J) is a sheaf if and only if for ev- ery covering sieve S in J (C), the inclusion S,→ hC induces an isomorphism ∼ Hom (S, F) = Hom (hC , F).

In this dissertation we focus only on the atomic Grothendieck topology where the covering sieves are taken to be all inhabited sieves. This topic will be covered in more depth in chapter 3. In the remaining chapters, all other sheaves discussed are constructed on normal topological spaces. The atomic sieves play an important role in topos theory because their sheaves have a particularly simple characterization. Informally, the lemma below states that every morphism is a cover in the atomic topology.

36 Lemma. (Mac Lane & Moerdijk [94]) A presheaf F is a sheaf for the atomic topology on a category C if and only if for any morphism f : D → C and any y ∈ F (D), if F (g)(y) = F (h)(y) for all diagrams g f E ⇒ D → C h with f ◦ g = f ◦ h, then y = F (f)(x) for a unique x ∈ F (C).

In chapter 3 we will interpret this lemma probabilistically when constructing a sheaf of random variables.

37 Chapter 3 | A Sheaf Theoretic Perspec- tive on Higher Order Proba- bilistic Programming

Probabilistic programming languages are programming languages designed to describe probabilistic models and to perform inference with these models. In this chapter we explore the relationship between sheaf theory and probabilistic programming with higher order functions. The main result in this chapter is extending a construction due to Heunen, Kammar, Staton, and Yang to a presheaf construction on a category of sample spaces (Definition 88) and showing this extension is in fact a sheaf with respect to the atomic Grothendieck topology (Lemma 91). We also realize expectation as a sheaf morphism (Section 3.6.3) and discuss some structural properties of these new objects as they relate to foundational concepts in probability theory (Section 3.6.1, Section 3.6.2). Along the way, we characterize several sub-classes of monic arrows in the category of measurable spaces (Proposition 77) and show that Meas does not admit a subobject classifier (Lemma 76). This provides an alternative proof that Meas is not a topos. Although the fact that Meas is not a topos was already well known due to a theorem of Aumann [10] who showed that the category of standard Borel spaces is not Cartesian closed. By proving Meas does not admit a subobject classifier, we provide an alternative proof that Meas is not a topos. We also prove a simple lemma about lifting probability measures along surjective maps (Section 7.3.2). Composability is the heart of category theory and structural programming,

38 with its emphasis on the composability of blocks of code, is essential to building large scale programs. In this sense, it would seem that there should be some connection between category theory and software engineering. Moreover, functional programming goes beyond composability of functions and data types and makes concurrency composable. Eugenio Moggi discovered that computational effects could be mapped using monads from category theory [100]. This observation made functional languages like Haskell more usable and gave computer scientists a more stereoscopic view of traditional programming. It is our belief that category theory is a natural setting for programming semantics and the purpose of this chapter is to develop a framework for thinking about higher order probabilistic programming based on sheaves of families of random variables of some fixed type. Sheaf theory has been previously applied to probability theory by Jackson who showed that the Radon-Nikodym theorem could be interpreted as a sheaf morphism between measurable locales [71]. Gauthier gives a purely algebraic characterization of stochastic calculus via sheaves on a symmetric monoidal infinity category and establishes connections between stochastic differential equations and deformation theory [54]. A similar construction to the one developed in this chapter appears in a conference paper by Simpson [123]; however, Simpson uses a slightly different definition of his underlying sample space category and prefers to work with equivalence classes of random variables. Similar tools to those discussed in this chapter are used in later chapters to reason about statistical inference for non-flat distributed databases in a way that recasts statistical theory as a contextual theory. In this chapter we will first study the category of measurable spaces and see that Meas is an inadequate category for modeling probabilistic programming. We also discuss a recent construction due to Heunen, Kammar, Staton, and Yang which replaces Meas with a Cartesian closed category which they call the category of quasi-Borel spaces [65]. We use this framework to extend their construction to a presheaf which allows us to work with the full structure of a topos.

3.1 The Categorical Structure of Measurable Spaces

Lawvere and Giry have previously analyzed probability from the perspective of category theory by considering probabilistic concepts as constructions inside the

39 category of measurable spaces or the subcategory corresponding to the category of standard Borel spaces. In this section, we explore the categorical structure of the category of measurable spaces noting that it fails to be a Cartesian closed category due to a result by Aumann [10]. For the purposes of discussing probabilistic programming, it is essential that we develop a framework compatible with extensions of sample spaces. This is suggestive of Lawvere’s observation that topos theory is a natural framework for thinking about variable sets due to its connection with sheaf theory. To start with, we can consider a category whose objects are measurable spaces (Ω, F) where Ω is a set and F is a sigma algebra on Ω. Morphisms in this category will be given by measurable maps between measurable spaces. In the sequel, we will denote this category by Meas. For shorthand, we will refer to objects in Meas by their underlying set with the sigma algebra being suppressed for notational convenience. As Meas is a concrete category, it shares many structures with the category of sets; however, the restriction of equipping these sets with a sigma algebra leads to several key differences. Although the category of sets is the most elementary example of a a topos, i.e. a Cartesian closed category with a subobject classifier, we will see that Meas is not a topos as it admits neither a subobject classifier nor exponentials.

3.1.1 Non-Existence of Exponentials

Given two measurable spaces X and Y , we would like for the collection of measurable maps between them, Mor (X,Y ), to be an object in Meas. Clearly the collection of all measurable maps between the two measurable spaces is a set. In order to verify the universal property for Mor (X,Y ), we can introduce the canonical evaluation map on the product Mor (X,Y ) × X as follows:

ev : Mor (X,Y ) × X → Y (f, x) 7→ f (x)

and endow Mor (X,Y ) with the smallest sigma algebra such that the evaluation maps is measurable. This forces each set of the form ev−1 (B) = {(f, x): f (x) ∈ B} for B measurable in Y to be measurable in Mor (X,Y ) × X. Notice that the

40 sections of this will be measurable sets in Mor (X,Y ) and X respectively. Fix some x ∈ X, this force sets of the form {f ∈ Mor (X,Y ): f (x) ∈ B} to be measurable in Mor (X,Y ) which is equivalent to forcing the canonical evaluation maps evx : Y X → Y defined by f 7→ f (x) to be measurable. Notice this construction appears to be universal, i.e. given any g : Z × X → Y , there exists a unique gˆ : Z → Mor (X,Y ) which makes the following diagram commute:

ev Y X × X / Y O : gˆ×id X g Z × X

where

gˆ : Z → Mor (X,Y ) z 7→ g (z, ·)

such that

g (z, ·): X → Z x 7→ g (z, x) .

At a first glance, it might seem like Meas admits exponentials. However, the sigma algebra induced by ev must be the product sigma algebra of a sigma algebra on Y X and a sigma algebra on X since products in Meas are defined as being equipped with product sigma algebras. Unfortunately, even with X = Y = R equipped with the standard Borel sigma algebra, this is not true due to the following result.

Theorem. (Aumann, 1961) For any sigma algebra, Σ, on RR, the evaluation mapping is never measurable with respect to the product sigma algebra Σ × BR on Y X × X [10].

This no-go result indicates that neither Meas nor its subcategory of standard Borel spaces can be a Cartesian closed category. The infinite nature of R is essentail to Aumann’s argument. In particular, restricting to only finite sets avoids all of the problems we have discussed in this subsection.

41 3.1.2 Lack of Subobject Classifier

In this section we prove that Meas does not admit a subobject classifier. We also show that for measurable spaces, regular monics, strong monics, and extermal monics all coincide with measurable embeddings. The fact that restricting to these subclasses fails to resolve the problems with the subobject classifier construction is interesting in that it demonstrates how the category of measurable spaces behaves quite differently from the category of topological spaces in spite of the similarity in their definitions. The category of topological spaces does admit a subobject classifier for strong monics [145].

Lemma 76. Meas does not admit a subobject classifier.

Proof. In order for Meas to have a sub-object classifier, we would need an object Ω along with a morphism > : 1 → Ω such that for every monic arrow m : E  X, there exists a measurable function χm : X → Ω which makes the diagram below a pullback:

m E / X

! χm   1 / Ω. >

In the category of measurable spaces, let Ω = {0, 1} denote the measurable space defined on any set with two elements equipped with the sigma algebra of all subsets and let 1 = {∗} be a terminal object. The map > : 1 → Ω is given by > (∗) = 1 and the map !: E → 1 is defined by ! (e) = ∗. χm is then just a characteristic function  1 x ∈ im (m) χm (x) = 0 otherwise. This construction is perfectly fine in the category of sets; however, for measurable spaces we need to ensure the mapping χm is measurable. In general, the image of a measurable set is not measurable and so we can not guarantee that χm is even a measurable mapping unless im (m) = m (E) is a measurable set in X. As a simple example to illustrate this point, consider a map from any non-trivial measurable space into the same space equipped with the trivial sigma algebra

B∅ = {∅,X} . This gives an injective map of sets which is measurable and as such

42 is a monomorphism in Meas. Note that this construction shows that Meas is unbalanced as this construction provides an example of a morphism which is monic and epic but not an isomorphism. At this point, the only alternative is to try equipping Ω with the trivial sigma algebra B = {∅, Ω}. Note that χm is always measurable with respect to the trivial −1 −1 sigma algebra since χm (∅) = ∅ and χm (Ω) = X and so this construction resolves the first obstruction we found to constructing the pullback diagram for subobject classifiers in Meas. In order for the diagram to be a pullback, we would need for any other monic arrow n : F  X such that > (! (f)) = χm (n (f)) for all f ∈ F , there exists a unique g : F → E making the following diagram commute

F g n m ' ! E / X

! χm    1 / Ω. >

The commutativity of the diagram forces us to define g (f) := m−1 (n (f)) . In order for this construction to be measurable, we would need n (BF ) to be measurable in X whenver BF is measurable in F . To see that this can not be the case let X = R with its Borel sigma algebra and take F to be the inclusion mapping corresponding to an analytic set which is not Borel. As an explicit example of such a set, consider the set of all irrational numbers whose continued fractions expansion can be written as 1 x = a + 0 1 a + 1 1 a + 2 . .. where there exists an infinite sequence 0 < i1 < i2 < ··· where each aik divides aik+1 . A result due to Lusin shows this is a set which is Lebesgue measurable, but not Borel measurable [80]. Thus, Meas does not admit a subobject classifier.

In the topos theory literature, it is common to explore variations on the definition of a subobject classifier by restricting to certain restricted classes of monics. For instance, the category of topological spaces admits a subobject classifier for strong

43 monics [145]. A strong monic is simply a monic m : E  X such that for every epic e : Y  Z and morphisms f : Z → X, g : Y → E such that m ◦ g = f ◦ e, there exists a morphism d : Z → E such that the following diagram commutes:

e Y / / Z d g f  ~ m  E / / X.

We will now show that the strong monics in Meas behave like subspace embeddings which in turn satisfy the universal property for the pullback diagram above. In doing so, we show several subclasses of monic arrows all coincide in the category of measurable spaces. As the construction due to Lusin is indeed an embedding, this shows that Meas does not admit a subobject classifier for strong monics and thus provides a categorical construction distinguishing it structurally from the category of topological spaces.

Proposition 77. In Meas, regular monomorphisms, strong monomorphisms, and extremal monomorphisms all coincide with measurable embeddings.

Proof. Another class of monic arrows of interest in category theory are regular monics. A monic m : E  X is regular if it is the equalizer of some pair of arrows. First observe that every regular monic is strong. Suppose e : Y  Z is epic and α : Y → Z, β : Z → X are morphisms such that β ◦ e = m ◦ α. Since m is regular, it is the equalizer of some pair of parallel arrows k1, k2 : X → C. This situation can be visualized via the diagram below:

Y / / Z d α g   k1 ~ m / E / / X / C. k2

The dotted morphism d is guaranteed to exist by the universal property of equalizers and m ◦ d = β. By assumption β ◦ e = m ◦ α, and so m ◦ (d ◦ e) = m ◦ α which implies d ◦ e = α since m is monic. This establishes the commutativity of the diagram above. Another class of monic arrows of interest are extremal monics. We say that a monic m : E  X is extremal if whenever m factors through an epic morphsim, i.e.

44 m = g ◦ e, e is actually an isomorphism. Every strong morphism is easily seen to be extremal. If m : E  X is a strong monic, and e : E  Y , g : Y → X are such that m = g ◦ e, the fact that m is strong means there is a morphism d : Y → E such that the diagram below commutes:

e E / / Y

d g idE  ~ m  E / / X.

Now, e = e ◦ idE = e ◦ (d ◦ e) = (e ◦ d) ◦ e and e = idY ◦ e. Thus, (e ◦ d) ◦ e = idY ◦ e and so (e ◦ d) = idY , i.e. e is an isomorphism. Given an injective measurable map m : E  X, we can endow m (E) with the subspace sigma algebra Fm = m (E) ∩ FX . We call a monic map m : E  X a measurable embedding if E ∼= m (E) where the latter is equipped with the sigma algebra Fm. We now notice that every extremal monomorphism must be a measurable embedding. Let m : E  X be an extremal monic. Then any factorization of m = g ◦ e with e an epimorphism implies that e is an isomorphism. Any monic arrow, m can be decomposed as:

m (E) < e i

m " E / / X where e is the epimorphism onto the image subspace. Since m is assumed to be extremal, we must have that e is actually an isomorphism, i.e. e is a measurable embedding. To finish this proof, we need to argue that every measurable embedding is a regular monomorphism. Given a measurable embedding, m : E  X, we can form 1 parallel arrows as follows: X ⇒ 2 where 1 (x) = 1 for each x ∈ X and χm is defined χm as above. The equalizer of these parallel arrows is

E1,χm = {e ∈ E | χm (m (e)) = 1 (m (e))} = E.

Thus, every measurable embedding is the equalizer of these two arrows and hence regular.

45 The above lemma shows that several subclasses of monics all coincide with measurable embeddings. An analogous characterization of these monics is well known for topological spaces [145].

3.2 The Giry Monad

In this section we briefly discuss the Giry monad highlighting some of the properties we will use later in this dissertation. A more detailed exposition along with proofs of the coherence conditions can be found in [59]. The Giry Monad is a structure on Meas, the category of measurable spaces. A monad on a category C consists of a triple (P, η, µ) where P is an endofunctor from C into itself, η is a natural transformation from IC ⇒ P, and µ is a natural transformation µ : G2 ⇒ G satisfying the coherence conditions in definition 51 in chapter 2.

3.2.1 The Endofunctor G

We can construct a functor G : Meas → Meas defined as follows:

• On objects: G (X) is defined to be the collection of probability measures on X equipped with the smallest sigma algebra such that the evaluation map X X ev : G (X) × 2 → [0, 1] defined by ev (p, χB) = X χBdp = p (B) where 2 ranges through all subobjects of X (i.e. characteristic´ functions of measurable sets) and I is the interval [0, 1] equipped with the Borel sigma algebra.

• On morphisms: given a measurable map f : X → Y , G (f) : G (X) → G (Y )

is defined as follows: given a probability distribution pX on X, G (f) : pX 7→ −1 pX (f (·)) which defines a probability measure on Y .

Remark 78. Note that for any X, G (X) also has the structure of a convex space, i.e. if p, q ∈ G (X), then αp + (1 − α) q ∈ G (X) for any α ∈ [0, 1].

3.2.2 The Natural Transformation η

η is supposed to be a natural transformation from the identity functor I ⇒ P, i.e. for each object x ∈ Meas, we need to construct a morphism I (X) → G (X). To do

46  1 x ∈ B this we can map each x ∈ X to the Dirac measure δx := δx (B) = . 0 otherwise We can verify this is indeed a natural transformation by checking the commutativity of the following diagram: ηX I (X) / G (X)

f G(f)

 ηY  I (Y ) / G (Y )

The commutativity of the above diagram asserts ηY (f (x)) = G (f) ◦ ηX (x), i.e. −1 δf(x) = δx (f (B)) .

3.2.3 The Natural Transformation µ

The endofunctor G associates to a measurable space X, the collection of all proba- bility measures on X viewed as a measurable space itself. This means we can apply G to G (X) and obtain a measurable space G2 (X). The natural transformation µ 2 will be a natural transformation µ : G ⇒ G. Let X be an object in Meas. ηX needs to take p0 ∈ G2 (X) and associate to it a probability measure p ∈ G (X). The evaluation maps evB : G (X) → [0, 1] ⊂ R are real-valued measurable func- tions and therefore we can integrate with respect to them. Thus, we can define 0 0 µX (p )(B) := G(X) evB (p) dp . This definition is sigma-additive by the monotone ´ 0 convergence theorem and µX (p )(X) = 1 so this is indeed a probability measure on G2 (X).

3.2.4 The Kleisli Category of the Giry Monad

Giry showed that the triple (G, η, µ) forms a monad on the category of measurable spaces by verifying the coherence conditions. Every monad gives rise to a corre- sponding Kleisli category. The Kleisli category of a monad (G, η, µ) has the same objects as its underlying category, in this case Meas. A Kleisli morphism between

X and Y is a map fk : X → P (Y ). Given two Kleisli morphisms fK : X → P (Y ) and gK : Y → P (Z), the composition is defined to be fK ◦ gK := µZ ◦ GgK ◦ fK . Kleisli arrows can be seen as statistical models when the domain is interpreted as a parameter space. A Kleisli arrow also gives rise to a Markov kernel in the following manner: given a Kleisli arrow fK : X → G (Y ), we can construct a

47 2 Markov Kernel through the evaluation mapping evfK : X × Y → [0, 1] defined by evfK (x, B) = fK (x)(B).

3.2.5 Simple Facts About the Giry Monad

Example 79. Let [k] = {1, 2, . . . , k} viewed as a measurable space whose sigma algebra is just the collection of all subsets of [k]. Then G ([k]) is the collection of all probability distributions on [k]. This can be identified with the probability simplex k−1 n k Pk o 4 = (p1, . . . , pk) ∈ R : i=1 pi = 1, pi > 0∀i ∈ [k] . The sigma algebra on G ([k]) is generated by the pullbacks of sets of the form (a, b) ∩ [0, 1] ⊂ I under the evaluation maps evi : 4[k] → [0, 1] given by evi (p1, . . . , pk) = pi. Hence −1 evi ((a, b) ∩ I) = {(p1, . . . , pk): pi ∈ (a, b) ∩ I}. It follows that sets of the form

k −1 ∩i=1evi ((a, b) ∩ I) = [(a1, b1) × · · · × (ak, bk)] ∩ 4[k] are Giry measurable subsets of 4k−1. This means that the Giry sigma algebra on 4k−1 induced by the evaluation maps is equivalent to the standard Borel sigma algebra on 4k−1. In particular, algebraic statistical models, i.e. models defined by polynomial equations in 4k−1 will be measurable sets as these sets are closed in the Euclidean topology.

Lemma 80. Let X × Y be a product of measurable spaces and let πX : X × Y → X be the projection onto the first coordinate. The induced map G (πX ) : G (X × Y ) → G (X) corresponds to marginalizing over Y .

Proof. By definition G (f) : G (X × Y ) → G (X) is defined as µ 7→ µ (f −1 (−)). Let B be a measurable subset of X. Then µ (f −1 (B)) = µ (B × Y ) .

Example 81. (Univariate Normal Distributions) The collection of single variable normal distributions can be seen as the mapping from n : R × R>0 → G (R) defined 2 by (µ, σ ) 7→ pµ,σ2 where

1 (x−µ)2 − 2 pµ,σ2 (B) = √ e 2σ dx. 2 ˆB 2πσ

To see this mapping is indeed measurable, note that the Borel sigma algebra on I is generated by sets of the form (a, b) ∩ I. The sigma algebra structure on G (R) was generated by the evaluation maps evB : G (R) → I defined by evB (p) =

48 −1 p (B). The preimage of such a set is just a set of the form evB ((a, b) ∩ I) = {p ∈ P (R): p (B) ∈ (a, b) ∩ I}. Since the sigma algebra on G (R) is generated by sets of this form, we can observe that

 2  −1 −1  2 1 − (x−µ) n ev ((a, b) ∩ I) = µ, σ : √ e 2σ2 dx ∈ (a, b) ∩ I B 2 ˆB 2πσ which is a Borel measurable set on R × R>0.

Example 82. (Singular Model) [141] Let P = [0, 1] × R. Let m : P → G (R) be defined by  2 2  1 − x − (x−b) (a, b) 7→ √ (1 − a) e 2 + ae 2 . 2π

2 − x This mapping is not injective because if ab = 0. m (a, 0) = m (0, b) = e 2 all lead to the same probability distribution. Thus, this mixture model would not correspond to a subobject of G (R).

3.3 The Cartesian Closed Category of Quasi-Borel Spaces

A Cartesian closed category is a type of category with the same expressive power as a typed λ-calculus. As such this gives a category theoretic framework for expressing computations. Standard Borel spaces are certain well-behaved spaces for which many probabilistic constructions are guaranteed to exist. Unfortunately, as a full subcategory of Meas, these still do not admit a closed structure due to the theorem of Aumann referenced previously. For this reason, Heunen, Kammar, Staton, and Yang invented the category of quasi-Borel spaces [65]. The category of quasi-Borel spaces has the structure of a quasi-topos [66]. In this section, we review quasi-Borel spaces. In later sections, we will see how these can be regarded as certain types of sheaves on an appropriately defined sample space category.

3.3.1 Quasi-Borel Spaces

Fix an uncountable standard Borel space Ω.

49 Definition. A quasi-Borel space, (X,MX ), is a set X together with a subset

MX ⊂ [Ω → X] such that

1. If α ∈ MX and f :Ω → Ω is measurable, then α ◦ f ∈ MX 2. If α : R → X is constant, then α ∈ MX . = ` S S ∈ B {α } M ` α ∈ 3. If R i, each i , and i i∈N is a sequence in X , then i∈N i MX .

Given a quasi-Borel space as defined above, these can be identified with sub- Ω objects in the topos of sets, i.e. MX  X . In Set, we can consider the co-slice category, Ω ↓ Set, whose objects are XΩ and whose arrows are given by pre- composition, i.e. α ∈ XΩ is sent to f ◦ α. To give quasi-Borel spaces the structure of a category, we can consider morphisms in the co-slice category which induce a set mapping between MX and MY . In other words, a morphism f : (X,MX ) → (Y,MY ) is a set map f : X → Y which induces a map _ ◦ f : MX → MY . By abuse of notation, we will often simply denote a quasi-Borel space (X,MX ) by its collection of functions, MX . The collection of quasi-Borel spaces thus forms a category which we denote by QBS.

Given a quasi-Borel space MX , we can construct a sigma algebra, FX on its space of outcomes by

 −1 FMX = B ⊂ X | f (B) ∈ FΩ for all f ∈ MX .

We call members of FMX events. ∼ Lemma 83. We have a bijection FMX = QBS (MX ,M2).

Proof. Take χ in QBS (MX ,M2). If χ is a QBS morphism, then given any α ∈ MX , −1 −1 −1 χ ◦ α is in M2. Then (χ ◦ α) (1) = α (χ (1)) must be a Borel measurable −1 subset of Ω for every α ∈ MX . Hence, χ (1) ∈ FM . X  1 x ∈ E Suppose E ∈ FMX . Define χE ∈ QBS (MX ,M2) by χE (x) = . 0 x∈ / E −1 −1 −1  −1 Let α ∈ MX . Then (χE ◦ α) (1) = α χE (1) = α (E) is Borel measurable −1 −1 C  −1 C in Ω since E ∈ FMX . Similarly (χE ◦ α) (0) = α E = (α (E)) is Borel measurable.

The above proposition allows us to define an evaluation mapping on probability measures.

50 MX Definition 84. (Evaluation Mapping for Probability Measures) Let M2 =

QBS (MX ,M2) and let G (MX ) denote the image of MX under the Giry endo-

functor. Let MI be the quasi-Borel space obtained by the standard Borel space MX [0, 1]. The evaluation mapping, ev : M2 × G (MX ) → MI is defined to be

ev (χ, µ) = Ω χ (ω) dµ. ´ 3.3.2 Cartesian Closure of QBS.

The terminal object in QBS is just the singleton set whose quasi-Borel structure is

given by the unique map into it. If (X,MX ) and (Y,MY ) are quasi-Borel spaces,

there is a product (X × Y,MX×Y ) where the set is given by the usual product of

sets and the product structure MX×Y is defined by

MX×Y := {α : S → X × Y | πX ◦ α ∈ MX , πY ◦ α ∈ MY } .

Exponentials are the only somewhat complicated construction. if (X,MX ) and X  X (Y,MY ) are quasi-Borel spaces, then so is Y ,MY X where Y := QBS (X,Y ) and  X MY X := α : S → Y | uncurry (α) ∈ QBS (S × X,Y ) .

3.3.3 The Giry Monad on the Category of Quasi-Borel Spaces

Let MX be a quasi-Borel space. Any α ∈ MX and µ ∈ Ω determine a probability

measure on the measurable space (X, FMX ) via the pushforward α∗µ onto (X, FMX ).

As (α, µ) and (β, ν) may pushforward to the same probability measure on (X, FMX )

we can define an equivalence relation on pairs where (α, µ) ∼ (β, ν) if α∗µ = β∗ν

(as probability-measures). As a set this is a quotient set of MX × G (Ω) where G (Ω) is the image of Ω under the Giry endofunctor in Meas (regarded as a set). We now

need to give PX = (MX × G (Ω)) / ∼ a quasi-Borel structure. Define

MPX := {β :Ω → PX | ∃α ∈ MX .∃g ∈ Meas (Ω, G (Ω)) .∀ω ∈ Ω.β (ω) = [α, g (ω)]}

where [α, µ] is the image of (α, µ) under the quotient map π : MX × G (Ω) → PX . Heunen, Kammar, Staton, and Yang prove that this construction yields a strong monad on QBS. The following lemma is an important observation about the Giry endofunctor applied to the sample space Ω.

51 ∼ Lemma 85. MPΩ = Meas (Ω, G (Ω))

Proof. Suppose f ∈ Meas(Ω, G (Ω)). Then f (ω) = [1Ω, f (ω)] so f ∈ MPΩ .

Now, suppose β ∈ MPΩ . Then there exists α ∈ MΩ = Meas (Ω, Ω) and g ∈ Meas (Ω, G (Ω)) such that β (ω) = [α, g], i.e. β determines a map g ◦ α :Ω → G (Ω) in Meas (Ω, G (Ω)).

3.3.4 De Finetti Theorem for Quasi-Borel Spaces

DeFinetti’s representation theorem is a foundational theorem in Bayesian statistics. For completeness, its statement is provided below.

Theorem. (DeFinetti’s Representation Theorem) Let (Ω, F, µ) be a probability

space, and let (X, B) be a Borel Space. For each n, let Xn :Ω → X be measurable. ∞ The sequence {Xn}n=1 is exchangeable if and only if there is a random probability ∞ measure P on (X, B) such that, conditional on P = ρ, {Xn}n=1 are IID with distribution ρ. Furthermore, if the sequence is exchangeable, the the distribution of

P is unique, and Pn (B) converges to P (B) almost-surely for each B ∈ B [119].

Heunen, Kammar, Staton, and Yang formulate a version of the DeFinetti Theorem for Quasi-Borel Spaces. Before giving the statement of this theorem, we must explain exchangeability in the language of Quasi-Borel spaces.

Definition. (α, µ) Q X A probability measure on i∈N i is said to be exchangeable if

for all permutations π : N → N, [α, µ] = [απ, µ] where απ (ω)i := α (ω)π(i) for all i ∈ N. Theorem. (DeFinetti’s Theorem for Quasi-Borel Spaces) If (α, µ) is an exchange- Q∞ able probability measure on i=1 Xi, then there exists a probability measure (β, ν) Qn in G (G (X)) such that for all n ≥ 1, the measure ([β, ν] = iidn) on G ( i=1 X)

equals G ((−)1···n)(α, µ) when considered as a measure on the product measur- Qn n Q∞ Qn able space ( i=1 X, ⊗i=1X) where (−)1···n : i=1 Xi → i=1 Xi is defined by

(x1, . . . , xn, xn+1,... ) 7→ (x1, . . . , xn) [65].

Quasi-Borel spaces provide a category that can be used as the denotational semantics of a probabilistic programming language. However, their construction does not naturally handle extensions of sample spaces. As we discussed at the beginning of this chapter, in the course of executing a program which involves

52 sampling from probability distributions, sample spaces are constructed in memory as needed. Suppose we draw a collection of n samples from a standard normal distribution and another n samples from some binomial distribution and coupling the results into a data frame. The sample space for the joint distribution collected in the data frame is the independence join of the probability spaces used to generate the samples. This suggests that a notion of extensibility is a natural requirement to embed in whatever type of categorical framework we use for modeling probabilistic programming. The notion of extensibility is reminiscent of Lawvere’s idea of topos theory being a framework for working with variable sets. We use this as motivation for constructing a sheaf topos on a particular category of probability spaces which will serve as a framework for discussing probabilistic concepts. Before we can achieve this, we need to briefly recall some important properties of standard Borel spaces. Standard Borel spaces are the prototypical examples of well-behaved measurable spaces and these will be used in the construction of our sheaf topos.

3.4 Standard Borel Spaces

Data on a computer is ultimately represented as a bit-string. The types of measur- able spaces which should be sufficiently well-behaved to be modeled on computers should, in some sense, be well-approximated by a bit-string. Recall that a measur- −1 able map f : (X, FX ) → (Y, FY ) is said to be exactly measurable if f (FY ) = FX . A natural condition on a measurable space could be that there exists an exactly N measurable map from (X, FX ) into S = {0, 1} with the sigma algebra generated by the cylinders of finite bit-strings. Note this is equivalent to the Borel sigma-algebra induced by the metric

X d {s } , {s0 }  := 2−n |s − s0 | . n n∈N n n∈N n n n∈N

A theorem, due to Mackey, identifies such measurable spaces as exactly the countably generated measurable spaces.

Theorem. (Mackey, 1957) A measurable space is countably generated if and only if there exists an exactly measurable mapping into S [95].

53 Standard Borel spaces are an important class of well-behaved measurable spaces first introduced in [95]. Many important results in probability and statistics are not true for probability measures on arbitrary measurable spaces but do hold for standard Borel spaces. In this dissertation, we will focus on probability measures on standard Borel spaces. In fact, any time we say measurable space in this dissertation it should be assumed that we mean standard Borel space. For a more detailed treatment of the facts collected here, we refer the reader to the survey article [111] or the textbook [130].

Definition 86. Let (X, F) be a measurable space. (X, F) is said to be standard Borel if there exists a metric on X which makes it a complete separable metric space in such a way that F is the Borel σ-algebra corresponding to that metric. In other words, (X, F) is the Borel sigma algebra associated to a Polish space.

Theorem. For a measurable space (X, F), the following are equivalent [79]:

• (X, F) is a retract of (R, BR), i.e. there exists a measurable map f : X → R such that g ◦ f = idX .

• (X, F) is either measurable isomorphic to (R, BR), (Z,P (Z)), or ([k] ,P ([k])).

• X has a complete metric with a countable dense subset and F is the Borel sigma algebra generated by this metric.

Standard Borel spaces enjoy a number of useful properties which are not true of general measurable spaces:

• They are countably generated [95].

• Any bijective measurable mapping between standard Borel spaces is necessar- ily an isomorphism, i.e. the inverse must also be measurable [127].

• Products and disjoint unions of standard Borel spaces are standard Borel [130].

• A function f : X → Y between two standard Borel spaces is measurable if and only if its graph is also standard Borel [130].

Moreover, a number of important results in probability hold for standard Borel spaces which are known to not hold for arbitrary measurable spaces. Some notable examples include the following:

54 • the Kolmogorov extension property [107,109]

• the existence of conditional distributions [33,40,42]

• the Dynkin extension property [50]

• DeFinetti’s representation theorem [31,37,67]

The properties that standard Borel spaces have which do not hold in a general measurable space seem to strongly suggest that the subcategory of standard Borel spaces is an appropriate category to use as a base category for our sheaf theoretic approach to probabilistic programming.

3.5 Quasi-Borel Sheaves

3.5.1 Sample Space Category

Tao observed that probability theorists often choose notation which de-emphasizes the role of a sample space preferring instead to treat it more as a black box. He emphasizes that random variables should not be thought of as anchored to any one particular sample space and should be identified with their space of extensions. He proposes that the concepts studied in probability theory are precisely those that are invariant under surjective measure preserving mappings [136]. This idea is very suggestive of some sort of category of extensions of a probability space. We make this idea mathematically precise in this section by modifying the definition of quasi-Borel spaces introduced in [65] to be compatible with Tao’s idea of extensibility. In the course of executing a program which involves sampling from probability distributions, sample spaces are constructed in memory as needed. Imagine drawing a collection of n samples from a standard normal distribution and another n samples from a uniform distribution on [0, 1] and coupling the results into a data frame. The sample space for the joint distribution collected in the data frame is the product of the two sample spaces and these spaces are joined as independent when we couple them into the data frame. As such, it is unnatural to think of our program as arising from one fixed sample space declared initially. Online probabilistic systems must have the ability to dynamically generate sample spaces. For this reason,

55 extensibility is a natural criteria for thinking about probabilistic programming and will be a major motivation behind the introduction of sheaves later in this chapter.

Example 87. One way to implement sampling from a normal distribution would be to sample uniformly from some approximation of [0, 1] and apply the inverse cumulative distribution function of the standard Gaussian, α, to the resulting samples. Using the identity map id : [0, 1] → [0, 1] allows us to sample uniformly from [0, 1]. Consider a quasi-Borel structure on R containing both α and id. A naive coupling of these along their original sample space yields a map α ∨ id : [0, 1] → R × [0, 1] which would not represent an independence join of the two random variables. Instead, we should consider I2 = [0, 1] × [0, 1] equipped with the product measure and sample components. The original α and id can be recovered from projecting onto the first and second coordinates, respectively. As the component projections are surjective, this construction is an example of an extension of sample space.

We will define a category S whose objects are standard Borel spaces, (S, BS, µ) equipped with a probability measure µ, and whose morphisms are given by surjective measure preserving mappings. This category has a terminal object and initial object, but not many other nice properties; however, we will see later that it admits a semi-pullback structure which is essential to developing sheaf theory on this category of extensions. This definition is similar to one used by Simpson in an unpublished conference paper; however, Simpson does not require his morphisms to be surjective [123]. In this section, we will see how we can embed the notion of quasi-Borel spaces as a certain family of representable presheaves on this category. For the purposes of modeling computation, we will choose to think of there being some sample space representation, S0, that is the initial source of randomness. 32 Conceptually, we can think of this as something like the 32-bit space {0, 1} and suppose all extensions of sample spaces will be spaces with strictly larger cardinality. This is a subcategory of the full category, but adequate enough for our purposes.

In this subcategory, S0 would become the terminal object. We will not dwell on this point further and instead work with the full sample space category unless mentioned explicitly.

56 3.5.2 Quasi-Borel Presheaves

Definition 88. A quasi-Borel presheaf, Q, is a representable presheaf defined in the following manner:

• If α ∈ Q (S) and f : S0 → S is a morphism in S, then α ◦ f ∈ Q (S0).

• For any S in S, S (S) contains all constant mappings.

`∞ ∞ • If S = i=1 Bi where each Bi ∈ BS and Bi ∩ Bj = ∅ when i 6= j, and {αi}i=1 ∞ is a sequence of maps in Q (S), then ⊕i=1αi := αi (s) if s ∈ Bi is also in Q (S).

The first observation about this structure is that the presheaves so defined are actually sheaves with respect to the atomic Grothendieck topology. In order to prove this, we need to recall a bit of terminology.

3.5.3 Quasi-Borel Sheaves

In order to discuss sheaves on a sample space category, we need to equip it with a Grothendieck topology. A category equipped with a Grothendieck topology is referred to as a site. Recall that a Grothendieck topology on a category C is a function J which assigns to each object C in C a collection of sieves on C satisfying the following properties:

1. The maximal sieve tC = {f | cod (f) = C} is in J (C).

2. (stability axiom) If S ∈ J (C), then the pullback h∗ (S) ∈ J (D) for any arrow h : D → C.

3. (transitivity axiom) If S ∈ J (C) and R is any sieve on C such that h∗ (R) ∈ J (D) for all h : D → C in S, then R ∈ J (C).

A popular choice of Grothendieck topology is the so-called atomic topology.

Definition 89. A site (C,Jat) is called an atomic site if the covering sieves of Jat are given by the inhabited sieves. Jat is referred to as the atomic topology.

57 In order for the stability axiom to hold on an atomic topology, we need it to be possible for every cospan to be completed to a commuting square, i.e. given a cospan: X

 Y / Z there exists an object W in C along with morphisms W → X, W → Y which makes the following diagram commute:

W / X

  Y / Z.

Notice the similarity in this condition to the definition of pullbacks. The difference between this condition and the more general pullback condition is that there is no universal property requirement on the object W an morphisms W → X and W → Y . As such these are sometimes referred to as semi-pullbacks in the literature. This property is so important in topos theory that it is given a name.

Definition 90. A category C is said to satisfy the Ore condition if every cospan can be completed to a commuting square.

Verifying the Ore condition for our sample space category follows from a construction due to Edalat.

Theorem. (Edalat, 1998) The category of standard Borel spaces equipped with probability measures admits semi-pullbacks. [43]

Note that this is not true for arbitrary measurable spaces [52]. The basic idea behind Edalat’s construction is is to integrate over the fibers of product measures of the regular conditional probabilities. A theorem due to Johnstone relates this condition to the internal logic of the topos Sˆ.

Theorem. (Johnstone, 1979) Let C be a category. Then Cˆ is a De Morgan topos if and only if C satisfies the Ore condition [76].

This means that the topos of presheaves Sˆ is a de Morgan topos, i.e. its internal logic is a Heyting algebra satisfying the De Morgan laws: ¬ (a ∧ b) = ¬a ∨ ¬b and

58 ¬ (a ∨ b) = ¬a ∧ ¬b. However, it should be noted that the second law holds in every Heyting algebra [18]. Although not required in the Ore condition, the Edalat construction is known to obey a universal property. Simpson attempted to characterize independence and conditional independence in purely category theoretic terms [121]. Simpson’s work allows the Edalat construction to be seen as universal with respect to independence products. With this background out of the way, we can now prove that every quasi-Borel presheaf is in fact a sheaf with respect to the atomic Grothendieck topology. This result is similar to a result due to Simpson for equivalence classes of random variables stated in [123] without proof. Lemma 91. Any quasi-Borel presheaf, Q, is a sheaf with respect to the atomic Grothendieck topology.

Proof. A presheaf R on an atomic site (S,Jat) is sheaf if and only if for any 0 0 0 0 0 morphism q : S → S and any α ∈ R (S ), if R (r1)(α ) = R (r2)(α ) for all diagrams r 00 1 0 q S ⇒ S → S r2 0 with q ◦ r1 = q ◦ r2, then there is a unique α ∈ R (S) such that α = R (α)(γ) [94]. To check that the condition holds, notice that the above condition results in the following commutative diagram in Set:

r1 00 / 0 q S / S / S r2 α γ   X

We would like to define γ (ω) to be α (q−1 (ω)). However, this is only well-defined if α is constant on the fibers q−1 (ω). 0 0 0 0 0 By way of contradiction, suppose there exists ω1 and ω2 in Ω with α (ω1) =6 α (ω2) 00 00 00 0 and q (ω1) = q (ω2) . Lifting again, we get that there are ω1 and ω2 with r1 (ω1 ) = ω1 00 0 and r2 (ω2 ) = ω2. By assumption, R (r1)(α) = R (r2)(α) and hence r1 ◦ α = r2 ◦ α. 00 0 0 00 However, α ◦ r1 (ω1 ) = α (ω1) =6 α (ω2) = α ◦ r2 (ω2 ). Hence, it must be the case that the map α is constant on fibers. Thus, any quasi-Borel presheaf is a sheaf.

Notice the similarity in this definition to the definition of sheaves in terms

59 of matching families. The above lemma informally can be interpreted as stating that every morphism f is a cover with respect to the atomic topology. If we interpret this in the language of the presheaf of random variables the condition that P (g)(y) = P (h)(y) for all diagrams

g f E ⇒ D → C h with f ◦ g = f ◦ h is basically saying that the random variable y is not exploiting any of the additional structure of the sample space extension afforded to it as an extension of the space C. In essence, this appears to be a statement about the random variable being constant on fibers which is suggestive of regular conditional probabilities. The trouble with this approach is that regular conditional probabilities are only defined almost surely. As such, we could adjust by passing to equivalence classes of random variables. This may initially seem appealing for the reasons stated when we first defined the sample space category. However, such construction makes the theory of stochastic processes difficult. This would make properties like the almost sure continuity of the path of a Brownian motion an ill-defined notion. Nevertheless, by defining these structures in terms of the maps themselves, we do not have deal with the nuances of the underlying probability measures. In terms of probabilistic programming, any probability measure that we implement on the sample space can be pushed forward onto our quasi-Borel space of outcomes via the collection of maps we have already defined from the sample space into another data type.

3.5.4 Lifting Measures Lemma

In order to discuss probability theory, we need to understand the lifts of measures for the sample space category S. In other words, given a measure µ on S, we want to ensure there is a way of extending the measure to any space S0 where q : S0 → S where q is a surjective measurable map. The fact that arbitrary measures have lifts to the extended sample spaces is the subject of the next lemma.

Lemma 92. Given a measure µ : 1 → G (S), µ lifts to a measure on G (S0).

Proof. Given some B0 ∈ F 0, the collection ↑ B0 := {B ∈ F | q−1 (B) ⊃ B0} is non-empty as q−1 (S) = S0 ⊃ B0. Similarly, we may define a collection ↓ B0 :=

60 {B ∈ F | q−1 (B) ⊂ B0} . This allows us to construct outer and inner approxi- mations to any measure on S0 which is a lift of µ. We define the inner ap- 0 0 proximation by µ∗ (B ) := sup {µ (B) | B ∈↓ B } and the outer approximation by µ∗ (B0) := inf {µ (B) | B ∈↑ B0} . Any lift of µ, µ˜, must then satisfy the following inequality: 0 0 ∗ 0 µ∗ (B ) ≤ µ˜ (B ) ≤ µ (B ) .

By way of contradiction, suppose that no such µ˜ existed. Then there would be {B0} F 0 a sequence of disjoint sets i i∈N belonging to for which

∞ X 0 ∗ 0 µ∗ (Bi) > µ (B ) i=1

0 0 0 0 where B = ∪i∈NBi. For each Bi, there exists a Bi ∈ F with Bi ∈↓ Bi satisfying  µ (B0) < µ (B ) + . ∗ i i 2i

{B } P∞ µ (B ) = Moreover, the collection i i∈N consists of pairwise disjoint sets. Hence i=1 i 0 µ (B) where B = ∪i∈NBi. Now, as B ⊂ C for any C ∈↓ B , we see that

∞ ∞ X 0 X µ∗ (Bi) <  + µ (Bi) i=1 i=1 =  + µ (B) ≤  + µ∗ (B0)

Letting  → 0, establishes ∞ X 0 ∗ 0 µ∗ (Bi) ≤ µ (B ) i=1 and so a lift µ˜ of µ must indeed exist.

Remark. The construction µ∗ is actually a probability measure on the lift. Note that if B1 and B2 are disjoint then so are ↓ B1 and ↓ B2 thus µ∗ will be countably additive. Moreover, the full space will have measure 1 because q−1 (S) = S0 so 0 µ∗ (S ) = 1. What this lemma says in terms of programming languages is that if I implement 32 a probability measure on a sample space like {0, 1} and later construct an

61 64 extension {0, 1} . As long as we construct a surjective map connecting these, i.e. 64 32 q : {0, 1} → {0, 1} defined by q (b1, . . . , b32, b33, . . . , b64) = (b1, . . . , b32), we can grantee that we can construct a measure on the larger space which pushes forward to the original sample space.

3.6 Probability Theory for Quasi-Borel Sheaves

3.6.1 Events

We have previously seen that there is a bijection between FMX and QBS (X, 2) for any quasi-Borel space (X,MX ). In order to understand sub-sigma algebras as constructions within QBS, we should first understand a sigma algebra diagram- matically in terms of the quasi-Borel space (2,M2) where M2 = Meas (S, 2). Sigma algebras are required to be closed under complementation. This can be understood in terms of characteristic functions by observing that χA + χAC = 1. In C other words for every A : 1 → M2 there exists A : 1 → M2 such that the diagram below commutes: A 1 / M2

C A 1M2  ¬  M2 / M2 The other defining property of a sigma algebra is that it is closed under countable unions. Equivalently, we could require the sigma algebra to be closed under countable intersections. We choose the latter approach because this is easier to express in the language of characteristic functions. In terms of characteristic {A } : functions, closure under countable intersections means for any mapping i i∈N 1 → Q M A : 1 → M i∈N 2, there is a mapping 2 such that diagram below commutes:

Q i∈ Ai 1 N / Q M i∈N 2 Q χi A i∈N '  M2.

Recall that in Meas there is a one-to-one correspondence between events and characteristic functions. In other words, for every object (S, BS) in Meas, we have a canonical isomorphism B ∼= Meas (S, 2). Now if q : S0 → S is an extension of

62 −1 S, then an event B ∈ BS can be identified with the event q (B) ∈ BS0 . The corresponding characteristic function leads to a commutative diagram:

S0 χq−1(B) q

 χB S / X.

Thus, to understand events we can construct the following presheaf.

Definition 93. (Event Presheaf) Let S be a sample space category. The presheaf

of events can be identified with the Yoneda embedding h2 := Meas (−, 2) where 2 = {0, 1} is given the sigma-algebra of all its subsets.

3.6.2 Global Sections, Local Sections, and Subsheaves

Heunen, Kammar, Staton, and Young do not incorporate measures into their underlying sample space. As such, they define a probability measure to be a pair

(α, µ) where α ∈ MX where (X,MX ) is a quasi-Borel space and µ is a probability measure on the underlying standard Borel space. Because we have incorporated probability measures into our definition of the underlying sample space category, we can identify random variables with points (or global sections) on our sheaf of random variables of some fixed type. Global sections of a quasi-Borel sheaf can be identified with points in the outcome space because the component function of the terminal object must select out a map from a one element set into the outcome space of the quasi-Borel sheaf. Any map from a singleton set into another set determines a unique point, namely the image of the singleton. Thus, global sections of a quasi-Borel sheaf are simply the points in the outcome space. Local sections of a quasi-Borel sheaf are more interesting. Because these are defined as morphisms from a subsheaf of the terminal sheaf into a quasi-Borel sheaf, this allows for the possibility of mapping the terminal sample space to the empty set. This added flexibility allows for the possibility of randomness. A random variable, along with its collection of extensions, can be identified as a local section of a quasi-Borel sheaf. Effectively, the local section picks out the random variable α, along with its q-extensions, α ◦ q for each surjective measure-preserving map q.

63 The product construction for quasi-Borel spaces represents a joint distribution on outcomes. This construction lifts naturally to a quasi-Borel sheaf since X × Y is just another set, we can construct a quasi-Borel sheaf of maps into X × Y in the same manner as before. However, the category of presheaves has its own product, namely the component-wise product of presheaves. These two notions are isomorphic by the universal property of products in Set. When discussing independence, most authors begin with a discussion of inde- pendence of sub-sigma algebras. Let BX and GX be two sigma algebras on the set

X. We say GX is a sub-sigma algebra of BX if GX ⊂ BX . Another way of stating this is that the inclusion mapping i : (X, BX ) ,→ (X, GX ) is measurable. Recall that two sigma fields are said to be independent with respect to a probability measure p if for all G1 ∈ G1, G2 ∈ G2, p (G1 ∩ G2) = p (G1) p (G2). From independence of sigma fields, we can define the independence of random variables as follows: we say two random variables (f :Ω → X, p) and (g :Ω → X, p) are independent if −1 −1 the sigma-algebras f (BX ) and f (BY ) are independent. From this definition, it is possible to prove that two random variables are independent if and only if their joint distribution is the product of their marginal distributions [112]. For quasi-Borel sheaves, independent random variables can be identified with a subsheaf,

RX ⊥ RY ⊂ RX × RY , where (αX , αY ) ∈ RX ⊥ RY if and only if αX and αY are independent.

3.6.3 Expectation as a Sheaf Morphism

Let X = Rk for some k ∈ N. With a quasi-Borel structure, the sigma algebra on the outcome space is defined so that all maps inside the quasi-Borel space are measurable with respect to the constructed sigma algebra. As such, it is sensible to construct an expectation operator on quasi-Borel sheaves. The codomain of this operation will need to be an extension of the real numbers:

∼ a R := R {∞, −∞, undefined} .

64 ∼ ∼ Let denote the constant presheaf on . Expectation is then a morphism : R ∼ R E RX ⇒ R which we can define on components as:

ES [α] := α (s) dµ = xdα∗µ. ˆS ˆS

Note that if q : S0 → S in S, then

ES0 [α ◦ q] = α ◦ q (s) dµ = xd (α ◦ q)∗ µ = ES [α] . ˆS0 ˆS0

Thus, E− is a morphism of presheaves. If f : RX → RX is a morphism of quasi-Borel sheaves, we can define expectation similarly on components:

ES [f ◦ α] := f (α (s)) dµ = f (x) dα∗µ. ˆS ˆS

Example 94. Let S = [6] where [6] = {1, 2, 3, 4, 5, 6} and let S0 = [6] × [6]. Equip both S and S0 with their uniform probability measures. Define q : S0 → S by q (x1, x2) = x1. Note that q is surjective because it is a projection operator and it −1 is measure preserving because µS0 (q (i)) = µS (i) for each i ∈ [6]. Thus, q is a legitimate extension of sample space. Let α : [6] → R be the obvious embedding. Then X 21 [α] = iµ (i) = . ES S 6 i∈[6] On the other hand,

X 1 21 0 [α ◦ q] = α (q (i, j)) µ 0 (i, j) = (6 · 21) = . ES S 36 6 i,j∈[6]2

3.7 Future Work

3.7.1 Probabilistic Programming and Simulation of Stochas- tic Processes

The demands of probabilistic programming and the demands for simulation of stochastic processes seem to hold contradictory requirements. Probabilistic pro-

65 gramming relies heavily on conditioning and as conditional expectation is only defined almost everywhere, it seems natural to replace random variables with equiv- alence classes of random variables where two random variables are identified if they agree almost surely. Unfortunately, descending to equivalence classes makes many desirable properties of stochastic processes untrue (e.g. almost-sure continuity of paths of a Brownian motion). How can the seemingly conflciting demands of these two applications be balanced?

3.7.2 Categorical Logic and Probabilistic Reasoning

Topos theory provides a wealth of tools for analyzing the internal logic of the constructed topos. Does the internal logic of this construction reflect the logic of plausible reasoning as formulated in [74]? More broadly, what is the relationship between this internal logic of this sheaf topos and statistical inference? The approach outlined in this approach is reminescent of the non-commutative viewpoint. For instance, manifold theory from this perspective has been developed in [106]. Perhaps the categorical perspective could help with generalizing probabilistic reasoning to non-commutative situations such as those arising in free probability [99] or quantum Bayesianism [21].

3.7.3 Sample Space Category and the Topos Structure

Although there are counterexamples which prohibit the larger category of measurable spaces from satisfying the Ore condition, it is possible to extend the definition of the atomic topology to arbitrary categories by defining the atomic Grothendieck topology to be the smallest Grothendieck topology containing the inhabited sieves. For this definition of atomic topology for an arbitrary category, it is still the case that the sheaf category Sh (C,Jat) is an atomic Grothendieck topoos, i.e. the subobject lattice of every object is a complete atomic Boolean algebra [22]. This observation could perhaps be a stepping stone for moving the ideas presented in this chapter to more general classes of measurable spaces. More broadly speaking, how does restricting or enhancing the types of spaces we allow as sample space affect the subsequent sheaf topos? Do different choices for the underlying base category result in equivalent sheaf topoi?

66 3.7.4 Extension of the Giry Monad

Can the Giry monad be extended to the presheaf topos Sˆ or the sheaf topos Sh (S)? In chapter 6 of this dissertation we will make the argument that implementing the Giry monad is rather useful for statistical computing with missing or conflicting data. As such, if probabilistic programming can be given sheaf theoretic semantics, it would be helpful to generalize the Giry monad to arbitrary presheaves or sheaves. Another way of framing this is whether or not the collection of probability measures (along with the unit and multiplication operations) can be given a purely category theoretic definition. For the reader interested in pursuing this direction, we suggest the recent paper [132]. Another possible approach in this direction is provided in the conference paper [123]. This construction appears to rely on the co-Yoneda lemma realizing every presheaf as a colimit of representable presheaves.

67 Chapter 4 | Categorical Logic and Rela- tional Databases

4.1 Introduction

In this chapter we will discuss a simple categorical formalism for defining databases in a mathematically rigorous way. Databases are ubiquitous in modern comput- ing science. We demonstrate how the language of topos theory can be used to understand the structure of databases. The traditional mathematical perspective on databases was developed by Edgar F. Codd at IBM and is known as relational algebra [23]. Earlier work has described many operations common in relational databases through the language of category theory. Topos theory is a natural setting for discussing variable sets and multi-valued logics. Rosebrugh and Wood use this viewpoint to discuss a dynamic view of databases [114]. In particular, database updates are modeled as indexing a collection of database objects by a topos and non-boolean logics are explored for databases through the lense of sheaf theory. Later work in this direction emphasized the role of sketches to formalize schemas representing the data itself as a model of the sketch [49,75]. Baclawski, Simovici, and White define databases as constructions withn an arbitrary topos. In particular, they work out how selections, squeeze (elimination of duplicates), projections, joins, and boolean operations can be performed as constructions inside an arbitrary topos [13]. It has also been shown how database models based on simplicies can be used to support type-theoretic operations along

68 with database queries [51, 128]. More recent work has focused on representing concrete data like integers and strings [120]. The underlying model for SQL is based on multi-sets [53] while the relational algebra due to Codd is based on relations [23]. In this chapter, we create a multi-set model similar to the model underlying SQL and show that this model is sufficient to express all the operations due to Codd as constructions within the category of sets. Such a formulation is interesting because simple extensions of the underlying model allow us to express constructions outside the traditional relational model such as outer joins. By focusing on concrete constructions within the category of sets, we can easily generalize to non-binary logics. Purely topos-theoretic models are unable to capture the logic of SQL, for instance, because the internal logic of any topos must be a Heyting algebra and the three-valued logic of SQL is not a Heyting algebra. This model has the advantage that it is easily adaptable to more general logics by substituting the two element set along with the operations of AND, OR, and NOT with their corresponding counterparts for a logic with more values. We also demonstrate how extensions allow us to also account for null values and missing data, extending the discussion in [101]. Near the end of this chapter, we define a simplicial structure (Section 4.6.3) and graph associated to a database schema and prove a result relating properties of this graph to whether or not agreement on marginal tables is sufficient to ensure that the marginals can arise from a joint distribution on the full outcome space. In particular, we provide sufficient conditions for a joint table on the full column set to exist (Lemma 98, Proposition 101). This result is foundational to the next chapter where we attach an additional topological structure to this simplicial complex using this topological structure to weaken the common assumption in statistics that the family of marginals under consideration arise as projections of a joint distribution on the full column space.

4.2 Data Tables

In this section, we discuss databases as constructions in the topos Set. Since we will study random variables whose outcome space is a table space. This construction will be important in the next chapter as we discuss the extension of statistical concepts to databases. We will initially focus on the case of a database containing

69 a single table and focus on random variables whose outcome space is a table. In the next chapter, we will see how sheaf theory allows us to extend these ideas from single tables to databases whose tables contain overlapping columns. We construct an appropriate category of databases and show that all operations in the relational algebra exist for this category. Such a formulation allows us to develop mathematical models for thinking about databases consisting of multiple tables or more general databases in distributed systems. In future chapters, we will use this construction to aid in statistical modeling of databases. The relational model of a table in a database conceptualizes data as a two dimensional frame called a relation. For example, a database of students may consist of columns containing the student’s name, ID number, email, and phone number. Each row in the table would correspond to the records of an individual student. Our goal will be to build a model of such tables inside Set. In this section, we use the word relation as it is used by the database community. As will become apparent, this is not the same thing as a relation in mathematics. We will formalize this notion using several equivalent representations of multisets inside the category Set. The implementation of SQL involves the use of a three-valued logic adding a third value UNKNOWN to the standard TRUE and FALSE [53]. Topos theory is a framework that allows more general logic than Boolean logic. As such, using the categorical framework in this chapter can help with the exploration of more general database logic. These more general logics have been explored previously in [13,114]. However, the logic underlying SQL is not a Heyting algebra and thus can not be represented as the internal logic of any topos. By relying on concrete constructions within Set, we are able to construct a framework that easily generalizes to other logics. Moreover, in SQL there are three types of relations: stored relations, views, and temporary tables. Stored relations are the actual tables saved in the database management system. Views are relations which are constructed by computation. These are not stored after they are used. Temporary tables are constructed by the SQL language processor when it executes queries. After the query is executed and the appropriate modifications are made, these tables are removed from memory [53]. In order to represent the space of possible tables formed from a database schema, we need to discuss the possible operations that can be performed with a collection

70 of tables. Ultimately, this will lead us to a category representing the possible structures. In this section, we will see operations that can be built up from a collection of simpler primitive operations. We can think of the intermediate diagrams as temporary tables.

4.2.1 Attributes

Tables are collections of data points containing multiple attributes. We first establish an attribute space for databases. An attribute is simply a descriptive label used to describe entries in the corresponding column of a table. When visualizing a table, it is common for the attributes to be pictured in the top row. Thus, we can identify an attribute with a one-element set containing that label, e.g. {Student_ID}. Given a finite set of characters, C (think the collection of all Unicode characters, for instance) we can form the set of strings on C, SC , by defining   a Y SC :=  Ci . n∈N i∈[n]

In practice, the size of this set is actually finite due to memory limitations. Never- theless, we can think of a particular attribute name as a point 1 → SC . For example, a table containing students could have the following collection of attributes

A = {student_name, student_ID, email, phone} ⊂ SC .

Thus, attribute labels for a table can be identified with finite sub-objects of SC . Thus, the attribute labels of a table can be seen as a morphism E  SC where |E| < ∞.

4.2.2 Attribute Spaces (Data Types)

Each attribute tabulated in a table takes values in some type of space which we will call the column space of the table. To each attribute, a ∈ A, we can describe a set,

Xa, which denotes the collection of possible values of the attribute a. For instance, if a is an attribute representing the number of red cards in a player’s 5-card poker hand, then the attribute space could be taken to be the set of all non-negative

71 integers less than or equal to five. However, in most languages, we would simply declare an integer type for this particular situation. More broadly, we can take the attribute space to be the outcome set of any random variable of interest.

4.2.3 Missing Data

We next account for missing data. Many real world data sets contain missing data. To model this we can adjoin a singleton set {NA} to the output space of our random variable, i.e. we can consider random variables having outcome space X ` {NA} (considered as a coproduct in the category of measurable spaces). In SQL and Codd’s relational model, NULL is used in place of NA. We choose NA because data frames in R represent missing records in this way. Many real world data sets contain records with missing values for certain columns. As such, we would like to be able to account for modles involving missing records. One of the simplest models for missing data is data that is missing completely at random (MCAR) as introduced in [58]. Any statistical model can naturally lift to a model where records are missing with some fixed probability (1 − α). In the previous chapter, we introduced an endofunctor G on the category of measurable spaces Meas. We can use this endofunctor to extend a statistical model with MCAR data.

Lemma 95. G (X ` Y ) = Conv (G (X) , G (Y )) where

Conv (G (X) , G (Y )) := {αp + (1 − α) q | p ∈ G (X) , q ∈ G (Y ) , α ∈ [0, 1]} .

Proof. ⊃ is obvious. Let p ∈ G (X ` Y ). Then p ({0} × X) = α. If α = 0, we have the desired decomposition. As such, assume α 6= 0. Then we can define a 1 probability measure pX on X by defining pX (B) = α p ({0} × B) . Analogously, p ({0} × X) = α implies p ({1} × Y ) = (1 − α) and as long as α 6= 1 we can define 1 define a probability measure on Y by qα (B) = 1−α p (B). This gives us the desired decomposition of p.

Corollary. G (X ` {NA}) = Conv (G (X) , G ({NA})).

Another simple consequence of this observation is that statistical models can be extended to the coproduct X ` {NA} by taking a mixture model. This construction

72 may not be appropriate for all circumstances such as in the presence of censored data. Remark 96. Any statistical model m : P → G (X) extends to a model m˜ : (P, α) → G (X ` {NA}) via convex hulls (i.e. mixture models), i.e. m˜ (p, α) = (1 − α) p + αδ{NA}. In the future, we will use the shorthand X˜ to refer to X ` {NA}. Also, observe that there is a canonical inclusion i : X,→ X˜. These observations will be important when we introduce measurable presheaves defined on contextual categories later in this dissertation. Recall that the space X˜ can be given a coproduct sigma algebra structure from any sigma algebra on X.

4.2.4 Data Types

When designing programming languages to work with a database management system, often times we need some finite collection of primitive data types from which to build our language. In the student database example, student_name and student_ID can be regarded as strings while student_ID and phone can be regarded as integers. A string is simply a collection of characters. If C is a set containing all S S := ` C valid characters, the space of strings, , can be defined as n∈N . Thus, a string, s, is simply a point s : 1 → S. Other common data types include Booleans, bit strings, floats, dates and times. Booleans are just elements of the set {TRUE, FALSE}. In many languages, such as SQL, the collection of Booleans is augmented to include UNKNOWN so that logical comparison operators can be appropriately defined for records containing NULL ` {0, 1}n . values. Bit strings are just elements of n∈N Floats are the computer approximations of real numbers. We ignore implementation details of how these are represented in computer memory and allow our databases to have the real numbers as a data type. Dates and times can be seen as strings with specified formats and so the collection of dates or times can be seen as sub-objects of the string space S. Most languages for databases require you to set a maximum length for data types like strings and bit-strings when declaring a table. For instance, in SQL to create a column containing strings, the declaration requires you to specify an integer n representing the maximum length of an entry. Thus, in SQL CHAR (n) `n i corresponds to i=0 C . As such, we see that inside the topos of sets we can

73 represent all the types of data commonly found in relational database management systems like SQL.

4.2.5 Column Spaces, Tuples, and Tables

4.2.5.1 Column Spaces

For each attribute a ∈ A, we have an associated attribute space Xa. As such, the column space of a collection of attributes is simply the product of attribute spaces taken across all a ∈ A, i.e. Y XA := Xa. a∈A

4.2.5.2 Records

Individual rows of the database are referred to as tuples. A tuple, or record, is simply a point in a column space, i.e. r : 1 → XA. Continuing with the student example, our student table may contain a record such as

(Juan Batista, 24601, [email protected], 8675309) .

4.2.5.3 Tables

Tables are simply collections of multiple records, i.e. mappings out of a finite set, F , into a column space, i.e.

t : F → XA represents a particular table. Note that any table determines a point in the product space indexed by F , 0 Y t : 1 → XA. f∈F n Qn Let n = |F | from above and define XA := i=1 XA. With this notation, we can think of tables equivalently as points in a finite dimensional product of attribute spaces, t :   n ∼ Q 1 → XA. Note that there is a natural bijection Hom (F,XA) = Hom 1, f∈F XA . Many implementations, such as pandas, allow you to summarize a table as a count of distinct values in the column space. In other words, as a set mapping

˜ t : XA → N.

74 Note that these three perspective are all ways of formalizing the notion of a multiset as a construction involving set theory. We refer to t : [n] → XA as the memory allocation representation of the table, and refer to t : 1 → XA as the point representation of the table. Lastly, we refer to t : XA → N as the count of values representation of the table. Note that the count of values representation of a table can determine many differ- ent representations of the same table by either its memory allocation representation or its point representation. Intuitively, it seems that our definition of a table should not really depend on the particular details of the set of memory address indexing our data. As such, we will now discuss how to construct an equivalence relation on both the memory allocation representation and the point representation so that these three representations are all isomorphic but not canonically isomorphic.

Let tm : M → X be the memory allocation representation of some table. We −1 can construct its count of values representation tv by defining tv (x) = |f (x)| . Note that precomposition by any permutation σ : M → M or more generally any isomorphism φ : M → M 0 does not affect the value counts representation. As such we can define the memory allocation representation of tm to be the equivalence 0 0 class of maps into X where two maps tm : M → X and t : M → X are said to be 0 0 equivalent if there exists an isomorphism φ : M → M such that tm = t ◦ φ. Following the same line of reasoning, we will consider two point representation Q 0 Q of a table t : 1 → I X and t : 1 → J X to be equivalent if there exists an φ Q Q isomorphism φ : I → J such that the induced projection π : I X → J X satisfies πφ ◦ t = t0. At this point all representations of a table are in bijective correspondence and so we can now umabiguously move between representations as is convenient. In the rest of this section we will focus primarily on the first definition as it naturally mimics the idea of thinking of a database as a collection of values indexed at various memory addresses. From this point of view, it is easier to make connections with what is occurring in computer memory as we manipulate a database.

75 4.2.6 Primary Keys

A collection of attributes is said to be a key for a table if we do not allow two rows in the table instance to have the same values for all the attributes in the key. In our student example, we could say student_ID is a key if no two students are able to have the same ID number. In practice, many database management systems simply create a primary key for the users because many applications could easily have redundant rows. As our hope is to explore statistical properties of multiple distributed databases, we expect to have many entries with identical features. We can model this behavior by adding an additional column to a table representing a key or index. A table may contain multiple rows with the same data. For instance, if we are tabulating IID copies of a binary random variable, we need to allow for multiple zeroes and ones in our database. This can be remedied by requiring the rows be labeled by an index (aka primary key). For example, we can take I = N. We can think of a row in a database with a primary key as an element i × x : 1 → I × X.

A table is typically thought of as a finite relation T ⊂ I × X such that (i, x1) ∈ D and (i, x2) ∈ D implies x1 = x2. This ensures each primary key is matched with only one data point. To emphasize the finite nature of the relation, we will think of a table as a mapping t : [n] → I × X. In the future, we will include any primary keys, if they exist, into the existing column space.

4.2.7 Versioning

In many situations, we want to time-stamp or version records in a table. Date-times are a common choice for time-stamp employed in both R and the pandas package for python. We could also use a counter representing version number such as is common with software dates. What is necessary of the time-stamps or versions is that the collection of these objects forms a poset. A more complex method of versioning is the TrueTime model introduced by Corbett ET AL. at Google [24]. In this framework, events are time-stamped by intervals [ts, te] and we are guaranteed the true UTC time is a point in the interval. The collection of such intervals can be given a poset structure as follows: [ts1 , te1 ] < [ts2 , te2 ] if and only if te1 < ts2 as real numbers. We can log time-stamps in a database by having a column whose values are tuples representing a start and end time for the given interval. In versioned

76 databases that also employ an index, it is common to use the product I × V as a primary key for the entries.

4.3 Relational Algebra on Tables

In the previous section, we saw how tables could be represented inside the category of sets. As tables were represented by morphisms from a finite set into some set representing the possible values of each of the columns of the table, we know by the Cartesian closure of the category of sets that many standard categorical constructions such as products, coproducts, pushouts, pullbacks, etc. exist for tables. In this section, we show how to represent the five primitive operations (selection, projection, products, unions, and differences) in Codd’s relational algebra as constructions inside the topos of sets. Our goal in this section is to show that the constructions of tables inside Set has the same expressive power as Codd’s original construction. An alternative way of obtaining this result would be to use the abstract formulation in [13] applied to the specific topos of Set which would reduce the problem to expressing Codd’s original operations in terms of the primitve operations used by the authors. We choose to simply recover Codd’s constructions in a manner specific to the topos Set so as to avoid introducing another collection of primitive operations.

4.3.1 Products

Given two tables t1 : N → A and t2 : M → B, we can form a product table

t1 × t2 : N × M → A × B. This construction takes all possible combinations

of records in t1 and t2. However, this operation alone is not useful for many of the operations we want to do with real databases because we typically want to eliminate redundant features such as when two columns overlap. Nevertheless, as Codd showed, more complex constructions on tables can be built out of these simple primitive operations [23]. We will discuss some of these operations in later sections after showing that we can construct all five primitive operations.

77 4.3.2 Projection

Projection involves creating a new table on a subset of the columns in our table. Recall that each table t : M → XA has an associated column space which can be A Q decomposed as X = a∈A Xa. If S ⊂ A is a subset of the attribute set of our table, A A S there is a corresponding projection operator πS : X → X . The composition, A πS ◦ t is referred to as the projection of the table onto the attribute space S.

4.3.3 Union

Intuitively, unions correspond to concatenating two tables defined on a common column space. Categorically, unions can be seen as pushout construction in the category of sets. Given two table t1 : N → X and t2 : M → X, the union is given ` by the coproduct of t1 and t2, i.e. t1 ⊕ t2 : N M → X.

4.3.4 Selection

Selection is the most complicated of the five primitive operations. Intuitively, selection allows us to determine the collection of points in a table satisfying certain properties such as selecting all users in a table whose age is greater than 30. These operations naturally identify selections with certain sub-objects of our original table. In this subsection, we explore this idea in depth. The principle of comprehension in set theory allows us to form sets consisting of all elements which satisfy a certain property φ (x). A naive approach to set theory easily leads to contradictions such as Russel’s paradox. In the Zermelo-Fraenkel approach to set theory, this is solved by set-building, i.e. requiring the proposition φ (x) to apply to only members of a particular set A. In this section, we discuss a comprehension principle for databases. In the language of database theory, this is referred to as selection. When querying a database, we often want to return all results meeting some specified criteria. In particular this requires a way of thinking about propositional formulas involving the logical operators AND, OR, and NOT. Let Ai and Aj be attributes. If Ai or Aj are categorical attributes, let θ ∈ {=, 6=}. Otherwise, let θ ∈ {<, ≤, =, 6=, ≥, >}. Then AiθAj or Aiθx with x ∈ R determines a mapping θ : D → 2. Given such a mapping and a particular database, d : [n] → D, we can

78 consider the set d−1 (θ−1 (1)). Since this is a subset of [n], there is a natural inclusion −1 −1 −1 −1 d (θ (1))  [n]. Thus we can form a new database, dθ : d (θ (1)) → D representing the selection of these entries for which the proposition is true. More complex queries can be formed by the logical operators AND, NOT, and OR. Before discussing these constructions, we review the topos theoretic perspective of the logical operators of conjunction, disjunction, and negation. A more detailed discussion of from the perspective of topos theory can be found in [60]. Recall that negation is the arrow ¬ : 2 → 2 such that the diagram below is a pullback: ⊥ 1 / 2

! ¬  >  1 / 2 Similarly, conjunction is the arrow ∩ : 2 × 2 → 2 such that the diagram below is a pullback: >×> 1 / 2 × 2

! ∩  >  1 / 2 Finally, disjunction has a more complex categorical description. From simple truth tables we know ∪ : 2 × 2 → 2 should be the characteristic map corresponding to the sub-object E = {(1, 1) , (1, 0) , (0, 1)}. First notice that this sub-object can be decomposed as E1 ∪ E2 where E1 = {(1, 1) , (1, 0)} and E2 = {(1, 1) , (0, 1)}. This is important because E1 is identified with the monic mapping h>, 1i : 2 → 2 × 2 and E2 can be identified with the monic mapping h1, >i : 2 → 2 × 2. We can then form the coproduct, f, of these two mappings

2 / 2 + 2 o 2

"  | 2 × 2 and observe that im (f) = D. Thus, by the canonical decomposition of a set mapping into a surjection followed by an injection, we can identify E up to unique isomorphism. In order to form more complex selection queries based on multiple binary operations θ : D → 2 and θ0 : D → 2, we can combine binary operations using

79 negation, conjunction, and disjunction. For instance, to represent the selection θ×θ0 ∪ corresponding to θ ∧ θ0 we could consider the map θ × θ0 : D × D → 2 × 2 → 2. Hence, the selection can be represented by the database

0 [n] d→×d D × D θ→×θ 2 × 2 →∩ 2.

By forming larger numbers of products, we can represent more complex selection queries. These are all guaranteed to work because Set admits finite limits and colimits. Gathering the most recent version of entries in a table can be seen as another Q type of selection operator. Given a table d : [n] → A = I × V × 2 × i∈I Xi. We can define a binary relation θ :(I × V ) × (I × V ) → 2 by  0 if i = i0 and v ≤ v0 θ ((i, v) , (i0, v0)) = 1 otherwise.

Clearly θ lifts to a map on A × A. Hence, by using the selection criteria above with this binary operation, we get a new table containing only the most recent entries.

4.3.5 Difference

Informally, the difference of two tables t and t0 is the set of records that belong to one table but not to the other. If t : M → X is any table and t0 : N → X is another table on the same attribute space, then the difference t \ t0 can be identified by 0 first considering the sub-object E  M defined by E = {m ∈ M | t (m) ∈/ t (N)} . Thus, the difference can be formed by taking the selection operator corresponding to the characteristic function of this sub-object. At this point, we have expressed all five primitive operations as constructions within the category of sets. As such, we know that we can express more complex operations performed on tables such as equijoins by chaining several of these primitive operations.

80 4.4 Some Additional Operations on Tables

In order to construct a category of tables, we need to define morphisms between tables. Before giving a general definition, we explore several common constructions with individual tables and see how to express these notions via category theory.

4.4.1 Addition & Deletion

Let t : N → X be a table and let t0 : 1 → X be a record. We can add t0 to t by taking the union of t and t0 as discussed in the last section. Equivalently, if t˜ : 1 ` N → X is the resulting table, we say that t can be obtained from t˜ by deleting a record. These equivalent situations can be represented by the following commutative diagram: N / N ` 1 o 1

t˜ t t0 "  | X.

4.4.2 Editing Records

Entries in a database can also be modified. When a row is changed, we want to preserve its primary key, update its version, i.e. replace t with t0 where t ≤ t0, and change x to x0. To modify the j-th row to have time-stamp t0 and value x0, we can introduce a modification map defined as follows:  (t0,x0) (i, t, x) i 6= j modj (i, t, x) = (i, t0, x0) i = j.

This results in the following commutative diagram:

id [n] / [n]

t1 t2  modj→(t0,x0)  I × V × R / I × V × R.

81 4.4.2.1 Rename

Let A and B be two attribute sets. An isomorphism ρ : A → B induces a commutative diagram t M / XA

πρ t0 !  XB where the vertical map is defined by πρ (xa) = xρ(a). Such a diagram can be interpreted as a renaming of the columns of the original table.

Example 97. Consider the following two tables:

I X J X a 3 0 3 b 5 1 5 c 7 2 7

We can view the second table as a re-indexing of the first table where the re-index map r : {a, b, c} → {0, 1, 2} is defined by r (a) = 0, r (b) = 1, and r (c) = 2. This re-indexing can be seen as a pair of morphisms (r, 1) where r is a mapping between the attribute spaces of the corresponding tables and 1 is the identity mapping between column spaces.

4.4.2.2 Imputation

Another special case of editing records is imputing missing data. When preparing data for model fitting, an analyst must make a decision about what to do with records with missing entries. If there are very few records containing missing entries, the analyst may simply choose drop these records dismissing them as a measurement error. Other common techniques involve replacing the missing values with the mean, median or some other fixed value. This can be seen as map which takes NA to the chosen value x0 and is the identity mapping on every other outcome. Other imputation schemes involve attempting to predict the missing values based on some statistical model for the missing entries [115, 116]. As such, these determine a collection of modifications replacing the missing values at the individual indices with their predicted values.

82 4.4.3 Merging Overlapping Records

In the next chapter, we will need to discuss merging data frames which agree on their overlapping columns. In this section, we discuss how to view this as a construction on table categories. A particular record can be viewed as a point in a column space: p : 1 → X. Imagine we have two records p1 : 1 → X × Y and p2 : 1 → Y × Z. When we say that these two records agree on their overlapping columns, what we mean is that the projections onto their overlapping column spaces agree. As such, we can understand joining two records as an instance of the pullback operation. Saying two records agree on their overlapping columns is equivalent to saying the diagram below commutes:

p1 1 / X × Y

p2 πY

 πY  Y × Z / Y.

∼ The universal property of pullbacks implies ∃!p : 1 → (X × Y ) ×Y (Y × Z) = X × Y × Z such that the diagram below commutes:

1 !

% p1 + X × Y × Z / X × Y

p2 πY

  πY  Y × Z / Y

By using successive record-wise joins, we can join data frames together. Imagine two synchronized computational agents tabulating partial observations from a random experiment. In this situation, the time-stamp could be used as a key for our merge operation. Assuming perfect synchronization of the clocks tabulating the different agents, we can again invoke the universal property of pullbacks to construct a merge records.

4.4.3.1 Table Morphisms

Given these possible ways to modify a database, we can now create a subcategory of tables T inside Set. The objects in this category are tables and a morphism

83 f : t → t0 between tables t : N → XA and t0 : M → Y B, is given by a pair of maps (σ, f) where σ : N → M and f : XA → Y B is a morphism making the obvious diagram commute t M / XA

σ f  t0  N / Y B.

4.4.4 Non-Binary Logics

The implementation of SQL uses a three valued logic to handle missing data [53]. Truth tables for this logic are displayed below. We use the shorthand U for UNKNOWN in SQL.

a ∧ b a ∨ b a ¬a a\b 0 U 1 a\b 0 U 1 0 1 0 0 0 0 0 0 1 1 2 U U U 0 U U U U U 1 1 0 1 0 U 1 1 0 1 1

By inspection of the tables above, the negation of U fails to obey U ∧ ¬U = 0 and thus the three-valued logic implemented in SQL is not a Heyting algebra. This suggests that database theory based on category theory should step outside the scope of topos theory to more general areas of categorical logic. In order to replicate the functionality of SQL, we need only replace the two element set used previously with the three element set {0, U, 1} and the corresponding conjunction, disjunction, and negation operations from the tables displayed previously.

4.5 Random Tables and Random Databases

In order to perform statistical analysis on tables, we need to be able to view them as outcome spaces of some random variable. In this section, we discuss augmenting the column spaces with the structure of a measurable space. This connects this chapter to earlier chapters discussing sheaves of random variables. In this section, we choose to use the point representation of tables because this representation is the most similar to the notation used by statisticians.

84 4.5.1 Random Tables

If each attribute space, Xa, is given the structure of a measurable space by equipping A Q it with a sigma algebra, then the joint outcome space X = a∈A Xa can be endowed with a sigma algebra by taking product sigma algebras. Again, taking Q A product sigma algebras, we can endow a table n∈N X with a sigma algebra structure. Given some probability space (Ω, F, ρ), a measurable mapping R :Ω → Q A n∈N X is said to be a random table.

4.5.2 Giry Monad Applied to Tables

Given a table space with a sigma algebra structure, we can apply the Giry monad to the table space to recover the collection of all probability measures on the table space. If all Xa are standard Borel spaces, the product space is also standard Borel Q A and thus so will be the Giry monad. We refer to a table 1 → n∈N G X as a Giry table or Giry data frame. In a later chapter, we will discuss techniques for imputing missing data and merging conflicting data which relies on these Giry tables.

4.5.3 Random Databases

A database, D, is simply a finite collection of tables D = {t1, . . . , tk}. The attribute space, AD, of the database is simply the union of the attribute spaces of the tables. A random database can be obtained by considering a random variable

Q AD whose outcome space is n∈N X . By projecting onto the various outcome spaces of the tables, we obtain the database representation of the random sample. In the next chapter, we will discuss properties of reconstructing global samples. In particular, we will see that requiring any pair of tables to agree on their overlapping columns is insufficient to ensure that tables can be joined together. Arbitrary probability distributions on the outcome spaces of the tables do not necessarily need to arise from a global probability distribution even if the overlapping marginals are compatible. The problem of reconstructing probability measures from their marginal distributions has been discussed in [62,131]. Wang has also provided conditions for compatibility of marginals for undirected graphical models [140]. In this section, we will provide sufficient conditions for when a

85 collection of tables with overlapping projections onto their shared column space can be seen as projections of a global table. The general problem of determining whether or not a collection of tables arise as the projection of a table on the joint column space is known to be NP-complete [68]. As an example, consider a table with three attributes A, B, and C. The outcome space of each attribute is {0, 1}. The following collection of probability distributions agree on their overlapping marginals but fail to arise as marginal distributions of a joint distribution on the outcome space XA × XB × XC :

P (A = 0) P (A = 1) 1 P (B = 0) 2 0 1 P (B = 1) 0 2

P (B = 0) P (B = 1) 1 P (C = 0) 2 0 1 P (C = 1) 0 2

P (A = 0) P (A = 1) 1 P (C = 0) 0 2 1 P (C = 1) 2 0

Although these distributions can not be seen as the marginal distributions of some probability distribution on the full outcome space, this type of structure is possible in many data collection scenarios. Morton showed how this phenomenon can rise from missing data or databases employing versioning techniques like snapshot isolation [101]. In the next chapter, we discuss the representation of random variables of this form and discuss how to extend statistical theory into this setting.

86 4.6 Topological Aspects of Databases

4.6.1 Simplicial Complex Associated to a Database

A database is simply a collection of tables. Each table has an attribute space. The attribute space associated to a database can be formed by taking the union of the attribute spaces of the tables constituting the database. The column space of each table determines an abstract simplicial complex by associating a vertex to each column in the table. The maximal face associated to the table is given by the set of column attributes. The abstract simplicial complex condition requires that all subsets of the maximal face be included in the simplicial structure. Intuitively, this corresponds to the collection of tables that can be formed by dropping a subset of columns from the original table. The simplicial complexes associated to each table can be glued together along their overlapping column set. This determines an abstract simplicial complex for the entire database.

4.6.2 Contextuality

Contextuality is a phenomenon first observed in quantum physics whereby the out- come one observes in a measurement depends upon the other measurements taking place. Mathematically, this corresponds to the fact that in quantum mechanics observables are represented by operators on a Hilbert space and two operators are simultaneously observable if and only if they commute. We believe contextuality can arise in databases whenever the simplicial complex associated to the database is non-contractible. First, we can start with a simpler observation.

Lemma 98. Suppose two tables agree on their overlapping counts, then there exists a join of the tables.

Proof. Construct a on the values of the overlapping columns of the two tables. Sort each table according to this total order. We can now work inductively. If there is only one-entry, the tables agreeing on their overlapping counts means that each entry has the same value on the overlapping column. As such, there is only one choice to be made when joining the records. For tables with multiple entries, we can just merge the sorted rows by matching the indices in the enumeration implied by the total order.

87 Remark. In general, there may be multiple ways to join a table together as the next example shows. Example 99. Consider the following tables:

A B B C 0 0 0 a 1 0 0 b 2 0 0 c 3 0 0 d In this situation, there are 4! ways to join the two tables along B. Repeatedly applying binary joins allows us to construct a global table which projects onto each marginal distribution. This procedure can go wrong if we add a table that has non-empty intersection with multiple tables in the group in such a way that the simplicial complex associated to the tables is not contractible.

Definition 100. Let t1, . . . , tk be a collection of tables and let C1,...,Ck denote their respective column sets. The schema graph associated to this database is the graph whose nodes n1, . . . , nk are given by their respective tables t1, . . . , tk. The edges for this graph are given by the pairs (i, j) such that Ci ∩ Cj 6= ∅. With this definition, we can establish sufficient conditions for the constraint satisfaction problem to admit a solution.

Proposition 101. Let t1, . . . , tk be a collection of tables. If the schema graph associated to t1, . . . , tk is connected and acyclic then the constraint satisfaction problem has a non-empty solution set.

Proof. We proceed by induction on k. The base case k = 1 is trivial. The case k = 2 is established by the previous lemma. By relabeling the nodes if necessary, we may assume without loss of generality that node k is a boundary node, i.e. a node with exactly one incident edge. Such a node much exists because if each node has two or more edges there is no way for the graph to not be cyclic. By the previous lemma, there is a join between this node and its neighbor. By definition, this join projects onto the two marginal tables and so will obey the condition of agreeing overlaps with any other tables with edges incident on either of the original nodes. By considering the graph obtained by contracting along these two nodes, we reduce the number of nodes by one and may thus apply the inductive hypothesis.

88 Example 102. Suppose we have three tables whose column sets are {A, B}, {B,C}, and {C,D}, respectively. Then these tables admit a join as long as the first two tables agree on their projection onto B and the last two tables agree on the projection onto C. To see this, we can first join {A, B} and {B,C}. Again, there are potentially multiple solutions to the join problem. Based on the result of the first operation, we can join the resulting table to {C,D} which is again possible because these tables agree on their overlapping counts of the values of C.

Example 103. Consider three tables whose column sets are {A, B}, {B,C}, and {A, C}, respectively. The following collection of tables admits no join:

A B B C A C 0 0 0 0 0 1 1 1 1 1 1 0

Note the simplicial complex associated to these tables is given by:

A

B C.

4.6.3 Topology on a Database

An abstract simplicial complex has a natural poset structure given by subset inclusion. There are two canonical topologies associated to a poset: the topology generated by taking upper sets as open sets or the topology generated by taking lower sets as open sets [139]. Thus, given a database schema, we can construct an abstract simplicial complex representing the overlap between tables in the database. We can construct a topology on this abstract simplicial complex by taking the lower sets to be open sets. If (P, ≤) is a poset, then a set U ⊂ P is said to be a lower set if for all x ∈ U, y ≤ x implies y ∈ U. This endows our abstract simplicial complex with the structure of an Alexandroff topology (since arbitrary intersections of closed sets are closed). In the next chapter, we will see how to think about statistical inference on such databases.

89 4.7 Relationship Between Topological Structure of a Schema and Contextuality

We have seen that contextual schema which contain cycles can produce collections of tables which agree on all marginal overlaps but fail to admit a global glueing of the tables. Future research could investigate more how topological properties of the database schema affect the potential for contextuality. We will present a few conjectures for types of questions in this direction in this subsection.

Conjecture. Let T1,T2,...,Tk be a collection of tables where the database schema associated to the collection of tables is contractible, then there is a global join of the tables T1,...,Tk as long as the tables agree on their overlapping column sets.

Note that Proposition 101 is a weaker form of this conjecture. As an example of a database whose simplicial complex is contractible yet has a cylic schema graph, consider four binary random variables A, B, C, and D. And consider a database consisting of tables whose column sets are {A, B, C}, {B,C,D}, and {A, C, D}. The simplicial complex is a triangular pyramid with one face removed while the schema graph is simply the cyclic graph on 3 nodes. Another natural question is whether or not any schema which is not contractible will necessarily admit a collection of tables which fail to admit a global glueing. The above conjecture claims that holes (or higher dimensional analogs thereof) are necessary for tables to fail to admit a glueing. A natural question is then is whether or not this condition is also sufficient? In chapter five, we discuss how the space of contextual probability distributions can be realized as a certain linear subspace and also how the collection of classical probability distributions is a linear subspace of this space. One approach to investigating this conjecture could be to analyze the linear algebra of these subspaces and investigate whether or not the constraint equations resulting from closing a hole create an inconsistency in these sets of linear equations.

90 Chapter 5 | Contextual Statistics

5.1 Introduction

In chapter three, we saw that the collection of quasi-Borel presheaves on a sample space category has the structure of a sheaf with respect to the atomic Grothendieck topology on its underlying sample space category. In this chapter, we will see how replacing objects with presheaves and measurable mappings with natural transformations allows us to extend statistical concepts to a distributed measurement scenario in a natural way. Contextuality is a mathematical property in quantum mechanics arising from Bell’s theorem. It is the phenomenon by which the value of a measurement depends on the other measurements being performed simultaneously. Contextuality has been formalized in the language of sheaf theory in [1–5] and has also been shown to arise in cognitive science by [104]. Morton showed that contextuality can also arise in many data collection scenarios such as those involving missing data or versioning in a distributed database employing snapshot isolation [101]. Traditional statistical methodology assumes that data is already arranged in a flat and tidy manner [143]. As such, the ways that preprocessing affects the statistical properties of distributions is not typically discussed in most textbooks. The typical workaround when such assumptions fail is first cleaning the data by matching disparate sources and combining them into a single data frame. Choices are made about how to impute values or whether or not to throw away missing records. However, these methods introduce additional implicit assumptions which may not always be warranted as the models themselves assume missing data is impossible and that the pipeline

91 transforming the data does not affect subsequent analysis. In fact, many common techniques such as filling missing records by their column mean induce clear biases in the estimation of higher order moments of our data. Modern data sets have a more complex structure than assumed in a traditional statistics course. Traditional statistical methods rely on analyzing what data scientists would call a flat and tidy data set. In this chapter, we will explore how relaxing these assumptions affects statistical techniques. In particular, we will develop statistical theory in a manner that is contextual by creating sheaves on the Alexandroff topology constructed on a database schema as defined in the previous chapter (Section 4.6.3). A non-flat measurement scenario consists of a collection of tables tabulating the outcomes of some collection of random variables of interest. By a context, we mean a collection of simultaneously observable random variables whose outcomes are collected in a single table. In other words, contexts can be identified with the particular collection of attributes tabulated in a single table. We often informally refer to a table as a context in this chapter. The major contributions present in this chapter are the use of an appropriate topological structure to sheaf-theoretically lift standard statistical constructions to families of marginals with some overlapping constraints. More precisely, this includes the introduction of a poset structure on the collection of constraint satisfaction problems (Section 5.5.2) which allows us to select an appropriate topology based on the shared columns of the tables constituting a database (Section 5.6.1). Using this topology, we see how to express various statistical concepts as sheaves or presheaves with respect to this topology (Section 5.7). This allows us to define the notion of contextual random variables (Section 5.7.6) and to define statistical models in terms of sheaf morphisms (Section 5.8.1). We also introduce the distinction between classical and contextual factors (Definition 127) and the notion of classical snapshot to handle classical approximations to globally irreconcilable marginals. We discuss a pseudo-lieklihood approach to extending maximum likelihood estimation based on the realization of contextual random variables as subsets of an equalizer (Section 5.11.1) and provide a test for whether or not marginal distributions can arise from a joint distribution on the full column set (Section 5.12.2). This last result is similar to a result due to Abramsky, Barbosa, and Mansfield based on sheaf cohomology which allows the user to detect contextuality [3]. By combining our results with the construction in chapter seven, we can provide a goodness-of-fit

92 measure for contextuality rather than a simple detection of contextuality.

5.2 The Bell Marginals

The problem of reconstructing specific statistical models from marginal models tables has been analyzed previously [6,63,134]. The example in this section can be seen as a specific instance of recovering a multinomial model on full joint distribution from a particular family of marginals. Formulated as a database problem, the problem of whether or not there exists a table projecting onto a given family of marginals is known to be NP-complete [68]. In this section, we review the Bell marginal tables introduced in [101]. We prove that the introduction of a naive transition noise from a joint distribution on the full attribute space onto the product of the marginals is sufficient to match any family of marginal distributions. The problem with this construction is that it is overparametrized and there are in general many noise transitions which explain a collection of incompatible marginals. We use this as motivation for the introduction of sheaf theory in later sections which allows us to construct a model of the globally inconsistent marginals as a subspace of an equalizer constructed from the degree of overlapping compatibility. We begin by considering the contextual inference problem for data collected in contingency tables. As a starting point, we recall the Bell marginals example from [101].

Example 104. (Bell Marginals) Consider the following collection of four contin- gency tables:

0 0 tAB A = 0 A = 1 tA0B A = 0 A = 1 B = 0 4 0 B = 0 3 1 B = 1 0 4 B = 1 1 3

0 0 tAB0 A = 0 A = 1 tA0B0 A = 0 A = 1 B0 = 0 3 1 B0 = 0 1 3 B0 = 1 1 3 B0 = 1 3 1

Given such a collection of tables, we can attempt to construct a table which has the above four contingency tables as marginal tables. In [101], it is shown that such

93 a construction is impossible. If we attempt to merge A0B, AB0 and A0B0, there are 14 possible tables of counts that marginalize to these tables. However, if we merge AB with A0B, there is only one possible solution to the resulting constraint satisfaction problem. We will explain this constraint satisfaction problem in greater depth in Section 5.5.1. Given a classical random variable, there is no way we could encounter the above situation assuming these tables were drawn from a random variable on the joint outcome space of A, B, A0, and B0 assuming clean records; however, as shown in [101] these tables could result from versioning or missing data. As a random variable is not determined by its marginal distributions, there could in general be many possible probability distributions on the full outcome space with the same marginal distributions as those collected in a particular collection of tables. Nevertheless, the presence of noise in instrumentation in a network of computational agents or bias in the agents themselves (such as may occur in a network of sensors due to the physical degradation of some of the components of the sensor) could potentially result in this type of situation.

Example 105. Suppose we have four computational agents observing outcomes in the same manner as the Bell distributions above, i.e. one agent observes random variables A and B, another observes A0 and B, etc. If each agent flips a bit while logging their observations with some fixed probability p, then the tables of marginal counts will likely fail to glue together due to the presence of the random noise in the system.

Thus, one way of understanding contextuality is by introducing noise into the system. This motivates the following definition.

Qn Definition 106. (Noisy Random Variable) Let X = i=1 Xi be the joint outcome R X = Q X˜ X˜ = X ` { } space of the random variables in . Let Ci i∈Ci i where i i NA . In order to allow for the possibility of global inconsistency among the agents, we equip each context with a ’noise’ which is simply a transition probability between

X and XCi , i.e. a Kleisli arrow niK : X → XCi . Thus, we will think of a noisy random variable as a random variable whose outcome space is the product across a set of measurement contexts.

Lemma 107. (Existence of Noisy Random Variables)

94 Proof. Starting from a random variable (f : S → X, p) and Kleisli arrows niK : 0 X → XCi , we need to construct an extension of sample space S  S whose Qm outcome space is i=1 XCi (or perhaps an extension of this). From our original p random variable, we can construct a Kleisli arrow 1 →K X. The family of Kleisli Qm n i=1 iK Qm arrows gives rise to a Kleisli arrow: X → i=1 XCi . Thus, Kleisli composition Qm induces a probability distribution on the space i=1 XCi which will descend to marginal distributions on the various contexts via the projection maps. We will m denote these composition arrows {pCi }i=1. Hence, the existence of a contextual random variable is reduced to the question of construction a sample space for a random variable based on a push-forward distribution. However, this is a straightforward, albeit abstract construction. Given the original distribution, p on Qm Qm S, we can construct S × i=1 XCi with the distribution p × i=1 pCi . There is an Qm Qm obvious projection map q : S × i=1 XCi → i=1 XCi . Then the random variable Qm Qm Qm (q : S × i=1 XCi → i=1 XCi , p × i=1 pCi ) has the desired properties. Lemma 108. Given any collection of contexts of discrete random variables, there exists a noisy random variable with the prescribed marginal distributions.

Proof. We establish this lemma by constructing a recursive algorithm for finding such a representation. n Pk o Define k1 = min k ∈ [n] such that i=1 pi ≥ q1 and define α1 such that Pk1−1 Pn Pm αpk1 + i=1 pi = q1. Note that (1 − α) pk1 + i=k +1 pi = i=2 qi. Recursively, n 1 o k = min k ∈ [n] Pk p ≥ q α α p + define ` such that i=k`−1+1 i ` and ` such that ` k` Pk`−1 p = q i=k`−1+1 i `. From these we can construct our Kleisli morphism as the right stochastic matrix whose entries have the following form  0 j < ki−1   1 − αi−1 j = ki−1  Tij = 1 ki−1 < j < ki   αi j = ki   0 j > ki

From this form, we see this construction is indeed row stochastic as every row

i∈ / {k1, . . . , km} contains a single 1 and each i = k` row contains an entry of the

form αk` followed by (1 − αk` ) with all other entries being 0. Each αi ∈ [0, 1]

95 p + Pk`−1 p ≥ q p ≥ q − Pk`−1 p . because by construction, k` i=k`−1+1 i `. Hence, k` ` i=k`−1+1 i Recall that α` is chosen such that the latter inequality becomes an equality and α ≤ 1 k Pk`−1 p < q so ` . By the definition of `, we know i=k`−1+1 i ` or, equivalently, q − Pk`−1 p > 0 α ≥ 0 ` i=k`−1+1 i . Thus, we also must have ` . This establishes the row stochasticity of the matrix T .

From the previous proof, we see that the the Kleisli transition probabilities are sufficient to match any observed contextual distribution. However, such a construction is highly non-unique. Note that any permutation of the indices of the true distribution or the contextual distribution would result in a different transition probability by the construction in the previous lemma. This means that any model of contextuality as a noisy random variable will be non-identifiable. Such models are called singular and their asymptotic theory is sufficiently more complex due to the failure of asymptotic normality of the posterior distribution. More information on the asymptotic theory of singular models can be found in [141,142]. In order to extend statistical methods to such situations, we need methods that are robust to any of these above possibilities. This motivates our later construction of an Alexandroff topology on a collection of overlapping tables(Section 5.6.1). The next example outlines how contextuality can potentially arise in experiments analyzing human behavior.

Example 109. (Consumer Behavior) Imagine we are testing several options for displaying products in a store. Each display has the capacity for two items. We label our different products A, B, A0, and B0 and try to log counts of purchases of the various combinations of the available options, e.g. the number of customers who bought item A but not item B, etc. The data is tabulated as a collection of counts tabulating the number of customers who bought the various combinations of products. In this scenario, we can interpret the value 0 to correspond with a consumer not purchasing an item and a value of 1 to indicate that a consumer purchased the item. Thus each context can be represented as a 2 × 2 table where x00 denotes the number of customers who purchased neither item, x01 and x10 denote the number of customers who purchased only one of the items, and x11 denotes the number of customers who purchased both items. If we test four different displays where any two pairs of display have only one overlapping product, we could potentially arrive at the situation discussed at the beginning of this chapter

96 involving the Bell marginals. We may be interested in predicting how a pair of items would sell before doing the experiment to make a determination as to whether or not it would be worth the cost of trying out a different product layout.

In the Bell marginal tables displayed above, we can consider tables TAB, TA0B, and TAB0 to be tables resulting from previous observations of customer behavior. Merging these three tables yields the collection of purchasing patterns across all four products that are consistent with the observed marginal distributions from the three previous experiments. In the specific situation of the Bell marginals listed above, there are 4 possible joins of TAB, TA0B, and TAB0 consistent with the corresponding marginal distributions. Code for reproducing this observation can be found in Appendix A.2.2. Projecting these tables onto the joint outcome-space A0 × B0 yields four possible marginal tables:

A0 = 0 A0 = 1 A0 = 0 A0 = 1 B0 = 0 4 0 B0 = 0 3 1 B0 = 1 0 4 B0 = 1 1 3

A0 = 0 A0 = 1 A0 = 0 A0 = 1 B0 = 0 3 1 B0 = 0 2 2 B0 = 1 1 3 B0 = 1 2 2

Note that the two off diagonal tables are the same because there are two possible 0 0 joins of TAB, TA0B, and TAB0 which marginalize onto A B in the same manner. In order to make predictions for this set of outcomes, we need some way of choosing one of the above tables, the ability to reason over the full collection of the above tables, or a technique for averaging the tables in an appropriate way. The latter idea is the most straightforward. When applied to the tables above the average table is one of the two identical off-diagonal tables. However, as we discuss later in this chapter (Section 5.11.2), this type of analysis can lead to seriously flawed predictions. For methods to choose a particular table, we could apply the maximum-entropy principle which would mean choosing the furthest right table above. However, it is possible to construct situations where the entropy maximizing table is not unique and as such this principle can’t necessarily be applied to all problems. One

97 such example is given by the collection of joint tables on A, B, A0, and B0 which marginalize to the tables TAB and TA0B0 in the Bell marginals introduced above. As we will see later in this chapter (Section 5.5.1), the collection of all such tables results in a constraint satisfaction problem. In this particular case, there are 14 different tables which solve the resulting constraint satsifaction problem and 6 of these solutions are entropy maximizing. The code for producing this example appears in Appendix A.2.3. Entropy maximization according to marginals will also tend to eliminate any correlation between the random variables which could be undesirable in many predictive modeling situations. In the specific scenario of predicting customer purchasing habits, we may want to calculate the expected profit for each of these situations and decide if it the potential distributions of returns is worth the cost of further experimentation. In general, the profit (or some other reward function) will depend on which of the four distributions we choose. Moreover, simply averaging the profit will destroy information about the spread of values. In terms of making a decision of whether or not additional testing is warranted, the spread is relevant to the analyst who is trying to understand the risk to reward characteristics of such a decision. In the situation above, if we were looking at the expected profit for each possible table, we could also report the sample standard deviation across all tables of the profit. A major theme of the next chapter involves using the Giry monad discussed in chapter 3 to develop means of imputing data that better preserves the spread of the observed data. Suppose instead we are interested in some statistical property of the full dis- tribution on A × B × A0 × B0. If we had collected the marginal tables displayed at the beginning of this section, we would find that there is no table on the full outcome space (without missing values) which marginalizes to the Bell marginals. In order to do any statistical inquiries in this situation, we have a few options on how to proceed:

• We can fit models which only depend on the observed marginal distributions.

• Extend our model to incorporate the possibility of an unobserved observation and attempt to reason over the space of possible joins including NAs. Given the number of possibilities of how to join the data with missing entries, we may need to incorporate some form of bootstrapping to keep our computations

98 feasible.

• Attempt to come up with a representation for the joint distribution that is independent of these arbitrary choices.

To conclude this discussion: when analyzing contextual databases we most often are not able to join data frames uniquely. Much of the time we see either many possible ways to join our data frames or no globally consistent way of joining our data frames. In a chapter four, we found sufficient conditions for ensuring that a collection of tables admits a consistent join provided their overlapping counts agree (Proposition 101). We next explore some of the limitations of the skip-NA method as it relates to model fitting for the specific case of directed graphical models, or Bayesian networks.

5.3 Skip-NA and Directed Graphical Models

The problem of compatibility amongst conditional distributions has been previously explored in [7,8]. More recent work on this question has focused on the compatibility question for conditional distributions from the point of view of algebraic geometry [35, 102, 124, 125]. In this section, we examine a naive approach to fitting graphical models to contextual tables. Graphical models are statistical models for which a graph is used to express conditional dependence relationships between different subsets of random variables. In this section we discuss issues related to fitting graphical models to contextual measurement scenarios. Suppose we have a directed acylcic graph on the observables whose nodes are labeled by random variables belonging to some measurement scenario. The Bayesian network associated to the directed acyclic graph is given by

n Y p (x1, . . . , xk) = p (xi | pai) i=1

where pai denotes the collection of parents of Xi. Note that a graphical model can

be naively fit to a collection of contexts as long as the table which covers Xi | pai belongs to the generating family of observables assuming additionally that the collection of contexts agree on any overlapping subset of columns. As we will see in this section, the naive technique of fitting the model from the relevant conditional

99 table can result in pathological behavior in the presence of contextuality. In practice, such issues could arise when attempting to fit a model to a large database which contains missing records and attempting to fit the model by querying for the relevant tables using a skip-NA framework on the subset of columns of interest to the query. For more details about how missing data can produce contextuality in marginal tables, we refer the reader to the discussion in [101].

Example 110. Consider the following directed acyclic graph on the Bell contexts:

A

 B

 A0

 B0.

If we interpret the above directed acyclic graph as representing a Bayesian network, then the corresponding factorization of the joint distribution on XA×XB ×XA0 ×XB0 is given by

0 0 p (xA, xB, xA0 , xB0 ) = p (A) p (B | A) p (A | B) p (B | A) .

Each individual probability distribution in the factorization is naively estimable from the Bell contexts since all of these tables are uniquely computable from the observed tables. A is covered by tAB and tAB0 and the marginalized tables produce the same marginal probability on A, so p (A) is well-defined as far as the original tables are concerned. Similarly, tAB, tA0B, and tB0A are all tables of the original contexts and so p (B | A), p (A0 | B), and p (B0 | A) are all computable from their respective observed tables. Thus, we can fit the above graphical model to the contextual distribution even though there is no global table on {A, B, A0,B0} which marginalizes to the observed Bell tables. Even though the model is naively estimable from the contextual family, it is indeed a probability distribution on the joint outcome space XA × XB × XA0 × XB0 . As such, we know that this model will not marginalize to the Bell tables because

100 the Bell tables admit no globally consistent join. In particular, marginalizing the distribution naively estimated from conditional tables computed from the Bell marginals gives us the correct frequencies corresponding to tAB but the incorrect frequencies for tA0B0 .

In the Bell tables, tAB and tA0B together admit a unique solution to their resulting constraint satisfaction problem. The same holds true for the tables tAB and tAB0 . As such, we can also fit graphical models to the Bell tables containing conditional tables involving {A, B, A0} or {A, B, B0}. We will soon study the mathematical structure of the collection of tables constructable from a collection of overlapping tables.

Example 111. The probability model corresponding to the following directed acyclic graph is also naively estimable from the Bell contexts:

B0

A ~ A0

~ B which corresponds to the decomposition of the joint distribution given by

p (A, B, A0,B0) = p (A | B0) p (B | A, A0) p (A0 | B0) p (B0) .

0 Note that the term p (B | A, A ) is unambiguous because tAB and tA0B admit a unique extension table. For the same reasons as the previous example, this joint distribution does not marginalize to the Bell contexts.

Remark 112. A collection of contextual tables contains the tables from which sufficient statistics for a graphical model on a directed acyclic graph can be naively estimated if and only if each edge is a contextual table or consistent projection thereof. This procedure can fit arbitrarily ’bad’ models in the presence of contextuality. We present an example where using the marginal tables as sufficient statistics for model selection produces a model with marginal distributions which have infinite Kullback-Liebler divergence from one of the marginal distributions.

101 Example 113. Consider the following normalized marginal distributions on AB, BC, and AC, respectively:

1 1 1 s = δ , r = δ , , t = (1 − δ ) . ij 2 ij ij 2 ij ij 2 ij

Note these marginal tables provide sufficient statistics for the following graphical model: A

 B

 C which corresponds to the probabilistic model:

p (A, B, C) = p (A) p (B | A) p (C | B) .

Letting pijk denote the probability that A = i, B = j, and C = k. Then

1 p = δ (1 − δ ) . ijk 2 ij ik

The projection of the above probability onto the outcome space BC is given by

X 1 r = p = (1 − δ ) . ik ijk 2 ik i

However, the observed marginal distribution of BC is given by

1 q = δ . ik 2 ik

The Kullback-Liebler divergence between r and q is given by   X rik DKL (r k q) = rik log = ∞. qik

In the above calculations, we attempted to fit a model to a collection of tables whose schema graph is cyclic. This observation suggests the topological structure of our database has something to do with the type of behavior observed in the

102 above examples. We will return to this topological question in section 4, but first we will discuss the structure of the types of constraint satisfaction problems which arise from contextual measurement scenarios.

5.4 Motivation from Statistical Privacy

Statistical privacy concerns how data from a statistical database can be released publicly in a way that respects the privacy of the individuals whose data is being released. Simply omitting personally identifiable information is not enough in many situations because other public records could potentially be merged with the released data to identify the individuals from the publicly released data. As such, research has focused on how to construct noise functions which preserve the statistical properties of the original data while making it impossible for an observer using the perturbed data to determine whether or not a particular individuals information is present in the database [11,41,87,126]. Statistical privacy provides another motivation for the study of contextuality we present in this chapter. In databases with a large number of columns, it may be computationally infeasible to manipulate a full joint distribution on the full column set. Instead, analysts will marginalize onto subsets of the full column set and then introduce noise and work with these marginal distributions instead. In general, the introduction of noise to these marginal tables can create a family of marginals which do not arise as the marginals of any joint distribution. In such situations, statistical techniques which assume the collection of marginals arise from a joint distribution are invalid. In this chapter, we develop an approach based on sheaf theory to lay the foundations for adapting statistical methods to families of marginal tables which are not necessarily assumed to arise as marginals of some joint distribution on the full column set.

5.5 Poset of Joins of a Database

When joining two tables together, we form a new table from the two pre-existing tables according to some specified rules for combining the tables. In the previous chapter, we discussed a few types of join operations and how they could be interpreted in the language of category theory. For a collection of tables which

103 agree on their overlapping columns, we can reformulate the problem of finding a consistent join to the tables as a constraint satisfaction problem which is known to be NP-complete [68].

5.5.1 Contextual Constraint Satisfaction Problems

Suppose we want to join the tables tAB and tA0B from the Bell family. A table which extends these tables is any table which marginalizes to the original two tables. Let 0 0 0 πAB : A × B × A → A × B and πA0B : A × B × A → A × B denote the canonical projection operators. We are looking for a table, t0 : [8] → A × B × A0, for which 0 0 πAB ◦ t = tAB and πA0B ◦ t = tA0B. From the previous chapter, we know that each table is an equivalence class and each equivalence class has a representation as a count of values(Section 4.2.5.3). Let V be the function that accepts a table as an argument and returns its value count representation. The earlier requirement 0 0 is then that V (πAB ◦ t ) = V (tAB) and V (πA0B ◦ t ) = V (tA0B). By matching entries in these tables, we generate a linear constraint satisfaction problem. In this particular example, let nijk with i, j, k ∈ {0, 1} denote the number of times A = i, B = j, and A0 = k, then our contextual constraint satisfaction problem is given by the set of linear constraint equations:  n000 + n001 = 4   n010 + n011 = 0   n100 + n101 = 0   n110 + n111 = 4  n000 + n100 = 3   n001 + n101 = 1   n010 + n110 = 1   n011 + n111 = 3  P  i,j,k∈{0,1}3 nijk = 8. This particular constraint satisfaction problem has only a single solution

{n000 = 3, n001 = 1, n110 = 1, n111 = 3}

104 with all other entries not appearing in the list above equal to zero.

Proposition 114. (Worst case analysis for number of solutions to a contextual constraint satisfaction problem)

Consider two tables t1 and t2 that overlap on a column B. Suppose t1 contains another column A1 with n rows each with distinct values. Suppose also that t2 contains an additional column A2 also with n rows where each row takes on distinct values. Moreover, suppose t1 and t2 are both constant along the overlapping column B with the same constant value for each table. In this case, the number of solutions to the contextual constraint satisfaction problem is n!.

Proof. Apply the counting principle. For the first record in t1, we have n choices th of records in t2. Inductively, the k row has (n − k) possible choices in t2 for joins.

The above proposition shows that the number of solutions to a constraint satisfaction problem depends on the number of ways of matching overlapping entries and as such can grow combinatorially based on the extent to which the overlapping column fails to function as a primary key for a join. As such, worst case analysis can always be performed by assuming the overlapping entries are constant. Note that we could leverage the observation above to construct an algorithm for producing a random join of two tables which agree on the value counts of their overlapping columns by iterating through the columns of one table and selecting a random record for which the overlapping tables agree. The fact that the input tables are required to agree on their overlapping columns ensures that we won’t run out of choices at any step in the procedure.

5.5.2 Poset of Solutions to Contextual Constraint Satisfac- tion Problems

In database theory, lattices are sometimes used to represent the possible joins and restrictions of a collection of tables constituting a database [89]. In this subsection, we define a poset structure on the collection of solutions to the various constraint satisfaction problems arising from these tables. The poset structure discussed in this subsection serves as motivation for the topology constructed on the simplicial complex associated to the database in the next section. Specifically, we generalize

105 the observations of the previous section to generate a poset from a collection of tables which consists of all tables that can be constructed from the original collection of tables via marginalization or as a solution to the type of constraint satisfaction problem discussed previously. We can construct a poset on the space of joins where the order relation is based on whether or not a table extends another table.

Fix a finite collection of tables {ti :[ni] → XCi }i∈I and let Ci denote the collec- tion of column names for table ti. For each σi ⊂ Ci, there is a projection operator pCi→σi : XCi → Xσi , we can define a marginal table, tσi : [ni] → Xσi defined by pCi→σi ◦ti. As we saw in the previous section, a family of contextual tables can have  multiple joins. Given a collection of tables tj :[nj] → XCj j∈J with J ⊂ I, we say that a table t : [n] → XC is an extension of {tj}j∈J if the following conditions hold:

• nj = n for all j ∈ J

• C = ∪j∈J Cj

• tj = pC→Cj ◦ t for all j ∈ J.

From the above definitions, we can create a poset whose elements are the collection of tables {ti :[ni] → XCi }i∈I along with the collection of projections and extensions of these tables. The order structure is define as t ≤ t0 if and only if t0 is an extension of t. We can equip such a collection of tables with an Alexandroff topology generated by lower sets in a manner similar to what we will do for databases in the next section. However, we will not use this topology in any meaningful way in this chapter and so postpone this discussion until the next section.

Example 115. (Bell Context) Let tAB, tAB0 , tA0B, and tA0B0 be the Bell marginals discussed at the beginning of this section. Since these tables all agree on their marginal overlaps, we have tables tA, tB, tA0 , and tB0 obtained by marginalizing over the variables in the subscripts. The lower set generated by all contexts can be visualized via the Hasse diagram displayed below:

tAB tA0B tAB0 tA0B0 O b O g 4 c ; O

tA tB tA0 tB0 .

106 To generate the full topology, we also need to consider the possible extensions of the tables. Unfortunately, the full topology is too difficult to visualize graphically. In the next section, we will consider the smallest possible database which can produce contextuality. For this reason, we postpone any additional visual aids until the next section. Instead, we will present a table counting the number of extensions of each pair and triple of tables in the Bell family. Code for generating these tables can be found in Appendix A.2.3.

Tables Number of Extensions

tAB, tA0B 1

tAB, tAB0 1

tAB, tA0B0 14

tA0B, tAB0 46

tA0B, tA0B0 4

tAB0 , tA0B0 4

We can similarly count the number of extensions of three tables (Appendix A.2.2):

Tables Number of Extensions

tAB, tA0B, tAB0 4

tAB, tAB0 , tA0B0 4

tAB, tA0B, tA0B0 4

tA0B, tAB0 , tA0B0 14

As a concrete example, the two tables presented below are two possible solutions to the constraint satisfaction problem involving the tables tAB, tA0B, and tAB0 from the Bell marginals. Code for reproducing these results can be found in Appendix A.2.2.

0 0 0 0 0 0 0 0 t1 A = 0, B = 0 A = 0, B = 1 A = 1, B = 0 A = 1, B = 1 A = 0, B = 0 3 0 0 1 A = 0, B = 1 0 0 0 0 A = 1, B = 0 0 0 0 0 A = 1, B = 1 1 0 0 3

107 0 0 0 0 0 0 0 0 t2 A = 0, B = 0 A = 0, B = 1 A = 1, B = 0 A = 1, B = 1 A = 0, B = 0 3 0 0 1 A = 0, B = 1 0 0 0 0 A = 1, B = 0 0 0 0 0 A = 1, B = 1 0 1 1 2.

We also know that there is no table which has all four contexts as its marginals [101]. An implementation of this verification can be found in Appendix A.2.1. Using the table of counts of solutions presented above, we see that the number of solutions to a contextual constraint satisfaction problem can be larger than the number of observations in the table. For instance, the constraint satisfaction

problem that this corresponds to joining tA0B, tAB0 , and tA0B0 has fourteen solutions even though the original tables contain only eight observations. As such, we see that enumerating all possible joins can become problematic for large data sets whenever there are many combinations of joins for the overlapping columns such as would be the case with repeated observations of a categorical feature. For this reason, in practice it will be necessary to use bootstrapping techniques for many statistical methods on contextual databases.

5.6 Topology of a Database Schema

Previously, we saw that for a prescribed set of marginals, the contextuality can always be accounted for by introducing a transition noise from the joint distribution on the full column space onto the collection of marginals constituting a particular measurement scenario and considering the prescribed set of marginals as projections of some higher dimensional random variable. Unfortunately, this representation of the higher dimensional random variable was highly non-unique and as a model was highly non-identifiable due to overparametrization of the transition noise. In light of these considerations, we will propose a definition for contextual statistical models involving morphisms of sheaves between a sheaf of parameters and a sheaf of contextual probability distributions on a topology associated to a database schema (Section 5.8.1). We first expound upon the construction of a database topology discussed at the end of chapter four.

108 Record linkage refers to the process of finding records in a database or collection of databases that refer to the same entity [39, 105]. The mathematical foundations of the subject were first established by Fellegi and Sunter [48]. In this section, we will discuss constructing a simplicial complex associated to a database. In this discussion, we assume a particularly simple linkage model, i.e. that all links are constructed by overlapping columns. The topological construction presented in this section could be adapted to more general deterministic record linkage scenarios by using more general glueing conditions to connect the tables than are discussed in this section.

5.6.1 Contextual Topology on a Database Schema

Definition 116. Let D be a database consisting of tables t1, . . . , tk. Let Ci denote the column set of ti. We define the abstract simplicial complex associated to S the database schema, 4 (D) to be the simplicial complex 4 (D) = i∈[k] (↓ Ci) where ↓ Ci = {S | S ⊂ Ci} . The sets {↓ Ci}i∈[k] generate a topology which we call the contextual topology associated to the database schema. In other words, the contextual topology τD on 4 (D) is the topology on D where the open sets are given by sub-simplicial complexes.

Note that this topology is finite and as such it is an Alexandroff topology, i.e. arbitrary intersections of closed sets are closed. The topology defined above is common in poset theory [139]. In the next section we will discuss various sheaves and presheaves on 4 (D). Throughout the next section, it will be important to have a running example of a particular database schema and its associated topology to discuss. We will fix the simplest possible contextual database schema and use it as an example topology for all of the sheaves discussed in the next section.

Example 117. Consider a database consisting of three tables: tAB, tAC , and tBC .

Each of the labels A, B, and C correspond to a binary data type. tAB counts the 2 number of times A = i and B = j where (i, j) ∈ {0, 1} . tAC and tBC tabulate similar counts corresponding to their respective data types. The simplicial complex associated to this database schema can be visualized as the following undirected

109 cyclic graph: A

B C. The poset structure of this simplex can be visualized with the following Hasse diagram: AB AC BC O b < b < O

A B C. In the diagram above an arrow X → Y indicates that X ≤ Y . We will use the notation P = {A, B, C, AB, AC, BC} to refer to the underlying poset associated to the abstract simplicial complex. For this particular database, there are a total of eighteen open sets in the contextual topology. These are given by lower sets of the poset, i.e. the sub-simplicial complexes. In this particular case the contextual topology consists of the following eighteen sets:

∅ {A}{B} {C}{A, B}{B,C} {A, C}{A, B, C}{AB, A, B} {BC,B,C}{AC, A, C}{AB, A, B, C} {BC, A, B, C}{AC, A, B, C}{AB, BC, A, B, C} {AB, AC, A, B, C}{BC, AC, A, B, C}{AB, AC, BC, A, B, C} .

110 Given p ∈ P , we can define ↓ p := {s ∈ P | s ≤ p}. Then, the contextual topology associated to this schema can be visualized via the following Hasse diagram:

↓ AB∪ ↓ AC∪ ↓ BC 5 O i

↓ AB∪ ↓ AC ↓ AB∪ ↓ BC ↓ AC∪ ↓ BC O i 5 i 5 O

↓ AB ∪ {C} ↓ AC ∪ {B} ↓ BC ∪ {A} O m O l O g

↓ AB ↓ AC ↓ BC 1 {A, B, C} O O O 2 7

{A, B} {A, C} {B,C} O i 5 i 5 O

{A} {B} {C} i O 5

∅.

The arrows in the above diagram correspond to subset inclusion. The collection of open sets on this topology also has the structure of a poset so it is important to keep in mind the distinction between the underlying poset and the topology constructed on it. In the diagram above, the second row of elements corresponds to the open sets obtained by excluding one of AB, AC, and BC from the poset, e.g. ↓ {A, B} ∪ ↓ {A, C} is the subset {AB, AC, A, B, C}. Note also that the elements in the third row in the previous diagram correspond to sets excluding two of AB, AC, BC. For example, ↓ AB ∪ {C} is the subset {AB, A, B, C}. Recall that a basis for a topology is a collection of open sets which cover the full space X and satisfy the additional requirement that if B1 and B2 are basis elements, then for every x ∈ B1 ∩ B2, there is a basis element B3 with x ∈ B3 and for which B3 ⊂ B1 ∩ B2. A sheaf is determined by its definition on a basis because there is an equivalence of categories between Sh (X) and Sh (B). A proof of this fact can be found in [94]. As such, we can simplify our definition of sheaves by constructing sheaves on a

111 basis. The contextual database topology discussed in this example has the following basis: B := {↓ {A, B} , ↓ {B,C} , ↓ {A, C} , ↓ {A} , ↓ {B} , ↓ {C}} .

In the next section, we refer to this particular contextual database topology as the 3-cycle database topology for ease of reference.

Remark. Note that {A, B}= 6 ↓ AB. This is important to keep in mind when discussing the glue-ability condition for sheaves as we will frequently in the next section. The open subsets in this topology of lower sets can be understood as placing a one-to-one correspondence between lower sets and sub-simplicial complexes. As such, it can be beneficial to visualize these via their geometric realizations. The top element, P , corresponds to the full directed cyclic graph:

A

B C.

Open sets in the second from top row correspond to the subsimplicial complexes ob- tained by removing one edge from the above picture, e.g. ↓ AB∪ ↓ BC corresponds to: A

B C. Open sets in the next row down correspond to the original contexts of our database, e.g. ↓ AB ∪ {C} corresponds to

A

BC.

Open sets in the next row down correspond to the original contexts of our database, e.g. ↓ AB corresponds to A B except for the set {A, B, C} which corresponds to the sub-simplicial complex with

112 no edges A

BC. The open sets in the next level down are those obtained from original contexts by removing an edge, e.g. {A, B} corresponds to

AB.

Finally, the empty set corresponds to the empty graph.

5.7 Sheaves on Databases

Various statistical concepts can be seen as constructions involving different sheaves on the contextual database topology defined in the previous section. In this section, we will discuss a number of sheaves of interest for statistical analysis on contextual databases. All example sheaves from this section will be constructed on the 3-cycle contextual database topology described at the end of the previous section.

5.7.1 Presheaf of Data Types

When defining databases in chapter four, every attribute had an associated data type which was determined by a set of values that the particular attribute could take. Given a contextual database topology, we can define a presheaf that associates to each open set U, the product of all data types of attributes for the tables belonging to U. The restriction maps of this presheaf are given by projections or identity maps where appropriate.

Example 118. The presheaf of data types on the contextual database topology associated to the 3-cycle with binary data types can be visualized by the following

113 commutative diagram:

{0, 1}3

w  ( {0, 1}3 {0, 1}3 {0, 1}3

 w ( w (  {0, 1}3 {0, 1}3 {0, 1}3

   ( {0, 1}2 {0, 1}2 {0, 1}2 -+ {0, 1}3

   w {0, 1}2 r {0, 1}2 s {0, 1}2

 v ( v (  {0, 1} {0, 1} {0, 1}

 ( {∗} . v

5.7.2 Presheaf of Classical Tables of a Fixed Size

A more interesting presheaf is the one that associates to each open set the collection of all tables of size n on the outcome space of all outcomes present in the open set. The restriction mappings here are given by projections of tables as discussed in chapter four. Note that these open sets are in one-to-one correspondence with the possible constraint satisfaction problems arising from our original contexts (Section 5.5.1). This observation will be used later in the chapter when we discuss the collection of classical approximations of a family of contextual tables (Section 5.11.2).

5.7.3 Sheaf of Counts on Contextual Tables

Each column in a table has an associated data type which specifies the possible values that an observation can take. We can construct a sheaf on the basis by associating each element of the basis with the product space of the data types of the columns corresponding to the top element. Since the basis elements only have the column space associated to a particular table or a subset thereof, there is no

114 ambiguity in this prescription. We know that this definition on the basis will allow us to construct the sheaf on the remaining open sets using the equalizer condition for sheaves.

Example 119. We will compute the sheaf of outcomes for the 3-cycle database topology. We can first define this construction on the original contexts as

∼ ∼ ∼ 4 N (↓ AB) = N (↓ AC) = N (↓ BC) = N and define N on the single outcome tables as

∼ ∼ ∼ 2 N ({A}) = N ({B}) = N ({C}) = N .

Any sheaf must map the empty set to the terminal object. In this case, we have N (∅) = N0. In order for this construction to define a sheaf, we must also specify U the restriction mappings. We define res∅ (U) = ∗ since there is only one map to a singleton set. Thus, we need only define the restriction mappings from the tables to their overlapping columns. In order to define these mappings, we introduce coordinates on N (↓ AB) = N4 of the form

AB AB AB AB x00 , x01 , x10 , x11 .

We label the coordinates on N (↓ AC) and N (↓ BC) in an analogous manner. We 2 A A also label the coordinates on N ({A}) = N as x0 , x1 and use similar definitions ↓AB 4 2 for coordinates on N ({B}) and N ({C}). We define res{A} : N → N by

AB AB AB AB AB AB AB AB x00 , x01 , x10 , x11 7→ x00 + x01 , x10 + x11 and make similar definitions for the other restriction mappings. The equalizer condition in the definition of a sheaf allows us to extend this definition on the basis to the full 3-cycle database topology. For {A, B}, this means that N ({A, B}) is given by the equalizer of

N ({A, B}) ⇒ N (∅) .

115 Because N (∅) is just the singleton set, the equalizer condition implies

2 2 N ({A, B}) = N ({A}) × N ({B}) = N × N .

A A B B If we equip this space with coordinates x0 , x1 , x0 , x1 , the restriction maps correspond to the canonical projections onto the components of the product. Similar constructions can be used to define N ({A, C}) and N ({B,C}). A similar argument can be used to define the sheaf of counts on ↓ AB∪ ↓ AC, ↓ AB∪ ↓ BC, and ↓ AC∪ ↓ BC. We will only explicitly construct N (↓ AB∪ ↓ BC) because the other constructions work similarly. The equalizer condition requires that N (↓ AB∪ ↓ BC) be the equalizer of

N (↓ AB) × N (↓ BC) ⇒ N ({B}) .

Hence, we can define N (↓ AB∪ ↓ BC) as:

n ↓AB ↓BC o (x, y) ∈ N (↓ AB) × N (↓ BC) | res{B} (x) = res{B} (y) .

This constructs N (↓ AB∪ ↓ BC) as a subset of N4×N4. If we give the product space AB BC  the canonical coordinates inherited by N (↓ AB) and N (↓ BC), i.e. xij , xk` . The equalizer condition means we only consider the collection of coordinates satisfying the following system of linear equations on the counts:  AB AB BC BC x00 + x10 = x00 + x01 AB AB BC BC x01 + x11 = x10 + x11 .

These conditions mean that the marginal counts of B agree between N (↓ AB) and N (↓ BC) . Finally, we can extend our definition of N to the full space 4 (D) by defining N (4 (D)) to be the equalizer of

N (↓ AB) × N (↓ BC) × N (↓ AC) ⇒ N ({A}) × N ({B}) × N ({C}) .

The equalizer condition means we can define N (4 (D)) to be the subspace of AB BC AC  N (↓ AB)×N (↓ BC)×N (↓ AC) which we can equip with coordinates xij , xk` , xmn

116 which are required to satisfy the restriction equations:  AB AB AC AC x00 + x01 = x00 + x01   AB AB AC AC x10 + x11 = x10 + x11   AB AB BC BC x00 + x10 = x00 + x01 AB AB BC BC x01 + x11 = x10 + x11   AC AC BC BC x00 + x10 = x00 + x10   AC AC BC BC x01 + x11 = x01 + x11 .

In subsequent sections, we will mainly define sheaves by discussing their defini- tion on the basis elements making exceptions when the gluing condition reveals something important such as is the case with contextual probability measures.

5.7.4 Presheaf of Classical Probability Measures

If we presume the data types of each column in our database have the structure of a standard Borel space, we can apply the Giry monad (Section 3.2) to associate the set of all probability measures on the different possible products of data types. We can then construct a presheaf of classical probability measures on the contextual topology associated to a database by sending each open set to the collection of all probability distributions on the product space of all data types of columns appearing in the open set. When we define contextual probability measures in Section 5.7.8, we will see that this construction is not a sheaf because it is not what we would obtain by applying the equalizer condition to the basis elements.

5.7.5 Sheaf of Outcome Spaces

In chapter four, each column in a table had an attribute space or data type. As such each point in the poset has an associated data type or product of data types. Let P be the underlying poset from which the contextual topology is constructed. We can construct a sheaf, X , by defining for each p ∈ P , X (↓ p) to be the space of outcomes associated to p. Note that the points in the poset correspond to either the original contexts or tables obtained by projecting onto a subspace of the column space of the original contexts. As such, the equalizer condition allows us to extend

117 the definition of X to the full contextual topology. This is a generalization of the construction of the sheaf of counts in Section 5.7.3. We will further suppose that these outcome spaces are given the structure of a standard Borel space so that we can use them to define contextual random variables. An example of such a construction can be seen by the sheaf of table counts discussed above. As the outcome space is discrete, we can take the sigma algebra generated by all subsets. We will use the notation X to denote a generic outcome space sheaf.

5.7.6 Contextual Random Variables

We can now define contextual random variable as certain sheaf morphisms. Let S be C Q a standard Borel space equipped with a probability measure ρ. Let X = i XCi be the product space of the outcomes of all contexts. Define E to be the equalizer of Y Y X↓Ci ⇒ X↓Ci∩↓Cj . i i,j A contextual random variable is a measurable map, α, from a probability space (S, B, ρ) into XC satisfying P [α ∈ E] = 1. In other words, a contextual random Q variable is a random variable on i X (Ci) whose image lies in the subspace X (>) almost surely where > corresponds to the full poset underlying the contextual database topology. Put another way, the family of contextual random variables are the sub-family of random variables constructed in this way with compatible marginal tables determined by the simplicial complex of our database. Unlike our noise model of contextuality, this gives a characterization of contextual random variables which is not subject to as many free parameters. Moreover, by an appropriate reduction of parameters, this insight will allow us to construct an identifiable parametrization of the saturated contextual model which would have been impossible due to the number of degrees of freedom in the noise model (Definition 125). Moreover, the noise model allowed for random variables which do not actually agree on their overlaps while the sheaf theoretic definition presented in this section enforces agreement on the overlapping columns.

118 5.7.7 Sheaf of Parameters

Classically, a statistical model is defined as function mapping a set of parameters into the collection of probability distributions on some space. A model is said to be identifiable if and only if this mapping is injective. In order to adapt models to contextual sheaves, we first need to discuss sheaves of parameters. A sheaf of parameters associates to each element of the basis in a contextual topology a set of parameters along with restriction mappings relating the parameters on the contexts to their sub-contexts.

Example 120. Consider the contextual database topology corresponding to the 3- cycle. Since each column is presumed to have a binary data type, we can construct a parameter sheaf, P, for the saturated model by associating to ↓ AB, ↓ BC, 3 n P o and ↓ AC the parameter space 4 = pij | pij ≥ 0, i,j pij = 1 . To the sets 1 {A}, {B}, and {C}, we associate the set 4 = {(p0, p1) | pi ≥ 0, p0 + p1 = 1} . The U restriction mappings rV each correspond to marginalizing across the set U \ V , i.e. AB 3 1 P rB : 4 → 4 is given by r (pij) = pi+ := j pij. This parameter presheaf leads to the commutative diagram below:

P (>)

u  ) 3 3 3 3 3 3 4 ×41 4 4 ×41 4 4 ×41 4

 u ) u )  43 × 41 43 × 41 43 × 41

   * 43 43 43 -, 41 × 41 × 41

   t 41 × 41 q 41 × 41 r 41 × 41

  41 u ) 41 u ) 41

 ) 40. u

The sets in the above diagram which were not mentioned previously can all be

119 obtained by applying the equalizer condition to members of the basis. For instance, P ({A} ∪ {B}) is the equalizer of

0 P ({A}) × P ({B}) ⇒ 4 and hence P ({A, B}) ∼= P ({A}) × P ({B}) = 41 × 41.

A similar calculation shows

P ({A, B}) ∼= P ({B,C}) = 41 × 41.

By similar reasoning, P (↓ AB∪ ↓ BC) must be the equalizer of

P (↓ AB) × P (↓ BC) ⇒ P ({B}) .

In this case, we see

∼ P (↓ AB∪ ↓ BC) = P (↓ AB) ×P({B}) P (↓ BC) where the right hand-side is the pullback in set of the cospan

P (↓ AB) → P ({B}) ← P (↓ BC) .

Recall that the pullback is defined explicitly in the following manner:

n 3 3 ↓AB ↓BC o P (↓ AB∪ ↓ BC) = (p, q) ∈ 4 × 4 | m{A} (p) = m{B} (q) .

Finally, applying the equalizer condition to the top element constructs P (>) as a subset of 43 × 43 × 43 but the sheaf condition constrains the original nine parameters down to six parameters because we lose a degree of freedom for each of the three marginalization mappings in the equalizer condition.

5.7.8 Sheaf of Contextual Probability Measures

Each column in the database schema has an associated data type. This determines some outcome space which we presume has the structure of a standard Borel space.

120 We can the construct a probability sheaf by associating to each element of the basis the collection of all probability distributions on its corresponding outcome space. In general, we will denote such a sheaf by G in honor of Michele Giry. The restriction mappings are the mappings induced by application of the Giry endofunctor to the projection mappings in the outcome space sheaf. More concretely, these correspond to the marginalization mappings (Lemma 80).

Example 121. The sheaf G defined on the contextual topology discussed in the previous example behaves exactly the same as the parameter sheaf because the collection of all probability distributions on a finite space X with k outcomes can be identified with 4k−1. Previously, we showed the induced sigma algebra on 4k−1 is the same as the Borel sigma algebra on 4k−1 (Example 79). ∼ 3 If I is an index set enumerating {↓ AB, ↓ BC, ↓ AC}, then we have G (Ui) = 4 ∼ 1 and G (Ui ∩ Uj) = 4 . The morphisms are given by application of the Giry endofunctor to the projection mapping and as such correspond to marginalization mappings. Let > =↓ AB∪ ↓ BC∪ ↓ AC. The sheaf condition requires that ( ) Y U G (>) = p ∈ G (U ) | mUi (p) = m j (p) ∀i, j ∈ I × I . i Ui∩Uj Ui∩Uj i∈I

Q 3 This constructs p as a product measure on [4] 4 with the further condition that the overlapping marginals agree. Effectively, the equalizer condition requires us to join the probability measures onto a larger dimensional space with marginal distributions of overlapping column sets between the contexts forced to agree.

5.8 Statistical Models on Contextual Sheaves

In this section, we lift the definition of statistical models to contextual database topologies. We then distinguish between classical and contextual factors and use this distinction to define classical snapshots which are the collections of joins which represent the collection of tables which are guaranteed to be projections of some joint table on their common outcome space.

121 5.8.1 Contextual Statistical Models

Classically statistical models are defined as maps from a space of parameters into the collection of probability distributions on some measurable space of outcomes. In the previous section, we discussed how to construct sheaves of parameters and sheaves of contextual probability measures. As such, we can lift the definition of statistical models to sheaves on a contextual database topology.

Definition 122. (Contextual Statistical Model) A contextual statistical model is a sheaf morphism between a sheaf of parameters and a sheaf of contextuality probability measures.

In light of the previous definition, we can express a couple of examples of contex- tual statistical models. We start by discussing the adaptation of the independence model to the Bell marginals introduced earlier in this chapter (Section 5.2).

Example 123. (Contextual Independence Model on Bell Contexts) Recall that a basis, B, for the topology generated by the Bell Contexts is given by B = {↓ AB, ↓ A0B, ↓ AB0, ↓ A0B0, {A} , {B} , {A0} , {B0}} .

Let U denote an arbitrary element of {↓ AB, ↓ A0B, ↓ AB0, ↓ A0B0} and let V denote an arbitrary element of {{A} , {B} , {A0} , {B0}}. The sheaf condition implies that it suffices to define our statistical model on the elements of the basis. In this 2 situation, the parameter sheaf is defined to be P (U) = [0, 1] and P (V ) = [0, 1]. The sheaf of probability measures is defined by G (U) = 43 and G (V ) = 41. To define a statistical model, µ, we need to construct a natural transformation between 2 3 P and G. We can define this natural transformation µU : [0, 1] → 4 on objects by

µU (x, y) = (xy, x (1 − y) , (1 − x) y, (1 − x) (1 − y))

1 and defining µV : [0, 1] → 4 by µV (z) = (z, 1 − z).

Remark 124. The above construction only asserts the marginal independence of the various contexts. As we can not guarantee that there exists a joint distribution on 415 from which contextual tables are drawn, it is not sensible to talk about the mutual independence of these random variables in this particular measurement scheme. Moreover, even if the marginal distributions are globally compatible, there

122 are in general many possible choices for distributions with prescribed marginals. This observation will be returned to when we discuss classical approximations of a contextual factor.

Definition 125. (Saturated Contextual Model) If all tables involved have data types with a finite outcome space, the saturated contextual model is the model obtained by using the corresponding probability simplices as both parameters and models.

Example 126. (Contextual No Three-Way Interaction Model) This example illustrates some of the nuance involved in extending statistical models to contextual databases. We will see that constructing the parameter sheaf in different ways leads to different possibilities for statistical models. Recall that if A, B, and C are discrete random variables whose outcome spaces have cardinality kA, kB, and kC , respectively. The no three-way interaction model is the graphical model associated to the undirected graph:

A

B C.

A B C AB BC AC The parameters of the model are given by θi , θj , θk , θij , θjk , and θik leading to a total of kA + kB + kC + kAkB + kBkC + kAkC parameters. We can construct local models based on the contexts ↓ AB, ↓ AC, and ↓ BC given by RkA × RkB × RkAkB , RkA × RkC × RkAkC , and RkB × RkC × RkB kC , respectively. We begin by discussing what we call the contextual approximation of the classical no three-way interaction model. In this situation, we can use projection mappings to enforce global consistency among the parameters, i.e. the restriction mapping connecting P (↓ AB) and P ({A}) is given by the projection mapping

kA kB kAkB kA πA : R × R × R → R onto the A component. The equalizer condition forces P (>) = RkA × RkB × RkC × RkAkB × RkB kC × RkAkC . As such the contextual model constructed by using projection mappings in the parameter sheaf is equivalent to the classical no 3-way interaction model. We have seen previously that there are contextual distributions for which no classical model exists. An alternative possibility for constructing a contextual analog of the no three- way interaction model is to use marginalization of parameters rather than pro-

123 jections. In this case, we use the same parameter spaces as above but require instead that the mapping connecting RkA × RkB × RkAkB to RkA is defined by AB A B AB A P B AB A C AC kA kC πA θi , θj , θij = θi j θj θij . Let τi , τk , τik be coordinates for R ×R × kAkC AC A C AC  A P C AC R and πA τi , τk , τik = τi k τk τik . Then the equalizer condition in AB AC requires that P (↓ AB∪ ↓ AC) be the pullback associated with πA and πA :

P (↓ AB) ⊗P({A}) P (↓ AC)

π2 π1 t * P (↓ AB) P (↓ AC)

πAB πAC A * t A P ({A}) .

Notice that this choice results in a much larger parameter space of dimension

(kA + kB + kAkB)(kB + kC + kBkC )(kA + kC + kAkC ) − kA − kB − kC .

For the case where A, B, and C are all binary, this construction leads to an overparametrized (non-identifiable) model containing the saturated contextual model. In general, this model is typically overparametrized as the number of 2 2 2 parameters to fit is O (kAkBkC ) compared with the kA+kB +kC +kAkB +kBkC +kAkC parameters of the previous construction. Because the local models on the original contexts overparametrize the original model, it is not possible to construct an identifiable parametrization of the no 3-way interaction model for a contextual probability distribution. Note that removing the A B C parameters θi , θj , and θk in the first parametrization would recover the saturated model.

From these two examples we can see the nuance involved in constructing models for contextual measurement scenarios. Choices made in parametrization can have drastic effects on the identifiability of a model. Of course, this problem still occurs in classical measurement scenarios, but the additional requirement of constructing restriction mappings on the parameter sheaf can create additional complexities if done naively.

124 5.8.2 Factors

Thus far we have only discussed statistical modeling on the full column space. However, in many applications, we are interested in only a subset of the column space. This motivates the introduction of factors. The idea behind factors is that there is a joint random variable we are interested in which may or may not belong to one of the contexts. In this section we will outline these possibilities an introduce a language for discussing contextual statistical inference. When performing statistical analysis on a database, we are often times interested in a subset of the column space which we call factors. Formally, a factor, f, is S simply a subset of the column set of the entire database, CD := i∈[k] Ci [101]. In the language of graphical models, these are so named because of their connection to factor graphs which are graphs representing a factorization of the joint distribution of a family of random variables. In our situation, factors naturally correspond to sub-complexes of the simplicial complex associated to our database.

Definition 127. (Classical and Contextual Factors) A given factor identifies a sub-simplicial complex of the database schema by taking the intersection of the factor with the abstract simplicial complex, i.e. f ∩ 4 (D). We say that a factor is classical if the geometric realization of this sub-simplicial complex is contractible and that the factor is contextual otherwise.

Example 128. The factor {A, B, A0} is an example of a classical factor on the Bell context.

Example 129. The factor {A, B, A0,B0} is an example of a contextual factor on the Bell context.

The inclusion mapping i : 4 (f)  4 (D) is continuous with respect to the Alexandroff topology generated by the lower sets. As such it determines two functors ∗ i∗ : Sh (4f) → Sh (4D) called the direct image and i : Sh (4D) → Sh (4f) called ∗ ∗ the inverse image where i is left adjoint to i∗ and i is left exact. In the language of topos theory, the inclusion mapping determines a geometric morphism between the two sheaf topoi. Models on factors can be defined analogously to our original definition on tables.

125 Definition 130. A contextual statistical model on a factor is a natural transfor- mation between a parameter sheaf on 4f and the sheaf of contextual probability measures on 4f.

5.8.3 Classical Snapshots of a Factor

m Let f ⊂ CD be a factor. Recall that an open cover of f, ∪i=1Ui, is a collection of m open subsets (i.e. lower sets) such that f ⊂ ∪i=1Ui. We can mimic this definition to define the notion of contextual covers.

Definition 131. Let D = {t1, . . . , tk} be a database and let Ci denotes the column k set of ti. Define CD = ∪i=1Ci and for σ ⊂ [k], define Cσ = ∪i∈σCi. We say that Cσ is a contextual cover of f if and only if f ⊂ Cσ.

Example 132. Consider the 3-cycle database topology introduced in Example 117.

Let f = {A, B} be a factor. In this example, CD = {{A, B} , {B,C} , {A, C}} and the contextual covers of f are then {A, B}, {{A, B} , {B,C}}, {{A, B} , {A, C}}, {{B,C} , {A, C}} and the full measurement scenario {{A, B} , {B,C} , {A, C}}.

In the above example, f = {A, B} is a classical factor. As such, we can simply perform statistical analysis on the table corresponding to {A, B} using standard techniques. We can also consider the contextual covers of a specific contextual factor.

Example 133. Let f = {A, B, C} be a factor on the 3-cycle database topol- ogy. The contextual covers of f are then {{A, B} , {B,C}}, {{A, B} , {A, C}}, {{B,C} , {A, C}}, and {{A, B} , {B,C} , {A, C}}.

If we are interested in performing statistical analysis on the factor in the last example, we could potentially have a problem with the contextual cover {{A, B} , {B,C} , {A, C}} because there are collections of tables on these contexts which are not marginals of a joint distribution on the full space {A, B, C} (Exam- ple 103). However, any of the remaining contextual covers admit a join provided that the tables agree on their overlapping columns (Lemma 98). This observation motivates the idea of maximal classical contextual covers formalized in the next definition.

126 Definition 134. We say that a contextual cover, Cσ = ∪i∈σCi, is a maximal

classical contextual cover of the contextual factor f if (1) f ⊂ Cσ, i.e. Cσ is a

contextual cover of f, (2) the geometric realization, 4 (Cσ), is contractible, and

(3) for any open Cj for which j∈ /σ, the geometric realization 4 (Cσ ∪ Cj) is not

contractible. In other words, if adding any other contexts to Cσ results in the possibility that no join exists of the contextual tables.

Example 135. Let C be the Bell contexts. The maximal classical contextual covers 0 0 0 0 of the global factor f = {A, B, A ,B } are given by U1 = {{A, B} , {A, B } , {A ,B}}, 0 0 0 0 0 0 U2 = {{A, B} , {A, B } , {A ,B }}, U3 = {{A, B} , {A ,B} , {A ,B }}, and U4 = {{A, B0} , {A0,B} , {A0,B0}} because adding the missing factor for any of these covers results in closing the 4-cycle.

Definition 136. If f is a contextual factor, we call the set of maximal classical covers the classical snapshot of f.

Statistical inference on contextual factors involves piecing together the statistical properties of the maximal classical factors.

5.9 Subobject Classifier for Contextual Sheaves

In the category of sets, the subobject classifier is simply the two element set. This creates a one to one correspondence between subobjects and maps into the two element set via characteristic functions. For categories of presheaves and sheaves, the subobject classifier is a more complex object. We first recall the general definition of the subobject classifier in an arbitrary topos and then specialize the discussion to the specific case of a contextual topology on a databse.

Definition 137. The subobject classifier Ω ∈ 4\(CD) is the presheaf that associates

to each object U ∈ 4 (CD) the set of sieves on U and to each arrow i : U → V the pullback sieve, i.e. the arrow Ω (i) :Ω (V ) → Ω (U) defined by S 7→ f ∗ (S) := {j : W → U | i ◦ j ∈ S} .

In a topos, the subobject classifier comes equipped with a truth arrow, represent- ing the top value of the Heyting algebra of subobjects on the subobject classifier. For sets this is simply the arrow > : 1 → {0, 1} defined by > (∗) = 1 which picks

127 out the element representing true. We now recall the definition of the truth arrow for a presheaf topos.

Definition 138. > : 1 → Ω is defined to be the natural transformation whose

components are >U : 1 → Ω (U) that selects the maximal sieve on U, i.e. the sieve containing the identity.

In order to make this discussion a bit more specialized to the contextual topologies discussed above, we can work out the sieves on the basis of a contextual topology. Note that sieves on a contextual topology correspond to open subsets.

Example 139. Consider the 3-cycle database topology discussed throughout this chapter. A sieve on the top element is simply a subfunctor. As the collection of open subsets is a poset, the sieves are given by downward closed subsets of the poset. In the case of a contextual database topology, these just correspond to the open subsets. Thus, Ω(P ) gives the collection of all open sub-simplicial complexes (Example 117).

Remark. In the topos theory literature, the local sections of the subobject classifier are referred to as truth values because these determine the internal logic of the topos.

5.10 Local and Global Sections of a Contextual Sheaf

Recall that the terminal presheaf in the category of presheaves is the presheaf that associates to each open set a set with one element, i.e. 1 (U) = {∗} for every U. Every restriction mapping is given by the identity mapping. We use the notation 1 for the terminal presheaf because it is a terminal object in the topos of presheaves

4\(CD).

Definition 140. A global section of a presheaf F ∈ 4\(CD) is a natural transfor- mation from the terminal presheaf 1 into the presheaf F.

A particular global section was discussed in the previous definition which picks out the maximal element in the Heyting algebra of subobjects on the subobject classifier Ω.

128 Example 141. Let Tn be the presheaf of classical tables with n records. A global section on Tn is simply a global table on the full outcome space along with all marginal tables corresponding to the columns associated with each open 3 set. For example, let tABC : [3] → {0, 1} be defined by tABC (1) = (0, 0, 1), tABC (2) = (0, 1, 0), tABC (3) = (1, 0, 0). Then tABC determines a global section on the sheaf of tables on the contextual topology associated to the complete graph on {A, B, C} discussed earlier. Note that all other components of the inclusion mapping are determined by the presheaf condition once the top element is defined.

A very important family of global sections come from the global sections of the subobject classifier Ω. In a general topos, these are known as the truth values of the topos and have a natural structure as a Heyting algebra (Definition 48). This is what in topos theory is referred to as the internal logic of the topos. In general, this internal logic is multivalued and intuitionistic.

Definition 142. A local section of X ∈ 4\(CD) is a natural transformation from a subobject E of the terminal presheaf 1 into the presheaf X.

Note that any subobject E of 1 must associate to each set U either ∅ or {∗}. Moreover, the presheaf condition requires the existence of restriction maps. Thus if E (U) = {∗}, then E (V ) = {∗} for all V ⊂ U.

Example 143. The table tAB in the Bell context determines a local section of the sheaf of tables on the Bell context. We can define

E (↓ AB) = E ({A}) = E ({B}) = E (∅) = {∗} and E (U) = ∅ for all other U in the Bell contextual topology. Then the natural transformation taking E (↓ AB) to the table tAB defines a local section on E since all other mappings are determined by projecting onto subspaces of the attribute space.

Note that the family of Bell tables discussed previously provides an example of a collection of local sections which do not glue together into any global section.

129 5.11 Fitting Contextual Models

A collection of tables can be thought of as a collection of local sections on the classical table sheaf discussed previously in Section 5.7.2. We have also previously considered the problem of reconstructing tables using constraint satisfaction prob- lems (Section 5.5.2). This corresponds to attempting to glue the local sections of the classical sheaf of tables. As such that particular sheaf can help us understand the degree of compatibility of the tables depending on how close they are to admitting a global section. In this section, we discuss the problem of fitting global models to contextual statistical models and in later sections we will discuss how knowledge of the glueability of local sections can be used to determine the collection of classical probabilities compatible with the local datum of the tables. The sheaf condition for parameters of a statistical model ensures that overlapping models induced by restriction mappings will be compatible. However, there are, in general, many possibilities for choosing restriction mappings for a contextual model.

5.11.1 Maximum Likelihood Estimation for the Saturated Contextual Model

Let D be a database consisting of tables t1, . . . , tk with column spaces C1,...,Ck, respectively. On each Ci, we can follow a pseudo-likelihood approach by defining a likelihood function analogously to the classical case, i.e.

k Y L (xC | θC ) = L (xCi | θCi ) . i=1

The goal of maximum likelihood estimation is then to maximize L subject to the constraint equations generated by the restriction mappings in the statistical model, i.e. for all i, j i j resi,jP (Ci) = resi,jP (Cj) .

Example 144. Consider a database schema on three columns A, B, and C. Sup- pose further that the data type of each column is binary. If we attempt to fit the saturated contextual model to this measurement scenario, our likelihood function

130 is simply the product of multinomial distributions on each of the contexts, i.e.       nAB! Y xAB nBC ! Y xBC nAC ! Y xAC L (xC | θC ) = Q pAB Q pBC Q pAC . (xAB!) (xBC !) (xAC !)

The log likelihood function is then the sum of the log likelihood functions of the log likelihood function of the multinomial distribution on the individual contexts, i.e.

`C (pAC , pBC , pAC ) = ` (pAB) + ` (pBC ) + ` (pAC ) where mAB mAB AB X AB  X AB AB ` p := log (nAB!) − log xi ! + xi log pi i=1 i=1 and ` (pBC ) and ` (pAC ) are defined similarly. Thus, contextual maximum likelihood estimation can be phrased as optimizing

`C (pAC , pBC ,PAC ) subject to the constraints  AB AB AB AB p00 + p01 + p10 + p11 = 1   BC BC BC BC p00 + p01 + p10 + p11 = 1   AC AC AC AC p00 + p01 + p10 + p11 = 1 AB AB AC AC p00 + p01 − p00 − p01 = 0   AB AB BC BC p00 + p10 − p00 − p01 = 0   BC BC AC AC p00 + p10 − p00 − p10 = 0.

This optimization problem can be reduced to the following equation: x x x x 00 + 11 = 01 + 10 p00 1 − qA − qB + p00 qA − p00 qB − p00 which is equivalent to a degree three equation in p00. As such, it has a closed form solution; however, the length of the closed form answer is significantly longer than the simple form of the MLE for the multinomial model and would not fit on the page or be as easily interpretable as the analogous result for the classical

131 multinomial model.

5.11.2 Classical Approximation of a Contextual Distribution

In some situations, we may hold a strong belief that the observed outcomes are drawn from a classical random variable whose outcome space coincides with the column space of our database. In this situation, we may want to consider a collection of classical approximations to our contextual distribution. In this section, we formalize this notion. As one such example, consider an algorithm enumerating counts of types of targets in some geographic region. The case of a single target class with no noise has been studied in [16]. Assuming that the region under consideration is well-defined, the true count of targets of different types over a specified time period is a well-defined thing and should be modeled by a classical probability distribution. Any collection of tables on a collection of contexts which agree on the total number of samples can be thought of as a collection of local sections of the sheaf of tables. The problem of reconstructing a global table from its local sections has already been discussed by Abramsky using the language of sheaf cohomology [2]. Here, we address the question of performing statistical inference on a collection of partial tables. Given a factor, f, we can define C (f) to be the set of maximal classical contextual covers of f (Definition 134). On such a classical contextual cover of the factor, we can use the classical definition of a statistical model. We assume here that our data has an suitable for averaging. In the next chapter, we see how to relax this assumption by embedding data into a Giry data frame which allows us to take convex combinations of arbitrary types of data by embedding them as probability measures. Let mf denote the number of maximal factors of the factor f. Let ef : [mf ] → C (f) be an enumeration of these maximal factors. For each maximal factor, the resulting contextual constraint satisfaction problem is guaranteed to have a solution. Let ki denote the number of solutions to the maximal factor ef (i). When performing statistical inference, we need a method of reasoning over the space of classical approximations to our contextual database. Point estimation on contextual tables can easily lead to flawed analysis. Consider three binary random variables A, B, and C tabulated in a database whose simplicial

132 complex is given by the cyclic graph:

A

B C.

Suppose the following marginal distributions are stored in our database:

A = 0 A = 1 1 1 B = 0 4 4 1 1 B = 1 4 4

B = 0 B = 1 1 1 C = 0 4 4 1 1 C = 1 4 4

A = 0 A = 1 1 1 C = 0 4 4 1 1 C = 1 4 4 If we consider the resulting contextual constraint satisfaction problem we arrive at the system of thirteen linear equations:  1 pij+ = 2   1 pi+k = 2 1 p+jk = 2   p+++ = 1

The solution of this linear system is the following line in parametric form:  1 p000 = p010 = p100 = p111 = 2 − t

p000 = p011 = p101 = p110 = t

1 where 0 ≤ t ≤ 2 . Averaging (integrating over the disintegration onto the line spanned

133 by these equations in the probability simplex) would result in the distribution 1 corresponding to t = 2 and could erroneously lead us to conclude that outcomes were mutually independent since the average over all possible joins of the marginally independent distributions is the mutually independent join. As such, classical approximations are more appropriate for analyzing the range of possible explanations to a collection of consistent marginal distributions rather than predicting factors which are not covered by a single context.

5.12 Contextual Hypothesis Testing

When performing a hypothesis test we compute some statistic based on the observed data and then either reject or fail to reject the null hypothesis based on whether or not the value of the statistic lies within a specified range based on a p-value and asymptotic distribution for the statistic. In this section, we discuss how to perform hypothesis tests on contextual databases. Classically, a statistic is defined to be a measurable function of the data. This same definition works for contextual Qk tables. Note that a statistic can be seen as a mapping from i=1 TCi into some space of values, V (typically, V = R). In this section, we discuss several methods for hypothesis testing on contextual databases.

5.12.1 Testing if Observed Marginals are Drawn from the Same Distribution

Given a collection of tables we may be interested in whether or not the marginal distributions of the given tables can be assumed to be drawn from the same distribution. This would allow us to test if the tables can be presumed to be independent draws from some contextual probability distribution. As an example of a situation where an analyst could employ this technique, imagine creating a product landing page that suggests two add-on products to users on an e-commerce website. We may have a basket of potential products to offer a user based on some recommender system and we have constructed an experiment that chooses different pairs of products to suggest to the user. We could form a collection of tables tabulating the counts of which combinations of products were purchased on each landing page. Some of these count tables could overlap as products may appear on

134 multiple landing pages. As such, we may be interested in a hypothesis such as “the fraction of users who prefer a particular product does not depend on the product it is paired with.” In order to test this hypothesis, we can form a collection of tables by taking all tables with counts of the particular product and projecting onto the marginal counts for that particular product. The classical way of approaching this test would be to use a chi-squared tests on these marginal tables. We can first compute the expected count by summing across all outcomes and dividing by the total number of observations. Next we can compute the chi-squared statistic by iterating through the columns again

k 2 X (Oi − Ei) χ2 = . E i=1 i

If m denotes the number of marginals, we can use a chi-squared test with m − 1 degrees of freedom.

5.12.2 Testing if a Collection of Tables can be Explained Classically

A more classical approach to the problem of testing whether or not a collection of tables can be the marginal tables of some global table would involve finding a maximum likelihood estimate for a classical probability subject to the constraints imposed by the marginals. As we have seen previously, marginal distributions do not in general determine a joint distribution on the full probability space. As such, the statistical model as naively determined from the marginalization mapping is non-identifiable and so standard asymptotic theory breaks down. The technique of algebraic hypothesis testing developed in chapter seven relies only on the asymptotic convergence properties of the empirical distribution function rather than a collection of estimators as its based on implicit equations for a statistical model rather than parametric equations. As such, this technique may be more preferable in situations like these especially if the analyst is not equipped with sufficient computational tools to resolve model singularities. As we’ve seen previously, the space of contextual probabilities is a subspace of the Qk Ci Ci product space i=1 4 . If pi ∈ 4 is the marginal distribution of some probability 4C p ∈ 4C mC (p) = p distribution on , then there exists such that Ci i. Thus, if a

135 contextual probability distribution arises from a classical distribution, it must belong Qk mC 4C to the image of the linear transformation i=1 Ci . The image of under the Qk mC linear transformation i=1 Ci representing the product of marginal distributions identifies the collection of contextual distributions which can arise from a true probability distribution on the outcome space XC . Any p ∈ 4C can be identified with a vector p in R|C|. The image of the full space under the linear transformation k C Q k Q k Q k Q m Ci T : Ci → Ci i=1 Ci defines a linear subspace in R . Let R R denote the Q projection operator associated to this linear subspace. If po = pCi represents the observed normalized contextual tables, we can test if these are the marginal distributions for a classical probability distribution by observing p0 − P p0 is zero if and only if these come from a classical distribution. Breaking these into linear equations gives us a collection of invariants for the contextual distributions which arise from a classical distribution. With this observation, we can then apply the technique of algebraic hypothesis testing developed in chapter seven in order to obtain a test of significance for whether or not samples are drawn from a contextual distribution.

5.12.3 A Hypothesis Test for Contextuality

The space of contextual probability distributions can be identified with the in- Qk Ci tersection of the linear subspace of i=1 4 determined by forcing the marginal distributions to agree on overlapping columns. Note that the space of classical marginals is a subspace of this space. By constructing a basis for the subspace of classical probabilities and extending this basis to the subspace determined by the image, we can obtain invariants for the collection of contextual distributions which are not classical. As such, we can also use the algebraic hypothesis testing technique developed in chapter seven for creating a contextual goodness of fit statistic.

5.13 Future Work

5.13.1 Contextuality Penalization

Contextuality, in the case of databases, is the phenomenon by which a collection of observed marginal distributions fails to admit any gluing into a joint table. In

136 certain types of analysis we may have a strong belief that our observations actually arise as marginal distributions of some joint distribution on the full column space. In such situations, we may want to add a penalization term to the log likelihood function in order to reduce the amount of contextuality present in the model. We can start with a contextual L1 penalty. LASSO, or least absolute shrinkage and selection operator, is a statistical technique that performs variable selection and regularization which is often used to enhance the predictive power and in- terpretability of the model it produces. In machine learning, this helps combat over-fitting by penalizing complexity. Mathematically, LASSO puts an L1 penalty on the model coefficients. If a collection of tables arises as the result of a probability on the global C outcome space XC , then there exists some probability vector pC ∈ 4 such that mCi (pC ) = pCi for each Ci. This determines the collection of classical probabilities as a subspace of the space of contextual probability distributions inside Q 4Ci . Using the projection operation, we can construct an energy function in the same manner discussed in chapter seven to penalize contextuality. Future work in this direction could focus on implementing these penalization terms and exploring how these affect statistical analysis of contextual databases.

5.13.2 Sampling for Contextual Probability Distributions

In some situations such as multiple A/B testing, we may want to summarize our data with a contextual distribution. In this type of setting, we can simulate draws for marginals independently because we would not expect to have exact agreement on marginal overlapping. This can be done as long as we have the draw samples from the local distribution representing the context of interest. If we want to produce a database of samples, we can simply draw the desired number of samples according to the marginal distribution of each context. Note this method will typically not produce perfect agreement on overlapping columns in the database, but for large sample sizes they should be fairly close to agreement. For many types of contextuality, such as that arising in quantum systems, it is necessary to produce samples from a contextual distribution which will satisfy the stronger requirement that the distributions agree perfectly on their overlapping set. One possible approach to implementing this sampling technique is sketched in the

137 paragraphs below. First, initialize a database with tables corresponding to the various contexts in our contextual distribution and also initialize an empty dictionary which will be used to keep track of the previously observed outcomes. Also, initialize an empty list which will contain the set of contexts that have been previously observed. Next, select a random context from the set of all contexts in our contextual distribution and append it to the list of observed contexts. Draw from the local probability distribution associated to this context and add the corresponding columns to the list of observed outcomes. Initialize a tree with the selected context as its node and add children to the node corresponding to the other contexts which have observables which overlap with the selected context. To draw from our contextual distribution, we can use a breadth first search where a random child node is drawn at each iteration and the resulting child context is then appended to the list of observed contexts. We can then draw from a conditional distribution on that child distribution conditioned on the values already observed which appear in the dictionary of observed outcomes. We terminate the algorithm whenever the list of observed contexts equals the list of contexts of our sampling contextual distribution. Note this procedure is guaranteed to terminate as long as the database schema is connected. By first determining the number of connected components, we could easily adapt this procedure to sample from contextual distributions on non-connected schema.

138 Chapter 6 | Algebraic Hypothesis Testing

6.1 Introduction

In algebraic statistics, a common problem is to compute the ideal (or invariants) for an algebraic statistical model [36,108]. The invariants of a model are obtained by eliminating all parameters and finding polynomial equations whose zero locus defines the model inside the probability simplex. We will construct statistics based on knowledge of these invariants and establish some asymptotic properties of these statistics. These new statistics are alternatives to the likelihood ratio (G2) statistic and Pearson’s chi-squared tests in estimating goodness of fit. Thus, they can serve as an ingredient in model selection (e.g. by adding a penalty, or otherwise comparing fits). We show that this new test performs differently than likelihood ratio and Pearson’s chi-squared, and is in some cases superior. The proposed test has the advantage of being defined in certain situations (such as on the boundary of models) where the other tests fail, and may perform better near singularities due to the fact that it can be directly evaluated on our data without relying on the asymptotic normality of an estimator in order to ensure asymptotic consistency of the test. We show how to construct our test given a model ideal as input, both theoretically and computationally using R and algebraic geometry software. We give several examples of specific models for which our test outperforms alternatives. We also give some theoretical results showing regimes in which this can be expected to occur. The major contributions in this chapter are constructing an energy statistic

139 based on the invariants of an algebraic statistical model and proving its asymptotic consistency under the null hypothesis. This construction is interesting because its asymptotic properties do not rely on the asymptotic normality of an estimator since it can be computed from empirical frequencies. Thus, this construction provides an alternative technique for computing goodness-of-fit in situations where standard asymptotic theory breaks down such as on boundary points of the probability simplex or near singularities of a statistical model. We demonstrate this improved performance near a singularity of the binary 4-cycle undirected graphical model by benchmarking it against the likelihood ratio and chi-squared test in simulations. The table below measures the average percentage deviation from a perfect procedure for the stated significance level. The table below was generated by sampling from the distribution below which does not lie on the binary 4-cycle undirected graphical model. The specific distribution used in this simulation is:

q0000 = 0.07453062, q0001 = 0.08634001, q0010 = 0.041189657,

q0011 = 0.0640731, q0100 = 0.008735056, q0101 = 0.1015492,

q0110 = 0.001641928, q0111 = 0.1536179, q1000 = 0.0690093,

q1001 = 0.05180355, q1010 = 0.02624617, q1011 = 0.01191531,

q1100 = 0.06465253, q1101 = 0.04332119, q1110 = 0.122514,

q1111 = 0.07886045.

n = 100 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 χ2 4.7 12.8 18.7 23.0 24.4 26.3 26.2 G2 28.4 46.0 52.6 55.0 55.4 55.9 53.5 Ours - Davies -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 Ours - Farebrother -1.0 -4.9 -6.7 -6.0 -1.4 4.2 8.7 Ours - Imhof -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 Ours - Liu -1.0 -4.7 -5.7 -3.0 3.5 11.9 17.9

6.2 From model to invariants

In this section, we offer a brief review of what an algebraic statistical model and its ideal are, and give the examples we will analyze later. An algebraic statistical

140 model is a statistical model which can be realized as a semi-algebraic set. Given an algebraic statistical model, a set of invariants for the model are a collection of polynomial equations which vanish on the model, i.e. the value of this function is invariant for all points in the probability simplex which belong to the model. The idea of using invariants for statistical inference is not a new idea; however, most of the literature in this area is concentrated to the the specific case of phylogenetic invariants [45,46,133]. One of the challenges associated with choosing invariants for statistical inference includes developing a principled way to select invariants as there are infinitely many polynomials belonging to any ideal and the computational complexity of finding the invariants themselves. Invariants appear implicitly in common statistics. For example, the odds ratio statistic is defined as n /n n n O = 00 01 = 00 11 n10/n11 n10n11 which is equal to 1 on precisely the independence model [44]. A related statistic, the χ2 statistic also bears a relationship to the invariants of the independence model. Recall that the χ2 statistic is defined to be

2 X (nij − Eij) χ2 := E i,j ij

ni+n+j P P P Eij = n++ = nij ni+ = nij n+j = nij where n++ , i,j , j , and i . By simple algebraic manipulation we can rewrite this statistic for a 2 × 2 contingency table as:

(n + n + n + n )2 (n n − n n )2 χ2 = 11 12 21 22 12 21 11 22 . (n11 + n12)(n11 + n21)(n12 + n22)(n21 + n22)

In this form it is clear that the χ2 statistic is 0 if and only if a point lies on the independence variety, provided that no row or column sum is zero. Formally, this statistic is undefined if any of the row or column sums are zero; however, most software packages will perturb the counts by a small amount to ensure the measured distribution lies on the of the probability simplex so as to ensure convergence of the algorithms used to compute the p-values. Our approach is to create an inner product on probability distributions from the invariants – a matrix H analogous to a Hamiltonian. Viewing the empirical probability distribution function as a state, this state has zero energy if and only

141 if all the invariants in a particular degree are satisfied. Then we compute the distribution of this energy statistic under the null hypothesis to obtain our goodness of fit statistic.

6.3 Constructing an inner product from invariants

Given a list of invariants, we may form a vector by evaluating our observed data at each invariant. The most natural form of a statistic to consider would be a linear combination of the invariants. However, one desirable property of a statistical test for model membership would be that the statistic vanishes if and only if our data actually satisfies the model. Using a linear combination of invariants would allow for the statistic to vanish outside of the model. The easiest way to satisfy this criteria is to notice that given any collection of polynomials over the real numbers

{g1 = 0} ∩ · · · ∩ {gr = 0}, a point P lies on the algebraic variety generated by 2 2 I = hg1, . . . , gri if and only if g1 + ··· + gr = 0 at the point P . This observation can be generalized. Let |ψi be the vector whose i-th entry is gi evaluated at our normalized data. Then, we can consider a statistic of the form hψ| H |ψi where H is positive definite. As H is positive definite, hψ| H |ψi = 0 if and only if |ψi = 0. In the sequel, we only consider statistics of this form. Remark. For the purpose of constructing statistics, we may as well assume H is symmetric. Indeed, observe

n X hψ | H | ψi = hijgi (x) gj (x) i,j=1 n n X 2 X = hiigi (x) + [(hij + hji) gi (x) gj (x)] . i=1 i6=j,i

hij + hji From the latter decomposition of hψ | H | ψi we see that replacing hij with 2 does not affect the statistic.

142 6.4 Asymptotic Properties of hψ| H |ψi

Let Xi be IID random vectors sampled from distribution p, i.e. P (Xi = ej) = pj Pk √  P and i=1 pi = 1. By the central limit theorem, n Xn − p ⇒ N (0, ) where P P ii = pi (1 − pi) and ij = −pipj. Suppose {g1, . . . , gr} are the invariants of a statistical model. We want to design a hypothesis test for membership on this submodel of the saturated model. In other words, we would like to construct a statistic which vanishes only on the collection of distributions belonging to a given submodel. Consider a polynomial function

r X f (x1, . . . , xk) = hψ | H | ψi = hijgi (x1, . . . , xk) gj (x1, . . . , xk) i,j=1 with H positive definite and symmetric. As observed previously, f (x) = 0 if and only if gi (x) = 0 for all 1 ≤ i ≤ r. Moreover,

r   X ∂gi ∂gj ∇f (x) = h · g + g · . k ij ∂x j i ∂x i,j=1 k k

So, under the null-hypothesis, ∇f (p) = 0.

Thus, in order to apply the delta-method we need to go to the second derivative. We compute the Hessian of f:

r  2 2  X ∂ gi ∂gi ∂gj ∂gi ∂gj ∂ gj ∇2f (x) = h · g + · + · + g · . k` ij ∂x ∂x j ∂x ∂x ∂x ∂x i ∂x ∂x i,j=1 k ` ` k k ` k `

Under the null-hypothesis, the above expression simplifies to

r   X ∂gi ∂gj ∂gi ∂gj ∇2f (x) = h · + · . k` ij ∂x ∂x ∂x ∂x i,j=1 ` k k `

Assuming this matrix is not identically zero, the delta method implies that

T 2 2n · f (Xn) ⇒ Y ∇ f (p) Y

143 where Y ∼ N (0, Σ). The right hand side is a quadratic form of a multivariate normal distribution. We will investigate properties of such random variables in the next section.

6.5 Quadratic Forms of Multivariate Normal Dis- tributions

In this chapter, we are considering statistical models for various algebraic submodels of the collection of all discrete random variables with a fixed number of outcomes. Any discrete random variable can be thought of as a point lying in the probability simplex. For instance, if the random variable X has k possible outcomes labeled by [k] = {1, 2, . . . , k}. The law of X can be defined by specifying P (X = i) for each

1 ≤ i ≤ k. If we write pi for P (X = i), the condition that the pushforward measure Pk T is a probability measure just tells us i=1 pi = 1, i.e. the point p = (p1, . . . , pk) lies on the standard k − 1 simplex. If we consider any collection of independent identically distributed (IID) random variables whose sampling distribution is a Pn discrete random variable. The empirical distribution, Yn = i=1 Xi, of the samples is a multinomial distribution. The central limit theorem tells us that as n → ∞,

√  1  n Y − p ⇒ N (0, Σ) . n n

Here, ⇒ denotes convergence in distribution, and Σ = diag (p) − ppT . Thus, to understand the asymptotic properties of statistics defined on quadratic forms of the algebraic invariants, we need to understand distributions which are quadratic forms of a multivariate normal random variables. In 1961, Imhof developed a method for computing the survival function of a quadratic form of a multivariate normal random vector under the assumption that the covariance matrix Σ is non-singular [70]. Several alternative numerical algorithms have also been developed [30, 47, 91]. Under these assumptions, any quadratic form of a multivariate normal random vector, Q, with non-singular covariance matrix can be written in the form

n X 2 Q = λrχhr;δr i=1

144 2 2 2 where χhr;δr is a non-central chi-squared distribution, i.e. χhr;δr = (x1 + δ) + Phr 2 i=2 xi where the xi are IID standard normal random variables [118]. Let Y ∼ N (p, Σ) be the asymptotic distribution above. For a multinomial distribution, the covariance matrix Σ, is positive semi-definite of rank k − 1. As such, Σ has a decomposition Σ = LLT where L is k × (k − 1). Thus, we may decompose Y = p + LX  where X ∼ N 0,I(k−1)×(k−1) . Then

YT HY = (p + LX)T H (p + LX) = pT Hp + XT LT Hp + pT HLX + XT LHLX = pT Hp + 2XT LT Hp + XT LT HLX

By the spectral theorem, there exists an orthogonal matrix U such that D = U T LT HLU is a diagonal (k − 1) × (k − 1) matrix. The random variable Z = T  U X ∼ N 0,I(k−1)×(k−1) . Thus, we may write

YT HY = pT Hp + 2ZT U T LT Hp + ZT DZ.

We can break this up into components to find a representation of this statistic as a linear combination of generalized chi-squared random variables. Note that T T c = p Hp is just a constant. Let ui denote the i-th component of p HLU.

k−1 k−1 T X X 2 Y HY = c + 2 uiZi + λiZi i=1 i=1 k−1 2 ! k−1  2 X u X ui = c − i + λ Z + λ i i λ i=1 i i=1 j

T as long as L HL =6 0. Here, the λi’s are the eigenvalues of D. Letting K =  u2  c − Pk−1 i i=1 λi , a constant, we see that

k−1  2 X ui YT HY − K = λ Z + i i λ i=1 i

145 which is of the form necessary to apply Imhof’s method for computing the survival function of a quadratic form of a multivariate normal random vector.

6.6 Estimation of Parameters for the Asymptotic Distribution

6.6.1 Using an MLE

Our goal is to choose a model from our algebraic submodel of the saturated probability simplex which is the best approximation of the data subject to the model we are considering. In other words, we would like to choose a q ∈ V where P V is the variety defined by {g1 = 0} ∩ · · · ∩ {gr = 0} ∩ { qi = 1} which minimizes   i Pk pi K (p k q) = pi log the information loss i=1 qi or, equivalently, maximizes the likelihood. One approach to solving this problem is to use the technique of Lagrange multipliers to solve for the collection of critical points. Here we have the function   Pk pi K (q) = K (q1, . . . , qk) = pi log i=1 qi . Observe

pi ∇K (q)i = − . qi

Thus, by clearing denominators in the Lagrange multiplier equations, we obtain the likelihood equations. We can attempt to solve the resulting system of polynomial equations via Grobner basis techniques. As long as we end up with a finite collection of solutions to the Lagrange multiplier equations, we can evaluate the information loss for each prospective minimizer to find our estimate. Notice that K (q) is continuous on the probability simplex and V is closed (in the Euclidean topology) being the zero locus of a family of polynomials and bounded because V is a subset of the probability simplex. Thus, by the extreme value theorem, our optimization problem indeed has a solution. To construct a hypothesis test based on the invariants we can say that under the null-hypothesis (our data belongs to the given algebraic submodel)

T 2 2n · f (Xn) ⇒ Y ∇ f (p˜) Y

 ˜ where Y ∼ N p˜, Σ , f (Xn) is a quadratic form evaluated on the invariants, and

146 p˜ and Σ˜ are MLE’s for the given algebraic submodel. Let Σ˜ = LLT be a Cholesky decomposition of the estimated covariance matrix. Assuming, LT ∇2f (p˜) L 6= 0, the results in the previous paragraph imply that we have the following asymptotic distribution

k−1 2 k−1  2 X u X ui 2n · f (X ) − p˜T ∇2f (p˜) p˜ + i ∼ λ Z + n λ i i λ i=1 i i=1 j

T 2 where ui is the i-th component of p˜ ∇ f (p˜) LU where U is a unitary matrix which T 2 diagonalizes L ∇ f (p˜) L and λi’s are the diagonal elements of this matrix. The  2 Pk−1 ui Q = λi Zi + survival function of i=1 λj may then be evaluated in R using the method of Imhof in the CompQuadForm package.

6.6.2 Using Normalized Count Data

Using an MLE in the computation of the statistics relies on all the standard assumptions about asymptotic normality of the MLE which are known to break down for singular statistical models [141,142]. If we use the normalized empirical counts, we know by the strong law of large numbers that these converge to their correct frequency point-wise and thus will also converge in probability to their true counts. As such, the continuous mapping theorem implies that our energy statistic will also converge in probability to the true energy of the model. As such, the technique of plugging in empirical counts is an asymptotically consistent test for testing whether or not samples are drawn from a distribution in the Zariski closure of the parametric model. It should be noted that it is possible that the Zariski closure of the parametrization contains points which are not part of the original model so some caution must be used in applying this test.

6.7 The Independence Model for a 2 × 2 Contin- gency Table

In this section, we illustrate the method outlined above in the simplest possible example: the independence model for a 2 × 2 contingency table. We will investigate how the method outlined previously gives us an alternate test for independence.

147 We can imagine our observed counts are collected in a matrix of the form ! n n N = 11 12 n21 n22 from which we can form a matrix of normalized counts 1 P = N n++

P2 where n++ = i,j=1 nij. The independence model is defined by the vanishing of the determinant of P , i.e.

p11p22 − p12p21 = 0.

We let p = (p11, p12, p21, p22) be the vector of normalized counts of the 2 × 2 contingency table. As there is only one invariant, all quadratic form statistics constructed based on such an invariant will be proportional to

2 f (p) = (p11p22 − p12p21) .

The sampling distribution is a multinomial distribution, and the maximum likelihood estimate of the saturated model is simply the frequency of counts in each outcome category, i.e. the pij’s above. Following the method in section 4 of this paper, we introduce the Kullback-Leibler function         p11 p12 p21 p22 K (q) = p11 log + p12 log + p21 log + p22 log . q1 q2 q3 q4

We see that the gradient is then

 p p p p  ∇K (q) = − 11 , − 12 , − 21 , − 22 q1 q2 q3 q4 and the gradient of the constraint function g (q) = q1q4 − q2q3 is just

∇g (q) = hq4, −q3, −q2, q1i and clearly ∇h (q) = h1, 1, 1, 1i where h (q) = q1 + q2 + q3 + q4. Thus, the method of Lagrange multipliers, yields the system of polynomial equations:

148  −p11 = q1λ + µq1q4   −p12 = q2λ − µq2q3   −p21 = q3λ − µq3q2

−p22 = q4λ + µq1q4   1 = q1 + q2 + q3 + q4   0 = q1q4 − q2q3

Eliminating λ, and µ one can find the solution:

q˜ =h(p11 + p12)(p11 + p21) , (p11 + p12)(p12 + p22) ,

(p11 + p21)(p21 + p22) , (p12 + p22)(p21 + p22)i.

The Hessian of this statistic under the null hypothesis is then:

 2  2q4 −2q3q4 −2q2q4 −2q1q4  2  2  −2q3q4 2q3 −2q1q4 −2q1q3  ∇ f (q˜) =    −2q q −2q q 2q2 −2q q   2 4 1 4 2 1 2  2 −2q1q4 −2q1q3 −2q1q2 2q1 which is clearly non-zero. Under the null-hypothesis,

2 · n · f (p) ⇒ YT ∇2f (˜q) Y where Y ∼ N (p, Σ) and   (1 − q1) q1 −q1q2 −q1q3 −q1q4    −q1q2 (1 − q2) q2 −q2q3 −q2q4  Σ =   .  −q q −q q (1 − q ) q −q q   1 3 2 3 3 3 3 4  −q1q4 −q2q4 −q3q4 (1 − q4) q4

Using the symbolic decomposition of Tanabe and Sagae [135] we obtain the following

149 formula for L:  p  (1 − q1) q1 0 0 r q  q1 q2(1−q1−q2)   −q2 0   1−q1   (1 − q1)   r r r   q1 q2 q3 (1 − q1 − q2 − q3)  .  −q3 −q3   1 − q1 (1 − q1) (1 − q1 − q2) 1 − q1 − q2   r r r   q1 q2 q3  −q4 −q4 −q4 1 − q1 (1 − q1) (1 − q1 − q2) (1 − q1 − q2) (1 − q1 − q2 − q3)

This is about as far as we can go with a symbolic expression for the statistic in this case. We do know that the Hessian of f is non-zero and we can see LT HL is not identically zero. In order to evaluate the statistic, we need to find a spectral T decomposition for L HT to get the ui’s and λi’s in the formula derived in section 3. In the next section, we will implement a numerical analysis of this statistic in the R programming language.

6.8 Behavior of Statistic on the Boundary of the Probability Simplex

Neither the G2 statistic or the chi-squared statistic are defined if either row sum or column sum is zero. Our method extends to such cases. Although computing a maximum likelihood estimate (MLE) in this case is subject to a multitude of problems in general. If we use the technique of applying the energy functional to the normalized counts, we can obtain a quick estimate of a p-value. This can be used when designing model architecture or fitting a number of hidden nodes provided the invariants are known.

6.9 A Test for the Rank of a Contingency Table

We may use the method outlined in Section 6.3 to construct a statistical test to determine whether or not our data were sampled from a table of rank r. Another way to think about the contingency table having rank r is to ask whether or not the observed relationship between the random variables may be explained by a latent variable with r possible outcomes. This observation is formalized in the following

150 lemma.

Lemma. A contingency table has non-negative rank r if and only if the joint distri- Pr bution factors as P (X = i, Y = j) = h=1 P (X = i | Z = h) P (Y = j | Z = h) P (Z = h) .

Proof. Suppose our contingency table, M, is an m × n matrix with rank r. Then Pr M can be written as a sum of r rank 1 matrices, i.e. M = i=1 Mi where each Mi 1 is rank 1. For each Mi, define αi = Pm,n . Thus, we may define a random j,k=1(Mi)jk variable Z whose law is P (Z = i) = αi. As each Mi is rank 1, it can be written T in the form XiYi . Thus we define Xi,j = P (X = j | Z = i) for 1 ≤ j ≤ m and T Yi,j = P (Y = j | Z = i) for 1 ≤ j ≤ n. For the other direction, suppose

r X P (X = i, Y = j) = P (X = i | Z = h) P (Y = j | Z = h) P (Z = h) . h=1

Then define αi = P (Z = i) for each 1 ≤ i ≤ r, and

T Mi = P (X = · | Z = i) P (Y = · | Z = i) .

This expresses M as a linear combination of rank 1 matrices.

Thus, we may generalize our statistic defined in Section 6.3 to test for the presence of latent variables in a contingency table. Note that the number of hidden nodes depends on the non-negative rank of the matrix while this test finds the actual rank. Thus, when using this technique follow-up analysis should be performed such as using the EM-algorithm to fit the mixture model and subsequently re-evaluating the goodness of fit.

6.10 Simulation Techniques and Results

The calculation in Section 6.5 allows us to manipulate the asymptotic distribution for our energy statistic into a form where the asymptotic distribution is represented as a sum of non-central chi-squared distributions. The survival function of such a random variable can be computed using the CompQuadForm package [38]. This package implements several numerical algorithms for computing this survival function [30,47,70,91].

151 Figure 6.1. A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from a uniform distribution.

The idea is to treat the evaluation of this model as a classification problem. By drawing samples from a known true distribution which either belongs to or does not belong to the model under consideration, we can repeatedly sample from that distribution and compare the values of survival functions. Specifying a significance level allows us to score performance based on the accuracy of the classifier given that we know the true distribution and whether or not it originated from the model. Rather than providing accuracy scores for our results, we will produce scatter plots of p-values computed via different methods, i.e. a point in the scatter plot corresponds to p-values evaluated according to the methods on the axis for the same sample. This allows us to visually inspect the classifier performance in a way that its dependence on alpha can be understood. In order to get a benchmark for the performance of this statistic, we can do simple comparisons with the chi-squared distribution. For each run of our experiment, we simulated 1000 samples drawn from a uniform distribution and computed p-values using these three methods for 1000 runs of this experiment. We created a scatter plot of the p-values for the chi-squared distribution verses the p-values computed using the invariants technique using the Imhof method for computing the survival function. The results of this experiment are presented in Figure 6.1.

152 Figure 6.2. A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from the distribution (q00, q01, q10, q11) = (0.1, 0.3, 0.2, 0.4).

Running the same experiment with the true distribution

(q00, q01, q10, q11) = (0.1, 0.3, 0.2, 0.4)

yields a scatter plot which appears to branch off from the chi-squared distribution. This scatter plot is shown in Figure 6.2. In order to demonstrate the potential advantages of our techniques, we consider

the four-cycle model and the degenerate distribution discussed in [57]. Let pijkl denote the joint probability mass of four binary random variables. Consider the true distribution which satisfies 1 q = q = q = q = 0100 0111 1001 1010 4

and all other qijkl = 0. This degenerate distribution belongs to the four cycle model. From this distribution, we generated a small random noise  perturbing the true distribution q. The exact distribution we used in this simulation is given by

153 Figure 6.3. A scatterplot showing values p-values computed for the perturbed degenerate distribution on the binary 4-cycle comparing the likelihood ratio test vs. the survival function of the invariants based quadratic form computed via Imhof’s method.

q0000 = 0.01832664, q0001 = 0.005581024, q0010 = 0.004061217,

q0011 = 0.009203675, q0100 = 0.2263769, q0101 = 0.01443055,

q0110 = 0.0007282847, q0111 = 0.2281081, q1000 = 0.009627385,

q1001 = 0.2327531, q1010 = 0.2288592, q1011 = 0.005308429,

q1100 = 0.005519297, q1101 = 0.002346809, q1110 = 0.002494027,

q1111 = 0.006275445

The correct classification according to this distribution is to reject the the model. A scatter plot of the likelihood ratio test vs the Imhof test with a sample size of 400 and 1000 trials is displayed in Figure 6.3. More detailed analysis of this experiment is presented in Appendix B.4.2 Notice that the invariant based statistic (x-axis) has lower p-values on average than this and so can be viewed as the more accurate classifier in this experiment. A similar experiment performed by comparing the chi-squared test with the invariant based statistic computed using the Davies method is displayed in Figure 6.4.

154 Figure 6.4. A scatterplot showing values p-values computed for the perturbed degenerate distribution on the binary 4-cycle comparing the chi-squared test vs. the survival function of the invariants based quadratic form computed via Davies’ method.

We see again that the invariant based statistic is better able to reject the models near this degenerate case because it frequently assigns p-values lower than the corresponding value computed with the chi-squared test. In Figure 6.4, these correspond to points lying above the diagonal. For this reason, we believe the statistic developed in this paper based off of an energy functional of the invariants can outperform classical techniques in degenerate cases. We studied the performance of p-values computed with this technique versus p-values computed by the chi- squared test and the likelihood ratio test. The results of this simulation are presented in Appendix B.4. These tables show the percentage deviation from the nominal significance level of the test. One result where our statistic performs noticeably better than both the likelihood ratio and the chi-squared test is presented

155 below. The exact distribution used in this simulation is:

q0000 = 0.07453062, q0001 = 0.08634001, q0010 = 0.041189657,

q0011 = 0.0640731, q0100 = 0.008735056, q0101 = 0.1015492,

q0110 = 0.001641928, q0111 = 0.1536179, q1000 = 0.0690093,

q1001 = 0.05180355, q1010 = 0.02624617, q1011 = 0.01191531,

q1100 = 0.06465253, q1101 = 0.04332119, q1110 = 0.122514,

q1111 = 0.07886045

The table below shows percentage deviation from the nominal significance level for several methods. More tables like the one below for different sample sizes and distibutions can be found in Appendix B.4.

n = 100 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 4.7 12.8 18.7 23.0 24.4 26.3 26.2 27.5 G2 28.4 46.0 52.6 55.0 55.4 55.9 53.5 50.9 Ours - Davies -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 24.7 Ours - Farebrother -1.0 -4.9 -6.7 -6.0 -1.4 4.2 8.7 14.6 Ours - Imhof -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 24.7 Ours - Liu -1.0 -4.7 -5.7 -3.0 3.5 11.9 17.9 25.9

6.11 Future Work

6.11.1 Application to Mixture Models

A mixture model is a convex combination of two statistical models. Given a model, we can create a mixture model by taking convex combinations of the model. The method of algebraic hypothesis testing discussed here can be used to determine the r s number k in a mixture of algebraic statistical models. Suppose {fi}i=1 and {gi}i=1 are lists of invariants for a model of interest. Given the empirical distribution p∗, we can consider the possible convex decompositions pλ = λp1 + (1 − λ) p2 where

156 pλ = p∗ Then p∗ belongs to the mixture model if there exists a λ ∈ [0, 1] such that

λ hf (p1) | H | f (p1)i + (1 − λ) hg (p2) | H | g (p2)i = 0.

As such, we can find λ such that the left-hand side is minimized and use the Imhof method to calculate a p-value as before. This allows us to perform a hypothesis test without fitting model parameters using the EM algorithm, for instance. Notice the size (dimension of domain) of the optimization problem scales linearly with the number of hidden variables.

6.11.2 Application To Restricted Boltzmann Machines

Example 145. We review an example from [26]. The restricted Boltzmann machine (RBM) is a statistical model for binary random variables with n-visible nodes and k-hidden nodes. The model consists of nk + n + k model parameters. Let W be a k × n matrix, b an n-dimensional vector, and c a k-dimensional vector. We can define the following energy function:

ψ (v, h) = exp hT W v + bT v + cT h .

This energy function can be used to define a probability distribution on the visible nodes by 1 X p (v) = ψ (v, h) . Z h∈{0,1}k P where Z = v,h ψ (v, h). We can apply the following reparametrization to the RBM:

γi = exp (ci) , ωij = exp (Wij) , βj = exp (bj)

which translate the energy function into the following form:

k k n n Y hi Y Y hivj Y vj ψ (v, h) = γi · ωij · βj . i=1 i=1 j=1 j=1

157 We can then express the probability distribution on the visible nodes as:

k 1 Y p (v) = βv1 βv2 ··· βvn 1 + γ ωv1 ωv2 ··· ωvn  . Z 1 2 n i i,1 i,2 i,n i=1

A proposition from [26] shows that when k = 1, the RBM can be rewritten as a mixture of the independence models for n binary random variables:

n n Y 1−vi vi Y 1−vi vi p (v) = λ δi (1 − δi) + (1 − λ) i (1 − i) . i=1 i=1

Note that X X p (v) = p (v, h) = p (v | h) p (h) h h where the conditionals p (v | h) are product distributions. This means that the

Restricted Boltzmann machine in general is a submodel of Mn,2k the independence mixture model with 2k components. Using the technique of the previous section, we can tune the hyperparamter k for the RBM using the following procedure:

• Start with k = 1

• Check the 2k mixture model using algebraic hypothesis testing.

• If p-value is below the threshold, increase k and repeat previous step.

158 Chapter 7 | A Monadic Approach to Miss- ing and Conflicting Data

Monads have been previously applied to database theory in [129] where monads are used to allow databases to contain generalized values such as lists, sets of values, exceptions, and various types of nulls. Spivak uses the framework of monads to show how concepts such as Markov chains, graphs, and finite state automata can all be modeled as different monads. In this chapter, we focus specifically on the Giry monad [59] and discuss techniques for using it to handle conflicting and missing data. The statistical properties of missing data have been studied extensively [58, 90, 116, 117]. Our goal in this chapter is to discuss how implementation of the Giry monad can be potentially useful tool for thinking about computations involving missing or conflicting data. In this chapter, we will see that being able to use unit and multiplication operations on the monad allow us more flexibility with probabilistic computations and can help us design more robust techniques for working with messy data. In particular, we place emphasis on how the Giry monad can be used to represent data frames as collections of empirical measures and how multiple imputation can be naturally expressed in this framework. We discuss the problem of lifting normal statistics to these data frames and give an example of how this affects analysis by analyzing the effect of monadic imputation on k-nearest neighbors clustering. This chapter is speculative and mainly intended to explore how implementation of the Giry monad could affect probabilistic and statistical calculations. The main result of this chapter is a lemma establishing that measurable statistics lift to the Giry monad and the use of the Giry monad to combine conflicting data

159 in a way that does not destroy information about the conflicting records. This construction is potentially useful in statistical decision making situations where we would like to design systems which select more conservative actions in the presence of conflict such as in target recognition in sensor networks.

7.1 Pullbacks, Maximum Entropy Distributions, and Independence Joins

The categorical structure of independence and conditional independence was first established in [121, 122]. In this section, we examine the relationship of these constructions with the maximum entropy principle. In a later section, we discuss the implication for multiple imputation schemes implemented through the Giry monad to show why implementation of the monad should not treat the columns in a table independently as this destroys any information about correlation between the columns. Let X and Y be discrete random variables with finite outcome spaces. We know that H (X,Y ) ≤ H (X) + H (Y ) P where H (X) := − x∈X px ln px is the entropy of the random variable [25]. Further- more, equality holds in the above expression if and only if X and Y are independent. This theorem can be interpreted in the language of the Giry monad which will allow us to create a new technique for multiple imputation of missing data.

A discrete random variable is a map from some probability space (S, FS, µ) [k] into a measurable space isomorphic to [k] , 2 . Note that a random variable [k] ∼ k−1 X : (S, FS, µ) → [k] , 2 selects a point G (X) : 1 → G ([k]) = 4 . Entropy k−1 can then be viewed as a map H : 4 → R≥0 defined by H : (p1, . . . , pk) 7→ P − i∈[k] pi ln pi. Let p ∈ 4n−1 and q ∈ 4m−1 be two probability measures on discrete outcome spaces of size n and m, respectively. There is a canonical isomorphism between n−1 m−1 mn−1 mn−1 4 × 4 and 4 . If we give 4 coordinates of the form (rij) where i ∈ [n] and j ∈ [m], then with respect to these coordinates, one direction of the n−1 m−1 nm−1 isomorphism is given by φ : 4 ×4 → 4 defined by φ (pi, qj) = piqj. The reverse direction to this isomorphism is given by the product of the marginalization

160 mappings. Note that the isomorphism φ takes a pair of marginal distributions and constructs the maximum entropy distribution subject to those marginals. A similar construction also works for conditionally independent probability distributions. Two joint probability density functions, p ∈ 4mk−1 and q ∈ 4nk−1 for which X X pij = qjk i k for all j determine a co-span:

4mk−1 4nk−1

$ z 4k−1.

The independence join of p and q, denoted by p ` q ∈ 4mnk−1 is defined to be

 a  pijqjk p q := . ijk rj

This construction produces a commutative square:

4mnk−1 / 4nk−1

  4mk−1 / 4k−1.

The usual fibered product is defined to be

m n ∼ 4 ⊗4k 4 = {(pij, qjk) | m+j (pij) = mj+ (qjk)} .

As such, the independence join can be seen as a solution to the universal problem m n m+n−k solved by pullbacks with the following isomorphism: φ : 4 ⊗4k 4 → 4 pijqjk which is defined by φ (pij, qjk) 7→ . Note that the numerator is well-defined rj because we can take rj = m+j (pij) = mj+ (qjk). As such, for finite probability measures, the independence join can be seen as a weaker form of the pullback construction. The typical approach to finding maximum entropy distributions for continuous random variables involves setting up constraints on the support distribution [17,

161 19, 20, 72–74]. For instance, the uniform distribution is the maximum entropy distribution subject to the constraint that the support of the density is [a, b]. Alternatively, the normal distribution is the maximum entropy distribution subject to a constraint on the variance. This observation suggests a method for multiple value imputation where we maximize entropy of the missing columns in a row subject to the constraint that we match the empirical distribution of the respective columns.

7.2 Merging Conflicting Tables

In chapter four, we saw how merging two records which agree on their overlap could be understood as a weaker form of the pullback construction in the category of sets

Section 4.4.3. Given two records r1 : 1 → X × Y and r2 : 1 → Y × Z, the merged record r1 ./ r2 is the universal solution to the following pullback diagram

1

r1 r1./r2 % + r2 X × Y × Z / X × Y

πY

  πY  Y × Z / Y.

When two tables are connected by a primary key and there is disagreement about the value in a shared column between the two tables, an analyst must make some decision about how to complete the merge. For columns whose data can be modeled by real numbers, a common choice is to take the average. If we assume the measurement disagreement is due to some random noise in our instrumentation, this is a reasonable choice that reduces the variance in many circumstances. One such situation is analyzed in the example below.

Example 146. Let X1 and X2 be random variables. Assume X1 = c + N1 and

X2 = c + N2 where Ni is a random variable representing a noise which is assumed 2 to have a mean of 0. Let i < ∞ denote the variance of i. Then

1 1  1 1 X + X = 2 + 2. V 2 1 2 2 4 1 4 2

162 2 2 1 2 1 2 2 Without loss of generality suppose 1 < 2, then 4 1 + 4 2 < 1 if and only 2 2 if 2 < 31. Thus, averaging reduces the variance in the estimator as long as 1 2 2 2 2 2 2 2 2 4 (1 + 2) ≤ min (1, 2) or alternatively as long as max (1, 2) ≤ 3 min (1, 2).

For other types of random variables, such as categorical random variables, the outcome space does not have the type of algebraic structure that is amenable to averaging. By allowing our tables to take values that are probability distributions, we can extend this idea of averaging to random variables whose outcome space does not naturally have that structure. This is possible because the space of probability measures on a sigma-algebra is a convex space, i.e. given any finite collection of probability measures, any convex combination of them also defines a probability measure. We briefly discussed Giry tables in Section 4.5.2. These were tables obtained by applying the Giry endofunctor to a table whose outcome space was equipped with the structure of a sigma-algebra. This gives us a table where each record is a probability distribution on the outcome space. Given a data table t : N → XA, the corresponding Giry table, tG, is defined by applying the the unit δ to the image of t, i.e. tG (i) = δt(i). Recall that the unit of the Giry monad is the natural transformation 1 ⇒ G whose component functions δX : X → G (X) take a point x to the probability measure concentrate at x. Here, we choose to use the notation δ rather than the traditional η because of the Dirac-delta function. Q Q Given two tables t1 : N → K × a∈A Xa and t2 : M → K × b∈B Xb with overlapping column sets that are connected by a primary key, K, we can use Giry tables to merge the columns by averaging the probability measures on the Q overlapping columns. We assume the indexing sets in the products a∈A Xa Q and b∈B Xb are the column names. As such the joint column space A ∪ B C  leads to decompositions of A and B. Note that A = A ∩ B ∪ (A ∩ B) and C  1 1 1 B = (A ∩ B) ∪ A ∩ B . Thus, first decompose t1 and t2 as tK × tA∩BC × tA∩B : A∩BC A∩B 2 2 2 B∩AC B∩A N → K × X × X and tK × tB∩AC × tA∩B : M → K × X × X , 1 2 respectively. Let F = tK (N)∩tK (M) denote the set of keys common to both tables. 1   A∩BC  Let tK : F → K be defined by tK (f) = f. Define tA∩BC : F → G X 2   AC ∩B and tAC ∩B G : F → G X by post-composition with the unit δ of the Giry 1./2 A∩B 1 1 monad. Lastly, define (tA∩B)G : F → G X by f 7→ 2 δt1(f) + 2 δt2(f). Then using the universal property of products we can define the Giry merge of t1 and t2,

163 denoted (r1 ./ r2)G, as

1  1./2  2  (t1 ./ t2)G = tK × tA∩BC G × tA∩B G × tAC ∩B G .

As constructed, this operation is non-associative. Associativity can be restored by also storing a natural number for each column counting the number of columns that have been merged together. This would mean column spaces in Giry tables have the form G (X) × N where X is some measurable space. An entry in a column is then a tuple (µ, n) where µ is a probability measure on x and n is a count tracking the number of merges that have already occurred. Two tuples (µ, n) and (ν, m) n m  can be merged as n+m µ + n+m ν, n + m . This construction allows us to merge conflicting records in an associative manner. Using the Giry monad, we’ve seen how to average out conflicts for data types which do not come equipped with an algebraic amenable to averaging. This gives us a technique for handling conflicts between columns containing categorical data, for example. In machine learning, many algorithms require you to perform one-hot encoding on categorical variables before passing them to the function which actually fits a model to your data. This process involves constructing an enumeration e : [k] → X of the outcome space for the categorical variable and embedding the outcomes into Rk−1. This can be achieved by sending e (1) 7→ 0 in Rk−1 and e (i) = ei−1 for 2 ≤ i ≤ k where ej denotes the j-th standard basis vector for Rk−1. This observation allows us to extend one-hot-encoding to Giry data frames. Note that an enumeration also determines an embedding of the probability simplex 4k−1 as the convex hull of the images. As such given a probability measure µ ∈ G (X), we can determine a vector in Rk−1 whose components are determined by the enumeration e, (µ (e (2)) , . . . , µ (e (k − 1))). We denote the extension of this enumeration to the Giry data frame as eG. In order to represent Giry tables on a computer, we need to discuss what type of data structure could be used as a model for a probability distribution on an outcome space. For categorical random variables whose outcome space is finite, this can be achieved via a dictionary whose keys are the various outcomes and whose values are the probabilities of the corresponding outcomes. This technique, also works for outcomes of a real-valued random variable as any table must contain only a finite number of data points. Note that the size of the dictionary is now a

164 random variable whose expected growth grows proportionally to the number of merges performed where entries in this dictionary have the particular outcome as a key and the number of records with that outcome as values.

7.3 Imputing Missing Data with Giry Tables

Imputation is the process by which null values in a column are replaced with some estimated value. One popular technique is replacing a null entry with its column mean, median, or some other fixed value [86]. This technique reduces the variance of that column because it adds additional rows with Xi = µ. By first counting the number of missing values in the original series and subtracting this number off from the denominator when estimating the variance, we can remove the effect of these imputed values. Unfortunately, when preparing data to be fit with a machine learning algorithm, the result of preprocessing is usually represented by a multi-dimensional array and no information about how many records were imputed is seen by the procedure fitting the model. One consequence of this particular implementation is that when using the fitted model to forecast confidence intervals we may systematically under-predict the width of these intervals due to the reduction in spread introduced by the imputation process. Other techniques for imputation involve attempting to predict the missing values. We will use n to denote the number of records in our table and p to denote the number of predictor columns. When predicting real-valued columns, linear regression and gradient boosting are common options. Linear regression achieves this in O (p2n + p3) which thus scales linearly for a fixed number of predictors. If we additionally cap the number of trees, gradient boosting will also scale linearly with the size of our data and thus provides an approach for prediction with less assumptions about the structure of the nature of the predictor function [64]. Many other techniques, such as k-nearest neighbors, scale super-linearly and so may be impractical for large data sets. Moreover, even if we assume that these techniques provide consistent estimates for E [Y | X], these techniques still decrease the variance of the conditional distribution [Y | X] which will make predicted confidence intervals thinner in the same manner we discussed previously. For categorical data, the analog of fixed value imputation could be simply encoding null values as an additional category and incorporating the null values as

165 part of your model. Other approaches typically rely on various multiple imputation schemes or maximum likelihood estimates using the EM algorithm. Multiple Imputation techniques make distributional assumptions, typically of multivariate normality. And the scalability of the EM algorithm is directly influenced by the techniques used to compute the probabilities of the latent variables and to estimate values for the next iteration. Discussing these existing techniques in complete depth would be too much of a distraction as these are the subjects of several books, c.f. [115, 116]. A brief survey of the history and further references can be found in [117]. Giry data frames are a general framework that can incorporate aspects of likelihood estimation and multiple imputation. As such, implementation fo this design pattern could provide some engineering benefits in certain applications. For instance, imputing by measures and rewriting model fitting algorithms as functionals would allow us to perform multiple implementation without parallelization. We will discuss the theory of Giry data frames and the problem of lifting statistics from them in this section.

7.3.1 Imputing by Empirical Probability Measure of a Col- umn

Using Giry tables allows us to generalize pre-existing imputation techniques and provide the analyst greater flexibility than other common techniques. Potentially, this can allow us to help reduce undesirable side-effects in our imputation schemes. In particular, using Giry tables we can develop both parametric and non-parametric techniques for imputing data. We will start with the Giry analog of imputing by mean which involves replacing the single estimate of the column mean with the empirical probability measure column. This particular scheme is obtained by lifting the mean function to the Giry monad as dicussed in Section 7.3.2.

Definition 147. (Empirical Probability Measure) Let (X, F) be a sigma algebra such that all singleton sets are measurable. The empirical probability measure associated to a collection of points s :[n] → X is defined to be

n 1 X µ := δ s n {xi} i=1

166 where xi = s (i) and δ{xi} is the Dirac probability measure concentrated at xi, i.e.  1 x ∈ B δxi (B) = 0 otherwise.

Given a column in a table ta : N → Xa, we can construct the empirical probability measure of the column by defining

|N| 1 X µ := δ . a |N| ta(i) i=1

 ˜  To construct the imputation scheme we need to define a mapping iµs : G Xa →  ˜  G Xa . Any probability measure ν ∈ G (Xa) can be decomposed as a ν = αµX + (1 − α) δ{NA}. Thus, we can define   i αµ + (1 − α) δ = αµ + (1 − α) µ . µs X {NA} X s

Note that if our Giry table is obtained by applying the unit of the Giry monad to a regular table, post composition by iµS will result in a probability measure concentrated at a single point if the row was observed and the empirical measure

µS if the row is unobserved. This is the Giry monad analogue of imputing by mean. The advantage of using the Giry monad is that we can also preserve the empirical variance of the column with this imputation scheme. In order to see this, we must discuss the problem of lifting statistics to the Giry monad. Assuming for the moment that this procedure works, note that imputing by the empirical distribution function can eliminate correlations between observations across columns. Moreover, if multiple columns are missing and if each column is imputed by the empirical probability measure associated of their respective columns, this will yield a maximum entropy join of the two empirical column measures and as such will move the estimated value of the covariance between the columns towards zero.

167 7.3.2 Lifting Statistics to the Giry Monad

A statistic, t : X → V , is said to lift to G (X) if there exists a function tG : G (X) → V such that the diagram below commutes:

G (X) < δ tG

t " X / V.

In this section, we discuss lifting common statistics to the Giry monad. We begin by establishing lemmas that illustrate the algebra of extendable statistics. We then use these results to show how using monads can aid us in our design of imputation schemes that better preserve the statistical properties of our data.

Lemma 148. Let t : X → V be a statistic which extends to the Giry monad and let f : V → U be any measurable function. Then f ◦ t extends to the Giry monad.

G G Proof. Define (f ◦ t) := f ◦ t . This situation is more easily verified with the use of the commutative diagram below:

G (X) < δ tG

t " f X / V / U.

The main result of this section is characterizing a large class of statistics which admit lifts to the Giry monad. In order to extend the definition of standard statistics to the Giry monad, we need to augment the space of values. For the remainder of this section we focus on statistics taking values in the real numbers. In order to extend statistics to the Giry monad, we need to define them as integrals against probability measures. As such, we need to allow for the possibility that the values of our statistics are infinite or undefined. Let R be the real numbers equipped with the standard Borel sigma algebra. We define an extended real ` ` ` number syste R˜ := R {∞} {−∞} {NA} equipped with the coproduct sigma algebra. Note that if t : X → R is measurable, then t : X → R˜ is also measurable.

Proposition 149. Any measurable f : X → R lifts to tG : G (X) → R˜.

168 G G Proof. Define f (µ) = f (x) dµ. Let x0 ∈ R be arbitrary and note f (δ (x0)) = ´ f (x) dδx0 = f (x0) . An arbitrary measurable function f : X → R can be decomposed´ as f = f + − f − where both f + and f − are measurable. We can then define the extension to the Giry monad as follows:  + −  f (x) dµ if f dµ < ∞ and f dµ < ∞  ´ ´ + ´ − G ∞ if f dµ = ∞ and f dµ < ∞ f (µ) = ´ ´ −∞ if f +dµ < ∞ and f −dµ = ∞   ´ ´ NA otherwise.

Remark. The above lemma establishes that any type of function one would be interested in computing with will lift to the Giry monad. To construct examples of situations and functions which don’t lift to the Giry monad, we either have to construct very weak sigma algebras (e.g. given a set X, take F = {∅,X}) or invoke the axiom of choice. Given how real numbers are only approximately represented on a computer, these considerations should not arise in any practical application of the ideas presented in this section.

7.3.3 A Simple Example of Giry Imputation

A method which imputes data by estimating the expected value of a missing record introduces a bias into the second moments of our data even if the assumptions of the imputation are correct because it lowers the observed variability in the data and as such reduces the magnitude of the higher order empirical moments. This presents a problem to the data analyst who wants to estimate confidence intervals or predict ranges of possible outcomes of future observations as their predictions will lead to narrower prediction intervals than can be reasonably justified. To combat this technique, statisticians have proposed multiple imputation whereby multiple values are drawn from the predictive distribution to help combat the bias taking a mean would introduce into the data. The data is split into multiple copies resampled from the predictive distributions for the missing entries. Analysis is performed on each sampled copy of the original data. Analysis on the copies is performed in parallel and the results are pooled together and then returned [21,90,115–117].

169 Giry tables have the benefit of being able to provide the same theoretical benefits as multiple implementation without requiring parallelizability. In situations where memory is limited, the additonal complexity of implementing the Giry monad may be worthwhile. The Giry monad implementation differs in that the missing values are replaced by their actual predictive distributions. Thus, when writing functions to fit models in the framework of Giry monads, the methods to fit models can be lifted to constructions on the Giry monad. This requires different ways of thinking about designing model fitting criteria, but it is a tradeoff an analyst may be willing to make if memory considerations prohibit storing multiple copies of the original data or limitations in computing resources prohibit running multiple imputations in parallel. Exact implementation of the monad design pattern would involve implementing the method to lift functions to the Giry monad and thus would be compatible with previously implemented functions such as those that fit models or make predictions. To illustrate how imputing by distribution affects existing algorithms, we can investigate the effect that Giry monad imputation has on the Euclidean distance between points. This means that imputation via the Giry monad can potentially affect clustering decisions in algorithms such as k-nearest neighbors. For sparse higher dimensional data, the differences in the computed distance can be a significant consideration when selecting a machine learning algorithm.

Lemma 150. Imputation via any non-point measure increases Euclidean distance between incomplete records and preserves the distance between completely observed records.

n Proof. Let (p1, p2, . . . , pn) be an arbitrary point in R . Suppose (z1, . . . , zn) rep- resents a completely observed record. The squared distance is a statistic tp = Pn 2 i=1 (pi − xi) whose lift to the Giry monad can be expressed as

n ! X 2 [µ] = (p − x ) dµ Etp ˆ i i i=1 n X 2 = (p − x ) dµ. ˆ i i i=1

170 Suppose the j-th column in this record was observed, then

(p − x )2 dµ = (p − x )2 d δ  ˆ j j ˆ j j zj   = p2 − 2p x dδ + x2dδ j j ˆ j zj ˆ j zj 2 2 = pj − 2pjzj + zj 2 = (pj − zj)

However, if instead the jth record were imputed by some empirical distribution µj 2 2 with E [µ] = xjdµj = zj and V [µj] = (xj − zj) dµj = σ 6= 0, then ´ ´ (p − x )2 dµ = (p − x )2 d δ  ˆ j j ˆ j j zj   = p2 − 2p x dδ + x2dδ j j ˆ j zj ˆ j zj 2 2 2 = pj − 2pjzj + zj + σ 2 2 = (pj − zj) + σ 2 > (pj − zj) .

Thus, imputation by Giry monad and lifting the Euclidean distance to a functional on probability measures increases the distance to imputed records as long as the imputed distribution has non-zero variance.

7.4 Future Work

7.4.1 Implementation of Giry Tables

In order to further explore the computational benefits and potential tradeoffs of designing algorithms with the Giry monad, we need to discuss a method of repre- senting this structure. Giry tables can be implemented in practice via dictionaries. The unit natural transformation can simply take a record to a dictionary with a single entry. The key value corresponds to the values of the observables and

171 the value associated to this key is 1 representing the fact that this dictionary is representing a point mass assigned to that value. Probability measures on finite outcome spaces can be implemented similarly. The keys of the dictionary correspond to the different possibilities for the outcome while the values corresponding to the keys represent the probability that a particular outcome will be observed. For real-valued random variables, we could simply dynamically add keys based on new observed values and adjust the probabilities at each step. A more efficient way would be to store counts of the observations and also tabulate the total number of observations and just compute probabilities when needed so that we don’t have to re-normalize the dictionary every time a new entry is added. This problem is fine in theory but the expected growth of the dictionary is linear assuming we are tabulating a random variable with infinite support. To save storage space, we could use a binning technique to approximate the distribution similar to how histograms are computed. While histograms will typically use equally spaced bins, such a technique would not be as robust for arbitrary distributions of values as there could potentially be empty bins depending on how the values are clustered. One principled way of binning a collection of points {x1, . . . , xn} would be to divide the points into k clusters such that the total sum of the variance of the clusters is minimized. To achieve this we can first sort n the data and divide it into clusters consisting of bins with approximately k samples per bin. We can then select the bin with the largest within cluster variance and see if shifting the largest value to the cluster immediately to the right or shifting the smallest value to the cluster on immediately to the left decreases the total variance choosing the action that results in the greatest decrease in total variance. The algorithm will terminate when there is no choice of shift that will decrease the variance. For samples of real-valued random vectors, the clustering problem is NP-hard [137]; however, many approximate algorithms exist and could be found in many machine learn textbooks, e.g. [64].

7.4.2 The Giry Monad and Contextuality

When attempting to resolve a contextual database through its set of maximal classical contextual covers, we saw that proper analysis must take into account the

172 full range of explanations. Each of these individual tables can be thought of as representing some marginal probability on the outcome space of a table. As such each maximal factor corresponds to a subspace of the classical outcome space which is consistent with the marginals associated to the various tables. In the case where all outcomes are categorical, this can be identified with the intersection of some linear subspace with the probability simplex on the full space of outcomes. As such analysis of a maximal factor can be seen as making some decision about to place a probability measure on this linear subspace. A maximum entropy principle would correspond to the choice of a uniform measure in this case. The multiplication operation of the Giry monad can be used to integrate over the probability measure placed on the subspace of probability measures to produce a probability measure on the original outcome space. As the cautionary examples in the previous chapter proves, care has to be taken so not to construct propositions in a way that leads to a self-fulfilling prophecy, e.g. using maximum entropy joins and then concluding mutual independence based only on the marginal independence of the contexts.

Example 151. Recall the Bell marginals discussed in chapter 5. These were a collection of tables on the outcome spaces A, B, A0, B0 whose distributions are given by the tables below:

0 0 tAB A = 0 A = 1 tA0B A = 0 A = 1 B = 0 4 0 B = 0 3 1 B = 1 0 4 B = 1 1 3

0 0 tAB0 A = 0 A = 1 tA0B0 A = 0 A = 1 B0 = 0 3 1 B0 = 0 1 3 B0 = 1 1 3 B0 = 1 3 1

When motivating contextuality in chapter 5, we discussed in Example 104 the problem of attempting to forecast the table for A0 and B0 based on the tables tAB, tA0B, and tAB0 by considering the collection of tables on the outcome space 0 0 0 0 A, B, A ,B and the marginalizing to A , B . The table tA0B0 did not belong to any of those tables. This suggest that if we are analyzing outcomes which we believe to be contextual, we should generate the predictive space in a different manner. We can instead consider the collection of tables which will contextually join with the

173 observed tables tAB, tA0B, and tAB0 . This yields the constraint satisfaction problem

produced by requiring the number of counts and marginal distributions of tA0B0 to

agree with the marginal counts for tA0B and tAB0 , respectively. In other words, we 4 are looking at the collection of (n00, n01, n10, n11) ∈ N which solve the following constraint satisfaction problem:  n00 + n01 = 4  n00 + n10 = 4   n00 + n01 + n10 + n11 = 8.

By inspection, we can see that there are five possible solutions: (0, 4, 4, 0), (1, 3, 3, 1), (2, 2, 2, 2), (3, 1, 1, 3), and (4, 0, 0, 4). Contextual modelling can be a way of fore- casting the possibilities for outcomes in situations where contextuality may play an important role. For instance, when analyzing human behavior, such as the counts of products that may be purchased on a display page, using this technique results in a wider range of possibilities and can account for potential interference effects. As an example of how this type of situation could emerge, imagine we had tested a product landing page displaying mustard and ketchup for users and counted the different combinations of orders. In another test we displayed the same landing page but used ketchup and spicy mustard for the users. Interference effects from the fact that someone is less likely to buy both mustard and spicy mustard could result in a contextual distribution. If we attempted to analyze this situation by treating it as a marginal distribution of one of the joins of the two initial tables, we would not have produced a distribution with the possibility of these interference effects.

7.4.3 Generalizations of Interval Time Models

The TrueTime [24] model discussed briefly in Section 4.2.7 represents time as collections of intervals [a, b] where we know the correct time is guaranteed to belong to this interval. One possible generalization of this would be to treat time as a compactly supported random variable. As such, the timestamps could be represented as a Giry data frame on the datetime objects implemented in the underlying language. A very broad future project could explore what how to

174 adapt time series methods to such structures and how to design data visualization techniques for these types of time series.

175 Appendix A| Supplemental Code for Chap- ter 5

A.1 Introduction

This appendix contains supplemental code from chapter 5 which generates all solutions to the various constraint satisfaction problems determined by the Bell marginal tables. We present the block of code to generate all solutions to each constraint satisfaction problem along with terminal output displaying the various solutions referenced in chapter 5. All code in this appendix is written in the Python programming language. The focus in this section is making the code easily readable for a large audience. As such, the code is very repetitive and makes no use of type hints or abstract base classes as would be more appropriate in a more reusable implementation.

A.2 Code

A.2.1 CSP for All Bell Marginals

This first code snippet verifies that there is no way to join all four tables in the Bell marginals. from constraint import ∗ # constrains values for exhaustive search since 4 is the largest # observed count in any Bell table, this is the maximum size of

176 # any entry in the joined tables vals = [0, 1, 2, 3, 4]

# nijkl denotes the number of times A=i, B=j, A’=k, B’=l problem.addVariable("n0000", vals) problem.addVariable("n0001", vals) problem.addVariable("n0010", vals) problem.addVariable("n0011", vals) problem.addVariable("n0100", vals) problem.addVariable("n0101", vals) problem.addVariable("n0110", vals) problem.addVariable("n0111", vals) problem.addVariable("n1000", vals) problem.addVariable("n1001", vals) problem.addVariable("n1010", vals) problem.addVariable("n1011", vals) problem.addVariable("n1100", vals) problem.addVariable("n1101", vals) problem.addVariable("n1110", vals) problem.addVariable("n1111", vals)

# constraints for table AB # constraints for A=0, B=0 problem.addConstraint(lambda n0000, n0001, n0010, n0011: n0000 + n0001 + n0010 + n0011 == 4, ("n0000", "n0001", "n0010", "n0011")) # constraint for A=0, B=1 problem.addConstraint(lambda n0100, n0001, n0010, n0011: n0100 + n0101 + n0110 + n0111 == 0, ("n0100", "n0001", "n0010", "n0011")) # constraint for A=1, B=0 problem.addConstraint(lambda n1000, n1001, n1010, n1011: n1000 + n1001 + n1010 + n1011 == 0, ("n1000", "n1001", "n1010", "n1011"))

177 # constraint for A=1, B=1 problem.addConstraint(lambda n1100, n1101, n1110, n1111: n1100 + n1101 + n1110 + n1111 == 4, ("n1100", "n1101", "n1110", "n1111"))

# constraints for table A’B # constraint for B=0, A’=0 problem.addConstraint(lambda n0000, n0001, n1000, n1001: n0000 + n0001 + n1000 + n1001 == 3, ("n0000", "n0001", "n1000", "n1001")) # constraint for B=0, A’=1 problem.addConstraint(lambda n0010, n0011, n1010, n1011: n0010 + n0011 _ n1010 + n1011 == 1, ("n0010", "n0011", "n1010", "n1011")) # constraint for B=1, A’=0 problem.addConstraint(lambda n0100, n0101, n0101, n1101: n0100 + n0101 + n1100 + n1101 == 1, ("n0100", "n0101", "n1100", "n1101")) # constraint for B=1, A’=1 problem.addConstraint(lambda n0110, n0111, n1110, n1111: n0110 + n0111 + n1110 + n1111 == 3, ("n0110", "n0111", "n1110", "n1111"))

# constraints for AB’ # constraint for A=0, B’=0 problem.addConstraint(lambda n0000, n0010, n0100, n0110: n0000 + n0010 + n0100 + n0110 == 3, ("n0000", "n0010", "n0100", "n0110")) # constraint for A=0, B’=1 problem.addConstraint(lambda n0001, n0011, n0101, n01111: n0001 + n0011 + n0111 + n0111 == 1, ("n0001", "n0011", "n0101", "n0111")) # constraint for A=1, B’=0

178 problem.addConstraint(lambda n1000, n1010, n1100m n1110: n10000 + n1010 + n1100 + n1110 == 1, ("n1000", "n1010", "n1100", "n1110")) # constraint for A=1, B’=1 problem.addConstraint(lambda n1001, n1011, n1101, n1111: n1001 + n1011 + n1101 + n1111 ==3, ("n1001", "n1011", "n1101", "n1111"))

# constraints for A’B’ # constraint for A’=0, B’=0 problem.addConstraint(lambda n0000, n0100, n1000, n1100: n0000 + n0100 + n1000 + n1100 == 1, ("n0000", "n0100", "n1000", "n1100)) # constraint for A’=0, B’=1 problem.addConstraint(lambda n0001, n0101, n1001, n1101: n0001 + n0101 + n1001 + n1101 == 3, ("n0001", "n0101", "n1001", "n1101")) # constraint for A’=1, B’=0 problem.addConstraint(lambda n0010, n0110, n1010, n1110: n0010 + n0110 + n1010 + n1110 == 3, ("n0010", "n0110", "n1010", "n1110")) # constraint for A’=1, B’=1 problem.addConstraint(lambda n0011, n0111, n1011, n1111: n0011 + n0111 + n1011 + n1111 == 1, ("n0011", "n0111", "n1011", "n1111"))

problem.addConstraint(lambda n0000, n0001, n0010, n0011, n0100, n0101, n0110, n1111, n1000, n1001, n1010, n1011, n1100, n1101, n1110, n1111: n0000 + n0001 + n0010 + n0011 + n1000 + n1001 + n1010 + n1011 + n1000 + n1001 + n1010 + n1011 + n1100 + n1101 + n1110 + n1111 == 8,

179 ("n0000", "n0001", "n0010", "n0011", "n0100", "n0101", "n0110", "n0111", "n1000", "n1001", "n1010", "n1011", "n1100", "n1101", "n1110", "n1111"))

csp_solutions = problem.getSolutions() print(len(csp_solutions)) The above code results in the output below: 0 This indicates that the full constraint satisfaction involving all marginal Bell tables does not have any solution. In the rest of this appendix, we explore the different partial solutions to the other partial constraint satisfaction problems invovling the Bell marginal tables and generate code for producing all solutions belonging to the poset of solutions discussed in chapter 5 of this dissertation.

A.2.2 CSPs Involving Three Bell Marginals

The remainder of this appendix contains various code snippets and their resulting output. The remaining code is largely reminiscent of the code presented in the previous subsection, mutatis mutandis. In this subsection, we explore the four constraint satisfaction problems resulting from merging the different combinations of three overlapping tables. The next code snippet generates all code snippets for the constraint satisfaction problem involving tables AB, A0B, and AB0: from constraint import ∗ vals = [0,1,2,3,4] problem = Problem() # nijk is the number of times A=i, B=j, A’=k, B’=l problem.addVariable("n0000",vals) problem.addVariable("n0001",vals) problem.addVariable("n0010",vals) problem.addVariable("n0011", vals) problem.addVariable("n0100", vals) problem.addVariable("n0101", vals)

180 problem.addVariable("n0110", vals) problem.addVariable("n0111", vals) problem.addVariable("n1000", vals) problem.addVariable("n1001", vals) problem.addVariable("n1010", vals) problem.addVariable("n1011", vals) problem.addVariable("n1100", vals) problem.addVariable("n1101", vals) problem.addVariable("n1110", vals) problem.addVariable("n1111", vals)

# constraints for A, B problem.addConstraint(lambda n0000, n0001, n0010, n0011: n0000 + n0001 + n0010 + n0011==4, ("n0000", "n0001", "n0010", "n0011")) problem.addConstraint(lambda n0100, n0101, n0110, n0111: n0100 + n0101 + n0110 + n0111==0, ("n0100", "n0101", "n0110", "n0111")) problem.addConstraint(lambda n1000, n1001, n1010, n1011: n1000 + n1001 + n1010 + n1011==0, ("n1000", "n1001", "n1010", "n1011")) problem.addConstraint(lambda n1100, n1101, n1110, n1111: n1100 + n1101 + n1110 + n1111 == 4, ("n1100", "n1101", "n1110", "n1111"))

# constraints for A’ B problem.addConstraint(lambda n0000, n0001, n1000, n1001: n0000+n0001+n1000+n1001==3, ("n0000", "n0001", "n1000", "n1001")) problem.addConstraint(lambda n0010, n0011, n1010, n1011: n0010 + n0011 + n1010 + n1011==1, ("n0010", "n0011", "n1010", "n1011")) problem.addConstraint(lambda n0100, n0101, n1100, n1101: n0100 + n0101 + n1100 + n1101==1,

181 ("n0100", "n0101", "n1100", "n1101")) problem.addConstraint(lambda n0110, n0111, n1110, n1111: n0110 + n0111 + n1110 + n1111 == 3, ("n0110", "n0111", "n1110", "n1111"))

# constraints for A, B’ problem.addConstraint(lambda n0000, n0010, n0100, n0110: n0000 + n0010 + n0100 + n0110 == 3, ("n0000", "n0010", "n0100", "n0110")) problem.addConstraint(lambda n0001, n0011, n0101, n0111: n0001 + n0011 + n0111 + n0111 == 1, ("n0001", "n0011", "n0101", "n0111")) problem.addConstraint(lambda n1000, n1010, n1100, n1110: n1000 + n1010 + n1100 + n1110 == 1, ("n1000", "n1010", "n1100", "n1110")) problem.addConstraint(lambda n1001, n1011, n1101, n1111: n1001 + n1011 + n1101 + n1111 == 3, ("n1001", "n1011", "n1101", "n1111"))

problem.addConstraint(lambda n0000, n0001, n0010, n0011, n0100, n0101, n0110, n0111, n1000, n1001, n1010, n1011, n1100, n1101, n1110, n1111 : n0000 + n0001 + n0010 + n0011 + n0100 + n0101 + n0110 + n0111 + n1000 + n1001 + n1010 + n1011 + n1100 + n1101 + n1110 + n1111 == 8, ("n0000", "n0001", "n0010", "n0011", "n0100", "n0101", "n0110", "n0111", "n1000", "n1001", "n1010", "n1011", "n1100", "n1101", "n1110", "n1111"))

csp_solutionss = problem.getSolutions() print("Number of solutions to constraint satisfaction problem:")

182 print(len(csp_solutions)) p r i n t (" ") for s in csp_solutions: p r i n t ( s ) p r i n t (" ") The above code results in the following terminal output: Number of solutions to constraint satisfaction problem: 4

{’n0000’: 3, ’n0001’: 0, ’n0010’: 0, ’n0011’: 1, ’n0100’: 0, ’n0110’: 0, ’n0101’: 0, ’n0111’: 0, ’n1000’: 0, ’n1001’: 0, ’n1010’: 0, ’n1011’: 0, ’n1100’: 1, ’n1101’: 0, ’n1110’: 0, ’n1111’: 3}

{’n0000’: 3, ’n0001’: 0, ’n0010’: 0, ’n0011’: 1, ’n0100’: 0, ’n0110’: 0, ’n0101’: 0, ’n0111’: 0, ’n1000’: 0, ’n1001’: 0, ’n1010’: 0, ’n1011’: 0, ’n1100’: 0, ’n1101’: 1, ’n1110’: 1, ’n1111’: 2}

{’n0000’: 2, ’n0001’: 1, ’n0010’: 1, ’n0011’: 0, ’n0100’: 0, ’n0110’: 0, ’n0101’: 0, ’n0111’: 0, ’n1000’: 0, ’n1001’: 0, ’n1010’: 0, ’n1011’: 0, ’n1100’: 1, ’n1101’: 0, ’n1110’: 0, ’n1111’: 3}

{’n0000’: 2, ’n0001’: 1, ’n0010’: 1, ’n0011’: 0, ’n0100’: 0, ’n0110’: 0, ’n0101’: 0, ’n0111’: 0, ’n1000’: 0, ’n1001’: 0, ’n1010’: 0, ’n1011’: 0, ’n1100’: 0, ’n1101’: 1, ’n1110’: 1, ’n1111’: 2} Similar code can be written to attain the other three combinations of constraint satisfaction problems involving three of the marginal tables. Note that the above code block can be obtained by the code block in the previous section by deleting the constraints corresponding to A0B0. By similarly deleting constraints, we can use similar blocks to produce the solutions for the other three ways of joining three

183 of the Bell marginals.

A.2.3 CSPs Involving Two Bell Marginals

We first produce code for producing the solution to the constraint satisfaction problem involving tables AB and A0B. For this constraint satisfaction problem, we only have eight count variables because we don’t have a column corresponding to B0. from constraint import ∗ vals = [0,1,2,3,4] problem = Problem() # nijk is # of times A=i, B=j, A’=k problem.addVariable("n000",vals) problem.addVariable("n001",vals) problem.addVariable("n010",vals) problem.addVariable("n011", vals) problem.addVariable("n100", vals) problem.addVariable("n101", vals) problem.addVariable("n110", vals) problem.addVariable("n111", vals)

# constraints for A, B problem.addConstraint(lambda n000, n001: n000 + n001 == 4, ("n000", "n001")) problem.addConstraint(lambda n010, n011: n010 + n011 == 0, ("n010", "n011")) problem.addConstraint(lambda n100, n101: n100 + n101 == 0, ("n100", "n101")) problem.addConstraint(lambda n110, n111: n110 + n111 == 4, ("n110", "n111"))

# constraints for A’ B problem.addConstraint(lambda n000, n100: n000 + n100 == 3, ("n000", "n100"))

184 problem.addConstraint(lambda n001, n101: n001 + n101 == 1, ("n001", "n101")) problem.addConstraint(lambda n010, n110: n010 + n110 == 1, ("n010", "n110")) problem.addConstraint(lambda n011, n111: n011 + n111 == 3, ("n011", "n111"))

problem.addConstraint(lambda n000, n001, n010, n011, n100, n101, n110, n111: n000 + n001 + n010 + n011 + n100 + n101 + n110 + n111 == 8, ("n000", "n001", "n010", "n011", "n100", "n101", "n110", "n111"))

csp_solutions = problem.getSolutions() for s in csp_solutions: p r i n t ( s ) The terminal output for the above code snippet is given below. {’n000’: 3, ’n001’: 1, ’n100’: 0, ’n101’: 0, ’n010’: 0, ’n011’: 0, ’n110’: 1, ’n111’: 3} The code block for generating the unique solution to the constraint satisfaction problem involving AB and AB0 is similar to the code block above. This is also the case for the constraint satisfaction problem involving A0B and A0B0 which has four solutions and for the constraint satisfaction problem involving AB0 and A0B0 which also has four solutions. The code snippet below generates the number of solutions to the constraint satisfaction problem involving AB and A0B0. from constraint import ∗ vals = [0,1,2,3,4] problem = Problem() # nijk is # of times A=i, B=j, A’=k, B’=l problem.addVariable("n0000",vals)

185 problem.addVariable("n0001",vals) problem.addVariable("n0010",vals) problem.addVariable("n0011", vals) problem.addVariable("n0100", vals) problem.addVariable("n0101", vals) problem.addVariable("n0110", vals) problem.addVariable("n0111", vals) problem.addVariable("n1000", vals) problem.addVariable("n1001", vals) problem.addVariable("n1010", vals) problem.addVariable("n1011", vals) problem.addVariable("n1100", vals) problem.addVariable("n1101", vals) problem.addVariable("n1110", vals) problem.addVariable("n1111", vals)

# constraints for A, B problem.addConstraint(lambda n0000, n0001, n0010, n0011: n0000 + n0001 + n0010 + n0011 == 4, ("n0000", "n0001", "n0010", "n0011")) problem.addConstraint(lambda n0100, n0101, n0110, n0111: n0100 + n0101 + n0110 + n0111 == 0, ("n0100", "n0101", "n0110", "n0111")) problem.addConstraint(lambda n1000, n1001, n1010, n1011: n1000 + n1001 + n1010 + n1011 == 0, ("n1000", "n1001", "n1010", "n1011")) problem.addConstraint(lambda n1100, n1101, n1110, n1111: n1100 + n1101 + n1110 + n1111 == 4, ("n1100", "n1101", "n1110", "n1111"))

# constraints for A’, B’ problem.addConstraint(lambda n0000, n0100, n1000, n1100: n0000 + n0100 + n1000 + n1100 == 1, ("n0000", "n0100", "n1000", "n1100"))

186 problem.addConstraint(lambda n0001, n0101, n1001, n1101: n0001 + n0101 + n1001 + n1101 == 3, ("n0001", "n0101", "n1001", "n1101")) problem.addConstraint(lambda n0010, n0110, n1010, n1110: n0010 + n0110 + n1010 + n1110 == 3, ("n0010", "n0110", "n1010", "n1110")) problem.addConstraint(lambda n0011, n0111, n1011, n1111: n0011 + n0111 + n1011 + n1111 == 1, ("n0011", "n0111", "n1011", "n1111"))

problem.addConstraint(lambda n0000, n0001, n0010, n0011, n0100, n0101, n0110, n0111, n1000, n1001, n1010, n1011, n1100, n1101, n1110, n1111 : n0000 + n0001 + n0010 + n0011 + n0100 + n0101 + n0110 + n0111 + n1000 + n1001 + n1010 + n1011 + n1100 + n1101 + n1110 + n1111 == 8, ("n0000", "n0001", "n0010", "n0011", "n0100", "n0101", "n0110", "n0111", "n1000", "n1001", "n1010", "n1011", "n1100", "n1101", "n1110", "n1111")) solns = problem.getSolutions() print(len(solns)) The above code snippet results in the terminal output displayed below. 14 A similar code snippet can be used to produce the number of solutions to the constraint satisfaction problem for A0B and AB0 which has 46 solutions.

187 Appendix B| Code for Producing Figures in Chapter 7

B.1 Introduction

This appendix contains the code used to generate the scatterplots appearing in chapter 7 of the dissertation. The next section contains the code testing out the invariant based method against the standard chi-squared test for the 2×2 case. The next section contains the code for generating the simulations involving the binary four cycle model. All code in this appendix is implemented in the R programming language.

B.2 Invariants vs. Chi-Squared 2 × 2 Case

The code below produces scatterplots comparing p-values computed with the invariants for the independence model on a 2 × 2 table verses p-values computed with both the chi-squared test and the likelihood-ratio test. This code produced Figure 6.1 and Figure 6.2. l i b r a r y ("CompQuadForm") library ("MASS") library("Deducer")

# Create a vector of probabilities corresponding to the # unwinding (p_11 p_12 p_21 p_22) of the table

188 # This probability vector does not satisfy the # independence condition. This probability vector was # used for figure 7.2 #p=c(.1 ,.3 ,.2 ,.4)

# This probability vector satisfies the independence # condition. This probability vector was used for # f i g u r e 7.1 p=c(rep(.25 ,4))

# Number of samples to be used in test n=1000 # Number of trials to run to produce scatter plot of p−values numTrials=1000 im=rep(0,numTrials) chi=rep(0,numTrials) G=rep(0 ,numTrials) for(i in 1:numTrials) { # Construct a normalized vector of samples from # the above distribution x=rmultinom(1,n,p)/n invariant1=x[1 ,1] ∗ x [4 ,1] − x [ 2 , 1 ] ∗ x [ 3 , 1 ] sigma=matrix(rep(0 ,16) ,nrow=4,ncol=4) #mean vector for mle xhat=rep(0 ,4) xhat[1]=(x[1,1]+x[2 ,1]) ∗ (x[1,1]+x[3 ,1]) xhat[2]=(x[1,1]+x[2 ,1]) ∗ (x[2,1]+x[4 ,1]) xhat[3]=(x[1,1]+x[3 ,1]) ∗ (x[3,1]+x[4 ,1]) xhat[4]=(x[2,1]+x[4 ,1]) ∗ (x[3,1]+x[4 ,1]) #covariance matrix for mle sigma[1,1]=xhat[1]∗(1 − xhat [ 1 ] ) sigma[1,2]= − xhat [ 1 ] ∗ xhat [ 2 ] sigma[1,3]= − xhat [ 1 ] ∗ xhat [ 3 ] sigma[1,4]= − xhat [ 1 ] ∗ xhat [ 4 ]

189 sigma[2,1]= − xhat [ 2 ] ∗ xhat [ 1 ] sigma[2,2]=xhat[2]∗(1 − xhat [ 2 ] ) sigma[2,3]= − xhat [ 2 ] ∗ xhat [ 3 ] sigma[2,4]= − xhat [ 2 ] ∗ xhat [ 4 ] sigma[3,1]= − xhat [ 3 ] ∗ xhat [ 1 ] sigma[3,2]= − xhat [ 3 ] ∗ xhat [ 2 ] sigma[3,3]=xhat[3]∗(1 − xhat [ 3 ] ) sigma[3,4]= − xhat [ 3 ] ∗ xhat [ 4 ] sigma[4,1]= − xhat [ 4 ] ∗ xhat [ 1 ] sigma[4,2]= − xhat [ 4 ] ∗ xhat [ 2 ] sigma[4,3]= − xhat [ 4 ] ∗ xhat [ 3 ] sigma[4,4]=xhat[4]∗(1 − xhat [ 4 ] ) #Hessian matrix for quadratic form hess=matrix(rep(0 ,16) ,nrow=4,ncol=4) hess[1,1]=2∗ xhat [ 4 ] ∗ xhat [ 4 ] hess [1 ,2]= −2∗ xhat [ 3 ] ∗ xhat [ 4 ] hess [1 ,3]= −2∗ xhat [ 2 ] ∗ xhat [ 4 ] hess [1 ,4]= −2∗ xhat [ 1 ] ∗ xhat [ 4 ] hess [2 ,1]= −2∗ xhat [ 3 ] ∗ xhat [ 4 ] hess[2,2]=2∗ xhat [ 3 ] ∗ xhat [ 3 ] hess [2 ,3]= −2∗ xhat [ 1 ] ∗ xhat [ 4 ] hess [2 ,4]= −2∗ xhat [ 1 ] ∗ xhat [ 3 ] hess [3 ,1]= −2∗ xhat [ 2 ] ∗ xhat [ 4 ] hess [3 ,2]= −2∗ xhat [ 1 ] ∗ xhat [ 4 ] hess[3,3]=2∗ xhat [ 2 ] ∗ xhat [ 2 ] hess [3 ,4]= −2∗ xhat [ 1 ] ∗ xhat [ 2 ] hess [4 ,1]= −2∗ xhat [ 1 ] ∗ xhat [ 4 ] hess [4 ,2]= −2∗ xhat [ 1 ] ∗ xhat [ 3 ] hess [4 ,3]= −2∗ xhat [ 1 ] ∗ xhat [ 2 ] hess[4,4]=2∗ xhat [ 1 ] ∗ xhat [ 1 ] #Cholesky Decomposition of Covariance Matrix at MLE L=matrix(rep(0 ,12) ,nrow=4,ncol=3) L[1,1]=sqrt((1− xhat [ 1 ] ) ∗ xhat [ 1 ] ) L[1 ,2]=0

190 L[1 ,3]=0 L[2 ,1]= − xhat [ 2 ] ∗ sqrt(xhat[1]/(1 − xhat [ 1 ] ) ) L[2,2]=sqrt(xhat[2] ∗ (1−xhat [1] − xhat [ 2 ] ) / (1−xhat [ 1 ] ) ) L[2 ,3]=0 L[3 ,1]= − xhat [ 3 ] ∗ sqrt(xhat[1]/(1 − xhat [ 1 ] ) ) L[3 ,2]= − xhat [ 3 ] ∗ sqrt(xhat[2]/((1 − xhat [ 1 ] ) ∗ (1−xhat [1] − xhat [ 2 ] ) ) ) L[3,3]=sqrt((xhat[3]) ∗ (1−xhat [1] − xhat [2] − xhat [ 3 ] ) / (1−xhat [1] − xhat [ 2 ] ) ) L[4 ,1]= − xhat [ 4 ] ∗ sqrt(xhat[1]/(1 − xhat [ 1 ] ) ) L[4 ,2]= − xhat [ 4 ] ∗ sqrt(xhat[2]/((1 − xhat [ 1 ] ) ∗ (1−xhat [1] − xhat [ 2 ] ) ) ) L[4 ,3]= − xhat [ 4 ] ∗ sqrt(xhat[3] / ((1− xhat [1] − xhat [ 2 ] ) ∗ (1−xhat [1] − xhat [2] − xhat [ 3 ] ) ) ) temp=t (L)%∗%hess%∗%L spectralDecomposition=eigen(temp) U=t(spectralDecomposition$vectors) D=diag(spectralDecomposition$values) b=t ( xhat)%∗%hess%∗%L%∗%U c=t ( xhat)%∗%hess%∗%xhat K=c−b%∗%s o l v e (D)%∗%t (b) s t a t =2∗n∗ i n v a r i a n t 1 ∗ invariant1 −K #compute survival function P(Q>stat) to get p−value C=hess%∗%sigma Cdiag=eigen(C) lambda=c(Cdiag$values[1] ,Cdiag$values[2] ,Cdiag$values [3]) delta=solve(diag(lambda))%∗%t (b) imTemp=imhof(stat ,lambda,rep(1, length(lambda)) ,delta) row1=c (n∗x [ 1 , 1 ] , n∗x [ 2 , 1 ] ) row2=c (n∗x [ 3 , 1 ] , n∗x [ 4 , 1 ] ) dataTable=rbind(row1 ,row2) chiTemp=chisq . test(dataTable) GTemp=likelihood . test(dataTable)

191 im [ i ]=imTemp$Qq chi [ i]=chiTemp$p. value G[ i ]=GTemp$p . value } plot(chi ,G) plot(chi ,im) p l o t (G, im )

B.3 P-values on a Degenerate Distribution in the Binary 4-Cycle Model

The code below generates Figure 6.3 and Figure 6.4. The distribution studied is based on a degenerate distribution studied in [57]. This first block of code contains the contents of a file with several helper functions used during the simulated trials. Comments before each function explain the inputs and outputs of each function. # Filename: invariantsNoMLE.R

l i b r a r y ("CompQuadForm") library("numDeriv") library ("MASS") library ("poLCA") # value for tolerance comes from default tolerance for # matrixRank method tol=3.552714e−15 # # # Accepts vector of counts # Returns Pval for invariants based on invariants for the binary # 4−cycle model discussed on page 1475 of "On the Toric Algebra # of Graphical Models" by pVal4Cycle1475 = function (df ,H=diag(8)) { H <<− H

192 num_samples=sum( df$freq ) normalizedCounts=(1/sum(df$freq ))∗ d f $ f r e q invariants=chainInvariants1475(normalizedCounts) s t a t =2∗num_samples∗ t(invariants)%∗%H%∗%invariants hess=hessian(stat1475 ,normalizedCounts) sigma=diag(normalizedCounts)−normalizedCounts%∗%t(normalizedCounts) temp=eigen(sigma) V=temp$vectors #Lambda=diag(temp$values) for (i in 1:length(temp$values)) { if (temp$values[i]

193 { i = i +1 } } k=0 delta=rep(0,length(lambda)) for (i in 1:length(lambda)) { k=k+b [ i ] ∗ b[i]/lambda[i] delta[i]=b[i]/lambda[i] } c=t(normalizedCounts)%∗%hess%∗%normalizedCounts correction=c−k delta=abs(delta) lambda=abs(lambda) #rootDelta=sqrt(delta) #sqDelta=delta ∗ d e l t a s t a t=stat −c o r r e c t i o n #print(delta) if (length(lambda)==0) { r e s u l t I =1 resultD=1 r e s u l t F=1 r e s u l t L=1 } e l s e { resultI=imhof(stat ,lambda,rep(1, length(lambda)),delta) resultD=davies(stat ,lambda,rep(1, length(lambda)),delta) resultF=farebrother(stat ,lambda,rep(1, length(lambda)),delta) resultL=liu(stat ,lambda,rep(1, length(lambda)),delta) pI=resultI$Qq pD=resultD$Qq pF=resultF$Qq

194 pL=resultL [1] } results=c(pI ,pD,pF,pL) return (results) } #

# This function returns the uncorrected value of the statistic # In the paper we show that there is a term which must be # subtracted from the energy in order to compute the survival # function of the generalized quadratic form representing the # asymptotic form of this statistic. # # Inputs : p − a 16 dimensional probability vector # Outputs: value of uncorrected statistic evaluated at p stat1475 = function (p) { invariants=chainInvariants1475(p) return( t(invariants)%∗%H%∗%invariants ) } #

# The function below implements the invariants for the binary 4−c y c l e # # Inputs : p − a length 16 probability vector # Outputs: a vector representing the value of the invariants evaluated # at the point p in the probability simplex chainInvariants1475 = function (p) { invariants=c(p[12] ∗ p[15] −p [ 1 1 ] ∗ p [ 1 6 ] , p [ 8 ] ∗ p[14] −p [ 6 ] ∗ p [ 1 6 ] , p [ 1 0 ] ∗ p[13] −p [ 8 ] ∗ p [ 1 4 ] , p [ 7 ] ∗ p[13] −p [ 5 ] ∗ p [ 1 5 ] , p [ 4 ] ∗ p[10] −p [ 2 ] ∗ p [ 1 2 ] , p [ 4 ] ∗ p[7] −p [ 3 ] ∗ p [ 8 ] , p [ 2 ] ∗ p[5] −p [ 1 ] ∗ p [ 6 ] , p [ 3 ] ∗ p[9] −p [ 1 ] ∗ p [ 1 1 ] ) return (invariants)

195 }

# Accepts : n − an integer representing the size of the # squarematrix # ev − a list of n eigenvalues for the random matrix. # By default we select them uniformly between # 0 and 10. # Returns : Z − a positive definite matrix whose eigenvalues # aregivenbythevectorev randPosDefMat = function (n, ev = runif(n,0,10) ) { Z = matrix(ncol=n, rnorm(n^2)) decomp = qr(Z) Q = qr.Q(decomp) R = qr.R(decomp) d = diag (R) ph = d/abs(d) O = Q%∗%diag ( ph ) Z = t (O) %∗% diag ( ev)%∗%O return (Z) } The last block of code contains the contents of a file simulation.R which was the driver behind the simulations.

startTime=Sys.time() # Code to generate histogram of statistics # compares 4 different numerical methods for computing # the invariant based statistic against the chi−squared # statistic and likelihood ratio statistic # #We omit the step of computing an MLE for the invariant # based statistic in this simulation

196 # # By commenting out different qx lines you can choose # which degenerate distribution is use # first one: limit of points in model # second one: not # # This parameter controls the name of the folder the results # will be written in simLabel="test0" # # change these parameters to change simulation # # parameter that controls magnitude of noise eps=1e−3 # # # # The loop will perform numTrials test simulations for each # sample size in the numSamples vector numTrials=1000 numSamples=(1:10)∗100 # # # creates degenerate distribution which is limit of models #qx=c(1,1,0,1,0,0,0,1,1,0,0,0,1,0,1,1) #qx=(1/8)∗qx # creates degenerate model which is not a limit point qx=c(0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,0) qx=(1/4)∗qx qOriginal=qx

# # loads file containing helper functions for the simulation # including tools for computing the p−values

197 source( ’invariantsNoMLE.R’) library ("MASS") library ("msm") library("lmtest") # # initialize covariates for simulation x1=c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1) x2=c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1) x3=c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1) x4=c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1) # # # resultsImhof = matrix(rep(0,numTrials ∗ length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) resultsDavies= matrix(rep(0,numTrials ∗ length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) resultsFarebrother=matrix(rep(0, numTrials ∗ length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) resultsLiu=matrix(rep(0,numTrials ∗ length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) resultsChiSq=matrix(rep(0,numTrials ∗ length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) resultsLR=matrix(rep(0, numTrials ∗

198 length(numSamples)) , ncol=length(numSamples) , nrow=numTrials ) # accuracyImhof=rep(0,length(numSamples)) accuracyDavies=rep(0,length(numSamples)) accuracyFarebrother=rep(0,length(numSamples)) accuracyLiu=rep(0,length(numSamples)) accuracyChiSq=rep(0,length(numSamples)) accuracyLR=rep(0 , length(numSamples)) # # changes amplitude of noise added to data # noise=abs(rnorm(16)) t o l =1e−14 q=0:15 # # build some noise qx=qx+eps ∗ n o i s e qx=qx/sum(qx) # generates a random pos def matrix # scale is based on param at top of code H=randPosDefMat(8 ,ev) # for (j in 1:length(numSamples)) { for (i in 1:numTrials) { draws = sample(q, size = numSamples[j], replace = TRUE, prob = qx ) counts=rep(0,16) for (k in 1:16) { counts [k]=sum(draws==(k−1))

199 } freq=rep(0,16) for (k in 0:15) { freq [k+1]=sum(draws==k) } df=data.frame(x1,x2,x3,x4, freq) n=16 #H=randPosDefMat (8) result=pVal4Cycle1475(df ,H) model=glm( freq~x1∗x2+x2∗x3+x3∗x4+x4∗x1 , family=poisson() ,data=df) satModel=glm( freq~x1∗x2+x1∗x3+x1∗x4+x2∗x3+x2∗x4+x3∗x4 , family=poisson() ,data=df) mle=predict(model ,type="response") mle=mle/sum(mle) mle=unname(mle) obs=df$freq chiSq=chisq. test(obs, p=mle) LR=lrtest (model ,satModel) resultsImhof[i ,j]=result [1] resultsDavies[i ,j]=result [2] resultsFarebrother[i ,j]=result [3] resultsLiu[i ,j]=result[4] resultsChiSq[i ,j]=chiSq$p.value resultsLR[i , j]=LR$‘Pr(>Chisq) ‘[2] } } # # Count fraction of trials that reject the null hypothesis # for (j in 1:length(numSamples)) { accuracyImhof[ j]=sum(resultsImhof [ , j]<0.05)/numTrials

200 accuracyDavies[ j]=sum(resultsDavies [ , j]<0.05)/numTrials accuracyFarebrother[j]=sum(resultsFarebrother[ ,j]<0.05) / numTrials accuracyLiu[ j]=sum(resultsLiu [ , j]<0.05)/numTrials accuracyChiSq[ j]=sum(resultsChiSq [ , j]<0.05)/numTrials accuracyLR[ j]=sum(resultsLR [ , j]<0.05)/numTrials } # # Assess accuracy of each method #

# # Code for Exporting Results # setwd("simulations") # folder=paste("sim",simLabel ,sep="_") dir.create(folder) setwd(folder) write(qx,"sampling_distribution.txt") write.csv(resultsChiSq ,"chi_squared_sim_results.csv") write .csv(resultsLR ,"LR_sim_results.csv") write.csv(resultsDavies ,"Davies_sim_results.csv") write.csv(resultsFarebrother ,"Farebrother_sim_results.csv") write.csv(resultsImhof ,"Imhof_sim_results.csv") write.csv(resultsLiu ,"Liu_sim_results.csv") write.csv(H,"Hamiltonian.csv") write(qOriginal ,"degenerate_distribution.txt") write(eps ,"noise_parameter.txt") write(ev,"Hamiltonian_eigenvalues.txt") # # code to generate plots # dir.create("figures")

201 setwd("figures") for (j in 1:length(numSamples)) { temp=toString(numSamples[ j ])

filename=paste("chiSquared_v_Imhof",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsImhof[,j],resultsChiSq[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste ("LR_v_Imhof" , temp , sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsImhof[,j],resultsLR[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste("chiSquared_v_Davies",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsDavies[,j],resultsChiSq[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste("LR_v_Davies",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsDavies[,j],resultsLR[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste("chiSquared_v_Farebrother",temp, sep="_")

202 filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsFarebrother[,j],resultsChiSq[,j], xlim=c(0,1),ylim=c(0,1),asp=1) dev . o f f ( )

filename=paste("LR_v_Farebrother",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsFarebrother[,j],resultsLR[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste("chiSquared_v_Liu",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsLiu[,j],resultsChiSq[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

filename=paste("LR_v_Liu",temp, sep="_") filename=paste(filename ,".svg",sep="") svg(filename) plot(resultsLiu[,j],resultsLR[,j],xlim=c(0,1), ylim=c(0 ,1) ,asp=1) dev . o f f ( )

} setwd("..") temp = "Chi Squared Test Results: " temp = paste(temp, toString(accuracyChiSq),sep="\n") temp = paste(temp, "Davies Results: ",sep="\n") temp = paste(temp, toString(accuracyDavies),sep="\n") temp = paste(temp, "Farebrother Results:",sep="\n")

203 temp = paste(temp, toString(accuracyFarebrother),sep="\n") temp = paste(temp, "Imhof Results: ",sep = "\n") temp = paste(temp, toString(accuracyImhof), sep="\n") temp = paste(temp, "Liu Results: ", sep = "\n") temp = paste(temp, toString(accuracyLiu), sep = "\n") temp = paste(temp, "LR Results: ", sep="\n") temp = paste(temp, toString(accuracyLR), sep = "\n") write(temp, file ="summary.txt") setwd("..") endTime=Sys . time() setwd("..")

B.4 Tables of Percentage Deviation from Signifi- cance Level

In this section, we discuss results of the simulations produced by the code displayed in the previous section. These simulations were obtained by taking a singular point on the statistical model associated to the binary 4-cycle undirected graphical model. These tables measure the observed percentage deviation from the nominal significance level of each test.

B.4.1 Noise Parameter  = 0.1

For this simulation we took a singular model on the binary 4-cycle and perturbing the model by adding a noise generated by sampling from a normal distribution and taking the absolute value of the result and rescaling the result by the parameter  = 0.1. The specific sampling distribution used for the simulations in this subsection is: q0000 = 0.07453062, q0001 = 0.08634001, q0010 = 0.041189657,

q0011 = 0.0640731, q0100 = 0.008735056, q0101 = 0.1015492,

q0110 = 0.001641928, q0111 = 0.1536179, q1000 = 0.0690093,

q1001 = 0.05180355, q1010 = 0.02624617, q1011 = 0.01191531,

q1100 = 0.06465253, q1101 = 0.04332119, q1110 = 0.122514,

q1111 = 0.07886045

204 n = 100 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 4.7 12.8 18.7 23.0 24.4 26.3 26.2 27.5 G2 28.4 46.0 52.6 55.0 55.4 55.9 53.5 50.9 Ours - Davies -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 24.7 Ours - Farebrother -1.0 -4.9 -6.7 -6.0 -1.4 4.2 8.7 14.6 Ours - Imhof -1.0 -4.7 -5.4 -2.3 4.5 11.9 17.2 24.7 Ours - Liu -1.0 -4.7 -5.7 -3.0 3.5 11.9 17.9 25.9

n = 200 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 51.1 74.0 77.6 76.5 74.8 71.6 67.3 63.1 G2 65.2 79.8 81.3 78.6 75 71.5 67.3 62.5 Ours - Davies 1.0 30.3 56.9 68.3 71.0 69.8 66.5 62.7 Ours - Farebrother 0.8 29.2 55.6 67.2 69.9 68.7 65.4 61.6 Ours - Imhof 1.0 30.3 56.9 68.3 71.0 69.8 66.5 62.7 Ours - Liu 1.0 29.4 54.8 67.1 70.7 69.7 66.8 63.0

n = 300 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 92.5 94.0 89.7 84.8 79.8 74.9 69.9 64.9 G2 86.4 90.3 87.8 83.9 79.2 74.6 69.8 64.8 Ours - Davies 32.7 83.5 87.5 84.0 79.6 74.7 69.7 64.9 Ours - Farebrother 32.5 83.0 87.0 83.5 79.1 74.2 69.2 64.4 Ours - Imhof 32.7 83.5 87.5 84.0 79.6 74.7 69.7 64.9 Ours - Liu 32.9 82.5 87.3 83.6 79.5 74.7 69.7 64.9

205 n = 400 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 98.7 95.0 90.0 85.0 80.0 75.0 70.0 65.0 G2 93.7 94.0 89.2 84.8 79.8 74.8 69.9 64.9 Ours - Davies 73.7 93.7 89.7 84.8 79.8 74.9 69.9 64.9 Ours - Farebrother 73.0 92.9 89.0 84.1 79.4 74.5 69.5 64.5 Ours - Imhof 73.7 93.7 89.7 84.8 79.8 74.9 69.9 64.9 Ours - Liu 73.9 93.3 89.7 84.8 79.8 74.9 69.9 64.9

B.4.2 Noise Parameter  = 0.01

In this subsection, we perform the same simulation using a noise parameter of  = 0.01. The specific sampling distribution used for the simulations in this subsection is:

q0000 = 0.1199417, q0001 = 0.1139999, q0010 = 0.005630032,

q0011 = 0.1102922, q0100 = 0.008179367, q0101 = 0.004033609,

q0110 = 0.005865577, q0111 = 0.1120079, q1000 = 0.1106506,

q1001 = 0.01478819, q1010 = 0.01170206, q1011 = 0.01567811,

q1100 = 0.1263541, q1101 = 0.006702296, q1110 = 0.11685,

q1111 = 0.1173245

n = 100 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -0.9 -4.7 -9.3 -13.5 -18.1 -22.4 -26.7 -31.1 G2 2.1 5.1 8.2 9.4 10.1 10.0 9.6 9.8 Ours - Davies -1.0 -5.0 -10.0 -14.6 -18.5 -21.8 -24.5 -24.7 Ours - Farebrother -1.0 -5.0 -10.0 -14.9 -19.8 -24.6 -29.1 -32.9 Ours - Imhof -1.0 -5.0 -10.0 -14.6 -18.5 -21.8 -24.5 -24.7 Ours - Liu -1.0 -5.0 -10.0 -14.7 -18.7 -21.8 -24.5 -24.3

206 n = 200 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -0.6 -4.0 -8.2 -12.1 -16.1 -19.9 -23.9 -27.5 G2 4.0 8.1 11.3 14.4 15.5 15.4 16.2 15.6 Ours - Davies -1.0 -4.9 -8.9 -11.7 -14.1 -13.9 -14.0 -13.4 Ours - Farebrother -1.0 -5.0 -9.4 -13.2 -16.6 -18.2 -19.4 -20.4 Ours - Imhof -1.0 -4.9 -8.9 -11.7 -14.1 -13.9 -14.0 -13.4 Ours - Liu -1.0 -4.9 -8.9 -11.8 -14.3 -13.9 -14.1 -12.5

n = 300 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -0.8 -3.3 -6.7 -9.8 -13.7 -16.9 -19.9 -22.6 G2 5.6 12.4 17.9 20.3 21.4 22.4 22.8 22.6 Ours - Davies -1.0 -4.7 -6.2 -7.5 -6.7 -4.8 -2.2 -0.3 Ours - Farebrother -1.0 -4.8 -6.9 -8.8 -8.2 -7.2 -5.2 -3.7 Ours - Imhof -1.0 -4.7 -6.2 -7.5 -6.7 -4.8 -2.2 -0.3 Ours - Liu -1.0 -4.7 -6.6 -7.9 -6.8 -4.9 -2.0 0.4

n = 400 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 0.3 -1.3 -3.7 -6.5 -9.3 -11.3 -12.9 -14.9 G2 6.3 14.6 20.5 22.5 23.6 23.1 24.4 23.8 Ours - Davies -0.9 -2.3 -0.7 1.7 4.1 5.3 7.2 10.2 Ours - Farebrother -0.9 -2.7 -1.6 0.6 2.9 4.2 6.0 8.9 Ours - Imhof -0.9 -2.3 -0.7 1.7 4.1 5.3 7.2 10.2 Ours - Liu -0.8 -2.6 -1.1 1.6 3.7 5.2 7.5 10.4

B.4.3 Noise Parameter  = 0.001

In this subsection, we perform the same simulation using a noise parameter of  = 0.001. The specific sampling distribution used for the simulations in this

207 subsection is:

q0000 = 0.1246213, q0001 = 0.124153, q0010 = 0.0009961762,

q0011 = 0.1256715, q0100 = 0.0002068291, q0101 = 0.00068493,

q0110 = 0.001658084, q0111 = 0.1237989, q1000 = 0.1245288,

q1001 = 0.0009192271, q1010 = 0.0002931132, q1011 = 0.0002904846,

q1100 = 0.1235315, q1101 = 0.0005560014, q1110 = 0.1235039,

q1111 = 0.1245863

n = 100 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -1.0 -5.0 -10.0 -14.9 -19.8 -24.6 -29.6 -34.4 G2 -1.0 -4.2 -7.6 -11.6 -12.4 -4.2 2.6 3.6 Ours - Davies -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -29.8 -34.0 Ours - Farebrother -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -30.0 -35.0 Ours - Imhof -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -29.8 -34.0 Ours - Liu -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -29.8 -34.0

n = 200 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -1.0 -4.8 -9.8 -14.8 -19.6 -24.2 -29.1 -34.1 G2 -0.5 -2.4 -1.8 -4.9 -8.0 8.0 25.0 23.9 Ours - Davies -1.0 -5.0 -10.0 -14.9 -19.5 -23.3 -25.9 -28.7 Ours - Farebrother -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -30.0 -35.0 Ours - Imhof -1.0 -5.0 -10.0 -14.9 -19.5 -23.3 -25.9 -28.7 Ours - Liu -1.0 -5.0 -10.0 -14.9 -19.5 -23.5 -25.7 -28.7

208 n = 300 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -1.0 -4.7 -9.6 -14.1 -18.5 -23.2 -27.8 -32.7 G2 -0.6 0.6 6.9 5.3 1.1 18.3 31.6 28.9 Ours - Davies -1.0 -5.0 -10.0 -14.5 -18.6 -16.9 -17.5 -21.4 Ours - Farebrother -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -30.0 -34.9 Ours - Imhof -1.0 -5.0 -10.0 -14.5 -18.6 -16.9 -17.5 -21.4 Ours - Liu -1.0 -5.0 -10.0 -14.7 -18.6 -16.9 -17.3 -21.2

n = 400 α 0.01 0.05 0.10 0.15 0.20 0.25 0.30 0.35 χ2 -0.9 -4.7 -9.6 -14.3 -19.0 -23.8 -28.4 -32.7 G2 0.5 3.9 11.1 13.6 9.8 22.2 32.5 29.8 Ours - Davies -1.0 -5.0 -9.8 -13.1 -16.5 -15.3 -13.4 -15.8 Ours - Farebrother -1.0 -5.0 -10.0 -15.0 -20.0 -25.0 -30.0 -35.0 Ours - Imhof -1.0 -5.0 -9.8 -13.1 -16.5 -15.3 -13.4 -15.8 Ours - Liu -1.0 -5.0 -9.8 -13.2 -16.5 -15.0 -13.2 -15.8

209 Bibliography

[1] S. Abramsky. Relational databases and bells theorem. In In search of elegance in the theory and practice of computation, pages 13-35. Springer, 2013.

[2] S. Abramsky, R. S. Barbosa, K. Kishida, R. Lal, and S. Mansfield. Contextu- ality, cohomology and paradox. arXiv preprint arXiv:1502.03097, 2015.

[3] S. Abramsky, R. S. Barbosa, and S. Mansfield. Quantifying contextuality via linear programming. Informal Proceedings of Quantum Physics & Logic, 2016.

[4] S. Abramsky and A. Brandenburger. The sheaf-theoretic structure of non- locality and contextuality. New Journal of Physics, 13(11):113036, 2011.

[5] S. Abramsky, S. Masfield, and R. S. Barbosa. The cohomology of non-locality and contextuality. arXiv prepreint arXiv:1111.3620, 2011.

[6] A. H. Andersen. Multidimensional contingency tables. Scandinavian Journal of Statistics, 1974.

[7] B. C. Arnold, E. Castillo, and J. M. Sarabia. Conditionally Specified Distri- butions: An Introduction. Statistical Science, 2001.

[8] B. C. Arnold and S. J. Press. Compatible Conditional Distributions. Journal of the American Statistical Association, 1989.

[9] M. Artin, A. Grothendieck, and J. L. Verdier. Theorie des topos et cohomologie etale des schemas (known as SGA4). Springer, 1972.

[10] R. J. Aumann. Borel Structures for Function Spaces. Illinois Journal of Mathematics, vol. 5, pp. 614-630, 1961.

[11] J. Awan and A. Slavkovic. Differentially private uniformly most powerful tests for binomial data. NIPS 2018. https://arxiv.org/abs/1801.09236, 2018.

[12] S. Awodey. Category theory. Oxford University Press, 2010.

210 [13] K. Baclawski, D. Simovici, and W. White. A categorical approach to database semantics. Mathematical Structures in Computer Science, vol 4, pp. 147-183. 1994.

[14] M. Barr and C. Wells. , Triples, and Theories. Springer, 1985.

[15] M. Barr and C. Wells. Category Theory for Computing Science. Prentice Hall, 1999.

[16] Y. Baryshnikov and R. Ghrist. Target enumeration via integration over planar sensor networks. Conference: Robotics: Science and Systems IV, 2008.

[17] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to natural language processing.” Association for Computational Linguistics, 1996.

[18] F. Borceux. Handbook of Categorical Algebra. Volumes 1-3. Cambridge Uni- versity Press, 1994.

[19] Z. I. Botev and D. P. Kroese. Non-asymptotic bandwidth selection for density estimation of discrete data. Methodology and Computing in Applied Probability. 10 (3): 435. 2008.

[20] Z. I. Botev and D. P. Kroese. The generalized cross entropy method, with applications to probability density estimation. Methodology and Computing in Applied Probability. 13 (1): 1–27, 2011.

[21] S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 2011.

[22] O. Caramello. Atomic toposes and countable categoricity. Applied Categorical Structures. 20 no. 4 pp.379-391. (arXiv:0811.3547), 2012.

[23] E. F. Codd. A relational model of data for large shared data banks. Commu- nications of the ACM. 13 (6): 377–387, 1970.

[24] J.C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems, 31(3):8, 2013.

[25] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 2006.

[26] M. Cueto, J. Morton, and B. Sturmfels. Geometry of the restricted Boltzmann machine. Contemporary Mathematics. 516. 10.1090/conm/516/10172, 2009.

211 [27] J. Culbertson and K. Sturtz. A categorical foundation for Bayesian probability. Applied Categorical Structures 22: 647, 2014.

[28] H. B. Curry. Functionality in combinatory logic. Proceedings of the National Academy of Sciences of the United States of America. 20 (11): 584–90, 1934.

[29] B. A. Davey and H. A. Priestley. Introduction to Lattice and Order. Cambridge University Press, 2002.

[30] R.B. Davies. Algorithm AS 155: the distribution of a linear combination of chi-2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(3), p. 323-333, 1980.

[31] B. de Finetti. La prevision : ses lois logiques, ses sources subjectives. Annales de l’institute Henri Poincare, vol. 7, 1937.

[32] T. de la Rue. Espaces de Lebesgue. Seminaire de Probabilites XXVII, Lecture Notes in Mathematics, 1557, Springer, pp. 15–21, 1993.

[33] J. L. Doob. Stochastic Processes. Wiley, 1953.

[34] E. Doberkat. Eilenberg–Moore algebras for stochastic relations. Information and Computation 204 pp. 1756–1781, 2006.

[35] Dobra A., Fienberg S.E., Rinaldo A., Slavkovic A., Zhou Y. Algebraic Statis- tics and Contingency Table Problems: Log-Linear Models, Likelihood Estima- tion, and Disclosure Limitation. In: Putinar M., Sullivant S. (eds) Emerging Applications of Algebraic Geometry. The IMA Volumes in Mathematics and its Applications, vol 149. Springer, New York, NY, 2009.

[36] M. Drton, B. Sturmfels, and S. Sullivant. Lectures on Algebraic Statistics. Oberwolfach Seminars. Birkhauser, 2008.

[37] L. E. Dubins and D. A. Freedman, Exchangeable processes need not be mixtures of independent, identically distributed random variables. Probability Theory and Related Fields 48(2):115-132, 1979.

[38] P. Duchesne, P. Lafaye de Micheaux. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods, Computational Statistics and Data Analysis, Volume 54, pp. 858-862, 2010.

[39] H. L. Dunn. Record Linkage. American Journal of Public Health, 1946.

[40] R. Durrett. Probability: Theory and Examples. Second edition. Cambridge University Press, 1996.

212 [41] Dwork C., McSherry F., Nissim K., Smith A. (2006) Calibrating Noise to Sensitivity in Private Data Analysis. In: Halevi S., Rabin T. (eds) Theory of Cryptography. TCC 2006. Lecture Notes in Computer Science, vol 3876. Springer, Berlin, Heidelberg

[42] E. B. Dynkin and A. A. Yushkevich. Controlled Markov Processes. Springer- Verlag, 1979.

[43] A. Edalat. Semi-pullbacks and bisimulation in categories of Markov processes. Mathematical Structures in Computer Science, 9(5):523–543, 1999.

[44] A. W. F. Edwards. The measure of association in a 2 × 2 table. Journal of the Royal Statistical Society. A (General). 126 (1): 109–114. doi:10.2307/2982448. JSTOR 2982448, 1963.

[45] N. Eriksson. Using Invariants for Phylogenetic Tree Construction. In: M. Putinar and S. Sullivant (eds) Emerging Applications of Algebraic Geometry. The IMA Volumes in Mathematics and its Applications, vol 149. Springer, 2009.

[46] N. Eriksson and Y. Yao. Metric learning for phylogenetic invariants. arXiv:q- bio/0703034 [q-bio.PE], 2007.

[47] R. W. Farebrother. Algorithm AS 204: The distribution of a positive linear combination of chi-squared random variables. Journal of the Royal Statistical Society, Series C (applied Statistics), Vol. 33, No. 3, p. 332-339, 1984.

[48] I. Fellegi and A. Sunter. A Theory for Record Linkage. Journal of the American Statistical Association, 1969.

[49] M. Fleming, R. Gunther, and R. Rosebrugh. A database of categories. Journal of Symbolic Computation 35.2, pp. 127-135, 2003.

[50] H. Follmer. Phase transition and Martin boundary. In: Lecture Notes in Mathematics Vol. 465, Springer, 1975.

[51] H. Forssell, H. R. Gylterud, and D. I. Spivak. Type Theoretical Databases. In: Logical Foundations of Computer Science, pp. 117-129, 2016.

[52] D. H. Fremlin. Measure Theory, volume 4. Torres Fremlin, 2000.

[53] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The Complete Book. Pearson, 2008.

[54] R. Gauthier. Algebraic Stochastic Calculus. arXiv 1407.6784v1, 2014.

213 [55] A. Gelman and M. Betancourt. Does quantum uncertainty have a place in everyday applied statistics? Behavioral and Brain Sciences, 2013. [56] A. Gelman and T. E. Raghunathan. Using conditional distributions for missing-data imputation. Statistical Science, 2001. [57] D. Geiger, C. Meek, and B. Sturmfels. On the toric algebra of graphical models. The Annals of Statistics. Volume 34, Number 3, 1463-1492, 2006. [58] Z. Ghahramani and M. I. Jordan. Learning from incomplete data, 1994. [59] M. Giry [original paper manuscript from 1982]. A categorical approach to probability theory. In book: Categorical Aspects of Topology and Analysis, pp.68-85, 2006. [60] R. Goldblatt. Topoi : the categorial analysis of logic. North-Holland, 1984. [61] M. Gromov. In a search for a structure, part 1: on entropy. 2012. [62] S. Gutmann, J. H. B. Kemperman, J. A. Reeds, and L. A. Shepp. Existence of probability measures with given marginals. The Annals of Probability, 1991. [63] S. J. Haberman. The Analysis of Frequency Data. University of Chicago Press, 1974. [64] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer, 2016. [65] C. Heunen, O. Kammar, S. Staton, and H. Yang. A convenient category for higher order probability theory. arXiv:1701.02547, 2017. [66] C. Heunen, O. Kammar, S. Moss, A. Scibior, S. Staton, M. Vakar, and H. Yang. The semantic structure of quasi-Borel spaces. PPS18, 2018. [67] E. Hewitt and L. J. Savage. Symmetric measures on cartesian products. Transactions of the American Mathenatical Society, vol. 80, pp. 470–501, 1955. [68] P. Honeyman, R. E. Ladner, and M. Yannakakis. Testing the universal instance assumption. Information Processing Letters, 1980. [69] W. A. Howard [original paper manuscript from 1969]. The formulae-as-types notion of construction, in J. P. Seldin, and J. R. Hindley (eds.), To H.B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, Academic Press, pp. 479–490, 1980. [70] J. P. Imhof. Computing the distribution of quadratic forms in normal variables. Biometrika, Vol. 48, No. 3/4, pp. 419-426, 1961.

214 [71] M. Jackson. A Sheaf Theoretic Approach to Measure Theory. Unpublished doctoral dissertation, 2006.

[72] E. T. Jaynes. Information theory and statistical mechanics. In K. Ford (ed.) Statistical Physics. New York, 1963.

[73] E. T. Jaynes. Prior Probabilities. IEEE Transactions on Systems Science and Cybernetics. 4 (3): 227–241, 1968.

[74] E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.

[75] M. Johnson and R. Rosebrugh. Sketch data models, relational schema and data specifications. Electronic Notes in Theoretical Computer Science 61, pp. 51-63, 2002.

[76] P. T. Johnstone. Another condition equivalent to de Morgan’s law. Commu- nications in Algebra, 1979.

[77] P. T. Johnstone. Sketches of an Elephant: A Topos Theory Compendium. Volume 1 and 2. Claredon Press, 2002.

[78] P. T. Johnstone. Topos Theory. Academic Press, 1977.

[79] O. Kallenberg. Foundations of Modern Probability, 2nd ed. Srpinger, 2002.

[80] A. S. Kechris, A. S. Classical descriptive set theory, Springer, 1995.

[81] J. Kropko, B. Goodrich, A. Gelman, and J. Hill. Multiple Imputation for Con- tinuous and Categorical Data: Comparing Joint and Conditional Approaches. Political Analysis, 2014.

[82] J. Lambek. Deductive Systems and Categories. Theory of Computing Systems, 1968.

[83] J. Lambek. Deductive systems and categories II. Standard constructions and closed categories. Category theory, homology, and their applications. 1969.

[84] J. Lambek. Deductive systems and categories III. Cartesian closed categories, intuitionist , and combinatory logic. Toposes, algebraic geometry, and logic, 1972.

[85] F. W. Lawvere and R. Rosebrugh. Sets for Mathematics. Cambridge Univesity Press, 2003.

[86] G. Lebanon and M. El-Geish. Computing with Data: An Introduction to the Data Industry. Springer, 2019.

215 [87] B. Li, V. Karwa, A. Slavkovic, B. Steorts. A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms. Journal of Privacy and Confidentiality 8 (1), 2018.

[88] A. J. Lindenhovius. Grothendieck Topologies on Posets. arXiv:1405.4408, 2014.

[89] Litak T., Mikulas S., Hidders J. Relational Lattices. In: Hofner P., Jipsen P., Kahl W., Muller M.E. (eds) Relational and Algebraic Methods in Com- puter Science. RAMICS 2014. Lecture Notes in Computer Science, vol 8428. Springer, Cham, 2014.

[90] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data, 3rd Ed. John Wiley & Sons, 2019.

[91] H. Liu, Y. Tang, H. H. Zhang. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Computational Statistics and Data Analysis, Volume 53, 853-856, 2009.

[92] F. Loregian. This is the (co)end, my only (co)friend. arXiv:1501.02503, 2017.

[93] S. Mac Lane. Categories for the Working Mathematician. Springer, 1998.

[94] S. Mac Lane and I. Moredijk. Sheaves in Geometry and Logic: A First Introduction to Topos Theory. Springer, 1994.

[95] G. W. Mackey. Borel structure in groups and their duals. Transactions of the American Mathematical Society, 85, 134-165, 1957.

[96] E. Manes. Algebraic Theories. Springer-Verlag, 1976.

[97] P. McCullagh. What is a statistical model? Annals of Statistics. Volume 30, Number 5, 1225-1310, 2002.

[98] C. McLarty. Elementary Categories, Elementary Toposes. Oxford University Press, 1995.

[99] J. A. Mingo and R. Speicher. Free Probability and Random Matrices. Springer, 2017.

[100] E. Moggi. Notions of computations and monads. Information and Computation Volume 93, Issue 1, pp. 55-92, 1991.

[101] J. Morton. Contextuality from Missing and versioned data. arXiv:1708.03264, 2017.

216 [102] J. Morton. Relations among conditional probabilities. arXiv:0808.1149, 2018.

[103] D. Mumford. The dawning of the age of stochasticity. Mathematics: Frontiers and Perspectives, 2000

[104] L. Narens. Alternative probability theories for cognitive psychology. Topics in cognitive science, 6(1):114-120, 2014.

[105] H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic Linkage of Vital Records. Science, 1959.

[106] J. Nestruev. Smooth Manifolds and Observables. Springer, 2003.

[107] B. Oksendal. Stochastic Differential Equations: An Introduction with Appli- cations (Sixth ed.). Springer, 2003.

[108] L. Pachter and B. Sturmfels. Algebraic Statistics for Computational Biology. Cambridge Univeristy Press, 2005.

[109] K. R. Parthasarathy. Probability Measures on Metric Spaces. Academic Press, 1967.

[110] K. Petersen. Ergodic Theory. Cambridge University Press, 1983.

[111] C. Preston. Some Notes on Standard Borel and Related Spaces. arXiv:0809.3066, 2008.

[112] D. Pollard. A User’s Guide to Measure Theoretic Probability, Cambridge University Press, 2001.

[113] V. A. Rokhlin [original paper manuscript from 1949] On the fundamental ideas of measure theory. American Mathematical Society Translations, 71, pp. 1–54. Translated from Russian. 25 (67): 107–150, 1952.

[114] R. Rosebrugh and R. J. Wood. Relational Databases and Indexed Categories. In Category Theory 1991: Proceedings of an International Summer Category Theory Meeting, Held June 23-30, 199 1 (Conference Proceedings, Vol 13), 1992.

[115] D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.

[116] J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997.

[117] J. L. Schafer and J. W. Graham. Missing Data: Our View of the State of the Art. DOI: 10.1037/1082-989X.7.2.147, 2002.

[118] H. Scheffe. The Analysis of Variance. John Wiley & Sons, 1959.

217 [119] M. J. Schervish. Theory of Statistics. Springer, 1995.

[120] P. Schultz, D. I. Spivak, C. Vasilakopoulou, and R. Wisnesky. Algebraic Databases. arXiv:1602.03501, 2016.

[121] A. Simpson. Category-theoretic Structure for Independence and Conditional Independence. Electronic Notes in Theoretical Computer Science. 336: 281-297, 2018.

[122] A. Simpson. Conditional Independence in Categories. TACL 2013 : 9, 2013.

[123] A. Simpson. Probability Sheaves and the Giry Monad. 7th Conference on Algebra and Coalgebra in Computer Science, 2017.

[124] A. B. Slavkovic and S. Sullivant. The space of compatible full conditionals is a unimodular toric variety. Journal of Symbolic Computation, 2006.

[125] A. B. Slavkovic, X. Zhu, and S. Petrovic. Fibers of multi-way contingency tables given conditionals: relation to marginals, cell bounds and Markov bases. arXiv:1401.1397, 2014.

[126] J. Snoke, T. Brick, A. Slavkovic, and M. Hunter. Providing Accurate Models across Private Partitioned Data: Secure Maximum Likelihood Estimation. Annals of Applied Statistics - Accepted. https://arxiv.org/abs/1710.06933, 2018.

[127] M. Souslin. Sur une definition des ensembles mesurables B sans nombres transfinis. Comptes rendus de l’Academie des Sciences de Paris, 164: 88–91, 1917.

[128] D.I. Spivak. Simplicial Databases. arXiv:0904.2012, 2009.

[129] D. I. Spivak. Kleisli Database Instances. arXiv:1209.1011, 2012.

[130] S. M. Srivastava. A Course on Borel Sets, Springer-Verlag, 1991.

[131] V. Strassen. The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 1965.

[132] K. Sturtz. The factorization of the Giry monad. Advances in Mathematics, 2018.

[133] J. G. Sumner, A. Taylor, B. R. Holland, and P. D. Jarvis. Developing a statistically powerful measure for quartet tree inference using phylogenetic identities and Markov invariants. arXiv:1608.04761 [q-bio.QM], 2016.

218 [134] R. Sundberg. Some Results about Decomposable (or Markov-type) Models for Multidimensional Contingency Tables: Distribution of Marginals and Partitioning of Tests. Scandinavian Jorunal of Statistics, 1975.

[135] K. Tanabe and M. Sagae. An Exact Cholesky Decomposition and the Gener- alized Inverse of the Variance-Covariance Matrix of the Multinomial Distri- bution, with Applications. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 54, No. 1, pp. 211-219, 1992.

[136] T. Tao. Topics in Random Matrix Theory. American Mathematical Society, 2012.

[137] A. Vattani. k-means Requires Exponentially Many Iterations Even in the Plane. Special Issue of Discrete and Computational Geometry, Volume 45, Issue 4, 2011.

[138] H. C. von Bayer. QBism: The Future of Quantum Physics. Harvard University Press, 2016.

[139] M. L. Wachs. Poset Topology: Tools and Applications. arXiv:math/0602226, 2006.

[140] Y. J. Wang. Compatibility among marginal densities. Biometrika, 2004.

[141] S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009.

[142] S. Watanabe. Mathematical Theory of Bayesian Statistics. CRC Press, 2018.

[143] H. Wickham. Tidy Data. Journal of Statistical Software, 2014.

[144] G. Winskel. The Formal Semantics of Programming Languages: An Introduc- tion. MIT Press, 1993.

[145] O. Wyler. Lectures Notes on Topoi and Quasitopoi. World Scientific, 1991.

219 Vita William Wright

Education • Ph.D. in Mathematics with minor in Statistics, The Pennsylvania State University, Advisor: Jason Morton (expected: August 2019)

• B.S. Mathematics with highest honors, University of Texas at Austin (2012)

Honors and Awards • Winner (as member of team NDL) Kaggle University Club Winter Hackathon (December 2018)

• Cada R. and Susan Wynn Grove Mathematics Enhancement Endowment (Spring 2018)

• Department Teaching Award (Fall 2017)

• Jack and Eleanor Petit Scholarship (2016, 2017)

• ZZRQ Award (2016)

• NSF GRFP Honorable Mention (2013)

Activities and Leadership • Nittany Data Labs (2018/19 Academic Year)

• GPSA Eberly College of Science Delegate (2015/16 and 2016/17 terms)

• Organizer for Algebraic Geometry Reading Group (2014/15 Academic Year)

• Penn State Outreach Program Volunteer (Fall 2013)