The Pennsylvania State University The Graduate School
AN ALGEBRAIC PERSPECTIVE ON COMPUTING WITH DATA
A Dissertation in Mathematics by William Wright
© 2019 William Wright
Submitted in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
August 2019 The dissertation of William Wright was reviewed and approved∗ by the following:
Jason Morton Professor of Mathematics Dissertation Advisor, Chair of Committee
Vladimir Itskov Professor of Mathematics
Alexei Novikov Professor of Mathematics, Director of Graduate Studies
Aleksandra Slavkovic Professor of Statistics
∗Signatures are on file in the Graduate School.
ii Abstract
Historically, algebraic statistics has focused on the application of techniques from computational commutative algebra, combinatorics, and algebraic geometry to problems in statistics. In this dissertation, we emphasize how sheaves and monads are important tools for thinking about modern statistical computing. First, we explore how probabilistic computing necessitates thinking about random variables as tied to their family of extensions and ultimately reformulate this observation in the language of sheaf theory. We then turn our attention to the relationship between topos theory and relational algebra of databases showing how Codd’s original operations can be seen as constructions inside Set. Next we discuss contextuality, the phenomenon whereby the value of a random variable depends on the other random variables observed simultaneously, and demonstrate how sheaves allow us to lift statistical concepts to contextual measurement scenarios. We then discuss a technique for hypothesis testing based on algebraic invariants whose asymptotic convergence properties do not rely on asymptotitc normality of any estimator as they are defined as energy functionals on the observed data. Finally, we discuss the Giry monad and how its implementation would aid in analysis of data sets with missing data.
iii Contents
List of Figures x
Acknowledgments xi
Chapter 1 Introduction 1 1.1 Motivation & Background ...... 1 1.2 Contributions ...... 3 1.3 Summary ...... 5
Chapter 2 Background 7 2.1 Categories ...... 7 2.2 Functors & Categories of Functors ...... 14 2.2.1 Functors ...... 14 2.2.2 Natural Transformations ...... 16 2.2.3 Functor Categories ...... 17 2.2.4 The Yoneda Embedding ...... 17 2.3 Lattices & Heyting Algebras ...... 18 2.3.1 Lattices ...... 18 2.3.2 Heyting Algebras ...... 20 2.4 Monads ...... 21 2.5 Cartesian Closed Categories ...... 23 2.6 Topoi ...... 24 2.7 Presheaves ...... 27 2.7.1 The Category of Presheaves ...... 27 2.7.2 Initial and Terminal Objects ...... 28 2.7.3 Products and Coproducts ...... 28 2.7.4 Equalizers and Coequalizers ...... 29 2.7.5 Pullbacks and Pushouts ...... 30
iv 2.7.6 Exponentials ...... 31 2.7.7 The Subobject Classifier ...... 32 2.7.7.1 Subobjects ...... 32 2.7.7.2 The Subobject Classifier ...... 32 2.7.8 Local and Global Sections ...... 34 2.8 Sheaves ...... 34
Chapter 3 A Sheaf Theoretic Perspective on Higher Order Probabilistic Programming 38 3.1 The Categorical Structure of Measurable Spaces ...... 39 3.1.1 Non-Existence of Exponentials ...... 40 3.1.2 Lack of Subobject Classifier ...... 42 3.2 The Giry Monad ...... 46 3.2.1 The Endofunctor G ...... 46 3.2.2 The Natural Transformation η ...... 46 3.2.3 The Natural Transformation µ ...... 47 3.2.4 The Kleisli Category of the Giry Monad ...... 47 3.2.5 Simple Facts About the Giry Monad ...... 48 3.3 The Cartesian Closed Category of Quasi-Borel Spaces ...... 49 3.3.1 Quasi-Borel Spaces ...... 49 3.3.2 Cartesian Closure of QBS...... 51 3.3.3 The Giry Monad on the Category of Quasi-Borel Spaces . . 51 3.3.4 De Finetti Theorem for Quasi-Borel Spaces ...... 52 3.4 Standard Borel Spaces ...... 53 3.5 Quasi-Borel Sheaves ...... 55 3.5.1 Sample Space Category ...... 55 3.5.2 Quasi-Borel Presheaves ...... 57 3.5.3 Quasi-Borel Sheaves ...... 57 3.5.4 Lifting Measures Lemma ...... 60 3.6 Probability Theory for Quasi-Borel Sheaves ...... 62 3.6.1 Events ...... 62 3.6.2 Global Sections, Local Sections, and Subsheaves ...... 63 3.6.3 Expectation as a Sheaf Morphism ...... 64 3.7 Future Work ...... 65 3.7.1 Probabilistic Programming and Simulation of Stochastic Processes ...... 65 3.7.2 Categorical Logic and Probabilistic Reasoning ...... 66 3.7.3 Sample Space Category and the Topos Structure ...... 66 3.7.4 Extension of the Giry Monad ...... 67
v Chapter 4 Categorical Logic and Relational Databases 68 4.1 Introduction ...... 68 4.2 Data Tables ...... 69 4.2.1 Attributes ...... 71 4.2.2 Attribute Spaces (Data Types) ...... 71 4.2.3 Missing Data ...... 72 4.2.4 Data Types ...... 73 4.2.5 Column Spaces, Tuples, and Tables ...... 74 4.2.5.1 Column Spaces ...... 74 4.2.5.2 Records ...... 74 4.2.5.3 Tables ...... 74 4.2.6 Primary Keys ...... 76 4.2.7 Versioning ...... 76 4.3 Relational Algebra on Tables ...... 77 4.3.1 Products ...... 77 4.3.2 Projection ...... 78 4.3.3 Union ...... 78 4.3.4 Selection ...... 78 4.3.5 Difference ...... 80 4.4 Some Additional Operations on Tables ...... 81 4.4.1 Addition & Deletion ...... 81 4.4.2 Editing Records ...... 81 4.4.2.1 Rename ...... 82 4.4.2.2 Imputation ...... 82 4.4.3 Merging Overlapping Records ...... 83 4.4.3.1 Table Morphisms ...... 83 4.4.4 Non-Binary Logics ...... 84 4.5 Random Tables and Random Databases ...... 84 4.5.1 Random Tables ...... 85 4.5.2 Giry Monad Applied to Tables ...... 85 4.5.3 Random Databases ...... 85 4.6 Topological Aspects of Databases ...... 87 4.6.1 Simplicial Complex Associated to a Database ...... 87 4.6.2 Contextuality ...... 87 4.6.3 Topology on a Database ...... 89 4.7 Relationship Between Topological Structure of a Schema and Con- textuality ...... 90
vi Chapter 5 Contextual Statistics 91 5.1 Introduction ...... 91 5.2 The Bell Marginals ...... 93 5.3 Skip-NA and Directed Graphical Models ...... 99 5.4 Motivation from Statistical Privacy ...... 103 5.5 Poset of Joins of a Database ...... 103 5.5.1 Contextual Constraint Satisfaction Problems ...... 104 5.5.2 Poset of Solutions to Contextual Constraint Satisfaction Problems ...... 105 5.6 Topology of a Database Schema ...... 108 5.6.1 Contextual Topology on a Database Schema ...... 109 5.7 Sheaves on Databases ...... 113 5.7.1 Presheaf of Data Types ...... 113 5.7.2 Presheaf of Classical Tables of a Fixed Size ...... 114 5.7.3 Sheaf of Counts on Contextual Tables ...... 114 5.7.4 Presheaf of Classical Probability Measures ...... 117 5.7.5 Sheaf of Outcome Spaces ...... 117 5.7.6 Contextual Random Variables ...... 118 5.7.7 Sheaf of Parameters ...... 119 5.7.8 Sheaf of Contextual Probability Measures ...... 120 5.8 Statistical Models on Contextual Sheaves ...... 121 5.8.1 Contextual Statistical Models ...... 122 5.8.2 Factors ...... 125 5.8.3 Classical Snapshots of a Factor ...... 126 5.9 Subobject Classifier for Contextual Sheaves ...... 127 5.10 Local and Global Sections of a Contextual Sheaf ...... 128 5.11 Fitting Contextual Models ...... 130 5.11.1 Maximum Likelihood Estimation for the Saturated Contex- tual Model ...... 130 5.11.2 Classical Approximation of a Contextual Distribution . . . . 132 5.12 Contextual Hypothesis Testing ...... 134 5.12.1 Testing if Observed Marginals are Drawn from the Same Distribution ...... 134 5.12.2 Testing if a Collection of Tables can be Explained Classically 135 5.12.3 A Hypothesis Test for Contextuality ...... 136 5.13 Future Work ...... 136 5.13.1 Contextuality Penalization ...... 136 5.13.2 Sampling for Contextual Probability Distributions ...... 137
vii Chapter 6 Algebraic Hypothesis Testing 139 6.1 Introduction ...... 139 6.2 From model to invariants ...... 140 6.3 Constructing an inner product from invariants ...... 142 6.4 Asymptotic Properties of hψ| H |ψi ...... 143 6.5 Quadratic Forms of Multivariate Normal Distributions ...... 144 6.6 Estimation of Parameters for the Asymptotic Distribution . . . . . 146 6.6.1 Using an MLE ...... 146 6.6.2 Using Normalized Count Data ...... 147 6.7 The Independence Model for a 2 × 2 Contingency Table ...... 147 6.8 Behavior of Statistic on the Boundary of the Probability Simplex . 150 6.9 A Test for the Rank of a Contingency Table ...... 150 6.10 Simulation Techniques and Results ...... 151 6.11 Future Work ...... 156 6.11.1 Application to Mixture Models ...... 156 6.11.2 Application To Restricted Boltzmann Machines ...... 157
Chapter 7 A Monadic Approach to Missing and Conflicting Data 159 7.1 Pullbacks, Maximum Entropy Distributions, and Independence Joins 160 7.2 Merging Conflicting Tables ...... 162 7.3 Imputing Missing Data with Giry Tables ...... 165 7.3.1 Imputing by Empirical Probability Measure of a Column . . 166 7.3.2 Lifting Statistics to the Giry Monad ...... 168 7.3.3 A Simple Example of Giry Imputation ...... 169 7.4 Future Work ...... 171 7.4.1 Implementation of Giry Tables ...... 171 7.4.2 The Giry Monad and Contextuality ...... 172 7.4.3 Generalizations of Interval Time Models ...... 174
Appendix A Supplemental Code for Chapter 5 176 A.1 Introduction ...... 176 A.2 Code ...... 176 A.2.1 CSP for All Bell Marginals ...... 176 A.2.2 CSPs Involving Three Bell Marginals ...... 180 A.2.3 CSPs Involving Two Bell Marginals ...... 184
viii Appendix B Code for Producing Figures in Chapter 7 188 B.1 Introduction ...... 188 B.2 Invariants vs. Chi-Squared 2 × 2 Case ...... 188 B.3 P-values on a Degenerate Distribution in the Binary 4-Cycle Model 192 B.4 Tables of Percentage Deviation from Significance Level ...... 204 B.4.1 Noise Parameter = 0.1 ...... 204 B.4.2 Noise Parameter = 0.01 ...... 206 B.4.3 Noise Parameter = 0.001 ...... 207
Bibliography 210
ix List of Figures
6.1 A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from a uniform distribution...... 152
6.2 A scatterplot showing values of the survival function of the invariant based quadratic form vs. the chi-squared distribution for samples drawn from the distribution (q00, q01, q10, q11) = (0.1, 0.3, 0.2, 0.4). . . 153
6.3 A scatterplot showing values p-values computed for the perturbed degenerate distribution on the binary 4-cycle comparing the like- lihood ratio test vs. the survival function of the invariants based quadratic form computed via Imhof’s method...... 154
6.4 A scatterplot showing values p-values computed for the perturbed de- generate distribution on the binary 4-cycle comparing the chi-squared test vs. the survival function of the invariants based quadratic form computed via Davies’ method...... 155
x Acknowledgments
I would like to thank my advisor Jason Morton for all the time he has invested in this project. I would also like to thank the other members of my committee: Vladimir Itskov, Alexei Novikov, and Aleksandra Slavkovic for agreeing to be on my committee and investing their time in this project. I would also like to thank Jared Culbertson and Roman Ilin for their supervision while interning at AFRL. I would also like to thank Kirk Sturtz and Benjamin Robinson for many helpful conversations on the subject of applied category theory. Additionally, I would like to thank Manfred Denker for his early encouragement to explore some of the unconventional ideas in this dissertation. I would also like to thank Becky Halpenny and Allyson Borger for all their help throughout my time time at Penn State. I would also like to thank Bojana Radja and Cheryl Huff for their encouragement and support. Lastly, I would like to thank my family and friends for all their support over the years, especially my father, Clifton. This material is based upon work supported by the Air Force Office of Scientific Research under Award No. FA9550-16-1-0300. Any opinions, findings, and conclu- sions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the Air Force Office of Scientific Research.
xi Chapter 1 | Introduction
1.1 Motivation & Background
Traditionally algebraic statistics has focused on applying techniques from algebraic geometry, commutative algebra, and combinatorics to statistical problems. A major focus of this dissertation is to demonstrate that several new tools, namely sheaf theory and monads, can be useful for thinking about the types of statistical questions arising out of the needs of modern statistical computing. Traditionally, statistical theory has focused on the case where data is tabulated in a single table in a tidy manner. However, many data sets contain multiple tables with overlapping columns with missing entries and stale records. The application of category theory to probability is not a new idea. The fist appearance in the literature appears to be a paper by Michele Giry [59], published in 1980. In this paper, Giry constructs an endofunctor associating a measurable space to its collection of probability distributions and shows that this endofunctor can be given the structure of a monad and studies the Kleisli category associated to this monad. Following Giry’s work, there appears to be very little subsequent work in the area until 2006 when Doberkat worked out the Eilenbreg-Moore algebras for the Giry monad restricted to Polish spaces [34]. Culbertson and Sturtz developed a categorical framework for Bayesian probability in 2013 [27]. Some mathematicians have argued that probability should be rethought foundationally. Mumford has also argued that probability should be rethought of in a way where random variable is taken as a primitive concept [103]. Gromov has argued that category theory should give us insights into how such a re-imagining of the foundation could be
1 achieved [61]. The potential of category to gain insight into statistics is not new. McCullagh noticed that categories and natural transformations provide a natural way to express the intuitive concept of requiring a statistical model to admit certain natural extensions depending on the domain of inference [97]. In this dissertaion, we use category theory as a useful language for organizing and expressing statistical concepts on versioned databases with missing data. In particular, sheaf theory is a natural framework for thinking about local to global phenomenon. We propose that sheaves provide a natural way of thinking about distributed or contextually dependent statistical questions. Moreover, Tao observes that the concepts that are probabilistically meaningful are those which are invariant under surjective measure preserving maps [136]. Tao also observes that notions like independence are invariant under such maps while constructions such as which elements of the sample space map to which elements of the outcome space do not satisfy this invariance. This suggests a random variable should be identified with its entire sieve of extensions. In this dissertation we realize this construction by viewing a random variable as a construction involving quasi-Borel sheaves on the sieve of extensions of a sample space in chapter three. McCullagh argues that its is not possible to perform inference with a statistical model unless the model can be extended to the domain for which inference is required [97]. This concept of requiring statistical models to admit natural extensions suggests a rejection of the notion of a fixed universe of sets in favor of some notion of variable structure. However, this is precisely what the topos-theoretic perspective is intended to provide. To quote Lawvere, “Every notion of constancy is relative, being derived perceptually or conceptually as a limiting case of variation, and the undisputed value of such notions in clarifying variation is always limited by that origin. This applies in particular to the notion of a constant set, and explains so much of why naive set theory carries over in some form to the theory of variable sets.” [78] A major goal of this dissertation is to demonstrate how the ideas behind topos theory and sheaf theory can clarify the statistical structure of modern complex data sets by providing a language which can naturally handle the variability of these structures.
2 1.2 Contributions
In chapter three, the main result is extending a construction due to Heunen, Kammar, Staton, and Yang to a presheaf construction on a category of sample spaces (Definition 88) and showing this extension is in fact a sheaf with respect to the atomic Grothendieck topology (Lemma 91). We also realize expectation as a sheaf morphism (Section 3.6.3) and discuss some structural properties of these new objects as they relate to foundational concepts in probability theory (Section 3.6.1, Section 3.6.2). Along the way, we characterize several sub-classes of monic arrows in the category of measurable spaces (Proposition 77) and show that Meas does not admit a subobject classifier (Lemma 76). This provides an alternative proof that Meas is not a topos which was already well known due to Aumann [10] who showed that the category of standard Borel spaces is not Cartesian closed. By proving Meas does not admit a subobject classifier, we provide an alternative proof that Meas is not a topos. We also prove a simple lemma about lifting probability measures along surjective maps (Section 7.3.2). In chapter four, we discuss a categorical perspective on bag models common in the relational database literature. By emphasizing how these models can be seen as constructions within the underlying category of sets, we create a framework which more easily generalizes to situations, as in SQL, where the underlying logic is not a Heyting algebra. Such situations are impossible in purely topos theoretic models as the internal logic is always a Heyting algebra. We also define a simplicial structure (Section 4.6.3) and graph associated to a database schema and prove a result relating properties of this graph to whether or not agreement on marginal tables is sufficient to ensure that the marginals can arise from a joint distribution on the full outcome space. In particular, we provide sufficient conditions for a joint table on the full column set to exist (Lemma 98, Proposition 101). This result is foundational to the next chapter where we attach an additional topological structure to this simplicial complex using this topological structure to weaken the common assumption in statistics that the family of marginals under consideration arise as projections of a joint distribution on the full column space. In chapter five, our major contributions are the use of an appropriate topological structure to sheaf-theoretically lift standard statistical constructions to families of marginals with some overlapping constraints. More precisely, this includes
3 the introduction of a poset structure on the collection of constraint satisfaction problems (Section 5.5.2) which allows us to select an appropriate topology based on the shared columns of the tables constituting a database (Section 5.6.1). Using this topology, we see how to express various statistical concepts as sheaves or presheaves with respect to this topology (Section 5.7). This allows us to define the notion of contextual random variables (Section 5.7.6) and to define statistical models in terms of sheaf morphisms (Section 5.8.1). We also introduce the distinction between classical and contextual factors (Definition 127) and the notion of classical snapshot to handle classical approximations to globally irreconcilable marginals. We discuss a pseudo-lieklihood approach to extending maximum likelihood estimation based on the realization of contextual random variables as subsets of an equalizer (Section 5.11.1) and provide a test for whether or not marginal distributions can arise from a joint distribution on the full column set (Section 5.12.2). This last result is similar to a result due to Abramsky, Barbosa, and Mansfield based on sheaf cohomology which allows the user to detect contextuality [3]. By combining our results with the construction in chapter six, we can provide a goodness-of-fit measure for contextuality rather than a simple detection of contextuality. In chapter six, our major contributions are constructing an energy statistic based on the invariants of an algebraic statistical model and proving its asymptotic consistency under the null hypothesis. This construction is interesting because its asymptotic properties do not rely on the asymptotic normality of an estimator since it can be computed from empirical frequencies. Thus, this construction provides an alternative technique for computing goodness-of-fit in situations where standard asymptotic theory breaks down such as on boundary points of the probability simplex or near singularities of a statistical model. We demonstrate this improved performance near a singularity of the binary 4-cycle undirected graphical model by benchmarking it against the likelihood ratio and chi-squared test in a simulation. In chapter seven, the main result is a lemma establishing that measurable statistics lift to the Giry monad and the use of the Giry monad to combine conflicting data in a way that does not destroy information about the conflicting records. This construction is potentially useful in statistical decision making situations where we would like to design systems which select more conservative actions in the presence of conflict such as in target recognition in sensor networks. This chapter is more speculative than the remaining chapters and is intended to
4 explore how implementation of the Giry monad could be beneficial for statistical computing.
1.3 Summary
The contents of the remainder of this dissertation are as follows. Chapter two provides background information collecting many basic definitions from category theory. This chapter is not intended as a complete introduction to the subject but rather provides a list of definitions used elsewhere in the dissertation and can be used as a reference whenever these concepts are used in subsequent chapters. The topics treated include the basic definitions and properties of categories along with the basic definitions and properties of functor categories. We discuss the important Yoneda lemma as it is used several times throughout the dissertation. We also discuss the basics of lattice theory and Heyting algebras along with Cartesian closed categories and topoi. The most important concepts introduced in this chapter are monads and sheaves which are used several times throughout this dissertation. In chapter three, we examine how category theory gives us insight into the semantics of probabilistic programming languages by demonstrating how the need for higher order functions requires us to step outside the standard bounds of measure theory. We first deconstruct the ways in which the category of measurable spaces is an inadequate framework for higher order probabilistic programming. We then discuss restricting attention to an appropriately well-behaved subset of measurable spaces, namely the standard Borel spaces. We review the recent theory of Quasi-Borel spaces developed by Heunen, Kammar, Staton, and Yang and discuss the importance of sample space extensions to probabilistic programming. We then lift their definition to a sheaf-theoretic one in order to naturally incorporate such extensions into their model of higher order probabilistic programming. In chapter four, we examine databases using the lens of topos theory. We construct a simple model of tables within the category of sets that is simple enough to express the standard operations of relational algebra along with some other common operations for table manipulations. This largely sets a notation and definition of tables to be used in subsequent chapters. We emphasize constructions which can also be performed inside the category of standard Borel spaces as subsequent work will focus on adapting random variables to databases with global
5 inconsistency. Finally we discuss how a database schema creates an abstract simplicial complex showing the interconnections between tables in the database. This observation is the jumping point for chapter five where we analyze statistical techniques in the presence of contextuality. In chapter five, we examine how sheaves on a topology associated to an abstract simplicial complex associated to a database can be used to lift statistical concepts to the realm of databases containing marginal distributions which are globally irreconcilable. We also study the problem of attempting to reconstruct tables from marginal tables and some of the computational issues that arise in these computations. A major theme of this chapter is that using the language of sheaf theory allows us to extend results globally by constructing a classical approximation of the usual concept along the basis. The nuances of adapting statistics to this regime are explored along with some computational issues. In chapter six, we discuss a technique for algebraic hypothesis testing based on algebraic invariants of a statistical model. We derive an asymptotic distribution for an energy functional of a statistic computed from the invariants. The asymptotic theory of this statistic does not rely on asymptotic normality and so provides a method robust to model singularities and boundary points. We discuss potential applications of this technique to mixture models and Restricted Boltzmann machines. We conclude by simulating the statistic and bench-marking it against known techniques. We consider small perturbations of a degenerate binary four-cycle and see the invariant based statistic outperforms standard techniques for this particular example. In chapter seven, we examine the relationship between the Giry monad and multiple imputation and discuss how implementation of the Giry monad could be useful for statistical computing. We see that implementing the Giry monad allows us to preserve information about conflicting measurements when compared to using a point estimate to resolve conflicts between different tables or computational agents. The main result is a construction which allows us to lift all the common statistics used in practice to Giry monads. We also discuss how its implementation would facilitate the handling of multiple imputation techniques in other parts of the inference pipeline. A simple example involving the k-nearest neighbor technique in machine learning indicates how this implementation leads to different calculations which will agree with original techniques for completely observed data.
6 Chapter 2 | Background
This chapter provides background information and terminology for this dissertation. We cover the basic language of categories, functors, monads, Cartesian closed cate- gories, topoi, presheaves and sheaves (both on regular topologies and Grothendieck topologies). These constructions will be used many times throughout the thesis. References are provided to more detailed treatment of these topics.
2.1 Categories
Categories are the spaces where mathematical objects live. Intuitively, we have a collection of objects and morphisms relating these objects which are composable and have identities. It is tempting to think of these as ’sets’ and ’functions’ respectively, but these are not the only type of category as we will see later. In this section, we review some elementary definitions in category theory. More detailed treatments of these topics can be found in [15,18,93].
Definition 1. A category C consists of a collection of objects, denoted Ob (C) , along with a collection of morphisms for each C,D ∈ Ob (C), denoted Mor (C,D) which satisfy the following conditions:
• For all C,D,E in Ob (C), there is a composition
◦ : Mor (C,D) × Mor (D,E) → Mor (C,E)
which is associative.
7 • For all C ∈ Ob (C), there is an identity, 1C : C → C such that for any
f : C → D and any g : E → C, we have f = f ◦ 1C and g = 1C ◦ g.
By commutative diagram, we mean any paths with identical source and target yield the same answer. In the diagram below, we represent the axioms for the identity morphism as a commutative diagram. For this diagram, it suffices to check that the left triangle and right triangle are commutative. From the commutativity of these two triangles, we can deduce the rest. This is a general feature of such diagrams.
f A / B g 1B f B g / C.
Categories are ubiquitous in mathematics. Here are a few basic examples.
Example 2. The category of sets has a collection of sets as its objects and the morphisms are given by functions.
Example 3. The category of groups has the collection of groups as its objects and the morphisms are given by group homomorphisms.
Example 4. There is a category, Meas, of measurable spaces. The objects are measurable spaces and the morphisms are given by measurable mappings.
All the above examples are so called concrete categories which consist of objects which are sets (possibly equipped with additional structure) and morphisms given by functions (which preserve that structure). These are not the only types of categories.
Example 5. Any poset, (P, ≤), can be viewed as a category in the following way. The objects are given by elements of the poset and we define x → y if and only if x ≤ y.
Example 6. A similar construction could be used to view any topological space (X, τ) as a category. The objects are given by open sets and the morphisms are
given by U → V if and only if U ⊂ V . We will use the notation Xτ to denote such a category.
8 The main type of categories that we will be concerned with in this talk are Cartesian closed categories. Before explaining what a Cartesian closed category is, we need a few definitions.
Definition 7. Let C be a category. An object, T , of C is said to be a terminal object if for every object C in C, there is a unique morphism !: C → T .
In sets a terminal object is given any singleton set T = {∗}. This also works for Meas. A poset category admits a terminal object if and only if it has a top element. In a category associated to a topological space, Xτ , the terminal object is the underlying set, X. Terminal objects, when they exist, are unique up to unique isomorphism. The dual notion to a terminal object is an initial object.
Definition 8. Given a category C, an initial object is an object 0 ∈ C such that for each C ∈ C, there is a unique morphism ! : 0 → C.
In the category Set or Meas, this object is the empty set (equipped with the empty sigma algebra in Meas). For a poset, this is a bottom element and for a topological space this is again the empty set. Products are the categorical generalization of the Cartesian product of sets. Products in an abstract category are defined by a universal property.
Definition 9. Let C be a category. C admits products if given any two objects
X and Y in C, there exists a third object X × Y and a pair of morphism pX :
X × Y → Y , pY : X × Y → Y such that given any other object, Z, and morphisms f : Z → X, g : Z → Y , there exists a unique morphism f × g : Z → X × Y such that the following diagram commutes:
Z f f×g
# pY g X × Y /) X
pX Y
In the category of sets, the product is given by the usual Cartesian product of sets. In the category of measurable spaces, you can take the Cartesian product of the underlying sets with the product sigma algebra. In the category associated to
9 a topological space, Xτ , the product is given by the intersection of open sets. It is also true that products are unique up to unique isomorphism. Coproducts are the dual notion to products.
Definition 10. We say that C admits coproducts if given any two objects X and ` ` Y there exists a third object X Y and a pair of morphism iX : X → X Y , ` iY : Y → X Y such that given any other object Z and morphisms f : X → Z, g : Y → Z, there exists a unique morphisms f ` g : X ` Y → Z such that the following diagram commutes:
X
iX f Y / X ` Y iY ` g f g )# Z.
In Set, the coproduct is given by the disjoint union of two sets. For the category associated to a topological space, Xτ , the coproduct of two open sets is given by their union. For a lattice the coproduct is given by the meet operation. In Meas, X ` Y can be constructed as follows. The underlying set is simply the disjoint union of X × {0} and Y × {1}. We can generate a sigma algebra on ` X Y by taking the smallest sigma algebra containing sets of the form BX × {0} and BY ×{1} where BX is a measurable subset of X and BY is a measurable subset of Y . This observation will be important later in the dissertation when we augment outcome spaces to allow for the possibility of missing data. In this dissertation we only focus on this construction for standard Borel spaces. A more detailed construction is provided in [130].
Definition 11. A morphism i : Ef,g → X in C is an equalizer for a pair of morphisms, f, g : X → Y if f ◦ i = g ◦ i and given any morphism h : Z → X such that f ◦ h = g ◦ h, there exists a unique morphism k : Z → E such that i ◦ k = h,
10 i.e. the following diagram commutes:
Z h k f i ' Ef,g / X / Y. g /
In Set, the equalizer of f, g is simply the subset of X defined by Ef,g = {x ∈ X : f (x) = g (x)} . The dual notion to equalizers are coequalizers.
Definition 12. A morphism q : Y → Ef,g in C is a coequalizer for a pair of morphisms f, g : X → Y if q ◦ f = q ◦ g and given any morphism h : Y → Z, there exists a unique morphism k : Ef,g → Z such that k ◦ q = h, i.e. the following diagram commutes:
f q X / Y / Ef,g g / h k ' Z.
In Set, the coequalizer of f, g is the quotient of Y by the minimal equivalence relation such that f (x) = g (x) for each x ∈ A.
Definition 13. A pullback for functions f : X → Z and g : Y → Z is a pair of morphisms h : W → X, k : W → Y satisfying the universal property that for any morphisms k : W → X, h : W → Y with f ◦ k = g ◦ h, there exists a unique ` : V → W such that the following diagram commutes:
W h l j ( k V / Y i g X / Z. f
In the category of sets we can take
V = X ×Z Y = {(x, y) ∈ X × Y | f (x) = g (y)} .
11 Definition 14. A pushout for functions f : X → Y and g : X → Z is a pair of morphisms h : Y → W , k : Z → W satisfying the universal property that for any morphisms i : Y → V and j : Z → V with i ◦ f = j ◦ g, then there exists a unique morphism ` : W → V such that the following diagram commutes:
f X / Y
g h k Z / W i l j ( V.
In the category of sets the pushout is given by W = (Z ` Y ) / ∼ where ∼ is the finest equivalence relation such that f (z) ∼ g (z) . The constructions we have seen thus far are all examples of more general types of limits or colimits. In order to define these notions, we first need to give a rigorous definition of a diagram.
Definition 15. A diagram of shape J in C is a functor from J to C. The category J is referred to as the index category.
Example 16. Let J be the category 0 ⇒ 1 where only the non-identity morphisms are drawn. A diagram on J just picks out two parallel arrows
f X / Y. g /
When discussing diagrams, it is common to leave out explicit mention of the index category J and to simply depict the image under the functor as we have done in all of the diagrams used thus far in this section.
Now that we have defined diagrams we can define the notion of a cone on a diagram.
Definition 17. Let F : J → C be a diagram. A cone to F is an object N in C along with a family ψX : N → F (X) of morphisms indexed by the objects X of J such that for every morphism f : X → Y in J, we have F (f) ◦ ψX = ψY .
A limit is simply a cone that is universal in the sense that any other cone must factor uniquely through it. This is made precise with the following definition:
12 Definition 18. A limit of the diagram F : J → C is a cone (L, φ) which is universal in the sense that for any other cone (N, ψ) to F there is a unique isomorphism u : N → L such that φX ◦ u = ψX for all X in J.
N
u ψX L ψY
φ φ Õ | X Y " F (X) / F (Y ) F (f)
Example 19. Products, equalizers, terminal objects, and pullbacks are all examples of limits.
Dual to the notion of limit is the colimit.
Definition 20. A co-cone of a diagram F : J → C is an object N of C along with a family ψX : N → F (X) of morphisms indexed by the objects X of J such that for every morphism f : X → Y in J, we have ψY ◦ F (f) = ψX .
Similarly to limits being defined as universal cones, we can define colimits as universal co-cones.
Definition 21. A colimit of the diagram F : J → C is a co-cone (L, φ) which is universal in the sense that for any other co-cone (N, ψ) to F there is a unique isomorphism u : L → N such that u ◦ φX = ψX for all X in J.
F (f) F (X) / F (Y )
φX φY
" | ψX L ψY u Õ N
Example 22. Coproducts, coequalizers, pushouts, and initial objects are all ex- amples of colimits.
13 2.2 Functors & Categories of Functors
In this section, we collect basic results and terminology about functors and categories of functors. More detailed treatments can be found in [18,93,94].
2.2.1 Functors
Functors are the mappings between categories. There are two types of functors: covariant functors and contravariant functors.
Definition 23. Let C and D be categories. A covariant functor is a mapping F : C → D which assigns to each object C in C, an object F (C) in D and to each morphism f : C → C0 inside C, a morphism F (f) : F (C) → F (C0) inside D in such a way that F (idC ) = idF (C) and F (f ◦ g) = F (f) ◦ F (g).
Example 24. There is a functor P : Set → Set which maps each set X to its power set P (X) and each set function f : X → Y is sent to the mapping P (f): P (X) → P (Y ) that maps each subset S ⊂ X to its image P (S) ⊂ Y .
Example 25. Let Group denote the category of groups. There is a functor U : Group → Set which associates to each group its underlying set and takes each group homomorphism to its underlying set map. This functor is called a forgetful functor because it simply ’forgets’ the additional group structure and homomorphism structure.
Example 26. (Hom-Functor) Let C be a category and C be an object in C. We C can define a functor h := HomC (C, −) : C → Set which takes each object X to the set of all C-morphisms from C to X, C (C,X). To each morphism f : X → Y ,
HomC (C, f) : HomC (C,X) → HomC (C,Y ) is defined by pre-composition, i.e. g 7→ f ◦ g.
The other type of functors are called contravariant functors. These are defined similarly except that they reverse the direction of morphisms.
Definition 27. Let C and D be categories. A contravariant functor F : C → D assigns to each object C in C an object F (C) in D; however, it assigns to each morphism f : C → C0 a morphism F (f) : F (C0) → F (C) in such a way that
F (idC ) = idF (C) and F (f ◦ g) = F (g) ◦ F (f).
14 Example 28. There is a contravariant functor P 0 : Set → Set that associates to each set X its power set P 0 (X); however, to each function f : X → Y , is mapped to its inverse image map P 0 (f) : P 0 (Y ) → P 0 (X) which maps each subset S ⊂ Y to its pre-image f −1 (S) ⊂ X.
Example 29. There is also a contravariant version of the Hom functor. Let D be an object in C. Then we can define a contravariant functor, Hom (−,D) : C → Set which assigns to each object X in C the collection of morphisms with codomain D and to each morphism f : X → Y assigns a set function Hom (f, D) : Hom (Y,D) → Hom (X,D) defined by post-composition, i.e. g 7→ g ◦ f.
We now collect a sequence of a basic definitions for certain properties that are important for functors. For a more detailed exposition and examples, refer to MacLane.
Definition 30. A functor F : C → D is faithful if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is injective.
Definition 31. A functor F : C → D is full if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is surjective.
Definition 32. A functor F : C → D is fullly faithful if for all objects X and Y in C the induced map F : C (X,Y ) → D (X,Y ) is bijective.
Definition 33. A functor F : C → D is essentially surjective if each object D in D is isomorphism to F (C) for some C in C.
Definition 34. A functor F : C → D is called an embedding if it is fully faithful and injective on objects.
Definition 35. A functor F : C → D is called an equivalence of categories if it is fully faithful and essentially surjective.
Definition 36. A functor F : C → D is called an isomorphism if there exists a functor G : D → C such that G ◦ F = idC and F ◦ G = idD.
In category theory, it is rare to talk about isomorphisms between categories and more common to talk about equivalences of categories.
15 Definition 37. A functor is said to preserve a property of a morphism f if F (f) satisfies the property whenever f does.
Definition 38. A functor is said to reflect a property of a morphism if f satisfies the property whenever F (f) does.
Here are a few useful facts about functors:
• Faithful functors reflect monics and epics.
• Fully faithful functors reflect isomorphisms.
• Equivalences of categories preserve monics and epics.
• Every functor preserves isomorphisms.
2.2.2 Natural Transformations
Natural transformations are mappings between functors.
Definition 39. Let F,G : C → D be contravariant functors. A natural transforma- tion, η : F ⇒ G associates to each object A in C a morphism ηA : F (A) → G (A), called a component of the natural transformation, such that for any morphism f : A → B in C, the diagram below commutes:
ηB F (B) / G (B)
F (f) G(f)
ηA F (A) / G (A) i.e. ηA ◦ F (f) = G (f) ◦ ηB. Note that an analogous definition holds for covariant functors, mutatis mutandis. We only consider natural transformations between contravariant functors in this dissertation.
Example 40. Let f : A → B be a morphism in a category C. We can construct a natural transformation between the covariant Hom-functors φ : Hom (B, −) ⇒
Hom (A, −) whose components are defined as φC : Hom (B,C) → Hom (A, C)
16 where g 7→ g ◦ f. The commutativity of the natural transformation square
Hom(B,h) Hom (B,C) / Hom (B,D)
φC φD Hom(A,h) Hom (A, C) / Hom (A, D) follows from the associativity of function composition in Set.
Another way of stating the definition of equivalence of categories is that two categories C and D are equivalent if there exists functors F : C → D and G : D → C such that F ◦ G is naturally isomorphic to the identity functor on D and G ◦ F is naturally isomorphic to the identity functor on C.
2.2.3 Functor Categories
Definition 41. Given two categories C and D, the functor category DC is defined to be the category whose objects are contravariant functors F : C → D and whose morphisms are given by natural transformations between functors.
Functor categories can also be defined for covariant functors, but we will not discuss any examples of these in this dissertation. In fact, a special type a contravariant functor, called a presheaf, will be used many times in this dissertation. These special type of contravariant functors are the subject of the next section.
2.2.4 The Yoneda Embedding
The Yoneda lemma is a generalization of Cayley’s theorem in group theory which allows you to embed any category into a category of functors defined on that category. There are two forms of the Yoneda the covariant version and the contravariant version. In this dissertation, we only use the contravariant form of the lemma and so only it is covered here. We state the Yoneda lemma here without proof. A formal proof can be found in [93]. The contravariant form of the Yoneda lemma concerns the contravariant form of the Hom functor, Hom (−,A) which is often denoted by hA. The contravariant form of the lemma states that for any contravariant functor G : Cop → Set, there
17 is a natural isomorphism ∼ [hA : G] = G (A) where [hA : G] denotes the set of natural transformations between hA and G. When the functor used in the Yoneda lemma is another Hom functor, the contravariant Yoneda lemma states
∼ [hA : hB] = Hom (A, B) .
This means h− gives rise to a covariant functor from C to the category of con- travariant functor into Set, i.e. h− : C → Set. Thus, the Yoneda lemma tells us that any locally small category can be embedded in the category of contravariant functors into Set via h−. This is called the Yoneda embedding of the category. Another way of expressing this is to say any locally small category can be represented by presheaves in a full and faithful matter, i.e.
∼ [hA : P ] = P (A) for any presheaf P . A contravariant functor into Set is said to be representable if it is naturally isomorphic to hA for some object A. When working out how topos theoretic constructions arise, we will commonly restrict to deducing how these constructions should work on representable functors and using these insights to surmise the general situation. This trick is very common in category theory and will be used when we discuss exponentials and subobject classifiers for presheaf topoi in later sections of this chapter.
2.3 Lattices & Heyting Algebras
2.3.1 Lattices
Here we briefly introduce the basic definitions for lattices. A more detailed reference is [29]. The main type of lattices we will focus on in this dissertation are Heyting algebras. These will be defined in the next section. A lattice consists of a poset in which every two elements have a unique supremum and a unique infimum.
Example 42. The natural numbers can be given a poset structure by divisibility,
18 i.e. a ≤ b if and only if a divides b. In this case, the supremum is the least common multiple and the infimum is the greatest common divisor.
Lattice can be given a purely algebraic definition as well. The poset based definition and the algebraic definition are equivalent. We provided the algebraic axioms below.
Definition 43. A lattice (L, ∨.∧) is a set L along with two binary operations ∨ and ∧ on L which satisfies the following properties for all a, b, and c in L:
• a ∨ b = b ∨ a and a ∧ b = b ∧ a
• a ∨ (b ∨ c) = (a ∨ b) ∨ c and a ∧ (b ∧ c) = (a ∧ b) ∧ c
• a ∨ (a ∧ b) = a and a ∧ (a ∨ b) = a.
The first rules are the commutative laws, the second are the associative laws, and the last are known as the absorption laws. There are two more laws, which are consequences of this definition, which are also important for lattices. The following identities holds for every a in L and are known as the idempotent laws:
• a ∨ a = a and a ∧ a = a.
From an algebraic lattice as defined above, we can endow it with a poset structure by defining x ≤ y if and only if x ∧ y = x.
Definition 44. A lattice is said to be bounded if there exists elements > and ⊥ such that ⊥ is an identity element for the join operation ∨ and > is an identity element for the meet operation ∧, i.e.
• a ∨ ⊥ = a and a ∧ > = a.
Definition 45. A lattice is said to be distributive if the following properties hold for all a, b, and c in L :
• a ∨ (b ∧ c) = (a ∨ b) ∧ (a ∨ c)
• a ∧ (b ∨ c) = (a ∧ b) ∨ (a ∧ c).
19 Example 46. Let X be a set. The collection of all subsets of X, P (X), is a bounded distributive lattice where the join ∧ is given by set intersection and the meet ∨ is given by set union. The bottom element is the empty set while the top element is the set X itself.
Example 47. The integers can be given the structure of a distributive lattice where the join is given by minimums and the meet is given by maximums. Notice that this lattice is not bounded because there is neither a smallest integer nor a largest integer.
2.3.2 Heyting Algebras
A Heyting algebra is a bounded, distributive lattice with a weaker form of comple- mentation called pseudo-complementation which we will define below. More details on the properties of Heyting algebras can be found in [60,94]. Heyting algebras are important in topos theory because the collection of subobjects on any object in a topos has the structure of a Heyting algebra. In a Heyting algebra, we can define a pseudo-complement of any a in H, denoted ¬a, where ¬a is the largest element such that a ∧ ¬a = ⊥. Another way of defining Heyting algebras is via a binary operation, called the implication, →, which satisfies the requirements in the definition below.
Definition 48. Let H be a bounded lattice. We say that H is a Heyting algebra if and only if there exists a binary operation, →, called the implication such that the following identities hold for all a, b, and c in H:
• a → a = >
• a ∧ (a → b) = a ∧ b
• b ∧ (a → b) = b
• a → (b ∧ c) = (a → b) ∧ (a → c)
With this definition, we can provide an alternative definition of the pseudo- complement: ¬a := (a → ⊥). Heyting algebra play an important role in topos theory because the lattice of subobjects of an object in a topos has the structure of a Heyting algebra.
20 Example 49. Every Boolean algebra is a Heyting algebra with a → b given by ¬p ∨ q.
1 Example 50. Let 0, 2 , 1 be given as a totally ordered set with ≤ defined in the usual way. This can be given the structure of a Heyting algebra which is not a Boolean algebra by defining the meet, join, implication, and psedu-complementation by the rules depicted in the tables below:
a ∧ b a ∨ b a → b a ¬a 1 1 1 a\b 0 2 1 a\b 0 2 1 a\b 0 2 1 1 0 1 0 0 0 0 0 0 2 1 0 1 1 1 1 1 1 1 1 1 1 1 2 0 2 0 2 2 2 2 2 1 2 0 1 1 1 1 1 0 1 0 2 1 1 0 1 1 1 0 2 1 Note that the above construction is not a Boolean algebra because it does not satisfy double negation, i.e.
1 1 6= ¬ ¬ = 1. 2 2
2.4 Monads
In this section, we briefly review monads. Our exposition here will be rather terse and will simply recall various standard definitions and results. A more thorough exposition can be found in chapter 6 of [93] or chapter 14 of [15].
Definition 51. Let C be a category. A monad, also called a triple, on C consists of
an endofunctor T : C → C together with two natural transformations η : 1C ⇒ T and µ : T 2 ⇒ T referred to as the unit and multiplication, respectively. The triple (T, η, µ) are required to satisfy the coherence conditions µ ◦ T µ = µ ◦ T µ and
µ ◦ T η = µ ◦ ηT = 1T . The first equation is an equality of natural transformations from T 3 ⇒ T and the latter equation is an equality of natural transformations from T ⇒ T . These can be visualized by the following commutative diagrams:
T µ T 3 / T 2
µT µ 2 T µ / T
21 and T η ηT T / T 2 o T
µ id ~ id T. The first condition is analogous to the associativity condition for monoids if µ is thought of as a categorification of the monoid’s binary operation while the latter condition is analogous to the existence of an identity element for the binary operation on the monoid. This is why η is referred to as the unit and µ is referred to as the multiplication for the monad.
Example 52. We can define another monad P on Set by defining P (X) to be the power set of X. For a morphism f : A → B, we can define P (f) to be the function defined by taking direct images under f. The unit natural transformation is defined on components as the map ηX : X → P (X) by x 7→ {x}. The multiplication natural 2 2 transformation µ : T ⇒ T is defined on components by µX : T (X) → T (X) by taking a set of sets to its union.
A particular monad, constructed by Michele Giry in [59], will be used frequently in this dissertation. It associates to a measurable space the collection of probability distributions on that measurable space and equips it with a sigma algebra. We will introduce this monad in greater depth in chapter 3. In the computer science literature, (e.g. [100], [65]) monads are often defined slightly differently. The multiplication natural transformation is replaed with a natural transformation called bind, denoted =. The bind is defined for all pairs of objects X, Y in C as a function
(=) : C (X,T (Y )) → C (T (X) ,T (Y )) .
The notation (t f) is used for (=) (f)(t). The monad axioms then become:
• (t = η) = t,
• (η (x) = f) = f (x),
• t = (λx. (f (x) = g)) = t (= f) = g.
The intuition behind this reformulation is to think of T (X) as an object of compu- tations returning X. In this case, η is the computation that returns immediately.
22 t f sequences computations by first running t and calling f with the result of t, similar to a UNIX pipeline. The equivalence of these two notions can be found in [96].
2.5 Cartesian Closed Categories
Definition 53. A category C is said to be Cartesian closed if and only if it has a terminal object, admits products, and admits exponentials.
Terminal objects and Products were covered in an earlier section. We briefly recall the definition of exponentials here. More information on Cartesian closed categories can be found in [12,15,60,93].
Definition 54. Let Z and Y be objects of the category C and suppose C admits Y Y binary products. An object Z together with a morphism eval : Z × Y → Z is said to be an exponential object if for any object X and morphism g : X × Y → Z, there is a unique morphism λg : X → ZY (called the transpose of g) such that the diagram below commutes:
X X × Y g λg λg×1Y $ ZY ZY × Y / Z. eval
The assignment of a unique λg to each g establishes an isomorphism Hom (X × Y,Z) ∼= Y Y Hom X,Z . In other words, the functor (−) : C → C defined on objects by Y Y Y Y C 7→ C and on morphisms by (f : C → D) 7→ f : X → Z is right adjoining to the product functor − × Y.
Example 55. The category Set whose objects are sets and whose morphisms are functions is an example of a Cartesian closed category. The exponential object is defined as Y X = Hom (X,Y ), the set of functions from X to Y .
Example 56. A Boolean algebra can be given the structure of a Cartesian closed category. The objects in the category correspond to the elements of its underlying set. Products are given by conjunctions, exponentials are given by implications,
23 and evaluation corresponds to Modus Ponens, i.e.
(A ⇒ B) ∧ A ≤ B.
Cartesian closed categories are important in computer science. In a Cartesian closed category, a morphism f : X × Y → Z can be represented as a morphism λf : X → ZY . Computer scientists refer to this as currying. As such, simply-typed lambda calculus can be interpreted in any Cartesian closed category. The formal relationship between these is given by the Curry-Howard-Lambek correspondence which establishes an isomorphism between intuitionistic logic, simply-typed lambda calculus, and Cartesian closed categories [28,69,82–84]. In functional programming languages, eval is often written as apply and λg is often written as curry (g).
2.6 Topoi
Topoi are a special class of Cartesian closed categories which have been proposed as a general setting for mathematics [85]. Topoi were initially conceived by Grothendieck as a a type of category which behaves like the category of sheaves on a topological space [9]. A later definition, due to Lawvere, generalized Grothendieck’s topoi in order to make suitable connections with logic [94]. These topoi are referred to as elementary topoi. When we discuss topoi in this dissertation, we will mean the elementary topoi due to Lawvere. The exposition in this section will be terse and effectively serves to collect definitions and standardize notation for future sections. Elementary introductions to the subject can be found in [60,85]. More advanced expositions are given in [94,98]. Standard references for the field are the books [14, 77, 78]. In this dissertation, we use sheaf topoi to model the semantics of higher order probabilistic programming languages extending a previous construction by Heunen, Kammar, Staton, and Yang [65] and demonstrating that this construction is a sheaf with respect to the atomic Grothendieck topology on an appropriately constructed sample space category. We also use sheaf topoi as a background for extending statistical constructions to contextual measurement scenarios in which a family of marginal distributions can not be assumed to orginate from a joint distribution on their full column space. A topos is simply a Cartesian closed category with a subobject classifier. A
24 subobject classifier of a category C is an object Ω in C such that subobjects of any object X are in one-to-one correspondence with the morphisms from X into Ω. Before we can characterize Ω by a universal property, we need to briefly review the definition of subobjects. Subobjects are the topos theoretic analog of subsets. If A ⊂ B, there is an inclusion map ι : A,→ B which is injective and hence monic. Conversely, any monic morphism in the category of sets determines a subset via its image. Hence, the domain of a monic map is isomorphic to a subset of the codomain of this map. However, since there are many sets with the same cardinality, we have to think of subobjects as equivalence classes of monic morphisms where f : A → B and g : C → B are equivalent if f factors through g and vice versa.
Definition 57. A subobject of a category C is an equivalence class of monic morphisms under the equivalence relation f ∼ g if f and g have the same codomain and factor through one another. We denote the equivalence class of f by [f].
These equivalence class can be given a poset structure. Let f : A → B and g : C → B be monic morphisms. We say [f] ≤ [g] if there is a morphism h : A → C such that f = g ◦ h. Note this forces h to be monic. This construction categorically represents subsets as each subset determines and is determined by a unique subobject. Now that we have discussed subobjects, we can discuss the subobject classifier. Intuitively, subobjects in the category of sets correspond to subsets of a set.
Any subset S ⊂ A has an indicator function, χS : A → {0, 1} defined by 1 x ∈ S χS (x) = 0 x∈ / S.
This can be given a purely categorical definition by associating subsets with collections of monic morphisms into the set. Thus a subobject of A is an equivalence class of monic morphisms with codomain A where two monics m1 and m2 are considered equivalent if and only if they factor through one another. In Set, each subset S ⊂ X determines an equivalence class from its inclusion morphism [ι : S,→ X] and any monic m : E X determines a subobject which is equivalent to [ι : m (E) ,→ X]. As such, we will often simply refer to subsets of X as subobjects by an abuse of terminology.
25 In the category of sets, the truth object, Ω, is simply the two element set 2 := {0, 1}. If we let 1 := {∗} denote a terminal object in Set, then there is a truth morphism: true : 1 → {0, 1} defined by true (∗) = 1 which picks out the value 1 as corresponding to true in Boolean logic. Given a subobject [m : S A], the characteristic function, χm, of m (S) satisfies the universal property that the diagram below is a pullback square:
m S / / A
! χm 1 / 2 true where !: S → 1 is the unique map to the singleton set 1. Note that if n : E A is another monic arrow, the condition that true (1 (E)) = χm (n (E)) requires that n (E) ⊂ m (E). Thus the universal property above merely distinguishes m (E) as the largest possible element in the poset of subobjects. This observation motivates the definition of the subobject classifier for an arbitrary category.
Definition 58. Let C be a category with a terminal object 1.A subobject classifier for C is an object Ω together with a monic morphism > : 1 → Ω such that given any monic morphism m : C D, there exists a unique morphism χm : D → Ω which makes the diagram below a pullback square:
m C / / D
! χm 1 / Ω. >
We can now officially define a topos.
Definition 59. A category T is said to be a topos if and only if it is a Cartesian closed category which also has a subobject classifier.
Example 60. The protypical example of a topos is Set. However, the categories of presheaves and sheaves are also commonly occuring examples of topoi which we will discuss in the next section. The only topoi discussed in this dissertation are the types mentioned in this example.
There are many equivalent characterizations of topoi. Note that any topos
26 must admit all finite limits and colimits. Historically, the major successes of topos theory came from algebraic geometry and logic. In algebraic geoemtry, topoi were invented by Grothendieck as an attempt to construct a cohomology theory with variable coefficients [9]. Grothendieck’s original definition was generalized by Lawvere in his search for an axiomatization of the category of sets [?]. The definition presented above is what Lawvere would have called an elementary topos. Topoi were also essential to Paul Cohen’s forcing technique used to construct new models of Zermelo-Fraenkel set theory for which he was awarded the Fields medal in 1966. In this dissertation, we will mainly discuss the topos of sets and topoi that arise as presheaf of sheaf topoi on either a topological space or a Grothendieck topology on a category.
2.7 Presheaves
Presheaves and sheaves are an important family of contravariant functors which we will use at many points in this dissertation. We recall some basic definitions and properties below. More detailed treatment of these topics, including proofs, can be found in [94].
2.7.1 The Category of Presheaves
Definition 61. A presheaf on a category C is a contravariant functor into the category Set. Presheaves form a category whose morphisms are given by natural transformations. If C is a category, we use the notation Cˆ to denote the category of presheaves.
Cˆ has the structure of a category. The objects in this category are contravariant functors. Morphisms are given by natural transformations η : F ⇒ G, i.e. given a morphism f : C → D in C, the following diagram commutes:
F(f) F (C) / F (D)
ηC ηD G(f) G (C) / G (D) .
The identity natural transformation ι : F → F is created by taking its components
27 to be ιX = idF(X) in Set which produces the following commutative diagram:
F(f) F (C) / F (D)
ιC ιD F(f) F (C) / F (D) .
Composition is given by compositions of natural squares, i.e. commutative diagrams of the following form: F(f) F (C) / F (D)
ηS ηD G(f) G (C) / G (D)
µS µD H(f) H (C) / H (D) .
2.7.2 Initial and Terminal Objects
The initial object in Cˆ is the constant functor 0 in Cˆ which maps each object to the empty-set and every morphism to the identity morphism. The terminal object, 1, is the constant functor that maps each object to the one element set 1 and each morphism to the identity map.
2.7.3 Products and Coproducts
Given two presheaves F and G, the product presheaf is defined on an object C in C by (F × G)(C) := F (C) × G (C) where the right-hand side is a product in Set. From each morphism f : B → C, we obtain a function
(F × G)(f): F (B) × G (B) → F (C) × G (C)
28 such that the following diagram commutes:
F(f) F (C) / F (B) O O ρ1 ρ1 F(f)×G(f) F (C) × G (C) / F (B) × G (B)
ρ2 ρ2 G(f) G (C) / G (B) .
Given two presheaves F and G, the coproduct presheaf is defined on an object C in C by (F ` G)(C) := F (C) ` G (C) where the right-hand side is a coproduct (disjoint union) in Set. From each morphism f : B → C, we obtain a function
a a a F G (f): F (C) G (C) → F (B) G (B)
such that the following diagram commutes:
F(f) F (C) / F (B)
ι1 ι1 F(f) ` G(f) F (C) ` G (C) / F (B) ` G (B) O O ι2 ι2 G(f) G (C) / G (B) .
2.7.4 Equalizers and Coequalizers
Given two natural transformations η, µ : F ⇒ G, an equalizer ι : E ⇒ F is a natural transformation such that η ◦ ι = µ ◦ ι, i.e. for each C in C, the components
of the natural transformation compose in Set, i.e. ηC ◦ ιC = µC ◦ ιC . Moreover, given any other natural transformation ω : H ⇒ F satisfying η ◦ ω = µ ◦ ω, there is a unique natural transformation κ : H → E such that for any object C in C, the
29 following diagram commutes:
H (C)
ωC κC ι ( ηC S / EF,G (C) / F (C) / G (C) . µC
Given two natural transformations η, µ : F ⇒ G, a coequalizer ϑ : E ⇒ F is a natural transformation such that ι ◦ η = ι ◦ µ, i.e. for each C in C, the components of the natural transformation compose in Set, i.e. ιC ◦ ηC = ιC ◦ ηC . Moreover, given any other natural transformation ω : G ⇒ H satisfying ω ◦ η = ω ◦ µ, there is a unique natural transformation κ : E ⇒ H such that for any object C in C, the following diagram commutes:
ηC / ϑC F (C) / G (C) / E (C) µC ωC κC ( H (C) .
2.7.5 Pullbacks and Pushouts
Let X , Y, Z be presheaves in Cˆ . Suppose λ : X ⇒ Z and ι : Y ⇒ Z are natural transformations. We say that the natural transformations η : P ⇒ X and κ : P ⇒ Y form a pullback for λ and ι if and only if for every object C in C, the commuting square κC P (C) / Y (C)
ηC ιC X (C) / Z (C) λC ∼ is a pullback square in Set, i.e. (X ×Z Y)(S) = X (S) ×Z(S) Y (S).
30 Notice that a morphism f : C → D in C, induces a commuting cube:
κC P (C) / Y (C) P(f) Y(f)
$ $ ηC P (D) ιC / Y (D)
X (C) / Z (C) X (f) Z(f)
$ $ X (D) / Z (C) .
Similarly, if X , Y, Z are presheaves on S with values in Set and λ : Z ⇒ X and ι : Z ⇒ Y are natural transformations, we say that the natural transformations η : X ⇒ Q and κ : Y ⇒ Q form a pushout if and only if for every S ∈ S,
ιC Z (C) / Y (C)
λC κC X (C) / Q (C) ηC
` ∼ ` is a pushout in Set, i.e. (X Z Y) = X (C) Z(C) Y (C). Again, pushouts always exist because they exist in Set.
2.7.6 Exponentials
Let F and G be presheaves on C. Assuming an exponential, GF exists, would imply that we have the following natural bijection
∼ F HomC (E × F, G) = HomC E, G for every presheaf E in Cˆ . In particular, if we take E to be a representable functor, i.e. E = HomC (−,C) = hC . This would mean that
F ∼ F ∼ G (C) = HomCˆ hC , G = HomCˆ (hC × F, G) .
31 Rather than assuming that the desired bijection exists, we can use this observation to define F G (C) := HomCˆ (hC × F, G)
F i.e. G (C) is the set of all natural transformations from HomC (−,C) × F into G which is contravariant and hence a presheaf. The evaluation mapping, eval : GF × F → G, is a natural transformaton defined on components by
evalC (η, y) = ηC (1C , y) ∈ G (C) where C is an object in C, η : HomC (−,C) × F → G, and y ∈ F (C). Verification that these satisfy the universal properties can be found in [94].
2.7.7 The Subobject Classifier
2.7.7.1 Subobjects
Definition. Let F, G : Cop → Set be functors. We say F is a subfunctor of G if F (C) ⊂ G (C) for all objects C in C and F (f) is a restriction of G (f) for all morphisms in C. There is then a natural transformation ι whose components are given by inclu- sion mappings which is monic in Cˆ . Thus each such F determines a subobject. Conversely all subobjects are given by subfunctors since if η : F ⇒ G is monic in the functor category then each component ηC : F (C) → G (C) is an injective.
As subobjects are equivalence classes of monics, ηC is equivalent to ιC where ιC is given by the inclusion of the image of each component.
2.7.7.2 The Subobject Classifier
Definition 62. (Sieves) If C is an object in a category C, a sieve is a subfunctor of Hom (−,C). Sieves can be thought of as the categorical analog of lower sets on a poset. The subfunctor criteria means that if f : B → C belongs to the sieve and g : A → B is any morphism, then f ◦ g also belongs to the sieve.
If Cˆ admits a subobject classifier, Ω, it must, in particular classify the subobjects
32 of the representable presheaf hC := HomC (−,C). Thus,
∼ SubCˆ (HomC (−,C)) = HomCˆ (HomC (−,C) , Ω) = [HomC (−,C) : Ω] .
∼ By the Yoneda lemma, we would have a natural isomorphism [HomC (−,C) : Ω] = Ω(C). Thus, the subobject classifier, if it exists, must be given by
Ω(C) = SubCˆ (HomC (−,C)) .
In other words, the subobject classifier is the collection of sieves on C.
Definition 63. If C is an object in a category C, a sieve on C is a subfunctor of Hom (−,C). Sieves can be thought of as the categorical analog of lower sets on a poset. The subfunctor criteria means that if f : B → C belongs to the sieve and g : A → B is any morphism, then f ◦ g also belongs to the sieve.
Example 64. If we regard a poset as a category, P, the sieves on an object p in C are just sets of elements, S, where each s ≤ p and s ∈ S and s0 ≤ s implies s0 in S.
Definition 65. (Pullback of Sieves) Any morphism f : C → D, induces a map f ∗ : O (D) → O (D) where S 7→ {g | cod (g) = C and f ◦ g ∈ S} . The map f ∗ is known as the pullback.
The restriction mappings in the presheaf are given by sieve pullbacks, i.e. a morphism f : C → D induces a morphism Ω (f) :Ω (D) → Ω (C) defined by taking a sieve on D to its pullback on C. The largest possible sieve on an object is known as its principle sieve.
Definition 66. For any object C ∈ C, we can define the principal sieve, ↓ C, to be the largest possible sieve on C, i.e. ↓ C := {f | cod (f) = C}. Notice this is equivalent to saying ↓ C is the sieve containing the identity map id : C → C.
Note that using principle sieves, we can rewrite the pullback sieve more simply as f ∗ (S) = S∩ ↓ C.
The collection of sieves on an object in a category has the structure of a Heyting algebra which was defined in an earlier section. In order to finish defining
33 the subobject classifier for presheaf categories, we need to also define the truth morphism. The truth morphism > : 1 ⇒ Ω is just the natural transformation defined on each component by picking out the maximal sieve on each Ω(C).
2.7.8 Local and Global Sections
Definition 67. A global section on a presheaf F is a natural transformation from the terminal presheaf1 into F.
A particular class of global sections are important in any topos. These are the global sections of the subobject classifier which are known as the truth values in the topos.
Definition 68. A local section on a presheaf F is a natural transformation from a subobject of the terminal object into F.
2.8 Sheaves
Sheaves are a mathematical tool for handling local to global phenomenon on a topological space. The standard example of a sheaf is the sheaf of continuous real-valued functions on a topological space, X.
Definition 69. A presheaf on a topological space (X, τ) is a function that assigns to each open set U a set F (U) called a section on U. This assignment is done in V such a way that if U ⊂ V there is a restriction mapping resU : F (V ) → F (U). The associated restriction mappings are required to satisfy the criteria that if V W W U ⊂ V ⊂ W then resU ◦ resV = resU .
We defined presheaves on an arbitrary category in the previous section. The definition above is a special instance of the previous definition because we construct a category from the topological space where the objects are given by open sets and we define U → V if and only if U ⊂ V . Note this is the same definition as the category associated to a topological space Xτ . With this definition, a presheaf is simply a contravariant functor from this category into Set. In terms of our example of continuous functions on a topological space. The sections of each open set U are given by the set of continuous functions
34 f : U → R. The restriction maps are obtained by just restricting the definition V W W of the function to a smaller domain. The requirement resU ◦ resV = resU tells us that restricting the domain of our function from W to V and then restricting the domain of the function from V to U is the same thing as simply restricting the domain of the function from W to U. Sheaves are presheaves which satisfy additional gluability criteria.
Definition 70. A presheaf F is called a sheaf it satisfies two additional criteria:
• (Locality) If {Ui}i∈I is an open covering of a set U, and f, g ∈ F (U) are such U (f) = U (g) U f = g that resUi resUi for each i in the open cover, then .
• (Gluing) If {Ui}i∈I is an open covering of a set U, and if for each i, we have a section fi ∈ F (Ui) in such a way that for every pair Ui and Uj of the Ui (f ) = Uj (g ) f ∈ F (U) covering resUi∩Uj i resUi∩Uj j , then there is a section such U (f) = f i ∈ I that resU∩Ui i for every .
These two axioms can be characterize category theoretically by stating that the following diagram is an equalizer:
Y Y F (U) → F (Ui) ⇒ F (Ui ∩ Uj) i i,j
U where the first map is the product of the maps resUi and the pair of parallel morphisms are the products of the two different ways of doing the restriction Ui Uj resUi∩Uj and resUi∩Uj . A sheaf can be defined by only specifying its values on the open sets of a basis and verifying the sheaf axioms relative to the basis [94]. We will use this fact frequently in chapter 5 of this dissertation when discussing sheaves on the simplicial complex associated to a database. When defining presheaves, we observed that we could construct a category from the collection of open sets on the underlying topological space and see the presheaf as a contravariant functor into the category Set. as such, we can generalize the definition of presheaf to hold in any category.
Definition 71. Let C be a category. A presheaf F on C is a contravariant functor from C into Set. The collection of presheaves itself forms a category which we will denote by Cˆ .
35 The definition of a sheaf required a notion of gluing. As such to extend the definition of sheaf to an arbitrary category, we need some type of category theoretic notion of an open cover. This can be accomplished through the use of sieves which we defined earlier when discussing presheaf categories.
Definition 72. A Grothendieck topology, J, on a category C is a function which associates to each object C in C a collection of sieves J (C) known as the collection of covering sieves of C. This assignment is required to satisfy the following criteria:
• The maximal sieve tC = {f | cod (f) = C} is in J (C).
• (stability axiom) If S ∈ J (C), then the pullback h∗ (S) ∈ J (D) for any morphism h : D → C.
• (transitivity axiom) If S ∈ J (C) and R is any sieve on C such that h∗ (R) ∈ J (D) for all h : D → C in S, then R ∈ J (C).
Definition 73. A category C equipped with a Grothendieck topology J is referred to as a site.
Definition 74. A sieve S on an object C is said to be a covering sieve if S ∈ J (C) .
The covering sieves on a Grothendieck topology are the analogs of open covers of a set. As such, we can develop sheaves on Grothendieck topologies by using topological notions such as maximal families. The interested reader can follow the presentation in [94]. We give a precise definition here for completeness.
Definition 75. A presheaf F on a site (C,J) is a sheaf if and only if for ev- ery covering sieve S in J (C), the inclusion S,→ hC induces an isomorphism ∼ Hom (S, F) = Hom (hC , F).
In this dissertation we focus only on the atomic Grothendieck topology where the covering sieves are taken to be all inhabited sieves. This topic will be covered in more depth in chapter 3. In the remaining chapters, all other sheaves discussed are constructed on normal topological spaces. The atomic sieves play an important role in topos theory because their sheaves have a particularly simple characterization. Informally, the lemma below states that every morphism is a cover in the atomic topology.
36 Lemma. (Mac Lane & Moerdijk [94]) A presheaf F is a sheaf for the atomic topology on a category C if and only if for any morphism f : D → C and any y ∈ F (D), if F (g)(y) = F (h)(y) for all diagrams g f E ⇒ D → C h with f ◦ g = f ◦ h, then y = F (f)(x) for a unique x ∈ F (C).
In chapter 3 we will interpret this lemma probabilistically when constructing a sheaf of random variables.
37 Chapter 3 | A Sheaf Theoretic Perspec- tive on Higher Order Proba- bilistic Programming
Probabilistic programming languages are programming languages designed to describe probabilistic models and to perform inference with these models. In this chapter we explore the relationship between sheaf theory and probabilistic programming with higher order functions. The main result in this chapter is extending a construction due to Heunen, Kammar, Staton, and Yang to a presheaf construction on a category of sample spaces (Definition 88) and showing this extension is in fact a sheaf with respect to the atomic Grothendieck topology (Lemma 91). We also realize expectation as a sheaf morphism (Section 3.6.3) and discuss some structural properties of these new objects as they relate to foundational concepts in probability theory (Section 3.6.1, Section 3.6.2). Along the way, we characterize several sub-classes of monic arrows in the category of measurable spaces (Proposition 77) and show that Meas does not admit a subobject classifier (Lemma 76). This provides an alternative proof that Meas is not a topos. Although the fact that Meas is not a topos was already well known due to a theorem of Aumann [10] who showed that the category of standard Borel spaces is not Cartesian closed. By proving Meas does not admit a subobject classifier, we provide an alternative proof that Meas is not a topos. We also prove a simple lemma about lifting probability measures along surjective maps (Section 7.3.2). Composability is the heart of category theory and structural programming,
38 with its emphasis on the composability of blocks of code, is essential to building large scale programs. In this sense, it would seem that there should be some connection between category theory and software engineering. Moreover, functional programming goes beyond composability of functions and data types and makes concurrency composable. Eugenio Moggi discovered that computational effects could be mapped using monads from category theory [100]. This observation made functional languages like Haskell more usable and gave computer scientists a more stereoscopic view of traditional programming. It is our belief that category theory is a natural setting for programming semantics and the purpose of this chapter is to develop a framework for thinking about higher order probabilistic programming based on sheaves of families of random variables of some fixed type. Sheaf theory has been previously applied to probability theory by Jackson who showed that the Radon-Nikodym theorem could be interpreted as a sheaf morphism between measurable locales [71]. Gauthier gives a purely algebraic characterization of stochastic calculus via sheaves on a symmetric monoidal infinity category and establishes connections between stochastic differential equations and deformation theory [54]. A similar construction to the one developed in this chapter appears in a conference paper by Simpson [123]; however, Simpson uses a slightly different definition of his underlying sample space category and prefers to work with equivalence classes of random variables. Similar tools to those discussed in this chapter are used in later chapters to reason about statistical inference for non-flat distributed databases in a way that recasts statistical theory as a contextual theory. In this chapter we will first study the category of measurable spaces and see that Meas is an inadequate category for modeling probabilistic programming. We also discuss a recent construction due to Heunen, Kammar, Staton, and Yang which replaces Meas with a Cartesian closed category which they call the category of quasi-Borel spaces [65]. We use this framework to extend their construction to a presheaf which allows us to work with the full structure of a topos.
3.1 The Categorical Structure of Measurable Spaces
Lawvere and Giry have previously analyzed probability from the perspective of category theory by considering probabilistic concepts as constructions inside the
39 category of measurable spaces or the subcategory corresponding to the category of standard Borel spaces. In this section, we explore the categorical structure of the category of measurable spaces noting that it fails to be a Cartesian closed category due to a result by Aumann [10]. For the purposes of discussing probabilistic programming, it is essential that we develop a framework compatible with extensions of sample spaces. This is suggestive of Lawvere’s observation that topos theory is a natural framework for thinking about variable sets due to its connection with sheaf theory. To start with, we can consider a category whose objects are measurable spaces (Ω, F) where Ω is a set and F is a sigma algebra on Ω. Morphisms in this category will be given by measurable maps between measurable spaces. In the sequel, we will denote this category by Meas. For shorthand, we will refer to objects in Meas by their underlying set with the sigma algebra being suppressed for notational convenience. As Meas is a concrete category, it shares many structures with the category of sets; however, the restriction of equipping these sets with a sigma algebra leads to several key differences. Although the category of sets is the most elementary example of a a topos, i.e. a Cartesian closed category with a subobject classifier, we will see that Meas is not a topos as it admits neither a subobject classifier nor exponentials.
3.1.1 Non-Existence of Exponentials
Given two measurable spaces X and Y , we would like for the collection of measurable maps between them, Mor (X,Y ), to be an object in Meas. Clearly the collection of all measurable maps between the two measurable spaces is a set. In order to verify the universal property for Mor (X,Y ), we can introduce the canonical evaluation map on the product Mor (X,Y ) × X as follows:
ev : Mor (X,Y ) × X → Y (f, x) 7→ f (x)
and endow Mor (X,Y ) with the smallest sigma algebra such that the evaluation maps is measurable. This forces each set of the form ev−1 (B) = {(f, x): f (x) ∈ B} for B measurable in Y to be measurable in Mor (X,Y ) × X. Notice that the
40 sections of this will be measurable sets in Mor (X,Y ) and X respectively. Fix some x ∈ X, this force sets of the form {f ∈ Mor (X,Y ): f (x) ∈ B} to be measurable in Mor (X,Y ) which is equivalent to forcing the canonical evaluation maps evx : Y X → Y defined by f 7→ f (x) to be measurable. Notice this construction appears to be universal, i.e. given any g : Z × X → Y , there exists a unique gˆ : Z → Mor (X,Y ) which makes the following diagram commute:
ev Y X × X / Y O : gˆ×id X g Z × X
where
gˆ : Z → Mor (X,Y ) z 7→ g (z, ·)
such that
g (z, ·): X → Z x 7→ g (z, x) .
At a first glance, it might seem like Meas admits exponentials. However, the sigma algebra induced by ev must be the product sigma algebra of a sigma algebra on Y X and a sigma algebra on X since products in Meas are defined as being equipped with product sigma algebras. Unfortunately, even with X = Y = R equipped with the standard Borel sigma algebra, this is not true due to the following result.
Theorem. (Aumann, 1961) For any sigma algebra, Σ, on RR, the evaluation mapping is never measurable with respect to the product sigma algebra Σ × BR on Y X × X [10].
This no-go result indicates that neither Meas nor its subcategory of standard Borel spaces can be a Cartesian closed category. The infinite nature of R is essentail to Aumann’s argument. In particular, restricting to only finite sets avoids all of the problems we have discussed in this subsection.
41 3.1.2 Lack of Subobject Classifier
In this section we prove that Meas does not admit a subobject classifier. We also show that for measurable spaces, regular monics, strong monics, and extermal monics all coincide with measurable embeddings. The fact that restricting to these subclasses fails to resolve the problems with the subobject classifier construction is interesting in that it demonstrates how the category of measurable spaces behaves quite differently from the category of topological spaces in spite of the similarity in their definitions. The category of topological spaces does admit a subobject classifier for strong monics [145].
Lemma 76. Meas does not admit a subobject classifier.
Proof. In order for Meas to have a sub-object classifier, we would need an object Ω along with a morphism > : 1 → Ω such that for every monic arrow m : E X, there exists a measurable function χm : X → Ω which makes the diagram below a pullback:
m E / X
! χm 1 / Ω. >
In the category of measurable spaces, let Ω = {0, 1} denote the measurable space defined on any set with two elements equipped with the sigma algebra of all subsets and let 1 = {∗} be a terminal object. The map > : 1 → Ω is given by > (∗) = 1 and the map !: E → 1 is defined by ! (e) = ∗. χm is then just a characteristic function 1 x ∈ im (m) χm (x) = 0 otherwise. This construction is perfectly fine in the category of sets; however, for measurable spaces we need to ensure the mapping χm is measurable. In general, the image of a measurable set is not measurable and so we can not guarantee that χm is even a measurable mapping unless im (m) = m (E) is a measurable set in X. As a simple example to illustrate this point, consider a map from any non-trivial measurable space into the same space equipped with the trivial sigma algebra
B∅ = {∅,X} . This gives an injective map of sets which is measurable and as such
42 is a monomorphism in Meas. Note that this construction shows that Meas is unbalanced as this construction provides an example of a morphism which is monic and epic but not an isomorphism. At this point, the only alternative is to try equipping Ω with the trivial sigma algebra B = {∅, Ω}. Note that χm is always measurable with respect to the trivial −1 −1 sigma algebra since χm (∅) = ∅ and χm (Ω) = X and so this construction resolves the first obstruction we found to constructing the pullback diagram for subobject classifiers in Meas. In order for the diagram to be a pullback, we would need for any other monic arrow n : F X such that > (! (f)) = χm (n (f)) for all f ∈ F , there exists a unique g : F → E making the following diagram commute
F g n m ' ! E / X
! χm 1 / Ω. >
The commutativity of the diagram forces us to define g (f) := m−1 (n (f)) . In order for this construction to be measurable, we would need n (BF ) to be measurable in X whenver BF is measurable in F . To see that this can not be the case let X = R with its Borel sigma algebra and take F to be the inclusion mapping corresponding to an analytic set which is not Borel. As an explicit example of such a set, consider the set of all irrational numbers whose continued fractions expansion can be written as 1 x = a + 0 1 a + 1 1 a + 2 . .. where there exists an infinite sequence 0 < i1 < i2 < ··· where each aik divides aik+1 . A result due to Lusin shows this is a set which is Lebesgue measurable, but not Borel measurable [80]. Thus, Meas does not admit a subobject classifier.
In the topos theory literature, it is common to explore variations on the definition of a subobject classifier by restricting to certain restricted classes of monics. For instance, the category of topological spaces admits a subobject classifier for strong
43 monics [145]. A strong monic is simply a monic m : E X such that for every epic e : Y Z and morphisms f : Z → X, g : Y → E such that m ◦ g = f ◦ e, there exists a morphism d : Z → E such that the following diagram commutes:
e Y / / Z d g f ~ m E / / X.
We will now show that the strong monics in Meas behave like subspace embeddings which in turn satisfy the universal property for the pullback diagram above. In doing so, we show several subclasses of monic arrows all coincide in the category of measurable spaces. As the construction due to Lusin is indeed an embedding, this shows that Meas does not admit a subobject classifier for strong monics and thus provides a categorical construction distinguishing it structurally from the category of topological spaces.
Proposition 77. In Meas, regular monomorphisms, strong monomorphisms, and extremal monomorphisms all coincide with measurable embeddings.
Proof. Another class of monic arrows of interest in category theory are regular monics. A monic m : E X is regular if it is the equalizer of some pair of arrows. First observe that every regular monic is strong. Suppose e : Y Z is epic and α : Y → Z, β : Z → X are morphisms such that β ◦ e = m ◦ α. Since m is regular, it is the equalizer of some pair of parallel arrows k1, k2 : X → C. This situation can be visualized via the diagram below:
Y / / Z d α g k1 ~ m / E / / X / C. k2
The dotted morphism d is guaranteed to exist by the universal property of equalizers and m ◦ d = β. By assumption β ◦ e = m ◦ α, and so m ◦ (d ◦ e) = m ◦ α which implies d ◦ e = α since m is monic. This establishes the commutativity of the diagram above. Another class of monic arrows of interest are extremal monics. We say that a monic m : E X is extremal if whenever m factors through an epic morphsim, i.e.
44 m = g ◦ e, e is actually an isomorphism. Every strong morphism is easily seen to be extremal. If m : E X is a strong monic, and e : E Y , g : Y → X are such that m = g ◦ e, the fact that m is strong means there is a morphism d : Y → E such that the diagram below commutes:
e E / / Y
d g idE ~ m E / / X.
Now, e = e ◦ idE = e ◦ (d ◦ e) = (e ◦ d) ◦ e and e = idY ◦ e. Thus, (e ◦ d) ◦ e = idY ◦ e and so (e ◦ d) = idY , i.e. e is an isomorphism. Given an injective measurable map m : E X, we can endow m (E) with the subspace sigma algebra Fm = m (E) ∩ FX . We call a monic map m : E X a measurable embedding if E ∼= m (E) where the latter is equipped with the sigma algebra Fm. We now notice that every extremal monomorphism must be a measurable embedding. Let m : E X be an extremal monic. Then any factorization of m = g ◦ e with e an epimorphism implies that e is an isomorphism. Any monic arrow, m can be decomposed as:
m (E) < e i
m " E / / X where e is the epimorphism onto the image subspace. Since m is assumed to be extremal, we must have that e is actually an isomorphism, i.e. e is a measurable embedding. To finish this proof, we need to argue that every measurable embedding is a regular monomorphism. Given a measurable embedding, m : E X, we can form 1 parallel arrows as follows: X ⇒ 2 where 1 (x) = 1 for each x ∈ X and χm is defined χm as above. The equalizer of these parallel arrows is
E1,χm = {e ∈ E | χm (m (e)) = 1 (m (e))} = E.
Thus, every measurable embedding is the equalizer of these two arrows and hence regular.
45 The above lemma shows that several subclasses of monics all coincide with measurable embeddings. An analogous characterization of these monics is well known for topological spaces [145].
3.2 The Giry Monad
In this section we briefly discuss the Giry monad highlighting some of the properties we will use later in this dissertation. A more detailed exposition along with proofs of the coherence conditions can be found in [59]. The Giry Monad is a structure on Meas, the category of measurable spaces. A monad on a category C consists of a triple (P, η, µ) where P is an endofunctor from C into itself, η is a natural transformation from IC ⇒ P, and µ is a natural transformation µ : G2 ⇒ G satisfying the coherence conditions in definition 51 in chapter 2.
3.2.1 The Endofunctor G
We can construct a functor G : Meas → Meas defined as follows:
• On objects: G (X) is defined to be the collection of probability measures on X equipped with the smallest sigma algebra such that the evaluation map X X ev : G (X) × 2 → [0, 1] defined by ev (p, χB) = X χBdp = p (B) where 2 ranges through all subobjects of X (i.e. characteristic´ functions of measurable sets) and I is the interval [0, 1] equipped with the Borel sigma algebra.
• On morphisms: given a measurable map f : X → Y , G (f) : G (X) → G (Y )
is defined as follows: given a probability distribution pX on X, G (f) : pX 7→ −1 pX (f (·)) which defines a probability measure on Y .
Remark 78. Note that for any X, G (X) also has the structure of a convex space, i.e. if p, q ∈ G (X), then αp + (1 − α) q ∈ G (X) for any α ∈ [0, 1].
3.2.2 The Natural Transformation η
η is supposed to be a natural transformation from the identity functor I ⇒ P, i.e. for each object x ∈ Meas, we need to construct a morphism I (X) → G (X). To do
46 1 x ∈ B this we can map each x ∈ X to the Dirac measure δx := δx (B) = . 0 otherwise We can verify this is indeed a natural transformation by checking the commutativity of the following diagram: ηX I (X) / G (X)
f G(f)
ηY I (Y ) / G (Y )
The commutativity of the above diagram asserts ηY (f (x)) = G (f) ◦ ηX (x), i.e. −1 δf(x) = δx (f (B)) .
3.2.3 The Natural Transformation µ
The endofunctor G associates to a measurable space X, the collection of all proba- bility measures on X viewed as a measurable space itself. This means we can apply G to G (X) and obtain a measurable space G2 (X). The natural transformation µ 2 will be a natural transformation µ : G ⇒ G. Let X be an object in Meas. ηX needs to take p0 ∈ G2 (X) and associate to it a probability measure p ∈ G (X). The evaluation maps evB : G (X) → [0, 1] ⊂ R are real-valued measurable func- tions and therefore we can integrate with respect to them. Thus, we can define 0 0 µX (p )(B) := G(X) evB (p) dp . This definition is sigma-additive by the monotone ´ 0 convergence theorem and µX (p )(X) = 1 so this is indeed a probability measure on G2 (X).
3.2.4 The Kleisli Category of the Giry Monad
Giry showed that the triple (G, η, µ) forms a monad on the category of measurable spaces by verifying the coherence conditions. Every monad gives rise to a corre- sponding Kleisli category. The Kleisli category of a monad (G, η, µ) has the same objects as its underlying category, in this case Meas. A Kleisli morphism between
X and Y is a map fk : X → P (Y ). Given two Kleisli morphisms fK : X → P (Y ) and gK : Y → P (Z), the composition is defined to be fK ◦ gK := µZ ◦ GgK ◦ fK . Kleisli arrows can be seen as statistical models when the domain is interpreted as a parameter space. A Kleisli arrow also gives rise to a Markov kernel in the following manner: given a Kleisli arrow fK : X → G (Y ), we can construct a
47 2 Markov Kernel through the evaluation mapping evfK : X × Y → [0, 1] defined by evfK (x, B) = fK (x)(B).
3.2.5 Simple Facts About the Giry Monad
Example 79. Let [k] = {1, 2, . . . , k} viewed as a measurable space whose sigma algebra is just the collection of all subsets of [k]. Then G ([k]) is the collection of all probability distributions on [k]. This can be identified with the probability simplex k−1 n k Pk o 4 = (p1, . . . , pk) ∈ R : i=1 pi = 1, pi > 0∀i ∈ [k] . The sigma algebra on G ([k]) is generated by the pullbacks of sets of the form (a, b) ∩ [0, 1] ⊂ I under the evaluation maps evi : 4[k] → [0, 1] given by evi (p1, . . . , pk) = pi. Hence −1 evi ((a, b) ∩ I) = {(p1, . . . , pk): pi ∈ (a, b) ∩ I}. It follows that sets of the form
k −1 ∩i=1evi ((a, b) ∩ I) = [(a1, b1) × · · · × (ak, bk)] ∩ 4[k] are Giry measurable subsets of 4k−1. This means that the Giry sigma algebra on 4k−1 induced by the evaluation maps is equivalent to the standard Borel sigma algebra on 4k−1. In particular, algebraic statistical models, i.e. models defined by polynomial equations in 4k−1 will be measurable sets as these sets are closed in the Euclidean topology.
Lemma 80. Let X × Y be a product of measurable spaces and let πX : X × Y → X be the projection onto the first coordinate. The induced map G (πX ) : G (X × Y ) → G (X) corresponds to marginalizing over Y .
Proof. By definition G (f) : G (X × Y ) → G (X) is defined as µ 7→ µ (f −1 (−)). Let B be a measurable subset of X. Then µ (f −1 (B)) = µ (B × Y ) .
Example 81. (Univariate Normal Distributions) The collection of single variable normal distributions can be seen as the mapping from n : R × R>0 → G (R) defined 2 by (µ, σ ) 7→ pµ,σ2 where
1 (x−µ)2 − 2 pµ,σ2 (B) = √ e 2σ dx. 2 ˆB 2πσ
To see this mapping is indeed measurable, note that the Borel sigma algebra on I is generated by sets of the form (a, b) ∩ I. The sigma algebra structure on G (R) was generated by the evaluation maps evB : G (R) → I defined by evB (p) =
48 −1 p (B). The preimage of such a set is just a set of the form evB ((a, b) ∩ I) = {p ∈ P (R): p (B) ∈ (a, b) ∩ I}. Since the sigma algebra on G (R) is generated by sets of this form, we can observe that
2 −1 −1 2 1 − (x−µ) n ev ((a, b) ∩ I) = µ, σ : √ e 2σ2 dx ∈ (a, b) ∩ I B 2 ˆB 2πσ which is a Borel measurable set on R × R>0.
Example 82. (Singular Model) [141] Let P = [0, 1] × R. Let m : P → G (R) be defined by 2 2 1 − x − (x−b) (a, b) 7→ √ (1 − a) e 2 + ae 2 . 2π
2 − x This mapping is not injective because if ab = 0. m (a, 0) = m (0, b) = e 2 all lead to the same probability distribution. Thus, this mixture model would not correspond to a subobject of G (R).
3.3 The Cartesian Closed Category of Quasi-Borel Spaces
A Cartesian closed category is a type of category with the same expressive power as a typed λ-calculus. As such this gives a category theoretic framework for expressing computations. Standard Borel spaces are certain well-behaved spaces for which many probabilistic constructions are guaranteed to exist. Unfortunately, as a full subcategory of Meas, these still do not admit a closed structure due to the theorem of Aumann referenced previously. For this reason, Heunen, Kammar, Staton, and Yang invented the category of quasi-Borel spaces [65]. The category of quasi-Borel spaces has the structure of a quasi-topos [66]. In this section, we review quasi-Borel spaces. In later sections, we will see how these can be regarded as certain types of sheaves on an appropriately defined sample space category.
3.3.1 Quasi-Borel Spaces
Fix an uncountable standard Borel space Ω.
49 Definition. A quasi-Borel space, (X,MX ), is a set X together with a subset
MX ⊂ [Ω → X] such that
1. If α ∈ MX and f :Ω → Ω is measurable, then α ◦ f ∈ MX 2. If α : R → X is constant, then α ∈ MX . = ` S S ∈ B {α } M ` α ∈ 3. If R i, each i , and i i∈N is a sequence in X , then i∈N i MX .
Given a quasi-Borel space as defined above, these can be identified with sub- Ω objects in the topos of sets, i.e. MX X . In Set, we can consider the co-slice category, Ω ↓ Set, whose objects are XΩ and whose arrows are given by pre- composition, i.e. α ∈ XΩ is sent to f ◦ α. To give quasi-Borel spaces the structure of a category, we can consider morphisms in the co-slice category which induce a set mapping between MX and MY . In other words, a morphism f : (X,MX ) → (Y,MY ) is a set map f : X → Y which induces a map _ ◦ f : MX → MY . By abuse of notation, we will often simply denote a quasi-Borel space (X,MX ) by its collection of functions, MX . The collection of quasi-Borel spaces thus forms a category which we denote by QBS.
Given a quasi-Borel space MX , we can construct a sigma algebra, FX on its space of outcomes by
−1 FMX = B ⊂ X | f (B) ∈ FΩ for all f ∈ MX .
We call members of FMX events. ∼ Lemma 83. We have a bijection FMX = QBS (MX ,M2).
Proof. Take χ in QBS (MX ,M2). If χ is a QBS morphism, then given any α ∈ MX , −1 −1 −1 χ ◦ α is in M2. Then (χ ◦ α) (1) = α (χ (1)) must be a Borel measurable −1 subset of Ω for every α ∈ MX . Hence, χ (1) ∈ FM . X 1 x ∈ E Suppose E ∈ FMX . Define χE ∈ QBS (MX ,M2) by χE (x) = . 0 x∈ / E −1 −1 −1 −1 Let α ∈ MX . Then (χE ◦ α) (1) = α χE (1) = α (E) is Borel measurable −1 −1 C −1 C in Ω since E ∈ FMX . Similarly (χE ◦ α) (0) = α E = (α (E)) is Borel measurable.
The above proposition allows us to define an evaluation mapping on probability measures.
50 MX Definition 84. (Evaluation Mapping for Probability Measures) Let M2 =
QBS (MX ,M2) and let G (MX ) denote the image of MX under the Giry endo-
functor. Let MI be the quasi-Borel space obtained by the standard Borel space MX [0, 1]. The evaluation mapping, ev : M2 × G (MX ) → MI is defined to be
ev (χ, µ) = Ω χ (ω) dµ. ´ 3.3.2 Cartesian Closure of QBS.
The terminal object in QBS is just the singleton set whose quasi-Borel structure is
given by the unique map into it. If (X,MX ) and (Y,MY ) are quasi-Borel spaces,
there is a product (X × Y,MX×Y ) where the set is given by the usual product of
sets and the product structure MX×Y is defined by
MX×Y := {α : S → X × Y | πX ◦ α ∈ MX , πY ◦ α ∈ MY } .
Exponentials are the only somewhat complicated construction. if (X,MX ) and X X (Y,MY ) are quasi-Borel spaces, then so is Y ,MY X where Y := QBS (X,Y ) and X MY X := α : S → Y | uncurry (α) ∈ QBS (S × X,Y ) .
3.3.3 The Giry Monad on the Category of Quasi-Borel Spaces
Let MX be a quasi-Borel space. Any α ∈ MX and µ ∈ Ω determine a probability
measure on the measurable space (X, FMX ) via the pushforward α∗µ onto (X, FMX ).
As (α, µ) and (β, ν) may pushforward to the same probability measure on (X, FMX )
we can define an equivalence relation on pairs where (α, µ) ∼ (β, ν) if α∗µ = β∗ν
(as probability-measures). As a set this is a quotient set of MX × G (Ω) where G (Ω) is the image of Ω under the Giry endofunctor in Meas (regarded as a set). We now
need to give PX = (MX × G (Ω)) / ∼ a quasi-Borel structure. Define
MPX := {β :Ω → PX | ∃α ∈ MX .∃g ∈ Meas (Ω, G (Ω)) .∀ω ∈ Ω.β (ω) = [α, g (ω)]}
where [α, µ] is the image of (α, µ) under the quotient map π : MX × G (Ω) → PX . Heunen, Kammar, Staton, and Yang prove that this construction yields a strong monad on QBS. The following lemma is an important observation about the Giry endofunctor applied to the sample space Ω.
51 ∼ Lemma 85. MPΩ = Meas (Ω, G (Ω))
Proof. Suppose f ∈ Meas(Ω, G (Ω)). Then f (ω) = [1Ω, f (ω)] so f ∈ MPΩ .
Now, suppose β ∈ MPΩ . Then there exists α ∈ MΩ = Meas (Ω, Ω) and g ∈ Meas (Ω, G (Ω)) such that β (ω) = [α, g], i.e. β determines a map g ◦ α :Ω → G (Ω) in Meas (Ω, G (Ω)).
3.3.4 De Finetti Theorem for Quasi-Borel Spaces
DeFinetti’s representation theorem is a foundational theorem in Bayesian statistics. For completeness, its statement is provided below.
Theorem. (DeFinetti’s Representation Theorem) Let (Ω, F, µ) be a probability
space, and let (X, B) be a Borel Space. For each n, let Xn :Ω → X be measurable. ∞ The sequence {Xn}n=1 is exchangeable if and only if there is a random probability ∞ measure P on (X, B) such that, conditional on P = ρ, {Xn}n=1 are IID with distribution ρ. Furthermore, if the sequence is exchangeable, the the distribution of
P is unique, and Pn (B) converges to P (B) almost-surely for each B ∈ B [119].
Heunen, Kammar, Staton, and Yang formulate a version of the DeFinetti Theorem for Quasi-Borel Spaces. Before giving the statement of this theorem, we must explain exchangeability in the language of Quasi-Borel spaces.
Definition. (α, µ) Q X A probability measure on i∈N i is said to be exchangeable if
for all permutations π : N → N, [α, µ] = [απ, µ] where απ (ω)i := α (ω)π(i) for all i ∈ N. Theorem. (DeFinetti’s Theorem for Quasi-Borel Spaces) If (α, µ) is an exchange- Q∞ able probability measure on i=1 Xi, then there exists a probability measure (β, ν) Qn in G (G (X)) such that for all n ≥ 1, the measure ([β, ν] = iidn) on G ( i=1 X)
equals G ((−)1···n)(α, µ) when considered as a measure on the product measur- Qn n Q∞ Qn able space ( i=1 X, ⊗i=1X) where (−)1···n : i=1 Xi → i=1 Xi is defined by
(x1, . . . , xn, xn+1,... ) 7→ (x1, . . . , xn) [65].
Quasi-Borel spaces provide a category that can be used as the denotational semantics of a probabilistic programming language. However, their construction does not naturally handle extensions of sample spaces. As we discussed at the beginning of this chapter, in the course of executing a program which involves
52 sampling from probability distributions, sample spaces are constructed in memory as needed. Suppose we draw a collection of n samples from a standard normal distribution and another n samples from some binomial distribution and coupling the results into a data frame. The sample space for the joint distribution collected in the data frame is the independence join of the probability spaces used to generate the samples. This suggests that a notion of extensibility is a natural requirement to embed in whatever type of categorical framework we use for modeling probabilistic programming. The notion of extensibility is reminiscent of Lawvere’s idea of topos theory being a framework for working with variable sets. We use this as motivation for constructing a sheaf topos on a particular category of probability spaces which will serve as a framework for discussing probabilistic concepts. Before we can achieve this, we need to briefly recall some important properties of standard Borel spaces. Standard Borel spaces are the prototypical examples of well-behaved measurable spaces and these will be used in the construction of our sheaf topos.
3.4 Standard Borel Spaces
Data on a computer is ultimately represented as a bit-string. The types of measur- able spaces which should be sufficiently well-behaved to be modeled on computers should, in some sense, be well-approximated by a bit-string. Recall that a measur- −1 able map f : (X, FX ) → (Y, FY ) is said to be exactly measurable if f (FY ) = FX . A natural condition on a measurable space could be that there exists an exactly N measurable map from (X, FX ) into S = {0, 1} with the sigma algebra generated by the cylinders of finite bit-strings. Note this is equivalent to the Borel sigma-algebra induced by the metric
X d {s } , {s0 } := 2−n |s − s0 | . n n∈N n n∈N n n n∈N
A theorem, due to Mackey, identifies such measurable spaces as exactly the countably generated measurable spaces.
Theorem. (Mackey, 1957) A measurable space is countably generated if and only if there exists an exactly measurable mapping into S [95].
53 Standard Borel spaces are an important class of well-behaved measurable spaces first introduced in [95]. Many important results in probability and statistics are not true for probability measures on arbitrary measurable spaces but do hold for standard Borel spaces. In this dissertation, we will focus on probability measures on standard Borel spaces. In fact, any time we say measurable space in this dissertation it should be assumed that we mean standard Borel space. For a more detailed treatment of the facts collected here, we refer the reader to the survey article [111] or the textbook [130].
Definition 86. Let (X, F) be a measurable space. (X, F) is said to be standard Borel if there exists a metric on X which makes it a complete separable metric space in such a way that F is the Borel σ-algebra corresponding to that metric. In other words, (X, F) is the Borel sigma algebra associated to a Polish space.
Theorem. For a measurable space (X, F), the following are equivalent [79]:
• (X, F) is a retract of (R, BR), i.e. there exists a measurable map f : X → R such that g ◦ f = idX .
• (X, F) is either measurable isomorphic to (R, BR), (Z,P (Z)), or ([k] ,P ([k])).
• X has a complete metric with a countable dense subset and F is the Borel sigma algebra generated by this metric.
Standard Borel spaces enjoy a number of useful properties which are not true of general measurable spaces:
• They are countably generated [95].
• Any bijective measurable mapping between standard Borel spaces is necessar- ily an isomorphism, i.e. the inverse must also be measurable [127].
• Products and disjoint unions of standard Borel spaces are standard Borel [130].
• A function f : X → Y between two standard Borel spaces is measurable if and only if its graph is also standard Borel [130].
Moreover, a number of important results in probability hold for standard Borel spaces which are known to not hold for arbitrary measurable spaces. Some notable examples include the following:
54 • the Kolmogorov extension property [107,109]
• the existence of conditional distributions [33,40,42]
• the Dynkin extension property [50]
• DeFinetti’s representation theorem [31,37,67]
The properties that standard Borel spaces have which do not hold in a general measurable space seem to strongly suggest that the subcategory of standard Borel spaces is an appropriate category to use as a base category for our sheaf theoretic approach to probabilistic programming.
3.5 Quasi-Borel Sheaves
3.5.1 Sample Space Category
Tao observed that probability theorists often choose notation which de-emphasizes the role of a sample space preferring instead to treat it more as a black box. He emphasizes that random variables should not be thought of as anchored to any one particular sample space and should be identified with their space of extensions. He proposes that the concepts studied in probability theory are precisely those that are invariant under surjective measure preserving mappings [136]. This idea is very suggestive of some sort of category of extensions of a probability space. We make this idea mathematically precise in this section by modifying the definition of quasi-Borel spaces introduced in [65] to be compatible with Tao’s idea of extensibility. In the course of executing a program which involves sampling from probability distributions, sample spaces are constructed in memory as needed. Imagine drawing a collection of n samples from a standard normal distribution and another n samples from a uniform distribution on [0, 1] and coupling the results into a data frame. The sample space for the joint distribution collected in the data frame is the product of the two sample spaces and these spaces are joined as independent when we couple them into the data frame. As such, it is unnatural to think of our program as arising from one fixed sample space declared initially. Online probabilistic systems must have the ability to dynamically generate sample spaces. For this reason,
55 extensibility is a natural criteria for thinking about probabilistic programming and will be a major motivation behind the introduction of sheaves later in this chapter.
Example 87. One way to implement sampling from a normal distribution would be to sample uniformly from some approximation of [0, 1] and apply the inverse cumulative distribution function of the standard Gaussian, α, to the resulting samples. Using the identity map id : [0, 1] → [0, 1] allows us to sample uniformly from [0, 1]. Consider a quasi-Borel structure on R containing both α and id. A naive coupling of these along their original sample space yields a map α ∨ id : [0, 1] → R × [0, 1] which would not represent an independence join of the two random variables. Instead, we should consider I2 = [0, 1] × [0, 1] equipped with the product measure and sample components. The original α and id can be recovered from projecting onto the first and second coordinates, respectively. As the component projections are surjective, this construction is an example of an extension of sample space.
We will define a category S whose objects are standard Borel spaces, (S, BS, µ) equipped with a probability measure µ, and whose morphisms are given by surjective measure preserving mappings. This category has a terminal object and initial object, but not many other nice properties; however, we will see later that it admits a semi-pullback structure which is essential to developing sheaf theory on this category of extensions. This definition is similar to one used by Simpson in an unpublished conference paper; however, Simpson does not require his morphisms to be surjective [123]. In this section, we will see how we can embed the notion of quasi-Borel spaces as a certain family of representable presheaves on this category. For the purposes of modeling computation, we will choose to think of there being some sample space representation, S0, that is the initial source of randomness. 32 Conceptually, we can think of this as something like the 32-bit space {0, 1} and suppose all extensions of sample spaces will be spaces with strictly larger cardinality. This is a subcategory of the full category, but adequate enough for our purposes.
In this subcategory, S0 would become the terminal object. We will not dwell on this point further and instead work with the full sample space category unless mentioned explicitly.
56 3.5.2 Quasi-Borel Presheaves
Definition 88. A quasi-Borel presheaf, Q, is a representable presheaf defined in the following manner:
• If α ∈ Q (S) and f : S0 → S is a morphism in S, then α ◦ f ∈ Q (S0).
• For any S in S, S (S) contains all constant mappings.
`∞ ∞ • If S = i=1 Bi where each Bi ∈ BS and Bi ∩ Bj = ∅ when i 6= j, and {αi}i=1 ∞ is a sequence of maps in Q (S), then ⊕i=1αi := αi (s) if s ∈ Bi is also in Q (S).
The first observation about this structure is that the presheaves so defined are actually sheaves with respect to the atomic Grothendieck topology. In order to prove this, we need to recall a bit of terminology.
3.5.3 Quasi-Borel Sheaves
In order to discuss sheaves on a sample space category, we need to equip it with a Grothendieck topology. A category equipped with a Grothendieck topology is referred to as a site. Recall that a Grothendieck topology on a category C is a function J which assigns to each object C in C a collection of sieves on C satisfying the following properties:
1. The maximal sieve tC = {f | cod (f) = C} is in J (C).
2. (stability axiom) If S ∈ J (C), then the pullback h∗ (S) ∈ J (D) for any arrow h : D → C.
3. (transitivity axiom) If S ∈ J (C) and R is any sieve on C such that h∗ (R) ∈ J (D) for all h : D → C in S, then R ∈ J (C).
A popular choice of Grothendieck topology is the so-called atomic topology.
Definition 89. A site (C,Jat) is called an atomic site if the covering sieves of Jat are given by the inhabited sieves. Jat is referred to as the atomic topology.
57 In order for the stability axiom to hold on an atomic topology, we need it to be possible for every cospan to be completed to a commuting square, i.e. given a cospan: X
Y / Z there exists an object W in C along with morphisms W → X, W → Y which makes the following diagram commute:
W / X
Y / Z.
Notice the similarity in this condition to the definition of pullbacks. The difference between this condition and the more general pullback condition is that there is no universal property requirement on the object W an morphisms W → X and W → Y . As such these are sometimes referred to as semi-pullbacks in the literature. This property is so important in topos theory that it is given a name.
Definition 90. A category C is said to satisfy the Ore condition if every cospan can be completed to a commuting square.
Verifying the Ore condition for our sample space category follows from a construction due to Edalat.
Theorem. (Edalat, 1998) The category of standard Borel spaces equipped with probability measures admits semi-pullbacks. [43]
Note that this is not true for arbitrary measurable spaces [52]. The basic idea behind Edalat’s construction is is to integrate over the fibers of product measures of the regular conditional probabilities. A theorem due to Johnstone relates this condition to the internal logic of the topos Sˆ.
Theorem. (Johnstone, 1979) Let C be a category. Then Cˆ is a De Morgan topos if and only if C satisfies the Ore condition [76].
This means that the topos of presheaves Sˆ is a de Morgan topos, i.e. its internal logic is a Heyting algebra satisfying the De Morgan laws: ¬ (a ∧ b) = ¬a ∨ ¬b and
58 ¬ (a ∨ b) = ¬a ∧ ¬b. However, it should be noted that the second law holds in every Heyting algebra [18]. Although not required in the Ore condition, the Edalat construction is known to obey a universal property. Simpson attempted to characterize independence and conditional independence in purely category theoretic terms [121]. Simpson’s work allows the Edalat construction to be seen as universal with respect to independence products. With this background out of the way, we can now prove that every quasi-Borel presheaf is in fact a sheaf with respect to the atomic Grothendieck topology. This result is similar to a result due to Simpson for equivalence classes of random variables stated in [123] without proof. Lemma 91. Any quasi-Borel presheaf, Q, is a sheaf with respect to the atomic Grothendieck topology.
Proof. A presheaf R on an atomic site (S,Jat) is sheaf if and only if for any 0 0 0 0 0 morphism q : S → S and any α ∈ R (S ), if R (r1)(α ) = R (r2)(α ) for all diagrams r 00 1 0 q S ⇒ S → S r2 0 with q ◦ r1 = q ◦ r2, then there is a unique α ∈ R (S) such that α = R (α)(γ) [94]. To check that the condition holds, notice that the above condition results in the following commutative diagram in Set:
r1 00 / 0 q S / S / S r2 α γ X
We would like to define γ (ω) to be α (q−1 (ω)). However, this is only well-defined if α is constant on the fibers q−1 (ω). 0 0 0 0 0 By way of contradiction, suppose there exists ω1 and ω2 in Ω with α (ω1) =6 α (ω2) 00 00 00 0 and q (ω1) = q (ω2) . Lifting again, we get that there are ω1 and ω2 with r1 (ω1 ) = ω1 00 0 and r2 (ω2 ) = ω2. By assumption, R (r1)(α) = R (r2)(α) and hence r1 ◦ α = r2 ◦ α. 00 0 0 00 However, α ◦ r1 (ω1 ) = α (ω1) =6 α (ω2) = α ◦ r2 (ω2 ). Hence, it must be the case that the map α is constant on fibers. Thus, any quasi-Borel presheaf is a sheaf.
Notice the similarity in this definition to the definition of sheaves in terms
59 of matching families. The above lemma informally can be interpreted as stating that every morphism f is a cover with respect to the atomic topology. If we interpret this in the language of the presheaf of random variables the condition that P (g)(y) = P (h)(y) for all diagrams
g f E ⇒ D → C h with f ◦ g = f ◦ h is basically saying that the random variable y is not exploiting any of the additional structure of the sample space extension afforded to it as an extension of the space C. In essence, this appears to be a statement about the random variable being constant on fibers which is suggestive of regular conditional probabilities. The trouble with this approach is that regular conditional probabilities are only defined almost surely. As such, we could adjust by passing to equivalence classes of random variables. This may initially seem appealing for the reasons stated when we first defined the sample space category. However, such construction makes the theory of stochastic processes difficult. This would make properties like the almost sure continuity of the path of a Brownian motion an ill-defined notion. Nevertheless, by defining these structures in terms of the maps themselves, we do not have deal with the nuances of the underlying probability measures. In terms of probabilistic programming, any probability measure that we implement on the sample space can be pushed forward onto our quasi-Borel space of outcomes via the collection of maps we have already defined from the sample space into another data type.
3.5.4 Lifting Measures Lemma
In order to discuss probability theory, we need to understand the lifts of measures for the sample space category S. In other words, given a measure µ on S, we want to ensure there is a way of extending the measure to any space S0 where q : S0 → S where q is a surjective measurable map. The fact that arbitrary measures have lifts to the extended sample spaces is the subject of the next lemma.
Lemma 92. Given a measure µ : 1 → G (S), µ lifts to a measure on G (S0).
Proof. Given some B0 ∈ F 0, the collection ↑ B0 := {B ∈ F | q−1 (B) ⊃ B0} is non-empty as q−1 (S) = S0 ⊃ B0. Similarly, we may define a collection ↓ B0 :=
60 {B ∈ F | q−1 (B) ⊂ B0} . This allows us to construct outer and inner approxi- mations to any measure on S0 which is a lift of µ. We define the inner ap- 0 0 proximation by µ∗ (B ) := sup {µ (B) | B ∈↓ B } and the outer approximation by µ∗ (B0) := inf {µ (B) | B ∈↑ B0} . Any lift of µ, µ˜, must then satisfy the following inequality: 0 0 ∗ 0 µ∗ (B ) ≤ µ˜ (B ) ≤ µ (B ) .
By way of contradiction, suppose that no such µ˜ existed. Then there would be {B0} F 0 a sequence of disjoint sets i i∈N belonging to for which
∞ X 0 ∗ 0 µ∗ (Bi) > µ (B ) i=1
0 0 0 0 where B = ∪i∈NBi. For each Bi, there exists a Bi ∈ F with Bi ∈↓ Bi satisfying µ (B0) < µ (B ) + . ∗ i i 2i
{B } P∞ µ (B ) = Moreover, the collection i i∈N consists of pairwise disjoint sets. Hence i=1 i 0 µ (B) where B = ∪i∈NBi. Now, as B ⊂ C for any C ∈↓ B , we see that
∞ ∞ X 0 X µ∗ (Bi) < + µ (Bi) i=1 i=1 = + µ (B) ≤ + µ∗ (B0)
Letting → 0, establishes ∞ X 0 ∗ 0 µ∗ (Bi) ≤ µ (B ) i=1 and so a lift µ˜ of µ must indeed exist.
Remark. The construction µ∗ is actually a probability measure on the lift. Note that if B1 and B2 are disjoint then so are ↓ B1 and ↓ B2 thus µ∗ will be countably additive. Moreover, the full space will have measure 1 because q−1 (S) = S0 so 0 µ∗ (S ) = 1. What this lemma says in terms of programming languages is that if I implement 32 a probability measure on a sample space like {0, 1} and later construct an
61 64 extension {0, 1} . As long as we construct a surjective map connecting these, i.e. 64 32 q : {0, 1} → {0, 1} defined by q (b1, . . . , b32, b33, . . . , b64) = (b1, . . . , b32), we can grantee that we can construct a measure on the larger space which pushes forward to the original sample space.
3.6 Probability Theory for Quasi-Borel Sheaves
3.6.1 Events
We have previously seen that there is a bijection between FMX and QBS (X, 2) for any quasi-Borel space (X,MX ). In order to understand sub-sigma algebras as constructions within QBS, we should first understand a sigma algebra diagram- matically in terms of the quasi-Borel space (2,M2) where M2 = Meas (S, 2). Sigma algebras are required to be closed under complementation. This can be understood in terms of characteristic functions by observing that χA + χAC = 1. In C other words for every A : 1 → M2 there exists A : 1 → M2 such that the diagram below commutes: A 1 / M2
C A 1M2 ¬ M2 / M2 The other defining property of a sigma algebra is that it is closed under countable unions. Equivalently, we could require the sigma algebra to be closed under countable intersections. We choose the latter approach because this is easier to express in the language of characteristic functions. In terms of characteristic {A } : functions, closure under countable intersections means for any mapping i i∈N 1 → Q M A : 1 → M i∈N 2, there is a mapping 2 such that diagram below commutes:
Q i∈ Ai 1 N / Q M i∈N 2 Q χi A i∈N ' M2.
Recall that in Meas there is a one-to-one correspondence between events and characteristic functions. In other words, for every object (S, BS) in Meas, we have a canonical isomorphism B ∼= Meas (S, 2). Now if q : S0 → S is an extension of
62 −1 S, then an event B ∈ BS can be identified with the event q (B) ∈ BS0 . The corresponding characteristic function leads to a commutative diagram:
S0 χq−1(B) q
χB S / X.
Thus, to understand events we can construct the following presheaf.
Definition 93. (Event Presheaf) Let S be a sample space category. The presheaf
of events can be identified with the Yoneda embedding h2 := Meas (−, 2) where 2 = {0, 1} is given the sigma-algebra of all its subsets.
3.6.2 Global Sections, Local Sections, and Subsheaves
Heunen, Kammar, Staton, and Young do not incorporate measures into their underlying sample space. As such, they define a probability measure to be a pair
(α, µ) where α ∈ MX where (X,MX ) is a quasi-Borel space and µ is a probability measure on the underlying standard Borel space. Because we have incorporated probability measures into our definition of the underlying sample space category, we can identify random variables with points (or global sections) on our sheaf of random variables of some fixed type. Global sections of a quasi-Borel sheaf can be identified with points in the outcome space because the component function of the terminal object must select out a map from a one element set into the outcome space of the quasi-Borel sheaf. Any map from a singleton set into another set determines a unique point, namely the image of the singleton. Thus, global sections of a quasi-Borel sheaf are simply the points in the outcome space. Local sections of a quasi-Borel sheaf are more interesting. Because these are defined as morphisms from a subsheaf of the terminal sheaf into a quasi-Borel sheaf, this allows for the possibility of mapping the terminal sample space to the empty set. This added flexibility allows for the possibility of randomness. A random variable, along with its collection of extensions, can be identified as a local section of a quasi-Borel sheaf. Effectively, the local section picks out the random variable α, along with its q-extensions, α ◦ q for each surjective measure-preserving map q.
63 The product construction for quasi-Borel spaces represents a joint distribution on outcomes. This construction lifts naturally to a quasi-Borel sheaf since X × Y is just another set, we can construct a quasi-Borel sheaf of maps into X × Y in the same manner as before. However, the category of presheaves has its own product, namely the component-wise product of presheaves. These two notions are isomorphic by the universal property of products in Set. When discussing independence, most authors begin with a discussion of inde- pendence of sub-sigma algebras. Let BX and GX be two sigma algebras on the set
X. We say GX is a sub-sigma algebra of BX if GX ⊂ BX . Another way of stating this is that the inclusion mapping i : (X, BX ) ,→ (X, GX ) is measurable. Recall that two sigma fields are said to be independent with respect to a probability measure p if for all G1 ∈ G1, G2 ∈ G2, p (G1 ∩ G2) = p (G1) p (G2). From independence of sigma fields, we can define the independence of random variables as follows: we say two random variables (f :Ω → X, p) and (g :Ω → X, p) are independent if −1 −1 the sigma-algebras f (BX ) and f (BY ) are independent. From this definition, it is possible to prove that two random variables are independent if and only if their joint distribution is the product of their marginal distributions [112]. For quasi-Borel sheaves, independent random variables can be identified with a subsheaf,
RX ⊥ RY ⊂ RX × RY , where (αX , αY ) ∈ RX ⊥ RY if and only if αX and αY are independent.
3.6.3 Expectation as a Sheaf Morphism
Let X = Rk for some k ∈ N. With a quasi-Borel structure, the sigma algebra on the outcome space is defined so that all maps inside the quasi-Borel space are measurable with respect to the constructed sigma algebra. As such, it is sensible to construct an expectation operator on quasi-Borel sheaves. The codomain of this operation will need to be an extension of the real numbers:
∼ a R := R {∞, −∞, undefined} .
64 ∼ ∼ Let denote the constant presheaf on . Expectation is then a morphism : R ∼ R E RX ⇒ R which we can define on components as:
ES [α] := α (s) dµ = xdα∗µ. ˆS ˆS
Note that if q : S0 → S in S, then
ES0 [α ◦ q] = α ◦ q (s) dµ = xd (α ◦ q)∗ µ = ES [α] . ˆS0 ˆS0
Thus, E− is a morphism of presheaves. If f : RX → RX is a morphism of quasi-Borel sheaves, we can define expectation similarly on components:
ES [f ◦ α] := f (α (s)) dµ = f (x) dα∗µ. ˆS ˆS
Example 94. Let S = [6] where [6] = {1, 2, 3, 4, 5, 6} and let S0 = [6] × [6]. Equip both S and S0 with their uniform probability measures. Define q : S0 → S by q (x1, x2) = x1. Note that q is surjective because it is a projection operator and it −1 is measure preserving because µS0 (q (i)) = µS (i) for each i ∈ [6]. Thus, q is a legitimate extension of sample space. Let α : [6] → R be the obvious embedding. Then X 21 [α] = iµ (i) = . ES S 6 i∈[6] On the other hand,
X 1 21 0 [α ◦ q] = α (q (i, j)) µ 0 (i, j) = (6 · 21) = . ES S 36 6 i,j∈[6]2
3.7 Future Work
3.7.1 Probabilistic Programming and Simulation of Stochas- tic Processes
The demands of probabilistic programming and the demands for simulation of stochastic processes seem to hold contradictory requirements. Probabilistic pro-
65 gramming relies heavily on conditioning and as conditional expectation is only defined almost everywhere, it seems natural to replace random variables with equiv- alence classes of random variables where two random variables are identified if they agree almost surely. Unfortunately, descending to equivalence classes makes many desirable properties of stochastic processes untrue (e.g. almost-sure continuity of paths of a Brownian motion). How can the seemingly conflciting demands of these two applications be balanced?
3.7.2 Categorical Logic and Probabilistic Reasoning
Topos theory provides a wealth of tools for analyzing the internal logic of the constructed topos. Does the internal logic of this construction reflect the logic of plausible reasoning as formulated in [74]? More broadly, what is the relationship between this internal logic of this sheaf topos and statistical inference? The approach outlined in this approach is reminescent of the non-commutative viewpoint. For instance, manifold theory from this perspective has been developed in [106]. Perhaps the categorical perspective could help with generalizing probabilistic reasoning to non-commutative situations such as those arising in free probability [99] or quantum Bayesianism [21].
3.7.3 Sample Space Category and the Topos Structure
Although there are counterexamples which prohibit the larger category of measurable spaces from satisfying the Ore condition, it is possible to extend the definition of the atomic topology to arbitrary categories by defining the atomic Grothendieck topology to be the smallest Grothendieck topology containing the inhabited sieves. For this definition of atomic topology for an arbitrary category, it is still the case that the sheaf category Sh (C,Jat) is an atomic Grothendieck topoos, i.e. the subobject lattice of every object is a complete atomic Boolean algebra [22]. This observation could perhaps be a stepping stone for moving the ideas presented in this chapter to more general classes of measurable spaces. More broadly speaking, how does restricting or enhancing the types of spaces we allow as sample space affect the subsequent sheaf topos? Do different choices for the underlying base category result in equivalent sheaf topoi?
66 3.7.4 Extension of the Giry Monad
Can the Giry monad be extended to the presheaf topos Sˆ or the sheaf topos Sh (S)? In chapter 6 of this dissertation we will make the argument that implementing the Giry monad is rather useful for statistical computing with missing or conflicting data. As such, if probabilistic programming can be given sheaf theoretic semantics, it would be helpful to generalize the Giry monad to arbitrary presheaves or sheaves. Another way of framing this is whether or not the collection of probability measures (along with the unit and multiplication operations) can be given a purely category theoretic definition. For the reader interested in pursuing this direction, we suggest the recent paper [132]. Another possible approach in this direction is provided in the conference paper [123]. This construction appears to rely on the co-Yoneda lemma realizing every presheaf as a colimit of representable presheaves.
67 Chapter 4 | Categorical Logic and Rela- tional Databases
4.1 Introduction
In this chapter we will discuss a simple categorical formalism for defining databases in a mathematically rigorous way. Databases are ubiquitous in modern comput- ing science. We demonstrate how the language of topos theory can be used to understand the structure of databases. The traditional mathematical perspective on databases was developed by Edgar F. Codd at IBM and is known as relational algebra [23]. Earlier work has described many operations common in relational databases through the language of category theory. Topos theory is a natural setting for discussing variable sets and multi-valued logics. Rosebrugh and Wood use this viewpoint to discuss a dynamic view of databases [114]. In particular, database updates are modeled as indexing a collection of database objects by a topos and non-boolean logics are explored for databases through the lense of sheaf theory. Later work in this direction emphasized the role of sketches to formalize schemas representing the data itself as a model of the sketch [49,75]. Baclawski, Simovici, and White define databases as constructions withn an arbitrary topos. In particular, they work out how selections, squeeze (elimination of duplicates), projections, joins, and boolean operations can be performed as constructions inside an arbitrary topos [13]. It has also been shown how database models based on simplicies can be used to support type-theoretic operations along
68 with database queries [51, 128]. More recent work has focused on representing concrete data like integers and strings [120]. The underlying model for SQL is based on multi-sets [53] while the relational algebra due to Codd is based on relations [23]. In this chapter, we create a multi-set model similar to the model underlying SQL and show that this model is sufficient to express all the operations due to Codd as constructions within the category of sets. Such a formulation is interesting because simple extensions of the underlying model allow us to express constructions outside the traditional relational model such as outer joins. By focusing on concrete constructions within the category of sets, we can easily generalize to non-binary logics. Purely topos-theoretic models are unable to capture the logic of SQL, for instance, because the internal logic of any topos must be a Heyting algebra and the three-valued logic of SQL is not a Heyting algebra. This model has the advantage that it is easily adaptable to more general logics by substituting the two element set along with the operations of AND, OR, and NOT with their corresponding counterparts for a logic with more values. We also demonstrate how extensions allow us to also account for null values and missing data, extending the discussion in [101]. Near the end of this chapter, we define a simplicial structure (Section 4.6.3) and graph associated to a database schema and prove a result relating properties of this graph to whether or not agreement on marginal tables is sufficient to ensure that the marginals can arise from a joint distribution on the full outcome space. In particular, we provide sufficient conditions for a joint table on the full column set to exist (Lemma 98, Proposition 101). This result is foundational to the next chapter where we attach an additional topological structure to this simplicial complex using this topological structure to weaken the common assumption in statistics that the family of marginals under consideration arise as projections of a joint distribution on the full column space.
4.2 Data Tables
In this section, we discuss databases as constructions in the topos Set. Since we will study random variables whose outcome space is a table space. This construction will be important in the next chapter as we discuss the extension of statistical concepts to databases. We will initially focus on the case of a database containing
69 a single table and focus on random variables whose outcome space is a table. In the next chapter, we will see how sheaf theory allows us to extend these ideas from single tables to databases whose tables contain overlapping columns. We construct an appropriate category of databases and show that all operations in the relational algebra exist for this category. Such a formulation allows us to develop mathematical models for thinking about databases consisting of multiple tables or more general databases in distributed systems. In future chapters, we will use this construction to aid in statistical modeling of databases. The relational model of a table in a database conceptualizes data as a two dimensional frame called a relation. For example, a database of students may consist of columns containing the student’s name, ID number, email, and phone number. Each row in the table would correspond to the records of an individual student. Our goal will be to build a model of such tables inside Set. In this section, we use the word relation as it is used by the database community. As will become apparent, this is not the same thing as a relation in mathematics. We will formalize this notion using several equivalent representations of multisets inside the category Set. The implementation of SQL involves the use of a three-valued logic adding a third value UNKNOWN to the standard TRUE and FALSE [53]. Topos theory is a framework that allows more general logic than Boolean logic. As such, using the categorical framework in this chapter can help with the exploration of more general database logic. These more general logics have been explored previously in [13,114]. However, the logic underlying SQL is not a Heyting algebra and thus can not be represented as the internal logic of any topos. By relying on concrete constructions within Set, we are able to construct a framework that easily generalizes to other logics. Moreover, in SQL there are three types of relations: stored relations, views, and temporary tables. Stored relations are the actual tables saved in the database management system. Views are relations which are constructed by computation. These are not stored after they are used. Temporary tables are constructed by the SQL language processor when it executes queries. After the query is executed and the appropriate modifications are made, these tables are removed from memory [53]. In order to represent the space of possible tables formed from a database schema, we need to discuss the possible operations that can be performed with a collection
70 of tables. Ultimately, this will lead us to a category representing the possible structures. In this section, we will see operations that can be built up from a collection of simpler primitive operations. We can think of the intermediate diagrams as temporary tables.
4.2.1 Attributes
Tables are collections of data points containing multiple attributes. We first establish an attribute space for databases. An attribute is simply a descriptive label used to describe entries in the corresponding column of a table. When visualizing a table, it is common for the attributes to be pictured in the top row. Thus, we can identify an attribute with a one-element set containing that label, e.g. {Student_ID}. Given a finite set of characters, C (think the collection of all Unicode characters, for instance) we can form the set of strings on C, SC , by defining a Y SC := Ci . n∈N i∈[n]
In practice, the size of this set is actually finite due to memory limitations. Never- theless, we can think of a particular attribute name as a point 1 → SC . For example, a table containing students could have the following collection of attributes
A = {student_name, student_ID, email, phone} ⊂ SC .
Thus, attribute labels for a table can be identified with finite sub-objects of SC . Thus, the attribute labels of a table can be seen as a morphism E SC where |E| < ∞.
4.2.2 Attribute Spaces (Data Types)
Each attribute tabulated in a table takes values in some type of space which we will call the column space of the table. To each attribute, a ∈ A, we can describe a set,
Xa, which denotes the collection of possible values of the attribute a. For instance, if a is an attribute representing the number of red cards in a player’s 5-card poker hand, then the attribute space could be taken to be the set of all non-negative
71 integers less than or equal to five. However, in most languages, we would simply declare an integer type for this particular situation. More broadly, we can take the attribute space to be the outcome set of any random variable of interest.
4.2.3 Missing Data
We next account for missing data. Many real world data sets contain missing data. To model this we can adjoin a singleton set {NA} to the output space of our random variable, i.e. we can consider random variables having outcome space X ` {NA} (considered as a coproduct in the category of measurable spaces). In SQL and Codd’s relational model, NULL is used in place of NA. We choose NA because data frames in R represent missing records in this way. Many real world data sets contain records with missing values for certain columns. As such, we would like to be able to account for modles involving missing records. One of the simplest models for missing data is data that is missing completely at random (MCAR) as introduced in [58]. Any statistical model can naturally lift to a model where records are missing with some fixed probability (1 − α). In the previous chapter, we introduced an endofunctor G on the category of measurable spaces Meas. We can use this endofunctor to extend a statistical model with MCAR data.
Lemma 95. G (X ` Y ) = Conv (G (X) , G (Y )) where
Conv (G (X) , G (Y )) := {αp + (1 − α) q | p ∈ G (X) , q ∈ G (Y ) , α ∈ [0, 1]} .
Proof. ⊃ is obvious. Let p ∈ G (X ` Y ). Then p ({0} × X) = α. If α = 0, we have the desired decomposition. As such, assume α 6= 0. Then we can define a 1 probability measure pX on X by defining pX (B) = α p ({0} × B) . Analogously, p ({0} × X) = α implies p ({1} × Y ) = (1 − α) and as long as α 6= 1 we can define 1 define a probability measure on Y by qα (B) = 1−α p (B). This gives us the desired decomposition of p.
Corollary. G (X ` {NA}) = Conv (G (X) , G ({NA})).
Another simple consequence of this observation is that statistical models can be extended to the coproduct X ` {NA} by taking a mixture model. This construction
72 may not be appropriate for all circumstances such as in the presence of censored data. Remark 96. Any statistical model m : P → G (X) extends to a model m˜ : (P, α) → G (X ` {NA}) via convex hulls (i.e. mixture models), i.e. m˜ (p, α) = (1 − α) p + αδ{NA}. In the future, we will use the shorthand X˜ to refer to X ` {NA}. Also, observe that there is a canonical inclusion i : X,→ X˜. These observations will be important when we introduce measurable presheaves defined on contextual categories later in this dissertation. Recall that the space X˜ can be given a coproduct sigma algebra structure from any sigma algebra on X.
4.2.4 Data Types
When designing programming languages to work with a database management system, often times we need some finite collection of primitive data types from which to build our language. In the student database example, student_name and student_ID can be regarded as strings while student_ID and phone can be regarded as integers. A string is simply a collection of characters. If C is a set containing all S S := ` C valid characters, the space of strings, , can be defined as n∈N . Thus, a string, s, is simply a point s : 1 → S. Other common data types include Booleans, bit strings, floats, dates and times. Booleans are just elements of the set {TRUE, FALSE}. In many languages, such as SQL, the collection of Booleans is augmented to include UNKNOWN so that logical comparison operators can be appropriately defined for records containing NULL ` {0, 1}n . values. Bit strings are just elements of n∈N Floats are the computer approximations of real numbers. We ignore implementation details of how these are represented in computer memory and allow our databases to have the real numbers as a data type. Dates and times can be seen as strings with specified formats and so the collection of dates or times can be seen as sub-objects of the string space S. Most languages for databases require you to set a maximum length for data types like strings and bit-strings when declaring a table. For instance, in SQL to create a column containing strings, the declaration requires you to specify an integer n representing the maximum length of an entry. Thus, in SQL CHAR (n) `n i corresponds to i=0 C . As such, we see that inside the topos of sets we can
73 represent all the types of data commonly found in relational database management systems like SQL.
4.2.5 Column Spaces, Tuples, and Tables
4.2.5.1 Column Spaces
For each attribute a ∈ A, we have an associated attribute space Xa. As such, the column space of a collection of attributes is simply the product of attribute spaces taken across all a ∈ A, i.e. Y XA := Xa. a∈A
4.2.5.2 Records
Individual rows of the database are referred to as tuples. A tuple, or record, is simply a point in a column space, i.e. r : 1 → XA. Continuing with the student example, our student table may contain a record such as
(Juan Batista, 24601, [email protected], 8675309) .
4.2.5.3 Tables
Tables are simply collections of multiple records, i.e. mappings out of a finite set, F , into a column space, i.e.
t : F → XA represents a particular table. Note that any table determines a point in the product space indexed by F , 0 Y t : 1 → XA. f∈F n Qn Let n = |F | from above and define XA := i=1 XA. With this notation, we can think of tables equivalently as points in a finite dimensional product of attribute spaces, t : n ∼ Q 1 → XA. Note that there is a natural bijection Hom (F,XA) = Hom 1, f∈F XA . Many implementations, such as pandas, allow you to summarize a table as a count of distinct values in the column space. In other words, as a set mapping
˜ t : XA → N.
74 Note that these three perspective are all ways of formalizing the notion of a multiset as a construction involving set theory. We refer to t : [n] → XA as the memory allocation representation of the table, and refer to t : 1 → XA as the point representation of the table. Lastly, we refer to t : XA → N as the count of values representation of the table. Note that the count of values representation of a table can determine many differ- ent representations of the same table by either its memory allocation representation or its point representation. Intuitively, it seems that our definition of a table should not really depend on the particular details of the set of memory address indexing our data. As such, we will now discuss how to construct an equivalence relation on both the memory allocation representation and the point representation so that these three representations are all isomorphic but not canonically isomorphic.
Let tm : M → X be the memory allocation representation of some table. We −1 can construct its count of values representation tv by defining tv (x) = |f (x)| . Note that precomposition by any permutation σ : M → M or more generally any isomorphism φ : M → M 0 does not affect the value counts representation. As such we can define the memory allocation representation of tm to be the equivalence 0 0 class of maps into X where two maps tm : M → X and t : M → X are said to be 0 0 equivalent if there exists an isomorphism φ : M → M such that tm = t ◦ φ. Following the same line of reasoning, we will consider two point representation Q 0 Q of a table t : 1 → I X and t : 1 → J X to be equivalent if there exists an φ Q Q isomorphism φ : I → J such that the induced projection π : I X → J X satisfies πφ ◦ t = t0. At this point all representations of a table are in bijective correspondence and so we can now umabiguously move between representations as is convenient. In the rest of this section we will focus primarily on the first definition as it naturally mimics the idea of thinking of a database as a collection of values indexed at various memory addresses. From this point of view, it is easier to make connections with what is occurring in computer memory as we manipulate a database.
75 4.2.6 Primary Keys
A collection of attributes is said to be a key for a table if we do not allow two rows in the table instance to have the same values for all the attributes in the key. In our student example, we could say student_ID is a key if no two students are able to have the same ID number. In practice, many database management systems simply create a primary key for the users because many applications could easily have redundant rows. As our hope is to explore statistical properties of multiple distributed databases, we expect to have many entries with identical features. We can model this behavior by adding an additional column to a table representing a key or index. A table may contain multiple rows with the same data. For instance, if we are tabulating IID copies of a binary random variable, we need to allow for multiple zeroes and ones in our database. This can be remedied by requiring the rows be labeled by an index (aka primary key). For example, we can take I = N. We can think of a row in a database with a primary key as an element i × x : 1 → I × X.
A table is typically thought of as a finite relation T ⊂ I × X such that (i, x1) ∈ D and (i, x2) ∈ D implies x1 = x2. This ensures each primary key is matched with only one data point. To emphasize the finite nature of the relation, we will think of a table as a mapping t : [n] → I × X. In the future, we will include any primary keys, if they exist, into the existing column space.
4.2.7 Versioning
In many situations, we want to time-stamp or version records in a table. Date-times are a common choice for time-stamp employed in both R and the pandas package for python. We could also use a counter representing version number such as is common with software dates. What is necessary of the time-stamps or versions is that the collection of these objects forms a poset. A more complex method of versioning is the TrueTime model introduced by Corbett ET AL. at Google [24]. In this framework, events are time-stamped by intervals [ts, te] and we are guaranteed the true UTC time is a point in the interval. The collection of such intervals can be given a poset structure as follows: [ts1 , te1 ] < [ts2 , te2 ] if and only if te1 < ts2 as real numbers. We can log time-stamps in a database by having a column whose values are tuples representing a start and end time for the given interval. In versioned
76 databases that also employ an index, it is common to use the product I × V as a primary key for the entries.
4.3 Relational Algebra on Tables
In the previous section, we saw how tables could be represented inside the category of sets. As tables were represented by morphisms from a finite set into some set representing the possible values of each of the columns of the table, we know by the Cartesian closure of the category of sets that many standard categorical constructions such as products, coproducts, pushouts, pullbacks, etc. exist for tables. In this section, we show how to represent the five primitive operations (selection, projection, products, unions, and differences) in Codd’s relational algebra as constructions inside the topos of sets. Our goal in this section is to show that the constructions of tables inside Set has the same expressive power as Codd’s original construction. An alternative way of obtaining this result would be to use the abstract formulation in [13] applied to the specific topos of Set which would reduce the problem to expressing Codd’s original operations in terms of the primitve operations used by the authors. We choose to simply recover Codd’s constructions in a manner specific to the topos Set so as to avoid introducing another collection of primitive operations.
4.3.1 Products
Given two tables t1 : N → A and t2 : M → B, we can form a product table
t1 × t2 : N × M → A × B. This construction takes all possible combinations
of records in t1 and t2. However, this operation alone is not useful for many of the operations we want to do with real databases because we typically want to eliminate redundant features such as when two columns overlap. Nevertheless, as Codd showed, more complex constructions on tables can be built out of these simple primitive operations [23]. We will discuss some of these operations in later sections after showing that we can construct all five primitive operations.
77 4.3.2 Projection
Projection involves creating a new table on a subset of the columns in our table. Recall that each table t : M → XA has an associated column space which can be A Q decomposed as X = a∈A Xa. If S ⊂ A is a subset of the attribute set of our table, A A S there is a corresponding projection operator πS : X → X . The composition, A πS ◦ t is referred to as the projection of the table onto the attribute space S.
4.3.3 Union
Intuitively, unions correspond to concatenating two tables defined on a common column space. Categorically, unions can be seen as pushout construction in the category of sets. Given two table t1 : N → X and t2 : M → X, the union is given ` by the coproduct of t1 and t2, i.e. t1 ⊕ t2 : N M → X.
4.3.4 Selection
Selection is the most complicated of the five primitive operations. Intuitively, selection allows us to determine the collection of points in a table satisfying certain properties such as selecting all users in a table whose age is greater than 30. These operations naturally identify selections with certain sub-objects of our original table. In this subsection, we explore this idea in depth. The principle of comprehension in set theory allows us to form sets consisting of all elements which satisfy a certain property φ (x). A naive approach to set theory easily leads to contradictions such as Russel’s paradox. In the Zermelo-Fraenkel approach to set theory, this is solved by set-building, i.e. requiring the proposition φ (x) to apply to only members of a particular set A. In this section, we discuss a comprehension principle for databases. In the language of database theory, this is referred to as selection. When querying a database, we often want to return all results meeting some specified criteria. In particular this requires a way of thinking about propositional formulas involving the logical operators AND, OR, and NOT. Let Ai and Aj be attributes. If Ai or Aj are categorical attributes, let θ ∈ {=, 6=}. Otherwise, let θ ∈ {<, ≤, =, 6=, ≥, >}. Then AiθAj or Aiθx with x ∈ R determines a mapping θ : D → 2. Given such a mapping and a particular database, d : [n] → D, we can
78 consider the set d−1 (θ−1 (1)). Since this is a subset of [n], there is a natural inclusion −1 −1 −1 −1 d (θ (1)) [n]. Thus we can form a new database, dθ : d (θ (1)) → D representing the selection of these entries for which the proposition is true. More complex queries can be formed by the logical operators AND, NOT, and OR. Before discussing these constructions, we review the topos theoretic perspective of the logical operators of conjunction, disjunction, and negation. A more detailed discussion of classical logic from the perspective of topos theory can be found in [60]. Recall that negation is the arrow ¬ : 2 → 2 such that the diagram below is a pullback: ⊥ 1 / 2
! ¬ > 1 / 2 Similarly, conjunction is the arrow ∩ : 2 × 2 → 2 such that the diagram below is a pullback: >×> 1 / 2 × 2
! ∩ > 1 / 2 Finally, disjunction has a more complex categorical description. From simple truth tables we know ∪ : 2 × 2 → 2 should be the characteristic map corresponding to the sub-object E = {(1, 1) , (1, 0) , (0, 1)}. First notice that this sub-object can be decomposed as E1 ∪ E2 where E1 = {(1, 1) , (1, 0)} and E2 = {(1, 1) , (0, 1)}. This is important because E1 is identified with the monic mapping h>, 1i : 2 → 2 × 2 and E2 can be identified with the monic mapping h1, >i : 2 → 2 × 2. We can then form the coproduct, f, of these two mappings
2 / 2 + 2 o 2
" | 2 × 2 and observe that im (f) = D. Thus, by the canonical decomposition of a set mapping into a surjection followed by an injection, we can identify E up to unique isomorphism. In order to form more complex selection queries based on multiple binary operations θ : D → 2 and θ0 : D → 2, we can combine binary operations using
79 negation, conjunction, and disjunction. For instance, to represent the selection θ×θ0 ∪ corresponding to θ ∧ θ0 we could consider the map θ × θ0 : D × D → 2 × 2 → 2. Hence, the selection can be represented by the database
0 [n] d→×d D × D θ→×θ 2 × 2 →∩ 2.
By forming larger numbers of products, we can represent more complex selection queries. These are all guaranteed to work because Set admits finite limits and colimits. Gathering the most recent version of entries in a table can be seen as another Q type of selection operator. Given a table d : [n] → A = I × V × 2 × i∈I Xi. We can define a binary relation θ :(I × V ) × (I × V ) → 2 by 0 if i = i0 and v ≤ v0 θ ((i, v) , (i0, v0)) = 1 otherwise.
Clearly θ lifts to a map on A × A. Hence, by using the selection criteria above with this binary operation, we get a new table containing only the most recent entries.
4.3.5 Difference
Informally, the difference of two tables t and t0 is the set of records that belong to one table but not to the other. If t : M → X is any table and t0 : N → X is another table on the same attribute space, then the difference t \ t0 can be identified by 0 first considering the sub-object E M defined by E = {m ∈ M | t (m) ∈/ t (N)} . Thus, the difference can be formed by taking the selection operator corresponding to the characteristic function of this sub-object. At this point, we have expressed all five primitive operations as constructions within the category of sets. As such, we know that we can express more complex operations performed on tables such as equijoins by chaining several of these primitive operations.
80 4.4 Some Additional Operations on Tables
In order to construct a category of tables, we need to define morphisms between tables. Before giving a general definition, we explore several common constructions with individual tables and see how to express these notions via category theory.
4.4.1 Addition & Deletion
Let t : N → X be a table and let t0 : 1 → X be a record. We can add t0 to t by taking the union of t and t0 as discussed in the last section. Equivalently, if t˜ : 1 ` N → X is the resulting table, we say that t can be obtained from t˜ by deleting a record. These equivalent situations can be represented by the following commutative diagram: N / N ` 1 o 1
t˜ t t0 " | X.
4.4.2 Editing Records
Entries in a database can also be modified. When a row is changed, we want to preserve its primary key, update its version, i.e. replace t with t0 where t ≤ t0, and change x to x0. To modify the j-th row to have time-stamp t0 and value x0, we can introduce a modification map defined as follows: (t0,x0) (i, t, x) i 6= j modj (i, t, x) = (i, t0, x0) i = j.
This results in the following commutative diagram:
id [n] / [n]
t1 t2 modj→(t0,x0) I × V × R / I × V × R.
81 4.4.2.1 Rename
Let A and B be two attribute sets. An isomorphism ρ : A → B induces a commutative diagram t M / XA
πρ t0 ! XB where the vertical map is defined by πρ (xa) = xρ(a). Such a diagram can be interpreted as a renaming of the columns of the original table.
Example 97. Consider the following two tables:
I X J X a 3 0 3 b 5 1 5 c 7 2 7
We can view the second table as a re-indexing of the first table where the re-index map r : {a, b, c} → {0, 1, 2} is defined by r (a) = 0, r (b) = 1, and r (c) = 2. This re-indexing can be seen as a pair of morphisms (r, 1) where r is a mapping between the attribute spaces of the corresponding tables and 1 is the identity mapping between column spaces.
4.4.2.2 Imputation
Another special case of editing records is imputing missing data. When preparing data for model fitting, an analyst must make a decision about what to do with records with missing entries. If there are very few records containing missing entries, the analyst may simply choose drop these records dismissing them as a measurement error. Other common techniques involve replacing the missing values with the mean, median or some other fixed value. This can be seen as map which takes NA to the chosen value x0 and is the identity mapping on every other outcome. Other imputation schemes involve attempting to predict the missing values based on some statistical model for the missing entries [115, 116]. As such, these determine a collection of modifications replacing the missing values at the individual indices with their predicted values.
82 4.4.3 Merging Overlapping Records
In the next chapter, we will need to discuss merging data frames which agree on their overlapping columns. In this section, we discuss how to view this as a construction on table categories. A particular record can be viewed as a point in a column space: p : 1 → X. Imagine we have two records p1 : 1 → X × Y and p2 : 1 → Y × Z. When we say that these two records agree on their overlapping columns, what we mean is that the projections onto their overlapping column spaces agree. As such, we can understand joining two records as an instance of the pullback operation. Saying two records agree on their overlapping columns is equivalent to saying the diagram below commutes:
p1 1 / X × Y
p2 πY
πY Y × Z / Y.
∼ The universal property of pullbacks implies ∃!p : 1 → (X × Y ) ×Y (Y × Z) = X × Y × Z such that the diagram below commutes:
1 !
% p1 + X × Y × Z / X × Y
p2 πY
πY Y × Z / Y
By using successive record-wise joins, we can join data frames together. Imagine two synchronized computational agents tabulating partial observations from a random experiment. In this situation, the time-stamp could be used as a key for our merge operation. Assuming perfect synchronization of the clocks tabulating the different agents, we can again invoke the universal property of pullbacks to construct a merge records.
4.4.3.1 Table Morphisms
Given these possible ways to modify a database, we can now create a subcategory of tables T inside Set. The objects in this category are tables and a morphism
83 f : t → t0 between tables t : N → XA and t0 : M → Y B, is given by a pair of maps (σ, f) where σ : N → M and f : XA → Y B is a morphism making the obvious diagram commute t M / XA
σ f t0 N / Y B.
4.4.4 Non-Binary Logics
The implementation of SQL uses a three valued logic to handle missing data [53]. Truth tables for this logic are displayed below. We use the shorthand U for UNKNOWN in SQL.
a ∧ b a ∨ b a ¬a a\b 0 U 1 a\b 0 U 1 0 1 0 0 0 0 0 0 1 1 2 U U U 0 U U U U U 1 1 0 1 0 U 1 1 0 1 1
By inspection of the tables above, the negation of U fails to obey U ∧ ¬U = 0 and thus the three-valued logic implemented in SQL is not a Heyting algebra. This suggests that database theory based on category theory should step outside the scope of topos theory to more general areas of categorical logic. In order to replicate the functionality of SQL, we need only replace the two element set used previously with the three element set {0, U, 1} and the corresponding conjunction, disjunction, and negation operations from the tables displayed previously.
4.5 Random Tables and Random Databases
In order to perform statistical analysis on tables, we need to be able to view them as outcome spaces of some random variable. In this section, we discuss augmenting the column spaces with the structure of a measurable space. This connects this chapter to earlier chapters discussing sheaves of random variables. In this section, we choose to use the point representation of tables because this representation is the most similar to the notation used by statisticians.
84 4.5.1 Random Tables
If each attribute space, Xa, is given the structure of a measurable space by equipping A Q it with a sigma algebra, then the joint outcome space X = a∈A Xa can be endowed with a sigma algebra by taking product sigma algebras. Again, taking Q A product sigma algebras, we can endow a table n∈N X with a sigma algebra structure. Given some probability space (Ω, F, ρ), a measurable mapping R :Ω → Q A n∈N X is said to be a random table.
4.5.2 Giry Monad Applied to Tables
Given a table space with a sigma algebra structure, we can apply the Giry monad to the table space to recover the collection of all probability measures on the table space. If all Xa are standard Borel spaces, the product space is also standard Borel Q A and thus so will be the Giry monad. We refer to a table 1 → n∈N G X as a Giry table or Giry data frame. In a later chapter, we will discuss techniques for imputing missing data and merging conflicting data which relies on these Giry tables.
4.5.3 Random Databases
A database, D, is simply a finite collection of tables D = {t1, . . . , tk}. The attribute space, AD, of the database is simply the union of the attribute spaces of the tables. A random database can be obtained by considering a random variable
Q AD whose outcome space is n∈N X . By projecting onto the various outcome spaces of the tables, we obtain the database representation of the random sample. In the next chapter, we will discuss properties of reconstructing global samples. In particular, we will see that requiring any pair of tables to agree on their overlapping columns is insufficient to ensure that tables can be joined together. Arbitrary probability distributions on the outcome spaces of the tables do not necessarily need to arise from a global probability distribution even if the overlapping marginals are compatible. The problem of reconstructing probability measures from their marginal distributions has been discussed in [62,131]. Wang has also provided conditions for compatibility of marginals for undirected graphical models [140]. In this section, we will provide sufficient conditions for when a
85 collection of tables with overlapping projections onto their shared column space can be seen as projections of a global table. The general problem of determining whether or not a collection of tables arise as the projection of a table on the joint column space is known to be NP-complete [68]. As an example, consider a table with three attributes A, B, and C. The outcome space of each attribute is {0, 1}. The following collection of probability distributions agree on their overlapping marginals but fail to arise as marginal distributions of a joint distribution on the outcome space XA × XB × XC :
P (A = 0) P (A = 1) 1 P (B = 0) 2 0 1 P (B = 1) 0 2
P (B = 0) P (B = 1) 1 P (C = 0) 2 0 1 P (C = 1) 0 2
P (A = 0) P (A = 1) 1 P (C = 0) 0 2 1 P (C = 1) 2 0
Although these distributions can not be seen as the marginal distributions of some probability distribution on the full outcome space, this type of structure is possible in many data collection scenarios. Morton showed how this phenomenon can rise from missing data or databases employing versioning techniques like snapshot isolation [101]. In the next chapter, we discuss the representation of random variables of this form and discuss how to extend statistical theory into this setting.
86 4.6 Topological Aspects of Databases
4.6.1 Simplicial Complex Associated to a Database
A database is simply a collection of tables. Each table has an attribute space. The attribute space associated to a database can be formed by taking the union of the attribute spaces of the tables constituting the database. The column space of each table determines an abstract simplicial complex by associating a vertex to each column in the table. The maximal face associated to the table is given by the set of column attributes. The abstract simplicial complex condition requires that all subsets of the maximal face be included in the simplicial structure. Intuitively, this corresponds to the collection of tables that can be formed by dropping a subset of columns from the original table. The simplicial complexes associated to each table can be glued together along their overlapping column set. This determines an abstract simplicial complex for the entire database.
4.6.2 Contextuality
Contextuality is a phenomenon first observed in quantum physics whereby the out- come one observes in a measurement depends upon the other measurements taking place. Mathematically, this corresponds to the fact that in quantum mechanics observables are represented by operators on a Hilbert space and two operators are simultaneously observable if and only if they commute. We believe contextuality can arise in databases whenever the simplicial complex associated to the database is non-contractible. First, we can start with a simpler observation.
Lemma 98. Suppose two tables agree on their overlapping counts, then there exists a join of the tables.
Proof. Construct a total order on the values of the overlapping columns of the two tables. Sort each table according to this total order. We can now work inductively. If there is only one-entry, the tables agreeing on their overlapping counts means that each entry has the same value on the overlapping column. As such, there is only one choice to be made when joining the records. For tables with multiple entries, we can just merge the sorted rows by matching the indices in the enumeration implied by the total order.
87 Remark. In general, there may be multiple ways to join a table together as the next example shows. Example 99. Consider the following tables:
A B B C 0 0 0 a 1 0 0 b 2 0 0 c 3 0 0 d In this situation, there are 4! ways to join the two tables along B. Repeatedly applying binary joins allows us to construct a global table which projects onto each marginal distribution. This procedure can go wrong if we add a table that has non-empty intersection with multiple tables in the group in such a way that the simplicial complex associated to the tables is not contractible.
Definition 100. Let t1, . . . , tk be a collection of tables and let C1,...,Ck denote their respective column sets. The schema graph associated to this database is the graph whose nodes n1, . . . , nk are given by their respective tables t1, . . . , tk. The edges for this graph are given by the pairs (i, j) such that Ci ∩ Cj 6= ∅. With this definition, we can establish sufficient conditions for the constraint satisfaction problem to admit a solution.
Proposition 101. Let t1, . . . , tk be a collection of tables. If the schema graph associated to t1, . . . , tk is connected and acyclic then the constraint satisfaction problem has a non-empty solution set.
Proof. We proceed by induction on k. The base case k = 1 is trivial. The case k = 2 is established by the previous lemma. By relabeling the nodes if necessary, we may assume without loss of generality that node k is a boundary node, i.e. a node with exactly one incident edge. Such a node much exists because if each node has two or more edges there is no way for the graph to not be cyclic. By the previous lemma, there is a join between this node and its neighbor. By definition, this join projects onto the two marginal tables and so will obey the condition of agreeing overlaps with any other tables with edges incident on either of the original nodes. By considering the graph obtained by contracting along these two nodes, we reduce the number of nodes by one and may thus apply the inductive hypothesis.
88 Example 102. Suppose we have three tables whose column sets are {A, B}, {B,C}, and {C,D}, respectively. Then these tables admit a join as long as the first two tables agree on their projection onto B and the last two tables agree on the projection onto C. To see this, we can first join {A, B} and {B,C}. Again, there are potentially multiple solutions to the join problem. Based on the result of the first operation, we can join the resulting table to {C,D} which is again possible because these tables agree on their overlapping counts of the values of C.
Example 103. Consider three tables whose column sets are {A, B}, {B,C}, and {A, C}, respectively. The following collection of tables admits no join:
A B B C A C 0 0 0 0 0 1 1 1 1 1 1 0
Note the simplicial complex associated to these tables is given by:
A
B C.
4.6.3 Topology on a Database
An abstract simplicial complex has a natural poset structure given by subset inclusion. There are two canonical topologies associated to a poset: the topology generated by taking upper sets as open sets or the topology generated by taking lower sets as open sets [139]. Thus, given a database schema, we can construct an abstract simplicial complex representing the overlap between tables in the database. We can construct a topology on this abstract simplicial complex by taking the lower sets to be open sets. If (P, ≤) is a poset, then a set U ⊂ P is said to be a lower set if for all x ∈ U, y ≤ x implies y ∈ U. This endows our abstract simplicial complex with the structure of an Alexandroff topology (since arbitrary intersections of closed sets are closed). In the next chapter, we will see how to think about statistical inference on such databases.
89 4.7 Relationship Between Topological Structure of a Schema and Contextuality
We have seen that contextual schema which contain cycles can produce collections of tables which agree on all marginal overlaps but fail to admit a global glueing of the tables. Future research could investigate more how topological properties of the database schema affect the potential for contextuality. We will present a few conjectures for types of questions in this direction in this subsection.
Conjecture. Let T1,T2,...,Tk be a collection of tables where the database schema associated to the collection of tables is contractible, then there is a global join of the tables T1,...,Tk as long as the tables agree on their overlapping column sets.
Note that Proposition 101 is a weaker form of this conjecture. As an example of a database whose simplicial complex is contractible yet has a cylic schema graph, consider four binary random variables A, B, C, and D. And consider a database consisting of tables whose column sets are {A, B, C}, {B,C,D}, and {A, C, D}. The simplicial complex is a triangular pyramid with one face removed while the schema graph is simply the cyclic graph on 3 nodes. Another natural question is whether or not any schema which is not contractible will necessarily admit a collection of tables which fail to admit a global glueing. The above conjecture claims that holes (or higher dimensional analogs thereof) are necessary for tables to fail to admit a glueing. A natural question is then is whether or not this condition is also sufficient? In chapter five, we discuss how the space of contextual probability distributions can be realized as a certain linear subspace and also how the collection of classical probability distributions is a linear subspace of this space. One approach to investigating this conjecture could be to analyze the linear algebra of these subspaces and investigate whether or not the constraint equations resulting from closing a hole create an inconsistency in these sets of linear equations.
90 Chapter 5 | Contextual Statistics
5.1 Introduction
In chapter three, we saw that the collection of quasi-Borel presheaves on a sample space category has the structure of a sheaf with respect to the atomic Grothendieck topology on its underlying sample space category. In this chapter, we will see how replacing objects with presheaves and measurable mappings with natural transformations allows us to extend statistical concepts to a distributed measurement scenario in a natural way. Contextuality is a mathematical property in quantum mechanics arising from Bell’s theorem. It is the phenomenon by which the value of a measurement depends on the other measurements being performed simultaneously. Contextuality has been formalized in the language of sheaf theory in [1–5] and has also been shown to arise in cognitive science by [104]. Morton showed that contextuality can also arise in many data collection scenarios such as those involving missing data or versioning in a distributed database employing snapshot isolation [101]. Traditional statistical methodology assumes that data is already arranged in a flat and tidy manner [143]. As such, the ways that preprocessing affects the statistical properties of distributions is not typically discussed in most textbooks. The typical workaround when such assumptions fail is first cleaning the data by matching disparate sources and combining them into a single data frame. Choices are made about how to impute values or whether or not to throw away missing records. However, these methods introduce additional implicit assumptions which may not always be warranted as the models themselves assume missing data is impossible and that the pipeline
91 transforming the data does not affect subsequent analysis. In fact, many common techniques such as filling missing records by their column mean induce clear biases in the estimation of higher order moments of our data. Modern data sets have a more complex structure than assumed in a traditional statistics course. Traditional statistical methods rely on analyzing what data scientists would call a flat and tidy data set. In this chapter, we will explore how relaxing these assumptions affects statistical techniques. In particular, we will develop statistical theory in a manner that is contextual by creating sheaves on the Alexandroff topology constructed on a database schema as defined in the previous chapter (Section 4.6.3). A non-flat measurement scenario consists of a collection of tables tabulating the outcomes of some collection of random variables of interest. By a context, we mean a collection of simultaneously observable random variables whose outcomes are collected in a single table. In other words, contexts can be identified with the particular collection of attributes tabulated in a single table. We often informally refer to a table as a context in this chapter. The major contributions present in this chapter are the use of an appropriate topological structure to sheaf-theoretically lift standard statistical constructions to families of marginals with some overlapping constraints. More precisely, this includes the introduction of a poset structure on the collection of constraint satisfaction problems (Section 5.5.2) which allows us to select an appropriate topology based on the shared columns of the tables constituting a database (Section 5.6.1). Using this topology, we see how to express various statistical concepts as sheaves or presheaves with respect to this topology (Section 5.7). This allows us to define the notion of contextual random variables (Section 5.7.6) and to define statistical models in terms of sheaf morphisms (Section 5.8.1). We also introduce the distinction between classical and contextual factors (Definition 127) and the notion of classical snapshot to handle classical approximations to globally irreconcilable marginals. We discuss a pseudo-lieklihood approach to extending maximum likelihood estimation based on the realization of contextual random variables as subsets of an equalizer (Section 5.11.1) and provide a test for whether or not marginal distributions can arise from a joint distribution on the full column set (Section 5.12.2). This last result is similar to a result due to Abramsky, Barbosa, and Mansfield based on sheaf cohomology which allows the user to detect contextuality [3]. By combining our results with the construction in chapter seven, we can provide a goodness-of-fit
92 measure for contextuality rather than a simple detection of contextuality.
5.2 The Bell Marginals
The problem of reconstructing specific statistical models from marginal models tables has been analyzed previously [6,63,134]. The example in this section can be seen as a specific instance of recovering a multinomial model on full joint distribution from a particular family of marginals. Formulated as a database problem, the problem of whether or not there exists a table projecting onto a given family of marginals is known to be NP-complete [68]. In this section, we review the Bell marginal tables introduced in [101]. We prove that the introduction of a naive transition noise from a joint distribution on the full attribute space onto the product of the marginals is sufficient to match any family of marginal distributions. The problem with this construction is that it is overparametrized and there are in general many noise transitions which explain a collection of incompatible marginals. We use this as motivation for the introduction of sheaf theory in later sections which allows us to construct a model of the globally inconsistent marginals as a subspace of an equalizer constructed from the degree of overlapping compatibility. We begin by considering the contextual inference problem for data collected in contingency tables. As a starting point, we recall the Bell marginals example from [101].
Example 104. (Bell Marginals) Consider the following collection of four contin- gency tables:
0 0 tAB A = 0 A = 1 tA0B A = 0 A = 1 B = 0 4 0 B = 0 3 1 B = 1 0 4 B = 1 1 3
0 0 tAB0 A = 0 A = 1 tA0B0 A = 0 A = 1 B0 = 0 3 1 B0 = 0 1 3 B0 = 1 1 3 B0 = 1 3 1
Given such a collection of tables, we can attempt to construct a table which has the above four contingency tables as marginal tables. In [101], it is shown that such
93 a construction is impossible. If we attempt to merge A0B, AB0 and A0B0, there are 14 possible tables of counts that marginalize to these tables. However, if we merge AB with A0B, there is only one possible solution to the resulting constraint satisfaction problem. We will explain this constraint satisfaction problem in greater depth in Section 5.5.1. Given a classical random variable, there is no way we could encounter the above situation assuming these tables were drawn from a random variable on the joint outcome space of A, B, A0, and B0 assuming clean records; however, as shown in [101] these tables could result from versioning or missing data. As a random variable is not determined by its marginal distributions, there could in general be many possible probability distributions on the full outcome space with the same marginal distributions as those collected in a particular collection of tables. Nevertheless, the presence of noise in instrumentation in a network of computational agents or bias in the agents themselves (such as may occur in a network of sensors due to the physical degradation of some of the components of the sensor) could potentially result in this type of situation.
Example 105. Suppose we have four computational agents observing outcomes in the same manner as the Bell distributions above, i.e. one agent observes random variables A and B, another observes A0 and B, etc. If each agent flips a bit while logging their observations with some fixed probability p, then the tables of marginal counts will likely fail to glue together due to the presence of the random noise in the system.
Thus, one way of understanding contextuality is by introducing noise into the system. This motivates the following definition.
Qn Definition 106. (Noisy Random Variable) Let X = i=1 Xi be the joint outcome R X = Q X˜ X˜ = X ` { } space of the random variables in . Let Ci i∈Ci i where i i NA . In order to allow for the possibility of global inconsistency among the agents, we equip each context with a ’noise’ which is simply a transition probability between
X and XCi , i.e. a Kleisli arrow niK : X → XCi . Thus, we will think of a noisy random variable as a random variable whose outcome space is the product across a set of measurement contexts.
Lemma 107. (Existence of Noisy Random Variables)
94 Proof. Starting from a random variable (f : S → X, p) and Kleisli arrows niK : 0 X → XCi , we need to construct an extension of sample space S S whose Qm outcome space is i=1 XCi (or perhaps an extension of this). From our original p random variable, we can construct a Kleisli arrow 1 →K X. The family of Kleisli Qm n i=1 iK Qm arrows gives rise to a Kleisli arrow: X → i=1 XCi . Thus, Kleisli composition Qm induces a probability distribution on the space i=1 XCi which will descend to marginal distributions on the various contexts via the projection maps. We will m denote these composition arrows {pCi }i=1. Hence, the existence of a contextual random variable is reduced to the question of construction a sample space for a random variable based on a push-forward distribution. However, this is a straightforward, albeit abstract construction. Given the original distribution, p on Qm Qm S, we can construct S × i=1 XCi with the distribution p × i=1 pCi . There is an Qm Qm obvious projection map q : S × i=1 XCi → i=1 XCi . Then the random variable Qm Qm Qm (q : S × i=1 XCi → i=1 XCi , p × i=1 pCi ) has the desired properties. Lemma 108. Given any collection of contexts of discrete random variables, there exists a noisy random variable with the prescribed marginal distributions.
Proof. We establish this lemma by constructing a recursive algorithm for finding such a representation. n Pk o Define k1 = min k ∈ [n] such that i=1 pi ≥ q1 and define α1 such that Pk1−1 Pn Pm αpk1 + i=1 pi = q1. Note that (1 − α) pk1 + i=k +1 pi = i=2 qi. Recursively, n 1 o k = min k ∈ [n] Pk p ≥ q α α p + define ` such that i=k`−1+1 i ` and ` such that ` k` Pk`−1 p = q i=k`−1+1 i `. From these we can construct our Kleisli morphism as the right stochastic matrix whose entries have the following form 0 j < ki−1 1 − αi−1 j = ki−1 Tij = 1 ki−1 < j < ki αi j = ki 0 j > ki
From this form, we see this construction is indeed row stochastic as every row
i∈ / {k1, . . . , km} contains a single 1 and each i = k` row contains an entry of the
form αk` followed by (1 − αk` ) with all other entries being 0. Each αi ∈ [0, 1]
95 p + Pk`−1 p ≥ q p ≥ q − Pk`−1 p . because by construction, k` i=k`−1+1 i `. Hence, k` ` i=k`−1+1 i Recall that α` is chosen such that the latter inequality becomes an equality and α ≤ 1 k Pk`−1 p < q so ` . By the definition of `, we know i=k`−1+1 i ` or, equivalently, q − Pk`−1 p > 0 α ≥ 0 ` i=k`−1+1 i . Thus, we also must have ` . This establishes the row stochasticity of the matrix T .
From the previous proof, we see that the the Kleisli transition probabilities are sufficient to match any observed contextual distribution. However, such a construction is highly non-unique. Note that any permutation of the indices of the true distribution or the contextual distribution would result in a different transition probability by the construction in the previous lemma. This means that any model of contextuality as a noisy random variable will be non-identifiable. Such models are called singular and their asymptotic theory is sufficiently more complex due to the failure of asymptotic normality of the posterior distribution. More information on the asymptotic theory of singular models can be found in [141,142]. In order to extend statistical methods to such situations, we need methods that are robust to any of these above possibilities. This motivates our later construction of an Alexandroff topology on a collection of overlapping tables(Section 5.6.1). The next example outlines how contextuality can potentially arise in experiments analyzing human behavior.
Example 109. (Consumer Behavior) Imagine we are testing several options for displaying products in a store. Each display has the capacity for two items. We label our different products A, B, A0, and B0 and try to log counts of purchases of the various combinations of the available options, e.g. the number of customers who bought item A but not item B, etc. The data is tabulated as a collection of counts tabulating the number of customers who bought the various combinations of products. In this scenario, we can interpret the value 0 to correspond with a consumer not purchasing an item and a value of 1 to indicate that a consumer purchased the item. Thus each context can be represented as a 2 × 2 table where x00 denotes the number of customers who purchased neither item, x01 and x10 denote the number of customers who purchased only one of the items, and x11 denotes the number of customers who purchased both items. If we test four different displays where any two pairs of display have only one overlapping product, we could potentially arrive at the situation discussed at the beginning of this chapter
96 involving the Bell marginals. We may be interested in predicting how a pair of items would sell before doing the experiment to make a determination as to whether or not it would be worth the cost of trying out a different product layout.
In the Bell marginal tables displayed above, we can consider tables TAB, TA0B, and TAB0 to be tables resulting from previous observations of customer behavior. Merging these three tables yields the collection of purchasing patterns across all four products that are consistent with the observed marginal distributions from the three previous experiments. In the specific situation of the Bell marginals listed above, there are 4 possible joins of TAB, TA0B, and TAB0 consistent with the corresponding marginal distributions. Code for reproducing this observation can be found in Appendix A.2.2. Projecting these tables onto the joint outcome-space A0 × B0 yields four possible marginal tables:
A0 = 0 A0 = 1 A0 = 0 A0 = 1 B0 = 0 4 0 B0 = 0 3 1 B0 = 1 0 4 B0 = 1 1 3
A0 = 0 A0 = 1 A0 = 0 A0 = 1 B0 = 0 3 1 B0 = 0 2 2 B0 = 1 1 3 B0 = 1 2 2
Note that the two off diagonal tables are the same because there are two possible 0 0 joins of TAB, TA0B, and TAB0 which marginalize onto A B in the same manner. In order to make predictions for this set of outcomes, we need some way of choosing one of the above tables, the ability to reason over the full collection of the above tables, or a technique for averaging the tables in an appropriate way. The latter idea is the most straightforward. When applied to the tables above the average table is one of the two identical off-diagonal tables. However, as we discuss later in this chapter (Section 5.11.2), this type of analysis can lead to seriously flawed predictions. For methods to choose a particular table, we could apply the maximum-entropy principle which would mean choosing the furthest right table above. However, it is possible to construct situations where the entropy maximizing table is not unique and as such this principle can’t necessarily be applied to all problems. One
97 such example is given by the collection of joint tables on A, B, A0, and B0 which marginalize to the tables TAB and TA0B0 in the Bell marginals introduced above. As we will see later in this chapter (Section 5.5.1), the collection of all such tables results in a constraint satisfaction problem. In this particular case, there are 14 different tables which solve the resulting constraint satsifaction problem and 6 of these solutions are entropy maximizing. The code for producing this example appears in Appendix A.2.3. Entropy maximization according to marginals will also tend to eliminate any correlation between the random variables which could be undesirable in many predictive modeling situations. In the specific scenario of predicting customer purchasing habits, we may want to calculate the expected profit for each of these situations and decide if it the potential distributions of returns is worth the cost of further experimentation. In general, the profit (or some other reward function) will depend on which of the four distributions we choose. Moreover, simply averaging the profit will destroy information about the spread of values. In terms of making a decision of whether or not additional testing is warranted, the spread is relevant to the analyst who is trying to understand the risk to reward characteristics of such a decision. In the situation above, if we were looking at the expected profit for each possible table, we could also report the sample standard deviation across all tables of the profit. A major theme of the next chapter involves using the Giry monad discussed in chapter 3 to develop means of imputing data that better preserves the spread of the observed data. Suppose instead we are interested in some statistical property of the full dis- tribution on A × B × A0 × B0. If we had collected the marginal tables displayed at the beginning of this section, we would find that there is no table on the full outcome space (without missing values) which marginalizes to the Bell marginals. In order to do any statistical inquiries in this situation, we have a few options on how to proceed:
• We can fit models which only depend on the observed marginal distributions.
• Extend our model to incorporate the possibility of an unobserved observation and attempt to reason over the space of possible joins including NAs. Given the number of possibilities of how to join the data with missing entries, we may need to incorporate some form of bootstrapping to keep our computations
98 feasible.
• Attempt to come up with a representation for the joint distribution that is independent of these arbitrary choices.
To conclude this discussion: when analyzing contextual databases we most often are not able to join data frames uniquely. Much of the time we see either many possible ways to join our data frames or no globally consistent way of joining our data frames. In a chapter four, we found sufficient conditions for ensuring that a collection of tables admits a consistent join provided their overlapping counts agree (Proposition 101). We next explore some of the limitations of the skip-NA method as it relates to model fitting for the specific case of directed graphical models, or Bayesian networks.
5.3 Skip-NA and Directed Graphical Models
The problem of compatibility amongst conditional distributions has been previously explored in [7,8]. More recent work on this question has focused on the compatibility question for conditional distributions from the point of view of algebraic geometry [35, 102, 124, 125]. In this section, we examine a naive approach to fitting graphical models to contextual tables. Graphical models are statistical models for which a graph is used to express conditional dependence relationships between different subsets of random variables. In this section we discuss issues related to fitting graphical models to contextual measurement scenarios. Suppose we have a directed acylcic graph on the observables whose nodes are labeled by random variables belonging to some measurement scenario. The Bayesian network associated to the directed acyclic graph is given by
n Y p (x1, . . . , xk) = p (xi | pai) i=1
where pai denotes the collection of parents of Xi. Note that a graphical model can
be naively fit to a collection of contexts as long as the table which covers Xi | pai belongs to the generating family of observables assuming additionally that the collection of contexts agree on any overlapping subset of columns. As we will see in this section, the naive technique of fitting the model from the relevant conditional
99 table can result in pathological behavior in the presence of contextuality. In practice, such issues could arise when attempting to fit a model to a large database which contains missing records and attempting to fit the model by querying for the relevant tables using a skip-NA framework on the subset of columns of interest to the query. For more details about how missing data can produce contextuality in marginal tables, we refer the reader to the discussion in [101].
Example 110. Consider the following directed acyclic graph on the Bell contexts:
A
B
A0
B0.
If we interpret the above directed acyclic graph as representing a Bayesian network, then the corresponding factorization of the joint distribution on XA×XB ×XA0 ×XB0 is given by
0 0 p (xA, xB, xA0 , xB0 ) = p (A) p (B | A) p (A | B) p (B | A) .
Each individual probability distribution in the factorization is naively estimable from the Bell contexts since all of these tables are uniquely computable from the observed tables. A is covered by tAB and tAB0 and the marginalized tables produce the same marginal probability on A, so p (A) is well-defined as far as the original tables are concerned. Similarly, tAB, tA0B, and tB0A are all tables of the original contexts and so p (B | A), p (A0 | B), and p (B0 | A) are all computable from their respective observed tables. Thus, we can fit the above graphical model to the contextual distribution even though there is no global table on {A, B, A0,B0} which marginalizes to the observed Bell tables. Even though the model is naively estimable from the contextual family, it is indeed a probability distribution on the joint outcome space XA × XB × XA0 × XB0 . As such, we know that this model will not marginalize to the Bell tables because
100 the Bell tables admit no globally consistent join. In particular, marginalizing the distribution naively estimated from conditional tables computed from the Bell marginals gives us the correct frequencies corresponding to tAB but the incorrect frequencies for tA0B0 .
In the Bell tables, tAB and tA0B together admit a unique solution to their resulting constraint satisfaction problem. The same holds true for the tables tAB and tAB0 . As such, we can also fit graphical models to the Bell tables containing conditional tables involving {A, B, A0} or {A, B, B0}. We will soon study the mathematical structure of the collection of tables constructable from a collection of overlapping tables.
Example 111. The probability model corresponding to the following directed acyclic graph is also naively estimable from the Bell contexts:
B0
A ~ A0
~ B which corresponds to the decomposition of the joint distribution given by
p (A, B, A0,B0) = p (A | B0) p (B | A, A0) p (A0 | B0) p (B0) .
0 Note that the term p (B | A, A ) is unambiguous because tAB and tA0B admit a unique extension table. For the same reasons as the previous example, this joint distribution does not marginalize to the Bell contexts.
Remark 112. A collection of contextual tables contains the tables from which sufficient statistics for a graphical model on a directed acyclic graph can be naively estimated if and only if each edge is a contextual table or consistent projection thereof. This procedure can fit arbitrarily ’bad’ models in the presence of contextuality. We present an example where using the marginal tables as sufficient statistics for model selection produces a model with marginal distributions which have infinite Kullback-Liebler divergence from one of the marginal distributions.
101 Example 113. Consider the following normalized marginal distributions on AB, BC, and AC, respectively:
1 1 1 s = δ , r = δ , , t = (1 − δ ) . ij 2 ij ij 2 ij ij 2 ij
Note these marginal tables provide sufficient statistics for the following graphical model: A
B
C which corresponds to the probabilistic model:
p (A, B, C) = p (A) p (B | A) p (C | B) .
Letting pijk denote the probability that A = i, B = j, and C = k. Then
1 p = δ (1 − δ ) . ijk 2 ij ik
The projection of the above probability onto the outcome space BC is given by
X 1 r = p = (1 − δ ) . ik ijk 2 ik i
However, the observed marginal distribution of BC is given by
1 q = δ . ik 2 ik
The Kullback-Liebler divergence between r and q is given by X rik DKL (r k q) = rik log = ∞. qik
In the above calculations, we attempted to fit a model to a collection of tables whose schema graph is cyclic. This observation suggests the topological structure of our database has something to do with the type of behavior observed in the
102 above examples. We will return to this topological question in section 4, but first we will discuss the structure of the types of constraint satisfaction problems which arise from contextual measurement scenarios.
5.4 Motivation from Statistical Privacy
Statistical privacy concerns how data from a statistical database can be released publicly in a way that respects the privacy of the individuals whose data is being released. Simply omitting personally identifiable information is not enough in many situations because other public records could potentially be merged with the released data to identify the individuals from the publicly released data. As such, research has focused on how to construct noise functions which preserve the statistical properties of the original data while making it impossible for an observer using the perturbed data to determine whether or not a particular individuals information is present in the database [11,41,87,126]. Statistical privacy provides another motivation for the study of contextuality we present in this chapter. In databases with a large number of columns, it may be computationally infeasible to manipulate a full joint distribution on the full column set. Instead, analysts will marginalize onto subsets of the full column set and then introduce noise and work with these marginal distributions instead. In general, the introduction of noise to these marginal tables can create a family of marginals which do not arise as the marginals of any joint distribution. In such situations, statistical techniques which assume the collection of marginals arise from a joint distribution are invalid. In this chapter, we develop an approach based on sheaf theory to lay the foundations for adapting statistical methods to families of marginal tables which are not necessarily assumed to arise as marginals of some joint distribution on the full column set.
5.5 Poset of Joins of a Database
When joining two tables together, we form a new table from the two pre-existing tables according to some specified rules for combining the tables. In the previous chapter, we discussed a few types of join operations and how they could be interpreted in the language of category theory. For a collection of tables which
103 agree on their overlapping columns, we can reformulate the problem of finding a consistent join to the tables as a constraint satisfaction problem which is known to be NP-complete [68].
5.5.1 Contextual Constraint Satisfaction Problems
Suppose we want to join the tables tAB and tA0B from the Bell family. A table which extends these tables is any table which marginalizes to the original two tables. Let 0 0 0 πAB : A × B × A → A × B and πA0B : A × B × A → A × B denote the canonical projection operators. We are looking for a table, t0 : [8] → A × B × A0, for which 0 0 πAB ◦ t = tAB and πA0B ◦ t = tA0B. From the previous chapter, we know that each table is an equivalence class and each equivalence class has a representation as a count of values(Section 4.2.5.3). Let V be the function that accepts a table as an argument and returns its value count representation. The earlier requirement 0 0 is then that V (πAB ◦ t ) = V (tAB) and V (πA0B ◦ t ) = V (tA0B). By matching entries in these tables, we generate a linear constraint satisfaction problem. In this particular example, let nijk with i, j, k ∈ {0, 1} denote the number of times A = i, B = j, and A0 = k, then our contextual constraint satisfaction problem is given by the set of linear constraint equations: n000 + n001 = 4 n010 + n011 = 0 n100 + n101 = 0 n110 + n111 = 4 n000 + n100 = 3 n001 + n101 = 1 n010 + n110 = 1 n011 + n111 = 3 P i,j,k∈{0,1}3 nijk = 8. This particular constraint satisfaction problem has only a single solution
{n000 = 3, n001 = 1, n110 = 1, n111 = 3}
104 with all other entries not appearing in the list above equal to zero.
Proposition 114. (Worst case analysis for number of solutions to a contextual constraint satisfaction problem)
Consider two tables t1 and t2 that overlap on a column B. Suppose t1 contains another column A1 with n rows each with distinct values. Suppose also that t2 contains an additional column A2 also with n rows where each row takes on distinct values. Moreover, suppose t1 and t2 are both constant along the overlapping column B with the same constant value for each table. In this case, the number of solutions to the contextual constraint satisfaction problem is n!.
Proof. Apply the counting principle. For the first record in t1, we have n choices th of records in t2. Inductively, the k row has (n − k) possible choices in t2 for joins.
The above proposition shows that the number of solutions to a constraint satisfaction problem depends on the number of ways of matching overlapping entries and as such can grow combinatorially based on the extent to which the overlapping column fails to function as a primary key for a join. As such, worst case analysis can always be performed by assuming the overlapping entries are constant. Note that we could leverage the observation above to construct an algorithm for producing a random join of two tables which agree on the value counts of their overlapping columns by iterating through the columns of one table and selecting a random record for which the overlapping tables agree. The fact that the input tables are required to agree on their overlapping columns ensures that we won’t run out of choices at any step in the procedure.
5.5.2 Poset of Solutions to Contextual Constraint Satisfac- tion Problems
In database theory, lattices are sometimes used to represent the possible joins and restrictions of a collection of tables constituting a database [89]. In this subsection, we define a poset structure on the collection of solutions to the various constraint satisfaction problems arising from these tables. The poset structure discussed in this subsection serves as motivation for the topology constructed on the simplicial complex associated to the database in the next section. Specifically, we generalize
105 the observations of the previous section to generate a poset from a collection of tables which consists of all tables that can be constructed from the original collection of tables via marginalization or as a solution to the type of constraint satisfaction problem discussed previously. We can construct a poset on the space of joins where the order relation is based on whether or not a table extends another table.
Fix a finite collection of tables {ti :[ni] → XCi }i∈I and let Ci denote the collec- tion of column names for table ti. For each σi ⊂ Ci, there is a projection operator pCi→σi : XCi → Xσi , we can define a marginal table, tσi : [ni] → Xσi defined by pCi→σi ◦ti. As we saw in the previous section, a family of contextual tables can have multiple joins. Given a collection of tables tj :[nj] → XCj j∈J with J ⊂ I, we say that a table t : [n] → XC is an extension of {tj}j∈J if the following conditions hold:
• nj = n for all j ∈ J
• C = ∪j∈J Cj
• tj = pC→Cj ◦ t for all j ∈ J.
From the above definitions, we can create a poset whose elements are the collection of tables {ti :[ni] → XCi }i∈I along with the collection of projections and extensions of these tables. The order structure is define as t ≤ t0 if and only if t0 is an extension of t. We can equip such a collection of tables with an Alexandroff topology generated by lower sets in a manner similar to what we will do for databases in the next section. However, we will not use this topology in any meaningful way in this chapter and so postpone this discussion until the next section.
Example 115. (Bell Context) Let tAB, tAB0 , tA0B, and tA0B0 be the Bell marginals discussed at the beginning of this section. Since these tables all agree on their marginal overlaps, we have tables tA, tB, tA0 , and tB0 obtained by marginalizing over the variables in the subscripts. The lower set generated by all contexts can be visualized via the Hasse diagram displayed below:
tAB tA0B tAB0 tA0B0 O b O g 4 c ; O
tA tB tA0 tB0 .
106 To generate the full topology, we also need to consider the possible extensions of the tables. Unfortunately, the full topology is too difficult to visualize graphically. In the next section, we will consider the smallest possible database which can produce contextuality. For this reason, we postpone any additional visual aids until the next section. Instead, we will present a table counting the number of extensions of each pair and triple of tables in the Bell family. Code for generating these tables can be found in Appendix A.2.3.
Tables Number of Extensions
tAB, tA0B 1
tAB, tAB0 1
tAB, tA0B0 14
tA0B, tAB0 46
tA0B, tA0B0 4
tAB0 , tA0B0 4
We can similarly count the number of extensions of three tables (Appendix A.2.2):
Tables Number of Extensions
tAB, tA0B, tAB0 4
tAB, tAB0 , tA0B0 4
tAB, tA0B, tA0B0 4
tA0B, tAB0 , tA0B0 14
As a concrete example, the two tables presented below are two possible solutions to the constraint satisfaction problem involving the tables tAB, tA0B, and tAB0 from the Bell marginals. Code for reproducing these results can be found in Appendix A.2.2.
0 0 0 0 0 0 0 0 t1 A = 0, B = 0 A = 0, B = 1 A = 1, B = 0 A = 1, B = 1 A = 0, B = 0 3 0 0 1 A = 0, B = 1 0 0 0 0 A = 1, B = 0 0 0 0 0 A = 1, B = 1 1 0 0 3
107 0 0 0 0 0 0 0 0 t2 A = 0, B = 0 A = 0, B = 1 A = 1, B = 0 A = 1, B = 1 A = 0, B = 0 3 0 0 1 A = 0, B = 1 0 0 0 0 A = 1, B = 0 0 0 0 0 A = 1, B = 1 0 1 1 2.
We also know that there is no table which has all four contexts as its marginals [101]. An implementation of this verification can be found in Appendix A.2.1. Using the table of counts of solutions presented above, we see that the number of solutions to a contextual constraint satisfaction problem can be larger than the number of observations in the table. For instance, the constraint satisfaction
problem that this corresponds to joining tA0B, tAB0 , and tA0B0 has fourteen solutions even though the original tables contain only eight observations. As such, we see that enumerating all possible joins can become problematic for large data sets whenever there are many combinations of joins for the overlapping columns such as would be the case with repeated observations of a categorical feature. For this reason, in practice it will be necessary to use bootstrapping techniques for many statistical methods on contextual databases.
5.6 Topology of a Database Schema
Previously, we saw that for a prescribed set of marginals, the contextuality can always be accounted for by introducing a transition noise from the joint distribution on the full column space onto the collection of marginals constituting a particular measurement scenario and considering the prescribed set of marginals as projections of some higher dimensional random variable. Unfortunately, this representation of the higher dimensional random variable was highly non-unique and as a model was highly non-identifiable due to overparametrization of the transition noise. In light of these considerations, we will propose a definition for contextual statistical models involving morphisms of sheaves between a sheaf of parameters and a sheaf of contextual probability distributions on a topology associated to a database schema (Section 5.8.1). We first expound upon the construction of a database topology discussed at the end of chapter four.
108 Record linkage refers to the process of finding records in a database or collection of databases that refer to the same entity [39, 105]. The mathematical foundations of the subject were first established by Fellegi and Sunter [48]. In this section, we will discuss constructing a simplicial complex associated to a database. In this discussion, we assume a particularly simple linkage model, i.e. that all links are constructed by overlapping columns. The topological construction presented in this section could be adapted to more general deterministic record linkage scenarios by using more general glueing conditions to connect the tables than are discussed in this section.
5.6.1 Contextual Topology on a Database Schema
Definition 116. Let D be a database consisting of tables t1, . . . , tk. Let Ci denote the column set of ti. We define the abstract simplicial complex associated to S the database schema, 4 (D) to be the simplicial complex 4 (D) = i∈[k] (↓ Ci) where ↓ Ci = {S | S ⊂ Ci} . The sets {↓ Ci}i∈[k] generate a topology which we call the contextual topology associated to the database schema. In other words, the contextual topology τD on 4 (D) is the topology on D where the open sets are given by sub-simplicial complexes.
Note that this topology is finite and as such it is an Alexandroff topology, i.e. arbitrary intersections of closed sets are closed. The topology defined above is common in poset theory [139]. In the next section we will discuss various sheaves and presheaves on 4 (D). Throughout the next section, it will be important to have a running example of a particular database schema and its associated topology to discuss. We will fix the simplest possible contextual database schema and use it as an example topology for all of the sheaves discussed in the next section.
Example 117. Consider a database consisting of three tables: tAB, tAC , and tBC .
Each of the labels A, B, and C correspond to a binary data type. tAB counts the 2 number of times A = i and B = j where (i, j) ∈ {0, 1} . tAC and tBC tabulate similar counts corresponding to their respective data types. The simplicial complex associated to this database schema can be visualized as the following undirected
109 cyclic graph: A
B C. The poset structure of this simplex can be visualized with the following Hasse diagram: AB AC BC O b < b < O
A B C. In the diagram above an arrow X → Y indicates that X ≤ Y . We will use the notation P = {A, B, C, AB, AC, BC} to refer to the underlying poset associated to the abstract simplicial complex. For this particular database, there are a total of eighteen open sets in the contextual topology. These are given by lower sets of the poset, i.e. the sub-simplicial complexes. In this particular case the contextual topology consists of the following eighteen sets:
∅ {A}{B} {C}{A, B}{B,C} {A, C}{A, B, C}{AB, A, B} {BC,B,C}{AC, A, C}{AB, A, B, C} {BC, A, B, C}{AC, A, B, C}{AB, BC, A, B, C} {AB, AC, A, B, C}{BC, AC, A, B, C}{AB, AC, BC, A, B, C} .
110 Given p ∈ P , we can define ↓ p := {s ∈ P | s ≤ p}. Then, the contextual topology associated to this schema can be visualized via the following Hasse diagram:
↓ AB∪ ↓ AC∪ ↓ BC 5 O i
↓ AB∪ ↓ AC ↓ AB∪ ↓ BC ↓ AC∪ ↓ BC O i 5 i 5 O
↓ AB ∪ {C} ↓ AC ∪ {B} ↓ BC ∪ {A} O m O l O g
↓ AB ↓ AC ↓ BC 1 {A, B, C} O O O 2 7
{A, B} {A, C} {B,C} O i 5 i 5 O
{A} {B} {C} i O 5
∅.
The arrows in the above diagram correspond to subset inclusion. The collection of open sets on this topology also has the structure of a poset so it is important to keep in mind the distinction between the underlying poset and the topology constructed on it. In the diagram above, the second row of elements corresponds to the open sets obtained by excluding one of AB, AC, and BC from the poset, e.g. ↓ {A, B} ∪ ↓ {A, C} is the subset {AB, AC, A, B, C}. Note also that the elements in the third row in the previous diagram correspond to sets excluding two of AB, AC, BC. For example, ↓ AB ∪ {C} is the subset {AB, A, B, C}. Recall that a basis for a topology is a collection of open sets which cover the full space X and satisfy the additional requirement that if B1 and B2 are basis elements, then for every x ∈ B1 ∩ B2, there is a basis element B3 with x ∈ B3 and for which B3 ⊂ B1 ∩ B2. A sheaf is determined by its definition on a basis because there is an equivalence of categories between Sh (X) and Sh (B). A proof of this fact can be found in [94]. As such, we can simplify our definition of sheaves by constructing sheaves on a
111 basis. The contextual database topology discussed in this example has the following basis: B := {↓ {A, B} , ↓ {B,C} , ↓ {A, C} , ↓ {A} , ↓ {B} , ↓ {C}} .
In the next section, we refer to this particular contextual database topology as the 3-cycle database topology for ease of reference.
Remark. Note that {A, B}= 6 ↓ AB. This is important to keep in mind when discussing the glue-ability condition for sheaves as we will frequently in the next section. The open subsets in this topology of lower sets can be understood as placing a one-to-one correspondence between lower sets and sub-simplicial complexes. As such, it can be beneficial to visualize these via their geometric realizations. The top element, P , corresponds to the full directed cyclic graph:
A
B C.
Open sets in the second from top row correspond to the subsimplicial complexes ob- tained by removing one edge from the above picture, e.g. ↓ AB∪ ↓ BC corresponds to: A
B C. Open sets in the next row down correspond to the original contexts of our database, e.g. ↓ AB ∪ {C} corresponds to
A
BC.
Open sets in the next row down correspond to the original contexts of our database, e.g. ↓ AB corresponds to A B except for the set {A, B, C} which corresponds to the sub-simplicial complex with
112 no edges A
BC. The open sets in the next level down are those obtained from original contexts by removing an edge, e.g. {A, B} corresponds to
AB.
Finally, the empty set corresponds to the empty graph.
5.7 Sheaves on Databases
Various statistical concepts can be seen as constructions involving different sheaves on the contextual database topology defined in the previous section. In this section, we will discuss a number of sheaves of interest for statistical analysis on contextual databases. All example sheaves from this section will be constructed on the 3-cycle contextual database topology described at the end of the previous section.
5.7.1 Presheaf of Data Types
When defining databases in chapter four, every attribute had an associated data type which was determined by a set of values that the particular attribute could take. Given a contextual database topology, we can define a presheaf that associates to each open set U, the product of all data types of attributes for the tables belonging to U. The restriction maps of this presheaf are given by projections or identity maps where appropriate.
Example 118. The presheaf of data types on the contextual database topology associated to the 3-cycle with binary data types can be visualized by the following
113 commutative diagram:
{0, 1}3
w ( {0, 1}3 {0, 1}3 {0, 1}3
w ( w ( {0, 1}3 {0, 1}3 {0, 1}3
( {0, 1}2 {0, 1}2 {0, 1}2 -+ {0, 1}3
w {0, 1}2 r {0, 1}2 s {0, 1}2
v ( v ( {0, 1} {0, 1} {0, 1}
( {∗} . v
5.7.2 Presheaf of Classical Tables of a Fixed Size
A more interesting presheaf is the one that associates to each open set the collection of all tables of size n on the outcome space of all outcomes present in the open set. The restriction mappings here are given by projections of tables as discussed in chapter four. Note that these open sets are in one-to-one correspondence with the possible constraint satisfaction problems arising from our original contexts (Section 5.5.1). This observation will be used later in the chapter when we discuss the collection of classical approximations of a family of contextual tables (Section 5.11.2).
5.7.3 Sheaf of Counts on Contextual Tables
Each column in a table has an associated data type which specifies the possible values that an observation can take. We can construct a sheaf on the basis by associating each element of the basis with the product space of the data types of the columns corresponding to the top element. Since the basis elements only have the column space associated to a particular table or a subset thereof, there is no
114 ambiguity in this prescription. We know that this definition on the basis will allow us to construct the sheaf on the remaining open sets using the equalizer condition for sheaves.
Example 119. We will compute the sheaf of outcomes for the 3-cycle database topology. We can first define this construction on the original contexts as
∼ ∼ ∼ 4 N (↓ AB) = N (↓ AC) = N (↓ BC) = N and define N on the single outcome tables as
∼ ∼ ∼ 2 N ({A}) = N ({B}) = N ({C}) = N .
Any sheaf must map the empty set to the terminal object. In this case, we have N (∅) = N0. In order for this construction to define a sheaf, we must also specify U the restriction mappings. We define res∅ (U) = ∗ since there is only one map to a singleton set. Thus, we need only define the restriction mappings from the tables to their overlapping columns. In order to define these mappings, we introduce coordinates on N (↓ AB) = N4 of the form