Noname manuscript No. (will be inserted by the editor)

Fast Left-Kan Extensions Using The Chase

David I. Spivak · Ryan Wisnesky

August 30, 2021

Abstract We show how any chase algorithm from relational database theory that computes initial models on finite- theories can be used to compute left-Kan extensions of set-valued , and prove that the core chase and the parallel chase both compute initial models on finite-limit theories. We also describe an op- timized implementation of the parallel chase specialized to left-Kan computations that achieves an order of magnitude improvement in our performance benchmarks compared to the next fastest left-Kan extension algorithm we are aware of. Keywords Computational theory · left-Kan extensions · the Chase · Data migration · Data integration

1 Introduction

Left-Kan extensions [8] are used for many purposes in automated reasoning: to enumerate the elements of finitely-presented algebraic structures such as monoids; to construct semi-decision procedures for Thue (equational) systems; to compute the cosets of groups; to compute the orbits of a group action; to compute quotients of sets by equivalence relations; and more. Left-Kan extensions are described category-theoretically, and we assume a knowledge of [3] in this paper, but see the next section for a review. Let C and D be categories and F : C → D a . Given a functor J : D → Set, where D → Set (also written SetD) is the category of functors from D to the , Set, we define ∆F (J): C → Set := J ◦ F , and think of ∆F as a functor from D → Set to C → Set. ∆F has a left adjoint, which we write as ΣF , taking functors in C → Set to functors in D → Set. Given a functor I : C → Set, the functor ΣF (I): D → Set is called the left-Kan extension [8] of I along F . Left-Kan extensions always exist, up to unique isomorphism, but they need not be finite, (i.e., ΣF (I)(d) may have infinite cardinality for some object d ∈ D, even when I(c) has finite cardinality for every object c ∈ C). In this paper we

Both authors at Conexus AI 2 David I. Spivak, Ryan Wisnesky

describe how to compute finite left-Kan extensions when C, D, and F are finitely presented and I is finite, a semi-computable problem originally solved in [8] and significantly improved upon in [7].

1.1 Motivation

Our interest in left-Kan extensions comes from their use in data migration [24, 28,25], where C and D represent database schemas, F represents a “schema map- ping” [16] defining a translation from schema C to D, and I represents an in- put C-database (often called an instance) that we wish to migrate to schema D. Our implementation of the fastest left-Kan algorithm we knew of from existing literature [7] was impractical for large input instances, yet it bore a striking op- erational resemblance to an algorithm from relational database theory known as the chase [10], which is also used to solve data migration problems, and for which efficient implementations are known [5]. The chase takes a set of formulae F in a subset of first-order logic known to logicians as existential Horn logic [10], to category theorists as regular logic [22] to database theorists as embedded dependen- cies [10], and to topologists as lifting problems [27], and constructs an F-model chaseF (I) that is weakly initial among other such “F-repairs” of I.

1.2 Contributions

In this paper, we:

– show how any chase algorithm that computes initial models on finite-limit the- ories [1] (regular theories where every ∃ quantifier is read as “exists-unique”) can be used to compute left-Kan extensions of set-valued functors; and, – prove that the core chase [10] and the parallel chase [10] compute initial models on finite-limit theories; and, – describe an optimized left-Kan extension algorithm, inspired by the parallel chase, that achieves an order of magnitude improvement in our performance benchmarks compared to the next fastest left-Kan extension algorithm we are aware of [7].

1.3 Outline

This paper is structured as follows. In the next section we review category the- ory [3] and then describe a running example of a left-Kan extension. Then in section 2 we describe how to compute left-Kan extensions using chases that pro- duce initial models on finite-limit theories, as well as prove that the core chase and parallel chase construct initial models on finite-limit theories. In section 3 we describe our particular left-Kan algorithm implementation, compare it to the algorithm in [7], and provide experimental performance results. We conclude in section 4 by discussing the differences between the chase as used in relational database theory and as used in this paper. We assume knowledge of formal logic and algebraic specification at the level of [2], and knowledge of left-Kan extensions at the level of [8] and knowledge of the chase at the level of [10] is helpful. Fast Left-Kan Extensions Using The Chase 3

1.4 Review of Category Theory

In this section, we review standard definitions and results from category theory [3]. A category C is an axiomatically-defined algebraic structure similar to a group, ring, or monoid. It consists of:

– a set (or class), Ob(C), the members of which we call objects, and – for every two objects c1, c2, a set (or class) C(c1, c2), the members of which we call (or arrows) from c1 to c2, and

– for every three objects c1, c2, c3, a function ◦c1,c2,c3 : C(c2, c3) × C(c1, c2) → C(c1, c3), which we call composition, and – for every object c, an arrow idc ∈ C(c, c), which we call the identity for c.

We may write f : c1 → c2 instead of f ∈ C(c1, c2), and drop object subscripts on id and ◦, when doing so does not create ambiguity. The sets and functions above must obey axioms stating that ◦ is associative and id is its unit:

id ◦ f = f f ◦ id = f f ◦ (g ◦ h) = (f ◦ g) ◦ h

Two morphisms f : c1 → c2 and g : c2 → c1 such that f ◦ g = id and g ◦ f = id are said to be an isomorphism. We may write f; g or f.g to indicate g ◦ f and write c ∈ C to indicate c ∈ Ob(c) when it is clear c is an object. An example category is the category of sets and total functions, Set, which has for objects all the “small” sets in some set theory, such as ZFC, and for morphisms X to Y the total functions X → Y represented as sets. The isomorphisms of Set are exactly the bijections. Programming languages often form categories, with types as objects and programs taking inputs of type t1 and returning outputs of type t2 as morphisms t1 → t2. A relational database schema consisting of single-part foreign keys and single-part unique identifiers also forms a category, say C, and we may consider C-databases as functors C → Set [24], with the pleasant property that natural transformations of such functors correspond exactly to the C-database homomorphisms in the sense of relational database theory [10] (bearing in mind some caveats discussed in the conclusion to this paper). A functor F : C → D between categories C and D consists of:

– a function F : Ob(C) → Ob(D), and

– for every c1, c2 ∈ Ob(C), a function Fc1,c2 : C(c1, c2) → D(F (c1),F (c2)), where we may omit object subscripts when they can be inferred, such that

F (idc) = idF (c) F (f ◦ g) = F (f) ◦ F (g).

A h : F → G between functors F,G : C → D consists of a family of morphisms hc : F (c) → G(c), indexed by objects in C, called the com- ponents of h, such that for every f : c1 → c2 in C, hc2 ◦ F (f) = G(f) ◦ hc1 . Given (small [3]) categories C and D, the functors from C to D form a category, C → D (written also DC ), whose morphisms are natural transformations. The category of all (small) categories forms a category in the obvious way. A natural transforma- tion is called a natural isomorphism when all of its components are isomorphisms. The family of equations defining a natural transformation may be rendered as a 4 David I. Spivak, Ryan Wisnesky

: F (f) F (c1) / F (c2)

hc1 hc2   G(c1) / G(c2) G(f)

Such a diagram indicates that all compositions of morphisms that start at the same object and at the same object are to commute (be equal morphisms) in D; in this case, there are two such paths, east-south and south-east. Because categories are algebraic objects, they can be defined by generators and relations (or as we like to say, generators and equations) in a manner similar to e.g. groups. Formally, let G be a directed multi-graph with nodes N and edges E, and let a path from node n1 to node nk consist of a continuous, possibly 0-length list of edges n1 →e1 n2 → ... →ek nk. An equation p = q (from node n1 to node nk) is a pair of paths p and q (each from n1 to nk), and given a set of equations Eq , we say that two paths p and q (each from n1 to nk) are equivalent according to Eq if there is a way to re-write (in the sense of a Thue system [3]) p into q according to Eq and the usual axioms of equality (reflexivity, symmetry, transitivity) and congruence (if p = q, then p.e = q.e). The category presented by (G, Eq) has for objects the nodes of G, and for morphisms from n1 to nk the equivalence classes of paths from n1 to nk modulo the equivalence relation induced by Eq, with composition given by “path append”. When Eq is empty, the category presented is called the generated by G, and in fact, the category presented by G and Eq is exactly the free(G)/Eq. In summary, a category presentation is a multi-sorted monoid presentation [13]. In algebraic terminology, a functor free(G)/Eq → Set can be represented as a G-algebra [2] satisfying E; a natural transformation of such functors represented as a G-algebra homomorphism in the usual way [2], and a functor F : free(G)/Eq → free(G0)/Eq0 can be presented as a of theories: for each node n ∈ G, F 0 provides a node F (n) ∈ G , and for each edge e : n1 → n2 in G, F (e) provides a 0 path F (n1) → F (n3) in G , such that for all paths p, q from n1 to n2 in G equivalent according to equations E, we have that F (p) = F (q) is entailed by equations E0. This notion of theory morphism is studied specifically in [19]; see also [23] for a fully formal definition. A pushout of objects A, B, C and morphisms f, g in a category, as shown below, is the object D and morphisms α and β as shown below, having the that for any other such D0 and α0 and β0, there is a unique morphism θ making the diagram commute:

g A / C β0

f β  α  θ  B D D0 / /7 α0

The notion of pushout is pullback. A pushout where B = C is called a co-equalizer. Fast Left-Kan Extensions Using The Chase 5

Given two functors F : C → D and G : D → C, we say that F is left adjoint to G, written F a G, when for every object c in C and d in D that the set of morphisms F (c) → d in D is isomorphic to the set of morphisms c → G(d) in C, naturally in c and d (i.e., when we independently consider each side of the isomorphism as a functor C → Set and as a functor D → Set). Associated with each adjunction is a natural transformation η : idC → G ◦ F called, the unit of the adjunction, that obeys certain equations derived from the the naturality of the hom-set bijection just described: for every object X in C there exists an object G(X) in D and a morphism ηX : F (G(X)) → X such that for every object Y in D and every morphism f : F (Y ) → X there exists a unique morphism g : Y → G(X) with X ◦F (G) = f. Dually, a natural transformation F ◦G → idD called the co-unit is associated to each adjunction.

1.5 Running Example Left-Kan Extension

Our running example of a left-Kan extension is that of quotienting a set by an equivalence relation, where the equivalence relation is induced by two given func- tions. In this example, the input data consists of teaching assistants (TAs), Faculty, Students, and Persons, such that every TA is exactly one faculty and exactly one student. We wish to compute all of the persons without double-counting the TAs, which we can do by taking the disjoint union of the faculty and the students and then equating the two occurrences of each TA. Our source category C is the category Faculty’ ← TA’ → Student’, our target category D extends C into a commutative square with new object Person and no 0 marks, and the functor F : C → D is the inclusion:

TA0 TA • • isTF0 isTS0 isTF isTS F C := • z $ • ,−−−→ • w isTF.isFP=isTS.isSP ' • =: D Faculty0 Student0 Faculty Student isFP isSP • ' • w Person0 Person

Our input functor I : C → Set, displayed with one table per object, is:

Faculty’ Student’ TA’ isTF’ isTS’ ’Dr.’ Alice Alice math-TA ’Dr.’ Alice Alice ’Dr.’ Bob Bob cs-TA ’Dr.’ Bob Bob Prof. Ed Chad Prof. Finn Doug Prof. Gil

The cs-TA is both ’Dr.’ Bob and Bob, and the left-Kan extension equates them as persons. Similarly, the math-TA is both ’Dr.’ Alice and Alice. We thus expect 5 + 4 − 2 = 7 persons in ΣF (I). However, there are infinitely many left-Kan 6 David I. Spivak, Ryan Wisnesky

extensions ΣF (I); each is naturally isomorphic to the one below in a unique way. That is, the following tables uniquely define ΣF (I) up to choice of names: Faculty isFP Student isSP TA isTF isTS Person ’Dr.’ Alice math-TA Alice math-TA math-TA ’Dr.’ Alice Alice Chad ’Dr.’ Bob cs-TA Bob cs-TA cs-TA ’Dr.’ Bob Bob cs-TA Prof. Ed Prof. Ed Chad Chad Doug Prof. Finn Prof. Finn Doug Doug Prof. Ed Prof. Gil Prof. Gil Prof. Finn Prof. Gil math-TA

Because in this example F is fully faithful, the natural transformation ηI : I → ∆F (ΣF (I)), i.e. the unit of ΣF a ∆F adjunction, is an identity of C-instances [18]; it associates each source Faculty’ to the same-named target Faculty, etc.

2 Left-Kan Extensions Using the Chase

In this section we show how to compute the left-Kan extension ΣF (I): D → Set of functors F : C → D and I : C → Set by invoking a particular chase algorithm on I and a theory col(F ) associated with F , called the collage [15] of F . We: – define the collage of F , the category col(F ) that axiomatizes the unit of the ΣF a ∆F adjunction (Section 2.1); and, – define finite-limit logic (Section 2.2), and what it means to be an initial model of a theory and input instance (Section 2.3); and, – prove that, for chases that construct initial models of finite-limit theories, chas- ing I with col(F ) results in ΣF (I) (Section 2.4); and, – prove that the core chase [12], and the parallel chase [12], construct initial models of finite-limit theories (Sections 2.5 and 2.6, respectively).

2.1 The Collage of a Functor

The collage [15] of a functor F : C → D, written col(F ), is the category that axiomatizes the natural transformation ηI : I → ∆F (ΣF (I)) associated with the left-Kan extension of any instance I along F (see Proposition 2). To construct a presentation (see Section 1.4) of col(F ) we first take the co-product of C and D as categories (i.e., take the disjoint union of C and D’s objects, generating morphisms, and equations), and then we then add a generating morphism αc : c → F (c) for each object c ∈ C, and finally we add an equation F (f) ◦ αc = αc0 ◦ f for each generating morphism f : c → c0 ∈ C, for example:

TA0 α TA • / • isTF0 isTS0 isTF isTS

• x ( • • w isTF.isFP=isTS.isSP ' • Faculty0 Student0 4 Faculty Student α : isFP isSP isTF0.α=α.isTF isTS0.α=α.isTS & w α • Person Fast Left-Kan Extensions Using The Chase 7

The evident inclusion functors iC : C → col(F ) and iD : D → col(F ) of C and D into col(F ) will be used several times throughout the paper. We will also make use of the following easy propositions:

Proposition 1 Let F : C → D be a functor. For objects c ∈ C and d ∈ D, there is a bijection between hom-sets, ∼ D(F (c), d) = col(F )(iC (c), iD(d)).

Proof Our definition of col(F ) is in terms of generators and equations. A morphism 0 in col(F ) from iC (c) to iD(d) must be of the form g ◦ α ◦ f, where f : c → c is in 0 0 0 C, αc0 : c → F (c ) is a generator, and g : F (c ) → d is in D. Exactly one equation can apply, namely that which says αc0 ◦ f = F (f) ◦ αc, so in fact every morphism in col(F ) from iC (c) to iD(d) must be of the form g ◦ αc for a unique g; thus we define our map col(F )(iC (c), iD(d)) → D(F (c), d) to send g ◦ αc to g. To see that g is unique, note that the morphism g ◦ F (f) is an invariant of equations transformations of normal forms g ◦ α ◦ f. The map going the other way sends g : F (c) → d to g ◦ αc, and it is easy to check that this is an isomorphism.

Proposition 2 Let F : C → D be a functor. The following are equivalent:

1. the category of triples (I, J, f), with I : C → Set, J : D → Set, and f : ΣF (I) → J (where a morphism (I, J, f) → (I0,J0, f 0) is a pair of morphisms i : I → I0 and 0 0 j : J → J such that j ◦ f = f ◦ ΣF (i)) 2. the category of triples (I, J, f), with I : C → Set, J : D → Set, and f : I → ∆F (J), 3. the category of functors col(F ) → Set.

Proof 1 ↔ 2 is the definition of ΣF being left adjoint to ∆F , naturally in I,J. 3 → 2. Given a functor K : col(F ) → Set, we compose it with iC : C → col(F ) to obtain a functor I := K ◦ iC : C → Set, and similarly we obtain J := K ◦ iD : D → Set. To give a natural transformation f : I → ∆F (J), first choose an object c in C. We need a function I(c) → J(F (c)), so we use K(αc): I(c) = K(iC (c)) → K(iD(F (c)) = 0 J(F (c)). For any morphism g : c → c , the corresponding equation F (g)◦αc = αc0 ◦g in col(F ) ensures that f is indeed natural. This establishes 3 → 2 on objects, and it is straightforward to check that it functorial. 2 → 3. As expected, this is just inverse to the above. Given I,J, and f, we define K : col(F ) → Set on objects via I and J on objects. Every generating morphism in col(F ) is either in C or in D—in which case use I or J—or it is is of the form αc : c → F (c), in which case use fc : I(c) → J(F (c)). The equations in col(F ) are satisfied by the naturality of f. This establishes 2 → 3 on objects, and it is again straightforward to check that it is functorial.

2.2 Finite-Limit Theories

A finite-limit theory [29] consists of a set s1, . . . , sj of sorts and a set p1, . . . , pk of arity-sorted relation symbols—together these form a signature—as well as a set A of formulae, which we call axioms, each having the following form:

∀(x0 : s0) ··· (xn : sn). φ(x0, ··· , xn) ⇒ ∃!(xn+1 : sn+1) ··· (xm : sm). ψ(x0, . . . , xm) where φ and ψ are (possibly empty, indicating truth) conjunctions of: 8 David I. Spivak, Ryan Wisnesky

– assertions x = x0, for some variables of the same sort, or – assertions p(x, . . . , x0), for some variables of appropriate sort, and ∃! means “exists unique.” A pre-model I consists of a set I(s) for every sort s ∈

S (called the carrier for s) and a subset I(p) ⊆ I(si1 )×· · ·×I(sik ) for every relation symbol p of arity si1 , . . . , sik . An A-model is a pre-model that additionally satisfies every axiom of A in the obvious way. A morphism of pre-models I and J (on the same signature) consists of a sort-indexed family of functions hs : I(s) → J(s) such that for every relation symbol p of arity si1 , . . . , sik and all carrier elements c ∈ I(s ), . . . , c ∈ I(s ), I(p)(c , . . . , c ) implies J(p)(h (c ), . . . , h (c )). i1 i1 ik ik i1 ik si1 1 sik k A morphism of models is defined as a morphism of pre-models. Finite-limit theories can be equivalently described using partial functions in- stead of relations, in which case they are often called essentially algebraic theories [1] (page 8 and also theorem 5.30).

The Finite-limit Theory of a Category

We now describe how to convert a category C—including that for the collage col(F ) of any functor F —into a finite-limit theory that axiomatizes the functors C → Set. To do so, consider the objects of C as sorts of the theory; convert each generating morphism of C to a binary relation symbol; and add axioms requiring all relations be total and functional, for example:

∀(x : TA). ∃!(y : Faculty). isTF(x, y).

In addition, we must include axioms corresponding to C’s presentation. For exam- ple, here are the three equations from col(F ), one for the commutative square in D and two associated with α; see Section 2.1.

isTF(x, y) ∧ isFP(y, z) ∧ isTS(x, y0) ∧ isSP(y0, z0) ⇒ z = z0 0 0 0 0 0 isTF (x, y) ∧ αFaculty0 (y, z) ∧ αTA0 (x, y ) ∧ isTF(y , z ) ⇒ z = z 0 0 0 0 0 isTS (x, y) ∧ αStudent0 (y, z) ∧ αTA0 (x, y ) ∧ isTS(y , z ) ⇒ z = z Note that from now on, we will omit universal quantifiers and sorts when they can be inferred from context, but we will continue to make existential quantifiers explicit, as above.

2.3 Chasing Finite-Limit Theories

The chase is usually defined over a slightly more general logic than finite-limit logic, known to database theorists as the logic of embedded dependencies (EDs) [10], but which we like to call “regular logic” following the categorical tradition [3]. Regular logic is obtained from finite-limit logic simply by removing the requirement that every existential quantifier be modally unique (i.e., remove the ! from ∃!). Given a regular theory A and a pre-model κ, to chase κ by A is to construct a pre-model chaseA(κ) and morphism h : κ → chaseA(κ) such that:

1. chaseA(κ) satisfies A (i.e., is an A-model). 2. for any model κ0 satisfying A and any morphism h0 : κ → κ0, there is a possibly 0 0 non-unique morphism g : chaseA(κ) → κ such that g ◦ h = h . Fast Left-Kan Extensions Using The Chase 9

h0 κ κ0 :/ h g  chaseA(κ)

In the database theory literature, chaseA(κ) is called a “universal solution”[10] to the problem of A and κ. Note that two such universal solutions to the same problem may have different cardinalities; database theorists often identify instances κ and κ0 for which there exist morphisms κ → κ0 and κ0 → κ even if the morphisms do not compose to the identity, a notion they call “homomorphic equivalence”. Our goal in this section is to show that a chase algorithm, when run on the finite-limit theory derived from the collage of a functor, computes left-Kan ex- tensions, for which our primary lemma is that finite-limit theories admit strongly universal chases: Lemma 1 Given a finite signature (S, P ) and finite set of ∀∃! axioms A, as in Sec- tion 2.2, for any pre-model κ, there exists a pre-model initA(κ) and morphism h : κ → initA(κ) with the following properties:

1. initA(κ) satisfies A (i.e., is an A-model), 2. for any model κ0 satisfying A and any morphism h0 : κ → κ0, there is a unique 0 0 morphism g : initA(κ) → κ such that g ◦ h = h . Proof The proof uses the theory of sketches; see [4] and [29]. Note that under- standing the proof is not required to understand our left-Kan algorithm. Given the theory (S, P, A) there is a category SP and a set RP of limit cones such that the category of models for the sketch (SP , RP ) is equivalent to the category of A-models. Indeed, begin with the category with objects S t P and a morphism φ → si for each φ ∈ P with arity (s1, . . . , sk) and 1 ≤ i ≤ k. Now form the free finite-limit sketch on this category and add to the sketch a cone for each φ that enforces the unique map φ → s1 ×· · ·×sk to be a monomorphism. Finally, for each axiom

∀(x0 : s0) ··· (xn : sn). φ(x0, ··· , xn) ⇒ ∃!(xn+1 : sn+1) ··· (xm : sm). ψ(x0, . . . , xm) in A, the conjunctions φ and ψ are given by pullbacks, say p and q which already exist in SP , and we finish by adding a morphism p → q. The resulting category sketch is (SP , RP ), and it is tedious but not hard to show that the category of models of this sketch is equivalent to that of A-models. The theorem then becomes just a restatement of the fact that the category of models of a limit sketch (S, R) is a reflective of the SetS; see [4, Theorem 4.2.1]. Here S initA is the name of the reflection functor, and given κ ∈ Set , the map h is the unit of the reflection. The properties above imply that initial model construction is a reflector, i.e. left adjoint to the inclusion of the category of A-models into the category of pre-models. In other words, initA(κ) is an initial object in the category of A-models equipped with a map from κ.

2.4 Left-Kan Extensions Using Initial Models

To compute the left-Kan extension ΣF (I) of I : C → Set along F : C → D using the previous lemma, we consider I as a pre-model I on the finite-limit theory col(F ), 10 David I. Spivak, Ryan Wisnesky

compute initcol(F )(I), and then project the D part we need for ΣF (I). Our main result is: ∼ Lemma 2 initcol(F )(I) = (I,ΣF (I), ηI ), where η is the unit of the ΣF a ∆F adjunc- tion and I is I, considered as a pre-model of col(F ) with empty D and η projections.

Proof Intuitively, by Proposition 2, the universal properties defining the two sides of the isomorphism are the same. In detail, let J := initcol(F )(I). By definition,

I = ∆iC (I), so there is a map I → ∆iC (J) and hence an induced map ΣiC (I) → J

over I. Also ΣiC (I) contains I, and by universality (Lemma 1 (2)), there is a unique

morphism J → ΣiC (I) over I. By a standard argument, we have an isomorphism ∼ J = ΣiC I. ∼ Now it suffices to show that ΣF (I) = ∆iD ◦ ΣiC (I). This can be seen at a high-level of abstraction using , but at a hands-on level it follows from Proposition 1 which implies that colimit formula for both sides are the same:

colim I(c) ∼ colim I(c). P = P c∈C D(F (c),d) c∈C col(F )(iC (c),iD (d)) With the above lemma in hand, all that remains is to show that the core chase and parallel chase algorithms compute initial models on finite-limit theories.

2.5 The Core Chase on Finite-limit Theories

The core chase [12] is a canonical (determined up to isomorphism) chase algorithm which is intractable (exponential time [12]) but easy to work with in theory. For- mally, the core of a database instance I is the smallest sub-instance that has (not necessarily unique) morphisms into and out of I. In this section, we prove that on finite-limit theories, the core chase constructs initial models. Or rather, a slightly more general lemma:

Lemma 3 Any weakly initial model of a finite-limit theory A and input instance κ that is also core is initial. We will write such a model as corechaseA(κ). Proof Let χ be a A-model and κ → χ a morphism. Then there exists a morphism h : corechaseA(κ) → χ and morphism κ → corechaseA(κ) making the following diagram commute:

κ / corechaseA(κ) h ' χ 0 Suppose h : corechaseA(κ) → χ also commutes in the above square in place of h; we aim to show h = h0. Let E be the equalizer of h and h0 as such:

EO =  _

eq(h,h0)  κ / corechaseA(κ)

h0 h !  Ù X Fast Left-Kan Extensions Using The Chase 11

Because models of limit theories are closed under limits, E is a A-model, and the dotted morphism exists by weak initiality. By definition of core, E = corechaseA(κ), 0 and so h = h . Recall that by being core, for any monomorphism E,→ corechaseA(κ) and morphism K → corechaseA(κ) the following diagram commutes:

κ / E _ '  corechaseA(κ).

2.6 A Pushout-based Parallel Chase Algorithm

We now describe a particular canonical (determined up to isomorphism) chase algorithm, which was one of our original inspirations for the left-Kan algorithm described in Section 3. This algorithm turns out to be equivalent to the core chase algorithm of [10] without the core step, formalized using the categorical notion of pushout. We now make this statement precise and prove that this chase produces initial models on finite-limit theories. A conjunctive query Q on some signature S with free variables (x0 : s0, ··· , xn : sn) is a formula in multi-sorted first-order logic of the form:

∃(xn+1 : sn+1) ··· (xm : sm). ψ(x0, . . . , xm)

where ψ is as usual a conjunction of equalities between variables of the same sort or membership predicates for base relations of S. The meaning of query Q in an instance M is the subset of M(s0) × ... × M(sn) that satisfies Q. Every query Q induces a so-called frozen ψ-model Q0 whose carrier for sort s is the bound- variables of sort s and whose membership relation is given by ψ [11]. Inversely, we may convert an arbitrary ψ-model to a conjunctive query by considering the elements of each carrier to be the ∃-bound variables of that carrier’s sort. A morphism of conjunctive queries h : Q1 → Q2 on the same free variables is a sort-preserving function assigning each bound variable of Q1 to a bound variable of Q2 in a way that preserves entailment, so that e.g. Q1 → Q2 implies Q2(I) ⊆ Q1(I) for all I [11]. An embedded dependency (ED) on some signature S:

∀(x0 : s0) ··· (xn : sn). φ(x0, ··· , xn) ⇒ ∃(xn+1 : sn+1) ··· (xm : sm). ψ(x0, . . . , xm)

induces two relational conjunctive S-queries, called the ED’s front and back; each has (x0 : s0) ··· (xn : sn) for free variables:

front := φ(x0, ··· , xn)

back := ∃ (xn+1 : sn+1) ··· (xm : sm). φ(x0, ··· , xn) ∧ ψ(x0, ··· , xm) Each ED also induces morphism of queries (technically, an empty inclusion) m : front → back. It is an easy fact of database theory that m(I): back(I) → front(I) is an injection (technically, an inclusion) for every S-instance I, and as such that the ED holds on an S-instance I exactly when m(I) is also surjective [27] [11]. We will denote the obvious morphism front0 → back0 between frozen instances as m0. We are now ready to describe our the parallel chase algorithm, which runs in stages, starting from an original input S-instance I0. For each stage n ≥ 0, we first 12 David I. Spivak, Ryan Wisnesky

find the subset Tn of front(In) that does not appear in the image of m(In); these are our “triggers”, and when Tn is empty, we stop, returning In. Otherwise, we define In+1 to be the pushout below (in the category of pre-models; that pushouts of pre-models exists follows by [1]):

` 0 eval front In t∈Tn /

` m0 t∈Tn   ` back0 I t∈Tn / n+1

0 where the morphism eval chooses as the morphism front → In, for each t ∈ Tn, the satisfying valuation/assignment/trigger front → In corresponding to t.

Lemma 4 The algorithm above computes initial models of finite-limit theories.

Proof (Sketch). By a modified small object argument of Quillen; see [14]. In a bit more detail, we proceed by induction, starting with a model K satisfying the EDs, and a map h: I0 → K. For the inductive step, since K satisfies all the EDs and there is a commutative solid-arrow square as shown

` 0 eval front / In / K t∈Tn 5 > ` m0 t∈Tn   ` back0 I t∈Tn / n+1

there is a dotted making the diagram commute. If the algorithm terminates, the final IN is a colimit of the whole sequence I0 → I1 → · · · . In general, whether it does or doesn’t terminate, there is always a map from the colimit to K, again by the universal property of colimits. To see that if the algorithm terminates, it does so with a model, note that the set of triggers of a pre-model will be empty exactly when the pre-model is a model, which is precisely the termination condition of our algorithm. There are regular (but not finite-limit) theories for which a model exists but the above algorithm diverges; for completeness, we must add a core computation at each step of the algorithm [12]. To see that the dotted lift is unique, note that both the morphism ` back0 → K and composite morphism ` back0 → I → K are unique t∈Tn t∈Tn n+1 because the EDs are finite-limit (i.e., literally have “exists-unique” form).

3 A Fast left-Kan Extension Algorithm

We have implemented a specialized version of the parallel chase algorithm in Section 2.6 tailed for computing left-Kan extensions inside the open-source CQL tool for computational category theory (http://categoricaldata.net); this algorithm also resembles a parallel version of the left-Kan algorithm in [7]. By leverag- ing collection-oriented (parallelizable) operations, it improves upon our optimized (necessarily sequential) java implementations of various existing left-Kan algo- rithms ([8] and [7] and several from [23]) by a factor of ten on our benchmarks. Fast Left-Kan Extensions Using The Chase 13

More sophisticated strategies for representing left-Kan extensions using TGDs only, such as introducing a binary equality predicate, adding the requisite rules such as transitivity and congruence, obtaining duplicate entities that are still dis- tinct but satisfy the equality predicate, and then merging those entities in a final post-processing step, may, as suggested by a reviewer, be even more performant than the algorithm we describe in this section. We leave such enhancements for a new Spark-based version of this algorithm currently in development.

3.1 Input Specification

The input to our left-Kan algorithm consists of a source finite category presenta- tion, whose vertex (node) set we refer to as C, whose edge set from c1 to c2 we refer to as C(c1, c2), and whose equations (pairs of possibly 0-length paths) from c1 to c2 we refer to as CE(c1, c2). Similarly, our input contains a target finite category presentation (D, D(−, −), DE(−, −)). We require as further input a finite func- tor presentation F :(C,C(−, −),CE(−, −)) → (D,D(−, −),DE(−, −). Finally, we require a (C,C(−, −))-algebra I satisfying CE. The source equations CE are not used by our algorithm (or any chase algorithm we are aware of) but are required to fully specify I. A review of concepts such as finite category presentation and (C,C(−, −))-algebra is found in Section 1.4.

3.2 The State

Like most chase algorithms [21], our left-Kan extension algorithm runs in rounds, possibly forever, transforming a state consisting of a col(F ) pre-model until a fixed point is reached. In general, termination of the chase is undecidable, but suffi- cient criteria exist based on the acyclicity of the “firing pattern” of the existential quantifiers [21] in the finite-limit theory corresponding to DE from the previous section. Similarly, termination of a left-Kan extension is undecidable [8], although using the results of this paper we can use termination of one to show termination of the other. Formally, the state of our canonical chase algorithm consists of: – For each d ∈ D, a set J(d), the elements of which we call output rows. J is F initialized in the first round by setting J(d) := {c∈C | F (c)=d} I(c). – For each d ∈ D, an equivalence relation ∼d ⊆ J(d)×J(d), initialized to identity at the beginning of every round. – For each edge f : d1 → d2 ∈ D, a binary relation J(f) ⊆ J(d1)×J(d2), initialized in the first round to empty. When the chase completes, each such relation will be total and functional. – For each node c ∈ C, a function η(c): I(c) → J(F (c)). η is initialized in the first round to the co-product/disjoint-union injections from the first item, i.e., η(c)(x) = (c, x).

Given a path p : d1 → d2 in D, we may evaluate p on any x ∈ J(d1), written p(x), resulting in a (possibly empty) set of values from J(d2) (a set because each J(f) is a relation). Given a state, we may consider it as a col(F ) pre-model in the obvious way by extending ∼ into a congruence (e.g., so that x ∼ y and J(f)(x, a) implies J(f)(y, a)). 14 David I. Spivak, Ryan Wisnesky

3.3 The Step

Unlike most chase algorithms, our left-Kan algorithm consists of a fully determin- istic sequence of state transformations, up to unique isomorphism. In practice, designing a practical chase algorithm (and hence, we believe, a practical left-Kan algorithm) comes down to choosing an equivalent sequence of state transforma- tions, as well as efficiently executing them in bulk. In this section, we first describe the actions of our algorithm, and then we show how each action is equivalent to a sequence of the two kinds of actions used to define chase steps in database theory. We conclude with some discussion around the additional observation that each step in our left-Kan algorithm can be directly read as computing a pushout in the chase algorithm of Section 2.6. A single step of our left-Kan algorithm consists of applying the actions below to the state in the order they appear in this list:

– Action α: add new elements. For every edge g : d1 → d2 in D and x ∈ J(d1) for which there does not exist y ∈ J(d2) with (x, y) ∈ J(g), add a fresh (not occurring elsewhere) symbol g(x) to J(d2), and add (x, g(x)) to J(g). Note that this action may not force every edge to be total (which might lead to an infinite chain of new element creations), but rather adds one more “layer” of new elements. – Action βD: add all coincidences induced by D. The phrase “add coincidences” is used by the authors of [7] where a database theorist would use the phrase “fire equality-generating dependencies”. In this step, for each equation p = q

in DE(d1, d2) and x ∈ J(d1), we update ∼d2 to be the smallest equivalence relation also including {(x0, x00) | x0 ∈ p(x), x00 ∈ q(x)}. – Action βF : add all coincidences induced by F . This step is similar to the step above, except that the equation p = q comes from the collage of F and evalu- ation requires data from η and I in addition to J. – Action δ: add all coincidences induced by functionality. For every (x, y) and (x, y0) 0 in J(f) for some f : d1 → d2 in D with y =6 y , update ∼d2 to be the smallest equivalence relation also including (y, y0). This step makes ∼ into a congruence, allowing us to quotient by it in the next step. – Action γ: merge coincidentally equal elements. In many chase algorithms, in- cluding [7], elements are equated in place, necessitating complex reasoning and inducing non-determinism. Our algorithm is deterministic: step 1 adds a new layer of elements, and the next steps add to ∼. In this last step, we replace every entry in J and η with its equivalence class (or representative) from ∼, and then ∼ resets on the next round.

3.4 Example Run of the Algorithm

See section 1.5 for the definition of our running example. The state begins as: Faculty isFP Student isSP TA isTF isTS Person ’Dr.’ Alice Alice math-TA ’Dr.’ Bob Bob cs-TA Prof. Ed Chad Prof. Finn Doug Prof. Gil Fast Left-Kan Extensions Using The Chase 15

The η tables are initialized to identities and do not change, so we do not display them. First, we add new elements (action α):

Faculty isFP Student isSP ’Dr.’ Alice isFP(’Dr.’ Alice) Alice isSP(Alice) ’Dr.’ Bob isFP(’Dr.’ Bob) Bob isSP(Bob) Prof. Ed isFP(Prof. Ed) Chad isSP(Chad) Prof. Finn isFP(Prof. Finn) Doug isSP(Doug) Prof. Gil isFP(Prof. Gil) isTS(math-TA) isTF(math-TA) isTS(cs-TA ) isTF(cs-TA )

Person isFP(’Dr.’ Alice) isFP(’Dr.’ Bob) isFP(Prof. Ed) TA isTF isTS isFP(Prof. Finn) math-TA isTF(math-TA) isTS(math-TA) isFP(Prof. Gil) cs-TA isTF(cs-TA ) isTS(cs-TA ) isSP(Chad) isSP(Alice) isSP(Bob) isSP(Doug)

Next, we add coincidences (actions βD,βF ,and δ). The single target equation in DE induces no equivalences, because of the missing values in the isFP and isSP columns, so βD does not apply. βF requires that isTF and isTS be copies of isTF’ and isTS’ (from the source schema C), inducing the following equivalences:

isTF(math − TA) ∼ 0Dr.0Alice isTS(math − TA) ∼ Alice isTF(cs − TA) ∼ 0Dr.0Bob isTS(cs − TA) ∼ Bob

The edge relations are all functions, so action δ does not apply. So, after merging equal elements (action γ) we have:

Faculty isFP Student isSP ’Dr.’ Alice isFP(’Dr.’ Alice) Alice isSP(Alice) ’Dr.’ Bob isFP(’Dr.’ Bob) Bob isSP(Bob) Prof. Ed isFP(Prof. Ed) Chad isSP(Chad) Prof. Finn isFP(Prof. Finn) Doug isSP(Doug) Prof. Gil isFP(Prof. Gil)

Person isFP(’Dr.’ Alice) isFP(’Dr.’ Bob) isFP(Prof. Ed) TA isTF isTS isFP(Prof. Finn) math-TA ’Dr.’ Alice Alice isFP(Prof. Gil) cs-TA ’Dr.’ Bob Bob isSP(Chad) isSP(Alice) isSP(Bob) isSP(Doug) 16 David I. Spivak, Ryan Wisnesky

The reason that the empty cells disappear in the Faculty and Student tables is that after merging, they are subsumed by existing entries; i.e., based on our definition of state, the tables below are visually distinct but completely the same as the tables above:

Faculty isFP Student isSP ’Dr.’ Alice isFP(’Dr.’ Alice) Alice isSP(Alice) ’Dr.’ Bob isFP(’Dr.’ Bob) Bob isSP(Bob) Prof. Ed isFP(Prof. Ed) Chad isSP(Chad) Prof. Finn isFP(Prof. Finn) Doug isSP(Doug) Prof. Gil isFP(Prof. Gil) Alice ’Dr.’ Alice Bob ’Dr.’ Bob

In the second and final round, no new elements are added and one action adds coincidences, βD. In particular, it induces equivalences

isFP(0Dr.0Alice) ∼ isSP(Alice) isFP(0Dr.0Bob) ∼ isSP(Bob) which, after merging, leads to a final state of:

Faculty isFP Student isSP ’Dr.’ Alice isFP(’Dr.’ Alice) Alice isSP(Alice) ’Dr.’ Bob isFP(’Dr.’ Bob) Bob isSP(Bob) Prof. Ed isFP(Prof. Ed) Chad isSP(Chad) Prof. Finn isFP(Prof. Finn) Doug isSP(Doug) Prof. Gil isFP(Prof. Gil)

Person isFP(’Dr.’ Alice) isFP(’Dr.’ Bob) TA isTF isTS isFP(Prof. Ed) math-TA ’Dr.’ Alice Alice isFP(Prof. Finn) cs-TA ’Dr.’ Bob Bob isFP(Prof. Gil) isSP(Chad) isSP(Doug) which is obviously uniquely isomorphic1 to the original example output (see Sec- tion 1.5). The actual choice of names in the above tables is not canonical, as we would expect for a set-valued functor defined by a universal property, and different naming strategies are possible.

1 Note that the uniqueness of the isomorphism from these tables to any other left-Kan extension does not imply there are no automorphisms of the input (for example, we may certainly swap Prof. Finn and Prof. Gil), but rather that there will be at most one isomorphism of the tables above with any other left-Kan extension. If desired, we may of course restrict ourselves to only considering those morphisms that leave their inputs fixed, ruling out the swapping of Prof. Finn and Prof. Gil, although we do not do so in this paper. Fast Left-Kan Extensions Using The Chase 17

3.5 Comparison to Previous Work

The authors of [7] identify four actions that leave invariant the left-Kan extension denoted by a state, and consider a run of their left-Kan extension algorithm to be any “fair” sequence of these actions: – Action α: add a new element. This step is similar to our α step, except it only adds one element. – Action β: add a coincidence. This step is similar to our βF and βD, except it only considers one equation. – Action δ: delete non-determinism. This is similar to our δ step, except it only applies to one edge at a time. If (x, y) ∈ J(g) and (x, y0) ∈ J(g) but y =6 y0, add (y, y0) and (y0, y) to ∼ and delete (x, y0) from J(g). This process is biased towards keeping older values to ensure fairness. – Action γ: delete a coincidence. If (x, y) ∈ ∼d for some d ∈ D, then replace y by x in various places, and add new coincidences. In the first computational left-Kan paper [8], this action took an entire companion technical report to justify [9]; the authors of [7] reduced this step to about a page. One reason this step is complicated to write in [7] is because the relation ∼ is not required to be transitive; another reason is that the way deletion is done in the various places depends on the particular place; another is that deletion is done in place. Finally, in [7], ∼ is persistent; their notion of action and round are the same, and ∼ does not reset between rounds. As remarked by a reviewer, our algorithm’s action form a “fair strategy” in the sense of [7], and so our algorithm is complete. For example, resetting ∼ after each round corresponds to applying action γ exhaustively until ∼ is reduced to the diagonal. Completeness of our algorithm on left-Kan extensions also follows from the fact that at each chase step of a finite-limit theory, our algorithm and the core chase both yield ismorphic results. (Of course, algorithm and the core chase give different results on regular theories that are not finite-limit). It is straightforward to observe a direct operational correspondence between our algorithm and traditional chase algorithms. In particular, each of our actions corresponds to a set of firings of a formula in the regular theory for col(F ), and any sequence of such firings is a chase algorithm [10]. To elaborate, we may consider col(F ) to consist of two kinds of regular formulae: the tuple-generating dependencies (so-called TGDs), and the equality-generating dependencies (so-called EGDs). A TGD has the form:

∀(x0:s0) ··· (xn:sn). φ(x0, ··· , xn) ⇒ ∃(xn+1:sn+1) ··· (xm:sm).P1(x0, . . . , xm) ∧ ...

where each Pi is an atomic predicate symbol. An EGD has the form:

∀(x0 : s0) ··· (xn : sn). φ(x0, ··· , xn) ⇒ xi = xj ∧ ... where all the conjuncts are equalities. In both cases φ is a possibly empty set of conjunctions of predicate symbols applied to variables. To fire a TGD on a pre- model I is to find all the I-valuations V of the variables x0, . . . , xn satisfying φ for which there does not exists any I-valuation extending V and satisfying P1 ∧ ..., create fresh model elements xn+1,..., xm, and add them to I so that P1 ∧ ... is satisfied by them. To fire an EGD on a pre-model I is to find all the I-valuations V of the variables x0, . . . , xn satisfying φ and replace xi with xj throughout I. 18 David I. Spivak, Ryan Wisnesky

Proposition 3 Each round of left-Kan sequence for F has the same effect on the state as firing the EDs associated with col(F ), considered as a regular theory.

Proof By construction; we designed actions α, βD, βF , δ, γ by grouping the EDs for col(F ) according to where they came from and then performing all the TGDs at once (action α) and all the EGDs at once (the other actions).

Finally, we conclude by noting that each step in our left-Kan algorithm corre- sponds to one of the steps from the canonical chase algorithm from Section 2.6. In particular, our left-Kan algorithm makes use of the equivalence relation at each step, clearing it for the subsequent step, rather than say deferring it until the very end of the algorithm, to make the correspondence with that chase algorithm direct. To summarize, both our canonical chase algorithm and left-Kan algorithm push out a set of triggers and a set of EDs, suitably encoded.

3.6 Implementation and Experiments in CQL

In this section we establish the baseline performance of our algorithm by a reference implementation, primarily motivated by the fact that we are not aware of any benchmarks for any left-Kan algorithms besides our own from previous work [23]. The primary optimization of our CQL implementation of our left-Kan algo- rithm is to minimize memory usage by storing cardinalities and lists instead of sets, such that a CQL left-Kan state as benchmarked in this paper consists of: 1. For each d ∈ D, a number J(d) ≥ 0 representing the cardinality of a set. 2. For each d ∈ D, a union-find data structure [20] based on path-compressed trees ∼d ⊆ {n | 0 ≤ n < J(d)} × {n | 0 ≤ n < J(d)} [26]. 3. For each edge f : d1 → d2 ∈ D, a list of length J(d1), each element of which is a set of numbers ≥ 0 and < J(d2). 4. For each c ∈ C, a function η(c): I(c) → {n | 0 ≤ n < J(F (c))}. From a theoretical viewpoint, the above state is more precisely considered as a functor to the skeleton [3] of the category of sets. Scalability tests, for both time (rows/second) and space (rows/megabyte(MB) of RAM) based on randomly constructed instances of the running example taken on a 13” 2018 MacBook Air with a 1.6GHZ i5 CPU and 16GB RAM, on Oracle Java 11, are shown in Figure 1. Perhaps not as familiar as time throughput, mem- ory throughput, measured here in rows/MB, measures the memory used by the algorithm during its execution as a function of input size; the periodic spikes in Figure 1 are likely due to the “double when size exceeded” behavior of the many hash-set and hash-map data structures [26] in our Java implementation. Mem- ory throughput improves as the input gets larger, we believe, because the path- compressed union-find data structure of item two above scales logarithmically in space. Time throughput (rows / second) gets worse as the input gets larger, we be- lieve, because that same union-find structure scales linearly times logarithmically in time. The CQL implementation runs the Java garbage collector between rounds, uses “hash-consed” [2], tree-based terms, and uses strings for symbol and variable names. Although performance on random instances may not be representative of performance in practice, our algorithm is fast enough to support multi-gigabyte real-world use cases, such as [6]. Fast Left-Kan Extensions Using The Chase 19

Fig. 1 Left-Kan Chase Throughput, Quotient of a Set

To demonstrate the significant speed-up of our algorithm compared to all the other algorithms we are aware of, Figure 2 shows time throughput for the same ex- periment using three previous left-Kan algorithms: the “substitute and saturate” al- gorithm of [23] using either specialized Knuth-Bendix completion (“monoidal” [17]) or congruence closure [20] to decide the word problem associated to each cate- gory presentation, and the sequential chase-like left-Kan algorithm of [7]. All the algorithms are implemented in java 11 in the CQL tool, and share micro-level optimization techniques such as hash-consed terms [2], making the comparison relatively apples-to-apples, with one caveat: only our algorithm from this paper targets the skeleton of the category of sets, but with row counts limited to 12 mil- lion in this paper, the three algorithms besides our own that we compare to are all CPU-bound as opposed to memory bound, and so we hypothesize this difference does not impact our peformance analysis below. Performance analysis using java’s built-in jvisualvm tool indicates that, as expected, the source of the performance benefit in our algorithm stems from the bulk-oriented (table at a time) nature of the actions that make up our rounds. That is, the algorithms of [8] and [7] are innately sequential in that they pick particular rows at a time non-determistically, and so their runtime comes to be dominated by many small sequentual reads and writes to large collections. In contrast, our algorithm, by performing bulk-oriented collection operations, spends less time on row-level overhead. This finding is consistent with that of the database theory literature, where parallel versions of the chase are deliberately employed because they are faster than sequential versions [21]. 20 David I. Spivak, Ryan Wisnesky

Fig. 2 Prior left-Kan Throughput, Quotient of a Set

4 Conclusion: Left-Kan Extensions and Database Theory

We conclude by briefly summarizing how our use of the chase relates to its use in data migration. Unlike traditional logic and model theory, where models hold a single kind of value (typically drawn from a domain/universe of discourse), in data migration, models / database instances hold two kinds of values: constants and labelled nulls. Constants have inherent meaning, such as the numerals 1 or 2 or a Person’s name; labelled nulls, sometimes called Skolem variables [11], are created when existential quantifiers are encountered during the chase and are distinct from constants and are not meaningful; they are considered up to isomorphism and correspond to the fresh g(x) symbols in our left-Kan algorithm. All practical chase engines we know of enforce the constant/null distinction, and when an equality- generating dependency n = c is encountered, where n is a null and c a constant, then n is replaced by c, and never vice-versa; if c = c0 is encountered, where c and c0 are distinct constants, then the chase fails. Hence, when performing a left-Kan extension using a chase, the input data must be encoded using labelled nulls, rather than constants, to obtain the required semantics. More a complication due to the fact that many categorical constructions cannot distinguish between isomorphic sets than a problem in practice, the consequences of adopting an unfailing, nulls- only chase procedure in the context of data migration are explored in [24,28,25]. In the other direction, from category theory to database theory, our pushout- based chase algorithm defines an alternative, tractable solution concept for data exchange settings that agrees with the notion of core [10] on finite-limit theories. Fast Left-Kan Extensions Using The Chase 21

References

1. Adamek, J., Rosicky, J.: Locally Presentable and Accessible Categories. London Mathe- matical Society Lecture Note Series. Cambridge University Press (1994). DOI 10.1017/ CBO9780511600579 2. Baader, F., Nipkow, T.: Term Rewriting and All That. Cambridge University Press, New York, NY, USA (1998) 3. Barr, M., Wells, C.: Category Theory for Computing Science. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1990) 4. Barr, M., Wells, C.: , Triples and Theories (2002) 5. Benedikt, M., Konstantinidis, G., Mecca, G., Motik, B., Papotti, P., Santoro, D., Tsamoura, E.: Benchmarking the chase. In: Proceedings of the 36th ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’17, pp. 37–52. ACM, New York, NY, USA (2017) 6. Brown, K.S., Spivak, D.I., Wisnesky, R.: Categorical data integration for computational science. Computational Materials Science 164, 127 – 132 (2019) 7. Bush, M.R., Leeming, M., Walters, R.F.C.: Computing left Kan extensions. J. Symb. Comput. 35(2), 107–126 (2003) 8. Carmody, S., Leeming, M., Walters, R.: The Todd-Coxeter procedure and left Kan exten- sions. J. Symb. Comput. 19(5), 459–488 (1995) 9. Carmody, S., Walters, R.F.C.: Computing quotients of actions of a free category. In: A. Carboni, M.C. Pedicchio, G. Rosolini (eds.) Category Theory, pp. 63–78. Springer Berlin Heidelberg, Berlin, Heidelberg (1991) 10. Deutsch, A., Nash, A., Remmel, J.: The chase revisited. In: Proceedings of the Twenty- seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Sys- tems, PODS ’08, pp. 149–158. ACM, New York, NY, USA (2008) 11. Doan, A., Halevy, A., Ives, Z.: Principles of Data Integration, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2012) 12. Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: Getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005). DOI 10.1145/1061318.1061323 13. Fleming, M., Gunther, R., Rosebrugh, R.: A database of categories. J. Symb. Comput. 35(2), 127–135 (2003) 14. Garner, R.: Understanding the small object argument. Applied Categorical Structures 20 (2008). DOI 10.1007/s10485-008-9126-7 15. Garner, R., Shulman, M.: Enriched categories as a free cocompletion. Advances in Math- ematics 289, 1 – 94 (2016) 16. Haas, L.M., Hern´andez,M.A., Ho, H., Popa, L., Roth, M.: Clio grows up: From research prototype to industrial tool. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pp. 805–810. ACM, New York, NY, USA (2005) 17. Kapur, D., Narendran, P.: The Knuth-Bendix completion procedure and Thue systems. SIAM Journal on Computing 14(4) (1985) 18. MacLane, S., Moerdijk, I.: Sheaves in geometry and logic: a first introduction to theory. Universitext. Springer, Berlin (1992). DOI 10.1007/978-1-4612-0927-0. URL https://cds.cern.ch/record/824105 19. Mossakowski, T., Krumnack, U., Maibaum, T.: What is a derived signature morphism? In: M. Codescu, R. Diaconescu, I. T, ut,u (eds.) Recent Trends in Algebraic Development Techniques, pp. 90–109. Springer International Publishing, Cham (2015) 20. Nelson, G., Oppen, D.C.: Fast decision procedures based on congruence closure. J. ACM 27(2), 356–364 (1980) 21. Onet, A.: The Chase Procedure and its Applications in Data Exchange. In: P.G. Kolaitis, M. Lenzerini, N. Schweikardt (eds.) Data Exchange, Integration, and Streams, Dagstuhl Follow-Ups, vol. 5, pp. 1–37. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2013). DOI 10.4230/DFU.Vol5.10452.1. URL http://drops.dagstuhl.de/ opus/volltexte/2013/4288 22. Patterson, E.: Knowledge representation in of relations. https://arxiv.org/ abs/1706.00526 (2017) 23. Schultz, P., Spivak, D.I., Vasilakopoulou, C., Wisnesky, R.: Algebraic databases. Theory and Applications of Categories 32(16), 547–619 (2017) 24. Schultz, P., Spivak, D.I., Wisnesky, R.: Algebraic model management: A survey. In: P. James, M. Roggenbach (eds.) Recent Trends in Algebraic Development Techniques, pp. 56–69. Springer International Publishing, Cham (2017) 22 David I. Spivak, Ryan Wisnesky

25. Schultz, P., Wisnesky, R.: Algebraic data integration. Journal of Functional Programming 27, e24 (2017) 26. Sedgewick, R., Wayne, K.: Algorithms, 4th edn. Addison-Wesley Professional (2011) 27. Spivak, D.I.: Database queries and constraints via lifting problems. Mathematical Struc- tures in Computer Science 24(6), e240602 (2014) 28. Spivak, D.I., Wisnesky, R.: Relational foundations for functorial data migration. In: Pro- ceedings of the 15th Symposium on Database Programming Languages, DBPL 2015, pp. 21–28. ACM, New York, NY, USA (2015) 29. Wells, C.: Sketches: Outline with references. In: Dept. of Computer Science, Katholieke Universiteit Leuven (1994)

Contents

1 Introduction ...... 1 1.1 Motivation ...... 2 1.2 Contributions ...... 2 1.3 Outline ...... 2 1.4 Review of Category Theory ...... 3 1.5 Running Example Left-Kan Extension ...... 5 2 Left-Kan Extensions Using the Chase ...... 6 2.1 The Collage of a Functor ...... 6 2.2 Finite-Limit Theories ...... 7 2.3 Chasing Finite-Limit Theories ...... 8 2.4 Left-Kan Extensions Using Initial Models ...... 9 2.5 The Core Chase on Finite-limit Theories ...... 10 2.6 A Pushout-based Parallel Chase Algorithm ...... 11 3 A Fast left-Kan Extension Algorithm ...... 12 3.1 Input Specification ...... 13 3.2 The State ...... 13 3.3 The Step ...... 14 3.4 Example Run of the Algorithm ...... 14 3.5 Comparison to Previous Work ...... 17 3.6 Implementation and Experiments in CQL ...... 18 4 Conclusion: Left-Kan Extensions and Database Theory ...... 20 References ...... 20