<<

as a Unifying

ABSTRACT from the system development side. For example consider the Database theory developed from mathematical and various advances regarding schema mapping, aggre- theory, but many formalisms have been introduced over time gates and user-defined-functions, deductive , in- to tackle specific extensions of these core . This pa- complete data and management of nulls, database recovery per makes the case for using as a unifying and concurrency control, real-time DB. DS: are we propos- formalism for databases, by arguing that category theory ing to handle each of these? can capture in a uniform and succinct way the basic rela- While for each of these areas precise formal models, rooted tional model (e.g., schemas, constraints, queries, updates), in formal logic, have been provided, it remains often hard and its many extensions (e.g., aggregates, transactions, trig- to precisely understand the correspondences between differ- gers, schema mapping, pivoting, anonymization, and semi- ent formalizations — at the very least this understanding structured data). We also show how existing results from the requires a significant time investment. This task is often large corpus of developed within daunting, making it harder to operate across the boundaries can be naturally imported, and how automated assis- of different theories, which in turn works as a disincentive tants such as COQ could be leveraged. Finally, we observe to anyone trying to reduce this gap. that category theory has been successfully applied to other A similar problem faced the mathematics community in areas of (e.g., programming languages), the first half of the . Mathematics was subdi- this provides hope about the feasibility of our attempt, and vided into many subfields, each with its own jargon and way suggests that by sharing a rigorous formal footing we could of doing business. In order to advance, a unifying language enable fascinating cross-area analysis and theories. and formalism was needed. Category theory was invented in the 1940s by Mac Lane and Eilenberg to strengthen the con- nection between and , but it quickly spread 1. INTRODUCTION to neighboring fields. By providing a precise language for DS: Much of the theory below would work out if we used comparing different universes of discourse, category theory only finite schemas (only a finite of non-equivalent has been a unifying force for mathematics. paths) and allowed only finitely many rows in each table. CC: next two paragraphs need smoothing Category the- Would you prefer this? If so, should we still call the cate- ory is no stranger to computer science. It has been remark- gory of states on “ –Set” or should we replace Set with C C ably successful in formalizing the of programming something like Fin or FSet? languages []. In fact, it has also been applied many times The field of Database research developed at the boundary to database theory in the past. These attempts did not of active industrial development and solid formal theories, catch on in mainstream applications, perhaps because those and here more than in any other field this duality has pro- models attempted to be too faithful to the . duced a fascinating interplay between system builders and Because these formalisms did not seem to offer enough ad- theoreticians. The flow of innovation has not always been a vantages to justify the learning curve, some database theo- unidirectional transfer from new advances in theory to cor- rists developed a slight aversion to category theory. CC: We responding systems, but rather an intricate back and forth, need to check who’ in the PC, and avoid to make enemies! where often systems have been built before a full under- However in this paper, we argue that database theory and standing of the theoretical implications was available. category theory are naturally compatible, in fact that a ba- The core of database theory has developed from rather old sic database schema and a category are roughly the same and well-understood mathematics: and thing. Once a simple dictionary is set up, classical category . From this common starting point the database theoretic results and constructions capture many of the for- community has built a host of extensions, which were needed mal results that appear in database literature. to provide the formal underpinnings for a large number of In particular, we present an application of categories and new functionalities and new solutions to problems arising functors to uniformly represent schemas, integrity constraints, Permission to make digital or hard copies of all or part of this work for database states, queries, updates, and show that this is not personal or classroom use is granted without fee provided that copies are only natural but enlightening. And we show, with a simple not made or distributed for profit or commercial advantage and that copies online demo, that simple can be used to trans- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific late SQL schemas (with primary keys and foreign keys) and permission and/or a fee. queries into the formalism of categories and functors. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.

1 by other communities, such as the theoretical programming 2.0.1. A category consists of the following C language community, there is an interesting opportunity of components: bridging results and enabling theories and analysis to span 1. A set of objects Ob –eachelementx Ob is called across different areas, as an example consider security mod- C ∈ C an object of . els C DS: This paper can serve as a very fast introduction of 2. For each x, y Ob asetArr (x, y) —eachelement ∈ C C databases to anyone familiar with computer science. Database f Arr (x, y) is called an arrow from x to y in , C theorists should note the economy of the definitions con- ∈ f C and is denoted as f : x y or as x y. tained here. → −→ 3. For each object x Ob achosenarrowidx Arr (x, x) ∈ C ∈ C Contributions. In summary this paper makes the following cal led the identity arrow on x. contributions: 4. For each x, y, z Ob ,afunction ∈ C comp : Arr (x, y) Arr (y, z) Arr (x, z) • C C × C → C cal led the composition law for —we denote comp (f,g) Next, we showcase the expressivity of category theory in 2. BACKGROUND C C • simply by f g. • modeling classical database problems, many of which re- This section provides the reader with• a crash course in To be a category the above components must also satisfy quired significant extensions of the relational model, such as category theory. More details can beThe found rest of this in paper [?, is? organized, ?, ? as]. follows: Section 2 the following properties: provides a crash course in category theory, Section 3 presents Identity law: For each arrow f : x y,thefollowingequa- → schema mapping, transactions, and user defined aggregates. the basic application of category theory to databases that we tions hold: The reader already familiar with thepropose, definition Section 4 shows of howcategory some of the classical database results can be proved in this new framework, Section 5 show- idx f = f and f idy = f. We show existing results in DB theory can be proved using • • and functor, as well as the category caseSet the powerof sets, of category can theory safely beyond classical results on category theories. skip to Section 2.1. The goal of thisa simple section case, and is argues to that provide much more can be done, Sec- Associative law: Given a sequence of composable arrows tion 6 provides summary of related work, and Section ?? f g h w x y z, We then argue that by adopting this formalism we can database researchers with a core introductionsummarizes our conclusions. to category −→ −→ −→ also: (i) inherit results from the large corpus of theorems theory, so we slightly abuse our and postpone de- the following equation holds: 2. BACKGROUND f (g h)=(f g) h. produced in the pure category theory research, and (ii) lever- tailed comments on some set-theoreticThis issues section provides (e.g., the the reader dis- with a crash course in • • • • category theory. More details can be found in [?, ?, ?, ?]. CC: I pulled this triangle comment from the definition age automated proof assistants such as COQ. tinction between sets and classes) toThe Appendix reader already B. familiar with the definition of category Given the following triangle: and functor, as well as the category Set of sets, can safely f Finally, since category theory has already been embraced skip to Section 2.1. The goal of this section is to provide x ￿ y Intuitively, a category is a multi-graph (i.e., a graph that ￿ database researchers with a core introduction to category • ￿￿ • by other communities such as the theoretical programming ￿￿ g can have multiple edges between thetheory, same so wetwo slightly nodes)abuse our notation to- and postpone de- h ￿￿ tailed comments on some set-theoretic issues (e.g., the dis- ￿￿ ￿ language community, there is an interesting opportunity of z gether with an equivalence ontinction finite between paths sets and classes)(i.e., to we Appendix A. • Intuitively, a category is a multi-graph (i.e., a graph that We say the triangle commutes iff f g =comp (f,g)=h. bridging results and enabling theories and analysis to span • C can declare two paths between the samecan have nodes multiple edgesto be between equiv- the same nodes) together Mathematicians often use categories to formalize rather across different areas. As an example consider security mod- with an on finite paths (i.e., we can general concepts such as the category of topological spaces alent or not)—as shown in the exampledeclare figure to paths between below. the same nodes to be equivalent or and continuous functions, the category of groups and group els DS: ...? not)—as shown in the example figure below. homomorphisms, or the category Set of sets and functions. C: One important category we will use in this paper is the CC: A common problem in applying category theory to b , denoted Set. The objects in this category f B b A 1 h g i are sets , the arrows are (total) functions between sets, and f B a computer science problem is an excessively loose modeling, A C the composition law sends a pair of composable functions h c to their composition. Other than the Set category, in this g i that lead to an application of the theory, that while formally a paper, we are not interested platonic categories, but rather C f h = g “custom” categories designed to model some enterprise. Our • correct is not constructive. We carefully avoid this in this h = i objects will represent tables, and the arrows will represent paper, giving a tighter modeling of databases. This allows c ￿ attributes, and our composition law will represent integrity Each node represents an object of a specific type, each constraints or “business rules.” All this will be made explicit arrow A B represents a chosen between objects in Section 3. us to leverage some of the core results/machinery developed → Each node represents an object of(that a can specific be total, continuous, type, etc.), each a path in the graph rep- Another key notion is the one of functors. A functor is a in category theory (such as composition properties of func- resents the composition of such functions, and two paths are mapping between categories that sends objects to objects, arrow A B represents a chosen functionequivalent if between they represent objects the same function. Conceptu- and arrows to arrows, while preserving compositions and tors, and the theory of adjoint functors) to constructively ally, the graph contains the transitive of every finite identities. We formalize the notion of functor as follows: → path, including the self-loop path on each node (a,b,c in our (that can be total, continuous, etc.), a path in the graph Definition 2.0.2. Let and be categories. A functor force important properties. As an example the semantics example). Note that we often omit obvious paths when rep- C D from to ,denotedF : maps: represents the composition of such functions,resenting a category and pictorially. two paths A more formal definition is C D C → D of query execution we provide naturally and automatically presented below: 91See Appendix A for a more rigorous definition. derives from the use of functors to model the query com- are equivalent if they represent the same function. Con- mands. As further confirmation that this application of cat- ceptually, the graph contains the transitive closure of every 2 egory theory is useful we observed that some of the classical finite path, including the paths of length 0, i.e. the degener- results in category theory, e.g., functors composition, per- ate self-loop on each node (a,b,c in our example). Note that fectly fit known theorems in database theory, e.g., second we often omit obvious paths when representing a category order mappings composition. pictorially. A more formal definition is presented below: DS: “Data migration and querying are both formalized by Definition 2.0.1. A category consists of the following the same mathematics. There is something deep going on C components: here.” –Peter Gates DS: This paper can serve as a very fast introduction of 1. A set of objects Ob — each x Ob is called C ∈ C databases to anyone familiar with computer science. Database an object of . theorists should note the economy of the definitions con- C 2. For each x, y Ob a set Arr (x, y) — each element tained here. ∈ C C f Arr (x, y) is called an arrow from x to y in , C ∈ f C Contributions. In summary this paper makes the following and is denoted as f : x y or as x y. → −→ contributions: 3. For each object x Ob a chosen arrow idx Arr (x, x) ∈ C ∈ C Offer “all-at-once” ETL process, rather than one table called the identity arrow on x. • at a time. 4. For each x, y, z Ob , a function ∈ C comp : Arr (x, y) Arr (y, z) Arr (x, z) • C C × C → C called the composition law for — we denote comp (f, g) C C • simply by f g. The rest of this paper is organized as follows: Section 2 • provides a crash course in category theory, Section 3 presents To be a category the above components must also satisfy the basic application of category theory to databases that we the following properties: propose, Section 5 shows how some of the classical database Identity law: For each arrow f : x y, the following equa- → results can be proved in this new framework, Section 6 show- tions hold: case the power of category theory beyond classical results on a simple case, and argues that much more can be done, Sec- idx f = f and f idy = f. • • tion 7 provides summary of related work, and Section ?? summarizes our conclusions. Associative law: Given a sequence of composable arrows f g h w x y z, 1.0.1 Acknowledgments −→ −→ −→ The first author would like to thank Peter Gates for many the following equation holds: useful conversations and Greg Morrisett for suggesting we f (g h) = (f g) h. explore transactions. • • • •

2 by other communities, such as the theoretical programming Definition 2.0.1. A category consists of the following C language community, there is an interesting opportunity of components: bridging results and enabling theories and analysis to span 1. A set of objects Ob –eachelementx Ob is called across different areas, as an example consider security mod- C ∈ C an object of . els C DS: This paper can serve as a very fast introduction of 2. For each x, y Ob asetArr (x, y) —eachelement ∈ C C databases to anyone familiar with computer science. Database f Arr (x, y) is called an arrow from x to y in , C theorists should note the economy of the definitions con- ∈ f C and is denoted as f : x y or as x y. tained here. → −→ 3. For each object x Ob achosenarrowidx Arr (x, x) ∈ C ∈ C Contributions. In summary this paper makes the following cal led the identity arrow on x. contributions: 4. For each x, y, z Ob ,afunction ∈ C comp : Arr (x, y) Arr (y, z) Arr (x, z) • C C × C → C cal led the composition law for —we denote comp (f,g) C C • simply by f g. • • To be a category the above components must also satisfy The rest of this paper is organized as follows: Section 2 the following properties: provides a crash course in category theory, Section 3 presents Identity law: For each arrow f : x y,thefollowingequa- → the basic application of category theory to databases that we tions hold: propose, Section 4 shows how some of the classical database results can be proved in this new framework, Section 5 show- idx f = f and f idy = f. • • case the power of category theory beyond classical results on a simple case, and argues that much more can be done, Sec- Associative law: Given a sequence of composable arrows tion 6 provides summary of related work, and Section ?? f g h w x y z, summarizes our conclusions. −→ −→ −→ the following equation holds: 2. BACKGROUND f (g h)=(f g) h. This section provides the reader with a crash course in • • • • category theory. More details can be found in [?, ?, ?, ?]. CC: I pulled this triangle comment from the definition The reader already familiar with the definition of category Given the following triangle: and functor, as well as the category Set of sets, can safely f skip to Section 2.1. The goal of this section is to provide x ￿ y 2. eacheach arrow f : x y in Arr (x, y) to an arrow do not make much of a distinction between￿ a finitely pre- database researchers→ with aC core introduction to category • ￿￿ • F (f): F (x) F (y) in Arr (F (x),F(y)). sentable category and a given presentation￿￿ of it.g For more theory,→ so we slightlyD abuse our notation and postpone de- h ￿￿ tailed comments on some set-theoretic issues (e.g., thedetails, dis- see []. ￿￿ ￿ A functor must also satisfy the following requirements: z tinction between sets and classes) to Appendix A. • Intuitively, a category is a multi-graph (i.e., a graph that We say the triangle commutes iff f g =comp (f,g)=h. Preserve identities: for each object x Ob the equation 3. DATABASES AS CATEGORIES • C can have multiple edges between∈ C the same nodes) together Mathematicians often use categories to formalize rather F (idxwith)=id anF (x) equivalenceholds in Arr relation(F (x),F on(x finite)),and paths (i.e., weIn can this section, we describe a basic observation: there is a 2. eacheachD arrow f : x y in Arr (x, ygeneral) to concepts an arrow such as the category3. DATABASES of topological spaces AS CATEGORIES declare to paths between the same nodes→ to be equivalenttight or connectionC and between continuous database functions, schemas the andcategory categories, of groups and group Preserve composition:not)—as shownF (ffor): in eachtheF ( examplex composable) F figure(y pair) below.in ofArr ar- (andF (x between),F(y databasehomomorphisms,)). instances or and the functors. categoryInSet thisof sets section, and functions. we describe a basic observation: there is a f g → D C: rows x y z in ,theequation One important category we will use in this paper is the b tight connection between database schemas and categories, −→ A−→ functorC mustf alsoB satisfy the following requirements:category of sets, denoted Set. The objects in this category b A 3.1 The basic idea:1 “category = schema" CC: I pulled this triangle comment from the definition h and between database instances and functors. F (fg)=F g(f)F (g) i are sets , the arrows are (total) functions between sets, and A f B a The most natural way to express databases in terms of Preserve identities:C for each object x Obthe compositionthe equation law sends a pair of composable functions Given the following triangle: c C g h holds in Arr (F (x),F(z)). mathematics∈ to is via their the composition. following definitions. Other than the Set category, in this i D F (idx)=idF (x) holds in Arr (F (x),F(x)),and 3.1 The basic idea: “category = schema" a D CC: I brokepaper, your definition we are not into interested 3 definitions platonic categories, but rather C f h = g “custom” categories designed to modelThe some most enterprise. natural Our way to express databases in terms of f Specifying• a functor means defining “where each object x y h Preserve= i composition: for each composableDefinitionobjects3.1.1. pair willSchemas represent of ar- are tables,Categories: andmathematics thea database arrows will is represent via the following definitions. / c and arrow from￿ one category is sentf in theg other category”. x y z schema or simplyattributes,schema andis aour finitely composition presentable law willcategory represent integrity • A • Let’s consider the examplerows category C above in ,theequation constraints or “business rules.” AllCC: this will I broke be made your explicit definition into 3 definitions A Each node represents−→ an−→ object ofC a specific type,.Anobject each c of is called a table in ,andanarrow A A T,B arrowT,CA S B represents a chosen function between objects in Section 3. A g → → →→ C C C A h u, g (thatu, can be total, continuous, etc.),F a( pathfg)= in theF graph(ff):Fc rep-(gc)￿ is calledAnother a column key notionof c valued is the in onec￿.Theidentity ofDefinitionfunctors. A functor3.1.1. is a Schemas are Categories: a database h A → A F f → t, a →resentst, b t, the composition of such functions, and two pathsmap areidc : c mappingc is called between the identity categories column thatschema of sendsc.A objectsleafor simply to objects,schema is a finitely presentable category A → → → holds in Arr (F (x),F(z)). →and arrows to arrows, while preserving compositions and  c s, i vequivalent if they representD the same function. Conceptu-table is an object c Ob( ) with no outgoing arrows (other c z identities.∈ WeC formalize the notion.Anobject of functor as follows:of is called a table in ,andanarrow → → ally, the graph contains the transitive closure of everythan finiteidc). C C C D: path,Specifying including the self-loop a functor path on means each node defining (a,b,c in our “where each object f : c c￿ is called a column of c valued in c￿.Theidentity • Example 2.0.3. Let and denote the two categories Definition 2.0.2. Let and be→ categories. A functor example). Note1 that2 we often omit obvious paths when rep- C D v t and arrowC fromC one category is sent inCC: the State-¿Instance otherfrom category”.to ,denotedF : mapmaps:idc : c c is called the identity column of c.Aleaf from Exampleresenting??,sothesets a categoryArr pictorially.1 (X, Z) and AArr more2 ( formalX, Z) definition is C D C → D → We say the triangle commutes iff f g = comp (f, g) = h.DS: C C 1 table is an object c Ob( ) with no outgoing arrows (other S u haveT one andpresentedLet’s two objects, consider below: respectively. the Aexample functor F : category C above 9 See Appendix A for a more rigorous definition. • C C1 → Definition 3.1.2. Database States are set-valued Func- ∈ C Set is givenv by= supplyingu three sets F (X),F(Y ),F(Z) and than idc). I don’t like using bullet ( ) both to denote the composition s ￿ tors: A database state or simply state on schema is a • x three functions F (f),F(g),F(h),where(becausewemust C functor γ 2: Set.Givenanobjectc Ob(CC:),anele- State-¿Instance law and to draw objects in a category (as ). How about preserve composition) we require that F (f)F (g)=F (f)F (h); C → ∈ C • ment x γ(c) is called a row of γ(c).Werefertoapair afunctorG: 2 ExampleSet is the same2.0.3. thing,Let except1 theand above 2 denote∈ the two categories f g for composition?CC: Let’s use an empty bullet in the C → C C (x, f),wherex is a row of γ(c) and f : c c￿ is a column ∗ requirement isfrom not enforced. Example Simple??,sothesets examples of functorsArr (X, Z) and Arr (X, Z) → Definition 3.1.2. Database States are set-valued Func- C1 of c,asacellinγ(c),andtoeachcellC2 (x, f) the element graphs.. and keep the solid one for the text.. A particularly interesting classG: 2 ofSet that functors do not lift tofor (i.e. “come our from”) pa- functors tors: A database state or simply state on schema is a C → have one and two objects, respectively.γ(f)(x) Aγ functor(c￿) is calledF the: value1 of the (x, f)-cell. C 1 Set are ones in which G(f) is a surjection. ∈ C → functor γ : Set.Givenanobjectc Ob( ),anele- Mathematicians often use categories to formalize rather per are the set-valued functorsC;→ i.e.Set functorsis given of by supplying the type three sets F (X),F(Y ),F(Z) and C → ∈ C Remark 3.1.3. Another way to look at Definitionsment x 3.1.1 γ(c) is called a row of γ(c).Werefertoapair In this paperthree we will functions be particularlyF (f interested),F(g),F in set-(h),where(becausewemust ∈ general concepts such as the category of topological spaces Set. Such a functor sends objects and arrows of some and 3.1.2 is as a kind of “normal form” for(x, a f database.),wherex is a row of γ(c) and f : c c￿ is a column valued functorspreserve; i.e. functors composition)Set for some we require category that F (f)F (g)=F (f)F (h); → C → C → Suppose we say that a database is in categoricalof normalc,as formacellinγ(c),andtoeachcell(x, f) the element and continuous functions, the category of groups and group category to objects and arrows. In of fact, the a categoryafunctor categorywillG serve: Set as2 a databaseofSet sets. schemais the and same thing, except the above C C C → if γ f x γ c￿ x, f C each set-valuedrequirement functor γ : Set iswill not serve enforced. as an instance Simple examples of functors ( )( ) ( ) is called the value of the ( )-cell. homomorphisms, or the category Set of sets and functions. This will be explained in Section 3. C → ∈ of it. G: 2 Set that do not lift to (i.e. “comeevery table from”)t has an functors identity column idt,chosenatthe C → • Remark 3.1.3. Another way to look at Definitions 3.1.1 The last of these, Set, is an important category both in Set are ones in which G(f) is a surjection.outset. The cells in this column are called the row-ids 2.1 Presentations1 of categories and 3.1.2 is as a kind of “normal form” for a database. mathematics and for the present paper. The objects in this 2.1 Representing categories andC → functors of t and no two row-ids are the same; Given a graph G, one can create a category G whose Suppose we say that a database is in categorical normal form 1 In this paper we willC be particularly interested in set- category are sets , the arrows are (total) functions between objects are the nodes of G, whose arrows x y are the paths for every column c of a table t there isif some table t￿ In order to simplify pictorial representationvalued functors of categories,; i.e.→ functors Set• for some category from x to y in G, and whose composition law takes a pair ofC → such that the value in each cell of c refers to some sets, and the composition law simply sends a pair of com- . In fact, a category will serve as a database schema and we will omit the needed but obviouspaths meetingC arrows, “head to tail” ande.g., concatenates weC will them. The row-id of t￿;and every table t has an identity column idt,chosenatthe each set-valued functor γ : Set will serve as an instance • posable functions to their composition. In this paper we are category G is called the free category on G. For example,C → outset. The cells in this column are called the row-ids avoid drawing identity arrows and arrowsC of it. representing paths each “pure data” column of t (with values in some set, consider the graph H with only one node and one loop — its • of t and no two row-ids are the same; say the set of strings) is considered a foreign key col- not interested platonic categories (those of mathematical in- through drawn arrows. Similarly,free category by convention,H again has one object two but itpaths has an arrow for C umn to a 1-column table whose cells are all possible every natural2.1 number n PresentationsN, and comp (m, n)= ofm categories+ n. for every column c of a table t there is some table t￿ terest), other than the Set category and various derivations will not be considered equivalent unless explicitly∈ declared.CH values of the given set (e.g. all strings).• These 1- It is certainly muchGiven easiera to drawgraph the graphG, one of H than can the create a category G whose such that the value in each cell of c refers to some category H . This is akin to the fact that in Example ?? column tables doC not have to be physically stored, but of it. Instead we deal mainly with “custom” categories de- As for functors F : we will omitobjects all the are obvious the nodes -of G, whose arrows x y are the paths row-id of t￿;and C that→ issue is hidden. signed to model some enterprise. Our objects will represent C → D it was more convenientfrom x to drawy in theG displayed, and whose graph G than composition law takes a pair of pings (i.e., if there is an object inits freewith category the2. same as an each “pure data” column of t (with values in some set, C pathsC meeting “head to tail” and concatenates3.2 Running example them. The • tables, our arrows will represent attributes, and our com- object in , we intend that the firstCC: We is want mapped to say: we only to draw the needed second stuff, the rest say the set of strings) is considered a foreign key col- exists implicitlycategory and does notG commute...is called comfy the +free mathe- categoryIn this on sectionG. For we set example, up a running example which we will D C umn to a 1-column table whose cells are all possible position law will represent integrity constraints or “business unless otherwise specified). Mathematically,matically soundconsider the these graph pictorialH with only onediscuss node throughoutand one the loop paper. — its Categories,free likegroups category in abstractH again algebra, has can beone pre- objectA but database it has in categorical an arrow normal for form consists of avalues bunch of the given set (e.g. all strings). These 1- rules.” All this will be made explicit in Section 3. C representations are called “linearsented sketches” by generatorsevery natural and and relations. are number The akin generatorsn toN for, and a compof tables. Each(m, has n)= an identitym + column,n. and other columns.column tables do not have to be physically stored, but ∈ CH Another key notion is the one of functors. A functor is a representing a group in abstractcategory algebra areIt a set is of by certainly objects generators and much arrows, and easierand the relations to draw theFor example, graph consider of H thesethan tables: the that issue is hidden. are equations of paths. In Example ??, is generated by category H . ThisC1 is akin to the fact that in ExampleEmployee?? mapping between categories that sends objects to objects, relations. G, but has the additionalC relation fg = fh. A category it was more convenient to drawC the displayedId graphFirst G Lastthan Mgr 3.2Dpt Running example typically can have many different presentations; is called and arrows to arrows, while preserving compositions and We will now show that, once equipped with the aboveC In this section we set up a running example which we will finitely presentableits freeif it is category generated by a2 finite. graph G and 101 David Hilbert 103 q10 (1) C discuss throughout the paper. identities. We formalize the notion of functor as follows: basic notions of category and functor,finitely many we relations.CC: can We For easily example wantSet to leverage say:is not finitely we only pre- draw needed102 stuBertrandff, theRussell rest 102 x02 103 Alan Turing 103 q10 sentable, butexists of course implicitly1 and 2 are. and In does this paper not we commute... comfy + mathe- A database in categorical normal form consists of a bunch the expressive power and mathematical rigorC ofC category of tables. Each has an identity column, and other columns. Definition 2.0.2. Let and be categories. A functor matically sound C D theory to model a broad spectrum of databaseCategories, problems. like groups in , can be pre- For example, consider these tables: sented by generators and relations.3 The generators for a from to , denoted F : maps: Employee C D C → D category are a set of objects and arrows, and the relations Id First Last Mgr Dpt are equations of paths. In Example ??, is generated by C1 101 David Hilbert 103 q10 (1) 1. each object x Ob to an object F (x) Ob , and G, but has the additional relation fg = fh. A category ∈ C ∈ D C typically can have many different presentations; is called 102 102 x02 C 2. each arrow f : x y in Arr (x, y) to an arrow finitely presentable if it is generated by a finite graph G and 103 Alan Turing 103 q10 → C F (f): F (x) F (y) in Arr (F (x),F (y)). finitely many relations. For example Set is not finitely pre- D sentable, but of course 1 and 2 are. In this paper we Department → C C do not make much of a distinction between a finitely pre- Id Name Secr A functor must also satisfy the following requirements: sentable category and a given presentation of it. For more q10 Sales 101 details, see []. x02 Production 102 Preserve identities: for each object x Ob the equation ∈ C F (idx) = idF (x) holds in Arr (F (x),F (x)), and 3 D Preserve composition: for each composable pair of ar- f g rows x y z in , the equation −→ −→ C F (f g) = F (f) F (g) • • holds in Arr (F (x),F (z)). D Specifying a functor means defining “where each object and arrow from one category is sent in the other category”. One should note that the composition of two functors is a functor. Let’s consider the example category above . C Example 2.0.3. Consider as an example the functor F in the following figure. Every object/arrow of the input cate- gory is mapped to an object/arrow in , and both identities C D and compositions are preserved.

91See Appendix B for a more rigorous definition.

3 3. DATABASES AS CATEGORIES representing the schema to the category Set of all possible In this section, we discuss the basic intuition: there is a sets. tight connection between database schemas and categories, So a database state is a functor in which: each table (ob- and between database states and functors. ject) in the schema is mapped to the set (in Set) of row identifiers stored in that table, and each column (arrow) is 3.1 Schemas are Categories mapped to an arrow (in Set) that associates row identifiers The schema of a database is naturally represented by a to column values for this specific state. This fully charac- category in which: (i) objects represent schema tables and terizes a state, and thanks to the fact that functors preserve datatypes, and (ii) arrows represent columns. As an example identity and composition, it also guarantees that the state consider the following relational schema S: is valid or is a model for S. This means that only functors representing valid states can be defined between S and Set. C employee(ssn int, name varchar, deptno int); More formally we say: department(dno int, address varchar); Definition 3.2.1. Database states are set-valued Func- tors: A database state of schema S is a functor γ : S Where employee.deptno is a foreign key that C → Set. The functor γ takes each object t in S to a set γ(t) department.dno. The categorical representation S for schema C C t ∈ S is shown in Figure 1. Set (representing the row identifiers of table for this state), and each column c: t t0 in ArrC (t, t0) to an arrow → S γ(c): γ(t) γ(t0) in Set that associates each row identi- C → s fier x γ(t) to its corresponding column value in the set ∈ γ(t0). One of the strengths of category theory is to allow us to change level of abstraction very naturally. We exploit this here, by representing database states as the objects of a much larger category called S –Set. S –Set thus is the cat- C C egory of all valid database states for schema S. The arrows of S –Set are called natural transformations. Using trans- COMMUTE: C deptno = id•fk1; formations we can capture database insert and delete oper- ations. In the following section we will use this to provide Figure 1: A simple schema as a category a semantics for update commands. Various mathematical subtleties related to S –Set will be worked out rigorously More precisely we state the following: C in Section ??. Definition 3.1.1. Schemas are Categories: A database 3.3 The Update Category schema S is a finitely presentable category S . Each ta- C We are now ready to introduce the notion of update com- ble t of the schema S is represented by a corresponding ob- mands in terms of the arrows/transformations in S –Set. ject t ObC , and each column c in S with d C ∈ S We introduce this intuitively here for insert and delete state- is represented as c: t d. Each table also has an im- → ments, while we postpone to Section ?? a richer construction plicit column idt which is represented as an identity map and rigorous definition. For now we defined a insert/delete idt : t t is called the identity column of t. Foreign keys → command in terms of the arrows in S –Set. In particular, are defined as arrows between tables (functions between row C every insert/delete is represented by a family of identifiers), together with commutative properties for the ref- arrows in S –Set, one for each database state (object). An erenced columns (see Figure 1). C arrow from the objects D1 to D2 represents the effect of a Note that the identity column of each table closely matches certain update statement applied to D1. the notion of row identifier often used in practical implemen- In the following figure we show how an update u oper- tations of the relational model. Also note that datatypes ates on few database states (in light gray you can see other are represented uniformly as tables conceptually listing all updates). the values in the domain of an attribute. In this way data S4 columns and foreign key columns are put on equal footing. u u S3 u S1 The definition of a foreign key is constituted by an arrow S6 between the two tables, and the declaration of commutativ- u u S2 S7 ity of the two paths containing referencing and referenced u S5 columns, as shown in Figure 1 The notion of uniqueness of a column, required to model primary keys, requires more powerful mathematics, namely sketches, and its discussion A richer definition of the semantics of an update com- is postponed to Section ??. mand is given in terms of the transformations in S –Set in C Section ??, where we will also show that this definition fits 3.2 Database states are Set-Valued Functors nicely with many other constructions representing schema Now the question arise: how to represent a database state? migration, querying etc. A database state is a collection of sets/functions, one for each table/column, that satisfies the schema structure, data 3.4 Queries as Functors types and integrity constraints. In the world of categories it Queries are naturally represented as functors between the is natural to represent this as a functor from the category S category of the schema S and the category representing the C C

4 C Qs s As

F G As.address = Qs.address•Qs.q_departm

COMMUTE: deptno = id•fk1; COMMUTE: carlo•q_carlo = name•q_employee; deptno = id•fk1; did•q_department = deptno•q_employee;

Figure 2: Representing a query answer schema/targetlist. Let us introduce this with a sim- does too, with the additional requirement that ple example, consider the schema S introduced in Figure 1, equivalent paths are sent to equivalent paths; in other words and the following SQL query: it is a functor. Here is the precise definition. SELECT d.address FROM employee e, department d Definition 4.1.1. Let and be schemas (categories). WHERE e.deptno = d.did AND e.name="carlo"; C D A morphism of schemas, denoted F : , is simply a C → D We can represent this categorically as shown in Figure 2. functor. That is, it is a mapping sending each table T in C In this section, we showed the basic results of our work and to a table F (T ) in and each column T T 0 in to a D → C how categories and functors can be used to capture the basic column F (T ) F (T 0) in , such that the composition law → D entities in the database world. In the following section, we enforced in must also be enforced in . provide further details on this and, thus, prepare the ground C D for the more advanced results of Section ??. CC: longer and At this point we have defined schemas and of explaining why we need next section schemas, and there is an obvious identity and composition law for morphisms. In other words, we have a category of 4. UNDERLYING MATHEMATICS schemas. We will not really need to use this fact, although In Section 3, we gave an explicit definition of the basic it may be convenient for phrasing future results. idea, that of representing schemas as categories and states as functors. We also gave a brief introduction to updates and Example 4.1.2. DS: worthwhile? If not, can you come queries so that the reader could see how the general theory up with something? fit in with these definitions. In this section we explicate There are two possible category structures on the graph these ideas in detail. We will show that the concepts are tight enough that classical theorems in mathematics yield g f # beautiful results, for example a tight link between data mi- X / Y Z gration and queries. • • ; • We begin by explaining the setup for migrating data or h querying, namely the notion of morphism between schemas, In one, we set f g = f h and in the other we do not which will simply be a functor. We will then discuss the no- ∗ ∗ (so f g = f h). Lets call the first once and the sec- tion of mapping from one state to another (within the same ∗ 6 ∗ C1 ond . Considering these as schemas, we will discuss the schema) which will be the basis for our rigorous definition C2 of updates. morphisms between them. There is a unique functor that sends every object At this point we will need to shift gears to describe the C2 → C1 to itself, but there is no functor that sends every data migration functors and queries. While queries are de- C1 → C2 fined solely by schema mappings, they act at an instance object to itself. However, if we drop the requirement that level by applying some powerful mathematical machinery. objects are sent to themselves, then there are many functors . For example there is one which sends every object We will conclude this section by discussing data types and C1 → C2 in to X Ob( ) and every morphism to idX . As an calculated fields. C1 ∈ C2 For the reader who is interested in the tight formula- exercise, the reader may check that there are in fact ten (10) functors . Each of these will prescribe a different tions of schema mapping, data migration, queries, updates, C1 → C2 way to exchange data between and . and typing, this section will be a careful presentation. The C1 C2 reader who is interested in only the main ideas may skim (or skip) to Section 5 without much trouble. Currently, in order to create a schema mapping one writes From here on, we may denote a schema as rather than down a set of logical that detail how the data in C S . one schema relates to that of the other schema. Here we C suggest that many (if not all) of the mappings which come 4.1 Morphisms between schemas up in practice can be inferred from functors, as in Definition A schema can be thought of as a graph in which we can 4.1.1. Whereas logical axioms can be ad hoc, redundant, and specify when two paths are equivalent. Graph morphisms difficult to envision, functors are simple to understand and send objects to objects and arrows to arrows, and a schema surprisingly powerful. DS: Can you support this claim?

5 4.2 A new concept of state transformations [0] is just a set of rows. For any schema , the category –Set is a . That Legend has it that Eilenberg and MacLane spent the effort C C to invent category theory because they needed to formalize means that there is an internal language and logic for states on supporting all the usual logical connectives. For exam- the concept now known as natural transformation. Whereas C ple, if m: γ , δ is an inclusion of states, the logic of –Set a functor connects categories, a natural transformation con- → C nects functors (we promise not to go on like this!). allows us to take the of γ in δ. However, note The definition of natural transformations between states that the naive (table by table) complement of one state in naturally captures much of the semantics for what remains another will not generally be a state. The the topos logic unchanged when performing an update. Between two set- “takes that into account” and returns the largest possible valued functors I,J : Set, a natural transformation is state contained inside the naive complement. C → a way to the tie row-ids in instance I to row-ids in instance J such that all facts are preserved on the nose. Example 4.2.2. Consider the category With enough uniqueness constraints, this amounts only s to inserting rows (and tying rows pre-insert to their corre- = E V G // lates post-insert) or collapsing duplicate rows into a single • t • one. If data is anonymized (a concept we will discuss in A state δ : Set is the same thing as a multi-graph; it Section 4.4) then transformations give a much richer seman- G → tics which may have uses in . Either way, the consists of a set δ(V ) of vertices, a set δ(E) of edges, and natural “adjointness” machinery which naturally converts a source and target map between them. A transformation of states m: γ δ is the same thing as a graph morphism; it queries into their implementation depends on this notion of → transformation, so we carefully state it in Definition 4.2.1. sends vertices to vertices and edges to edges in such a way In Section 4.3 we will discuss how a morphism of schemas that the source and target of each edge is preserved. So for creates a port through which to transfer data. Before we example a graph isomorphism is just an isomorphism in the category –Set. do so, we must discuss mappings between states on a given G schema . To a category theorist, these are natural trans- C formations of set-valued functors. To a database theorist, 4.3 Data migration functors transformations may be a bit mysterious at first. However it In this section we show that given any morphism of schemas, will turn out that many constructions familiar to a database F : , three data migration functors ∆F , ΣF , and ΠF C → D theorist happen to respect all transformations; for example are automatically defined. These functors transport data queries, updates, and pivoting. Moreover, queries can be un- between and according to logical rules. However, these C D derstood using classical theorems of category theory if given rules do not have to be explicitly mentioned; they arise as as input the category of states and transformations, rather universal constructions. than just the set of states. In this section we will define these three migration func- One can think of such a transformation as a one-way dic- tors. As evidence that these constructions, which seemingly tionary between two states, a translation of every record have not been previously discussed in database literature, in the first to an record in the second, which preserves the have “real-world” meaning, we will show that in special cases foreign-key relations. It turns out that a large of up- they naturally compute projects, duplicates, unions, joins, dates are “functorial” with respect to these transformations. and selects roughly following this dictionary: We will discuss that after giving the following definition. Migration functor SQL command Direction Definition 4.2.1. Let be a schema and let γ and δ be ∆F Project / duplicate Backward C two states on , i.e. functors Set.A transformation of ΣF / Skolem Forward C C → states, denoted m: γ δ, consists of a function mc : γ(c) ΠF Join / Select Forward → → δ(c) for each table c Ob and must satisfy the following ∈ C condition. For each row x γ(c) and each column f : c c0, These migration functors have been well-known in category ∈ → the following equation holds theory literature for 50 years, albeit with different inten- tions. We will explain what is meant in the above table by mc(x).f = mc0 (x.f). (1) the “Direction” column shortly. We should also note that ∆, Σ, and Π can be used in Definition 4.2.1 is not as difficult as it may appear. A combination to create GLAV views or more complex queries. mapping of states m: γ δ is as one might expect: for → Although we have not defined or even properly described every row in the first state, m provides a row in the second these migration functors yet, we provide an example of a state. The equation (1) is a coherence condition on cells: which roughly says that any complex query of this given a column f : c c0 and a row r γ(c) the value of → ∈ form can be refactored as a project / duplicate, followed by the (r, f)-cell in γ must be sent by m to the value of the a select / join, followed by a union / skolem.DS: fix this. (mc(r), δ(f))-cell in δ. For example, δ may be the result of inserting some rows into γ, in which case m is the inclusion of states. Theorem 4.3.1. Suppose one has a sequence of functors

By now we have defined a new category for every schema F1 G1 H1 F2 G2 H2 Gn Hn : the category of states on , which we denote –Set. The 1 o 1 / 1 / 2 o 2 / 2 / / n / n+1, C C C C D E C D E ··· E C objects of –Set are states γ : Set and the morphisms and consider the migration functor X : –Set n –Set C C → C1 → C +1 are transformations of states. As a trivial example, consider given by the category with only one object and its identity morphism [0] := . Then [0]–Set is the category of sets — a state on X = ∆F ΠG ΣH ∆F ΠG ΣH ΠG ΣH . • 1 1 1 2 2 2 ··· N n

6 Then there exists some 3-term sequence Let F : be a morphism of schemas. Given a state C → D δ : Set, let ∆F (δ): Set denote the composition of F G H D → C → 1 o / / n+1 functors C D E C F δ and an isomorphism X = ∆F ΠGΣH of functors 1–Set Set. ∼ C → C −→D −→ n+1–Set. Moreover, the process of finding this 3-term se- C We will call ∆F the re-indexing functor — it is a functor by quence is implementable. DS: word choice? Lemma 4.3.2 below. Proof. [?] Lemma 4.3.2. Let F : be a morphism of schemas. C → D The mapping ∆F described above extends to a functor

∆F : –Set –Set, DS: Say somewhere that this composition of functors can D → C be thought of as “pipelining.” which in particular means that for any transformation of Now we will describe the migration functors from a math- states δ δ0 on , there is a transformation ∆F (δ) ematical point of view. The simplest to describe is ∆ which → D → ∆F (δ0) of states on , and the composition law is preserved. basically works by or duplication. C Proof. See [?]. 4.3.1 The migration functor ∆ We begin with an example. Consider the morphism F of schemas depicted below:DS: Should below have the Given a schema mapping F : , the functor ∆F im- D C → D commutativity relations on it as in (??)? The idea is the ports data from to . We now describe two functors that D C same without them, they add space, we don’t care if the export data from to . These functors, denoted ΣF and C D reader thinks they’re there or not. Yet not having them ΠF are known as the left and right adjoint (respectively) here, and saying that (??) is the running example may add of ∆F . The reader can see above that once F is defined, annoying ambiguity. the migration functor ∆F is not chosen but mathematically given. In the same way, its adjoints ΣF and ΠF are mathe- matically given, even if they are more difficult to explain. := := We must save the explicit of ΣF and ΠF for D C the appendix, Section ??, where we allow ourselves to as- Employee Mgr sume more category-theoretic background on the part of the •  Dpt Employee Department reader. However, we can give the following informal descrip- F o / • Secr • tions here. First Last −→ First Last 4.3.2 The migration functor Π Let F : , and let γ : Set be a state on ; we   Ô Name C → D C → C String String r wish to describe the state ΠF (γ) on . To do so, we will D • • take an arbitrary table in and describe the set of rows in (2) D the exported state ΠF (γ)(t). Given any table t in there D Given a database state on , it should be clear how one are a number of tables u in that are either sent by F to t C C would import it to : one simply drops the Department itself, or are sent to some table to which t has a transitive D table entirely, and projects out the Dpt and Mgr column column. The set ΠF (γ)(t) is roughly the join (or “limit”) of of the Employee table. One can also use ∆ to duplicate all rows in all such u’s. tables or columns. For example, consider the functor Example 4.3.3. Recall the functor F : from Dia- D → C gram (3), and that given a state γ on the result of ∆F (γ) C := is a state on that has duplicated the employee table and := C D D m forgotten everything but the last . Now suppose we E E follow F by the functor 1 2  d 0 E / D • 0 • F o 0  • s • (3) 00  −→ := l1 0  l2 E 0  f l 00  := W  × n • 8 S D Ö 8  Ô E E Ö 8 S 1 2 8 • s G ÖÖ 8 • 6 • Ö 8 (4) • 66 Ø ÓÖ  6 ØØ −→ E1 E2 l1 6 Ø l2 where F sends E1,E2 to E and sends l1 and l2 to l. Given 6 Ø • 8 •  ÔØ 88 Ö any state γ : Set on the migration functor ∆F dupli- S 8 ÖÖ C → C l1 8 Ö l2 cates the employee table, as well as projecting out everything • 8 ÓÖÖ but its last name column, and puts that result in . S D • In this section we will show that a similar process works for any morphism of schemas. Not only that, but there Applying the migration functor ∆F ΠG to γ produces a state will be two ways to export a state on to a state on . on . One can prove that the result will leave the employee C D E The distinction between them does not appear to have been tables alone, and that table W will be their join — in other discussed before in database literature. DS: how do you feel words, a row in W (more precisely in ∆F ΠG(γ)(W )) is a about that? pair of employees with the same last name.

7 4.3.3 The migration functor Σ migration functors to transform the old data into the new Let H : , and let : Set be a state on ; we scheme. E → C E → E wish to describe the state ΣH () on . To do so, we will C take an arbitrary table in and describe the set of rows in 4.4 Typing and calculated fields C the exported state ΣH ()(t). Given any table t in there C Consider again the category from Section ?? Diagram are a number of tables u in that are either sent by H to C E ??. The observant reader may have noticed that a state t itself, or are sent to some table with a transitive column γ : Set has no obligation to return the set of strings C → String to t. The set ΣH ()(t) is roughly the union (or “colimit”) of when applied to the object called . This predicts an • all rows in all such u’s. interesting problem: when merging or comparing databases Note that this process automatically creates skolem vari- across schemas, there is no to expect that the imple- ables. For example, consider Diagram 2 but rename , , D C mentation of basic types, such as strings, is the same. In and F : respectively to , , and H : so as order to enforce that γ( String) really is some set of strings D → C E C E → C • to conform with the current notation. Given any state on (e.g. varchar(40)), one can use the migration functors and (which consists of a set of employees and a first and last E an already-established typing system. name for each), we need to give a state on . This is done by C This brings us to an interesting point. As mentioned including a row in the Department table for not only every in the introduction, category theory has been remarkably employee but every employees manager (to the nth power). successful in modeling programming languages and advanc- We would also have to add new rows to the String table ing the theory of functional programs. For example, mon- as well. These would not be strings, but variables acting ads (functors of a certain kind) have brought notions such like strings, called Skolem variables. These Skolem variables as input/output, state, exceptions, and concurrency to the could be manipulated in various ways by state transforma- purely functional setting. The goal of PL theory is to rea- tions; e.g. we may equate the skolem variable for Alice’s son about programs and prove that they will perform as manager with the skolem variable for Bob’s department’s expected. It is our hope and belief that category theory will secretary. In Theorem 4.3.4 we will show that, as long as help bring that kind of guarantee and reasoning ability to there are only finitely many paths through a both the source database theory as well. and target database schema, there will only be finitely many While it is a bit of a digression, let us take a moment skolem variables created under the mapping. to describe how category theory applies to functional pro- Theorem 4.3.4. Let F : be a morphism of schemas, gramming languages (such as Haskell and ML). The relevant C → D category is that of “types and programs”: the objects and let γ –Set be a finite state on . If and are finite T ∈ C C C D are simple data types such as Int, String, Bool, as well as categories then F!γ will be a finite state on . D more complex data types such as (Int Int) whose values Proof. For each object of x Ob , the set of x-records → D are themselves functions. The arrows in are programs, γ ∈ T of ΣF ( ) is given by a colimit of finite sets, whose indexing for example the program that takes a string and returns its category is finite. The result is a quotient of (and hence has length as an integer; this is an arrow String Int. For each → 2 at most the of) a finite coproduct of finite sets, object in t Ob( ), there is an intended “set of values” which is finite. ∈ T V (t) and each program t t0 is a function from the values → of t to the values of t0. In other words, V : Set is a C → functor. mgr We call the category , together with the values functor V T  a typing system, meaning a category of types and programs Example 4.3.5. Let := E ; with no relations for which every object and arrow is fully implementable on D • n a computer. Note the similarity between and a database imposed, has infinitely many arrows, mgr for all n N. T D ∈ schema : in both cases the objects are types and the ar- Let := E and F : the unique functor. Given a C C • C → D rows are some way to take values of one type and produce finite set of employees γ : E Set, one can load that data → values of another; in both cases the values of those types are into by applying ΣF (γ). The result will be a state on D D given by a set-valued functor. The only difference is that the which includes infinitely many skolem variable (one for each types in are most often “user-defined” and their values are C choice of employee-record in γ and n N). changing, whereas the types in seem closer to “god-given” ∈ T However, if we know that the manager- of our and their values are permanent. business is only 7 deep (i.e. that a 7th-degree manager is his or her own manager) then we can replace by the category D Definition 4.4.1. Let ( ,V : Set) be a typing sys- T T → 7 8 tem as above and let be a category. We call a functor mgr = mgr C G: := mgr T → C E  a ( ,V )-typing of . An object in (or its under E T C T G) will be called a typed object and a morphism in (or • T its image under G) will be called a calculated field. and consider the unique functor G: . Now Theorem A ( ,V )-typed state on consists of a state γ –Set C → E T C ∈ C 4.3.4 applies, and we can rest assured that ΣG(γ) is a finite and a transformation γ ΠGV of states on , or equiva- → C state. In fact its cardinality will be precisely 8 times the lently a transformation ∆G(γ) V of states on . cardinality of γ(E). → T Moreover, if in the future our manager-hierarchy changes 92In fact, the values of these types are often CPOs, but as there (say to allow for 10th degree managers), we can again use is a functor from CPOs to Sets, the above discussion is valid

8 Note that under this definition any computer program date commands are considered equal if they are the same as P can serve as a foreign key column connecting its input functions. type X to its output type Y ; we merely write it as an arrow P X Y in . We did not define morphisms of ( ,V )-typed This simple definition already allows us to define equiv- −→ T T states in Definition 4.4.1 for fear of overwhelming readers. alence for chains of updates statements. However, we are It can be found in Definition B.0.4. interested in giving a tighter definition based on We are now interested in the category of updates, as op- Example 4.4.2. Suppose each employee of a company has posed to the category (or monoid) of update commands. a salary and a travel allowance, and we wish to calculate the We want each object in this category to be a state and each morphism A B to be an update command applied to A sum of these two fields. The relevant schema is → which returns B. We say that two updates f, g : A B are → f; π1=salary f; π2=travel equivalent if commands which generated f and g are equal, i.e. if these two commands are identical when applied to any E state. It will turn out ( ?? that this definition z • DD is not arbitrary from a categorical perspective, but instead salary zz DDtravel zz f DD “suggested” in the sense that it comes from a major tool of := zz DD C }z π1  π2 ! category theory called the Grothendieck construction. R o R×R / R • • • Definition 4.5.2. Let be a schema. The category of + C updates on , denoted Up , has as objects the set Ob –Set C C C q  fail of states on plus a fail state. If γ and δ are states, R { } C • then a morphism f : γ δ is a update command U → ∈ E R UpCmd such that U(γ) = δ, and a morphism γ fail The composite f; +: , is a calculated field. How- C → • → • is any update command which fails on γ. For the compo- ever, nothing about this schema forces that the elements of sition law, we say f1 f2 = f3 if the underlying update R or R×R are anything like real or pairs of such, ∗ • • commands also satisfy that equation, U1 U2 = U3, i.e. if nor that the column + really produces the sum. To enforce ∗ this composition law would hold for any input state. these constraints, we use a typing as above. Let be the category T Proposition 4.5.3. Let be a category and let UpCmd C C π1 π2 be the monoid of update commands, considered as a cate- R o R×R / R gory with one object. There is a functor F : UpCmd • • • C → Set where the unique object of UpCmd is sent to the set := + T Ob –Set fail , and each update command is sent to the  C q { } corresponding function Ob –Set Ob –Set. The Grothendieck R C → C • construction of F is precisely the category Up defined in C and let V : Set be appropriate functor DS: Carlo: help. Definition 4.5.2. T → Under Definition 4.4.1, every object in , except for E, is a C Proof. Obvious to those who know and can work with typed object, and both projections π1, π2 and the sum of real the definition of Grothendieck constructions. numbers + are calculated fields, whereas salary and travel are not. There is an obvious functor G: , so we have a T → C typing of .A ( ,V )-typed state δ on consists of a set C T C Remark 4.5.4. In Definition 4.5.1, we defined an update δ(x) for every object x in , together with an assignment of C command to be a function from states to states. Surpris- a value to each cell in . In other words, it is what we want. T ingly, it seems that updates are often functorial; i.e. they respect the notion of “transformation of states” defined in The point here is that the way in which categories are used Definition 4.2.1. The precise structure of update commands, in PL theory fits in perfectly with the way we propose to use such as those found in SQLs data manipulation language, categories in DB theory. We can exploit that connection should be studied using category theory to determine to make a bridge between these two fields of research. In Section 6.1 we discuss this issue a bit more to integrate a whether all update commands are indeed partial func- • few advanced results. tors U : –Set –Set C → C 4.5 The update category, revisited whether there is a nice characterization of standard up- • In Section 3.3 dates within the category of all partial functors –Set C → We start from this rather loose definition of an update –Set, C command that we can formally state as follows: and whether, if there are some partial functors that are • Definition 4.5.1. Let be a schema. An update com- not standardly considered updates, such functors may C mand on is a function U : Ob –Set Ob –Set; the subcat- have new capabilities that have not been explored. C C → C egory of –Set on which U is defined is called the domain of C definition of U. Let UpCmd denote the monoid of update In Section ?? we will discuss transactions and triggers and commands on . The multiplicationC law for this monoid is show that category theory can provide a useful semantics for C simply sequential application of update commands. Two up- these.

9 5. CATEGORIESFOR CLASSICAL RESULTS as a functor F : where is the category to the right H → C H In the previous section, we provided the user with a nat- ural representation of databases in terms of categories and := := functors, what remains to be shown is that this specific ap- G H C1 C1 plication of category theory to databases is constructive, f1 4 iiii • t: • and that is a valid alternative to prove useful results. We iiif2 f1 tt T1 i / C2 tt C2 UU f3 G tft2 j4 attempt this in this section by showing how we can use the • UUU • tt jjjj • UUU tjjjf3 framework introduced above to prove some of the classical g * C3 −→ T j / C3 TT f4 results in database theory. • • JJTTTT • f J f5TT  4 JJ T* C CC: - schema mapping as a functor T2 / C4 JJ 4 UUU f5 JJ • - extensions of aggregates and UDF • UUUU • J$ U* C5 C5 • • There is a functor F : (where T ,T are sent to T G → H 1 2 and labels are otherwise preserved) which will allow us to group by columns 4 and 5. Hopefully, the way one does this for an arbitrary of the columns is clear. At this point we have functors 6. MORE ADVANCED RESULTS G F In the previous sections, we showed that category theory G −→H −→C naturally extend set theory, and, thus, naturally allows to so, as in Section 4.3.1, we can take any database state δ on represent database problems, and its rigorous foundations and apply ∆G if we want to isolate table T , or we can C allows us to prove some of the classical theorems from the apply ∆F to that to get a state γ := ∆DS:?(δ) in –Set. G database literature. This section is devoted to suggest how It should be clear how this would look: two tables and a using these conceptual tools we can imagine to tackle more foreign key column pointing from the first to the second, advanced problems. where we intend to group by the columns of the second. In other words, for every element in γ(T2), we want a table with data columns C1,C2,C3, i.e. a state on

C1 f1 4 6.1 Integrations with PL iiii • iiif2 0 T1 i C2 We described in Section 4.4 how category theory is typi- := f / G • UUUU 3 • cally applied in the theory programming languages, and how UUU U* C3 this meshes with our application to databases. In Section • 6.6 we will also mention Grothendieck constructions are use- We have the following proposition. ful [?][?] and how they relate to our work. Harper’s work also discussed “Kinds” a category together with a map Proposition 6.2.1. There is a functor K Cat. Here, each kind has a category of states, rather K → –Set –SetOb than a mere set of states, and these are the types of that G −→ X G0–Set kind. Many of the above ideas on set-valued functors can be where is the schema recast to category-valued functors and again integrate well X with Harper’s theory. One could also use functors valued in C4 topological spaces or vector spaces and perhaps find use in pp7 • ppp spatial or temporal databases. T2 / C5 As a quick example of the link, one of the big successes • NNN • NNN in theory was the use of monads to ' X describe effects such as those encountered in IO, exceptions, • concurrency, etc. A monad on a category is a functor C and X is typed by the set of 0-tables, Ob 0–Set.DS: Carlo: M : with some properties. The simplest such example G G C → C help is the list monad which takes a type and produces its list type. In database theory the same monad might be useful to Proof. *** relax the notion of “state transformation” m: γ δ (from → Proposition 6.2.1 says that given a choice of how to group Section 4.2), and instead allow each row in γ to map to a some table T , the grouping operation is functorial. sequence of rows in δ. Remark 6.2.2. The same idea works for indexing. Given a field c: X T in a schema , we can index table X by → C type T . Suppose that T has value set V T . Form the fiber → 6.2 Grouping X T V / X In this section we show that grouping a table by a few of × y its columns is functorial. That is, it fits in perfectly with c the rest of the theory of this paper.   Let be a schema; one can picture a 5-column table in V / T C C

10 and, after removing the arrow c(using a migration functor), morphisms in the category of updates are all sequences of apply the above operation to group by V . The result is a updates, whereas the morphisms in the category of transac- map V Ob –Set, that sends every value to the set of X tions are all sequences of transactions. Clearly, every trans- → X records that have that value in the c field. action yields a sequence of updates, but some sequences of updates are not transactions. 6.3 Aggregates Let be a schema and S = Ob –Set its set of states. Definition 6.4.1. Let be a schema and let Up be the C C C C Suppose given a function h: S T , where T is any set. category of updates on .A category of transactions on → C C For example T could be the set of integers and h could send is a subcategory of Up that has the same set of objects, X C Ob = ObUp = Ob –Set. We denote the inclusion func- any state to its total number of rows or the sum of cells in X C C a given column of . Then aggregation simply becomes a tor by L: Up . C X → C combination of grouping, as defined in Section 6.2, and the function h, which could be user-defined. Note that under the above definition there may be many different categories of transactions on . That is, one can C Example 6.3.1. A scientist is measuring the spring con- decide which updates (or sequences of updates) will count stant for springs made of different materials. He or she as transactions and which will not. wants a table of materials vs. spring constants. We will The beauty of this approach is that the updates in a se- consider the map of schemas quence of transactions could be done in any order, so long as the resulting path is equivalent to the “default” path. Let us make this more precise. To say that t and t are trans- := 1 2 r , r , . . . , r 0 := G actions with underlying update sequences 1 2 m and G x-val s1, s2, . . . , sn respectively, is to say that L(t1) = r1 r2 x-val hhh4 • ∗ ∗ F hhhh rm and L(t2) = s1 s2 sn. And of course, jjj4 • sample · · · ∗ ∗ ∗ · · · ∗ jjj L(t1 t2) = r1 sn accordingly. However there may be sample −→ • VVVV ∗ ∗ · · · ∗ VVV* more than one sequence of updates that compose to equal • TTTT y-val TT* L(t1 t2). In a category, two different sequences of arrows y-val  • ∗ • experiment / mtrl may be declared equal, and we have declared to sequences of • • updates to be equal if they have the same effect. Thus the A single experiment can be viewed in 0, whereas the totality transaction scheduler could choose any sequence of updates G L t t is viewed on . Let S = Ob 0–Set be the set of states on 0. whose composition is equal to ( 1 2). G G G ∗ A data set for one spring can either be seen formally as an Recall that we defined updates using a notion of update element of S or visually as a scatterplot of force- commands. Similarly we could define transaction commands, a monoid which acts on the set Ob –Set of states, just as data for that spring. C UpCmd does. A result analogous to Proposition 4.5.3 By Hook’s law, the force vs. extension curve of a spring C is assumed linear, of the form y = kx + b. During the data is easily proved. In fact, it would be interesting to study reduction process the scientist will use some program h: S the category of submonoids of UpCmd , to classify various → C R which assigns to each data set a best-fit value of the spring notions of transaction on a given schema. constant k. By grouping, we can functorially transform a Remark 6.4.2. The above does not explore transactions state on into a state on the schema G in all there glory. There are promising connections with the experiment / mtrl “state monad” in PL theory that will be useful to explore in U future work. This monad would allow one to model programs • UUUU • UU* S which have the capacity to access the state of the database, • use it in a computation, and then update the state. Monads which sends each experiment to its data set. We can then are specifically designed to allow one to keep track of the use migration functors and typing to add a calculated field history of such transactions. given by h: S R to yield a state on the schema → 6.5 Triggers experiment / mtrl A database schema is supposed to give a certain semantic • UU • UUUU to the data. Triggers are designed to uphold this semantic: U* R given an update which would violates a constraint, triggers • are employed to rectify the situation. In this paper, we do This will be a table of materials and their spring constants. not describe triggers which act outside the database (like sending a supervisor an email whenever a certain oc- 6.4 Transactions curs), we only describe those that act on the state of the A transaction is a series of updates which must be per- database; we call these corrective triggers. We can describe formed as a unit, or at least act as though it is performed as corrective triggers categorically as follows. a unit. In this section we describe the relationship between Suppose that is a sketch (which includes information like C transactions and updates in the language of category theory. “table T is the product of table A and table B”) and is the B The main point is to relate transactions to updates; each underlying category (with only foreign key type constraints). will be a morphism in an appropriate category and we want Then any state on will restrict to a state on , whereas a C B a functor between them. The set of objects in these two state on must be “corrected” in order to give a state on B categories will be the same, namely the set of states on a . In other words, there are functors F : –Set –Set C C → B given schema . The difference is in the morphisms: the and G: –Set –Set such that FG = id –Set, where C B → C C

11 G is the functor that takes a state on and formulaically other words it naturally respects the categorical structure B adds rows to table T to ensure that it becomes the product we defined in Section 4.2, and works hand-in-hand with two of A and B. Updates to a -set δ will generally land in of the migration functors ∆ and Σ. C –Set because the updater either does not know about the A state on only the bottom row of can be transformed B P constraint or wants it to be taken care of automatically. The into a bipartite graph with tables on one side, data types trigger is basically G — it is employed to take the resulting on the other, and an arrow from each table to its data type. -set and “rectify it” to give a new -set. Whereas this is Such a graph is in fact a category. Moreover, morphisms of B C generally done with a piece of code, the semantics should be states on are sent by this construction to functors between P considered as a functor G as above. categories. That is, we have a functor Piv : –Set Cat 6.6 RDF and semi-structured data 0 P → Suppose that is a schema and δ : Set is a state. A C C → (we prove this as part of Proposition 6.7.2)which works only beautiful construction of category theory, called the Grothendieck at the schema level. We need the following definition in construction, allows one to convert the functor δ into a new order to see the full power of pivoting. category El(δ), called the category of elements of δ, which is surprisingly similar to an RDF graph. The objects of El(δ), Definition 6.7.1. Let Data denote the category for which like vertices in an RDF graph, correspond to the entities in an object is a pair ( , δ : Set) consisting of a schema C C → δ, i.e. one for each row in each table. The arrows of El(δ), and a state on it; and for which a morphism a:( , δ) ] C → like edges in an RDF graph, correspond to the attributes of ( 0, δ0) consists of a pair (a, a ) where a: 0 is a map C ] C → C the entities in δ, i.e. an arrow points from an entity to its of schemas and a : δ a∗δ0 is a transformation of states. value in some cell. So the subject and value in a (subject, → predicate, value) triple correspond to two objects in El(δ), In terms of RDF graphs, the above is easier to understand: and the predicate corresponds to an arrow between them. it’s just a commutative square of categories In fact, there is a canonical functor π : El(δ) , where → C a] El(δ) acts as the RDF triple store (or RDF graph), acts δ δ0 C Gr( ) / Gr( ) as the RDF schema, and π sends each entity and predicate to its type as declared in . Functors such as π can encode C semi-structured data as well. In other words, such a func-   tor (or RDF triple store) may not come from a relational / 0 C a C database state on , for example because such a state would C have null values or multiple values in a single cell, or may Proposition 6.7.2. There is a functor Piv: –Set P → not conform to the commutativity constraints declared in . Data, sending a state γ on to an object ( , δ) where the C P C Thus the Grothendieck construction sends database states set of tables in is γ(T ), the set of all columns is γ(Co), C into a more “relaxed” world of semi-structured data in a each column is sourced in a table by γ(s) and has values in functorial way. a data type by γ(t), the set of all rows is given by γ(R), the set of cells is given by γ(Ce), etc. 6.7 Pivoting F DS: If we want to be sure of no nulls in the pivoted thing, Proof. If is the schema morphism we have to enforce that Ce is a fiber product as drawn below in . But if we allow nulls in the pivoted thing then we 0 := := P P F Q don’t need to get into it (we can stay in categories, rather s t −→ T o Co / DT Co / T than needing sketches). If we allow nulls then we aren’t in • • • • • the world of database states, but “semi-structured data” as the migration functor ΣF sends states on to states on described in Section 6.6. What do you think? P . A state on is a graph, and there is a functor sending We can pivot any state on a schema of the following form: Q Q graphs to categories. Combining this with the ∆ migration functor applied to the obvious inclusion 0 gives us R o Ce / D P → P • • • Piv0 as above. x := Applying this same construction to the top row of yields P P another such functor, written Piv1 : –Set Cat, and a  s  t  P → T o Co / DT natural transformation m: Piv Piv . The requirement 1 → 0 • • • on -sets δ that δ(Ce) = δ(R) δ(Co) enforces that the P ×δ(R) where the labels of the objects stand for: row, cell, domain, functor Piv (δ) Piv (δ) is a discrete op-fibration, which 1 → 0 table, column, data type — respectively. One can change means that Piv (δ) = Gr(γ) for some state γ : Piv (δ) 1 0 → the number of columns in a table or the data types of those Set. We can consider m as a functor m: –Set Cat[1], columns by changing the state on . Pivoting turns the P → P and the preceding sentence implies that it serves as a functor state on into a new schema whose tables are the elements m: –Set Data. If we take Piv := m, it is clear that it P P → of T, columns are elements of Co with data types given by has the prescribed properties. t, etc. In this section we define pivoting only for schemas in which the tables have no foreign keys, just data columns. Allowing foreign keys can be done along the same lines DS: should we Remark 6.7.3. It is interesting to notice that applying put it in an appendix?. What is amazing, from a categor- the Grothendieck construction to a state on is similar to P ical perspective, about pivoting is that it is functorial – in “reification” in RDF triples stores and schemas. The arrows

12 (cells) become new objects which point to both their subject B. ABUSE OF NOTATION,AND DETAILED and value. This is just another case of how natural concepts COMMENTS in category theory magically recreate ideas that may have seemed ad hoc or complicated without it. Definition B.0.4. A typing system is defined as a cate- gory and a functor V : Set. Given a category , we T T → C CC: (mention many prove one) call a functor - transactions - schema mapping extensions... beyond weak acyclicity G: T → C - PL integration... etc.. a ( ,V )-typing of .A ( ,V )-typed state on consists of T C T C a pair (γ, m) where γ –Set is a state and m: γ ΠGV ∈ C → 7. RELATED WORK is a transformation of states on .A morphism of ( ,V )- C T typed states on is a transformation of states f :(δ, m) DS: The following papers discuss sketches as models for C → (δ0, m0) such that m = m0DS : f. We denote the category database schemas: ∗ of ( ,V )-typed states on by –Set . T C C /V Zinovy Diskin. “Databases as diagram : spec- • ifying queries and views via the graph-based logic of CC: comment on use of sketches for multi-column pk and sketches.” Research Report FIS/LDBD-96-02. Frame fk. Inform System, Riga, Latvia, 1996 CC: add back to properexample The fact that Ob(Set) is not actually a set but a “class” is a distinction Zinovy Diskin, Boris Kadish. “Variable set seman- that is carefully made in mathematics textbooks but which • tics for keyed generalized sketches: formal semantics we elide. for object identity and abstract for conceptual DS: Some references follow below the end document line { } modeling” Data & Knowledge Engineering. Volume in this .tex file. 47 Issue 1, October 2003

Michael Johnson, Robert Rosebrugh, and R. J. Wood. • “Entity-relationship-attribute designs and sketches.”

I claim that we are doing something more here.

Many of these papers were either not clear or not writ- • ten for a computer scientist.

These papers are too flexible: so much of the theory • can be done with just categories (not sketches) that you give up a lot of power by jumping so far. For exam- ple “data migration” does not really exist for sketches like it does for categories.

These papers do not discuss calculated fields. • These papers do not discuss the transformations of • states that make so much of our machinery work.

These papers do not discuss updates, grouping, trig- • gers, transactions, pivoting, RDF, etc.

The conception of aggregates in some of them is sur- • prisingly unlike mine.

These papers do not show any examples in table form, • thus making it difficult to understand.

These papers do not emphasize commutative diagrams, • 7 8 and never suggest anything like mgr = mgr .

8. CONCLUSIONS category can be very useful for DB

APPENDIX A. CATEGORY THEORY

13