A Categorical Viewpoint on

Christophe Culan Maxime Lubin [email protected] [email protected]

Abstract

Modeling the reasoning process of human beings is a long-standing goal of artificial intelligence. Starting with a symbolic approach rooted in logic in the 60s - compose- able, interpretable but brittle in the face of fuzzy and stochastic patterns - to recent advances at the other end of the spectrum. We propose a novel model and inference technique based on Category Theory to treat training data not only as a set but as a full-fledged learnable category: what points are related to each other? In what way? In which causal direction? Learning such a complex view of a dataset is tantamount to soundly drop the often unquestioned hypothesis of independent, identically distributed training data. This can be useful to better account for the consequences of data augmentation techniques, as well as to learn non-trivial relations spanning multiple observations.

1 Introduction

We perceive and abstract our reality through the prism of causality: causes precede consequences. However we do not perceive causality directly. Instead through repeated experimentation, we eventually notice patterns emerging from the underlying causal links, eg, drinking alcohol then feeling tipsy. These patterns can be learned through induction, even by machines, as all the great recent progress in machine learning showcases in vision [KSH12][HZRS15], natural language processing [VSP+17][HS97], speech processing [GMH13][BCS+16] or [SSS+17] [MKS+15]. We now have tremendous pattern-matching machines but they still do struggle and are largely incapable of reifying observations into causal links. The most notable exceptions are Probabilistic Graphical Models [KF09], Markov Logic Networks [RD06], Structural Causal Models [Pea09] and the recent Graph Networks [BHB+18], all of which requires an explicit encoding of stochastic causal dependence between observables. Nature and human abstractions abund in different patterns. So much they first appear to form a large set of distinct, unrelated curiosities. However in each category, humanity’s work and ingenuity largely proved the infinite looking zoo of patterns to stem from a small number of unique interacting entities. For example, physical processes are largely described by the handful of elementary particles and evolution laws of the Standard Model, mathematics from axioms and proof constructs, language from words and syntax. Complexity seems to emerge through interaction of simpler components, a phenomenon known as combinatorial generalization. The emphasis is not on the intrinsics of the elementary entities but on their collective behavior and interaction. [BHB+18] offers a comprehensive, well written introduction and justification for this argument. Category theory [EM45] is a very powerful framework [Law66] that precisely embodies this view. Since its inception, the pervasive nature of categories has been steadily fleshed out and revealed many deep connections between seemingly unrelated fields of mathematics and is now a core tool in state-of-the-art developments of mathematics, computer science and theoretical physics. Machine learning and more generally statistics has thus far reaped little theoretical and practical benefits. Our choice to rely on category theory for AI is not so strange a position. Most if not all of computational

Preprint. Work in progress. ontologies e.g. [HV06][PSST10] is about categories, albeit very often not phrased in its direct language but instead in logic. Category theory is starting to creep into cognitive sciences to formally model concepts, interplays and analogies [BP][HG08], and even applied to neural networks [Hea00]. This paper starts by a brief introduction of a few basic notions required from category theory, after which we incrementally build a categorical view of classification problems in Section 2. This is followed by a short presentation of related works in the literature, after which we shortly conclude and expand on planned future works. An open source Python implementation can be found at: https://github.com/Previsou/ CategoryLearning.

Basic category theory primer For the reader unfamiliar with category theory, we here define the few basic notions used in the rest of the paper. For a more principled introduction and advanced category theory topics, see [Awo10] and [Rie16]. Definition 1. A category C consists of:

• Objects, noted as x, y, z, ··· : C • Arrows/relations between two objects, noted f : x → y, g, . . .

• Identities: given x : C, there is an identity arrow 1x : x → x in C. • Compositions: the collection of arrows is closed by composition. Given f : x → y and g : y → z there is an arrow noted as g ◦ f : x → z in C.

The composition ◦ is furthermore restricted to be associative and identities act as left and right units. That is, given f : x → y, g : y → z, h : z → w in C:

• associativity: h ◦ (g ◦ f) = (h ◦ g) ◦ f.

• unit: f = f ◦ 1A = 1B ◦ f.

Everything matching the above definitions is a category. For example, directed graph G(V,E) can be seen as a form of proto-category. A path in G is a finite sequence of edges e1e2 . . . el. The free category C(G) over a graph G(V,E) is obtained by completing the set of edges. Then C(G) 0 0 has for objects the vertices and for arrows the paths e1 . . . el, ele1 . . . em in G whose composite is 0 0 1 e1 . . . ele1 . . . em ie there is an arrow fV1 → V2 if and only if V1 = V2 or V1 and V2 belongs to the same connected component. Definition 2. Let C be a category. A subcategory S of C is given by a subcollection of objects of C - denoted ob(S) - and a subcollection of arrows of C - denoted hom(S) - such that

•∀ X : ob(S), idX : hom(S), •∀ f : hom(S)(X,Y ),X : ob(S) ∧ Y : ob(S), •∀ f, g : hom(S), (g ◦ f : C ⇒ g ◦ f : hom(S)).

2 Learning the category of a dataset

Supervised machine learning algorithm receives a set of observations and matching targets as N input. Let θ ∈ Rq be a q-dimensional parameter vector. Let X ∈ FT and Y be respectively the observation sample of length N and Y the assocatiated labels or targets, FT is the feature space of a single observation. Discriminative supervised machine learning training can be viewed as, given a model f - typically a neural network or SVM- to find the maximum a posteriori (MAP) parameter vector:

MAP(θ, f, X, Y ) = arg max P (Y = f(X, θ)|X) (1) θ

1Loops must also be added to every vertex to serve as identities.

2 For tractability, most algorithms hypothesize those observations are independent and identically distributed 2: they were generated by the same measurement protocol, with previous measurements having no effect on future ones. However, not only is this assumption quite wrong in many situations, but it is inherently broken by the very common use of data augmentation techniques. Independence of samples is usually supposed as it enables to factor Eq.1:

N Y MAP(θ, f, X, Y ) = arg max P (Yi = f(Xi, θ)|Xi) (2) θ i=1 For numerical reasons, Eq.2 is transformed as a loss minimization problem L by applying the transformation x → − log x. The factored product transforms into a summation, less prone to floating-point rounding errors.

N 1 X L(θ, f, X, Y ) = − log P (Y = f(X , θ)|X ) (3) N i i i n=1 We now particularize to K-class classification setting.Targets Y are assumed to be one-hot encoded ie Yi,j = 1 if and only if observation i is of class j and 0 otherwise. Let our model be S : FT × θ → K PK [0, 1] constrained by i=1 Si(∗) = 1. The ouput of S is essentially a probability vector. The same holds for Yi where all probability mass is concentrated in a single entry. The goal is to have these two distributions to align. Classification problems typically select the Kullback-Leibler divergence DKL to compare distributions. Definition 3. The Kullback-Leibler divergence between discrete probability distributions P and Q is P P (i) defined as DKL(P ||Q) = i P (i) log Q(i) . Its most relevant properties are:

• positivity: DKL(P ||Q) ≥ 0.

• asymmetry: DKL(P ||Q) 6= DKL(Q||P ) in the general case.

• minimum: DKL(P ||Q) = 0 if and only if P = Q almost everywhere. We end up with the following training loss:

N 1 X L(θ, S, X, Y ) = D (Y ||S(X , θ)) (4) N KL i i n=1

Given the degenerate case of Yi the KL-divergence reduces to the cross-entropy H(P,Q), and we end up with the familiar classification loss:

K X LCE(θ, S, Xi,Yi) = H (Yi,S(Xi, θ)) = − Yi,j log S(Xi, θ)j j=1 (5) N 1 X L(θ, S, X, Y ) = L (θ, S, X ,Y ) N CE i i n=1 Considering the input data not as a set but as a learnable small category requires to select a proper and practical cost function for identities and composition, and to define how observations can relate to each other and in how many ways. The gist of what we set out to do can be exemplified as follows. Assume we set out to perform a binary classification task: given a sequenced genome, is the patient affected by a given genetic disease? From what we know of DNA and heredity, encoding and properly leveraging whether two patients are related should yield a more accurate model.

2The most notable exceptions are time-series and sequence learning problems, where the time/order depen- dence is implicit and given.

3 Alice Alice father Bob ill? sibling Bob ill? father Carol Carol (a) Observations as a set (b) Observations as a category Figure 1: Non-iid observations can be modeled as a category

In the remainder of this section we build the full categorified classification model step-by-step, increasing the complexity along the way.

2.1 Classification as a relational problem

The above setting can be easily rearranged as learning whether two points are related to each other. We start by embedding labels and input observations into a larger space, in the same vein as projective geometry. The restricted set of enhanced label will serve as anchors or sources for the binary point relation. First we introduce S˜ and V˜ respectively the extended vectors summing to 1 of the scoring model K ext output and the input labels. Let EFT = R ×FT be the extended feature space, Yi : EFT = [Yi 0FT] ext X : = [O K X ] the embedded labels and i EFT R i the embedded data points. We modify the scoring model S so as to score whether two points are related to each other: S : EFT2 → [0, 1]. The input training batch now consists in pairs of points in the embedding space: the source is a label and the destination is the embedded observation. The new batch labels V are then 1 if the pair represents a valid association, 0 otherwise. The training loss to minimize for a single observation is again a simple KL divergence referred as matching loss.

ext ext  ˜ ˜ ext ext  Lmatch(Yi ,Xi ,Vi) = DKL Vi||S(Yi ,Xi ) (6)

Note that the original dataset must be artificially augmented by generating known wrong associations, 3 otherwise the constant model S(_, _) = 1 minimizes Lmatch. Our small test cases shows that in practice it suffices to add a single wrong (label, observation) for each true datapoint suffices, effectively doubling the input data of the model. A proper theoretical bound or in-depth practical setting of this ratio is left for future work. A slight generalization allowing labels in [0, 1] provides a direct way to handle semi-, encoded by generating pairs with such weak labeling 4.

id id Yi Yi Xj Xj

id Xk Xk

id Xl Xl Figure 2: Free category on a bipartite graph

We are essentially trying to learn a bipartite graph. Since there can be no in-between labels or observations edges, the free category of such graphs is simply themselves, augmented with identity

3Other workarounds do exist. For example [INS18] develops a scheme to learn from only positive examples. 4The matching cost in Eq.6 must then be reverted to use the full Kullback-Leibler divergence.

4 arrows for each vertex (see Figure 2). To get a truly categorical view on classification on such restricted graphs, we must thus incorporate identities cost, which we refer as the unit cost.

1 L (Y ext,Xext) = − log S(Y ext,Y ext) + log S(Xext,Xext) (7) unit i i 2 i i i i

The total training loss is then Lcat = Lmatch + λunitLunit, averaged over the input batch. Predictions of new observations can then obtained by testing the tentative labelings. This procedure is akin to hypothesis tests. Let Hi be the one-hot encoding of i. A decision function D : FT × {1,...,K} → [0, 1] can now be computed by:

ext ext  D(x, i) = − Lmatch Hi , x , 1 (8) predict(x) = arg max D(x, i) i∈1..K

2.2 Bring out your arrows

Thus far we have not mentioned explicitly the arrows of the category we want to learn, though the setting above uses two different types of relations: strict identities and class membership. We now explicitly define, model and incorporate them in the training loss. Let’s first take a step back and return to the definition of a category (see Definition 1), of which we give another characterization 5. Lemma 1. A category C with a set of objects O and a set of arrows A can alternatively be characterized by a membership function 1C : O × O × A → {0, 1} such that:

• membership: given an arrow f : a → b, 1C (x, y, f) = 1 iff x = a, y = b and f ∈ A.

• composition: given compatible arrows f, g : A, 1C (x, z, f ◦ g) = 1C (x, y, f)1C (y, z, g).

We want not only to learn whether two points are related but also in what way. Given x, y : O we want to learn R(x, y) = {f ∈ A|1C (x, y, f) = 1}. This form is poorly amenable for inference and would preclude learning anything but the smallest categories. We instead consider R : O × O,→ A as a generative, trainable model. Given src, dst ∈ ON the objective is to minimize membership errors:

N X Lrel(R, src, dst) = 1 − 1C (srci, dsti,R(srci, dsti)) (9) i=1

The above setting assumed the category to be crisply defined. Our scoring model S is essentially a probabilistic version of it and we need to extend S with a third input: the relation under consideration. Our choice of relations representation must be compatible with usual machine learning algorithm input representation, hence should preferably be real vector spaces, potentially with some added structure (for now we use vector spaces, matrix algebras as well as affine spaces). Let M be such a structure, with uM a fixed, specific element to represent the identities (the rationale for this notation is detailed in 2.4). We then have the modified losses, where we jointly optimize for S and R parameters:

ext ext  ˜ ˜ ext ext ext ext  Lmatch(Yi ,Xi ,Vi) = DKL Vi||S(Yi ,Xi ,R(Yi ,Xi ) (10)

1 L (Y ext,Xext) = − log S(Y ext,Y ext, u ) + log S(Xext,Xext, u ) (11) unit i i 2 i i M i i M

2.3 The I.I.D. fallacy: the case of data augmentation

Independently, identically distributed (iid) observations is a very common working hypothesis, yielding drastic model and inference simplifications. It however breaks in many real world patterns of interest:

5A restricted, slightly informal characterization

5 • sequences: observations are related by a linear total order (very often temporal). • online learning: streams may undergo distribution shifts. • active learning: the learner itself selects observations and request labelling. • reinforcement learning: actions taken by the agent may modify future observations, thus creating a causal dependency between previous and new observations. • ranking: selected items to rank match a common filter eg a user’s query. • data augmentation: crafting new examples from existing ones by applying several label- preserving transformations.

We claimed in the introduction that our categorical setting can gracefully handle non iid obser- vations, and especially the widely used data augmentation meta-technique. We now proceed to build up our model to do so, focusing exclusively on data augmentation. Given an observation X : FT, a data augmentation is a mapping f : FT → FT such that Label(x) = Label(f(x)). Label preservation is essential, otherwise we would not be able to use the generated new ’observation’ for training, at least in the strictly supervised setting. We will now take the example of images for this purpose, for visual clarity. The usual image data augmentation operators are rotations (by a small angle), translations, cropping, adding noise, etc. We denote this family of augmentations as (f) = (fi)i∈{crop, rotate, translate,...}. Until now our model was only concerned with learning relations between a label and a single observation, both embedded in a single extended space. We may instead ask whether two observations x : EFT and y : EFT are related by transformation fi e.g. are two images related by a translation. Let X : EFTN be the original observation dataset and Xaug : EFTM the training dataset obtained by aug applying several transformations to a sample of the original observations. Learning whether Xi aug and Xj are fk-related is precisely the problem defined in the previous section, where the label aug aug V = 1 if and only if fk(Xi ) = Xj . The learned models Sk and Rk essentially approximate the fk-category. However this falls short of helping us solving the original classification task since class labels are nowhere to be found and we never try to score label-observation pairs. In addition, learning a different category for each augmentation operator would not help in encoding observations related by multiple transformations eg a translation followed by a rotation, unless we train new score and relation models for precisely this augmentation. We instead fuse all this subtasks by allowing scoring models to output several membership probabilities. s Let s ∈ N be the number of subtasks we want to jointly solve and ∆c = s Ps {(t1, . . . , ts) ∈ R | i=1 ti ≤ 1 and ∀i, ti ≥ 0} be the standard orthogonal s-simplex. We extend s the scoring model to a mapping S : EFT × EFT × M → ∆c. The target labels must also be extended s to V ∈ ∆c. The different subtasks can be viewed as learning particular subcategories over the augmented data set. Since identity loops are shared across all subtasks we are only interested in the overall membership S˜. Hence the unit cost is modified as Eq.12.

s s ! 1 X X L (X ,X ) = − log S(X ,X , u ) + log S(X ,X , u ) (12) unit i j 2 i i M a j j M a a=1 a=1

Similarly the matching cost is readily extended to multi scores. 6 The multi-dimensional scoring enables to encode and in principle infer any symmetry of the data. Given the fundamental, unifying, and simplifying role symmetries had in physics, we firmly believe explicitly using data symmetries to be an interesting lead for the future of AI, albeit perhaps not in this specific form.

2.4 The power of composition

Let’s temporarily come back to the simpler case of a single score. In the preceding subsections, we have enriched our description to model dependencies between pairs of observations and labels. Looking back at the definition of a category Eq.1, so far we have taken into account objects,

6Original labels are embedded in EFT and added to X as well.

6 arrows between them and identity relations. However we are still laking one crucial ingredient: the composition of arrows. We first start by distinguishing between primitive and composite arrows, restricting relation model R to generate only primitive arrows. Definition 4. A arrow h in a category C is called composite iff there exists f, g compatible arrows in C (the source of g is the target of f), such that h = g ◦ f and neither f nor g is an identity arrow. An arrow which is not a composite is called primitive. Consider now the following commutative triangle (Fig. 3). For our modeling to form a category, we must equip the output of R with a composition law ∗, such that R(x, z) = R(x, y) ∗ R(y, z). Adding the constraints for unit arrows to exists, we obtain precisely the definition of monoids. We therefore require the codomain of a relation model R to at least be a monoid. Thus we are free to model relations with any vector space, affine space or more generally any algebra.

R(x,y) x y

R(y,z) R(x,z) z Figure 3: Composition of relation samples.

Definition 5. A monoid (M, ∗) is a set equipped with a binary product ∗ : M × M → M where ∗ is associative and has a unit element uM ∈ M. Going back again to the commutative triangle, how are we to score it? We have four numbers S(x, z, Rxz), S(x, y, Rxy), S(y, z, Ryz) and S(x, z, Rxy ∗ Ryz). We are interested in two behaviors: • Does R(x, z) better fit the data than R(x, y) ∗ R(y, z) ? That is the purpose of the matching cost, introduced in the previous sections. It must be modified to handle compositions, which is explained in Section 2.5. • Does the composition makes sense, given the current scoring and relation models?

Ideally we would like to associate a cost Lcomp enforcing a form of causality between the links. Informally, the idea is to ensure that composite arrows are only ever as good as their base constituents are; indeed in order to think logically, a good argument should not be obtainable from a sequence of bad arguments. However one should also account for the fact that two elementary arguments could not match, as part of their underlying hypothesis could be contradictory; hence the composition of two relations can be lossy without any constraints. To this effect we want the composition to respect the following inequality: S(x, z, Rxy ∗ Ryz) ≤ S(x, y, Rxy)S(y, z, Ryz). Given these requirements, we have chosen the following form for Lcomp, the corresponding elementary loss function associated to composition:

   Rx,y Ry,z S(x, z, Rx,y ∗ Ry,z) Lcomp(S, (x → y → z) = ReLU S(x, z, Rx,y ∗ Ry,z) log S(x, y, Rx,y) · S(y, z, Ry,z) (13) Note however that a single primitive triangle is but the simplest shape on which causality needs to be enforced. In practice, we want to be able to extend this cost to arbitrarily long sequences of composite arrows. We build upon those elementary triangles to cost primitive chains of arbitrary length, then we introduce the order of a given relation to properly cost composites. ext ext Each batch input entry has so far consisted of two entries Yi ,Xi : EFT, which could in turn be ∗ seen as a single batch entry Zi ∈ EFT made of a sequence of length two.

Consider scoring a primitive sequence Z = z1 . . . zL of length |Z| = L (Fig. 4). We need to compute Lcomp for each possible triangle starting from z1 and ending at zL, which in this case turns out to be simply selecting a single middle point in Z, including the extremeties, and normalize (see Eq. 14).

L   1 X R(z1,zk) R(zk,zL) L (S, R, z . . . z ) = L S, (z → z → z ) (14) causal 1 L L comp 1 k L k=1

7 R(2,4) R(1,2)

R(1,1) 1 2 3 4 R(4,4) R(3,4) R(1,3)

R(1,4) Figure 4: Primitive chain of length 4. R(1,4) is shared between the red and purple triangles.

The complexity of a relation depends on its decomposition over the primitives, where we called the length of this composition the order θ of a relation. We define the order of an arrow f : x → z recursively:

• Primitive arrows R(x, y) are of order 1. • θ(f) = n + 1 if and only if there is y and g : y → z such that f(x, z) = R(x, y) ∗ g(y, z) and θ(g) = n.

Since loops are allowed, there is an infinite amount of composite relations over a sequence. From now on we always restrict ourselves to a finite problem by introducing a new hyperparameter θmax controlling the maximum order of relations. We have so far restricted ourselves to model all potential relations with a single learnable model R. While it simplifies constructing and explaining the model, there is no fundamental reason for this restriction. Thus we allow several relation models to exist, and the number of relation models r ∈ N used for a given task becomes a new hyperparameter, akin to the number of coefficients to use doing a polynomial curve fitting. Too little may not suffice to fit the data appropriately, too much may overfit and deter generalization performances. Analytical or empirical studies of this hyperparameter remains an open question.

Let Tri(Z, θmax, (Ri)i=1..r) be the set of all triangles of order at most θmax > 1 over a chain Z using r relation models (see Fig. 5 for a graphical representation of Tri(1 → 2 → 3 → 4, 3,R)). Let cc(θmax, |Z|, r) = |Tri(Z, θ, (Ri)i=1..r)| the number of such unique triangles (see Appendix A for a detailed derivation). Eq. 15 extends Lcausal to any order and multiple relation models.

1 X Lcausal(S, (Ri)i=1..r, θmax,Z) = Lcomp(S, t) (15) cc(θmax, |Z|, r) t∈Tri(Z,θmax,(Ri)i=1..r )

The number of terms in the summation grows quickly as order or chain length increase. Therefore handling larger cases would effectively require dropping exhaustivity in favor of a Monte-Carlo approximation. We must also extend unit and matching costs to the newly introduced concepts - input sequence, order and multiple relation models. Fortunately, the unit cost is readily amenable to this: the only relations at play are identities, which are independent of relation models and order. The matching cost should undergo the same kind of extension than Lcausal, iterating over different shapes, which is the focus of Section 2.5.

2.5 Full model

The preceding sections defined various costs, their extensions and the rationale behind them. We now proceed to aggregate everything and lay out the categorical learning problem in full. Let s ∈ N, s s Ps the s-simplex is defined by ∆c = {(t1, . . . , ts) ∈ R | i=1 ti ≤ 1 and ∀i, ti ≥ 0}. Let Obs be the ∗ s space of observations, Z : Obs a sequence of observations of length l = |Z| ∈ N, V ∈ ∆c the coresponding training label and (M, ∗) be a monoid, with unit uM . A scoring model S is a trainable s mapping S : Obs × Obs × M → ∆c, and (Ri : Obs × Obs → M)i=1..r is a finite family of trainable relation models. To simplify notations we denote by S˜(x, y, r) and V˜ the scores and label vectors,

8 2 3

R(1,2) R(3,4) R(1,3) R(2,4)

R(1,1) 1 4 R(4,4) R(1,4)

R(1,4) (a) Triangles of order Θ = 2. R(1,3) 3 R(3,3)∗R(3,4) R(1,3)∗R(3,3) R(3,4)

R(1,3) R(3,4)∗R(4,4) 1 4 R(4,4)

R(1,3)∗R(3,4) (b) Triangles of order Θ = 3, rooted by R(1, 3) R(1,4)∗R(4,4)

1 R(4,4)∗R(4,4) 4 R(4,4)

R(1,4)

(c) Triangles of order Θ = 3, rooted by R(1, 4) Figure 5: Partial graphical representation of Tri(1 → 2 → 3 → 4, 3,R).

augmented to form probability vectors. We then try to optimize trainable parameters of S and (Ri) to minimize the following loss - λunit ∈ R+ and λcausal ∈ R+:

Lcat(S, (Ri)i=1..r,Z,V ) = λunitLunit(S, Z) + λcausalLcausal(S, (Ri)i=1..r,Z) (16) + Lmatch(S, (Ri)i=1..r,Z,V )

Unit and causal costs enforce learned models behave like a category, while the matching cost drives the models to learn the category we are interested in - provided by the labels. The unit cost forces the learned model to see identity loops as trivial, and is independent of the relation models.

|Z| s ! 1 X X L (S, Z) = − log S(Z ,Z , u ) (17) unit |Z| i i M a i=1 a=1

The causality cost constrains relation composition to uphold a loose form of causality: a composite arrow should at best be as ’accurate’ as its constituents. A new term Lequiv constraining triangles is introduced. Its goal is to further restrict the modeling of equivalences. If we are to model an equivalence relation then, loosely speaking, we want to enforce composition loses as little information as possible. To each score dimension, the user must specify whether the label belongs to an equivalence relation. For example if classifying digits in MNIST [LC10], augmented with translations (restricted to transformations where the digit stays inside the image), we could model the problem as 2-dimensional scoring: the first dimension for class membership, the second one encoding whether images are related by translation. Let Tri(Z, θmax, (Ri)i=1..r) be the set of all triangles of order at most θmax > 1 over Z using r relation models and cc(θmax, |Z|, r) = |Tri(Z, θmax, (Ri)i=1..r)| the

9 ˆ ˆ ˆ ˆ number of such unique triangles. Let S, Seq, V and Veq be the respective component-wise summation over user-specified equivalence dimensions.

1 X Lcausal(S, (Ri)i=1..r,Z) = Lcomp(S, t) + Lequiv(S, t) cc(θmax, |Z|, r) t∈Tri(Z,θmax,(Ri)i=1..r ) !! Sˆ(x, z, r ∗ r ) r1 r2 ˆ 1 2 Lcomp(S, (x → y → z) = ReLU S(x, z, r1 ∗ r2) log (18) Sˆ(x, y, r1)Sˆ(y, z, r2) !! r r Sˆ (x, y, r )Sˆ (y, z, r ) L (S, (x →1 y →2 z) = ReLU Sˆ(x, z, r ∗ r ) log eq 1 eq 2 equiv 1 2 ˆ Seq(x, z, r1 ∗ r2)

Finally the matching cost is the label constraint, ie the real classification constraint. The final form differs from Eq. 10 to account for all the possible induced relations, up to order θmax. Let Rel(Z, (Ri)i=1..r, θmax) be the set of all relations Z1 → Zl of order at most θmax (see Appendix A).

Lmatch(S, (Ri)i=1..r,Z,V ) = σagg ({DKL(V ||S(Z1,Zl, r))|r ∈ Rel(Z, (Ri)i=1..r, θmax)}) (19)

In all our tests we restricted ourselves to the following choice of σagg. Let P ∈ N and α ≥ 0, we define σagg by the following three steps:

1. Sort by increasing DKL.

2. Keep only the lowest P matching scores (mi)i=1..P i.e. best relation match.

PP mk 3. Do a weighted sum over selected scores: k=1 kα . We only use the best P relations in the matching score because there is no reason for all relations to be a match, and hence a good model would result in a high KL divergence for many relations, which would in turn completely bury the interesting constraints in noise.

Hyperparameters The full model is in the end governed by several hyperparameters:

• Scoring model S: we need to select a model and its set of parameters. • Relation models: Similar to scoring model. However there is in general several such models, with different parameters, and possibly different architectures.

• λunit ∈ R+: controls the unit cost. • λcausal ∈ R+: controls the causality cost. • maximum order θmax ∈ N: governs the depth of the category approximation. Highly influences training time. • α ≥ 0 and P ∈ N: controls how many relations between points to consider, and their weight.

3 Related works

To the best of our knowledge, this is the first attempt to learn an approximation of a finite category, and to apply it to classification tasks. There is however recent works in machine learning along our lines of reasoning: relations ought to be explicited and become first order citizens, not just curiosities, if we are to further machine learning range of application and performances. In this section we briefly review existing works closest to our full model.

Topological Data Analysis Topological data analysis focuses on expliciting the topological struc- ture of a dataset by representing it as a simplicial complex and studying its homological properties [CM17]. The aim is then to compute the persistence of these properties as one gradually zooms in on the data. It is particularly useful for clustering tasks, and in particular the ToMATo algorithm

10 [CGOS13] is a powerful clustering technique which can handle arbitrary cluster shapes with non trivial hierarchical structure. It is however limited in several ways, as TDA techniques generally assume one already has computed a distance between each pair of data points, as well as a density estimation for each of these data points. In practice this often accounts to an embedding in a finite dimensional vector space, combined with a kernel-based density estimation. Moreover it assumes symmetrical relations between members of the dataset, which we do not. It is nonetheless interesting as similar to our framework, it is a very generic method which can be applied in a wild variety of contexts.

Probabilistic Programming Fusing probablities and programming languages by interpreting a program as a generative description of a distribution. Executing the program once gives a sample of the distribution, inference is carried out by multiple executions and various sampling schemes. For example Church [GMR+08] is a Lisp-based probabilistic language, while Alchemy (based on Markov Logic Network [RD06]) directly uses the language of first-order logic. By nature, those representations and techniques are highly composable and structure-rich. However, the user must specify a generative model i.e. the causal structure while our categorical framework’s goal is to flesh out those structures.

Structured Causal Models (SCM) Introduced in [Pea09] SCMs directly target causality relations between variables, traditionally embodied by Bayesian Networks. Focusing solely on causalities, a SCM can be used to precisely answer complex facts about the observed data, generally divided in a three-levelled hierarchy of increasing generality/power: association, intervention and counter factual. By relying on the combination of both data and a causal model, precise computation of conditional and interventional (e.g. pinning part of the variables to certains values) probabilities is achieved, expressing interventions through syntactic rewrites until only pure conditional probabilties remain (do-calculus). Most theoretical and practical developments of SCM focus on getting the most out of a given causal graph and observations. Efficient inference of the causal structure itself is still very much an open question, where our categorical approach fits naturally.

Tsetlin Machines Testlin automata are simple, finite-state automaton whose transitions depend on a scalar reward/penalty signal. Those automata have proven to be very good at solving variations of the multi-armed bandit problem; but a single automaton can only output a binary signal. Hence more complex pattern matching requires an ensemble of automata, which have to cooperate. Each automaton learns by itself, with no centralization point: output signal of the cohort is thus noisy, and larger learning tasks suffers from a poor signal-to-noise ratio. [Gra18] overcomes the noise problem with a novel game-based decentralized learning algorithm. An ensemble of Testlin automata governed by the newly designed game is called a Testlin Machine. Inputs and outputs of Tsetlin Machines are bit patterns, which are in turn interpreted as propostional logic formulas: the internal organization of automata inside the Tsetlin Machine are reinterpreted as building Conjunctive Normal Form propositional formulas over the inputs. While the emphasis of [Gra18] is over Tsetlin Machine’s capabilities for interpretation, it also provides rich compostional structure. In fact the input propositional observations together with usual propositional operators (∧, ∨, ¬) form a category with extra properties.

Graph Networks Since [BHB+18] presents a framework unifying essentially all Graph Networks, we will use its terminology. Let G(V, E, T, u) where V = (vk)k=1:N v are the vertices, E ⊂ V × V are the edges and u is a graph-wide extra data. For each e : E or v : V there is an associated extra data/arguments noted as T (v)/T (e)- typically a tensor for deep learning settings. Graph Networks are mappings G(V, E, T, u) → G(V,E,T 0, u0), and provided modifications to T and u conserves the type, Graph Networks can themselves be composed in virtually any way - forming a direct graph of Graph networks. Essentially, the structure of the acted upon graphs is not changed, only the data associated to each vertex, edge and the global data. Vertex (edge) data update φv (resp. φe) crucially depends solely on local neighborhood information - plus u-, and is shared across the whole graph. Neither Graph Networks nor our classification model in its current form subsume the other: we focus solely on the edges, never tweaking vertex representation and Graph Networks have no ready equivalent of Lcausal and higher-order (might be doable with a combination of repeated application of a Graph Network and extra bookkeeping of relations).

11 4 Discussion

4.1 Conclusion

The model described here is cast in the context of a classification task. However this is certainly not the only problem which can be tackled using similar methods, nor is it the most amenable. Our choice was mostly guided by the ubiquity of such tasks in machine learning, which makes it a good candidate for expository purposes. Indeed the problem of inferring relations between different objects is very general and in fact underlies many, if not most higher level tasks. We believe this makes categorical modeling to be a useful general framework in which to cast particular problems.

4.2 Future works

Although there are many different applications we could find, we hope in particular to apply category- based learning to recommendation systems. Indeed recommending can be thought as inferring and exploiting relations between objects, of the form: "If person X has liked object A, person X should also like object B". Hence we think this is a particularly suited task which could make full use of associative reasoning, the core principle of this categorical framework.

References [Awo10] Steve Awodey. Category Theory. Oxford University Press, Inc., New York, NY, USA, 2nd edition, 2010.

[BCS+16] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. End-to-end attention-based large vocabulary speech recognition. In ICASSP, pages 4945–4949. IEEE, 2016.

[BHB+18] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks, 2018.

[BP] Ronald Brown and Timothy Porter. 2003b, category theory and higher dimensional algebra: Potential descriptive tools in neuroscience. In Proceedings of the Interna- tional Conference on Theoretical Neurobiology, Delhi, February 2003, National Brain Research Centre, Conference Proceedings, pages 80–92.

[CGOS13] Frédéric Chazal, Leonidas J Guibas, Steve Y Oudot, and Primoz Skraba. Persistence- based clustering in riemannian manifolds. Journal of the ACM (JACM), 60(6):41, 2013.

[CM17] Frédéric Chazal and Bertrand Michel. An introduction to topological data analysis: fundamental and practical aspects for data scientists. arXiv preprint arXiv:1710.04019, 2017.

[EM45] S. Eilenberg and S. MacLane. General Theory of Natural Equivalences. American Mathematical Society, 1945.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks, 2013. cite arxiv:1303.5778Comment: To appear in ICASSP 2013.

[GMR+08] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: a language for generative models. In Proc. of Uncertainty in Artificial Intelligence, 2008.

12 [Gra18] Ole-Christoffer Granmo. The tsetlin machine - A game theoretic bandit driven approach to optimal pattern recognition with propositional logic. CoRR, abs/1804.01508, 2018. [Hea00] Michael J. Healy. Category theory applied to neural modeling and graphical represen- tations. In in Proceedings of the International Joint Conference on Neural Networks (IJCNN 2000), IEEE. Press, 2000. [HG08] Michael; Thomas Caudell; Healy and Timothy Goldsmith. A model of human catego- rization and similarity based upon category theory, 2008. [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa- tion, 9(8):1735–1780, 1997. [HV06] Ian Horrocks and Andrei Voronkov. Reasoning support for expressive ontology lan- guages using a theorem prover. pages 201–218, 02 2006. [HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [INS18] Takashi Ishida, Gang Niu, and Masashi Sugiyama. Binary classification from positive- confidence data. In NeurIPS, 2018. [KF09] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [Law66] F. William Lawvere. The category of categories as a foundation for mathematics. In S. Eilenberg, D. K. Harrison, S. MacLane, and H. Röhrl, editors, Proceedings of the Conference on Categorical Algebra, pages 1–20, Berlin, Heidelberg, 1966. Springer Berlin Heidelberg. [LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. [MKS+15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015. [Pea09] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New York, NY, USA, 2nd edition, 2009. [PSST10] Adam Pease, Geoff Sutcliffe, Nick Siegel, and Steven Trac. Large theory reasoning with sumo at casc. AI Commun., 23:137–144, 01 2010. [RD06] Matthew Richardson and Pedro Domingos. Markov logic networks. Mach. Learn., 62(1-2):107–136, February 2006. [Rie16] Emily Riehl. Category Theory in Context. Courier Dover Publications, Mineola, NY, USA, 2016. [SSS+17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–, October 2017. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.

13 A Relation combinatorics

A.1 Introduction and notations

We detail here the non trivial derivations used to normalize the cost functions. We shall use the following few notations in what follow:

n • k is the binomial coefficient n chooses k, i.e the number of k-subsets of a n-element set. Let us recall that these binomial coefficients are subject to Pascal’s triangle formula: n n − 1 n − 1 = + (20) k k k − 1

n  • k is the number of combinations with repetitions of size k is a n-element set. One has the identity: n n + k − 1 = (21) k k From this and Pascal’s triangle formula we can deduce: n n − 1  n  = + (22) k k k − 1

We shall also observe that: n (n + k − 1)! (n + k − 1)! n + 1 k = k = n = n (23) k k!(n − 1)! (k − 1)!n! k − 1

R • Let P be a strictly positive integer, and R be a positive integer. The directed multigraph DP is the graph which: 1. has exactly P vertices, noted 0, ··· ,P − 1

2. has exactly R edges between any pair of points i, j with i ≤ j, noted ri,j for 0 ≤ r ≤ R − 1

1 00,1,10,1 01,2,11,2

0 2 00,2,10,2

2 Figure 6: Example of D3

• For any directed multigraph D, we shall note Cat(D) the free category over D. A morphism of Cat(D) is said to be of order n ∈ N (over D) if it is the composition of exactly n edges 2 of D; hence each morphism m has a unique order noted o(m). Hence for example on D3:

1. identities id0, id1 and id2 are of order 0 2. ri,j is of order 1 for any 0 ≤ i ≤ j ≤ P − 1, 0 ≤ r ≤ R − 1 3. 01,210,1 is of order 2 A composite pair (f, l) of Cat(D) is of order n if lf is of order n, or equivalently if o(f) + o(l) = n.

A.2 Categorical combinatorics

Let θ, P be strictly positive integers, and R be a positive integer.

14 Counting relations of a given order The first value we want to compute is the number of mor- R phisms of order θ from vertex 0 and ending at vertex P in Cat(DP ), which we denote by co(θ, P, R). Choosing such a morphism is equivalent to:

1. Choosing θ − 1 ordered vertices i0 ≤ · · · ≤ iθ−2, the targets/codomains of the first θ − 1 R edges of DP composing the morphism (the last edge has P forced as a target). This is in fact equivalent to choosing a θ − 1 combination among the P vertices, the number of choices  P  thus being θ−1 .

2. Choosing for each pair ik, ik+1 for 0 ≤ k ≤ θ, an edge rik,ik+1 for 0 ≤ r ≤ R − 1 (with iθ+1 = P ); there are R possible choices of such an edge for each 0 ≤ k ≤ θ

Hence in total we have:  P  co(θ, P, R) = Rθ (24) θ − 1

Moreover from Pascal’s triangle (22), we get the following recursive relation:

co(θ, P, R) = R · co(θ − 1,P,R) + co(θ, P − 1,R) (25)

In practice this relation can be used to optimize computations when lower order values are also needed.

Counting relations of at most a given order Now we are looking for the number of morphisms R of order at most Θ (included), from 0 to P in Cat(DP ), which we shall note cm(Θ,P,R). Quite evidently, we can obtain this by summing over all possible orders below Θ:

Θ X cm(Θ,P,R) = co(θ, P, R) (26) θ=1

Here the benefit of using the recursion formula (25) should be obvious.

Counting composite pairs of at most given order We shall now suppose that Θ ≥ 2. The last value we want to compute is the number of non trivial composite pairs of order at most Θ from 0 to R P in Cat(DP ), which we shall note cc(Θ,P,R). In order to choose such a non trivial composite pair one can:

• Choose a given order 2 ≤ θ ≤ Θ (a non trivial composite pair is necessarily of order at least 2)

• Choose a morphism r : 0 → P of order θ. Hence r can be written as: r = rθ−1 . . . r0 where R r0, . . . , rθ−1 are edges of DP . By definition, there are co(θ, P, R) such choices • Choose a decomposition of r as a non trivial pair. This is equivalent to choosing a cut of the string rθ−1 . . . r0 in two non empty substrings rθ−1 . . . rk+1, rk . . . r0 for 0 ≤ k ≤ θ − 2. There are θ − 1 possible choices of such a cut.

Hence one has in total: Θ X cc(Θ,P,R) = (θ − 1) · co(Θ,P,R) (27) θ=2

This is in turn equal to:

Θ Θ X  P  X P + 1 cc(Θ,P,R) = (θ − 1)Rθ = PRθ (28) θ − 1 θ − 2 θ=2 θ=2

15 Hence one can further simplify by remapping the summation index from 1 to Θ − 1 and factoring constants out of the sum:

Θ−1 X P + 1 cc(Θ,P,R) = PR Rθ θ − 1 θ=1 Θ−1 X (29) = PR co(θ, P + 1,R) θ=1 = PR · cm(Θ − 1,P + 1,R)

16