A Categorical Viewpoint on Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Categorical Viewpoint on Machine Learning Christophe Culan Maxime Lubin [email protected] [email protected] Abstract Modeling the reasoning process of human beings is a long-standing goal of artificial intelligence. Starting with a symbolic approach rooted in logic in the 60s - compose- able, interpretable but brittle in the face of fuzzy and stochastic patterns - to recent deep learning advances at the other end of the spectrum. We propose a novel model and inference technique based on Category Theory to treat training data not only as a set but as a full-fledged learnable category: what points are related to each other? In what way? In which causal direction? Learning such a complex view of a dataset is tantamount to soundly drop the often unquestioned hypothesis of independent, identically distributed training data. This can be useful to better account for the consequences of data augmentation techniques, as well as to learn non-trivial relations spanning multiple observations. 1 Introduction We perceive and abstract our reality through the prism of causality: causes precede consequences. However we do not perceive causality directly. Instead through repeated experimentation, we eventually notice patterns emerging from the underlying causal links, eg, drinking alcohol then feeling tipsy. These patterns can be learned through induction, even by machines, as all the great recent progress in machine learning showcases in vision [KSH12][HZRS15], natural language processing [VSP+17][HS97], speech processing [GMH13][BCS+16] or reinforcement learning [SSS+17] [MKS+15]. We now have tremendous pattern-matching machines but they still do struggle and are largely incapable of reifying observations into causal links. The most notable exceptions are Probabilistic Graphical Models [KF09], Markov Logic Networks [RD06], Structural Causal Models [Pea09] and the recent Graph Networks [BHB+18], all of which requires an explicit encoding of stochastic causal dependence between observables. Nature and human abstractions abund in different patterns. So much they first appear to form a large set of distinct, unrelated curiosities. However in each category, humanity’s work and ingenuity largely proved the infinite looking zoo of patterns to stem from a small number of unique interacting entities. For example, physical processes are largely described by the handful of elementary particles and evolution laws of the Standard Model, mathematics from axioms and proof constructs, language from words and syntax. Complexity seems to emerge through interaction of simpler components, a phenomenon known as combinatorial generalization. The emphasis is not on the intrinsics of the elementary entities but on their collective behavior and interaction. [BHB+18] offers a comprehensive, well written introduction and justification for this argument. Category theory [EM45] is a very powerful framework [Law66] that precisely embodies this view. Since its inception, the pervasive nature of categories has been steadily fleshed out and revealed many deep connections between seemingly unrelated fields of mathematics and is now a core tool in state-of-the-art developments of mathematics, computer science and theoretical physics. Machine learning and more generally statistics has thus far reaped little theoretical and practical benefits. Our choice to rely on category theory for AI is not so strange a position. Most if not all of computational Preprint. Work in progress. ontologies e.g. [HV06][PSST10] is about categories, albeit very often not phrased in its direct language but instead in logic. Category theory is starting to creep into cognitive sciences to formally model concepts, interplays and analogies [BP][HG08], and even applied to neural networks [Hea00]. This paper starts by a brief introduction of a few basic notions required from category theory, after which we incrementally build a categorical view of classification problems in Section 2. This is followed by a short presentation of related works in the literature, after which we shortly conclude and expand on planned future works. An open source Python implementation can be found at: https://github.com/Previsou/ CategoryLearning. Basic category theory primer For the reader unfamiliar with category theory, we here define the few basic notions used in the rest of the paper. For a more principled introduction and advanced category theory topics, see [Awo10] and [Rie16]. Definition 1. A category C consists of: • Objects, noted as x; y; z; ··· : C • Arrows/relations between two objects, noted f : x ! y; g; : : : • Identities: given x : C, there is an identity arrow 1x : x ! x in C. • Compositions: the collection of arrows is closed by composition. Given f : x ! y and g : y ! z there is an arrow noted as g ◦ f : x ! z in C. The composition ◦ is furthermore restricted to be associative and identities act as left and right units. That is, given f : x ! y; g : y ! z; h : z ! w in C: • associativity: h ◦ (g ◦ f) = (h ◦ g) ◦ f. • unit: f = f ◦ 1A = 1B ◦ f. Everything matching the above definitions is a category. For example, directed graph G(V; E) can be seen as a form of proto-category. A path in G is a finite sequence of edges e1e2 : : : el. The free category C(G) over a graph G(V; E) is obtained by completing the set of edges. Then C(G) 0 0 has for objects the vertices and for arrows the paths e1 : : : el, ele1 : : : em in G whose composite is 0 0 1 e1 : : : ele1 : : : em ie there is an arrow fV1 ! V2 if and only if V1 = V2 or V1 and V2 belongs to the same connected component. Definition 2. Let C be a category. A subcategory S of C is given by a subcollection of objects of C - denoted ob(S) - and a subcollection of arrows of C - denoted hom(S) - such that •8 X : ob(S); idX : hom(S), •8 f : hom(S)(X; Y );X : ob(S) ^ Y : ob(S), •8 f; g : hom(S); (g ◦ f : C ) g ◦ f : hom(S)). 2 Learning the category of a dataset Supervised machine learning algorithm receives a set of observations and matching targets as N input. Let θ 2 Rq be a q-dimensional parameter vector. Let X 2 FT and Y be respectively the observation sample of length N and Y the assocatiated labels or targets, FT is the feature space of a single observation. Discriminative supervised machine learning training can be viewed as, given a model f - typically a neural network or SVM- to find the maximum a posteriori (MAP) parameter vector: MAP(θ; f; X; Y ) = arg max P (Y = f(X; θ)jX) (1) θ 1Loops must also be added to every vertex to serve as identities. 2 For tractability, most algorithms hypothesize those observations are independent and identically distributed 2: they were generated by the same measurement protocol, with previous measurements having no effect on future ones. However, not only is this assumption quite wrong in many situations, but it is inherently broken by the very common use of data augmentation techniques. Independence of samples is usually supposed as it enables to factor Eq.1: N Y MAP(θ; f; X; Y ) = arg max P (Yi = f(Xi; θ)jXi) (2) θ i=1 For numerical reasons, Eq.2 is transformed as a loss minimization problem L by applying the transformation x ! − log x. The factored product transforms into a summation, less prone to floating-point rounding errors. N 1 X L(θ; f; X; Y ) = − log P (Y = f(X ; θ)jX ) (3) N i i i n=1 We now particularize to K-class classification setting.Targets Y are assumed to be one-hot encoded ie Yi;j = 1 if and only if observation i is of class j and 0 otherwise. Let our model be S : FT × θ ! K PK [0; 1] constrained by i=1 Si(∗) = 1. The ouput of S is essentially a probability vector. The same holds for Yi where all probability mass is concentrated in a single entry. The goal is to have these two distributions to align. Classification problems typically select the Kullback-Leibler divergence DKL to compare distributions. Definition 3. The Kullback-Leibler divergence between discrete probability distributions P and Q is P P (i) defined as DKL(P jjQ) = i P (i) log Q(i) . Its most relevant properties are: • positivity: DKL(P jjQ) ≥ 0. • asymmetry: DKL(P jjQ) 6= DKL(QjjP ) in the general case. • minimum: DKL(P jjQ) = 0 if and only if P = Q almost everywhere. We end up with the following training loss: N 1 X L(θ; S; X; Y ) = D (Y jjS(X ; θ)) (4) N KL i i n=1 Given the degenerate case of Yi the KL-divergence reduces to the cross-entropy H(P; Q), and we end up with the familiar classification loss: K X LCE(θ; S; Xi;Yi) = H (Yi;S(Xi; θ)) = − Yi;j log S(Xi; θ)j j=1 (5) N 1 X L(θ; S; X; Y ) = L (θ; S; X ;Y ) N CE i i n=1 Considering the input data not as a set but as a learnable small category requires to select a proper and practical cost function for identities and composition, and to define how observations can relate to each other and in how many ways. The gist of what we set out to do can be exemplified as follows. Assume we set out to perform a binary classification task: given a sequenced genome, is the patient affected by a given genetic disease? From what we know of DNA and heredity, encoding and properly leveraging whether two patients are related should yield a more accurate model. 2The most notable exceptions are time-series and sequence learning problems, where the time/order depen- dence is implicit and given.