arXiv:2106.14587v2 [math.AT] 12 Jul 2021 1 uwiPrsRsac etr ahmtcladAlgorithmi and Mathematical Center, Research Paris Huawei oo n tcso epNua Networks Neural Deep of Stacks and enCad Belfiore Jean-Claude 2 nvriyo ai,Fclyo Mathematics of Faculty Paris, of University 1 Lab n ailBennequin Daniel and 1,2 Sciences c Contents

1 Architectures 4 1.1 Underlyinggraph ...... 4 1.2 Dynamicalobjectsofthechains ...... 5 1.3 DynamicalobjectsofthegeneralDNNs...... 6 1.4 Back-propagation as a natural (stochastic) flow in the topos ...... 8 1.5 TheparticularnatureofthetoposofDNNs ...... 10

2 Stacks of DNNs 14 2.1 Groupoids, general categorical invariance and logic ...... 14 2.2 Objects classifiers of the fibers of a classifying topos ...... 16 2.3 Theories, interpretation, inferences and deductions ...... 22 2.4 The model of a DNN and its Martin-L¨of type theory ...... 25 2.5 ClassifyingtheM-Ltheory?...... 32

3 Dynamics and homology 34 3.1 Ordinary cat’s manifolds ...... 34 3.2 Dynamics with spontaneous activity ...... 35 3.3 Fibrations and cofibrations of languages and theories ...... 36 3.4 Semantic information. Homology constructions ...... 45 3.5 Homotopyconstructions ...... 57

4 Unfoldings and memories, LSTM and GRU 75 4.1 RNNlattices,LSTMcells ...... 75 4.2 GRU,MGU...... 77 4.3 Universalstructurehypothesis ...... 79 4.4 Memoriesandbraids ...... 82 4.5 Pre-semantics ...... 87

5 A natural 3-category of deep networks 90 5.1 Attention moduli and relation moduli ...... 90 5.2 The2-categoryofanetwork ...... 92 5.3 Grothendieck derivators and semantic information ...... 93 5.4 StackshomotopyofDNNs ...... 95

Appendices A Localic topos and Fuzzy identities ...... 98 B ToposofDNNsandspectraofcommutativerings ...... 102 C Classifyingobjectsofgroupoids ...... 103 D Non-booleaninformationfunctions ...... 105 E Closer to natural languages: linear semantic information ...... 107

1 .

2 Abstract

Every known artificial deep neural network (DNN) corresponds to an object in a canoni- cal Grothendieck’s topos; its learning dynamic corresponds to a flow of in this topos. Invariance structures in the layers (like CNNs or LSTMs) correspond to Giraud’s stacks. This invariance is supposed to be responsible of the generalization property, that is extrapolation from learning data under constraints. The fibers represent pre-semantic categories (Culioli, Thom), over which artificial languages are defined, with internal logics, intuitionist, classical or linear (Girard). Semantic functioning of a network is its ability to express theories in such a language for answering questions in output about input data. Quantities and spaces of semantic information are defined by analogy with the homological interpretation of Shannon’s entropy (P.Baudot and D.B. 2015). They generalize the mea- sures found by Carnap and Bar-Hillel (1952). Amazingly, the above semantical structures are classified by geometric fibrant objects in a closed of Quillen, then they give rise to homotopical invariants of DNNs and of their semantic functioning. Intentional type theories (Martin-L¨of) organize these objects and fibrations between them. Information contents and exchanges are analyzed by Grothendieck’s derivators. Preliminaries

Introduction

This text presents a general theory of semantic functioning of deep neural networks, DNNs, based on Topology, more precisely, Grothendieck’s topos, Quillen’s homotopy theory, Thom’s singularity theory and the pre-semantic of Culioli in enunciative linguistic.

The theory is based on the existing networks, transforming data, as images, movies or written texts, for answering questions, achieving actions or taking decisions. Experiments, recent and past, show that the deep neural networks, having learned under constrained methods, can achieve surprizing semantic performances, cf.[XQLJ20], [BBD+11], [BBDH14], [BBG20]. However, the exploitation of more explicit invariance structures and adapted lan- guages, are in great part a task for the future. Then the present text is a melange of an analysis of the functioning networks, and of a conjectural frame to make them able to ap- proach more ideal semantic functioning.

Note that categories, homology and homotopy were recently applied in several manners to semantic information, for instance the application of to the design of networks, by Fong and Spivak [FS18], a general notion of Information Networks based on Segal spaces by Manin and Matilde Marcolli, [MM20], the Cech homology reconstruction of the environment by place fields of Curto and collaborators, [Cur17]. Let us also mention the characterization of entropy, by Baez, Fritz, Leinster, [BFL11], and the use of sheaves and cosheaves for studying information networks, Ghrist, Hiraoka 2011 [GH11], Curry 2013 [Cur13], Robinson and Joslyn, cf. [Rob17], and Abramsky et al. specially for Quantum Information [AB11]. Persistent homology for detecting structures in data must also be cited in this context, for instance Port, Karidi, Marcolli 2019, [PKM19] on syntactic structures, and Carlsson et al. on shape recognition [CZCG05]. More in relation with Bayes networks, there are the three recent theses of J-P. Vigneaux [Vig19], O. Peltre [Pel20] and G. Sergeant- Perthuis [SP21a], [SP21b], [SP21c].

With respect to these works, we look at a notion of information which is a (toposic) topological invariant of the junction of three dimensions of dynamics: 1. a logical flow along the network,

2. in the layers, the action of categories,

3. the evocations of meaning in languages. The resulting notion of information generalizes the suggestion of Carnap and Bar-Hillel 1952 in these three dynamical directions. Our inspiration came from the toposic interpre- tation of Shannon’s entropy in [BB15], cf. also [Vig19]. The fundamental ingredient is the

1 interpretation of internal implication (exponential) as a conditioning on theories. We dis- tinguish between the theoretically accessible information, concerning all the theories in a fibred languages, and the practically accessible information, that corresponds to the seman- tic functioning of neural networks, associated to a feed-forward dynamics which depends on a learning process. The main results in this article are the theorems 1.1 and 1.2, characterizing the topos as- sociated to DNNs; the theorems 2.1 and 2.2, characterizing the fibrations, in particular the fibrant objects, in a closed model category, made by the stacks of the DNNs having a given network’s architecture; the tentative definition of Semantic Information quantities and spaces in the sections 3.4, 3.5; the theorem 4.1 on the generic structures and dynamics of LSTMs. Particular examples showing the nature of the semantic information that we present here, are at the of section 3.5 about the exemplar language of Carnap and Bar-Hillel, and the mathematical interpretation of the pre-semantic of Culioli in relation with the artificial memory cells in sections 4.4, 4.5.

Chapter 1 describes the nature of the sites and the topos associated to deep neural networks, said DNNs, with their dynamics, feedforward and feedback (back-propagation) learning. Chapter 2 presents the different stacks of a DNN, which are fibred categories over the site of the DNN, incorporating symmetries and logics for approaching the semantics of wanted functioning. Usual examples are CNNs for translation symmetries, but other ex- amples concern logic and semantic (cf. experiments in Logical Information Cells, Belfiore, Bennequin, Giraud [BBG20]). Then the logical structure of the classifying topos of such a stack is described. Semantic functioning of a DNN is defined. The 2-category of these stacks is shown to constitute a closed model theory of injective type, in the sense of Quillen (Cisinski, Lurie). The fibrant objects, which are difficult to characterize in general, are de- termined in the case of the sites of DNNs. Interestingly, they correspond to the hypothesis guarantying a logical and semantic functioning. This model theory gives rise to a Martin-L¨of type theory associated to every DNN. Semantics in the sense of topos (Lambek) are added by considering objects in the classifying topos of the stack. In chapter 3, we start exploring the notion of semantic information and semantic function- ing in DNNs, in relation with homology and homotopy theory. We introduce hypotheses on the stack and the language objects that allow a transmission of theories downstream and of propositions upstream in the network. These hypotheses are in agreement with the fibrant morphisms in the model category of section 2. Then we define semantic condition- ing of the theories by the propositions, and compute the corresponding ringed co-homology of the functions of these theories; this gives a numerical notion of semantic ambiguity, of mutual information and Kullback-Leibler divergence. Then we generalize the homogeneous • bar-complex, to define a bi-simplicial set I∗ of classes of theories and propositions histories over the network, by taking homotopy co-limits. We introduce a class of increasing and • concave functions from I∗ to an external model category M; and with them, we obtain natural homotopy types of semantic information, associated to coherent semantic function- ing of a network with respect to a semantic problem; they satisfy properties conjectured by Carnap and Bar-Hillel in 1952 [CBH52] for the sets of semantic information. On the simple example they studied we show the interest of considering spaces of information, in partic- ular groupoids, in addition to the more usual combinatorial dimension of logical content of propositions. Chapter 4 describes examples of memory cells, and show that the natural groupoids

2 for their stack have for fundamental group the group of Artin’s braids with three strands. Generalizations are proposed, for semantics closer to the semantic of natural languages in appendix E. Finally chapter 5 introduces possible applications of topos, stacks and models to the re- lations between several DNNs: understanding the modular structures of networks, defining and studying the obstructions to integrate certain semantics or to solve problems in certain contexts. Examples could be taken from the above mentioned experiments on logical infor- mation cells, and from recent attempts of several teams in artificial intelligence: Hudson and Manning, Santoro, Raposo, Bengio and Hinton, using memory modules, linguistic analysis modules, attention modules and relation modules, in addition to convolution CNNs, for answering questions about images and movies.

Most of the figures mentioned in the text can be found in the chapter of Bennequin and Belfiore On new mathematical concepts for Artificial Intelligence, in the Huawei volume on Mathematics for Future Computing and Communication, edited by Liao Heng and Bill McColl, 2021. We also refer to this chapter for the elements of category theory that are nec- essary to understand this text, the definitions and first properties of topos and Grothendieck topos, and the presentation of elementary type theories. Chapter 9 of this book by Ge Yiqun and Tong Wen, Mathematics, Information and Learning, gives a large place to Topology in the notions of semantic information.

In a forthcoming preprint, A mathematical theory of semantic communication, with Mer- ouane Debbah, we will present the application of the above stacks of functioning DNNs and their information spaces, to the problem of semantic communication. In particular we show how the invariance structures in the fibers, made by categories acting on artificial languages, give a way to understand generalization properties of DNNs, for extrapolation, not only interpolation. Analytical aspects, as equivariant standard DNNs approximation of functions, or gradi- ent descent respecting the invariance, are developed in this context.

Acknowledgements

The two authors deeply thank Olivia Caramello and Laurent Lafforgue for the impulsion they gave to this research, for their constant encouragements and many helpful suggestions. They also warmly thank Merouane Debbah for his deep interest, the help and the support he gave, Xavier Giraud, for the concrete experiments he realized with us, allowing to connect the theory with the lively spontaneous behavior of artificial neural networks, and Zhenrong Liu (Louise) for her constant and very kind help at work. D.B. gives special thanks to Alain Berthoz, with whom he has had the chance to work and dream since many years on a conjectural topos geometry (properly speaking stacks) for the generation and control of the variety of humans voluntary movements. He also does not forget that the presence of natural invariants of topos in Information theory was dis- covered during a common work with Pierre Baudot, that he heartily thanks. D.B. had many inspiring discussions on closely related subjects with his former students, in particular Alireza Bahraini, Alexandre Afgoustidis, Juan-Pablo Vigneaux, Olivier Peltre and Gr´egoire Sergeant-Perthuis, that he friendly thanks, with gratitude.

3 Chapter 1

Architectures

Let us show how every (known) artificial deep neural network (DNN) can be described by a family of objects in a well defined topos.

1.1 Underlying graph

Definitions: an oriented graph Γ is directed when the relation a ≤ b between vertices, defined by the existence of an oriented path, made by concatenation of oriented edges, is a partial ordering on the set V (Γ) = Γ(0) of vertices. A graph is said classical if there exists at most one edge between two vertices, and no loop at one vertex (also named tadpole). A classical directed graph can have non-oriented cycles, but not oriented cycles. The layers and the direct connections between layers in an artificial neural network constitute a finite oriented graph Γ, which is directed, and classical. The minimal elements correspond to the initial layers, or input layers, and the maximal elements to the final layers, or output layers, all the other correspond to hidden layers, or inner layers. In the case of RNNs (as when we look at feedback connections in the brain) we apparently see loops, however they are not loops in space-time, the graph which represents the functioning of the network must be seen in the space-time (not necessary Galilean but causal), then the loops disappear and the graph appears directed and classical (cf. figure 1.1). Apparently there is no exception to these rules in the world of DNNs.

yt yt−1 yt

rt−1 rt rt+1 rt

rt−1 t− t xt x 1 x RNN Γ (a) Original RNN (b) Unfolded RNN in Time-Space

Figure 1.1: RNN with space-time unfolding

4 Remark. Bayesian networks are frequently associated to oriented or non-oriented graphs, which can be non-directed, and have oriented loops. However, the underlying random vari- ables are associated to vertices and to edges, the variable of an edge ab being the joint variable of the variables of a and b. More generally, an hypergaph is considered, made by a subset A of the set P(I) of subsets of a given set I. In this situation, we have a poset, where the partial ordering relation is the opposite of the inclusion.

1.2 Dynamical objects of the chains

The simplest architecture of a network is a chain, and the feed-forward functioning of the network, when it has learned, corresponds to a covariant X from the category Co(Γ) freely generated by the graph to the S: to a layer Lk; k ∈ Γ is associated the set Xk of possible activities of the population of neurons in Lk, to the edge Lk 7→ Lk+1 w is associated the map Xk+1,k : Xk → Xk+1 which corresponds to the learned weights wk+1,k; then to each arrow in Co(Γ), we associate the composed map. o But also the weights can be encoded in a covariant functor Π from C (Γ) to S: for Lk we define Πk as the product of all the sets Wl+1,l of weights for l ≥ k, and to the edge k 7→ k + 1 we associate the natural forgetting projection Πk+1,k : Πk → Πk+1. (The product over an empty set is the singleton ∗ in S, then for the output layer Ln the last projection is the unique possible map from Πn−1 to ∗.) In what follows, we will note W = Π, for remembering that it describes the functor of weights, but the notation Π is less confusing for denoting the morphisms in this functor. The cartesian products Xk × Πk together with the maps

′ w ′ Xk+1,k × Πk+1,k(xk, (wk+1,k,wk))=(Xk+1,k(xa),wk) (1.1) also defines a covariant functor X; it represents all the possible feed-forward functioning of the network, for every potential weights. The natural projection from X to W = Π is a of . It is remarkable that, in supervised learning, the Back- propagation algorithm is represented by a flow of natural transformations of the functor W to itself. We give a proof below in the general case, not only for a chain, where it is easier. Remark a difference with Spivak et al. ([FST19]), where the backpropagation is a functor, not a natural transformation. In fact, the weights represent mappings between two layers, individually they correspond to morphisms in a functor Xw, then it should have been more intuitive if they had been coded by morphisms, however globally they are better encoded by the objects in the func- tor W, and the morphisms in this functor are the erasure of the weights along the arrows that corresponds to them. This appears as a kind of dual representation of the mappings Xw.

As we want to respect the convention of Topos theory, [AGV63], we introduce the cat- egory C = C(Γ) which is opposed to Co(Γ); then Xw, W = Π and X become contravariant functors from this category C to Sets, i.e. presheaves over C, i.e. objects in the topos C∧. This is this topos which is associated to the neural network in form of a chain. Observe that the arrows between sets continue to follow the natural dynamical ordering, from the initial layer to the final layer, but the arrows in the category (the site) C are going now in the opposite direction. The object Xw can be naturally identified with a sub-object of X, we call this singleton ∧ the fiber of pr2 : X → W over the singleton w in W, (that is a in C from the final object 1 (the constant functor equal to the point ∗ at each layer) to the object W), which is

5 a system of weights for each edge of the graph Γ.

In this simple case of a chain, the classifying object of subobjects Ω, which is responsible of the logic in the topos, cf.[Pro19], is given by the sub-objects of 1; more precisely, for every k ∈ C, Ω(k) is the set of sub-objects of the localization 1|k, made by the arrows in C going to k. All these sub-objects are increasing sequences (∅, ..., ∅, ∗, ..., ∗). This can be interpreted as the fact that a proposition in the language (and internal semantic theory) of the topos is more and more determined when we approach to the last layer. Which corresponds well to what happens in the internal world of the network, and also, in most cases, to the informa- tion about the output that an external observer can deduce from the activity in the inner layers. Cf. Logical Information Cells, soon on , [BBG20].

1.3 Dynamical objects of the general DNNs

However, many networks, and most today’s networks, are far from being simple chains. The topology of Γ is very complex, with many paths going from a layer to a deeper one, and many inputs and outputs at a same vertex. In these cases, the functioning and the weights are not defined by functors on C(Γ) (the category opposite to the category freely generated by Γ). But a canonical modification of this category permits to solve the problem: at each layer a where more than one layer sends information, say a′, a”, ..., i.e. where exist irreducible arrows aa′, aa”, ... in C(Γ) (edges in Γop), we perform a surgery, between a and a′ (resp. a and a”, a.s.o.) introduce two new objects A∗ and A, with arrows a′ → A∗, a” → A∗, ..., and A∗ → A, a → A, forming a fork, with tips in a′, a”, ... and handle A∗Aa (more precisely if not too pedantically, the arrows a′A∗, a”A∗, ... are the tines, the arrow A∗A is the tang, or socket, and the arrow aA is the handle), cf. figure 1.2. By reversing arrows, this gives a new oriented graph Γ, also without oriented cycles, and the category C which replaces C(Γ) is the category C(Γ), opposite of the category which is freely generated by Γ.

′ a Γ

a′′ a

a′ ···

a′′ A∗ A a

··· Γ

Figure 1.2: From the initial graph to the Fork

6 Remark: in Γ, the complement of the unions of the tangs is a forest. Only the convergent multiplicity in Γ gives rise to forks, not the divergent one. In the category C, this convergence (resp. divergence) corresponds to a divergence (resp. convergence) of the arrows.

When describing concrete networks (cf. for instance RNN, and LST M or GRU memory cells we will study in section 4), ambiguity can appear with the input layers: they can be considered as input or as tips when several inputs join for connecting a deeper layer a. The better attitude is to duplicate them; for instance two input layers xt, ht−1 going to ht,yt, ′ ′ ∗ ′ ′ we introduce Xt, xt, Ht−1, ht−1, then a fork A , A, and in C, arrows xt → Xt, ht−1 → Ht−1 ′ ∗ ′ ∗ ∗ for representing the input data, arrows of fork xt → A , ht−1 → A , A → A, and arrows of information transmissions ht → A and yt → A, representing the ourput of the memory cell.

With this category C, it is possible to define the analog of the pre-sheaves Xw, W = Π and X in general.

w w First X : at each old vertex, the set Xa is as before the set of activities of the neurons of the corresponding layer; over a point like A∗ and A we put the product of all the incoming w w sets Xa′ × Xa”, .... The map from XA to Xa is the dynamical transmission in the network, joining the information coming from all the inputs layers a′, a”, ... at a, all the other maps w are given by the structure: the projection on its factors from XA∗ , and the identity over the ∗ 0 arrow A A. It is easy to show, that given a collection of activities εin in all the initial layers of the network, it results a unique section of the presheaf Xw, a singleton, or an element of w 0 limCX , which induces εin. Thus, dynamically, each arrow of type a → A has replaced the set of arrows from a to a′, a”, ....

It is remarkable that the main structural part (which is the projection from a product to its components) can be interpreted by the fact that the presheaf is a sheaf for a natural Grothendieck topology J on the category C: in every object x of C the only covering is the full category C|x, except when x is of the type of A∗, where we add the covering made by the arrows of the type a′ → A∗. Cf. [AGV63].

The sheafification process, associating a sheaf X∗ over (C,J) to any presheaf X over C is ∗ easy to describe: no value is changed except at a place A , where XA∗ is replaced by the ∗ ∗ ∗ product XA∗ of the Xa′ , and the map from XA = XA to XA∗ is replaced by the product of the maps from XA to the Xa′ given by the functor X. In particular, important for us, the sheaf C∗ associated to a constant pre-sheaf C replaces C in A∗ by a product Cn and the identity C → C by the diagonal map C → Cn over the arrow A∗A.

Let us now describe the sheaf W over (C,J) which represents the set of possible weights of the DNN (or RNN a.s.o.). First consider at each vertex a of the initial graph Γ, the set Wa ′ of weights describing the allowed maps from the product XA = Qa′←a Xa to Xa, over the projecting layers a′, a”, ... to a. Then consider at each layer x the (necessarily connected) subgraph Γx (or x|Γ) which is the union of the connected oriented paths in Γ from x to some output layer (i.e. the maximal branches issued from x in Γ); take for W(x) the product of the Wy over all the vertices in Γx. (For the functioning, it is useful to consider the part Γx ∗ (or x|Γ) which is formed from Γx, by adding the collections of points A , A when necessary, and the arrows containing them in Γ.) At every vertex of type A∗ or A of Γ, we put the ′ ′ product WA of the sets Wa′ for the afferent a , a”, ... to a. If x x is an oriented edge of Γ, ′ there exists a natural projection Πxx′ : W(x ) → W(x). This defines a sheaf over C = C(Γ).

7 The crossed product X of the Xw over W is defined as for the simple chains. It is an object of the topos of sheaves over C that represents all the possible functioning of the neural network.

1.4 Back-propagation as a natural (stochastic) flow in the topos

Nothing is loosed in generality if we put together the inputs (resp. the output) in a product space X0 (resp. Xn); this corresponds to the introduction of an initial vertex x0 and a final vertex xn in Γ, respectively connected to all the existing initial or final vertices. We also assume that the spaces of states of activity Xa and the spaces of weights WaA are smooth manifolds, and that the maps (x, w) 7→ Xw(x) defines smooth maps on the cor- responding product manifolds. In particular it is possible to define tangent objects in the topos of the network T X and T W, and smooth natural transformations between them.

Supervised learning consists in the choice of an energy function

(ξ0,w) 7→ F (ξ0; ξn(w,ξ0)); (1.2) then in the search of the absolute minimum of the mean Φ = E(F ) of this energy over a measure on the inputs ξ0; it is a real function on the whole set of weights W = W0. For simplicity, we assume that F is smooth, and we do not enter the difficult point of effective numerical gradient descent algorithms, we just want to develop the formula of the linear form dF on Tw0 W , for a fixed input ξ0 and a fixed system of weights w0. The gradient will depend on the choices of a Riemannian metric on W . And the gradient of Φ is the mean of the individual gradients. We have ∗ dF (δw)= F dξn(δw), (1.3) then it is sufficient to compute dξn. The product formula is W 0 = Y WaA, (1.4) a∈Γ where a describes all the vertices of Γ, Aa is the corresponding edge in Γ. Then it is sufficient to compute dξn(δwa) for δwa ∈ Tw0 WaA, assuming that all the other vectors δwbB are zero, except δwa which denotes the weight over the edge Aa.

For that, we consider the set Ωa of directed paths γa in Γ going from a to the output layer xn. Each such path gives rise to a zigzag in Γ : ... ← B′ → b′ ← B → b ← ... (1.5) which gives a feed-forward composed map, by taking over each B → b the map XwbB from the product XB to the manifold Xb, where everything is fixed by ξ0 and w0 except on the ′ ′ branch coming from b , where wa varies, and by taking over each b ← B the injection ρBb′ defined by the other factors Xb”,Xb′′′ , ... of XB. This composition is written φ = Xw0 ◦ ρ ◦ Xw ; (1.6) γa Y bkBk Bkbk−1 aA bk∈γa

8 going from the manifold Wa × XA to the manifold Xn. In the above formula, k starts with 1, and b0 = a.

′ Two different elements γa, γ”a of Ωa must coincide after a certain vertex c, where they about from different branches c′c, c”c in Γ; they pass through B in Γ; then we can define ⊕2 ′ the sum φγa ⊕ φγ”a , as a map from WaA × XA to Xn, by composing the maps between the ′ ′ X s after b, from b to xn, with the two maps φγa and φγ”a truncated at B. We name this ′ operation the cooperation, or cooperative sum, of φγa and φγ”a . Cooperation can be iterated in associative and commutating manner to any subset of Ωa, representing a tree issued from xn, embedded in Γ, made by all the common branches between the pairs of paths from a to xn. The full cooperative sum is the map

M φγa : XA × M WaA → Xn. (1.7) γa∈Ωa

For a fixed ξ0, and all wbB fixed except waA, the point ξn(w) can be described as the com- position of the diagonal map with the total cooperative sum

wa 7→ (wa, ...wa) ∈ M WaA → Xn. (1.8) γa∈Ωa| This gives

dξn(δwa)= X dφγa δwa; (1.9) γa∈Ωa which implies the back-propagation formula:

Lemma 1.1. dξ (δw )= DXw0 ◦ Dρ ◦ ∂ Xw .δw (1.10) n a X Y bkBk Bkbk−1 w aA a γa∈Ωa bk∈γa

0 0 going from the tangent space Twa (Wa) to the tangent space Tξn (Xn). In this expression, k starts with 1, and b0 = a. To get the back-propagation flow, we compose to the left with F ∗ = dF , which gives a linear form, then apply the chosen metric on the manifold W , which gives a vector field β(w0|ξ0). Let us assume that the function F is bounded from below on X0 × W and coercive (at least proper). Then the flow of β is globally defined on W . From it we define a one parameter group of natural transformations of the object W.

In practice, a sequence Ξm; m ∈ [M] of finite set of inputs ξ0 (benchmarks) is chosen randomly, according to the chosen mesure on the initial data, and the gradient is taken for the sum

Fm = X Fξ0 , (1.11) Ξm then the flow is integrated (with some important cooking) for a certain time, before the next integration with Fm+1. This changes nothing for the result:

Theorem 1.1. Backpropagation is a flow of natural transformations of W, computed from collections of singletons in X.

9 Remark. Frequently, the function F takes the form of a Kullback-Leibler divergence

DKL(P (ξn)|Pn) and can be rewritten as a Free energy, which can itself be replaced by a Bethe Free energy over inner variables, which are probabilistic laws on the weights. This is where information quantities could enter, cf. [Pel20].

1.5 The particular nature of the topos of DNNs

We ask now to what species belong the topos C∼ of a DNN?

Definitions. Let X denotes the set of vertices of Γ of type a or of type A. We introduce the ′ full CX of C generated by X. There exist only one arrow from a a to a vertex of type A through A∗ (but a given a′ can join different A∗ then different A), only one arrow from a a to its preceding A (but A can belong to several vertices a). Moreover there exists only one arrow from a vertex c to a vertex b when b and c are on a chain in C which doesn’t contain a fork. And no other arrows exist in CX. By definition of the forks, a point a (i.e. a handle) cannot join another point than its tang A, and an input or a tang A is the center of a convergent star. op Any maximal chain in CX joins an input entry or a A-point (i.e. a tang), to a vertex of type a′ (i.e. a tip) or to an output layer. Issued from a tang A it can pass through a handle a or a tip a′, because nothing forbids a tip to join a vertex b. If x, y belong to X, we note x ≤ y when there exists a morphism from x to y; then it is equivalent to write x → y in the category CX.

Proposition 1.1. (i) CX is a poset. (ii) Every presheaf on C induces a presheaf on CX. (iii) For every presheaf on CX, there exists a unique sheaf on C which induces it.

Proof. (i) let γ1,γ2 be two different simple directed paths in CX going from a point z in X to a point x in X, there must exists a first point y where the two paths disjoin, going to two different points y1, y2. This point y cannot be a handle (type a), nor an input, nor a tang (type A), then it is an output or a tip. It cannot be an output, because a fork would have been introduced here to manage the divergence. If the two points y1,y2 were tangs, they were the ending points of the paths, which is impossible. But at least one of them is a tang, say A2, because a tip cannot diverge to two ordinary vertices, if not, there should be a fork here. Then one of them, say y1, is an ordinary vertex and begins a chain, without divergence until it attains an input or a tang A1. Therefore A1 = A2, but this gives an oriented loop in the initial graph Γ, which was excluded from the beginning for a DNN. This final argument directly forbids the existence of x, =6 y with x ≤ y and y ≤ x. Then CX is a poset. (ii) is obvious. For (iii), remark that the vertices of Γ which are eliminated in X are the ∗ ∗ A . Then consider a pre-sheaf FX on X, the sheaf condition over C tells that F (A ) must be the product of the entrant F (a′), ..., then the map F (A) → F (A∗) must be the product of the maps F (A) → F (a′). 

∼ ∧ Corollary. C is naturally equivalent to the category of presheaves CX. Remark. In Friedman [Fri05], it was shown that every topos defined by a finite site, where objects do not possess non unit endomorphisms, has this property to be equivalent to a topos

10 of presheaves over a finite full subcategory of the site: this is the category generated by the objects that have only the trivial full covering. Then we are in a particular case of this theorem. The special fact, that we get a site which is a poset, implies many good properties for the topos, cf. Bell [Bel08], Caramello [Car09].

In what follows, we will frequently note X the poset CX.

The (lower) topology of Alexandrov on X, is made by the subsets U of X such that y ∈ U and x ≤ y imply x ∈ U. A basis for this topology is made by the collections Uα of the β such that α ≤ β. In fact, ′ consider the intersection Ux ∩ Ux′ ; if y ≤ x and y ≤ x , we have Uy ⊆ Ux ∩ Ux′ , then Ux ∩ Ux′ = Uy. Sy∈Ux∩Ux′ In our examples the poset X is in general not stable by intersections or unions of subsets of X, but the intersection and union of the sets Ux, Uy for x, y ∈ X plays this role. We note Ω or Ω(X) when there exists a possibility of confusion, the set of (lower) open sets on X. A sheaf in the topological sense over the Alexandrov space X is a sheaf in the sense of topos over the category Ω(X), where arrows are the inclusions, equipped with the Grothendieck topology, generated by the open coverings of open sets.

Proposition 1.2. (cf. [Car18, Theorem 1.1.8, the comparison lemma] and [Bel08, p. 210]): every presheaf of sets over the category CX can be extended to a sheaf on X for the Alexandrov topology, and this extension is unique up to a unique isomorphism.

Proof. Let F be a presheaf on CX; for every x ∈ X, F (Ux) is equal to F (x). For any open set U = Sx∈U Ux we define F (U) as the over x ∈ U of the sets F (x) (that is the set of ′ families sx; x ∈ U in the sets F (x); x ∈ U, such that for any pair x, x in U and any element y in Ux ∩ Ux′ , the images of sx and sx′ in F (y) coincide. This defines a presheaf for the lower topological topology. This presheaf is a sheaf:

1. if U is a covering of U, and if s,s′ are two elements of F (U) which give the same elements ′ ′ over V for all V ∈ U, the elements sx,sx that are defined by s and s respectively in every F (x) for x ∈ U are the same, then by definition, s = s′.

2. To verify the second axiom of a sheaf, suppose that a collection sV is defined for V in the covering U of U, and that for any intersection V ∩ W ,V, W ∈ U the restrictions of sV and sW coincide, then by restriction to any Ux for x ∈ U we get a coherent section over U.

3. For the uniqueness, take a sheaf F ′ which extends F , and consider the open set U = ′ ′ ′ Sx∈U Ux, any element s of F (U) induces a collection sx ∈ F (Ux) = F (x) which is ′ coherent, then defines a unique element s = fU (s ) ∈ F (U). These maps fU ; U ∈ Ω defines the required isomorphism. 

Corollary. The category C∼ is equivalent to the category Sh(X) of sheaves of X, in the ordinary topological sense, for the (lower) Alexandrov topology.

11 Consequences, from [Bel08, pp.408-410]: the topos E = C∼ of a neural network is coherent. It possesses sufficiently many points, i.e. geometric functors S → C∼, such that equality of morphisms in C∼ can be tested on these points. In fact, such an equality can be tested on sub-singletons, i.e. the topos is generated by the subobjets of the final object 1. This property is valled the sub-extensionality of the topos E. Moreover E (as any Grothendieck topos) is defined over the category of sets: there exists a unique geometric functor µ : E → S. This functor is given by the global sections of a sheaf over X. In this case, as shown in the book of Bell, the equality of sub-objects (i.e. propositions) in every object of the form µ∗(S) (named sub-constant objects) is decidable. The two above properties characterize the so-called localic topos, cf. [Bel08] or [MLM92].

The points of E correspond to the ordinary points of the topological space X; they are also the points of the poset CX. For each such point x ∈ X, the functor ǫx : S→E is the right adjoint of the functor sending the sheaf F to its fiber F (x).

In the neural network, the minimal elements for the ordering in X are the output layers plus some points a′ (tips), the maximal ones are the input layers, and the points of type A (tangs). However, for the standard functioning and for the supervised learning, in the objects X, W, the fibers in A are identified with the products of the fibers in the tips a′, a”, ..., and play the role of transmission to the branches of type a. Therefore the feed-forward functioning doesn’t reflect the complexity of the set Ω. The backpropagation learning algorithm also escapes this complexity.

Remarks: if A were not present in the fork, we should have added the empty covering of a for satisfying the axioms of a Grothendieck topology, and this would be disastrous, implying that every sheaf must have in a the value ∗ (singleton). A consequence is the existence of more general sheaves than the ones that correspond to usual feed-forward dynamics, because ∗ they can have a value XA different from the product of the Xa′ appearing in A , equipped with a map XA∗A : XA → Q Xa′ and XaA : XA → Xa. Then, depending on the value of 0 εin and of the other objects and morphisms, a propagation can happen or not. This could opens the door for new types of networks, having a part of spontaneous activities. Several evidences show that the natural neuronal networks in the brain of the animals are working with internal modulations and complex variants of supervised learning, involving memories, spontaneous activities, genetically and epi-genetically programmed activations and desactivations, which optimize the survival at the level of the evolution of species.

Remark. Appendix A gives an interpretation due to Bell of the class of topos we encounter here, named localic topos, in terms of a categorical version of fuzzy sets, called sets with fuzzy identities taking values in an given Heyting algebra.

For the topos of a DNN, the Heyting algebra Ω is the algebra of open subsets of the poset X. However, we can go further in the characterization of the topos by using the particular properties of the poset X, and of the algebra Ω.

Theorem 1.2. The poset X of a DNN is made by a finite number of trees, rooted in the maximal points, which are joined in the minimal points.

12 More precisely, the minimal elements are of two sorts: the outputs layers xn,j and the tips of the forks, i.e. the points of type a′; the maximal elements are also of two sorts: the input layers x0,i and the tangs of the forks (i.e. the points A). Moreover, the tips and the tanks are joined by an irreducible arrows, but a tip can join several tanks and some ordinary point (type a) (but no input x0,i), and a tank can be joined by several tips and other ordinary points (but no output xn,j).

Remark. The only possible divergences happen at tips, because they can joint several tanks and additional ordinary points in X.

Remark. Appendix B gives an interpretation of the kind of we get for DNNs in terms of spectrum of commutative rings.

13 Chapter 2

Stacks of DNNs

2.1 Groupoids, general categorical invariance and logic

In many interesting cases, a restriction on the structure of the functioning Xw, or the learning in W, comes from a geometrical or semantical invariance, which is extracted (or expected) from the input data and/or the problems that the network has to solve as output. The most celebrate example is given by the convolutional networks CNNs. These net- works are made for analyzing images; it can be for finding something precise in an image in a given class of images, or it can be for classifying special forms. The images are assumed to be by nature invariant by planar translation, then it is imposed to a large number of layers to accept a non trivial action of the group G of 2D-translations and to a large numbers of connections between two layers to be compatible with the actions, which implies that the underlying linear part when it exists is made by convolutions with a numerical function on the plane. This doesn’t forbid that in several layers, the action of G is trivial, to get invariant characteristics under translations, and here, the layers can be fully connected. The Resnets today have such a structure, with non-trivial architectures, as studied before. Other Lie groups and their associated convolutions were recently used for DNNs, cf. Co- hen et al. [CWKW19], [CGW20]. The authors underline the analogy with Gauge theory in Physics. In the same spirit, Bondesan and Welling [BW21] give an interpretation of the excited states in DNNs in terms of particles in Quantum Field Theory.

DNNs that analyze images today, for instance in object detection, have several channels of convolutional maps, max pooling and fully connected maps, that are joint together to take a decision. It looks as a structure for localizing the translation invariance, as it happens in the successive visual areas in the brains of animals. Experiments show that in the first layers, kinds of wavelet kernels are formed spontaneously to translate contrasts, and color opposition kernels are formed to construct color invariance.

A toposic manner to encode such a situation consists to consider contra-variant functors from the category C of the network with values in the topos of G-sets. The collection of ∼ these functors, with morphisms given by the natural transformations, form a category CG , which was shown to be itself a topos by Giraud 1972 [Gir72]. In fact it is equivalent to introduce a category F fibred in groups isomorphic to G over C, π : F → C, satisfying the axioms of a stack; the category F has a canonical topology J (the less fine such that π is continuous), and the ordinary topos E of sheaves of sets over the site (F,J), which is named ∼ the classifying topos of the stack, is naturally equivalent to CG . The theorem of Giraud is more general; it extends to any stack over C, telling that the

14 category of conra-variant functors from C to the the topos of the fibers (satisfying a topo- logical condition) is equivalent to the classifying topos of the stack. In particular nothing is seriously changed (in principles) if the group is replaced by a groupoid, and if we consider a category F which is fibered in groupoids over the category C, or better, its associated stack. The classifying topos E of the stack fibers geometrically over ∼ the topos C and its fibers Ex for every x ∈ C can be naturally identified with the topos of presheaves over the fiber Fx. We could even imagine arbitrary small categories for the fibers. The case of groupoids has the interest that the presheaves on a groupoid form a Boolean topos, then ordinary logic is automatically incorporated.

Remarks. 1. The logic in the topos of a groupoid consists of simple Boolean algebras; however, things appear more interesting when we remember the meaning of the atoms Zi; i ∈ K, because they are made by irreducible Ga-sets. We interpret that as a part of the semantic point of view, in the languages of topos and stacks.

2. In the experiments reported in [BBG20] as in CNNs, the irreducible linear represen- tations of groups appear spontaneously among the dynamical objects.

3. In every language we can speak of the future, the uncertain past, and introduce hy- potheses, this doesn’t mean that we are leaving the world of usual Boolean logic, we are just considering externally some intuitionist Heyting algebra, this can be done within ordinary set theory, as is done topos theory in Mathematics, in the fibers, defined by groupoids.

Appendix C gives a description of the classifying object of a groupoid, that is well known by specialists of category theory.

However, other logics, intuitionist, can also have an interest. In more recent experiments, with Xavier Giraud, on data representing time evolution, we used simple posets in the fibers.

The notion of invariance goes farther than groupoids.

Invariance is synonymous of action, and is understood here in the categorical sense: a category G acts on another category V when a (contravariant) functor from G to V is given. The example that justifies this terminology is when G is a group G, and V the Abelian category of vector spaces and linear maps over a commutative field K. In the latter case, we obtain a linear representation of the group G.

In any category V, there exists a notion which generalizes the notion of element of a set. Any morphism ϕ : u → v in V can be viewed as an element of the object v of V.

Definition. Suppose that G acts through the functor f : G→V and that v = f(a), then the orbit of ϕ under G|a is the functor from the left slice category G|a to the right slice category u|V, that associates to any morphism a′ → a the element u → f(a) → f(a′) of f(a′) in V and to an arrow a” → a′ over a the corresponding morphism f(a′) → f(a”) from u → f(a′) to u → f(a”).

In the classical example of a group representation, u = K and the morphism ϕ defines a vec- tor x in the space Ve. The group G is identified with G|e and the vector space Ve, identified

15 with Hom(K,Ve), contains the whole orbit of x.

In a stack, the notion of action of categories is extended to the notion of fibred action of a fibred category F to a fibred category N :

Definition. Suppose we are given a sheaf of categories F : C → Cat, that we consider as a general structure of invariance, and another sheaf M : C → Cat. An action of F on M is a ′ family of contra-variant functors fU : FU → MU such that, for any morphism α : U → U of C, we have fU ◦ Fα = Mα ◦ fU ′ . (2.1) This is the equivariance formula generalizing group equivariance as it can be found in [Kon18] for instance. It is equivalent to morphisms of stacks, and allows to define the orbits of sections uU → fU (ξU ) in the sheaf u|M under the action of the relative stack F|ξ. Remark that Eilenberg and MacLane, when they invented categories and functors in [EM45], were conscious to generalize the Klein’s program in Geometry (Erlangen program).

In the next sections, we will introduce languages with types taken from pre-sheaves over the fibers of the stack, where we define the terms of theories and propositions of interest for the functioning of the DNN. Then the above notion of invariance will concern the action of a kind of pre-semantic categories on the languages and the possible sets of theories, that the network could use and express in functioning.

This view is a crucial point for our applications of topos theory to DNNs, because it is in this framework that logical reasoning, and more generally semantics, in the neural net- work, can be posed: in a stack the different layers interpret the logical propositions and the sentences of the output layers. As we will see, the interpretations are expected to become more and more faithful when approaching the output, however the information flow in the whole networks is interesting.

This shift from groups to groupoids, then to categories, then to more general semantic, by taking pre-sheaves in groupoids or categories, is a fundamental addition to the site C. The true topos associated to a network is the classifying topos E over F; it incorporates much more structure than the visible architecture of layers, it takes into account invariance (which appears here to be part of the semantic, or better pre-semantic). More generally, it can concern the domain of natural human semantics that the network has to understand in his own world, called artificial.

Moreover, as we will show below, working in this setting gives access to more flexible type theories, like the Martin-L¨of intensional types, and goes into the direction of homotopy type theory, Hofmann and Streicher 1996 [HS98], Hollander 2001 [Hol08], Arndt and Kapulkin 2012 [AK11], enlarged by objects and morphisms in classifying topos in the sense of Giraud.

2.2 Objects classifiers of the fibers of a classifying topos

Among the equivalent points of view on stacks and classifying topos (cf. Giraud, 1964, [Gir64], Non-abelian cohomology 1971 [Gir71], and 1972 [Gir72]), the most concrete starts

16 with a contra-variant functor F from the category C to the 2-category of small categories. (This is an element of the category Scind(C) in the book of Giraud [Gir71].) To each object U ∈ C is associated a small category F (U), and to each morphism α : U → U ′ is ′ associated a covariant functor Fα : F (U ) → F (U), also noted F (α), satisfying the axioms of a presheaf. If fU : ξ → η is a morphism in F (U), the functor Fα sends it to a morphism ′ Fα(fU ): Fα(ξ) → Fα(η) in F (U ). The corresponding fibration π : F → C, written ∇F by Grothendieck, has for objects the pairs (U, ξ) where U ∈ C and ξ ∈ F (U), sometimes shortly written ξU , and for morphisms the elements of

′ ′ ′ HomF ((U, ξ), (U ,ξ )) = [ HomF (U)(ξ, F (α)ξ ). (2.2) ′ α∈HomC (U,U )

′ ′ For every morphism α : U → U of C, the set HomF (U)(ξ, F (α)ξ ) is also denoted ′ ′ Homα((U, ξ), (U ,ξ )); it is the subset of morphisms in F, that lift α.

−1 The functor π sends (U, ξ) on U. We will write indifferently F (U) or FU the fiber π (U).

A section s of π corresponds to a family sU ∈ FU indexed by U ∈ C, and a family of ′ morphisms sα ∈ HomF (U)(sU , F (α)sU ′ ) indexed by α ∈ HomC(U, U ) such that, for any pair of compatible morphisms α, β, we have

sα◦β = Fβ(sα) ◦ sβ. (2.3)

As shown by Grothendieck and Giraud (cf. Giraud 1964), a pre-sheaf A over F corre- sponds to a family of presheaves AU on the categories FU indexed by U ∈ C, and a family ′ ∗ ∗ Aα indexed by α ∈ HomC(U, U ), of natural transformations from AU ′ to FαAU . (Here Fα ′ denotes the pullback of presheaf associated to the functor Fα : F (U ) → F (U), that is, for AU : F (U) → Set, the composed functor AU ◦ Fα.) Moreover, for any compatible morphisms β : V → U, α : U → U ′, we must have

∗ Aα◦β = Fα (Aβ) ◦ Aα. (2.4)

′ If ξ is an object of FU , we define A(U, ξ)= AU (ξ), and if f : ξU → FαξU ′ is a morphism of ′ F between ξU ∈FU and ξU ′ ∈FU ′ lifting α, we take

′ ′ A(f)= AU (f) ◦ Aα : AU ′ (ξ ) → AU (Fα(ξ )) → AU (ξ). (2.5)

The relation A(f ◦ g)= A(g) ◦ A(f) follows from (2.4). A natural transformation ϕ : A → A′ corresponds to a family of natural transformations ′ ′ ϕU : AU → AU , such that, for any arrow α : U → U in C,

∗ ′ ∗ ′ FαϕU ◦ Aα = Aα ◦ ϕU ′ : AU ′ → FαAU . (2.6)

This describes the category E of pre-sheaves over F from the family of categories EU of ∗ pre-sheaves over the fibers FU and the family of functors Fα : EU →EU ′ . ′ ∗ ∗ ∗ Note that for two consecutive morphisms β : V → U, α : U → U , we have Fαβ = Fα ◦ Fβ .

The category E is fibred over the category C, it corresponds to the functor from C to ′ Cat, which associates to U ∈ C the category EU and to an arrow α : U → U , the functor α ∗ F! : EU ′ → EU , which is the left adjoint of Fα. This functor extends Fα by the Yoneda

17 embedding, cf. SGA 4 I, pre-sheaves. ′ αβ β α For two consecutive morphisms β : V → U, α : U → U , we have F! = F! ◦ F! . α ∗ Let ηα : F! ◦ Fα → IdEU the co-unit of the adjunction; a natural transformation ∗ ∗ α ∗ Aα : AU ′ → FαAU gives a natural transformation Aα : F! AU ′ → AU , by taking Aα = α (ηα ⊗ Id)F! (Aα). This gives another way to describe the elements of E, the pre-sheaves over F.

Remark. A section (sU ,sα) defines a pre-sheaf A, by taking

AU (ξ)= HomFU (ξ,sU ); (2.7)

∗ and Aα = sα ◦ Fα, according to the following sequence:

′ ′ ′ Hom(ξ ,sU ′ ) → Hom(Fαξ , Fα(sU ′ )) → Hom(Fαξ ,sU ). (2.8)

The identity (2.4) follows from the identity (2.3). This construction generalizes in the fibered situation the Yoneda objects in the absolute sit- uation. A morphism of sections gives a morphism of pre-sheaves.

In each topos EU there exists a classifying object ΩU , such that the natural transforma- tions HomU (XU , ΩU ) correspond naturally to the sub-objects of XU ; the pre-shaef ΩU has ∧ for value in ξU ∈ FU the set of sub-objects in EU of the Yoneda pre-sheaf ξU defined by η 7→ Hom(η,ξU ). The set ΩU (ξU ) can also be identified or with the set of sub-objects of the

final sheaf 1ξU over the slice category FU |ξU . ∗ As just said before, the functor Fα : EU →EU ′ which associates A ◦ Fα to A, possesses a left α adjoint F! : EU ′ →EU which extends the functor Fα on the Yoneda objects. For any object ′ ′ α ′ ∧ ∧ ′ ∧ ξ in FU ′ , note ξ = Fα(ξ ); the functor F! sends (ξ ) to ξ , and sends a sub-set of (ξ ) to ∧ α a subset of ξ . This is not because F! is necessarily left exact, but because we are working with Grothendieck topos, where sub-objects are given by families of coherent subsets. α Moreover F! respects the ordering between these sub-sets, then it induces a poset morphism between the posets of sub-objects

′ ′ ′ ∗ ′ Ωα(ξ ) : ΩU ′ (ξ ) → ΩU (Fα(ξ )) = FαΩU (ξ ); (2.9) the functoriality of ΩU , ΩU ′ and Fα implies that these maps constitute a natural transfor- mation between pre-sheaves ∗ Ωα : ΩU ′ → FαΩU . (2.10) The naturalness of the construction insures the formula (2.4) for the composition of mor- phisms. Consequently, we obtain a pre-sheaf ΩF . ∧ Moreover the final object 1F of the classifying topos E = F corresponds to the collection ∗ ′ of final objects 1U ; U ∈ C and to the collection of morphisms 1U ′ → Fα1U ; α ∈ HomC(U, U ), then we have:

Proposition 2.1. The classifier of the classifying topos is the sheaf ΩF given by the classi- fiers ΩU and the pullback morphisms Ωα, which can be summarized by the formula

ΩF = ∇U∈CΩU dΩα. (2.11)

18 ∗ In general the functor Fα is not geometric; by definition, it is so if and only if its left adjoint (Fα)!, which is right exact (i.e. commutes with the finite co-limits), is also left exact (i.e. commutes with the finite limits). Also by definition, this is the case if and only if the morphism Fα is a morphism of sites. The prototype is the functor associated to a continuous map between topological spaces. ∗ Important for us: it results from the work of Giraud in [Gir72], that Fα is geometric when Fα is a stack. (We will see in the next section, that the stacks π which correspond to the admissible contexts in a dependent type theory over the site of a DNN satisfy this condition.)

∗ When Fα is geometric, a great part of the logic in EU ′ can be transported to EU : ∗ ∗ Let us write f = Fα and f =(Fα)! its left adjoint, supposed to be left exact, therefore exact, because it is right exact as every left adjoint. This functor f ∗ preserves the monomor- phisms, and the final elements of the slices categories. Then it induces a map between the ′ sets of subsets, called the inverse image or pullback by f, for any object X ∈EU ′ :

f ∗ : Sub(X′) → Sub(f ∗X′). (2.12)

′ ′ ∧ ∗ When X describe the Yoneda objects (ξ ) , this gives the morphism Ωα : ΩU ′ → FαΩU . As it is shown in MacLane-Moerdijk [MLM92, p. 496], this map is a morphism of lattices, ′ ′ it preserves the ordering and the operations ∧ and ∨. If h : Y → X is a morphism in EU ′ , ∗ the reciprocal image h between the sets of subsets has a left adjoint ∃h and a right adjoint ∗ ∀h. The morphism f commutes with ∃h, but in general not with ∀h, for which there is only an inclusion: ∗ ′ ∗ ′ f (∀hP ) ≤ ∀f ∗h(f P ). (2.13) To have an equality, the morphism f must be geometric and open. Which is equivalent to the existence of a left adjoint, in the sense of posets morphisms, for Ωα, cf. [MLM92, Theorem 3, p. 498]. In McL-M this natural transformation Ωα is noted λα, and its left adjoint when it exists is noted µα.

When this left adjoint in the sense of Heyting algebras exists, we have, by adjunction, the co-unit and unit morphisms:

µ ◦ λ ≤ Id : ΩU ′ → ΩU ′ ; (2.14) ∗ ∗ λ ◦ µ ≥ Id : F ΩU → F ΩU . (2.15)

If f is geometric and open, the map f ∗ also commutes with the negation ¬ and the (internal) implication ⇒. If openness fails, only implications hold, as for the universal quantifier.

′ Remark. When FU ′ and FU are posets of open sets of (sober) topological spaces X and X , ′ ∗ and Fα is given by a continuous map ϕ : X → X , the functor Fα is geometric, and it is open if and only if ϕ is open in the topological sense. This extends to locale, cf. McL-M [MLM92].

∗ When Fα is geometric and open, it transports the predicate calculus of formal theories from EU ′ to EU , as exposed in the book of Mac Lane and Moerdijk, ”Sheaves and Geome- try in Logic. A first introduction to Topos Theory”. This is expressed by the following result:

19 Proposition 2.2. Suppose that all the Fα are open morphisms of sites, then (i) the pullback Ωα commutes with all the operations of the predicate calculus; ′ (ii) any theory at a layer U , i.e. in EU ′ , can be read and translated in a deeper layer U, in EU , in particular at the output layers.

In the sequence we will be particularly interested by the case where all the FU are groupoids and the Fα are morphisms of groupoids, in this case, the algebras of subobjects SubE (X) are ∗ boolean, then, in this case, the following lemma will imply that, as soon as Fα is geometric, it is open:

Lemma 2.1. In the boolean case the morphism of lattices f ∗ : Sub(X′) → Sub(f ∗X′) is a morphism of algebras which commutes with the universal quantifiers ∀h. Proof. Since f ∗ is right and left exact, it sends 0 = ⊥ to 0 = ⊥ and X′ = ⊤ to X = ⊤. ′ ∗ ′ ′ ∗ ′ ∗ Therefore, for every A ∈ SubE′ (X ), f (X \ A ) = X \ f (A ), i.e. f commutes with the negation ¬. This negation establishes a duality between ∃ and ∀, then f ∗ commutes with the universal quantifier. More precisely:

f ∗(¬(∀x′,P ′(x′))) = f ∗(∃a′, ¬P ′(a′)) = ∃a, f ∗(¬P ′)(a)= ¬[∀xf ∗(P ′)(x)), (2.16) then by commutation with ¬, and ¬¬ = Id, we have

f ∗(∀x′,P ′(x′)) = ∀xf ∗(P ′)(x)). (2.17)



On the other side, it is important that a theory in the fiber over U can be understood over U ′. ∗ Hopefully, this can always be done, at least in part: the functor Fα is left exact and has α a right adjoint F∗ : EU ′ →EU , which can be described as a right (cf SGA4): ′ α ′ for a pre-sheaf A over FU ′ , the value of the presheaf F∗ (AU ′ ) at ξU ∈ FU is the limit of ′ ′ ′ AU ′ over the slice category Fα|ξU , whose objects are the pairs (η ,ϕ) where η ∈ FU ′ and ′ ′ ′ ϕ : Fα(η ) → ξU is a morphism in FU , and whose morphisms from (η ,ϕ) to (ζ ,φ) are the ′ ′ morphisms u : η → ζ such that ϕ = φ ◦ Fα(u). Therefore, if we denote by ρ the forgetting functor from Fα|ξU to FU ′ , we have

α ′ 0 ∗ ′ F∗ (A )(ξU )= H (Fα|ξU ; ρ A ), (2.18) is the set of sections of the pre-sheaf ρ∗A′ over the slice category.

Remark. in the case where Fα : FU ′ → FU is a morphism of groupoids, this set is the set ′ −1 of sections of A over the connected components of Fα (ξU ).

α Therefore the functor g = F∗ is always geometric. Consequently, as shown in [MLM92], the pullback of sub-objects defines a natural transformation of pre-sheaves over FU ′ :

′ α λα : ΩU → F∗ ΩU ′ ; (2.19)

∗ α which corresponds by the adjunction of functors Fα ⊣ F∗ , to a natural transformation of sheaves over FU : ′ ∗ τα : FαΩU → ΩU ′ . (2.20)

20 Lemma 2.2. If Fα is a fibration (here not necessarily in groupoids), it is an open morphism α of sites, and the functor F∗ is open. Cf. the results of the article of Giraud [Gir72]. Proof. This results directly from [MLM92, Proposition 1, pp. 509-513]. Precisely this propo- sition tells that a morphism of sites F : F ′ → F induces an open geometric morphism ′ ′ F∗ : Sh(F ,J ) → Sh(F,J) between the categories of sheaves, as soon as the following three conditions are satisfied: (i) F has the property of lifting of the coverings:

∀ξ′ ∈F ′, ∀S ∈ J(F (ξ′)), ∃T ′ ∈ J ′(ξ′), F (T ′) ⊆ S; (2.21)

where F (T ′) is the sieve generated by the images of the arrows in T ′;

(ii) F preserves the covers, i.e.

∀ξ′ ∈F ′, ∀S′ ∈ J ′(ξ′), F (S′) ∈ J((F (ξ′)); (2.22)

(iii) for every ξ′ ∈ F ′, the sliced morphism F |ξ′ : F ′|ξ′ → F|F (ξ′) is surjective on the objects. The two first conditions are true for the canonical topology of a stack (cf. [Gir72]). They are evident in our case of presheaves. And the condition (iii) is part of the definition of fibration (pre-fibration).  If in addition F itself is surjective on the objects, as it will be the case in our applica- ∗ ∗ tions, the maps of algebras gX : Sub(X) → Sub(f X) are injective and the geometric open morphism g = F∗ is surjective on the objects. Cf. McL-M page 513.

∗ Lemma 2.3. When Fα is a fibration, the relation between λα = Ωα : ΩU ′ → FαΩU and ′ α λα : ΩU → F∗ ΩU ′ , is given by the adjunction of posets morphisms:

′ Ωα ⊣ τα; (2.23) ′ ∗ ′ where τα : FαΩU → ΩU ′ is the dual of λα. ′ ′ The morphism Ωα is the left adjoint of the morphism τα. More precisely τα is an injective section of the surjective morphism Ωα.

∗ Proof. If Fα is a fibration, FαΩU is isomorphic to ΩU , it is the sub-algebra of ΩU ′ formed by ∗ the sub-objects of 1U ′ that are invariant by Fα, i.e. by λα : ΩU ′ → Fα ΩU . ′ The map τα associates to an element P of ΩU the element P ◦ Fα, seen as a sub-sheaf ′ of 1U ′ , that is an element of ΩU ′ saturated by Fα. Therefore, for every P ∈ ΩU ′ , the ′ ′ ′ ′ element τα ◦ λα(P ) of ΩU ′ is the saturation of P , then it contains P . This gives a natural transformation ′ η : IdΩU′ → τα ◦ Ωα. (2.24)

′ ′ ′ ∗ In the other direction, τα is a section over ΩU of the map λα, i.e. Ωα ◦ τα = IdFαΩU . Which gives a natural transformation ′ ∗ ǫ : Ωα ◦ τα → IdFαΩU . (2.25) In the following lines, we forget the indices α everywhere, and show that η and ǫ are respec- tively the unit and co-unit of an adjunction of posets morphisms. ′ ′ Let P and Q, be respectively elements of ΩU ′ and ΩU , if we have a morphism from λP to

21 Q, by applying τ ′, we obtain a morphism from τ ′ ◦ λP ′ to τ ′Q, then a morphism from P ′ to τ ′Q. All that is equivalent to the following implications:

(λP ′ ≤ Q) ⇛ (P ′ ≤ τ ′λP ′ ≤ τ ′Q). (2.26)

In the other direction, (P ′ ≤ τ ′Q) ⇛ (λP ′ ≤ λτ ′P ′ ≤ Q). (2.27) Therefore (P ′ ≤ τ ′Q) ⇚⇛ (λP ′ ≤ Q). (2.28) Which is the statement of the lemma 2.3. 

Morality: from the above lemmas, we conclude that when Fα is a fibration, the logical ′ ′ formulas and their truth in the topos propagate from U to U by λα (feedback propagation in the DNN), and if in addition Fα is a morphism of groupoids, the logic in the topos also ′ propagates from U to U, by λα (feedforward functioning in the DNN). The logic is always richer in U ′ than in U, like a fibration of Heyting algebras of sub-objects of objects.

To finish this section, let us describe the relation between the classifier ΩF and the classifier ΩC of the basis category C of the fibration π : F → C. As remained above, the proposition 2.1 in Giraud 1971, establishes that the functor π∗ is geometric. And the above lemma 2 tells that the functor π∗, which is its right adjoint ′ is geometric and open. We can apply the above lemma 3, and get an adjunction λπ ⊣ τπ; where ∗ λπ : ΩF → π ΩC, (2.29) is a surjective morphism of lattices, and

′ ∗ τπ : π ΩC → ΩF , (2.30) is the section by invariant objects. ∗ When π is fibration of groupoids, π is open, and λπ is a morphism of Heyting algebras. In this case, there exists a perfect lifting of the theories in C to the theories in F.

The above logic in the stack F over C is studied in more general and canonical toposic terms by Olivia Caramello and Riccardo Zanfa, ”Relative topos theory in a stack”, soon on Arxiv; see the available notes for Topos Online, 24-30 june 2021.

2.3 Theories, interpretation, inferences and deductions

References: Bell [Bel08], Lambek and Scott [LS81], [LS88] , MacLane and Moerdijk [MLM92].

The formal languages we are considering are the typed languages of type theory, in the sense of Lambek and Scott [LS81]. In particular, in such a type theory we have a notion of deduction, conditioned by a set S of propositions, named axioms, which is denoted by ⊢S. That is a relation between two propositions, P ⊢S Q, which satisfies the usual axioms, structural, logical, and set theoretical, also named rules of inference, of the form

(P1 ⊢S Q1,P2 ⊢S Q2, ..., Pn ⊢S Qn)/P ⊢S Q, (2.31)

22 meaning that the truth (or validity) of the left (said upper) conjunction of deductions implies the truth of the right deduction (said lower). The conditional validity of a proposition R is noted ⊢S R. A (valid) proof of ⊢S R is an oriented classical graph without oriented cycles, whose vertices are labelled by valid inferences, and whose oriented edges are identifying one of the upper terms of its final extremity to the lower term of its initial extremity, and having only one final vertex whose lower term is ⊢S R. The initial vertices have left terms that are empty or belonging to the set S. A theory T in a formal language L is the set of propositions that can be asserted to be true if certain axioms are assumed to be true, this means that these propositions are deduced by valid proofs from the axioms.

A language L is interpreted in a topos E when certain objects of E are associated to every type, the object ΩE corresponding to the logical type ΩL, when certain arrows A → B are associated to the variables (or terms) of B in the context A, all that being compatible with the respective definitions of products, sub-sets, exponentials, singleton, changes of contexts (substitutions), and logical rules, including the predicate calculus, which includes the two projections (existential and universal) on the side of topos, cf. Bell, or Lambek and Scott. A theory T is represented in E when all its axioms are true in E. The fact that all the deductions are valid in E is the statement of the soundness theorem of T in E.

Remark. The completeness theorem tells that, for any language and any theory, there exists a minimal ”elementary topos” ET, which in general is not a Grothendieck topos, where the converse of the soundess theorem is true; validity in ET implies validity in T. The different interpretations in a topos E of a theory T form a category M(T, E), which is equivalent to the category of ”logical functors” from ET to E. This equivalence needs precisions given by Lambek and Scott, in particular to fix representant of subobjects, which is automatic in Grothendieck topos. For us, an interpretation of a type theory in a topos constitutes a semantic of this theory.

If a formal language L can be interpreted in a topos E, and if F : E →F is a left exact functor from E to a topos F, the interpretation is transferred to F. The condition for trans- porting any theory T by f is that it admits a right adjoint f : F →E which is geometric and open. A geometric functor allows the transportation of the restricted family of geometric theories, cf. Caramello [Car09], MacLane and Moerdijk [MLM92].

′ Remark. If T is a geometric theory, there is a Grothendieck topos ET which classifies the interpretations of T, i.e. for every Grothendieck topos E the category of geometric functors ′ from E to ET is equivalent to M(T, E). Cf. O.Caramello, L.Lafforgue, 2016, 2019. A logical functor is the left adjoint of a geometric functor. In many applications of DNNs, a network has to proceed to a semantic analysis of some data. Our aim now is to precise what this can mean, and how we, observers, can have access to the internal process of this analysis. As before, the network is presented by a dynamic object X in a topos, with learning object W , and this topos E is the classifying topos of a fibration π : F → C. In the applications, the logic is richer in U ′ than in U when there is a morphism α : U →

23 ′ U in C. We suppose given a family of typed language LU ; U ∈ C, interpreted in the topos EU ; U ∈ C of the corresponding layers. ∗ ∗ We say that the functors f = g = Fα propagate these languages backward, when for each morphism α : U → U ′ in C, there exists a natural transformation

L L ∗L α : U ′ → Fα U , (2.32) which extends Ωα = λα, implying that the types define objects or morphisms in E, in particular 0U , 1U . And we say that the left adjoint functor f ∗ propagate the languages feed-forward, when for each morphism α : U → U ′ in C, there exists a natural transformation

L′ L αL α : U → F∗ U ′ , (2.33)

′ ′ which extends λα, implying that the types define objects or morphisms in the fibration E , α defined by the right F∗ .

We write L for the corresponding pre-sheaf in theories over C, ΩL its logical type, and for each U ∈ C, we note ΩLU the value of this logical type at U. L For each U ∈ C, we write SU the set of set of axioms in U , that is SU = P(ΩLU ).

In the known applications, the richer logic relies on a richer language, but the contrary hap- pens to theories, they are more constrained in the deeper layers, with more axioms in general.

We take as output (resp. input) the union of the output (resp. output) layers. In supervised and reinforcement learning, we can tell that, for every input ξin ∈ Ξin in a set of inputs for learning, a theory Tout(ξ) in Lout is imposed at the output of the network., i.e. some propositions are asked to be true, other are asked to be false. The set of theories in the language Lout is denoted Θout. Then the objectives of the func- tioning is a map Tout :Ξin → Θout.

Definition. A semantic functioning of the dynamic object Xw of possible activities in the T w network, with respect to the mapping out, is a family of quotient sets DU of XU , U ∈ C, equipped with a map SU : DU → SU , such that for every ξin ∈ Ξin and every U ∈ C, the image SU (ξU ) generates a theory which is coherent with Tout(ξin), for the transport in both directions along any path.

In the examples we know (cf. [BBG20]), the quotient DU (from discretized cells) is given by the activity of some special neurons in the layer LU , which saturate at a finite number of values, associated to propositions in the Heyting algebras ΩLU . In this case, the definition of semantic functioning can be made more concrete: for each neuron a ∈ LU , each quantized value of activity ǫa implies the validity of a proposition Pa(ǫa) in ΩLU ; this defines the map SU . Then the definition of semantic functioning asks that, for each input ξin ∈ Ξin, the generated activity defines values ǫa(ξin) of the special neurons, such that the generated set of propositions Pa(ǫa), implies the validity of a certain proposition in ΩLout , which is valid for Tout(ξin). In particular, we saw experimentally that the inner layers understand the language Lout, ∗ ∗ which is an indication that the functors f = g = Fα propagate the languages backward. This gives a notion of semantic information of a given layer, or any sub-set E of neurons in the union of the sets DU : it is the set of propositions predicted to hold true in Tout(ξin)

24 by the activities in E. If all the involved sets are finite, the amount of information given by the set E can be defined as the ratio of the number of predicted propositions over the number of wanted decisions, and a mean of this ratio can be taken over the entries ξin.

Remark: the above notion of semantic functioning and semantic information can be extended to sets of global activities Ξ, singletons sections of Xw, more general that the ones used for learning.

In the above neural networks, there exist two kinds of functors Fα : FU ′ → FU over C, the ordinary ones, flowing from the input to the output, and the canonical projection from the fiber at a fork A to the fibers of their tines a′, a”, .... The second kind of functors are canonically fibrations, for the other functors, this is a condition wee can ask for a good semantic functioning.

Our experiments in [BBG20] have shown that the number of hidden layers, or the com- plexity of the architecture, strongly influences the nature of the semantic functioning. This implies that the semantic functioning, then the corresponding semantic information, depend on the characteristics of the dynamic Xw, for instance the non-linearities for saturation and quantization, and of the characteristics of the learning, the influence of the non-linearities of the gradient of back-propagation on the optimal weights w ∈ W . Therefore, it appears a notion of semantic learning, which is a flow of natural transformations between dynamic objects Xwt , augmenting the semantic information. In the mentioned experiments, the semantic behavior appears only for sufficiently deep networks, and non-linear activities.

2.4 The model category of a DNN and its Martin-L¨of type theory

′ ′ Consider two fibrations (FU , Fα) and (FU , Fα) over C; a morphism ϕ from the first to the ′ ′ second is given by a collection of functors ϕU : FU →FU such that for any arrow α : U → U ′ of C, ϕU ◦ Fα = Fα ◦ ϕU ′ . With the fibrations in groupoids, this gives a category GrpC. Nat- ural transformation between two morphisms gives it a structure of strict 2-category. We consider this category fibred over CX. Remind that the Grothendieck topology on CX is chaotic. If we consider an equivalent site, with a non-trivial topology, homotopical con- straints appear for defining stacks, cf. Giraud 1972 [Gir72], Hollander 2001, 2007 [Hol08]. However the category of stacks (resp. stacks in groupoids) is equivalent to the category obtained from CX. Hofmann and Streicher, 1996, [HS98], have proved that the category Grp of groupoids gives rise to a Martin-L¨of type theory [ML80], by taking for types the fibrations in groupoids, for terms their sections, for substitutions the pullbacks, and they have defined non-trivial (non-extensional) identity types in this theory. Hollander 2001, 2007, using Giraud’s work and homotopy limits, constructed a Quillen model theory on the category of fibrations (resp. stacks) in groupoids over any site C, where the fibrant objects are the stacks, the cofibrant objects are generators, and the weak equiva- lences are the homotopy equivalence in the fibers. Cf. also Joyal-Tierney, and Jardine cited in Hollander [Hol08]. These results were extended to the category of general stacks, not only in groupoids, over a site by Stanculescu 2014 [Sta14].

25 Awodey and Warren 2007 [AW09] observed that the construction of Hofmann-Streicher is based on the closed model category structure in the sense of Quillen on Grp, and proposed an extension of the construction to more general model categories. Thus they established the connection between Quillen’s models and Martin-L¨of intensional theories, which was soon extended to a connection between more elaborate Quillen’s models and Voedvosky univalent theory. Arndt and Kapulkin, in Homotopy Theoretic Models of Type Theory 2012, cf. [AK11], have posed additional axioms on a closed model theory that are sufficient for deducing formally a Martin-L¨of theory. This was extended later by Kapulkin and Lumsdaine 2018 [KLV12], to obtain models of Voedvosky theory, by using more simplicial techniques. Here, we will follow their approach, without going to the special properties of HoTT, that are Functions extensionality, Univalence axiom and Higher inductive type formations.

In what follows, we focus on the model structure of groupoids and stacks in groupoids, which are the most useful for our applications. However, many things work also with Cat in place of Grp, and some other model categories M. The complication is due to the difference between fibrations (resp. stacks) in the sense of Giraud and Grothendieck and the fibrations in the sense of Quillen’s models, that is not present with groupoids. For Cat, there exists a unique closed model structure, defined by Joyal and Tierney, such that the weak equiv- alences are the equivalence of categories (cf. Schommers-Pries in [SP12]). It is named for this reason the canonical model structure on Cat; in this structure, the cofibrations are the functors injective on objects and the fibrations are the iso-fibrations. An isofibration is a functor F : A → B, such that every isomorphism of B can be lifted to an isomorphism of A. Any fibration of category is an iso-fibration, but the converse is true only for groupoids. A different model theory was defined by Thomason 1980, [Tho80]. It is better understandable in terms of ∞-groupoids and ∞-categories.

The axioms of Quillen [Qui67] concern three subsets of morphisms in a category M, supposed to be (at least finitely) complete and co-complete, the set F ib of fibrations, the set Cofib of co-fibrations and the set W E of weak equivalences. An object A of M is said fibrant (resp. cofibrant) if A → 1, the final object (resp. ∅→ A from the initial object) is a fibration (resp. a cofibration).

Definitions. Two morphisms i : A → B and p : C → D in a category are said orthogonal, written (non-traditionally) i ⋌ p, if for any pair of morphisms u : A → C and v : B → D, such that p ◦ u = v ◦ i, there exists a morphism j : B → C such that j ◦ i = u and p ◦ j = v. The morphism j is named a lifting, left lifting of i and a right lifting of p. Two sets L and R are said be the orthogonal one of each other if i ∈ L is equivalent to ∀p ∈ R, i ⋌ p and p ∈ R is equivalent to ∀i ∈L, i ⋌ p. The three axioms of Quillen for a closed model category M of models are: (1) given two morphisms f : A → B, g : B → C, define h = g ◦ f; if two of the morphisms f,g,h belong to WE, then the third one belongs to WE; (2) every morphism f is a composition f = p ◦ i of an element p of Fib and an element i of Cofib ∩ WE, and a composition p′ ◦ i′ of an element p′ of Fib ∩ WE and an element i′ of Cofib; (3) the sets Fib and Cofib∩WE are the orthogonal one of each other and the sets Fib∩WE and Cofib also.

26 An element of Fib ∩ WE is named a trivial fibration, and element of Cofib ∩ WE is named a trivial cofibration.

These axioms (and some more general) allowed Quillen to develop a convenient homotopy theory in M, and to define a homotopy category HoM. Cf. his book, Homotopical Algebra, 1967, [Qui67]. The objects of HoM are the fibrant and cofibrant objects of M, and its morphisms are the homotopy classes of morphisms in M; two morphisms f,g from A to B are homotopic if there exists an object A′, equipped with a weak equivalence σ : A′ → A and ′ ′ two morhisms i0, i1 from A to A such that σ ◦ i0 = σ ◦ i1, and a morphism h : A → B, such that h◦i0 = f and h◦i1 = g. In the category HoM, the weak equivalences of M are inverted.

A particular example is the category of sets with surjections as fibrations, injections as cofibrations and all maps as equivalences. Another trivial structure, which exists for any category is no restriction for Fib and Cofib but isomorphisms for WE. As we already said, an important example is the category of groupoids Grp, with the usual fibrations in groupoids, with all the functors injective on the objects as cofibrations, and the usual homotopy equivalence (i.e. here category equivalence) as weak equivalences. We also mentioned the canonical structure on Cat, that is the only one where weak homo- topy corresponds to the usual equivalence of category. Other fundamental examples are the topological spaces Top and the simplicial sets SSet = ∆∧, with Serre and Kan fibrations for F ib respectively. Thomason 1980 showed that the category Cat has a closed model category that is deduced by the above structure on SSet, using the nerve construction and the square of the right adjoint functor f the barycentric subdivision. In this structure the weak equivalences are not reduced to the category equivalences; and the co-fibrant objects are constrained (Cisinski [Cis06]); this theory is weakly equivalent to the Kan structure on SSet. In this structure, a category is considered through its weak homotopy type (the weak homotopy type of its nerve).

We now make appeal to a general result of Lurie ’s book of 2009, [Lur09], appendix A.2.8, prop. A.2.8.2, which establishes the existence of two canonical closed model structures on op the category of functors MC = F un(C , M) when M is a model category. (Caution, Lurie consider diagrams, i.e. C and not Cop.) An additional hypothesis is made on M, that it is combinatorial in the sense of Smith (cf. Rosicky in [Ros09]), i.e. locally presentable (i.e. ac- cessible by a regular cardinal), and generated by co-fibrant objects, which are both satisfied by Grp and by Cat. Moreover M is supposed to have all small limits and small colimits, which is also the case for Grp (or Cat); as Set, both are cartesian closed categories; every object is fibrant and co-fibrant. The two Lurie structures are respectively obtained by defining the sets Fib or Cofib in the fiberwise manner, as for the set WE, and by taking respectively the set Cofib or Fib of morphism which satisfy the required lifting properties, respectively on the left and on the right, i.e. the orthogonality of Quillen. The structure obtained by fixing Fib (resp. Cofib) by the behavior in the fibers, is named the projective structure, or right one (resp. the injective one, or left one). Caution: depending on the authors, the term right and left can be exchanged.

The model structure of Hollander on GrpC (or Stanculescu for CatC) is the right Lurie model. She called this model a left model. A model category is said right proper when the pullback of any weak equivalence along an element of F ib is again a weak equivalence. Dually, left proper is when pus-forward of

27 weak equivalence along co-fibrations is again in W E. In the right proper case, the injective (left) structure of Lurie was defined before by D-C. Cisinski in ”Images directes cohomologiques dans les cat´egories de mod`eles”, 2003, [Cis03].

The cofibrations in the right model (resp. the fibrations in the left model) depend on the category C. They certainly deserve to be better understood. Cf. the discussion of Cisinski, section 2.3.10 in his book Higher Categories and Homotopical Algebra, [Cis19].

Proposition 2.3. If C has sufficiently many points, the elements of F ib for the left Lurie structure are fibrations in the fibers (i.e. elements of F ib for the right structure) and the ele- ments of Cofib for the right structure are injective on the objects in the fibers (i.e. elements of Cofib for the left structure. Proof. Suppose that a morphism ϕ is right orthogonal to any trivial cofibration ψ of the left Lurie structure; for every point x in C, this gives an orthogonality in the model Grp, then over x, ϕx induces a fibration in groupoids. From the hypothesis, this implies that in every fiber over C, ϕ is a fibration, then an element of Fib for the right Lurie structure. The other case is analog.  However in general, even if C is a poset, not all fibrations in the fibers are in F ib for the left model structure, and not all the injective in fibers are in Cofib for the right model. This was apparent in Hollander [Hol01].

Trying to determine the obstruction for a local fibration (resp. local co-fibration) to be orthogonal to functors that are locally injective on the objects (resp. local fibrations) and locally homotopy equivalence, we see that the intuitionistic structure of ΩC enters the play, through the global constraints on the complement of pre-sheaves:

Lemma 2.4. The category C being the oriented segment 1 → 0 and the category M being Set (then MC is the topos of the Shadoks); in the left Lurie model the fibrant objects are the (non-empty) surjective maps f : F0 → F1. Proof. A trivial cofibration is a natural transformation

′ ′ ′ η :(h : H0 → H1) → (h : H0 → H1); (2.34) such that η0 and η1 are injective. Suppose given a natural transformation u = (u0,u1) from h to f : F0 → F1; the lifting ′ ′ problem is the prolongment of u to u from h to f. If H1 is empty, there is no problem. ′ ′ If not, we choose a point ∗0 in H0 and note ∗1 = h(∗0). If x1 ∈ H1 does’nt belong to H1 ′ ′ ′ ′ ′ ′ ′ ′ we define u1(x1)= u1(∗1), and for any x0 such that h (x0)= x1, we define u0(x0)= u0(∗0). ′ ′ Now the problem comes with the points x”0 in H0\H0 such that h (x”0) ∈ H1 (a shadok ′ ′ with an egg); their image by u1 is defined, then u1(h (x”0)) is forced to be in the image of F0 by f. If f is not surjective there exists η such that the lifting is impossible. But, if f ′ is surjective there is no obstruction: we define u0(x”0) to be any point y0 in F0 such that ′ f(y0)= u1(h (x”0)) in F1. 

Lemma 2.5. Also M = Set, but C being the (confluence) category V with three objects 0, 1, 2 and two non-trivial arrows 1 → 0 and 2 → 0. In the left Lurie model, the fibrant objects are the pairs (f1 : F0 → F1, f2 : F0 → F2), such that the product map (f1, f2) is surjective.

28 Proof. Following the path of the preceding proof, with an injective transformation η from ′ ′ ′ ′ a triple H0,H1,H2 to a triple H0,H1,H2, we are in trouble with the elements x”0 ∈ H0 ′ ′ that h1 or h2 sends into H1 or H2 respectively. Under the hypothesis of bi-surjectivity, we ′ know where to define u0(x”0). But if this hypothesis is not satisfied, impossibility happen in general for η. 

Lemma 2.6. Also M = Set, but C being the (divergence) category W with three objects 0, 1, 2 and two non-trivial arrows 0 → 1 and 0 → 2. In the left Lurie model, the fibrant objects are the pairs (f1 : F1 → F0, f2 : F2 → F0), such that separately f1 and f2 are surjective. Proof. following the path of the preceding proof, with an injective transformation η from a ′ ′ ′ ′ triple H0,H1,H2 to a triple H0,H1,H2, we are in trouble with the elements x”1 ∈ H1 (resp. ′ ′ ′ x”2 ∈ H2) that h1 (resp. h2) sends into H0. As in the proff of the lemma 1, the problem is solved under the hypothesis of surjectivity, but it cannot be solved without it.  More generally, we can determine the fibrant objects of the left Lurie model (injective) for every closed model category M, and a finite poset C which has the structure of a DNN, coming with a graph, with unique directed paths:

Theorem 2.1. When C is the poset of a DNN, for any combinatorial category of model, the fibrations of MC for the injective (left) model structure are made by the natural trans- formations F→F ′ between functors in C to M, that induce fibrations in M at each object of C, such that the functor F is also a fibration in M along each arrow of C coming from an internal of minimal vertex (ordinary vertex, output or tip), and a fibration along each of the arrows issued from a minimal vertex (output and tip), and a multi-fibration at each confluence point (cf. lemma 2), in particular at the maximal vertices (input or tank).

By multi-fibration fi, i ∈ I from an object FA of M to a family of objects Fi, i ∈ I of M, we mean a fibration (element of Fib) from FA to the product Qi∈I Fi.

Proof. We proceed by recurrence on the number of vertices. For an isolated vertex, this is the definition of fibration in M. Then consider an initial vertex (tank or input) A with ∗ incoming arrows si : i → A for i ∈ I in the graph poset C, and note C the category with ′ the star A, si deleted. A trivial cofibration in MC is a natural transformation η; H → H between contravariant functors in C→M, which is at each vertex injective on objects and ′ an element of WE. Let us consider a morphism (u,u ) in in MC from η to a morphism ′ ϕ : F→F , where F belongs to MC. Suppose that ϕ satisfies the hypotheses of the theorem. From the recurrence hypothesis, there exists a lifting θ∗ : (H′)∗ → F ∗ between the restrictions of the functors to C∗; it is in ′ particular defined on the objects Hi, i ∈ I to the objects Fi, i ∈ I. ′ Consider the functor from HA to the product Qi Fi, which is obtained by composing the ′ ′ ′ ′ horizontal arrows of η, from HA to the product H = Qi Hi with θ . The fact that FA → Qi Fi ′ is a multi-fibration in M and the fact that ηA : HA → HA is a trivial cofibration in M imply ′ the existence of a lifting θA : HA → FA, which is given on HA. Conversely, if the hypothesis of multi-fibration is not satisfied, there exists elements ′ ′ ηA : HA → HA in Cofib∩WE of M, such that the lifting θA of HA to FA does’nt exist, by the axiom (3) of closed models. To finish the proof, we note that the necessity to be a fibration at each vertex in C is given by proposition 2.3. 

29 Corollary. Under the same hypotheses, the fibrant objects of MC for the injective (left) model structure are made by the functors that are a fibration in M at each internal of minimal vertex (ordinary vertex, output or tip), and a fibrant object at the minimal (output and tip), and a multi-fibration at each confluence point (cf. lemma 2.6), in particular at the maximal vertices (input or tank). One interest of this result is that it will describe the allowed contexts in the associated Martin-L¨of theory when it exists, as we will see just below.

Another interest is for the behavior of the classifying object ΩF : in the case of GrpdC the fibrant objects are all good for the induction theory in logic over the network, cf. above sections 1.2, 1.3. In the case of CatC, with the canonical structure on Cat, it is not the case, only a geometric subclass of fibrant objects are good, they are made by composition of Giraud-Grothendieck fibrations, cf. below.

In GrpC the final object 1 (resp. the initial object ∅) is the constant functor on C with values a singleton, (resp. the empty set). It follows that any object is co-fibrant.

The additional axioms of Arndt and Kapulkin for a Logical Model Theory are as follows: (1) for any element f ∈ F ib, f : B → A, the pullback functor f ∗ : M|A → M|B, once restricted to the fibrations, possesses a right-adjoint, denoted Πf . (2) The pullback of a trivial cofibration, i.e. an element of Cofib ∩ W E, along an element of F ib is again a trivial cofibration. Remark. In Arndt and Kapulkin, the first axiom is written without the restriction of the adjunction to fibrations, however they remark later ([AK11, section 4.1, acknowledging an anonymous reviewer]) that this restricted axiom is sufficient for the application below. The second axiom is satisfied if separately Cofib and W E are stable by pullback along a fibration. A model category satisfying the second property, for W E, is called right proper.

When every object in M is fibrant (resp. cofibrant) the theory is right (resp. left) proper (Rezk, 2002). This is the case for Grp (and Cat). And Lurie proved that his two model structures on diagrams (or phe-sheaves) are right (reps. left) proper as soon as M is so. Then in our case, all the considered models are right proper and left proper. This was shown by Hollander 2001 for GrpC.

The injectivity on objects in the fibers and the equivalence of category in the fibers are preserved by every pullback, thus the condition (2) is verified for the left injective structure. This is the structure we choose. What happens for the right structure? Arndt and Kapulkin noticed the example of the injective structure, cf. [AK11, Prop. 27, p.12], and its Bousfield-Kan localizations; this gives in particular the injective model struc- tures for the category of stacks over any site, cf. Hirschhorn, Localization of Model Categories [Hir03].

The existence of a right adjoint and a left adjoint of the pullback of fibrations in cate- gories, as it holds for pre-sheaves of sets, was proved by Giraud in 1964, cf. [Gir72, section I.2.]. Then, by the proposition 2.3, for M = Grp, both left and right structures satisfy the con- dition (1). For M = Cat this is true only if f is a fibration in the geometric sense, not only

30 an isofibration. What happens for other models categories M?

As noticed by Arndt and Kapulkin, the left adjoint of f ∗ : M|A → M|B always exists, it is written Σf , and the right properness implies that it respects W E.

If M satisfies the axioms (1) and (2), Arndt and Kapulkin generalized the constructions of Seely 1983, Hofmann and Streicher, Awodey and Warren, to define a M-L theory:

A context is a fibration Γ → C, that is a fibrant object. A type A in this context is a fibration A → Γ. The declaration (judgment) of a type is written Γ ⊢ A. A term a : A is a section Γ →A. It is noted Γ ⊢ a : A. ∗ A substitution x/a is given by a change of base F for a functor F : ∆ → Γ in MC, not necessarily a fibration. ∗ The adjoint functor Σf and Πf of f , allows to define new types of objects: given Γ and f : A→ Γ, and g : B→A, we get Σf (g) : Σx:AB(x) → Γ and Πf (g) : Πx:AB(x) → Γ. They respectively replace the union over A and the product over A. On the types, logical operations are applied, A ∧B, A ∨B, A ⇒ B, ⊥ is empty, ∃x, B(x), ∀x, B(x). The rules for these operations satisfy the usual axioms. More types, like the integers or the real numbers or the well ordering can be added, with specific rules.

As remarked by Arndt and Kapulkin, it is not necessary to have a full closed model theory to get a M − L type theory, cf. [AK11, remarks pages 12, 15]. They noticed that M −L type theories are probably associated to fibration-categories (or categories with fibrant objects) in the sense of Brown 1973. Cf. Uemura 2016. In these categories, cofibrations are not considered, however a nice homotopy theory can be developed. We have the following result concerning the weak factorization system made by co- fibrations and trivial fibrations in the canonical model Cat.

Lemma 2.7. A canonical trivial fibration in Cat is a geometric fibration.

Proof. Consider an isofibration f : A → B that is also an equivalence of category. Take a ∈ A and f(a) = b ∈ B and a morphism ϕ : b′ → b of B; because f is surjective on the objects, there exists a′ ∈ A such that f(a”) = b′, and because f is an equivalence the map from Hom(a′, a) to Hom(b”, b) is a bijection, then there exists a unique morphism ψ : a′ → a such that f(ψ)= ϕ. In the same manner, every morphism b” → b′ has a unique lift a” → a′, and conversely any morphism ψ′ : a” → a′ defines a composed morphism χ : a” → a and a morphism image ϕ′ : b” → b′ that define the same morphism ϕ ◦ ϕ” from b” to b. As the morphisms from a” to a are identified by f with the morphisms from b” to b, this gives a natural bijection between the morphisms ψ′ from a” to a′ and the pairs (χ, ϕ′) in Hom(a”, a) × Hom(b”, b′) over the same element in Hom(b”, b). Therefore ψ is a strong cartesian morphism over ϕ.  The same proof shows that a canonical trivial fibration is a geometric op-fibration, that is by definition a fibration between the opposite categories.

In the case where C is the poset of a DNN and M is the category Cat, we say that a model fibration f : A → B, in MC is a geometric fibration if it is a Grothendieck-Giraud

31 fibration, and if all the iso-fibrations that constitute the fibrant object A are Grothendieck- Giraud fibrations. Cf. Theorem 2.1.

Theorem 2.2. Let C be a poset of DNN, there exists a canonical M − L structure where contexts and types correspond to the geometric fibrations in the 2-category of contra-variant functors CatC, and such that base change substitutions correspond to its 1-morphisms. Proof. We follow the lines of Arndt and Kapulkin 2012 establishing their theorem 26. The main point is to prove that if f : A → B is a geometric fibration in MC, the pull-back ∗ functor f : Cat|A → CatB, has a left adjoint f! = Σf and a right adjoint f∗ = Πf that both preserve the geometric fibrations. For the first case it is the stability of G-G fibrations by composition. For the second one, this is the Giraud thorem of bi-adjunction, cf. [Gir71].  There exist several equivalent interpretations of such a type theory, as for the intuition- istic theory of Bell, Lambek Scott et al. Cf. the text in 1980 of Martin-L¨of, Intuitionistic Type Theory, [ML80]. For instance the types are sets, the terms are elements, or a type is a proposition and a term is a proof, or a type is a problem (a task) and a term is a method for solving it. (For each interpretation, things are local over a context.) In particular, Identity types are admitted, representing equivalence of elements, proofs or methods that are not the strict equality, like homotopies, or invertible natural equivalences. The types of identities, as in Hofmann and Streicher, are fibrations IdA : IA → A×A equipped with a cofibration r : A→ IA (with a section) such that IdA ◦ r = ∆, the diagonal morphism. They are considered as paths spaces. For instance, given a groupoid A, IdA = {0 ↔ 1}⇒ A. Axioms of inference for the types are expressed by rules of formation, introduction and de- termination, specific for each type.

Let us compare to the semantic in a topos C∧: a context is an object Γ which is a pre-sheaf with values Set then a fibration in sets over C, a type is another object A; to get something over Γ we can consider the product Γ × A → Γ. A section corresponds to a morphism a :Γ → A, which is rightly a term of type A, Γ ⊢ a : A. A substitution corresponds to a morphism F : ∆ → Γ, and defines a pullback of trivial fibrations ∆ × A → ∆. If we have a morphism g : B → Γ × A in the topos, we can define its existential image ∃πg(B) and its universal image ∀πg(B) as subobjects of Γ, which can be seen as a trivial fibration over Γ. Therefore, we have analogs of the type theory M-L in Set theory, but with trivial fibra- tions only, and without fibrant restriction.

2.5 Classifying the M-L theory ?

In what precedes the category Grp has replaced the category Set; it is also cartesian closed.

Also we have seen that all small limits and colimits exist in GrpC (Giraud, Hollander, Lurie). However every natural transformation between two functors with values groupoids is invert- ′ ible. Thus in the 2-category, the morphisms in HomGrp(G,G ) are like homotopies. In fact they become exactly like that when passing to the nerves.

32 Let us introduce the categories of pre-sheaves on every fibration in groupoids A → C, i.e. the classifying topos EA of the stack A. Their objects are fibered in groupoids over C, because the fibers EU for U ∈ C are such (they take their values in IsoSet), but their morphisms, the natural transformations between functors, are taken in the sense of sets, not invertible.

In what follows we combine the type theory of topos with the groupoidal M −L type theory. We propose new types, associated to every object XA in every EA.

The fibration A→ Γ itself can be identified with the final object 1A ∈EA in the context Γ.

Sections of A → Γ are particular cases of objects. For the terms in an object XA, we take any natural transformation from the object S corresponding to a section Γ → A to the object XA in EA. A simple section is a term to 1A, the final object, which is a usual M-L type.

Due to the adjunction for the topos of pre-sheaves, the construction Σ and Π extend to the new types.

Now a classifier of sub-objects ΩA is disponible for any M-L type A. ∗ We define relative sub-objects using the correspondence λπ : ΩAπ ΩΓ.

33 Chapter 3

Dynamics and homology

3.1 Ordinary cat’s manifolds

Certain limits in the sense of category theory of the dynamical object Xw of C∼ describe the sets of activities in the DNN which correspond to certain decisions taken by its output (the so called cat’s manifolds in the folklore of Deep Learning). Here we consider the case of supervised learning or the case of reinforcement learning, because the success or the failure of an action integrating the output of the network is also a kind of metric. For instance, consider a proposition Pout about the world which depends on the final states ξout. It can be seen as a function P on the product XB = Qb Xb of the spaces of states over the output layers to the boolean field ΩSet = {0, 1}, taking the value 1 if the proposition is true, 0 if not. Our aim is to understand better the involvement of the full network in this decision; it is caused by the input data in a deterministic manner, but it results from the chosen weights and from the full functioning of the DNN. One of the many ways to express the situation in terms of category is to enlarge C (or Γ) by several terminal layers: 1) a layer B∗ which makes the product of the output layers, as we have done with forks, followed by the layer B (this can be replaced by B only projecting to the Xb, b ∈ xout);

2) a layer ωb with one cell and two states in a set Ωb, as in ΩSet, with one arrow from ωb to B, for translating the proposition P , followed by a last layer ω1, with one arrow

ωb → ω1, the state’s space Xω1 being a singleton ∗1, and the map ∗1 → Ωb sending the singleton to 1. This gives a category C+ enlarging C by a fork with handle B ← ωb → w w op ω1, and a unique extension X+ , depending on P , of the functor X from C to Set in a presheaf over C+.

w The space of sections singletons of X+ is identified naturally with the space of sections of w w X such that the output satisfies Pout, i.e. the subset of the product of all the X (a) when a describes C made by the coherent activities giving the assertion ”P is true” at the output. In this picture, we also can consider that P is the weight over the arrow B ← ωb, and note w,P w X+ the extension of X . In other terms, the subset of activities of X which affirm the proposition Pout is given by a op value of the right Kan extension of X+ along the unique functor p+ : C+ →∗:

op w M(Pout)(X)= RKanC+ (X+)= lima∈C+ X+ (a) : (3.1)

In the folklore of AI, the set M(Pout)(X) is named a cat’s manifold, alluding to the case where the network has to decide if yes or no a cat is present in the image. M(Pout)(X) can

34 be identified with a subset of the product Xin of the input layers. It has to be compared with the assertion ”P is true” made by an observer, then studied in function of the weights W of the dynamics. However, in general, M(Pout)(X) cannot be identified with a product of subsets in the Xa, for a ∈ C, it is a global invariant. In fact, it is a particular case of a set of co-homology:

0 M(Pout)(X)= H (C+; X+). (3.2)

w If the proposition Pout is always true, M coincides with is the set of section of X = X , which can be identified with the product of the entry layers activities:

0 ∼ Γ(X)= H (C; X) = Y Xa (3.3) a∈xin

The construction of C+ and the extension of X by X+ can be seen as a conditioning. The map X+(ωb → B) is equivalent to a proposition, the characteristic map of a subset of XB. In this case we have 0 0 H (C+; X+) ⊂ H (C; X). (3.4)

In the same manner, we define the manifold of a theory Tout expressed in a typed lan- guage Lout in the output layers, by replacing the above objects ωb, ω1, and the pre-sheaf X+(P ) over them, by larger sets and X+(T), as the set of sections of X+(T) over the whole C+.

We will revisit the notion of cat’s manifold when considering the homotopy version of semantic information.

3.2 Dynamics with spontaneous activity

In our approach of networks functioning, the feed-forward dynamic coincides with the limit set H0(X). The coincidence with the traditional notion of propagation from the input to the output relies on the particular choice of morphisms at the centers of forks (named tanks), product on one side and isomorphism on the other. But this can be generalized to other morphisms: the only condition being that the inner sources A and the input from the outer world I determine a unique section of the object Xw over C. In concrete terms, this happens if and only if the maps from A and I give coherent values at any tip of each fork. This tuning involves the values in entry ξ0 ∈ Ξ, the values of the inner sources σ0 ∈ Σ and the weights, in particular from a A to the a′, a”, .... Therefore it depends on the learning process.

Then a possibility for defining coherent dynamical objects with spontaneous activities, is to start with standards objects Xw, satisfying the restriction of products and isomorphisms, and to include algorithms constructing solutions of an Implicit Function Theorem in the Learning dynamics.

35 Another possibility, closer to the natural networks, and more readable, is to introduce new entries for each A and to maintain the form of product and isomorphisms, which guaranties the coherence condition. These spontaneous entries can be learned by back-propagation, as the weights, for minimizing a functional, or for realizing a task with success.

It is important to remark that in natural brains, even for very small animals, having hundred neurons, the part of spontaneous activity is much larger than the part due to the sensory inputs. This activity comes from internal rhythms, regulatory activities of the au- tonomous system, internal motivations more or less planed. The neural network transform them in actions or more general decisions. To make them efficient, corrections are necessary, due to reentrant architectures. However these networks in general do not learn by fully supervised methods, they depend on reinforcement, by success of actions, or unsupervised methods, involving maximization of mutual information quantities. This will require much further works to attain this degree of integration. But certainly experiments can be conducted in this direction, with simple networks as used in Logical Information Cells [BBG20], the experimental companion article.

3.3 Fibrations and cofibrations of languages and theo- ries

Taking in account the internal dimensions given by the stacks F over C, several levels of information emerge. Without closing the subject, they reflect different meaning of the word information.

A first level concerns the pertinent types, or objects, to introduce for understanding how the network performs a semantic task, in addition to the types coming from Lout, that are put at the hand by the observer, and guide the learning process, by back-propagation or reinforcement. A first conjecture, that we will not study in the present text, is that new objects appear in co-homological forms, as obstructions for integrating correctly the input data in the output theory. It is not excluded that this can appear spontaneously in the network, but more probably it requires the intervention of the observer, for changing the functional (the metric) or the data, which induces a variation of the weights. We will de- scribe below in section 3 examples of semantic groupoids which could generate or constrain these obstructions. More precisely, we expect that the new objects are vanishing cycles, in the sense of Grothendieck, Deligne, Laumon, for convenient maps of sites, localized in the fibers FU , at points (U, ξ). In some regions of the weights, the network should become able to develop a semantic func- tioning about the new objects, formalized by the languages LU ; U ∈ C similarly to what happens with singularities of functions or varieties, with imposed reality conditions. The analogy is made more precise in sections 3.4, 3.5 below.

A second level, perhaps not independent of the first, concerns the information contained in certain theories about other theories, or about decisions to take or actions to do, for ′ instance TU ′ in some layer, considered in relation to TU , when α : U → U , or Tout. As we saw, these theories in general depend on the given section ǫ of Xw. Moreover, we expect that the notion of information permits also to compare the theories made by different networks about a certain class of problems.

36 The semantic information that we want to precise must be attached to the communi- cation between layers and the communication between networks, and attached to certain problems to solve, cf. Ren´eThom, Mathematical Models of Morphogenesis, 1980, for a view of the necessary interaction in information. Cf. [Tho83] Some theories will be more informative than others, or more redundant, then we will be happy to attach quantitative notions of amount of information to the notion of semantic information. However, efficient numerical measures should also take care of the expression of theories by certain axioms. Some systems of axioms are more economical than others, or more redundant than others. Redundancy is more the matter of axioms, ambiguity is more the matter of theories. Probably, the notion of ambiguity will come first. In the Shannon information theory, [SW49], the fundamental quantity is the entropy, which is in fact a measure of the ambiguity of the expressed knowledge with respect to an individual fact, for instance a message. Only the difference of entropies can be un- derstood as an information in the common sense, for instance the mutual information I(X; Y )= H(X)+ H(Y ) − H(X,Y ).

Here the theories TU , U ∈ C are seen as possible models, analogs to the probabilistic models PX ,X ∈ B in Bayesian networks. The variables of the Bayesian network are analog to the layers of the neural networks; the values of the variables are analog of the states of the neurons of the layers. In some version of Bayes analysis, for instance presented by Pearl (ref. [Pea88]), the Bayes network is associated to a directed graph, but in some other versions it is not cf. Yedidia, Weiss et al. 2001 [YFW01]. In the case of the probabilistic models, Shannon theorems have revealed the importance of entropy and of mutual information. We have shown (Baudot and Bennequin, Entropy 2015 [BB15] and Juan-Pablo Vigneaux, appeared in TAC, 35-2020 [Vig19]), that the entropy is a universal class of co-homology of degree one of the topos of presheaves over the Bayesian network, seen as a poset B, equipped with a co-sheaf P of probabilities (covariant functor of sets). The operation of joining variables gives a pre-sheaf A in monoids over B. On the other hand, the numerical functions on P form a sheaf FP , which becomes a A-module by 1 considering the mean conditioning of Shannon. The entropy belongs to the ExtA(K; FP ) with coefficients in this module. Moreover, in this framework, the higher mutual information quantities (McGill 1954 [McG54], Hu Kuo Ting 1962 [Tin62], belong to the homotopical al- gebra of co-cycles of higher degrees (cf. [BB15]). We conjectured that something analog appears in the case of DNNs and theories T, and of axioms for them.

The first ingredient in the case of probabilities was the operation of marginalization of a probability law, interpreted as the definition of a functor (a co-presheaf); it can be replaced here by the transfers of theories associated to the functors Fα : FU ′ → FU , and to the ′ morphisms h in the fibers from objects ξ to objects Fα(ξ ), as we saw in the section 3. For ′ logics, this transfer can go in two directions, depending on the geometry of Fα, from U to U, and from U to U ′, as seen in 1.2.

We will begin by the transfer from U ′ to U, having in mind the flow of information in the downstream direction to the output of the DNN; a non-supervised learning should also correspond to this direction. However, the learning by back-propagation or by reinforcement goes from the output layers to the inner layers, then the inner layers have to understand something of the imposed language Lout and the useful theories Tout for concluding. Therefore we will also consider this backward or upstream direction.

37 For an arrow (α, h):(U, ξ) → (U ′,ξ′), the map

′ Ωα,h : ΩU ′ (ξ ) → ΩU (ξ), (3.5)

′ ′ ′ is obtained by composing the map λα = Ωα at ξ , from ΩU ′ (ξ ) to ΩU (Fαξ ) with the map ′ ΩU (h) from ΩU (Fαξ ) to ΩU (ξ). ′ α ′ More generally, for every object X in EU ′ , the map F! sends the subobjects of X to the α ′ subobjects of F! (X ), respecting the lattices structures. Then for any natural transformation α ′ over FU : γ : X → F! (X ), we get a transfer X′ X Ωα,γ : ΩU ′ → ΩU . (3.6) Remind that X (resp. X′) are seen as local contexts in the topos semantic. We assume in what follows that this mapping extends to the sentences in the typed languages LU , where the dependency on ξ reflects the variation of meaning in the included notions. In particular, the morphisms in the topos EU express such variations. At the level of theories, this induces in general a weakening, something which is implied at (U, ξ) by the propositions at (U ′,ξ′), or more generally at the context X by telling what is true, or expected, in the context X′. L In what follows we note by A = Ω this presheaf of sentences in L over F, and by Lα,h, or ∗ πα,h, its transition maps.

Under strong hypotheses on the fibration F, for instance if it defines a fibrant object in the injective groupoids models, i.e. any Fα is a fibration, cf. 2.3 above, following the lemma 2.3 of 2.2, there exists a right adjoint of Ωα,h: ′ X X′ Ωα,h : ΩU → ΩU ′ . (3.7) ′ ∗ α It is given by extension of the operators λα, associated to Fα, in the place of F! , plus a transposition. ′ t L tL′ In what follows we note by A = Ω this co-presheaf of sentences over F, and by α,h, or α,h simply π∗ , its transition maps.

′ For fixed U and ξ ∈FU , the operation ∧ gives a monoid structure on the set AU,ξ = AU,ξ, L tL′ which is respected by the maps α,h and α,h. Moreover, AU,ξ has a natural structure of poset category, given by the external implication L tL′ P ≤ Q, for which α,h and α,h are functors. There exists a right adjoint of the functor R 7→ R ∧ Q; this is the internal implication, ′ P 7→ (Q ⇒ P ). Then, by definition, AU,ξ = AU,ξ is a closed . In fact this is the only structure that is essentially needed for the information theory below, cf. generalization below in this text.

This gives a fibration A over F, and a co-fibration A′ over F, in the sense of Grothendieck, cf. Maltsiniotis, The homotopye theory of Grothendiecke, 2009, [Mal05].

A morphism γ in A from (U,ξ,P )to (U ′,ξ′,P ′), lifting a morphism (α, h) in F from (U, ξ) L ′ ′ e U L ′ ∗ ′ to (U ,ξ ), is given by an arrow ι in Ω from P to α,h(P ) = πα,hP , that is an external implication ′ P ≤ Lα,h(P ). (3.8) Similarly, an arrow in the category A′ lifting the same morphism (α, h) in F, is an implication e tL′ ′ α,h(P ) ≤ P . (3.9)

38 ∗ α,h Remark that a priori the left adjunction πα,h ⊣ π∗ does’nt imply something between P ′ and Lα,h(P ) when (3.9) is satisfied. However, as we will see, the stronger hypothesis that ∗ π ◦ π∗ = Id, has an interest for us, being compatible with hypothesis that Fα; α ∈ C are fibrations, and in this case, (3.9) implies (3.8), i.e. A′ is a subcategory of A. e e Remark: an important particular case, where our standard hypotheses are satisfied, is when LU,ξ the Ω = AU,ξ are the sets of open sets of a topological spaces ZU,ξ, and when there exist ∗ continuous open maps fα : ZU ′,ξ′ → ZU,ξ lifting the functors Fα, such that the maps π and π∗ are respectively the direct images and the inverse images. The strong hypothesis holds when the fα are topological fibrations.

A and A′ belong to augmented model categories using monoidal posets, cf. Raptis, 2010, Homotopye e theory of posets [Rap10].

LU,ξ For P ∈ Ω = AU,ξ, we note AU,ξ,P the set of proposition Q such that P ≤ Q. They are sub-monoidal categories of AU,ξ. Moreover they are closed, because P ≤ Q, P ≤ R implies P ∧ Q = P , then P ∧ Q ≤ R, then P ≤ (Q ⇒ R). ′ When varying P , these sets form a pre-sheaf over AU,ξ = AU,ξ.

Lemma 3.1. Under our hypotheses on the fibration F, the monoids AU,ξ,P , with the functors π∗ between them, form a presheaf over the category A. e ∗ ′ Proof. Given a morphism (α, h, ι): AU,ξ,P → AU ′,ξ′,P ′ in A, the symbol ι means P ≤ π P , then, from P ′ ≤ Q′, we deduce P ≤ π∗P ′ ≤ π∗Q′. e 

∗ Lemma 2.3 in 2.2 asserted the existence of a counit η : π π∗ → IdU , for every morphism ′ ′ ∗ (α, h):(U, ξ) → (U ,ξ ), then for every P ∈ AU,ξ, we have π π∗P ≤ P . ∗ Under the stronger hypothesis on the fibration F, that η = IdΩL , i.e. π π∗P = P , the lemma 1 implies the same result over the category A′. e

LU Definition. ΘU,ξ is the set of theories expressed in the algebra Ω in the context ξ. Under t our standard hypothesis on F, both Lα and Lα send theories to theories.

Definition. ΘU,ξ,P is the subset of theories which imply the truth the proposition ¬P , i.e. the theories excluding P . Remind that ¬P ≡ (P ⇒ ⊥) is the largest proposition R such that R ∧ P ≤ ⊥. It is always true that P ≤ P ′ implies ¬P ′ ≤ ¬P , but the reciprocal implication in general requires a boolean logic. Then, for fixed U, ξ, the sets ΘU,ξ,P when P varies in AU,ξ, form a pre-sheaf over AU,ξ; if P ≤ Q, any theory excluding Q is a theory excluding P .

Lemma 3.2. Under the standard hypotheses on the fibration F, without necessarily the ∗ stronger one, the sets ΘU,ξ,P with the morphisms π , form a pre-sheaf over A.

∗ ′ Proof. Let us consider a morphism (α, h, ι): AU,ξ,P → AU ′,ξ′,P ′ , where ι denotes P ≤ π P ; we deduce π∗¬P ′ = ¬π∗P ′ ≤ ¬P ; then T ′ ≤ ¬P ′ implies π∗T ′ ≤ π∗¬P ′ ≤ ¬P . 

Corollary. Under the standard hypotheses on the fibration F plus the stronger one, the sets ∗ ′ ΘU,ξ,P with morphisms π , form also a pre-sheaf over A . e

39 What happens with π∗?

α,h It is in general false that the collection AU,ξ,P with the functors π∗ form a co-presheaf ′ ′ over A . However, if we restrict ourselves to the smaller category Astrict, with the same e ′ eα,h objects but with morphisms from AU,ξ,P to AU ′,ξ′,P ′ only when P = π∗ P , this is true. ′ Proof: if P ≤ Q, π∗P ≤ π∗Q, then P ≤ π∗Q.

The same thing happens for the collection of the ΘU,ξ,P with the morphism π∗: over the ′ restricted category Astrict, they form a co-pre-sheaf. e ′ Proof: if T ≤ ¬P , we have π∗T ≤ π∗¬P = ¬π∗P = ¬P . However for the full category A′ (resp. the category A), the argument does’nt work: from ′ ∗ ′ e ′ e ∗ ′ π∗P ≤ P (resp. P ≤ π P ), it follows that ¬P ≤ ¬¬π∗P = π∗¬P (resp. π∗P ≤ π∗π P ∗ ′ ′ then ¬π∗π P ≤ π∗P , then by adjunction ¬P ≤ ¬π∗P = π∗¬P ); then T ≤ ¬P implies ′ π∗T ≤ π∗¬P , not π∗T ≤ P .

To summarize what is positive:

Lemma 3.3. Under the above hypothesis, the collections AU,ξ,P and ΘU,ξ,P with the mor- ′ phisms π∗, constitute co-presheaves over Astrict. e ′ ′ Note that the fibers AU,ξ,P are not sub-categories of Astrict. They are subcategoris of A and A. e e e

Definition. A theory T′ is said weaker than a theory T if its axioms are true in T. We note T ≤ T′, as we made for weaker probabilistic models. This applies to theories excluding a P , in ΘU,ξ,P . With respect to propositions, if we take the joint R by the operation ”and” of all the ′ ′ ′ axioms ⊢ Ri; i ∈ I of T, and the analog R for T , the above relation corresponds to R ≤ R . Remark: a weaker theory can also be seen as a more precise, or understandable, theory; for instance in Θλ, the maximal theory ⊢ (¬P ) is dedicated to exclude P , and only it.

Be careful that in the sense of sets of truth assertions, the pre-ordering by inclusion of the theories goes in the reverse direction. For instance {⊢ ⊥} is the strongest theory, in it ev- erything is true, thus every other theory is weaker.

Now we introduce a notion of semantic conditioning.

LU,ξ Definition. For fixed U, ξ, P ≤ Q in Ω , and T a theory in the language LU,ξ, we define a new theory by the internal implication:

Q.T =(Q ⇒ T). (3.10)

More precisely: the axioms of Q.T are the assertions ⊢ (Q ⇒ R) where ⊢ R describes the axioms of T. We consider Q.T as the conditioning of T by Q, in the logical or semantical sense, and frequently we write the resulting theory T|Q.

40 At the level of propositions, the operation ⇒ is the right adjoint in the sense of the Heyting algebra of the relation ∧; i.e.

(R ∧ Q ≤ P ) iff (R ≤ (Q ⇒ P )). (3.11)

Proposition 3.1. The conditioning gives an action of the monoid AU,ξ,P on the set of theories in the language LU,ξ. Proof.

(R ∧ Q′ ∧ Q ≤ P ) iff (R ∧ Q′) ≤ (Q ⇒ P ) iff (R ≤ (Q′ ⇒ (Q ⇒ P )). (3.12)

Note that Q ⇒ P is also the maximal proposition Q′ (for ≤) such that Q ∧ Q′ ≤ P .

Therefore he theory Q ⇒ T is the largest of the theories T′ such that

Q ∧ T′ ≤ T. (3.13)

This implies that T|Q is weaker than T and than ¬Q.

1) In Q ∧ T, the axioms are of the form ⊢ (Q ∧ R) where ⊢ R is an axiom of T, and from ⊢ (Q ∧ R), we deduce ⊢ R.

2) Here Q (resp. ¬Q) is understood as the theory with unique axiom ⊢ Q (resp. ⊢ ¬Q), then if ⊢ (Q ∧ ¬Q) we have ⊢ ⊥ and all theories are true. 

Remark. The theory T|Q = (Q ⇒ T) can also be written TQ, by definition of the internal exponential; an the action by conditioning is also the internal exponential.

Notation: for being lighter, in what follows, we will mostly denote the propositions by the letters P,Q,R,P ′, ... and the theories by the simple capital letters S,T,U,S′, ....

The operation of conditioning was considered by Carnap and Bar-Hilled 1952 [CBH52], in the case of Boolean theories, studying the content of propositions and looking for a general notion of sets of semantic Information. In this case Q ⇒ T is equivalent to T ∨¬Q =(T ∧ Q) ∨¬Q. Cf. the companion text on logico-probabistic information for more details. Their main formula for the concept of information was

Inf(T|P )= Inf(T ∧ P )\Inf(P ); (3.14) assuming that Inf(A ∧ B) ⊇ Inf(A) ∪ In(B).

Proposition 3.2. The conditioning by elements of AU,ξ,P , i.e. propositions Q implied by P , preserves the set ΘU,ξ,P of theories excluding P .

Proof. Let T be a theory excluding P and Q ≥ P ; consider a theory T ′ such that Q∧T ′ ≤ T , we deduce T ′ ∧ P ≤ T , thus T ′ ∧ P ≤ T ∧ P . But T ∧ P ≤ ⊥, then T ′ ∧ P ≤ ⊥. But Q ⇒ T is the largest theory such that Q ∧ T ′ ≤ T , therefore Q ⇒ T excludes P , i.e. asserts ¬P . 

41 ′ Remark. Consider the sets ΘU,ξ,P of theories which imply the validity of the proposition P . ′ ∗ These sets constitute a co-sheaf over the category Astrict for π∗ and a sheaf for π . However, e ′ the formula (3.10) does’nt give an action of the monoid AU,ξ,P on the set ΘU,ξ,P , even in the boolean case, where (Q ⇒ T )= T ∨ ¬Q.

We can also consider the set of all theories over the largest category A, without further ∗ e localization; they also form a sheaf for π and a co-sheaf Θ for π∗, which are stable by the conditioning. ∗ When necessary, we note Θloc the pre-sheaf for π made by the ΘU,ξ,P over A. e ′ The naturality over Astrict of the action of the monoids relies on the following formulas, for e ′ ′ ′ ′ every arrow (α, h):(U, ξ) → (U ,ξ ) in F, we have the arrows (U,ξ,P ) → (U ,ξ , π∗P ) in ′ ∗ Astrict; in the pre-sheaf of monoids AU,ξ,P , for the morphism π , and the pre-sheaf ΘU,ξ,P e with morphisms π∗: ∗ ′ ∗ ′ (π Q ).T = π [Q .π∗(T )] (3.15) ∗ This holds true under the strong hypothesis π π∗ = Id.

If we want to consider functions φ of the theories, two possibilities appear, π∗ for Θ with ∗ ∗ π for the monoids A, or the contrary π for Θ with π∗ for the monoids A. Even with our strongest hypothesis, only the second one gives a modules Φ over A. But they are both co-sheaves.

∗ Proposition 3.3. Under the strong hypothesis π π∗ = Id, and over the restricted category ′ ′ Astrict, the co-sheaf Φ made by the measurable functions (with any sort of fixed values) of e ∗ ′ the theories ΘU,ξ,P , with the morphisms π , is a co-sheaf of modules over the co-sheaf Aloc, made by the monoidal categories AU,ξ,P , with the morphisms π∗.

′ ′ ′ ′ ′ Proof. Consider a morphism (α, h, ι): AU,ξ,P → AU ,ξ ,π∗P , a theory T in ΘU ,ξ ,π∗P , a propo- ′ sition Q in AU,ξ,P , and an element φP in ΦU,ξ,P , we have

′ ′ ′ ′ π∗Q.(Φ∗φP )(T )=(Φ∗φP )(T |π∗Q) ∗ ′ = φP [π (T |π∗Q)] ∗ ′ = φP [π (π∗Q ⇒ T )] ∗ ∗ ′ = φP [π π∗Q ⇒ π T ] ∗ ′ = φP [Q ⇒ π T ] ∗ ′ = φP [π (T )|Q]. 

Remark. The same kind of computation shows that, in the case of the sheaf Φ of functions ∗ on the co-sheaf Θ with π∗ and the sheaf Aloc with π , we would have, for the corresponding elements Q′,T,φ′, ∗ ′ ∗ ′ ′ ∗ ′ π Q .Φ (φ )(T )= φ (π∗π Q .π∗T ); (3.16) which is not the correct equation of compatibility, under our assumption. It is with the other ∗ direction, if ǫ = π∗π = IdAU′,ξ′ .

∗ ∗ However, there exists an important case where both hypotheses π π∗ = IdU and π∗π = IdU ′ hold true, it the case where the languages over the objects (U, ξ) are all isomorphic. In terms

42 of the intuitive maps fα, this means that they are homeomorphisms. This case happens in particular when we consider the restriction of the story to a given layer in a network.

Alternative: looking at the lemmas 1 and 2, we could forget the functional point of view of Φ. In this case we don’t have an abelian situation, but we have a sheaf of sets of theories Θloc, on which the sheaf of monoids Aloc acts by conditioning:

∗ Proposition 3.4. The pre-sheaf Θloc for π is compatible with the monoidal action of the ′ presheaf Aloc, both considered on the category A (then over A by restriction, under the strong hypothesis on F). e e

Proof. If T ′ ≤ ¬P ′ and P ≤ π∗P ′, we have ¬π∗P ′ ≤ ¬P , therefore π∗T ′ ≤ ¬P .  In the Bayesian case, the conditioning is expressed algebraically by the Shannon mean for- mula on the functions of probabilities: P E P Y.φ( X )= Y∗PX (φ( |Y = y)) (3.17)

This gives an action of the monoid of the variables Y coarser than X, as we find here for the fibers AU,ξ,P and the functions of theories ΦU,ξ,P .

Equation (3.14) was also inspired by the Shannon equation

(Y.H)(X; P)= H((Y,X); P) − H(Y ; Y∗P). (3.18)

However this set of equations for a system B can be deduced from the set of equations of invariance (HX − HY )|Z = HX∧Z − HY ∧Z. (3.19) We see that in the semantic case, two analogies appear: in one of them, in each layer, the role of random variables is played by the propositions P ; in the other one, their role is played by the layers U, augmented by the objects of a groupoid (or another kind of category for contexts). The first analogy was chosen by Carnap and Bar-Hillel, and certainly will play a role in our toposic approach too, at each U, ξ, to measure the logical value of functioning. However, the second analogy is more promising for the study of DNNs, for understanding the semantic adventure in the feedforwat and feedback dynamics. What precedes implies that the co-homological approach, looking for topological invari- ant in terms of topos must be developed, at least for beginning, in two separated roads.

The alternative comes from proposition 3.3. It allows to define invariants mixing the two analogies, but not in the abelian framework, possibly in the homotopical framework.

As defined in section 2.3, the semantic functioning of the neural network X is given by a function SU,ξ : XU,ξ → ΘU,ξ. (3.20) The introduction of P , seen as logical localization, corresponds to a refined notion of seman- tic functioning, a quotient of the activities made by the neurons that express a rejection of this proposition.

43 ′ ′ ′ Remark. We could a priori consider the co-sheaf Θ or Θloc over Astrict, and obtain a co- ′ ′ e sheaf Σ , of all possible maps SU,ξ : XU,ξ → ΘU,ξ; U ∈ C,ξ ∈ FU , where the transition from U, ξ to U ′,ξ′ over α, h is given by the contravariance of X and by the covariance of Θ′:

′ t L′ Σα,h(sU )U ′,ξ′ = α,h ◦ sU ◦ Xα,h. (3.21)

However the above discussion shows that the compatibility with the conditioning would require ∗ π∗π = Id, which appeared too restrictive. Moreover, our principal interest is for a particular class of inputs Ξ, coming from the effective dynamical feed-forward flows Xw, more or less adapted by learning to the expected theories in the output. Then the convenient notions of information, if they exist, must be applied to these ingredients. By using functions of the SU,ξ we could not apply them to particular vectors in XU,ξ. But using functions on the ΘU,ξ we can. And that could give numbers (or sets or spaces) ′ associated to a family of activities xλ ∈ Xλ, and to their semantic expression Sλ(xλ) ∈ Θλ. And we can take the sum over the set of x belonging to Ξ. This seems preferable.

On another side, a semantic information must correspond to the impact of the inner functioning on the output, given the inputs. For instance, it has to measure how far from the output theory is the expressed theory at U ′,ξ′. We hope that this should be done by the analog of the mutual information. If we believe in the analogy with probabilities, this ′ is given by the topological co-boundary of the family φλ; λ ∈ A , supported on the arrows γ : λ → λ′. Then we enter the theory of topological invariants of the sheaves of modules in a ringed topos. Here Θ′ over A′.

Then, suppose that Θ takes the form of a sheaf, not a co-sheaf. This corresponds to the alternative of the above prop. 3, where A is also a sheaf over A. Both A and Θ have morphisms given by the maps π∗. e In this framework, it appears a natural topological obstruction: the Sλ; λ ∈ A have to form a natural transformation to give a coherent semantic functioning. e

The relations ∗ ∗ SU,ξ ◦ X = π ◦ SU ′,ξ′ , (3.22) mean that the logical transmission of the theories expressed by U ′ (in the context ξ′) coincide with the theories in U induced by the neuronal transmission from U ′ to U.

If this coherence is verified, the object Σ in the topos, replacing Σ′, could be taken as the ΘX in the topos of presheaves over A. By definition, this is equivalent to consider the parameterized families of functioning e

Sλ : XU,ξ × Yλ → ΘU,ξ,P ; (3.23) where Y is any object in the topos of pre-sheaves over A. e

Remark. In the experiments with small networks, we verified this coherence, but only ap- proximatively, i.e. with high probability on the activities in X.

44 ′ op The category Astrict, that we will also denote D , gives birth to a refinement of the cat’s manifolds we definede before.

Suppose, for simplifying, that we have a unique initial point in C; it corresponds to the output layer Uout. Then look at a given ξ0 ∈ Fout, and a given proposition Pout in

Ωout(ξ0) = AUout,ξ0 ; it propagates in the inner layers through π∗ in P ∈AU,ξ for any U and ∗ any ξ linked to ξ0, and can be reconstructed by π at the output, due to the hypothesis ∗ op π π∗ = Id. Then we get a section over C of the co-fibration D → C. This can be extended op as a section of D →F, by varying ξ0, when the Fα are fibrations, which is the main case we have in mind. Note that this doesn’t give all the sections, because certain propositions P in a Aλ are not ∗ in the image of π∗, even if all of them are sent by π to an element of a set Ωout(ξ0). However, these sections are in bijection with the connected components of Dop. Let K be a commutative ring, and cP a non zero element of K; we define the (measurable) function δP on the theories in the Θλ(P ), taking the value cP over a point in the above connected component of D, and 0 outside.

Now look at the semantic functioning S : XU,ξ → Θλ, we get a function δP on the sets of local activities. This function takes the value cP on the set of activities that form theories excluding P .

Several subtle points appear: 1) the function really depends on P , but varying P , it is non-zero when two propositions have the same negation ¬P ; 2) to conform with the before introduced notion of cat’s manifold, we must assume that the activities in different layers which exclude P in their axioms, are coherent, i.e. form a section of the object Xw. Without the coherence hypothesis between dynamics and logics, we have two different notions of cat’s manifolds, one dynamic and one linguistic or logical. In a sense, only the agreement deserves to be really named semantic.

3.4 Semantic information. Homology constructions

Bar complex of functions of theories and conditioning by proposi- tions. We start with the computation of the abelian invariants, therefore with functions Φ on Θ in the cases where conditioning can act.

We consider first the case described by proposition 3.2. As it concerns co-sheaves, we prefer ′ ′ to work over the opposite of the category Astrict, denoted D. Then Aloc with morphisms e ′ ∗ π∗, becomes a sheaf of monoids over D, and Θloc, with morphisms π , becomes a co-sheaf of ′ ′ sets over D, in such a manner that the functions Φ on Θloc constitute a sheaf of Aloc modules.

We suppose that the elements φλ in Φλ take their values in a commutative ring K (with cardinality at most continuous).

45 The method of relative homological algebra, used for probabilities in Baudot, Bennequin ∗ [BB15], and Vigneaux [Vig19], cited above, can be applied here, for computing Ext ′ (K, Φ) Aloc ′ in the toposic sense. The action of Aloc on K is supposed trivial.

′ ′ ′ We note R = K[Aloc] the co-sheaf in K-algebras associated to the monoids Aλ; λ ∈A . The non-homogeneous bar construction gives a free resolution of the trivial constant module K: ′ ′ ′ 0 ← K ← B0 ← B1 ← B2 ← ... (3.24) ′ N ⊗(n+1) where Bn; n ∈ , is the free R module R , with the action on the first factor. In ′ ′ each object λ =(U,ξ,P ), the module Bn(λ) is freely generated over K[Aλ] by the symbols ′ [P1|P2|...|Pn], where the Pi are elements of Aλ, i.e. propositions implied by P . Then the ′ elements of Bn(λ) are finite sums of elements P0[P1|P2|...|Pn]. ′ The first arrow from B0 to K is the coordinate along [∅]. The higher boundary operators are of the Hochschild type, defined by on the basis by the formula

∂[P1|P2|...|Pn] n−1 i n = P1[P2|...|Pn]+ X(−1) [P1|...|PiPi+1|...|Pn]+(−1) [P1|P2|...|Pn−1] (3.25) i=1 N n And, for each n ∈ , the vector space ExtA′ (K, Φ) is the n-th group of co-homology of the ∗ associated complex HomA′ (B , Φ), made by natural transformations which commutes with the action of K[A′].

′ A cochain of degree zero is a section φλ; λ ∈D of Φ, that is, a collection of maps φλ :Θλ → K, ′ op ′ ′ such that, for any morphism γ : λ → λ in D , and any S ∈ Θλ′ , we have

′ ∗ ′ φλ′ (S )= φλ(π S ). (3.26)

If there exists a unique last layer Uout, as in the chain, this implies that the functions φµ are all determined by the functions φout on the sets of theories Sout in the final logic, exclud- ′ ing given propositions, by definition of the sets ΘU,ξ,P . And a priori these final functions are arbitrary.

Acyclicity anf fundamental cochains To be a co-cycle, φ must satisfy, for any λ =(U,ξ,P ), and P ≤ Q,

0= δφ([Q])(S)= Q.φλ(S) − φλ(S)= φλ(Q ⇒ S) − φλ(S). (3.27)

However, for any P we have P ≤ ⊤, and S|⊤ = ⊤; then the invariance (3.27) implies that φλ is independent of S; it is equal to φλ(⊤).

Then, a co-cycle is a collection elements φ(λ) in K, satisfying φλ′ = φλ each time there ′ ′ ′ exists an arrow from λ to λ in Astrict, thus forming a section of the constant sheaf over Astrict. e e This gives:

46 Proposition 3.5. As

e′ 0 0 ′ π0(Astrict) ExtA′ (K, Φ) = H (Astrict; K)= K , (3.28) e then degree zero co-homology counts the propositions that are transported by π∗ from the output. Remark. From the discussion at the end of the preceding section, this gives a strong relation between the zero co-homology of information and the cats manifolds, then with the zero co- homology in the sense of sheaves, as explained in the corresponding section above. R ′ ′ A degree one co-chain is a collection φλ of measurable functions on Θλ, and R ∈Aλ, which ′ op ′ ′ satisfies the naturality hypothesis: for any morphism γ : λ → λ in D , and any S ∈ Θλ′ , we have π∗R ′ R ∗ ′ φλ′ (S )= φλ (π S ). (3.29) The co-cycle equation is

′ ∀U, ξ, ∀P, ∀Q ≥ P, ∀R ≥ P, ∀S ∈ ΘU,ξ,P , Q∧R Q R φλ (S)= φλ (S)+ φλ (Q ⇒ S). (3.30) Let us define a family of elements of K by the equation P ψλ(S)= −φλ (S). (3.31)

The formula (3.29) implies the formula (3.26), then ψλ is a zero co-chain. Take its co-boundary P P δψλ([Q])(S)= φλ (S) − Q.φλ (S). (3.32) using the co-cycle equation and the fact that for any Q ≥ P we have Q ∧ P = P , this gives Q Q∧P P φλ (S)= φλ (S) − Q.φλ (S)= δψλ([Q])(S). (3.33) P Remark that the co-chain ψ is not unique, the formula ψ = −φλ is only a choice. Two co-chains ψ satisfying δψ = φ differ by a zero co-cycle, that is a family of numbers cλ, de- pendent on P but not on S. Remind us that P is part of the object λ.

Therefore every one co-cycle is a co-boundary, or in other terms:

1 Proposition 3.6. ExtA′ (K, Φ)=0. The same argument applies to every degree n ≥ 1.

n Proposition 3.7. ExtA′ (K, Φ)=0. Q1;...;Qn Proof. If φλ is a cocycle of degree n ≥ 1, where λ =(U,ξ,P ), the formula

Q1;...;Qn−1 n Q1;...;Qn−1;P ψλ =(−1) φλ (3.34) defines a cochain of degree n − 1 such that δψ = φ. Q1;...;Qn Extracting φλ from the last term of the co-cycle equation for φ, applied to Q1, ..., Qn+1 with Qn+1 = P , gives

n Q1;...;Qn (−1) φλ n−1 Q2;...;Qn;P Q2;...;QiQi+1;...;Qn;P n Q2;...;Qn∧P = Q1.φλ + X φλ +(−1) φλ . (3.35) i=1

As Qn ∧ P = P in Aλ, this is exactly the co-boundary of ψ applied to Q1; ...; Qn. 

47 Remark. At first sight this is a deception; however, there is a morality here, because it tells that the measure of semantic information reflects a value of a theory at the output, depending on many elements that the network does not know, without knowing the consequences of this theory. Some of these consequences can be included in the metric for learning, some other cannot be.

When a co-chain ψ as above is chosen, it defines the co-cycle φ by the formula

Q φλ (S)= ψλ(Q ⇒ S) − ψλ(S). (3.36)

The cochain ψ satisfied (3.26), and the co-boundary φ the equation (3.29). All the arbitrariness is contained in the values of ψout, which are function of P and of the theory excluding P . Now examine the role of a proposition Q implied by P . It changes the value of φ according to the equation

Q P P φout(Q; T )= φout(T )= φout(T ) − φout(T |Q)= ψout(T |Q) − ψout(T ), (3.37) then it subtracts from ψout(T ) the conditioned value ψout(T |Q). And this is transmitted inside the network by the equation

π∗Q ′ Q ∗ ′ φλ′ (S )= φλ (π S ); (3.38) which is equivalent to the simplest equation

′ ∗ ′ ψλ′ (S )= ψλ(π S ). (3.39)

∗ Note that we are working under the hypothesis π π∗ = Id, then it can happen that a theory ′ ∗ ′ S , in the inner layers cannot be reconstructed (by π∗) from its deduction π S in the outer layer. Thus the logic inside is richer than the transmitted propositions, but the quantity ′ ∗ ′ ψλ′ (S ) depends only on π S . This corresponds fairly well with what we observed in the experiments about simple clas- sification problems, with architectures more elaborated than a chain, cf. Logical cells II, [BBG21]. In some cases, the inner layers invent propositions that are not asked in the ob- jectives. They correspond to demonstrations of these objectives.

Mutual information, classical and quantum analogies We propose now an interpretation of the functions φ and ψ, when K = R, or an ordered Z P ring, as : the value φout(S) measures the ambiguity of S with respect to ¬P , then the value of ψout(S) is growing with S, i.e. S ≤ T implies ψout(S) ≤ ψout(T ).

Among the theories which exclude P , there is a minimal one, which is ⊥, without much interest, even it has the maximal information in the sense of Carnap and Bar-Hillel, and a maximal theory, which is ¬P itself; it is the more precise, but with the minimal information, if we measure information by the quantity of exclusions of propositions it can give. Thus ψ doesn’t count the quantity of possible information, but the closeness to ¬P .

Q Consequently, φP (S) is always a positive number, which is decreasing in Q when S is given. P Therefore, we can take ψ negative, by choosing ψλ = −φλ . In what follows we take this choice for ψ.

48 Q The maximal value of φP (S), for a given S is attained for Q = P , in this case S|P = ¬P , P P then the maximal value is φλ (S) − φλ (¬P ).

The truth of the proposition ¬Q can be seen as a theory excluding P when P ≤ Q. Like a counterexample of P .

Note the following formula for P ≤ Q:

Q P P φλ (S)= φλ (S) − φλ (S|Q). (3.40) Remind that the entropy function H of a joint probability is also always positive, and we have I(X; Y )= H(X) − H(X|Y ), (3.41) as it follows from the Shannon equation and the definition of I. This also gives I(X; X)= H(X).

Q P Then we interpret φλ (S) as a mutual information between S and ¬Q, and φλ (S) itself as a kind of entropy, thus measuring an ambiguity: the ambiguity of what is expressed in the layer λ about the exclusion of π∗P at the output. This accords with the formula π∗Q Q ∗ φλ (S)= φout(π S). (3.42) Remark: in Quantum Information, where variables are replaced by orthogonal decomposition of an Hilbert space, and probabilities are replaced by adapted positive hermitian operators of trace one (cf. Baudot and Bennequin, [BB15]), the Shannon entropy H (entropy of the associated classical law) appears as (minus) the co-boundary of a co-chain, which is the Von

Neumann entropy S = − log2 Trace(ρ).

HY (Y ; ρ)= SX (ρ) − Y.SX (ρ). (3.43)

Then in the present case, the theories are analogs of the density matrices, the propositions are the analogs of the observables, the function ψ is an analog of the opposite of the Von- Neumann entropy, and the ambiguity φ an analog to the Shannon entropy.

Let us see what this gives for a functioning network Xw, possessing a semantic functioning SU,ξ : XU,ξ → ΘU,ξ, not necessarily assuming the naturality (3.23). We can even specialise by taking a family of neurons having an interest in the exclusion of some property P , and look at a family ′ Sλ : XU,ξ → Θλ, (3.44) where λ =(U,ξ,P ).

To a true activity x of the network, we get xU,ξ, then, we define

Q Q Hλ (x)= φλ (Sλ(xU,ξ)). (3.45) And we propose it as the ambiguity in the layer U, ξ, about the proposition P at the output, when Q is given as an example.

To understand better the role of Q, we apply the equation (3.29), which gives

π∗Q ′ Q ∗ ′ ′ Hλ′ (x )= φλ (π S (x )). (3.46)

49 ′ Therefore, evaluated on a proposition π∗Q which comes from the output, I(x ) in the hidden ′ ∗ layer U , is the mutual information of ¬Q and the deduction in Uout by π of the theory S′(x′), expressed in U ′ in presence of the given section (feed-forward information flow), com- ′ ing from the input, by the activity x ∈ XU ′ .

′ ′ ∗ Remark: consider a chain (U, ξ) → (U ,ξ ) → (U”,ξ”). We denote by ρ∗ and ρ the ap- plications which correspond to the arrow (U ′,ξ′) → (U”,ξ”). Therefore (π′)∗ = π∗ρ∗ and ′ π∗ = ρ∗π∗. For any section x, and proposition P in the output (U, ξ), consider the particular case P = Q, where (Q ⇒ S)= ¬P for every theory excluding P :

′ P ∗ ′ ′ P ∗ ′ ′ P ′ ∗ H(x ) − H(x”) = φλ (π S (x )) − φλ (π S (x )|P ) − (φλ ((π ) S”(x”)) P ′ ∗ − φλ ((π ) S”(x”)|P )) P ∗ ′ ′ P ′ ∗ = φλ (π S (x )) − φλ ((π ) S”(x”)) ∗ ∗ ∗ ′ ′ = ψλ(π ρ S”(x”)) − ψλ(π S (x ))

This is surely negative in practice, because the theory S′(x′) is larger than the theory ∗ ρ S”(x”). For instance, at the end, we surely have Sout = ¬P , as soon as the network has learned. Consequently this quantity has a tendency to be negative. Then it is not like the mutual between the layers. It looks more as a difference of ambiguities. Because the ambiguity is decreasing in a network in reality. This confirms that H is an ambiguity. Therefore, the mutual information has to come from a manner to involve a pair of layers.

To obtain a notion of mutual information, we make an extension of the monoids AU,ξ,P , which continues to act by conditioning on the sets ΘU,ξ,P .

′ For that, we consider a fibration over Astrict made by monoids Dλ which contain Aλ as submonoids. ′ By definition, if λ = (U,ξ,P ), an object of Dλ is an arrow γ0 = (α0, h0, ι0) of Astrict, going e from a triple (U0,ξ0,P0) to a triple (U, ξ, π∗P0), where P ≤ π∗P0, and a morphism from α10,h10 (α0, h0, ι0) to γ1(α1, h1, ι1) is a morphism γ10 from (U0,ξ0,P0) to (U1,ξ1, Q1 = π∗ P0) such that Q1 ≥ P1. For the intuition it is better to see the objects as arrows in the D of ′ Astrict, in such a manner they can compose with the arrows Q ≤ R in the monoidal category e Aλ, then we get a variant of the right slice λ|D, just extended by Aλ. The category Dλ is monoidal and strict if we define the product by

γ1 γ0 γ1.γ2 =(U, ξ, π∗ P1 ∧ π∗ P2). (3.47)

The identity being the truth ⊤λ.

We also define the action of Dλ on Θλ as follows: for every arrow γ0 : λ0 → λπ∗P0 , where λ0 =(U0,ξ0,P0), and where λπ∗P0 dentes (U, ξ, π∗P0), assuming π∗P0 ≥ P , we define T γ0 T γ0. =(π∗ P0 ⇒ ). (3.48)

This gives an action of the monoid of propositions in Aλ0 which are implied by P0, whose images by π∗ are implied by P .

50 γ0 γ0 γ0 If P0 ≤ Q0 and P0 ≤ R0, we have π∗ (Q0 ∧ R0)= π∗ (Q0) ∧ π∗ (R0).

The monoidal categories Dλ; λ ∈ D form a natural presheaf over D. For any morphism ′ ′ ′ γ = (α, h, ι) of Astrict, going from (U,ξ,P ) to (U ,ξ , π∗P ), and any object γ0 : λ0 → λπ∗P0 e γ in Dλ, we define γ∗(γ0) by the composition (α, h)◦(α0,ξ0) and the proposition π∗ ◦π∗P0 in Aλ′ .

∗ γ The naturalness of the monoidal action on the theories follows from πγ π∗ = IdU :

∗ ′ ∗ γ ′ πγ [γ∗(π∗P0).T ]= πγ[π∗ π∗P0 ⇒ T ] ∗ γ ∗ ′ = πγπ∗ π∗P0 ⇒ πγT ∗ ′ = π∗P0 ⇒ πγT

′ ∗ ′ Then, defining [Φ∗(γ)(φλ)(T )= φλ(πγ T ), we get the following result

Lemma 3.4. ′ ∗ ′ [Φ∗(γ)φλ](γ∗(γ0).T )= φλ(γ0.π T ). (3.49) Consequently the methods of abelian homological algebra can be applied. Cf. [Mac12].

The (non-homogeneous) bar construction makes now appeal to symbols [γ1|γ2|...|γn], where the γi are elements of Dλ. The action of algebra pass through the direct image of propositions π∗Pi; i =1, ..., n. ′ Things are very similar to what happened with the precedent monoids Aλ: the zero co-chains are families φλ of maps on theories satisfying

∗ ′ ′ ψλ(π T )= ψλ′ (T ), (3.50)

′ ′ where γ : λ → λ is a morphism in Astrict. The coboundary operator is e

γ1 δψλ([γ1]) = ψλ(T |π∗ P1) − ψλ(T ). (3.51) Then the co-homology is defined as before. We get the same proposition ... γ1 The one cochains are collections of maps of theories φλ satisfying

γ1 ∗ ′ γ∗γ1 ′ φλ (π T )= φλ′ (T ). (3.52) The cocycle equation is γ1∧γ2 γ1 γ2 φλ = φλ + γ1.φλ . (3.53) One more time, the cocycles are coboundaries; the following formula is easily verified

λ1 φλ =(δψλ)[λ1]= π∗P1.ψλ − ψλ; (3.54) where Idλ ψλ = −φλ . (3.55) The new interesting point is the definition of a mutual information. For that we mimic the formulas of Shannon theory, thus we apply a combinatorial operator to the ambiguity. Then we consider the canonical bar resolution for Ext∗ (K, Φ), with the trivial action of Ae A′|λ; λ ∈ A. The operator is the combinatorial co-boundary δt at degree two, and it gives: e t γ1 γ1∧γ1 γ2 Iλ(γ1; γ2)= δ φλ[γ1,γ2]= φλ − φλ + φλ . (3.56)

51 This gives the following formulas

γ1 γ1 γ2 γ2 Iλ(γ1; γ2)= φλ − γ2.φλ = φλ − γ1.φλ . (3.57)

More concretely, for two morphisms γ1 : λ1 → λ and γ2 : λ2 → λ, denoting by P1,P2 λ their respective coordinates on propositions, and by ψλ = −φλ the canonical 0-cochain, we have: Iλ(γ1; γ2)(T )= ψλ(T |π∗P2)+ ψλ(T |π∗P1) − ψλ(T |π∗P1 ∧ π∗P2) − ψλ(T )

Remark. We decided that the interpretation of φλ is better when ψλ is growing. Now, the positivity of the quantity Iλ implies a sort of concavity of ψλ.

More generally, we say that a real function ψ of the theories, containing ⊢ ¬P , in a given language is concave (resp. strictly concave), if for any pair of such theories T ≤ T ′ and any proposition Q ≥ P , the following expression is positive (resp. strictly positive),

′ ′ ′ IP (Q; T, T )= ψ(T |Q) − ψ(T ) − ψ(T |Q)+ ψ(T ). (3.58)

Remark that this definition extends verbatim to any closed monoidal category, because it uses only the pre-order and the exponential. ′ The positivity of the mutual information is the particular case where T = T |Q1.

This makes that ψ looks like the function ln P for a domain ⊥ < P ≤ ¬P , analog of the inteval ]0, 1[ in the propositional context.

P The functions ψλ can always be chosen such that φλ = −ψλ. Then the above interpre- tation of φ as an informational ambiguity is compatible with an interpretation of ψ(T ) as a measure of the precision of the theory.

The Boolean case, comparing to Carnap and Bar-Hillel [CBH52] In the finite boolean case, the opposite of the content defined by Carnap and Bar-Hillel gives such a function ψ, strictly increasing and concave. Remind that the content set C(T ) is the set of elementary propositions that are excluded by the theory T . Here we assimilate a theory with the and of its axioms, and with a subset of a finite set E. If T < T ′, there is less excluded points by T ′ than by T , then −c(T ′) − (−c(T )) > 0. If P ≤ Q, the content set of T ∨ ¬Q is the intersection of C(T ) and C(⊢ ¬Q)= C(Q)c, and the content of T ′ ∨ ¬Q the intersection of C(T ) and C(⊢ ¬Q) = C(Q)c, then the complement of C(T ′ ∨ ¬Q) in C(T ′) is contained in the complement of C(T ∨ ¬Q) in C(T ). Consequently

ψ(T |Q) − ψ(T ) − (ψ(T ′|Q) − ψ(T ′)) = c(T ) − c(T |Q) − (c(T ′) − c(T ′|Q)) ≥ 0. (3.59)

It is zero when T ′ ∧ (¬Q) ≤ T . A natural manner to obtain a strictly concave function is to apply the logarithm function to the function (cmax − c(T ))/cmax. Therefore a natural formula in the boolean case is c(⊥) − c(T) ψ (T) = ln (3.60) P c(⊥) − c(¬P )

52 But we also could take a uniform normalization: c(⊥) − c(T) ψ (T) = ln (3.61) ⊥ c(⊥) Amazingly, this was the definition of the amount of information (with a sign minus) of Car- nap and Bar-Hillel, 1952. A generalization along their line consists to choose any strictly positive function m of the elementary propositions and to define the numerical content c(T ) as the sum of the values of m over the elements excluded by T . This corresponds to the attribution of more or less value to the individual elements.

Question: find a natural formula, if it exists, that is valid in every Heyting algebra, or at least in a class of Heyting algebras larger than the Boole algebras.

Example: the open sets of a topology on a finite set X. The analog of the content of T is the cardinality of the closed set X \ T . Then a preliminary function ψ is the cardinality of T itself, which is naturally increasing with T . However simple examples show that this function can be non-concave. The set T |Q\T is made by the points x of X\T having a neighborhood V such that V ∩ V ⊂ T , there exists no relation between this set and the analog set for T ′ larger than T , but smaller than ¬P .

However, appendix D constructs a good function ψ for the sites of DNNs and the injective finite sheaves. This applies in particular to the chains 0 → 1 → ... → n.

A remark on semantic independency In their 1952 report[CBH52], Carnap and bar-Hillel gave a different justification than us for taking the logarithm of a normalized version of the content. This was in the Boolean situation, n = 0, but the appendix five permits to extend what they said to some non-boolean situations. They had in mind that independent assertions must give an addition of the amounts of information of the separate assertions. However, as they recognized themselves, the concept of semantic independency is not very clear (cf. [CBH52, page 12]). In fact they studied a π particular case of typed language, they named Ln where there exists one type of subjects with n elements, a, b, c, ..., that can have a certain number π of attributes (or predicate). The example is three humans and their genre, male or female, and their age, old of young. For every elementary proposition Zi, i.e. a point inn E, they choose a number mP (Zi) in ]0, 1|, and define, as in the preceding section with µ, the function m of any proposition L, by taking the sum of the mi over the elements of L, viewed as a subset of E. Carnap and Bar-Hillel imposed several axioms on mP , for instance the invariance under the natural action of the symmetry group Sn × Gπ, and the normalization by m(E) = 1. The content is an additive normalization of the opposite of m. The number c(L) evaluates the quantity of elementary propositions excluded by L. At some moment, they introduce axiom h, [CBH52, page 14], m(Q ∧ R) = m(Q)m(R), if Q and R do not consider any common predicate. This axiom was rarely considered in the rest of the paper. However it is followed by a definition: two assertions S and T were said inductively independent (with respect to mP ) if an only if m(S ∧ T )= m(S)m(T ). (3.62)

53 This was obviously inspired from the theory of probabilities, cf. Carnap’s book [Car50], where primitive predicates are considered in relation to probabilities. If we think to the example or age of male or female, the axiom is not very convincing from the point of view of probability, because in most sufficiently large population of humans it is not true that age and genre are independent. However, from a semantic point of view, this is completely justified!

Now, if we come to the amount of information: taking the logarithm of the inverse of m(T ) to measure inf(T ) makes that independency (inductive) is equivalent to the additivity:

ψ(S ∧ T )= ψ(S)+ ψ(T ). (3.63)

Under this form, the definition conserves a meaning, for any function ψ. Even with values in a category of models, with a good notion of co-limit, as the disjoint union of sets.

In Shannon’s theory, with the set theoretic interpretation of Hu Kuo Ting, [Tin62], we recover the same thing.

Comparison of information between layers Another way to obtain a comparison between layers, i.e. objects (U, ξ), comes from the ordinary co-homology of the object Φ in the topos of pre-sheaves over the opposite category ′ of Astrict, that we named D. eThis cohomology can be computed following the method exposed by Grothendieck and Verdier in SGA 4 [AGV63], using a canonical resolution of Φ. This resolution is constructed ′ from the nerve N (D), made by the sequences of arrows λ → λ1 → λ2... in Astrict, then associated to the fibration by the slices category λ|D over D. Be carefull thate in D, the arrows are in reversed order.

The nerve N (D) has a natural structure of simplicial set whose n simplices are sequences ′ of composable arrows (γ1, ..., γn) between objects λ0 →···→ λn in Astrict, and whose face e operators di; i =0, ..., n are given by the following formulas:

d0(γ1, ..., γn)=(γ2, ..., γn)

di(γ1, ..., γn)=(γ1, ..., γi+1 ◦ γi, ..., γn)if 0

dn(γ1, ..., γn)=(γ1, ..., γn−1).

This allows to define a canonical cochain complex (Cn(D, Φ),d) which cohomology is H∗(D, Φ).

The n-cochains are n C (D, Φ) = Y Φλn (3.64) λ0→···→λn and the co-boundary operator δ : Cn−1(D, Φ) → Cn(D, Φ) is given by

n−1 i n (δφ)λ0→···→λn = X(−1) φdi(λ0→···λn) +(−1) Φ∗(γn)φdn(λ0→···λn). (3.65) i=0

54 For instance at degree zero, this gives, for γ : λ → λ′

0 ′ 0 ′ 0 ∗ ′ δφγ(S )= φλ′ (S ) − φλ(π S ). (3.66)

Q For our co-cycle φλ , with P ≤ Q, a more convenient sheaf over D is given by the sets Ψλ of functions of the pairs (S, Q), with S excluding P and P implying Q, with morphisms

′ ′ ∗ ′ ∗ ′ Ψ∗(γ)(S , Q )= ψ(π S , π Q ). (3.67) This gives 0 ′ ′ 0 ′ ′ 0 ∗ ′ ∗ ′ δφγ(S , Q )= φλ′ (S , Q ) − φλ(π S , π Q ). (3.68) 0 Q In our case, with φλ(S, Q) = φλ (S), we get the measure of the evolution of the ambiguity along the network.

Now we change of subjects and consider the reverse direction of propagation of the theories and propositions.

The particular case of natural isomorphisms Until the end of this subsection, we consider the particular case of isomorphisms between ∗ ∗ the logics in the layers, i.e. π π∗ = IdU and π∗π = IdU ′ . As we will see, this is rather deceptive, giving a particular case of the preceding notion of ambiguity and information, obtained without the hypothesis of isomorphism, then it can be skipped easily, but it seemed necessary to explore what possibilities were offered by the contra-variant side of A. e In this case we are allowed to consider the sheaf of propositions A for π∗ together and the co-sheaf of theories Θ for π∗ over the category A. The action of A by conditioning on the sheaf Φ of measurable functions on Θ is natural,e cf. prop. 3. Thus we can apply the same strategy as before, using the bar complex.

The zero co-chains satisfy ψλ′ (π∗T )= ψλ(T ). (3.69) This equation implies the naturality (3.26). However, there is a difference with the preced- ing framework, because we have more morphisms to take in account, i.e. the implications P ≤ P ′. This implies that, for U, ξ fixed, φ doesn’t depend on P ; there exists a function ψU,ξ on all the theories such that ψλ on Θ(U,ξ,P ) is its restriction. Proof : for any pair P ≤ Q in Aλ, and any theory which excludes Q then P , we have ψP (S)= ψQ(S). Therefore ψP = ψ⊥. The equation of co-cycle is the same as before, i.e. (3.27). It implies that ψU,ξ is invariant by the action of Aλ. In every case, boolean or not, this implies that φU,ξ is also independent of the theory T . Therefore the H0 now simply counts the sections of F.

The degree one co-chains satisfy

R′ π∗R′ φλ′ (π∗S)= φλ (S). (3.70)

In particular, for any triple P ≤ Q ≤ R, and any S ∈ ΘP , we have

R R φU,ξ,Q(S)= φU,ξ,P (S), (3.71)

55 P which allows us to consider only the elements of the form φλ , that we denote simply φλ.

The co-cycle equation is as before, (3.30): And taking ψλ = −φλ gives canonically a zero whose co-boundary is φ: Q φλ (S)= ψλ(S) − ψλ(S|Q). (3.72) Which defines the dependency of φ in Q.

The naturality, in the case of isomorphisms, for a connected network, with a unique output layer, tells that everything can be computed in the output layer. The intervention of the layers is illusory. Then it is sufficient to consider the case of one layer and logical calculus. What follows is only a verification that things transport naturally to the whole category A. e The extension of monoids is made with the left slices categories λ|A; the action of λ|A on Θλ is given by T ∗ ′ T ∗ ′ γ. =(πγ P ⇒ )= T |πγP (3.73) ′ ′ ′ ′ ′ ∗ ′ where γ : λ → λ , λ =(U,ξ,P ), λ =(U ,ξ ,P ), P ≤ π P , and πγ =(α, h) is the projected morphism of F. ′ This defines an action of the monoid of propositions in Aλ′ which are implied by P . If ′ ′ ′ ′ ∗ ′ ′ ∗ ′ ∗ ′ P ≤ Q and P ≤ R , we have πγ (Q ∧ R )= πγ (Q ) ∧ πγ (R ). A natural structure of monoid is given by

∗ ∗ γ1.γ2 =(U, ξ, π γ1 ∧ π γ2). (3.74)

′ ∗ ′ This works because, for a morphism γ : λ → λ , we have P ≤ πγ P . The identity is the truth ⊤λ.

Lemma 3.5. The naturality of the operations over A′, follows from the further hypothesis: ∗ for every morphism (α, h), we assume that the counit π π∗ is equal to IdLU,ξ .

∗ Proof. Consider an arrow ρ : λ → λ1; it gives a morphism ρ : λ1|A → λ|A. ′ ∗ For a morphism γ1 : λ1 → λ1, ρ (λ1)= γ1 ◦ ρ. ′ ′ ′ ′ ′ ′ If γ1 : λ1 → λ1 is an arrow in A , where λ1 =(U1,ξ1,P1), and T a theory in Θλ, we have

∗ ∗ ′ ρ (γ1).T = πγ1◦ρP1 ⇒ T ∗ ∗ ′ ∗ = πρπγ1 P1 ⇒ πρ(πρ)∗T ∗ ∗ ′ = πρ[πγ1 P1 ⇒ (πρ)∗T ] ∗ = πρ[γ1.(πρ)∗T ] ∗ = ρ (γ1.ρ∗T )

The monoids λ|A constitute a pre-sheaf over A, only in the case of isomorphisms, i.e. ∗ e e π∗π = Idλ′ . 

The bar construction now makes appeal to symbols [γ1|γ2|...|γn|, where the γi are arrows ∗ issued from λ. The action of algebra pass through the inverse image of propositions π Pi. The zero co-chains are families φλ of maps on theories satisfying

ψλ(T )= ψλ′ (π∗T ), (3.75)

56 where γ : λ → λ′ is a morphism in A. The coboundary operator is e

∗ δψλ([γ1]) = ψλ(T |πγP1) − ψλ(T ). (3.76)

Then the co-homology is as before. γ1 The one cochains are collections of maps of theories φλ satisfying

′ ′ γ1 γ1◦γ φλ′ (π∗T )= φλ (T ). (3.77)

The cocycle equation is γ1∧γ2 γ1 γ2 φλ = φλ + γ1.φλ . (3.78) One more time, the cocycles are coboundaries; the following formula is easily verified

λ1 ∗ φλ =(δψλ)[λ1]= π P1.ψλ − ψλ; (3.79) where Idλ ψλ = −φλ . (3.80) The combinatorial co-boundary δt at degree two gives:

t γ1 γ1∧γ1 γ2 Iλ(γ1; γ2)= δ φλ[γ1,γ2]= φλ − φλ + φλ . (3.81)

This gives the following formulas

γ1 γ1 γ2 γ2 Iλ(γ1; γ2)= φλ − γ2.φλ = φλ − γ1.φλ . (3.82)

More concretely, for two morphisms γ1 : λ1 → λ and γ2 : λ2 → λ, denoting by P1,P2 λ their respective coordinates on propositions, and by ψλ = −φλ the canonical 0-cochain, we have:

Iλ(γ1; γ2)(T ) ∗ ∗ ∗ ∗ = ψλ(T |π P1 ∧ π P2) − ψλ(T |π P1) − ψλ(T |π P2)+ ψλ(T ) (3.83)

In a unique layer U, for a given context ξ, we get

I(P1; P2)(T )= ψ(T |P1 ∧ P2) − ψ(T |P1) − ψ(T |P2)+ ψ(T ). (3.84)

This is the particular case of the mutual information we got before, cf. (3.56), because now, the generating function ψ is the restriction to Θ(P ) of a function that is defined on Θ=Θ(⊥).

3.5 Homotopy constructions

Abelian homogeneous bar complex of information We start by describing an homogeneous version of the information co-cycles, giving first the differences of ambiguities, from which the above ambiguity can ne derived by reducing redundancy. For that purpose we consider equivariant co-chains as in [BB15]. The sets Θλ, where λ = (U,ξ,P ), are now extended by the symbols [γ0|γ1|...|γn], where N ′ n ∈ , and the γi; i =0, ..., n, are objects of the category Dλ or arrows in Astrict abutting to e 57 λR =(U, ξ, R) for P ≤ R. n This extension with n + 1 symbols is denoted by Θλ. It represents the possible theories in the local language and its context U, ξ, excluding the validity of P , augmented by the possibility to use counter-examples ¬Qi, i = 0, ..., n. There is a natural simplicial structure • on the union Θλ of these sets. The face operators di; i =0, ..., n being given by the following formulas:

d0(γ0, ..., γn)=(γ1, ..., γn)

di(γ0, ..., γn)=(γ0, ..., γi−1,γi+1..., γn)if 0

dn(γ0, ..., γn)=(γ0, ..., γn−1).

• By definition, the geometric realization of Θλ is named the space of theories at λ or localized at λ. Its homotopy type is named the algebraic homotopy type of theories, also at λ.

Remind that a simplicial set K is a presheaf over the category ∆, with objects N and mor- phisms fro`u m to n, the non decreasing maps from [m]= {1, ..., m} to [n]= {1, ..., n}. The geometric realization |K| of a simplicial set K is the topological space obtained by quoti- n+1 enting the disjoint union of the products Kn × ∆(n), where Kn = K([n]) and ∆(n) ⊂ R is the geometric standard simplex, by the equivalence relation that identifies (x, ϕ∗(y)) and ∗ (ϕ (x),y) for every nondecreasing map ϕ :[m] → [n], every x ∈ Kn and every y ∈ ∆(m); ∗ n+1 here f is K(f) and f∗ is the restriction to ∆(n) of the unique linear map from R to m+1 R that sends the canonical vector ei to ef(i). In this construction, for n ∈ N, Kn is equipped with the discrete topology and ∆(n) with its usual topology, then compact, the topology on the union over n ∈ N is the weak topology, i.e. a subset is closed if and only if its intersection with each closed simplex is closed, and the realization is equipped with the quotient topology, the finest making the quotient map continuous. In particular, even it is not evident at first sight, the realization of the simplicial set ∆k is the standard simplex ∆(k).

K n N Suppose given a countable commutative ring . We consider the rings Φλ; n ∈ of (mea- n K surable) functions on the respective Θλ with values in . • The above simplicial structure gives a differential complex on the graded sum Φλ of the n N Φλ; n ∈ , with the simplicial (or combinatorial) co-boundary operator

n γ0|···|γn i γ0|···|γbi|···|γn (δλφ)λ = X(−1) φ . (3.85) i=0 We call algebraic co-cycles the elements in the .

As we have seen, the arrows γQ ∈Dλ can by multiplied, using the operation ∧ on propositions in Aλ, and this defines an action of monoid on Θλ by the conditioning operation. Therefore we can define the homogeneous functions or homogeneous algebraic co-chains of degree n ∈ N γ0;γ1;...;γn as the (measurable) functions φλ on Θλ, such that for any γQ in Dλ, abutting in (U, ξ, Q), for P ≤ Q, and any T ∈ Θλ, thus excluding P ,

γQ∧γ0;γQ∧γ1;...;γQ∧γn γ0;γ1;...;γn φλ (T )= φλ (T |Q). (3.86)

The above operator δλ preserve the homogeneous algebraic co-chains. The kernel restriction of δλ defines the homogeneous algebraic co-cycles.

58 ′ ′ ′ ′ γ0|γ1|...|γn γ0|γ1|...|γn A morphism γ : λ → λ naturally associates φλ with φλ′ through the formula

γ0|γ1|...|γn ∗ ′ γ∗γ0|γ∗γ1|...|γ∗γn ′ φλ (π T )= φλ′ (T ). (3.87)

∗ n Then the hypothesis π π∗ = IdU ′,ξ′ permits to define a co-sheaf Φλ; λ ∈ D over D, not a sheaf, by γ0|γ1|...|γn γ∗γ0|γ∗γ1|...|γ∗γn (Φ∗φλ′ ) (T )= φλ′ (π∗T ). (3.88) However the first equation (3.87) is more precise, and we take it as definition of natural algebraic co-chains.

∗ ′ Remark: we cannot consider a sheaf because we lack a definition of γ γi.

The operation of conditioning preserves the naturality, in reason of the following identity, ′ ′ n involving γ : λ → λ , γQ ∈Dλ, S ∈ Θλ:

∗ ′ ∗ ′ πγ [S |γ∗(γQ)] = πγ S |γQ. (3.89) Therefore we can speak of natural homogeneous algebraic co-cycles.

γ0 For n = 0, the co-chains are collections of functions ψλ of the theories in Aλ such that

γQ∧γ0 γ0 ψλ (S)= ψλ (S|Q), (3.90) and such that, for any morphism γ : λ → λ′,

γ0 ∗ ′ γ∗γ0 ′ ψλ (πγ T )= ψλ′ (T ). (3.91)

⊤ From the first equation, we can eliminate γ0. We define ψλ = ψλ , and get

γQ ψλ (S)= ψλ(S|Q). (3.92) The second equation, with the transport of truth, is equivalent to

∗ ′ ′ ψλ(πγ T )= ψλ′ (T ). (3.93)

A co-cycle corresponds to a collection of constant cλ, which are natural, then to the functions of the connected components of the category D. Thus we recover the same notion as in the preceding section.

γ0;γ1 In degree one, the homogeneous co-chain φλ cannot be a priori expressed through the γ0 γ0;⊤ collection of functions ϕλ = φλ , but, if it is a co-cycle, it can:

γ0;γ1 γ0 γ1 φλ = ϕλ − ϕλ ; (3.94) as this follows directly from the algebraic co-cycle relation applied to [γ0|γ1|⊤λ].

But we also have, by homogeneity

Q.ϕγQ = Q.φγQ|⊤ = φγQ∧γQ|γQ∧⊤ = φγQ|γQ = ϕγQ − ϕγQ =0. (3.95)

Then, the homogeneity equation gives the particular case

Q∧Q Q∧Q γ γ γ ϕ O − ϕ = Q.ϕ QO − Q.ϕ Q = Q.ϕ QO , (3.96)

59 therefore Q∧Q γ γ ϕ O = ϕ Q + Q.ϕ QO ; (3.97) which is the co-cycle equation we discussed in the preceding section, under the form of Shan- non.

Remark: all that generalizes to any degree, in virtue of the comparison theorem between projective resolutions, proved in the relative case in MacLane ”Homology”, or in SGA 4, more generally, because the above homogeneous bar complex and in-homogeneous bar com- plex are such resolutions of the constant functor K.

Semantic Kullback-Leibler distance In Baudot, Bennequin 2015 [BB15], it was also shown that the Kullback-Leibler distance ′ (or divergence) DKL(X; P; P ) between tow probability laws on a random variable X is a co-homology class in the above sense. The co-chains depend on a sequence P0, ..., Pn of probabilities and a sequence of variables X0, ..., Xm less fine than a given variable X; the conditioning the n + 1 laws by the value y of a variable Y ≥ X is integrated over Y∗P0, for giving an action on the set of measurable functions of the n = 1 laws, then the homogeneity is defined as before, and the co-boundary is the standard combinatorial one, as before. For n = 1, the universal degree one class is shown to be the difference of divergences. Remind that the K − L divergence is given by the formula ′ P P′ pi DKL(X; ; )= − X pi log . (3.98) pi xi In our present case, we consider functions of n + 1 theories and m + 1 propositions, all works as for n = 0. In degree zero, the co-chains are defined by functions ψλ(S0,S1) satisfying ∗ ′ ∗ ′ ′ ′ ψλ(πγ S0; ...; πγ Sn)= ψλ′ (S0; ...; Sn), (3.99) for any morphism γ : λ → λ′. The formula for the homogeneous co-chain is

γQ ψλ (S0; ...; Sn)= ψλ(S0|Q; ...; Sn|Q). (3.100) The non-homogeneous zero co-cycles are the functions of P only, invariant by the transport π∗. Q In degree one, the co-cycles are defined by any function ϕλ (S0; ...; Sn) which satisfies

Q ∗ ′ ∗ ′ π∗(Q) ′ ′ ϕλ (πγS0; ...; πγSn)= ϕλ′ (S0; ...; Sn), (3.101) for any morphism γ : λ → λ′, and verifies the co-cycle equation

Q∧R Q R ϕλ (S0; ...; Sn)= ϕλ (S0; ...; Sn)+ ϕλ (S0|Q; ...; Sn|Q). (3.102) The homogeneous co-cycle associated to ϕ is defined by

γQ0 ;γQ1 Q0 Q1 φλ (S0; ...; Sn)= ϕλ (S0; ...; Sn) − ϕλ (S0; ...; Sn). (3.103)

As for n = 0, there exists a function ψλ(S0; ...; Sn) such that for any Q ∈ Aλ, i.e. Q ≥ P , we have Q ϕλ (S0; ...; Sn)= ψλ(S0|Q; ...; Sn|Q) − ψλ(S0; ...; Sn). (3.104)

60 In the particular case n = 1, we can consider a basic real function ψλ(S), seen as a logarithm of theories as before, and define

ψλ(S0; S1)= ψλ(S0 ∧ S1) − ψλ(S0). (3.105)

If the function ψλ(S) is supposed increasing in S (for the relation of weakness ≤, as before), this gives a negative function. We obtain

Q φλ (S0; S1)= ψλ(S0 ∧ S1|Q) − ψλ(S0 ∧ S1) − ψλ(S0|Q)+ ψλ(S0). (3.106)

The positivity of this quantity is equivalent to the concavity of ψλ(S) on the pre-ordered set of theories. Assuming this property we obtain an analog of the Kullback-Leibler divergence. If ψλ(S) is strictly concave, that is the most convenient hypothesis, this function takes the value zero if and only if S0 = S1. Therefore it can be taken as a natural semantic distance, depending on the data of Q, as candidate from a counter-example of P . As in the case of DKL this function is not symmetric, then it could be more convenient to take the sum Q Q Q σλ (S0; S1)= φλ (S0; S1)+ φλ (S1; S0) (3.107) to have a good notion of distance between two theories.

Simplicial homogeneous space of histories of theories Another argument to justify the consideration of the homogeneity is the interest of taking a pushout of the theories.

The sheaf of monoidal categories Dλ over D acts in two manners on the algebraic space of • theories Θλ: γQ.(S ⊗ [γ0; ...; γn]) = (S|Q) ⊗ [γ0; ...; γn], (3.108)

γQ ∧ (S ⊗ [γ0; ...; γn]) = S ⊗ [γQγ0; ...; γQγn]. (3.109) • Then we can consider the colimit Θλ/D of these pairs of maps over all the arrows γQ, i.e. over Dλ: this co-limit is the disjoint union of the co-equalizers for each arrow. This is a quotient simplicial set. The homogeneous co-chains are just the (measurable) functions on this simplicial set.

This can be realized directly as a pushout, or co-equalizer, of a unique pair of maps, by • • taking the union Z of the products Θλ ×Dλ, and the two natural maps µ, ν to T =Θλ given by multiplication and conditioning respectively.

Remark that the two operations in (3.108) and (3.109) are adjoint of each other, then we can speak of adjoint gluing.

Also interesting is the homotopy quotient, taking into account that, geometrically, Z has a higher degree in propositions belonging to Dλ, due to the presence of γQ. This homotopy col- imit is the simplicial set Σ• obtained from the disjoint union (Z ×[0, 1])⊔(T ×{0})⊔(T ×{1}) by taking the identification of (z, 0) with µ(z) and of (z, 1) with ν(z). It can be named a ho- motopy gluing, because the arrows are used geometrically as continuous links between points

61 in T ×{0} and T ×{1}. The simplicial set Σ• is equipped with a natural projection onto the • ordinary co-equalizer Θλ/Dλ. See for instance Dugger [Dug08] for a nice exposition of this notion, and its interest for homotopical stability with respect to the ordinary colimit. Then we propose that a more convenient notion of homogeneous co-chains could be the functions on Σ•.

Similarly, we have two natural actions of the category D of arrows abutting to λ and issued from λ′: the first one being of the type

⊗(n+1) n Θλ′ ⊗Dλ → Θλ; (3.110) the second one of the type ⊗(n+1) n Θλ′ ⊗Dλ → Θλ′ . (3.111) They are respectively defined by the following formulas:

∗ ′ ∗ ′ γ (S ⊗ [γ0; ...; γn]) = (πγ S )λ ⊗ [γ0; ...; γn] (3.112)

The second one is ′ ′ γ γ γ∗(S ⊗ [γ0; ...; γn]) = S ⊗ [π∗ γ0; ...; π∗ γn] (3.113) They are both compatibles with the quotient by the actions of the monoids, then they define maps at the level of Σ•.

The natural co-chains are the functions that satisfy, for each γ : λ → λ′, the equation

∗ φλ ◦ γ = φλ′ ◦ γ∗. (3.114)

Note that no one of the above equations, for homogeneity and naturality, necessitates nu- merical values, but the second necessitates values in a constant set or a constant category, at least along the orbits of D. And it is important for us that the co-chains can take their values in a category M admitting limits, like Set or Top, non necessarily abelian, because our aim is to obtain a theory of information spaces in the sense searched by Carnap and Bar-Hillel in 1952.

n n ′ Define a set Θ1 (resp. Θ0 ) by the co-product, or disjoint union, over γ : λ → λ (resp. ⊗(n+1) n • λ) of the sets Θλ′ ⊗Dλ (resp. Θλ). When the integer n varies, we note the sum by Θ1 • (resp. Θ0). They are canonically simplicial sets. ∗ • • The collections of maps γ and γ∗ define two (simplicial) maps from Θ1 to Θ0, that we will denote respectively ̟ and ϑ, for past and future. The co-limit or co-equalizer of these • • two maps, is the quotient H0 of Θ0 by the equivalence relation

∗ ′ ′ γ γ (πγ S )λ ⊗ [γ0; ...; γn]λ ∼ Sλ′ ⊗ [π∗ γ0; ...; π∗ γn]λ′ . (3.115)

Once iterated over the arrows, this relation represents the complete story of a theory, from the source of its formulation in the network to the final layer. It is remarkably conform to the notion of cat’s manifold, and to the possible presence of inner sources in the network.

62 Remark that the two operations in (3.112) and (3.113) are also adjoint one each other, then again the corresponding colimit can be named an adjoint gluing.

Remark: the above equivalence relation is more fine than the relation we would have found with the co-variant functor, i.e.

γ γ γ (π∗ S)λ′ ⊗ [π∗ γ0; ...; π∗ γn]λ′ ∼ Sλ ⊗ [γ0; ...; γn]λ; (3.116) ′ because this relation is implied by the former, when we applied it to S = π∗S, in virtue of ∗ our hypothesis π π∗ = Id. ∗ The two relations ar equivalent if and only if π∗π = Id, that is the case of isomorphic logics among the network.

• We define the natural co-chains as the (measurable) functions on H0, and the natural ho- • mogeneous co-chains as the functions on the quotient H0/D by the identification of junction with conditioning. And we are more interested in the homogeneous case.

However, in a non-abelian context, the stability under homotopy will be an advantage, therefore we also consider the homotopy colimit of the maps ̟ and ϑ, or homotopy gluing • between past and future, and propose that this colimit I0 (or hoI if we reserve I for the usual colimit) is a better notion of the histories of theories in the network. It is also naturally a simplicial set. Then the natural homotopy homogeneous co-chains will be functions on the homotopy gluing hoI.

• The homotopy type of the theories histories I0 itself is an interesting candidate for repre- senting the information, and information flow in the network. For instance, its connected components gives the correct notion of zero-cycles, and the functions on them are zero-cocycles. The abelian construction is sufficient to realize these cocycles. • We will later consider functions from the space I0 to a closed model category M, their homotopy type in the sense of Quillen can be seen as a non-abelian set of co-cocycles.

What we just have made above for the co-chains (homogeneous and/or natural) is a particular case of a homotopy limit.

The notion of homotopy limit was introduced in Bousfield-Kan 1972, [BK72, chapter XI] where it generalized the classical bar resolution in a non-linear context, cf. MacLane’s book ”Homology” [Mac12]. The authors attributed its origin to Milnor, in the article ”On ax- iomatic homology theory”, 1962, [Mil62]. For this notion and more recent developments, cf. Hirschhorn, 2003, 2014, [Hir03], Hirschhorn, Dwyer et al. [DHKS04], or the preprint of Dugger [Dug08].

• • In this spirit, we extend now the two maps ̟, ϑ from Θ1 to Θ0, in higher degrees, by using the nerve of the category D.

The nerve N = N (D) of the category D is the simplicial set made by the sequences A of succeeding arrows in D. For k ∈ N, Nk is the set of sequences of length k. A sequence is written (δ1, ..., δk), where δi; i = 1, ..., k goes from λi−1 to λi in D. We use the symbols ∗ δi , or the letters γi when there is no ambiguity, for the arrow δi considered in the opposite op ′ category D = Astrict; this reverse the direction of the sequence, going now upstream. When e 63 necessary, we write δi(A),λi−1(A), ..., for the arrows and vertices of a chain A. For k ∈ N, we define Θn as the disjoint union over A =(δ , ..., δ )of the sets Θ ⊗D⊗(n+1). k 1 k λ0 λk Thus the theory is attached to the beginning in the sense of D, and the involved propositions are at the end. The chain in D goes in the dynamical direction, downstream. When the • integers n and k vary, we note Θ∗ the sum (disjoint union). This is a bi-simplicial set.

n n We have k + 1 canonical maps ϑi; i = 1, ..., k +1 from Θk+1 to Θk . Each map deletes a vertex, moreover at the extremities it also deletes the arrow, and inside the chain, it com- ∗ poses the arrows at i−1 and i. In λ0, the map πγ1 is applied to the theory, to be transmitted γk+1 downstream, and in λk+1, the map π∗ is applied to the n + 1 elements γQj in Dλk+1 , to be transmitted upstream.

By analogy with the definition of the homotopy colimit of a diagram in a model category cf. references upcit, we take for a more complete space of histories, the whole geometric • realization of the simplicial functor Θ∗, seen now as a simplicial space with the above skeleton in degree k, and the above gluing maps ϑi. The expression gI denotes this space, that we understand as the geometrical space of complete histories of theories. The extension of information over the nerve incorporates the topology of the categories C, F, D. The degree n was for the logic, the degree k is for its transfer through the layers. gI, or its homotopy type, represents for us the logical part of the available information; it takes into account

1) the architecture C,

2) the pre-semantic structure, through the fibration F over C, which constrains the pos- ∗ sible weights, and also generates the logical transfers π , π∗, 3) the terms of a language through A, and the propositional judgements through D and Θ. e

What is missing is the dynamic and the data; they will be given by the semantic functioning Sw : Xw → Θ; needing an intermediary, a notion of co-cycles of information.

The information appears as a tensor F γ0,...,γn (S). A priori they take their values in the δ1,...,δk category M, that can be Set or Top.

The points in gI are classes of elements

u = S ⊗ [γ0, ..., γn] ⊗ [δ1, ..., δk](t0, ..., tn; s1, ..., sk) (3.117) where the ti; i =0, ..., n and sj; j =1, ..., k are respectively barycentric coordinates in ∆(n) and ∆(k − 1).

It is tempting to interpret the coordinates ti as weights, or values, attributed to the propo- sitions Qi, and the numbers sj as times, conduction times perhaps, along the chain of mappings.

Therefore we see the tensor F as a local system Fu; u ∈ gI over gI.

64 Simplicial dynamical space of a DNN, information content Considering a semantic functioning S : X → Θ, we can enrich it by the choice of proposi- tions in each layer U and context ξU (or better collections of elements of Dλ), and consider sequences over the networks, relating activities and enriched theories. Then, for each local activity, and each chain of arrows in the network, equipped with propositions at one end (downstream), the function F gives a space of information.

More precisely, we form the topological space of activities gX, by taking the homotopy col- imit of the object X, fibred the object W, in the classifying topos of F, lifted to D, and seen as a diagram over D. This space is defined in the same manner gI∗ was defined from Θ∗ over D; it is the geometric realization of the simplicial set gX∗, whose k-skeleton is the sum of the pairs (Ak, xλ) where A is an element of length k in N (D) and xλ an element in Xλ, at the origin of A in D. The degeneracies di; i = 1, ..., k +1 from Xk+1 to Xk are given for 1

Then we can ask for an extension of the semantic functioning to a continuous or simplicial map gS : gX → gI. (3.118) This implies a compatibility between dynamical functioning in X and logical functioning in Θ. However, this map factorizes by a quotient, that can be small, when the semantic functioning is poor. It is only for certain regions in the weight object W, giving itself a geometrical space gW, that the semantic functioning is interesting.

Given F : gI → M, this gives a map F ◦ gS from gX to M, that can be seen as the information content of the network. To have a better analog on the abelian quantities, we suppose that M is a closed model category, and we pass to the homotopy type

ho.F ◦ gS : gX → hoM. (3.119)

For real data inputs and spontaneous internal activities, this gives a homotopy type for each image.

Q For instance, the degree one homogeneous co-cycle φλ (S) deduced from a precision func- tion ψλ(S) with real values, is replaced by a map to topological spaces, associated to some ”propositional” paths between two points of gI, the degree two combinatorial co-cycles, as the mutual information, by a varying space associated to a ”propositional” triangle, up to homotopy.

Non-abelian inhomogeneous fundamental cochains and cocycles. A tentative

Q0 Q Remember that the fundamental zero cochain ψλ with real coefficients, satisfied ψλ (S) = ψλ(S|Q) ≥ ψλ(S). Then, in the non-linear context, it is tempting to assume the existence in M of a class of morphisms replacing the inclusions of the sets, namely co-fibrations, and to generalize the increasing of the function ψλ of S, by the existence of a co-fibration,

65 F (S) ֌ F (S|Q), or more generally a co-fibration F (S) ֌ F (S′) each time S ≤ S′. This is sufficient for defining an object of ambiguity, then an information object (non- homogeneous), by generalizing the relation between precision and ambiguity of the abelian case: HQ(S)= F (S|Q)\F (S); (3.120) where the collapse to a point is taken in the homotopical sense. All that works if we suppose that M is a closed model category of Quillen.

This invites us to suppose that F is co-variant under the action of the monoidal categories Dλ, i.e. for every arrow γQ in Dλ, and every theory S in Θλ, there exists a morphism F (γQ; S): F (S) → F (S|Q) in M, and for two arrows γQ, γQ′ ,

F (γQ′ γQ; S)= F (γQ′ ; S|Q) ◦ F (γQ; S). (3.121)

Then we assume that every F (γQ; S) is a co-fibration.

Q In the same manner, the generalization of the concavity of the real function ψλ , is the hypothesis that for two arrows γQ, γQ′ , there exists a co-fibration of the quotient objects H:

H(Q, Q′; S): HQ(S|Q′) ֌ HQ(S). (3.122)

′ ′ The same thing happening for HQ (S|Q) ֌ HQ (S). The difference space is the model category version of the mutual information between Q and Q′: by definition ′ Q Q⊗Q′ Q′ I2(Q; Q )= H \[H \H ], (3.123) or in other terms, ′ ′ ′ I2(Q; Q )=(Q.F \F )\[(Q ⊗ Q )F \Q .F ], (3.124) ′ Reasoning on subsets of HQ⊗Q , this gives the symmetric relation

′ Q Q′ I2(Q; Q ) ∼ H ∩ H . (3.125)

The general concavity condition is the existence of a natural cofibration HQ(S′) ֌ HQ(S) as soon as there is an inclusion S ≤ S′.

This stronger property of concavity for the functor F implies in particular, for any pair of theories S0,S1, the existence of a co-fibration

Q Q JQ(S0; S1): H (S0) → H (S0 ∧ S1). (3.126)

This allows to define a homotopical notion of Kullback-Leibler divergence space in M, be- tween two theories falsifying P , at a proposition Q ≥ P :

Q Q Q D (S0; S1)= H (S0 ∧ S1)\F∗H (S0). (3.127)

66 Comparison between homogeneous and inhomogeneous non-abelian co-chains and co-cycles To be complete, we have to relate these maps F, H, I, D, ... from theories and constellations γ0,...,γn of propositions to M with the homogeneous tensors Fδ1,...,δk (S). For that, the natural idea is to follow the path we had described from the homogeneous abelian bar-complex to the non- homogeneous one, at the beginning of this section. This will give a homotopical/geometrical version of the MacLane comparison in homological algebra.

• We consider the the bi-simplicial set I∗ as a simplicial set I∗ in the algebraic exponent n for •, then it is a contravariant functor from the category ∆ to the category of simplicial sets ∆Set. The morphisms of ∆ from [m] to [n] are the non-decreasing maps, their set is noted ∆(m, n).

Our hypothesis is that the above tensors form a co-simplicial local system Φ with values in the category M over the simplicial presheaf I∗, in the sense of the preprint Extra-fine sheaves and interaction decompositions, D.B., O. Peltre, G. Sergeant-Perthuis, J-P. Vigneaux, 2020, [BPSPV20]. In an equivalent manner, we consider the category T = S(I∗) whose objects are the simpliciaal cells u of I∗ and arrows from v of dimension n to u of dimension m are the non-decreasing maps ϕ ∈ ∆(m, n) (morphisms in the category ∆) such that ϕ∗(v)= u. Here the map ϕ∗ is simplicial in the index k for ∗, concerning the nerve complex of D; then the co-simplicial local system is a contra-variant functor from T to M. All that is made to obtain a non-abelian version of the propositional (semantic) bar-complex. Following a recent trend, we name spaces the elements of M.

We add that an inclusion of theories S ≤ S′, gives a co-fibration Φ(S′) ֌ Φ(S), in a func- torial manner over the poset of theories.

Let us repeat the arguments to go from homogeneous co-chains or co-cycles to non-homogeneous ones.

First, a zero-cochain is defined over the cells Sλ ⊗ [γ0], where the arrow γ0 abuts in a propo- sitions Q0 ≥ P . The associated non-homogeneous space F (S) corresponds to Q0 = ⊤. The relation between conditioning and multiplication gives the way to recover ΦQ0 (S).

Second, we name degree one homogeneous co-cycle a sheaf of spaces Φ[γ0,γ1](S), over the one ∗ [γ0,γ1] skeleton of ϕ , which satisfies that for the triangle [γ0, ⊤,γ1], the space Φ is homotopy equivalent to the difference of the spaces Φ[γ0,⊤] and Φ[γ1,⊤]. Remark: more generally a degree one co-cycle should satisfies this axiom for every zigzag γ0 ≤ γ 1 ≥ γ1. 2 This definition supposes that we have a notion of difference in M, satisfying the same prop- erties that the difference A\(A ∩ B) satisfies in subsets of set. If all the theories considered contain a minimal one, then spaces are subspaces of a given space, and this hypothesis has a meaning. However, this is the case in our situation, considering the sets ΘP , because we consider only propositions Q, Q0, Q1, ... that are impled by P .

To the degree one co-cycle Φ[γ0,γ1](S) we associate the space Hγ0 (S) = Φ[γ0,⊤](S), obtained γ1 by replacing γ1 by ⊤. The space G (S) is obtained by replacing γ0 by ⊤ in Φ.

67 Note the important point that H and G are in general non-homogeneous.

Applying the definition of 1-co-cycle to the triangle [γ0, ⊤,γ1], we obtain that

Φ[γ0,γ1](S) ∼ Hγ0 (S) \ Hγ1 (S). (3.128)

Lemma 3.6. The co-cyclicity of Φ implies

Q.HQ ∼ HQ⊗Q\HQ. (3.129)

Proof. Q.HQ = Q.ΦQ|⊤ = ΦQ⊗Q|Q⊗⊤ = ΦQ⊗Q|Q = HQ⊗Q\HQ. (3.130) 

From that we deduce immediately:

Proposition 3.8. The homogeneity of Φ implies

′ ′ HQ⊗Q \HQ⊗Q ∼ Q.HQ \[HQ⊗Q\HQ]. (3.131)

Proof. ′ ′ ′ HQ⊗Q \HQ⊗Q = Q.HQ \Q.HQ ∼ Q.HQ \[HQ⊗Q\HQ]. (3.132) 

In the abelian case or ordinary difference this is equivalent to

′ ′ HQ⊗Q ∼ Q.HQ ∪ HQ. (3.133)

This is the usual Shannon equation, then (3.131) can be seen as a non-abelian Shannon equation. Taking homotopy in Ho(M) probably gives a more intrinsic meaning of semantic information.

It is natural to admit that at the level of information spaces, HQ⊗Q ∼ HQ. Under this hypothesis, we have the usual Shannon’s formula under the form

′ ′ HQ⊗Q \HQ ∼ Q.HQ . (3.134)

That is, for every theory S falsifying P :

′ ′ HQ⊗Q (S)\HQ(S) ∼ HQ (S|Q). (3.135)

′ Remind there is no reason a priori that HQ ֌ HQ⊗Q . Then the above difference is after intersection.

If F is any non-homogeneous zero-cochain, we have a co-fibration F ֌ Q.F , where Q.F (S)= F (S|Q). In this case we already defined a space HQ by

HQ(S)= F (S|Q)\F (S). (3.136)

Proposition 3.9. HQ automatically satisfies equation (3.131).

68 Proof. we have F ֌ (Q ⊗ Q′)F and F ֌ (Q ⊗ Q)F , then

′ HQ⊗Q \HQ⊗Q = ((Q ⊗ Q′)F \F )\((Q ⊗ Q)F \F ) ∼ (Q ⊗ Q′)F \(Q ⊗ Q)F.

Using F ֌ Q.F ֌ (Q ⊗ Q)F , and assuming Q.F ֌ (Q ⊗ Q′)F , we get

′ Q.HQ \[HQ⊗Q\HQ]= Q.(Q′F \F )\[((Q ⊗ Q)F \F )\(Q.F \F )] =(Q ⊗ Q′)F \Q.F )\[(Q ⊗ Q)F \Q.F ] ∼ (Q ⊗ Q′)F \(Q ⊗ Q)F.

Therefore, as wanted, ′ ′ HQ⊗Q \HQ⊗Q ∼ Q.HQ \[HQ⊗Q\HQ]. (3.137) 

′ We also had suggested above to define the mutual information I2(Q; Q ) associated to a ′ Q ′ Q co-cycle H by the formula I2(Q : Q )= H \Q .H . The restricted concavity condition on H is the existence of a natural cofibration Q′.HQ ֌ HQ.

Remark: this goes in the reverse direction than for F : more precise is the theory S, bigger is HQ(S), i.e. S ≤ S′ implies HQ(S′) ֌ HQ(S).

′ ′ We assume also that for all pair Q, Q′ we have HQ⊗Q ∼ HQ ⊗Q.

′ ′ Proposition. under the above hypothesis and the assumption that HQ⊗Q ∼ HQ and HQ ⊗Q ∼ ′ ′ ′ HQ , we can consider HQ and HQ as subsets of HQ⊗Q , and we have

′ ′ Q Q′ I2(Q; Q )= I2(Q ; Q)= H ∩ H . (3.138)

′ ′ ′ ′ Proof. The Shannon formula (3.135) tells that Q.HQ is HQ⊗Q \HQ and Q′.HQ is HQ ⊗Q\HQ , then ′ Q Q⊗Q′ Q′ Q Q′ I2(Q; Q )= H \[H \H ] ∼ H ∩ H . (3.139) 

Remark. We cannot write the relation with the usual union, but, under the above hypotheses, there is a co-fibration ′ ′ j ∨ j′ : HQ ∨ HQ ֌ HQ⊗Q , (3.140) giving rise to a quotient ′ ∼ Q Q′ I2(Q; Q ) = H ×HQ⊗Q′ H . (3.141) Generalizing the suggestion of Carnap and Bar-Hillel, and a Shannon theorem in the case of probabilities, we propose, to tell that Q, Q′ are independent (with respect to P ) at the theory ′ S, when HQ ∩ HQ is empty (initial element of M).

With I2, we can continue and get a semantic version of the synergy quantity of three variables:

I3(Q1; Q2; Q3)(S)= I2(Q1; Q2)(S)\I2(Q1; Q2)(S|Q3). (3.142)

However, there is no reason why it must be a true space, because in the abelian case it can be a negative number; cf. Baudot et al. 2019 [BTBG19] for the relation with the Borromean

69 links.

Remark: this invites us to go to Ho(M), where there exists a notion of relative objects: for a zigzag A և C ֌ B, with a trivial fibration to the left, and a co-fibration to the right, the deduced arrow A → B in Ho(M), can be considered as a kind of difference of spaces; cf. Jardine, Cocyle categories, 2009, [Jar09], and Zhen Lin Low, Cocycles in categories of fibrant objects, 2015 [Low15]. Before Quillen and Jardine this kind of homotopy construction was introduced by Gabriel and Zisman, 1967 [GZ67], as a calculus of fraction, in the framework of simplicial objects, their book being the first systematic exposition of the simplicial theory.

With respect to the Shannon information, what is missing is an analog of the expectation of functions over the states of the random variables. In some sense, this is replaced by the properties of growing and concavity of the function ψ, or spaces F and H, which give a manner to compare the theories. The true semantic information is not the value attributed to each individual theory, it is the set of relations between these values, either numerical, • either geometric, as expressed by functors over the simplicial space gI∗ , or better, more practical over the part of ot that is accessible to a functioning network gX.

2 The example of the theory L3 of Carnap and Bar-Hillel.

Let us try to describe the structure of Information, as we propose it, in the simple ex- ample that was chosen for development by Carnap and Bar-Hillel in their report in 1952, [CBH52].

π The authors considered a language Ln with n subjects a, b, c, ... and π attributes of them A, B, ..., having certain numbers of possible values, respectively πA, πB, .... In their devel- oped example n = 3, π = 2 and every πi equals 2. The subjects are human persons, the two attributes are the gender G, male M or female F , and the age A, old O or young Y . The elementary, or ultimate, states, e ∈ E of the associated Boolean algebra Ω = ΩE are given by choosing values of all the attributes for all the subjects. For instance, in the 2 3 language L3, we have 3 = 64 elementary states. The proposition P, Q, R, ... are the subsets of Ω, their number is 264. The theories S, T, ... in this case are also described by their initial assertion, that is the truth of a certain propo- sition, also named S, T, .... With our conventions, for conditioning and information spaces or quantities, it appears practical to define the propositions by the disjunction of their elements eI = ei1 ∨ ... ∨ eik and the theories by the conjonction of the complementary sets ¬ei = Si, that is SI =

(¬ei1 ) ∧ ... ∧ (¬eik ). Experimentally (cf.[BBG20]) the theories exclude something, like P , i.e. contain ¬P , then with SI we see that P = eI is excluded, as are all the eij for 1 ≤ j ≤ k. A proposition Q which is implied by P , corresponds to a subset which contains all the ele- mentary propositions eij for 1 ≤ j ≤ k.

In what follows, the models of ”spaces of information” that are envisaged are mainly groupoids, or sets, or topological spaces. A zero co-chain FP (S) gives a space for any theory excluding P , in a growing manner, in the sense that S ≤ S′ (inclusion of sets) implies F (S) ≤ F (S′). The co-boundary δF = H, Q gives a space HP (S) for any proposition Q such that P ≤ Q, whose formula is

Q HP (S)= FP (S ∨ ¬Q)\FP (S). (3.143)

70 By concavity, this function (space) is assumed to be decreasing with S, i.e. if S ≤ S′,

Q Q ′ HP (S) ֋ HP (S ). (3.144)

And by monotonicity of F , it is also decreasing in Q, i.e. if Q ≤ Q′,

Q Q′ ′ HP (S) ֋ HP (S ). (3.145)

In particular, we can consider the smaller FP (S) that is FP (⊥), as it is contained in all the spaces FP (S), we choose to take it as the empty space (or initial object in M), then

Q HP (⊥)= FP (¬Q). (3.146)

As we saw in general for every one-cocycle, not necessarily a co-boundary, we have for any pair Q, Q′ larger than P ,

Q∧Q′ Q′ Q ′ Q ′ HP (S)\HP (S) ≈ HP (S|Q )= HP (S ∨ ¬Q ). (3.147)

Therefore, in the boolean case, every value of H can be deduced from its value on the empty theory: Q ′ Q∧Q′ Q′ HP (¬Q ) ≈ HP (⊥)\HP (⊥). (3.148) Q Q We note simply HP (⊥)= HP = FP (¬Q). And they are the spaces to determine.

The localization at P (i.e. the fact to exclude P ) consists in discarding the elements ei belonging to P from the analysis. Therefore we begin by considering the complete situation, which corresponds to P = ⊥. Q Q In this case we note simply H = H⊥ = F (¬Q).

There exists a Galois group G of the language, generated by the permutation of the n subjects, the permutations of the values of each attribute and the permutations of the attributes that have the same number of possible values. To be more precise, we order and label the subjects, the attribute and the values, with triples xYi. In our example, x = a, b, c, Y = A, G, i =1, 2, the group of subjects permutation is S3, the transposition of values are σA =(A1A2) and σG =(G1G2), and the four exchanges 3 −1 of attributes are σ = (A1G1)(A2G2), κ = (A1G1A2G2), κ = κ = (A1G2A2G1), and τ =(A1G2)(A2G1). We have 2 σA ◦ σG = σG ◦ σA =(A1A2)(G1G2)= κ ; (3.149) −1 σ ◦ σA = σG ◦ σ = κ; σA ◦ σ = σ ◦ σG = κ ; (3.150)

σA ◦ σ ◦ σG = τ; σA ◦ τ ◦ σG = σ (3.151)

The group generated by σ, σA, σG is of order 8; it is the dihedral group D4 of all the isometries of the square with vertices A1G1, A1G2, A2G2, A2G1. The stabilizer of a vertex is a cyclic A A group C2, of type σ or τ, the stabilizer of an edge is of type σA or σG, noted C2 or C2 .

2 S Therefore, in the example L3, the group G is the product of 3 with a dihedral group D4.

In the presentation given by the present article, the language L is a sheaf over the cate- gory G, which plays the role of the fiber F. We have only one layer U0, but the duality of

71 propositions and theories corresponds to the duality between questions and answers respec- tively.

The action of G on the set Ω is deduced from its action on the set E, which can be described as follows: 1) One orbit of four elements, where a, b, c have the same gender and age. The stabilizer of each element is S3 × C2, or order 12. 2) One orbit of 24 elements made by a pair of equal subjects and one that differs from them by one attribute only. The stabilizer being the S2 of the pair of subjects. 3) One orbit of 12 elements made by a pair of equal subjects and one that differs from them by the two attributes. The stabilizer being the product S2 × C2, where C2 stabilizes the characteristic of the pair, which is the same as stabilizing the character of the exotic subject. 4) One last orbit of 24 elements, where the three subjects are different, then two of them differ by one attribute and differ from the last one by the two attributes. The stabilizer ′ is the stabilizer C2 of the missing pair of values of the attributes. Ansatz 1. The maximal spaces He = F (¬e) are given by 64 subspaces of H = H⊥, which are divided in four orbits of isomorphic spaces, permuted by G. Then there is four types of maximal spaces. The form of the space He must be deduced from the stabilisation group, noted Ge, which is also named the inertia subgroups: S3 × C2 in the type I; S2 ⊂ S3 in the type II; S2 × C2 ′ in the type III; and C2 in the type IV . The action of G on the set E corresponds to the conjugation of the inertia subgroups.

It is natural to take for the total space H = H⊥ the group G itself, or its product by a space counting the dimension of the problem. This will permit to compare the information in different systems.

All that looks like Galois theory, however there exists subgroups of G, even normal sub- groups, that cannot happen as stabilizers in the language, without adding terms or concepts. For instance, the cyclic group A3 ⊂ S3; if it stabilizes a proposition P , this means that the subjects appear in complete orbits of A3, but these orbits are orbits of S3 as well, then the stabilizer contains S3. The notion of ordering is missing.

The collection of all the ultimate states of one type defines a proposition, noted T , describing I,II,III,IV . This proposition has for stabilizer the group G itself. According to the above ansatz the space of information must have a form attached to G, but it also must take into account the structure of its elements. A natural choice is the connected groupoid with objects the elements e ∈ T and with isomorphisms their stabilizers. Then the choice of an element e gives an equivalence of ho- motopy between the space of the type and the fundamental group based in this element, that is the inertia Ge.

Remark. In some sense, a group is less ambiguous than a groupoid, connected, having this group as fundamental group. However, when an element e is chosen, the trivial co-fibration

72 goes from the group to the groupoid, not in the reverse direction. And a morphism HT → He cannot be a cofibration, that is injective on the objects in the cases we forecast.

However, each Ge comes with an embedding in G, as the groupoid HT . Therefore a solution to the above paradox, consists to define He as the groupoid HT with a marked object e. This tells that the information of e is the structure of its type plus a particular choice of conju- gate subgroup, or in other term, a homogeneous Klein space G/Ge, which has a marked point.

Ansatz 2. The information space of type T corresponds to the natural groupoid of type T , and the information space of the ultimate element e is the homogeneous space associated to the inertia subgroups Ge of type T .

Remark that each type corresponds to a well formed sentence in natural languages: type I is translated by ”all the subjects have the same attributes”; type II by ”all the subjects have the same attributes except one which differs by only one aspect”; type III ”one subject is opposite to all the others”; type IV ”all the subjects are distinguished by at least one attribute”. It is natural to describe the union of the types II and III by the sentence ”all the subjects have the same attributes except one”.

Remark that other propositions have non-trivial inertia, and evidently support interesting semantic information. The most important for describing the system are the numerical state- S A ments, for instance ”there exist two female subjects in the population”. Its inertia is 3×C2 .

By definition, a simple proposition is given by the form aA, telling that one given subject has one given value for one given attribute. There exist twelve such propositions, they are per- muted by the group G. The simple props form an orbit of the group G, of the type III above.

Amazingly, the set of the twelve simples is selfdual under the negation:

¬(aA)= aA, (3.152) where A denotes the opposite value.

A last but not least ingredient, introduced by Carnap and Bar-Hillel, is the mutual indepen- dency of the 12 simple propositions.

′ According to the definition of the spaces I2(Q, Q ), this implies:

Ansatz 3. The spaces of the simples are disjoint, the maximal elements are unions of them.

Illustration: associate to each e a trefoil knot, presented as a braid with three colored strands, corresponding to its simple constituents. Each subject corresponds to a strand, each pair of values A, G of the attributes to a color, red, blue, green and black for the vertices ofthe square, red and green and blue and black being in diagonal.

Problem: the unions of propositions, giving co-fibrations to the propositions, correspond to mixtures of colors. We see no further forms to represent them, then combinatorics and

73 numbers enter the structure. The ”or” of several propositions describe a simplex having these propositions as vertices.

Ansatz 4. The simple propositions have no shape. Their unions are counted as logical values, from the ordinary content, as we described by the numerical function ψ, when discussing Bar-Hillel and Carnap theory.

This concerns propositions that are complex and not used in natural languages; example: ”in this population, there is two old mans, or there is a young woman, or there exist a woman that has the same age of a man”. This is pure logical calculus, not really semantic.

We are faced to the problem of combining the forms given by the groups and groupoids, as for HT and He, or for numerical statements, and the combinatorial counting of informa- tion. A suggestion is to represent the combinatorial aspect by a dimension: all propositions are ranged by their numerical content, for instance e has c(e) = 63, ¬e has c = 1, and aA has c = 58. We represent the groups and groupoids by CW complexes of dimension 2, associated to a presentation by generators and relations of their fundamental group, possibly marked by several base points. The spaces of information HQ are obtained by thickening the complexes, by taking the product with a simplex or a ball of the dimension corresponding to Q. However, note that any manner to code this dimension by a number, for instance, connected components, would work as well.

Then, on this simple example we see that ”spaces” of semantic information are more interesting and justified than numerical estimations, but also that this justification concerns only a few propositions, which seem too have more sense. Then the structure of spaces has to be completed by calculus and combinatorics for most of the 264 sentences. This touches the sensitive departure point from the admissible sentences, more relevant to Shannon theory, and the significant sentences, more relevant for a future semantic theory, that we hope to find in the direction of homotopy invariants of spaces of theories and questions.

74 Chapter 4

Unfoldings and memories, LSTM and GRU

4.1 RNN lattices, LSTM cells

Artificial networks for analyzing or translating successions of words, or any timely ordered set of data, have a structure in lattice, which generalizes the chain: the input layers are arranged in a corner: horizontally x1,0, x2,0, ..., named data, vertically h0,1, h0,2, ..., named hidden memories. Generically, there is a layer xi,j for each i =1, 2, ..., N, j =0, 1, 2, ..., M, and a layer hi,j for each i =1, 2, ..., N, j =0, 1, 2, ..., M. The information of xi,j−1 and hi−1,j are joined in a layer Ai,j, which sends information to xi,j and hi,j. Then in our representation, the category CX has one arrow from xi,j to Ai,j, from hi,j to Ai,j, from xi,j−1 to Ai,j and from hi−1,j to Ai,j, and it is all. Cf. figure 1.1. If we want, we could ∗ add the layers Ai,j, but there is no necessity. The output is generally a up-right corner horizontally y1 = x1,M , y2 = x2,M , ..., named the result (a classification or a translation), and vertically hN,1, hN,2, ..., (which could be named future memories).

However, the inputs and outputs can have the shape of a more complex curves, transverse to vertical and horizontal propagation. Things are organized as in a two dimensional Lorentz space, where a space coordinate is xi,j−1 − hi−1,j and a time coordinate xi,j−1 + hi−1,j. Input and output correspond to spatial sections, related by causal propagation.

Remark. In many applications, several lattices are used together, for instance a sentence or a book can be read backward after translation, giving reverse propagation, without trouble. We will discuss these aspects with the modularity. Most RNNs have a dynamic of the type 1D-non-linearity applied to a linear summation: we denote the vectorial states of the layers by greek letters ξ for layers x and η for layers h, a b like ξi,j and ηk,l; the lower indices denote the coordinates of the layer and the upper indices denote the neuron, that is the real value of the state. In most applications, as we will see, the basis of neurons plays an important role. a b In the layer Ai,j the vector of state is made by the pairs (ξi,j−1, ηi−1,j); a ∈ xi,j−1, b ∈ hi−1,j. The dynamic Xw has the following form:

a a a a′ a b′ a ξi,j = fx (X wa′;x,i,jξi,j−1 + X ub′;x,i,jηi−1,j + βx,i,j); (4.1) a′ b′

75 b b b a′ b b′ b ηi,j = fh(X wa′;h,i,jξi,j−1 + X ub′;h,i,jηi−1,j + βx,i,j). (4.2) a′ b′ The functions f are sigmoids or of the type tanh(Cx), the real numbers β are named bias, and the numbers w and u are the weights. In practice, everything here is important, the system being very sensitive, however theoret- ically, only the overall form matters, thus for instance we can incorporate the bias in the weights, just by adding a formal neuron in x or h, with fixed value 1. The weights are summarized by the matrices Wx,i,j, Ux,i,j, Wh,i,j, Uh,i,j. All these weights are supposed to be learned by back-propagation, or analog more general reinforcement.

Experiments during the eighties and nineties showed the strongness of the RNNs but also some weakness, in particular for learning or memorizing long sequences. Then Hochreiter and Schmidhuber, in a remarkable paper in Neural Computation 1997, [HS97], introduced a modification of the simple RNN, named the Long Short Term Memory, or LST M, which overcame all the difficulties so efficiently that more than thirty years after it continues to be the standard. The idea is to duplicate the layers h by introducing parallel layers c, playing the role of longer time memory states, and just called cell states, by opposition to hidden states for h.

In what follows we present the cell which replaces Ai,j without insisting on the lattice aspect, which is unchanged for many applications.

The sub-network which replaces the simple crux A = Ai,j is composed of five tanks ′ ′ ′ ′ A,F,I,H ,V , plus the inputs Ct−1,Ht−1,Xt−1, and has nine tips ct−1, ht−1, xt, f, i, o, h, vi, vf e plus the three outputs ct, ht,yt. However, yt being a function of ht only, it is forgotten in the analysis below. In A, the two layers h′ and x′ (where we forget the indices t − 1 and t respectively) join to give by formulas like (4.2) the four states of i, f, o, h respectively called input gate, forget gate, output gate, combine gate, the first three are sigmoidal,e the fourth one is of type tanh, indicating a function of states separations. The weights in these operations are the only parameters to adapt, they form matrices Wi, Ui, Wf , Uf , Wo, Uo and Wh, Uh; which makes four times more than for a RNN (because the output ξi,j is not taken in account). ′ Then the states in vf and vi are respectively given by combining c with f and h with i, in the simplest bilinear way: e a a a ξv = γ ϕ ; a ∈ v; (4.3) where γ denotes the states of c′ or h, and ϕ the states of f or i respectively. e Note that the above formulae have a sense if and only of the dimensions of c and f and vf are equal and the dimension of h and i and vi are equal. This is an important restriction. At the level of vectors this diagonale product is name the Hadamard product and is written

ξv = γ ⊙ ϕ. (4.4)

It is free of parameters. Only the dimension is free for a choice. Then, vi and vf are joined by a Hadamard sum, adding term by term, to give the new cell state

ξc = ξvf ⊕ ξvi ; (4.5)

76 which implies that vi and vf have the same dimension. And finally, a new Hadamard product gives the new hidden state:

ηh = ξo ⊙ tanh ξc. (4.6)

This leaves the latitude of a normalization tanh Cx but this is all. However this implies that c and o and h have the same dimension. Therefore the LST M has a discrete invariant, which is the dimension of the layers, and is named its multiplicity m. Only the layers x can have other dimensions, in what follows we write n for this dimension.

Symbolically, the dynamics can be summarized by the two formulas:

ct = ct−1 ⊙ σf (xt, ht−1) ⊕ σi(xt, ht−1) ⊙ τh(xt, ht−1); (4.7)

ht = σo(xt, ht−1) ⊙ tanh ct; (4.8) where σk (resp. τk) denotes the application of σ (resp. tanh) to a linear or affine form. ′ ′ ′ In what follow we denote xt by x and ht−1, ct−1 by h , c , like their tips.

Due to the non-linearities σ and tanh, there ares several regimes of the functioning, ac- cording to the fact that some of the variable give or not a saturation; this can generate almost linear transmission or to the contrary, discrete transmission, for instance ±1 when tanh is applied, or 0 or 1 if σ is applied. Here appears the fundamental aspect of discretization in the functioning of DNNs.

In the linear regime, the new state c appears as a polynomial of degree 2 in the vectors x, h′ and degree 1 in c′, and h appears as a polynomial of degree 3 in x′, h′. Introducing the linear (or affine with bias) forms αf ,αi,o ,αh, before application of σ or tanh, we have ′ ht = αo ⊙ (c ⊙ αf ⊕ αi ⊙ αh). (4.9) ′ ′ ′ ′ The dominant term in x , h is decomposable: αo ⊙i ⊙αh; the term of second degree in x , h ′ is αo ⊙ c ⊙ αf , and there is no linear term, because we forgotten the bias. When separating x′ from h′, we obtain all the types of degrees less than 3. However, experiments with alternative memory cells, named GRU and their simplifications, have shown that the degree in x′ is apparently less important then the degree in h′. All the essays with degree less than 3 in h′ had dramatic loss of performance, but this was not the case for x′, where degree 1 appeared to be sufficient.

The number of parameters to adapt is 4m2 +4mn or 4m2 + dmn, with 1 ≤ d ≤ 4 count the dependencies in x in the four operations αf ,αi,o ,αh. At least d = 1 for αh or for αf seems to indispensable from the study of MGU.

4.2 GRU, MGU

Several attempts were made for diminishing the quantity of parameters to adapt in LST M without diminishing the performance. The most popular solution is known as Gated Recur- rent Unit, or GRU, Cho et al. 2014 [CvMBB14], Chung et al. [CGCB14] in the group of

77 Bengio, then simplified in several kinds of Minimal Gated Units, MGU, Zhou et al. 2016 [ZWZZ16], cf. Heck and Salem 2017 [HS17]. The idea is to replace several gate layers by one, at the cost of a more complex architec- ture’s topology. In the standard GRU, the pair ht,ct is replaced by ht alone, as in the original RNN; ′ there exists two input layers Xt,Ht−1, the number of joins, our tanks, is six: R,F,I,V,W,H , the number of tips is six, z, r, v1−z, vr, vx, vh and one output ht. The dynamic begins with two non-linear linear transform, of type σ P, like (4.2) in R, giving ′ ′ ′ z and r from x and h ; then in I, there is a Hadamard product vz = h ⊙ (1 − z), where 1 − z designates the Hadamard difference between the saturation and the values of the states of z. ′ Moreover, in F , there is another Hadamard product vr = h ⊙ r. A tanh P, like (4.2) with ′ ′ f = tanh, joins x with vr in W to give vx, which joins z in H to give vh by a third Hadamard product. Finally, vh and v1−z are joined together by a Hadamard sum in V , giving h = vz ⊕vh.

Symbolically, with the same conventions used for LST M, the dynamic can be summarized by the following formula

ht = (1 − σz(xt, ht−1)) ⊙ ht−1

⊕ σz(xt, ht−1) ⊙ tanh(Wx(xt)+ Ux(σr(xt, ht−1) ⊙ ht−1)). (4.10) In a GRU as in a LST M we have three Hadamard products and one Hadamard sum, plus three non-linear-linear transforms NLL (one with tanh); LST M had four NLL trans- forms (two with tanh), but the complexity of GRU stays in the succession of two NLL with adaptable parameters. Remark that LST M also contains a succession of non-linearities, th being applied to ct, which is a sum of product on non-linear terms of type σ or th.

In the linear regime, the GRU gives

ht = [(1 − αz) ⊙ ht−1] ⊕ [αz ⊙ [W xt + U(αr ⊙ ht−1)]]. (4.11) For the same reason than LST M a GRU has a multiplicity m, and a dimension n of data input. The parameters to be adapted are the matrices Wz, Uz, Wr, Ur and Wx, Ux in W . This gives 3m2+3mn real numbers to adapt, in place of 4m2+4mn for a complete LST M.

The simplification which was proposed by Zhou et al. for MGU consists in taking σz = σr, 2 thus reducing the parameters to 2m +2mn. This unique vector is denoted σf , assimilated to the forget gate f of LST M. It seems that the performance of MGU were as good as the ones of GRU, which are almost as good as LST M for many tasks.

Heck and Salem 2017 suggested further radical simplifications, some of them being as good as MGU.

′ MGU1 consists in suppressing the dependency of the unique σf in x , and MGU2 in sup- ′ ′ pressing also the bias βf . A MGU3 removed x and h , just conserving a bias, but it showed poor learning and accuracy in the tests.

The experimental results proved that MGU2 is excellent in all tests, even better than GRU.

78 Note that MGU2 (and MGU1) continue to be of degree 3 in h′. This reinforces the impres- sion that this degree is an important invariant of the memory cell. But these results indicate that the degree in x′ is not so important.

Consequently we may assume

ht = (1 − σz(ht−1)) ⊙ ht−1 ⊕ σz(ht−1) ⊙ tanh(Wx(xt)+ Ux(σz(ht−1) ⊙ ht−1))). (4.12) And in the linear regime

′ ′ ht = [(1 − αz) ⊙ h ] ⊕ [αz ⊙ [W xt + U(αz ⊙ h )]]. (4.13)

a ′ ′ Only two vectors of linear (or affine) forms intervene, αz (h ); a = 1, ..., m and h itself, i.e. ηa(h′); a =1, ..., m. The parameters to adapt are Uz, giving αz, and Ux = U, Wx = W , giving the polynomial of degree two in parenthesis, i.e. the state of the layer called vh.

The number of free parameters in MGU2 is 2m2 + mn, two time less than the most eco- nomical LST M.

The graph Γ of a GRU or a MGU has five independent loops, a fundamental group free of rank five; it is non-planar. The graph of a LST M has only three independent loops, and is planar. Cf. figure 4.1 .

C −1 V C t ⊙ f ⊕ t

f Vi ⊙ tanh ı

ht−1

˜ ⊙ tanh h o ht

xt Figure 4.1: Grothendieck site of a LSTM cell

4.3 Universal structure hypothesis

A possible form of dynamic covering the above examples is a vector of dimension m of non- linear functions of several vectors σαa , σβb , ..., that are σ of th functions of linear (or perhaps affine) forms of the variables ξa, ηb, for a, b, c varying from 1 to m. More precisely

a a a a a ηt = X tb σαb tanh[X uc,dσβc σγd + X vc σβc + X wdσγd + σδa ]. (4.14) b,c,d c,d c d

79 Remark: we have written σα, σβ, ... for the application to a linear form of a sigmoid or a tanh indifferently; but for a more precise discussion of the examples, we must distinguish and write τα, τβ, ... when tanh is applied. However, sometimes in the following lines, we will use τ when we are certain that a tanh is preferable to a σ.

a 3 The tensor uc,d would introduce m parameters, leading to great computational difficul- ties. A natural manner to limit the degrees of freedom at Km2, inspired by LST M and GRU, is to use the Hadamard product, for instance σβa σγa . A second simplification, justified by the success of MGU consists to impose αa = γa. A third one, justified by the success of MGU2 is to limit the degree in x′ to 1. This can be done by reserving the dependency on x′ to the forms β and δ. All that gives a ηt = σαa (η) tanh[σαa (η)σβa (η,ξ)+ σβa (η,ξ)+ σδa (ξ)]. (4.15) This contains 2m2 +2mn free parameters to be adapted.

Remark: here we have neglected the addition of the alternative term in the dynamic which a is (1 − σαa)η in GRU and MGU, but this term is probably very important, therefore, we must keep in mind that it can be added in the applications. At the end it will reappear in the formulas we suggest below.

For MGU1, 2, the term of higher degree has no dependency in x′, then we can simplify further in a ηt = σαa (η) tanh[σαa (η)σβa (η)+ σya (ξ)σβa (η)+ τδa (ξ)]. (4.16) Moreover, as MGU2 is apparently better than MGU1 in the tested applications, the forms αa can be taken linear, not affine.

It looks like a simplified LST M, if we define for the state of ct the following vector:

a γt = σαa (η)σβa (η)+ σya (ξ)σβa (η)+ τδa (ξ), (4.17)

a a and impose the recurrence y (ξ)= γt−1. This gives a kind of minimal LST M or MLST M

a a γt = σαa (η)σβa (η)+ γt−1σβa (η)+ τδa (ξ), (4.18) a a ηt = σαa (η) tanh[γt ]. (4.19) Or with the forgotten alternative,

a a a ηt = σαa (η) tanh[γt ]+(1 − σαa (η))η . (4.20) Now we suggest to look at these formulas from the point of view of the deformation of singularities having polynomial universal models, and trying to keep the main properties of the above dynamics: 1) on a generic straight line in the input space h′, and in any direction of the output space h, we have every possible shape of a 1D polynomial function of degree 3, when modulating by the functions of x′; 2) the presence of non-linearity σ applied to forms in h′ and th applied to forms in x′ allow discretized regimes for the full application, but also a regime where the dynamic is close to a simple polynomial model.

80 In the above formulas the last application of th renders possible the degeneration to degree 1 in h′ and x′, we suggest to forbid that, and to focus on the coefficients of the polynomial. In fact the truncation of the linear forms by σ or th is sufficient to warranty the saturation of the polynomial map. From this point of view the terms of degree 2 are in general not essential, being absorbed by a Viete transformation. Also the term of degree zero, doesn’t change the shape, only the values; but this can be non-negligible. In the simplest form this gives

a 3 a a ηt = σαa (η) + u (ξ)σαa (η)+ v (ξ); (4.21) where u and v are th applied to a linear form of ξ, and σα is a σ applied to a linear form in η. This gives only m2 +2mn free parameters, thus one order less than MGU2 in m.

However, we cannot neglect the forgotten alternative (1−z)h′ of GRU, or more generally the possible function in the transfer of a term of degree two, even if structurally, from the point of view of the deformation of shapes, it seems not necessary, thus the following form could be preferable:

a 3 a a a ηt = σαa (η) + (1 − σαa (η))η + u (ξ)σαa (η)+ v ; (4.22) or more generally, with 2m2 +2mn free parameters:

a 3 a a ηt = σαa (η) + σαa (η)[σβa (η)+ u (ξ)] + v (ξ); (4.23) where β is a second linear map in η.

Description of an architecture for this dynamic : it has two input layers Ht−1,Xt, three sources or tanks A, B, C, and seven internal layers that give six tips, α,β, vβ, u, v, vαβ, vααα, and one output layer ht. First ht−1 gives σα and σβ, and xt gives u and v; then σβ joins u in A to give vβ = σβ ⊕ u, then σα joins vβ in B to give vαβ = σα ⊙ vβ. In parallel, σα ⊙3 is transformed along an ordinary arrow in vααα = σα . And finally, in C, the sum of v, vααα and vβ produces the only output ht.

The simplified network is for β = 0. It has also three tanks, A, B and C, but only five tips, α, u, v, vα, vααα. The schema is the same, without the creation of β, and vβ (resp. vαβ) replaced by vα (resp. vαα).

Remark. In the models with tanh like (4.20) the sign of the terms of effective degree three can be minus or plus; in the model (4.23) it is always plus, however this can be compensated by the change of sign of the efferent weights in the next transformation.

Formula (4.15) could induce the belief that 0 goes to 0, but in general this is not the case, because the function σ contrarily to tanh has only strictly positive values. For instance the standard σ(z)=1/1 + exp(−z) gives σ(0) = 1/2. However, the point 0 plays apparently an important role, even if it is not preserved: 1) in MGU2 the absence of bias in αa confirms this point; 2) the functions σ and th are almost linear in the vicinity of 0 and only here. Therefore, let us define the space H of the activities of the memory vectors ht−1 and ht, of real dimension m; it is pointed by 0, and the neighborhood of this point is a region of special interest.

81 We also introduce the line U of coordinate u and the plane Λ = U × R of coordinates u, v, where 0 and its neighborhood is also crucial. The input from new data xt is sent to Λ, by the two maps u(ξ) and v(ξ). By definition this constitutes an unfolding of the degree three map in σα(η).

A more complex model of the same spirit is

a 3 2 a a ηt = σαa (η) ± σαa (η)[σβa (η) + u (ξ)]+ v (ξ)σβa (η) a 2 2 a + w (ξ)[σαa (η) + σβa (η) ]+ z (ξ); (4.24)

2 it has 2m +4mn free parameters. The exression of xt is much richer and we will see below that it shares many good properties with the model (4.21), in particular stability and uni- versality. The corresponding space U has dimension 3 and the corresponding space Λ has dimension 4.

4.4 Memories and braids

In every DNN, the dynamic from one or several layers to a deeper one must have a sort of stability, to be independent of most of the details in the entries, but it must also be plastic, and sensitive to the important details in the data, then not too stable, able to shift from a state to another one, for constructing a kind of discrete signification. These two aspects are complementary. They were extensively discussed a long time before the apparition of DNNs in the theory of dynamical systems. The framework was different because most concepts in this theory were asymptotic, pertinent when the time tends to infinity, and here in deep learning, to the contrary, most concepts are transient: one shot transformations for feed forward, and gradient descent or open exploration for learning; however, with respect to the shape of individual transformation, or with respect to the parameters of deformation, the two domains encounter similar problems, and probably answer in similar manners. Structural stability is the property to preserve the shape after small variation of the pa- rameters. In the case of individual map between layers, this means that little change in the input has little effect on the output. In the case of a family of maps, taking in account a large set of different inputs, this means that varying a little the weights, we get little change in the global functioning and the discrimination between data. The second level is deeper, because it allows to understand what are the regions of the manifolds of input data, where the individual dynamics are stable in the first sense, and what happens when individual dynamics changes abruptly, how are made the transitions and what are the properties of the inputs at the frontiers. A third level of structural stability concerns the weights, selected by learning: in the space of weights it appears regions where the global functioning in the sense of family is stable, and regions of transitions where the global functioning changes; this happens when the tasks of the network change, for instance detect a cat versus a dog. This last notion of stability depends on the architecture and on the forms of dynamical maps that are imposed.

With LST M, GRU and their simplified versions like MGU, MGU2, we have concrete examples of these notions of structural stability. w The transformation is X from (ht−1, xt) to ht. The weights w are made by the coeffi- cients of the linear forms, αa(η), βa(η),ua(ξ), va(ξ), but the structure depends on the fixed architecture and the non-linearities, of two types, the tensor products and sums, and the

82 applied sigmoids and tanh. For simplicity we assume the form (4.21), but the discussion is not very different for the other families (4.23), (4.16) or (4.20). We have a linear endomorphism α of coordinates αa; a ∈ h of Rm = H; when we apply to it the sigmoid function coordinate by coordinate, we obtain a map φ from H to a compact do- main in H. The invariance of the multiplicity m of the memory cell suggests the hypothesis (to be verified experimentally) that φ is a diffeomorphism from H to its image. However, as we will see just below, other reasons like redundancy suggests the opposite, therefore we left open this hypothesis, with a preference for diffeomorphism, for mathematical or structural reasons. Probably, depending on the application, there exists a range of dimensions m which performs the task, such that φ is invertible. We also have the two mappings ua(ξ); a ∈ h and va(ξ); a ∈ h from the space X = Rn of m states xt, to R . This gives a complete description of the set of weights Wh;h′,x′ . The formula (4.21) defines the map Xw from H × X to H. w We also consider the restriction Xξ at a fixed state ξ of xt.

Theorem 4.1. The map Xw is not structurally stable on H or H × X, but each coordinate a ηt , seen as function on a generic line of the input ht−1 and a generic line of the input xt, or as a function on H or H × X, is stable (at least in the bounded regions where the discretization doesn’t apply). These coordinates represent the activities of individual neurons, then we get structural sta- bility at the level of the neurons and not at the level of the layers.

As we justify in the following lines, this theorem follows from the results of the universal unfolding theory of smooth mappings, developed by Whitney, Thom, Malgrange and Mather (cf. LN 552, Gibson et al., 1976 [GWDPL76], and Jean Martinet, 1982, [Mar82]. The main point here (our hypothesis) is the insistence that for each neuron in the layer of 3 ht, the cubic degeneracy z can appear, together with its deformation by the function u. For the deformation of singularities of functions, and their unfolding, see V.Arnold et al. [Arn73], [AGZV12b].

The universal unfolding of the singularity z3 is given by a polynomial

3 Pu(z)= z + uz, (4.25)

This means that for every smooth real function F , from a neighbor of a point 0 in R1+M , such that F (z, 0, ..., 0) = z3, (4.26) there exist a smooth map u(Y ) and a smooth family of maps ζ(z,Y ) such that

F (z,Y )= ζ(z,Y )3 + u(Y )ζ(z,Y ) (4.27)

Equivalently, the smooth map (z,u) 7→ (Pu(z),u), (4.28) in the neighbor of (0, 0) is stable: every map sufficiently near to it can be transformed to it by a pair of diffeomorphisms of the source and the goal. This result on maps from the plane to the plane, is the starting point of the whole theory, found by Whitney: the stability of

83 the gathered surface over the plane v,u. The stability is not true for the product

(z,u,w,v) 7→ (Pu(z),u,Pv(w), v) (4.29)

The infinitesimal criterion of Mather is not satisfied (cf. [GWDPL76], [Mar82]).

There exists also a notion of universal unfolding for maps from a domain of Rn to Rp in the vicinity of a point 0, however in most cases, there exists no universal unfolding, contrarily to the case of functions, when p = 1. Here n = p = m, the transformation from ht−1 to ht is an unfolding, dependent of ξ ∈ xt, but it doesn’t admit a universal deformation. It has an infinite codimension in the space of germs of maps. Also for mappings, universality of and unfolding and its stability as a map are equivalent (another theorem of Mather).

Our non-linear model (4.21) with u free being equivalent to the polynomial model by diffeomorphism, we can apply to it the above results. This establishes the theorem 5.

Corollary. Each individual cell plays a role.

This doesn’t contradict the fact that frequently several cells send similar message, i.e. there exists a redundancy, which is opposite to the stability or genericity of the whole layer. However, as said before, in certain regime and/or for m sufficiently small, the redundancy is not a simple repetition, it is more like a creation of characteristic properties.

a Let us look at a neuron a ∈ ht, and consider the model (4.21). If u = u (ξ) doesn’t change of sign, the dynamic of the neuron a is stable under small perturbations. For u> 0, it looks like a linear function, it is monotonic. For u< 0 there exist a unique stable minimum and a unique saddle point which limits its basin of attraction. But for u = 0 the critical points collide, the individual map is unstable. This is named the catastrophe point. For the whole theory, see [Tho72], [AGZV12b].

a If we are interested in the value of ηt , as this is the case in the analysis of the cat’s manifolds seen before, for understanding the information flow layer by layer, we must also consider the levels of the function, involving va then Λ. This asks to follow a sort of inversion of the flow, going to the past, by finding the roots z of the equations

P a(z)= c. (4.30)

Depending on u and v, there exist one root or three roots. For instance, for c = 0, the second case happens if an only if the numbers ua(ξ), va(ξ) satisfy the inequality 4u3 + 27v2 < 0. When the point (ua(ξ), va) in the plane Λ belongs to the discriminant curve ∆ of equation 4u3 + 27η2 = 0, things become ambiguous, two roots collide and disappear together for 4u3 + 27v2 > 0. These accidents create ramifications in the cat’s manifolds.

This analysis must be applied independently to all the neurons a = 1, ..., m in h, that is to all the axis in H. If α is an invertible endomorphism, the set of inversions has a finite

84 number of solutions, less than 3m.

Remind that the region around 0 in the space H is especially important, because it is only here that the polynomial model applies numerically, σ and th being almost linear around 0. Therefore the set of data ηt−1 and ξt which gives a certain point ηt in this region have a special meaning: they represent ambiguities in the past for ηt−1 and critical parameters for 3 2 ξt. Thus the discriminant ∆ of equation 4u +27v = 0 in Λ plays an important role in the global dynamic.

w a a The inversion of Xξ : H → H is impossible continuously along a curve in ξ whose u , v meet ∆ for some component a. It becomes possible if we pass to complex numbers, and ∼ C ∗ lift the curve in Λ to the universal covering Λ∗ ( ) of the complement ΛC of ∆C in ΛC. Cf. [AGZV12a]. The complex numbers have the advantage that every degree k polynomials has k roots, when counted with multiplicities. The ambiguity in distinguishing individual roots along a path is ∗ contained in the Poincar´efundamental group π1(ΛC). However the precise definition of this ∗ group requires the choice of a base point in ΛC, then it is more convenient to consider the ∗ fundamental groupoid Π(ΛC) = B3, which is a category, having for points the elements of ∗ ΛC and arrows the homotopy classes of paths between two points. The choice of an object ∗ λ0 determine π1(ΛC; λ0), which is the group of homotopy classes of loops from λ0 to itself, i.e. the isomorphisms of λ0 in B3. This group is isomorphic to the Artin braid group B3 of braids with three strands. Cf. Arnold et al. volume 2, [AGZV12a]. This group B3 is generated by two loops σ1, σ2 that could be define as follows: take a line R C + − u = u0 ∈ − ⊂ ,, with complex coordinate v, and let v0 , v0 be the positive and negative 4 3 + − square roots of − 27 u0; the loop σ1 = σ (resp. σ2 = σ ) is based in 0, contained in the line + − u = u0 and makes one turn in the trigonometric sense around v0 (resp. v0 ). The relations between σ1 and σ2 are generated by σ1σ2σ1 = σ2σ1σ2. 3 The center of B3 is generated by c = (σ1σ2] . The quotient by this center is isomorphic 2 3 to the group B3/C generated by a = σ1σ2σ1 and b = σ1σ2 satisfying a = b ; the quotient 2 of B3/C by a is the M¨obius group P SL2(Z of integral homographies, and the quotient of 4 B3/C by a is the modular group SL2(Z) of integral matrices of determinant one, then a two Z S 2 2 fold covering of P SL2( ). The quotient 3 of B3 is defined by the relations σ1 = σ2 = 1, and by the relation which defines B3, i.e. σ1σ2σ1 = σ2σ1σ2.

Of course the disadvantage of the complex numbers is the difficulty to compute with them in DNNs, for instance σ and tanh extended to C have poles. Moreover all the dynamical ∗ regions are confounded in ΛC; in some sense the room is too wide. Therefore, we will limit ∗ ourselves to the sub-category ΠR = B3(R), made by the real points of Λ , but retaining all the morphisms between them, that is a full sub-category of B3. This means that only the paths are imaginary in B3(R).

Another sub-groupoid could be also useful: consider the gathered surface Σ in Λ × R of 3 equation z + uz + v = 0; let ∆3 be the natural lifting of ∆ along the folding lines of Σ ∗ over Λ, the complement Σ of ∆3 in Σ can be canonically embedded in the complex uni- ∼ versal covering Λ∗ , based in the real contractile region Λ0 inside the real cusp, by taking, + for each (u, v) = λ in Λ0 the points λ+ and λ− respectively given by the paths σ = σ1 − and σ = σ2, which make simple turn over the branches of the cusp. When λ approaches one of these branches, the corresponding point collide with it on ∆3, but the other point continues to be isolated then the construction gives an embedding of Σ∗. Therefore we can

85 ∗ r define the full sub-groupoid of B3 which has for objects the points of Σ , and name it B3 or Πr.

Remark. The groupoid Πr can be further simplified, by taking one point in each region of interest: one point outside the preimage of the cusp ∆, and three points in each region over the interior of the cusp. Remark. These four points correspond to the four real structures of Looijenga in the complex kaleidoscope, cf. Looijenga 1978 [Loo78].

r The groupoid B3 is naturally equipped with a covering (surjective) functor π to the groupoid B3(R) of real points. r R The interest of B3 with respect to B3( ) is that it distinguishes between the stable minimum R r and the unstable one in the regime u< 0. But the interest of B3( ) with respect to B3 is that it speaks only of computable quantities u, v without ambiguity, putting all the ambiguities in the group B3.

R r All these groupoids are connected, the two first ones, B3( ) and B3 because they are full of the connected groupoid B3, the other ones in virtue of the definition of a quotient (to the right) of a groupoid by a normal sub-group H of its fundamental group G: it has the same objects, and two arrows f,g from a to b are equivalent if they dif- fer by an element of H. This is meaningful because in Auta (resp. Autb) the sub-group Ha −1 −1 (resp. Hb) is well defined, being normal, and moreover f g ∈ Ha is equivalent to gf ∈ Hb.

Cardan formulas expresses the roots by using square roots and cubic roots. They give explicit formulas for the differences of roots z2 − z1, z3 − z1. They can be seen directly in the surface Σ.

Remarks. They correspond to the simplest case of a map of period:

−1 (i) integral classes of H0(Pu,v (0) are transported along paths; (ii) the holomorphic form dz is integrated on the integral classes.

This gives a linear representation of B3, which factorizes through S3. Augment the variable z by a variable y, the roots can be completed by the levels Zu,v over (u, v) ∈ Λ, which are the elliptic curves

3 2 Pu,v(z,y)= z + y + uz + v =0, (4.31) the 2-form ω = dz ∧ dy can be factorized as follows 1 dz ω = − dP ∧ ; (4.32) 2 y the integral of dx/y over the curve Zu,v is an elliptic integral, its periods over integral cycles, gives a linear representation of B3 which factorizes through SL2(Z). Every stabilization of z3 by a quadratic form gives rise to the representation of the first case in odd dimension and of the second case in even dimension.

Smaller natural groupoids are given by quotienting the morphisms, replacing B3 by S3 or SL2(Z) or its projective version P SL2(Z) made by homographies.

86 4.5 Pre-semantics

The natural languages have many functions, from everyday life to poetry and science, or pol- itics and law, however all of them rely on cognitive operations about meanings and shapes, as they appear in the many language-games of Wittgenstein or the action/perception di- mensions of Austin. Cf. [Wit53], [Aus61]. The linguist Antoine Culioli, having studied in depth a great variety of natural lan- guages, tried to characterize some of these operations in meta-linguistic, for instance the generic structure and dynamics of a notional domain. The notion here can be ”dog” or ”cat” or ”good” or ”absent” or anything which has a meaning for most peoples, or spe- cialists in some field. To have a meaning must involve in general several occurrences and disappearances of the notion, a knowledge of its possible properties and individuations, in a language and in the world (data for instance, relations between them and classifications). A good reference is the book Cognition and Representation in Linguistic Theory, A. Culioli, Benjamins, 1995, [CLS95]. The notional domain has an interior I where the properties of the notion are certain, an exterior E where the properties are false, and a boundary B, where things are more uncertain. A path through the boundary goes from ”truly P” to ”truly not P”, through an uncertain region where ”non-really P, non really not P” can be said. In the center of I are one or several prototypes of the notion. A kind of gradient vector leads the mind to these archetypes, that Culioli named attracting centers, or attractors; however he wrote in 1989 (ref.) the following important precision: ”Now the term attractor cannot be interpreted as an attainable last point (...) but as the representation of the imaginary absolute value of the property (the predicate) which organizes an aggregate of occurrences into a structured notional domain.” Culioli also used the term of organizing center, but as we shall see this would conflict with another use. The division I,B,E takes all its sense when interrogative mode is involved, or negation and double negation, or intero-negative mode. In negation you go out of the interior, in interro-negation you come back inside from E. ”Is your brother really here” (it means that ”I don’t expect that your brother your brother is here”.) ”Now that, that isn’t a dog!” (you place yourself in front of P, or inside the notion I, you know what is a dog, then goes to E); ”Shall I still call that a dog?” ”I don’t refuse to help”; here come back in I of ”help” after a turn in its exterior E. All these circumstances involve an imaginary place IE, where the regions are not separated, this is like the cuspidal point before the separation of the branches I and E of the cusp. Mathematically this corresponds precisely the creation of the external (resp. internal) critical point of z3 + uz + v, on the curve ∆. Example: ”he could not have left the window open”, the meaning mobilizes the place IE of indetermination, the maximum of ambiguity, where the two actions, ”left” and ”not to left” are possible, then one of them is forbidden, and ”not having left” is retained by the negation. In the terminology of Thom, the place IE is the organizing center, the function z3 itself, the most degenerate one in the stable family, giving birth to the unfolding. To describe the mechanisms beyond these paths, Culioli used the model of the cam: ”the movement travels from one place to another, only to return to the initial plane”. Example: start from IE, then make a half-turn around I which passes by E then come to I by an- other half-turn. ”This book is only slightly interesting.” The meaning appears only if you imagine the place where interesting and not interesting are not yet separated, then go to not interesting and finally temperate the judgment by going to the boundary, near I; the compete turn leads you in another place, over the same point, thus the meaning is greatly in

87 the path, as an enclosed area. ”This book is not uninteresting” means that it is more than interesting. The paths here are well represented on the gathered real surface Σ, of equation

z3 + uz + v =0, (4.33) but they can also be made in the complement of ∆ in Λ in a complexified domain. It seems that only the homotopy class is important, not the metric, however we cannot neglect a weakly quantitative aspect, on the way of discretization in the nuances of the language. r Consequently, the convenient representation of the moves of Culioli is in the groupoid B3, that we propose to name the Culioli groupoid.

Remind that LST M and the other memory cells are mostly used in chains, for translating texts. It is natural to make a rapprochement between their structural and dynamical properties and the meta-linguistic description of Culioli. In many aspects Ren´eThom was closed to Culioli in his own approach of semantics, see the book Mathematical Models of Morphogenesis, R.Thom, Harwood 1983, [Tho83], translation of a french book, Bourgois 1980. The original theory was exposed in [Tho72]. In this approach, all the elementary catastrophes having a universal unfolding of dimension less than 4 are used, through their sections and projections, for understanding in particular the valencies of the verbs, from the semantic point of view, according to Peirce, Tesni`ere, Allerton: impersonal, ”it rains”, intransitive ”she sleeps”, transitive ”he kicks the ball”, triadic ”she gives him a ball”, quadratic ”she ties the goat to a tree with a rope”. The list of organizing centers is as follows:

y = x2, y = x3, y = x4, y = x5, y = x6, 3 2 3 3 3 2 4 2 y = x1 − x2x1, y = x1 + x2 (or y = x1 + x2x1), y = x1 + x2x1; (4.34) respectively named: well, fold, cusp, swallowtail, butterfly, elliptic umbilic, hyperbolic umbilic and parabolic umbilic, or with respect to the group which generalizes the Galois group S3 + − for the fold, respectively: A1, A2, A3, A4, A5, D4 = D4 = D4 and D5. The An are the symmetric groups Sn+1 and the Dn index two subgroups of the symmetry groups of the hypercubes In. Cf. Caustic mystic, 1984 [Ben86].

It is not difficult to construct networks, on the model of mLST M, such that the dynamics of neurons obey to the unfolding of these singular functions. The various actors of a verb in a sentence could be separated input data, for different coordinates on the unfolding param- eters. Their efficiency should be tested in translation.

Coming back to the memory cell (4.21), the critical parameters xt over ∆ can be inter- preted as frontiers between regions of notional domains. a a The precise learned 2mn weights wx for the coefficients u and v , for a = 1, ..., m, to- a gether with the weights in the forms α for ht−1 gives vectors (or more accurately matrices), which are like readers of the words x in entry, taking in account the contexts from the other words through h. Remember Frege: a word has a meaning only in the context od a sen- tence. This is a citation of Wittgenstein, after he said that ”Naming is not yet a move in a language-game” (W. 49), [Wit53]. To get ”meanings”, the names, necessarily embedded in sentences, must resonate with other contexts and experiences, and must be situated with respect to the discriminant, along a path, thus we suggest that the vector spaces of ”readers” W , and the vector spaces of states

88 r X are local systems A over a fibered category F in groupoids B3 over the network’s category C. r R In certain circumstances, the groupoid B3 can be replaced by the quotient over objects B3( ), or a quotient over morphisms giving SL2 or S3.

3 The case of z corresponds to A2. It is tempting to consider the case of D4, i.e. the ellip- tic and hyperbolic umbilics, because their formulas are very closed to MGU2 as mentioned at the end of the preceding section. This would allow the direct coding and translation of sentences with three actant.

η = z3 ∓ zw2 + uz + vw + x(z2 + w2)+ y. (4.35)

89 Chapter 5

A natural 3-category of deep networks

5.1 Attention moduli and relation moduli

In addition to the chains of LST M, another network’s component is now recognized as es- sential for most of the tasks in linguistic: to translate, to complete a sentence, to determine a context and to take into account a context for finding the meaning of a word or sentence. This modulus has its origin in the attention operator, introduced by Bahdanau et al. 2015 [BCB16], for machine translation of texts. The extended form that is the most used today was defined in the same context by Vaswani et al. 2017 [VSP+17], under the frequent name of transformer or simply decoder.

Let us describe the steps of the algorithm: the input contains vectors Y representing mem- ories or hidden variables like contexts, and external input data X. 1) Three sets of linear operators are applied: Q = W Q[Y ], K = W K[Y,X], V = W V [Y ]; where the W are matrices of weights, to be learned. The vectors Q,K,V are respec- tively called queries, keys and values, from names used in ; they are supposed to be indexed by ”heads” i ∈ I, representing individuals in the input, and by other indices a ∈ A, representing for instance different instant times, or aspects, to a a a be integrated together. Then we have vectors Qi ,Ki ,Vi . a a a 2) The scalar products Ei = k(Qi |Ki ) are computed (implying that Q and K have the same dimension), and the soft-max function is applied to them, giving a probability a law, from the Boltzmann weights of energy Ei

1 a a Ei pi = a e , (5.1) Zi 3) a sum of product is computed ′ a a Vi = X pi Vi . (5.2) a 4) A new matrix is applied for mixing the heads

i ′ Aj = X wjVi . (5.3) i

90 All that is summarized in the formula: i Q a K a V a Aj(Y,X)= X X wjsoftmax k(W (Y )i |W (Y,X)i ) W (Y )i . (5.4) i a A remarkable point is that, as the cell MGU2 or LST M and GRU, the transformer cor- responds to a mapping of degree 3, made by multiplying a linear form of Y with non-linear function of a bilinear form of Y . Strictly speaking the degree 3 is only valid in a region of the parameters, and in other regions, saturation decreases the degree.

Chains of LST M were first used for language translations, and were later on used for image description helped by sentences predictions, cf. cf. Karpathy, Fei-Fei 2015 [KL14], Mao, Yuille 2015 [MXY+15], where they proved to outperform other methods for detections of objects and their relations. In the same manner, the addition of attention cell proved to be very beneficial in this context, cf. Zambaldi et al. 2018 [ZRS+18], then it was extended to develop reasoning about the the relations between objects in images and videos, cf. Raposo et al. 2017 [RSB+17], Barrett et al. 2018 [BHS+18], [BHS+18], Santoro et al. 2019 [SRB+17], Ding et al. 2020 [DHSB20].

In the algorithm MHDPA (multi-head dot product attention) Santoro et al. 2018, 2019, the inputs X either words, questions and features of objects and their relations, coded in vectors, and inputs Y combining hidden and external memories, the output A is new memories, new relations and new questions. Remark: in fact the method combined fully supervised learning with un-supervised learning (or adaptation) by maximization of a learned functional of the above variables. In particular, the memories or hidden variables issued from the transformer were re- introduced in the LST M chain; giving the following symbolic formulas:

ct = ct−1 ⊙ σf (xt, ht−1) ⊕ σi(xt, mt) ⊙ τh(xt, ht−1); (5.5) where mt results of transformer applied to the antecedent sequence of hs, cs and xs ; and

ht = σo(xt, ht−1) ⊙ tanh ct. (5.6) Geometrically, this can be seen as a concatenation of folds, as proposed by Thom Esquisse d’une S´emiophysique [Tho88], for explaining many kinds of organized systems in biology and cognition. From this point of view, the concatenation of folds, giving the possibility of coincidence of co-folds [Arg78], is a necessary condition for representing the emergence of a meaning in a living system. Note that, in the non-saturated regimes, ht has a degree 5 in ht−1, then its natural groupoid can be embedded in a braids groupoid of type B5. This augmentation, from the fold to the so called swallowtail, could explain the greatest syntactic power of the MHDPA with respect to LST M. However the concrete use of more memories in times s before t renders the cells much more complex than a simple mapping from t − 1 to t.

The above algorithm can be composed with other cells for detecting relations. For in- stance, Raposo et al. in 2017 [RSB+17] had defined a relation operator: having produced contexts H or questions Q concerning two objects oi,oj by a chain of LST M (that can be helped by external memories and attention cells) the answer is taken from a formula:

A = f(X g(oi,ol; Q, H)), (5.7) i,j

91 where f and g are parameterized functions, and oi : i ∈ I are vectors representing objects with their characteristics. The authors insisted on the important invariance of this operator by the permutation group Sn of the objects.

More generally, composed networks were introduced in 2016 by Andreas et al. [ARDK16] for question answering about images. The reasoning architecture MAC, defined by Hudson and Manning, 2018 [HM19], composed three Attention operators named control, write and read, in a DNN, inspired from the architecture of computers.

This leads us to consider the evolution of architectures and internal fibers of stacks and languages, in relation to the problems to be solved in semantic analysis.

5.2 The 2-category of a network

For representing languages in DNNs, we have associated to a small category C the class ∧ AC = GrpC of pre-sheaves over the category of fibrations in groupoids over C. The objects of AC were described in terms of pre-sheaves AU on the fibers FU for U ∈ C satisfying gluing conditions, cf. sections 3 and 4.

Remark. Other categories than groupoids, for instance posets or fibrations in groupoids over posets, can replace the groupoids in this section, and are useful in the applications, as we mentioned before, and as we will show in the forthcoming article on semantic communication.

′ ′ Natural morphisms between objects (F, A) and (F , A ) of AC are defined by a family of ′ ′ functors FU : FU →FU , such that for any morphism α : U → U in C,

′ Fα ◦ FU ′ = FU ◦ Fα; (5.8)

∗ ′ ′ and by a family of natural transformations ϕU : AU → FU (AU )= AU ◦ FU , such that for any morphism α : U → U ′ in C,

∗ ′ ∗ FU ′ (Aα) ◦ ϕU ′ = Fα(ϕU ) ◦ Aα, (5.9)

∗ ∗ ′ ∗ ′ ∗ ′ from AU ′ to Fα(FU AU )= FU ′ ((Fα) AU ). ′ Note that the family FU ; U ∈ C is equivalent to a C-functor F : F→F of fibered categories in groupoids, and the family ϕU is equivalent to a morphism ϕ in the topos EF from the object A to the object F ∗(A′).

Remark. These morphisms include the morphisms already defined for the individual classi- fying topos EF . But, even for one fibration F and its topos E, we can consider non-identity end-functor from F to itself, which give new morphisms in AC.

The composition of (FU ,ϕU ); U ∈ C with (GU , ψU ) from (G, B) to (F, A) is defined by the ordinary composition of functors FU ◦ GU , and the twisted composition of natural transformation ∗ ∗ ′ (ϕ ◦ ψ)U = GU (ϕU ) ◦ ψU : BU → (FU ◦ GU ) AU . (5.10)

92 This rule gives a structure of category to AC.

In addition, the natural transformations between functors give vertical arrows in HomA(F, A : F ′, A′) forming categories: a morphism from (F,ϕ)to(G, ψ) is a natural transformations λ : F → G, which in this case with groupoids, is an homotopy in the nerve, plus a morphism a : A → A, such that

A′(λ) ◦ ϕ = ψ ◦ a : A → G∗A′. (5.11)

For better understanding of this relation, we can introduce the points (U, ξ) in F over C, and read ′ ′ AU (λU (ξ)) ◦ ϕU (ξ)= ψU (ξ) ◦ aU (ξ): AU (ξ) → AU (GU (ξ)). (5.12) This can be understood geometrically, as a lifting of the deformation λ to a deformation of the pre-sheaves. Vertical composition is defined by usual composition for the deformations λ and ordinary composition in End(A) for a. Horizontal composition are for F →F ′ →F”.

Horizontal arrows and vertical arrows satisfy the axioms of a 2-category, cf. [Gir71], [Mac71]. This structure encodes the relations between several semantics over the same network.

The relations between several networks, for instance moduli inside a network, or aug- mented networks by external links, belong to a 3-category, whose objects are the above semantic triples, and the 1-morphism are lifting of functors between sites u : C → C′.

[Gir71, Theorem 2.3.2] tells us that, as for ordinary pre-sheaves, there exist natural right ∗ and left adjoints u∗ and u! respectively of the pullback u from the 2-category CatC′ of fi- ′ brations over C to the 2-category CatC of fibrations over C. They are natural 2-functors, adjoint in the extended sense. These 2-functors define adjoint 2-functors between the above 2-categories of classifying toposes AC and AC′ , by using the natural constructions of SGA4 for the categories of presheaves. They can be seen as substitutions of stacks and languages induced by functors u.

This is a particular case of Grothendieck’s derivators, cf. [Cis03].

5.3 Grothendieck derivators and semantic information

∧ The map C 7→ MC, or MC is an example of derivator in the sense of Grothendieck, cf. [Gro83], [Gro90], the three articles of Cisinski 2002, cf. [Cis03], and the book of Maltsiniotis on the homotopy theory of Grothendieck [Mal05]. A derivator generalizes the passage from a category to its topos of presheaves, for devel- oping homotopy theory, as topos were made to develop cohomology theory. It is a 2-functor D from the category Cat (or a special sub-category of diagrams, for instance Poset) to the 2-category CAT , satisfying four axioms. The first tells that D transforms sums of categories in products, the second that isomorphisms of images can be tested on objects, the third that ′ there exists for any functor u : C → C , a right adjoint u∗ (defining homotopy limit) and ∗ a left adjoint u! (defining homotopy co-limit) of the functor u = D(u); the fourth axiom asks that these adjoint are defined locally, for instance, if X′ ∈ C′, and F ∈ D(C), therefore

93 ′ u∗F ∈ D(C) , the fourth axiom tells that

∼ ∗ (u∗F )X′ = p∗j F ; (5.13) where j is the canonical map from C|X′ to C, and p the unique morphism from from C|X′ to ∗. Another formula that expresses the same thing is

∼ ∗ ′ (u∗F )X′ = H (C|X ; F |C|X′ ), (5.14) abstract version of a Kan extension formula. In general, the cohomology is defined by

∗ H (C; F )=(pC)∗F ∈ D(∗). (5.15)

A first example of derivator is given by an abelian category Ab, like commutative groups or real vector spaces, and it is defined by the derived category of differential complexes, where quasi-isomorphisms (isomorphisms in co-homology) are formally inverted,

D(I)= Der(Hom(Iop, Ab)). (5.16)

Another kind of example is a representable derivator

op DM(I)= F unct(I , M), (5.17) where M is a closed model category. This can be seen as a non-abelian generalization of the above first example.

A third kind of examples is given by the topos of sheaves over a representable derivator ∧ MC .

In this article, we have defined information quantities, or information spaces, by applying co-homology or homotopy limits, over the category D which expresses a triple C, F, A, made by a language over a pre-semantic over a site. The abelian situation was studied through the bar-complex of co-chains of the module of functions Φ on the fibration T of theories Θ over the category D. A non-abelian tentative, for defining spaces of information, was also proposed at this level, using (in the non-homogeneous form) the functors F from Θloc to a model category M. Therefore information spaces were defined at the level of MT , not at a level MC.

Then representable derivators allow to compare the elements of semantic functioning be- tween several networks, for instance a network with a sub-network of this network, playing the role of a module in computation.

Consider the sub-categories ΘP , over the languages λ,λ ∈ FU , made by the theories that exclude a rigid proposition P =!Γ, in the sense they contain P ⇒ ∆, for a certain chosen ∆. The category P |Aλ acts on ΘP . The information space F defines an object of

MΘP , its co-homology allow us to generalize the cat’s manifolds, that we defined below with the connected components of the category D: the dynamical object X is assumed to be defined over the stack F, the dynamical space gX is defined over the nerve of F, the semantic functioning gives a simplicial map gS : gX → gI• from gX space to the equipped

94 theories, then we can consider the inverse image of ΘP in the functioning network. Com- posing with F we obtain a parameterized object MP in M, defining a local system over the category associated to gX, which depends on Γ, ∆. This represents the information in X about the problem of (rigidly) excluding P when considering that ∆ is (think to be) false. Seen as an element of D(gX), its co-homology is an homotopical invariant of the information.

Information spaces belong to DM(T ). To compare spaces of information flows in two theoretical semantic networks, we have at disposition the adjoint functors ϕ∗,ϕ! of the func- tors ϕ∗ = D(ϕ) associated to ϕ : T → T ′, between categories of theories. Those functors ϕ can be associated to changes of languages A, changes of stacks F and/or changes of basic architecture C.

An important problem to address, for constructing networks and applying deep learning to them, is the realization of information relations or correspondences, by relations or corre- spondences between the underlying structures. For instance, to realize a family of homotopy equivalences (resp. fibration, resp. cofibration) in M, by transformations of languages, stacks or sites having certain properties, like enlargement of internal symmetries.

The analog problem for pre-sheaves (set valued) is to realize a correspondence (or rela- tion) between the topos I∧ and (I′)∧ from a correspondence between convenient sites for them.

For toposes morphisms this is a classic (cf. SGA4 4.9.4, Stacks project 7.16n 2.29), any geometric morphism f∗ : Sh(I) → Sh(J) comes from a morphism of sites up to topos equiv- alence between I and I′. More precisely, there exists a site I′ and a co-continuous and ′ ′ continuous functor v : I → I giving an equivalence v! : Sh(I) → Sh(I ) extending v, and a ′ ′ site morphism J → I , given by a continuous functor u : I → J such that f∗ = u∗ ◦ v!.

From Schulman, Exact completion and small sheaves, 2012 [Shu12]: a geometric morphism between Sh(I) and Sh(J) comes from a morphism of site if and only if it is compatible with the Yoneda embeddings.

5.4 Stacks homotopy of DNNs

The characterization of fibrant and cofibrant objects in MC was the main result of section 5, cf. 5.4. All objects of MC are cofibrant and the fibrant objects are described by the theorem ∗ 4; we saw that they correspond to ideal semantic flows, where the condition π π∗ = Id holds. They also correspond to the contexts and the types of a natural M − L theory. The objects of Ho(MC), [Qui67], are these fibrant and cofibrant objects of MC, the Ho morphisms being the homotopy classes of morphisms in MC, generated by inverting formally zigzags similar to the above ones. Thus we get a direct access to the homotopy category HoMC. The Ho morphisms are the homotopy equivalences classes of the substitutions of variables in the M − L theory.

From the point of view of semantic information, we saw that homotopy is pertinent at the next level: looking first at languages over the stacks, then at certain functors from the posets of theories to a test model category M′, then going to Ho(M′). However, the fact

95 that we restricted us to theories over fibrant objects and fibrations between them, implies that the homotopy of semantic information depends only of the images of these theories over the category Ho(MC). How to use this fact for functioning networks?

96 Appendices

97 A Localic topos and Fuzzy identities

Definitions. let Ω be a complete Heyting algebra; a set over Ω, (X, δ), also named an Ω-set, is a set X equipped with a map δ : X × X → Ω, which is symmetric and transitive, in the sense that for any tripe x, y, z, we have δ(x, y)= δ(y, x) and

δ(x, y) ∧ δ(y, z) ≤ δ(x, z). (18)

Note that δ(x, x) can be different from ⊤. But we always have δ(x, y)= δ(x, y) ∩ δ(y, x) ≤ δ(x, x), and δ(x, y) ≤ δ(y,y). As Ω is made for fixing a notion of relative values of truth, δ is interpreted as fuzzy equality in X; it generalizes the characteristic function of the diagonal when Ω is boolean. In our context of DNN, it can be understood as the progressive decision about the outputs on the trees of layers rooted in a given layer. A morphism from (X, δ) to (X′, δ′) is an application f : X × X′ → Ω, such that, for every, x, x′,y,y′

δ(x, y) ∧ f(x, x′) ≤ f(y, x′), (19) f(x, x′) ∧ δ′(x′,y′) ≤ f(x, y′); (20) f(x, x′) ∧ f(x, y′) ≤ δ′(x′,y′). (21)

Moreover ′ δ(x, x)= _ f(x, x ). (22) x′∈X′ Which generalizes the usual properties of the characteristic function of the graph of a function in the boolean case. The composition of a map f : X × X′ → Ω with a map f ′ : X′ × X” → Ω is given by

′ ′ ′ (f ◦ f)(x, x”) = _ f(x, x ) ∧ f(x , x”). (23) x′∈X′

And the identity morphism is defined by

idX,δ = δ. (24)

This gives the category SetΩ of sets over Ω, also named Ω-sets. The Heyting algebra Ω of a topos E is made by the subobjects of the final object 1; the elements of Ω are named the open sets of E. In fact, there exists an object Ω in E, the Lawere object, such that for every object X ∈E, the set of sub-objects of X is naturally identified with the set of morphisms ΩX . When E = Sh(X) is a Grothendieck topos, Ω is the sheaf over X, which is defined by Ω(x)=Ω(E|x), the sub-objects of 1|x. In the Alexandrov case, Ω(x) is the set of open sets for the Alexandrov topology contained in Λx.

According to Bell, cf. [Bel08], a localic topos, as the one of a DNN, is naturally equivalent to the category SetΩ of Ω-sets, i.e. sets equipped with fuzzy identities with values in Ω. We now give an explicit construction of this equivalence, because it offers a view of the relation between the layers directly connected to the intuitionist logic of the topos.

98 Definition .1. On the poset (Ω, ≤), the canonical Grothendieck topology K is defined by the coverings by open subsets of the open sets. In the localic case, where we are, the topos is isomorphic to the Grothendieck topos E = Sh(Ω,K). We assume that this is the case in the following exposition.

In the particular case E = X∧, where X is a poset, Ω is the poset of lower Alexandrov open sets and the isomorphism with Sh(Ω,K) is given explicitly by proposition 1.2.

Let X be an object of E; we associate to it the set XΩ of natural transformation from Ω Ω to X. For two elements x, y of X , we define δX (x, y) ∈ Ω as the largest open set over which x and y coincide. Ω An element u of X is nothing else than a sub-singleton in X, its domain ωu is δX (u,u). In other terms, in the localic case, u is a section of the presheaf X over an open subset ωu in Ω. Then, if u, v and w are three elements of XΩ, the maximal open set where u = w contains the intersection of the open sets where u = v and v = w. Thus XΩ is a set over Ω. In the same manner, suppose we have a morphism f : X → Y in E, if we take x ∈ XΩ Ω and y ∈ Y we define f(x, y) ∈ Ω as the largest open set of X where y coincides with f∗x. This gives a morphism of Ω-sets. All that defines a functor from E to SetΩ.

A canonical functor from SetΩ to E is given by a similar construction: for U ∈ Ω, ΩU = Ω(U) is an Ω-set, with the fuzzy equality defined by the internal equality

′ ′ δU (α,α )=(α ≍ α ), (25)

.that is the restriction of the characteristic map of the diagonal subset: ∆ : Ω ֒→ Ω × Ω Ω The set ΩU can be identified with the Ω-set U associated to the Yoneda pre-sheaf defined by U. More concretely, an element ω of ΩU is an open subset of U, and its domain δ(ω,ω) is ω itself. Now, for any Ω-set (X, δ), and for any element U ∈ Ω, we define the set

XΩ(U)= HomSetΩ (ΩU ,X)= {f : ΩU × X → Ω|(19), (22)}. (26)

In what follows, we sometimes write XΩ = X, when there this doesn’t introduce too much ambiguity. If V ≤ W , the formula f(ωV ,ωW )= ωV ∩ ωW defines a Ω-morphism from ΩV to ΩW , which gives a map from X(W ) to X(V ). Then XΩ is a presheaf over Ω. A morphism of Ω-set f : X × Y → Ω gives by composition to a natural transformation fΩ : XΩ → YΩ of presheaves over Ω. Consider fU ∈ X(U); the axiom (22) tells that for every open set V ⊂ U, the family of V open sets fU (V,u); u ∈ X is an open covering fU of V . The first axiom of (19), which represents the substitution of the first variable, tells that V W on V ∩ W the two coverings fU and fU coincide. Therefore, for every u ∈ X, the value fU (u) = f(U, u) of fU on the maximal element U determines by intersection all the values fU (V,u) for V ⊂ U. For fU ∈ X(U) and V ≤ U, the functorial image fV of fU in X(V ) is the trace on V :

∀u ∈ X, fV (u)= ρVU fU (u)= fU (u) ∩ V. (27)

99 This implies that XΩ is a sheaf: consider a covering U of U, (1) for two elements fU ,gU of X(U), if the families of restrictions fU ∩ V ; V ∈ U, gU ∩ V ; V ∈ U, then fU = gU ; (2) if ′ a family of coverings fV ; V ∈ U is given, such that for any intersection W = V ∩ V , the restriction fV |W and fV ′ |W coincide, as open coverings, we can define an element fU of X(U) by taking for each u ∈ X the open set fU (u) which is the reunion of all the fV (u) for V ∈ U. The union of the sets fV (u) over u ∈ X is V , and the union of the sets V is U, then the union of the fU (u) when u describes X is U. Q.E.D.

The second axiom of substitution tells that for any u, v ∈ X, δ(u, v) ∩ f(u)= δ(u, v) ∩ f(v). The third axiom of (19), which expresses the functional character of f, tells that for any u, v ∈ X, δ(u, v) ⊇ f(u) ∩ f(v). Consequently, the elements of X(α) can be identified with the open coverings fU (u); u ∈ X of the open set U, such that, in Ω, we have

∀u, v ∈ X, fU (u) ∩ fU (v) ⊆ δ(u, v) ⊆ (fU (u) ⇔ fU (v)); (28) where ⇔ denotes the internal equivalence ⇐∧⇒ in Ω. Remind that α ⇒ β is the largest element γ ∈ Ω such that γ ∧ α ≤ β, and in our topo- logical setting Ω = U(X) it is the union of the open sets V such that V ∩ α ⊆ β, therefore f(u) ⇔ f(v) is the union of the elements V of Ω such that V ∩ f(u)= V ∩ f(v).

Proposition .1. Let Ω be any complete Heyting algebra (i.e. a locale); the two functors Ω F : (X, δ) 7→ (U 7→ X(U) = HomΩ(ΩU ,X) and G : X 7→ (X , δX ) = HomE (Ω, X) define an equivalence of category between SetΩ and E = Sh(Ω,K). Proof. The composition F ◦ G sends a sheaf X(U); U ∈ Ω to the sheaf XΩ(U); U ∈ Ω made by the open coverings of U by sets indexed by the sub-singletons u of X satisfying the two inclusions (28). Consider an element sU ∈ X(U), identified with a section of X over U. For each sub- Ω s singleton v ∈ X , we define the open set f(v) = fU (v) by the largest open set in U where v = sU . As the sub-singletons generate X, this forms an open covering of U. It satisfies (28) for any pair (u, v): δ(u, v) is the largest open set where u coincides with v, then the first inclusion is evident, for the second one, consider the intersection δ(u, v) ∩ f(u), on it we have u = v and u = s, then it is included in δ(u, v) ∩ f(v). Q.E.D. If V ⊂ U and sV = sU |V , the open covering of V defined by sV is the trace of the open covering defined by sU . Moreover, a morphism φ : X → Y in E sends sub-singletons to sub-singletons and induces injections of the maximal domain of extension; therefore the above construction defines a natural transformation ηE from IdE to F ◦ G. This transformation is invertible: take an element f of XΩ(U), and for every U ∈ Ω, consider the set S(f, U) of sub-singletons u of X such that fU (u) =6 ∅. If u and v belong to this set, the first inequality of (28) implies that u = v on the intersection fU (u) ∩ fU (v), then, by the sheaf property 3, S(f, U) defines a unique element uU ∈ X(U). In the other direction, the composition G ◦ F associates to a Ω-set (X, δ) the Ω-set Ω (X , δX,Ω) made by the sub-singletons of the pre-sheaf XΩ, i.e. the families (f, U) of com- patible coverings fV (v), v ∈ X of V ; V ⊂ U. We have δ((f, U), (f, U)) = U; therefore, for simplifying the notations, we denote the singleton by f, and U is δ(f, f). We saw that, for two elements f, (f ′, the open set δ(f, f ′) is the maximal open subset of ′ ′ U ∩ U where the coverings fV (u) and fV (u) coincide for every u ∈ X and V ⊂ U.

100 For a pair (u, f), of u ∈ X and (f ∈ XΩ, we define H(u, f) ∈ Ω as the unions of the open sets fV (u), over V ⊂ δ(f, f) ∩ δ(u,u). The formula (27) implies that H(u, f) is also the union of open sets α such that α ⊂ fα(u), i.e. fα(u)= α. We verify that H is a morphism of Ω-sets: the first axiom

δ(u, v) ∧ H(u, f) ≤ H(v, f) (29) results from δ(u, v) ∧ fα(u) ≤ fα(v) (30) for every α ∈ Ω. The second axiom H(u, f) ∧ δ(f, f ′) ≤ H(u, f ′) (31) comes from the definition of δ(f, f ′) as an open set where the induced coverings coincide. For the third axiom, H(u, f) ∧ H(u, f ′) ≤ δ(f, f ′); (32) ′ ′ if α is included in the intersection we have fα(u)= α = fα(u), then α ≤ δ(f, f ). From (28), we have fα(u) ⊂ δ(u,u), then

H(u, f) ⊂ δ(u,u) (33)

u And for every α ≤ δ(u,u), we can define a special covering fα by

u u fα (u)= α, fα (v)= α ∧ δ(u, v); (34) it satisfies (28). Then δ(u,u)= _ H(u, f) (35) f∈X(U)

The Ω-map H is natural in X ∈ SetΩ. To terminate the proof of the prop. 4, we have to ′ Ω ′ show that H is invertible, that is to find a Ω-map H : XΩ × X → Ω, such that H ◦ H = δX ′ ′ and H ◦ H = δXΩ,Ω. We note the first fuzzy identity by δ and the second one by δ . In fact H′(f,u) = H(u, f) works; in other terms H is an involution of Ω-sets. let us verify this fact: by definition of the composition

′ ′ H ◦ H(u, v)= _ H(u, f) ∧ H (f, v) (36) f is the reunion of the α ∈ Ω such that there exists f with α = fα(u)= fα(v), then by the first inequality in (28) it is included in δ(u, v). Now consider α ≤ δ(u, v) ⊆ δ(u,u), and define a u u u covering of α by fα (w)= α∩δ(u,w) for any w ∈ X, this gives α ≤ fα (v) then α ⊆ H(v, f ), then α ⊂ H(u, f u) ∧ H′(f u, v). On the other side, ′ H ◦ H (g, f)= _ H(g,u) ∧ H(u, f), (37) u is the reunion of the α ∈ Ω such that there exists u with α = fα(u)= gα(u). In this case, we consider the set S(f,α) of elements v ∈ X such that fα(v) =6 ∅. If v and w belong to this set, the first inequality of (28) implies that v = w on the intersection fα(v) ∩ fz(w), then, by the sheaf property, S(f,α) defines a unique element uα ∈ X. This element must be equal to u.

101 The same thing being true for g, this implies that fα(v)= gα(v) for all the elements v of X, some of them giving α the other giving the empty set. Consequently, H ◦ H′(g, f) ⊆ δ′(f,g). The other inclusion δ′(f,g) ⊆ H ◦ H′(g, f) being evident, this terminates the proof of the proposition.  This proposition generalizes to the localic Grothendieck topos the construction of the sheaf space (espace ´etal´ein French) associated to a usual topological sheaf. However the accent in Ω-sets is put more on the gluing of sections than on a well defined set of germs of sections, as in the sheaf space. In some sense, the more general Ω-sets give also a more global approach, as in the original case of Riemann surfaces. Replacing a dynamics for instance by its solutions, pairs of domains and functions on them, with the relation of prolongation over sub-domains. This seems to be well adapted to the understanding of a DNN, on sub-trees of its architectural graph Γ.

The localic Grothendieck topos EΩ are the ”elementary topos” which are sub-extensional (generated by sub-singletons) and defined over Set (cf. [Bel08, p. 207]). Particular cases are characterized by special properties of the lattice structure of the locale Ω (cf. the book of Bell, [Bel08, pp. 208-210]): we say that two elements U, V in Ω are separated by another element α ∈ Ω when one of them is smaller than α but not the other one.

EΩ is the topos of sheaves over a topological space X if and only if Ω is spatial, which means by definition, that any pair of elements of Ω is separated by a large element, i.e. an element α such that β ∧ γ ≤ α implies β ≤ α or γ ≤ α. Moreover, in this case, Ω is the poset of open sets of X, and the large elements are the complement of the closures of points of X. The topological space is not unique, only the sober quotient is unique. A topological space is sober when every irreducible closed set is the closure of one and only one point.

EΩ is the topos of pre-sheaves over a poset CX if and only if Ω is an Alexandrov lattice, i.e. any pair of elements of Ω is separated by a huge (very large) element, i.e. an element α such that Vi∈I βi ≤ α implies that ∃i ∈ I, βi ≤ α. In this case Ω is the set of lower open sets for the Alexandrov topology on the poset. If Ω is finite, large and huge coincide, then spatial is the same as Alexandrov.

B Topos of DNNs and spectra of commutative rings

A finite poset with the Alexandrov topology is sober. This is a particular case of Scott’s topology. Then it is also a particular case of spectral space (cf. Hochster, 1969, [Hoc69] Priestley 1994 [Pri94], that are (prime) spectra of a commutative ring with the Zariski topology.

From the point of view of spectrum, a tree in the direction described in theorem 1.2, corre- sponds to a ring with a unique maximal ideal, i.e. a by definition a local ring. The minimal points correspond to minimal primes. The gluing of two posets along an end- ing vertex corresponds to the fiber product of the two rings over the simple ring with only one prime ideal (cf. the thesis of C.F. Tedd, 2016, Ring constructions on spectral spaces

102 [Ted16]). A ring with a unique prime ideal is a field, in this case the maximal ideal is {0}. This gives the following result:

Proposition .2. The canonical (i.e. sober) topological space of a DNN is the Zariski spectrum of a commutative ring which is the fiber product of a finite set of local rings over a product of fields.

The construction of a local rings for a given finite poset can be made by recurrence over the number of primes, by successive application of two operations: gluing a poset along an open subset of another poset, and joining several maximal points; this method is due to Lewis 1973 [Ted16].

Examples. 1) The topos of Shadoks corresponds to the poset β<α with two points; this is the spectrum of any discrete valuation ring, the ideal {0} and a non-zero maximal ideal. Such a ring is the subset of a commutative field k with a valuation v valued in Z, defined by {a ∈ K|v(a) ≥ 0}. An example is K((x)) the field of fractions of the formal series K[[x]], with the valuation given by the smallest power of x (and ∞) for a = 0. The valuation ring is K[[x]], also noted K{x}, its maximal ideal is mx = xK[[x]]. 2) Consider the poset of length three: γ<β<α. Apply the gluing construction to the ring A = K{x} embedded in K((x)) and the ring B = K((x)){y} projecting to K((x)); this gives the following local ring: ∼ D = K{x} ×K((x)) K((x)){y} = {d = a + yb|a ∈ A, b ∈ B} ⊂ B. (38)

The sequence of prime ideals is

{0} ⊂ yB ⊂ mx + yB. (39)

3) Continuing this process, we get a natural local ring whose spectral space is the chain of length n +1, αn < ... < α0 or simplest DNNs. There is one such ring for any commutative field K:

Dn = {d = an + xn−1bn−1 + ... + x1b1 ∈ K((x1, x2, ..., xn))|

an ∈ K{xn}, bn−1 ∈ K((xn)){xn−1}, ..., b1 ∈ K((x2, ..., xn)){x1}. (40)

The sequence of prime ideals is

{0} ⊂ x1K((x2, ..., xn)){x1} ⊂

x1K((x2, ..., xn)){x1} + x2K((x3, ..., xn)){x2} ⊂

... ⊂ x1K((x2, ..., xn)){x1} + ... + xnK{xn}. (41)

C Classifying objects of groupoids

Proposition .3. There exists an equivalence of category between any connected groupoid G and its fundamental group G.

103 Proof. let us choose an object O in G, the group G is represented by the group GO of automorphisms of O. The inclusion gives a natural functor J : G → G which is full and faithful. In the other direction, we choose for any object x of G, a morphism (path) γx from x to O, we choose γO = idO, and we define a functor R from G to G by sending any −1 object to O and any arrow γ : x → y to the endomorphism γy ◦ γ ◦ γx of O. The rule of composition follows by cancellation. A natural isomorphism between R ◦ J and IdG is the identity. A natural transformation T from J ◦ R to IdG is given by sending x ∈ G to γx, which is invertible for each x. The fact that it is natural results from the definition of R: for every morphism γ : x → y, we have

−1 T (y) ◦ Id(γ)= γy ◦ γ =(γy ◦ γ) ◦ γx ◦ γx = JR(γ) ◦ T (x). (42)

What is not natural in general (except if G = G = {1}) is the choice of R. This makes groupoids strictly richer than groups, but not from the point of view of homotopy equivalence. Every functor between two groupoids that induces an isomorphism of π0, the set of connected components, and of π1, the fundamental group, is an equivalence of category. 

One manner to present the topos E = EG of presheaves over a small groupoid G (up to category equivalence) is to decompose G in connected components Ga; a ∈ A, then E will be product of the topos Ea; a ∈ A of pre-sheaves over each component. For each a ∈ A, the topos Ea is the category of Ga-sets, where Ga denotes the group of auto-morphisms of any object in Ga. The classifying object Ω = ΩG is the boolean algebra of the subsets of A. In the applications, we are frequently interested by the sub-objects of a fixed object X X = {Xa; a ∈ A}. The algebra of sub-objects Ω , has for elements all the subsets that are preserved by Ga for each component a ∈ A independently. Xa Thus we can consider what happens for a given a. Every element Ya ∈ Ω has a comple- c ment Ya = ¬Ya, which is also invariant by Ga, and we have ¬¬ = Id. Here the relation of negation ≤ is the set-theoretic one. It is also true for the operations ∧ (intersection of sets), ∨ (union of sets), and the internal implication p ⇒ q, which is defined in this case by (p ∧ q) ∨ ¬p. Xa All the elements Ya of Ω are reunions of orbits Zi; i ∈ K(Xa) of the group Ga in the Ga-set Xa. On each orbit, Ga acts transitively. Each sub-object of X is a product of sub-objects of the Xa for a ∈ A. The product over a of the K(Xa) is a set K = K(X). X The algebra Ω is the Boolean algebra of the subsets of the set of elements {Zi; i ∈ K}, that we can note simply ΩK . The arrows in this category, p → q, correspond to the pre-order ≤, or equivalently to the inclusion of sets, and can be understood as implication of propositions. This is the implication in the external sense, if p is true then q is true, not in the internal sense qp, also denoted p ⇒ q, that is also the maximal element x such that x ∧ p ≤ q). On this category, there exists a natural Grothendieck topology, named the canonical topol- ogy, which is the largest (or the finest) Grothendieck topology such that, for any p ∈ Ω, the presheaf x 7→ Hom(x, p) is a sheaf. For any p ∈ Ω, the set of coverings JK(p) is the set of collections of sub-sets q of p whose reunion is p. In particular JK(∅) contains the empty family; this is a singleton.

Proposition .4. The topos E is isomorphic to the topos Sh(Ω; K) of sheaves for this topology JK (cf. for instance Bell, Toposes and local set theories, [Bel08]).

104 Proof. For all p, any covering of p has for refinement the covering made by the disjoint singletons Zi that belong to p, seen as a set; then, for every sheaf F over Ω, the restriction maps give a canonical isomorphism from F (p) with the product of the sets F (Zi) over p itself. In particular, any sheaf has for value in ⊥ = ∅ a singleton. 

D Non-boolean information functions

This is the case of chains and injective pre-sheaves on them.

The site Sn is the poset 0 → 1 → ... → n. A finite object E is chosen in the topos of ∧ presheaves Sn , such that each map Ei → Ei−1 is an injection, and we consider the Heyt- ing algebra ΩE, that is made by the sub-objects of E. The inclusion the intersection and the union of sub-objects are evident. The only non-trivial internal operations are the expo- nential, or internal implication Q ⇒ T , and the negation ¬Q, that is a particular case Q ⇒∅.

E Lemma .1. Let Tn ⊂ Tn−1 ⊂ ... ⊂ T0 and Qn ⊂ Qn−1 ⊂ ... ⊂ Q0 be two elements of Ω , then the implication U =(Q ⇒ T ) is inductively defined by the following formulas:

U0 = T0 ∨ (E0\Q0),

U1 = U0 ∧ (T1 ∨ (E1\Q1), ...

Uk = Uk−1 ∧ (Tk ∨ (Ek\Qk), ...

Proof. By recurrence. For n = 0 this is the well known boolean formula. Let us assume the result for n = N − 1, and prove it for n = N. The set UN must belong to UN−1 and must be the union of all the sets V ⊂ EN ∩ UN−1 such that V ∧ QN ⊂ TN , then it is the union of TN ∩ UN−1 and (EN \QN ) ∩ UN−1. In particular the complement ¬Q is made by the sequence

n n−1 \(Ek\Qk) ⊂ \(Ek\Qk) ⊂ ... ⊂ E0\Q0. (43) k=0 k=0 

Definition .2. We choose freely a strictly positive function µ on E0; for any subset F of E0, we note µ(F ) the sum of the numbers µ(x) for x ∈ F . In practice µ is the constant function equal to 1, or to |F |−1.

Definition .3. Consider a strictly decreasing sequence [δ] of strictly positive real numbers E δ0 > δ1 > ... > δn; the function ψδ : Ω → R is defined by the formula

n ψδ(Tn ⊂ Tn−1 ⊂ ... ⊂ T0) = Σk=0δkµ(Tk). (44)

Lemma .2. The function ψδ is strictly increasing.

′ This is because index by index, Tk contains Tk.

105 Definition .4. A function ϕ : ΩE → R is concave concave (resp. strictly concave), if for any pair of subsets T ≤ T ′ and any proposition Q, the following expression is positive (resp. strictly positive),

∆ϕ(Q; T, T ′)= ϕ(Q ⇒ T ) − ϕ(T ) − ϕ(Q ⇒ T ′)+ ϕ(T ′). (45)

Hypothesis on δ: for each k, n ≥ k ≥ 0, we assume that δk > δk+1 + ... + δn. k This hypothesis is satisfied for instance for δ0 =1, δ1 =1/2, ..., δk =1/2 , ....

Proposition .5. Under this hypothesis, the function ψδ is concave.

′ E (k) Proof. Let T ≤ T in Ω . We define inductively an increasing sequence T of Sn-sets by (0) (k) (k−1) ′ taking T = T and, for k > 0, Tj equal to Tj for jk, but equal to Tj for ′ j = k. In other terms, the sequence is formed by enlarging Tk to Tk, index after index. Let (k−1) (k) us prove that ∆ψδ(Q; T , T ) is positive, and strictly positive when at the index k, Tk ′ is strictly included in Tk. The theorem follows by telescopic cancellations. (k−1) (k) ′ The only difference between T and T is the enlargement of Tk to Tk, and this generates (k−1) (k) a difference between Tj |Q and Tj |Q only for the indices j>k. This allows us to simplify the notations by assuming k = 0. The contribution of the index 0 to the double difference ∆ψδ is the difference between the sum of δ0µ over the points in E0\Q0 that do not belong to T0 and the sum of δ0µ over the ′ points in E0\Q0 that do not belong to T0, then it is the sum of δ0µ over the points in E0\Q0 ′ that belong to T0\T0. ′ ′ As in lemma .1, let us write U0 = T0 ∨ (E0\Q0) and U0 = T0 ∨ (E0\Q0). And for k ≥ 1, let us write Vk = Tk ∨ (Ek\Qk), and Wk = V1 ∩ ... ∩ Vk. From the lemma 1, the contribution of the index 1 to the double difference ∆ψδ, is the simple difference between the sum of δkµ over the points in U0 ∩ Wk and its sum over the ′ points in U0 ∩ Wk, then it is equal to the opposite of the sum of δkµ over the points in ′ (T0\T0) ∩ (E0\Q0) ∩ Wk. The hypothesis on the sequence δ implies that the sum over k of these sums is smaller than the difference given by the index 0. 

Remark. In general the function ψδ, whatever being the sequence δ, is not strictly concave, ′ ′ because it can happen that T0 is strictly larger than T0, and the intersection of T0\T0 with E0\Q0 is empty. Therefore, to get a strictly concave function, we take the logarithm, or R∗ R another function from + to that transforms strictly positive strictly increasing concave functions to strictly increasing strictly concave functions. This property for the logarithm comes from the formulas

ϕ′ ϕϕ” − (ϕ′)2 (ln ϕ)” = [ ]′ = < 0. (46) ϕ ϕ2

In what follows we take ψ = ln ψδ as the fundamental function of precision. By normalizing µ and taking δ0 =1, we get 0 < ψδ ≤ 1, −∞ < ψ ≤ 0.

Remark. Lemmas .1, .2 and proposition .5 can easily be extended to the case where the basic site S is a rooted (inverse) tree, i.e. the poset that comes from an oriented graph with several initial vertices and a unique terminal vertex. The computation with intersections works in the same manner. The hypothesis on δ concerns only the descending branches to the terminal vertex.

106 Now, remember that the poset of a DNN is obtained by gluing such trees on some of their initial vertices, interpreted as tips (of forks) or output layers. The maximal points correspond to tanks (of forks) of input layers. Therefore it is natural to expect that the existence of ψ holds true for any site of a DNN.

E Closer to natural languages: linear semantic infor- mation

Several attempts were made by logicians and computer scientists, since Frege and Russel, Tarski and Carnap, to approach the properties of human natural languages by formal lan- guages and processes. In particular, a computational grammar was proposed by Lambek 1958, 1961 [Lam58]: a syntactic category is defined with sentences as objects and applica- tions of grammatical rules as arrows, a second category is defined, that contains products and exponentials, for instance a topos, and semantic is seen as a certain functor from the first category to the second one. This is the first place where semantic is defined as interpre- tations of types and propositions in a topos. Precursors of the kind of grammar considered by Lambek were Adjukiewicz in 1935 and Bar-Hillel in 1953. Then a decisive contribution was made by Montague 1970, [Mon70], which developed in particular a formal treatment of pieces of English, cf. Partee 1975 [Par75]. Also in this approach, semantic appears as a transformation from a syntactic algebraic structure, having lexis and multiple operations, to a coarser structure. In the nineties mathematicians and linguists observed that the categorial point of view, as in Lambek, gives a good framework for developing further the theory of Montague, cf. Van Benthem 1990 [vB90]. The next step used intensional type theories, like the Martin-L¨of theory, 1980, 1984, named modern TT by Luo 2000 [Luo14], or rich TT by Cooper et al. 2016 [CDLL15]. New types were introduced, corresponding to the many structural notions of linguistic, ex. noun, verb, adjective, and so on. Also modalities like interrogative, performative, can be intro- duced. Cf. F. Brunot 1936 [Bru36] for the complexity of the enterprise in French. Recent experiment with programming languages have shown that many properties of languages can be captured by extending TT. For instance, in Martin-L¨of TT it is possible to construct ZFT theories but also alternative Non-well-founded set theories, cf. Aczel 1978 [Acz88], taking into account paradoxical vicious circles as natural languages do, cf. Lindstr¨om 1989 [Lin89]. Even more powerful is the homotopical type theory, HoTT, Voedvoski, Awodey, Kapulkin, Lumsdaine, Shulman, ..., [KLV12]. Cf. Gylterud and Bonnevier, 2020 [GB20], for the inclusion of non-well-founded sets theories. These formal theories do not give a true definition of what is meaning, cf. the funda- mental objections of Austin 1961 [Aus61], but they give an insight of the various ways the meanings can be combined and how they are related to grammar, compatible with the intu- ition we have of human interpretations. We do not suggest that the categorial presentation defines the natural languages, but here also we think that its capture something of toys lan- guages, an some languages games that can help the understanding of semantic functioning in networks, including properties of natural semantics of human peoples.

In what follows, we consider that a certain category A represents the semantic for a certain language, or a certain language’s game, cf. Wittgenstein, 1953 [Wit53], and reflects properties of a language, not the abstract rules, as in the algebra ΩL before. The objects of A represent interpretations of sentences, or images, corresponding to the ”types as propo- sitions” (Curry-Howard) in a certain grammar, and its arrows represent the evocations,

107 significations, or deductions, corresponding to proofs or application of rules in grammar. Oriented cycles are a priori admitted. We simply assume that A is a closed monoidal category cf. Eilenberg-Kelly 1966 [EK66], that connects with linear logic, and linear type theory, cf. Mellies, Categorical Semantics of Linear Logic, 2006, [Mel09]. In such a category, a bifunctor (X,Y ) 7→ X ⊗ Y is given, that is associative up to natural transformation, with a neutral element ∗ also up to linear transformation, satisfying con- ditions of coherence. This product representing aggregation of sentences. Moreover there exists classifiers objects of morphisms, i.e. objects AY defined for any pair of objects A, Y , such that for any X, there exist natural isomorphisms Hom(X ⊗ Y, A) ≃ Hom(X, AY ). (47) The functor X 7→ X ⊗ Y has for right-adjoint the functor A 7→ AY . For us, this defines the semantic conditioning, the effect on the interpretation A that Y is taken into account, when A is evoked by a composition with Y . Thus we also denote AY by Y ⇒ A or A|Y . When A is given, and if Y ′ → Y we get A|Y → A|Y ′. From X ⊗∗ ∼= X, it follows that canonically A∗ ∼= A. We make the supplementary hypoth- esis that ∗ is a final object, then we get a canonical arrow A → A|Y , for any object Y . This represents the internal constants.

Remark. In the product X ⊗ Y , the ordering plays a role, and in linguistic, in the spirit of Montague, two functors can appear, the one we just said Y 7→ X ⊗ Y and the other one X 7→ X ⊗ Y . If both have a left adjoint, we get two exponentials: AY = A|Y and X A = X A; the natural axiomatic becomes the bi-closed category of Eilenberg and Kelly, 1965. Dougherty 1993 [Dou92] gave a clear exposition of part of the Lambek calculus in the Montague grammar in terms of this structure. Cf. also Lambek, 1988, categorial and categorical grammars. The semantic information theory should benefit of this possibility, where composition depends on the ordering, but in what follows, to begin, we assume that A is symmetric: there exist natural isomorphisms exchanging the two factors of the product. All that can be localized in a context Γ ∈ A by considering the category Γ\A of mor- phisms Γ → A, where A describes A, with morphisms given by the commutative triangles. For Γ → A, and Y ∈ A, we get a morphism Γ → A|Y by composition with the canonical morphism A → A|Y . This extends the conditioning. We will discuss the existence of a restricted tensor product later on; it asks restrictions on Γ.

The analog of a theory, that we will also name theory here, is a collection S of propo- sitions A, that is stable by morphisms to the right, i.e. A ∈ S and A → B implies B ∈ S. This can be seen as the consequences of a discourse. A theory S′ is said weaker than a theory S if it is contained in it, noted S ≤ S′. Then the analog of the conditioning of S by Y is the collection of the objects AY for A in S. The collection of theories is partially ordered. We have S|Y ′ ≤ S|Y when there exists Y ′ → Y . In particular S|Y ≤ S, as it was the case in simple type theory. When a context is given, it defines restricted theories, because it introduces a constraint of commutativity for A → B, to define a morphism from Γ → A to Γ → B.

The monoidal category A acts on the set of functions from the theories to a fixed commu- tative group, for instance the real numbers.

108 We will discuss later how the context Γ can be included in a category generalizing D, for obtaining the analog of the logical case with the propositions P excluded. This needs a notion of negation, we will there are plenty ones.

Remark. The model should be more complete if we introduce a syntactic type theory, as in Montague 1970, such that A is an interpretation of part of the types, compatible with products and exponentials. Then some of the arrows can interpret transformation rules in the grammar. The introduction of syntaxes will be necessary for communication between networks. Between two layers α : U → U ′ lifted by h to F, we assume the existence of a functor ∗ ∗ π∗α, h from AU,ξ to AU ′,ξ′ , with a left adjoint πα,h, such that π π∗ = Id, in such a manner that A becomes a pre-co-sheaf over F for π∗ and the sets of theories Θ form a pre-sheaf for π∗. The information quantities are defined as before, by the natural bar-complex associated to the action of A on the pre-cosheaf Φ′ of functions on the functor Θ. The passage to a network gives a dynamic to the semantic, and the consideration of weights gives a model of learning semantic. Even if they are caricature of the natural ones, we hope this will help to capture some interesting aspects of them.

A big difference with the ordinary logical case, is the absence of ”false”, then in general, the absence of the negation operation. This can make the co-homology of information non-trivial.

Another big difference is that the category A is not supposed to be a poset, the sets Hom can be more complex than ∅ and ∗, and they can contains isomorphisms. In particular loops can be present.

Consider for instance any function ψ on the collection of theories; and suppose that there exist arrows from A to B and from B to A; then the function ψ must take the same value on the theories generated by A and B. This tells in particular that they contain the same information.

The homotopy construction of a bi-simplicial set gΘ can be made as before, representing the propagation feed-forward of theories and propagation backward of the propositions, and the information can be defined by a natural increasing and concave map F with values in a closed model category M of Quillen, cf. below. The semantic functioning becomes a simplicial map gS : gX → gΘ, and the semantic spaces are given by the composition F ◦ gS.

Another interest of this generalization: we can assume that a measure of complexity K is attributed to the objects, seen as expressions in a languages, and that this complexity is additive in the product, i.e. K(X ⊗ Y )= K(X)+ K(Y ), and related to the combinatorics of the syntax, and the complexity of the lexicon, and the grammatical rules of formation. In this framework, we could compare the values of K in the category, and define the compres- sion as the ratio F/K of information by complexity.

Remark: it is amazing and happy that the bar-complex for the information co-cycles and the homotopy limit, can also be defined for the bi-closed generalization. The two exponen- tials X A and AY an action of the monoid A to the right and to the left that commute on

109 the functions of theories, and on the bi-simplicial set gΘ. Then we can apply the work of MacLane, Beck on bi-modules and the work of Schulman on enriched categories. •• Taking into account the network, we get a tri-simplicial set Θ∗ of information elements, or tensors, giving rise to a bi-simplicial space of histories of theories, with multiple left and right conditioning, gI••, that is the geometrical analog of the bar-complex of semantic information.

Links with Linear Logic (intuitionist) and negations.

The generalized framework corresponds to a fragment of an intuitionist Linear Logic, cf. Bierman and de Paiva [BdP00], Mellies [Mel09]. The arrows A → B in the category are the expression of the assertions of consequence A ⊢ B, and the product expresses the joint of the elements of the left members of consequences, in the sense that a deduction A1, ..., An ⊢ B corresponds to an arrow A1 ⊗ ... ⊗ An → B. There is no necessarily a ”or” for the right side, but there is an internal implication A ⊸ B which satisfies all the axioms of the above implication A ⇒ B, right adjoint of the tensor product. The existence of the final element corresponds to the existence of (multiplicative) truth 1 = ∗. To be more complete, we should suppose that all the finite products exist in the category A. Then the (categorial) product of two corresponds to an additive disjunction ⊕, then a ”or”, that can generate the right side of sequents B1, ..., Bm in A1, ..., An/B1, ..., Bm; however, a neutral element for ⊕ could be absent, even if it is always present in the full theory of Girard 1987 [Gir87]. No right adjoint is required for ⊕. And in what follows we don’t assume the data ⊕. One of the main ideas of Girard 1987 was to incorporate the fact that in real life the proposition A that is used in a consequence A ⊸ B does’nt stay unchanged after the event, however it is important to give a special status for propositions that continue to holds after the event. For that purpose Girard introduced an operator on the formulas, named a linear exponential, and written !. It is said ”of course” and has the meaning of a reaffirmation, something stable. The functor ! is asked to be naturally equivalent to !!, then a projector in the sense of categories, such that, in a natural manner, the objects !A and the morphisms !f between them satisfy the Gentzen rules of weakening and contraction, respectively (Γ ⊢ ∆)/(Γ, !A ⊢ ∆) and (Γ, A, A ⊢ ∆)/(Γ, A ⊢ ∆). (This corresponds to the traditional assertions A ∧ B ≤ A and A ≤ A ∧ A.) Further axioms tell, when translated in categorical terms, that ! is a monoidal functor equipped with two natural transformations ǫA : !A → A and δA :!A →!!A, that are monoidal transformations, satisfying the coherence rules of a co-monad, and with natural transformations eA :!A → 1 (useful when 1 is not assumed final) and dA :!A →!A⊗!A, that is a diagonal operator, also satisfying coherence axioms telling that each !A is a commutative comonoid, and each !f a morphism of commutative comonoid. From all these axioms, it is proved that under ! the monoidal product becomes an usual categorial product in the category !A := A!,

!(A ⊗ B) ∼=!A⊗!B ∼=!(A × B); (48) and the category A!, named the Keisli category of A, !, is cartesian closed/ More precisely, under ! the multiplicative exponential becomes the usual exponential:

!(A ⊸ B) ∼=!B!A. (49)

Remind that a comonad in a category is a functor T of this category to itself, equipped with two natural transformations T → T ◦ T and ε : T → Id, satisfying coassociativity and co- unity axioms. This the dual of a monad, T ◦ T → T and Id → T , that is the generalization of monoids to categories. The functor ! is an example of comonad. Cf. MacLane Categories

110 for working mathematician, [Mac71].

The axioms of a closed symmetric monoidal category, plus the existence of finite prod- ucts, plus the functor !, give the largest part of the Gentzen rules, as they were generalized by Jean-Yves Girard 1987.

For us, the linear exponential ! permits to localize the product at a given proposition, in the sense that the slice category to the right Γ|A is closed by products of linear exponential objects as soon as Γ belongs to A!. Demonstration: if we restrict us to the arrows !Γ → Q, then the product !Γ → Q ⊗ Q′ is ob- ′ tained by composing the diagonal d!Γ :!Γ →!Γ⊗!Γ with the tensor product !Γ⊗!Γ → Q ⊗ Q . Its right adjoint is given by !Γ → (Q ⊸ R), obtained by composing !Γ → Q with the natural map Q → Q|R.

To localize the theories themselves at P , for instance at a !Γ, we used, in the Heyting case, a notion of negation. To exclude a certain propostion was the only coherent choice from the point of view of information, and this was also in accord with the experiments of spontaneous logics in small networks.

In the initial work of Girard, negation was a fundamental operator, verifying the hy- pothesis of involution ¬¬ = Id, then giving a duality. That explains that the initial theory is considered as a classical Linear Logic; it generalizes the usual Boolean logic in another direction than intuitionism. In a linear intuitionist theory, the negation is not necessary, but it is also not forbidden, and axioms were discussed in the nineties. We follow here the exposition of Paul-Andr´eMelli`es, 2006 [Mel09], and of his article with Nicolas Tabareau 2009 [MT10]. The authors work directly in a monoidal category A, without assuming that it is closed, and define negation as a functor ¬ : A→Aop, such that the opposite functor ¬op from Aop to A, also denoted by ¬, is the left-adjoint of ¬, giving a unit η : Id → ¬¬ and a counit ǫ : ¬¬ → Id, that are not equivalence in general. Then there exist for any objects A, B a canonical bijection bijection between HomA(¬A, B) and HomA(¬B, A). Note that in this case ε and η coincide, because the morphisms in Aop are the morphisms in A written in the reverse order. The double negation T = ¬op¬ forms a monad whose η is the unit; the multiplication µ : ¬¬¬¬ → ¬¬ is obtained by composing Id¬ with ¬(η), to the left or to the right, that is µA = ¬(ηA) ◦ Id¬A = Id¬¬¬A ◦ ¬(ηA). In theoretical computer science, T is called the continuation monad, and plays an important role in computation and games logics, cf. Kock, Moggi, Mellies, Tabareau.

In the case of the Heyting algebra of a topos (elementary), this continuation defines a topol- ogy, named after Lawvere and Tirney, which defines the unique subtopos that is Boolean and dense (i.e. contains the initial object ∅, cf. O. Caramello 2012, Universal models and definability [Car12].

The second important axiom tells how is transformed the (multiplicative) product ⊗: it is asked that for any objects B,C the object ¬(B ⊗ C) represents the functor A 7→ Hom(A ⊗ B, ¬C) ∼= Hom(C, ¬(A ⊗ B); that is Hom(A ⊗ B, ¬C) ∼= Hom(A, ¬(B ⊗ C)). (50) This bijection being natural in the three argument and coherent with the associativity and

111 unit for the product ⊗. For instance all the sets Hom(ABC, ¬D), Hom(AB, ¬(CD), Hom(A, ¬(BCD)), are iden- tified with Hom(ABCD, ¬1). Mellies and Tabareau called such a structure a tensorial negation, and named the monoidal category A, equipped with ¬, a dialogue category.

The special object ¬1 is canonically associated to the chosen negation; it is named the pole and frequently denoted by ⊥. It has no reason in general to be an initial object of A.

A monoidal structure of (multiplicative) disjunction is deduced from the tensor product by duality: A℘B = ¬(¬A ⊗ ¬B). (51) Its neutral element is the pole of ¬. This implies that the notion of ”or” is parameterized by the variety of negations, that we will see equivalent to A itself.

In the same manner an additive conjonction is defined by A&B = ¬(¬A ⊕ ¬B). (52) Its neutral element is ⊤ = ¬∅, when an initial element ∅ exists, that is the additive ”false”.

An operator ? is introduced in classical linear that satisfies ?¬A = ¬!A, ¬?A =!¬A (53) For us, supposing these relations, is not sufficient to define it, because ¬ is not a bijection.

The Girard operator ? means ”why not?”, as the operator ! means ”of course”; they are examples of modalities, and correspond to the modalities more frequently noted  and ⋄ in modal logics.

However, Hasegawa, Moggi, Mellies, Tabareau, have remarked that more convenient ten- sorial negations must satisfy a further axiom. Note that this story began with Kock, 1969, 1970, 1972, inspired by Eilenberg and Kelly 1966, cited. Cf. [Has03], [MT10].

Lemma .3. From the second axiom of a tensorial negation it results two natural transfor- mations ¬¬A ⊗ B → ¬¬(A ⊗ B); (54) A ⊗ ¬¬B → ¬¬(A ⊗ B). (55) A monad where such maps exist in a monoidal category, is named a strong monad, Kock, 1970, 1972, cf. Moggi, Computations and monads [Mog91]. The first transformation is named the strength of the monad T = ¬¬, the second one its co-strength. Proof. Let us start with the Identity morphism of ¬(A ⊗ B); by the axiom, it can be interpreted as a morphism B ⊗ ¬(A ⊗ B) → ¬A, then applying the functor ¬, we get a morphism ¬¬A → ¬[B ⊗ ¬(A ⊗ B)]; (56)

112 then, applying the axiom again, we obtain a natural transformation

¬¬A ⊗ B → ¬¬(A ⊗ B). (57)

Exchanging the roles of A and B gives the other transformation. Said in other terms, we have natural bijections given by the tensorial axiom, applied two times,

Hom(¬(A ⊗ B), ¬(A ⊗ B)) ∼= Hom(¬(A ⊗ B) ⊗ B, ¬A) ∼= Hom(A, ¬[B ⊗ ¬(A ⊗ B)] ∼= Hom(A ⊗ B, ¬¬(A ⊗ B)); (58) and also natural bijections, obtained in the same manner,

Hom(¬(A ⊗ B), ¬(A ⊗ B)) ∼= Hom(¬(A ⊗ B) ⊗ A, ¬B) ∼= Hom(B, ¬[A ⊗ ¬(A ⊗ B)] ∼= Hom(A ⊗ B, ¬¬(A ⊗ B)); (59)

The identity of ¬(A⊗B) in the first term gives a natural marked point, that is also identifiable with ηA⊗B in the last term. On the set Hom((¬(A⊗B)⊗B, ¬A) (resp. Hom(A⊗¬(A⊗B), ¬B)) we can apply the functor ¬; this gives a map to Hom(¬¬A, ¬[B ⊗ ¬(A ⊗ B)]) (resp. Hom(¬¬B, ¬[A ⊗ ¬(A ⊗ B)])), then the strength (resp. the costrength) after applying the second axiom.  The strength and co-strength taken together give two a priori different transformations T A ⊗ T B → T (A ⊗ B) (cf. n lab cafe, Kock, Moggi, Hazegawa). The first one is the composition starting with the co-strength of T A followed by the strength of B, then ending with the product:

T A ⊗ T B → T (T A ⊗ B) → T T (A ⊗ B) → T (A ⊗ B); (60) the other one starts with the strength, then uses the co-strength, and ends with the product

T A ⊗ T B → T (A ⊗ T B) → T T (A ⊗ B) → T (A ⊗ B). (61)

Then a third axiom was suggested by Kock in general for strong monads, and reconsidered by Hazegawa, Moggi, Mellies and Tabareau, it consist to require that these two morphisms coincide. This is named since Kock a commutative monad, or a monoidal monad. We will say that the negation itself is monoidal.

According to Mellies and Tabareau, Hasegawa observed that T = ¬¬ is commutative, if and only if η gives an isomorphism ¬ ∼= ¬¬ on the objects of ¬A, if and only if µ gives an isomorphism on the objects of A.

Proposition .6. A necessary and sufficient condition for having ¬ monoidal is that for each object A, the transformation η¬A is an equivalence from ¬A and ¬¬¬A in the category A.

η ′ Corollary. Define A as the collection of objects A of A, such that ηA′ is an isomporphism; in the commutative case, ¬A is a sub-category ¬ induces an equivalence of the full subcategory Aη of A with its opposite Cf. [Bel08, Proposotion 1.31].

113 Thus we recover most of the usual properties of negation, without having a notion of false.

Now assume that A is symmetric monoidal and closed; we get natural isomorphisms

¬(A ⊗ B) ≈ A ⇒ ¬B ≈ B ⇒ ¬A. (62)

And using the neutral element 1 = ∗ for C, and denoting ¬1 by P , we obtain that ¬B = B ⊸ P .

Proposition .7. Conversely, for any object P ∈A, the functor A 7→ (A ⊸ P )= P |A is a tensor negation whose pole is P .

Proof. First, this is a contra-variant functor in A. Secondly, for any pair A, B in A, using the symmetry hypothesis, we get natural bijections

Hom(B, A ⊸ P ) ∼= Hom(B ⊗ A, P ) ∼= Hom(A, B ⊸ P ). (63)

This gives the basic adjunction. Third, for any triple A,B,C in A, the associativity gives

Hom(A ⊗ B,C ⊸ P ) ∼= Hom(A ⊗ B ⊗ C,P ) =∼ Hom(A, (B ⊗ C) ⊸ P ). (64)

This gives the tensorial condition.  The transformation η is given by the Yoneda lemma, from the following natural map

Hom(X, A) → Homop(¬X, ¬A) ∼= Hom(X, ¬¬A). (65)

There is no reason for asserting that this negation is commutative. From the proposition 1, the necessary and sufficient condition is that for any object A, the following map an isomorphism

ηA⇒P :(A ⇒ P ) → (((A ⇒ P ) ⇒ P ) ⇒ P ). (66)

Even for A = 1 this is a non-trivial condition: P ≈ ((P ⇒ P ) ⇒ P ). The fact that 1 ⇒ P ≡ P being obvious.

Choose an arbitrary object ∆ and define ¬Q as Q ⊸ ∆. This ∆ will play the role of ”false”. We say that a theory T excludes P if it contains P ⊸ ∆. This is equivalent to say that there exists R in T such that R → (P ⊸ ∆), i.e. R ⊗ P → ∆, that is by symmetry: there exists P → (R ⊸ ∆). In particular, if P → R, we obtain such a map by composition with R → (R ⊸ ∆).

To localize the action of the proposition at P , we have to prove the following lemma:

Lemma .4. Conditioning by Q such that P → P ⊗ Q is non-empty, sends a theory T that excludes P into a theory T that also excludes P .

Proof. From the hypothesis we have a morphism ¬(P ×Q) → ¬P , but ¬(P ×Q) is isomorphic to Q ⇒ (P ⇒ ∆)=(¬P )|Q. 

114 This is analog to the statement of Proposition 3.2 in section 3.3, because in this case P ≤ Q is equivalent to P = P ∧ Q and to P ≤ P ∧ Q. The proof does’nt use that P is a linear exponential object.

Now assume that P belongs to the category A!, i.e. P =!Γ for a certain object Γ ∈ A; we saw that the set AP of Q such that P → Q forms a closed monoidal category, and by the above lemma, it acts on the set of theories excluding P . That is because P → Q implies P → P ⊗ P → P ⊗ Q Therefore, all the ingredients of the information topology are present in this situation.

115 Bibliography

[AB11] Samson Abramsky and Adam Brandenburger. The sheaf-theoretic structure of non-locality and contextuality. New Journal of Physics, 13(11):113036, 2011.

[Acz88] Peter Aczel. Non-Well-Founded Sets. Stanford University, Center for the Study of Language and Information, 1988.

[AGV63] Michael Artin, Alexander Grothendieck, and Jean-Louis Verdier. Th´eorie des topos et cohomologie ´etale des sch´emas. SGA4, IHES, 1963.

[AGZV12a] E. Arnold, S.M. Gusein-Zade, and A.N. Varchenko. Singularities of Differ- entiable Maps, Volume 2: Monodromy and Asymptotics of Integrals. Modern Birkh¨auser Classics. Birkh¨auser Boston, 2012.

[AGZV12b] V.I. Arnold, S.M. Gusein-Zade, and A.N. Varchenko. Singularities of Differ- entiable Maps, Volume 1: Classification of Critical Points, Caustics and Wave Fronts. Modern Birkh¨auser Classics. Birkh¨auser Boston, 2012.

[AK11] Peter Arndt and Krzysztof Kapulkin. Homotopy-theoretic models of type the- ory. In Proceedings of the 10th International Conference on Typed Lambda Calculi and Applications, TLCA’11, page 45–60, Berlin, Heidelberg, 2011. Springer-Verlag.

[ARDK16] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for question answering. CoRR, abs/1601.01705, 2016.

[Arg78] Jose Argemi. Approche qualitative d’un probl`eme de perturbations singuli`eres dans R4. Equadiff. 78, Conv. int. su equazioni differenziali ordinarie ed equazioni funzionali, Firenze 1978, 333-340 (1978)., 1978.

[Arn73] V. I. Arnold. Normal forms for functions near degenerate critical points, the Weyl groups of Ak,Dk, Ek and lagrangian singularities. Funct. Anal. Appl., 6:254–272, 1973.

[Aus61] J. L. Austin. Philosophical Papers. Oxford University Press, 1961.

[AW09] Steve Awodey and Michael A. Warren. Homotopy theoretic models of iden- tity types. Mathematical Proceedings of the Cambridge Philosophical Society, 146(1):45–55, Jan 2009.

[BB15] Pierre Baudot and Daniel Bennequin. The Homological Nature of Entropy. Entropy, pages 3253–3318, 2015.

116 [BBD+11] J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, W. Leland, and J. A. Hendler. Towards a theory of semantic communication. In 2011 IEEE Network Science Workshop, pages 110–117, 2011. [BBDH14] P. Basu, J. Bao, M. Dean, and J. Hendler. Preserving quality of information by using semantic relationships. Pervasive and Mobile Computing, 11:188 – 202, 2014. [BBG20] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical informa- tion cells, Part I. Internal technical report, Huawei, 2020. [BBG21] Jean-Claude Belfiore, Daniel Bennequin, and Xavier Giraud. Logical informa- tion cells, Part II. Internal technical report, Huawei, 2021. [BCB16] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ArXiv, abs/1409.0473v7, 2016. [BdP00] G. M. Bierman and V. C. V. de Paiva. On an intuitionistic modal logic. Studia Logica: An International Journal for Symbolic Logic, 65(3):383–416, 2000. [Bel08] John L. Bell. Toposes and Local Set Theories. Dover, 2008. [Ben86] Daniel Bennequin. Caustique mystique. In S´eminaire Bourbaki : vol- ume 1984/85, expos´es 633-650, number 133-134 in Ast´erisque. Soci´et´e math´ematique de France, 1986. talk:634. [BFL11] John C Baez, Tobias Fritz, and Tom Leinster. A characterization of entropy in terms of information loss. Entropy, 13(11):1945–1957, 2011. [BHS+18] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 511– 520, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [BK72] A. K. Bousfield and D. M. Kan. Homotopy Limits, Completions and Localiza- tions. Springer, 1972. [BPSPV20] Daniel Bennequin, Olivier Peltre, Gr´egoire Sergeant-Perthuis, and Juan Pablo Vigneaux. Extra-fine sheaves and interaction decompositions, 2020. [Bru36] Ferdinand Brunot. La pens´ee et la langue. Masson et compagnie, 1936. [BTBG19] Pierre Baudot, Monica Tapia, Daniel Bennequin, and Jean-Marc Goaillard. Topological information data analysis. Entropy, 21(9):869, Sep 2019. [BW21] Roberto Bondesan and Max Welling. The hintons in your neural network: a quantum field theory view of deep learning, 2021. [Car50] Rudolf Carnap. Logical Foundations of Probability. Chicago]University of Chicago Press, 1950. [Car09] Olivia Caramello. The Duality Between Grothendieck Toposes and Geometric Theories. University of Cambridge, 2009.

117 [Car12] Olivia Caramello. Universal models and definability. Mathematical Proceedings of the Cambridge Philosophical Society, 152(2):279–302, 2012.

[Car18] Olivia Caramello. Theories sites toposes : relating and studying mathematical theories through topos-theoretic ’bridges’. Oxford University Press, Oxford, 2018.

[CBH52] Rudolf Carnap and Jehoshua Bar-Hillel. An Outline of a Theory of Semantic Information. Technical report, Research Laboratory of Electronics, MIT, 1952.

[CDLL15] R. Cooper, Simon Dobnik, Staffan Larsson, and Shalom Lappin. Probabilistic type theory and natural language semantics. Linguistic Issues in Language Technology, 10, 2015.

[CGCB14] Junyoung Chung, C¸aglar G¨ul¸cehre, KyungHyun Cho, and Yoshua Bengio. Em- pirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.

[CGW20] Taco Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces, 2020.

[Cis03] Denis-Charles Cisinski. Images directes cohomologiques dans les cat´egories de mod`eles. Annales Math´ematiques Blaise Pascal, 10(2):195–244, 2003.

[Cis06] Denis-Charles Cisinski. Les Pr´efaisceaux Comme Mod`eles Des Types D’homotopie. Ast´erisque (Soci´et´e Math´ematique de France). Soci´et´e Math´ematique de France, 2006.

[Cis19] Denis-Charles Cisinski. Higher Categories and Homotopical Algebra. Cam- bridge Studies in Advanced Mathematics. Cambridge University Press, 2019.

[CLS95] Antoine Culioli, Michel Liddle, and John T. Stonham. Cognition and Repre- sentation in Linguistic Theory. Benjamins, John Publishing Company, 1995.

[Cur13] Justin Curry. Sheaves, cosheaves and applications. arXiv preprint arXiv:1303.3255, 2013.

[Cur17] Carina Curto. What can topology tell us about the neural code? Bulletin of the American Mathematical Society, 54(1):63–78, 2017.

[CvMBB14] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben- gio. On the properties of neural machine translation: Encoder-decoder ap- proaches. CoRR, abs/1409.1259, 2014.

[CWKW19] Taco S. Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. Gauge equivariant convolutional networks and the icosahedral cnn, 2019.

[CZCG05] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas J Guibas. Persistence barcodes for shapes. International Journal of Shape Modeling, 11(02):149–187, 2005.

[DHKS04] William G. Dwyer, Philip S. Hirschhorn, Daniel M. Kan, and Jeffrey H. Smith. Homotopy Limit Functors on Model Categories and Homotopical Categories. AMS Mathematical Surveys and Monographs, 2004.

118 [DHSB20] David Ding, Felix Hill, Adam Santoro, and Matt M. Botvinick. Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic mod- els with flexible distributed architectures. CoRR, abs/2012.08508, 2020.

[Dou92] Daniel J. Dougherty. Closed categories and categorical grammar. Notre Dame Journal of Formal Logic, 34(1):36 – 49, 1992.

[Dug08] Daniel Dugger. A primer on homotopy colimits. https://pages.uoregon.edu/ddugger/hocolim.pdf, 2008.

[EK66] Samuel Eilenberg and G. Max Kelly. Closed categories. In S. Eilenberg, D. K. Harrison, S. MacLane, and H. R¨ohrl, editors, Proceedings of the Conference on Categorical Algebra, pages 421–562, Berlin, Heidelberg, 1966. Springer Berlin Heidelberg.

[EM45] Samuel Eilenberg and Saunders MacLane. General theory of natural equiv- alences. Transactions of the American Mathematical Society, 58(2):231–294, 1945.

[Fri05] Joel Friedmann. Cohomology in Grothendieck topologies and lower bounds in Boolean complexity. ArXiv, 2005.

[FS18] Brendan Fong and David I Spivak. Seven sketches in compositionality: An invitation to applied category theory, 2018.

[FST19] Brendan Fong, David I. Spivak, and R´emy Tuy´eras. Backprop as functor: a compositional perspective on supervised learning. ArXiv, 2019.

[GB20] H˚akon Robbestad Gylterud and Elisabeth Bonnevier. Non-wellfounded sets in homotopy type theory, 2020.

[GH11] R. Ghrist and Y. Hiraoka. Applications of sheaf cohomology and exact se- quences on network codings. Research Institute for Mathematical Sciences, 1752:31–40, 2011.

[Gir64] Jean Giraud. M´ethode de la descente. Number 2 in M´emoires de la Soci´et´e Math´ematique de France. Soci´et´emath´ematique de France, 1964.

[Gir71] Jean Giraud. Cohomologie non ab´elienne. Springer, 1971.

[Gir72] Jean Giraud. Classifying Topos. Toposes, algebraic Geometry and Logic, Dalhousie Univ. Halifax 1971, Lect. Notes Math. 274, 43-56 (1972)., 1972.

[Gir87] Jean-Yves Girard. Linear logic. Theoretical Computer Science, 50(1):1–101, 1987.

[Gro83] Alexander Grothendieck. Pursuing Stacks. unpublished, 1983. https://thescrivener.github.io/PursuingStacks/ps-online.pdf.

[Gro90] Alexandre Grothendieck. Les d´erivateurs, 1990.

[GWDPL76] C.G. Gibson, K. Wirthmuller, A.A. Du PLessis, and Eduard Looijenga. Topo- logical Stability of Smooth Mappings, volume 552 of Lecture Notes in Mathe- matics. Springer, 1976.

119 [GZ67] P. Gabriel and M. Zisman. Calculus of Fractions and Homotopy Theory. Ergeb- nisse der Mathematik und ihrer Grenzgebiete. 3Island Press, 1967. [Has03] Masahito Hasegawa. Coherence of the double negation in linear logic. In (Algebra, Logic and Geometry in Informatics, 2003. [Hir03] Philip S. Hirschhorn. Model Categories and their Localizations, volume 99 of Mathematical Surviews and Monographs. AMS, 2003. [HM19] Drew A. Hudson and Christopher D. Manning. GQA: a new dataset for com- positional question answering over real-world images. CoRR, abs/1902.09506, 2019. [Hoc69] M. Hochster. Prime ideal structure in commutative rings. Transactions of the American Mathematical Society, 142:43–60, 1969. [Hol01] Sharon Hollander. A homotopy theory for stacks. Israel Journal of Mathemat- ics, 163:93–124, 2001. [Hol08] Sharon Hollander. A homotopy theory for stacks. Israel Journal of Mathemat- ics, 2008. [HS97] Sepp Hochreiter and Jurgen Schmidhuber. Long-short term memory. Neural Computation, 9, 1997. [HS98] Martin Hofmann and Thomas Streicher. The groupoid interpretation of type theory. In Twenty-five years of constructive type theory (Venice, 1995), vol- ume 36 of Oxford Logic Guides, pages 83–111. Oxford Univ. Press, New York, 1998. [HS17] Joel C. Heck and Fathi M. Salem. Simplified minimal gated unit variations for recurrent neural networks. CoRR, abs/1701.03452, 2017. [Jar09] J. F. Jardine. Cocycle categories. In Nils Baas, Eric M. Friedlander, Bj¨orn Jahren, and Paul Arne Østvær, editors, Algebraic Topology: The Abel Sym- posium 2007, pages 185–218, Berlin, Heidelberg, 2009. Springer Berlin Heidel- berg. [KL14] Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generat- ing image descriptions. CoRR, abs/1412.2306, 2014. [KLV12] Chris Kapulkin, Peter Lumsdaine, and Vladimir Voevodsky. The simplicial model of . arxiv, 11 2012. [Kon18] Risi Kondor. N−body Networks: a Covariant Hierarchical Neural Network Architecture for learning atomic potentials. ArXiv, abs/1803.01588, 2018. [Lam58] Joachim Lambek. The mathematics of sentence structure. The American Mathematical Monthly, 65(3):154–170, 1958. [Lin89] Ingrid Lindstr¨om. A construction of non-well-founded sets within martin-l¨of’s type theory. The Journal of Symbolic Logic, 54(1):57–64, 1989. [Loo78] Eduard Looijenga. The discriminant of a real simple singularity. Compositio Mathematica, 37(1):51–62, 1978.

120 [Low15] Zhen Lin Low. Cocycles in categories of fibrant objects, 2015.

[LS81] Joachim Lambek and Philip J. Scott. Intuitionist type theory and foundations. Journal of Philosophical Logic, 10:101–115, 1981.

[LS88] Joachim Lambek and Philip J Scott. Introduction to higher-order categorical logic, volume 7. Cambridge University Press, 1988.

[Luo14] Zhaohui Luo. Formal semantics in modern type theories: Is it model-theoretic, proof-theoretic, or both? In Nicholas Asher and Sergei Soloviev, editors, Log- ical Aspects of Computational Linguistics, pages 177–188, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.

[Lur09] J. Lurie. Higher Topos Theory (AM-170). Academic Search Complete. Prince- ton University Press, 2009.

[Mac71] Saunders MacLane. Categories for the Working Mathematician. Springer- Verlag, New York, 1971. Graduate Texts in Mathematics, Vol. 5.

[Mac12] S. MacLane. Homology. Classics in Mathematics. Springer Berlin Heidelberg, 2012.

[Mal05] Georges Maltsiniotis. La th´eorie de l’homotopie de Grothendieck. Number 301 in Ast´erisque. Soci´et´emath´ematique de France, 2005.

[Mar82] Jean Martinet. Singularities of smooth functions and maps. Transl. from the French by Carl P. Simon, volume 58. Cambridge University Press, Cambridge. London Mathematical Society, London, 1982.

[McG54] W. McGill. Multivariate information transmission. Psychometrika, 19:97–116, 1954.

[Mel09] Paul-Andr´eMelli`es. Categorical Semantics of Linear logic. SMF, 2009.

[Mil62] J. Milnor. On axiomatic homology theory. Pacific Journal of Mathematics, 12(1):337 – 341, 1962.

[ML80] Per Martin-L¨of. Intuitionistic Type Theory. Lectures given at University of Padova, 1980.

[MLM92] Saunders Mac Lane and Ieke Moerdijk. Sheaves in geometry and logic: a first introduction to topos theory. Universitext. New York etc.: Springer-Verlag. xii, 627 p. , 1992.

[MM20] Yuri Manin and Matilde Marcolli. Homotopy theoretic and categorical models of neural information networks. ArXiv, 2020.

[Mog91] Eugenio Moggi. Notions of computation and monads. Information and Com- putation, 93(1):55–92, 1991.

[Mon70] Richard Montague. Universal grammar. Theoria, 36:373–398, 1970.

[MT10] Paul-Andr´eMelli`es and Nicolas Tabareau. Resource modalities in tensor logic. Annals of Pure and Applied Logic, 161(5):632–653, February 2010.

121 [MXY+15] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Zhiheng Huangand Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-RNN). ArXiv, abs/1412.6632v5, 2015.

[Par75] Barbara Partee. Montague grammar and transformational grammar. Linguistic Inquiry, 6(2):203–300, 1975.

[Pea88] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[Pel20] O. Peltre. A Homological Approach to Belief Propagation and Bethe Approxi- mations. PhD thesis, Dept of Mathematics, Universit´eParis Diderot, 2020.

[PKM19] Alexander Port, Taelin Karidi, and Matilde Marcolli. Topological analysis of syntactic structures. arXiv preprint arXiv:1903.05181, 2019.

[Pri94] H.A. Priestley. Spectral sets. Journal of Pure and Applied Algebra, 94:101–114, 1994.

[Pro19] Alain Prout´e. Introduction `ala logique cat´egorique. MSc Course, Universit´e Paris Diderot, 2019.

[Qui67] Daniel G. Quillen. Homotopical Algebra. Lecture notes in mathematics. Springer-Verlag, 1967.

[Rap10] G.Raptis. Homotopy theoryofposets. Homology, Homotopy and Applications, 12:211–230, 2010.

[Rob17] Michael Robinson. Sheaves are the canonical data structure for sensor integra- tion. Information Fusion, 36:208–224, 2017.

[Ros09] J. Rosick´y. On combinatorial model categories. Applied Categorical Structures, 17:303–316, 2009.

[RSB+17] David Raposo, Adam Santoro, David G. T. Barrett, Razvan Pascanu, Timo- thy P. Lillicrap, and Peter W. Battaglia. Discovering objects and their relations from entangled scene representations. CoRR, abs/1702.05068, 2017.

[Shu12] Michael Shulman. Exact completions and small sheaves, 2012.

[SP12] Chris Schommer-Pries. The canonical model structure on cat. Blog, https://sbseminar.wordpress.com/2012/11/, 2012.

[SP21a] Gr´egoire Sergeant-Perthuis. Bayesian/graphoid intersection property for fac- torisation spaces, 2021.

[SP21b] Gr´egoire Sergeant-Perthuis. Interaction decomposition for presheaves, 2021.

[SP21c] Gr´egoire Sergeant-Perthuis. Intersection property and interaction decomposi- tion, 2021.

[SRB+17] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Raz- van Pascanu, Peter W. Battaglia, and Timothy P. Lillicrap. A simple neural network module for relational reasoning. CoRR, abs/1706.01427, 2017.

122 [Sta14] Alexandru E. Stanculescu. Stacks and sheaves of categories as fibrant objects, I and II. Arxiv, 2014.

[SW49] Claude E. Shannon and Warren Weaver. The Mathematical Theory of Com- munication. University of Illinois Press, Urbana and Chicago, 1949.

[Ted16] Christopher Francis Tedd. Ring constructions on spectral spaces, 2016.

[Tho72] Ren´e Thom. Stabilit´estructurelle et morphog´en´ese : Essai d’une th´eorie g´en´erale des modeles. Benjamin, 1972.

[Tho80] R. W. Thomason. Cat as a closed model category. Cahiers de Topologie et G´eom´etrie Diff´erentielle Cat´egoriques, 21(3):305–324, 1980.

[Tho83] Ren´eThom. Mathematical Models of Morphogenesis. Ellis Horwood Series in Mathematics and its applications. Ellis Horwood, 1983. translated by W.M. Brookes and D. Rand.

[Tho88] Ren´e Thom. Esquisse d’une s´emiophysique : Physique aristot´elicienne et th´eorie des catastrophes. Dunod, 1988.

[Tin62] Hu Kuo Ting. On the amount of information. Theory of Probability and its Applications, 1962.

[vB90] Johan van Benthem. Categorial grammar and type theory. Journal of Philo- sophical Logic, 19(2):115–168, 1990.

[Vig19] Juan-Pablo Vigneaux. Topology of statistical systems : a cohomological ap- proach to information theory. PhD thesis, University of Paris Diderot, 2019.

[VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.

[Wit53] Ludwig Wittgenstein. Philosophical Investigations. Oxford, Blackwell, 1953.

[XQLJ20] Huiqiang Xie, Zhijin Qin, Geoffrey Ye Li, and Biing-Hwang Juang. Deep learning enabled semantic communication systems. ArXiv, 2020.

[YFW01] Jonathan S Yedidia, William Freeman, and Yair Weiss. Generalized belief propagation. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001.

[ZRS+18] Vin´ıcius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yu- jia Li, Igor Babuschkin, Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter W. Battaglia. Relational deep reinforcement learning. CoRR, abs/1806.01830, 2018.

[ZWZZ16] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. Minimal gated unit for recurrent neural networks. CoRR, abs/1603.09420, 2016.

123