<<

arXiv:2106.07890v1 [math.CT] 15 Jun 2021 [email protected] Appendix Acknowledgements Conclusion structure 4. Tropical interpretation space metric 3.1. A setting enriched the in 3. Semantics theory enriched of 2.3. Basics language of category enriched 2.2. The enrichment through statistics 2.1. Incorporating theory? category 2. Why 1.2. Compositionality 1.1. 1 a is References Implication of product the E. Computing colimits and functorD. limits Weighted enriched an as embedding Yoneda C. The enriched an as probability B. Conditional A. .Introduction 1. 3 -aladdresses E-mail 2 ,T X, T T NERCE AEOYTER FLNUG:FROM LANGUAGE: OF THEORY CATEGORY ENRICHED AN UNNEL HE A-AA BRADLEY TAI-DANAE HE C htge ihwa.W hnps oteerce aeoyo u of category enriched sem the find to to category pass syntactical then this We on copresheaves synt valued what. is category with This goes pro another. what of conditional extension are an Objects objects is hom expression interval. and unit language the in over expressions enriched category a probabil as model texts enriche we an speaking, to Roughly texts information. given semantic passi of for extensions framework on mathematical distributions a ity propose we paper, gr this of In knowledge a including sophistication, some implies it A ITY BSTRACT M N , OONSHOT U EW IEST OF NIVERSITY : Y ie ic ftx,teaiiyt eeaeachrn ext coherent a generate to ability the text, of piece a Given . [email protected],[email protected], [email protected], ORK F NY , COYAND ACTORY . [0 , YTXT SEMANTICS TO SYNTAX 1] 1 N ONTERILLA JOHN , -functor EW Y R AND ORK [0 S C ANDBOX , ONTENTS 1] -copresheaves 1 T 2 @A UNNEL N INI VLASSOPOULOS YIANNIS AND , LPHABET N , EW M , ma n semantics. and ammar Y aeoycontaining category d cia—tdescribes actical—it ORK ni information. antic fti aeoyare category this of OUNTAIN t itiuin on distributions ity aiiista one that babilities gfo probabil- from ng NY , i interval- nit ninof ension V IEW CA , 3 20 19 18 16 16 15 15 15 15 14 12 9 8 6 6 3 2 2 2 AN ENRICHED OF LANGUAGE: FROM SYNTAX TO SEMANTICS

1. INTRODUCTION The world’s best large language models (LLMs) have recently attained new lev- els of sophistication by effectively learning a probability distribution on possible continuations of a given text. Interactively, one can input prefix text and then sam- ple repeatedly from a next word distribution to generate original, high quality texts [VSP+17, RNSS18, RWC+18, B+20]. Intuitively, the ability to continue a story implies a great deal of sophistication. A grammatically correct continuation re- quires a mastery of syntax, careful pronoun matching, part of speech awareness, a sense of tense, and much more. A language model that effectively learns a prob- ability distribution on possible continuations must apparently also have learned some semantic knowledge. For the continuation of a story to be reasonable and internally consistent requires meta-temporal knowledge of the world: dogs are an- imals that bark, golf is played outdoors during the day, Tuesday is the day after Monday, etc. What is striking is that these LLMs are trained only on samples of text without any rules for syntactic structure or semantic knowledge, neverthe- less complex syntactic structures and semantic information are learned [MCH+20]. Consistent with this evidence that it is possible to pass from probability distribu- tions on text continuations to semantic information, we propose a mathematical framework within which the passage may be explained. We define a syntax category, which is a category enriched over the unit-interval [0, 1], that models probability distributions on text continuations [BV20]. Our se- mantic category is then defined to be the enriched category of [0, 1]-valued co- presheaves on the syntax category. The Yoneda embedding, which maps the syntax category as a of the semantic category, assigns to a given text its repre- sentable copresheaf, which is regarded as the text’s meaning. Moreover, there are categorical operations in the semantic category that allow one to combine mean- ings that correspond to certain logical operations. In particular, there is a kind of implication that models whether a certain text is true, given that another text is true.

1.1. Compositionality. Language is a compositional structure: expressions can be combined to make longer expressions. This is not a new idea. Linguists have richly developed theories of formal grammars—think Chomsky’s hierarchy. Fo- cusing on the compositionality, one may consider language as some kind of alge- braic structure, perhaps simply as the free on some set of atomic symbols quotiented by the ideal of all expressions that are not grammatically correct; or, perhaps, as something more operadic, resembling the tree-like structures of parsed texts, labelled with classes that can be assembled by some set of tree-grafting rules. In either case, the algebraist might identify the meaning of a word, such as “red” with the principal ideal of all expressions that contain “red.” So, the concept of red, then, is the ideal containing red ruby, bright red ruby, red rover, blood red, red blooded, the worker’s and peasant’s red army, red meat, red firetruck, ... If one knows every expression that contains a word, then, as the thinking goes, one understands the meaning of that word. One is then led to a geometric picture by an algebro-geometric analogy: construct a space whose points are ideals and view AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 3 the algebraic structure as a sort of coordinate algebra on the space. From this per- spective, language is a coordinate algebra on a space of meanings. This geometric picture points in appealing directions. If one can say the same kinds of things in different languages, then those languages should be regarded as having comparable ideal structures (or possibly equivalent categories of modules), translating between different languages involves a change of coordinates on the space of meanings, and so on. Going further, one has a way to model semantics via sheaves on this space of meanings, as Lawvere in his classic 1969 paper [Law69] about syntax-semantic duality. We think this algebro-geometric picture is inspirational, but incomplete. Here, we furnish it with an important refinement. The compositional structure of lan- guage is only part of the structure that exists in language. What is missing is the distribution of meaningful texts. What has been, and what will be, said and written is crucially important. When one encounters the word “firetruck” it’s relevant that “red firetruck” has been observed more often than “green firetruck.” The fact that “red firetruck” is observed more often than “red idea” contributes to the meaning of red. Encountering an unlikely expression like “red idea” involves a shift in expec- tation that demands attention and contributes to the meaning of the broader context containing it. So, the ideals mentioned above are just a first approximation for the space of meanings. In this paper, we formalize a marriage of the compositional structure of language with the distributional structure as a category enriched over the unit interval, which we denote by L and call the language syntax category. The DisCoCat program of [CSC10] seeks to combine compositional and distri- butional structures in a single categorical framework, though a choice of grammar is needed as input. In the work below, syntactical structures are inherent within the enriched category theory and no grammatical input is required. This is moti- vated by the observation that LLMs can continue texts in a grammatically correct way without the additional input of a grammatical structure, suggesting that all im- portant grammatical information is already contained in the enriched category. In [AS14], Abramsky and Sadrzadeh consider a sheaf theoretic framework for study- ing language. That work does not consider the distributional structures that we consider here, but does emphasize the gluing properties of sheaves. Here, we work only with copresheaves.

1.2. Why category theory? The algebraic perspective of viewing ideals as a proxy for meaning is consistent with a category theoretical perspective, and the latter pro- vides a better setting in which to merge the compositional and distributional struc- tures of language. But even before adding distributional structures, moving from an algebraic to a categorical perspective provides certain conceptual advantages that we highlight first. Consider the category whose objects are elements of the free monoid on some finite set of atomic symbols, where there is a x → y whenever x is a substring of y, that is, when the expression y is a continuation of x. If the finite set is taken to be a set of English words, for example, then there are red → red firetruck and red → bright red ruby and so on. Each string is a substring of 4 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS

FIGURE 1. A category with at most one morphism between any two objects. Here the objects are expressions in a language and morphisms indicate when one expression is a continuation of an- other. itself, providing the identity morphisms, and composite morphisms are provided by transitivity: if x is a substring of y and y is a substring of z, then x is a substring of z. This category is simple to visualize: it is thin, which is to say there is at most one morphism between any two objects, and so one might have in mind a picture like that in Figure 1. By the Yoneda lemma, a fixed object in this category is determined up to unique isomorphism by the totality of its relationships to all other objects in the category. One thus thinks of identifying an expression x with the representable copresheaf hx := hom(x, −) whose value on an expression y is the one-point set ∗ if x is contained in y and is the empty set otherwise. In this way the preimage of ∗ is precisely the principal ideal generated by x. Representable copresheaves, therefore, are comparable to ideals and both can be thought of as a first approximation for the meaning of an expression. An immediate advantage to this shift in perspective is that the C of copresheaves C := Set on any small category C is better behaved than C itself. It has all limits and colimits and is Cartesian closed. In fact, the category of copresheaves is anb example of a , which is known to be an appropriate setting for internal intuitionistic logic. The full theory of is not incorporated in this work, but intuition abounds when one takes inspiration from the field. Think about colimits, for a moment.

1.2.1. . In the algebraic setting, the union of ideals is not an ideal, and so one is left to wonder what algebraic structure might represent the disjunc- tion of two concepts, say “red” or “blue.” In the categorical setting, the answer is straightforward. The of copresheaves is again a copreasheaf. Given any objects x and y in a small category C, the coproduct of representable copresheaves hx ⊔ hy : C → Set is computed “pointwise.” So suppose C is the category of ex- pressions in language defined above, and let x = red and y = blue. The coproduct hred ⊔ hblue therefore maps an expression c to the set hred(c) ⊔ hblue(c), which is isomorphic to ∗ if c contains either red or blue or both and is the empty set other- wise. The output of the functor hred ⊔ hblue is therefore nonempty on the union of all expressions that contain red or blue or both, and this matches well with the role of union as logical “or” among sets. AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 5

1.2.2. Products. Now think about limits and in particular products. Like the co- product, the product of repesentable copresheaves hx × hy : C → Set is again a copresheaf computed pointwise. So if C is the category of language and x = red and y = blue, then the value of the product on any expression c is given by hred(c) × hblue(c), which is isomorphic to ∗ if c contains both red and blue and is the empty set otherwise. So the output of the functor hred × hblue is nonempty on the intersection of expressions that contain red with those that contain blue, and this coincides with the role of intersection as logical “and” among sets. 1.2.3. Cartesian closure. Advantages of the categorical perspective can further be illustrated with a third example, which comes from the Cartesian closure of the category of copresheaves C on a small category C. To say that C is Cartesian closed means that it has finite products and moreover that the product of copresheaves has a right adjoint (See sectionb I.6. of [MM92]). That is, there is ab bifunctor [ , ]: C × C → C called the internal hom that fits into an isomorphism C(F × G, H) =∼ C(F, [G, H]) for all copresheaves F, G and H. Here, the internal hom [G, Hb] is theb copresheafb defined on objects c in C by c 7→ C(hc × G, H)bwhere as before hbc = hom(c, −) is the image of c under the Yoneda embedding Cop → C. The relevance of this new copresheaf can be seen in itsb kinship to implication in logic. Indeed, in intuitionistic propositional calculus one works within a certainb kind of thin Cartesian called a Heyting algebra. The usual notation in this setting is to write x ≤ y if there is a morphism from x → y. The categorical product is denoted by ∧ and the internal hom is denoted by ⇒. So, in this notation, one has: x ∧ y ≤ z if and only if x ≤ (y ⇒ z) for all elements x,y,z. Keep this analogy in mind as we now inspect the internal hom in C when C is the category of expressions in language. In this setting, the copresheaf [hred, hblue]: C → Set b captures something of implication red ⇒ blue. To see this, recall that the internal hom assigns to an expression c the set of natural transformations from hc × hred to hblue. Explicitly, this set consists of those functions (1) (hc × hred)(d) → hblue(d), as d ranges over all expressions in C, that fit into a particular commutative square. Let’s unwind this. The product is computed pointwise, and thus the domain (hc × hred)(d)= hc(d)×hred(d) is isomorphic to the one-point set ∗ whenever d contains both c and red and is the empty set otherwise. The codomain hblue(d) is likewise either ∗ if d contains blue and is the empty set otherwise. For example, if c is the expression French flag, then the function in Equation (1) is ∗ → ∗ when d is a text that contain both French flag and red as subtexts, and d also contains blue. In general, the set [hred, hblue](d) is comprised of functions of the form ∗ → ∗ or ∅ → ∗ or ∅ → ∅, and so the values of the copresheaf [hred, hblue] are built up from expressions that contain both red and blue (corresponding to functions ∗ → ∗), along with expressions containing blue but not red (corresponding to functions ∅ → ∗), along with expressions containing neither blue nor red (corresponding 6 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS

red blue red ⇒ blue hred hblue hred ⇒ hblue T T T ∗ ∗ ∗ F T T ∅ ∗ ∗ T F F ∗ ∅ ∅ F F T ∅ ∅ ∗

TABLE 1. The internal hom between copresheaves associated to red and blue is akin to logical implication.

to functions ∅ → ∅). This mirrors the familiar notion of logical implication as summarized in Table 1, where we have introduced the suggestive notation hred ⇒ hblue := [hred, hblue]. C In summary, the Yoneda embedding Cop → Set maps an expression x to the hx, which captures something of meaning. This is compa- rable to making a passage from syntax to semantics, after which concepts can be combined by computing limits, colimits, internal homs, and more. And yet this picture is still incomplete. The distributional structure of language has not yet been accounted for. To do so, we propose the use of enriched category theory.

2. INCORPORATING STATISTICS THROUGH ENRICHMENT Enriched category theory provides a ready-made way to simply decorate mor- phisms in the category C of language with conditional probabilities as in Figure 2. In general, enriched category theory begins with the observation that the set of morphisms between objects may have more structure than just a set. Examples are plentiful. The set of linear map between vector spaces is itself a ; the set of continuous functions between topological spaces is often given a topology; and so on. Enriched category theory is then the setting in which the morphisms between objects is itself an object in a category called the base category or the category over which the original category is enriched. For the desired axioms to make sense, the formal definitions require that the base category be a symmetric . We will review the basic definitions below, though the discus- sion will not require the full machinery of general enriched category theory. The base categories we consider are an especially simple kind of category—namely a preordered set, or simply preorder, which is a set equipped with a reflexive and transitive binary relation. As a category, the objects are the elements of the pre- order, and the relation defines the morphisms.

2.1. The enriched category of language. Recalling the picture in Figure 2, the base category we are primarily interested in is the unit interval, which becomes symmetric monoidal in a way that we now specify. A symmetric monoidal preorder (V, ≤, ⊗, 1) is a commutative monoid (V, ⊗, 1) equipped with a preorder ≤ that is compatible with the monoidal product in that x ⊗ y ≤ x′ ⊗ y′ whenever x ≤ x′ and y ≤ y′. The unit interval [0, 1] together with multiplication of real numbers is AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 7

.03 .05 .33 .10 .11 .97 .07 .65 .17 .02

FIGURE 2. The compositional and distributional structures of lan- guage are married by decorating arrows with the conditional prob- ability that one expression contains another. a commutative monoid with unit 1. Together with the usual ordering ≤, the unit interval is thus a symmetric monoidal preorder. Definition (Enriched Category). Let (V, ≤, ⊗, 1) be a symmetric monoidal pre- order. A V-enriched category, or simply a V-category, C consists of a set ob(C) of objects and an element C(x,y) ∈ V, called a V-hom object, for every pair of objects x and y, such that 1 ≤C(x,x) for each object x, and (2) C(x,y) ⊗C(y, z) ≤C(x, z) for all objects x,y,z. Following [BV20], we construct a category L enriched over [0, 1]. The objects of L are expressions, that is, texts, in the language. Morphisms are defined by L(x,y) := π(y|x), where π(y|x) denotes the probability that expression y extends expression x. If x is not a subtext of y then necessarily π(y|x) = 0. So, for example, one might have L (red,red firetruck) = 0.02 L (red,red idea) = 10−5 L (red,blue sky) = 0. That L is indeed a category enriched over [0, 1] follows from the fact that π(y|x)π(z|y)= π(z|x) for all texts x,y,z and so satisfies Equation (2). The reader might think of these probabilities π(y|x) as being most well defined when y is a short extension of x. While one may be skeptical about assigning a probability distribution on the set of all possible texts, we find it perfectly reasonable to say there is a nonzero probabil- ity that cat food will follow I am going to the store to buy a can of, and practically speaking, that probability can be estimated. Indeed, existing LLMs [RNSS18, RWC+18, B+20] successfully learn these conditional probabilities π(y|x) using standard machine learning tools trained on large corpora of existing texts, which 8 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS one may view as providing a wealth of samples drawn from these conditional prob- ability distributions. Figure 2 gives the right toy picture: the objects are expressions in language and the labels on the arrows describe the probability of extension. And yet, as in the unenriched setting of Section 1.2, the category L is inherently syntactical—it encodes “what goes with what” together with the statistics of those expressions. But what about semantics? Following the same line of reasoning that led to the consideration of copresheaves in the unenriched case, we now wish to pass from L to the enriched version of copresheaves on L where we propose that meaning lies, and where concepts can again be combined through the enriched version of limits, colimits, and internal homs. This discussion first requires a few mathematical preliminaries.

2.2. Basics of enriched category theory. Recall that a copresheaf on a category C is a functor C → Set. To understand the enriched version, then, one must first have a notion of between enriched categories. So suppose C and D are categories enriched over a symmetric monoidal preorder (V, ⊗, ≤, 1).A V-functor C → D is a function f : ob(C) → ob(D) satisfying

(3) C(x,y) ≤ D(fx,fy) for all objects x and y in C. There is also a category DC enriched over V whose objects are V-functors from C to D. The hom object between any two such functors F, G: C → D is defined by an , which is a particular kind of in V:

(4) DC(F, G) := D(F c, Gc). Zc∈ob(C) The precise definition of an end is not needed now, as the above expression will re- duce to something familiar in Equation (6). The takeaway is that just as mappings between (ordinary) categories define a functor category, so mappings between en- riched categories define an enriched functor category. So to understand the en- riched version of copresheaves, one starts with an enriched functor valued in the base category V. For this notion to make sense, however, Equation (3) implies that one must first understand what it means for V itself to be a V-category. This leads to another important definition: closure. A symmetric monoidal preorder V is said to be monoidal closed, or simply closed, if it is closed under forming homs; that is, if for every pair of elements v and w in V there is an element [v, w] ∈ V , called the internal hom, such that u ⊗ v ≤ w if and only if u ≤ [v, w] for all u, v, w in V. So if a symmetric monoidal closed preorder V is taken to be the base category of an enrichment, then one can define the hom object V(u, v) := [u, v] to be the internal hom of V. In this setting, the notion of a V-copresheaf on a V-category C makes sense: it is a function f : ob(C) → ob(V) satisfying C(x,y) ≤ V(fx,fy) for all objects x and y in C. In addition to closure, the notions of completeness and cocompleteness will also be required in the work to come. A category is said to be complete if it has all small AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 9 limits and is said to be cocomplete if it has all small colimits.1 If the category is simply a preorder, as in the present discussion, then it is complete if it has all finite meets and is cocomplete if it has all finite joins.

Example 1. The unit interval is closed when the monoidal product is taken to be usual multiplication and the internal hom is truncated division: for all a, b ∈ [0, 1] define

b/a if b ≤ a (5) [a, b]= (1 otherwise. The unit interval is also both complete and cocomplete as a preorder. The limit of a finite set of elements is the minimum of the elements, while the colimit is the maximum. In particular, the categorical product and coproduct are given by a×b = min{a, b} and a⊔b = max{a, b} for all a and b in [0, 1]. As a consequence, the end in Equation (4) is straightforward to understand in the case when the target category is the unit interval. Explicitly, if C is any category enriched over [0, 1], then the functor category [0, 1]C is also enriched over the same. The [0, 1]-object between any pair of [0, 1]-functors F, G: C → [0, 1] is given by the following minimum over all objects in C,

(6) [0, 1]C (F, G)= min {[F c, Gc]} = min {1, Gc/F c} c∈ob(C) c∈ob(C) where the second equality follows from the closure of the unit interval as in Equa- tion (5). So the unit interval has rich structure from the perspective of category theory. It is symmetric monoidal closed and complete and cocomplete, and as such, the enriched category [0, 1]L of copresheaves on the [0, 1]-category L of language in- herits rich structure, as well. In particular, it has both products and coproducts as well as an internal hom, thus allowing for the combination of concepts in language in a way that’s parallel to the ideas explored in Section 1.2. We elaborate below.

2.3. Semantics in the enriched setting. Section 1.2 described a passage from syntax to semantics involving ideas from ordinary category theory—namely the passage from a category of expressions in language (thought of as syntax) to the category of copresheaves on it (thought of as semantics) through the Yoneda em- bedding. Now having enriched both categories over the unit interval, we revisit the analogous passage, namely the Yoneda embedding Lop → [0, 1]L between en- riched categories,2 which assigns to each expression in language a representable [0, 1]-copresheaf. Explicitly, for each object x in L let hx := L(x, −) be defined

1The (co)limit of a diagram of categories I → C is said to be small if the indexing category I is a small category. 2Here Lop denotes the opposite [0, 1]-category of L, which has the same objects x,y,... as L and where the hom objects are defined by Lop(x,y) := L(y,x). 10 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS by conditional probability of containment,

π(c|x) if x ≤ c (7) c 7→ hx(c) := (0 otherwise. where we use “x ≤ c” as shorthand for “x is contained as a subtext of c.” So hx captures something of the meaning of the expression x as desired, plus more. Notably, its support is comprised of all expressions containing x (and this again coincides with the principal ideal associated to x), and it further accounts for the statistics associated with those expressions, which is precisely the distributional structure missing from Section 1.2. That both hx and the assignment x 7→ hx are indeed [0, 1]-functors remains to be checked, and the details can be found in Appendices A and B. With this definition in mind, we now turn to the combinations of concepts in this new context. Think back to Section 1.2.1, for instance, where we described the coproduct of ordinary copresheafs associated to the words red and blue, which represented their disjunction red or blue. Section 1.2.2 likewise computed the prod- uct of copresheaves representing the conjunction red and blue, and Section 1.2.3 computed the copresheaf representing the implication red ⇒ blue. What are the analogous constructions for enriched copresheaves? By way of motivation, recall that given a small ordinary category C the functor category SetC has all limits and colimits because the base category Set has all limits and colimits. The idea is that the (co)limit of a diagram of copresheaves is evaluated “pointwise” and is therefore built up from (co)limits of sets, which are guaranteed to exist. Now suppose the category Set is replaced with a sufficiently nice base category such as [0, 1]. To make sense of (co)limits in this new enriched setting, the familiar definition must be slightly modified. This leads to the notion of weighted limits and colimits, which are the appropriate notions of limits and colimits in enriched category theory. As it turns out, however, the weighted (co)limits we are interested in—namely, the triv- ially weighted (co)product of [0, 1]-copresheaves—are also evaluated pointwise in the way one would hope and in direct analogue with Sections 1.2.1 and 1.2.2. For this reason, we state the results below without reference to the full theory of weighted (co)limits, though background material and a formal definition are in- cluded in Appendix C for the interested reader.

2.3.1. Enriched Products and Coproducts. Let’s begin with a look at products and coproducts of copresheaves in the semantic category L := [0, 1]L. Fix a pair of expressions x and y in L. The product hx × hy : L → [0, 1] assigns to an expression c the pointwise product in [0, 1], which by Exampleb 1 is a minimum: hx(c) × hy(c) = min{hx(c), hy(c)}. By Equation (7) the value of this minimum depends on whether the expressions x and y are jointly contained within c, and so one has min{π(c|x),π(c|y)} if x ≤ c and y ≤ c (8) hx(c) × hy(c)= (0 otherwise. AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 11

The support of this copresheaf thus coincides with the support of logical “and” in Boolean logic, and this pairs well with the intuition that categorical limits capture something of conjunction. In a similar way, the coproduct hx ⊔ hy : L → [0, 1] is also simple to understand. It assigns to an expression c the pointwise maxi- mum hx(c) ⊔ hy(c) = max{hx(c), hy(c)} in [0, 1], the value of which depends on whether x or y or both are contained within c. Explicitly,

max{π(c|x),π(c|y)} if x ≤ c and y ≤ c π(c|x) if x ≤ c and y 6≤ c  (9) hx(c) ⊔ hy(c)=  π(c|y) if x 6≤ c and y ≤ c  if and 0 x 6≤ c y 6≤ c.   The support of this copresheaf coincides with the support of logical “or,” and this matches the intuition that colimits capture something of disjunction. A detailed proof of the claim in Equation (8) is given in Appendix D, and the proof for the co- product is analogous. Let’s now turn to a discussion of implication in the enriched setting.

2.3.2. Enriched Implication. Recall from Section 1.2.3 that the internal hom be- tween a pair of ordinary copresheaves G, H : C → Set is the copresheaf [G, H] de- fined on objects by c 7→ C(hc × G, H), which is the set of natural transformations from hc × G to H. Analogously, when we replace the base with the unit interval, we defineb for fixed expressions x and y in L the [0, 1]-copresheaf [hx, hy]: L→ [0, 1] whose value on an expression c is given by hy(d) (10) [hx, hy](c) := L(hc × hx, hy)= min 1, , d∈ob(L) hc(d) × hx(d)   where the second equalityb follows from Equation (6). A proof that this indeed defines a [0, 1]-functor is given in Appendix E. That this assignment resembles implication becomes intuitive when one takes a closer look at the expression on the right-hand side. To start, observe that the minimum in Equation (10) is zero if there is an expression d in the language that contains both c and x but does not contain y. For in that case, the denominator hc(d) × hx(d) is nonzero by Equation (8) whereas hy(d) is zero. So to understand the support, let’s consider the argument of the minimum for a fixed text d that contains y. There are two cases to consider. First, the minimum is 1 if d contains c but not x, or if it contains x but not c, or if it contains neither x nor c. (In fact, the minimum is 1 even if we drop the assumption that d contains y.) In the second case, the minimum lies strictly between 0 and 1 if the text d contains both x and c. The output of the copresheaf [hx, hy] is therefore nonzero on the union of the set of texts that contain both x and y and the set of texts that do not contain y. This coincides with the support of the logical implication operator in Boolean logic, and shows that Table 1 can be promoted to Table 2, where we have again introduced the suggestive notation hx ⇒ hy := [hx, hy]. 12 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS

x y x ⇒ y hx hy hx ⇒ hy 0 1 1 ∗ ∗ ∗ 1 1 1 0 ∗ ∗ 0 0 0 0 0 0 1 0 1 ∗ 0 ∗

TABLE 2. The internal hom between [0, 1]-copresheaves associ- ated to texts x and y has the same support as the implication op- erator in Boolean logic. Here the asterisk ∗ represents a nonzero number.

3. A INTERPRETATION It is sometimes desirable to work in a category M that is a slight variant of the syntax category L. First notice that like the unit interval, the set of nonnegative ex- tended reals [0, ∞] together with addition of real numbers is a commutative monoid with unit 0, where a + ∞ := ∞ and ∞ + a := ∞ for all a. If one further specifies a morphism from a to b whenever b ≤ a, then [0, ∞] is a symmetric monoidal preorder. In fact, the function − ln: [0, 1] → [0, ∞] defines an isomorphism of preorders and so [0, ∞] is also monoidal closed as well as complete and cocom- plete. The internal hom is given by truncated subtraction, [a, b] := max{b − a, 0}, and the limit of a finite set of elements is the maximum of the elements, while the colimit is the minimum. In particular, the categorical product and coproduct are respectively given by a × b = max{a, b} and a ⊔ b = min{a, b} for all a and b in [0, ∞]. By applying − ln to morphisms of L we thus obtain a new category M having the same objects as L and where the hom object between any pair of expressions x and y is given by M(x,y) := − ln L(x,y). It is then straightforward to check that 0 ≥ M(x,x) for all expressions x and moreover that M(x,y)+ M(y, z) ≥ M(x, z) for all expressions x,y,z (in fact, both inequalities become equalities in this case), thus showing that M is indeed a [0, ∞]-category. Categories en- riched over [0, ∞] are typically called generalized metric spaces [Law73, Law86] since composition of morphisms is precisely the triangle inequality, though no- tice that symmetry is not required as with usual metrics. Even so, we embrace the generalized metric space point of view and will denote hom objects in M by dM(x,y) := M(x,y), thinking of dM as defining a kind of distance between texts. Texts that are likely extensions have distances that are near the texts they extend, and texts that are not extensions of one another are infinitely far apart. Figure 3 illustrates a picture one might have in mind. Paths in this generalized metric space go in only one direction and represent stories: you begin somewhere, there are ex- pectations of where the story is going, the story continues, expectations of where you are going are revised, the story continues, and so on. So the two categories L and M have the same information, but the first emphasizes the probabilistic point of view while the second gives a geometric picture. AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 13

red idea

red red ruby

FIGURE 3. When viewing language as a generalized metric space, a text such as red is identified with its corresponding [0, ∞]- copresheaf dM(red, −). Expressions that are unlikely continua- tions of red are therefore far away, whereas expressions that are more likely to be continuations of red are closer to it.

As the reader might expect, the idea that semantic information lies in a co- presheaf category resurfaces here as well. There is a generalized metric space M := [0, ∞]M whose objects are [0, ∞]-copresheaves on M, which are functions f : M → [0, ∞] satisfying dM (x,y) ≥ max{fy − fx, 0} for all expressions x andc y in the language. The hom object between any pair of [0, ∞]-copresheaves f and g is given by M(f, g)= sup (max{gx − fx, 0}) x∈ob(M) in analogy with Equationc (6). This generalized metric space is also both complete and cocomplete with respect to weighted (co)limits, and moreover the Yoneda em- bedding M → M that maps x 7→ dM(x, −) defines an isometric embedding of M as the representable copresheaves within M [Wil13]. So although a pair of ex- pressions x and ycmay be unrelated in M, one can make sense of combining them using the rich structure present in M. For instance,c ideas analogous to those in Section 2.3.1 show that the (trivially weighted) product of a pair of representable [0, ∞]-copresheaves is the [0, ∞]-copresheafc dM(x, −)×dM(y, −): M→ [0, ∞] whose value on an expression c is computed pointwise,

(11) dM(x, c) × dM(y, c) = max{dM(x, c), dM(y, c)}. This is finite and nonzero on the set of all expressions containing both x and y, consistent with one’s intuition about conjunction. Geometrically, one may think of the product as a new “point” added in M, which is specified precisely by giv- ing its distance to all other points in the category. Moreover, the exponential map a 7→ e−a defines an isomorphism of categoriesc [0, ∞] → [0, 1], and under this map Equation (11) recovers the pointwise product of [0, 1]-copresheaves in Equa- tion (8) using the identity min{a, b} = e− max{− ln(a),− ln(b)}. Dually, we may also consider the (trivially weighted) coproduct dM(x, −) ⊔ dM(y, −). On each expression c it is given by

(12) dM(x, c) ⊔ dM(y, c) = min{dM(x, c), dM(y, c)}, 14 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS which is finite and nonzero whenever c contains either x or y or both, realiz- ing a disjunction operation. Geometrically, one also views the coproduct as a point added in M specified by indicating its distance to every other point. And again the exponential map applied to Equation (12) recovers the pointwise coprod- uct of [0, 1]-copresheavesc in Equation (9) using the dual identity max{a, b} = e− min{− ln(a),− ln(b)}. The general idea that copresheaves f in M are specified precisely by indicat- ing the distance between f and the representable copresheaves dM(x, −) suggests that the categorical completion M resemblesc a kind of metric completion. The so called distributional hypothesis [Har54, TP10] in linguistics, which states that linguistic items with similar distributionsc have similar meanings, therefore has a precise mathematical formulation in the following statement: Two texts x and y have similar meanings if and only if the co- presheaves representing them are near in the generalized metric space M.

3.1. Tropical modulec structure. We note now that the completion M has a semi- tropical module structure, that is, the structure of a module over the semi-tropical semi-ring that we define in the next paragraph. This is entirely analogousc to the fact that the set of scalar valued functions on a given set has the structure of a vector space, which it inherits from that on the target of the functions, namely the field of scalars. Here elements of M are functions valued in [0, ∞], which is a semi-tropical semi-ring and therefore they inherit the structure of a semi-tropical module. c The set ((−∞, ∞], ⊕, ⊙) with operations ⊕ and ⊙ defined by

(13) s1 ⊕ s2 = min{s1,s2} and s1 ⊙ s2 = s1 + s2 defines a semi-ring called the tropical semi-ring, or the (min, +) algebra. We adopt the terminology in [Wil13] and call the sub semi-ring ([0, ∞], ⊕, ⊙) the semi-tropical semi-ring. A module over a semi-ring is a commutative monoid with an action of the semi-ring. The commutative monoid here is defined by the colimit (12) (f ⊔ g)(x) = min{f(x), g(x)}, where f, g : M → [0, ∞] are copresheaves in M and the action is defined by (s ⊙ f)(x) := f(x)+ s. Semi-tropical module structures on presheaves on generalized metric spaces, along with a variation where mincis replaced with max, were first explored in [Wil13]. Tropical geometry contains rich structures actively researched over the last twenty years, and provides a way to transform non-linear algebraic geometry to piecewise linear geometry [Mac12, BIMS15, DS04a, DS04b, Yos11, SS04, SS03, Vir00]. The passage is achieved by considering usual polynomials in several variables and replacing the additions and products by their tropical counterparts. The zero set of the original polynomial is mapped to the singular set of the tropical one which is a piecewise linear space. A relevant example is a tropical module structure [DS04a, DS04b, SS04] gener- ated by distance functions, like the ones we have here but coming from a symmetric AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 15 metric and satisfying a tropical Pl¨ucker relation. A one-dimensional tropical poly- tope arises, which realizes the metric defined by the distance functions as a tree metric. The achievement here is that extra limit points are added in precise loca- tions, indicating branching points, so that the distances between leaves are realized as distances along tree paths connecting them. This result is used in phylogenetics for the production of trees encoding common ancestors of species using distances defined by their DNA. More recently, it has been discovered [ZNL18, MCT21, CM19, SM19] that feedforward neural networks with ReLU activation functions, compute tropical ra- tional maps, opening a new way to study the mathematical structure underlying them. Such neural networks are used in learning the next-words statistics encoded by the enriched category theory language model described here.

4. CONCLUSION Knowing how to continue texts implies a good deal of semantic information. So one might think that it would be very difficult to learn the syntactic language category L which encodes probability distributions on text continuations. It would be understandable to believe that a computer model that coherently continues texts would require, as input, some semantic knowledge. However, existing LLMs prove that is not the case. Standard machine learning techniques prove that one can learn L in an unsupervised way directly from samples of existing texts. These models have then necessarily learned some semantic information. In this paper, we provide a mathematical framework for where, mathematically speaking, that semantic information lives. Namely, it lives in the category of [0, 1]-copresheaves on L. One future direction for this work is to study how these presheaves can be found within the huge and seemingly disordered collection of parameters of trained LLMs, and once found, accessed, manipulated, combined and generally, put to use.

ACKNOWLEDGEMENTS The authors thank Olivia Caramello, Shawn Henry, Maxim Kontsevich, Laurent Lafforgue, Jacob Miller, David Jaz Myers, David Spivak, and Simon Willerton for helpful discussions.

APPENDIX A. Conditional probability as an enriched functor. For each expression x in L, we show the representable functor hx defined in Equation (7) is indeed a [0, 1]- functor as in Equation (3); that is, that the inequality L(c, d) ≤ [hx(c), hx(d)] holds for all expressions c and d in L. Recall that the right-hand side is given by 16 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS truncated division as in Equation (5),

hx(d)/hx(c) if hx(d) ≤ hx(c) [hx(c), hx(d)] = (1 otherwise

π(d)/π(c) if x ≤ d and x ≤ c = 0 if x 6≤ d and x ≤ c 1 otherwise

 Now, if c is not contained within d, then L(c, d) = 0 ≤ [hx(c), hx(d)] is always true. So suppose c is contained in d so that L(c, d) is equal to π(d)/π(c). The internal hom [hx(c), hx(d)] is also equal to π(d)/π(c) if x is contained in both c and d, and it is equal to 1 if x is contained in neither c nor d. (The possibility that it is equal to 0 is excluded since the assumption c ≤ d precludes case when x is contained in c but not d.) Thus in either case L(c, d) ≤ [hx(c), hx(d)] as desired.

B. The Yoneda embedding as an enriched functor. The copresheaf hx discussed in Appendix A is the image of an expression x under the Yoneda embedding Lop → [0, 1]L. To verify that the assignment x 7→ hx does indeed define a [0, 1]- L functor, it remains to check that L(y,x) ≤ [0, 1] [hx, hy], where by Equation y x (6) the right-hand side is equal to minc∈L{1, h (c)/h (c)}. So suppose y is a substring of x so that L(y,x) = π(x)/π(y). (Otherwise, L(y,x) = 0 and the desired inequality always holds.) Now, for any expression c that contains x, the quotient hy(c)/hx(c) is equal to π(c|x)/π(c|y) = π(x)/π(y). Importantly, the quotient cannot equal 0, which would happen in the case that c contains x but not y which is not possible under the assumption that y ≤ x. So the minimum y x minc∈L{1, h (c)/h (c)} is either 1 or π(x)/π(y) which is an upper bound for L(y,x) in either case.

C. Weighted limits and colimits. The appropriate notion of limits and colimits in enriched category theory are called weighted limits and colimits. We focus on building intuition around limits as the discussion for colimits is analogous. Now, to understand weighted limits, it will first help to recall the defining isomorphism for ordinary limits, also called “conical limits.” To that end, let J be an indexing category and let C be any category. Recall that the limit of a diagram F : J → C, if it exists, is an object lim F in C together with the following isomorphism of functors C → Set,

J (14) C(−, lim F ) =∼ Set (∗, C(−, F )) AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 17 which is natural in the first variable. Here, ∗ is used to denote the constant functor at the one-point set: J∗ Set i ∗ (15) id∗ j ∗

J So for each object Z in C there is a bijection C(Z, lim F ) =∼ Set (∗, C(Z, F −)) and so “morphisms into the limit are the same as a cone over the diagram, whose legs commute with morphisms in the diagram,” and one has in mind the following picture. Z

lim F

Fi Fj Taking a closer look at the right-hand side of Equation (14), notice that for each object Z in C, a ∗ ⇒ C(Z, F ) consists of a function ∗ = ∗(i) → C(Z,Fi) for each object i in J. This simply picks out a morphism Z → F i in C, and these morphisms comprise the legs of the limit cone over lim F . Natu- rality ensures that these legs are compatible with the morphisms in the diagram. With this background in mind, we now introduce weighted limits. The essential difference between the two constructions is that the constant functor in (15) is re- placed with a so-called V-functor of weights W : J →V, where V is the base cat- egory over which enrichment takes place and where J is an indexing V-category. Here is the formal definition. Definition (Weighted Limit). Let V be a closed symmetric monoidal category and let J and E be V-categories. Given a V-functor F : J →E and a V-functor W : J →V, the weighted limit of F , if it exists, is an object limW F of E together with the following isomorphism of V-functors E→V, J (16) E(−, limW F ) =∼ V (W, E(−, F )) that is natural in the first variable.

So in other words, for each object Z in E there is an isomorphism E(Z, limW F ) =∼ VJ (W, E(Z, F )) of objects in V. Comparing this with Equation (14), notice that the legs of a conical limit cone ∗ → C(Z,Fi) have now been replaced with a V- morphism W i →E(Z,Fi) indexed by objects i in J . Informally then, the legs of a weighted limit cone are now “W i-shaped objects in E(Z,Fi) ∈ V” rather than “point-shaped objects in C(Z,Fi) ∈ Set.” The idea behind a weighted colimit is analogous, and details may be found in [Rie14, Chapter 7]. 18 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS

D. Computing the product of [0, 1]-copresheaves. In this section, we prove the claim made in Equation (8) of Section 2.3.1 and do so by unwinding the isomor- phism in (16) of Appendix C in the simple case when V = [0, 1], when E = [0, 1]L, and when the indexing category J is a discrete category with two objects, call them 1 and 2. This is a discrete category enriched over [0, 1] by setting J (i, j)= δij for i, j ∈ {1, 2}. To begin, fix a functor of weights W : J → [0, 1]. This is nothing more than a choice of two numbers w1 := W (1) and w2 := W (2). For simplicity choose the trivial weighting w1 = w2 = 1. Further, for a fixed pair of expressions x,y in L, define F : J → [0, 1]L to be the [0, 1]-functor with F (1) := hx and F (2) := hy. We now wish to compute the product of hx and hy, which is the weighted limit of F with respect to the weight W. To start, the of limW F in Equation (16) implies that for any copresheaf Z in [0, 1]L we must have the equality3 J (17) E(Z, limW F )= V (W, E(Z, F )) with V = [0, 1]. Taking a closer look at the right-hand side, a V-natural transfor- 4 x mation W ⇒E(Z, F ) consists of a pair of inequalities 1 ≤V(w1, E(Z, h )) and y 1 ≤V(w2, E(Z, h )). To understand the right-hand side of each expression, recall that the unit interval is monoidal closed and so the hom objects are given by trun- cated division as in Equation (5). Therefore since w1 = w2 = 1, these inequalities imply (18) E(Z, hx) ≥ 1 and E(Z, hy) ≥ 1, and by Equation (6) this means min {1, hx(c)/Z(c)}≥ 1 and min {1, hy(c)/Z(c)}≥ 1 c∈ob(L) c∈ob(L) which is true if and only if Z(c) ≤ hx(c) and Z(c) ≤ hy(c). One recognizes this latter pair of inequalities as the legs of the familiar cone over the discrete diagram, when unwinding the universal property from ordinary category theory: “For all objects Z(c) ∈ [0, 1] with morphisms to hx(c) and hy(c), there exists....” Now, by Equation (17) the information in this cone is the same as E(Z, limW F ) which is again given by a minimum, and it’s enough to show this is equal to 1:

limW F c claim (19) E(Z, limW F )= min 1, = 1. c∈ob(L) Z(c)   For if the equality on the right holds, then limW F (c) ≥ Z(c), and we have finally recovered the usual limit diagram as in the diagrams below. This shows that the

3This equality follows from unwinding the definition of a V-natural isomorphism. In this case, J it consists of a morphism 1 ≤ V(E(Zc, limW F ), V (W, E(Zc, F )) for each c (together with an- other inequality involving the same expression but with the arguments swapped) which implies that J V (W, E(Zc, F )) ≥E(Zc, limW F ). But we show below that the right-hand side of this inequality is equal to 1, and so the left-hand side is, too. 4A V-natural transformation α: F ⇒ G between a pair of V-functors F, G: C → D consists of a morphism ηc : 1 → D(F c, Gc) in V for each object c in C such that a certain diagram commutes. Here, 1 is the monoidal unit in V. AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 19 product limW F (c) is the usual product in [0, 1], which the minimum and is what is shown in Equation (8) in Section 2.3.1.

Z(c) Z(c)

≤ ≤ ≤ x y limW F (c) = min{h (c), h (c)} ≤ ≤

y x y hx(c) h (c) h (c) h (c)

As an aside, notice the legs going out of limW F come from Equation (17) in the special case when Z = limW F , for then we have 1= E(lim F, lim F ) = min{1, E(lim F, hx), E(lim F, hy)}, W W W W x x which implies E(limW F, h ) = 1, which is true if and only if h (c) ≥ limW F (c) for all c and similarly for hy. In any case, it now remains to prove the claim on the right-hand side of Equation (19). But that follows since the right-hand side of Equation (17) is evaluated by using the expression for hom objects between [0, 1]-functors given in Equation (6): J V (W, E(Z, F )) = min {[wi, E(Z,Fi)]} i∈ob(J )

= min {1, E(Z,Fi)/wi} i∈ob(J ) = min {1, E(Z,Fi)} i∈ob(J ) = min{1, E(Z, hx), E(Z, hy )} = 1 where the last equality follows from (18).

E. Implication is a [0, 1]-functor. To verify the claim that [hx, hy]: L→ [0, 1] is indeed a [0, 1]-functor, we show that [hx, hy](d) if [hx, hy](d) ≤ [hx, hy](c) [hx, hy](c) L(c, d) ≤  1 otherwise for all expressions c and d in L. If c is not contained within d then L(c, d) = 0 and the desired inequality holds. So suppose c is contained in d and further suppose [hx, hy](d) ≤ [hx, hy](c), since otherwise we’re done. We then wish to show that hy(a) x y min 1, [h , h ](d) a∈ob(L) hd(a) × hx(a) (20) π(d|c) ≤ =   [hx, hy](c) hy(a) min 1, a∈ob(L) hc(a) × hx(a)   20 AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS where the equality is given by Equation (10). This inequality holds for a straight- forward reason: the right-hand side is always equal to 1. To see this, first note that the denominator of the above is nonzero by assumption, and so we must have hy(a) 6= 0 for all a in L, which is to say that y must be a subtext of every expression in L. Now consider the following two cases. If the numerator is strictly less than 1, then it must be that hd(a) × hx(a) 6= 0 for all a, which means both x and d must be contained in every expression. In particular, d must be contained in c, and so we have both d ≤ c and c ≤ d which is true if and only if c = d. The left-hand side of (20) is therefore 1= π(d|d), which cannot be bounded above by a number strictly less than 1. On the other hand, suppose the denominator is strictly less than 1. Then we must have hc(a) × hx(a) 6= 0 for all a, which means both x and c must be contained in every expression in the language. In particular, x ≤ y and (by the opening observation) y ≤ x, which is true if and only if x = y. And further it must be that x ≤ c and c ≤ x and so x = y = c. Therefore the right-hand side of (20) is equal to hc(a′) × hc(a′) = 1 hc(a′) for some fixed expression a′, a contradiction. So numerator and denominator in Equation (20) must both be 1, and indeed π(d|c) ≤ 1 for all c and d.

REFERENCES [AS14] Samson Abramsky and Mehrnoosh Sadrzadeh. Semantic Unification, pages 1–13. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. [B+20] Tom Brown et al. Language models are few-shot learners. In H. Larochelle, M. Ran- zato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. [BIMS15] Erwan Brugall´e, Ilia Itenberg, Grigory Mikhalkin, and Kristin Shaw. Brief introduction to tropical geometry, 2015. [BV20] Tai-Danae Bradley and Yiannis Vlassopoulos. Language modeling with reduced densi- ties. arXiv 2007.03834, 2020. [CM19] Vasileios Charisopoulos and Petros Maragos. A tropical approach to neural networks with piecewise linear activations, 2019. [CSC10] Bob Coecke, Mehrnoosh Sadrzadeh, and Stephen Clark. Mathematical foundations for a compositional distributional model of meaning. arXiv:1003.4394, 2010. [DS04a] M. Develin and B. Sturmfels. Tropical convexity. Documenta Mathematica, pages 205– 206, 2004. [DS04b] M. Develin and B. Sturmfels. Tropical covexity erratum. Documenta Mathematica, 9 1-27, 2004. [Har54] Zellig S. Harris. Distributional structure. WORD, 10(2-3):146–162, 1954. [Law69] F. William Lawvere. Adjointness in foundations. Dialectica, 23(3/4):281–296, 1969. [Law73] F. William Lawvere. Metric spaces, generalized logic and closed categories. Rendiconti del seminario matematico e fisico di MIlano, 1973. [Law86] F. William Lawvere. Taking categories seriously. Revista Colombiana de Matematicas, XX 147-178, 1986. [Mac12] Diane Maclagan. Introduction to tropical algebraic geometry, 2012. AN ENRICHED CATEGORY THEORY OF LANGUAGE: FROM SYNTAX TO SEMANTICS 21

[MCH+20] Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy. Emergent linguistic structure in artificial neural networks trained by self- supervision. Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020. [MCT21] Petros Maragos, Vasileios Charisopoulos, and Emmanouil Theodosis. Tropical geome- try and machine learning. Proceedings of the IEEE, 109(5):728–755, 2021. [MM92] Saunders MacLane and Ieke Moerdijk. Sheaves in geometry and logic: a first introduc- tion to topos theory. Universitext. Springer, Berlin, 1992. [Rie14] Emily Riehl. Categorical Homotopy Theory. New Mathematical Monographs. Cam- bridge University Press, 2014. [RNSS18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Language mod- els are unsupervised multitask learners. 2018. [RWC+18] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. [SM19] Georgios Smyrnis and Petros Maragos. Tropical polynomial division and neural net- works, 2019. [SS03] David Speyer and Bernd Sturmfels. The tropical grassmannian, 2003. [SS04] David Speyer and Bernd Sturmfels. Tropical mathematics, 2004. [TP10] P. D. Turney and P. Pantel. From frequency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research, arXiv:1003.1141, 37:141–188, Feb 2010. [Vir00] Oleg Viro. Dequantization of real algebraic geometry on logarithmic paper, 2000. [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Pro- ceedings of the 31st International Conference on Neural Information Processing Sys- tems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. [Wil13] Simon Willerton. Tight spans, Isbell completions and semi-tropical modules. arXiv 1302.4370, 2013. [Yos11] Shuhei Yoshitomi. Generators of modules in tropical geometry, 2011. [ZNL18] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim. Tropical geometry of deep neural networks, 2018.