<<

Tropical Geometry of Deep Neural Networks

Liwen Zhang 1 Gregory Naitzat 2 Lek-Heng Lim 2 3

Abstract 2011; Montufar et al., 2014; Eldan & Shamir, 2016; Poole et al., 2016; Telgarsky, 2016; Arora et al., 2018). Recent We establish, for the first time, explicit connec- work (Zhang et al., 2016) showed that several successful tions between feedforward neural networks with neural networks possess a high representation power and ReLU activation and tropical geometry — we can easily shatter random data. However, they also general- show that the family of such neural networks is ize well to data unseen during training stage, suggesting that equivalent to the family of tropical rational maps. such networks may have some implicit regularization. Tra- Among other things, we deduce that feedforward ditional measures of complexity such as VC-dimension and ReLU neural networks with one hidden layer can Rademacher complexity fail to explain this phenomenon. be characterized by zonotopes, which serve as Understanding this implicit regularization that begets the building blocks for deeper networks; we relate generalization power of deep neural networks remains a decision boundaries of such neural networks to challenge. tropical hypersurfaces, a major object of study in tropical geometry; and we prove that linear The goal of our work is to establish connections between regions of such neural networks correspond to neural network and tropical geometry in the hope that they vertices of polytopes associated with tropical ra- will shed light on the workings of deep neural networks. tional functions. An insight from our tropical for- Tropical geometry is a new area in that mulation is that a deeper network is exponentially has seen an explosive growth in the recent decade but re- more expressive than a shallow network. mains relatively obscure outside pure mathematics. We will focus on feedforward neural networks with rectified linear units (ReLU) and show that they are analogues of rational 1. Introduction functions, i.e., ratios of two multivariate polynomials f, g in variables x1, . . . , xd, Deep neural networks have recently received much limelight f(x , . . . , x ) for their enormous success in a variety of applications across 1 d , many different areas of artificial intelligence, computer vi- g(x1, . . . , xd) sion, speech recognition, and natural language processing in tropical algebra. For standard and trigonometric poly- (LeCun et al., 2015; Hinton et al., 2012; Krizhevsky et al., nomials, it is known that rational approximation — ap- 2012; Bahdanau et al., 2014; Kalchbrenner & Blunsom, proximating a target function by a ratio of two polynomials 2013). Nevertheless, it is also well-known that our theoreti- instead of a single polynomial — vastly improves the quality cal understanding of their efficacy remains incomplete. of approximation without increasing the degree. This gives There have been several attempts to analyze deep neural net- our analogue: An ReLU neural network is the tropical ratio works from different perspectives. Notably, earlier studies of two tropical polynomials, i.e., a tropical rational function. have suggested that a deep architecture could use parameters More precisely, if we view a neural network as a function d p more efficiently and requires exponentially fewer parame- ν : R → R , x = (x1, . . . , xd) 7→ (ν1(x), . . . , νp(x)), ters to express certain families of functions than a shallow ar- then ν is a tropical rational map, i.e., each νi is a tropical chitecture (Delalleau & Bengio, 2011; Bengio & Delalleau, rational function. In fact, we will show that:

1Department of Computer Science, University of Chicago, Chicago, IL 2Department of Statistics, University of Chicago, the family of functions represented by feedforward Chicago, IL 3Computational and Applied Mathematics Initiative, neural networks with rectified linear units and University of Chicago, Chicago, IL. Correspondence to: Lek-Heng integer weights is exactly the family of tropical Lim . rational maps. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 It immediately follows that there is a semifield structure on by the author(s). this family of functions. More importantly, this establishes a Tropical Geometry of Deep Neural Networks bridge between neural networks1 and tropical geometry that it is not a ring (lacks additive inverse), one may nonetheless allows us to view neural networks as well-studied tropical generalize many algebraic objects (e.g., matrices, polynomi- geometric objects. This insight allows us to closely relate als, tensors, etc) and notions (e.g., rank, determinant, degree, boundaries between linear regions of a neural network to etc) over the tropical — the study of these, in a tropical hypersurfaces and thereby facilitate studies of de- nutshell, constitutes the subject of tropical algebra. cision boundaries of neural networks in classification prob- Let = {n ∈ : n ≥ 0}. For an integer a ∈ , raising lems as tropical hypersurfaces. Furthermore, the number of N Z N x ∈ to the ath power is the same as multiplying x to linear regions, which captures the complexity of a neural R itself a times. When standard multiplication is replaced by network (Montufar et al., 2014; Raghu et al., 2017; Arora tropical multiplication, this gives us tropical power: et al., 2018), can be bounded by the number of vertices of the polytopes associated with the neural network’s tropical x a := x · · · x = a · x, rational representation. Lastly, a neural network with one hidden layer can be completely characterized by zonotopes, where the last · denotes standard product of real numbers; it which serve as building blocks for deeper networks. is extended to R ∪ {−∞} by defining, for any a ∈ N, ( In Sections2 and3 we introduce basic tropical algebra and −∞ if a > 0, −∞ a := tropical algebraic geometry of relevance to us. We state 0 if a = 0. our assumptions precisely in Section4 and establish the connection between tropical geometry and multilayer neural A tropical semiring, while not a field, possesses one quality networks in Section5. We analyze neural networks with of a field: Every x ∈ R has a tropical multiplicative inverse tropical tools in Section6, proving that a deeper neural given by its standard additive inverse, i.e., x (−1) := −x. network is exponentially more expressive than a shallow Though not reflected in its name, T is in fact a semifield. network — though our objective is not so much to perform One may therefore also raise x ∈ R to a negative power state-of-the-art analysis but to demonstrate that tropical al- a ∈ Z by raising its tropical multiplicative inverse −x to the gebraic geometry can provide useful insights. All proofs are positive power −a, i.e., x a = (−x) (−a). As is the case deferred to SectionD of the supplement. in standard real arithmetic, the tropical additive inverse −∞ does not have a tropical multiplicative inverse and −∞ a 2. Tropical Algebra is undefined for a < 0. For notational simplicity, we will henceforth write xa instead of x a for tropical power when Roughly speaking, tropical algebraic geometry is an ana- there is no cause for confusion. Other algebraic rules of logue of classical algebraic geometry over C, the field of 2 tropical power may be derived from definition; see SectionB complex numbers, but where one replaces C by a semifield in the supplement. called the tropical semiring, to be defined below. We give a brief review of tropical algebra and introduce some relevant We are now in a position to define tropical polynomials and notations. See (Itenberg et al., 2009; Maclagan & Sturmfels, tropical rational functions. In the following, x and xi will 2015) for an in-depth treatment. denote variables (i.e., indeterminates). The most fundamental component of tropical algebraic ge- Definition 2.2. A tropical monomial in d variables  x , . . . , x is an expression of the form ometry is the tropical semiring T := R ∪ {−∞}, ⊕, , 1 d also known as the max-plus algebra. The two operations ⊕ a1 a2 ad c x1 x2 · · · xd and , called tropical and tropical multiplication respectively, are defined as follows. where c ∈ R ∪ {−∞} and a1, . . . , ad ∈ N. As a conve- : nient shorthand, we will also write a tropical monomial in Definition 2.1. For x, y ∈ R, their tropical sum is x ⊕ y = α d : multiindex notation as cx where α = (a1, . . . , ad) ∈ N max{x, y}; their tropical product is x y = x + y; the α α tropical quotient of x over y is x y := x − y. and x = (x1, . . . , xd). Note that x = 0 x as 0 is the tropical multiplicative identity. For any x ∈ R, we have −∞ ⊕ x = 0 x = x and Definition 2.3. Following notations above, a tropical poly- −∞ x = −∞. Thus −∞ is the tropical additive identity nomial f(x) = f(x1, . . . , xd) is a finite tropical sum of and 0 is the tropical multiplicative identity. Furthermore, tropical monomials these operations satisfy the usual laws of arithmetic: associa- f(x) = c xα1 ⊕ · · · ⊕ c xαr , tivity, commutativity, and distributivity. The set R ∪ {−∞} 1 r is therefore a semiring under the operations ⊕ and . While d where αi = (ai1, . . . , aid) ∈ N and ci ∈ R ∪ {−∞}, 1Henceforth a “neural network” will always mean a feedfor- i = 1, . . . , r. We will assume that a monomial of a given ward neural network with ReLU activation. multiindex appears at most once in the sum, i.e., αi 6= αj 2A semifield is a field sans the existence of additive inverses. for any i 6= j. Tropical Geometry of Deep Neural Networks

Definition 2.4. Following notations above, a tropical ra- tional function is a standard difference, or, equivalently, a tropical quotient of two tropical polynomials f(x) and g(x): f(x) − g(x) = f(x) g(x). We will denote a tropical rational function by f g, where f and g are understood to be tropical polynomial functions.

It is routine to verify that the set of tropical polynomials T[x1, . . . , xd] forms a semiring under the standard extension 2 2 Figure 1. 1 x1 ⊕ 1 x2 ⊕ 2 x1x2 ⊕ 2 x1 ⊕ 2 x2 ⊕ 2. of ⊕ and to tropical polynomials, and likewise the set of Left: Tropical curve. Right: Dual subdivision of Newton polygon tropical rational functions T(x1, . . . , xd) forms a semifield. and tropical curve. We regard a tropical polynomial f = f 0 as a special case of a tropical rational function and thus T[x1, . . . , xd] ⊆ T(x1, . . . , xd). Henceforth any result stated for a tropical A tropical hypersurface divides the domain of f into convex rational function would implicitly also hold for a tropical cells on each of which f is linear. These cells are convex polynomial. polyhedrons, i.e., defined by linear inequalities with integer coefficients: {x ∈ Rd : Ax ≤ b} for A ∈ Zm×d and A d-variate tropical polynomial f(x) defines a function m d b ∈ R . For example, the cell where a tropical monomial f : R → R that is a convex function in the usual sense as αj d T cjx attains its maximum is {x ∈ R : cj + αjx ≥ ci + taking max and sum of convex functions preserve convexity αTx for all i 6= j}. Tropical hypersurfaces of polynomials (Boyd & Vandenberghe, 2004). As such, a tropical rational i in two variables (i.e., in R2) are called tropical curves. function f g : Rd → R is a DC function or difference- convex function (Hartman, 1959; Tao & Hoai An, 2005). Just like standard multivariate polynomials, every tropical polynomial comes with an associated Newton polygon. We will need a notion of vector-valued tropical polynomials and tropical rational functions. Definition 3.2. The Newton polygon of a tropical polyno- α1 αr d p mial f(x) = c1x ⊕ · · · ⊕ crx is the convex hull of Definition 2.5. F : R → R , x = (x1, . . . , xd) 7→ d d α1, . . . , αr ∈ N , regarded as points in R , (f1(x), . . . , fp(x)), is called a tropical polynomial map if each f : d → is a tropical polynomial, i = 1, . . . , p,  d i R R ∆(f) := Conv αi ∈ R : ci 6= −∞, i = 1, . . . , r . and a tropical rational map if f1, . . . , fp are tropical ra- tional functions. We will denote the set of tropical poly- A tropical polynomial f determines a dual subdivision of nomial maps by Pol(d, p) and the set of tropical rational d ∆(f), constructed as follows. First, lift each αi from R maps by Rat(d, p). So Pol(d, 1) = T[x1, . . . , xd] and d+1 into R by appending ci as the last coordinate. Denote Rat(d, 1) = T(x1, . . . , xd). the convex hull of the lifted α1, . . . , αr as

d 3. Tropical Hypersurfaces P(f) := Conv{(αi, ci) ∈ R × R : i = 1, . . . , r}. (1) There are tropical analogues of many notions in classical al- Next let UFP(f) denote the collection of upper faces in gebraic geometry (Itenberg et al., 2009; Maclagan & Sturm- P(f) and π : Rd × R → Rd be the projection that drops fels, 2015), among which are tropical hypersurfaces, trop- the last coordinate. The dual subdivision determined by f ical analogues of algebraic curves in classical algebraic is then geometry. Tropical hypersurfaces are a principal object of  d  interest in tropical geometry and will prove very useful in δ(f) := π(p) ⊆ R : p ∈ UF P(f) . our approach towards neural networks. Intuitively, the trop- δ(f) forms a polyhedral complex with support ∆(f). By ical hypersurface of a tropical polynomial f is the set of (Maclagan & Sturmfels, 2015, Proposition 3.1.6), the tropi- points x where f is not linear at x. cal hypersurface T (f) is the (d − 1)-skeleton of the poly- Definition 3.1. The tropical hypersurface of a tropical poly- hedral complex dual to δ(f). This means that each vertex nomial f(x) = c xα1 ⊕ · · · ⊕ c xαr is d 1 r in δ(f) corresponds to one “cell” in R where the function f is linear. Thus, the number of vertices in P(f) provides T (f) := x ∈ d : c xαi = c xαj = f(x) R i j an upper bound on the number of linear regions of f. for some α 6= α . i j Figure1 shows the Newton polygon and dual subdivision 2 2 i.e., the set of points x at which the value of f at x is attained for the tropical polynomial f(x1, x2) = 1 x1 ⊕ 1 x2 ⊕ by two or more monomials in f. 2 x1x2 ⊕ 2 x1 ⊕ 2 x2 ⊕ 2. Figure2 shows how we Tropical Geometry of Deep Neural Networks

2 x 2 and for λ1, λ2 ≥ 0, their weighted Minkowski sum is Upper envelope of polytope 1 2 x2  d 2 x1x2 λ1P1 +λ2P2 := λ1x1 +λ2x2 ∈ R : x1 ∈ P1, x2 ∈ P2 . Weighted Minkowski sum is clearly commutative and asso- ciative and generalizes to more than two sets. In particular, 1 x2 1 the Minkowski sum of line segments is called a zonotope.

1 x2 Let V(P ) denote the set of vertices of a polytope P . Clearly, 2 the Minkowski sum of two polytopes is given by the convex (0, 0) (1, 0) hull of the Minkowski sum of their vertex sets, i.e., P1 + c  P2 = Conv V(P1) + V(P2) . With this observation, the a1 (2, 0) following is immediate. a2 (0, 1) (1, 1) Proposition 3.2. Let f, g ∈ Pol(d, 1) = T[x1, . . . , xd] be Dual subdivision of Newton polygon (0, 2) tropical polynomials. Then

P(f g) = P(f) + P(g),

2 2  Figure 2. 1 x1 ⊕ 1 x2 ⊕ 2 x1x2 ⊕ 2 x1 ⊕ 2 x2 ⊕ 2. P(f ⊕ g) = Conv V(P(f)) ∪ V(P(g)) . The dual subdivision can be obtained by projecting the edges on the upper faces of the polytope. We reproduce below part of (Gritzmann & Sturmfels, 1993, Theorem 2.1.10) and derive a corollary for bounding the may find the dual subdivision for this tropical polynomial by number of verticies on the upper faces of a zonotope. following the aforementioned procedures; with step-by-step Theorem 3.3 (Gritzmann–Sturmfels). Let P1,...,Pk be details given in Section C.1. polytopes in Rd and let m denote the total number of non- parallel edges of P ,...,P . Then the number of vertices Tropical polynomials and tropical rational functions are 1 k of P + ··· + P does not exceed clearly piecewise linear functions. As such a tropical ratio- 1 k nal map is a piecewise linear map and the notion of linear d−1 X m − 1 region applies. 2 . ( j ) Definition 3.3. A linear region of F ∈ Rat(d, m) is a max- j=0 imal connected subset of the domain on which F is linear. The upper bound is attained if all Pi’s are zonotopes and The number of linear regions of F is denoted N (F ). all their generating line segments are in general positions. Note that a tropical polynomial map F ∈ Pol(d, m) has con- Corollary 3.4. Let P ⊆ Rd+1 be a zonotope generated by d d vex linear regions but a tropical rational map F ∈ Rat(d, n) m line segments P1,...,Pm. Let π : R × R → R be the generally has nonconvex linear regions. In Section 6.3, projection. Suppose P satisfies: we will use N (F ) as a measure of complexity for an (i) the generating line segments are in general positions; F ∈ Rat(d, n) given by a neural network. (ii) the set of projected vertices {π(v): v ∈ V(P )} ⊆ Rd 3.1. Transformations of Tropical Polynomials are in general positions. Then P has Our analysis of neural networks will require figuring out d how the polytope P(f) transforms under tropical power, X (m) sum, and product. The first is straightforward. j=0 j Proposition 3.1. Let f be a tropical polynomial and let vertices on its upper faces. If either (i) or (ii) is violated, a ∈ N. Then then this becomes an upper bound. P(f a) = aP(f). As we mentioned, linear regions of a tropical polynomial f aP(f) = {ax : x ∈ P(f)} ⊆ Rd+1 is a scaled version of correspond to vertices on UFP(f) and the corollary will P(f) with the same shape but different volume. be useful for bounding the number of linear regions. To describe the effect of tropical sum and product, we need a few notions from convex geometry. The Minkowski sum 4. Neural Networks d of two sets P1 and P2 in R is the set While we expect our readership to be familiar with feedfor-  d P1 + P2 := x1 + x2 ∈ R : x1 ∈ P1, x2 ∈ P2 ; ward neural networks, we will nevertheless use this short Tropical Geometry of Deep Neural Networks section to define them, primarily for the purpose of fixing • real weights can be approximated arbitrarily closely by notations and specifying the assumptions that we retain rational weights; throughout this article. We restrict our attention to fully • one may then ‘clear denominators’ in these rational connected feedforward neural networks. weights by multiplying them by the least common mul- Viewed abstractly, an L-layer feedforward neural network tiple of their denominators to obtain integer weights; is a map ν : Rd → Rp given by a composition of functions • keeping in mind that scaling all weights and biases by the same positive constant has no bearing on the ν = σ(L) ◦ ρ(L) ◦ σ(L−1) ◦ ρ(L−1) · · · ◦ σ(1) ◦ ρ(1). workings of a neural network. (1) (L) The preactivation functions ρ , . . . , ρ are affine trans- The activation function in (c) includes both ReLU activation formations to be determined and the activation functions t(l) = 0 t(l) = −∞ (1) (L) ( ) and identity map ( ) as special cases. σ , . . . , σ are chosen and fixed in advanced. Aside from ReLU, our tropical framework will apply to We denote the width, i.e., the number of nodes, of the lth piecewise linear activations such as leaky ReLU and abso- layer by nl, l = 1, ··· ,L − 1. We set n0 := d and nL := p, lute value, and with some extra effort, may be extended to respectively the dimensions of the input and output of the max pooling, maxout nets, etc. But it does not, for example, network. The output from the lth layer will be denoted by apply to activations such as hyperbolic tangent and sigmoid. ν(l) := σ(l) ◦ ρ(l) ◦ σ(l−1) ◦ ρ(l−1) · · · ◦ σ(1) ◦ ρ(1), In this work, we view an ReLU network as the simplest and most canonical model of a neural network, from which i.e., it is a map ν(l) : Rd → Rnl . For convenience, we other variants that are more effective at specific tasks may assume ν(0)(x) := x. be derived. Given that we seek general theoretical insights (l) n n and not specific practical efficacy, it makes sense to limit The affine function ρ : R l−1 → R l is given by a weight (l) n ×n (l) n ourselves to this simplest case. Moreover, ReLU networks matrix A ∈ Z l l−1 and a bias vector b ∈ R l : already embody some of the most important elements (and ρ(l)(ν(l−1)) := A(l)ν(l−1) + b(l). mysteries) common to a wider range of neural networks (e.g., universal approximation, exponential expressiveness); (l) (l) The (i, j)th coordinate of A will be denoted aij and the they work well in practice and are often the go-to choice for (l) (l) ith coordinate of b by bi . Collectively they form the feedforward networks. We are also not alone in limiting our parameters of the lth layer. discussions to ReLU networks (Montufar et al., 2014; Arora et al., 2018). For a vector input x ∈ Rnl , σ(l)(x) is understood to be in the coordinatewise sense; so σ : Rnl → Rnl . We assume the final output of a neural network ν(x) is fed into a score 5. Tropical Algebra of Neural Networks function s : p → m that is application specific. When R R We now describe our tropical formulation of a multilayer used as an m-category classifier, s may be chosen, for ex- feedforward neural network satisfying (a)–(c). ample, to be a soft-max or sigmoidal function. The score function is quite often regarded as the last layer of a neu- A multilayer feedforward neural network is generally non- ral network but this is purely a matter of convenience and convex, whereas a tropical polynomial is always convex. we will not assume this. We will make the following mild Since most nonconvex functions are a difference of two assumptions on the architecture of our feedforward neural convex functions (Hartman, 1959), a reasonable guess is networks and explain next why they are indeed mild: that a feedforward neural network is the difference of two tropical polynomials, i.e., a tropical rational function. This (a) the weight matrices A(1),...,A(L) are integer-valued; is indeed the case, as we will see from the following. (b) the bias vectors b(1), . . . , b(L) are real-valued; Consider the output from the first layer in neural network (c) the activation functions σ(1), . . . , σ(L) take the form ν(x) = max{Ax + b, t}, σ(l)(x) := max{x, t(l)}, where A ∈ p×d, b ∈ p, and t ∈ ( ∪ {−∞})p. We will (l) nl Z R R where t ∈ (R ∪ {−∞}) is called a threshold vec- decompose A as a difference of two nonnegative integer- tor. p×d valued matrices, A = A+ −A− with A+,A− ∈ N ; e.g., Henceforth all neural networks in our subsequent discus- in the standard way with entries sions will be assumed to satisfy (a)–(c). + − aij := max{aij, 0}, aij := max{−aij, 0} (b) is completely general but there is also no loss of gen- respectively. Since erality in (a), i.e., in restricting the weights A(1),...,A(L) from real matrices to integer matrices, as: max{Ax + b, t} = max{A+x + b, A−x + t} − A−x, Tropical Geometry of Deep Neural Networks we see that every coordinate of one-layer neural network Corollary 5.3. Let ν : Rd → R be an ReLU activated is a difference of two tropical polynomials. For networks feedforward neural network with integer weights and linear with more layers, we apply this decomposition recursively output. Then ν is a tropical rational function. to obtain the following result. A more remarkable fact is the converse of Corollary 5.3. Proposition 5.1. Let A ∈ Zm×n, b ∈ Rm be the parame- ters of the (l + 1)th layer, and let t ∈ (R ∪ {−∞})m be the Theorem 5.4 (Equivalence of neural networks and tropical threshold vector in the (l + 1)th layer. If the nodes of the lth rational functions). layer are given by tropical rational functions, d (i) Let ν : R → R. Then ν is a tropical rational func- ν(l)(x) = F (l)(x) G(l)(x) = F (l)(x) − G(l)(x), tion if and only if ν is a feedforward neural network satisfying assumptions (a)–(c). (l) (l) i.e., each coordinate of F and G is a tropical polyno- (ii) A tropical rational function f g can be represented mial in x, then the outputs of the preactivation and of the as an L-layer neural network, with (l + 1)th layer are given by tropical rational functions L ≤ max{dlog rf e, dlog rge} + 2, ρ(l+1) ◦ ν(l)(x) = H(l+1)(x) − G(l+1)(x), 2 2 ν(l+1)(x) = σ ◦ ρ(l+1) ◦ ν(l)(x) = F (l+1)(x) − G(l+1)(x) where rf and rg are the number of monomials in the tropical polynomials f and g respectively. respectively, where We would like to acknowledge the precedence of (Arora (l+1)  (l+1) (l+1) F (x) = max H (x),G (x) + t , et al., 2018, Theorem 2.1), which demonstrates the equiva- (l+1) (l) (l) lence between ReLU-activated L-layer neural networks with G (x) = A+G (x) + A−F (x), real weights and d-variate continuous piecewise functions (l+1) (l) (l) H (x) = A+F (x) + A−G (x) + b. with real coefficients, where L ≤ dlog2(d + 1)e + 1. (l) (l) (l) By construction, a tropical rational function is a continuous We will write fi , gi and hi for the ith coordinate of F (l), G(l) and H(l) respectively. In tropical arithmetic, the piecewise linear function. The continuity of a piecewise recurrence above takes the form linear function automatically implies that each of the pieces on which it is linear is a polyhedral region. As we saw in (l+1) (l+1) (l+1) d fi = hi ⊕ (gi ti), Section3, a tropical polynomial f : R → R gives a tropical  n   n  hypersurface that divides d into convex polyhedral regions (l+1) (l) − (l) + R K aij K aij gi = (fj ) (gj ) , defined by linear inequalities with integer coefficients: {x ∈ (2) d m×d m j=1 j=1 R : Ax ≤ b} with A ∈ Z and b ∈ R . A tropical  n   n  rational function f g : d → must also be a continuous (l+1) K (l) + K (l) − R R aij aij d hi = (fj ) (gj ) bi. piecewise linear function and divide R into polyhedral j=1 j=1 regions on each of which f g is linear, although these regions are nonconvex in general. We will show the converse Repeated applications of Proposition 5.1 yield the following. — any continuous piecewise linear function with integer Theorem 5.2 (Tropical characterization of neural networks). coefficients is a tropical rational function. A feedforward neural network under assumptions (a)–(c) Proposition 5.5. Let ν : Rd → R. Then ν is a continuous is a function ν : Rd → Rp whose coordinates are tropical piecewise linear function with integer coefficients if and rational functions of the input, i.e., only if ν is a tropical rational function.

ν(x) = F (x) G(x) = F (x) − G(x) Corollary 5.3, Theorem 5.4, and Proposition 5.5 collectively where F and G are tropical polynomial maps. Thus ν is a imply the equivalence of tropical rational map. (i) tropical rational functions, Note that the tropical rational functions above have real (ii) continuous piecewise linear functions with integer co- coefficients, not integer coefficients. The integer weights efficients, A(l) ∈ Znl×nl−1 have gone into the powers of tropical (iii) neural networks satisfying assumptions (a)–(c). monomials in f and g, which is why we require our weights An immediate advantage of this characterization is that the to be integer-valued, although as we have explained, this set of tropical rational functions (x , . . . , x ) has a semi- requirement imposes little loss of generality. T 1 d field structure as we pointed out in Section2, a fact that By setting t(1) = ··· = t(L−1) = 0 and t(L) = −∞, we we have implicitly used in the proof of Proposition 5.5. obtain the following corollary. However, what is more important is not the algebra but the Tropical Geometry of Deep Neural Networks algebraic geometry that arises from our tropical characteri- We provide bounds on the number of positive and negative zation. We will use tropical algebraic geometry to illuminate regions and show that there is a tropical polynomial whose our understanding of neural networks in the next section. tropical hypersurface contains the decision boundary. The need to stay within tropical algebraic geometry is the Proposition 6.1 (Tropical geometry of decision boundary). reason we did not go for a simpler and more general char- Let ν : Rd → R be an L-layer neural network satisfying acterization (that does not require the integer coefficients assumptions (a)–(c) with t(L) = −∞. Let the score function assumption). A tropical signomial takes the form s : R → R be injective with decision threshold c in its range. m n If ν = f g where f and g are tropical polynomials, then M K aij ϕ(x) = b x , d −1 i j (i) its decision boundary B = {x ∈ R : ν(x) = s (c)} i=1 j=1 divides Rd into at most N (f) connected positive re- where aij ∈ R and bi ∈ R ∪ {−∞}. Note that aij is not gions and at most N (g) connected negative regions; tropical required to be integer-valued nor nonnegative. A (ii) its decision boundary is contained in the tropical hy- rational signomial is a tropical quotient ϕ ψ of two tropical persurface of the tropical polynomial s−1(c) g(x) ⊕ signomials ϕ, ψ.A tropical rational signomial map is a −1 d p f(x) = max{f(x), g(x) + s (c)}, i.e., function ν = (ν1, . . . , νp): R → R where each νi : d R → R is a tropical rational signomial νi = ϕi ψi. The B ⊆ T (s−1(c) g ⊕ f). (3) same argument we used to establish Theorem 5.2 gives us the following. The function s−1(c) g⊕f is not necessarily linear on every Proposition 5.6. Every feedforward neural network with positive or negative region and so its tropical hypersurface ReLU activation is a tropical rational signomial map. T (s−1(c) g ⊕ f) may further divide a positive or negative Nevertheless tropical signomials fall outside the realm of region derived from B into multiple linear regions. Hence tropical algebraic geometry and we do not use Proposi- the “⊆” in (3) cannot in general be replaced by “=”. tion 5.6 in the rest of this article. 6.2. Zonotopes as Geometric Building Blocks of Neural 6. Tropical Geometry of Neural Networks Networks From Section3, we know that the number of regions a Section5 defines neural networks via tropical algebra, a per- tropical hypersurface T (f) divides the space into equals the spective that allows us to study them via tropical algebraic number of vertices in the dual subdivision of the Newton geometry. We will show that the decision boundary of a polygon associated with the tropical polynomial f. This neural network is a subset of a tropical hypersurface of a cor- allows us to bound the number of linear regions of a neural responding tropical polynomial (Section 6.1). We will see network by bounding the number of vertices in the dual that, in an appropriate sense, zonotopes form the geometric subdivision of the Newton polygon. building blocks for neural networks (Section 6.2). We then prove that the geometry of the function represented by a We start by examining how geometry changes from one neural network grows vastly more complex as its number of layer to the next in a neural network, more precisely: layers increases (Section 6.3). Question. How are the tropical hypersurfaces of the tropi- cal polynomials in the (l + 1)th layer of a neural network 6.1. Decision Boundaries of a Neural Network related to those in the lth layer? We will use tropical geometry and insights from Section5 The recurrent relation (2) describes how the tropical poly- to study decision boundaries of neural networks, focusing nomials occurring in the (l + 1)th layer are obtained from on the case of two-category classification for clarity. As those in the lth layer, namely, via three operations: tropical explained in Section4, a neural network ν : d → p R R sum, tropical product, and tropical powers. Recall that a together with a choice of score function s : p → give R R tropical hypersurface of a tropical polynomial is dual to us a classifier. If the output value s(ν(x)) exceeds some the dual subdivision of the Newton polytope of the tropical decision threshold c, then the neural network predicts x is polynomial, which is given by the projection of the upper from one class (e.g., x is a CAT image), and otherwise x faces on the polytopes defined by (1). Hence the question is from the other category (e.g., a DOG image). The input boils down to how these three operations transform the poly- space is thereby partitioned into two disjoint subsets by the topes, which is addressed in Propositions 3.1 and 3.2. We decision boundary B := {x ∈ d : ν(x) = s−1(c)}. Con- R follow notations in Proposition 5.1 for the next result. nected regions with value above the threshold and connected (l) (l) (l) regions with value below the threshold will be called the Lemma 6.2. Let fi , gi , hi be the tropical polynomials positive regions and negative regions respectively. produced by the ith node in the lth layer of a neural network, Tropical Geometry of Deep Neural Networks

(l) (l) (l) d i.e., they are defined by (2). Then P fi , P gi , P hi Theorem 6.3. Let ν : R → R be an L-layer real-valued (L) are subsets of Rd+1 given as follows: feedforward neural network satisfying (a)–(c). Let t = (L) (1) (1) −∞ and nl ≥ d for all l = 1,...,L − 1. Then ν = ν (i) P gi and P hi are points. has at most (1) L−1 d (ii) P fi is a line segment. Y X nl (2) (2) ( )   l=1 i=0 i (iii) P gi and P hi are zonotopes.

(iv) For l ≥ 1, linear regions. In particular, if d ≤ n1, . . . , nL−1 ≤ n, the number of linear regions of ν is bounded by Ond(L−1). (l)  (l) (l) (l) P fi = Conv P gi ti ∪ P hi

(l) (l) (l) (l) Proof. If L = 2, this follows directly from Lemma 6.2 and if ti ∈ R, and P fi = P hi if ti = −∞. Corollary 3.4. The case of L ≥ 3 is in Section D.7 in the (l+1) (l+1) (v) For l ≥ 1, P gi and P hi are weighted supplement. Minkowski sums,

nl nl As was pointed out in (Raghu et al., 2017), this upper (l+1) X − (l) X + (l)    (L−1)d d P gi = aijP fj + aijP gj , bound closely matches the lower bound Ω (n/d) n j=1 j=1 in (Montufar et al., 2014, Corollary 5) when n1 = ··· = nl nl (l+1) X + (l) X − (l) nL−1 = n ≥ d. Hence we surmise that the number of linear P hi = aijP fj + aijP gj regions of the neural network grows polynomially with the j=1 j=1 width n and exponentially with the number of layers L. + {bie},

(l+1) 7. Conclusion where aij, bi are entries of the weight matrix A ∈ Znl+1×nl and bias vector b(l+1) ∈ Rnl+1 , and e := We argue that feedforward neural networks with rectified (0,..., 0, 1) ∈ Rd+1. linear units are, modulo trivialities, nothing more than tropi- cal rational maps. To understand them we often just need to A conclusion of Lemma 6.2 is that zonotopes are the build- understand the relevant tropical geometry. ing blocks in the tropical geometry of neural networks. Zonotopes are studied extensively in convex geometry and, In this article, we took a first step to provide a proof-of- among other things, are intimately related to hyperplane ar- concept: questions regarding decision boundaries, linear rangements (Greene & Zaslavsky, 1983; Guibas et al., 2003; regions, how depth affect expressiveness, etc, can be trans- McMullen, 1971; Holtz & Ron, 2011). Lemma 6.2 connects lated into questions involving tropical hypersurfaces, dual neural networks to this extensive body of work but its full subdivision of Newton polygon, polytopes constructed from implication remains to be explored. In Section C.2 of the zonotopes, etc. supplement, we show how one may build these polytopes As a new branch of algebraic geometry, the novelty of tropi- for a two-layer neural network. cal geometry stems from both the algebra and geometry as well as the interplay between them. It has connections to 6.3. Geometric Complexity of Deep Neural Networks many other areas of mathematics. Among other things, there is a tropical analogue of linear algebra (Butkovicˇ, 2010) and We apply the tools in Section3 to study the complexity a tropical analogue of convex geometry (Gaubert & Katz, of a neural network, showing that a deep network is much 2006). We cannot emphasize enough that we have only more expressive than a shallow one. Our measure of com- touched on a small part of this rich subject. We hope that plexity is geometric: we will follow (Montufar et al., 2014; further investigation from this tropical angle might perhaps Raghu et al., 2017) and use the number of linear regions of unravel other mysteries of deep neural networks. a piecewise linear function ν : Rd → Rp to measure the complexity of ν. Acknowledgments We would like to emphasize that our upper bound below does not improve on that obtained in (Raghu et al., 2017)— The authors thank Ralph Morrison, Yang Qi, Bernd Sturm- in fact, our version is more restrictive given that it applies fels, and the anonymous referees for their very helpful com- only to neural networks satisfying (a)–(c). Nevertheless our ments. The work in this article is generously supported goal here is to demonstrate how tropical geometry may be by DARPA D15AP00109, NSF IIS 1546413, the Eckhardt used to derive the same bound. Faculty Fund, and a DARPA Director’s Fellowship. Tropical Geometry of Deep Neural Networks

References Holtz, O. and Ron, A. Zonotopal algebra. Advances in Mathematics, 227(2):847–894, 2011. Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Under- standing deep neural networks with rectified linear units. Itenberg, I., Mikhalkin, G., and Shustin, E. I. Tropical International Conference on Learning Representations, algebraic geometry, volume 35. Springer Science & 2018. Business Media, 2009. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine Kalchbrenner, N. and Blunsom, P. Recurrent continuous translation by jointly learning to align and translate. translation models. In Proceedings of the 2013 Confer- arXiv:1409.0473, 2014. ence on Empirical Methods in Natural Language Pro- cessing, pp. 1700–1709, 2013. Bengio, Y. and Delalleau, O. On the expressive power of deep architectures. In International Conference on Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Algorithmic Learning Theory, pp. 18–36. Springer, 2011. classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Boyd, S. and Vandenberghe, L. Convex optimization. Cam- pp. 1097–1105, 2012. bridge University Press, 2004. LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na- Butkovic,ˇ P. Max-linear systems: theory and algorithms. ture, 521(7553):436–444, 2015. Springer Science & Business Media, 2010. Maclagan, D. and Sturmfels, B. Introduction to tropical Delalleau, O. and Bengio, Y. Shallow vs. deep sum-product geometry, volume 161. American Mathematical Society, networks. In Advances in Neural Information Processing 2015. Systems, pp. 666–674, 2011. McMullen, P. On zonotopes. Transactions of the American Eldan, R. and Shamir, O. The power of depth for feedfor- Mathematical Society, 159:91–109, 1971. ward neural networks. In Conference on Learning Theory, pp. 907–940, 2016. Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Gaubert, S. and Katz, R. Max-plus convex geometry. In Advances in Neural Information Processing Systems International Conference on Relational Methods in Com- , pp. 2924–2932, 2014. puter Science. Springer, 2006. Greene, C. and Zaslavsky, T. On the interpretation of whit- Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and ney numbers through arrangements of hyperplanes, zono- Ganguli, S. Exponential expressivity in deep neural net- Advances in Neural topes, non-radon partitions, and orientations of graphs. works through transient chaos. In Information Processing Systems Transactions of the American Mathematical Society, 280 , pp. 3360–3368, 2016. (1):97–126, 1983. Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl- Gritzmann, P. and Sturmfels, B. Minkowski addition of Dickstein, J. On the expressive power of deep neural net- International Conference on Machine Learning polytopes: computational complexity and applications to works. In , grobner¨ bases. SIAM Journal on Discrete Mathematics, pp. 2847–2854, 2017. 6(2):246–269, 1993. Tao, P. D. and Hoai An, L. T. The DC (difference of con- Guibas, L. J., Nguyen, A., and Zhang, L. Zonotopes as vex functions) programming and DCA revisited with DC bounding volumes. In Proceedings of 14th Annual ACM- models of real world nonconvex optimization problems. SIAM Symposium on Discrete Algorithms, pp. 803–812. Annals of Operations Research, 133(1–4):23–46, 2005. SIAM, 2003. Tarela, J. and Martinez, M. Region configurations for realiz- Hartman, P. On functions representable as a difference of ability of lattice piecewise-linear models. Mathematical convex functions. Pacific Journal of Mathematics, 9(3): and Computer Modelling, 30(11-12):17–27, 1999. 707–713, 1959. Telgarsky, M. benefits of depth in neural networks. In Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Conference on Learning Theory, 2016. Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. T. N., et al. Deep neural networks for acoustic modeling Understanding deep learning requires rethinking general- in speech recognition: The shared views of four research ization. arXiv:1611.03530, 2016. groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. Supplementary Material: Tropical Geometry of Deep Neural Networks

A. Illustration of Our Neural Network Figure A.1 summarizes the architecture and notations of the feedforward neural network discussed in this paper.

ith neuron Input Hidden layer Hidden Hidden Output in layer l: (l = 0) (l = 1) layer (l) layer (l + 1) (l = L) b(1) b(l) b(l+1) b(L) 1 1 1 1  ν1(x) nl−1 (1) (l) (l+1) (L) (l) (l−1) x1 σ . . . σ σ . . . σ P 1 1 1 1 max aij νj j=1  (1) (l) (l+1) (L) (l) (l) (1) i b b +1) b b +b , t 1 2 2 l 2 2 i i a ( i (1) (l) a 1 (l+1) (L) ν2(x) x2 σ2 . . . σ2 σ2 . . . σ2

(1) +1) a 2i (l . . . a 2i . . ith neuron . . (l) . . . in layer l, (1) b (l+1) (L) bi i bi bi a(1) a(l+1) tropical formulation: ii (1) (l) ii (l+1) (L) νi(x) xi σi . . . σi σi . . . σi

n a (l l−1 (l) . (1) . . +1) . .  (l−1) a  a . . n . . J (ν ) ij . l . n . . +1 . . j 1 i (1) (l) i (l+1) (L) j=1 b bn b bn1 nl l+1 p (l) (l) bi ⊕ti (1) (l) (l+1) (L) νp(x)

. . . σ . . . xd σn1 σnl nl+1 σp

σ(l+1) ◦ ρ(l+1) ◦ σ(l) ◦ ρ(l) ν(l+1) = σ(l+1) ◦ ρ(l+1) ◦ · · · ◦ σ(1) ◦ ρ(1)(x)

d p Figure A.1. General form of an ReLU feedforward neural network ν : R → R with L layers.

B. Tropical Power As in Section2, we write xa = x a; aside from this slight abuse of notation, ⊕ and denote tropical sum and product, + and · denote standard sum and product in all other contexts. Tropical power evidently has the following properties:

• For x, y ∈ R and a ∈ R, a ≥ 0, (x ⊕ y)a = xa ⊕ ya and (x y)a = xa ya. If a is allowed negative values, then we lose the first property. In general (x ⊕ y)a 6= xa ⊕ ya for a < 0. • For x ∈ R, x0 = 0.

• For x ∈ R and a, b ∈ N, (xa)b = xa·b.

• For x ∈ R and a, b ∈ Z, xa xb = xa+b.

• For x ∈ R and a, b ∈ Z, xa ⊕ xb = xa (xa−b ⊕ 0) = xa (0 ⊕ xa−b). Tropical Geometry of Deep Neural Networks C. Examples C.1. Examples of Tropical Curves and Dual Subdivision of Newton Polygon

Let f ∈ Pol(2, 1) = T[x1, x2], i.e., a bivariate tropical polynomial. It follows from our discussions in Section3 that the tropical hypersurface T (f) is a planar graph dual to the dual subdivision δ(f) in the following sense: (i) Each two-dimensional face in δ(f) corresponds to a vertex in T (f). (ii) Each one-dimensional edge of a face in δ(f) corresponds to an edge in T (f). In particular, an edge from the Newton polygon ∆(f) corresponds to an unbounded edge in T (f) while other edges correspond to bounded edges. 2 2 Figure2 illustrates how we may find the dual subdivision for the tropical polynomial f(x1, x2) = 1 x1 ⊕ 1 x2 ⊕ 2 x1x2 ⊕ 2 x1 ⊕ 2 x2 ⊕ 2. First, find the convex hull P(f) = Conv{(2, 0, 1), (0, 2, 1), (1, 1, 2), (1, 0, 2), (0, 1, 2), (0, 0, 2)}.

Then, by projecting the upper envelope of P(f) to R2, we obtain δ(f), the dual subdivision of the Newton polygon.

C.2. Polytopes of a Two-Layer Neural Network

2 We illustrate our discussions in Section 6.2 with a two-layer example. Let ν : R → R be with n0 = 2 input nodes, n1 = 5 nodes in the first layer, and n2 = 1 nodes in the output: −1 1  1     1 −3   −1  (1)   x1    ω = ν (x) = max  1 2 +  2 , 0 ,   x2    −4 1  0   3 2 −2  (2) ν (ω) = max{ω1 + 2ω2 + ω3 − ω4 − 3ω5, 0}.

We first express ν(1) and ν(2) as tropical rational maps, ν(1) = F (1) G(1), ν(2) = f (2) g(2), where y := F (1)(x) = H(1)(x) ⊕ G(1)(x),     x1 1 x2 3 x2  (−1) x1  (1)   (1)  2  z := G (x) =  0  ,H (x) =  2 x1x2  ,  4   x1  x2  3 2 0 (−2) x1x2 and f (2)(x) = g(2)(x) ⊕ h(2)(x), (2) 3 2 g (x) = y4 y5 z1 z2 z3 4 3 2 3 3 2 = (x2 ⊕ x1) ((−2) x1x2 ⊕ 0) x1 (x2) , (2) 2 3 h (x) = y1 y2 y3 z4 z5 3 2 2 4 = (1 x2 ⊕ x1) ((−1) x1 ⊕ x2) (2 x1x2 ⊕ 0) x1.

(1) (1) (1) (1) (1) (1) (1) We will write F = (f1 , . . . , f5 ) and likewise for G and H . The monomials occurring in gj (x) and hj (x) are a1 a2 (1) (1) 3 all of the form cx1 x2 . Therefore P(gj ) and P(hj ), j = 1,..., 5, are points in R .

(1) (1) (1) (1) 3 Since F = G ⊕ H , P(fj ) is a convex hull of two points, and thus a line segment in R . The Newton polygons (1) associated with fj , equal to their dual subdivisions in this case, are obtained by projecting these line segments back to the plane spanned by a1, a2, as shown on the left in Figure C.1. Tropical Geometry of Deep Neural Networks

(1) 5 6 7 h3 (1) (1) x1x2 x1x2 f3 h1 f (1) c 1 (1) (1) (1) g g3 \g5 (1) 1 g (1) a1 4 (1) h a f 4 2 4 h(1) 2 g(1) (1) 2 c f2

(1) (0, 0) a1 f5 14 12 (1) a2 (−6) x1 x2 h5 (1, 0) (5, 6) (4, 0) (0, 1) (1, 7) (14, 12) (−6) x10x13 (3, 2) (1, 2) (0, 3) (10, 13) 1 2

(1) (1) (2) (2) Figure C.1. Left: P(fj ) and dual subdivision of ∆(fj ), j = 1,..., 5. Right: P(g ) and dual subdivision of ∆(g ). In both figures, dual subdivisions have been translated along the −c direction (downwards) and separated from the polytopes for visibility.

7 x1x2 5 9 3 x1x2 4 7 6 8 1 x1x2 2 x1x2

x5x6 1 2 c a c 1 x7x3 2 (−1) x6x 1 2 1 2 a1 a2 8 2 x1x2 3 x5x9 4 7 1 2 1 x1x2 7 a1 (−2) x1 6 8 2 x1x2 10 13 (−6) x1 x2 7 3 1 x1x2 (1, 7)

6 (−1) x x2 5 6 1 x1x2 14 12 (4, 7) (−6) x1 x2 7 8 2 (5, 6) (−2) x1 x1x2 (4, 7) (5, 9) (6, 2) (6, 8) (5, 9) (5, 6) (7, 0) (6, 2) (7, 3) (10, 13) (7, 3) (6, 8) (8, 2) (7, 0)

(8, 2) (14, 12)

Figure C.2. Left: The polytope associated with h(2) and its dual subdivision. Right: P(f (2)) and dual subdivision of ∆(f (2)). In both figures, dual subdivisions have been translated along the −c direction (downwards) and separated from the polytopes for visibility.

(1) (1) (2) The line segments P(fj ), j = 1,..., 5, and points P(gj ), j = 1,..., 5, serve as building blocks for P(h ) and P(g(2)), which are constructed as weighted Minkowski sums:

(2) (1) (1) (1) (1) (1) P(h ) = P(f4 ) + 3P(f5 ) + P(g1 ) + 2P(g2 ) + P(g3 ), (2) (1) (1) (1) (1) (1) P(g ) = P(f1 ) + 2P(f2 ) + P(f3 ) + P(g4 ) + 3P(g5 ). P(g(2)) and the dual subdivision of its Newton polygon are shown on the right in Figure C.1. P(h(2)) and the dual subdivision of its Newton polygon are shown on the left in Figure C.2. P(f (2)) is the convex hull of the union of P(g(2)) and P(h(2)). The dual subdivision of its Newton polygon is obtained by projecting the upper faces of P(f (2)) to the plane spanned by a1, a2. These are shown on the right in Figure C.2.

D. Proofs D.1. Proof of Corollary 3.4

Proof. Let V1 and V2 be the sets of vertices on the upper and lower envelopes of P respectively. By Theorem 3.3, P has

d X m − 1 n1 := 2 ( ) j=0 j Tropical Geometry of Deep Neural Networks vertices in total. By construction, we have |V1 ∪ V2| = n1. It is well-known that zonotopes are centrally symmetric and so 0 there are equal number of vertices on the upper and lower envelopes, i.e., |V1| = |V2|. Let P := π(P ) be the projection of P into Rd. Since the projected vertices are assumed to be in general positions, P 0 must be a d-dimensional zonotope generated by m nonparallel line segments. Hence, by Theorem 3.3 again, P 0 has

d−1 X m − 1 n2 := 2 ( ) j=0 j vertices. For any vertex v ∈ P , π(v) is a vertex of P 0 if and only if v belongs to both the upper and lower envelopes, 0 i.e., v ∈ V1 ∩ V2. Therefore the number of vertices on P equals |V1 ∩ V2|. By construction, we have |V1 ∩ V2| = n2. Consequently the number of vertices on the upper envelope is

d 1 1 X m |V | = (|V ∪ V | − |V ∩ V |) + |V ∩ V | = (n − n ) + n = . 1 2 1 2 1 2 1 2 2 1 2 2 ( ) j=0 j

D.2. Proof of Proposition 5.1

Proof. Writing A = A+ − A−, we have

(l+1)  (l) (l)  ρ (x) = A+ − A− F (x) − G (x) + b (l) (l)  (l) (l)  = A+F (x) + A−G (x) + b − A+G (x) + A−F (x) = H(l+1)(x) − G(l+1)(x), ν(l+1)(x) = maxρ(l+1)(y), t = maxH(l+1)(x) − G(l+1)(x), t = maxH(l+1)(x),G(l+1)(x) + t − G(l+1)(x) = F (l+1)(x) − G(l+1)(x).

D.3. Proof of Theorem 5.4

αi Proof. It remains to establish the “only if” part. We will write σt(x) := max{x, t}. Any tropical monomial bix is clearly such a neural network as αi T bix = (σ−∞ ◦ ρi)(x) = max{αi x + bi, −∞}.

If two tropical polynomials p and q are represented as neural networks with lp and lq layers respectively,

(lp) (1) p(x) = σ−∞ ◦ ρp ◦ σ0 ◦ . . . σ0 ◦ ρp (x),

(lq ) (1) q(x) = σ−∞ ◦ ρq ◦ σ0 ◦ . . . σ0 ◦ ρq (x), then (p ⊕ q)(x) = max{p(x), q(x)} can also be written as a neural network with max{lp, lq} + 1 layers:  (p ⊕ q)(x) = σ−∞ [σ0 ◦ ρ1](y(x)) + [σ0 ◦ ρ2](y(x)) − [σ0 ◦ ρ3](y(x)) ,

d 2 2 where y : R → R is given by y(x) = (p(x), q(x)) and ρi : R → R, i = 1, 2, 3, are linear functions defined by

ρ1(y) = y1 − y2, ρ2(y) = y2, ρ3(y) = −y2.

Thus, by induction, any tropical polynomial can be written as a neural network with ReLU activation. Observe also that if a tropical polynomial is the tropical sum of r monomials, then it can be written as a neural network with no more than dlog2 re + 1 layers. Next we consider a tropical rational function (p q)(x) = p(x) − q(x) where p and q are tropical polynomials. Under the same assumptions, we can represent p q as  (p q)(x) = σ−∞ [σ0 ◦ ρ4](y(x)) − [σ0 ◦ ρ5](y(x)) + [σ0 ◦ ρ6](y(x)) − [σ0 ◦ ρ7](y(x)) Tropical Geometry of Deep Neural Networks

2 2 where ρi : R → R , i = 4, 5, 6, 7, are linear functions defined by

ρ4(y) = y1, ρ5(y) = −y1, ρ6(y) = −y2, ρ7(y) = y2.

Therefore p q is also a neural network with at most max{lp, lq} + 1 layers.

Finally, if f and g are tropical polynomials that are respectively tropical sums of rf and rg monomials, then the discussions above show that (f g)(x) = f(x) − g(x) is a neural network with at most max{dlog2 rf e, dlog2 rge} + 2 layers.

D.4. Proof of Proposition 5.5

Proof. It remains to establish the “if” part. Let Rd be divided into N polyhedral region on each of which ν restricts to a linear function T d `i(x) = ai x + bi, ai ∈ Z , bi ∈ R, i = 1, . . . , L, d i.e., for any x ∈ R , ν(x) = `i(x) for some i ∈ {1,...,L}. It follows from (Tarela & Martinez, 1999) that we can find N subsets of {1,...,L}, denoted by Sj, j = 1,...,N, so that ν has a representation

ν(x) = max min `i. j=1,...,N i∈Sj

It is clear that each `i is a tropical rational function. Now for any tropical rational functions p and q,

min{p, q} = − max{−p, −q} = 0 [(0 p) ⊕ (0 q)] = [p q] [p ⊕ q].

Since p q and p ⊕ q are both tropical rational functions, so is their tropical quotient. By induction, mini∈Sj `i is a tropical rational function for any j = 1,...,N, and therefore so is their tropical sum ν.

D.5. Proof of Proposition 5.6

p×d p d Proof. For a one-layer neural network ν(x) = max{Ax + b, t} = (ν1(x), . . . , νp(x)) with A ∈ R , b ∈ R , x ∈ R , t ∈ (R ∪ {−∞})p, we have

 d   d   d  K akj K akj K 0 νk(x) = bk xj ⊕ tk = bk xj ⊕ tk xj , k = 1, . . . , p. j=1 j=1 j=1

¯ ¯ So for any k = 1, . . . , p, if we write b1 = bk, b2 = tk, a¯1j = akj, a¯2j = 0, j = 1, . . . , d, then

2 d M ¯ K a¯ij νk(x) = bi xj i=1 j=1 is clearly a tropical signomial function. Therefore ν is a tropical signomial map. The result for arbitrary number of layers then follows from using the same recurrence as in the proof in Section D.2, except that now the entries in the weight matrix are allowed to take real values, and the maps H(l)(x), G(l)(x), F (l)(x) are tropical signomial maps. Hence every layer can be written as a tropical rational signomial map ν(l) = F (l) G(l).

D.6. Proof of Proposition 6.1 We prove a slightly more general result.

Proposition D.1 (Level sets). Let f g ∈ Rat(d, 1) = T(x1, . . . , xd). (i) Given a constant c > 0, the level set

d B := {x ∈ R : f(x) g(x) = c}

divides Rd into at most N (f) connected polyhedral regions where f(x) g(x) > c, and at most N (g) such regions where f(x) g(x) < c. Tropical Geometry of Deep Neural Networks

(ii) If c ∈ R is such that there is no tropical monomial in f(x) that differs from any tropical monomial in g(x) by c, then the level set B is contained in a tropical hypersurface,

B ⊆ T (max{f(x), g(x) + c}) = T (c g ⊕ f).

Proof. We show that the bounds on the numbers of connected positive (i.e., above c) and negative (i.e., below c) regions are d as we claimed in (i). The tropical hypersurface of f divides R into N (f) convex regions C1,...,CN (f) such that f is d linear on each Ci. As g is piecewise linear and convex over R , f g = f − g is piecewise linear and concave on each Ci. Since the level set {x : f(x) − g(x) = c} and the superlevel set {x : f(x) − g(x) ≥ c} must be convex by the concavity of f − g, there is at most one positive region in each Ci. Therefore the total number of connected positive regions cannot exceed N (f). Likewise, the tropical hypersurface of g divides Rd into N (g) convex regions on each of which f g is convex. The same argument shows that the number of connected negative regions does not exceed N (g). We next address (ii). Upon rearranging terms, the level set becomes

 d B = x ∈ R : f(x) = g(x) + c .

Since f(x) and g(x) + c are both tropical polynomial, we have

α1 αr f(x) = b1x ⊕ · · · ⊕ brx ,

β1 βs g(x) + c = c1x ⊕ · · · ⊕ csx , with appropriate multiindices α1, . . . , αr, β1, . . . , βs, and real coefficients b1, . . . , br, c1, . . . , cs. By the assumption on αi βj the monomials, we have that x0 ∈ B only if there exist i, j so that αi 6= βj and bix0 = cjx0 . This completes the proof since if we combine the monomials of f(x) and g(x) + c by (tropical) summing them into a single tropical polynomial, max{f(x), g(x) + c}, the above implies that on the level set, the value of the combined tropical polynomial is attained by at least two monomials and therefore x0 ∈ T (max{f(x), g(x) + c}).

Proposition 6.1 follows immediately from Proposition D.1 since the decision boundary {x ∈ Rd : ν(x) = s−1(c)} is a level set of the tropical rational function ν.

D.7. Proof of Theorem 6.3 The linear regions of a tropical polynomial map F ∈ Pol(d, m) are all convex but this is not necessarily the case for a tropical rational map F ∈ Rat(d, n). Take for example a bivariate real-valued function f(x, y) whose graph in R3 is a pyramid with base {(x, y) ∈ R2 : x, y ∈ [−1, 1]} and zero everywhere else, then the linear region where f vanishes is R2 \{(x, y) ∈ R2 : x, y ∈ [−1, 1]}, which is nonconvex. The nonconvexity invalidates certain geometric arguments that only apply in the convex setting. Nevertheless there is a way to subdivide each of the nonconvex linear regions into convex ones to get ourselves back into the convex setting. We will start with the number of convex linear regions for tropical rational maps although later we will deduce the required results for the number of linear regions (without imposing convexity). We first extend the notion of tropical hypersurface to tropical rational maps: Given a tropical rational map F ∈ Rat(d, m), we define T (F ) to be the boundaries between adjacent linear regions. When F = (f1, . . . , fm) ∈ Pol(d, m), i.e., a tropical polynomial map, this set is exactly the union of tropical hypersurfaces T (fi), i = 1, . . . , m. Therefore this definition of T (F ) extends Definition 3.1. For a tropical rational map F , we will examine the smallest number of convex regions that form a refinement of T (F ). For brevity, we will call this the convex degree of F ; for consistency, the number of linear regions of F we will call its linear d degree. We define convex degree formally below. We will write F |C to mean the restriction of map F to C ⊆ R . Definition D.1. The convex degree of a tropical rational map F ∈ Rat(d, n) is the minimum division of Rd into convex regions over which F is linear, i.e.

N (F ) := minn : C ∪ · · · ∪ C = d,C convex, F | linear . c 1 n R i Ci

d Note that C1,...,CNc(F ) either divide R into the same regions as T (F ) or form a refinement. Tropical Geometry of Deep Neural Networks

For m ≤ d, we will denote by Nc(F | m) the maximum convex degree obtained by restricting F to an m-dimensional affine subspace in Rd, i.e.,  d Nc(F | m) := max Nc(F |Ω):Ω ⊆ R is an m-dimensional affine space .

For any F ∈ Rat(d, n), there is at least one tropical polynomial map that subdivides T (F ), and so convex degree is well- defined (e.g., if F = (p1 q1, . . . , pn qn) ∈ Rat(d, n), then we may choose P = (p1, . . . , pn, q1, . . . , qn) ∈ Pol(d, 2n)). Since the linear regions of a tropical polynomial map are always convex, we have N (F ) = Nc(F ) for any F ∈ Pol(d, n). n 3 Let F = (f1, . . . , fn) ∈ Rat(d, n) and α = (a1, . . . , an) ∈ Z . Consider the tropical rational function n α T K aj F := α F = a1f1 + ··· + anfn = fj ∈ Rat(d, 1). j=1 For some α, F α may have fewer linear regions than F , e.g, α = (0,..., 0). As such, we need the following notion. n α Definition D.2. α = (a1, . . . , an) ∈ Z is said to be a general exponent of F ∈ Rat(d, n) if the linear regions of F and the linear regions of F are identical.

We show that general exponent always exists for any F ∈ Rat(d, n) and may be chosen to have all entries nonnegative. Lemma D.2. Let F ∈ Rat(d, n). Then (i) N (F α) = N (F ) if and only if α is a general exponent; (ii) F has a general exponent α ∈ Nn.

Proof. It follows from the definition of tropical hypersuface that T (F α) and T (F ) comprise respectively the points x ∈ Rd at which F α and F are not differentiable. Hence T (F α) ⊆ T (F ), which implies that N (F α) < N (F ) unless T (F α) = T (F ). This concludes (i).

For (ii), we need to show that there always exists an α ∈ Nn such that F α divides its domain Rd into the same set of linear regions as F . In other words, for every pair of adjacent linear regions of F , the (d − 1)-dimensional face in T (F ) that separates them is also present in T (F α) and so T (F α) ⊇ T (F ).

Let L and M be adjacent linear regions of F . The differentials of F |L and F |M must have integer coordinates, i.e., n×d dF |L, dF |M ∈ Z . Since L and M are distinct linear regions, we must have dF |L 6= dF |M (or otherwise L and M can α α T T be merged into a single linear region). Note that the differentials of F |L and F |M are given by α dF |L and α dF |M .

α T To ensure the (d − 1)-dimensional face separating L and M still exists in T (F ), we need to choose α so that α dF |L 6= T T n α dF |M . Observe that the solution to (dF |L − dF |M ) α = 0 is contained in a one-dimensional subspace of R . Let A(F ) be the collection of all pairs of adjacent linear regions of F . Since the set of α that degenerates two adjacent linear regions into a single one, i.e.,

[  n T S := α ∈ N :(dF |L − dF |M ) α = 0) , (L,M)∈A(F ) is contained in a union of a finite number of hyperplanes in Rn, S cannot cover the entire lattice of nonnegative integers Nn. Therefore the set Nn ∩ (Rn \S) is nonempty and any of its element is a general exponent for F .

Lemma D.2 shows that we may study the linear degree of a tropical rational map by studying that of a tropical rational function, for which the results in Section 3.1 apply. We are now ready to prove a key result on the convex degree of composition of tropical rational maps.

Theorem D.3. Let F = (f1, . . . , fm) ∈ Rat(n, m) and G ∈ Rat(d, n). Define H = (h1, . . . , hm) ∈ Rat(d, m) by

hi := fi ◦ G, i = 1, . . . , m. Then N (H) ≤ Nc(H) ≤ Nc(F | d) ·Nc(G). 3This is in the sense of a tropical power but we stay consistent to our slight abuse of notation and write F α instead of F α. Tropical Geometry of Deep Neural Networks

Proof. Only the upper bound requires a proof. Let k = Nc(G). By the definition of Nc(G), there exist convex sets C ,...,C ⊆ d whose union is d and on each of which G is linear. So G| is some affine function ρ . For any i, 1 k R R Ci i

Nc(F ◦ ρi) ≤ Nc(F | d), by the definition of Nc(F | d). Since F ◦ G = F ◦ ρi on Ci, we have

k X Nc(F ◦ G) ≤ Nc(F ◦ ρi). i=1 Hence k k X X Nc(F ◦ G) ≤ Nc(F ◦ ρi) ≤ Nc(F | d) = Nc(F | d) ·Nc(G). i=1 i=1

We now apply our observations on tropical rational functions to neural networks. The next lemma follows directly from Corollary 3.4. Lemma D.4. Let σ(l) ◦ ρ(l) : Rnl−1 → Rnl where σ(l) and ρ(l) are the affine transformation and activation of the lth layer of a neural network. If d ≤ nl, then d (l) (l) X nl Nc(σ ◦ ρ | d) ≤ ( ) . i=0 i

(l) (l) d nl Proof. Nc(σ ◦ ρ | d) is the maximum convex degree of a tropical rational map F = (f1, . . . , fnl ): R → R of the form (l) (l) α1 αnl−1 fi(x) := σi ◦ ρ ◦ (b1 x , . . . , bnl−1 x ), i = 1, . . . , nl. For a general affine transformation ρ(l),

0 0 (l) α1 αn 0 α 0 α  ρ (b x , . . . , b x l−1 ) = b x 1 , . . . , b x nl =: G(x) 1 nl−1 1 nl

(l) α0 , . . . , α0 b0 , . . . , b0 G : d → nl f = σ ◦ G for some 1 nl and 1 nl , and we denote this map by R R . So i i . By Theorem D.3, we (l) (l) (l) (l) have Nc(σ ◦ ρ | d) = Nc(σ | d) ·Nc(G) = Nc(σ | d); note that Nc(G) = 1 as G is a linear function.

We have thus reduced the problem to determining a bound on the convex degree of a single layer neural network with nl d nl nl nodes ν = (ν1, . . . , νnl ): R → R . Let γ = (c1, . . . , cnl ) ∈ N be a nonnegative general exponent for ν. Note that

nl nl  d   d  cj nl  d cj c + − − K j K K aji K aji K K aji νj = bi x ⊕ x tj − x . j=1 j=1 i=1 i=1 j=1 i=1

Since the last term is linear in x, we may drop it without affecting the convex degree of the entire expression. It remains to determine an upper bound for the number of linear regions of the tropical polynomial

nl  d   d  cj K K a+ K a− h(x) = bi x ji ⊕ x ji tj , j=1 i=1 i=1 which we will obtain by counting vertices of the polytope P(h). By Propositions 3.1 and 3.2 the polytope P(h) is given by a weighted Minkowski sum

nl  d   d   X K a+ K a− cjP bi x ji ⊕ x ji tj . j=1 i=1 i=1 By Proposition 3.2 again,  d   d   K a+ K a−  P bi x ji ⊕ x ji tj = Conv V(P(f)) ∪ V(P(g)) i=1 i=1 Tropical Geometry of Deep Neural Networks where d  d  K a+ K a− f(x) = bi x ji and g(x) = x ji tj i=1 i=1

d+1  d+1 are tropical monomials. Therefore P(f), P(g) are just points in R and Conv V(P(f)) ∪ V(P(g)) is a line in R . d+1 Hence P(h) is a Minkowski sum of nl line segments in R , i.e., a zonotope, and Corollary 3.4 completes the proof.

Using Lemma D.4, we obtain a bound on the number of linear regions created by one layer of a neural network. Theorem D.5. Let ν : Rd → RnL be an L-layer neural network satisfying assumptions (a)–(c) with F (l), G(l),H(l), and (l) ν as defined in Proposition 5.1. Let nl ≥ d for all l = 1,...,L. Then

d n (1) (1) (1) (l+1) (l) X l+1 Nc(ν ) = N (G ) = N (H ) = 1, Nc(ν ) ≤ Nc(ν ) · ( ) . i=0 i

(1) (1) (1) (1) (1) Proof. The l = 1 case follows from the fact that G (x) = A− x and H (x) = A+ x + b are both linear, which in (1) (l) (l) (l) (l−1) turn forces Nc(ν ) = 1 as in the proof of Lemma D.4. Since ν = (σ ◦ ρ ) ◦ ν , the recursive bound follows from Theorem D.3 and Lemma D.4.

Theorem 6.3 follows from applying Theorem D.5 recursively.