From Softmax to Sparsemax:A Sparse Model of Attention and Multi-Label Classification

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification Andre´ F. T. Martinsy] [email protected] Ramon´ F. Astudillo† [email protected] yUnbabel Lda, Rua Visconde de Santarem,´ 67-B, 1000-286 Lisboa, Portugal ]Instituto de Telecomunicaçoes˜ (IT), Instituto Superior Tecnico,´ Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Rua Alves Redol, 9, 1000-029 Lisboa, Portugal Abstract log-likelihood loss function by taking the logarithm of its output. Alternatives proposed in the literature, such as the We propose sparsemax, a new activation func- Bradley-Terry model (Bradley & Terry, 1952; Zadrozny, tion similar to the traditional softmax, but able 2001; Menke & Martinez, 2008), the multinomial probit to output sparse probabilities. After deriving (Albert & Chib, 1993), the spherical softmax (Ollivier, its properties, we show how its Jacobian can be 2013; Vincent, 2015; de Brebisson´ & Vincent, 2015), or efficiently computed, enabling its use in a net- softmax approximations (Bouchard, 2007), while theoret- work trained with backpropagation. Then, we ically or computationally advantageous for certain scenar- propose a new smooth and convex loss function ios, lack some of the convenient properties of softmax. which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection In this paper, we propose the sparsemax transformation. between this new loss and the Huber classifi- Sparsemax has the distinctive feature that it can return cation loss. We obtain promising empirical re- sparse posterior distributions, that is, it may assign exactly sults in multi-label classification problems and in zero probability to some of its output variables. This prop- attention-based neural networks for natural lan- erty makes it appealing to be used as a filter for large out- guage inference. For the latter, we achieve a sim- put spaces, to predict multiple labels, or as a component to ilar performance as the traditional softmax, but identify which of a group of variables are potentially rele- with a selective, more compact, attention focus. vant for a decision, making the model more interpretable. Crucially, this is done while preserving most of the attrac- tive properties of softmax: we show that sparsemax is also 1. Introduction simple to evaluate, it is even cheaper to differentiate, and that it can be turned into a convex loss function. The softmax transformation is a key component of several To sum up, our contributions are as follows: statistical learning models, encompassing multinomial logistic regression (McCullagh & Nelder, 1989), action se- • We formalize the new sparsemax transformation, de- arXiv:1602.02068v2 [cs.CL] 8 Feb 2016 lection in reinforcement learning (Sutton & Barto, 1998), and neural networks for multi-class classification (Bridle, rive its properties, and show how it can be efficiently 1990; Goodfellow et al., 2016). Recently, it has also been computed (x2.1–2.3). We show that in the binary case used to design attention mechanisms in neural networks, sparsemax reduces to a hard sigmoid (x2.4). with important achievements in machine translation (Bah- • danau et al., 2015), image caption generation (Xu et al., We derive the Jacobian of sparsemax, comparing it to 2015), speech recognition (Chorowski et al., 2015), mem- the softmax case, and show that it can lead to faster x ory networks (Sukhbaatar et al., 2015), and various tasks gradient backpropagation ( 2.5). in natural language understanding (Hermann et al., 2015; • We propose the sparsemax loss, a new loss function Rocktaschel¨ et al., 2015; Rush et al., 2015) and computa- that is the sparsemax analogue of logistic regression tion learning (Graves et al., 2014; Grefenstette et al., 2015). (x3). We show that it is convex, everywhere differen- There are a number of reasons why the softmax transfor- tiable, and can be regarded as a multi-class general- mation is so appealing. It is simple to evaluate and dif- ization of the Huber classification loss, an important ferentiate, and it can be turned into the (convex) negative tool in robust statistics (Huber, 1964; Zhang, 2004). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification • We apply the sparsemax loss to train multi-label linear Algorithm 1 Sparsemax Evaluation classifiers (which predict a set of labels instead of a Input: z single label) on benchmark datasets (x4.1–4.2). Sort z as z(1) ≥ ::: ≥ z(K) n P o • Finally, we devise a neural selective attention mecha- Find k(z) := max k 2 [K] j 1 + kz(k) > j≤k z(j) P nism using the sparsemax transformation, evaluating ( z(j))−1 Define τ(z) = j≤k(z) its performance on a natural language inference prob- k(z) lem, with encouraging results (x4.3). Output: p s.t. pi = [zi − τ(z)]+. 2. The Sparsemax Transformation can be expressed as follows. Let z(1) ≥ z(2) ≥ ::: ≥ 2.1. Definition z(K) be the sorted coordinates of z, and define k(z) := n o K−1 K > max k 2 [K] j 1 + kz > P z . Then, Let ∆ := fp 2 R j 1 p = 1; p ≥ 0g be the (K − (k) j≤k (j) 1)-dimensional simplex. We are interested in functions that map vectors in K to probability distributions in ∆K−1. P P R j≤k(z) z(j) − 1 j2S(z) zj − 1 Such functions are useful for converting a vector of real τ(z) = = ; (4) weights (e.g., label scores) to a probability distribution (e.g. k(z) jS(z)j posterior probabilities of labels). The classical example is where S(z) := fj 2 [K] j sparsemax (z) > 0g is the the softmax function, defined componentwise as: j support of sparsemax(z). exp(zi) softmaxi(z) = P : (1) Proof: See App. A.1 in the supplemental material. j exp(zj) A limitation of the softmax transformation is that the re- In essence, Prop. 1 states that all we need for evaluating sulting probability distribution always has full support, i.e., the sparsemax transformation is to compute the threshold τ(z); all coordinates above this threshold (the ones in the softmaxi(z) 6= 0 for every z and i. This is a disadvan- tage in applications where a sparse probability distribution set S(z)) will be shifted by this amount, and the others will is desired, in which case it is common to define a threshold be truncated to zero. We call τ in Eq. 4 the threshold func- below which small probability values are truncated to zero. tion. This piecewise linear function will play an important role in the sequel. Alg. 1 illustrates a na¨ıve O(K log K) In this paper, we propose as an alternative the following algorithm that uses Prop. 1 for evaluating the sparsemax.1 transformation, which we call sparsemax: sparsemax(z) := argmin kp − zk2: (2) 2.3. Basic Properties K−1 p2∆ We now highlight some properties that are common to soft- In words, sparsemax returns the Euclidean projection of the max and sparsemax. Let z(1) := maxk zk, and denote by input vector z onto the probability simplex. This projection A(z) := fk 2 [K] j zk = z(1)g the set of maximal compo- is likely to hit the boundary of the simplex, in which case nents of z. We define the indicator vector 1A(z), whose kth sparsemax(z) becomes sparse. We will see that sparsemax component is 1 if k 2 A(z), and 0 otherwise. We further retains most of the important properties of softmax, having denote by γ(z) := z(1) − maxk2 =A(z) zk the gap between in addition the ability of producing sparse distributions. the maximal components of z and the second largest. We let 0 and 1 be vectors of zeros and ones, respectively. 2.2. Closed-Form Solution Proposition 2 The following properties hold for ρ 2 Projecting onto the simplex is a well studied problem, for fsoftmax; sparsemaxg. which linear-time algorithms are available (Michelot, 1986; Pardalos & Kovoor, 1990; Duchi et al., 2008). We start by −1 1 recalling the well-known result that such projections corre- 1. ρ(0) = 1=K and lim!0+ ρ( z) = A(z)=jA(z)j spond to a soft-thresholding operation. Below, we use the (uniform distribution, and distribution peaked on the maximal components of z, respectively). For sparse- notation [K] := f1;:::;Kg and [t]+ := maxf0; tg. max, the last equality holds for any ≤ γ(z) · jA(z)j. Proposition 1 The solution of Eq. 2 is of the form: 2. ρ(z) = ρ(z + c1), for any c 2 R (i.e., ρ is invariant sparsemaxi(z) = [zi − τ(z)]+; (3) to adding a constant to each coordinate). K where τ : R ! R is the (unique) function that satis- 1More elaborate O(K) algorithms exist based on linear-time P fies j[zj − τ(z)]+ = 1 for every z. Furthermore, τ selection (Blum et al., 1973; Pardalos & Kovoor, 1990). From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification 1.0 softmax1 ([t,0]) 1.0 1.0 sparsemax1 ([t,0]) 0.8 ,0]) 2 0.8 ,0]) 0.8 2 ,t 1 ,t 1 0.6 0.6 ([t 1 ([t 0.6 1 0.4 0.4 0.4 0.2 softmax 0.2 sparsemax 0.0 0.0 0.2 3 3 2 2 1 1 −3 −3 0.0 −2 0 −2 0 −1 −1 t 2 −1 −1 t 2 0 0 t 1 −2 t 1 −2 −3 −2 −1 0 1 2 3 1 2 1 2 t 3 −3 3 −3 Figure 1. Comparison of softmax and sparsemax in 2D (left) and 3D (two righmost plots).

Load more