Joint Embedding of and Labels for Text Classification

Guoyin Wang, Chunyuan Li,∗ Wenlin Wang, Yizhe Zhang Dinghan Shen, Xinyuan Zhang, Ricardo Henao, Lawrence Carin Duke University gw60,cl319,ww107,yz196,ds337,xz139,r.henao,lcarin @duke.edu { }

Abstract et al., 2014; Zhang et al., 2017b; Shen et al., 2017) and recurrent neural networks (RNNs) based on embeddings are effective intermedi- long short-term memory (LSTM) (Hochreiter and ate representations for capturing semantic Schmidhuber, 1997; Wang et al., 2018). regularities between words, when learn- To further increase the representation flexibil- ing the representations of text sequences. ity of such models, attention mechanisms (Bah- We propose to view text classification as danau et al., 2015) have been introduced as an in- a label-word joint embedding problem: tegral part of models employed for text classifi- each label is embedded in the same space cation (Yang et al., 2016). The attention module with the word vectors. We introduce is trained to capture the dependencies that make an attention framework that measures the significant contributions to the task, regardless of compatibility of embeddings between text the distance between the elements in the sequence. sequences and labels. The attention is It can thus provide complementary information learned on a training set of labeled samples to the distance-aware dependencies modeled by to ensure that, given a text sequence, the RNN/CNN. The increasing representation power relevant words are weighted higher than of the attention mechanism comes with increased the irrelevant ones. Our method maintains model complexity. the interpretability of word embeddings, Alternatively, several recent studies show that and enjoys a built-in ability to leverage the success of deep learning on text classification alternative sources of information, in ad- largely depends on the effectiveness of the word dition to input text sequences. Extensive embeddings (Joulin et al., 2016; Wieting et al., results on the several large text datasets 2016; Arora et al., 2017; Shen et al., 2018a). Par- show that the proposed framework out- ticularly, Shen et al.(2018a) quantitatively show performs the state-of-the-art methods by that the word-embeddings-based text classifica- a large margin, in terms of both accuracy tion tasks can have the similar level of difficulty and speed. regardless of the employed models, using the con- 1 Introduction cept of intrinsic dimension (Li et al., 2018). Thus, simple models are preferred. As the basic build- Text classification is a fundamental problem in ing blocks in neural-based NLP, word embed- natural language processing (NLP). The task is dings capture the similarities/regularities between to annotate a given text sequence with one (or words (Mikolov et al., 2013; Pennington et al., multiple) class label(s) describing its textual con- 2014). This idea has been extended to compute tent. A key intermediate step is the text rep- embeddings that capture the semantics of word se- resentation. Traditional methods represent text quences (e.g., phrases, sentences, paragraphs and with hand-crafted features, such as sparse lexi- documents) (Le and Mikolov, 2014; Kiros et al., cal features (e.g., n-grams) (Wang and Manning, 2015). These representations are built upon vari- 2012). Recently, neural models have been em- ous types of compositions of word vectors, rang- ployed to learn text representations, including con- ing from simple averaging to sophisticated archi- volutional neural networks (CNNs) (Kalchbrenner tectures. Further, they suggest that simple models ∗Corresponding author are efficient and interpretable, and have the poten-

2321 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2321–2331 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics tial to outperform sophisticated deep neural mod- 2 Related Work els. It is therefore desirable to leverage the best of Label embedding has been shown to be effective both lines of works: learning text representations in various domains and tasks. In computer vi- to capture the dependencies that make significant sion, there has been a vast amount of research contributions to the task, while maintaining low on leveraging label embeddings for image clas- computational cost. For the task of text classifica- sification (Akata et al., 2016), multimodal learn- tion, labels play a central role of the final perfor- ing between images and text (Frome et al., 2013; mance. A natural question to ask is how we can Kiros et al., 2014), and text recognition in im- directly use label information in constructing the ages (Rodriguez-Serrano et al., 2013). It is par- text-sequence representations. ticularly successful on the task of zero-shot learn- ing (Palatucci et al., 2009; Yogatama et al., 2015; Ma et al., 2016), where the label correlation cap- 1.1 Our Contribution tured in the embedding space can improve the Our primary contribution is therefore to pro- prediction when some classes are unseen. In pose such a solution by making use of the la- NLP, labels embedding for text classification has bel embedding framework, and propose the Label- been studied in the context of heterogeneous net- Embedding Attentive Model (LEAM) to improve works in (Tang et al., 2015) and multitask learning text classification. While there is an abundant lit- in (Zhang et al., 2017a), respectively. To the au- erature in the NLP community on word embed- thors’ knowledge, there is little research on inves- dings (how to describe a word) for text representa- tigating the effectiveness of label embeddings to tions, much less work has been devoted in compar- design efficient attention models, and how to joint ison to label embeddings (how to describe a class). embedding of words and labels to make full use The proposed LEAM is implemented by jointly of label information for text classification has not embedding the word and label in the same latent been studied previously, representing a contribu- space, and the text representations are constructed tion of this paper. directly using the text-label compatibility. For text representation, the currently best- Our label embedding framework has the fol- performing models usually consist of an encoder lowing salutary properties: (i) Label-attentive text and a decoder connected through an attention representation is informative for the downstream mechanism (Vaswani et al., 2017; Bahdanau et al., classification task, as it directly learns from a 2015), with successful applications to sentiment shared joint space, whereas traditional methods classification (Zhou et al., 2016), sentence pair proceed in multiple steps by solving intermediate modeling (Yin et al., 2016) and sentence sum- problems. (ii) The LEAM learning procedure only marization (Rush et al., 2015). Based on this involves a series of basic algebraic operations, and success, more advanced attention models have hence it retains the interpretability of simple mod- been developed, including hierarchical attention els, especially when the label description is avail- networks (Yang et al., 2016), attention over at- able. (iii) Our attention mechanism (derived from tention (Cui et al., 2016), and multi-step atten- the text-label compatibility) has fewer parameters tion (Gehring et al., 2017). The idea of attention is and less computation than related methods, and motivated by the observation that different words thus is much cheaper in both training and test- in the same context are differentially informative, ing, compared with sophisticated deep attention and the same word may be differentially important models. (iv) We perform extensive experiments in a different context. The realization of “context” on several text-classification tasks, demonstrating varies in different applications and model architec- the effectiveness of our label-embedding attentive tures. Typically, the context is chosen as the target model, providing state-of-the-art results on bench- task, and the attention is computed over the hidden mark datasets. (v) We further apply LEAM to layers of a CNN/RNN. Our attention model is di- predict the medical codes from clinical text. As rectly built in the joint embedding space of words an interesting by-product, our attentive model can and labels, and the context is specified by the label highlight the informative key words for prediction, embedding. which in practice can reduce a doctor’s burden on Several recent works (Vaswani et al., 2017; reading clinical notes. Shen et al., 2018b,c) have demonstrated that sim-

2322 D P ple attention architectures can alone achieve state- : ∆ R , where P is the of-the-art performance with less computational dimensionality of the embedding7→ space. There- time, dispensing with recurrence and convolutions fore, the text sequence X is represented via the entirely. Our work is in the same direction, shar- respective word embedding for each token: V = P ing the similar spirit of retaining model simplicity v1, , vL , where vl R . A typical text and interpretability. The major difference is that {classification··· } method proceeds∈ in three steps, end- the aforementioned work focused on self attention, to-end, by considering a function decomposition which applies attention to each pair of word tokens f = f0 f1 f2 as shown in Figure1(a): from the text sequences. In this paper, we investi- ◦ ◦ f0 : X V, the text sequence is represented gate the attention between words and labels, which • 7→ is more directly related to the target task. Further- as its word-embedding form V, which is a matrix of P L. more, the proposed LEAM has much less model × parameters. f1 : V z, a compositional function f1 ag- • gregates7→ word embeddings into a fixed-length 3 Preliminaries vector representation z.

Throughout this paper, we denote vectors as bold, f2 : z y, a classifier f2 annotates the text • 7→ lower-case letters, and matrices as bold, upper- representation z with a label. case letters. We use for element-wise division when applied to vectors or matrices. We use for A vast amount of work has been devoted to de- p ◦ function composition, and ∆ for the set of one vising the proper functions f0 and f1, i.e., how hot vectors in dimension p. to represent a word or a word sequence, respec- N Given a training set = (Xn, yn) of tively. The success of NLP largely depends on the S { }n=1 pair-wise data, where X is the text sequence, effectiveness of word embeddings in f0 (Bengio ∈ X and y is its corresponding label. Specifically, et al., 2003; Collobert and Weston, 2008; Mikolov ∈ Y y is a one hot vector in single-label problem and et al., 2013; Pennington et al., 2014). They are a binary vector in multi-label problem, as defined often pre-trained offline on large corpus, then re- later in Section 4.1. Our goal for text classification fined jointly via f1 and f2 for task-specific rep- is to learn a function f : by minimizing resentations. Furthermore, the design of f1 can an empirical risk of the form:X 7→ Y be broadly cast into two categories. The popu- N lar deep learning models consider the mapping as 1 X min δ(yn, f(Xn)) (1) a “black box,” and have employed sophisticated f∈F N n=1 CNN/RNN architectures to achieve state-of-the- where δ : R measures the loss incurred Y × Y 7→ art performance (Zhang et al., 2015; Yang et al., from predicting f(X) when the true label is y, 2016). On the contrary, recent studies show that where f belongs to the functional space . In the F simple manipulation of the word embeddings, e.g., evaluation stage, we shall use the 0/1 loss as a tar- mean or max-pooling, can also provide surpris- get loss: δ(y, z) = 0 if y = z, and 1 otherwise. ingly excellent performance (Joulin et al., 2016; In the training stage, we consider surrogate losses Wieting et al., 2016; Arora et al., 2017; Shen et al., commonly used for structured prediction in differ- 2018a). Nevertheless, these methods only lever- ent problem setups (see Section 4.1 for details on age the information from the input text sequence. the surrogate losses used in this paper). More specifically, an input sequence X of 4 Label-Embedding Attentive Model length L is composed of word tokens: X = 4.1 Model x1, , xL . Each token xl is a one hot vec- { ··· } tor in the space ∆D, where D is the dictionary By examining the three steps in the traditional size. Performing learning in ∆D is computation- pipeline of text classification, we note that the use ally expensive and difficult. An elegant frame- of label information only occurs in the last step, work in NLP, initially proposed in (Mikolov et al., when learning f2, and its impact on learning the 2013; Le and Mikolov, 2014; Pennington et al., representations of words in f0 or word sequences 2014; Kiros et al., 2015), allows to concisely per- in f1 is ignored or indirect. Hence, we propose a form learning by mapping the words into an em- new pipeline by incorporating label information in bedding space. The framework relies on so called every step, as shown in Figure1(b):

2323 To further capture the relative spatial informa- f0 f1 f2 1 X VAAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB zAAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E= y tion among consecutive words (i.e., phrases ) and introduce non-linearity in the compatibility mea- (a) Traditional method sure, we consider a generalization of (2). Specif- ically, for a text phase of length 2r + 1 cen- f0 f1 f2 tered at l, the local matrix block Gl−r:l+r in G

AAACCXicbVC7TsMwFHXKq5RXgJHFUCExVSlCArYKFsYikbZSE1WO47RWbSeyHVAVZWbhV1gYALHyB2z8DW6aAVqOZOn4nHvte0+QMKq043xblaXlldW16nptY3Nre8fe3euoOJWYuDhmsewFSBFGBXE11Yz0EkkQDxjpBuPrqd+9J1LRWNzpSUJ8joaCRhQjbaSBfegJ8oBjzpEIM6/Dkc6zzAsi2MnzWnEf2HWn4RSAi6RZkjoo0R7YX14Y45QToTFDSvWbTqL9DElNMSPm1VSRBOExGpK+oQJxovysWCWHx0YJYRRLc4SGhfq7I0NcqQkPTKWZbaTmvan4n9dPdXThZ1QkqSYCzz6KUgZ1DKe5wJBKgjWbGIKwpGZWiEdIIqxNejUTQnN+5UXinjYuG87tWb11VaZRBQfgCJyAJjgHLXAD2sAFGDyCZ/AK3qwn68V6tz5mpRWr7NkHf2B9/gAsmprB AAACUnicbVJNTwIxEO3iFyIq6tFLIzHxRBZjot6IXjxiFCUBQrrdWaj2Y9N2MbDhPxoTD/4RLx60u3IQcJKmL++9aWemDWLOjPX9D6+wsrq2vlHcLG2Vt3d2K3v7D0YlmkKLKq50OyAGOJPQssxyaMcaiAg4PAbP15n+OAJtmJL3dhxDT5CBZBGjxDqqX3nqSnihSggiw7R7J4idpmk3iPDddFqa0wKwZJSLiodmLNyGc3LROFl0TTLHZNSvVP2anwdeBvUZqKJZNPuVt26oaCJAWsqJMZ26H9teSrRllIM7MzEQE/pMBtBxUBIBppfmM5niY8eEOFLaLWlxzv7NSIkwWX3O6ZoemkUtI//TOomNLnopk3FiQdLfi6KEY6twNmAcMg3U8rEDhGrmasV0SDSh1j1DyQ2hvtjyMmid1i5r/u1ZtXE1m0YRHaIjdILq6Bw10A1qohai6BV9om8Pee/eV8H9kl9rwZvlHKC5KJR/AHpCt3E= y X V z measures the label-to-token compatibility for the

AAACM3icbVDLSgMxFM34rPVVdekmWARXZSqCuhPdCG4qdVRoS8lk7tRgHkOSUcrQj3Ljh7gRwYWKW//BzHQWVr0Qcjjn3uSeEyacGev7L97U9Mzs3Hxlobq4tLyyWltbvzQq1RQCqrjS1yExwJmEwDLL4TrRQETI4Sq8Pcn1qzvQhil5YYcJ9AQZSBYzSqyj+rWzroR7qoQgMsq6bUHsKMu6YYzbo1F1QgvBkrtCVDwyQ+EuXJB5YyH2a3W/4ReF/4JmCeqorFa/9tSNFE0FSEs5MabT9BPby4i2jHJwz6YGEkJvyQA6DkoiwPSywvQIbzsmwrHS7kiLC/bnREaEydd0nc7Vjfmt5eR/Wie18UEvYzJJLUg6/ihOObYK5wniiGmglg8dIFQztyumN0QTal3OVRdC87flvyDYbRw2/PO9+tFxmUYFbaIttIOaaB8doVPUQgGi6AE9ozf07j16r96H9zlunfLKmQ00Ud7XN5I5rWk= “label-phrase” pairs. To learn a higher-level com-

f0 patibility stigmatization ul between the l-th phrase C Y and all labels, we have:

GAAACCXicbVC7TsMwFHV4lvIKMLIYKiSmKkVIwFbBUMYiEVqpiSrHcVqrthPZDqiKMrPwKywMgFj5Azb+BjfNAC1HsnR8zr32vSdIGFXacb6thcWl5ZXVylp1fWNza9ve2b1TcSoxcXHMYtkNkCKMCuJqqhnpJpIgHjDSCUZXE79zT6SisbjV44T4HA0EjShG2kh9+8AT5AHHnCMRZl6LI51nmRdEsJXn1eLet2tO3SkA50mjJDVQot23v7wwxiknQmOGlOo1nET7GZKaYkbMq6kiCcIjNCA9QwXiRPlZsUoOj4wSwiiW5ggNC/V3R4a4UmMemEoz21DNehPxP6+X6ujcz6hIUk0Enn4UpQzqGE5ygSGVBGs2NgRhSc2sEA+RRFib9KomhMbsyvPEPalf1J2b01rzskyjAvbBITgGDXAGmuAatIELMHgEz+AVvFlP1ov1bn1MSxessmcP/IH1+QPmwpqU (b) Proposed joint embedding method ul = ReLU(Gl−r:l+rW1 + b1), (3)

2r+1 K Figure 1: Illustration of different schemes for doc- where W1 R and b1 R are parameters ∈ K∈ to be learned, and ul R . The largest com- ument representations z. (a) Much work in NLP ∈ has been devoted to directly aggregating word em- patibility value of the l-th phrase wrt the labels is bedding V for z. (b) We focus on learning label collected: embedding C (how to embed class labels in a Eu- clidean space), and leveraging the “compatibility” ml = max-pooling(ul). (4) G between embedded words and labels to derive Together, m is a vector of length L. The compat- the attention score β for improved z. Note that ⊗ ibility/attention score for the entire text sequence denotes the cosine similarity between C and V. In is: this figure, there are K=2 classes. β = SoftMax(m), (5)

f : Besides embedding words, we also em- where the l-th element of SoftMax is βl = 0 exp(ml) • PL . bed all the labels in the same space, which l0=1 exp(ml0 ) act as the “anchor points” of the classes to in- The text sequence representation can be sim- fluence the refinement of word embeddings. ply obtained via averaging the word embeddings, weighted by label-based attention score: f1: The compositional function aggregates • word embeddings into z, weighted by the X z = βlvl. (6) compatibility between labels and words. l

f2: The learning of f2 remains the same, as it • Relation to Embeddings Pre- directly interacts with labels. dictive Text Embeddings (PTE) (Tang et al., 2015) Under the proposed label embedding framework, is the first method to leverage label embeddings we specifically describe a label-embedding atten- to improve the learned word embeddings. We tive model. discuss three major differences between PTE and our LEAM: (i) The general settings are different. Joint Embeddings of Words and Labels We PTE casts the text representation through hetero- propose to embed both the words and the labels geneous networks, while we consider text repre- D P P into a joint space i.e., ∆ R and R . 7→ Y 7→ sentation through an attention model. (ii) In PTE, The label embeddings are C = [c1, , cK ], ··· the text representation z is the averaging of word where K is the number of classes. embeddings. In LEAM, z is weighted averaging A simple way to measure the compatibility of of word embeddings through the proposed label- label-word pairs is via the cosine similarity attentive score in (6). (iii) PTE only considers the G = (C>V) Gˆ , (2) linear interaction between individual words and la- bels. LEAM greatly improves the performance by where Gˆ is the normalization matrix of size K L, considering nonlinear interaction between phrase with each element obtained as the multiplication× 1We call it “phrase” for convenience; it could be any of `2 norms of the c-th label embedding and l-th longer word sequence such as a sentence and paragraph etc. word embedding: gˆkl = ck vl . when a larger window size r is considered. k kk k 2324 and labels. Specifically, we note that the text em- the “anchor” points for each classes: closer to bedding in PTE is similar with a very special case the word/sequence representations that are in the of LEAM, when our window size r = 1 and at- same classes, while farther from those in different tention score β is uniform. As shown later in Fig- classes. To best achieve this property, we consider ure2(c) of the experimental results, LEAM can be to regularize the learned label embeddings ck to be significantly better than the PTE variant. on its corresponding manifold. This is imposed by the fact c should be easily classified as the correct Training Objective The proposed joint embed- k label y : ding framework is applicable to various text clas- k sification tasks. We consider two setups in this K paper. For a learned text sequence representation 1 X min CE(yk, f2(ck)), (9) z = f1 f0(X), we jointly optimize f = f0 f1 f2 f∈F K ◦ ◦ ◦ n=1 over , where f2 is defined according to the spe- cific tasks:F where f2 is specficied according to the problem Single-label problem: categorizes each text in either (7) or (8). This regularization is used as • instance to precisely one of K classes, y a penalty in the main training objective in (7) or ∈ ∆K (8), and the default weighting hyperparameter is

N set as 1. It will lead to meaningful interpretabil- 1 X ity of learned label embeddings as shown in the min CE(yn, f2(zn)), (7) f∈F N n=1 experiments. Interestingly in text classification, the class where CE( , ) is the cross entropy between · · itself is often described as a set of E words two probability vectors, and f2(zn) = ei, i = 1, ,E . These words are consid- SoftMax ( 0 ), with 0 = + and zn zn W2zn b2 ered{ as the most··· representative} description of each K×P K are trainable param- W2 R , b2 R class, and highly distinguishing between different eters.∈ ∈ classes. For example, the Yahoo! Answers Topic Multi-label problem: categorizes each text dataset (Zhang et al., 2015) contains ten classes, • most of which have two words to precisely de- instance to a set of K target labels yk { ∈ ∆2 k = 1, ,K ; there is no constraint on scribe its class-specific features, such as “Comput- how| many··· of the classes} the instance can be ers & Internet”, “Business & Finance” as well as assigned to, and “Politics & Government” etc. We consider to use each label’s corresponding pre-trained word em- N K 1 X X beddings as the initialization of the label embed- min CE(ynk, f2(znk), (8) dings. For the datasets without representative class f∈F NK n=1 k=1 descriptions, one may initialize the label embed- 1 0 dings as random samples drawn from a standard where f2(znk) = 1+exp(z0 ) , and znk is the nk Gaussian distribution. 0 kth column of zn. To summarize, the model parameters θ = Testing Both the learned word and label embed- V, C, W1, b1, W2, b2 . They are trained end- dings are available in the testing stage. We clar- { } to-end during learning. W1, b1 and W2, b2 ify that the label embeddings C of all class candi- { } { } are weights in f and f , respectively, which are dates are considered as the input in the testing 1 2 Y treated as standard neural networks. For the joint stage; one should distinguish this from the use of embeddings V, C in f0, the pre-trained word groundtruth label y in prediction. For a text se- embeddings are{ used} as initialization if available. quence X, one may feed it through the proposed pipeline for prediction: (i) f1: harvesting the word 4.2 Learning & Testing with LEAM embeddings V, (ii) f2: V interacts with C to ob- Learning and Regularization The quality of tain G, pooled as β, which further attends V to the jointly learned embeddings are key to the derive z, and (iii) f3: assigning labels based on model performance and interpretability. Ide- the tasks. To speed up testing, one may store G ally, we hope that each label embedding acts as offline, and avoid its online computational cost.

2325 Model Parameters Complexity Seq. Operation Dataset # Classes # Training # Testing CNN m · h · PO(m · h · L · P ) O(1) LSTM 4 · h · (h + P ) O(L · h2 + h · L · P ) O(L) AGNews 4 120k 7.6k SWEM 0 O(L · P ) O(1) Yelp Binary 2 560 k 38k Bi-BloSAN 7·P 2 +5·PO(P 2 ·L2/R+P 2 ·L+P 2 ·R2) O(1) Our model K · PO(K · L · P ) O(1) Yelp Full 5 650k 38k DBPedia 14 560k 70k Table 1: Comparisons of CNN, LSTM, SWEM Yahoo 10 1400k 60k and our model architecture. Columns correspond to the number of compositional parameters, com- Table 2: Summary statistics of five datasets, in- putational complexity and sequential operations cluding the number of classes, number of training samples and number of testing samples. 4.3 Model Complexity We compare CNN, LSTM, Simple Word 5.1 Classification on Benchmark Datasets Embeddings-based Models (SWEM) (Shen et al., We test our model on the same five standard 2018a) and our LEAM wrt the parameters and benchmark datasets as in (Zhang et al., 2015). The computational speed. For the CNN, we assume summary statistics of the data are shown in Table the same size m for all filters. Specifically, h 2, with content specified below: represents the dimension of the hidden units in the LSTM or the number of filters in the CNN; AGNews: Topic classification over four cat- R • denotes the number of blocks in the Bi-BloSAN; egories of Internet news articles (Del Corso P denotes the final sequence representation et al., 2005) composed of titles plus descrip- dimension. Similar to (Vaswani et al., 2017; tion classified into: World, Entertainment, Shen et al., 2018a), we examine the number of Sports and Business. compositional parameters, computational com- Yelp Review Full: The dataset is obtained plexity and sequential steps of the four methods. • from the Yelp Dataset Challenge in 2015, the As shown in Table1, both the CNN and LSTM task is sentiment classification of polarity star have a large number of compositional parameters. labels ranging from 1 to 5. Since K m, h, the number of parameters in  our models is much smaller than for the CNN and Yelp Review Polarity: The same set of LSTM models. For the computational complexity, • text reviews from Yelp Dataset Challenge in our model is almost same order as the most simple 2015, except that a coarser sentiment defini- SWEM model, and is smaller than the CNN or tion is considered: 1 and 2 are negative, and LSTM by a factor of mh/K or h/K. 4 and 5 as positive.

5 Experimental Results DBPedia: Ontology classification over four- • teen non-overlapping classes picked from Setup We use 300-dimensional GloVe word em- DBpedia 2014 (Wikipedia). beddings Pennington et al.(2014) as initializa- Yahoo! Answers Topic: Topic classifica- tion for word embeddings and label embeddings • tion over ten largest main categories from Ya- in our model. Out-Of-Vocabulary (OOV) words hoo! Answers Comprehensive Questions and are initialized from a uniform distribution with Answers version 1.0, including question title, range [ 0.01, 0.01]. The final classifier is imple- question content and best answer. mented− as an MLP layer followed by a sigmoid or softmax function depending on specific task. We compare with a variety of methods, in- We train our model’s parameters with the Adam cluding (i) the bag-of-words in (Zhang et al., Optimizer (Kingma and Ba, 2014), with an ini- 2015); (ii) sophisticated deep CNN/RNN models: tial learning rate of 0.001, and a minibatch size large/small word CNN, LSTM reported in (Zhang of 100. Dropout regularization (Srivastava et al., et al., 2015; Dai and Le, 2015) and deep CNN (29 2014) is employed on the final MLP layer, with layer) (Conneau et al., 2017); (iii) simple compo- dropout rate 0.5. The model is implemented using sitional methods: fastText (Joulin et al., 2016) and Tensorflow and is trained on GPU Titan X. simple word embedding models (SWEM) (Shen The code to reproduce the experimental results et al., 2018a); (iv) deep attention models: hier- is at https://github.com/guoyinwang/LEAM archical attention network (HAN) (Yang et al.,

2326 Model Yahoo DBPedia AGNews Yelp P. Yelp F. Bag-of-words (Zhang et al., 2015) 68.90 96.60 88.80 92.20 58.00 Small word CNN (Zhang et al., 2015) 69.98 98.15 89.13 94.46 58.59 Large word CNN (Zhang et al., 2015) 70.94 98.28 91.45 95.11 59.48 LSTM (Zhang et al., 2015) 70.84 98.55 86.06 94.74 58.17 SA-LSTM (word-level) (Dai and Le, 2015) - 98.60 - - - Deep CNN (29 layer) (Conneau et al., 2017) 73.43 98.71 91.27 95.72 64.26 SWEM (Shen et al., 2018a) 73.53 98.42 92.24 93.76 61.11 fastText (Joulin et al., 2016) 72.30 98.60 92.50 95.70 63.90 HAN (Yang et al., 2016) 75.80 - - - - Bi-BloSAN (Shen et al., 2018c) 76.28 98.77 93.32 94.56 62.13 LEAM 77.42 99.02 92.45 95.31 64.09 LEAM (linear) 75.22 98.32 91.75 93.43 61.03 Table 3: Test Accuracy on document classification tasks, in percentage.  We ran Bi-BloSAN using the authors’ implementation; all other results are directly cited from the respective papers.

2016); (v) simple attention models: bi-directional Model # Parameters Time cost (s) block self-attention network (Bi-BloSAN) (Shen CNN 541k 171 et al., 2018c). The results are shown in Table3. LSTM 1.8M 598 SWEM 61K 63 Testing accuracy Simple compositional meth- Bi-BloSAN 3.6M 292 LEAM 65K 65 ods indeed achieve comparable performance as the sophisticated deep CNN/RNN models. On the Table 4: Comparison of model size and speed. other hand, deep hierarchical attention model can improve the pure CNN/RNN models. The recently faster than Bi-BloSAN. We also compare the per- proposed self-attention network generally yield formance when only a partial dataset is labeled, higher accuracy than previous methods. All ap- the results are shown in Figure2(b). LEAM con- proaches are better than traditional bag-of-words sistently outperforms other methods with different method. Our proposed LEAM outperforms the proportion of labeled data. state-of-the-art methods on two largest datasets, i.e., Yahoo and DBPedia. On other datasets, Hyper-parameter Our method has an addi- LEAM ranks the 2nd or 3rd best, which are simi- tional hyperparameter, the window size r to define lar to top 1 method in term of the accuracy. This the length of “phase” to construct the attention. is probably due to two reasons: (i) the number Larger r captures long term dependency, while of classes on these datasets is smaller, and (ii) smaller r enforces the local dependency. We study there is no explicit corresponding word embed- its impact in Figure2(c). The topic classification ding available for the label embedding initializa- tasks generally requires a larger r, while senti- tion during learning. The potential of label embed- ment classification tasks allows relatively smaller ding may not be fully exploited. As the ablation r. One may safely choose r around 50 if not fine- study, we replace the nonlinear compatibility (3) tuning. We report the optimal results in Table3. to the linear one in (2) . The degraded performance demonstrates the necessity of spatial dependency 5.2 Representational Ability and nonlinearity in constructing the attentions. Label embeddings are highly meaningful To Nevertheless, we argue LEAM is favorable for provide insight into the meaningfulness of the text classification, by comparing the model size learned representations, in Figure3 we visual- and time cost Table4, as well as convergence ize the correlation between label embeddings and speed in Figure2(a). The time cost is reported document embeddings based on the Yahoo date- as the wall-clock time for 1000 iterations. LEAM set. First, we compute the averaged document em- 1 P maintains the simplicity and low cost of SWEM, beddings per class: z¯k = zi, where k |Sk| i∈Sk S compared with other models. LEAM uses much is the set of sample indices belonging to class k. less model parameters, and converges significantly Intuitively, z¯k represents the center of embedded 2327 80 Yahoo! DBPedia 80 99.0 77.0

70 76.5 98.8 60 76.0 98.6 Yelp Polarity Yelp Full 60 64 40 95.0 Accuracy (%) Accuracy (%)

LEAM LEAM Accuracy (%) 63 CNN CNN 94.5 50 LSTM LSTM 62 Bi-Blosa SWEM 94.0 61 0 2K 4K 0.1 1 10 100 0 25 50 75 0 25 50 75 # Iteration Proportion (%) of labeled data # Window Size (a) Convergence speed (b) Partially labeled data (c) Effects of window size Figure 2: Comprehensive study of LEAM, including convergence speed, performance vs proportion of labeled data, and impact of hyper-parameter

0.4 Society Culture 0.3 Science Mathematics Health 0.2 Education Reference Computers Internet 0.1 Sports Business Finance 0.0 Entertainment Music Family Relationships 0.1 Politics Government (a) Cosine similarity matrix (b) t-SNE plot of joint embeddings Figure 3: Correlation between the learned text sequence representation z and label embedding V. (a) Cosine similarity matrix between averaged z¯ per class and label embedding V, and (b) t-SNE plot of joint embedding of text z and labels V. text manifold for class k. Ideally, the perfect label nal region of the respective manifold, which again embedding ck should be the representative anchor demonstrate the strong representative power of la- point for class k. We compute the cosine similar- bel embeddings. ity between z¯k and ck across all the classes, shown in Figure3(a). The rows are averaged per-class Interpretability of attention Our attention document embeddings, while columns are label score β can be used to highlight the most infor- embeddings. Therefore, the on-diagonal elements mative words wrt the downstream prediction task. measure how representative the learned label em- We visualize two examples in Figure4(a) for the beddings are to describe its own classes, while Yahoo dataset. The darker yellow means more im- off-diagonal elements reflect how distinctive the portant words. The 1st text sequence is on the label embeddings are to be separated from other topic of “Sports”, and the 2nd text sequence is classes. The high on-diagonal elements and low “Entertainment”. The attention score can correctly off-diagonal elements in Figure3(a) indicate the detect the key words with proper scores. superb ability of the label representations learned 5.3 Applications to Clinical Text from LEAM. Further, since both the document and label em- To demonstrate the practical value of label embed- beddings live in the same high-dimensional space, dings, we apply LEAM for a real health care sce- we use t-SNE (Maaten and Hinton, 2008) to vi- nario: medical code prediction on the Electronic sualize them on a 2D map in Figure3(b). Each Health Records dataset. A given patient may have color represents a different class, the point clouds multiple diagnoses, and thus multi-label learning are document embeddings, and the label embed- is required. dings are the large dots with black circles. As can Specifically, we consider an open-access be seen, each label embedding falls into the inter- dataset, MIMIC-III (Johnson et al., 2016), which

2328 AUC F1 Model Macro Micro Macro Micro P@5 Logistic Regression 0.829 0.864 0.477 0.533 0.546 Bi-GRU 0.828 0.868 0.484 0.549 0.591 CNN (Kim, 2014) 0.876 0.907 0.576 0.625 0.620 C-MemNN (Prakash et al., 2017) 0.833 - -- 0.42 Attentive LSTM (Shi et al., 2017) - 0.900 - 0.532 - CAML (Mullenbach et al., 2018) 0.875 0.909 0.532 0.614 0.609 LEAM 0.881 0.912 0.540 0.619 0.612

Table 5: Quantitative results for doctor-notes multi-label classification task. contains text and structured records from a hospital intensive care unit. Each record includes a variety of narrative notes describing a patients stay, including diagnoses and procedures. They are accompanied by a set of metadata codes from the International Classification of Diseases (ICD), which present a standardized way of indicating (a) Yahoo dataset diagnoses/procedures. To compare with previous work, we follow (Shi et al., 2017; Mullenbach et al., 2018), and preprocess a dataset consisting of the most common 50 labels. It results in 8,067 documents for training, 1,574 for validation, and (b) Clinical text 1,730 for testing. Figure 4: Visualization of learned attention β. Results We compare against the three base- lines: a logistic regression model with bag-of- words, a bidirectional gated recurrent unit (Bi- We emphasize that the learned attention can be GRU) and a single-layer 1D convolutional net- very useful to reduce a doctor’s reading burden. work (Kim, 2014). We also compare with three As shown in Figure4(b), the health related words recent methods for multi-label classification of are highlighted. clinical text, including Condensed Memory Net- 6 Conclusions works (C-MemNN) (Prakash et al., 2017), Atten- tive LSTM (Shi et al., 2017) and Convolutional In this work, we first investigate label embed- Attention (CAML) (Mullenbach et al., 2018). dings for text representations, and propose the To quantify the prediction performance, we fol- label-embedding attentive models. It embeds the low (Mullenbach et al., 2018) to consider the words and labels in the same joint space, and mea- micro-averaged and macro-averaged F1 and area sures the compatibility of word-label pairs to at- under the ROC curve (AUC), as well as the preci- tend the document representations. The learn- sion at n (P@n). Micro-averaged values are cal- ing framework is tested on several large standard culated by treating each (text, code) pair as a sep- datasets and a real clinical text application. Com- arate prediction. Macro-averaged values are cal- pared with the previous methods, our LEAM al- culated by averaging metrics computed per-label. gorithm requires much lower computational cost, P@n is the fraction of the n highestscored labels and achieves better if not comparable performance that are present in the ground truth. relative to the state-of-the-art. The learned atten- The results are shown in Table5. LEAM pro- tion is highly interpretable: highlighting the most vides the best AUC score, and better F1 and P@5 informative words in the text sequence for the values than all methods except CNN. CNN con- downstream classification task. sistently outperforms the basic Bi-GRU architec- Acknowledgments This research was supported ture, and the logistic regression baseline performs by DARPA, DOE, NIH, ONR and NSF. worse than all deep learning architectures.

2329 References Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and classification. EACL. Cordelia Schmid. 2016. Label-embedding for image classification. IEEE transactions on pattern analy- Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- sis and machine intelligence. som. 2014. A convolutional neural network for modelling sentences. ACL. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence em- Yoon Kim. 2014. Convolutional neural net- beddings. ICLR. works for sentence classification. arXiv preprint arXiv:1408.5882. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural by jointly Diederik P Kingma and Jimmy Ba. 2014. Adam: A learning to align and translate. ICLR. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- Ryan Kiros, Ruslan Salakhutdinov, and Richard S guage model. Journal of research. Zemel. 2014. Unifying visual-semantic embeddings NIPS Ronan Collobert and Jason Weston. 2008. A unified with multimodal neural language models. 2014 deep learning workshop architecture for natural language processing: Deep . neural networks with multitask learning. In Pro- ceedings of the 25th international conference on Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, Machine learning. and Sanja Fidler. 2015. Skip-thought vectors. In Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Advances in neural information processing systems. Yann Lecun. 2017. Very deep convolutional net- works for text classification. In Proceedings of the Quoc Le and Tomas Mikolov. 2014. Distributed rep- 15th Conference of the European Chapter of the As- resentations of sentences and documents. In Inter- sociation for Computational Linguistics: Volume 1, national Conference on Machine Learning, pages Long Papers, volume 1, pages 1107–1116. 1188–1196.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Ja- Ting Liu, and Guoping Hu. 2016. Attention-over- son Yosinski. 2018. Measuring the intrinsic dimen- attention neural networks for reading comprehen- sion of objective landscapes. In International Con- sion. arXiv preprint arXiv:1607.04423. ference on Learning Representations.

Andrew M Dai and Quoc V Le. 2015. Semi-supervised Yukun Ma, Erik Cambria, and Sa Gao. 2016. Label sequence learning. In Advances in Neural Informa- embedding for zero-shot fine-grained named entity tion Processing Systems, pages 3079–3087. typing. In COLING.

Gianna M Del Corso, Antonio Gulli, and Francesco Laurens van der Maaten and Geoffrey Hinton. 2008. Romani. 2005. Ranking a stream of news. In Visualizing data using t-sne. Journal of machine Proceedings of the 14th international conference on learning research. World Wide Web. ACM. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Andrea Frome, Greg S Corrado, Jon Shlens, Samy rado, and Jeff Dean. 2013. Distributed representa- Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. De- tions of words and phrases and their compositional- vise: A deep visual-semantic embedding model. In ity. In Advances in neural information processing NIPS. systems, pages 3111–3119.

Jonas Gehring, Michael Auli, David Grangier, De- James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng nis Yarats, and Yann N Dauphin. 2017. Convolu- Sun, and Jacob Eisenstein. 2018. Explainable pre- tional sequence to sequence learning. arXiv preprint diction of medical codes from clinical text. arXiv arXiv:1705.03122. preprint arXiv:1802.05695.

Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, short-term memory. Neural computation. and Tom M Mitchell. 2009. Zero-shot learning with semantic output codes. In NIPS. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Moham- Jeffrey Pennington, Richard Socher, and Christopher mad Ghassemi, Benjamin Moody, Peter Szolovits, Manning. 2014. Glove: Global vectors for word Leo Anthony Celi, and Roger G Mark. 2016. representation. In Proceedings of the 2014 confer- Mimic-iii, a freely accessible critical care database. ence on empirical methods in natural language pro- Scientific data. cessing (EMNLP), pages 1532–1543.

2330 Aaditya Prakash, Siyuan Zhao, Sadid A Hasan, Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen, Vivek V Datla, Kathy Lee, Ashequl Qadir, Joey Liu, Jiaji Huang, Wei Ping, Sanjeev Satheesh, and and Oladimeji Farri. 2017. Condensed memory net- Lawrence Carin. 2018. Topic compositional neural works for clinical diagnostic inferencing. In AAAI. language model. AISTATS.

Jose A Rodriguez-Serrano, Florent Perronnin, and John Wieting, Mohit Bansal, Kevin Gimpel, and Karen France Meylan. 2013. Label embedding for text Livescu. 2016. Towards universal paraphrastic sen- recognition. In Proceedings of the British Machine tence embeddings. ICLR. Vision Conference. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander M Rush, Sumit Chopra, and Jason We- Alex Smola, and Eduard Hovy. 2016. Hierarchi- ston. 2015. A neural attention model for ab- cal attention networks for document classification. stractive sentence summarization. arXiv preprint In North American Chapter of the Association for arXiv:1509.00685. Computational Linguistics: Human Language Tech- nologies. Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Wenpeng Yin, Hinrich Schutze,¨ Bing Xiang, and Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Bowen Zhou. 2016. Abcnn: Attention-based convo- yuan Li, Ricardo Henao, and Lawrence Carin. lutional neural network for modeling sentence pairs. 2018a. Baseline needs more love: On simple TACL. word-embedding-based models and associated pool- ing mechanisms. In ACL. Dani Yogatama, Daniel Gillick, and Nevena Lazic. 2015. Embedding methods for fine grained entity Dinghan Shen, Yizhe Zhang, Ricardo Henao, Qinliang type classification. In ACL. Su, and Lawrence Carin. 2017. Deconvolutional latent-variable model for text sequence matching. Honglun Zhang, Liqiang Xiao, Wenqing Chen, AAAI. Yongkun Wang, and Yaohui Jin. 2017a. Multi- task label embedding for text classification. arXiv Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, preprint arXiv:1710.07210. Shirui Pan, and Chengqi Zhang. 2018b. Disan: Di- rectional self-attention network for rnn/cnn-free lan- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. guage understanding. AAAI. Character-level convolutional networks for text clas- sification. In NIPS. Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018c. Bi-directional block self- Yizhe Zhang, Dinghan Shen, Guoyin Wang, Zhe Gan, attention for fast and memory-efficient sequence Ricardo Henao, and Lawrence Carin. 2017b. De- modeling. ICLR. convolutional paragraph representation learning. In NIPS. Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, and Eric P Xing. 2017. Towards automated Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. icd coding using deep learning. arXiv preprint Attention-based lstm network for cross-lingual sen- arXiv:1711.04075. timent classification. In EMNLP.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research.

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Pre- dictive text embedding through large-scale hetero- geneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 1165–1174. ACM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.

Sida Wang and Christopher D Manning. 2012. Base- lines and : Simple, good sentiment and topic classification. In ACL.

2331