Joint Embedding of Words and Labels for Text Classification
Total Page:16
File Type:pdf, Size:1020Kb
Joint Embedding of Words and Labels for Text Classification Guoyin Wang, Chunyuan Li,∗ Wenlin Wang, Yizhe Zhang Dinghan Shen, Xinyuan Zhang, Ricardo Henao, Lawrence Carin Duke University gw60,cl319,ww107,yz196,ds337,xz139,r.henao,lcarin @duke.edu f g Abstract et al., 2014; Zhang et al., 2017b; Shen et al., 2017) and recurrent neural networks (RNNs) based on Word embeddings are effective intermedi- long short-term memory (LSTM) (Hochreiter and ate representations for capturing semantic Schmidhuber, 1997; Wang et al., 2018). regularities between words, when learn- To further increase the representation flexibil- ing the representations of text sequences. ity of such models, attention mechanisms (Bah- We propose to view text classification as danau et al., 2015) have been introduced as an in- a label-word joint embedding problem: tegral part of models employed for text classifi- each label is embedded in the same space cation (Yang et al., 2016). The attention module with the word vectors. We introduce is trained to capture the dependencies that make an attention framework that measures the significant contributions to the task, regardless of compatibility of embeddings between text the distance between the elements in the sequence. sequences and labels. The attention is It can thus provide complementary information learned on a training set of labeled samples to the distance-aware dependencies modeled by to ensure that, given a text sequence, the RNN/CNN. The increasing representation power relevant words are weighted higher than of the attention mechanism comes with increased the irrelevant ones. Our method maintains model complexity. the interpretability of word embeddings, Alternatively, several recent studies show that and enjoys a built-in ability to leverage the success of deep learning on text classification alternative sources of information, in ad- largely depends on the effectiveness of the word dition to input text sequences. Extensive embeddings (Joulin et al., 2016; Wieting et al., results on the several large text datasets 2016; Arora et al., 2017; Shen et al., 2018a). Par- show that the proposed framework out- ticularly, Shen et al.(2018a) quantitatively show performs the state-of-the-art methods by that the word-embeddings-based text classifica- a large margin, in terms of both accuracy tion tasks can have the similar level of difficulty and speed. regardless of the employed models, using the con- 1 Introduction cept of intrinsic dimension (Li et al., 2018). Thus, simple models are preferred. As the basic build- Text classification is a fundamental problem in ing blocks in neural-based NLP, word embed- natural language processing (NLP). The task is dings capture the similarities/regularities between to annotate a given text sequence with one (or words (Mikolov et al., 2013; Pennington et al., multiple) class label(s) describing its textual con- 2014). This idea has been extended to compute tent. A key intermediate step is the text rep- embeddings that capture the semantics of word se- resentation. Traditional methods represent text quences (e.g., phrases, sentences, paragraphs and with hand-crafted features, such as sparse lexi- documents) (Le and Mikolov, 2014; Kiros et al., cal features (e.g., n-grams) (Wang and Manning, 2015). These representations are built upon vari- 2012). Recently, neural models have been em- ous types of compositions of word vectors, rang- ployed to learn text representations, including con- ing from simple averaging to sophisticated archi- volutional neural networks (CNNs) (Kalchbrenner tectures. Further, they suggest that simple models ∗Corresponding author are efficient and interpretable, and have the poten- 2321 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 2321–2331 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics tial to outperform sophisticated deep neural mod- 2 Related Work els. It is therefore desirable to leverage the best of Label embedding has been shown to be effective both lines of works: learning text representations in various domains and tasks. In computer vi- to capture the dependencies that make significant sion, there has been a vast amount of research contributions to the task, while maintaining low on leveraging label embeddings for image clas- computational cost. For the task of text classifica- sification (Akata et al., 2016), multimodal learn- tion, labels play a central role of the final perfor- ing between images and text (Frome et al., 2013; mance. A natural question to ask is how we can Kiros et al., 2014), and text recognition in im- directly use label information in constructing the ages (Rodriguez-Serrano et al., 2013). It is par- text-sequence representations. ticularly successful on the task of zero-shot learn- ing (Palatucci et al., 2009; Yogatama et al., 2015; Ma et al., 2016), where the label correlation cap- 1.1 Our Contribution tured in the embedding space can improve the Our primary contribution is therefore to pro- prediction when some classes are unseen. In pose such a solution by making use of the la- NLP, labels embedding for text classification has bel embedding framework, and propose the Label- been studied in the context of heterogeneous net- Embedding Attentive Model (LEAM) to improve works in (Tang et al., 2015) and multitask learning text classification. While there is an abundant lit- in (Zhang et al., 2017a), respectively. To the au- erature in the NLP community on word embed- thors’ knowledge, there is little research on inves- dings (how to describe a word) for text representa- tigating the effectiveness of label embeddings to tions, much less work has been devoted in compar- design efficient attention models, and how to joint ison to label embeddings (how to describe a class). embedding of words and labels to make full use The proposed LEAM is implemented by jointly of label information for text classification has not embedding the word and label in the same latent been studied previously, representing a contribu- space, and the text representations are constructed tion of this paper. directly using the text-label compatibility. For text representation, the currently best- Our label embedding framework has the fol- performing models usually consist of an encoder lowing salutary properties: (i) Label-attentive text and a decoder connected through an attention representation is informative for the downstream mechanism (Vaswani et al., 2017; Bahdanau et al., classification task, as it directly learns from a 2015), with successful applications to sentiment shared joint space, whereas traditional methods classification (Zhou et al., 2016), sentence pair proceed in multiple steps by solving intermediate modeling (Yin et al., 2016) and sentence sum- problems. (ii) The LEAM learning procedure only marization (Rush et al., 2015). Based on this involves a series of basic algebraic operations, and success, more advanced attention models have hence it retains the interpretability of simple mod- been developed, including hierarchical attention els, especially when the label description is avail- networks (Yang et al., 2016), attention over at- able. (iii) Our attention mechanism (derived from tention (Cui et al., 2016), and multi-step atten- the text-label compatibility) has fewer parameters tion (Gehring et al., 2017). The idea of attention is and less computation than related methods, and motivated by the observation that different words thus is much cheaper in both training and test- in the same context are differentially informative, ing, compared with sophisticated deep attention and the same word may be differentially important models. (iv) We perform extensive experiments in a different context. The realization of “context” on several text-classification tasks, demonstrating varies in different applications and model architec- the effectiveness of our label-embedding attentive tures. Typically, the context is chosen as the target model, providing state-of-the-art results on bench- task, and the attention is computed over the hidden mark datasets. (v) We further apply LEAM to layers of a CNN/RNN. Our attention model is di- predict the medical codes from clinical text. As rectly built in the joint embedding space of words an interesting by-product, our attentive model can and labels, and the context is specified by the label highlight the informative key words for prediction, embedding. which in practice can reduce a doctor’s burden on Several recent works (Vaswani et al., 2017; reading clinical notes. Shen et al., 2018b,c) have demonstrated that sim- 2322 D P ple attention architectures can alone achieve state- word embedding: ∆ R , where P is the of-the-art performance with less computational dimensionality of the embedding7! space. There- time, dispensing with recurrence and convolutions fore, the text sequence X is represented via the entirely. Our work is in the same direction, shar- respective word embedding for each token: V = P ing the similar spirit of retaining model simplicity v1; ; vL , where vl R . A typical text and interpretability. The major difference is that classificationf ··· g method proceeds2 in three steps, end- the aforementioned work focused on self attention, to-end, by considering a function decomposition which applies attention to each pair of word tokens f = f0 f1 f2 as shown in Figure1(a): from the text sequences. In this paper, we investi- ◦ ◦ f0 : X V, the text sequence is represented gate the attention between words and labels, which • 7! is more directly related to the target task. Further- as its word-embedding form V, which is a matrix of P L. more, the proposed LEAM has much less model × parameters. f1 : V z, a compositional function f1 ag- • gregates7! word embeddings into a fixed-length 3 Preliminaries vector representation z. Throughout this paper, we denote vectors as bold, f2 : z y, a classifier f2 annotates the text • 7! lower-case letters, and matrices as bold, upper- representation z with a label. case letters. We use for element-wise division when applied to vectors or matrices.