Fast and Efficient Text Classification with Class-Based Embeddings

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/332082660 Fast and Efficient Text Classification with Class-based Embeddings Conference Paper · March 2019 CITATIONS READS 0 139 3 authors: Jônatas Wehrmann Camila Kolling Pontifícia Universidade Católica do Rio Grande do Sul Pontifícia Universidade Católica do Rio Grande do Sul 22 PUBLICATIONS 146 CITATIONS 1 PUBLICATION 0 CITATIONS SEE PROFILE SEE PROFILE Rodrigo C. Barros Pontifícia Universidade Católica do Rio Grande do Sul 93 PUBLICATIONS 1,095 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Hierarchical Multi-Label Classification View project Development of Fully-Flexible Receptor (FFR) Models for Molecular Docking View project All content following this page was uploaded by Jônatas Wehrmann on 29 March 2019. The user has requested enhancement of the downloaded file. Fast and Efficient Text Classification with Class-based Embeddings Jonatasˆ Wehrmann, Camila Kolling, and Rodrigo C. Barros Machine Intelligence and Robotics Research Group School of Technology, Pontif´ıcia Universidade Catolica´ do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: fjonatas.wehrmann,[email protected], [email protected] Abstract—Current state-of-the-art approaches for Natural Whereas neural network models often achieve very good Language Processing tasks such as text classification are either performance on text classification, they tend to use a large based on Recurrent or Convolutional Neural Networks. Notwith- amount of memory during both training and inference, es- standing, those approaches often require a long time to train, or large amounts of memory to store the entire trained models. pecially when learning from a given corpus that contains a In this paper, we introduce a novel neural network architecture very large vocabulary. Recent work have tried to change this for ultra-fast, memory-efficient text classification. The proposed perspective, e.g., FastText [11]. Such a method represents a architecture is based on word embeddings trained directly over document by averaging word vectors from a given sentence, the class space, which allows for fast, efficient, and effective text resulting in a bag-of-words-like representation, though allow- classification. We divide the proposed architecture into four main variations that present distinct capabilities for learning temporal ing the update of word vectors through backpropagation during relations. We perform several experiments across four widely- training as opposed to the static word representation in a used datasets, in which we achieve results comparable to the standard bag-of-words model. state-of-the-art while being much faster and lighter in terms FastText provides fast learning of word representations of memory usage. We also present a thorough ablation study and sentence classification. Compared to other systems [9], to demonstrate the importance of each component within each proposed model. Finally, we show that our model predictions can [13], [14] that are based on either CNNs or RNNs, FastText be visualized and thus easily explained. shows comparable results though much smaller training times. Index Terms—Text classification, deep learning, neural net- Nevertheless, in spite of being faster to train and to test than works, natural language processing. traditional techniques based on n-grams, FastText uses a lot of memory to store and process the embeddings. This is an I. INTRODUCTION important issue for applications that need to run on systems Text classification approaches are important components with limited memory, such as smartphones. within the Natural Language Processing (NLP) research, and To address the limitation of the current neural network they have been designed for countless application domains models, we propose four different fast and memory-efficient such as document classification [1], sentiment analysis [2]–[4], approaches. The first one, CWE-BCdraws inspiration from information retrieval [5], [6], hierarchical classification [?], [7] FastText [11] and generates word-embedding vectors by di- and generation of sentence embeddings [8], just to name a few. rectly mapping the word-embedding space to the target class A central problem in text classification is feature represen- space. Our second method, CWE-SA, replaces traditional tation, which has relied for a long time on the well-known pooling functions by a self-attention module, giving different bag-of-words (or bag of n-grams) approach. Such a strategy weights for each word and thereby focusing on the most describes the occurrence of words or characters within a important words of the sentences. CWE-C is the third ap- document, and basically requires the usage of a vocabulary proach and employs convolutional layers for processing the of known words and the measurement of the occurrence of temporal dimension. Our final approach, CWE-R, is based on those known words. Since we need to store the vocabulary, recurrent operators: it is designed to learn temporal relations memory requirements are often a practical concern. and, unlike traditional RNNs, it does not require additional Recently, the NLP community has turned to methods trainable weight matrices. that are capable of automatically learning features from raw We compare our models with previous state-of-the-art ap- text, such as Convolutional or Recurrent Neural Networks proaches. They perform on a par with recently-proposed deep (CNNs/RNNs). CNNs were originally designed with computer learning methods while being faster and lighter in terms of vision applications in mind, but they have shown to be memory consumption. Our models make use of ≈ 100× less quite effective for a plethora of NLP applications [9], [10]. memory while running up to 4× faster when compared to Indeed, models based on neural networks have outperformed FastText [11]. In addition, they are much easier to visualize traditional hand-crafted approaches achieving state-of-the-art and understand since they learn word embeddings that are performance in several NLP tasks [1], [11], [12]. trained directly on the class space. The rest of this paper is organized as follows. In Section II C we describe the proposed approach and its variations, while I in Section III we detail the setup used for training and C evaluating each of the models and the respective baselines. In love t Section IV we describe the experiments that are performed this to quantitatively evaluate our approaches on several text- classification datasets. In Section V we qualitatively assess place the proposed methods. Section VI summarizes previous related AVG / MAX / SELF ATTENTION work, and we finally conclude this paper and discuss future research directions in Section VII . Fig. 1. Model architecture of CWE-BC. Each word is embedded according to the number of classes C and passed to one of the pooling functions. II. CLASS-BASED WORD EMBEDDING In this paper, we introduce Class-based Word Embeddings class-based content, such as amazing and awesome, which (CWE), an approach designed to classify text in a fast, most certainly denote a positive polarity, while terrible and light, and effective fashion. Unlike traditional state-of-the-art worst represent mostly the negative polarity. text classification methods, CWE is developed to work with In addition, word embeddings trained with CWE-BC can be minimal resources in terms of processing and memory, while easily visualized when the class space is small. For instance, achieving solid results across distinct datasets. assume that a sentiment analysis model is trained to classify CWE works by learning a text classification function text as either positive or negative. In that case, one could (T ) = y, where T ⊃ f! gt is a given text (instance) j j=1 naturally visualize those embeddings in a bidimensional Eu- within the text-classification dataset, and it comprises t words, clidean space and fully explain the model predictions, without each encoded as a ! 2 C vector, and y is the respective j R the need of further employing algorithms to visualize high- binary-class vector for a C-class classification problem, so that C dimensional data such as t-SNE [15], which presents a high P y = 1. In the following sections we detail several flavors i i asymptotic computational complexity (quadratic in the number of function (·), which gives CWE distinct capabilities with of instances), limiting its use to roughly 10,000 instances [15]. advantages and disadvantages. In theory, high-dimensional word-embedding-based methods A. CWE-BC: Bag-of-Classes project the input data so that the feature space lies on several different, though related, low-dimensional manifolds. In CWE, CWE-BC is the first strategy for learning function (·), we directly optimize a low-dimensional manifold which is which somewhat draws inspiration from the FastText [11] equivalent to an Euclidean space, making it much easier to model. In the latter, word embeddings ! 2 d are averaged in R visualize. order to build a d-dimensional sentence feature representation, which is then linearly mapped through a trainable weight- matrix W 2 Rd×h to a hidden feature space with h di- B. CWE-SA: Self-Attentive Pooling mensions, and finally projected onto the C-dimensional class Our second approach replaces traditional global pooling space. Our approach aims to generate word-embedding vectors strategies by a learned self-attention module (selfatt), re- by exploring direct relations across the word-embedding space sponsible to assign distinct weights for each word so the and the target class space. In CWE-BC, the direct word-class final sentence representation is given by a weighted mean mapping is achieved by training ! 2 C word embeddings R pooling. We refer to this method hereby as CWE-SA, and instead of ! 2 d, discarding the need for additional transition R it is depicted in Figure 2. The self-attention mechanism was weight-matrices. The final classification scores are given by originally introduced in [16], being applied within RNNs.

Fast and Efficient Text Classification with Class-Based Embeddings

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

ACL 2019 Social Media Mining for Health Applications (#SMM4H)

Mapping the Hyponymy Relation of Wordnet Onto

Pairwise Fasttext Classifier for Entity Disambiguation

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Fasttext.Zip: Compressing Text Classification Models

Question Embeddings for Semantic Answer Type Prediction

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Investigating Classification for Natural Language Processing Tasks

Linked Data Triples Enhance Document Relevance Classification

Data Driven Value Creation

Two Models for Named Entity Recognition in Spanish