Text Classification Using Label Names Only
Total Page:16
File Type:pdf, Size:1020Kb
Text Classification Using Label Names Only: A Language Model Self-Training Approach Yu Meng1, Yunyi Zhang1, Jiaxin Huang1, Chenyan Xiong2, Heng Ji1, Chao Zhang3, Jiawei Han1 1University of Illinois at Urbana-Champaign, IL, USA 2Microsoft Research, WA, USA 3Georgia Institute of Technology, GA, USA 1 yumeng5, yzhan238, jiaxinh3, hengji, hanj @illinois.edu f [email protected] [email protected] Abstract 2016) have been developed and achieved great success when trained on large-scale labeled doc- Current text classification methods typically uments (usually over tens of thousands), thanks require a good number of human-labeled doc- uments as training data, which can be costly to their strong representation learning power that and difficult to obtain in real applications. Hu- effectively captures the high-order, long-range se- mans can perform classification without see- mantic dependency in text sequences for accurate ing any labeled examples but only based on classification. a small set of words describing the categories Recently, increasing attention has been paid to to be classified. In this paper, we explore semi-supervised text classification which requires a the potential of only using the label name of much smaller amount of labeled data. The success each class to train classification models on un- of semi-supervised methods stems from the usage labeled data, without using any labeled doc- uments. We use pre-trained neural language of abundant unlabeled data: Unlabeled documents models both as general linguistic knowledge provide natural regularization for constraining the sources for category understanding and as rep- model predictions to be invariant to small changes resentation learning models for document clas- in input (Chen et al., 2020; Miyato et al., 2017; sification. Our method (1) associates semanti- Xie et al., 2019), thus improving the generalization cally related words with the label names, (2) ability of the model. Despite mitigating the annota- finds category-indicative words and trains the tion burden, semi-supervised methods still require model to predict their implied categories, and (3) generalizes the model via self-training. We manual efforts from domain experts, which might show that our model achieves around 90% ac- be difficult or expensive to obtain especially when curacy on four benchmark datasets including the number of classes is large. topic and sentiment classification without us- Contrary to existing supervised and semi- ing any labeled documents but learning from supervised models which learn from labeled docu- unlabeled data supervised by at most 3 words ments, a human expert will just need to understand 1 (1 in most cases) per class as the label name . the label name (i.e., a single or a few representative 1 Introduction words) of each class to classify documents. For example, we can easily classify news articles when Text classification is a classic and fundamental task given the label names such as “sports”, “business”, in Natural Language Processing (NLP) with a wide and “politics” because we are able to understand spectrum of applications such as question answer- these topics based on prior knowledge. ing (Rajpurkar et al., 2016), spam detection (Jin- In this paper, we study the problem of weakly- dal and Liu, 2007) and sentiment analysis (Pang supervised text classification where only the label et al., 2002). Building an automatic text classifica- name of each class is provided to train a classifier tion model has been viewed as a task of training on purely unlabeled data. We propose a language machine learning models from human-labeled doc- model self-training approach wherein a pre-trained uments. Indeed, many deep learning-based clas- neural language model (LM) (Devlin et al., 2019; sifiers including CNNs (Kim, 2014; Zhang et al., Peters et al., 2018; Radford et al., 2018; Yang et al., 2015) and RNNs (Tang et al., 2015a; Yang et al., 2019) is used as both the general knowledge source 1Source code can be found at https://github.com/ for category understanding and feature represen- yumeng5/LOTClass. tation learning model for classification. The LM creates contextualized word-level category super- ELMo (Peters et al., 2018), GPT (Radford et al., vision from unlabeled data to train itself, and then 2018) and XLNet (Yang et al., 2019) and autoen- generalizes to document-level classification via a coding LMs such as BERT (Devlin et al., 2019) and self-training objective. its variants (Lan et al., 2020; Lewis et al., 2020; Liu Specifically, we propose the LOTClass model et al., 2019b), has brought astonishing performance for Label-Name-Only Text Classification built in improvement to a wide range of NLP tasks, mainly three steps: (1) We construct a category vocabu- for two reasons: (1) LMs are pre-trained on large- lary for each class that contains semantically corre- scale text corpora, which allow the models to learn lated words with the label name using a pre-trained generic linguistic features (Tenney et al., 2019) and LM. (2) The LM collects high-quality category- serve as knowledge bases (Petroni et al., 2019); indicative words in the unlabeled corpus to train and (2) LMs enjoy strong feature representation itself to capture category distinctive information learning power of capturing high-order, long-range with a contextualized word-level category predic- dependency in texts thanks to the Transformer ar- tion task. (3) We generalize the LM via document- chitecture (Vaswani et al., 2017). level self-training on abundant unlabeled data. LOTClass achieves around 90% accuracy on 2.2 Semi-Supervised and Zero-Shot Text four benchmark text classification datasets, AG Classification News, DBPedia, IMDB and Amazon corpora, with- For semi-supervised text classification, two lines out learning from any labeled data but only using of framework are developed to leverage unlabeled at most 3 words (1 word in most cases) per class data. Augmentation-based methods generate new as the label name, outperforming existing weakly- instances and regularize the model’s predictions supervised methods significantly and yielding even to be invariant to small changes in input. The comparable performance to strong semi-supervised augmented instances can be either created as real and supervised models. text sequences (Xie et al., 2019) via back transla- The contributions of this paper are as follows: tion (Sennrich et al., 2016) or in the hidden states of the model via perturbations (Miyato et al., 2017) We propose a weakly-supervised text classifi- or interpolations (Chen et al., 2020). Graph-based • cation model LOTClass based on a pre-trained 2 methods (Tang et al., 2015b; Zhang et al., 2020) neural LM without any further dependencies . build text networks with words, documents and la- LOTClass does not need any labeled documents bels and propagate labeling information along the but only the label name of each class. graph via embedding learning (Tang et al., 2015c) We propose a method for finding category- or graph neural networks (Kipf and Welling, 2017). • indicative words and a contextualized word-level Zero-shot text classification generalizes the clas- category prediction task that trains LM to pre- sifier trained on a known label set to an unknown dict the implied category of a word using its one without using any new labeled documents. contexts. The LM so trained generalizes well to Transferring knowledge from seen classes to un- document-level classification upon self-training seen ones typically relies on semantic attributes and on unlabeled corpus. descriptions of all classes (Liu et al., 2019a; Pushp and Srivastava, 2017; Xia et al., 2018), correlations On four benchmark datasets, LOTClass outper- among classes (Rios and Kavuluru, 2018; Zhang • forms significantly weakly-supervised models et al., 2019) or joint embeddings of classes and and has comparable performance to strong semi- documents (Nam et al., 2016). However, zero-shot supervised and supervised models. learning still requires labeled data for the seen label set and cannot be applied to cases where no labeled 2 Related Work documents for any class is available. 2.1 Neural Language Models 2.3 Weakly-Supervised Text Classification Pre-training deep neural models for language Weakly-supervised text classification aims to cat- modeling, including autoregressive LMs such as egorize text documents based only on word-level 2Other semi-supervised/weakly-supervised methods usu- descriptions of each category, eschewing the need ally take advantage of distant supervision like Wikipedia dump (Chang et al., 2008), or augmentation systems like of any labeled documents. Early attempts rely on trained back translation models (Xie et al., 2019). distant supervision such as Wikipedia to interpret the label name semantics and derive document- produced by the BERT encoder to the MLM head, concept relevance via explicit semantic analy- which will output a probability distribution over sis (Gabrilovich and Markovitch, 2007). Since the entire vocabulary V , indicating the likelihood the classifier is learned purely from general knowl- of each word w appearing at this position: edge without even requiring any unlabeled domain- p(w h) = Softmax (W2 σ (W1h + b)) ; (1) specific data, these methods are called dataless j h×h classification (Chang et al., 2008; Song and Roth, where σ( ) is the activation function; W1 R , h · jV |×h 2 2014; Yin et al., 2019). Later, topic models (Chen b R , and W2 R are learnable param- 2 2 et al., 2015; Li et al., 2016) are exploited for seed- eters that have been pre-trained with the MLM guided classification to learn seed word-aware top- objective of BERT. ics by biasing the Dirichlet priors and to infer pos- Table1 shows the pre-trained MLM prediction terior document-topic assignment. Recently, neu- for the top words (sorted by p(w h)) to replace j ral approaches (Mekala and Shang, 2020; Meng the original label name “sports” under two different et al., 2018, 2019) have been developed for weakly- contexts.