Contextualized Weak Supervision for Text Classification

Contextualized Weak Supervision for Text Classification Dheeraj Mekala1 Jingbo Shang1;2 1 Department of Computer Science and Engineering, University of California San Diego, CA, USA 2 Halıcıoglu˘ Data Science Institute, University of California San Diego, CA, USA fdmekala, [email protected] Abstract Most of, if not all, existing methods generate pseudo-labels in a context-free manner, therefore, Weakly supervised text classification based on the ambiguous, context-dependent nature of human a few user-provided seed words has recently at- languages has been long overlooked. Suppose the tracted much attention from researchers. Exist- ing methods mainly generate pseudo-labels in user gives “penalty” as a seed word for the sports a context-free manner (e.g., string matching), class, as shown in Figure1. The word “penalty” therefore, the ambiguous, context-dependent has at least two different meanings: the penalty nature of human language has been long in sports-related documents and the fine or death overlooked. In this paper, we propose a penalty in law-related documents. If the pseudo- novel framework ConWea, providing contextu- label of a document is decided based only on the alized weak supervision for text classification. frequency of seed words, some documents about Specifically, we leverage contextualized representations of word occurrences and seed word law may be mislabelled as sports. More impor- information to automatically differentiate mul- tantly, such errors will further introduce wrong tiple interpretations of the same word, and thus seed words, thus being propagated and amplified create a contextualized corpus. This contex- over the iterations. tualized corpus is further utilized to train the In this paper, we introduce contextualized weak classifier and expand seed words in an iterative supervision to train a text classifier based on user- manner. This process not only adds new con- provided seed words. The “contextualized” here is textualized, highly label-indicative keywords but also disambiguates initial seed words, mak- reflected in two places: the corpus and seed words. ing our weak supervision fully contextualized. Every word occurrence in the corpus may be inter- Extensive experiments and case studies on preted differently according to its context; Every real-world datasets demonstrate the necessity seed word, if ambiguous, must be resolved accord- and significant advantages of using contextu- ing to its user-specified class. In this way, we aim alized weak supervision, especially when the to improve the accuracy of the final text classifier. class labels are fine-grained. We propose a novel framework ConWea, as illus- 1 Introduction trated in Figure1. It leverages contextualized representation learning techniques, such as ELMo (Pe- Weak supervision in text classification has recently ters et al., 2018) and BERT (Devlin et al., 2019), attracted much attention from researchers, because together with user-provided seed information to it alleviates the burden of human experts on anno- first create a contextualized corpus. This contextu- tating massive documents, especially in specific alized corpus is further utilized to train the classi- domains. One of the popular forms of weak super- fier and expand seed words in an iterative manner. vision is a small set of user-provided seed words During this process, contextualized seed words are for each class. Typical seed-driven methods fol- introduced by expanding and disambiguating the low an iterative framework — generate pseudo- initial seed words. Specifically, for each word, we labels using some heuristics, learn the mapping develop an unsupervised method to adaptively de- between documents and classes, and expand the cide its number of interpretations, and accordingly, seed set (Agichtein and Gravano, 2000; Riloff et al., group all its occurrences based on their contex- 2003; Kuipers et al., 2006; Tao et al., 2015; Meng tualized representations. We design a principled et al., 2018). comparative ranking method to select highly label- 323 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 323–333 July 5 - 10, 2020. c 2020 Association for Computational Linguistics User-Provided Seed Words Extended Seed Words Contextualized & Expanded Seed Words Comparative RankinG Class Seed Words Class Seed Words Class Seed Words Law Soccer Soccer soccer, goal, penalty Soccer soccer, goal$0, goal$1, Soccer soccer, goal$0, penalty$1, … penalty$0, penalty$1, Law law, judge, court Law law, judge, court$1, Law law, judge, court$0, court$1 penalty$0, … … … … … … … Cosmos Politics Raw Docs Contextualized Docs Messi scored the penalty$1! … Messi scored the penalty! … Messi scored the penalty$1! … Judge passed the order of … Judge passed the order of … Judge passed the order of … The court$1 issued a penalty$0 … The court issued a penalty … The court$1 issued a penalty$0 … …… …… …… Text Classifier Contextualized Docs with Predictions Figure 1: Our proposed contextualized weakly supervised method leverages BERT to create a contextualized corpus. This contextualized corpus is further utilized to resolve interpretations of seed words, generate pseudo- labels, train a classifier and expand the seed set in an iterative fashion. indicative keywords from the contextualized cor- document classifier from these inputs, assigning pus, leading to contextualized seed words. We will class label Cj 2 C to each document Di 2 D. repeat the iterative classification and seed word Note that, all these words could be upgraded to expansion process until the convergence. phrases if phrase mining techniques (Liu et al., To the best of our knowledge, this is the first 2015; Shang et al., 2018) were applied as pre- work on contextualized weak supervision for text processing. In this paper, we stick to the words. classification. It is also worth mentioning that our Framework Overview. We propose a framework, proposed framework is compatible with almost any ConWea, enabling contextualized weak supervi- contextualized representation learning models and sion. Here, “contextualized” is reflected in two text classification models. Our contributions are places: the corpus and seed words. Therefore, we summarized as follows: have developed two novel techniques accordingly • We propose a novel framework enabling contex- to make both contextualizations happen. tualized weak supervision for text classification. First, we leverage contextualized representation • We develop an unsupervised method to auto- learning techniques (Peters et al., 2018; Devlin matically group word occurrences of the same et al., 2019) to create a contextualized corpus. We word into an adaptive number of interpretations choose BERT (Devlin et al., 2019) as an example based on contextualized representations and user- in our implementation to generate a contextualized provided seed information. vector of every word occurrence. We assume the • We design a principled ranking mechanism to user-provided seed words are of reasonable quality identify words that are discriminative and highly — the majority of the seed words are not ambigu- label-indicative. ous, and the majority of the occurrences of the seed • We have performed experiments on real-world words are about the semantics of the user-specified datasets for both coarse- and fine-grained text class. Based on these two assumptions, we are able classification tasks. The results demonstrate the to develop an unsupervised method to automati- superiority of using contextualized weak supervi- cally group word occurrences of the same word sion, especially when the labels are fine-grained. into an adaptive number of interpretations, harvest- Our code is made publicly available at GitHub1. ing the contextualized corpus. Second, we design a principled comparative 2 Overview ranking method to select highly label-indicative Problem Formulation. The input of our prob- keywords from the contextualized corpus, leading lem contains (1) a collection of n text documents to contextualized seed words. Specifically, we start with all possible interpretations of seed words and D = fD1; D2;:::; Dng and (2) m target classes train a neural classifier. Based on the predictions, C = fC1; C2;:::; Cmg and their seed words S = we compare and contrast the documents belong- fS1; S2;:::; Smg. We aim to build a high-quality ing to different classes, and rank contextualized 1https://github.com/dheeraj7596/ConWea words based on how label-indicative, frequent, and 324 (a) Similarity Distribution: Windows (b) Cluster Visualisation: Windows (c) Cluster Visualisation: Penalty Figure 2: Document contextualization examples using word “windows” and “penalty”. τ is decided based on the similarity distributions of all seed word occurrences. Two clusters are discovered for both words, respectively. unusual these words are. During this process, we Choice of Clustering Methods. We model the eliminate the wrong interpretations of initial seed word occurrence disambiguation problem as a clus- words and also add more highly label-indicative tering problem. Specifically, we propose to use contextualized words. the K-Means algorithm (Jain and Dubes, 1988) to This entire process is visualized in Figure1. We cluster all contextualized representations bwi into denote the number of iterations between classifier K clusters, where K is the number of interpreta- training and seed word expansion as T , which is the tions. We prefer K-Means because (1) the cosine only hyper-parameter in our framework. We dis- similarity and Euclidean distance

Contextualized Weak Supervision for Text Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support