Contextualized Weak Supervision for Text Classification

Dheeraj Mekala1 Jingbo Shang1,2 1 Department of Computer Science and Engineering, University of California San Diego, CA, USA 2 Halıcıoglu˘ Data Science Institute, University of California San Diego, CA, USA {dmekala, jshang}@ucsd.edu

Abstract Most of, if not all, existing methods generate pseudo-labels in a context-free manner, therefore, Weakly supervised text classification based on the ambiguous, context-dependent nature of human a few user-provided seed words has recently at- languages has been long overlooked. Suppose the tracted much attention from researchers. Exist- ing methods mainly generate pseudo-labels in user gives “penalty” as a seed word for the sports a context-free manner (e.g., string matching), class, as shown in Figure1. The word “penalty” therefore, the ambiguous, context-dependent has at least two different meanings: the penalty nature of human language has been long in sports-related documents and the fine or death overlooked. In this paper, we propose a penalty in law-related documents. If the pseudo- novel framework ConWea, providing contextu- label of a document is decided based only on the alized weak supervision for text classification. frequency of seed words, some documents about Specifically, we leverage contextualized repre- sentations of word occurrences and seed word law may be mislabelled as sports. More impor- information to automatically differentiate mul- tantly, such errors will further introduce wrong tiple interpretations of the same word, and thus seed words, thus being propagated and amplified create a contextualized corpus. This contex- over the iterations. tualized corpus is further utilized to train the In this paper, we introduce contextualized weak classifier and expand seed words in an iterative supervision to train a text classifier based on user- manner. This process not only adds new con- provided seed words. The “contextualized” here is textualized, highly label-indicative keywords but also disambiguates initial seed words, mak- reflected in two places: the corpus and seed words. ing our weak supervision fully contextualized. Every word occurrence in the corpus may be inter- Extensive experiments and case studies on preted differently according to its context; Every real-world datasets demonstrate the necessity seed word, if ambiguous, must be resolved accord- and significant advantages of using contextu- ing to its user-specified class. In this way, we aim alized weak supervision, especially when the to improve the accuracy of the final text classifier. class labels are fine-grained. We propose a novel framework ConWea, as illus- 1 Introduction trated in Figure1. It leverages contextualized rep- resentation learning techniques, such as ELMo (Pe- Weak supervision in text classification has recently ters et al., 2018) and BERT (Devlin et al., 2019), attracted much attention from researchers, because together with user-provided seed information to it alleviates the burden of human experts on anno- first create a contextualized corpus. This contextu- tating massive documents, especially in specific alized corpus is further utilized to train the classi- domains. One of the popular forms of weak super- fier and expand seed words in an iterative manner. vision is a small set of user-provided seed words During this process, contextualized seed words are for each class. Typical seed-driven methods fol- introduced by expanding and disambiguating the low an iterative framework — generate pseudo- initial seed words. Specifically, for each word, we labels using some heuristics, learn the mapping develop an unsupervised method to adaptively de- between documents and classes, and expand the cide its number of interpretations, and accordingly, seed set (Agichtein and Gravano, 2000; Riloff et al., group all its occurrences based on their contex- 2003; Kuipers et al., 2006; Tao et al., 2015; Meng tualized representations. We design a principled et al., 2018). comparative ranking method to select highly label-

323 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 323–333 July 5 - 10, 2020. c 2020 Association for Computational Linguistics User-Provided Seed Words Extended Seed Words Contextualized & Expanded Seed Words Comparative Ranking

Class Seed Words Class Seed Words Class Seed Words Law Soccer Soccer soccer, goal, penalty Soccer soccer, goal$0, goal$1, Soccer soccer, goal$0, penalty$1, … penalty$0, penalty$1, Law law, judge, court Law law, judge, court$1, Law law, judge, court$0, court$1 penalty$0, … … … … … … … Cosmos Politics

Raw Docs Contextualized Docs Messi scored the penalty$1! … Messi scored the penalty! … Messi scored the penalty$1! … Judge passed the order of … Judge passed the order of … Judge passed the order of … The court$1 issued a penalty$0 … The court issued a penalty … The court$1 issued a penalty$0 … …… …… …… Text Classifier Contextualized Docs with Predictions

Figure 1: Our proposed contextualized weakly supervised method leverages BERT to create a contextualized corpus. This contextualized corpus is further utilized to resolve interpretations of seed words, generate pseudo- labels, train a classifier and expand the seed set in an iterative fashion. indicative keywords from the contextualized cor- document classifier from these inputs, assigning pus, leading to contextualized seed words. We will class label Cj ∈ C to each document Di ∈ D. repeat the iterative classification and seed word Note that, all these words could be upgraded to expansion process until the convergence. phrases if phrase mining techniques (Liu et al., To the best of our knowledge, this is the first 2015; Shang et al., 2018) were applied as pre- work on contextualized weak supervision for text processing. In this paper, we stick to the words. classification. It is also worth mentioning that our Framework Overview. We propose a framework, proposed framework is compatible with almost any ConWea, enabling contextualized weak supervi- contextualized representation learning models and sion. Here, “contextualized” is reflected in two text classification models. Our contributions are places: the corpus and seed words. Therefore, we summarized as follows: have developed two novel techniques accordingly • We propose a novel framework enabling contex- to make both contextualizations happen. tualized weak supervision for text classification. First, we leverage contextualized representation • We develop an unsupervised method to auto- learning techniques (Peters et al., 2018; Devlin matically group word occurrences of the same et al., 2019) to create a contextualized corpus. We word into an adaptive number of interpretations choose BERT (Devlin et al., 2019) as an example based on contextualized representations and user- in our implementation to generate a contextualized provided seed information. vector of every word occurrence. We assume the • We design a principled ranking mechanism to user-provided seed words are of reasonable quality identify words that are discriminative and highly — the majority of the seed words are not ambigu- label-indicative. ous, and the majority of the occurrences of the seed • We have performed experiments on real-world words are about the semantics of the user-specified datasets for both coarse- and fine-grained text class. Based on these two assumptions, we are able classification tasks. The results demonstrate the to develop an unsupervised method to automati- superiority of using contextualized weak supervi- cally group word occurrences of the same word sion, especially when the labels are fine-grained. into an adaptive number of interpretations, harvest- Our code is made publicly available at GitHub1. ing the contextualized corpus. Second, we design a principled comparative 2 Overview ranking method to select highly label-indicative Problem Formulation. The input of our prob- keywords from the contextualized corpus, leading lem contains (1) a collection of n text documents to contextualized seed words. Specifically, we start with all possible interpretations of seed words and D = {D1, D2,..., Dn} and (2) m target classes train a neural classifier. Based on the predictions, C = {C1, C2,..., Cm} and their seed words S = we compare and contrast the documents belong- {S1, S2,..., Sm}. We aim to build a high-quality ing to different classes, and rank contextualized 1https://github.com/dheeraj7596/ConWea words based on how label-indicative, frequent, and

324 (a) Similarity Distribution: Windows (b) Cluster Visualisation: Windows (c) Cluster Visualisation: Penalty Figure 2: Document contextualization examples using word “windows” and “penalty”. τ is decided based on the similarity distributions of all seed word occurrences. Two clusters are discovered for both words, respectively. unusual these words are. During this process, we Choice of Clustering Methods. We model the eliminate the wrong interpretations of initial seed word occurrence disambiguation problem as a clus- words and also add more highly label-indicative tering problem. Specifically, we propose to use contextualized words. the K-Means algorithm (Jain and Dubes, 1988) to

This entire process is visualized in Figure1. We cluster all contextualized representations bwi into denote the number of iterations between classifier K clusters, where K is the number of interpreta- training and seed word expansion as T , which is the tions. We prefer K-Means because (1) the cosine only hyper-parameter in our framework. We dis- similarity and Euclidean distance are equivalent for cuss these two novel techniques in detail in the fol- unit vectors and (2) it is fast and we are clustering lowing sections. To make our paper self-contained, a significant number of times. we will also brief the pseudo-label generation and Automated Parameter Setting. We choose the document classifiers. value of K purely based on a similarity threshold τ. τ is introduced to decide whether two clusters 3 Document Contextualization belong to the same interpretation by checking if the cosine similarity between two cluster center We leverage contextualized representation tech- vectors is greater than τ. Intuitively, we should niques to create a contextualized corpus. The keep increasing K until there exist no two clusters key objective of this contextualization is to dis- with the same interpretation. Therefore, we choose ambiguate different occurrences of the same word K to be the largest number such that the similarity into several interpretations. We treat every word between any two cluster centers is no more than τ. separately, so in the rest of this section, we focus K = arg max{cos(ci, cj) < τ∀i, j} (1) on a given word w. Specifically, given a word w, K we denote all its occurrences as w , . . . , w , where 1 n where c refers to the i-th cluster center vector after n is its total number of occurrences in the corpus. i clustering all contextualized representations into Contextualized Representation. First, we obtain K clusters. In practice, K is usually no more than a contextualized vector representation b for each wi 10. So we increase K gradually until the constraint w . Our proposed method is compatible with al- i is violated. most any contextualized representation learning We pick τ based on user-provided seed infor- model. We choose BERT (Devlin et al., 2019) as mation instead of hand-tuning, As mentioned, we an example in our implementation to generate a make two “majority” assumptions: (1) For any contextualized vector for each word occurrence. In seed word, the majority of its occurrences follow this contextualized vector space, we use the cosine the intended interpretation by the user; and (2) The similarity to measure the similarity between two majority of the seed words are not ambiguous — vectors. Two word occurrences w and w of the i j they only have one interpretation. Therefore, for same interpretation are expected to have a high co- each seed word s, we take the median of pairwise sine similarity between their vectors b and b . wi wj cosine similarities between its occurrences. For the ease of computation, we normalize all con- textualized representations into unit vectors. τ(s) = median({sim(bsi , bsj )|∀i, j}) (2) 325 Algorithm 1: Corpus Contextualization Input: Word occurrences w1, w2, . . . , wn of the word w, Seed words s1, s2, . . . , sm and their occurrences si,j. Output: Contextualized word occurrences wˆ1, wˆ2,..., wˆn

Obtain bwi and bsi,j using BERT. Compute τ follow Equation3. K ← 1 while True do

Run K-Means on {bwi } for (K+1) clusters. Obtain cluster centers c1, c2,..., cK+1. if maxi,j cos(ci, cj) > τ then Break Figure 3: The HAN classifier used in our ConWea K ← K + 1 framework. It is trained on our contextualized corpus Run K-Means on {bwi } for K clusters. with the generated pseudo-labels. Obtain cluster centers c1, c2,..., cK . for each occurrence wi do is replaced by wˆi in the corpus as follows: Compute wˆi following Equation4. ( Return wˆi. w if K = 1 wˆi = (4) w$j∗ otherwise Then, we take the median of these medians over all seed words as τ. Mathematically, where ∗ K j = arg max cos(bwi , cj) τ = median({τ(s)|∀s}) (3) j=1 By applying this to all words and their occurrences, The nested median solution makes the choice of the corpus is contextualized. The pseudo-code for τ safe and robust to outliers. For example, consider corpus contextualization is shown in Algorithm1. the word “windows” in the 20Newsgroup corpus. windows In fact, the word has two interpretations 4 Pseudo-Label and Text Classifier in the 20Newsgroup corpus — one represents an opening in the wall and the other is an operating We generate pseudo-labels for unlabeled contex- system. We first compute the pairwise similarities tualized documents and train a classifier based on between all its occurrences and plot the histogram these pseudo-labels, similar to many other weakly as shown in Figure 2(a). From this plot, we can supervised methods (Agichtein and Gravano, 2000; see that its median value is about 0.7. We apply Riloff et al., 2003; Kuipers et al., 2006; Tao et al., the same for all seed words and obtain τ following 2015; Meng et al., 2018). These two parts are not Equation3. τ is calculated to be 0.82. Based on the focus of this paper. We briefly introduce them this value, we gradually increase K for “windows” to make the paper self-contained. and it ends up with K = 2. We visualize its K- Pseudo-Label Generation. There are several Means clustering results using t-SNE (Maaten and ways to generate pseudo-labels from seed words. Hinton, 2008) in Figure 2(b). Similar results can As proof-of-concept, we employ a simple but effec- be observed for the word penalty, as shown in Fig- tive method based on counting. Each document is ure 2(c). These examples demonstrate how our assigned a label whose aggregated term frequency document contextualization works for each word. of seed words is maximum. Let tf(w, ˆ d) denote In practice, to make it more efficient, one can term-frequency of a contextualized word w in the subsample the occurrences instead of enumerating contextualized document d and Sc represents set of all pairs in a brute-force manner. seed words of class c, the document d is assigned a Contextualized Corpus. The interpretation of label l(d) as follows: each occurrence of w is decided by the cluster-ID to which its contextualized representation belongs. X l(d) = arg max{ tf(si, d)|∀si ∈ Sl} (5) l Specifically, given each occurrence wi, the word w i 326 Document Classifier. Our framework is compati- the frequency score, ble with any text classification model. We use Hi- erarchical Attention Networks (HAN) (Yang et al., fCj (w) F(Cj, w) = tanh 2016) as an example in our implementation. HAN fCj considers the hierarchical structure of documents

(document – sentences – words) and includes an Here, different from fCj ,w defined earlier, fCj (w) attention mechanism that finds the most important is the frequency of word w in documents that are words and sentences in a document while taking predicted as class Cj. the context into consideration. There are two lev- • Unusual: We want our highly label-indicative els of attention: word-level attention identifies the and frequent words to be unusual. To incorporate important words in a sentence and sentence level this, we consider inverse document frequency attention identifies the important sentences in a doc- (IDF). Let n be the number of documents in ument. The overall architecture of HAN is shown the corpus D and fD,w represents the document in Figure3. We train a HAN model on contextual- frequency of word w, the IDF of a word w is ized corpus with the generated pseudo-labels. The computed as follows: predicted labels are used in seed expansion and n disambiguation. IDF(w) = log  fD,w 5 Seed Expansion and Disambiguation Similar to previous work (Tao et al., 2015), we Seed Expansion. Given contextualized documents combine these three measures using the geometric and their predicted class labels, we propose to rank mean, resulting in the ranking score R(Cj, w) of a contextualized words and add the top few words word w for a class Cj. into the seed word sets. The core element of this process is the ranking function. An ideal seed word 1/3 s of label l, is an unusual word that appears only R(Cj, w) = LI(Cj, w) × F(Cj, w) × IDF(w) in the documents belonging to label l with signifi- Based on this aggregated score, we add top words cant frequency. Hence, for a given class Cj and a word w, we measure its ranking score based on the to expand the seed word set of the class Cj. following three aspects: Seed Disambiguation. While the majority of user- • Label-Indicative. Since our pseudo-label gen- provided seed words are nice and clean, some of eration follows the presence of seed words in the them may have multiple interpretations in the given document, ideally, the posterior probability of corpus. We propose to disambiguate them based on the ranking. We first consider all possible in- a document belonging to the class Cj after ob- terpretations of an initial seed word, generate the serving the presence of word w (i.e., P (Cj|w)) should be very close to 100%. Therefore, we use pseudo-labels, and train a classifier. Using the clas- sified documents and the ranking function, we rank P (Cj|w) as our label-indicative measure: all possible interpretations of the same initial seed

fCj ,w word. Because the majority occurrences of a seed LI(Cj, w) = P (Cj|w) = fCj word are assumed to belong to the user-specified class, the intended interpretation shall be ranked the

where fCj refers to the total number of docu- highest. Therefore, we retain only the top-ranked ments that are predicted as class Cj, and among interpretation of this seed word. them, fCj ,w documents contain the word w. All After this step, we have fully contextualized these counts are based on the prediction results our weak supervision, including the initial user- on the input unlabeled documents. provided seeds. • Frequent. Ideally, a seed word s of label l ap- pears in the documents belonging to label l with 6 Experiments significant frequency. To measure the frequency score, we first compute the average frequency of In this section, we evaluate our framework and seed word s in all the documents belonging to many compared methods on coarse- and fine- label l. Since average frequency is unbounded, grained text classification tasks under the weakly we apply tanh function to scale it, resulting in supervised setting.

327 Table 1: Dataset statistics. similar label based on cosine similarity. Dataset # Docs # Coarse # Fine Avg Doc Len • Doc2Cube (Tao et al., 2015) considers label NYT 13,081 5 25 778 surface names as seed set and performs multi- 20News 18,846 6 20 400 dimensional document classification by learning dimension-aware embedding. • WeSTClass (Meng et al., 2018) leverages seed 6.1 Datasets information to generate pseudo documents and Following previous work (Tao et al., 2015), (Meng refines the model through a self-training module et al., 2018), we use two news datasets in our ex- that bootstraps on real unlabeled documents. periments. The dataset statistics are provided in We denote our framework as ConWea, which in- Table1. Here are some details. cludes contextualizing corpus, disambiguating seed • The New York Times (NYT): The NYT dataset words, and iterative classification & key words ex- contains news articles written and published by pansion. Besides, we have three ablated versions. The New York Times. These articles are clas- ConWea-NoCon refers to the variant of ConWea sified into 5 wide genres (e.g., arts, sports) and trained without the contextualization of corpus. 25 fine-grained categories (e.g., dance, music, ConWea-NoSeedExp is the variant of ConWea hockey, basketball). without the seed expansion module. ConWea- • The 20 Newsgroups (20News): The 20News WSD refers to the variant of ConWea, with the con- dataset2 is a collection of newsgroup documents textualization module replaced by Lesk algorithm partitioned widely into 6 groups (e.g., recre- (Lesk, 1986), a classic Word-sense disambiguation ation, computers) and 20 fine-grained classes algorithm (WSD). (e.g., graphics, windows, baseball, hockey). We also present the results of HAN-Supervised We perform coarse- and fine-grained classifica- under the supervised setting for reference. We use tions on the NYT and 20News datasets. NYT 80-10-10 for train-validation-test splitting and re- dataset is imbalanced in both fine-grained and port the test set results for it. All weakly supervised coarse-grained classifications. 20News is nearly methods are evaluated on the entire datasets. balanced in fine-grained classification but imbal- anced in coarse-grained classification. Being aware 6.3 Experiment Settings 3 of these facts, we adopt micro- and macro-F1 scores We use pre-trained BERT-base-uncased to as evaluation metrics. obtain contextualized word representations. We follow Devlin et al.(2019) and concatenate the 6.2 Compared Methods averaged word-piece vectors of the last four layers. We compare our framework with a wide range of The seed words are obtained as follows: we methods described below: asked 5 human experts to nominate 5 seed words • IR-TF-IDF treats the seed word set for each per class, and then considered the majority words class as a query. The relevance of a document (i.e., > 3 nominations) as our final set of seed to a label is computed by aggregated TF-IDF words. For every class, we mainly use the label values of its respective seed words. The label surface name as seed words. For some multi-word with the highest relevance is assigned to each class labels (e.g., “international business”), we have document. multiple seed words, but never exceeds four per • Dataless (Chang et al., 2008) uses only la- each class. The same seed words are utilized for bel surface names as supervision and leverages all compared methods for fair comparisons. Wikipedia to derive vector representations of la- For ConWea, we set T = 10. For any method bels and documents. Each document is labeled using word embedding, we set its dimension to be based on the document-label similarity. 100. We use the public implementations of WeST- • first learns word vector representa- Class4 and Dataless5 with the hyper-parameters tions (Mikolov et al., 2013) for all terms in the mentioned in their original papers. corpus and derive label representations by aggre- 3https://github.com/google-research/ gating the vectors of its respective seed words. Finally, each document is labeled with the most 4https://github.com/yumeng5/WeSTClass 5https://cogcomp.org/page/software_ 2http://qwone.com/˜jason/20Newsgroups/ view/Descartes

328 Table 2: Evaluation Results for All Methods on Fine-Grained and Coarse-Grained Labels. Both micro-F1 and macro-F1 scores are presented. Ablation and supervised results are also included. NYT 20 Newsgroup 5-Class (Coarse) 25-Class (Fine) 6-Class (Coarse) 20-Class (Fine) Methods Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 IR-TF-IDF 0.65 0.58 0.56 0.54 0.49 0.48 0.53 0.52 Dataless 0.71 0.48 0.59 0.37 0.50 0.47 0.61 0.53 Word2Vec 0.92 0.83 0.69 0.47 0.51 0.45 0.33 0.33 Doc2Cube 0.71 0.38 0.67 0.34 0.40 0.35 0.23 0.23 WeSTClass 0.91 0.84 0.50 0.36 0.53 0.43 0.49 0.46 ConWea 0.95 0.89 0.91 0.79 0.62 0.57 0.65 0.64 ConWea-NoCon 0.91 0.83 0.89 0.74 0.53 0.50 0.58 0.57 ConWea-NoExpan 0.92 0.85 0.76 0.66 0.58 0.53 0.58 0.57 ConWea-WSD 0.83 0.78 0.72 0.64 0.52 0.46 0.49 0.47

HAN-Supervised 0.96 0.92 0.94 0.82 0.90 0.88 0.83 0.83

6.4 Performance Comparison cific corpus, additional annotated training data might be required to meet the similar perfor- We summarize the evaluation results of all methods mance as ours, which defeats the purpose of in Table2. As one can observe that our proposed a weakly supervised setting. Therefore, we be- framework achieves the best performance among lieve that our contextualization module has its all the compared weakly supervised methods. We unique advantages. Our experimental results discuss the effectiveness of ConWea as follows: further confirm the above reasoning empirically. • Our proposed framework ConWea outperforms For example, for coarse-grained labels on the all the other methods with significant margins. 20News dataset, the contextualization improves By contextualizing the corpus and resolving the the micro-F1 score from 0.53 to 0.62. interpretation of seed words, ConWea achieves • We observe that ConWea performs quite close inspiring performance, demonstrating the neces- to supervised methods, for example, on the NYT sity and the importance of using contextualized dataset. This demonstrates that ConWea is quite weak supervision. effective in closing the performance gap between • We observe that in the fine-grained classifica- the weakly supervised and supervised settings. tion, the advantages of ConWea over other meth- 6.5 Parameter Study ods are even more significant. This can be at- tributed to the contextualization of corpus and The only hyper-parameter in our algorithm is T , seed words. Once the corpus is contextualized the number of iterations of iterative expansion & properly, the subtle ambiguity between words is classification. We conduct experiments to study a drawback to other methods, whereas ConWea the effect of the number of iterations on the perfor- can distinguish them and predict them correctly. mance. The plot of performance w.r.t. the number • The comparison between ConWea and the ab- of iterations is shown in Figure4. We observe lation method ConWea-NoExpan demonstrates that the performance increases initially and gradu- the effectiveness of our Seed Expansion. For ally converges after 4 or 5 iterations. We observe example, for fine-grained labels on the 20News that after the convergence point, the expanded seed dataset, the seed expansion improves the micro- words have become almost unchanged. While there F1 score from 0.58 to 0.65. is some fluctuation, a reasonably large T , such as • The comparison between ConWea and the two T = 10, is a good choice. ablation methods ConWea-NoCon and ConWea- WSD demonstrates the effectiveness of our Con- 6.6 Number of Seed Words textualization. Our contextualization, building We vary the number of seed words per class and upon (Devlin et al., 2019), is adaptive to the in- plot the F1 score in Figure5. One can observe that put corpus, without requiring any additional hu- in general, the performance increases as the number man annotations. However, WSD methods(e.g., of seed words increase. There is a slightly different (Lesk, 1986)) are typically trained for a general pattern on the 20News dataset when the labels are domain. If one wants to apply WSD to some spe- fine-grained. We conjecture that it is caused by the

329 (a) NYT Coarse (b) NYT Fine (c) 20News Coarse (d) 20News Fine

Figure 4: Micro- and Macro-F1 scores w.r.t. the number of iterations.

(a) NYT Coarse (b) NYT Fine (c) 20News Coarse (d) 20News Fine

Figure 5: Micro- and Macro-F1 scores w.r.t. the number of seed words. subtlety of seed words in fine-grained cases – addi- 7.1 Weak Supervision for Text Classification tional seed words may bring some noise. Overall, Weak supervision has been studied for building three seed words per class are enough for reason- document classifiers in various of forms, includ- able performance. ing hundreds of labeled training documents (Tang 6.7 Case Study et al., 2015; Miyato et al., 2016; Xu et al., 2017), class/category names (Song and Roth, 2014; Tao We present a case study to showcase the power of et al., 2015; Li et al., 2018), and user-provided seed contextualized weak supervision. Specifically, we words (Meng et al., 2018; Tao et al., 2015). In this investigate the differences between the expanded paper, we focus on user-provided seed words as seed words in the plain corpus and contextualized the source of weak supervision, Along this line, corpus over iterations. Table3 shows a column-by- Doc2Cube (Tao et al., 2015) expands label key- column comparison for the class For Sale on the words from label surface names and performs multi- 20News dataset. The class For Sale refers to doc- dimensional document classification by learning uments advertising goods for sale. Starting with dimension-aware embedding; PTE (Tang et al., the same seed sets in both types of corpora, from 2015) utilizes both labeled and unlabeled docu- Table3, in the second iteration, we observe that ments to learn text embeddings specifically for a “space” becomes a part of expanded seed set in the task, which are later fed to logistic regression classi- plain corpus. Here “space” has two interpretations, fiers for classification; Meng et al.(2018) leverage one stands for the physical universe beyond the seed information to generate pseudo documents Earth and the other is for an area of land. This and introduces a self-training module that boot- error gets propagated and amplified over the iter- straps on real unlabeled data for model refining. ations, further introducing wrong seed words like This method is later extended to handle hierarchical “nasa”, “shuttle” and “moon”, related to its first in- classifications based on a pre-defined label taxon- terpretation. The seed set for contextualized corpus omy (Meng et al., 2019). However, all these weak addresses this problem and adds only the words supervisions follow a context-free manner. Here, with appropriate interpretations. Also, one can see we propose to use contextualized weak supervision. that the initial seed word “offer” has been disam- biguated as “offer$0”. 7.2 Contextualized Word Representations

7 Related Work Contextualized word representation is originated from machine translation (MT). CoVe (McCann We review the literature about (1) weak supervision et al., 2017) generates contextualized representa- for text classification methods, (2) contextualized tions for a word based on pre-trained MT models, representation learning techniques, (3) document More recently, ELMo (Peters et al., 2018) lever- classifiers, and (4) word sense disambiguation. ages neural language models to replace MT models,

330 Table 3: Case Study: Seed word expansion of the For Sale class in context-free and contextualized corpora. The For Sale class contains documents advertising goods for sale. Blue bold words are potentially wrong seeds. Seed Words for For Sale class Iter Plain Corpus Contextualized Corpus 1 sale, offer, forsale sale, offer, forsale 2 space, price, shipping, sale, offer shipping, forsale, offer$0, condition$0, sale space, price, shipping, sale, nasa, price, shipping, sale, forsale, condition$0, 3 offer, package, email offer$0, package, email space, price, moon, shipping, sale, nasa, price, shipping, sale, forsale, condition$0, 4 offer, shuttle, package, email offer$0, package, email, offers$0, obo$0 which removes the dependency on massive parallel 7.4 Document Classifier texts and takes advantages of nearly unlimited raw Document classification problem has been long corpora. Many models leveraging language mod- studied. In our implementation of the proposed eling to build sentence representations (Howard ConWea framework, we used HAN (Yang et al., and Ruder, 2018; Radford et al., 2018; Devlin 2016), which considers the hierarchical structure et al., 2019) emerge almost at the same time. Lan- of documents and includes attention mechanisms guage models have also been extended to the char- to find the most important words and sentences in a acter level (Liu et al., 2018; Akbik et al., 2018), document. CNN-based text classifiers(Kim, 2014; which can generate contextualized representations Zhang et al., 2015; Lai et al., 2015) are also popular for character spans. and can achieve inspiring performance. Our proposed framework is compatible with all Our framework is compatible with all the above the above contextualized representation techniques. text classifiers. We choose HAN just for a demon- In our implementation, we choose to use BERT stration purpose. to demonstrate the power of using contextualized supervision. 8 Conclusions and Future Work In this paper, we proposed ConWea, a novel con- 7.3 Word Sense Disambiguation textualized weakly supervised classification frame- Word Sense Disambiguation (WSD) is one of the work. Our method leverages contextualized repre- challenging problems in natural language process- sentation techniques and initial user-provided seed ing. Typical WSD models (Lesk, 1986; Zhong words to contextualize the corpus. This contextual- and Ng, 2010; Yuan et al., 2016; Raganato et al., ized corpus is further used to resolve the interpre- 2017; Le et al., 2018; Tripodi and Navigli, 2019) tation of seed words through iterative seed word are trained for a general domain. Recent works expansion and document classifier training. Exper- (Li and Jurafsky, 2015; Mekala et al., 2016; Gupta imental results demonstrate that our model outper- et al., 2019) also showed that machine-interpretable forms previous methods significantly, thereby sig- representations of words considering its senses, im- nifying the superiority of contextualized weak su- prove document classification. However, if one pervision, especially when labels are fine-grained. wants to apply WSD to some specific corpus, ad- In the future, we are interested in generalizing ditional annotated training data might be required contextualized weak supervision to hierarchical to meet the similar performance as ours, which text classification problems. Currently, we perform defeats the purpose of a weakly supervised setting. coarse- and fine-grained classifications separately. In contrast, our contextualization, building There should be more useful information embedded upon (Devlin et al., 2019), is adaptive to the input in the tree-structure of the label hierarchy. Also, corpus, without requiring any additional human extending our method for other types of textual annotations. Therefore, our framework is more data, such as short texts, multi-lingual data, and suitable than WSD under the weakly supervised code-switched data is a potential direction. setting.. Our experimental results have verified this Acknowledgements reasoning and showed the superiority of our contex- tualization module over WSD in weakly supervised We thank Palash Chauhan and Harsh Jhamtani for document classification tasks. valuable discussions.

331 References Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em- beddings improve natural language understanding? Eugene Agichtein and Luis Gravano. 2000. Snowball: arXiv preprint arXiv:1506.01070. Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digi- Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. tal libraries, pages 85–94. ACM. 2018. Unsupervised neural categorization for sci- entific publications. In SIAM Data Mining, pages Alan Akbik, Duncan Blythe, and Roland Vollgraf. 37–45. SIAM. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Ji- Conference on Computational Linguistics, pages awei Han. 2015. Mining quality phrases from mas- 1638–1649. sive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, and of Data Vivek Srikumar. 2008. Importance of semantic rep- , pages 1729–1744. resentation: Dataless classification. In Aaai, vol- Liyuan Liu, Jingbo Shang, Xiang Ren, ume 2, pages 830–835. Frank Fangzheng Xu, Huan Gui, Jian Peng, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and Jiawei Han. 2018. Empower sequence la- Kristina Toutanova. 2019. Bert: Pre-training of beling with task-aware neural language model. deep bidirectional transformers for language under- In Thirty-Second AAAI Conference on Artificial standing. In Proceedings of the 2019 Conference of Intelligence. the North American Chapter of the Association for Laurens van der Maaten and . 2008. Computational Linguistics: Human Language Tech- Visualizing data using t-sne. Journal of machine nologies, Volume 1 (Long and Short Papers) , pages learning research, 9(Nov):2579–2605. 4171–4186. Bryan McCann, James Bradbury, Caiming Xiong, and Vivek Gupta, Ankit Saw, Pegah Nokhiz, Harshit Gupta, Richard Socher. 2017. Learned in translation: Con- and Partha Talukdar. 2019. Improving document textualized word vectors. In Advances in Neural In- classification with multi-sense embeddings. arXiv formation Processing Systems, pages 6294–6305. preprint arXiv:1911.07918. Jeremy Howard and Sebastian Ruder. 2018. Universal Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, language model fine-tuning for text classification. In and Harish Karnick. 2016. Scdv: Sparse com- Proceedings of the 56th Annual Meeting of the As- posite document vectors using soft clustering sociation for Computational Linguistics (Volume 1: over distributional representations. arXiv preprint Long Papers), volume 1, pages 328–339. arXiv:1612.06778. Anil K Jain and Richard C Dubes. 1988. Algorithms Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei for clustering data. Englewood Cliffs: Prentice Hall, Han. 2018. Weakly-supervised neural text classifica- 1988. tion. In Proceedings of the 27th ACM International Conference on Information and Knowledge Manage- Yoon Kim. 2014. Convolutional neural net- ment, pages 983–992. ACM. works for sentence classification. arXiv preprint arXiv:1408.5882. Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han. 2019. Weakly-supervised hierarchical text classifi- Benjamin J Kuipers, Patrick Beeson, Joseph Modayil, cation. In Proceedings of the AAAI Conference on and Jefferson Provost. 2006. Bootstrap learning of Artificial Intelligence, volume 33, pages 6826–6833. foundational representations. Connection Science, 18(2):145–158. Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. representations in vector space. arXiv preprint Recurrent convolutional neural networks for text arXiv:1301.3781. classification. In Twenty-ninth AAAI conference on artificial intelligence. Takeru Miyato, Andrew M Dai, and Ian Goodfel- low. 2016. Adversarial training methods for semi- Minh Le, Marten Postma, Jacopo Urbani, and Piek supervised text classification. arXiv:1605.07725. Vossen. 2018. A deep dive into word sense disam- biguation with lstm. In Proceedings of the 27th in- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt ternational conference on computational linguistics, Gardner, Christopher Clark, Kenton Lee, and Luke pages 354–365. Zettlemoyer. 2018. Deep contextualized word rep- resentations. In Proceedings of NAACL-HLT, pages Michael Lesk. 1986. Automatic sense disambiguation 2227–2237. using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of Alec Radford, Karthik Narasimhan, Tim Salimans, and the 5th annual international conference on Systems Ilya Sutskever. 2018. Improving language under- documentation, pages 24–26. standing by generative pre-training.

332 Alessandro Raganato, Jose Camacho-Collados, and Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: Roberto Navigli. 2017. Word sense disambiguation: A wide-coverage word sense disambiguation system A unified evaluation framework and empirical com- for free text. In Proceedings of the ACL 2010 system parison. In Proceedings of the 15th Conference of demonstrations, pages 78–83. the European Chapter of the Association for Compu- tational Linguistics: Volume 1, Long Papers, pages 99–110. Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh confer- ence on Natural language learning at HLT-NAACL 2003-Volume 4, pages 25–32. Association for Com- putational Linguistics. Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, and Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering, 30(10):1825–1837. Yangqiu Song and Dan Roth. 2014. On dataless hierar- chical text classification. In AAAI. Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Pre- dictive text embedding through large-scale hetero- geneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 1165–1174. ACM. Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, Tim Hanratty, Lance Kaplan, and Jiawei Han. 2015. Doc2cube: Automated document allocation to text cube via dimension-aware joint embedding. Dimen- sion, 2016:2017. Rocco Tripodi and Roberto Navigli. 2019. Game the- ory meets embeddings: a unified framework for word sense disambiguation. In Proceedings of the 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 88–99. Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational autoencoder for semi-supervised text classification. In AAAI. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchi- cal attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computa- tional linguistics: human language technologies, pages 1480–1489. Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv preprint arXiv:1603.07012. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Advances in neural information pro- cessing systems, pages 649–657.

333