Correlation-Guided Representation for Multi-Label Text Classification

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) Correlation-Guided Representation for Multi-Label Text Classification Qian-Wen Zhang1∗ , Ximing Zhang2∗y , Zhao Yan1 , Ruifang Liu2 , Yunbo Cao1 and Min-Ling Zhang3;4 1Tencent Cloud Xiaowei, Beijing 100080, China 2Beijing University of Posts and Telecommunications, Beijing 100876, China 3School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 4Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] Abstract Text representation is critical in multi-label text classification. Transformer-based studies [Devlin et al., 2019; Multi-label text classification is an essential task in Lan et al., 2019] demonstrate the effectiveness of Trans- natural language processing. Existing multi-label former module for capturing the dependencies of all words classification models generally consider labels as in a sequence and provide a contextualized representation for categorical variables and ignore the exploitation of classification tasks. Nevertheless, utilizing only contextual label semantics. In this paper, we view the task as a information to generate the text representation is suboptimal correlation-guided text representation problem: an as it ignores the information conveyed by class labels and attention-based two-step framework is proposed to thus fails to take advantage of potential correlations among integrate text information and label semantics by label-label and word-label. The fact that different labels may jointly learning words and labels in the same space. share the same subset of words is beneficial to help gener- In this way, we aim to capture high-order label- ate strong text representation. For example, academic liter- label correlations as well as context-label correla- ature containing keywords such as “neural network” is of- tions. Specifically, the proposed approach works ten tagged with “artificial intelligence” and “deep learning”. by learning token-level representations of words Closely related labels tend to co-occurr. Therefore, it is rather and labels globally through a multi-layer Trans- desirable to fully exploit potential correlation information in former and constructing an attention vector through text representation generation, which could be investigated in word-label correlation matrix to generate the text two-fold: 1) On one hand, label-label correlations can be ex- representation. It ensures that relevant words re- ploited to extract latent inter-dependent features; 2) On the ceive higher weights than irrelevant words and thus other hand, context-label correlations can be exploited to en- directly optimizes the classification performance. hance discriminative abilities of extracted features. As far as Extensive experiments over benchmark multi-label we know, the simultaneous exploitation of both correlations datasets clearly validate the effectiveness of the has still not been well studied. proposed approach, and further analysis demon- strates that it is competitive in both predicting low- Generally, the learning process must be facilitated by ex- frequency labels and convergence speed. ploiting correlations among labels in order to tackle the chal- lenge of an exponential-sized output space for MLTC. Specif- ically, CNN-RNN [Chen et al., 2017] presents an ensemble 1 Introduction method of CNN and RNN to capture semantics and model la- Multi-label text classification (MLTC) deals with real-world bel correlations. SGM [Yang et al., 2018] captures high-order objects with rich semantics, where each text is simultaneously correlations between labels through the sequence generation associated with multiple class labels that tend to be corre- model. We argue that correlations change dynamically in dif- lated. It is a fundamental task in natural language processing ferent contexts, so if we can learn words and labels jointly (NLP), which aims to learn a predictive model that assigns an in the same space, we will get better label-label correlations appropriate set of labels to an unseen text. It is worth not- as well as context-label correlations that fit a text. To further ing that to learn from multi-label text data, one needs to pay model the context-label correlations, several label embedding attention to two key factors: 1) How to generate more dis- methods, including C2AE [Liu et al., 2017], LEAM [Wang et criminative text representation; 2) How to effectively mine al., 2018], LSAN [Xiao et al., 2019], X-Transformer [Chang correlations to facilitate the learning procedure. et al., 2020], etc., are proposed to take advantage of label information and construct label-specific text representation ∗Equal contribution. through the refinement of the word embedding. However, yWork done during an internship at Tencent. they fail to provide implicit information among label space, 3363 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) which leads to the prediction bias in favor of the majority G classes while ignoring the minority classes. Furthermore, such methods are limited to a certain extent when the labels do not carry semantic description information. As an example, “deep learning” is a label with description, but the sym- bol “DL” has no description. Providing labels with abbre- viations or symbols in a dataset can lead to poor prediction Transformer Layer x N performance or inapplicability. .. .. Inspired by the potential of correlations, we import label . Word-Word Word-Label Label-Label semantics as auxiliary information by a global embedding .. .. strategy. The encoder learns word-word, label-label, and . word-label correlations globally through Transformer mod- [CLS] x x x x x [SEP] y y y y [SEP] ule. Since not all text words contribute equally to the pre- 1 2 3 4 m 1 2 3 l diction, we construct an attention vector from the word-label Context Labels correlation matrix to extract more discriminative words. The attention mechanism can improve performance with inter- Figure 1: The framework of CORE. Specifically, word and label pretability for text classification, which means that it helps representations are first generated by a multi-layer Transformer en- relevant words to get higher attention than irrelevant words. coder, which learns effectively about word-word, word-label, and To the best of our knowledge, we are the first to learn label-label correlations through self-attention. After that, we focus label-label and context-label correlations together with an on learning context-label correlation matrix by the output representations. Text representation vector is generated by the part of context attention-based two-step framework, and the main contribu- output and attention vector, which is finally used to predict labels. tions of this paper include: 1. We propose a basic global embedding strategy that rep- resents context and all class labels in the same latent space with l labels. Different from the single-label classification where only one label is associated with Xi, the multi- space by Transformer to generate token-level represen- Y tations, which captures correlations and reduces the de- label classification function f : X! 2 assigns a set of pendence on label description information. possible class labels (Y , 0 ≤ jY j ≤ l) for the unseen text. Here, yi is either regarded to be relevant (y 2 Y ) or irrelevant 2. We propose an effective and novel method, called CORE, (y2 = Y ) for instance X . Note that we use k 2 f0; 1g to which exploits COrrelation-guided REpresentation for i denote the categorical information of yi. multi-label text classification. CORE utilizes higher- A typical text classification approach first preprocesses text order context-label correlations to guide attention pro- data X for the model to obtain text representation X . Then, cesses and attempts to produce a semantic-aware repre- the classifier annotates the text representation with a set of sentation. proper labels Y . Intuitively, the approaches utilize only the information from the input text sequence. Our method ex- 3. Experimental results show that CORE achieves compet- tends the input by adding label information. Therefore, the itive performance against other state-of-the-art multi- new input sequence of CORE is overlaid with both text and label text classification approaches. We further provide labels, which is composed of all tokens like: fX; Yg = a series of BERT-based methods and analyze the perfor- fx ; x ; :::; x ; y ; y ; :::; y g, the number of labels is fixed mance with macro-based and rank-based metrics. Re- 1 2 m 1 2 l to l in the data. The preprocessing is to obtain text represen- sults show that the utilization of label embedding and tation C from context and labels. The aim of the predictive label correlations have a significant impact on the per- function f : C! 2Y is to minimize a loss function which formance of our approach. ensures that the model predicts relevant and irrelevant labels for each training instance with minimal misclassification. 2 The CORE Approach In this section, we first introduce the standard formal defini- 2.2 Global Embedding Strategy tion of multi-label text classification. Afterwards, the formu- We utilize BERT [Devlin et al., 2019], which outperforms lation of our method is illustrated. The technical details of state-of-the-art models

Load more