Knowledge Enhanced Masked Language Model for Stance Detection

Knowledge Enhanced Masked Language Model for Stance Detection Kornraphop Kawintiranon and Lisa Singh Department of Computer Science Georgetown University Washington, DC, USA {kk1155,lisa.singh}@georgetown.edu Abstract obvious positive sentiment, but an opposing stance towards Donald Trump. Detecting stance on Twitter is especially chal- lenging because of the short length of each I’m so happy Biden beat Trump in the tweet, the continuous coinage of new termi- debate. nology and hashtags, and the deviation of sentence structure from standard prose. Fine- tuned language models using large-scale in- Stance detection is an especially difficult prob- domain data have been shown to be the new lem on Twitter. A large part of this difficulty state-of-the-art for many NLP tasks, including comes from the fact that Twitter content is short, stance detection. In this paper, we propose a highly dynamic, continually generating new hash- novel BERT-based fine-tuning method that en- tags and abbreviations, and deviates from standard hances the masked language model for stance prose sentence structure. Recently, learning mod- detection. Instead of random token masking, we propose using a weighted log-odds-ratio to els using pre-training (Peters et al., 2018; Radford identify words with high stance distinguisha- et al., 2018; Devlin et al., 2019; Yang et al., 2019) bility and then model an attention mechanism have shown a strong ability to learn semantic rep- that focuses on these words. We show that our resentation and outperform many state-of-the-art proposed approach outperforms the state of the approaches across different natural language pro- art for stance detection on Twitter data about cessing (NLP) tasks. This is also true for stance the 2020 US Presidential election. detection. The strongest models for stance detection on Twitter use pre-trained BERT (Ghosh et al., 1 Introduction 2019; Sen et al., 2018). Stance detection refers to the task of classifying A recent study that proposed models for senti- a piece of text as either being in support, oppo- ment analysis (Tian et al., 2020) showed that focus- sition, or neutral towards a given target. While ing the learning model on some relevant words, i.e. this type of labeling is useful for a wide range of sentiment words extracted using Pointwise Mutual opinion research, it is particularly important for un- Information (PMI) (Bouma, 2009), performed bet- derstanding the public’s perception of given targets, ter than using the standard pre-trained BERT model. for example, candidates during an election. For We are interested in understanding whether or not this reason, our focus in this paper is on detecting focusing attention on specific stance-relevant vo- stance towards political entities, namely Joe Biden cabulary during the learning process will improve and Donald Trump during the 2020 US Presidential stance detection. To accomplish this, we consider election. the following two questions. First, how do we Stance detection is related to, but distinct from identify the most important stance-relevant words the task of sentiment analysis, which aims to ex- within a data set? And second, how much attention tract whether the general tone of a piece of text needs to be paid to these words versus random do- is positive, negative, or neutral. Sobhani and col- main words? Toward that end, we propose building leagues (Sobhani et al., 2016) show that measures different knowledge enhanced learning models that of stance and sentiment are only 60% correlated. integrate an understanding of important context- For example, the following sample tweet1 has an specific stance words into the pre-training process. 1All of the sample tweets in this paper are invented by correspond to any actual tweet in the data set in order to the authors. They are representative of real data, but do not preserve the privacy of Twitter users. 4725 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4725–4735 June 6–11, 2021. ©2021 Association for Computational Linguistics While we consider PMI as a way to identify impor- in reviews, e.g., the price of a laptop versus its tant stance words, we find that using the log-odds CPU performance (Schmitt et al., 2018; Chen et al., ratio performs better. 2017; Poddar et al., 2017; Tian et al., 2020). Dif- We also consider different options for fine-tuning ferent approaches have been proposed to tackle an attention-based language model. To fine-tune an this problem. Chen and colleagues combine atten- attention-based language model to a specific task, tion with recurrent neural networks (Chen et al., the most common approach is to fine-tune using 2017). Schmitt and colleagues propose combin- unlabeled data with random masking (Devlin et al., ing a convolutional neural network and fastText 2019; Liu et al., 2019). Because of the noise within embeddings (Schmitt et al., 2018). A recent study social media posts, random tokens that are not task- proposes modifying the learning objective of the relevant can impact sentence representation nega- masked language model to pay attention to a spe- tively. Therefore, instead of letting the model pay cific set of sentiment words extracted by PMI (Tian attention to random tokens, we introduce Knowl- et al., 2020). The model achieves new state-of- edge Enhanced Masked Language Modeling (KE- the-art results on most of the test data sets. Be- MLM), where significant tokens generated using cause stance is a different task,3 we will adjust the log-odds ratio are incorporated into the learning their target-directed sentiment approach for stance process and used to improve a downstream classi- and compare to it in our empirical evaluation. fication task. To the best of our knowledge, this The most well-known data for political stance is the first work that identifies significant tokens detection is published by the SemEval 2016 (Mo- using log-odds-ratio for a specific task and inte- hammad et al., 2016b; Aldayel and Magdy, 2019). grates those tokens into an attention-based learning The paper describing the data set provides a high- process for better classification performance. level review of approaches to stance detection us- In summary, we study stance detection on En- ing Twitter data. The best user-submitted system glish tweets and our contributions are as follows. was a neural classifier from MITRE (Zarrella and (i) We propose using the log-odds-ratio with Dirich- Marsh, 2016) which utilized a pre-trained language let prior for knowledge mining to identify the most model on a large amount of unlabeled data. An distinguishable stance words. (ii) We propose a important contribution of this study was using pre- novel method to fine-tune a pre-trained masked trained word embeddings from an auxiliary task language model for stance detection that incorpo- where a language model was trained to predict a rates background knowledge about the stance task. missing hashtag from a given tweet. The runner-up (iii) We show that our proposed knowledge mining model was a convolutional neural network for text approach and our learning model outperform the classification (Wei et al., 2016). fine-tuned BERT in a low resource setting in which Following the MITRE model, there were a num- the data set contains 2500 labeled tweets about the ber of both traditional and neural models proposed 2020 US Presidential election. (iv) We release our for stance detection. A study focusing on tradi- labeled stance data to help the research commu- tional classifiers proposed using a support vector nity continue to make progress on stance detection machine (SVM) with lexicon-based features, senti- methods.2 ment features and textual entailment feature (Sen et al., 2018). Another SVM-based model con- 2 Related Work sisted of two-step SVMs (Dey et al., 2017). In the first step, the model predicts whether an in- In the NLP community, sentiment analysis is a put sequence is relevant to a given target. The more established task that has received more atten- next step detects the stance if the input sequence tion than stance detection. A sub-domain of senti- is relevant. Target-specific attention neural net- ment analysis is target-directed or aspect-specific work (TAN) is a novel bidirectional LSTM-based sentiment, which refers to the tone with which an attention model. In this study, Dey and colleagues author writes about a specific target/entity or an trained it on unpublished unlabeled data to learn aspect of a target (Mitchell et al., 2013; Jiang et al., the domain context (Du et al., 2017). Recently, 2011). One common use case is breaking down sentiment toward different aspects of a product 3Stance detection aims to detect the opinion s to the specific target e, aspect-based sentiment focuses on extracting 2https://github.com/GU-DataLab/ the aspect a towards the target e and corresponding opinion stance-detection-KE-MLM s (Wang et al., 2019). 4726 a neural ensemble model consisting of bi-LSTM, edge resulting in better sentiment classification per- nested LSTMs, and an attention model was pro- formance. SentiLARE (Ke et al., 2020) uses an posed for stance detection on Twitter (Siddiqua alternative approach that injects word-level linguis- et al., 2019). The model’s embedding weights were tic knowledge, including part-of-speech tags and initialized with the pre-trained embeddings from sentiment polarity scores obtained by SentiWord- fastText (Bojanowski et al., 2017). Net (Guerini et al., 2013), into the pre-training process. Following these works, SENTIX (Zhou The emergence of transformer-based deep learn- et al., 2020) was proposed to incorporate domain- ing models has led to high levels of improve- invariant sentiment knowledge for cross-domain ment for many NLP tasks, including stance de- sentiment data sets.

Load more