Sentiment Knowledge Enhanced Pre-Training for Sentiment Analysis

SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis Hao Tianz;y, Can Gaoy, Xinyan Xiaoy, Hao Liuy, Bolei Hey, Hua Wuy, Haifeng Wangy, Feng Wuz yBaidu Inc., Beijing, China zUniversity of Science and Technology of China ftianhao,gaocan01,xiaoxinyan,liuhao24,hebolei,wu hua, [email protected], [email protected] Abstract Yang et al., 2019) have shown their powerfulness in learning general semantic representations, and Recently, sentiment analysis has seen remark- have remarkably improved most natural language able advance with the help of pre-training processing (NLP) tasks like sentiment analysis. approaches. However, sentiment knowledge, such as sentiment words and aspect-sentiment These methods build unsupervised objectives at pairs, is ignored in the process of pre-training, word-level, such as masking strategy (Devlin et al., despite the fact that they are widely used in 2019), next-word prediction (Radford et al., 2018) traditional sentiment analysis approaches. In or permutation (Yang et al., 2019). Such word- this paper, we introduce Sentiment Knowl- prediction-based objectives have shown great abili- edge Enhanced Pre-training (SKEP) in order ties to capture dependency between words and syn- to learn a unified sentiment representation tactic structures (Jawahar et al., 2019). However, for multiple sentiment analysis tasks. With the help of automatically-mined knowledge, as the sentiment information of a text is seldom ex- SKEP conducts sentiment masking and con- plicitly studied, it is hard to expect such pre-trained structs three sentiment knowledge prediction general representations to deliver optimal results objectives, so as to embed sentiment informa- for sentiment analysis (Tang et al., 2014). tion at the word, polarity and aspect level into Sentiment analysis differs from other NLP tasks pre-trained sentiment representation. In partic- in that it deals mainly with user reviews other than ular, the prediction of aspect-sentiment pairs is news texts. There are many specific sentiment converted into multi-label classification, aim- ing to capture the dependency between words tasks, and these tasks usually depend on differ- in a pair. Experiments on three kinds of ent types of sentiment knowledge including senti- sentiment tasks show that SKEP significantly ment words, word polarity and aspect-sentiment outperforms strong pre-training baseline, and pairs. The importance of these knowledge has been achieves new state-of-the-art results on most verified by tasks at different level, for instance, of the test datasets. We release our code at sentence-level sentiment classification (Taboada https://github.com/baidu/Senta. et al., 2011; Shin et al., 2017; Lei et al., 2018), aspect-level sentiment classification (Vo and Zhang, 1 Introduction 2015; Zeng et al., 2019), opinion extraction (Li and Sentiment analysis refers to the identification of Lam, 2017; Gui et al., 2017; Fan et al., 2019) and sentiment and opinion contained in the input texts so on. Therefore, we assume that, by integrating that are often user-generated comments. In practice, these knowledge into the pre-training process, the sentiment analysis involves a wide range of specific learned representation would be more sentiment- tasks (Liu, 2012), such as sentence-level sentiment specific and appropriate for sentiment analysis. classification, aspect-level sentiment classification, In order to learn a unified sentiment representa- opinion extraction and so on. Traditional meth- tion for multiple sentiment analysis tasks, we pro- ods often study these tasks separately and design pose Sentiment Knowledge Enhanced Pre-training specific models for each task, based on manually- (SKEP), where sentiment knowledge about words, designed features (Liu, 2012) or deep learning polarity, and aspect-sentiment pairs are included to (Zhang et al., 2018). guide the process of pre-training. The sentiment Recently, pre-training methods (Peters et al., knowledge is first automatically mined from un- 2018; Radford et al., 2018; Devlin et al., 2019; labeled data (Section 3.1). With the knowledge 4067 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4067–4076 July 5 - 10, 2020. c 2020 Association for Computational Linguistics Sentiment product fast appreciated Prediction x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 & Transformer Encoder !"#$% this [MASK] came really [MASK] and I [MASK] it & !Sentiment Masking & !"#$% this product came really fast and I appreiated it aspect-sentiment pair sentiment word Figure 1: Sentiment Knowledge Enhanced Pre-training (SKEP). SKEP contains two parts: (1) Sentiment masking recognizes the sentiment information of an input sequence based on automatically-mined sentiment knowledge, and produces a corrupted version by removing these informations. (2) Sentiment pre-training objectives require the transformer to recover the removed information from the corrupted version. The three prediction objectives on top are jointly optimized: Sentiment Word (SW) prediction (on x9), Word Polarity (SP) prediction (on x6 and x9), Aspect-Sentiment pairs (AP) prediction (on x1). Here, the smiley denotes positive polarity. Notably, on x6, only SP is calculated without SW, as its original word has been predicted in the pair prediction on x1. mined, sentiment masking (Section 3.2) removes 2 Background: BERT and RoBERTa sentiment information from input texts. Then, the pre-training model is trained to recover the senti- BERT (Devlin et al., 2019) is a self-supervised ment information with three sentiment objectives representation learning approach for pre-training (Section 3.3). a deep transformer encoder (Vaswani et al., 2017). BERT constructs a self-supervised objective called SKEP integrates different types of sentiment masked language modeling (MLM) to pre-train the knowledge together and provides a unified senti- transformer encoder, and relies only on large-size ment representation for various sentiment analysis unlabeled data. With the help of pre-trained trans- tasks. This is quite different from traditional senti- former, downstream tasks have been substantially ment analysis approaches, where different types improved by fine-tuning on task-specific labeled of sentiment knowledge are often studied sepa- data. We follow the method of BERT to construct rately for specific sentiment tasks. To the best of masking objectives for pre-training. our knowledge, this is the first work that has tack- led sentiment-specific representation during pre- BERT learns a transformer encoder that can pro- training. Overall, our contributions are as follows: duce a contextual representation for each token of input sequences. In reality, the first token of an in- • We propose sentiment knowledge enhanced put sequence is a special classification token [CLS]. pre-training for sentiment analysis, which pro- In fine-tuning step, the final hidden state of [CLS] vides a unified sentiment representation for is often used as the overall semantic representation multiple sentiment analysis tasks. of the input sequence. In order to train the transformer encoder, MLM • Three sentiment knowledge prediction objec- is proposed. Similar to doing a cloze test, MLM tives are jointly optimized during pre-training predicts the masked token in a sequence from so as to embed sentiment words, polarity, their placeholder. Specifically, parts of input to- aspect-sentiment pairs into the representation. kens are randomly sampled and substituted. BERT In particular, the pair prediction is converted uniformly selects 15% of input tokens. Of these into multi-label classification to capture the sampled tokens, 80% are replaced with a special dependency between aspect and sentiment. masked token [MASK], 10% are replaced with a random token, 10% are left unchanged. After the • SKEP significantly outperforms the strong construction of this noisy version, the MLM aims pre-training methods RoBERTa (Liu et al., to predict the original tokens in the masked posi- 2019) on three typical sentiment tasks, and tions using the corresponding final states. achieves new state-of-the-art results on most Most recently, RoBERTa (Liu et al., 2019) of the test datasets. significantly outperforms BERT by robust opti- 4068 mization without the change of neural structure, Here, p(:) denotes probability estimated by count. and becomes one of the best pre-training mod- Finally, the polarity of a word is determined by the els. RoBERTa also removes the next sentence pre- difference between its PMI scores with all positive diction objective from standard BERT. To verify seeds and that with all negative seeds. the effectiveness of our approach, this paper uses X RoBERTa as a strong baseline. WP(w) = PMI(w; s) (2) WP(s)=+ X 3 SKEP: Sentiment Knowledge − PMI(w; s) Enhanced Pre-training WP(s)=− We propose SKEP, Sentiment Knowledge En- If WP(w) of a candidate word w is larger than 0, hanced Pre-training, which incorporates sentiment then w is a positive word, otherwise it is negative. knowledge by self-supervised training. As shown After mining sentiment words, aspect-sentiment in Figure1, SKEP contains sentiment masking and pairs are extracted by simple constraints. An aspect- sentiment pre-training objectives. Sentiment mask- sentiment pair refers to the mention of an aspect ing (Section 3.2) recognizes the sentiment informa- and its corresponding sentiment word. Thus, a tion of an input sequence based on automatically- sentiment word with its nearest noun will be con- mined sentiment knowledge (Section 3.1), and pro-

Load more