Aspect-Based Sentiment Analysis Using BERT

Aspect-Based Sentiment Analysis Using BERT Mickel Hoang Oskar Alija Bihorac Chalmers University of Technology Chalmers University of Technology Sweden Sweden [email protected] [email protected] Jacobo Rouces Sprakbanken,˚ University of Gothenburg Sweden [email protected] Abstract feel or think about a certain brand, product, idea or topic is a valuable source of information for Sentiment analysis has become very popu- companies, organizations and researchers, but it lar in both research and business due to the can be a challenging task. Natural language of- vast amount of opinionated text currently ten contains ambiguity and figurative expressions produced by Internet users. Standard sen- that make the automated extraction of information timent analysis deals with classifying the in general very complex. overall sentiment of a text, but this doesn’t Traditional sentiment analysis focuses on clas- include other important information such sifying the overall sentiment expressed in a text as towards which entity, topic or aspect without specifying what the sentiment is about. within the text the sentiment is directed. This may not be enough if the text is simultane- Aspect-based sentiment analysis (ABSA) ously referring to different topics or entities (also is a more complex task that consists in known as aspects), possibly expressing different identifying both sentiments and aspects. sentiments towards different aspects. Identifying This paper shows the potential of using sentiments associated to specific aspects in a text is the contextual word representations from a more complex task known as aspect-based senti- the pre-trained language model BERT, to- ment analysis (ABSA). gether with a fine-tuning method with ad- ABSA as a research topic gained special trac- ditional generated text, in order to solve tion during SemEval-2014 (Pontiki et al., 2014) out-of-domain ABSA and outperform pre- workshop, where it was first introduced as Task 4 vious state-of-the-art results on SemEval- and reappeared in the SemEval-2015 (Pontiki 2015 (task 12, subtask 2) and SemEval- et al., 2015) and SemEval-2016 (Pontiki et al., 2016 (task 5). To the best of our knowl- 2016) workshops. edge, no other existing work has been In parallel, within NLP, there have been numer- done on out-of-domain ABSA for aspect ous developments in the field of pre-trained lan- classification. guage models, for example ELMo (Peters et al., 1 Introduction 2018) and BERT (Devlin et al., 2019). These language models are pre-trained on large amounts of Sentiment analysis, also known as opinion mining, unannotated text, and their use has shown to al- is a field within natural language processing (NLP) low better performance with a reduced require- that consists in automatically identifying the sen- ment for labeled data and also much faster train- timent of a text, often in categories like negative, ing. At SemEval-2016, there were no submissions neutral and positive. It has become a very popu- that used such pre-trained language model as a lar field in both research and industry due to the base for the ABSA tasks. For this paper we will large and increasing amount of opinionated user- use BERT as the base model to improve ABSA generated text in the Internet, for instance social models for the unconstrained evaluation, which media and product reviews. Knowing how users permits using additional resources such as exter- nal training data, due to the pre-training of the base The left and right pre-training of BERT is language model. More precisely, the contributions achieved using modified language model masks, of this paper are as follows: called masked language model (MLM). The purpose of MLM is to mask a random word in a sen- • It proposes the new ABSA task for out-of- tence with a small probability. When the model domain classification at both sentence and masks a word it replaces the word with a to- text levels. ken [MASK]. The model later tries to predict the masked word by using the context from both left • To solve this task, a general classifier model and right of the masked word with the help of is proposed, which uses the pre-trained lan- transformers. In addition to left and right context guage model BERT as the base for the con- extraction using MLM, BERT has an additional textual word representations. It makes use of key objective which differs from previous works, the sentence pair classification model (Devlin namely next-sentence prediction. et al., 2019) to find semantic similarities be- tween a text and an aspect. This method out- performs all of the previous submissions, ex- Previous work cept for one in SemEval-2016. BERT is the first deeply bidirectional and un- • It proposes a combined model, which uses supervised language representation model devel- only one sentence pair classifier model from oped. There have been several other pre-trained BERT to solve both aspect classification and language models before BERT that also use bidi- sentiment classification simultaneously. rectional unsupervised learning. One of them is ELMo (Peters et al., 2018), which also focuses 2 State-of-the-art on contextualized word representations. The word embeddings ELMo generates are produced by us- This chapter provides an overview of the tech- ing a Recurrent Neural Network (RNN) named niques and models used throughout the rest of the Long Short-Term Memory (LSTM) (Sak et al., paper, as well as existing state-of-the-art results. 2014) to train left-to-right and right-to-left inde- Section 2.1 will cover the pre-trained model pendently and later concatenate both word repre- used in this paper, which has achieved state-of- sentations (Peters et al., 2018). BERT does not the-art results in several NLP tasks, together with utilize LSTM to get the word context features, but the architecture of the model and its key features. instead uses transformers (Vaswani et al., 2017), Thereafter, Section 2.2 will explain the ABSA which are attention-based mechanisms that are not task from SemEval-2016. Previous work with and based on recurrence. without a pre-trained model will be briefly de- scribed in Section 2.3 and Section 2.4. Input Representaion 2.1 BERT The text input for the the BERT model is first Pre-trained language models are providing a con- processed through a method called wordpiece to- text to words, that have previously been learning kenization (Wu et al., 2016). This produces set the occurrence and representations of words from of tokens, where each represent a word. There unannotated training data. are also two specialized tokens that get added to Bidirectional encoder representations from the set of tokens: classifier token [CLS], which transformers (BERT) is a pre-trained language is added to the beginning of the set; and separa- model that is designed to consider the context of tion token [SEP], which marks the end of a sen- a word from both left and right side simultane- tence. If BERT is used to compare two sets of ously (Devlin et al., 2019). While the concept is sentences, these sentences will be separated with a simple, it improves results at several NLP tasks [SEP] token. This set of tokens is later processed such as sentiment analysis and question and an- through three different embedding layers with the swering systems. BERT can extract more con- same dimensions that are later summed together text features from a sequence compared to train- and passed to the encoder layer: Token Embed- ing left and right separately, as other models such ding Layer, Segment Embedding Layer and Posi- as ELMo do (Peters et al., 2018). tion Embedding Layer. Transformers proven that a large amount of training data in- Previous work in sequence modeling used creases the performance of deep learning models, the common framework sequence-to-sequence for instance in the computer vision field with Im- (seq2seq) (Sutskever et al., 2014), with tech- ageNet (Deng et al., 2009). The same concept can niques such as recurrent neural networks be applied to deep language models. The devel- (RNNs) (Graves, 2013) and long short-term opment of a general purpose language model uses memory (LSTM) (Hochreiter and Schmidhuber, large amount of unannotated text, which is called 1997). pre-training, and the general purpose for the lan- The architecture of transformers is not based on guage model is to learn the contextual representa- RNNs but on attention mechanics (Vaswani et al., tion of words. 2017), which decides what sequences are impor- Language Models are key components in solv- tant in each computational step. The encoder does ing NLP problems and learn word occurrence and not only map the input to a higher dimensional word prediction patterns based on unannotated space vector, but also uses the important keywords text data. A language model learns the context by as additional input to the decoder. This in turn im- using techniques such as word embeddings which proves the decoder because it has additional infor- use vectors to represent the words in a vector space mation such which sequences are important and (Mikolov et al., 2013). With the large amount of which keywords give context to the sentence. training data, the language model learns that representations of words, depending on the context, Sentence Pair Classifier Task allows similar words to have a similar representa- Originally, BERT pre-trained the model to obtain tion. word embeddings to make it easier to fine-tune the Masked Language Model BERT uses a mask model for a specific task without having to make a token [MASK] to pre-train deep bidirectional rep- major change in the model and parameters. Usu- resentations for the language model.

Load more