2020 International Conference on Computational Science and Computational Intelligence (CSCI)

An attention-based deep learning method for text Thanh Le School of Business Information Technology University of Economics HCMC, Vietnam [email protected]

Abstract— Text sentiment analysis is target-oriented, aiming using unstructured text from social media to aid people on to identify the opinion or attitude from a piece of natural information indigestion. text toward topics or entities, whether it is negative, positive or neutral using natural language processing and computational Several approaches have been proposed for text sentiment methods. With the growth of the internet, numerous business analysis. Among those that are based on lexical resources, websites have been deployed to support shopping products, methods utilize of opinionated synsets like booking services online as well as to allow online reviewing and SentiWordNet [2], or set of opinion adjective terms like Liu commenting the services in forms of either business forums or [1], are widely used these days thanks to their simplicity social networks. Use of text sentiment analysis for automatically and effectiveness. SentiWordNet is basically a mining opinion from the feedbacks on such emerging internet for opinion mining, in which the connections between synsets platforms is not only useful for customers seeking for advice, but and opinions are defined based on WordNet [3] dictionary. also necessary for business to study customers’ attitudes toward Computationally generated from a set of seed known to brands, products, services, or events, and has become an be popularly used, SentiWordNet consists of about 147,306 increasingly dominant trend in business strategic management. synsets, and is affordable for many real world problems in terms Current state-of-the-art approaches for text sentiment analysis of sentiment analysis. However, its large number of selected include lexicon based and machine learning based methods. In this terms, including those having no positive or negative sentiment research, we proposed a method that utilizes deep learning with polarity, makes SentiWordNet highly noisy and failed at attention embedding. We showed that our method accounting for sentiment bearing lexical features relevant to text outperformed popular lexicon and embedding based methods. in micro-blogs [4]. On the other hand, Liu lexicon [5] [1] which Keywords— sentiment analysis, , deep learning, was initiated based on a set of seed adjective terms meaning attention mechanism, natural language processing, either good or bad and augmented through a knowledge discovery process using semantic and antonym I. INTRODUCTION relations, is a set of around 6,800 English terms that are Text sentiment analysis is an ongoing field in natural categorized in positive and negative opinion groups. Thanks to language processing (NLP). It has been widely used in an array its update in the past decade with misspellings, morphological of business applications, including social media, algorithmic variants, slang, and social-media markup, Liu lexicon can trading, customer experience, and human resource management. perform well on social media text analysis without using It recently becomes an important analysis tool in financial and advanced methods for text pre-processing. It however cannot business research thanks to its ability in analyzing opinions, cover all kinds of real world problems due to its limitation in expressions, likes and dislikes of customers towards various terms of sentiment intensity. Moreover, the significant entities, namely products, services, organizations, individuals disagreement between SentiWordNet and the gold standard etc. as well as in identifying customer trends. These marketing lexicons, namely Harvard General Inquirer [6] and Linguistic matters have been always the most important issues in business Inquiry and Word Counts [7], is another reason for its worse strategic management, particularly in business decision making. performance, preventing it from being used widely in real world According to Liu et al. [1], the beliefs, or perceptions of reality, problems. and the choices one makes are somehow conditioned upon the On the other side, methods that are based on machine way the others act. This is true not only for individuals but also learning utilize supervised machine learning models such as for business. While consumers hunger for and rely on online Support Vector Machines, Naïve Bayes, Ensemble Learning, advice or recommendations of products and services, business Neural Networks… together with advanced text embedding demand for utilities that can transform customers’ expressions techniques, namely , GloVe and FastText for word and conversations into customer insights, or those for social representation in form of numerical vector [8] [9]. Since the media monitoring, reputation management and voice of the emerge of deep learning, it has become one part of the most customer programs. Traditionally, individuals usually ask for state-of-the-art systems in various areas, especially in text opinions from friends and family members, while business rely sentiment analysis. Similar to conventional machine learning on surveys, focus groups, opinion polls and consultants. In the methods, deep learning depends heavily on the word embedding modern age of Big Data, when millions of consumer reviews and techniques. Both Word2vec and GloVe preserve word syntactic discussions flood the Internet every day, while individuals feel meanings, making possible word ranging using syntactic overwhelmed with information, it is as well impossible for similarity. They are however unable to capture the sentiment business to keep that up manually. Thus, there is a clear need of polarity of the words [10] [11] [12]. Therefore, words with computational methods for automatically analyzing sentiment opposite polarity maybe mapped into close vectors. Lack of

978-1-7281-7624-6/20/$31.00 ©2020 IEEE 282 DOI 10.1109/CSCI51800.2020.00054 sentiment information in vector representation, which plays a SentiWordNet basically consists of automatic annotations of all vital role in text sentiment analysis may significantly degrade the synsets in WordNet according to the notions of “positivity”, performance of deep learning systems for text sentiment “negativity” and “objectivity”. Each synset is associated with analysis. three numerical scores, Pos(s), Neg(s), and Obj(s) indicating how positive, negative or objective the terms in the synset are. In this research, we proposed a deep learning based method Different senses of the same term may have different opinion- with an attention word embedding technique that utilizes NLP related properties. Scores vary in the interval [0.0, 1.0], and the and popular lexicons for text sentiment analysis. We showed that total of their values is 1.0 per synset. Positive polarity score our method outperformed the popular lexicon based and deep (PosS) and negative polarity score (NegS) are independently learning methods. assigned to synsets. Each synset may consist of many terms II. SENTIMENT ANALYSIS AND RELEVANT METHODS having the same meaning in a given context, and are ranked based on their popularity in that such context. Table I shows a A. Text sentiment analysis list of synsets for the adjective term “terrible”, and describing Text sentiment analysis is a multi-discipline research field how information of synsets is stored in SentiWordNet . aiming to analyze people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes TABLE I: SYNSETS FOR ‘TERRIBLE’ ADJECTIVE TERM expressed in written text. Entities can be products, services, Name POS PosS NegS ObjS Definition organizations, individuals, events, issues, or discussion topics. causing fear or dread In general, text sentiment analysis is a semantic analysis awful.s.02 S 0.000 0.625 0.375 problem, but highly focusing on determining “positive” or or terror exceptionally bad or “negative” without the need of fully understanding the given atrocious.s.02 S 0.000 0.875 0.125 displeasing sentence or document. Since human language is rarely precise or intensely or plainly spoken, sentiment analysis is therefore hard and extremely bad or severe.s.01 S 0.000 0.875 0.125 challenging. There are different levels of tasks in text sentiment unpleasant in degree analysis [13] [14]. While the basic task is to identify the or quality sentiment polarity from a given text to learn whether the extreme in degree or expressed opinion in that text is positive, negative, or neutral, frightful.s.02 S 0.125 0.250 0.625 extent or amount or impact tasks of higher levels are to deal with different levels of sentiment towards different features of a given entity. In this research, we aimed at improving the accuracy of identifying C. One-hot encoding opinion, which is either positive or negative, from short reviews One-hot encoding is a method to encode categorical features about something, made available on websites or social media. by a 1-of-K encoding scheme. It allows word representation by Current state-of-the-art text sentiment analysis methods can vectors where, given a word, all the elements of the vector are 0 be roughly classified into lexicon based approach, machine except one, which is corresponding to the word index. One-hot learning based approach and hybrid approach [15]. While encoding features in converting text data into numerical values lexicon based approach relies on sentiment lexicon which is a which are usable by machine learning algorithms. It however yet precompiled dictionary of terms with sentiment polarity scores, suffers from some limitations. Because each word is assigned machine learning based approach employs popular machine with a different representation, there is no notion of “similarity” learning algorithms and embedding techniques to “learn” the between these representations. Use of one-hot encoding, sentiment-oriented features of text. Hybrid approach combines therefore, makes impossible encoding the conceptual similarity solution models and techniques from both the lexicon based and between words. In addition, the size of one-hot representations machine learning based approaches. grows with the number of words in the vocabulary, typically ranging from few thousands to few millions. Representing each Because sentiment terms are instrumental to sentiment word using one-hot representation is definitely too storage- analysis, to fully understand sentiments toward a given topic in intensive and would make downstream processing much more natural language, sentiment lexicons should be used. Recent difficult. research works have pointed out that sentiment analysis methods that do not use sentiment lexicons all result in poor performances D. Word embeddings due to the lack of understanding human language while solving Word embedding [16] is a technique for language modeling NLP problem [5]. Among available resources, SentiWordNet is and feature learning, where each word is mapped to a vector of the most widely used lexicon to conduct text sentiment analysis. real values in such a way that words with similar meanings have B. SentiWordNet a similar representation. Value learning can be done using neural networks. Popular word embedding techniques include SentiWordNet is basically a sentiment intensity lexical Word2vec, GloVe and FastText. resource for opinion mining, in which connections between terms and opinions are defined based on WordNet dictionary and 1) Word2vec a set of seed words that are popularly used. The expansion of the Word2vec [16] is based on a simple but efficient feed term set was done based on a gloss similarity measure, resulting forward neural architecture which is trained with language in a set of around 146,000 terms that allows SentiWordNet to modeling objective. Two different but related Word2vec models afford many real world sentiment analysis problems. were proposed: Continuous Bag-Of-Words (CBOW) and Skip-

283 gram. Both models are based on the probability of words dislike the movie”. With respect to their syntactic structure and occurring in proximity to each other. While Skip-gram makes it word co-occurrences both words have similar vector possible to start with a word and predict the words that are likely representations, but from a sentiment point of view, their vector to surround it, CBOW reverses that by predicting a word that is representation must be different as they are of opposite likely to occur on the basis of specific context words, based on a polarities. Thus, a vector representation based on the word loss function defined as in (1). embedding algorithm has not enough sentiment information to perform sentiment analysis and is not able to precisely capture the overall sentiment of a sentence. = −log ⃗⃗ (1) III. PROPOSED METHOD where wt is the target word and Wt = wt−n,...,wt, ...,wt+n represents In this research, we proposed SADL, a sentiment attention the sequence of words in context. based deep learning method that combines lexicons with a Word2vec model and attention mechanism (AM) for word semantic sentiment representation. While SentiWordNet and Liu lexicon are used with AM for generating sentiment information, the Word2vec model is applied for identifying the context of the text to analyze. These two subtasks together support the deep learning model to handle linguistic problems. Thus, to understand a given text, SADL tries to understand the words in terms of sentiment polarity, then employs the context Fig. 1: Learning architecture of Word2vec CBOW and Skip gram models [16] to create sentiment features of the text. This routine actually replicates the way human beings do to solve this problem. Fig. 2 Fig. 1 shows a simplification of the general architecture of the show SADL architecture. The method, in details, consists of CBOW and Skip-gram models of Word2vec. The architecture three key components which are described in the following consists of input, hidden and output layers. The input layer has sections. the size of the word vocabulary and encodes the context as a combination of one-hot vector representations of surrounding A. Text preprocessing words of a given target word. The output layer has the same size This is an essential task in sentiment analysis, especially in as the input layer and contains a one-hot vector of the target word analysis of social network data. Our pre-processing procedure during the training phase. includes four steps [14]: text extraction, text cleaning, 2) GloVe lemmatizing and negation marking which is actually a NLP An alternative approach for Word2vec embedding technique based method using language grammar and structure. Word is GloVe [17], named after the way the model directly captures lemmatizing using WordNet makes easy the use of grammar and the global corpus statistics. GloVe is a model for distributed the process of matching words in the text with those from the word representation. It tries to perform the meaning embedding lexicons in downstream analyzing tasks. procedure of Word2vec in an explicit manner. The main idea 1) Text extraction behind GloVe is that the ratio of co-occurrence probabilities of The reviews, which consists of a few sentences, are first split two words, wi and wj, with a third probe word wk, defined as into clauses based on clause-level punctuation mark. A clause- P(wi,wk)/P(wj,wk), is more indicative of their semantic level punctuation mark is any word from the regular expression: association than a direct co-occurrence probability, i.e., P(wi,wj). ^[.,:;!?]$. This is actually a middle step for negation marking. This is achieved by fulfilling the following objective function: 2) Text cleaning + + =log ( ) (2) This is to remove special characters and turn the uppercase letters into lowercase ones. where bi and bk are bias terms for word wi and probe word wk respectively, Xik is the number of times wi co-occurs with wk. 3) Lemmatizing The optimization results in the construction of vectors wi and wk This process is to turn words in the text into their whose dot product gives a good estimate of their transformed co- form. Current state-of-the-art methods usually use a heuristic occurrence counts. method for collapsing distinct word forms by trying to remove affixes. Noun can be in either singular or plural form using either Both Word2vec and GloVe have several advantages over –es or –s suffix. Similarly, verb can be in either present or past bag of words, TF-IDF and one-hot scheme. They retain the participle form using –ing and –ed respectively. Adjectives can semantic meaning of different words in a document. The context be in comparative form using -er suffix or superlative form using information is therefore not lost. Another great advantage in –est suffix. These methods however have a drawback in their approach is that the size of the embedding vector is very producing incorrect stemming form, making impossible small. Each dimension in the embedding vector contains searching for synsets of a given word from SentiWordNet. In information about one aspect of the word, unlike the bag of this research, we employed the method from [13] where words, TF-IDF and one-hot approaches. However, a common WordNet is used for lemmatizing while WordNet lemma is used limitation of Word2vec and GloVe is lack of word sentiment with SentiWordNet for retrieving word’s synsets and sentiment information. For instance, the words “like” and “dislike” can scores. appear in a similar context such as “I like the movie” and “I

284

B. Attention word embedding trained model from Stanford which contains 300-dimensional Attention mechanism (AM), first introduced by Bahdanau et vectors for 1.6 billion word tokens from Wikipedia [17]. al. [18], is one of the most valuable breakthroughs in Deep Regarding word sentiment information, we employ both Liu Learning research to allow effective improvement over encode- lexicon [1] and SentiWordNet [2]. Sentiment score calculation decode based neural systems in NLP. In method in [13] is used to extract sentiment information from the SADL method, AM is used together with lexicons to learn to lexicons. The sentiment scores are grouped by word’s POS (part focus on important words of the given sentence based on word of speech) tags such as noun, verb, adjective and adverb, and are sentiment scores. Use of both lexicons and Word2vec models in arranged into word vector based on word’s synset rank. For SADL makes possible word representations in terms of either instance, list of sentiment scores for the word "terrible" (Table I) sentiment and . as an adjective is: 0.125, 0.000, 0.000, 0.000, -0.250, -0.625, - In order to achieve the best performance of Word2vec, we 0.875, -0.875. Such a vector of scores is used as an input of AM employ Google’s pre-trained Word2vec model which contains model in SADL to learn for word importance. 300-dimensional vectors for 3 million words and phrases from C. SADL deep learning components Google’s news dataset of about 100 billion word tokens [19]. For testing purpose with GloVe technique, we used the pre- 1) Dropout

Corpus Text Preprocessing Word Embedding w/ Concatenate Dropout BiLSTM Dense w/ Sentiment Attention Mechanism softmax

Attention - Token - lex2vec Pos - Lemma - Stem - POS2vec Text - POS - Stop Embedding word Neg

- word2vec

Fig. 2: Architecture of SADL

Dropout is a regularization technique for neural network Dense layer is a fully connected layer, represents a matrix models proposed by [20]. It randomly selects neurons which are vector multiplication. The value in the matrix are the trainable ignored during the training process. These neurons temporarily parameters which gets updated during training process. In have no contribution to the activation of downstream neurons on SADL, Dense layer is used for the final layer with softmax forward phase, and any weight updates are not applied to them function, defined as in (3), that help to map the output into one on the backward phase. Dropout is commonly used to address of two categories: positive opinion and negative opinion. the problem of overfitting. In SADL method, it also helps to deal with words with low importance level.  ( )=  (3) ∑ 2) BiLSTM – Long Short-Term Memory Network LSTM, as a special kind of gated RNN, introduced by [21] IV. EXPERIMENTAL RESULTS to address the problem of long-term dependency handling. A common LSTM network architecture is composed of a memory We compared our proposed method (SADL) with popular cell, a forget gate, an input gate, and an output gate. The memory lexicon based methods, including SentiWN method, the cell flows straight down the entire chain, storing information for methods that uses SentiWordNet with heuristics (SPL,LPL) [6], either long or short time periods. The forget gate determines that by Liu et al. [1], and LML method [13]. The versions of what information to discard in the cell. The input gate controls SADL that only use either Word2vec (W2VDL) or GloVe what new information would be stored in the cell. The output (GVEDL) for word embedding are also taken into account. For gate controls the output value of the LSTM unit based on the algorithm performance evaluation, three popular datasets: memory cell. BiLSTM is, in fact, a sequence of two LSTMs; one Amazon, IMDb, and Yelp, which are available with SADL taking the input in forward direction and the other in backward Python source code online [22] [23] were used. All deep learning direction. BiLSTM is used in the SADL to increase the amount based models were trained with the number of epochs set to 10, of information available to the model, in terms of both sentiment and the 10-fold cross validation method was applied. and semantics. Performance of the algorithms is shown in Table II. Without a usage of attention mechanism and sentiment information, the 3) Dense layer methods based on Word2vec (W2VDL) and GloVe (GVEDL) performed worse than SADL, achieving an average accuracy of

285

79.4% and 78.7% respectively. SADL outperformed all the [5] M. Hu and B. Liu, "Mining and summarizing customer reviews," in 10th methods listed in Table II except LML method. In comparison ACM SIGKDD Intl' Conf. on Knowledge Discovery and Data Mining, with LML, SADL performed better on Amazon and IMDb 2004. [6] P. Stone, D. Dunphy and M. Smith, "The general inquirer: A computer datasets. It however performed slightly worse than LML on approach to content analysis," 1966. average, and particularly on the Yelp dataset. It is worth to notify [7] J. Pennebaker, M. Francis and R. Booth, "Linguistic inquiry and word that LML [13] is also a hybrid method that utilizes both lexicons count: LIWC 2001.," 2001. and machine learning models for text sentiment analysis. [8] B. Jagtap and V. Dhotre, "SVM and HMM Based Hybrid Approach of Sentiment Analysis for Teacher Feedback Assessment," Intl' J. of TABLE II: ALGORITHM PERFORMANCE Emerging Trends & Technology in Computer, 2014. Test Dataset (#samples) [9] S. Chowdhury and W. Chowdhury, "Sentiment Analysis for Bengali Algorithm Microblog Posts," in Intl' Conf. on Informatics, Electronics & Vision, Amazon IMDb Yelp (parameters) Average 2015. (1000) (1000) (1000) [10] A. Abdia, S. M. Shamsuddina, S. Hasana and J. Piranb, "Deep learning- Lexicon based methods based sentiment classification of evaluative text based on Multi-feature fusion," Information Processing and Management, 2019. SentiWN 49.9% 50.0% 50.0% 50.0% [11] O. Araque, I. Corcuera-Platas, J. F. Sanchez-Rada and C. A. Iglesias, LIU 74.7% 75.3% 70.6% 73.5% "Enhancing deep learning sentiment analysis with ensemble techniques in social applications," Expert Systems with Applications, p. 236–246, 2017. SPL 80.9% 74.3% 80.6% 78.6% [12] M. Giatsoglou, M. G. Vozalis, K. Diamantaras, A. Vakali, G. Sarigiannidis and K. C. Chatzisavvas, "Sentiment analysis leveraging LPL 77.3% 75.9% 72.8% 75.3% emotions and word embeddings," Expert Systems with Applications, p. 214–224, 2017. LML 89.7% 85.2% 86.5% 87.1% [13] T. Le, "A Hybrid Method for Text-Based Sentiment Analysis," in Intl' Deep learning based methods Conf. on Comp. Science and Comp. Intel, 2019. [14] L. Vu and T. Le, "A lexicon-based method for Sentiment Analysis using SADL 90.1% 86.4% 83.7% 86.7% social network data," in Intl' Conf. on Information and Knowledge GVEDL 84.0% 79.1% 73.0% 78.7% Engineering, 2017. [15] W. Medhat, A. Hassan and H. Korashy, "Sentiment analysis algorithms W2VDL 84.1% 78.9% 75.2% 79.4% and applications: A survey," Ain Shams Engineering Journal, pp. 1093- 1113, 2014. V. CONCLUSIONS [16] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient estimation of word representations in vector space," in CoRR, 2013. Text sentiment analysis is a fascinating and yet challenging [17] J. Pennington, R. Socher and C. D. Manning, "GloVe: Global vectors for problem in NLP with almost unlimited practical applications in word representation," in EMNLP, 2014. various industries, including social media, customer service and [18] D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by business strategic management. In this paper, we introduced an jointly learning to align and translate," in Intl' Conf. on Learning attention-based deep learning method using lexicons and Representations, 2015. Word2vec model for word representation for sentiment analysis [19] Google, "Word2vec," 29 July 2013. [Online]. Available: of social media text data. Our proposed method is novel and https://code.google.com/archive/p/word2vec/. effective because it combines the current state-of-the-art [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. machine learning and NLP representation learning methods. The Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," JMLR, p. 1929−1958, 2014. experimental results in the section (IV) of the paper indicated [21] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural that our method performed better than the poplar methods using Computation, pp. 1735-1780, 1997. lexicons and/or traditional word embedding techniques. Our [22] D. Kotzias, M. Denil, N. D. Freitas and P. Smyth, "From group to directions for future work are (i) to conduct a deeper study of individual labels using deep features," in 21th ACM SIGKDD Intl' Conf. natural language grammar and structures to fully understand on Knowledge Discovery and Data Mining, 2015. opinion in aspect-level, (ii) to leverage additional deep learning [23] T. Le, "Opinion detection from text, Research and Demonstration algorithms for the proposed method to achieve better results, and projects," 2019. [Online]. Available: https://demo.tinyray.com/lml. (iii) to examine on large and varied datasets to prove accuracy [Accessed 2020]. of the proposed method.

VI. REFERENCES [1] B. Liu, "Sentiment analysis and opinion mining," Synthesis lectures on human language technologies, pp. 1-167, 2012. [2] S. Baccianella, A. Esuli and a. F. Sebastiani, "Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining," in In LREC 2010, 2010. [3] G. A. Miller, "Wordnet: a lexical database for english," in Comm. of the ACM, 1995. [4] C. Hutto and E. Gilbert, "Vader: A parsimonious rule-based model for sentiment analysis of social media text," in 8th Intl' AAAI Conf. on Weblogs and Social Media, 2015.

286