Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading Gcforest in Identifying Sentiment Polarity
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading gcForest In Identifying Sentiment Polarity by Mounika Marreddy, Subba Reddy Oota, Radha Agarwal, Radhika Mamidi in 25TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (SIGKDD-2019) Anchorage, Alaska, USA Report No: IIIT/TR/2019/-1 Centre for Language Technologies Research Centre International Institute of Information Technology Hyderabad - 500 032, INDIA August 2019 Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading gcForest In Identifying Sentiment Polarity Mounika Marreddy Subba Reddy Oota [email protected] IIIT-Hyderabad IIIT-Hyderabad Hyderabad, India Hyderabad, India [email protected] [email protected] Radha Agarwal Radhika Mamidi IIIT-Hyderabad IIIT-Hyderabad Hyderabad, India Hyderabad, India [email protected] [email protected] ABSTRACT an effective neural networks to generate low dimensional contex- Neural word embeddings have been able to deliver impressive re- tual representations and yields promising results on the sentiment sults in many Natural Language Processing tasks. The quality of analysis [7, 14, 21]. the word embedding determines the performance of a supervised Since the work of [2], NLP community is focusing on improving model. However, choosing the right set of word embeddings for a the feature representation of sentence/document with continuous given dataset is a major challenging task for enhancing the results. development in neural word embedding. Word2Vec embedding In this paper, we have evaluated neural word embeddings with was the first powerful technique to achieve semantic similarity (i) a mixture of classification experts (MoCE) model for sentiment between words but fail to capture the meaning of a word based classification task, (ii) to compare and improve the classification on context [17]. As an improvement to Word2Vec, [19] introduced accuracy by different combination of word embedding as first level GloVe embeddings, primarily focus on global co-occurrence count of features and pass it to cascade model inspired by GcForest for for generating word embeddings. Using Word2Vec & GloVe, it was extracting diverse features. We argue that each expert learns a cer- easy to train with application in Question Answering task, Sen- tain positive and negative examples corresponding to its category timent Analysis, Automatic Summarization [13] and also gained of features and resulting features on a given task (polarity identifi- popularity in Word Analogy, Word similarity and Named Entity cation) can achieve competitive performance with state of the art Recognition tasks [5]. However, the main challenge with GloVe methods in terms of accuracy, precision, and recall using gcForest. and Word2Vec is unable to differentiate the word used in different context. [16] introduced a deep LSTM encoder from an attentional KEYWORDS sequence-to-sequence model trained for machine translation (MT) to contextualize word vectors( MT-LSTM/CoVe). The main limita- mixtue of experts, gcForest, word embeddings, sentiment analysis tion with CoVe vectors was it uses zero vectors for unknown words . (out of vocabulary words). ELMO [20] and BERT [6] embeddings are two recent popular techniques outperforms many of the NLP tasks and got huge suc- 1 INTRODUCTION cess in neural embedding techniques that represent the context in Sentiment Analysis is one of the most successful and well-studied features due to the attention-based mechanism. ELMO embedding fields in Natural Language Processing [3, 9, 10]. Traditional ap- is a character based embedding, it allows the model to capture out proaches mainly focus on designing a set of features such as bag- of vocabulary words and deep contextualized word representation of-words, sentiment lexicon to train a classifier for sentiment clas- can capture syntax and semantic features of words and outper- sification [18]. However, feature engineering is labor intensive and forms the problems like sentiment analysis [1] and named entity almost reaches its performance bottleneck. Moreover, as the in- recognition [15]. In advancement to contextual embedding, BERT creasing information on web like writing reviews on review sites embedding is a breakthrough in neural embedding technique and and social media, opinions influence human behavior and help or- built upon transformers including the self-attention mechanism. It ganization or individual in decision making task. With the huge can represent features with the relationship between all words in success of deep learning techniques, some researchers designed a sentence. BERT outperforms state of the art feature representa- tion for a task like question answering with SQuAD [22], language modeling/sentiment classification. In recent years, the use of neural word embeddings provide ,, better vector representations of semantic information, there has © been relatively little work on direct evaluations of these models. ,, Mounika and Subba, et al. Sentences with Higher Gating Probability from Each Expert Function "Dickens is always good-- this miniseries captured the essence of w1 w2 the book. The humor, the melancholoy, the bleak and the brightness. The characterizations were perfect. Whole-heartedly recommend this miniseries.\n" "Hilarious movie. Streamed cleanly from Amazon. No problems Exp1 captures positive reviews with the playback. This is definitely a cult humor film with a special type of humor.\n' Input "Save your money and time. This is nonsense attempting to be arty and important. I'd also forgotten how dire an actor Hawke is. Save your time? There is no explanation or direction. Honey Boo Exp2 Text features extracted from captures negative reviews Boo has better writin.\n" Word2Vec or GloVe or BERT "This movie was requested by my husband to watch it. It was alot of or ELMO fun but since the movie is too old there is alot of glitching.\n" Figure 1: Proposed Mixture of Classification Experts (MoCE) model. Here Expert1 captures positive reviews and Expert2 cap- tures negative reviews. There has been previous work to evaluate various word embedding 2.1 MoCE Architecture techniques [8] on a specific task like word similarity or analogy, Given an input feature vector x from the one of the neural word Named entity recognition [23] and evaluate it based on the obtained embedding method, we model its posterior probabilities as a mixture performance metric. of posteriors produced by each expert model trained on x. In this paper we have evaluated neural word embedding tech- niques with (1) a mixture of classification experts (MoCE) model for the sentiment classification task, (ii) compare and improve the K Õ ¹ j º ¹ j º ¹ j º classification accuracies using gcForest. The underlying mechanism p y x = P Sj x;θ0 p y x; Sθj of MoCE model is that it has great potential to discriminate positive j=1 and negative examples for sentiment classification task on Amazon K Õ ¹ º ¹ j º product reviews data. In the next sections, we discuss the proposed = дSj x;θ0 p y x; Sθj (1) MoCE approach, cascading gcForest and our enhancements. j=1 ¹ j º ¹ º th Here, P Sj x;θ0 = дSj x;θ0 is the probability of choosing Sj ex- 2 MODEL ARCHITECTURE ÍK pert for given input x. Note that дS ¹x;θ0º = 1 andдS ¹x;θ0º ≥ We use a mixture of experts based model, whose architecture is j=1 j j 0; j 2 »K¼. дS ¹x;θ0º is also called gating function and is parame- inspired from [11]. The mixture of experts architecture is composed 8 j terized by θ0. of gating network and several expert networks, each of which solves In this paper, we choose p¹yjx; S º as a Gaussian probability a function approximation problem over a local region of the input θj density for each of the experts, denoted by: space. The detailed overview of our model is shown in Figure 1 where the input is a text vector extracted from recently successful 1 1 neural embeddings such as Word2Vec, GloVe, ELMO, Amazon Small p¹yjx; S º = exp − ¹y − W xº2 (2) θj 1/2 2 j Embeddings, & BERT. These input features pass through both the ¹jσj j2πº gating network and two of the experts. The gating network uses a 2 Rm×n th probabilistic model to choose the best expert for a given input text where Sθj is the weight matrix associated with the Sj f g vector. expert. Thus, Sθj = Wj . We use softmax function for the gating Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading gcForest In Identifying Sentiment Polarity , , Level Level Level 1 2 n Lightgbm Lightgbm Lightgbm Text features Random Forest Random Forest Random Forest extracted from Word2Vec, GloVe, BERT, Avg Max or ELMO Xgboost Xgboost Xgboost Final Prediction Extratrees Extratrees Extratrees Concatenate Figure 2: Cascading gcForest Architecture variable дSj ¹x;θ0º. ensemble of ensembles yields the diversity in feature construction. Here, each forest produces a class distribution for each instance T exp vj x and finally estimate the average of all class distributions across the ¹ ; º = дSj x θ0 (3) ensemble based forests gives an output vector. The output vector is ÍK exp vT x i=1 i concatenated with the original feature vector and passed to the next n where vj 2 R ; j 2 »K¼. Thus, θ0 = fv1;:::; vK g. Let Θ be cascading level. In order to avoid the risk of overfitting, each forest the set of all the8 parameters involved for the K-experts. Thus, uses K-fold cross-validation to produce the class vector. Moreover, Θ = fθ0; ¹W1º;:::; ¹WK ºg. Here, we train the MoCE model and up- the complexity of a model can be controlled by checking the train- date the weights iteratively using expectation-maximization (EM) ing error and validation error to terminate the process when the algorithm. training is adequate. 2.2 Multigrained gcForest Architecture 3 EXPERIMENTAL SETUP & RESULTS 3.1 Dataset Description Table 1: Model-Parameters For our experiments, we are using the Dranziera protocol dataset provided in ESWC Semantic Challenge-2019 1.