Sentiment Analysis for Tweets

MSc in Computer Science Department of Computer Science Master of Science Thesis Sentiment Analysis for Tweets Konstantinos Korovesis Athens 2018 Acknowledgements I would like to thank my supervisor Prof. Ion Androutsopoulos for introducing me to the world of NLP and for giving me the opportunity to work on this exciting field. Furthermore, I would like to thank C. Baziotis, J. Pavlopoulos and J. Koutsikakis for their useful insights and assistance. Finally, I would like to thank Irene and my family for their support and understanding. 1 Abstract Sentiment Analysis is a field of Natural Language Processing which addresses the problem of extracting sentiment or, more generally, opinion from text. Obtaining deeper insights on that topic can be very valuable for a range of fields such as finance, marketing, politics and business. Previous research has shown how sentiment and public opinion can affect stock markets, product sales, polls as well as public health. This thesis researches the message sentiment polarity classification problem in Twitter aiming to classify messages based on the polarity of the sentiment towards a specific topic, where the tweets and the topics are always given. The dataset analyzed and the evaluation metrics considered are provided by the SemEval 2017 International Workshop and the 4th task about "Sentiment Analysis in Twitter". This task includes five subtasks, two of which were eventually engaged in this research according to the implemented approach. First, subtask B is a binary classification task, where the goal is to classify messages into two classes, positive and negative regarding the sentiment towards the topic. Following, subtask C where the target is to classify messages in a five-scale sentiment polarity from highly negative, negative, neutral, positive to highly positive, based on the sentiment towards a given topic. We re-implemented and experimented with two deep learning models for both subtasks; a Convolutional Neural Network (CNN) and a state-of-the-art Recurrent Neural Network with context attention (Att-BiLSTM+WL). We compare both models to the baselines of the challenge and show that the Att-BiLSTM+WL outperforms the other models, in both subtasks, with all evaluation measures but one. 2 Περίληψη Η Ανάλυση Συναισθήματος είναι ένας τομέας της Επεξεργασίας Φυσικής Γλώσσας ο οποίος αντιμετωπίζει το πρόβλημα της εξαγωγής συναισθημάτων ή γενικότερα γνώμης από κείμενο. Η απόκτηση βαθύτερων γνώσεων σχετικά με το θέμα αυτό μπορεί να είναι πολύτιμη για μια σειρά πεδίων όπως τα χρηματοοικονομικά, το μάρκετινγκ, η πολιτική και οι επιχειρήσεις. Προηγούμενες έρευνες έχουν δείξει πώς το συναίσθημα και η κοινή γνώμη μπορούν να επηρεάσουν τις χρηματιστηριακές αγορές, τις πωλήσεις προϊόντων, τις δημοσκοπήσεις καθώς και τη δημόσια υγεία. Αυτή η διπλωματική εργασία ερευνά το πρόβλημα της ταξινόμησης της πολικότητας του συναισθήματος μηνυμάτων στο Twitter με στόχο την ταξινόμησή τους με βάση την πολικότητα του συναισθήματος προς ένα συγκεκριμένο θέμα, όπου δίνονται πάντα τα tweets και τα θέματα. Το σύνολο δεδομένων που αναλύεται και οι μετρήσεις αξιολόγησης που εξετάζονται παρέχονται από το SemEval 2017 International Workshop Task 4 για το "Sentiment Analysis in Twitter". Αυτός ο διαγωνισμός περιλαμβάνει πέντε υπό-εργασίες, δύο από τις οποίες εν τέλει προσεγγίσαμε σε αυτή την έρευνα. Πρώτον, η υπό-εργασία Β είναι μια εργασία δυαδικής ταξινόμησης, όπου ο στόχος είναι να ταξινομηθούν τα μηνύματα σε δύο κατηγορίες, θετικά και αρνητικά σχετικά με το συναίσθημα προς το θέμα. Ακολούθως, η υπό- εργασία Γ όπου ο στόχος είναι να ταξινομηθούν τα μηνύματα, ως προς κάποιο θέμα, σε πέντε τάξεις πολικότητας, πολύ αρνητική, αρνητική, ουδέτερη, θετική και πολύ θετική. Εφαρμόσαμε δύο μοντέλα βαθιάς μάθησης και εκτελέσαμε πειράματα για τις δύο υπό-εργασίες; ένα CNN και ένα τελευταίας τεχνολογίας RNN που περιλαμβάνει και μηχανισμό Attention (Att-BiLSTM+WL). Συγκρίνουμε και τα δύο μοντέλα με τα βασικά συστήματα (baselines) του διαγωνισμού και δείχνουμε ότι το Att-BiLSTM+WL ξεπερνά τα άλλα μοντέλα, και στις δύο υπό-εργασίες, με όλα τα μέτρα αξιολόγησης, πλην ενός. 3 Content 1. INTRODUCTION ........................................................................................................................................................ 7 1.1 THE SENTIMENT ANALYSIS TASK ......................................................................................................................................... 7 1.2 MESSAGE POLARITY CLASSIFICATION IN TWITTER ................................................................................................................... 8 1.3 APPROACHES TO SENTIMENT CLASSIFICATION ....................................................................................................................... 9 1.4 THE PURPOSE OF THIS WORK ............................................................................................................................................ 10 1.5 THESIS OUTLINE ............................................................................................................................................................ 11 2. RELATED WORK ....................................................................................................................................................... 13 3. TOPIC-BASED SENTIMENT ANALYSIS MODELS ......................................................................................................... 15 3.1 DATASETS .................................................................................................................................................................... 15 3.2 PREPROCESSING ............................................................................................................................................................ 16 3.3 BASELINES .................................................................................................................................................................... 17 3.3.1 EMBEDDING LAYER ................................................................................................................................................. 18 3.3.2 CONVOLUTIONAL NEURAL NETWORK BASELINE ............................................................................................................. 18 3.3.3 BIDIRECTIONAL LSTM BASELINE ................................................................................................................................ 20 3.4 ATT-BILSTM+WL MODEL .............................................................................................................................................. 21 3.4.1 MEAN POOLING AND ANNOTATION ........................................................................................................................... 23 3.4.2 CONTEXT ATTENTION LAYER ..................................................................................................................................... 23 3.4.3 REGULARIZATION ................................................................................................................................................... 24 3.4.4 OPTIMIZATION ....................................................................................................................................................... 24 3.4.5 HYPER-PARAMETERS ............................................................................................................................................... 25 3.5 CLASS BALANCING ......................................................................................................................................................... 26 4. EXPERIMENTS .......................................................................................................................................................... 27 4.1 EVALUATION MEASURES ................................................................................................................................................. 27 4.2 EXPERIMENTAL SETUP .................................................................................................................................................... 29 4.3 EVALUATION AT VALIDATION ........................................................................................................................................... 29 4.4 EVALUATION AT TEST ..................................................................................................................................................... 32 5. CONCLUSIONS AND FUTURE WORK ......................................................................................................................... 34 5.1 CONCLUSIONS ............................................................................................................................................................... 34 5.2 FUTURE WORK .............................................................................................................................................................. 34 BIBLIOGRAPHY ............................................................................................................................................................ 35 4 List of Figures Figure 1. CNN model with 3 filters. ................................................................................................................... 20 Figure 2. The Att-BiLSTM+WL model: A 2-layer bidirectional LSTM with attention. ........................................ 22 Figure 3. Att-BiLSTM+WL

Sentiment Analysis for Tweets

Risk Bounds for Classification and Regression Rules That Interpolate

Probabilistic Circuits: Representations, Inference, Learning and Theory

Risk Bounds for Classification and Regression Rules That Interpolate

Part I Officers in Institutions Placed Under the Supervision of the General Board

Machine Learning Conference Report

AAAI News AAAI News

Probabilistic Machine Learning: Foundations and Frontiers

Bayesian Activity Recognition Using Variational Inference

Automatic Opioid User Detection from Twitter

Deep Learning

Uncertainty in Deep Learning

Arxiv:2012.12294V2 [Stat.ML] 16 Apr 2021 1 Introduction