Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning

IoT Article Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning Ravikumar Patel and Kalpdrum Passi * Department of Mathematics and Computer Science, Laurentian University, Sudbury, ON P3E 2C6, Canada; [email protected] * Correspondence: [email protected] Received: 15 September 2020; Accepted: 6 October 2020; Published: 10 October 2020 Abstract: In the derived approach, an analysis is performed on Twitter data for World Cup soccer 2014 held in Brazil to detect the sentiment of the people throughout the world using machine learning techniques. By filtering and analyzing the data using natural language processing techniques, sentiment polarity was calculated based on the emotion words detected in the user tweets. The dataset is normalized to be used by machine learning algorithms and prepared using natural language processing techniques like word tokenization, stemming and lemmatization, part-of-speech (POS) tagger, name entity recognition (NER), and parser to extract emotions for the textual data from each tweet. This approach is implemented using Python programming language and Natural Language Toolkit (NLTK). A derived algorithm extracts emotional words using WordNet with its POS (part-of-speech) for the word in a sentence that has a meaning in the current context, and is assigned sentiment polarity using the SentiWordNet dictionary or using a lexicon-based method. The resultant polarity assigned is further analyzed using naïve Bayes, support vector machine (SVM), K-nearest neighbor (KNN), and random forest machine learning algorithms and visualized on the Weka platform. Naïve Bayes gives the best accuracy of 88.17% whereas random forest gives the best area under the receiver operating characteristics curve (AUC) of 0.97. Keywords: natural language processing (NLP); data preprocessing; word tokenization; word stemming and lemmatizing; POS tagging; name entity recognition; machine learning; naïve Bayes; SVM; maximum entropy; KNN; random forest 1. Introduction In this advancing world of technology, expressing emotions, feelings, and views regarding any and every situation is much easier through social networking sites. Sentiment analysis along with opinion mining are two processes that aid in classifying and investigating the behavior and approach of customers in regards to the brand, product, events, company, and customer services [1]. Sentiment analysis can be defined as the automatic process of extracting the emotions from the user’s written text by processing unstructured information and preparing a model to extract the knowledge from it [2]. In this paper, one such social networking site is taken into account, which is among the largest networking sites, Twitter. Looking at the statistics, users that are active monthly range around 316 million, and on an average, about 500 million tweets are sent daily [3]. There are many approaches used for sentiment analysis on linguistic data, and the approach to be used depends on the nature of the data and the platform. Most research carried out in the field of sentiment analysis employs lexicon-based analysis or machine learning techniques. Machine learning techniques control the data processing by the use of machine learning algorithms and by classifying the linguistic data by representing them in vector form [4]. On the other side, a lexicon-based (also called dictionary-based) approach classifies the linguistic data using a dictionary lookup database. During this classification, IoT 2020, 1, 218–239; doi:10.3390/iot1020014 www.mdpi.com/journal/iot IoT 2019, 2 FOR PEER REVIEW 2 IoT 2020approach, 1 classifies the linguistic data using a dictionary lookup database. During this classification,219 it computes sentence- or document-level sentiment polarity using lexicon databases for processing linguistic data like WordNet, SentiWordNet, and treebanks. Moreover, a lexicon dictionary or it computesdatabase sentence- contains or the document-level opinionated words sentiment that are polarity classified using by lexicon positive databases and negative for processing word type, and linguisticthe datadescription like WordNet, of the word SentiWordNet, that occurs and in treebanks. the current Moreover, context. aFor lexicon each dictionaryword in the or document, database it is containsassigned the opinionated a numeric score; words the that average are classified score is co bymputed positive by and summing negative up wordall the type, numeric and scores the and descriptionsentiment of the polarity word that is occursassigned in theto the current document. context. Using For each this word approach, in the document,the words in it isthe assigned sentence are a numericconsidered score; the in the average form scoreof vectors is computed and analyzed by summing using different up all the machine numeric learning scores algorithms and sentiment like naïve polarityBayes, is assigned support to thevector document. machine Using (SVM), this and approach, maximu them words entropy. in the The sentence data ar aree trained considered accordingly, in the formwhich of vectors can be and applied analyzed to machine using di learningfferent machine algorithms. learning algorithms like naïve Bayes, support vector machineIn this (SVM), paper, and both maximum approaches entropy. were The combined, data are namely trained lexicon-based accordingly, which and machine can be applied learning for to machinesentiment learning analysis algorithms. of Twitter data. These algorithms were implemented for the preprocessing of a Indataset, this paper, and bothfiltering approaches and reducing were combined, the noise from namely the lexicon-based dataset. In this and approach, machine learningthe core linguistic for sentimentdata analysisprocessing of Twitter algorithm data. using These natural algorithms langua werege implemented processing (NLP) for the has preprocessing been designed of a and dataset,implemented, and filtering and sentiment reducing the polarity noise fromis assigned the dataset. to the Intweets thisapproach, using a lexicon-based the core linguistic approach. data Later, processingthe resultant algorithm dataset using naturalwas trained language using processing machine (NLP)learning has algorithms, been designed naïve and Bayes, implemented, SVM (support and sentimentvector machine), polarity isK-nearest assigned neighbor to the tweets (KNN), using and a lexicon-basedrandom forest approach. for measuring Later, the the accuracy resultant of the datasettraining was trained dataset using and machine comparison learning of the algorithms, results wa naïves accomplished. Bayes, SVM An (support abstract vector view machine), of the derived K-nearestapproach neighbor that (KNN), combines and randomlexicon-based forest forand measuring machine thelearning accuracy for ofsentiment the training analysis dataset is andshown in comparisonFigure of 1. the results was accomplished. An abstract view of the derived approach that combines lexicon-based and machine learning for sentiment analysis is shown in Figure1. Figure 1. Overview of approach for sentiment analysis. Figure 1. Overview of approach for sentiment analysis. 2. Related Work In2. this Related era, informationWork sharing through social media has increased and most users actively share theirIn personal this era, ideas information and information sharing through publicly. social This media information has increased for an and analyst most orusers researcher actively share is a goldtheir mine personal to dig ideas out and the valuableinformation information publicly. This for strategic information decision-making for an analyst [or5]. researcher Fishing out is a gold sentimentsmine embodiedto dig out inthe the valuable user’s writteninformation text, infor the strategic world ofdecision-making social media is [5]. known Fishing as sentiment out sentiments analysisembodied or opinion in the mining. user’s Firmino written Alves text, in et the al. [6world] state of that social from media the beginning is known ofas the sentiment 21st century, analysis or sentimentopinion analysis mining. has Firmino been one Alves of theet al most. [6] state interesting that from as the well beginning as active of research the 21st topicscentury, in sentiment the domainanalysis of natural has been language one of processing. the most interesting It helps the as decisionwell as active maker research to understand topics in the the responses domain of of natural peoplelanguage towards processing. a particular It topic helps and the aids decision in determining maker to understand whether the the event responses is positive, of people negative, towards a or neutral.particular Twitter topic has beenand aids considered in determining a very important whether platform the event for is data positive, mining negative, by many or researchers. neutral. Twitter Hemalathahas been et al. considered [7] discuss a that very the important Twitter platform platform contains for data much mining relevant by many information researchers. on particular Hemalatha et eventsal. with [7] hashtagsdiscuss that that the have Twitter been platform followed contains and accepted much by relevant many popularinformation personalities. on particular The events main with fundamental objective of sentiment analysis is to classify sentiment polarity from a text as positive, IoT 2020, 1 220 negative, or neutral. This classification can be done at the sentence level, document level, or entity and aspect level. There are many approaches to classify the sentiment

Sentiment Analysis on Twitter Data of World Cup Soccer Tournament Using Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support