<<

Recent Trends in Information Technology and its Application Volume 4 Issue 1

Analyzing Tweets to Rank FIFA Players using Named Entity Recognition

Md. Niaz Imtiaz1, Md. Toukir Ahmed2, Golum Rabby3* 1,2,3Department of Computer Science and Engineering, Pabna University of Science and Technology, Pabna, Bangladesh.

*Corresponding Author E-mail Id:- [email protected]

ABSTRACT Nowadays online media play significant role in sharing opinion on different occasions. People regularly share their experiences, feelings, and estimations. Opinions shared via online media can have impact on different issues. Twitter is a well known web-based media platform to express news simultaneously. Football is perhaps the most famous game all around the world. In this paper, an analysis hasbeen performed onFIFA top listed eight players of current time based on tweets from people. Sentiment analysis approach has been applied on tweets. A ranking system hasbeen designed by different computational and statistical meansto rank players.

Keywords:-Sentiment analysis, natural language processing, opinion mining, named entity recognition, twitter data.

INTRODUCTION different applications as they cannot Twitter (developed in March 2006) is an allocate sentiments to comparable targets. online micro-blog and social networking free of cost service where clients can post A text corpus incorporates various tweets and cooperate by means of 'tweets'. Clients and each tweet can be basic as like single can follow, post, tweet and re-tweet. sentence or can be complex as like Bunches of tweets are tweeted each day. multiple paragraphs. Each tweet may Twitter API gives the access of twitter contain at least one substances like information in real-time or even back in individual or person, organization, 2006. In this paper, the Twitter API has location, money, time etc. Information been used to gather tweets. 5000 recent Extraction (IE) approaches attempt to take tweets on FIFA players have been gathered out name entities and objects from text for our model. corpus and perceive their significance in event descriptions[3,4]. Named Entity Sentiment analysis is a field of Natural Recognition (NER) is a process to identify Language Processing (NLP) that analyzes information. The primary goal of NER is human feelings, opinions, perspectives, to perceive named entities and placed them and feelings on products, administrations, into predefined classes[5]. associations, occasions, people, issues, etc[1]. Sentiment analysis can anticipate When a document is identified as positive, the consequences of impending occasions, it does not indicate all of the opinions are assess the effect of the issue, measure the positive. Similarly, a negatively probability of individuals and a lot more identifieddocument does not indicate every [2]. It is inadequate to characterize opinion is negative. Usually, while assessment messages at sentence level for analyzing the sentiment of the reviews of

HBRP Publication Page 1-10 2021. All Rights Reserved Page 1 Recent Trends in Information Technology and its Application Volume 4 Issue 1

products, one could be fascinated by not tools. Yue et al. analyzed the progress of only whether people are talking with a sentiment analysis over the years[13]. positive, neutral, or negative thing about They mentioned the major achievements in the merchandise, but also which particular the field of sentiment analysis and aspects people are speaking about. summerized the limitations. Mohey and Hussein analyzed the hurdles in analyzing In this model, Spacy has been used for the accurate meaning of sentiments[14]. information extraction. To implement NLP They showed the obstacles to analyze models, Spacy utilize deep learning sentiments. approaches. [6]. C.J.Hutto implemented VADER lexicon. It was firstly designed to Several works have done that describe identify sentiments of social media posts different applications based on Named [7]. The library under NLTK is applied on Entity Recognition. Ao et al. tried to detect the tweet to get the polarity scores. the locations of accidents [15]. They analyzed the accident related information Several works on analysis of online from a popular blog in China, called Sina reviews have been conducted. Many of Weibo. Chavan and Suryawanshi focused them tried to extract meaningful on ways of segmenting tweets[16]. They information from them that are useful in tried to build a framework that can support real life applications. Ji et al. attempted to the segmentation of tweets, using a compare items in e-commerce [8]. They collection form, called HybridSeg. Aleidi built a review based decision support et al. proposed a system that can gather model by using probability multivalued tweets and apply sentiment analysis neutrosophic linguistic numbers techniqueon the tweets [17]. Their main (PMVNLNs). They applied their model to objective was to allow decision makers to characterize their extracted reviews. get categorized tweets. Their system McGlohon et al. attempted to find the true formed categorized tweets by analyzing quality of products and merchants from sentiments and popularities of the tweets. reviews on different websites [9]. They Liu et al. analyzed the difficulties of experimented statistic and heuristic based identifying named entity for tweets where models and compared their performances. information is not sufficient [18]. They Liu et al. attempted to rate different proposed a combined version of K-Nearest products by analyzing online reviews [10]. Neighbors (KNN) classifier and linear They applied sentiment analysis technique Conditional Random Fields (CRF) model and Intuitionistic fuzzy logic on reviews. to overcome those difficulties. Kumar and Abirami built a novel framework to rank product based in Various researches on recognizing Named aspects [11]. They experimented Harel- Entity and public sentiment analysis on Koren fast multiscale layout to identify tweets have been led. But recognizing and visualize the reviews. Bollen et al. Named Entity for sentiment analysis is a analyzed tweets to determine mood states new idea. In this research, tweets are affecting decision making in stock market analyzed to extract FIFA players and [12]. They analyzed Opinion Finder to polarities of comments are used to rank measure the polarity of mood and Google- them. Profile of Mood States (GPOMS) to measure mood from six dimensions- Calm, MATERIALS AND METHODS Happy, Sure, Alert and Kind. Fuzzy Python is used as a programming language Neural Networks are then used to throughout the work. VADER algorithm is investigate the results given by the two experimented to measure the sentiment

HBRP Publication Page 1-10 2021. All Rights Reserved Page 2 Recent Trends in Information Technology and its Application Volume 4 Issue 1

polarity.We recognized Named Entities Named Entity Recognition and their with the help of Stanford NER. SpaCy and corresponding tweets are separated. NLTK libraries are used for information Finally, based on sentiment analysis of the extraction and to perform various text tweets ranks of players are measured. preprocessing tasks. Firstly, tweets are Moreover the results are investigated extracted from tweeter. Then, irrelevant through two machine learning approaches- information and noises are removed from Logistic Regression and Naïve Bayes data. Data are processed in a meaningful Classifier. Figure 1 shows the model for way. FIFA players are recognized using player ranking based on twitter data.

Data Tweets Entity Pre- Twitter Gathering Extraction processing Streaming API

Model Compound Tweets Rating Evaluation Score (Positive / Negative) Fig.1:-Model for player ranking based on twitter data.

Data Extraction including html tags, labels, joins, symbols, 'Twitter API' has been used to collect stop-words and so on. So preprocessing is tweets from twitter. Twitter API gives the expected to eliminate those noises. 5000 access of information real-time or in past. recent tweets have been collected with tags Table 1 shows a portion of the tweets that like-FIFA, football, club football and so are collected. Tweets contain lots of noises on. Table 1:-Twitter Data Unprocessed Tweets If it’s not Messi I don’t wanna hear it https://t.co/Y0EGOuNxGF Toni Kroos + Jordan Henderson + Casemiro + Valverde + Modric = 22 goal contributions this season (10 goals, 13 assi… https://t.co/Yu0LLulBEp Tuesday's Arsenal transfer talk news roundup: Pierre-EmerickAubameyang, Luis Suarez, James Rodriguez https://t.co/b7wbfWe1HB #arsenal @ESPNFC You don't have Toni kroos and Karim Benzema in Team Adidas..Sometimes I don't understand how dumb people ar… https://t.co/DWLFoGMKOt @UtdChi Did Messi and Ronaldo actually stop them from winning BallonD'or or they didn't measure up to them in the… https://t.co/zSHP8bPhrC

Preprocessing decent sort of strategy that changes the raw For unprocessed text data, preprocessing text into well-defined sequences of method plays a key task to make a model semantic components that have standard that understands the data. Text data structure. Some preprocessing techniques contain lots of noises. Hence, it is that have been applied in this work are challenging to clean the noises smartly. described here. Data pre-preparing reduces the size of the input text documents on a very basic level. Removing HTML tags It happens with various advancements. We A portion of the tweets contains links that have replaced each positive with '1' and are not important for our model. To take each negative with '0' for better processing. out HTML tags, we have applied regular With the help of deep learning and expressions. Texts contain single machine learning techniques the text are characters, special characters and white then refined and parsed into a meaningful spaces that are not important for our format. The preprocessing cycle includes a

HBRP Publication Page 1-10 2021. All Rights Reserved Page 3 Recent Trends in Information Technology and its Application Volume 4 Issue 1

model. Hence, all these unnecessary word is constituted by a group of entities have been removed. morphemes. It is necessary to interpret words and classify them according to Converting uppercase to lowercase parts-of-speech. In Natural language In the field of Natural Language processing, tokens are smallest units that Processing, conversion of capitalized have a particular syntax and semantic. A words to lower case is important. It is done text document a text document is comprise to represent the characters uniformly. of sentences which fare comprise of Additionally, case conversion has been words, phrases and clauses. Tokenization done in this work to match specific words approaches split a text corpus into or tokens and to remove ambiguities sentences and then further split into words. among similar words. These words are known as tokens. Word tokenization breaks down a sentence into Tokenization words. It is a process of splitting a Text are unstructured but they follow a sentence into an inventory of words that particular syntax and semantic. Syntax or may be gathered to rebuild the previous structure is usually a group of specific sentence if needed. Word tokenization is rules or conventions that describes how essential in cleaning and normalizing text several words come together to form as well as for operations like stemming phrases, clauses or sentences. Words are and lemmatization. Table 2 shows independent with meaning of their own. A some tokenized words. Table 2:-A Snippet of Dataset after Tokenization. Tweets Tokenized Words If it is not Messi I do not wanna hear it „If‟, „it‟, „is ,„not‟, „Messi‟ ,„I‟ ,„do‟, „not‟,„wanna‟,„hear‟,„it‟

Toni Kroos contributes 22 goals this season, 13 „Toni‟, „Kroos‟, „contributes‟, „22‟, „goals‟, „this‟, „season,‟ assist. I think it is good from him. „13‟, „assist‟, „I‟, „think‟, „it‟, „is‟, „good‟, „from‟, „him‟

Lemmatization Lemmatization is a technique of grouping Removing unnecessary tokens and the words those have a common base stopwords form, the dictionary form of word. For Stopwords are words that have less or no grouping, the words are analyzed by importance. For example, „a‟, „the‟, „an‟, vocabulary and morphological techniques. „of‟, „where‟, „which‟, „would‟, „whom‟, It is essential to identify the correct parts „this‟ etc. are stop words. But selection of of speech of words and to understand the stop words varies from application to actual meaning of the word. In tweets a application. At first we had to identify the word may be used in various forms. So correct stop words for our application. lemmatization is an essential part of our After that, those stop words are removed work. Table 3 shows some examples of to reduce the complexity. As they have no lemmatization. For example, both of words significance on decision making, removing friends and friendly have been originated them would not affect the overall from the common base word friend. performance. Table 4 shows some tokens Table 3:-Lemmatized Words found after removing stop words. Each Real Words Lemmatized Word 5000 tweets are segmented in the same Friends, Friendly Friend way. Bad, Badly Bad Playing, Played Play Performing, Performed, Perform Performance

HBRP Publication Page 1-10 2021. All Rights Reserved Page 4 Recent Trends in Information Technology and its Application Volume 4 Issue 1

Table 4:-Data after Removing Stop Words. corpora, APIs and numerous Natural Tweet Tokens Language Processing algorithms. If it is not Messi I do not „not‟, „Messi‟, „do‟, wanna hear it „wanna‟, „hear‟ Toni Kroos contributes 22 „Toni‟, „Kroos‟, In this research, to find out FIFA players goals this season, 13 assist. „contributes‟, „22‟, in the tweets Named Entity Recognizer I think it is good from him. „goals‟, „season‟, „13‟, hasbeen used. Eight FIFA players have „assist‟, „good. been selected from the top 10 players in

EA Sports FIFA 19 Ratings- "Lionel Named Entity Recognition Messi", "", "Toni An entity is a person, product, service, Kroos", "", "", organization, topic, event or issue [19]. "Luka Modric", "Luis Suarez" and “David Named Entity Recognition (NER) has de Gea". numerous applications, for example machine learning, knowledge extraction, Calculating Polarity Score information retrieval, data processing and Sentiment analysis model measures the text mining. In sentiment analysis polarity of sentences. There are two types application, people often classify opinions of techniques of analyzing sentiment. One for some entities. It is important to find out is called machine learning based approach the entity to which a comment indicates. and another is called lexicon-based People can post similar entities in different approach. Lexicon-based approach is more ways. For instance, “Lionel Messi” is understandable and can be easily written as “Lionel” or “Messi” or “Lio” implemented. Conversely, machine sometimes. An efficient system identifies learning based approach requires a huge them from corpus. The principle volume of data. In this study, a lexicon- requirement of this extraction is that the based approach is experimented to analyze system should identify entities according sentiment. It uses a lexicon model that has to the user's entry. There are various point by point data with respect to abstract categories of entities. Some of them are words and phrases including feeling, state shown in Table 5. of mind, extremity, objectivity,

subjectivity, etc. A lexicon model works Table 5:-Various Named Entities. Type Description with lexicons. Lexicons are comprise of PERSON People numerous words and each word is NORP Nationalities, religious or political groups assigned a polarity score. Researchers ORG Companies, agencies, institutions etc. have experimented many lexicon models GPE Countries, cities, states etc. such as VADER lexicon, TextBlob

lexicon, MPQA subjectivity lexicon, Bing Stanford NER is additionally called CRF Liu‟s lexicon, Pattern lexicon etc. In this Classifier. The Conditional Random Field work VADER lexicon is experimented for (CRF) is known as a statistical method for calculating sentiment polarity of tweets. recognizing patterns of structured or VADER is a framework which works labeled data. We used Stanford NER for according to the rule based sentiment recognizing named entities. SpaCy is a analysis approach which is mostly used for NLP library in Python and reputed for its analyzing sentiments in social media. industrial strength. The preprocessing Previous sentiment analysis experiments tasks like tokenization, lemmatization, on VADER showed precise results. POS tagging are performed using SpaCy. VADER is found to perform well in the Natural Language Toolkit (NLTK) is a field of online review analysis where the Python library which consists of several text is unorganized and complex. In this study we implemented rule based

HBRP Publication Page 1-10 2021. All Rights Reserved Page 5 Recent Trends in Information Technology and its Application Volume 4 Issue 1

sentiment analysis to calculate the schema is defined and with the polarity sentiment polarity scores. The model is scores ranks of players are measured. This then evaluated using machine learning rank resembles how popular a player is approaches. among the people all around the world.

Table 6:-Sentiment Scoring. RESULT AND DISCUSSION Sentiment Metric Score After the preprocessing tasks on extracted Positive 0.674 5000 reviews, Named Entity Recognizer is Negative 0.0 Neutral 0.326 applied on them. Various named entities Compound 0.735 are found from the results. Among them, we only considered the tweets regarding Table 6 shows the sentiment polarity for a our selected eight FIFA players and rest of random sentence given by VADER. The the tweets are removed. 2646 tweets are Positive, Negative and Neutral scores found on the selected players. indicate the percentage of text falls in these classes. Here the random sentence is Table 7:-Number of tweets found on 67% Positive, 33% Neutral and 0% players. Negative. The extreme positive polarity is Player Name Number of Tweets Found marked by +1 and the extreme negative Cristiano Ronaldo 262 Toni Kroos 397 polarity is marked by -1. The compound Eden Hazard 370 score is the normalized final polarity score. Lionel Messi 460 Sergio Ramos 415 First of all, eight of the top 10 players in Luka Modric 301 Luis Suarez 348 EA Sports FIFA 19 Ratings, are chosen. David de Gea 93 Tweets on the chosen FIFA players are extracted. 5000 tweets on different players Table 7 shows the number tweets found on are extracted. Then a series of each of the eight players applying NER preprocessing tasks is performed. Using and scrutinizing therefore. The largest Named Entity Recognizer, tweets of number of tweets (460) is found on Lionel selected eight FIFA players are identified. Messi and the smallest number of tweets Tweets on each player are separated under (93) is found on David de Gea. In term of their names. For each player, polarities of percentage, about 17% of total reviews is tweets are calculated and the mean of found on Messi, 16% on Suarez and 4% on those polarities is considered as the final Gea(Figure 2). sentiment score for that player. A rating

Fig.2:-Percentage of tweets on each player.

HBRP Publication Page 1-10 2021. All Rights Reserved Page 6 Recent Trends in Information Technology and its Application Volume 4 Issue 1

Tweets are separated under the player of 460 tweets on Lionel Messi, 224 tweets names. We analyzed sentiment polarities are positive, 185 are negative of tweets according to player name. Figure and the rest are neutral. 3 shows the sentiment analysis results. Out

Fig.3:-Sentiment polarities of tweets

Lionel Messi has the largest number of is the highest percentage of positive positive reviews (224), while Luis Suarez reviews (Table 8). Luis Suarez has the has the largest number of negative reviews highest percentage (58%) of negative (198). Among the 93 reviews on David de reviews with 202 negative tweets out of Gea, 67 reviews (72%) are positive which 348.

Table 8:-Percentage of tweet polarities. Player Name Positive Tweets Negative Tweets Neutral Tweets (%) (%) (%) Cristiano Ronaldo 52.3 30.5 17.2 Toni Kroos 49.9 16.6 33.5 Eden Hazard 39.2 20.3 40.5 Lionel Messi 46.2 38.7 15.1 Sergio Ramos 49.9 25.3 24.8 Luka Modric 45.2 24.2 30.6 Luis Suarez 29.0 58.0 12.9 David de Gea 72.0 18.3 9.7

The mean sentiment score of the resultant defined with the hypothesis that higher polarities is calculated from the reviews on sentiment polarity score resembles higher individual player. Mean sentiment score is rank and lower rank indicates lower considered as the sentiment polarity score polarity score. for that player. Higher score indicates the higher positive sentiment the player has The model is evaluated using two well- among people. A ranking schema is known machine learning approaches

HBRP Publication Page 1-10 2021. All Rights Reserved Page 7 Recent Trends in Information Technology and its Application Volume 4 Issue 1

regarding sentiment classification- by our model. The two machine learning Logistic Regression and Naïve Bayes approaches measure the prediction Classifier. Pre-trained Naïve Bayes and accuracy of the sentiment polarities given Logistic Regression models are applied on by our model. the tweets to investigate the results given

Table 9:-Model Evaluation Evaluation Logistic Regression Naïve Bayes Classifier Accuracy (%) 91.4 93.8

Table 9 shows the evaluation of our model Classifier gives a satisfactory using the two approaches. Naïve Bayes level of classification accuracy.

Fig.4:-Player ranking.

Figure 4 shows the ranting achieved from football and do criticism and share the sentiment polarity scores for each opinions about football and football player where David de Gea is the highest players via various online media. Thus, ranked player with rank 10 while Luis tweeter is a great source of data about Suarez is the lowest ranked player with people's reviews on football players. In rank 3. Cristiano Ronaldo, Toni Kroos and this paper, tweets on eight players from the Luka Modric have similar ranting with top 10 players in EA Sports FIFA 19 rank 7. This ranking represents how Ratings, are extracted and analyzed to rank football fans are evaluating the players. them according to public sentiment on them. Sentiment determination technique CONCLUSION and ranking framework are discussed. By Social media have become the most computational and statistical means, a famous platform for sharing personal rating for each eight players is generated feeling and opinion in recent days. People based on public views on the players. This share their opinion what they evaluate. model is reliable, as the rating is basically Football is an exciting game which is evaluated depending on public feelings played almost every country in the world. and this rating reflects the real impression People love football, play football, watch of the FIFA players among people. People

HBRP Publication Page 1-10 2021. All Rights Reserved Page 8 Recent Trends in Information Technology and its Application Volume 4 Issue 1

usually make reviews on players based on 8. Ji, P., Zhang, H. Y., & Wang, J. Q. their performance satisfaction levels. This (2018). A fuzzy decision support ranking system exhibits the current model with sentiment analysis for satisfaction levels of players among the items comparison in e-commerce: The football fans. In addition, this rating model case study of http://PConline. can be used to rate players in different com. IEEE Transactions on Systems, games. Man, and Cybernetics: Systems, 49(10), 1993-2004. REFERENCES 9. McGlohon, M., Glance, N., & Reiter, 1. Zhao, J., Liu, K., & Xu, L. (2016). Z. (2010, May). Star quality: Sentiment analysis: mining opinions, Aggregating reviews to rank products sentiments, and emotions. and merchants. In Proceedings of the 2. Lin, B., Zampetti, F., Bavota, G., Di International AAAI Conference on Penta, M., Lanza, M., & Oliveto, R. Web and Social Media (Vol. 4, No. 1). (2018, May). Sentiment analysis for 10. Liu, Y., Bi, J. W., & Fan, Z. P. (2017). software engineering: How far can we Ranking products through online go?. In Proceedings of the 40th reviews: A method based on sentiment International Conference on Software analysis technique and intuitionistic Engineering (pp. 94-104). fuzzy set theory. Information 3. Hobbs, J.,& Riloff, E.(2010). Chapter Fusion, 36, 149-161. 21, Information Extraction. Handb. 11. Kumar, A., & Abirami, S. (2018). Nat. Lang. Process,.511–532. Aspect-based opinion ranking 4. Ghiassi, M., & Lee, S. (2018). A framework for product reviews using domain transferable lexicon set for a Spearman's rank correlation Twitter sentiment analysis using a coefficient method. Information supervised machine learning Sciences, 460, 23-41. approach. Expert Systems with 12. Bollen, J., Mao, H., & Zeng, X. Applications, 106, 197-216. (2011). Twitter mood predicts the 5. Li, C., Sun, A., Weng, J., & He, Q. stock market. Journal of (2014). Tweet segmentation and its computational science, 2(1), 1-8. application to named entity 13. Yue, L., Chen, W., Li, X., Zuo, W., & recognition. IEEE Transactions on Yin, M. (2019). A survey of sentiment knowledge and data analysis in social media. Knowledge engineering, 27(2), 558-570. and Information Systems, 60(2), 617- 6. Partalidou, E., Spyromitros-Xioufis, 663. E., Doropoulos, S., Vologiannidis, S., 14. Hussein, D. M. E. D. M. (2018). A & Diamantaras, K. I. (2019, October). survey on sentiment analysis Design and implementation of an open challenges. Journal of King Saud source Greek POS Tagger and Entity University-Engineering Recognizer using spaCy. In 2019 Sciences, 30(4), 330-338. IEEE/WIC/ACM International 15. Ao, J., Zhang, P., & Cao, Y. (2014). Conference on Web Intelligence Estimating the locations of emergency (WI) (pp. 337-341). IEEE. events from Twitter streams. Procedia 7. Hutto, C., & Gilbert, E. (2014, May). Computer Science, 31, 731-739. Vader: A parsimonious rule-based 16. Chavan, C., & Suryawanshi, R. (2016, model for sentiment analysis of social September). Summarization of tweets media text. In Proceedings of the and Named Entity Recognition from International AAAI Conference on tweet segmentation. In 2016 Web and Social Media (Vol. 8, No. 1). International Conference on

HBRP Publication Page 1-10 2021. All Rights Reserved Page 9 Recent Trends in Information Technology and its Application Volume 4 Issue 1

Automatic Control and Dynamic entities in tweets. In Proceedings of Optimization Techniques the 49th annual meeting of the (ICACDOT) (pp. 66-71). IEEE. association for computational 17. Aleidi, S., Alsuhaibani, D., Alrajebah, linguistics: human language N., & Kurdi, H. (2019, December). A technologies (pp. 359-367). tweet-ranking system using sentiment 19. Liu, B. (2012). Sentiment analysis and scores and popularity measures. opinion mining. Synthesis lectures on In International Conference on human language technologies, 5(1), 1- Computing (pp. 162-169). Springer, 167. Cham. 18. Liu, X., Zhang, S., Wei, F., & Zhou, M. (2011, June). Recognizing named

HBRP Publication Page 1-10 2021. All Rights Reserved Page 10