Empirical Study of Twitter and Tumblr for Sentiment Analysis Using Soft Computing Techniques

Proceedings of the World Congress on Engineering and Computer Science 2017 Vol I WCECS 2017, October 25-27, 2017, San Francisco, USA Empirical Study of Twitter and Tumblr for Sentiment Analysis using Soft Computing Techniques Akshi Kumar, Member, IAENG, Arunima Jaiswal Analysis is one such research direction which drives this Abstract-Twitter and Tumblr are two prominent players in cutting-edge SMAC paradigm by transforming text into a the micro-blogging sphere. Data generated from these social knowledge-base. micro-blogs is voluminous and varied. People express and voice Formally, Sentiment Analysis, established as a typical their emotions and opinions over these social media channels text classification task [3], is defined as the computational making it a big sentiment-rich corpus from which strategic data can be analyzed. Sentiment Analysis on Twitter has been study of people’s opinions, attitudes and emotions towards a research trend with constant and continued studies on it to an entity [4, 5]. These opinions are expressed as written text improve & optimize the accuracy of results. As an alternative on any particular topic, available and discussed over social to the ‘restricting’ tweets, tumblogs from Tumblr give the media sources such as blogs, micro-blogs, social networking ‘scrapbook’ micro-blogging space. In this paper, we intend to sites and product reviews sites etc., with the primary intent mine and compare these two social media for sentiments to to categorize content into negative, positive or otherwise empirically analyze their performance as data analytic tools based on oft computing techniques. We have collected tweets neutral polarities. and tumblogs related to four trending events from both Amongst the social media channels, Twitter has emerged platforms (approximately 3000 tweets and 3000 tumblogs) and as a key player from which sentiment-rich data can be analyzed the sentiment polarity using six supervised extracted. This is primarily due to the characteristics of the classification algorithms, namely, Naive Bayesian, Support real-time messages shared on it. The post size is limited 140 Vector Machine, Multilayer Perceptron, Decision Tree, k-NN, character-set, the users are diverse (Regular user, Public and Fuzzy logic. The results are evaluated for the classifier performance, based on precision, recall and accuracy. figure, company representative) & globally distributed and . the user involvement can range from simply liking, commenting on or re-tweeting the post, making it a leading Index Terms— Micro-blogging, Sentiment Analysis, Soft sentiment rich corpus. Moreover, the easy availability of Computing, Twitter, Tumblr several Twitter API’s and programming services add to making research on sentiment analysis on Twitter has been a significant trend [3, 5, 6]. I. INTRODUCTION The techniques for sentiment analysis on Twitter have he new technology paradigm of SMAC (Social media, been categorized into Lexicon Based, Machine Learning T Mobile, Analytics & Cloud) [1] has revolutionized the Based, Hybrid (Lexicon+ Machine-Learning) and Concept- way computing is done and also how users get information based (Ontology or context) across pertinent literature [4, 5, and engage themselves. This confluence is dominating the 6, 7]. IT practices primarily due to the increased popularity of Tumblr is another micro-blogging portal, which came social media & subsequent development of analytic tools. almost around the same time as Twitter but has gained The buzzing term ‘Big data’ includes data generated from popularity recently due to some value-added features such either Social Networks (human-sourced information or as posting images, audios, videos, and other media Traditional Business systems (process-mediated data) or depending on user’s knowledge for customizing, managing Internet of Things (machine-generated data) [2]. and uploading such files to create tumblogs (short-form Social media generates high volume of varied data at a blogs). ‘Tumblogging’ has not been used much in research high velocity, thus leading to the ‘bigness’ in data. People studies whereas ‘Tweeting’ has been the core of most choose to express and voice their emotions and opinions prominent baseline studies. over major social media channels such as blogs, review Moreover, there has been a constant need to improve websites, posts, comments and micro-blogs. A persistent sentiment classification accuracy with the increase in need to leverage this big social data for analytics has been practical use of sentiment analysis for various analytic identified by both researchers & practitioners. Sentiment domains. The high-dimensional, uncertain social data-space is complex and researchers are keen on applying & testing Manuscript received July 16, 2017 novel techniques to improve the generic sentiment Akshi Kumar is a faculty with the Dept. of Computer Science & classification task. Engineering, Delhi Technological University, Delhi, India (e-mail: Soft computing has emerged as a significant approach to [email protected]). solve real-world problems which are pervasively imprecise Arunima Jaiswal is a research scholar with the Dept. of Computer Science & Engineering, Delhi Technological University, Delhi, India (e- and uncertain. It covers a variety of techniques, namely mail: [email protected]). Machine Learning (Supervised; Unsupervised; Ensemble); ISBN: 978-988-14047-5-6 WCECS 2017 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Proceedings of the World Congress on Engineering and Computer Science 2017 Vol I WCECS 2017, October 25-27, 2017, San Francisco, USA Neural Networks; Evolutionary Computation: (Evolutionary Feature Selection; Sentiment Classification and Sentiment Algorithms- Genetic Algorithms, Differential Evolution, Polarity detection [10]. Swarm Intelligence -Nature-Inspired Algorithms such as Feature selection directly impacts the classification Ant Colony Optimization, Particle Swarm Intelligence); accuracy but the high- dimensional, un-structured social Fuzzy Logic and Probabilistic Reasoning (Naïve Bayesian- media content makes this sub-task even more challenging, Bayesian probability) [8]. Studies to understand the theory, fostering the need for improved & optimized techniques for research and practice of using soft computing techniques for feature selection. Studies exploring and evaluating micro- sentiment analysis exist but are limited and have majorly blogging portals, especially Twitter are available in considered Twitter as the database. This motivated us literature [11, 12, 13, 14]. towards the work presented in this paper, where we The challenging aspects of using micro-blogs for implement and analyze few (not all of these) supervised soft sentiment analysis stem from the domains of Natural computing techniques for sentiment analysis on both Twitter Language Processing (NLP), text analytics and and Tumblr. computational linguistics such as issues related with the In this paper, we considered tumblogs(re-blogs) and fixed text length; spelling variation due to use of short forms tweets (re-tweets)of four most trending topics in last two like ‘gr8’ for ‘great’, ‘gud’ for ‘good’; use of colloquial years, that is, the US presidential elections (2016), Donald words, multilingual usage of content in the same tweet or Trump’s plans to ban Muslims from the US (2017), Rio posts; use of emoticons; co-reference resolution; negation Olympics (2016) and release of Pokemon Go second handling, sarcasm/ irony/ emotion detection and word sense generation (2017) for empirical evaluation and comparison. disambiguation, etc. [3, 15]. The extracted data was preprocessed for feature selection The following table I illustrates the basic difference and was manually labelled to accomplish coarse grain between the two micro-blogging portals: Tumblr, a blog sentiment analysis (positive, negative or neutral). It was based social media website created in February 2007 and then assessed using six supervised soft computing Twitter, a free social networking micro-blogging service techniques namely, Naive Bayesian, Support Vector that allows registered members to broadcast short posts Machine, Multilayer Perceptron, Decision Tree, k-NN, and called tweets, released in March 2006. Fuzzy logic in Weka (while other supervised soft computing techniques can also be evaluated but we selected few to In this paper, we have focused on six supervised soft define the scope of work presented in this paper, although computing techniques namely, Naive Bayesian, Support the rest can be evaluated in future). The results were Vector Machine, Multilayer Perceptron, Decision Tree, k- evaluated based on efficacy measures like precision, recall, NN, and Fuzzy logic to perform the task of sentiment accuracy for probing the capabilities and scope of sentiment classification in two micro-blogging portals, Twitter and analysis within the two micro-blogs. Tumblr. In a supervised learning model the data The paper has been organized as follows. Section 2 (observations, measurements, etc.) are labeled with pre- provides a brief idea on the key concepts of this work, defined classes and the test data is classified into these namely the micro-blogs (Twitter and Tumblr), sentiment classes. analysis and soft computing techniques followed by The training and the testing dataset selection procedure discussion on the related studies. Section 3explicates the has been done in Weka using 5-fold cross validation system architecture for this empirical analysis and describes method.

Empirical Study of Twitter and Tumblr for Sentiment Analysis Using Soft Computing Techniques

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support