Author’s Accepted Manuscript

Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach

S.K. Bharti, B. Vachha, R.K. Pradhan, K.S. Babu, S.K. Jena

www.elsevier.com/locate/dcan

PII: S2352-8648(16)30027-X DOI: http://dx.doi.org/10.1016/j.dcan.2016.06.002 Reference: DCAN42 To appear in: Digital Communications and Networks Received date: 20 February 2016 Revised date: 16 May 2016 Accepted date: 15 June 2016 Cite this article as: S.K. Bharti, B. Vachha, R.K. Pradhan, K.S. Babu and S.K. Jena, Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach, Digital Communications and Networks, http://dx.doi.org/10.1016/j.dcan.2016.06.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Digital Communications and Networks(DCN)

journal homepage: www.elsevier.com/locate/dcan

Sarcastic Sentiment Detection in Tweets Streamed in Real Time: A Big Data Approach

S K Bharti ∗ a, B Vachhaa, R K Pradhana, K S Babua, S K Jenaa aDepartment of Computer Science & Engineering, National Institute of Technology, Rourkela 769008, India

Abstract Sarcasm is a type of sentiment where people express their negative feelings using positive or intensified positive words in the text. While speaking, people often use heavy tonal stress and certain gestural clues like rolling of the eyes, hand movement, etc. to represent sarcastic. In the textual data, these tonal and gestural clues are missing, making sarcasm detection very difficult for an average human. Due to these challenges, researchers are showing interest in sarcasm detection of social media text, especially in tweets. Rapid growth of tweets in volume and its analysis pose major challenges. In this paper, we proposed a Hadoop based framework that captures real time tweets and process it with a set of algorithms which identifies sarcastic sentiment efficiently. We observe that the elapse time for analysing and processing under Hadoop based framework significantly outperforms the conventional methods and is more suited for real time streaming tweets.

© 2015 Published by Elsevier Ltd.

KEYWORDS: Big Data, Flume, Hadoop, Hive, MapReduce, Sarcasm, Sentiment, Tweets

1. Introduction etc. The volume of social data being generated is growing rapidly. Statistics from Global WebIndex With the advent of smart mobile devices and the shows a 17 % yearly increase in mobile users with the high-speed Internet, users are able to engage with so- total number of unique mobile users reaching 3.7 bil- cial media services like Facebook, Twitter, Instagram, lion people [1]. Social networking websites have be- come a well-established platform for users to express ∗ Santosh Kumar Bharti (Corresponding author) is currently pur- their feelings and opinions on various topics such as suing his PhD in Computer Science & Engineering from Na- tional Institute of Technology Rourkela, India. His research in- events, individuals or products. Social media channels terest includes opinion mining and sarcasm sentment detection have become a popular medium to discuss ideas and (email:sbharti1984@.com). to interact with people worldwide. For instance, Face- 1 Bakhtyar Vachha is currently pursuing his M.Tech in Com- book claims to have 1.59 billion monthly active users, puter Science & Engineering from National Institute of Technology Rourkela, India. His research interest includes network security and each one being a friend with 130 people on average big data. (email:[email protected]). [2]. Similarly, Twitter claims to have more than 500 2Ramkrushna Pradhan is currently pursuing his M.Tech duel million users, out of which more than 332 million are degree in Computer Science & Engineering from National In- active [1]. Users post more than 340 million tweets stitute of Technology Rourkela, India. His research interest in- cludes speech translation, social media analysis and big data. and 1.6 billion search queries every day [1]. (email:[email protected]). 3Korra Sathya Babu is working as Asst. Professor in the de- With such large volumes of data being generated, partment of Computer Science & Engineering, National Institute of a number of challenges are posed. Some of them Technology Rourkela, India. (email:[email protected]). are accessing, storing, processing, verification of data 4Sanjay Kumar Jena is working as Professor in the department of Computer Science & Engineering, National Institute of Technology sources, dealing with misinformation and fusing vari- Rourkela, India. (email:[email protected]). ous types of data [3]. However, almost 80% of gener- 2 Santosh Kumar Bharti, et al. ated data is unstructured [4]. As the technology devel- language called HiveQL to convert queries into map- oped, people were given more and more ways to inter- per and reducer classes [12]. Further, we use natural act, from simple text messaging and message boards to language processing (NLP) techniques like POS tag- other more engaging and engrossing channels such as ging [13], parsing [14], text mining [15, 16] and sen- images and videos. These days, social media channels timent analysis [17] to identify sarcasm in these pro- are usually the first to get the feedback about current cessed tweets. event and trends from their user base, allowing them Our paper compares and contrasts the time require- to provide companies with invaluable data that can be ments for our approach when run on a standard non- used to position their products in the market as well as Hadoop implementation as well as on a Hadoop de- gather rapid feedback from customers. ployment to find the improvement in performance When an event commences or a product is launched, when we use Hadoop. For real time applications people start tweeting, writing reviews, posting com- where millions of tweets need to be processed as fast ments, etc. on social media. People look at social me- as possible, we observe that the time taken by the dia platforms to read reviews from other users about single node approach increases much higher than the a product before they decide whether to purchase it Hadoop implementation. This suggests that for higher or not. Organizations also depend on these sites to volumes of data it is more advantageous to use the pro- know the response of users for their products and sub- posed deployment for sarcasm analysis. sequently use the feedback to improve the products. The contributions of this paper are as follows: However, finding and verifying the legitimacy of opin- 1. Capturing and processing real time tweets us- ions or reviews is a formidable task. It is difficult to ing Apache Flume and Hive under the Hadoop manually read through all the reviews and determine framework. which of the opinions expressed are sarcastic. In ad- 2. We propose a set of algorithms to detect sarcasm dition, the average human reader will have difficulty in tweets under the Hadoop framework. in recognizing sarcasm in tweets or product reviews, which may end up misleading them. 3. We propose another set of algorithms to detect sarcasm in tweets. A tweet or a review may not state the exact orien- tation of the user directly, i.e. it may be sarcastically The rest of this paper is organized as follows. Sec- expressed. Sarcasm is a kind of sentiment which acts tion 2 presents related work for capturing and pro- as an interfering factor in any text that can flip the po- cessing data acquired through the Twitter streaming larity [5]. Example: ‘I love being ignored #sarcasm’. API followed by sarcasm analysis of the captured data. Here, love expresses a positive sentiment in a negative Section 3 explains preliminaries of this research paper. context. Therefore, the tweet is classified as sarcastic. The proposed scheme is described in Section 4. Sec- Unlike a simple negation, sarcastic tweets contain pos- tion 5 presents the performance analysis of the pro- itive words or even intensified positive words to con- posed schemes. Finally, the conclusion and recom- vey a negative opinion or vice versa. This creates a mendations for future work are drawn in Section 6. need for the large volumes of reviews, tweets or feed- back to be analyzed rapidly to predict their exact orientation. Moreover, each tweet may have to 2. Related work pass through a set of algorithms to be accurately clas- sified. In this section the literature survey is done on two folds. At first, capturing and preprocessing of the real In this paper, we propose a Hadoop-based frame- time tweets are surveyed and then literature on sar- work [6] that allows the user to acquire and store casm detection follows. tweets in a distributed environment [7] and process them for detecting sarcastic content in real time using 2.1. Capturing and preprocessing of tweets in large the MapReduce [8] programming model. The map- volume per class works as a partitioner and divides large vol- umes of tweets into small chunks and distributes them Rapid adaption and growth of social networking among the nodes in the Hadoop cluster. The reducer platforms enables users to generate data at an alarm- class work as a combiner and is responsible for col- ing rate. Storing and processing of such large data lecting processed tweets from each node in the clus- sets is a complex problem. Twitter is one such social ter and assemble them to produce the final output. networking platform that generates data continuously. Apache Flume [9, 10] is used for capturing tweets in In the existing literature, most of the researchers used real time as it is highly reliable, distributed and con- Tweepy (An easy-to-use Python library for access- figurable. Flume uses an elegant design to make data ing the Twitter API) and Twitter4J (a java library for loading easy and efficient from several sources into accessing the Twitter API) for aggregation of tweets the Hadoop Distributed File System (HDFS) [11]. For from Twitter [5, 18, 19, 20, 21, 22]. The Twitter Ap- processing these tweets stored in the HDFS, we use plication Programming Interface (API) [23] provides Apache Hive [12]. It provides us with an SQL-like a streaming API [24] to allow developers to obtain Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 3 real time access to tweets. Befit and Frank [25] dis- and for lexicon generation they used unigram, bigram cuss the challenges of capturing Twitter data streams. and trigram features. Bharti et al. [19] considered bi- Tufekci and Zeynep [26] examined the methodologi- gram and trigram to generate bags of lexicons for sen- cal and conceptual challenges for social media based timent and situation in tweets. Barbieri et al. [37] big data operations with special attention on the va- considered seven lexical features to detect sarcasm lidity and representativeness of social media big data through its inner structure such as unexpectedness, the analysis. Due to some restrictions placed by Twitter intensity of the terms or imbalance between registers. on the use of their retrieval APIs, one can only down- load a limited amount of tweets in a specified time 2.2.2. Pragmatic feature based classification frame using these APIs and libraries. Getting a larger The use of symbolic and figurative text in tweets amount of tweets in real time is a challenging task. is frequent due to the limitations in message length There is a need for efficient techniques to acquire a of a tweet. These symbolic and figurative texts are large amount of tweets from Twitter. Researchers are called pragmatic features (such as smilies, emoticons, evaluating the feasibility of using the Hadoop ecosys- replies, @user, etc). It is one of the powerful fea- tem [6] for the storage and processing [22, 27, 28, 29] tures to identify sarcasm in tweets as several authors of large amounts of tweets from Twitter. Shirahatti have used this feature in their work to detect sarcasm. et al. [27] used Apache Flume [10] with the Hadoop Pragmatic features are one of the key features used by ecosystem to collect tweets from Twitter. Ha et al. Kreuz et al. [33] to detect sarcasm in text. Carvalho [22] used Topsy with the Hadoop ecosystem for gath- et al. [38] used pragmatic features like emoticons and ering tweets from Twitter. Furthermore, they analyzed special punctuations to detect irony from newspaper the sentiment and emotion information for the col- text data. Gonzlez et al. [18] further explored this lected tweets in their research. Taylor et al. [28] used feature with some more parameters like smilies and the Hadoop framework in applications in the bioinfor- replies and developed a sarcasm detection system us- matics domain. ing the pragmatic features of Twitter data. Tayal et al. [39] also used the pragmatic feature in political tweets 2.2. Sarcasm sentiment analysis to predict which party will win in the election. Simi- Sarcasm sentiment analysis is a rapidly growing larly, Rajadesingan et al. [40] used psychological and area of NLP with research ranging from word, phrase behavioral features on users present and past tweets to and sentence level classification [5, 18, 19, 30] to doc- detect sarcasm. ument [31] and concept level classification [21]. Re- search is progressing in finding ways for efficient anal- 2.2.3. Hyperbole feature based classification ysis of sentiments with better accuracy in written text Hyperbole is another key feature often used in sar- as well as analyzing irony, humor and sarcasm within casm detection from textual data. A hyperbolic text social media data. Sarcastic sentiment detection is contains one of the text properties such as intensifier, classified into three categories based on text features interjection, quotes, punctuation, etc. Previous authors used for classification, these are lexical, pragmatic and used these hyperbole features and achieved good ac- hyperbolic as shown in Fig. 1. curacy in their research to detect sarcasm in tweets. Utsumi and Akira [41] discussed extreme adjective 2.2.1. Lexical feature based classification and adverb and how the presence of these two inten- Text properties such as unigram, bigram, n-grams, sifies text. Most often, it provides an implicit way to etc. are classified as lexical features of a text. Au- display negative attitudes, i.e., sarcasm. Kreuz et al. thors used these features to identify sarcasm, Kreuz [33] discussed the other hyperbolic terms such as in- et al. [32] introduced this concept for the first time terjection and punctuation. They have shown how hy- and they observed that lexical features play a vital role perbole is useful in sarcasm detection. Filatova and in detecting irony and sarcasm in text. Kreuz et al. Elena [31] used the hyperbole features in document [33], in their subsequent work used these lexical fea- level text. According to them, phrase or sentence level tures along with syntactic features to detect sarcastic is not sufficient for good accuracy and considered the tweets. Davidov et al. [30] used pattern-based (high- text context in that document to improve the accuracy. frequency words and content words) and punctuation- Liebrecht et al. [42] explained hyperbole features with based methods to build a weighted k-nearest neighbor examples of utterances ‘Fantastic weather’ when it (kNN) classification model to perform sarcasm detec- rains is identified as sarcastic with more ease than the tion. Tsur et al. [34] observed that bigram based utterance without a hyperbole (‘the weather is good’ features produce better results in detecting sarcasm when it rains). Lunando et al. [20], declared that in tweets and Amazon product reviews. Gonzalez- the tweet containing interjection words such as wow, Ibanez et al. [18] explored numerous lexical features aha, yay, etc. have a higher chance of being sarcastic. (derived from LWIC [35] and WordNet affect [36]) They developed a system for sarcasm detection for In- to identify sarcasm. Riloff et al. [5] used a well- donesian social media. Tungthamthiti et al. [21], ex- constructed lexicon based approach to detect sarcasm plored concept level knowledge using the hyperbolic 4 Santosh Kumar Bharti, et al.

Fig. 1: Classification of sarcasm detection based on text features used. words in sentences and gave an indirect contradiction These tweets are retrieved from Twitter using the Twit- between sentiment and situation such as raining, bad ter streaming API (Twitter4j) as shown in Fig. 2. weather, that are conceptually the same. Therefore, The Flume module is responsible for communicating if ‘raining’ is present in any sentence, then one can with the Twitter streaming API and retrieving tweets assume ‘bad weather’. Bharti et al. [19] considered matching certain criteria, trends or keywords. The interjection as a hyperbole feature to detect sarcasm in tweets retrieved from Flume are in JavaScript Object tweets that starts with an interjection. Notation (JSON) format which is passed on to the Based on the classification, a consolidated summary HDFS. Oozie is a module in Hadoop that provides the of previous studies related to sarcasm identification is output from one stage as the input to the next. Oozie shown in Table 1. It provides types of approaches used is used to partition the incoming tweets into blocks of by previous authors (denoted as A1 and A2), various tweets, partitioned on an hourly basis. These partitions types of sarcasm occurring in tweets (denoted as T1, are passed on to the Hive module, which then parses T2, T3, T4, T5, T6, and T7), text features (denoted as the incoming JSON tweets into a format suitable for F1, F2, and F3) and datasets from different domains consumption by the sarcasm detection engine (SDE). (denoted as D1, D2, D3, D4, and D5), mostly from This parsed tweets are stored again in the HDFS and Twitter data. The details are shown in Table 2. are later retrieved by SDE for further processing and From Table 1, it is observed that only Bharti et al. attainment of final sentiment summarization. [19] has worked for sarcasm type T2 and T3. Lu- nando et al. [20] discussed that tweets with interjec- 3.2. Parallel HDFS tions are classified as sarcastic. Further, Rajadesingan To increase the throughput of a system and handle et al. [40] is the only author who worked for sarcasm the massive volume of tweets, the parallel architecture type T4. Most of the researchers identified sarcasm of HDFS that is used which is shown in the Fig. 3. The in tweets in type T1. None of the authors worked on overall file system consists of a metadata file, master sarcasm types T5, T6 and T7 until now. In this work, node and multiple slave nodes that are managed by the we consider these research gaps as challenges and pro- master node. pose a set of algorithms to tackle them. A metadata file contains two subfiles namely fsim- age and edits file. The fsimage contains the complete state of the file system at a given instance of time and 3. Preliminaries the edits file contains the log of changes to the file sys- This section describes the overall framework for tem after the most recent fsimage was made. The mas- capturing and analyzing tweets streamed in real time. ter node contains three entities namely, name node, In addition, the architecture of Hadoop HDFS fol- secondary name node and data node. All three entities lowed by POS tagging, parsing and sentiment analysis in the name node can communicate with each other. of the given phrase or sentence are elaborated. The name node is responsible for the overall function- ing of the file system. A secondary name node is re- 3.1. Framework for sarcasm analysis in real time sponsible for updating and maintaining of the name tweets node as well as managing the updates to the meta- The proposed system uses the Hadoop framework data. The Job tracker is a service in Hadoop that in- to process and store the tweets streamed in real time. terfaces between the name node and the task trackers Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 5

Table 1: Previous studies in sarcasm detection in text.

Study Approaches Types of sarcasm Type of feature Domains A1 A2 T1 T2 T3 T4 T5 T6 T7 F1 F2 F3 D1 D2 D3 D4 D5 A11 A12 F31 F32 F33 F34 Kreuz et al., 1995 X X X X X X Utsumi et al., 2000 X X X X X Verma et al., 2004 X X X X X Bhattacharyya et al., 2004 X X X X X Kreuz et al., 2007 X X X X X X Chaumartin et al., 2007 X X X X Carvalho et al., 2009 X X X X Tsur et al., 2010 X X X X Davidov et al., 2010 X X X X X X Gonzlez-Ibnez et al., 2011 X X X X X Filatova et al., 2012 X X X X X X Riloff et al., 2013 X X X X X Lunando et al., 2013 X X X X X Liebrecht et al., 2013 X X X X X Lukin et al., 2013 X X X X X Tungthamthiti et al., 2014 X X X X X Peng et al., 2014 X X X X X Raquel et al., 2014 X X X X X Kunneman et al., 2014 X X X X X X X Barbieri et al. 2014 X X X X Tayal et al., 2014 X X X X X Pielage et al., 2014 X X X X X X X X Rajadesingan et al., 2015 X X X X X X Bharti et al., 2015 X X X X X X X

Fig. 2: System model for capturing and analysing sarcasm sentiment in tweets. and matches the jobs with the closest available task 3.3. Sarcasm detection engine tracker. To identify the sentiment of a given tweet, it passes through the MapReduce functions for sentiment clas- The Slave node contains two entities namely, data sification. The tweet is classified into either a negative, node and task tracker. Both entities can communicate positive or neutral, based on the detection engine. Fig. with each other within the slave node. The data node, 4 depicts an automated SDE which takes tweets as an which is responsible for handling the data blocks and input and produces the actual sentiment of the tweet as providing the services for storage and retrieval of the an output. Once the tweet is classified as either posi- data as requested by the name node. The task tracker is tive or negative further checks are required to confirm responsible for processing the input according to user if it has an actual positive/negative sentiment or a sar- requirements and returning the output. castic sentiment. In the parallel HDFS architecture, the name node communicates with the various data nodes in the slave 3.4. Parts-of-speech tagging nodes while simultaneously the job tracker in the Parts-of-speech (POS) tagging divides sentences or name node coordinates with the task trackers on the paragraphs into words and assigning corresponding slaves in parallel, resulting in a high rate of output parts-of-speech information to each word based on which is fed into the SDE. their relationship with adjacent and related words in a 6 Santosh Kumar Bharti, et al.

Table 2: Types, features and domains of sarcasm detection

Types of Approaches used in sarcasm detection A1 Machine learning based A11 Supervised A12 Semi-supervised A2 Corpus based Types of sarcasm occur in text T1 Contrast between positive sentiment and negative situation T2 Contrast between negative sentiment and positive situation T3 Tweet starts with an interjection word T4 Likes and Dislikes contradiction – behavior based T5 Tweet contradicting universal facts T6 Tweet carries positive sentiment with antonym pair T7 Tweet contradicting time dependent facts Types of features F1 Lexical- unigram, bigram, trigram, n-gram, #hashtag F2 Pragmatic- smilies, emoticons, replies F3 Hyperbole- Interjection, Intensifier, Punctuation Mark, Quotes F31 Interjection- yay, oh, wow, yeah, nah, aha, etc. F32 Intensifier- adverb, adjectives F33 Punctuation Mark- !!!!!, ???? Fig. 3: Parallel HDFS architecture F34 Quotes- “ ” , ‘ ’ Types of domains D1 Tweets of Twitter D2 Online product reviews D3 Website comments D4 Books D5 Online discussion forums

phrase, sentence, or paragraph. In this paper, a Hidden Markov Model (HMM) based POS tagger [13] is used to identify the correct POS tag information of given words. For example: POS tag information for the sen- Fig. 4: Sarcasm detection engine. tence ”Love has no finite coverage” is love-NN, has- VBZ, no-DT, finite-JJ, and coverage-NN. Where, NN, JJ, VBZ and DT denote the notations for noun, ad- has been used for parsing. An example of pars- jective, verb and determiner, respectively. The Penn ing for text ”I love waiting forever for my doctor” Treebank tag [43] set notations are used to assign a is I/PRP/B-NP/O, love/NN/I-NP/O, waiting/VBG/B- tag to the particular word. It is a brown corpus style of VP/O, forever/RB/B-ADVP/O, for/IN/B-PP/B-PNP, tagging having 44 tags. my/PRP$/BNP/ I-PNP, doctor/NN/I-NP/I-PNP. With the help of the parse data, two examples of parse trees 3.5. Parsing are shown in Fig. 5 and Fig. 6. Parsing is a process of analyzing grammatical struc- ture, identifying its parts of speech and syntactic re- 3.6. Sentiment analysis lations of words in sentences. When a sentence is Sentiment analysis is a mechanism to recognize passed through a parser then the parser divides the ones opinion, polarity, attitude and orientation of any sentence into words and identifies the POS tag in- target like movies, individuals, events, sports, prod- formation. With the help of the POS information ucts, organizations, locations, services, etc. To iden- and syntactic relation, it forms units like subject, tify sentiment in given phrase, we use pre-defined lists verb, and object, then determines the relations be- of positive and negative words such as Sentiwordnet tween these units and generates a parse tree. In this [44]. It is a standard list for positive and negative En- paper, a python based package called TEXTBLOB glish words. Using the Sentiwordnet lists along with Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 7

• Other approaches to detect sarcasm in tweets. 1. Tweet contradicting universal facts 2. Tweet contradicting time dependent facts 3. Likes dislikes contrdiction

4.1. Capturing and processing real time streaming tweets using Flume and Hive The Twitter Streaming API returns a constant stream of tweets in JSON format which is then stored Fig. 5: Parse tree for a tweet: I love waiting forever for my doctor. in the HDFS as shown in Fig. 2. To avoid issues re- lated to security and writing code that requires com- plicated integration with secure clusters, we prefer to use the existing components within Cloudera Hadoop [29]. This allows us to directly store the data retrieved by the API into the HDFS. We use Apache Flume to store the data in the HDFS. Flume is a data ingestion system that is defined by setting up channels in which data flows between sources and sinks. Each piece of data is an event and each such event goes through a Fig. 6: Parse tree for a tweet: I hate Australia in cricket because channel. The Twitter API does the work of the source they always win. here and the sink is a system that writes out the data to the HDFS. Along with the data capture, the Flume module allows us to set up custom filters and keyword- equations 1, 2 and 3, we find the sentiment score for a based searches that allow us to further narrow down given phrase or sentence. the tweets to just the ones relevant to our requirements. PWP Once the data from the Twitter API is fed into the PR = (1) HDFS, the data must be pre-processed to convert the TWP tweets stored in JSON format into usable text for the NWP SDE. We make use of the Oozie module for handling NR = (2) TWP the work flow, which is scheduled to run at periodic intervals. We configure Oozie to partition the data in S entiment S core = PR − NR (3) the HDFS on the basis of hourly retrievals and load the last hour’s data into the hive as shown in Fig. 2. where, The hive is another module in Hadoop that allows PR: Positive Ratio, NR: Negative Ratio, one to translate and load data with the help of the PWP: Number of Positive words in a given phrase, Serializer-Deserializer. This allows us to convert the NWP: Number of Negative words in a given phrase, JSON tweets into a query-able format and we then add TWP: Total words in given phrase. these entries back into the HDFS for processing by the SDE. 4. Proposed scheme 4.2. HMM-based POS tagging There is an increasing need for automatic tech- In this paper, an HMM-based POS tagger is de- niques to capture and process real time tweets and ployed to evaluate accurate POS tag information for analyze their sarcastic sentiment. It provides useful the Twitter dataset as shown in Algorithm 1 and Algo- information for market analysis and risk management rithm 2. Algorithm 1 trains the system using 500,000 applications. Therefore, we propose the following ap- pre-tagged (according to the Penn Tree Bank style) proaches towards sarcasm detection in tweets. American English words from the American National Corpus (ANC) [45, 46]. Algorithm 2 evaluates the • Capturing and processing real time tweets using POS tag information of words in the given dataset. Flume and Hive. According to Algorithm 1, HMM uses pre-tagged American English words [45, 46] as an input and cre- • An HMM-based algorithm for POS tagging. ates three dictionary objects namely, WT, TT and T. • MapReduce functions for three approaches to de- WT stores the number of occurrence of each word tect sarcasm in tweets. with its associated tag in the training corpus. Simi- larly, TT stores the number of occurrence of the bi- 1. Parsing based lexicon generation algorithm gram tags in the corpus and T stores the number of 2. Interjection word start occurrence of uni-gram tag. For each word in the sen- 3. Positive sentiment with antonym pair tence, it checks if the word is the starting word of the 8 Santosh Kumar Bharti, et al.

Algorithm 1: POS training Algorithm 2 finds all the possible tags of a given Data: dataset := Annotated corpus word (for tag evaluation) using the pre-tagged corpus Result: WT:= dictionary variable with [45, 46] and applies Equation 4 [47], if the word is hkey, valuei pair for each word with its the starting word of a respective sentence otherwise it tag in the corpus applies Equation 5 [47]. Next, it selects the tag whose TT:= dictionary variable with probability value is maximum. For example: once you hkey, valuei for bigram tag pair encounter a POS tag determiner (DT), such as ‘the’, T:=dictionary variable with maybe the probability that the next word is a noun is hkey, valuei pair for each tag with 40% and it being a verb is 20%. Once the model fin- its occurrences ishes its training, it is used to determine whether ‘can’ while sentence in corpus do in ‘the can’ is a noun (as it should be) or a verb. while word in sentence do if word==first word then ∗ previous tag =$ argmaxt ∈ APT [TT($, t)/T($)] [WT(word, t)/T(t)] current tag = POS tag of current word (4) TT[previous tag, current tag]++ argmaxt ∈ APT [TT(P, t)/T(P)] ∗ [WT(word, t)/T(t)] T[current tag]++ (5) WT[word, current tag]++ where, end APT = all possible tags else previous tag =POS tag of previous word 4.3. MapReduce functions for sarcasm analysis current tag = POS tag of current word Here, the Map function comprises three approaches TT[previous tag, current tag]++ to detect sarcasm. Each of the approaches are detailed T[current tag]++ below. WT[word, current tag]++ end 4.3.1. Parsing based lexicon generation algorithm end The MapReduce function, parsing based lexicon end generation algorithm (PBLGA), is based on our previ- ous article [19]. It takes tweets as an input from HDFS and parses them into the form of phrases such as sentence or not. If a word is the starting word then it noun phrase (NP), verb phrase (VP), adjective phrase assumes the previous tag to be ‘$’. Otherwise, the pre- (ADJP), etc. These phrases are stored in the phrase vious tag is the tag of the previous word in the respec- file for further processing. The phrase file is then sub- tive sentence. It increments the occurrence of various sequently passed on to the rule-based classifier to clas- tags through the dictionary object WT, TT and T. Fi- sify sentiment phrase and situation phrase as shown in nally, it creates a probability table using the dictionary the mapper part of Fig. 7 and stores it in the senti- objects WT, TT and T. ment phrase file and situation phrase file. Then, the output of the mapper class (sentiment phrase file and situation phrase file) passes to the reducer class as an Algorithm 2: POS testing input. The reducer class calculates the sentiment score Data: dataset := corpus of real time streaming (as explained in Section 3.6) of each phrase in both the tweets sentiment and situation phrase file. Then, it gives out- Result: Tags sequences of each tweet in dataset put an aggregated positive or negative score for each while tweet in dataset do phrase in terms of the sentiment and situation of the while word in tweet do tweet. Based on whether the score is positive or nega- APT = find all possible tag(word) tive, the phrases are stored in the corresponding phrase if word==first word then file as shown in the reducer class of Fig. 7. PBLGA argmax t ∈ APT [TT($, t)/T($)] * [WT(word,t)/T(t)] generates four files namely, positive sentiment, neg- end ative sentiment, positive situation and negative situa- else tion files as an output. Furthermore, we use these four P = tag of previous word files to detect sarcasm in tweets with tweet structure argmax t ∈ APT [TT(P, t)/T(P)] * contradiction between positive sentiment and negative [WT(word,t)/T(t)] situation and vice-versa as shown in Algorithm 3. end According to Algorithm 3, it takes testing tweets end and four bags of lexicons generated using PBLGA. If end the testing tweet matches with any positive sentiment from the positive sentiment file then it subsequently Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 9

Fig. 7: Procedure to obtain sentiment and situation phrase from tweets checks for any matches with negative situation against rithm IWS also executes under the Hadoop framework the negative situation file. If both checks match then as well as without the Hadoop framework to compare the testing tweet is sarcastic and similarly, it checks the running time. for sarcasm with a negative sentiment in a positive sit- uation. Otherwise, the given tweet is not sarcastic. 4.3.3. Positive sentiment with antonym pair Both the algorithms are executed under the Hadoop The MapReduce function for positive sentiment framework as well as without the Hadoop framework with antonym pair (PSWAP) is a novel approach as to compare the running time. shown in Fig. 9 to determine if the tweet as sarcastic or not. The tweet that is sent to the mapper is first parsed 4.3.2. Interjection word start into its constituent tags using Algorithm 1 and Algo- The MapReduce function for interjection word start rithm 2. The output of this stage gives us a bag of tags (IWS) is also based on [19] as shown in Fig. 8. This which is then passed to a rule based classifier as given approach is applicable for the tweets that start with an in the mapper class of Fig. 9 which looks for antonym interjection word such as aha, wow, nah, uh, etc. In pairs of certain tags such as noun, verb, adjective and this approach, the tweet that is sent to the mapper is adverb. If any antonym pair is found, it stores them first parsed into its constituent tags using Algorithm 1 in a separate file. The reducer class is responsible for and Algorithm 2. Then, the tags are separated as first generating a sentiment score using equations 1, 2, and tag, second tag and remaining tags of each tweet. The 3 for the tweet contained in the file of antonym tweets output of this stage gives us three lists, the list of the and are sorted according to their sentiment score into first tag, which stores the first tag of the tweet, the list positive and negative sentiment tweets. It then clas- of the second tag, which stores the second tag of the sifies all the positive sentiment tweets as sarcastic as tweet and the list of remaining tags, which stores the shown in the reducer class of Fig. 9. In this approach, remaining tags in the tweet. The lists are then passed the antonym pairs of noun, verb, adjective and adverb to a rule based pattern as given in the mapper class are taken from NLTK wordnet [48]. The algorithm of Fig. 8 that checks if the first tag is an interjection PSWAP is executed under the Hadoop framework as i.e UH (interjection tag notation) and second tag is ei- well as without Hadoop framework to compare the ther adjective or adverb then the tweet is classified as running time. sarcastic. Otherwise, it checks if the first tag is an in- terjection and the remaining tags are either adverb fol- 4.4. Other approaches for sarcasm detection in tweets lowed by adjective, adjective followed by noun, or ad- We propose three other novel approaches to iden- verb followed by verb then the tweet is sarcastic else it tify sarcasm in three different tweet types i.e. T4, T5 is not sarcastic. If the pattern does not find any match and T7 as shown in Table 2. Due to the unavailabil- in a given tweet then tweet is not sarcastic. The algo- ity of various aspects modeling these algorithms in the 10 Santosh Kumar Bharti, et al.

Fig. 8: Procedure to detect sarcasm in tweets that starts with interjection word

Fig. 9: Procedure to detect sarcasm in positive sentiment tweets with antonym pair Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 11

Algorithm 3: PBLGA testing ery sentence in the corpus. To generate hkey, valuei Data: dataset := Tweets for testing, bags of pair, it finds triplets of (subject, verb, and object) val- lexicons ues according to the Rusu Triplets [49] method for ev- Result: classi f ication:= sarcastic or non ery sentence. Furthermore, it combines the subject sarcastic and verb together as key and object as value. The h i while tweets in dataset do key, value pair for the sentence ”the sun rises in the count = 0 east” is h(sun, rises), easti. sarcamsFlag = False; while words in tweet do Algorithm 4: Tweet contradict universal facts if word==(any phrase in positive Data: dataset := Corpus of universal facts. sentiment lexicons) && (count == 0) Result: Result:= A hKey, Valuei pair then Notation: S : Subject, V: verb, O: Object, T: count = 1; tweets, C: corpus, PF: parse file, TWP: tweet check negative situation lexicons wise parse phrase, UFF: Universal fact file. continue; Initialization : PF = { φ }, UFF = { φ } end while T in C do if word == (any phrase in negative p = find parsing (T) PF ← PF ∪ p situation lexicons) && (count == 1) end then while TWP in PF do sarcasmFlag = True S = find subject (TWP) break; V = find verb (TWP) end O = find object (TWP) else Key ← (S + V) if word == (any phrase in negative Value ← O sentiment lexicons) && (count == 0) UFF ← hKey, Valuei then count = 1 end check positive situation lexicons continue; Identifying sarcasm in tweets using universal facts end is shown in Algorithm 5. It takes the universal facts if word == (any phrase in negative hkey, valuei pair file and tests the tweets as input situation lexicons) && (count == 1) and extracts triplet values (subject, object, verb) from then the test tweets using the Rusu Triplets [49] method. sarcasmFlag = True Furthermore, we form hkey, valuei pairs of the test- break; end ing tweet using the subject, verb, and object. If the hkey, valuei of the testing tweet is matched with end any key in universal fact hkey, valuei pair file then, it if sarcasmFlag==True then Given tweet is sarcastic checks the value of the testing tweet along with the end corresponding value in the universal fact hkey, valuei else pair file. If both the hkey, valuei pairs are matched, Given tweet is not sarcastic then the current testing tweet is not sarcastic. Other- end wise, the tweet is sarcastic. end end 4.4.2. Tweets contradicting with time-dependent facts Tweets contradicting with time-dependent facts (TCTDF) are based on temporal facts. In this ap- proach, time-dependent facts (ones that may change Hadoop framework is undone. However, the methods over a certain time period) are used as a feature to were implemented without the Hadoop framework. identify sarcasm in tweets as shown in Algorithm 6. Each of the methods are described below. For instance, ’@MirzaSania becomes world number one. Great day for Indian tennis’ is a time-dependent 4.4.1. Tweets contradicting with universal facts fact sentence. After some time, someone else will be Tweets contradicting with universal facts (TCUF) the number one tennis player. The newspaper head- is based on universal facts. In this approach, univer- lines are used as a corpus for time-dependent facts. sal facts are used as a feature to identify sarcasm in The Algorithm 4 uses newspaper headlines as an input tweets as shown in Algorithm 4. For an example ’the corpus and generates a list of hkey, valuei pairs for ev- sun rises in the east’ is a universal fact. The corpus ery headlines in the corpus. To generate a hkey, valuei of universal fact sentences, Algorithm 4 takes as an pair, it finds the triplet of (subject, verb, and object) input and generates a list of hkey, valuei pairs for ev- values according to the Rusu Triplets [49] method for 12 Santosh Kumar Bharti, et al.

Algorithm 5: TCUF testing tweets ing tweet with the hkey, valuei pair in the file to iden- Data: dataset := Corpus of tweets, UFF. tify sarcasm using the TCTDF approach, one needs to Result: classi f ication:= Sarcastic or not match the object as well as the time-stamp together as sarcastic the value. If both match then, the current testing tweet Notation: S : Subject, V: verb, O: Object, T: is not sarcastic else it is sarcastic. tweets, C: corpus, PF: parse file, TWP: tweet wise parse phrase Algorithm 7: TCTDF testing tweets Initialization : PF = { φ } Data: dataset := Corpus of tweets, UFF. while T in C do Result: classi f ication:= Sarcastic or not p = find parsing (T) PF ← PF ∪ p sarcastic end Notation: S : Subject, V: verb, O: Object, T: while TWP in PF do tweets, C: corpus, PF: parse file, TWP: tweet SarcasticFlag=1; wise parse phrase S = find subject (TWP) Initialization : PF = { φ } V = find verb (TWP) while T in C do O = find object (TWP) p = find parsing (T) PF ← PF ∪ p Key ← (S + V) end Value ← O while TWP in PF do forms hKey, Valuei pair for all tweets in SarcasticFlag=1; corpus S = find subject (TWP) if (Key in UFF) && V = find verb (TWP) (Val==UFF[Key].Value) then O = find object (TWP) SarcasticFlag=0; Key ← (S + V) end Value ← (O + T) return SarcasticFlag; forms hKey, Valuei pair for all tweets in end corpus if (Key in UFF) && (Val==UFF[Key].Value) then every sentence. Furthermore, it combines the subject SarcasticFlag=0; and verb together as key and combines the object and end time-stamp as value. The time-stamp is the news head- return SarcasticFlag; line date. The hkey, valuei pair for the sentence ’Wow, end Australia won the cricket world cup again in 2015’ is h(Australia, won), (cricketworldcup, 2015)i. 4.4.3. Likes dislikes contradiction Algorithm 6: Tweet contradict time dependent facts Likes dislikes contradiction (LDC) is based on the Data: dataset := Corpus of universal facts. behavioral features of the Twitter user. It is given in Result: classi f ication:= A hKey, Valuei pair Algorithm 8. Here, the algorithm observes a user’s Notation: S : Subject, V: verb, O: Object, T: behavior using their past tweets. It analyzes the user’s tweets, C: corpus, PF: parse file, TWP: tweet tweet history in the profile and generates a list behav- wise parse phrase, UFF: Universal fact file. iors for his likes and dislikes. To generate the likes Initialization : PF = { φ }, UFF = { φ } and dislikes list of a particular user, one need to crawl while T in C do through all the past tweets from the user’s Twitter ac- ← ∪ p = find parsing (T) PF PF p count as an input for Algorithm 8. Next, the algo- end rithm calculates the sentiment score of all the tweets while TWP in PF do in the corpus using equations 1, 2, and 3. Later it S = find subject (TWP) classifies the tweets as positive sentiment or negative V = find verb (TWP) sentiment using the sentiment score (if the sentiment O = find object (TWP) score is >0.0, then the tweet is positive. Otherwise the Key ← (S + V) tweet is negative. Then both the positive and negative Value ← (O + TS) tweets are stored in separate files. From the positive UFF ← hKey, Valuei sentiment tweet file, one needs to extract triplet value end (subject, object, verb) for every tweet in the file using the Rusu Triplets [49] method. If the subject value is Identifying sarcasm in tweets using time-dependent a pronoun such as ’I’ or ’We’, then, append ‘object’ facts is similar to TCUF as shown in Algorithm 6. value of that tweet in the likes list. Otherwise, append The only difference is in the value of the hkey, valuei the ‘subject’ value of that tweet in the likes list. Sim- pair. While matching the hkey, valuei pair of the test- ilarly, in the negative sentiment tweet file, one needw Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 13 to extract triplet value (subject, object, verb) for every Algorithm 8: Likes and Dislikes Contradiction tweet in the file using the Rusu Triplets [49] method. Data: dataset := Tweets corpus of particular user. If the subject value is a pronoun such as ’I’ or ’We’, Result: Result:= Likes and dislikes list of that then, append the ‘object’ value of that tweet to the dis- user. likes list. Otherwise, append the ‘subject’ value of that Notation: N: Noun, T: Tweets, C: Corpus, tweet in the dislikes list. For example: ‘@Modi is do- PS T F: Positive sentiment tag file, NSTF: ing good job for India’. Given the tweet is positive as Negative sentiment tag file, t: tag, TWT: the word ‘good’ is present. The subject of this partic- Tweet wise tag, FT: First tag, pst f : ular tweet is ‘Modi’. Therefore, Modi is appended to Positive sentiment tweet file, nst f : the likes list of that particular user. Negative sentiment tweet file, LL: Likes list, The method to identify sarcasm in tweets using be- DLL: Dislikes list, PF: Parse file. havioral features (likes, dislikes) is shown in Algo- Initialization : pst f = { φ }, nst f = { φ }, NSTF rithm 9. The algorithm considers the testing tweets = { φ }, PS T F = { φ }, LF = { φ }, DLF = { φ } and the list of likes and dislikes as an input parame- while T in C do ter for the particular user. While testing sarcasm in SC = find sentiment score (T) tweets, one needs to calculate the sentiment score of if SC >0.0 then the tweet. Then, extract the triplet (subject, verb and pst f ← pst f ∪ T object) of that tweet. If the tweet is positive and the end subject is not a pronoun then check the subject value if SC <0.0 then in the likes list. If the subject value is found in the nst f ← nst f ∪ T likes list, then the tweet is not sarcastic. If it is found else in the dislikes list then, the tweet is sarcastic. Simi- Discard the tweet. larly, if the subject value is a pronoun and the tweet is end positive then the object value checks the likes list. If it end is found then the tweet is not sarcastic. If it is found in while T in pst f do the dislikes list then the tweet is sarcastic. In a similar k = find postag (T) fashion, one identifies sarcasm for negative tweets as PS T F ← PS T F ∪ k well. x = find parsing (T) PF ← PS T F ∪ x end while TWT in PS T F do 5. Results and discussion t = find subset (TWT) FT = find first tag (t) This section describes the experimental results of if (FT = pronoun) then the proposed scheme. We started with an experimen- O = find object (PF) tal setup where a five node cluster is deployed under LL ← LL ∪ O the Hadoop framework. Five datasets are crawled us- end ing Apache Flume and the Twitter streaming API. We else also discuss the time consumption of the proposed ap- S = find subject (PF) proach under the Hadoop framework as well as with- LL ← LL ∪ S end out the Hadoop framework and made a comparison. end We also discuss all the approaches with precision, re- while T in nst f do call and F-score measure. k = find postag (T) NSTF ← NSTF ∪ k 5.1. Experimental environment x = find parsing (T) PF ← PS T F ∪ x Our experimental setup consists of a five node clus- end ter with the specifications as shown in Table 3. The while TWT in NS T F do master node consists of an Intel Xeon E5-2620 (6 core, t = find subset (TWT) v3 @ 2.4 GHz) processor with 6 cores running the FT = find first tag (t) Ubuntu 14.04 operating system with 24 GB of main if (FT = PRP) then memory. The remaining four nodes were virtual ma- O = find object (PF) DLL ← DLL ∪ S chines. All the VMs ran on a single machine. The end secondary name node server is another Ubuntu 14.04 else machine running on an Intel Xeon E5-2620 with 8 GB S = find subject (PF) of main memory. The remaining three slave nodes DLL ← DLL ∪ S responsible for processing the data consist of three end Ubuntu 14.04 machines running Intel Xeon E5-2620 end with 4 GB of main memory. 14 Santosh Kumar Bharti, et al.

5.2. Datasets collection for experiment and analysis Algorithm 9: LDC testing tweets The datasets for the experimental analysis are Data: dataset := Corpus of tweets, LL, DLL shown in Table 4. There are five sets of tweets crawled Result: Classi f ication:= sarcastic or not from the Twitter using the Twitter Streaming API and sarcastic. processed through Flume before being stored in the Notation: N: Noun, T: Tweets, C: Corpus, HDFS. In total, 1.45 million tweets were collected us- PS T F: Positive sentiment tag file, NSTF: ing keywords #sarcasm, #sarcastic, sarcasm, sarcastic, Negative sentiment tag file, t: tag, TWT: happy, enjoy, sad, good, bad, love, joyful, hate, etc. Tweet wise tag, FT: First tag, pst f : After preprocessing, approximately 156,000 tweets Positive sentiment tweet file, nst f : were found as sarcastic (tweets ending with #sarcasm Negative sentiment tweet file, LL: Likes list, or #sarcastic). The remaining tweets approximately DLL: Dislikes list. 1.294 million were not sarcastic. Every set contained Initialization : pst f = { φ }, nst f = { φ }, NSTF a different number of tweets. Depending on number = { φ }, PS T F = { φ }, LF = { φ }, DLF = { φ } of tweets in each set, the crawling time (in hours) is while T in C do given in Table 4. SC = find sentiment score (T) k = find postag (T) 5.3. Execution time for POS tagging x = find parse (T) t = find subset (k) In this paper, POS tagging is an essential phase FT = find first tag (t) for all the proposed approaches. Therefore, we used if (FT = pronoun) then Algorithm 1 and Algorithm 2 to find POS informa- O = find object (x) tion for all the datasets (approximately 1.45 million end tweets). We deployed algorithms on both Hadoop as else well as without the Hadoop framework and estimated S = find object (x) the elapsed time as shown in Fig. 10. The solid line end shows time taken (approx. 674 seconds) for POS tag- if (SC >0.0) && (O in LL) then ging (approx. 10.5 million tweets) without the Hadoop Tweet is not sarcastic. framework, while the dotted line shows time (approx. end 225 seconds) for POS tagging (approx. 10.5 million if (SC >0.0) && (O in DLL) then tweets) under the Hadoop framework. Tweets were in Tweet is sarcastic. different sets and we ran the POS tag algorithm sep- else arately for each set. Therefore the graph in Fig. 10 Discard the tweet. shows the maximum time (674 seconds) for 10.5 mil- end lion tweets. if (SC <0.0) && (O in LL) then Tweet is sarcastic. end if (SC <0.0) && (O in DLL) then Tweet is not sarcastic. else Discard the tweet. end if (SC >0.0) && (S in LL) then Tweet is not sarcastic. end if (SC >0.0) && (S in DLL) then Tweet is sarcastic. else Discard the tweet. Fig. 10: Elapsed time for POS tagging under the Hadoop end framework vs without the Hadoop framework. if (SC <0.0) && (S in LL) then Tweet is sarcastic. end 5.4. Execution time for sarcasm detection algorithm if (SC <0.0) && (S in DLL) then There are three proposed approaches namely, Tweet is not sarcastic. PBLGA, IWS and PSWAP deployed under Hadoop else Discard the tweet. framework to analyze the estimated time for sarcasm end detection in tweets. We pass tagged tweets as an input to all three approaches. Therefore, the tagging time end is not considered in the proposed approaches for sar- casm analysis. Then, we compared the elapsed time Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 15

Table 3: Experimental environment.

Components OS CPU Memory HDD Primary server Ubuntu 14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 24 GB 1 TB Secondary server Ubuntu 14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 8 GB 1 TB Data server 1 Ubuntu 14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB Data server 2 Ubuntu 14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB Data server 3 Ubuntu 14.04 x64 Intel Xeon E5-2620 (6 core, v3 @ 2.4 GHz) 4 GB 20 GB

Table 4: Datasets captured for experiment and analysis.

Datasets No. of tweets Extraction period (approx) (hours) Set 1 5000 1 Set 2 51000 9 Set 3 100000 21 Set 4 250000 50 Set 5 1050000 187

under the Hadoop framework vs without the Hadoop Fig. 11: Processing time to analyze sarcasm in tweets using framework for all three approaches as shown in Fig. PBLGA under the Hadoop framework vs without the Hadoop 11, 12, and 13. PBLGA approach takes approx. 3,386 framework. seconds to analyze sarcasm in 1.4 million tweets with- out the Hadoop framework and takes approx. 1,400 seconds to analyze sarcasm in 1.4 million tweets un- der the Hadoop framework. The IWS approach takes approx. 25 seconds to analyze sarcasm in 1.4 mil- lion tweets without the Hadoop framework and takes approx. 9 seconds to analyze sarcasm in 1.4 million tweets under the Hadoop framework. The PSWAP ap- proach takes approx. 7,786 seconds to analyze sar- casm in 1.4 million tweets without the Hadoop frame- work and takes approx. 2,663 seconds to analyze sar- casm in 1.4 million tweets under the Hadoop frame- work. Finally, we combined all three approaches and ran with 1.4 million tweets. Then, we compared the Fig. 12: Processing time to analyze sarcasm in tweets using IWS elapsed time under the Hadoop framework vs with- under the Hadoop framework vs without the Hadoop framework. out the Hadoop framework for all three combined approaches as shown in Fig. 14 and it takes ap- prox. 11,609 seconds to analyze sarcasm in 1.4 mil- lion tweets without the Hadoop framework (indicated with the solid line and takes approx. 4,147 seconds to analyze sarcasm in 1.4 million tweets under the Hadoop framework (indicated with the dotted line).

5.5. Statistical evaluation metrics There are three statistical parameters namely, precision, recall and F − score used to evaluate our proposed approaches. Precision shows how much relevant information is identified correctly and recall Fig. 13: Processing time to analyze sarcasm in tweets using shows how much extracted information is relevant. PBLGA under Hadoop framework vs without Hadoop framework. F − score is the harmonic mean of precision and recall. Equations 6, 7, and 8 shows the formula to calculate precision, recall and f − score. 16 Santosh Kumar Bharti, et al.

5.6. Discussion on experimental results

Among the six proposed approaches, PBLGA and IWS were earlier implemented and discussed in [19] with small a set of test data (approx. 3,000 tweets for each experiment) and deployed in a without Hadoop framework. In this work, we deployed PSWAP (novel approach) along with PBLGA and IWS in both a Hadoop and without Hadoop framework to check the efficiency in terms of time. PBLGA generates four lexicon files namely, positive sentiment, negative sit- Fig. 14: Processing time to analyze sarcasm in tweets using PBLGA, IWS and PSWAP (combined approach) under the Hadoop uation, positive situation, and negative sentiment us- framework vs without the Hadoop framework. ing 156,000 sarcastic tweets. The PBLGA algorithm used 1.45 million tweets as test data. While testing, PBLGA checks each tweet’s tweet structure for con- tradiction between positive sentiment and negative sit- T p Precision = (6) uation and vice-versa to classify them as sarcastic or T p + Fp non-sarcastic. For 1.45 million tweets, PBLGA takes approx. 3,386 seconds in the without Hadoop frame- T p Recall = (7) work and it takes approx. 1,400 seconds in the Hadoop T + F p n framework. PBLGA consumes most of the time to ac- 2 ∗ Precision ∗ Recall cess the four lexicon files for every tweet to meet the F − S core = (8) Precision + Recall condition of tweet structure. IWS doesn‘t require any training set to identify tweets as sarcastic. Therefore, where, it takes the minimal processing time in both frame- T = true positive, F = false positive, F = false p p n works (25 seconds for the without Hadoop and 9 sec- negative. onds for the Hadoop framework). PSWAP requires a list of antonym pairs for noun, adjective, adverb, Experimental datasets consist of a mixture of sar- and verb to identify sarcasm in tweets. Therefore, it castic and non sarcastic tweets. In this paper, we takes approx. 7,786 seconds for 1.45 million tweets assume that the tweets with the hashtag sarcasm or in the without Hadoop framework and approx. 2,663 sarcastic (#sarcasm or #sarcastic) as sarcastic tweets. seconds for 1.45 million tweets in the Hadoop frame- The datasets consist of a total of 1.4 million tweets. work. PSWAP consumes most of the time in search- Among these tweets, 156,000 were sarcastic and rest ing antonym pairs for all four tags (noun, adjective, was non sarcastic. Experimental results in terms of adverb, and verb) for every tweet. Finally, we com- precision, recall and F − score was the same un- bined all three approaches together and tested. In the der both the Hadoop and without the Hadoop frame- combined approach, the F-score value attained is 97% work. The only difference was algorithm processing but, execution time is more as it checks all three ap- time due to the parallel architecture of HDFS. Experi- proaches sequentially for every tweet until each one is mental results are shown in Table 5. satisfied to detect sarcasm. Table 5: Precision, recall and F-score values for proposed Three more novel algorithms were proposed approaches namely, TCUF, TCTDF and LDC. These three al- gorithms are implemented using conventional meth- Approach Precision Recall F − score ods with small datasets. Presently, there are no suf- PBLGA approach 0.84 0.81 0.82 ficient datasets available with us to deploy these al- IWS approach 0.83 0.91 0.87 gorithms under the Hadoop framework. TCUF re- PSWAP approach 0.92 0.89 0.90 quires a corpus of universal facts. The accuracy of Combined (PBLGA, IWS, and 0.97 0.98 0.97 this approach is dependent on the universal facts set. PSWAP) approach We crawled approximately 5,000 universal facts from LDC (first user’s account) 0.92 0.72 0.81 Google and for experimentation. TCTDF LDC (second user’s account) 0.91 0.77 0.84 requires a corpus of time-dependent facts. Accuracy of this approach is dependent on the time-dependent LDC (third user’s account) 0.92 0.73 0.82 facts. Presently, we trained TCTDF with 10,000 news TCUF approach 0.96 0.57 0.72 article headlines as time-dependent facts. LDC re- TCTDF approach 0.93 0.62 0.74 quires Twitter users’ profile information and their past tweet history. In this work, we tested LDC using ten Twitter users profile and their past tweet history. Sarcastic Sentiment Detection in Tweets Streamed in Real time: A Big Data Approach 17

6. Conclusion and Future work [16] Q. Mei, C. Zhai, Discovering evolutionary theme patterns from text: an exploration of temporal text mining, in: Pro- Sarcasm detection and analysis in social media pro- ceedings of the eleventh ACM SIGKDD international confer- ence on Knowledge discovery in data mining, ACM, 2005, pp. vides invaluable insight into the current public opin- 198–207. ion on trends and events in real time. In this paper [17] B. Liu, Sentiment analysis and opinion mining, Synthesis Lec- six algorithms namely, PBLGA, IWS, PSWAP, TCUF, tures on Human Language Technologies 5 (1) 2012, pp. 1– TCTDF, and LDC were proposed to detect sarcasm in 167. [18] R. Gonzalez-Ib´ anez,´ S. Muresan, N. Wacholder, Identifying tweets collected from Twitter. Three algorithms were sarcasm in twitter: a closer look, in: Proceedings of the 49th run with and without the Hadoop framework. The run- Annual Meeting on Human Language Technologies, ACL, ning time of each algorithm was shown. The process- 2011, pp. 581–586. ing time under the Hadoop framework with data nodes [19] S. K. Bharti, K. S. Babu, S. K. Jena, Parsing-based sarcasm sentiment recognition in twitter data, in: Proceedings of the reduced up to 66% on 1.45 million tweets. 2015 IEEE/ACM International Conference on Advances in In the future, sufficient datasets suitable for the Social Networks Analysis and Mining (ASONAM), ACM, other three algorithms namely: LDC, TCUF and 2015, pp. 1373–1380. TCTDF need to be attained and deployed under the [20] E. Lunando, A. Purwarianti, Indonesian social media senti- ment analysis with sarcasm detection, in: International Con- Hadoop framework. ference on Advanced Computer Science and Information Sys- tems (ICACSIS), IEEE, 2013, pp. 195–198. [21] P. Tungthamthiti, S. Kiyoaki, M. Mohd, Recognition of sar- References casm in tweets based on concept level sentiment analysis and supervised learning approaches (2014) 404–413. [1] D. Chaffey, Global social media research summary 2016. [22] I. Ha, B. Back, B. Ahn, Mapreduce functions to analyze sen- URL http://www.smartinsights.com/ timent information from social big data, International Journal social-media-marketing/social-media-strategy/ of Distributed Sensor Networks 2015 (1) 2015, pp. 1–11. new-global-social-media-research/ [23] Twitter streaming api (2010). [2] W. Tan, M. B. Blake, I. Saleh, S. Dustdar, Social-network- URL http://apiwiki.twitter.com/ sourced big data analytics, Internet Computing 17 (5) 2013, [24] J. Kalucki, Twitter streaming api (2010). pp. 62–69. URL http://apiwiki.twitter.com/ [3] Z. N. Gastelum, K. M. Whattam, State-of-the-Art of Social Streaming-API-Documentation/ Media Analytics Research, Pacific Northwest National Labo- [25] A. Bifet, E. Frank, Sentiment knowledge discovery in twitter ratory, 2013. streaming data, in: 13th International Conference on Discov- [4] P. Zikopoulos, C. Eaton, Understanding big data: Analytics ery Science, Springer, 2010, pp. 1–15. for enterprise class hadoop and streaming data, McGraw-Hill [26] Z. Tufekci, Big questions for social media big data: Represen- Osborne Media, 2011. tativeness, validity and other methodological pitfalls, arXiv [5] E. Riloff, A. Qadir, P. Surve, L. De Silva, N. Gilbert, preprint arXiv:1403.7400. R. Huang, Sarcasm as contrast between a positive sentiment [27] A. P. Shirahatti, N. Patil, D. Kubasad, A. Mujawar, Sentiment and negative situation, in: Proceedings of the conference on analysis on twitter data using: Hadoop. empirical methods in natural language processing, 2013, pp. [28] R. C. Taylor, An overview of the hadoop//hbase 704–714. framework and its current applications in bioinformatics, [6] Hadoop. BMC bioinformatics 11 (Suppl 12) 2010, pp. 1–6. URL http://hadoop.apache.org/ [29] M. Kornacker, J. Erickson, Cloudera impala: Real time [7] S. Fitzgerald, I. Foster, C. Kesselman, G. Von Laszewski, queries in apache hadoop, for real, ht tp://blog. cloud- W. Smith, S. Tuecke, A directory service for configuring high- era. com/blog/2012/10/cloudera-impala-real-time-queries-in- performance distributed computations, in: Proceedings on apache-hadoop-for-real. High Performance Distributed Computing., IEEE, 1997, pp. [30] D. Davidov, O. Tsur, A. Rappoport, Semi-supervised recogni- 365–375. tion of sarcastic sentences in twitter and amazon, in: Proceed- [8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing ings of the Fourteenth Conference on Computational Natural on large clusters, Communications of the ACM 51 (1) 2008, Language Learning, ACL, 2010, pp. 107–116. pp. 107–113. [31] E. Filatova, Irony and sarcasm: Corpus generation and anal- [9] S. Hoffman, Apache Flume: Distributed Log Collection for ysis using crowdsourcing, in: Proceedings of Language Re- Hadoop, Packt Publishing Ltd, 2013. sources and Evaluation Conference, 2012, pp. 392–398. [10] Flume. [32] R. J. Kreuz, R. M. Roberts, Two cues for verbal irony: Hyper- URL http://flume.apache.org/ bole and the ironic tone of voice, Metaphor and symbol 10 (1) [11] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The hadoop 1995, pp. 21–31. distributed file system, in: Proceedings of 26th Symposium [33] R. J. Kreuz, G. M. Caucci, Lexical influences on the percep- on Mass Storage Systems and Technologies (MSST), IEEE, tion of sarcasm, in: Proceedings of the Workshop on compu- 2010, pp. 1–10. tational approaches to Figurative Language, ACL, 2007, pp. [12] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. An- 1–4. thony, H. Liu, P. Wyckoff, R. Murthy, Hive: a warehousing [34] O. Tsur, D. Davidov, A. Rappoport, Icwsm-a great catchy solution over a map-reduce framework, Proceedings of the name: Semi-supervised recognition of sarcastic sentences in VLDB Endowment 2 (2) 2009, pp. 1626–1629. online product reviews, in: Proceeding of International Con- [13] S. M. Thede, M. P. Harper, A second-order hidden markov ference on Weblogs and Social Media, 2010, pp. 162–169. model for part-of-speech tagging, in: Proceedings of the 37th [35] J. W. Pennebaker, M. E. Francis, R. J. Booth, Linguistic in- annual meeting on Computational Linguistics, ACL, 1999, pp. quiry and word count: Liwc 2001, Mahway: Lawrence Erl- 175–182. baum Associates 71 (1) 2001, pp. 1–11. [14] D. Klein, C. D. Manning, Accurate unlexicalized parsing, in: [36] C. Strapparava, A. Valitutti, et al., Wordnet affect: an affective Proceedings of the 41st Annual Meeting on Association for extension of wordnet, in: Proceedings of Language Resources Computational Linguistics, ACL, 2003, pp. 423–430. and Evaluation Conference, Vol. 4, pp. 2004, pp. 1083–1086. [15] K. Park, K. Hwang, A bio-text mining system based on natural [37] F. Barbieri, H. Saggion, F. Ronzano, Modelling sarcasm in language processing, Journal of KISS: Computing Practices twitter a novel approach, Proceedings of the 5th Workshop 17 (4) 2011, pp. 205–213. 18 Santosh Kumar Bharti, et al.

on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis 2014, pp. 50–58. [38] P. Carvalho, L. Sarmento, M. J. Silva, E. De Oliveira, Clues for detecting irony in user-generated contents: oh...!! it’s so easy;-), in: Proceedings of the 1st international CIKM work- shop on Topic-sentiment analysis for mass opinion, ACM, 2009, pp. 53–56. [39] D. Tayal, S. Yadav, K. Gupta, B. Rajput, K. Kumari, Polarity detection of sarcastic political tweets, in: proceedings of In- ternational Conference on Computing for Sustainable Global Development (INDIACom), IEEE, 2014, pp. 625–628. [40] A. Rajadesingan, R. Zafarani, H. Liu, Sarcasm detection on twitter: A behavioral modeling approach, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 97–106. [41] A. Utsumi, Verbal irony as implicit display of ironic environ- ment: Distinguishing ironic utterances from nonirony, Journal of Pragmatics 32 (12) 2000, pp. 1777–1806. [42] C. Liebrecht, F. Kunneman, A. van den Bosch, The perfect solution for detecting sarcasm in tweets# not, in: Proceedings of the 4th Workshop on Computational Approaches to Subjec- tivity, Sentiment and Social Media Analysis, New Brunswick, NJ: ACL, 2013, pp. 29–37. [43] M. P. Marcus, M. A. Marcinkiewicz, B. Santorini, Building a large annotated corpus of english: The penn treebank, Com- putational linguistics 19 (2) 1993, pp. 313–330. [44] A. Esuli, F. Sebastiani, Sentiwordnet: A publicly available lexical resource for opinion mining, in: Proceedings of Lan- guage Resources and Evaluation Conference, 2006, pp. 417– 422. [45] N. Ide, K. Suderman, The american national corpus first re- lease, in: Proceedings of Language Resources and Evaluation Conference, Citeseer, 2004. [46] N. Ide, C. Macleod, The american national corpus: A stan- dardized resource of american english, in: Proceedings of Corpus Linguistics, 2001. [47] E. Charniak, Statistical techniques for natural language pars- ing, AI magazine 18 (4) 1997, pp. 33–43. [48] J. Perkins, Python text processing with NLTK 2.0 cookbook, Packt Publishing Ltd, 2010. [49] D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, D. Mladenic, Triplet extraction from sentences, in: Proceedings of the 10th International Multiconference on Information Society- IS, 2007, pp. 8–12.