Submitted by Mhd Mousa HAMAD

Submitted at Department of Computational Perception

Supervisor Dr. Markus Schedl Controversy Detection October 2017 of Music Artists

Master Thesis to obtain the academic degree of

Diplom-Ingenieur in the Master’s Program

Computer Science

JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str. 69 4040 Linz, Austria www.jku.at DVR 0093696

Table of Contents

Abstract ...... 5 1. Introduction ...... 7 1.1. Motivation ...... 7 1.2. Research Problem ...... 8 1.3. Research Tasks ...... 10 2. Literature Review ...... 11 2.1. Social Media Analysis ...... 11 2.2. Text Preprocessing ...... 12 2.3. Trend Detection ...... 13 2.4. Sentiment Analysis ...... 15 2.5. Controversy Detection ...... 21 3. Data Collection and Processing ...... 31 3.1. Data Streaming ...... 31 3.2. Data Processing ...... 33 3.3. Sentiment Analysis ...... 40 3.4. Data Storage ...... 42 3.5. Data Annotation ...... 43 4. Experiments and Results ...... 49 4.1. Feature Extraction ...... 49 4.2. Feature Analysis ...... 56 4.3. Machine Learning Models Evaluations ...... 61 4.4. News Dataset Evaluations ...... 67 5. Conclusion and Future Work ...... 73 6. Bibliography ...... 75 7. Appendix A: Detailed Evaluation Results ...... 83 7.1. Twitter Dataset ...... 83 7.2. CNN Dataset ...... 87

3

Abstract

“We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. It is estimated to be 5 Exabyte” [1]. The Internet and web technologies give billions of users the ability to share information and express their opinions on various issues. This enormous amount of data might be very valuable. Social media, as the main sharing platform, is a very promising data source for researchers to investigate and analyze how people feel or think on variety of issues, from politics to entertainment. Previous research has explored the problem of detecting controversies involving multiple kinds of entities (people, event, …) by analyzing different feelings and opinions on these entities. The music domain, as one of the most controversial domains, has not been investigated much in this research. This thesis studies to which extent Twitter, as a social media platform, can be used to detect controversies involving music artists. It generalizes and extends the work proposed in previous research to build good machine learning prediction models to detect these controversies. We analyze what people share about music artists in Twitter, present the problems in this data and study how to tackle most of them. Then, we use this data to build a new controversy detection dataset in the music domain. The created dataset is then used to evaluate a comprehensive set of features to be used in building prediction models to detect controversies involving music artists. We propose using information about the users who share their opinions along with information about the shared opinions themselves to enrich this set of features. Our evaluations show promising results in detecting controversies involving music artists using the created dataset. They also show that we can easily improve the results of detecting controversies in other domains as we also run our evaluations on a CNN news dataset.

5

Chapter 1

1. Introduction

Over 3 billion people used the Internet in 2016 [2]. Most of these users use social media to share information and communicate between each other. These users express their feelings about various entities (e.g., products they use or famous people they know) using these social platforms. This data may provide a real-time view of opinions, activities and trends around the world. This chapter introduces the problem of using part of this data feed to detect controversies about music artists engaging social media users.

1.1. Motivation

Social media have profoundly changed our lives and how we interact with one another and the world around us. Recent research indicates that more and more people are using social media applications such as Facebook and Twitter for various reasons such as making new friends, socializing with old friends, receiving information, and entertaining themselves [3]. As a result, many organizations and companies are adopting social media to accommodate this growing trend to provide better services or gain business values such as driving customer traffic, increasing customer loyalty and retention, increasing sales and revenues, improving customer satisfaction, creating brand awareness and building reputation. Users from different backgrounds are participating in the massive open collaboration in social media. This often leads to vandalism, when users try to deliberately damage someone’s or something’s reputation, and controversies, when users share multiple opinions on someone or something. Companies, governments, national security agencies, and marketing groups are interested in identifying which issues the public is having problems with. They are also very interested in early predicting whether an issue or a product is likely to generate controversies to act against this generation. As music is one of the most controversial domains, detecting these controversies about a song, a clip or a music event as early as possible is also very important for music producers and for the artists themselves to counteract against them. In the music domain, automatic recommendation systems are becoming more important for music companies and producers in one hand and for the listeners on the other hand. Detecting controversies about music artists will most likely boost the performance of these systems. They can identify the users who usually listen to music by controversial artists and those who avoid it and recommend the appropriate music for each group. These systems may also use this information to change how they recommend controversial artists for users who do not usually listen to music by such artists by focusing in the recommendations on non-controversial facts. As web services, from search engines to social media services, are moving more and more towards personalization, an unsuspecting user who has never heard of a controversy is likely to be misled. This is known as “The Filter Bubble Effect” wherein web services serve users with what they expect, rather than encouraging them to seek multiple perspectives available on a subject. Detecting controversies is getting more attention in the research community to counteract this effect “The Filter Bubble” [4, 5].

7

1.2. Research Problem

The goal of this thesis is to study, evaluate and build machine learning prediction models to detect controversies involving music artists in Twitter. The approaches to detect controversies differ based on the type of controversies they detect. This section defines controversy and some related terms to differentiate between them. Then, it introduces different types of controversies. Finally, it defines the controversy detection problem being the topic of this thesis. 1.2.1. Controversy Definition Controversy is “a state of prolonged public dispute or debate, usually concerning a matter of conflicting opinion or point of view” (Wikipedia1) or “argument that involves many people who strongly disagree about something” (Merriam-Webster2). The most well-known controversial topics are related to politics, religion, sex and philosophy. Celebrities and the media are also some of the prominent areas of controversy. Polarity is “a state in which two ideas, opinions, etc., are completely opposite or very different from each other” (Merriam-Webster3). Polarity can appear in environments associated with affirmation or negation. “A polarity item that appears in affirmative (positive) contexts is called a positive polarity item (PPI), and one that appears in negative contexts is a negative polarity item (NPI)” (Wikipedia4). From the definitions above, we can conclude that a controversy is usually about a matter with many conflicting points of view from many people. Polarity, on the other hand, is usually about a matter with two completely different points of view. Polarity can then be a special case of controversy where the points of view are limited to only two. Vandalism is “the act of deliberately destroying or damaging property” (Merriam-Webster5). It has been around ever since knowledge crowdsourcing emerged. Crowdsourcing is a well-established paradigm to acquire knowledge using open platforms where anyone can contribute (e.g. Wikipedia, Stack Exchange, …). The impact of vandalism on unstructured knowledge bases can be severe, spreading false information or nonsense. Vandalism toward a certain entity (politician, artist, …) can enrich the belief that this entity is controversial although this controversy is artifactual not a natural one; therefore, it is important to avoid vandalism when detecting controversies. To the best of our knowledge, research has recently covered the task of automatically detecting vandalism and controversies in Wikipedia but no research has been conducted in other social media domains like Twitter or Facebook due to its complexity even for humans. [6] proposed an advanced machine learning model based on random forest classifiers to detect vandalism in Wikidata (the knowledge base used in Wikipedia) using a set of 47 features that exploit both content and context information.

1 https://en.wikipedia.org/wiki/Controversy; Accessed in Jun 2017 2 https://www.merriam-webster.com/dictionary/controversy; Accessed in Jun 2017 3 https://www.merriam-webster.com/dictionary/polarity; Accessed in Jun 2017 4 https://en.wikipedia.org/wiki/Polarity_item; Accessed Jun 2017 5 https://www.merriam-webster.com/dictionary/vandalism; Accessed in Jun 2017

8

1.2.2. Types of Controversies Controversies can be categorized based on one of these factors; namely time-period, synchronism, or text (or author) [7]. - Time-Period-wise o Specific-Point Controversies These controversies live for a very short time (e.g. day, week, …) and then stop. They are usually related to specific events and are generated mostly in social media. o Long-Period Controversies These controversies last for a very long time (e.g. generations, …) or even forever. They are usually related to the topics of politics, religion, philosophy and sex. - Synchronism-wise o Synchronous Controversies Multiple points of view about the controversial matters in these controversies exist in the same specific point of time. Most of the controversial matters generate synchronous controversies. o Asynchronous Controversies Different points of view about the controversial matters in these controversies exist in different periods of time. These controversies are generated when a sudden change about a specific matter occurs. They are usually related to science. - Text-wise (Author-wise) o Single-Text (Author) Controversies These controversies are in a single text written by one or more authors or multiple texts written by one author discussing multiple points of view about a specific matter. These controversies usually exist in objective articles discussing a scientific or political debate. o Multiple-Texts (Authors) Controversies These controversies are in multiple texts written by multiple authors where each author express his/her point of view about a specific matter. Most controversies are generated in the form of multiple-texts (authors) controversies. 1.2.3. Controversy Detection Controversy detection is the process of automatically detecting any type of controversy involving an entity (person, product, event, article, etc.). Research has focused on single-text (author) controversies especially using Wikipedia articles either as candidate entities for controversies or as an information source to be used when detecting controversies about other entities. As other social media platforms are getting more attention as some of the richest information sources to be mined, detecting controversies in these platforms has started to get more attention by research. Only some approaches were proposed to detect multiple-texts (authors) controversies which is the type of controversies generated in these platforms. One major problem that limits this research is the availability of a benchmark dataset to evaluate any proposed approach.

9

1.3. Research Tasks

The main tasks to achieve the objectives of this thesis can be summarized in the following points: - Perform a thorough study about controversy detection approaches especially in the domain of social media. - Analyze data available in Twitter, define its problems and propose solutions to tackle these problems. The analyzed data is one of the noisiest data available in social media. - Based on the study and the analysis, create a new controversy detection dataset for the music domain. - Evaluate the features used in previous approaches, generalize and extend them with domain-specific features. - Evaluate multiple machine learning models to detect (specific-point, synchronous and multiple-texts) controversies in the music domain based on Twitter data.

10

Chapter 2

2. Literature Review

Controversies can exist in multiple data sources from a single text discussing multiple controversial issues to a source of multiple users discussing the same issue like in Twitter. Detecting controversies using data generated by multiple users in Twitter is considered a social media analysis task. This task typically contains multiple other subtasks starting from preprocessing the retrieved data to detecting trends within this data and then analyzing how users feel or what they think about these trends to figure out whether any of them is controversial or not. This chapter first introduces social media analysis in general then it provides a brief literature study about each of the subtasks typically required to detect controversies. These subtasks are illustrated in Figure 2.1.

Data Trend Preprocessing Detection

Controversy Sentiment Detection Analysis

Figure 2.1 A typical controversy detection process

2.1. Social Media Analysis

Data mining can be defined as the process of discovering logical, consistent and unseen knowledge patterns from large data sets. Data stream mining is then defined as the process of extracting these knowledge patterns from continuous, rapidly evolving data records. A data stream is a sequence of data instances that can be read a small number of times, typically once. Mining a data stream often applies techniques from the field of incremental learning where online learning, real-time demands and changes are coped with. Text mining (or text analytics) refers to the process of deriving the knowledge patterns from text. These patterns form a higher data level than the text itself. Text mining usually involves the process of structuring a text (pre-processing), deriving patterns from the structured data, and finally evaluating and interpreting the output. Text mining involves the fields of natural language processing, information extraction, information retrieval, and information classification based on large textual datasets. The authors in [8] define text mining as an extension of conventional data mining techniques. This extension is more complex compared to conventional data mining techniques due to the fuzzy and unstructured nature of natural language text [9]. Due to ever increasing amount of textual data

11 available on the Web, the field of text mining is gaining more popularity in research. Such data is available in the form of websites content, blogs, comments, social media posts, digital libraries, and chats. Social media websites, such as Facebook and Twitter, enable users to create various forms of textual data in the form of posts, comments and private . Due to the great unprecedented use of social media services in recent years, the amount of textual data available using these services has become enormous. Applying text mining techniques on this data collected can reveal significant results related to people interaction behaviors, human thinking patterns and opinions on any specific subject [10, 9]. The work in [3] describes an in-depth case study which applies text mining to analyze unstructured text content on Facebook and Twitter pages of the three largest pizza chains: Pizza Hut, Domino's Pizza and Papa John's Pizza. The results revealed the value of social media analysis and the power of text mining as an effective technique to extract business value from the vast amount of available social media data.

2.2. Text Preprocessing

Text preprocessing is a very important phase of any text mining technique due to its key role in structuring and simplifying the text which, during the text gathering process, may be loosely organized. If text is not scanned carefully to identify problems, then text mining might lead to the “garbage in garbage out” phenomena [11]. Preprocessing increases the success rate of applying text mining techniques, but it may consume considerable processing resources. Feature extraction and feature selection are the two basic methods of text preprocessing. 2.2.1. Feature Extraction 2.2.1.1. Morphological Analysis (MA) MA processes individual words in a text and consists mainly of tokenization, stop-word removal, and stemming. In the tokenization process, the text is divided into its word components. Tokenization is mainly based on whitespace characters and punctuations for most of the common languages. In some applications, it may be beneficial to extract phrases for effective feature selection and not just tokens of words. The core idea in [12] is that individual words are not very effective for a clustering algorithm because they miss the context in which the word is used. The usage of such phrases to improve text mining quality is referred to as semantic smoothing because it reduces the noise which is associated with semantic ambiguity [13]. Stop-words usually refer to the most common words in a language, such as “the”, “a”, and “or” in English. In the stop-word removal process, such words are removed from the text. This process improves both the efficiency and effectiveness of processing a text as its words number is reduced. Stemming is the process of returning inflected or derived words to their root or base form. The stem may not be a valid morphological root of the word. In most text processing applications, it is usually sufficient that related words get mapped to the same stem, even if it is not a valid root, such in the word “computer” which may be mapped to the root form “comput”.

12

2.2.1.2. Syntactic Analysis (SYA) SYA processes a text in a specific language to provide knowledge about its grammatical structure. SYA consists of Part of Speech tagging (POS tagging) and parsing. POS tagging is the process of marking up a word in a text as corresponding to a part of speech (category of words having similar grammatical properties), based on both its definition, as well as its context. In other simple words, POS tagging is the process of identifying words as nouns, verbs, adjectives, adverbs, etc. Knowing the lexical class of the words in a text improves the results processing and analyzing this text and reduces ambiguities. Parsing is the process of analyzing the grammatical structure of a sentence. Parsing a text represents it in a tree-like structure, called the parse tree, which shows the syntactic relations between its words. Parsing trees are commonly used to check the correct syntactic structure of a sentence in a text. 2.2.1.3. Semantic Analysis (SMA) SMA is “the process of relating syntactic structures, from the levels of phrases, clauses, sentences and paragraphs to the level of the writing as a whole, to their language-independent meanings”.6 SMA analyzes a text, usually using ontologies, to understand and describe the intended meaning of it. 2.2.2. Feature Selection Feature selection is the process of selecting a subset of relevant non-redundant features to generate less complex and more robust machine learning and language models. This process simplifies the models to avoid overfitting and for easier interpretation by researchers and users. Using feature selection techniques is based on the premise that data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. In text processing, feature selection scores the importance of the words (phrases) of a text to select important (relevant) features [14]. There are several models that assign this importance score and represent a text as a feature vector. Vector Space Model (VSM), Latent Semantic Indexing (LSI) and Random Mapping (RM) are some of the well-known models for text representation. The representation of these models is commonly used in research as the feature vector representing a text.

2.3. Trend Detection

Trend detection is the process of automatically detecting and analyzing emerging topics (i.e. the “trends”) that appear in a stream of data. Trends are usually driven by breaking news, emerging events and topics that suddenly attract the attention of people. Trend detection is important for news reporters, data analysts and marketing experts as trends highlight fast-evolving topics that attract public attention [15]. The process of detecting trends in a stream of data documents (e.g. stream of tweets) starts with a topic detection process in which the topics mentioned in each document of the stream are

6 https://en.wikipedia.org/wiki/Semantic_analysis_(linguistics) ; Accessed in Jun 2017

13 detected. The detected topics are then aggregated considering their time resolution which measures how emergent each topic is during a period. Topic detection approaches may be categorized into document-pivot or feature-pivot approaches [16]. 2.3.1. Document-Pivot Approaches In these approaches, the documents themselves are clustered into topic clusters by representing these documents as feature vectors and leveraging some similarity measure between these vector representations. Features extracted to represent a document contain not only textual information (keywords) but also temporal information, social information (e.g. number of mentions when processing a Twitter stream) and other information that may be specific to the source of the processed stream. Using this set of features illustrates one of the main drawbacks of document- pivot approaches which is their sensitivity to noise. In [17], the authors proposed “TwitterStand”, a news processing system that uses Twitter to detect the latest breaking news. It uses a “leader-follower” clustering algorithm that considers textual similarity and temporal proximity to cluster tweets into news topic clusters. The center of each cluster was represented using a centroid vector and the average sharing time of the tweets in this cluster. A similarity metric based on both dimensions in addition to the number of shared hashtags was used to do incremental clustering, i.e., merging new tweets into existing clusters. Each news topic was then associated with an importance score representing whether it refers to a breaking news topic (trend) or not. This score was computed by tracking the number of tweets involving this topic and how quickly they were found by TwitterStand. The authors used their proposed approach in building TwitterStand without reporting their evaluation results. In [18], the authors proposed a system to collect, group, rank and track breaking news in Twitter. Tweets were converted into a bag-of-words representation and then incrementally merged into topic clusters using textual similarity. Similarity scores were boosted with !" − $%" (term frequency - inverse document frequency) on proper nouns to improve the grouping results. The topics were then ranked as emergent or not using three measures: “popularity” using the number of retweets, “reliability” using the number of followers and “freshness” using the timestamp. The authors used their proposed approach in building a system called “Hotstream” without reporting their evaluation results. In [19], the authors used the approach presented in TwitterStand to assign tweets to their suitable topic clusters. The topics were then classified as referring to real-world events (trends) or not. The classifier was trained on a set of features that included social features (e.g. number of mentions, number of retweets), topical features (e.g. keywords), temporal features (e.g. timestamp), and other Twitter-specific features (e.g. hashtags). The experiments were conducted on a dataset of 2600000 tweets shared during February 2010 in New York City. The tweets were clustered into topic clusters using incremental clustering. The clusters were then manually annotated as referring to real-world events or not. After processing the annotated clusters, the authors used 374 clusters to train their classification model and 100 clusters to test it. They measured the performance of their model using F-measure and achieved a score of 0.837 on the test set.

14

2.3.2. Feature-Pivot Approaches These approaches are related to topic modeling in natural language processing. Topic modeling uses Bayesian models to assign each document to a probability distribution over topics, each of which is also assigned to a probability distribution over terms. In these approaches, a set of terms (keywords) that represent the topics are selected based on their information content so that the most informative set of terms is used. The information content is usually measured using the frequency of the occurrences of these terms. Documents are then clustered into topic clusters based on the occurrences of these terms. Capturing misleading term correlations is one of the main drawbacks of feature-pivot approaches e.g., when two unrelated terms co-occur frequently. For trend detection, this set of terms may be enriched by bursty terms, i.e. terms that suddenly occur in a stream of data documents in unusual higher rates. In [15], the authors proposed “TwitterMonitor”, a system that performs trend detection over Twitter streams. This system identifies bursty terms, then groups them into trends based on their co-occurrences. In other words, a trend is detected when a set of bursty terms occur together frequently in a Twitter stream. Latent Dirichlet Allocation (LDA) [20] is a well-known and widely used topic model. In LDA, every document is represented as a bag-of-words. These words are the only observed features in the model. The topics’ probability distribution per document and the term probability distribution per topic are hidden and are estimated using Bayesian inference. Alternative topic modeling approaches have been explored in research, such as graph-based and signal processing approaches. In the graph-based approaches, documents are clustered into topic clusters using their pairwise similarities. In [21], the authors used a community detection algorithm to cluster the nodes of a term co-occurrence graph that was used to cluster documents into topic clusters using a similarity measure. In the signal processing approaches, a wavelet analysis is applied on blocks of statistics on the terms in texts available in different consecutive time slots. A signal is constructed using the difference between the normalized entropy of these consecutive blocks. This signal is then further processed and the terms that represent the topics are selected using some threshold [22].

2.4. Sentiment Analysis

Sentiment analysis is the process of analyzing a text to identify or quantify the emotional state expressed in it with respect to some entity which can be a person, product, event, article, etc. It is widely applied on customer reviews, survey responses, social media content and similar textual contents to support applications ranging from marketing and advertising to psychology and political science. Nowadays people are expressing their feelings and opinions about different topics using social media more than any other available sharing services. That is why research lately turned towards social media to analyze the sentiments expressed by what people share. This revealed several big challenges like misspellings, poor grammatical structures, emoticons, acronyms and slang words. Sentiment analysis approaches may be categorized into lexicon-based or machine learning approaches.

15

2.4.1. Lexicon-Based Approaches These approaches rely on an underlying sentiment lexicon. A sentiment lexicon is a list of lexical features (e.g. words) attached with values declaring how negative or positive they are based on their semantic meaning. These approaches use the lexical features as the key features for their models to identify or quantify the sentiment of a text. These lexicons can be categorized based on the labels assigned to their words into polarity-based or valence-based lexicons. 2.4.1.1. Polarity-Based Lexicons Polarity-based lexicons (also known as semantic orientation lexicons) assign a binary value to each word declaring its sentiment polarity (i.e. either positive or negative). The following lexicons are some of the widely-used polarity-based lexicons: - Harvard General Inquirer (GI)7 GI is a content analysis tool with one of the oldest manually constructed lexicons. It maps a text into counts of dictionary-supplied categories by counting the words in each category within the text. Each category is supplied with a dictionary of words and word senses. Each word is supplied with syntactic, semantic and pragmatic information. The current distributed version of this lexicon combines 182 categories from “Harvard IV-4”, “Lasswell” and 5 manually added categories [23]. The words of two categories, namely “positive” with 1915 words and “negative” with 2291 words have been widely used in sentiment analysis research [24, 25]. - Linguistic Inquiry and Word Counts (LIWC)8 “LIWC is a commercial text analysis tool developed by researchers with interests in social, clinical, health, and cognitive psychology to capture people’s social and psychological states”. It simply counts the percentage of words within a text that reflect different emotional states, thinking styles, social concerns, and even parts of speech. LIWC uses a proprietary lexicon which consists of 4500 words and regular expressions categorized into one or more of 76 categories. These categories are highly correlated with those of the General Inquirer. The words of two categories, namely “positive emotion” with 406 words and “negative emotion” with 499 words have been widely used in sentiment analysis research [24, 25]. - Opinion Lexicon9 Opinion lexicon is a publicly available lexicon of about 6800 English words categorized into either positive (2006 words) or negative (4783 words) [26]. Opinion lexicon is more attuned to sentiment words and phrases used in social media and product reviews. It includes common misspellings, morphological variants and slangs. - Subjectivity Lexicon – MPQA (Multi-Perspective Question Answering)10

7 http://www.wjh.harvard.edu/~inquirer/ 8 http://liwc.wpengine.com/ 9 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html 10 http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/

16

Subjectivity lexicon is also a publicly available lexicon distributed under a GNU Public License. It consists of over 8000 (8222 in June 2017) subjectivity clues. “Subjectivity clues are words and phrases that may be used to express private states, i.e., they have subjective usages”. The lexicon was compiled from different sources including the General Inquirer. The clues in this lexicon are categorized into either positive (2718 clues), negative (4913 clues), both (21 clues) or neutral (570 clues). This lexicon is part of the “OpinionFinder” system11 that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences [27, 28]. 2.4.1.2. Valence-Based Lexicons Valence-based lexicons (also known as semantic intensity lexicons) assign a valence score to each word declaring its sentiment intensity. The following lexicons are some of the widely-used valence-based lexicons: - Affective Norms for English Words (ANEW)12 ANEW provides a set of “normative emotional ratings” for about 1040 English words. The words in ANEW were rated in terms pleasure, arousal, and dominance. Pleasure, or “affective valence”, ranges from “happy” to “unhappy”, arousal ranges from “calm” to “excited”, and dominance, or control, ranges from “controlled” to “in-control” [29]. Each word in ANEW has three associated sentiment values for each of the rated dimensions, ranging from 1 to 9 with a neutral point in the middle at 5. ANEW further breaks down these values for women versus men, allowing gender-specific emotional responses to be explored. - SentiWordNet13 SentiWordNet is a lexicon that attaches three numeric scores related to positivity, negativity and objectivity (neutrality) to WordNet synsets [30]. These scores range continuously from 0.0 to 1.0 and they sum up to one for each synset. The scores were originally calculated automatically using a semi-supervised machine learning approach. SentiWordNet was introduced for the first time in 2006 with version 1.0 [31]. Two more versions were later introduced in 2007 (1.1) and 2008 (2.0). In 2010, the current version (3.0) was introduced scoring synsets from WordNet 3.0 instead of WordNet 2.0. In the last version, the authors used the scores calculated by the semi-supervised approach as input to an additional iterative random-walk process that is run to convergence refining the scores [32]. SentiWordNet is freely distributed for noncommercial use, and licenses are available for commercial applications. - SentiStrength14 SentiStrength is an application to analyze the sentiments expressed in short texts. It is a free application for academic research and a license is available for commercial use. It assigns a positive sentiment strength from +1 (“no positive sentiment”) to +5 (“very strong positive sentiment”) and a negative sentiment strength from -1 (“no negative sentiment”) to -5 (“very strong

11 http://mpqa.cs.pitt.edu/opinionfinder/ 12 http://csea.phhp.ufl.edu/media.html 13 http://sentiwordnet.isti.cnr.it/ 14 http://sentistrength.wlv.ac.uk/

17 negative sentiment”) to each text (zeros are not used). The core of SentiStrength is a lexicon of (currently) 2310 words. Each of these words was assigned two (positive and negative) sentiment scores on scales that exactly matches the text scales for both scores [33]. The lexicon was compiled from the common well-established sentiment lexicons (General Inquirer, LIWC). The sentiment scores for each word were initially assigned based upon a development corpus of 2600 comments from the social network site MySpace, and subsequently updated through additional testing. The lexicon was then enriched by a list of emoticons. The application and its lexicon are continuously updated starting from the first version in 2010 [34]. An improved version introduced in 2012 [35] improved the lexicon and proposed special rules to deal with negations, questions, booster words (e.g., very), emoticons, and a range of other special cases. - AFINN15 AFINN is a lexicon of (currently) 2477 English words including 15 phrases rated with a valence score ranging from -5 (very negative) to +5 (very positive). The words were collected from different sources and word lists and then they were manually rated by the author [36]. AFINN in its current version is attuned to sentiment words used in social media (namely Twitter). It includes common acronyms, obscene words and slangs. - SenticNet16 “SenticNet is a publicly available resource for sentiment analysis built exploiting Artificial Intelligence and Semantic Web techniques”. This resource can then be used to analyze sentiment in texts at a semantic, rather than just syntactic level. SenticNet introduced this concept-level sentiment analysis using the “bag-of-concepts” model as a replacement for the simple word co-occurrence frequencies which reduces dimensionality, and by leveraging on “linguistic patterns” to allow sentiments to flow from concept to concept based on the dependency relation between clauses. SenticNet uses the above dimensionality reduction and graph mining techniques to infer the polarity scores of common sense concepts leveraging the semantic network called ConceptNet.17 Different versions of this lexicon were introduced starting from version 1.0 in 2010 to the current version (4.0) in 2016. The current version consists of 50000 common sense concepts. Each concept is attached to a continuous numeric polarity score ranging from -1.0 to +1.0 [37, 38]. - Valence Aware Dictionary for sEntiment Reasoning (VADER)18 VADER is a simple rule-based model for sentiment analysis of texts from social media. Along with proposing the model, the researchers proposed a new gold-standard sentiment lexicon specifically attuned to microblogs. This lexicon contains over 7500 lexical features (words, slangs, emoticons, …). It was compiled by examining some of the common well-established sentiment lexicons (General Inquirer, LIWC,

15 http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010 16 http://sentic.net/ 17 http://conceptnet.io/ 18 http://comp.social.gatech.edu/papers/

18

ANEW) and compiling a list of words from these lexicons. The lexicon was then enriched by numerous other lexical features (tokens) common in microblogs, including a list of Western-style emoticons, sentiment-related acronyms and initialisms and commonly used slangs. Each of these words was then rated by ten independent human raters on a scale from -4 (extremely negative) to +4 (extremely positive) with a neutral point in the middle at 0 [24]. The lexical features compiled in VADER were then used with a consideration for five general rules to build a state of the art model for sentiment analysis of tweets. The rules embody simple grammatical and syntactical conventions for expressing and emphasizing sentiment intensity; e.g. exclamation mark (!) and an all-capitalized sentiment-related word (e.g. GREAT) increase the magnitude of sentiment intensity without affecting the semantic orientation. The proposed model in most cases performed better than eleven other sentiment analysis tools. It also outperformed individual human raters and generalized relatively well across context [24]. Table 2.1 shows a summary of all the lexicons presented above, focusing on the features important to sentiment analysis in the domain of social media. In [24], the researchers briefly survey research that used most of these lexicons and discuss the pros and cons for almost each of them.

Table 2.1 Sentiment analysis lexicons summary with features attuned to the domain of social media

Lexicon Reference #Tokens Annotation Acronyms Slangs Emoticons GI [23] 2406 Binary LIWC Commercial19 905 Binary Opinion [26] 6800 Binary True Lexicon

Subjectivity [27, 28] 20 8200+ Binary Lexicon [29] Numeric ANEW 1040 [1, 9]

[30] 147306 Numeric SentiWordNet (Synsets) [0.0, 1.0] [33, 35, 34] Numeric SentiStrength 2310 True [-5, -1], [1, 5] [36] Numeric AFINN 2477 True True [-5, +5] [38, 37] 50000 Numeric SenticNet (Concepts) [-1.0, +1.0] [24] Numeric VADER 7500 True True True [-4, +4]

19 http://liwc.wpengine.com/ 20 In addition to positive and negative, some words are annotated as neutral or both.

19

2.4.2. Machine Learning Approaches Creating and validating a lexicon that covers all the important sentiment-related lexical features is a hard and time-consuming work. That is why some research has explored the possibilities to use machine learning approaches to automatically identify and process the most important sentiment- related features of texts. These approaches rely on the underlying machine learning model. These models also need a training set whose creation is also a hard and time-consuming work. Some of these models lack the ability of explaining their results and some of them may generate and use features within the model that are not human-interpretable (meta-features) and therefore they are difficult to generalize and modify. Naïve Bayes (NB) classifiers, Maximum Entropy (ME) and Support Vector Machines (SVM) are some of the common machine learning models for text analysis. Neural Networks and Deep Learning models have also gained more attention in research in the domain of text analysis as these models have outperformed other models in multiple domains. In [39], the authors proposed the inclusion of word bigrams as lexical features in bag-of-words models. This inclusion gave consistent gain in sentiment analysis tasks. The modified bag-of-word model was used to generate features for NB and SVM classifiers. The results explained in this thesis showed that NB classifiers outperformed SVM in sentiment analysis for short texts, where SVM outperformed NB classifiers for full-length text documents. The authors also proposed building SVM classifiers over NB log-count ratios as feature values instead of the direct bag-of- words model. This model performed well in all the test cases for short texts and full-length text documents. The experiments were conducted on multiple sentiment analysis datasets. The results often outperformed the original results reported with each of the evaluated datasets by 1% to 2%. In [40], the authors introduced a deep learning model on top of sentences’ grammatical structures to build a representation of a whole sentence based on its structure. The model then analyzes the sentiment of a text based on how its words compose the meaning of longer phrases. An important key feature of this model is that it used the order of words in analyzing sentiments. This order is completely ignored by simple machine learning models based on bag-of-words models. The experiments were conducted on a proposed “Sentiment Treebank” constructed by parsing 11855 sentences extracted from movie reviews then labeling the 215154 unique phrases of the parse trees of these sentences. Each phrase was labeled manually by 3 human annotators using a continuous range from “very negative” to “very positive”. The labels were then processed and converted to 5 distinct classes. The proposed model achieved an accuracy of 80.7% outperforming all other evaluated models, such as NB (67.2%), SVM (64.3) and RNN (79.0%). In [41], the authors used a recursive neural network which was adapted from the deep leaning model used in [40] to analyze sentiments in short texts (tweets) depending on a specific target rather than analyzing the whole text as proposed by almost all research in this area. A target is a word in a sentence. Each target may have a different sentiment; for example, "windows phone is better than iOS!" has different sentiment for two different targets “Windows Phone” and “iOS”. The proposed model adaptively propagates the sentiments of words to target depending on the context and syntactic relationships between them.

20

The experiments were conducted on a dataset of 6940 tweets involving 5 targets. The tweets were manually annotated as “negative”, “neutral” or “positive” for each target. The proposed model for target-dependent sentiment analysis achieved an accuracy of 66.3% outperforming all other evaluated models, such as SVM (63.4) and RNN (63.0%). NB, ME and SVM classifiers were evaluated in [24] along with the improved-lexicon and rule- based proposed approach which outperformed these models in most of the test cases. The evaluations were conducted on 4 different datasets (tweets, movie reviews, technical product reviews and news articles). The proposed approach was especially good in analyzing sentiment from social media texts as it was attuned to process these texts. It achieved an F-measure of 0.96 for the tweets dataset outperforming all other evaluated models, such as NB (0.84) and SVM (0.83). It performed worse than the proposed model in [39] for the movie reviews dataset [42] achieving an F-measure of 0.61 compared to the original model which achieved an F-measure of 0.75. For the two other datasets, the proposed model also outperformed all other evaluated models achieving an F-measure of 0.63 for the technical products reviews dataset and of 0.55 for the news articles dataset. SemEval (Semantic Evaluation)21 “is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics”. Each evaluation in this series contains multiple tasks. Multiple teams participate in solving each task. Task 10 of the SemEval-201522 was about sentiment analysis in Twitter which contained 5 different subtasks according to the following list [43]: - Subtask A. Contextual Polarity Disambiguation - Subtask B. Message Polarity Classification - Subtask C. Topic-Based Message Polarity Classification - Subtask D. Detecting Trend Towards a Topic - Subtask E. Determining Strength of Association of Twitter Terms with Positive Sentiment The results of this task showed that the top systems proposed by the participating teams in several subtasks used deep learning neural networks. The most important features used by these systems were those derived from sentiment lexicons. Other important features included features derived from bag-of-words models, word shape and punctuations. These results show that using a hybrid system that uses the power of both sentiment lexicons and machine learning models performs the best in sentiment analysis.

2.5. Controversy Detection

Controversy detection is the process of automatically analyzing information about an entity to detect whether it is controversial or not. It is also the process of automatically analyzing a text (or multiple texts) to detect whether it contains controversies or not. Controversy detection approaches may be categorized into content-based and feature-based approaches.

21 http://alt.qcri.org/conferences/ 22 http://alt.qcri.org/semeval2015/task10/

21

2.5.1. Content-Based Controversy Detection Content-based approaches usually detect single-text, synchronous controversies. They analyze the content of a text and/or its edit history to extract a suitable set of features which are then used in a machine learning model to predict whether a text is controversial or not. Almost all detection systems in this category were proposed to detect controversies in Wikipedia articles or to detect whether some general topic is controversial or not using articles about it from Wikipedia. In [44], the authors proposed a Basic and two Controversy Rank (CR) models to identify controversial articles in Wikipedia. These models drew clues from the collaboration between the contributors and the edit history of an article instead of interpreting the actual content. The Basic model considered only the number of disputes within an article. A dispute between a pair of contributors in an article is defined as the number of words that they have deleted from each other in the edit history of this article. Only deletion operations were considered as they indicate disagreement between contributors. The two Controversy Rank models extended the Basic model by considering the relationships between articles and contributors. These models were based on the premise that disputes involving controversial contributors are more likely caused by the behavior (or aggressiveness) of these contributors, not the topic of the article. On the other hand, disputes involving non- controversial contributors should be rare. Such disputes are therefore strong evidence of a controversial article. CR Models, therefore, derive the controversy score of an article by aggregating two measures: - Article Controversy: this measure is affected by the premise that an article is more controversial if its disputes involve less controversial contributors. - Contributor Controversy: this measure is affected by the premise that a contributor is more controversial if he/she is involved in disputes of less controversial articles. The experiments on these models were conducted on a dataset of 19456 Wikipedia articles related to the “Religious” category with 174338 distinct contributors. The best performance was for one of the CR models when evaluating this model on the top 71 articles achieving an F- measure of about 0.145. This performance dropped when evaluating this model on the whole dataset. In [45], the authors proposed an approach to detect if a webpage is controversial or not using its neighboring Wikipedia articles. The proposed approach first maps the webpage to a set of & neighboring articles. The controversy scores for the neighboring articles are then aggregated to calculate the controversy score of the webpage. Finally, this score is converted into a binary value declaring whether the webpage is controversial or not using a predefined threshold. To get the neighboring articles of a webpage, the top ten most frequent non-stop terms are selected from this webpage. The “blekko”23 is used to search for these terms in Wikipedia. The search results were then processed and some of them were discarded (e.g. user and talk articles) resulting in a set of 8755 Wikipedia articles used as the search space for neighbors. 1761 of these articles were manually annotated as controversial or not on an integer

23 https://blekko.com/

22 scale from 1 (clearly controversial) to 4 (clearly non-controversial). Unannotated articles were handled by guessing, setting their controversy score to 2.5. The experiments were conducted on a dataset of 377 webpages selected by querying the “blekko” search engine using the titles of 41 manually selected seed Wikipedia articles with an implied controversy level ranging from clearly controversial articles (“Abortion”) to clearly non- controversial (“Mary Poppins”). These webpages were manually annotated as controversial or not using the same methodology used for annotating the articles forming the search space. The dataset was split into training and testing sets starting from the seed Wikipedia articles with a ratio of (60-40) respectively. Multiple number of neighbors (&) and aggregation methods were evaluated. The best performance was using the maximum controversy score of the 8 neighboring articles (k=8, aggregation=max) achieving an F-measure of 0.642 on the test set. In [46], the authors improved the approach proposed in [45] by replacing the manual annotation process of Wikipedia articles forming the search space by a fully-automated one. The controversy score of a Wikipedia article was calculated using a weakly-supervised approach based on three automatically generated scores for each article based on the article and its metadata available in Wikipedia. These three scores measure the controversy level of Wikipedia articles and they are referred to in research as C, D, and M scores. - The C score estimates a controversy level of a Wikipedia article using regression model. This model uses metadata available for Wikipedia articles such as the length of the article, number of contributors, number of anonymous contributors, etc. [47]. - The D score represents the presence of dispute tags which are added to talk pages of Wikipedia articles by their contributors [47]. - The M score also estimates a controversy score of a Wikipedia article based on the concepts of “mutual reverts” and “edit wars” in Wikipedia. The model used for the estimation assumes that “the larger the armies, the larger the war”, that is why it uses the number and reputation of the contributors reverting each other’s edits [48]. The experiments were conducted on the same webpage dataset proposed in [45]. The results were very similar to these of the previous research. The number of neighbors considered in these experiments is relatively higher given the higher number of available annotated articles after the automatic annotation process. The best performance was achieved using the maximum controversy score of 15, instead of 8, neighboring articles (k=15, aggregation=max) with an F- measure of about 0.7. The performance of the proposed fully-automated approach to annotate Wikipedia articles was considered good when the authors achieved comparable results to these of the previous research [45]. In [49], the authors proposed using a stacked machine learning model on a relational data to improve the quality of the controversy detection approach proposed in [46]. The proposed approach classifies whether a webpage is controversial or not using a combination of its intrinsic features and the controversy predictions of its related webpages. A stacked model is a collective classification model which gets applied on the neighbors of a data instance using their intrinsic features to predict a value for each of these neighbors. These predictions are then aggregated and used as new features for the original data instance. Finally, a stacked machine learning model is trained using the extended set of features to predict the final

23 expected value. Stacked models are particularly useful in situations where there is a lack of extensive ground truth data of the neighborhood of a data instance [50]. Relational data represents how data instances are related to each other. Hyperlinks, textual similarity and topic similarity are common examples of such data. The proposed approach uses textual similarity to build a network of relations between the webpages based on TF-IDF-based cosine similarity. Using textual similarity is based on the hypothesis that “a definition of relatedness that incorporates textual or topical similarity will hold more predictive power than pre-existing relationships such as hyperlinks” [49]. The experiments were conducted on the same webpage dataset proposed in [45] and on another dataset of 480 Wikipedia articles in which articles are equally balanced between controversial 240 and non-controversial 240 [51]. The stacked model outperformed most of the other evaluated models on both datasets achieving an area under ROC of 0.8 on the first dataset [45] and of 0.84 on the second one [51]. In [52], the authors improved the approach proposed in [46] to detect controversy in webpages based on related Wikipedia articles. The improved approach changed the query used to retrieve related articles (neighbors). It also proposed a smoothing technique to change D, C and M scores associated with Wikipedia articles. Instead of using top ten non-stop terms, this approach used the topics and subtopics present in a webpage to query its related articles. A webpage is split into multiple blocks of sentences (tiles) when a subtopic shift occurs. Topic shifts are detected using a TextTiling technique [53] in which a block is simply defined as a few sentences and the lexical similarity between each two consequent blocks is calculated. When the similarity score dramatically changes, a subtopic shift is assumed to occur. A query for each tile is then generated, then one query for the webpage is generated by creating a mixture of the global most frequent terms and the local most frequent terms of each tile. The main advantage of using this query is to keep the context of the original webpage. The proposed smoothing technique computes or modifies the three automatically generated scores which measure the controversy level of Wikipedia articles (D, C, and M scores). These scores are generated based on the edit history of an article, that is why when an article has no or only a short edit history, these scores will not correctly measure the controversy level of the article. To infer more reliable scores for an article without or with a short edit history, the proposed approach smoothens the original scores from the scores of the neighbors with more established edit history. The experiments were conducted on the same webpage dataset proposed in [45]. The proposed improvements outperformed the original results in almost all test cases. It achieved an F-measure of 0.71 using the maximum controversy score of 20 neighboring articles (k=20, aggregation=max). The impact of the smoothing technique on improving the results was higher than the impact of changing the query. 2.5.2. Feature-Based Controversy Detection Feature-based approaches can detect controversies of any kind. They use a set of different features and metadata to build a machine learning model to predict whether a matter is controversial or not. Matters in these approaches can be of any kind of entities (person, product,

24 event, article, etc.). Sentiment-based features are the most common and important features used by these systems. In [54], the authors defined a controversial issue as “a concept that invokes conflicting sentiments or views”. Conforming to the definition, they proposed an approach to detect controversial issues in news articles based on the magnitude and the difference of the scores of the sentiments expressed within the terms of that issue. For example, “Afghanistan war” would be a controversial issue in the proposed approach. The proposed approach extracted a set of terms (phrase) to represent each controversial issue. Terms were extracted by adopting a topic model known as “known-item query generation” [55]. A known-item query is generated from a document known to be relevant based on its topic (probabilistic) model. The scores of the positive and negative sentiments expressed in each of the extracted terms were calculated using the SentiWordNet lexicon (explained in section 2.4.1.2). The scores of the positive and negative sentiments involving a candidate issue were then calculated by combining the respective scores of its terms. An issue was considered controversial if the sum of its polarity scores was greater than a specified threshold ()) and the difference between these scores was smaller than a specified threshold (*). These thresholds were set empirically to () = 250) and (* = 50). The experiments were conducted on the MPQA Opinion Corpus24, a dataset of 350 news articles classified into 10 topics and manually annotated for opinions, by selecting the top 10 (above the thresholds) controversial issue phrases for each topic and asking three annotators if the phrase is appropriate for a controversial issue or not. A phrase was considered correctly classified as controversial if at least two annotators agreed with that. The proposed approach achieved a precision of 0.74. In [56], the authors formulated the task of detecting controversies involving celebrities in Twitter and they proposed an approach to detect these controversies. The authors defined a “snapshot” denoting a triple / = (1, Δ!, !411!/), where 1 represented an entity (celebrity), Δ! represented a period and tweets represented the set of tweets published during the target period involving the target entity. The controversy score then represented the level of controversy involving an entity in the context of a snapshot. The controversy detection task can finally be formulated as follows:

“Given an entity set 5 and a snapshot set 6 = 1, ∆!, !411!/ 1 ∈ 5}, compute the controversy score for each snapshot / in 6 and rank 6 with respect to the resulting scores”.

The controversy score for a snapshot is computed as a linear combination (weighted sum) of two scores, namely “historical controversy score” and “timely controversy score”. The historical controversy score estimates the controversy level of an entity independent of time. It is estimated simply by counting the occurrences of controversial terms using a controversy lexicon available on Wikipedia as a “list of controversial issues”.25 This lexicon was compiled by mining controversial

24 http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/ 25 https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues

25 topics and phrases from Wikipedia articles. The timely controversy score estimates the controversy level of an entity during a period by analyzing a given snapshot. It is estimated as a linear combination of a sentiment-based score and a controversy-based score. The sentiment- based score represents the disagreement in sentiments expressed in the tweets of the snapshot. Sentiment scores were estimated using the subjectivity lexicon (OpinionFinder) (explained in a previous section). The controversy-based score is also estimated by counting the occurrences of controversial terms in the tweets of the snapshot. The experiments were conducted on a dataset of 120 snapshots randomly sampled from 30451 buzzy snapshots which, in turn, were sampled from 738045 snapshots collected from Twitter between July 2009 and February 2010 using 104713 celebrity names (entities) collected from Wikipedia for the Actor, Athlete, Politician and Musician categories. The snapshots were manually annotated as controversial or non-controversial by two annotators. The results suggested that the presence of disagreement in sentiments in a snapshot is a good indicator of controversy. The proposed approach achieved an area under ROC of 0.662 using the best test setup. In [57], the authors extended the work proposed in [56] and reformulated the task of detecting controversies to differentiate between “event snapshots” and “non-event snapshots”. The new formulation computed different controversy scores for controversial-event, non-controversial- event and non-event snapshots. The new formulated task was decomposed into two steps: Event detection and Controversy detection. Three regression models using Gradient Boosted Decision Trees (GBDT) [58] were proposed to resolve these steps: - Direct model: a machine learning regression model that estimates the controversy score of a snapshot in a single step. I.e., in the training set used to build the model, positive examples represent “controversial-event” snapshots, whereas negative examples represent “non-controversial-event” and “non-event” snapshots. - Two-step pipeline model: a model composed of two models using the two-step decomposition of the formulated task. The “event detection classification model” selects “event” snapshots then the “controversy detection regression model” estimates the controversy score of these snapshots. - Two-step blended model: a “soft variant of the pipeline model”. The results of the event detection classification model are used as additional features of the snapshots to provide more information for the controversy detection regression model which takes the complete set of snapshots as input. The proposed regression models used a rich set of features representing each snapshot including Twitter-based features and external features. Twitter-based features consisted of: - Linguistic features: such as percentage of nouns, percentage of verbs, average number of mentions of the target entity in the snapshot, … - Structural features: such as number of tweets in the snapshot, percentage of tweets that are retweets, … - Buzziness features: these features estimate the buzziness of the target entity by comparing the number of tweets in the snapshot with numbers from previous snapshots of this target entity. - Sentiment features: such as percentage of positive tweets in the snapshot, percentage of negative tweets, …

26

- Controversy features: such as the sentiment contradiction score proposed in [59], percentage of tweets with at least one controversial term from the Wikipedia-based controversy lexicon used in the previous research, … External features consisted of: - News buzziness features: these features represent the buzziness of the target entity in news articles at the same period of the snapshot. News articles were aligned with each snapshot based on its period (∆!) and in which the target entity is a “salient entity”. A salient entity is defined as an entity that is mentioned in the headlines of the articles, or it is one of the 3 most frequently mentioned named entities in the articles. - Web and news controversy features: these features estimate the levels of controversy involving the target entity of the snapshot using the external Web resources (search engines) and the aligned news articles. Some of these features for example count the controversial terms from the controversy lexicon that co-occurred with the target entity on the Web or occurred on the aligned news articles. The experiments were conducted on an extended gold standard dataset of 800 snapshots randomly sampled from the complete 738045 snapshots set of the previous research [56]. The snapshots were also manually annotated by two annotators as event snapshots 325 from which 152 are controversial-event snapshots, and 173 non-controversial-event snapshots and 475 non- event snapshots. All the proposed models outperformed a baseline of a random-rank. The two- step blended model performed the best among the tested models achieving an area under ROC of 0.850. This result suggested that the event-related features generated from the first step (the event detection classification model) are the most relevant features. In [7], the authors formulated the task of detecting contradictions involving some topics in a set of texts (text documents or simple sentences). They also proposed an approach to detect these contradictions based the mean value and the variance of the scores of the sentiments expressed within the texts. The existence of a contradiction involving some topic was formally defined as follows:

“There is a contradiction on a topic : between two groups of documents ;<, ;= ⊂ ; in a document collection ;, where ;< ;= = ∅, when the information conveyed about : is considerably more different between ;< and ;= than within each one of them”.

The research suggested that it is important to consider the time of publishing texts when identifying controversies involving some topic in these texts. Based on this suggestion and the above definition, two types of contradictions were defined (explained in section 1.2.2):

- Synchronous contradictions: when all documents in ;< were published in a time interval !< and all documents in ;= were published in a time interval != that followed !<. - Asynchronous contradictions: when all documents in ;

27

The selection of these values was based on the following points: - A large variance of sentiment scores does not lead to a contradiction when only positive or negative sentiment scores are present. - A zero-mean value of sentiment scores may occur even when all texts are neutral, which does not indicate a contradiction. - A set of only two texts with opposite sentiment scores will generate a contradiction that is unfavorably compared to a contradiction generated of a set of texts with a much higher cardinality. The proposed approach to detect contradictions was not evaluated in terms of relevance (precision, recall and F-measure) due to the absence of a benchmark dataset, and the difficulty in creating one. This approach was applied on three sentiment analysis datasets and the results were discussed as a user study showing how useful this approach was for 8 persons. In [60], the authors proposed an approach to detect controversies in news articles using machine learning classification models applied on a rich set of features extracted from the comments on the articles. The feature set that represented each article was derived from the feature set used in [57]. The same Wikipedia-based controversy lexicon was used. SentiStrength lexicon (explained in section 2.4.1.2), instead of OpinionFinder, was used to estimate the score of the sentiment expressed within the comments. The authors discarded the external features and extended the rest with new features specific to the news domain and the available dataset. For example: - The reputation scores of users who commented on an article, which are values available in CNN news articles dataset. - Average number of likes of comments on an article. - Average word count of comments on an article. The proposed approach took the time of comments on an article into consideration to study how early it is possible to detect controversies in the article. Multiple time ranges were considered: 6, 12, 18, 24, 36, 42 and 48 hours. For each of these ranges, the features representing each article were extracted using only its comments that were generated within the range starting from the time of publishing the article. The proposed approach was evaluated using a dataset of 728 news articles published by CNN. The articles were manually annotated by at least three different annotators for each of them using a simple manually created application. Naïve Bayes, Support Vector Machines (SVM), Random Forest, and Decision Table classification models were evaluated for each of the tested feature sets. Decision Table models outperformed other models achieving an F-measure of 0.711 and 0.706 considering the feature sets extracted using the comments generated within 6 and 48 hours, respectively. Their results suggested that the first 6 hours are sufficient to detect controversies in news articles. The performance changed only marginally using features extracted considering the subsequent ranges. The features were also evaluated for their importance using the Chi-squared ranking attribute evaluator available in WEKA,26 a machine learning application developed at the University of Waikato, New Zealand [61, 62]. The evaluation results indicated that controversy features,

26 http://www.cs.waikato.ac.nz/ml/weka/

28 extracted based on the controversy lexicon, and sentiment-based features are the most relevant features in detecting controversies in news articles. In [63], like [60], the authors proposed an approach to detect controversial topics in pages submitted to Reddit, a social navigation site, using classification models applied on a rich set of features extracted from the submissions and the comments on them. The feature set that represented each page was also derived from the feature set used in [57]. SentiWordNet lexicon (explained in section 2.4.1.2), instead of OpinionFinder, was used to estimate the score of the sentiment expressed within the comments on Reddit pages. The controversy score introduced in [7] was used to replace the Wikipedia-based controversy lexicon. The authors discarded the external features and extended the rest with new features specific to Reddit. For example: - The number of “gilds”27 in a page. “Gilding” is giving “reddit gold” directly to a user or one of their posts or comments. Gilding is similar to crediting a user, post or comment. Two lexicons of positive words and bad and swear words were also used to extract features related to the words in each of these lexicons. The experiments were conducted on a dataset of about 196 million submissions and 1.7 billion comments from 370 million distinct authors. A page was considered controversial if it had at least one controversial comment. A comment was marked as controversial or not by Reddit using a simple voting-based method which aggregates the up-votes and the down-votes of submissions and comments to determine whether they are controversial or not. A Naïve Bayes was used as the classification model. Only pages that were predicted as controversial with a confidence score of 100% were marked as controversial. The model achieved an F-measure of 0.73. Analyzing the relevance of the adopted feature set assured the relevance of the controversy features, computed using the approach proposed in [7], and sentiment-based features in detecting controversies.

27 https://www.reddit.com/gilding/

29

Chapter 3

3. Data Collection and Processing

The goal of this thesis is to detect controversies involving music artists using what is published about them in Twitter. That is why we need a controversy detection dataset with content in the domain of micro-blogs or social media. The only dataset available that is close to the area of the problem of this thesis is the one used in [57], but this dataset is not publicly available either for research or for commercial use. This chapter proposes a new dataset to evaluate controversy detection approaches in the music domain using Twitter. The rest of the chapter presents the dataset creation process including streaming the data from Twitter, storing, processing and annotating this data.

3.1. Data Streaming

We used the public streams28 of Twitter Streaming APIs29 to collect data about music artists. Public streams give developers a low latency access to the public tweet data flowing through Twitter. This data is suitable following specific users or topics, and data mining which exactly fits our requirements. Public streams can be filtered using a list of “track terms”. “The default access level allows up to 400 track keywords, 5,000 follow userIds”.30 We compiled a list of about 300 artist names to filter the public streams using one of the following two methods: - As hashtags: the names of the artists are converted into hashtags which are then used to filter the public streams. To convert a name into a hashtag, we simply removed all white spaces and special characters and attached the hashtag sign at the start of the name (e.g. “John Lennon” becomes “#JohnLennon” and “alt-J” becomes “#altJ”). Using this method, Twitter APIs return only tweets with any of these hashtags. - As texts: the names of the artists are used as they are to filter the public streams. Using this method, Twitter APIs return only tweets containing any of these full names in any possible Twitter-specific format (text, hashtag or user mention). 3.1.1. Artist List The list of artist names was compiled by combining names retrieved from different sources to keep the balance between controversial and non-controversial artists. 3.1.1.1. Last.fm We used two API methods from Last.fm API:31 - “chart.getTopArtists”: gets the top artists chart. We used this method to get the top 50 artists.

28 https://dev.twitter.com/streaming/public 29 https://dev.twitter.com/streaming/overview 30 https://dev.twitter.com/streaming/reference/post/statuses/filter 31 https://www.last.fm/api

31

- “tag.getTopArtists”: gets the top artists tagged by any of the top global tags on Last.fm which are available using the API method “tag.getTopTags”. We used this API method to get the top 50 tags, then for each of them, we used the top artists API method to get the top 3 artists tagged by this tag which led to 150 artists. 3.1.1.2. Spotify We used the “regional global daily” top 200 music tracks chart from Spotify Charts32 to get the artists of these tracks. 3.1.1.3. Billboard We manually collected two lists of artists available on Billboard: - “Artist 100”:33 contains the top 100 artists in the week of accessing this list. - “Banned Music: 21 Artists Censors Tried to Silence”:34 contains a list of 21 of music's most infamous banned moments involved some controversial music artists. 3.1.1.4. The Top Tens We manually collected the list of “Top Ten Worst Music Artists”35 along with “The Contenders” to get 100 artists of the worst. 3.1.1.5. We manually collected the list of “Music Artists Frequently Mentioned on The Web”. This list is an automatic list compiled by Google36 and accessed by searching for “Top Artists 2017” using Google’s search engine which will render this list at the top of the search results. The list contained about 50 artists when it was accessed. 3.1.2. Streaming Process We streamed Twitter for three different days separated by relatively long periods using names of artists as hashtags. The streamed data was completely stored in a raw format on the hard disk to be always available as is. The main purpose of this data is to perform thorough analyses and use their results to process other data during the streaming process to use minimal storage resources. This data was streamed for 24 hours starting from 20:00 UTC on Dec 31, 2016, Feb 25, 2017 and May 20, 2017. These days were selected as one very special day in the music domain as music events happen all over the world during New Year’s Eve and two randomly selected weekends. This selection is based on two reasons: - People tend to be more positive in special events especially the New Year’s Eve. The combination of the day of this very special event with other normal days helps in analyzing and overcoming this hypothesis if the analysis results proofed it. - People tend to express their feeling more about a music artist when a special event involving this artist occurs (e.g. album release, concert, TV interview, etc.). The long

32 https://spotifycharts.com; Accessed in Dec 2016 33 http://www.billboard.com/charts/artist-100; Accessed in Dec 2016 34 http://www.billboard.com/articles/list/513610/banned-music-21-artists-censors-tried-to- silence; Accessed in Dec 2016 35 https://www.thetoptens.com/worst-music-artists/; Accessed in Dec 2016 36 https://www.google.com/search?q=Top+Artists+2017; Accessed in Dec 2016

32

periods separating the selected days help in analyzing and overcoming this hypothesis if the analysis results proofed it. Table 3.1 shows the number of raw tweets streamed in each of these three days.

Table 3.1 Number of raw tweets streamed in each of the selected three days

Streaming Day Number of Tweets Dec 31, 2016 497629 Feb 25, 2017 1059290 May 20, 2017 574890

After analyzing and processing the streamed data for the three days, we streamed Twitter again for about 10 consecutive days from Jun 03, 2017 to Jun 13, 2017. Names of artists were used as hashtags for the whole streaming period except for Jun 09, 2017 and Jun 12, 2017 where we used them as texts to filter the public streams. This data was processed directly during streaming based on the results of the analyses conducted on the data streamed during the three previous days (explained in section 3.2), and then stored in the used data store. Table 3.2 shows the numbers of processed tweets streamed in every streaming period along with the filtering method; i.e., how did we use names of artists to filter the public streams (hashtags or texts).

Table 3.2 Number of processed tweets and filtering method in each of the streaming periods

Filtering Method Streaming Period Number of Processed Tweets (Artists Names as) Dec 31, 2016 1004 Hashtags Feb 25, 2017 1046 Hashtags May 20, 2017 1034 Hashtags Jun 03-10, 2017 7995 Hashtags Jun 09, 2017 11623 Texts Jun 12, 2017 40543 Texts

Storing the processed data instead of the raw data reduced the required disk space to store the tweets of the first three streaming days from 20 GB to about 100 MB.

3.2. Data Processing

We built an annotation tool where annotators can assign a controversy score for each music artist based on a selected set of tweets involving this artist and using a simple web interface (this tool is described in more details in section 3.5). Eight annotators with excellent English knowledge were asked to start using this tool using the data streamed during the first three streaming days. They were asked to provide us continuously with information about problems they faced, improvements they thought of, and any other feedback on the annotation process. The data processing pipeline was continuously improved based on the feedback of the annotators. The annotations provided during this phase were completely discarded as the main goal of it was to improve the processing pipeline and familiarize the annotators with the annotation tool.

33

Figure 3.1 shows the complete data processing pipeline used to filter and process the streamed tweets. This pipeline was integrated into the streaming process for all data streamed after the first three streaming days.

Filter out Collect Artist Ambiguous Stream Twitter Names Artist Names

Filter out Consider Filter out Non- Useless Original English Tweets Tweets Tweets

Filter out Random Extract the Duplicted Tweet Involved Artist Tweets Selection

Figure 3.1 Data processing pipeline

Table 3.3 shows the number of tweets, streamed during the first three streaming days, after or during each of the major processing steps where this number is highly affected (mostly reduced).

Table 3.3 The effect of the data processing pipeline on the number of tweets

Dec 31, 2016 Feb 25, 2017 May 20, 2017 Streamed Tweets 497629 1059290 574890 Retweeted Tweets 383107 803174 465718 Quoted Tweets 5498 12143 5782 Original Tweets 120020 268259 114954 English Tweets 73165 93256 67416 Non-English Tweets 46855 175003 47538 Processed Tweets 1004 1046 1034

We applied very selective filters because we kept getting feedback from the annotators about useless tweets and what is common between them. In addition to the general tendency to share more positive tweets about music artists which biased the decisions of the annotators. The applied processes and filters reduced the number of tweets to less than 1% keeping only what we believe to be informative tweets in the context of detecting controversies. 3.2.1. Collect Artist Names We compiled a list of about 300 artist names from different sources as discussed in detail in section 3.1.1.

34

3.2.2. Filter out Ambiguous Artist Names Many artists of the compiled artist list have ambiguous names. These names refer either to someone or something else beside the artists themselves (e.g., Air, Berlin, FloRida, Pitbull, etc.). We asked the annotators in the first experimental run of the annotation tool to skip annotating artists whose most of the viewed tweets in the annotation page are confusing with someone or something else. The names of the skipped artists were marked using the annotation tool to be removed. We then manually reviewed all the marked artist names and removed all the ambiguous names. In addition to the ambiguous names, some non-English names were also manually removed as we are only interested in processing English tweets about music artists known in the English- speaking countries (e.g., (Japanese pop-rock band), (Japanese singer- songwriter), DieÄrtze (German punk rock band), etc.). This process filtered about 200 names keeping only about 100 names to be used for future streaming and processing. 3.2.3. Stream Twitter We streamed Twitter using the public streams of its Streaming APIs for three different days separated by relatively long periods then for about another 10 consecutive days. We used Tweepy,37 “an easy-to-use Python library for accessing the Twitter API”, for the streaming process. 3.2.4. Consider Original Tweets During the streaming process, we received a lot of tweets that were retweeted. Retweeting is republishing a tweet of someone on Twitter. Twitter keeps track of retweets and maps them all to their original tweet. The retweets we received generate useless duplicates rendered clearly during the annotation process which is based only on the text of the tweets; all annotators complained from the duplicates during the experimental annotation phase. Twitter provides a property called “retweet_count” available in the data of each tweet to represent the “number of times this tweet has been retweeted”.38 Except for this property and the unique id for each tweet, the original tweet and its retweeted versions are identical. Based on these facts, we always used original tweets. For every streamed retweeted tweet, we stored its original tweet, which is available as a property in its data. If we had stored this tweet before, we updated the “retweet_count” of the stored one to match the newer value. If a user @, for example, published a tweet with text: “Mariah Carey's backup dancers deserve Oscars for their performance of having to act like nothing was wrong #MariahCarey” then a user A retweeted this tweet, the text of the retweeted tweet will then be received as “RT @x: Mariah Carey's backup dancers deserve Oscars for their performance of having to act like nothing was wrong #MariahCarey”. Figure 3.2 shows a histogram of the number of retweets (“retweet_count”) of the complete processed streamed tweets. The right-hand side of the figure is the same histogram focusing on the area where most information resides.

37 http://www.tweepy.org/ 38 https://dev.twitter.com/overview/api/tweets

35

Figure 3.2 Number of retweets of the complete processed tweets

Twitter also allows its users to quote tweets. Quoting is republishing a tweet within quotation marks with the ability to manually add own comment on it. As the comment alone is useless without the original tweet, we also extracted the original tweet of all quoting tweets and stored both to be able to render any quoting tweet with its original tweet in the annotation tool. If a user @, for example, published a tweet with text: “#PressPlay Looks like #Rihanna had too much to drink last night ! via: @rihmotions #PourItUpPourItUp "” then a user A quoted this tweet and added his/her own comment “My girl was lit lol”, both tweets will be stored with the quoting tweet referencing the quoted one. The text of the quoting tweet will then be always shown in the annotation tool as “My girl was lit lol Quoting: “#PressPlay Looks like #Rihanna had too much to drink last night ! via: @rihmotions #PourItUpPourItUp "””. 3.2.1. Filter out Non-English Tweets More than 53% of the original streamed tweets during the first three streaming days (Dec 31, 2016, Feb 25, 2017 and May 20, 2017) were non-English. These tweets were removed which highly reduced the number of the processed tweets. Non-English tweets can be determined using the “lang” property available in each tweet’s data received from Twitter. This process was then integrated as an additional filter for the public streams using the “language” parameter available in all streaming APIs offered by Twitter. 3.2.2. Filter out Useless Tweets Useless tweets are tweets with almost no textual sentimental information content. During the first experimental run of the annotation tool, we found out that a lot of tweets do not provide much value in the controversy domain. These tweets distracted the annotators and affected their decisions given the limited number of presented tweets involving each music artist. Most of these tweets where about buying tickets for a concert, buying t-shirts, voting for an artist or a playing- now song. We processed all the streamed tweets during the first three streaming days and calculated how many times a word occurred in all these tweets. We considered all 1-gram and 2-gram words in these tweets; i.e., each one single word and each combination of two consequent words respectively. We then removed all stop-words and meaningless words from the most occurring 1- gram words. We also removed all combinations of two stop-words and meaningless combinations

36 from the most occurring 2-gram words. Figure 3.3 and Figure 3.4 show the most occurring words after removing meaningless words in the case of 1-gram and 2-gram words respectively.

1-Gram occurrences

16000 14000 12000 10000 8000 6000 4000 2000 0 Number of occurrences life see like day first sing way love tour new best now final sync vote guys time right song good think night party show know video songs today touch world music watch happy follow voting better wrong career please tickets photos awards tonight nothing retweet listening everyone instagram apparently performance 1-Gram words

Figure 3.3 Most occurring (1-gram) words in all streamed tweets during the first 3 streaming days

2-Gram occurrences

7000 6000 5000 4000 3000 2000 1000 0 rt to rt for I love Number of occurrences to get to win to vote vote for love you stages of last night right now oscars for tonight at listening to of listening at midnight like nothing via instagram deserve oscars 's performance performance of backup dancers responsibility for performance was their performance 2-Gram words

Figure 3.4 Most occurring (2-gram) words in all streamed tweets during the first 3 streaming days

Using these two word lists, we manually compiled a list of about 50 unwanted words. This list contains words from the most occurring word lists that have no sentimental information (e.g., vote, Instagram, ticket, tix, now playing, etc.) Based on the process of the most occurring word list and the feedbacks from the annotators, we removed all tweets with any of the following criteria: - Tweets starting with “Retweet” or “RT” which encourage Twitter users to retweet about some music artist. - Tweets containing any of the unwanted words compiled using the process of the most occurring words.

37

- Tweets containing a hyperlink as we are only processing textual tweets. These links are for external articles, images or videos. We asked the annotators to annotate based on the presented text only. - Tweets containing a quotation which, in almost all seen cases, were lyrics from a song of the music artist to be annotated. - Tweets containing many hashtags. We considered a tweet to be containing many hashtags when the length of its combined hashtags equals the length of its text without any hashtag. In other words, when the length of its text without hashtags is less than half its length with hashtags. - Tweets with very short or very long text. Based on the histogram of the text length of the tweets, shown in Figure 3.5, we noticed that there are many tweets with very short text and some with very long one. Long-text tweets are tweets with more than 140 characters which is the maximum length of tweets allowed by Twitter, however, we received some tweets with more than 140 characters which might be because of an encoding problem. As each of the processed tweets contains at least the artist name, we considered any tweet with less than 25 characters as a very short one which, in almost all seen cases, were the name an artist and the title of one of his/her songs (e.g., “#BrunoMars - Grenade”, “Slide #migos”, etc.).

Figure 3.5 Text length of the complete processed tweets

This process, along with filtering out non-English tweets, significantly reduced the number of tweets involving each artist. It, roughly, removed about 96% to 98% of the streamed tweets involving each artist. 3.2.3. Extract the Involved Artist As we are using multiple artist names as “track terms” to filter Twitter’s public streams, received tweets should then be processed to determine which of these names are involved in each of these tweets. Typically, a tweet involves one name which can be easily determined using simple string matching approaches. Nevertheless, some tweets involve more than one name of the artist names used to filter the public streams. In these cases, we consider the tweet involving the name occurs first in its text as a hashtag or a text. 3.2.4. Filter out Duplicated Tweets Although we considered only original tweets, we noticed some duplicates using the annotation tool. These duplicates are mostly from copying original tweets rather than retweeting them. To

38 overcome the duplicates problem, we introduced an additional step after finishing the streaming process where we compared texts of all the tweets involving an artist and then remove any tweet with duplicated text keeping the older tweet as it could be an origin for the duplicated tweet and it has a more accurate number of retweets being the oldest. Instead of the exact string matching approach to compare texts, we used edit distance (Levenshtein distance) which measures how dissimilar (rather than similar) two texts are by counting the minimum number of operations (character insertion, removal, substitution) required to transform one of them into the other. This approach did not provide much value as most duplicates could be removed by the simple string matching approach. It is implemented either using a recursive or a dynamic programming algorithm which, in both cases, is time and resources consuming compared to the value it may provide. 3.2.5. Random Tweet Selection After finishing all the processes described above, some artists were still involved with many tweets (tens of thousands of tweets). As we are only presenting a limited number of tweets to be annotated and our experiments are based on this annotated dataset, we reduced the number of tweets involving these artists. We randomly selected 1700 tweets involving each artist with more than 1700 tweets. This process balanced the number of tweets involving each artist. Figure 3.6 shows histograms of the number of tweets per artist before reducing this number using streamed tweets using artist names as hashtags (A) and using artist names as texts (B). It also shows a histogram of the number of tweets per artist after reducing this number combining all process tweets (C).

A – Number of tweets (streamed using names as B – Number of tweets (streamed using names as texts) hashtags) per artist per artist

39

C – Number of tweets per artist

Figure 3.6 Number of tweets per artist

3.3. Sentiment Analysis

Introducing some of the best sentiment analysis results, VADER [24], explained in detail in section 2.4.1.2, was integrated as part of the Natural Language Toolkit (NLTK).39 NLTK is a platform with a suite of libraries for natural language processing implemented in Python [64]. We used the implementation of VADER available in NLTK to compute the sentiments expressed in the processed tweets. This implementation provides four different sentiment scores for positive, negative and neutral sentiments in addition to a combined score where the value ranges from -1 for a completely negative sentiment to +1 for a completely positive sentiment with the neutral sentiment at 0. Based on the computed sentiment scores, we evaluated whether people tend to be more positive in days with special events or during weekends. Figure 3.7 shows histograms of the compound sentiment scores of the processed tweets streamed during the first 3 streaming days: New Year’s Eve on Dec 31, 2016 (A), a weekend on Feb 25, 2017 (B), and another weekend on May 20, 2017. Table 3.4 aggregates the number of positive and negative tweets and shows the ratio of positive to negative tweets for each of those days. For this analysis, we considered a tweet as positive when it has a compound sentiment score greater than 0 and as negative when it is smaller than 0. The diagrams and the table show that people tend to share more positive tweets about music artists in general regardless of the day of the year.

39 http://www.nltk.org/

40

A – Tweets streamed during Dec 31, 2016 B – Tweets streamed during Feb 25, 2017

C – Tweets streamed during May 20, 2017

Figure 3.7 Sentiment scores of the processed tweets streamed during the first 3 streaming days

Table 3.4 Number of positive and negative tweets streamed during the first 3 streaming days

Streaming Day Number of Positive Number of Negative Ratio Tweets Tweets (Positive / Negative) Dec 31, 2016 442 212 2.08 Feb 25, 2017 384 178 2.16 May 20, 2017 369 199 1.85

We also evaluated whether people tend to be more positive when mentioning the music artist, they are sharing about using his/her name as a hashtag. Figure 3.8 shows histograms of the compound sentiment scores of all the processed tweets streamed during all streaming days using artist names as hashtags (A) and as texts (B). Table 3.5 aggregates the number of positive and negative tweets and the ratio of positive to negative tweets for each of the two filtering methods. The diagrams and the table enrich the previous finding that people tend to share more positive tweets about music artists in general, regardless of whether they are mentioning them using their names as hashtags or texts.

41

A – Tweets streaming using artist names as hashtags* B – Tweets streaming using artist names as texts*

Figure 3.8 Sentiment scores of all the processed tweets streamed using artist names as hashtags and texts * Number of tweets at the neutral score 0 is much higher, we zoomed for a better interpretation

Table 3.5 Number of positive and negative tweets streamed using artist names as hashtags and texts

Filtering Method Number of Positive Number of Negative Ratio (Artists Names as) Tweets Tweets (Positive / Negative) Hashtags 4357 2280 1.91 Names 17997 9136 1.97

3.4. Data Storage

Figure 3.9 presents a simple model we used to represent streamed tweets’ data and collected artists’ information. We used MongoDB40 to store this data using PyMongo41 which is the recommended Python distribution containing tools for working with MongoDB. MongoDB is a free and open-source document-oriented database. It is classified as a NoSQL database and uses JSON-like documents with schemas. These documents are represented in a binary-encoded format called BSON (Binary JSON) which extends the JSON model to provide additional data types along with other features. We provided every entity, we modeled, with two methods: as_dictionary() and from_dictionary() which convert an object of an entity into a dictionary data structure and vice versa. Dictionary data structure is the closest data structure to JSON data representation, used in MongoDB. LastFM_Artist, Spotify_Track and PublicList_Artist entities represent artists’ information collected from Last.fm, Spotify and other publicly available lists (Billboard, The Top Tens, Google) respectively. “name” in LastFM_Artist and PublicList_Artist and “artist_name” in Spotify_Track were combined in the Entity entity which represents entities involved in controversies. These entities are music artists in the scope of this research. The data represented by this entity was used to filter Twitter’s public streams API, as discussed in a previous section.

40 https://www.mongodb.com/ 41 https://api.mongodb.com/python/current/

42

The User entity represents Twitter users who are the publishers of the streamed tweets. These tweets are represented in the Status entity; “Status” is how Twitter refers to tweets in all its streaming APIs.

Figure 3.9 Model (class diagram) used to represent streamed and collected data

3.5. Data Annotation

We built a simple web-based annotation tool where only authorized annotators can help estimating how controversial sets of tweets involving music artists are. The tool was developed using Python and Django42, a free and open source Python Web framework, and hosted on Amazon Web Services (AWS)43. We also used the Bootstrap44 framework to keep the annotation tool responsive fitting both desktop and mobile screens as most of the annotators preferred using their mobile devices. Figure 3.10 and Figure 3.11 show screen shots of the annotation tool taken from a desktop and a mobile screen, respectively.

42 https://www.djangoproject.com/ 43 https://aws.amazon.com/ 44 http://getbootstrap.com/

43

Figure 3.10 Annotation tool - desktop screen shot

Figure 3.11 Annotation tool - mobile screen shot

We defined an annotation set as a music artist with a set of tweets involving this artist. This set of tweets is selected randomly from all the processed tweets involving the artist (see section 3.2). We used sets of 50 tweets to keep the balance between showing the annotators as many as possible and making sure that the time required to do each annotation is reasonable to avoid user fatigue. The annotation tool presents each annotation set with some basic instructions.

44

Each annotation set can be annotated by selecting one of four presented options: “Clearly non- controversial”, “Possibly non-controversial”, “Possibly controversial” and “Clearly controversial”. For each annotation option, we chose at least one example (annotation set) from the data streamed during the first three streaming days (see section 3.1). We explained the annotation options using their chosen examples to 8 annotators with an excellent English knowledge. Three annotators were primary (#1, #2 and #3) who were asked to finish as many annotations as possible. The other five were secondary who were asked to help in their free time. Table 3.6 shows basic information about each of the 8 annotators along with their group as a primary or a secondary annotator.

Table 3.6 Basic information about the annotators

Annotator Group Gender Age Musical Affinity #1 Primary Male 29 Medium #2 Primary Female 21 Medium #3 Primary Female 30 Medium #4 Secondary Male 25 High #5 Secondary Male 26 High #6 Secondary Male 28 Low #7 Secondary Male 29 Low #8 Secondary Male 28 High

We asked the annotators to consider the presented text only in their evaluations and not to follow any reference to external resources and more importantly not to be biased by their own knowledge (feelings) about the music artist. Annotators could skip annotating an annotation set whenever the presented tweets do not provide good information to evaluate the controversy level of the artist involved in this set. Annotators could also refresh the page presenting the annotation set to change the presented set of tweets, as described in the annotation instructions. We evaluated the time required to do an annotation for each of the annotators. This time was evaluated using the difference between the submission time of each two consequently submitted annotations. We considered only differences lower than 10 minutes to exclude long breaks taken by the annotators. Figure 3.12 shows the average time for each annotator in minutes. We noticed that annotator #7 took much less than other annotators and he was one of the secondary annotators, that is why we excluded all the annotations done by him from the final results.

45

Average annotation time per annotator

7 5.78 6 4.75 5 4.05 3.99 3.8 4 3.29 3 2.04 2 1.27 1

0 #1 #2 #3 #4 #5 #6 #7 #8

Average annotation time (in minutes) Annotator

Figure 3.12 Average annotation time per annotator (in minutes)

Despite all the strict rules applied during processing the tweets, annotators skipped some annotation sets. We analyzed all skipped sets to figure out that we still had some artists with ambiguous names (“Nirvana” (a rock band), “Oasis” (a rock band), etc.). We removed all artists that were skipped by at least two of the primary annotators along with their tweets. This process removed about 10 artists and 8500 tweets from our dataset to end up with exactly 95 music artists and therefore 95 annotated annotation sets. Figure 3.13 shows the number of annotations done by each annotator along with how many times they skipped an annotation set.

Number of annotations per annotator

100 81 79 83 80

60 40 40 22 15 20 7 10 9 10 6 2 1 0 1 0 Number of annotations 0 #1 #2 #3 #4 #5 #6 #7 #8 Annotator

Number of Annotations Number of Skips

Figure 3.13 Number of annotations (skips) per annotator

We selected the music artist to be annotated as the one with the least number of received annotations. This kept a balance between the number of annotations received for each artist who, as a result, received exactly 3 annotations (only 7 artists received 4 annotations). As a result of the selection criterion for the artist to annotated, no artist received more than one annotation from all secondary annotators. Figure 3.13 also shows that these annotators, combined, did not provide many annotations, compared to what the primary annotators had

46 provided. That is why, we will combine all annotations done by them (except for annotator #7 whose annotations were discarded) and consider them done by a single “virtual” annotator. As our annotation options are ordinal values, we used weighted kappa [65], a statistic which measures inter-rater agreement for ordinal values, to measure the agreement between our annotators. We averaged the agreement values between each two annotators to get a general inter-rater agreement value of (0.335). This value may be considered as a fair agreement according to the table proposed in [66] for interpreting kappa values, although this table is not universally accepted. We also computed the intra-class correlation coefficient [67] between the values provided by each annotator to get a correlation coefficient of (0.443). This value may also be considered as a fair agreement according to the table proposed in [68].45 To compute the controversy score of each music artist, we mapped the annotation options into numeric values using a simple convention: “Clearly non-controversial” à 1, “Possibly non- controversial” à 2, “Possibly controversial” à 3 and “Clearly controversial” à 4. The final controversial score associated with each artist was computed by averaging all annotation scores received for this artist. Figure 3.14 shows a histogram of the controversy scores of all artists in the created dataset. Table 3.7 lists the most 5 and the least 5 controversial artists in this dataset along with their controversy scores.

Figure 3.14 Controversy scores of music artists

Table 3.7 Most and least controversial music artists

Most Controversial Artists Score Least Controversial Artists Score The Chainsmokers 4.0 The Weeknd 1.0 Kodak Black 4.0 Pink Floyd 1.0 Desiigner 4.0 alt-J 1.0 Chris Cornell 4.0 Halsey 1.0 Kesha 4.0 Gorillaz 1.0

To summarize, this chapter proposed a dataset which can be used to evaluate controversy detection approaches using both classification and regression machine learning techniques. The

45 https://en.wikipedia.org/wiki/Intraclass_correlation

47 created dataset contains 95 music artists annotated with a continuous controversy score ranging from 1 for “clearly non-controversial” artists to 4 for “clearly controversial” artists. The scores were manually computed based on annotating 53441 tweets involving these artists and published by 43141 Twitter users during a selected set of days.

48

Chapter 4

4. Experiments and Results

We used the created dataset to evaluate the usage of machine learning approaches in detecting controversies using a large set of features. This chapter presents the feature set used in detecting controversies involving music artists along with an evaluation for their relevance to this problem. It also presents the evaluation of machine learning approaches applied on these features. Finally, it evaluates generalizing the usage of these features by extracting (most of) them from another dataset from the news domain which was used in one of the reference studies.

4.1. Feature Extraction

The evaluated feature set is designed as comprehensive as possible, likely including non- meaningful features for this task, trying not to emit any feature that might be useful. This set contains 41 features representing each music artist in addition to a numeric feature representing how controversial this artist is. These features were determined based on a multilevel feature extraction process where we extracted features representing Twitter users then features representing tweets and finally, based on both feature sets, we extracted features representing music artists involved with these tweets which were published by these users. Most of the tweet-based features are based on the approaches presented in [57, 60, 63]. We extended this set with features extracted based on multiple external resources which will be presented in this section. In [60], the authors considered user-based features in detecting controversies in news articles. This thesis focuses more on this area as it considers more user- based features in its evaluations. 4.1.1. External Resources We compiled and manually processed multiple lists from different external resources to be used in extracting relevant features from tweets. Almost all these lists were used in extracting tweets- based features except for the “Professions” list which was used in extracting user-based features. 4.1.1.1. Profession List This list contains 112 occupations compiled based on three main sources: - An automatic list compiled by Google and accessed by searching for “Professions List” using Google’s search engine. - Wikipedia’s list of occupations.46 - A list of professions available on 123test.com.47 We manually checked the professions in these sources and selected common and top-level professions (e.g., “Military Engineer” was compiled as “Engineer”).

46 https://en.wikipedia.org/wiki/Lists_of_occupations; Accessed in Mar 2017 47 https://www.123test.com/professions/; Accessed in Mar 2017

49

4.1.1.2. Controversial Word List This list contains 1474 controversial terms (words and phrases). The list was compiled by manually processing Wikipedia’s list of controversial issues.48 These issues are mentioned in words or sentences. We first got the controversial terms if they are mentioned within sentences. Then, we processed all terms to ensure a better string matching process (e.g., “Women's rights and feminism” was compiled as “women’s rights”, “women rights” and “feminism”). Finally, we added common synonyms for some of these terms. Names and controversial phrases available in this list are kept as they were without processing. 4.1.1.3. Controversial Abbreviation List This list contains 42 controversial abbreviations compiled during processing the controversial word list (e.g., “World War II” was compiled as “WWII”). This list was not combined with the original controversial word list as the string matching process differs between matching terms and matching abbreviations. In the former, we match strings using case insensitive approach (e.g., “Censorship laws” may match “censorship laws” or any other formulation), whereas, in the later, we match strings using full world case sensitive approach (e.g., “WWII” will only match “WWII” and neither “wwii” nor “wwiixyz”). 4.1.1.4. Positive Word List This list contains 266 positive words compiled from a list available on EnchantedLearning.com49 (e.g., “reward” and “wow”). 4.1.1.5. Negative Word List This list contains 939 negative words compiled from three sources: - A list of negative words available on EnchantedLearning.com50 (e.g., “boring” and “sad”). - A list of dirty words available on github.com.51 This list is based on a metasearch engine from Google called “What Do You Love” (WDYL) which is currently inactive. - A list of swear words available on noswearing.com.52 4.1.1.6. Slang Word List This list contains 5379 slang words compiled from a list of slang words available on noslang.com.53 The list contains slang words along with their meaning (e.g., “abt” for “about” and “afc” for “away from computer”).

48 https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues; Accessed in Mar 2017 49 http://www.enchantedlearning.com/wordlist/positivewords.shtml; Accessed in Mar 2017 50 http://www.enchantedlearning.com/wordlist/negativewords.shtml; Accessed in Mar 2017 51 https://gist.github.com/jamiew/1112488; Accessed in Mar 2017 52 http://www.noswearing.com/dictionary; Accessed in Mar 2017 53 https://www.noslang.com/dictionary/; Accessed in Mar 2017

50

4.1.1.7. Positive and Negative Emoticon List This list contains 47 positive emoticons (e.g., “J”) and 21 negative ones (e.g., “L”). These emoticons are common in the domain of social media and were compiled from the source code of an open source Python wrapper for the Twitter API available on github.com.54 4.1.2. User Features These features were extracted for every unique user in the created dataset. A user is the publisher of a tweet involving one of the music artists. A user may have multiple tweets; Figure 4.1 shows a histogram of the number tweets per user in the created dataset after processing it.

Figure 4.1 Number of tweets per user

The following list presents these features: - Verified: a binary feature representing whether the Twitter account of this user is verified. - Tweets count: a numeric feature representing the number of tweets published by this user since the creation of his/her account in Twitter. This number is usually not identical to the number of tweets in the created dataset. - Likes count: a numeric feature representing the number of tweets liked by this user since the creation of his/her account in Twitter. This number is usually not identical to the number of tweets liked by this user in the created dataset. - Followings count: a numeric feature representing the number of Twitter users followed by this user. - Followers count: a numeric feature representing the number of Twitter users following this user. - Lists count: a numeric feature representing the number of public lists joined by this user. - Description existed: a binary feature representing whether the Twitter account of this user has a description. - Profession included: a binary feature representing whether the description of the Twitter account of this user, if there is one, contains the profession of the user. We used the compiled profession list (explained earlier) to look for the profession of the user. 2251 out of 43141 users in the created dataset have included their professions in the description of their accounts.

54 https://github.com/bear/python-twitter; Accessed in Mar 2017

51

- Days since account creation: a numeric feature representing the number of days passed since the creation of the Twitter account of this user. We used a date in the close future (Jan 01, 2018) as a reference date and computed this feature as the number of days between the date of the creation and this future date. We did not use the Unix timestamp, the number of seconds that have elapsed since Jan 01, 1970, as we are processing data in a relatively short period of time. The processed time differences are very small compared to the time since epoch (Jan 1, 1970). User features were used to compute one new feature representing how active and trustworthy the account of this user is. We refer to this feature as “participation score”. To compute the value of this score of a user, we first normalized all numeric features to be within the range [0, 1]. Then, we used a very simple approach to combine these normalized values into one by averaging them. This combined value will then represent the participation score of a user. We used the logarithm (log base 10) of the number of tweets, likes, followings, followers and lists to normalized their associated feature values as these values span over a very large scale. For example, some users have about 20 million followers, if we use a linear normalization technique, other users will have almost 0 for the followers count feature. Using logarithm will give more importance to small values i.e., increasing the number of followers from 1000 to 2000 will have more impact on the normalized value than increasing it from 1000000 to 1001000 which is very close to how people perceive these changes. Using a weighted sum approach instead of the average to combine the normalized values into the participation value is a better approach but such approach needs proper weights for the contributing features. 4.1.3. Tweet Features These features were extracted for every unique tweet in the created dataset based on its content, metadata and the extracted features for its user. They can be categorized into multiple groups as presented in the following list: Controversy Features: - Controversial words count: a numeric feature representing the number of controversial words in the text of this tweet. - Controversial abbreviations count: a numeric feature representing the number of controversial abbreviations in the text of this tweet. - Controversial terms percentage: a numeric feature representing the percentage of controversial words and abbreviations in the text of this tweet compared to the total number of tokens available in this text. Sentiment Features: - Neutral, positive, negative and compound sentiment scores: numeric features representing the scores of the sentiment expressed in this tweet. These four scores were computed using the implementation of VADER [24] available in the Natural Language Toolkit (NLTK). - Positive and negative words count: numeric features representing the number of positive and negative words in the text of this tweet. - Positive and negative emoticons count: numeric features representing the number of positive and negative emoticons in the text of this tweet.

52

Syntactic Features:55 - Tokens count: a numeric feature representing the number of tokens in the text of this tweet. - Nouns, verbs, adjectives and adverbs count: numeric features representing the number of nouns, verbs, adjectives and adverbs in the text of this tweet. - Capitalized words count: a numeric feature representing the number of words written all in capital letter characters in the text of this tweet. This feature reflects the expressiveness of the tweet. - Slang words count: a numeric feature representing the number of slang words in the text of this tweet. - Expressive punctuation marks count: a numeric feature representing the number of expressive punctuation marks in the text of this tweet. Most common punctuation marks used in the domain of social media to express a feeling are: “!”, “?” and “$”. Twitter-based Features: - Likes count: a numeric feature representing the number of times this tweet has been liked. - Retweets count: a numeric feature representing the number of times this tweet has been retweeted. - User mentions and hashtags count: numeric features representing the number of user mentions and hashtags in this tweet. User mentions start with the “@” sign in Twitter, while hashtags start with the “#” sign. - Reply: a binary feature representing whether this tweet is a reply to another one. - Quoting: a binary feature representing whether this tweet is quoting another one. User-based Features: - User participation score: a numeric feature representing the participation score of the user who published this tweet. This score is computed as described earlier based on the features extracted for this user. - User verified: a binary feature representing whether the Twitter account of the user who published this tweet is verified. - Username includes artist name: a binary feature representing whether the username of the user who published this tweet is related to the music artist who is involved in it. This user may be the Twitter account of the artist themselves or just related to them (e.g., “One Direction Fans”). This feature should typically reduce the effect of this tweet as sentiments and opinions expressed in such tweets are mostly biased. 4.1.4. Music Artist Features These features were extracted for every music artist in the created dataset based on the features extracted for the tweets involving this artist and their publishers (users). They can also be categorized into multiple groups as presented in the following list: Controversy Features:

55 We used the Natural Language Toolkit (NLTK) to tokenize and tag texts using part of speech tagging approaches.

53

- Controversial terms mean and SD: numeric features representing the mean and standard deviation of the share of controversial terms in the tweets involving this artist, respectively. - Controversial terms count: a numeric feature representing the number of controversial terms in all tweets involving this artist. Sentiment Features: - Positive and negative tweets percentage: numeric features representing the percentage of positive and negative tweets compared to the total number of tweets involving this artist. Based on Figure 4.2, showing histograms of the positive (A) and negative (B) sentiment scores of all the processed tweets, we consider a tweet to be positive when it has a positive sentiment score greater than 0.2 (the peak in figure A) and negative when it has a negative sentiment score greater than 0.15 (the peak in figure B). Both scores range from 0 to 1 where 1 indicates the greatest sentiment.

A – Positive sentiment scores B – Negative sentiment scores

Figure 4.2 Positive and negative sentiment scores of all processed tweets * Number of tweets at the neutral score 0 is much higher, we zoomed for a better interpretation

- Neutral, positive, negative and compound sentiment mean: numeric features representing the mean of neutral, positive, negative and compound sentiment scores of the tweets involving this artist. - Neutral, positive, negative and compound sentiment SD: numeric features representing the standard deviation of neutral, positive, negative and compound sentiment scores of the tweets involving this artist. - Positive and negative words mean: numeric features representing the mean of positive and negative words count of the tweets involving this artist. - Positive and negative emoticons mean: numeric features representing the mean of positive and negative emoticons count of the tweets involving this artist. Syntactic Features: - Tokens mean: a numeric feature representing the mean of tokens count of the tweets involving this artist. - Nouns, verbs, adjectives and adverbs mean: numeric features representing the mean of nouns, verbs, adjectives and adverbs count of the tweets involving this artist.

54

- Capitalized words mean: a numeric feature representing the mean of capitalized words count of the tweets involving this artist. - Slang words mean: a numeric feature representing the mean of slang words count of the tweets involving this artist. - Expressive punctuation marks mean: a numeric feature representing the mean of expressive punctuation marks count of the tweets involving this artist. Twitter-based Features: - Likes mean and SD: numeric features representing the mean and standard deviation of likes count of the tweets involving this artist, respectively. - Retweets mean and SD: numeric features representing the mean and standard deviation of retweets count of the tweets involving this artist, respectively. - User mentions and hashtags mean: numeric features representing the mean of user mentions and hashtags count of the tweets involving this artist. - Reply and quoting tweets percentage: numeric features representing the percentage of reply and quoting tweets compared to the total number of tweets involving this artist. User-based Features: - User participation mean and SD: numeric feature representing the mean and standard deviation of participation scores of the users who published the tweets involving this artist. - Verified users percentage: a numeric feature representing the percentage of verified users compared to the total number of users who published the tweets involving this artist. - Username includes artist name percentage: a numeric feature representing the percentage of users with a username related to the artist compared to the total number of users who published the tweets involving this artist. - Tweets by verified users percentage: a numeric feature representing the percentage of tweets published by verified users compared to the total number of tweets involving this artist. - Tweets by very active users percentage: a numeric feature representing the percentage of tweets published by very active users compared to the total number of tweets involving this artist. Based on Figure 4.3, showing a histogram of participation score of all users in the created dataset, we considered a user to be very active when he/she has a participation score greater than 0.4.

Figure 4.3 User participation

55

Other Features: - Tweets and users count: numeric features representing the number of tweets involving this artist and the number of unique users who published these tweets. - Tweets per user mean: a numeric feature representing the mean of number of tweets involving this artist per user who published these tweets.

4.2. Feature Analysis

The main goal of conducting feature analysis on data that will be used to build prediction models is to select relevant and non-redundant features from this data. Irrelevant and redundant features are common data problems in machine learning. Most prediction models perform better when these features are removed as this process usually simplifies the models, shortens the training time, avoid the curse of dimensionality and enhances generalization by reducing overfitting. Irrelevant and redundant features are two different notions. Relevant features contribute to prediction accuracy and some of these features may be redundant when they are highly correlated. Irrelevant features, on the other hand, never contribute to prediction accuracy, that is why having such features may negatively affect building the prediction model and is usually worse than having redundant features. We analyzed the extracted features for music artists using the “caret” (Classification and Regression Training) package available in R. This package contains tools to build and evaluate prediction models, conduct feature analyses and much more. It provides uniform interfaces to numerous machine learning functions available in different R packages [69]. 4.2.1. Feature Correlation Analysis We analyzed the correlation between each possible pair of features to detect redundant features. Two features were considered highly correlated, and thus redundant, if they had a correlation score smaller than -0.75 or greater than +0.75. A correlation score ranges from -1 for total negative correlation to +1 for total positive correlation with a neutral point at 0 for no correlation. Table 4.1 shows these highly correlated features.

Table 4.1 Highly correlated features

st nd Correlation 1 Feature 2 Feature Score Tweets count Users count 0.970 Controversial terms mean Controversial terms SD 0.821 Controversial terms mean Controversial terms count 0.922 Positive tweets percentage Positive sentiment mean 0.977 Positive tweets percentage Compound sentiment mean 0.971 Negative tweets percentage Negative sentiment mean 0.979 Positive sentiment mean Compound sentiment mean 0.801 Negative sentiment mean Negative sentiment SD 0.791 Verbs mean Adverbs mean 0.759 Likes mean Likes SD 0.949 Likes mean Retweets mean 0.875

56

st nd Correlation 1 Feature 2 Feature Score Likes mean Retweets SD 0.838 Likes SD Retweets mean 0.879 Likes SD Retweets SD 0.890 Retweets mean Retweets SD 0.986 Reply tweets percentage User mentions mean 0.814 Verified users percentage Tweets by verified users percentage 0.949

Table 4.2 shows the top negatively correlated features. None of these correlations revealed redundant features as they do not have scores smaller than the considered threshold for redundancy (-0.75).

Table 4.2 Top negatively correlated features

st nd Correlation 1 Feature 2 Feature Score Positive sentiment mean Neutral sentiment mean -0.706 Compound sentiment mean Negative sentiment SD -0.677 Positive tweets percentage Neutral sentiment mean -0.673 Compound sentiment mean Negative sentiment mean -0.661 Compound sentiment SD Neutral sentiment mean -0.646

4.2.2. Feature Importance Analysis We analyzed the importance of each feature to detect irrelevant features which are usually ranked as the least important ones. Feature importance is usually measured by building many prediction models using different subsets of the feature set and evaluating the accuracy of each model. Features that continuously generate the worst accuracy are considered the least important. We used a linear regression model to evaluate feature importance. Linear regression is “a linear approach for modeling the relationship between a scalar dependent variable and one or more explanatory variables (or independent variables)”.56 It is a common and well-known regression approach. This model was evaluated using E= (R-squared or coefficient of determination),57 a statistical measure of the proportion of the variance in the dependent variable that is predictable from the independent variables, and 10-fold cross validation. Each model was built three times using the same set of features. Figure 4.4 shows the 10 most important features.

56 https://en.wikipedia.org/wiki/Linear_regression 57 https://en.wikipedia.org/wiki/Coefficient_of_determination

57

Feature importance score using linear regression

0.000 0.500 1.000 1.500 2.000 2.500

Compound sentiment SD Positive emoticons mean Tweets by verified users percentage Verified users percentage Username includes artist name percentage User reputation SD Slang words mean Tweets by highly reputed users percentage Verbs mean Negative words mean

Figure 4.4 Feature importance scores using linear regression prediction model

To have a better feature importance analysis, it is recommended to use different prediction models and combine their results as each model rank the features according to their importance based on its algorithm. We also used a Learning Vector Quantization (LVQ) model to evaluate feature importance. LVQ is a special Artificial Neural Network (ANN) that applies a winner-takes-it-all Hebbian-learning-based approach and it is the predecessor of Self-Organizing Maps (SOM) [70]. LVQ is a classification model that can be applied on data with binary or categorical class attribute. To be able to use this model on the created dataset, we adapted this dataset to fit to classification models by converting the “controversy score” from a numeric attribute into a binary one indicating whether a music artist is controversial or not using a threshold of 2.4. This adaption will be discussed in more details in section 4.3.2. The LVQ model was evaluated using area under ROC,58 a curve that combines two measures: recall and inverse recall, and 10-fold cross validation. Each model was also built three times using the same set of features. Figure 4.5 shows the 10 most important features.

58 https://en.wikipedia.org/wiki/Receiver_operating_characteristic

58

Feature importance score using LVQ

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900

Negative words mean Slang words mean Negative Tweets percentage Negative sentiment mean Positive emoticons mean Compound sentiment mean Negative sentiment SD Controversial terms SD User reputation mean Controversial terms mean

Figure 4.5 Feature importance scores using learning vector quantization prediction model

From the results of both analyses, we can notice that “positive emoticons mean”, “slang words mean”, “negative words mean” and “user participation mean” are some of the most important features being ranked using multiple prediction models as the most important (relevant) features. Table 4.3 shows the full list of features along with their normalized importance scores using both prediction models. To rank the features using one importance score, we combined the scores from both models by multiplying their normalized value which generated one general importance score for each feature. This new score is also shown in Table 4.3.

Table 4.3 Feature importance scores * Features highly correlated with a more important feature (higher in the table)

LR Score LVQ Score Product # Feature Name (Normalized) (Normalized) Score #1 Negative words mean 0.704 1.000 0.704 #2 Slang words mean 0.738 0.830 0.612 #3 Positive emoticons mean 0.904 0.673 0.609 #4 Compound sentiment mean 0.501 0.628 0.315 #5 Negative sentiment mean 0.405 0.674 0.273 Negative sentiment SD * 0.452 0.574 0.260 #6 Compound sentiment SD 1.000 0.252 0.252 #7 User participation mean 0.589 0.421 0.248 Negative tweets percentage * 0.267 0.733 0.196 #8 Reply tweets percentage 0.472 0.326 0.154 #9 Tweets by very active users percentage 0.719 0.201 0.145 #10 Verified users percentage 0.834 0.171 0.143 #11 Controversial terms mean 0.349 0.396 0.138 #12 Verbs mean 0.712 0.175 0.125

59

LR Score LVQ Score Product # Feature Name (Normalized) (Normalized) Score Positive tweets percentage * 0.474 0.259 0.123 Tweets by verified users percentage * 0.864 0.129 0.112 Controversial terms count * 0.297 0.372 0.110 #13 Neutral sentiment mean 0.406 0.270 0.110 Positive sentiment mean * 0.408 0.187 0.076 #14 Tokens mean 0.325 0.198 0.064 #15 Retweets mean 0.509 0.115 0.058 #16 Tweets per user mean 0.214 0.242 0.052 #17 Positive words mean 0.126 0.366 0.046 #18 User participation SD 0.803 0.054 0.043 #19 Negative emoticons mean 0.609 0.069 0.042 #20 Hashtags mean 0.111 0.357 0.040 #21 Expressive punctuation marks mean 0.558 0.068 0.038 Retweets SD * 0.457 0.077 0.035 Likes mean * 0.411 0.083 0.034 #22 Username includes artist name percentage 0.820 0.040 0.033 #23 Capitalized words mean 0.552 0.058 0.032 #24 Neutral sentiment SD 0.086 0.295 0.025 User mentions mean * 0.249 0.092 0.023 Likes SD * 0.291 0.075 0.022 Adverbs mean * 0.308 0.057 0.018 #25 Neutral sentiment SD 0.076 0.173 0.013 #26 Tweets count 0.036 0.289 0.010 Controversial terms SD * 0.023 0.440 0.010 #27 Adjectives mean 0.006 0.273 0.002 #28 Nouns mean 0.695 0.000 0.000 Users count * 0.000 0.333 0.000

From this table, we can notice that controversy features are not as highly relevant as reported in some of the reference studies [60]. A closer look at the dataset used in [60] and at the created dataset in this thesis reveals the reason that the compiled list of controversial terms contains mostly politics-related terms which are regularly used in the CNN dataset of [60] but not in this dataset being related to, first, the music domain and, second, the social media domain where such terms are rarely used. That explains the reason why the feature of slang words is highly relevant being a good replacement for controversial terms as most of the used slang words are swear ones. We can also notice that sentiment based features are still highly relevant as reported in almost all reference studies. The proposed user-based features are also highly relevant which encourages

60 a deeper investigation research in this area. Unfortunately, the small dataset that we are creating does not help in such investigation as it does not contain much tweets per user.

4.3. Machine Learning Models Evaluations

We evaluated two categories of prediction models based on the features analyzed before and using 10-fold cross validation in all these evaluations. All evaluations were conducted using WEKA (Version 3.8.1) [61, 62]. ZeroR is simplest prediction model available in WEKA. This model ignores all independent attributes (input attributes) of a dataset and simply predicts the mean value of the dependent attribute (class attribute) for a numeric class attribute and the mode (majority) for a nominal (categorical) class attribute. We used this model as our baseline in both evaluation categories. 4.3.1. Regression Prediction Models We evaluated five different regression models available in WEKA: Decision Table, Sequential Minimal Optimization (SMO), Multilayer Perceptron, Random Forest and Linear Regression. Decision predict a precise value based on a modeled set of complex rules along with their corresponding actions [71]. SMO is a Support Vector Machine (SVM) optimized to solve the training problem of an SVM using heuristics that accelerate the rate of convergence. This algorithm trains an SVM more efficient than the standard Quadratic Problem (QP) required to train one [72]. Multilayer Perceptron is a simple Artificial Neural Network (ANN). It uses backpropagation to train the model and uses the sigmoid function as the activation function for all its nodes [73]. Random Forest is an ensemble prediction model that constructs multiple decision trees in the training phase, then uses these trees to predict a class value for a new instance by returning the mean value of the predictions of these trees for a numeric class attribute and the mode (majority) for a nominal (categorical) class attribute [74]. We used correlation coefficient (F) and Root Mean Squared Error (RMSE) as our evaluation metrics. Correlation coefficient (F) measures how correlated predicted and actual (true) values are. Correlation coefficient values range from -1 to 1 with 1 indicating the best performance. Root Mean Squared Error (RMSE) measures the average distance (standard deviation) between predicted and actual (true) values. RMSE values range from 0 to ∞ with 0 indicating the best performance. IJK(L, L′) FG,GH = NGNGH

Q (A′ − A )= EO65 = PR< P P S

Where: - L: is the list of actual (true) values. - L′: is the list of predicted values.

- FG,GH: is the correlation score between two lists of values L and L′. - IJK(L, L′): is the covariance.

61

- NG: is the standard deviation of a list of values L. - AP: is the actual (true) value of the data instance $. - A′P: is the predicted value for the data instance $. - S: is the total number of data instances. We evaluated the five regression models using the most important features starting by the most important one (single feature) and adding one feature at a time until we included all the features. We skipped adding the features that are highly correlated with one of the already added ones (marked in Table 4.3 with *). Figure 4.6 and Figure 4.7 show the correlation coefficient and root mean squared error for all evaluated models using the included number of the most important features as the horizontal axis.

RMSE

1.8 1.7 1.6 1.5 1.4 1.3 1.2 RMSE 1.1 1 0.9 0.8 0.7 0.6 #1 #3 #5 #7 #9 #11 #13 #15 #17 #19 #21 #23 #25 #27 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Linear Regression Baseline

Figure 4.6 RMSE for regression models using the most important features

62

Correlation coefficient

0.6

0.5

0.4

0.3

0.2

0.1

Correlation coefficient 0

-0.1

-0.2 #1 #3 #5 #7 #9 #11 #13 #15 #17 #19 #21 #23 #25 #27 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Linear Regression Baseline

Figure 4.7 Correlation coefficient for regression models using the most important features

Based on these figures, we can notice that using the most relevant features from the start improved the performance which started to get slightly worse after including about half the features. Table 7.1 and Table 7.2 in section 7.1.1 in Appendix A show the evaluation results of the regression prediction models in more details. SMO and linear regression performed the best on both evaluation metrics. Linear regression has a slightly better performance using the #9 most important features (EO65 = 0.688, F = 0.580). 4.3.2. Classification Prediction Models We evaluated four different classification models available in WEKA: Decision Table, Sequential Minimal Optimization (SMO), Multilayer Perceptron and Random Forest. The algorithms for these models are implemented in WEKA (Version 3.8.1) to support numeric and nominal class attributes, i.e., they can be built as both regression and classification prediction models. We used F-measure and area under ROC as our evaluation metrics. F-measure is widely used in evaluating binary classification models as it combines two well-known measures: precision and recall. It is the harmonic-mean of both measures. Precision is the fraction of correctly classified positive instances compared to the total number of instances predicted as positive, while recall is the fraction of correctly classified positive instances compared to the total number of actual (true) positive instances. F-measure values range from 0 to 1 with 1 indicating the best performance. XY1I$/$JS . Y1IZ[[ W = 2. XY1I$/$JS + Y1IZ[[

63

Receiver Operating Characteristic (ROC) curve also combines two measures: recall and inverse recall. Inverse recall is the fraction of incorrectly classified positive instances compared to the total number of actual (true) negative instances. The area under this curve is also widely used in evaluating binary classification models. The values of this area range from 0 to 1 with 1 indicating the best performance. The controversy score in the created dataset is a numeric attribute which cannot be used as the target class in classification models. To solve this problem in this set of evaluations, we changed the controversy score into a binary attribute indicating whether a music artist is controversial or not. We used a threshold ] as a cutoff so that each music artist with a controversy score greater than ] is considered controversial. :Y^1, IJS!YJK1Y/A_/IJY1 @ > ] IJS!YJK1Y/$Z[ @ = WZ[/1, IJS!YJK1Y/A_/IJY1 @ ≤ ] The controversy score ranges from 1 for completely non-controversial to 4 for completely controversial. We evaluated multiple threshold values within the [1.5, 2.5] with a step of 0.1. These evaluations are based on the complete feature set and using the default parameters defined in WEKA for each of the evaluated models. Figure 4.8, Figure 4.9 and Figure 4.10 show the number of controversial and non-controversial artists, F-measure and area under ROC, respectively, for the four evaluated classification models using different threshold values ].

Number of controversial and non-controversial artists using different threshold values

80 70 60 50 40 30

Number of artists 20 10 0 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 Threshold

Controversial Artists Non-controversial Artists

Figure 4.8 Number of controversial and non-controversial artists using multiple threshold values

64

F-measure using different threshold values

0.9

0.8

0.7

0.6 measure - F 0.5

0.4

0.3 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 Threshold

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

Figure 4.9 F-measure using multiple threshold values

Area under ROC using different threshold values

0.9

0.8

0.7

0.6

Area under ROC 0.5

0.4

0.3 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 Threshold

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

Figure 4.10 Area under ROC using multiple threshold values

65

From these figures, we can notice that using a cutoff at (] = 2.0) keeps the best balance for the class attribute between the number of controversial and non-controversial artists. It is also clear that using a cutoff at (] = 2.4) results in the best performance on both evaluation metrics and at this cutoff point, the balance is not so divergent. Based on this analysis, we used a threshold value of (] = 2.4) to convert the numeric class attribute (controversy score) into a binary one. After adapting the class attribute to fit to classification models, we evaluated the four classification models using the most important features following the same methodology used for evaluating the regression models. Figure 4.11 and Figure 4.12 show the F-measure and area under ROC for all evaluated models using the included number of the most important features as the horizontal axis.

F-measure

0.9

0.8

0.7

0.6 measure - F 0.5

0.4

0.3 #1 #3 #5 #7 #9 #11 #13 #15 #17 #19 #21 #23 #25 #27 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

Figure 4.11 F-measure for classification models using the most important features

66

Area under ROC

0.9

0.8

0.7

0.6

Area under ROC 0.5

0.4

0.3 #1 #3 #5 #7 #9 #11 #13 #15 #17 #19 #21 #23 #25 #27 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

Figure 4.12 Area under ROC for classification models using the most important features

Based on these figures, we can notice that the best performance was also using a small set of the important features where Multilayer Perceptron performed the best using the #4 - #5 most important features (W − c1Z/^Y1 = 0.811, dY1Z ^S%1Y Eef = 0.793). Table 7.3 and Table 7.4 in section 7.1.2 in Appendix A show the evaluation results of the classification prediction models in more details. The results of evaluating the regression and classification models showed that adding more features does not necessarily improve the performance of a prediction model. In contrast, it affects the training time of the model which will take much less with a smaller set of features. The main goal of this thesis is to define the best set of features to detect controversies involving music artists in social media. This set should not only improve the performance of the detection model but also optimize the resources required to build it. Based on the results of these evaluations, we believe that using the most 5 to 9 relevant non-redundant features of the evaluated set leads to the best controversy detection system involving music artists in Twitter.

4.4. News Dataset Evaluations

We evaluated the possibility of generalizing the presented feature set and using it in domains other than Twitter to have better insights into it. We extracted what is possible of these features from a dataset in the news domain and ran the complete evaluation process again. This section presents the results of this evaluation. In [60], the authors used a CNN dataset to evaluate their controversy detection approach involving news articles. This dataset contains 728 articles (376 controversial and 352 non-controversial) published by CNN and 522595 comments written by 40826 CNN users. The articles were

67 annotated by multiple annotators as controversial or non-controversial. We contacted the main author and, thankfully, got a copy of this dataset. Figure 4.13 shows the structure of this dataset using its Entity Relationship Diagram (ERD).

Figure 4.13 CNN controversy dataset structure

We extracted almost the same features that we discussed earlier for the Twitter dataset. Articles in this dataset are mapped to music artists in the created one being the entities involved in controversies. Comments and CNN users in this dataset are mapped to tweets and Twitter users, respectively, in the created dataset. User features were different as they are platform specific. CNN provides an automatic reputation score for each of its users. We used this score as any of the features of Twitter users and generated a user participation score combining this feature with other available user features (i.e. about existed, profession included, days since account creation). Some of the Twitter-based features for tweets have replacements (i.e., likes, replies) in CNN comments, while some of them were completely dropped having no similar or related features in the news dataset (i.e., retweets, hashtags and user mentions). As the music artist features are completely based on the tweets and Twitter users features, processing the comments and CNN users made the feature extraction step for CNN articles a straight forward process dropping only features generated from Twitter-based features in the tweets feature set. We conducted the same feature analyses as in the created dataset. Table 4.4 and Table 4.5 show highly correlated features and normalized importance scores for the full list if features using only the Learning Vector Quantization (LVQ) model as this dataset contains only a binary class value for its data instances.

68

Table 4.4 Highly correlated features in CNN dataset

st nd Correlation 1 Feature 2 Feature Score Comments count Users count 0.889 Comments count Controversial terms count 0.971 Controversial terms mean Controversial terms SD 0.767 Positive comments percentage Positive sentiment mean 0.780 Negative comments percentage Negative sentiment mean 0.793 Neutral sentiment SD Positive sentiment SD 0.804 Tokens count Nouns count 0.963 Tokens count Verbs count 0.978 Tokens count Adjectives count 0.938 Tokens count Adverbs count 0.916 Nouns count Verbs count 0.910 Nouns count Adjectives count 0.916 Nouns count Adverbs count 0.820 Verbs count Adjectives count 0.890 Verbs count Adverbs count 0.924 Adjectives count Adverbs count 0.855 Replies mean Replies SD 0.866 Comments by very active users User participation mean 0.805 percentage

Table 4.5 Feature importance scores for CNN dataset * Features highly correlated with a more important feature (higher in the table)

LVQ Score # Feature Name (Normalized) #1 Controversial terms mean 1.000 #2 Controversial terms count 0.996 #3 Comments per user mean 0.942 #4 Compound sentiment SD 0.851 Controversial terms SD * 0.776 #5 Replies mean 0.753 #6 Nouns mean 0.748 #7 Negative sentiment mean 0.737 Verbs mean * 0.713 #8 Compound sentiment mean 0.705 Comments count * 0.686

69

LVQ Score # Feature Name (Normalized) Negative comments percentage * 0.683 Tokens mean * 0.679 #9 Negative words mean 0.637 #10 Capitalized words mean 0.595 Users count * 0.586 Adjectives mean * 0.575 Replies SD * 0.483 Adverbs mean * 0.463 #11 Positive comments percentage 0.408 #12 Negative sentiment SD 0.344 Positive sentiment mean * 0.276 #13 User participation SD 0.248 #14 Likes SD 0.235 #15 Positive emoticons mean 0.207 #16 Positive sentiment SD 0.166 #17 Negative emoticons mean 0.147 #18 Expressive punctuation marks mean 0.119 #19 Positive words mean 0.098 #20 Neutral sentiment mean 0.055 #21 User participation mean 0.052 #22 Slang words mean 0.013 Comments by very active users percentage * 0.008 #23 Likes mean 0.000 Neutral sentiment SD * 0.000

We also evaluated the same classification prediction models on this dataset using the same methodology as in the created dataset after adapting it to fit to classification models. Figure 4.14 and Figure 4.15 show the F-measure and area under ROC for all evaluated models using different subsets of the most important features.

70

F-measure for CNN dataset

0.9

0.8

0.7

0.6 measure - F 0.5

0.4

0.3 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

F-measure for CNN dataset - Magnified -

0.75 0.73 0.71 0.69 0.67 0.65 measure -

F 0.63 0.61 0.59 0.57 0.55 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 Number of features

Decision Table SMO Multilayer Perceptron Random Forest

Figure 4.14 F-measure for classification models using the most important features of CNN dataset

71

Area under ROC for CNN dataset

0.9

0.8

0.7

0.6

Area under ROC 0.5

0.4

0.3 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 Number of features

Decision Table SMO Multilayer Perceptron

Random Forest Baseline

Figure 4.15 Area under ROC for classification models using the most important features of CNN dataset

Based on these figures, we can notice that after adding the #3 - #9 most important features, the performance of each model almost stabilizes and the performance of all models is very comparable. Table 7.5 and Table 7.6 in section 7.2.1 in Appendix A show the detailed evaluation results of these models for a better look on them. We evaluated almost the same prediction models as reported in [60] and gained an almost continuous slight improvement in the performance using all evaluated models. The best performance reported in [60] considering the feature set extracted using all comments (48 hours) (see section 2.5.2) was using Decision Table having an F-measure of 0.706. The best performance for Decision Table in our evaluations was using the #14 most important features where we achieved an F-measure of 0.722. Using SMO, we could gain even a better performance having an F-measure of 0.748 using the #18 most important features compared to the SVM results reported in the original work (F-measure = 0.507). These results again show that using a smaller set of the right features may lead to a better performance using less resources. Looking at the most important features in both datasets, we can notice the specificity of Twitter compared to other media sources. Negative words, slang words and emoticons have much relevance to the controversy score in Twitter than any other feature. Whereas the controversial terms themselves and the sentiment expressed in the comments have the best relevance to the controversy score involving articles in the news domain. The relatively very short length of allowed tweets may be the cause of this difference as people probably tended to use more emoticons and slangs to shorten the messages they are expressing their feelings in. Another cause might be the average age and education level of the users in both domains.

72

Chapter 5

5. Conclusion and Future Work

Controversy detection in the music domain is important for music companies, producers and listeners. It can improve the performance of many music-related systems. Social media as one of the richest sources of information is being investigated to detect controversies not only in the music domain but also in many other domains. This thesis studied the problem of controversy detection in the music domain using data streamed from Twitter. It evaluated a comprehensive list of features to analyze their relevance to the problem of this thesis. These features were collected by carefully examining previous research, generalizing the features used in it, and extending them by proposing new features related to the tweets and the users publishing them. The features were also used to evaluate multiple machine learning prediction models to predict the level of controversy involving a music artist using a set of tweets about this artist. The evaluations were conducted on a newly created dataset fitting the problem of controversy detection in the music domain. The evaluation results were very promising in detecting controversies involving music artists in Twitter. We achieved a root mean squared error (RMSE) of 0.688 and a correlation coefficient (F) of 0.580 using a linear regression model built using a relatively small subset of the evaluated features. We generalized our evaluations and applied them on a controversy detection dataset from the news domain improving the results originally reported with this dataset [60] using a much smaller set of features. The created dataset contains only 95 samples representing music artists annotated with a continuous controversy score ranging from 1 for “clearly non-controversial” artists to 4 for “clearly controversial” artists. This dataset should be improved and extended with new samples. Processing streamed tweets should be further investigated and improved to properly avoid trolls (inflammatory, irrelevant, or offensive tweets) and other meaningless tweets. The extended feature set should also be improved and further extended by features related to the music artists from external resources other than Twitter which may prove very useful.

73

6. Bibliography

[1] A. Bifet, "Mining Big Data in Real Time," Informatica, vol. 37, no. 1, pp. 15-20, 2013.

[2] International Telecommunication Union, "ICT Facts and Figures 2016," International Telecommunication Union, Geneva, Switzerland, 2016.

[3] W. He, S. Zha and L. Li, "Social Media Competitive Analysis and Text Mining: A Case Study in the Pizza Industry," International Journal of Information Management, vol. 33, no. 3, pp. 464 - 472, 2013.

[4] E. Pariser, The Filter Bubble: What the Internet is Hiding from You, Penguin UK, 2011.

[5] S. Dori-hacohen, E. Yom-tov and J. Allan, "Navigating Controversy as a Complex Search Task," in 1st International Workshop on Supporting Complex Search Tasks (ECIR), 2015.

[6] S. Heindorf, M. Potthast, B. Stein and G. Engels, "Vandalism Detection in Wikidata," in The 25th ACM International on Conference on Information and Knowledge Management (CIKM), Indianapolis, IN, USA, 2016.

[7] M. Tsytsarau, T. Palpanas and K. Denecke, "Scalable Detection of Sentiment-Based Contradictions," in The 1st International Workshop on Knowledge Diversity on the Web (DiversiWeb), Hyderabad, India, 2011.

[8] F. Liu and L. Xiong, "Survey on Text Clustering Algorithm -Research Present Situation of Text Clustering Algorithm," in IEEE 2nd International Conference on Software Engineering and Service Science, Beijing, China, 2011.

[9] R. Irfan, C. K. King, D. Grages, S. Ewen, S. U. Khan, S. A. Madani, J. Kolodziej, L. Wang, D. Chen, A. Rayes, N. Tziritas, C.-Z. Xu, A. Y. Zomaya, A. S. Alzahrani and H. Li., "A Survey on Text Mining in Social Networks," The Knowledge Engineering Review, vol. 30, no. 2, p. 157–170, 2015.

[10] C. C. Aggarwal and H. Wang, "Text Mining in Social Networks," in Social Network Data Analytics, C. C. Aggarwal, Ed., Boston, MA: Springer US, 2011, pp. 353-378.

[11] Y. Dai, T. Kakkonen and E. Sutinen, "MinEDec: A Decision Support Model That Combines Text Mining with Competitive Intelligence," in International Conference on Computer Information Systems and Industrial Management Applications (CISIM), Krackow, Germany, 2010.

[12] Y. -B. Liu, J.-R. Cai, J. Yin and A. W.-C. Fu, "Clustering Text Data Streams," Journal of Computer Science and Technology, vol. 23, no. 1, p. 112–128, January 2008.

75

[13] C. C. Aggarwal, "Mining Text Streams," in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds., Boston, MA: Springer US, 2012, pp. 297-321.

[14] J. Hua, W. D. Tembe and E. R. Dougherty, "Performance of Feature-Selection Methods in the Classification of High-Dimension Data," Pattern Recognition, vol. 42, pp. 409-424, 2009.

[15] M. Mathioudakis and N. Koudas, "TwitterMonitor: Trend Detection over the Twitter Stream," in ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 2010.

[16] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Göker, I. Kompatsiaris and A. Jaimes, "Sensing Trending Topics in Twitter," IEEE Transactions on Multimedia, vol. 15, no. 6, pp. 1268-1282, 2013.

[17] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman and J. Sperling, "TwitterStand: News in Tweets," in The 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 2009.

[18] S. Phuvipadawat and T. Murata, "Breaking News Detection and Tracking in Twitter," in IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Toronto, ON, Canada, 2010.

[19] H. Becker, M. Naaman and L. Gravano, "Beyond Trending Topics: Real-World Event Identification on Twitter," Columbia University Academic Commons, New York City, NY, USA, 2011.

[20] D. M. Blei, A. Y. Ng and M. I. Jordan, "Latent Dirichlet Allocation," Journal of Machine Learning Research 3, pp. 993-1022, 2003.

[21] H. Sayyadi, M. Hurst and A. Maykov, "Event Detection and Tracking in Social Streams," in Third International AAAI Conference on Weblogs and Social Media (ICWSM), Palo Alto, CA, USA, 2009.

[22] J. Weng and B.-S. Lee, "Event Detection in Twitter," in Fifth International AAAI Conference on Weblogs and Social Media (ICWSM), Palo Alto, CA, USA, 2011.

[23] P. J. Stone, D. C. Dunphry, M. S. Smith and D. M. Ogilvie, The General Inquirer: A Computer Approach to Content Analysis, Cambridge: MIT Press, 1966.

[24] C. Hutto and E. Gilbert, "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text," in The Eighth International AAAI Conference on Weblogs and Social Media, 2014.

[25] C. Potts, "Sentiment Symposium Tutorial: Lexicons," November 2011. [Online]. Available: http://sentiment.christopherpotts.net/lexicons.html. [Accessed June 2017].

76

[26] M. Hu and B. Liu, "Mining and Summarizing Customer Reviews," in The ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), Seattle, WA, USA, 2004.

[27] J . Wiebe, T. Wilson and C. Cardie, "Annotating Expressions of Opinions and Emotions in Language," Language Resources and Evaluation, vol. 39, no. 2-3, p. 165–210, May 2005.

[28] T. Wilson, J. Wiebe and P. Hoffmann, "Recognizing Contextual Polarity in Phrase-level Sentiment Analysis," in The Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT), Vancouver, British Columbia, Canada, 2005.

[29] M. M. Bradley and P. J. Lang, "Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings," Gainesville, FL, USA, 1999.

[30] C. Fellbaum, WordNet: An Electronic Lexical Database, Cambridge, MA: MIT Press, 1998.

[31] A. Esuli and F. Sebastiani, "SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining," in The 5th Conference on Language Resources and Evaluation (LREC), Genova, Italy, 2006.

[32] S. Baccianella, A. Esuli and F. Sebastiani, "SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining," in The International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, 2010.

[33] M. Thelwall, "The Heart and Soul of the Web? Sentiment Strength Detection in the Social Web with SentiStrength," in Cyberemotions: Collective Emotions in Cyberspace, J. A. Holyst, Ed., Springer, 2017, pp. 119-134.

[34] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai and A. Kappas, "Sentiment in Short Strength Detection Informal Text," Journal of the Association for Information Science and Technology, vol. 61, no. 12, pp. 2544-2558, December 2010.

[35] M. Thelwall, K. Buckley and G. Paltoglou, "Sentiment Strength Detection for the Social Web," Journal of the Association for Information Science and Technology, vol. 63, no. 1, pp. 163- 173, January 2012.

[36] F. Å. Nielsen, "A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs," in The ESWC2011 Workshop on 'Making Sense of Microposts': Big Things Come in Small Packages, Heraklion, Greece, 2011.

[37] E. Cambria, R. Speer, C. Havasi and A. Hussain, "SenticNet: A Publicly Available Semantic Resource for Opinion Mining," in Commonsense Knowledge, Menlo Park, CA, USA, 2010.

77

[38] E. Cambria, S. Poria, R. Bajpai and B. Schuller, "SenticNet 4: A Semantic Resource for Sentiment Analysis Based on Conceptual Primitives," in The 26th International Conference on Computational Linguistics (COLING), Osaka, Japan, 2016.

[39] S. Wang and C. D. Manning, "Baselines and Bigrams: Simple, Good Sentiment and Topic Classification," in The 50th Annual Meeting of the Association for Computational Linguistics (ACL): Short Papers, Jeju Island, Korea, 2012.

[40] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng and C. Potts, "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank," in Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, WA, USA, 2013.

[41] L. Dong, F. Wei, C. Tan, D. Tang|, M. Zhou and K. Xu, "Adaptive Recursive Neural Network for Target-dependent Twitter Sentiment Classification," in The 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Baltimore, MD, USA, 2014.

[42] B. Pang and L. Lee, "A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts," in The 42Nd Annual Meeting on Association for Computational Linguistics (ACL), Barcelona, Spain, 2004.

[43] S. Rosenthal, P. Nakov, S. Kiritchenko, S. M. Mohammad, A. Ritter and V. Stoyanov, "SemEval-2015 Task 10: Sentiment Analysis in Twitter," in The 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 2015.

[44] B. -Q. Vuong, E.-P. Lim, A. Sun, M.-T. Le, H. W. Lauw and K. Chang, "On Ranking Controversies in Wikipedia: Models and Evaluation," in The 2008 International Conference on Web Search and Data Mining (WSDM), Palo Alto, CA, USA, 2008.

[45] S. Dori-Hacohen and J. Allan, "Detecting Controversy on the Web," in The 22nd ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, USA, 2013.

[46] S. Dori-Hacohen and J. Allan, "Automated Controversy Detection on the Web," in Advances in Information Retrieval - The 37th European Conference on IR Research (ECIR), Vienna, Austria, 2015.

[47] A. Kittur, B. Suh, B. A. Pendleton and E. H. Chi, "He Says, She Says: Conflict and Coordination in Wikipedia," in The SIGCHI Conference on Human Factors in Computing Systems (CHI), San Jose, CA, USA, 2007.

[48] T. Yasseri, R. Sumi, A. Rung, A. Kornai and J. Kertész, "Dynamics of Conflicts in Wikipedia," PLoS ONE, vol. 7, no. 6, 2012.

78

[49] S. Dori-Hacohen, D. Jensen and J. Allan, "Controversy Detection in Wikipedia Using Collective Classification," in The 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Pisa, Italy, 2016.

[50] Z. Kou and W. W. Cohen, "Stacked Graphical Models for Efficient Inference in Markov Random Fields," in The 2007 SIAM International Conference on Data Mining, 2007.

[51] H. Sepehri Rad, A. Makazhanov, D. Rafiei and D. Barbosa, "Leveraging Editor Collaboration Patterns in Wikipedia," in The 23rd ACM Conference on Hypertext and Social Media (HT), Milwaukee, WI, USA, 2012.

[52] M. Jang and J. Allan, "Improving Automated Controversy Detection on the Web," in The 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Pisa, Italy, 2016.

[53] M. A. Hearst, "TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages," Computational Linguistics, vol. 23, no. 1, pp. 33-64, March 1997.

[54] Y. Choi, Y. Jung and S.-H. Myaeng, "Identifying Controversial Issues and Their Sub-topics in News Articles," in Pacific-Asia Workshop on Intelligence and Security Informatics (PAISI), Hyderabad, India, 2010.

[55] L. Azzopardi, M. de Rijke and K. Balog, "Building Simulated Queries for Known-item Topics: An Analysis Using Six European Languages," in The 30th Conference on Research and Development in Information Retrieval (SIGIR), Amsterdam, The Netherlands, 2007.

[56] M. Pennacchiotti and A.-M. Popescu, "Detecting Controversies in Twitter: A First Study," in The NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media (WSA), Los Angeles, CA, USA, 2010.

[57] A. -M. Popescu and M. Pennacchiotti, "Detecting Controversial Events from Twitter," in The 19th ACM International Conference on Information and Knowledge Management (CIKM), Toronto, ON, Canada, 2010.

[58] J. H. Friedman, "Greedy Function Approximation: A Gradient Boosting Machine," The Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, October 2001.

[59] M. Tsytsarau, T. Palpanas and K. Denecke, "Scalable Discovery of Contradictions on the Web," in The 19th International Conference on World Wide Web (WWW), Raleigh, NC, USA, 2010.

[60] R. V. Chimmalgi, "Controversy Trend Detection in Social Media," LSU Digital Commons, Baton Rouge, LA, USA, 2013.

79

[61] E. Frank, M. A. Hall and I. H. Witten, "The WEKA Workbench. Online Appendix," in Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition ed., Morgan Kaufmann, 2016.

[62] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I. H. Witten, "The WEKA Data Mining Software: An Update," SIGKDD Explorations, vol. 11, no. 1, pp. 10-18, 2009.

[63] O. P. Anifowose, "Identifying Controversial Topics in Large-scale Social Media Data," Weimar, Germany, 2016.

[64] S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly Media Inc., 2009.

[65] J. Cohen, "Weighted kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit," Psychological Bulletin, vol. 70, no. 4, pp. 213-220, 1968.

[66] J. R. Landis and G. G. Koch, "The Measurement of Observer Agreement for Categorical Data," Biometrics, vol. 33, no. 1, pp. 159-174, 1977.

[67] R. A. Fisher, Statistical Methods for Research Workers, Edinburgh: Oliver and Boyd, 1925.

[68] D. V. Cicchetti, "Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology," Psychological Assessment, vol. 6, no. 4, pp. 284-290, 1994.

[69] M. Kuhn, "Building Predictive Models in R Using the caret Package," Journal of Statistical Software, vol. 28, no. 5, pp. 1-26, 2008.

[70] T. Kohonen, "Learning Vector Quantization," in The Handbook of Brain Theory and Neural Networks, Cambridge, MA: MIT Press, 1998, pp. 537-540.

[71] R. Kohavi, "The Power of Decision Tables," in 8th European Conference on Machine Learning (ECML), Berlin, Germany, 1995.

[72] J. Platt, "Fast Training of Support Vector Machines Using Sequential Minimal Optimization," in Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, USA, 1998.

[73] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, D. E. Rumelhart, J. L. McClelland and C. PDP Research Group, Eds., Cambridge, MA: MIT Press, 1986, pp. 318-362.

[74] T. K. Ho, "Random Decision Forests," in 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 1995.

80

[75] C. Strapparava and G. Ozbal., "The Color of Emotions in Texts," in 2nd Workshop on Cognitive Aspects of the Lexicon (CogALex), Beijing, China, 2010.

[76] H. S. Rad and D. Barbosa, "Identifying Controversial Articles in Wikipedia: A Comparative Study," in The 8th Annual International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, 2012.

81

7. Appendix A: Detailed Evaluation Results

This appendix includes the evaluation results for all the evaluated prediction models for both the created Twitter dataset and the CNN dataset.

7.1. Twitter Dataset

7.1.1. Regression Prediction Models

Table 7.1 RMSE for regression models using the most important features

Number of Decision Multilayer Random Linear SMO Baseline Features Table Perceptron Forest Regression

#1 0.786 0.723 0.811 0.805 0.722 0.843 #2 0.786 0.727 0.812 0.769 0.710 0.843 #3 0.801 0.701 0.787 0.727 0.716 0.843 #4 0.818 0.710 0.822 0.712 0.720 0.843 #5 0.842 0.713 0.954 0.741 0.732 0.843 #6 0.836 0.715 0.967 0.735 0.702 0.843 #7 0.836 0.710 0.971 0.721 0.719 0.843 #8 0.836 0.713 0.971 0.729 0.718 0.843 #9 0.887 0.705 1.177 0.715 0.688 0.843 #10 0.887 0.702 1.401 0.722 0.706 0.843 #11 0.944 0.708 1.083 0.712 0.706 0.843 #12 0.928 0.707 1.589 0.722 0.710 0.843 #13 0.946 0.709 1.510 0.726 0.713 0.843 #14 0.987 0.701 1.329 0.717 0.712 0.843 #15 0.961 0.707 1.597 0.725 0.712 0.843 #16 0.955 0.702 1.397 0.710 0.720 0.843 #17 0.958 0.705 1.477 0.735 0.720 0.843 #18 0.961 0.729 1.729 0.740 0.725 0.843 #19 0.956 0.723 1.538 0.740 0.730 0.843 #20 0.954 0.741 1.601 0.730 0.742 0.843 #21 0.954 0.740 1.301 0.733 0.745 0.843 #22 1.004 0.768 1.247 0.753 0.742 0.843 #23 1.004 0.773 1.594 0.759 0.764 0.843 #24 0.911 0.773 1.306 0.748 0.768 0.843 #25 0.918 0.788 1.470 0.753 0.775 0.843 #26 0.959 0.794 1.388 0.748 0.782 0.843 #27 0.901 0.809 1.479 0.756 0.832 0.843

83

Number of Decision Multilayer Random Linear SMO Baseline Features Table Perceptron Forest Regression

#28 0.901 0.792 1.212 0.763 0.834 0.843

Table 7.2 Correlation coefficient for regression models using the most important features

Number of Decision Multilayer Random Linear SMO Baseline Features Table Perceptron Forest Regression

#1 0.373 0.513 0.353 0.424 0.512 -0.135 #2 0.373 0.512 0.353 0.443 0.536 -0.135 #3 0.336 0.558 0.433 0.509 0.530 -0.135 #4 0.311 0.542 0.341 0.536 0.522 -0.135 #5 0.244 0.546 0.325 0.481 0.509 -0.135 #6 0.276 0.543 0.383 0.489 0.557 -0.135 #7 0.276 0.553 0.455 0.515 0.533 -0.135 #8 0.276 0.546 0.404 0.500 0.534 -0.135 #9 0.243 0.559 0.277 0.526 0.581 -0.135 #10 0.243 0.565 0.330 0.515 0.560 -0.135 #11 0.116 0.557 0.443 0.533 0.560 -0.135 #12 0.091 0.558 0.173 0.516 0.556 -0.135 #13 0.104 0.557 0.382 0.506 0.557 -0.135 #14 -0.004 0.566 0.340 0.530 0.563 -0.135 #15 0.019 0.560 0.234 0.518 0.563 -0.135 #16 0.038 0.564 0.340 0.552 0.553 -0.135 #17 0.049 0.562 0.283 0.490 0.553 -0.135 #18 0.047 0.526 0.222 0.479 0.546 -0.135 #19 0.047 0.532 0.206 0.483 0.539 -0.135 #20 -0.009 0.510 0.241 0.505 0.533 -0.135 #21 -0.009 0.515 0.324 0.503 0.529 -0.135 #22 -0.056 0.488 0.357 0.456 0.532 -0.135 #23 -0.056 0.482 0.234 0.439 0.512 -0.135 #24 0.129 0.480 0.298 0.467 0.507 -0.135 #25 0.103 0.456 0.224 0.452 0.498 -0.135 #26 0.058 0.443 0.217 0.474 0.488 -0.135 #27 0.133 0.433 -0.010 0.456 0.443 -0.135 #28 0.133 0.448 0.270 0.435 0.430 -0.135

84

7.1.2. Classification Prediction Models

Table 7.3 F-measure for classification models using the most important features

Number of Decision Multilayer Random SMO Baseline Features Table Perceptron Forest

#1 0.781 0.539 0.778 0.712 0.515 #2 0.745 0.643 0.767 0.736 0.515 #3 0.745 0.661 0.780 0.747 0.515 #4 0.745 0.635 0.812 0.710 0.515 #5 0.745 0.635 0.799 0.710 0.515 #6 0.745 0.643 0.757 0.719 0.515 #7 0.745 0.670 0.807 0.738 0.515 #8 0.745 0.635 0.801 0.747 0.515 #9 0.745 0.635 0.743 0.781 0.515 #10 0.745 0.635 0.736 0.772 0.515 #11 0.745 0.679 0.739 0.756 0.515 #12 0.745 0.679 0.728 0.756 0.515 #13 0.745 0.679 0.778 0.769 0.515 #14 0.745 0.679 0.757 0.762 0.515 #15 0.745 0.679 0.789 0.759 0.515 #16 0.745 0.679 0.747 0.756 0.515 #17 0.745 0.679 0.707 0.731 0.515 #18 0.745 0.679 0.717 0.747 0.515 #19 0.745 0.670 0.709 0.728 0.515 #20 0.745 0.694 0.686 0.734 0.515 #21 0.745 0.694 0.698 0.722 0.515 #22 0.745 0.677 0.694 0.759 0.515 #23 0.745 0.677 0.684 0.741 0.515 #24 0.745 0.669 0.633 0.734 0.515 #25 0.745 0.677 0.675 0.769 0.515 #26 0.745 0.677 0.684 0.734 0.515 #27 0.745 0.703 0.738 0.716 0.515 #28 0.745 0.703 0.772 0.728 0.515

85

Table 7.4 Area under ROC for classification models using the most important features

Number of Decision Multilayer Random SMO Baseline Features Table Perceptron Forest

#1 0.680 0.515 0.751 0.749 0.456 #2 0.704 0.591 0.798 0.779 0.456 #3 0.772 0.606 0.788 0.745 0.456 #4 0.772 0.583 0.793 0.776 0.456 #5 0.772 0.583 0.796 0.773 0.456 #6 0.772 0.591 0.789 0.782 0.456 #7 0.772 0.613 0.776 0.792 0.456 #8 0.772 0.583 0.778 0.804 0.456 #9 0.772 0.583 0.771 0.796 0.456 #10 0.772 0.583 0.727 0.797 0.456 #11 0.772 0.620 0.742 0.790 0.456 #12 0.772 0.620 0.717 0.796 0.456 #13 0.772 0.620 0.755 0.796 0.456 #14 0.772 0.620 0.762 0.782 0.456 #15 0.772 0.620 0.742 0.786 0.456 #16 0.772 0.620 0.749 0.788 0.456 #17 0.772 0.620 0.729 0.782 0.456 #18 0.772 0.620 0.714 0.775 0.456 #19 0.772 0.612 0.686 0.764 0.456 #20 0.772 0.635 0.719 0.784 0.456 #21 0.772 0.635 0.717 0.785 0.456 #22 0.772 0.619 0.714 0.786 0.456 #23 0.772 0.619 0.743 0.751 0.456 #24 0.772 0.611 0.696 0.754 0.456 #25 0.772 0.619 0.701 0.751 0.456 #26 0.772 0.619 0.670 0.767 0.456 #27 0.772 0.649 0.769 0.760 0.456 #28 0.772 0.649 0.774 0.765 0.456

86

7.2. CNN Dataset

7.2.1. Classification Prediction Models

Table 7.5 F-measure for classification models using the most important features of CNN dataset

Number of Decision Multilayer Random SMO Baseline Features Table Perceptron Forest

#1 0.692 0.696 0.697 0.572 0.352 #2 0.691 0.696 0.695 0.675 0.352 #3 0.695 0.701 0.697 0.688 0.352 #4 0.700 0.714 0.701 0.688 0.352 #5 0.694 0.714 0.693 0.699 0.352 #6 0.716 0.715 0.708 0.703 0.352 #7 0.701 0.719 0.706 0.710 0.352 #8 0.701 0.724 0.729 0.699 0.352 #9 0.699 0.724 0.719 0.702 0.352 #10 0.699 0.725 0.703 0.717 0.352 #11 0.699 0.728 0.706 0.710 0.352 #12 0.705 0.729 0.718 0.700 0.352 #13 0.708 0.725 0.715 0.710 0.352 #14 0.709 0.724 0.700 0.718 0.352 #15 0.712 0.723 0.687 0.722 0.352 #16 0.712 0.718 0.695 0.720 0.352 #17 0.706 0.722 0.686 0.726 0.352 #18 0.702 0.724 0.688 0.729 0.352 #19 0.702 0.721 0.695 0.732 0.352 #20 0.702 0.722 0.686 0.742 0.352 #21 0.702 0.719 0.690 0.734 0.352 #22 0.700 0.724 0.674 0.736 0.352 #23 0.702 0.732 0.703 0.732 0.352

87

Table 7.6 Area under ROC for classification models using the most important features of CNN dataset

Number of Decision Multilayer Random SMO Baseline Features Table Perceptron Forest

#1 0.689 0.695 0.731 0.632 0.495 #2 0.743 0.695 0.733 0.709 0.495 #3 0.745 0.701 0.747 0.732 0.495 #4 0.750 0.714 0.769 0.739 0.495 #5 0.747 0.714 0.759 0.747 0.495 #6 0.757 0.714 0.773 0.758 0.495 #7 0.760 0.718 0.762 0.774 0.495 #8 0.759 0.724 0.778 0.783 0.495 #9 0.760 0.724 0.783 0.784 0.495 #10 0.761 0.724 0.775 0.791 0.495 #11 0.761 0.728 0.770 0.784 0.495 #12 0.761 0.729 0.767 0.780 0.495 #13 0.765 0.725 0.767 0.784 0.495 #14 0.768 0.724 0.749 0.795 0.495 #15 0.769 0.722 0.755 0.796 0.495 #16 0.769 0.718 0.744 0.796 0.495 #17 0.770 0.722 0.737 0.802 0.495 #18 0.768 0.723 0.752 0.803 0.495 #19 0.768 0.721 0.764 0.796 0.495 #20 0.768 0.722 0.745 0.802 0.495 #21 0.768 0.718 0.753 0.803 0.495 #22 0.762 0.724 0.746 0.798 0.495 #23 0.762 0.732 0.747 0.790 0.495

88 Curriculum vitae

PERSONAL INFORMATION Mhd Mousa HAMAD

Julius Raab Str. 10, 4040 Linz (Austria) (+43) 664 7655600 mhd.mousa.hamad@.com http://linkedin.com/in/mhdmousahamad http://stackoverflow.com/story/mhdmousahamad

Sex Male | Date of birth 14 Mar 1988 | Nationality Syrian

EDUCATION AND TRAINING

2015–Present Master of Science in Computer Science Johannes Kepler Universität Linz (JKU), Linz (Austria) ▪ I am awarded the ASSUR scholarship from the Erasmus Mundus Program (EACEA) to pursue this degree program. ▪ Thesis: "Controversy Detection of Music Artist". ▪ Main Courses: Data Analysis, Machine Learning, Information Retrieval, Knowledge-based Systems, Semantic Data Modeling and Web Information Systems.

2011–2015 Master of Science in Artificial Intelligence Damascus University, Damascus (Syria) ▪ Thesis: "A Study and Implementation of Mechanisms to Improve the Methodologies of Measuring Semantic Similarity for Arabic Scripts". ▪ Taught Courses: Verbal Communication, Data Mining, Distributed Network Applications, Electronic Business (Internet Applications), Advanced Software Engineering.

2006–2011 Bachelor of Science in Informatics Engineering Damascus University, Damascus (Syria) ▪ Main Courses: Programming Languages, Software Engineering, Database, Data Mining, Machine Learning, Natural Language Processing, Intelligent Search Algorithms, Communication Skills, Project Management and Marketing.

WORK EXPERIENCE

Mar 2017–Present Software Engineer Smarter Ecommerce GmbH (smec), Linz (Austria) ▪ Developing systems for PPC (pay-per-click) automation for Google Adwords, namely AdEngine and Whoop!.

Sep 2011–Jul 2015 Teaching Assistant Damascus University (DU), Arab International University (AIU), Syrian Private University (SPU), Damascus (Syria) ▪ DU (Oct 2011 - Jul 2015), AIU (Oct 2013 - Jun 2015), SPU (Sep 2011 - Aug 2014). ▪ Teaching assistant in Data Mining, Information Retrieval, Computer Graphics (OpenGL) and Software Engineering. ▪ Supervisor of projects in Natural Language Processing and Data Mining.

Jun 2010–Dec 2014 Mobile Applications Developer

27/10/17 © European Union, 2002-2017 | http://europass.cedefop.europa.eu Page 1 / 3

Curriculum vitae Mhd Mousa HAMAD

A.I Apps, Damascus (Syria) ▪ Customized and public mobile applications for iOS and Android ▫ Taif Municipality is a customized application, built for iOS and Android, to provide some of the municipality's e-services. ▫ The Holy Quran is a mobile application, built for iOS, to listen to the Holy Quran, read by many reciters, online or offline. ▫ Epsilon and Ultimate Paper are generic RSS readers built using cocoa-touch framework supported by a web-based python API. ▫ Saudi Newspaper, Saudi News

Jun 2009–May 2010 Web-based Application Developer Al-Bayan Co., Damascus (Syria) ▪ Web-based applications using ASP.NET, C#, and SQL Server Databases ▫ Al-Bairuni University Hospital is a web-based application for managing all medicine transactions between many inventories to patients. ▫ Customs Office Management is a web-based application for managing all the operations required to get some goods out of the customs.

PERSONAL SKILLS

Mother tongue(s) Arabic

Digital competence ▪ Mobile Development: ▫ iOS Native Mobile Applications (Cocoa-Touch Framework) (Top 10% in Stack Overflow answers for iPhone), Android Native Mobile Applications. ▪ Database: ▫ SQL (Microsoft SQL Server), SQLite. ▪ Programming Languages: ▫ Objective-C (Cocoa-Touch Framework), Swift (Basic Knowledge), C# (.Net Framework – Entity Framework, LINQ to SQL, ASP Security, Crystal Reports), Java, C++, Python (Basic Knowledge), R (Basic Knowledge). ▪ Web Design: ▫ ASP.Net (Web Forms, MVC), HTML, CSS, jQuery. ▪ Semantic Web & Information Retrieval: ▫ RDF, SPARQL, Ontologies, Apache Lucene (Lucene.Net, Highlight). ▪ Version Control Systems: ▫ SVN, GIT.

ADDITIONAL INFORMATION

Other languages ▪ English: fluent ▪ German: basic knowledge (A2)

Volunteer experiences ▪ Business Team Member (Mar 2016-Jul 2016) ▫ AIESEC, Linz (Austria)

Honours and awards ▪ Certificate of being one of the best 25 information technology applications, Online/Offline Plagiarism Detector, from Zayed University, UAE (Apr 11).

27/10/17 © European Union, 2002-2017 | http://europass.cedefop.europa.eu Page 2 / 3

Curriculum vitae Mhd Mousa HAMAD

Projects ▪ As a graduation project,plagiarism detection software was developed with a team of 3 students. NMPlagiarism, No More Plagiarism, was a web-based software, built using ASP.NET, C#, and SQL Server Database, to detect word-for-word, and paraphrasing plagiarism in both English and Arabic languages. NMPlagiarism was one of the best 25, 5th, information technology applications in Zayed University, UAE (Apr 11).

27/10/17 © European Union, 2002-2017 | http://europass.cedefop.europa.eu Page 3 / 3

STATUTORY DECLARATION

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references. This printed thesis is identical with the electronic version submitted.

Place, Date: Linz, October 27, 2017

Signature: ______