Controversy Detection of Music Artists
Total Page:16
File Type:pdf, Size:1020Kb
Submitted by Mhd Mousa HAMAD Submitted at Department of Computational Perception Supervisor Dr. Markus Schedl Controversy Detection October 2017 of Music Artists Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science JOHANNES KEPLER UNIVERSITY LINZ Altenberger Str. 69 4040 Linz, Austria www.jku.at DVR 0093696 Table of Contents Abstract .................................................................................................................................. 5 1. Introduction ..................................................................................................................... 7 1.1. Motivation ............................................................................................................................ 7 1.2. Research Problem ................................................................................................................. 8 1.3. Research Tasks ..................................................................................................................... 10 2. Literature Review ........................................................................................................... 11 2.1. Social Media Analysis ........................................................................................................... 11 2.2. Text Preprocessing ............................................................................................................... 12 2.3. Trend Detection ................................................................................................................... 13 2.4. Sentiment Analysis ............................................................................................................... 15 2.5. Controversy Detection .......................................................................................................... 21 3. Data Collection and ProcessinG ....................................................................................... 31 3.1. Data StreaminG .................................................................................................................... 31 3.2. Data ProcessinG .................................................................................................................... 33 3.3. Sentiment Analysis ............................................................................................................... 40 3.4. Data StoraGe ........................................................................................................................ 42 3.5. Data Annotation ................................................................................................................... 43 4. Experiments and Results ................................................................................................ 49 4.1. Feature Extraction ................................................................................................................ 49 4.2. Feature Analysis ................................................................................................................... 56 4.3. Machine LearninG Models Evaluations .................................................................................. 61 4.4. News Dataset Evaluations .................................................................................................... 67 5. Conclusion and Future Work ........................................................................................... 73 6. Bibliography ................................................................................................................... 75 7. Appendix A: Detailed Evaluation Results ........................................................................ 83 7.1. Twitter Dataset .................................................................................................................... 83 7.2. CNN Dataset ......................................................................................................................... 87 3 Abstract “We are creating the same quantity of data every two days, as we created from the dawn of time up until 2003. It is estimated to be 5 Exabyte” [1]. The Internet and web technologies give billions of users the ability to share information and express their opinions on various issues. This enormous amount of data might be very valuable. Social media, as the main sharing platform, is a very promising data source for researchers to investigate and analyze how people feel or think on variety of issues, from politics to entertainment. Previous research has explored the problem of detecting controversies involving multiple kinds of entities (people, event, …) by analyzing different feelings and opinions on these entities. The music domain, as one of the most controversial domains, has not been investigated much in this research. This thesis studies to which extent Twitter, as a social media platform, can be used to detect controversies involving music artists. It generalizes and extends the work proposed in previous research to build good machine learning prediction models to detect these controversies. We analyze what people share about music artists in Twitter, present the problems in this data and study how to tackle most of them. Then, we use this data to build a new controversy detection dataset in the music domain. The created dataset is then used to evaluate a comprehensive set of features to be used in building prediction models to detect controversies involving music artists. We propose using information about the users who share their opinions along with information about the shared opinions themselves to enrich this set of features. Our evaluations show promising results in detecting controversies involving music artists using the created dataset. They also show that we can easily improve the results of detecting controversies in other domains as we also run our evaluations on a CNN news dataset. 5 Chapter 1 1. Introduction Over 3 billion people used the Internet in 2016 [2]. Most of these users use social media to share information and communicate between each other. These users express their feelings about various entities (e.g., products they use or famous people they know) using these social platforms. This data may provide a real-time view of opinions, activities and trends around the world. This chapter introduces the problem of using part of this data feed to detect controversies about music artists engaging social media users. 1.1. Motivation Social media have profoundly changed our lives and how we interact with one another and the world around us. Recent research indicates that more and more people are using social media applications such as Facebook and Twitter for various reasons such as making new friends, socializing with old friends, receiving information, and entertaining themselves [3]. As a result, many organizations and companies are adopting social media to accommodate this growing trend to provide better services or gain business values such as driving customer traffic, increasing customer loyalty and retention, increasing sales and revenues, improving customer satisfaction, creating brand awareness and building reputation. Users from different backgrounds are participating in the massive open collaboration in social media. This often leads to vandalism, when users try to deliberately damage someone’s or something’s reputation, and controversies, when users share multiple opinions on someone or something. Companies, governments, national security agencies, and marketing groups are interested in identifying which issues the public is having problems with. They are also very interested in early predicting whether an issue or a product is likely to generate controversies to act against this generation. As music is one of the most controversial domains, detecting these controversies about a song, a clip or a music event as early as possible is also very important for music producers and for the artists themselves to counteract against them. In the music domain, automatic recommendation systems are becoming more important for music companies and producers in one hand and for the listeners on the other hand. Detecting controversies about music artists will most likely boost the performance of these systems. They can identify the users who usually listen to music by controversial artists and those who avoid it and recommend the appropriate music for each group. These systems may also use this information to change how they recommend controversial artists for users who do not usually listen to music by such artists by focusing in the recommendations on non-controversial facts. As web services, from search engines to social media services, are moving more and more towards personalization, an unsuspecting user who has never heard of a controversy is likely to be misled. This is known as “The Filter Bubble Effect” wherein web services serve users with what they expect, rather than encouraging them to seek multiple perspectives available on a subject. Detecting controversies is getting more attention in the research community to counteract this effect “The Filter Bubble” [4, 5]. 7 1.2. Research Problem The goal of this thesis is to study, evaluate and build machine learning prediction models to detect controversies involving music artists in Twitter. The approaches to detect controversies differ based on the type of controversies they detect. This section defines controversy and some related terms to differentiate