Distant Labelling and Author Profiling on Reddit

Distant Labelling and Author Profiling on Reddit Gytha Muller ANR: u408823 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMMUNICATION AND INFORMATION SCIENCES, MASTER TRACK DATA SCIENCE:BUSINESS &GOVERNANCE, AT THE SCHOOL OF HUMANITIES AND DIGITAL SCIENCES OF TILBURG UNIVERSITY Thesis committee: C.D. Emmery MSc dr. Maryam Alimardani Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands December 2018 Preface This thesis provides the final part of my master’s programme Data Science: Business and Governance at Tilburg University. I would like to thank my supervisor Chris Emmery for all his feedback. It was reassuring to know that whenever I had questions, whether about coding or theoretical, I could count on getting a fast response with specific feedback that enabled me to continue on, and hopefully improve, the final product. In a short period of time, I learnt a lot about a field that was almost completely new to me. Thanks to Drew Hendrickson as well. Without his endless patience at questions and troubleshooting coding errors during my research traineeship before and during this thesis, my code undoubtedly would have been a lot worse off. Finally, thank you to my friends and family for always cheering me on and support- ing me during my master’s degree. Gytha Muller December 2018 Distant Labelling and Author Profiling on Reddit Gytha Muller This thesis investigates the application of distant labelling and author profiling techniques to Reddit, one of the world’s major online communities yet a barely investigated domain in the field of author profiling. Firstly, this thesis proposes and shows a method for reliably creating a large corpus of Reddit users and posts labelled with gender (>40,000 users) and/or age (>18,000 user). The method uses flairs and submission titles to distantly label users, which reduces noise compared to earlier approaches. Currently, no such corpus exists, as earlier approaches on Reddit have sampled from a very limited user subset. Secondly, it is investigated whether the current approaches on author profiling for Twitter translate to Reddit, by using common methods such as fastText and a linear kernel SVM with TFIDF representation to predict gender and age respectively. Models are trained on both Reddit and Twitter. All models have good in- domain performance in line with existing work: for Reddit, 0.75 accuracy on gender and 0.70 accuracy on age, for Twitter 0.91 accuracy on gender and 0.93 accuracy on age. Cross-domain performance of the trained models is mediocre, suggesting that while the general approach translates, without preprocessing there is too much domain-specificity to properly generalize. In conclusion, this work contributes to understanding applicable author profiling approaches for Reddit and demonstrates how to automatically build a labelled corpus on Reddit, which should be able to assist any future research on this domain. 1. Introduction The internet revolutionized communication and plays an increasingly important role in our society. Its effects are vast. Virtually all countries have fully adopted the internet, while internet discussions influence mainstream news and politics. Often, this means that written communication plays a central role in our lives, posing new challenges in interpretation and analysis. Simultaneously to its growth, the role of the internet itself has changed fundamen- tally. Coined "Web 2.0", the last decade saw a move from static web pages to "the network as platform" (O’reilly 2005). The web, or the internet, has become cross-platform, constantly updated and encouraging interaction between users. In 2018, communication scholars and marketing experts have already widely adopted the move to a "web 3.0", where interaction and co-creation are central concepts (Prahalad and Ramaswamy 2004). However, as in the early days of the internet, the internet landscape is still largely dominated by a few titans. Among these is online community Reddit, the domain this thesis will focus on. 1 Data Science: Business & Governance 2018 1.1 Reddit While Reddit may be lesser-known than communities such as Twitter or Facebook, its impact should not be understated. Reddit was founded in 2005 and has been growing continuously since then. It is currently one of the most active communities on the web, with 13 million submissions and 103 million comments in August 2018 alone 1. According to Reddit itself, it currently has over 330 million active users and 14 billion monthly pageviews 2, while Alexa ranks it as 17th most used website worldwide and 5th in the US 3. Reddit consists of subgroups centred around specific topics called ‘subreddits’, styl- ized as /r/subredditname, comparable to Facebook groups or sub-sections on online forums. Subreddits range in size from tens to millions of users. They vastly range in topics as well, from general and broad topics such as politics or food, to fan subreddits for specific media, to very niche interests such as /r/birdswitharms (as the name may indicate, this is a subreddit dedicated to collecting images of birds with arms photoshopped on). Users can register anonymously, as they only need a username and password, and optionally an email address for account retrieval. After registering, they can post submissions to subreddits and comment on both submissions and other users’ comments. Reddit’s emphasis on interaction and community makes it an excellent fit for the web 3.0 trend. It does, however, also make its users more vulnerable to the growing field of author profiling. 1.2 Author profiling Author profiling is a growing subdomain of Natural Language Processing4 (Rangel et al. 2018) . Author profiling aims to infer unknown attributes of an author, such as age or gender, through that author’s writing (Reddy, Vardhan, and Reddy 2017; Emmery, Chrupała, and Daelemans 2017). As mentioned, web 3.0 and the emphasis on communities mean internet users increasingly share their own writing (whether as comments or blog posts) online, making users vulnerable to author profiling techniques. At the same time, the rise of (semi-)anonymous networks means a decrease in users actively sharing author attributes such as age or country on e.g. profiles. Twitter is an example of one such prolific social network that does not require a legal name, gender or profile picture of oneself, with many users opting to have a picture of someone or something else. It should be noted that communities struggle with fake accounts and impersonators, regardless of whether the communities are anonymous such as Twitter5 or "non-anonymous" such as Facebook 6. Author profiling may have another application as a more refined tool to automatically identify such accounts. There are multiple reasons why knowing user attributes is of interest. Firstly, this will have commercial uses such as enabling more targeted marketing (Rangel et al. 2018) or profiling online reviews to find out which demographics (dis)like a product (Argamon et al. 2009). Secondly, there are security purposes such as knowing author 1 Statistics retrieved from https://pushshift.io/ 2 https://www.redditinc.com/ 3 https://www.alexa.com/siteinfo/reddit.com; accessed on December 19, 2018 4 Reflected too in the amount of participants in the annual PAN Author Profiling task, which will be discussed more extensively later 5 https://www.nytimes.com/2018/07/11/technology/twitter-fake-followers.html 6 https://newsroom.fb.com/news/2018/05/enforcement-numbers/ 2 Gytha Muller Distant Labelling and Author Profiling on Reddit attributes of anonymous threat senders, which will help legal investigations (Reddy, Vardhan, and Reddy 2017; Argamon et al. 2009). For instance in a related application, researchers tried to predict whether users are Dutch based off English-language comments. Given Dutch users who only comment in English, this would allow Dutch police to better allocate their resources when prosecuting persons for illegal activities online (van den Boom and Veenman 2018). Finally, there are scientific applications as well. Knowledge of user attributes will improve some text classification tasks (Hovy 2015). Additionally, if certain characteristics such as gender or age category can be reliably inferred, researchers can use the predictions of these characteristics as variables in content analysis or discourse studies of anonymous communities. Section2 of this thesis will show that Reddit is a popular domain for research other than author profiling as well. Author profiling research originally started on formal texts, such as articles, and quickly expanded to informal texts, such as blog posts. In the last few years, author profiling work has extensively covered Twitter and Facebook, reaching high accuracies on predicting gender and age category. For gender, the most common approach is a fastText model (a linear classifier over a bag of words representation) which generally reaches an accuracy of 85% - 95% on various social media and blog posts. The most common approach for age is different. Generally researchers treat age as a multiclass problem and use support vector machines on various kinds of text representations; accuracy varies more widely from around 50% to 80%. While Reddit is a heavily used social network, it has gone largely unnoticed in the author profiling field. A few works have looked at profiling personality type (Gjurković and Šnajder 2018), political preference (van Duijnhoven 2018) but only two works have looked at predicting age and gender (Vasilev 2018; Amezoug 2018). One possible explanation for this is the lack of a suitable corpus to train on. 1.3 Distant Labeling A corpus suitable for author profiling needs reliable labels and a large enough quantity of posts to properly train machine learning algorithms. For Twitter and the PAN tasks there are hand-labelled corpora available; these are generally very reliable but cost a lot of time and resources to create.

Load more