Distant Labelling and Author Profiling on Reddit

Gytha Muller ANR: u408823

THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF MASTEROF SCIENCEIN COMMUNICATION AND INFORMATION SCIENCES, MASTER TRACK DATA SCIENCE:BUSINESS &GOVERNANCE, AT THE SCHOOLOF HUMANITIESAND DIGITAL SCIENCES OF TILBURG UNIVERSITY

Thesis committee: C.D. Emmery MSc dr. Maryam Alimardani

Tilburg University School of Humanities and Digital Sciences Department of Cognitive Science & Artificial Intelligence Tilburg, The Netherlands December 2018

Preface

This thesis provides the final part of my master’s programme Data Science: Business and Governance at Tilburg University. I would like to thank my supervisor Chris Emmery for all his feedback. It was reassuring to know that whenever I had questions, whether about coding or theoretical, I could count on getting a fast response with specific feedback that enabled me to continue on, and hopefully improve, the final product. In a short period of time, I learnt a lot about a field that was almost completely new to me. Thanks to Drew Hendrickson as well. Without his endless patience at questions and troubleshooting coding errors during my research traineeship before and during this thesis, my code undoubtedly would have been a lot worse off. Finally, thank you to my friends and family for always cheering me on and support- ing me during my master’s degree. Gytha Muller December 2018

Distant Labelling and Author Profiling on Reddit

Gytha Muller

This thesis investigates the application of distant labelling and author profiling techniques to Reddit, one of the world’s major online communities yet a barely investigated domain in the field of author profiling. Firstly, this thesis proposes and shows a method for reliably creating a large corpus of Reddit users and posts labelled with (>40,000 users) and/or age (>18,000 user). The method uses flairs and submission titles to distantly label users, which reduces noise compared to earlier approaches. Currently, no such corpus exists, as earlier approaches on Reddit have sampled from a very limited user subset. Secondly, it is investigated whether the current approaches on author profiling for translate to Reddit, by using common methods such as fastText and a linear kernel SVM with TFIDF representation to predict gender and age respectively. Models are trained on both Reddit and Twitter. All models have good in- domain performance in line with existing work: for Reddit, 0.75 accuracy on gender and 0.70 accuracy on age, for Twitter 0.91 accuracy on gender and 0.93 accuracy on age. Cross-domain performance of the trained models is mediocre, suggesting that while the general approach translates, without preprocessing there is too much domain-specificity to properly generalize. In conclusion, this work contributes to understanding applicable author profiling approaches for Reddit and demonstrates how to automatically build a labelled corpus on Reddit, which should be able to assist any future research on this domain.

1. Introduction

The revolutionized communication and plays an increasingly important role in our society. Its effects are vast. Virtually all countries have fully adopted the internet, while internet discussions influence mainstream news and politics. Often, this means that written communication plays a central role in our lives, posing new challenges in interpretation and analysis. Simultaneously to its growth, the role of the internet itself has changed fundamen- tally. Coined "Web 2.0", the last decade saw a move from static web pages to "the net- work as platform" (O’reilly 2005). The web, or the internet, has become cross-platform, constantly updated and encouraging interaction between users. In 2018, communica- tion scholars and marketing experts have already widely adopted the move to a "web 3.0", where interaction and co-creation are central concepts (Prahalad and Ramaswamy 2004). However, as in the early days of the internet, the internet landscape is still largely dominated by a few titans. Among these is online community Reddit, the domain this thesis will focus on.

1 Data Science: Business & Governance 2018

1.1 Reddit

While Reddit may be lesser-known than communities such as Twitter or , its impact should not be understated. Reddit was founded in 2005 and has been growing continuously since then. It is currently one of the most active communities on the web, with 13 million submissions and 103 million comments in August 2018 alone 1. According to Reddit itself, it currently has over 330 million active users and 14 billion monthly pageviews 2, while Alexa ranks it as 17th most used website worldwide and 5th in the US 3. Reddit consists of subgroups centred around specific topics called ‘subreddits’, styl- ized as /r/subredditname, comparable to Facebook groups or sub-sections on online forums. Subreddits range in size from tens to millions of users. They vastly range in topics as well, from general and broad topics such as politics or food, to fan subreddits for specific media, to very niche interests such as /r/birdswitharms (as the name may indicate, this is a subreddit dedicated to collecting images of birds with arms photoshopped on). Users can register anonymously, as they only need a username and password, and optionally an address for account retrieval. After registering, they can post submissions to subreddits and comment on both submissions and other users’ comments. Reddit’s emphasis on interaction and community makes it an excellent fit for the web 3.0 trend. It does, however, also make its users more vulnerable to the growing field of author profiling.

1.2 Author profiling

Author profiling is a growing subdomain of Natural Language Processing4 (Rangel et al. 2018) . Author profiling aims to infer unknown attributes of an author, such as age or gender, through that author’s writing (Reddy, Vardhan, and Reddy 2017; Emmery, Chrupała, and Daelemans 2017). As mentioned, web 3.0 and the emphasis on communities mean internet users increasingly share their own writing (whether as comments or posts) online, making users vulnerable to author profiling techniques. At the same time, the rise of (semi-)anonymous networks means a decrease in users actively sharing author attributes such as age or country on e.g. profiles. Twitter is an example of one such prolific that does not require a legal name, gender or profile picture of oneself, with many users opting to have a picture of someone or something else. It should be noted that communities struggle with fake accounts and impersonators, regardless of whether the communities are anonymous such as Twitter5 or "non-anonymous" such as Facebook 6. Author profiling may have another application as a more refined tool to automatically identify such accounts. There are multiple reasons why knowing user attributes is of interest. Firstly, this will have commercial uses such as enabling more targeted marketing (Rangel et al. 2018) or profiling online reviews to find out which demographics (dis)like a product (Argamon et al. 2009). Secondly, there are security purposes such as knowing author

1 Statistics retrieved from https://pushshift.io/ 2 https://www.redditinc.com/ 3 https://www.alexa.com/siteinfo/reddit.com; accessed on December 19, 2018 4 Reflected too in the amount of participants in the annual PAN Author Profiling task, which will be discussed more extensively later 5 https://www.nytimes.com/2018/07/11/technology/twitter-fake-followers.html 6 https://newsroom.fb.com/news/2018/05/enforcement-numbers/

2 Gytha Muller Distant Labelling and Author Profiling on Reddit attributes of anonymous threat senders, which will help legal investigations (Reddy, Vardhan, and Reddy 2017; Argamon et al. 2009). For instance in a related application, researchers tried to predict whether users are Dutch based off English-language com- ments. Given Dutch users who only comment in English, this would allow Dutch police to better allocate their resources when prosecuting persons for illegal activities online (van den Boom and Veenman 2018). Finally, there are scientific applications as well. Knowledge of user attributes will improve some text classification tasks (Hovy 2015). Additionally, if certain characteristics such as gender or age category can be reliably inferred, researchers can use the predictions of these characteristics as variables in content analysis or discourse studies of anonymous communities. Section2 of this thesis will show that Reddit is a popular domain for research other than author profiling as well. Author profiling research originally started on formal texts, such as articles, and quickly expanded to informal texts, such as blog posts. In the last few years, author profiling work has extensively covered Twitter and Facebook, reaching high accuracies on predicting gender and age category. For gender, the most common approach is a fastText model (a linear classifier over a bag of words representation) which generally reaches an accuracy of 85% - 95% on various and blog posts. The most common approach for age is different. Generally researchers treat age as a multiclass problem and use support vector machines on various kinds of text representations; accuracy varies more widely from around 50% to 80%. While Reddit is a heavily used social network, it has gone largely unnoticed in the author profiling field. A few works have looked at profiling personality type (Gjurkovi´c and Šnajder 2018), political preference (van Duijnhoven 2018) but only two works have looked at predicting age and gender (Vasilev 2018; Amezoug 2018). One possible explanation for this is the lack of a suitable corpus to train on.

1.3 Distant Labeling

A corpus suitable for author profiling needs reliable labels and a large enough quantity of posts to properly train machine learning algorithms. For Twitter and the PAN tasks there are hand-labelled corpora available; these are generally very reliable but cost a lot of time and resources to create. For Reddit - and some other domains - one has to look at generating a corpus automatically. This can be achieved through distant labelling, a type of distant supervision where queries are used to ’distantly label’ users. Examples are retrieving all users who tweet "I’m a woman/man" and using this to label users as female or male respectively (Em- mery, Chrupała, and Daelemans 2017), or setting up a query to retrieve anyone who comments "I’m a Democrat/Republican" and using this to label them with political preference (van Duijnhoven 2018). While setting up queries on tweets provides acceptable performance, Amezoug (2018) indicates that querying for comment text on Reddit includes a substantial amount of noise such as sarcasm or quotes, which lowers reliability of labels. The use of flairs and submission titles on Reddit for distant labelling does not have these issues, but the few attempts at this sample from a very limited set of subreddits (Gjurkovi´cand Šnajder 2018; Amezoug 2018; Vasilev 2018) which can result in a fairly small corpus and is likely to introduce bias into the dataset.

3 Data Science: Business & Governance 2018

1.4 Research questions

This thesis aims to fill two gaps in the research discussed so far. The first gap relates to data collection. As mentioned, limitations of the few existing studies on Reddit are sampling from a very limited amount of subreddits or noise introduced by labelling via comments. Current works on distant labelling note that refinement of the simple queries used for distant labelling should lead to a more reliable corpus and therefore better model performance. The first research question aims to address these shortcomings: RQ1: How does combining and refining existing distant labelling techniques affect model performance? This thesis will sample users from ten different subreddits for distant labelling of gender and five different subreddits for distantly labelling age. Care was taken to include more varied topics, such as subreddits about powerlifting, asking advice, losing weight and finding new friends. Secondly, rather than querying for text in comments, distant labelling will be done through flairs and submission titles. These are less ambiguous than comment queries because there is no risk of the label being a quote or negation of some kind. The same simple fastText model that was used for Reddit in Amezoug(2018) will be trained on this new corpus to determine if performance improves when building a larger corpus sampled from more subreddits.

The second gap relates to analysis: author profiling for Reddit on gender and age is nearly non-existent, which this thesis aims to address. The few author profiling studies conducted on Reddit data have mostly used simple fastText models. Meanwhile, many different approaches have been tried to predict gender and age for social media posts from another domain, Twitter (Rangel et al. 2018). RQ2: To what extent are prolific Twitter-based classification methods suitable for inferring age and gender from Reddit comments? This research question will be answered by testing the most common approaches from Twitter - fastText for gender, a SVM for age - on Reddit. The same models will be trained and evaluated on the Reddit data and a Twitter corpus. This should give some insights into how different approaches translate to Reddit. After training, the models which performed best will be tested cross-domain as well (Reddit to Twitter, Twitter to Reddit) to identify any domain-specificity.

2. Related Work

Author profiling is the inferring of user attributes from their written text (Reddy, Vardhan, and Reddy 2017). Researchers have attempted predicting a wide variety of attributes such as age (Rangel Pardo et al. 2015), gender (Emmery, Chrupała, and Daelemans 2017; Rangel et al. 2018), nationality (Argamon et al. 2009; van den Boom and Veenman 2018; Estival et al. 2007), (Estival et al. 2007) and mental health status (Conway and O’Connor 2016). For the sake of structure, this section will first discuss inferring gender and age, then continue with profiling of other author attributes. The PAN convention is so prolific in the author profiling field that it will serve as central theme through this section. PAN describes itself as “a series of scientific events and shared tasks on digital text forensics and ” 7. One of these shared tasks

7 https://pan.webis.de/index.html

4 Gytha Muller Distant Labelling and Author Profiling on Reddit is the author profiling task, hosted annually since 2013 8. The overall format remains the same across the years: researchers are given a collection of documents, such as blog posts, tweets or Facebook posts, and need to predict certain author attributes. The precise focus of the author identification task(s) is different each year.

2.1 Prediction of gender

The groundwork for predicting gender as an author attribute was laid well over a decade before PAN started their contests. Koppel, Argamon, and Shimoni(2002) reached an accuracy of 0.80 when predicting gender for formal texts. This was followed by multiple endeavours to predict the gender of blog authors, obtaining accuracies such as 80.1% (Schler et al. 2006) and 76.1% (Argamon et al. 2009) respectively. Gender is easily the most prolific dependent variable in the PAN tasks: “determine an author’s gender” has been a task every year from 2013 to 2019. The annual re- occurrence of the PAN task clearly show the progresses that this field, or at least the contest participants, have made in a short timeframe. The winner of the 2013 author profiling task on blog posts reached a mere 0.59 accuracy for gender prediction in English, barely above the random baseline of 0.50 (Rangel et al. 2013). Yet in later years this accuracy was far surpassed: the 2018 winner had an accuracy of 0.82 using only text (Rangel et al. 2018). The rest of the field has caught up as well, with contemporary accuracies for gender prediction reaching 88% for Twitter and 91.9% for Facebook (Schwartz et al. 2013). It should be noted that all these tasks and many papers treat gender as a binary problem, so far discounting the idea of more than two or gender expressions. There are slight differences in the preprocessing and models used for the most successful approaches. Some preprocessing methods are more universal than others. Of the 23 participants in the PAN2018 task, eleven teams removed Twitter-specific features such as and user mentions (Rangel et al. 2018). In terms of frequency, this approach is followed by lowercasing the words, used by eight teams. Rarer prepro- cessing steps are removal of stopwords, removing character flooding and restoring contractions and abbreviations (Rangel et al. 2018). Many authors do not preprocess text which appears to be consistent with the wider field. Most researchers barely preprocess, choosing only to tokenize data (Emmery, Chrupała, and Daelemans 2017; Sap et al. 2014; van den Boom and Veenman 2018), remove domain-specific features such as @username mentions (van Duijnhoven 2018), or remove special characters and tokenize and lemmatize the text (Amezoug 2018). This simple but effective approach taken for preprocessing continues into the modelling phase. Generally, gender is predicted with simple models. While machine learning and neural networks are the most successful approach in many tasks, for author profiling simple models appear to perform as well if not better. One of the most popular models is fastText, a linear model over (sub)word embeddings (Joulin et al. 2016). Alternate approaches include a SVM trained on a combination of character and tf*idf word n-grams (Basile et al. 2017) and a SVM trained on n-grams (Schwartz et al. 2013; Sap et al. 2014). Approaches to predicting author attributes from Reddit comments include a fastText model to predict gender and age (Amezoug 2018), and a character- level convolutional neural network to determine gender (Vasilev 2018). Additionally, fastText has been used to predict political preference on Reddit (van Duijnhoven 2018).

8 https://pan.webis.de/tasks.html

5 Data Science: Business & Governance 2018

2.2 Prediction of age

Age is almost as prolific as a dependent variable in author profiling as gender. In PAN, age prediction was a recurring task too, having been set from 2013 to 2016 9. Each of these tasks turned age into a categorical predictor instead of a continuous predictor, an approach echoed by researchers independent of PAN (Argamon et al. 2009; Amezoug 2018). In contrast, Sap et al.(2014) approached age as a continuous variable for which they tried to predict the exact age; while sometimes age is divided merely into an over-18 and under-18 category, for instance when trying to detect criminal behaviour (van de Loo, De Pauw, and Daelemans 2016). Regardless of approach, age appears to be harder to predict than gender, not least because it often has more categories. The 2015 winner of the PAN author profiling task reached an accuracy of 0.84 for four different age categories, the 2016 winner an accuracy of 0.53 for five different age categories (Rangel Pardo et al. 2015; Rangel et al. 2016). The popular methods for age prediction are slightly different than for predicting gender. Most participants in the PAN2016 task used a support vector machine to predict age (Rangel et al. 2016). For English , a Support Vector Machine provided the best performance with 56.46% accuracy for three age categories, compared to a baseline of 39.43%; in with the same age groups, the accuracy was better, reaching 72.10% (Estival et al. 2008). Argamon et al. (2009) represent as numerical representation with feature weights and predict three age categories with Bayesian Multinomial Regression, a variant of multivariate logistic regression. They report an accuracy of 76.1% using this approach. In a similar approach using 10s, 20s and 30s as age categories, Schler et al.(2006) reached an overall accuracy of 76.2%, noting that while 10s and 30s are very distinguishable, many users in the 20s and 30s categories get misclassified. Schwartz et al.(2013) and Sap et al.(2014) opted for linear ridge regression to predict age as a continuous variable. Both papers reported a Pearson correlation coefficient of r = 0.84 between actual age and predicted age.

2.3 Other author attributes

While age and gender are by far the most commonly predicted attributes in author profiling, researchers have explored different attributes, too, and it is worthwhile to include them. To start with PAN, its 2015 author profiling task included personality prediction, where personality was operationalized as a series of traits that scored be- tween -0.5 and 0.5. Researchers tried to predict the scores for each author, which lead to low accuracies of approximately 0.10 - 0.20 per trait (Rangel Pardo et al. 2015). The PAN organizers are not alone in showing interest in personality types. Gjurkovi´cand Šnajder (2018) determined MBTI personality types on Reddit by using user flairs as ground truth; because of the setup, this approach considers personality prediction a categorical problem rather than the PAN approach of treating it as continuous. With this approach, performance increases to match that of standardized tests with an accuracy of 41% for an exact match and 82% for exact or one-off match (Gjurkovi´cand Šnajder 2018). So far, Reddit has only been mentioned briefly in this section, because author profiling research on Reddit is very limited. There are no peer-reviewed publications that focus on predicting age and gender for Reddit, though some other author attributes have been investigated. As mentioned before, self-labelling through flairs was used to

9 https://pan.webis.de/tasks.html

6 Gytha Muller Distant Labelling and Author Profiling on Reddit predict user’s MBTI types, yielding an accuracy comparable to that of popular commer- cial models (Gjurkovi´cand Šnajder 2018). In another example, English-language Reddit comments were used to predict if a user was English or a native Dutch speaker (van den Boom and Veenman 2018). The results are sufficiently accurate for the authors to suggest repeating their approach on different forums to determine if accuracy remains consis- tently high.

2.4 Language

Traditionally, both the PAN profiling task and the wider author profiling field have focused on a select few languages. English has been a document language in all PAN au- thor profiling tasks, although the three most recent years included Spanish, Portuguese, Arabic 10 and Dutch. All but Dutch were annotated with the region of the author, e.g. “Australia” or “Canada” for English. In 2017, the winning team was able to determine language variety with an average accuracy of 0.92 (on the secondary predictor, gender, they achieved an accuracy of 0.83). Despite these brief forays into different languages, it has been noted that lack of including other languages is a shortcoming in the author profiling field in general (Fatima et al. 2017; Verhoeven 2018). Attempts at author pro- filing in non-Western languages are relatively few and varied. They include predicting age based off Chinese microblogs, predicting various attributes based off Vietnamese forum posts and predicting age and gender based off Roman Urdu Facebook posts (Rangel et al. 2017). In Arabic, Estival et al.(2008) found an a 81.15% accuracy for emails (compared to 69.26% for English) and an 86.4% accuracy for news articles (Alsmearat et al. 2015).

2.5 Text representation and relevant features

It has been established earlier that many researchers do not or barely preprocess text, but many algorithms need text to be represented as a numerical vector to be able to process it. One of the standard ways of representing a text is called a ’bag of words’, which represents text as a numerical vector that contains each word and a value indicating how often that word occurs in the text, discarding word order (Zhang, Jin, and Zhou 2010). The fastText model, too, first represents the text as a bag of words before applying the classifier (Joulin et al. 2016). Optionally a TFIDF weighing can be applied over the vector to assign more importance to uncommon words, though it is unable to find relations between words (Ramos et al. 2003). With or without weighting, the main disadvantages of a bag-of-words representation are caused by the very large amount of potential features, such as the risk of running out of memory (van den Boom and Veenman 2018). Another disadvantage is a tendency to overfit, as the model includes even the least relevant features such as typos, though proper regularization should help counter this (van den Boom and Veenman 2018). One simple way of feature reduction is keeping only the top n n-grams that provide the most information gain (Argamon et al. 2009; Schler et al. 2006). Another way of feature reduction is through Latent Semantic Analysis, which keeps the contextual relation between words (Landauer, Foltz, and Laham 1998). Traditionally, one word is the base unit for all the methods described in the previous paragraph, e.g. a 1-gram would be one word. Recent papers have investigated the use

10 https://pan.webis.de/clef17/pan17-web/author-profiling.html

7 Data Science: Business & Governance 2018 of character-level n-grams, concluding they might be an effective alternative to word- level n-grams (Kešelj et al. 2003; Zhang, Zhao, and LeCun 2015; Bojanowski et al. 2016). One distinct advantage is that character-level n-grams can ’learn’ variations of words not in the training set, such as misspelled words, by comparing the known n-gram configurations to new configurations (Bojanowski et al. 2016). Returning to plain text itself, Campbell and Pennebaker(2003) distinguish two types of textual features for author profiling: content-based, which comprises what people choose to write about, and style-based, or how people choose to write about the topic. On the topic-based features, there are conflicting results. Argamon et al.(2009) report blog posts about technology as discriminant feature for men and writing about personal life or relationships as discriminant feature for women, yet Santosh et al.(2013) found no difference in blog topics between men and women. Argamon et al.(2009) do note that content features are dependent on the underlying collection of blogs. For age, Argamon et al.(2009) report a difference in topics for teens, twenties and bloggers in their thirties, with the groups writing about school and mood, work and social life, and family life, respectively. Use of language, or style-based features, does seem to be different for gender. One of the most-used tools in author profiling, the Linguistic Inquiry Word Count, is based on findings such as women using more first-person identifiers and men using more determiners (Newman et al. 2008). Similarly, female authors use less prepositions in blog posts than male authors, but will use more self-conscious language (e.g. “I think. . . ”) than men (Newman et al. 2008) and more pronouns (Argamon et al. 2009; Schler et al. 2006). Arguably, emojis are a part of modern language too, and emoji use can therefore be considered a stylistic feature. Lu et al.(2016) analysed a large dataset of mobile communication and found men tend to use more emojis than women. Although the top five of emojis is the same for both genders, the proportions are different, with women using a significantly higher proportion of face-related emojis than men (Lu et al. 2016). However, emoji use and interpretation are also influenced by platform (Tauch and Kanjo 2016; Miller et al. 2016). It should be noted that these papers mainly looked at blogs or mobile communication, and it is unknown if and to what extent these effects transfer to social media platforms like Reddit, and whether these effects change(d) over time as for instance new emoji get introduced. For age, less information on relevant stylistic features is available. Usage of deter- miners and prepositions grows with age (Argamon et al. 2009; Schler et al. 2006). Use of contractions without apostrophes (“dont") and word-lengthening are some stylistic features of younger authors (Brody and Diakopoulos 2011; Argamon et al. 2009).

2.6 Distant labelling

All of the author profiling research discussed above relies on availability of a sufficiently large labelled corpus. In the case of the discussed PAN tasks, all corpora are labelled by hand. However, with the amount of data needed to properly train a model, hand- labelled corpora can be difficult or even impossible to obtain (Ratner et al. 2016). Automation of this process seems a logical solution, yet turning data collection and processing into an unsupervised labelling process by automating it fully can cause problems with validity and can make it difficult to ’map’ the labels to specific factors (Mintz et al. 2009). Ratner et al.(2016) instead proposed what they refer to as ’data programming’: a relatively new set of techniques and heuristics with a goal to create a large and reliable labelled corpus, filling the gap between manual annotation and completely unsupervised labelling.

8 Gytha Muller Distant Labelling and Author Profiling on Reddit

For consistency, in this thesis any form of automated labelling is referred to as “distant labelling”. Sometimes users will label themselves, for instance through flairs or phrases such as “I’m a woman who. . . ”. To distinguish between these two related concepts, the latter is referred to as self-labelling. In other words, self-labelling can be used to distantly label authors. Distant labelling has been used to label twitter sentiments (Go, Bhayani, and Huang 2009), to label the gender of Twitter users (Emmery, Chrupała, and Daelemans 2017), label political preference on Reddit (van Duijnhoven 2018) and label personality types on Reddit (Gjurkovi´c and Šnajder 2018). The approach to labelling varied slightly. Researchers either searched for self-identifying comment text via queries such as “I’m a woman” (Emmery, Chrupała, and Daelemans 2017) or "I’m a democrat" (van Dui- jnhoven 2018), used emojis for sentiment labelling (Go, Bhayani, and Huang 2009) or searched for self-labelling through flairs (Gjurkovi´cand Šnajder 2018). A major limitation of distant labelling through comments on Reddit is the introduc- tion of noise. Examples are users citing someone else and the quote being mistaken for self-labelling and sarcasm or dialogue ("You assume I’m a democrat but...") leading to mistaken labels (van Duijnhoven 2018). To circumvent this, some researchers use flairs to distantly label users, which do not suffer from the same issues. However, research so far has sampled from a very limited amount of subreddits, possibly introducing bias. Gjurkovi´cand Šnajder(2018) only sampled from MBTI discussion subreddits; Amezoug (2018) only sampled from /r/AskMen and /r/AskWomen; Vasilev(2018) sampled from AskMen, AskWomen, tall, relationships and relationship_advice. Vasilev(2018) noted that users only list their age on the relationships and relationship_advice subreddits and the other subreddits were discarded for age labelling; furthermore 75% of all labels were assigned through the relationship_advice subreddit.

2.7 Other research on Reddit

While author profiling on age and gender from Reddit comments is extremely limited, Reddit has been the subject of various other research. Therefore, building a working au- tomated gender and age classification of Reddit users may help researchers by allowing them to incorporate these dimensions as additional variables. For instance, analysis of Reddit’s mental health communities suggest it might fill a unique niche where users can discuss stigmatized mental illnesses, with the protection through anonymity playing an important role in the amount of self-disclosure and quality of advice given by commenters (De Choudhury and Sushovan 2014). Pappa et al.(2017) analysed LoseIt (a weight loss subreddit). Using a subset of users that had gender, age and weight loss over time data available through self-labelling they found that high online activity, high engagement, greater BMI and certain topics (e.g. self- esteem, working out) were associated with greater weight loss. The knowledge gained from this may help with devising more effective weight loss strategies (Pappa et al. 2017).

2.8 Conclusion

In this section it has been established that author profiling is a growing field within NLP, of interest because of its applications in among others marketing and security. We have identified some gaps in existing research. One which will be addressed by this work is that author profiling has focused on Facebook, Twitter and personal blogs, largely ignoring Reddit despite it being a prolific social media platform. Little is known about

9 Data Science: Business & Governance 2018 predicting age and gender, the two most common author attributes in these tasks, for Reddit. The few works which attempt to do so use a relatively small corpus sampled from a very limited set of subreddits.

3. Experimental method

This work aims to address the shortcomings identified in Section2. The shortcomings are addressed by using distant labelling techniques in conjunction with Reddit’s built- in features such as flairs to quickly generate a large, labelled corpus suitable for author profiling on this domain. After generating the corpus, different models to predict age and gender are trained and tested both in-domain and cross-domain. The performance of these models is compared to earlier work on Reddit and Twitter. Lastly, post-hoc analysis is performed on both models to get more insight into the classifiers themselves.

3.1 Data collection and description

Reddit itself limits the amount of requests made through their API, so as an alternative data was retrieved from Pushshift.io 11, a data science project hosting a complete clone of all Reddit comments and submissions, including their metadata. These files can either be queried through the Pushshift API or downloaded in ‘raw dump’ form, allowing download of all comments or submissions per month. The Pushshift API limits the amount of objects returned, therefore comment and submission dumps were down- loaded in November 2018 to keep data freely query-able. The data was stored on a server in a JSON format and accessed with MongoDB queries, using PyMongo. The data is split into two collections. The first collection contains all comments in all subreddits of Reddit during the time period June until November 2018, totalling to approximately 520 million comments and 460 GB of data. The other collection contains submissions to the subreddits ProgressPics and r4r made in November 2018, amounting to 14 947 submissions and 115 MB of data. Flairs and submission titles were used to label users.

3.1.1 Subreddits, flairs and features. One of Reddit’s characteristic features is ‘flairs’, which show up as a little tag next to the name of the user. Flairs are subreddit-specific: it is up to the subreddit whether it allows flairs and how they are implemented. Flairs can be assigned by moderators or set by users themselves, and can either be an option from a given list (e.g. "M" or "F") or ’freeform’. A flair will be individual to that subreddit, meaning users can set a different flair for each subreddit they are active in. For example, users posting in /r/askmen or /r/askwomen can opt to choose a male or female symbol as flair. An example of a ’freeform’ flair following a specific format is "M | 542.5 kg | 93 kg | 343 wlks | USAPL | RAW" in /r/powerlifting (a type of weight training), which communicates that user’s gender, his top competition result and the specific powerlifting ruleset this result was achieved in. The choice of subreddits included in the analysis is based on subreddits already incorporated in existing work, and an exploration of the Reddit website. The included subreddits expand on previous works, which often only included a subset of this selec- tion, such as only the AskMen/AskWomen subreddits (Amezoug 2018; Vasilev 2018).

11 http://pushshift.io

10 Gytha Muller Distant Labelling and Author Profiling on Reddit

Including subreddits such as powerlifting and progresspics is an attempt to diversify the users selected for analysis. Since the data collections include metadata, the data contains many fields irrelevant to the experiment such as comment score, controversiality score and retrieval date. Only the fields relevant to this experiment will be shortly described. ‘Author’ contains the username, which on Reddit is unique (two users cannot share the exact same username) and permanent (it cannot be changed), allowing the username to serve as unique identifier. The ‘Body’ field contains the full body text of a comment without any preprocessing, so including special characters and emojis. Lastly, ‘subreddit’ shows which subreddit the comment was placed in. As with usernames, subreddit names are unique and case-sensitive. The rest of the fields are not used for querying and are discarded.

3.2 Preprocessing

The preprocessing of users can roughly be split up into two phases. In the first phase, any user with relevant flair or submission title is retrieved through MongoDB queries. This is described more extensively in 3.2.1. User selection. In the second phase, the saved usernames were processed to add gender and age labels. The collection of users was split up into gender and age subsets. After the preprocessing pipeline, all comments by a total of 40,806 users labelled with gender and 18,588 users labelled with an age category were retrieved.

3.2.1 User selection. All users were collected through MongoDB queries aimed at flair or submission title. Because of the unique rules and formatting of each subreddit, querying was completed on a subreddit basis. For the subreddits gainit, tall, askmen and askwomen, the MongoDB query re- turned all users who had chosen either “male” or “female” as their flair. In the case of AskMenOver30 and AskWomenOver30, flairs include both gender and age category in the same field, so the query returned all users who had set a flair. For the powerlifting subreddit, flairs are set by moderators and always include gender hence the query returned all users with flairs. For the LoseIt and keto subreddits, users create their own flairs leading to a wide variety of formats. E.g., age might be referred to as “23M”, “23/M” or “M|23”. Returning all flairs allowed further inspection of these flairs, which was used to fine-tune the method for extracting age and gender labels. The analysis included two additional subreddits where users were not selected by flair but by the titles of submissions, because these subreddits encourage or require users to put age and/or gender in their submission titles. For the /r4r/ subreddit, users are required to put their age and gender at the start of a title (e.g. “60 [M4F] Looking for music-loving dates”). This appears to be strictly enforced, so a query returned all submissions made to this subreddit on the assumption that all would include age and gender in the required format. In the subreddit progresspics not all users include their age and gender in the title, since this is not required. Therefore the MongoDB query used regex to filter for submissions whose title contained age-gender pairs such as “23F”, and submissions not containing age-gender pairs were discarded. Authors can self-label only gender, only age or both. If only authors of which both attributes are known would be kept, much of the data would have to be discarded. Therefore the data was split into two subsets: a dataset with gender-labelled users and a dataset with age-labelled users. There is an overlap in users between these two

11 Data Science: Business & Governance 2018

Table 1 Overview of user and label counts in the gender subset, specified per subreddit Label Subreddit Male Female AskWomen 2,200 5,193 AskMen 7,933 1,707 AskWomenOver30 309 1,713 AskMenOver30 3,759 541 Tall 2,513 412 LoseIt 1,799 3,134 Keto 3,712 1,799 GainIt 406 53 Powerlifting 261 51 r4r 1,180 337 progresspics 458 494 Total 24530 16276

datasets, but this is not expected to negatively influence modelling, because age and gender modelling is performed separately.

3.2.2 User preprocessing: creating a gender subset. The gender label was created with simple if-then rules for subreddits with predetermined flair categories and regex for freeform flairs. The accuracy of the recoding was checked by creating crosstabs of the original flair and the gender label, ensuring that all flairs were accurately converted. Only users who selected either “female” or “male” as gender were included, meaning that e.g. transgender or non-binary users were discarded. For an example of scale, on the AskMenOver30 subreddit 13 out of 4403 users (0.29%) indicated they are trans; on AskWomenOver30, 8 out of 2073 users (0.39%) indicated they are trans. Transgender users were excluded for practical reasons (the subset would be too small for proper modelling) and ethical reasons (detailed in chapter 5.3). Table1 contains a breakdown of the users in the gender dataset whose comments were retrieved. Overall the class distribution is fairly balanced with 60.11% of users being male, although there is a large variability in gender distribution between subreddits.

3.2.3 User preprocessing: Creating an age subset. Age required more preprocessing than gender. For AskMenOver30 and AskWomenOver30, recoding age was relatively straightforward, as users can only select their age from pre-determined categories. The first two digits in these flairs were simply transported to the age column. Many categories span approximately 4-5 years such as “26 – 29 years”. In these cases, the first age number was considered their age, because it was determined each age span would fit wholly into one category (i.e., the categorical representation of age lined up with the age categories offered as flair options). Extracting age for the r4r subreddit was similarly straightforward due to the strict rules of the subreddit, which ensure a user’s age always appears at the start of the submission title. However, extracting age for keto, LoseIt and progresspics proved more complicated as it is not possible to just extract pairs of two digits. For instance ‘70’ may refer to either weight, or age. First, flairs and

12 Gytha Muller Distant Labelling and Author Profiling on Reddit

Table 2 Overview of user and label counts in the age subset, specified per subreddit Label Subreddit <20 20 - 30 >30 AskMenOver30 0 1,665 2,449 AskWomenOver30 0 888 973 Keto 0 2,702 2,971 Loseit 221 2,817 1,296 Progresspics 73 656 217 r4r 123 1,103 331 Total 520 9831 8237

submission titles were stripped of any special characters used by submitters to separate age and gender, e.g. in the format “23/F” or "M|50". After this, the regex set up for these subreddits looked for pairs (e.g. “70M”). Any pair of two digits in this match was then considered to be the user’s age. Once age is extracted,the approach of Estival et al.(2008); Argamon et al.(2009); Schler et al.(2006) to consider age a categorical problem is followed. Age is split into three categories in line with pre-existing age categories on the sampled subreddits: under 20, 20 to 30 years old and older than 30. Crosstabs were used to check for outliers and check for proper recoding. Based on this, a total of seven users who selected the category “over 100 years of age” were dropped due to the low likelihood of this being true. For keto, progresspics and LoseIt users whose age was <15 or >80 were dropped: after a manual check against the original flair or title, it was discovered these were coding errors where weight, weight loss or start date were mistaken for age. Table2 shows the final set of age-labelled users.

3.3 Comment retrieval and preprocessing

Comment retrieval was performed separately for age and gender to create two datasets, but the process is the same. Comment retrieval was performed through a MongoDB query that iterated over a list of labelled users and retrieve each user’s comments. One JSON data file was created containing the fields author, label (gender label or age category label, depending on the subset) and a list with all of that user’s comments. In earlier author profiling work on Reddit, Gjurkovi´cand Šnajder(2018), only kept users who had at least one comment, while Amezoug(2018) kept users who had made a minimum of five comments. Due to the large amount of users collected, the threshold to be included in the dataset was put at a minimum of five comments made by that user. Table3 shows basic descriptives of the resulting comments file. On average, men have more posts than women, but use less words and characters in their posts. Due to men having more posts than women, they make up 70.65% of all posts in the dataset, which is more skewed than the labelling distribution (60.11% men) suggests. For age, post frequency, words per post and characters per post all increase with age. In terms of total posts the 20 - 30 class and >30 class are fairly balanced, but the <20 class is underrepresented, making up only 2.2% of all posts in the dataset. However, with all these metrics, the large standard deviations cause an overlap in confidence

13 Data Science: Business & Governance 2018

Table 3 Overview of descriptives for posts in the Reddit dataset Descriptives Label Posts per user Words per post Characters per post Total posts Male 339 (SD = 638) 34 (SD = 22) 187 (SD = 127) 7,558,557 Female 217 (SD = 446) 44 (SD = 26) 236 (SD = 145) 3,140,454 <20 186 (SD = 373) 31 (SD = 22) 167 (SD = 119) 96,316 20 - 30 217 (SD = 437) 39 (SD = 25) 212 (SD = 139) 2,024,160 >30 286 (SD = 551) 41 (SD = 26) 224 (SD = 141) 2,263,909

intervals between groups, meaning definitive conclusions cannot be drawn from these statistics alone. The distributions, especially for amount of posts per user, show negative skewness and heavy kurtosis with a long tail, which may provide a partial explanation for the high standard deviations.

3.4 Modeling

For both age and gender, at loading in, the dataset is split up into 80% training and 20% hold-out test set, stratified on the outcome variable. The data is split up further during parameter tuning, with 20% of the training set used as validation set during grid search to find the ideal hyperparameters. FastText 12 serves as the baseline model for both features. It is a text classification model which contains a hidden layer with a normalized bag-of-words representation of the text, to which a linear classifer is applied (Joulin et al. 2016). The complete hyperparameters of the baseline model are listed in table4; it is set to include n-grams from one to three words but skip learning sub-word representations (character-level n-grams). Learning rate is set to 0.01 and loss function is set to Softmax. FastText uses raw text as input, so no further preprocessing of the data is done. Before tuning other fastText hyperparameters, the size of comment batches when reading the data in was investigated. The default size is 100, but reducing comment size batch to 50 maintains accuracy (with default parameters) while adding 6,050 training instances. Therefore, in subsequent parameter tuning, comment batch size is kept at 50. Learning rate, dimension and amount of epochs are tuned by using grid search with two-fold cross-validation on the training dataset, choosing two-fold cross-validation to keep runtime manageable. Table4 provides an overview of the tuned hyperparameters plus the values tested in the grid search. Due to the poor performance of the fastText model for age prediction, no further fastText parameter tuning was conducted for age. Instead, additional models were explored.

3.4.1 Additional modelling for age. Based on the existing approaches to an age classi- fication problem, the choice was made to implement a Support Vector Machine (Chang and Lin 2011; Cortes and Vapnik 1995; Boser, Guyon, and Vapnik 1992) with a linear ker-

12 https://fasttext.cc/

14 Gytha Muller Distant Labelling and Author Profiling on Reddit

Table 4 fastText model parameters, listing baseline, any tuning values and the parameters of the best performing model. Parameter Baseline Tuning values Best value Learning rate 0.01 1.0, 0.5, 0.1, 0.05, 0.1 0.01, 0.005, 0.001 Learning rate update 100 100 Vector size (dim) 100 30, 50, 100, 200, 100 300 Size of context window (ws) 3 3 Epochs 30 5, 10, 15, 20 30 Minimal number of word occurences 0 0 Negatives sampled 10 10 N-gram range (1,3) (1,3) Loss function Softmax Softmax bucket 1000 1000 MinN, MaxN 0 0 t 0.0001 0.0001

nel (Fan et al. 2008), which treats multi-class classification as a ’one versus all’ problem 13. Contrary to the fastText model, a SVM does require preprocessing. To this end the data was vectorized after which TFIDF weighting was applied, using `2 regularization for the normalization (Baeza-Yates, Ribeiro et al. 2011). Latent Semantic Analysis was used as additional step after the TF-IDF weighing (Halko, Martinsson, and Tropp 2011), testing for reduction to 150 features and reduction to 300 features. After each of these approaches, the SVM was retrained with default parameters, and re-tested on the test dataset to quantify any effects on its accuracy. Finally, grid search with two-fold cross-validation was used to tune the SVM itself. The grid search tested weighted versus non-weighted classes and different values for the error term parameter. Table5 contains an overview of the tested hyperparameters.

3.5 Evaluation

The performance of the tuned models is evaluated by looking at the accuracy score on the held-out test data. In addition, full classification reports can be found in appendix A and appendixB. Part of the evaluation comprises to what extent a model generalizes to different platforms, such as Twitter. To test this, the best-performing model from Reddit is iden- tified to use its parameter settings to create a new model trained on Twitter. These two models are then evaluated on Reddit and Twitter respectively. With this approach, the cross-domain performance to in-domain performance can be compared to in-domain performance, allowing better insight into whether any cross-platform increase or de- crease in accuracy is meaningful. The Twitter corpus from (Volkova, Coppersmith, and Van Durme 2014) is used as Twitter domain data. This corpus contains batched tweets

13 All further preprocessing and modelling of the age data was ran in the SciKit-learn package

15 Data Science: Business & Governance 2018

Table 5 SVM model parameters, listing baseline, any tuning values and the parameters of the best performing model. Parameter Baseline Tuning values Best value Fit intercept (center data) True True Class weight None None, balanced None Loss function Squared hinge Squared hinge Multi-class handling One vs. all One vs all Maximum iterations 1000 1000 Penalty `2 `2 Penalty error term (C) 1.0 0.0001, 0.001, 1.0 0.01, 0.1, 1.0, 10, 100, 1000, 10,000, 100,000

Table 6 Descriptives for the Volkova Twitter corpus Feature Category Frequency Gender Male 26,708 Female 32,367 Age <20 8,898 20 - 30 28,821 >30 6,919

without preprocessing from 103,713 different authors. Full descriptives of the corpus can be found in Table6.

3.6 Post-hoc analyses

To give some insight into what the fastText algorithm considers important features, an experiment similar to that proposed by Ribeiro, Singh, and Guestrin(2016) and Kádár, Chrupała, and Alishahi(2017) is performed. In this experiment, the gender class of a random sentence is predicted using the trained model, after which the function iterates over the tokens in the sentence, each time comparing the difference in probability between the full sentence and the sentence with the omitted token. The more important a token, the higher the difference in probability. This experiment is repated for both the Reddit and Twitter model. The SVM algorithm used to predict age allows us to extract the coefficients for each class and link these to the TFIDF features to find out which n-grams are most predictive for each class. The 20 most positive n-grams for each class are extracted for both the Reddit and Twitter models. This will allow ’human sense-making’ of the model: give an indication of whether noise from the unprocessed input was incorporated into the model, and show any possible overlaps between classes and domains.

16 Gytha Muller Distant Labelling and Author Profiling on Reddit

Table 7 Reported accuracies for in-domain and cross-domain performance of the best-performing models Test data Feature Train data Reddit Twitter Reddit 0.8579 0.6149 Gender Twitter 0.7945 0.9071 Baseline 0.601 0.548 Reddit 0.7012 0.5873 Age Twitter 0.5393 0.9289 Baseline 0.529 0.646

4. Results

Table7 reports the in-domain and cross-domain results of the best-performing Reddit and Twitter models for gender and age. Gender prediction for Reddit has an accuracy of 0.857, well above the naive majority baseline of 0.601. The Reddit model does not generalize well to Twitter, with the accu- racy dropping to 0.615 when tested on the Twitter corpus. The Twitter model reached a higher in-domain accuracy than Reddit, with 0.907. Here too, the cross-domain accuracy is lower, though the drop in accuracy is less pronounced than for the Reddit model and accuracy stays well above baseline. The classification report in appendixA shows that in-domain precision for men and women is equal for both models, despite class imbalance. With cross-domain testing, precision remains approximately equal for both genders in the Twitter model, but dramatically drops from 0.85 to 0.54 for males in the Reddit model, despite men being overrepresented in the Reddit dataset. For age prediction the results are more mixed. The Reddit-to-Reddit model per- forms well above baseline with an accuracy of 0.701 compared to the 0.529 baseline. Reddit-to-Twitter however drops below baseline with an accuracy of 0.587 compared to a 0.646 baseline for Twitter. The best performing model is the Twitter in-domain model with an accuracy of 0.928. The Twitter-to-Reddit model shows that there is a steep decline in performance with cross-domain testing. With an accuracy of 0.539, the Twitter-to-Reddit model barely remains above the naive majority baseline. The classification for age prediction in appendixB shows that the Reddit model fails to predict any <20 cases for both domains. While the Twitter model has an in- domain precision of 0.98 for the <20 class, this drops to 0.00 when testing it on the Reddit domain.

4.1 Post-hoc analysis gender model

Table8 shows some example sentences with the three most significant tokens for either male or female prediction. The experiment shows the model highlights completely different tokens for the same sentence, depending on whether it was trained on Twitter or Reddit, with only an occasional overlap.

17 Data Science: Business & Governance 2018

Table 8 Results of the gender post-hoc analysis. Sentences were picked at random from both corpuses. If the model misclassified the sentence the actual label is listed in brackets. Pink indicates the three tokens most predictive for female, blue the three tokens most predictive for male. The darker color is the top token. Domain, predicted Results for the Reddit model label (actual label) Reddit, female I’m finding non-food rewards helpful... hot baths, quiet time with a , time to myself messing around with my guitar, etc. Hang in there, you’re not alone. Reddit, female I get how you feel. I feel the same way a lot of the time. Reddit, male Just Cause 3 is still $6.00. ($9.00 for XL Edition). And I’m still not sure if I want it lol. Twitter, female Please don’t ruin my day Mavs. It’s bad enough I’m already half dead Twitter, male Y’all laugh at Dj Khaled but he really the biggest mo- tivator in the rap game Domain, predicted Results for the Twitter model label (actual label) Reddit, male (fe- It’s absolutely a good thing! For me, it’s forcing me male) to end my relationship with a bunch of detrimental old habits. Reddit, female I get how you feel.I feel the same way a lot of the time. Reddit, male Just Cause 3 is still $6.00. ($9.00 for XL Edition). And I’m still not sure if I want it lol. Twitter, male (fe- Please don’t ruin my day Mavs. It’s bad enough I’m male) already half dead Twitter, male Y’all laugh at Dj Khaled but he really the biggest motivator in the rap game Twitter, male Some people believe everything they hear smh

4.2 SVM Coefficients

Table9 shows the five most positive predictors for each age class for the Reddit model and the Twitter model, appendix 14 contains a table showing the twenty most positive predictors of each class. The results of this experiment show that noise has been incorporated into the Reddit model. For instance, "favorite https" is one of the top five n-grams for the under 20 class and the typo "didn learn" [sic] is one of the twenty most positive predictors for the over 30 category. The Twitter model is similarly noisy, including a significant amount of Twitter handles as influential coefficients for

18 Gytha Muller Distant Labelling and Author Profiling on Reddit

Table 9 The top 5 predictive n-grams for each model per class, ordered top to bottom from most to least predictive. Label Reddit model Twitter model doctor nurse birthday dj marketability nigga good Under favorite https grown man 20 soon thing weird sorry months live list phone easy simple logeyy_bear injections raised questions 20 - 30 cordelia understand support counting single hike amp bowl eat historical kindle sorry trying fat motherfucker going pain im walkin over 30 just saw don depend stick routine safe bro online thanks chief like

each class. Furthermore, there is little overlap between the top n-grams for Twitter and Reddit, regardless of class.

5. Discussion

For gender prediction on Reddit, a simple fastText model reaches approximately the same accuracy as established in Amezoug(2018), with an accuracy of 0.848 inAmezoug (2018)’s work and an accuracy of 0.857 in our own. The Twitter model significantly outperforms the Reddit model by achieving an in-domain accuracy of 0.930, which is consistent with earlier work which reached accuracies of 0.902 Amezoug(2018) and 0.854 (Rangel et al. 2018). The closeness of the results seems to indicate that a larger corpus alone is not sufficient to significantly increase model performance. The post- hoc analysis shows that the fastText model tags very different words as important, de- pending on whether it was trained on Twitter or Reddit, which seems to indicate some degree of domain-specificity in language used. The findings appear to contradict earlier . Newman et al.(2008) stated men use more determiners and prepositions ("my", "in") than women, yet the Reddit model flags multiple of these as positive for a female label prediction. Likewise, Newman et al.(2008) found women use more first- person identifiers ("I"), but particularly the Twitter model considers these as predictive for males. Prediction of age for Reddit exceeds existing work, with 0.70 accuracy compared to 0.65 (Amezoug 2018). One possible explanation for this follows that given by Amezoug (2018), who notes that their distant queries (e.g. "I’m X years old") may have included sarcastic remarks or quotes by other commenters as labels. Flairs and submission titles are assumed to not have these issues, as both explicitly refer to the users themselves and sarcasm in flairs would have to be subreddit-wide, since the same flair appears on all comments in that subreddit. A second possible explanation for the slight increase in

19 Data Science: Business & Governance 2018 accuracy is the difference in age categories: <20, 20-30 and >30 versus <20, 20-40 and >40. The main issue with the age model is its inability to detect the <20 class. In the Reddit dataset, this is somewhat expected due to the class imbalance. The <20 class makes up only 2.50% of the Reddit dataset, instead of 33.33% if classes were balanced. However, the Twitter model being able to predict the <20 class on the Twitter platform but failing to do so for Reddit indicates that beyond the class distribution problem, there may be a clear difference between how the <20 class communicates on Reddit and Twitter. The extracted SVM coefficients give a hint towards this, as there is no overlap between the most influential Reddit and Twitter n-grams. Furthermore, it is noteworthy that some n-grams intuitively make sense, while others do not. Examples of ’sensible’ n-grams are "make dinner" (positive for the 20 - 30 class) or "stick routine" and "room kids" (both positive for the over 30 class). Yet "insurance" as positive predictor for the under 20 class on the face of it does not make sense - these are the kind of discrepancies that as Ribeiro, Singh, and Guestrin(2016) point out, only become visible analysing the classifier itself. On the other hand, some cryptic influential n-grams may have perfectly sensible explanations: "historical kindle" appears to be from users persistently promoting their historical fiction in the Amazon Kindle store 14. Regardless of possible explanations, the noisiness of these coefficients suggests that preprocessing could contribute in making the dataset cleaner and predictions better by removing irrelevant features. Secondly, the presence of usernames suggests that some users may have a disproportionate presence in the dataset which carries over into having an effect on the model. This second issue especially affects the Twitter data, where at least 8 out of the 60 most influential n-grams appear to be Twitter handles, while the Reddit n-grams only contain one name.

5.1 Limitations

Despite attempts to diversify, users are still sampled from a very small percentage of all subreddits, which may introduce bias into the data. When only labelling users through flair or submission titles, there is no clear solution to this, as many subreddits do not contain flairs or submission titles with author attributes. Another possible bias in the data lies in the age distribution: there are comparatively very few under 20s in the dataset. Understandably, this makes predicting age a more difficult task, but it may also have an effect on the prediction of gender: if the gender dataset follows roughly the same age distribution, it will be just as imbalanced in this dimension and any models trained on it will be more attuned to predicting gender for slightly older users. Users of social media may also lie about their identity, and the automated labelling will accept these lies as truth. For instance, a teenager might pretend to be an adult and label themselves as such. While careful manual annotation might notice such inconsistencies, we are not aware of a method to automatically detect these users. Lastly, this thesis did not use any preprocessing. Preprocessing might improve model performance by removing two types of noise: parts of texts that are not the author’s own, i.e. quotes and retweets, and removing or standardizing internet slang. As a counterpoint to too much preprocessing, slang and (intentional) mistyping of words could be predictive of author attributes too, such as millenials’ usage of word lengthening or people with a shared native language making the same typos.

14 See the results on https://twitter.com/search?q=historical%20kindle

20 Gytha Muller Distant Labelling and Author Profiling on Reddit

5.2 Future work

We have shown how to create a large-scale labelled dataset with minimal resources, an approach that can be re-used in future author profiling work on Reddit. Authors can choose whether to use the existing corpus, augment it, or create their own. To address the limitations in sampling from a limited set of subreddits, future research could look at combining the flair- and submission-based method from this thesis with distant labelling based on comment text, which is noisier but applicable to any subreddit. Due to time and resource constraints no preprocessing methods were explored. The poor cross-domain performance suggests that there may be domain-specific features which hinder model generalization. Therefore, a starting point would be to remove all domain-specific features from Reddit comments, such as user mentions, subreddit mentions and hyperlinks. Preprocessing can also focus on removing noise, such as Twitter handles, which currently often appear as significant feature.

5.3 Ethical considerations

Author profiling work currently still treats gender as a binary variable, contrary to academic understanding of gender and gender identity. By treating gender as a binary male/female variable, a subset of the population is ignored. However, researchers should be careful when including these marginalized groups. Any algorithm that iden- tifies marginalized groups could be used by malicious agents. A second point of possible concern is the analysis of vulnerable communities. Users often do not seem to be aware that if part of a userbase shares defining characteristics, this can be used to profile all ’anonymous’ users. In 2014, UK suicide prevention charity Samaritans released "Samaritans Radar". After installation of the tool it scanned the tweets of every account followed by the user, and sent an email to that user alert- ing them if the tool determined someone displayed suicidal tendencies. This led to widespread outcry and a quick retraction of the tool. Bernal(2014) found that those using Twitter for mental health support saw privacy as a collective right, and that any attempt at harming this privacy by e.g. automated analysis was seen as attack on the community as a whole. Indiscriminately using author profiling on all Reddit communities may cause similar harm, since De Choudhury and Sushovan(2014) found that being anonymous plays an important role in people’s usage of mental health subreddits. Creating algorithms that profile anonymous users can slowly erode these supportive networks, leaving vulnerable persons with nowhere to go.

6. Conclusion

This thesis contributes a method to quickly and cheaply generate large labelled datasets for Reddit, a platform currently underutilised in the author profiling field. It refines previous methods by sampling more subreddits and using unambiguous flairs and submission titles, leading to a significantly larger corpus with less pollution of labels. Comparing the achieved accuracies to earlier attempts shows that the increased reli- ability and size of the assembled corpus only slightly enhances model performance. Compared to earlier work using the same models, gender classification accuracy is improved by merely 0.01 but age classification accuracy is improved by 0.05. Due to the limited scope of this thesis, it is possible that performance improvement will be more noticeable using different models or author attributes.

21 Data Science: Business & Governance 2018

This thesis investigated the use of a simple fastText model for gender prediction, showing it can achieve a high accuracy of 0.86 for Reddit. For age on Reddit, fastText does not prove suitable. The results show a linear kernel SVM with TFIDF represen- tation is succesful, reaching an accuracy of 0.70 for a three-class categorisation of age which is on par with earlier work. Neither Reddit model beats the performance of a model trained on Twitter, which reaches 0.90 accuracy for gender and 0.92 accuracy for age. Considered together, these results support the hypothesis that successful Twitter approaches and algorithms do generalize to Reddit, since the used approaches provide satisfactory results without much further engineering. On the contrary, trained models show a poor cross-domain performance, taking a hit to accuracy and in the case of age failing to detect an entire class. Lack of preprocessing to filter out domain-specific features may be a possible explanation for this. In conclusion, the chosen models perform acceptably but provide room for im- provement, possibly by preprocessing input and further tuning the models. The pro- posed method to generate a corpus shown here can aid in this, as it will allow any researcher to efficiently create a large dataset suitable for text classification training.

22 Gytha Muller Distant Labelling and Author Profiling on Reddit

References Alsmearat, Kholoud, Mohammed Shehab, Mahmoud Al-Ayyoub, Riyad Al-Shalabi, and Ghassan Kanaan. 2015. Emotion analysis of arabic articles and its impact on identifying the author’s gender. In Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of, pages 1–6, IEEE. Amezoug, Anwar. 2018. Inferring personal attributes based on different modalities on social networks. Master’s thesis, Tilburg University, 8. Argamon, Shlomo, Moshe Koppel, James W Pennebaker, and Jonathan Schler. 2009. Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2):119–123. Baeza-Yates, Ricardo, Berthier de Araújo Neto Ribeiro, et al. 2011. Modern information retrieval. New York: ACM Press; Harlow, England: Addison-Wesley,. Basile, Agenlo, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. 2017. Is there life beyond n-grams? a simple svm-based author profiling system. Cappellato et al.[13]. Bernal, P. 2014. Samaritans radar: Misunderstanding privacy and ‘publicness’. (Unpublished manuscript). Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. van den Boom, Bernard and Cor J Veenman. 2018. Finding dutch natives in online forums. Forensic Sciences Research, pages 1–10. Boser, Bernhard E, Isabelle M Guyon, and Vladimir N Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, ACM. Brody, Samuel and Nicholas Diakopoulos. 2011. Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!: using word lengthening to detect sentiment in microblogs. In Proceedings of the conference on empirical methods in natural language processing, pages 562–570, Association for Computational . Campbell, R Sherlock and James W Pennebaker. 2003. The secret life of pronouns: Flexibility in and physical health. Psychological science, 14(1):60–65. Chang, Chih-Chung and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27. Conway, Mike and Daniel O’Connor. 2016. Social media, big data, and mental health: current advances and ethical implications. Current opinion in psychology, 9:77–82. Cortes, Corinna and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, 20(3):273–297. De Choudhury, Munmun and De Sushovan. 2014. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In ICWSM. van Duijnhoven, Coen. 2018. Predicting political preference through content- and stylistic text features and distant labeling. Master’s thesis, Tilburg University, 7. Emmery, Chris, Grzegorz Chrupała, and Walter Daelemans. 2017. Simple queries as distant labels for predicting gender on twitter. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 50–55. Estival, Dominique, Tanja Gaustad, Ben Hutchinson, Son Bao Pham, and Will Radford. 2008. Author profiling for english and arabic emails. Estival, Dominique, Tanja Gaustad, Son Bao Pham, Will Radford, and Ben Hutchinson. 2007. Author profiling for english emails. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, pages 263–272. Fan, Rong-En, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874. Fatima, Mehwish, Komal Hasan, Saba Anwar, and Rao Muhammad Adeel Nawab. 2017. Multilingual author profiling on facebook. Information Processing & Management, 53(4):886–904. Gjurkovi´c,Matej and Jan Šnajder. 2018. Reddit: A gold mine for personality prediction. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pages 87–97. Go, Alec, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12).

23 Data Science: Business & Governance 2018

Halko, Nathan, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288. Hovy, Dirk. 2015. Demographic factors improve classification performance. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 752–762. Joulin, Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Kádár, Akos, Grzegorz Chrupała, and Afra Alishahi. 2017. Representation of linguistic form and function in recurrent neural networks. Computational Linguistics, 43(4):761–780. Kešelj, Vlado, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the conference pacific association for computational linguistics, PACLING, volume 3, pages 255–264. Koppel, Moshe, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401–412. Landauer, Thomas K, Peter W Foltz, and Darrell Laham. 1998. An introduction to latent semantic analysis. Discourse processes, 25(2-3):259–284. van de Loo, Janneke, Guy De Pauw, and Walter Daelemans. 2016. Text-based age and gender prediction for online safety monitoring. Comput. Linguistics Netherlands, 5(1):46–60. Lu, Xuan, Wei Ai, Xuanzhe Liu, Qian Li, Ning Wang, Gang Huang, and Qiaozhu Mei. 2016. Learning from the ubiquitous language: an empirical analysis of emoji usage of smartphone users. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 770–780, ACM. Miller, Hannah, Jacob Thebault-Spieker, Shuo Chang, Isaac Johnson, Loren Terveen, and Brent Hecht. 2016. Blissfully happy” or “ready to fight”: Varying interpretations of emoji. Proceedings of ICWSM, 2016. Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011, Association for Computational Linguistics. Newman, Matthew L, Carla J Groom, Lori D Handelman, and James W Pennebaker. 2008. Gender differences in language use: An analysis of 14,000 text samples. Discourse Processes, 45(3):211–236. O’reilly, Tim. 2005. What is web 2.0. Pappa, Gisele Lobo, Tiago Oliveira Cunha, Paulo Viana Bicalho, Antonio Ribeiro, Ana Paula Couto Silva, Wagner Meira Jr, and Alline Maria Rezende Beleigoli. 2017. Factors associated with weight change in online weight management communities: a case study in the loseit reddit community. Journal of medical Internet research, 19(1). Prahalad, Coimbatore K and Venkat Ramaswamy. 2004. Co-creation experiences: The next practice in value creation. Journal of interactive marketing, 18(3):5–14. Ramos, Juan et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 133–142. Rangel, Francisco, Paolo Rosso, Manuel Montes-y Gómez, Martin Potthast, and Benno Stein. 2018. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. Working Notes Papers of the CLEF. Rangel, Francisco, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Giacomo Inches. 2013. Overview of the author profiling task at pan 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pages 352–365, CELCT. Rangel, Francisco, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF. Rangel, Francisco, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, and Benno Stein. 2016. Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. In Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al., pages 750–784. Rangel Pardo, Francisco Manuel, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd author profiling task at pan 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pages 1–8.

24 Gytha Muller Distant Labelling and Author Profiling on Reddit

Ratner, Alexander J, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in neural information processing systems, pages 3567–3575. Reddy, T Raghunadha, B Vishnu Vardhan, and P Vijayapal Reddy. 2017. N-gram approach for gender prediction. In Advance Computing Conference (IACC), 2017 IEEE 7th International, pages 860–865, IEEE. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, ACM. Santosh, K, Romil Bansal, Mihir Shekhar, and Vasudeva Varma. 2013. Author profiling: Predicting age and gender from blogs. Notebook for PAN at CLEF, pages 119–124. Sap, Maarten, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. Developing age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1146–1151. Schler, Jonathan, Moshe Koppel, Shlomo Argamon, and James W Pennebaker. 2006. Effects of age and gender on blogging. In AAAI spring symposium: Computational approaches to analyzing weblogs, volume 6, pages 199–205. Schwartz, H Andrew, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski, Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin EP Seligman, et al. 2013. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9):e73791. Tauch, Channary and Eiman Kanjo. 2016. The roles of emojis in mobile phone notifications. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, pages 1560–1565, ACM. Vasilev, Evgenii. 2018. Inferring gender of Reddit users. Ph.D. thesis, Universitt Koblenz-Landau. Verhoeven, Ben. 2018. Two authors walk into a bar: studies in author profiling. Ph.D. thesis, University of Antwerp. Volkova, Svitlana, Glen Coppersmith, and Benjamin Van Durme. 2014. Inferring user political preferences from streaming communications. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 186–196. Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657. Zhang, Yin, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4):43–52.

25 Data Science: Business & Governance 2018

Appendices

A. Classification reports for gender prediction

Table 10 Overview of Classification reports for gender, using the best-performing model. fastText trained on Reddit, predicting on Reddit precision recall f1-score support female 0.87 0.70 0.78 1732 male 0.85 0.94 0.89 3125 micro avg 0.86 0.86 0.86 4857 macro avg 0.86 0.82 0.84 4857 weighted avg 0.86 0.86 0.85 4857 fastText trained on Reddit, predicting on Twitter precision recall f1-score support female 0.87 0.35 0.50 6473 male 0.54 0.94 0.69 5342 micro avg 0.61 0.61 0.61 11815 macro avg 0.71 0.64 0.59 11815 weighted avg 0.72 0.61 0.58 11815

Table 11 Overview of Classification reports for age, using the best-performing model. fastText trained on Twitter, predicting on Twitter precision recall f1-score support female 0.91 0.92 0.92 6473 male 0.91 0.89 0.90 5342 micro avg 0.91 0.91 0.91 11815 macro avg 0.91 0.91 0.91 11815 weighted avg 0.91 0.91 0.91 11815 fastText trained on Twitter, predicting on Reddit precision recall f1-score support female 0.78 0.59 0.67 1732 male 0.80 0.91 0.85 3125 micro avg 0.79 0.79 0.79 4857 macro avg 0.79 0.75 0.76 4857 weighted avg 0.79 0.79 0.79 4857

26 Gytha Muller Distant Labelling and Author Profiling on Reddit

B. Classification reports for age prediction

Table 12 Overview of Classification reports for age, using the best-performing model. SVM trained on Reddit, predicting on Reddit precision recall f1-score support 20 - 30 0.69 0.74 0.72 1104 older than 30 0.72 0.70 0.71 1040 younger than 20 0.00 0.00 0.00 55 micro avg 0.70 0.70 0.70 2199 macro avg 0.47 0.48 0.48 2199 weighted avg 0.69 0.70 0.69 2199 SVM trained on Reddit, predicting on Twitter precision recall f1-score support 20 - 30 0.68 0.74 0.71 5764 older than 30 0.36 0.69 0.48 1384 younger than 20 0.00 0.00 0.00 1780 micro avg 0.59 0.59 0.59 8928 macro avg 0.35 0.48 0.40 8928 weighted avg 0.50 0.59 0.53 8928

Table 13 Overview of Classification reports for age, using the best-performing model. SVM trained on Twitter, predicting on Twitter precision recall f1-score support 20 - 30 0.91 0.99 0.95 5764 older than 30 0.98 0.87 0.92 1384 younger than 20 0.98 0.77 0.86 1780 micro avg 0.93 0.93 0.93 8928 macro avg 0.96 0.88 0.91 8928 weighted avg 0.93 0.93 0.93 8928 SVM trained on Twitter, predicting on Reddit precision recall f1-score support 20 - 30 0.52 0.96 0.68 1104 older than 30 0.73 0.12 0.21 1040 younger than 20 0.00 0.00 0.00 55 micro avg 0.54 0.54 0.54 2199 macro avg 0.42 0.36 0.30 2199 weighted avg 0.61 0.54 0.44 2199

27 Data Science: Business & Governance 2018

C. Coefficients of the SVM

Table 14 Most influential coefficients for the Reddit SVM Reddit under 20 weights Reddit 20-30 weights Reddit >30 weights 1 doctor nurse 0.675971 easy simple 3.257542 sorry trying 3.084984 2 marketability 0.660743 injections 2.214002 going pain 2.286150 3 favorite https 0.633053 cordelia 1.932245 just saw 1.850574 4 soon thing 0.622117 counting single 1.923312 stick routine 1.755480 5 months live 0.615818 bowl eat 1.821784 online thanks 1.722979 6 reply day 0.610005 policy gt 1.735433 track eating 1.667309 7 appreciate words 0.537924 provence 1.702276 520 1.645445 8 insurance 0.535737 said suspect 1.517488 perfect exactly 1.636734 9 nicer place 0.533491 like plastic 1.440709 jealous didn 1.451777 10 plan plan 0.501572 walkway 1.425718 amazing tried 1.374537 11 job aren 0.492992 make dinner 1.392441 room kids 1.349704 12 therapy right 0.492795 told watch 1.372025 doorman 1.325109 13 reservoir 0.490263 tried little 1.338940 sick don 1.304082 14 just submitted 0.484365 wait 20 1.277630 meghan 1.302293 15 cordelia 0.482183 suffer consequences 1.275206 just started 1.256477 16 hands just 0.465961 water black 1.270268 vital 1.253584 17 follow suit 0.457663 going probably 1.259378 trans women 1.238346 18 probably year 0.443900 art isn 1.252262 highway traffic 1.203242 19 freaks 0.439974 nicer place 1.236824 didn learn 1.183471 20 use positive 0.438557 regulates 1.205978 fact correct 1.169088

28 Gytha Muller Distant Labelling and Author Profiling on Reddit

Table 15 The most positive coefficients for each class for the Twitter-trained SVM Twitter under 20 weight Twitter 20-30 weight Twitter >30 weight 1 birthday dj 2.188085 logeyy_bear 2.427438 fat motherfucker 2.187499 2 nigga good 2.113997 raised questions 1.693042 im walkin 1.863517 3 grown man 2.049330 understand support 1.506671 don depend 1.726872 4 weird sorry 1.729439 hike amp 1.433984 safe bro 1.637248 5 list phone 1.715524 historical kindle 1.325379 chief like 1.508339 6 worldwide rt 1.709411 wage labor 1.303460 goes minutes 1.475641 7 offered free 1.667660 12 wanna 1.276124 sugar rush 1.463418 8 like bedroom 1.629207 just dive 1.250283 rt aaronr_pxw 1.460467 9 wire share 1.599877 today enjoy 1.236441 repost tarajixone 1.457263 10 dailylovescopes cancer 1.566035 rt tittromney 1.230008 wow miss 1.446954 11 feel outta 1.530270 robfee black 1.207991 faith leads 1.415822 12 mother bad 1.515561 malcolm gladwell 1.201477 white inside 1.377968 13 rn looks 1.472729 different interesting 1.201452 tryna hear 1.370816 14 really large 1.438865 rt realjrdonato 1.194704 wayfair 1.356995 15 em purpose 1.435678 dead talking 1.185850 hahahahahahah rt 1.319665 16 situation clearly 1.402089 kemba 1.164759 friendly crab 1.294252 17 rahrahraina 1.397267 nypost rt 1.152335 expect buy 1.284988 18 matter guys 1.380762 nick bad 1.097149 attacks white 1.276545 19 pro equality 1.376564 tomorrow needed 1.072935 reflect worship 1.266116 20 icewear 1.363392 tote amp 1.058457 birthday good 1.263891

29 30