<<

Echoing Within and Between: Quantifying Behaviours on

A thesis submitted to The University of Manchester for the degree of Doctor of Philosophy in the Faculty of Humanities

2020

Ella M. Guest

School of Social Sciences, Social Statistics Contents

1 Introduction 13 1.1 Background and motivation...... 15 1.1.1 echo chambers: fact of fiction?...... 15 1.1.2 Reddit: self-proclaimed ‘front page of the ’. 17 1.2 Research aims...... 21 1.2.1 Research questions...... 22 1.2.2 Computational social science approach...... 25 1.3 Thesis outline...... 27

2 Research Background 32 2.1 The emergence of social media ‘echo chambers’...... 33 2.1.1 The public sphere(s)...... 34 2.1.2 Reinforcing and avoiding...... 41 2.1.3 Echoing in networks...... 43 2.2 ‘The front page of the internet’...... 51 2.2.1 Subreddits: topic-based communities...... 53 2.2.2 Who is on Reddit?...... 57 2.3 Topic interests on Reddit...... 64 2.3.1 ‘Interest meta-communities’ as public sphericules.. 65 2.3.2 changemyview: the anti-echo chamber?...... 68

2 2.3.3 The Donald: a self-defined echo chamber...... 71 2.4 Chapter summary...... 74

3 Data Collection & Preparation 76 3.1 Data collection...... 76 3.1.1 Data quality...... 78 3.1.2 Data access...... 79 3.1.3 Overview of the data...... 82 3.2 Subsetting data...... 83 3.2.1 Subsetting authors...... 86 3.3 Data Preparation...... 88 3.3.1 Author-subreddit pairs...... 88 3.3.2 Subreddit similarity...... 90 3.4 Labelling subreddit topics...... 96 3.4.1 Overview of topic frequencies...... 100 3.4.2 Description of the political subreddits...... 102 3.5 Chapter summary...... 110

4 Subreddits as Echo Chambers 113 4.1 Defining chamberness ...... 114 4.1.1 Within subreddit author participation...... 115 4.1.2 Between subreddit author participation...... 120 4.2 Within subreddit participation results...... 126 4.2.1 General trends...... 126 4.2.2 The case study subreddits...... 131 4.3 Between subreddit participation results...... 142 4.3.1 General trends...... 143 4.3.2 The case study subreddits...... 147

3 4.4 Chapter summary...... 153

5 Networks of Echo Chamber 158 5.1 Defining meta echo chambers ...... 159 5.1.1 Measuring subreddit similarity...... 165 5.1.2 Defining chamberness similarity...... 168 5.2 A network of public sphericules...... 175 5.2.1 The subreddit network...... 175 5.2.2 Detecting public sphericules...... 179 5.2.3 Sphericules as chambers?...... 187 5.3 The geo-political sphericule...... 196 5.3.1 The most chamber-like edges?...... 197 5.3.2 Political meta echo chambers?...... 201 5.3.3 Political echoing on the left...... 205 5.4 Chapter summary...... 215

6 Discussion 218 6.1 Addressing the research questions...... 219 6.1.1 Research question 1...... 220 6.1.2 Research question 2...... 223 6.1.3 Research questions 3 & 4...... 226 6.2 The case study subreddits...... 229 6.2.1 The right...... 229 6.2.2 The neutral...... 233 6.2.3 The left...... 234 6.2.4 The anti echo chamber...... 237 6.3 Non-political echoing...... 239 6.3.1 The sports ...... 239

4 6.3.2 The porn community...... 241 6.3.3 vs echoing...... 245 6.4 Implications of the research...... 247 6.4.1 Wider theoretical implications...... 247 6.4.2 Strengths of the research...... 250 6.4.3 Limitations of the research...... 252

7 Conclusion 256 7.1 Summary of key findings...... 256 7.2 Key contributions of the research...... 258 7.3 Future research...... 259 7.4 Concluding remarks...... 261

References 263

A Appendix 282 A.1 Code repository...... 282 A.2 Subreddit topic distributions...... 283 A.3 Political subreddit results...... 283 A.4 Inter-community edge weights...... 283 A.5 Community subreddit topic counts...... 283 A.6 Closest neighbours of political subreddits...... 286

Word count: 61,102

5 List of Figures

2.1 Rates of Reddit use by US adults in 2019, data from (Pew Research Center 2019)...... 59

3.1 Screenshot of BigQuery interface querying January 2019 Red- dit comment dataset...... 81 3.2 Overall author and comment counts per subreddit...... 84 3.3 Cumulative frequency of author counts for top 1000 subreddits 85 3.4 Distribution of 21 most common subreddit topic labels... 101 3.5 Timeline of political subreddit creation...... 105 3.6 Author and comment counts for political subreddits..... 111

4.1 Frequency of median comment count per author for all sub- ...... 127 4.2 Frequency of median author insubreddit proportion for all subreddits...... 128 4.3 Scatterplot of median number of author out-comments vs in-comments for all subreddits...... 129 4.4 Heatmap of correlations between comment count, median comment count, and median insubreddit proportion for all subreddits...... 130 4.5 Author and comment counts for case study subreddits... 133

6 4.6 Results of within subreddit author participation measures for case study subreddits...... 135 4.7 Scatterplot of median number of author out-comments vs in-comments for case study subreddits...... 137 4.8 Heatmap of political subreddit percentiles for within subred- dit participation measures...... 138 4.9 Scatterplot of median comments per authors by median au- thor insubreddit proportion for case study subreddits.... 139 4.10 Frequency of median author subreddit count for all subreddits143 4.11 Frequency of median author comment count for all subreddits144 4.12 Frequency of median author average comment count for all subreddits...... 145 4.13 Frequency of median author Gini for all subreddits..... 146 4.14 Heatmap of between subreddit median author participation measure percentiles for case study subreddits...... 148 4.15 First pair of bar plots for between subreddit author partici- pation measures for case study subreddits...... 150 4.16 Second pair of bar plots for between subreddit author partic- ipation measures for case study subreddits...... 151

5.1 Scatterplot of logged values of co-authorship (y-axis) and text similarity (x-axis) per pair of subreddits...... 169 5.2 Distribution of regression residuals for all subreddit pairs.. 173 5.3 Frequency distributions of subreddit degree in the network. 177 5.4 Frequency of the 11 most popular topic labels for subreddits in the network...... 182 5.5 Heatmaps of relative subreddit topic and community fre- quencies...... 184

7 5.6 Community sizes...... 188 5.7 Network graph of communities...... 194 5.8 Barplot of degree percentile rank for political subreddits.. 202 5.9 Frequency distribution of residuals for all pairs of political subreddits...... 203 5.10 Number of edges shared by case study subreddits...... 204 5.11 Network of top edges between pairs of political subreddits. 208 5.12 Network of case study subreddits and their ten closest neigh- bours...... 209 5.13 Barplot of percentile rank of edge weights between change- myview and political subreddits...... 215

8 List of Tables

3.1 Polarity of political subreddits in order of date created... 104 3.2 Description of left-wing political subreddits...... 107 3.3 Description of neutral political subreddits...... 108 3.4 Description of right-wing subreddits...... 109

5.1 Most prevalent topics per community...... 185 5.2 Community Descriptive Statistics...... 187 5.3 Example contingency table of edges by association..... 189 5.4 Top 20 subreddits in the geo-political community by internal edge count...... 198 5.5 Top 20 subreddit pairs in the geo-political community by edge weight...... 200 5.6 Top pairs of political subreddits by residual value...... 207 5.7 Top 10 subreddits changemyview shares an edge...... 212

A.1 Frequencies of all subreddit topic labels...... 284 A.2 Chapter4 raw values for the case study subreddits..... 285 A.3 Community edge weights...... 287 A.4 Community topic counts...... 288 A.5 Closest neighbours of the case study subreddits...... 289

9 Abstract

This thesis examines whether echo chambers exist on Reddit, the self- titled ‘front page of the internet’. As a social media platform Reddit is widely popular in Western countries but both relatively unknown and understudied by academics. This research will show that the structural features of the platform; in particular its organisation into distinct topic- based communities, called subreddits; makes it an ideal space to empirically test the existence of echo chambers on social media. In this research we consider how the network of subreddits naturally self-organises into topical ‘public sphericules’ of like-minded communities. Whereas most previous research has treated echo chambers as binary, here we treat echo chamberness as a spectrum from greatest chamberness to least. To achieve this, a key methodological advantage of the research is the use of multiple simple measures of chamberness applied to the 1,000 most active subreddits during our period of study, January 2019. By creating a background distribution of ‘expected’ levels of chamberness, we can un- derstand the relative chamberness of any single subreddit. This enables us to consider at what point the natural process of interest homophily crosses from useful public sphericules into problematic ‘meta echo chambers’. We focus our research on the set of political subreddits, and in partic- ular two key case study subreddits: The Donald, a group of supporters of United States President , which self-characterises as an echo chamber by explicitly banning dissenting views; and changemyview a to go to have your questionable opinions changed, which seeks to be an ‘anti-echo chamber’. Through this research we find that, contrary to our expectations, The Donald is not actually the echo chamber it set itself up to be, and that the most active participants of changemyview show a notable bias towards alt right contrarian discourse, which we would not expect of a broad-minded anti-echo chamber. The research also more broadly considers whether social media echo chambers are exclusively the concern of political content, as the majority of existing literature suggests. Through the examples of the sports and porn communities, which are both internally densely connected and largely isolated from the wider subreddit network, we illustrate how content which is not socio-political can still fall victim to the concerning behaviours which motivate our fears of echo chambers. This thesis will show that the concerns which motivate the study of social media echo chambers – increasing polarisation, fragmentation, and loss of common ground between opposing groups – affect many of our views, not only the political.

10 Declaration

No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

Copyright Statement

1. The author of this thesis (including any appendices and/or sched- ules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes.

2. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copy- right, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made.

3. The ownership of certain Copyright, patents, designs, trademarks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions.

4. Further information on the conditions under which disclosure, pub- lication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy ( http://documents. manchester.ac.uk/DocuInfo.aspx?DocID=24420), in any relevant The- sis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.library.manchester. ac.uk/about/regulations/) and in The University’s policy on Presen- tation of Theses.

11 Acknowledgements

Thank you to my supervisors Prof Martin Everett, Dr Termeh Shafie, and Dr Johan Koskinen for your guidance and support throughout this work. Martin, you have been a consistent source of encouragement over the last four years. Johan, your interest in the research from the earliest days provided me with the initial confidence to commit to it. And Termeh, your constant enthusiasm and encouragement has bolstered me as I saw this work through to the end. The Mitchell Centre for Social Network Analysis and its members have offered me a welcoming academic home. Particular thanks to Prof Nick Crossley and Dr David Schoch for taking the time and effort as my annual reviewers to provide valuable feedback on the earlier stages of this research. To my PhD colleagues & friends - in particular Dorottya Hoor,´ Dominika Czerniawska, Pete Jones, Alejandro Espinosa-Rada, Francisca Ortiz-Ruiz, Anna Sanders, and Joanna Flavell. My PhD experience has been greatly improved by being able to celebrate the ups and commiserate the downs with you all, usually at Sandbar. As such as anyone, Andy Jones thank you for being a true partner and growing with me these last several years. You help me keep things in perspective, encourage me to do the hard things when I don’t want to, and inspire me to want to do the even harder things. Lastly to my family - the Alburys, Careys, Peterkins, and Roberts’ - thank you for your laughter and warmth. It’s one of the things I miss most about the Bahamas. And to my parents, Ron & Suzie Guest, I offer endless thanks for your limitless love and belief in me. There aren’t enough words to express how much I appreciate everything you have done and continue to do for me. I would not be where I am without you.

12 Chapter 1

Introduction

Do echo chambers exist on ‘the front page of the internet’? This thesis seeks to understand the extent to which echo chambers exist on Reddit, the self-styled ‘front page of the internet’ and one of the most popular social media in the world. It will show how the study of online echo chambers over the last two decades has been inhibited by the absence of an early formalised theory and, as such, lacked a clear working definition. As a result much of the evidence base is divided on what methods can adequately determine the existence, or lack thereof, of echo chambers on social media. Additionally, there is also a lack of consensus of whether echo chambers are uniquely a concern for political content, or whether they extend to any opinion-based issue on which we can become polarised over time. The following research will make three key contributions to this research gap. First, we will capitalise on recent developments in the literature to operationalise the theory of echo chambers not as a binary phenomenon but a spectrum of ‘chamberness’ (Bruns 2019a). We will define multiple simple measures of chamberness informed by the literature on echo chamber

13 theory and test their real world utility. Second, this research will apply these measures across the complete network of active users on a single platform, Reddit, which has received little attention in academic research on echo chambers. The research will show that, apart from being popular but understudied, Reddit offers the opportunity, and access, to compare the chamberness of a wide range of topics. This leads to the third key contribution of this research — by applying chamberness measures across a range of topics, we can contextualise the findings for political subreddits. This will allow us to interpret the relative differences in chamberness between political subreddits — i.e. do left wing or right wing subreddits echo more). It will also enable us to consider whether non-political topics can also be considered echo chambers. This introductory chapter is organised into three sections. Section 1.1 summarises the background and motivation for this research. It gives an overview of the concept of online echo chambers and their importance in understanding the relationship between social media and the current political landscape. We’ll then move on to an introduction to Reddit, its sig- nificance as a popular but non-mainstream platform, and the key features which make it an ideal case study for echo chamber research. Section 1.2 will then present the aims of the research. It will present the four leading research questions and the computational social science approach which has guided the research methodology. Finally Section 1.3 will outline the contents of each of the following chapters in the thesis.

14 1.1 Background and motivation

1.1.1 Social media echo chambers: fact of fiction?

It is often suggested, both in academic research and the wider media, that social media is leading to greater political fragmentation and polarisa- tion (Sunstein 2018). A key contributor to this fragmentation is believed to be the new modes of consuming media which allow us to enter into ‘infor- mation cocoons’ (Bruns 2019b; Borgesius et al. 2016; Eady et al. 2019; Nyhan 2014). There are two related but distinct kinds of information cocoons; echo chambers, which are the focus of this research, and filter bubbles, which are not. Filter bubbles are the result of recommendation engines suggesting personalised content based on our previous actions, which can lead to a narrowing of the range of content we consume overtime (Pariser 2011). They occur, for example, when or ’s algorithms prioritise content in our newsfeed, and recommend new people and pages for use to friend or follow. Echo chambers, by comparison, are more directly social. They result from us engaging with groups of individuals with whose views we share and avoiding those we disagree with. This research will focus on the information cocoons of echo chambers. While filter bubbles depend on the recent technological advances which enable platforms to serve users with individual recommendations, fears of echo chambers predate social media. The underlying cause of echo chambers is selective exposure, the process by which we choose to expose ourselves to media which aligns with our existing interests, opinions, or beliefs (Stroud 2008; Guess et al. 2018; Guess et al. 2018). Selective exposure

15 is related to the concept of homophily, that birds of a feather flock together, but emphasises the relationship by which we select not to expose ourselves to opposing viewpoints. The term echo chamber was first used in this capacity in the early to describe the disintermediating effect of new media, in particular , would have on how citizens consumed media (Sunstein 2001). However fears of the online echo chamber effect have grown in recent years, and become a point of discussion not just in academic, but also in the wider media (Moore 2019; Stelter 2020; El-Bermawy 2016). This is largely tied to a rise in populism, and numerous surprising and contentious election results in Western countries, most notably the 2016 United States Presidential Election and the United Kingdom’s referendum on European Union membership. Both of these votes remain highly divisive and were shocking loses to many left-leaning voters (Chater 2016; Groshek and Koc- Michalska 2017). In the wake of those results there as been a deluge of research into the existence of echo chambers on social media. Much of that research has found evidence of echo chambers within specific boundaries (Vicario et al. 2016). Some has shown that online echo chambers reflect existing ones offline (Bastos et al. 2018). While others find that online users are more likely to encounter a diverse range of views online than in their day to day, face to face interactions (Dubois and Blank 2018a; Guess et al. 2018). Overall, the extent to which online echo chambers are real and a distinct problem from offline echo chambers is still being debated. This research will provide evidence of the extent to which echo chambers exist on a different platform, which has a significant but under recognised contribution to wider online media — Reddit.

16 1.1.2 Reddit: self-proclaimed ‘front page of the internet’

As is the case for most reach into social media, the majority of empirical research into echo chambers has focused on Facebook and Twitter, two of the most popular online platforms in the world in terms of user base and brand recognition. While Facebook is by far the most popular social network in the world with almost 2.5 billion monthly active monthly users, Reddit has had more global monthly active users than Twitter since 2018 (430 vs 386 million as of April 2020) (Statista 2020; Perez 2019). However Reddit is still relatively less known by the general public. This has lead to it being referred to as one of the ‘biggest [s] you’ve never heard of’ (Marantz 2018). For this reason alone Reddit provides a rich opportunity for new research, but there are also many unique attributes to the platform which make it an ideal site to study online echo chambers. Reddit is a . It was originally intended to be a place where users posted links to content from other parts of the internet which they found interesting. Since its inception Reddit has branded itself as the ‘front page of the internet’, providing a bridge between its users and the rest of the web. By having users surface content rather than using professional editors to curate the platform, Reddit relies on social collaboration to collate a constant stream of fresh content which is organised into specific topics. In the style of similar message board based platforms, user contributions are organised into distinct topic-based communities, called subreddits. Any Reddit user, called redditors, can create a subreddit devoted to any niche topic. Any redditor can then post links to relevant content across the web, or make a text post to discuss an element of the topic. Each post then has a comment section in which users can discuss the original post.

17 For example, if I create a subreddit called reddit echo chambers, other users might contribute links to news articles referencing echo chambers on Reddit, or make a text post discussing whether or not there is evidence of echo chambers on Reddit. We could then discuss this content in the comment sections of the posts. The topic-oriented subreddit structure is the first attribute which clearly distinguishes Reddit from most other popular social media, and makes it a highly relevant place to study echo chambers, as we will see. First we must understand the slight but nuanced distinction between social media and social networking sites. Social media can refer to any website which depends on user-generated content. Social networking sites (SNSs) are a subset of social media which are centred on user profiles and the networks formed by friending (e.g. on Facebook) or following (e.g. on Twitter and ) other users (Boyd and Ellison 2007). Social media can use some features of SNSs without being structurally designed as a social network. For example, in 2018 Reddit introduced the ability for redditors to ‘friend’ each other but the process of social networking is not a core function of the platform. Therefore Reddit, like other social news sites, is not based on the direct relationships between users, as is fundamental to social networking sites, but interactions between users are mediated by the subreddits in which they co-participate. Instead of being oriented around user profiles, Reddit is oriented around subreddits. Subreddits themselves are user-defined, distinct, and clearly bounded topic-based communities. This provides a clear unit of analysis to study, and a unique opportunity to study a complete network of user interactions between clearly defined and bounded topical communities. A core methodological challenge in previous echo chamber research has been

18 in locating and determining the boundaries of suspected echo chambers. These studies often start with an initial subject and then sample outwards. However by being able to map out a complete range of communities we can take a more top down approach. This will allow us to understand the context of suspected echo chambers within the wider platform, and also allow for the discovery of previously unsuspected echo chambers. As a platform Reddit is a strong supporter of free speech, almost always allowing subreddits which cover very troubling and contentious content to remain on the site (Marantz 2018; Staff 2019). The majority of moderation on the site is performed by ordinary redditors who volunteer to moderate subreddits of which they are members. Therefore each subreddit is allowed to develop as a unique, self-run community with minimal external over- sight. Most subreddits also feature written descriptions and rules which are developed by their moderators to briefly introduce the subreddit to newcomers (Reddit FAQ 2020). These written summaries allow anyone to get a quick, high level understanding of how a subreddit defines itself.

Echo chambers on Reddit

Reddit is generally considered to be politically left-leaning and there is ethnographic evidence to support this (Massanari 2015b). It is, therefore, surprising to many users that it is also home to one of the largest online fan communities for the Republican US President Donald Trump, the subreddit The Donald. As we will see in this research, The Donald is in fact one of the largest political subreddits by number of active authors. As a result of its status as an anomalous but large right-wing hub on an otherwise left-leaning platform The Donald has attracted accusations of being an echo chamber separate from and opposed to the rest of Reddit.

19 However, to date there has been no empirical research to determine whether that is the case. As we will see, the subreddit’s own rules show that it is fundamentally designed to become one. Within its sidebar it is clearly stated that moderators will ban dissenting opinions. But it remains to be seen whether, or how, that exclusion principle has actually lead to the subreddit becoming an echo chamber. Reddit also offers us a unique opportunity to study a self-defined ‘anti- echo chamber’. The subreddit changemyview is devoted to people an opinion which they hold, which is often controversial, and on which they are willing to have their mind changed. These contributors must be willing to engage in critical discussions of the issue and are encouraged to reward commenters who successfully change their original point of view. The creator and lead moderator has stated that he created the subreddit specifically to act as an ‘anti-echo chamber’, a place where people came to have rational disagreements of opinion and to be open to having their views changed. While academic research of changemyview has studied the dynamics of interactions within the subreddit (Tan et al. 2016) none has examined, or questioned, its position as an ‘anti-echo chamber’. This research was initially inspired by these two fascinating and oddly complementary subreddits: The Donald as a self-defined echo chamber and changemyview as a self-proposed anti echo chamber. I sought to question whether these communities actually empirically fit the roles they described for themselves. I also saw that this presented an opportunity to draw on existing theories and definitions of echo chambers to test different measures of the level of chamberness of any given subreddit.

20 1.2 Research aims

Broadly this research set outs to systematise the study of echo chambers to determine to what extent, if any, they exist on one of the largest social network platforms in the world. More specifically, the aim of this research is to address gaps in the literature on social media echo chambers in three key ways. Firstly, it will take a computational social science approach to develop multiple simple empirical measures by operationalising the theory of cham- berness as a spectrum rather than a binary. Secondly, it will provide cross- platform analysis of a very popular social media site which has yet to attract much attention in echo chamber research. The advantages of using Reddit, in particular the subreddit structure and open data access, will allow for the generation of background distributions of a large number of communities, covering a range of topics. This will allow us to contextualise chamberness within the wider network and interpret the different ways in which any one subreddit is chamber-like. Thirdly, this will allow us to consider whether echo chambers are uniquely a problem of political content and communities. By measur- ing chamberness on a scale and analysing subreddits across a range of topics, we can see how chamberness manifests for non-political subred- dits. Referring back to existing echo chamber theory we can then consider whether this chamberness is simply a natural, innocuous extension of the interest-based homophily we expect to find in topic-based communities, or whether increasing polarisation of non-political topics should also be of greater societal concern.

21 1.2.1 Research questions

The core research of this thesis is organised into two sections, first looking at intra-subreddit behaviours, then looking at inter-subreddit behaviours. In total there are four specific research questions, two at either level. The first set of subreddit level measures consider subreddits as the group of authors that contribute to them and explores the way these authors behave within their home subreddits and across the wider platform. They are generally guided by the question of whether any single subreddit is an echo chamber. The two subreddit-level research questions integrate the general trends of the proposed measures of chamberness and each have an echo chamber and an anti-echo chamber hypothesis. The first research question is based on the principle of reinforcement bias, a factor in selective exposure and a leading cause in the literature on echo chambers. It asks how does author within subreddit participation vary between subreddits? Reinforcement bias suggests that members of an echo chamber will be attracted to participate more in the subreddits which reinforce their views and do not raise uncomfortable cognitive dis- sonance. The first echo chamber hypothesis is thus that echo chamber subreddits will have higher author within subreddit participation than other subreddits. Conversely, we expect that the anti-echo chamber will have lower author within subreddit participation than other subreddits as contributing to the anti-echo chamber takes more effort in terms of the time and consideration required to make the high level of arguments expected in the subreddit. The second research question addresses the second component of selec- tive exposure, the avoidance of dissenting views. The questions asks how

22 actively do authors of a given subreddit participate between all subred- dits? Taking all other factors to be equal, if members of the echo chamber practice avoidance of views and content with which they do not agree, we would hypothesis that authors in echo chambers will have less varied participation between subreddits. In other words, they would partici- pate in fewer subreddits overall, and spend relatively less of their time in those other, non-reinforcing subreddits. By contrast, we hypothesis that anti-echo chamber authors will have more varied participation between subreddits because they are not restricted by the avoidance principle and are more likely to actively choose to engage with a greater variety of con- tent. In answering the first two research questions we can create a distribution of behaviours across subreddits. For each of the measures used we will be able to determine a ‘normal’ range of behaviours. By examining where any given subreddit is on the spectrum we can then determine it’s relative level of chamberness for each measure. The anti-echo chamber hypothesis will be applied to changemyview to test whether it actually differs quantitatively from other subreddits, also acting as a form of control against the political subreddits. The echo chamber hypothesis will then be tested not just for The Donald but for all of the political subreddits identified during the data collection process. This will allow for an unbiased comparison of chamberness across the political spectrum on Reddit from The Donald on the right, to anti-Trump and other assorted subreddits on the left. Once we have determined to what extent any of the political subreddits may exist as echo chambers in themselves, we then move on to the second part of the research and consider the interrelations between subreddits. Network analysis has been the most popular approach to identifying echo

23 chambers on social media (Adamic and Glance 2005; Zollo et al. 2017; Bastos et al. 2018). Social network analysis allows us to understand social systems by the relationships between actors (Borgatti, Everett, et al. 2018). It is, therefore, the method best suited to examine the key mechanism of echo chambers (Borgatti, Mehra, et al. 2009). Namely, the presence of close connections within the chamber and, most importantly, the absence of connections to alternative spaces. Building on the many previous informative applications of network analysis to the problem of echo chambers, the second part of the research considers the complete network of subreddits and examines the structures between them, specifically considering their topical relatedness. Again this section has two research questions. Here our first research question, and third overall, considers whether public sphericule theory holds for Reddit. Specifically, can we identify topical public sphericules among the top subreddits? To do this we create a new measure of chamberness similarity which signals how many more co-authors any pair of subreddits share than we would expect given the similarity of the content they share. Even after attempting to control for the relatedness of textual content between sub- reddits we find that most subreddits still share the largest author overlaps with subreddits of similar topics. This enables us to address the fourth and final research question: can we identify ‘meta-echo chambers’ among those public sphericules? By meta-echo chambers we mean groups (or pairs) of highly topically related subreddits that are highly interconnected (i.e. are potentially reinforcing) and also have a notable absence of ties with subreddits with dissenting views (i.e. avoiding). The complete network approach will allow us two key advantages

24 in identifying meta echo chambers. First, we will be able to compare subreddits across the political spectrum to determine whether echo-ing is a concern on either, or both, of the political poles on Reddit. The research will show that, contrary to expectations, The Donald is not part of a right wing meta echo chamber, though this is most likely due to the absence of any similarly chamber-like right-wing subreddits of a size large enough to be included in the analysis. However, we do find that The Donald may in fact be the opposition of a group of anti-Trump and anti-alt right subreddits which themselves form a left wing meta echo chamber. The second key advantage to the network approach of identifying meta echo chambers is that it allows us to consider whether non-political top- ics can form echo chambers. In our discussion of the research findings we’ll explore how non-political public sphericules, particularly the sports and porn communities, could result in problematic polarisation if their fragmentation from the larger network grows over time.

1.2.2 Computational social science approach

This research uses a computational social science approach. The research questions are motivated by social theory from social science and communi- cation and media studies, the literature of which will be explored in the next chapter. The analysis then takes a computational approach to answering the research questions at scale. It is the goal of computational social science in general, and this research in particular, to bridge these two domains. Research in the computer sciences into social phenomena, especially that which examines user behaviour on social media, can often treat this information source simply as a new form of data. In technical domains the application of computational methods to rich social datasets such as the one

25 we will be examining often lacks grounding in the vast literature of existing theoretical frameworks available to interpret the societal implications of the data. For example, as we will show in the next chapter, previous research into the topical networks of Reddit which has informed the current research did not offer substantive interpretations of their analyses, or reflect on the user behaviours these uncovered. Previous computational research which has looked at the networks of similarity between subreddits has often used that insight to build techni- cal systems, such as recommendation systems in which users are recom- mended new subreddits similar to the ones they have previously visited (Ja- monnak et al. 2015; Olson and Neal 2015). In these cases the researchers suggested how their findings could be leveraged to create technological tools, such as recommendation systems, which would likely encourage the kinds of troubling fragmentation which we study here. However, such research rarely reflects on whether those technical in- sights may not provide a social benefit, or indeed encourage asocial be- haviours such as introducing filter bubble effects to subreddits and users which already have high chamberness. It also does not take maximal ad- vantage of the wealth of insight that these computational methods could provide about the way we behave online. Though it may be interesting to simply confirm that subreddits do group into ‘interest meta-communities’, as we would expect due to interest-based homophily, such a shallow consid- eration of the phenomena misses out on the more detailed, contextual, and nuanced insights that the social scientific perspective offers. That balance of breadth plus depth is the goal of this research. Possibly the greatest advantage of the computational aspect of the methods used in this research is that it has enabled the research process

26 to be fully reproducible, a key factor in the growing move towards open science (Munafo` and Jacobs 2019; Open Science Framework 2020). As we will see in the later chapter on data and methods, the data used in the this analysis is freely publicly available online. All work from data collection, through analysis, to the presentation of results has been conducting using the Python programming language. I have compiled the code developed during the research process into a public code repository available on GitHub. Details on the repository are available in Appendix A.1. The intention of making this repository publicly available is to ensure that the research is fully reproducible, and therefore open to due scrutiny, and that the methods developed are available for use in future research.

1.3 Thesis outline

Chapter2 provides a more in-depth background to the research. Section 2.1 details the research context of echo chambers including the origin of the theory, the rise in the application of the term, the state of existing evidence, and general criticisms of the theory. It details the suggested causes of echo chambers; homophily, reinforcement bias, and selective exposure; and the key concerns; increasing polarisation and fragmentation of citizens in democratic societies. It draws out the connection between echo chambers and the Habermasian public sphere with particular focus on more recent work on the fragmentation of many public ‘sphericules’ on social media. Section 2.2 provides deeper background on Reddit as a platform. It ex- plains the use of pseudonymity and draws on literature of context collapse on social media. We also consider the platform’s free speech ethos and how

27 this has allowed communities to develop organically, setting their own rules and limitations. Section 2.3 then considers the breakdown of topic interests on Reddit. It shows the relationship between existing work on interest ‘meta-communities’ on Reddit and public sphericules on Twitter to provide a framework for the original research presented in later chapters. Here we will also more thoroughly consider the state of political discourse of Reddit, particularly in light of its free speech ethos. We will provide background on The Donald, a self-characterised echo chamber with a con- tentious history with Reddit administrators, as well as changemyview, our potential anti-echo chamber. Chapter3 describes the data collection and processing used to pre- pare for the analysis presented in later chapters. Section 3.1 outlines Red- dit’s very open public data sharing policy and the availability of complete monthly datasets of all comments made on Reddit, available via the BigQuery platform. This research uses the January 2019 dataset, the most recent available at the time final analysis began. Section 3.2 details how a subset of the data was defined to provide a reasonable level of granularity and to allow the researcher to develop some familiarity with all subreddits under analysis. The top 1000 subreddits by monthly author count were selected and for each of these we focus on the top 25% of most active authors. Section 3.3 describes how the data was prepared for analysis. After subsetting, the raw data was transformed into matrices. First, a bipartite matrix of subreddits and authors was weighted by comment count. This was used to create a unipartite weighted co-authorship matrix between subreddits. Then, focusing on the text of comments, we constructed a second bipartite matrix of subreddits by terms in the text, weighted by the

28 term frequency-inverse document frequency measure. Pairwise cosine sim- ilarity was then used to create a second unipartite matrix of text similarity between subreddits. Then Section 3.4 describes the process of manually labelling each of the 1000 subreddits by high level topics. It also provides extra descriptive information for the subreddits labelled as political. The first analysis chapter, Chapter4, focuses on the subreddit-level measures of chamberness by addressing the first two research questions and sets of hypotheses. Section 4.1 outlines the research questions and hypotheses in greater detail with reference to the existing literature and introduces the measures used to answer the research questions. Section 4.2 takes us through the results for the first set of chamberness measures, those for within subreddit author participation. First we view the general trends – i.e. the spectrum of chamberness – then we focus on the results for the political subreddits and the anti echo chamber. Section 4.3 then follows the same procedure for the second research question, thus focusing on the between subreddit chamberness measures. Finally, Section 4.4 draws the subreddit-level findings together and explains why a networked approach is then needed to understand whether subreddits are part of ‘meta echo- chambers’. The second analysis chapter, Chapter5, then builds on the findings of the previous chapter to understand the networked relationships between subreddits and to address the third and fourth research questions. Sec- tion 5.1 re-introduces the discussion of public sphericules and meta interest communities from Chapter2, providing context for network relationships as indicative of chamberness. We then also re-introduce the co-authorship and text similarity matrices from Section 3.3 which are used to create a com- bined measure of chamberness similarity to highlight pairs of subreddits

29 with much higher than expected author overlap given their text similarity. Section 5.2 then takes us through the process of mapping the complete network, using our measure of chamber similarity as the edge weight between subreddits and only focusing on those with exceptionally high levels of chamberness. Here we also briefly review the key social network theory and analysis techniques used in the research. Specifically, we use community detection to identify potential public sphericules in the network. By applying the manual topic labels described in Section 3.4 to validate the topical relatedness of the communities detected we are able to confirm the existence of topic-based public sphericules on Reddit. To return to the question of meta-echo chambers, Section 5.3 then fo- cuses in on the community which holds all political subreddits. Here we examine the ties amongst political subreddits, as well as how they connect to other subreddit topics, to determine whether political meta-echo cham- bers exist on Reddit. We ultimately find that while The Donald is not part of a meta-echo chamber, the anti-Trump and anti-alt right subreddits may form a left wing meta-echo chamber. Section 5.4 will then summarise the results of the chapter and begin to tie them back into those of the previous chapter. Chapter6 will bring together the findings of the two analysis questions, interpreting them in light of the research background provided earlier in the thesis. Section 6.1 gives a detailed discussion of each of the research ques- tions and findings in turn. These are then brought together in Section 6.2 to examine the results of the political subreddits across the spectrum, from right to left, determining the extent to which they are, or exist within, echo chambers. Section 6.3 then returns to the question of whether echo chambers are inherently political. Here we discuss the sports and porn

30 communities identified in the network and how these insular, disconnected public sphericules could raise echo chamber concerns if their members become polarised over time. Section 6.4 then discusses the overall implications of the research pre- sented in this thesis. It outlines the strengths of the research and how these enabled it to meet the gap shown in the wider literature. We then reflect on the limitations of the research and discuss how these could be improved upon in future research. Finally, Chapter7 will conclude the thesis by summarising the outcomes of the research, showing how the goals of the thesis have been meet, and how the work contributes to our understanding of social media echo chambers and new ways to identify them. Having provided a general introduction to the original research that will be presented in this thesis, let us now gain a better understanding of the context of this research. Specifically, in the next chapter will review the relevant theoretical and empirical research into echo chambers, as well the key features which make Reddit a valuable site of study.

31 Chapter 2

Research Background

This chapter provides context to the research of this thesis. It gives an overview of key motivating theories from social science and communication studies, summarises the existing evidence (or lack thereof) of online echo chambers, and briefly outlines relevant existing network analysis in the area. In this chapter we will see that there has been a chasm between compelling social theory with limited empirical evidence on the one hand and large scale network analysis with little theoretical motivation on the other. The research outlined in the later chapters will aim to navigate the middle ground that is a computational social science approach. The first section of this chapter gives an overview of echo chamber the- ory, the current state of evidence, and in particular the key limitations of the existing evidence base. The following chapters will then operationalise the most developed definition of the term ‘echo chamber’ to provide empirical evidence of the extent to which (if at all) echo chambers exist on one of the largest social media platforms in the world. The second section then introduces Reddit, the social media platform which is the focus on this work. It gives a brief introduction of the history

32 of the platform and the key differences between Reddit and more widely known platforms such as Facebook and Twitter. A key difference discussed will be the organisation of the website into topic-based message boards called ‘subreddits’. We will also examine the practice of pseudonymity on Reddit to understand how identity is performed as this has a substantial influence on the interpretation of trace data used in computational methods such as ours. The third and final section of this chapter will give an overview of exist- ing research on Reddit culture, which particular focus on the main topic interests on the platform. This will include previous research which used community detection to identify ‘interest meta-communities’ on Reddit. Finally, I will provide context on two subreddits which are central to our study of chamberness on Reddit; r/The Donald and r/changemyview.

2.1 The emergence of social media ‘echo cham- bers’

In 2001 the American legal scholar Cass Sunstein introduced the term ‘echo chamber’ as a troubling trend he suspected would develop as citizens began to participate in political discourse online (Sunstein 2001). Sunstein did not provide a definition for the term, but uses it as an offhand metaphor — more a choice of creative language than the introduction of a new theoretical term. Since then, however, the term echo chamber has been picked up and widely used in academia, particularly within communications and media studies and the political and social sciences, as well as the wider media. This is despite the lack of any clear — or commonly agreed on — theoretical grounding, as we will soon see.

33 The numerous surprising and still highly contentious political events of 2016, including the United States’ Presidential election and the United Kingdom’s Brexit referendum, led to increasing discussions amongst aca- demics and society at large as to why the outcomes of these votes were so surprising for the losing sides. A commonly suggested cause has been the rise of online echo chambers in the social media era (Hooton 2016; Chater 2016). In this section I will give background to the origins of the term echo chamber as applied to online, particularly political, discourse and its foun- dations in the Habermasian concept of the public sphere. We will then briefly look at how the application of the term have morphed and var- ied over the last two decades with the rise of social media to become a ubiquitous, albeit loosely defined, concept. Finally we will look at the most promising evidence for and against the existence of social media echo chambers. This will highlight the challenges of previous research to clearly operationalise the term. We will then move on in the following section to introduce the social media platform Reddit to understand why it provides a unique, and promising, opportunity to test the echo chamber theory.

2.1.1 The public sphere(s)

Before addressing the specifics of what an echo chamber is, and how it might come into being, it is important to understand why they are a con- cern, if they do exist at all. Why should we care about echo chambers? Understanding the origins of the term will allow us to arrive at a much needed clear definition. Sunstein’s original work, Republic.com, is centred on what he refers to as ‘the public forum doctrine’ and the ways in which new media was expected

34 to change the way citizens consume news (Sunstein 2001). The public forum doctrine as described by Sunstein is a term from legal theory which states that in countries that support free speech, governments must ensure that unrestricted speech can occur freely in public spaces. While traditional examples of public spaces would be streets and parks, Sunstein explores how these principles should be applied within the newfound domain of the internet. Sunstein emphasises that occupants of a public space who would oth- erwise prefer not to hear a certain form of speech do not have a right to be protected from hearing that speech, except for limited cases depending on local laws such as those against inciting violence. This builds to his central premise that it is essential for citizens of a deliberative democracy to be informed on a wide manner of topics and viewpoints particularly those that they did not plan on - or were specifically opposed to - seeing. This he refers to as the ‘affirmative’ side of free speech (Sunstein 2018). In summary, according to Sunstein it is the obligation of citizens in democ- racies to listen to a range of opinions, on a range of topics, even some which they have no interest, or even an aversion, to hearing. This is, of course, a demo- cratic ideal that is unlikely to ever be fully reflected in society. However, in Sunstein’s view, media general interest intermediaries (in particular large broadcasters such as the BBC, CNN, or Fox News) were for a time ‘the unacknowledged public forums [of the world]’ (Sunstein 2001, p. 34). That was, however, until the rise of new media which decentralised and fragmented our news consumption. Written in 2001, Sunstein’s introduction of the term echo chamber pre- dates social media and instead focuses on how the rise of blogging would challenge the traditional role of mainstream media as gatekeepers of news.

35 Although a companion book focusing on social media, # Republic: : Divided democracy in the age of social media (Sunstein 2018), was released in 2018 the early Republic.com is still the seminal text out of which most research of social media echo chambers developed. The public forum doctrine has parallels with the concept of the ‘public sphere’, articulated by German sociologist Jurgen¨ Habermas in 1962 (Haber- mas 1991), which remains a core tenet in communication and media studies. The public sphere is the more familiar theory in the social sciences and communications studies. However, the two theories are opposed in how they perceive the merits of news gatekeepers. In Sunstein’s framing general interest intermediaries fill the beneficial role of providing a common public forum. Habermas’ public sphere, in com- parison, asserts that news intermediaries provide too much gate-keeping, that “mediated political communication is carried on by an elite” (Haber- mas 2006, p. 416) which overly restricts the public to the role of passive audience. Habermas states that the true public sphere is limited in modern society by the influence of money and corporate interests within - the media which Sunstein applauds as providers of a public forum. Sunstein and Habermas would both agree that mass media allows select actors to carry out political deliberation on behalf of the public, and, of key import, set the agenda of public discourse. Whilst Sunstein focuses on the loss of shared discourse that results from relying less on mass media, Habermas emphasises that social media should offer the opportunity to rectify the limitations of this gate-keeping model by giving citizens a forum to voice their opinions, and reach their own audiences (Habermas 2006). This highlights a central tension in modern

36 communication theory - the trade-offs between democratisation, infor- mation overload, and fragmentation. As the internet allows everyone a platform to voice their views exponentially more content is created and as consumers we are overwhelmed by possible choices. When we therefore have to decide which information to consume, and from which sources, we are likely to choose the content, topics, and sources which are already appealing to us leading to fragmentation in consumption patterns. In the succinctly named 2015 paper ‘Is Habermas on Twitter?’ media scholars Tim Highfield and Axel Bruns use the term ‘public sphericules’ to account for this fragmentation of public discourse into increasingly niche interest groups on social media. The authors suggest that by “unpacking the traditional public sphere into a series of public sphericules and micro- publics” (Highfield and Bruns 2015, p. 78) we can move beyond the limited theoretical construct of ‘the’ public sphere to account for the breakdown of broad domain publics such as the political, cultural, business, sporting, etc. Later in this chapter will we see how this framing of broad, topic based publics, clearly maps onto the interest-based structure of Reddit. By disassembling the public sphere into many public sphericules Highfield and Bruns expose the inherently networked structure of on- line discourse. Public sphericules offer a “multifaceted model that al- lows for connections and overlap between a multitude of coexisting public spheres”(Highfield and Bruns 2015, p. 3, emphasis added). These con- nections suggest a collection of interconnected communities, whilst the overlaps between spheres highlight that inter-topic fragmentation is of limited concern as “few participants are likely to be interested only in politics but not in culture” (Highfield and Bruns 2015, p. 6), for example. Later in this chapter we will examine two key pieces of prior research

37 which use network analysis and community detection to expose these connected and overlapping public sphericules, one on Twitter and one on Reddit. But first we must consider the relationship between public sphericules and echo chambers, specifically how the former can develop into the latter.

Fragmentation fears

Sunstein’s original passing use of the echo chamber metaphor was a tool to illustrate worries of increasing group polarisation online. “This phe- nomenon raises serious questions about any system in which individuals and groups make diverse choices, and many people end up in echo chambers of their own design” (Sunstein 2001, p. 96, emphasis added). From this single statement a body of echo chamber research has followed, some of which we will soon review. However much of this work has not offered a more formal definition of what exactly such echo chambers look like and, as such, what clear methods we can use to identify them. This has given rise to inconsistencies in how the term is applied and how evidence has been interpreted. In his critical review of the term, Bruns has provided the clearest definition of an echo chamber, which will be the basis of our research:

An echo chamber comes into being when a group of partici- pants choose to preferentially connect with each other, to the exclusion of outsiders. The more fully formed this network is (that is, the more connections are created within the group, and the more connections with outsiders are severed), the more iso- lated from the introduction of outside views is the group, while the views of its members are able to circulate widely within

38 it (Bruns 2019b, p. 29)

This formal definition allows a way to more systematically study po- tential echo chambers. We can measure the degree of ‘chamberness’, or communicative enclosure, of a group. This in turn raises follow up ques- tions overlooked by previous research which treated echo chamberness as a binary - a community or group either was an echo chamber or it was not. By conceiving of chamberness as a spectrum we can then interrogate the processes that create and reinforce echo chambers over time. For example, when does mild preferential attachment to like-minded others become dysfunctional avoidance of alternative perspectives? As such, we can define one of the central research focuses of the current work. To what extent do political subreddits show evidence of greater ‘chamberness’ than non-political subreddits? And how does this vary across the political spectrum from far left to far right?. Do communities of political contributors show different levels of preferential attachment, and is this effected by political leaning or polarity? As we will see, the question of political vs non-political content is an important but open-ended question in echo chamber research. Bruns’ work also addresses the related phenomena of filter bubbles. Filter bubbles are the result of algorithmic curation that allows each of us to be served by a custom selection of content (Pariser 2011; Borgesius et al. 2016). To some extent as users we opt-in to this curation, such as by selecting who to follow on Twitter or which friends to unfollow (or mute) of Facebook. But beyond these choices what content we are shown, and always in what order things are shown, are the result of filtering and sorting algorithms designed to serve us what they think we will most respond to. A deeper exploration of filter bubbles is beyond the focus of this re-

39 search. However it is important to note that together the two distinct but complementary terms - echo chambers and filter bubbles - have been referred to as ‘information cocoons’. They are both the result of recent technological advancements which have significantly affected the way we access and consume information online. Some of these changes have been negative, and these have been the focus of most of the literature which we will focus on. Other changes have been much more positive. In line with the original ethos of the Habermasian public sphere, echo chambers can provide safe spaces for disadvantaged groups to meet and community build. They can be supportive communities where people, often marginalised groups, can feel free to express their opinions and experiences without fear of retaliation. This is a view supported by O’Hara and Stevens who show that echo chambers can be beneficial for users if they provide increased autonomy. When users knowingly enter political echo chambers “it strengthens in-group identification by providing an outside ‘other’ that serves as an embodiment of the political enemy” (O’Hara and Stevens 2015, p. 418). However, as we will see in later chapters many political communities on Reddit are more ‘anti’ an opposing group or view point than they are focused on positive support or fostering shared values. The subreddit r/fuckthealtright will be one such key example. Rather than offering a safe space for members of the far left to meet and community build without fears of external pressures, such groups can give in to the negative forces which make echo chambers a worrying force for increasing political polarisation and antagonism.

40 2.1.2 Reinforcing and avoiding

There are two key interrelated theoretical drivers behind the forma- tion and continuation of echo chambers, homophily and selective ex- posure. Homophily is the principle that similar people are likely to be connected, commonly characterised by the saying ‘birds of a feather flock together’ (McPherson et al. 2001). In keeping with the political example, homophily broadly suggests that someone with left wing views is more likely to be friends with other people with left wing friends, and less likely to have right wing friends (Colleoni et al. 2014). Selective exposure refers more specifically to how we decide to choose which content to consume. It has two complementary mechanisms, rein- forcement seeking and challenge avoidance. Reinforcement seeking drives us to consume content which aligns with and supports our existing views. At the same time we may seek to avoid content which will contradict with and call in to question our existing beliefs (Stroud 2008). It is this second mechanism, avoidance of diverse viewpoints, which was the cause of Sun- stein’s initial fears of echo chambers and which has been the focus of most research on echo chambers and political polarisation since. A nationally representative telephone survey of 1,510 households found that Americans were very likely to practice reinforcement seeking when selecting online information sources but did not actively avoid challenging views (Garrett 2009). This research, published in 2009, has been regularly cited as evidence against selective exposure. However the research relied on data collected in 2004, years before the rise of social media and the current online information environment. The use of self-reporting also raises questions about the reliability of the findings as confirmation bias

41 has been shown to encourage us to avoid conflict without consciously realising we are doing so (Scharkow 2016). More recent, larger scale research has shown that challenge avoidance is very common among some communities and that it can be robust to interventions. Zollo et al researched the polarisation of pro-science and pro-conspiracy Facebook users over a number of years. They analysed the interactions of 54 million Facebook users over a five year period (Jan- uary 2010 to December 2014) and found two distinct groups, or public sphericules, which they refer to as echo chambers, those who were po- larised towards scientific sources and those who were polarised towards conspiracy theories (Zollo et al. 2017). The researchers then examined the responses of members of the con- spiracy echo chamber to the 50,220 debunking posts which they had been exposed to during this period. They found that very few members ever interacted with the debunking posts, and those that did actually had in- creased rates of conspiracy posts after being faced with the debunking material. These findings suggested that not only did the Facebook users fall into echo chambers of increasing polarisation over time through a process of challenge avoidance, but members of the conspiracy echo chamber were actually driven towards greater polarisation when faced with conflicting evidence. These findings are in line with those of Bail et al who found that ex- posure to opposing view can actually increase the political polarisation among American Twitter users (Bail et al. 2018). Using an experimental research design they found that Republicans became significantly more conservative after being asked to spend one month following a Twitter bot which exposed them to liberal messages. Democrats showed a small but

42 not statistically significant increase in liberality after being exposed to a conservative bot over the same period. This provided more evidence that not only did some groups selectively avoid opposing content but, when made to face it, they were actually polarised more in their original direction. Now that we have a good un- derstanding of the theorised mechanisms behind the development and increased polarisation of echo chambers we can turn to an examination of the bodies of evidence for and against their existence.

2.1.3 Echoing in networks

Adamic and Glance’s 2005 paper ‘The Political Blogosphere and the 2004 U.S. Election: Divided They ’ (Adamic and Glance 2005), examined the network structure of communities of liberal and conservative American blogs in the lead up to the 2004 U.S. Presidential election. This research shows that, from it’s earliest days, the echo chamber problem has been a network homophily problem. The authors found that conservative blogs were linked to each other more frequently, and in a denser pattern, that liberal blogs. Comparing the pairwise cosine similarly of URLs that blogs cited, they also found that the denser network of conservative blogs was not associated with greater uniformity of content discussed - i.e. it did not suggest a uniform voice. The researchers found that blogs tended to talk more about opposing political figures than their own as 59% of mentions of the Democratic candidate John Kerry were by right bloggers and 53% of mentions of George Bush were by liberal blogs. “These statistics indicate that our A-list political bloggers, like mainstream journalists (and like most of us) support their positions by criticising those of the political figures they dislike” (Adamic

43 and Glance 2005, p. 12). This is in keeping with the later work of O’Hara and Steven who highlighted the utility of relationship between in-group identification and echo chambers. Adamic and Glance again used the term ‘echo chamber’ loosely and did not provide a definition. They did use the related term ‘cyberbalkanisation’ which was popular at the time which they define as a process of segre- gation of the internet “where the proliferation of specialised online news sources allows people with different political leanings to be exposed only to information in agreement with their previously held views” (Adamic and Glance 2005, p. 3). More recently, in their work on public sphericules Bruns and Highfield used a similar methodology to map the entire Australian ‘Twittersphere’ - the follower network of all Australian Twitter users. They found that “users cluster together in highly connected (but loosely thematic) groups around shared interests” providing evidence of topic-based public sphericules on Australian twitter. They note, however that “the map also demonstrates the fact that few clusters would fit the description of filter bubbles which are far removed and difficult to reach from the remainder of the network” (Highfield and Bruns 2015). The process of topical fragmentation of Twitter had not devel- oped so far as to show densely connected echo chambers becoming farther removed from the overall network. Bruns and Highfield did find that the cluster of accounts relating to teen culture was the most distant to the rest of the network. However, teen culture was still connected by the ‘gateway’ topic of popular and celebrity culture. The more general topic of popular culture appears to have provided a bridge between the more niche teen culture and the wider

44 network. In more detailed follow up research, Bruns showed that the Australian Twittersphere could be partitioned into clear topic-based subnetworks which represented public sphericules of users and accounts with similar high-level interests (Bruns et al. 2017). However, no matter how closely internally connected these “communities of interest that accumulate around shared topics and identities” became they “do not detach altogether from the world beyond” (Bruns 2019b, p. 93, emphasis added). By emphasising the principle of disconnection Bruns concluded, thus, that there were no echo chambers among the Australian public sphericules. Later in this chapter, we will see research of a similar pattern of separate but not divided interest groups on Reddit, the platform focus of this research. In their research of discourse on social media Williams et al found three distinct groups (Williams et al. 2015). The first two repre- sented polarised echo chambers on either side of the issue, the “sceptics” and the “activists”. Each of these communities was dominated by the single key viewpoint, members expressed largely positive sentiments among each other, and there was little interaction between the two groups. However the research also identified a third group of “open forum” mixed attitude communities in which both sceptics and activists partici- pated. Unsurprisingly the content of these communities tended to carry much more negative sentiments as the two sides argued with each other. But most interestingly the participants of the open forums tended to hold less strongly polarised views than those of the echo chambers. This evi- dence lends support to the understanding that members of echo chambers tend to be more polarised and less open to exposure to opposing views. Unfortunately the research did not provide longitudinal analysis there-

45 fore could not offer evidence of the causal mechanism of polarisation and openness to discourse. With the exception of the Australian Twittersphere all of the evidence cited so far focuses exclusively on the American socio-political context. Vaccari et al have sought to address the bias of this US dominance in the evidence base by studying the context of German and Italian Twitter users (Vaccari et al. 2016). Both countries have multi-party parliamentary political systems in contrast to the American bi-partisan system. They found that for both countries, 1 in 5 respondents were part of what they referred to as “contrarian clubs”, local networks where political disagree- ment was common, similar to the open forums found by Williams et al. Though most respondents on average encountered more political content with which they agreed than disagreed, a large minority of them appeared to be using Twitter to engage in political debate. Elsewhere, research has focused on Brexit, the withdrawal of the United Kingdom from the European Union. Whilst still clearly different from the context of the majority of US-focused researched, the issue of Brexit shares a bipartisan split with American politics, that between pro-departure Leavers and anti-departure Remainers. Analysis of the consumption of Brexit related news on Facebook in the United Kingdom showed the emergence of two distinct communities based on their main news providers (Del Vicario et al. 2017). The research used automatic topic extraction and to compare how news outlets present the same topic and how readers respond to articles. It focused on mainstream news sources included in the European Media Monitor (Europe Media Monitor 2013; Steinberger et al. 2009). Echoing the previous work of Adamic and Glance on the American political blogosphere, del Vicario et al found

46 that the consumption of Brexit related news fell into two clear, separate camps. Using only the topics of the news articles they were able to clearly distinguish between Leave and Remain readers. Other research of Brexit related echo chambers, this time on Twitter, found that the chambers largely mapped onto offline, geographic social networks (Bastos et al. 2018). The echo chambers associated with the Leave campaign were associated with geographic proximity, however the op- posite was true for Remainers. This provides important evidence that online echo chambers are not a departure from more diverse face to face interactions, but are often a continuation of the patterns of ideologically homophily already experienced offline. Recent survey research in the US, UK, and France has also shown that offline interactions are more associated with moving into right wing populism than online media behaviours pro- viding counter evidence to suggestions that online echo chambers are the dominant cause of increasing political polarisation (Boulianne et al. 2020). The research presented so far largely focuses on identifying, and quan- tifying, the existence of social media echo chambers. However in doing so they have all been restricted to studying a single platform and, for the most part, a single topic or issue such as climate change or Brexit. These general limitations to the growing body of evidence intended to exhibit the existence of echo chambers has given rise to criticisms, both of the theory itself and the methodologies used to study it , as well as to important counter-research.

Criticism

We saw that Bruns et al did not find evidence of echo chambers within the Australian Twittersphere (Bruns et al. 2017; Highfield and Bruns 2015). In

47 addition to providing the most formalised working definition of the term currently available, Bruns also offers well considered criticism of the very term and what he believes to be a misguided overuse of the metaphor in recent years (Bruns 2019b; Bruns 2019c).

We are hampered, misled, and distracted from more important questions by metaphors of echo chambers and filter bubbles that are no longer fit for purpose, and probably never were” (Bruns 2019b, p. vi).

We have seen that this current time of political upheaval has coincided with a seismic shift in the way news is both generated and consumed. From the gate-keeping of print and broadcast generations to citizen journalism circulated over social media. According to Bruns, it is not surprising that this technological shift causes some level of moral panic, particularly when it concerns something as important as access to quality news. However, Bruns’ central thesis is that the popular narrative of online information cocoons relies on a technological determinist fallacy which asserts that the platforms we use, and how they are designed, inevitably lead us into information cocoons. According to Bruns, by using the technology as a scapegoat, this theory avoids addressing the tougher questions about why users may be drawn to more polarising opinions and less willing to seek out conflicting views. In addition, Bruns points out that both theories of information cocoons ‘assign far too much importance to the role of politics in ordinary people’s lives’ (Bruns 2019b, p. 10) and ‘in discussing ideological cocoons, we must not forget that the vast majority of users are not on social media primarily to discuss politics’ (Bruns 2019b, p. 11). Even the most politically-engaged

48 citizens will likely only spend a portion of their time online engaging with political content. Fundamentally large social networking platforms such as Facebook and Twitter are built on our interpersonal relationships. As the research in this thesis will show, even for Reddit, a platform that is centred on ties of shared topical interest, the political sphere only comprises a fraction of users interactions, and that fraction is enmeshed in a slew of other broad topic interests. However, even social media users who do not actively seek out political discourse have been shown to experience incidental exposure to views they disagree with (Fletcher and Nielsen 2018). This gives rise to a question we will repeatedly return to throughout this research - are echo chambers only a problem for political content? In contract to the information cocoon perspective, a survey of Ameri- can adults found that social media users perceive more political disagree- ment online that in face-to-face interaction, assumedly because they have greater access to a variety of opinions online than in their offline commu- nity (Barnidge 2017). Consuming news via social media was also positively associated with perceived disagreement on social media, in other words the more respondents read online the more they disagreed with (Barnidge 2017). Engagement with political discussion online has been shown to be associated with participating in “politically heterogeneous individual dis- cussion networks (Brundidge and Rice 2009, p. 144). This suggests that for a majority of people who do choose to participate in political discourse online, they are choosing to not engage in echo chambers. Dubois and Blank provide strong evidence of the limitations of echo

49 chamber research which is conducted on a single platform, as most has been (Dubois and Blank 2018a; Dubois and Blank 2018b). Specifically they suggest that any research which only examines a person’s behaviour on one site cannot possibly observe whether the person themself is in an echo chamber. To suggest so neglects the obvious fact that people may choose to consume different types of content on different platforms. A very natural form of segmentation of consumption that reflects the concept of context collapse which we will explore later in this chapter. Dubois and Blank instead suggest that to know to what extent a person may live within an echo chamber you must observe their complete media environment. Using representative survey data of British internet users they found that both media diversity, and measure by the number of unique media sources consumed, and level of political interest are both highly negatively associated with likeliness to be in an echo chamber. These held true for users who were politically left wing, right wing, or neutral, showing that neither political poles were likely to be in echo chambers. As Blank and Dubois show, research conducted using only online data, especially that from a single platform, will never provide evidence that people are living in completely isolated echo chambers in their daily lives. However there are still questions about how societal discourse is portrayed in the mass media, in particular the extent to which political polarisation is developing and self-perpetuating online. While this research agrees that theories of online echo chambers and their dangers to society have been largely overstated, they speak to larger societal fears that discourse, in particular political, is becoming increasingly divided and harder to bridge. Having acknowledged the inherent limitations of single platforms stud- ies, this research does just that. However it seeks to build on the example

50 provided by the Australian Twittersphere research to map a complete net- work of one of the most popular social media platforms in the world, which has so far not been the object of echo chamber research. By studying Reddit, we can test the limits of the echo chamber theory, and develop a better un- derstanding of how ideological homophily operates as a subset of general interest homophily. Reddit serves as a natural experiment of how online political communities interact with non-political interest communities and how users develop the connections between these communities. The structure of Reddit also allows us to avoid the wicked bounding problem of Facebook and Twitter studies which have shown that defining which communities to study is one of the most difficult steps in conducting analysis of echo chambers. The community based structure of Reddit, which I will outline in the next section, precludes this issue. As communities are self-defined it is possible to set clear boundaries on areas of interest. Studying Reddit also allows up to examine another fuzzy question that has received little examination to date. Are echo chambers only about political content? Or are they about any view, however apolitical or seemingly innocuous, being amplified and increasingly polarised?

2.2 ‘The front page of the internet’

Reddit could be considered one of the most popular social media sites that many people have never heard of, or at least never visited. As on January 2020 reddit.com was the 18th most popular website in the world. At that time 44% of visitors to the website were based in the United States, where it was the the 5th most popular website after Google, YouTube, , and Facebook (Top Sites in United States 2020). Around 8% of visits to Reddit

51 came from the United Kingdom where it was remarkably the 3rd most visited website at the beginning of 2020, after Google and YouTube, but before the BBC homepage or Facebook (Top Sites in United Kingdom 2020). But what is Reddit and how is it different from more mainstream social media such as Facebook and Twitter? Reddit refers to itself as ‘the front page of the internet’ (Reddit 2019). It was originally created as a place of users to share links to content from across the internet that they found interesting. The intention was to surface content that people might not have otherwise discovered, similar to other social news aggregators of the time such as and . Reddit’s name was a reference to the role the site wanted to have in the process of users discovering content they enjoyed. The founders wanted users to tell their friends ‘I read it on Reddit’ (Reddit FAQ 2020). Unlike most other major social media platforms it is not a traditional social networking site. Users do not use the platform to connect directly with other users, to network, but to share and consume content. As such the social aspect of the media is mediated by the content. This is a nuanced but important distinction, which has been described by danah boyd,

SNSs [social networking sites] are primarily organised around people, not interests. Early public online communities such as and public discussion forums were structured by topics or according to topical hierarchies, but social network sites are structured as person (or ‘egocentric’) networks, with the individual at the center of their own community (Boyd and Ellison 2007, p. 219, emphasis added).

Reddit is not a social networking site because it is not based on direct ties between users, it does not represent a online personal network. It is,

52 instead, a ‘website dedicated to communities of interest’ (Boyd and Ellison 2007, p. 219, emphasis added) which ‘help[s] strangers connect based on shared interests’ (Boyd and Ellison 2007, p. 210, emphasis added). Reddit users (called redditors) are not directly connected, but share mutual ties based on the interests that they share. This duality of persons and topics creates a natural membership or affiliation network within the framing of social network analysis (Breiger 1974). In the following chapters we will see that this bipartite structure of people and topics greatly informs the framing of this research. But first we need to understand the second actor in these bipartite networks, subreddits.

2.2.1 Subreddits: topic-based communities

Whether you’re into breaking news, sports, TV fan theories, or a never-ending stream of the internet’s cutest animals, there’s a community on Reddit for you (Reddit Inc 2020).

To enable users to find content that may be of interest to them, the platform is organised into topic-based message boards, called subreddits. Each subreddit has it’s own homepage, for example, the subreddit devoted to the University of Manchester can be found at https://www.reddit.com/ r/manchester uni/. Here any user can make a post related to the topic, in this case The University of Manchester. Each post has a text title which briefly describes what it is about. My body of the posts can be links to external web pages, such as news articles or images, or they can be chunks of text. For example, a post on r/manchester uni could contain a link to a recent news article about the University, or it could contain a question from a new student asking current students for course recommendations.

53 Any user can create a subreddit, on any subject. Subreddits are overseen by moderators, users who volunteer to look after the running of a subreddit. They write the community descriptions that introduce new users to the community and codify community norms. The role of moderators has been described as ‘civic labor’ (Matias 2016). In comparison to the employees of Reddit, who are called admins, moderators are ordinary users who freely volunteer their unpaid labour to manage particular subreddits. Similar to other message board style websites, each post can then be commented on by users and these comments can then be commented on themselves creating for a nested, tree-like discussion thread. Comments allow the members of subreddits to engage with others on the content of posts. Users can also give feedback on any post or comment by up- voting or down-voting it. Each post and comment then has an overall score, depending on how many upvotes and downvotes it has received. Scores are not simply calculated as the difference between the total number of upvotes and downvotes a piece of content receives. Because Reddit it oriented towards what is current and newsworthy, older votes receive less weight in the algorithm that determines an item’s score. The limited use of moderation and the emphasis on community-led content generation and curation are founded on Reddit’s belief that with enough views from users the best content will naturally find it’s way to the top. In the absence of traditional gatekeeping this process depends of the ‘wisdom of the crowd’. However depending on the crowd this poses the risk of group think, as previous research has shown. This process then has the potential to allow subreddits to descend into echo chambers.

54 Wisdom of the crowd

As a social news website, Reddit is centred on this wisdom of the crowd approach to surfacing the ‘best’ content, without the gatekeeping of tra- ditional news intermediaries, exactly the kind of content generation and consumption format which Sunstein originally feared and Habermas rel- ished. Any user can contribute content and customise their own ‘front page’ to filter for only the topics they find the most interesting, provided for them by other people with those shared interests. However, research by Gilbert has provided evidence that the wisdom of the crowd theory of social voting does not work as well as intended. His research found that 52% of the most popular links on Reddit were overlooked the first time they were submitted within the 17 day period studied (Gilbert 2013). One link was submitted 54 times before it was upvoted to most popular status. There appears to be a fundamental flaw in this system allowing the same content to receive such different responses at different points in time. Gilbert suggests that this flaw is the overdependence on a small minority of redditors voting on content very soon after it is published to the site. By default the Reddit news feed is sorted by ‘popularity’, but for content to become popular enough users must have already identified the content as worthy before it reaches this feed. This initial rise to popularity is the result of a small subset of users actively choosing to change their news feed setting to sort for ‘new’ content, giving them a live stream of the most recent content to be posted across Reddit or within a given subreddit. According to Gilbert, users who only ever browse by popularity - the vast majority of users - ‘free load’ off of the extra work of those who sift through new content.

55 Gilbert uses the term ‘under-provision’ to describe the process by which only a small group of users provide the initial sorting of the vast quantity of new content, a process which is essential to the social navigation central to the structure of Reddit. Gilbert emphasises that underprovision threatens to become a ‘tragedy on the commons’, in opposition to the original ethos of Reddit. However, if only about 5% of content ever reaches the most popular page, as Gilbert found, we must ask why the average Redditor would want to wade through the many not very good submissions of new content? It seems inevitable that the majority of users will want to exploit the filtering work of others to consume only the ‘best’ content which Reddit has offered them. The greater threat, I would suggest, is the amount of power given to those early voters to determine which content does, or does not, have the opportunity to be seen by other viewers. This is a another form of gatekeeping, albeit on a much smaller scale. In this way, not every member of the crowd is afforded equal wisdom. Other research has shown that Reddit has largely moved away from it’s original purpose of linking to external newsworthy content. Singer et al have suggested that Reddit may no longer operate predominately as a gateway to other parts of the internet, but has become increasingly self- referential over time (Singer et al. 2016). For the period of 2008 to 2012 25% of posts were self posts, meaning they did not linki to any externals sites. As of 2014 26% of posts linked to .com an image hosting website that was originally created for Reddit. Over one quarter of all posts to subreddits were images (often memes) that had been uploaded online specifically to be shared on Reddit. This shows that Reddit is no longer simply the social news website it was originally created to be. It is not merely a customisable filter to other

56 parts of the internet. It is a site for the generation of new content itself. However, despite this evolution in how content is sourced, the importance of subreddits remains central. All content, whether linked or original, is in some way categorised by what subreddit it belongs to. Each of these subreddits can be considered a distinct community, composed as much of the redditors who contribute to and consume from it as by the content it hosts. For each redditor, then, their experience of the platform in defined by what subreddits they participate in and how they choose to participate.

2.2.2 Who is on Reddit?

Unlike most modern content ranking social media algorithms today, Reddit does not take into consideration the subject matter of content, or who it was contributed or supported by, into account when determining what content a user sees. In fact there is minimal personalisation to a user’s own news feed - only that which they implement themselves by subscribing to, or unsubscribing from, specific subreddits. This lack of personalised algorithms is beneficial in the sense that it does not risk users entering filter bubbles, the other form of information cocoon touched on earlier when referring to Bruns’ conceptualising on echo chambers. Each user has a unique, customisable experience - they only need con- sume the topics of interest to them. By building their custom ‘front page of the internet’, each user can forgo the shared public sphere of new sites, or even the homepages of more niche interest sites. They instead jump directly to the specific content they want, which has been curated by people with the same interests as them. They are actively opting in to their chosen public sphericules. In this sense Reddit is the antithesis of the general interest intermediaries heralded by Sunstein. Despite the many advantages

57 of this fully customisable experience, as we will see this can lead to the de- velopment of echo chambers for some users if interest homophily descends into selective exposure. This is especially the case for Reddit, where the use of pseudonymity may encourage users to behave in ways they would not on other, more public platforms.

Pseudonymity

To create a Reddit account all you need is a username and a password. Unlike most online registrations, you do not need to use an email to reg- ister, and you will never be asked to provide your real name. Redditors rarely use identifying characteristics in their usernames. This practice of pseudonymity, using an alias rather than your real name, is actively en- couraged by Reddit. As of January 2020, when you create a new account you are even provided with a list of available randomly generated user- names such as Lucky-Wishbone or TemporaryLobster5. This precedent was set when the founders, and , created their own usernames, kn0thing and spez. Given the lack of connection between users Reddit profiles and their real world selves that results from the use of pseudonyms, there can be no official statistics or demographic breakdowns of who redditors are. Instead our understanding of the makeup of Reddit’s userbase depends on a combination of general public surveys, surveys administered by redditors themselves, and ethnographic research. In early 2019 the Pew Research Center conducted a nationally representative survey of American adults to assess the rates at which they used Reddit and other platforms, and reported the breakdown by key demographic features including gender, age, race, and educational attainment. The results of this survey a presented

58 in Figure 2.1. The research found that, of all US adults, 11% used Reddit. 15% of men and 8% of women used Reddit, providing evidence that the user ase skews largely male. It also skews younger, being used by 22% of 18 to 29 year olds, 15% of 30 to 49 year olds, and only 6% of 50 to 64 year olds. There is also a tendency for American redditors to tend to be more highly educated, 6% of US adults with a high school degree or less reported using Reddit, versus 15% of college graduates (Pew Research Center 2019).

Figure 2.1: Rates of Reddit use by US adults in 2019, data from (Pew Research Center 2019)

These results give us an idea of the overall makeup of Reddit users and some of the demographic biases are also found in existing qualita- tive research we will explore later in this chapter. However, the use of pseudonymous usernames on Reddit makes it impossible to get a more accurate account of who is on Reddit. And, perhaps more importantly, it ob-

59 fuscates our ability to make offline assumptions about specific users. While from a research perspective this may be limiting, from a social perspective it is largely positive. In his 2012 paper Pseudonyms and the Rise of the Real-Name Web Hogan provides one of the most detailed examinations of the practice of pseudonymity online which is currently available to us (Hogan 2013). He outlines the trade-offs that online users must make between context-specific impression management and the persistence of online content. A common example would be a potential employer being able to see comments an applicant made on Facebook many years ago as a teenager and the damage this way cause to the applicant’s chance of being hired. “Using a pseudonym for some topics and one’s real name for others is a straightforward way to avoid such serendipitous linking.” (Hogan 2013, p. 296). One could choose to use my real name for LinkedIn and a professional Twitter account, but use a pseudonym on my personal Facebook or Instagram to be prevent employers or others from seeing one’s personal content, for example. Related work by Hogan is centred on the theory of the presentation of the self originally proposed by the sociologist Erving Goffman (Hogan 2010; Goffman 1978). He uses the metaphor of walking into different rooms and presenting different selves. This ability to act as a social chameleon is more available to use in the offline world than online. The persistence of content online, as in the example of the potential employer on Facebook, and the use of search engines allows the causal observer to much greater and almost instantaneous access to existing information about us, or by us, than to a stranger we meet in person. This leads to the phenomena of ‘context collapse’ in which many different versions of ourselves which

60 we present in different contexts collapse into a single unified self in a way which is impossible in face to face interactions (Marwick and boyd 2010). Other popular social media platforms, most notably Facebook, have his- torically discouraged, or even banned, the use of pseudonyms in the belief that they encourage and allow people to get away with bad behaviours. This is based in a belief in the online disinhibition effect which suggests that without the threat on being held personally accountable for negative actions, pseudonymous or users are more likely to behave badly (Suler 2004). While it is surely true that pseudonyms allow some users to act more negatively than they otherwise would, such as by trolling other users or engaging in hate speech, for Reddit the pros have been decided to greatly outweigh the cons. In Hogan’s words, “if one is connecting to others based on an interest, a name that signal one’s interest in that space makes more sense that a real name” (Hogan 2013, p. 298). One of the key advantages of the practice of pseudonymity is that is allows users to limit context collapse, as we will see. Not only does this have benefits for redditors but it also allows us as researchers to view connections between topics which users may otherwise would not exhibit if they were made to use their real names. This has meaningful implications for how public sphericules and echo chambers can be analysed on Reddit, as I will address in the coming chapters.

Context collapse

Leavitt et al explore the use of ‘throwaway’ accounts on Reddit and in- troduce the term ‘multiple profile maintenance’ (Leavitt et al. 2016). A redditor may create a throwaway account when they want to disclose sen-

61 sitive content, or partake in a community they do not want associated with their main account. By maintaining multiple profiles users clearly separate their interactions on Reddit in an extreme act of limiting context collapse. Throwaways allow users to set boundaries between which audiences see what content. Work by van der Nagel explores the use of ‘alts’, or alternative, accounts on social media (van der Nagel 2018). Alts act as secondary profiles to a redditor’s main account. “They are a kind of automediation, a way of representing the self, that deliberately displays a different identity facet, and addresses a different audience, to what someone considers to be their main account. (van der Nagel 2018)”. She offers the examples of ‘finstas’ on Instagram, a portmanteau of fake Instagram, which teenagers use to portray less polished versions of themselves or hide illicit behaviour from parents. However even on Reddit where all interactions are pseudonymous, the use of multiple accounts allows users to avoid context collapse. In later chapters in this work, we will explore the example of pornographic subreddits and suggest that redditors may use alt or throwaway accounts when engaging with porno- graphic content to keep those behaviours separate from their other Reddit experiences. Recent research has focused on the tension of context collapse, and benefits on , on Reddit, specifically among queer users (Triggs et al. 2019). The authors highlight that Reddit is possibly the most influential, certainly the largest, anonymity-based social media platform. They show that LGBTQ people use differentiation strategies to avoid context collapse. This supports the development of a ‘safe space’, outside of the pressures of heteronormative society. The use of multiple accounts is a practice of

62 ‘intra-platform differentiation’. This is also supported by the ability to ‘lurk’, browsing without having an account or being logged in. By contrast, Facebook, Twitter, and Instagram all require users to be logged in to see content. Two interviewees in that study maintained multiple profiles to differen- tiate between their participation in LGBT and Christian subreddits. This is an important finding for our later construction of ties between subreddits. Communities which traditionally have little overlap, or may even be op- posed to each other in some way, may have hidden connections, facilitated by this technical affordance, which the use of trace data will not capture. van der Nagel and Firth provide a case study of the use of pseudonyms on the subreddit r/gonewild where women, and sometimes men, post nude photos of themselves (van der Nagel and Frith 2015). Pseudonymity allows these users to engage in a greater range of identity practices than they would be able to in spaces with a ‘real name’ policy. Facebook is the most notable platform to enforce a ‘real name’ policy, by which users should only have one account, and use their legal names. In the case of gonewild women use alt accounts when sharing explicit pictures of themselves to create a layer of privacy and ensure those photos cannot be found via their main account, which friends or family members may be aware of. According to van Nagel and Firth,

”The option of not using real names online allows people to con- trol what they reveal about themselves and who they reveal it to, opening up possibilities for identity exploration, exhibition- ism, and connections with people who share different interests without being limited by the social factors that routinely shape everyday life” (van der Nagel and Frith 2015)

63 Based on in-depth ethnographic research on Reddit, conducted over a period of years, Massinari agrees that pseudonymity is essential to the culture of Reddit and encourages “reddit’s tendency towards both altruism and cynicism” (Massanari 2015a, p. 50). She gives the example of users who occasionally post in NSFW () subreddits, such as r/gonewild, who want to maintain ‘contextual privacy’ with other parts of their Reddit life and, thus, fight context collapse. We will find preliminary quantitative evidence of similar behaviours among other pornographic subreddits in the later chapters of the current work. These examples of the use of multiple profiles to prevent context col- lapse on Reddit pose a methodological limitation to the current research as we will see in later chapters. However, the general principle of a single pseudonymous profile on Reddit, separate to and safe from the judgement of others presents a likely strength of studying the platform. The level of privacy offered by pseudonymity means that many redditors may behave in ways on the platform that they would be less likely to on others where their contributions are associated with their offline selves. For our research, this has the clearest advantage of creating connections between topics that might not be visible in other spaces.

2.3 Topic interests on Reddit

In addition to the core tenant of user’s right to pseudonymity, Reddit is also strongly committed to their right to free speech. Since it’s origin Reddit has placed great emphasis on this right, but Reddit’s framing of free speech can be somewhat confusing to new observers. The platform’s content policy sets minimal limits to what content is allowed to be posted (Reddit Content

64 Policy 2020). These are broadly reduced to content that is illegal, incites violence, or is harassing. However Massinari questions the extent to which Reddit actually rep- resents a beacon of free speech. Her long term ethnographic research highlighted the strong cultural norms that much of Reddit can exhibit (Mas- sanari 2015a). She found that there is a strong emphasis on technology and gaming, and that user politics tended towards the left wing but with a techno-libertarian streak. We will see some of these qualitative insights reinforced in Chapter3 when we look at the distribution of topics among the most popular sub- reddits, in particular a very large skew towards gaming subreddits. With regards to politics, we will also see a slight left wing skew, with almost twice as many left subreddits than right making the top 1000. However there are important exceptions to this. As we will soon see the largest polit- ical subreddit, and one of the largest subreddits overall, is a fan community for US President Donald Trump. Massinari’s ethnography of Reddit briefly mentions the concern of echo chambers. Many of her interviewees expressed awareness that in creating completely unique Reddit experiences for themselves, based on their specific niche interests and views, they may be actively creating their own echo chambers, exactly as Sunstein foretold. However, existing network-based research suggests that this may not be the case at scale.

2.3.1 ‘Interest meta-communities’ as public sphericules

Published in 2015, before the widespread fear of social media ‘echo cham- bers’ arose in 2016, research conducted by Olson and Neal was the first to map the complete topical network of Reddit (Olson and Neal 2015). In

65 doing so, it confirmed a few key network qualities about the platform, which are generally consistent with findings from other social media. most notably they found that Reddit had a modular community structure which aligned with high level subreddit topics. This mirrored the findings on Bruns et al on the organisation of topical communities on the Australian Twittersphere. The researchers collected data in mid-2013 for nearly 90,000 users, one third of all of Reddit’s active users at that time, across 15,000 distinct sub- reddits. They considered a user to be ‘interested’ in a subreddit if they had made at least 10 submissions or comments to it. They constructed an un- weighted bipartite network between users and subreddits, then projected a weighted unipartite network of the number of users shared between subreddits. To reduce the number of edges they used a backbone extraction algorithm and retained only those edges which were statistically signif- icantly larger than expected in a null model. The authors state that this provides a network in which edges between subreddits suggest their is a high probability users might move from one subreddit to the other if they navigate Reddit by following the posts of other users. They then applied a community detection algorithm to the network. Focusing on the largest connected component in the graph they identified 59 distinct clusters which they called interest meta-communities 1. These inter- est meta-communities are analogous to the topic-based public sphericules identified in the Australian Twittersphere work. Of the Reddit interest

1A note of the use of the term communities: On Reddit a subreddit and it’s active users are often referred to as a community. However, when conducting community detection analysis, as Olson and Neal did in their paper, and as I will in the research to be presented, we use the term community to refer to groups of subreddits in the network that are internally densely connected. To avoid ambiguity going forward I will not refer to subreddits as communities and will reserve the term for the sets of subreddits grouped by community detection, as we will see in Chapter5

66 communities, 30 are ‘core’, densely connected communities, while 29 are more weakly internally connected. At the centre of their network is a community of the most popular general interest subreddits (such as r/pictures and r/videos) which overlap with other, more topically specific communities. The associated interactive map of Reddit is still available online (Olson 2020). The authors suggest that the scale-free nature of the network, meaning the degree distribution of subreddits follows a power law, is a product of the high level of preferential attachment on Reddit. In other words, the largest subreddits continue to get larger. They note that this is partially influenced by the default subreddit system that Reddit implemented until 2017 which meant all new users were automatically subscribed to the same set of 20 subreddits selected by Reddit for being of ‘general interest’. Olson & Neal suggest that their map would be useful for users to find niche subreddits and avoid overcrowding in the largest ones. They offer their map as a utility to the platform “to assist users in organizing themselves into more specific interest groups” (Olson and Neal 2015, p. 1), rather than as a source of social scientific insight to reflect upon. In the light of the concerns in recent years of possible information cocoons on social media, this goal of using the existing structure of the subreddit network to suggest user behaviours without more in depth consideration of potential issues of polarisation or fragmentation appears short-sighted. The research presented in this thesis seeks to build on the insights provided by Olson & Neal’s large scale analysis of the subreddit structure of network to consider the dynamics of interest meta-communities as public sphericules, and beyond that, echo chambers. To allow for more nuanced insights we’ll examine a selection of case study subreddits most of which

67 will be introduced in the next chapter. Here, we will begin to close out this chapter with more detailed background of the two key subreddits which inspired this research; changemyview, a self-purported ‘anti-echo chamber’, and The Donald, a self-characterised potential echo chamber.

2.3.2 changemyview: the anti-echo chamber?

The subreddit changemyview was created in 2013 by a Scottish teenager to act as a place where he and others could try to escape the echo chambers they may find themselves in their other online or offline interactions (Wendling 2019; Malone 2017). The description provided by moderators of the subred- dits to introduce new visitors defines the subreddits as

“A place to post an opinion you accept may be flawed, in an effort to understand other perspectives on the issue. Enter with a mindset for conversation, not debate (R/Changemyview 2020).”

On changemyview post authors make a statement about an opinion they have which they are open to debating and possibly having changed. In the post they clearly outline their knowledge of the subject and the reasoning behind their views. They are encouraged to provide links to supporting evidence. Commenters then respond with counter points, either to the original poster’s (OP) central thesis or to specific points. A discussion then takes place in the comment threads between the OP and commenters. If the OP decides that a comment has changed their mind, or given them a new perspective on the issue, they respond to the post with a delta icon (∆). A bot account called DeltaBot keeps count of how many deltas each user has received and posts a live ‘leaderboard’ to the subreddit’s front page,

68 thus celebrating and rewarding users who are most successful in changing others’ views. Many posts deal with large socio-political issues, and often express views that would be considered controversial by the mainstream. At the time of writing the second most popular post was titled “Gender Dysphoria is a curable mental illness, we’ve stopped looking for the cure because society is now forced into accepting .” (u/Phill Hermouth 2020). The poster then goes on to recognise the highly controversial nature of the statement and asks for help in changing their opinion, “I know this is a big yikes to post in 2020, but I am posting this because I truly want my view to be changed. I know it is offensive to a lot of people.”. By contrast other submissions address rather more innocuous subjects. At the time of writing the most popular post, by a large number of upvotes, was titled “if you name your child something like “Abcde” (pronounced ‘Absidy’) and get upset at the mispronunciation or negative attention it brings, you knew what you were doing and you wanted the attention for yourself (u/Pirateer 2020)”. This seemingly random subject matter was in response to a news story at the time about an airline staff member mocking a five year old called Abcde (BBC News 2018) and the post sparked a flurry of responses as evidenced by it’s level of popularity. Moderators of changemyview, often shortened to CMV, use strict policing to ensure posters submit in good faith. One of the subreddit’s rules states that the OP must hold the view themselves, and cannot be playing devils advocate. As such the subreddit is not just a place to debate controver- sial ideas, but specifically to challenge your own opinions. This could be considered the opposite process to selective exposure, wherein users seek out views which challenge the ones they hold in an effort to un-enforce

69 their existing opinions. Here participants are encouraged to use the public sphere not to find like-minded users but to explore the range of views on an issue and debate there relative merits. The subreddit, therefore, is designed to be an anti-echo chamber. Previous research has examined the interaction dynamics between posters and commenters on CMV to understand what made for the most persuasive arguments, i.e. what factors were most associated with posters changing their (Tan et al. 2016). The authors showed that the way the OP expresses their opinion can predict how likely they are to have their view changed. They also found that comments which include links as evidence were more likely to change the OP’s view. This is in contrast to ev- idence presented early in the chapter which showed that members of echo chambers became more polarising when exposed to counter-evidence (Bail et al. 2018; Zollo et al. 2017). For most posts examined in the research it appeared the OP did not ulti- mately change their mind. Early comments were more likely to persuade the OP. Exchanges (back and forth replies between the OP and commenters) corresponded with higher delta rates, but only up to 4 total replies. Ex- changes with 5 or more replies almost never received deltas, suggesting that while some deliberation with the OP had a positive effect, too much deliberation with the OP was not more persuasive. This may suggest that OP’s are more likely to have their mind changed when shown novel in- formation or takes on an issue, as opposed to having their mind changed through argumentation. This shows that the majority of attempts to change a view are not suc- cessful and that there are specific interaction dynamics at play. However the willingness, indeed the active decision, of contributors to changemyview

70 to engage in these debates offers a unique opportunity to research a space which exists to be the antithesis of an echo chamber. Similar to the mixed- attitudinal open forums and ‘contrarian clubs’ identified in previous re- search into echo chambers on Twitter, changemyview represents a melting pot of perspectives (Williams et al. 2015; Vaccari et al. 2016). For the purposes of our research, the subreddit structure of Reddit gives us a methodological advantage which those previous studies lacked. Subreddits are bounded, unique units of analysis, and often provide clearly defined descriptions presented to initiate new members which offers us a methodological advantage to understanding subreddits as scale, as we will see more in the next chapter. In the case of changemyview, we have seen from this description that the subreddit intends to act as an anti-echo chamber. Other subreddits, by contrast, appear to define themselves implicitly as echo chambers.

2.3.3 The Donald: a self-defined echo chamber

The subreddit The Donald was created in June 2015 after the announcement that Donald Trump would be running in the 2016 US Presidential elec- tion (R/The Donald 2020). As a fan site for the Republican candidate the subreddit was somewhat of an outlier on the generally left-leaning plat- form (Massanari 2015b). The subreddit became increasingly controversial both on and off Reddit as it’s userbase quickly grew and it became the largest community of Trump supporters online and well as one of the most popular subreddits on the platform. For years The Donald maintained a contentious relationship with Red- dit admins who the subreddit’s members accused of political biases and unfair treatment. In November 2016, soon after Donald Trump was elected

71 president, this tension gained wider attention when Reddit’s CEO Steve Huffman altered posts in the The Donald which directed at him by using his username u/spez. The incident, referred to as Spezgiving as it occurred over American Thanksgiving weekend, though intended as a prank, was generally considered to be a violation of Reddit’s fundamen- tal anti- ethos. Though Huffman expressed serious regret and quickly apologised for his actions the relationship between Reddit admins and the moderators of The Donald appeared to never recover (Roozenbeek and Salvador Palau 2017; Kiberd 2016; Burgess 2016). In spite of these strained relations, for years Reddit refused repeated calls from other redditors and the media to shut down The Donald for violating Reddit’s content policy (Marantz 2018). In 2017 the platform banned smaller alt right and neo-Nazi subreddits commonly associated with The Donald for doxxing - maliciously revealing someone’s private information - harassment and inciting violence, the same content issues The Donald was accused of (Sacks 2017; Romano 2017). Ultimately, in late June 2020, at the time this thesis was nearing completion, Reddit extended it’s content policy and banned The Donald, along with 2,000 other subreddits, for “promoting hate based on identity or vulnerabil- ity” (r/announcements 0020; Reddit Help 2020). Throughout it’s lifespan, however, The Donald, remained a popular, albeit divisive, part of the Reddit community. For years it was one of the most popular subreddits devoted to politics, as we will see in the next chapter. Using the sidebar feature, moderators of The Donald maintained clear guidelines on what views were, and were not, acceptable within the sub- reddit. The following quote is taken from the public description of the subreddit from before it was banned, and was intended to clarify the pur-

72 pose of the subreddit for new visitors,

“Trump Supporters ONLY – This sub is for supporters of Donald J. Trump ONLY. This is not a place for you to debate with us about Donald Trump, or to ask us to convince you to like Donald Trump. This is not a neutral place – we are 100% in support of Donald J. Trump. Moderators reserve the right to ban non- supporters as we see fit.” [emphasis added]

This self-description provides as direct a statement as possible that The Donald sought to enforce the practices associated with echo chamber. It was a place for supporters only. It was not a place to debate, or even ask questions. And, should any non-supporters try to participate in the subreddit they could, and often would, be banned. Thus, the subreddit was explicitly designed as a place to have one’s pro-Trump beliefs reinforced and to avoid any challenging opinions to these views. The quote highlights again one of the key methodological strengths to using Reddit in this research - subreddits are self-categorised. In the next chapter we will see that these descriptions can be used to group subreddits by topic. In the case of The Donald it provides qualitative evidence that the subreddit was designed around the behaviours we have come to associate with echo chambers. As such, we are provided with an initial potential echo chamber which we can use test our hypotheses on what echo chambers are expected to look like quantitatively. In other words, was The Donald actually the echo chamber it claimed to be?

73 2.4 Chapter summary

In this chapter we have examined the research context for the original research of this thesis. First, in Section 2.1 we looked at the growing body of research into social media echo chambers and the societal concerns motivating its development. We saw that the introduction of the term echo chamber by Sunstein inspired this field of research but did not offer us a clear working definition of what one is or, therefore, how to measure it. We drew parallels between Sunstein’s public forum doctrine and Habermas’ public sphere, and saw that the more recent theoretical and empirical work of Bruns and Highfield on public sphericules may offer us a more clear path to operationalising the study of echo chamber research, while also keeping in mind the inherent limitations of any work which studies a single platform. In the second part of the chapter we then turned our attention to the site of this research, Reddit. While a disproportionate amount of academic social media research has examined Twitter and Facebook, Reddit remains understudied and there is a large gap in the body of echo chamber research to address the extent of this phenomena on different platforms. Section 2.2 provided background on the unique character of Reddit as a social me- dia platform with particular emphasis on the subreddit structure and the pseudonymity of users as a tool to avoid context collapse. In Section 2.3 we then explored the topic-based nature of subreddits in greater detail and the methodological advantage this offers to network based research. We reviewed existing research which identified interest meta-communities within the complete Reddit network and drew parallels between this and the example of topic-based public sphericules. Finally,

74 we introduced two key subreddits, changemyview and The Donald, the ex- amination of which will enable us to bridge the gap between large scale computational mapping of Reddit and the theoretically motivated explo- ration of echo chamber behaviours. In the next chapter we will examine the data and methods which were foundational to this approach.

75 Chapter 3

Data Collection & Preparation

In this chapter I will discuss how I collected and processed the data used to conduct this research. In the first section I will describe the availability of Reddit comment data and why I selected a single month of comments for analysis. The second section will show how I selected a subset of data based on the most popular subreddits and their most active authors, excluding suspected bot accounts. The third section will explain how I prepared this data to be used in analysis, the methods for which I will discuss in greater detail in Chapters4 and5. The fourth and final section of this chapter will describe the process of manually labelling the subset of subreddits under analysis with substantive topic labels and gives a description of the subreddits labelled as ‘political’.

3.1 Data collection

The metadata for all comments made in public subreddits, the vast majority of subreddits, is available via Reddit’s application programming interface (API) (Reddit API Terms of Use 2016). Reddit has one of, possibly the, most

76 open and exhaustive APIs of any major social media platform. It allows anyone with a Reddit account to collect a wide range of data types, and to access historical data. This research is focused on the comment-level data that is freely available via the API. While data on original posts, also known as submissions, is also avail- able, I chose to focus specifically on comments for three key reasons. First, many Redditors make comments on posts but do not contribute original posts themselves, thus by looking at commenting we include a greater range of users. Second, comments provide a much larger body of data than posts alone as there can be thousands of comments on a single post. Third, and most importantly, the comment section of posts are where mem- bers of subreddits engage with each other and the content which most interests them. For these reasons I believe that comment threads are much more informative of the character of subreddits, and thus the similarities between them, than the original posts. To reduce introducing ambiguity to the analysis I only looked at comment level data rather than both posts and comments. For the purposes of this research, for any given comment I specifically needed to know:

• the subreddit it was made in

• the username of the author who posted it

• and the text body of the comment

Since 2016 Jason Baumgartner, an American data scientist and avid red- ditor, has regularly used the Reddit API to compile complete datasets of comments for discrete time periods, referred to as the Pushshift

77 datasets (Baumgartner et al. 2020). These datasets include every com- ment made during the specific time period. For 2005, the year Reddit was launched, there is a single dataset as relatively few comments were made on the nascent platform. For data from 2006 to the present Baumgartner has created monthly datasets to handle the enormous increase in the number of comments, and therefore size of the datasets. The entire compressed 2005 dataset was 118 kilobytes (KB), and by January 2006 this had almost tripled to 350 KB. Ten years later, by January 2016, the monthly dataset had increased in size by an order of magnitude of 4 to 6,309,604 KB (6.3 GB). Each dataset is formed of one row per comment, and a column for every piece of comment metadata available via the API at the time of Baumgartner’s data collection. These include the subreddit, author, and text body of each comment as well as the exact time and date it was posted. Many other variables are available in the datasets, such as the score of a comment or the unique identifier of the original post on which it was made. However these additional variables were not required for our research, so were not used in data collection.

3.1.1 Data quality

Due to the ease of using the structured Pushshift datasets compared to Reddit’s own API the Pushshift datasets are commonly used in computa- tional social science research of Reddit. However, research comparing data accessed via the Reddit and Pushshift APIs has shown that, as of 2018 at least, there were cases of missing data in the Pushshift corpus which could bias findings (Gaffney and Matias 2018). This investigative research partic- ularly focused on the impact on analysis based on user histories or network analysis. Given that the completeness of the data is a significant techni-

78 cal advantage of studying Reddit, this posed a methodological dilemma during the course of this research. The main cause of missing data identified was due to brief periodic outages in the automatic data collection process. As of March 2018 Baum- gartner had corrected the underlying issues in his code which allowed for these outages (Baumgartner 2020b). The second cause of substantial miss- ingness in the pre-2018 datasets was determined to be inconsistencies in the Reddit API itself. I therefore decided to focus analysis on data provided by Pushshift post-March 2018, as I will discuss at the end of this section. For comments that had been deleted between the time they were posted on Reddit and when Baumgartner collected the data, a comment entry is available but metadata is not included. Therefore, while it is possible to know how many comments have been deleted, it is not possible to know any details about deleted comments, including the author it was made by. Comments are most often removed either voluntarily by the author, or by a subreddit moderator if the comment breaks the standards of the subreddit. Where comments are removed for violating subreddit norms I would argue the comments do not suit the purpose of this research, to categorise the nature of subreddits, therefore their omission does not pose a substantive limitation.

3.1.2 Data access

The Pushshift monthly datasets can be downloaded as compressed files from the Pushshift website or accessed programmatically via the Pushshift API (Pushshift 2020; Baumgartner 2020a). They are also available through BigQuery, a data warehouse service from the Google Cloud Platform (Big- Query 2020). BigQuery stores datasets in tables which can be accessed and

79 manipulated using structured query language (SQL). As a cloud-based product Google BigQuery allows users to work with large datasets without using local computer memory. I decided to collect my data from the Google BigQuery tables, rather than directly from the Pushshift or Reddit APIs, for a number of reasons. First, because they were already compiled into discrete time periods of a reasonable size. The monthly datasets were a manageable size to conduct analysis on. Second, I was able to query the BigQuery tables using SQL, which allowed me to conduct all of my data cleaning and preparation within BigQuery. Given the size of the datasets, being able to do this processing in the cloud rather than my own computer greatly improved the ease and efficiency of the process. And finally, I was able to easily integrate with the Google Cloud Platform Storage service to store data post-processing (Cloud Storage 2020). This allowed me to store my data subsets in the cloud and access them locally only when I needed to conduct data analysis. Details of the data workflow, and the accompanying public code repository, are available in Appendix A.1. Figure 3.1 shows the interface for the BigQuery platform for illustration. The top half of the image shows the query editor and the SQL command used to query the dataset. This example queries the January 2019 dataset of Reddit comments, the monthly dataset used in this research. The SQL code shown groups the comments by subreddit, counts the number of unique authors (num authors) in each subreddit, then returns the results in descending order. The lower half of the image shows the results of the query. We can see that the subreddit AskReddit had the greatest number of unique authors in January 2019.

80 Figure 3.1: Screenshot of BigQuery interface querying January 2019 Reddit comment dataset

81 3.1.3 Overview of the data

Due to the sheer size of the full collection of comments on Reddit, I decided to limit my analysis to a single monthly cross-section. I chose January 2019 as it was the most recent dataset to be available on BigQuery at the time I began to finalise analysis in April 2019. Due to the size of the datasets, and the fact that their creator, Jason Baumgartner, provides his services voluntarily, there is often a delay of a few months before a new monthly dataset is added to BigQuery. As has been previously mentioned using more recent data ensured that issues of missingness in the older datasets had been corrected. The January 2019 dataset is 33.2 GB in size and contains over 129 million comments (BigQuery 2020). In total, 6,135,425 unique authors commented in 125,363 subreddits. Approximately 4% of all comments (n=5,345,373) had been deleted between when they were posted on Reddit and when the data was collected by Pushshift. The data available for deleted comments did not include the username of the author or the content of the comment. As such I removed all deleted comments from further analysis. Figure 3.2 shows the range in number of authors and number of com- ments made per subreddit in January 2019. All count values are highly skewed, thus I took the log transformation of each, which are the shown values. The first subplot shows a scatter plot of the number of comments (logged, x-axis) by the number of authors (logged, y-axis), per subreddit. There is a positive general trend, as the more authors a subreddit has, natu- rally the more comments it tends to have. The trend is also left censored as subreddits cannot have a higher author count than comment count, as each author must have made at least one comment per subreddit.

82 On the lower end of the y-axis we see subreddits which have a logged author count of 0, or 1 author in total. Following this line from left to right we can see that some of these subreddits have many comments despite only having 1 author each, in some cases with a logged comment count greater than 10, or over 22,026 comments. In most cases subreddits with very many comments but only one author will be the result of bot accounts. A bot can automatically contribute many more posts and comments than any single human author would be capable of within the one month time frame. These subreddits could serve a variety of purposes such as a bot creator testing out a bot’s behaviours, or a bot which makes funny responses to contributed posts. In the next section we will discuss the process of removing bot accounts from the dataset. The second subplot in Figure 3.2 shows the frequency distribution of subreddit author counts. The y-axis is normalised to show the fraction of subreddits at each author count. For example, around 38% of all subreddits have an author count above exp(8.5) or 4915 comments. The histogram shows that author participation on Reddit is highly skewed between sub- reddits. As such I focused analysis on a subset of the largest subreddits, which account for the vast majority of activity on the platform.

3.2 Subsetting data

Half of the comments made during January 2019 were made in only 344 subreddits, accounting for 0.27% of all subreddits. I decided to select the 1000 largest subreddits for analysis, as this would cover the majority of activity on Reddit while still including a range of very large and relatively very small subreddits. Importantly, it was a small enough sample of sub-

83 Figure 3.2: Overall author and comment counts per subreddit

84 reddits for me to manually label, as I will describe in Section 3.4, whilst also a large enough sample to cover a range of topics. I selected the top 1000 subreddits by total number of authors. I used the number of authors instead of the number of comments as my measure of size as author level behaviours are an important focus on my analysis, as we will see in later chapters. If I had selected subreddits by number of comments this would have included some subreddits with very few authors who collectively made very many comments. Such subreddits would have greatly skewed the analysis in Chapter4. For the top 1000 subreddits, the number of authors per subreddit ranges from SelfAwarewolves with 818 to AskReddit with 618,016 authors. Figure 3.3 shows that the distribution of author counts for the subset of top 1000 subreddits is still highly skewed.

Figure 3.3: Cumulative frequency of author counts for top 1000 subreddits

85 3.2.1 Subsetting authors

In total there are 20,560,704 comments, 3,838,553 unique authors, and 8,019,055 author-subreddit pairs in the top 1000 subreddits. An author- subreddit pair is any instance in which an author comments in a subreddit. It can also be considered as an edge between the set of authors and the set of subreddits in a bipartite graph, as I will describe in the next section. For each author-subreddit pair I calculated the number of comments, such that

Cij is the number of comments author i made in subreddit j. I used a combination of methods to identify and remove suspected bot accounts. Bots are widely used on Reddit to perform automated utility tasks. For example, alternate-source-bot is an account that replies to posts of news articles with comments linking to similar news articles from different sources. As with alternate-source-bot, it is common practice for the devel- opers of bot accounts to chose usernames which end in ‘bot’ to signal to other users that the account is a bot. Therefore I removed all authors with a username ending in ‘bot’ from the dataset. This included 5,254 unique authors and a total of 808,835 comments. It is possible that some human authors who happened to have a username ending in ‘bot’ were removed at this stage. However, after manually checking the comment histories of a random selection of a couple of hundred of the usernames I was confident that the majority were bots. I also wanted to ensure that I also removed bot accounts that did not end in ‘bot’, but which were highly active and, therefore, may have had a significant effect on my analysis. I compiled a list of the top 50 most active authors by either total number of comments or number of subreddits commented in and manually reviewed their posting history. By looking at

86 the content and frequency of posts I was able to identify seven authors that were clearly bot accounts and removed then from the dataset. The seven highly active bot authors were AutoModerator, autotldr, Mentioned Videos, TweetPoster, xkcd transcriber, imgurtranscriber, and The-Paranoid-Android. As the number of authors is highly skewed between subreddits, the number of comments made per author within subreddits is also usually highly skewed. Many authors only make a few comments within a subred- dit in a given month. As my research is interested in how the behaviours of authors as members of subreddits determine the characteristics of sub- reddits, I decided to set a threshold for author membership of a subreddit by only looking at the most active authors in each subreddit. I chose to take a subset of the top quartile (25%) of most active authors within each subreddit, based on the number of comments (Cij). Many subreddits had overall very low average author participation, as will be shown in Chapter4. This meant that they had incredibly low 75th percentiles for number of comments per subreddit. For example, the subreddit ContagiousLaughter had a 75th percentile of 1, meaning that at least 75% of its authors only made 1 comment in the subreddit in January 2019. I therefore decided to only select authors who made at least one more comment than the 75% percentile comment count. For ContagiousLaughter, therefore, I kept authors who made at least 2 comments in the dataset. After removing suspected bot accounts and subsetting the 25% most active authors per subreddit there were 59,662,403 comments, 1,478,695 unique authors and 3,642,568 subreddit-author pairs remaining in the final January 2019 data subset.

87 3.3 Data Preparation

Chapters4 and5 will set out the research problem in detail, including our research questions and hypotheses and outline the specific methods used to address them. In this section I will explain how the subset of data described in in the previous section was prepared for the analysis that will be explained in the following chapters.

3.3.1 Author-subreddit pairs

The previous section showed that after selecting the top quartile of authors in each of the top 1000 subreddits, there were 1,478,695 unique authors in the January 2019 dataset. The first part of analysis, which will be presented in Chapter4, addresses author behaviour across and within subreddits. To conduct this analysis I constructed a subreddit by author bipartite matrix. In Section 3.2, to be able to subset authors by the number of comments they made in a given subreddit, I calculated Cij, number of comments author made by i in subreddit j. The resulting dataset had 3,642,568 rows, one for each author-subreddit pair. We can consider Cij to be the edge weight between the pair of nodes Ai and Sj. From this edge data I created a weighted bipartite matrix B = (A, S, C) such that A is the set of authors, S is the set of subreddits, and C denotes the edge values (i.e. number of comments per author-subreddit pair). Where i = 1, . . . , m and m is the number of authors (1,478,695); and j = 1, . . . , n and n is the number of subreddits (1000).

88   c11 c12 . . . c1n      c21 c22 . . . c2n  B =   (3.1)    ......    cm1 cm2 . . . cmn

Constructing the comment weighted subreddit by author matrix B served two purposes. First, it allowed me to easily conduct the analysis we will examine in Chapter4 for which I needed to access the comment count vectors of both subreddits and authors. For example, to calculate the th average number of comments made by the i author (Ai) I would index the ith column and drop any non-zero values to get the distribution of the number of comments Ai made in each subreddit author Ai commented in. I would then take the sum of this vector, the total number of comments made by author Ai in the dataset, and divide by the number of values, to get the author’s average number of comments per subreddit. Similarly, I could perform the same analysis on the jth row (i.e. the subreddit vector) to calculate the average number of comments per author in subreddit Sj. Second, I was able to use the subreddit by author matrix to project a weighted unipartite matrix of subreddits. To construct the unipartite matrix, also known as the one-mode projection, I multiplied the subreddit by author matrix B by its transpose, the author by subreddit matrix BT. This resulted in a square subreddit by subreddit matrix where each value is the number of authors the given pair of subreddits shares (Breiger 1974; Everett and Borgatti 2013). This then provided the first of two types of similarity between subreddits used in the second stage of analysis.

89 3.3.2 Subreddit similarity

To examine the relationships between subreddits, as I will further discuss in Chapter5, I extracted two forms of information from the comment data sets: which subreddits authors commented in, and what text content they commented. As I described in the previous section, I was able to use the subreddit by author comment matrix B to project a unipartite adjacency matrix on the set of subreddits. This subreddit co-authorship matrix was weighted such that Sij was the number of authors who commented in both subreddits si and sj. As such I will refer to the weighted subreddit adjacency matrix as the co-authorship matrix, or C, as it describes the number of authors who commented in a given pair of subreddits.

Weighted co-authorship

The values for co-authorship are highly skewed, as the total size of subred- dits varies greatly. Raw values of co-authorship between subreddits thus often tell us more about the size of the subreddits than their relative level of author overlap. If two subreddits each have 1,000 authors and 100 authors comment in both, there is a 10% overlap in co-authorship. However if an- other two subreddits have 100 authors each, 99 of which comment in both, the overlap is 99%. The 99% overlap is much more indicative of subreddit similarity than the 10% overlap, even though 99 authors is fewer than 100. This example shows that to properly understand the relative level of co-authorship between subreddits we need to normalise the co-authorship values by the size of subreddits. To control for subreddit size I used the configuration model to generate a matrix of expected co-authorship values between subreddits. The config-

90 uration model generates a random network based on a pre-defined set of degrees for all subreddits in the network (Barabasi´ 2016; Newman 2010). As we know the degree for each subreddit, i.e. the number of other sub- reddits with which it shares an edge, we can use the configuration model to determine the probability of an edge between each pair of subreddits.

For subreddits i and j with degrees ki and kj the probability of an edge between i and j is calculated by:

kikj pij = (3.2) 2m − 1

Σk where m is the total number of edges in C (m = 2 ). As C is a 1000 by 1000 undirected matrix, without self-loops, the maximum possible value of 1000·999 m was 499,500 ( 2 ). However, the actual value of m is 406,746 as some pairs of subreddits did not share any authors. The overall density of the co-authorship network was therefore 0.81. We then use Equation (3.2) to create the expected co-authorship ma- trix, E, of the probability of an edge between all pairs of subreddits. Fi- nally, to calculate the normalised co-authorship matrix (Cweighted) I divided the weighted subreddit matrix C by the expected matrix E such that the weighted co-authorship between subreddits i and j is:

cij weightedij = (3.3) eij

Weighting the raw co-authorship counts between subreddits by the probability of them sharing an edge at random serves two complimentary purposes. First, it reduces the size in authorship overlap between large

91 subreddits. The more authors a subreddit has, the more subreddits it tends to share authors with. Therefore we expect large subreddits to share higher numbers of authors. Down-weighting co-authorship by the high likelihood that these subreddits share an edge thus partially reduces this effect. Second, and conversely, this process emphasises the small but significant author overlap between smaller subreddits. In our earlier example, we saw that a co-authorship value of 99 can be much more meaningful than a value of 100 where it accounts for an 99% overlap in authors compared to a 10% overlap. As two small subreddits will have a very low probability of sharing an edge this method will up-weight their relative levels of co- authorship. In effect, the weighted co-authorship values tell us how much higher the authorship overlap is for a pair of subreddits than we would expect at random, given their degrees. This is therefore a more informative measure of similarity between subreddits than the raw co-author counts as it mitigates some of the effect of preferential attachment in the network. In Chapter5 I will show how the weighed co-authorship matrix

Cweighted was used to address our network research questions about the existence of public sphericules and meta echo chambers on Reddit. But first, I will outline how I constructed the second measure of subreddit similarity used, this one based on the text of comments.

Text similarity

In Chapter5 I will also analyse the text similarity between subreddits. To create a measure of text similarity I returned to the original subset of 59,662,403 comments made by the top quartile of authors in the top 1000 subreddits. I used term frequency-inverse document frequency (TF- IDF), a common ranking statistic in text analysis, to create a weighted text

92 vector for each subreddit. I then used cosine similarity to calculate a single measure of text similarity between all subreddit pairs. For each subreddit I combined the text of all comments made it’s top authors into a single ‘document’. I then processed each subreddit document to create a ‘bag of words’ by removing all numbers and punctuation, mak- ing all characters lower case, and splitting the documents into individual terms by removing whitespace. For example, the comment ‘Nah, it’s just a pumpernickel loaf.’ would be split into the separate terms ‘nah’, ‘its’, ‘just’, ‘a’, ‘pumpernickel’, ‘loaf’. Commonly in text analysis, after text has been broken into terms they are pre-processed using either stemming or lemmatising (Manning et al. 2008). Stemming removes the ending from the ‘stem’ of a term such that ‘winning’ becomes ‘win’. Lemmatising is a more detailed process which converts a word into its root form such that ‘won’ would be converted to ‘win’, its root form or lemma. The advantage of these processes is to group sets of words which provide the same information under one term (e.g. ‘win’). These methods can be very useful on structured, standard language text. I tested multiple algorithms for each stemming and lemmatising using the python package NLTK (natural language toolkit) (Bird et al. 2009). I found that all algorithms were very efficient at lemmatising standard En- glish words such as ‘won’ or ‘winning’. However they also tended to incor- rectly over-stem some standard terms (e.g. ‘Winnie’ becomes ‘win’). This was also very common for non-standard language terms (e.g. ‘winnerz’). A possible solution to this over-processing would have been to consider any non-standard terms as typos or misspellings and using distance metrics such as Levenshtein or Hamming to convert them into their ‘correct’ form

93 (e.g. ‘winnerz’ become ‘winners’ which can become ‘win’) (Manning et al. 2008). I suggest, however, that either of these steps; stemming/lemmatising or spelling correcting; are actually more likely to introduce new errors to our dataset and lose valuable information than they are to improve the quality of the data. While these processes have been shown to work well on standard language text, they are often less applicable to the unstructured, informal text common on social media (Uysal and Gunal 2014). Perhaps most importantly, what may appear to an outside observer to be a typo may actually be a meaningful neologism within the community in which it is used. Our example of ‘winnerz’ is in fact a self-referential term used by members of The Donald. Therefore it is meaningful that we maintain the exact form of the term used by the authors. For our research we are specifically concerned with identifying subreddits which share these kinds of rare and highly insightful terms. To our purposes then, these data cleaning methods would not just introduce errors into the dataset they would be certain to reduce the most crucial information in the data. For these reasons I did not pre-process the text further. After creating the ‘bag of words’ for each subreddit I calculated the number of times each word, or term, was used in a given subreddit, to create a weighted word vector for each subreddit (Manning et al. 2008). As with the author comment counts in the previous section, I then combined all subreddit vectors into a bipartite matrix of words by subreddits. In total 26,197,494 terms appeared across the 1000 subreddits. I constructed a subreddit by word affiliation matrix of word occurrence counts such that term t in subreddit document d has term frequency tfwd. I removed terms which appeared in 80% or more of the subreddits

94 (N≥800). This ensured that very common words such as ‘a’, which would not be informative of subreddit similarity, were excluded. I also removed any terms that did not appear in at least two subreddits as these would also not provide information on similarity between subreddits. By setting upper and lower thresholds on the number of subreddits a term appeared in I was also to focus only on terms which represented meaningful textual similarity between subreddits. This also had the benefit of reducing the dataset to 38% of the original number of terms (N=9,898,678) making the dataset a much more manageable size to compute. I then used term frequency-inverse document frequency to weight the term counts per subreddit. TF-IDF downweights words that are common across many documents, in our case subreddits, and upweights terms that are rare across subreddits but commonly used in a few. This is used to high- light semantic closeness between subreddits that share use of uncommon words (Sparck Jones 1972). The inverse document frequency measures how important a term is passed on how common or rare it is in the total corpus. It is calculated by taking the log of the total number of subreddits (N) divided by the number of subreddits the token appears in (nt). Therefore, for term t:

N idft = log (3.4) nt

The weighted text matrix is then taken by multiplying each value by its term’s idf,

tfidftd = tftd · idft (3.5)

95 For each pair of subreddits, I then computed the cosine similarity be- tween their TF-IDF weighted term vectors. Cosine similarity values could be between 0, where the subreddits shared no terms, and 1, if the subreddits had identical term vectors. Finally, I compiled a list of all 499,500 undirected edges between sub- reddit pairs. For each edge I listed the weighted co-authorship and the text similarity between the subreddits. In Chapter5 I will show how these two measures were used to create a single measure of similarity between subreddits, which I in turn used to perform community detection to iden- tify topical public sphericules and meta-echo chambers in the subreddit network. But first I had to manually label each of the 1000 subreddits to be able to determine their high-level topics.

3.4 Labelling subreddit topics

To understand the range of content in the subset of subreddits, I needed to manually label the high level topic of each subreddit. This was particularly necessary to identify which subreddits explicitly dealt with political content and thus be able to determine the levels of chamberness of political subred- dits. I wrote a python script which took the name of a subreddit from the list of the top 1000 subreddits, for example The Donald and automatically opened up a web browser that navigated to the page of most popular posts in the subreddit of all time.1 The most popular posts in a subreddit are the posts with the highest total scores. A post’s score is approximately the difference between the

1Available at: https://www.reddit.com/r/The Donald/top/?t=all

96 total number of upvotes and downvotes it has ever received. Top posts, therefore, represent the content that has been overall the best received by the users who have been exposed to it. The page for top posts of all time provides two key types of information: 1) content about the top posts and 2) the sidebar of the subreddit. The sidebar contains basic information written by subreddit moderators and is intended to provide a brief description of the subreddit for new visitors. Therefore, the top post page provided multiple sources of information for me to use to determine the high level topic of a subreddit. From the list of top posts, for each post I could see the title, score, number of comments, author’s username, how old the post was and a thumbnail if the post linked to an image or external website. I could also click through to see the full content of a post including any comments it had received. This was helpful when the content and/or tone of a post’s title were unclear and more context was needed. For example, if I suspected a subreddit was satirical I would click through to see the full text of a post and how commenters responded to it. As posts are created and upvoted by subreddit users, top posts can be considered a form of crowd-sourced information of what content users enjoy within a subreddit. The sidebar, comparatively, outlines the high level ideals for the community, which are explicitly set by moderators, and implicitly accepted by the subreddit’s users. The sidebar provided multiple sources of general information about each subreddit including:

• a brief text description

• the date the subreddit was created

• the number of subscribers to the subreddit

97 • the number of redditors currently browsing the subreddit

• the subreddit rules (where they existed)

• a small logo (where it existed)

• related subreddits (automatically created by Reddit using an un- known formula, but moderators can disable this feature)

• a list of current moderators

The brief text description and subreddit rules were the most informative sources of information to determine the topic of a subreddit. The brief description is the moderators’ opportunity to explain to new visitors, within a few words, the purpose of the subreddit and what they should expect to find there. For example, the description for The Donald was ‘The Donald is a never-ending rally dedicated to the 45th President of the United States, Donald J. Trump’. This description clearly and quickly shows that the subreddit is about, and supportive of, Donald Trump as President of the United States. When a subreddit description did not provide much information, or if it was unclear whether it was satirical or ironic, I would then look at the subreddit rules to gain a better understanding of the nature of the subreddit. The rules of a subreddit outline what is and is not allowed within a subreddit, specifically what content users can post and comment or how they should otherwise behave. Moderators are then supposed to use those rules to regulate user behaviour, such as by removing inappropriate content and explaining to authors why such content violates the subreddit’s rules.

98 Subreddit rules often include references to Reddit’s general content guidelines. The first rule in The Donald at the time of labelling was ‘Do not violate Reddit’s Content Policy. You will be banned. If you see rulebreaking content, report & downvote.’ While these rules were not useful for gaining insights about a subreddit, subreddit-specific rules could provide further information about the purpose and scope of a subreddit. The Donald’s sixth rule was titled ’Trump Supporters Only. No Cucks or Leftists2.’ It specifically stated ’This forum is for Trump supporters only. If you have questions about our president, our way of thinking or other discussion questions, post on r/AskThe Donald, where we will gladly answer. This forum is NOT for that.’ This rule clearly summarised what the subreddit was, and what it was not. It was a place to post content in support of Trump, and not a place to challenge, or even discuss, this position. After reviewing the sidebar and top posts for a subreddit, I determined what high level topic(s) the subreddit belonged to. I use broad categories such as ‘gaming’ or ‘political’ to ensure that there were a reasonable number of total topics and number of subreddits per topic to be able to analyse. If the substantive topic of a subreddit was still unclear I labelled it as ‘discussion’ if it focused on textual content or ‘image’ if it predominantly featured pictures or videos. Where a subreddit could be classified under multiple topics I listed each of these, in order of which seemed most central to the purpose of the subreddit. As such ukpolitics was labelled as ‘politics’ and then ‘geographic’. For methodological simplicity I only used the first topic label for each subreddit in the analysis presented in this research.

2‘Cuck’ is a reference to the word cuckold, a term for a man whose wife is sexually unfaithful to him. Cuck was originally used by the online men’s rights movement as a pejorative against men who don’t ascribe to the same philosophy. The alt-right then began to use the portmanteau ’cuckservative’ to define moderate conservatives who do not agree with alt-right views (Romo 2017).

99 However, given the time and labour intensiveness of manual coding, I decided to note more detailed labels than I expected to be necessary, as they could be useful for future analysis. The python script that automatically opened the top posts page for a subreddit also automatically prompted me to enter the label(s) I decided to annotate for the subreddit. Once I had entered the label the script automatically updated a master file with the subreddit name and label, moved to the next subreddit in the list, and opened the top posts page and label prompt for the next subreddit. This allowed me to cycle through the 1000 subreddits relatively quickly, and ensured I used the same reference material for all subreddits (i.e. the top posts and sidebar) to determine subreddit topics.

3.4.1 Overview of topic frequencies

Overall I used 53 unique top level topic labels, though the majority of these only included a few subreddits. Appendix A.2 lists all 53 topic labels with subreddit counts. Figure 3.4 shows the frequency of subreddits for the 21 topics with at least 10 subreddits (i.e. 0.1% of subreddits). The x-axis shows the subreddit count, and the y-axis shows the topic label. The subreddit count is also written to the right of each topic bar. These 21 topics account for 92% of subreddits. The most common topic is gaming, which accounts for 213 subreddits, or 21% of the top 1000 subreddits in the month January 2019. funny and images then account for roughly another 10% each (109 and 99 subreddits, respectively). The general topic discussion consists of 79 subreddits. This means that approximately 8% of the top subreddits did not clearly belong to a substantive topic, at least to the casual observer. These four top topics;

100 Figure 3.4: Distribution of 21 most common subreddit topic labels gaming, funny, images, and discussion; cover 50% of subreddits. The next six topics; tech, geographic, porn, tv, music, and political; account for more than the next quarter of subreddits (27%). The lower ten topics in Figure 3.4 account for a further eighth (12.5%) of subreddits.

Possible data limitations

Due to the large number of subreddits to label I did not recruit anyone else to conduct independent labelling or to validate my topic labels. As I favoured broad topic labels (e.g. gaming) the categorisations should be largely objective. As stated, when I had any uncertainty about the substan- tive topic of a subreddit I used the formative labels discussion or images as appropriate. However, as I only spent a relatively short amount of time labelling each subreddit I may have misinterpreted and thus mislabelled the topics of some subreddits. Overall I am confident in the general frequency of the most common

101 topics, on which we will focus analysis in Chapter5. However the use of manual labelling by a single annotator may have introduced some errors, and therefore minor bias, into the dataset. This would have been the most detrimental to the research in the case of missing political subreddits. Therefore I took a second pass at the names of all subreddits to ensure that no clearly political subreddits were missed in the initial labelling process. I ultimately re-labelled two subreddits which were erroneously labelled as political, as I will discuss more momentarily. Otherwise I did not relabel any other subreddits. We will explore the variety of subreddit topics in greater detail in Chap- ter5 where we determine whether the network of subreddits is organised into topic-based public sphericules. However the majority of our explo- ration of echo chamber behaviours will focus specifically on the subreddits labelled as political, as well as the potential anti-echo chamber change- myview. Therefore one final step of data collection was required to provide more in depth understanding of the character of the political subreddits.

3.4.2 Description of the political subreddits

After manually labelling all subreddits, I returned to the subreddits labelled political to check the accuracy of my labelling. In my initial annotation I had labelled 23 subreddits as political. However, on second assessment I decided that two of the initial subreddits were not sufficiently explicitly po- litical to be included under the label. These were MensRights and MGTOW (i.e. ), both of which related to the topic of men’s rights activism (also known as the men’s rights movement). I originally labelled these as political because online men’s rights ac- tivism is often associated with the alt-right (Nagle 2017). However, I ulti-

102 mately decided to be more conservative in my use of the the label political to limit the risk of introducing personal bias, and re-labelled both subred- dits as discussion as they did not fit within any existing substantive topic label. These were the only subreddits which I relabelled. I then manually labelled the remaining political subreddits by their political polarity. All political subreddits were labelled as either left-leaning (L), neutral (N), or right-leaning (R). In total, of the 21 political subreddits 11 were left-leaning subreddits, 5 neutral, and 5 right-leaning. Table 3.1 lists each of the political subreddits, their polarity, and the data they were created as taken from their sidebars. This includes our potential anti-echo chamber changemyview for reference, which is labelled with polarity A for anti-echo chamber. As there are 11 left-wing subreddits compared to 5 on the right among the subreddits in our dataset there is a clear left-leaning skew in the most popular subreddits in January 2019. In Table 3.1 subreddits are ordered from oldest to newest. This is also shown in Figure 3.5 where each point represents a subreddit. The subreddit name is listed on the x-axis, the date it was created is on the y-axis, and the colour of the point shows the subreddits’ polarity. We can see that the majority of left subreddits, highlighted in blue, are relatively new. socialism was created in 2008, early in Reddit’s history. The next left wing subreddit to be created, and to still be popular as of January 2019, was SandersForPresident in 2013. The remaining 9 left subreddits were created in quick succession between 2015 and 2017. This coincides with the events surrounding the 2016 United States Presidential election and the rise of right wing populism and the online alt-right movement. This shows an interesting development in the nature of political sub-

103 subreddit polarity date created politics N 2007-08-06 Conservative R 2008-01-25 Libertarian R 2008-01-25 worldpolitics N 2008-02-10 ukpolitics N 2008-03-08 socialism L 2008-06-29 PoliticalHumor N 2008-09-03 neoliberal R 2011-04-14 PoliticalDiscussion N 2011-04-14 changemyview A 2013-01-16 SandersForPresident L 2013-12-05 beholdthemasterrace L 2015-04-26 The Donald R 2015-06-27 COMPLETEANARCHY L 2015-07-21 LateStageCapitalism L 2015-08-28 ABoringDystopia L 2016-01-29 ChapoTrapHouse L 2016-06-10 Fuckthealtright L 2016-07-09 esist L 2017-01-26 Trumpgret L 2017-01-29 The Mueller R 2017-07-26 ENLIGHTENEDCENTRISM L 2017-08-28

Table 3.1: Polarity of political subreddits in order of date created

104 105

Figure 3.5: Timeline of political subreddit creation reddits, at least for those that were still active in January 2019. The first political subreddit created was the generalist politics in 2007. In 2008 1 left, 2 right, and 3 neutral subreddits were created. There was a gap until 2011 when each of another right and neutral subreddit were created. Until this point all of the subreddits were quite broad in topic, even those that were politically aligned such as Conservative and socialism. There is another gap until 2013 when the second left subreddit was created SandersForPresident. This is the oldest niche political subreddit to still be among the most active as of January 2019. Between 2015 and 2017 the remaining 9 left subreddits were created, while only 2 more right- wing subreddits were, The Donald and Th Mueller. The Mueller is about Special Counsel and his investigation into allegations of Russian interference in the 2016 U.S. Presidential election. The subreddit is associated with The Donald as is showcased by their shared styling. Other subreddits have been created in this style and centre on political figures such as The Farage, the schulz, the hofer, and The Europe, however none of these were active enough to be included in our dataset. Tables 3.2 to 3.4 provide brief descriptions of all of the political subred- dits for reference, grouped by polarity. The descriptions are taken from the subreddit sidebars used to label the subreddit topics. I have attempted to select as brief a description as possible while still encapsulating the purpose of each subreddit. We can see from both the subreddit names and their descriptions shown in Tables 3.2 to 3.4 that the political subreddits created since 2013 focus on more specific topics than the older ones. The left wing subreddits are notable for two reasons: 1) there are more of them, created relatively recently within a short time span, and 2) many of them are not just pro-left

106 subreddit description socialism This is a community to discuss current events in our world from an anti-capitalist perspective and to pro- vide clarity to socialist ideas. SandersForPresident A hub for activists supporting ’s plat- form and run for president. beholdthemasterrace Supremacists of any shade displaying their supremacy COMPLETEANARCHY Just... The most *Complete* Anarchy. 1. No - Liberalism and its defense is not accepted within the sub. LateStageCapitalism A One-Stop-Shop for Evidence of our Social, Moral and Ideological Rot ABoringDystopia A subreddit for chronicling how Advanced Capitalist Society is not only dystopic, but also incredibly boring ChapoTrapHouse is a Fuckthealtright A subreddit dedicated to shitting on the racist, misogy- nist, anti-Semitic, adolescent clusterfuck known as the ””Alt-Right””. esist Rule 7. no trump supporters. Rule 9. No references to subreddits and users of those subreddits who may disagree with us Trumpgret trumpgret - regretting it already ENLIGHTENEDCENTRISM The goal of this subreddit is to point out the hypocrisy of the centrist types who often align with (sometimes extreme) right wing views.

Table 3.2: Description of left-wing political subreddits

107 subreddit description politics /r/Politics is for news and discussion about U.S. poli- tics worldpolitics reddit’s free speech political subreddit - no agenda, few rules, reddit’s opposite day - do not encourage or incite violence - do not abuse your fellow redditors ukpolitics Political news and debate concerning the United King- dom PoliticalHumor A subreddit focused on US politics, and the ridiculous- ness surrounding them PoliticalDiscussion Discussion about politics. Talk amongst yourselves. Politely and informatively.

Table 3.3: Description of neutral political subreddits but explicitly anti-right. On the first point, the recent creation of the left wing subreddits likely corresponds with the build up to and aftermath of the 2016 US Presidential election. The surge of new left subreddits began in the spring of 2015, around the time that candidates began making official announcements of their intentions to run. To the second point, of the 9 left wing subreddits created since 2015, 7 are topically against and/or critical of some system or philosophy. Two are anti-alt right (beholdthemasterrace, Fuckthealtright), two are anti-Capitalist (LateStageCapitalism, ABoringDystopia), two are anti-Trump (esist3, Trump- gret), and finally ENLIGHTENEDCENTRISM focuses of criticising “centrist types who often align with (sometimes extreme) right wing views”. Only two of the recent left wing subreddits are not centred on criticising opposing views. COMPLETEANARCHY is a meme heavy and somewhat self-satirising pro-anarchist group. ChapoTrapHouse is devoted to an epony-

3esist is a play on the stylistic use of the prefix r/ before subreddit names. It’s full name is r/esist, i.e. resist

108 subreddit description Conservative The place for Conservatives on Reddit. Libertarian A place to discuss , related topics, and share things that would be of interest to libertarians neoliberal Defending the open society. Rule 3. Bad faith arguing- Engage others assuming good faith and don’t reflex- ively downvote people for disagreeing with you or having different assumptions than you. Don’t troll other users. The Donald The Donald is a never-ending rally dedicated to the 45th President of the United States Donald J. Trump. The Mueller This is a subreddit for memes, discussions and articles about about Special Counsel Robert Mueller III, and his investigation of the Trump administration’s ties to Russia and any person of interest or related associate listed in the Mueller report

Table 3.4: Description of right-wing subreddits mous hard left wing podcast, which itself is known for it’s support for Bernie Sanders and satirization of and the right wing (Chapo Trap House 2020; Bowles 2020). In late June 2020 is was among the group of 2,000 subreddits, including The Donald, which was banned for promoting violence (Newton 2020). Based only on these subreddit descriptions, the majority of the recent popular left wing ecosystem on Reddit appears to focus on external crit- icism, rather than discussion of internal beliefs and values. As we will see in our analysis of chamberness behaviours in the following chapters, this focus on othering political opposition could be a factor in these leftist subreddits having greater chamberness than other political subreddits. Though there are more left wing subreddits, they tend to be smaller than the right subreddit in general and The Donald in particular. Figures 3.6a and 3.6b show the number of authors and comments, respectively, in each

109 of the political subreddits.4 Left subreddits are coloured blue, neutral are green, and right are red. changemyview is also included in purple, for reference. Both plots show that many of the left subreddits are relatively small. PoliticalDiscussion is the smallest neutral subreddit by author and comments counts. By far the largest political subreddit by number of authors is The Donald, which at 11,990 authors is the 44th largest subreddit overall. By comment count The Donald is the 5th largest subreddit. I will discuss the variation between political subreddits in greater detail in the following analysis chapters. For now we have introduced the sup- plementary data which will allow us to gain a better understanding of the content of the political subreddits. This will help use interpret our findings as we explore the relative measure of chamberness among subreddits in the following chapters.

3.5 Chapter summary

This chapter has outlined how I accessed and collected Reddit comment data. It described aspects of my technical workflow, such as the use of structured query language (SQL) to manipulate data, Google Cloud Ser- vices to manage very large datasets, and an automated program written to facilitate manual labelling of subreddit topics. I explained how I chose to create a subset of data, first by selecting

4Note on the colouring of political subreddits: This is the default ordering of colours in Python (blue, green, red, etc). As I have ordered the polarities intuitively from left to right (i.e. left, neutral, right) this coincidently aligns with the American standards of blue for Democrats and red for Republicans. In the UK these colours are inverted with blue being used for by the Conservatives and red by the Labour Party. As that majority of political subreddits relate to American politics, I have decided to maintain the default colour-mapping.

110 (a) Number of authors

(b) Number of comments

Figure 3.6: Author and comment counts for political subreddits

111 a monthly time period, January 2019. Then by selecting the 1000 most active subreddits by author counts, followed by removing suspected bot accounts and selecting the top quartile of most active authors per subreddit by comment count. I also described how I processed this final subset of data to prepare it for the analyses which I will discuss in the following chapters. First I explained how I constructed the weighted subreddit by author matrix to conduct the subreddit-level analysis of author participation which will be further detailed in Chapter4. Then I gave detail of how I created two measures of subreddit similarity - weighted co-authorship and text similarity - which I will then combine in Chapter5 to perform community detection and identify topical public sphericules and meta-echo chambers in the subreddit network. Finally, I introduced the political subreddits which will be the main focus of our exploration of echo chamber behaviours on Reddit. I intro- duced supplementary data on these subreddits, in particular the dates on which they were created and excerpts of their official public descriptions to provide context both for the subreddits themselves and the development of political discourse on Reddit in general.

112 Chapter 4

Subreddits as Echo Chambers

In Chapter1 I outlined the key research questions of this thesis and in Chap- ter2 discussed the motivations behind them. This chapter addresses the first two research questions which seek to determine the level of ‘chamber- ness’ of individual subreddits using a variety of simple but informative measures. These two research questions address how each subreddits’ set of authors participate within the subreddit itself, as well as within other subreddits, to understand the relative degree of chamberness of each sub- reddit. The next chapter, Chapter5, will then address the two remaining research questions which examine the network dynamics of subreddits and whether whether public sphericules and meta-echo chambers exist between groups of subreddits. This chapter is divided into four sections. Section 4.1 defines the re- search questions in greater detail, drawing on material from the back- ground research to show how they relate to the measurement of subreddit chamberness. It also defines specific echo chamber and anti-echo chamber hypotheses for each research question, motivated by existing echo chamber theories. This first section will also outline the methods of analysis used to

113 address the research questions. Sections 4.2 and 4.3 will then present the results for each of the research questions in turn. Each section will begin by presenting the general trends in each of the relevant measures across all 1,000 subreddits. Then they will present the results of the 22 case study subreddits — the 21 political subreddits and the self-defined anti-echo chamber r/changemyview. Finally, Section 4.4 will summarise the results discussed in this chapter. It will relate the results back to the research questions and confirm whether the results support or disagree with the hypotheses outlined in the first section. This will prepare us to move on to the second stage of analysis looking at the networked nature of echo chambers in the following chapter.

4.1 Defining chamberness

This chapter will test whether a range of simple measures can be used to determine the level of chamberness of the top 1000 subreddit on Reddit. To determine the chamberness of a given subreddit we use intra-subreddit measures of author participation. I consider a subreddit to be characterised by the group of authors who most regularly contribute to it. As such, I can understand the overall nature of a subreddit by examining the behaviour of its most active authors, both within the original subreddit and in others. I am particularly interested in two aspects of author participation; where they comment, and how much. On Reddit, as on other social media sites, author comment participa- tion is an important indicator of overall group dynamics. However, how often and where authors comment is particularly relevant to echo chamber

114 dynamics, as this section will show. In its most strict form, echo chamber theory posits that users of the echo chamber will not participate in other groups. In addition, due to the attraction of reinforcement bias, they may be drawn to participate more in the echo chamber than members of a non- chamber group. Therefore, to test the echo chamberness of subreddits, we must understand the general trends in author participation between subreddits, and determine to what extent, if any, the hypothesised echo chambers display relatively greater chamberness behaviours.

4.1.1 Within subreddit author participation

The literature on echo chambers suggests that confirmation bias encourages people to enter and maintain echo chambers, as we experience pleasure from having our opinions re-affirmed by a like-minded group. To test this theory we can determine whether authors who choose to comment in a purported echo chamber, such as The Donald, spend relatively more time within the chamber than authors in other subreddits. As such, the first research question looks at the levels of participation of authors within subreddits. By examining how often authors tend to comment in each subreddit we can begin to understand trends in author participation for the top subreddits on Reddit, and consider how these are expected to vary for echo chambers. In other words, how ‘devoted’ are authors to the chamber?

Research question 1: How does author within subreddit participation vary between subreddits?

Once we understand how within subreddit author participation varies between subreddits, we can determine whether political subreddits have

115 higher or lower levels of participation than other subreddits. In earlier chap- ters we highlighted The Donald as a potential echo chamber based on the public description of the subreddit. The descriptions of the political subred- dits presented in Tables 3.2 to 3.4 in the previous chapter suggest that other political subreddits have similar anti-opposition policies to The Donald. However without an equally thorough understanding of the culture of any of the these subreddits I refrain from singling out specific ones as addi- tional possible echo chambers. Instead, with regards to the chamberness of other political subreddits I suggest the general principle that the farther left- or right-wing a subreddit is, i.e. the more polarised its views, the more we would expect it to be an echo chamber. By contrast politically neutral subreddits are not expected to have high levels of chamberness. As an indication of chamberness I anticipate that authors of echo cham- bers will spend relatively more time in the chamber than in other subreddits. This is due to both push and pull factors. Selective exposure would suggest that members of an echo chamber will participate in fewer subreddits over- all to avoid situations that may conflict with their views, in effect pushing them away from other subreddits. It follows, then, that if chamber authors participate in fewer subreddits overall, and therefore divide their time between fewer subreddits, they will on average spend more of their time in the chamber than authors who participate in more subreddits. In addition, the reinforcement aspect of selective exposure discussed in the literature may also act as a pull factor towards the chamber. I suggest that authors will be attracted to the chamber for bolstering and reinforcing their opinions and that this process will draw them into participating more in the chamber than authors of other subreddits which do not show as strong confirmation biases.

116 Echo chamber hypothesis 1: Echo chamber subreddits will have higher author within subreddit participation than other subreddits.

As stated, this hypothesis is based on the dual reinforcement and avoid- ance aspects of selective exposure with are theorising to drive echo cham- bers. Members of an echo chamber are expected to be drawn to engage in the echo chamber by the positive feedback of having their views supported and reinforced. As such I anticipate that authors of echo chambers will participate more actively than in subreddits overall. In other words, high author participation rates are an expected indicator of greater chamberness. Therefore I would expect the the opposite trend to be true for changemyview if it is an anti-echo chamber.

Anti-echo chamber hypothesis 1: The anti-echo chamber will have lower author within subreddit participation than other subreddits.

This hypothesis is based on the diversity of content within changemyview and the expected quality of contributions. First, as the anti-echo chamber covers a wider range of disparate topics, I suggest that many authors will only make comments on posts relating to topics they are particularly interested in or knowledgeable of, thus curbing the extent to which any given author will participate with the subreddit. The wide range of topics discussed in the anti echo chamber will mean that most authors will not engage with most posts, but only participate in topics which they are particularly interested in. Secondly, given the greater level of effort and time needed to meet the standards for contributions that the anti-echo chamber requires of authors, I suggest that there will be an upper limit to how many comments the most active authors will make within the subreddit. The effort involved

117 in developing the well structured arguments required in changemyview may limit the frequency at which the average author can contribute do it. Authors may also experience a degree of cognitive fatigue that limits the number of disagreements they engage with. I also expect the average author of the anti-chamber to spend relatively less of their time in the anti-chamber compared to other subreddits. Rather than avoiding subreddits in an act of selective exposure, I expect authors of the anti-echo chambers are drawn to participate in a greater range of subreddits overall and that, thus, their rate of participation per subreddits will be more widely spread. To answer this first research question I suggest two measures. The first is simply the average number of comments per author in a given subreddit. The second is the average proportion of an author’s total comments that they made within the given subreddit, which I call the insubreddit proportion.

Average comments per author

Equation (3.1) in Chapter3 presented the weighted bipartite matrix B. To calculate the mean number of comments per author for subreddit j I take the row from B which corresponds to the jth subreddit and select all non-zero values. The resulting vector, called sj, is the number of comments made th by each of the asj number of authors in the subreddit. For the a author in subreddit sj there are xa number of comments, such that for each subreddit we have the vector:

sj = [x1, x2, ··· xa] (4.1)

As we have seen, the number of comments and authors per subreddit is highly skewed. The same is true of the number of comments per author in

118 any given subreddit in our dataset. As such I use the median rather than the mean as my measure of average. The arithmetic mean is more sensitive to skewness in the distribution and is heavily influenced by outliers. By contrast the median is a more robust measure of central tendency in skewed distributions. The median number of comments per author in each subreddit will be my first measure of author participation used to answer this first research question. The results of this analysis are presented in Section 4.2.

Author insubreddit proportion

For each author in the subreddit I then calculate the fraction of all of their comments which were made in the subreddit. I refer to this measure as the insubreddit proportion. To calculate the insubreddit proportion for each author in a subreddit I take the number of comments they made in the given subreddit sj and divide by the total number of comments they made across subreddits. For the ith author I return to matrix B and select the column with the corresponding author vector ai;

ai = [x1, x2, ··· xsj] (4.2)

The number of comments the ith author made in the jth subreddit is xsj in ai. To calculate the insubreddit proportion xsj is divided by the sum of ai:

xsj insubredditas = P (4.3) ai

119 such that insubredditas is the fraction of all comments made by author ai that were made in subreddit sj. To determine a single measure of insubreddit participation for each sub- reddit I then take the average of the insubreddit proportion of all authors in a subreddit. Again, to avoid the bias of skewness in the distribution of insubreddit proportions I use median instead of mean as my measure of av- erage. Thus if subreddit sj has the vector of author insubreddit proportions insubreddit sj such that:

insubreddit sj = [x1, x2, ··· xa] (4.4)

The median author insubreddit proportion in each subreddit will be our second measure of author participation used to answer the first research questions and set of hypotheses. The results of this analysis are presented in Section 4.2.

4.1.2 Between subreddit author participation

Our second research question then considers authors as actors who partici- pation across subreddits. Using similar author participation measures as in Section 4.2 I will instead look at the distribution of comments made by the same author, but across different subreddits. This will speak to how much diversity an author shows in the range and frequency of comments they choose to make across subreddits. However, as our research is centred on the chamberness of subreddits, rather than individual authors, I will then aggregate these author-level diversity measures for each subreddit. This will allow me to determine, for example, whether a subreddit consists of authors which engage if a variety

120 of subreddits (i.e. they comment in a range of subreddits). This will allow us to incorporate knowledge of a subreddit’s authors’ ex situ behaviour into our understanding of the subreddit’s chamberness. As such the second research questions asks,

RQ 2: How actively do authors of a given subreddit participate between all subreddits?

For any subreddit and its set of authors I want to understand how wide a range of other subreddits those authors participate in. As a mea- sure of chamberness, I expect authors in more chamber-like subreddits to participate in fewer subreddits overall.

Echo chamber hypothesis 2: Echo chambers authors will have less var- ied participation between subreddits.

This hypothesis is again motivated by selective exposure theory. If authors of an echo chamber subreddit select to expose themselves only to views they agree with, this is expected to lead to them having a more limited palate (i.e. participating in a smaller range of subreddits). The fewer subreddits an author participates in, the less likely they are to meet ideological resistance. Conversely, I would expect authors of the anti-chamber to participate in a greater number of subreddits overall. As these authors actively choose to engage with the debates of the anti-chamber, they may be motivated by an opposing drive to selective exposure - they are seeking out disagreement. Therefore we would not expect them to have a limited palate due to any form of avoidance. On the contrary, I anticipate authors of the anti-chamber will choose to participate in a greater range of subreddits than those of any general subreddit.

121 Anti-echo chamber hypothesis 2: Anti-echo chamber authors will have more varied participation between subreddits.

Measures of author participation

To address this research question I return to the bipartite matrix B and repeat the process of constructing the subreddit vectors, but here I focus on the columns which correspond to authors. For the ith author I select the corresponding column and keep all non-zero values. This author vector th , ai, will be the length of the number of subreddits the author. For the j subreddit author commented in, there are xsj number of comments, such that for each author we have the vector:

ai = [x1, x2, ··· xsj] (4.5)

Using the author vector I then calculate four measures for each author which provide different information about how they participated between subreddits. First is the number of subreddits they commented in sai. Sec- ond, the total number of comments they made across all subreddits. This is the sum of the author vector:

X aitotal comments = ai (4.6)

The third measure is the median number of comments author i made in each subreddit. This is a mirror of the median number of comments per author calculated on the subreddit vector in the previous section. Finally, I want to understand how evenly a given author participates be- tween the subreddits they comment in. I use the Gini coefficient, a standard measure of inequality in population distributions. The gini coefficient has

122 previously been applied to the dispersion of comments on Reddit, to mea- sure the change in the concentration of submissions between subreddits over time (Singer et al. 2016). The Gini coefficient measures the inequality in a given distribution, in our case the number of comments an author makes in each subreddit they commented in. It is the relative mean difference, calculated by taking the mean of the difference between all pairs of subreddit counts in the author’s distribution. This is then divided by the mean number of comments the author made overall. It is a ratio between the values of 0 and 1, where 0 means perfect equality (i.e all subreddit have the same count) and 1 indicates perfect inequality (i.e. one subreddit accounts for all comments). Therefore the higher the value of Gini for a given author vector, the more unequally the author’s comments are distributed across subreddits. For a given author vector the Gini coefficient is calculated using the following formula, where xsj is the number of comments the author made th in the j subreddit and sai is the total number of subreddits the author participated in:

P 1 (sai + 1 − i)xs Gini = (sai + 1 − 2 P ) (4.7) sai xs

Note that the order of the values is important, the distribution must be sorted from smallest to largest value of xs. Consider an author who made 1 comment in two subreddits and two comments in a further two subreddits.

Their distribution would be (1, 1, 2, 2) with sai of 4. Using Equation (4.7) their gini coefficient would be 0.17, much more equal (0) than unequal (1). However, because small sample sizes can bias the Gini coefficient, I chose to use an adjusted Gini measure to account for this bias (Deltas 2003).

123 The adjusted Gini measure takes the original Gini values and weights it by the sample size divided by the sample size minus one. Here the sample size is the number of subreddits in the author vector, sai, such that the adjusted Gini measure is calculated as:

sai adjusted Gini = Gini · (4.8) sai − 1

For our example author the adjusted gini would therefore be 0.22 (0.17 · 4 3 ). As the adjusted Gini is higher after accounting for the small sample bias we can see that the author has actually distributed there comment slightly less equally between subreddits than originally calculated. These four measures will show us different aspects of how a given author participates between subreddits. However, we are interested in the patterns of participation at the subreddit rather than the author-level. Therefore, we must aggregate these author-level measures to the subreddit- level to address the second research question of cumulatively how actively authors of a subreddit participate between subreddits. To aggregate each measure to the subreddit-level I take the average of values for all authors in each subreddit. For example, to aggregate the number of subreddits per author for the subreddit Manchester Uni I take all of the authors in Manchester Uni in my sample, calculate the number of subreddits each of the authors commented in, then take the median on these values. As with insubreddit proportion in the previous section, I used median as my measure of average rather than mean because of the skewness in the overall distribution of number of subreddits per author. Ultimately I will have four aggregate measures for each subreddit:

124 • median number of subreddits per author

• median total number of comments per author

• median of the median number of comments per subreddit per author

• median adjusted Gini coefficient of comments per author

The results of these measures will be presented in Section 4.3 to offer evidence of the second research question and set of hypotheses. In this section we reviewed the two research questions and accompa- nying hypotheses which form the first stage of analysis, and described the methods we will be using to address them. First we will be examining how author within subreddit participation varies between subreddits using the median number of comments and insubreddit proportion per author. Drawing on the echo chamber literature, we expect that echo chambers subreddit will have high levels of author participation and for the anti-echo chamber to have low levels. We will address these hypotheses in the next section. The second research question we outlined addressed how actively au- thors of a given subreddit participate between all subreddits. Based on the literature we anticipate that the authors of the echo chamber will have less varied participation between subreddits and anti-echo chamber authors will have more. To test these hypotheses we will use the four measure listed above; covering the how much authors comment, how many subreddits they comment in, and how evenly they distribute their comments among subreddits. This research question will eventually be the focus of the third section of this chapter

125 4.2 Within subreddit participation results

In this section we present the analysis undertaken to answer the first re- search question: how does author participation vary between subreddits?. This first subsection presents the general trends across all 1000 top subreddits. The second subsection then focuses specifically on the results of the political subreddits and changemyview to address the echo chamber and anti-echo chamber hypotheses.

4.2.1 General trends

Median Comments per Author

Figure 4.1 shows the the frequency distribution of median comments per author for the top 1000 subreddits, which is highly right skewed. This is common with count measurements for online phenomena (Johnson et al. 2014; Adamic and Huberman 2002). The mean of the distribution is 8.3 and the standard deviation is 6.5. Values range from a low of only 2 median comments per author in 42 subreddits to 103 comments in the subreddit Market76. The lower quartile, median, and upper quartile are 4, 6, and 10, respectively. Recall from Chapter3 that our dataset is limited to the top quartile of authors by comment count for each subreddit. Therefore, in 75% of subreddits the most active authors made an average of 10 comments or fewer per month. We can see in Figure 4.1 that the frequency of subreddits drops off quickly after 20 comments. The long tail shows the handful of outliers with very large median numbers of comments per author.

126 Figure 4.1: Frequency of median comment count per author for all subred- dits

In-subreddit participation

In Section 4.1 I introduced the measure of insubreddit proportion, the portion of an author’s total comments that they made within a given subreddit. To aggregate to the subreddit-level, for each subreddit I took the median insubreddit proportion of all authors in the subreddit. Figure 4.2 shows the frequency distribution of median insubreddit proportion, which is also right skewed among subreddits. The mean in 0.2 and the standard deviation is 0.19. The subreddit thisismylifenow which was one of the subreddits with the lowest median comments per author (N=2) also has the lowest median insubreddit proportion, 0.018, which mean that the average active author in thisismylifenow only made 1.8% of all of their comments in the dataset in thisismylifenow. Only one subreddit, EggsInc has the maximum possible median insubreddit proportion of 1, meaning that the median active author in EggsInc did not comment in any of the other

127 top 1000 subreddits in January 2019.

Figure 4.2: Frequency of median author insubreddit proportion for all subreddits

Figure 4.3 shows an alternative view of insubreddit author participation where each point is a subreddit. The x-axis is the median number of comments authors of the given subreddit made in all other subreddits. The y-axis is the median number of comments made within the given subreddit, which we used as out numerator in Equation (4.3). The grey line is the identity line where x=y, or the number of out comments is the same as the number of in comments. Points of this line would have an insubreddit proportion of 0.5. Points above the line show subreddits whose authors make more comments within the subreddit then in all other subreddits combined. The highest point in the top left corner is Market76 where the average author made comments 103 within the subreddit but only 26 comments elsewhere. This shows that authors in this subreddit make most of their contributions to Reddit there. Conversely points below the line are subreddits where authors made

128 Figure 4.3: Scatterplot of median number of author out-comments vs in- comments for all subreddits fewer subreddits within the subreddit than elsewhere. We can see that the majority of subreddits are in this group, mirroring the finding from Fig- ure 4.2 that the vast majority of subreddits have an insubreddit proportion below 0.5. The subreddit in the far right bottom corner is nottheonion with 116 out-comments but only 4 in-comments. The average highly active au- thor in this subreddit is actually much more active across other subreddits. Figure 4.4 shows the correlations between median comment count (avg com), median insubreddit proportion (insub) and the total number of comments in the subreddit (com count). This allows us to understands the relationship between our two measures, and the extent to which they are related by the size of the subreddit. I chose comment count as the mea- sure of subreddit size here rather than author count as the data subsetting process outlined in Chapter3 used author count as the cut-off for selecting subreddits, therefore author count is left censored. Each square in Figure 4.4 represents the correlation between a pair of

129 Figure 4.4: Heatmap of correlations between comment count, median comment count, and median insubreddit proportion for all subreddits measures. For each square the names of the corresponding measures are written on the x and y axes. The background colour represents the relative strength of the correlation on a scale on 0 to 1, with darker colour signifying a higher correlation. The correlation value is also written within each square. All three pairs of measures have positive correlations. There is a pos- itive correlation of 0.23 between the total number of comments and the mean number of comments per author in a subreddit. Comment count and insubreddit proportion have a much lower correlation of 0.12. Though the relationship is positive, this suggests that there is not a strong relation- ship between the size of a subreddit and the relative number of comments authors make in it. Our two measures of within subreddit author par- ticipation, mean comments per author and median author insubreddit

130 proportion, are highly correlated at 0.74. At the end of this chapter, in Sec- tion 4.4 we will reflect on the general trends in these measures and how they inform our knowledge of the first research question and, thus, how well they speak to the nature of subreddit chamberness. Despite the correlations between our measures and comment count, I chose not to normalise either measure by comment count. Controlling for comment count as an indicator of subreddit size would ideally control for any effects the size of a subreddit has on the number of comments authors make and their levels of insubreddit participation. However, as all of the measures were non-normally distributed and featured a few strong outliers it was not statistically feasibly to normalise by subreddit size. Instead, as we move on to focusing on the results for political subreddits we will keep in mind how subreddit size my have some minor influence on our measures and, therefore, how we interpret our results.

4.2.2 The case study subreddits

In Chapter3 we introduced the 21 subreddits in our dataset that were manually labelled with the primary topic label of ‘political’. Each of the political subreddits was then labelled as being left-leaning (e.g. socialism), right-leaning (e.g. Conservative), or politically neutral (e.g. ukpolitics). Of the 21 political subreddits 11 were left-leaning, 5 right-leaning, and 5 were politically neutral. The largest of these subreddits is the neutral discussion forum politics which has the fourth highest number of authors among subreddits in our dataset (51,981) and second highest number of total comments (1,935,457). politics is the generalist politics discussion subreddits, therefore we would

131 expect it to be the largest political subreddit. However, its position as one of largest subreddits overall is an important indicator of the generally high level of engagement with politics on Reddit. Figure 4.5 shows the number of comments and authors for each of the political subreddits. We previously saw these plots in Section 3.4.2 in the previous chapter when we introduced the political subreddits. Left wing subreddits are coloured blue, neutral subreddits are green, and right wing subreddits are red. Though changemyview was labelled as discussion and is not classified a political subreddit it is included for reference and labelled in purple. Together I will refer to these 22 subreddits as our case study subreddits. Although there are more left wing subreddits in the top 1000 subreddits in January 2019, they tend to be smaller than both neutral political subred- dits and right wing subreddits in terms of both author count and comment count. The left wing subreddits tend to have more niche topical focuses, such as ChapoTrapHouse, which discusses the eponymous far left podcast, or Fuckthealtright which is an antagonistic group specifically against the alt-right. The leftist anti-Trump subreddit Trumpgret has 1,032 authors and 6,954 comments while the pro-Trump subreddit The Donald is the largest non-neutral political subreddit with 11,990 authors and 753,522 comments, making it the 5th largest subreddit overall by comment count. However, even the leftist subreddits which cover broad philosophical topics, such as socialism are much smaller (882 authors, 9,752 comments) than right-wing subreddits which focus on similarly broad ideological topics such as Conservative (2,263 authors, 50,156 comments) or Libertarian (4,310 authors, 50,156 comments). This simple finding is very informative of the political engagement of authors on Reddit. I suggest three possible

132 (a) Number of active authors

(b) Number of comments by active authors

Figure 4.5: Author and comment counts for case study subreddits

133 explanations for this trend, which we will keep in mind as we undertake the analysis:

1. There are more right-wing authors on Reddit than left-wing. This, however, runs counter to previous survey research presented in Chap- ter2 which suggests that redditors tend to be liberal (Barthel 2020).

2. Left-wing redditors do not participate in political subreddits as much as right-wing redditors and/or

3. Engagement in political discourse by left-wing authors redditors is more diffuse between the smaller but more plentiful left-wing subreddits.

Figure 4.6 shows the results of our two measures of within subreddit author participation for the political subreddits and changemyview. These results are presented as bar plots rather than tables for two key reasons. First, the range of lengths of the bars provides a better visual sense of scale than the raw numbers. And second, and most importantly, colouring the bars by subreddit polarity emphasises important trends and allows for better comparison between the two subplots and their measures. At the top of both subplots we see that the right wing subreddits The Donald and neoliberal, the neutral ukpolitics and the left wing Chapo- TrapHouse all have much high levels of author participation than the other case study subreddits. This is then followed by politics, the very large neutral subreddit, and changemyview, our potential anti-echo chamber. Conversely, we can see that the left wing subreddits tend to fall at the bottom or centre of the plots, with the obvious exception of ChapoTrapHouse. This shows that the left-wing subreddits tend to have lower levels of author

134 (a) Median author comment count

(b) Median author insubreddit proportion

Figure 4.6: Results of within subreddit author participation measures for case study subreddits

135 participation compared to other political subreddits. However, to be able to interpret the relative chamberness of the case study subreddits we must contextualise them within the overall trends we found earlier in this section. To do this we look at where the political subreddits and changemyview rank among all subreddits for each of the measures of author within subreddit participation. The Donald is the only political subreddit to have a median author in- subreddit proportion greater than 0.5, as shown in Figure 4.6b. As we saw in Figure 4.3 we can better understand insubreddit participation by comparing the number of author in-comments to out-comments per sub- reddit. Figure 4.7 shows this for the political subreddits and changemyview. We see that The Donald is the only political subreddit above the line, with the third highest in-comment value (N=29) and the lowest number of out- comments by far. Here the majority of subreddits make fewer than 20 comments within the subreddit and at least 60, up to 150, in other subred- dits. As stated in the general trends section above, it is very important to keep in mind trends in subreddit size when interpreting our measures of within subreddit author participation. To account for a) subreddit size and b) skewness in our measures Figure 4.8 shows the percentile values for political subreddits rather than their raw values. I display percentiles instead of raw values because they provide better comparison between the political subreddits, contextualising their relative differences within the background distribution of all subreddits in the sample. This is essential for addressing our hypotheses of how the level of chamberness of the cases study subreddits compared to the background distribution of all other subreddits.

136 Figure 4.7: Scatterplot of median number of author out-comments vs in- comments for case study subreddits

Each row in Figure 4.8 is a subreddit. On the y axis the name of the subreddit is prepended by a letter signifying the political polarity of the subreddit. ‘A’ stands for the anti-echo chamber subreddit, changemyview. ‘L’, ‘N’, and ‘R’ stand for left-wing, neutral, and right-wing, respectively. The subreddits are ordered in alphabetical order first by polarity then by subreddit name. The columns are the comment count (com count), median number of comments per subreddit (avg com), and the median insubreddit proportion (insub) per subreddit. The background colour of each cell is the relative strength of the percentile value on a scale on 0 to 1, the darker the colour the higher the subreddit is for that measure in the overall distribution. The percentile value is also written within each cell. By grouping the subreddits according to polarity we can see that there are trends between polarities. The top row of Figure 4.8 shows the values for changemyview. It is in the mid eighties for both total comment count

137 Figure 4.8: Heatmap of political subreddit percentiles for within subreddit participation measures

138 Figure 4.9: Scatterplot of median comments per authors by median author insubreddit proportion for case study subreddits and median comment count. For insubreddit proportion it is in the middle of all subreddits (51st). By comparison, left wing subreddit values tend to have relatively low values across all three measures, as shown by the block of lighter blue at the upper centre of the figure. This is in line with the previous finding that left-wing subreddits tend to be smaller by both author and comment counts. There are some variations in this trend however. For example the subreddit ENLIGHTENEDCENTRISM, a far left satire of centrism, is in the 55th percentile for median comment count. The main outlier among left subreddits is ChapoTrapHouse which has values in the 82nd and higher percentiles. This subreddit is a community for fans of the popular left wing podcast ‘Chapo Trap House’. It is by far the most popular left wing subreddit in the dataset and it is also in the 99th percentile for average comment count.

139 Turning to the lower half of Figure 4.8 we can see that the neutral and right subreddits are generally in the higher percentiles for both comment count and mean comment count per author. All neutral and right wing subreddits are in the 60th or higher percentile except the general subreddits PoliticalDiscussion in the 43rd percentile for comment count and The Mueller, a sibling subreddit to The Donald devoted to Robert Mueller’s investigation of Trump, which has a median comment count in the 47th percentile. There is less consistency among right wing and neutral political subreddits for insubreddit proportion, however. The right-hand column on Figure 4.8 shows the percentile values on median insubreddit proportion. As we have seen, the suggested anti-echo chamber changemyview is at a significantly lower percentile for median insubreddit proportion than the other two measures. Most left wing sub- reddits are in the low percentile ranges for insubreddit proportion as well. The more niche left wing subreddits in particular are in the bottom fifth of all subreddits. beholdthemasterrace, Trumpgret,[r]esist, ABoringDystopia, and Fuckthealtright and insubreddit proportion percentile values of 0.15, 0.12, 0.1, 0.068, and 0.047, respectively. The main exception to this, again, is ChapoTrapHouse, though it is more than ten points lower than for the other values (82nd vs 96th). The majority of neutral and right wing subreddits tend to be at much lower percentiles for insubreddit proportion than either total comment or mean comment count. For these subreddits greater comment count did not corresponded with high insubreddit proportion as they did for median author count. There are a few important outliers to this trend, however. The neutral subreddit politics and ukpolitics are in the 70th and 85th percentiles, respectively. This can be compared with the similar generalist subreddit

140 worldpolitics which is smaller, though still relatively large, but in the bottom third of values for insubreddit proportion (29th). For right wing subreddits, both The Donald and neoliberal are in the top fifth for insubreddit values, 93rd and 87th respectively, the highest values amongst all political subreddits.

Section summary

In this section we have addressed the first research question: how does author within subreddit participation vary between subreddits? We have used two measures to address this question: the median number of com- ments per author and the median insubreddit proportion per author. To answer the question we calculated these measures for the 1,000 subreddits under analysis to understand overall trends. We saw that both measures were highly skewed with the majority of subreddits having generally low levels of author participation. We then focused on the results for our case study subreddits: the 21 political subreddits and changemyview to test our echo chamber and anti- echo chamber hypotheses. In line with our first hypothesis we found that The Donald, our suspected echo chamber, had very high levels of within subreddit author participation. We also identified other political subreddits which ranked unusually highly for these measures, suggesting they could also exhibit echo chamber tendencies. However, contrary to our anti-echo chamber hypothesis, we found that authors of changemyview did not exhibit lower levels of author participation, but were actually in the upper range compared to the political subreddits. In the final section of this chapter, Section 4.4 we will summarise the key findings of this and the next stages of analysis to reflect on the research questions and hypotheses covered in this chapter. This will set us up

141 to consider the inter-subreddit dynamics, including textual similarity, of subreddits in Chapter5. But first we must review the analysis undertaken to address our second research question on author participation between subreddits.

4.3 Between subreddit participation results

In Section 4.1 we introduced the second research question: how actively do authors of a given subreddit participate between all subreddits? I also set out the four measures to be used to address this question. Those measures are the medians for each subreddit of:

• the number of subreddits per author

• the total number of comments per author

• the median number of comments per subreddit per author

• the adjusted gini coefficient of comments per author

I calculated each of the measures for all authors in the dataset, then aggregated to the subreddit by taking the median values for the set of authors of each subreddit. As in the previous section, I will first present the general trends in these measures across all subreddits, which speaks to the broad research question. I will then explore the results for the case study subreddits in particular, which will be interpreted to address the echo chamber and anti-echo chamber hypotheses specifically.

142 4.3.1 General trends

Median author subreddit count

Figure 4.10 shows the frequency distribution of median author subreddit counts among the top 1000 subreddits. 89 subreddits have a median author subreddit count of 1, meaning at least half of their most active authors were not also an active member of another subreddit in January 2019. Three tenths (N=306, the mode) of subreddits had a median subreddit count of 2, and 172 subreddits had a value of 3. This means that for more than half of subreddits (N=567), the median subreddit count was 3 or fewer. The frequency is then roughly steady until 9, then drops of to the maximum median subreddit count of 11, which 6 subreddits have.

Figure 4.10: Frequency of median author subreddit count for all subreddits

143 Median author comment count

Figure 4.11 shows the frequency distribution of the median author total comments among subreddits. Only three subreddits have a value lower than 10, the lowest being millionairemakers with 2. This means that in millionairemakers, for the most active authors, at most half made more than 2 comments across all top 1000 subreddits in January 2019. The frequency then peaks in the high 10s to 40s, then drops off steadily to a maximum of 129 for Market76, which also has the highest within subreddit mean number of comments as previously shown by Figure 4.1.

Figure 4.11: Frequency of median author comment count for all subreddits

Median author average comment count

Figure 4.12 shows that average number of comments made per author across subreddits is also right skewed, with a very long tail. millionaire-

144 makers again has the lowest value with a median of 2 median comments per subreddit per author. The lower, mid, and upper quartiles are 8.25, 9.4, and 11.2, respectively. This shows that in half of subreddits the median active author made between 8.25 and 11.2 comments per subreddit they commented in. The highest value is again in Market76 at 86. This means that in Market76 half of the most active authors had a mean of 86 comments per subreddit by commented in.

Figure 4.12: Frequency of median author average comment count for all subreddits

Median author gini

Recall from the explanation of the Gini coefficient in Equation (4.7) that a value of 0 means perfect equality (i.e the author made the same number of comments in all subreddits) and 1 means perfect inequality (i.e the author made all comments in one subreddit and none in the others). As we are

145 only looking at the subreddits which authors did comment in no author could have a Gini of 1, but higher values still show greater inequality. Figure 4.13 shows the frequency distribution of median author comment gini, which is bi-modal. 93 subreddits have a median author gini value of 0. Most of these subreddits also have a median subreddit count of 1. For a distribution with only one value the gini coefficient will always be 0 because all comments are equally distributed among the population of 1. Therefore if an author only comments in 1 subreddit, their gini value will be 0. In turn, if most authors in a subreddit only comment within that subreddit, most of those authors will have a gini of 0. This explains the large number of subreddits with a median author gini of 0 — most of their authors only comment in that subreddit. If we ignore gini values of 0, we see that median author comment Gini is left skewed.

Figure 4.13: Frequency of median author Gini for all subreddits

The rest of Figure 4.13 is much more informative. We see a steady

146 incline in frequency for ascending values of median author Gini until a sharp peak around 0.4 followed by a sharp decline to the maximum value of 0.47 for PoliticalDiscussion, one of the political subreddits we will return to in the next subsection. The lower, mid, and upper quartile values are 0.23, 0.33, and 0.39, respectively. We can gain a better understanding of the the relationships between the general trends for the four measures by honing in on the results for the case study subreddits.

4.3.2 The case study subreddits

Figure 4.14 shows the percentile values for the case study subreddits for the four between subreddit author participation measures. The heat map shows that most political subreddits are in the upper quartile (percentile value of 0.75 or higher) for both median comment count and median author Gini. This means that authors in the political subreddits tend to a) make more comments across the top 1000 subreddits but b) spread their comments less equally amongst the subreddits in which they comment. They main exception to this trend is in the suspected echo chamber The Donald which is in a lower decile then any other political subreddit for median author Gini (N=0.62). In Figure 4.14 we can see that The Donald also has a lower rank for median author subreddit count than any other case study subreddit. However, as we saw in the histograms in the previ- ous section on general trends, the distributions of the measures are highly skewed. This means that a large difference in percentile rank may not actu- ally correspond to a large difference in raw values, or vice versa. Therefore was must also look at the raw values to interpret to what extent there is significant variation between the subreddits for these measures. Figures 4.15 and 4.16 show bar plots of the political subreddit values

147 Figure 4.14: Heatmap of between subreddit median author participation measure percentiles for case study subreddits

148 for each of the measures. The y-axes show the name of the corresponding subreddit. The x-axes show the values of the respective measures. The bars are coloured by the polarity of the subreddit: blue for left, green for neutral, red of right wing. The anti-echo chamber changemyview is included and highlighted in purple. In each of these plots the subreddits are sorted in order of descending values. This allows us to see trends between polarity groups. Figure 4.15 shows the median author subreddit count and mean author comments per subreddit for the political subreddits. These two measures were grouped to highlight a pattern among left wing subreddits. In Fig- ure 4.15a we can see that the six top values for median author subreddit count are in left wing subreddits: beholdthemasterrace, ABoringDystopia, Fuck- thealtright, esist, Trumpgret, and ENLIGHTENEDCENTRISM. Figure 4.15b shows that these same six subreddits, plus another leftist subreddit Lat- eStageCapitalism, also have the lowest values for median average comment count per subreddit among their authors. Authors of these subreddits tend to a) post in more subreddits and b) make fewer comments per subreddit than authors in other political subreddits. Figure 4.15 shows us the opposite trend for the The Donald (R), neoliberal (R), ChapoTrapHouse (L), and ukpolitics (N) which were identified in the previous section as potential echo chambers based on their within sub- reddit levels of author participation. These subreddits have the lowest median author subreddit count and highest author mean comment count among political subreddits. Authors in these subreddits tended to only comment in 2 or 3 subreddits but made an average of 18 to 27 comments in the subreddits they did comment in. It should also be noted that in changemyview the median values among active authors were 5 subreddits,

149 (a) Median author subreddit count

(b) Median author median comments per subreddit

Figure 4.15: First pair of bar plots for between subreddit author participa- tion measures for case study subreddits

150 (a) Median author comment count

(b) Median author comment Gini

Figure 4.16: Second pair of bar plots for between subreddit author partici- pation measures for case study subreddits

151 77 total comments, 14.7 mean comments (38th percentile), and a Gini of 0.427 (86th percentile). Figure 4.14 showed that all political subreddits were in the top quartile for median author total comment count, with the exception of LateStage- Capitalism just behind at 0.7. Figure 4.16a shows that there is no significant variation in author total comments count among political subreddits. The comment count steadily decreases from 108 for beholdthemasterrace to 55 from socialism. The Donald is on the lower end of this range at 61 total comments per author, while changemyview is in the mid-range at 77. Fig- ure 4.16b shows that values for median author comment Gini also do not vary significantly. PoliticalDiscussion has the highest value at 0.478, while The Donald has the lowest value at 0.368. This means the the most active authors in The Donald tend to spread their comments more evenly between the subreddits they comment in than the most active authors in any other political subreddit. changemyview is again in the mid-range with a median author comment gini of 0.427.

Section summary

In this section we addressed our second research question: how actively do subreddit authors participate between all subreddits? To answer this question we used four measures of author participation across subreddits: the number of subreddits they commented in, the total number of com- ments they made, their average number of comments per subreddit, and the Gini coefficient of how equally they made comments between subred- dits. By taking the median value of each measure among all authors for each subreddit we could determine the collective patterns of how authors of a subreddit participated across Reddit. We saw that in general most

152 subreddits tended to have authors with low levels of participation across subreddits according to the three count measures and that, where they did comment in multiple subreddits, they tended to distribute comments slightly but not very unequally. We then focused on the cases study subreddits to test our hypotheses: 1) that authors of an echo chamber subreddit would have less varied partici- pation between subreddits, and 2) authors of the anti-echo chamber would have more varied participation of this kind. In the suspected echo cham- bers The Donald, neoliberal, ChapoTrapHouse, and ukpolitics authors tended to make relatively more comments overall, but in fewer subreddits than other political subreddits, providing partial support of the echo chamber hypothesis. changemyview authors had high values for all measures, though not relative to some political subreddits. We therefore also found only partial support for the anti-echo chamber hypothesis. Next we will bring together the findings of this and the previous section to reflect on the results of this chapter before moving on to our second stage of analysis.

4.4 Chapter summary

In the final section of this chapter will summarise the results presented in Sections 4.2 and 4.3 and briefly summarise how they address the research questions and hypotheses set out in Section 4.1. A deeper discussion of the findings will be presented in Chapter6 where we will also be able to reflect on the second stage of findings which will be presented in the next chapter.

153 Within subreddit author participation

RQ1: How does author within subreddit participation vary between subreddits?

The distribution of mean number of comments per active author among subreddits in our dataset was highly skewed. The majority of subreddits had a mean of 20 or less, but there was a long tail of a handful of outliers with up to 197 mean comments per author. We therefore know that the level of within subreddit author participation in terms of average number of comments does not vary greatly between subreddits, with the exception of the few outlier subreddits with very high author comment counts. The distribution of median insubreddit proportion of author comments was also right skewed, though not as severely. For the majority of sub- reddits, authors made 20% our less of their total comments within the given subreddit. There was a steady decrease in frequency for higher rates, building to the handful of subreddits in which authors made all of their comments in the dataset within those subreddits. We can conclude that for the majority of subreddits active authors made more comments in other subreddits than within the initial subreddit. Next let us examine the results of the echo chamber(s) and anti echo chamber, and whether these confirm our hypotheses.

Echo chamber hypothesis 1: Echo chamber subreddits will have higher author within subreddit participation than other subreddits.

The suspected echo chamber The Donald was in the 99th percentile for median number of comments per author and the 93rd percentile for median insubreddit proportion. The average active author in The Donald made 63

154 comments in the subreddit in January 2019, or 52% of all comments they made in the top 1000 subreddits. This provides some support for the first echo chamber hypothesis. If we take high author insubreddit participation as a possible signal of chamberness, this also drew our attention to the right wing subreddit neoliberal, the neutral ukpolitics, and left wing subreddit ChapoTrapHouse and potential echo chambers. They are in the 87th, 85th, and 83rd percentiles, respectively, which was much higher than the next highest political subred- dit. We will see these subreddits emerge again in later results as potential echo chambers. But first lets turn to the anti-echo chamber.

Anti-echo chamber hypothesis 1: The anti-echo chamber will have lower author within subreddit participation than other subreddits.

changemyview had a mean of 27 comments per author placing it in the 86th percentile. This partially contradicts our first anti-echo chamber hy- pothesis, which stated that authors would participate less in changemyview than authors of other subreddits because of the higher level of effort needed to produce comments of an acceptable quality. The subreddit had a median author insubreddit proportion of 0.14 (51st percentile), meaning the average active author only made 14% of all of their comments in the dataset within changemyview. This provides slight support of the first anti-echo chamber hypothesis. changemyview authors do not participate less in the subreddit than authors in the average subreddit, but they do participate much less relative to the potential echo chambers.

Between subreddit author participation

RQ 2: How actively do authors of a given subreddit participate between all subreddits?

155 The three count measures used to address author participation between subreddits — author subreddit count, author comment count, and median comments per subreddit — were all highly right skewed. For the majority of subreddits the median active author made between 20 to 40 comments in 2 or 3 subreddits for an average of 8 to 11 comments per subreddit. There was a thin long tail of subreddits with higher median subreddit counts, and a more steadily descending long tail for median total number of comments per author. For median author comment Gini, a plurality of subreddits had a value around 0.4, showing their authors spread their comments slightly more equally than not.

Echo hypothesis 2: Echo chambers authors will have less varied partic- ipation between subreddits.

Active authors in The Donald tended to only comment in one other sub- reddit. As we have since this places it among the majority of subreddits. It is interesting to note, however, that neoliberal, ukpolitics, and ChapoTrapHouse are the only other political subreddits in this majority, with a median of 3 subreddits per author. This suggests that in the potential echo chamber subreddits authors have more committed levels of participation but among fewer subreddits relative to other political subreddits. For the final measure of author comment Gini, we saw that there was no great variation between political subreddits. However The Donald had the lowest value among political subreddits, suggesting that its authors participated between communities more equally than authors of other sub- reddits. This could provide evidence against the second hypothesis, though the smaller number of subreddits per author means that the Gini is more sensitive to small changes in comment count and thus weakens the inter- pretability of this finding.

156 Anti-echo chamber hypothesis 2: Anti-echo chamber authors will have more varied participation between subreddits.

Finally, we found that the purported anti echo chamber changemyview was in the mid range among political subreddits, and the higher percentiles among all subreddits for these four measures. The average active author made 77 comments among 5 subreddits, slightly higher than in the average subreddit, or among the suspected echo chamber subreddits. We therefore do find evidence that authors of the suspected anti-echo chamber are signif- icantly more active in terms of number of subreddits and comments than those in most subreddits, partially supporting the hypothesis. Interestingly, according to the Gini measure, its authors actually participated less between subreddits, especially relative toThe Donald. The next chapter will follow on from the work presented so far to address the second stage of analysis examining inter-subreddit dynamics in greater detail. The chapter will develop on the findings of this one to understand the networked relationship of how multiple subreddits may cluster to form ‘meta echo chambers’. It will also introduce the use of textual data to add an additional layer of insight to that of author participation which we have focused on so far.

157 Chapter 5

Networks of Echo Chamber

This chapter turns to look at the networked relationship between subreddits and considers how similarity in terms of both users and content can de- termine whether connections between subreddits, or a lack of them, show networked echo chambers. The chapter is organised into four sections. Section 5.1 introduces the concept of the ‘meta echo chamber’ as a way to move from the subreddit-level to the network-level in approaching the existence of echo chambers on Reddit. This section draws on literature previously discussed in Chapter2 about topic-based public sphericules to contextualise ‘meta echo chambers’ in the wider network to motivate the two research questions which will guide this second stage of analysis. The section will also provide a refresher of the subreddit similarity measures introduced in Chapter3 and explain how these can be combined into a single measure of similarity to characterise networked ‘chamberness’. Section 5.2 then provides a short overview of social network analysis before introducing the network of connections between our subreddits. It describes the process of constructing the network and performing commu- nity detection, then looks at the relationship between subreddit topics and

158 communities to answer the first research question of this chapter (and third overall). Next, Section 5.3 will narrow in on the community of political and related subreddits identified in Section 5.2. It will explore the different forms of relationship between subreddits that could be associated with high ‘chamberness’ and reflect on the extent to which the political subreddits and the suspected anti-echo chamber fit into expected levels of ‘chamberness’, thus addressing our final research question. Finally, in Section 5.4 we will summarise the findings of the chapter and determine how the research questions set out in the first section have been answered. We will then begin to consider how these insights tie back into our earlier understanding of chamberness from the previous chapter, before moving on to a more thorough and consolidated interpretation of the cumulative findings in the following discussion chapter.

5.1 Defining meta echo chambers

The previous chapter showed that, contrary to our hypotheses, active au- thors in The Donald, our main suspected echo chamber, a) did not comment significantly more in the subreddit and b) commented in the same range of subreddits as most other subreddits. Thus, we did not find conclu- sive evidence that The Donald shows high levels of chamberness at the subreddit-level according to these experimental measures. However, those measures could not speak to the nature of the other subreddits The Donald authors participated in — in particular how similar, or different, they are in to The Donald in terms of their content. The previous chapter showed that our suspected echo chamber was not an island in itself, but we have yet to see if it is perhaps on an island with other similar

159 subreddits. To address this question of islands we can examine the patterns on similarity between all subreddits in terms of both the number of authors they share, and the level of similarity of their content. In Chapter2 we discussed the concept of ‘public sphericules’ introduced by Highfield and Bruns (Highfield and Bruns 2015). Developing on the Habermasian concept of a single public sphere of discourse, they suggest that social media discourse actually consists of many smaller topic-based ‘sphericules’. In their study of the Australian Twittersphere they used network analysis to identify distinct communities of highly connected, topically related Twitter accounts (Bruns et al. 2017). In similar research, Olson and Neal showed that high-level ‘interest meta-communities’ could be found on Reddit when mapping authorship between all subreddits as of 2015 (Olson and Neal 2015). This lead to the the first research question to be explored in this chapter, our third overall:

RQ 3: Can we identify topical public sphericules among the top sub- reddits?

Given the previous evidence by Olson and Neal, we expect to find topic- based ‘public spheres’. However, this analysis has three key methodological difference with that work. First, we are only examining the top 1000 subreddits at our time point, as this allow us to develop a more in-depth knowledge of individual subreddits. Second, we are not just considering the similarity between subreddits in terms of shared authors but also shared content, specifically the text of comments. And third, we will be validating the relationship between communities and topics using an independent list of subreddit topic labels. Due to the number of subreddits in their network, Olson and Neal relied on determining the topics of communities

160 inductively after subreddits were grouped. Therefore the original research presented here will built on that earlier work, using a more developed methodology, to provide necessary context for the investigation of meta echo chambers. If we can identify public sphericules in the network, we can then dive deeper into when a public sphericule crosses over into being an echo chamber — that is when a group of highly inter-connected subreddits are also loosely connected from the rest of the network. This raises the second research question to be addressed in this chapter, and the fourth and final overall:

RQ 4: Can we identify ‘meta-echo chambers’ among those public spher- icules?

We can think of ‘meta echo chambers’ as smaller sections of public sphericules, consisting of topically similar subreddits which are a) tightly internally connected and b) loosely connected, or even disconnected, from the wider network. By contrast, an anti-echo chamber should closely connected to many other subreddits across a range of topics and we will also examine whether changemyview actually shows such a position in the network. A benefit of the public sphericule approach is that it will allow us to also examine potential echo chambers beyond the political subreddits which have previously been our focus. We will touch on this in Section 5.2 and return to it in greater detail in the next chapter. But first let us briefly review existing measures of echo chamberness in the literature before moving on to define the measure of subreddit similarity which we will use to build our network.

161 Reviewing previous network-based approaches to echo chamberness

In Section 2.1 we explored some key existing work using network-based approaches to study echo chamber behaviours, or political polarisation more broadly. Adamic and Glance (2005) presented one of the earliest such works, examining the patterns of cross-linking between Democratic and Republican bloggers in the US. Del Vicario et al. (2017) analysed news consumption patterns of UK Facebook users in the lead up to the Brexit referendum. And many other studies focusing on Twitter looked at either the follower, retweet, or commenting patterns between users to construct their networks (Bruns et al. 2017; Bastos et al. 2018). Each of these studies used a single form of data to construct their network, most often some form of relational trace data such as follower relationships or the sharing of a link. These kinds of structural data do not carry direct information about the content of a relationship such as why one user follows another. However other literature, which is often not specifically focused on echo chambers but more concerned with general political polarisation, take a two pronged approach in their methodology, using both structural and content based information. We provide a brief review of some of these methodologies before describing how the new measures developed here differ from them. Himelboim et al. (2013) constructed follower networks of Twitter users who used keywords related to 10 controversial topics, such as ‘gun reform’, during the 2010 US midterm elections. They used the structural follower relationship to construct their networks then manually coded content for political orientation. This is similar to the approach used by both Highfield and Bruns (2015) and Olson and Neal (2015) to construct a network using structural ties which is then interpreted using content based information.

162 However with this method the structure of the network is highly dependent on the type of tie chosen (e.g. followers vs retweets), therefore the choice of tie needs to be deeply informed by the research goals. This is demonstrated by work which compares the differences between layers in multiplex networks, where a set of nodes can share multiple types of edges. In an analysis of interactions between politicians on a Swiss online platform Garcia et al. (2015) show that using different forms of relationships, which may appear comparable at face value, can result in very different findings. Specifically they compare the patterns of supports (similar to follows), likes, and comments and find that each layer provides information not replicated by the others. This suggests the need to consider multiple sources of information when constructing ties between actors in the network. Similarly, Aragon´ et al. (2013) examine political campaigning on Twit- ter ahead of the 2011 Spanish National election. They look at cross- communication between political parties using both retweets and replies. They show that there is less polarisation in the replies network, with only 79% of replies being between members of the same party versus 93% of retweets. This represents a significant difference between diffusions and conversation patterns. Politicians will engage with other parties via replies but not endorse them by sharing retweets. This demonstrates the importance of using multiple forms of relational information when considering connection and polarisation. Other research has used a single structural feature to construct their networks, but then used independently conducted content analysis to validate the network findings. Colleoni et al. (2014) use automatic content analysis in addition to network analysis of follower relationships in their

163 study of US political polarisation on Twitter. They used supervised machine learning classification to distinguish Democratic and Republican discourse, then overlaid this on the follower network to determine whether their was evident polarisation. In conducting their content analysis separately to the network analysis they were able to avoid the confirmation bias that is inherent in approaches which analyse content in light of the need to interpret network-based findings. And yet other work has compared the outcomes of content and structure-based approaches. Conover et al. (2011) tested the power of content based features and structural features to predict the political align- ment of US Twitter users. First, they conducted manual content analysis of tweets to determine whether users were left of right leaning. They then separately constructed the retweet network of users and ran community detection to identify group segregation. Comparing the two approaches they found that the simple retweet network was able to more accurately predict user’s political alignment than the content analysis of their tweets. This surprising but illuminating findings shows that their can be a great deal of latent information provided by seemingly simple measures such as retweeting. It shows that their is valuable information in basic network structures which should not be neglected. However it also begs the ques- tion of whether we can create more robust and illuminating networks by not choosing between content based and structural features, but my using them in a more cohesive and complimentary way. In the following section we offer an approach which does not favour one set of features over the other, but considers the relationship between content based and structural features to create a combined streamlined measure of subreddit similarity. In Section 5.2 we will then move on to the question of how to infer group

164 polarisation, and specifically echo chamberness, from our wider network. Guerra et al. (2013) suggest that modularity is often used as a measure of polarisation, but that it is not a direct measure of antagonism between groups. Specifically, a lack of connection is more likely to be the result of indifference or dissimilarity than of active dislike or disagreement. The new measure introduced in this chapter will keep in line with the suggestion that we can use modularity to define communities, or specifically topical public sphericules, but that we must then use additional steps to address echo chamberness. Where previous methods have used both networks and content analysis they have tended to employ them at separate stages. Instead the new measures outlined here will include content-based information, specifically the text similarity of subreddits, in the construction of the network itself. As we have seen content-based information is often used as a secondary step to validate or interpret the findings of non-content based networks. The method developed here is unique because it treats the two forms of data as related, considers the relationship between them, and creates a unified measure which carries information of both content and form. This allows for a more nuanced understanding of the relationship between subreddits. Later on we also use a secondary source of content information, the independently generated subreddit topic labels, to validate the network structure without introducing confirmation bias, a common limitation in previous research.

5.1.1 Measuring subreddit similarity

In Chapter3 we saw how the top 1000 subreddits were selected for the month January 2019, and how the top 25% most active authors were se-

165 lected for each subreddit. We also described the process by which we collected the comment data for each subreddit-author pair. With this data, I described the process of constructing two distinct measures of similarity between subreddits: 1) based on shared authors, and 2) based on shared words in comments. The following section will briefly re-outline these measures separately, then explain how they were combined to form a sin- gle measure of adjusted similarity which will be our indicator of potential chamberness throughout this chapter.

Co-authorship

Starting with all pairs of subreddits and authors used in the first stage of analysis, as described in Chapter4, I constructed an incidence matrix of all 1000 subreddits by all 3,838,553 authors. I then took the weighted one-mode projection of subreddit-subreddit edges. This 1000 by 1000 network gives the number of shared authors, or co-authors, between all subreddit pairs. As I previously described in Chapter3, the values for co-authorship are highly skewed, as the total size of subreddits varies greatly. Thus I choose to weight the observed co-authorship matrix by dividing it by an expected matrix, determined by the total number of authors per subreddit, and the total weight of all edges. The expected matrix was generated using the configuration model (Barabasi´ 2016, section 4.8). Given a degree distribution, where subreddit i has a degree of ki, i.e. it shares co-authors with ki other subreddits, the probability of an edge between subreddits i and j is defined as:

k k p = i j ij 2m − 1

Σk where m is the total number of edges in the undirected network, m = 2 .

166 For each pair of subreddits I then divided the observed number of co- authors by the expected co-authorship pij to create the weighted measure of co-authorship. This weighted measure emphasised the relationships between subreddits with relatively high levels of co-authorship which were generally less well connected in the network.

Text similarity

To determine the level of text similarity between pairs of subreddit I first created a ‘bag of words’ of each subreddit. For each comment in a subreddit I removed numbers and punctuation and split the string into tokens. For each subreddit I combined all comments to create a weighted vector of total counts per term. There were 1,988,121 unique tokens in the dataset, 70% of which occur at most twice and 80% of which occur in at most 2 subreddits. I removed any terms which did not appear in at least two subreddits, or which appeared in 80% of more of subreddits. From the subreddit token vectors I constructed a subreddit-by-term affiliation matrix of term occurrence counts such that for subreddit s term t has a term frequency of tfts. I then weighted the term frequency matrix by the inverse document frequency of terms (Sparck Jones 1972). The inverse document frequency measures how important a term is based on how common or rare it is across all subreddits. It is calculated by taking the log of the total number of subreddits (N) divided by the number of subreddits the token appears in (nt). Therefore, for token t:

N dft = log (5.1) nt

The weighted text matrix is then taken by multiplying each subreddit

167 term frequency by the term’s inverse document frequency:

tfidfts = tfts · idft (5.2)

Similar to the process of weighting co-authorship using the configu- ration model, weighting term frequency by inverse document frequency emphasises terms which are generally less common across subreddits but which may therefore be more indicative of the unique character of the subreddits in which they do appear. I then computed the pairwise cosine similarity of the weighted subreddit by term network, resulting in a 1000 by 1000 matrix of cosine text similarity measures for all subreddit pairs. The observed values ranged from 0 (where the subreddits did not share any terms) to 0.94 (88% shared variance) for the subreddits Market76 and fo76bazaar, which are both about the video game Fortnite76, meaning they shared the majority of terms and at relatively similar frequencies.

5.1.2 Defining chamberness similarity

Figure 5.1 shows the relationship between co-authorship and text similarity values for all pairs of subreddits that share at least two co-authors. Both co-authorship and text similarity are highly skewed, so logged values of each are used from this point onwards. The measures have a correlation of 0.33, showing there is a moderate positive relationship between the co-authorship and text similarity of subreddits, as shown by the green trend line in Figure 5.1. This aligns with our expectation that subreddits with similar topics will tend to share more users than those with dissimilar topics, based on the principle of interest homophily. We can understand

168 how the relationship between co-authorship and text similarity varies by looking at the highest and lowest points for each measure.

Figure 5.1: Scatterplot of logged values of co-authorship (y-axis) and text similarity (x-axis) per pair of subreddits

Each point on Figure 5.1 represents a pair of subreddits. The y-axis is the logged co-authorship measure and the x-axis is the logged text similarity measure. The farthest left point, highlighted in purple, is for the subreddits BoJackHorseman, about the eponymous animated television series, and thenetherlands. This pair of subreddits has the lowest text similarity score (just above -8, or 0.0005 unlogged) because comments in thenetherlands are predominantly in the Dutch language. The point is mid-way along the y-axis because the weighted co-authorship value (6.69 or 804 unlogged) is in the 46th percentile. This suggests that some members of thenetherlands watch BoJackHorseman, or at least like to discuss it. In the top right corner of the plot, highlighted in red, is the subreddit

169 pair with the highest text similarity at -0.02 or 0.94 unlogged: Market76 and fo76bazaar, both of which are discussion forums for a specific challenge (Fortbyte #76) in the video game Fortnite for users to trade in-game items. They also have a weighted co-authorship score in the 99th percentile. These subreddits have a very high level of text similarity and also share many authors, a prime example of interest homophily. Conversely the yellow point in the centre bottom of the plot shows the pair of subreddits with the lowest weighted co-authorship: CFB, about college level American football, and Sweden. This suggest that not many Redditors are interested in both Sweden and American college football. The top right section of Figure 5.1 (circled in black) shows subreddit pairs with the highest values for both similarity measures. Of the 50 edges with the highest co-authorship measures, 44 include at least one subreddit about pornographic content. For the moment it is only necessary to note that given the inherent ‘not safe for work’ nature of porn subreddits and their names, which is largely beyond the scope of this research, I will default to using the most relevant non-pornographic examples where possible. However we will return to the question of porn subreddits in this and later chapters, especially as they relate to the practice of multiple profile maintenance. In this instance, the edge with the highest co-authorship score, and to not include a porn subreddit, is between the subreddits Market76 and fo76bazaar, which we have already observed. They have the third highest weighted co-authorship score (98082) and also have the highest text similarity as we have seen. The games are created by the same independent developer and feature some shared characters. In total 74,625 pairs of subreddits share only one co-author. However,

170 after accounting for subreddit size their weighted co-authorship values vary from 0.08 to 46.8. The edges with the lowest co-authorship (<0.1) all include the the subreddit dankmemes, a very popular subreddit for sharing memes. These edges are shared with orangetheory, BabyBumps, and beyondthebump. orangetheory is a HIIT (high intensity interval training) fitness franchise with a predominantly female userbase. BabyBumps and beyondthebump are discussion forums for pregnancy and post-pregnancy, respectively. Anecdotally, this suggests an interesting gendered pattern to the subreddits with the lowest weighted co-authorship. Interestingly, these edges have mid-range values for text similarity (41st, 47th, and 33rd percentiles respectively). This can be seen from the bottom U-shape in scatter plot in Figure 5.1. This means these subreddits share very low co-authorship with dankmemes even though they have only slightly lower than average text similarity. These pairs of subreddit could be considered to display the opposite of chamberness, as they share fewer authors than we would expect given the level of textual similarity. By comparison, beyondthebump has a text similarity of 0.04 (50th percentile) with namenerd but their co-authorship is 99th percentile (807). namenerd is a forum for discussing one’s favourite names so would likely attract new and expecting parents looking for baby names. These examples have shown that there is a meaningful relationship for many subreddits between their level of text similarity and the level of co- authorship we would expect them to share. Thus, to create a single measure of similarity to examine chamberness we need to some extent control for text similarity, as interest homophily would tell us that subreddits with high text similarity are generally likely to have high co-authorship. To determine which subreddit pairs still have high co-authorship similarity

171 after controlling for text similarity I ran a simple linear regression of co- authorship on text similarity:

coauthorshipt = β1textt +  (5.3)

By taking the simple linear regression of co-authorship on text similar- ity we can determine for any pairs of subreddits whether they have higher co-authorship than we would expect given their level of text similarity. Given the reinforcement bias from selective exposure theory we would expect consider subreddits to be part of a meta-echo chamber if they share much greater co-authorship than we would expect by their level of text similarity alone. By calculating the residual  in Equation (5.3) we can determine to what extent a pair of subreddits have a higher, or lower, than expected co-authorship given their text similarity. It is the residuals of the regres- sion of co-authorship on text similarity which provides our measure of ‘chamberness’ between pairs of subreddits. Higher residual values suggest higher levels of inter-subreddit chamberness. The model has a R-squared of 0.734 meaning that 73.4% of the varia- tion in co-authorship can be described by the variation in text similarity. Figure 5.2 shows that the residuals of the model are normally distributed and the distribution is centred around 0. This shows that the majority of subreddit pairs have the level of co-authorship we would expect given their text similarity and that roughly equal numbers of pairs have lower or higher co-authorship than we would anticipate. Again, we can understand these trends with a few illustrative examples. The pair of subreddits with the second lowest residual value are AskDocs, a place to seek advice from medical professionals, and Drugs, a community for discussing recreational drug use. Their low level of controlled similarity

172 Figure 5.2: Distribution of regression residuals for all subreddit pairs shows us that these subreddits share far fewer authors than we would expect given how similar there content is. This suggests they discuss similar topics or use related terminology, but that the recreational drug discussants don’t seek medical advice on Reddit. On the opposite end of the spectrum we the subreddits with the highest residuals which therefore have much higher authorship overlap than we would expect after accounting for the similarity of their comments. Of the top 10 of these pairs, six are between pairs of porn subreddits which, judging by their names, are highly topically related, thus explaining their shared authors bases. Redditors who comment in a subreddit for a niche type of appear to comment in multiple of such subreddits where they exist. Of the non-pornographic pairs the highest was Market76 and fo76bazaar, which we previously saw had the highest level of text similarity. This

173 shows us that even though these subreddits had very high text similarity the number of authors they shared was still far greater than we would have anticipated. This is an example of a pair of subreddits for which interest homophily could potentially cross over into chamberness. In the next section as we construct our network of subreddits we will focus on the connections between subreddits such as these - those with atypically high level of controlled similarity. In this section we have introduced our second set of research questions and the motives behind them. Moving from the subreddit-level approach of the previous research questions we now turn to the inter-subreddit level to analyse the network of subreddits. We saw that our first task will be to determine whether this network is composed on topic-based public sphericules, which will be the subject of the next section. We then outlined our return to the question of echo chambers - to determine whether ‘meta- echo chambers’ of subreddits exist within those public sphericules. This section then outlined the measure of inter-subreddit similarity which we will use to construct our network. By taking a weighted measure of the number of authors which subreddits share and controlling this by the level of similarity between their content we are able to create a measure of potential chamberness between subreddits. Using this controlled measure of similarity we can identify pairs of subreddits which have higher than expected co-authorship given their text similarity. This is the measure of chamberness which we use to construct the network of subreddits and identify public sphericules.

174 5.2 A network of public sphericules

As we previously saw in Chapter2 research has frequently used net- work analysis to understand the structure of relationships between users and communities on social media platforms, including Reddit specifi- cally (Bruns et al. 2017; Highfield and Bruns 2015; Olson and Neal 2015). Network-based approaches have been used to showcase the existence of topic-based public sphericules in the Australian Twittersphere and interest- based meta communities on Reddit. This research builds on those exist- ing examples to map the relationships between highly similar subreddits. We can then use this blueprint to understand whether echo chamber be- haviours emerge at this inter-subreddit level. Networks consist of units of analysis referred to as nodes, in our case subreddits. Nodes are connected if they share a tie or edge, such as if they share authors. Edges can be weighted by some measure and ours are weighted by the controlled similarity of the subreddits. As our research is interested in those subreddits which share very high levels of controlled similarity - those which share many more authors than we expect given the similarity of their content - we can choose to exclusively look at this subset of edges. And thus we can construct our network.

5.2.1 The subreddit network

There are 427,911 total subreddit-pairs in the dataset. As we are only inter- ested in exposing the relationships between subreddits with much higher co-authorship than expected given the text similarity, as a marker of cham- berness, I selected only edges with residual values in the top 5% (i.e. greater than 3.46). Thirteen subreddits did not have at least one edge with a value

175 in the top 5% and therefore do not appear in the network. Our subreddit similarity network thus consisted of 987 subreddits. Components are sec- tions of the network which do not shares any edges with each other.There is one giant component of 976 subreddits, and 4 smaller components of 5 or fewer subreddits each. We will consider these distinct components in the following section on community detection.

Subreddit degree

Figure 5.3 shows the degree distribution of the network. The degree of a subreddit is the number of edges it shares with other subreddits. Remember we are only considering the top 5% of edges by value of controlled similarity. In Figure 5.3a we can see that the subreddit degrees are strongly right skewed. Figure 5.3b shows the logged values of degree are more normally distributed. Overall 17 subreddits have the minimum possible degree of 1 meaning they only share one top level edge with another subreddit. The subreddit NoStupidQuestions has the maximum degree of 305, meaning it shares a top level edge with 31% of other subreddits in the network. The lower, mid, and upper quartiles are 9, 23, and 54, respectively. We can see that the vast majority of subreddits, 3 in 4, only share an edge with 5.5% of other subreddits, though some subreddits are progressively more connected up to a maximum for 31%. As we will see this aligns with the network density measure and tells us that overall the network is not strongly connected.

Network density

The density of a network is the proportion of all possibles edges between subreddits which are actually observed. As our network is undirected

176 (a) degree

(b) log of degree

Figure 5.3: Frequency distributions of subreddit degree in the network

177 and subreddits cannot have self-loops (i.e share an edge with themselves), where S is the number of subreddits in the network we calculate the number of possible edges as:

S · (S − 1) possible edges = (5.4) 2

To calculate the density we then take the proportion of actual edges over possible edges:

actual edges density = (5.5) possible edges

With 987 subreddits the number of possible edges is therefore 486,591. However there are only 21,396 edges in our network. The network density is thus 0.0439 — 4.4% of all possible edges between subreddits are found in the network. Before subsetting the top 5% of values, there were 427,911 total edges between pairs of subreddits. The density of the network without subsetting would be 0.879 (87.9%), therefore our network density of 4.4% is a reflection of the subsetting process rather than low subreddit connectivity over all (0.88·0.5=0.44). However, it will be very important later in the chapter to compare this overall density of 4.4% to the local density of communities detected in the next section. By keeping in mind that, on principle, relatively few pairs of subreddits can share edges with very high weights, we can understand how patterns of high connectivity relate to chamberness among groups of topically similar subreddits.

178 5.2.2 Detecting public sphericules

Perhaps the most important step in conducting community detection in network analysis is in careful algorithm selection. Each algorithm uses a different approach to partitioning a network into groups and can therefore result in very different partitions. Various widely used community detec- tion algorithms have been employed in the existing echo chamber research reviewed earlier. For example, Garcia et al. (2015) and Conover et al. (2011) used the ‘Newman’ method presented in Newman (2006). Himelboim et al. (2013) applied the fast greedy algorithm by Clauset et al. (2004). Aragon´ et al. (2013) and Guerra et al. (2013) employed the Louvain method devel- oped by Blondel et al. (2008) which is noted for being very efficient for large networks. These studies selected different community detection algorithms for a variety of reasons. In their review of community detection methods in network analysis Fortunato and Hric (2016) emphasise that there are a number of subjective methodological choices which must be made in model selection. First there are no agreed on measures of how to define a ‘community’. As such each model, and each researcher, must define the term based of their specific context. This reflects the ambiguity around bounding communities which was discussed in the earlier chapters. Second, there are no agreed upon steps to select the ‘best’ community detection algorithm for a specific problem or to robustly compare the accuracy of different models on a network. For example, as justification for selecting the fast greedy algorithm Himelboim et al. (2013) simply state that the method ‘typically results in a few major clusters and several very small ones’ (Himelboim et al. 2013, p. 161). In other words, they selected the

179 method they expected to produce the type of clusters they sought. Model selection therefore allows the potential for bias to enter the methodological setup as different algorithms can produce very different results. We must therefore ensure that we do not base our analysis solely on the results of a single algorithm. In the following paragraphs I will outline how I selected the Louvain method for this analysis and compare its performance to other methods. I used the NetworkX and pythonlouvain packages in python to run community detection using the Louvain algorithm developed by Blondel et al (Blondel et al. 2008). I used the best partition function in the python- louvain package to optimise for the partition with the highest modularity. Modularity considers how many internal edges there are in a group com- pared to the number you would expect at random. By optimising for this the Louvain method partitions subreddits into mutually exclusive groups which are highly internally connected. Using the Louvain algorithm I de- tected 13 communities in the network, consisting of the 4 small components mentioned earlier and 9 communities from within the giant component. I also tested two other community detection algorithms for compari- son. I used the implementation of Clauset-Newman-Moore fast greedy modularity maximization available in the NetworkX package. However this method returned 107 communities, 60 of which consisted of a single subreddit, each of which were part of the giant component (i.e. not iso- lates). I also tested the Girvan–Newman algorithm using the same package, which also only peeled off single subreddits from the giant component over iterations. Therefore I proceeded with the communities detected using the Louvain method. As I will demonstrate, the application of indepen- dent subreddit topic labels to these communities can act as an essential

180 additional validation of their accuracy. In reviewing previous research on echo chambers we have seen that community detection is commonly used to address polarisation in the structure of networks. Guerra et al. (2013) provide a review of modularity based community detection as a measure of polarisation. They point out that most work treats modularity as a direct indicator of polarity and assumes that a lack of connection between communities shows antagonism between the groups. This of course overlooks the likely more common phenomena of communities being disconnected simply because they are disinterested in, or unrelated to, each other. Having shown that modularity is essential but not sufficient in demonstrating group polarisation, Guerra et al suggest we instead examine whether key actors are central to modular communities (suggesting polarisation) or act as bridges at the boundaries between them (suggesting non-polarisation). While this approach is well suited to their example bipartisan network which focused on a single topic, it is less well suited for our multi-topic network. However we still need to add an additional level of insight to move from high level modularity to polarisation, as is called for by Guerra et al. Instead I choose to apply the manual subreddit topic labels which were previously developed to determine whether the communities produced by the chosen method were indeed valid.

Breakdown of topics by community

In Chapter3 I described the process of manually labelling the high level topics of each of the 1000 subreddits under analysis. In Chapter4 we used these labels to identify the subset of political subreddits. In this chapter we will incorporate the rest of the subreddit labels to test the validity of

181 the community detection as a means of identifying topic-based public sphericules. By examining the breakdown of topics between communities in the networks, we will be able to determine whether the communities correspond with the concepts of ‘public sphericules’ or ‘meta-interest com- munities’ as summarised in the first section.

Figure 5.4: Frequency of the 11 most popular topic labels for subreddits in the network

Figure 5.4 shows the frequency of the most common topic labels for subreddits in the network, which we also previously saw in Chapter3. The y-axis shows the 11 topics with at least 20 subreddits in the network (5% of all subreddits). Together these 11 topics account for 72% of subreddits (n=724). The frequency of topics in highly skewed. Almost one fifth of all subreddits are about gaming (N=192). The general topics ‘funny’ and ‘images’ then account of around 10% each, followed by discussion at 8%. The remaining topics are more substantively specific, ranging from technology to pornography to political.

182 I mapped these subreddit topics on to the 13 communities identified in the network. This allowed me to determine to what extent these communi- ties represent topical public sphericules. Figures 5.5a and 5.5b present two ways of examining the intersection between topics and communities. Both figures are heatmaps with values normalised between 0 and 1. Normalised value are written in each cell, and the strength of the value is visualised by the colour of the cell, the darker the colour, the closer the value is to 1. For clarity values below 0.05 (5%) are not displayed. The y-axis of each subplot in Figure 5.5 list the 11 most common subred- dit topics, in descending order of size, as previously shown in Figure 5.4. The x axes lists the communities, again in order of descending size from left to right. Thus, the raw values used to create Figures 5.5a and 5.5b are the same. These raw values can be seen in Table A.4 in Appendix A.5. However, to create the heatmaps I then normalised by topic community size and topic size in turn to control for variation in the frequency of topics and size of communities. This will enable us to understand the spread of topics both across and within communities. Values in Figure 5.5a are normalised by the size of the relevant commu- nity such that each value is the portion of all subreddits in community c that are labelled with topic t. For example, communities 10, 11, and 12 all have values of 1 for geographic. This means all subreddits (100%) in each of those communities are labelled geographic. By comparison, for community 3 (C3) 25% (0.25) of subreddits are geographic. The only other community to consist of subreddits all of one topic is community 13, which only contains two gaming subreddits. In summary,Figure 5.5a shows which topics are most prevalent within each community. The plot also shows us that for most communities at least 50% of subred-

183 (a) Breakdown of communities by topic Values are the portion of all subreddits in community c that are labelled with topic t.

(b) Breakdown of topics by community Values are the portion on all subreddits of topic t that are in community c. Figure 5.5: Heatmaps of relative subreddit topic and community frequen- cies

184 dits fit under a single topic. The exceptions to this are C3, C5, and C6. This finding is reflected in the column ‘Community majority’ in Table 5.1 which lists the majority topic per community, where applicable. This allows us to quickly see that more than 50% of subreddits in C1 are gaming, funny for C2, etc. Bold values indicate that all of the subreddits in community C are labelled with that topic, as is the case for the small communities as we have already seen.

C Community majority Topic majority 1 gaming gaming, tech 2 funny funny, images 3 geographic, political 4 discussion 5 6 7 porn porn 8 sports sports 9 music music 10 geographic 11 geographic 12 geographic 13 gaming

Table 5.1: Most prevalent topics per community Community majority is the topic which accounts for more than 50% of subreddits in the community. Topic majority lists any topics for which 50% or more of subreddits are in the community. Bold values show that all subreddits fall under that category.

The first heatmap gave us an understanding of how communities are comprised of topics. To understand how topics are spread between com- munities I repeated the complimentary process. Figure 5.5b is normalised by topic frequency, such that each value is the fraction of all subreddits labelled with topic t that are included in community c. For example, the value for political subreddits in C3 is 1, therefore we know that all subred- dits labelled as political are in C3. Similarly all porn subreddits are in C7,

185 as are 12% of television subreddits. For each topic except discussion and tv the majority of subreddits fall within a single community. The column Topic majority in Table 5.1 lists the topics which have a majority of subreddits in community c. Again bold values indicate that all of the subreddits of that topic are in the community, for example all political subreddits in C3. By comparing these heat maps side by side we can see that there is significant grouping of topics within communities, which validates that the community detection has allowed us to identify topic-based ‘interest meta- communities’. Using the majority topics listed in Table 5.1 we can give each of the communities a more descriptive name based on their subreddit topics. C1 is the gaming and tech community, and notably also has almost half of television subreddits. C2 is the funny and images community. C3 is the political and geographic community and also includes one third of discussion subreddits. We will examine this community in greater detail in Section 5.3 when we return our focus to the political subreddits. C4 is a more general discussion community. C5 and C6 are the only communities without strong associations to any topics, though C6 mostly consists of image and discussion subreddits. These could be considered the miscellaneous our highly generalist communities. C7 is the porno- graphic community, though it also contains some non-pornographic tele- vision subreddits. C8 is the sports community and also features a bit of gaming. C9 is the music community. Finally, C10 to C12 are all geographic communities and C13 exclusively covers gaming. In the next section we will look at the key network features of our interest communities and also look at the nature of subreddits in these small communities.

186 5.2.3 Sphericules as chambers?

Table 5.2 lists key network statistics for each of the 13 communities. The first column, C is the number for the community. Communities are again numbered 1 to 13 in descending order by number of subreddits (Num subs in the second column). Figure 5.6 also shows the community sizes as a bar plot. The largest community, C1, has 265 subreddits while the three smallest communities (C11-C13) only contain two subreddits each. The third column in Table 5.2, Labels, lists the most common topic labels for each of the larger subreddits, as determined in the previous section. For the last four communities, which each contain five or fewer subreddits, the labels refer more specifically to their subreddits.

C Num subs Labels Num edges Yules Q Density 1 265 gaming/tech 1848 0.82 0.05 2 190 funny/images 7521 0.81 0.25 3 147 pol/geo 4878 0.82 0.28 4 143 discussion/tv 5068 0.81 0.26 5 58 generalist 455 0.93 0.18 6 51 images/discussion 463 0.93 0.18 7 49 porn 762 0.99 0.63 8 47 sports 279 0.99 0.25 9 26 music 117 0.98 0.3 10 5 Romance 7 1.0 0.7 11 2 Dutch 1 1.0 1.0 12 2 Scandinavian 1 1.0 1.0 13 2 Runescape 1 1.0 1.0

Table 5.2: Community Descriptive Statistics

The fourth smallest community, C10, has 5 subreddits — argentina, brasil, italy, portugal, and mexico. Each of these subreddits are predominantly written in the native language of their respective countries (i.e. Spanish, Portuguese, and Italian) so I have labelled the community for the Romance

187 Figure 5.6: Community sizes languages. C11 contains two Dutch language subreddits — thenetherlands and ik ihe — so I have labelled it Dutch. C12 contains sweden and Denmark, also predominantly in their native languages, and has been labelled Scan- dinavian. This three communities could be considered linguistic ‘public sphericules’ for their respective language or groups of languages. The final small subreddit, C12, contains the subreddits 2007scape and runescape which both discuss the online game Runescape. Table 5.2 also shows the number of edges and Yule’s Q for each of the subreddits. Yule’s Q is a measure of association which indicates how many of a node’s edges are shared with other similar or dissimilar nodes. In our case, applied to the communities of the network it compares the number of edges subreddits in community X share with other subreddits in community X, compared to any edges shared with subreddits in any of the other communities (Borgatti, Everett, et al. 2018). A key advantage of

188 Yule’s is that it controls for the relative size of the communities which is important to account for given the significant variation in the size of the communities. To measure the Yule’s Q of community c we must first construct a contingency table of the number of edges all subreddits in c have in the networks and the number of edges they could have but do not, broken down by whether those edges are shared with another subreddit c, or are with a subreddit in another community. An example contingency table is shown in Table 5.3.

same different edge a b no edge c d

Table 5.3: Example contingency table of edges by association

We can understand this table by taking one of our communities as an example. We have labelled community 12 the Dutch community because it exclusively contains two Dutch language subreddits. These subreddits share 1 edge, therefore a is 1. Neither subreddit shares and edge with another subreddit in the network therefore b, the number of edges in a different community, is 0. As the network is undirected the two subreddits in the community can only share 1 edge, which they do, therefore there are no edges within the community which do not exist, thus c is 0. Finally, there are 987 subreddits in the entire network, meaning each of the Dutch subreddits could, but do not, share an edge with any of the 985 subreddits not in the community. Together this means that d is 1,970 edges, the number of combined edges the subreddits do not share with other communities. After constructing this contingency table for each of the communities

189 we use the following equation to measure their Yule’s Q:

ad − bc Y ules Q = (5.6) ad + bc

ad corresponds with the level of association, i.e. how many similar edges exist and how many dissimilar edges do not exist. bc by comparison relates to the amount of disassociation, i.e. how many similar edges do not exist and how many dissimilar ones do exist. In our example of the Dutch community we can input the counts from above:

1 ∗ 1970 − 0 ∗ 0 1970 Dutch Y ules Q = = = 1 (5.7) 1 ∗ 1970 + 0 ∗ 0 1970

We see that the Dutch community has a Yule’s Q of 1. Values of Yule’s Q can range from -1 to 1. A value of -1 indicates a total negative association. In our example this would, theoretically, occur if the Dutch subreddits did not share an edge, but shared edges with all subreddits not in their community. A Yule’s Q of 0 indicates no relationship between subreddits being in the same community and sharing an edge. A top end value of 1 indicates a total positive association. As is the case in our Dutch example, this means all subreddits in the community share edges with each other and none shares an edge with a subreddit from another community. The Louvain method used to detect the communities optimises for modularity. Therefore we would expect our communities to have many internal edges. Given the low overall density in the network, we would also expect communities to have relatively few external edges. Thus all communities are expected to have high values of Yule’s Q. We can see in

190 the column Yule’s Q in Table 5.2 that this is the case with all communities having very strong positive associations, 0.81 or higher. The four small subreddits all have perfect association showing that their subreddits all share edges and they are totally disconnected from other communities. Though Yule’s Q accounts for the relative size of the communities, we can see that there is still a relationship between community size and level of associations. In the table the communities are listed in descending order of size by number of subreddits. Conversely, the corresponding values for Yule’s Q fall in roughly ascending order. The four largest subreddits, with more than 140 subreddits each, are all between 0.81 or 0.82 while the five remaining mid-size communities are at 0.93, 0.98 or 0.99. This shows that the bigger communities have high but not perfect associations, likely due to the greater variation in subreddit topics leading to more ‘missing’ internal edges between unrelated subreddits. If we compare these results to the other measures presented in Table 5.2 we can identify some interesting variations for the mid-sized communities which do have almost total positive associations. The sports and porn communities both have a Yule’s Q of 0.99, showing that each of these communities share relatively very few external edges. However, the porn community has almost three times as many internal edges (Num edges) as the sports community but only two more subreddits (Num subreddits). This means that subreddits in the porn community tend to have many many more edges with high residuals than subreddits in communities of similar size, and share these with other subreddits in the same community. This is reflected in the relative densities for the communities (Density). In Section 5.2 we saw that the density of the whole network was 4.4%, which reflects that the network was constructed on the top 5% of edges

191 between subreddits. Using the same equation for density we can measure the local density of each community. This will allow us to compare the relative levels of density between communities, and contextual these within the overall network. If community C has s number of subreddits the number of possible edges in the community is:

s · (s − 1) possible edges = (5.8) 2

because we have an undirected network without self loops. The local density of the community is then calculated as the portion of possible internal edges which are actually observed:

actual edges local density = (5.9) possible edges

The final column in Table 5.2 shows the local density of each community. The Romance languages community, for example has 5 subreddits meaning it has 10 possible edges ((5 · 4)/2). As it has 7 actual edges, it’s total density is 0.7 (7/10) as 70% of all possible edges exist. The three communities with 2 subreddits each all have densities of 1 because they consist of the single possible edge between their respective two subreddits. The next most dense is the porn community with 63% of possible edges. This is then followed by the music community (30%) and then the political and geographical community (28%). On the other end of the spectrum the gaming and tech community has the lowest density (0.05). Though it is 1 the largest community by number of subreddits it has approximately 4 as

192 many edges as the second largest. Thus, while the community is weakly connected to the rest of the network, as shown by it’s high Yule’s Q, but is also very weakly connected internally. As a public sphericule this mega gaming and tech community is very diffuse. Finally, as a reference for the dynamics between public sphericules we need to look at the connections between communities.

Network structure of topic communities

Figure 5.7 shows the network between connected communities. Each node is a community. The four small communities are not shown because they do not share edges within any other communities. The nodes are size by the relative number of subreddits in the community. Nodes were positioned using the Fruchterman–Reingold forced-directed graph drawing algorithm. The thickness and opacity of edges are weighted by the number of external edges between the communities. For visual clarity I have removed any edges with a weight of fewer than 5 subreddits. The raw edge weights are available in Appendix A.5 in the Appendix A.4. The three thickest edges form a triangle in the centre to of the graph connecting the geopolitical, funny & images, and discussion & tv communi- ties. Subreddits in these communities share many more external edges with each other than any other communities. This shows that though these com- munities can still be considered distinct public sphericules, their subreddits still tend to share many authors relative to their text similarity. The largest community, gaming & tech does not share a large number of external edges with any other community, but does share a handful of edges with each other community.

193 194

Figure 5.7: Network graph of communities Section summary

In this section we have seen that all of the 13 communities in the network have very strong levels of association as measured by the Yule’s Q - they each share relatively many more internal edges than external given their size. We have also seen that these communities have strong associations with subreddit topics. 11 of the 13 communities are either dominated by subreddits of a single topic and/or are home to the majority of subreddits on a particular topic. This suggests that the communities do correspond with topical public sphericules, providing strong affirmation for our third research question overall, and first of this chapter. By exploring key network properties of the communities we have also begun to address our final research question - whether meta-echo cham- ber networks can be identified among the public sphericules. The four smallest subreddits are each both fully internally connected and completely disconnected from the rest of the network. In theory this would be the structure of a perfect meta-echo chamber. However, given the tiny sizes of the communities (only 2 to 5 subreddits each) and the specificity of their relationships I suggest the subreddits do not pose significant risk of being echo chambers. Instead, I suggest, they are the result of of the Reddit ideal - the bringing together of niche or specific audiences through the process of interest homophily. If we move our attention to the mid-sized, topic specific communities we begin to see the emergence of public sphericules as potential meta-echo chambers. In particular the porn community, which is much more densely internally connected than the other communities and almost completely disconnected from the wider network. To a lesser degree, the sports and music communities also only share limited external edges, but do not have

195 the have levels of internal density. The theoretical implications of these findings will be discussed in greater detail in the next chapter where we will return to the question of whether echo chambers are inherently political, an issue previously introduced in Chapter2. This brings us finally back to the question of whether echo chambers are a problem among political subreddits, and whether far left and far right political subreddits on Reddit form networks of meta-echo chambers. In the next section we will explore the geo-political community in greater detail, to understand the local networks in which political subreddits are embedded. We will look at how closely the political subreddits are inter- connected, as well as the kinds of non-political subreddits they share the greatest similarity with. This will allow us to answer our fourth and final research question before moving on to Chapter6 where we will have an in-depth discussion of the cumulative findings of this thesis and their wider implications.

5.3 The geo-political sphericule

As we have seen, all 21 political subreddits are in the geo-political dis- cussion community. Our suspected anti-echo chamber changemyview also belongs to this community. As such we now turn to focus on the geo- political ‘sphericule’ in greater detail to understand to what extent the political subreddits may be part of networked meta-echo chambers, and to consider to what extent changemyview actually is an anti-echo chamber. In total there are 147 subreddits in the community. Of the 21,396 edges in the our network 3,050 (14.26%) are between subreddits in the geo-political community. Table 5.2 showed us that the community has a local density of

196 28%, the second highest after the porn community if we exclude the small isolate communities. It also has a Yule’s Q of 0.82, almost exactly the same as the other larger communities. This shows us that while the subreddits in this community tend to share many more edges internally than they do with external subreddits, they aren’t internally very densely connected. We can thus determine that the community as a whole is not an echo chamber, which is to be expected of a group of nearly 150 subreddits covering a range of topics. To approach the question of smaller sub-networks of possible meta echo chambers we can begin by first identifying the dyads - pairs of subreddits - with unusually high levels of similarity.

5.3.1 The most chamber-like edges?

Table 5.4 shows the 20 subreddits which share the greatest number of edges (N) with other members of the geo-political community. Together they account for 79.5% all internal edges in the community. changemyview is in joint second with 59 internal edges meaning it is strongly connected with 40% of other subreddits in the community, which we can consider a signal of possible anti-echo chamberness. Nine of the 21 political subreddits also appear in the list - PoliticalHu- mor, worldpolitics, neoliberal, ENLIGHTENEDCENTRISM, ChapoTrapHouse, conservative, LateStageCapitalism, politics, and Libertarian. This shows that almost half of the political subreddits are among the most highly connected in the community and each share many ties with non-political subreddits. To understand the nature of these connections we can look at which dyads share the highest levels of similarity. Table 5.5 lists the 20 edges in the geo-political community with the

197 subreddit N unpopularopinion 68.0 PoliticalHumor 59.0 changemyview 59.0 worldpolitics 56.0 worldnews 52.0 news 48.0 neoliberal 41.0 TrueReddit 38.0 ENLIGHTENEDCENTRISM 38.0 ChapoTrapHouse 36.0 iamatotalpieceofshit 36.0 insanepeoplefacebook 35.0 Conservative 35.0 TopMindsOfReddit 35.0 todayilearned 35.0 LateStageCapitalism 35.0 JordanPeterson 35.0 conspiracy 34.0 politics 33.0 Libertarian 30.0

Table 5.4: Top 20 subreddits in the geo-political community by internal edge count

198 highest residuals. It includes the topic labels for the source and target subreddits. The edges are displayed in ascending order of weight and all are in the 99th overall percentile for ‘chamberness’ similarity. Many of the edges consist of one or two geographic subreddits, and show varying levels of topical redundancy. For example, the top edge is between Seattle and SeattleWA, both about the city Seattle, Washington. Looking at the subreddits, there appears to be no clear distinction between their purposes, thus they appear to be redundant. This could be the product as a previous schism between the moderators and/or contributor commu- nities of the subreddits, or conversely the subreddits developing naturally but separately. Some edges show the nested nature of subreddits, for example those between ontario and toronto, or australia and melbourne. These are edges between geographical subreddits that are clearly related; Toronto is a city in the province of Ontario, Melbourne is a city in the country of Australia. These edges are not redundant but clearly closely, hierarchical, related. Similarly, the edges between ukpolitics and unitedkingdom and Catholicism and Christianity highlight the strong relationships between larger gener- alist subreddits (i.e. unitedkingdom and Christianity) and more specific subreddits within those broad topics (i.e. ukpolitics and Catholicism). There are also edges between subreddits that are clearly topically related, but not necessarily hierarchical or redundant. MGTOW stands for ‘men going there own way’, a term used in the men’s rights community, thus it is clear why the subreddit shares an edge with MensRights. Other topical overlaps are somewhat less direct. skiing and snowboarding likely share many co-authors who are fans of winter sports generally. It is interesting, however, that these subreddits are located in the geo-political community,

199 source target source˙topic target˙topic Seattle SeattleWA geographic geographic solotravel travel discussion discussion ukpolitics unitedkingdom political geographic ontario toronto geographic geographic COMPLETEANARCHY socialism political political COMPLETEANARCHY ChapoTrapHouse political political AskEurope europe geographic geographic ChapoTrapHouse socialism political political Catholicism Christianity discussion discussion australia melbourne geographic geographic analog photography images discussion AskUK britishproblems discussion funny CasualUK britishproblems discussion funny australia sydney geographic geographic AskUK CasualUK discussion discussion eu4 hoi4 gaming gaming MGTOW MensRights discussion discussion LateStageCapitalism socialism political political skiing snowboarding sports sports CrusaderKings eu4 gaming gaming

Table 5.5: Top 20 subreddit pairs in the geo-political community by edge weight

200 and not the sports community. This could be due to the winter sports having stronger associations with specific locations over other non-winter sports. eu4 and hoi4 refer to the games Europa Universalis IV and Hearts of Iron IV, both of which are developed by Paradox Development Studio and set in or around fictionalised historical Europe. Again, it is interesting that these subreddits are included in the political community, not on the those more clearly related to gaming subreddits. Again, this could suggest a latent topic similarity the subreddits have to geographic topics, such as Europe in this instance. Noticeably, only 5 of the top edges in the geo-political community con- tain at least one subreddit labelled as ‘political’. The first, between ukpolitics and unitedkingdom shows a topical hierarchy as previously discussed. In this case we see one topic ukpolitics, which bridges the intersection between two more general topics, unitedkingdom and politics. The remaining 4 polit- ical edges in the table, however, are all between two political subreddits and begin to speak to the question of the networked nature of political echo chambers.

5.3.2 Political meta echo chambers?

We can see how political subreddits rank in terms of their degree in Fig- ure 5.8. changemyview has the highest degree, in the 97th percentile among all subreddits. The political subreddit with the lowest degree, SandersFor- President at 33, is still in the 62nd percentile overall. This shows that all political subreddits are highly connected in the network, relative to other subreddits. There is some clustering by political polarity, as demonstrated by the groups of red and blue lines in the graph, but the variation is not

201 very large. To understand to what extent, if any, political subreddits exist in echo chambers we must examine a) the kinds of subreddits they share ties with and b) the strength of these ties. To begin, we’ll look only at the edges between two political subreddits.

Figure 5.8: Barplot of degree percentile rank for political subreddits

Edges between political subreddits

For the 22 case study subreddits, including changemyview, there are 231 possible edges ((22 · 21)/2) that can exist. With 201 observed edges, the density among political subreddits is therefore 0.87. A remarkable 87% of edges between political subreddits are within the top 5% of most similar edges. Figure 5.9 shows the frequency distribution of residual values between all pairs of political subreddits. Recall from Section 5.2 that the cut-off point for the top 5% of residuals values to be included in the network was 3.49. For completeness the plot includes the few residuals in the lower

202 95% which were excluded from the network analysis. This allows us to see that only a small fraction of edges between pairs of political subreddits do not fall within this top range (N=29, 12.5%). The lowest residual value between a pair of political subreddits is ABoringDystopia and The Donald at 1.8, however this is still in the 77th percentile among all subreddit pairs.

Figure 5.9: Frequency distribution of residuals for all pairs of political subreddits

This shows that there is very high author overlap between almost all political subreddits, even when controlling for the similarity in their text matter. This suggests the political subreddits are densely intercon- nected, possibly even across the political poles. Figure 5.10 shows the number of edges that each political subreddit shares with other political subreddits in the network. Bars are coloured by the polarity of the sub- reddit. Six of the 22 subreddits share an edge with all 21 other subreddits, shown at the bottom of the plot. These are changemyview, PoliticalHumor,

203 world politics, neoliberal, LateStageCapitalism, and ChapoTrapHouse. This is particularly interesting based on the findings of the previous chapter which suggested that ChapoTrapHouse may be acting as a left wing mirror to some of the subreddit-level chamberness characteristics of The Donald.

Figure 5.10: Number of edges shared by case study subreddits

However Figure 5.10 also shows us that The Donald is the third lowest in terms of shared edges (N=14), behind only ukpolitics and ABoringDystopia. It is not surprising that ukpolitics does not share edges with many other political subreddits, as most are specifically focused on US politics. ABor- ingDystopia is one of the smallest political subreddits, so this may also explain why it is less connected to others. Its posts are also predominately images and memes, rather than text, suggesting it is less discursive than the other case study subreddits. This may explain why it is one of the few political subreddits to not share an edge with politics or PoliticalDiscus- sion.The Donald, on the other hand, is the second largest political subreddit

204 therefore it is conspicuous that it is not connected to 7 other political subred- dits in the network. Given the high level of connectivity between political subreddits overall, to understand chamberness we instead need to examine which subreddits are missing edges. The Donald does not share edges with the less connected subreddits ukpolitics or ABoringDystopia. It is also missing edges in the network with the left wing subreddits beholdthemasterrace, Fuckthealtright, Trumpgret, esist and socialism. Figure 5.10 shows us that these first four subreddits are grouped with SandersForPresident, the blue lines towards the top of the figure, as having somewhat low counts of shared edges themselves. This shows us that The Donald does not share edges with the specifically anti- Trump and anti-alt right subreddits in the network. It is also the only political subreddit that socialism does not share an edge with. Instead of seeing The Donald grouped tightly with other right-wing subreddits, as we might expect of a meta echo chamber, we see a more noticeable distance from a group of left-wing subreddits. This, instead, shifts our attention to the potential chamberness of these far left subreddits.

5.3.3 Political echoing on the left

If we begin to draw together the various trends we have seen among political subreddits so far, a pattern emerges of close connections between left wing subreddits, and a lack of connections between these subreddits and specific right wing ones. Figure 5.10 showed that the left wing subreddits ABoringDystopia, be- holdthemasterrace, Fuckthealtright, Trumpgret and esist all had multiple miss- ing edges with other political subreddits in the network. Separately, we saw in Table 5.5 that the left wing subreddits COMPLETEANARCHY, socialism,

205 ChapoTrapHouse, and LateStageCapitalism shared edges with each other that are among the highest in the geo-political subreddit. Figure 5.9 also showed that the distribution of residual values between pairs of subreddits is right-skewed. There is a steep drop off in frequency for values over approximately 6.3. This far right section of the figure accounts for 19 of the 231 pairs of political subreddits (8.2%). Table 5.6 lists each of these pairs from highest to lowest residual values and includes polarity labels for both subreddits. Of the 19, one edge is between a left and a right wing subreddit (Chapo- TrapHouse and neoliberal). There are also two edges for each of the fol- lowing types of pairs: two neutral, two right wing, and between a neutral and a right wing (PoliticalDiscussion and neoliberal). There are no top edges between a left wing and a neutral subreddit. The remaining 12 highest edges are between 6 of the 11 left wing subreddits: COMPLETEAN- ARCHY, ENLIGHTENEDCENTRISM, Fuckthealtright, LateStageCapitalism, beholdthemasterrace, and ChapoTrapHouse. Figure 5.11 shows a graph of these edges. Nodes are labelled by the political polarity of the subreddit — red for right, green for neutral, and blue for left. The component of three red nodes is between Conservative, The Donald and Libertarian. The component of three green and one red nodes shows PoliticalHumour as the central point between worldpolitics, politics, and The Mueller. The largest component shows a branch from ChapoTrapHouse to neoliberal to PoliticalDiscussion. This shows that, as op- posed to being a right wing echo chamber, neoliberal shares very strong links with both a left wing subreddit and a general political discussion subreddit. This suggests that authors of the subreddit may actively engage in discourse across ideological lines, though confirmation of the nature of

206 Subreddit 1 Subreddit 2 Residual Sub 1 polarity Sub 2 polarity COMPLETEANARCHY socialism 8.19 L L COMPLETEANARCHY ChapoTrapHouse 8.17 L L ChapoTrapHouse socialism 8.06 L L LateStageCapitalism socialism 7.59 L L COMPLETEANARCHY ENLIGHTENEDCENTRISM 7.42 L L PoliticalDiscussion neoliberal 7.38 N R Conservative The˙Donald 7.22 R R ChapoTrapHouse ENLIGHTENEDCENTRISM 7.18 L L COMPLETEANARCHY LateStageCapitalism 7.09 L L 207 PoliticalHumor worldpolitics 6.9 N N ChapoTrapHouse LateStageCapitalism 6.88 L L ChapoTrapHouse neoliberal 6.71 L R ENLIGHTENEDCENTRISM socialism 6.69 L L Fuckthealtright beholdthemasterrace 6.54 L L PoliticalHumor politics 6.38 N N ENLIGHTENEDCENTRISM beholdthemasterrace 6.37 L L PoliticalHumor The˙Mueller 6.36 N R Conservative Libertarian 6.3 R R ENLIGHTENEDCENTRISM Fuckthealtright 6.3 L L

Table 5.6: Top pairs of political subreddits by residual value such discourse is beyond the scope of this thesis.

Figure 5.11: Network of top edges between pairs of political subreddits

Also in the largest component, the blue nodes show the many edges shared between left subreddits. Fuckthealtright, beholdthemasterrace, and ENLIGHTENEDCENTRISM form a triangle which we will call the anti-alt right triad. ENLIGHTENEDCENTRISM then acts as a bridge to the other left subreddits which all share edges with each other, with the exception of ENLIGHTENEDCENTRISM and LateStageCapitalism, forming an almost complete subgraph. Together, these findings provide strong evidence that political subreddits on the left, especially the far left, may form a left wing meta echo chamber.

Non-political neighbours

Finally, we can understand whether the political subreddits are part of non- political topic meta echo chambers by returning to the edges they share

208 with non-political subreddits. Lists of the five closest neighbours of each of the political subreddits can be found in Table A.5 in the appendix. Nine of the 21 political subreddits share all of their five strongest edges with another political subreddit. A further 6 share four out of five. This shows that the majority of political subreddits are more closely connected to other political subreddits than they are to non-political neighbours. Figure 5.12 shows the edges that the political subreddits and change- myview share with their 10 closest neighbours in the overall network. For visual clarity, edges between political subreddits are not shown. The net- work therefore shows the connections between our case study subreddits and their closest non-political neighbours. Non-political subreddits are represented by grey nodes. The case study subreddits retain the same colouring as in previous plots.

Figure 5.12: Network of case study subreddits and their ten closest neigh- bours

209 SandersForPresident is not included in the network because all of its 10 closest neighbours are political subreddits. The network consists of two components. The smaller component has the neutral subreddit ukpolitics at its centre, the green node, surrounded by nine non-political subreddits. ukpolitics only shares one top edge with another political subreddit, neolib- eral. All of its other top edges are shared with subreddits related to the UK, Ireland, or Europe which in turn are not connected to any other case study subreddits. This suggests that the subreddit for British politics is more embedded within a regionally specific public sphericule than a political one, and is as such the most disconnected of the political subreddits. This is not surprising given the American focus on most of the political subreddits, with the obvious exception of worldpolitics. The other, larger component contains all of the other case study subred- dits, excluding SandersForPresident. This shows that all of our case study subreddits, with two exceptions, share at least one non-political closest neighbour with another in the group. Fuckthealtright shares exactly one, but all others share at least two and up to 7 in the case ofThe Donald. The network was again plotted using the Fruchterman–Reingold force- directed layout. We can see that there does not appear to be any clustering in the colour of nodes, representing the polarity of the subreddits. There is no obvious pattern of left-wing (in blue) subreddits sharing different neighbours than the right wing subreddits (in red). We see that there is a high level of overlap in the non-political neighbours of our subreddits across the political spectrum. While it may be possible that our political subreddits could share more ‘’ edges with lower levels of similarity than we have examined here, the

210 fact that for the most part their strongest edges are shared with mutual non-political neighbours shows that there must be high levels of shared interests and participation between their authors for non-political content. Therefore none of the political subreddits can be considered to be in meta echo chambers in the strictest sense as most are only ever two very short steps away form the other side of the political divide. We will discuss these findings and their ramifications in greater detail in the following chapter. But first we return our focus to changemyview to test its claim that it is in fact an anti-echo chamber. changemyview

We saw in Section 5.2.1 that the median subreddit degree in the network was 23. By comparison, our suspected anti-echo chamber subreddit change- myview has a degree of 197 meaning it shares a top edge with 20% of all subreddits in the network placing it in the 97th percentile. This shows that it is very highly connected in comparison with the rest of the subreddits in the network. This supports the idea that authors of changemyview partici- pate actively in a range of subreddits, as we would expect of an anti-echo chamber. However to understand the nature, and therefore actual diversity of the topics which changemyview authors actively participated in we need to examine the specific subreddits with which it shares the strongest ties. Table 5.7 lists the subreddits with which changemyview shares the greatest controlled similarity, in descending order. With the exception of the four political subreddits with which we are now familiar all of the subreddits are labelled as general discussion subreddits. As withchangemyview, these discussion subreddits are generally discursive and may focus on a particu-

211 lar topic (e.g. philosophy) but did not fit into any higher level substantive topic labels used.

Subreddit Edge percentile unpopularopinion 96 JordanPeterson 94 PoliticalDiscussion 94 TrueReddit 94 neoliberal 94 NoStupidQuestions 93 MensRights 93 philosophy 92 Libertarian 92 worldpolitics 92

Table 5.7: Top 10 subreddits changemyview shares an edge

The top subreddits with which changemyview shares the highest edge values can be broadly grouped into three thematic categories and a fourth ‘other’ group:

general discourse: unpopularopinion, TrueReddit, NoStupidQues- tions, philosophy

neutral political: PoliticalDiscussion and worldpolitics

right wing political: neoliberal and libertarian

other: JordanPeterson, MensRights

Two of the ‘general discourse’ subreddits share a similar broad minded ethics tochangemyview as is evidenced by their names: unpopularopinion and NoStupidQuestions. These subreddits are similarly devoted to sharing views, or questions, which may not be welcomed in other spaces. TrueReddit is intended to be true to Reddit’s original mission as a place to share

212 ‘insightful’ articles from across the internet. philosophy is, as we might expect, the general philosophy discussion subreddit. Turning to the neutral political subreddits with whichchangemyview is closely connected, it is not surprising that the authors of changemyview are regular participants in PoliticalDiscussion and worldpolitics, or that these are some of the subreddits which which it is most connected. It is perhaps surprising that we do not see the subreddit politics on this list, but it is changemyview’s 35th closest neighbour. We then see that there are two right wing subreddits in the list of change- myview’s closest neighbours and, notably, no left-wing subreddits. The right subreddits, neoliberal and libertarian, are centred on political philosophies, and therefore may have more discursive natures than the political subred- dits devoted to more niche, specific topics such as The Donald or The Mueller. Therefore, of right wing subreddits, it is unsurprising thatchangemyview shares the strongest edges with these. However, the fact thatchangemyview authors appear to participate relatively more in these subreddits than most others, and also does not share a similarly strong edge with the high level left wing subreddit socialism, begins to suggest that changemyview authors may tend to be hold right-wing views. Finally, in the top 10 subreddits with which changemyview is connected we can find JordanPeterson and MensRights. JordanPeterson is devoted to the eponymous Canadian psychologist who has been called a leader of the ‘intellectual dark web’ for his range of controversial opinions and fervent stance against (Weiss and Winter 2018; Ribeiro et al. 2020). MensRights is a mens rights activism subreddit which has been shown to regularly express misogynistic viewpoints and is also associated with the alt-right (Marwick and Caplan 2018).

213 We previously briefly encountered MensRights in Chapter3 when I described relabelling it and MGTOW (men going their own way), another subreddit on the men’s right movement. changemyview also shares a strong edge with MGTOW (75th percentile). MGTOW is a more explicitly anti- woman subreddit, so much so that it was quarantined bh Reddit in January 2020 (Basu 2020). A quarantine limits the visibility of a subreddit, mak- ing new readers acknowledge that it is “dedicated to shocking or highly offensive content” before they can view the community. These subreddits, JordanPeterson and MensRights, and the topics on which they focus are known for being strong critics of political correctness and supporting opinions which are considered controversial by the main stream. This is an ethos which we would expect to appeal to many authors of changemyview. However, the particularly high levels of similarity change- myview shares with these subreddits shows that it is not just a small fraction of authors who participate in these subreddits and, we can extrapolate, hold these views. This in itself suggests a level of conformity of view points which we would not expect to see in a true anti-echo chamber. We will discuss the implications of this potential bias more in the next chapter. The strongest edge changemyview shares with a left-leaning subreddit is ENLIGHTENEDCENTRISM in 15th place. The second is TwoXChromosomes, which is not a political subreddit but is Reddit’s largest feminist subreddit, at 23rd. Returning our focus to political subreddits, Figure 5.13 shows where each of the political subreddits ranks in terms of the weight of the edge they share with changemyview, relative to all edges changemyview has in the network. We can see the neutral and right wing subreddits with which it shares the greatest similarity at the top of the graph. At the bottom of the graph we see a group of blue bars corresponding to

214 Figure 5.13: Barplot of percentile rank of edge weights between change- myview and political subreddits most of the far left subreddits: ABoringDystopia, beholdthemasterrace, Sander- sForPresident and Fuckthealtright. This tells us that, of political subreddits, changemyview authors share the least similarity with the far left subreddits. In the next chapter we will explore in greater how this is indicative of both a) the nature of the far left subreddits as a possible meta echo chamber and b) changemyview not being the anti-echo chamber it purports to be. But before moving on to a deeper discussion on the overall findings, let us first review the findings we have covered in this chapter.

5.4 Chapter summary

While the previous chapter approached the question of echo chambers on Reddit from a subreddit level perspective, this chapter has raised our focus to the network level by examining the patterns of atypically high levels of

215 shared authorship between subreddits. Section 5.1 outlined the two key research questions to this second stage of research as well as the measure of combined similarity used to address them. The following sections then outlined the steps undertaken to answer each of those research questions in turn. Section 5.2 sought to determine if we could identify topical public sphericules among our set of subreddits. After providing a quick overview of the key aspects of social network analysis it outlined the process of constructing a network focusing specifically on those subreddit pairs with the very highest levels of combined similarity. We then used community detection to partition the network into communities of internally highly connected subreddits. By mapping the subreddit topic labels previously presented in Chapter3 on to these communities we were able to validate that they did constitute topic-based public sphericules. We briefly considered whether some of the mid-sized communities, in particularly the home of the porn subreddits, might constitute meta echo chambers. We’ll return to this question in the next chapter. Section 5.3 then moved on to our fourth and final research question - to determine whether we could identify ‘meta-echo chambers’ among those public sphericules. Here we focused our attention on the geo-political public sphericule, the community which was home to all political subred- dits as well as our suspected anti-echo chamber changemyview. We saw that some pairs of non-political subreddits which ranked highly for our measure of chamberness bore specific types of relationships which would not be considered chamber-like. With regards to the political subreddits we found that those which had

216 been flagged as potential echo chambers at the subreddit level were not part of meta echo chambers at the network level. Quite the converse, two of these from opposite ends of the political spectrum, neoliberal and Chapo- TrapHouse were very closely connected, suggesting that these subreddits welcome discourse across the political divide. When considering only those edges between political subreddits we determined that the cluster of far left subreddits came the closest to form- ing a meta-echo chamber - one which was noticeably disconnected from The Donald. However, once we incorporated the edges political subreddits share with their closest non-political neighbours we saw that almost all political subreddits are closely enmeshed in the same local network. This simple finding shows that no political subreddits could exist in meta-echo chambers in the strictest sense, because they are all only a couple of short steps away from each other. Finally, we saw that changemyview may not be the anti-echo chamber it claims to be. While it is enmeshed in the saw local network as the political subreddits its closest neighbours tend to skew right wing. In the following chapter we’ll explore in more depth whether changemyview’s supposed openmindedness actually leads to diversity of thought, or appeals mainly to a certain form of contraianism.

217 Chapter 6

Discussion

In this chapter we will bring together the findings presented in the previous two chapters and interpret them in light of the research aims and existing literature presented in the early chapters of the thesis. This discussion is organised into four sections. To begin Section 6.1 will interpret the findings to directly address each of the four research questions and accompanying hypotheses in turn. Section 6.2 will take a closer look at the findings for the political sub- reddits and changemyview, to discuss our ultimate understanding of the chamberness of each in turn. Starting with our original suspected echo chamber The Donald on the right, moving through the neutral subreddits, then on to the left and the likely left-wing meta echo chamber that emerged over the course of the research. In comparison to the general behaviours of political subreddits, we will then consider whether changemyview actually is the anti-echo chamber it claims to be. Section 6.3 will then look at the most notable non-political ties and com- munities that came to light in Chapter5, in particular the porn and sports communities. It will reflect on the dynamics of these communities, such as

218 why they are so internally cohesive and separate from the wider network. It will use these examples to explore the question of whether echo cham- berness is only a societal concern for political content, a theme originally introduced in Chapter2. We will also examine the kinds of non-political ties which ranked highly for our chamberness similarity measure and discuss how understanding the nuances of subreddit context is essential to how we interpret the measure. Finally, Section 6.4 will review the overall implications of this research, reflecting on the initial aims presented in Chapter1 and we have ad- dressed the gaps identified in the literature. It will do this by discussing the strengths of the research methodology used. We will then reflect on the limitations of the research before presenting possible areas of future research which the findings presented here have paved the way for, and by which they could be extended. This will prepare us to move on to the following and final chapter, Chapter7, in which we conclude the thesis.

6.1 Addressing the research questions

We introduced Bruns’ definition of an echo chamber in Chapter2.

An echo chamber comes into being when a group of partici- pants choose to preferentially connect with each other, to the exclusion of outsiders. The more fully formed this network is (that is, the more connections are created within the group, and the more connections with outsiders are severed), the more iso- lated from the introduction of outside views is the group, while the views of its members are able to circulate widely within it (Bruns 2019b, p. 29)

219 This definition states two simple but essential factors to the existence of an echo chamber. First, the presence of connections between participants with shared views. Second, the absence of connections with outsiders with different views. Without the possibility of the introduction of new or opposing views to circulate among participants, the existing shared views are expected to echo within the chamber, becoming more and more reinforced over time. The ultimate concern of echo chambers being that as views are reinforced they become more polarised over time such that opposing groups ultimately loss any form of common ground on which to meet. In the research presented in this thesis we have set out four research questions broadly intended to determine whether echo chambers exist on the platform Reddit and, if so, whether they are inherently political. We have considered echo chamberness to be a spectrum rather than a binary - subreddits can show more or less chamberness, according to a variety of different measures, which is always relative to what is considered ‘standard’ behaviour among other subreddits. In Chapters4 and5 we presented the analysis conducted to answer the four research questions. In this section we will discuss in turn the findings for each of the questions and associated hypotheses. In the following sec- tion we will then integrate these findings to explore what they collectively mean for our understanding of the chamberness of the political subreddits, and the anti-chamberness of changemyview.

6.1.1 Research question 1

RQ1: How does author within subreddit participation vary between subreddits?

220 We found that overall the median number of comments per author per subreddit were highly skewed. In most subreddits the average highly active author made 20 comments in the subreddit during January 2019. However there were a handful of subreddits with substantially higher participation, up to almost 10 times the average number for the top subreddit. The relative number of comments authors made within a given subreddit, as measured by the insubreddit proportion, was also right skewed. In most subreddits the average active author made only 20% of all their comments on Reddit within the given subreddit. Together, these two measures show that levels of within subreddit author participation vary greatly between subreddits. This is of course in line with the general principles of preferential attachment and resulting uneven levels of participation between subreddits which has been shown in previous research (Olson and Neal 2015; Gilbert 2013). However the key benefit to us in confirming these trends was to provide a background distribution of what would be considered ‘normal’ or ‘standard’ levels of author participation for any given subreddit. This then allows us to contextualise the findings for the political subreddits and changemyview to be able to understand their relative levels of chamberness and thus answer our echo chamber and anti-echo chamber hypotheses.

Echo chamber hypothesis 1: Echo chamber subreddits will have higher author within subreddit participation than other subreddits.

Our main suspected echo chamber subreddit, The Donald was in the top decile for both measures. Therefore we find that it does have much higher author within subreddit participation than the vast majority of other subreddits. We can confirm that according to our first hypothesis The Donald shows high levels of chamberness for these measures.

221 Three other political subreddits also showed very high levels of author insubreddit participation. These were the left wing subreddit ChapoTrap- House, the neutral ukpolitics, and the right wing neoliberal. As we take high levels of author participation, especially relative to participation in other subreddits, as a possible signal of chamberness this provided an initial suggestion that these subreddits may also be echo chambers, at least at the subreddit level. However, we would also, of course, expect there to be many engaging subreddits with highly active bases of authors which are not the result of attraction due to confirmation bias, or therefore echo chambers. Thus we needed to move on to other measures of chamberness with the second research question to develop more well-rounded understandings of the behaviours of subreddit authors across subreddits. But first let us address the anti-echo chamber.

Anti-echo chamber hypothesis 1: The anti-echo chamber will have lower author within subreddit participation than other subreddits.

Authors in the self-purported anti-echo chamber changemyview made on average seven more comments there than authors in other subreddits, providing strong evidence against our first anti-echo chamber hypothesis. They were also in the middle of the distribution for author insubreddit par- ticipation. Therefore we must reject the first anti-echo chamber hypothesis. changemyview commenters do not appear to be put off engaging actively in the subreddit as a result of the higher costs of time and effort which would be required to make the quality of contributions expected in the subreddit. In fact they tend to make more comments than authors in other subreddits. This could be the result of the discursive nature for

222 changemyview where commenters are expected to engage with the author of the original post (the OP) to convince the OP to ultimately change their mind. Though the anti-echo chamber did not have a lower author insubreddit proportion than most subreddits, it was substantially lower than that for The Donald — 15% versus 50%. Relative to the suspected echo chamber, changemyview authors made far fewer of their overall comments within the subreddit. However, to understand whether these findings correspond with a greater range of other subreddits authors are commenting in — a signal of their openness to a diversity of content — we must look at the between subreddit measures of author participation.

6.1.2 Research question 2

RQ 2: How actively do authors of a given subreddit participate between all subreddits

The second research question looked at the range of subreddits that authors participated in, specifically the number of subreddits, number of comments, and median number of comments per subreddit made by active authors in any given subreddit. Again, the distribution for each of these measures was highly right skewed across subreddits. For most subreddits the average author made 20 to 40 comments between only 2 or 3 subreddits. In terms of how evenly authors distributed their comments between subreddits, most subreddits had a median author comment Gini of 0.4 or lower, showing that their authors spread their comments more evenly between subreddits than not. With this is mind we could then contextualise the behaviours of the authors of suspected echo chambers.

223 Echo hypothesis 2: Echo chambers authors will have less varied partic- ipation between subreddits

The average active author in The Donald only commented in one other subreddit, however as we have seen due to the highly skewed distribution of author subreddit counts this is common among subreddits. Therefore by this measure The Donald does not have less varied participation relative to most subreddits, as the majority have very low levels. Once again we saw that neoliberal, neutral ukpolitics, and left wing sub- reddit ChapoTrapHouse also showed similar patterns to The Donald, which differentiated the four from the other political subreddits. Authors in these three subreddits only commented in two other subreddits. Again, relative to all other subreddits this was not low. However it was much lower than the values for the other political subreddits. Here we begin to see the importance of contextualising subreddit not just by general trends, but by those of topically related subreddits. This will was a key feature in the later network section of the research. In terms of how equally authors made their comments between sub- reddits, as measured by the Gini coefficient, there was not much variation between the political subreddits. The Donald actually had the lowest value among them, meaning that its authors actually distributed their comments more equally between subreddits than authors in other political subreddits. We must therefore reject the second echo chamber hypothesis as we do not find evidence that authors of The Donald, or any of the three addi- tional suspected echo chambers, have less varied participation between subreddits than those in most subreddits. While they do tend to post in very few subreddits this is the general norm. And for the subreddits that The Donald authors do comment in they tend to split their attention more

224 evenly than other political authors. Ultimately we do not see signs that authors of the potential echo chamber are participating less with outside content. However, to determine whether that range of subreddits actually corresponds with engaging with a range of content, we had to consider the textual similarity and network dynamics in the second stage of the research.

Anti-echo chamber hypothesis 2: Anti-echo chamber authors will have more varied participation between subreddits.

changemyview authors made more comments across more subreddits than most others, however those comments are spread less evenly between subreddits. They participate more, in a greater range of subreddits and can therefore be considered to have more varied participation between subreddits. This suggests that changemyview authors may have a wider palette than most other subreddit author bases. It must be noted, though, that this was also the case for most political subreddits, with the exception of the four suspected echo chambers previously mentioned. Therefore having high levels of author activity, but also high inequality of that activity, seems to be normal for most subreddits which engage heavily with political content and not a unique feature of changemyview. Therefore to understand whether high levels of between subreddit author participation actually correspond with participation in a greater variety of subreddits, as we would expect of an anti-echo chamber, we must turn to look at the kinds of topics which our case study subreddits are highly connected to. In Section 6.2 we will reflect in greater detail on how these separate mea- sures come together to inform our understanding of the echo chamberness, or anti chamberness, of the case study subreddits.

225 6.1.3 Research questions 3 & 4

RQ3: Can we identify topical public sphericules among the top subred- dits?

In Chapter5 we were able to identify highly modular communities of subreddits each containing many subreddits of similar topics. We created a new measure of ‘chamberness’ between pairs of subreddits which signified relatively how many more authors they shared given their level of text similarity. Given the natural process of interest-based homophily we would expect authors in a subreddit to be attracted other subreddits with similar content. As such we found that the two measures were highly correlated. By controlling for the text similarity between subreddits we were thus able to identify those pairs of subreddits with exceedingly more overlapping authors than we would expect based on topical relatedness alone. By mapping independently annotated subreddit topic labels onto the identified communities we were able to determine that they corresponded very well to the concept of topic-based public sphericules, even though our measure of similarity partially controlled for content similarity. In Sec- tion 6.3 we will discuss in greater detail how the topic of and relationship between subreddits impacts whether we should actually interpret high levels of similarity as chamberness. But first let us address the final research question before moving on to a more unified discussion of the findings.

RQ4: Can we identify ‘meta-echo chambers’ among those public spher- icules?

After confirming that the communities detected did strongly correspond to topical public sphericules, we returned to the question of echo chambers.

226 We honed in specifically on the geo-political discussion community, which held all of the political subreddits as well as changemyview. We discovered that many of the pairs of subreddits in this community with the highest values of chamberness had clear and specific types of relationships, such as australia and melbourne or Seattle and SeattleWA. We’ll discuss the implica- tions of this on our understanding of the chamberness similarity measure in Section 6.3. Focusing specifically on edges involving at least one political subreddit we were able to determine whether subreddits on either side of the political spectrum were in closely connected cliques, or meta echo chambers, within the wider geopolitical public sphere. We saw that there was an almost complete clique among left wing subreddits. We’ll discuss the significance of this is greater detail in Section 6.2. In general, most pairs of political subreddits shared an edge in the net- work, even those on opposing sides of the political spectrum. Due to the high level of inter-connectedness between political subreddits overall, I suggest that the question of political meta echo chambers are better un- derstood by the ties that we do not find in the network versus the ones we do see. In other words, if we expect, in general, for political subreddits to share edges as we have seen, even between those on the left and right, then the absence of a tie tells us much more about the relationship between the subreddits. Specifically, we saw that this was the case for The Donald and a group of left wing subreddits. We did not see that The Donald was particularly tightly connected to other right wing subreddits. Therefore we did not find that The Donald was part of a right wing meta echo chamber. It did, however, share relatively few edges with political subreddits in general. In particular, it did not share

227 edges with any of the anti-Trump and anti-alt right subreddits. This is not, of course, especially surprising given that they are on opposite ends of the ideological spectrum among the political subreddits and fundamentally opposed on many key issues. But it is worth exploring in greater detail, as we will do in Section 6.2. In summary, we did not find that The Donald was part of a right wing meta echo chamber. We’ll discuss The Donald more in the next section, and reflect on how this may have been affected by the research design in Section 6.4. The lack of ties with the left did show a level of chamberness for the subreddit itself, but it was not part of a wider meta-echo network. In comparison, we found that the far left wing subreddits beholdthe- masterrace, Fuckthealtright, Trumpgret, and esist formed an almost complete clique. In addition to the absence of ties with The Donald which we have explored, this suggests that if there were a political meta-echo chamber among the the popular subreddits it was actually on the left. However, by examining the non-political subreddits with which the po- litical subreddits were most closely related we saw that almost all political subreddits, even those on the far left, shared their closest neighbours with each other. The only exceptions to this were SandersForPresident, which was only very close to other political subreddits, and ukpolitics, which was in a separate Euro-British local network. In the most strict sense then, none of the political subreddits were found to exist within meta-echo chambers as they were enmeshed within the same larger local network which featured a range of discourse. No active political subreddit was truly isolated from the wider Reddit network, or far from any other political subreddit holding opposing views. In this section we have reviewed the findings for each of the four re-

228 search questions in turn. To bring these results together and consider their insights in greater detail let us next examine the political subreddits in turn working across the political spectrum from right to left.

6.2 The case study subreddits

To consolidate our findings and discuss their collective implications we will take a closer look at each of our case study subreddits in turn. We’ll begin on the right with an emphasis on The Donald our original suspected echo chamber, before moving on to the neutral political subreddits. We’ll then look at the left wing subreddits and consider in greater detail the far left clique identified in the course of the research. Finally, we will end this section by reviewing our findings for changemyview to determine whether it is in practice the anti-echo chamber it claims to be.

6.2.1 The right

On the right we had five active subreddits: our suspected echo chamber The Donald as well as Conservative, Libertarian, The Mueller, and neoliberal. In the first stage of analysis where we adopted subreddit-levels measures of chamberness we identified that neoliberal had very high author insubreddit proportion - its authors tended to comment relatively much more with there than authors of other subreddits. This lead us to question whether neoliberal could also possibly be an echo chamber. An examination of the network level structures of political subreddits, however, proved that this was not the case. All of the right wing subred- dits, including neoliberal shared very high similarity with neutral or left wing subreddits and therefore did not exist in any right wing meta echo

229 chambers.

The Donald

The Donald had very high values for both author comment count and author insubreddit proportion confirming our first echo chamber hypothesis and providing some initial evidence that it may in fact be an echo chamber. It also had very low numbers of comments and subreddits per author, but this was in line with the overall low levels for most subreddits. We also found that of political subreddits it had a very low Gini coefficient suggesting its authors actually distributed their comments relatively equally among subreddits. With regards to the second echo chamber hypothesis then, we did not find that authors of the The Donald behaved the way we expect members of an echo chamber to behave in terms of the frequency and variety of other subreddits they participated in. We therefore needed to look at the nature of the specific subreddits The Donald authors also participated in to do understand any meta-echo chamber dynamics it might show at the network level. In Section 6.4 we will explore how the methodological decision to select the top 1000 most popular subreddits may have missed out on smaller right wing subreddits with which The Donald could share high chamberness. For example, The Donald might have been the ‘leader’ in a network of similar satellite subreddits which were too small to be included in our sample. From the existence of The Mueller in our dataset we could expect that other stylistically and/or thematically related subreddits may have existed at that time but were too small to meet the threshold of inclusion in our research. However our findings have shown that The Donald is not isolated from either the wider subreddit network in general or the mainstream political

230 subreddits in particular. It is relatively more distant from socialism and the far left subreddits compared to other political subreddits but it still shares relatively high similarity with these - even the anti-Trump subreddits, as all political subreddits ultimately do. We can therefore conclude that due to these strong associations with opposing groups, The Donald was not part of a right wing meta echo chamber, even it is has connections with satellite subreddits not shown in this research. Despite this, we did identify an interesting relationship between The Donald and the left wing subreddits. We will discuss that relationship from The Donald’s perspective here, and later from the perspective of the left wing subreddits.

Relationship with the left

The Donald is the only right wing subreddit not to share a tie in our net- work with the more moderate leftist subreddit socialism. To interpret what this mean for The Donald’s level of chamberness we need to consider the two way movement of authors which could theoretically exist between The Donald and any of these subreddits. On the one hand, we know that the description of The Donald presented in its sidebar, as discussed in Chapters2 and3, clearly states that content which questions or detracts from support of Trump is not allowed within the subreddit. Therefore it is understandable that supporters of the far left subreddits and/or socialism would either not be compelled to contribute to The Donald, or would have their comments removed if the did engage their. If comments were removed they would not appear in our dataset. In the opposite direction, supporters of The Donald could choose to comment in those subreddits, as we have seen that they did in some left

231 wing subreddits such as ChapoTrapHouse, LateStageCapitalism and Sander- sForPresident. There are multiple reasons why we do not see The Donald authors commenting in the other left wing subreddits. Most simply, they may choose not to comment there because they share no interest in engag- ing in discourse. This would especially be true for the far left subreddits, beholdthemasterrace, Fuckthealtright, Trumpgret, and esist which are diametri- cally opposed to The Donald. Alternatively, if The Donald authors do choose to participate in those subreddits, they may in turn see their comments removed if those subred- dits also enforce policies against dissenting voices. This is more likely the case in the far left subreddits, but is less likely to be the case of socialism which claims to be more open to engaging in discourse. This suggests that the outreach of The Donald authors across political divides may shut down at a certain point. They appear to be able to find some common ground with ChapoTrapHouse and even LateStageCapitalism but not with socialism. One explanation for this phenomena would be that political chamberness on Reddit may be a two way process, with socialism choosing to reject members of The Donald just as The Donald expels members of socialism. Auto-banning is a process by which a subreddit can choose to auto- matically ban any redditor who comments in another specific subred- dit from commenting through the use of bot moderator accounts, called ‘banbots’. This is a practice which Reddit discourages but has not yet banned (r/ModSupport 2017). Subreddits do not need to publicise their use of auto-banning, therefore we cannot know for certain whether any of our political subreddits practised auto-banning during the period of study and, as such, whether this impacted the co-authorship patterns we observed.

232 This means that we cannot determine conclusively whether authors of The Donald and the left subreddits chose not to participate in each other, or whether they were not allowed to and had their submissions automatically removed by banbots. However, practically this does not affect the over- all interpretation of our findings as one subreddit auto-banning another would be the clearest sign that they are not open to dissenting views. The theoretical relationship between auto-banning and echo chambers could provide an interesting topic of future research. We will examine these findings from the perspective of the left wing subreddits later in this chapter, but first let us consider the neutral subred- dits.

6.2.2 The neutral

There were five neutral subreddits, which could not be categorised as politically left or right leaning: politics, PoliticalHumor, PoliticalDiscussion, ukpolitics, and worldpolitics. Methodologically these subreddits provided a useful control group for how ‘normal’ political subreddits might be expected to behave, and by which we could compare any trends on the left or right. In practice, ukpolitics provided some interesting insights in itself. We found that it had very high author insubreddit proportion leading us to consider to as a possible echo chamber, at least at the subreddit level. On inspecting the non-political neighbours with which it was most associated we found that it was the only political subreddit to exist in a totally separate local network. ukpolitics shared its closest edges with more generalist subreddits which were geographically related, focusing on the UK, Ireland, and Europe.

233 Though a further exploration of this section of the geo-political community is beyond the focus on this research, it is worth noting this is an example of the kind of interest-based homophily which is one of the central aims of Reddit. The UK politics subreddit is not part of an meta echo chamber, but a member of a larger but still regionally specific public sphericule.

6.2.3 The left

On the left there were 11 popular subreddits. Whereas the right wing had two generalist political philosophy subreddits, Conservative and Libertarian, the left only had one, socialism. The other 10 left wing subreddits had much more narrow topical focuses such as SandersForPresident or Fuckthealtright. Among these subreddits there were two key findings of note: the first about ChapoTrapHouse and the second about a group of far left subreddits. As with neoliberal and ukpolitics we saw that ChapoTrapHouse had very high author insubreddit proportion leading us to consider that it might also be an echo chamber candidate. But once again the network measures showed this was not the case. Indeed, Chapo appeared to be one the polarised subreddits (i.e. left of right) with the strongest connections across political lines. ChapoTrapHouse was the only left wing subreddit to share an especially strong edge with a non-left subreddit, interestingly neoliberal. This suggests that there is strong discourse between the two groups of authors. In this sense these two subreddits may in fact be the farthest of the polarised subreddits from being part of a meta echo chamber as their relationship provides a bridge along which ideas can flow. This is quite a different situation to the one that we found manifested among most of the other far left subreddits.

234 A left-wing meta-echo chamber?

When we examined the pairs of political subreddits with the vary highest levels of similarity we found that all left wing subreddits created a densely connected component, which was only connected to other subreddits by the bridge from ChapoTrapHouse to neoliberal. In particular the subreddits Fuckthealtright, beholdthemasterrace, and ENLIGHTENEDCENTRISM formed an anti-alt right triad. In addition to being tightly internally connected the left wing subreddits, including the more moderate socialism, were notable for the absence of sharing strong edges with The Donald. In principle this is, of course ,not surprising at all. In the previous section and in our earlier review of The Donald we addressed the obvious point of the division between Trump supporters and anti-Trump supporters, in addition to the possible use of auto-banning. The absence was notable, however, because it highlighted the possible requirement of two opposing forces in the creation of political chamberness on Reddit. Not only might The Donald echo internally and the left wing subreddits also echo internally but to become echo chambers both groups would have to deny each other. Given the otherwise very high level of overlap between all political subreddits, if a subreddit on either side of the political divide welcomed debate or discussion - such as ChapoTrapHouse and neoliberal appear to with each other - we might reasonably expect to see a bridge appear. In other words, if socialism was willing to debate with members of The Donald they might not be so disconnected. This bi-directional distance is predominantly due to the fact that most of the left wing subreddits do not take an affirmative pro-issue stance, but are framed as anti-issue. They focus less on what they support, and more

235 on what they are against. We can see this most clearly from the names of Fuckthealtright and Trumpgret, but all of the left wing subreddits beyond socialism and SandersForPresident are specially frame against either the right (e.g. Trump or the alt right) or moderate centrism. This mirrors previous research touched on in Chapter2 which found that the Republican blogosphere devoted more time to criticising Demo- cratic politicians than sharing supportive content about Republican ones, and vice versa. As we previously discussed, the methodological decision to set a threshold of the 1000 most popular subreddits during our time of study may have missed out on right wing subreddits which satellite The Donald. This might have also missed out on smaller left wing sub- reddits which were willing to have discourse with the active members of The Donald. However, if those subreddits did exist, we known that they were not very popular with Redditors. Overall, we have shown that there were no political meta echo chambers in our network in the strictest sense, as the majority of political subreddits were highly connected with each other. Even in the case of the far left subreddits and The Donald, which did not share direct ties, they were part of the same local neighbourhood, sharing many close non-political neighbours. If any political subreddits were to be considered in a meta echo chamber they would be those on the left rather than the right, for the reasons we have discussed. However, even if the absence of close direct ties, the far left was never more than two short steps away from The Donald and, as such, could not form an airtight chamber.

236 6.2.4 The anti echo chamber

In the last part of this section we turn to changemyview, the subreddit which set out to be an anti-echo chamber. We found that the subreddit had a high author comment count and middling author insubreddit proportion. Authors were not put off by the level of quality expected of comments in the subreddits and, as such, were no less active participants than those of other subreddits. We saw that changemyview’s authors tended to make more comments and across more subreddits than most others, showing that they were active across a a range of subreddits. This suggested that these authors therefore may have a wider topically palette than others but an examination of the network structure of subreddits proved that this was not quite the case. The majority of political subreddits also shared these characteristics suggesting that they may reflect the discursive nature of socio-political subreddits in general, more than an inherent or unique characteristic of changemyview authors. changemyview did prove to be one of the most connected subreddits in the geo-political community, and it was closely connected to other subreddits across a range of topics, especially discursive subreddits. However in examining those subreddits with which it shared the strongest connections we were able to see that they covered topics frequently associated with the right, in particular the alt right. This was the first suggestion that active authors of changemyview may reflect a diversity of perspectives and, as such, discourse within the subreddit may not be as unbiased or well rounded as one would expect of an anti-echo chamber. In addition, changemyview had a noticeable lack of similarity with left wing subreddits. Given the other signals of chamberness among those

237 subreddits which we have already discussed, this disconnect could be the result of push factors from the left. If the far left subreddits are not open to dissenting voices, it would make sense that it would not attract the same authors as changemyview. But, similarly, it means that there are relatively few members of the far left who are drawn to participating in the debates of changemyview to advocate for their own side. I would suggest that this disconnect may be in part a response to the general left-leaning tendencies of Reddit which have been discussed in pre- vious research (Massanari 2015b). Authors who are drawn to changemyview may do so because they hold views which are considered ‘contrarian’ to the Reddit mainstream, rather than because they are interested in creating balanced debate. The non-political subreddits with which changemyview is most closely as- sociated are all contrarian, but do not actually represent a wider diversity of view points. JordanPeterson, MensRights, and MGTOW are all anti-political correctness and anti-social justice, common refrains of the alt right. We do not see a similar overlap between the authors of changemyview and any form of left wing contraianism. An in depth topic analysis of the content of changemyview would need to be conducted to address whether the internal discourse in the subreddit actually meets the level of balanced, varied perspectives that we would expect from an anti-echo chamber. But by examining the kinds of topics which changemyview authors tend to be drawn to the most has shown that there is a notable bias towards a limited set of alt right perspectives. This suggests that changemyview does not attract the kinds of diverse, open minded participants that we would hope for but may simply be part of a contrarian sub-culture.

238 6.3 Non-political echoing

This then leads us, in our next section, to consider whether echo chambers are inherently and exclusively an issue for explicitly political content. In presenting the existing literature in Chapter2 we considered whether social media echo chambers are only a problem for issues around political views and/or content with a direct impact on democracy. In this section we will reflect on the research conducted here and consider whether echo chambers are inherently political. First we will look at two of the most distinct topical public sphericules identified, the sports and porn communities, and reflect on the extent to which they are, or could be, meta-echo chambers. Then we’ll examine the kinds of dyads between non-political subreddits which ranked very highly for our measure of chamberness similarity and discuss how understanding the specific relationship between subreddits is essential to how we interpret the meaning of this measure. Finally we will discuss how, ultimately, whether these subnetworks and ties raise moral or ethical issues depends on the content of subreddits, but also the chance for polarisation over time, irrespective of political relevance.

6.3.1 The sports community

Of the original 1000 subreddits 47 were labelled as ‘sports’. Of these 42 ended up within a single community. In total the sports community con- tained 47 subreddits including only five not given the primary level of ‘sports’. One of these outliers was geographic, Atlanta, a city which is home to many notable professional sports teams with their own subreddits. The four other non-sports subreddits were about sports-related video games,

239 such as FIFA. Only five sport labelled subreddits were not part of this community - Cricket, skiing, and snowboarding belonged to the geo-political community and formula1 and bicycling to the generalist community. Though the overall density of the sports community was not very high at 25%, this was likely due to the variety of sports covered by the subreddits, ranging from football and American football to baseball and martial arts. The disconnect of the community from the wider subreddit network was very notable with a Yule’s Q of 0.99, an almost perfect association between subreddits being part of the community and sharing an edge. This therefore suggests that authors in sports subreddits, at least those in the sports community, form an insular community. The sports community thus presents a very natural example of topic- based homophily on the platform. The separateness of the sports commu- nity from the wider network suggests that many authors who are active in sports subreddits do not comment in other parts of Reddit. In Chapter2 we reviewed the practice of multiple profile maintenance, when one per- son uses different usernames in different online spaces to prevent context collapse. As we will soon see this practice is likely common in the porn community, but we would not expect it to be common for participation in sports subreddits. There is of course the possibility that authors in the sports community are using secondary account, perhaps to hide some form of negative be- haviour, but we would not expect this practice to be widespread. Therefore, the systemic disconnect of the sports community from the wider network suggests, more simply, that those authors only come to Reddit for sports content or at least only actively comment within sports subreddits. This finding in itself is not troublesome. The sports community may

240 represent the purpose of Reddit in its truest form wherein a homophilous group (or groups) of sports fans share and discuss the specific form of content they are interested in. But sports fandom offers us an interesting example of a sense of community which can, over time, develop negative behaviours. Increasing in-group identification of sports fans could lead to greater antagonism between opposing teams and fans, leading to greater hooliganism in real life. Online hooliganism has been shown to lead to offline abuse of players and supporter of opposing teams, for which of- fenders can be legally prosecuted in England and Wales (Press Association 2013). Recent research has shown that conflicts between football subreddits mirrors real world, real time competition - which increases in antagonism between rival teams such as Manchester United and Liverpool FC during a championship league tournament (Aswath et al. 2020). While this thesis has not sought to develop our understanding of sports hooliganism specifically, it provides us a comparative example of the possible dangers of even non- political homophilous online communities becoming more insular and polarised. We will return to the problem of polarisation at the end of this section, but first let us examine another non-political example.

6.3.2 The porn community

There were 40 pornographic subreddits in out network, all of which were contained within a single community, the porn community. This commu- nity included four non-pornographic subreddits which labelled as either discussion or tv related subreddits. The community is much more densely internally connected than the sports community at 63%, which was second only to the four very small niche communities.

241 Like the sports community, it was almost completely separate from the wider subreddit network. This shows that not all commenters in porn subreddits choose to keep complete separate accounts for their not safe for work (NSFW) and safe for work (SFW) contributions. All porn subreddits are in the same community, but not all subreddits in the community are de- voted to pornography. This shows that though there is very high homophily among the porn commenters, they are not exclusively participating in those communities. There are a multiple number of reasons to explain the patterns of cham- berness showed in the porn community. Porn subreddits are likely farther from the overall network due to use of multiple profiles. As we reviewed in Chapter2, previous research has shown that contributors to NSFW sub- reddits practice often multiple profile maintenance. Specifically, women who submitted nude images of themselves to gonewild used secondary ac- counts to prevent the photos from being linked to their main Reddit profile, which could be known to their friends or family members (van der Nagel and Frith 2015). I would suggest the same is likely to be true for many of the authors in our pornographic subreddits. They may in fact be commenting in other subreddits, but using different accounts. Many authors in this community will have separate usernames, likely their main accounts, which they use to comment in other parts of Reddit. They may not want their participation in NSFW subreddits associated with their main accounts (Leavitt 2015; Leavitt et al. 2016). Alternatively, as I have argued is possibly the case for the sports sub- reddits, other authors will simply only comment on porn subreddits. They exclusively visit Reddit for the pornographic content, or they choose not to

242 actively comment in the non-pornographic subreddits they frequent. There is, of course, also likely to be the opposite trend of authors in our non- pornographic subreddits visiting but not commenting in porn subreddits. As previous research has attempted to address, there are interesting social scientific questions around who comments on online porn, and why (Tyson et al. 2015). A closer inspection of these questions is beyond the scope of this research, but it is worth noting that there has been related research in this area. This issue of possible unseen overlap between subreddits, either through the use of multiple profiles or the differences between observing and participating in subreddits, highlights one of the key limitations of using trace data which we will explore in a later section of this chapter. We need to keep this potential missing data in mind as we interpret the findings of the porn community in particular. But the existence of the insular, cloistered NSFW community provides us with another important example on a potentially harmful non-political echo chamber. Returning to Reddit’s original goal of compiling links from across the internet of interest to niche communities, subreddits offer consumers of online pornography the ability to find and share thematically themed con- tent. Research has shown that in addition to having general porn subreddit such as r/PornVids there are more specialist subreddits on the platform, such as r/ChickFlixxx which specifically focuses on feminist porn (Smith 2015). These specialised subreddits may serve as a source of collating niche pornography, and allow those communities - fandoms in a way, as with sports subreddits - to coalesce. Similarly, the original research presented in the previous chapter high- lighted the strength of homophily among subreddits which share niche

243 pornographic interest. The pairs of subreddits with the very highest levels of combined similarity were almost all pornographic, and from their names shared clear, specific niche focuses. While the existence of these subred- dits may again serve the purpose Reddit originally intended - connecting specific audiences with content tailored to their specific interests - if left estranged from any form of outside influence these communities and their content could become extremised over time. Again, a body of research into online pornography exists which is largely beyond the focus of the work presented here (Attwood 2010; Smith and Luykx 2017). But I suggest that this isolated pornographic community has the potential to pose similar general concerns of polarisation that we associate with political echo chambers. This is especially a concern given previous issues Reddit has faced with not addressing illegal and unethical adult content on the platform. In 2018 Reddit banned the proliferating number of ‘deep fake’ porn subreddits which featured faked nude images of celebrities created using artificial intelligence. Reddit only banned the practice after the problem re- ceived widespread media attention (Burkell and Gosse 2019). Years earlier in 2014 Reddit allowed the circulation of authentic adult imagery of female celebrities which had been illegally obtained by computer hackers (Massa- nari 2015a; Marwick 2017). Reddit only banned the host subreddits when it came to light that one of the subjects of the images was underage and the photos were therefore . Reddit’s resistance to addressing these previous issues with adult con- tent were based in its desire to remain a platform for free speech. However these incidents are examples of the necessary limits to this policy as some content must be deemed as inappropriate and not allowed on the platform,

244 even where subreddits exist with audiences who desire that content. In some cases the legality of the content will be in question, as with the hacked images. In other instances the issues we involve ongoing ethical dilemmas, as with celebrity deep fakes. The decisions to allow these kinds of content cannot, ultimately, lay with each subreddit itself. When these communities insulate without any influ- ence from outside perspectives they may develop increasingly anti-social behaviours that would be deemed unacceptable by the wider platform, such as sharing nude images of someone without their consent. Without the chance of hearing outside perspectives about the moral issues raised by the content favoured within them, these chambers may, over time, develop more extreme understandings of what is acceptable content.

6.3.3 Homophily vs echoing

These examples of the sports and porn subreddits bring us back to the perceived problem which inspires the study of echo chambers - the issue of increasing polarisation. As we have seen, there is nothing inherently wrong with the existence of topical public sphericules devoted to sports or pornography. Indeed, this topical homophily is an indication that Reddit is meeting its intended purpose. In our examination of the geo-political community we identified pairs of subreddits with very high measures of chamberness which did not pose the risk of being dyadic meta-echo chambers. Rather, the much higher than expected co-authorship between these subreddits successfully highlighted the latent relationships between their subject matter. Some subreddits appeared to be redundant (e.g. Seattle and SeattleWA), seemingly duplicating the same purpose. This raises the

245 question of why these subreddits exist separately, but are still so strongly connected. Other dyads were clearly hierarchically related either geograph- ically (e.g. ontario and toronto) or in increasing specificity (e.g. unitedkingdom and ukpolitics). And yet others were part of a common latent parent class (e.g. skiing and snowboarding as winter sports). This shows that the methodology developed in this research is able to surface topically homophilous relationships between subreddits, the first stage in considering echo chamberness. But to get to the problem of echo chambers we need to also consider two additional factors: both the absence of external connections and the potential for group polarisation. As we have explored using the examples of the sports and porn communities, closely connected groups with limited ties to the wider network can be seen as a positive and not raise echo chamber fears. These communities may form chambers of like minded audiences but where the content they echo remains benign, the issue of reinforcement is not, in practice, an issue. By reasserting our focus on why echo chambers can pose an issue to civil society we have shown that more attention should be paid to the problem of non-political echo chambers which has previously been largely overlooked in the literature. Increasing group polarisation, fragmentation, and intolerance for alternative perspectives is not exclusively a concern for the large socio-political topics which have been the main focuses of the existing research on echo chambers such as Brexit (Del Vicario et al. 2017; Bastos et al. 2018), climate change (Williams et al. 2015), or American bi-partisanship (Adamic and Glance 2005; Bail et al. 2018). The problem of echo chambers may appear the starkest and most con- cerning around issues of great socio-political importance such as presi- dential elections and issues of national sovereignty. But any situation in

246 which homophily turns into unmitigated in-group fanaticism can lead to troubling extremist communities. And if we cannot practice facing criticism about our favourite sports teams or tastes in porn, will we be able to face criticism about larger, more difficult subjects?

6.4 Implications of the research

In the final section of this chapter we’ll reflect on the wider implications of the original research presented in this thesis. First we consider the the- oretical implications of the findings by reflecting on the major theories which motivated it. We will next examine the strengths of the research, in particular how the novel methodology developed enabled it to fill gaps in the existing research, and how the full reproducibility of the code con- tributes to the growing movement for open science. Then we will discuss the limitations of the research and how this effects what we can and cannot conclude about the existence of echo chambers on Reddit. This will prepare us to discuss, in the final chapter, the types of future research which this original research has paved the way for, thus showing the role that this research has played in developing our understandings of the problem of echo chambers on social media.

6.4.1 Wider theoretical implications

In Section 2.1 we reviewed the key theories which motivated this research, in particular Sunstein’s introduction of the term ‘echo chamber’ and his work on the public forum doctrine and Habermas’s concept of the public sphere. Sunstein introduced the concept of new media ‘echo chambers’ as a move away from the communal knowledge that citizens gained from

247 ‘the public forum’ of trusted mainstream media outlets (Sunstein 2018). He emphasises that citizens in a deliberate democracy should be informed of a range of viewpoints, particularly those they are opposed to. However, as the findings presented here demonstrate Sunstein’s theory works best as a metaphor on which more developed, and less idealised, theories of social media can be built. Most notably as his originally thesis was developed in 2001, almost a decade before the proliferation of social media, it failed to address the positive dismediating effects that new media could provide. Counter to Sunstein’s fears, new media does not need to be placed in direct competition with traditional media but can offer a wholly different experience. This is evidenced by the wide range of topics covered by the subreddits we have studied. While mainstream media is focused on current affairs and the latest breaking news, social media allows users to engage with a wider range of content than that which is considered newsworthy. Indeed newsworthiness, and particularly political content, are just a small part of the larger information consumption patterns of social media users, and particularly redditors, as shown in the variety of topical public sphericules we have explored. Again in contrast to Sunstein’s suppositions, this diverse information environment actually highlights the commonalities between otherwise disparate groups, as shown in the high level of shared neighbours by left and right wing subreddits. This shows us that the political poles perhaps have more in common than not. Having largely disavowed the work of Sunstein in this context, we can reflect on the enduring importance of Habermas. Our modern understand- ing of social media has proved to be more in line with Habermas’s concept of the public sphere, even though it predates social media by half a cen- tury (Habermas 1991). Rather than revering traditional media as Sunstein

248 does, Habermas suggests that they act as gatekeepers who overly-mediate political communication on behalf of the masses. As such he calls for the public sphere to be a place for the wider citizenship to actively engage in political discourse. In later work specifically discussing new media Haber- mas demonstrates the power of the Internet to enable ordinary citizens to express their opinions and gain there own audiences (Habermas 2006). The public sphere theory is supported by the original research we have presented here. The Donald started as a fandom for a political candidate who was derided by much of mainstream media. It serves as an exemplar of the counter-establishment, populist movements which gained widespread appeal during this period. On the other side of the political spectrum, the far left subreddits covered a range of viewpoints which would be considered too fringe to be represented by the wider media. This is well exemplified by ChapoTrapHouse which was inspired by the eponymous podcast, another form of dismediating new media. The popularity of this ‘alt left’ subreddits on the platform highlights the affordances of social media to give a stage to new voices and opinions and to enable them to attract new audiences. Returning to our central thesis, this research supports the view that social media does not inherently lead to echo chamber behaviours. While there remain concerns of increasing fragmentation and extremism in certain corners, the fears of disintermediation as asserted by Sunstein are largely overstated and do a disservice to the level of positive niche community engagement that these platforms provide their users. Given the topic-based nature of Reddit, it may indeed provide an ideal form of public sphere as outlined by Habermas. Or more specifically topical ‘public sphericules’ as defined by Highfield and Bruns (2015). As our network analysis has

249 shown subreddits naturally group into broad loosely connected topical based communities which could operate as separate sphericules. Addition- ally, and contrary to echo chambers fears, the left and right wing political subreddits were not strongly polarised but actually central parts of the same wider sphericule. As such the research here largely supports the theo- retical work of Habermas, and its descendant work by Highfield and Bruns. It undermines the work of Sunstein who placed too great an emphasis on the benefits of traditional political media to the extent of approaching a moral panic on the hazards of new media. We can consider the wider implications this has for how we should think about echo chambers going forward by reviewing the specific strengths of the original research that has been presented here.

6.4.2 Strengths of the research

In reviewing the existing body of academic work on social media echo chambers we identified a gap between the theoretical and empirical work. We showed that although it had long been used as a theoretical concept, much work treated ‘echo chambers’ as self-explanatory and did not offer a clear definition of what was meant by the term. This led to inconsistencies in the way the term was applied, limiting the extent to which empirical studies could be compared. The research presented here has attempted to avoid this same mistake by drawing on the first clear definition to be provided in the literature, recently introduced by Bruns (Bruns 2019b). By presenting a specific definition of what an echo chamber is, we have been able to create a systematised, multi- approach method to identifying subreddits and communities which may be echo chambers. A key advantage of our method is that it treats chamberness

250 not as a binary but a spectrum, a level of degree. By addressing both the subreddit-level and network-level dynamics of chamberness we have also been able to develop a more rounded and nuanced approach. The ability to treat chamberness as a spectrum is the result of another key strength of the research methodology, the cross platform approach. While the majority of previous echo chamber research has focused on a bounded set of actors on Twitter or Facebook, this research has provided a full scale analysis of the most active 1000 subreddits of Reddit, which accounted for the majority of activity on the platform within our time frame of January 2019. By analysing such a wide range of subreddits we were able to create a background distribution for each of our measures of chamberness and, as such, provide context as to what was normally expected behaviour on the platform. This enabled us to interpret the relative level of echo chamberness (or anti echo chamberness) of our case study subreddits. The addition of the manual subreddit topic labels also enabled us to independently validate the existence of topical public sphericules within the subreddit network. Whereas previous research has relied on identifying the topics of communities after detection (Bruns et al. 2017; Olson and Neal 2015) our approach has been able to avoid the confirmation bias inherent to those methodologies. The research presented here also improves on the existing network approaches to identifying echo chambers on social media by considering two aspects of subreddit relatedness: co-authorship and text similarity. The unique nature of subreddits as both collections of authors and distinct nodes of analysis themselves has allowed us these two related but discrete sources of information. Subreddits can be associated by the number of

251 authors which they share, but also the similarity of the content their authors contribute. Taken together, these two data sources allow us to create a single mea- sure of similarity between subreddits which gives a better representation of their potential chamberness than either measure would by itself. For two subreddits to exist in an echo chamber they must, by definition, share relatively many authors and highly similar content. By controlling for text similarity and identifying those subreddits which share even more authors than we would expect we are able to identify those pairs with unusually high closeness. The final key strength of this research is the reproducibility of the anal- ysis. The codebase used to conduct this analysis, from data collection to visualisation, is publicly available in a GitHub repository. Further details of the repository are presented in Appendix A.1. The code has been written so that it can be completely re-run according to two parameters, the month of the dataset (e.g. August 2020 instead of January 2019) and the cut-off for the number of subreddits by author count to subset for (i.e. the top 10,000 subreddits instead of the top 1,000). This means that not only can the exact analysis here be reproduced and validated, the analysis can be extended to include a greater range of subreddits or time points.

6.4.3 Limitations of the research

Throughout the presentation of this research we have noted limitations which resulted from the nature of Reddit as a platform and the method- ologies employed. We saw that the practice of pseudonymity provided some interesting insights, such as why authors are more likely to avoid context collapse when participating in the porn community than in the

252 sports community, but it also hides some behaviours of interest which will never be observable using our chosen form of data alone. The use of digital trace data provides the opportunity for large scale insights into social behaviour that would otherwise be impossible (Latzko- Toth et al. 2016; Freelon 2014). In our own example, we would never have been able to understand the extent of connections between subreddits at this scale if we relied on self-reporting by Redditors or on observing their behaviours in real time. However, this form of data does miss out on sources of information which would only be observable by other methods of data collection such as interviews. Previous qualitative research into context collapse has shown that Christian LGBT teenagers choose to practice multiple profile main- tenance to separate the two parts of their Reddit experience which they find to be conflicting (Triggs et al. 2019). Our methodology misses out on authors which connect subreddits but choose to do so privately and may therefore under-report the actual level of similarity between some subreddits. Another possible limitation of the research was the decision to sample only the most active subreddits and authors for our analysis. We selected the top 1000 subreddits because this provided a realistic range of subreddits to manually label, whilst still including the majority of activity on Reddit. Similarly, we limited our analysis to only the 25% of most active authors per subreddit as they were considered to be the most important contributors of the given communities. Adjusting these thresholds may alter the findings of the research. Additionally, due to the high level of activity on the platform, our analysis was limited to one monthly snapshot - January 2019. The same

253 analysis conducted on a different time frame may provide different results. In recognition of these limitations the public code repository provided for this research has been designed to allow for the analysis of different monthly datasets and numbers of subreddits. As we have shown, the manual labelling of the high level topic of each subreddits was a key strength of the research and enabled validation of the topical public sphericules. However inaccuracies may exist in the set of subreddit labels. In Chapter3 I outlined the process of re-labelling the subreddits MGTOW and MenRights from ‘political’ to ‘discussion’. This reflected my personal bias associating these subreddits with The Donald and the alt right (Nagle 2017). It also showed that I honed the topic categories through the process of annotating, to only those which clearly discussed politics or political ideology. There may be other inaccuracies in how non-political subreddits were labelled. However, I posit that any errors would not have significantly affected the final analysis these subreddit labels were only used to show the breakdown of topics between communities, which would be insensitive to minor errors. The list of subreddit topic labels is included in the code repository and is available for scrutiny. The final limitation to this research is that it is limited to the exploration of a single online platform. As previous research has shown, to truly under- stand whether individuals exist within echo chambers we must understand their complete media diet, both online and offline (Dubois and Blank 2018a; Vaccari 2018). Even if users of the far left subreddits are in fact part of a meta echo chamber on Reddit, we cannot know if these individuals are participating in more diverse spaces as well. As such, this research has not set out to make generalisable statements

254 about the existence of echo chambers on social media in general. It has, however, attempted to understand the nature of echo chamber behaviours, among both political and non-political communities, within one of the most popular social media platforms in the world. In the final concluding chapter we will reflect on the implications of this work, and how it has paved the way for new avenues of future research.

255 Chapter 7

Conclusion

In the final chapter of this thesis we will provide some concluding remarks on the original research presented here. First we will highlight the key findings of the research, some of which addressed the guiding research questions and some of which were unexpected discoveries. Then we will summarise the key contributions of this research to our wider understand- ing of social media echo chambers. From there we will identify some of the most important avenues for future research which would naturally follow from the insights provided here and the open questions that remain. And finally, we will end with a summary overview of what this work has accomplished.

7.1 Summary of key findings

1. The Donald is not the echo chamber we would expected it to be given its insistence on banning outside perspectives. The subred- dit is closely connected to neighbouring subreddits of a variety of topics, which it tends to share as neighbours with most other political

256 subreddits.

2. The closest example we find of a political meta echo chamber is among left wing subreddits. The left wing subreddits are closely internally connected and even the most centrist - socialism - has a notable distance from The Donald. However these subreddits also share many close neighbours with non-left political neighbours so cannot be considered isolated.

3. Inter-subreddit echo chamberness, as measured by the absence of cross-cutting ties, is likely a two way street. The Donald and the left wing are disconnected because neither side of the political divide welcomes discourse with opposing perspectives. By contrast, Chapo- TrapHouse and neoliberal show that where cross-cutting discussions are embraced the tensions between political poles have the possibility of being bridged.

4. We found evidence that changemyview is most likely not the anti- echo chamber it claims to be. Its most active authors tend to also be active participants is contrarian subreddits commonly associated with the alt-right. In the absence of similarly close connections with complementary left wing subreddits this suggests that the most active members of changemyview show a right wing bias and may therefore not represent the diversity of perspectives expected of an anti-echo chamber.

257 7.2 Key contributions of the research

Beyond the specific findings about the nature of political echo chamber (and anti echo chamber) dynamics on Reddit, the research also makes wider contributions to our understanding of the character of social media echo chambers and how to measure them. At a high level it provided a detailed examination of the entire active population of Reddit, an incredibly pop- ular social media website which has received limited academic attention, specifically with regards to echo chambers. In particular, our examination of the sports and porn public sphericules as communities isolated from the wider subreddit network highlighted the need for wider consideration of non-political echo chambers. This is supported by returning to the original concerns in the literature which led to echo chamber theory - the fear of increasing group polarisation, fragmen- tation, and a loss of the ability of disparate groups to find common ground. In identifying non-political public sphericules which are strikingly isolated from the wider network we have shown that any number of subjects, from sports fanaticism to porn fetishes, have the potential to become extremist over time. As such, the concern that social media may foster negative group dynamics should not be limited to socio-political issues. This discovery of possible examples of non-political echo chambers was only achievable because of the computational approach of this work. As a burgeoning field computational social science is still navigating the space between social science theory and computational methods. This research has shown the advantages of combining a solid theoretical grounding with a large scale analytical methodology. Practically, the use of digital trace data and programmatic analysis enabled us to contextualise the behaviours

258 of individual subreddits within the norms of an entire platform. But a close understanding of the social mechanisms of Reddit and social media more generally also allowed us to gain a more nuanced understanding of the limitations of the computational approach. For example, in exploring the practice of pseudonymity on Reddit and the larger concern of context collapse online we have shown that Reddit serves as a highly informative site of study for common online phenomena which are largely overlooked by research of more mainstream platforms such as Facebook and Twitter. This also exhibited the limitations that necessarily result from the use of digital trace data. These limitations can only be mitigated through the use of mixed methods in future research to complement and build upon the insights of computational work. Finally, the research presented here has contributed to the growing open science movement. Throughout the course of this research I have carefully documented the steps taken to arrive at the findings presented. I have developed the accompanying codebase with the intention, and hope, that others will be able to make use of it going forward. My goal in doing this has been to ensure that this analysis is reproducible and, as such, open to scrutiny, as well as to ensure that it can be extended upon in the future to develop upon the insights provided here.

7.3 Future research

Future research should give greater attention to the issue of polarisation within echo chambers. As group polarisation and increasing societal frag- mentation are the putative motivations for addressing social media echo chambers, more focus should be given to these specific processes. Identify-

259 ing echo chambers alone will not solve a societal problem. Indeed, echo chambers in themselves may not be a problem. To understand to what extent online echo chambers actually relate to offline concerns for individuals, work from the field of social psychology is required. This would provide insights into whether echo chamber par- ticipation and polarisation have tangible negative effects for individual’s psychology (e.g. decreasing willingness to engage with conflicting opinion) or leads to worsening effects on others (e.g. increasing use of hate speech or offline violence). To develop insights on echo chamber polarisation a longitudinal ap- proach is required. The analysis presented here could be extended across a time frame of many years. This would allow for an understanding of, for example, whether The Donald exhibits greater chamberness at different time points and, as such, became more (or less) polarised over time. The same could be performed for changemyview to determine whether it has been more or less of an anti-echo chamber at different points in its history. Our analysis could also be extended to include a greater range of sub- reddits and authors. The work presented here has been limited by the processing capacity of a single researcher and the desire to balance the breadth of a large range of subreddits with the depth of familiarity with each one. A researcher or researcher group with greater processing ability and resources could extend this methodology to map the complete net- work of Reddit and determine with authority whether echo chambers are observed on the platform across subreddits of all sizes. A key tension in this research has been the line between positive ho- mophily and negative echoing. We recommend that future research give more consideration to the continuum between these two states to gain a

260 better understanding of the point at which echo chambers begin. This will require advancement in both theory of the mechanisms behind echo chambers and the tools used to discern them. This would in turn provide invaluable insights into probable interventions to divert echo chambers before they develop, where that is considered a desirable course of action. Finally, we suggest that future research should explore the issue of echo chambers of non-political content. The work presented here has raised concerns about the potential for isolated topical communities to develop increasingly anti-social behaviours over time which have social and ethical ramifications rather than political. This reiterates the need to focus on the true concerns of echo chambers, their potential to polarise groups and erode common ground no matter the subject matter.

7.4 Concluding remarks

The research set out in this thesis has sought to determine whether echo chambers exist on Reddit, the self-purported ‘front page on the internet’. We have seen that Reddit provides a compelling site of study. It is at once one of the most popular websites in the Western world and also relatively unknown in comparison to more mainstream platforms such as Facebook and Twitter. Additionally, the organisation of the platform into topic-based subreddits makes it the ideal space in which to study echo chamber dynamics. We found that The Donald, the subreddit of Donald Trump supporters, was not the echo chamber we anticipated it to be. Quite conversely, we found the greatest signs of political echo-ing on Reddit among the many smaller far left wing subreddits. The idealised anti-echo chamber change-

261 myview was also shown to host a less diverse group of authors than was expected. Rather, the subreddit appeared to attract a like-minded set of contrarians more than it attracted a wide range of diverse perspectives. We saw that topical public sphericules do exist among the network of subreddits and that some of these present the risk of worrying polarisation despite their non-political orient. Through the examples of sports fanati- cism and the ethical quandaries raised by niche online pornography this work has shown the need for a wider consideration of the problem of social media echo chambers.

262 References

Adamic, Lada and Natalie Glance (Aug. 21, 2005). “The Political Blogo- sphere and the 2004 U.S. Election: Divided They Blog”. In: Proceedings of the 3rd International Workshop on Link Discovery. LinkKDD ’05. Chicago,

Illinois: Association for Computing Machinery, pp. 36–43. URL: https: //doi.org/10.1145/1134271.1134277 (visited on 08/01/2020). Adamic, Lada and Bernardo A. Huberman (2002). “Zipf’s Law and the Internet.” In: Glottometrics 3.1, pp. 143–150. Aragon,´ Pablo et al. (2013). “Communication Dynamics in Twitter during Political Campaigns: The Case of the 2011 Spanish National Election”.

In: Policy & Internet 5.2, pp. 183–206. URL: https://onlinelibrary.wiley. com/doi/abs/10.1002/1944-2866.POI327 (visited on 11/27/2020). Aswath, S, Deekshith Godavarthi, and Bhaskarjyoti Das (Feb. 2020). “Analysing Conflicts in Online Football Communities of Reddit”. In: 2020 International Conference on Emerging Trends in Information Technology and Engineering (Ic-ETITE). 2020 International Conference on Emerging Trends in Information Technology and Engineering (Ic-ETITE), pp. 1–6. Attwood, Feona (2010). Porn.Com: Making Sense of Online Pornography. New York, NY: Peter Lang. Bail, Christopher A. et al. (Aug. 28, 2018). “Exposure to Opposing Views on Social Media Can Increase Political Polarization”. In: Proceedings of

263 the National Academy of Sciences. URL: http://www.pnas.org/content/ early/2018/08/27/1804840115 (visited on 09/17/2018). Barabasi,´ Albert-Laszl´ o´ (2016). Network Science. Cambridge: University Press. Barnidge, Matthew (Apr. 3, 2017). “Exposure to Political Disagreement in Social Media Versus Face-to-Face and Anonymous Online Settings”. In:

Political Communication 34.2, pp. 302–321. URL: https://doi.org/10.1080/ 10584609.2016.1235639 (visited on 05/13/2020). Barthel, Michael (2020). How the 2016 Presidential Campaign Is Being Discussed

on Reddit. URL: https://www.pewresearch.org/fact-tank/2016/05/26/ how-the-2016-presidential-campaign-is-being-discussed-on-reddit/ (visited on 06/19/2020). Bastos, Marco, Dan Mercea, and Andrea Baronchelli (Nov. 2, 2018). “The Geographic Embedding of Online Echo Chambers: Evidence from the

Brexit Campaign”. In: PLOS ONE 13.11, e0206841. URL: https://journals. plos.org/plosone/article?id=10.1371/journal.pone.0206841 (visited on 05/11/2020). Basu, Tanya (Feb. 7, 2020). “The “” Is Getting More Toxic as

Angry Men Join the ”. In: MIT Technology Review. URL: https : //www.technologyreview.com/s/615155/the-manosphere-is-getting- more-toxic-as-angry-men-join-the-incels/ (visited on 02/07/2020).

Baumgartner, Jason (June 9, 2020a). Pushshift API. URL: https://github. com/pushshift/api (visited on 06/10/2020). — (2020b). R/Datasets - My Response to the Paper Highlighting Issues with

Data Incompleteness Concerning My Reddit Corpus. URL: https://www. reddit . com / r / datasets / comments / 884vkh / my response to the paper highlighting issues with/ (visited on 06/10/2020).

264 Baumgartner, Jason et al. (Jan. 23, 2020). The Pushshift Reddit Dataset. URL: http://arxiv.org/abs/2001.08435 (visited on 05/15/2020). BBC News (Nov. 29, 2018). “Airline ’sorry’ for Mocking Girl’s Name”. In:

BBC News. US & Canada. URL: https://www.bbc.com/news/world-us- canada-46393501 (visited on 07/02/2020). El-Bermawy, Mostafa M. (Nov. 18, 2016). “Your Filter Bubble Is Destroying

Democracy”. In: Wired. URL: https://www.wired.com/2016/11/filter- bubble-destroying-democracy/ (visited on 07/05/2020).

BigQuery (2020). BigQuery Reddit Comments 2019 01. URL: https://console. cloud.google.com/bigquery?project=fh-bigquery&p=fh-bigquery& d=reddit comments&t=2019 01&page=table (visited on 06/10/2020).

BigQuery (2020). BigQuery: Cloud Data Warehouse. URL: https : / / cloud . google.com/bigquery (visited on 06/10/2020). Bird, Steven, Ewan Klein, and Edward Loper (June 12, 2009). Natural Lan- guage Processing with Python: Analyzing Text with the Natural Language Toolkit. ”O’Reilly Media, Inc.” 506 pp. Blondel, Vincent D. et al. (Oct. 9, 2008). “Fast Unfolding of Communities in Large Networks”. In: Journal of Statistical Mechanics: Theory and Experi-

ment 2008.10, P10008. URL: http://arxiv.org/abs/0803.0476 (visited on 03/21/2019). Borgatti, Stephen P., Martin G. Everett, and Jeffrey C. Johnson (2018). Ana- lyzing Social Networks. Sage. Borgatti, Stephen P., Ajay Mehra, et al. (Feb. 13, 2009). “Network Anal-

ysis in the Social Sciences”. In: Science 323.5916, pp. 892–895. URL: https : / / science . sciencemag . org / content / 323 / 5916 / 892 (visited on 06/24/2020).

265 Borgesius, Frederik J. Zuiderveen et al. (Mar. 31, 2016). “Should We

Worry about Filter Bubbles?” In: Internet Policy Review. URL: https : //policyreview.info/articles/analysis/should-we-worry-about-filter- bubbles (visited on 05/13/2020). Boulianne, Shelley, Karolina Koc-Michalska, and Bruce Bimber (Apr. 1, 2020). “Right-Wing Populism, Social Media and Echo Chambers in

Western Democracies”. In: New Media & Society 22.4, pp. 683–699. URL: https://doi.org/10.1177/1461444819893983 (visited on 05/13/2020). Bowles, Nellie (Feb. 29, 2020). “The Pied Pipers of the Want

to Lead Everyone to Bernie Sanders”. In: . U.S. URL: https://www.nytimes.com/2020/02/29/us/politics/bernie-sanders- chapo-trap-house.html (visited on 07/03/2020). Boyd, Danah M. and Nicole B. Ellison (Oct. 1, 2007). “Social Network Sites: Definition, History, and Scholarship”. In: Journal of Computer-Mediated

Communication 13.1, pp. 210–230. URL: http://onlinelibrary.wiley.com/ doi/10.1111/j.1083-6101.2007.00393.x/abstract (visited on 09/26/2016). Breiger, Ronald L. (1974). “The Duality of Persons and Groups”. In: Social forces 53.2, pp. 181–190. Brundidge, Jennifer and Ronald E. Rice (2009). “Political Engagement On- line Do the Information Rich Get Richer and the Like-Minded More Similar?” In: Routledge Handbook of Internet Politics, p. 144. Bruns, Axel (2019a). Are Filter Bubbles Real? John Wiley & Sons.

— (Nov. 29, 2019b). “Filter Bubble”. In: Internet Policy Review 8.4. URL: https: //policyreview.info/concepts/filter-bubble (visited on 02/14/2020). — (2019c). It’s Not the Technology, Stupid: How the ‘Echo Chamber’ and ‘Filter

Bubble’ Metaphors Have Failed Us. URL: http://snurb.info/node/2526 (visited on 02/14/2020).

266 Bruns, Axel et al. (Oct. 2017). “The Australian Twittersphere in 2016: Map- ping the Follower/Followee Network”. In: Social Media + Society 3.4,

p. 205630511774816. URL: http://journals.sagepub.com/doi/10.1177/ 2056305117748162 (visited on 07/26/2018). Burgess, Matt (Dec. 1, 2016). “Reddit CEO Apologises for Editing Critical

Posts about Himself”. In: Wired UK. URL: https://www.wired.co.uk/ article/reddit-edit-post (visited on 05/11/2020). Burkell, Jacquelyn and Chandell Gosse (Dec. 2, 2019). “Nothing New Here: Emphasizing the Social and Cultural Context of ”. In: First

Monday. URL: https://journals.uic.edu/ojs/index.php/fm/article/ view/10287 (visited on 07/03/2020).

Chapo Trap House (2020). URL: http://www.chapotraphouse.com/ (visited on 07/03/2020). Chater, James (July 6, 2016). “What the EU Referendum Result Teaches Us

about the Dangers of the Echo Chamber”. In: New Statesman. URL: https: //www.newstatesman.com/2016/07/what-eu-referendum-result- teaches-us-about-dangers-echo-chamber (visited on 05/13/2020). Clauset, Aaron, M E J Newman, and Cristopher Moore (Dec. 6, 2004). “Finding Community Structure in Very Large Networks”. In: Physical

Review E 70.6, p. 066111. URL: http://arxiv.org/abs/cond-mat/0408187 (visited on 05/03/2020).

Cloud Storage (2020). URL: https://cloud.google.com/storage (visited on 07/27/2020). Colleoni, Elanor, Alessandro Rozza, and Adam Arvidsson (2014). “Echo Chamber or Public Sphere? Predicting Political Orientation and Mea- suring Political Homophily in Twitter Using Big Data”. In: Journal of Communication 64, pp. 317–332.

267 Conover, Michael D. et al. (2011). “Predicting the Political Alignment of Twitter Users”. In: Proceedings of the 2011 International Conference on Privacy, Security, Risk and Trust. IEEE Third International Conference on Social Computing. , MA: Citeseer, pp. 192–199. Del Vicario, Michela et al. (July 1, 2017). “Mapping Social Dynamics on

Facebook: The Brexit Debate”. In: Social Networks 50, pp. 6–16. URL: http: //www.sciencedirect.com/science/article/pii/S0378873316304166 (visited on 05/11/2020). Deltas, George (2003). “The Small-Sample Bias of the Gini Coefficient: Results and Implications for Empirical Research”. In: The Review of Economics and Statistics 85.1, pp. 226–234. Dubois, Elizabeth and Grant Blank (May 4, 2018a). “The Echo Chamber Is Overstated: The Moderating Effect of Political Interest and Diverse

Media”. In: Information, Communication & Society 21.5, pp. 729–745. URL: https://www.tandfonline.com/doi/full/10.1080/1369118X.2018. 1428656 (visited on 07/26/2018). — (Mar. 8, 2018b). “The Myth of the Echo Chamber”. In: The Conversation.

URL: http://theconversation.com/the-myth-of-the-echo-chamber- 92544 (visited on 02/13/2019). Eady, Gregory et al. (Jan. 1, 2019). “How Many People Live in Political Bubbles on Social Media? Evidence From Linked Survey and Twitter

Data”. In: SAGE Open 9.1, p. 2158244019832705. URL: https://doi.org/ 10.1177/2158244019832705 (visited on 05/11/2020).

Europe Media Monitor (Feb. 11, 2013). URL: https://ec.europa.eu/jrc/ en / scientific - tool / europe - media - monitor - newsbrief (visited on 05/11/2020).

268 Everett, Martin and Stephen Borgatti (May 2013). “The Dual-Projection Ap- proach for Two-Mode Networks”. In: Social Networks 35.2, pp. 204–210.

URL: https://linkinghub.elsevier.com/retrieve/pii/S0378873312000354 (visited on 06/05/2020). Fletcher, Richard and Rasmus Kleis Nielsen (July 1, 2018). “Are People Inci- dentally Exposed to News on Social Media? A Comparative Analysis”.

In: New Media & Society 20.7, pp. 2450–2468. URL: https://doi.org/10. 1177/1461444817724170 (visited on 05/13/2020). Fortunato, Santo and Darko Hric (Nov. 2016). “Community Detection in

Networks: A User Guide”. In: Physics Reports 659, pp. 1–44. URL: http: //arxiv.org/abs/1608.00163 (visited on 11/27/2020). Freelon, Deen (Jan. 2, 2014). “On the Interpretation of Digital Trace Data in Communication and Social Computing Research”. In: Journal of Broad-

casting & Electronic Media 58.1, pp. 59–75. URL: https://doi.org/10.1080/ 08838151.2013.875018 (visited on 07/05/2020). Gaffney, Devin and J. Nathan Matias (July 6, 2018). “Caveat Emptor, Compu- tational Social Science: Large-Scale Missing Data in a Widely-Published Reddit Corpus”. In: PloS one 13.7. Garcia, David et al. (2015). Ideological and Temporal Components of Network Polarization in Online Political Participatory Media. Garrett, R. Kelly (2009). “Echo Chambers Online?: Politically Motivated Se- lective Exposure among Internet News Users1”. In: Journal of Computer- Mediated Communication 14, pp. 265–285. Gilbert, Eric (2013). “Widespread Underprovision on Reddit”. In: Pro- ceedings of the 2013 Conference on Computer Supported Cooperative Work, pp. 803–808.

269 Goffman, Erving (1978). The Presentation of Self in Everyday Life. Har- mondsworth London. Groshek, Jacob and Karolina Koc-Michalska (Sept. 2, 2017). “Helping Pop- ulism Win? Social Media Use, Filter Bubbles, and Support for Pop- ulist Presidential Candidates in the 2016 US Election Campaign”. In:

Information, Communication & Society 20.9, pp. 1389–1407. URL: https: //doi.org/10.1080/1369118X.2017.1329334 (visited on 05/11/2020). Guerra, Pedro Henrique Calais et al. (2013). “A Measure of Polarization on Social Media Networks Based on Community Boundaries.” In: ICWSM. Guess, Andrew et al. (2018). “Avoiding the Echo Chamber about Echo Chambers: Why Selective Exposure to like-Minded Political News Is Less Prevalent than You Think”. In: Knight Foundation White Paper. Habermas, Jurgen (1991). The Structural Transformation of the Public Sphere: An Inquiry Into a Category of Bourgeois Society. MIT Press. 330 pp. — (Nov. 1, 2006). “Political Communication in Media Society: Does Democ- racy Still Enjoy an Epistemic Dimension? The Impact of Normative The- ory on Empirical Research”. In: Communication Theory 16.4, pp. 411–426.

URL: https://academic.oup.com/ct/article/16/4/411/4098648 (visited on 09/20/2018). Highfield, Tim and Axel Bruns (2015). “Is Habermas on Twitter? Social Media and the Public Sphere”. In: The Routledge Companion to Social Media and Politics. Routledge, pp. 78–95. Himelboim, Itai, Stephen McCreery, and Marc Smith (2013). “Birds of a Feather Tweet Together: Integrating Network and Content Analyses to Examine Cross-Ideology Exposure on Twitter”. In: Journal of Computer- Mediated Communication 18, pp. 154–174.

270 Hogan, Bernie (Dec. 1, 2010). “The Presentation of Self in the Age of So- cial Media: Distinguishing Performances and Exhibitions Online”. In:

Bulletin of Science, Technology & Society 30.6, pp. 377–386. URL: http : //bst.sagepub.com/cgi/doi/10.1177/0270467610385893 (visited on 12/27/2015). — (2013). “Pseudonyms and the Rise of the Real-Name Web”. In: A Com- panion to New Media Dynamics. Wiley Online Library, pp. 290–307. Hooton, Christopher (Nov. 10, 2016). “Social Media Echo Chambers Gifted

Donald Trump the Presidency”. In: The Independent. Voices. URL: http: //www.independent.co.uk/voices/donald-trump-president-social- media - echo - chamber - hypernormalisation - adam - curtis - protests - blame-a7409481.html (visited on 04/17/2020). Jamonnak, S. et al. (Dec. 2015). “Recommenddit: A Recommendation Ser- vice for Reddit Communities”. In: 2015 International Conference on Compu- tational Science and Computational Intelligence (CSCI). 2015 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 374–379. Johnson, Steven L., Samer Faraj, and Srinivas Kudaravalli (2014). “Emer- gence of Power Laws in Online Communities: The Role of Social Mech- anisms and Preferential Attachment”. In: MIS Quarterly 38.3, 795–A13. Kiberd, Roisin (Dec. 5, 2016). “‘Spezgiving’: How Reddit’s CEO Tried And

Failed to Troll the Trolls”. In: Vice. URL: https://www.vice.com/en us/ article/ezpkpk/spezgiving-how-reddits-ceo-triedand-failedto-troll- the-trolls (visited on 05/11/2020). Latzko-Toth, Guillaume, Claudine Bonneau, and Mlanie Millette (2016). “Small Data, Thick Data: Thickening Strategies for Trace-Based Social Media Research”. In:

271 Leavitt, Alex (2015). ““This Is a Throwaway Account”: Temporary Technical Identities and Perceptions of Anonymity in a Massive Online Commu- nity”. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp. 317–327. Leavitt, Alex, Joshua A Clark, and Dennis Wixon (2016). “Uses of Multiple Characters in Online Games and Their Implications for Social Net- work Methods”. In: Proceedings of the 19th ACM Conference on Computer- Supported Cooperative Work & Social Computing. CSCW ’16. ACM Press,

pp. 647–661. URL: http://dl.acm.org/citation.cfm?doid=2818048. 2819980 (visited on 07/26/2018). Malone, Kenny (June 29, 2017). “Change My View On Reddit Helps People

Challenge Their Own Opinions”. In: NPR. URL: https://www.npr.org/ 2017/06/29/534916052/change-my-view-on-reddit-helps-people- challenge-their-own-opinions (visited on 07/02/2020). Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze¨ (2008). Introduction to Information Retrieval. Cambridge University Press. Marantz, Andrew (Mar. 19, 2018). “Reddit and the Struggle to Detoxify

the Internet”. In: . URL: https://www.newyorker.com/ magazine/2018/03/19/reddit- and- the- struggle- to- detoxify- the- internet (visited on 02/14/2020). Marwick, Alice E. (Sept. 1, 2017). “Scandal or Sex Crime? Gendered Privacy and the Celebrity Nude Photo Leaks”. In: Ethics and Information Technol-

ogy 19.3, pp. 177–191. URL: https://doi.org/10.1007/s10676-017-9431-7 (visited on 07/03/2020). Marwick, Alice E. and danah boyd (July 7, 2010). “I Tweet Honestly, I Tweet Passionately: Twitter Users, Context Collapse, and the Imagined

272 Audience:” in: New Media & Society. URL: https://journals.sagepub. com/doi/10.1177/1461444810365313 (visited on 07/01/2020). Marwick, Alice E. and Robyn Caplan (July 4, 2018). “Drinking Male Tears: Language, the Manosphere, and Networked Harassment”. In: Feminist

Media Studies 18.4, pp. 543–559. URL: https://doi.org/10.1080/14680777. 2018.1450568 (visited on 12/06/2019). Massanari, Adrienne (Oct. 9, 2015a). “#Gamergate and The Fappening: How Reddit’s Algorithm, Governance, and Culture Support Toxic Tech-

nocultures”. In: New Media & Society. URL: http://nms.sagepub.com/ content/early/2015/10/07/1461444815608807 (visited on 06/08/2016). — (2015b). Participatory Culture, Community, and Play. Learning from Reddit. Peter Lang. Matias, Nathan (2016). “The Civic Labor of Online Moderators”. In: Internet Politics and Policy. Oxford, England, p. 10. McPherson, Miller, Lynn Smith-Lovin, and James M Cook (Aug. 2001). “Birds of a Feather: Homophily in Social Networks”. In: Annual Review

of Sociology 27.1, pp. 415–444. URL: http://www.annualreviews.org/ doi/10.1146/annurev.soc.27.1.415 (visited on 07/26/2018). Moore, Suzanne (July 23, 2019). “How Social Media Echo Chambers Fuelled the Rise of — Suzanne Moore”. In: . Opinion.

URL: https://www.theguardian.com/commentisfree/2019/jul/23/ how-social-media-echo-chambers-fuelled-the-rise-of-boris-johnson (visited on 05/13/2020). Munafo,` Marcu and Neil Jacobs (Jan. 5, 2019). “Reproducibility of Re- search Is Critical for Open Science and Open Britain”. In: Times Higher

Education. URL: https : / / www. timeshighereducation . com / blog /

273 reproducibility-research-critical-open-science-and-open-britain (visited on 07/05/2020). Nagle, Angela (June 7, 2017). Kill All Normies: Online Culture Wars From And To Trump And The Alt-Right. John Hunt Publishing. 125 pp. Newman, M. E. J. (June 6, 2006). “Modularity and Community Structure in Networks”. In: Proceedings of the National Academy of Sciences 103.23,

pp. 8577–8582. URL: http://www.pnas.org/cgi/doi/10.1073/pnas. 0601602103 (visited on 12/03/2020). — (2010). Networks: An Introduction. Oxford: University Press. 772 pp. Newton, Casey (June 29, 2020). “Reddit Bans r/The Donald and r/ChapoTrapHouse

as Part of a Major Expansion of Its Rules”. In: . URL: https: //www.theverge.com/2020/6/29/21304947/reddit-ban-subreddits- the-donald-chapo-trap-house-new-content-policy-rules (visited on 07/03/2020). Nyhan, Brendan (Oct. 24, 2014). “Americans Don’t Live in Information Co-

coons”. In: The New York Times. The Upshot. URL: https://www.nytimes. com/2014/10/25/upshot/americans- dont- live- in- information- cocoons.html (visited on 06/01/2020). O’Hara, Kieron and David Stevens (2015). “Echo Chambers and Online Radicalism: Assessing the Internet’s Complicity in Violent Extremism”.

In: Policy & Internet 7.4, pp. 401–422. URL: https://onlinelibrary.wiley. com/doi/abs/10.1002/poi3.88 (visited on 02/14/2020).

Olson, Randal (2020). Reddit Interest Network. URL: https://rhiever.github. io/redditviz/clustered/ (visited on 02/29/2020). Olson, Randal and Zachary Neal (May 27, 2015). “Navigating the Massive World of Reddit: Using Backbone Networks to Map User Interests in

274 Social Media”. In: PeerJ Computer Science 1, e4. URL: https://peerj.com/ articles/cs-4/ (visited on 12/02/2015).

Open Science Framework (2020). URL: https://osf.io/ (visited on 07/05/2020). Pariser, Eli (2011). The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think. Penguin. Perez, Sarah (Dec. 4, 2019). “Reddit’s Monthly Active User Base Grew 30%

to Reach 430M in 2019”. In: Tech Crunch. URL: https://techcrunch.com/ 2019/12/04/reddits-monthly-active-user-base-grew-30-to-reach- 430m-in-2019/ (visited on 05/16/2020). Pew Research Center (June 12, 2019). Who Uses YouTube, WhatsApp and

Reddit. Pew Research Center. URL: https://www.pewresearch.org/ internet/chart/who-uses-youtube-whatsapp-and-reddit/ (visited on 03/04/2020). Press Association, The (Aug. 23, 2013). “Football Fans Face Prosecution for

’Online Hooliganism’”. In: The Guardian. Football. URL: https://www. theguardian.com/football/2013/aug/23/football-fans-prosecution- online-hooliganism (visited on 05/20/2020).

Pushshift (2020). URL: https://pushshift.io/ (visited on 06/10/2020). r/announcements (June 29, 20). R/Announcements - Update to Our Content

Policy. URL: https://www.reddit.com/r/announcements/comments/ hi3oht/update to our content policy/ (visited on 06/30/2020).

R/Changemyview (2020). R/Changemyview. URL: https://www.reddit.com/ r/changemyview/ (visited on 07/02/2020). r/ModSupport (July 15, 2017). R/ModSupport - About That Autoban.. URL: https://www.reddit.com/r/ModSupport/comments/6ni724/about that autoban/ (visited on 05/25/2020).

275 R/The Donald (2020). R/The Donald. URL: https://www.reddit.com/r/ The Donald (visited on 07/02/2020).

Reddit API Terms of Use (2016). URL: https://docs.google.com/forms/d/ 1ao gme8e xfZ41q4QymFqg5HD29HggOD8I9-MFTG7So/viewform? usp=embed facebook (visited on 01/05/2016).

Reddit Content Policy (2020). URL: https://www.redditinc.com/policies/ content-policy (visited on 07/02/2020).

Reddit FAQ (2020). URL: https://www.reddit.com/wiki/faq (visited on 04/12/2020). Reddit Help (July 13, 2020). Promoting Hate Based on Identity or Vulnerability.

URL: https://www.reddithelp.com/en/categories/rules-reporting/ account-and-community-restrictions/promoting-hate-based-identity- or (visited on 07/02/2020).

Reddit Inc (2020). URL: https : / / www . redditinc . com/ (visited on 01/24/2020).

Reddit (2019). Reddit: The Front Page of the Internet. URL: https://www.reddit. com/r/The Watch (visited on 07/03/2019). Ribeiro, Manoel Horta et al. (Jan. 27, 2020). “Auditing Radicalization Path- ways on YouTube”. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. FAT* ’20. Barcelona, Spain: Association

for Computing Machinery, pp. 131–141. URL: https://doi.org/10.1145/ 3351095.3372879 (visited on 06/24/2020). Romano, Aja (Nov. 13, 2017). “Reddit Just Banned One of Its Most Toxic

Forums. But It Won’t Touch The Donald.” In: Vox. URL: https://www. vox.com/culture/2017/11/13/16624688/reddit- bans- incels- the- donald-controversy (visited on 05/11/2020).

276 Romo, Vanessa (Sept. 6, 2017). “’Ghost Skins’ And Masculinity: Alt-Right

Terms, Defined”. In: NPR. URL: https://www.npr.org/2017/09/06/ 548858850/- ghost- skins- and- masculinity- alt- right- terms- defined (visited on 07/27/2020). Roozenbeek, Jon and Adria` Salvador Palau (2017). “I Read It on Reddit: Exploring the Role of Online Communities in the 2016 US Elections News Cycle”. In: Social Informatics. Ed. by Giovanni Luca Ciampaglia, Afra Mashhadi, and Taha Yasseri. Lecture Notes in Computer Science. Cham: Springer International Publishing, pp. 192–220. Sacks, Brianna (Oct. 25, 2017). “Reddit Is Removing Nazi And Alt-Right Groups As Part Of A New Policy And Some Users Are Confused”.

In: News. URL: https://www.buzzfeednews.com/article/ briannasacks/reddit-is-banning-nazi-and-alt-right-groups-as-part-of- a (visited on 05/11/2020). Scharkow, Michael (Jan. 2, 2016). “The Accuracy of Self-Reported Internet Use—A Validation Study Using Client Log Data”. In: Communication

Methods and Measures 10.1, pp. 13–27. URL: http://www.tandfonline. com/doi/full/10.1080/19312458.2015.1118446 (visited on 06/30/2020). Singer, Philipp et al. (2016). “Evidence of Online Performance Deterioration in User Sessions on Reddit”. In: PloS one 11.8, e0161636. Smith, Allegra (Jan. 13, 2015). “Porn Architecture: User Tagging and Fil- tering in Two Online Pornography Communities”. In: Communication

Design Quarterly 3.1, pp. 17–23. URL: https://doi.org/10.1145/2721882. 2721885 (visited on 05/20/2020). Smith, Jesus and Aurolyn Luykx (Oct. 2, 2017). “Race Play in BDSM Porn: The Eroticization of Oppression”. In: Porn Studies 4.4, pp. 433–446.

277 URL: https://doi.org/10.1080/23268743.2016.1252158 (visited on 07/03/2020). Sparck Jones, (Jan. 1, 1972). “A Statistical Interpreation of Term Specificity and Its Application in Retrieval”. In: Journal of Documentation

28.1, pp. 11–21. URL: https://doi.org/10.1108/eb026526 (visited on 07/13/2020). Staff, G. Q. (July 2, 2019). “It Was Once A Beacon Of Authenticity, But Now Reddit Seems To Be Moving Away From Its Free Speech Roots”.

In: GQ Australia. URL: https : / / www. . com . au / entertainment / tech / it - was - once - a - beacon - of - authenticity - but - now - reddit - seems - to - be - moving - away - from - its - free - speech - roots / news - story/484d02556bb342b68a631518a2abeafb (visited on 05/11/2020).

Statista (2020). Global Social Media Ranking 2020. URL: https://www.statista. com/statistics/272014/global-social-networks-ranked-by-number-of- users/ (visited on 05/15/2020). Steinberger, Ralf, Bruno Pouliquen, and Erik van der Goot (Aug. 24, 2009).

An Introduction to the Europe Media Monitor Family of Applications. URL: http://arxiv.org/abs/1309.5290 (visited on 07/13/2020). Stelter, Brian (Jan. 16, 2020). “How the Pro-Trump Echo Chamber Has

Changed Impeachment”. In: CNN. URL: https://www.cnn.com/2020/ 01/16/media/donald-trump-feedback-loop-reliable-sources/index. html (visited on 07/05/2020). Stroud, Natalie Jomini (2008). “Media Use and Political Predispositions: Revisiting the Concept of Selective Exposure”. In: Political Behavior 30.3, pp. 341–366. Suler, John (2004). “The Online Disinhibition Effect”. In: Cyberpsychology & Behavior 7.3, pp. 321–326.

278 Sunstein, Cass R. (2001). Republic.Com. Princeton University Press. — (2018). #Republic: Divided Democracy in the Age of Social Media. Princeton University Press. Tan, Chenhao et al. (2016). “Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-Faith Online Discussions”. In: Proceedings of the 25th International Conference on World Wide Web, pp. 613–624.

Top Sites in United Kingdom (2020). URL: https://www.alexa.com/topsites/ countries/GB (visited on 01/24/2020).

Top Sites in United States (2020). URL: https://www.alexa.com/topsites/ countries/US (visited on 01/24/2020). Triggs, Anthony Henry, Kristian Møller, and Christina Neumayer (Nov. 27, 2019). “Context Collapse and Anonymity among Queer Reddit Users”.

In: New Media & Society, p. 1461444819890353. URL: https://doi.org/10. 1177/1461444819890353 (visited on 12/04/2019). Tyson, Gareth et al. (Apr. 21, 2015). “Are People Really Social in Porn 2.0?” In: Ninth International AAAI Conference on Web and Social Media.

Ninth International AAAI Conference on Web and Social Media. URL: https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/ view/10511 (visited on 05/13/2020). u/Phill Hermouth (2020). R/Changemyview - CMV: Gender Dysphoria Is a Cureable Mental Illness, We’ve Stopped Looking for the Cure Because Society

Is Now Forced into Accepting Transgenders. URL: https://www.reddit. com/r/changemyview/comments/f2ol3f/cmv gender dysphoria is a cureable mental illness/ (visited on 07/02/2020). u/Pirateer (2020). R/Changemyview - CMV: If You Name Your Child Something like ”Abcde” (Pronounced ’Absidy’) and Get Upset at the Mispronunciation or Negative Attention It Brings, You Knew What You Were Doing and You

279 Wanted the Attention for Yourself. URL: https://www.reddit.com/r/ changemyview/comments/a2pwpy/cmv if you name your child something like abcde/ (visited on 07/02/2020). Uysal, Alper Kursat and Serkan Gunal (Jan. 2014). “The Impact of Prepro- cessing on Text Classification”. In: Information Processing & Management

50.1, pp. 104–112. URL: https://linkinghub.elsevier.com/retrieve/pii/ S0306457313000964 (visited on 06/08/2020). Vaccari, Cristian (Feb. 14, 2018). How Prevalent Are Filter Bubbles and Echo Chambers on Social Media? Not as Much as Conventional Wisdom Has

It. URL: https://blog.lboro.ac.uk/crcc/uncategorised/prevalent- filter-bubbles-echo-chambers-social-media-not-much-conventional- wisdom/ (visited on 05/13/2020). Vaccari, Cristian et al. (Sept. 2016). “Of Echo Chambers and Contrarian Clubs: Exposure to Political Disagreement Among German and Italian

Users of Twitter”. In: Social Media + Society 2.3, p. 205630511666422. URL: http://journals.sagepub.com/doi/10.1177/2056305116664221 (visited on 07/26/2018). Van der Nagel, Emily (2018). “Alts and Automediality: Compartmentalising the Self through Multiple Social Media Profiles”. In: M/C Journal 21.2. Van der Nagel, Emily and Jordan Frith (2015). “Anonymity, Pseudonymity, and the Agency of Online Identity: Examining the Social Practices of

r/Gonewild”. In: First Monday 20.3. URL: http://firstmonday.org/ojs/ index.php/fm/article/view/5615 (visited on 05/20/2016). Vicario, Michela Del et al. (Jan. 19, 2016). “The Spreading of Misinformation Online”. In: Proceedings of the National Academy of Sciences 113.3, pp. 554–

559. URL: https://www.pnas.org/content/113/3/554 (visited on 05/11/2020).

280 Weiss, Bari and Damon Winter (May 8, 2018). “Opinion — Meet the Rene- gades of the Intellectual Dark Web”. In: The New York Times. Opinion.

URL: https://www.nytimes.com/2018/05/08/opinion/intellectual- dark-web.html (visited on 06/24/2020). Wendling, Mike (June 13, 2019). “The 23-Year-Old Fixing the Broken World

of Online Argument”. In: BBC News. BBC Trending. URL: https://www. .com/news/blogs-trending-48579597 (visited on 07/02/2020). Williams, Hywel T.P. et al. (May 2015). “Network Analysis Reveals Open Forums and Echo Chambers in Social Media Discussions of Climate

Change”. In: Global Environmental Change 32, pp. 126–138. URL: http: //linkinghub.elsevier.com/retrieve/pii/S0959378015000369 (visited on 07/26/2018). Zollo, Fabiana et al. (July 24, 2017). “Debunking in a World of Tribes”. In:

PLoS ONE 12.7. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC5524392/ (visited on 05/11/2020).

281 Appendix A

Appendix

A.1 Code repository

The code used to perform this analysis is available online in a public Github repository at https://github.com/ellamguest/reddit-echo-chambers. As discussed in Chapter2 data was collected from BigQuery using Structured Query Language (SQL). I used the google-cloud-bigquery python package to run SQL queries on the BigQuery database via python functions. I used the google-cloud-storage python package to store the data using Google storage services. I then accessed the cloud storage to access the data locally as needed to conduct analysis. The repository contains a readme note which describes the content of each of the subdirectories and files. To re-run the analysis one would need to create ones one files with login details in the credentials folder, as described. The code has been developed so that the number of subreddits or monthly dataset can be changed. The repository also contains copies of the list of 1000 subreddits and their topic labels, as well as the descriptions of the political subreddits.

282 A.2 Subreddit topic distributions

Table A.1 shows all topic levels originally applied to the subreddits. For ef- ficiency topics are grouped by the number of subreddits they were labelled to (Num subreddits). For example, two labels each had 11 subreddits; finance and films.

A.3 Political subreddit results

Chapter4 presented the percentile ranks of the within and between sub- reddit author values for the case study subreddits, to show the relative differences and enable interpretation of the result. Table A.2 shows the raw values for the 21 political subreddits and changemyview for reference.

A.4 Inter-community edge weights

Table A.3 shows the number of edges shared between communities. For communities x and y it is the number of edges in the subreddit network between a subreddit in x and a subreddit in y. These were used as edge weights to create the network graph Figure 5.7 in Chapter5.

A.5 Community subreddit topic counts

Table A.4 lists the number of subreddits for each topic t in each community c. These counts were used to create the topic by community heatmaps in Figure 5.5 in Chapter5.

283 Num subreddits List of topics 195 gaming 107 funny 99 images 79 discussion 50 tech 49 sports 47 geographic 40 porn 28 tv 21 music, political 17 mocking, relationships, advice 14 memes 13 entertainment 12 personal, anime 11 finance, films 10 lgbt 9 drugs 8 meta 7 books, health 6 women, fashion, news, stories, comics, food, nature, animals, fitness 5 cars, beauty, crafts 4 weapons 3 work, QUARANTINED, military, art, drinking 2 religion, men, wholesome, mental health 1 mra, writing, PRIVATE

Table A.1: Frequencies of all subreddit topic labels

284 subreddit Median com Insub Num comments Num authors Gini A - changemyview 12 0.15 79446 2911 0.566339 L - ABoringDystopia 4 0.03 7714 1131 0.415007 L - COMPLETEANARCHY 9 0.09 24806 1472 0.504852 L - ChapoTrapHouse 26 0.35 253931 4470 0.570722 L - ENLIGHTENEDCENTRISM 8 0.06 14305 1104 0.41408 L - Fuckthealtright 4 0.03 8883 1140 0.476412 L - LateStageCapitalism 4 0.06 21084 2991 0.417063 L - SandersForPresident 7 0.07 13896 1105 0.508352 L - Trumpgret 4 0.03 6954 1032 0.401657 L - beholdthemasterrace 6 0.04 13649 1363 0.428277

285 L - esist 4 0.03 7363 1146 0.377484 L - socialism 6 0.07 9752 882 0.472018 N - PoliticalDiscussion 11 0.08 20421 1075 0.484362 N - PoliticalHumor 7 0.07 169142 11810 0.538524 N - politics 16 0.24 1935457 51981 0.59948 N - ukpolitics 32 0.39 268155 3485 0.607645 N - worldpolitics 7 0.06 44875 2638 0.607263 R - Conservative 10 0.11 50156 2263 0.583869 R - Libertarian 10 0.11 107993 4310 0.612541 R - The˙Donald 29 753522 11990 0.579455 R - The˙Mueller 7 35563 3182 0.45677 R - neoliberal 38 0.41 116605 1073 0.657753

Table A.2: Chapter4 raw values for the case study subreddits A.6 Closest neighbours of political subreddits

Table A.5 shows the five closest neighbours of each of the case subreddits. This was used to create Figure 5.12 in Chapter5

286 source target weight C1 C2 330 C1 C3 306 C1 C4 155 C1 C5 54 C1 C6 15 C1 C7 11 C1 C8 1 C1 C9 16 C2 C3 2267 C2 C4 1835 C2 C5 212 C2 C6 96 C2 C7 92 C2 C8 32 C2 C9 31 C3 C4 1867 C3 C5 123 C3 C6 77 C3 C7 8 C3 C8 58 C3 C9 12 C4 C5 30 C4 C6 175 C4 C7 176 C4 C8 11 C4 C9 23 C5 C6 22 C5 C9 3 C6 C9 5

Table A.3: Community edge weights

287 topic 1 2 3 4 5 6 7 8 9 10 11 12 13 gaming 159.0 4.0 10.0 12.0 1.0 4.0 2.0 funny 2.0 76.0 17.0 3.0 6.0 2.0 1.0 images 2.0 57.0 14.0 8.0 10.0 7.0 1.0 discussion 7.0 3.0 27.0 28.0 5.0 6.0 1.0 2.0 tech 39.0 2.0 2.0 5.0 2.0 288 sports 3.0 2.0 42.0 geographic 31.0 1.0 1.0 1.0 5.0 2.0 2.0 porn 40.0 tv 12.0 3.0 7.0 3.0 music 3.0 18.0 political 21.0

Table A.4: Community topic counts Case study subreddit Five closest neighbours ABoringDystopia ENLIGHTENEDCENTRISM; socialism; ChapoTrap- House; LateStageCapitalism; COMPLETEANARCHY COMPLETEANARCHY ABoringDystopia; LateStageCapitalism; ENLIGHT- ENEDCENTRISM; ChapoTrapHouse; socialism ChapoTrapHouse neoliberal; LateStageCapitalism; ENLIGHTENEDCEN- TRISM; socialism; COMPLETEANARCHY Conservative KotakuInAction; JordanPeterson; unpopularopinion; Libertarian; The Donald ENLIGHTENEDCENTRISM TopMindsOfReddit; SelfAwarewolves; socialism; ChapoTrapHouse; COMPLETEANARCHY Fuckthealtright Trumpgret; PoliticalHumor; TopMindsOfReddit; EN- LIGHTENEDCENTRISM; beholdthemasterrace LateStageCapitalism ENLIGHTENEDCENTRISM; PoliticalHumor; Chapo- TrapHouse; COMPLETEANARCHY; socialism Libertarian Economics; worldpolitics; neoliberal; ENLIGHTENED- CENTRISM; Conservative PoliticalDiscussion Economics; politics; changemyview; SandersForPresi- dent; neoliberal PoliticalHumor esist; The Mueller; politics; unpopularopinion; world- politics SandersForPresident PoliticalHumor; worldpolitics; socialism; ChapoTrap- House; PoliticalDiscussion . Trumpgret PoliticalDiscussion; worldpolitics; esist; The Mueller; PoliticalHumor beholdthemasterrace ChapoTrapHouse; TopMindsOfReddit; COMPLETEA- NARCHY; ENLIGHTENEDCENTRISM; Fucktheal- tright changemyview neoliberal; TrueReddit; PoliticalDiscussion; Jordan- Peterson; unpopularopinion esist Fuckthealtright; Trumpgret; The Mueller; worldpoli- tics; PoliticalHumor neoliberal Libertarian; Destiny; ChapoTrapHouse; Economics; Po- liticalDiscussion politics news; worldpolitics; PoliticalDiscussion; The Mueller; PoliticalHumor socialism neoliberal; ENLIGHTENEDCENTRISM; LateStage- Capitalism; ChapoTrapHouse; COMPLETEANAR- CHY ukpolitics neoliberal, london, europe, AskUK, unitedkingdom worldpolitics esist; TrueReddit; unpopularopinion; Economics; Polit- icalHumor 289 Table A.5: Closest neighbours of the case study subreddits