UPTEC F 17059 Examensarbete 30 hp Januari 2018
A Quantative Study of Social Media Echo Chambers
Joakim Johansson Abstract A Quantative Study of Social Media Echo Chambers
Joakim Johansson
Teknisk- naturvetenskaplig fakultet UTH-enheten The changing online environment - where the breadth of the information we are exposed to is algorithmically narrowed - has raised concerns about the creation of Besöksadress: "echo chambers"; in which individuals are exposed mainly to information already in Ångströmlaboratoriet Lägerhyddsvägen 1 alignment with their preconceived ideas and opinions. Hus 4, Plan 0 This thesis explores the role of Twitter as a social media and as an information Postadress: network, and investigates if exposure to and participation in political discussions Box 536 751 21 Uppsala resembles echo chambers. The findings by analyzing the Twitter friendship network shows that users on Twitter tend to prefer to follow like-minded individuals to some Telefon: extent, but not in the domain of what constitutes an echo chamber. However, 018 – 471 30 03 analyzing all communication events associated to a set of influential politicians,
Telefax: newspapers, journalists and bloggers during a ten-day period in connection to the 018 – 471 30 00 British general election in June 2017, reveals that users are more likely to engage with ideologically similar peers, than with users with different political beliefs from Hemsida: themselves. http://www.teknat.uu.se/student Data is collected using both the Twitter rest and streaming API, and is analyzed using methods from graph theory and social network analysis. One part of the project involves collection and analysis of several million tweets, in which case the cluster computing platform Apache Spark is used. The other part is concerned with finding the degrees of separation between accounts in the Twitter network, in which case the API is queried step-by-step.
Handledare: Raazesh Sainudiin Ämnesgranskare: David Sumpter Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F17059 Acknowledgements
IwouldliketothankmysupervisorsDavidSumpterandRaazeshSainudiinfortheir encouragement and guidance in writing this thesis. The cloud-computing infrastructure was provided by databricks academic part- ners program for Raazesh Sainudiin.
i Populärvetenskaplig Sammanfattning
Internet har fått en allt större roll som nyhetsmedium, och är idag den primära käl- lan till nyheter och samhällsinformation för en stor andel människor. Samtidigt har idén om ett krympande urval av information vuxit sig allt starkare de senaste åren. Detta är till följd av att ett större utbud av information kräver manuell eller algorit- misk sortering av innehållet, och förflyttar ansvaret för att välja ut nyheter från ny- hetsredaktioner till individer och deras sociala nätverk. Trots den enorma mängden information tillgänglig på internet, befarar man att behovet av att fokusera informa- tionsflödet begränsar exponeringen av information som motsäger vad vi redan tror på, vilket resulterar i en intellektuell isolering i kulturella eller ideologiska bubblor. Året 2016 kantades av två stora politiska händelser i världen: i Storbritannien röstade man för att lämna den europeiska unionen, och i USA valdes Donald Trump till president. Baserat på nyhetsrapporteringen kring händelserna verkar världen ha försatts i chock två gånger under samma år, och i svallvågorna av vad som hänt har sättet vi använder internet på för att ta del av och delta i politiska diskussioner starkt ifrågasatts. Anklagelser har riktats mot sociala medier och hur sociala plat- tformar premierar att visa innehåll som redan är i linje med vad användaren tycker, vilket fostrar en miljö där användare exponeras mer och mer av innehåll från andra likasinnade personer. I det extrema fallet uppstår en “Echo Chamber”, ekokam- mare, i vilken en användare ser sina åsikter om och om igen upprepas och bekräftas som ett eko, utan att bli ifrågasatta eller bemötta med utmanande alternativ. Syftet med detta examensarbete är att studera i vilken utsträckning ekokammare existerar i sociala nätverk på Twitter i en politisk kontext. På grund av Twitters roll som både ett socialt nätverk och ett informationsnätverk, är studien uppdelad i två delar. Den första delen fokuserar på det sociala nätverket på Twitter och undersöker om personer tenderar att använda samma kanaler för nyheter som sina vänner. Här utnyttjas den tydliga uppdelningen i brittisk media kring bevakningen av Brexit kampanjen, och personer som exklusivt följer tidningar av den ena eller andra vin- klingen jämförs med varandra. Den andra delen fokuserar på hur Twitter används för att delta i politiska diskussioner, och försöker besvara frågan om personer hellre interagerar inom ideologiskt homogena grupper än med oliktänkande. Även denna delen är kopplad till ett politiskt tema, och i detta fall studeras Twitterkommunika- tion i samband med the brittiska valet som ägde rum i juni 2017.
ii Contents
1 Introduction 1 1.1 Background ...... 1 1.2 WhatisTwitter? ...... 2 1.2.1 TheAsymmetricFollowingModel ...... 2 1.2.2 Twitter Markup and Timelines ...... 2 1.2.3 Communication on Twitter ...... 3 1.2.4 TheTwitterAPI ...... 3 1.2.5 Tweepy ...... 4 1.3 Outline ...... 4
2 Theory 6 2.1 Graph Theory ...... 6 2.1.1 GraphTraversalAlgorithms ...... 7 2.2 SocialNetworkAnalysis ...... 10 2.2.1 Measures and Metrics ...... 10 2.2.2 Clustering ...... 11 2.2.3 AssortativeMixing ...... 12
3 The Twitter Friendship Network and the EU Referendum 15 3.1 Background ...... 15 3.2 Method ...... 16 3.2.1 Data Collection ...... 16 3.2.2 Mathematical Model of the Twitter Network ...... 17 3.2.3 EgoNetworks ...... 17 3.2.4 Network Building Blocks of an Echo Chamber ...... 18 3.2.5 Degrees of Separation ...... 19 3.2.6 PrincipalComponentAnalysis...... 20 3.3 Results ...... 22 3.3.1 EgoNetworks ...... 22 3.3.2 Degrees of Separation ...... 23 3.4 Discussion ...... 27 3.4.1 EgoNetworks ...... 27 3.4.2 Degrees of Separation and Principal Component Analysis . . . 28 3.4.3 Method and Limitations ...... 29
4 Structure of the Twitter Retweet Network and the UK Election 30 4.1 Background ...... 30 4.2 Method ...... 31
iii 4.2.1 Design of the Twitter Data Collector ...... 31 4.2.2 RetrospectiveDataAugmentation...... 31 4.2.3 RepresentationoftheRetweetNetwork ...... 32 4.2.4 Degrees of Separation in the Geometric Network ...... 33 4.2.5 Neighbour Joining and the Classification Tree ...... 35 4.3 Result ...... 36 4.3.1 InfluentialUsers...... 36 4.3.2 DegreesofSeparationfromKeyAccounts ...... 36 4.3.3 CommunicationintheRetweetNetwork ...... 39 4.3.4 Sharing of External Content ...... 39 4.4 Discussion ...... 39 4.4.1 ChoiceofKeyAccounts ...... 39 4.4.2 Population of the Classification Tree ...... 40
5 Conclusion 42
iv 1 Introduction
1.1 Background
On Thursday 23 June, 2016, a referendum was held to decide whether the United Kingdom (UK) should leave or remain in the European Union (EU), and leave won by 51.9% to 48.1% of the votes. The UK’s prospective withdrawal from the EU is commonly referred to as Brexit. The, to many, surprising outcome of the referendum has been associated with the e↵ect of ”Echo Chambers”; where individuals are exposed only to information from like-minded. Concerns have been raised that the devoid of attitude-challenging con- tent on social media platforms such as Twitter and Facebook may harm democracy [1, 2]. Digital media has become a primary source of news and civic information, which are more and more mediated through people’s personalized, online social networks [3]. The abundance of information available online o↵ers an unprecedented number of choices when it comes to what information to consume and who to talk to. Hence, in contrary to their o✏ine counterparts, a person’s online environment is shaped through manual, or algorithmic, sorting rather than geographical constraints or the variety of newspapers available. Even though the internet has the potential to expose people to a diverse range of viewpoints [4], which many argues is crucial for developing well informed and tolerant citizens [5], it also has the potential to limit the exposure of cross-cutting content, which in turn could lead to political polarization towards ideological extremes [6, 7]. Selective exposure describes the phenomenon whereby people select information or sources already aligned with their own beliefs, whilst filtering out others [8]. In the same way, content presented to an individual can be selected by algorithms based on the viewers previous behaviours, commonly referred to as the ”filter bubble” phenomenon [9]. Both concepts are important as they are closely related to the creation of echo chambers, which can stifle political debate and contest the function of digital media as a public sphere, where a diversity of opinions come together and interact. In this work, online social networks on Twitter are used to address this issue, by analyzing the structure of echo chambers during two major political events; the EU referendum and the British general election in June 2017. Twitter is used as the source of data in both cases, as it is the only major social media platform that publicly share a large portion of their data for free, but di↵erent approaches are used to explore it. Underlying ideas from Big Data Analysis, Graph Theory and Social Network Analysis (SNA) are used in either study.
1 1.2 What is Twitter?
Twitter is a microblogging service that allows its users to communicate on the platform in short messages called tweets. A tweet is limited to 140 characters with the intention to roughly correspond an idea or a thought. Only registered users can post tweets, but anyone can, without being a registered user, read what other people writes. This, in combination with Twitters asymmetric following model, makes it stand out from other social websites. Twitter apply very little governance of what aTwitteraccountcanbe.Thatis,aTwitteraccountneednotbearealperson, but can represent just about anything, for example a company, a music group, a sports team or a newspaper. Twitter users follow and are being followed by others, but the relationship of following and being followed is not reciprocal. Twitter allows users to protect their accounts, which means that the user approve or deny every following request, and their posted tweets are only visible to approved followers. The terminology is that if user A follows user B, then A is Bs follower, and B is As followee.
1.2.1 The Asymmetric Following Model Twitters asymmetric following model makes Twitter into both a social network and anewsmedia;userswanttobothstayincontactwiththeirfriendsaswellasto keep up with celebrities, sport teams, politics or news [10]. Thus, following a twitter account is no implication of a real-world connection, which is usually the case for other social networking sites, e.g. Facebook or LinkedIn. By following an account the user receives all tweets posted by the followed account. As a twitter user, one can then react to the tweets in several ways, which will be covered in more depth in alatersection,themostcommonwaybeingtoretweet.Retweetingapostmakes it appear to your followers, allowing the tweet to reach beyond the original posters followers. In this way, the retweet mechanism gives the users a tool to spread information of their liking in their own networks.
1.2.2 Twitter Markup and Timelines As mentioned above, tweets are the basic, atomic buildings blocks of Twitter, and awell-definedmarkupvocabularyhasevolvedaroundthem.Forexamplea@sign followed by a users screen name is used to address that user, referred to as a mention, and a # sign followed by a word represents a hashtag [11]. User mentions and hashtags are examples of metadata in a tweet. There are two types of metadata: entities and places. Tweet entities are, in addition to user mentions and hashtags, also URLs or any type of media that may be linked in the tweet such as images or videos. Places are real world locations that are associated to the tweet, e.g. the location from where the tweet was authored or a reference to aplacedescribedinthetweet.[12] Another central part of the Twitter platform are timelines, which are collections of tweets displayed in a chronological order. There are two types of timelines: the home timeline and the user timeline. The home timeline is individual to each user and contains all the tweets from users that are followed by the particular user. The
2 user timeline is the collection of tweets from only a certain user, and it looks the same independently of who is accessing it.
1.2.3 Communication on Twitter Atweetcanberespondedtoindi↵erentwaysonTwitter,eitherbycommenting directly on the tweet, known as a reply tweet, or by publicly sharing a tweet as either a retweet or a quoted tweet. As long as the account responding to a tweet is not protected, all three types of responses are public, but for reply tweets only users that are part of the conversation will see the reply on their Home timelines [13]. The retweet function is the simplest way to share information on Twitter, as the original tweet is shared unaltered to the retweeter’s followers. The retweeter can choose to comment on the retweeted message, making it into a quoted tweet. Retweets indicate interest and trust in the message and author as well as endorsement [14, 15], whereas reply tweets and quote tweets can be used in disagreement. Thus, analyzing all communication on Twitter would involve natural language processing algorithms, and the uncertainty associated with them, which can be circumvented by focusing only on the retweet network.
1.2.4 The Twitter API Twitter allows developers to acquire and analyze Twitter data through their Ap- plication Programming Interface (API). In order to use the Twitter API one will need to register as a developer at Twitters developers platform, and register an ap- plication. Open Authorization (OAuth) credentials are then generated, which are used in the application to gain authorization to query the API without sharing any sensitive information or passwords. Tweets comprises a rich set of metadata and the anatomy of a tweet is more complex than what is first suspected from the 140 characters. In table 1.1 some important fields of the tweet object are highlighted and a complete list of all fields can be found in [16]. Entities, Places and Users,areinturnobjectsoftheir own and contain additional fields. The Entities object found in a tweet object can contain any of the following entities: media, URLs, user mentions, hashtags, symbols etc. The Places,whenpresent,andtheUsers objects contain information about a place the tweet is associated with (but not necessarily originated from) and information about the user who posted the tweet respectively. [16–19] Twitter limits the number of queries that can be issued per time to reduce abuse of the resources. The rate limiting of the API is primarily on a per-user basis (per user access token), but the rate limits can also be determined globally for the entire application which is considered separately from per-user limits. Rate limits are divided into 15 minute intervals, called rate limit windows. Di↵erent methods allows for a di↵erent number of requests per rate limit window, see table 1.2 for a selection of the methods. Twitter o↵ers also a streaming API, for monitoring and processing of Twitter’s global stream of data in real time. The default access level allows developers to track up to 400 keywords, and 5,000 users, as long as the tra c does not surpass 1% of the total data volume.
3 Table 1.1: A selection of the fields of the Tweet object. The types marked with (*), Entities, Places and Users,arethemetadataobjects,andcontainsadditional fields.
Entities Type Description entities Entities*EntitiesinthetextoftheTweet. favourite count Int Number of times the tweet has been liked. id Int Unique integer identifier for this tweet. place Places*Aplacethatisassociatedtothetweet. retweet count Int Number of times the tweet has been retweeted. retweeted Boolean Whether or not a tweet has been retweeted. text String The text of the actual content in the tweet. user User*Theuserwhopostedthistweet.
Table 1.2: Rate limits for a few API methods. The rate limits described in this table refer to read-only/GET methods. The requests given are per rate limit window of 15 minutes.
Method Resource Type Requests (user) Requests (app) GET followers/ids followers 15 15 GET friends/ids friends 15 15 GET friendships/show friendships 180 15 GET users/lookup users 900 300
1.2.5 Tweepy Tweepy is a Python library for accessing the Twitter API. Tweepy implements an API class that provides access to the Twitter API methods. When invoking an API method most of the time what is returned back will be a Tweepy model class instance, which will contain the data returned from Twitter. Tweepy supports OAuth authentication, which is handled by the OAuthHandler class. After the OAuth credentials are generated, a OAuthHandler instance can be created into which the credentials are passed. The OAuthHandler also needs to be equipped with the access token and the access token secret. Once this is ready, the OAuthHandler instance can be used to create an instance of the API class, which is ready the query the Twitter API.
1.3 Outline
This thesis is divided into three parts. The first part, including chapter 1 and 2, gives an introduction to the problem, and the theoretical framework needed to understand the methods used in the two other parts. This part provides support to the rest of the thesis; and if the readers is already familiar with graph theory, network analysis and online social media this part can be skimmed through swiftly. The two other parts, chapter 3 and chapter 4 together with the conclusion in chapter 5, constitutes the main part of the thesis. Chapter 3 looks at the
4 follower/followee-network on Twitter in the context of the Brexit debate, while chapter 4 examines communication on Twitter associated to the United Kingdom general election in June 2017. As the two chapters comprises two di↵erent data sets and di↵erent methods, a division of the two felt like the most natural structure.
5 2 Theory
2.1 Graph Theory
AgraphG =(V,E)consistsofasetofnodes,alsocalledvertices,V ,andasetof links, also called edges, E V V ,whichconnectspairsofnodes.Agraphmaybe directed or undirected, unipartite✓ ⇥ or bipartite, connected or disconnected and the edges can be weighted, figure 2.1.
Definition 2.1.1. (Directed and Undirected Graph). A directed graph G =(V,E), is a graph where every edge (i, j) E is an ordered pair of nodes. An undirected graph H =(V,E)isagraphwhere(2 i, j) E = (j, i) E. 2 ) 2 Definition 2.1.2. (Unipartite and Bipartite Graph) A graph G =(V,E)iscalled bipartite if the set V can be partitioned into two disjoint sets V1 and V2,where V V =ØandV = V V such that there are no edges between nodes in the 1 \ 2 1 [ 2 same partition, i.e. (i, j) E = i V1 and j V2.Foraunipartitegraph, the node set can not be partitioned2 ) into2 disjoint sets2 such that there are no edges between nodes in the same partition.
Definition 2.1.3. (Connected and Disconnected Graph) In a connected graph there is a path from any node to any other node in the graph. A graph that is not connected is said to be disconnected.
Definition 2.1.4. (Multigraph, Self Edges and Simple Graph) A multigraph is a graph where multiple edges are allowed to connect the same two vertices. A self edge, or a loop, is an edge that connects a vertex to itself. A simple graph is a undirected in which both multiple edges and self-edges are prohibited.
1
2 3
8 (a) (b) (c) (d)
Figure 2.1: Example of (a) directed graph, (b) undirected graph, (c) weighted graph and (d) bipartite graph.
6 Awaytorepresentafinitegraphiswithanadjacencymatrix.Theadjacency matrix, A,ofasimple,unweightedgraph,g =(V,E), is a square V V matrix | |⇥| | with elements Aij such that:
1, if there is an edge from vertex i to vertex j, Aij = (2.1.1) (0, otherwise.
For example:
011 A = 101 , represents the graph: (2.1.2) 0 1 110 @ A Two properties of the adjacency matrix are that, first for a graph with no self-edges the diagonal matrix elements are all zero, and second for an undirected graph the adjacency matrix is symmetric. In a similar way, an adjacency matrix can be used to represent a non-simple graph or a weighted graphs, then the elements Aij becomes equal to the multiplicity or the weight of the edge between vertex i and j,respectively. Because Aij is 1 when there is an edge from vertex i to vertex j,itfollowsthat the product AikAkj is 1 only when there is a path (i j k), and so the total number of paths from i to j,viaanyothervertexisgivenby:! !
n 2 AikAkj =[A ]ij. (2.1.3) Xk=1 This can be generalized to paths via an arbitrary number of intermediate vertices, n n and so the matrix A has the property that the element [A ]ij gives the number of paths of length n from vertex i to vertex j.
2.1.1 Graph Traversal Algorithms The goal of a graph traversal is, generally, to find every node reachable from a given source node. A common application of graph traversal algorithms is to find the shortest path between two nodes in a graph. In this section a few standard algorithms are presented.
Breadth-First Search Breadth-first search finds the shortest path from a given source vertex to every other vertex within the same component of the graph. It works by starting with a given vertex, the root, and expanding outwards layer per layer, where the vertices in the first layer are the immediate neighbours of the starting vertex. The vertices in the next layer are the neighbours of the vertices in layer one, with distance 2 to the starting node, and so forth, see figure 2.2. Afirst-in,first-outqueueisusedtokeeptrackofthenodesonthefrontier.Every time a new node is pushed onto the queue it is given a distance k +1fromtheroot until all the nodes at distance k are gone, after which k is incremented by one.
7 Root 0
1 1 1
2 2 2 2 2
Figure 2.2: A breadth-first search starts at the root and explores the graph outwards in layers. All the nodes in the first layer, marked with 1, are explored before moving on to the nodes marked with 2.
The time complexity of the breadth-first search algorithm is ( V + E ), where V is the number of vertices and E is the number of edges inO the| graph.| | | First, it takes| | ( V )timetosetupthedistancearray,whichhasoneelementpernodeinthe| | graph.O Then,| | each node in the queue has on average ( E / V )neighbours,which are either added to the queue or they are already knownO | | | in| which case nothing is done. Either way the operations take (1) time. Thus, for all vertices in the graph this takes V ( E / V )= ( E )timetocomplete,whichgivesthetimeO complexity ( V | +|·OE )forthewholealgorithm.[20]| | | | O | | ExpressO in| di↵| erent| | terms, let b be the average out-degree of each node in the graph, called the branching factor of the graph. Then, each node will generate on average b new nodes when expanded, each of which generates b nodes, yielding b2, and so forth. Thus, the time complexity for finding all nodes at a specific distance, d,is (bd+1). O Depth-First Search In contrast to a breadth-first search, the depth-first search uses a last-in, first out queue to track the frontier, i.e. it starts at the root and explores as far as possible along each branch of the graph, before backtracking and considering other nodes on the stack. The time complexity is the same as for the breadth-first search for traversing the whole graph.
Bi-directional Search Abi-directionalsearchalgorithmisuseswhenthesourceandtargetnodesare explicitly defined. Bi-directional search proceeds two simultaneous searches: one forward from the source node, and one backward from the goal node, stopping when the two meet in the intersection point [21], figure 2.3. Using breadth first search in both directions ensures completeness and optimal- ity of the algorithm. Since both searches will run the same amount of time and meet in the middle, each search only needs to expand (bd/2)nodes.Thus,the time complexity for the whole algorithm is (bd/2), whichO for a large d,oralarge branching factor, can be many orders of magnitudeO faster than for a regular breadth first search.
8 2 2 Intersection Point 1 2 1 1 Root 1 Root 2 0 1 0
1 1 1
2 2
Figure 2.3: The bi-directional search algorithm starts from two nodes: the source node and the target node. When an intersection point is found, the shortest path is computed. The gray nodes are the start points of the search, and the digits in the nodes gives how far away a node is from the root.
Dijkstra’s Algorithm Like breadth-first search, Dijkstra’s algorithm finds the shortest paths between a starting vertex and every other vertex in the same component, but for weighted graphs, i.e. it takes the edge lengths into account. Let s V ,avertexinaweighted graph G =(V,E), be the starting vertex. Assign to every2 other node in V atentative distance of + ,andmarkthedistancesasuncertain.Thealgorithmthenproceeds as follows. 1
1. Find the vertex, v,withthesmallesttentativedistancefroms.
2. Mark this distance as certain.
3. Consider the neighbourhood of v,andcomputethedistancefroms via v to each neighbour, by summing the weights of the intermediate edges. If the computed distance is smaller than the current estimated distance, it replaces the older one.
Step 1 trough 3 are repeated until the distances to all vertices are flagged as being certain. The correctness of the algorithm can be proved by contradiction. Claim. The smallest distance recorded to any vertex among the vertices which are marked uncertain, is in fact the smallest possible distance to that vertex. Proof. Suppose v is the node in step 1, i.e. the next unexplored vertex that has the smallest estimate distance, (v), to s,andassumethatthereisapath,p =(s ! u1 u2 ... v), from s to v that is shorter than the current estimate, see figure! 2.4.! Since! the path, p,isshorterthantheestimate,thedistancealongthe path, to each vertex on the path must also be less than the estimate. Furthermore, somewhere along the path there must be a pair of vertices, x, y,suchthatx’s distance to s, d(x), is flagged as being certain and y’s is not. Suppose now, that y and v are the same vertex, then a shorter path to v must already have been found when the neighbourhood of x was explored (remember
9 d(x) <