A Reaction-Based Approach to Information Cascade Analysis

Proc. 2019 IEEE International Conference on Computer Communication and Neworks, Valencia, Spain, July 29 - August 1, pp. 1-9. A Reaction-Based Approach to Information Cascade Analysis James Flamino Boleslaw K. Szymanski Department of Physics, Applied Physics, and Astrophysics Department of Computer Science Rensselaer Polytechnic Institute Rensselaer Polytechnic Institute Troy NY, USA Troy NY, USA fl[email protected] [email protected] Abstract—Online social media provides massive open-ended matter without any help from text analysis. And while natural platforms for users of a wide variety of backgrounds, interests, language processing (NLP) in itself has set an unprecedented and beliefs to interact and debate, facilitating countless infor- paradigm for understanding human interactions and the nature mation cascades across many subjects. With numerous unique voices being lent to the ever-growing information stream, it is of information cascades, in this paper we will show that the essential to consider the question: how do the many types of alternative approach of using non-NLP, reaction-based analysis conversations within an information cascade characterize the can provide significant insight into the understanding of topics process as a whole? In this paper we analyze the underlying in online social media. features of the dynamics of communication, and use those In this paper we will start our description of this novel features to explain the inherent properties of the encompassing information cascade. Utilizing ”microscopic” trends to describe approach in Section 3 by describing the response features we ”macroscopic” phenomena, we set a paradigm for analyzing in- choose to use, their mathematical quantification, the methods formation dissemination through the individual user interactions we use to employ them, and a visualization of their pat- that sprout from a source topic, instead of trying to interpret terns. We then validate these features in order to show their the emergent patterns themselves. This paradigm yields a set of consistency and resilience in real-world datasets in Section unique tools for a myriad of application in the field of information cascade analysis: from topic classification of sources to time-series 4. In Section 5 we introduce time-dependency to further forecasting. We use these tools in a 88-million-row dataset for describe the robustness of this approach. Then in Section 6 Reddit to show their conceptual effectiveness and accuracy when we utilize the response features in all forms for a couple of compared to the ground truth. applications in order to demonstrate their capabilities, while Index Terms—information cascade, response features, topic testing accuracy and effectiveness. We conclude our work in classification, time-series forecasting, conversational dynamics, semantic analysis Section 8, where we review the implications of a response- based approach to information cascade analysis. I. INTRODUCTION II. DATASETS How can we understand the nature of a topic? To answer this Given our focus on conversational dynamics, we decided to query, let us consider a simpler question: How can we guess extract a dataset from an online social media platform that en- the genre of an unknown movie while actively watching it in a courages in-depth discussions ofna wide variety of topics and theatre? One way would be to examine the title and character subjects. The format that we found to best fit this description dialogue. Another way would be to analyze the visuals and was the group of online platforms called forums. A forum special effects. But a more unconventional approach might be is a network of registered users in which any user can freely to turn around and look to the audience for the answer. If we do submit posts about certain topics (generally under some related not want to use dialogue or visuals, we can use the audience’s category). This post triggers responses to the post material. reactions. For example, if the audience is laughing frequently, In turn these responses trigger more responses, resulting in its likely to be a comedy. But if they are mostly crying out a cascade of information passed between unique users. A in fear, the film is probably a horror movie. And while there majority of these cascades will be short-lived, and are quickly are outliers, the process described is reliable enough given superseded by more recent topics, but what conversations do that an audience’s emotional reaction is inherently tied to occur due to that post follow the theme established by the the nature of a movie’s genre. After all, the director of a source post, characterizing a majority of the interactions that horror film needs the audience to react in fear, otherwise occur within a cascade. their film will most likely fail. And just as these emotional One of the most well-known forum-like social media plat- reactions from an audience fundamentally describe the asso- forms is Reddit. In Reddit the posts are clustered by Subred- ciated movie, topics in online social media have their own dit, which generally encompasses a defining theme (e.g. the set of unique ”user reactions” that can be extracted and used Subreddit r=politics is comprised of discussions about U.S. to characterize the unique properties of the discussed subject politics). Within a Subreddit, a user can create a submission 978-1-7281-1856-7/19/$31.00 ©2019 IEEE pertaining to the Subreddit’s genre. The submission then Symbol Definition becomes available to all other users. Users can vote on the R Some Subreddit within Reddit quality of the submission and start discussions in the comment U All users subscribed to R section of the submission. The more provocative the subject of s A submission within R, represented as a set of a submission, the greater the response, ultimately increasing users that responded to the submission, s = the activity of the submission and the vote count, which fug; s ⊆ U inevitably increases the exposure of the submission, evoking T (N; E) A tree network representing the structure of additional responses. In essence, Reddit is a platform that hierarchically linked comments within a sub- rewards posts that elicit conversational cascades; exactly what mission we are looking for. n A user-generated comment within T, n 2 N We use the data provided from [1], which offers up a Reddit n0 The head node. Technically, this is the submis- dataset that covers 5; 692 Subreddits, 88M submissions, and sion text submitted to the subreddit that triggers 887:5M comments over a time range of 2006 to 2014. The the comment cascade. comments in this dataset are formatted as a comment tree e A directed edge within T, representing the extension, accentuating the natural branching conversations direction of information flowing between com- that sprout from the source submission. We reformatted this ments (from respondee to responder), e 2 E into event sequence style, where the identifiers id, root id, B The set of branches found in T. Bk = th and parent id illustrates the event’s position in a cascade. To fng;Bk ⊆ N where the k branch contains clarify, the root id indicates the id of the source submission, some subset of linked comments (including n0) while the parent id indicates the id of the event that the generated for a submission. posting user is responding directly to. This event sequence was Table 1: Symbols and definitions uploaded into a MySQL database, with the 88M submissions being supplemented with their respective titles, text bodies, and scraped headlines from any linked URLs using Reddit’s Official API. in a typical fashion, where edges connect unique users like e = (ui; uj); e 2 E. And while user’s can ”follow” other users III. CHARACTERIZING CONVERSATIONAL DYNAMICS in Reddit, the more apparent link to content is through the Our objective is to use the dynamics of conversations to subscription to Subreddits. Once subscribed, a user becomes characterize and understand the overarching topic. With re- part of a collective of fellow subscribers that are all updated spect to Reddit, this means we want to use the comment tree to when any other subscribed user post a submission to that identify the theme of the submission without any text mining Subreddit. This leads us to assumption 1. from the comment tree and even from the submission itself. Assumption 1: Reddit can be considered a set of Subreddit But what kind of features can we extract that are indicative tags R. Each Subreddit has a unique set of associated users enough? Our answer is this: we will capture features using the U. All users in set U are connected with an edge representing innate bias that a majority of users will unconsciously exhibit information flow. This system forms an isolated, undirected given a specific topic. If a user is biased in a certain way, they complete graph will react differently in more than just text. A good example Naturally the network of Reddit becomes more complicated of the application of this idea can be found in [2], where once you consider users subscribe to multiple Subreddits and the authors show that users have stable, consistent reactions follow other users, but when considering the graph topology associated with a given topic, called a ”Social Genotype within the scope of a single Subreddit, the assumption makes Model” (specifically within a Twitter dataset). While proving sense. And given this approach, we can intuitively state that the stability of this model, the authors quantified non-semantic submissions within a Subreddit are also generally detached features to train them in a hashtag classifier. These features from each other, with each information cascade eliciting pertained to an individual user’s use of a hashtag through responses from different subsets of users at different rates.

Load more