Master’s thesis
Czech Technical University in Prague
Faculty of Electrical Engineering F3 Department of Cybernetics
Twitter’s local trends spread analysis
Gustav Šourek
Program: Open Informatics Field: Artificial Intelligence
April 2013 Supervisor: Ing. Ondřej Kuželka
Acknowledgement / Declaration
I would like to thank my supervisor, I hereby declare that this thesis is Ing. Ondřej Kuželka, for giving me the the result of my own work and all opportunity to work on this interesting the sources I used are in the list of topic, his patient guidance, constant in- references, in accordance with the flow of new ideas and encouragement. Methodological Instructions on Eth- My thanks also go to my family and ical Principles in the Preparation of friends for their continued support. University Theses.
Gustav Šourek
In Prague, May 6th, 2013
Prohlašuji, že jsem předloženou práci vypracoval samostatně a že jsem uvedl veškeré použité informační zdroje v souladu s Metodickým pokynem o do- držování etických principů při přípravě vysokoškolských závěrečných prací.
Gustav Šourek
V praze, 6. května, 2013
iii Abstrakt / Abstract
Potenciální přínos schopnosti pre- The potential value of predicting dikovat trendy v sociálních médiích trends in social media rises with its vyvstává s jejich narůstajícím vlivem growing dominance in our lives. Where- v našem každodenním životě. Zatímco as many works focus on anomaly or existuje spousta prací, zabývajících se trend detection, there is still little detekcí anomálií či trendů, stále víme knowledge on the evolution of trend málo o jejich vývoji v čase. Inspirováni dynamics. Inspired by the studies on studií šíření infekce skrz sociální síť, infection diffusion through a social net- navrhujeme popstup pro předvídání work, we propose an approach to predict trendů v lokální podsíti Twitteru, vy- trends spread within a local subnetwork užívající informaci o síťové struktuře v of Twitter, exploiting the network struc- míře, která přesahuje dosavadní práce. ture information beyond the scope of Reprezentaci anomálních vzorů v síti previous works. We base the anomaly zakládáme na grafových features, vy- pattern representation on graph fea- jadřujících různé možnosti lokálního tures, reflecting various local relational relačního uspořádání v místech pří- topology options in the context of trend tomnosti trendu. S použitím algoritmu presence. Utilizing machine learning strojového učení potom využíváme algorithm, the information extracted extrahovanou informaci pro predikci is used for prediction of future trends budoucího chování trendu a vyhodno- behavior and evaluated over several cujeme přístup na vytyčených cílech. demarcated targets. The contribution Přínos našeho grafového přístupu je po- of our graph approach is then measured měřován oproti baseline modelu, který against a baseline model, utilizing the využívá stejný postup při učení, ale same learning strategy, yet considering pokládá výskyt trendu za časovou řadu, the trends as time series, absent any bez jakékoliv znaloti o síťové struktuře. knowledge on the network topology. Kromě toho jsou testovány některé další Moreover several other approaches are přístupy pro srovnání. tested for comparison. Výsledky ukazují, že síťová struk- The results prove the network struc- tura hraje významnou úlohu při ší- ture to play an important role in the ření trendů, jelikož informace získaná trends spread dynamics, as the topology pomocí grafových features znatelně information extracted via graph fea- zlepšuje přesnost modelu, za hranice tures improves the accuracy of learner přesnosti ostatních metod. Další vy- considerably, out of the reach of other lepšení tohoto síťového přístupu může methods tested. Further feature options potenciálně skýtat v různém obohacení and combinations can be considered for features a jejich kombinacích. prospective improvements of the net- work related approach.
Klíčová slova: Twitter, šíření lokál- Keywords: Twitter, local trends ních trendů, relační strojové učení spreading, relational machine learning
Překlad titulu: Analýza šíření lokál- ních trendů v sociální síti Twitter
iv Contents /
1 Introduction ...... 1 5.3.2 Cross-validation ...... 25 1.1 Motivation ...... 1 5.3.3 Test set validation ...... 25 1.2 Related work ...... 2 5.3.4 Weka ...... 25 1.3 Our approach ...... 3 6 Features ...... 27 1.3.1 Overview ...... 3 6.1 Base features ...... 27 2 Social Networks ...... 4 6.1.1 Frequency rankings ...... 27 2.1 Introduction ...... 4 6.1.2 User features ...... 28 2.2 Digital social networks...... 4 6.2 Model features ...... 28 2.3 Twitter ...... 5 6.3 Graph features ...... 28 2.4 Social Network Analysis ...... 6 6.3.1 Relational features ...... 29 2.4.1 Levels of analysis ...... 6 6.3.2 Time features ...... 30 2.5 Trends spreading ...... 7 6.4 Graph features creation ...... 31 3 Data acquisition ...... 9 6.4.1 Isolated feature check.... 31 3.1 Crawling strategy ...... 9 6.4.2 Feature set check ...... 31 3.2 Twitter API ...... 10 6.5 Isomorphism problem ...... 31 3.2.1 Functionality ...... 10 6.5.1 Calculating invariants ... 32 3.2.2 Rate limiting ...... 11 6.5.2 Isomorphic mapping ..... 32 3.2.3 Limits workaround...... 11 6.6 Feature matching ...... 33 3.3 Implementation ...... 11 6.6.1 Heuristic ordering...... 33 4 Data analysis ...... 13 6.6.2 Search method ...... 33 4.1 Crawled data ...... 13 6.6.3 Set intersection speedup . 35 4.1.1 Statistics overview ...... 13 7 Experiments ...... 37 4.1.2 Network structure ...... 13 7.1 Settings...... 37 4.1.3 Trending topics ...... 17 7.1.1 Sliding window prop- 4.2 Data transformation ...... 18 erties ...... 37 4.3 Time structures ...... 18 7.1.2 Top-k threshold ...... 40 4.3.1 Sequential representa- 7.1.3 Datasets ...... 40 tion ...... 19 7.2 Feature options ...... 42 4.3.2 Sliding window ...... 19 7.2.1 Ranking ...... 42 4.4 Graphs ...... 19 7.2.2 User features ...... 44 4.4.1 Relations ...... 20 7.2.3 User modeling...... 44 4.4.2 Representation: ...... 20 7.2.4 Graph features ...... 45 5 Learning ...... 21 7.3 Results ...... 47 5.1 Target classes ...... 21 7.3.1 Shows or stays ...... 48 5.1.1 Motivation ...... 21 7.3.2 Top-k% ...... 49 5.1.2 Basic class...... 21 7.3.3 Expands ...... 50 5.1.3 Top-K% metric ...... 22 8 Conclusion ...... 52 5.1.4 Expands class ...... 22 8.1 Future work ...... 52 5.1.5 Enters top-K class ...... 22 References ...... 54 5.2 Approaches ...... 22 A Specification ...... 57 5.2.1 Simple learner...... 22 B Used Terms ...... 59 5.2.2 Baseline learner ...... 23 B.1 Acronyms...... 59 5.2.3 Graph learner ...... 23 B.2 Software ...... 59 5.2.4 User modeling...... 24 C CD content ...... 60 5.3 Evaluation...... 24 5.3.1 Classifiers ...... 24
v Tables / Figures
4.1. crawled datasets comparison .. 13 2.1. depiction of social network ...... 6 6.1. features reduction ...... 31 2.2. levels of analysis...... 7 3.1. crawling strategy ...... 10 3.2. implementation of crawling .... 12 4.1. friends degree distribution ..... 15 4.2. retweets degree distribution ... 15 4.3. betw. centrality distribution... 16 4.4. time-series trend occurence .... 17 4.5. network subset trends spread .. 18 4.6. sliding window ...... 19 5.1. top-k% prediction task...... 21 6.1. size 1 features ...... 29 6.2. size 2 features ...... 29 6.3. features of size 3 ...... 30 6.4. spread potential ...... 30 6.5. causality correlation...... 30 6.6. isomorphism problem ...... 32 6.7. feature searching algorithm .... 34 6.8. feature matching algorithm .... 35 7.1. overfitting of graph learner .... 38 7.5. shows in top-k window parameters...... 40 7.3. stays in top-k window parameters...... 39 7.4. precision target size ordering .. 39 7.5. top-k threshold influence...... 40 7.6. datasets change re- silience baseline approach...... 41 7.7. datasets change re- silience graph approach ...... 42 7.8. ranking baseline ...... 43 7.9. ranking graph features ...... 43 7.10. user subset sampling ...... 44 7.11. user modeling topk-k% ...... 45 7.12. relations selection accuracy .... 46 7.13. time features addition...... 46 7.14. relations selection top-k...... 47 7.15. learners accuracy collation ..... 48 7.16. shows in top-k collation ...... 48 7.17. stays in top-k collation...... 49 7.18. learners topk-k % collation .... 50 7.19. all learners collation at topk-k % ...... 50 7.20. core learners collation at expands ...... 51
vi Chapter 1 Introduction
1.1 Motivation Human beings have been assembling themselves into social networks for thousands of years. Forming relations like friendship, relative, or coworker relationships with other people, we get embedded in various social networks that these relations give rise to. Studying the rules that govern how social networks are assembled, how they operate and affect our lives gained much attention in various fields of scientific interest. Dynam- ics of human interaction, social behavior, processes and phenomena have been subjects of research for decades, as they proved able to provide means for further improvements in related disciplines, such as epidemiology, healthcare or social behavior studies. In the recent years, a new variety of social networks emerged. With the coming of Internet and digital media, vast amounts of data on virtual social networks became available for studying at a scale, that would never be reachable for empirical stud- ies before. Moreover the data from digital networks can often be obtained passively, without actually affecting the users. Massive streams of user generated content in vir- tual social networks provide great opportunity to analyze social behavior and spread of information within the network. One of the works, that stands inspiration for this thesis, is studying how an infection spreads through a face to face friendship network, and how the information on the network’s structure can help to predict the upcoming epidemics [1–2]. The positive results on the early detection of the disease outbreak in that work have driven us to test the social network structure effect out of the scope of face to face relations and physical infections. The idea is, that the means by which the structure of network affects the infection dynamics, applies not only for epidemics of germs transmitted through face to face relations, but all sorts of kinds spread by social contagion, like behavior, emotions, ideas and trends. Following this assumption, we want to turn the data collected from virtual social network into valuable insights, that could be used for prediction of the spread of trending topics in a local network subset. The problems of prediction and classification of phenomena in complex systems is ubiquitous in science, engineering and society. From classifying proteins for their function, detecting suspicious behavior in financial transactions, to revealing emerging trending topics in social networks, extracting information from the data is important to understand the world around us. A common presumption in solving these tasks is that there is an underlying process generating the data observed. A model of such process in complex network is generally hard to determine, as most of the real world systems, especially those involving human behavior, defy simple model descriptions. While, in contrast to anomalies detection, the literature still lacks a theoretic model for the evolution of anomalies in the network [3], we restrain from reasoning about particular social models and embrace the complexity of data instead. For that we
1 1. Introduction ...... utilize machine learning approaches building on general classification models, which we feed with particular features reflecting selected topology properties to effectively exploit the network structure information.
1.2 Related work Although at the time of this thesis inception, there was only little work published regarding Twitter trends spread, during the project processing (2012) a number of related works appeared in public. Most closely related are structure aware approaches, using basic local network prop- erties to model users’ behavior. Tsur and Rappoport [4] used strategy based on a linear regression, combining temporal and topological features, for predicting the spread of an idea in a given time frame, demonstrating the importance of the content of the idea. In a similar work, Ruan et al. [5] consider multiple dimensions into one regression-based prediction framework emphasizing the influence of network structure, user interaction and content characteristics over simple past activity features. In both these works, the network features are based on local user connectivity statistics, e.g. average and maximal number of friends, and corresponding interaction among users measured by number of retweets. The rest of related trends prediction techniques, applied in and out of the scope of Twitter, are unaware of the network structure. Taking trends prediction as time series, the approaches could be differentiated by whether they are using explicit model of trend spread process or not. The first type represents popular strategy for detecting emergent trending topics and general network outbreaks. In a recent work from Stanislav Nikolov(MIT&Twitter) [6], that gathered much atten- tion in the media, they predict global Twitter trends using a stochastic model, specified by a small collection of unknown “latent sources”. Relating these sources to topics observations, without using any information on the network structure, they achieved some very good results. In [7], Gupta et al. introduce a framework to experiment with various features encoding trend dynamics and utilize regression, classification and hy- brid approaches to predict event popularity. Work from Altshuler et al. [3], presents an analytic model for the social diffusion dynamics of spreading network patterns, based on information diffusion models. In [8] Wang et al., uses branching process model to pro- vide a theoretical basis for the formation, persistence and decay of trends on Twitter in global scale and discovers some factors influencing the trends spread, like user activity and resonance. In other work, Shtatland and Shtatland [9] utilize susceptible-infectious- recovered (SIR) model, typically used for predicting the progress of an epidemic in a large population, to detect anomaly phenomena outbreak by the means of the model’s stationarity. In the latter type, with no explicit use of process model, there are approaches utilizing clustering, e.g. the work of Becker et al. [10], where the messages are arranged into meme1) and real-event clusters based on their similarities, combining temporal, topical and social based features. Moreover there are methods based on trajectory clustering, operating on large collections of time series denoting the topic frequencies. These include simple “nearest trajectory” strategies or more sophisticated approaches, based on low density areas between the trajectories to identify the number of clusters, as in the work of Murthy et al. [11].
1) from Greek “mimo” ' “to imitate”, an idea, behavior, or style that spreads from person to person within a culture.
2 ...... 1.3 Our approach
Despite the number of approaches to Twitter trends prediction that recently emerged, we are unaware of any work, that would investigate prediction capabilities in Twitter, with more complex structural information, than just simple local connectivity measures.
1.3 Our approach As the network structure has been proved to play an important role in spreading social trends [1, 4–5], we want to exploit the effect of social network topology beyond the scope of previous works. Inspired by creating network signatures from graphlet degree distributions [12] in biological networks, we use similar representation to reflect a trend presence within our network. For that we create graph features - small connected sub- graphs, representing various local relative topology options, and measure their presence in the network by the means of subgraph matching. The network trend signatures, cal- culated from the frequencies of respective features occurrences, are further submitted as attribute vectors to a machine learning algorithm, specifically Random Forest model. This model is finally trained upon a network dataset collected from Twitter, and various experiments and tunings are performed to reason about features contribution.
1.3.1 Overview Following chapters describe respective parts of the thesis: . Chapter 1 provides the thesis introduction, motivation and related work. . In the beginning of Chapter 2 a concept of social networks is introduced, specifically its digital type, and reader will become acquainted with Twitter and related defini- tions. Further the field of social network analysis is outlined and related to our task of trends spreading. . Next Chapter 3 describes the strategy and the means by which we acquire the data. A part is also dedicated to the implementation of respective program. . The collected data are analyzed in Chapter 4 to provide insights on their structure and information included. Next the data transformations, necessary for subsequent learning, are briefly outlined. . Chapter 5 concerns the learning itself. It introduces various learning approaches to be tested, lays out the targets to be achieved, and describes how to evaluate them. . Following Chapter 6 gives an account on features associated with the learners pre- viously introduced. It describes their options, properties and enhancements, with special regard to graph features creation and matching. . Experiments are demonstrated in Chapter 7, where annotated settings and feature options are tested for performance. Finally the results of all the learners are compared with respect to selected targets. . Chapter 8 provides final conclusion over the thesis and outlines some further work and ideas.
3 Chapter 2 Social Networks
2.1 Introduction
Social network is a theoretical construct used in social sciences to study relationships between social units, such as individuals, groups, organizations or even entire societies. The purpose of this construct is to describe structure of the network determined by in- teractions between these units. These interactions are usually represented as a complex set of dyadic 1) ties. That also draws the relational approach often used in the study of social networks, as in our case in Section 5.2.3. The idea behind studying phenomena in social networks, such as trend spreading in our case, is that they are primarily caused by these relations rather than the units themselves and thus we should focus on properties of these relations. That implies a common criticism of social network theory, which is that properties of individuals, individual agencies, i.e. the capabilities of units to act independently and make free choices, is often ignored [13]. Still, social network analysis has become useful in broad range of research enterprises like economics, geography and organizational studies. On the academic side, a large amount of knowledge has accumulated on the for- mation and dynamics of these networks, fueled by the easy availability of data and the regularities found in the statistical distribution of nodes and links within these networks [14].
2.2 Digital social networks
Digital social networking emerged in recent decades with the advance of new Internet based technologies. It refers to the means of interactions among people in which they create, share and exchange information within their respective virtual communities. The main factor that gave rise to digital social networks was the coming of Web 2.0, allowing a creation and exchange of user-generated content. Digital social networks in the form of social media differentiate from traditional me- dia in many aspects that stem from the usage of Internet, such as reach, frequency and permanence. They can take different forms such as Internet forums, microblogging, wiki sites and authentic social networks, such as Facebook or Twitter, that are subject of study in this thesis in Section 2.3. The latter type is also the most influential type of social media and with interfaces that allow people to follow the lives of friends, acquain- tances and families, the number of people in social networks has grown exponentially since the turn of this century [14].
1) from Greek “dýo” ' “two”, describes the interaction between a pair of individuals.
4 ...... 2.3 Twitter
2.3 Twitter Twitter is an online social networking service that enables its users to send and read text-based messages of up to 140 characters known as “tweets”. At the same time it enables users to connect to others through the follows relationship. The users that a particular user is following through this relation are referred as his friends. Users on the other side who are following the particular user are referred simply as his followers. Tweets posted by a particular user are stored and displayed as a chronological se- quence in user’s timeline. Each such a tweet being posted is also broadcasted to the user’s followers. Tweets are, by default, public, which means that anyone can list them out through Twitter’s search engine or other Twitter API facilities (see Section 3.2) and join the related conversation. Moreover Twitter users can also engage in a direct conversation between each other. As for the information content, users can group posts together by type by the use of hashtags – words or phrases prefixed with a “#” sign, referring a tweet to the specified topic. Hashtag signed tweets have special treatment in Twitter’s engine and can be easily searched out. Hashtags allow Twitter to effectively organize conversations and topics. Moreover there are other interactions between users and the information content. A user can either indirectly join a topic using a hashtag in his tweet or respond directly to other user’s tweet. Tweets can also be reposted and shared on user’s own timeline which is referred as retweet, symbolized by “RT” sign in the beginning of the message. Last interaction between users leaving a direct trace in the message is user mention, using a “@” sign to declare a searchable reference to other user’s profile. A word, phrase or topic that is tagged at a greater rate than other tags is said to be a trending topic (see trends in 4.1.3). Twitter displays trending topics in a special list to provide users on information of what is happening in the world, similarly to regular news media. A trending topic on Twitter can emerge through the relations representing the information channels in the network or from an outer event that prompts users to talk about. Trending topics are sometimes the result of concerted efforts by fans of certain celebri- ties or cultural phenomena, particularly musicians. To prevent manipulation of this type, Twitter has altered the trend algorithm in the past, making trending topics se- lection intransparent [15]. Finally, as there are lots of twitter-specific terms being used throughout the thesis, we list them out for better reference: . Tweet - text-based message that the users are posting . Hashtag - a form of metadata tag used in tweet to signify reference to some topic . Timeline - chronological sequence of user’s tweets . Follower - user that receives by broadcast tweets from referred user . Friend - user that the actual one is subscribing to for receiving tweets . Retweet - reposting of other user’s tweet . Respond - a direct reaction to other user’s tweet . Mention - a reference to other user’s profile included in tweet
5 2. Social Networks ......
2.4 Social Network Analysis Social network analysis (SNA) views social units and relationships by the means of network theory. It indicates the study of network structure and it’s effects on social and cultural aspects. Some of the commonly studied subjects are to determine if a given social network is tightly bounded, diversified or constricted, to find it’s density and clustering, and to study how the behavior of network members if affected by their positions and connections [16]. By the means of SNA, each social network consists of two main entities: . Nodes - represent individual actors within the network . Ties - represent relationships between the individuals, such as friendship. These networks are then often depicted in diagrams, where nodes are represented as points and ties as lines between them, as illustrated in Figure 2.1.
Figure 2.1. An example of a diagram commonly used in SNA for depiction of social network structure (generated from a random subset of our Twitter network).
Such a visual representation of social networks is important tool to understand the networked data and convey the result of the analysis [16].
2.4.1 Levels of analysis A social network is a complex emergent structure where local interaction of units that make up the system, create global patterns of phenomena we are interested in. As the size of network increases, these patterns become more apparent. However there are practical limitations to the global network analysis and careful choice of scale is important as it will finally influence the quality of information derived from the network. Most of social networks are being viewed as individual people taking part in interper- sonal relationships with others. Often then these networks become “social facts” and take on a life of their own. For instance a family as a network of close relations among a set of people has been institutionalized and given a name and reality beyond that of its component nodes. Similarly individuals in their work relations may be seen as nested
6 ...... 2.5 Trends spreading
within organizations. Neighborhoods, communities, and even societies are, to varying degrees, social entities inside and of themselves. And, as social entities, they may form ties with the individuals nested within them, and with other social entities [17]. Even though that there are social network methods suited for such multiple levels of analysis upon multi-modal data structures, it must be stated that analyst rarely take much advantage of them. The most common modalities used in modern network analysis usually work with reductionism at three levels [18]: . Micro level - working with individuals in particular social context, tracing relation- ships to create bigger units. . Meso level - works with certain population size that falls in between micro and macro level and reveals connection between them. . Macro level - works with large populations and analyzes the general outcomes of interactions. In our analysis of local trends spreading on Twitter, we focus on micro level, while creating graph features and reasoning on their properties in Section 6.3, and meso level, while considering the trends occurrence and their competition in network subsets in Section 4.1.3. Depiction of these level can be seen in Figure 2.2.
Figure 2.2. Micro level focusing on relations (on the left) and meso level focusing on trend occurence (on the right) in our social network analysis.
2.5 Trends spreading Trends in social networks are elements of communication, whose frequency of occurrence in the respective media is significantly higher than others. Within modern digital media networks, these can take different forms, such as Twitters trending topics (see Section 2.3) or phrases, pictures and videos that are being shared amongst the users. The formation of trend can be rooted in external event that prompts people to talk about, or can have no apparent substantiation and spread virulently through the ties of the network. Trends in the information network are important cursor of what is happening amongst the users and are often being used by social media analysts for statistical processing. The information derived can then be applied in improving of social network services and creation of social marketing strategies.
7 2. Social Networks ......
Sometimes the spread of trends is even being manipulated by concerted effort of set of users, such as fans of celebrities and marketers. Understanding the dynamics of trends spreading, i.e. potential prediction capabilities, could thus be of great information value to all people involved in social media environment. The importance of the ability to predict social trends has been growing rapidly in the past few years with the growing dominance of social media in our everyday’s life. Whereas many works focus on the detection of anomalies in networks, there exists little theoretical work on the prediction of the likelihood of anomalous network pattern to globally spread and become “trend” [3].
8 Chapter 3 Data acquisition
For testing of our graph feature based approach we chose data from social network Twitter, mainly for it’s public availability. The information on users desired can be accessed through Twitter API. To download the data we implemented a program man- aging the data flow from Twitter to a database where the data is finally being stored. The sampling strategy, i.e. the choice of the set of users we are downloading the in- formation about is determined by the approach of crawling the network, described in Section 3.1. The implementation of the whole process is then discussed in the corre- sponding Section 3.3.
3.1 Crawling strategy Crawling strategy defines how we proceed with gathering the users in our scope. In social network analysis there are various corresponding methods of network data col- lection, commonly being divided into ego-centric and socio-centric category. In the ego-centric approach we are interested in set of specific persons referred as “ego” and the social network is being constructed by references to their affiliates referred as “al- ters”. Socio-centric approach on the other hand focuses on the whole network analysis by measuring structural patterns of interaction and explaining the outcomes, regardless of choice of individual persons. The underlying assumption is that members of a group interact more than would a randomly selected group of similar size [19]. This assumption corresponds to what we are trying to achieve, i.e. to determine the global outcome - in our case predicting the top trends, by measuring the structural patterns of interactions between users. Thus, following the socio-centric assumption, sampling a group of highly connected users would be desirable, since the interactions between them should have bigger effect on the global outcome than within a uniformly sampled group. To follow this policy we altered commonly used technique in SNA called “snowball sampling”, where existing study subjects recruit future subjects from among their ac- quaintances, similarly to breadth-first search (BFS) used in computer science[18]. As in BFS, while proceeding with the search, we work with a queue-like structure holding the users for further expansion. Each of these users has a priority assigned, calculated as a number of total connections with the already expanded set of users. At each step, the queue is updated with the newly explored users and the one with highest priority is chosen for next expansion (see Figure 3.1 for illustration). Following this simple approach from any given starting point, we gather a group of inter-connected users.
9 3. Data acquisition ......