<<

Master’s thesis

Czech Technical University in Prague

Faculty of Electrical Engineering F3 Department of Cybernetics

Twitter’s local trends spread analysis

Gustav Šourek

Program: Open Informatics Field: Artificial Intelligence

April 2013 Supervisor: Ing. Ondřej Kuželka

Acknowledgement / Declaration

I would like to thank my supervisor, I hereby declare that this thesis is Ing. Ondřej Kuželka, for giving me the the result of my own work and all opportunity to work on this interesting the sources I used are in the list of topic, his patient guidance, constant in- references, in accordance with the flow of new ideas and encouragement. Methodological Instructions on Eth- My thanks also go to my family and ical Principles in the Preparation of friends for their continued support. University Theses.

Gustav Šourek

In Prague, May 6th, 2013

Prohlašuji, že jsem předloženou práci vypracoval samostatně a že jsem uvedl veškeré použité informační zdroje v souladu s Metodickým pokynem o do- držování etických principů při přípravě vysokoškolských závěrečných prací.

Gustav Šourek

V praze, 6. května, 2013

iii Abstrakt / Abstract

Potenciální přínos schopnosti pre- The potential value of predicting dikovat trendy v sociálních médiích trends in social media rises with its vyvstává s jejich narůstajícím vlivem growing dominance in our lives. Where- v našem každodenním životě. Zatímco as many works focus on anomaly or existuje spousta prací, zabývajících se trend detection, there is still little detekcí anomálií či trendů, stále víme knowledge on the evolution of trend málo o jejich vývoji v čase. Inspirováni dynamics. Inspired by the studies on studií šíření infekce skrz sociální síť, infection diffusion through a social net- navrhujeme popstup pro předvídání work, we propose an approach to predict trendů v lokální podsíti Twitteru, vy- trends spread within a local subnetwork užívající informaci o síťové struktuře v of , exploiting the network struc- míře, která přesahuje dosavadní práce. ture information beyond the scope of Reprezentaci anomálních vzorů v síti previous works. We base the anomaly zakládáme na grafových features, vy- pattern representation on graph fea- jadřujících různé možnosti lokálního tures, reflecting various local relational relačního uspořádání v místech pří- topology options in the context of trend tomnosti trendu. S použitím algoritmu presence. Utilizing machine learning strojového učení potom využíváme algorithm, the information extracted extrahovanou informaci pro predikci is used for prediction of future trends budoucího chování trendu a vyhodno- behavior and evaluated over several cujeme přístup na vytyčených cílech. demarcated targets. The contribution Přínos našeho grafového přístupu je po- of our graph approach is then measured měřován oproti baseline modelu, který against a baseline model, utilizing the využívá stejný postup při učení, ale same learning strategy, yet considering pokládá výskyt trendu za časovou řadu, the trends as time series, absent any bez jakékoliv znaloti o síťové struktuře. knowledge on the network topology. Kromě toho jsou testovány některé další Moreover several other approaches are přístupy pro srovnání. tested for comparison. Výsledky ukazují, že síťová struk- The results prove the network struc- tura hraje významnou úlohu při ší- ture to play an important role in the ření trendů, jelikož informace získaná trends spread dynamics, as the topology pomocí grafových features znatelně information extracted via graph fea- zlepšuje přesnost modelu, za hranice tures improves the accuracy of learner přesnosti ostatních metod. Další vy- considerably, out of the reach of other lepšení tohoto síťového přístupu může methods tested. Further feature options potenciálně skýtat v různém obohacení and combinations can be considered for features a jejich kombinacích. prospective improvements of the net- work related approach.

Klíčová slova: Twitter, šíření lokál- Keywords: Twitter, local trends ních trendů, relační strojové učení spreading, relational machine learning

Překlad titulu: Analýza šíření lokál- ních trendů v sociální síti Twitter

iv Contents /

1 Introduction ...... 1 5.3.2 Cross-validation ...... 25 1.1 Motivation ...... 1 5.3.3 Test set validation ...... 25 1.2 Related work ...... 2 5.3.4 Weka ...... 25 1.3 Our approach ...... 3 6 Features ...... 27 1.3.1 Overview ...... 3 6.1 Base features ...... 27 2 Social Networks ...... 4 6.1.1 Frequency rankings ...... 27 2.1 Introduction ...... 4 6.1.2 User features ...... 28 2.2 Digital social networks...... 4 6.2 Model features ...... 28 2.3 Twitter ...... 5 6.3 Graph features ...... 28 2.4 Analysis ...... 6 6.3.1 Relational features ...... 29 2.4.1 Levels of analysis ...... 6 6.3.2 Time features ...... 30 2.5 Trends spreading ...... 7 6.4 Graph features creation ...... 31 3 Data acquisition ...... 9 6.4.1 Isolated feature check.... 31 3.1 Crawling strategy ...... 9 6.4.2 Feature set check ...... 31 3.2 Twitter API ...... 10 6.5 Isomorphism problem ...... 31 3.2.1 Functionality ...... 10 6.5.1 Calculating invariants ... 32 3.2.2 Rate limiting ...... 11 6.5.2 Isomorphic mapping ..... 32 3.2.3 Limits workaround...... 11 6.6 Feature matching ...... 33 3.3 Implementation ...... 11 6.6.1 Heuristic ordering...... 33 4 Data analysis ...... 13 6.6.2 Search method ...... 33 4.1 Crawled data ...... 13 6.6.3 Set intersection speedup . 35 4.1.1 Statistics overview ...... 13 7 Experiments ...... 37 4.1.2 Network structure ...... 13 7.1 Settings...... 37 4.1.3 Trending topics ...... 17 7.1.1 Sliding window prop- 4.2 Data transformation ...... 18 erties ...... 37 4.3 Time structures ...... 18 7.1.2 Top-k threshold ...... 40 4.3.1 Sequential representa- 7.1.3 Datasets ...... 40 tion ...... 19 7.2 Feature options ...... 42 4.3.2 Sliding window ...... 19 7.2.1 Ranking ...... 42 4.4 Graphs ...... 19 7.2.2 User features ...... 44 4.4.1 Relations ...... 20 7.2.3 User modeling...... 44 4.4.2 Representation: ...... 20 7.2.4 Graph features ...... 45 5 Learning ...... 21 7.3 Results ...... 47 5.1 Target classes ...... 21 7.3.1 Shows or stays ...... 48 5.1.1 Motivation ...... 21 7.3.2 Top-k% ...... 49 5.1.2 Basic class...... 21 7.3.3 Expands ...... 50 5.1.3 Top-K% metric ...... 22 8 Conclusion ...... 52 5.1.4 Expands class ...... 22 8.1 Future work ...... 52 5.1.5 Enters top-K class ...... 22 References ...... 54 5.2 Approaches ...... 22 A Specification ...... 57 5.2.1 Simple learner...... 22 B Used Terms ...... 59 5.2.2 Baseline learner ...... 23 B.1 Acronyms...... 59 5.2.3 Graph learner ...... 23 B.2 Software ...... 59 5.2.4 User modeling...... 24 C CD content ...... 60 5.3 Evaluation...... 24 5.3.1 Classifiers ...... 24

v Tables / Figures

4.1. crawled datasets comparison .. 13 2.1. depiction of social network ...... 6 6.1. features reduction ...... 31 2.2. levels of analysis...... 7 3.1. crawling strategy ...... 10 3.2. implementation of crawling .... 12 4.1. friends degree distribution ..... 15 4.2. retweets degree distribution ... 15 4.3. betw. centrality distribution... 16 4.4. time-series trend occurence .... 17 4.5. network subset trends spread .. 18 4.6. sliding window ...... 19 5.1. top-k% prediction task...... 21 6.1. size 1 features ...... 29 6.2. size 2 features ...... 29 6.3. features of size 3 ...... 30 6.4. spread potential ...... 30 6.5. causality correlation...... 30 6.6. isomorphism problem ...... 32 6.7. feature searching algorithm .... 34 6.8. feature matching algorithm .... 35 7.1. overfitting of graph learner .... 38 7.5. shows in top-k window parameters...... 40 7.3. stays in top-k window parameters...... 39 7.4. precision target size ordering .. 39 7.5. top-k threshold influence...... 40 7.6. datasets change re- silience baseline approach...... 41 7.7. datasets change re- silience graph approach ...... 42 7.8. ranking baseline ...... 43 7.9. ranking graph features ...... 43 7.10. user subset sampling ...... 44 7.11. user modeling topk-k% ...... 45 7.12. relations selection accuracy .... 46 7.13. time features addition...... 46 7.14. relations selection top-k...... 47 7.15. learners accuracy collation ..... 48 7.16. shows in top-k collation ...... 48 7.17. stays in top-k collation...... 49 7.18. learners topk-k % collation .... 50 7.19. all learners collation at topk-k % ...... 50 7.20. core learners collation at expands ...... 51

vi Chapter 1 Introduction

1.1 Motivation Human beings have been assembling themselves into social networks for thousands of years. Forming relations like friendship, relative, or coworker relationships with other people, we get embedded in various social networks that these relations give rise to. Studying the rules that govern how social networks are assembled, how they operate and affect our lives gained much attention in various fields of scientific interest. Dynam- ics of human interaction, social behavior, processes and phenomena have been subjects of research for decades, as they proved able to provide means for further improvements in related disciplines, such as , healthcare or social behavior studies. In the recent years, a new variety of social networks emerged. With the coming of Internet and digital media, vast amounts of data on virtual social networks became available for studying at a scale, that would never be reachable for empirical stud- ies before. Moreover the data from digital networks can often be obtained passively, without actually affecting the users. Massive streams of user generated content in vir- tual social networks provide great opportunity to analyze social behavior and spread of information within the network. One of the works, that stands inspiration for this thesis, is studying how an infection spreads through a face to face friendship network, and how the information on the network’s structure can help to predict the upcoming epidemics [1–2]. The positive results on the early detection of the disease outbreak in that work have driven us to test the social network structure effect out of the scope of face to face relations and physical infections. The idea is, that the means by which the structure of network affects the infection dynamics, applies not only for epidemics of germs transmitted through face to face relations, but all sorts of kinds spread by , like behavior, emotions, ideas and trends. Following this assumption, we want to turn the data collected from virtual social network into valuable insights, that could be used for prediction of the spread of trending topics in a local network subset. The problems of prediction and classification of phenomena in complex systems is ubiquitous in science, engineering and society. From classifying proteins for their function, detecting suspicious behavior in financial transactions, to revealing emerging trending topics in social networks, extracting information from the data is important to understand the world around us. A common presumption in solving these tasks is that there is an underlying process generating the data observed. A model of such process in complex network is generally hard to determine, as most of the real world systems, especially those involving human behavior, defy simple model descriptions. While, in contrast to anomalies detection, the literature still lacks a theoretic model for the evolution of anomalies in the network [3], we restrain from reasoning about particular social models and embrace the complexity of data instead. For that we

1 1. Introduction ...... utilize machine learning approaches building on general classification models, which we feed with particular features reflecting selected topology properties to effectively exploit the network structure information.

1.2 Related work Although at the time of this thesis inception, there was only little work published regarding Twitter trends spread, during the project processing (2012) a number of related works appeared in public. Most closely related are structure aware approaches, using basic local network prop- erties to model users’ behavior. Tsur and Rappoport [4] used strategy based on a linear regression, combining temporal and topological features, for predicting the spread of an idea in a given time frame, demonstrating the importance of the content of the idea. In a similar work, Ruan et al. [5] consider multiple dimensions into one regression-based prediction framework emphasizing the influence of network structure, user interaction and content characteristics over simple past activity features. In both these works, the network features are based on local user connectivity statistics, e.g. average and maximal number of friends, and corresponding interaction among users measured by number of retweets. The rest of related trends prediction techniques, applied in and out of the scope of Twitter, are unaware of the network structure. Taking trends prediction as time series, the approaches could be differentiated by whether they are using explicit model of trend spread process or not. The first type represents popular strategy for detecting emergent trending topics and general network outbreaks. In a recent work from Stanislav Nikolov(MIT&Twitter) [6], that gathered much atten- tion in the media, they predict global Twitter trends using a stochastic model, specified by a small collection of unknown “latent sources”. Relating these sources to topics observations, without using any information on the network structure, they achieved some very good results. In [7], Gupta et al. introduce a framework to experiment with various features encoding trend dynamics and utilize regression, classification and hy- brid approaches to predict event popularity. Work from Altshuler et al. [3], presents an analytic model for the social diffusion dynamics of spreading network patterns, based on information diffusion models. In [8] Wang et al., uses branching process model to pro- vide a theoretical basis for the formation, persistence and decay of trends on Twitter in global scale and discovers some factors influencing the trends spread, like user activity and resonance. In other work, Shtatland and Shtatland [9] utilize susceptible-infectious- recovered (SIR) model, typically used for predicting the progress of an epidemic in a large population, to detect anomaly phenomena outbreak by the means of the model’s stationarity. In the latter type, with no explicit use of process model, there are approaches utilizing clustering, e.g. the work of Becker et al. [10], where the messages are arranged into meme1) and real-event clusters based on their similarities, combining temporal, topical and social based features. Moreover there are methods based on trajectory clustering, operating on large collections of time series denoting the topic frequencies. These include simple “nearest trajectory” strategies or more sophisticated approaches, based on low density areas between the trajectories to identify the number of clusters, as in the work of Murthy et al. [11].

1) from Greek “mimo” ' “to imitate”, an idea, behavior, or style that spreads from person to person within a culture.

2 ...... 1.3 Our approach

Despite the number of approaches to Twitter trends prediction that recently emerged, we are unaware of any work, that would investigate prediction capabilities in Twitter, with more complex structural information, than just simple local connectivity measures.

1.3 Our approach As the network structure has been proved to play an important role in spreading social trends [1, 4–5], we want to exploit the effect of social network topology beyond the scope of previous works. Inspired by creating network signatures from graphlet degree distributions [12] in biological networks, we use similar representation to reflect a trend presence within our network. For that we create graph features - small connected sub- graphs, representing various local relative topology options, and measure their presence in the network by the means of subgraph matching. The network trend signatures, cal- culated from the frequencies of respective features occurrences, are further submitted as attribute vectors to a machine learning algorithm, specifically Random Forest model. This model is finally trained upon a network dataset collected from Twitter, and various experiments and tunings are performed to reason about features contribution.

1.3.1 Overview Following chapters describe respective parts of the thesis: . Chapter 1 provides the thesis introduction, motivation and related work. . In the beginning of Chapter 2 a concept of social networks is introduced, specifically its digital type, and reader will become acquainted with Twitter and related defini- tions. Further the field of social network analysis is outlined and related to our task of trends spreading. . Next Chapter 3 describes the strategy and the means by which we acquire the data. A part is also dedicated to the implementation of respective program. . The collected data are analyzed in Chapter 4 to provide insights on their structure and information included. Next the data transformations, necessary for subsequent learning, are briefly outlined. . Chapter 5 concerns the learning itself. It introduces various learning approaches to be tested, lays out the targets to be achieved, and describes how to evaluate them. . Following Chapter 6 gives an account on features associated with the learners pre- viously introduced. It describes their options, properties and enhancements, with special regard to graph features creation and matching. . Experiments are demonstrated in Chapter 7, where annotated settings and feature options are tested for performance. Finally the results of all the learners are compared with respect to selected targets. . Chapter 8 provides final conclusion over the thesis and outlines some further work and ideas.

3 Chapter 2 Social Networks

2.1 Introduction

Social network is a theoretical construct used in social sciences to study relationships between social units, such as individuals, groups, organizations or even entire societies. The purpose of this construct is to describe structure of the network determined by in- teractions between these units. These interactions are usually represented as a complex set of dyadic 1) ties. That also draws the relational approach often used in the study of social networks, as in our case in Section 5.2.3. The idea behind studying phenomena in social networks, such as trend spreading in our case, is that they are primarily caused by these relations rather than the units themselves and thus we should focus on properties of these relations. That implies a common criticism of social network theory, which is that properties of individuals, individual agencies, i.e. the capabilities of units to act independently and make free choices, is often ignored [13]. Still, social network analysis has become useful in broad range of research enterprises like economics, geography and organizational studies. On the academic side, a large amount of knowledge has accumulated on the for- mation and dynamics of these networks, fueled by the easy availability of data and the regularities found in the statistical distribution of nodes and links within these networks [14].

2.2 Digital social networks

Digital social networking emerged in recent decades with the advance of new Internet based technologies. It refers to the means of interactions among people in which they create, share and exchange information within their respective virtual communities. The main factor that gave rise to digital social networks was the coming of Web 2.0, allowing a creation and exchange of user-generated content. Digital social networks in the form of social media differentiate from traditional me- dia in many aspects that stem from the usage of Internet, such as reach, frequency and permanence. They can take different forms such as Internet forums, microblogging, wiki sites and authentic social networks, such as Facebook or Twitter, that are subject of study in this thesis in Section 2.3. The latter type is also the most influential type of social media and with interfaces that allow people to follow the lives of friends, acquain- tances and families, the number of people in social networks has grown exponentially since the turn of this century [14].

1) from Greek “dýo” ' “two”, describes the interaction between a pair of individuals.

4 ...... 2.3 Twitter

2.3 Twitter Twitter is an online social networking service that enables its users to send and read text-based messages of up to 140 characters known as “tweets”. At the same time it enables users to connect to others through the follows relationship. The users that a particular user is following through this relation are referred as his friends. Users on the other side who are following the particular user are referred simply as his followers. Tweets posted by a particular user are stored and displayed as a chronological se- quence in user’s timeline. Each such a tweet being posted is also broadcasted to the user’s followers. Tweets are, by default, public, which means that anyone can list them out through Twitter’s search engine or other Twitter API facilities (see Section 3.2) and join the related conversation. Moreover Twitter users can also engage in a direct conversation between each other. As for the information content, users can group posts together by type by the use of hashtags – words or phrases prefixed with a “#” sign, referring a tweet to the specified topic. Hashtag signed tweets have special treatment in Twitter’s engine and can be easily searched out. Hashtags allow Twitter to effectively organize conversations and topics. Moreover there are other interactions between users and the information content. A user can either indirectly join a topic using a hashtag in his tweet or respond directly to other user’s tweet. Tweets can also be reposted and shared on user’s own timeline which is referred as retweet, symbolized by “RT” sign in the beginning of the message. Last interaction between users leaving a direct trace in the message is user mention, using a “@” sign to declare a searchable reference to other user’s profile. A word, phrase or topic that is tagged at a greater rate than other tags is said to be a trending topic (see trends in 4.1.3). Twitter displays trending topics in a special list to provide users on information of what is happening in the world, similarly to regular news media. A trending topic on Twitter can emerge through the relations representing the information channels in the network or from an outer event that prompts users to talk about. Trending topics are sometimes the result of concerted efforts by fans of certain celebri- ties or cultural phenomena, particularly musicians. To prevent manipulation of this type, Twitter has altered the trend algorithm in the past, making trending topics se- lection intransparent [15]. Finally, as there are lots of twitter-specific terms being used throughout the thesis, we list them out for better reference: . Tweet - text-based message that the users are posting . Hashtag - a form of metadata tag used in tweet to signify reference to some topic . Timeline - chronological sequence of user’s tweets . Follower - user that receives by broadcast tweets from referred user . Friend - user that the actual one is subscribing to for receiving tweets . Retweet - reposting of other user’s tweet . Respond - a direct reaction to other user’s tweet . Mention - a reference to other user’s profile included in tweet

5 2. Social Networks ......

2.4 Social Network Analysis Social network analysis (SNA) views social units and relationships by the means of network theory. It indicates the study of network structure and it’s effects on social and cultural aspects. Some of the commonly studied subjects are to determine if a given social network is tightly bounded, diversified or constricted, to find it’s density and clustering, and to study how the behavior of network members if affected by their positions and connections [16]. By the means of SNA, each social network consists of two main entities: . Nodes - represent individual actors within the network . Ties - represent relationships between the individuals, such as friendship. These networks are then often depicted in diagrams, where nodes are represented as points and ties as lines between them, as illustrated in Figure 2.1.

Figure 2.1. An example of a diagram commonly used in SNA for depiction of social network structure (generated from a random subset of our Twitter network).

Such a visual representation of social networks is important tool to understand the networked data and convey the result of the analysis [16].

2.4.1 Levels of analysis A social network is a complex emergent structure where local interaction of units that make up the system, create global patterns of phenomena we are interested in. As the size of network increases, these patterns become more apparent. However there are practical limitations to the global network analysis and careful choice of scale is important as it will finally influence the quality of information derived from the network. Most of social networks are being viewed as individual people taking part in interper- sonal relationships with others. Often then these networks become “social facts” and take on a life of their own. For instance a family as a network of close relations among a set of people has been institutionalized and given a name and reality beyond that of its component nodes. Similarly individuals in their work relations may be seen as nested

6 ...... 2.5 Trends spreading

within organizations. Neighborhoods, communities, and even societies are, to varying degrees, social entities inside and of themselves. And, as social entities, they may form ties with the individuals nested within them, and with other social entities [17]. Even though that there are social network methods suited for such multiple levels of analysis upon multi-modal data structures, it must be stated that analyst rarely take much advantage of them. The most common modalities used in modern network analysis usually work with reductionism at three levels [18]: . Micro level - working with individuals in particular social context, tracing relation- ships to create bigger units. . Meso level - works with certain population size that falls in between micro and macro level and reveals connection between them. . Macro level - works with large populations and analyzes the general outcomes of interactions. In our analysis of local trends spreading on Twitter, we focus on micro level, while creating graph features and reasoning on their properties in Section 6.3, and meso level, while considering the trends occurrence and their competition in network subsets in Section 4.1.3. Depiction of these level can be seen in Figure 2.2.

Figure 2.2. Micro level focusing on relations (on the left) and meso level focusing on trend occurence (on the right) in our social network analysis.

2.5 Trends spreading Trends in social networks are elements of communication, whose frequency of occurrence in the respective media is significantly higher than others. Within modern digital media networks, these can take different forms, such as trending topics (see Section 2.3) or phrases, pictures and videos that are being shared amongst the users. The formation of trend can be rooted in external event that prompts people to talk about, or can have no apparent substantiation and spread virulently through the ties of the network. Trends in the information network are important cursor of what is happening amongst the users and are often being used by social media analysts for statistical processing. The information derived can then be applied in improving of social network services and creation of social marketing strategies.

7 2. Social Networks ......

Sometimes the spread of trends is even being manipulated by concerted effort of set of users, such as fans of celebrities and marketers. Understanding the dynamics of trends spreading, i.e. potential prediction capabilities, could thus be of great information value to all people involved in social media environment. The importance of the ability to predict social trends has been growing rapidly in the past few years with the growing dominance of social media in our everyday’s life. Whereas many works focus on the detection of anomalies in networks, there exists little theoretical work on the prediction of the likelihood of anomalous network pattern to globally spread and become “trend” [3].

8 Chapter 3 Data acquisition

For testing of our graph feature based approach we chose data from social network Twitter, mainly for it’s public availability. The information on users desired can be accessed through Twitter API. To download the data we implemented a program man- aging the data flow from Twitter to a database where the data is finally being stored. The sampling strategy, i.e. the choice of the set of users we are downloading the in- formation about is determined by the approach of crawling the network, described in Section 3.1. The implementation of the whole process is then discussed in the corre- sponding Section 3.3.

3.1 Crawling strategy Crawling strategy defines how we proceed with gathering the users in our scope. In social network analysis there are various corresponding methods of network data col- lection, commonly being divided into ego-centric and socio-centric category. In the ego-centric approach we are interested in set of specific persons referred as “ego” and the social network is being constructed by references to their affiliates referred as “al- ters”. Socio-centric approach on the other hand focuses on the whole network analysis by measuring structural patterns of interaction and explaining the outcomes, regardless of choice of individual persons. The underlying assumption is that members of a group interact more than would a randomly selected group of similar size [19]. This assumption corresponds to what we are trying to achieve, i.e. to determine the global outcome - in our case predicting the top trends, by measuring the structural patterns of interactions between users. Thus, following the socio-centric assumption, sampling a group of highly connected users would be desirable, since the interactions between them should have bigger effect on the global outcome than within a uniformly sampled group. To follow this policy we altered commonly used technique in SNA called “snowball sampling”, where existing study subjects recruit future subjects from among their ac- quaintances, similarly to breadth-first search (BFS) used in computer science[18]. As in BFS, while proceeding with the search, we work with a queue-like structure holding the users for further expansion. Each of these users has a priority assigned, calculated as a number of total connections with the already expanded set of users. At each step, the queue is updated with the newly explored users and the one with highest priority is chosen for next expansion (see Figure 3.1 for illustration). Following this simple approach from any given starting point, we gather a group of inter-connected users.

9 3. Data acquisition ......

Figure 3.1. Illustration of one step in the crawling strategy approach.

3.2 Twitter API Twitter’s application interface is the means by which we acquire desired network’s user data. It provides developers or anyone a programmatic access to a number of Twitter’s features. The API allows for easy integration of Twitter functionality into custom applications and web solutions and is said to be the most influential factor in Twitter’s ubiquity and overall popularity. Twitter API has come through significant changes since it’s establishment in 2006, with the most crucial ones being applied at the time of this project development in 2012, while changing from version 1.0 to 1.1 [15].

3.2.1 Functionality The original concept of API divides resources into three categories by functionality delivered: . REST - allows access to Twitter’s core data, update timelines, status data, and user information. . Search - gives developers methods to interact with Twitter search engine and trends data . Stream - provides near real-time high-volume access to Tweets in sampled and filtered form. With the new version of API 1.1, all the methods of these resource families are presented to developers through a unified interface under the REST header. There is a vast number of methods offered and we will pinpoint only few categories that relate to the implementation of this project1): . Timelines - a class of methods for retrieving users’ statuses and retweets . Tweets - provides methods for tracking specific tweets’ metadata information . Friends & Followers - allows for crawling by providing sets of related users . OAuth - set of tools for authentication of calls to the API 1) for a complete set visit Twitter API specification site at https://dev.twitter.com/docs/api/1

10 ...... 3.3 Implementation

3.2.2 Rate limiting The Twitter API only allows clients to make a limited number of calls in a given time interval. This policy affects the APIs in different ways. While this possesses a huge restriction on crawling larger amounts of data for either research or other data inten- sive purposes, Twitter was allowing “white-listing” of selected services upon request, increasing the number of request to API dramatically, from 350 to 20000. Unfortu- nately, by February of 2011 Twitter has turned down this possibility, creating huge upset among researchers and developers. In the new version of API 1.1 the rate limiting strategy changed again. While in 1.0 developers had a bucket of 350 requests they could make in any given hour period, in version 1.1 rate limits are divided into 15 minute intervals by different types of request. The applied rate limit for calls now varies depending on the method and authorization type being used. Additionally, all 1.1 endpoints require authentication, so there is no longer a concept of unauthenticated calls and rate limits. Rate limiting in version 1.1 of the API is primarily considered on a per-user basis, or more accurately, per access token. If a method allows for 15 requests per rate limit window, then it allows to make 15 requests per window per leveraged access token [15].

3.2.3 Limits workaround By exploiting Twitter’s new rate limiting policy, we can extend the limits by couple of times. For that we make use of multiple prepared authorization tokens for calls on both the level of user and application level. By constantly switching the tokens while accessing the API, we increased the data download rate up to the limit, where the whole process of crawling the network, with respect to the time needed for processing the data, runs continuously with no or just a small delays due to the rate limit restrictions.

3.3 Implementation The implementation of program for Twitter’s data acquisition was one of the thesis input parts, thus we will provide a brief description of it’s core structure in this Section. The program consist of following main parts: . Download - the main part implementing loop for calls to API and switching tokens . Database - part handling the database connection, data storage and extraction . Twitter - set of methods encapsulating the work with Twitter API calls . User - internal representation of user information . Tweet - internal representation of tweet parts When the program is launched, memory statistics are checked and IP proxy is set if needed. Next the connection with database is initialized and authorization tokens are loaded into memory. These are subsequently validated against Twitter API. Now the actual state of queue, storing users for expansion and their priorities, is extracted from database. The highest priority user is selected and the main download loop is launched. First user’s timeline is crawled into memory. Here Twitter allows the access to the last 3200 tweets. These are further transformed into internal representation and stored into database. Next the set of user’s friends is downloaded, split and stored as connections to the current user in the database. Friends are further being used in updating the queue. Finally the current user’s data is transformed into internal representation, stored into database and signed as expanded. At the end of the loop the rate limit of actual token is

11 3. Data acquisition ......

Figure 3.2. simplified Twitter crawler’s implementation flow chart checked and switched if needed. Then the next user is selected from the queue and the whole process is repeated until sufficient amount of users is crawled. The application architecture flow chart is illustrated in figure 3.2.

The program is implemented in Java language. For easier work with Twitter API a wrapper for Java called “Twitter4j” 1) is used. While choosing the database engine we were forced to move from open source Apache Derby 2) database, which is easily embedded with Java projects, to proprietary MSSQL 3) for performance reasons. The performance of the crawler depends heavily on the database engine that is accessed through JDBC. All the main structures, including the queue, are stored in database for the purpose of safety and permanence. The memory consumption of the program is thus directly derived from the database server settings. The speed of crawl also varies, depending on the amount of data loaded into the priority queue. With the settings we tested and rational load of few millions of users in queue for potential ex- pansion, one loop iteration takes approximately one second on 2.4 GHz machine using 3 GB of memory. Generally the whole process could be accelerated by employing distributed archi- tecture as described in recent project called TwitterEcho [20]. There the API calls are distributed into a set of clients operating with own authorization tokens. Each of these clients is transmitting a subset of crawled data to a central server managing the database. This is a natural decomposition of the problem, yet a special care would have to be taken of the shared resources, in our case the priority queue, for synchronization.

1) the official twitter4j library site can be found at http://twitter4j.org/ 2) the official site of Derby project http://db.apache.org/derby/ 3) official site of MSSQL product http://www.microsoft.com/en-us/sqlserver

12 Chapter 4 Data analysis

4.1 Crawled data As in every data analysis, it is important to have at least a brief picture of the data collected. We will provide general description and some commonly applied measures in social network analysis of the data downloaded.

4.1.1 Statistics overview We have crawled three datasets for testing purposes, two of which are collected in different time upon the same set of users for testing the influence of time shift and data size. Moreover one smaller set that stands separated from previous to evaluate the robustness of our approach to network structure change. Let these be, in the order of mention, labeled December, March and April, according to month they were collected in. Their statistics overview can be seen in Table 4.1.

stats set A - December set A - March set B - April # users 8054 8054 3701 # connections 3044687 3044687 1065109 # days 14 35 20 # tweets 1556067 4904096 1419261 # hashtags 561058 2220550 428253 # retweets 307230 1133715 277042 Table 4.1. Statistical comparison of different datasets crawled

4.1.2 Network structure The characteristics of the structure of the network can be viewed through network den- sity, clustering coefficients, reciprocity, degree distribution and betweenness centrality, measures commonly applied in SNA [21]. Network density basically represents the actual number of ties in a network as a ratio of the total maximum ties that are possible with all the nodes of the network. A fully dense network has a network density value of 1, which indicates that all nodes are connected to each other. A network with a density value near 0 indicates that it is a sparsely-knit network. For a directed graph with N nodes and T ties the density D is defined as [19]:

T D = N(N − 1)

Since our crawling strategy, described in Section 3.1, implies high connectivity of our networks, the resulting density of 0.047 is two orders of magnitude higher than that of

13 4. Data analysis ...... large scale Twitter networks studied in [22], yet still much lower that the mean density 0.37 of general social networks studied in other works [23].

Another measure, reciprocity is of particular interest for the sake of our relational features, mapping the ties between users in directed manner. A traditional way to define the reciprocity R is using the ratio of the number of links pointing in both directions L ↔ to the total number of links L →.

L ↔ R = L →

In our case that means to merge all directed edges, which are part of bilateral ties, into one bidirectional edge and count with their numbers as such. The resulting reciprocity of 33% is on average 10 % higher that that reported with Twitter on the level of states in [22]. The higher reciprocity implies more complicated study of trends spreading direction and higher importance of relational features utilizing it, which we can see in corresponding Chapter 6.3.

When talking about local relational structure, as being exploited by our graph features, we can investigate clustering coefficients of the network. Clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. We follow the standard definition from Watts and Strogatz [24] and calculate clustering coefficient as the ratio of the number of edges between a vertex’s neighbors to the total possible number of edges between the vertex’s neighbors. I.e. for vertex v with kv neighbors there are at most kv ∗ (kv − 1) ties between the neighbors, in the case they are fully connected. The local clustering coefficient C is then defined as a fraction of these possible edges that actually exist. Finally the global clustering coefficient C¯ of the whole network is the average local clustering coefficient over all present vertices n.

n 1 X C¯ = C n i i=1

Average clustering coefficient of our network subset is 0.26, which is again more than two times higher than those of country-level Twitter networks studied in [22], which most probably goes for the account of our high connectivity focused crawling strategy.

Another property to examine is degree distribution. In the study of graphs and networks, the degree of a node in a network is the number of connections it has to other nodes and the degree distribution is the probability distribution of these degrees over the whole network. In our case of directed relations between users, the degree of a node has two components - in-degree and out-degree corresponding to the follows and befriends relationship respectively. The importance of degree distribution measure in SNA raised with the discovery of power-law degree distribution as a common property of many large affiliation networks [25–26]. This property, amongst others, was tested on Twitter with global scale crawling in 2010 [27]. The results showed that in Twitter the power-law applies, but not completely, especially for very highly connected users. Let us examine this property in our network subset in Figure 4.1. We demonstrate results with befriends(out-degree) relation only, but complementary follows(in-degree) relation behaves in the same manner.

14 ...... 4.1 Crawled data

friends degree distribution 1

0.1

CCDF 0.01

0.001

0.0001 1 10 100 1000 10000 users[-]

Figure 4.1. Complementary cumulative distribution(CCDF) of friends degrees in the crawled subset of Twitter network on log-log scale graph, with dotted line designating the best power law fit.

retweets degree distribution 1

0.1

CCDF 0.01

0.001

0.0001 1 10 100 1000 10000 users[-]

Figure 4.2. CCDF of frequencies of retweets from users in the crawled subset of Twitter network on log-log scale graph, with dotted line designating the best power law fit.

As we can see, our network connections degrees don’t actually follow power-law distribution, marked by the dotted line in the graph, since exponential distribution, designated by solid curve, would make a better fit. Compared to the power-law trend- line, we apparently have higher number of users with too few friends and lesser number of users with very high connectivity, so called “hubs”. That can be caused by the fact that we are calculating with connections only within our network subset, which doesn’t reflect the real connectivity. E.g., even if we crawl some “hub users”, we don’t count their connections going outside our subset, which puts limits on the overall connectivity, implying generally lower number of ties than the power-law would predict. We can also

15 4. Data analysis ...... notice a small glitch in the distribution around 2000 of users, where both the trendlines cross. That could be potentially explained by the fact, that before the year 2009, there was an equal upper limit in the number of people a user can follow [28]. As a matter of interest, we plot the same statistic on the retweets relation, collected in a given time interval over our network subset, in Figure 4.2, with similar results.

The last property to study is so called betweenness centrality as a measure of nodes centrality in the network. It is equal to the number of shortest paths from all vertices to all others that pass through that respective node. That becomes important in real world affiliation networks as they tend to arrange themselves to create short path lengths across the network by creating a few “hub” nodes with much higher connectivity than the majority of the network. Over the past few years, betweenness centrality has become a popular strategy in dealing with complex networks as it was again observed with power-law distribution in networks like world wide web, co-authorship, protein interaction and metabolic networks. The betweenness centrality of a node v is given by the following expression:

X ϕst(v) bc(v) = ϕst s6=v6=t

where ϕst is total number of shortest paths from node s to node t and ϕst(v) is the number of those paths that pass through the node v. Average betweenness centrality, taken over all the nodes in the network and normalized by the total number of edges, is 0.0002, which is considerably lower than with commonly studied scale free networks. More information can be derived from the distribution over all the nodes in network, which is plotted in Figure 4.3. Again as in the case of degree distribution, we lack the “hubs” that would be connecting some local components as well as we have too many nodes with low number of shortest paths going through them.

betweenness centrality distribution 1

0.1

CCDF 0.01

0.001

0.0001 0.1 1 10 100 1000 10000 100000 1000000 10000000 users[-]

Figure 4.3. CCDF of betweenness centrality distribution amongst the users in our network subset, dotted line designating the best power law fit, orange nodes denoting “hubs” and low centrality users.

16 ...... 4.1 Crawled data

That altogether prompts to more uniform distribution of connectivity within our network than a power-law trend would predict for a scale free network, which can all be explained by our focused crawling strategy, implying all nodes to be highly and closely connected within the subset. 4.1.3 Trending topics The communication on Twitter is conveyed through tweets (see overview in 2.3). The most accurate way of determining topic of tweet would be the text message decompo- sition using appropriate dictionary to pinpoint the keywords. Even though there is a lot of research being done on text topic decomposition [29], this problem goes beyond the reach of this thesis and we will follow an easier approach. By the way they are internally represented, tweets consist of the message itself and metadata elements. Among these there is information on media and urls used, retweets, contributors etc. For the tweet’s topic extraction we use explicit theme metadata mark called hashtag (described in 2.3), which has become a standard among Twitter users in highlighting the tweet’s topic. Even though this is not 100% accurate method, almost half of the collected tweets bear a hashtag sign (see statistics in 4.1), forming statistically significant group for our study. Moreover the tweets that are not explicitly signed with a topic usually truly carry none and are referred to as “pointless babble”, creating a significant part of Twitter’s communication 1).

hashtag occurence as time-series 7000

6000

5000

] - 4000

3000 Occurence[

2000

1000

0 1-Dec 2-Dec 3-Dec 4-Dec 5-Dec 6-Dec 7-Dec 8-Dec 9-Dec 10-Dec 11-Dec 12-Dec 13-Dec 14-Dec date

Figure 4.4. An example of hashtag’s “#tcot” December occurence as time-series.

Representing the tweet topics directly by hashtags, we can effectively store, filter and display the information in the network. We define trending topics by measuring the frequency of occurrence of corresponding hashtag2). If the relative frequency of hashtag in a particular timeframe is among the top-k, we declare it a “trend”. We also often work with the number of “infected” users as a support measure of trends, especially in

1) according to a study performed by pearanalytics in 2009, http://www.pearanalytics.com/ 2) The actual Twitter’s trending topics list is calculated using internal algorithm counting with the topic history and some additional rules and exclusions [15]

17 4. Data analysis ...... the graph approach (Section 6.3), since a single user can produce multiple tweets signed with the same particular hashtag in a given timeframe (see time structures in 4.3). By these means we can measure a spread of trend in time, while working with fre- quencies of hashtags’ occurrence. As we can notice in Figure 4.4, many of the topics don’t have a single period of being trending, but they are rather repeatedly increasing and decreasing their presence. Secondly we can measure a hashtag presence in space, while tracking the infected users. The participation on a given topic amonst the users varies rapidly, as depicted in Figure 4.5, especially with trends of shorter time presence.

Figure 4.5. Hashtag’s “#tcot” December 4th propagation through a network subset, darker color marks later time of assumed transition through corresponding relation.

4.2 Data transformation For the analysis, the data loaded from database needed to be transformed and processed in various ways, creating temporary structures, training sets for learning and results information, all according to a given schema defining various aspects of what data we are working with like time intervals, number of top trends etc. Firstly the data are thus subject to a transformation in time and space, creating time-related (see Section 4.3) and graph-related (Section 4.4) structures.

4.3 Time structures While in the database we stored the data in tables according to the present entities and relations, for the purpose of learning we needed to transform it accordingly. We need to define the whole time scope of the data, the granularity by which the data will be treated and how they will be stored in memory. For that we create following structures. . Time-fold - refers to a time “bucket” at which the occurence of hashtags is being considered. As the granularity can differ, we can have time-folds of weeks, days, hours etc.

18 ...... 4.4 Graphs

. Time-frame - is simply a sequence of time-folds, the main structure that is operated before the learning phase by sliding window (see Section 4.3.2). . Time-scope - represents the whole dataset of hashtag-time information available. 4.3.1 Sequential representation Learning is performed on data as a time series, so while being loaded according to the settings in the schema, each hashtag occurrence, together with it’s user information, goes into a prepared time-fold according to its parent tweet’s time of creation 1). Clear sequential representation of the data is important as all the following learning processes, regardless of the features used, will use this property as a core for predicting future from the past.

4.3.2 Sliding window While having the hashtag occurrences represented as a time series, divided into time- folds whose sequences are defining time-frames, we can add another dimension for learn- ing upon such a structure. Sliding window, a common technique for turning sequential supervised learning problem into classical supervised learning task [30] can be utilized. In our case we can vary in the size of window that the learning is based upon trainSize, but also the size of following window that we try to predict the hashtag occurrences for targetSize and possibly even the gap between them, as depicted in Figure 4.6.

Figure 4.6. illustration of two subsequent sliding window positions on day long time-folds. Parameters of window: winSize = 3, targetSize = 2, gap = 1

From each time-fold an underlying information on attributes, such as current hashtag rank, occurence flag or graph feature ranks and time values (see features in Chapter 6), is derived. This information is then serialized to produce a final set of features for the whole training part of time-frame. By continuously shifting the window position we can create multiple learning instances from a single time-frame.

4.4 Graphs Graph structures represent how the information on various relations from the network is stored. During the implementation, different representations proved advantageous in different situations, and we briefly sum them up in following sections.

1) As a matter of interest, each tweet in the network has the time of creation associated, yet because the network is global, different timezones need to be taken into account while performing merging operations.

19 4. Data analysis ......

4.4.1 Relations As it was in our focus from the beginning, we explicitly stored the information on follows relation from the users crawled, yet other relations can be also derived from data, these are retweets, responds and mentions. Let us summarize their meanings with following descriptions: . follows - the user has subscribed to other user’s tweets . retweets - the user has re-posted other user’s tweet in a quotation-like manner . responds - the user is simply responding to other user’s tweet (also ref. as replies) . mentions - the user included other user’s identifier in his tweet 4.4.2 Representation: All these relations can eventually create a graph, yet while stored more or less as pairs of users in the database, for computing graph related operations, they need to be transformed into more effective representation. At first they are loaded into hashmap like structures, where for each user a set of his ancestors and descendants is assigned. Later where computational speed is the driver, they are further transformed into bit- array structures, indexing each user in the subset and creating representation similar to adjacency matrix.

20 Chapter 5 Learning

The purpose of crawling the data from Twitter was to be able to perform an analysis of interesting factors affiliated with the spread of local trends in the network subset, while the main focus of analysis is on the trends behavior prediction (see trends in 4.1.3). To extend the analysis with prediction we employ machine learning, yet there are several options to be considered while dealing with trends prediction task. Firstly we specify the target we try to achieve in the following Section 5.1, and secondly we consider the overall approach to learning in Section 5.2.

5.1 Target classes At first we focused on predicting the class of hashtags to be trending and non-trending respectively, yet soon we realized that the training data will be very skewed as most of the hashtags will not trend, and accuracy on this task might not be the best cursor of what we are trying to achieve. So we created new classes and evaluation metrics expressing different aspects of trends spreading.

5.1.1 Motivation The ultimate goal of learning is to make a practical predictions and compare them with reality, e.g. to ask how many of tomorrow’s real trends will be on our prediction list of the same length, as illustrated in Figure 5.1.

Figure 5.1. Top-K% prediction task ilustration with simple approach.

5.1.2 Basic class The natural representation of positive and negative labeling, when it comes to top trends classification, is what basic class represents. An instance in this case is labeled as positive if and only if it’s topic will become a part of the top-k list at a target

21 5. Learning ...... time-fold, negative otherwise. Because the target time zone can consist of a number of subsequent time-folds, it is important to decide how to treat them. In this way we have considered following two options: . Shows in top-k - the hashtag was been present in the top-k list in at least one of the target’s time-folds. . Stays in top-k - the hashtag has stayed present in the top-k list for the whole target’s time scope. Even though this approach is certainly providing us with some information, it is sensitive to the data skewness and turns the whole classification accuracy scale close towards 100%.

5.1.3 Top-K% metric This metric was created to solve the sensitivity problem of basic class as well as bring the whole classification task closer to reality. Evaluation metric top-k% calculates the percentage of correctly identified trending topics. It takes the probability distribution of classification and creates respective ranking on the instances that are subject to the current prediction. That means that only the top K instances classified with highest confidence as trending will be considered positive. This k-set will then be compared with the true top-k list for the target day. The final score of classification is percentage of correctly predicted topics in the target list.

5.1.4 Expands class Class expands represents the growth of the overall topic’s presence. It is calculated as a direction of change between topic’s frequency at the training and target time- fold respectively. It was created as a compensation of skewness of the training data. While the huge dominance of non-trending topics tends to drive the accuracy of basic classification all the way up to certainty, the expands metric stays untouched. That makes it a statistically significant measure and moreover it’s resilient against simple approach (introduced in 5.2.1) as it is not easy to determine which topic will increase it’s overall frequency better than a 50% guess.

5.1.5 Enters top-K class This class was meant as a combination of top-k% and expands strategy. The instance in this case is marked as positive if and only if it changes state from being outside the top-k list to being inside while moving from training to target time-fold. Although such an information would be of great value, this method suffers from skewness the worst and different dataset would have to be created for it’s proper evaluation.

5.2 Approaches

5.2.1 Simple learner From the very beginning of analysis it was clear that it is going to be about search- ing improvements with various approaches and methods at best in the order of single percents. To stay clear from searching ghosts in the noise while doing that, the sim- ple learner approach was constructed. It stands as a threshold to all algorithms and

22 ...... 5.2 Approaches

represents the common sense approach to the statistics measured, utilizing no machine learning. The method of simple prediction builds on basic average occurrence of the hashtags, calculated over all time-folds in the training part of current window, and treats them as if they were to continue constantly with that occurrence in the target part. This means that to measure the performance of simple prediction by the metrics we introduced in Section 5.1, with respect to the current window and top-k threshold, we order hashtags in our scope according to their average rank (see Section 6.1.1) from the training part of window, and “classify” first-k of them as positive for the target part. This method is simple yet robust, especially where the trends behavior is more stable, and leads us to better understanding of our predictions in environments where it is not like that (see discussion in 7.3.1).

5.2.2 Baseline learner This learner was created as a main competitor to our approach based on graph features. The motivation for baseline learner comes from a well known problem of time series forecasting, which is not just a matter of SNA, but widely researched area in context of statistics, engineering and finance. It represents the use of a model to predict future values based on previously observed values in data with natural temporal ordering. The data can be modeled with the use of regression analysis, working within the time domain, or signal estimation approach, utilizing the frequency domain of data with methods like spectral analysis, used mainly for signal processing. In our case we want to set this approach a baseline to the graph learner, utilizing classification on relational features (next Section 5.2.3). To make a clear comparison of these two, we turn the time series forecasting task into a standard classification problem as well, using a sliding window technique described in Section 4.3.2. This way the model is represented by inner state of the classifier used, based on the features we feed it with. As it was defined in the thesis input, the baseline features are based strictly on global statistical measures, like overall occurrence, frequency or ranking of hashtags in the network as a whole. And since our main focus lies with the features extracted, it utilizes always the same machine learning algorithm as the graph learner. A closer description of the features used with this approach can be found in Section 6.1.

5.2.3 Graph learner The graph learner enhances the baseline with new features and is the core of this thesis. The idea of this approach builds on statistical relational learning (SRL) [31], that models uncertainty upon complex network structures, with associated tasks like link prediction, collective classification or social network modeling. The fundamental aspect in SRL is knowledge representation, i.e. description of our object and relation properties, that would allow to abstract away from concrete entities and represent the general principles applicable to the network as a whole. Our intention is to create such a representation through the use of relational features described in Section 6.3.1. The inspiration for this representation comes from the work of Nataša Pržulj on a network structure similarity measure using so called graphlet degree distribution [12]. Graphlets are small connected non-isomorphic induced subgraphs of a large network, which means they have exactly the edges that appear in the network over the same vertex set. The graphlet distribution then generalizes the regular degree distribution

23 5. Learning ......

(seen in Section 4.1.2 ) by measuring the nodes’ incidence with respective number of graphlets instead of edges. As an analogous method to biological sequence comparison, yet on the structural level, graphlet methods have been successfully applied in the study of biological net- works, such as protein-protein interaction, and in discovering their underlying models. The motivation for us was to prove the value of this concept in the domain of social networks. Our approach differs in that, with our features, we generate the subgraphs separately, in advance of further matching in the whole network, while we do not restrict them to be induced. This way they are more similar to network motifs, which are recurrent and statistically significant partial 1) sub-graphs extracted from the network. Network mo- tifs, possibly reflecting it’s functional properties, have recently also gathered attention as a useful concept to uncover structural design principles of complex networks [32]. Yet in contrast to network motifs we are not interested in frequent patterns only, but rather a whole network structure signature.

Thus the main focus of this learner lies in extracting the relational features. These are reflecting different relations in context of hashtag presence in the incident user nodes, while some additional attributes have also been considered. The features are subsequently, as in the case of baseline learner, fed into selected classifier creating the model for our predictions. The graph approach with all it’s features is described in detail in the corresponding Section 6.3.

5.2.4 User modeling Since user models proved valuable in studying trends diffusion dynamics [4–5], we want to examine their capabilities of the whole network trends prediction for comparison. In this approach we simulate each user’s behavior by a simple model and try to predict the overall network statistics as a sum of it’s elements’ predictions. The model will generate probability of using a given hashtag in a given time-fold for each user. Given this probability threshold we consequently generate a random number from a uniform distribution and make a prediction based on it’s position to the threshold. Following this procedure for all users in the network we accumulate the overall frequency of hashtag for our prediction. The features used at the level of user are described in model features Section 6.2.

5.3 Evaluation The evaluation process is to measure the actual impact of selected features and provide a structured interpretation of predictions results. Depending on what needs to be ac- complished, there are different evaluation strategies that can be applied. These consist of choosing a correct classifier, its tuning and a final validation approach.

5.3.1 Classifiers There is a great number of algorithms for classification, all of which are performing differently on different tasks. To choose the right one we used both experience and intuition and ended up deciding between two state of the art algorithms, SVM [33] and Random Forest [34]. After performing some basic tests it became clear that the

1) i.e. they don’t require all the edges of network sub-graphs to be present in the motif

24 ...... 5.3 Evaluation

latter is going to be more suitable from both the accuracy and efficiency perspective. Although SVM, with it’s deep theoretical foundations, is widely considered as one of the top performers when it comes to classification, we were unable to properly tune it’s parameters for our needs. It appeared more prone to skewness of the data, tending to classify all examples into the majority class. Finally it’s time complexity for testing a number of various datasets became quickly unbearable. From these perspectives Random Forest fared much better in our task. It is an ensemble classification technique where a set of randomly generated trees are voting for the most popular class. Random Forest is rather a framework than a particular classification model, as it can vary in decision shapes of the nodes the type of predictor used in each leaf etc. We use Weka’s implementation that follows the original Breiman’s prototype, which has a good overall accuracy results and some desirable properties like robustness to outliers and noise, as described in his paper [34]. Random Forests classifier is also convenient for tuning. The only parameter that was needed to be taken care of is the number of trees in the forest. Using past experience we increased the value from default 10 up to 100 trees as an appropriate compromise between accuracy and time complexity.

5.3.2 Cross-validation A popular model validation technique when we want to estimate how accurately a predictive model will perform in practice. The main advantage of cross-validation is that no further training set transformations are required. It is simple and powerful but in our case not having the test explicitly separated could lead to hidden inadequacies. The instances of our training sets are generated through a sliding window tech- nique (described in 4.3.2). This leads to a state where features belonging to consequent time-folds of the same hashtag, in all possible time shift variations, are present in a sin- gle training set. When performing a randomized k-fold cross-validation split with such a training set, it is theoretically possible to find the information on instance’s predicted class by simple matching against another instance of the same hashtag within the train- ing set. Even though the instances are “anonymized” as for the hashtag information, we refrained from cross-validation to a clearer evaluation method.

5.3.3 Test set validation To make a clear border between training and test set, we decided to split the data in time at the level of time-frames (see Section 4.3). This way each prediction truly corresponds to its natural interpretation, i.e. predicting the future from the past. An assumption with this approach is, that the population and the network structure does not change much between the training and target intervals, which is being fulfilled. Various training and test sizes can be considered, affecting the classifiers bias and variance. In our case, while working with the sliding window upon the time-frame structure, some minimal target time interval must be taken into account to make it possible for a window to fit into the time-frame at least once. This criteria then derives a limit on the test set size.

5.3.4 Weka Weka is a popular suite of machine learning software which, same to this project, is written in Java. Choosing Weka for evaluation part of the thesis was based mainly on it’s convenient interface for incorporation in our code. Weka implements all the

25 5. Learning ...... required classification algorithms for our purpose, i.e. SVM, Random Forests and Lo- gistic regression as well as it takes care of preprocessing and performance evaluation techniques. Through Weka we generate both training and test sets in arff format. The instances generated often consist of hundreds or even thousands of features and thus are internally represented in a sparse form. This representation is supported by Weka while processing them into arff files and by that we are saving a lot of space. The preprocessing capabilities of Weka allow us to perform filtering the attributes, their ranking (used in Section 6.3) and change of type. Since our selected classifiers cannot handle numeric class attributes, it is important to define proper nominal (in our case binary) types to all classes we are working with (see in Section 5.1). Weka’s interface also offers some important tools for performance evaluation and visualization. Assuming that the classifier can provide us class probability distributions, we can study ROC curves and by their means a number of classification model properties to reason about further steps.

26 Chapter 6 Features

Features are the core of this thesis. A feature is defined as an individual measurable heuristic property of a phenomenon being observed. Correct feature extraction and selection is what we believe can elevate our approach over the baseline model. While focusing completely on feature engineering, the learning algorithm have stayed fixed for both the baseline and graph feature methods in the testing phase. Some of the features proved to be more effective while some of them less. Various tests performed are displayed in experiments Chapter 7 and the final comparison is shown in corresponding results Section 7.3.

6.1 Base features

Following the sequential data representation defined in Section 4.3 the baseline fea- tures extraction is straightforward. For each of the time-folds a hashtag occurrence frequency is calculated and saved into prepared time-frame structure for further pro- cessing through a sliding window (described in 4.3.2). Moreover a simple signature flags of hashtag presence in a given time-fold is being stored. This can become useful as the absolute hashtag frequencies are further transformed into rankings.

6.1.1 Frequency rankings The purpose of ranking is that we are actually interested in competitions between the topics rather than their frequencies as well as it should prevent classifier from confusion when the overall frequency of hashtag might be shifted. Combining the hashstag presence flags with it’s ranking provides full information on it’s position in the context of trending topics 1). Due to the parity of frequencies, especially with the low values, the rank strategy is not completely straightforward. Following options were considered while producing the final ranking. . continuous ranking - keep increasing rank in sequence, regardless of the draws. . step ranking - assign the first(highest) rank to all hashtags of the same frequency. . average ranking - assign an average rank between the actual and next step at draws. It is important that hashtags of the same frequency are also represented by the same feature values. This cannot be left at random, thus the first option is not admissible. Although the third option might be more robust in low frequency hashtags when the shift occurs, our interest lies with the most frequent hashtags, where these two do not differ much and thus we get by with the step ranking.

1) Later it became clear that this applies not only for baseline features, so everywhere where possible, the feature’s absolute frequencies were turned into ranks for each particular hashtag and time-fold, e.g. for graph features

27 6. Features ......

6.1.2 User features This is a way to represent a complete trace of occurrence of a topic and might be one of the first methods that can come to one’s mind for solving the trends spread issue. User features actually lie somewhere at the border between base and graph features. The idea is to store the hashtag occurrence frequencies, similarly to the baseline method, yet bring it down to a single user granularity. This means that for each hashtag, time- fold and user tuple we need to store a number representing hashtag’s occurrence. This produces a vast amount of data to be stored in sparse representation, since most of the users will not have the hashtag present. Even though we can compress the data by this representation, in the learning phase it appears that the amount of information is still unbearable. Thus we decided on two strategies how to reduce the dataset: . sample a random subset of users of certain size . select a set of the most influential users as measured by in-degree connectivity Follow corresponding Section 7.2.2 in experiments to see their comparison from the accuracy point of view.

6.2 Model features

Model features support the bottom-up modeling approach to trend prediction. Here we want to decide what information plays role for individual person to use particular hashtag in future target zone. The information will be derived from every consequent day of training window part and submitted to logistic regression for determining a probability of using a given hashtag. For each user we extract following features from each time-fold: . percentage of friends that used given hashtag . percentage of followers that used given hashtag . signature of using the hashtag by the user himself . global percentage of users that used the hashtag within given time-fold As this approach creates tremendous number of learning examples and logistic clas- sifier doesn’t allow for incremental learning, it was necessary to subsample the training set. To reflect the irregular occurrence of hashtags, we tested approach of weighting the sample size by the total frequency for each of them, i.e. most samples were taken from the most frequent hashtags. The effect on the results can be seen in corresponding Section 7.2.3.

6.3 Graph features

As graph features we consider all features using information from one of the relations in the network that can be added over the baseline model. We start with the follows and other relations to create connected subgraphs for matching and continue with calculating time variance and other attributes of the underlying relations in the original network.

28 ...... 6.3 Graph features

Figure 6.1. “graph” features of size = 1, recording only absence(on the left) or presence(on the right) of hashtag at a single node(user).

6.3.1 Relational features Relational features represent how natural user to user connections in Twitter’s network affect the topics discussed. They consist of nodes and edges representing occurrence of hashtag on user’s time-line in context of his neighbors, e.g. his followers and friends. Edges in graph features are directed as to reflect unilateral of Twitter’s rela- tions. A value of each node will then vary between two states where hashtag is present and the contrary (see Figure 6.1). The single node features are of not much interest in the relational sense and similarly to the baseline model, yet by the means of matching in the original graph, they represent total frequency of users that have used given hashtag at a given time-fold.

Figure 6.2. set of features of size = 2, displaying various relationships between users in various context of hashtag presence.

More interesting graph features come with increasing size. For size = 2 we have already 5 different non-isomorphic oriented graphs with binary node labels, depicted in Figure 6.2. Some of these features bring common sense insight into the network and have been used separately for modeling social networks in other works [35][36].

In the unlabeled version these features could reflect network’s density(single edge) or reciprocity(double edge). In our case we are interested in trends spread and the relations with the context of hashtag presence go deeper. They are reflecting social attributes of the topic like the virulence of the hashtag, the mutual willingness of sharing it and the “potential” of a trend (Figure 6.4)

With increasing the size of features their number grows enormously. Thus the size = 3 is maximal size of features we have used for evaluation, although some experiments with features of size 4 have also been done. The interpretation of these features is not trivial and some of them might make more common sense in spreading trends, while some of them less. We can exclude common sense and rank the features according to their information value regarding the given classification task. Some of the frequently highest ranked features by this measure are displayed in Figure 6.3. One can notice that there are common “sub-features” in the high information ranked features such as the potential of unaffected node following a node that has a hashtag present. We should mention that feature of size = 2 representing this state, i.e. where one unaffected node is following a node with hashtag presence (see Figure 6.4), is

29 6. Features ......

Figure 6.3. selected examples of features of size = 3, some with the highest information gain with respect to the trends spread.

Figure 6.4. A common “subfeature” representing a potential of future trend spreading. also mostly amongst the highest ranked. Usually the more the feature is reflecting this potential, the higher it will be ranked, which supports our intuition in trends spreading.

6.3.2 Time features Time features can be seen as an extension of the relational features. It is to add a measure of some time properties of the underlying networks’ relations. The motivation for this comes from a natural intuition of trends spread in social networks. In a directed network like Twitter, if the information is being spread through the network, the fashion of the spreading should corresponds with the network structure. By that we mean that the directed relations between the users should actually represent a causality links in the trend spread dynamics. If it is not the case, the information is probably not coming from the network and is being spread by other channels. Now how to measure this networks causality correspondence. We are already working with graph features mapping general relationship patterns onto a network. All we need to do is to record information to these patterns that will, for each particular incorporated relation, represent a measure of how the causality has been met.

Figure 6.5. The two cases of correspondence between the relation and causality in hashtag spreading.

By implementation means, this is to record a specific time of each hashtag occur- rence and create a structure that will calculate and keep the time differences of these occurrences between all neighboring pairs (in the context of investigated relation) for each particular hashtag and time-fold. If the time difference logically corresponds with the linkage between the users, it’s value is considered as positive, negative otherwise, as depicted in Figure 6.5. In the process of feature matching these values are being

30 ...... 6.4 Graph features creation

aggregated to produce a final statistic measures, which are further treated in the same manner as the original feature counts. As the number of graph feature matches is usu- ally enormous, generating hundreds of millions of values for aggregation, we sticked to basic statistic values like average and variance only.

6.4 Graph features creation The idea was that construction of relational features to reflect some of the network’s characteristics, similar to micro-level modeling of social networks using Markov random graphs in [36], could help in revealing hidden information and by its means improve prediction of the trends spread. Each feature is thus represented as a small graph yet not every graph is valuable to us. Before the graph features are created there is a whole underlying process of generating and backward checking of the feature’s feasibility and the whole subset consistency. The results of this process, reducing the number of graph features, can be seen in Table 6.1.

graph size # graphs # features 1 2 2 2 16 5 3 512 72 4 65536 2644 Table 6.1. Displaying the effect of checking on graph features’ number reduction.

6.4.1 Isolated feature check There is a tremendous number of graphs that can be generated for even a small number of nodes. Yet to bring some information as a feature, a general graph must meet certain criteria. At first it must be connected, i.e. all it’s nodes must have a corresponding ingoing or outgoing edge. Second for a feature to make sense in some trend context, at least one of it’s nodes must represent a hashtag occurrence.

6.4.2 Feature set check Now, as we didn’t want to make any unnecessary presumptions, relational features in our set consist of all potentially valuable graphs up to certain size 1). But to create a final subset, one has to check for their isomorphism (see Section 6.5), as mutually iso- morphic features will finally carry the very same information by the means of matching.

6.5 Isomorphism problem While looking at the graph features, one can quickly notice that some of them are representing the same relation topologies and will end up with the same number of matchings. These features are represented by isomorphic graphs and need to be elimi- nated, especially for computational speed purpose as repeated matching of the “same” feature is costly. Checking whether two graphs are isomorphic is a well known problem

1) Actualy we do not hold the whole set of features until the final state, as they are created and checked sequentialy

31 6. Features ......

Figure 6.6. A graph isomorphism problem is not easy to solve even for just a small graphs. A hashtag presence in the node “A” and “C” can quickly guide us to a solution as described in the Section on invariants 6.5.1. in computational theory as one of the few remaining problems known to lie in NP class yet not known to be either P or NP-complete [37]. There is a number of heuristic algorithms designed for this task, yet for our purpose of small features’ mutual isomorphism detection we do not need to optimize to the extreme and can proceed with only few tricks that are common in approaching this problem (see illustration in Figure 6.6). The task can then be seen as graph preparation, e.g. calculating invariants, and discovering an isomorphic mapping.

6.5.1 Calculating invariants This is a very practical technique, which helps to solve graph isomorphism problem, since it is independent of the actual algorithm used and is easy to calculate. An invariant is generally a number assigned to a vertex such that if there is an isomorphic projection between two graphs, mapping of the vertices will correspond with mapping of the invariants, i.e. a vertex of a given invariant will be mapped to a vertex with the same invariant value. Many different invariants are being proposed in literature, all of them based on statistics describing the vertex’s neighborhood. These include invariants like two paths, adjacency triangles and k-cliques which are examples of invariants that are actually used in one of the fastest program for graph isomorphism checking called Nauty [38]. These invariants are usually further combined to produce one final and possibly unique invariant value. Since our graphs are quite small and the number of possible projections is not that high, the potential cost of calculating complex invariants could easily outweigh the savings on the number of mappings tested. Thus only the most common invariant combining the in-degree and out-degree of nodes with a hashtag presence flag will suffice in our case. 6.5.2 Isomorphic mapping Having the invariants prepared, we can calculate a “trace” of each graph feature by simply ordering the node invariants into a vector. These trace vectors are then subject to mutual test every time a new feature attempts to be added to an existing set. If the test is passed, i.e. the vectors are different, the feature can be safely added to the set. If the vectors are the same we proceed to further checking of mutual isomorphism.

32 ...... 6.6 Feature matching

The goal is to find all possible mappings on the trace vectors and apply them on the original graphs for final check of their equality. We have the trace vectors ordered thus we can proceed systematically from the top and check all possible permutations of mappings where the elements of trace vectors carry the same value. We will finally use each such a mapping to project one of the features’ extended adjacency matrix1) to check against the other’s. If both the extended adjacency matrices are equal, the two original graphs are isomorphic, if not we proceed to the next mapping. When all the mappings are depleted and the equality check was still not passed, the two graphs represent different relation structure and thus are both valuable to us as features.

6.6 Feature matching By feature matching we mean the process of searching the graph features’ (described in 6.3) occurrences in the original network by the means of projection onto it’s sub- graphs. This is known as a subgraph isomorphism problem which has already been proved to be NP-complete in the original Cook’s theorem paper [39]. In the later phase of implementation we moved, for computational reasons, from isomorphism to homomorphism mapping, while maintaining the expressiveness to distinguish between various trends’ occurrence.

6.6.1 Heuristic ordering As a matter of implementation it is clear that at some point of matching a feature can no longer be considered plainly as a graph, but rather a sequence of nodes and edges. This sequence specifies an order by which the feature if going to be mapped into the original network. Such an order is important as it can affect the performance of matching significantly. Similarly to constraint satisfaction problem CSP approach [40], an effective and straightforward strategy is to begin with the most heavily connected node, applying the maximal number of restriction from the beginning of search, and continue in the same manner while adding the rest. That means at each step, from all potential nodes to be added, we select the one with the highest degree of connection with what we already have. This strategy helps to avoid unnecessary checks of nodes that are isolated from currently matched feature part and decrease the number of candidates in the search context. The order of nodes generated is actually used as a projection of the original feature’s extended adjacency matrix, a technique we already used in isomorphism testing in Section 6.5. This way all the features can be approached in a unified manner, following ordered elements of their matrix as a path in the search strategy 2).

6.6.2 Search method Having each feature represented as a sequence of nodes and edges, we can start testing it against the original network. For that we need some basic structures to work with. Because the feature’s nodes ordering focuses of the degree of connectivity but doesn’t

1) represents adjacency matrix with edge direction and additional information on hastag presence in the nodes 2) actually only the upper triangular part of extended matrix is used as it contains information on edge direction

33 6. Features ...... make any preference on the direction of edges, we need to be able to search in both ways. For that purpose we store information on each of the network’s nodes’ friends and followers in a hashmap structure. Additionally for each hashtag and each time-fold interval we store set of users on who’s timeline it was present. The search itself is now straightforward and is being repeated for every hashtag and every time-fold given. We pickup the first node of the feature and check whether it is to contain a hashtag or not. This gives us a first set of candidates from the network, labeled or unlabeled with that particular hashtag, to continue with. Next we start with checking of the adjacent edges, while each such an edge that is present for each of the candidates will generate a new set of candidates, either friends, followers or both, all subject to further check on their hashtag label. With each of these candidates we again continue in a recurrent manner.

1 procedure searchFeatures (TimeFrame, Features)

2 for each Hastag in allHashtags

3 for each TimeFold in TimeFrame

4 hashtaggedUsers = TimeFold.getUsersWith(Hashtag)

5 unHashtaggedUsers = AllUsers \ hashtaggedUsers

6 for each Feature in Features

7 frequency = 0

8 path = feature.getPath()

9 for each user in allUsers

10 trace[0] = user

11 matchFeature(step.init(), trace)

Figure 6.7. The procedure of applying matching of the features provided on the given time-frame.

The search procedure printed in Figure 6.7 is just organizing the work across the features and time-frames and it’s interpretation is straightforward. The feature matching itself, as outlined in Figure 6.8 is more interesting. As we initialized the matching in search procedure, we continue following the path of each feature as introduced in ordering Section 6.6.1. The current state of the walk is hold in variable step, which basically substitutes an edge in the graph feature. Each step on the path then represents an edge expansion, determining the source and target users on opposite sides of corresponding tie in the network. The users expanded are stored in corresponding places in the trace vector along the way. With each next step a new user is expanded (line 17) accordingly to the current edge type it represents (line 9), tested for hashtag presence (line 6) and if a constraint has been set earlier in the walk by another edge (line 18), it is also tested for it’s consistency (line 19). If all the edges are successfully checked, we step at the end of the path (line 2) and increase the frequency counter. It should be stated, that the actual implementation of feature matching, even though it builds on this idea, is almost completely different from the pseudocode, which is very inefficient. The procedure has come through many optimizations, making it’s structure

34 ...... 6.6 Feature matching

1 procedure matchFeature(step, trace)

2 if step at end of path

3 frequency ++

4 return // successful feature match

5

6 if path.user(step.source()).hasHashtag() != trace[step.source()].hasHashtag()

7 return // hashtag presence inconsistency

8

9 case path.at(step).Edge()

10 is follower

11 candidates = trace[step.source()].getFollowers()

12 is friend

13 candidates = trace[step.source()].getFriends()

14 is both

15 candidates = trace[step.source()].getBoth()

16

17 for each candidate in candidates

18 if trace[step.target()] is not empty

19 if trace[step.target()] = candidate

20 matchFeature(step.setNext(), trace)

21 else

22 return // target user matching inconsistency

23 else

24 trace[step.target()] = candidate

25 matchFeature(step.setNext(), trace)

Figure 6.8. Inside look on the feature matching procedure in pseudocode.

quite complicated. We needed to decrease the number of procedure calls by forward checking on hashtags presence and edge consistencies, get rid of recursion, and mainly introduce new data representation as discussed in following Section 6.6.3.

6.6.3 Set intersection speedup

It is important to note that the most crucial operation in the actual matching method is finding an intersection of two sets of users. That is in forward checking, every time we ask which friends have used a hashtag, what are common followers of two users and generally which newly generated user candidates are contained in the previous candidates set, we are asking for set intersection. With such sets of users, originally represented as hashmaps, each intersection oper- ation takes a linear time in the number of users present in one of the sets O(n), since access time to an object in a hashmap is constant O(1). This unfortunately proved not

35 6. Features ...... to be good enough for searching between thousands of users in the network and hasmap intersection became the narrowest bottleneck of the implementation. A performance tweak comes with bringing the problem down to a bit representation as mentioned in Section 4.4.2. Since the set of all users in our network is static while performing the matching, we can represent each user as a single bit flag in a N long bit array where N is the number of users. The advantage is, that having a whole subset of users represented as a single bit-array with flags at corresponding elements, we can perform the intersection as a logical AND operation. Similarly the set union will correspond to a logical OR operation upon the bit-arrays. Fortunately there is a good implementation of bit-array like structure in Java called BitSet, allowing these an more useful operations, e.g. calculating cardinality, which in our case represents the number of users in the subset. Utilizing the BitSet the performance of feature matching increased enormously, enabling us to complete the task in a reasonable time. Another minor speedup is also implemented for linear features, i.e. features where no backward checking of edge consistency is required. With these features we can continue from node to node, and since generated set of candidates will not change on the rebound, we can calculate purely with it’s cardinality.

36 Chapter 7 Experiments

In the experiments we will demonstrate how different options influence the final outcome of learning. We have global settings to study, e.g. window properties, top-k thresholds and datasets on one side, and there are various feature options for testing on the other side. These all together form a training dataset as a subject of learning for different types of learners. While these options possess exhaustive number of combinations, we will provide only selected few that are important for intuition and proceed sequentially to comparing final results on selected target classes in the corresponding Section 7.3. Moreover we test some interesting phenomena and secondary approaches. If it is not a concern of particular component studied or stated otherwise, in the following experiments we use common context of top-20 trending topics, measured by accuracy on the basic class within one day target window part, for demonstration.

7.1 Settings In this section we will provide examples of influence of different settings on learner’s accuracy. Each of following categories sums up the behavior of respective component and compares it with our intuition. These comparisons are also illustrated in the accompanying charts. Although the settings influence the overall performance as a whole, we take each particular component out of the frame for demonstration purpose, while the presented behavior mostly holds true regardless the context.

7.1.1 Sliding window properties Sliding window consists of training part and target(goal) part. The training part is where the features are extracted and proceed to respective learner, while target part is where the prediction is checked against reality. We examined the range of 7 days for the training part, as we are assuming on week cycles in the communication on Twitter, and subsequent 3 days for the target part. Our intuition, that with increasing size of training part also the accuracy of learner should increase, is met in most cases. The exception to this assumption is mainly caused by over-fitting of graph learner, where there is too many features and too few training examples. That causes learner’s accuracy to decrease with increasing training window size. This behavior was suppressed by extending the time-scope in the new dataset and thus creating more training samples (see the influence in Figure 7.1).

With the target class shows in top-k tested in Figure 7.5, making predictions for wider target window is harder as the uncertainty of what can happen grows. Wider target part implies more topics to be classified as trending, which is generally more difficult than with non-trending topics. Moreover, determining what topics will be amongst trending tomorrow is easier than doing that for two or three days horizon (Figure 7.5). On the other hand, with target class stays in top-k, as demonstrated on graph learner in Figure 7.3, the situation is opposite. The wider target window part, the closer

37 7. Experiments ......

Figure 7.1. Over-fitting of graph learner (on the left), solved by extending the time-scope for learning (on the right).

window properties influence [simple learner] 99.5

99

98.5

98 Accuracy [%] Accuracy

97.5

goal size = 1 97 goal size = 2

goal size = 3 96.5 1 2 3 4 5 6 7 train window size [days]

Figure 7.2. Influence of window size properties on learners’ accuracy, tested on shows in top-k class.

38 ...... 7.1 Settings

window properties influence [graph learner] 99.8

99.6

99.4

99.2

99 Accuracy [%] Accuracy

98.8

goal size = 1 98.6 goal size = 2

goal size = 3 98.4 1 2 3 4 5 6 7 train window size [days]

Figure 7.3. Influence of window size properties on learners’ accuracy, tested on stays in top-k class, shown on graph learner for demonstration.

restrictions are put on the trending topics, creating less positive examples that are generally hard to classify. Described target size ordering holds with confidence for learners’ accuracy, yet doesn’t always apply for other measures. As a matter of interest the counter example to this behavior is simple learner. While the portion of positive and negative examples changes with increasing size of target window, it’s specificity and precision follow reverse order, opposite to what we defined by the target class (see in Figure 7.4). Finally we can state, that eventual gap between training and target part decreases the accuracy consistently as the prediction goes more into the future.

Figure 7.4. Comparison of accuracy and precision of simple learner on shows in top-k class - reverse of order in target sizes.

Here we also decided on the granularity of time-folds to be equal to days. Smaller time-folds, e.g. hours, make the hashtag occurrences too sparse and wider time-folds,

39 7. Experiments ...... e.g. weeks, just make the whole prediction task too difficult. For further experiments we consider no gap, the target size of a single day (except for examining shows/stays class performance in results 7.3.2), and training window parts of 1-4 days, where the accuracy improvements are most significant.

7.1.2 Top-k threshold In this part we focus on deciding the threshold of the number of topics to be considered as trending. There is usually a list of 10 trending topics displayed in Twitter-related applications, but there is no reason not to go further. We examined how the difficulty of the task varies with respect to this threshold within the range of 10-50 (Figure 7.5).

topk-k threshold influence 100 simple

99 baseline graph

98

97

96 accuracy [%] accuracy

95

94

93 10 20 30 40 50 topk-k hashtags [-]

Figure 7.5. Influence of top-k threshold choice on learners’ accuracy.

Clearly, the more hashtags we are assuming in the positive class, the more difficult the task is. This comes from the fact that non-trending hashtags with low rating far from the threshold are very unlikely to come to top-k and thus are easily classified, yet the situation among the top rated is more fluctuating. The more hashtags are assigned to positive class the smaller the portion of those we are fairly certain about. As stated in the chapter introduction, for further experiments we consider mostly top-20 hashtags in the trending list, since it ensures the threshold to cut clearly between trending and non-trending in the case of equal rating amongst top-10 (specifically for the top-k% metric).

7.1.3 Datasets The choice of datasets downloaded naturally influences the accuracy too. Some ap- proaches are more resilient to change of data and some of them less, the purpose of crawling them was to test this ability. We have three datasets in total (see overview in Section 4.1). Two of them, “December” and “March” maintain the network structure

40 ...... 7.1 Settings

for testing the generalization ability on different hashtag sets. Third one - “April” with different distribution of ties amongst the users is introduced to test the resilience to structural changes. In this experiment we focus on two main approaches - baseline and graph, since we are mostly interested in hashtag frequency variation influence on the base features and network structure change influence on the graph learner. Each dataset generates train and test set separated in time, while learning is always per- formed on the training part and evaluation on the test part of the dataset chosen. We have tested following schemas: . Content resilience - train and test learners on the dataset “December”, then train them on “March ” but switch the test sets to those generated from “December”. . Structural resilience - train and test learners on dataset “April”, then train them on “March” yet test on “April” again.

Let us take a look, in the order mentioned above, on how baseline and graph learners deal with the task in Figure 7.6 and Figure 7.7 respectively.

data change resilience (baseline)

99.4 original 99.2 swapped

99

98.8

98.6

98.4 accuracy [%] accuracy

98.2

98

97.8

97.6 content structure

Figure 7.6. Datasets change resilience of baseline learner.

To explain the phenomena seen in the charts, we need to realize the sizes of our datasets mentioned in 4.1. The largest dataset provided is from “March”, then goes “December” with equal number of users yet shorter time-scope, and “April”, with smaller user set. Generally the larger the training set the better the accuracy. On that point, in the content part of the resilience charts, we wanted to demon- strate that change of time-scope on the same user set plays minimal role, as we even increased the original accuracy of “December” by swapping to larger training dataset from “March” for both of learners. Among other aspects, we can put the content re- silience to the account of feature ranking, described in Section 6.1.1 and examined closer in 7.2.1.

41 7. Experiments ......

data change resilience (graph)

99.4 original 99.2 swapped

99

98.8

98.6

98.4 accuracy [%] accuracy

98.2

98

97.8

97.6 content structure

Figure 7.7. Datasets change resilience of graph learner.

Following the same point, the overall lower accuracy on “April” dataset (on the right) can be explained by smaller number of users crawled, which makes the hashtag occurrence data more sparse and the corresponding trends behavior more fluctuating.

The structure resilience part of charts then shows that our graph approach is more sensitive to network change, which is unfortunate, yet expectable. While switching the train sets between “March” and “April” had almost no effect on the baseline learner’s accuracy evaluated on “April”, the graph learner’s performance decreased noticeably. This is predictable since graph approach indirectly uses that network ties distribution information through it’s relational features. We can mildly suggest that, despite the structural changes sensitivity on “April”, it still fares slightly better than baseline here, yet such comparisons belong to the results in 7.3.

7.2 Feature options In this section we study how different parameters of features influence the accuracy. We have listed number of options in Chapter 6 about features, here we test how they fare. The purpose of these experiments is to determine the best feature options for each of the learners to proceed with into final comparison in 7.3. Since this is going to influence the learners’ resultant performance, we want to put more confidence on the experiment results and, rather than single valued statements, we start displaying distributions over several runs and window sizes. 7.2.1 Ranking Ranking of frequencies is a common concern of baseline and graph features and is applied to both in the same manner with similar effect. Ranking was introduced in learning to build resistance to shifts in hashtag overall frequencies while still being able to distinguish between trending and non-trending.

42 ...... 7.2 Feature options

While in the ranking the information on hashtag presence can be lost, we complete the features with signature flags distinguishing present from zero-frequency hashtags. We considered this component separately to measure it’s effect in combination with ranking. Finaly we display the distribution of it’s influence on baseline (Figure 7.8) and graph learner (Figure 7.9).

ranking influence (baseline)

99.3

99.2

99.1

99

98.9 accuracy [%] accuracy

98.8

98.7

98.6 frequency ranking +signatures

Figure 7.8. Absolute frequencies of hashtags vs. ranking influence on baseline learner’s accuracy.

ranking influence (graph features)

99.35

99.3

99.25

99.2 accuracy [%] accuracy

99.15

99.1

99.05 frequency ranking +signatures

Figure 7.9. Absolute number of matching of graph features vs. ranking influence on graph learner’s accuracy.

43 7. Experiments ......

We noticed that ranking of the features improves the overall accuracy, especially with graph features, where the frequencies representing numbers of feature matchings are extreme. Thus we consider the contribution of ranking as positive and continue to use it in further experiments. 7.2.2 User features We wanted to present user features mainly as another baseline to other approaches, since it’s representation is very natural, yet there are problems that this approach carries. First problem with user features is that it is, at least in it’s original form, custom tailored for a particular network subset. Secondly it is too demanding for learning, since every single user potentially represents a feature. To reduce this feature set we introduced two subsampling strategies in Section 6.1.2, which we compare here in Figure 7.10. Although user features do not strictly belong to either graph or baseline category, here we considered combining them with base features to improve their accuracy, similarly to graph approach. In both strategies we subsampled 100 users from each hashtag.

subset sampling effect (user features)

99

98.8

98.6

98.4 accuracy [%] accuracy

98.2

98

97.8 random influential

Figure 7.10. Effect of two subsampling strategies on user features accuracy.

As we can see focused subsampling of influential users has a positive effect on accuracy. The problem is that user features approach alone is of poor performance. It is sensitive to user selection and settings and since ranking cannot be used here, every user can change the outcome significantly. Adding the base features, as tested above, increases the accuracy rapidly, yet from the other point of view user features make no improve- ments to the baseline model. In a result, we decide not to continue with user features in our approach as it provides us neither with improvement of our method nor a closer threshold. 7.2.3 User modeling We dedicate this section to user model approach. Since it is very different from the rest of the methods, also the corresponding experiments run in different manner. There

44 ...... 7.2 Feature options

would be many possibilities to study with user model features, yet as in this thesis we consider it a secondary strategy, herein feature options experiments we restrict ourselves to the effect of weighted selection, described in user model features Section 6.2, on trends prediction performance. The result, in this case measured by top-k% metric, can be observed in Figure 7.11.

subsample selection (user modeling)

0.8

0.7

0.6

0.5

] -

0.4

k% [ k%

- top 0.3

0.2

0.1

0 random weighted

Figure 7.11. Effect of weighted selection on user model performance as measured by top- k% metric.

We can notice on the spread of values, that the random choice of samples causes high variance of performance and the method is thus very sensitive to subsampling. This unfavorable behavior is suppressed by weighted subsampling. Even though the weighted selection helped the method dramatically, we have to mention that user-model strategy proved not to be accurate enough in global scale and it still remains bellow demarcated threshold of the baseline model. We must state that these results are just preliminary, as in our user model we in- corporated only very basic strategy and features, while user modeling outlines much wider frame of capabilities. Prospective improvements could lie in considering more than just a single user model. Users could then be clustered, e.g. according to their use of hashtags and willingness to share it. It should also be added, that although the performance seems quite low, user mod- eling still fares much better than standalone user features. The handicap here is that it cannot be easily combined with the baseline. 7.2.4 Graph features Throughout the whole project, the most extensive experiments have naturally been done with graph features. We have tested and tuned number of options. We chose relational features of one, two and three nodes size combined together. For the time features we finally incorporated only linear subset of these features. We introduced ranking of feature matching frequencies. We considered three different Twitter relations (see in Section 4.4.1) in creating the network structure. And finally, we experimented with combinations of all the components mentioned.

45 7. Experiments ......

Again, as in previous experiments, we will highlight only some of the options for demonstration to guide us to the final tuning of graph feature approach for collation in the next Section 7.3. These are the effect of different relations, seen in Figure 7.12, and time features addition, displayed in the next Chart 7.13.

relation selection performance

99.4

99.35

99.3

99.25

99.2 accuracy [%] accuracy 99.15

99.1

99.05

99 follows retweets replies

Figure 7.12. Effect of different Twitter network relations on graph features.

time features addition effect

99.4

99.35

99.3

99.25

99.2 accuracy [%] accuracy

99.15

99.1

99.05 relational time combined

Figure 7.13. Performance of time features and effect of combining them with relational ones.

46 ...... 7.3 Results

The results of experiments with graph features as measured by accuracy come out pleasingly clear, following intuition from previous Chapter on features 6.3. The winner amongst the relations is the original follows type, although the other relations do not fare bad either. The time features have shown a great value as here they could compete with the preselected relational features standalone. That was not their purpose and thus we can add them to create a supreme combination as for the accuracy measured. To make certain of this options selection, we will briefly check it again on the top-k metric, ensuring we really choose the best possibility for final collation.

graph approach tuning

0.84

0.82

0.8

0.78

] -

0.76

k % [ % k

- top 0.74

0.72

0.7

0.68 follows retweets replies relational time both

Figure 7.14. Final tuning of graph learning approach on top-k% metric.

The situation observed with graph approach tuning as measured by top-k% metric in Figure 7.14 is similar to previous experiments. Even though the distinction is not as clear as with accuracy, especially concerning the time features, we decide to stay with previously selected options.

7.3 Results This section is a place where the tested learning approaches are submitted for final comparison. They are evaluated on target classes and metrics we have laid out in the evaluation Chapter 5.3. Even though various features and settings combinations have different influence on the outcome, here we continue with static settings tuned from previous sections for each learner. While the settings and features stay determined, the randomness is introduced through changing the seed in random forest classifier to provide confidence intervals on the results. The following collations will provide closing results for conclusions over our approach. Let us start straight away with the main declared purpose of the thesis and compare our core threshold learners introduced in Section 5.2, i.e. simple and baseline, with our graph approach, according to their accuracies on the basic class, and see the result in Figure 7.15.

47 7. Experiments ......

core learners performance comparison

99.5

99.4

99.3

99.2

99.1 accuracy [%] accuracy 99

98.9

98.8

98.7 simple baseline graph

Figure 7.15. Final comparison of our core learners measured by accuracy.

7.3.1 Shows or stays While considering the basic performance class, we should also remind two strategies we have set up in the Section on classes 5.1.2, i.e.stays in topk-k and shows in top-k, describing the trend presence in the top-k listings accordingly. For this test, let us increase again the target part of sliding window up to three days, and see how our learners deal with each of them.

shows in top-k comparison

98.4

98.2

98

97.8

97.6 accuracy [%] accuracy

97.4

97.2

97 simple baseline graph

Figure 7.16. Final comparison of our core learners on the shows in top-k class for 3 days target window interval.

48 ...... 7.3 Results

stays in top-k comparison

100

99.8

99.6

99.4

99.2 accuracy [%] accuracy 99

98.8

98.6

98.4 simple baseline graph

Figure 7.17. Final comparison of our core learners on the stays in top-k class for 3 days target window interval.

While deciding the task whether a hashtag will show up among the top-k on any of the target window days, the performance of all learners is comparable, as can be seen in Figure 7.16. Seemingly this task is just too difficult for learning and comprises too wide range of hashtag’s behavior. That could explain the competitive performance of simple learner. The more specific task as for the restrictions put on hashtags behavior is deciding whether a hashtag will stay among top-k for all the target window days. In this case, as displayed in Figure 7.17, we can clearly differentiate between performance of each learner and support our approach one more time.

7.3.2 Top-k% Until now we mostly tested approaches for their accuracy, because it works consistently for demonstration of different phenomena. Yet as depicted in the motivation of learning (Figure 4.6), the top-k% metric has become our desired measure of prediction success. It replaces the standard accuracy in determining the performance on top trends listings in a natural way. Let us put this way the simple, baseline and graph approaches to a real test and see how they fare in Figure 7.18. Comparing learners on this metric changes their standard behavior as this is not a measure the classifiers try to directly optimize, but is rather derived from it. The changes are in the favor of graph approach, which beats the other learners in the order of several percent. Since we consider top-k% the most important measure in this thesis, let us review again all the learning approaches accordingly in Figure 7.19. User modeling, as a very distinctive approach, stands on the bottom of performance scale and would require some further tuning. The rest of the learners utilize the base features which puts them closer together. It is important to note that, while adding user features at their best still decreases the baseline performance, adding the graph features increases it rapidly, out of the reach of simple approach.

49 7. Experiments ......

core learners performance comparison

0.84

0.82

0.8

0.78

] -

0.76

k% [ k%

- top 0.74

0.72

0.7

0.68 simple baseline graph

Figure 7.18. Final comparison of our core learners on the top-k% measure.

all learners performance comparison

0.85

0.8

0.75

]

-

k% [ k%

- top 0.7

0.65

0.6 modeling userFeatures simple baseline graph

Figure 7.19. Final comparison overview of all tested learners on the top-k% measure.

7.3.3 Expands The last challenge we should face our learners with is the ability to predict a common hashtag spread. As described in Section 5.1.4, expands target class became impor- tant for staying an unbiased measure, resilient to data skewness and simple approach. Moreover determining which hashtag will build up it’s presence in the near future of a network subset clearly has a practical value. Simple approach was not used here, even though there could be strategies estimating the future frequency difference, it would not be considered simple anymore, thus here

50 ...... 7.3 Results

comparison on Expands class

72

70

68

66

64 accuracy [%] accuracy 62

60

58

56 baseline graph

Figure 7.20. Comparison of baseline and graph approach over the expands class.

we leave the position of threshold solely to the baseline learner. It’s comparison against our graph approach can be seen in the final chart in Figure 7.20. Although the results are not completely discriminative as in some of the previous cases, we can state that on average the graph approach beats the baseline model by a margin of several percent.

51 Chapter 8 Conclusion

We have proposed, tested and evaluated approach for prediction of trends spread within a local Twitter subnetwork, utilizing topology structure information, based on represen- tation, inspired by methods from the area of biological networks. The results demon- strated prove the value of knowledge on the network structure and the contribution of the approach itself. The method is based on graph features, reflecting various settings of local relational topology structure in the network. The features are represented as a set of small graphs, where each edge stands for respective relation from the network, and each node, signed with a trend presence flag, stands for a Twitter user. The signature of the trend’s presence in the network is then created by the means of subgraph matching of these features, and subsequently fed into a machine learning model for classification. We have also created several other methods for the task, firstly the user modeling, employing the bottom-up user focused approach in predicting the trend behavior, next the user features, utilizing raw user-trend occurrence representation, and the simple approach, predicting stable trend behavior based on past average occurrence. Finally there is the baseline model, standing as the main threshold to our approach, utilizing machine learning for prediction of trend occurrence as a pure time series problem. For the evaluation we have followed a framework, by which we tested selected ap- proaches for comparison over the same targets, e.g. predicting when a topic spreads out, when it becomes a trend, and shows up or stays a trend during a target time interval. At the same time, various options and settings of the learners and respective features enhancements were examined. As demonstrated in the experiments, the graph features were the only features tested to prove their value by consistently increasing the baseline performance. In a result, the graph learner appeared as the top performer, beating the rest of methods by a clear margin, at most of the tasks. In the thesis we also describe some implementation details and heuristic tweaks for the graph features creation and subsequent matching. The implementation and strategy of the Twitter data acquisition is discussed, and the data are studied for their content and structure. In the beginning of the thesis, a reader can find a general introduction to social network analysis, Twitter and trends.

8.1 Future work Considering the number of features and options outlined, there is a significant space for prospective improvements. The extension of relational features with time values, ex- tracting knowledge of short-term trend dynamics, proved promising and could be further investigated for adding more information on the trend behavior, while distinguishing between memes and real world events. Moreover the metrics of spatial distribution of users, e.g. centrality and clustering, could be taken into account, giving weights to users

52 ...... 8.1 Future work

and related features. Decomposition of the network into clusters, to reflect local fluctu- ations of trending behavior, could also be investigated. To expose more real structure of the network in the terms of information diffusion, we could use retweeting behavior among the users and some user similarity measures. Finally we completely ignored the content part of the trends, that also proved to play important role in trend’s potential, where we could match the words accompanying the hashtag against selected lexical terms for categorization, and use this information in combination with other features.

53 References

[1] Nicholas Christakis. How social networks predict epidemics, 2010. TED Talk. [2] Christakis NA and Fowler JH. Social network sensors for early detection of con- tagious outbreaks, 2010. PLoS ONE 5(9): e12948. [3] Yaniv Altshuler, Wei Pan, and Alex (Sandy) Pentland. Trends prediction using social diffusion models, 2012. MIT Media Lab. [4] Oren Tsu and Ari Rappoport. What is in a hashtag? content based prediction of the spread of ideas in microblogging communities, 2012. [5] Yiye Ruan, Hemant Purohit, David Fuhry, Srinivasan Parthasarathy, and Amit Sheth. Prediction of topic volume on twitter, 2012. [6] Stanislav Nikolov. Trend or no trend: A novel nonparametric method for classi- fying time series, 2012. MIT. [7] Manish Gupta, Jing Gao, ChengXiang Zhai, and Jiawei Han. Predicting future popularity trend of events in microblogging platforms, 2012. UIUC. [8] Sitaram Asur, Bernardo A. Huberman, Gabor Szabo, and Chunyan Wang. Trends in social media : Persistence and decay, 2011. In ICWSM. [9] Ernest S. Shtatland and Timur Shtatland. Another look at low-order autoregres- sive models in early detection of epidemic outbreaks and explosive behaviors in economic and financial time series, 2008. SAS Global Forum. [10] Hila Becker, Mor Naaman, and Luis Gravano. Beyond trending topics: Real-world event identification on twitter, 2011. [11] Hazarath Munaga, M. D. R. Mounica Sree, and J. V. R. Murthy. Dentrac: A density based trajectory clustering tool, 2012. [12] Tijana Milenkovic and Nataša Pržulj. Uncovering biological network function via graphlet degree signatures, 2006. . [13] John P. Scott. Social network analysis: A handbook (2nd edition), 2000. Thou- sand Oaks, CA: Sage Publications. [14] Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. Social networks that matter: Twitter under the microscope, 2008. Oxford Press. [15] Twitter Developers official site. https://dev.twitter.com/. [16] D’Andrea Alessia et al. An ove rview of methods for virtual social network analysis, 2009. Thousand Oaks, CA: Sage Publications. [17] Robert A. Hanneman and Mark Riddle. Introduction to social network methods, 2005. Riverside, CA: University of California, Riverside. [18] Mark Newman. Networks, an introduction, 2010. Oxford Press. [19] Kenneth K S Chung, Liaquat Hossain, and Joseph Davis. Exploring sociocentric and egocentric approaches for social network analysis, 2006. School of Information Technologies, University of Sydney.

54 ......

[20] Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento. Twitterecho - a distributed focused crawler to support open research with twitter data, 2012. MSND’12 Workshop. [21] Stanley Wasserman and Katherine Faust. Social network analysis, methods and applications, 1994. [22] Barbara Poblete, Ruth Garcia, Marcelo Mendoza, and Alejandro Jaimes. Do all birds tweet the same? characterizing twitter around the world, 2011. [23] Katherine Faust. Comparing social networks: Size, density, and local structure, 2006. [24] D. J. Watts and Steven Strogatz. Collective dynamics of ’small-world’ networks, 1998. [25] Szymon Chojnacki and Mieczyslaw Klopotek. Power-law node degree distribution in online affiliation networks, 2010. [26] LA Adamic and BA Huberman. Power-law distribution of the world wide web, 2000. [27] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media?, 2010. [28] Hubspot. state of the twittersphere., 2009. http://bit.ly/sotwitter. [29] Gerard Salton, Amit Singhal, Chris Buckley, and Mandar Mitra. Automatic text decomposition using text segments and text themes. [30] Thomas G. Dietterich. Machine learning for sequential data: A review, 2010. [31] Lise Getoor and Ben Taskar. Introduction to statistical relational learning, 2007. The MIT Press. [32] Masoudi-Nejad A, Schreiber F, and Kashani ZR. Building blocks of biological networks: A review on major network motif discovery algorithms., 2012. Systems Biology, IET. [33] Vladimir Vapnik Corinna Cortes. Support-vector networks, 1995. AT&T Bell Labs., Hohndel, NJ 07733, USA. [34] Leo Breiman. Random forests, 2001. Statistics Department, University of Cali- fornia, Berkeley, CA 94720. [35] Kate Ehrlich and Inga Carboni. Inside social network analysis, 2006. ScienceDi- rect. [36] Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph (p*) models for social networks, 2006. ScienceDirect. [37] Scott Fortin. The graph isomorphism problem, 1996. Technical Report TR 9620. [38] Brendan McKay. Nauty’s official site. http://pallini.di.uniroma1.it/. [39] Stephen Cook. The complexity of theorem proving procedures, 1971. [40] Bahareh Jafari and Malek Mouhoub. Heuristic techniques for variable and value ordering in csps, 2011. University of Regina.

55

Appendix A Specification

Gzech Technical University in Prague Faculty of Electrical Engineering

Department of Gybernetics

DIPLOMA THESIS ASSIGNMENT

Student: Bc.Gustav Sourek

Study programme: Open Informatics

Specialisation: Artificial I ntelligence

Title of Diploma Thesis: Twitter's Local Trends Spread Analysis

Guidelines:

1. Learn about trends in social networks and networks in general. 2. Analyse Twitter's interface for mining the relevant data and implement a program for its acquisition. 3. Transform the data collected and extract the desired features. 4. Using machine learning algorithms try to predict the spread of local trends using a baseline model build upon statisticalfeatures and a model utilizing relational graph structure. 5. compare both approaches and evaluate benefits of relational rearning.

Bibl iography/Sou rces : Newman M.E.J.: Networks, An Introduction. Oxford University press, 2010

Diploma Thesis Supervisor: Ing. Ondiej KuZelka

valid until: the end of the summer semester of academic year 201312014

,tZ/, prof. g. Vladimir Maiik, DrSc. prof. Ing. Pavel Ripka, CSc. ead of Department Dean

Prague, March 5,2013

57 A Specification ......

58 Appendix B Used Terms

B.1 Acronyms

SNA Social network analysis CSP Constraint satisfaction problem BFS Breadth first search API Application interface SVM Support vector machines SRL Statistical relational learning ROC Receiver operating characteristic JDBC Java database connectivity REST Representational State Transfer Arff Attribute Relationship File Format

B.2 Software

Java A general-purpose, concurrent, class-based, object-oriented computer programming language; used for implementation of both the data download and analysis. Netbeans An integrated development environment for developing primarily with Java; used as IDE. Twitter API 1.1 An interface to interact with Twitter data and core functions; used to crawl the network subset and data. Twitter4J An unofficial Java library for the Twitter API; used as a wrapper for default API functionality. MSSQL A relational database management system; used to store the Twitter network data Weka Waikato Environment for Knowledge Analysis; used for machine learning algorithms, e.g. Random Forest, SVM and Regression. GraphViz An open source graph visualization software; used to generate all graph and network pictures. MatlabBGL A Matlab library providing robust graph algorithms; used for computing some of the network metrics. CTUstyle A template for theses at CTU, based on plainTEX, created by Petr Olšák1), who has my thanks for that; used to typeset this document.

1) http://petr.olsak.net/ctustyle.html

59 Appendix C CD content

Unfortunately, due to the new Twitter terms1), it is not permitted to share the data downloaded. The data used for analysis can be obtained through Twitter API, as described in the data acquisition part of the thesis. For the task we present source codes of the crawler, downloading the data to a prepared database, that can be reconstructed using enclosed SQL script. The program has following dependencies, that can all be downloaded for free from the addresses in the notes: . Java - somewhat later version of 1.62) . MSSQL - release 2008 R2, free evaluation version will suffice3) . Twitter4J - new 3.0x versions designed for Twitter API 1.14) For connecting to the new Twitter API 1.1, an authorization access token needs to be presented. These can be generated from active Twitter accounts through the method listed in oAuth directory and stored into Accounts table in the database. With all that, an initial user id should be set in the Queue table and the crawl can begin. For the analysis part, a somewhat later version of Weka5) and joda time library6) need to be included in the project.

The enclosed CD contains following files and directories: . souregus.pdf - the pdf file with this thesis . docs - directory with source files for this document, excluding CTUstyle7)

• images - contains images used for illustration • charts - contains experiment results charts • data - contains experiment results data . database - directory with SQL scripts for creating the database structure . download - directory with source codes of the data acquisition part

• oAuth - contains method for obtaining Twitter oAuth access tokens • Twitter - contains source codes of the crawler . analysis - directory with source codes of the whole data analysis part

1) https://dev.twitter.com/terms/api-terms 2) http://java.com/en/download/index.jsp 3) http://www.microsoft.com/en-us/download/details.aspx?id=23650 4) http://twitter4j.org/ 5) http://www.cs.waikato.ac.nz/ml/weka/ 6) http://joda-time.sourceforge.net/ 7) http://petr.olsak.net/ctustyle.html

60