Temporal Walk Based Centrality Metric for Graph Streams
Total Page:16
File Type:pdf, Size:1020Kb
Temporal Walk Based Centrality Metric for Graph Streams Ferenc Béres1;2 Róbert Pálovics3 András A. Benczúr1 1Institute for Computer Science and Control of the Hungarian Academy of Sciences 2Eötvös Loránd University Budapest 3Stanford University {beres, benczur}@sztaki.hu, [email protected] t < t < t < t ABSTRACT t1 t2 1 2 3 a b c Centrality measures account for the importance of the nodes of t t a network. In the seminal study of Boldi and Vigna (2014), the 2 3 b c t3 comparative evaluation of centrality measures was termed a dif- t3 ficult, arduous task. In networks with fast dynamics, such asthe c u Twitter mention or retweet graphs, predicting emerging central- t5 t4 ity is even more challenging. Our main result is a new, temporal d e t walk based dynamic centrality measure that models temporal in- 5 formation propagation by considering the order of edge creation. e t 4 < t5 < t This measure outperforms graph snapshot based static and other recently proposed dynamic centrality measures in assigning the highest time-aware centrality to the actually relevant nodes of the Figure 1: Temporal walks ending at node u before time t. network. One of our main contributions is creating a quantitative experiment to assess temporal centrality metrics. Our data and codes are publicly available1. In our best effort to provide quantitative evaluation for dynamic 1 INTRODUCTION centrality, we consider daily granularity and compile ground We present temporal Katz centrality, an online updateable graph truth based on an external source. We collect tweets about Roland- centrality metric for measuring user importance over time. We Garros 2017, the French Open Tennis Tournament (RG17), and consider temporal networks where the edges of the network arrive US Open 2017, the United States Open Tennis Tournament (UO17). continuously in time. In other words the graph is represented as a We compute both static and dynamic centrality metrics over the sequence of time-stamped edges [20]. Our metric is based on the time-aware mention graph that we extract from the tweets. We concept of time-respecting walks containing a sequence of adjacent define the mention graph by adding a time-stamped edge (u;v; t) edges ordered in time. As seen in Figure 1, temporal Katz centrality whenever user u mentions v in a tweet at time t. For ground truth, aggregates each temporal walk ending at node u before time t. we consider the Twitter accounts of players participating in daily Our new method is a graph algorithm for online machine learn- rounds as relevant. We then hour by hour investigate how men- ing [5]. Our goal is to reduce the delay caused by collecting data tions of players for the coming day take over the importance of only for the range of hours to process as a graph snapshot. We pro- past participants. vide online updateability, which poses computational restrictions In this paper, we design and evaluate an online updateable, dy- and challenges to most centrality measures and graph algorithms. namic graph centrality measure. Our main contribution is threefold: Although many studies tried to identify the best estimates for (1) We propose a new, online updateable path count based centrality the importance of a social media user, to the best of our knowledge, measure as a temporal variant of the successful Katz index [12]. there is only one previous study [20] that proposes data stream Our measure incorporates arbitrary time decay functions that can updateable centrality measures. However, their algorithm, which be adapted to the task in question. (2) We compile a data set with we analyze in Section 3.2, cannot incorporate the actual edge arrival ground truth labels for the quantitative evaluation of dynamic cen- times in its calculations. We believe our method is superior in trality. Our evaluation is based on our Twitter collection about using the exact time of interaction between two social media users, tennis tournaments. For centrality ground truth at a given time, resulting in better performance in our prediction task. we set the players participating in rounds on given days. (3) We Another key issue that we address is the difficulty of the timely experiment over Twitter tennis tournament data sets and observe evaluation of fast changes in social media. Even a static ground that our method outperforms the temporal PageRank of [20]. truth labeling for a static centrality measure requires tedious human effort. In [6], for example, the Text Retrieval Conference (TREC) 2 RELATED RESULTS topics are used. In a dynamic graph, depending on time granularity, To quantify the popularity of a node, several graph centrality mea- the same human data curation may be required in each time step. sures have been proposed [6]. The definitions of centrality vary For example, in the study most similar to ours [20], only small greatly and incorporate both global and local factors of a node’s loca- temporal social network snapshots are collected, and evaluation is tion within the network. The high variability of centrality scores re- mostly based on convergence to static centrality measures. flects the nature of popularity observed in real-world networks [17]. 1GitHub repository of our research: https://github.com/ferencberes/online-centrality Several models have been suggested to explain the emergence of high variability, habitually involving some variation of the prefer- the best of our knowledge is the only temporal ranking metric ential attachment mechanism, also extended to the dynamic set- proposed in the literature prior to our work. ting [10]. PageRank, Katz index, and temporal PageRank are all based on For temporal networks, a few generalizations of static centrality counting paths in the underlying networks. Next, we review the measures to dynamic settings have been suggested recently [2, 9, general properties of the path counting centrality metrics and tem- 13, 22, 23]. In these works, tracking centrality of a single node poral PageRank [20]. Then in Section 4, we describe our temporal and determining its variability play a major role [23], as it has Katz centrality measure. been observed in the literature that centrality of nodes can change drastically from one time period to another [7]. 3.1 Path counting centrality metrics The above results [2, 9, 13, 22, 23], however, cannot be used for As perhaps the first centrality metric based on path counting, Katz computing and updating centrality online. The following results introduced his index [12] as the summation of all paths coming devise methods that are variants of our snapshot baselines: In [23], into a node, but with an exponentially decaying weight based on the spectrum of a set of discrete graph snapshots is analyzed in the length of the path: time; however, the spectrum cannot be dynamically updated with X1 fine time granularity, as required by our application. Similarly, Katz~ = 1 · βk Ak ; (1) in [9], sequences of snapshots are considered. Finally, in [2, 13, k=0 22], degree, closeness, and betweenness are considered in dynamic ~ graphs, but these measures, with the exception of the degree, cannot where Katz is the Katz index vector, A is the directed adjacency be efficiently updated online. Note that online degree, also with matrix, and β < 1 is a constant. Hence the Katz index of a node time decay, is compared as a baseline method in our experiments. is the weighted sum of the number of paths of different lengths k k In this paper we address a practically important variant of dy- terminating in u, where the weight is β : namic centrality: our goal is to compute online updateable measures 1 X X k that can be computed from a data stream of time-stamped edges. To Katz~ (u) := β jfpaths of length k from v to ugj; (2) the best of our knowledge, the only previous such algorithms are v k=0 temporal PageRank [20] and temporal degree [13]—other measures The Katz index is finite only if β < 1=jλ1j, where λ1 is the eigenvalue are inefficient to update online. We show that our algorithm has bet- of A with largest absolute value [12]. Since 1=jλ1j is often very small, ter performance for assessing centrality in a dynamic graph, which around 0.05 in our graphs, the relative weight of a length two path we explain in Section 3 by showing that we can incorporate tempo- stays very small compared to a single edge. In order to be able to ral information while keeping dynamic update computational costs use larger values of β, we introduce the truncated Katz index as very low. In fact, temporal PageRank is based on PageRank [19], K [K] X while our method is based on the Katz index [12], both of which are Katz~ = 1 · βk Ak : (3) shown to have similar theoretical and practical properties by [6]. k=0 To our knowledge, temporal PageRank [20] is the only published By the basic definition, PageRank is normally considered to be the work about temporal generalizations of PageRank. Other results static distribution of a random walk with damping [19]. In order focus on coarse, static snapshots [15], or use temporal information to compare PageRank and the Katz index, and to motivate online to calculate edges of a static graph [11, 16]. Finally, another line update rules, we use the result of [8], who show—and use as an of research considers updating PageRank in dynamic or online efficient algorithm—that PageRank is equal to the path counting scenarios [3, 4, 14, 18, 21]; however, in these results PageRank is formula 1 considered a stationary distribution over the current, static graph. c X In our experiments, we will show that our temporal Katz centrality PageRank~ = 1 · · (1 − c)k Mk ; (4) N outperforms snapshot-based static measures for assessing node k=0 importance in a temporally changing environment.