Mining Evolving Network Processes
Total Page:16
File Type:pdf, Size:1020Kb
Mining evolving network processes Misael Mongiov`ı, Petko Bogdanov, Ambuj K. Singh Email:[email protected], [email protected], [email protected] Abstract—Processes within real world networks evolve accord- NZ Cricket Players ing to the underlying graph structure. A bounty of examples New New India India Zealand Zealand exists in diverse network genres: botnet communication growth, NCT NCT moving traffic jams [25], information foraging [37] in document NCT NCT networks (WWW and Wikipedia), and spread of viral memes Cricket Cricket or opinions in social networks. The network structure in all the Pages Pages Sri Sri above examples remains relatively fixed, while the shape, size and Pakistan Pakistan Lanka Lanka NCT NCT NCT position of the affected network regions change gradually with NCT 2011 Cricket World Cup time. Traffic jams grow, move, shrink and eventually disappear. 28/3 29/3 Finals Schedule Public attention shifts among current hot topics inducing a similar New India Zealand India shift of highly accessed Wikipedia articles. Discovery of such NCT NCT NCT smoothly evolving network processes has the potential to expose India NCT the intrinsic mechanisms of complex network dynamics, enable Cricket Cricket new data-driven models and improve network design. Pages Pages Cricket Sri Sri Pakistan Pakistan Pages Lanka Lanka We introduce the novel problem of Mining smoothly evolving NCT NCT NCT NCT processes (MINESMOOTH) in networks with dynamic real-valued node/edge weights. We show that ensuring smooth transitions in 30/3 - 31/3 1/4 - 4/4 5/4 the solution is NP-hard even on restricted network structures such as trees. We propose an efficient filtering-based framework, called Fig. 1. The information searching process in Wikipedia coinciding with the LEGATO. It achieves 3−7 times improvement in the obtained pro- Cricket World Cup finals in 2011. The process is represented by an evolving cess scores (i.e. larger and stronger-impact processes) compared network region of highly accessed pages. As the games proceed, attention to alternatives on real networks, and above 80% accuracy in shifts to currently playing teams and eventually onto the winning team. discovering realistic “embedded” processes in synthetic networks. In transportation networks, LEGATO discovers processes that conform to existing theoretical models for traffic jams, while its world cup progresses. Our goal is to capture such localized, obtained processes in Wikipedia reveal the temporal evolution of contiguous and smooth processes within the global dynamics information seeking of Internet users. of the underlying network. I. INTRODUCTION Similar networked evolving processes are common to other domains as well. Cellular signaling pathways have traditionally Network processes in domains ranging from biology been modeled as dynamic networked systems, whose evolution and neuroscience to transportation and information networks includes spatial and temporal switching and oscillating behav- evolve temporally, following the underlying graph topology. ior in the global cell activity [42]. The nervous system also ex- One example is the process of fact searching in information hibits dynamic behavior within a fixed connectivity of neurons, networks such as Wikipedia. The network structure of inter- with both theoretical [10] and experimental [14] support for the linked pages remains fixed (relative to the rate of accesses to existence of evolving activation processes. In transportation, them) and, as major events unfold, the number of views of computer and social networks, diverse processes like traffic subnetworks of related pages increases above normal levels. jams, denial of service attacks and social rumor, all evolve An important question then arises: How can we capture the gradually along the underlying structure. Traffic jams shift evolution of attention focus of information seekers in large spatially in highway networks [21], [40]; malicious botnet networks such as Wikipedia? activity propagates in computer networks as computers get One such information foraging process that we discov- infected and recover due to software patches [18]; and similarly ered in Wikipedia is summarized in Fig. 1. It coincides rumors spread along friendship links in social networks [36]. with the final games from the Cricket World Cup in 2011 Mining and characterization of evolving network processes and captures the gradual shift of interest across pages of has a number of important applications: it furthers the un- currently playing teams, players, statistics and cricket rules derstanding of their domain-specific causes and mechanisms; pages. The underlying structure of the information (hyperlink) it is essential in summarizing the micro-evolution of the graph remains constant, but edge significance scores change in network behavior as well as spotting anomalous behavior. In time with high values if both connected pages receive higher water networks, where low concentrations of chlorine indicate than usual traffic (shaded nodes have high-value links). A a contamination [31], mapping a contamination process as temporally contiguous series of subgraphs that include many a growing network region may help indicate the source of high-value edges captures the evolving process of information contamination and predict the rate and direction of its spread. foraging [37] of fans who surf Wikipedia to get trivia and facts Similarly, identifying flexible congestion processes in highway about teams, players, rules and records as the games proceed. networks helps the understanding of traffic jam propagation This process is naturally localized in the information graph and may enable improved urban planning [25]. as relevant pages are well interlinked and those links direct people in their search for facts. In addition, the attention of We propose a novel problem formulation and a corre- information seekers smoothly shifts from team to team as the sponding solution for mining significant smoothly-evolving processes on a network with dynamic node/edge weights. In evolution [13]. We will use the terms time-evolving network our setting, a process is a smoothly evolving subgraph in time and dynamic network interchangeably. that includes high-weighted (significant) edges. While we al- low the subgraph to change in consecutive time steps to capture A network process can be summarized by aggregating a the evolution of the underlying phenomenon, we impose a connected set of edges that altogether have high (positive) smoothness constraint on their rate of change, ensuring that weights. For example, in a traffic network a connected set of we capture a unique network process. Smoothness is important edges that prevalently have high positive weight represents a in modeling processes that evolve gradually in time due to congestion. The sum of the weights quantifies the significance physical constraints or limited resources (e.g. traffic jams and of the process. Summing up weights is a powerful method information foraging). Similar temporal smoothness notions that can model a variety of problems, by suitably balancing have previously been adopted for different dynamic network between positive and negative weights. In order to model problems such as evolutionary clustering [13]. processes in a dynamic network, we first introduce the concept of smoothly evolving subgraph, then we formalize its score Mining smoothly evolving processes is a computationally as the sum of its edge weights. Smoothness is targeted to difficult task due to noise and the large scale of real-world processes that evolve gradually in time (such as traffic jams and networks evolving over long periods of time. Furthermore, information foraging), due to physical constraints or limited we show that mining smoothly-evolving subgraphs is NP-hard resources. even on simple structures such as trees. While the problem of A smoothly evolving subgraph is a contiguous sequence of mining fixed or connected temporal subgraphs can be solved connected subgraphs that changes gradually in time. Smooth- by reduction to the corresponding single-slice (one time step) ness is controlled by a parameter α. An evolving subgraph problem, this approach is not suitable for discovering evolving with smoothness not exceeding α is called α-smooth evolving processes. subgraph. More formally: Our contributions are as follows: Definition 1: Given a dynamic network G¯ and an integer Novelty: We introduce the problem of mining smoothly- α, an α-smooth evolving subgraph R = fGi;Gi+1;:::;Gjg, evolving processes and prove that it is NP-hard even on trees. is a sequence of subgraphs (each one in a separate time slice) Quality: We propose a high-quality method for the smoothly- of G¯ that satisfies the following constraints: evolving process mining problem that achieves above 80% smoothness: jE(Gt)j+jE(Gt+1)j−jE(Gt)\E(Gt+1)j ≤ accuracy in detecting realistic embedded processes and 3-7 α; 8t 2 [i; j − 1]. times improvement in solution scores compared to alternatives. connectivity: every subgraph is connected within its time Efficiency: We design a filtering-based framework for reduc- slice. ing the size of the input, making our high-quality algorithm temporal connectivity: two contiguous subgraphs share at feasible for large datasets. least one edge, i.e. E(Gt) \ E(Gt+1) 6= ;; 8t 2 [i; j − 1] Relevance: Processes discovered in real-world transportation no-negative slices: all slices have positive score, i.e. networks agree with established models for traffic jam evolu- score(Gt) > 0; 8t 2 [i; j]. tion, while those in information networks reveal the informa- We measure the amount of variation between two consec- tion seeking behavior of a large user population as a response utive subgraphs as the Hamming distance of their edge-sets. to major events. For example in Fig. 4 The evolving subgraph composed by A, B and C is a 1-smooth evolving subgraph, while D and E 2 II. PROBLEM DEFINITION AND METHOD OVERVIEW form a -smooth evolving subgraph. An alternative is to use the Jaccard similarity, which allows for a number of changes We first introduce some notation, define the problem of relative to the current size of the subgraph.