Better Approximation of Betweenness Centrality∗

Better Approximation of Betweenness Centrality∗ Robert Geisberger† Peter Sanders† Dominik Schultes† Abstract interior. Then, the betweenness centrality for node v is Estimating the importance or centrality of the nodes in large networks has recently attracted increased inter- σ (v) c (v) := st , (1.1) est. Betweenness is one of the most important central- B σ ∈ st ity indices, which basically counts the number of short- s,tXV est paths going through a node. Betweenness has been used in diverse applications, e.g., social network analy- where σst := |SPst| and σst(v) := |SPst(v)|. sis or route planning. Since exact computation is pro- This definition counts the number of shortest paths hibitive for large networks, approximation algorithms through v, counting paths with alternatives only frac- are important. In this paper, we propose a framework tionally. for unbiased approximation of betweenness that gener- Our original motivation for considering betweenness alizes a previous approach by Brandes. Our best new was to identify sets of important nodes that can define schemes yield significantly better approximation than a highway-node hierarchy [14], which is used for (dy- before for many real world inputs. In particular, we also namic) routing in road networks. For this application, get good approximations for the betweenness of unim- the requirement is to process huge networks with many portant nodes. million nodes in a few minutes. In this context, we also need reasonable approximations for the betweenness of 1 Introduction all nodes since we have to decide which of several neigh- One of the most important aspects of automatic analysis boring unimportant nodes will make it to the first level of networks is the computation of centrality indices that of the hierarchy. measure the importance of a node in some well defined way. Recently, the focus of attention in network anal- 1.1 Related Work. Brandes [4] gives an exact algo- ysis has shifted to the analysis of ever larger networks rithm for computing betweenness of all nodes that is that are rapidly becoming available in such diverse areas based on solving a single source shortest path problem as transportation networks (e.g., public transportation (SSSP) from each node. An SSSP computation from s or road networks), social networks (e.g., friendship cir- produces a directed acyclic graph (DAG) encoding all cles, recommendation networks, or citation networks), shortest paths starting at s. By backward aggregation of computer networks (e.g., the internet or peer-to-peer counter values, the contributions of these paths to the networks), or networks in bioinformatics (e.g., protein betweenness counters can be computed in linear time interaction networks). (Section 3 gives more details). Depending on the graph In this paper we consider betweenness centrality model, the exact algorithm takes time between Θ(nm) [8, 1], which is one of the most frequently considered (e.g., for unit edge weights) and Θ nm + n2 log(n) centrality indices. Our results might also be applicable (comparison based general edge weights). Although this to related concepts such as stress centrality [15] that is polynomial time, it is prohibitive for networks with are also based on counting shortest paths. Consider many millions of nodes and edges. Bader and Madduri a weighted directed (multi)-graph G = (V, E) with [2] present a massively parallel implementation of the n = |V |, m = |E|. Let SPst denote the set of shortest exact algorithm that can handle a few million nodes. paths between source s and target t and SPst(v) the Brandes and Pich [5] investigate how the exact al- subset of SPst consisting of paths that have v in their gorithm can be turned into an approximation algorithm by extrapolating from a subset of k starting nodes (pivots), otherwise using the same aggregation strategy as the exact algorithm. Subsequently, we refer to this ap- ∗Partially supported by DFG grant SA 933/1-3. proximation algorithm as “Brandes’ algorithm”. A ran- †Universität Karlsruhe (TH), 76128 Karlsruhe, Germany, dom sample of starting nodes turns out to work well. {robert.geisberger, sanders, schultes}@ira.uka.de In particular, the randomized method yields an unbiased estimator1 for betweenness. Unfortunately, this 2 A Generalized Framework for Betweenness method produces large overestimates for unimportant Approximation nodes that happen to be near a pivot. For example, Our estimator is parametrized by a length function consider a degree-two node v connecting a degree-one ℓ : E → R on the edges3 and a scaling function node u to the rest of the network. If u is selected as a f : [0, 1] → [0, 1]. For a path P = he1,...,eki let pivot, the betweenness of v is overestimated by a factor ℓ(P ):= 1≤i≤k ℓ(ei). of n/k. In each iteration, our algorithm performs one of 2n possibleP shortest path searches with uniform probability 1.2 Our Contributions. Our main idea is to solve 1/2n. Namely, forward search in G = (V, E) from the problem described above by changing the scheme for a pivot s ∈ V (|V | = n possibilities) or backward aggregating betweenness contributions so that nodes do search from a pivot t ∈ V , i.e., a search from t in not unduly ‘profit’ from being near a pivot. Section 2 (V, {(v,u) : (u, v) ∈ E}) (another n possibilities). For describes a general framework for this idea that yields each shortest path of the form unbiased estimators for betweenness. Our framework also applies to a simpler variant Q of betweenness, canonical-path betweenness centrality P = hs,...,v,...,ti cC (v), which we usually just call canonical centrality. We introduce this variant due to our original motivation found in this way, we definez }| a{scaled contribution of dealing with road networks, where shortest routes are 2 f(ℓ(Q)/ℓ(P )) for a forward search ‘almost unique’ . Note that some route planning meth- σst ods even enforce unique shortest paths by perturbing δP (v) := . 1−f(ℓ(Q)/ℓ(P )) for a backward search the edge weights (e.g., [9]). Consequently, in case of σst canonical centrality, we consider only a single canoni- Overall, v gets a contribution cal shortest path between any source-target pair. Our approach does not require edge perturbation. It is suffi- δ(v) := δ (v) := {δ (v) : P ∈ SP (v)} cient that some deterministic tie breaking ensures that s P st ∈ only a single shortest path is found. tXV X Section 3 describes how to efficiently implement two for a forward search, and instantiations of our framework—linear scaling, where the contribution of a sample depends linearly on the δ(v) := δt(v) := {δP (v) : P ∈ SPst(v)} distance to the sample, and bisection scaling, where a s∈V sample only contributes ‘on the second half’ of a path. X X Linear scaling can be implemented using a slight varia- for a backward search. tion of Brandes’ original aggregation scheme. Bisection Theorem 2.1. X:= 2nδ(v) is an unbiased estimator scaling requires a quite different approach and another for c (v), i.e., E(X)= c (v). level of random sampling. Section 4 reports on exten- B B sive experiments with a wide range of large graphs. The Proof. Summing over all 2n possible events with linear scaling is always better than [5]. (Sampled) bisec- probability 1/2n each, we get tion scaling is in many ways even better. In particular, it yields good approximation of the betweenness also δs(v)+ δt(v) for less important nodes already with a small number s∈V t∈V of pivots—an area in which the original method fails. E(X)=2n X 2n X Section 5 summarizes the results and outlines possible =1 further improvements. For example, we have evidence ℓ(Q) ℓ(Q) f +1− f that betweenness approximations can help to construct z ℓ(P ) }| ℓ(P ) { = :P ∈SPst(v) better highway-node hierarchies for road networks. σst s,t∈V ( ) X X=σst(v) |SPst(v)| (1.1) = = cB(v) . 1 σst That means the expectation of the estimated betweenness is s,t∈V z }| { the actual betweenness. X 2Indeed, multiple shortest routes do appear in practice. How- ever they usually share most edges so that in most cases, the term 3ℓ may or may not be identical to the edge-weight function σst(v)/σst is one or zero. used for shortest-path calculations. Hence, by averaging k independent runs of the paths of the form hs,...,v,wi and the δs(w) takes care above unbiased estimator, we obtain an approximation of nodes reached indirectly via w. X1+...+Xk k of the betweenness value of node v. In the following, we will instantiate the length 3.2 Linear Scaling is easiest to implement by using function ℓ either with the original edge weight or with the original edge weights as the length function ℓ. Then unit edge weight (hop counting). We will consider three it is possible to implement it with only small deviations variants for the choice of f. For constant f(x)=1/2 we from Brandes’ scheme. We have essentially get Brandes’ algorithm.4 Our new schemes use f(x)= x for linear scaling and µ(s, v) σsv δs(v)= · (1 + δs(w)) 0 for x ∈ [0, 1/2) µ(s, w) σsw f(x)= w∈Xsucc(v) (1 for x ∈ [1/2, 1] for bisection scaling. The intuition behind these meth- where µ(s, w) is the shortest path distance from s to w. ods is to reduce the contributions for nodes close to Less formally, the modification compared to Brandes’ the pivot. In a sense, bisection scaling is best for this scheme is to put values 1/µ(s, w) into the aggregation purpose.

Load more