Orion: Shortest Path Estimation for Large Social Graphs

Xiaohan Zhao, Alessandra Sala, Christo Wilson, Haitao Zheng and Ben Y. Zhao Department of Computer Science, UC Santa Barbara, USA {xiaohanzhao, alessandra, bowlin, htzheng, ravenben}@cs.ucsb.edu

Abstract nodes, computing exact values for node separation met- Through measurements, researchers continue to produce rics like graph radius, graph diameter, and average path large social graphs that capture relationships, transac- length, requires calculating O(n2) node distances. In de- tions, and social interactions between users. Efficient ployed social networks, LinkedIn users can use node dis- analysis of these graphs requires algorithms that scale tance to filter out query results in their neighborhood, and well with graph size. We examine node distance com- social e-commerce sites can use node distance to iden- putation, a critical primitive in graph problems such as tify more trustworthy sellers [27]. Node distance is also computing node separation, centrality computation, mu- the determining factor for other common graph problems tual friend detection, and community detection. For like centrality and mutual friend detection. large million-node social graphs, computing even a sin- Current methods for computing node distance do not gle shortest path using traditional breadth-first-search scale with graph size. For a graph with n nodes and can take several seconds. m edges, efficient implementations of traditional algo- In this paper, we propose a novel node distance esti- rithms including breadth-first-search (BFS), Dijkstra and mation mechanism that effectively maps nodes in high Floyd-Warshall can produce shortest paths for each node dimensional graphs to positions in low-dimension Eu- pair in O(n log n + m) time, and all pairs shortest-paths clidean coordinate , thus allowing constant time in Θ(n3) [6]. Tolerable for small graphs, the compu- node distance computation. We describe Orion, a pro- tation required for a single node distance computation totype graph coordinate system, and explore critical de- on a large million-node graph can take up to a minute cisions in its design. Finally, we evaluate the accuracy on modern computers [23]. Given the prohibitively high of Orion’s node distance estimates, and show that it can costs of storing precomputed distances, researchers have produce accurate results in applications such as node sep- little choice but to sample portions of the graph or seek aration, node centrality, and ranked social search. approximate results. In this paper, we propose a novel approach to ap- 1 Introduction proximating node distance measurements we call Graph Coordinate Systems. A graph coordinate system maps Analysis of graph properties is critical to understanding nodes in high dimensional graphs to positions in a fixed- the mechanisms underlying the formation and evolution dimension Euclidean coordinate space. Using the co- of complex networks, and is of particular importance in ordinates associated with each graph node, we can use the study of online social networks. In recent years, the a simple Euclidean distance computation to estimate, in research community has seen a rise in large-scale mea- constant time, its distance to any other node in the graph. surement studies of deployed social networks [2, 18] and Our work is inspired by the prior success of using vir- interaction networks [15, 32], some producing graphs of tual network coordinate systems [7, 8, 20] to predict la- up to tens of millions of nodes. The size of these massive tencies between Internet hosts. Studies show that in- graphs makes their analysis extremely challenging, as tegrating network coordinates into applications such as even efficient algorithms can become time-consuming. web caches and peer-to-peer systems significantly im- Computing node distance, or the shortest-path dis- proved their performance. Unlike latencies between In- tance between two nodes, is a primitive that lies at the ternet hosts, however, shortest path values on a graph, by core of both graph analysis algorithms and social net- definition, will never violate the triangle inequality [17]. work applications. For example, in a network with n Since triangle inequality violations are often cited as a

1 key source of error in network coordinate systems, graph tion. Applications that benefit from these systems in- coordinates could potentially be even more accurate. clude content distribution networks [24], multicast sys- We make three key contributions in this paper. First, tems [3], distributed file systems [26] and file-sharing we propose the use of graph coordinate systems to sim- networks [1, 5]. plify node distance computation on large graphs. While The majority of network coordinate systems work by similar in fundamental methodology to network coordi- mapping an Internet host to a specific position in a Eu- nates, several critical differences force a ground-up re- clidean space based on round-trip measurements to other design of graph coordinate systems. For example, while hosts. Depending on the protocol, a node’s coordinates network coordinates can be easily tuned using fast la- can be continually refined as additional measurement re- tency measurements (e.g. via Internet ping), measur- sults are added to the system. Once a pair of nodes has ing actual distances between graph nodes can be very converged to their positions in the coordinate space, their expensive. We describe Orion, a prototype graph coor- distance in the Internet (usually a round-trip-time or RTT dinate system, and explore critical decisions in its de- value) can be predicted by computing the Euclidean dis- sign. Second, we perform extensive validation of Orion’s tance between their coordinate values. node distance estimates using several real social graphs. Based on the way coordinates are computed for Finally, we explore the utility of graph coordinate sys- new nodes, NC systems can be generally categorized tems in graph analysis and social applications, and show into “landmark-based” and “decentralized” systems. that Orion produces effective results on large graphs for Landmark-based systems such as GNP [20] first compute applications such as node separation metrics, centrality coordinates for an initial set of well-known landmark computation, and ranked social search. nodes using pair-wise measurements, where errors be- tween virtual and measured distances are minimized us- Roadmap. We begin in Section 2 by defining our ing a non-linear optimization algorithm such as Simplex goals and assumptions, and describing key differences Downhill [19]. The NC then uses these nodes as fixed from prior work on network coordinate systems. We then points to calibrate coordinate values for the rest of the describe the Orion graph coordinate system and explain network. Landmark-based systems [17, 20, 21, 22, 29] key design decisions in Section 3. Next, we present ac- have fast convergence properties, since all nodes rely on curacy measurements of Orion in Section 4, and show the same fixed nodes for their coordinate calculations. the effectiveness of Orion in computing graph metrics However, the accuracy of these systems can suffer if the and graph applications in Section 5. Finally, we discuss choice of landmark nodes is suboptimal, i.e. they do not future directions and conclude in Section 6. sufficiently cover the network. In contrast, decentralized NCs such as PIC [7] and Vi- 2 Virtual Coordinates and Large Graphs valdi [8] allow incoming nodes to orient themselves in the coordinate space using any nodes already positioned The goal of our work is to find a compact representa- in the space. While these systems avoid dependence on tion of distances between nodes in a graph, such that we well-known landmarks, new nodes can force already cal- can quickly and easily compute estimates of shortest path ibrated nodes to adjust their coordinates, potentially in- distances between any two nodes. We are inspired by the creasing convergence time and propagating errors. For significant volume of prior work on the topic of network further details on NC systems, we refer the reader to a coordinate systems, much of which mapped distances recent survey [9]. between Internet hosts to distances in a Euclidean space. Successes and Limitations. NC systems have been In this section, we briefly summarize prior work in net- shown to be highly effective at improving performance work coordinates, and use it as context to identify key of large distributed systems [12, 1]. However, more re- differences and challenges in the design of graph coordi- cent work has questioned the validity of using Euclidean systems. Finally, we briefly discuss related projects spaces to approximate Internet latencies, which have as context for our work. been shown to violate the Triangle Inequality [13, 33].

2.1 Background: Network Coordinates 2.2 Graph Coordinates: Challenges Network coordinate (NC) systems [7, 8, 17, 20, 21, Our goal is to investigate the feasibility of using a Eu- 22, 29] were designed as efficient and scalable mech- clidean coordinate space to capture node distances on anisms to estimated distances or latencies between In- large graphs. Upon consideration, we find that three key ternet hosts. Such distance estimation mechanisms can differences separate the problems of estimating shortest prove critical to large-scale distributed systems that use paths on graphs and host latencies on the Internet. As a approximate distance values for performance optimiza- result, we cannot simply apply techniques from NC sys-

2 tems, but must instead carefully reevaluate them in the poses to compute nodes position in a graph by exploit- context of graph distances. ing a coordinate-like approach, called network structure index (NSI) [25]. Compared to Orion, NSI is more ex- Triangle Inequality. First, we note that while the pensive in both time and space complexity. The space presence of triangle inequality violations (TIV) is often complexity of NSI is O(nkD), where k is the number of identified as a barrier to accuracy in network coordinate zones and D is the number of dimensions, which are k systems, shortest path computation on graphs is guaran- times higher than Orion. On the other hand, NSI’s time teed to be TIV free. This is inherent in the definition complexity, O(mkD), is proportional to the number of of the shortest path metric. The proof is straightforward edges m while Orion takes only O(nkD) time, where n by contradiction. Assume a triangle inequality violation is the number of nodes. This also represents a significant for three nodes a, b, c, i.e. d(a, b)+d(a, c)

3 y a graph, since each computation can, in the worst case, a require a full traversal of the graph. Using a landmark 1 { approach, we limit the total number of Breadth-First- c b b c Search operations to k, the number of landmarks. Each BFS computes the shortest path distance from a land- d x mark to all other nodes. Computing BFS for all land- d 1{ marks essentially precomputes all values needed to cal- e f ibrate all nodes in the graph. In contrast, a decentral- e f ized approach such as the physical springs model used by Vivaldi [8] requires shortest path computations between Figure 1: Mapping graph nodes into Euclidean coordinate random node pairs, thus drastically increasing the num- space. For most node pairs, the Euclidean distance exactly ber of BFS operations. matches the hop-count separating them in the original graph. The second advantage of a landmark-based scheme is that the positions of incoming nodes depend only on the landmark nodes. This bounds the number of operations node-distance measurements. This “calibration phase” required to compute a node’s position, guaranteeing fast is where a graph coordinate system incurs its one-time convergence. In contrast, in decentralized models adding computational overhead. Once all nodes in the graph a new node will often force its nearby neighbors to make have been added, the resulting system can be integrated adjustments on their position, a process that can propa- with graph applications to answer node distance queries gate adjustments iteratively throughout the entire space. with estimates. Finally, we note that the challenges that make Land- Since the per-query computation cost is O(1), the fo- mark systems undesirable in Internet systems do not ap- cus of our design is to ensure the calibration phase is ply in our context. In network coordinate systems, land- computationally efficient, and the results are as accurate marks are physical machines that must remain available as possible. More specifically, our goals are three-fold: at all times, and processing load from other applications • Scalability. The computational cost of the calibra- (e.g. web traffic) can affect the accuracy of latency mea- tion phase must scale linearly with the number of surements to other machines in the network [21]. Com- nodes, i.e. O(n). promised landmarks can also significantly impact the en- • Accuracy. While individual node distance pre- tire system [9]. Those issues do not exist for graph coor- dictions might incur reasonable errors, predictions dinates, where nodes are just graph vertices and all com- should approximate ground truth at the large scale. putation can be performed on a centralized server. • Fast convergence. Impact of individual node cali- brations should be localized, i.e. should not trigger 3.2 Scalable Landmark Coordinates significant new adjustments to their neighbors. Intuitively, the number of landmarks used to calibrate a Based on these goals, we now describe the Orion de- graph should have a direct impact on the accuracy of the sign and explain key decisions. Euclidean mapping. Similar correlation between land- marks and accuracy has been observed in the context of 3.1 A Landmark-based Approach network coordinate systems [20]. The highly connected and complex nature of social graphs leads us to believe Figure 1 illustrates how Orion maps nodes in a graph to that an accurate graph coordinate system requires a sig- D positions in a -dimension Euclidean coordinate space. nificant number of landmarks. The challenge is to find a The goal is accurately translate pairwise hop-count dis- way to accurately and quickly compute the coordinates tances in the graph into Euclidean distances in the co- for a large number of landmarks. ordinate space. To do this, Orion uses a landmark ap- Traditional network coordinates determine a node’s proach, where the positions of all nodes are calibrated D-dimension coordinates by minimizing the sum of k with their relative distances to a fixed number ( ) of cho- squares of prediction errors using the Simplex Downhill sen landmark nodes. Landmark nodes are initially cho- algorithm [19], a nonlinear optimization algorithm. The sen from the entire graph based on their position and de- algorithm runs in O(k2 · D) time to compute coordinates gree of connectivity. of k landmarks. Why Landmarks? We use a landmark-based scheme Since running Simplex Downhill on our desired num- in Orion for two main reasons. First and foremost, we ber of landmarks (up to 100 in our study) is computa- wish to minimize the number of shortest path compu- tionally expensive, we propose a new approach, where tations needed to establish ground truth on the actual we separate our landmarks into two groups, a small ini-

4 tial group of 16 landmarks, and a larger secondary group Network Nodes Edges Avg. Path Len. composed of the remaining landmarks. Norway 293K 5,589K 4.2 We leverage the Simplex Downhill algorithm to com- Egypt 246K 1,618K 5.0 Los Angeles 275K 2,115K 5.1 pute the coordinates for the initial (kI =16) landmarks, 2 India 363K 1,556K 6.1 thus its asymptotical complexity is O(kI · D). The sec- ondary group of landmarks calibrate their positions us- Table 1: Properties of Social Graphs ing the initial kI landmarks as anchors, contributing to a computational complexity of only O(kI · D) each. Thus, the total time required to compute landmark coordinates We consider these strategies as approximations of the 2 is O(kI · D)+(k − kI ) × O(kI · D), where k is the high-centrality strategy, and evaluate their effectiveness total number of landmarks. empirically in Section 4. Furthermore, we describe two ways to compute the Summary. Orion works as a landmark-based scheme, coordinates of the secondary group of landmarks, while where an initial core of 16 landmarks is first fixed in maintaining the same computational complexity. In the the space using Simplex Downhill optimization. A sec- global approach, we compute the coordinates of each ondary group of landmarks position themselves based node in the secondary group relying only on the ini- on the original landmarks. Finally, all remaining graph tial group as anchors. In the incremental landmarks ap- nodes calibrate their positions based on node distances proach, nodes in the secondary group are added one by obtained from computing BFS from all landmarks. one. Once a node receives its coordinate values, it be- comes an anchor for all remaining nodes. To compute its coordinates, any remaining node in the secondary group 4 Experimental Results can choose any kI nodes from all embedded nodes to be its landmarks. In this section we analyze the accuracy of Orion’s node distance estimates. We study the impact on accuracy by key factors: Landmark selection strategy, cardinality of 3.3 Landmark Selection the Landmark set, and dimensionality of node coordi- nates. We preface our core discussion with an overview Finally, we consider the problem of choosing landmark of the experimental environment and evaluation metrics. nodes to produce the most accurate graph to Euclidean coordinate mapping. Prior work by Potamias et. al con- sidered the problem of choosing landmarks, and con- 4.1 Experimental Setup cluded experimentally that choosing nodes with high centrality performed significantly better than random We evaluate Orion accuracy using four anonymized choice [23]. Given the complexity of computing node datasets (Egypt, India, Los Angeles and Norway) gath- centrality, we consider two groups of alternative land- ered from Facebook regional networks [32]. These mark selection strategies as possible approximations of graphs were chosen because they are large, but not too centrality-based selection: Random and High-degree. large to make graph analysis intractable. Their statistical properties are consistent with other OSN datasets [2, 30]. • Random. This is the basic landmark selection strat- Table 1 reports their basic properties. egy. Landmarks are chosen uniformly at random All experiments were run on 2.4 GHz, dual core Xeon from all nodes in the graph. servers with 32GB of RAM. All machines ran Fedora • High-degree. Prior measurements on social net- Core, kernel version 2.6.x. works [18, 32] show that social graphs exhibit a Evaluation Metrics. We use two key metrics to eval- power-law-like degree distribution. Intuitively, high uate Orion accuracy. The first is Relative Error. This degree nodes reside at the core of social graphs, ef- metric is widely used in the study of Network Coordinate fectively approximating central nodes. This strategy Systems, although it must be modified slightly in order chooses nodes with the highest degree. to evaluate graph coordinate systems. Let a and b be two • m Landmark separation. Closely positioned land- nodes in the graph. Let da,b be the measured distance marks are less effective at “covering” the graph as between a and b on the real graph using the BFS algo- P anchors. Therefore, we add variants to the two ba- rithm, and let da,b be the estimated distance computed sic strategies, where we select the landmarks one using a and b’s coordinates from Orion. In our context, by one, ignore any potential landmarks that are too the relative error is: close in the graph to existing landmarks, and con- |dm − dP | tinue selecting landmarks until the desired number a,b a,b Re = m (1) has been met. da,b

5 0.4 1 Random Strategy (Global) 0.9 0.35 High-degree Strategy (Global) 0.8 0.3 Random Strategy (Incremental) High-degree Strategy (Incremental) 0.7 0.25 0.6 0.2 0.5 CDF 0.15 0.4 0.3 0.1 0.2

Average relative error 0.05 100 landmarks 0.1 30 landmarks 0 0 Original 2-hop 3-hop 4-hop 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Minimum distance between landmarks Relative error

Figure 2: ARE of nodes’ distances with different combination Figure 3: CDF of relative error on nodes distances on India. of landmark selection and computation strategies in India graph

Cardinality of Landmark Set. In this section we ex- The second metric is Average Relative Error (ARE) of plore the variation in accuracy when we initialize Orion predicted distances. Small ARE values are sufficient to using different cardinalities of the landmark set. The prove that the majority of node pairs in Orion have realis- intuition behind this experiment is that by having more tic predicted distances. Finally, we also use Computation landmarks spread in the graph there is a better space cov- time to investigate Orion’s efficiency. erage that should allow higher precision while placing nodes into this space. 4.2 Estimation Accuracy Figure 3 depicts the cumulative distribution function of the relative error for cardinality 30 and 100 of the land- We examine Orion’s estimation accuracy under the influ- mark set. Figure 3 shows that there is a small increase in ence of three different factors: landmark selection strat- precision with larger landmark set sizes. In general, al- egy, cardinality of the landmark set, and dimensionality most 70% of the computed distances have a relative error of node coordinates. less then 0.2 and more than 90% are less than 0.4, that allows us to validate a satisfactory accuracy in comput- Landmark Selection Strategies. We begin by analyz- ing node distances with a relatively small landmarks (i.e. ing the impact of landmark selection strategies on accu- 100 landmarks represent a millesimal of our graphs). racy. In Section 3.3, we describe two selection strategies (random and high-degree) and variants based on land- Dimensionality of Coordinates. Nodes are mapped mark separation. Figure 2 plots AREs for a variety of into geometric space based on the coordinates they ac- landmark selection strategies using the India graph. We quire during the initialization phase. Intuitively, cali- evaluate the accuracy of each different strategy on all brating node positions using a larger coordinate vector four datasets. These results are similar for all our graphs, should have a direct impact on the precision of the esti- and we only show India here for brevity. mated distances between nodes. Each evaluation is performed by selecting 1000 ran- We compute coordinates as dimensionality varies be- dom nodes in the graph and computing pairwise dis- tween 2 and 14. Figure 4 shows that increasing the tances between them, for a total of ≈ 500K distances. coordinates dimension also increases the predicted dis- These results form the control sample when calculating tances between nodes, confirming our intuition. Al- relative error vs. Orion. Each value reported in Figure 2 though higher dimensions produce smaller errors, as the is the average results over 5 sets of randomly selected dimension increases the time for coordinate and distance 1000 node groups. computation increases as well. We explore the trade off In general, Figure 2 shows that Orion provides low between predicted precision and efficiency and conclude relative errors compared to actual path lengths for differ- that using 10-dimensional coordinates is best compro- ent landmark selection strategies. Among the considered mise. In particular, as shown in Figure 4, the accuracy strategies, Figure 2 shows that high-degree strategies can gain for x ≥ 10 slightly decreases. produce lower errors. Furthermore, the impact of land- mark separation on the accuracy of shortest path length 4.3 Computational Complexity estimation is fairly small. Taking a close look, the high- degree incremental landmark selection strategy with 3- In this section we investigate Orion efficiency by analyz- hop separation provides the most accurate result among ing Orion bootstrap and pair distance computation time all the considered strategies. As a result, all remaining versus BFS. experiments run with this approach. Orion bootstrap involves two main operations: (i)

6 0.4 Metric Method India Egypt L. A. Norway Norway Avg. Relative Error Orion 11.7 9.5 10.8 8.1 0.35 Egypt Avg. Relative Error Radius Actual 11 9 11 8 0.3 L.A. Avg. Relative Error India Avg. Relative Error Orion 17.9 13.9 17.8 12.1 Diameter 0.25 Actual 17 13 17 12 0.2 Avg. Path Orion 5.8 4.8 4.9 4.1 0.15 Length Actual 6.1 5.0 5.1 4.2 0.1

Average relative error 0.05 Table 3: Comparing Node Separation Metrics for a 1000-node 0 sample in each of our four graphs. Orion’s approximations are 2 4 6 8 10 12 14 compared to results computed via BFS. Dimension

Figure 4: ARE of different coordinate dimensions. 5.1 Node Separation Metrics Time India Egypt L. A. Norway Orion Bootstrap 9499s 7852s 8856s 9383s Orion Response 0.2µs 0.2µs 0.18µs 0.19µs Node separation metrics are commonly used to charac- BFS Response 1.028s 0.75s 1.027s 1.44s terize overall graph structure. The common node sepa- Table 2: Computation times for Orion and BFS. ration metrics include graph radius, graph diameter and average path length. The eccentricity of a node is defined as the longest hop distance from it to all other nodes in measure distances from each landmark to all the nodes a graph. Graph radius is defined as the minimum eccen- using BFS, and (ii) compute coordinates using Simplex tricity across all nodes, while graph diameter is defined Downhill. We record the time for bootstrapping Orion as the maximum eccentricity across all nodes. Average on our four social graphs and show that Orion bootstrap path length is the mean of all shortest path lengths. time is about 2 hours (as shown in Table 2). These times are acceptable since bootstrapping is a one-time cost. Computation Time. Given their intensive use of Response time is the average time to compute pairwise shortest path computations, node separation metrics are node distances using Orion. As shown in Table 2, Orion an ideal application for Orion. We would like to quan- is 7 orders of magnitude faster than BFS. This result con- tify Orion’s accuracy in this context by computing these firms the huge gain a coordinate graph system like Orion metrics using Orion and compare them directly to those is able to achieve compared to traditional methods. from BFS. Given the large sizes of our graphs, however, Note that Orion bootstrap and response times are func- it was not possible for us to compute eccentricity for all tions of the number of nodes in the graph. Conversely, the nodes by BFS for direct comparison. From our time BFS computation time is a function of the number of measurements of single node full BFS we estimate a full edges. Thus Orion is likely to provide better scalability computation of the Los Angeles network (275K nodes) than BFS because, as social networks expand, the growth would take roughly 152 hours or more than 6 days of in edges far surpasses the growth in nodes. computation. In contrast, embedding the LA network into Orion takes less than 2 hours, and querying for all pairwise paths takes roughly 7000 seconds, for a total 5 Using Orion in Graph Applications process time of less than 4 hours.

To demonstrate Orion’s utility and accuracy in an opera- Accuracy Results. For a scalable side-by-side com- tional setting, we integrate Orion into several graph anal- parison, we randomly sample 1000 nodes from each of ysis and social applications that make extensive use of the graphs, and compute graph radius, diameter and av- shortest path computations. Under normal conditions, erage path length based on BFS from those nodes to these graph metrics and applications can be computa- all other nodes in the graph. We compare those results tionally intractable for large graphs. We show that we to those generated using node distance estimations from can use Orion to scalably obtain answers that reasonably Orion, and show the results in Table 3. We find that Orion approximate answers obtained from deterministic meth- performs very well in predicting these metrics. For graph ods. Specifically, we look at three common operations: radius and diameter, it always provides a result that is less computing node separation metrics such as graph radius, than 1 hop from the BFS answer. In the case of average diameter and average path length, locating central nodes path length, Orion is even more accurate, and provides in a graph, and ranked social search. results that never deviate more than 0.3 from BFS.

7 0.9 1 0.8 0.9 0.7 0.8 0.6 0.7 0.6 0.5 0.5 0.4 0.4 Accuracy Accuracy 0.3 India 0.3 Norway 0.2 Egypt 0.2 Egypt 0.1 Los Angeles Los Angeles Norway 0.1 India 0 0 50 100 200 5 10 20 50 Top # of 1000 nodes Top # of 100 responses

Figure 5: Accuracy of Top k high centrality nodes Figure 6: Accuracy of top k ranked nodes.

5.2 Computing Node Centrality rate because it has the longest average path lengths of our sample graphs. The results are generally good across Information dissemination is an active research area the board, with Orion giving correct estimates more than of social networks. Viral spread [31], influence cam- 50% of the time, when selecting top 50 highest centrality paigns [4, 14], and breaking-news coverage [10] are all nodes out of 1000. examples of information dissemination problems on so- cial graphs. A critical, but computationally expensive, 5.3 Ranked Social Search metric necessary for these applications is node central- ity. We leverage Orion coordinates to compute node’s Online social networks often need to rank their query re- centrality in order to compare its speed and accuracy sults by proximity in the social graph to the query owner. with centrality calculations performed using traditional For example, searches for specific names on Facebook shortest-path algorithms. and LinkedIn will only return the top results that are clos- Centrality is defined as the average shortest path est in social distance to the user. Social distance is used length from a node a to every other nodes in the graph. to rank query results because users generally care about The smaller the average path length for a node is, the people close to their social circles. higher its centrality is. Using Orion, a node can quickly We implement a ranked social search application. In estimate its centrality by computing its average Eu- each graph, we randomly select 100 nodes to represent clidean distance to all other nodes in the graph. the total set of results for each query. We run the simula- We estimate the precision of computing node central- tion 5000 times, each time with a randomly chosen node ity via Orion by comparing its results to actual results as the point of origin for the query. computed using BFS. Computationally, node centrality Accuracy Results. We sort the randomly selected 100 also requires all pairs of shortest paths computation, and nodes in increasing order and choose the top k nodes. our time estimates from node separation metrics also ap- Then we count the amount of overlap in the two sets ply here (152 hours for our LA graph). of top k nodes computed by Orion and the BFS-based Accuracy Results. To keep computation time man- approach. We define the accuracy of the ranked social ageable, we again sample 1000 random nodes from each search in Orion as the ratio of the number of overlapping graph, and compute node centrality values for each node nodes to the total number of all considered nodes. Fig- using both Orion and BFS. We sort nodes based on their ure 6 plots the accuracy values over different values of average shortest path length to every other node in the k, averaged across the 5000 runs. Again, Orion’s , in increasing order. Then we select the top k search produces fairly good results, with more than 60% nodes from each resulting group, and count the number overlap when choosing the top 20 responses. of top k central nodes (according to BFS) that also ap- peared in Orion’s results. We repeat this for 5 sets of 6 Conclusions and Future Directions 1000 random nodes and average the result. Figure 5 shows the percentage of top-k nodes that are Shortest path computation is one of the most critical correctly considered found by Orion, for different val- and computationally intensive primitives for both graph ues of k: 50, 100, and 200. The overlap between Orion analysis and social networking applications. We pro- and BFS’ results increases with k. As with results in pose graph coordinate systems, a new approach to dra- Section 4.3, centrality results for India are more accu- matically reduce the complexity of shortest paths com-

8 putation by mapping the entire graph into a multi- [10] KWAK, H., LEE, C., PARK, H., AND MOON, S. What is twitter, dimensional Euclidean coordinate space. We describe a social network or a news media? In Proc. of WWW (2010). the design of Orion, an efficient graph coordinate proto- [11] LANCICHINETTI, A., AND FORTUNATO, S. Community detec- tion algorithms: A comparative analysis. Phys. Rev. E 80, 5 (Nov type. Mapping a graph of n nodes takes time O(k I ·D·n) 2009). (roughly 2-3 hours for a 275K node graph), after which each node distance estimation takes less than 0.2 mi- [12] LEDLIE, J., GARDNER,P.,AND SELTZER, M. I. Network co- ordinates in the wild. In Proc. of NSDI (April 2007). croseconds. Our experiments show Orion can provide [13] LEE, S., ZHANG, Z., SAHU, S., AND SAHA, D. On suitability accurate results both for graph metrics such as graph ra- of euclidean embedding of internet hosts. In Proc. of SIGMET- dius and node centrality, as well as graph-based applica- RICS (June 2006).

tions such as ranked social search. [14] LESKOVEC, J., ET AL. Cost-effective outbreak detection in net- Future Directions. We believe graph coordinate sys- works. In Proc. of KDD (2007). tems are a promising new research direction for scalable [15] LESKOVEC, J., AND HORVITZ, E. Planetary-scale views on a Proc. of WWW graph analysis. While our work here is preliminary, we large instant-messaging network. In (2008). see three immediate areas for future work. First, we [16] LI, L., ET AL. Towards a theory of scale-free graphs: Definition, properties, and implications. Internet Math 2, 4 (2005), 431Ð523. would like to explore the efficacy of mapping graphs to non-Euclidean coordinate systems such as spherical [17] MAO,Y.,SAUL, L., AND SMITH, J. M. Ides: An internet dis- tance estimation service for large networks. IEEE JSAC 24,12 and hypercube. Second, we will examine the impact (Dec. 2006), 2273Ð2284. of graph coordinates on weighted graphs, e.g. geo- [18] MISLOVE, A., ET AL. Measurement and analysis of online social graphical graphs or temporal distance metrics for social networks. In Proc. of IMC (Oct 2007). graphs [28]. Finally, Orion is designed for static graphs. [19] NELDER,J.A.,AND MEAD, R. A simplex method for function Adding new nodes to the graph after the initial mapping minimization. The Computer Journal 7, 4 (Jan. 1965), 308Ð313. can change shortest path values for portions of the graph [20] NG,T.S.E.,AND ZHANG, H. Predicting internet network dis- and force a re-mapping of the graph. We will investigate tance with coordinates-based approaches. In Proc. of INFOCOM mechanisms and heuristics to allow run-time modifica- (New York, NY, June 2002). tions to graphs already mapped to the coordinate space. [21] NG, T. S. E., AND ZHANG, H. A network positioning system for the internet. In Proc. of USENIX ATC (June 2004). [22] PIAS, M., ET AL. Lighthouses for scalable distributed location. Acknowledgments In Proc. of IPTPS (Feb. 2003). [23] POTAMIAS, M., ET AL. Fast shortest path distance estimation in This material is based in part upon work supported by the large networks. In Proc. of CIKM (Hong Kong, Nov. 2009). National Science Foundation under grants IIS-847925, [24] RATNASAMY, S., HANDLEY, M., KARP, R., AND SCHENKER, CNS-0916307, and CAREER CNS-0546216. S. Topologically-aware overlay construction and server selection. In Proc. of INFOCOM (2002), IEEE. References [25] RATTIGAN,M.J.,MAIER, M., AND JENSEN, D. Using of structure indices for efficinet approximation of network proper- [1] Azureus-vuze. http://sourceforge.net/projects/ ties. In Proc. of ACM SIGKDD (Philadelphia, USA, 2006). azureus/. [26] RHEA, S., ET AL. Pond: The OceanStore prototype. In Proc. of [2] AHN, Y.-Y., ET AL. Analysis of topological characteristics of FAST (April 2003). huge online social networking services. In Proc of WWW (2007). [27] SWAMYNATHAN, G., ET AL. Do social networks improve e- [3] CASTRO, M., ET AL. Scribe: A large-scale and decentral- commerce: a study on social marketplaces. In Proc. of SIG- ized application-level multicast infrastructure. IEEE JSAC 20, COMM WOSN (August 2008). 8 (2002). [28] TANG, J., ET AL. Temporal distance metrics for social network [4] CHEN,W.,WANG,Y.,AND YANG, S. Efficient influence maxi- analysis. In Proc. of WOSN (Barcelona, Spain, August 2009). mization in social networks. In Proc. of ACM KDD (2009). [29] TANG, L., AND CROVELLA, M. Virtual landmarks for the inter- [5] COHEN, B. Incentives build robustness in bittorrent. In Proc. of net. In Proc. of IMC (Oct. 2003). P2P-Econ (June 2003). [30] VISWANATH, B., ET AL. On the evolution of user interaction in [6] CORMEN, T. H., LEISERSON, C. E., RIVEST,R.L.,AND facebook. In Proc. of SIGCOMM WOSN (2009). STEIN,C.Introduction to Algorithms, 3 ed. MIT Press, 2009. [31] WANG, C., KNIGHT, J. C., AND ELDER, M. C. On computer [7] COSTA, M., CASTRO, M., ROWSTRON, A., AND KEY, P. Pic: viral infection and the effect of immunization. In Proc. of ACSAC Practical internet coordinates for distance estimation. In Proc. of (2000). ICDCS (Tokyo, Japan, March 2004). [32] WILSON, C., BOE, B., SALA, A., PUTTASWAMY, K. P. N., [8] DABEK,F.,COX, R., KAASHOEK,F.,AND MORRIS,R.Vi- AND ZHAO, B. Y. User interactions in social networks and their valdi: A decentralized network coordinate system. In Proc. of implications. In Proc. of EuroSys (April 2009). SIGCOMM (Portland, OR, August 2004). [33] ZHENG, H., ET AL. Internet routing policies and round-trip [9] DONNET, B., GUEYE, B., AND KAAFAR, M. A. A survey on times. In Proc. of PAM (April 2005). network coordinates systems, design, and security. IEEE Com- munication Surveys & Tutorials (2009).

9