Copyright by Tae Won Cho 2010 The Dissertation Committee for Tae Won Cho certifies that this is the approved version of the following dissertation:

Enabling Information-centric Networking: Architecture, Protocols, and Applications

Committee:

Yin Zhang, Supervisor

Mohamed Gouda

Raymond Mooney

Lili Qiu

K. K. Ramakrishnan Enabling Information-centric Networking: Architecture, Protocols, and Applications

by

Tae Won Cho, B.S., M.A.

DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN August 2010 To Christy, Brandon, and Claire. Enabling Information-centric Networking: Architecture, Protocols, and Applications

Tae Won Cho, Ph.D. The University of Texas at Austin, 2010

Supervisor: Yin Zhang

As the Internet is becoming information-centric, network services increas- ingly demand scalable and efficient communication of information between a mul- titude of information producers and large groups of interested information con- sumers. Such information-centric services are growing rapidly in use and deploy- ment. Examples of deployed services that are information-centric include: IPTV, MMORPG, VoD, video conferencing, file sharing, software updates, RSS dissemi- nation, online markets, and grid computing.

To effectively support future information-centric services, the network in- frastructure for multi-point communication has to address a number of significant challenges: (i) how to understand massive information-centric groups in a scalable manner, (ii) how to analyze and predict the evolution of those groups in an accurate and efficient way, and (iii) how to disseminate content from information produc- ers to a vast number of groups with potentially long-lived membership and highly diverse, dynamic group activity levels?

This dissertation proposes novel architecture and protocols that effectively address the above challenges in supporting multi-point communication for future information-centric network services. In doing so, we make the following three major contributions:

v (1) We develop a novel technique called Proximity Embedding (PE) that can approximate a family of path-ensembled based proximity measures for information- centric groups. We develop Clustered Spectral Graph Embedding (SCGE) that cap- tures the essential structure of large graphs in a highly efficient and scalable manner. Our techniques help to explain the proximity (closeness) of users in information- centric groups, and can be applied to a variety of analysis tasks of complex network structures.

(2) Based on SCGE, we develop new supervision based link prediction tech- niques called Clustered Spectral Learning (CSL) and Clustered Polynomial Learn- ing (CPL) that enable us to predict the evolution of massive and complex network structures in an accurate and efficient way. By exploiting supervised information from past snapshots of network structures, our methods yield up to 20% improve- ment in link prediction accuracy when compared to existing state-of-the-art meth- ods.

(3) Finally, we develop a novel multicast infrastructure called Multicast with Adaptive Dual-state (MAD). MAD supports large number of group and group mem- bership, and efficient content dissemination in a presence of dynamic group activity. We demonstrate the effectiveness of our approach in extensive simulation, analysis, and emulation through the real system implementation.

vi Acknowledgments

First of all, I would like to thank my advisor Yin Zhang. Yin has guided me through my entire PhD career, and he has been an excellent advisor all the time. Especially, he has an exceptional vision in research, and selecting challenging problems. He also gave me lots of experience in designing problems and attacking solutions based on intuitions. I feel very fortunate to work with him.

I thank my mentors in AT&T Labs Research - K. K. Ramakrishnan and Divesh Srivastava. During two summer internships and four years of collaboration with AT&T Labs Research, they have been my excellent mentors. K. K. has a great vision in research, and lots of experience from real-world problems. He has helped me a lot in designing a new multicast architecture, and formulating it as information-centric networking. Divesh has provided me invaluable insights based on his expertise from area. His sharp analytic skill has always inspired me. They gave me lots of helpful comments in positioning our work, and finally getting published in the top-tier conference.

It has been a great pleasure to work with my co-authors Inderjit Dhillon, Han Hee Song, Berkant Savas, and Vacha Dave. Inderjit has provided me many helpful comments and insights from his knowledge in data mining and mathematics. Han is my good friend and an excellent colleague to work with. Berkant has helped me a lot in understanding and solving tough problems using his expertise in data mining and scientific computing. Vacha has been a great asset to our team. She is very good at collecting and analyzing tremendous amount of network data.

I would like to thank my other committee members Mohamed Gouda, Ray-

vii mond Mooney, and Lili Qiu. Professor Gouda gave me lots of helpful comments in network protocols and architectures. Raymond showed great interest in the link prediction problem. I received very good feedback from his experience and knowl- edge from machine learning field. Lili is an excellent researcher in wireless area. Her passion and hard-working in research has motivated me a lot. She also gave me many helpful comments in writing papers.

Without my colleagues and lab members, I could not survive from the grad- uate school. I want to thank Ajay Mahimkar, Upendra Shevade, Mikyoung Han, Eric Rozner, and the LASR group members. I also thank lab alumni - Jayaram Mudigonda, Ravi Kokku, Taylor Riche, and Umamaheswararao Karyampudi. They helped me in setting up the lab environment and understanding research problems during my early graduate career.

Finally, I dedicate my dissertation to my loving wife Christy, my mischievous son Brandon, and my beautiful daughter Claire. They have always supported me with love, and motivated me to move forward in my life.

viii Table of Contents

List of Tables xii

List of Figures xiii

Chapter 1. Introduction 1 1.1 Challenges...... 3 1.2 Approach ...... 5 1.2.1 Information-centricGroupFormation ...... 5 1.2.2 ScalableProximityEmbedding ...... 6 1.2.3 SupervisedLinkPrediction ...... 6 1.2.4 MAD:MulticastwithAdaptiveDual-state ...... 7 1.3 Outline...... 9

Chapter 2. Information-centric Group 10 2.1 Information-centricGroupFormation ...... 10 2.1.1 User-basedGroup ...... 10 2.1.1.1 OnlineSocialNetwork ...... 11 2.1.2 Content-basedGroup...... 12 2.1.2.1 Multicast ...... 12

Chapter 3. Scalable Proximity Embedding 14 3.1 Introduction ...... 14 3.2 Background ...... 16 3.2.1 ProximityMeasures ...... 16 3.2.2 SpectralGraphEmbedding ...... 18

ix 3.2.3 GraphClustering...... 19 3.3 ProximityEmbedding ...... 21 3.3.1 ProblemFormulation...... 21 3.3.2 ScalableProximityInversion ...... 23 3.3.2.1 Preparation ...... 23 3.3.2.2 ProximityEmbedding ...... 24 3.4 ClusteredSpectralGraphEmbedding ...... 27 3.4.1 ProposedAlgorithm ...... 27 3.4.2 AdvantagesofOurApproach ...... 29 3.4.3 ScalabilityAnalysis ...... 32 3.4.4 Proximity Estimation Using CSGE ...... 34 3.5 Evaluation ...... 36 3.5.1 DatasetDescription ...... 36 3.5.2 GraphClustering...... 38 3.5.3 Scalability ...... 38 3.5.4 ProximityEstimation...... 42 3.5.4.1 EvaluationMethodology ...... 43 3.5.4.2 EstimatingProximityMetrics ...... 44 3.6 RelatedWork ...... 48

Chapter 4. Supervised Link Prediction 50 4.1 Introduction ...... 50 4.2 OurApproach ...... 52 4.2.1 ProblemSetup ...... 52 4.2.2 SupervisedLearningMetric ...... 54 4.2.3 TrainingandTestingofLinkPredictors ...... 56 4.2.4 Alignment ...... 57 4.3 LinkPredictionEvaluation ...... 58 4.3.1 DatasetDescription ...... 58 4.3.2 EvaluationMethodology...... 59 4.3.3 ScalabilityEvaluation ...... 60 4.3.4 AccuracyEvaluation ...... 62

x Chapter 5. MAD: Multicast with Adaptive Dual-state 67 5.1 Introduction ...... 67 5.1.1 Requirements of Information-centric Network Services . . . . 68 5.1.2 MADApproachandContributions ...... 70 5.2 RelatedWorkandLimitations...... 72 5.3 MADOverview ...... 75 5.4 MADProtocolDesign...... 79 5.4.1 ModeTransition ...... 80 5.4.2 FailureRecovery...... 81 5.5 ScalingofMADTrees...... 82 5.5.1 SimulationEvaluation ...... 82 5.5.1.1 SimulationSetup ...... 82 5.5.1.2 SimulationResults ...... 84 5.5.2 FormalAnalysisofStateRequirement ...... 86 5.6 EvaluationofImplementation ...... 91

Chapter 6. Conclusions and Future Work 96 6.1 Contributions ...... 96 6.2 FutureWork ...... 98

Bibliography 101

Vita 107

xi List of Tables

3.1 SummaryofProximityMeasures...... 35 3.2 SummaryofOnlineSocialNetworkCharacteristics ...... 37 3.3 ClusteringResults...... 38 3.4 Preparation Time of Graph Embedding Algorithms (r = 100) . . . . 39 3.5 Query Time of Proximity Estimation Algorithms (r = 100, 0.6 millionsamples)...... 39 3.6 Clustered Spectral Graph Embedding Computation Timeand Mem- oryUsage(0.6millionsamples) ...... 42

4.1 Preparation Time of Graph Embedding Algorithms (r = 100) . . . . 60 4.2 Query Time of Proximity Estimation Algorithms (r = 100, 0.6 millionsamples)...... 61

xii List of Figures

3.1 Proximityembedding...... 26 3.2 Regular Embedding vs. Clustered Spectral Graph Embedding . . . . 30 3.3 Comparisonofeigendecompositiontime ...... 40 3.4 CDFofNormalizedAbsoluteErrors ...... 45 3.5 CDF of relative errors for different metrics...... 47

4.1 Shortest path distance between user pairs who are disconnected in snapshot1andbecomesconnectedinsnapshot2...... 58 4.2 LinkpredictionaccuracywithClustering...... 63 4.3 LinkpredictionaccuracyofSpectralLearningMetric ...... 64 4.4 Linkpredictionaccuracywith2-hopscenario ...... 66

5.1 YouTubechannelcharacteristics...... 68 5.2 PublishingcharacteristicsofRSSfeeds ...... 68 5.3 ExamplesofMADtrees ...... 78 5.4 Scaling of MAD trees on topology pow-16k...... 83 5.5 Max#ofgroupsthat 216 routers (each with 3GB MEM)can hold. . 87 5.6 AnalyticalresultsonthescalingofMADtrees...... 90 5.7 EfficiencyofMAD ...... 92 5.8 Cost of mode transition from MT to CBT ...... 94

xiii Chapter 1

Introduction

Large-scale information dissemination and aggregation has increasingly be- come a dominant role of the Internet. Diverse applications have hastened the migra- tion from the point-to-point communication of information between a single pro- ducer and consumer to multipoint communication among a multitude of distributed information producers and large groups of interested information consumers. These information-centric services are growing rapidly in use and deployment. For exam- ple, IPTV services that use IP multicast as the underlying distribution technology are being deployed by multiple carriers. The number of subscribers and revenues have grown rapidly [53]. Online social networks, such as MySpace [50], Face- book [20], and YouTube [79], have each attracted tens of millions of visitors ev- ery month [55] and now become one of the most popular Web sites [3]. Multi- player online games (MMORPGs) are reported to see 30-100% annual subscription growth [11]. Other common examples of deployed information-centric services include: file sharing, software updates, RSS dissemination, video conferencing, on- line markets, video-on-demand, and grid computing. All these services require the capability of large-scale information communication.

To effectively support future information-centric services, the network in- frastructure for multi-point communication has to address a number of significant challenges. First, an important issues is how to identify ”information”, and how to form a ”group” of people who shares similar interests on that information. The mas-

1 sive scale and high dimensionality in large information-centric groups (e.g., social networks in [65]) poses significant scalability challenges in analyzing group struc- tures. Second, information-centric groups are often highly dynamic, with hundreds of thousands of new users and millions of user relations added daily. It is challeng- ing to predict the evolution of those groups in an accurate and efficient way. Third, information dissemination has to support a vast number of groups with potentially long-lived membership and highly diverse, dynamic group activity levels.

This dissertation proposes a novel architecture and protocols that effectively address the above challenges in supporting multi-point communication for future information-centric network services. In doing so, we make the following three major contributions:

(1) We first develop Proximity Embedding (PE) technique which can ap- proximate the similarity of closeness of users in information-centric groups. Our proximity embedding technique can be applied to a large family of path-ensemble based proximity measures, which has been effective in many applications but too costly to compute in massive network structures. Then, we develop Clustered Spec- tral Graph Embedding (CSGE) to captures the fundamental structure of the original information-centric groups and allows a wide range of computational analysis tasks to be performed on the reduced structure. Our scalable proximity estimation tech- nique helps to quantify user relationships in massive online social networks. Our new techniques result in nearly an order of magnitude speedup over the state-of- the-art techniques [65].

(2) Using essential clustering and spectral structure of information-centric groups, we develop scalable and efficient techniques - Clustered Spectral Learn- ing (CSL) and Clustered Polynomial Learning (CPL) to predict the evolution of information-centric groups. Specifically, we derive new supervision-based prox-

2 imity measures that are more effective for link prediction problem (i.e., predicting which user pairs will become new friends) and our results show that the prediction accuracy is improved up to 20%.

(3) We develop a novel multicast infrastructure called MAD (Multicast with Adaptive Dual-state) for content based information-centric group. MAD supports a large number of groups, large group sizes, and efficient content dissemination in the presence of dynamic group activity. We demonstrate the effectiveness of our approach using extensive simulation, analysis, and emulation through the real system implementation.

1.1 Challenges

Information-centric network services create significant new research chal- lenges and opportunities. These services exhibit several key characteristics:

• Scalability in the number of groups. Given the increasing amount of elec- tronic content and the need to ensure that only relevant information is dissem- inated, increasing number of fine granularity groups will need to be managed, with a distinct group for each piece of distributable content. For example, eBay lists over ten million new items every day [46], each of which can be a potential group. As a result, the number of groups that the underlying architecture can support will need to significantly increase.

• Scalability of the size of group. As the network evolves to support models of information communication such as publish/subscribe, membership is likely to be long-lived. Users tend to subscribe but do not unsubscribe and continue to be interested in receiving information sent infrequently by publishers. Long- lived membership can significantly increase the group size over time, resulting

3 in higher control overhead of group management.

• Flexibility in group formation. As users’ interests are diverse and eclectic, and there are a number of different applications that require different types of information-centric groups. Internet should enable users to communicate using flexibly defined notions of groups. For example, user could define different size of groups based on various types of proximity measures (e.g., distance function) of users depending on situations.

• Ability to adapt wide ranges of group activity levels. Group activity tends to exhibit a skewed distribution. That is, most groups generate relatively infre- quent and/or small amount of data traffic, yet a small fraction (e.g., 20%) of active groups account for the vast majority (e.g., 80%) of data traffic. Note that the 80-20 rule is also observed in many other network applications, e.g., sub- scription counts of RSS feeds [42], view counts of video clips [12], incoming link counts of Web pages [13], and file access frequencies of online streaming servers [16]. The activity level within a group tends to vary over time, too. Some new groups become active quickly, whereas other groups become dormant after the peak.

• High dimensionality in proximity measures. Proximity measures are impor- tant for many applications in understanding group structures. Unfortunately, given the massive size of information-centric groups (e.g., online social net- works with 100 million users), few dimensions may not be able to capture enough information about the underlying structure. Despite of significant progress in low-rank approximation of massive network matrices, proximity measures such as rooted page rank and escape probability require higher dimensionality and thus cannot be approximated accurately by existing techniques. Moreover, the amount of required processing overhead becomes increasingly expensive as

4 the size of the group grows.

1.2 Approach 1.2.1 Information-centric Group Formation

We develop novel techniques to facilitate the formation of fine-grained, and meaningful information-centric groups. We consider two general classes of ap- proaches for group formation:

(1) User-based approach, which uses the vector space model to encode users’ interests based on their preferences to items or other users. Once user is represented as a user vector, relation of two users can be computed using the dis- tance function of two vectors. Then, members of the user-based group would be a set of users placed within the group perimeter. For example, online social networks can encode users with friendship vectors with other users.

(2) Content-based approach using notion of ”content descriptor” (CD) to capture both keyword and structure information. Identifying specific data item meeting users’ demands requires performing expensive match operations in many distributed places repeatedly. CD enables the independent match of data items and interests. For example, any data items containing CD of /music/genre/rock is said to match interest profiles of rock music fans. Thus, CD provides the rendezvous point to information publishers and consumers. For example of multicast service, the multicast group can be formed to the corresponding interest group based on content-based approach.

5 1.2.2 Scalable Proximity Embedding

Online social network can be viewed as a graph structure where a node rep- resents a user, and the edge of two users represent the friendship relation. We de- velop novel proximity embedding techniques for efficient and accurate proximity es- timation to explain closeness or similarity of two users for user-based information- centric groups.

We first propose proximity embedding technique, which is applicable to a family of path-ensembled based proximity measures (e.g., Katz measure [31]). De- spite the effectiveness of path-ensemble based proximity measures, it is computa- tionally expensive to summarize the ensemble of all paths between two nodes. The state of the art in estimating path-ensemble based proximity measures (e.g., [72]) typically can only handle social networks with tens of thousands of nodes. Our proximity embedding technique applies matrix factorization to approximate prox- imity matrix as the product of two lower-rank factor matrices, which can handle massive online social networks with millions of nodes.

Second, we develop a novel dimensionality reduction technique called clus- tered spectral graph embedding that embeds the original massive but sparse graph into a much smaller but dense graph. The embedded graph captures the fundamen- tal structure of the original graph. Our new technique results in nearly an order of magnitude speed up over the existing state-of-the-art proximity estimation tech- niques, and able to create approximations whose rank are an order of magnitude higher than previous methods.

1.2.3 Supervised Link Prediction

Link prediction is the task of predicting the edges that will be added to a network in a future based on past snapshots of the network. Understanding which

6 proximity measures lead to the most accurate link predictions provides valuable insights into the nature of networks and can serve as the basis for comparing var- ious network evolution models. Accurate link prediction also allows information- centric groups to automatically make high-quality recommendations on potential group members, making it much easier for individual users to expand their social neighborhood.

We propose novel supervised proximity measures which are tailored to link prediction problem (called spectral learning, clustered spectral learning, polyno- mial learning and clustered polynomial learning). Based on the clustered spectral graph embedding of past snapshots of networks, our new proximity techniques learn optimal parameters to best approximate the proximity information.

1.2.4 MAD: Multicast with Adaptive Dual-state

We describe Multicast with Adaptive Dual-state (MAD), a novel architec- ture that can scalably support a vast number of multicast groups with diverse, time- varying activity, in an efficient and transparent manner on today’s commercial hard- ware. MAD has the following key features.

1. MAD provides persistence in group membership by explicitly decoupling the membership state from the forwarding state. MAD uses a distributed state management approach to efficiently store group membership state at very large scale. For each group, a small number of routers form a membership tree (rooted at a core router) to maintain the membership state.

2. MAD achieves both efficiency in data forwarding and scalability in number of groups by treating active groups and inactive groups differently to optimize for different performance objectives. Specifically, messages to an active group are

7 handled using any existing multicast protocol (for maximizing forwarding effi- ciency), whereas messages to an inactive group are forwarded along the mem- bership tree (for minimizing state requirement and control overhead). Our spe- cific instantiation of MAD uses the Core Based Tree (CBT) [4] (or a shared tree using PIM-SM [22]) for active groups due to its known efficiency and scalabil- ity. We refer to the CBT of an active group as the dissemination tree, in contrast to the membership tree.

3. MAD provides transparency in the presence of dynamic changes in group ac- tivity level. Since group activity can drastically change over time, MAD pro- vides seamless transition mechanisms to promote active groups from inactive groups and vice versa without any end-system participation. Messages are for- warded along the dissemination tree or the membership tree based on the activ- ity level.

4. MAD is designed to be modular so that we can easily replace its components to take advantage of alternative multicast or information dissemination capa- bilities when they become available. Specifically, our current instantiation of MAD begins as an overlay multicast service. However, when multicast capa- bility is available at the underlay, MAD can directly leverage such capability to further improve forwarding efficiency. For example, messages sent to an active group can be directly handled by IP multicast for better forwarding efficiency. Messages sent to an inactive group are forwarded along the membership tree. In doing so, stateless multicast protocols can be used to avoid generating redun- dant overlay unicast messages. Finally, it is possible to use existing peer-to-peer technology to disseminate content to active groups.

8 1.3 Outline

This dissertation is organized as follows. In Chapter 2, we describe information- centric group formation and information-centric network services. In Chapter 3, we describe scalable proximity embedding techniques. In Chapter 4, we describe supervised link prediction techniques. In Chapter 5, we describe Multicast with Adaptive Dual-state (MAD) architecture.

9 Chapter 2

Information-centric Group

2.1 Information-centric Group Formation

To support information-centric networking services, an important issues is how to identify ”information”, and how to form a ”group” of people who shares similar interests on that information. It is critical to enable sharing of information at a fine enough granularity to ensure that only relevant and non-redundant infor- mation is accessed and disseminated. Also, the information-centric group definition has to be powerful enough to meet the need of individuals demands which rapidly changes over time.

We develop novel techniques to facilitate the formation of fine-grained, and meaningful groups. We consider two general classes of approaches for group for- mation: user-based approach, and content-based approach.

2.1.1 User-based Group

For the user-based approach, we propose the vector-space model, in which users positions are encoded using preferences on items or other users. For example, users can provide relations with other users (e.g., friend relation, co-author relation) or their ratings on items (e.g., movie/book ratings) in user vectors.

We propose a novel architecture to support meaningful and flexible group formation by allowing multiple proximity definitions and variable group size to

10 meet the diverse needs of users and application demands. In doing so, we define neighborhood of the specific user as user-based groups by selecting similar or close users in the vector-space. The size of groups can be controlled by setting the thresh- old of the proximity score accordingly.

We develop scalable techniques to capture proximity between users in mas- sive social networks, where user vectors represent friendship information with other users.

2.1.1.1 Online Social Network

Online social networks (OSNs) have gained tremendous popularity recently. Social networking sites such as MySpace [50], Facebook [20], YouTube [79], Twit- ter [74] and LiveJournal [43] allow users to interact with one another, share infor- mation, and form virtual communities in the cyber space. These networks have each attracted tens of millions of visitors each month [55] and are among the most popular sites on the Internet [3]. The explosive growth of online social networks creates unprecedented research opportunities in the field of computer networks and beyond.

A key requirement of many social network applications is to accurately dis- tinguish users that are similar or close to each other from those users that are far apart. For example, when two nodes are socially “close”, they are often more trustworthy to each other, which is useful for fraud detection [15], spam mitiga- tion [24,73], and identity verification [80,81]. Meanwhile, nearby nodes are likely to have similar interests, which is useful for Internet search [47], and content rec- ommendation [6]. Proximity measures quantify the closeness or similarity between any given two nodes in a social network.

Most of existing online social networks allow users to define social rela-

11 tionships with other users often referred as ”friend”. Supposed we map users into vector-space, we can encode users with friendship vector with each element rep- resenting pair-wise relationship with the corresponding friend user in the social network. Friendship vector becomes a building block of computing proximity mea- sures of users in the social networks. Thus, user-based information-centric group can be defined using proximity values on the social graph.

2.1.2 Content-based Group

We propose the notion of content descriptors (CD) to capture both keyword information and document structure when forming content-based groups. To iden- tify the specific data item meets users demands, published data item would have to be matched against users interests. In distributed environment, it is very expensive to perform such matches at multiple places in the network. Content descriptors de- couple information producers and consumers by providing the independence match of data item and interest. For example, a CD could be an keyword element of an XML data path. A data item is then said to match the interest set only if their re- spective sets of CDs have at least one CD in common. CD achieves a good balance between expressiveness and granularity.

2.1.2.1 Multicast

Multicast is an approach that uses network and server resources efficiently to support multipoint communication. As the Internet evolves to become information- centric, network services increasingly demand scalable and efficient dissemination of information from a multitude of distributed information producers to large groups of interested information consumers. These information-centric services in use and deployment. For example, IPTV services that use IP multicast as the underlying

12 distribution technology are being deployed by multiple carriers. Multiplayer online games (MMORPGs) are reportedly seeing 30-100% annual subscription growth [11]. Other common examples of deployed services that are information-centric include: file sharing, software updates, RSS dissemination, video conferencing, online markets, video-on-demand, and grid computing.

Given the increasing amount of electronic content and the need to ensure that only relevant information is disseminated, multicast will need to manage an increasing number of fine granularity groups. A distinct multicast group for each piece of distributable content is mapped to the corresponding content-based group. For example, eBay lists over ten million new items every day [46], each of which can be a potential group. As a result, the number of groups that the underlying multicast architecture can support will need to significantly increase from what we typically see with IP multicast in the underlay, or with overlay multicast.

13 Chapter 3

Scalable Proximity Embedding

3.1 Introduction

A central concept in the computational analysis of social networks is prox- imity measure, which quantifies the closeness or similarity between nodes in a social network. Proximity measures form the basis for a wide range of important applica- tions in social and natural sciences (e.g., modeling complex networks [5,19,29,52]), business (e.g., viral marketing [28], fraud detection [15]), information technology (e.g., improving Internet search [47], collaborative filtering [6]), computer networks (e.g., constructing overlay networks [56]), and cyber security (e.g., mitigating email spams [24], defending against Sybil attacks [81]).

Challenges. Unfortunately, the explosive growth of online social networks im- poses significant challenges on proximity estimation.

1. Scalability. First, online social networks are typically massive in scale. For example, MySpace has over 400 million user accounts [51], and Facebook has reportedly over 120 million active users world wide [21]. As a result, many proximity measures that are highly effective in relatively small social networks (e.g., the classic Katz measure [31]) become computationally pro- hibitive in large online social networks with millions of nodes [63].

2. Dynamic update. Second, online social networks are often highly dynamic, with hundreds of thousands of new nodes and millions of edges added daily.

14 In such fast-evolving social networks, it is challenging to compute up-to-date proximity measures in a timely fashion.

3. Diverse network. With a huge success of onlne social networks, they often exhibit diverse social structures and behavioral patterns from one to another. Thus, the effectiveness of different proximity measures varies significantly across different networks. It is challenging to figure out which proximity measure will perform better than other for the given network in priori.

Contributions. To address the above challenges, we first develop a novel tech- nique, proximity embedding, for efficient and accurate proximity estimation in large social networks with millions of nodes. Our techniques are applicable to a path en- semble based proximity measures, which includes the Katz measure [31], as well as rooted PageRank [39,40] and escape probability [72]. These proximity measures are known to be highly effective for many applications [39, 40, 72], but were previ- ously considered computationally prohibitive for large social networks [63,72].

Second, we develop a novel dimensionality technique called clustered spec- tral graph embedding that embeds the original massive but sparse social graph into a much smaller but dense graph. The embedded graph captures the fundamental structure of the original social graph and allows a wide range of computational anal- ysis tasks to be performed on the embedded graph instead of the original graph. In the context of proximity estimation, our new techniques results in nearly an order of magnitude speedup over the state-of-the-art proximity estimation techniques [66]. More importantly, with the same memory requirement, our technique is able to create approximations whose rank are an order of magnitude higher than previ- ous methods. As a result, our technique results in dramatic improvement on the

15 approximation accuracy of proximity measures such as rooted page rank and es- cape probability, which are not very low-rank and thus cannot be approximated accurately by previous methods. Finally, since our technique captures the essential clustering and spectral structure of the underlying social graph, it can be applied to derive new proximity measures that are more effective for specific social network applications.

3.2 Background

A social network is denoted as a graph G = (V, E), where V = {1, 2,..., |V|} is the set of vertices, and E = {Eij|i, j ∈ V} is the set of edges. In particular, if there is an edge between vertex i and vertex j, then Eij denotes the weight of this edge. The adjacency matrix A of G is a m × m matrix with m = |V|: E , if there is an edge between i and j a = A(i, j) = ij ij 0, otherwise. 

3.2.1 Proximity Measures

Proximity measures are the basis for many applications of social networks. As a result, a variety of proximity measures have been proposed. The simplest proximity measures are based on either the shortest graph distance or the maximum information flow between two nodes. One can also define proximity measures based on node neighborhoods (e.g., the number of common neighbors). Finally, several more sophisticated proximity measures involve infinite sums over the ensemble of all paths between two nodes (e.g., Katz measure [31], rooted PageRank [39, 40], and escape probability [72]). Compared with more direct proximity measures such as shortest graph distances and numbers of shared neighbors, path-ensemble based proximity measures incorporate more information about the underlying social struc-

16 ture and have been shown to be more effective in social networks with thousands of nodes [39,40,72].

Now, we formally define the following commonly used proximity measures: (i) common neighbor, (ii) Katz measure, (iii) rooted PageRank, and (iv) Escape Probability.

Common neighbor. Let N(i) be the neighbor set of a node i.Then, two nodes i and j are more likely to become friends when the number of common friends is high:

PCN(i, j) = |N(i) ∩ N(j)|

Katz measure. The Katz measure captures the relationship of two nodes by tak- ing account of (i) the number of paths between two nodes and (ii) the length of those paths. By computing the summation of paths of the two nodes damped by path lengths, a high number of short paths between two nodes signifies a stronger relationship. ∞ k k PKZ(i, j) = β |pathi,j| k X=1 Rooted PageRank The rooted PageRank is defined as the random walk probability for two nodes to run into each other on the graph. Specifically, it is the stationary probability of j with the following random walk: (i) jump to node i with probability

1 − βRP R, (ii) move to a random neighbor of current node with probability βRP R. −1 Let D be a diagonal matrix with D(i, j) = ΣjA(i, j), and T = D A be the adja- cency matrix with row sums normalized to 1. Then RPR is defined as following:

−1 PRPR = (1 − βRP R)(I − βRP RT )

17 Escape Probability The escape probability computes the probability that a random walk from the node i will visit the node j before it returns to node i. R(i, j) P (i, j) = EP R (i,i)R(j, j) − R(i, j)R(j, i) where R(i, j) = PRPR(i, j)/(1 − βRPR)

3.2.2 Spectral Graph Embedding

Let A be an m×m adjacency matrix of the original graph. For simplicity, we first assume that A is symmetric and later explain how to extend our formalization to the case when A is asymmetric. A graph embedding can be mathematically formalized as the following decomposition:

T Am×m ≈ Um×nDn×nUm×n, (3.1)

T where U is an orthonormal matrix, i.e. U U = In, and D is an n × n matrix which represents the embedded adjacency matrix of the original graph. Equation (3.1) can be applied to approximate any matrix power Ak as follows;

Ak ≈ (UDU T)k = UDkU T. (3.2)

As a special case, A2 gives the number of common neighbors between any pair of nodes, which is a frequently used proximity measure.

Many functions defined on A can be approximated using sum of matrix powers through Taylor Series expansion. Using Equation (3.2) we can approxi- mate these functions with corresponding functions in D. For example of the Katz measure, ∞ −1 k k PKZ = (I − βA) − I = β A , (3.3) k X=1 We can approximate the Katz measure Kβ using equations (3.1) and (3.2) as fol-

18 lows: ∞ k k T −1 T PKZ ≈ Uβ D U = U (In − βD) − In U (3.4) k=1 X 

Eigendecomposition as graph embedding. When m is not toolarge, a simpleway of computing U is through eigendecomposition. But there are several drawbacks to apply eigendecomposition as graph embedding for our application.

1. Approximation. The best approximation of A is given by A ≈ UΛU T, where Λ is a diagonal matrix whose diagonal consists of the n largest in magnitude eigenvalues of A, and columns of U are the corresponding eigen- vectors.

2. Limitations. Unfortunately, the computational complexity and memory re- quirement of this approximation through the eigendecomposition increases quickly as the network size grows. Moreover, our experience suggests that with small n, spectral graph embedding cannot capture sufficient social/network structure for very large social networks and therefore yields poor approxima- tion accuracy.

3.2.3 Graph Clustering

Assume that we have a clustering of the set of vertices V into c disjoint V V c V V V clusters i, i = 1, . . . , c, i.e. = ∪i=1 i and i ∩ j = ∅ for all i =6 j. Let mi = |Vi|. Without any loss of generality, we can assume that the vertices in

V1,..., Vc are sorted in strictly increasing order. Then the adjacency matrix will have the following form:

19 A11 A12 A1c A21 A22 A =  .  (3.5) ..   Ac1 Acc     where each diagonal block Aii, i = 1, . . . , c, is an mi × mi matrix that can be considered as a local adjacency matrix for cluster i. The off-diagonal blocks Aij, i =6 j, are mi × mj matrices and contain the set of edges that go between vertices belonging to different clusters. In an ideal scenario, with perfect clustering, the off- diagonal blocks will not contain any edges, thus yielding Aij = 0, and the graph will consist of c disconnected components.

Benefits of clustering. Since diagonal blocks can be decoupled, all the computa- tion can be performed independently on each of the blocks, whose computation can be easily parallelized. The size of diagonal blocks are typically several orders of magnitude smaller than the original graph. With the same amount of computation, we can achieve much accurate results with smaller graphs.

Objective function. For a given graph G = (V, E) there are various objective functions that measure the quality of the clustering, examples including minimum cut [78], ratio cut [26] and normalized cut [64]. It has been shown that these objec- tive functions are NP hard problems [75]. Although our approach is valid for any clustering, most computational benefits are obtained when cluster sizes are approx- imately the same, while capturing the largest number of edges within the clusters. With respect to this, the normalized cut objective [64], which seeks to minimize dis- associations between clusters and maximize associations within a cluster, is well suited for our clustering problem.

20 3.3 Proximity Embedding

In this section, we propose proximity embedding, efficient and accurate tech- niques to approximate a large family of path-ensemble based proximity measures. Our technique can handle social networks with millions of nodes, which are several orders of magnitude larger than what the state of the art can support.

Despite the effectiveness of path-ensemble based proximity measures, it is computationally expensive to summarize the ensemble of all paths between two nodes. The state of the art in estimating path-ensemble based proximity measures (e.g., [72]) typically can only handle social networks with tens of thousands of nodes. As a result, recent works on proximity estimation in large social networks (e.g., [63]) either dismiss path-ensemble based proximity measures due to their pro- hibitive computational cost or leave it as future work to compare with these prox- imity measures.

3.3.1 Problem Formulation

Below we first formally define the classic path-ensemble based proximity measure, Katz measure, then show that it can be efficiently estimated by solving a subproblem, the proximity inversion problem. In all our discussions below, we model a social network as a graph G = (V,E), where V is the set of nodes, and E is the set of edges. G can be either undirected or directed, depending on whether the social relationship is symmetric.

Katz measure. The Katz measure [31] is a classic path-ensemble based proximity measure. It is designed to capture the following simple intuition: the more paths there are between two nodes and the shorter these paths are the stronger the rela- tionship is (because there are more opportunities for the two nodes to discover and

21 interact with each other in the social network). Given two nodes x, y ∈ V , the Katz measure Katz[x, y] is a weighted sum of the number of paths from x to y, exponentially damped by length to count short paths more heavily. Formally, we have ∞ ℓ hℓi Katz[x, y] = βKatz · |pathsx,y| (3.6) ℓ X=1 hℓi where pathsx,y is the set of length-ℓ paths from x to y, and βKatz is a damping factor. Let A be the adjacency matrix of graph G, where

1, if hx, yi ∈ E, A[x, y] = (3.7) 0, otherwise.  As shown in [40], the Katz measures between all pairs of nodes (represented as a matrix Katz) can be derived as a function of the adjacency matrix A and the damping factor βKatz as follows.

∞ ℓ ℓ −1 Katz = βKatz A = (I − βKatz A) − I (3.8) ℓ X=1 where I is the identity matrix. Thus, in order to compute Katz, we just need to −1 compute the matrix inverse (I − βKatz A) .

The proximity inversion problem. The key of estimating path-ensemble based proximity measures is to efficiently compute elements of the following matrix in- verse: ∞ △ P = (I − βM)−1 = βℓM ℓ (3.9) ℓ X=0 where M is a sparse nonnegative matrix with millions of rows and columns, I is an identity matrix of the same size, and β ≥ 0 is a damping factor. We term this common subproblem the proximity inversion problem.

22 3.3.2 Scalable Proximity Inversion

The key challenge in solving the proximity inversion problem (i.e., comput- ing elements of matrix P = (I − βM)−1) is that while M is a sparse matrix, P is a dense matrix with millions of rows and columns. It is thus computationally prohibitive to compute and/or store the entire P matrix. To address the challenge, we develop a novel dimensionality reduction technique proximity embedding, to approximate elements of P = (I − βM)−1 based on a static snapshot of M:

3.3.2.1 Preparation

We first present an algorithm to approximate the sum of a subset of rows or columns of P = (I − βM)−1 efficiently and accurately. We use this algorithm as a basic building block in both proximity sketch and proximity embedding.

Algorithm. Suppose we want to computethe sum of a subset of columns: i∈S P [∗, i], where S is a set of column indices. We first construct an indicator columnP vector v such that v[i] = 1 for ∀i ∈ S and v[j] = 0 for ∀j 6∈ S. The sum of columns

i∈S P [∗, i] is simply P v and can be approximated as:

P ∞ ℓmax P v = (I − βM)−1 v = βℓM ℓ v ≈ βℓM ℓ v (3.10) ℓ ℓ X=0 X=0 where ℓmax bounds the maximum length of the paths over which the summation is performed.

Similarly, in order to compute the sum of a subset of rows i∈S P [i, ∗], we first construct an indicator row vector u such that u[i] = 1 for ∀i ∈PS and u[j] = 0 for ∀j 6∈ S. We then approximate the sum of rows i∈S P [i, ∗] = u P as:

∞ P ℓmax u P = u (I − βM)−1 = βℓu M ℓ ≈ βℓu M ℓ (3.11) ℓ ℓ X=0 X=0 23 As a special case when S contains only one element, we can approximate a single row or column of P .

Complexity. Suppose M is an m-by-m matrix with n non-zeros. Computing the product of sparse matrix M and a dense vector v takes O(n) time by exploiting ℓ the sparseness of M. Therefore, it takes O(n · ℓmax) time to compute {M v | ℓ =

1, . . . , ℓmax} and approximate P v. Note that the time complexity is independent of the size of the subset S. The complexity for computing uP is identical.

Note however that the above approximation algorithm is not efficient for estimating individual elements of P . In particular, even if we only want a single element P [x, y], we have to compute either a complete row P [x, ∗] or a complete column P [∗, y] in order to obtain an estimate of P [x, y]. As a result, we only apply the above technique for preprocessing. We will develop several techniques in the rest of this section to estimate individual elements of P efficiently.

Benefits of truncation. We achieve two key benefits by truncating the infinite ex- ∞ ℓ ℓ ℓmax ℓ ℓ pansion ℓ=0 β M to form a finite expansion ℓ=0 β M . First, we completely eliminateP the influence of paths with length abovePℓmax on the resulting sums. This is desirable because as pointed out in [39,40], proximity measures that are unable to limit the influence of overly lengthy paths tend to perform poorly in other appli- ℓmax ℓ ℓ cations. Second, we ensure that ℓ=0 β M is always finite, whereas elements of ∞ ℓ ℓ ℓ=0 β M may reach infinity whenP the damping factor β is not small enough. P 3.3.2.2 Proximity Embedding

Our dimensionality reduction technique, proximity embedding, applies ma- trix factorization to approximate P as the product of two rank-r factor matrices U

24 and V :

Pm×m ≈ Um×r · Vr×m (3.12) In this way, with O(2 m r) total state for factor matrices U and V , we can approxi- mate any P [x, y] in O(r) time as: r Pˆ[x, y] = U[x, k] · V [k, y] (3.13) k X=1 Our technique is motivated by recent research on embedding network dis- tance (e.g., end-to-end round-trip time) into low-dimensional space (e.g., [41, 45, 54,71]). Note however that proximity is the opposite of distance — the lower the distance the higher the proximity. As a result, techniques effective for distance embedding do not necessarily work well for proximity embedding.

Algorithm. As shown in Figure 3.1(a), our goal is to derive the two rank-r factor matrices U and V based on only a subset of rows P [L, ∗] and columns P [∗,L], where L is a set of indices (which we term the landmark set). We achieve this goal by taking the following five steps:

1. Randomly select a subset of ℓ nodes as the landmark set L. The probability for a node i to be included in L is proportional to the PageRank of node i in the underlying graph1.

2. Compute sub-matrices P [L, ∗] and P [∗,L] efficiently by computing each row P [i, ∗] and each column P [∗, i] (i ∈ L) separately as described in Section 3.3.2.1.

3. As shown in Figure 3.1(b), use singular value decomposition (SVD) to obtain the best rank-r approximation of P [L, L]:

P [L, L] ≈ U[L, ∗] · V [∗,L] (3.14)

1We also consider uniform landmark selection, but it yields worse accuracy than PageRank based landmark selection (see Section 5.6).

25 (a) goal: approximate P as the product of two rank−r matrices U, V by only computing a subset of rows P[L,*] and columns P[*,L]

P[L,L] P[L,*] ~ U[L,*] * V[*,L] V P[*,L] PU

P[L,L] ~ V[*,L] ~ P[*,L] ~~~ V[*,L] U[L,*] * ~ U * (b) factorize P[L,L] to get U[L,*], V[*,L]

(c) obtain U from P[*,L] and V[*,L]

P[L,*] ~ * V U[L,*]

(d) obtain V from P[L,*] and U[L,*]

Figure 3.1: Proximity embedding

26 4. Our goal is to find U and V such that U · V is a good approximation of P . As a result, U · V [∗,L] should be a good approximation of P [∗,L]. We can therefore find U such that U · V [∗,L] best approximates sub-matrix P [∗,L] in least-squares sense (shown in Figure 3.1(c)). With our use of SVD in step 3, the best U is simply U = P [∗,L] · V [∗,L]T (3.15)

5. Similarly, find V such that U[L, ∗] · V best approximates sub-matrix P [L, ∗] in least-squares sense (shown in Figure 3.1(d)):

V = U[L, ∗]T · P [L, ∗] (3.16)

Accuracy. Proximity embedding does not provide any provable data-independent accuracy guarantee. However, as a data-adaptive dimensionality reduction tech- nique, when matrix P is in fact low-rank, proximity embedding can potentially yield even better accuracy than proximity sketch. Our empirical results in Sec- tion 3.5 suggest that this is indeed the case for the Katz measure.

3.4 Clustered Spectral Graph Embedding

In this section, we explain our clustered spectral graph embedding algo- rithm. First, we describe our method, followed by comparative discussion, and a scalability analysis.

3.4.1 Proposed Algorithm

The key idea to handle scalable graph embedding is to combine clustering and the spectral graph embedding approach. Recall that the adjacency matrix for cluster i is simply the diagonal block Aii in equation (3.5).

27 We first compute the best rank ri approximations through the eigen decom- position for every cluster i, i = 1, ..., c:

T Aii ≈ ViDiVi , i = 1, . . . , c (3.17) where Di is diagonal and contains the ri largest (in magnitude) eigenvalues of Aii and Vi is the orthonormal matrix with the corresponding eigenvectors.

Due to the orthonormality of Vi, V is also orthonormal :

V = diag(V1,...,Vc) (3.18)

Thus, the matrix V can be used as an embedding for the entire adjacency matrix A. We obtain the following clustered spectral approximation :

A ≈ VV TAV V T := VDV T, (3.19) where V is given by (3.18). It follows that the embedded graph D has the following block structure D11 ··· D1c T . .. . D = V AV =  . . .  , (3.20) Dc1 ··· Dcc   T   and Dij = Vi AijVj, for i, j = 1 . . . , c. Note that Dii = Di are diagonal (from Equation (3.17)), but the off-diagonal blocks are dense.

With the large number of clusters, and larger ranks in the approximations, the size of the embedded D could become large. Then, we can further approximate D: ¯ T Dcr×cr ≈ Qcr×r¯ · Λr¯×r¯ · Qcr×r¯ (3.21) where r¯ < cr, Λ¯ is an r¯ × r¯ diagonal matrix containing D’s eigenvalues, and Q is an cr × r¯ matrix containing D’s eigenvectors. For the large size of D, re- member that using the Lanczos algorithm [37, 38] would be more efficient than a direct method [25,69]. For such case, we don’t even explicitly form the dense

28 matrix D but rather use it as an operator acting on vector, i.e. Dv = (V TAV )v = V T(A(V v)). Since V is block diagonal, and A is sparse, computing matrix – vector products with both V and A is efficient.

Comparison vs. low-rank approximation. The clustered graph embedding ap- proach is different than the best low-rank approximation of A and it has several advantages. We now highlight these differences and advantages by comparing a best rank-r approximation of A ≈ UΛU T with the clustered low rank approxi- T mation with ri = r (i.e., each Aii ≈ ViDiVi is a rank-r approximation). Figure 3.2 shows a pictorial illustration of the regular spectral embedding compared to the clustered spectral embedding.

3.4.2 Advantages of Our Approach

By incorporating clustering with the spectral graph embedding, our method has several advantages over the regular spectral graph embedding approach.

Fast computation of Ak. Fast and efficient computation of Ak is crucial in many applications including computing proximity measures such as common neighbor and Katz measure. In our approach, we can easily compute the approximation of Ak, Ak ≈ VDkV T. Since D is dense, we further decompose D using the eigen decomposition, D = QΛ¯QT: Ak ≈ VDkV T = VQΛ¯ kQTV T.

Note that Ak is never formed explicitly. The factor matrices V and Q in VQ are stored separately, because computing VQ would destroy the block diagonal struc- ture and increase the memory requirements.

Efficient memory use. By using block-diagonal structure of V matrix, the mem- ory usage of U in the regular embedding, and V in the clustered embedding is

29 r! r! m! r! T !" U " m! U"

(a) Regular Embedding

c x r! c x r! m!

c x r! D! VT! m! V!

(b) Clustered Spectral Graph Embedding

Figure 3.2: Regular Embedding vs. Clustered Spectral Graph Embedding

30 exactly same. Specifically, the rank of the approximation in the regular embedding is r while the rank in the clustered embedding is cr (c is the number of clusters). Although the size of V is c times bigger than U (in Figure 3.2), but it has a block diagonal structure and only the diagonal blocks have to be stored in memory.

On the other hand, there is difference in the memory requirements for the embedded graphs Λ ∈ Rr×r and D ∈ Rcr×cr as Λ is diagonal but D is larger and dense. Please remember that in many cases U and V account for the dominant part in memory usage, since m >> r.

Preserve both intra and inter-clustering information. The clustered graph em- bedding preserves mostly intra-cluster spectral information through Di but also global inter-cluster relationships through the off-diagonal blocks Dij, i =6 j of D, and is thus more accurate than approximating the clusters alone.

Since many existing social networks often form good clusters, the clustered low-rank approximation yields smaller residual kA−VDV Tk than the regular spec- tral approximation kA − UΛU Tk. Note that the orthonormal matrices U and V use the same amount of memory. See Section 5.6 for numerical examples.

The regular eigen approximation emphasizes large/dominant clusters of the entire graph, since they contain the largest eigenvalues. This has a negative impact on the smaller clusters because they are ignored in the approximation. Depend- ing on the application, the small clusters may have their own importance and thus including information from all clusters is desirable.

Trade-off between scalability and accuracy. The computation of the clustered embedding is faster than the computation of the regular regular embedding (depend- ing on the number of clusters c). The more number of clusters, the computation is faster, but it may result less intra-cluster links used in cluster eigen computations

31 (See Section 5.6 for timing results). Thus, we provide a trade-off between compu- tational cost and approximation accuracy. Interestingly, it is not always helpful to decrease the number of clusters due to the intrinsic clustering of graphs.

Flexibility of clustering. The cluster spectral graph embedding is general and flexible. Different clusterings could be beneficial for different applications. Our approach is not tied to any specific clustering algorithm. Also, there is a freedom in choosing the ranks ri in the approximations of the different clusters. For example, T one could compute the individual approximations Aii ≈ ViDiVi with much larger ri than would be possible to use in the accumulated approximation for A. At later step, the ri could be lowered to manageable sizes based on analysis using the full

Di.

Comparison vs. proximity embedding. In comparison with the proximity em- bedding, the clustered spectral graph embedding is more scalable, and more flexi- ble. The proximity embedding can be only applied to a family of path-ensembled methods, but the clustered spectral graph embedding can approximate any proxim- ity measures which benefit from the fast computation of Ak. Moreover, the clus- tered spectral graph embedding can capture more ranks using the same amount of processing when compared to the proximity embedding technique, thus it yields highly accurate approximation results even when the original proximity measure is not low-rank. We demonstrate the extensive evaluation of comparing the proximity embedding and the clustered spectral graph embedding in the evaluation section (For details, see Section 5.6).

3.4.3 Scalability Analysis

Our clustered spectral graph embedding technique is highly scalable, both in terms of storage and computation requirements. In this section, we compare

32 the spatial complexity of the regular graph embedding to the the spectral graph embedding techniques.

For the regular graph embedding, the memory usage in the spectral approx- imation is bounded by the size of the matrix U containing the eigenvectors. If m = |V| then a rank-r approximation of the adjacency matrix A requires mr float- ing point numbers. State of the art algorithms for computing small number of eigen- vectors as ARPACK [37,38] of a large and sparse matrix use internally m(r + p) floats (p is a user-defined parameter, usually p ≈ r. Thus the memory complexity of computing r eigenvectors using ARPACK is O(mr).

In the clustered embedding, the computation of each cluster’s eigen approx- imation is independent from other cluster’s approximation. Each of these computa- tions can be done using the full amount of memory, capturing much more structural information from the underlying graph.

Since the size of each cluster mi = |Vi| is orders of magnitude smaller than the total number of vertices m, we can compute the eigen approximation with much larger rank. For example, when c = 10, the average mi would be about m/10, al- lowing the clustered embedding to fully utilize the entire memory space, would increase the maximum dimensions computed from n to 10n. For a given amount of memory space, the maximum computable dimensions for clusters can be traded off based on the size of c. The approach can be extended to use different number of dimension ni for different clusters. Suppose the number of dimensions to be com- puted per cluster is known a priori (depending on the clusters’ importance), it is possible to assign variable dimensions to clusters and make the combined approxi- mation more efficient.

The limiting factor of the regular embedding is Um×n, which depends on m linearly. Since mi doesn’t need to increase when m increases (the number of

33 clusters would have to increase), the limiting factor in the clustered case becomes the embedded graph matrix D with size of cn × cn. This is preferable because the available memory space is used to store and work with the graph embedding itself instead of the orthonormal matrices generating the embeddings. The difference becomes more prominent for larger graphs. For a limited amount of memory, the dimension n of the embedding Λ decreases as m increases, whereas the dimension of the clustered embedding D remains constant.

In this discussion we have assumed that D is computed block-wise accord- ing to the clustering (see (3.20)), thus only two Vi and Vj are accessed at a time to compute Dij.

The scalability of clustered spectral approach in terms of computation time is also better than the regular counterpart. For the clustered spectral graph embed- ding, scales linearly with the number of nodes m in the graph, as we can constrain the size of each cluster mi and choose to increase the number of clusters c, as m increases. In this way, the eigen approximation of each cluster stays relatively inex- pensive to compute. On the other hand, the eigen approximation on the entire graph requires the computation time to grow quadratically. In addition to the linearity of spatial and temporal complexity in number of nodes, empirically we observed that they are also linear in the number of links.

3.4.4 Proximity Estimation Using CSGE

For each proximity metric, there are three different methods of computing scores: (i) the direct method, (ii) approximation using spectral graph embedding (from Equation 3.1), and (iii) approximation using clustered spectral graph embed- ding (from Equation 3.21).

Table 3.1 summarizes the equations for existing proximity measures - com-

34 Proximity Measure Definition 2 CN Common neighbor PCN = A 2 T aCN approx. Common Neighbour PaCN = UrΛ Ur 2 T aCNc approx. Common Neighbour w. cluster PaCNc = VrΛ Vr ∞ k k KZ Katz measure PKZ = k=1 β A aKZ approx. Katz measure P = U( kmax βkLk)U T aKZ P k=1 aKZc approx. Katz measure w. cluster P = V ( kmax )βkDk)V T aKZc P k=1 =Q V ( kmax βkΛk)QT V T k=1P RPR Rooted PageRank P = (1 − β ) ∞ βk T k RPR P RPR k=0 RPR aRPR approx. Rooted PageRank P = (1 − β )U( kmax βk Lk)U T aRPR RPR P k=0 RPR aRPRc approx. Rooted PageRank w. cluster P = V ( kmax )βk Dk)V T aRPRc k=0 RPRP =Q V ( kmax βk Λk)QT V T k=0P rpr EP Escape Probability P (i, j) = R(i,j) , EP P R (i,i)R(j,j)−R(i,j)R(j,i) where R(i,j) = PRPR(i, j)/(1 − βRPR) R(i,j) aEP approx. Escape Probability PaEP(i, j) = R (i,i)R(j,j)−R(i,j)(j,i) , where R(i,j) = PaRPR(i, j)/(1 − βRPR) R(i,j) aEPc approx. Escape Probability w. cluster PaEPc(i, j) = R (i,i)R(j,j)−R(i,j)R(j,i) , where R(i,j) = PaRPRc(i, j)/(1 − βRPR) Table 3.1: Summary of Proximity Measures mon neighbor (denoted as CN), Katz measure (denoted as KZ), rooted PageRank (denoted as RPR), and escape probability (denoted as EP).

We describe how to compute the direct method, spectral approximation, and clustered spectral approximation of common neighbor as an example. The predictor based on direct method can be computed using the entire A:

2 Pcn(A) = A .

Computing the common neighbor for the entire A is almost infeasible due to the large size of graph. Instead, we evaluate the predictor only on a small subset of the entries of P .

The approximated common neighbors is based on the best low rank approx- T imation of A ≈ Ar = UrΛrUr

35 2 2 T Pacn(Ar) = Ar = UrΛrUr .

Using the equation 3.21, we can compute the clustered low rank approxi- mation of common neighbors: 2 2 T Pacn-c(Acr) = Acr = VcrΛcrVcr .

Similarly, we can compute approximated version, and the clustered version of other proximity measures.

3.5 Evaluation

We first give an overview of our datasets in Section 3.5.1. We then evaluate the scalability of clustered spectral graph embedding in Section 3.5.3 and study the accuracy of proximity estimation in Section 3.5.4.

3.5.1 Dataset Description

We use three real large online social networks with millions of nodes: Flickr [23], LiveJournal [43] and MySpace [50]. Table 3.2 summarizes the characteristics of snapshots for each network. LiveJournal and MySpace datasets are from [66], and the Flickr dataset is collected by [48].

Note that we consider users with at least one friend link. We focus on pre- dicting new links and do not consider link deletions (which is very rare). We made all the networks symmetric for an evaluation (Flickr and LiveJournal), although our approach can be generalized to work with asymmetric networks, as well.

Flickr [23] is a photo-sharing website, where users can create ”contacts” rela- tionship each other. We obtain this dataset from [48], which is generated by a

36 Network Date # nodes # links # added links % added links 4/14/2007 1,990,149 41,302,536 – – Flickr 4/25/2007 1,990,149 42,056,754 754,218 1.8% 5/6/2007 1,990,149 42,879,714 822,960 1.9% 02/16/2009 1,770,961 83,663,478 – – LiveJournal 03/4/2009 1,770,961 84,413,542 750,064 0.8% 04/03/2009 1,770,961 85,713,766 1,300,224 1.5% 12/11/2008 2,137,264 90,333,122 – – MySpace 1/11/2009 2,137,264 90,979,264 646,142 0.7% 2/14/2009 2,137,264 91,648,716 669,452 0.7% Table 3.2: Summary of Online Social Network Characteristics breadth-first search on the graph starting from a few seed nodes. Depending on the connectivity of seed nodes, some part of network may not be discovered during the collection period even when they already exist. To remove this artifact, we use the first few months as a bootstrap period, and create snapshots when most nodes have been discovered and link growth is relatively stable. For the same reason, even though the snapshot dates are ten days apart, there is 2% growth in the number of links.

LiveJournal [43] is a blogging site, where members can become “fan” of other members. We obtain LiveJournal dataset from [66]. The network is collected by listening to the RSS server, which sends out recent update information. The statistics suggests that these bloggers are more active in creating new relationships than users in other networks.

MySpace [50] is a social networking site for people to interact with their acquain- tances via posts on each other’s personal pages. We obtain this dataset from [66]. The graph is symmetric since both users need to agree to having a social link. The dataset is created by collecting information of the first 10 million user IDs. It ap- pears that MySpace assigns user IDs chronologically, as result, the MySpace dataset

37 # cluster avg. size % intra links % inter links Flickr 18 110,563 71.8% 28.2% LiveJournal 17 106,241 72.5% 27.5% MySpace 17 125,721 51.9% 48.1% Table 3.3: Clustering Results is relatively dormant in social activity (e.g., the fraction of new links is smaller than other networks).

3.5.2 Graph Clustering

We cluster the social networks using GRACLUS [34]. Read Section 3.5.2 for details. Specifically, we restrict the size of biggest cluster smaller than 1/10 of total users in the network. Table 3.3 summarized the cluster characteristics with the average size of cluster (e.g., the average number of users in each cluster), the fraction of intra-cluster links, and the fraction of inter-cluster links. For all dataset, the average size of cluster is about 100,000, and the number of cluster is 17 or 18. Note that more than 70% of links are intra-cluster for Flickr and LiveJournal dataset, whereas only 51.9% of links are intra-cluster for MySpace.

3.5.3 Scalability

In this section, we empirically assess the scalability of clustered spectral graph embedding with a series of benchmarks measuring various aspects of com- putational costs. We first benchmark the timing of the entire process of clustered spectral graph embedding and show its scalability by comparing it against other proximity estimation techniques . Then in the following two subsections, we deter- mine our methods’ scalability that are specific in individual steps of clustered graph embedding generation. The experiment is conducted using AMD Opterontm 850

38 Flickr LiveJournal MySpace Proximity Embedding Preparation 7.1 hour 8.7 hour 8.9 hour Spectral Embedding Preparation 6.3 hour 6.1 hour 7.5 hour Clustered Spectral Embedding Preparation 48.7 min 44.5 min 60.6 min Clustering 24.1 min 19.7 min 33.8 min Breakdown of CSGE Preparation Eigen Decomposition 7.0 min 8.6 min 8.5 min Constructing D 17.6 min 16.2 min 18.3 min Table 3.4: Preparation Time of Graph Embedding Algorithms (r = 100)

Flickr LiveJournal MySpace Direct Method CN 15.9 ms 23.5 ms 24.5 ms Katz 8,040 ms 14,790 ms 16,655 ms Proximity Embedding aKatzp 0.051 ms 0.045 ms 0.076 ms Spectral Embedding aCN 0.042 ms 0.038 ms 0.040 ms aKatz 0.045 ms 0.036 ms 0.036 ms Clustered Spectral Embedding aCNc 2.72 ms 5.45 ms 2.60 ms aKatzc 2.76 ms 2.03 ms 2.65 ms Table 3.5: Query Time of Proximity Estimation Algorithms (r = 100, 0.6 million samples) with 32GB memory, running Ubuntu Linux Kernel v2.6.

Timing Micro-benchmarks. Table 3.4 shows the preparation time of the three em- bedding techniques for our datasets of up-to two million users and hundred million links. The preparation time for clustered spectral graph embedding is comprises of clustering time, spectral approximation time for all clusters, and time required for construction of graph embedding D, as described in section 3.4. The total time for all the three stages of clustered embedding is in the order of tens of minutes while that for spectral embedding is seven times higher. Compared with proximity embedding, the temporal scalability of clustered spectral graph embedding is even better - 60 mins vs. 9 hours.

Table 3.5 compares the query time of 0.6 million samples for the differ-

39 ent proximity estimation techniques. Compared with directly measuring CN and Katz, all three embeddings are much faster. Especially for Katz measure, the true computation time for the entire 0.6 million sample requires several months! Mul- tiplying (or inverting) huge matrices required by direct measures incurs prohibitive computation overhead as shown in Section 3.2.1.

In the following two sections, we show benchmarks specific to the other two stages of clustered spectral graph embedding: clustering and generation of the graph embedding D.

70 10 dims 50 dims 60 100 dims

50

40

30 Time (min) 20

10

0 0 10 20 30 40 50 60 70 # Cluster Figure 3.3: Comparison of eigen decomposition time

Scalability of Spectral Approximation. We vary the number of clusters into which the LiveJournal dataset is divided (hence changing the cluster sizes), to un- derstand the trade-off between number of clusters and cluster sizes and its impact on on the aggregate eigen approximation time. Figure 3.3 plots the total time re-

40 quired for spectral approximation for all the clusters versus the number of clustered used. The 0 point on x-axis represents no clustering case. We use PROPACK [36], a MATLAB package, optimized for spectral approximation for computation, which is up to four times faster over the regular ARPACK based spectral approximation provided by MATLAB. However, our machines cannot handle spectral graph em- bedding on the entire A with PROPACK without clustering. Hence, the spectral approximation on the no-clustering(0) case is computed using ARPACK while, for the other clustered cases, spectral approximation is done with PROPACK. We observe from the graph that, for any given number of dimensions, the aggregated time of spectral approximation on clusters is significantly less than no clustering. We also find that, by increasing the number of clusters, the time it takes to run the eigen approximation decreases. The benefit of clustering seems to hit a knee point at around 40 clusters, when overheads start to dominate.

Scalability of Graph Embedding In section 3.4.3, we find the size of D to be the limiting factor of spatial scalability of our algorithm. With this in mind, we stress test clustered spectral graph embedding’s ability to handle larger datasets by creating two large adjacency matrices of up to 12 million nodes and 600 million links and creating their graph embedding D. This dataset size is comparable to some real social networks.

We generate this large matrix by merging each snapshot from each three dataset to create a 6 million dataset, and double it to create a 12 million dataset. We place the adjacency matrices of the networks into diagonal blocks of the large matrix. To create a large connected component connecting most of the nodes (sim- ilar to real social networks), we add a small percentage of links on the off diagonal blocks.

Table 3.6 shows the memory usage and time required for clustered spectral

41 No. users 2 million 6 million 12 million No. links 40 million 239 million 598 million No. clusters 18 52 104 Timing 22.6 min 116.3 min 391.2 min Memory Usage 74 MB 546 MB 1,197 MB Table 3.6: Clustered Spectral Graph Embedding Computation Time and Memory Usage (0.6 million samples) graph embedding for the two large datasets. As the number of node increases, instead of increasing the size of each cluster, we increase the number of clusters c. The number of nodes in each cluster is fixed and we evaluate with the number of dimensions cluster fixed to r = 100.

The memory usage and time required for creating graph embedding D in- creases quadratically with the number of users in the dataset. This, however, is natural as the size of embedding D grows quadratic to c. Despite the fact that the size of D is inherently quadratic to c, by using clustering we can handle the largest dataset we generated - 12 millions of users and 0.6 billion links - with only 1.2 GB of RAM. On the other hand, the regular spectral embedding is unable to load U for the 6 million user dataset.

3.5.4 Proximity Estimation

We first give an overview of our evaluation methodology followed by eval- uation results. While we have evaluated proximity estimation on all three datasets (i.e., Flickr, LiveJournal, and MySpace), in the interest of brevity we present most results for LiveJournal, since they are similar across the three networks.

42 3.5.4.1 Evaluation Methodology

Using the largest connected component . Computing rooted PageRank and Es- cape Probability requires a normalized adjacency matrix, as stated in section 3.2. Accordingly, we utilize the largest connected component of the first snapshot for each of the social networks. We observe that, for all social networks, the largest connected component contains almost all the nodes and links: For Flickr, the largest connected component covers 93.9% of the total nodes and 99.47 % of the total links. For MySpace and LiveJournal, the largest connected component is even bigger, cov- ering 97.6% and 99.2 % of nodes and 99.93% and 99.94 % of links respectively.

Accuracy Metrics . We quantify the estimation error using two different metrics: (i) Normalized Absolute Error (NAE) (defined as |esti−actuali| ) and (ii) Relative Er- mean i(actuali) ror |esti−actuali| (defined as actuali ), where esti and actuali denote the estimated and actual values of the proximity measure for node pair.

Sampling Methodology. Since it is expensive to compute the actual proximity measures over all the data points, we randomly sample 100,000 data points by first randomly selecting 200 rows from the proximity matrix and then selecting 500 elements from each of these rows. We then compute errors for these 100,000 data points.

Parameter settings. For Katz measure, we use damping factor β = 0.0005 and kmax = 6 unless otherwise specified. For rooted PageRank we use a damping factor

βRPR = 0.85 and an kmax = 20. Refer to [66] for the analysis assessing the impact of these parameters on the accuracy.

43 3.5.4.2 Estimating Proximity Metrics

Katz. Figure 3.4 (a) plots the CDF of Normalized Absolute Errors for Katz mea- sure, computed using the clustered and regular adjacency matrices. In all three net- works, while clustering improves the accuracy of Katz measure, the improvement is not significant.

The likely reason for this trend is as follows: the truncated Katz that we consider here takes into account of the links that are up to six hops away. Many node pairs that are longer hops away are likely to be in different clusters as the number of connections increase by hop count. Hence, clustered computation of Katz measure is limited because the proximity of node pairs across different clusters are more difficult to approximate than the ones within clusters. However, we find that such inter-cluster links make up only a fraction of the total links - For LiveJournal, for example, 71% of the links are intra-cluster, and the importance of the inter-cluster links is damped if β is smaller than 1/kλ1k where λ1 is the largest eigen value of A. This is also supported by the plot, since the clustered Katz computation is no worse than the regular computation of Katz measure.

Rooted PageRank. Figure 3.4 (b) plots the CDF of absolute errors for rooted PageRank, using both the clustered and regular adjacency matrices. We observe that clustering improves the accuracy of rooted PageRank immensely. The nor- malized absolute error plot for LiveJournal shows that, on average, the normal- ized absolute error reduces by more than half when clustered embedding is used. This improvement should be analyzed, with the background that, the normalized adjacency matrix T , on which rooted PageRank is calculated, has a much higher intrinsic dimensionality than the adjacency matrix A. Hence, clustering improves the accuracy for rooted PageRank because having 100 dimensions per cluster al-

44 1

0.8

0.6

CDF 0.4

0.2 aKZ aKZ-c 0 0 0.2 0.4 0.6 0.8 1 Normalized Absolute Errors (a) Katz measure

1

0.8

0.6

CDF 0.4

0.2 aRPR aRPR-c 0 0 0.2 0.4 0.6 0.8 1 Normalized Absolute Errors (b) Rooted PageRank

1

0.8

0.6

CDF 0.4

0.2 aEP aEP-c 0 0 0.2 0.4 0.6 0.8 1 Normalized Absolute Errors (c) Escape Probability Figure 3.4: CDF of Normalized Absolute Errors

45 lows a much greater fraction of variance to be captured, compared to having a 100 dimensions for the approximating the entire matrix.

Escape Probability. Figure 3.4 (c) shows the CDFs of absolute errors for Escape Probability, using both the clustered and regular adjacency matrices. Again, in all three networks,using clustered embedding causes a significant increase in the accu- racy. This improvement is expected because Escape Probability is a linear combi- nation of multiple rooted PageRank values, as shown in section 3.2, and clustering improves accuracy of rooted PageRank.

Relative errors. Figure 3.5 further plots the CDF of relative errors for Katz, rooted PageRank, and Escape Probability. We take top 1% and 5% of the randomly se- lected data points and generate the CDF for each of the selections. In all datasets, we observe that the relative errors are smaller for elements with larger values. This is desirable because larger elements play a more important role in many applications and thus need to be correctly estimated. For Katz measure, which is computed on A, the rank between regular and clustered embedding is subtle in all three networks. However, for rooted PageRank and Escape probability, which is based on T , there is a significant improvement of using additional dimensions from clustering. On the other hand, almost all estimated values of regular embedding have relative error close to 1.

Summary. To summarize, our proximity estimation evaluation shows that clus- tered spectral graph embedding is not only effective in approximating proximity measures on A but also performs well on intrinsically higher-dimensioned matri- ces such as the normalized matrices (T ). Compared with regular spectral graph embedding, clustered spectral graph embedding can accurately approximate rooted PageRank and Escape probability even though these metrics were known to be dif- ficult to approximate, even in the proximity embedding [66].

46 1

0.8

0.6 CDF 0.4

0.2 aKZ, top 1% aKZ-c, top 1% aKZ, top 5% aKZ-c, top 5% 0 0 0.2 0.4 0.6 0.8 1 Relative Errors (a) LiveJournal - Katz measure

1 aRPR, top 1% aRPR-c, top 1% aRPR, top 5% 0.8 aRPR-c, top 5%

0.6 CDF 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 Relative Errors (b) LiveJournal - rooted PageRank

1 aEP, top 1% aEP-c, top 1% aEP, top 5% 0.8 aEP-c, top 5%

0.6

CDF 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 Relative Errors (c) LiveJournal - Escape Probability Figure 3.5: CDF of relative errors for different metrics.

47 3.6 Related Work

Analysis of Online Social Networks. Swelling numbers of Online Social Net- works users for networks like Facebook [20], MySpace [50], [74] has brought it sharply in focus of researchers. Particularly, there have been in-depth measure- ment studies like [2, 48] that describe the topological structure of Online Social networks. There has been other work like that motivates leveraging social networks for a variety of diverse tasks like Sybil Attack Detection [81] , Viral Marketing [28].

Spectral Embedding. Spectral decomposition or spectral embedding provided by spectral theorem [67] is central to our approach. Eigendecomposition seeks to decompose a square matrix into product of eigenvector matrices and a diagonal eigenvalue matrix. There are lots of researches in finding computationally efficient ways of eigen-decomposing a matrix [32]. [25] provide a numerically stable and fast method for eigendecomposition of matrices. A variety of software provides for eigendecomposition of matrices. MATLAB provides eigendecomposition functions, and there are optimized eigen decomposition packages like PROPACK [36] for Matlab. ARPACK [37] is Fortran package for large eigendecomposition problems.

Proximity measures. Over the years, many proximity measures have been pro- posed for the purpose of studying relationships between nodes in a social graph(e.g., [1,31,33,39,40,49,60,63,72]). [31] proposed the Katz measure considered in this paper, while [39,40] proposes the Rooted PageRank, calculated on the normalized adjacency matrix. [72] proposes escape probability as a useful measure of direction- aware proximity. Scalability is not addressed in any of these, since all these consider networks with only tens of thousands of nodes. [66] first address scalability by pro- viding two methods for computing low-rank approximations for a large family of proximity measures.

48 Link prediction. The link prediction for social networks is first defined [39,40] by predicting new links with higher proximity scores on co-authorship networks with a few thousand nodes. Decision tree-based learner has been used with mul- tiple proximity measures in [66] for predicting links on social networks. [35] use supervised learning methods for learning link prediction on diverse networks, such as hyperlink and citation graphs.

49 Chapter 4

Supervised Link Prediction

4.1 Introduction

A social network [77] is a social structure modeled as a graph, where nodes represent people or other entities embedded in a social context, and edges repre- sent specific types of interdependency among entities, e.g., values, visions, ideas, financial exchange, friendship, kinship, dislike, conflict or trade. Understanding the nature and evolution of social networks has important applications in a number of fields such as sociology, anthropology, biology, economics, information science, and computer science.

Link prediction refers to the task of predicting the edges that will be added to a social network in the future based on past snapshots of the network. As shown in [39,40], proximity measures lie right at the heart of link prediction. Understand- ing which proximity measures lead to the most accurate link predictions provides valuable insights into the nature of social networks and can serve as the basis for comparing various network evolution models (e.g., [5,19,29,52]). Accurate link prediction also allows online social networks to automatically make high-quality recommendations on potential new friends, making it much easier for individual users to expand their social neighborhood.

Challenges. There are several challenges to enable accurate and efficient link prediction for massive online social networks:

50 1. Scalability of the size of network. The explosive growth of online so- cial networks creates unprecedented research opportunities in the field of computer networks and beyond. For example, MySpace has over 400 mil- lion user accounts [51], and Facebook has reportedly over 120 million active users world wide [21]. As a result, many link prediction methods that are highly effective in relatively small social networks become computationally prohibitive in large online social networks with millions of nodes [63].

2. Highly dynamic network. Online social networks are often highly dynamic, with hundreds of thousands of new nodes and millions of edges added daily. In such fast-evolving social networks, it is challenging to predict even the near future of social graph structure.

3. Ability to adapt to diverse networks. The effectiveness of different prox- imity measures varies significantly across different networks, and there is no existing proximity measure performs consistently well across different social networks for link prediction problem (as shown in [66]). In addition, they are often sensitive to the choice of control parameters, which is difficult to tune for different social networks. Therefore, link prediction technique is required to have an ability to adapt diverse nature of different network structures.

Contributions. We develop novel supervised link prediction methods called spec- tral learning, clustered spectral learning, polynomial learning, and clustered poly- nomial learning based on the essential clustering and spectral structure of under- lying social graph captured by clustered spectral graph embedding technique. Su- pervision enables us to derive new proximity measures that are more effective for specific social network applications - link prediction.

51 By learning from differences of past snapshots of network structures, our supervised link prediction technique eliminates the need of choosing the right prox- imity measure and guessing the optimal parameter configurations in priori for the link prediction problem.

Our results show that the new supervision-based link prediction techniques can yield up to 20% improvement in accuracy. We experimentally demonstrate the effectiveness of our approach using three large real-world social network datasets: Flickr, MySpace, and LiveJournal.

4.2 Our Approach

In this section, we explain the basic problem setup, and describe our novel supervised learning metrics - (1) spectral learning, (2) clustered spectral learning, (3) polynomial learning, and (4) clustered polynomial learning. Finally, we describe supervision-based link predictors based on proposed learning metrics.

4.2.1 Problem Setup

Learning from Snapshots. We introduce time steps t1, . . . , tK at which a “snap shot” G(tk) = (Vk, Ek) of the graph is taken. Denote the corresponding adjacency matrices with A(k) We restrict ourselves to use V1 for all time steps. We express the adjacency matrix A(k+1) = A(k) + ∆k where ∆k contains the edges or links that are formed between time tk and tk+1. Associated for each time step tk there is a positive set Pk (i.e., a pair of node whose link is created) and a negativeset Nk (i.e., a pair of node remains not-linked), where

Pk = {i, j ∈ V|∆k(i, j) =6 0}, (4.1)

Nk = {i, j ∈ V|(i, j) ∈/ Ek and (i, j) ∈/ Ek+1}. (4.2)

52 In contrast to the unsupervised link prediction methods, as Katz and rooted page rank, we use the different snapshots to train the parameters of our models.

Objective. The objectives for the supervised link prediction methods are of the following general form: ¯ ¯ T min UkFk(xk)Uk − ∆k ◦ WSk , (4.3) xk  where U¯k is an orthonormal matrix from the graph embedding, Fk(xk) is a matrix depending on the model parameters xk (for the simplest example, it can be diagonal matrix with eigen values), ∆k is the “target” of the model (for example, the set of node pairs with added links can be a target), WS is a weight matrix such that

WSk (i, j) = 1 if (i, j) is in a sample set, and finally, the symbol ◦ denotes the Hadamard product or elementwise product between matrices. All quantities are associated to time step tk.

Assuming the matrix Fk(xk) is linear in xk equation (4.3) can be trans- formed into, min kMkxk − bkk. (4.4) xk d where xk ∈ R , Mk is a mS × d matrix with mS = |Sk| and bk is a mS dimensional vector containing the target values of interest. The transformed problem (4.4) only contains parts of (4.3) for which WS(i, j) = 1.

Sample set. The ideally choice for the sample set is to include all the positive cases (i.e., the set of node pairs with added links), and all the negativecases (i.e., the set of node pairs remain un-connected) Sk = Pk ∪ Nk. Note that Ek containing the known links at time tk are excluded from the model fitting objective. Unfortunately, for the size of social networks with many millions of nodes, the choice Sk = Pk ∪ Nk would yield a sample set of the order |V|2 and practically infeasible to process. To

53 make the problem manageable, we choose Sk to contain a subset of Pk and a subset of Nk. Not only |Sk| should to be large enough to capture the essence of the model, but also of manageable size. In our experiments we have |V| ≈ 2 · 106 and we 5 choose |Sk| ≈ 5 · 10 (see Section 5.6 for details).

Introduce a vector r = (r1, . . . , rmS ) with row indices and a vector c =

(c1, . . . , cmS ) with column indices such that WS(ri, ci) = 1 for i = 1, . . . , mS.

For an m × n matrix U introduce Ur to denote the mS × n matrix with entries

Ur(i, j) = U(r(i), j) for i = 1, . . . , mS and j = 1, . . . , n. Uc is analogous. In

MATLAB notation this is simply Ur = U(r, :) and Uc = U(c, :). Using the vectors r and c we can write entries of bk in Equation (4.4) as bk(i) = ∆k(ri, ci). In the following we will remove the time step subscript k for clarity.

4.2.2 Supervised Learning Metric

We have two approximation methods of A, and we propose two parameter matrices for modeling our optimized objective function. By combining these, we have four supervised learning metrics.

Approximation of A. In Section 2 and 3, we have presented two different ways of obtaining orthonormal matrices that aid in the approximation of A;

1. The best rank-n approximation UΛU T ≈ A using the regular embedding.

T 2. The clustered low rank approximation VDV ≈ A, where V = diag(V1,...,Vc) and the dense matrix D is computed by (3.20). Assuming each cluster is ap- proximated with rank-n, and there are c clusters, the rank of the approxima- tion becomes cn.

54 Parameter Matrices. For each of the two cases above, we have studied two dif- ferent parameter matrices;

1. F (x) = diag(x), a parameter matrix which can be in any diagonal form.

p k 2. F (x) = k=1 xkD , a polynomial matrix based on the eigenvalue matrix D. P Therefore, we have four combinations - spectral learning, clustered spectral learn- ing, polynomial learning and clustered polynomial learning. We propose the fol- lowing four new metrics deduced from the matrix M in equation (4.4).

Spectral learning. Let UΛU T ≈ A be a spectral approximation and F (x) = diag(x). then Msl = Uc ◦ Ur.

This problem is solved preferably using the QR-factorization of Uc ◦ Ur and SVD of the triangular part to deal with potential ill-conditioning [27]. Note that the least squares problem (4.3) without the sampling is perfectly well conditioned, (since

Uk is orthonormal). But our experience revealed that the sampled problem is very ill-conditioned and needs to be actively dealt with.

Clustered spectral learning. Let A ≈ V TDV T be the clustered eigen approxima- T tion, and we introduce the full eigendecomposition D = QΛQ , since Dk is dense. As F (x) = diag(x) then Mcsl = (VcQ) ◦ (VrQ).

An important difference from the regular spectral learning is the size of the least squares equations. It is no longer feasible to compute a QR factorization, since Vc is a mS ×cn matrix (compared to the mS ×n matrix Uc). Since V has a block diagonal structure, Vc will have a structure as well. It is very important to note that computing

VcQ will destroy the block diagonal structure, and the result cannot be loaded into

55 memory. To solve the least squares problem for the clustered spectral learning without explicitly computing factor matrices, we directly form the corresponding normal equations T T min kMcslMcslx − Mcslbkk

only using Uk, Qk.

Polynomial learning. Polynomial learning is closely related to the Katz measure. Consider the truncated approximation of Katz measure (3.3) replacing the single parameter β with a different one αi for each power term we obtain a polynomial of T p k degree p. Given the eigen approximation A ≈ UΛU and F (x) = k=1 xkΛ the coefficient matrix becomes P

Mpl = (Uc ◦ Ur)[l1 l2 ··· lp],

i where li = diag(Λ ), i = 1, . . . , p.

Clustered polynomial learning

Using the clustered low rank approximation from (3.19) A ≈ VDV T and the eigen decomposition D = QΛQT, we obtain the coefficient matrix :

Mcpl = (VcQ) ◦ (VrQ) [l1 l2 ··· lp]  i where again li = diag(Λ ). In this case the matrix Mcpl can be formed explicitly, but without forming any of the factors it involves.

4.2.3 Training and Testing of Link Predictors

We train our model from tk to tk+1 by targeting ∆k (A(k+1) = A(k) + ∆k). For the testing, we used the trained model to make predictions on new links for the time step t(k+2). The predictions are made based on corresponding factorizations T of A(k+1). Let Uk+1Λk+1Uk+1 ≈ Ak+1 be the best low-rank approximation, xsl be

56 the solution to the spectral learning fitting problem and L = diag(xsl), then the predictor becomes

T Psl = Uk+1LUk+1.

Similarly, with the parameters xcsl, xpl, xcpl for the other models, we obtain the predictors Pcsl,Ppl,Pcpl, correspondingly. For example, the clustered polynomial predictor takes the form p i T Pcpl = Vk+1 xiDk+1 Vk+1, i=1 ! X T where V2D2V2 ≈ A2 is the clustered low rank approximation and xcpl = (x1, . . . , xp). Since the dimension of the predictors is in the orders of millions, they can not be explicitly formed. As in the learning step, the predictions are computed and evaluated using a new sample set, Sk+1. Note that the dimensions of the model parameters xsl, xcsl, xpl, xcpl are of different dimensions. The size of spectral leaning parameters is the same as the number of columns in U¯, whereas the size of the polynomial parameters is the degree of the polynomial, which typically is less than 10.

4.2.4 Alignment

When training a model, we use the eigen-space Uk or Vk which are both obtained from Ak. After fitting the model parameters to the least squares objective

, we make predictions using Uk+1 or Vk+1. Uk is different from Uk+1 and similarly

Vk is different from Vk+1. The eigenspaces Uk,Uk+1, and the clustered eigenspaces

Vk,Vk+1 are close to each other, since the number of edges in ∆k is small compared to the number of edges in A(k) and A(k+1) = A(k) + ∆k. But the difference in predictions maybe be significant. To address this issue, we re-align the eigenvectors

57 in 2. snapshot in connected are becomes who and pairs 1 user shot between distance path Shortest 4.1: Figure U ieora 4]adMSae[0.Frdtie ttsis laeseSection see please statistics, detailed For [50]. MySpace and [43] LiveJournal Description Dataset 4.3.1 overall. performance best the shows metric learning tral Evaluation Prediction Link 4.3 alignment. without method previous the over improvement substantial n iial fteohrmtos xeiet aesonta hsse gives step this that shown have Experiments methods. other the of similarly and k T +1 U k U +1 k euetrera ag niesca ewrswt iloso oe:Fik [23], Flickr nodes: of millions with networks social online large real three use We spec- our predictors, various for prediction link of accuracy the measure We : obte tteegnetr in eigenvectors the fit better to

Percentage of Node Pairs! #!!" # $!" %!" &!" '!" (!" )!" *!" +!" P !" !" sl = Flickr! U k +1 ( U k T +1 LiveJournal U 58 k U ) L k n ehdi oisr h product the insert to is method One . ( U ! k T +1 MySpace! U k ) T U k T +1 . $"-./0" %"-./0" &"-./0" ,"&"-./0" icnetdi snap- in disconnected 3.5.1.

Figure 4.1 plots the shortest hop distance of user pairs in S1 who have be- come connected in S2. More than 70% of new links are created between user pairs who are two-hops away, whereas the fraction of users that are 4 or more hop away is very small. Later in Section 4.3.4, we evaluate our approach with two-hop user pairs.

4.3.2 Evaluation Methodology

Metrics. The accuracy of link prediction is measured by false negative rate (FNR), and false positive rate (FPR): #of missed friend links FNR = # ofnewfriendlinks #of incorrectly predicted friend links FPR = # of non-friend pairs

Note that the denominator of FNR is the number of non-friend pairs (e.g., the number of users not in friend relationship), which is basically almost all user pairs (2 × 106 × 2 × 106 − 90 × 106 ≈ 1012 for MySpace). Therefore, even extremely small false positive rate is converted into a very large false positive count. To illustrate such behavior, we draw FPR as X-axis in log-scale, and FNR as Y-axis in normal scale in the accuracy evaluation.

Training and testing sets. We have three snapshots of friendship networks (A(0),

A(1), and A(2)) for each social networks described in Table 3.2. We use A(0) - A(1) as training set, and A(1) - A(2) as testing set. Since our datasets are very sparse, we use randomly selected subsets of positive user pairs as positive sample set, and negative user pairs as negative sample set. We have two scenarios for selecting positive and negative user pairs.

59 Flickr LiveJournal MySpace Proximity Embedding Preparation 7.1 hour 8.7 hour 8.9 hour Spectral Embedding Preparation 6.3 hour 6.1 hour 7.5 hour Clustered Spectral Embedding Preparation 48.7 min 44.5 min 60.6 min Clustered Polynomial Learning Training 17.9 min 15.2 min 17.6 min Clustered Spectral Learning Training 43.3 min 35.2 min 36.5 min

Table 4.1: Preparation Time of Graph Embedding Algorithms (r = 100)

• All: we select 100,000 positive user pairs from all positive cases (e.g., those who become friends), and 500,000 negative user pairs from all negative cases (e.g., not related users).

• 2-hop: we consider user pairs connected by two-hops only. Due to the sparsity of our dataset, it is a practical scenario to reduce search space by focusing on user pairs with close relationship already (but not direct friend yet).

Link predictors. We have common neighbor (CN), and Katz measure (Katz) as the base measures, and we compute approximated values of these two metrics using spectral graph embedding (denoted as aCN, aKatz) and Clustered spectral graph embedding(aCNC, aKatzc). Finally, we have Spectral Learning metric (SL) with clustered spectral graph embedding method (SLc) We also evaluated the Poly- nomial Learning metric which performs the second best after SLc, which shows better performance than existing metrics, but not as good a SLc (see Section 3.4 for details).

4.3.3 Scalability Evaluation

Timing Micro-benchmarks. Here, we present the scalability of supervised met- rics by comparing their training time and testing (i.e., query) time with preparation time and query time of other graph embedding methods. Table 4.1 shows the train-

60 Flickr LiveJournal MySpace Direct Method CN 15.9 ms 23.5 ms 24.5 ms Katz 8,040 ms 14,790 ms 16,655 ms Proximity Embedding aKatzp 0.051 ms 0.045 ms 0.076 ms Spectral Embedding aCN 0.042 ms 0.038 ms 0.040 ms aKatz 0.045 ms 0.036 ms 0.036 ms Clustered Spectral Embedding aCNc 2.72 ms 5.45 ms 2.60 ms aKatzc 2.76 ms 2.03 ms 2.65 ms Clustered Polynomial Learning PLC 3.06 ms 1.27 ms 1.91 ms Clustered Spectral Learning SLC 5.59 ms 4.57 ms 4.73 ms Table 4.2: Query Time of Proximity Estimation Algorithms (r = 100, 0.6 million samples) ing time of clustered polynomial learning and clustered spectral learning when com- pared to the preparation time of the three embedding techniques for our datasets of up-to two million users and hundred million links. Note that our supervised learn- ing metrics requires to have clustered spectral embedding information for training and testing. Even after combining the preparation time of the clustered spectral em- bedding, the total training time is up-to 6 times faster than the spectral embedding, and 8 times faster than proximity embedding for the case of LiveJournal dataset.

Table 4.2 shows the query time of supervised metrics - clustered polynomial learning (denoted as PLc) and clustered spectral learning (denoted as SLc) when compared the query times of other existing proximity estimation techniques for 0.6 million samples. The proximity metrics based on proximity embedding (aKatzp) and spectral embedding (aCN, aKatz) performs the best. Clustered polynomial learning and clustered spectral learning performs similar performance as clustered spectral embedding based proximity measures, and outperforms direct methods. We can see once the preparation stage is over, the query time for supervised met- rics (SLc and PLc) in clustered spectral embedding does not add any significant overhead.

61 4.3.4 Accuracy Evaluation

Regular vs. Clustering. Figure 4.2 compares link prediction accuracy of approxi- mation methods when using the spectral graph embedding (SGE) and the clustered spectral graph embedding (CSGE). The clustered approximation methods performs better than the regular version. The Katz with CSGE (aKatzc) shows better accuracy than the regular one for Flickr and LiveJournal datasets, and similar performance in MySpace. For common neighbor, the clustered version has higher accuracy in LiveJournal and MySpace.

Spectral Learning metric. Figure 4.3 shows the link prediction accuracy of our proposed method - Spectral Learning (SLc). In Flickr, SLc shows the best perfor- mance, followed by aKatzc. For LiveJournal, SLc outperforms aKatz − c in low FP region, there is no significant difference in MySpace.

Two hop user pairs. Figure 4.4 shows the accuracy of link predictors when we evaluate predictors with only 2-hop user pairs both for positive and negative. Our spectral learning metric outperforms other metrics up-to 20% in Flickr dataset, and

10% in LiveJournal dataset. For MySpace, the performance of aKatzc and SLc is better than common neighbor based metrics in 4%.

Note that there is no significant difference among predictors in MySpace dataset. Each online social network has its own characteristics and structure. Also, the data collection of MySpace that takes the first 6 million users may cause such behaviors. Since the first users are relatively old users, it is possible they are less active in creating new relationships with other users. The number of added links between two snapshots are actually very small compared to the number of existing links (shown in Table 3.2).

Summary. Overall, the spectral learning metric (denoted as SLc) shows the best

62 100 aKatz-c aKatz aCN-c 80 aCN

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (a) Flickr

100 aKatz-c aKatz aCN-c 80 aCN

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (b) LiveJournal

100 aKatz-c aKatz aCN-c 80 aCN

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (c) MySpace Figure 4.2: Link prediction accuracy with Clustering

63 100 SL-c aKatz-c aCN-c 80

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (a) Flickr

100 SL-c aKatz-c aCN-c 80

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (b) LiveJournal

100 SL-c aKatz-c aCN-c 80

60

40 false negative rate 20

0 0.001 0.01 0.1 1 false positive rate (c) MySpace Figure 4.3: Link prediction accuracy of Spectral Learning Metric

64 performance, followed by the Katz measure with the clustered spectral graph em- bedding (aKatzc). Especially, in the two-hop scenario, SLc outperforms other met- rics, up-to 20% in Flickr, and 10% in LiveJournal dataset. This clearly demonstrates the benefit of supervised link predictors.

65 100 SL-c aKatz-c aCN-c 80 CN

60

40 false negative rate 20

0 0.01 0.1 1 10 false positive rate (a) Flickr

100 SL-c aKatz-c aCN-c 80 CN

60

40 false negative rate 20

0 0.01 0.1 1 10 false positive rate (b) LiveJournal

100 SL-c aKatz-c aCN-c 80 CN

60

40 false negative rate 20

0 0.01 0.1 1 10 false positive rate (c) MySpace Figure 4.4: Link prediction accuracy with 2-hop scenario

66 Chapter 5

MAD: Multicast with Adaptive Dual-state

5.1 Introduction

Multicast is an approach that uses network and server resources efficiently to support multipoint communication. Despite its clear performance benefit, multicast has not seen wide deployment over the past two decades. Some of the past barriers to widespread use have been the lack of support by Internet service providers and (possibly as a consequence) a lack of application demand for multicast.

Recently, however, multicast is seeing a resurgence. As the Internet evolves to become information-centric, network services increasingly demand scalable and efficient dissemination of information from a multitude of distributed information producers to large groups of interested information consumers. These information- centric services are growing rapidly in use and deployment. For example, IPTV services that use IP multicast as the underlying distribution technology are being deployed by multiple carriers. These services are gaining rapid adoption, with the number of subscribers and revenues growing rapidly [53]. Multiplayer online games (MMORPGs) are reportedly seeing 30-100% annual subscription growth [11]. Other common examples of deployed services that are information-centric include: file sharing, software updates, RSS dissemination, video conferencing, on- line markets, video-on-demand, and grid computing. All these services require the capability of large-scale information dissemination and can therefore benefit signif- icantly from the communication efficiency of multicast delivery.

67 1 1 0.8 0.1 0.6 0.01 0.4 0.2 daily weekly fraction of channels Fraction of Channels 0.001 0 0 0.1 0.2 0.3 0.4 0.5 0 2 4 6 8 10 12 14 Average Daily Change (%) number of days (a) Changes in channel subscription count (b) Channel lifetime in top 100 Figure 5.1: YouTube channel characteristics. 100 1000 80 100 60 40 10 20 Publishing Rate 1 Fraction of Updates 0 0 20 40 60 80 100 0 20 40 60 80 100 Fraction of Publishers Fraction of Publishers (a) Publishing rate (# updates/month) (b) CDF of feed updates Figure 5.2: Publishing characteristics of RSS feeds 5.1.1 Requirements of Information-centric Network Services

Information-centric network services create not only new opportunities but also significant new challenges for multicast. These services exhibit several key characteristics:

Vast number of groups. Given the increasing amount of electronic content and the need to ensure that only relevant information is disseminated, multicast will need to manage an increasing number of fine granularity groups, with a distinct group for each piece of distributable content. For example, eBay lists over ten million

68 new items every day [46], each of which can be a potential group. As a result, the number of groups that the underlying multicast architecture can support will need to significantly increase from what we typically see with IP multicast in the underlay, or with overlay multicast.

Long-lived group membership. As the network evolves to support models of information dissemination (such as publish/subscribe), membership is likely to be long-lived. Users tend to subscribe but do not unsubscribe and continue to be inter- ested in receiving information sent infrequently by publishers. As an example, we analyzed the average daily changes in the subscription counts of 754 YouTube [79] channels after they become inactive (i.e., stop appearing in any popular channel list). Figure 5.1(a) shows that only 2.3% of the channels experience a decrease in their subscription counts. Long-lived membership can significantly increase the state that has to be maintained in routers (that may have limited table space). Long- lived membership can also gradually increase the group size over time, resulting in higher control overhead.

Wide range of activity level across groups. Group activity tends to exhibit a skewed distribution. That is, most groups generate relatively infrequent and/or small amount of data traffic, yet a small fraction (e.g., 20%) of active groups ac- count for the vast majority (e.g., 80%) of data traffic. To illustrate this 80-20 rule, we analyze the publishing activity of RSS feeds using data from [42]. Figure 5.2(a) shows that only 5% of the RSS feeds publish more than 100 updates/month, and the median update rate is below 10 updates/month. Figure 5.2(b) shows that the 10% most active RSS feeds contribute to 75% of the total feed updates. Note that the 80-20 rule is also observed in many other network applications, e.g., subscription counts of RSS feeds [42], view counts of video clips [12], incoming link counts of Web pages [13], and file access frequencies of online streaming servers [16].

69 Dynamic activity level within a group. The activity level within a group tends to vary over time. Some new groups become active quickly, whereas other groups become dormant after the peak. To illustrate such dynamic behavior, we measure how long a channel stays in the top-100 popular channel lists in YouTube. Fig- ure 5.1(b) shows that 78% channels disappear from the daily top-100 list in just 2 days after their appearance. Similarly, 80% of the channels disappear from the weekly top-100 list after 4–5 days.

To effectively support such information-centric network services, we seek to design a multicast infrastructure that can provide multicast services at massive scale (with a billion users, hundreds of billions of groups, and long-lived group member- ship), while being efficient across a range of group sizes and diverse, time-varying activity levels. We want to realize this goal using today’s commercial hardware.

5.1.2 MAD Approach and Contributions

In this paper, we describe Multicast with Adaptive Dual-state (MAD), a novel architecture that can scalably support a vast number of multicast groups with diverse, time-varying activity, in an efficient and transparent manner on today’s commercial hardware. MAD has the following key features.

1. MAD provides persistence in group membership by explicitly decoupling the membership state from the forwarding state. MAD uses a distributed state management approach to efficiently store group membership state at very large scale. For each group, a small number of routers form a membership tree (rooted at a core router) to maintain the membership state.

2. MAD achieves both efficiency in data forwarding and scalability in number of groups by treating active groups and inactive groups differently to optimize for

70 different performance objectives. Specifically, messages to an active group are handled using any existing multicast protocol (for maximizing forwarding effi- ciency), whereas messages to an inactive group are forwarded along the mem- bership tree (for minimizing state requirement and control overhead). Our spe- cific instantiation of MAD uses the Core Based Tree (CBT) [4] (or a shared tree using PIM-SM [22]) for active groups due to its known efficiency and scalabil- ity. We refer to the CBT of an active group as the dissemination tree, in contrast to the membership tree.

3. MAD provides transparency in the presence of dynamic changes in group ac- tivity level. Since group activity can drastically change over time, MAD pro- vides seamless transition mechanisms to promote active groups from inactive groups and vice versa without any end-system participation. Messages are for- warded along the dissemination tree or the membership tree based on the activ- ity level.

4. MAD is designed to be modular so that we can easily replace its components to take advantage of alternative multicast or information dissemination capa- bilities when they become available. Specifically, our current instantiation of MAD begins as an overlay multicast service. However, when multicast capa- bility is available at the underlay, MAD can directly leverage such capability to further improve forwarding efficiency. For example, messages sent to an active group can be directly handled by IP multicast for better forwarding efficiency. Messages sent to an inactive group are forwarded along the membership tree. In doing so, stateless multicast protocols like Xcast can be used to avoid gener- ating redundant overlay unicast messages Finally, it is possible to use existing peer-to-peer technology to disseminate content to active groups. For example, we can easily implement an active group as a “torrent” using BitTorrent [7].

71 Our instantiation of MAD begins as an overlay multicast service for bet- ter deployability. However, MAD can directly take advantage of alternative mul- ticast or information dissemination capabilities when they become available. For example, messages sent to an active group can be handled by either IP multicast or peer-to-peer protocols for better forwarding efficiency. Messages sent to an inac- tive group are forwarded along the membership tree. In doing so, stateless multicast protocols like Xcast [8] can be used for eliminating redundant overlay unicast mes- sages (see §??).

We demonstrate through analysis, simulation and implementation that MAD can support hundreds of billions of groups with hardware that is representative of today’s commercial platforms. At the same time, MAD achieves high performance and efficiency, while exploiting the wide range of activity likely to be seen across multicast groups.

5.2 Related Work and Limitations

Twenty years of networking research has produced many efficient mecha- nisms for multicast communication (e.g., [4,8–10,18,22,59,70]). However, these current multicast approaches implicitly couple group membership state (i.e., which end hosts are members of a group) with forwarding state (i.e., how to reach those group members), and use a common approach for multicast distribution for all groups – small or large, active or inactive.

IP multicast-style approaches: IP multicast has focused on efficient forwarding of information to a large active group of recipients, with the goal of efficient lookup for forwarding. IP multicast-style approaches (at the network layer [4, 22] or at the application layer with “overlay multicast” [9, 10]) try to keep a relatively small

72 amount of state (limited number of groups and the associated interfaces downstream with recipients for the group). However, this state is maintained (as soft state) at every router on the multicast tree of the group for efficient forwarding. State maintenance is thus expensive — it requires a large number of routers to store the state, and involves considerable control overhead (periodic refresh and pruning) to keep the state current and refreshed. This approach is inappropriate for our problem because:

• First, IP multicast-style approaches are appropriate for a relatively small num- ber of groups. Building a cache hierarchy with SRAM and DRAM in routers for keeping state can partially relieve the memory requirements for large numbers of groups. With an increasing number of active groups, either the cache size has to be increased to maintain forwarding performance, or cache misses will degrade system performance [30]. Moreover, we would like the environment to support a range of router sizes, including small routers such as IP DSLAMs which may only be able to support a few thousand multicast groups. It is im- portant for the architecture to scale to large numbers of groups even with such small routers in the network.

• Second, when groups are long-lived, but have little or no activity over long peri- ods of time, maintaining the membership state in IP multicast-style approaches requires a lot of control overhead (relative to the activity) to keep it from be- ing aged-out. If not, data sent out by sources to a relatively inactive group will likely be lost, which we believe is highly undesirable.

• Third, when ISPs choose to support non-active groups with unicast (due to the limitation of the number of groups in existing multicast schemes), converting groups from multicast to unicast and vice versa requires additional intervention - allocating a multicast group address and initiating joins for those who are

73 members of a group that becomes active, and tearing down an existing multicast group for a group that becomes inactive.

Multicast forwarding state reduction: There exist a wide range of approaches to reducing multicast forwarding state.

Pruning routers with multicast forwarding state. REUNITE [70] and [17] propose to keep multicast forwarding state only at “branching” routers, while non-branching routers use unicast routing to forward traffic and keep the multicast control table in the control plane. However, the number of branching routers can grow as the size of group increases in above schemes, increasing the multicast state. Also, multicast forwarding tables are stored by soft state, which entails high control overhead for groups with little data activity.

Aggregating multicast forwarding state. Aggregated Multicast [18] aggregates mul- tiple base multicast groups into a single tree to achieve better scalability (at the ex- pense of sending irrelevant content to a subset of the members). Such aggregation is complementary to our approach of separating the multicast state between for- warding and membership states. If desired, MAD can apply aggregation to further reduce multicast forwarding state.

Making multicast stateless. Stateless approaches eliminate the need for routers to maintain multicast forwarding state by storing such state in packet headers in- stead. For example, Explicit Multicast (Xcast) [8] encodes the list of destination nodes into every packet. Free Riding Multicast (FRM) [59] reduces routing state by caching all source-based tree edges of a group at the sources and embeds a Bloom filter (that encodes all the tree edges) into the message itself. These stateless ap- proaches do not effectively meet the needs of large multicast groups, because they can result in excessively large packet headers. They can also reduce forwarding

74 efficiency, because they require routers to parse the header of every packet at every hop to extract the forwarding state.

P2P content dissemination: P2P solutions have also been developed for large- scale information dissemination (e.g., BitTorrent [7]). Such solutions can achieve high forwarding efficiency in disseminating popular content, However, they are less effective in a publish-subscribe environment where the information sent by the pub- lisher is infrequent but still critical, and has to reach all the subscribers in a timely manner. In such an environment, it would be difficult for a user to quickly find enough peers that can share such content. This would be of particular concern when peers have limited up-link bandwidth or are unreliable. Thus, peer-to-peer content dissemination does not fully meet our goal of disseminating information to a vast number of groups with persistent membership and diverse, dynamic activity levels.

Points of departure: Our design of MAD seeks improvements over these tradi- tional approaches. In contrast to IP multicast-style approaches, we wish to mini- mize the amount of control overhead associated with keeping state up over a long time, especially when groups are inactive. However, for active groups, we wish to take advantage of the structures that IP multicast designs have adopted. Thus, MAD seeks the best of both worlds — forwarding efficiently (a la IP multicast) when in- formation is frequently generated, while also enabling the membership of a group to scale to large numbers where the membership may be long-lived.

5.3 MAD Overview

MAD environment: The MAD multicast service overlay consists of logical over- lay routers, which reside on (or are owned by a provider of) the physical overlay

75 routers. User management functionality is concentrated in the external subscrip- tion manager, which maintains subscriptions for all users connecting at a given MAD site, and initiates or cancels group subscriptions with corresponding logical overlay routers on behalf of the end users. Thus, every logical overlay router serves a single aggregated local subscriber representing all users assigned to it by the sub- scription manager. From the perspective of a MAD overlay router, no knowledge of end users is required and the only entities an overlay router communicates with are other MAD overlay routers and its single aggregated local subscriber.

MAD example: To illustrate the key ideas behind MAD, we consider an exam- ple in Figure 5.3(a), which shows a set of routers that are part of a traditional IP multicast tree. This example is for a single multicast group, with a set of 5 users subscribed to the group. With IP multicast-style approaches (e.g., PIM-SM [22], CBT [4]), every intermediate router on the path from the root (A) to the first-hop routers that users are connected to has to maintain state for this group. In this ex- ample, there are 11 routers that have to maintain state.

MAD’s membership tree (MT) protocol significantly reduces state by lim- iting the number of routers that have to keep multicast group state. As shown in Figure 5.3(b), the membership tree consists of only four nodes: A, B, C and E. The membership tree itself is shown in Figure 5.3(c). The core (i.e., root) of the membership tree is selected to be A, based on the hash of the group ID. To limit its depth, the membership tree is constructed on top of a base tree rooted at the core. A base tree is a balanced k-ary tree that comprises all logical overlay routers. For each node, MAD defines a single base tree with this node as its root All groups that take this node as the core then use this base tree as the basis to construct their membership trees.

The membership tree is constructed as follows. When a subscriber wants

76 to join a group, it issues a join message, which gets forwarded towards the core along the base tree rooted at the core until it reaches the first node already on the membership tree. An en route logical overlay router joins the membership tree whenever the subtree rooted at this node in the base tree has at least a minimum number (2 in this example) of first-hop routers with attached subscribers.

Each node on the membership tree keeps the following state: (i) mtChild – children of this node in the base tree that belong to the MT (encoded as a bit-vector), and (ii) mtFHs – a list of first-hop (FH) routers with attached subscribers that are downstream of this node in the base tree.

In our example, when the first subscriber S1 (attached to M) joins, its join message is propagated to the core A and the core adds M to its mtFHs. Subse- quently, when another subscriber S2 (attached to O) joins, A sees that the subtree rooted at C in the base tree now has at least 2 FH routers with subscribers (i.e., M and O). So A informs C to create membership state. C then updates its mtFHs to include M and O. Meanwhile, A sets a bit in its mtChild to indicate subscribers downstream of child C. After subscribers S3 (attached to J) and S4 (attached to K) join, routers B and E create membership state, respectively. Finally, after subscriber

S5 (attached to I) joins, only 4 routers (A, B, C, and E) maintain membership state. Their final state is shown in Figure 5.3(d).

Thus, even this limited topology shows that we have fewer (only 4) routers maintaining membership state compared to IP multicast-style approaches that have more (11) such routers.

On the other hand, IP multicast-style approaches have better forwarding efficiency. Specifically, messages delivered to a group can be forwarded either using the IP multicast tree in Figure 5.3(a) or along the membership tree in Fig- ure 5.3(c). The former is clearly more efficient than the latter. To maximize overall

77 A A

C E L C E L

N J N J S3 S3 B H M G F B H M G F S1 S1 K I D O K I D O S4 S5 S2 S4 S5 S2 A (a) IP multicast tree (b) Membership tree nodes

B C

D E F G H IJ KLMNO S5 S3 S4 S1 S2 (c) Membership tree and its base tree Nodes Membership tree state on MT mtChild (childreninMT) mtFHs (FHs with subscribers) A { B, C} (encodedas 11) {} B { E} (encodedas 01) {I} C {} (encodedas 00) {M, O} E {} (encodedas 00) {J, K} (d) Final membership tree state Figure 5.3: Examples of MAD trees efficiency, MAD uses IP multicast-style dissemination tree to deliver messages for active groups, and membership tree based forwarding to deliver messages for inac- tive groups. MAD also provides seamless transition mechanisms to switch between dissemination tree and membership tree as the level of group activity varies over time. Although we have chosen to classify a group as being active based on the “absolute” activity level of individual groups, MAD may choose to classify groups as being active based on the relative activity level of groups, so that the top k% of the groups are classified as being active (to be resilient in the presence of particular groups unnecessarily generating traffic in an attempt to receive better forwarding efficiency).

78 5.4 MAD Protocol Design

The MAD protocol consists of five components: (i) the membership tree sub-protocol, (ii) the dissemination tree sub-protocol, (iii) state transition mech- anisms, (iv) failure recovery, and (v) mechanisms for operating across domain boundaries. Here, we explain state transition mechanisms and failure recovery. Please refer our paper for more details [14].

Preliminaries: We first introduce some notations before presenting the details of MAD protocols. The MAD multicast service overlay consists of L logical overlay routers; assume L = 2b is a power of two. Each logical overlay router is uniquely identified by a b-bit router ID (ranging from 0 to L − 1). We use FH(s) to denote the first-hop router that a subscriber s is connected to.

Each multicast group g is identified by a unique 128-bit group ID gid(g). Each group g maintains a membership tree MT (g) to record its set of members. If g is active, it also maintains a separate dissemination tree DT (g). MT (g) and DT (g) share a common core (i.e., root) logical overlay router core(g). To balance the load and reduce traffic concentration, we apply a hash function H(·) to map the 128-bit group ID to a random core ID, i.e., core(g) = H(gid(g)). A benefit of this hash-based scheme is that it obviates the need for a separate resolution procedure for mapping group IDs to core IDs, which can become expensive with hundreds of billions of groups. Finally, for each logical overlay router ℓ, MAD defines a virtual base tree BT (ℓ) rooted at ℓ that includes all L logical overlay routers. BT (ℓ) is the basis for constructing MT (g) for every group g with core(g) = ℓ.

79 5.4.1 Mode Transition

A major challenge in the design of MAD is how to ensure the smooth tran- sition between membership tree based forwarding (i.e., the inactive mode) and dis- semination tree based forwarding (i.e., the active mode). In particular, it is essential to avoid disruption of the multicast data delivery service during mode transition.

Our basic strategy for achieving smooth mode transition is to require every group in the system to always maintain the membership tree even when the group is considered active and the dissemination tree has been constructed. Having an always up-to-date membership tree ensures that during the transition period we can use membership tree based forwarding to deliver messages reliably to all the group members, with very little additional control overhead.

When a group transitions from being active to inactive, the transition is achieved simply by not forwarding on the dissemination tree and “tearing it down”. The key however is to have an efficient transition from inactive to active mode while ensuring no data is lost. Every new group g initially stays in the inactive mode. As group activity becomes high enough, core(g) may decide to improve forwarding efficiency by creating a separate dissemination tree DT (g). We first deliver data messages over both MT (g) and DT (g) during the transition (i.e., in transient mode) and stop delivering data to a subtree of MT (g) only after the root of the subtree is certain that all existing group members in the subtree are able to receive data from the dissemination tree.

While we have worked through the details of mode transition in the proto- type implementation we have evaluated in this paper, we elide the details here due to limited space.

80 5.4.2 Failure Recovery

MAD handles failures through replication, in a manner similar to other over- lay based approaches [10,62]. We exploit the advantage of being able to establish connectivity in the overlay dynamically, in response to a failure. The main concern we address here is the careful management of state specific to MAD. Specifically, for each physical overlay router in our system, we designate a set of physical over- lay routers as its shadow routers. To minimize replicated state, we only store state related to leaf nodes of the membership tree. Once the membership tree is repaired, it can perform normal multicast data delivery. For each logical overlay router ℓ owned by a physical overlay router p, the shadow routers of p only need to replicate the user subscription state (i.e., mtFHs) for all enrolled groups. The replicated state is saved in the stable storage (e.g., hard disk drive) of all the shadow routers of p. There is a keep-alive message exchange between a physical overlay router p and its shadow routers p′ for fast failure detection and recovery. For each logical overlay router ℓ that is previously owned by p, p′ needs to recover the role of ℓ in every group that ℓ is involved in. Each child of p may discover p′ through a directory service lookup that maintains an up-to-date list of shadow routers. Such a directory service may also be implemented in a distributed manner using DHTs.

Leaf: ℓ may be a leaf node in the membership tree MT (g) for group g. To take over such a role, p′ sends MtJoinMsg towards the core of group g along base tree BT (g). We save bandwidth by aggregating multiple MtJoinMsg with the same core into a single message with a list of group IDs.

On-tree node: ℓ may be an internal node of either a membership tree or a dissem- ination tree. MAD takes advantage of on-tree nodes in both MT and DT sending heartbeat messages down the tree. Upon failure detection, the child repairs the tree by sending a join message (i.e., either MtJoinMsg or DtJoinMsg) to its parent.

81 Core: ℓ may be the core for a group. After a failure, p′ starts receiving join messages from children of ℓ in group g. p′ infers the mode information from the received DtJoinMsg.

It is important to note that MAD only involves system-wide keep-alive mes- sages (between each physical overlay router and its shadow routers) as opposed to per-group keep-alive messages. So the control overhead due to keep-alive mes- sages is independent of the number of multicast groups. In contrast, IP multicast style protocols like CBT require per-group keep-alive messages to retain forward- ing state.

5.5 Scaling of MAD Trees

In this section, we conduct extensivesimulations on realistic network topolo- gies to examine the state requirement of MAD trees and the trade-off between state reduction and forwarding cost. To gain more insights into the scaling of MAD trees, we also analytically derive the state requirement.

In our model, rather than the number of subscribers, the number of distinct subscription managers that are involved in a group (essentially the number of FH routers) reflects the scaling of the system. Thus, our results examine the scaling based on the number of FH routers in the group or system.

5.5.1 Simulation Evaluation 5.5.1.1 Simulation Setup

Simulator: To evaluate MAD, we developed a simulator (with 6,000 lines of C/C++ code) that achieves scalability by avoiding the simulation of packet-level events.

82 250 MAD MT 200 CBT

150

100

50 Total Tree State (KB)

0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group (a) Total tree state for a group

25000 MAD MT 20000 CBT

15000

10000

5000 Total Underlay Hops

0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group (b) Total underlay hops

180 MAD 160 CBT 140 120 100 80 60 40 20

Control Messages Per Second 0 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group (c) Control overhead

Figure 5.4: Scaling of MAD83 trees on topology pow-16k. Underlay distance matrix: Our simulator constructs overlay network topologies by first computing the pair-wise underlay distance between overlay routers and then computes multiple edge-disjoint Minimum Spanning Trees. To obtain the under- lay distance matrix, we first generate an underlay network topology that comprises 16,000 nodes in all. We then use shortest hop-count routing to obtain the distance (i.e., hop count) between each pair of (overlay) routers. The underlay topology we consider is either a power-law topology (pow-16k), or a transit-stub topology with stub (access) nodes and transit nodes (ts-16k). In addition to computing hop count between overlay routers, we also use the DS2 delay space synthesizer [82] to directly synthesize a realistic underlay distance matrix (ds2-16k).

Overlay topology: After obtaining the underlay distance matrix, the overlay net- work topology is constructed as the union of m edge-disjoint Minimum Spanning Trees (MSTs) for the logical full-mesh (i.e., clique) over the 16,000 overlay routers. Specifically, we first construct a logical full mesh spanning all the overlay routers, where the distance between router i and router j is computed as the hop-count dis- tance between them on the underlying physical topology. We next compute a MST of the logical full-mesh, delete all its edges from the full-mesh, compute another MST, delete all its edges, and so on. After m edge-disjoint MSTs are obtained, we take the union of these MSTs as the overlay network topology. Shortest path rout- ing is used to determine the overlay path between two overlay routers, where the weight for an overlay edge is set to its hop-count distance in the physical network.

5.5.1.2 Simulation Results

We randomly form 100 multicast groups each with a fixed group size. We then vary this fixed multicast group size with respect to the number of first-hop routers. For each group, we compute the required state, forwarding cost, and control

84 overhead of CBT and MT as follows.

• We compute the total tree state stored by all on-tree nodes. For CBT , an on-tree node stores a 128-bit group ID and a 32-bit bit-vector dtChild (which specifies all interfaces with members downstream). For MT , an on-tree node stores a 128-bit group ID, a 16-bit bit-vector mtChild (which indicates whether each child in the base tree belongs to MT ), and a list of first-hop overlay routers in the subtree rooted at this node (mtFHs).

• The forwarding cost for a message is measured by the total number of underlay hops that the message traverses.

• The control overhead is measured by the total number of keep-alive (i.e., hello) messages sent by the group in a second. We use the default keep-alive message interval of 60 seconds in CBT [4], i.e., each node in a CBT sends a keep-alive message to its parent once every 60 seconds.

Finally, to compute the state requirement, forwarding cost and control over- head of the MAD protocol, we assume that 10% groups are active, and that they contribute to 75% of the data traffic. These fractions are chosen based on the pub- lishing behavior of RSS feeds as shown in Figure 5.2.

The average state requirement for a group is therefore 90% of the MT state requirement plus 10% of the CBT state requirement. Similarly, the average for- warding cost is 90% of the CBT forwarding cost plus 10% of the MT forwarding cost.

Figure 5.4 compares the state requirement, forwarding cost, and control overhead for pow-16k, where every data point is the average over 100 random groups (of the same size). The results for ts-16k are quantitatively similar and are omitted in the interest of brevity. Figure 5.4(a) compares the total tree state required

85 BTby C , MT , and MAD. By combining MT with CBT , MAD achieves nearly an order of magnitude state reduction over CBT . Figure 5.4(b) shows the forwarding cost of CBT , MT , and MAD. The total forwarding cost for MAD is very close to CBT (both in delay and number of hops traversed) and significantly outperforms MT as the size of the group increases. Figure 5.4(c) shows the control overhead of CBT and MAD (measured by the number of keep-alive messages per second from each group). MAD achieves an order of magnitude reduction in control overhead over CBT , because 90% groups are inactive and only have to maintain the MT , which requires no per-group keep-alive messages.

Figure 5.5 shows the maximum number of groups that an overlay with 216 overlay routers (each with 3GB memory) can hold. MAD can easily support hun- dreds of billions of groups on today’s commercial hardware platform. For example, with a group size of 16, MAD supports 913 billion groups for pow-16k, and 750 billion groups for ts-16k, yielding a factor of 7–9 improvement over CBT . Note that such improvement is close to optimal, because in our simulation 10% groups are active and maintain both MT and CBT . The state reduction is thus bounded stateCBT by 0 .1×stateCBT +stateMT ≤ 10.

Summary: Our simulation results clearly show that MAD achieves both the im- proved forwarding efficiency of CBT and the state reduction and low control over- head of MT .

5.5.2 Formal Analysis of State Requirement

Consider an overlay with L logical overlay routers. Given a multicast group g that has F randomly selected first-hop routers to which subscribers are connected, below we analytically derive the expected state requirement for both CBT (g) and MT (g) with respect to F .

86 1E+05 MAD MT 1E+04 CBT

1E+03

1E+02

1E+01

1E+00 Number of Groups (Billions) 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group (a) pow-16k

1E+05 MAD MT 1E+04 CBT

1E+03

1E+02

1E+01

1E+00 Number of Groups (Billions) 0 2000 4000 6000 8000 Number of First-Hop Routers in a Group (b) pow-16k

Figure 5.5: Max # of groups that 216 routers (each with 3GB MEM) can hold.

87 Membership tree state requirement: Since all the base trees constructed in §?? are isomorphic, without loss of generality we can assume that core(g) = 0. There- fore, MT (g) is constructed based on the base tree BT (0), which is illustrated in Figure ??.

For any logical overlay router i, let Ni be the number of nodes that are in the subtree of BT (0) rooted at i (including node i itself); and let random variable Xi denote the number of first-hop routers with subscribers that are in the same subtree. Assuming that the F first-hop routers with attached subscribers are selected uni- formly at random from all L logical overlay routers, then Xi has a hypergeometric F L−F (n)(Ni−n) 2 distribution: Pr(Xi = n) = M . The mean µi and variance σi of Xi are given (Ni) F 2 Ni(L−Ni) F L−F by µi = Ni L and σi = L−1 L L . In order for logical overlay router i to become a member of MT (g), we need to have Xi ≥ Smin. The probability for this to occur is bounded by Chebyshev’s Inequality: 2 △ σi MT Pr(Xi ≥ Smin) ≤ max(S σi, min−µi) = Pi

2 h i where µi and σi are the mean and variance of Xi (see above).

L−1 MT The expected number of nodes on MT (g) is thus bounded by i=0 Pi . Each node n on MT (g) stores a 128-bit group ID (as the key for theP forwarding table), a 16-bit field mtChild,andalistof 16-bit first-hop router IDs (mtFHs). Since each 16-bit first-hop router ID is stored only once, the total amount of state devoted to mtFHs is 16 × F bits. So the expected number of bits for the entire membership tree state is: MT L−1 MT state ≤ 16 × F + (128 + 16) × i=0 Pi P Dissemination tree state requirement: For simplicity, we only consider the state requirement of CBT (g) in the special case where all the unicast routes destined to

88 core (g) together form a balanced k-ary tree isomorphic to BT (0). This is likely to underestimate the state requirement of CBT in the more general case where the unicast routes do not have such a regular underlying structure. However, our results clearly show that even in this special case, MT and MAD achieve much better state efficiency than CBT .

In the special case we consider, let Ni and Xi denote the same as in the analysis for MT (g) (see above). A logical overlay router i becomes an on-tree node of CBT (g) whenever Xi > 0. The probability for this to occur is given by

F L−F L−F △ CBT ( 0 )(Ni−0) ( Ni ) Pi = 1 − Pr(Xi = 0) = 1 − L = 1 − L (Ni) (Ni)

L−1 CBT The expected number of nodes on CBT (g) is i=0 Pi . Each node n on CBT (g) stores a 128-bit group ID (as the key for theP forwarding table) plus a 32-bit bit-vector n.g.dtChild. So the expected number of bits for the dissemination tree is: CBT L−1 CBT state = (128 + 32) × i=0 Pi P Numerical results: We numerically compute the state requirement for MT and CBT (with k = 16) for different sizes of the group and the overlay network. As shown in Figure 5.6, MAD achieves an order of magnitude state reduction over CBT . Moreover, depending on the group size, MAD can easily support hundreds of billions of groups on today’s commercial hardware. These results are consistent with our simulation results on more realistic topologies (i.e., Figure 5.4 and Fig- ure 5.5) and demonstrate the MAD’s ability to reduce group state by decoupling membership and dissemination.

89 250 MAD MT 200 CBT

150

100

50 Total Tree State (KB)

0 0 2000 4000 6000 8000 Number of First-hop Routers in a Group (a) Total tree state for a group

1E+05 MAD MT 1E+04 CBT

1E+03

1E+02

1E+01

1E+00 Number of Groups (Billions) 0 2000 4000 6000 8000 Number of First-hop Routers in a Group (b) Max # of groups that 216 routers can hold

Figure 5.6: Analytical results on the scaling of MAD trees.

90 5.6 Evaluation of Implementation

We implemented the MAD protocol on top of FreePastry 1.4.4 [61]. Below we evaluate the state efficiency, forwarding efficiency, and mode transition cost of our MAD prototype.

Experimental setup: We conducted experiments on the Emulab testbed [76]. Our experiments involve 105 Emulab nodes that range from systems with an Intel Pen- tiumIII with 256MBRAM to a Xeon with 2GB RAM. We usethe sprintlink-us net- work topology available from Rocketfuel [68], with link latencies inferred in [44] and zero link loss. To control the total number of nodes in the network, we vary the number of routers in each city (PoP). Routers in the same city are connected via a LAN with 0 latency. For each group, we select first-hop routers to join the group uniformly at random from the entire set of nodes in the experiment. Similar to our simulation, we assume that 10% groups are active and that they contribute to 75% of the data traffic. Finally, we use the default settings of FreePastry 1.4.4.

State efficiency: We first look at the results from experiments to show the benefit of the membership tree (MT ) over the dissemination tree (CBT ) with respect to the total state stored at on-tree routers. Figure 5.7(a) shows the number of nodes that maintain state for a group, as a function of the number of subscribed first hop routers. With our design of the membership tree, only a small number of nodes need to maintain state for the group, and this number grows slowly. With CBT , the number of on-tree routers grows much more rapidly as more first-hop routers join the tree. MT thus achieves significant state reduction over CBT . The state efficiency of MAD is close to MT because 90% of all groups are inactive and maintain only the efficient MT .

Forwarding efficiency: Figure 5.7(b) shows the average latency of delivering a

91 80 MAD MT 60 CBT

40

20 Number of Nodes

0 0 10 20 30 40 50 60 70 80 Number of First-Hops (a) No. of nodes keeping tree state

100

80

60

40

20 MAD MT Delivery Latency (msec) CBT 0 0 10 20 30 40 50 60 70 80 Number of First-Hops (b) Average message delivery latency

Figure 5.7: Efficiency of MAD

92 data message from the core to all the member routers in a tree. CBT achieves much lower message delivery latency than MT . This is not surprising because MT is designed primarily for state efficiency, not for forwarding efficiency. Meanwhile, the delivery latency of MAD is very close to that of CBT , because most of the data traffic comes from active groups and is thus delivered via CBT in the MAD protocol — recall that we assume active groups contribute to 75% of all the data traffic.

Mode transition cost: We measure the cost of mode transition from MT to CBT (as an inactive group becomes active) in terms of (i) the transition latency (i.e., the time it takes for all nodes in a group to complete mode transition) and (ii) the message duplication ratio (i.e., the ratio between the number of duplicate messages and the total number of distinct messages received during the transition period). Recall that mode transition from CBT to MT is almost instantaneous and does not result in any duplicated messages (see §5.4.1).

Figure 5.8(a) shows the average, minimum, and maximum transition latency from MT to CBT as the number of first-hop routers in the group increases. The transition latency is quite low — the average transition latency is close to 1 second and the worst-case transition latency is only 1.8 seconds.

Figure 5.8(b) shows the average, minimum, and maximum message dupli- cation ratio during mode transition. The message duplication ratio is less than 1 because as soon as a first-hop router receives the first message from the CBT , it informs its parent in the MT to suppress all future duplicate messages. Therefore, during the entire transition period, only those messages that are sent before the sup- pression occurs will be received in duplicate. Since the transition latency is low, the overhead due to message duplication is acceptable (especially given the increased forwarding efficiency after the mode transition completes).

93 1800

1500

1200

900

600

300 Switching Time (msec) 0 0 10 20 30 40 50 60 70 80 Number of First-Hops (a) Transition latency 1

0.8

0.6

0.4

Duplication Ratio 0.2

0 0 10 20 30 40 50 60 70 80 Number of First-Hops (b) Message duplication ratio

Figure 5.8: Cost of mode transition from MT to CBT

94 Summary: Our experimental results clearly demonstrate that MAD achieves both the high state efficiency of MT and the high forwarding efficiency of CBT . Mean- while, such a benefit comes at a low cost — the mode transition between MT and CBT only takes 1–2 seconds and the overhead due to message duplication during the transition is acceptable.

95 Chapter 6

Conclusions and Future Work

To support information-centric network services, we propose novel tech- niques to facilitate the formation of fine-grained meaningful groups: user-based group and content-based group. We propose the vector-space model for the user- based group, which can encode pair-wise relations in user vectors. The user-based approach is useful when selecting close or similar user neighborhood in information- centric networking services such as Online Social Network. We propose the notion of content-based group to capture both keyword information and document struc- ture. The content descriptor helps to identify the specific data item meets users demands, and achieves a good balance between expressiveness and granularity. A distinct multicast group for each piece of distributable content can be mapped to the corresponding content-based group.

6.1 Contributions

1. Scalable Proximity Embedding. We develop several novel techniques to approximate a large family of proximity measures and analyze underlying complex network structures in massive, highly dynamic online social net- works. Our Proximity Embedding technique can be applied to compute path- ensembed based measures which are known to be effective but too expensive to compute for large networks.

96 We develop a novel technique called Clustered Spectral Graph Embedding, which combines the concepts of spectral graph embedding and clustering in a natural way. The method captures the essential structure of large graphs, is highly efficient, scalable and works as a flexible tool for mining and analyz- ing massive online social networks. The clustered spectral graph embedding was evaluated using three large-scale, real world social network datasets. We showed that our temporal scalability is an order of magnitude higher than the current state of the art. More importantly, with the same memory require- ment, we prove that our technique is able to create approximations with order of magnitude higher rank compared to existing approaches, dramatically im- proving proximity estimation accuracy.

2. Supervised Link Prediction. We develop new supervision based link pre- diction techniques called Clustered Spectral Learning (CSL), and Clustered Polynomial Learning (CPL) to predict the evolution of massive and complex online social network structures. Based on the analysis of the past snapshots of network structures, our supervised link prediction methods learn optimal parameters of proximity metrics to predict the future link creations. Evalua- tion results with three real-world online social network (Flickr, LiveJournal, and MySpace) show that our techniques yield up to 20% improvement in link prediction accuracy compared to existing methods while using the same or less amount of computing resources.

3. Multicast with Adaptive Dual-state. We presented Multicast with Adap- tive Dual-state (MAD), a novel architecture for providing efficient multicast service at massive scale. The key to its scalability and efficiency is the decou- pling of group membership and forwarding state, which allows us to optimize

97 for different objectives for active and inactive groups. Group membership is maintained scalably in a distributed fashion using a hierarchical membership tree (MT). Inactive groups forward data over their membership trees. Active groups use IP multicast-style dissemination trees (DT) for efficient data for- warding. MAD provides seamless and efficient mechanisms for a group to transition between DT-based and MT-based forwarding as its activity level changes.

We examined the scaling characteristics of the MAD protocol through a pro- totype implementation. Compared with IP-multicast style approaches (e.g., CBT), MAD achieves nearly an order of magnitude reduction in state re- quirement and control overhead. As a result, MAD can support hundreds of billions of multicast groups with long-lived membership on today’s commer- cial hardware platform. Meanwhile, MAD achieves comparable forwarding efficiency and low message delivery latency, through the use of DT-based forwarding for active groups. Thus, MAD achieves the best of both worlds – scalability and persistence in group membership by using MT, and efficient data forwarding by using DT.

6.2 Future Work

Many information-centric networking services – such as vote, content rec- ommendation, and trust network – require supports from underlying infrastructure not only to handle private user data, but also to compute aggregate information ac- curately. We want to explore a new direction in the distributed aggregation system which can provide accurate content aggregation of sensitive user data in a privacy- preserving manner for various types of applications.

98 There are some of the design challenges and trade-offs in solving the dis- tributed privacy-aware constrained data aggregation problem:

• Accuracy versus performance. There are a large number of users, items, groups, and various types of aggregations depending on many different situ- ations How can we produce aggregations efficiently in such an environment? It would be almost infeasible to access every user in every group. We realize there is a trade-off between aggregation accuracy and the performance we achieve, and plan to provide efficient yet highly accurate aggregation mecha- nisms.

• Privacy risk versus aggregation quality. Quality of aggregated information rely on accurate data provided by users. Users desire good aggregation, but also wish to minimize the risk of revealing private information. How do we achieve this seemingly contradictory goals? We realize there is a trade-off between privacy risk and aggregation quality, and aim to provide statistical guarantees on the aggregation results with different privacy awareness levels.

We plan to design distributed information aggregations system which pro- vides:

• Declarative group creation. As there are various types of applications in in- formation aggregation, the underlying architecture should be flexible enough to handle group formation dynamically, such that users can specify con- straints on users and data items when requesting aggregations. For example, users may want to create groups based on similarities with other users (e.g., close friends in social networks), or users with same interest on the specific topic (e.g., fans of rock music) for application of content recommendations.

99 Not only maintaining information about huge number of users and items it- self is difficult, but it is challenging to support flexible group creations in such dynamic environment.

• Scalable and privacy-aware aggregation. We plan to design mechanisms which generates an accurate summary from the large number of users in the network. Our system will provide privacy-aware sharing of user information using data perturbation techniques [57,58]. We plan to provide a set of sim- ple operators (e.g., sum, max, multiplication) which can be perturbed with different levels. To guide the aggregation process, we allow users to combine multiple operators to build more complex aggregation function. For example of movie recommendations, users can collect votes about movie selections from friends group using sum operator, then take the top-K movies by ap- plying max operator. We aim to achieve high quality of data summary even with distorted data using statistical techniques.

100 Bibliography

[1] L. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 2003.

[2] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proc. of WWW, 2007.

[3] Alexaglobaltop500 sites. http://www.alexa.com/site/ds/top_sites.

[4] A. J. Ballardie. RFC2189: Core based trees (CBT version 2) multicast routing: Protocol specification. RFC Editor United States, 1997.

[5] A. L. Barabasi, H. Jeong, Z. N´eda, E. Ravasz, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaboration. Physica A: Statistical Mechanics and its Application, 2002.

[6] R. M. Bell, Y. Koren, and C. Volinsky. Chasing $1,000,000: How we won the Netflix Progress Prize. Statistical Computing and Statistical Graphics Newsletter, 18(2):4– 12, 2007.

[7] Bittorrent. http://www.bittorrent.com.

[8] R. Boivie, N. Feldman, Y. Imai, W. Livens, D. Ooms, and O. Paridaens. Explicit multicast (xcast) basic specification. IETF draft, 2000.

[9] A. Bozdog, R. van Renesse, and D. Dumitriu. Selectcast: a scalable and self- repairing multicast overlay routing facility. In In Proc. ACM workshop on survivable and self-regenerative systems, 2003.

[10] M. Castro, P. Druschel, A.-M. Kermarrec, and A. Rowstron. Scribe: A large-scale and decentralized application-level multicast infrastructure. In IEEE J. on Selected Areas in Comm., 2002.

[11] E. Castronova. Network technology, markets and the growth of synthetic worlds. In In Proc. NetGames, 2003.

101 [12] M. Cha, H. Kwak, P. Rodriguez, Y.-Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In In Proc. IMC, 2007.

[13] J. Cho and S. Roy. Impact of search engines on page popularity. In In Proc. WWW, 2004.

[14] T. Cho, M. Rabinovich, K. K. Ramakrishana, D. Srivastava, and Y. Zhang. Enabling content dissemination using efficient and scalable multicast. In In Proc. INFOCOM, 2009.

[15] C. Cortes, D. Pregibon, and C. T. Volinsky. Communities of interest. Intelligent Data Analysis, 2002.

[16] C. P. Costa, I. S. Cunha, A. Borges, C. V. Ramos, M. M. Rocha, J. M. Almeida, and B. Ribeiro-Neto. Analyzing client interactivity in streaming media. In In Proc. WWW, 2004.

[17] L. H. Costa, S. Fdida, and O. C. Duarte. Hop-by-hop multicast routing protocol. In In Proc. SIGCOMM, 2001.

[18] J. Cui, M. Geria, K. Boussetta, M. Faloutsos, A. Fei, J. Kim, and D. Maggiorini. Aggregated multicast: A scheme to reduce multicast states. IETF draft, 2002.

[19] J. Davidsen, H. Ebel, and S. Bornholdt. Emergence of a small world from local interactions: Modeling acquaintance networks. Physical Review Letters, 2002.

[20] Facebook. http://www.facebook.com.

[21] Facebook statistics. http://www.facebook.com/press/info.php?statistics.

[22] B. Fenner, M. Handley, H. Holbrook, and I. Kouvelas. RFC3601: Protocol Indepen- dent Multicast-Sparse Mode (PIM-SM): Protocol Specification. RFC Editor United States, 2006.

[23] Flickr. http://www.flickr.com.

[24] S. Garriss, M. Kaminsky, M. J. Freedman, B. Karp, D. Mazieres, and H. Yu. RE: Reliable Email. In Proc. of Networked Systems Design and Implementation (NSDI), 2006.

[25] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition, 1996.

102 [26] L. Hagen and A. Kahng. New spectral methods for ratio cut partitioning and cluster- ing. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 11(9):1074–1085, Sep 1992.

[27] P. C. Hansen. Rank-deficient and discrete ill-posed problems: numerical aspects of linear inversion. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1998.

[28] S. Hill, F. Provost, and C. Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, 2006.

[29] E. M. Jin, M. Girvan, and M. E. J. Newman. The structure of growing social net- works. Physical Review Letters, 2001.

[30] Junos 9.2 multicast protocols configuration guide. http://netscreen.com/ techpubs/software/junos/junos92/swconfig-multicast/.

[31] L. Katz. A new status index derised from sociometric analysis. Psychometrika, 1953.

[32] A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput, 2000.

[33] Y. Koren, S. C. North, and C. Volinsky. Measuring and extracting proximity in networks. In Proc. of KDD, 2006.

[34] B. Kulis and Y. Guan. Graclus—Efficient graph clustering software for normalized cut and ratio association on undirected graphs, 2008. 2010/01/26 (version 1.2).

[35] J. Kunegis and A. Lommatzsch. Learning spectral graph transformations for link prediction. In ICML, 2009.

[36] R. Larsen. Lanczos bidiagonalization with partial reorthogonalization. Technical Report DAIMI PB-357, Department of Computer Science, Aarhus University, 1998.

[37] R. Lehoucq, D. Sorensen, and C. Yang. Arpack Users’ Guide: Solution of Large Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadel- phia, 1998.

[38] R. B. Lehoucq and D. C. Sorensen. Deflation techniques for an implicitly restarted arnoldi iteration. SIAM Journal on Matrix Analysis and Applications, 17(4):789–821, 1996.

103 [39] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Proc. of Conference on Information and Knowledge Management (CIKM), 2003.

[40] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol., 2007.

[41] H. Lim, J. Hou, and C. H. Choi. Constructing Internet coordinate system based on delay measurement. In Proc. of IMC, 2003.

[42] H. Liu, V. Ramasubramanian, and E. G. Sirer. A measurement study of , a publish- subscribe system for web micronews. In In Proc. IMC, 2005.

[43] LiveJournal. http://www.livejournal.com.

[44] R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. Inferring link weights using end-to-end measurements. In In Proc. IMW, 2002.

[45] Y. Mao and L. K. Saul. Modeling distances in large-scale networks by matrix factor- ization. In Proc. of IMC, New York, NY, USA, 2004.

[46] M. Medved. Ebay auction counts, 2008. http://www.medved.net/cgi-bin/ cal.exe?EIND.

[47] A. Mislove, K. P. Gummadi, and P. Druschel. Exploiting social networks for Internet search. In Proc. of HotNets-V, 2006.

[48] A. Mislove, H. S. Koppula, K. Gummadi, P. Druschel, and B. Bhattacharjee. Growth of the flickr social network. In Proc. of SIGCOMM Workshop on Social Networks (WOSN), 2008.

[49] T. Murata and S. Moriyasu. Link prediction of social networks based on weighted proximity measures. In Proc. of International Conference on Web Intelligence, Washington, DC, USA, 2007. IEEE Computer Society.

[50] MySpace. http://www.myspace.com.

[51] MySpace 400 millionth user. http://profile.myspace.com/index.cfm? fuseaction=user.viewprofile&friendid=400000000.

[52] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 2003.

[53] M. News. At&t’s fast iptv growth may have a downside. http://www.multichannel. com/article/ca6493387..

104 [54] T. E. Ng and H. Zhang. Predicting internet network distance with coordinate-based appoaches. In Proc. of INFOCOMM, 2002.

[55] Nielson Online. Fastest growing social networks for September 2008. http:// blog.nielsen.com/nielsenwire/wp-content/uploads/2008/10/press_ release24.pdf.

[56] J. A. Patel, I. Gupta, and N. Contractor. JetStream: Achieving predictable gossip dissemination by leveraging social network principles. In Proc. of IEEE Network Computing and Applications (NCA), 2006.

[57] H. Polat and W. Du. Privacy-preserving collaborative filtering using randomized perturbation techniques. In ICDM, 2003.

[58] H. Polat and W. Du. Privacy-preserving top-n recommendation on distributed data. Journal of American Society for Information Science & Technology, 2008.

[59] S. Ratnasamy, A. Ermolinskiy, and S. Shenker. Revisiting ip multicast. In In Proc. SIGCOMM, 2006.

[60] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Proc. of NIPS, 2002.

[61] A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and rout- ing for large-scale peer-to-peer systems. In In Proc. IFIP/ACM Middleware, 2001.

[62] A. Rowstron and P. Druschel. Store management and caching in past, a large-scale, persistent peer-to-peer storage utility. In In Proc. SOSP, 2001.

[63] P. Sarkar, A. W. Moore, and A. Prakash. Fast incremental proximity search in large graphs. In Proc. of ICML, 2008.

[64] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:888–905, 2000.

[65] H. Song, T. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In Proc. of IMC, 2009.

[66] H. H. Song, T. W. Cho, V. Dave, Y. Zhang, and L. Qiu. Scalable proximity estimation and link prediction in online social networks. In IMC, 2009.

[67] Spectral Theorem. http://en.wikipedia.org/wiki/Spectral_decomposition.

105 [68] N. Spring, R. Mahajan, D. Wetherall, and T. Anderson. Measuring isp topologies with rocketfuel. IEEE/ACM Transaction on Networking, 2004.

[69] G. Stewart. Matrix Algorithms, volume II: Eigensystems. SIAM, 2001.

[70] I. Stoica, T. S. E. Ng, and H. Zhang. Reunite: A recursive unicast approach to multicast. In In Proc. INFOCOM, 2000.

[71] L. Tang and M. Crovella. Virtual landmarks for the Internet. In Proc. of IMC, 2003.

[72] H. Tong, C. Faloutsos, and Y. Koren. Fast direction-aware proximity for graph min- ing. In Proc. of KDD, 2007.

[73] TrustMyMail. http://www.trustmymail.com.

[74] Twitter. http://www.twitter.com.

[75] D. Wagner and F. Wagner. Between min cut and graph bisection. In MFCS ’93: Proceedings of the 18th International Symposium on Mathematical Foundations of Computer Science, pages 744–750, London, UK, 1993. Springer-Verlag.

[76] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In In Proc. OSDI, 2002.

[77] Wikipedia. Social network. http://en.wikipedia.org/wiki/Social_ network.

[78] Z. Wu and R. Leahy. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 1993.

[79] YouTube. http://www.youtube.com.

[80] H. Yu, P. B. Gibbons, M. Kaminsky, and F. Xiao. SybilLimit: A near-optimal social network defense against Sybil attacks. In Proc. IEEE Symp. on Security and Privacy, 2008.

[81] H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman. SybilGuard: Defending against Sybil attacks via social networks. In Proc. of SIGCOMM, 2006.

[82] B. Zhang, T. S. E. Ng, A. Nandi, R. Riedi, P. Druschel, and G. Wang. Measurement- based analysis, modeling, and synthesis of the internet delay space. In In Proc. IMC, 2006.

106 Vita

Tae Won Cho was born in Seoul, Republic of Korea on 16 September 1978, the son of Chong Sung Cho and Yeon Sim Lee. He married Christy S. Choi on 3 July 2003. He has a son Brandon H. Cho, and a daughter Claire E. Cho. He re- ceived the Bachelor of Science degree in Computer Science from Korea Advanced Institute of Science and Technology in 2000. He worked at Oriental Chemical In- formation & Communcation between 2000 and 2003 as a substitute for the military obligation. He joined The University of Texas at Austin in Fall 2003 and received the Masters of Arts in Computer Science in December 2005 under the supervision of Professor Harrick Vin and Professor Yin Zhang. His M.A. thesis was design- ing automated network attack detection system using replay-based techniques. He started his Ph.D. in the Computer Science Department at The University of Texas at Austin under the supervision of Professor Yin Zhang. During the course of his Ph.D., he spent one internship at Microsoft Security Technology Unit in Washing- ton, two internships at AT&T Labs Research in New Jersey, and one internship at AT&T Labs in Texas.

Permanent address: 1 University Station Austin, Texas 78712

This dissertation was typed by Tae Won Cho.

107