Higher Order Information Identifies Tie Strength

Arnab Sarker1, Jean-Baptiste Seby1, Austin R. Benson2, and Ali Jadbabaie1,3

1Institute for Data, Systems, and Society, MIT 2Department of Computer Science, Cornell University 3Laboratory for Information and Decision Systems, MIT

March 2021

Abstract A key question in the analysis of sociological processes on networks is to identify pairs of individuals who have strong or weak social ties. Existing approaches mathematically model the social network as a graph, and tie strength is often inferred by examining the number of shared neighbors between individuals, or equivalently, the number of triangles in the graph which contain a specific pair of individuals. However, this approach misses out on critical information because it does not distinguish the case where interactions occur among groups involving more than two individuals. In this work, we measure tie strength by explicitly accounting for these higher order interactions in the network, through the use of a new measure called Edge PageRank. We show how Edge PageRank can be interpreted as the steady-state outcome of a dynamic, message-passing social process that characterizes the strength of weak ties by appropriately discounting the effect of higher-order interactions involving three individuals. Empirically, we find that Edge PageRank outperforms standard measures in identifying tie strength in several large-scale social networks. These results provide a new perspective on tie strength and demonstrate the importance of incorporating higher-order interactions in social network analysis.

1 Introduction

Decades of sociological research has focused on determining which social connections provide the most valuable information to an individual [2, 10, 21, 25, 38, 46, 52]. An important finding from this body of works is that the most novel information surprisingly comes from casual acquaintances, as opposed to close family and friends [25]. These casual relationships, referred to as weak ties, are useful in generating new ideas [10], finding a job [24], and transferring knowledge within organizations [27]. The intuition for this apparent “strength” of weak ties is subtle yet simple: whereas close family and friends, or strong ties, may have similar sources of information to an individual, acquaintances often have access to different resources altogether which could potentially expand the social reach of an agent. arXiv:2108.02091v1 [cs.SI] 4 Aug 2021 Along with allowing an individual to optimally mobilize their social contacts [34, 52], tie strength is used in various modern data-driven tasks such as predicting the formation of new ties [13, 32, 35], structuring online interactions [16], and defining the various social groups of an individual [39]. Moreover, the concept of tie strength is used throughout the social sciences, providing a valuable dimension to understand academic output [54], performance of teams [48], the formation of industry structures [57], social contagion [11], labor markets [40], political participation [56], and communication patterns [3]. Determining and quantifying the strength of the social tie between two individuals, however, remains a difficult task. Although many studies categorize social ties as either weak or strong, in reality tie strength is not a dichotomy. Individuals may have their strongest ties with close friends and family, and their weakest ties with casual acquaintances, but they may also have ties of medium strength with individuals such as coworkers who they see regularly [38]. Moreover, in many settings, the only data available consists of which connections exist between individuals, without explicitly detailing the quality of the connections [21, 38, 32]. In these cases, the social network is defined by a set of individuals, referred to as nodes, and information about social connections between pairs of individuals, which are referred to as edges or ties.

1 The most common approaches to the identification of tie strength based on network information rely on the property of triadic closure. Triadic closure reflects the following property common to many social networks: if two individuals share a friend, then it is likely they will eventually become acquainted [47, 14]. Moreover, this likelihood increases if both individuals share a strong tie with their mutual contact [25]. Hence, if two individuals A and B share a strong tie, then it should be more likely that their contacts in the network know one another. These observations of social networks define one of the most widely used network-based measure of tie strength, embeddedness, which considers A and B to have a strong tie if a large proportion of their friends are mutual friends [21, 25]. Variations of embeddedness, such as the unweighted overlap metric, have also been used to identify tie strength [38, 41]. Still, embeddedness is a coarse measure. Local bridges, for example, are a type of edge whose constituent nodes do not have any shared neighbors, but can vary significantly in tie strength based on other network structure [46]. Furthermore, while embeddedness characterizes tie strength, it does not emphasize the functional role of a particular edge, i.e., embeddedness does not characterize structural properties that make weak ties important. Other approaches identify the role of an edge within a network in order to measure tie strength. For example, the measure of betweenness, which counts the number of shortest paths through an edge, has been used as a tie strength metric [31]. In this way, betweenness characterizes the strength of a tie by quantifying the way in which the edge can facilitate efficient communication in the network. However, betweenness can be misleading because it places equal weight on all shortest paths throughout the network (Figure 1a). Another metric is dispersion, which identifies strong ties due to marriage in social networks based on the fact that marital ties often have common connections across multiple social groups [5]. Another reasonable approach to measure the strength of edges in a network would be to try and extend a traditional measure of centrality defined on nodes to define a measure on edges. Edges with high centrality could then be considered strong or weak depending on the corresponding definition from node centrality. However, such graph-based measures cannot explicitly capture the effect of higher order interactions repre- sented by triangles. For example, using the PageRank measure on the so-called line graph (a new network where edges in the original network correspond to nodes, connected if they share a node in common from the original network) tends to only highlight edges connected to many other edges, signifying a measure of high edge degree (Figure 1b). All of these tie strength measures rely on representing the interaction in social network with a set of dyadic or pairwise edges. However, social interactions often take place among groups larger than two, inherently limiting existing approaches, and much of the information collected on social networks goes beyond dyads [8]. Rather than modeling networks with individuals as nodes and relationships as edges, we can use more general structures to explicitly encode higher order interactions that occur between groups of more than two individuals [9, 18, 17, 26, 59]. In fact, the principle of triadic closure may be extended to closure of groups: If A and B each know C, then A, B, and C are relatively more likely to engage in a group interaction [8]. However, this observation, and the additional information that comes with it, cannot be used by existing approaches when relationships are only represented in terms of graphs with edges connecting two individuals at a time. In this work, we incorporate the information from higher order interactions by modeling network infor- mation with a more general mathematical structure called a simplicial complex, as opposed to a graph. This allows us to explicitly distinguish between a group interaction among three people, and the case where only pairwise interactions exist. Such a rich data representation also allows us to use a larger set of tools and centrality measures. More specifically, we use the recently-developed Edge PageRank measure [50] which is a distinct from the aforementioned PageRank on edges via the line graph. This measure was recently introduced as a mathematical description of a random walk process, without a particular social interpreta- tion [50]. Here, we show that this notion of Edge PageRank is the outcome of a natural social process. More specifically, we define a random, local interaction process, where each state is a directed communication from one node to another along an edge, and the possible transitions between states are dictated by the presence of both edges and triangles in the network. After the random process reaches a steady state, such that each possible directed communication is assigned a probability, the measure then takes a difference between the orientations along each pair of nodes connected by an edge. In this way, an edge receives a large value if there is asymmetric communication in the steady state of the random process, indicating a strong tie. Based on how the presence of triangles affects the random process, the net result appropriately discounts

2 (a) Betweenness Centrality

(b) PageRank on the Line Graph

(c) Edge PageRank

Figure 1: Comparison of dyad-based centrality measures (a–b) to the Edge PageRank measure (c). The top four edges of each centrality are bolded in the left plots (five in the case of betweenness centrality, due to a tie), and the values of the metrics for each edge is presented in sorted order in the right plots. (a) Betweenness centrality, which has been used as a continuous measure of tie strength [31], equally weights shortest paths across the entire network, resulting in {5, 8} considered weaker than {12, 14}. (b) The PageRank measure, a standard measure of importance in graphs, can be applied to the line graph transformationto define a measure of importance on edges [14]. However, instead of measuring tie strength, the measure is influenced by the size of the communities and transitive closure, resulting in edges such as {5, 6} considered weaker than {1, 8}. (c) Edge PageRank centrality clearly identifies the bridges in the network. 3 Figure 2: Potential states after [2, 3]. If the state of the random walk is currently [2, 3], i.e. 2 has just sent a message to 3, then there are 3 possible types of steps that can be taken next. The lower walk (purple) indicates that, after 3 receives a message, then 3 may send a message to one of its neighbors, including 2. The upper walk (green) represents the effect of higher order information in the process. Once 2 sends a message to 3, then 2 can send a message to 4, indicating that a message is being sent to both members of the simplex, or 4 might send a message to 3, indicating that after 4 sees a message sent from 2 to 3, node 4 may take the action to communicate with node 3. the effect of triadic interactions, allowing the measure to highly correlate with tie strength (Figure 1c). We evaluate Edge PageRank on nine large-scale empirical datasets of diverse social interactions, includ- ing physical proximity contacts, emails, and text messages. Each dataset provides auxiliary information, such as the frequency of interactions, that we use to generate (approximately) continuous measures of tie strength [25]. We find that Edge PageRank more effectively captures tie strength than dyadic measures such as embeddedness. We also conduct large-scale experiments that illuminate the role of higher-order interactions in identifying global and local bridges in a network. Global bridges are edges whose removal will disconnect the network, whereas local bridges are edges whose end points have no common connections. Both types of bridge epitomize the strength of a weak tie: bridges connect two individuals in disparate communities to one another, facilitating the spread of novel information [10, 14]. Edge PageRank more effectively finds bridges in social networks compared to other measures of tie strength such as betweenness. Altogether, our results show that restricting to the graph-based representations of networks results in a loss of potentially critical information for identifying sociologically relevant features of a network. Incorpo- rating higher-order information from group interactions provides a novel and empirically valuable character- ization of weak ties.

2 Characterizing Weak Ties

Edge PageRank was recently introduced by two of the authors as a way to quantify the importance of pairwise connections from a dataset that has “higher order interactions” involving multiple nodes at the same time [50]. The measure was originally motivated by algebraic-topological generalizations of the classical PageRank measure on graphs [43]. In this section, we instead develop a stochastic process contextualized in terms of tie strength whose equilibrium corresponds to the Edge PageRank scores. This allows us to interpret large and small Edge PageRank scores as strong and weak ties. For simplicity, we assume that we have a dataset of interactions, where each interaction is between two or three nodes (people in the social network). (In the appendix, we discuss various ways to extend this to cases where we have interactions involving more than three people.) We call the interactions involving two nodes edges and those among three nodes triangles. Each triangle also induces a hyper-edge among all three pairs of its constituent nodes. This type of edge inducement allows us to formally model the data with a simplicial

4 complex, which is the basis for the mathematics of Edge PageRank and many other network analyses using ideas from algebraic topology [7, 12, 20, 23, 55, 28].

The Information Exchange Process Sociologists point out that any measure of centrality on a social network can be understood through an underlying process [19], and here we show this to be true for Edge PageRank. The “states” in our stochastic process are given by the two possible directions that information could flow along each edge. Formally, each edge between individuals A and B can result in two states for the process: A sending a message to B, or B sending a message to A, denoted [A, B] and [B,A], respectively. The process “walks” (transitions or moves) between these discrete states randomly. Hence, the random walk corresponds to a (random) series of messages being sent throughout the network, and Edge PageRank will be a function of the steady-state distribution of this random walk. The transitions between states of the random walk are based on three possible types of transitions: a lower walk, an upper walk, and a teleportation maneuver (Figure2). The lower walk represents a standard local information exchange in a graph, where after an individual receives a message, then they send a message to one of their neighbors. That is, if the state is [A, B], the next state in the lower walk is of the form [B,C] where C is connected to B.For symmetry, the lower walk goes in both directions, so that if the state is [B,C], the next state of the random walk can return to [A, B]. The upper walk incorporates the higher order information of the network. If A has just sent a message to B, and A, B, and C are a part of a triangle, then the upper walk would allow two additional possible states for the next message. In the first, A sends a message to C, emulating a group conversation in which A has said something to both B and C. In the second additional state, C sends a message to B, indicating that due to the higher-order structure, C has seen the message which has just passed from A to B, and can act on this to communicate with B. These higher order steps from the upper walk allow the property of simplicial closure, the generalization of triadic closure to group interactions [8], to manifest itself in the measure of tie strength. Edges which take part in triangles to have more communication flow through them in the opposite direction of the lower walk, as is explicit in Figure2 where the steps of the upper walk are in direct opposition to those of the lower walk. This opposition represents a “vorticity” due to the higher order information, and is a common observation in analyses of networks which include higher order interactions [33]. The final possible transition in the social process is a teleportation step. In this step, independent of the preceding state of the random walk, the state “teleports” to a pre-specified seed message from one particular node to another. Because the teleportation step requires a particular seed edge, we refer to the outcome of the process as personalized Edge PageRank, such that each Edge PageRank computation is specific to a particular choice of seed message. This teleportation step is required for the stochastic process, due to the way in which the centrality measure takes differences between the two probabilities on each edge in order to define a centrality. Namely, because of the symmetry in the underlying stochastic process due to the reversibility of the lower walk and symmetry in the upper walk, without the teleportation step the differences along edges would become 0. This symmetry of the random walk also requires us to use personalized Edge PageRank, as opposed to the standard formulation of PageRank which uses a teleportation step which is uniform across all possibilities. If the process were to use a uniform teleportation step, then all messages in the random walk would have the same steady-state probability as the messages in the opposite direction, resulting in values on each edge which are identically 0. As we will discuss further, taking differences in between the two probabilities on each edge is crucial to the identification of tie strength because it highlights limitations in communication which are typical of weak ties [3, 48]. Once a steady-state distribution for the stochastic message passing process is computed and the differences are taken across each edge, the final step in developing a centrality measure is to assign a value to each edge. To do so, we compute the total size of the vector by using the 2-norm of each personalized Edge PageRank vector. For a specific choice of seed messages, this operation highlights the presence of edges along which there is asymmetry in the communication flow defined by the stochastic process, because larger asymmetries result in higher difference values and hence larger sizes of the Edge PageRank vector. Further, using the size of the vector allows the computed values to be independent of which orientation of the edge is used for the seed message, i.e. whether the seed is [A, B] or [B,A].

5 Tie Strength Identification The primary features which allow Edge PageRank to identify weak ties are two-fold. First, the use of differences in the summarizing the outcome of the stochastic process allows the measure to highlight ties along which messages flow in one direction but not the other. This inability for communication to flow easily along both directions is common to the study of weak ties, as their position in the network makes communication particularly difficult [3, 48]. That is, because weak ties are often between individuals with different backgrounds or who belong to different communities, communication is often limited when compared to strong ties [3]. In the defined stochastic process, this bandwidth limitation of weak ties is reflected by the inability of communication in the stochastic process to easily pass through both orientations of an edge, causing an asymmetry. Second, the transitions in the stochastic process make it such that the presence of triangles in the data results in more symmetric communication along edges which take part in triangles. As noted previously, this is explicit in Figure2 as the green arrows of the upper walk are in opposing directions to the blue and purple arrows of the lower walk. In this sense, we say that the stochastic process discounts the effect of triadic interactions. That is, the upper walk allows triangles in social networks to manifest themselves by allowing more avenues for communication within the network, in such a way to reduce the asymmetries of communication flows. Hence, if the seed edge in the personalized Edge PageRank computation takes part in many triangles, we would expect communication to be more symmetric along the seed edge compared the case in which the seed edge is a weak tie. This builds on the intuition from the measure of embeddedness: if an edge is a part of many triangles, so that the two nodes on the edge participate in many group interactions, we should expect the tie to be strong [25]. One of the benefits of using the simplicial complex model is that we can utilize the result of the so-called Hodge decomposition to break Edge PageRank into three types of scores [50]. When we wish to summarize tie strength in a single value, the size of the Personalized Edge PageRank vector can be used, and otherwise we can use the three scores from the Hodge decomposition to get a refined picture of the role of an edge. To do so, before taking the size of differences along each edge, we can first decompose the personalized Edge PageRank vector into “curl,” “gradient,” and “harmonic” components (see, e.g. Appendix A.6, Figure6). At a high level, the curl component represents dynamics supported on triangles, the gradient component represents acyclic dynamics, and the remaining harmonic component represents cyclic dynamics larger than triangles [33]. (A formal discussion and derivation of these components can be found in the appendix.) We find that this decomposition can be valuable information for identifying tie strength empirically. By computing the size of each of these components, we get an additional triplet of information which provides a concise yet nuanced perspective on the role of each edge in the network.

A Note on Computation Additional higher-order interaction data gives rise to an increase in com- putational demands. For a simplicial complex with n nodes, e edges, and t triangles, in the worst case the simplicial complex may be constructed in time O(e3/2), although many implementations are more ef- ficient [50]. Once the simplicial complex is constructed, the Personalized Edge PageRank vector can be computed in time O(n2e + e2t), although different solution techniques which utilize matrix sparsity can be used to increase the efficiency of the implementation [50]. Ultimately, this results in algorithms which run in time that is polynomial in the input size n + e + t, which can then efficiently be applied to identify the strength of ties in social networks.

3 Data

Our experiments use two categories of datasets: one for tie strength prediction and the other for bridge identification. The datasets used for tie strength prediction have some auxiliary proxy for tie strength, such as the frequency of interaction between two individuals [25]. We use this auxiliary information only for evaluation; the interaction data input to the method are given by the set of interactions without frequency information, i.e., an interaction between individuals is included if they interacted at least once. For tie strength experiments, we use the following social networks (see Table1 for summary statistics).

• contact-primary-school, contact-high-school, contact-university. Nodes are individuals, and simplices form when individuals are in proximity of one another within a short time period. Data comes from individuals at a primary school [53], a high school [37], and a university [49].

6 Table 1: Summary statistics for datasets used in tie strength experiments

dataset nodes edges triangles edge density contact-primary-school 242 8,317 5,139 2.85e-01 contact-high-school 327 5,818 2,370 1.09e-01 email-Enron 144 1,344 1,159 1.31e-01 email-Eu 986 16,064 27,655 3.31e-02 india-villages 1,774 294,626 314,768 1.87e-01 sms-a 30,278 42,882 1,581 9.36e-05 sms-c 11,714 17,050 1,114 2.49e-04 college-msg 1,899 13,838 5,403 7.68e-03

Table 2: Summary statistics of datasets used for bridge identification experiments.

local global dataset nodes edges triangles bridges bridges edge density coauth-MAG-10 80,198 237,261 355,186 1,701 8,758 7.38e-05 coauth-ai-systems 27,488 86,994 100,902 952 2,310 2.30e-04 dbpedia-producer 32,383 93,342 184,694 2,749 9,421 1.78e-04 dbpedia-starring 61,727 668,459 2,885,569 5,392 2,959 3.51e-04 dbpedia-writer 65,165 335,125 1,385,303 581 5,241 1.58e-04 email-Eu 979 29,299 160,605 53 63 6.12e-02 genius-rap 47,167 128,808 215,049 6,057 16,471 1.16e-04

• email-Enron, email-Eu. Nodes are email addresses and a simplex is formed if individuals send or CC one another on emails within a week; the email-Eu dataset spans over 2 years of communication at a European research institute [8, 45], and the email-Enron dataset spans the lifetime of the American energy company Enron [8, 29].

• india-villages. Nodes are individuals in villages in India, and edges form between individuals based on survey data [6], and simplices correspond to individuals living in the same household.

• sms-a, sms-c, college-msg. Nodes are individuals, and a simplex forms between individuals if they all text or send a message to each other within a week [58, 44].

For all datasets except india-villages, tie strength is based on the number of times two individuals appeared in some interaction together. In other words, if two individuals are in contact frequently, then their tie is considered to be stronger. Since the distribution of frequency of contacts is heavy-tailed, we use the log of the frequency for tie strength in our experiments. For the india-villages dataset, tie strength is determined on a scale of 1 to 12 based on answers to a questionnaire take by each individual [6]. For bridge identification, we use the following social network datasets, where each network has at least 50 global bridges and at least 50 local bridges (see Table2 for summary statistics).

• coauth-ai-systems, coauth-MAG-10. Nodes are computer science researchers and simplices corre- spond to multiple authors coauthoring a paper. The coauth-ai-systems dataset consists of arti- cles from computer systems conferences (VLDB, SIGMOD) and machine learning conferences (AAAI, NeurIPS, ICML, IJCAI), from 2000 to 2016. The coauth-MAG-10 dataset is from ten large computer science conferences. Both datasets are based on the the Microsoft Academic Graph [1, 51].

• email-Eu. Nodes are email addresses and a simplex is formed if individuals send or CC one another on emails within a week (same as above).

• genius-rap. Nodes are rap music artists and simplices are sets of rappers collaborating on songs [8].

7 Table 3: Accuracy in predicting tie strength with different sets of regressor. Edge PageRank is consistently the most accurate, but the methods are more comparable for the messaging datasets.

Edge PageRank Node PageRank Dataset Name Components Embeddedness local baseline [38] Variants contact-primary-school 0.77 (±0.003) 0.39 (±0.007) 0.43 (±0.007) 0.16 (±0.008) contact-high-school 0.61 (±0.004) 0.35 (±0.008) 0.37 (±0.009) 0.22 (±0.013) email-Enron 0.57 (±0.015) 0.36 (±0.016) 0.39 (±0.015) 0.28 (±0.016) email-Eu 0.66 (±0.003) 0.54 (±0.005) 0.54 (±0.005) 0.46 (±0.007) india-villages 0.52 (±0.001) 0.37 (±0.001) 0.34 (±0.000) -0.01 (±0.001) sms-a 0.73 (±0.001) 0.72 (±0.001) 0.72 (±0.001) 0.71 (±0.002) sms-c 0.76 (±0.002) 0.75 (±0.003) 0.75 (±0.003) 0.75 (±0.003) college-msg 0.57 (±0.004) 0.55 (±0.003) 0.55 (±0.003) 0.56 (±0.003)

• dbpedia-producer, dbpedia-writer, dbpedia-starring. Nodes are individuals, and simplices cor- respond to producer collaborations on creative works, coauthoring written works, and costarring in movies, respectively [4].

4 Experimental Results 4.1 Higher Order Information Identifies Tie Strength We now show on empirical data that Edge PageRank can effectively predict tie strength (Table3). We consider the following network-based baselines: embeddedness (i.e., number of common neighbors) [21, 30, 36, 42], three local measures identified by Mattie et al. as being predictive of tie strength1 [38], and a set of regressors based on extending Node (standard) PageRank, namely the average of Node PageRank scores for each edge and Node PageRank scores on the line graph, and the same measures for 2-norms of personalized Node PageRank vectors. These baselines are compared against using a combination of the three Hodge components of personalized Edge PageRank in order to predict tie strength. We use linear regression to predict tie strength using each set of features, and report the accuracy as 1− the mean squared error of the predictions compared to observed tie strength for each edge in dataset (see Section6 for additional details on methods). In the previous experiments, we compare the three Hodge components of Edge PageRank as predictors to the singular metric of betweenness. When restricting to a single measure for which to identify tie strength, we find that the size of the Personalized Edge PageRank vector, as opposed to the combination of individual Hodge components, appears to have the most consistent performance (AppendixC, Table6), outperforming network-based measures on datasets on the contact, email, and village datasets. The various methods have comparable accuracy on the messaging datasets. We also measured the predictive power of each Edge PageRank component (curl, gradient, and harmonic) on tie strength. Through regressions on each individual component, we find that the harmonic component of Edge PageRank tends to be indicative of weak ties, whereas large curl and gradient components tend to indicate that a tie is strong (Figure3). Finally, part of the motivation for Edge PageRank was that the method can incorporate higher order interactions directly, under the assumption that higher order information should be informative. To support this, in the appendix, we report results from additional experiments on our datasets where higher order information is subsumed by pairwise information. That is, we fill in every triangle formed in the social network by pairwise interactions with a three-way higher order interaction. In these cases, Edge PageRank is much less accurate in predicting tie strength (AppendixC, Table7), indicating that triads formed via higher-order interactions are important data for the strength of any two individuals within that tried.

1These three measure are the sum of degrees of the nodes on each edge, the unweighted overlap measure, and the sum of clustering coefficients of each node on an edge

8 (a) Gradient Component (b) Curl Component

(c) Harmonic Component

Figure 3: Regression coefficients and 95% confidence intervals across datasets for individual components of Edge PageRank. A large harmonic component is indicative of weak ties, whereas large curl and gradient components are indicative of strong ties.

9 Table 4: Accuracy in predicting if an edge is a global bridge, a local bridge, or neither. In all cases, Edge PageRank outperforms betweenness in the classification task.

Dataset Name Edge PageRank Betweenness coauth-MAG-10 0.83 (±0.007) 0.58 (±0.058) coauth-ai-systems 0.77 (±0.026) 0.51 (±0.097) dbpedia-producer 0.83 (±0.006) 0.45 (±0.081) dbpedia-starring 0.94 (±0.003) 0.58 (±0.018) dbpedia-writer 0.93 (±0.012) 0.58 (±0.014) email-Eu 0.92 (±0.015) 0.78 (±0.070) genius-rap 0.79 (±0.006) 0.55 (±0.076)

4.2 Bridge Identification with Higher Order Information Bridges satisfy a natural structural criteria for weak ties and the possible spread of novel information [10]. While bridges are not always weak ties in population-scale networks [46], they are important conduits for information. Organizational sociologists have found importance in distinguishing between global and local bridges, as local bridges in an organization can often be indicators of inefficient organizational practices, whereas global bridges better serve the role of providing novelty [48]. For this reason, identifying and distinguishing between local and global bridges is also an important task. In general, efficient algorithms for identifying local and global bridges are known, as global bridges can be found in time O(n + e) and local bridges can be found in time O(e3/2), where n is the number of nodes in the network and e is the number of edges. However, these definitions are combinatorial in nature, and the only differentiation between types of local bridges are due to their tie range, which measures the distance between the second shortest path between neighboring nodes [46]. Hence, measures such as betweenness have been developed in order to better quantify the way in which edges act as a bridge [22]. Our large-scale experiments indicate that Edge PageRank can effectively identify and distinguish between local and global bridges, outperforming betweenness. This suggests that Edge PageRank may be a more appropriate measure for determining the extent to which an edge acts as a bridge and provides additional evidence that Edge PageRank and higher-order information are effective for identifying tie strength. Empirically, we find that Edge PageRank outperforms betweenness in identifying and distinguishing between global and local bridges in the multiclass classification task (Table4). Here, we use multiclass logistic regression to predict if an edge is a global bridge, a local bridge, or neither, using either the size of the Personalized Edge PageRank vector or betweenness as the feature for prediction. Accuracy is then reported using the 0/1 classification loss (see Section6 for additional details on methods). The curl, gradient, and harmonic components are also particularly useful for this task, and in fact only two of the three components are required for near perfect classification (AppendixD, Table8). This is supported by the logistic regression decision boundary when two Edge PageRank components are used, as illustrated by the coauth-ai-systems dataset (Fig.4). In the appendix, we provide mathematical justification for the ability of these components to identify bridges. In particular, the gradient component can be associated with global bridges (AppendixE, Propo- sition1), and the harmonic component can be associated with local bridges (AppendixE, Proposition2). We also show that the decomposition on simpler vectors than the Edge PageRank vectors are also effective for bridge identification (AppendixF). When using a single metric to identify and distinguish between types of bridges, as opposed to combi- nations individual Hodge components of the personalized PageRank vectors, we find that the 2-norm of the Personalized Edge PageRank vector provides the best accuracy for the multiclass classification (AppendixD, Table9). As discussed in Section2, this ability of the personalized Edge PageRank vector is not unexpected, as its underlying stochastic process accentuates those edges who pass information from one community to another.

10 Figure 4: Decision boundaries of logistic regression using Edge PageRank components as regressors, illus- trated for the couath-ai-systems dataset. In (a) and (b), the curl component of Edge PageRank determines if an edge is either type of bridge, and then the remaining component distinguishes between local and global bridges. In (c), we see a unique tradeoff between harmonic and gradient components of Edge PageRank in the decision process. In particular, these plots suggest that the local bridges with the lowest harmonic and highest gradient values correspond to the “most global” of the local bridges. This interpretation is supported by a −0.55 correlation between Harmonic Edge PageRank and tie range of local bridges, and a 0.56 correlation between Gradient Edge PageRank and tie range of local bridges.

5 Discussion

While measures based on dyadic interactions have seen success in identifying the strength of ties, higher order information carries valuable network information for this task. We proposed the use of the Edge PageRank measure to identify weak and strong ties in networks, as it uses higher order information and models the steady state of a dynamic process related to tie strength. The process identifies edges with an asymmetric pattern of communication, and selects weak ties based on imbalance of communication flow on both directions of an edge, which is consistent with the role of weak ties described in the sociology literature [3,2, 48] as well as the role of weak ties in social contagion [11]. Moreover, the dynamic process encodes higher order information such that the inclusion of triangles reduces this asymmetry in communication. Edges which are a part of many triangles together have lower asymmetries in communication. This generalizes the intuition from the measure of embeddedness, which has long been known to be related to tie strength [25], into a form which explicitly accounts for interactions among all people in a triad. We also show that Edge PageRank can be used to identify bridges, as bridges tend to be a topological indicator of strong and weak ties [25, 14, 46]. Although our datasets only contain local bridges that have relatively small path lengths upon removal, the gradient component in tie strength regressions tends to be positive, which is consistent with recent results on the strength of long-range ties [46]. The observed relationship between Edge PageRank and tie strength has important implications for our understanding of social networks. Notably, the results suggest that higher order interactions are indicators of important sociological features in networks, and suggests the use of higher order information in other social network analysis tasks such as community detection.

6 Methods

For the experiments on tie strength, we denote the dependent variable as se, which is a real number which measures the tie strength of each edge, as defined Section3 for each dataset. We then compute a set of independent variable for each edge based on the network topology, either using personalized Edge PageRank and/or its components, or baselines from the literature. In all experiments, the teleportation parameter for Edge PageRank is set to α = 2/2.5 = 0.8 and for Node Pagerank baselines it is set to α = 0.85 to be consistent with previous literature [43, 50]. These independent variables are then used in a linear regression

11 on a training set, which allows for estimates of tie strength sbe to be computed for each edge. Then, on a test 1 P 2 set of edges T , we compute 1 − |T | e∈T (sbe − se) as a measure of accuracy of the prediction. The process is repeated five times, where each time the training set consists of a disjoint set of edges comprising 1/5 of the total dataset, i.e. we use 5-fold cross validation. The mean of the test accuracies and standard deviations are reported in Tables3 and6. For experiments on bridges, the dependent variable ye is categorical, as the edge is either a global bridge, a local bridge, or neither. For experiments, we take a subsample of the dataset to make sure each class of edges is equally represented. Then, using different sets of independent variables, we learn linear logistic regression model with a training set to generate a prediction ybe based on different features of the edge based 1 P on network topology, as done for tie strength. For a test set T , accuracy is computed as |T | e∈T 1(ybe = ye). The process is again repeated five times using 5-fold cross validation, and the mean of the test accuracies and standard deviations are reported in Tables4,8, and9.

Acknowledgements

A.S., J.B.S., A.R.B., and A.J. were all supported in part by ARO Award W911NF-19-1-0057. A.R.B. was supported in part by NSF CAREER Award IIS-2045555. A.J.’s research was supported in part by a Vannevar Bush Faculty Fellowship from the Office of Secretary of Defense.

12 References

[1] Ilya Amburg, Nate Veldt, and Austin R. Benson. Clustering in graphs and hypergraphs with categorical edge labels. In Proceedings of the Web Conference, 2020. [2] Sinan Aral. The future of weak ties. American Journal of Sociology, 121(6):1931–1939, 2016. [3] Sinan Aral and Marshall Van Alstyne. The diversity-bandwidth trade-off. American journal of sociology, 117(1):90–171, 2011.

[4] S¨orenAuer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007. [5] Lars Backstrom and Jon Kleinberg. Romantic partnerships and the dispersion of social ties: a network analysis of relationship status on facebook. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pages 831–841, 2014. [6] Abhijit Banerjee, Arun G Chandrasekhar, Esther Duflo, and Matthew O Jackson. The diffusion of microfinance. Science, 341(6144), 2013. [7] Sergio Barbarossa and Mikhail Tsitsvero. An introduction to hypergraph signal processing. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6425–6429. IEEE, 2016. [8] Austin R Benson, Rediet Abebe, Michael T Schaub, Ali Jadbabaie, and Jon Kleinberg. Simplicial closure and higher-order link prediction. Proceedings of the National Academy of Sciences, 115(48):E11221– E11230, 2018.

[9] Claude Berge. Hypergraphs: combinatorics of finite sets, volume 45. Elsevier, 1984. [10] Ronald S Burt et al. The social capital of structural holes. The new economic sociology: Developments in an emerging field, 148(90):122, 2002. [11] Damon Centola and Michael Macy. Complex contagions and the weakness of long ties. American journal of Sociology, 113(3):702–734, 2007.

[12] Joseph Minhow Chan, Gunnar , and Raul Rabadan. Topology of viral evolution. Proceedings of the National Academy of Sciences, 110(46):18566–18571, 2013. [13] Hially Rodrigues De S´aand Ricardo BC Prudˆencio. Supervised link prediction in weighted networks. In The 2011 international joint conference on neural networks, pages 2281–2288. IEEE, 2011.

[14] David Easley and Jon Kleinberg. Networks, crowds, and markets, volume 8. Cambridge university press Cambridge, 2010. [15] Beno Eckmann. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv., 17(1):240–255, 1944.

[16] Shelly D Farnham and Elizabeth F Churchill. Faceted identity, faceted lives: social and technical issues with being yourself online. In Proceedings of the ACM 2011 conference on Computer supported cooperative work, pages 359–368, 2011. [17] Scott L Feld. The focused organization of social ties. American journal of sociology, 86(5):1015–1035, 1981.

[18] Peter Frankl and Vojtˇech R¨odl.Extremal problems on set systems. Random Structures & Algorithms, 20(2):131–164, 2002. [19] Noah E Friedkin. Theoretical foundations for centrality measures. American journal of Sociology, 96(6):1478–1504, 1991.

13 [20] Abhirup Ghosh, Benedek Rozemberczki, Subramanian Ramamoorthy, and Rik Sarkar. Topological signatures for fast mobility analysis. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 159–168, 2018. [21] Eric Gilbert and Karrie Karahalios. Predicting tie strength with social media. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 211–220, 2009. [22] Michelle Girvan and Mark EJ Newman. Community structure in social and biological networks. Pro- ceedings of the national academy of sciences, 99(12):7821–7826, 2002. [23] Chad Giusti, Eva Pastalkova, Carina Curto, and Vladimir Itskov. Clique topology reveals intrinsic geo- metric structure in neural correlations. Proceedings of the National Academy of Sciences, 112(44):13455– 13460, 2015. [24] Mark Granovetter. Getting a job: A study of contacts and careers. University of Chicago press, 2018. [25] Mark S Granovetter. The strength of weak ties. American journal of sociology, 78(6):1360–1380, 1973. [26] Mangesh Gupte and Tina Eliassi-Rad. Measuring tie strength in implicit social networks. In Proceedings of the 4th Annual ACM Web Science Conference, pages 109–118, 2012. [27] Morten T Hansen. The search-transfer problem: The role of weak ties in sharing knowledge across organization subunits. Administrative science quarterly, 44(1):82–111, 1999. [28] Junteng Jia, Michael T Schaub, Santiago Segarra, and Austin R Benson. Graph-based semi-supervised & active learning for edge flows. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 761–771, 2019. [29] Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217–226. Springer, 2004. [30] Gueorgi Kossinets and Duncan J Watts. Empirical analysis of an evolving social network. science, 311(5757):88–90, 2006. [31] David Krackhardt, N Nohria, and B Eccles. The strength of strong ties. Networks in the knowledge economy, 82, 2003. [32] Ning Li, Xu Feng, Shufan Ji, and Ke Xu. Modeling relationship strength for link prediction. In Pacific- Asia Workshop on Intelligence and Security Informatics, pages 62–74. Springer, 2013. [33] Lek-Heng Lim. Hodge laplacians on graphs. SIAM Review, 62(3):685–715, 2020. [34] Nan Lin and Mary Dumin. Access to occupations through social ties. Social networks, 8(4):365–385, 1986. [35] Linyuan L¨uand Tao Zhou. Link prediction in weighted networks: The role of weak ties. EPL (Euro- physics Letters), 89(1):18001, 2010. [36] Peter V Marsden and Karen E Campbell. Measuring tie strength. Social forces, 63(2):482–501, 1984. [37] Rossana Mastrandrea, Julie Fournet, and Alain Barrat. Contact patterns in a high school: a compar- ison between data collected using wearable sensors, contact diaries and friendship surveys. PloS one, 10(9):e0136497, 2015. [38] Heather Mattie, Kenth Engø-Monsen, Rich Ling, and Jukka-Pekka Onnela. Understanding tie strength in social networks using a local “bow tie” framework. Scientific reports, 8(1):1–9, 2018. [39] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In NIPS, volume 2012, pages 548–56. Citeseer, 2012. [40] James D Montgomery. Social networks and labor-market outcomes: Toward an economic analysis. The American economic review, 81(5):1408–1418, 1991.

14 [41] Mark EJ Newman and Juyong Park. Why social networks are different from other types of networks. Physical review E, 68(3):036122, 2003. [42] J-P Onnela, Jari Saram¨aki,Jorkki Hyv¨onen,Gy¨orgySzab´o,David Lazer, Kimmo Kaski, J´anosKert´esz, and A-L Barab´asi.Structure and tie strengths in mobile communication networks. Proceedings of the national academy of sciences, 104(18):7332–7336, 2007. [43] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999. [44] Pietro Panzarasa, Tore Opsahl, and Kathleen M Carley. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology, 60(5):911–932, 2009. [45] Ashwin Paranjape, Austin R Benson, and Jure Leskovec. Motifs in temporal networks. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 601–610, 2017. [46] Patrick S Park, Joshua E Blumenstock, and Michael W Macy. The strength of long-range ties in population-scale social networks. Science, 362(6421):1410–1413, 2018. [47] Anatol Rapoport. Spread of information through a population with socio-structural bias: I. assumption of transitivity. The bulletin of mathematical biophysics, 15(4):523–533, 1953. [48] Ray Reagans and Bill McEvily. Network structure and knowledge transfer: The effects of cohesion and range. Administrative science quarterly, 48(2):240–267, 2003. [49] Piotr Sapiezynski, Arkadiusz Stopczynski, David Dreyer Lassen, and Sune Lehmann. Interaction data from the copenhagen networks study. Scientific Data, 6(1):1–10, 2019. [50] Michael T Schaub, Austin R Benson, Paul Horn, Gabor Lippner, and Ali Jadbabaie. Random walks on simplicial complexes and the normalized hodge 1-laplacian. SIAM Review, 62(2):353–391, 2020. [51] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th international conference on world wide web, pages 243–246, 2015. [52] Sandra Susan Smith. “don’t put my name on it”: Social capital activation and job-finding assistance among the black urban poor. American journal of sociology, 111(1):1–57, 2005. [53] Juliette Stehl´e,Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-Fran¸coisPinton, Marco Quaggiotto, Wouter Van den Broeck, Corinne R´egis,Bruno , et al. High-resolution measurements of face-to-face contact patterns in a primary school. PloS one, 6(8):e23176, 2011. [54] Richard Swedberg. Economics and Sociology: redefining their boundaries: conversations with economists and sociologists. Princeton University Press, 1990. [55] Alireza Tahbaz-Salehi and Ali Jadbabaie. Distributed coverage verification in sensor networks without location information. IEEE Transactions on Automatic Control, 55(8):1837–1849, 2010. [56] Sebasti´anValenzuela, Teresa Correa, and Homero Gil de Zuniga. Ties, likes, and tweets: Using strong and weak ties to explain differences in protest participation across facebook and twitter use. Political communication, 35(1):117–134, 2018. [57] Gordon Walker, Bruce Kogut, and Weijian Shan. Social capital, structural holes and the formation of an industry network. Organization science, 8(2):109–125, 1997. [58] Ye Wu, Changsong Zhou, Jinghua Xiao, J¨urgenKurths, and Hans Joachim Schellnhuber. Evidence for a bimodal distribution in human communication. Proceedings of the national academy of sciences, 107(44):18803–18808, 2010. [59] Jian Xu, Thanuka L Wickramarathne, and Nitesh V Chawla. Representing higher-order dependencies in networks. Science advances, 2(5):e1600028, 2016.

15 Table 5: Simplicial Complex Terminology

Term Definition Example(s) from Figure5 Vertex Set A set V{1, 2, 3, 4, 5} k-Simplex A subset of V of size k + 1 {1} is a 0-simplex, {1, 2} is a 1-simplex, {1, 2, 3} is a 2-simplex Face (of Sk)A(k − 1)-simplex Sk−1 ⊂ Sk {1, 2} is a face of {1, 2, 3} Co-Face (of Sk)A(k + 1)-simplex such that Sk ⊂ Sk+1 {1, 2, 3} is a co-face of {1, 2} Lower Adjacent k-simplices which share a face {3, 4} and {2, 4} Upper Adjacent k-simplices which share a co-face {1, 3} and {2, 3}

nk The number of k-simplices in a simplicial complex n0 = 5, n1 = 7, n2 = 2

1 3 5

2 4

Figure 5: An example of a simplicial complex X . Because the 2-simplex {1, 2, 3} is in X , then X also contains the 1-simplices {1, 2}, {2, 3}, {1, 3} and the nodes (0-simplices) {1}, {2}, {3}. Arrows in the example represent the orientation of each simplex, which provide a convention for the direction of flow.

A Simplicial Complexes and Edge PageRank

In many concrete problems, interactions between actors in a network are polyadic. This is the case for in- stance in communication networks, co-authorship networks, and even biological protein-interaction networks [8]. Therefore, tools and methods from graph theory, which restrict to dyadic interactions, may not be suffi- cient to analyze such data. In this section, we introduce the modeling of networks by simplicial complexes, which provides a useful framework to account for higher order interactions in networks.

A.1 Simplicial complexes Let V be a set of finite vertices. A k-simplex Sk is a subset of V with cardinality k + 1, i.e. a collection with no repeated elements. A simplicial complex is then defined as follows.

Definition 1. A simplicial complex X is a set of simplices such that if S ∈ X , then every subsets of S is in X . We summarize some key terminology associated with simplicial complexes in Table5. One important aspect of simplicial complexes is the notion of orientation. For the tools of algebraic topology to be well defined, we require an orientation for each simplex. For the purpose of this work, we will index nodes with natural numbers and the lexicographic ordering to define orientations on simplices. This orientation is shown using arrows in Figure5, and we will denote the oriented simplices with braces, e.g. [1 , 2] or [3, 4, 5]. Such orientation is important, for example, when we wish to define functions such as edge flows on simplicial complexes.

A.2 Functions on Simplicial Complexes When defining measures on simplicial complexes, we must be careful to appropriately define notions of functions on simplices. In the case of graphs, functions on nodes are typically represented as vectors. For

16 example, PageRank can be represented as a vector where each node’s PageRank score corresponds to an entry. When moving beyond nodes, we require that functions defined on higher order simplices be alternating. For example, an alternating function X on the set of edges E ⊆ X must satisfy

X(i, j) = −X(j, i) , if {i, j} ∈ E, and X(i, j) = 0 otherwise. More generally, a function is alternating if a permutation of its arguments results in a sign change corresponding to the parity of the permutation [33]. So, an alternating function on 2-simplices Φ will satisfy

Φ(i, j, k) = Φ(j, k, i) = Φ(k, i, j) = −Φ(i, k, j) = −Φ(k, j, i) = −Φ(j, i, k) .

A.3 Boundary operators In defining Edge PageRank, a measure on edges, we must describe the operators which allow relationships to form between simplices and their faces and co-faces. Much in the way PageRank defines a centrality using the co-faces (edges) of nodes, Edge PageRank will use both faces and co-faces to define a centrality on edges. n0×n1 We define B1 ∈ R as the incidence matrix of the underlying graph of a simplicial complex. More explicitly, if [i, j] is an (ordered) edge of the simplicial complex, i.e. i < j and {i, j} ∈ X , then B1[i, [i, j]] = −1 and B1[j, [i, j]] = 1, and all other entries of the column corresponding to [i, j] are 0 (Example1). This careful definition of signs helps to preserve functions on edges as being alternating as defined above. The matrix B1 is thought of as the graph-theoretic analogue of the divergence operator in multivariate calculus, as it sums the flows coming in and out of each node in order to provide a value at each node. Similarly, the matrix > −B1 can be thought of as an analogue to the gradient operator in multivariate calculus, as it takes the difference between potentials of incident nodes to define a function on the edge between them. [33]. n1×n2 Similarly, when moving to higher order interactions, we can define B2 ∈ R . For each 2-simplex [i, j, k], again ordered using natural ordering of nodes, B2[[i, j], [i, j, k]] = B2[[j, k], [i, j, k]] = 1, B2[[i, k], [i, j, k]] = −1, and remaining entries of the column are 0, where again the choice of signs is for mathematical consistency. > This definition allows B2 to be thought of as the analogue of the curl operator in multivariate calculus, since it sums edge flows around triangles, and allows B2 itself to be considered as the adjoint of the curl operator [33].

Example 1. In Figure5, the boundary maps B1 and B2 are given by the following matrices.

[1, 2] [1, 3] [2, 3] [2, 4] [3, 4] [3, 5] [4, 5] 1  −1 −1 0 0 0 0 0  2 1 0 −1 −1 0 0 0    B1 = 3 0 1 1 0 −1 −1 0    4 0 0 0 0 1 0 −1  5 0 0 0 0 0 1 1

[1, 2, 3] [3, 4, 5] [1, 2] 1 0  [1, 3]  −1 0    [2, 3] 1 0    B2 = [2, 4] 0 0    [3, 4] 0 1    [3, 5] 0 −1  [4, 5] 0 1 The higher-order boundary maps enable us to generalize the graph Laplacian operator for graphs with higher-order interactions.

17 A.4 Hodge Laplacians The Hodge Laplacian operators, also called Eckmann Laplacians [15], are a generalization of the graph Lapla- cian for simplicial complexes which incorporate higher-order interactions. The kth order Hodge Laplacian is defined

> > Lk = Bk Bk + Bk+1Bk+1 . (1)

The above definition can be seen as a generalization of the Graph Laplacian. Indeed, L0 corresponds precisely to the Laplacian matrix of the underlying graph of the simplicial complex. In the context of Edge PageRank, we focus on the first order Hodge Laplacian defined by the boundary operators presented in Section A.3, that is, we restrict to the case k = 1, and refer the reader to [33, 50] for a more general presentation of Hodge Laplacians of order k. We may then define the normalized 1-Hodge Laplacian Definition 2 (Normalized 1-Hodge Laplacian [50]). The normalized Hodge Laplacian for a simplicial complex with boundary operators B1 and B2 is defined

> −1 −1 L1 = D2B1 D1 B1 + B2D3B2D2 , (2) where D2 = max(diag(|B2|1,I)) is a diagonal matrix of an adjusted upper degree of edges, D1 = 2 · 1 diag(|B1|D21) is a diagonal matrix of weighted degrees of the nodes, and D3 = 3 I places the same weight on each 2−simplex.

We note here that the normalized 1-Hodge Laplacian L1 has a form similar to the normalized Laplacian for a graph, ˜ > ˜ −1 L0 = B1D0B1 D1 , (3) where D˜ 0 = I is the diagonal matrix of the number of faces of each node (which is 1 by convention) and D˜ 1 is a diagonal matrix of the number of co-faces of each node, i.e. its degree. As nodes do not have lower adjacent connections, we see that the normalized graph Laplacian is defined by one matrix as opposed to the sum of two matrices.

A.5 Edge PageRank Schaub et al. [50] extend the notion of the PageRank centrality measure to edges. In Section2, we provide a more detailed analysis of the Edge PageRank measure from a social perspective; we include the definition here for exposition.

Definition 3 ([50, Definition 6.1]). Let X be a simplicial complex with normalized Hodge Laplacian L1, x > be a vector of the form x = V µ where µ is a probability vector, and β ∈ (2, ∞). The PageRank vector π1 of the edges is then defined as the solution to the linear system

(βI + L1)π1 = (β − 2)x (4)

When applying this definition, we take x to be an indicator vector, whose i-th entry is non-zero. Other- wise, different orientations of the simplices in the simplicial complex may result in different vectors for the same underlying simplicial complex [50]. Instead, by using indicator vectors for x, the entries of the Edge PageRank vector π1 differ only in sign when orientations are changed. Using the normalized Hodge decomposition defined in Section A.6, we also consider the projections of the vector π1 onto the gradient, curl and harmonic spaces. We note that, in order to provide a score to each edge, we compute the two-norm of the stationary distribution vector π1 (or alternatively, the two-norm of its projection vectors). This allows us to assign a notion of importance to each edge that is invariant of the orientation of the simplicial complex.

18 1 3 1.24 1.90 1 3 5 1 3 5 2.80 2.76 3 2

-1 -2 -1.52 -0.90 1 1.28 2 4 2 4

(a) Original Flow, x (b) Gradient Component. xg -0.33 1.00 0.10 0.10 1 3 5 1 3 5 -0.10 -1.00 0.20 0.33

0.33 -1.00 0.20 -0.10 0 -0.30 2 4 2 4

(c) Curl Component, xc (d) Harmonic Component, xh

Figure 6: Example of an (unnormalized) Hodge decomposition of a flow on a simplicial complex. Gradient flows sum to 0 around cycles; curl flows are non-zero around triangles and 0 elsewhere; harmonic flows sum to 0 around triangles but are non-zero around longer cycles.

A.6 Hodge decomposition

Using the definition of the normalized Hodge Laplacian, it can be shown that the edge vector space Rn1 can be decomposed as the union of orthogonal subspaces. This decomposition is called the (normalized) Hodge decomposition [50].

n1 > R = im(B2) ⊕D−1 im(D2B1 ) ⊕ −1/2 ker(L1) (5) 2 D2

> −1 where ⊕ −1 is the union of orthogonal subspaces with respect to the inner product hu, vi −1 = u D v, D2 D2 2 The Hodge decomposition shows that every vector x ∈ Rn1 can be decomposed as follows x = xg ⊕ xc ⊕ xh (6)

g −1/2 > c −1/2 h where x is the projection of x onto im(D2 B1 ), x is the projection of x into im(D2 B2), and x h satisfies L1x = 0. The importance of the (normalized) Hodge Decomposition is in its interpretability, as it allows for different components of a flow on edges to become more clear. The interpretation of this decomposition is as follows:

> • im(D2B1 ) is the weighted cut space of the edges [50]. This subspace can be seen as the linear combinations of simplices whose cyclic components are zero. In this sense, g is a gradient flow between two nodes, as it corresponds to flows that have a (weighted) sum of 0 around cycles.

• im(B2) is made of the flows that can be described via local circulations along 2-simplices. r is a circulation around a filled triangle in the graph and can be seen as a curl flow.

• ker(L1) can be seen as a space of flows which are orthogonal to those described above. Hence, such flows are locally consistent around 2-simplices, but are globally inconsistent around longer cycles. Rather than use the weighted Hodge Decomposition above, it is sometimes convenient to use the sym- metric normalized Hodge Laplacian, s −1/2 1/2 L1 = D2 L1D2 , (7) which allows for the Hodge Decomposition to be defined with respect to the standard inner product, as opposed to the inner product weighted by D2.

19 A.7 Liftings of Edge Flows One way in which the effects of the Laplacian operator of (2) may be understood is through the use of a higher-dimensional, lifted state space. In particular, for any flow f ∈ Rn1 defined on the edges of a simplicial complex, we may define a lifted version of the flow which corresponds to the two possible orientations of each edge. To do so, we first define the matrix

+I  V = n1 . (8) −In1

The lifting2 of a particular flow f, which we denote fb, is then simply fb = V f = (f >, −f >)>. Importantly, > the lifting operator V is such that V V = 2In1 , so the Moore–Penrose pseudoinverse can be denoted † 1 > V = 2 V . In particular, this implies that any flow in the lifted space, say gb can be projected to the † lower-dimensional space of edges through the operation V gb. These notions then allow us to define corresponding notions of the lifting of a matrix operator.

Definition 4 (Definition 3.1, [50]). A matrix N ∈ R2n1×2n1 is a lifting of a matrix M ∈ Rn1×n1 if V >N = MV > .

In particular, if M has a lifting N, then by multiplying each equation on the right with V , we see M = V †NV . In this sense, the multiplication Mx can be interpreted in three steps: First, the matrix x is lifted to become V x. Then, the lifted operator N is applied to create NV x. Finally, the 2n1 dimensional vector is projected to the space of vectors, resulting in the final vector V †NV x = Mx. The notion of lifting becomes particularly important in developing a conceptual understanding of Edge PageRank, as we discuss in Section2. With the appropriate terminology defined for analyzing higher order interactions, we now shift our attention to the application of the concepts above to empirical data.

B Edge PageRank as a Dynamical System

In the same ways in which PageRank may be as a dynamical system, so can Edge PageRank.

(Node) PageRank as a Dynamical System We first recall the parameters necessary to define PageRank in a graph [14]. PageRank requires three components on a given a directed graph G = (V, E). First, a probability transition matrix P ∈ Rn0×n0 such that P[i, j] is non-zero only if (i, j) ∈ E and P is a column- stochastic matrix. Second, a preference vector v ∈ Rn0 representing a probability distribution on vertices of a graph. Third, PageRank requires a teleportation parameter α ∈ (0, 1). Algebraically, the PageRank vector π is defined as the solution to the system of equations

(I − αP)π = (1 − α)v .

In terms of the normalized Graph Laplacian (3), if P is a standard random walk on G, then the above may be written ((1 − α)I + αL0)π = (1 − α)v , which parallels the form in (4). In particular, the solution vector π can be thought of in two different ways. One interpretation of the vector π is as the equilibrium of a dynamical system, specifically the system

πk+1 = αP πk + (1 − α)v π0 = 0 . (9)

Solving this system explicitly yields

k X j πk = (1 − α) (αP ) v , (10) j=0

2 In general, we will use the b notation to represent mathematical objects pertaining to the lifted space.

20 which converges to the PageRank vector. In this context, we can think of PageRank as arising from a social process by which a node’s centrality is caused by its own importance score according to the preference vector v as well as the importance of and their k-hop neighborhoods, where the importance of nodes at increasing distances is modulated by the parameter α. We can make a similar comparison for the Edge PageRank process, and show that Edge PageRank can be represented as the equilibrium point of a dynamical system as well as the stationary distribution of a random walk.

B.1 A Dynamical System Perspective The Edge PageRank vector (4) may be represented as the outcome of the following dynamical system L x π = − 1 π + π = 0 . (11) k+1 β k β 0

We note that, if β > 2, the system above is stable since L1 will have eigenvalues bounded by 2 [50]. We also note that both the seed vector x and the iterates π may be written in terms of their normalized Hodge decomposition as in (6). Explicitly solving (11) yields a closed form solution k  j X L1 x π = − . (12) k β β j=0 That is, much in the way α modulates the rate at which farther neighbors impact a node’s PageRank score, the parameter β modulates the transience of the normalized 1-Hodge Laplacian and discounts higher order powers of the operator. Using the normalized Hodge decomposition provides further insight on the dynamics of the system (11).

h k  > −1 j g k  −1 j c x X D2B D B1 x X B2D3B2D x π = + − 1 1 + − 2 . (13) k β β β β β |{z} j=0 j=0 πh | {z } | {z } k g πc πk k As a result, much in the way PageRank can be thought of as arising from a dynamical system, we may think of Edge PageRank as three processes on the seed vector x, one which preserves on the harmonic component of the seed vector, another which distorts the gradient component, and a third which acts on the curl component In the form (13), we see that the harmonic component of π is undisturbed by the dynamical system. This is consistent with the fact that the dynamics are given by L1, so that any flow in the nullspace of L1 will not be affected. > −1 The gradient component of the system is acted on by the matrix D2B1 D1 B1, where the successive operators are modulated by the parameter β. In particular, the action of this operator may be interpreted as a weighted gradient of divergence. That is, the operator takes in a flow in the gradient space, computes the divergence to produce node values, weights these node values, and then computes a weighted gradient from the node values. −1 Similarly, the curl component of the system is acted on by the matrix B2D3B2D2 . This matrix has a similar effect as that in the gradient space, applying a weighted curl operator on a flow in the curl space, and then normalizing this flow among the triangle and applying the adjoint curl operator as a means to average out the curl flow along the triangles. Ultimately, the form (13) suggests that Personalized Edge PageRank unifies the components of Hodge Decomposition of the indicator vectors in a precise and interpretable fashion.

C Additional Results for Tie Strength

Figures7,8, and9 provide a short example displaying how Edge PageRank can be correlated with tie strength, and display the connections between these experiments on tie strength and Section2. In the

21 0.01 0.19 0.01 0.01 0.19 0.01 2 3 4 5 2 3 4 5

0.0 0.01 0.0 0.01 0.0 0.01 0.01 0.0 0.0 0.01 0.01 0.0 0.01 0.0 0.01 0.0

0 1 6 7 0 1 6 7 0.0 0.0 0.0 0.0 (a) Personalized Edge PageRank Vector (b) Gradient Component 0.0 0.0 0.0 0.0 0.0 0.0 2 3 4 5 2 3 4 5

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0 1 6 7 0 1 6 7 0.0 0.0 0.0 0.0 (c) Curl Component (d) Harmonic Component

Figure 7: Personalized Edge PageRank Decomposition of (3, 4), a weak tie in the network.

-0.04 0.0 0.0 0.01 0.0 0.0 2 3 4 5 2 3 4 5

0.0 0.0 0.0 0.0 0.0 0.0 -0.01 -0.01 0.0 0.0 0.04 0.04 0.0 0.18 0.0 0.08

0 1 6 7 0 1 6 7 0.0 0.0 0.01 -0.04

(a) Personalized Edge PageRank Vector (b) Gradient Component

0.0 0.0 0.05 0.0 0.0 0.0 2 3 4 5 2 3 4 5 0.0 0.0 0.0 0.0 0.0 0.0 -0.05 -0.05 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0 1 6 7 0 1 6 7 0.0 0.0 0.0 0.05 (c) Curl Component (d) Harmonic Component

Figure 8: Personalized Edge PageRank Decomposition of (5, 6), a tie of “medium” strength in the network.

22 -0.04 0.0 0.0 0.0 0.0 0.0 2 3 4 5 2 3 4 5 0.08 0.0 0.16 0.0 0.04 0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 1 6 7 0 1 6 7 0.0 0.0 0.0 -0.04 (a) Personalized Edge PageRank Vector (b) Gradient Component

0.04 0.0 0.0 2 3 4 5 0.0 0.0 0.0 2 3 4 5 0.08 0.0 0.0 0.0 -0.04 -0.04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 1 6 7 0.0 0 1 6 7 0.04 0.0 0.0 (c) Curl Component (d) Harmonic Component

Figure 9: Personalized Edge PageRank Decomposition of (1, 2), a strong tie in the network.

example graph, we compare three edges: (3, 4) in Figure7, which is a global bridge, and hence a weak tie if the network obeys strong triadic closure3 [14]. In contrast, (1, 2) in Figure9 can be viewed as a strong tie, as it has high embeddedness and lies within a clique of four closely connected friends. Finally, edge (5, 6) in Figure8 can be seen as a tie of “medium” strength, because while the community is interconnected, the individuals do not share any higher order interactions. An analysis of the Personalized Edge PageRank vectors on each edge reveals that (3, 4) has the highest 2-norm of the Edge PageRank vector, (5, 6) has the next highest, and (1, 2) has the lowest 2-norm of the personalized Edge PageRank vector of the three edges. Namely, this suggests that Edge PageRank is inversely correlated with tie strength, which is consistent with the observations in Section2 that the effect of additional 2-simplexes results in a negation of flows. By considering the (weighted) Hodge decomposition of each personalized Edge PageRank vector, we can also begin to understand why we observe this relationship. For the weak tie, since we have a global bridge, the curl component is negligible, as is the harmonic since there are no cycles that include the edge (3, 4). However, the gradient component of the global bridge is large, and hence we see that the mass of the flow is concentrated in a single direction. For the strong tie (1, 2), there are two flows which are not negligible, the gradient and curl components– however, the curl component on edges other than (1, 2) are opposed to the flow of the gradient component, resulting in some cancellation. For the medium tie, since there are no 2-simplices connected to (5, 6), the curl flow is negligible, but since there are other cycles, there is both a harmonic and gradient component. These flows are in some sense aligned in orientation, resulting in a lower Edge PageRank score than (3, 4) but a higher score than (1, 2).

D Experimental Results for Bridge Identification

See Tables8 and9.

3We note that for this edge to be considered a weak tie, it should also be the case that at least one of nodes 3 or 4 also has a strong tie. Given the topology of the network, it is reasonable to assume that the tie (1, 3) would be strong due to its high embeddedness and inclusion of higher-order interactions and hence (3, 4) must be a weak tie.

23 Table 6: Accuracy of Predicting Tie Strength with a Single Regressor. Each entry is the test accuracy of linear regression, computed using a 5 fold cross-validation. Bolded entries on the left three columns show the lowest error among the Edge PageRank components, and bolded entries on the right four columns compare the 2-norm of the Personalized Edge vector to three baselines which each measure tie strength with a single measure.

Total unweighted dataset harmonic gradient curl Edge PageRank embeddedness overlap [38] dispersion [5]

24 contact-primary-school 0.61 (±0.002) 0.69 (±0.004) 0.57 (±0.003) 0.58 (±0.003) 0.39 (±0.007) 0.40 (±0.007) 0.15 (±0.008) contact-high-school 0.49 (±0.008) 0.54 (±0.004) 0.48 (±0.008) 0.51 (±0.008) 0.35 (±0.008) 0.35 (±0.008) 0.21 (±0.013) email-Enron 0.39 (±0.014) 0.52 (±0.019) 0.38 (±0.013) 0.45 (±0.012) 0.36 (±0.016) 0.36 (±0.015) 0.23 (±0.022) email-Eu 0.51 (±0.006) 0.59 (±0.003) 0.53 (±0.006) 0.56 (±0.005) 0.54 (±0.005) 0.54 (±0.005) 0.46 (±0.006) india-villages 0.52 (±0.001) -0.07 (±0.001) 0.37 (±0.001) 0.51 (±0.001) 0.37 (±0.001) 0.26 (±0.001) -0.41 (±0.001) sms-a 0.71 (±0.002) 0.71 (±0.001) 0.73 (±0.001) 0.71 (±0.002) 0.72 (±0.001) 0.72 (±0.001) 0.72 (±0.001) sms-c 0.75 (±0.003) 0.75 (±0.003) 0.76 (±0.002) 0.75 (±0.003) 0.75 (±0.003) 0.75 (±0.003) 0.75 (±0.003) college-msg 0.55 (±0.004) 0.55 (±0.003) 0.56 (±0.004) 0.54 (±0.004) 0.55 (±0.003) 0.55 (±0.003) 0.54 (±0.004) Table 7: Comparison of Edge PageRank on the original datasets to Edge PageRank on modified datasets where all triangles in the underlying graph are filled. The same experimental setup as in previous experiments is used, presenting cross validation accuracy of linear regression using different features. The results show that the original data provides much more accurate information about tie strength, suggesting that the presence of triangles carries information about tie strength between individuals within the triangle.

Edge PageRank Edge PageRank Total Edge Total Edge dataset Components (Filled) Components PageRank (Filled) PageRank contact-primary-school 0.49 (±0.007) 0.77 (±0.003) 0.35 (±0.006) 0.58 (±0.003) contact-high-school 0.34 (±0.012) 0.61 (±0.004) 0.25 (±0.013) 0.51 (±0.008) email-Enron 0.31 (±0.018) 0.57 (±0.015) 0.25 (±0.021) 0.45 (±0.012) email-Eu 0.49 (±0.006) 0.66 (±0.003) 0.46 (±0.007) 0.56 (±0.005) india-villages 0.31 (±0.001) 0.52 (±0.001) 0.14 (±0.001) 0.51 (±0.001) sms-a 0.72 (±0.001) 0.73 (±0.001) 0.71 (±0.002) 0.71 (±0.002) sms-c 0.75 (±0.003) 0.76 (±0.002) 0.75 (±0.003) 0.75 (±0.003) college-msg 0.56 (±0.003) 0.57 (±0.004) 0.54 (±0.004) 0.54 (±0.004)

E Relationship of Bridges to the Hodge Decomposition

1 n1 Proposition 1 (Global Bridges and the Gradient Space). Let S be a 1-simplex, and let eS1 ∈ R be the indicator vector which is supported on the index corresponding to S1. Then, the gradient component g 1 eS1 = eS1 if and only if S is a global bridge. Proof. The backward direction follows from the fact that if S1 is a global bridge, then it is not part of any > cycles. Hence, the indicator vector eS1 belongs to im(D2B1 ), and will not change upon projection. 1 g g To prove the forward direction, assume S is not a global bridge but eS1 = eS1 . Since eS1 belongs to > g 1 im(D2B1 ), the (weighted) sum of eS1 is 0 around any cycles. However, since S is not a global bridge, it g must be a part of a cycle C, and eS1 = eS1 can not have a weighted sum of 0 around C since it will sum to 1 along the cycle. The above proof generalizes easily to the case where the normalized Hodge Decomposition is taken with respect to the symmetrized Hodge Laplacian (7). We also note that local bridges are connected to the harmonic vector space.

1 n1 Proposition 2 (Local Bridges and the Harmonic Space). Let S be a local bridge, and let eS1 ∈ R be the 1 h indicator vector which is supported on the index corresponding to S . Then, keS1 k2 > 0. h h g c Proof. Suppose towards contradiction that keS1 k2 = 0, so eS1 = 0. Then, eS1 = eS1 + eS1 , so the indicator on the local bridge must be the sum of a gradient flow which has a weighted sum of 0 around cycles, and a 1 c curl flow which is supported on edges connected to 2-simplices. Since S is a local bridge, eS1 must be 0 on 1 c g 1 g S itself. Hence, since eS1 is an indicator vector, eS1 = −eS1 on all entries other than S itself, and eS1 = 1 on S1. c g There are then two cases. If there is a triangle T on which eS1 has a non-zero entry, then −eS1 will not g c sum to 0 around T , so eS1 will not be a valid gradient flow, a contradiction. Otherwise, if eS1 = 0, then g eS1 = eS1 is not a gradient flow, again resulting in a contradiction. Hence, we see that the concepts of local and global bridges are closely related to the Hodge Decomposition of their respective indicator vectors.

F Comparison of Edge PageRank to Hodge Decomposition of the Seed Vector

See Figure 10.

25 Table 8: Using Hodge Decomposition of Edge PageRank (β = 2.5) to identify local and global bridges, with comparison to Betweenness, denoted b(e). Entries correspond to 5-fold cross validation test accuracy of a multiclass logistic regression with given regressors, with standard deviations in parentheses. Because bridges are uncommon in these networks, we down-sample prevalent classes to ensure non-bridges, local bridges, and global bridges have the same number of data points in train and test sets. Bolded entries indicate the highest performance with the given number of regressors. In most cases, any combination of two components of Edge PageRank results in over 95% accuracy in classification.

coauth- coauth- ai- dbpedia- dbpedia- dbpedia- email- genius- h g c b(e) kπe k kπek kπek MAG-10 systems producer starring writer Eu rap X 0.58 (± 0.058) 0.51 (± 0.097) 0.45 (± 0.081) 0.58 (± 0.018) 0.58 (± 0.014) 0.78 (± 0.07) 0.55 (± 0.076) X 0.77 (± 0.007) 0.65 (± 0.04) 0.66 (± 0.008) 0.79 (± 0.008) 0.7 (± 0.021) 0.92 (± 0.025) 0.56 (± 0.048) X 0.69 (± 0.016) 0.68 (± 0.04) 0.64 (± 0.011) 0.66 (± 0.016) 0.63 (± 0.017) 0.71 (± 0.023) 0.7 (± 0.01)

26 X 0.78 (± 0.016) 0.75 (± 0.015) 0.78 (± 0.004) 0.72 (± 0.006) 0.71 (± 0.031) 0.7 (± 0.063) 0.72 (± 0.007) XX 0.77 (± 0.006) 0.76 (± 0.109) 0.66 (± 0.008) 0.87 (± 0.008) 0.7 (± 0.023) 0.92 (± 0.016) 0.56 (± 0.022) XX 0.82 (± 0.022) 0.82 (± 0.018) 0.68 (± 0.016) 0.83 (± 0.011) 0.85 (± 0.018) 0.79 (± 0.049) 0.73 (± 0.011) XX 0.8 (± 0.012) 0.82 (± 0.02) 0.79 (± 0.014) 0.79 (± 0.019) 0.85 (± 0.026) 0.92 (± 0.025) 0.73 (± 0.005) XX 0.98 (± 0.003) 0.99 (± 0.005) 0.92 (± 0.008) 0.98 (± 0.001) 0.98 (± 0.004) 0.93 (± 0.036) 0.93 (± 0.002) XX 0.99 (± 0.002) 0.99 (± 0.003) 0.96 (± 0.004) 0.98 (± 0.003) 0.99 (± 0.005) 0.94 (± 0.023) 0.94 (± 0.004) XX 0.99 (± 0.004) 0.99 (± 0.003) 0.9 (± 0.009) 0.98 (± 0.002) 0.98 (± 0.007) 0.88 (± 0.037) 0.94 (± 0.003) XXX 0.98 (± 0.002) 0.99 (± 0.003) 0.92 (± 0.008) 0.98 (± 0.002) 0.98 (± 0.004) 0.94 (± 0.029) 0.93 (± 0.002) XXX 0.99 (± 0.002) 0.99 (± 0.003) 0.96 (± 0.004) 0.99 (± 0.003) 0.99 (± 0.006) 0.93 (± 0.024) 0.94 (± 0.004) XXX 0.99 (± 0.002) 0.99 (± 0.002) 0.9 (± 0.009) 0.99 (± 0.002) 0.98 (± 0.007) 0.93 (± 0.024) 0.94 (± 0.003) XXX 0.99 (± 0.003) 0.99 (± 0.003) 0.96 (± 0.004) 0.98 (± 0.001) 0.99 (± 0.005) 0.95 (± 0.025) 0.94 (± 0.003) XXXX 0.99 (± 0.002) 1.0 (± 0.002) 0.96 (± 0.004) 0.99 (± 0.003) 0.99 (± 0.005) 0.94 (± 0.02) 0.94 (± 0.004) Table 9: Using a Single Regressor to determine Local and Global Bridges. Regressors correspond to betweenness, 2-norm of Hodge Decomposition components of either the indicator vector of an edge or personalized Edge PageRank (β = 2.5) of the edge, or the 2-norm of the total Edge PageRank. Entries correspond to test accuracy of a 5-fold cross validation using the given regressor. Bolded entries indicate highest performance on a dataset.

Curl Curl of Gradient Gradient of Harmonic Harmonic of Personalized Dataset Name Edge PageRank Indicator Edge PageRank Indicator Edge PageRank Indicator Edge PageRank

27 cat-edge-MAG-10 0.78 (±0.016) 0.66 (±0.009) 0.69 (±0.016) 0.68 (±0.016) 0.77 (±0.007) 0.77 (±0.007) 0.83 (±0.007) coauth-ai-systems 0.75 (±0.015) 0.65 (±0.012) 0.68 (±0.040) 0.69 (±0.036) 0.65 (±0.040) 0.65 (±0.094) 0.77 (±0.026) dbpedia-producer 0.78 (±0.004) 0.65 (±0.008) 0.64 (±0.011) 0.64 (±0.010) 0.66 (±0.008) 0.62 (±0.007) 0.83 (±0.006) dbpedia-starring 0.72 (±0.006) 0.66 (±0.002) 0.66 (±0.016) 0.65 (±0.017) 0.79 (±0.008) 0.81 (±0.009) 0.94 (±0.003) dbpedia-writer 0.71 (±0.031) 0.65 (±0.014) 0.63 (±0.017) 0.63 (±0.017) 0.70 (±0.021) 0.70 (±0.021) 0.93 (±0.012) email-Eu 0.70 (±0.063) 0.57 (±0.024) 0.71 (±0.023) 0.71 (±0.029) 0.92 (±0.025) 0.87 (±0.061) 0.92 (±0.015) genius-rap 0.72 (±0.007) 0.63 (±0.003) 0.70 (±0.010) 0.69 (±0.009) 0.56 (±0.048) 0.67 (±0.026) 0.79 (±0.006) Figure 10: Comparison of 2-norm of Personalized Edge PageRank components to the 2-norm of the Hodge Decomposition of the indicator on each edge, for all three components and different values of the β parameter. Unless β is taken to be small, the components tend to coincide with one another.

28