A New Way to Measure Homophily in Directed Graphs
Total Page:16
File Type:pdf, Size:1020Kb
The Popularity-Homophily Index: A new way to measure Homophily in Directed Graphs Shiva Oswal UC Berkeley Extension, Berkeley, CA, USA Abstract In networks, the well-documented tendency for people with similar characteris- tics to form connections is known as the principle of homophily. Being able to quantify homophily into a number has a signicant real-world impact, ranging from government fund-allocation to netuning the parameters in a sociological model. This paper introduces the “Popularity-Homophily Index” (PH Index) as a new metric to measure homophily in directed graphs. The PH Index takes into account the “popularity” of each actor in the interaction network, with popularity precalculated with an importance algorithm like Google PageRank. The PH Index improves upon other existing measures by weighting a homophilic edge leading to an important actor over a homophilic edge leading to a less important actor. The PH Index can be calculated not only for a single node in a directed graph but also for the entire graph. This paper calculates the PH Index of two sample graphs, and concludes with an overview of the strengths and weaknesses of the PH Index, and its potential applications in the real world. Keywords: measuring homophily; popularity-homophily index; social interaction networks; directed graphs 1 Introduction Homophily, meaning “love of the same,” captures the tendency of agents, most com- monly people, to connect with other agents that share sociodemographic, behavioral, or intrapersonal characteristics (McPherson et al, 2001). In the language of social interaction networks, homophily is the “tendency for friendships and many other arXiv:2109.00348v1 [cs.SI] 29 Aug 2021 interpersonal relationships to occur between similar people” (Thelwall, 2009). The social, economic, and political outcomes of homophily are signicant. In urban set- tings for instance, homophilic tendencies in people tend to create exclaves of a highly concentrated race or ethnicity (Xu et al, 2021). Chinatown in San Francisco and Little Italy in New York, are prime examples of homophily in action. Race, socioeconomic factors, and religion all play a prime role in determining the likelihood of two people communicating with one another, or living nearby each other (Xu et al, 2021). Identifying and measuring homophily can have signicant real-world consequences. Identifying homophily in housing can help authorities deal with dogged pockets of segregation, while identifying homophily in social media can reveal the importance of 1 certain attributes and users, aiding social media companies, advertisers, and others (Weng et al, 2010). Measuring homophily provides numeric detail to the above situ- ations, which in turn can inuence the amount of funding allocated, and resources spent in response. For these reasons, an eective measure of homophily is needed. The most nuanced models need to take the structure of the graph into consideration. This paper presents a new way of measuring the homophily of an attribute in a directed graph — the “Popularity-Homophily Index” or PH Index for short. The paper then looks at two example datasets, calculates the PH Index in both cases, and then compares their results. The rst dataset is a social network of Github developers. The second dataset is an internet network of links between political blogs in the lead-up to the 2004 US presidential election. These two graphs are radically dierent in their contents, but this paper nds that the PH Index has relevance in both. For suciently complex directed graphs, the PH Index is signicantly more powerful than other related measures of homophily in common use, such as the classic External-Internal Index (EI Index). 2 Related Work The idea and motivation to measure homophily in directed graphs is not new. For example, Twitter’s social network can be modelled as a directed graph, with users as nodes, and follower relations as directed edges. Measuring homophily was an important part of the development of TwitterRank, which measured the inuence of prominent Twitterers in a sample dataset (Weng et al, 2010). Other papers have sought to determine the extent of the political “echo chamber” on social media such as Twitter. Aer classifying users as either Democratic or Republican based on the content of their tweets, one paper then sought to measure the political homophily present in the resulting graph (Colleoni et al, 2014). This measurement allowed them to quantify the scale and impact of the political echo chamber on social media. Additionally, “Degree Weighted Homophily” (DWH) takes the structure of the graph into account, and provably gives a lower-bound for the convergence time of certain simulations, where nodes represent “agents” that are either attracted to or repelled by their neighbors (Golub and Jackson, 2012). These simulations are oen excellent models for the behavior of real-world people, such as the tendency of people of the same race to cluster in the same neighborhoods. However, DWH is nearly impossible to compute eciently for real-world graphs, since its basic formulation takes exponential time to compute. A recently developed and widely used measure of homophily is called the Assor- tr(E)−sum(E2) tativity Coecient r, which satises the formula r = 1−sum(E2) (Newman, 2003). Easy and ecient to calculate, the assortativity coecient has been widely used in research into homophily (Chang et al, 2007). In the case of a binary classication, E is a 2x2 matrix containing the fraction of each possible type of edge in the graph. For example, if every node has a color attribute that is either “black” or “white”, the fraction of edges that start at a “black” node and end at “white” node is one of the 4 possible kinds of edges. Additionally, tr is the trace of a matrix (the sum of its main diagonal), and sum is the sum of the elements of the entire matrix. 2 A signicant issue with the assortativity coecient is that it does not take the overall structure of the graph into account since it reduces the entire graph into a matrix, losing much of the intricate complexity of the graph. All of these measures of homophily have their advantages and drawbacks. The PH Index is inuenced by many of these measures, and is designed to be versatile, easy, and ecient to calculate, while nuanced enough to take the entire structure of the graph into account. 3 Background Dene an edge to be a homophilic edge if it connects two nodes that share some attribute, and dene it be a non-homophilic edge if otherwise. For the purposes of this paper, assume that every node attribute is binary, and can be represented as either 0 or 1. Many non-binary attributes can be converted into binary attributes with a bit of extra work. For example, if every node is a person with their own “height” attribute, we can create a binary attribute of ‘tallness’. Arbitrarily dene a node to be ‘tall’ if its height attribute is more than 6 feet, and ‘not tall’ if otherwise. Dene a pair of similar nodes to be two nodes (not necessarily connected by an edge) that have the same value for a given attribute, either both 0 or both 1. For a given node n, one of the oldest, simplest, and most powerful measures of homophily is called the EI Index or alternatively the Coleman Homophily Index since it was introduced in his landmark book Introduction to Mathematical Sociology (Coleman, 1964). In the set of all node n’s out edges (in the case of a directed graph), let en represent the number of “external” non-homophilic edges, and in represent the number of “internal” homophilic edges. Then the EI Index is dened by Equation 1. e − i EI = n n (1) en + in Note that the EI Index is -1 for a completely homophilic node (all of its edges are homophilic edges), and 1 for a completely heterophilic node (Coleman, 1964). The EI Index for the entire graph is simply the average of the EI Index for each individual node. Most random graphs that do not demonstrate homophily tend to have a value of the EI Index of approximately 0. For the purposes of the PH Index, using the Weighted EI Index (WEI Index) is more appropriate. The necessity of the WEI Index stems from the fact that in a binary classication, the total number of nodes with one value of an attribute may dier signicantly from the total number of nodes with the other value of that same attribute. For example, consider a social network of 200 individuals, with 180 men and 20 women. If Bob (a male) and Jennifer (a female) both are connected to a single man and a single woman, then their EI indices will both equal 0:0. However, common sense suggests that Jennifer exhibits a greater amount of homophily, since there are very few other women she could be connected to. Inuenced by the above observation, dene the weight w of node n to be the ratio of the number of non-similar nodes in the entire graph, divided by the number of similar nodes. Therefore, Bob’s weight would equal 0.11, and Jennifer’s weight would 3 equal 9.0. For node n¸consider the same variables as in the denition of the EI Index. The equation for the WEI Index is stated in Equation 2. en − i WEI = w n en (2) w + in The Weighted EI Index essentially normalizes the EI Index, calculating it as if there were an equal number of nodes with each possible value of the attribute. In our example above, Bob’s WEI is 0:8 and Jennifer’s WEI is −0:8. As predicted, aer weighting, Jennifer exhibits strong homophily.