A New Way to Measure Homophily in Directed Graphs

Total Page:16

File Type:pdf, Size:1020Kb

A New Way to Measure Homophily in Directed Graphs The Popularity-Homophily Index: A new way to measure Homophily in Directed Graphs Shiva Oswal UC Berkeley Extension, Berkeley, CA, USA Abstract In networks, the well-documented tendency for people with similar characteris- tics to form connections is known as the principle of homophily. Being able to quantify homophily into a number has a signicant real-world impact, ranging from government fund-allocation to netuning the parameters in a sociological model. This paper introduces the “Popularity-Homophily Index” (PH Index) as a new metric to measure homophily in directed graphs. The PH Index takes into account the “popularity” of each actor in the interaction network, with popularity precalculated with an importance algorithm like Google PageRank. The PH Index improves upon other existing measures by weighting a homophilic edge leading to an important actor over a homophilic edge leading to a less important actor. The PH Index can be calculated not only for a single node in a directed graph but also for the entire graph. This paper calculates the PH Index of two sample graphs, and concludes with an overview of the strengths and weaknesses of the PH Index, and its potential applications in the real world. Keywords: measuring homophily; popularity-homophily index; social interaction networks; directed graphs 1 Introduction Homophily, meaning “love of the same,” captures the tendency of agents, most com- monly people, to connect with other agents that share sociodemographic, behavioral, or intrapersonal characteristics (McPherson et al, 2001). In the language of social interaction networks, homophily is the “tendency for friendships and many other arXiv:2109.00348v1 [cs.SI] 29 Aug 2021 interpersonal relationships to occur between similar people” (Thelwall, 2009). The social, economic, and political outcomes of homophily are signicant. In urban set- tings for instance, homophilic tendencies in people tend to create exclaves of a highly concentrated race or ethnicity (Xu et al, 2021). Chinatown in San Francisco and Little Italy in New York, are prime examples of homophily in action. Race, socioeconomic factors, and religion all play a prime role in determining the likelihood of two people communicating with one another, or living nearby each other (Xu et al, 2021). Identifying and measuring homophily can have signicant real-world consequences. Identifying homophily in housing can help authorities deal with dogged pockets of segregation, while identifying homophily in social media can reveal the importance of 1 certain attributes and users, aiding social media companies, advertisers, and others (Weng et al, 2010). Measuring homophily provides numeric detail to the above situ- ations, which in turn can inuence the amount of funding allocated, and resources spent in response. For these reasons, an eective measure of homophily is needed. The most nuanced models need to take the structure of the graph into consideration. This paper presents a new way of measuring the homophily of an attribute in a directed graph — the “Popularity-Homophily Index” or PH Index for short. The paper then looks at two example datasets, calculates the PH Index in both cases, and then compares their results. The rst dataset is a social network of Github developers. The second dataset is an internet network of links between political blogs in the lead-up to the 2004 US presidential election. These two graphs are radically dierent in their contents, but this paper nds that the PH Index has relevance in both. For suciently complex directed graphs, the PH Index is signicantly more powerful than other related measures of homophily in common use, such as the classic External-Internal Index (EI Index). 2 Related Work The idea and motivation to measure homophily in directed graphs is not new. For example, Twitter’s social network can be modelled as a directed graph, with users as nodes, and follower relations as directed edges. Measuring homophily was an important part of the development of TwitterRank, which measured the inuence of prominent Twitterers in a sample dataset (Weng et al, 2010). Other papers have sought to determine the extent of the political “echo chamber” on social media such as Twitter. Aer classifying users as either Democratic or Republican based on the content of their tweets, one paper then sought to measure the political homophily present in the resulting graph (Colleoni et al, 2014). This measurement allowed them to quantify the scale and impact of the political echo chamber on social media. Additionally, “Degree Weighted Homophily” (DWH) takes the structure of the graph into account, and provably gives a lower-bound for the convergence time of certain simulations, where nodes represent “agents” that are either attracted to or repelled by their neighbors (Golub and Jackson, 2012). These simulations are oen excellent models for the behavior of real-world people, such as the tendency of people of the same race to cluster in the same neighborhoods. However, DWH is nearly impossible to compute eciently for real-world graphs, since its basic formulation takes exponential time to compute. A recently developed and widely used measure of homophily is called the Assor- tr(E)−sum(E2) tativity Coecient r, which satises the formula r = 1−sum(E2) (Newman, 2003). Easy and ecient to calculate, the assortativity coecient has been widely used in research into homophily (Chang et al, 2007). In the case of a binary classication, E is a 2x2 matrix containing the fraction of each possible type of edge in the graph. For example, if every node has a color attribute that is either “black” or “white”, the fraction of edges that start at a “black” node and end at “white” node is one of the 4 possible kinds of edges. Additionally, tr is the trace of a matrix (the sum of its main diagonal), and sum is the sum of the elements of the entire matrix. 2 A signicant issue with the assortativity coecient is that it does not take the overall structure of the graph into account since it reduces the entire graph into a matrix, losing much of the intricate complexity of the graph. All of these measures of homophily have their advantages and drawbacks. The PH Index is inuenced by many of these measures, and is designed to be versatile, easy, and ecient to calculate, while nuanced enough to take the entire structure of the graph into account. 3 Background Dene an edge to be a homophilic edge if it connects two nodes that share some attribute, and dene it be a non-homophilic edge if otherwise. For the purposes of this paper, assume that every node attribute is binary, and can be represented as either 0 or 1. Many non-binary attributes can be converted into binary attributes with a bit of extra work. For example, if every node is a person with their own “height” attribute, we can create a binary attribute of ‘tallness’. Arbitrarily dene a node to be ‘tall’ if its height attribute is more than 6 feet, and ‘not tall’ if otherwise. Dene a pair of similar nodes to be two nodes (not necessarily connected by an edge) that have the same value for a given attribute, either both 0 or both 1. For a given node n, one of the oldest, simplest, and most powerful measures of homophily is called the EI Index or alternatively the Coleman Homophily Index since it was introduced in his landmark book Introduction to Mathematical Sociology (Coleman, 1964). In the set of all node n’s out edges (in the case of a directed graph), let en represent the number of “external” non-homophilic edges, and in represent the number of “internal” homophilic edges. Then the EI Index is dened by Equation 1. e − i EI = n n (1) en + in Note that the EI Index is -1 for a completely homophilic node (all of its edges are homophilic edges), and 1 for a completely heterophilic node (Coleman, 1964). The EI Index for the entire graph is simply the average of the EI Index for each individual node. Most random graphs that do not demonstrate homophily tend to have a value of the EI Index of approximately 0. For the purposes of the PH Index, using the Weighted EI Index (WEI Index) is more appropriate. The necessity of the WEI Index stems from the fact that in a binary classication, the total number of nodes with one value of an attribute may dier signicantly from the total number of nodes with the other value of that same attribute. For example, consider a social network of 200 individuals, with 180 men and 20 women. If Bob (a male) and Jennifer (a female) both are connected to a single man and a single woman, then their EI indices will both equal 0:0. However, common sense suggests that Jennifer exhibits a greater amount of homophily, since there are very few other women she could be connected to. Inuenced by the above observation, dene the weight w of node n to be the ratio of the number of non-similar nodes in the entire graph, divided by the number of similar nodes. Therefore, Bob’s weight would equal 0.11, and Jennifer’s weight would 3 equal 9.0. For node n¸consider the same variables as in the denition of the EI Index. The equation for the WEI Index is stated in Equation 2. en − i WEI = w n en (2) w + in The Weighted EI Index essentially normalizes the EI Index, calculating it as if there were an equal number of nodes with each possible value of the attribute. In our example above, Bob’s WEI is 0:8 and Jennifer’s WEI is −0:8. As predicted, aer weighting, Jennifer exhibits strong homophily.
Recommended publications
  • Sagemath and Sagemathcloud
    Viviane Pons Ma^ıtrede conf´erence,Universit´eParis-Sud Orsay [email protected] { @PyViv SageMath and SageMathCloud Introduction SageMath SageMath is a free open source mathematics software I Created in 2005 by William Stein. I http://www.sagemath.org/ I Mission: Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 2 / 7 SageMath Source and language I the main language of Sage is python (but there are many other source languages: cython, C, C++, fortran) I the source is distributed under the GPL licence. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 3 / 7 SageMath Sage and libraries One of the original purpose of Sage was to put together the many existent open source mathematics software programs: Atlas, GAP, GMP, Linbox, Maxima, MPFR, PARI/GP, NetworkX, NTL, Numpy/Scipy, Singular, Symmetrica,... Sage is all-inclusive: it installs all those libraries and gives you a common python-based interface to work on them. On top of it is the python / cython Sage library it-self. Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 4 / 7 SageMath Sage and libraries I You can use a library explicitly: sage: n = gap(20062006) sage: type(n) <c l a s s 'sage. interfaces .gap.GapElement'> sage: n.Factors() [ 2, 17, 59, 73, 137 ] I But also, many of Sage computation are done through those libraries without necessarily telling you: sage: G = PermutationGroup([[(1,2,3),(4,5)],[(3,4)]]) sage : G . g a p () Group( [ (3,4), (1,2,3)(4,5) ] ) Viviane Pons (U-PSud) SageMath and SageMathCloud October 19, 2016 5 / 7 SageMath Development model Development model I Sage is developed by researchers for researchers: the original philosophy is to develop what you need for your research and share it with the community.
    [Show full text]
  • Networkx Tutorial
    5.03.2020 tutorial NetworkX tutorial Source: https://github.com/networkx/notebooks (https://github.com/networkx/notebooks) Minor corrections: JS, 27.02.2019 Creating a graph Create an empty graph with no nodes and no edges. In [1]: import networkx as nx In [2]: G = nx.Graph() By definition, a Graph is a collection of nodes (vertices) along with identified pairs of nodes (called edges, links, etc). In NetworkX, nodes can be any hashable object e.g. a text string, an image, an XML object, another Graph, a customized node object, etc. (Note: Python's None object should not be used as a node as it determines whether optional function arguments have been assigned in many functions.) Nodes The graph G can be grown in several ways. NetworkX includes many graph generator functions and facilities to read and write graphs in many formats. To get started though we'll look at simple manipulations. You can add one node at a time, In [3]: G.add_node(1) add a list of nodes, In [4]: G.add_nodes_from([2, 3]) or add any nbunch of nodes. An nbunch is any iterable container of nodes that is not itself a node in the graph. (e.g. a list, set, graph, file, etc..) In [5]: H = nx.path_graph(10) file:///home/szwabin/Dropbox/Praca/Zajecia/Diffusion/Lectures/1_intro/networkx_tutorial/tutorial.html 1/18 5.03.2020 tutorial In [6]: G.add_nodes_from(H) Note that G now contains the nodes of H as nodes of G. In contrast, you could use the graph H as a node in G.
    [Show full text]
  • Networkx: Network Analysis with Python
    NetworkX: Network Analysis with Python Salvatore Scellato Full tutorial presented at the XXX SunBelt Conference “NetworkX introduction: Hacking social networks using the Python programming language” by Aric Hagberg & Drew Conway Outline 1. Introduction to NetworkX 2. Getting started with Python and NetworkX 3. Basic network analysis 4. Writing your own code 5. You are ready for your project! 1. Introduction to NetworkX. Introduction to NetworkX - network analysis Vast amounts of network data are being generated and collected • Sociology: web pages, mobile phones, social networks • Technology: Internet routers, vehicular flows, power grids How can we analyze this networks? Introduction to NetworkX - Python awesomeness Introduction to NetworkX “Python package for the creation, manipulation and study of the structure, dynamics and functions of complex networks.” • Data structures for representing many types of networks, or graphs • Nodes can be any (hashable) Python object, edges can contain arbitrary data • Flexibility ideal for representing networks found in many different fields • Easy to install on multiple platforms • Online up-to-date documentation • First public release in April 2005 Introduction to NetworkX - design requirements • Tool to study the structure and dynamics of social, biological, and infrastructure networks • Ease-of-use and rapid development in a collaborative, multidisciplinary environment • Easy to learn, easy to teach • Open-source tool base that can easily grow in a multidisciplinary environment with non-expert users
    [Show full text]
  • Graph Database Fundamental Services
    Bachelor Project Czech Technical University in Prague Faculty of Electrical Engineering F3 Department of Cybernetics Graph Database Fundamental Services Tomáš Roun Supervisor: RNDr. Marko Genyk-Berezovskyj Field of study: Open Informatics Subfield: Computer and Informatic Science May 2018 ii Acknowledgements Declaration I would like to thank my advisor RNDr. I declare that the presented work was de- Marko Genyk-Berezovskyj for his guid- veloped independently and that I have ance and advice. I would also like to thank listed all sources of information used Sergej Kurbanov and Herbert Ullrich for within it in accordance with the methodi- their help and contributions to the project. cal instructions for observing the ethical Special thanks go to my family for their principles in the preparation of university never-ending support. theses. Prague, date ............................ ........................................... signature iii Abstract Abstrakt The goal of this thesis is to provide an Cílem této práce je vyvinout webovou easy-to-use web service offering a database službu nabízející databázi neorientova- of undirected graphs that can be searched ných grafů, kterou bude možno efektivně based on the graph properties. In addi- prohledávat na základě vlastností grafů. tion, it should also allow to compute prop- Tato služba zároveň umožní vypočítávat erties of user-supplied graphs with the grafové vlastnosti pro grafy zadané uži- help graph libraries and generate graph vatelem s pomocí grafových knihoven a images. Last but not least, we implement zobrazovat obrázky grafů. V neposlední a system that allows bulk adding of new řadě je také cílem navrhnout systém na graphs to the database and computing hromadné přidávání grafů do databáze a their properties.
    [Show full text]
  • Networkx Reference Release 1.9.1
    NetworkX Reference Release 1.9.1 Aric Hagberg, Dan Schult, Pieter Swart September 20, 2014 CONTENTS 1 Overview 1 1.1 Who uses NetworkX?..........................................1 1.2 Goals...................................................1 1.3 The Python programming language...................................1 1.4 Free software...............................................2 1.5 History..................................................2 2 Introduction 3 2.1 NetworkX Basics.............................................3 2.2 Nodes and Edges.............................................4 3 Graph types 9 3.1 Which graph class should I use?.....................................9 3.2 Basic graph types.............................................9 4 Algorithms 127 4.1 Approximation.............................................. 127 4.2 Assortativity............................................... 132 4.3 Bipartite................................................. 141 4.4 Blockmodeling.............................................. 161 4.5 Boundary................................................. 162 4.6 Centrality................................................. 163 4.7 Chordal.................................................. 184 4.8 Clique.................................................. 187 4.9 Clustering................................................ 190 4.10 Communities............................................... 193 4.11 Components............................................... 194 4.12 Connectivity..............................................
    [Show full text]
  • Exploring Network Structure, Dynamics, and Function Using Networkx
    Proceedings of the 7th Python in Science Conference (SciPy 2008) Exploring Network Structure, Dynamics, and Function using NetworkX Aric A. Hagberg ([email protected])– Los Alamos National Laboratory, Los Alamos, New Mexico USA Daniel A. Schult ([email protected])– Colgate University, Hamilton, NY USA Pieter J. Swart ([email protected])– Los Alamos National Laboratory, Los Alamos, New Mexico USA NetworkX is a Python language package for explo- and algorithms, to rapidly test new hypotheses and ration and analysis of networks and network algo- models, and to teach the theory of networks. rithms. The core package provides data structures The structure of a network, or graph, is encoded in the for representing many types of networks, or graphs, edges (connections, links, ties, arcs, bonds) between including simple graphs, directed graphs, and graphs nodes (vertices, sites, actors). NetworkX provides ba- with parallel edges and self-loops. The nodes in Net- sic network data structures for the representation of workX graphs can be any (hashable) Python object simple graphs, directed graphs, and graphs with self- and edges can contain arbitrary data; this flexibil- loops and parallel edges. It allows (almost) arbitrary ity makes NetworkX ideal for representing networks objects as nodes and can associate arbitrary objects to found in many different scientific fields. edges. This is a powerful advantage; the network struc- In addition to the basic data structures many graph ture can be integrated with custom objects and data algorithms are implemented for calculating network structures, complementing any pre-existing code and properties and structure measures: shortest paths, allowing network analysis in any application setting betweenness centrality, clustering, and degree dis- without significant software development.
    [Show full text]
  • Graph and Network Analysis
    Graph and Network Analysis Dr. Derek Greene Clique Research Cluster, University College Dublin Web Science Doctoral Summer School 2011 Tutorial Overview • Practical Network Analysis • Basic concepts • Network types and structural properties • Identifying central nodes in a network • Communities in Networks • Clustering and graph partitioning • Finding communities in static networks • Finding communities in dynamic networks • Applications of Network Analysis Web Science Summer School 2011 2 Tutorial Resources • NetworkX: Python software for network analysis (v1.5) http://networkx.lanl.gov • Python 2.6.x / 2.7.x http://www.python.org • Gephi: Java interactive visualisation platform and toolkit. http://gephi.org • Slides, full resource list, sample networks, sample code snippets online here: http://mlg.ucd.ie/summer Web Science Summer School 2011 3 Introduction • Social network analysis - an old field, rediscovered... [Moreno,1934] Web Science Summer School 2011 4 Introduction • We now have the computational resources to perform network analysis on large-scale data... http://www.facebook.com/note.php?note_id=469716398919 Web Science Summer School 2011 5 Basic Concepts • Graph: a way of representing the relationships among a collection of objects. • Consists of a set of objects, called nodes, with certain pairs of these objects connected by links called edges. A B A B C D C D Undirected Graph Directed Graph • Two nodes are neighbours if they are connected by an edge. • Degree of a node is the number of edges ending at that node. • For a directed graph, the in-degree and out-degree of a node refer to numbers of edges incoming to or outgoing from the node.
    [Show full text]
  • Networkx: Network Analysis with Python
    NetworkX: Network Analysis with Python Dionysis Manousakas (special thanks to Petko Georgiev, Tassos Noulas and Salvatore Scellato) Social and Technological Network Data Analytics Computer Laboratory, University of Cambridge February 2017 Outline 1. Introduction to NetworkX 2. Getting started with Python and NetworkX 3. Basic network analysis 4. Writing your own code 5. Ready for your own analysis! 2 1. Introduction to NetworkX 3 Introduction: networks are everywhere… Social Mobile phone networks Vehicular flows networks Web pages/citations Internet routing How can we analyse these networks? Python + NetworkX 4 Introduction: why Python? Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasises code readability + - Clear syntax, indentation Can be slow Multiple programming paradigms Beware when you are Dynamic typing, late binding analysing very large Multiple data types and structures networks Strong on-line community Rich documentation Numerous libraries Expressive features Fast prototyping 5 Introduction: Python’s Holy Trinity Python’s primary NumPy is an extension to Matplotlib is the library for include primary plotting mathematical and multidimensional arrays library in Python. statistical computing. and matrices. Contains toolboxes Supports 2-D and 3-D for: Both SciPy and NumPy plotting. All plots are • Numeric rely on the C library highly customisable optimization LAPACK for very fast and ready for implementation. professional • Signal processing publication. • Statistics, and more…
    [Show full text]
  • PDF Reference
    NetworkX Reference Release 1.0 Aric Hagberg, Dan Schult, Pieter Swart January 08, 2010 CONTENTS 1 Introduction 1 1.1 Who uses NetworkX?..........................................1 1.2 The Python programming language...................................1 1.3 Free software...............................................1 1.4 Goals...................................................1 1.5 History..................................................2 2 Overview 3 2.1 NetworkX Basics.............................................3 2.2 Nodes and Edges.............................................4 3 Graph types 9 3.1 Which graph class should I use?.....................................9 3.2 Basic graph types.............................................9 4 Operators 129 4.1 Graph Manipulation........................................... 129 4.2 Node Relabeling............................................. 134 4.3 Freezing................................................. 135 5 Algorithms 137 5.1 Boundary................................................. 137 5.2 Centrality................................................. 138 5.3 Clique.................................................. 142 5.4 Clustering................................................ 145 5.5 Cores................................................... 147 5.6 Matching................................................. 148 5.7 Isomorphism............................................... 148 5.8 PageRank................................................. 161 5.9 HITS..................................................
    [Show full text]
  • Network Science and Python
    Network Analysis using Python Muhammad Qasim Pasta Assistant Professor Usman Institute of Technology About Me • Assistant Professor @ UIT • Research Interest: Network Science / Social Network Analysis • Consultant / Mentor • Python Evangelist • Github: https://github.com/mqpasta • https://github.com/mqpasta/pyconpk2018 • Website: http://qasimpasta.info Today’s Talk What is a Network? • An abstract representation to model pairwise relationship between objects. • Network = graph: mathematical representation nodes/vertices/actors edges/arcs/lines Birth of Graph Theory • Königsberg bridge problem: the seven bridges of the city of Königsberg over the river Preger • traversed in a single trip without doubling back? • trip ends in the same place it began Solution as graph! • Swiss mathematician - Leonhard Euler • considering each piece of island as dot • each bridge connecting two island as a line between two dots • a graph of dots (vertices or nodes) and lines (edges or links). Background | Literature Review | Questions | Contributions | Conclusion | Q/A Network Science • ―Network science is an academic field which studies complex networks such as telecommunication networks, computer networks …‖ [Wikipedia] • ―In the context of network theory, a complex network is a graph (network) with non- trivial topological features—features that do not occur in simple networks …‖ [Wikipedia] Study of Network Science Social Graph structure theory Inferential Statistical modeling mechanics Information Data Mining Visualization Background | Literature Review
    [Show full text]
  • Synthetic Networks Or Generative Models)
    Models of networks (synthetic networks or generative models) Prof. Ralucca Gera, Applied Mathematics Dept. Naval Postgraduate School Monterey, California [email protected] Excellence Through Knowledge Learning Outcomes • Identify network models and explain their structures; • Contrast networks and synthetic models; • Understand how to design new network models (based on the existing ones and on the collected data) • Distinguish methodologies used in analyzing networks. The three papers for each of the models Synthetic models are used as reference/null models to compare against and build new complex networks •“On Random Graphs I” by Paul Erdős and Alfed Renyi in Publicationes Mathematicae (1958) Times cited: 3, 517 (as of January 1, 2015) •“Collective dynamics of ‘small-world’ networks” by Duncan Watts and Steve Strogatz in Nature, (1998) Times cited: 24, 535 (as of January 1, 2015) •“Emergence of scaling in random networks” by László Barabási and Réka Albert in Science, (1999) Times cited: 21, 418 (as of January 1, 2015) 3 Why care? • Epidemiology: – A virus propagates much faster in scale-free networks. – Vaccination of random nodes in scale free does not work, but targeted vaccination is very effective • Create synthetic networks to be used as null models: – What effect does the degree distribution alone have on the behavior of the system? (answered by comparing to the configuration model) • Create networks of different sizes – Networks of particular sizes and structures can be quickly and cheaply generated, instead of collecting and cleaning
    [Show full text]
  • Networkx: Network Analysis with Python
    NetworkX: Network Analysis with Python Salvatore Scellato From a tutorial presented at the 30th SunBelt Conference “NetworkX introduction: Hacking social networks using the Python programming language” by Aric Hagberg & Drew Conway 1 Thursday, 1 March 2012 Outline 1. Introduction to NetworkX 2. Getting started with Python and NetworkX 3. Basic network analysis 4. Writing your own code 5. You are ready for your own analysis! 2 Thursday, 1 March 2012 1. Introduction to NetworkX. 3 Thursday, 1 March 2012 Introduction to NetworkX - network analysis Vast amounts of network data are being generated and collected • Sociology: web pages, mobile phones, social networks • Technology: Internet routers, vehicular flows, power grids How can we analyse these networks? Python + NetworkX! 4 Thursday, 1 March 2012 Introduction to NetworkX - Python awesomeness 5 Thursday, 1 March 2012 Introduction to NetworkX - Python in one slide Python is an interpreted, general-purpose high-level programming language whose design philosophy emphasises code readability. “there should be one (and preferably only one) obvious way to do it”. • Use of indentation for block delimiters (!!!!) • Multiple programming paradigms: primarily object oriented and imperative but also functional programming style. • Dynamic type system, automatic memory management, late binding. • Primitive types: int, float, complex, string, bool. • Primitive structures: list, tuple, dictionary, set. • Complex features: generators, lambda functions, list comprehension, list slicing. 6 Thursday, 1 March
    [Show full text]