Arxiv:2010.12611V1 [Cs.SI] 23 Oct 2020
Total Page:16
File Type:pdf, Size:1020Kb
Clustering via Information Access in a Network∗ Hannah C. Beilinson Nasanbayar Ulzii-Orshikh Ashkan Bashardoust Haverford College Haverford College University of Utah Haverford, PA Haverford, PA Salt Lake City, UT [email protected] [email protected] [email protected] Sorelle A. Friedler Carlos E. Scheidegger Haverford College University of Arizona Haverford, PA Tucson, AZ [email protected] [email protected] Suresh Venkatasubramanian University of Utah Salt Lake City, UT [email protected] October 27, 2020 Abstract Information flow in a graph (say, a social network) has typically been modeled using standard influence propagation methods, with the goal of determining the most effective ways to spread information widely. More recently, researchers have begun to study the differing access to information of individuals within a network. This previous work suggests that information access is itself a potential aspect of privilege based on network position. While concerns about fairness usually focus on differences between demographic groups, characterizing network position may itself give rise to new groups for study. But how do we characterize position? Rather than using standard grouping methods for graph clustering, we design and explore a clustering that explicitly incorporates models of how information flows on a network. Our goal is to identify clusters of nodes that are similar based on their access to information across the network. We show, both formally and experimentally, that the resulting clustering method is a new approach to network clustering. Using a wide variety of datasets, our experiments show that the introduced clustering technique clusters individuals together who are similar based on an external information access measure. 1 Introduction arXiv:2010.12611v1 [cs.SI] 23 Oct 2020 The rapidly growing literature on biases in automated decision making1 illustrates the many ways in which structural biases can enter the decision-making process. Online social networks have also been found to have problems with bias. The \data" in social networks is the way in which curation processes { recommendations most notably { construct links between people (nodes). It has been known for decades that position and access in a social network confer power Hethcote [2000], Burt [1987], Coleman et al. [1966], Granovetter [1978], and in the literature on information spreading and hiring networks (to name just two) there is evidence that groups that are disadvantaged in society continue to be disadvantaged in terms of access to information online Leskovec et al. [2010], Ball and Newman [2013], Clauset et al. [2015], Way et al. [2016], Morgan et al. [2018]. ∗This research was funded in part by the NSF under grants IIS-1955321, IIS-1956286, and IIS-1955162. We are grateful to Way et al. [2016] and Chen et al. [2017] for sharing their network datasets with us. 1http://fairmlbook.org. 1 Social networks are just that { social. The natural process of group formation (online or in person) tends to encourage segregation and clustering along many different dimensions { shared interests, values, identity and so on. These connections are not easily tied to any one of these factors. Rather, they are a complex mix of deliberate and opportunistic network building (with more than a little amount of serendipity). Among others, Burt [2004], as well as boyd et al. [2014] have argued that network positioning as a component of social capital is a form of group affinity in and of itself. In other words, nodes that have similar \status" based on network position receive similar information, and node groups based on information access might be more salient to understanding group dynamics than traditional groups based on node-level features like demographics, class status and so on. It is this observation that motivates this work. 1.1 Contributions Information access provides a novel, natural and effective mechanism for defining node signatures in a network. In this paper, we • formalize the notion of information access, and show how it can be effectively computed for a variety of network sizes, • show mathematically how information access measures similarity differently from methods relying on the spectrum of the graph Laplacian, and • through experiments on a number of different real-world networks, we show that clustering nodes based on information access captures external measures real-world interest. 2 Related Work Work on the problem of influence maximization (also referred to as target seed selection) started with Domingos and Richardson [2001] and was formalized by Kempe et al. [2003], leading to an extensive literature on the subject Li et al. [2018]. Inspired by the literature on social position first initiated by Granovetter [1977] and framed in the context of online social networks by boyd et al. [2014], there has been a more recent development of computational questions around fairness in access on social networks Fish et al. [2019], Tsang et al. [2019], Stoica and Chaintreau [2019]. The starting point for all of this is the idea of information access as a resource. In Fish et al. [2019], this is formalized in terms of the question \what is the probability that vertex i in a graph gets information from source j", and the paper focuses on explicit interventions to ensure that the minimum such probability is maximized. This concept is the basis for our representation-focused work in this paper. Subsequent research Tsang et al. [2019], Stoica and Chaintreau [2019] has focused on the problem of allocation with respect to information access { attempting to make sure that different demographic groups (represented as disjoint subsets of the graph) receive similar information access. Graph clustering (and more generally community detection) has long been a focus of intense study (see Aggarwal and Wang [2010] for a survey). A detailed review of the different strands of graph clustering is beyond the scope of this paper. Broadly speaking, one can categorize graph clustering algorithms as those based on finding dense submotifs in a graph, those based on spectral analysis Shi and Malik [2000], Von Luxburg [2007], Gharan and Trevisan [2014], Kannan et al. [2004] (which in turn generalizes connectivity- based clusterings) and those based on the more general framework of unusual local density or modularity Brandes et al. [2007]. 3 Information Access Let G = (V; E) be a network with sets of nodes V and edges E, where jV j = n. Consider any information flow model that describes how information might transmit from one node to its neighbors, such as the independent cascade model, the linear threshold model, or any of the infection flow models from epidemiology. All these models are stochastic and assume some initial seed set of nodes that possess the information to be spread. For any given seed set S there is then a fixed probability pv;S that node v 2 V possesses the information once the process terminates. 2 This motivates our idea of an information access signature: a way to encode the \view" from a node v of the access it has to information sent from other nodes in the graph. 3.1 Signatures and representation Definition 1 (information access signature). Let vj 2 V be a node in G. The information access signature G n sα : V ! R is G sα(vj) = (p1j; :::; pij; :::pnj) where pij is the probability that node vi 2 V receives information seeded at node vj 2 V under some information flow model with parameters α. Intuitively, this signature characterizes a node's information access based on how likely they are to receive information from everyone else in the network; people who are likely to receive information from the same part of the network will have similar signatures. Transforming all nodes to their associated information access signature gives an information access rep- resentation of the network. Definition 2 (information access representation). Let G = (V; E) be a network with sets of nodes V and edges E. The information access representation is G G Rα = f sα(vj) j vj 2 V g where α represents the parameters of the information flow model as before. In this paper we will focus on the extensively studied independent cascade model Kempe et al. [2003] (also known as the SIR model in epidemiology). In this model, a node exists in one of three states: ready to receive, ready to transmit, and dormant.2 Initially, all nodes are ready to receive information and some subset of seed nodes possess a bit of information and are ready to transmit. At each time step, a node that is ready to transmit decides (for each of its adjacent edges) with probability α to transmit the information to the corresponding neighbor. All such transmissions are imagined to happen simultaneously, after which the node goes dormant. Since the independent cascade model can be characterized by a single parameter α, G in what follows we will merely use sα (v) to denote the information access signature. An example: the star graph We illustrate this construction on the star graph depicted in Figure 1. It is easy to see that for any distinct pair of nodes in the set f1;:::; 7g, the probability of information being transmitted from one to another is α2. For any pair involving the central node t and any periphery node, 2 2 the probability is α. In particular sα(1) = (1; α ; : : : ; α ; α). On the other hand sα(t) = (α; α; : : : ; α; 1). Figure 1: A star graph 2These states are the susceptible, infectious and recovered states of the SIR model. 3 3.2 Information access clustering G The information access representation Rα represents each vertex of the graph as a point in an n-dimensional space. Two points are close in this space if they have similar views of information flow to other nodes. It is then a natural next step to try and group nodes together based on this proximity, i.e., to compute a graph clustering.