Measurement Error in Network Data: a Re-Classification
Total Page:16
File Type:pdf, Size:1020Kb
G Model SON-706; No. of Pages 14 ARTICLE IN PRESS Social Networks xxx (2012) xxx–xxx Contents lists available at SciVerse ScienceDirect Social Networks journa l homepage: www.elsevier.com/locate/socnet ଝ Measurement error in network data: A re-classification a,∗ b a a Dan J. Wang , Xiaolin Shi , Daniel A. McFarland , Jure Leskovec a Stanford University, Stanford, CA 94305, United States b Microsoft Corporation, Redmond, WA 98052, United States a r t i c l e i n f o a b s t r a c t Keywords: Research on measurement error in network data has typically focused on missing data. We embed missing Measurement error data, which we term false negative nodes and edges, in a broader classification of error scenarios. This Missing data includes false positive nodes and edges and falsely aggregated and disaggregated nodes. We simulate these Simulation six measurement errors using an online social network and a publication citation network, reporting their effects on four node-level measures – degree centrality, clustering coefficient, network constraint, and eigenvector centrality. Our results suggest that in networks with more positively-skewed degree distributions and higher average clustering, these measures tend to be less resistant to most forms of measurement error. In addition, we argue that the sensitivity of a given measure to an error scenario depends on the idiosyncracies of the measure’s calculation, thus revising the general claim from past research that the more ‘global’ a measure, the less resistant it is to measurement error. Finally, we anchor our discussion to commonly-used networks in past research that suffer from these different forms of measurement error and make recommendations for correction strategies. © 2012 Elsevier B.V. All rights reserved. 1. Introduction focuses on missing data, it has overlooked other major classes of measurement error that have emerged (for an exception, see Network analysis has long been plagued by issues of measure- Borgatti et al. (2006)). ment error, usually in the form of missing data. For instance, survey Here, measurement error refers to mistakes in collecting or cod- data used in early sociometric research often contained misrepre- ing a network dataset. For example, in a sociometric survey, if a sentations of ego-networks due to the limits of respondent memory respondent misspells the name of a contact, then the contact might and survey design (Marsden, 1990). erroneously be treated as two different individuals. However, mea- While missing data remains a major issue, much network surement error can also refer to the extent to which a network research currently faces the opposite problem. The growing avail- dataset represents the reality of the relationships within a group ability of large, complex network datasets has transformed a under study. For instance, even if all respondents report the cor- research environment that lacked sufficient data into one with an rect spellings of their friends’ names, the understanding of what overabundance (De Choudhury et al., 2010). Network researchers qualifies as a friendship tie can vary by respondent. regularly analyze networks with millions of nodes and edges Thus, in network research, there exist three levels of empirical with multiplex relations. However, while much research still interpretation (see Table 1) – (1) the ideal network: the true set of relations among entities in a network, (2) the clean network: the set of relations among entities as coded in a network dataset with- out data entry mistakes, and (3) the observed network: a network ଝ This material is based upon work supported by the Office of the President at dataset, often suffering from coding errors, that is actually avail- Stanford University and the National Science Foundation under Grant No. 0835614. able to a researcher. Resolving the differences between the ideal In addition, this research was in part supported by NSF Grants CNS-101092, IIS- 1016909, IIS-1149837, the Albert Yu & Mary Bechmann Foundation, IBM, Lightspeed, network and clean network is difficult because it entails apply- Samsung, Yahoo, an Alfred P. Sloan Fellowship, and a Microsoft Faculty Fellowship. ing an objective understanding to what is an inherently subjective Any opinions, findings, conclusions, or recommendations expressed in this mate- relationship (see De Choudhury et al. (2010)). Because most mea- rial are those of the authors and do not necessarily reflect the views of Stanford surement errors are a result of inconsistencies in data collection University or the National Science Foundation. We are grateful to the comments and coding, we focus our analysis on the discrepancies between from members of the Mimir Project workshop at Stanford University on earlier iter- ations of this work. We also are indebted to the editors of Social Networks and the the clean and observed network. two anonymous reviewers, who provided helpful suggestions, many of which we First, we classify network measurement error into six types incorporated into the final version of this paper. ∗ – missing nodes, spurious nodes, missing edges, spurious edges, Corresponding author at: Department of Sociology, Stanford University, 450 falsely aggregated nodes, and falsely disaggregated nodes. In effect, Serra Mall, Stanford, CA 94305-2047, United States. E-mail address: [email protected] (D.J. Wang). we build on the important work of Borgatti et al. (2006), who 0378-8733/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.socnet.2012.01.003 Please cite this article in press as: Wang, D.J., et al., Measurement error in network data: A re-classification. Soc. Netw. (2012), doi:10.1016/j.socnet.2012.01.003 G Model SON-706; No. of Pages 14 ARTICLE IN PRESS 2 D.J. Wang et al. / Social Networks xxx (2012) xxx–xxx Table 1 Three levels of empirical network interpretation. Description Example 1 Example 2 Illustration Friendship network gathered Collaboration network gathered by through sociometric surveys scraping publication database Ideal network Network of true Ties represent actual, mutual Ties represent active collaborative relations between friendships between relationships between individuals entities individuals Clean network Network of relations Ties represent each Ties represent co-author encoded in data respondent’s own perception relationships, but not necessarily without measurement of friendship with others active collaboration between error individuals Observed network Network of relations Ties represent reported Ties represent co-author relations, encoded in data with friendships, but some nodes but some nodes and ties are measurement error and ties are erroneously erroneous coded coded compared the effects of spurious and missing nodes and edges on Our paper builds on the more recent work of Kossinets (2006) centrality measures in Erdos-Rényi˝ random graphs. By contrast, we and Borgatti et al. (2006), who simulate of measurement errors compare the effects of our error scenarios by simulating them in on random Erdos-Rényi˝ networks. Kossinets (2006) focuses on two real-world networks and one random graph, each of which missing network data, finding that clustering coefficients are over- vary in key structural characteristics, such as average clustering estimated when the boundary specification is too restricted, and and degree distribution. We then observe the effects of these error centrality measures are underestimated when non-response is per- scenarios on four different node-level network measures – degree vasive. Borgatti et al. (2006) begin an important conversation about centrality, clustering coefficient, network constraint, and eigenvec- the typology of measurement errors. They find that spurious nodes tor centrality. and edges diminish the similarity between a network dataset and In addition, we make recommendations for error correction its error-afflicted counterpart but not as much as the removal of strategies, which depend on the type of measurement error sce- nodes and edges. nario, the network measure affected, and the structural attributes We argue that key structural features of real-world social of the network under study. We also ground our discussion in com- networks can also influence the robustness of certain network mon examples of network datasets from past research. By bringing measures to error scenarios. Erdos-Rényi˝ graphs, like those in attention to these understudied classes of measurement error, we Kossinets (2006) and Borgatti et al. (2006), tend to have little alert network researchers to important sources of bias that extend clustering and more uniform degree distributions, making their beyond missing data. comparison to empirical networks unrealistic (Newman and Park, 2003). We therefore use empirical network datasets for simulating measurement error to issue more relevant cautions to empiri- 1.1. Related work cal researchers about different forms of network measurement error. Early examinations of the measurement error in networks focused on missing data in sociometric surveys. Common sources 2. Error scenarios of error include respondent bias (Pool and Kochen, 1978), non-response (Stork and Richards, 1992), and the design of ques- First, network data can suffer from missing or spurious nodes, tionnaires (Burt, 1984; Holland and Leinhardt, 1973). In addition, which we term false negative nodes and false positive nodes, missing this work identified factors in research design that could give or spurious edges, which are termed here as false negative edges and rise to measurement