Extracting Insights from Differences: Analyzing Node-Aligned Social
Total Page:16
File Type:pdf, Size:1020Kb
Extracting Insights from Differences: Analyzing Node-aligned Social Graphs by Srayan Datta A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2019 Doctoral Committee: Associate Professor Eytan Adar, Chair Associate Professor Mike Cafarella Assistant Professor Danai Koutra Associate Professor Clifford Lampe Srayan Datta [email protected] ORCID iD: 0000-0002-5800-830X c Srayan Datta 2019 To my family and friends ii ACKNOWLEDGEMENTS There are several people who made this dissertation possible, first among this long list is my adviser, Eytan Adar. Pursuing a doctoral program after just finishing un- dergraduate studies can be a daunting task but Eytan made it easy with his patience, kindness, and guidance. I learned a lot from our collaborations and idle conversations and I am very grateful for that. I would like to extend my thanks to the rest of my thesis committee, Mike Ca- farella, Danai Koutra and Cliff Lampe for their suggestions and constructive feed- back. I would also like to thank the following faculty members, Daniel Romero, Ceren Budak, Eric Gilbert and David Jurgens for their long insightful conversations and suggestions about some of my projects. I would like to thank all of friends and colleagues who helped (as a co-author or through critique) or supported me through this process. This is an enormous list but I am especially thankful to Chanda Phelan, Eshwar Chandrasekharan, Sam Carton, Cristina Garbacea, Shiyan Yan, Hari Subramonyam, Bikash Kanungo, and Ram Srivatasa. I would like to thank my parents for their unwavering support and faith in me. This dissertation would not be possible without their constant support throughout the long process of pursuing a Ph.D. My dissertation is supported by University of Michigan computer science depart- ment and school of information. My projects were partially supported by grants from the University of Michigan and NSF. Additionally, I would like to thank Jason Baum- gartner for compiling the Reddit dataset, the one I used throughout this dissertation. iii TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: viii LIST OF TABLES :::::::::::::::::::::::::::::::: xi ABSTRACT ::::::::::::::::::::::::::::::::::: xii CHAPTER I. Introduction ..............................1 1.1 Dissertation Overview . .5 1.2 Motivation . .6 1.3 Contributions . .7 II. Network Representation of Data and Community Detection Algorithms ...............................9 2.1 Network representation of data . .9 2.1.1 Link inference . .9 2.1.2 Node aligned networks . 11 2.2 Community detection algorithms . 12 2.2.1 Fastgreedy . 12 2.2.2 InfoMap . 12 2.2.3 Label propagation . 13 2.2.4 Multilevel . 13 2.2.5 Spinglass . 13 2.2.6 Walktrap . 13 2.3 Summary . 14 III. Reddit: Data and Existing Work ................. 15 iv 3.1 Choice of the Reddit dataset . 15 3.2 Related work . 16 3.2.1 Trolling in Reddit . 18 IV. Identifying Misaligned Inter-Group Links and Communities in Reddit ................................ 19 4.1 Overview . 19 4.2 Related work . 22 4.3 Dataset . 22 4.4 Method . 23 4.4.1 Similarity Metrics . 23 4.4.2 Matrix Generation . 25 4.4.3 Pairwise relationships between subreddits . 25 4.4.4 Subreddit networks . 28 4.5 Results . 30 4.5.1 Matrices . 31 4.5.2 Characterizing subreddit pairs . 33 4.5.3 Characterizing individual subreddits . 35 4.5.4 Characterizing subreddit networks . 38 4.5.5 Summary . 42 4.6 Discussion . 43 4.6.1 Limitations . 44 4.6.2 Future extensions . 44 4.7 Conclusion . 45 V. Extracting Inter-community Conflicts in Reddit ........ 47 5.1 Overview . 47 5.2 Related Work . 50 5.2.1 Conflicts in Social Media . 50 5.2.2 Signed Social Networks . 51 5.3 Dataset . 51 5.4 Identifying Inter-community Conflicts . 51 5.4.1 Controversial Authors . 54 5.5 The Subreddit Conflict Graph . 55 5.5.1 Constructing the Conflict Graph . 55 5.5.2 Eliminating Edges Present due to Chance . 55 5.5.3 Conflict Graph Properties . 56 5.6 Co-Conflict Communities . 62 5.6.1 Creating the Co-conflict Graph . 62 5.6.2 Co-conflict Graph Properties . 63 5.6.3 Community Detection Results . 64 5.7 Conflict Dynamics . 65 5.8 Discussion . 69 v 5.8.1 Downvotes for determining community conflicts . 69 5.8.2 Subreddit conflict due to ideological differences . 70 5.8.3 Robustness of threshold parameters . 71 5.8.4 Identifying communal misbehavior . 72 5.8.5 Co-conflict graph . 73 5.8.6 Limitations . 74 5.8.7 Implications . 75 5.9 Conclusion . 76 VI. Identifying, analyzing and predicting banned subreddits ... 77 6.1 Overview . 77 6.2 Data Collection . 79 6.3 Properties of the Banned . 81 6.3.1 Banned subreddit comment properties . 82 6.3.2 Banned subreddit language . 84 6.3.3 Subreddit ban times . 84 6.3.4 Banned subreddit active time . 86 6.4 Clustering Subreddits . 86 6.4.1 Textual features . 86 6.4.2 Interaction features . 87 6.4.3 Measuring Similarity . 89 6.4.4 Generating Clusters . 89 6.4.5 Results . 90 6.4.6 The `Noise' . 91 6.5 Prediction . 92 6.5.1 Classifying Using Text . 93 6.5.2 Classifying Using Interaction . 93 6.5.3 Combining Classifiers . 93 6.5.4 Banning-by-example . 94 6.5.5 Predictive Words . 96 6.6 Discussion and Limitations . 97 6.7 Conclusion . 98 VII. An Interactive Visualization Tool for Community Detection 99 7.1 Overview . 99 7.2 Related Work . 103 7.2.1 Interactive Machine Learning . 103 7.2.2 Ensembles and Community Detection . 105 7.2.3 Active learning . 106 7.3 The Trouble With Community Detection . 106 7.3.1 The Analytical Pipeline . 109 7.4 The CommunityDiff Design . 111 7.4.1 Example Interaction . 114 vi 7.4.2 Design Guidelines . 116 7.5 System Details . 118 7.5.1 System and Interface Architecture . 118 7.5.2 Algorithmic Details . 121 7.5.3 Visualization and Interface Elements . 127 7.5.4 Human-in-the-Loop Machine Learning . 132 7.6 Evaluation . 134 7.6.1 Ensemble Evaluation . 135 7.6.2 Co-Community Evaluation . 136 7.6.3 User Study . 137 7.7 Case study . 140 7.8 Discussion . 141 7.9 Conclusion . 144 VIII. Conclusion and Future Work .................... 145 8.1 Summary . 145 8.2 Extensions . 147 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 149 vii LIST OF FIGURES Figure 1.1 A general pipeline for analyzing network-based communities from a set of given nodes/entities. .1 1.2 An expanded pipeline for analyzing network-based communities from a set of given nodes/entities. .3 1.3 A general methodology for identifying conflicts and creating a conflict graph from a set of given user communities. .4 4.1 Network inference pipeline. 20 4.2 Computing the z2-score. 24 4.3 Comparison between z-score distribution in rank difference list of all other subreddits for two subreddits: CatsStandingUp and pokemongo 28 4.4 Author similarity vs. term similarity for all pairs of subreddits that have non-zero author similarity score. 31 4.5 Distribution of z2-scores for all pairs of subreddits. 32 5.1 General methodology for identifying conflicts and creating the conflict graph. 53 5.2 Ego network for the subreddit Liberal................. 57 5.3 Conflict intensity vs intensity of reciprocation in the subreddit con- flict graph. 58 5.4 The Co-conflict Graph. 66 5.5 Change count for source subreddits who targeted at least 5 subreddits 67 viii 5.6 Rank by intensity of being targeted for four political subreddits over 2016. 67 5.7 Rank by conflict intensity for four political subreddits over 2016. 68 6.1 Average text similarity with banned subreddits banned within one day window of self-ban binned daily from Jun 2015 to Mar 2018. 81 6.2 Banned subreddit comment count histogram with 100 bins. The x- axis is log-scaled. 82 6.3 Banned and unbanned subreddit deleted comment percentages. 83 6.4 Subreddit ban time binned weekly from Jan 2014 to Mar 2018. 85 6.5 A bar chart showing the frequency of subreddit bans per day from January 2010 to March 2018. 85 6.6 A histogram showing active time of banned subreddits in our dataset binned weekly. 86 6.7 Largest cluster and noise subreddit comment count histograms with.