Towards a Better Understanding of Peer-Produced Structured Content Value

Towards a Better Understanding of Peer-Produced Structured Content Value A THESIS SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF MINNESOTA BY Andrew Hall IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Loren Terveen July, 2019 Copyright © Andrew Hall 2019. i Acknowledgements I would first like to thank my Ph.D. advisor, Dr. Loren Terveen, for his assistance over the last five years. Throughout the Ph.D. program, I had countless informal meetings with Loren in which he taught me how to be a more logical and critical thinker. Rather than directly providing answers to many of the questions that I would have for him, Loren helped me develop the necessary tools to derive such answers myself. He would also frequently push me to get out of my comfort zone to pursue challenging research questions/directions that had large potential impacts. I would like to thank Dr. Aaron Halfaker who is a Principle Research Scientist at the Wikimedia Foundation. I started actively working with Aaron during the spring of 2017 as a research intern at the Wikimedia Foundation. However, our collaboration did not finish once the internship ended, and for the last two years Aaron has served in a role that is essentially that of a co-advisor. I greatly appreciate Aaron’s willingness to discuss both low-level technical details and high-level research questions and anything in between during our weekly meetings. I would also like to thank many of the professors at the University of Minnesota for helping me succeed with both coursework and research. Specifically, I would like to thank Drs. Haiyi Zhu and Daniel Keefe for being a part of my thesis committee and for providing feedback when I proposed the thesis. Additionally, I’m grateful to Dr. Eric Van Wyk who introduced me to research as an undergraduate student and helped me realize that graduate school could serve an important role towards achieving my career goals. Many current and formers members of GroupLens Research have helped me succeed in the projects I performed. GroupLens is a very collaborative lab, and I believe that this has strengthened my research tremendously. I would especially like to thank Dr. Jacob Thebault-Spieker, Allen Lin, Sarah McRoberts, Dr. Isaac Johnson, Dr. Brent Hecht, and Dr. Shilad Sen. Finally, I would like to thank my parents and brother for all the support they have provided over a challenging five years of my life. ii Dedication To my parents, grandparents, and brother. iii Abstract Over the last 30 years, peer production has created everything from software (e.g. Linux) to encyclopedia articles (e.g. Wikipedia) to geographic data (e.g. OpenStreetMap). In recent years, peer production has increased its focus on the production of structured (key-value pair) content. This content is designed to be consumed by applications and algorithms. This thesis explores two challenges towards generating content that is as valuable as possible to these applications/algorithms. The first challenge is unique to the context of peer-produced structured data and is focused on a tension between the core peer production ethos of contributor freedom and the need for highly-standardized data in order for applications/algorithms to effectively operate. To explore this tension between freedom and standardization, I qualitatively analyze the ways in which it surfaces and then quantitatively analyze its impact. For the second challenge, I compare how different levels of automation affect content value. Contributions in peer production come from manual editing, semi-automated tool editing, and fully-automated bot editing. I use two important lenses to study the value provided by these different types of contributions. Specifically, I study value by considering 1) the relationship between content quality and demand, and 2) problematic societal-level content biases (e.g. along male versus female, Global North versus Global South, and urban versus rural lines). While peer-production research has explored these two lenses of value in the past, it has not sought to develop a robust understanding in the context of structured content. To ensure that automated and manual contributions are effectively differentiated, I also develop a bot detection model. Finally, I provide implications based on my results. For example, my work motivates socio-technical tools that can reduce the manual effort required to contribute structured data and tools that direct effort towards in-demand content. iv Table of Contents List of Tables ........................................................................................................................... vii List of Figures ........................................................................................................................ viii Notes on the Content in this Thesis .................................................................................. ix 1 Introduction ...................................................................................................................... 1 1.1 Defining the Problem Space ............................................................................................ 1 1.1.1 Challenge 1: Contributor Freedom versus Data Standardization. ............................. 1 1.1.2 Challenge 2: Understanding the Different Roles that Manual and Automated Contributions Play in Affecting Content Value ................................................................................... 3 1.2 Research Questions ............................................................................................................ 3 1.3 Summary of Challenge 1 Studies ................................................................................... 3 1.3.1 Exploring the Tension Between Freedom and Data Standardization ...................... 3 1.3.2 Measuring Contributor Freedom’s Effect on Data Standardization .......................... 4 1.4 Summary of Challenge 2 Studies ................................................................................... 4 1.4.1 Unidentified Bot Detection ......................................................................................................... 4 1.4.2 Comparing Content Value Produced by Manual and Automated Contributions Along Three Intuitive Dimensions ........................................................................................................... 4 1.5 Thesis Organization ........................................................................................................... 5 2 BacKground and Related WorK .................................................................................. 6 2.1 BacKground on Peer Production ................................................................................... 6 2.2 Brief BacKground on Communities Studied in this Thesis ................................... 6 2.2.1 OpenStreetMap ................................................................................................................................ 6 2.2.2 Wikidata .............................................................................................................................................. 7 2.3 Performing Contributions in Peer Production ......................................................... 8 2.3.1 Manual Contributions .................................................................................................................... 8 2.3.2 Automated Contributions ............................................................................................................ 8 2.4 Research Studying Contributor Freedom within Peer Production ................... 9 2.5 Studies of Content Value in Peer Production .......................................................... 10 3 Understanding the Causes of a Tension Between Freedom and Standardization in Peer-Produced Structured Content .......................................... 12 3.1 Introduction ....................................................................................................................... 12 3.2 Related WorK ..................................................................................................................... 12 3.3 Method .................................................................................................................................. 13 3.4 Results and Interpretations .......................................................................................... 14 3.4.1 Theme 1: Freedom vs. Metadata Completeness ............................................................ 15 3.4.2 Theme 2: Project-Specific Freedom and Metadata Correctness: Humanitarian OpenStreetMap (HOT) ............................................................................................................................... 17 3.4.3 Theme 3: Cultural Differences Make Global Metadata Correctness Standards Difficult to Achieve and Maintain .......................................................................................................... 18 3.4.4 Theme 4: Community-Management Obstacles to Achieving Consensus ............. 22 3.4.5 Theme 5: Data Representation Prevents Conceptual Correctness ......................... 23 3.4.6 Theme 6: Data Entry Tools May Harm Metadata Correctness and Privilege Certain Users .................................................................................................................................................. 24 3.5 Reflecting on Correctness, Community, and Code ................................................

Load more