Controlled Analyses of Social Biases in Wikipedia Bios

Controlled Analyses of Social Biases in Wikipedia Bios Anjalie Field Chan Young Park Yulia Tsvetkov [email protected] [email protected] [email protected] Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University ABSTRACT as explanatory variables in a regression model, which restricts anal- Social biases on Wikipedia, a widely-read global platform, could ysis to regression models and requires explicitly enumerating all greatly influence public opinion. While prior research has examined confounds [52]. man/woman gender bias in biography articles, possible influences In contrast, we develop a matching algorithm that enables ana- of confounding variables limit conclusions. In this work, we present lyzing different demographic groups while reducing the influence a methodology for reducing the effects of confounding variables in of confounding variables. Given a corpus of Wikipedia biography analyses of Wikipedia biography pages. Given a target corpus for pages for people that contain target attributes (e.g. pages for cis- analysis (e.g. biography pages about women), we present a method gender women), our algorithm builds a matched comparison corpus for constructing a comparison corpus that matches the target cor- of biography pages for people that do not (e.g. for cisgender men). pus in as many attributes as possible, except the target attribute The comparison corpus is constructed so that it closely matches the (e.g. the gender of the subject). We evaluate our methodology by de- target corpus on all known attributes except the targeted one. Thus, veloping metrics to measure how well the comparison corpus aligns examining differences between the two corpora can reveal content with the target corpus. We then examine how articles about gender bias [53] related to the target attribute, while reducing the influence and racial minorities (cisgender women, non-binary people, trans- of possible confounding variables. We develop metrics to evaluate gender women, and transgender men; African American, Asian our methodology that measure how closely the comparison corpus American, and Hispanic/Latinx American people) differ from other matches the target corpus for simulated sets of target corpora. articles, including analyses driven by social theories like intersec- We ultimately use this method to analyze biography pages that tionality. In addition to identifying suspect social biases, our results Wikipedia editors or readers may perceive as describing gender show that failing to control for confounding variables can result in (cigender women, non-binary people, transgender women, and 1 different conclusions and mask biases. Our contributions include transgender men) and racial (African American , Asian American, methodology that facilitates further analyses of bias in Wikipedia Hispanic/Latinx American) minorities [9]. We additionally inter- articles, findings that can aid Wikipedia editors in reducing biases, sect these dimensions and examine portrayals of African American and framework and evaluation metrics to guide future work in this women [14]. To the best of our knowledge, this is the first work to area. examine gender disparities on Wikipedia beyond cisgender women, the first large-scale analysis of racial disparities [1], and the first 1 INTRODUCTION consideration of intersectionality in Wikipedia biography pages. Almost since its inception, Wikipedia has attracted the interest of We compare article lengths, section lengths, and edit counts on Eng- researchers in various disciplines because of its unique commu- lish Wikipedia, and also consider language availability and length nity and departure from traditional encyclopedias [25, 27]. As a statistics in other language editions. Our analysis reveals systemic collaborative knowledge platform where anyone can create or edit differences in how these groups are portrayed. For example, articles pages, Wikipedia effectively crowd-sources information. This setup about cisgender women tend to be shorter and available in fewer allows for fast and inexpensive dissemination of information, but it languages than articles about cisgender men, articles about Asian risks introducing social and cultural biases [27]. These biases are Americans tend to be shorter than articles about other Americans, problematic—not just because they can influence readers, but also and articles about African American women tend to be available in because Wikipedia has become a popular data source for compu- more languages than comparable other American men, but fewer tational models in natural language processing [33, 37], which are languages than comparable other American women. Identifying arXiv:2101.00078v1 [cs.CL] 31 Dec 2020 prone to absorbing and even amplifying data bias [7, 54]. In this these types of disparities can help Wikipedia editors investigate and work, we develop methodology to identify possible content biases mitigate them, especially considering the large volume of Wikipedia on Wikipedia along racial and gender dimensions. data, which has inspired other work on automated methods to im- Concerns about quality and objectivity have existed almost since prove content [40, 45]. These analyses can also reveal stereotypes the inception of Wikipedia. However, prior computational work on and biases in society, as imbalanced content on Wikipedia can be social biases has focused primarily on binary gender, comparing indicative of imbalances in society, rather than editor bias. coverage of men and women [1, 52]. Much of this work focuses In §3 we present our matching methodology, which is based on on all articles about men and women, without considering how pivot-slope TF-IDF weighted vectors, and several baselines. We confounding variables may affect analyses. For example, there are then evaluate these methods using novel metrics and simulations more male athletes than female athletes on Wikipedia, so it is dif- defined in §3.2 and present results in §5. We finally present our ficult to disentangle if differences between articles occur because gender and race analyses in §6 and §7. Overall, our work offers women and men are presented differently, or because non-athletes methodology and initial findings for uncovering content biases on and athletes are presented differently [18, 24, 52]. Existing method- 1We used “African American” rather than Black throughout this paper, because it is ology for this task primarily consists of incorporating confounders the primary keyword on Wikipedia and Wikidata that we used for data collection Anjalie Field, Chan Young Park, and Yulia Tsvetkov Wikipedia, as well as a framework and evaluation metrics for future to contain more words associated with family, relationships, and work in this area. gender than articles about men in multiple languages (especially, English, Russian, and German) [28, 51]. [24] similarly find language 2 RELATED WORK differences in biographies about European Parliament members, Examining social biases on Wikipedia is not a new area, but our but suggest that their findings are influenced by nationality and work differs from existing analyses in several key ways. Most prior birth year more than by gender, demonstrating how confounding work focuses on gender and focuses on coverage bias, structural bias, variables and the “local heros” effect can complicate analysis. or content bias. Coverage bias involves examining how likely notable Beyond “local heros”, language editions can have systemic differ- men and women are to have Wikipedia articles, usually comparing ences due to differing readership and cultural norms. In a hypothet- against external databases like Encyclopedia Britannica [39, 51, 53]. ical example, an English article about a Bollywood actress might While earlier work found that notable women are more likely to be specify that Bollywood is a central point of Indian cinema, but such missing on Wikipedia than notable men [39], more recent work has information would be superfluous in a corresponding Hindi article. found the opposite [51, 53]. Coverage bias also involves examining The argument that these differences are beneficial, since language how much information is present on Wikipedia, e.g. article length. editions serve different readers [10, 21] is one of the motivations On average, articles about women are longer than articles about behind tools like Omnipedia [4] and Manypedia [31] that allow men [18, 39, 51, 53]. Structural bias denotes differences in article side-by-side comparisons of different language editions. However, meta-data and other properties that are not directly connected to in the context of social biases, these systemic differences can con- article text, such as links between articles, diversity of sources cited, found our research questions. For instance do English biographies and number of edits made by contributors [51, 53]. Examinations about women contain a higher percentage of words related to fam- of link structures have suggested the presence of structural bias ily than Spanish articles because there is greater gender bias in against articles about women (e.g. all biography articles tend to English [51]? Or because English articles generally discuss family link to articles about men more than women) [17, 51–53]. Finally, more than Spanish articles, regardless of gender? We can partially content bias considers how the article text itself differs between alleviate this ambiguity by comparing biography pages of men and demographic groups. Analysis using methods such as latent

Controlled Analyses of Social Biases in Wikipedia Bios

Community and Communication

The Culture of Wikipedia

Classifying Wikipedia Article Quality with Revision History Networks

Project Talk: Coordination Work and Group Membership in Wikiprojects

Critical Point of View: a Wikipedia Reader

Detailed Table of Contents

Xtools Release 3.1.45

A Wikipedia Reader

African Americans in STEM Wikipedia Edit-A-Thon

About Wikipedia (English)

Thanks for Stopping By: a Study of “Thanks

Wikipedia @ 20