<<

Controlled Analyses of Social Biases in Bios

Anjalie Field Chan Young Park Yulia Tsvetkov [email protected] [email protected] [email protected] Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University

ABSTRACT as explanatory variables in a regression model, which restricts anal- Social biases on Wikipedia, a widely-read global platform, could ysis to regression models and requires explicitly enumerating all greatly influence public opinion. While prior research has examined confounds [52]. man/woman gender bias in biography articles, possible influences In contrast, we develop a matching algorithm that enables ana- of confounding variables limit conclusions. In this work, we present lyzing different demographic groups while reducing the influence a methodology for reducing the effects of confounding variables in of confounding variables. Given a corpus of Wikipedia biography analyses of Wikipedia biography pages. Given a target corpus for pages for people that contain target attributes (e.g. pages for cis- analysis (e.g. biography pages about women), we present a method gender women), our algorithm builds a matched comparison corpus for constructing a comparison corpus that matches the target cor- of biography pages for people that do not (e.g. for cisgender men). pus in as many attributes as possible, except the target attribute The comparison corpus is constructed so that it closely matches the (e.g. the gender of the subject). We evaluate our methodology by de- target corpus on all known attributes except the targeted one. Thus, veloping metrics to measure how well the comparison corpus aligns examining differences between the two corpora can reveal content with the target corpus. We then examine how articles about gender bias [53] related to the target attribute, while reducing the influence and racial minorities (cisgender women, non-binary people, trans- of possible confounding variables. We develop metrics to evaluate gender women, and transgender men; African American, Asian our methodology that measure how closely the comparison corpus American, and Hispanic/Latinx American people) differ from other matches the target corpus for simulated sets of target corpora. articles, including analyses driven by social theories like intersec- We ultimately use this method to analyze biography pages that tionality. In addition to identifying suspect social biases, our results Wikipedia editors or readers may perceive as describing gender show that failing to control for confounding variables can result in (cigender women, non-binary people, transgender women, and 1 different conclusions and mask biases. Our contributions include transgender men) and racial (African American , Asian American, methodology that facilitates further analyses of bias in Wikipedia Hispanic/Latinx American) minorities [9]. We additionally inter- articles, findings that can aid Wikipedia editors in reducing biases, sect these dimensions and examine portrayals of African American and framework and evaluation metrics to guide future work in this women [14]. To the best of our knowledge, this is the first work to area. examine gender disparities on Wikipedia beyond cisgender women, the first large-scale analysis of racial disparities [1], and the first 1 INTRODUCTION consideration of intersectionality in Wikipedia biography pages. Almost since its inception, Wikipedia has attracted the interest of We compare article lengths, section lengths, and edit counts on Eng- researchers in various disciplines because of its unique commu- lish Wikipedia, and also consider language availability and length nity and departure from traditional encyclopedias [25, 27]. As a statistics in other language editions. Our analysis reveals systemic collaborative knowledge platform where anyone can create or edit differences in how these groups are portrayed. For example, articles pages, Wikipedia effectively crowd-sources information. This setup about cisgender women tend to be shorter and available in fewer allows for fast and inexpensive dissemination of information, but it languages than articles about cisgender men, articles about Asian risks introducing social and cultural biases [27]. These biases are Americans tend to be shorter than articles about other Americans, problematic—not just because they can influence readers, but also and articles about African American women tend to be available in because Wikipedia has become a popular data source for compu- more languages than comparable other American men, but fewer tational models in natural language processing [33, 37], which are languages than comparable other American women. Identifying arXiv:2101.00078v1 [cs.CL] 31 Dec 2020 prone to absorbing and even amplifying data bias [7, 54]. In this these types of disparities can help Wikipedia editors investigate and work, we develop methodology to identify possible content biases mitigate them, especially considering the large volume of Wikipedia on Wikipedia along racial and gender dimensions. data, which has inspired other work on automated methods to im- Concerns about quality and objectivity have existed almost since prove content [40, 45]. These analyses can also reveal stereotypes the inception of Wikipedia. However, prior computational work on and biases in society, as imbalanced content on Wikipedia can be social biases has focused primarily on binary gender, comparing indicative of imbalances in society, rather than editor bias. coverage of men and women [1, 52]. Much of this work focuses In §3 we present our matching methodology, which is based on on all articles about men and women, without considering how pivot-slope TF-IDF weighted vectors, and several baselines. We confounding variables may affect analyses. For example, there are then evaluate these methods using novel metrics and simulations more male athletes than female athletes on Wikipedia, so it is dif- defined in §3.2 and present results in §5. We finally present our ficult to disentangle if differences between articles occur because gender and race analyses in §6 and §7. Overall, our work offers women and men are presented differently, or because non-athletes methodology and initial findings for uncovering content biases on and athletes are presented differently [18, 24, 52]. Existing method- 1We used “African American” rather than Black throughout this paper, because it is ology for this task primarily consists of incorporating confounders the primary keyword on Wikipedia and Wikidata that we used for data collection Anjalie Field, Chan Young Park, and Yulia Tsvetkov

Wikipedia, as well as a framework and evaluation metrics for future to contain more words associated with family, relationships, and work in this area. gender than articles about men in multiple languages (especially, English, Russian, and German) [28, 51]. [24] similarly find language 2 RELATED WORK differences in biographies about European Parliament members, Examining social biases on Wikipedia is not a new area, but our but suggest that their findings are influenced by nationality and work differs from existing analyses in several key ways. Most prior birth year more than by gender, demonstrating how confounding work focuses on gender and focuses on coverage bias, structural bias, variables and the “local heros” effect can complicate analysis. or content bias. Coverage bias involves examining how likely notable Beyond “local heros”, language editions can have systemic differ- men and women are to have Wikipedia articles, usually comparing ences due to differing readership and cultural norms. In a hypothet- against external databases like Encyclopedia Britannica [39, 51, 53]. ical example, an English article about a Bollywood actress might While earlier work found that notable women are more likely to be specify that Bollywood is a central point of Indian cinema, but such missing on Wikipedia than notable men [39], more recent work has information would be superfluous in a corresponding Hindi article. found the opposite [51, 53]. Coverage bias also involves examining The argument that these differences are beneficial, since language how much information is present on Wikipedia, e.g. article length. editions serve different readers10 [ , 21] is one of the motivations On average, articles about women are longer than articles about behind tools like Omnipedia [4] and Manypedia [31] that allow men [18, 39, 51, 53]. Structural bias denotes differences in article side-by-side comparisons of different language editions. However, meta-data and other properties that are not directly connected to in the context of social biases, these systemic differences can con- article text, such as links between articles, diversity of sources cited, found our research questions. For instance do English biographies and number of edits made by contributors [51, 53]. Examinations about women contain a higher percentage of words related to fam- of link structures have suggested the presence of structural bias ily than Spanish articles because there is greater gender bias in against articles about women (e.g. all biography articles tend to English [51]? Or because English articles generally discuss family link to articles about men more than women) [17, 51–53]. Finally, more than Spanish articles, regardless of gender? We can partially content bias considers how the article text itself differs between alleviate this ambiguity by comparing biography pages of men and demographic groups. Analysis using methods such as latent vari- women in each language but other factors may also be influential. able models [3], lexicon counts, and pointwise mutual information For example, suppose our data set contains proportionally more (PMI) scores [18, 52] has suggested that pages for women discuss female singers than male singers [28]. Do we see a difference be- personal relationships more frequently than pages for men. In the tween the English and Spanish editions because these editions treat past, research on biases in Wikipedia has drawn the attention of the gender differently or because they treat singers differently? These editor community and led to changes on the platform [39], which ambiguities limit the conclusions in [24] and [51]. could explain why similar studies sometimes have different findings Our work approaches the problem of confounding variables and and also motivates our work. systemic language edition differences through matching. For every However, many of these studies draw limited conclusions be- page that aligns with our target attribute, we identify a “compari- cause of the difficulty in controlling for confounds in the data. This son” biography page, where the comparison page matches as nearly challenge is exemplified through word statistics. When computing as possible to the target page on all attributes excepted the targeted PMI or log-likelihood scores to find words that are over-represented one. We then identify possible biases in biography pages as differ- on pages for men as opposed to women, the most common words ences between the target and comparison corpora. Matching is a consist of sports terms: “football”, “footballer”, “baseball”, “league” common method for confounder control in observational causal [18, 52]. This result is not necessarily indicative of bias; Wikipedia inference studies [42, 48], and has recently grown in language editors do not omit the football achievements of women. Instead, analysis [11, 16, 26]. We follow Roberts et al. [41] in using direct this imbalance results because in society and on Wikipedia, there matching, rather than propensity matching, in order to facilitate are more male football players than female players.2 Thus, the dif- intuitive matches that can be manually assessed. ference in occupation, rather than the difference in gender, likely While our work focuses on analyzing biases in Wikipedia arti- explains this imbalance. While some traits can be accounted for, cles regardless of their origin, we briefly discuss possible sources e.g. by including occupation as an explanatory variable in a regres- of bias as motivation for our work: why might we expect to find sion model [52], it is difficult to explicitly enumerate all possible bias on Wikipedia? One prominent possible source of bias is lack confounders and limits analysis to particular models, e.g. regression. of diversity in Wikipedia contributor community [27]. Surveys Confounding variables also impact cross-lingual analyses of from 2008 suggest that the proportion of female contributors was Wikipedia biographies. Most cross-lingual studies focus on “local as low as 16.1% (after correcting for survey biases) [23]. A 2011 heros”, where language editions tend to favor people whose nation- survey found that the average Wikipedian is around 30 years old, ality is affiliated with the language, in terms of article coverage, male, computer-savvy, and lives in the U.S. or Europe, and that 3 length, and visibility [10, 17, 21]. Cross-lingual investigations of gen- the active editor community is only 8.5% female. The lack of di- der bias reveal some disparities: in Russian and , versity possibly occurs because of unwelcoming practices in the articles about men tend to be more central to the information net- editor community [15, 29]. However, given the general increase work, but not in Spanish Wikipedia, and articles about women tend in representation of women in society and on Wikipedia as well as the attention called to these disparities and efforts to improve 2Whether or not the lack of female football players in society is a sign of bias is beyond the scope of this paper. 3https://meta.wikimedia.org/wiki/Research:Wikipedia_Editors_Survey_2011_April Controlled Analyses of Social Biases in Wikipedia Bios diversity, editor demographics may have changed in the last decade We describe several baseline methods for constructing a com- [2, 13, 15, 28, 29, 34]. A second possible source of bias is the informa- parison group for a given target group, and we then describe our tion sources that Wikipedia editors draw from. Wikipedia upholds proposed method and how it addresses the limitations of the base- a “no original research” policy, mandating that all articles must cite line methods. Our proposed method uses TF-IDF vectors with a only secondary sources [30]. Bias in these secondary sources would pivot-slope correction [44, 47], where the vectors are constructed then propagate to Wikipedia. Finally, bias on Wikipedia may be from Wikipedia metadata categories. reflective of broader societal biases. For example, women maybe Given a set of target articles T , our goal is to construct a set of portrayed as less powerful than men on Wikipedia because editors comparison articles C from a set of candidates A, such that C has write imbalanced articles, because other coverage of women such a similar covariate distribution as T for all covariates except the as newspaper articles or traditional encyclopedia articles downplay target attribute. For example, T may be the set of all biography their power, or because societal constraints prevent women from pages about women, and A maybe be the set of all biography obtaining the same high-powered positions as men [1]. articles about men. We construct C using a 1:1 matching algorithm. Finally, nearly all of the cited work focuses on men/women For each 푡 ∈ 푇 , we identify 푐푏푒푠푡 ∈ 퐴 that best matches 푡 and add gender bias. Almost no computational work has examined bias 푐푏푒푠푡 to C. If 푡 is about an American female actress born in the 1970s, in Wikipedia biographies at-scale along other dimensions. While 푐푏푒푠푡 may be about an American male actor born in the 1970s. observed racial bias on Wikipedia, such as a lack of Black history, In order to identify 푐푏푒푠푡 for a given 푡, we leverage the category has prompted edit-a-thons to correct omissions,4 these dimensions metadata associated with each article. Wikipedia articles contain have not been systematically examined in research. One notable category tags that enumerate relevant traits. For example, the page exception, [1] examines both gender and racial biases in pages about for Steve Jobs includes the categories “Pixar people”, “Directors sociologists. However, their focus is limited to sociologists and R1 of Apple Inc.”, “American people of German descent”, etc. While institutions, and because of the small data size, they distinguish relying on the category tags could introduce some bias, as articles race only as white or non-white. Nevertheless, their analysis, which are not always categorized correctly or with equal detail, using focuses on coverage bias, does show that non-white sociologists this metadata allows us to focus covariates that are likely to affect are less likely to have Wikipedia biography pages. our target outcome, as these categories are assigned by editors and clearly displayed on Wikipedia pages, and thus reflect the traits 3 METHODOLOGY of individuals that possibly affect how their article is written. We 3.1 Matching Methodology cannot use the article text for matching, as we cannot disambiguate what aspects of the text result from the target attribute and what In this work, we present a method for identifying a “comparison” results from confounding variables. We describe several possible biography page for every page that aligns with a target attribute, metrics for comparing the categories for each candidate 푐 ∈ 퐴 with where the comparison page closely matches the target page on 푡: 3 baseline methods, and our proposed method. Throughout this all known attributes except the target one. The concept of a com- section, we use 퐶퐴푇 (푐) to denote the set of categories associated parison group originates in randomized clinical trials, in which with 푐. For all metrics, we perform matching with replacement, participants in a study are randomly assigned to the “treatment thus allowing each chosen comparison article to match to multiple group” or “control group” and the effectiveness of the treatment is target articles. measured as the difference in results between groups [42].5 In ob- Number of Categories (baseline) We choose 푐푏푒푠푡 as the ar- servational studies, when the treatment and outcomes have already ticle with the most number of categories in common with 푡. Intu- occurred, researchers can replicate the conditions of a randomized itively, the person whose article has the largest number of overlap- trial by constructing a treatment group and a control group so that ping categories has the most in common with the subject of 푡 and the distribution of covariates is as identical as possible between thus is the best possible match. More formally, the two groups for all traits except the target attribute [42]. Then by comparing the constructed treatment and control groups, re- searchers can isolate the effects of the target attribute from other 푐푏푒푠푡 = arg max |퐶퐴푇 (푐푖 ) ∩ 퐶퐴푇 (푡)| confounding variables. 푐푖 In our case, our outcome variable is how individuals are por- trayed in the contents of their Wikipedia articles. Then, if our target Percent Categories (baseline) One drawback of matching sim- attribute is gender, the target group may consist of Wikipedia biog- ply on the raw number of categories is that this method favors arti- raphy pages about women and the comparison group may consist of cles with more categories. For example, a candidate 푐푖 that has 30 biography pages about men. The two groups would be constructed categories is more likely to have more categories in common with 푡 so that they have similar distributions of covariates that could be than a candidate 푐 푗 that only has 10 categories. However, 푐푖 having confounding variables, such as age, occupation, and nationality. We more categories in common with 푡 does not necessarily mean that focus on characteristics directly listed on Wikipedia pages, as these 푐푖 is a better match than 푐 푗 —it suggests that the article is better are the ones we can assume Wikipedia editors would be aware of written, rather than suggesting that the person 푐푖 describes has and thus may affect the target outcome. more traits in common with the person 푡 describes than 푐 푗 . We can reduce this favoritism by normalizing the number of overlapping 4https://en.wikipedia.org/wiki/Racial_bias_on_Wikipedia 5We use the terminology target/comparison instead of treatment/control in order to categories by the total number of categories in the candidate 푐푖 . clarify that our work does not involve any actual “treatment”. Thus, we choose: Anjalie Field, Chan Young Park, and Yulia Tsvetkov

because if we identify differences between the target and compari- 1 son groups, we cannot determine if these differences are signs of 푐푏푒푠푡 = arg max |퐶퐴푇 (푐푖 ) ∩ 퐶퐴푇 (푡)| ∗ 푐푖 |퐶퐴푇 (푐푖 )| gender bias or are inaccuracies in the matching algorithms. Instead, in these simulated target sets, we do not fix a target attribute, and TF-IDF Weighting (baseline) Both of the prior methods as- thus we expect a high quality matching algorithm to identify a sume that all categories are equally meaningful, but this is an over- comparison set that matches very closely to the target set. We use simplification. For example, a candidate 푐 that has the category 푖 two methods to construct simulated target sets: “American short story writers” in common with 푡 is more likely to Article-Sampling We randomly sample 1000 articles. be a good match than a candidate that has the category “Living Category-Sampling We randomly sample one category that People” in common with 푡. We adopt a TF-IDF weighting schema has at least 500 members. We then sample 500 articles from the from information retrieval to up-weight categories that are less category. We do not expect there to be any bias towards a single common [44]. We represent each candidate 푐 ∈ 퐴 as a sparse cate- 푖 category, since most categories are very specific, e.g. “Players of gory vector. Each element in the vector is a product between the 1 American football from Pennsylvania”. While articles for football frequency of the category in 푐푖 , e.g. if the category is |퐶퐴푇 (푐푖 ) | players might have different characteristics than other articles, in 푐푖 and 0 otherwise, and the inverse frequency of the category, we would not expect articles for players from Pennsylvania to be e.g. 1 . Thus, very common categories like “Living People” |푐푎푡푒푔표푟 푦 | substantially different than articles for players from New York or are down-weighted as compared to more specific categories. We New Jersey. However, this setup does more closely replicate the similarly construct a vector representation of 푡. We then select 푐푏푒푠푡 intended analysis setting than random sampling, as we ensure that as the 푐푖 with the highest cosine similarity between its vector and all people in the target group have a common trait. the vector for 푡. We then use several metrics to assess how well-matched the Pivot-Slope TF-IDF Weighting (proposed) TF-IDF weight- target and comparison groups are: ing and Percent Categories both have a potential problem in that Average bias Standardized bias is the typical method used to they include the normalization term 1 . While this term |퐶퐴푇 (푐푖 ) | evaluate covariate balance in matching methods [20]. For a given is intended to normalize for articles having different numbers of covariate, the metric is calculated by taking the difference in means categories, in actuality, it over-corrects and causes the algorithm to between the treatment and control groups and dividing by the favor articles with fewer categories. This issue has been observed in standard deviation in the treatment group. In our case, we treat each information retrieval: using TF-IDF weighting to retrieve relevant category as a binary covariate that can be present or absent for each documents causes shorter documents to have a higher probabil- article. We then compute the standardized bias for each category ity of being retrieved [47]. In order to correct this, we adopt the and average across all categories. Since some categories appear only pivot-slope normalization mechanism from [47]. In this method, in the target group and some appear only in the comparison group, instead of normalizing the TF-IDF term by |퐶퐴푇 (푐푖 )|, the term is we compute this metric in two directions: for all the categories that normalized with an adjusted value: appear in the target group (“Avg. Bias”) and for all the categories that appear in the comparison group (“Avg. Bias 2”). High average (1.0 − 푠푙표푝푒) ∗ 푝푖푣표푡 + 푠푙표푝푒 ∗ |퐶퐴푇 (푐푖 )| bias suggests that the distribution of categories is very different between the target and the comparison groups. The Pivot-Slope TF-IDF Weighting approach requires setting Number of Categories As discussed in §3, one of the concerns two parameters, the slope and the pivot, that control the strength of with the provided methods is that they may favor articles with more the normalization adjustment. Following the recommendations in or fewer categories. Thus, we compare the number of categories in [47], the pivot is set to the average number of categories across all the target group with the number of categories in the comparison articles in the data set, and the slope is tuned over a development group using Cohen’s d, which measures effect size as the difference set (described in §4). in mean between two groups divided by the pooled standard de- viation. A high value indicates that the two groups have different 3.2 Evaluation Framework numbers of categories per article. Our ultimate goal in identifying matches is to control for possible Text Length The prior two metrics focus on the category level. confounds in the text. Thus, for a given set of target articles, the However, we use categories as a proxy to control for confounds optimal matching algorithm would produce a matched comparison in the text, and thus we ultimately seek to assess how well our set that has identical traits as the target set for all attributes, except matching methods control for differences in the actual article text. the one being measured. In order to assess the effectiveness of each We first compare article lengths (word counts) using Cohen’s d. matching metric, we devised evaluation schemes that examine how Polar Log-odds We then compare how article vocabularies dif- well the algorithm creates comparison groups with similar attribute fer by computing log-odds with a Dirichlet prior [32], which mea- distributions as target groups. sures to what extent words are overrepresented in one corpus as We first construct simulated target sets. We then run each match- compared to another. A high log-odds score indicates that a word ing algorithm to identify a matched comparison set for each target is much more likely to appear in one corpus than the other. We set, and we use several metrics to examine how closely the compar- compute log-odds between the target group and comparison group. ison set matches the target set. We cannot evaluate on an attribute- We then take the absolute value of all log-odds scores, and compute specific setting, such as setting the target group as biographies the mean and standard deviation for the 200 most polar words. High about women and the comparison group as biographies about men, log-odds polarities indicate dissimilar vocabulary between groups. Controlled Analyses of Social Biases in Wikipedia Bios

Random Number Percent TF-TDF Pivot TF-TDF Random Number Percent TF-TDF Pivot TF-TDF 4 0.08 0.06 2 0.04 0.02 0 0.00 Num. of Cats Text Length Polar log-odds Polar log-odds Avg. Bias Avg. Bias 2 KL 1 KL 2 mean Std

Figure 1: Evaluation of matching methods using article-sampling to select 1000-page target groups, averaged over 100 simula- tions (divided into two figures for readability). Lower scores indicate better matches; Pivot-Slope TF-IDF performs thebest.

Random Number Percent TF-TDF Pivot TF-TDF Random Number Percent TF-TDF Pivot TF-TDF 8 2.0 1.5 4 1.0 0.5 0 0.0 Num. of Cats Text Length Polar log-odds Polar log-odds Avg. Bias Avg. Bias 2 KL 1 KL 2 mean Std

Figure 2: Evaluation of matching methods using category-sampling to select 500 pages as the target group, repeated 100 times.

KL Divergence Finally, rather than examining just word-level fixed the slope to 0.3. We excluded the biography pages usedfor differences, we use a topic model to examine topic-level differences. tuning when constructing the simulated sets for evaluation. We We train an LDA model with 100 topics across all articles in the caution that tuning the slope parameter is an important step in corpus [5]. After running the matching algorithm, we average the using our algorithm, as changing the parameter does change the topic vector for each article in the comparison and the target group, selected matches. We use the same data set and parameter settings using 1/1000 additive smoothing to avoid having 0 probabilities for for our analysis of gender and racial biases (described in §7). any topic, and then normalize these vectors into valid probability distributions. Thus, we obtain a topic probability distribution vector 5 RESULTS for the target group and for the comparison group. We then compute the KL-divergence between these two vectors. Since KL-divergence We evaluate each method using 100 article-sampling and 100 category- is not symmetric, we compute it in both the target-comparison sampling simulations. For each simulation, we construct a synthetic (‘KL”) the comparison-target directions (“KL 2”). target group, and then we use the chosen method to identify a matched comparison group. We report results averaged over the 100 simulations. In addition to the described matching algorithms, 4 DATA we show the results of randomly sampling a comparison group that We gathered a corpus of Wikipedia biography pages by collecting has the same number of people as the target group. all articles with the category “Living people” in March 2020. We Figures 1 and 2 report results. All of the evaluation metrics discarded articles that had < 2 categories, < 100 tokens, or were measure differences between the target group and the comparison marked as stubs, indicated by the presence of a stub category like group, meaning a lower value indicates the comparison group is “Actor stubs”. In matching the articles, we ignore any categories that a better match. In the Category sampling simulations (Figure 2), are focused on traits of the Wikipedia article rather than traits of which better simulates having a target group with a particular trait the person using a heuristically-defined list that includes categories in common, all matching methods perform better than random containing the words “Use Indian English”, “Pages with”, “Contains sampling, and the Pivot-Slope TF-IDF method performs the best Links”, etc. After filtering, the data set that we use for evaluation overall. In the Random sampling simulations (Figure 1), random simulations contains 444,045 pages. On average, pages contain 9.3 sampling provides a strong baseline. This is unsurprising, as a categories (after excluding categories as described above) and 628.2 two randomly chosen groups of 1000 articles are unlikely to differ tokens. significantly from each other. Nevertheless, the Pivot-Slope TF-IDF As described in §3, in constructing pivot-slope TF-IDF vectors, method outperforms random sampling on the text-based metrics we set the pivot to 9.3 (average number of categories). We con- (polar log-odds and KL divergence) as well as on average bias. structed two development sets for tuning the slope: first, we sam- The Number of Categories (Num. of Cats) and Text Length met- pled a fixed set of 1000 biography pages, and second, we sampled rics do show possible biases of these methods. As expected, the a fixed set of 10 categories and sampled 500 people from each cat- Number of Categories, Percent Categories, and TF-IDF Weighting egory. We tested slope values between 0 and 1 in 0.1 increments matching methods all exhibit bias towards articles with more or and selected the value that minimized the difference between the fewer categories, which results in worse performance than random target and comparison sets using the metrics described in §3.2. We sampling over these 2 metrics in Figures 1 and 2. In the Number Anjalie Field, Chan Young Park, and Yulia Tsvetkov

T-Pain, Pharrell Williams American music industry execu- T-Pain, Vexxed Twitch streamers tives, 21st-century American rappers, American male singers, African- Barack Obama, JJonak Use American English from August 2018 American male singers, Grammy Award winners for rap music, 21st- Yuna Kim, Tommy Chang (martial artist) South Korean ex- century American singers, American hip hop singers, African-American patriates in Canada record producers, American hip hop record producers, Southern hip hop Amitabh Bachchan, Kishore Lulla Film producers from Mumbai, musicians, African-American male rappers, American contemporary Hindi film producers R&B singers Tim Cook, Stephen Austin (American football) National Football Barack Obama, Michael Moore Male feminists, HuffPost writers League executives and columnists, 21st-century American male writers, LGBT rights ac- Ron Berger (professor), Sheldon Hall (film historian) Academics tivists from the United States, American gun control activists, American of Sheffield Hallam University people of English descent, American male non-fiction writers, Amer- Table 3: Matches obtained using TF-IDF ican political writers, American people of Irish descent, 21st-century American non-fiction writers, 20th-century American male writers, 20th-century American non-fiction writer, American people of Scottish descent T-Pain, Tay Dizm 21st-century American rappers, Jive Records Yuna Kim, Tessa Virtue Olympic medalists in figure skating, artists, 21st-century American singers, Rappers from Florida, 21st- Medalists at the 2010 Winter Olympics, Figure skaters at the 2010 Win- century male singers, Musicians from Tallahassee, Florida, Singers from ter Olympics, World Junior Figure Skating Championships medalists, Florida, RCA Records artists, African-American male rappers Medalists at the 2014 Winter Olympics, Figure skaters at the 2014 Win- ter Olympics, World Figure Skating Championships medalists, Four Barack Obama, Roland Burris Illinois Democrats, United States Continents Figure Skating Championships medalists, Season-end world senators from Illinois, African-American United States senators, Demo- number one figure skaters cratic Party United States senators, 21st-century American politicians, Politicians from Chicago, African-American people in Illinois politics Amitabh Bachchan, S. P. Balasubrahmanyam 20th-century Indian singers, Filmfare Awards winners, Bollywood playback singers, Yuna Kim, Mao Asada Olympic medalists in figure skating, 1990 Recipients of the Padma Shri in arts, Indian male voice actors, 21st- births, Medalists at the 2010 Winter Olympics, Figure skaters at the century Indian male actors, Indian male film actors, Indian male film 2014 Winter Olympics, World Figure Skating Championships medalists, singers, 21st-century Indian singers Indian male singers, 20th-century Four Continents Figure Skating Championships medalists, Season-end Indian male actors, Indian television presenters, Recipients of the Padma world number one figure skaters, Figure skaters at the 2010 Winter Bhushan in arts Olympics, World Junior Figure Skating Championships medalists Amitabh Bachchan, Dilip Kumar Male actors in Hindi cinema, In- Tim Cook, Bob Iger American chief operating officers, Biography dian male voice actors, Recipients of the Padma Bhushan in arts, Indian with signature 20th-century American businesspeople, Directors of actor-politicians, Indian male film actors, Male actors from Mumbai, Apple Inc. 21st-century American businesspeople 20th-century Indian male actors, Filmfare Awards winners, Recipients Ron Berger (professor), Yukiko Iwai (singer) 1968 births of the Padma Vibhushan in arts, Dadasaheb Phalke Award recipients, Table 1: Matches and common categories obtained using the Biography with signature, Film producers from Mumbai Number of Categories method Tim Cook, Ellen Hancock American technology chief executives, Apple Inc. executives, IBM employees, American chief operating officers Ron Berger (professor), Liu Mingkang Alumni of Cass Business School T-Pain, Tay Dizm Musicians from Tallahassee, Florida, Singers from Table 4: Matches obtained using Pivot-Slope TF-IDF Florida, 21st-century American singers, 21st-century male singers, Jive Records artists, African-American male rappers, RCA Records artists, 21st-century American rappers, Rappers from Florida Barack Obama, Robert Moffit American male non-fiction writ- ers, American political writers pivot-slope normalization corrects for this length bias, and demon- Yuna Kim, Park Solhee 1990 births, South Korean writers strates better-than-random matches. In the Random Evaluation, Amitabh Bachchan, Kapil Jhaveri Male actors in Hindi cinema, while the pivot-slope normalization does outperform other metrics, Indian male film actors the random method exhibits the least category number and text Tim Cook, Joe Fuca American technology chief executives, 21st- length bias. However, as mentioned, random sampling is a strong century American businesspeople baseline in this setting. Ron Berger (professor), Jean-Christophe Valtat 1968 births In Tables 1-4, we provide examples of matched comparison ar- Table 2: Matches obtained using Percent Categories ticles for a set of sample people. These examples illustrate some of the trends in Figures 1 and 2. Notably, the Percent Categories and TF-IDF weighting methods prefer short comparison articles with few categories, even when they are not particularly mean- of Categories matching method, this effect is positive, indicating ingful. Both the Number of Categories and the Pivot-Slope TF-IDF that articles in the comparison group tend to have more categories, methods produce meaningful pairs. However, the TF-IDF weight- while in the Percent Categories and TF-IDF Weighting methods the ing upweights more specific categories. The Number of Categories effect is negative (Figures 1 and 2 report absolute values). These method matches Barack Obama to Michael Moore based on broad differences are also reflected in text length, in that articles with categories like “American people of Irish Descent”; the Pivot-Slope more categories also tend to be longer. In the Category Evaluation, TF-IDF weighting matches Barack Obama to Rolan Burris based on Controlled Analyses of Social Biases in Wikipedia Bios more specific categories, like “United States Senators from Illinois”. a Wikidata entry, nor a category indicative of non-binary gender, In the Number of Categories method, the match between T-Pain but that contained more he/him than she/her pronouns. and Pharell Williams seems adept, while in the Pivot-Slope TF-IDF Thus, for 38,955 articles for which we could not identify Wiki- weighting method, Yuna Kim and Mao Asada is a particularly accu- data entries nor any indication of non-binary gender, we assigned rate pairing, as these two figure skaters are well-known for their these pages as cis. man or cis. woman based on the most com- rivalry. mon pronouns used in the page. We found that for pages that did have Wikidata entries, pronoun-inferred gender aligned with their 6 ANALYSIS METHODOLOGY Wikidata gender in 98.0% of cases.7 We use cis. men as the compar- We use our matching method to facilitate analyses of Wikipedia ison group, and separately run the matching algorithm to identify articles along different dimensions, including examining possible matches from this group for each other gender. We exclude cis. men biases and content gaps for gender and racial minorities. We first with categories containing the keyword “LGBT” when matching describe how we identify target articles and then present results. transgender and non-binary people to them. As race is a social construct with no global definition,8 and given 6.1 Data that our starting point for data collection is English Wikipedia, we take a U.S.-centric approach and focus on biography pages of Amer- Our primary data for analysis is biography pages in the Living ican people and commonly selected race/ethnicity categories from People category of Wikipedia (444,045 pages, after filtering and pre- the U.S census: Hispanic/Latino, African American, and Asian.9 processing described in §4). Table 5 reports the final data set sizes While the census considers Hispanic/Latinx an ethnicity and not a for each target and candidate comparison group. We describe the race,10 surveys suggest that 2/3 of people who identify as Hispanic processing for determining biography pages for people of different consider it a part of their racial background.11 Thus, we follow genders and races below. prior work in sociolingustics in considering Hispanic/Latinx a (non- Identity categories like race and gender are fluid concepts that exclusive) racial category [1, 6], e.g. 410 pages in our corpus are are difficult to operationalize and whose definitions depend on considered both African American and Hispanic/Latinx. In general, social context [9, 19]. As our focus is on Wikipedia, and we aim we do not force racial categories to be exclusive (we do enforce that to identify how Wikipedia articles differ when editors and readers people in the target corpora cannot be in the comparison corpora). associate their subjects with different genders and races, we derive In order to identify pages for each target group, we primarily use race and gender categories directly from Wikipedia articles and category information. We identify articles for each racial group as associated metadata. Our goal is to identify the observed gender ones containing categories with the terms “American” and [“Asian”, and race of individuals as perceived by editors who assigned article “African-American”, or “Latino”]. We additionally include categories meta-data or readers who may view them, as opposed to assuming containing “American” and the name of a country in Asia, Africa, absolute ground-truth characteristics of any individual [43]. or Latin American (including the Caribbean),12 and we filter cate- In order to determine the observed gender of people in our cor- gories through further heuristics, e.g. discarding ones containing pus, we primarily rely on Wikidata—a crowd-sourced database of “expatriates” and “American descent”. Thus our final category set structured information corresponding to Wikipedia pages [50]. We includes ones like “African-American Catholics”, “American aca- identify pages for people of 5 genders (transgender men, transgen- demics of Mexican descent”, and excludes “American expatriates in der women, non-binary people, cisgender women, and cisgender Hong Kong”, “Asian of American descent”. We thus leverage our fo- men) as follows: cus on articles about American people and derive categories where Transgender men/women. Articles whose Wikidata entry con- “American” indicates nationality and other country names indicate tains the Q_TRANSGENDER_MALE/_FEMALE property value. background or ethnicity. We acknowledge that our use of the term “race” is an oversimplification of the distinctions between articles Non-binary, gender-fluid, or gender-queer (termed “non-binary”). in our corpus. Nevertheless, we believe it reflects the perceptions Articles whose Wikidata entry contains the Q_NONBINARY prop- that readers and writers of Wikipedia may hold. erty value or Q_GENDER_FLUID property value. Additionally, We validate our category-based approach by comparing the set Wikipedia pages with a category containing "-binary" or "Gen- of articles that we identify as describing African American people derqueer". We found that some pages with non-binary gender indi- with the Wikidata “ethnic group” property. We cannot use this cators had binary gender properties in Wikidata, and thus we also property for all target groups, as it was only populated for 3.4% of use categories to identify pages for non-binary people.6 articles in our data set and is largely unused for relevant values other than “African American” (9 pages contained the “Hispanic Cisgender (cis.) women. Articles whose Wikidata entry contain and Latino American” property; 5 pages contained “Latino”; 21 Q_FEMALE property value. Also, pages for which we did not iden- pages contained “White”). Of the 9,668 people that our category tify a Wikidata entry nor a category indicative of non-binary gender, method identifies as African American, we were able to match but that contained more she/her than he/him pronouns. 7We acknowledge that there can be errors in Wikidata [22] Cisgender (cis.) men. Articles whose Wikidata entry contained 8https://unstats.un.org/unsd/demographic/sconcerns/popchar/popcharmethods.htm Q_MALE property value. Also, pages for which we did not identify 9https://www.census.gov/prod/cen2010/briefs/c2010br-02.pdf 10https://www.census.gov/mso/www/training/pdf/race-ethnicity-onepager.pdf 11https://www.pewresearch.org/fact-tank/2015/06/15/is-being-hispanic-a-matter-of- 6For example, https://en.wikipedia.org/wiki/Laganja_Estranja, “they identify as non- race-ethnicity-or-both/ binary and does not have any preferred pronouns.”; Wikidata gives gender as male 12We identify country lists from worldometers.info Anjalie Field, Chan Young Park, and Yulia Tsvetkov

5,776 of them to ethnicity information in Wikidata. Of these 5,776 Group Pre-match Final pages, our method exhibited precision of 98.5% in that 5,693 pages African American 9,668 8,405 contained the African American Wikidata property. Our method Asian American 4,792 3,516 showed 69.0% recall, in that we recovered 5,693 of the 8,375 pages Hispanic/Latinx American 4,480 3,811 that have the African American Wikidata property. In our analysis, Unmarked American (comparison) 93,486 - precision is more important than recall, as low recall implies we Non-Binary 198 132 are analyzing a smaller data set than we may have otherwise, while Cisgender women 106,586 66,582 low precision implies we are analyzing the wrong data. Transgender women 261 138 Finally, a natural choice for comparisons would be articles about Transgender men 85 54 white/Caucasian Americans. However, in identifying these articles, Cisgender men (comparison) 331,484 - we encountered the obstacle of “markedness”: while articles about Table 5: Data set sizes for analysis corpora. “Final” column racial minorities are often explicitly marked as such, whiteness is indicates the target/comparison sizes after matching. assumed so generic and defined primarily in contrast to non-white identities that it is rarely explicitly marked [8, 49]. We see this in our data: an article about an African-American politician may have the category “African-American United States senators”, whereas • Log-odds scores: We identify words that are over-represented an article about a white-American politician has the category “Gov- in the target or comparison groups using log-odds [32]. ernors of Texas” as opposed to “White Governors of Texas” (Barack • Per-Language availability: For all languages that at least Obama vs. George W. Bush). Thus, we draw from the theory that a threshold 푡 number of target or comparison articles are markedness is itself a social indicator, and we define our candidate available in, we compare what percentage of target vs. com- comparison articles as ones that “unmarked” as racial minorities. parison articles are available in the language (e.g. are target More specifically, we selected all pages that contain a category or comparison articles more likely to be available in Ger- with the word “American”, but do not contain a category indicative man?). We compute significance using McNemar’s test with of a racial minority group (including categories referring to the Mid- a Benjaminin Hochberg multiple hypothesis correction.15 dle East, Native Americans, and Pacific Islanders) nor a Wikidata • Normalized Section Lengths: For all second-level sections entry indicate of a racial minority. We additionally exclude foot- that at least 푡 target or comparison articles contain, we com- ball players and basketball players, as we found that these articles pare on average, what percentage each articles is devoted to 13 were very often unmarked, even for racial minorities. In manu- that section (number of tokens in section / total number of ally reviewing the comparison corpus, based on outside Wikipedia tokens in article, averaged across target/comparison group). information sources and pictures, we estimated that 90% of the We compute significance using a paired t-test test witha corpus consists of Caucasian/white people. In practice, the 10% of Benjaminin Hochberg multiple hypothesis correction. non-white pages are often the ones selected during the matching • Multilingual Differences: For the top 10 most edited lan- algorithm. Thus as our analysis focuses on unmarked/marked arti- guages on Wikipedia (English, French, Arabic, Russian, Japan- cles, it likely underestimates the differences between how people ese, Italian, Spanish, German, Portuguese, and Chinese), for of color and white people are portrayed on Wikidata (since there all target-comparison pairs where both members of the pair are some people of color included in our “unmarked” corpus). are available in the language, we compare article lengths and Table 5 presents the data set sizes. After matching, we excluded normalized section lengths as described above. pairs from analysis if they contained < 2 categories in common.14 The rightmost column in Table 5 reflects sizes after this exclusion. 7 ANALYSIS RESULTS 6.2 Analysis Dimensions 7.1 Matching Reduces Data Confounds First, we revisit our motivation in this work by considering how We compute several metrics to compare the target and comparison results differ when matching is not used. Table 6 presents the10 corpora. We describe them here, and then demonstrate how they words that are most associated with biography pages about cis- can be used to identify content gaps, biases, and other areas for gender men and women calculated using log-odds with a Dirichlet editing improvements. Due to space limitations, we discuss a subset prior [32]. As shown in previous work, without matching, words of results and provide complete metrics at [ANONYMIZED LINK]. highly associated with men include many sports terms, such as • Summary statistics: using the English articles, we compute “season” and “League”, which suggests that directly comparing these the average article length, number of languages the articles biographies could capture athlete/non-athlete differences rather is available in, and number of edits. We compare target and than man/woman differences. After matching, these sports terms comparison metrics using a paired t-test. are no longer present in the top 10 most polar log-odds terms. In- stead, they are replaced by overtly-gendered terms like “himself”

13Articles about jazz musicians were also less-marked, though we did not exclude and “wife”, which suggests that matching helps isolate gender as them. We suggest investigation of markedness on Wikipedia as an area for future work. 15We set 푡 for language/section analysis as follows: all race/ethnicity (100), intersec- 14When counting common categories, we excluded categories that contained “births” tional (50), women (500), transgender men/women (20), non-binary (50). We do not or “alumn”, as we found that “Alumni from X” and “[year] births” were extremely believe that these thresholds affect results, since they primarily exclude results that common categories that we considered too broad to be meaningful. are not statistically significant, and we use them for convenience of output. Controlled Analyses of Social Biases in Wikipedia Bios

Matched Unmatched carefully, e.g. an editor may have spent less time researching the he/He her/Her he/He her/Her person and uncovering details to include in the article. They can also his/His she/She his/His she/She indicate that information is less carefully presented—in manually him Women season Women comparing articles with their matches, we often found that infor- himself women him women mation in Asian American peoples’ articles was displayed in lists or wife husband League actress tables, whereas information in comparison articles was presented in Park herself club husband descriptive paragraphs. Further, differences in Wikipedia could re- Men woman against female flect external information disparities—biases in society could make chairman female games Miss it more difficult for underrepresented minorities to achieve career Table 6: Log-odds scores between cisgender men and women accomplishments, which would lead to shorter careers sections, or pages after matching and without matching. Words are or- individuals could avoid press coverage, which would make finding dered most-to-least polar from top to bottom. Matching re- information to include in their Wikipedia page more difficult. duces sports terms (bold) in favor of overtly gendered terms. The center columns in Table 8 provide some insight into whether the differences are reflective of the editing process or other factors by comparing the average edit count and age (in months; at the time that edit data was collected) for each article.16 Notably, all articles Target Comparison p-value about gender and racial minorities were written more recently than African American 902.0 711.4 6.81E-87 matched comparisons. This could reflect growing awareness of Asian American 741.3 711.4 0.0198 biases on Wikipedia and corrective efforts.17 Furthermore, articles Hispanic/Latinx American 972.5 711.4 9.78E-82 about cis. women do have fewer edits than matches comparisons, Table 7: Average article lengths without matching. All target while articles about Asian Americans do not (even though both sets appear significantly longer than comparisons. were written more recently than comparisons). The section-level analyses described in §6.2 also provides in- formation about differences in the ways articles are written. In investigating what percent of articles are devoted to each section, the primary variable of difference between the two corpora. Beyond we find that articles about racial minorities are significantly more the top 10 log odds scores, sports terms do occur, but they tend to likely to focus on “Early” sections like “Early life”, “Early years”, and be more specific and represented on both sides, for example “WTA” “Early life and education”. Articles about cis. women tend to have is women-associated and “NBA” is men-associated. more space devoted to general sections, like “Career”, “Personal In Table 7 we show the results of comparing article lengths Life”, and “Life” and less space devoted to more specific sections: between racial subgroups and the entire candidate comparison “Political Career”, “Professional Career”, “Coaching Career”. group, rather comparing with the subset of matched articles. Ta- While prior work has suggested that articles about cis. women ble 7 suggests that articles for all racial subgroups are significantly tend to be longer and more prominent than articles about cis. men, longer than comparison articles. Thus, without matching, we would our work has the opposite finding [18, 39, 51, 53]. There are sev- find that people from traditionally underrepresented races tend to eral differences between our work and prior work: use of match- have more detailed articles than people whose race is not specified. ing rather than direct men/women comparison, discarding of in- However, in Table 8, we show article length differences after match- complete “stub” articles, focus on “Living People”, and considera- ing. In this table, we do not find significant differences between tion of gender as non-binary. Additionally, Wikipedia is constantly matched comparison articles and articles about African American changing, and prior work identifying missing pages of women on and Hispanic/Latinx American people. Instead, we find that articles Wikipedia has caused editors to create those pages [39]. However, about Asian American people are typically shorter than compari- our work suggests that there does still exist a disparity between son articles. Thus, matching suggests that the differences observed the quality articles about cis. men and women. Specifically, articles in Table 8 occur because of differences in confounding variables about cis. women tend to be shorter and more generic. Given the between groups, rather than reflecting racial disparities. differences in edit counts, greater editor attention to articles about cis. women could reduce this gap. 7.2 Metrics Reveal Content Imbalances In contrast, more investigation is needed to examine the disparity We present some of the high-level statistics, including article lengths, in articles about Asian American people, since we do not see a number of edits, and article age in Table 8. While prior work has significant difference in the number of edits per article. Edit counts examined similar statistics between articles about (assumedly cis- offer an overly simple view of the editing process, and a morein- gender) men and women, our work includes non-binary and trans- depth analysis, including examining type/size of edits, identity of gender people, as well as racial subgroups, and we use pivot-slope editors, and discussions between editors could offer more insights. TF-IDF matching to reduce the influence of confounding variables. High-level statistics can identify possible content gaps in Wikipedia 16Edit data was collected in September 2020, several months after the original data and sets of articles that may benefit from additional editing. collection; a small subset of articles which had been deleted or had URL changes The leftmost columns in Table 8 show that articles about Asian between collection times are not included in edit counts and article age. 17Examples: https://meta.wikimedia.org/wiki/Gender_gap Americans and cis. women tend to be shorter than comparison https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Women_in_Red articles. Shorter articles can indicate that articles were written less https://whoseknowledge.org/ Anjalie Field, Chan Young Park, and Yulia Tsvetkov

Article Lengths Edit History Article Age # of Languages Target Comparison Target Comparison Target Comparison Target Comparison African American 942.8 955.5 243.80 247.02 128.35 136.06 6.2 6.7 Asian American 795.64 854.6 193.12 197.74 123.05 130.19 5.93 7.08 Hispanic/Latinx American 1017.37 1028.11 293.50 278.14 129.99 137.47 7.54 7.62 Non-Binary 1002.5 783.45 363.77 181.48 92.54 115.71 6.48 4.24 Cisgender women 665.2 786.0 126.58 150.50 110.86 129.19 5.43 6.26 Transgender women 1106.30 839.32 265.21 161.46 117.74 135.64 6.76 4.65 Transgender men 648.1 845.4 116.44 164.06 96.37 125.41 3.8 5.6 Table 8: Averaged statistics for articles in each target group and matched comparisons, where matching is conducted with Pivot-Slope TF-IDF. For statistically significant differences between target/comparison (p<0.05) the greater value is inbold.

Nevertheless, our results support work contradicting the “model a number of factors, including ones independent from Wikipedia. minority myth” [12] in suggesting that Asian Americans are not However, length differences in one language that do not exist in exempt from prejudice and racial disparities. another language for the same set of articles indicate a content gap that can be addressed through additional editing—we know that the 7.3 Non-English Statistics Reveal “Local Heros” disparity does not occur because of external factors because it does not exist in the other language. We here highlight two observed While §7.2 focuses on English articles, here, we consider other lan- disparities. For 234 articles about African American people, both guage editions of Wikipedia. The rightmost columns in Table 8 the article and its match are available in Chinese. The articles about show how many languages each set of articles is available in. Arti- African Americans are significantly shorter than their matches in cles about African American people, Asian American people, and Chinese (target length: 25.4, comparison length: 35.1, p-val: 0.049), cis. women are typically available in fewer languages than com- but not in English (target: 2,740.8, comparison: 2,468.7, p-val: 0.130). parison articles. In contrast, articles about non-binary people and Similarly, for the 912 matched pairs available in Spanish, the articles transgender women are available in significantly more languages about African Americans are significantly shorter in Spanish (tar- (discussed in more detail in §7.4). Language availability indicates get: 656.6, comparison: 790.3, p-val: 0.012), but not English (target: possible content gaps that could perpetuate societal biases—a user 1,629.0, comparison: 1,501.4, p-val: 0.060). These results suggest searching in non-English Wikipedia editions is less likely to find that articles about African Americans are written less carefully in biography pages of African Americans than other Americans. Chinese and Spanish than articles about other Americans. When we examine differences by language (for each language, what percentage of target vs. comparison articles are available?), we find that the difference in total language availability is notdriven 7.4 Statistics Align with Social Theories by disparities in a few specific languages, but rather occurs broadly. While our goal in this work is not a thorough investigation of social Articles about African Americans are significantly more likely to theories, we briefly discuss some connections between trends in have versions in Haitian, Yoruba, Swahili, Punjabi, and Ido, and our data and existing theories in order to demonstrate how our less likely in 44 other languages. Articles about Asian Americans data and methodology can support this line of research, and how are significantly more likely to have versions in Hindi, Punjabi, these theories can provide insight into our data. Specifically, we Chinese, Tagalog, Tamil, and Thai and less likely in 42 other lan- highlight “the glass ceiling effect” and “exoticizing” or “othering”. guages. Similarly, articles about cis. women are more available 11 Wagner et al. [52] show that women on Wikipedia tend to be languages and less available in 38 languages. For reference, for arti- more notable than men and suggest that Wikipedia’s entry barrier cles about Latinx/Hispanic people, where we do not see a significant serves as a subtle glass ceiling: there is a higher bar for women to difference in overall language availability (Table 8), are significantly have a biography article than for men. We can find evidence of glass more likely to have versions in Spanish and Haitian and less likely ceiling effects by comparing article availability and length across in 8 other languages. These results support prior work on “local languages. Specifically, articles about African American people are heros” showing that a person’s biography is more likely to be avail- significantly less available in German (target: 29.86% comparison: able in languages common to the person’s nationality [10, 17, 21]. 31.86% p-val: 0.0023). However, for the subset of 1,399 pairs, for Our results show that pattern holds for a person’s ethnicity and which both the matched target and comparison articles are available background, beyond current nationality. These results also show in German, the English target articles are significantly longer than that reducing the observed language availability gap for African the comparison articles (target: 1,396.2, comparison: 1,264.4 p-val: Americans, Asian Americans, and cis. women requires substantial 0.007). While numerous factors can account for why an article may effort, as it requires adding articles in a broad variety of languages, not be available in a particular language, our results suggest that rather than focusing on adding articles in a few select languages. one possible reason for the difference in German is that articles Furthermore, we can examine additional information gaps by about African American people only exist if they are about particu- considering how the lengths of the same articles differ between larly noteworthy people with long well-written articles, whereas a languages. As discussed in §7.2, article lengths can differ because of broader variety of articles about other Americans are available. Controlled Analyses of Social Biases in Wikipedia Bios

Target Comparison p-value American cis. women, African American cis. men, and unmarked vs. unmarked Amer. women 906.78 868.84 0.13 American cis. men, separately conducting matching for each set. vs. African Amer. men 1013.09 969.39 0.23 Table 9 reports article lengths and number of available languages. vs. unmarked Amer. men 1012.60 956.99 0.11 Notably, articles about African American cis. women are translated vs. unmarked Amer. women 5.93 7.11 3.80E-09 into significantly fewer languages than unmarked American cis. vs. African Amer. men 6.54 6.38 0.57 women, even though these articles are often longer (though not sig- vs. unmarked Amer. men 6.10 5.60 0.035 nificantly). In contrast, articles about African American cis. women Table 9: Article lengths (top) and language availability (bot- are available in more languages than articles about unmarked Amer- tom) for African American cis. women vs. different com- ican cis. men. These results suggest that African American women parison groups. For significant results, the greater value is tend to have more global notoriety then comparable American men, bolded. Paired data sizes were 2,930 (vs. Unmarked women), possibly a “glass ceiling” or a “othering” effect. However, African 2,009 (vs. African American men), and 2,133 (vs. Unmarked American women do not receive as much global recognition on men). Sizes differ slightly for different comparison groups, Wikipedia as comparable American women. because of the exclusion of articles without close matches. This result supports the theory that focusing on gender biases without considering other demographics can mask marginalization and fails to consider that some women face more discrimination than others. We present only one example of intersected dimensions, but our methodology can be applied to other combinations. Additionally, our results show that the biography pages of trans- gender women and non-binary people tend to be longer and avail- able in more languages than matched comparison articles (Table 8 7.6 Limitations although length differences are not statistically significant, likely There are numerous types of bias that our methodology does not due to small data size). Non-binary people tend to have longer capture. As discussed in §2, we do not compare articles against exter- “Career” sections (target: 29.36%, comparison: 20.35%, p-val: 0.016; nal databases, which precludes identifying many types of coverage differences for transgender women are not significant), which could bias. Additionally, our matching method depends on categories, indicate a glass ceiling effect; non-binary people only have articles which are imperfect controls, and articles with few categories are written about them if they have particularly noteworthy careers. excluded. Our method does not capture if certain articles have better However, one contributing factor to article length is that in these category tags than others, and systemic differences category tag- pages, the focal person’s gender identity is often highlighted. Both ging quality could reduce the reliability of matching. Future work non-binary people and transgender women tend to have a higher could involve developing additional matching features. Relatedly, percentage of their article devoted to the “Personal Life” section the pivot-slope correction relies on tuning the slope, and changing (non-binary: 8.37%, comparison: 2.58% p-val: 8.30E-06), (transgender this parameter does change selected matches. We tune this parame- women: 7.12%, comparison: 1.76%, p-val: 1.58E-05). In examining ter on held-out data and results do suggest that it was correctly set. these pages, personal life sections often focus on gender identity, For example, we find articles for Asian Americans but not African including the person’s self-disclosure of their gender identity and Americans are significantly shorter than comparisons, whereas a preferred pronouns. The implication that gender identity is a note- slope that was too low would favor long comparison articles and worthy trait for just these minority groups is possibly indicative of reveal that all targets were significantly shorter than comparisons. “othering” or “exoticization”, where individuals are distinguished However, improperly setting the slope could introduce bias. or labeled as not fitting a societal norm, which often occurs inthe Additionally, we take a U.S.-centric approach in our examina- context of gender or sexual orientation [35, 36]. We briefly mention tion of race, and we also use English Wikipedia as a starting point, a few social theories to demonstrate the relevance of our analysis because it has the most active editor community and we expect cate- and leave in-depth exploration for future work. gories in English to be more reliable than in other languages. A more thorough analysis would include articles about non-Americans that 7.5 Intersectionality Uncovers Differing Trends may not have versions in English. We also treat identity as fixed at the time of data collection, but given the fluidity of identity, values In the last few decades, recent literature has called for focus on inferred in this work may not be correct in the future or in contexts intersectionality and shown that discrimination cannot be properly beyond Wikipedia. Finally, it is difficult to determine if biases are studied along a single axis. For example, focus on race tends to the result of Wikipedia editing, societal biases, or other factors. highlight gender or class-privileged people (e.g. Black men), while While our methodology aims to isolate specific social dimensions focus on gender tends to highlight race or class privilege people from other confounding variables, we do not suggest it implies (e.g. white women), which further marginalizes people who face causal relations, and we view the main use case of our model as discrimination along multiple axes (e.g. Black women) [14, 38, 46]. identifying sets of articles that may benefit from additional manual Although our work only focuses on two dimensions, which is investigation and editing. not sufficient for representing identity, we can expand on the single- dimensional analyses shown previously by considering intersected dimensions. While we could compute all intersections, we focus on 8 CONCLUSIONS African American cis. women for alignment with prior work [14]. We present a method that can be used to control for confounding We compare this target group with 3 comparison groups: unmarked variables in analyzing Wikipedia articles or building training data Anjalie Field, Chan Young Park, and Yulia Tsvetkov sets. While we focus on Wikipedia biography pages, this method Conference on Fairness, Accountability, and Transparency. 501–512. could be used to construct controlled corpora for any set of docu- [20] Valerie S Harder, Elizabeth A Stuart, and James C Anthony. 2010. Propensity score techniques and the assessment of measured covariate balance to test causal ments with category metadata. As demonstrated through our initial associations in psychological research. Psychological methods 15, 3 (2010), 234. analysis, our methodology can help identify systemic differences https://doi.org/10.1037/a0019623 between sets of articles, facilitate analysis of specific social theories, [21] Brent Hecht and Darren Gergle. 2010. The tower of Babel meets web 2.0: User- generated content and its applications in a multilingual context. In Proc. of SIGCHI. and provide guidance for reducing bias in corpora. ACM, New York, 291–300. https://doi.org/10.1145/1753326.1753370 [22] Stefan Heindorf, Yan Scholten, Gregor Engels, and Martin Potthast. 2019. Debi- asing Vandalism Detection Models at Wikidata. In Proc. of WWW. ACM, New ACKNOWLEDGMENTS York, 670–680. https://doi.org/10.1145/3308558.3313507 We would like to thank Martin Gerlach, Kevin Lin, Xinru Yan, and [23] Benjamin Mako Hill and Aaron Shaw. 2013. The Wikipedia gender gap revisited: Characterizing survey response bias with propensity score estimation. PloS one Michael Miller Yoder for their helpful feedback. This material is 8, 6 (2013), 1–5. https://doi.org/10.1371/journal.pone.0065782 based upon work supported by the NSF Graduate Research Fel- [24] Laura Hollink, Astrid van Aggelen, and Jacco van Ossenbruggen. 2018. Using the web of data to study gender differences in online knowledge sources: the lowship Program under Grant No. DGE1745016 and the Google case of the European parliament. In Proc. of WebSci. ACM, New York, 381–385. PhD Fellowship program Any opinions, findings, and conclusions https://doi.org/10.1145/3201064.3201108 or recommendations expressed in this material are those of the [25] Nicolas Jullien. 2012. What We Know About Wikipedia: A Review of the Literature Analyzing the Project(s). , 86 pages. https://hal.archives-ouvertes.fr/hal-00857208 authors and do not necessarily reflect the views of the NSF. [26] Katherine A. Keith, David Jensen, and Brendan O’Connor. 2020. Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates. In Proc. of ACL. ACL, Online, 5332–5344. https://doi.org/10.18653/v1/2020.acl- REFERENCES main.474 [1] Julia Adams, Hannah Brückner, and Cambria Naslund. 2019. Who Counts as a [27] Josef Kolbitsch and Hermann A Maurer. 2006. The transformation of the Web: Notable Sociologist on Wikipedia? Gender, Race, and the “Professor Test”. Socius How emerging communities shape the information we consume. J. UCS 12, 2 5 (2019), 1–14. https://doi.org/10.1177/2378023118823946 (2006), 187–213. [2] Judd Antin, Raymond Yee, Coye Cheshire, and Oded Nov. 2011. Gender differences [28] Piotr Konieczny and Maximilian Klein. 2018. Gender gap through time and space: in Wikipedia editing. In Proc. of Symposium on and Open Collaboration. A journey through Wikipedia biographies via the Wikidata Human Gender ACM, New York, 11–14. https://doi.org/10.1145/2038558.2038561 Indicator. New Media & Society 20, 12 (2018), 4608–4633. [3] David Bamman and Noah A Smith. 2014. Unsupervised discovery of biographical [29] Shyong (Tony) K Lam, Anuradha Uduwage, Zhenhua Dong, Shilad Sen, David R structure from text. TACL 2 (2014), 363–376. https://doi.org/10.1162/tacl_a_00189 Musicant, Loren Terveen, and John Riedl. 2011. WP: clubhouse? An exploration [4] Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael Horn, and of Wikipedia’s gender imbalance. In Proc. of Symposium on Wikis and open Darren Gergle. 2012. Omnipedia: Bridging the Wikipedia language gap. In Proc. of collaboration. ACM, New York, 1–10. https://doi.org/10.1145/2038558.2038560 the SIGCHI. ACM, New York, 1075–1084. https://doi.org/10.1145/2207676.2208553 [30] Wei Luo, Julia Adams, and Hannah Brueckner. 2018. The ladies vanish?: American [5] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. sociology and the genealogy of its missing women on wikipedia. Comparative JMLR 3 (2003), 993–1022. Sociology 17, 5 (2018), 519–556. [6] Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic Dialectal [31] Paolo Massa and Federico Scrinzi. 2012. Manypedia: Comparing language points Variation in Social Media: A Case Study of African-American English. In Proc. of of view of Wikipedia communities. In Proc. of Symposium on Wikis and Open EMNLP. ACL, Austin, TX, 1119–1130. https://doi.org/10.18653/v1/D16-1120 Collaboration. ACM, New York, 1–9. https://doi.org/10.1145/2462932.2462960 [7] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T [32] Burt L Monroe, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin’words: Kalai. 2016. Man is to computer programmer as woman is to homemaker? Lexical feature selection and evaluation for identifying the content of political Debiasing word embeddings. In Proc. of NeurIPs. Curran Associates Inc., Red conflict. Political Analysis 16, 4 (2008), 372–403. Hook, NY, 4349–4357. https://doi.org/10.5555/3157382.315758 [33] Marçal Mora-Cantallops, Salvador Sánchez-Alonso, and Elena García-Barriocanal. [8] Wayne Brekhus. 1998. A sociology of the unmarked: Redirecting our focus. 2019. A systematic literature review on Wikidata. Data Technologies and Appli- Sociological Theory 16, 1 (1998), 34–51. cations 53, 3 (2019), 250–268. https://doi.org/10.1108/DTA-12-2018-0110 [9] Mary Bucholtz and Kira Hall. 2005. Identity and interaction: A sociocultural [34] Jonathan T Morgan, Siko Bouterse, Heather Walls, and Sarah Stierch. 2013. Tea linguistic approach. Discourse Studies 7, 4-5 (2005), 585–614. https://doi.org/10. and sympathy: Crafting positive new user experiences on Wikipedia. In Proc. of 1177/1461445605054407 CSCW. ACM, New York, 839–848. https://doi.org/10.1145/2441776.2441871 [10] Ewa S Callahan and Susan C Herring. 2011. Cultural bias in Wikipedia content [35] Alison Mountz. 2009. The Other. In Key Concepts in Human Geography: Key on famous persons. Journal of the American society for information science and concepts in political geography. SAGE, London, 328–338. technology 62, 10 (2011), 1899–1915. https://doi.org/10.1002/asi.21577 [36] Kevin L Nadal, Avy Skolnik, and Yinglee Wong. 2012. Interpersonal and sys- [11] Eshwar Chandrasekharan, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam temic microaggressions toward transgender people: Implications for counseling. Glynn, Jacob Eisenstein, and Eric Gilbert. 2017. You can’t stay here: The efficacy Journal of LGBT Issues in Counseling 6, 1 (2012), 55–82. of Reddit’s 2015 ban examined through hate speech. In Proc. of CSCW. ACM, New [37] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, York, 1–22. https://doi.org/10.1145/3134666 Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representa- [12] Rosalind S Chou and Joe R Feagin. 2015. Myth of the model minority: Asian tions. In Proc. of NAACL. ACL, New Orleans, LA, 2227–2237. Americans facing racism. Routledge, New York. [38] Yolanda A Rankin and Jakita O Thomas. 2019. Straighten up and fly right: [13] Benjamin Collier and Julia Bear. 2012. Conflict, confidence, or criticism: An rethinking intersectionality in HCI research. Interactions 26, 6 (2019), 64–68. empirical examination of the gender gap in Wikipedia. In Proc. of CSCW. ACM, [39] Joseph Reagle and Lauren Rhue. 2011. Gender bias in Wikipedia and Britannica. New York, 383–392. https://doi.org/10.1145/2145204.2145265 International Journal of Communication 5 (2011), 21. [14] Kimberlé Crenshaw. 1989. Demarginalizing the intersection of race and sex: [40] Miriam Redi, Besnik Fetahu, Jonathan Morgan, and Dario Taraborelli. 2019. A black feminist critique of antidiscrimination doctrine, feminist theory and : A Taxonomy and Algorithmic Assessment of Wikipedia’s Veri- antiracist politics. U. of Chicago Legal Forum 1989, 8 (1989), 139. fiability. In Proc. of WWW. ACM, New York, 1567–1578. https://doi.org/10.1145/ [15] Stine Eckert and Linda Steiner. 2013. Wikipedia’s gender gap. In Media disparity: 3308558.3313618 A gender battleground, Cory L. Armstrong (Ed.). Lexington Bks, Plymouth, 87–98. [41] Margaret E Roberts, Brandon M Stewart, and Richard A Nielsen. 2020. Adjust- [16] Naoki Egami, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Bran- ing for confounding with text matching. American Journal of Political Science don M Stewart. 2018. How to make causal inferences using texts. Working Paper (Forthcoming) 64 (2020), 887–903. https://doi.org/10.1111/ajps.12526 (2018). https://arxiv.org/abs/1802.02163 [42] Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity [17] Young-Ho Eom, Pablo Aragón, David Laniado, Andreas Kaltenbrunner, Sebas- score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55. tiano Vigna, and Dima L Shepelyansky. 2015. Interactions of cultures and top [43] Wendy D Roth. 2016. The multiple dimensions of race. Ethnic and Racial Studies people of Wikipedia from ranking of 24 language editions. PloS one 10, 3 (2015), 39, 8 (2016), 1310–1338. 1–27. https://doi.org/10.1371/journal.pone.0114825 [44] Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in [18] Eduardo Graells-Garrido, Mounia Lalmas, and Filippo Menczer. 2015. First women, automatic text retrieval. Information processing & management 24, 5 (1988), second sex: Gender bias in Wikipedia. In Proc. of Hypertext & Social Media. ACM, 513–523. New York, 165–174. [45] Flavia Salutari, Diego Da Hora, Gilles Dubuc, and Dario Rossi. 2019. A Large- [19] Alex Hanna, Emily Denton, Andrew Smart, and Jamila Smith-Loud. 2020. Towards Scale Study of Wikipedia Users’ Quality of Experience. In Proc. of WWW. ACM, a critical race methodology in algorithmic fairness. In Proceedings of the 2020 New York, 3194–3200. https://doi.org/10.1145/3308558.3313467 Controlled Analyses of Social Biases in Wikipedia Bios

[46] Ari Schlesinger, W Keith Edwards, and Rebecca E Grinter. 2017. Intersectional Proc. of ICWSM. AAAI, Oxford, 454–463. HCI: Engaging identity through gender, race, and class. In Proc. of CHI. ACM, [52] Claudia Wagner, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. New York, 5412–5427. https://doi.org/10.1145/3025453.3025766 2016. Women through the glass ceiling: gender asymmetries in Wikipedia. EPJ [47] Amit Singhal, Chris Buckley, and Manclar Mitra. 1996. Pivoted document length Data Science 5, 1 (2016), 5. normalization. In Proc. of SIGIR, Vol. 51. ACM, New York, 176–184. [53] Amber Young, Ari D Wigdor, and Gerald Kane. 2016. It’s not what you think: [48] Elizabeth Stuart. 2010. Matching methods for causal inference: A review and a Gender bias in information about Fortune 1000 CEOs on Wikipedia. In Proc. of look forward. Stat Sci 25, 1 (2010), 1–21. https://doi.org/10.1214/09-STS313 ICIS. AIS, Ft. Worth, 1–16. [49] Sara Trechter and Mary Bucholtz. 2001. Introduction: White noise: Bringing [54] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. language into whiteness studies. Linguistic Anthropology 11, 1 (2001), 3–21. 2017. Men Also Like Shopping: Reducing Gender Bias Amplification using [50] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative Corpus-level Constraints. In Proc. of EMNLP. ACL, Copenhagen, 2979–2989. knowledgebase. Commun. ACM 57, 10 (2014), 78–85. https://doi.org/10.18653/v1/D17-1323 [51] Claudia Wagner, David Garcia, Mohsen Jadidi, and Markus Strohmaier. 2015. It’s a man’s Wikipedia? Assessing gender inequality in an online encyclopedia. In