Bias in Wikipedia: Different Links, Different Stories
Total Page:16
File Type:pdf, Size:1020Kb
Bias In Wikipedia: Different Links, Different Stories Raine Hoover [email protected] CS 229 Final Report: December 16th, 2016 Abstract personalized PageRank vector starting from the seed article, 2) logistic regression on the adjacency matrices We attempt to uncover the latent biases in dif- for Hebrew and Arabic, classifying whether a given ar- ferent narratives on Wikipedia by investigat- ticle is written in Hebrew or Arabic based on which ing the link structure of the within-Wikipedia concepts it does or does not link to, 3) PCA on the ad- link graphs generated from a seed article of jacency matrices for Hebrew and Arabic, investigating interest. The case study presented in this re- the main axes of variance across individual article link port is the Wikipedia article entitled “Arab- structure. Israeli Conflict”. We perform three experi- ments on this dataset: 1) hierarchical cluster- ing on personalized PageRank vectors for each 2 Related Works of the largest languages on Wikipedia, 2) lo- There has been some research into the differences in gistic regression classification of an article as link structure across languages on Wikipedia. In par- ‘Hebrew’ or ‘Arabic’ based solely on the ar- ticular, the project Omnipedia (Bao et al., 2012) high- ticle’s links, 3) principal component analysis lights these differences, but leaves analysis to the user. of the articles written in Hebrew and Arabic. With regard to cultural bias in Wikipedia, much of We find that while clustering and PCA are in- these investigations have been descriptive and/or lim- conclusive with regards to language being a ited to case studies, focusing for instance on coverage primary explanation of variance between link of famous people in English and Polish (Callahan and structures, we can accurately classify the lan- Herring, 2011). Another study by Laufer et. al. fo- guage of an article based on its concept links. cuses on several case studies, examining how different European cultures perceive one another’s food prac- 1 Introduction tices. The authors use variations on the Jaccard simi- There are many different historical narratives for the larity between the inter-Wikipedia links of the same ar- same underlying set of historical events. For instance, ticle across languages to estimate the level of “cultural when examining the Israeli-Palestinian conflict, there affinity” and “cultural understanding” between two Eu- are two dominant narratives: the Pro-Palestinian side ropean cultures (Laufer et al., 2015). While this study tells a story in which immigrant Jews systematically makes use of the within-Wikipedia link graph, it does displaced and oppressed native Palestinians, while the not take into account the structure of these networks as Pro-Israeli side presents a story of an oppressed people we do in the PageRank clustering task, and looks only returning to their homeland. How can the same facts at pairwise comparisons between articles’ link struc- lead to such different narratives, and how can we relate tures across languages. Furthermore, while it seeks multiple narratives to each other? to quantify agreement and understanding between cul- Wikipedia affords a unique opportunity to study tures, we hope to delve into the underlying sources of these questions. Each article topic is resolved across all cultural difference and distance. languages via the project Wikidata (Wikidata, 2016), so Hecht et. al. adopts an approach most similar to each language can serve as a proxy for a narrative sur- the current work, looking at the network structure of rounding that topic. We examine the within-Wiki link the within-Wikipedia link graph across different lan- structure for different seed articles, looking for patterns guages to investigate a form of bias called self-focus, across languages that correspond to geopolitical align- the tendency of a language to emphasize the home re- ments around historical events. gion of that language (Hecht and Gergle, 2009). While Specifically, for the networks generated from the Hecht et. al. uses similar methods to study the nature seed article ”Arab-Israeli Conflict”, we run: 1) hier- of different languages’ Wikipedias, the current study archical agglomerative clustering on each language’s goes deeper in asking how languages with very close article in each languages’ network. Thus for our clus- tering task we were working with the matrix A, where Aij = the PageRank for concept (resolved article ID) j in language i’s concept graph. Figure 1: Example graphs– different link structures over the same set of concepts (c1,c2...c5) in different languages’ Wikipedias (l1,l2...l39). 3.2 Data for Logistic Regression and PCA For our classification work and for PCA, we limit the or overlapping home regions diverge in their coverage languages we are investigating to just Hebrew and Ara- or emphasis on particular concepts dealing with these bic, and examine the adjacency matrix of the articles home regions. and concepts in the two languages. That is, we work with the matrix B, where: 3 Dataset and Graph Construction 1 if article i has a link to concept j Bij = Wikidata is an organization that maintains a mapping 0 if otherwise. ⇢ from every Wikipedia article in every language to a resolved ID for that topic across all Wikipedias, i.e.: An article i can either be in Hebrew or Arabic, and “The Six Day War” at en.wikipedia.org and “Guerre a concept j is resolved across both languages. Of des Six Jours” at fr.wikipedia.org both refer to the re- the 54,328 total articles we examine, 28,502 are from solved ID Q49077 (Wikidata, 2016). They provide a the Arabic Wikipedia and 25,826 are from the Hebrew dump of their mappings, which we processed to get Wikipedia. Across both languages, there are 47,521 a mapping from the concatenated language code with concepts. Of these concepts, 19,412 have articles both the article title to the resolved ID for this article topic. in Hebrew and Arabic that can be cited (as opposed to The dump used for this milestone is slightly out of date concepts that only appear as either Hebrew or Arabic (from November 2015), but the holes were negligible– articles). We run our experiments both on this “inter- on average, .01% of links discovered in any given sect” graph limited to concepts with articles in both lan- ⇠ article were not present in the mapping generated from guages and on the graph with all 47,521 concepts, and this stale dump. note any differences in outcomes. We then began with the seed article previously For the classification task, we take a stratified split mentioned, “Arab-Israeli Conflict”, and queried the of the data between training and test sets. In the test Wikipedia API using pywikibot (Pywikibot, 2016) to set, there are 14,174 Arabic articles and 12,990 Hebrew get the links for this article, which we then mapped to articles, for a total of 27,164 articles. In the training their resolved IDs using the mappings from Wikidata. set, there are 14,328 Arabic articles and 12,836 Hebrew If we encountered a link that did not have a resolved ID articles, for a total of 27,164 articles in the training set. in Wikidata’s mapping, we first resolved any redirects For PCA, before running the algorithm we center the this link may lead to, and attempted to find a resolved data: we make the mean of each feature column 0, and ID for the article title of the redirect. We repeated this the standard deviation of each feature column 1. sequence in a depth first search algorithm, allowing for a depth of 2 hops, to construct the resolved inter-wiki 4 Methodology link graph for each language (see Figure 1). 4.1 Hierarchical Agglomerative Clustering 3.1 Data for PageRank Clustering Hierarchical agglomerative clustering is an unsuper- While we constructed graphs for all 75 languages that vised learning algorithm that iteratively groups sam- have articles on the Arab-Israeli conflict, we were only ples in a bottom-up fashion, merging those samples or interested in clustering the largest 45 Wikipedias, a list groups of samples that are closest to each other by some obtained from Wikipedia’s reporting of its Wikis’ size distance metric until a set number of clusters (usu- (Wikipedia, 2016). Our interest in larger Wikipedias ally one) remains. Hierarchical clustering in general stems from the fact that a larger Wikipedia implies a is most useful for cases where the number of clusters more trafficked site and a more impactful narrative; in is unknown, in contrast to k-means clustering where addition, a larger Wikipedia could mean more thorough the number of clusters must be specified (Manning et coverage and a richer graph to study. Of these 45 lan- al., 2008). There are a variety of algorithms for hier- guages, we were able to obtain a resolved ID for the archical clustering, each of which defines the distance seed article “Arab-Israeli Conflict” and thereby con- between two clusters differently. In accordance with struct a link graph for 39. We report on the size of our aim of finding the differences between languages’ these graphs in Table 1. In total, we examine 39 lan- link structures, we focus on complete-link clustering, in guages with 207,375 resolved article IDs (concepts). which the similarity between clusters is defined as the Since we were aiming to model what a reader in a distance between their two most distant members. We particular language would see were he or she to surf experimented with a variety of distance metrics, but as through Wikipedia’s offerings starting from the seed we want to examine the similarity between the