Bias In : Different Links, Different Stories

Raine Hoover [email protected] CS 229 Final Report: December 16th, 2016

Abstract personalized PageRank vector starting from the seed article, 2) logistic regression on the adjacency matrices We attempt to uncover the latent biases in dif- for Hebrew and Arabic, classifying whether a given ar- ferent narratives on Wikipedia by investigat- ticle is written in Hebrew or Arabic based on which ing the link structure of the within-Wikipedia concepts it does or does not link to, 3) PCA on the ad- link graphs generated from a seed article of jacency matrices for Hebrew and Arabic, investigating interest. The case study presented in this re- the main axes of variance across individual article link port is the Wikipedia article entitled “Arab- structure. Israeli Conflict”. We perform three experi- ments on this dataset: 1) hierarchical cluster- ing on personalized PageRank vectors for each 2 Related Works of the largest languages on Wikipedia, 2) lo- There has been some research into the differences in gistic regression classification of an article as link structure across languages on Wikipedia. In par- ‘Hebrew’ or ‘Arabic’ based solely on the ar- ticular, the project Omnipedia (Bao et al., 2012) high- ticle’s links, 3) principal component analysis lights these differences, but leaves analysis to the user. of the articles written in Hebrew and Arabic. With regard to cultural bias in Wikipedia, much of We find that while clustering and PCA are in- these investigations have been descriptive and/or lim- conclusive with regards to language being a ited to case studies, focusing for instance on coverage primary explanation of variance between link of famous people in English and Polish (Callahan and structures, we can accurately classify the lan- Herring, 2011). Another study by Laufer et. al. fo- guage of an article based on its concept links. cuses on several case studies, examining how different European cultures perceive one another’s food prac- 1 Introduction tices. The authors use variations on the Jaccard simi- There are many different historical narratives for the larity between the inter-Wikipedia links of the same ar- same underlying set of historical events. For instance, ticle across languages to estimate the level of “cultural when examining the Israeli-Palestinian conflict, there affinity” and “cultural understanding” between two Eu- are two dominant narratives: the Pro-Palestinian side ropean cultures (Laufer et al., 2015). While this study tells a story in which immigrant Jews systematically makes use of the within-Wikipedia link graph, it does displaced and oppressed native , while the not take into account the structure of these networks as Pro-Israeli side presents a story of an oppressed people we do in the PageRank clustering task, and looks only returning to their homeland. How can the same facts at pairwise comparisons between articles’ link struc- lead to such different narratives, and how can we relate tures across languages. Furthermore, while it seeks multiple narratives to each other? to quantify agreement and understanding between cul- Wikipedia affords a unique opportunity to study tures, we hope to delve into the underlying sources of these questions. Each article topic is resolved across all cultural difference and distance. languages via the project (Wikidata, 2016), so Hecht et. al. adopts an approach most similar to each language can serve as a proxy for a narrative sur- the current work, looking at the network structure of rounding that topic. We examine the within-Wiki link the within-Wikipedia link graph across different lan- structure for different seed articles, looking for patterns guages to investigate a form of bias called self-focus, across languages that correspond to geopolitical align- the tendency of a language to emphasize the home re- ments around historical events. gion of that language (Hecht and Gergle, 2009). While Specifically, for the networks generated from the Hecht et. al. uses similar methods to study the nature seed article ”Arab-Israeli Conflict”, we run: 1) hier- of different languages’ , the current study archical agglomerative clustering on each language’s goes deeper in asking how languages with very close article in each languages’ network. Thus for our clus- tering task we were working with the matrix A, where Aij = the PageRank for concept (resolved article ID) j in language i’s concept graph. Figure 1: Example graphs– different link structures over the same set of concepts (c1,c2...c5) in different languages’ Wikipedias (l1,l2...l39). 3.2 Data for Logistic Regression and PCA For our classification work and for PCA, we limit the or overlapping home regions diverge in their coverage languages we are investigating to just Hebrew and Ara- or emphasis on particular concepts dealing with these bic, and examine the adjacency matrix of the articles home regions. and concepts in the two languages. That is, we work with the matrix B, where: 3 Dataset and Graph Construction 1 if article i has a link to concept j Bij = Wikidata is an organization that maintains a mapping 0 if otherwise. ⇢ from every Wikipedia article in every language to a resolved ID for that topic across all Wikipedias, i.e.: An article i can either be in Hebrew or Arabic, and “The Six Day War” at en.wikipedia.org and “Guerre a concept j is resolved across both languages. Of des Six Jours” at fr.wikipedia.org both refer to the re- the 54,328 total articles we examine, 28,502 are from solved ID Q49077 (Wikidata, 2016). They provide a the and 25,826 are from the Hebrew dump of their mappings, which we processed to get Wikipedia. Across both languages, there are 47,521 a mapping from the concatenated language code with concepts. Of these concepts, 19,412 have articles both the article title to the resolved ID for this article topic. in Hebrew and Arabic that can be cited (as opposed to The dump used for this milestone is slightly out of date concepts that only appear as either Hebrew or Arabic (from November 2015), but the holes were negligible– articles). We run our experiments both on this “inter- on average, .01% of links discovered in any given sect” graph limited to concepts with articles in both lan- ⇠ article were not present in the mapping generated from guages and on the graph with all 47,521 concepts, and this stale dump. note any differences in outcomes. We then began with the seed article previously For the classification task, we take a stratified split mentioned, “Arab-Israeli Conflict”, and queried the of the data between training and test sets. In the test Wikipedia API using pywikibot (Pywikibot, 2016) to set, there are 14,174 Arabic articles and 12,990 Hebrew get the links for this article, which we then mapped to articles, for a total of 27,164 articles. In the training their resolved IDs using the mappings from Wikidata. set, there are 14,328 Arabic articles and 12,836 Hebrew If we encountered a link that did not have a resolved ID articles, for a total of 27,164 articles in the training set. in Wikidata’s mapping, we first resolved any redirects For PCA, before running the algorithm we center the this link may lead to, and attempted to find a resolved data: we make the mean of each feature column 0, and ID for the article title of the redirect. We repeated this the standard deviation of each feature column 1. sequence in a depth first search algorithm, allowing for a depth of 2 hops, to construct the resolved inter-wiki 4 Methodology link graph for each language (see Figure 1). 4.1 Hierarchical Agglomerative Clustering 3.1 Data for PageRank Clustering Hierarchical agglomerative clustering is an unsuper- While we constructed graphs for all 75 languages that vised learning algorithm that iteratively groups sam- have articles on the Arab-Israeli conflict, we were only ples in a bottom-up fashion, merging those samples or interested in clustering the largest 45 Wikipedias, a list groups of samples that are closest to each other by some obtained from Wikipedia’s reporting of its Wikis’ size distance metric until a set number of clusters (usu- (Wikipedia, 2016). Our interest in larger Wikipedias ally one) remains. Hierarchical clustering in general stems from the fact that a larger Wikipedia implies a is most useful for cases where the number of clusters more trafficked site and a more impactful narrative; in is unknown, in contrast to k-means clustering where addition, a larger Wikipedia could mean more thorough the number of clusters must be specified (Manning et coverage and a richer graph to study. Of these 45 lan- al., 2008). There are a variety of algorithms for hier- guages, we were able to obtain a resolved ID for the archical clustering, each of which defines the distance seed article “Arab-Israeli Conflict” and thereby con- between two clusters differently. In accordance with struct a link graph for 39. We report on the size of our aim of finding the differences between languages’ these graphs in Table 1. In total, we examine 39 lan- link structures, we focus on complete-link clustering, in guages with 207,375 resolved article IDs (concepts). which the similarity between clusters is defined as the Since we were aiming to model what a reader in a distance between their two most distant members. We particular language would see were he or she to surf experimented with a variety of distance metrics, but as through Wikipedia’s offerings starting from the seed we want to examine the similarity between the PageR- article, we ran personalized PageRank from the seed ank vectors regardless of their magnitudes, we focus on Language Nodes Edges Polish 18888 45231 Basque 1598 2550 Croatian 9366 32406 Indonesian 7344 16500 English 65450 207319 Dutch 7033 13349 Hungarian 14589 53289 Serbo-Croatian 12805 51590 Slovenian 748 1064 Catalan 11898 37439 Korean 5994 16561 Armenian 421 597 Romanian 1017 1500 Persian 12919 56591 Portuguese 16357 44086 Hebrew 25826 162188 French 19341 45868 German 22975 49185 Czech 2671 3444 Slovak 7031 24243 Norwegian 8568 20022 Danish 3879 10319 Vietnamese 5243 15297 Finnish 7677 16217 Russian 37503 133417 Spanish 27814 62825 Ukrainian 15869 48485 Turkish 11194 44499 Swedish 2616 4174 Kazakh 959 2647 Chinese 15546 41233 Esperanto 6691 12612 Malay (macrolanguage) 3050 9311 Estonian 3291 8664 Japanese 33547 84882 Bulgarian 6505 14412 Serbian 6314 21288 Arabic 28502 153589 Waray (Philippines) 798 1278 Table 1: Sizes of the within-Wikipedia link graphs for each language. cosine similarity, defined as (Manning et al., 2008): 4.3 Principal Component Analysis Principal component analysis, PCA, is a method of di- a b sim(a, b)= · mensionality reduction that attempts to find the axes a b | || | that most explain variance across the dataset. While it is often used as a way to improve performance by col- 4.2 Logistic Regression lapsing features and thereby projecting the data into a Logistic regression is a supervised learning algorithm lower dimensional space, PCA can also be used to un- that uses the sigmoid function to constrain the output cover latent dimensions in the data (in our case, the la- of linear regression– in which features are weighted by tent dimension we hope to discover in the link structure coefficients and summed to get a predicted value– to of different articles is the articles’ language) (Niculae et values between 0 and 1, in order to solve binary classi- al., 2015). For the first principal component, PCA finds fication tasks. Logistic regression makes predictions of the unit vector u that maximizes the variance over all the following form: the data points: m 1 T 1 (i) (i) T h✓ = T arg max u ( x (x ) )u ✓ x T m 1+e u:u u=1 i=1 X In order to fit ✓ in training, the algorithm seeks to max- which is equivalent to finding the principal eigenvector imize the likelihood of the data. Thus it seeks to maxi- of the covariance matrix of the data, ⌃. Projecting the mize: data into a k-dimensional subspace (that is, finding the

m first k principal components) is equivalent to finding (i) y(i) (i) 1 y(i) the top k eigenvectors of ⌃. L(✓)= (h✓(x )) (1 h✓(x ) There are different approaches to finding these i=1 Y eigenvectors, one of which is running singular value m x(i) decomposition, or SVD, on the centered data X, which where is the number of training examples, is T the ith training example, y(i) is the label for the ith decomposes X into matrices U, S, and V (Berrar et al., 2003): training example, and h✓ is as defined above. Stochas- T tic gradient descent is used to update ✓ is the direction X = USV of fastest increase of the likelihood (equivalently, log- As mentioned in section 3.2, we centered the columns likelihood), via the update rule: of our data by scaling the mean of each feature to 0 and the standard deviation to 1. The implementation ✓ := ✓ + ↵(y(i) h (x(i)))x(i) we chose for PCA therefore uses SVD internally. j j ✓ j Arabic Hebrew ‘ar’ ‘he’ ‘ar’ ‘he’ Israeli Aggression King David Hotel Bombing 0.00071 5.85E-06 ‘ar’ 13,715 459 ‘ar’ 13,755 500 1948 Palestinian Exodus (Nakba) 0.00079 8.17E-05 ‘he’ 811 12,179 ‘he’ 1,299 11,610 Plan Dalet 0.00056 3.96E-06 Palestinian Aggression (a) Full (b) Intersect 1929 Hebron massacre 0.00017 0.00060 Table 3: Confusion matrices for the logistic regression clas- Terrorism 5.39E-05 0.00089 sifier when using either the full list of concepts (3a) or only 2014 kidnapping and murder of Israeli teenagers 2.25E-06 0.00061 concepts that appear in both Arabic and Hebrew (3b). Table 2: Differences in PageRank for controversial (violent) events in the history of the Arab-Israeli conflict The high performing classifier is a good indication that we can identify a particular language’s narrative 5 Results And Analysis based solely on the concepts an article uses to explain a topic. In addition, an examination of the coefficients 5.1 PageRank Clustering (✓) learned by the classifier aligns with the differences The resulting clustering using complete clustering with in PageRank noted between controversial (violent) arti- the cosine similarity metric can be seen in Figure 2. cle topics in Arabic and Hebrew, as well as illuminates The left-branching nature of the clustering suggests other key differences in the framing of the concepts re- that the subspace is too large for meaningful clusters lated to the conflict. Salient categories for the concepts to appear: a majority of the languages’ PageRank vec- that were heavily weighted towards Hebrew included: tors are spread throughout the subspace nearly uni- formly. Furthermore, there is no clear grouping across • Arab aggression: “Palestinian political violence”, languages– geography, language, and political align- “Palestinian stone-throwing” ment all fail to explain the groupings of languages at the leaves of the dendrogram. The two languages that • Loaded pro- terms: “Land of Israel”, we most expect to diverge in narrative, Hebrew and “Aliyah”, “State of Palestine” Arabic, are among the closest to each other. Indeed, the • Nuclear threat: “Plutonium”, “Nuclear reactor” correlation coefficient between the Hebrew and Arabic PageRank vectors is .997. Salient categories for the concepts that were heavily Yet, inspecting the PageRank vectors for these two weighted towards Arabic included: languages for known controversial events in the history of the conflict reveals an interesting pattern, shown in • Israeli aggression: “1982 Lebanon War”, “1948 Table 2: for both Hebrew and Arabic, acts of aggres- Palestinian Exodus” sion perpetrated by people who speak that language • Religion: “Hebrew calendar”, “Book of Ruth”, are less heavily weighted than acts of aggression per- “Jewish philosophy”, “Islam”, “Christianity”, petrated by people who do not speak that language. “Judaism” That is, both languages place more weight on times that their speakers were victims of violence, and less Inspecting ✓ also suggested that the classifier was weight on times that their speakers were perpetrators picking up on differences between the languages that of violence. The effect is more pronounced for Hebrew did not have to do with narrative framing, but rather as opposed to Arabic, perhaps because there is only one idiosyncrasies in Wikipedia practices across the two country where a majority of citizens speak Hebrew, and languages. For instance, the Hebrew Wikipedia tends therefore that language’s Wikipedia more closely mir- to link to years and dates much more than the Arabic rors that country’s dominant narrative. This key phe- Wikipedia. nomenon is getting drowned out by the sheer number of non-zero PageRanks that Hebrew and Arabic have 5.3 Principal Component Analysis in common, in contrast to the rest of the languages in The effects on our experiments of these idiosyncrasies the subspace. across different languages’ Wikipedias can be seen even more strongly in our PCA results. As seen in Fig- 5.2 Logistic Regression ure 3, the data does cluster along the first primary com- The discovery of differences in PageRank between ponent, though not clearly according to language. In- Arabic and Hebrew motivated the dive into classifica- vestigating the ordering induced on the articles by this tion of an article’s language based solely on its links. component suggests a distribution over topic, with the After training our logistic regression classifier, we eval- rightmost cluster being a cluster of all the years and uated its performance on the test set (described in sec- dates. The second principal component seems to group tion 3.2). We achieve 95.325% accuracy on the full articles into geographical places. concept matrix, and 93.377% accuracy on the ”inter- The variance explained by any one principal com- sect” matrix. The confusion matrix for the classifier is ponent is small, with the first 10 principal components given in Table 3. In both, the accuracy for Arabic is explaining 11% of the variance in the data, and the ⇠ higher than that for Hebrew, likely because there are first 50 principal components explaining 32%. This ⇠ 1,500 more Arabic articles than Hebrew articles in implies that the concepts are fairly independent– they ⇠ the training set. cannot be easily collapsed into very few components Figure 2: Output of hierarchical agglomerative clustering using the ’complete’ algorithm and the cosine similarity distance metric on the personalized PageRank vectors for each languages’ concept graph, staring from the seed article ”Arab-Israeli Conflict”. Languages are labeled using ISO 2 Letter language codes.

an article uses to explain a topic. Furthermore, the co- efficients ✓ learned by the classifier endorse the theory that specific events are emphasized or deemphasized in producing a given narrative. Both the PageRank clustering and the PCA experi- ments suggest an informed reduction in feature space could lead to more clearly conclusive results. The differences in PageRank between Hebrew and Arabic were drowned out by the sheer number of concepts in- cluded, many of which were irrelevant. The compo- nents that PCA picked up on were not components that one would expect to differ across language (i.e. topic and geographic location). This suggests that a useful next step for the project would be to prune the concept Figure 3: Plot of the data along the first two principal compo- graphs that form the basis for these experiments, per- nents. Hebrew articles are marked in red, Arabic articles in haps using a Jaccard similarity threshold between the blue. article that is linking to a concept and the article that represents that concept in that language. that explain a majority of the variance of the data. 92% A more detailed error analysis of the logsitic regres- of the variance is explained by the first 500 compo- sion classifier could also prove useful for this task: per- nents, however– given that we began with a space of haps articles that the classifier came close to misclas- 50,000 concepts, dimensionality reduction is still ef- ⇠ sifying (those whose predicted probabilities were close fective. to .5) could be considered neutral concepts, and used as It appears that the most variance in the data is not a way to limit the inclusion of articles irrelevant to the aligned with language, though it is clear that Arabic divergent narratives. Similarly, simply pulling out the and Hebrew articles do cluster in different subspaces concepts whose articles were close to equally likely to (as the logistic regression confirmed). We would ide- be Arabic as Hebrew could provide a source of neutral ally like to find a component that captures the vari- articles to be excluded from the studies. ance in narrative, and see that Hebrew and Arabic are clustered distinctly along this component. However, PCA focuses on those components that most explain References variance– the experiment picked up on characteristics that did not directly relate to the narrative and thus Patti Bao, Brent Hecht, Samuel Carton, Mahmood did not group articles according to their narratives (lan- Quaderi, Michael Horn, and Darren Gergle. 2012. Omnipedia: Bridging the wikipedia language gap. guages). In Proceedings of the SIGCHI Conference on Hu- man Factors in Computing Systems, CHI ’12, pages 6 Conclusion and Future Work 1075–1084, New York, NY, USA. ACM. Overall, the experiment yielded mixed results: cluster- Daniel P Berrar, Werner Dubitzky, Martin Granzow, ing and PCA did not yield obvious endorsements of our et al. 2003. A practical approach to microarray data hypothesis, but the high performance of the logistic re- analysis. Springer. gression classifier suggests that indeed we can define a particular languages narrative based on the concepts Ewa S. Callahan and Susan C. Herring. 2011. Cultural bias in wikipedia content on famous persons. Jour- nal of the American Society for Information Science and Technology, 62(10):1899–1915. Brent Hecht and Darren Gergle. 2009. Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth interna- tional conference on Communities and technologies, pages 11–20. ACM. Paul Laufer, Claudia Wagner, Fabian Flock,¨ and Markus Strohmaier. 2015. Mining cross-cultural re- lations from wikipedia: A study of 31 european food cultures. In Proceedings of the ACM Web Science Conference, WebSci ’15, pages 3:1–3:10, New York, NY, USA. ACM. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze.¨ 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. Vlad Niculae, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2015. Quotus: The structure of political media coverage as revealed by quoting patterns. In Proceedings of the 24th International Conference on World Wide Web, pages 798–808. ACM. Pywikibot. 2016. pywikibot 2.0rc5 : Python package index. https://pypi.python.org/pypi/ pywikibot. Wikidata. 2016. Wikidata:main page. https: //www.wikidata.org/wiki/Wikidata: Main_Page. Wikipedia. 2016. . https://en.wikipedia.org/wiki/ List_of_Wikipedias#Detailed_list, Nov.