Bias In Wikipedia: Different Links, Different Stories Raine Hoover
[email protected] CS 229 Final Report: December 16th, 2016 Abstract personalized PageRank vector starting from the seed article, 2) logistic regression on the adjacency matrices We attempt to uncover the latent biases in dif- for Hebrew and Arabic, classifying whether a given ar- ferent narratives on Wikipedia by investigat- ticle is written in Hebrew or Arabic based on which ing the link structure of the within-Wikipedia concepts it does or does not link to, 3) PCA on the ad- link graphs generated from a seed article of jacency matrices for Hebrew and Arabic, investigating interest. The case study presented in this re- the main axes of variance across individual article link port is the Wikipedia article entitled “Arab- structure. Israeli Conflict”. We perform three experi- ments on this dataset: 1) hierarchical cluster- ing on personalized PageRank vectors for each 2 Related Works of the largest languages on Wikipedia, 2) lo- There has been some research into the differences in gistic regression classification of an article as link structure across languages on Wikipedia. In par- ‘Hebrew’ or ‘Arabic’ based solely on the ar- ticular, the project Omnipedia (Bao et al., 2012) high- ticle’s links, 3) principal component analysis lights these differences, but leaves analysis to the user. of the articles written in Hebrew and Arabic. With regard to cultural bias in Wikipedia, much of We find that while clustering and PCA are in- these investigations have been descriptive and/or lim- conclusive with regards to language being a ited to case studies, focusing for instance on coverage primary explanation of variance between link of famous people in English and Polish (Callahan and structures, we can accurately classify the lan- Herring, 2011).