Converting Western Internet to Indigenous Internet Lessons from Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Converting Western Internet to Indigenous Internet Lessons from Wikipedia Kul Wadhwa and Howie Fung Converting Western Internet to Indigenous Internet Lessons from Wikipedia With the massive proliferation of communications technology, it’s the first time in human history where the majority of people (more than 4 billion) have access to information through a mobile phone. But then we have to ask: What content? Who contributes? In which languages? How should it be presented? And in what ways should it be delivered? And, most importantly, what are all the pieces you need to put together to make this work? The only Internet project that has been able to scale globally is Wikipedia. Everyone involved with this project got behind the vision of “a world in which every single human being can freely share in the sum of all knowledge.” Starting in 2001, this community was initially limited to people that had desktop or laptop computers, access to the Internet, and a strong comprehension of one of the 10 or so major languages on the planet. Wikipedia has consistently ranked in the top 10 most visited websites on the planet,1 as a knowledge resource that is multilingual and available in over 280 languages.2 Not only is it open and free to use, anyone with Internet access can write and make changes to Wikipedia articles. Therefore, as a democratic platform for both contribution and dissemination of information, it makes sense just to leverage it for greater use in the developing world, where access to such resources are either underdeveloped or are largely unavailable. But what made Wikipedia successful can’t be exported “as is” to the develop- ing world. There are challenges that extend beyond the platform and into the real world, limitations beyond the project’s direct control—lack of adequate infra- Kul Wadhwa worked at Wikimedia from 2007 to 2014; his most recent position was Head of Mobile, where he focused on expanding Wikipedia into the developing world. He’s currently Entrepreneur-In-Residence at Stanford University’s H-STAR Institute, working on collaborative learning technology for children. Howie Fung is the Director of Product Development at the Wikimedia Foundation, which he joined in 2009. Prior to Wikimedia, he was a Product Manager at Rhapsody and at PayPal. The authors would like to acknowledge Alolita Sharma for her advice, which helped shape this article. © 2014 Kul Wadhwa and Howie Fung innovations / volume 9, number 3/4 127 Downloaded from http://direct.mit.edu/itgg/article-pdf/9/3-4/127/705137/inov_a_00224.pdf by guest on 29 September 2021 Kul Wadhwa and Howie Fung structure and/or some people’s inability to comprehend the information that the Internet provides. However, other challenges can be tackled head on—such as cost of access, tools to improve contribution and content consumption, and educating users. These are typically perceived as “add-on” activities to be addressed later. They are not. When you build an Internet project, from the outset you must plan on “extending the core” of the overall platform as part of your foundational plan. CONTENT ECOSYSTEMS When creating an online product for users/customers in the developed world, you can focus on your product and not worry about all the other elements needed to make it useful—because they are either in place or being built by others. To be suc- cessful in the developing world, however, you have to think about not just building your project but an entire ecosystem. Since Wikipedia was launched in and remains heavily focused on the Western world, it has relied on existing platforms to scale: the Internet itself, widespread access, and people’s familiarity with paper encyclopedias. Thus, Wikipedia has been able to focus on the content side without worrying about the larger ecosys- tem. Content ecosystems are societal systems organized to create and distribute information. There are primarily two kinds of content ecosystems: expert-driven and community-driven. Expert-driven content ecosystems include existing, largely proprietary infor- mation sources and services. Examples are news and television companies, cable networks, educational institutions, and government. Community-driven online content ecosystems are relatively new and include Wikipedia, WikiHow, and various large-scale open-source software projects, such as Mozilla’s Firefox. These new content ecosystems are online, curated, and share information about topics of common interest. Expert-driven content ecosystems traditionally separate creators from con- sumers of information. These ecosystems focus on information developed by an accorded authority in a specific sector. This approach requires a centralized gov- erning body to generate and distribute the content. Expert-driven systems have and should continue to exist regardless of new technology. Community-driven content ecosystems are built around a new model of information sharing, where barriers between creators and consumers are reduced or eliminated. They can reach participants quickly and cost effectively because the model does not depend on external experts to create and tailor the content, which can be expensive and time consuming. These models can be localized more easily to the community’s language and culture, thus making content more relevant to local needs than what is produced by outside experts. However, since anyone can contribute and modify content, the downside to this approach can be ensuring quality control and providing curation oversight. Nevertheless, due to the content relevancy advantages of a community-driven 128 innovations / Digital Inclusion Downloaded from http://direct.mit.edu/itgg/article-pdf/9/3-4/127/705137/inov_a_00224.pdf by guest on 29 September 2021 Converting Western Internet to Indigenous Internet Experiments with the Mobile Internet Based on Wikipedia’s experience to date on scaling in the developing world, which is generally positive but not overall successful, it’s clear that more exper- iments are needed to fine tune Internet sites for mobile users in the developing world. To that end, Wikipedia launched a pilot program in Kenya in September 2013 with Airtel Africa to make Wikipedia available via text message. SMS is still the most popular data channel in the world, so content delivery must be tailored for that experience. However, it’s complicated to deliver Wikipedia content via SMS, with its limitation of 140 characters. Regardless of the delivery mechanism, for content to be relevant, it needs to be simple and short. Even if people are using smartphones, the form factor (smaller screen size compared to desktop/laptop computers), difficulty to input text (telephone keypad or compact keyboard), and the cost of staying online longer all add up to the need for a more efficient user engagement with content. The benefit of community-driven communities is that they can drive changes on their own. A group of volunteers within the Wikipedia community has started Wiki Project Med.12 One of their initiatives is to make medical con- tent more digestible for people in the developing world. They have teamed up with Translators without Borders to simplify medical content so it can be more useful to people and also easily translatable. That could make it easier for local communities to produce medical content in local languages. content model, this will be our focus. Aneel Karnani, from the University of Michigan Ross School of Business, suggests that the only way to alleviate poverty is to focus on the poor as producers, rather than as a market of consumers.3 Therefore, it’s important that the next billion (and more) are not passive observers in the information economy. The Internet has, for the first time, allowed community-driven systems to develop and flourish. But because of the need for consumers to also be contribu- tors, an ecosystem must be developed to allow a community-driven model to suc- ceed. Experience building online communities, and working within collaborative models to produce high-quality output are prerequisites for tackling the complex task of building communities in local languages for the developing world. There is a lot of “machinery” involved to make a content-driven model work —tools, processes, structures, workflows, governance, etc. For example, each Wikipedia project has developed its own processes around how to decide which articles out of the hundreds that are created every day are encyclopedia-worthy, how to protect articles from vandalism, how to resolve disputes between editors, how to identify and correct copyright violations, how to improve quality and cov- erage within topic areas, to name a few. Each project developed these systems innovations / volume 9, number 3/4 129 Downloaded from http://direct.mit.edu/itgg/article-pdf/9/3-4/127/705137/inov_a_00224.pdf by guest on 29 September 2021 Kul Wadhwa and Howie Fung organically, so the rules around what is encyclopedia-worthy may be different on the German Wikipedia than they are on the English Wikipedia. Thus, each Wikipedia community has been able to develop a system of policies and processes resulting in the creation of their own online encyclopedias, whose quality rivals professionally produced encyclopedias (e.g., Britannica and Science) with cover- age several orders of magnitude greater. WIKIPEDIA’S TENTACLES IN THE DEVELOPING WORLD Wikipedia strives to be all inclusive: it’s the encyclopedia that anyone can edit. For this to truly be the case, everyone must be able to contribute to it and have access to it. However, that is not the case today. Wikipedia currently holds almost 33 million articles (as of August 2014) in over 285 languages. But the vast majority of the content is in languages from the developed world. English alone accounts for almost 14 percent of the content (over 4.5 million articles) and the top five languages (English, Swedish, Dutch, German, and French) account for more than a third. Moreover, the top 10 lan- guages account for more than 50 percent of the content on Wikipedia.
Recommended publications
  • Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network
    In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network Morten Warncke-Wang1, Anuradha Uduwage1, Zhenhua Dong2, John Riedl1 1GroupLens Research Dept. of Computer Science and Engineering 2Dept. of Information Technical Science University of Minnesota Nankai University Minneapolis, Minnesota Tianjin, China {morten,uduwage,riedl}@cs.umn.edu [email protected] ABSTRACT 1. INTRODUCTION Wikipedia has become one of the primary encyclopaedic in- The world: seven seas separating seven continents, seven formation repositories on the World Wide Web. It started billion people in 193 nations. The world's knowledge: 283 in 2001 with a single edition in the English language and has Wikipedias totalling more than 20 million articles. Some since expanded to more than 20 million articles in 283 lan- of the content that is contained within these Wikipedias is guages. Criss-crossing between the Wikipedias is an inter- probably shared between them; for instance it is likely that language link network, connecting the articles of one edition they will all have an article about Wikipedia itself. This of Wikipedia to another. We describe characteristics of ar- leads us to ask whether there exists some ur-Wikipedia, a ticles covered by nearly all Wikipedias and those covered by set of universal knowledge that any human encyclopaedia only a single language edition, we use the network to under- will contain, regardless of language, culture, etc? With such stand how we can judge the similarity between Wikipedias a large number of Wikipedia editions, what can we learn based on concept coverage, and we investigate the flow of about the knowledge in the ur-Wikipedia? translation between a selection of the larger Wikipedias.
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • State of Wikimedia Communities of India
    State of Wikimedia Communities of India Assamese http://as.wikipedia.org State of Assamese Wikipedia RISE OF ASSAMESE WIKIPEDIA Number of edits and internal links EDITS PER MONTH INTERNAL LINKS GROWTH OF ASSAMESE WIKIPEDIA Number of good Date Articles January 2010 263 December 2012 301 (around 3 articles per month) November 2011 742 (around 40 articles per month) Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. THANK YOU Bengali বাংলা উইকিপিডিয়া Bengali Wikipedia http://bn.wikipedia.org/ By Bengali Wikipedia community Bengali Language • 6th most spoken language • 230 million speakers Bengali Language • National language of Bangladesh • Official language of India • Official language in Sierra Leone Bengali Wikipedia • Started in 2004 • 22,000 articles • 2,500 page views per month • 150 active editors Bengali Wikipedia • Monthly meet ups • W10 anniversary • Women’s Wikipedia workshop Wikimedia Bangladesh local chapter approved in 2011 by Wikimedia Foundation English State of WikiProject India on ENGLISH WIKIPEDIA ● One of the largest Indian Wikipedias. ● WikiProject started on 11 July 2006 by GaneshK, an NRI. ● Number of article:89,874 articles. (Excludes those that are not tagged with the WikiProject banner) ● Editors – 465 (active) ● Featured content : FAs - 55, FLs - 20, A class – 2, GAs – 163. BASIC STATISTICS ● B class – 1188 ● C class – 801 ● Start – 10,931 ● Stub – 43,666 ● Unassessed for quality – 20,875 ● Unknown importance – 61,061 ● Cleanup tags – 43,080 articles & 71,415 tags BASIC STATISTICS ● Diversity of opinion ● Lack of reliable sources ● Indic sources „lost in translation“ ● Editor skills need to be upgraded ● Lack of leadership ● Lack of coordinated activities ● ….
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Letter to SUBTEL Re Wikipedia Zero
    Letter to SUBTEL re Wikipedia Zero To the Undersecretariat of Telecommunications: Mr. Pedro Huichalaf, We write to you to ask for clarification regarding the official circular letter No. 40 /DAP 13221 /F­51, issued on April 14, 2014. In particular, we would like to clarify that this order does not apply to providing free mobile access to educational resources. In developing nations across the globe, the Wikimedia Foundation has partnered with mobile network operators interested in expanding their philanthropic operations to provide mobile access to Wikipedia without data charges through a project called Wikipedia Zero. Chile is an ideal country for Wikipedia Zero because it has a great need for free knowledge and a high mobile penetration in urban and rural areas, providing the perfect setting to deploy the program. “Imagine a world in which every single human being can freely share in the sum of all knowledge” – that is the vision statement that guides the Wikimedia Foundation, the non­profit organization behind Wikipedia. As the largest and most popular online encyclopedia in the world, Wikipedia has more than 30 million volunteer­authored articles in over 287 languages (including Spanish), and is visited by more than 490 million people every month, making it the largest collection of shared knowledge in human history. All the content on Wikipedia is provided under a Creative Commons license to encourage anyone to freely reuse and contribute to the content. That is why Wikipedia content can now be found in multiple third party applications and websites, like the Google Knowledge Graph. Our mission is to empower a global volunteer community to collect and develop the world's knowledge and to make it available to everyone for free.
    [Show full text]
  • Omnipedia: Bridging the Wikipedia Language
    Omnipedia: Bridging the Wikipedia Language Gap Patti Bao*†, Brent Hecht†, Samuel Carton†, Mahmood Quaderi†, Michael Horn†§, Darren Gergle*† *Communication Studies, †Electrical Engineering & Computer Science, §Learning Sciences Northwestern University {patti,brent,sam.carton,quaderi}@u.northwestern.edu, {michael-horn,dgergle}@northwestern.edu ABSTRACT language edition contains its own cultural viewpoints on a We present Omnipedia, a system that allows Wikipedia large number of topics [7, 14, 15, 27]. On the other hand, readers to gain insight from up to 25 language editions of the language barrier serves to silo knowledge [2, 4, 33], Wikipedia simultaneously. Omnipedia highlights the slowing the transfer of less culturally imbued information similarities and differences that exist among Wikipedia between language editions and preventing Wikipedia’s 422 language editions, and makes salient information that is million monthly visitors [12] from accessing most of the unique to each language as well as that which is shared information on the site. more widely. We detail solutions to numerous front-end and algorithmic challenges inherent to providing users with In this paper, we present Omnipedia, a system that attempts a multilingual Wikipedia experience. These include to remedy this situation at a large scale. It reduces the silo visualizing content in a language-neutral way and aligning effect by providing users with structured access in their data in the face of diverse information organization native language to over 7.5 million concepts from up to 25 strategies. We present a study of Omnipedia that language editions of Wikipedia. At the same time, it characterizes how people interact with information using a highlights similarities and differences between each of the multilingual lens.
    [Show full text]
  • The Culture of Wikipedia
    Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented.
    [Show full text]
  • Wikipedia As a Lens for Studying the Real-Time Formation of Collective Memories of Revolutions
    International Journal of Communication 5 (2011), Feature 1313–1332 1932–8036/2011FEA1313 WikiRevolutions: Wikipedia as a Lens for Studying the Real-time Formation of Collective Memories of Revolutions MICHELA FERRON University of Trento PAOLO MASSA Fondazione Bruno Kessler In this article, we propose to interpret the online encyclopedia Wikipedia as an online setting in which collective memories about controversial and traumatic events are built in a collaborative way. We present the richness of data available on the phenomenon, providing examples of users’ participation in the creation of articles related to the 2011 Egyptian revolution. Finally, we propose possible research directions for the empirical study of collective memory formation of traumatic and controversial events in large populations as they unfold over time. Introduction On December 17, 2010, Mohammed Bouazizi, a 26-year-old fruit vendor in the central town of Sidi Bouzid doused himself and set himself on fire in front of a local municipal office. On January 25, 2011, a series of protests began in downtown Cairo and across the country against the government of Egyptian President Hosni Mubarak, in what has been called the “Day of Revolt.” In the following days, protests spread across Tunisia and Egypt, leading to the flight of the Tunisian president Zine El Abidine Ben Ali from his country on January 14, 2011, and to the resignation of Hosni Mubarak on February 11, 2011. Besides the great deal of media attention received by these events, the Tunisian and Egyptian revolutions also triggered an intense flurry of editing activity and heated discussions on the online encyclopedia Wikipedia.
    [Show full text]
  • Embracing Wikipedia As a Teaching and Learning Tool Benefits Health Professional Schools and the Populations They Serve
    2017 Embracing Wikipedia as a teaching and learning tool benefits health professional schools and the populations they serve Author schools’ local service missions, suggesting that embracing Wikipedia as a teaching and learning Amin Azzam1* tool for tomorrow’s health professionals may be globally generalizable. A network of health Abstract professional schools and students contributing to Wikipedia would accelerate fulfillment of Wikipedia’s To paraphrase Wikipedia cofounder Jimmy Wales, audacious aspirational goal—providing every single “Imagine a world where all people have access person on the planet free access to the sum of all to high quality health information clearly written human knowledge. in their own language.” Most health professional students likely endorse that goal, as do individuals Keywords who volunteer to contribute to Wikipedia’s health- related content. Bringing these two communities medical education; medical communication; together inspired our efforts: a course for medical Wikipedia students to earn academic credit for improving Wikipedia. Here I describe the evolution of that Introduction course between 2013 – 2017, during which 80 students completed the course. Collectively they “Imagine a world in which every single person on the edited 65 pages, adding over 93,100 words and planet is given free access to the sum of all human 608 references. Impressively, these 65 Wikipedia knowledge. That’s what we’re doing.”1 pages were viewed 1,825,057 times during only the students’ active editing days. The students’ Some might consider this audacious statement a efforts were in partnership with communities naïve dreamer’s fantasy. Yet even at 16 years old, outside of academia—namely Wikiproject Medicine, Wikipedia continues to rank amongst the top 10 most 2 Translators Without Borders, and Wikipedia Zero.
    [Show full text]
  • How to Contribute Climate Change Information to Wikipedia : a Guide
    HOW TO CONTRIBUTE CLIMATE CHANGE INFORMATION TO WIKIPEDIA Emma Baker, Lisa McNamara, Beth Mackay, Katharine Vincent; ; © 2021, CDKN This work is licensed under the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/legalcode), which permits unrestricted use, distribution, and reproduction, provided the original work is properly credited. Cette œuvre est mise à disposition selon les termes de la licence Creative Commons Attribution (https://creativecommons.org/licenses/by/4.0/legalcode), qui permet l’utilisation, la distribution et la reproduction sans restriction, pourvu que le mérite de la création originale soit adéquatement reconnu. IDRC Grant/ Subvention du CRDI: 108754-001-CDKN knowledge accelerator for climate compatible development How to contribute climate change information to Wikipedia A guide for researchers, practitioners and communicators Contents About this guide .................................................................................................................................................... 5 1 Why Wikipedia is an important tool to communicate climate change information .................................................................................................................................. 7 1.1 Enhancing the quality of online climate change information ............................................. 8 1.2 Sharing your work more widely ......................................................................................................8 1.3 Why researchers should
    [Show full text]
  • Why Medical Schools Should Embrace Wikipedia
    Innovation Report Why Medical Schools Should Embrace Wikipedia: Final-Year Medical Student Contributions to Wikipedia Articles for Academic Credit at One School Amin Azzam, MD, MA, David Bresler, MD, MA, Armando Leon, MD, Lauren Maggio, PhD, Evans Whitaker, MD, MLIS, James Heilman, MD, Jake Orlowitz, Valerie Swisher, Lane Rasberry, Kingsley Otoide, Fred Trotter, Will Ross, and Jack D. McCue, MD Abstract Problem course on student participants, and improved their articles, enjoyed giving Most medical students use Wikipedia readership of students’ chosen articles. back “specifically to Wikipedia,” and as an information source, yet medical broadened their sense of physician schools do not train students to improve Outcomes responsibilities in the socially networked Wikipedia or use it critically. Forty-three enrolled students made information era. During only the “active 1,528 edits (average 36/student), editing months,” Wikipedia traffic Approach contributing 493,994 content bytes statistics indicate that the 43 articles Between November 2013 and November (average 11,488/student). They added were collectively viewed 1,116,065 2015, the authors offered fourth-year higher-quality and removed lower- times. Subsequent to students’ efforts, medical students a credit-bearing course quality sources for a net addition of these articles have been viewed nearly to edit Wikipedia. The course was 274 references (average 6/student). As 22 million times. designed, delivered, and evaluated by of July 2016, none of the contributions faculty, medical librarians, and personnel of the first 28 students (2013, 2014) Next Steps from WikiProject Medicine, Wikipedia have been reversed or vandalized. If other schools replicate and improve Education Foundation, and Translators Students discovered a tension between on this initiative, future multi-institution Without Borders.
    [Show full text]
  • Semantically Annotated Snapshot of the English Wikipedia
    Semantically Annotated Snapshot of the English Wikipedia Jordi Atserias, Hugo Zaragoza, Massimiliano Ciaramita, Giuseppe Attardi Yahoo! Research Barcelona, U. Pisa, on sabbatical at Yahoo! Research C/Ocata 1 Barcelona 08003 Spain {jordi, hugoz, massi}@yahoo-inc.com, [email protected] Abstract This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resources to process such a volume of information. Moreover, the use of different versions of Wikipedia processed differently might make it difficult to compare results. The aim of this work is to provide easy access to syntactic and semantic annotations for researchers of both NLP and IR communities by building a reference corpus to homogenize experiments and make results comparable. These resources, a semantically annotated corpus and a “entity containment” derived graph, are licensed under the GNU Free Documentation License and available from http://www.yr-bcn.es/semanticWikipedia. 1. Introduction 2. Processing Wikipedia1, the largest electronic encyclopedia, has be- Starting from the XML Wikipedia source we carried out a come a widely used resource for different Natural Lan- number of data processing steps: guage Processing tasks, e.g. Word Sense Disambiguation (Mihalcea, 2007), Semantic Relatedness (Gabrilovich and • Basic preprocessing: Stripping the text from the Markovitch, 2007) or in the Multilingual Question Answer- XML tags and dividing the obtained text into sen- ing task at Cross-Language Evaluation Forum (CLEF)2.
    [Show full text]