Recall-Oriented Learning of Named Entities in Arabic Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Recall-Oriented Learning of Named Entities in Arabic Wikipedia Recall-Oriented Learning of Named Entities in Arabic Wikipedia Behrang Mohit∗ Nathan Schneidery Rishav Bhowmick∗ Kemal Oflazer∗ Noah A. Smithy School of Computer Science, Carnegie Mellon University ∗P.O. Box 24866, Doha, Qatar yPittsburgh, PA 15213, USA fbehrang@,nschneid@cs.,rishavb@qatar.,ko@cs.,[email protected] Abstract delineated. One hallmark of this divergence be- tween Wikipedia and the news domain is a dif- We consider the problem of NER in Arabic ference in the distributions of named entities. In- Wikipedia, a semisupervised domain adap- deed, the classic named entity types (person, or- tation setting for which we have no labeled ganization, location) may not be the most apt for training data in the target domain. To fa- articles in other domains (e.g., scientific or social cilitate evaluation, we obtain annotations topics). On the other hand, Wikipedia is a large for articles in four topical groups, allow- ing annotators to identify domain-specific dataset, inviting semisupervised approaches. entity types in addition to standard cate- In this paper, we describe advances on the prob- gories. Standard supervised learning on lem of NER in Arabic Wikipedia. The techniques newswire text leads to poor target-domain are general and make use of well-understood recall. We train a sequence model and show building blocks. Our contributions are: that a simple modification to the online learner—a loss function encouraging it to • A small corpus of articles annotated in a new “arrogantly” favor recall over precision— scheme that provides more freedom for annota- substantially improves recall and F . We 1 tors to adapt NE analysis to new domains; then adapt our model with self-training on unlabeled target-domain data; enforc- • An “arrogant” learning approach designed to ing the same recall-oriented bias in the self- boost recall in supervised training as well as training stage yields marginal gains.1 self-training; and • An empirical evaluation of this technique as ap- plied to a well-established discriminative NER 1 Introduction model and feature set. This paper considers named entity recognition Experiments show consistent gains on the chal- (NER) in text that is different from most past re- lenging problem of identifying named entities in search on NER. Specifically, we consider Arabic Arabic Wikipedia text. Wikipedia articles with diverse topics beyond the commonly-used news domain. These data chal- 2 Arabic Wikipedia NE Annotation lenge past approaches in two ways: Most of the effort in NER has been fo- First, Arabic is a morphologically rich lan- cused around a small set of domains and guage (Habash, 2010). Named entities are ref- general-purpose entity classes relevant to those erenced using complex syntactic constructions domains—especially the categories PER(SON), (cf. English NEs, which are primarily sequences ORG(ANIZATION), and LOC(ATION) (POL), of proper nouns). The Arabic script suppresses which are highly prominent in news text. Ara- most vowels, increasing lexical ambiguity, and bic is no exception: the publicly available NER lacks capitalization, a key clue for English NER. corpora—ACE (Walker et al., 2006), ANER (Be- Second, much research has focused on the use najiba et al., 2008), and OntoNotes (Hovy et al., of news text for system building and evaluation. 2006)—all are in the news domain.2 However, Wikipedia articles are not news, belonging instead to a wide range of domains that are not clearly 2OntoNotes contains news-related text. ACE includes some text from blogs. In addition to the POL classes, both 1The annotated dataset and a supplementary document corpora include additional NE classes such as facility, event, with additional details of this work can be found at: product, vehicle, etc. These entities are infrequent and may http://www.ark.cs.cmu.edu/AQMAR not be comprehensive enough to cover the larger set of pos- History Science Sports Technology Table 1: Translated titles of Arabic Wikipedia arti- dev: Damascus Atom Raul´ Gonzales´ Linux Imam Hussein Shrine Nuclear power Real Madrid Solaris cles in our development and test sets, and some test: Crusades Enrico Fermi 2004 Summer Olympics Computer NEs with standard and Islamic Golden Age Light Christiano Ronaldo Computer Software article-specific classes. Islamic History Periodic Table Football Internet Additionally, Prussia and Ibn Tolun Mosque Physics Portugal football team Richard Stallman Amman were reserved Ummaya Mosque Muhammad al-Razi FIFA World Cup X Window System for training annotators, Claudio Filippone (PER) ; Linux (SOFTWARE) ; Spanish àñJ.ÊJ ¯ ñK XñÊ¿ ºJJ Ë and Gulf War for esti- CHAMPIONSHIPS PARTICLE League ( ) úGAJ.B@ øPðYË@; proton ( ) àñ KðQK .; nuclear mating inter-annotator agreement. radiation (GENERIC-MISC) ; Real Zaragoza (ORG) ø ðñJË@ ¨AªB@ 颯Qå ÈAK P appropriate entity classes will vary widely by do- testing examples, but not as training data. In x4 main; occurrence rates for entity classes are quite we will discuss our semisupervised approach to different in news text vs. Wikipedia, for instance learning, which leverages ACE and ANER data (Balasuriya et al., 2009). This is abundantly as an annotated training corpus. clear in technical and scientific discourse, where much of the terminology is domain-specific, but it 2.1 Annotation Strategy holds elsewhere. Non-POL entities in the history We conducted a small annotation project on Ara- domain, for instance, include important events bic Wikipedia articles. Two college-educated na- (wars, famines) and cultural movements (roman- tive Arabic speakers annotated about 3,000 sen- ticism). Ignoring such domain-critical entities tences from 31 articles. We identified four top- likely limits the usefulness of the NE analysis. ical areas of interest—history, technology, sci- Recognizing this limitation, some work on ence, and sports—and browsed these topics un- NER has sought to codify more robust invento- til we had found 31 articles that we deemed sat- ries of general-purpose entity types (Sekine et al., isfactory on the basis of length (at least 1,000 2002; Weischedel and Brunstein, 2005; Grouin words), cross-lingual linkages (associated articles et al., 2011) or to enumerate domain-specific in English, German, and Chinese3), and subjec- types (Settles, 2004; Yao et al., 2003). Coarse, tive judgments of quality. The list of these arti- general-purpose categories have also been used cles along with sample NEs are presented in ta- for semantic tagging of nouns and verbs (Cia- ble 1. These articles were then preprocessed to ramita and Johnson, 2003). Yet as the number extract main article text (eliminating tables, lists, of classes or domains grows, rigorously docu- info-boxes, captions, etc.) for annotation. menting and organizing the classes—even for a Our approach follows ACE guidelines (LDC, single language—requires intensive effort. Ide- 2005) in identifying NE boundaries and choos- ally, an NER system would refine the traditional ing POL tags. In addition to this traditional form classes (Hovy et al., 2011) or identify new entity of annotation, annotators were encouraged to ar- classes when they arise in new domains, adapting ticulate one to three salient, article-specific en- to new data. For this reason, we believe it is valu- tity categories per article. For example, names able to consider NER systems that identify (but of particles (e.g., proton) are highly salient in the do not necessarily label) entity mentions, and also Atom article. Annotators were asked to read the to consider annotation schemes that allow annota- entire article first, and then to decide which non- tors more freedom in defining entity classes. traditional classes of entities would be important Our aim in creating an annotated dataset is to in the context of article. In some cases, annotators provide a testbed for evaluation of new NER mod- reported using heuristics (such as being proper els. We will use these data as development and 3These three languages have the most articles on Wikipedia. Associated articles here are those that have been sible NEs (Sekine et al., 2002). Nezda et al. (2006) anno- manually hyperlinked from the Arabic page as cross-lingual tated and evaluated an Arabic NE corpus with an extended correspondences. They are not translations, but if the associ- set of 18 classes (including temporal and numeric entities); ations are accurate, these articles should be topically similar this corpus has not been released publicly. to the Arabic page that links to them. Token position agreement rate 92.6% Cohen’s κ: 0.86 History: Gulf War, Prussia, Damascus, Crusades Token agreement rate 88.3% Cohen’s κ: 0.86 WAR CONFLICT ••• Token F between annotators 91.0% 1 Science: Atom, Periodic table Entity boundary match F 94.0% 1 THEORY • CHEMICAL •• Entity category match F 87.4% 1 NAME ROMAN • PARTICLE •• Table 2: Inter-annotator agreement measurements. Sports: Football, Raul´ Gonzales´ SPORT ◦ CHAMPIONSHIP • nouns or having an English translation which is AWARD ◦ NAME ROMAN • conventionally capitalized) to help guide their de- Technology: Computer, Richard Stallman termination of non-canonical entities and entity COMPUTER VARIETY ◦ SOFTWARE • classes. Annotators produced written descriptions COMPONENT • of their classes, including example instances. Table 3: Custom NE categories suggested by one or This scheme was chosen for its flexibility: in both annotators for 10 articles. Article titles are trans- contrast to a scenario with a fixed ontology, anno- lated from Arabic. • indicates that both annotators vol- tators required minimal training beyond the POL unteered a category for an article; ◦ indicates that only conventions, and did not have to worry about one annotator suggested the category. Annotators were delineating custom categories precisely enough not given a predetermined set of possible categories; that they would extend straightforwardly to other rather, category matches between annotators were de- topics or domains. Of course, we expect inter- termined by post hoc analysis. NAME ROMAN indi- cates an NE rendered in Roman characters. annotator variability to be greater for these open- ended classification criteria.
Recommended publications
  • A Survey of Orthographic Information in Machine Translation 3
    Machine Translation Journal manuscript No. (will be inserted by the editor) A Survey of Orthographic Information in Machine Translation Bharathi Raja Chakravarthi1 ⋅ Priya Rani 1 ⋅ Mihael Arcan2 ⋅ John P. McCrae 1 the date of receipt and acceptance should be inserted later Abstract Machine translation is one of the applications of natural language process- ing which has been explored in different languages. Recently researchers started pay- ing attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine transla- tion systems is the variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable, but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthog- raphy’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic in- formation can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under- resourced languages. We discuss different types of machine translation and demon- strate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts of cog- nates information at different levels of machine translation and the lessons that can Bharathi Raja Chakravarthi [email protected] Priya Rani [email protected] Mihael Arcan [email protected] John P.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics
    computers Article Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland * Correspondence: [email protected]; Tel.: +48-(61)-639-27-93 Received: 10 May 2019; Accepted: 13 August 2019; Published: 14 August 2019 Abstract: On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
    [Show full text]
  • © 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION and LINKING for 300 LANGUAGES
    © 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION AND LINKING FOR 300 LANGUAGES BY XIAOMAN PAN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2020 Urbana, Illinois Doctoral Committee: Dr. Heng Ji, Chair Dr. Jiawei Han Dr. Hanghang Tong Dr. Kevin Knight ABSTRACT Information provided in languages which people can understand saves lives in crises. For example, language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB.
    [Show full text]
  • Mapping Arabic Wikipedia Into the Named Entities Taxonomy
    Mapping Arabic Wikipedia into the Named Entities Taxonomy Fahd Alotaibi and Mark Lee School of Computer Science, University of Birmingham, UK {fsa081|m.g.lee}@cs.bham.ac.uk ABSTRACT This paper describes a comprehensive set of experiments conducted in order to classify Arabic Wikipedia articles into predefined sets of Named Entity classes. We tackle using four different classifiers, namely: Naïve Bayes, Multinomial Naïve Bayes, Support Vector Machines, and Stochastic Gradient Descent. We report on several aspects related to classification models in the sense of feature representation, feature set and statistical modelling. The results reported show that, we are able to correctly classify the articles with scores of 90% on Precision, Recall and balanced F-measure. KEYWORDS: Arabic Named Entity, Wikipedia, Arabic Document Classification, Supervised Machine Learning. Proceedings of COLING 2012: Posters, pages 43–52, COLING 2012, Mumbai, December 2012. 43 1 Introduction Relying on supervised machine learning technologies to recognize Named Entities (NE) in the text requires the development of a reasonable volume of data for the training phase. Manually developing a training dataset that goes beyond the news-wire domain is a non-trivial task. Examination of online and freely available resources, such as the Arabic Wikipedia (AW) offers promise because the underlying scheme of AW can be exploited in order to automatically identify NEs in context. To utilize this resource and develop a NEs corpus from AW means two different tasks should be addressed: 1) Identifying the NEs in the context regardless of assigning those into NEs semantic classes. 2) Classifying AW articles into predefined NEs taxonomy.
    [Show full text]
  • Culture Contacts and the Making of Cultures
    CULTURE CONTACTS AND THE MAKING OF CULTURES Papers in homage to Itamar Even-Zohar Edited by Rakefet Sela-Sheffy Gideon Toury Tel Aviv Tel Aviv University: Unit of Culture Research 2011 Copyright © 2011 Unit of Culture Research, Tel Aviv University & Authors. All rights reserved. No part of this book may be reproduced in any form, by photostat, microform, retrieval system, or any other means, without prior written permission of the publisher or author. Unit of Culture Research Faculty of Humanities Tel Aviv University Tel Aviv, 69978 Israel www.tau.ac.il/tarbut [email protected] Culture Contacts and the Making of Cultures: Papers in Homage to Itamar Even-Zohar / Rakefet Sela-Sheffy & Gideon Toury, editors. Includes bibliographical references. ISBN 978-965-555-496-0 (electronic) The publication of this book was supported by the Bernstein Chair of Translation Theory, Tel Aviv University (Gideon Toury, Incumbent) Culture Contacts and the Making of Cultures: Papers in Homage to Itamar Even-Zohar Sela-Sheffy, Rakefet 1954- ; Toury, Gideon 1942- © 2011 – Unit of Culture Research & Authors Printed in Israel Table of Contents To The Memory of Robert Paine V Acknowledgements VII Introduction Rakefet Sela-Sheffy 1 Part One Identities in Contacts: Conflicts and Negotiations of Collective Entities Manfred Bietak The Aftermath of the Hyksos in Avaris 19 Robert Paine† Identity Puzzlement: Saami in Norway, Past and Present 67 Rakefet Sela-Sheffy High-Status Immigration Group and Culture Retention: 79 German Jewish Immigrants in British-Ruled Palestine
    [Show full text]
  • Automatically Extending Named Entities Coverage of Arabic Wordnet Using Wikipedia
    International Journal on Information and Communication Technologies, Vol. 1, No. 1, June 2008 1 Automatically Extending Named Entities coverage of Arabic WordNet using Wikipedia Musa Alkhalifa and Horacio Rodríguez performed automatically using English WordNet as source Abstract— This paper focuses on the automatic extraction of ontology and bilingual resources for proposing alignments. Arabic Named Entities (NEs) from the Arabic Wikipedia, their automatic attachment to Arabic WordNet, and their automatic Okumura and Hovy [30] proposed a set of heuristics for link to Princeton's English WordNet. We briefly report on the current status of Arabic WordNet, focusing on its rather limited associating Japanese words to English WordNet by means of NE coverage. Our proposal of automatic extension is then an intermediate bilingual dictionary and taking advantage of presented, applied, and evaluated. the usual genus/differentiae structure of dictionary definitions. Index Terms—Arabic NLP, Arabic WordNet, Named Entities Extraction, Wikipedia. I. INTRODUCTION ntologies have become recently a core resource for many O knowledge-based applications such as knowledge management, natural language processing, e-commerce, intelligent information integration, database design and integration, information retrieval, bio-informatics, etc. The Semantic Web, [7], is a prominent example on the extensive Table 1. Content of different versions of Princeton’s English WordNet use of ontologies. The main goal of the Semantic Web is the semantic annotation of the data on the Web with the use of ontologies in order to have machine-readable and machine- Later, Khan and Hovy [20] proposed a way to reduce the understandable Web that will enable computers, autonomous hyper-production of spurious links by this method by software agents, and humans to work and cooperate better searching common hyperonyms that could collapse several through sharing knowledge and resources.
    [Show full text]
  • Critical Point of View: a Wikipedia Reader
    w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink and Nathaniel Tkacz Editorial Assistance: Ivy Roberts, Morgan Currie Copy-Editing: Cielo Lutino CRITICAL Design: Katja van Stiphout Cover Image: Ayumi Higuchi POINT OF VIEW Printer: Ten Klei Groep, Amsterdam Publisher: Institute of Network Cultures, Amsterdam 2011 A Wikipedia ISBN: 978-90-78146-13-1 Reader EDITED BY Contact GEERT LOVINK AND Institute of Network Cultures NATHANIEL TKACZ phone: +3120 5951866 INC READER #7 fax: +3120 5951840 email: [email protected] web: http://www.networkcultures.org Order a copy of this book by sending an email to: [email protected] A pdf of this publication can be downloaded freely at: http://www.networkcultures.org/publications Join the Critical Point of View mailing list at: http://www.listcultures.org Supported by: The School for Communication and Design at the Amsterdam University of Applied Sciences (Hogeschool van Amsterdam DMCI), the Centre for Internet and Society (CIS) in Bangalore and the Kusuma Trust. Thanks to Johanna Niesyto (University of Siegen), Nishant Shah and Sunil Abraham (CIS Bangalore) Sabine Niederer and Margreet Riphagen (INC Amsterdam) for their valuable input and editorial support. Thanks to Foundation Democracy and Media, Mondriaan Foundation and the Public Library Amsterdam (Openbare Bibliotheek Amsterdam) for supporting the CPOV events in Bangalore, Amsterdam and Leipzig. (http://networkcultures.org/wpmu/cpov/) Special thanks to all the authors for their contributions and to Cielo Lutino, Morgan Currie and Ivy Roberts for their careful copy-editing.
    [Show full text]
  • Automatic Linking of Short Arabic Texts to Wikipedia Articles
    Journal of Software Automatic Linking of Short Arabic Texts to Wikipedia Articles Fatoom Fayad1*, Iyad AlAgha2 1 Computer Center, Palestine Technical College-Deir El-Balah, Gaza Strip, Palestine. 2 Faculty of Information Technology, The Islamic University of Gaza, Gaza Strip, Palestine. * Corresponding author. Tel.: 00972592542727; email: [email protected] Manuscript submitted January 10, 2016; accepted March 8, 2016. doi: 10.17706/jsw.11.12.1207-1223 Abstract: Given the enormous amount of unstructured texts available on the Web, there has been an emerging need to increase discoverability of and accessibility to these texts. One of the proposed solutions is to annotate texts with information extracted from background knowledge. Wikipedia, the free encyclopedia, has been recently exploited as a background knowledge to annotate text with complementary information. Given any piece of text, the main challenge is how to determine the most relevant information from Wikipedia with the least effort and time. While Wikipedia-based annotation has mainly targeted the English and Latin versions of Wikipedia, little effort has been devoted to annotate Arabic text using the Arabic version of Wikipedia. In addition, the annotation of short text presents further challenges due to the inability to apply statistical or machine learning techniques that are commonly used with long text. This work proposes an approach for automatic linking of Arabic short texts to articles drawn from Wikipedia. It reports on the several challenges associated with the design and implementation of the linking approach including the processing of the Wikipedia's enormous content, the mapping of texts to Wikipedia articles, the problem of article disambiguation, and the time efficiency.
    [Show full text]
  • A Wikipedia Reader
    UvA-DARE (Digital Academic Repository) Critical point of view: a Wikipedia reader Lovink, G.; Tkacz, N. Publication date 2011 Document Version Final published version Link to publication Citation for published version (APA): Lovink, G., & Tkacz, N. (2011). Critical point of view: a Wikipedia reader. (INC reader; No. 7). Institute of Network Cultures. http://www.networkcultures.org/_uploads/%237reader_Wikipedia.pdf General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:05 Oct 2021 w ikipedia pedai p edia p Wiki CRITICAL POINT OF VIEW A Wikipedia Reader 2 CRITICAL POINT OF VIEW A Wikipedia Reader CRITICAL POINT OF VIEW 3 Critical Point of View: A Wikipedia Reader Editors: Geert Lovink
    [Show full text]
  • Analyzing Wikidata Transclusion on English Wikipedia
    Analyzing Wikidata Transclusion on English Wikipedia Isaac Johnson Wikimedia Foundation [email protected] Abstract. Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated popula- tion of content within the articles themselves. It is not well understood, however, how widespread this transclusion of Wikidata content is within Wikipedia. This work presents a taxonomy of Wikidata transclusion from the perspective of its potential impact on readers and an associated in- depth analysis of Wikidata transclusion within English Wikipedia. It finds that Wikidata transclusion that impacts the content of Wikipedia articles happens at a much lower rate (5%) than previous statistics had suggested (61%). Recommendations are made for how to adjust current tracking mechanisms of Wikidata transclusion to better support metrics and patrollers in their evaluation of Wikidata transclusion. Keywords: Wikidata · Wikipedia · Patrolling 1 Introduction Wikidata is steadily becoming more central to Wikipedia, not just in maintaining interlanguage links, but in automated population of content within the articles themselves. This transclusion of Wikidata content within Wikipedia can help to reduce maintenance of certain facts and links by shifting the burden to main- tain up-to-date, referenced material from each individual Wikipedia to a single repository, Wikidata. Current best estimates suggest that, as of August 2020, 62% of Wikipedia ar- ticles across all languages transclude Wikidata content. This statistic ranges from Arabic Wikipedia (arwiki) and Basque Wikipedia (euwiki), where nearly 100% of articles transclude Wikidata content in some form, to Japanese Wikipedia (jawiki) at 38% of articles and many small wikis that lack any Wikidata tran- sclusion.
    [Show full text]
  • Aidarabic a Named-Entity Disambiguation Framework for Arabic Text
    AIDArabic A Named-Entity Disambiguation Framework for Arabic Text Mohamed Amir Yosef, Marc Spaniol, Gerhard Weikum Max-Planck-Institut fur¨ Informatik, Saarbrucken,¨ Germany mamir|mspaniol|weikum @mpi-inf.mpg.de { } Abstract tion (NED) is essential for many application in the domain of Information Retrieval (such as informa- There has been recently a great progress in tion extraction). It also enables producing more the field of automatically generated knowl- useful and accurate analytics. The problem has edge bases and corresponding disambigua- been exhaustively studied in the literature. The tion systems that are capable of mapping essence of all NED techniques is using background text mentions onto canonical entities. Ef- information extracted from various sources (e.g. forts like the before mentioned have en- Wikipedia), and use such information to know the abled researchers and analysts from vari- correct/intended meaning of the mention. ous disciplines to semantically “understand” The Arabic content is enormously growing on contents. However, most of the approaches the Internet, nevertheless, background ground in- have been specifically designed for the En- formation is clearly lacking behind other languages glish language and - in particular - sup- such as English. Consider Wikipedia for example, port for Arabic is still in its infancy. Since while the English Wikipedia contains more than 4.5 the amount of Arabic Web contents (e.g. million articles, the Arabic version contains less in social media) has been increasing dra- than 0.3 million ones 1. As a result, and up to our matically over the last years, we see a knowledge, there is no serious work that has been great potential for endeavors that support done in the area of performing NED for Arabic an entity-level analytics of these data.
    [Show full text]