Arxiv:2010.11856V3 [Cs.CL] 13 Apr 2021 Questions from Non-English Native Speakers to Rep- Information-Seeking Questions—Questions from Resent Real-World Applications

Total Page:16

File Type:pdf, Size:1020Kb

Arxiv:2010.11856V3 [Cs.CL] 13 Apr 2021 Questions from Non-English Native Speakers to Rep- Information-Seeking Questions—Questions from Resent Real-World Applications XOR QA: Cross-lingual Open-Retrieval Question Answering Akari Asaiº, Jungo Kasaiº, Jonathan H. Clark¶, Kenton Lee¶, Eunsol Choi¸, Hannaneh Hajishirziº¹ ºUniversity of Washington ¶Google Research ¸The University of Texas at Austin ¹Allen Institute for AI {akari, jkasai, hannaneh}@cs.washington.edu {jhclark, kentonl}@google.com, [email protected] Abstract ロン・ポールの学部時代の専攻は?[Japanese] (What did Ron Paul major in during undergraduate?) Multilingual question answering tasks typi- cally assume that answers exist in the same Multilingual document collections language as the question. Yet in prac- (Wikipedias) tice, many languages face both information ロン・ポール (ja.wikipedia) scarcity—where languages have few reference 高校卒業後はゲティスバーグ大学へ進学。 (After high school, he went to Gettysburg College.) articles—and information asymmetry—where questions reference concepts from other cul- Ron Paul (en.wikipedia) tures. This work extends open-retrieval ques- Paul went to Gettysburg College, where he was a member of the Lambda Chi Alpha fraternity. He tion answering to a cross-lingual setting en- graduated with a B.S. degree in Biology in 1957. abling questions from one language to be an- swered via answer content from another lan- 生物学 (Biology) guage. We construct a large-scale dataset built on 40K information-seeking questions Figure 1: Overview of XOR QA. Given a question in across 7 diverse non-English languages that Li, the model finds an answer in either English or Li TYDI QA could not find same-language an- Wikipedia and returns an answer in English or L . L swers for. Based on this dataset, we introduce i i is one of the 7 typologically diverse languages. a task framework, called Cross-lingual Open- Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross- lingual document retrieval from multilingual the bulk of this work has been exclusively on En- and English resources. We establish baselines glish. In this paper, we bring together for the first with state-of-the-art machine translation sys- time information-seeking questions, open-retrieval tems and cross-lingual pretrained models. Ex- QA, and multilingual QA to create a multilin- perimental results suggest that XOR QA is a gual open-retrieval QA dataset that enables cross- challenging task that will facilitate the devel- lingual answer retrieval. opment of novel techniques for multilingual question answering. Our data and code are While multilingual open QA systems would ben- available at https://nlp.cs.washington. efit the many speakers of non-English languages, edu/xorqa/. there are several pitfalls in designing such a dataset. First, a multilingual QA dataset should include 1 Introduction arXiv:2010.11856v3 [cs.CL] 13 Apr 2021 questions from non-English native speakers to rep- Information-seeking questions—questions from resent real-world applications. Questions in most people who are actually looking for an answer— recent multilingual QA datasets (Lewis et al., 2020; have been increasingly studied in question answer- Artetxe et al., 2020; Longpre et al., 2020) are trans- ing (QA) research. Fulfilling these information lated from English, which leads to English-centric needs has led the research community to look fur- questions such as questions about American sports, ther for answers: beyond paragraphs and articles cultures and politics. Second, it is important to toward performing open retrieval1 on large-scale support retrieving answers in languages other than document collections (Chen and Yih, 2020). Yet the original language due to information scarcity of low-resource languages (Miniwatts Marketing 1 We use open retrieval—instead of open domain—to Group, 2011). Moreover, questions strongly re- refer to models that can access answer context from large document collections. We avoid using open domain due to its lated to entities from other cultures are less likely double meaning as “covering topics from many domains.” to have answer content in the questioner’s language due to cultural bias (information asymmetry, Calla- 18.7 F1 points on XOR-FULL. This result indicates han and Herring, 2011). For example, Fig.1 shows that XOR-TYDI QA poses unique challenges to that the Japanese Wikipedia article of an Ameri- tackle toward building a real-world open-retrieval can politician, Ron Paul, does not have information QA system for diverse languages. We expect about his college degree perhaps because Japanese that our dataset opens up new challenges to make Wikipedia editors are less interested in specific ed- progress in multilingual representation learning. ucational backgrounds of American politicians. In this paper, we introduce the task of cross- 2 The XOR-TYDI QA Dataset lingual open-retrieval question answering (XOR Our XOR-TYDI QA dataset comprises questions QA) which aims at answering multilingual ques- inherited from TYDI QA (Clark et al., 2020) and tions from non-English native speakers given mul- answers augmented with our annotation process tilingual resources. To support research in this area, across 7 typologically diverse languages. We focus we construct a dataset (called XOR-TYDI QA) of on cross-lingual retrieval from English Wikipedia 40k annotated questions and answers across 7 ty- because in our preliminary investigation we were pologically diverse languages. Questions in our able to find answers to a majority of the questions dataset are inherited from TYDI QA (Clark et al., from resource-rich English Wikipedia, and native 2020), which are written by native speakers and speakers with much annotation experience were are originally unanswerable due to the informa- readily available via crowdsourcing in English. tion scarcity or asymmetry issues. XOR-TYDI QA is the first large-scale cross-lingual open-retrieval 2.1 XOR-TYDI QA Collection QA dataset that consists of information-seeking questions from native speakers and multilingual Our annotation pipeline proceeds with four steps: reference documents. 1) collection of questions from TYDI QA without a same-language answer which require cross-lingual XOR-TYDI QA is constructed with an annota- reference to answer (§2.1.1); 2) question translation tion pipeline that allows for cross-lingual retrieval from a target language to the pivot language of from large-scale Wikipedia corpora (§2). Unan- English where the missing information may exist swerable questions in TYDI QA are first translated (§2.1.2); 3) answer retrieval in the pivot language into English by professional translators. Then, an- given a set of candidate documents (§2.1.3); 4) notators find answers to translated queries given answer verification and translation from the pivot English Wikipedia using our new model-in-the- language back to the original language (§2.1.4). loop annotation framework that reduces annotation Fig.2 shows an overview of the pipeline. errors. Finally, answers are verified and translated back to the target languages. 2.1.1 Question Selection Building on the dataset, we introduce three new Our questions are collected from unanswerable tasks in the order of increasing complexity (§3). questions in TYDI QA. A question is unanswer- In XOR-RETRIEVE, a system retrieves English able in TYDI QA if an annotator cannot select Wikipedia paragraphs with sufficient information a passage answer (a paragraph in the article that to answer the question posed in the target language. contains an answer). We randomly sample 5,000 XOR-ENGLISHSPAN takes one step further and questions without any passage answer annotations finds a minimal answer span from the retrieved (unanswerable questions) from the TYDI QA train- English paragraphs. Finally, XOR-FULL expects ing data, and split them into training (4,500) and a system to generate an answer end to end in the development (500) sets. We use the develop- target language by consulting both English and ment data from TYDI QA as our test data, since the target language’s Wikipedia. XOR-FULL is the TYDI QA’s original test data is not publicly our ultimate goal, and the first two tasks enable available.2 We choose 7 languages with vary- researchers to diagnose where their models fail and ing amounts of Wikipedia data out of the 10 non- develop under less coding efforts and resources. English languages based on the cost and availability We provide baselines that extend state-of-the- 2 art open-retrieval QA systems (Asai et al., 2020; Furthermore, despite the benefits of hidden test sets, the resource-intensive nature of open-retrieval QA is not suitable Karpukhin et al., 2020) to our multilingual retrieval to code-submission leaderboards. This further precluded the setting. Our best baseline achieves an average of use of the original TYDI QA test sets. 1. Question 2. Question Translation 3. Answer Retrieval in English 4. Answer Translation Selection QL → Qen (Qen, Pen) (Qen, Pen, Aen→ AL ) TyDiQA XOR- Article retrieval Answer What did Ron Paul major in during TyDiQA ロンポールの学 Cross-lingual Annotation undergraduate? 部時代の専攻は (Q , No Paul went to Gettysburg L 何ですか? College … He graduated answer) Search Top English Engine Wikipedia articles with a B.S. degree in Human translation Biology in 1957. (QL , AL) In-language Answer verification What did Paragraph retriever Human Ron Paul major Paragraph Human translation (Q , A ) Annotation L L in during ranking Ron Paul is an American @Mechanical turk undergraduate? politician ... 生物学 Figure 2: Overview of the annotation process for XOR-TYDI QA. of translators:3 Arabic, Bengali, Finnish, Japanese, annotation errors because annotators have to find Korean, Russian and Telugu. answer context among many candidate articles. 2.1.2 Question Translation We use a professional translation service, Gengo,4 Collaborative model-in-the-loop. To find a mid- to translate all collected questions into English. dle ground in the tradeoff, we introduce a collabora- Since named entities are crucial for QA, we instruct tive model-in-the-loop framework that uses Google translators to carefully translate them by search- Search and a state-of-the-art paragraph ranker. We ing for common English translations from English first run Google Search to retrieve as many as top Wikipedia or other external sources.
Recommended publications
  • Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment
    Cultural Anthropology through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-based Sentiment Peter A. Gloor, Joao Marcos, Patrick M. de Boer, Hauke Fuehres, Wei Lo, Keiichi Nemoto [email protected] MIT Center for Collective Intelligence Abstract In this paper we study the differences in historical World View between Western and Eastern cultures, represented through the English, the Chinese, Japanese, and German Wikipedia. In particular, we analyze the historical networks of the World’s leaders since the beginning of written history, comparing them in the different Wikipedias and assessing cultural chauvinism. We also identify the most influential female leaders of all times in the English, German, Spanish, and Portuguese Wikipedia. As an additional lens into the soul of a culture we compare top terms, sentiment, emotionality, and complexity of the English, Portuguese, Spanish, and German Wikinews. 1 Introduction Over the last ten years the Web has become a mirror of the real world (Gloor et al. 2009). More recently, the Web has also begun to influence the real world: Societal events such as the Arab spring and the Chilean student unrest have drawn a large part of their impetus from the Internet and online social networks. In the meantime, Wikipedia has become one of the top ten Web sites1, occasionally beating daily newspapers in the actuality of most recent news. Be it the resignation of German national soccer team captain Philipp Lahm, or the downing of Malaysian Airlines flight 17 in the Ukraine by a guided missile, the corresponding Wikipedia page is updated as soon as the actual event happened (Becker 2012.
    [Show full text]
  • Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network
    In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network Morten Warncke-Wang1, Anuradha Uduwage1, Zhenhua Dong2, John Riedl1 1GroupLens Research Dept. of Computer Science and Engineering 2Dept. of Information Technical Science University of Minnesota Nankai University Minneapolis, Minnesota Tianjin, China {morten,uduwage,riedl}@cs.umn.edu [email protected] ABSTRACT 1. INTRODUCTION Wikipedia has become one of the primary encyclopaedic in- The world: seven seas separating seven continents, seven formation repositories on the World Wide Web. It started billion people in 193 nations. The world's knowledge: 283 in 2001 with a single edition in the English language and has Wikipedias totalling more than 20 million articles. Some since expanded to more than 20 million articles in 283 lan- of the content that is contained within these Wikipedias is guages. Criss-crossing between the Wikipedias is an inter- probably shared between them; for instance it is likely that language link network, connecting the articles of one edition they will all have an article about Wikipedia itself. This of Wikipedia to another. We describe characteristics of ar- leads us to ask whether there exists some ur-Wikipedia, a ticles covered by nearly all Wikipedias and those covered by set of universal knowledge that any human encyclopaedia only a single language edition, we use the network to under- will contain, regardless of language, culture, etc? With such stand how we can judge the similarity between Wikipedias a large number of Wikipedia editions, what can we learn based on concept coverage, and we investigate the flow of about the knowledge in the ur-Wikipedia? translation between a selection of the larger Wikipedias.
    [Show full text]
  • Cross-Cultural Research
    Cross-Cultural Research http://ccr.sagepub.com Cultural Adaptations After Progressionism Lauren W. McCall Cross-Cultural Research 2009; 43; 62 DOI: 10.1177/1069397108328613 The online version of this article can be found at: http://ccr.sagepub.com/cgi/content/abstract/43/1/62 Published by: http://www.sagepublications.com On behalf of: Society for Cross-Cultural Research Additional services and information for Cross-Cultural Research can be found at: Email Alerts: http://ccr.sagepub.com/cgi/alerts Subscriptions: http://ccr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://ccr.sagepub.com/cgi/content/refs/43/1/62 Downloaded from http://ccr.sagepub.com at DUKE UNIV on January 9, 2009 Cross-Cultural Research Volume 43 Number 1 February 2009 62-85 © 2009 Sage Publications Cultural Adaptations After 10.1177/1069397108328613 http://ccr.sagepub.com hosted at Progressionism http://online.sagepub.com Lauren W. McCall National Evolutionary Synthesis Center How should behavioral scientists interpret apparently progressive stages of cultural history? Adaptive progress in biology is thought to only occur locally, relative to local conditions. Just as evolutionary theory offers physi- cal anthropologists an appreciation of global human diversity through local adaptation, so the metaphor of adaptation offers behavioral scientists an appreciation of cultural diversity through analogous mechanisms. Analyses reported here test for cultural adaptation in both biotic and abiotic environ- ments. Testing cultural adaptation to the human-made environment, the culture’s pre-existing technical complexity is shown to be a predictive fac- tor. Then testing cultural adaptation to the physical environment, this article corroborates Divale’s (1999) finding that counting systems are adaptations to unstable environments, and expands the model to include other environ- mental indices and cultural traits.
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Handwriting Recognition in Indian Regional Scripts: a Survey of Offline Techniques
    1 Handwriting Recognition in Indian Regional Scripts: A Survey of Offline Techniques UMAPADA PAL, Indian Statistical Institute RAMACHANDRAN JAYADEVAN, Pune Institute of Computer Technology NABIN SHARMA, Indian Statistical Institute Offline handwriting recognition in Indian regional scripts is an interesting area of research as almost 460 million people in India use regional scripts. The nine major Indian regional scripts are Bangla (for Bengali and Assamese languages), Gujarati, Kannada, Malayalam, Oriya, Gurumukhi (for Punjabi lan- guage), Tamil, Telugu, and Nastaliq (for Urdu language). A state-of-the-art survey about the techniques available in the area of offline handwriting recognition (OHR) in Indian regional scripts will be of a great aid to the researchers in the subcontinent and hence a sincere attempt is made in this article to discuss the advancements reported in this regard during the last few decades. The survey is organized into different sections. A brief introduction is given initially about automatic recognition of handwriting and official re- gional scripts in India. The nine regional scripts are then categorized into four subgroups based on their similarity and evolution information. The first group contains Bangla, Oriya, Gujarati and Gurumukhi scripts. The second group contains Kannada and Telugu scripts and the third group contains Tamil and Malayalam scripts. The fourth group contains only Nastaliq script (Perso-Arabic script for Urdu), which is not an Indo-Aryan script. Various feature extraction and classification techniques associated with the offline handwriting recognition of the regional scripts are discussed in this survey. As it is important to identify the script before the recognition step, a section is dedicated to handwritten script identification techniques.
    [Show full text]
  • Towards a Korean Dbpedia and an Approach for Complementing the Korean Wikipedia Based on Dbpedia
    Towards a Korean DBpedia and an Approach for Complementing the Korean Wikipedia based on DBpedia Eun-kyung Kim1, Matthias Weidl2, Key-Sun Choi1, S¨orenAuer2 1 Semantic Web Research Center, CS Department, KAIST, Korea, 305-701 2 Universit¨at Leipzig, Department of Computer Science, Johannisgasse 26, D-04103 Leipzig, Germany [email protected], [email protected] [email protected], [email protected] Abstract. In the first part of this paper we report about experiences when applying the DBpedia extraction framework to the Korean Wikipedia. We improved the extraction of non-Latin characters and extended the framework with pluggable internationalization components in order to fa- cilitate the extraction of localized information. With these improvements we almost doubled the amount of extracted triples. We also will present the results of the extraction for Korean. In the second part, we present a conceptual study aimed at understanding the impact of international resource synchronization in DBpedia. In the absence of any informa- tion synchronization, each country would construct its own datasets and manage it from its users. Moreover the cooperation across the various countries is adversely affected. Keywords: Synchronization, Wikipedia, DBpedia, Multi-lingual 1 Introduction Wikipedia is the largest encyclopedia of mankind and is written collaboratively by people all around the world. Everybody can access this knowledge as well as add and edit articles. Right now Wikipedia is available in 260 languages and the quality of the articles reached a high level [1]. However, Wikipedia only offers full-text search for this textual information. For that reason, different projects have been started to convert this information into structured knowledge, which can be used by Semantic Web technologies to ask sophisticated queries against Wikipedia.
    [Show full text]
  • Mathematics in African History and Cultures
    Paulus Gerdes & Ahmed Djebbar MATHEMATICS IN AFRICAN HISTORY AND CULTURES: AN ANNOTATED BIBLIOGRAPHY African Mathematical Union Commission on the History of Mathematics in Africa (AMUCHMA) Mathematics in African History and Cultures Second edition, 2007 First edition: African Mathematical Union, Cape Town, South Africa, 2004 ISBN: 978-1-4303-1537-7 Published by Lulu. Copyright © 2007 by Paulus Gerdes & Ahmed Djebbar Authors Paulus Gerdes Research Centre for Mathematics, Culture and Education, C.P. 915, Maputo, Mozambique E-mail: [email protected] Ahmed Djebbar Département de mathématiques, Bt. M 2, Université de Lille 1, 59655 Villeneuve D’Asq Cedex, France E-mail: [email protected], [email protected] Cover design inspired by a pattern on a mat woven in the 19th century by a Yombe woman from the Lower Congo area (Cf. GER-04b, p. 96). 2 Table of contents page Preface by the President of the African 7 Mathematical Union (Prof. Jan Persens) Introduction 9 Introduction to the new edition 14 Bibliography A 15 B 43 C 65 D 77 E 105 F 115 G 121 H 162 I 173 J 179 K 182 L 194 M 207 N 223 O 228 P 234 R 241 S 252 T 274 U 281 V 283 3 Mathematics in African History and Cultures page W 290 Y 296 Z 298 Appendices 1 On mathematicians of African descent / 307 Diaspora 2 Publications by Africans on the History of 313 Mathematics outside Africa (including reviews of these publications) 3 On Time-reckoning and Astronomy in 317 African History and Cultures 4 String figures in Africa 338 5 Examples of other Mathematical Books and 343
    [Show full text]
  • Bengali Handwritten Numeral Recognition Using Artificial Neural Network and Transition Elements
    BENGALI HANDWRITTEN NUMERAL RECOGNITION USING ARTIFICIAL NEURAL NETWORK AND TRANSITION ELEMENTS a,* b c Zahidur Rahim Chowdhury Mohammad Abu Naser Ashraf Bin Islam a United International University, Dhaka. b Islamic University of Technology, Gajipur. c Bangladesh University of Engineering and Technology, Dhaka. * Corresponding email address: [email protected] Abstract: Bengali hand-writing recognition has potential application in document processing for one the widely used for language in the world. A method using Artificial Neural Network (ANN) is utilized primarily to identify numerals of the language using transition features. Maximum accuracy of 82% is reported in this article for an optimized network. The typical performance of the handwriting recognition system that uses a single recognition scheme is around 85%. The significance of local features in a character should be incorporated to enhance the overall performance of the network. Key words: Bengali Hand-writing, Numeral, Pattern Recognition, Neural Network, Transition. article. Recognition results from different systems were INTRODUCTION compared to make the final decision. The average recognition rate, error rate and reliability achieved by the Hand-written Bengali character recognition is a integrated system were 95.05%, 0.3% and 99.03%, process where techniques of pattern recognition are applied respectively. to analyze handwritings of Bengali language, one of the In this article, a recognition process for Bengali hand most popular languages in the world. Beside Bengali, written numerals is presented using ‘holistic’ approaches researchers have studied the recognition process using due to limited number of possible outputs. Transition different techniques for other popular languages like features of an image, instead of the complete image, were English [1–4], Chinese [5], Arabic [6], Japanese [7, 8], and used as inputs of the ANN for an efficient and a compact Indic [9].
    [Show full text]
  • QUARTERLY CHECK-IN Technology (Services) TECH GOAL QUADRANT
    QUARTERLY CHECK-IN Technology (Services) TECH GOAL QUADRANT C Features that we build to improve our technology A Foundation level goals offering B Features we build for others D Modernization, renewal and tech debt goals The goals in each team pack are annotated using this scheme illustrate the broad trends in our priorities Agenda ● CTO Team ● Research and Data ● Design Research ● Performance ● Release Engineering ● Security ● Technical Operations Photos (left to right) Technology (Services) CTO July 2017 quarterly check-in All content is © Wikimedia Foundation & available under CC BY-SA 4.0, unless noted otherwise. CTO Team ● Victoria Coleman - Chief Technology Officer ● Joel Aufrecht - Program Manager (Technology) ● Lani Goto - Project Assistant ● Megan Neisler - Senior Project Coordinator ● Sarah Rodlund - Senior Project Coordinator ● Kevin Smith - Program Manager (Engineering) Photos (left to right) CHECK IN TEAM/DEPT PROGRAM WIKIMEDIA FOUNDATION July 2017 CTO 4.5 [LINK] ANNUAL PLAN GOAL: expand and strengthen our technical communities What is your objective / Who are you working with? What impact / deliverables are you expecting? workflow? Program 4: Technical LAST QUARTER community building (none) Outcome 5: Organize Wikimedia Developer Summit NEXT QUARTER Objective 1: Developer Technical Collaboration Decide on event location, dates, theme, deadlines, etc. Summit web page and publicize the information published four months before the event (B) STATUS: OBJECTIVE IN PROGRESS Technology (Services) Research and Data July, 2017 quarterly
    [Show full text]
  • LNCS 8104, Pp
    Bengali Printed Character Recognition – A New Approach Soharab Hossain Shaikh1, Marek Tabedzki2, Nabendu Chaki3, and Khalid Saeed4 1 A.K.Choudhury School of Information Technology, University of Calcutta, India [email protected] 2 Faculty of Computer Science, Bialystok University of Technology, Poland [email protected] 3 Department of Computer Science & Engineering, University of Calcutta, India [email protected] 4 Faculty of Physics and Applied Computer Science, AGH University of Science and Technology, Cracow, Poland [email protected] Abstract. This paper presents a new method for Bengali character recognition based on view-based approach. Both the top-bottom and the lateral view-based approaches have been considered. A layer-based methodology in modification of the basic view-based processing has been proposed. This facilitates handling of unequal logical partitions. The document image is acquired and segmented to extract out the text lines, words, and letters. The whole image of the individual characters is taken as the input to the system. The character image is put into a bounding box and resized whenever necessary. The view-based approach is applied on the resultant image and the characteristic points are extracted from the views after some preprocessing. These points are then used to form a feature vector that represents the given character as a descriptor. The feature vectors have been classified with the aid of k-NN classifier using Dynamic Time Warping (DTW) as a distance measure. A small dataset of some of the compound characters has also been considered for recognition. The promising results obtained so far encourage the authors for further work on handwritten Bengali scripts.
    [Show full text]
  • Explaining Cultural Borders on Wikipedia Through Multilingual Co-Editing Activity
    Samoilenko et al. EPJ Data Science (2016)5:9 DOI 10.1140/epjds/s13688-016-0070-8 REGULAR ARTICLE OpenAccess Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity Anna Samoilenko1,2*, Fariba Karimi1,DanielEdler3, Jérôme Kunegis2 and Markus Strohmaier1,2 *Correspondence: [email protected] Abstract 1GESIS - Leibniz-Institute for the Social Sciences, 6-8 Unter In this paper, we study the network of global interconnections between language Sachsenhausen, Cologne, 50667, communities, based on shared co-editing interests of Wikipedia editors, and show Germany that although English is discussed as a potential lingua franca of the digital space, its 2University of Koblenz-Landau, Koblenz, Germany domination disappears in the network of co-editing similarities, and instead local Full list of author information is connections come to the forefront. Out of the hypotheses we explored, bilingualism, available at the end of the article linguistic similarity of languages, and shared religion provide the best explanations for the similarity of interests between cultural communities. Population attraction and geographical proximity are also significant, but much weaker factors bringing communities together. In addition, we present an approach that allows for extracting significant cultural borders from editing activity of Wikipedia users, and comparing a set of hypotheses about the social mechanisms generating these borders. Our study sheds light on how culture is reflected in the collective process
    [Show full text]