Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network

Total Page:16

File Type:pdf, Size:1020Kb

Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network Morten Warncke-Wang1, Anuradha Uduwage1, Zhenhua Dong2, John Riedl1 1GroupLens Research Dept. of Computer Science and Engineering 2Dept. of Information Technical Science University of Minnesota Nankai University Minneapolis, Minnesota Tianjin, China {morten,uduwage,riedl}@cs.umn.edu [email protected] ABSTRACT 1. INTRODUCTION Wikipedia has become one of the primary encyclopaedic in- The world: seven seas separating seven continents, seven formation repositories on the World Wide Web. It started billion people in 193 nations. The world's knowledge: 283 in 2001 with a single edition in the English language and has Wikipedias totalling more than 20 million articles. Some since expanded to more than 20 million articles in 283 lan- of the content that is contained within these Wikipedias is guages. Criss-crossing between the Wikipedias is an inter- probably shared between them; for instance it is likely that language link network, connecting the articles of one edition they will all have an article about Wikipedia itself. This of Wikipedia to another. We describe characteristics of ar- leads us to ask whether there exists some ur-Wikipedia, a ticles covered by nearly all Wikipedias and those covered by set of universal knowledge that any human encyclopaedia only a single language edition, we use the network to under- will contain, regardless of language, culture, etc? With such stand how we can judge the similarity between Wikipedias a large number of Wikipedia editions, what can we learn based on concept coverage, and we investigate the flow of about the knowledge in the ur-Wikipedia? translation between a selection of the larger Wikipedias. Our findings indicate that the relationships between Wiki- pedia editions follow Tobler's first law of geography: sim- ilarity decreases with increasing distance. The number of Dutch articles in a Wikipedia edition is found to be the strongest Wikipedia predictor of similarity, while language similarity also appears to have an influence. The English Wikipedia edition is by 6 far the primary source of translations. We discuss the im- pact of these results for Wikipedia as well as user-generated 2 3 content communities in general. 1 5 7 Categories and Subject Descriptors 4 German French H.3.4 [Information Systems]: Systems and software|In- Wikipedia Wikipedia formation networks; H.5.3 [Information Systems]: Group and Organization Interfaces|computer-supported collabora- World knowledge tive work General Terms Figure 1: Illustration of shared and non-shared con- Human Factors, Measurement cepts in the world of knowledge between three of the largest Wikipedia editions. Keywords Shared concepts can be illustrated as shown in Figure 1, Wikipedia, Tobler's Law, First Law of Geography, Multilin- where we have simplified the case by using only three Wiki- gual pedia editions of approximately the same size. In this case there are seven numbered classes of concepts: the ones that are shared by all (1), those shared by two (2, 3, 4), and those Permission to make digital or hard copies of all or part of this work for unique to a specific edition (5, 6, 7). There is also a part of personal or classroom use is granted without fee provided that copies are world knowledge that is not covered by any Wikipedia, as not made or distributed for profit or commercial advantage and that copies Wikipedia content usually needs to meet a notability thresh- bear this notice and the full citation on the first page. To copy otherwise, to old and have verifiable sources. We are interested in un- republish, to post on servers or to redistribute to lists, requires prior specific derstanding what forces influence the number of concepts permission and/or a fee. WikiSym ’12, Aug 27-29, Linz, Austria in each of the numbered classes, expanded to encompass all Copyright 2012 ACM 978-1-4503-1605-7/12/08 ...$15.00. 283 Wikipedia editions. For instance it has been shown that location bias exists in some Wikipedias [6], and Dahinden [4] coverage of the article \Game" in four Wikipedias and found found a connection between where a language is used and correlation with some of Hofstede's dimensions of cultural the location of articles in the same Wikipedia language edi- influence. Callahan and Herring [3] analysed coverage of tion. If it is possible to locate certain Wikipedia editions to famous persons in the English and Polish Wikipedias and specific geographic regions we might therefore be able to de- discovered systematic differences but no indication of inten- cide what influence geographic distance between Wikipedias tional bias. Yasseri et al. [32] used regularity in the edit has on the similarity between them. patterns of the larger Wikipedia editions to estimate the geo- Articles about the same concept have been linked across graphical distribution of their editors. Adar et al. [1] studied the different language Wikipedias from the early days. The tables with summary information about a topic, called in- inter-language link (ILL) network contains as of early Jan- foboxes, mining data from the English, German, French and uary 2012 more than 224 million links between the more Spanish Wikipedias to show that the information could to than 20 million articles. It is a decentralised system where a large extent be discovered and consolidated automatically each article is responsible for linking to all other languages through machine learning. The featured article approval where the same article exists1. Originally this network was process of Arabic, English, and Korean Wikipedias was ex- maintained manually, now Wikipedia users have a collec- plored by Stvilia et al. [23], finding that both their quality tion of software robots called interwiki bots to help them. models and understanding of quality differed. Hecht and The bots follow the ILLs to decide which languages have Gergle [7] studied 25 of the largest Wikipedia editions and an article in order to add, update, or remove links as nec- found strong evidence against there being a global consen- essary. Because they use only the ILLs they are unable to sus of world knowledge. In this paper we extend the exist- discover isolated articles that should be linked, and the num- ing literature by exploring metrics to quantify the similarity ber of missing links varies between editions [7]. Research between Wikipedia editions, and by using these metrics to on the existing network has identified errors from incorrect identify what factors affect similarity. links [5], from differences when deciding how to cover certain Much of the existing research has used a single language topics [7], and because the boundaries of concepts vary [2]. edition as a data source, English Wikipedia in particular. It is not difficult to find anecdotal evidence to show that Some of the issues that have been explored are: how users there are differences in content coverage of the same con- work together [29, 30, 13, 25], how article quality devel- cept. For example the English edition's article about the ops [31, 24, 15], and how different users contribute vastly Norwegian politician Erik Solheim2 contains several para- different amounts of content to the encyclopaedia [18, 11]. graphs about his work in the Sri Lankan peace negotiation There are also examples of research using other Wikipedias process, including a section on controversies about alleged as a source, such as Stein and Hess [22] who looked into bias towards one side of the conflict, while the Norwegian how certain contributors greatly affected article quality in edition's article3 contains only a single three-sentence para- the German edition. graph and completely omits the Sri Lankan point of view. Researchers have also looked at other networks. Roth et There are two official languages in Sri Lanka: Sinhala and al. [20] mined data from about 360 different wikis to un- Tamil. The Sinhala Wikipedia edition contains as of March derstand how macroscopic indicators, structural features, 2012 about 6,000 articles, but does not have an article about and governance policies affect growth patterns, while Kit- Erik Solheim. Tamil Wikipedia has more than 43,000 arti- tur and Kraut [12] sampled 6,118 wiki production groups cles, and the one about Erik Solheim4 lists only the basic to understand how a selection of coordination mechanisms facts about him and that he was involved in the peace talks affect task quality and conflict in systems that are similar to from 2000 to 2006. Wikipedia. On the popular social networking site Twitter, There are also similarities and differences among Wiki- Hong et al. [10] discovered that across languages users differ pedia editions as a whole. In the book \The Wikipedia Rev- in their inclusion of URLs in tweets, as well as Twitter- olution" [16], Lih mentions the story of how contributors to specific features like hashtags, mentions, and retweets. A the Spanish edition branched out and created Enciclopedia paper on the blogging site LiveJournal has also shown that Libre in 2002 after disputes about financing Wikipedia with there existed at least four distinct networks of users writ- advertisements, leaving Spanish Wikipedia nearly inactive ing blog entries in languages other than English: Russian, for a year and a half. The book also mentions that Japanese Portuguese, Finnish, and Japanese [9]. Wikipedia has a much larger proportion of users who do not create an account (so called \anonymous users") compared 2. RESEARCH QUESTIONS to other Wikipedias of similar size. The existing literature raises a number of fascinating ques- Research has also emerged describing similarities and dif- tions about information flow across languages and cultures, ferences between Wikipedias. Pfeil et al. [17] looked at the with the ILLs in Wikipedia providing a rich source of data about those questions.
Recommended publications
  • Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment
    Cultural Anthropology through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-based Sentiment Peter A. Gloor, Joao Marcos, Patrick M. de Boer, Hauke Fuehres, Wei Lo, Keiichi Nemoto [email protected] MIT Center for Collective Intelligence Abstract In this paper we study the differences in historical World View between Western and Eastern cultures, represented through the English, the Chinese, Japanese, and German Wikipedia. In particular, we analyze the historical networks of the World’s leaders since the beginning of written history, comparing them in the different Wikipedias and assessing cultural chauvinism. We also identify the most influential female leaders of all times in the English, German, Spanish, and Portuguese Wikipedia. As an additional lens into the soul of a culture we compare top terms, sentiment, emotionality, and complexity of the English, Portuguese, Spanish, and German Wikinews. 1 Introduction Over the last ten years the Web has become a mirror of the real world (Gloor et al. 2009). More recently, the Web has also begun to influence the real world: Societal events such as the Arab spring and the Chilean student unrest have drawn a large part of their impetus from the Internet and online social networks. In the meantime, Wikipedia has become one of the top ten Web sites1, occasionally beating daily newspapers in the actuality of most recent news. Be it the resignation of German national soccer team captain Philipp Lahm, or the downing of Malaysian Airlines flight 17 in the Ukraine by a guided missile, the corresponding Wikipedia page is updated as soon as the actual event happened (Becker 2012.
    [Show full text]
  • Volunteer Contributions to Wikipedia Increased During COVID-19 Mobility Restrictions
    Volunteer contributions to Wikipedia increased during COVID-19 mobility restrictions Thorsten Ruprechter1,*, Manoel Horta Ribeiro2, Tiago Santos1, Florian Lemmerich3, Markus Strohmaier3,4, Robert West2, and Denis Helic1 1Graz University of Technology, 8010 Graz, Austria 2EPFL, 1015 Lausanne, Switzerland 3RWTH Aachen University, 52062 Aachen, Germany 4GESIS – Leibniz Institute for the Social Sciences, 50667 Cologne, Germany *Corresponding author ([email protected]) Wikipedia, the largest encyclopedia ever created, is a global initiative driven by volunteer contribu- tions. When the COVID-19 pandemic broke out and mobility restrictions ensued across the globe, it was unclear whether Wikipedia volunteers would become less active in the face of the pandemic, or whether they would rise to meet the increased demand for high-quality information despite the added stress inflicted by this crisis. Analyzing 223 million edits contributed from 2018 to 2020 across twelve Wikipedia language editions, we find that Wikipedia’s global volunteer community responded remarkably to the pandemic, substantially increasing both productivity and the number of newcom- ers who joined the community. For example, contributions to the English Wikipedia increased by over 20% compared to the expectation derived from pre-pandemic data. Our work sheds light on the response of a global volunteer population to the COVID-19 crisis, providing valuable insights into the behavior of critical online communities under stress. Wikipedia is the world’s largest encyclopedia, one of the most prominent volunteer-based information systems in existence [18, 29], and one of the most popular destinations on the Web [2]. On an average day in 2019, users from around the world visited Wikipedia about 530 million times and editors voluntarily contributed over 870 thousand edits to one of Wikipedia’s language editions (Supplementary Table 1).
    [Show full text]
  • A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2373–2380 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages Dwaipayan Roy, Sumit Bhatia, Prateek Jain GESIS - Cologne, IBM Research - Delhi, IIIT - Delhi [email protected], [email protected], [email protected] Abstract Wikipedia is the largest web-based open encyclopedia covering more than three hundred languages. However, different language editions of Wikipedia differ significantly in terms of their information coverage. We present a systematic comparison of information coverage in English Wikipedia (most exhaustive) and Wikipedias in eight other widely spoken languages (Arabic, German, Hindi, Korean, Portuguese, Russian, Spanish and Turkish). We analyze the content present in the respective Wikipedias in terms of the coverage of topics as well as the depth of coverage of topics included in these Wikipedias. Our analysis quantifies and provides useful insights about the information gap that exists between different language editions of Wikipedia and offers a roadmap for the Information Retrieval (IR) community to bridge this gap. Keywords: Wikipedia, Knowledge base, Information gap 1. Introduction other with respect to the coverage of topics as well as Wikipedia is the largest web-based encyclopedia covering the amount of information about overlapping topics.
    [Show full text]
  • State of Wikimedia Communities of India
    State of Wikimedia Communities of India Assamese http://as.wikipedia.org State of Assamese Wikipedia RISE OF ASSAMESE WIKIPEDIA Number of edits and internal links EDITS PER MONTH INTERNAL LINKS GROWTH OF ASSAMESE WIKIPEDIA Number of good Date Articles January 2010 263 December 2012 301 (around 3 articles per month) November 2011 742 (around 40 articles per month) Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. Future Plans Awareness Sessions and Wiki Academy Workshops in Universities of Assam. Conduct Assamese Editing Workshops to groom writers to write in Assamese. THANK YOU Bengali বাংলা উইকিপিডিয়া Bengali Wikipedia http://bn.wikipedia.org/ By Bengali Wikipedia community Bengali Language • 6th most spoken language • 230 million speakers Bengali Language • National language of Bangladesh • Official language of India • Official language in Sierra Leone Bengali Wikipedia • Started in 2004 • 22,000 articles • 2,500 page views per month • 150 active editors Bengali Wikipedia • Monthly meet ups • W10 anniversary • Women’s Wikipedia workshop Wikimedia Bangladesh local chapter approved in 2011 by Wikimedia Foundation English State of WikiProject India on ENGLISH WIKIPEDIA ● One of the largest Indian Wikipedias. ● WikiProject started on 11 July 2006 by GaneshK, an NRI. ● Number of article:89,874 articles. (Excludes those that are not tagged with the WikiProject banner) ● Editors – 465 (active) ● Featured content : FAs - 55, FLs - 20, A class – 2, GAs – 163. BASIC STATISTICS ● B class – 1188 ● C class – 801 ● Start – 10,931 ● Stub – 43,666 ● Unassessed for quality – 20,875 ● Unknown importance – 61,061 ● Cleanup tags – 43,080 articles & 71,415 tags BASIC STATISTICS ● Diversity of opinion ● Lack of reliable sources ● Indic sources „lost in translation“ ● Editor skills need to be upgraded ● Lack of leadership ● Lack of coordinated activities ● ….
    [Show full text]
  • Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
    information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.
    [Show full text]
  • Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics
    computers Article Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland * Correspondence: [email protected]; Tel.: +48-(61)-639-27-93 Received: 10 May 2019; Accepted: 13 August 2019; Published: 14 August 2019 Abstract: On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
    [Show full text]
  • Arxiv:2010.11856V3 [Cs.CL] 13 Apr 2021 Questions from Non-English Native Speakers to Rep- Information-Seeking Questions—Questions from Resent Real-World Applications
    XOR QA: Cross-lingual Open-Retrieval Question Answering Akari Asaiº, Jungo Kasaiº, Jonathan H. Clark¶, Kenton Lee¶, Eunsol Choi¸, Hannaneh Hajishirziº¹ ºUniversity of Washington ¶Google Research ¸The University of Texas at Austin ¹Allen Institute for AI {akari, jkasai, hannaneh}@cs.washington.edu {jhclark, kentonl}@google.com, [email protected] Abstract ロン・ポールの学部時代の専攻は?[Japanese] (What did Ron Paul major in during undergraduate?) Multilingual question answering tasks typi- cally assume that answers exist in the same Multilingual document collections language as the question. Yet in prac- (Wikipedias) tice, many languages face both information ロン・ポール (ja.wikipedia) scarcity—where languages have few reference 高校卒業後はゲティスバーグ大学へ進学。 (After high school, he went to Gettysburg College.) articles—and information asymmetry—where questions reference concepts from other cul- Ron Paul (en.wikipedia) tures. This work extends open-retrieval ques- Paul went to Gettysburg College, where he was a member of the Lambda Chi Alpha fraternity. He tion answering to a cross-lingual setting en- graduated with a B.S. degree in Biology in 1957. abling questions from one language to be an- swered via answer content from another lan- 生物学 (Biology) guage. We construct a large-scale dataset built on 40K information-seeking questions Figure 1: Overview of XOR QA. Given a question in across 7 diverse non-English languages that Li, the model finds an answer in either English or Li TYDI QA could not find same-language an- Wikipedia and returns an answer in English or L . L swers for. Based on this dataset, we introduce i i is one of the 7 typologically diverse languages.
    [Show full text]
  • Florida State University Libraries
    )ORULGD6WDWH8QLYHUVLW\/LEUDULHV 2020 Wiki-Donna: A Contribution to a More Gender-Balanced History of Italian Literature Online Zoe D'Alessandro Follow this and additional works at DigiNole: FSU's Digital Repository. For more information, please contact [email protected] THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS & SCIENCES WIKI-DONNA: A CONTRIBUTION TO A MORE GENDER-BALANCED HISTORY OF ITALIAN LITERATURE ONLINE By ZOE D’ALESSANDRO A Thesis submitted to the Department of Modern Languages and Linguistics in partial fulfillment of the requirements for graduation with Honors in the Major Degree Awarded: Spring, 2020 The members of the Defense Committee approve the thesis of Zoe D’Alessandro defended on April 20, 2020. Dr. Silvia Valisa Thesis Director Dr. Celia Caputi Outside Committee Member Dr. Elizabeth Coggeshall Committee Member Introduction Last year I was reading Una donna (1906) by Sibilla Aleramo, one of the most important ​ ​ works in Italian modern literature and among the very first explicitly feminist works in the Italian language. Wanting to know more about it, I looked it up on Wikipedia. Although there exists a full entry in the Italian Wikipedia (consisting of a plot summary, publishing information, and external links), the corresponding page in the English Wikipedia consisted only of a short quote derived from a book devoted to gender studies, but that did not address that specific work in great detail. As in-depth and ubiquitous as Wikipedia usually is, I had never thought a work as important as this wouldn’t have its own page. This discovery prompted the question: if this page hadn’t been translated, what else was missing? And was this true of every entry for books across languages, or more so for women writers? My work in expanding the entry for Una donna was the beginning of my exploration ​ ​ into the presence of Italian women writers in the Italian and English Wikipedias, and how it relates back to canon, Wikipedia, and gender studies.
    [Show full text]
  • Explaining Cultural Borders on Wikipedia Through Multilingual Co-Editing Activity
    Samoilenko et al. EPJ Data Science (2016)5:9 DOI 10.1140/epjds/s13688-016-0070-8 REGULAR ARTICLE OpenAccess Linguistic neighbourhoods: explaining cultural borders on Wikipedia through multilingual co-editing activity Anna Samoilenko1,2*, Fariba Karimi1,DanielEdler3, Jérôme Kunegis2 and Markus Strohmaier1,2 *Correspondence: [email protected] Abstract 1GESIS - Leibniz-Institute for the Social Sciences, 6-8 Unter In this paper, we study the network of global interconnections between language Sachsenhausen, Cologne, 50667, communities, based on shared co-editing interests of Wikipedia editors, and show Germany that although English is discussed as a potential lingua franca of the digital space, its 2University of Koblenz-Landau, Koblenz, Germany domination disappears in the network of co-editing similarities, and instead local Full list of author information is connections come to the forefront. Out of the hypotheses we explored, bilingualism, available at the end of the article linguistic similarity of languages, and shared religion provide the best explanations for the similarity of interests between cultural communities. Population attraction and geographical proximity are also significant, but much weaker factors bringing communities together. In addition, we present an approach that allows for extracting significant cultural borders from editing activity of Wikipedia users, and comparing a set of hypotheses about the social mechanisms generating these borders. Our study sheds light on how culture is reflected in the collective process
    [Show full text]
  • Mapping the Nature of Content and Processes on the English Wikipedia
    ‘Creating Knowledge’: Mapping the nature of content and processes on the English Wikipedia Report submitted by Sohnee Harshey M. Phil Scholar, Advanced Centre for Women's Studies Tata Institute of Social Sciences Mumbai April 2014 Study Commissioned by the Higher Education Innovation and Research Applications (HEIRA) Programme, Centre for the Study of Culture and Society, Bangalore as part of an initiative on ‘Mapping Digital Humanities in India’, in collaboration with the Centre for Internet and Society, Bangalore. Supported by the Ford Foundation’s ‘Pathways to Higher Education Programme’ (2009-13) Introduction Run a search on Google and one of the first results to show up would be a Wikipedia entry. So much so, that from ‘googled it’, the phrase ‘wikied it’ is catching up with students across university campuses. The Wikipedia, which is a ‘collaboratively edited, multilingual, free Internet encyclopedia’1, is hugely popular simply because of the range and extent of topics covered in a format that is now familiar to most people using the internet. It is not unknown that the ‘quick ready reference’ nature of Wikipedia makes it a popular source even for those in the higher education system-for quick information and even as a starting point for academic writing. Since there is no other source which is freely available on the internet-both in terms of access and information, the content from Wikipedia is thrown up when one runs searches on Google, Yahoo or other search engines. With Wikipedia now accessible on phones, the rate of distribution of information as well as the rate of access have gone up; such use necessitates that the content on this platform must be neutral and at the same time sensitive to the concerns of caste, gender, ethnicity, race etc.
    [Show full text]
  • The Combined Wordnet Bahasa
    The combined Wordnet Bahasa Francis B?, Lian Tze L⋄, Enya Kong T† and Hammam R‡ ?Nanyang Technological University, Singapore ⋄KDU College Penang †Linton University College, Malaysia ‡Agency for the Assessment and Application of Technology, Indonesia [email protected], [email protected], [email protected], [email protected] This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using different resources and approaches: the Malay Wordnet (Lim & Hussein 2006), the Indonesian Wordnet (Riza, Budiono & Hakim 2010) and the Wordnet Bahasa (Nurril Hirfana, Sapuan & Bond 2011). The final wordnet has been validated and extended as part of sense annotation of the Indonesian portion of the NTU Multilingual Corpus (Tan & Bond 2012). The wordnet has over 48,000 concepts and 58,000 words for Indonesian and 38,000 concepts and 45,000 words for Malaysian. 1. Introduction1 This paper discusses the creation of a new release of the Open Wordnet Bahasa, an open- source semantic lexicon of Malay. The Wordnet is the result of merging with three wordnet projects for Malaysian and Indonesian, then adding data from other resources and finally by tagging a collection of Indonesian text. Up until now, there has been no richly linked, broad coverage semantic lexicon for Malay- sian and Indonesian. The Center of the International Cooperation for Computerization produced dictionaries for Malaysian and Indonesian (CICC 1994a,b) but the semantic hi- erarchy was limited to an upper level ontology for Indonesian.
    [Show full text]
  • Historiographical Approaches to Past Archaeological Research
    Historiographical Approaches to Past Archaeological Research Gisela Eberhardt Fabian Link (eds.) BERLIN STUDIES OF THE ANCIENT WORLD has become increasingly diverse in recent years due to developments in the historiography of the sciences and the human- ities. A move away from hagiography and presentations of scientifi c processes as an inevitable progression has been requested in this context. Historians of archae- olo gy have begun to utilize approved and new histo- rio graphical concepts to trace how archaeological knowledge has been acquired as well as to refl ect on the historical conditions and contexts in which knowledge has been generated. This volume seeks to contribute to this trend. By linking theories and models with case studies from the nineteenth and twentieth century, the authors illuminate implications of communication on archaeological knowledge and scrutinize routines of early archaeological practices. The usefulness of di erent approaches such as narratological concepts or the concepts of habitus is thus considered. berlin studies of 32 the ancient world berlin studies of the ancient world · 32 edited by topoi excellence cluster Historiographical Approaches to Past Archaeological Research edited by Gisela Eberhardt Fabian Link Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliographie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. © 2015 Edition Topoi / Exzellenzcluster Topoi der Freien Universität Berlin und der Humboldt-Universität zu Berlin Typographic concept and cover design: Stephan Fiedler Printed and distributed by PRO BUSINESS digital printing Deutschland GmbH, Berlin ISBN 978-3-9816384-1-7 URN urn:nbn:de:kobv:11-100233492 First published 2015 The text of this publication is licensed under Creative Commons BY-NC 3.0 DE.
    [Show full text]