Towards Understanding and Improving Multilingual Collaborative Ontology Development in Wikidata
Total Page:16
File Type:pdf, Size:1020Kb
Towards Understanding and Improving Multilingual Collaborative Ontology Development in Wikidata John Samuel CPE Lyon, Université de Lyon Villeurbanne CEDEX, France [email protected] ABSTRACT blogs, mailing lists, internet forums for discussion and obtaining Wikidata is a multilingual, open, linked, structured knowledge base. a final consensus. Understanding this whole process of commu- One of its major interesting features is the collaborative nature of nication during software development has helped to create new ontological development for diverse domains. Even though Wiki- full-fledged development environments that not only includes ver- data has around 4500 properties compared to the millions of items, sion control systems but also incorporates continuous integration, their limited number cannot be overlooked because they form a bug reporting, discussion forums etc. Understanding the role of major role in the creation of new items, especially for the manual external communication mechanisms is therefore very important interface. This article will take a look on understanding the present to build effective collaboration tools. approach of property creation, mainly focusing on its multilingual Wikis[24] are another major and often cited example of online aspects. It will also present a quick overview of an ongoing work collaboration. Wikipedia, undoubtedly comes to mind especially called WDProp, a dashboard for real-time statistics on properties considering its global outreach. Editors from around the world and their translation. create, edit and discuss various topics of interest. However, what distinguishes Wikipedia from many large software collaboration CCS CONCEPTS projects is also its goal of building a multilingual knowledge base right from its inception. This means, Wikis can be read and modified • Information systems → Web data description languages; by users from different linguistic backgrounds in their own native Structure and multilingual text search; • Software and its engi- (or preferred) languages. This multilingual nature is also one of neering → Collaboration in software development; the key factors for the popularity of Wikipedia, especially enabling KEYWORDS contributors to target local-language communities. Wikipedia has become a major source of information for human Collaboration, Multilingualism, Ontology development, Wikidata consumption. Even though, there are templates and information ACM Reference Format: boxes (or infoboxes) that highlight key information that have been John Samuel. 2018. Towards Understanding and Improving Multilingual extracted and analysed for machine-consumption, Wikipedia con- Collaborative Ontology Development in Wikidata. In Wiki Workshop (Wiki- sists of a large amount of unstructured data in the form of human- Workshop) , 5 pages. comprehensible text. With the growing explosion of data, it is now becoming difficult to maintain upto date information on Wikipedia. 1 INTRODUCTION Information is evolving at a very fast rate, making it infeasible for Online collaboration [5, 6] has caught the interest of many re- any one human or any one community to update content on all searchers during past many years. As internet became an integral the multilingual versions of an article[15]. This problem has been part of our everyday lives, it has become much easier to communi- well-known and in 2012, Wikidata[23] was created to tackle some cate and collaborate with people across the globe. Several collabo- of these problems, giving an equal opportunity to both humans and ration tools with real-time editing capabilities are now available. machines to access and update content and thereby maintaining the These tools and their usage can give wide insights to understand collaborative nature of Wikimedia projects. Wikidata1, is similar to human collaboration and to develop new time-saving features. its sister Wikimedia projects like Wikipedia, Wikibooks with the Collaborative software development has been around for several major difference that it is structured and linked knowledge base. years with version control systems (VCS) like GIT, SVN, CVS play- When multilingual Wikipedia projects are compared to Wikidata, ing an important role in development of softwares. VCS permits it must be noted that in case of Wikipedia, every language has its individual software developers to independently develop software own independent existence with a separate subdomain in the form code and later share their changes before making a final consensus of URL like https://{lc}.wikipedia.org, where lc is the language code on the final outcome. It is important to note here that the aforemen- for the concerned language. Wikidata is however different. There tioned tools are used with other tools of communication like email, are no separate subdomains, but recall that Wikidata is still multi- Permission to make digital or hard copies of part or all of this work for personal or lingual. Depending on the locale settings of web browser of the user, classroom use is granted without fee provided that copies are not made or distributed the user views the contents of a page in her local language. Details for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). WikiWorkshop, April 2018, © 2018 Copyright held by the owner/author(s). 1https://www.wikidata.org/ WikiWorkshop, April 2018, John Samuel of Douglas Adams2, Albert Einstein3, Alan Turing4, Leonardo da Wikidata contributors so that the entire workflow of ontology de- Vinci5 can be seen in the local language of the user. Nevertheless, velopment can be improved. That brings the readers to section 4 user can see the details in any supported language by changing which describes WDProp, its goals, ongoing development and fu- the default language. In some ways Wikidata may seem similar ture works. It also briefly discusses how some of these statistics to Wikipedia commons6, the media store for Wikimedia projects, may help in improving the ontology development. Finally section 5 where users can find description of media in several languages. concludes the article. This multilingual support is obtained thanks to the efforts of thousands of editors from around the world. Wikidata[23] consists of two types of pages: property pages and item pages. Item pages 2 MULTILINGUAL COLLABORATIVE describe the entities like persons (some examples given above), ONTOLOGY DEVELOPMENT books and articles, geographical places to name a few. Property A major step before setting up a knowledge base is to develop pages are dedicated to properties for describing the above entities. the required ontology. Ontology development is a long process Example properties include first name, last name, date of birth, and involves continuous discussion and feedback from domain native language, instance of, subclass of etc. Every domain may experts[25] and key players of the field. Different domains do share have its own dedicated set of properties, however some properties some common properties, yet identifying these common properties, may be used across domains. also called ontology reuse requires knowledge about the existence Defining the properties for a domain is one of the major tasks and capabilities of other existing ontologies. Nevertheless, ontology in ontology development. But in Wikidata, this ontology devel- reuse do constitute a major step in the ontology development. opment is done by the community, who may or may not be the Many concurrent ontologies have been proposed by different subject experts of the concerned domain. Additionally, as men- competing partners and it becomes a major problem during in- tioned above, Wikidata has no subdomains, dedicated for every formation interchange. In order to work with these ontologies, language like in the case of Wikipedia. Hence the creation of prop- automated ontology matching [11] has also been suggested. This erties is done by Wikidata contributors from across the world, who matching helps to identify the overlap between different ontologies. propose, discuss, vote and translate on one single website of Wiki- One other way is to standardize the ontology development process data. This process needs to be understood very well, especially in by involving multiple key players of the field. But these ontologies maintaining the significant feature of Wikidata: multilingualism. may not cater to all the needs of the wider community since it may Property and item labels need to be translated in all the supported only resolve the problems of the members of the standards body. languages in order to avoid creating duplicate and redundant pages. Other approaches to ontology development [20] include auto- Is it possible to achieve a truly multilingual collaborative ontology mated ontology development[1–3, 9] from structured and unstruc- development[14] in Wikidata or other similar projects, where the tured text. It involves identifying the key properties from documents entire process from discussion of ontology development to its final by using machine learning techniques. creation truly maintains multilingual nature? Despite the existence of several automated ontological devel- In this article, the focus is on Wikidata properties. Though few opment methods, collaborative ontology development[4, 16, 19] in number (currently less than 5000) compared to millions of items, still remains a major