Towards Bengali Dbpedia Arup Sarkara,∗, Ujjal Marjitb, Utpal Biswasa Adept
Total Page:16
File Type:pdf, Size:1020Kb
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 10 ( 2013 ) 890 – 899 International Conference on Computational Intelligence: Modeling Techniques and Applications (CIMTA) 2013 Towards Bengali DBpedia Arup Sarkara,∗, Ujjal Marjitb, Utpal Biswasa aDept. of Computer Sc. & Engg, University of Kalyani, Kalyani 741235, W.B. India bC.I.R.M., University of Kalyani, Kalyani 741235, W.B. India Abstract Online encyclopedia brings a whole universe of real information over a mouse click today. In the recent years, online encyclopedia like Wikipedia shows how big and vast it can be. Collecting information from such a big resource and producing as per the user’s requirement always is a challenging task. Linked Data based project called DBpedia resolves this challenge to some extent by publishing data from Wikipedia semantically. Like Wikipedia, DBpedia also gives its datasets in different international languages. Generation of DBpedia content in different languages always requires some extra effort to resolve the language related issues. Some special configuration and settings to be maintain within the DBpedia framework. This paper explains how a Bengali version can be achieved from the original version of the DBpedia framework. ©c 2013 The Authors.Authors. PublishedPublished by by Elsevier Elsevier Ltd. Ltd. Selection andand peer-reviewpeer-review under under responsibility responsibility of of the the University University of Kalyani,of Kalyani, Department Department of Computer of Computer Science Science & Engineering. & Engineering Keywords: DBpedia; Bengali DBpedia; DBpedia Information Extraction Framework; Wikipedia 1. Introduction Proper use of knowledge can change the history of mankind forever. That is why knowledge always remains a matter of praise of the intelligent ones. Now a day’s web becomes a huge hub of information storage. A place where information gets stored, retrieved and shared. Online encyclopedia has become very famous concept today, since it provides different category of information in a single place. Among them, most well known is the Wikipedia project. We can define Wikipedia as a common place to share up to-date information worldwide. Though Wikipedia holds a lot of structured data but still it is not completely usable because it is represented over the traditional non- semantic web just like a collection of textual data, and it is the end user’s sole responsibility to find out their needs of correct information from thousands of Wiki pages from Wikipedia. The solution is to add semantic annotations to the wiki-pages of Wikipedia. With the methodology of Semantic Wiki it was possible to add the semantic annotations to the pages manually but it was not a scalable solution for a huge datasets like Wikipedia where information is updated every seconds making Wikipedia growing all the time. Manual approach seemed unfit at this stage. So an automated technique was required. DBpedia project, initially known as DBpedia, a joint venture by group of ∗ Corresponding author. E-mail address: [email protected] 2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Selection and peer-review under responsibility of the University of Kalyani, Department of Computer Science & Engineering doi: 10.1016/j.protcy.2013.12.435 Arup Sarkar et al. / Procedia Technology 10 ( 2013 ) 890 – 899 891 researchers from Freie Universitat and University of Leipzig and OpenLink Softwares is just that solution world looking for. DBpedia project represents an important issue regarding the Linked Open Data (LOD) project. LOD mainly deals with freely accessible Semantic Data over the Semantic Web and publishes them as Linked Data[1] over the LOD cloud. It also maintains the links among the different data graphs. Maintaining links between different RDF (Resource Description Framework) [2] statements across different data sets is one of the important aspects of LOD project, since it is dedicated to the Linked Data. To make the LOD project a story of success, DBpedia plays an important role. Till date, a huge number of datasets over the LOD is connected with the DBpedia datasets to become a genuine part of the LOD project. One main reason behind this nature of the activity of the participants of LOD is that they always find out something related information within DBpedia datasets which makes it a common target for linking up. DBpedia covers such a big area of information due to its objective to publish structured data from Wikipedia. So in the sense, DBpedia is actually acting as a gateway to connect related information from different datasets with Wikipedia. 2. Wikipedia We already know that, Wikipedia [3] is one of the biggest and most popular online encyclopedia that we have ever seen and, it is still growing. Currently it consists of approximately 23 million of articles, among them the English Wikipedia itself holds 4,130,559 articles along with 1461 administrators besides 18,124,585 users [4]. These data may be changed at the time of pulication of this paper. A comparative study of the growth between different language editions of Wikipedia can be done from [5]. Most important aspect of Wikipedia is that, it is not only a web based representation of some textual data. It is a huge collection of structured data which is a key resource of the futuristic web i.e. machine readable Semantic Web. Within Wikipedia these structured data are stored mainly using the Infobox templates. Infoboxes are basically collection of certain key properties representing the important aspects as well as the main features of a particular article using that template. A sample use of Infobox OS is shown in the following Fig 1. Fig. 1. Sample use of Infobox OS (Adapted from http://bn.wikipedia.org /wiki/%E0%A6%9F%E0%A7%87%E0%A6%AE%E0%A6%AA%E0%A7 3. Mediawiki MediaWiki is a free and open-source software tool specifically developed for Wikipedia. MediaWiki was devel- oped by a student of University of Cologne named Magnus Manske with an aim to replace the old software platform of Wikipedia used at that time called UseModWiki. The first version of MediaWiki software was released in 2003. UseModWiki was written in perl and it stored all the information in the text files, which definitely limited the func- tionality of a daily growing online encyclopedia like Wikipedia, while MediaWiki is written in PHP like light-weight but high performance and scalable scripting language and besides this MediaWiki uses MySQL database at the back end to store all the information instead of text files which makes it better substitute for the UseModWiki. MediaWiki is rich with its highly performative functionalities and extensibility. It also supports the need of multilingualism since the developers are well aware of the efforts made by the different editors to publish different Internationalized and Localized versions of the main Wikipedia site (English version of course). Currently MediaWiki user interface comes in different languages to make the creation and editing of wiki pages in different languages easier. But having a 892 Arup Sarkar et al. / Procedia Technology 10 ( 2013 ) 890 – 899 functional and efficient nature, still WikiMedia itself considered as difficult to use by a normal end user. With its profoundly tested programs the use of WikiMedia is not only limited to the Wikipedia today. Different governmental non-governmental sites are also using WikiMedia these days to publish their wiki pages online. 4. DBpedia DBpedia [6] is a highly anticipated project to publish structured data from Wikipedia. We can point out the main objectives of DBpedia project as follows: • Extraction of structured data from the Wikipedia site (Specifically Wikipedia pages). • Representation of these structured data in RDF statement so that a RDF graph evolve. • Use of Ontology and RDF data models as backend of the RDF graph. • Making accessible this RDF graph through normal browsers. • Making accessible the RDF graph against a SPARQL endpoint so that any SPARQL query get entertained. • Answering any complex query against Wikipedia, was not possible prior to the development of the DBpedia project. The framework used behind this initiative is known as DBpedia Extraction Framework; in short DEF. DEF is responsible for extraction of structured data from the Wikipedia pages. Within Wikipedia most of the structured data is stored using some particular Infobox Template. It is basically a collection of keyword-value pairs. An example Infobox content is already shown in the Fig 1. 5. Motivation As we already know Wikipedia is a huge collection of information and it is increasing day by day. In the early beginning it was only limited to the English edition. Today it is expanded in several non-English languages. Some of these are non-Latin character based, like Greek, Korean, Bengali etc. Extracting structured data from these versions of Wikipedia is always challenging. Though despite of all the challenges, some of the contributors already developed some non-English versions of the DBpedia. The leading ones in this category are the Greek DBpedia [7] and Korean DBpedia [8]. They all are developed using the i18n Extension of the DBpedia Extraction Framework. Our aim is to develop a pathway for Bengali version of the DBpedia. Bengali Wikipedia is a promising Wikipedia version which holds much important information. And it is increasing day-by-day. So, needs of a Bengali version of DBpedia is justified. 6. DBpedia Extraction Framework DEF [9] is also known as DBpedia Information Extraction Framework (DIEF). DEF generally divided into two different modules. In DEF every module has a purpose. Normally they are known as Core Module and Dump Extraction module. Core Module holds the main components of the framework while the Dump Extraction Module is used to handle data extraction related issues. Mercurial, Java Development Kit, Maven, Scala and Virtuoso server (optional) are the main requirements for the setup of DEF. Mercurial [10] is a source control management tool, free to use.