Bibbase Triplified
Total Page:16
File Type:pdf, Size:1020Kb
BibBase Triplified∗ http://data.bibbase.org Oktie Hassanzadeh+, Reynold S. Xin×, Christian Fritzx Yang Yang+, Jiang Du+, Minghua Zhao+, Renée J. Miller+ +Department of Computer Science ×EECS Department xInformation Sciences Institute University of Toronto University of California, Berkeley University of Southern California ABSTRACT users with HTTP dereferenceable URIs that provide struc- We present BibBase, a system for publishing and managing tured (RDF) data as well as nicely formatted HTML pages. bibliographic data available in BibTeX files. BibBase uses a This is mainly due to the fact that existing systems either are powerful yet light-weight approach to transform BibTeX files not designed to be used within an external website, or they into rich triplified data as well as custom HTML and RSS require expert users to set up complex software systems on code that can readily be integrated within a user's website machines that meet the requirements of this software. Bib- while the data can instantly be queried online on the sys- Base [12] aims to fill this gap by providing several distinctive tem's SPARQL endpoint. In this short report, we present features as described in the following sections. a brief overview of the features of our system and outline a 2. LIGHT-WEIGHT LINKED DATA PUBLI- few research challenges in building such a system. CATION Categories and Subject Descriptors BibBase makes it easy for scientists to maintain publica- tion lists on their personal web site. Scientists simply main- H.3.5 [Information Storage and Retrieval]: Online Infor- tain a BiBTeX file of their publications, and BibBase does mation Services; H.2.4 [Database Management]: Systems the rest. When a user visits a publication page, BibBase Keywords dynamically generates an up-to-date HTML page from the Bibliographic Data Management, Linked Data, Data Integration BiBTeX file, as well as rich linked data with resolvable URIs that can be queried instantly on the system's SPARQL end- 1. INTRODUCTION point. Compared to existing linked data publication tools, Management of bibliographic data has received significant this approach is notably easy-to-use and light-weight, and attention in the research community. Many online systems allows non-expert users to create a rich linked data source have been designed specifically for this purpose, including without any specific server requirements, the need to set up but certainly not limited to, BibSonomy [13], CiteSeer [14], a new system, or define complex mapping rules. All they CiteULike [15], EPrints [16], Mendeley [17], PubZone [18], need to know is how to create and maintain a BiBTeX file refbase [19] and RefWorks [20]. The work in the seman- and there are tools to help with that. tic web community in this area has also resulted in several It is important to note that this ease of use does not sacri- tools (such as BiBTeX to RDF conversion tools [7]), ontolo- fice the quality of the published data. In fact, although the gies (such as SWRC ontology [10]) and data sources (such as system is light-weight on the users' side, BibBase performs DBLP Berlin [11] and RKBExplorer [21]). These systems, complex processing of the data in the back-end. When a tools, and data sources are widely being used and have con- new or updated BiBTeX file arrives, the system transforms siderably simplified and enhanced many bibliographic data the data into several structured formats using a rich ontol- management tasks such as data curation, storage, retrieval, ogy, assigns URIs to all the objects (authors, papers, venues, and sharing of bibliographic data. etc.), performs duplicate detection and semantic linkage, Despite the success of the above-mentioned systems, very and maintains and publishes provenance information as de- few individuals and research groups publish their biblio- scribed below. graphic data on their websites in a structured format, partic- 3. DUPLICATE DETECTION ularly following the principles of Linked Data [1], to provide BibBase needs to deal with several issues related to the ∗Supported in part by NSERC. heterogeneity of records in a single BiBTeX file, and across multiple BiBTeX files. BibBase uses existing duplicate de- tection techniques in addition to a novel way of managing duplicated data following the linked data principles. Within Permission to make digital or hard copies of all or part of this work for a single BiBTeX file, the system uses a set of rules to iden- personal or classroom use is granted without fee provided that copies are tify duplicates and fix errors. We refer to this phase as not made or distributed for profit or commercial advantage and that copies local duplicate detection. For example, if a BiBTeX file has bear this notice and the full citation on the first page. To copy otherwise, to two occurrences of author names \J. B. Smith" and \John republish, to post on servers or to redistribute to lists, requires prior specific B. Smith", the system matches the two author names and permission and/or a fee. I-SEMANTICS 2010, September 1-3, 2010 Graz, Austria creates only a single author object. In this example, the as- Copyright c ACM 978-1-4503-0014-8/10/09 ...$10.00. sumption is that the combination of the first letter of first DBLP Berlin RKBExplorer “Tim Berners-Lee” (person on dblp) owl:SameAs http://data.bibbase.org/ “Tim Berners-Lee” (person on dblp) owl:SameAs 1 “Tim Berners-Lee” (person on citeseer) skos:closeMatch “Tim Berners-Lee” (author) “Int. J. Semantic Web Inf. Syst.” (journal) src: http://www.w3.org/People/Berners-Lee/Publications owl:SameAs “Christian Bizer” (person on eprints) "Linked Data - The Story So Far" (article) W3C skos:closeMatch People “Tom Heath” (author) “Tim Berners-Lee” (foaf:Person) owl:SameAs foaf:page skos:closeMatch “Christian Bizer” (author) Revyu.com “Tim Berners-Lee” (person) bibtex0.2:has_keyword “Int. J. Semantic Web Inf. Syst.” (journal) rdfs:seeAlso “Linked Data” (buzzword) foaf:page foaf:page bibtex0.2: has_keyword "Linked Data - The Story So Far" (review) “Semantic Web” (buzzword) foaf:page PubZone DBLP “Tim Berners-Lee” "Linked Data - The Story So Far" "Linked Data - The Story So Far" “Tim Berners-Lee” Figure 1: Sample entities in BibBase interlinked with several related data sources. name, middle name, and last name, \JBSmith", is a unique links to related linked data sources and web pages. In order identifier for a person in a single file. If this assumption does to discover such links, similar to our duplicate detection ap- not hold for a specific user (which is unlikely) BibBase allows proach, we can leverage online and offline solutions. The on- the user to distinguish two people with the same identifier line approach mainly uses a dictionary of terms and strings by adding a number at the end of one of the author names. that can be mapped to external data sets. An important For identification of duplicates across multiple BiBTeX type of links comes from keywords in BiBTeX entries that files, which we call global duplicate detection, the assump- can be used to relate publications to entries on DBpedia tions made for local duplicate detection may not hold. Within (and pages on Wikipedia), such as DBpedia entities of type different publication lists, \JBSmith" may (or may not) re- buzzword shown in the example figure. A similar approach fer to the same author. BibBase deals with this type of is used to match abbreviated venues, such as \ISWC" to uncertainty by having a disambiguation page on the HTML \International Semantic Web Conference". The dictionaries interface that informs the users looking for author name \J. (or ontology tables) are maintained inside BibBase, and de- B. Smith" (by looking up the URI http://data.bibbase. rived from sources such as DBpedia, Freebase, Wordnet, and org/author/j-b-smith) of the existence of all the entities DBLP. We also allow the users to extend the dictionaries by with the same identifier, and having skos:closeMatch and @string definitions in their BiBTeX files, e.g., rdfs:seeAlso properties that link to related author entities @string{ISWC={Proc. of the Int'l Semantic Web Conference(ISWC)}} on the RDF interface. An offline link discovery can be performed using existing Duplicate detection, also known as entity resolution, record link discovery tools [6, 9]. linkage, or reference reconciliation is a well-studied problem and an attractive research area [5]. We use some of the exist- 5. PROVENANCE AND USER FEEDBACK ing techniques to define local and global duplicate detection Another highlight of the features implemented in BibBase rules, for example using fuzzy string similarity measures [4] is storage and publication of provenance information, i.e., or semantic knowledge for matching conference names and the source of each entity and each link in the data. This is paper titles [6]. In addition to the definition of rules and of utmost importance in a system like BibBase where the online duplicate detection, we also uses some graph-based data comes from several different users and BiBTeX files, duplicate detection techniques [8] (also known as collective and where (imperfect) automatic duplicate detection and entity resolution [2]) to identify duplicates in an offline man- linkage is performed over the data. Users will be able to see ner. However, in order to avoid loss of user data as a result the source of entities and the facts about the entities. In of imperfect data cleaning, the results of this process will addition, they will be able to fix their own BiBTeX files or be published as additional data on our system that result in provide feedback to the system and to other users who need disambiguation pages or skos:closeMatch predicates. to fix their files or provide additional information.