Publishing Bibliographic Data on the Semantic Web Using Bibbase
Total Page:16
File Type:pdf, Size:1020Kb
Undefined 0 (0) 1 1 IOS Press Publishing Bibliographic Data on the Semantic Web using BibBase Editor(s): Carsten Keßler, University of Münster, Germany; Mathieu d’Aquin, The Open University, UK; Stefan Dietze, L3S Research Center, Germany Solicited review(s): Kai Eckert, University of Mannheim, Germany; Antoine Isaac, Vrije Universiteit Amsterdam, The Netherlands; Jan Brase, German National Library of Science and Technology, Germany Reynold S. Xin a, Oktie Hassanzadeh b,∗, Christian Fritz c, Shirin Sohrabi b and Renée J. Miller b a Department of EECS, University of California, Berkeley, California, USA E-mail: [email protected] b Department of Computer Science, University of Toronto, 10 King’s College Rd., Toronto, Ontario, M5S 3G4, Canada E-mail: {oktie,shirin,miller}@cs.toronto.edu c Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, California, USA E-mail: [email protected] Abstract. We present BibBase, a system for publishing and managing bibliographic data available in BiBTeX files. BibBase uses a powerful yet light-weight approach to transform BiBTeX files into rich Linked Data as well as custom HTML code and RSS feed that can readily be integrated within a user’s website while the data can instantly be queried online on the system’s SPARQL endpoint. In this paper, we present an overview of several features of our system. We outline several challenges involved in on-the-fly transformation of highly heterogeneous BiBTeX files into high-quality Linked Data, and present our solution to these challenges. Keywords: Bibliographic Data Management, Linked Data, Data Integration 1. Introduction as DBLP Berlin [19] and RKBExplorer [24]). These systems, tools, and data sources are widely being used Management of bibliographic data has received sig- and have considerably simplified and enhanced many nificant attention in the research community. Many on- bibliographic data management tasks such as data cu- line systems have been designed specifically for this ration, storage, retrieval, and sharing of bibliographic purpose, including but certainly not limited to, Bib- data. Sonomy [16], CiteSeer [17], CiteULike [18], EPrints Despite the success of the above-mentioned sys- [20], Mendeley [22], PubZone [21], refbase [27] and tems, very few individuals and research groups publish RefWorks [23]. The work in the Semantic Web com- their bibliographic data on their websites in a struc- munity in this area has also resulted in several sys- tured format, particularly following the principles of tems (such as VIVO [10]), tools (such as BiBTeX to Linked Data [1], to provide users with HTTP derefer- RDF conversion tools [13]), ontologies (such as the enceable URIs that provide structured (RDF) data as BIBO [26] and SWRC [25]) and data sources (such well as nicely formatted HTML pages. This is mainly due to the fact that existing systems either are not de- *Corresponding author. E-mail: [email protected]. signed to be used within an external website, or they 0000-0000/0-1900/$00.00 c 0 – IOS Press and the authors. All rights reserved 2 R. S. Xin, O. Hassanazdeh, C. Fritz, S. Sohrabi, R. J. Miller / Publishing Bibliographic Data on the Semantic Web using BibBase require expert users to set up complex software sys- creates only a single author object. In this example, the tems on machines that meet the requirements of this assumption is that the combination of the first letter of software. BibBase [15] aims to fill this gap by provid- first name, middle name, and last name, “JBSmith”, is ing several distinctive features as described in the fol- a unique identifier for a person in a single file. If this lowing sections. assumption does not hold for a specific user (which is unlikely) BibBase allows the user to distinguish two people with the same identifier by adding a number at 2. Light-Weight Linked Data Publication the end of one of the author names. For identification of duplicates across multiple BibBase makes it easy for scientists to maintain BiBTeX files, which we call global duplicate detec- publication lists on their personal website. Scientists tion, the assumptions made for local duplicate detec- simply maintain a BiBTeX file of their publications, tion may not hold. Within different publication lists, and BibBase does the rest. When a user visits a pub- “JBSmith” may (or may not) refer to the same au- lication page, BibBase dynamically generates an up- thor. BibBase deals with this type of uncertainty by to-date HTML page from the BiBTeX file, as well as having a disambiguation page on the HTML inter- rich Linked Data with resolvable URIs that can be face that informs the users looking for author name “J. queried instantly on the system’s SPARQL endpoint. B. Smith” (by looking up the URI http://data. Compared to existing Linked Data publication tools, bibbase.org/author/j-b-smith) of the ex- this approach is notably easy-to-use and light-weight, istence of all the entities with the same identifier, and and allows non-expert users to create a rich Linked having skos:closeMatch and rdfs:seeAlso Data source without any specific server requirements, properties that link to related author entities on the the need to set up a new system, or define complex RDF interface. mapping rules. All they need to know is how to create Duplicate detection, also known as entity resolution, and maintain a BiBTeX file and there are tools to help record linkage, or reference reconciliation is a well- with that. studied problem and an attractive research area [5]. It is important to note that this ease of use does not We use some of the existing techniques to define lo- sacrifice the quality of the published data. In fact, al- cal and global duplicate detection rules, for example though the system is light-weight on the users’ side, using fuzzy string similarity measures [4] or semantic BibBase performs complex processing of the data in knowledge for matching conference names and paper the back-end. When a new or updated BiBTeX file titles [6]. In particular, we use LinQuer [7] to specify arrives, the system transforms the data into several linkage requirements in LinQL query language, and structured formats using a rich ontology, assigns URIs translate the specification into standard SQL queries to all the objects (authors, papers, venues, etc.), per- that can run over our relational backend. forms duplicate detection and semantic linkage, and In addition to the definition of rules and online du- maintains and publishes provenance information as de- plicate detection (i.e., detection of duplicates while the scribed below. input BibTeX file is being processed), we use some graph-based duplicate detection techniques [11] (also known as collective entity resolution [2]) to identify 3. Duplicate Detection duplicates in an offline manner (i.e., after the input file is submitted and processed). Such techniques take BibBase needs to deal with several issues related to advantage of structure of the data and the currently the heterogeneity of records in a single BiBTeX file, identified duplicates to iteratively enhance the qual- and across multiple BiBTeX files. BibBase uses ex- ity of the identified duplicates. For example, two au- isting duplicate detection techniques in addition to a thors with similar names that have many common co- novel way of managing duplicated data following the authors (or co-authors that are identified as duplicates Linked Data principles. Within a single BiBTeX file, in previous phases) are more likely to be duplicates. the system uses a set of rules to identify duplicates and Similarly, two papers with similar titles that have the fix errors. We refer to this phase as local duplicate de- same set of authors (or authors that are identified as tection. For example, if a BiBTeX file has two occur- duplicates) are more likely to refer to the same publi- rences of author names “J. B. Smith” and “John B. cation. In order to avoid loss of user data as a result of Smith”, the system matches the two author names and imperfect data cleaning, the results of this process will R. S. Xin, O. Hassanazdeh, C. Fritz, S. Sohrabi, R. J. Miller / Publishing Bibliographic Data on the Semantic Web using BibBase 3 DBLP Berlin RKBExplorer “Tim Berners-Lee” (person on dblp) owl:SameAs http://data.bibbase.org/ “Tim Berners-Lee” (person on dblp) owl:SameAs 1 “Tim Berners-Lee” (person on citeseer) skos:closeMatch “Tim Berners-Lee” (author) “Int. J. Semantic Web Inf. Syst.” (journal) src: http://www.w3.org/People/Berners-Lee/Publications owl:SameAs “Christian Bizer” (person on eprints) "Linked Data - The Story So Far" (article) W3C skos:closeMatch People “Tom Heath” (author) “Tim Berners-Lee” (foaf:Person) owl:SameAs foaf:page skos:closeMatch “Christian Bizer” (author) Revyu.com “Tim Berners-Lee” (person) bibtex0.2:has_keyword “Int. J. Semantic Web Inf. Syst.” (journal) rdfs:seeAlso “Linked Data” (buzzword) foaf:page foaf:page bibtex0.2: has_keyword "Linked Data - The Story So Far" (review) “Semantic Web” (buzzword) foaf:page PubZone DBLP “Tim Berners-Lee” "Linked Data - The Story So Far" "Linked Data - The Story So Far" “Tim Berners-Lee” Fig. 1. Sample entities in BibBase interlinked with several related data sources. be published as additional data on our system that re- @string{ISWC= sult in disambiguation pages or skos:closeMatch {Proc. of the Int’l Semantic Web Conference(ISWC)}} predicates. An offline link discovery can be performed using ex- isting link discovery tools [7,12], and more complex linkage requirements that require a longer processing 4. Discovering Links to External Data Sources time, and therefore cannot be performed online when the BibTeX files are submitted. Offline link discovery In order to publish our data in the Web, not just tasks are scheduled once a BibTeX file is submitted (or on the Web, to avoid creation of an isolated data silo, updated) or a change in the target data sources are de- and to add BibBase data to the growing Linking Open tected.