A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and Their Usage
Total Page:16
File Type:pdf, Size:1020Kb
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage Marko A. Rodriguez Johan Bollen Herbert Van de Sompel Digital Library Research & Digital Library Research & Digital Library Research & Prototyping Team Prototyping Team Prototyping Team Los Alamos National Los Alamos National Los Alamos National Laboratory Laboratory Laboratory Los Alamos, NM 87545 Los Alamos, NM 87545 Los Alamos, NM 87545 [email protected] [email protected] [email protected] ABSTRACT evolution of the amount of publications indexed in Thom- The large-scale analysis of scholarly artifact usage is con- son Scientific’s citation database over the last fifteen years: strained primarily by current practices in usage data archiv- 875,310 in 1990; 1,067,292 in 1995; 1,164,015 in 2000, and ing, privacy issues concerned with the dissemination of usage 1,511,067 in 2005. However, the extent of the scholarly data, and the lack of a practical ontology for modeling the record reaches far beyond what is indexed by Thompson usage domain. As a remedy to the third constraint, this Scientific. While Thompson Scientific focuses primarily on article presents a scholarly ontology that was engineered to quality-driven journals (roughly 8,700 in 2005), they do not represent those classes for which large-scale bibliographic index more novel scholarly artifacts such as preprints de- and usage data exists, supports usage research, and whose posited in institutional or discipline-oriented repositories, instantiation is scalable to the order of 50 million articles datasets, software, and simulations that are increasingly be- along with their associated artifacts (e.g. authors and jour- ing considered scholarly communication units in their own nals) and an accompanying 1 billion usage events. The real right. world instantiation of the presented abstract ontology is a While the size (and growth) of the scholarly record is semantic network model of the scholarly community which impressive, the extent of its use is even more staggering. lends the scholarly process to statistical analysis and com- For instance, in November 2006, Elsevier's Science Direct, putational support. We present the ontology, discuss its which provides access to articles from approximately 2,000 journals, celebrated its 1 billionth full-text download since instantiation, and provide some example inference rules for 1 calculating various scholarly artifact metrics. counting started in April of 1999 . And, again, the extent of scholarly usage clearly reaches far beyond Elsevier's repos- itory. Furthermore, usage events include not only full-text Categories and Subject Descriptors downloads, but also events such as requesting services from I.2.4 [Knowledge Representation Formalisms and Meth- linking servers, downloading bibliographic citations, email- ods]: Semantic Networks; H.3.7 [Digital Libraries]: Stan- ing abstracts, etc. dards|ontologies To a large extent, the effect of usage behavior on the schol- arly process is a horizon that is only beginning to be under- stood and, if properly studied, will offer clues to the evo- General Terms lutionary trends of science [1, 2, 3], quantitative models of Ontologies, Scholarly Communication the value of scholarly artifacts [4, 5], and services to sup- port scholars [6]. The Andrew W. Mellon funded MESUR2 Keywords project at the Research Library of the Los Alamos National arXiv:0708.1150v1 [cs.DL] 8 Aug 2007 Laboratory aims at developing metrics for assessing schol- Resource Description Framework and Schema, Web Ontol- arly communication artifacts (e.g. articles, journals, confer- ogy Language, Semantic Networks ence proceedings, etc.) and agents (e.g. authors, institu- tions, publishers, repositories, etc.) on the basis of scholarly 1. INTRODUCTION usage. In order to do this, the MESUR project makes use New publications are added to the scholarly record at an of a representative collection of bibliographic, citation and accelerating pace. This point is realized by observing the usage data. This data is collected from a wide variety of sources including academic publishers, secondary publish- ers, institutional linking servers, etc. Expectations are that the collected data will eventually encompass tens of millions of bibliographic records, hundreds of millions of citations, 1Elsevier's 1 billion downloads article available at: This paper is authored by an employee(s) of the United States Government http://www.info.sciencedirect.com/news/archive/2006/ news billionth.asp and is in the public domain. 2 JCDL’07, June 17–22, 2007, Vancouver, British Columbia, Canada. MEtrics from Scholarly Usage of Resources available at: ACM 978-1-59593-644-8/07/0006. http://www.mesur.org/ and billions of usage events. Mining such a vast data set work may exist. Figure 1 does not expose the range of con- in an efficient, performing, and flexible manner presents sig- ceptual nuances that can be expressed by modern ontology nificant challenges regarding data representation and data languages and thus, only provides a rudimentary representa- access. This article presents, the OWL ontology [7] used tion of the relationship between an ontology and its semantic by MESUR to represent bibliographic, citation and usage network instantiation. data in an integrated manner. The proposed MESUR on- tology is practical, as opposed to all encompassing, in that it represents those artifacts and properties that, as previ- ously shown in [6], are realistically available from modern scholarly information systems. This includes bibliographic a b data such as author, title, identifier, publication date and us- age data such as the IP address of the accessing agent, the ontology date and time of access, type of usage, etc. Finally, another network novel contribution of this work is the hybrid storage and b access architecture in which relational database and triple a b store technology are combined. This is achieved by storing core data and relationships in the triple store and auxiliary data in a relational database. This design choice is driven Figure 1: The relationship between an ontology and by the need to keep the size of the triple store to a level that its semantic network instantiation can realistically be handled by current technologies. The combination of the data architecture and scholarly ontology presented in this article provide the foundation for the large- 2.1 Semantic Network Technology scale modeling and analysis of scholarly artifacts and their The most popular semantic network representational frame- usage. work is the Resource Description Framework and Schema, or RDF(S) [10]. RDF(S) represents all nodes and edges 2. SEMANTIC NETWORK ONTOLOGIES by Universal Resource Identifiers (URI) [11]. The URI ap- proach supports the use of namespacing such that the URI A semantic network (sometimes called a multi-relational http://www.science.org#Article has a different mean- network or multi-graph) is composed of a set of nodes (repre- ing, or connotation, than what may be understood by the senting heterogeneous artifacts) connected to one another by URI http://www.newspaper.net#Article. a set of qualified, or labeled, edges [8]. In a graph theoretic The Web Ontology Language (OWL) is an extension of sense, a semantic network is a directed labeled graph. Be- RDF(S) that supports a richer vocabulary (e.g. promotes cause an edge is labeled, two nodes can be connected to one many set theoretical concepts) [7]. Prot´eg´e3 is perhaps the another by an infinite number of edges. However, in most most popular application for designing OWL ontologies [12]. cases, the possible interconnections between node types is While OWL is primarily a machine readable language, an constrained to a predetermined set. This predetermined set OWL ontology can be diagrammed using the Unified Mod- is made explicit in the semantic network's associated ontol- eling Language's (UML) class diagrams (i.e. entity relation- ogy. An ontology is generally defined as a set of abstract ship diagrams). classes, their relationship to one another, and a collection of Modern semantic network data stores represent the rela- inference rules for deriving implicit relationships [9]. An on- tionship between two nodes by a triple. For instance, the tology makes no explicit reference to the actual instances of triple the defined abstract classes; this is the role of the semantic network. hURIa; http://xmlns.com/foaf/0.1/#knows; URIbi An ontology is related to the developer's API in object ori- states that the resource identified by URI knows the re- ented programming languages such as C++ and Java (minus a the explicit representation of class methods/functions). For example, the set of relationships of an ontological class are URI http://xmlns.com/foaf/0.1/#knows URI known as the class' properties and, in the object oriented a b lexicon, can be understood as class fields. Also, a taxon- omy is usually expressed in a semantic network ontology. A Figure 2: A diagrammed triple taxonomy of sub- and super-classes support the inheritance of class properties. For instance, if all mammals are warm source identified by URI , where URI and URI are nodes blooded, then all humans are warm blooded because all hu- b a b and http://xmlns.com/foaf/0.1/#knows is a directed mans are mammals. In an inheritance hierarchy, the warm labeled edge (see Figure 2). The meaning of knows is fully blooded property of mammals is inherited by all sub-classes defined by the URI http://xmlns.com/foaf/0.1/. The of mammal (e.g. human). union of instantiated FOAF triples is a FOAF semantic net- Figure 1 diagrams the relationship between an ontology work. Current platforms for storing and querying such se- and its semantic network instantiation. The circles repre- mantic networks are called triple stores. Many open source sents objects that are instances of the dash-dot pointed to and proprietary triple stores currently exist. Various query- abstract classes (the squares). The three lower squares are ing languages exist as well [13].