Standards-Based Interfaces for Harvesting and Obtaining Assets from Digital Repositories

Op standaarden gebaseerde interfaces voor het verzamelen en verwerven van objecten uit digitale bewaarplaatsen Standards-Based Interfaces for Harvesting and Obtaining Assets from Digital Repositories Jeroen Bekaert Promotoren: prof. dr. ir.-architect E. De Kooning, dr. H. Van de Sompel Proefschrift ingediend tot het behalen van de graad van Doctor in de Ingenieurswetenschappen Vakgroep Architectuur en Stedenbouw Voorzitter: prof. dr. B. Verschaffel Faculteit Ingenieurswetenschappen Academiejaar 2005 - 2006 iv dedicated to P.B. (in memoriam) iii iv English Summary Various communities are deploying and maintaining digital repositories, as a serious and long-term commitment to store, safeguard, and share rich collections of digital assets. Each community has a different perspective on the design and management of digital repositories. For example, the learning community is developing repositories that focus on use and re-use of learning objects; the digital library community focuses on storage and access of scholarly articles; the cultural heritage community works on curation of digital images, and so on. This focus on particular materials and application domains has led to a variety of parallel technical approaches used in the design and implementation of repository systems, and, as a result to a serious lack of cross-community and cross-repository interoperability. The net result of this lack of interoperability is the inability to use and re-use assets stored in heterogeneous repositories in cross-repository services and service workflows. As a result, the promise and potential of a truly interconnected digital knowledge environment remains unfulfilled. This general problem presents itself at a smaller, yet significant, scale in the context of the ongoing deployment of institutional repositories. These repositories are increasingly used by universities and research institutions as tools to take on the stewardship of the materials created by their constituency, and to make them accessible to the general public. Several institutional repository systems have been developed and adopted successfully, including v DSpace (Massachusetts Institute of Technology & Hewlett-Packard Labs), Fedora (Cornell University & University of Virginia), and EPrints (University of Southampton). All of these institutional repository systems are conceived as autonomous software components and their content is accessed using diverse proprietary interfaces. This lack of common repository interfaces hinders the realization of the vision of a new scholarly communication system that would have these institutional repositories – and other scholarly repositories – as its core components, as realizing that vision would require the development of federations of such scholarly repositories, the provision of cross-repository service overlays, and the expression and invocation of automated workflow applications. The study that is object of this doctoral dissertation is framed in this problem space, and focuses on the definition and design of two essential interfaces to digital repositories: An interface to enable harvesting of batches of digital assets (harvest interface) and an interface to enable obtaining disseminations of a single digital asset (obtain interface). The characteristics and requirements for such harvest and obtain interfaces have been explored over the course of two elaborate experiments. A first experiment focused on the multi-faceted use of the Open Archives Initiative Protocol for Metadata Harvesting (OAI- PMH) and NISO’s OpenURL Framework for Context-Sensitive Services as information access protocols in a repository architecture designed and implemented for ingesting, storing, and accessing a vast collection of digital assets at the Research Library of the Los Alamos National Laboratory. A second experiment aimed at designing and implementing a robust solution for the recurrent and accurate transfer of digital assets from the repository of the American Physical Society to the aDORe environment of the Los Alamos National Laboratory; again the OAI-PMH played a significant role in the deployed solution. The insights obtained from these experiments are generalized and they result in the definition of generic harvest and obtain interfaces that could be supported across heterogeneous repositories. The proposed protocol-based interfaces operate in a manner that is neutral to the way in which a digital repository stores, manages and identifies its digital assets, and they build upon existing technologies that have recently received considerable buy-in from the digital library and scholarly information communities. These technologies are the OAI-PMH, NISO’s OpenURL Framework and several XML-based techniques for the representation and packaging of compound digital assets. By proposing these repository interfaces, the author hopes to contribute to the ongoing global discourse aimed at reaching new levels of interoperability across systems that store valuable content, and hence towards the realization of a richer scholarly information environment. vi Nederlandstalige Synthese Digitale bewaarplaatsen voor het ontsluiten, opslaan en langdurig bewaren van datacollecties hebben de laatste tien jaar een sterke toename gekend in uiteenlopende disciplines en vakdomeinen. Elke discipline heeft haar eigen visie op het ontwerp en beheer van digitale bewaarplaatsen. Zo zijn er bewaarplaatsen die zich richten op het gebruik en hergebruik van elektronisch leer- en studiemateriaal, digitale bibliotheken voor het stockeren en toegankelijk maken van onderzoeksartikels, bewaarplaatsen voor het ontsluiten van gedigitaliseerd cultureel erfgoed, enzovoort. Deze evolutie kan ongetwijfeld positief worden genoemd, zij het niet dat deze verscheidenheid aan visies en materialen tot een veelvoud aan parallelle technologische oplossingen voor het ontwerpen en ontwikkelen van digitale bewaarsystemen heeft geleid. Het gevolg hiervan is een gebrek aan interoperabiliteit van de verscheidene bewaarsystemen en disciplines. Het gemis aan eenvormigheid en coherentie leidt tot een onvermogen om objecten, die in heterogene digitale bewaarplaatsen gestockeerd worden, te gebruiken en hergebruiken in systeemoverschrijdende online-diensten en werkschema- toepassingen. De mogelijkheden van een sterk geconnecteerde digitale kennisomgeving worden bijgevolg niet gerealiseerd, laat staan optimaal benut. Deze problematiek komt tot uiting op een kleinere, maar belangrijke schaal in de huidige, sterke ontwikkeling van zogenaamd institutionele digitale bewaarplaatsen. Deze laatste vii worden door universiteiten en andere onderzoeksinstellingen meer en meer als hulpmiddel gebruikt om eigen onderzoeksresultaten en -artikels intern te bewaren en via het Internet toegankelijk te maken voor externe geïnteresseerden. Voorbeelden van kant-en-klare softwarepakketten voor de uitbouw van dergelijke institutionele bewaarplaatsen zijn DSpace (Massachusetts Institute of Technology & Hewlett-Packard Labs), Fedora (Cornell University & University of Virginia) en EPrints (University of Southampton). Elk van deze systemen wordt autonoom ontwikkeld en ontsloten aan de hand van onderling sterk verschillende technische specificaties. Door het gebrek aan uniforme interfaces wordt de realisatie van nieuwe wetenschappelijke communicatienetwerken, gebaseerd op institutionele of andere bewaarplaatsen, aanzienlijk gehinderd. Dergelijke netwerken zijn immers sterk afhankelijk van de ontwikkeling van federaties van bewaarplaatsen en van de uitbouw van online-diensten die de verschillende bewaarplaatsen overkoepelen. Het onderzoek dat in het kader van deze doctorale studie werd uitgevoerd, moet binnen deze interoperabiliteitsproblematiek worden gesitueerd. De studie richt zich op de definitie en het ontwerp van een aantal generieke interfaces voor heterogene digitale bewaarsystemen. Het proefschrift onderzoekt in het bijzonder interfaces voor het verzamelen van bundels van complexe objecten (harvest interfaces), en interfaces voor het verwerven van disseminaties van individuele complexe objecten (obtain interfaces). Om deze interfaces te kunnen ontwikkelen, en hun eigenschappen te verkennen, werden twee uitgebreide experimenten uitgevoerd. In een eerste experiment werd onderzoek verricht naar een digitale bewaaromgeving voor het deponeren, stockeren, verwerken en ontsluiten van de omvangrijke datacollecties van de Research Library van het Los Alamos National Laboratory. Zowel het Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) en NISO’s OpenURL Framework for Context-Sensitive Services Framework worden hier op een veelzijdige manier ingezet. Een tweede experiment beoogde de ontwikkeling van een raamwerk voor de herhaaldelijke en accurate overdracht van nieuwe en gewijzigde complexe dataobjecten van de digitale bewaarplaats van de American Physical Society naar de digitale bewaaromgeving van het Los Alamos National Laboratory. Ook hier is een belangrijke rol weggelegd voor het OAI-PMH raamwerk. De inzichten die uit beide experimenten zijn voortgevloeid, worden veralgemeend en monden uit in de definitie van interfaces voor het verzamelen en verwerven van digitale objecten. Deze interfaces zijn inzetbaar in een verscheidenheid aan digitale bewaarsystemen. De voorgestelde interfaces functioneren onafhankelijk van de wijze waarop een digitaal bewaarsysteem dataobjecten stockeert, beheert en identificeert. De interfaces worden opgebouwd aan de hand van een aantal recente en eenvoudige technologieën die succesvol

Standards-Based Interfaces for Harvesting and Obtaining Assets from Digital Repositories

Bibliography of Erik Wilde

UK Research Data Registry Mapping Schemes

Describing Data Patterns. a General Deconstruction of Metadata Standards

Index of /Sites/Default/Al Direct/2007/April

Access Interfaces for Open Archival Information Systems Based on the OAI-PMH and the Openurl Framework for Context-Sensitive Services

Describing Data Patterns. a General Deconstruction of Metadata Standards 2013

LCC Principles of Identification V1 1 April 2014 with Appendices

Internet Bible As Is...(April 2011)

LCC Identifiers Workstream Paper V0 2 Nov 7 2012

Options for Improving Access to Legislative Records: a White Paper

OAI 4 Issues in Managing Persistent Identifiers

Document History 1. Definition of the Subject 2. Keywords