Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and Accessibility in the Oaister Repository

University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln Winter 12-30-2012 Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and Accessibility in the OAIster Repository Michael Peake San Jose State University, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/libphilprac Part of the Library and Information Science Commons Peake, Michael, "Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and Accessibility in the OAIster Repository" (2012). Library Philosophy and Practice (e-journal). 892. https://digitalcommons.unl.edu/libphilprac/892 1 Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and accessibility in the OAIster repository. Michael Peake San Jose State University 2 Abstract This paper examines the use of Dublin Core as a minimum metadata standard for Open Archives Initiative Protocol for Metadata Harvesting in terms of its impact on end user experience in the OAIster repository. Specifically the study looked at the use of controlled vocabulary searches versus non-controlled vocabulary searches, as well as the impact of Dublin Core on the granularity and consistency records. Searches were performed in OAIster using Library of Congress Subject Headings and Name Authority Files, as well as non-controlled vocabulary searches for the same terms. The study concluded that controlled vocabulary searches are good for retrieving relevant results, but non-controlled vocabulary searches can retrieve more relevant results, at the cost of large numbers of non-relevant results also being returned. The openness of Dublin Core does lead to problems of granularity and consistency, but some records indicate the potential of Dublin Core for providing very useful records. The study concludes that institutions should give serious consideration to user experience and repository display before converting records to Dublin Core for harvesting. 3 Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and accessibility in the OAIster repository. Introduction The rapid increase in information resources on the World Wide Web has created a situation where an unprecedented amount of knowledge is potentially available to anyone with internet access. Indeed Tony Gill argues that “the Web is the largest and fastest-growing collection of documents the world has ever seen” (2008, p. 25). The publically indexable web alone had at least 25.21 billion pages in 2009 (“World Wide Web”, 2012, Statistics section, para. 1), and this does not include the many resources in databases that require search forms to be filled out before they are accessible (Raghavan & Garcia-Molina, n.d., abstract section). With so many information resources available the problem becomes one of successfully finding relevant resources among the billions available. Metadata, or data about data, is key to retrieving relevant information because it structures data about information resources in ways that can provide meaningful access points for searchers. For example the use of Machine-Readable Cataloging (MARC) records in online public access catalogs (OPAC’s) in the library field provides searchers with understandable access points, such as subject or author, and a controlled vocabulary that aids retrieval by bringing together resources on the same topic or by the same author. Although OPAC’s generally allow keyword searching, the combination of access points and controlled vocabulary can greatly aid retrieval of relevant resources as they bring together like resources that may be described differently and thus not show up in a keyword search. Works by an author who writes under more than one name, for example Stephen King/Richard Bachman, or subjects that may be 4 named differently within the literature, for example the Spanish Civil War/Spanish Revolution can be retrieved together in this way. Paradoxically metadata is both a key and a hindrance to finding relevant information. Different knowledge communities work with different metadata schemas and standards that are best suited to their purposes and priorities. For example the “archival community has embraced standards such as” Encoded Archival Description (EAD) and Describing Archives: A Content Standard (DACS) (Spiro, 2009, The Role of Software in Addressing Hidden Collections section), while the library community currently tends towards MARC and Anglo-American Cataloguing Rules 2 nd edition (AACR2). With researchers interested in a variety of resources, for example books, archival material, web documents, and images, the issue becomes one of metadata interoperability. As Woodley points out, bringing material together from a single community in a union catalog, like WorldCat for the library community, is possible because “the contributing community shares the same rules for description and access and the same protocol for encoding the information” (2008, p. 46). Bringing together data encoded in different metadata schemas and formatted according to different content standards poses issues of interoperability. In other words records from within a single knowledge community may have a high level of interoperability, but “it is when communities want to share their content in a broader arena, or reuse the information for other purposes, that problems of interoperability arise” (Woodley, 2008, p. 39). There are many approaches to improving metadata interoperability. This paper will examine one of them, namely the role that the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), in conjunction with Dublin Core, plays in improving the accessibility of records held by diverse institutions. The OAIster database hosted by WorldCat will be considered in terms of its ability to make records accessible. 5 Statement of the Problem There are a vast and continually growing number of information resources on the World Wide Web. Many of these resources remain hidden from commonly used search methods, such as Google or Yahoo search engines. By exposing their records to OAI compliant harvesters, institutions hope to make their records easier to access by allowing them to be retrieved from larger repositories, such as OAIster or Europeana. These repositories bring records together and allow users to search through the records of various institutions in one place. However, different institutions use different metadata schemas and data standards, so in order to ensure the interoperability of the records, OAI requires all records, at a minimum, to be exposed in simple Dublin Core. There is a concern that the use of simple Dublin Core may lead, in itself, to retrieval issues due to lack of granularity and crosswalking problems. The question arises as to whether the use of simple Dublin Core hampers the retrievability of records in OAI compliant repositories? Literature Review “The primary role of the OAI-PMH is to facilitate resource discovery when resources are stored in a number of distributed, independent repositories by exporting metadata about items in those repositories” (Fegen, 2007, What is the OAI Protocol for Metadata Harvesting for? section, para. 1). In doing this the OAI-PMH attempts to address the problem of metadata interoperability. According to the National Information Standards Organization (NISO), “interoperability is the ability of multiple systems with different hardware and software platforms, data structures, and interfaces to exchange data with minimal loss of content and functionality” (2004, p. 4). Chan and Zeng argue that metadata interoperability is needed to 6 make it “possible to facilitate the exchange and sharing of data prepared according to different metadata schemas and to enable cross-collection searching” (2006a, Introduction section, para. 1). Haslhofer and Klas consider metadata interoperability “a perquisite for uniform access to media objects in multiple autonomous and heterogeneous information systems” (2010, p. 1). With Woodley’s declaration that “global access to the universe of traditional print materials and digital resources has become more than ever the goal of many institutions” (2008, p. 38), the importance of metadata interoperability can easily be seen. Chan and Zeng identify three different levels of approach to achieving Metadata interoperability, the schema level, the record level and the repository level. They note that individual projects may combine more than one approach (2006a, Metadata Interoperability Projects at Different Levels section, para. 5). The OAI-PMH approach is focused on both the repository level and the schema level. Repositories are identified by Woodley as a means of bringing metadata records together in a single database “with links from individual records back to their home environments” (2008, p. 47). Chan and Zeng differentiate between two types of repository, those that harvest records based on converted metadata, for example using the OAI- PMH, and those that harvest records without needing record conversion (2006b, Achieving Interoperability at the Repository Level section). This research paper only considers repositories that use the OAI-PMH. Woodley describes repositories as a “recent model for union catalogs” (2008, p. 47) that physically bring together records “in a single database, with links from individual records back to their home environments” (2008, p. 47). According to the Open Archives Initiative website, two

Open Archives Initiative Protocol for Metadata Harvesting, Dublin Core and Accessibility in the Oaister Repository

The Serials Crisis and Open Access: a White Paper for the Virginia Tech Commission on Research

Making Institutional Repositories Work “Making Institutional Repositories Work Sums It up Very Well

OAI-PMH: a TOOL for METADATA HARVESTING and FEDERATED SEARCH Dipen Deka

Notes from the Interoperability Front: a Progress Report on the Open Archives Initiative

Strategische Und Operative Handlungsoptionen Für Wissenschaftliche Einrichtungen Zur Gestaltung Der Open-Access-Transformation

The Scielo Brazilian Scientific Journal Gateway and Open Archives

Preprints in Scholarly Communication: Re-Imagining Metrics and Infrastructures

Harvesting Full Text and Metadata of Opendoar Through Dspace OAI-PMH : a Framework for Institutional Digital Repositories

Open Access Self-Archiving: an Author Study

Open Access and New Standards in Scientific Publishing

Comparing Repository Types

Metadata for Harvesting: the Open Archives Initiative, and How to Find Things on the Web', the Electronic Library, Vol